Transformer里的全注意力块到底怎么工作的？它和普通层有啥本质区别？

### Fully Attention Block in Neural Networks Architecture In neural network architectures, fully attention blocks represent a significant advancement over traditional convolutional or recurrent layers by enabling models to focus on relevant parts of input data dynamically. These mechanisms are particularly powerful because they allow each position in the sequence to attend to all positions in previous layers, thus capturing dependencies regardless of their distance. A typical fully attention block consists of several key components: - **Multi-head Self-Attention Mechanism**: This allows the model to jointly attend information from different representation subspaces at various positions. - **Feed Forward Network (FFN)**: Applied identically across all positions, this component processes attended features through two linear transformations separated by an activation function like ReLU. #### Implementation Example Using PyTorch Below is a simplified version of how one might implement a fully attention block using Python and PyTorch library: ```python import torch.nn as nn import torch class MultiHeadSelfAttention(nn.Module): def __init__(self, embed_size, heads): super(MultiHeadSelfAttention, self).__init__() self.embed_size = embed_size self.heads = heads self.head_dim = embed_size // heads assert ( self.head_dim * heads == embed_size ), "Embedding size needs to be divisible by heads" self.values = nn.Linear(self.head_dim, embed_size, bias=False) self.keys = nn.Linear(self.head_dim, embed_size, bias=False) self.queries = nn.Linear(self.head_dim, embed_size, bias=False) self.fc_out = nn.Linear(embed_size, embed_size) def forward(self, values, keys, query, mask): N = query.shape[0] value_len, key_len, query_len = values.shape[1], keys.shape[1], query.shape[1] # Split embedding into self.heads pieces values = values.reshape(N, value_len, self.heads, self.head_dim) keys = keys.reshape(N, key_len, self.heads, self.head_dim) queries = query.reshape(N, query_len, self.heads, self.head_dim) energy = torch.einsum("nqhd,nkhd->nhqk", [queries, keys]) if mask is not None: energy = energy.masked_fill(mask == 0, float("-1e20")) attention = torch.softmax(energy / (self.embed_size ** (1/2)), dim=3) out = torch.einsum("nhql,nlhd->nqhd", [attention, values]).reshape( N, query_len, self.embed_size ) out = self.fc_out(out) return out class TransformerBlock(nn.Module): def __init__(self, embed_size, heads, dropout, forward_expansion): super(TransformerBlock, self).__init__() self.attention = MultiHeadSelfAttention(embed_size, heads) self.norm1 = nn.LayerNorm(embed_size) self.norm2 = nn.LayerNorm(embed_size) self.feed_forward = nn.Sequential( nn.Linear(embed_size, forward_expansion * embed_size), nn.ReLU(), nn.Linear(forward_expansion * embed_size, embed_size), ) self.dropout = nn.Dropout(dropout) def forward(self, value, key, query, mask): attention = self.attention(value, key, query, mask) x = self.dropout(self.norm1(attention + query)) forward = self.feed_forward(x) out = self.dropout(self.norm2(forward + x)) return out ``` This code snippet demonstrates creating multi-head self-attention mechanism followed by normalization and feed-forward operations within transformer blocks[^1].

创作声明：本文部分内容由AI辅助生成（AIGC），仅供参考

下一篇用Python手写栈和队列时，为什么栈用list就行，队列却推荐用deque？