vit中的transformer encoder

### Visual Transformer (ViT) Architecture Details In the context of Vision Transformers, the core component is the Transformer Encoder which processes image patches as sequences similar to how text tokens are handled in natural language processing tasks[^1]. The input images undergo a preprocessing step where they get divided into fixed-size patches—typically \(16 \times 16\) pixels each—and flattened into one-dimensional vectors. These vectors then pass through a linear embedding layer that projects them into an appropriate dimensionality suitable for feeding into the Transformer Encoder. #### Input Embedding Process Before entering the Transformer Encoders, positional embeddings are added to these patch embeddings so that spatial information can be preserved during subsequent layers' computations since self-attention mechanisms do not inherently capture position-specific features within sequences[^2]. ```python class PatchEmbedding(nn.Module): def __init__(self, img_size=224, patch_size=16, embed_dim=768): super().__init__() num_patches = (img_size // patch_size)**2 self.patch_embeddings = nn.Conv2d(in_channels=3, out_channels=embed_dim, kernel_size=patch_size, stride=patch_size) def forward(self, x): x = self.patch_embeddings(x).flatten(2).transpose(1, 2) return x ``` #### Multi-head Self Attention Mechanism The heart of every Transformer Encoder lies in its multi-headed self-attention mechanism designed to allow different parts of the sequence representation to focus on various aspects or positions when computing attention scores over all pairs of elements from both queries and keys sets derived from inputs themselves. This enables capturing long-range dependencies between distant regions across entire images effectively without relying heavily upon local receptive fields like those found inside convolutional filters used by CNNs traditionally employed for computer vision applications[^3]. ```python import torch.nn.functional as F from math import sqrt def scaled_dot_product_attention(q, k, v, mask=None): d_k = q.size(-1) attn_logits = torch.matmul(q, k.transpose(-2, -1)) attn_logits = attn_logits / sqrt(d_k) if mask is not None: attn_logits = attn_logits.masked_fill(mask == 0, -9e15) attention = F.softmax(attn_logits, dim=-1) values = torch.matmul(attention, v) return values, attention ``` #### Feed Forward Neural Network Layer Following the application of multi-head self-attention operations comes another essential part known as Position-wise Fully Connected Feed-Forward Networks applied independently at every single location along with residual connections around this sub-layer alongside normalization steps ensuring stable training dynamics throughout deep architectures composed entirely out of stacked encoders[^4]. ```python class MLPBlock(nn.Module): def __init__(self, hidden_features, output_features): super().__init__() self.fc1 = nn.Linear(hidden_features, output_features * 4) self.act = nn.GELU() self.fc2 = nn.Linear(output_features * 4, output_features) def forward(self, x): x = self.fc1(x) x = self.act(x) x = self.fc2(x) return x ```

创作声明：本文部分内容由AI辅助生成（AIGC），仅供参考

下一篇 ubuntu opencv vscode ros