AI

Transformer 架构

深度学习

Posted by LXG on April 25, 2026

动手深度学习

Transformer 架构

如果用一句工程化的话总结:

Transformer = “用矩阵并行计算的全局信息交互系统”

或者更直白: 它让每个 token 都能“瞬间看到全局”

transformer

基于位置的前馈网络

Position-wise Feed Forward Network(FFN)


import math
import pandas as pd
import torch
from torch import nn
from d2l import torch as d2l


def log_tensor_info(name, tensor):
    """统一打印张量信息:形状、dtype和完整张量。"""
    print(f"[LOG] {name}: shape={tuple(tensor.shape)}, dtype={tensor.dtype}")
    print(f"[LOG] {name} 完整内容:\n{tensor.detach().cpu()}")


#@save
class PositionWiseFFN(nn.Module):
    """基于位置的前馈网络"""
    def __init__(self, ffn_num_input, ffn_num_hiddens, ffn_num_outputs,
                 **kwargs):
        super(PositionWiseFFN, self).__init__(**kwargs)
        # 第1层线性变换:把每个位置的特征从输入维映射到隐藏维
        # 输入最后一维: ffn_num_input -> 输出最后一维: ffn_num_hiddens
        self.dense1 = nn.Linear(ffn_num_input, ffn_num_hiddens)
        # 非线性激活:为网络引入非线性表达能力
        self.relu = nn.ReLU()
        # 第2层线性变换:把隐藏维映射到目标输出维
        # ffn_num_hiddens -> ffn_num_outputs
        self.dense2 = nn.Linear(ffn_num_hiddens, ffn_num_outputs)

        # 初始化日志,帮助理解网络结构
        print("[LOG] PositionWiseFFN 初始化完成")
        print(
            f"[LOG] 输入维={ffn_num_input}, 隐藏维={ffn_num_hiddens}, 输出维={ffn_num_outputs}"
        )

    def forward(self, X):
        # X 常见形状: (batch_size, num_steps, d_model)
        # 注意“PositionWise”的含义:
        # - 不同位置之间不会在FFN里混合(没有跨位置计算)
        # - 仅对每个位置的向量独立执行“线性->ReLU->线性”
        print("\n[LOG] ===== 进入 PositionWiseFFN.forward =====")
        log_tensor_info("输入X", X)

        h1 = self.dense1(X)
        log_tensor_info("dense1(X)", h1)

        h2 = self.relu(h1)
        log_tensor_info("ReLU后", h2)

        out = self.dense2(h2)
        log_tensor_info("dense2输出", out)
        print("[LOG] ===== 退出 PositionWiseFFN.forward =====\n")
        return out


# 示例:输入维=4,隐藏维=4,输出维=8
ffn = PositionWiseFFN(4, 4, 8)
# 设为推理模式,确保输出可复现(该模块本身虽无dropout,依然是良好习惯)
ffn.eval()

# 构造输入:
# batch_size=2, num_steps=3, d_model=4
X = torch.ones((2, 3, 4))
log_tensor_info("示例输入X", X)

# 运行前馈网络
Y = ffn(X)
log_tensor_info("示例输出Y", Y)

# 打印第0个样本在所有位置上的输出(形状: 3x8)
print("[LOG] Y[0] =")
print(Y[0])

数据输入


[LOG] 输入X: shape=(2, 3, 4), dtype=torch.float32
[LOG] 示例输入X 完整内容:
tensor([[[1., 1., 1., 1.],
         [1., 1., 1., 1.],
         [1., 1., 1., 1.]],

        [[1., 1., 1., 1.],
         [1., 1., 1., 1.],
         [1., 1., 1., 1.]]])

维度 含义
2 batch_size(2个样本)
3 seq_len(每个样本3个token)
4 d_model(每个token 4维特征)

进入 forward

Step 1:dense1(第一层线性变换)


h1 = self.dense1(X)
📌 本质计算

对每一个 token:

h1 = X @ W1^T + b1
📐 维度变化
输入:  (2, 3, 4)
权重:  (4 → 4)
输出:  (2, 3, 4)

👉 注意:

PyTorch 的 nn.Linear(in, out) 实际权重是 (out, in)
自动对最后一维做矩阵乘法


[LOG] dense1 参数
[LOG] dense1.weight: shape=(4, 4), dtype=torch.float32
[LOG] dense1.weight 完整内容:
tensor([[ 0.0336,  0.1130, -0.3063,  0.4771],
        [-0.2043,  0.2148, -0.2774,  0.2155],
        [-0.1847,  0.0449, -0.3454, -0.3507],
        [ 0.1168,  0.2330, -0.1549,  0.2228]])
        
[LOG] dense1.bias: shape=(4,), dtype=torch.float32
[LOG] dense1.bias 完整内容:
tensor([ 0.3793,  0.3294, -0.1541,  0.0368])

计算公式: h1 = X @ W1^T + b1

[LOG] dense1(X): shape=(2, 3, 4), dtype=torch.float32
[LOG] dense1(X) 完整内容:
tensor([[[ 0.6967,  0.2780, -0.9898,  0.4545],
         [ 0.6967,  0.2780, -0.9898,  0.4545],
         [ 0.6967,  0.2780, -0.9898,  0.4545]],

        [[ 0.6967,  0.2780, -0.9898,  0.4545],
         [ 0.6967,  0.2780, -0.9898,  0.4545],
         [ 0.6967,  0.2780, -0.9898,  0.4545]]])


Step 2:ReLU(非线性激活)


h2 = self.relu(h1)
📌 本质计算
ReLU(x) = max(0, x)
🧠 实际作用
把负数变成0
保留正数
🔥 为什么必须有它?

如果没有 ReLU:

dense2(dense1(X)) = 一个大线性层

👉 整个网络就退化成:

❌ 线性模型(表达能力很差)

✅ 加了 ReLU 后:

👉 可以表示:

非线性关系
条件激活(类似“开关”)

[LOG] ReLU后: shape=(2, 3, 4), dtype=torch.float32
[LOG] ReLU后 完整内容:
tensor([[[0.6967, 0.2780, 0.0000, 0.4545],
         [0.6967, 0.2780, 0.0000, 0.4545],
         [0.6967, 0.2780, 0.0000, 0.4545]],

        [[0.6967, 0.2780, 0.0000, 0.4545],
         [0.6967, 0.2780, 0.0000, 0.4545],
         [0.6967, 0.2780, 0.0000, 0.4545]]])

Step 3:dense2(第二层线性变换)


out = self.dense2(h2)
📌 本质计算
out = h2 @ W2^T + b2
📐 维度变化
输入:  (2, 3, 4)
权重:  (4 → 8)
输出:  (2, 3, 8)
🧠 实际作用

👉 把“激活后的特征”投影到更高维空间:

4维 → 8维
🔥 直观理解

这一层在做:

重新编码信息(feature projection)

[LOG] dense2 参数
[LOG] dense2.weight: shape=(8, 4), dtype=torch.float32
[LOG] dense2.weight 完整内容:
tensor([[-3.5300e-01, -3.9332e-01,  2.8701e-01, -3.4535e-02],
        [ 3.8852e-01, -3.8515e-01, -3.9982e-01, -6.5256e-02],
        [-4.0756e-01, -1.3083e-04,  1.9923e-01,  2.6752e-01],
        [-9.9782e-02,  5.5980e-02, -4.2391e-01, -4.8913e-02],
        [-5.7005e-02,  2.1488e-01,  2.4811e-01, -3.3240e-01],
        [ 6.8557e-02, -4.1339e-01, -3.3388e-01, -2.0628e-01],
        [-2.1272e-01,  5.8841e-02, -1.2252e-01, -4.8725e-02],
        [ 4.8711e-01, -3.8770e-02,  3.9174e-01,  4.7053e-01]])
[LOG] dense2.bias: shape=(8,), dtype=torch.float32
[LOG] dense2.bias 完整内容:
tensor([-0.3661, -0.3997, -0.1815, -0.2088,  0.2070,  0.1019, -0.4531, -0.0070])

[LOG] dense2输出: shape=(2, 3, 8), dtype=torch.float32
[LOG] dense2输出 完整内容:
tensor([[[-0.7371, -0.2657, -0.3439, -0.2850,  0.0760, -0.0590, -0.6071, 0.5354],
         [-0.7371, -0.2657, -0.3439, -0.2850,  0.0760, -0.0590, -0.6071, 0.5354],
         [-0.7371, -0.2657, -0.3439, -0.2850,  0.0760, -0.0590, -0.6071, 0.5354]],

        [[-0.7371, -0.2657, -0.3439, -0.2850,  0.0760, -0.0590, -0.6071, 0.5354],
         [-0.7371, -0.2657, -0.3439, -0.2850,  0.0760, -0.0590, -0.6071, 0.5354],
         [-0.7371, -0.2657, -0.3439, -0.2850,  0.0760, -0.0590, -0.6071, 0.5354]]])
[LOG] ===== 退出 PositionWiseFFN.forward =====

FFN 架构

Position-wise Feed Forward Network(FFN)的作用

在 Transformer 里,FFN 常被一句话概括:对每个 token 独立做的两层 MLP,用来增强表示能力(非线性特征变换)


            输入 token 向量 x (4维)
                   │
                   ▼
        ┌─────────────────────┐
        │   Linear (dense1)   │
        │   W1: 4 → 4         │
        └─────────────────────┘
                   │
                   ▼
             中间特征 h1
                   │
                   ▼
        ┌─────────────────────┐
        │      ReLU 激活       │
        └─────────────────────┘
                   │
                   ▼
             激活特征 h2
                   │
                   ▼
        ┌─────────────────────┐
        │   Linear (dense2)   │
        │   W2: 4 → 8         │
        └─────────────────────┘
                   │
                   ▼
            输出向量 y (8维)

关键结构特点(图中最重要的点)


1️⃣ “逐位置”处理(核心)
batch 0:
  token1 ──► FFN ──► y1
  token2 ──► FFN ──► y2
  token3 ──► FFN ──► y3

batch 1:
  token1 ──► FFN ──► y1
  token2 ──► FFN ──► y2
  token3 ──► FFN ──► y3

👉 每个 token:

完全独立
共享同一套权重
2️⃣ 不跨 token(非常重要)
token1 ❌ 不会看到 token2
token2 ❌ 不会看到 token3

👉 所以它不是“序列建模层”

残差连接和层规范化

第一部分代码解析


# LayerNorm 与 BatchNorm 对比示例
# - LayerNorm: 对“每个样本自身的特征维”做归一化(与batch大小无关)
# - BatchNorm1d: 对“一个batch同一特征维”做归一化(依赖batch统计量)
ln = nn.LayerNorm(2)
bn = nn.BatchNorm1d(2)
X = torch.tensor([[1, 2], [2, 3]], dtype=torch.float32)
# 在训练模式下计算X的均值和方差
print("\n[LOG] ===== LayerNorm vs BatchNorm 示例 =====")
log_tensor_info("输入X", X)
ln_out = ln(X)
bn_out = bn(X)
log_tensor_info("LayerNorm输出", ln_out)
log_tensor_info("BatchNorm输出", bn_out)
print("[LOG] ===== 结束 LayerNorm vs BatchNorm 示例 =====\n")

数据输入


[LOG] ===== LayerNorm vs BatchNorm 示例 =====
[LOG] 输入X: shape=(2, 2), dtype=torch.float32
[LOG] 输入X 完整内容:
tensor([[1., 2.],
        [2., 3.]])

LayerNorm 逐样本计算


[LOG] LayerNorm输出: shape=(2, 2), dtype=torch.float32
[LOG] LayerNorm输出 完整内容:
tensor([[-1.0000,  1.0000],
        [-1.0000,  1.0000]])

👉 每个样本内部被“拉成零均值、单位方差”

BatchNorm1d 逐通道计算


[LOG] BatchNorm输出: shape=(2, 2), dtype=torch.float32
[LOG] BatchNorm输出 完整内容:
tensor([[-1.0000, -1.0000],
        [ 1.0000,  1.0000]])

方法 归一化方向
LayerNorm 每一行(样本内部)
BatchNorm 每一列(跨 batch)

第二部分代码解析


#@save
class AddNorm(nn.Module):
    """残差连接后进行层规范化"""
    def __init__(self, normalized_shape, dropout, **kwargs):
        super(AddNorm, self).__init__(**kwargs)
        # dropout作用于子层输出Y,再与残差X相加,最后做LayerNorm。
        self.dropout = nn.Dropout(dropout)
        self.ln = nn.LayerNorm(normalized_shape)
        print("[LOG] AddNorm 初始化完成")
        print(f"[LOG] normalized_shape={normalized_shape}, dropout={dropout}")

    def forward(self, X, Y):
        print("\n[LOG] ===== 进入 AddNorm.forward =====")
        log_tensor_info("残差输入X", X)
        log_tensor_info("子层输出Y", Y)
        y_drop = self.dropout(Y)
        log_tensor_info("dropout(Y)", y_drop)
        added = y_drop + X
        log_tensor_info("残差相加结果(Y_drop + X)", added)
        out = self.ln(added)
        log_tensor_info("LayerNorm后输出", out)
        print("[LOG] ===== 退出 AddNorm.forward =====\n")
        return out

add_norm = AddNorm([3, 4], 0.5)
add_norm.eval()
addnorm_out = add_norm(torch.ones((2, 3, 4)), torch.ones((2, 3, 4)))
log_tensor_info("AddNorm示例输出", addnorm_out)
print(f"[LOG] AddNorm示例输出形状: {tuple(addnorm_out.shape)}")

Step 1: 数据输入


[LOG] ===== 进入 AddNorm.forward =====
[LOG] 残差输入X: shape=(2, 3, 4), dtype=torch.float32
[LOG] 残差输入X 完整内容:
tensor([[[1., 1., 1., 1.],
         [1., 1., 1., 1.],
         [1., 1., 1., 1.]],

        [[1., 1., 1., 1.],
         [1., 1., 1., 1.],
         [1., 1., 1., 1.]]])

[LOG] 子层输出Y: shape=(2, 3, 4), dtype=torch.float32
[LOG] 子层输出Y 完整内容:
tensor([[[1., 1., 1., 1.],
         [1., 1., 1., 1.],
         [1., 1., 1., 1.]],

        [[1., 1., 1., 1.],
         [1., 1., 1., 1.],
         [1., 1., 1., 1.]]])

Step 2:Dropout


add_norm.eval()

👉 eval 模式:

dropout 关闭

所以:

y_drop = Y

[LOG] dropout(Y): shape=(2, 3, 4), dtype=torch.float32
[LOG] dropout(Y) 完整内容:
tensor([[[1., 1., 1., 1.],
         [1., 1., 1., 1.],
         [1., 1., 1., 1.]],

        [[1., 1., 1., 1.],
         [1., 1., 1., 1.],
         [1., 1., 1., 1.]]])

Step 3:残差相加


[LOG] 残差相加结果(Y_drop + X): shape=(2, 3, 4), dtype=torch.float32
[LOG] 残差相加结果(Y_drop + X) 完整内容:
tensor([[[2., 2., 2., 2.],
         [2., 2., 2., 2.],
         [2., 2., 2., 2.]],

        [[2., 2., 2., 2.],
         [2., 2., 2., 2.],
         [2., 2., 2., 2.]]])

Step 4:LayerNorm(核心)


[LOG] LayerNorm后输出: shape=(2, 3, 4), dtype=torch.float32
[LOG] LayerNorm后输出 完整内容:
tensor([[[0., 0., 0., 0.],
         [0., 0., 0., 0.],
         [0., 0., 0., 0.]],

        [[0., 0., 0., 0.],
         [0., 0., 0., 0.],
         [0., 0., 0., 0.]]])

为什么是全 0?

因为:所有值完全一样 → 方差=0 → 全部归一化为0

AddNorm 结构图


            X(残差输入)
             │
             │
             │        Y(子层输出)
             │               │
             │               ▼
             │        Dropout(训练才生效)
             │               │
             │               ▼
             └──────►  加法  ◄──────┘
                       │
                       ▼
                 LayerNorm
                       │
                       ▼
                     输出

1️⃣ 残差连接 X + Y

作用:

  • 防止梯度消失
  • 保留原始信息
  • 允许“微调而不是重写”

2️⃣ LayerNorm

作用:

  • 控制数值范围
  • 稳定训练
  • 加速收敛

3️⃣ Dropout

作用:防止过拟合(训练时)

编码器

第一部分代码理解


#@save
class EncoderBlock(nn.Module):
    """Transformer编码器块

    结构:
      输入X
        │
        ├──► 多头自注意力(Q=K=V=X) ──► AddNorm(残差+LayerNorm) ──► Y
        │                                                           │
        └───────────────────────────────────────────────────────── ┘
                                                                    │
        ┌──────────────────────────────────────────────────────────┘
        │
        ├──► PositionWiseFFN(Y) ──► AddNorm(残差+LayerNorm) ──► 输出
        │                                                          │
        └──────────────────────────────────────────────────────────┘
    """
    def __init__(self, key_size, query_size, value_size, num_hiddens,
                 norm_shape, ffn_num_input, ffn_num_hiddens, num_heads,
                 dropout, use_bias=False, **kwargs):
        super(EncoderBlock, self).__init__(**kwargs)
        # 子层1:多头自注意力
        # Q、K、V 均来自同一输入X(自注意力),输出维度为 num_hiddens
        # 新版 d2l.MultiHeadAttention 签名: (num_hiddens, num_heads, dropout, bias)
        # 不再需要单独传 key_size/query_size/value_size
        self.attention = d2l.MultiHeadAttention(
            num_hiddens, num_heads, dropout, use_bias)
        # 子层1 后的残差连接 + 层归一化
        self.addnorm1 = AddNorm(norm_shape, dropout)
        # 子层2:逐位置前馈网络,对每个位置独立做非线性变换
        self.ffn = PositionWiseFFN(
            ffn_num_input, ffn_num_hiddens, num_hiddens)
        # 子层2 后的残差连接 + 层归一化
        self.addnorm2 = AddNorm(norm_shape, dropout)

        print("[LOG] EncoderBlock 初始化完成")
        print(f"[LOG] num_hiddens={num_hiddens}, num_heads={num_heads}, "
              f"ffn_num_hiddens={ffn_num_hiddens}, dropout={dropout}")

    def forward(self, X, valid_lens):
        """
        参数:
            X          : (batch_size, num_steps, num_hiddens)  编码器输入
            valid_lens : (batch_size,) 或 None,用于屏蔽填充位置
        返回:
            与X形状相同的编码输出
        """
        print("\n[LOG] ========== 进入 EncoderBlock.forward ==========")
        log_tensor_info("输入X", X)
        print(f"[LOG] valid_lens={valid_lens}")

        # --- 子层1:多头自注意力 ---
        # Q=K=V=X,让每个位置都能关注序列中所有位置(受valid_lens限制)
        attn_out = self.attention(X, X, X, valid_lens)
        log_tensor_info("多头自注意力输出 attn_out", attn_out)

        # 残差连接 + LayerNorm:attn_out经dropout后与X相加,再归一化
        Y = self.addnorm1(X, attn_out)
        log_tensor_info("AddNorm1后输出Y", Y)

        # --- 子层2:逐位置前馈网络 ---
        ffn_out = self.ffn(Y)
        log_tensor_info("FFN输出 ffn_out", ffn_out)

        # 残差连接 + LayerNorm:ffn_out经dropout后与Y相加,再归一化
        out = self.addnorm2(Y, ffn_out)
        log_tensor_info("AddNorm2后输出(EncoderBlock最终输出)", out)
        print("[LOG] ========== 退出 EncoderBlock.forward ==========\n")
        return out


print("\n[LOG] ===== 编码器块测试 =====")
# 构造测试输入:batch_size=2, num_steps=100, num_hiddens=24
X = torch.ones((2, 100, 24))
log_tensor_info("测试输入X", X)

# valid_lens: 第0个样本只看前3个位置,第1个样本只看前2个位置
valid_lens = torch.tensor([3, 2])
print(f"[LOG] valid_lens={valid_lens}")

# 初始化编码器块
# key/query/value_size=24, num_hiddens=24, norm_shape=[100,24]
# ffn_num_input=24, ffn_num_hiddens=48, num_heads=8, dropout=0.5
encoder_blk = EncoderBlock(24, 24, 24, 24, [100, 24], 24, 48, 8, 0.5)
encoder_blk.eval()  # 推理模式,dropout不生效,结果可复现

output = encoder_blk(X, valid_lens)
print(f"[LOG] EncoderBlock输出形状: {tuple(output.shape)}")
print("[LOG] ===== 编码器块测试结束 =====\n")

架构


X
│
├── Self-Attention
│      ↓
├── Add & Norm  (残差1)
│      ↓
├── FFN
│      ↓
└── Add & Norm  (残差2)
       ↓
     输出

数据输入


X.shape = (2, 100, 24)

[LOG] 测试输入X: shape=(2, 100, 24), dtype=torch.float32
[LOG] 测试输入X 完整内容:
tensor([[[1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.],
         ...,
         [1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.]],

        [[1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.],
         ...,
         [1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.]]])

维度 说明
2 batch
100 序列长度
24 embedding维度

valid_lens = [3, 2]

👉 作用:

第1个样本:只允许看前3个 token
第2个样本:只允许看前2个 token

👉 后面的 token 会被 mask 掉

PositionWiseFFN


[LOG] PositionWiseFFN 初始化完成
[LOG] 输入维=24, 隐藏维=48, 输出维=24
[LOG] dense1 参数
[LOG] dense1.weight: shape=(48, 24), dtype=torch.float32
[LOG] dense1.weight 完整内容:
tensor([[-0.0627,  0.1583,  0.0363,  ..., -0.1293, -0.1403,  0.0511],
        [-0.1481,  0.0495, -0.0628,  ..., -0.0917, -0.1495, -0.0644],
        [-0.0271, -0.1870, -0.0948,  ..., -0.1679, -0.1096,  0.0957],
        ...,
        [-0.0276,  0.1536,  0.0568,  ..., -0.1469, -0.0915, -0.0856],
        [ 0.1135, -0.0184,  0.0330,  ...,  0.0470,  0.2012, -0.0248],
        [ 0.0607,  0.0418, -0.1447,  ...,  0.0322, -0.0400, -0.0710]])
[LOG] dense1.bias: shape=(48,), dtype=torch.float32
[LOG] dense1.bias 完整内容:
tensor([ 0.1649,  0.1784, -0.0196, -0.0276,  0.0587, -0.1541, -0.1756, -0.1351,
         0.1887, -0.1075, -0.0332, -0.0600, -0.1659, -0.1472, -0.1123, -0.1028,
         0.0076,  0.0641,  0.1130,  0.0777, -0.0257, -0.0656, -0.0449, -0.1324,
        -0.0187, -0.1960, -0.1980, -0.1332, -0.1313, -0.1265,  0.1737, -0.0182,
        -0.0558, -0.0620, -0.1547,  0.0924, -0.0586, -0.1146,  0.1801, -0.0604,
        -0.1654, -0.1518,  0.1121,  0.1409,  0.0836, -0.0488, -0.1851, -0.0886])

[LOG] dense2 参数
[LOG] dense2.weight: shape=(24, 48), dtype=torch.float32
[LOG] dense2.weight 完整内容:
tensor([[ 0.0248, -0.1010,  0.0445,  ..., -0.1299,  0.1192, -0.1108],
        [ 0.0041, -0.1158,  0.1042,  ...,  0.0748,  0.0463,  0.0654],
        [-0.0787, -0.0308, -0.1035,  ...,  0.0877,  0.1203, -0.1250],
        ...,
        [-0.0394, -0.0815,  0.0370,  ..., -0.1113, -0.0425,  0.0554],
        [ 0.1180, -0.1212,  0.0735,  ..., -0.0208, -0.1322, -0.0840],
        [ 0.1339,  0.0996,  0.0584,  ..., -0.0632, -0.0570, -0.0192]])
[LOG] dense2.bias: shape=(24,), dtype=torch.float32
[LOG] dense2.bias 完整内容:
tensor([-0.0400,  0.0554,  0.0028, -0.1291, -0.0754,  0.1313, -0.0341,  0.1149,
         0.0035, -0.1052,  0.0067, -0.0261, -0.0183, -0.0008,  0.0731,  0.0730,
         0.1362,  0.1111, -0.0523, -0.1191, -0.1024, -0.1101,  0.1146, -0.1102])

Step 1:多头自注意力(Self-Attention)


attn_out = self.attention(X, X, X, valid_lens)
📌 本质
Q = X
K = X
V = X

👉 每个 token:

“去看整个序列(但受 valid_lens 限制)”

📐 输出形状
attn_out.shape = (2, 100, 24)
🔥 关键理解

这一层负责:token之间的信息交互

[LOG] 多头自注意力输出 attn_out: shape=(2, 100, 24), dtype=torch.float32
[LOG] 多头自注意力输出 attn_out 完整内容:
tensor([[[ 0.4831, -0.0471, -0.7931,  ...,  0.2540,  0.2923, -0.1334],
         [ 0.4831, -0.0471, -0.7931,  ...,  0.2540,  0.2923, -0.1334],
         [ 0.4831, -0.0471, -0.7931,  ...,  0.2540,  0.2923, -0.1334],
         ...,
         [ 0.4831, -0.0471, -0.7931,  ...,  0.2540,  0.2923, -0.1334],
         [ 0.4831, -0.0471, -0.7931,  ...,  0.2540,  0.2923, -0.1334],
         [ 0.4831, -0.0471, -0.7931,  ...,  0.2540,  0.2923, -0.1334]],

        [[ 0.4831, -0.0471, -0.7931,  ...,  0.2540,  0.2923, -0.1334],
         [ 0.4831, -0.0471, -0.7931,  ...,  0.2540,  0.2923, -0.1334],
         [ 0.4831, -0.0471, -0.7931,  ...,  0.2540,  0.2923, -0.1334],
         ...,
         [ 0.4831, -0.0471, -0.7931,  ...,  0.2540,  0.2923, -0.1334],
         [ 0.4831, -0.0471, -0.7931,  ...,  0.2540,  0.2923, -0.1334],
         [ 0.4831, -0.0471, -0.7931,  ...,  0.2540,  0.2923, -0.1334]]])

Step 2:AddNorm1(残差 + LayerNorm)


Y = self.addnorm1(X, attn_out)

作用

1️⃣ 残差
X + attn_out

👉 保留原始信息

2️⃣ LayerNorm

👉 稳定数值分布

📐 输出
Y.shape = (2, 100, 24)

[LOG] 子层输出Y: shape=(2, 100, 24), dtype=torch.float32
[LOG] 子层输出Y 完整内容:
tensor([[[ 0.4831, -0.0471, -0.7931,  ...,  0.2540,  0.2923, -0.1334],
         [ 0.4831, -0.0471, -0.7931,  ...,  0.2540,  0.2923, -0.1334],
         [ 0.4831, -0.0471, -0.7931,  ...,  0.2540,  0.2923, -0.1334],
         ...,
         [ 0.4831, -0.0471, -0.7931,  ...,  0.2540,  0.2923, -0.1334],
         [ 0.4831, -0.0471, -0.7931,  ...,  0.2540,  0.2923, -0.1334],
         [ 0.4831, -0.0471, -0.7931,  ...,  0.2540,  0.2923, -0.1334]],

        [[ 0.4831, -0.0471, -0.7931,  ...,  0.2540,  0.2923, -0.1334],
         [ 0.4831, -0.0471, -0.7931,  ...,  0.2540,  0.2923, -0.1334],
         [ 0.4831, -0.0471, -0.7931,  ...,  0.2540,  0.2923, -0.1334],
         ...,
         [ 0.4831, -0.0471, -0.7931,  ...,  0.2540,  0.2923, -0.1334],
         [ 0.4831, -0.0471, -0.7931,  ...,  0.2540,  0.2923, -0.1334],
         [ 0.4831, -0.0471, -0.7931,  ...,  0.2540,  0.2923, -0.1334]]])

[LOG] 残差相加结果(Y_drop + X): shape=(2, 100, 24), dtype=torch.float32
[LOG] 残差相加结果(Y_drop + X) 完整内容:
tensor([[[1.4831, 0.9529, 0.2069,  ..., 1.2540, 1.2923, 0.8666],
         [1.4831, 0.9529, 0.2069,  ..., 1.2540, 1.2923, 0.8666],
         [1.4831, 0.9529, 0.2069,  ..., 1.2540, 1.2923, 0.8666],
         ...,
         [1.4831, 0.9529, 0.2069,  ..., 1.2540, 1.2923, 0.8666],
         [1.4831, 0.9529, 0.2069,  ..., 1.2540, 1.2923, 0.8666],
         [1.4831, 0.9529, 0.2069,  ..., 1.2540, 1.2923, 0.8666]],

        [[1.4831, 0.9529, 0.2069,  ..., 1.2540, 1.2923, 0.8666],
         [1.4831, 0.9529, 0.2069,  ..., 1.2540, 1.2923, 0.8666],
         [1.4831, 0.9529, 0.2069,  ..., 1.2540, 1.2923, 0.8666],
         ...,
         [1.4831, 0.9529, 0.2069,  ..., 1.2540, 1.2923, 0.8666],
         [1.4831, 0.9529, 0.2069,  ..., 1.2540, 1.2923, 0.8666],
         [1.4831, 0.9529, 0.2069,  ..., 1.2540, 1.2923, 0.8666]]])
         
# 归一化
[LOG] LayerNorm后输出: shape=(2, 100, 24), dtype=torch.float32
[LOG] LayerNorm后输出 完整内容:
tensor([[[ 1.2714, -0.0241, -1.8467,  ...,  0.7115,  0.8052, -0.2350],
         [ 1.2714, -0.0241, -1.8467,  ...,  0.7115,  0.8052, -0.2350],
         [ 1.2714, -0.0241, -1.8467,  ...,  0.7115,  0.8052, -0.2350],
         ...,
         [ 1.2714, -0.0241, -1.8467,  ...,  0.7115,  0.8052, -0.2350],
         [ 1.2714, -0.0241, -1.8467,  ...,  0.7115,  0.8052, -0.2350],
         [ 1.2714, -0.0241, -1.8467,  ...,  0.7115,  0.8052, -0.2350]],

        [[ 1.2714, -0.0241, -1.8467,  ...,  0.7115,  0.8052, -0.2350],
         [ 1.2714, -0.0241, -1.8467,  ...,  0.7115,  0.8052, -0.2350],
         [ 1.2714, -0.0241, -1.8467,  ...,  0.7115,  0.8052, -0.2350],
         ...,
         [ 1.2714, -0.0241, -1.8467,  ...,  0.7115,  0.8052, -0.2350],
         [ 1.2714, -0.0241, -1.8467,  ...,  0.7115,  0.8052, -0.2350],
         [ 1.2714, -0.0241, -1.8467,  ...,  0.7115,  0.8052, -0.2350]]])
[LOG] ===== 退出 AddNorm.forward =====

Step 3:FFN(逐位置前馈网络)


ffn_out = self.ffn(Y)
📌 本质

对每个 token:

y → Linear(24→48) → ReLU → Linear(48→24)
🔥 特点
不跨 token
每个位置独立
参数共享
📐 输出
ffn_out.shape = (2, 100, 24)
🔥 作用

对每个 token 做“非线性特征加工”

[LOG] dense1(X): shape=(2, 100, 48), dtype=torch.float32
[LOG] dense1(X) 完整内容:
tensor([[[ 0.5987,  0.1660,  0.3433,  ...,  0.3329, -0.4658, -0.0718],
         [ 0.5987,  0.1660,  0.3433,  ...,  0.3329, -0.4658, -0.0718],
         [ 0.5987,  0.1660,  0.3433,  ...,  0.3329, -0.4658, -0.0718],
         ...,
         [ 0.5987,  0.1660,  0.3433,  ...,  0.3329, -0.4658, -0.0718],
         [ 0.5987,  0.1660,  0.3433,  ...,  0.3329, -0.4658, -0.0718],
         [ 0.5987,  0.1660,  0.3433,  ...,  0.3329, -0.4658, -0.0718]],

        [[ 0.5987,  0.1660,  0.3433,  ...,  0.3329, -0.4658, -0.0718],
         [ 0.5987,  0.1660,  0.3433,  ...,  0.3329, -0.4658, -0.0718],
         [ 0.5987,  0.1660,  0.3433,  ...,  0.3329, -0.4658, -0.0718],
         ...,
         [ 0.5987,  0.1660,  0.3433,  ...,  0.3329, -0.4658, -0.0718],
         [ 0.5987,  0.1660,  0.3433,  ...,  0.3329, -0.4658, -0.0718],
         [ 0.5987,  0.1660,  0.3433,  ...,  0.3329, -0.4658, -0.0718]]])
[LOG] ReLU后: shape=(2, 100, 48), dtype=torch.float32
[LOG] ReLU后 完整内容:
tensor([[[0.5987, 0.1660, 0.3433,  ..., 0.3329, 0.0000, 0.0000],
         [0.5987, 0.1660, 0.3433,  ..., 0.3329, 0.0000, 0.0000],
         [0.5987, 0.1660, 0.3433,  ..., 0.3329, 0.0000, 0.0000],
         ...,
         [0.5987, 0.1660, 0.3433,  ..., 0.3329, 0.0000, 0.0000],
         [0.5987, 0.1660, 0.3433,  ..., 0.3329, 0.0000, 0.0000],
         [0.5987, 0.1660, 0.3433,  ..., 0.3329, 0.0000, 0.0000]],

        [[0.5987, 0.1660, 0.3433,  ..., 0.3329, 0.0000, 0.0000],
         [0.5987, 0.1660, 0.3433,  ..., 0.3329, 0.0000, 0.0000],
         [0.5987, 0.1660, 0.3433,  ..., 0.3329, 0.0000, 0.0000],
         ...,
         [0.5987, 0.1660, 0.3433,  ..., 0.3329, 0.0000, 0.0000],
         [0.5987, 0.1660, 0.3433,  ..., 0.3329, 0.0000, 0.0000],
         [0.5987, 0.1660, 0.3433,  ..., 0.3329, 0.0000, 0.0000]]])
[LOG] dense2输出: shape=(2, 100, 24), dtype=torch.float32
[LOG] dense2输出 完整内容:
tensor([[[ 0.1860,  0.0345,  0.1935,  ..., -0.1780, -0.1262, -0.2303],
         [ 0.1860,  0.0345,  0.1935,  ..., -0.1780, -0.1262, -0.2303],
         [ 0.1860,  0.0345,  0.1935,  ..., -0.1780, -0.1262, -0.2303],
         ...,
         [ 0.1860,  0.0345,  0.1935,  ..., -0.1780, -0.1262, -0.2303],
         [ 0.1860,  0.0345,  0.1935,  ..., -0.1780, -0.1262, -0.2303],
         [ 0.1860,  0.0345,  0.1935,  ..., -0.1780, -0.1262, -0.2303]],

        [[ 0.1860,  0.0345,  0.1935,  ..., -0.1780, -0.1262, -0.2303],
         [ 0.1860,  0.0345,  0.1935,  ..., -0.1780, -0.1262, -0.2303],
         [ 0.1860,  0.0345,  0.1935,  ..., -0.1780, -0.1262, -0.2303],
         ...,
         [ 0.1860,  0.0345,  0.1935,  ..., -0.1780, -0.1262, -0.2303],
         [ 0.1860,  0.0345,  0.1935,  ..., -0.1780, -0.1262, -0.2303],
         [ 0.1860,  0.0345,  0.1935,  ..., -0.1780, -0.1262, -0.2303]]])
[LOG] ===== 退出 PositionWiseFFN.forward =====

Step 4:AddNorm2(第二次残差)


[LOG] AddNorm2后输出(EncoderBlock最终输出): shape=(2, 100, 24), dtype=torch.float32
[LOG] AddNorm2后输出(EncoderBlock最终输出) 完整内容:
tensor([[[ 1.3459,  0.0299, -1.4830,  ...,  0.5057,  0.6380, -0.4027],
         [ 1.3459,  0.0299, -1.4830,  ...,  0.5057,  0.6380, -0.4027],
         [ 1.3459,  0.0299, -1.4830,  ...,  0.5057,  0.6380, -0.4027],
         ...,
         [ 1.3459,  0.0299, -1.4830,  ...,  0.5057,  0.6380, -0.4027],
         [ 1.3459,  0.0299, -1.4830,  ...,  0.5057,  0.6380, -0.4027],
         [ 1.3459,  0.0299, -1.4830,  ...,  0.5057,  0.6380, -0.4027]],

        [[ 1.3459,  0.0299, -1.4830,  ...,  0.5057,  0.6380, -0.4027],
         [ 1.3459,  0.0299, -1.4830,  ...,  0.5057,  0.6380, -0.4027],
         [ 1.3459,  0.0299, -1.4830,  ...,  0.5057,  0.6380, -0.4027],
         ...,
         [ 1.3459,  0.0299, -1.4830,  ...,  0.5057,  0.6380, -0.4027],
         [ 1.3459,  0.0299, -1.4830,  ...,  0.5057,  0.6380, -0.4027],
         [ 1.3459,  0.0299, -1.4830,  ...,  0.5057,  0.6380, -0.4027]]])

架构总结


输入 X (2,100,24)
      │
      ▼
┌────────────────────┐
│ Multi-HeadAttention│
└────────────────────┘
      │
      ▼
┌────────────────────┐
│ AddNorm (X + attn) │
└────────────────────┘
      │
      ▼
        Y
      │
      ▼
┌────────────────────┐
│ PositionWise FFN   │
└────────────────────┘
      │
      ▼
┌────────────────────┐
│ AddNorm (Y + ffn)  │
└────────────────────┘
      │
      ▼
    输出 out

1️⃣ Attention vs FFN 分工

模块 作用
Attention token之间通信
FFN token内部加工

2️⃣ 残差的作用

X + 子层输出

👉 防止:

  • 梯度消失
  • 信息丢失 3️⃣ LayerNorm 的作用

👉 保证:

  • 数值稳定
  • 训练收敛

完整编码器代码


#@save
class TransformerEncoder(d2l.Encoder):
    """Transformer编码器"""
    def __init__(self, vocab_size, key_size, query_size, value_size,
                 num_hiddens, norm_shape, ffn_num_input, ffn_num_hiddens,
                 num_heads, num_layers, dropout, use_bias=False, **kwargs):
        super(TransformerEncoder, self).__init__(**kwargs)
        # 保存隐藏维度,后续在embedding缩放时使用
        self.num_hiddens = num_hiddens
        # 词嵌入:把离散token id映射到连续向量
        self.embedding = nn.Embedding(vocab_size, num_hiddens)
        # 位置编码:为序列注入位置信息
        self.pos_encoding = d2l.PositionalEncoding(num_hiddens, dropout)
        # 编码器块堆叠容器
        self.blks = nn.Sequential()
        for i in range(num_layers):
            # 逐层添加编码器块,形成深层特征提取
            self.blks.add_module("block"+str(i),
                EncoderBlock(key_size, query_size, value_size, num_hiddens,
                             norm_shape, ffn_num_input, ffn_num_hiddens,
                             num_heads, dropout, use_bias))

        print("[LOG] TransformerEncoder 初始化完成")
        print(f"[LOG] vocab_size={vocab_size}, num_hiddens={num_hiddens}, "
              f"num_heads={num_heads}, num_layers={num_layers}, dropout={dropout}")
        print(f"[LOG] norm_shape={norm_shape}, ffn_num_input={ffn_num_input}, "
              f"ffn_num_hiddens={ffn_num_hiddens}")

    def forward(self, X, valid_lens, *args):
        print("\n[LOG] ========= 进入 TransformerEncoder.forward =========")
        log_tensor_info("TransformerEncoder输入token索引X", X)
        print(f"[LOG] valid_lens={valid_lens}")

        # 因为位置编码值在-1和1之间,
        # 因此嵌入值乘以嵌入维度的平方根进行缩放,
        # 然后再与位置编码相加。
        emb = self.embedding(X)
        log_tensor_info("词嵌入输出 embedding(X)", emb)

        scaled_emb = emb * math.sqrt(self.num_hiddens)
        log_tensor_info("缩放后的词嵌入 scaled_emb", scaled_emb)

        X = self.pos_encoding(scaled_emb)
        log_tensor_info("加入位置编码后的输入", X)

        self.attention_weights = [None] * len(self.blks)
        for i, blk in enumerate(self.blks):
            print(f"[LOG] ---- 进入编码器块 block{i} ----")
            X = blk(X, valid_lens)
            log_tensor_info(f"block{i} 输出", X)
            self.attention_weights[
                i] = blk.attention.attention.attention_weights
            log_tensor_info(f"block{i} 注意力权重", self.attention_weights[i])
            print(f"[LOG] ---- 退出编码器块 block{i} ----")

        print("[LOG] ========= 退出 TransformerEncoder.forward =========\n")
        return X
    
print("\n[LOG] ===== 完整TransformerEncoder测试 =====")
encoder = TransformerEncoder(
    200, 24, 24, 24, 24, [100, 24], 24, 48, 8, 2, 0.5)
encoder.eval()
enc_input = torch.ones((2, 100), dtype=torch.long)
log_tensor_info("完整编码器测试输入 enc_input", enc_input)
enc_output = encoder(enc_input, valid_lens)
print(f"[LOG] 完整编码器输出形状: {tuple(enc_output.shape)}")
log_tensor_info("完整编码器输出 enc_output", enc_output)
print("[LOG] ===== 完整TransformerEncoder测试结束 =====\n")

全局视角


token id
   │
   ▼
Embedding(词向量)
   │
   ▼
缩放(√d_model)
   │
   ▼
位置编码(Positional Encoding)
   │
   ▼
EncoderBlock × N(你这里是2层)
   │
   ▼
输出特征

输入数据

维度 说明
2 batch
100 序列长度
值=1 每个 token id 都是 1

enc_input = torch.ones((2, 100), dtype=torch.long)

👉 这点非常关键:

⚠️ 所有 token 是同一个词

[LOG] 完整编码器测试输入 enc_input: shape=(2, 100), dtype=torch.int64
[LOG] 完整编码器测试输入 enc_input 完整内容:
tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1]])

Step 1:Embedding(词嵌入)


emb = self.embedding(X)

📌 本质
token id → 查表 → 向量

📐 形状变化
(2,100) → (2,100,24)
🔥 关键现象

因为:

所有 token id = 1

👉 所以:

embedding 每一行完全一样

🧠 此时状态

序列中所有 token:
内容一样 ❌
位置未知 ❌


[LOG] 词嵌入输出 embedding(X): shape=(2, 100, 24), dtype=torch.float32
[LOG] 词嵌入输出 embedding(X) 完整内容:
tensor([[[ 1.1125, -0.0217,  0.2430,  ...,  0.4983, -0.5469, -0.2665],
         [ 1.1125, -0.0217,  0.2430,  ...,  0.4983, -0.5469, -0.2665],
         [ 1.1125, -0.0217,  0.2430,  ...,  0.4983, -0.5469, -0.2665],
         ...,
         [ 1.1125, -0.0217,  0.2430,  ...,  0.4983, -0.5469, -0.2665],
         [ 1.1125, -0.0217,  0.2430,  ...,  0.4983, -0.5469, -0.2665],
         [ 1.1125, -0.0217,  0.2430,  ...,  0.4983, -0.5469, -0.2665]],

        [[ 1.1125, -0.0217,  0.2430,  ...,  0.4983, -0.5469, -0.2665],
         [ 1.1125, -0.0217,  0.2430,  ...,  0.4983, -0.5469, -0.2665],
         [ 1.1125, -0.0217,  0.2430,  ...,  0.4983, -0.5469, -0.2665],
         ...,
         [ 1.1125, -0.0217,  0.2430,  ...,  0.4983, -0.5469, -0.2665],
         [ 1.1125, -0.0217,  0.2430,  ...,  0.4983, -0.5469, -0.2665],
         [ 1.1125, -0.0217,  0.2430,  ...,  0.4983, -0.5469, -0.2665]]])

👉 本质一句话: 把“离散的 token id”变成“连续的向量表示”


token id:     1      5      10

              │      │      │
              ▼      ▼      ▼

Embedding:
        [v1]   [v5]   [v10]

        │      │      │
        ▼      ▼      ▼

变成连续向量(可以计算相似度)

Step 2:缩放(重要细节)


scaled_emb = emb * sqrt(24)
📌 为什么要乘 √d_model?

👉 防止:

embedding 太小,被位置编码淹没
🔥 本质

让 embedding 和 positional encoding 在同一量级

[LOG] 缩放后的词嵌入 scaled_emb: shape=(2, 100, 24), dtype=torch.float32
[LOG] 缩放后的词嵌入 scaled_emb 完整内容:
tensor([[[ 5.4503, -0.1061,  1.1904,  ...,  2.4411, -2.6792, -1.3057],
         [ 5.4503, -0.1061,  1.1904,  ...,  2.4411, -2.6792, -1.3057],
         [ 5.4503, -0.1061,  1.1904,  ...,  2.4411, -2.6792, -1.3057],
         ...,
         [ 5.4503, -0.1061,  1.1904,  ...,  2.4411, -2.6792, -1.3057],
         [ 5.4503, -0.1061,  1.1904,  ...,  2.4411, -2.6792, -1.3057],
         [ 5.4503, -0.1061,  1.1904,  ...,  2.4411, -2.6792, -1.3057]],

        [[ 5.4503, -0.1061,  1.1904,  ...,  2.4411, -2.6792, -1.3057],
         [ 5.4503, -0.1061,  1.1904,  ...,  2.4411, -2.6792, -1.3057],
         [ 5.4503, -0.1061,  1.1904,  ...,  2.4411, -2.6792, -1.3057],
         ...,
         [ 5.4503, -0.1061,  1.1904,  ...,  2.4411, -2.6792, -1.3057],
         [ 5.4503, -0.1061,  1.1904,  ...,  2.4411, -2.6792, -1.3057],
         [ 5.4503, -0.1061,  1.1904,  ...,  2.4411, -2.6792, -1.3057]]])

Step 3:位置编码(关键突破)


X = self.pos_encoding(scaled_emb)

📌 本质

X = embedding + position_encoding

🔥 作用(非常关键)

👉 给每个 token 加“位置差异”

🧠 举个直观例子
token0 → embedding + pos(0)
token1 → embedding + pos(1)
token2 → embedding + pos(2)

🎯 结果

👉 虽然:

token内容一样

但:

最终向量 ≠ 一样


[LOG] 加入位置编码后的输入: shape=(2, 100, 24), dtype=torch.float32
[LOG] 加入位置编码后的输入 完整内容:
tensor([[[ 5.4503,  0.8939,  1.1904,  ...,  3.4411, -2.6792, -0.3057],
         [ 6.2918,  0.4342,  1.6380,  ...,  3.4411, -2.6789, -0.3057],
         [ 6.3596, -0.5223,  1.9910,  ...,  3.4411, -2.6787, -0.3057],
         ...,
         [ 5.8299, -1.0313,  2.0533,  ...,  3.4401, -2.6583, -0.3059],
         [ 4.8770, -0.9254,  2.1882,  ...,  3.4401, -2.6580, -0.3059],
         [ 4.4511, -0.0663,  2.1119,  ...,  3.4401, -2.6578, -0.3059]],

        [[ 5.4503,  0.8939,  1.1904,  ...,  3.4411, -2.6792, -0.3057],
         [ 6.2918,  0.4342,  1.6380,  ...,  3.4411, -2.6789, -0.3057],
         [ 6.3596, -0.5223,  1.9910,  ...,  3.4411, -2.6787, -0.3057],
         ...,
         [ 5.8299, -1.0313,  2.0533,  ...,  3.4401, -2.6583, -0.3059],
         [ 4.8770, -0.9254,  2.1882,  ...,  3.4401, -2.6580, -0.3059],
         [ 4.4511, -0.0663,  2.1119,  ...,  3.4401, -2.6578, -0.3059]]])

Step 4:进入 EncoderBlock(循环)


for i, blk in enumerate(self.blks):
    X = blk(X, valid_lens)

你这里:

num_layers = 2

👉 会执行:

block0 → block1

block0


[LOG] block0 输出: shape=(2, 100, 24), dtype=torch.float32
[LOG] block0 输出 完整内容:
tensor([[[ 1.3703,  0.3976,  0.2605,  ...,  1.5080, -0.5639, -0.1134],
         [ 1.5673,  0.3092,  0.3691,  ...,  1.5192, -0.5570, -0.1067],
         [ 1.5783,  0.1033,  0.4716,  ...,  1.5373, -0.5553, -0.1022],
         ...,
         [ 1.4608,  0.0285,  0.4712,  ...,  1.5040, -0.6001, -0.1518],
         [ 1.2321,  0.0456,  0.5011,  ...,  1.5086, -0.6077, -0.1597],
         [ 1.1312,  0.2147,  0.4753,  ...,  1.5047, -0.5998, -0.1663]],

        [[ 1.3797,  0.3918,  0.2551,  ...,  1.5219, -0.5632, -0.1195],
         [ 1.5769,  0.3033,  0.3638,  ...,  1.5330, -0.5562, -0.1128],
         [ 1.5878,  0.0973,  0.4663,  ...,  1.5512, -0.5544, -0.1084],
         ...,
         [ 1.4708,  0.0216,  0.4664,  ...,  1.5181, -0.5991, -0.1568],
         [ 1.2421,  0.0383,  0.4963,  ...,  1.5232, -0.6076, -0.1645],
         [ 1.1413,  0.2074,  0.4706,  ...,  1.5192, -0.5997, -0.1710]]])


[LOG] ---- 退出编码器块 block0 ----

block1


[LOG] block1 输出: shape=(2, 100, 24), dtype=torch.float32
[LOG] block1 输出 完整内容:
tensor([[[ 1.8015,  0.0294,  0.2230,  ...,  1.4384, -1.0806, -0.3897],
         [ 1.9988, -0.0460,  0.3466,  ...,  1.4391, -1.0542, -0.3975],
         [ 2.0112, -0.2198,  0.4557,  ...,  1.4599, -1.0405, -0.4251],
         ...,
         [ 1.8576, -0.2355,  0.4971,  ...,  1.4461, -1.1070, -0.4749],
         [ 1.6374, -0.2058,  0.5191,  ...,  1.4544, -1.1267, -0.4890],
         [ 1.5407, -0.0497,  0.4826,  ...,  1.4481, -1.1297, -0.4755]],

        [[ 1.8135,  0.0225,  0.2285,  ...,  1.4585, -1.0721, -0.3998],
         [ 2.0110, -0.0528,  0.3524,  ...,  1.4586, -1.0452, -0.4077],
         [ 2.0234, -0.2268,  0.4614,  ...,  1.4795, -1.0315, -0.4353],
         ...,
         [ 1.8688, -0.2437,  0.5032,  ...,  1.4652, -1.0981, -0.4842],
         [ 1.6486, -0.2145,  0.5251,  ...,  1.4740, -1.1187, -0.4980],
         [ 1.5521, -0.0584,  0.4887,  ...,  1.4675, -1.1217, -0.4844]]])

[LOG] ---- 退出编码器块 block1 ----

每个 block 内部发生什么(复习强化)

每个 block = 一次“全局信息融合 + 局部特征加工 + 稳定更新”


          token 表示
               │
               ▼
      ┌────────────────┐
      │ Self-Attention │  ← 信息交流
      └────────────────┘
               │
               ▼
        Add & Norm
               │
               ▼
      ┌────────────────┐
      │      FFN       │  ← 信息加工
      └────────────────┘
               │
               ▼
        Add & Norm
               │
               ▼
          更强表示

层数 作用
block0 基础语义关系
block1 更高层抽象
blockN 深层语义 / 推理能力

attention_weights 保存


self.attention_weights[i] = blk.attention.attention.attention_weights

📌 作用

👉 保存每一层的注意力矩阵:

shape ≈ (batch, heads, seq, seq)

🔥 可以用来做:

可视化注意力
分析模型关注点

block0


[LOG] block0 注意力权重: shape=(16, 100, 100), dtype=torch.float32
tensor([[[0.3146, 0.3766, 0.3088,  ..., 0.0000, 0.0000, 0.0000],
         [0.3127, 0.3777, 0.3096,  ..., 0.0000, 0.0000, 0.0000],
         [0.3107, 0.3761, 0.3132,  ..., 0.0000, 0.0000, 0.0000],
         ...,
         [0.3032, 0.3818, 0.3151,  ..., 0.0000, 0.0000, 0.0000],
         [0.3031, 0.3805, 0.3164,  ..., 0.0000, 0.0000, 0.0000],
         [0.3035, 0.3818, 0.3148,  ..., 0.0000, 0.0000, 0.0000]],

        [[0.3847, 0.2943, 0.3210,  ..., 0.0000, 0.0000, 0.0000],
         [0.3868, 0.2923, 0.3209,  ..., 0.0000, 0.0000, 0.0000],
         [0.3883, 0.2909, 0.3209,  ..., 0.0000, 0.0000, 0.0000],
         ...,
         [0.3890, 0.2887, 0.3223,  ..., 0.0000, 0.0000, 0.0000],
         [0.3895, 0.2891, 0.3214,  ..., 0.0000, 0.0000, 0.0000],
         [0.3903, 0.2895, 0.3203,  ..., 0.0000, 0.0000, 0.0000]],

        [[0.2770, 0.3312, 0.3917,  ..., 0.0000, 0.0000, 0.0000],
         [0.2777, 0.3305, 0.3918,  ..., 0.0000, 0.0000, 0.0000],
         [0.2801, 0.3319, 0.3880,  ..., 0.0000, 0.0000, 0.0000],
         ...,
         [0.2881, 0.3383, 0.3737,  ..., 0.0000, 0.0000, 0.0000],
         [0.2906, 0.3403, 0.3692,  ..., 0.0000, 0.0000, 0.0000],
         [0.2908, 0.3404, 0.3688,  ..., 0.0000, 0.0000, 0.0000]],

        ...,

        [[0.4926, 0.5074, 0.0000,  ..., 0.0000, 0.0000, 0.0000],
         [0.4922, 0.5078, 0.0000,  ..., 0.0000, 0.0000, 0.0000],
         [0.4919, 0.5081, 0.0000,  ..., 0.0000, 0.0000, 0.0000],
         ...,
         [0.4905, 0.5095, 0.0000,  ..., 0.0000, 0.0000, 0.0000],
         [0.4915, 0.5085, 0.0000,  ..., 0.0000, 0.0000, 0.0000],
         [0.4927, 0.5073, 0.0000,  ..., 0.0000, 0.0000, 0.0000]],

        [[0.4938, 0.5062, 0.0000,  ..., 0.0000, 0.0000, 0.0000],
         [0.4911, 0.5089, 0.0000,  ..., 0.0000, 0.0000, 0.0000],
         [0.4904, 0.5096, 0.0000,  ..., 0.0000, 0.0000, 0.0000],
         ...,
         [0.4954, 0.5046, 0.0000,  ..., 0.0000, 0.0000, 0.0000],
         [0.4975, 0.5025, 0.0000,  ..., 0.0000, 0.0000, 0.0000],
         [0.4977, 0.5023, 0.0000,  ..., 0.0000, 0.0000, 0.0000]],

        [[0.5312, 0.4688, 0.0000,  ..., 0.0000, 0.0000, 0.0000],
         [0.5317, 0.4683, 0.0000,  ..., 0.0000, 0.0000, 0.0000],
         [0.5338, 0.4662, 0.0000,  ..., 0.0000, 0.0000, 0.0000],
         ...,
         [0.5346, 0.4654, 0.0000,  ..., 0.0000, 0.0000, 0.0000],
         [0.5341, 0.4659, 0.0000,  ..., 0.0000, 0.0000, 0.0000],
         [0.5326, 0.4674, 0.0000,  ..., 0.0000, 0.0000, 0.0000]]])

block1


[LOG] block1 注意力权重: shape=(16, 100, 100), dtype=torch.float32
[LOG] block1 注意力权重 完整内容:
tensor([[[0.3344, 0.3326, 0.3330,  ..., 0.0000, 0.0000, 0.0000],
         [0.3344, 0.3326, 0.3330,  ..., 0.0000, 0.0000, 0.0000],
         [0.3343, 0.3326, 0.3330,  ..., 0.0000, 0.0000, 0.0000],
         ...,
         [0.3343, 0.3327, 0.3330,  ..., 0.0000, 0.0000, 0.0000],
         [0.3343, 0.3327, 0.3330,  ..., 0.0000, 0.0000, 0.0000],
         [0.3342, 0.3327, 0.3330,  ..., 0.0000, 0.0000, 0.0000]],

        [[0.3332, 0.3316, 0.3352,  ..., 0.0000, 0.0000, 0.0000],
         [0.3330, 0.3316, 0.3353,  ..., 0.0000, 0.0000, 0.0000],
         [0.3328, 0.3316, 0.3355,  ..., 0.0000, 0.0000, 0.0000],
         ...,
         [0.3321, 0.3312, 0.3368,  ..., 0.0000, 0.0000, 0.0000],
         [0.3322, 0.3311, 0.3367,  ..., 0.0000, 0.0000, 0.0000],
         [0.3323, 0.3311, 0.3366,  ..., 0.0000, 0.0000, 0.0000]],

        [[0.3324, 0.3322, 0.3354,  ..., 0.0000, 0.0000, 0.0000],
         [0.3324, 0.3323, 0.3354,  ..., 0.0000, 0.0000, 0.0000],
         [0.3323, 0.3323, 0.3354,  ..., 0.0000, 0.0000, 0.0000],
         ...,
         [0.3321, 0.3321, 0.3358,  ..., 0.0000, 0.0000, 0.0000],
         [0.3321, 0.3321, 0.3357,  ..., 0.0000, 0.0000, 0.0000],
         [0.3322, 0.3321, 0.3357,  ..., 0.0000, 0.0000, 0.0000]],

        ...,

        [[0.4932, 0.5068, 0.0000,  ..., 0.0000, 0.0000, 0.0000],
         [0.4930, 0.5070, 0.0000,  ..., 0.0000, 0.0000, 0.0000],
         [0.4928, 0.5072, 0.0000,  ..., 0.0000, 0.0000, 0.0000],
         ...,
         [0.4938, 0.5062, 0.0000,  ..., 0.0000, 0.0000, 0.0000],
         [0.4939, 0.5061, 0.0000,  ..., 0.0000, 0.0000, 0.0000],
         [0.4939, 0.5061, 0.0000,  ..., 0.0000, 0.0000, 0.0000]],

        [[0.5011, 0.4989, 0.0000,  ..., 0.0000, 0.0000, 0.0000],
         [0.5012, 0.4988, 0.0000,  ..., 0.0000, 0.0000, 0.0000],
         [0.5011, 0.4989, 0.0000,  ..., 0.0000, 0.0000, 0.0000],
         ...,
         [0.4990, 0.5010, 0.0000,  ..., 0.0000, 0.0000, 0.0000],
         [0.4984, 0.5016, 0.0000,  ..., 0.0000, 0.0000, 0.0000],
         [0.4981, 0.5019, 0.0000,  ..., 0.0000, 0.0000, 0.0000]],

        [[0.5016, 0.4984, 0.0000,  ..., 0.0000, 0.0000, 0.0000],
         [0.5017, 0.4983, 0.0000,  ..., 0.0000, 0.0000, 0.0000],
         [0.5017, 0.4983, 0.0000,  ..., 0.0000, 0.0000, 0.0000],
         ...,
         [0.5020, 0.4980, 0.0000,  ..., 0.0000, 0.0000, 0.0000],
         [0.5019, 0.4981, 0.0000,  ..., 0.0000, 0.0000, 0.0000],
         [0.5018, 0.4982, 0.0000,  ..., 0.0000, 0.0000, 0.0000]]])

TransformerEncoder.forward 做了三件事:

  • 👉 把 token 变成向量(embedding)
  • 👉 注入位置信息(pos encoding)
  • 👉 用多层 EncoderBlock 提取上下文特征

编码器的权重数量

模型配置


vocab_size = 200
num_hiddens = 24
num_heads = 8
num_layers = 2
ffn_num_hiddens = 48

1️⃣ Embedding 层


nn.Embedding(200, 24)

👉 参数量:

200 × 24 = 4800

2️⃣ Positional Encoding


d2l.PositionalEncoding

👉 是:

sin/cos 固定函数 ❗

👉 参数量:

0

3️⃣ 一个 EncoderBlock 的参数


3.1 多头注意力(MultiHeadAttention)

在 d2l 实现中,本质是:

Wq, Wk, Wv, Wo

每个线性层:
(24 × 24) + 24 = 576 + 24 = 600

一共 4 个:600 × 4 = 2400

✅ Attention 总参数:2400

3.2 AddNorm(第一个)
LayerNorm([100,24])

👉 参数:

gamma:100×24
beta:100×24
= 2 × (100×24)
= 4800


3.3 FFN(前馈网络)
24 → 48 → 24
第一层:
24×48 + 48 = 1152 + 48 = 1200
第二层:
48×24 + 24 = 1152 + 24 = 1176
总计:
1200 + 1176 = 2376
3.4 AddNorm(第二个)

同上:

4800

✅ 一个 EncoderBlock 总参数
Attention     = 2400
AddNorm1      = 4800
FFN           = 2376
AddNorm2      = 4800
--------------------------------
合计          = 14376

2层 EncoderBlock

14376 × 2 = 28752

最终总参数量


Embedding        = 4800
EncoderBlocks    = 28752
--------------------------------
Total            = 33552

解码器

解码器代码一


class DecoderBlock(nn.Module):
    """解码器中第i个块"""
    def __init__(self, key_size, query_size, value_size, num_hiddens,
                 norm_shape, ffn_num_input, ffn_num_hiddens, num_heads,
                 dropout, i, **kwargs):
        super(DecoderBlock, self).__init__(**kwargs)
        # 记录当前块编号,便于缓存state[2]中不同层的历史解码表示
        self.i = i
        # 子层1:Masked Self-Attention(解码器自注意力)
        # 新版 d2l.MultiHeadAttention 签名: (num_hiddens, num_heads, dropout, bias=False)
        self.attention1 = d2l.MultiHeadAttention(
            num_hiddens, num_heads, dropout)
        self.addnorm1 = AddNorm(norm_shape, dropout)
        # 子层2:Encoder-Decoder Attention(跨注意力)
        self.attention2 = d2l.MultiHeadAttention(
            num_hiddens, num_heads, dropout)
        self.addnorm2 = AddNorm(norm_shape, dropout)
        # 子层3:逐位置前馈网络
        self.ffn = PositionWiseFFN(ffn_num_input, ffn_num_hiddens,
                                   num_hiddens)
        self.addnorm3 = AddNorm(norm_shape, dropout)

        print(f"[LOG] DecoderBlock 初始化完成: block_index={i}")
        print(f"[LOG] num_hiddens={num_hiddens}, num_heads={num_heads}, "
              f"ffn_num_hiddens={ffn_num_hiddens}, dropout={dropout}")

    def forward(self, X, state):
        print(f"\n[LOG] ===== 进入 DecoderBlock.forward (block {self.i}) =====")
        log_tensor_info("DecoderBlock输入X", X)

        enc_outputs, enc_valid_lens = state[0], state[1]
        log_tensor_info("编码器输出 enc_outputs", enc_outputs)
        print(f"[LOG] 编码器有效长度 enc_valid_lens={enc_valid_lens}")

        # 训练阶段,输出序列的所有词元都在同一时间处理,
        # 因此state[2][self.i]初始化为None。
        # 预测阶段,输出序列是通过词元一个接着一个解码的,
        # 因此state[2][self.i]包含着直到当前时间步第i个块解码的输出表示
        if state[2][self.i] is None:
            key_values = X
            print(f"[LOG] block {self.i} 缓存为空,key_values 直接使用当前X")
        else:
            key_values = torch.cat((state[2][self.i], X), axis=1)
            print(f"[LOG] block {self.i} 使用历史缓存拼接当前X")
        state[2][self.i] = key_values
        log_tensor_info("当前块缓存 key_values", key_values)

        if self.training:
            batch_size, num_steps, _ = X.shape
            # dec_valid_lens的开头:(batch_size,num_steps),
            # 其中每一行是[1,2,...,num_steps]
            dec_valid_lens = torch.arange(
                1, num_steps + 1, device=X.device).repeat(batch_size, 1)
            log_tensor_info("训练阶段 dec_valid_lens", dec_valid_lens)
        else:
            dec_valid_lens = None
            print("[LOG] 推理阶段 dec_valid_lens=None")

        # 子层1:Masked Self-Attention + AddNorm
        X2 = self.attention1(X, key_values, key_values, dec_valid_lens)
        log_tensor_info("自注意力输出 X2", X2)
        Y = self.addnorm1(X, X2)
        log_tensor_info("AddNorm1输出 Y", Y)

        # 编码器-解码器注意力。
        # enc_outputs的开头:(batch_size,num_steps,num_hiddens)
        # 子层2:Cross-Attention + AddNorm
        Y2 = self.attention2(Y, enc_outputs, enc_outputs, enc_valid_lens)
        log_tensor_info("跨注意力输出 Y2", Y2)
        Z = self.addnorm2(Y, Y2)
        log_tensor_info("AddNorm2输出 Z", Z)

        # 子层3:FFN + AddNorm
        ffn_out = self.ffn(Z)
        log_tensor_info("FFN输出 ffn_out", ffn_out)
        out = self.addnorm3(Z, ffn_out)
        log_tensor_info("DecoderBlock最终输出 out", out)
        print(f"[LOG] ===== 退出 DecoderBlock.forward (block {self.i}) =====\n")
        return out, state
    
decoder_blk = DecoderBlock(24, 24, 24, 24, [100, 24], 24, 48, 8, 0.5, 0)
decoder_blk.eval()
X = torch.ones((2, 100, 24))
state = [encoder_blk(X, valid_lens), valid_lens, [None]]
print("\n[LOG] ===== 解码器块测试 =====")
log_tensor_info("解码器测试输入 X", X)
decoder_out, new_state = decoder_blk(X, state)
print(f"[LOG] DecoderBlock输出形状: {tuple(decoder_out.shape)}")
log_tensor_info("DecoderBlock输出 decoder_out", decoder_out)
print(f"[LOG] state缓存层数: {len(new_state[2])}")
print("[LOG] ===== 解码器块测试结束 =====")

整体结构


输入 X(当前已生成序列)
        │
        ▼
1️⃣ Masked Self-Attention   ← 只能看“过去”
        │
        ▼
2️⃣ Cross Attention         ← 看 Encoder(源句)
        │
        ▼
3️⃣ FFN                     ← 非线性加工

创建 DecoderBlock


DecoderBlock(24, 24, 24, 24, [100, 24], 24, 48, 8, 0.5, 0)

参数 含义
24 hidden size
48 FFN中间层
8 8头注意力
[100,24] LayerNorm维度
i=0 第0层decoder

输入数据


[LOG] 解码器测试输入 X: shape=(2, 100, 24), dtype=torch.float32
[LOG] 解码器测试输入 X 完整内容:
tensor([[[1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.],
         ...,
         [1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.]],

        [[1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.],
         ...,
         [1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.]]])

state 构造


state = [encoder_blk(X, valid_lens), valid_lens, [None]]
结构
state = [
    enc_outputs,        # 编码器输出
    valid_lens,         # 编码mask
    cache               # decoder缓存
]
cache
[None]

👉 表示:

第0层 decoder 还没有历史缓存

# 编码器输出

tensor([[[-0.6796,  1.1780, -1.1474,  ...,  2.4620,  0.2389,  0.1131],
         [-0.6796,  1.1780, -1.1474,  ...,  2.4620,  0.2389,  0.1131],
         [-0.6796,  1.1780, -1.1474,  ...,  2.4620,  0.2389,  0.1131],
         ...,
         [-0.6796,  1.1780, -1.1474,  ...,  2.4620,  0.2389,  0.1131],
         [-0.6796,  1.1780, -1.1474,  ...,  2.4620,  0.2389,  0.1131],
         [-0.6796,  1.1780, -1.1474,  ...,  2.4620,  0.2389,  0.1131]],

        [[-0.6796,  1.1780, -1.1474,  ...,  2.4620,  0.2389,  0.1131],
         [-0.6796,  1.1780, -1.1474,  ...,  2.4620,  0.2389,  0.1131],
         [-0.6796,  1.1780, -1.1474,  ...,  2.4620,  0.2389,  0.1131],
         ...,
         [-0.6796,  1.1780, -1.1474,  ...,  2.4620,  0.2389,  0.1131],
         [-0.6796,  1.1780, -1.1474,  ...,  2.4620,  0.2389,  0.1131],
         [-0.6796,  1.1780, -1.1474,  ...,  2.4620,  0.2389,  0.1131]]])

下面进入DecoderBlock.forward

Step 1:取出 encoder 信息


enc_outputs, enc_valid_lens = state[0], state[1]
结果
enc_outputs.shape = (2, 100, 24)
enc_valid_lens = [3, 2]

👉 含义:

每个样本只看前几个 token

[LOG] 编码器输出 enc_outputs: shape=(2, 100, 24), dtype=torch.float32
[LOG] 编码器输出 enc_outputs 完整内容:
tensor([[[-0.6796,  1.1780, -1.1474,  ...,  2.4620,  0.2389,  0.1131],
         [-0.6796,  1.1780, -1.1474,  ...,  2.4620,  0.2389,  0.1131],
         [-0.6796,  1.1780, -1.1474,  ...,  2.4620,  0.2389,  0.1131],
         ...,
         [-0.6796,  1.1780, -1.1474,  ...,  2.4620,  0.2389,  0.1131],
         [-0.6796,  1.1780, -1.1474,  ...,  2.4620,  0.2389,  0.1131],
         [-0.6796,  1.1780, -1.1474,  ...,  2.4620,  0.2389,  0.1131]],

        [[-0.6796,  1.1780, -1.1474,  ...,  2.4620,  0.2389,  0.1131],
         [-0.6796,  1.1780, -1.1474,  ...,  2.4620,  0.2389,  0.1131],
         [-0.6796,  1.1780, -1.1474,  ...,  2.4620,  0.2389,  0.1131],
         ...,
         [-0.6796,  1.1780, -1.1474,  ...,  2.4620,  0.2389,  0.1131],
         [-0.6796,  1.1780, -1.1474,  ...,  2.4620,  0.2389,  0.1131],
         [-0.6796,  1.1780, -1.1474,  ...,  2.4620,  0.2389,  0.1131]]])
[LOG] 编码器有效长度 enc_valid_lens=tensor([3, 2])
[LOG] block 0 缓存为空,key_values 直接使用当前X

Step 2:构造 key_values(缓存机制)


if state[2][self.i] is None:
    key_values = X
当前情况
state[2][0] = None

👉 所以:

key_values = X
更新缓存
state[2][self.i] = key_values

👉 cache 变成:

state[2] = [X]

🧠 解释

👉 这是“推理优化机制”的一部分:

但你现在:

一次性输入100个token(训练风格)

👉 所以缓存没有体现优势

Step 3:dec_valid_lens(mask)


if self.training:
    ...
else:
    dec_valid_lens = None
当前是 eval 模式
dec_valid_lens = None

👉 意味着:

没有mask ❗
⚠️ 重要

这意味着:

token可以看到未来 ❗

👉 这其实不符合“真正Decoder推理”

Step 4:Masked Self-Attention


X2 = self.attention1(X, key_values, key_values, dec_valid_lens)
实际变成
Q = X
K = X
V = X

👉 因为:

key_values = X
⚠️ 且没有mask

👉 实际效果:

= 普通 Self-Attention(和 Encoder 一样)
⚠️ 再叠加
X 全是1

👉 所以:

所有位置 attention 结果几乎一样

[LOG] 自注意力输出 X2: shape=(2, 100, 24), dtype=torch.float32
[LOG] 自注意力输出 X2 完整内容:
tensor([[[-0.1445,  0.3495,  0.0163,  ..., -0.2074, -0.1041, -0.3527],
         [-0.1445,  0.3495,  0.0163,  ..., -0.2074, -0.1041, -0.3527],
         [-0.1445,  0.3495,  0.0163,  ..., -0.2074, -0.1041, -0.3527],
         ...,
         [-0.1445,  0.3495,  0.0163,  ..., -0.2074, -0.1041, -0.3527],
         [-0.1445,  0.3495,  0.0163,  ..., -0.2074, -0.1041, -0.3527],
         [-0.1445,  0.3495,  0.0163,  ..., -0.2074, -0.1041, -0.3527]],

        [[-0.1445,  0.3495,  0.0163,  ..., -0.2074, -0.1041, -0.3527],
         [-0.1445,  0.3495,  0.0163,  ..., -0.2074, -0.1041, -0.3527],
         [-0.1445,  0.3495,  0.0163,  ..., -0.2074, -0.1041, -0.3527],
         ...,
         [-0.1445,  0.3495,  0.0163,  ..., -0.2074, -0.1041, -0.3527],
         [-0.1445,  0.3495,  0.0163,  ..., -0.2074, -0.1041, -0.3527],
         [-0.1445,  0.3495,  0.0163,  ..., -0.2074, -0.1041, -0.3527]]])

Step 5:AddNorm1


Y = self.addnorm1(X, X2)
计算
Y = LayerNorm(X + X2)

👉 输出:

shape = (2, 100, 24)

[LOG] AddNorm1输出 Y: shape=(2, 100, 24), dtype=torch.float32
[LOG] AddNorm1输出 Y 完整内容:
tensor([[[-0.5387,  0.8173, -0.0974,  ..., -0.7115, -0.4280, -1.1104],
         [-0.5387,  0.8173, -0.0974,  ..., -0.7115, -0.4280, -1.1104],
         [-0.5387,  0.8173, -0.0974,  ..., -0.7115, -0.4280, -1.1104],
         ...,
         [-0.5387,  0.8173, -0.0974,  ..., -0.7115, -0.4280, -1.1104],
         [-0.5387,  0.8173, -0.0974,  ..., -0.7115, -0.4280, -1.1104],
         [-0.5387,  0.8173, -0.0974,  ..., -0.7115, -0.4280, -1.1104]],

        [[-0.5387,  0.8173, -0.0974,  ..., -0.7115, -0.4280, -1.1104],
         [-0.5387,  0.8173, -0.0974,  ..., -0.7115, -0.4280, -1.1104],
         [-0.5387,  0.8173, -0.0974,  ..., -0.7115, -0.4280, -1.1104],
         ...,
         [-0.5387,  0.8173, -0.0974,  ..., -0.7115, -0.4280, -1.1104],
         [-0.5387,  0.8173, -0.0974,  ..., -0.7115, -0.4280, -1.1104],
         [-0.5387,  0.8173, -0.0974,  ..., -0.7115, -0.4280, -1.1104]]])

AddNorm1 = 把“原始信息 + 上下文信息”融合,并保证数值稳定

Step 6:Cross-Attention


2 = self.attention2(Y, enc_outputs, enc_outputs, enc_valid_lens)

展开就是:

Q = Y
K = enc_outputs
V = enc_outputs

[LOG] 跨注意力输出 Y2: shape=(2, 100, 24), dtype=torch.float32
[LOG] 跨注意力输出 Y2 完整内容:
tensor([[[ 0.5185,  0.3513, -0.2478,  ...,  0.0662, -0.2986, -0.0262],
         [ 0.5185,  0.3513, -0.2478,  ...,  0.0662, -0.2986, -0.0262],
         [ 0.5185,  0.3513, -0.2478,  ...,  0.0662, -0.2986, -0.0262],
         ...,
         [ 0.5185,  0.3513, -0.2478,  ...,  0.0662, -0.2986, -0.0262],
         [ 0.5185,  0.3513, -0.2478,  ...,  0.0662, -0.2986, -0.0262],
         [ 0.5185,  0.3513, -0.2478,  ...,  0.0662, -0.2986, -0.0262]],

        [[ 0.5185,  0.3513, -0.2478,  ...,  0.0662, -0.2986, -0.0262],
         [ 0.5185,  0.3513, -0.2478,  ...,  0.0662, -0.2986, -0.0262],
         [ 0.5185,  0.3513, -0.2478,  ...,  0.0662, -0.2986, -0.0262],
         ...,
         [ 0.5185,  0.3513, -0.2478,  ...,  0.0662, -0.2986, -0.0262],
         [ 0.5185,  0.3513, -0.2478,  ...,  0.0662, -0.2986, -0.0262],
         [ 0.5185,  0.3513, -0.2478,  ...,  0.0662, -0.2986, -0.0262]]])

Cross-Attention = 用“当前生成状态”去“查询输入句子的信息”

类型 Q K V 作用
Self-Attention 当前序列 当前序列 当前序列 内部信息融合
Cross-Attention Decoder Encoder Encoder 读取输入信息

Step 7:AddNorm2


[LOG] AddNorm2输出 Z: shape=(2, 100, 24), dtype=torch.float32
[LOG] AddNorm2输出 Z 完整内容:
tensor([[[-0.0642,  1.0724, -0.3748,  ..., -0.6618, -0.7396, -1.1315],
         [-0.0642,  1.0724, -0.3748,  ..., -0.6618, -0.7396, -1.1315],
         [-0.0642,  1.0724, -0.3748,  ..., -0.6618, -0.7396, -1.1315],
         ...,
         [-0.0642,  1.0724, -0.3748,  ..., -0.6618, -0.7396, -1.1315],
         [-0.0642,  1.0724, -0.3748,  ..., -0.6618, -0.7396, -1.1315],
         [-0.0642,  1.0724, -0.3748,  ..., -0.6618, -0.7396, -1.1315]],

        [[-0.0642,  1.0724, -0.3748,  ..., -0.6618, -0.7396, -1.1315],
         [-0.0642,  1.0724, -0.3748,  ..., -0.6618, -0.7396, -1.1315],
         [-0.0642,  1.0724, -0.3748,  ..., -0.6618, -0.7396, -1.1315],
         ...,
         [-0.0642,  1.0724, -0.3748,  ..., -0.6618, -0.7396, -1.1315],
         [-0.0642,  1.0724, -0.3748,  ..., -0.6618, -0.7396, -1.1315],
         [-0.0642,  1.0724, -0.3748,  ..., -0.6618, -0.7396, -1.1315]]])

Step 8:FFN


[LOG] FFN输出 ffn_out: shape=(2, 100, 24), dtype=torch.float32
[LOG] FFN输出 ffn_out 完整内容:
tensor([[[-0.3130, -0.1238, -0.2373,  ..., -0.1869,  0.0442,  0.1742],
         [-0.3130, -0.1238, -0.2373,  ..., -0.1869,  0.0442,  0.1742],
         [-0.3130, -0.1238, -0.2373,  ..., -0.1869,  0.0442,  0.1742],
         ...,
         [-0.3130, -0.1238, -0.2373,  ..., -0.1869,  0.0442,  0.1742],
         [-0.3130, -0.1238, -0.2373,  ..., -0.1869,  0.0442,  0.1742],
         [-0.3130, -0.1238, -0.2373,  ..., -0.1869,  0.0442,  0.1742]],

        [[-0.3130, -0.1238, -0.2373,  ..., -0.1869,  0.0442,  0.1742],
         [-0.3130, -0.1238, -0.2373,  ..., -0.1869,  0.0442,  0.1742],
         [-0.3130, -0.1238, -0.2373,  ..., -0.1869,  0.0442,  0.1742],
         ...,
         [-0.3130, -0.1238, -0.2373,  ..., -0.1869,  0.0442,  0.1742],
         [-0.3130, -0.1238, -0.2373,  ..., -0.1869,  0.0442,  0.1742],
         [-0.3130, -0.1238, -0.2373,  ..., -0.1869,  0.0442,  0.1742]]])

AddNorm3 (最终输出)


[LOG] DecoderBlock最终输出 out: shape=(2, 100, 24), dtype=torch.float32
[LOG] DecoderBlock最终输出 out 完整内容:
tensor([[[-0.3319,  0.9702, -0.5628,  ..., -0.7952, -0.6445, -0.9018],
         [-0.3319,  0.9702, -0.5628,  ..., -0.7952, -0.6445, -0.9018],
         [-0.3319,  0.9702, -0.5628,  ..., -0.7952, -0.6445, -0.9018],
         ...,
         [-0.3319,  0.9702, -0.5628,  ..., -0.7952, -0.6445, -0.9018],
         [-0.3319,  0.9702, -0.5628,  ..., -0.7952, -0.6445, -0.9018],
         [-0.3319,  0.9702, -0.5628,  ..., -0.7952, -0.6445, -0.9018]],

        [[-0.3319,  0.9702, -0.5628,  ..., -0.7952, -0.6445, -0.9018],
         [-0.3319,  0.9702, -0.5628,  ..., -0.7952, -0.6445, -0.9018],
         [-0.3319,  0.9702, -0.5628,  ..., -0.7952, -0.6445, -0.9018],
         ...,
         [-0.3319,  0.9702, -0.5628,  ..., -0.7952, -0.6445, -0.9018],
         [-0.3319,  0.9702, -0.5628,  ..., -0.7952, -0.6445, -0.9018],
         [-0.3319,  0.9702, -0.5628,  ..., -0.7952, -0.6445, -0.9018]]])
[LOG] ===== 退出 DecoderBlock.forward (block 0) =====

这次运行“本质发生了什么”


👉 因为你的设置:

❗ 特殊点1
X 全是1

👉 → 所有token一样

❗ 特殊点2
eval模式 → 没有mask

👉 → 可以看未来

❗ 特殊点3
一次性输入100个token

👉 → KV cache没发挥作用

🔥 最终效果

👉 你的 DecoderBlock 实际变成:

≈ EncoderBlock + CrossAttention

TransformerDecoder 解码器


class TransformerDecoder(d2l.AttentionDecoder):
    """Transformer解码器
    
    结构:
      输入token_ids (batch_size, target_seq_len)
        │
        ├──► 词嵌入 ──► 缩放 ──► 位置编码 ──► X0
        │
        ├──► [循环通过num_layers个解码器块]
        │   ├── block 0: Masked Self-Attn + Cross-Attn + FFN ──► X1
        │   ├── block 1: Masked Self-Attn + Cross-Attn + FFN ──► X2
        │   └── ...
        │   └── block (L-1): ... ──► XL
        │
        └──► 线性投影(num_hiddens -> vocab_size) ──► 输出logits
    """
    def __init__(self, vocab_size, key_size, query_size, value_size,
                 num_hiddens, norm_shape, ffn_num_input, ffn_num_hiddens,
                 num_heads, num_layers, dropout, **kwargs):
        super(TransformerDecoder, self).__init__(**kwargs)
        # 保存关键超参
        self.num_hiddens = num_hiddens
        self.num_layers = num_layers
        # 目标端词嵌入:把目标语言token id映射到连续向量
        self.embedding = nn.Embedding(vocab_size, num_hiddens)
        # 位置编码:为目标序列注入位置信息
        self.pos_encoding = d2l.PositionalEncoding(num_hiddens, dropout)
        # 解码器块堆叠容器
        self.blks = nn.Sequential()
        for i in range(num_layers):
            # 逐层添加解码器块,每块包含3个子层:
            # 1. Masked Self-Attention(仅能看当前及之前的位置)
            # 2. Cross-Attention(注意编码器输出)
            # 3. Position-wise FFN
            self.blks.add_module("block"+str(i),
                DecoderBlock(key_size, query_size, value_size, num_hiddens,
                             norm_shape, ffn_num_input, ffn_num_hiddens,
                             num_heads, dropout, i))
        # 最终投影层:把隐藏维度映射回词表大小,用于生成输出token概率
        self.dense = nn.Linear(num_hiddens, vocab_size)

        print("[LOG] TransformerDecoder 初始化完成")
        print(f"[LOG] vocab_size={vocab_size}, num_hiddens={num_hiddens}, "
              f"num_heads={num_heads}, num_layers={num_layers}, dropout={dropout}")
        print(f"[LOG] norm_shape={norm_shape}, ffn_num_input={ffn_num_input}, "
              f"ffn_num_hiddens={ffn_num_hiddens}")

    def init_state(self, enc_outputs, enc_valid_lens, *args):
        """初始化解码器状态
        
        状态结构 state:
          [0]: enc_outputs - 编码器的输出表示,用于Cross-Attention
          [1]: enc_valid_lens - 编码端有效长度掩码
          [2]: [None] * num_layers - 各解码器块的缓存(推理时用于增量解码)
        """
        print("\n[LOG] ===== TransformerDecoder.init_state 初始化解码器状态 =====")
        log_tensor_info("编码器输出 enc_outputs", enc_outputs)
        print(f"[LOG] 编码器有效长度 enc_valid_lens={enc_valid_lens}")
        state = [enc_outputs, enc_valid_lens, [None] * self.num_layers]
        print(f"[LOG] 初始化 {self.num_layers} 层解码器块缓存")
        print("[LOG] ===== 状态初始化完成 =====\n")
        return state

    def forward(self, X, state):
        """前向传播:解码生成目标序列表示
        
        参数:
            X: (batch_size, target_seq_len) 目标序列token索引
            state: [enc_outputs, enc_valid_lens, decoder_cache]
        返回:
            logits: (batch_size, target_seq_len, vocab_size)
            state: 更新后的解码器状态
        """
        print("\n[LOG] ========= 进入 TransformerDecoder.forward ==========")
        log_tensor_info("目标序列输入 X", X)
        
        # 词嵌入
        emb = self.embedding(X)
        log_tensor_info("目标序列词嵌入 embedding(X)", emb)
        
        # 缩放
        scaled_emb = emb * math.sqrt(self.num_hiddens)
        log_tensor_info("缩放后的词嵌入 scaled_emb", scaled_emb)
        
        # 加入位置编码
        X = self.pos_encoding(scaled_emb)
        log_tensor_info("加入位置编码后的输入", X)
        
        # 初始化注意力权重存储(用于后续可视化)
        # [0]: 解码器自注意力权重,[1]: 跨注意力权重
        self._attention_weights = [[None] * len(self.blks) for _ in range(2)]
        
        # 通过num_layers个解码器块
        for i, blk in enumerate(self.blks):
            print(f"[LOG] ---- 进入解码器块 block{i} ----")
            X, state = blk(X, state)
            log_tensor_info(f"block{i} 输出", X)
            # 保存该层的自注意力权重
            self._attention_weights[0][i] = blk.attention1.attention.attention_weights
            log_tensor_info(f"block{i} 自注意力权重", self._attention_weights[0][i])
            # 保存该层的跨注意力权重(编码器-解码器注意力)
            self._attention_weights[1][i] = blk.attention2.attention.attention_weights
            log_tensor_info(f"block{i} 跨注意力权重", self._attention_weights[1][i])
            print(f"[LOG] ---- 退出解码器块 block{i} ----")
        
        # 最终线性投影:映射到词表大小,得到logits
        logits = self.dense(X)
        log_tensor_info("最终输出logits (dense层后)", logits)
        print(f"[LOG] 输出logits形状: {tuple(logits.shape)}")
        print("[LOG] ========= 退出 TransformerDecoder.forward ==========\n")
        return logits, state

    @property
    def attention_weights(self):
        """返回所有层的注意力权重
        
        返回:
            [[self_attn_weights_layer0, ..., self_attn_weights_layer_L],
             [cross_attn_weights_layer0, ..., cross_attn_weights_layer_L]]
        """
        return self._attention_weights

整个 Decoder 的信息流


            输入token
                ↓
          Embedding + PE
                ↓
        ┌──────────────────┐
        │ Masked Self-Attn │ ← 看历史
        └──────────────────┘
                ↓
        ┌──────────────────┐
        │ Cross Attention  │ ← 看输入
        └──────────────────┘
                ↓
        ┌──────────────────┐
        │       FFN        │ ← 加工特征
        └──────────────────┘
                ↓
           多层重复
                ↓
            Linear
                ↓
            logits

编码器-解码器测试代码


# valid_lens: 第0个样本只看前3个位置,第1个样本只看前2个位置
valid_lens = torch.tensor([3, 2])
print(f"[LOG] valid_lens={valid_lens}")

print("\n[LOG] ============= 完整TransformerDecoder测试 =============")
# 初始化完整解码器
# vocab_size=200, num_hiddens=24, num_heads=8, num_layers=2
decoder = TransformerDecoder(
    200, 24, 24, 24, 24, [100, 24], 24, 48, 8, 2, 0.5)
decoder.eval()  # 推理模式

print("[LOG] ---- 第1步:使用编码器处理源语言 ----")
# 源语言序列
src_X = torch.ones((2, 100), dtype=torch.long)
log_tensor_info("源序列输入 src_X", src_X)

# 编码
encoder = TransformerEncoder(200, 24, 24, 24, 24, [100, 24], 24, 48, 8, 2, 0.5)
encoder.eval()
enc_outputs = encoder(src_X, valid_lens)
log_tensor_info("编码器输出 enc_outputs", enc_outputs)
print(f"[LOG] 编码器输出形状: {tuple(enc_outputs.shape)}")

print("\n[LOG] ---- 第2步:初始化解码器状态 ----")
# 初始化解码器状态
state = decoder.init_state(enc_outputs, valid_lens)
print(f"[LOG] state[0] (enc_outputs) 形状: {tuple(state[0].shape)}")
print(f"[LOG] state[1] (enc_valid_lens): {state[1]}")
print(f"[LOG] state[2] (decoder_cache) 层数: {len(state[2])}")

print("\n[LOG] ---- 第3步:目标端解码 ----")
# 目标语言序列(通常在预测时逐步生成)
# 注意:目标序列长度需与 norm_shape 的第一个维度匹配(这里为100)
tgt_X = torch.ones((2, 100), dtype=torch.long)  # 目标序列长度100
log_tensor_info("目标序列输入 tgt_X", tgt_X)

# 解码
logits, state_updated = decoder(tgt_X, state)
log_tensor_info("解码器输出logits", logits)
print(f"[LOG] 解码器输出logits形状: {tuple(logits.shape)}")
print(f"[LOG] 预期形状: (batch_size=2, tgt_seq_len=100, vocab_size=200)")

print("\n[LOG] ---- 第4步:查看注意力权重 ----")
attn_weights = decoder.attention_weights
print(f"[LOG] 解码器自注意力权重 (Masked Self-Attention) 层数: {len(attn_weights[0])}")
print(f"[LOG] 跨注意力权重 (Cross-Attention) 层数: {len(attn_weights[1])}")
for i in range(len(attn_weights[0])):
    if attn_weights[0][i] is not None:
        print(f"[LOG] 第{i}层 自注意力权重形状: {tuple(attn_weights[0][i].shape)}")
    if attn_weights[1][i] is not None:
        print(f"[LOG] 第{i}层 跨注意力权重形状: {tuple(attn_weights[1][i].shape)}")

print("\n[LOG] ---- 第5步:解码器cache状态 ----")
print(f"[LOG] 更新后state[2] (decoder_cache) 缓存块数: {len(state_updated[2])}")
for i in range(len(state_updated[2])):
    if state_updated[2][i] is not None:
        print(f"[LOG] block{i} 缓存形状: {tuple(state_updated[2][i].shape)}")
    else:
        print(f"[LOG] block{i} 缓存: None")

print("\n[LOG] ============= 完整TransformerDecoder测试结束 =============\n")

架构


源序列 src_X (B,S)
        │
        ▼
┌─────────────── Encoder ───────────────┐
│ Embedding → Scale → PosEncoding       │
│      │                                 │
│      ▼                                 │
│  [EncoderBlock × L]                    │
│   Self-Attn → AddNorm → FFN → AddNorm  │
│      │                                 │
└──────▼─────────────────────────────────┘
   enc_outputs (B,S,d)
        │
        ▼
state = [enc_outputs, valid_lens, cache]

        │
        ▼
目标序列 tgt_X (B,T)
        │
        ▼
┌─────────────── Decoder ───────────────┐
│ Embedding → Scale → PosEncoding       │
│      │                                 │
│      ▼                                 │
│  [DecoderBlock × L]                   │
│   ① Masked Self-Attn                  │
│   ② Cross-Attn                        │
│   ③ FFN                               │
│   + 每步 AddNorm                       │
│      │                                 │
└──────▼─────────────────────────────────┘
      X_L (B,T,d)
        │
        ▼
Linear → logits (B,T,V)

Step 1:Encoder 详细流程


src_X (B,S)
  │  token_id(离散)
  ▼
Embedding
  │  (B,S) → (B,S,d)
  │  每个token变向量
  ▼
Scale (×√d)
  │  放大embedding
  ▼
PosEncoding
  │  注入位置信息
  ▼
X0 (B,S,d)

EncoderBlock(重复 L 次)


输入:X (B,S,d)
  │
  ├─► Self-Attention
  │      Q=K=V=X
  │      ↓
  │      融合“全局上下文”
  │
  ├─► AddNorm
  │      X + Attn(X)
  │      ↓
  │      稳定训练 + 保留原信息
  │
  ├─► FFN
  │      每个token独立:
  │      d → 4d → d
  │
  └─► AddNorm
         再次稳定

Encoder 输出


enc_outputs (B,S,d)

含义:
每个 token → “带上下文的语义表示”

Step 2:初始化 Decoder 状态


state = [
  enc_outputs,        # 输入语义
  valid_lens,         # mask
  [None, None, ...]   # cache(每层一个)
]
🧠 cache 的意义
cache[i] = 第 i 层已经算过的 K/V

推理时:
避免重复计算历史token

Step 3:Decoder 输入处理


tgt_X (B,T)
  │
  ▼
Embedding
  │ (B,T) → (B,T,d)
  ▼
Scale
  ▼
PosEncoding
  ▼
X0 (B,T,d)

Step 4:DecoderBlock(核心)


输入:X (B,T,d)

        ┌──────────────────────────────┐
        │ ① Masked Self-Attention      │
        │   Q = 当前token              │
        │   K,V = 历史+当前            │
        │                              │
        │   作用:                     │
        │   只能看“过去”               │
        └───────────────┬──────────────┘
                        ▼
                    AddNorm1
                        │
                        ▼
        ┌──────────────────────────────┐
        │ ② Cross-Attention            │
        │   Q = decoder状态            │
        │   K,V = enc_outputs          │
        │                              │
        │   作用:                     │
        │   从输入句子“取信息”          │
        └───────────────┬──────────────┘
                        ▼
                    AddNorm2
                        │
                        ▼
        ┌──────────────────────────────┐
        │ ③ FFN                        │
        │   每个token独立              │
        │   非线性变换                 │
        └───────────────┬──────────────┘
                        ▼
                    AddNorm3
                        │
                        ▼
                  输出 X (B,T,d)

Step 6:Linear 输出层


X (B,T,d)
  │
  ▼
Linear(d → vocab_size)
  │
  ▼
logits (B,T,V)

🎯 含义
每个位置:一个“词表概率分布”

transformer_encode_decode

训练和推理的区别

特性 训练(Training) 推理(Inference)
模式 model.train() model.eval()
Dropout ✅ 开启(防过拟合) ❌ 关闭(稳定输出)
梯度计算 ✅ 计算 + 反向传播 torch.no_grad()
参数更新 ✅ 更新权重 ❌ 不更新
输入方式 并行(整句输入) 自回归(逐token)
Mask机制 causal mask + padding mask 主要是 causal mask
Self-Attention 全序列并行计算 只算当前token + cache
KV Cache ❌ 不使用 ✅ 必须使用
计算复杂度 O(n²)(每层) O(n)(有cache)
显存占用 高(存梯度) 低(仅前向)
输出 logits(用于loss) token(用于生成)
目标 学习参数 生成结果
是否知道未来 ✅(teacher forcing) ❌(只能看历史)
速度瓶颈 反向传播 自回归串行(latency)
并行性 高(GPU友好) 低(逐步生成)