Karpathy GPT 教程笔记（五）

布客飞龙

8人浏览 · 2026-06-22 03:01:23

布客飞龙 · 2026-06-22 03:01:23 发布

为了实现这一点，我们需要修改 Flatten 层。我们创建了一个新的 FlattenConsecutive 层，它可以将连续的 n 个元素拼接在一起，并增加一个“组”的维度。

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_281.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_283.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_285.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_287.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_289.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_291.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_293.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_295.png

class FlattenConsecutive:
    def __init__(self, n):
        self.n = n
    def __call__(self, x):
        B, T, C = x.shape
        x = x.view(B, T // self.n, C * self.n)
        if x.shape[1] == 1:
            x = x.squeeze(1)
        return x

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_297.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_299.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_301.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_303.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_305.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_307.png

然后，我们重新设计模型架构。第一层 FlattenConsecutive(2) 将8个字符分成4组，每组2个字符的嵌入被拼接。随后的线性层只处理这“2个字符”的信息。之后，我们再次使用 FlattenConsecutive(2) 将4组合并为2组，以此类推，形成一个小型的层次化网络。

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_309.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_311.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_313.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_315.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_317.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_319.png

model = Sequential([
    Embedding(vocab_size, n_embd),
    FlattenConsecutive(2), Linear(n_embd * 2, n_hidden), BatchNorm(n_hidden), Tanh(),
    FlattenConsecutive(2), Linear(n_hidden * 2, n_hidden), BatchNorm(n_hidden), Tanh(),
    FlattenConsecutive(2), Linear(n_hidden * 2, n_hidden), BatchNorm(n_hidden), Tanh(),
    Linear(n_hidden, vocab_size)
])

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_321.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_323.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_325.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_327.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_329.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_331.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_333.png

修复批归一化层

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_335.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_337.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_339.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_341.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_343.png

上一节我们构建了层次化模型。本节中，我们需要修复一个关键问题：BatchNorm 层对多维输入的处理。

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_345.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_347.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_349.png

我们原来的 BatchNorm 实现假设输入是二维的 (batch_size, features)。但在我们的新架构中，FlattenConsecutive 会产生三维输入 (batch_size, groups, features)。我们需要让 BatchNorm 在训练时，同时计算 batch 和 groups 维度上的均值和方差。

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_351.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_353.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_355.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_357.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_359.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_361.png

class BatchNorm:
    def __call__(self, x):
        if self.training:
            dims = (0, 1) if x.ndim == 3 else (0)
            xmean = x.mean(dims, keepdim=True)
            xvar = x.var(dims, keepdim=True)
        # ... 后续标准化和更新运行统计量

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_363.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_365.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_367.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_369.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_371.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_373.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_375.png

修复这个Bug后，模型性能得到了小幅但稳定的提升。

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_377.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_379.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_381.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_383.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_385.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_387.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_389.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_391.png

实验结果与未来方向

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_393.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_395.png

通过增加模型容量（如嵌入维度和隐藏层大小），我们最终将验证损失降低到了 1.993 左右，成功跨过了2.0的界限。

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_397.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_399.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_401.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_403.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_405.png

本节课我们一起实现了一个简化的WaveNet风格架构。我们学习了如何：

使用模块化构建块（如 Sequential）来组织复杂网络。
通过 FlattenConsecutive 和线性层实现信息的层次化融合。
调整 BatchNorm 以正确处理多维输入。

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_407.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_409.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_411.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_413.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_415.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_417.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_419.png

然而，我们实现的只是WaveNet思想的核心骨架。完整的WaveNet还包括门控激活单元、残差连接和空洞因果卷积（用于高效计算）。此外，我们缺乏一个系统的超参数搜索和实验框架，目前的优化更多是“猜测与检验”。

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_421.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_423.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_425.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_427.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_429.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_431.png

在未来的课程中，我们可以：

实现空洞卷积来高效地计算整个输入序列的输出。
添加残差连接以训练更深的网络。
建立实验管线，进行大规模的超参数优化。
探索循环神经网络（RNN/LSTM）和Transformer架构。

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_433.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_435.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_437.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_439.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_441.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_443.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_445.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_447.png

挑战：你可以尝试调整本课的模型（如各层通道数、嵌入维度），或者阅读WaveNet论文实现更复杂的层，看看能否击败 1.993 的验证损失记录。

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_449.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_451.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_453.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_455.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_457.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_459.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_461.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_463.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_465.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_467.png

总结：本节课中，我们从基础的MLP出发，逐步构建了一个层次化的、类似WaveNet的字符级语言模型。我们重构了代码使其更清晰，引入了层次化信息融合的概念，并修复了批归一化层的多维处理问题。虽然性能得到了提升，但这仅仅是探索现代深度神经网络架构的开始。

课程 P7：从零构建 GPT 🧠

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/d489e9697e4ec525091637b9ac0b6163_1.png

在本节课中，我们将学习如何从零开始构建一个类似 GPT 的 Transformer 语言模型。我们将使用一个简单的字符级数据集（Tiny Shakespeare），并逐步实现模型的核心组件，包括自注意力机制、多头注意力、前馈网络以及残差连接等。通过这个过程，你将深入理解现代大型语言模型（如 ChatGPT）背后的基本原理。

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/d489e9697e4ec525091637b9ac0b6163_3.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/d489e9697e4ec525091637b9ac0b6163_5.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/d489e9697e4ec525091637b9ac0b6163_7.png

概述 📋

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/d489e9697e4ec525091637b9ac0b6163_9.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/d489e9697e4ec525091637b9ac0b6163_11.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/d489e9697e4ec525091637b9ac0b6163_13.png

Transformer 架构是当今许多先进 AI 系统的核心，它最初在 2017 年的论文《Attention Is All You Need》中被提出。GPT（Generative Pre-trained Transformer）正是基于此架构构建的。在本教程中，我们将专注于构建一个仅解码器的 Transformer，用于字符级语言建模任务。虽然我们无法复现 ChatGPT 那样的复杂系统，但通过构建一个微型版本，我们可以清晰地理解其工作原理。

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/d489e9697e4ec525091637b9ac0b6163_15.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/d489e9697e4ec525091637b9ac0b6163_17.png

我们将从处理数据开始，逐步实现模型的关键部分，并在 Tiny Shakespeare 数据集上进行训练，最终生成莎士比亚风格的文本。

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/d489e9697e4ec525091637b9ac0b6163_19.png

1. 数据准备与分词 📚

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/d489e9697e4ec525091637b9ac0b6163_21.png

首先，我们需要准备数据并将其转换为模型可以处理的格式。我们将使用 Tiny Shakespeare 数据集，它包含了莎士比亚的所有作品。

1.1 读取数据

我们从指定 URL 下载数据集，并将其读取为一个长字符串。

import torch
import requests

# 下载数据集
url = "https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt"
text = requests.get(url).text
print(f"数据集长度（字符数）: {len(text)}")
print(text[:1000])  # 打印前1000个字符

1.2 创建词汇表

接下来，我们找出数据集中所有独特的字符，构建一个词汇表。每个字符将被映射到一个唯一的整数（标记）。

# 获取所有独特字符并排序
chars = sorted(list(set(text)))
vocab_size = len(chars)
print(f"词汇表大小: {vocab_size}")
print(''.join(chars))  # 打印所有字符

# 创建编码器和解码器
stoi = {ch: i for i, ch in enumerate(chars)}  # 字符 -> 整数
itos = {i: ch for i, ch in enumerate(chars)}  # 整数 -> 字符

def encode(s):
    return [stoi[c] for c in s]  # 字符串 -> 整数列表

def decode(l):
    return ''.join([itos[i] for i in l])  # 整数列表 -> 字符串

# 测试编码解码
test_str = "hi there"
encoded = encode(test_str)
decoded = decode(encoded)
print(f"原始字符串: {test_str}")
print(f"编码后: {encoded}")
print(f"解码后: {decoded}")

1.3 划分数据集

我们将数据集分为训练集（90%）和验证集（10%）。验证集用于评估模型的泛化能力，防止过拟合。

# 将整个文本编码为整数张量
data = torch.tensor(encode(text), dtype=torch.long)

# 划分训练集和验证集
n = int(0.9 * len(data))
train_data = data[:n]
val_data = data[n:]

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/d489e9697e4ec525091637b9ac0b6163_23.png

2. 数据批处理 🔄

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/d489e9697e4ec525091637b9ac0b6163_25.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/d489e9697e4ec525091637b9ac0b6163_27.png

由于我们无法一次性将整个数据集输入模型，因此需要从数据中随机抽取小块（批次）进行训练。每个批次包含多个独立的序列，模型将并行处理它们。

以下是创建数据批次的函数：

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/d489e9697e4ec525091637b9ac0b6163_29.png

def get_batch(split):
    # 根据 split 选择训练集或验证集
    data = train_data if split == 'train' else val_data
    # 生成随机起始索引
    ix = torch.randint(len(data) - block_size, (batch_size,))
    # 构建输入 x 和目标 y
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    return x, y

# 设置超参数
batch_size = 4
block_size = 8

# 获取一个批次
xb, yb = get_batch('train')
print('输入 xb 的形状:', xb.shape)
print('目标 yb 的形状:', yb.shape)
print('输入示例:\n', xb)
print('目标示例:\n', yb)

在这个批次中，xb 是模型的输入，yb 是每个位置对应的下一个字符的目标值。模型的任务是根据 xb 的上下文预测 yb。

3. 基础模型：Bigram 语言模型 🔤

在深入 Transformer 之前，我们先实现一个最简单的语言模型——Bigram 模型。它仅根据当前字符的身份来预测下一个字符，不考虑任何上下文信息。

3.1 模型定义

Bigram 模型本质上是一个查找表，其中每个字符都直接预测下一个字符的分布。

import torch.nn as nn

<https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/d489e9697e4ec525091637b9ac0b6163_31.png>

class BigramLanguageModel(nn.Module):
    def __init__(self, vocab_size):
        super().__init__()
        # 每个标记直接映射到下一个标记的 logits
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)

    def forward(self, idx, targets=None):
        # idx 和 targets 都是形状为 (B, T) 的整数张量
        logits = self.token_embedding_table(idx)  # (B, T, C)

        if targets is None:
            loss = None
        else:
            # 调整形状以匹配 PyTorch 的交叉熵损失期望
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = nn.functional.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx 是当前上下文，形状为 (B, T)
        for _ in range(max_new_tokens):
            # 获取预测
            logits, loss = self(idx)
            # 只关注最后一步
            logits = logits[:, -1, :]  # 变为 (B, C)
            # 应用 softmax 获取概率
            probs = nn.functional.softmax(logits, dim=-1)  # (B, C)
            # 从分布中采样下一个标记
            idx_next = torch.multinomial(probs, num_samples=1)  # (B, 1)
            # 将采样到的标记附加到序列上
            idx = torch.cat((idx, idx_next), dim=1)  # (B, T+1)
        return idx

# 实例化模型
model = BigramLanguageModel(vocab_size)

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/d489e9697e4ec525091637b9ac0b6163_33.png

3.2 训练与生成

我们可以用简单的优化循环来训练这个模型，并观察其生成效果。

optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)

for steps in range(10000):
    # 获取一个数据批次
    xb, yb = get_batch('train')
    # 前向传播，计算损失
    logits, loss = model(xb, yb)
    # 反向传播，更新参数
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

print(f"最终损失: {loss.item()}")

# 生成文本
context = torch.zeros((1, 1), dtype=torch.long)
print(decode(model.generate(context, max_new_tokens=500)[0].tolist()))

Bigram 模型的表现非常有限，因为它没有利用上下文信息。接下来，我们将引入自注意力机制，让字符之间能够进行交流。

4. 自注意力机制 🤝

自注意力是 Transformer 的核心组件，它允许序列中的每个元素（标记）根据其与序列中其他元素的关系来聚合信息。

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/d489e9697e4ec525091637b9ac0b6163_35.png

4.1 数学原理

自注意力的关键思想是让每个标记生成三个向量：查询（Query）、键（Key） 和 值（Value）。

查询（Q）：表示“我正在寻找什么”。
键（K）：表示“我包含什么信息”。
值（V）：表示“如果被关注，我将传递什么信息”。

标记之间的亲和力（注意力权重）通过查询和键的点积计算：affinity = Q @ K^T。然后，我们使用这些权重对值进行加权求和，从而聚合信息。

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/d489e9697e4ec525091637b9ac0b6163_37.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/d489e9697e4ec525091637b9ac0b6163_39.png

为了实现语言建模中的因果性（即当前标记不能看到未来标记），我们使用一个下三角掩码矩阵，将未来位置的注意力权重设置为负无穷大，这样在 softmax 后它们的权重就变为 0。

4.2 实现单头自注意力

以下是单头自注意力的 PyTorch 实现：

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/d489e9697e4ec525091637b9ac0b6163_41.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/d489e9697e4ec525091637b9ac0b6163_43.png

class Head(nn.Module):
    """ 单头自注意力 """

    def __init__(self, head_size):
        super().__init__()
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
        # 下三角掩码，用于实现因果注意力
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        B, T, C = x.shape
        k = self.key(x)   # (B, T, head_size)
        q = self.query(x) # (B, T, head_size)
        # 计算注意力分数（亲和力）
        wei = q @ k.transpose(-2, -1) * C**-0.5  # (B, T, T) 缩放点积
        # 应用因果掩码
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf'))
        wei = nn.functional.softmax(wei, dim=-1)  # (B, T, T)
        wei = self.dropout(wei)
        # 加权聚合值
        v = self.value(x)  # (B, T, head_size)
        out = wei @ v  # (B, T, head_size)
        return out

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/d489e9697e4ec525091637b9ac0b6163_45.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/d489e9697e4ec525091637b9ac0b6163_46.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/d489e9697e4ec525091637b9ac0b6163_48.png

在这个实现中：

我们为键、查询和值定义了线性投影层。
计算缩放点积注意力分数，并应用因果掩码。
使用 softmax 将分数转换为概率分布（注意力权重）。
使用这些权重对值向量进行加权求和，得到输出。

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/d489e9697e4ec525091637b9ac0b6163_50.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/d489e9697e4ec525091637b9ac0b6163_52.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/d489e9697e4ec525091637b9ac0b6163_54.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/d489e9697e4ec525091637b9ac0b6163_56.png

5. 多头注意力与 Transformer 块 🧩

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/d489e9697e4ec525091637b9ac0b6163_58.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/d489e9697e4ec525091637b9ac0b6163_60.png

单个注意力头可能只关注特定类型的关系。为了捕捉更丰富的信息，我们并行使用多个注意力头，这就是多头注意力。

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/d489e9697e4ec525091637b9ac0b6163_62.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/d489e9697e4ec525091637b9ac0b6163_64.png

5.1 实现多头注意力

我们将多个单头注意力的输出在通道维度上拼接起来。

class MultiHeadAttention(nn.Module):
    """ 多头自注意力 """

    def __init__(self, num_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
        self.proj = nn.Linear(n_embd, n_embd)  # 投影层
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        # 并行运行所有注意力头并拼接结果
        out = torch.cat([h(x) for h in self.heads], dim=-1)
        out = self.dropout(self.proj(out))  # 投影回残差路径
        return out

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/d489e9697e4ec525091637b9ac0b6163_66.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/d489e9697e4ec525091637b9ac0b6163_68.png

5.2 前馈网络

在自注意力进行通信之后，每个标记需要独立处理收集到的信息。这是通过一个简单的前馈网络（FFN）实现的，通常是一个两层 MLP。

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/d489e9697e4ec525091637b9ac0b6163_70.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/d489e9697e4ec525091637b9ac0b6163_72.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/d489e9697e4ec525091637b9ac0b6163_74.png

class FeedForward(nn.Module):
    """ 简单的前馈网络 """

    def __init__(self, n_embd):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, 4 * n_embd),  # 扩展维度
            nn.ReLU(),
            nn.Linear(4 * n_embd, n_embd),  # 投影回原始维度
            nn.Dropout(dropout),
        )

    def forward(self, x):
        return self.net(x)

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/d489e9697e4ec525091637b9ac0b6163_76.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/d489e9697e4ec525091637b9ac0b6163_77.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/d489e9697e4ec525091637b9ac0b6163_79.png

5.3 构建 Transformer 块

现在，我们将多头注意力和前馈网络组合成一个 Transformer 块。为了优化深度网络，我们引入残差连接和层归一化。

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/d489e9697e4ec525091637b9ac0b6163_81.png

残差连接：将块的输入直接加到其输出上。这创建了一条梯度高速公路，有助于缓解深度网络中的梯度消失问题。
层归一化：在块内对每个标记的特征进行归一化，稳定训练过程。

class Block(nn.Module):
    """ Transformer 块：通信（注意力）后接计算（前馈） """

    def __init__(self, n_embd, n_head):
        super().__init__()
        head_size = n_embd // n_head
        self.sa = MultiHeadAttention(n_head, head_size)  # 多头自注意力
        self.ffwd = FeedForward(n_embd)                  # 前馈网络
        self.ln1 = nn.LayerNorm(n_embd)                  # 层归一化 1
        self.ln2 = nn.LayerNorm(n_embd)                  # 层归一化 2

    def forward(self, x):
        # 带残差连接和层归一化的自注意力
        x = x + self.sa(self.ln1(x))
        # 带残差连接和层归一化的前馈网络
        x = x + self.ffwd(self.ln2(x))
        return x

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/d489e9697e4ec525091637b9ac0b6163_83.png

6. 构建完整 GPT 模型 🏗️

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/d489e9697e4ec525091637b9ac0b6163_85.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/d489e9697e4ec525091637b9ac0b6163_87.png

现在，我们可以将所有组件组合起来，构建完整的 GPT 模型。我们的模型将包括：

标记嵌入层：将整数标记转换为向量。
位置嵌入层：为序列中的每个位置提供位置信息。
多个 Transformer 块（解码器块）。
最终的层归一化和线性投影层，用于预测下一个标记。

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/d489e9697e4ec525091637b9ac0b6163_89.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/d489e9697e4ec525091637b9ac0b6163_91.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/d489e9697e4ec525091637b9ac0b6163_93.png

class GPTLanguageModel(nn.Module):

    def __init__(self):
        super().__init__()
        # 每个标记对应一个嵌入向量
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        # 每个位置对应一个嵌入向量
        self.position_embedding_table = nn.Embedding(block_size, n_embd)
        # 堆叠 Transformer 块
        self.blocks = nn.Sequential(*[Block(n_embd, n_head=n_head) for _ in range(n_layer)])
        # 最终的层归一化
        self.ln_f = nn.LayerNorm(n_embd)
        # 语言建模头，将特征投影回词汇表
        self.lm_head = nn.Linear(n_embd, vocab_size)

    def forward(self, idx, targets=None):
        B, T = idx.shape

        # 获取标记嵌入和位置嵌入
        tok_emb = self.token_embedding_table(idx)  # (B, T, n_embd)
        pos_emb = self.position_embedding_table(torch.arange(T, device=device))  # (T, n_embd)
        x = tok_emb + pos_emb  # (B, T, n_embd)
        # 通过 Transformer 块
        x = self.blocks(x)
        x = self.ln_f(x)
        logits = self.lm_head(x)  # (B, T, vocab_size)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = nn.functional.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx 是当前上下文 (B, T)
        for _ in range(max_new_tokens):
            # 如果上下文过长，裁剪到块大小
            idx_cond = idx if idx.size(1) <= block_size else idx[:, -block_size:]
            # 获取预测
            logits, loss = self(idx_cond)
            # 关注最后一步
            logits = logits[:, -1, :]  # (B, C)
            # 应用 softmax 获取概率
            probs = nn.functional.softmax(logits, dim=-1)  # (B, C)
            # 从分布中采样下一个标记
            idx_next = torch.multinomial(probs, num_samples=1)  # (B, 1)
            # 将采样到的标记附加到序列上
            idx = torch.cat((idx, idx_next), dim=1)  # (B, T+1)
        return idx

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/d489e9697e4ec525091637b9ac0b6163_95.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/d489e9697e4ec525091637b9ac0b6163_97.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/d489e9697e4ec525091637b9ac0b6163_99.png

7. 模型训练与评估 🚀

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/d489e9697e4ec525091637b9ac0b6163_101.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/d489e9697e4ec525091637b9ac0b6163_103.png

现在，我们可以使用更大的超参数来训练我们的 GPT 模型，并观察其性能。

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/d489e9697e4ec525091637b9ac0b6163_105.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/d489e9697e4ec525091637b9ac0b6163_107.png

7.1 设置超参数与设备

# 超参数
batch_size = 64         # 每批处理的独立序列数
block_size = 256        # 最大上下文长度
max_iters = 5000        # 训练迭代次数
eval_interval = 500     # 每多少步评估一次
learning_rate = 3e-4    # 学习率
device = 'cuda' if torch.cuda.is_available() else 'cpu'  # 使用 GPU 如果可用
eval_iters = 200        # 评估时平均损失的批次数量
n_embd = 384            # 嵌入维度
n_head = 6              # 注意力头数量
n_layer = 6             # Transformer 块层数
dropout = 0.2           # Dropout 比率

<https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/d489e9697e4ec525091637b9ac0b6163_109.png>

# 实例化模型并移至设备
model = GPTLanguageModel()
m = model.to(device)
print(f"模型参数量: {sum(p.numel() for p in m.parameters())/1e6:.2f} M")

<https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/d489e9697e4ec525091637b9ac0b6163_111.png>

<https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/d489e9697e4ec525091637b9ac0b6163_113.png>

# 创建优化器
optimizer =

# 课程 P8：GPT的现状 🧠

在本节课中，我们将学习大型语言模型（如GPT）是如何被训练出来的，以及如何有效地将它们应用于实际任务。课程内容分为两部分：第一部分介绍训练GPT助手的完整流程，第二部分探讨如何在实际应用中最佳地使用这些助手。

<https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/b4741a42e6cf74795c90528614bcb573_1.png>

## 第一部分：如何训练GPT助手 🏗️

<https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/b4741a42e6cf74795c90528614bcb573_3.png>

训练一个像GPT这样的助手模型是一个多阶段的过程。整个过程大致可以分为四个主要阶段：预训练、监督微调、奖励建模和强化学习。下面我们将逐一详细介绍。

<https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/b4741a42e6cf74795c90528614bcb573_5.png>

### 1. 预训练阶段：打造基础模型

预训练是整个过程的核心，消耗了绝大部分的计算资源和时间。这个阶段的目标是让模型学会理解和生成人类语言。

<https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/b4741a42e6cf74795c90528614bcb573_7.png>

首先，我们需要收集海量的文本数据。这些数据通常来自互联网，混合了多种来源。

<https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/b4741a42e6cf74795c90528614bcb573_9.png>

以下是构成训练数据混合物的常见来源：
*   Common Crawl（网络爬虫数据）
*   C4（另一种常见的爬虫数据集）
*   高质量数据集，如：GitHub代码、维基百科、书籍、学术论文、Stack Exchange问答等。

这些数据按照特定比例混合采样，形成神经网络的训练集。在训练之前，文本需要经过一个称为“分词”的预处理步骤。分词将原始文本无损地转换为整数序列，因为这是GPT模型能够理解的“原生”格式。常用的算法包括字节对编码（BPE）。

<https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/b4741a42e6cf74795c90528614bcb573_11.png>

**分词示例**：`"Hello world!"` 可能被转换为整数序列 `[15496, 995, 0]`。

<https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/b4741a42e6cf74795c90528614bcb573_13.png>

接下来，我们看看管理这个阶段的一些关键超参数。以GPT-3和LLaMA为例：

<https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/b4741a42e6cf74795c90528614bcb573_15.png>

*   **词汇表大小**：通常在数万级别（例如，50,257个标记）。
*   **上下文长度**：决定模型一次能查看的标记数量，早期是2K或4K，现在可达100万。
*   **模型参数量**：GPT-3有1750亿参数，LLaMA有650亿参数。
*   **训练数据量**：GPT-3训练了约3000亿标记，而LLaMA训练了约1.4万亿标记。

<https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/b4741a42e6cf74795c90528614bcb573_17.png>

模型的强大程度不仅取决于参数数量，更与训练数据量和训练时长密切相关。用于指定Transformer架构的超参数包括头数、维度、层数等。训练一个650亿参数的模型可能需要约2000个GPU训练21天，成本达数百万美元。

<https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/b4741a42e6cf74795c90528614bcb573_19.png>

那么，预训练具体是如何进行的呢？我们将分词后的数据组织成批次。每个批次包含多行独立文档，每行长度等于上下文长度。文档之间用特殊的结束标记分隔。

<https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/b4741a42e6cf74795c90528614bcb573_21.png>

模型的任务是预测序列中的下一个标记。以图中的绿色单元格为例，Transformer神经网络会查看它之前的所有黄色标记（即上下文），然后尝试预测下一个红色标记是什么。模型会为词汇表中的每一个可能的标记输出一个概率。

<https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/b4741a42e6cf74795c90528614bcb573_23.png>

**训练目标**：通过比较模型的预测概率和实际的下一个标记（监督信号），使用反向传播算法不断调整Transformer的数十亿个参数，使其预测越来越准确。

<https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/b4741a42e6cf74795c90528614bcb573_25.png>

训练开始时，模型的权重是随机的，输出也是杂乱无章的。随着训练的进行，模型逐渐学会单词、语法和文本结构。我们可以通过观察训练损失（loss）的下降来追踪进展，损失越低，意味着模型预测正确下一个标记的概率越高。

预训练完成后，我们得到了一个“基础模型”。人们发现，这种在庞大语料上训练出的模型，学到了强大的通用语言表示能力，可以高效地适配到各种下游任务。

<https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/b4741a42e6cf74795c90528614bcb573_27.png>

### 2. 从基础模型到助手模型

<https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/b4741a42e6cf74795c90528614bcb573_29.png>

基础模型本质上是“文档续写者”，它只想完成它认为的文档。例如，如果你问它“法国的首都是什么？”，它可能会续写成“法国的首都是一个常见的问题，答案是巴黎。”，而不是直接给出答案。

<https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/b4741a42e6cf74795c90528614bcb573_31.png>

为了让模型成为有用的“助手”，我们需要对它进行进一步的调优。主要有两种路径：

<https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/b4741a42e6cf74795c90528614bcb573_33.png>

**路径一：提示工程**
我们可以通过精心设计输入文本来“欺骗”基础模型执行任务。例如，使用“少样本提示”，在问题前提供几个问答示例，使模型模仿这种格式来回答问题。甚至可以通过构造“人类与助手对话”的文档格式，诱使基础模型扮演助手角色。但这种方法并不总是可靠。

<https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/b4741a42e6cf74795c90528614bcb573_35.png>

**路径二：监督微调**
这是创建真正助手模型的更可靠方法。在此阶段，我们需要收集一个小规模但高质量的数据集。

以下是数据集的构建方式：
*   聘请人类标注员，根据详细的指南（要求回答有帮助、真实、无害）来编写“提示”和对应的“理想回答”。
*   通常需要数十万条这样的数据。

<https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/b4741a42e6cf74795c90528614bcb573_37.png>

<https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/b4741a42e6cf74795c90528614bcb573_39.png>

然后，我们在这个新数据集上继续执行**语言建模任务**。算法不变，只是训练数据从互联网文档换成了高质量的问答对。训练后得到的模型称为“SFT模型”，它是一个可以直接部署的助手模型。

### 3. 基于人类反馈的强化学习

<https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/b4741a42e6cf74795c90528614bcb573_41.png>

为了使助手表现更好，我们可以引入基于人类反馈的强化学习。这个阶段分为两步：奖励建模和强化学习。

<https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/b4741a42e6cf74795c90528614bcb573_43.png>

<https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/b4741a42e6cf74795c90528614bcb573_45.png>

**第一步：奖励建模**
我们改变数据收集的形式，从“写答案”变为“比较答案”。

<https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/b4741a42e6cf74795c90528614bcb573_47.png>

以下是数据收集过程：
1.  使用已有的SFT模型为同一个提示生成多个（例如，三个）不同的回答。
2.  让人类标注员对这些回答进行质量排序。

<https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/b4741a42e6cf74795c90528614bcb573_49.png>

接着，我们训练一个“奖励模型”。该模型的任务是：给定一个提示和回答，预测一个标量奖励值，代表这个回答的质量。训练时，我们让奖励模型的预测尽量与人类标注员的排序一致。

<https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/b4741a42e6cf74795c90528614bcb573_51.png>

**第二步：强化学习**
现在，我们固定奖励模型，用它来指导SFT模型的进一步优化。

以下是强化学习的流程：
1.  收集一大批提示。
2.  用当前的SFT模型为每个提示生成回答。
3.  用奖励模型为每个回答打分。
4.  调整SFT模型的参数，使其生成的、获得高奖励的回答在未来出现的概率更高，同时降低低奖励回答的出现概率。

这个过程通常使用近端策略优化等强化学习算法。最终得到的模型就是“RLHF模型”。例如，ChatGPT就是一个RLHF模型。

<https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/b4741a42e6cf74795c90528614bcb573_53.png>

那么，为什么需要RLHF？实验表明，人类通常更喜欢RLHF模型的输出。一个可能的原因是：对于人类来说，“比较两个答案哪个更好”比“凭空写出一个完美答案”要容易得多。RLHF更高效地利用了人类的判断力。

但需要注意的是，RLHF模型并非在所有方面都优于基础模型。它们可能会失去一些“创造性”或“多样性”，输出变得更加确定和保守。在需要生成多样化内容（如构思创意名称）的场景下，基础模型可能更有优势。

<https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/b4741a42e6cf74795c90528614bcb573_55.png>

目前，能力最强的助手模型（如GPT-4、Claude）大多经过了RLHF训练。而许多开源模型（如Koala）是SFT模型。

## 第二部分：如何有效使用GPT助手 🛠️

<https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/b4741a42e6cf74795c90528614bcb573_57.png>

了解了模型的训练过程后，我们来看看如何在实际应用中最佳地使用它们。我们将通过一个具体例子来理解人类与LLM在解决问题时的认知差异。

<https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/b4741a42e6cf74795c90528614bcb573_59.png>

假设你要写一句话：“加利福尼亚的人口是阿拉斯加的53倍。”你的思考过程可能是：
1.  **意识**：我需要比较两个州的人口。
2.  **知识检索**：我不知道具体数字，需要查维基百科。
3.  **工具使用**：查到数字后，需要用计算器做除法。
4.  **反思验证**：53倍这个结果合理吗？加州人口最多，似乎合理。
5.  **创作与修订**：尝试组织句子，觉得“有53倍于”很拗口，删掉重写，最终定稿。

这个过程涉及丰富的内心独白、工具使用和递归验证。然而，对于GPT来说，它看到的只是一个接一个的标记序列。它对每个标记进行的计算是相同且有限的（例如，一个80层的Transformer对每个标记进行80步“思考”）。它没有持续的内心独白，不会在过程中主动检查错误或使用外部工具，它只是在模仿训练数据中下一个标记出现的概率。

<https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/b4741a42e6cf74795c90528614bcb573_61.png>

因此，我们可以把提示工程看作是弥补人类与LLM之间认知架构差异的桥梁。以下是一些核心策略：

### 1. 给予模型“思考时间”

<https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/b4741a42e6cf74795c90528614bcb573_63.png>

LLM需要标记来“思考”。对于复杂问题，不能指望它在一个标记内给出答案。

**关键技术**：
*   **思维链**：在提示中要求模型“逐步推理”或“展示你的工作”。这迫使模型将推理过程分散到多个输出标记上，从而更可能得出正确答案。例如，使用“让我们一步一步地思考...”作为提示开头。
*   **自我一致性**：不要只采样一次。让模型多次生成回答，然后通过投票或选择最佳答案的方式聚合结果，避免单次采样的随机性。

<https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/b4741a42e6cf74795c90528614bcb573_65.png>

### 2. 明确要求高质量输出

<https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/b4741a42e6cf74795c90528614bcb573_67.png>

LLM训练数据中既有高质量答案，也有低质量答案。它默认会模仿所有内容。你需要明确要求它给出专家级答案。

**关键技术**：
*   在提示中指定角色，如“你是一个顶尖的物理学家”或“请确保答案正确”。
*   这有助于模型将概率质量集中在高质量输出上，而不是平均分配给所有可能的续写方式。

<https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/b4741a42e6cf74795c90528614bcb573_69.png>

### 3. 弥补模型的能力缺陷

LLM可能不擅长精确计算、获取实时信息或处理特定格式。我们需要通过提示或外部工具来弥补。

<https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/b4741a42e6cf74795c90528614bcb573_71.png>

**关键技术**：
*   **工具使用**：明确告诉模型“你不太擅长心算，请使用提供的计算器工具”，并定义工具的使用格式。许多框架（如ReAct）将工具调用集成到模型的思考过程中。
*   **检索增强**：将模型庞大的内部记忆与外部检索结合起来。使用向量数据库等技术，将与任务相关的文档片段检索出来，并插入到模型的上下文中，作为其“工作记忆”。这能极大提升模型在特定领域的表现。
*   **输出约束**：使用指导采样等技术，强制模型的输出遵循特定格式（如JSON、XML），确保输出易于被下游程序解析。

<https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/b4741a42e6cf74795c90528614bcb573_73.png>

### 4. 超越单一提示：构建系统

复杂的任务往往不能通过一次问答完成。

**关键技术**：
*   **提示链**：将多个提示串联起来，形成工作流。例如，先让模型规划步骤，再分步执行，最后总结。
*   **反思与重试**：让模型评估自己生成的答案是否正确，如果不正确，则重新尝试。这模拟了人类的自我修正过程。
*   **树状搜索**：像AlphaGo一样，维护多个可能的推理路径（思维树），对它们进行评估和扩展，最终选择最优路径。这需要Python代码来协调多个LLM调用。

### 实践建议与总结

<https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/b4741a42e6cf74795c90528614bcb573_75.png>

对于初学者和应用开发者，建议遵循以下路径：

<https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/b4741a42e6cf74795c90528614bcb573_77.png>

1.  **优先提示工程**：从最强大的模型（如GPT-4）开始，设计详细、包含示例和背景信息的提示。充分考虑LLM的“心理特点”，使用思维链、检索增强等技术。
2.  **考虑系统设计**：不要局限于单一提示。思考如何用代码将多个提示、工具调用和逻辑判断粘合起来，构建一个可靠的系统。
3.  **最后考虑微调**：当提示工程潜力用尽时，再考虑微调。监督微调相对直接，但需要高质量数据。RLHF则非常复杂且不稳定，目前不建议初学者尝试。
4.  **认识局限性并安全使用**：始终记住LLM存在幻觉、偏见、知识过时、易受攻击等局限。建议在低风险场景中使用，将其作为“副驾驶”提供灵感和建议，并保持人类监督。

<https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/b4741a42e6cf74795c90528614bcb573_79.png>

<https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/b4741a42e6cf74795c90528614bcb573_81.png>

---

<https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/b4741a42e6cf74795c90528614bcb573_83.png>

**本节课总结**：我们一起学习了GPT助手训练的四个核心阶段（预训练、监督微调、奖励建模、强化学习），理解了基础模型与助手模型的区别。更重要的是，我们探讨了如何通过提示工程、工具使用和系统设计来弥合人类与LLM的认知差异，从而在实际应用中有效、可靠地利用这些强大的模型。记住，LLM是惊人的“标记模拟器”，而我们的任务是引导它，为它创造“思考”的条件。

# 课程 P9：构建 GPT 分词器 🧩

<https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/9b369713d8fe0d40ac1101ef2ac09517_1.png>

<https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/9b369713d8fe0d40ac1101ef2ac09517_3.png>

在本节课中，我们将要学习大型语言模型（LLM）中一个关键但常被忽视的组件：分词器。我们将了解什么是分词、为什么它如此重要，并动手从零开始实现一个基于字节对编码（BPE）的分词器。通过本教程，你将理解分词如何影响模型的性能，并掌握构建和训练自定义分词器的核心技能。

## 概述：什么是分词？

分词是将文本字符串转换为一系列整数（称为“词元”或“标记”）的过程，这些整数是语言模型能够理解和处理的基本单位。在之前的课程《从头开始构建 GPT》中，我们使用了一个简单的字符级分词器。然而，实际应用中的 LLM（如 GPT 系列）使用更复杂的分词方案，例如字节对编码。

分词是许多 LLM 奇怪行为的根源，例如拼写困难、处理非英语语言效果差、算术能力不佳等。理解分词的工作原理对于深入理解 LLM 至关重要。

## 从字符级分词到子词分词

上一节我们介绍了简单的字符级分词。本节中我们来看看更先进的子词分词方法。

在字符级分词中，每个字符（如 `‘h’`, `‘i’`）被映射为一个独立的整数。虽然简单，但这会导致序列非常长，效率低下。例如，句子 “hello there” 会被编码为一系列代表每个字符的整数。

实际操作中，我们使用子词分词。它将常见的字符组合（如 `‘he’`, `‘ll’`, `‘o’`）合并为单独的标记，从而压缩序列长度。这通过字节对编码等算法实现。

## 字节对编码算法详解

字节对编码是一种数据压缩算法，后来被应用于 NLP 的分词任务。其核心思想是迭代地合并数据中最常见的字节对。

以下是 BPE 算法的基本步骤：

1.  将文本编码为 UTF-8 字节序列，初始词汇表为 256 个字节（0-255）。
2.  统计所有相邻字节对的出现频率。
3.  找到出现频率最高的字节对。
4.  为该字节对创建一个新的标记，并将其加入词汇表。
5.  在数据中，将所有出现的该字节对替换为这个新标记。
6.  重复步骤 2-5，直到达到预设的词汇表大小或没有更多可合并的对。

通过这种方式，我们从基础的字节开始，逐步构建出代表常见字符组合的标记，从而实现对文本的高效压缩。

## 实现 BPE 分词器

现在，让我们动手实现一个基础的 BPE 分词器。我们将编写训练函数来从数据中学习合并规则，并编写编码/解码函数来进行文本和标记之间的转换。

首先，我们需要一个函数来统计字节对的出现频率。

```python

def get_stats(ids):

    """

    统计给定整数ID列表中相邻元素对的出现次数。

    Args:

        ids: 整数列表，代表字节或标记。

    Returns:

        一个字典，键为(元素1, 元素2)的元组，值为出现次数。

    """

    counts = {}

    for pair in zip(ids, ids[1:]):

        counts[pair] = counts.get(pair, 0) + 1

    return counts

接下来，实现合并最高频字节对的函数。


def merge(ids, pair, idx):

    """

    在ID序列中，用新ID替换所有出现的指定字节对。

    Args:

        ids: 整数列表。

        pair: 要合并的字节对，例如 (101, 32)。

        idx: 用于替换的新标记ID（例如 256）。

    Returns:

        合并后的新ID列表。

    """

    newids = []

    i = 0

    while i < len(ids):

        # 如果找到匹配的对，则进行合并

        if i < len(ids) - 1 and (ids[i], ids[i+1]) == pair:

            newids.append(idx)

            i += 2

        else:

            newids.append(ids[i])

            i += 1

    return newids

现在，我们可以编写训练循环，迭代地进行合并，构建词汇表。


def train_bpe(text, vocab_size):

    """

    在文本上训练BPE分词器。

    Args:

        text: 训练文本字符串。

        vocab_size: 目标词汇表大小。

    Returns:

        merges: 记录合并规则的字典，键为合并后的ID，值为被合并的字节对。

        vocab: 从标记ID到字节表示的映射。

    """

    # 1. 将文本编码为UTF-8字节，并转换为整数列表

    tokens = list(text.encode(‘utf-8’))

    # 初始词汇表大小是256（0-255）

    num_merges = vocab_size - 256

    merges = {} # (id1, id2) -> new_id

    vocab = {idx: bytes([idx]) for idx in range(256)} # id -> bytes

    for i in range(num_merges):

        # 2. 统计当前标记序列中字节对的频率

        stats = get_stats(tokens)

        if not stats:

            break

        # 3. 找到最常出现的字节对

        top_pair = max(stats, key=stats.get)

        # 4. 分配新的ID（从256开始）

        idx = 256 + i

        # 5. 记录合并规则

        merges[top_pair] = idx

        # 6. 更新词汇表：新标记是子标记字节的拼接

        vocab[idx] = vocab[top_pair[0]] + vocab[top_pair[1]]

        # 7. 在序列中应用合并

        tokens = merge(tokens, top_pair, idx)

    return merges, vocab

编码与解码

训练好分词器（获得 merges 和 vocab）后，我们需要实现编码（文本 -> 标记）和解码（标记 -> 文本）功能。

解码相对简单：将每个标记 ID 通过 vocab 映射回其字节表示，然后连接并解码为字符串。


def decode(ids, vocab):

    """

    将标记ID序列解码为文本字符串。

    Args:

        ids: 标记ID列表。

        vocab: 从标记ID到字节表示的映射。

    Returns:

        解码后的字符串。

    """

    # 将每个ID转换为其字节表示

    tokens_bytes = b’’.join(vocab[idx] for idx in ids)

    # 将字节解码为字符串，使用 ‘replace’ 处理无效字节

    text = tokens_bytes.decode(‘utf-8’, errors=‘replace’)

    return text

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/9b369713d8fe0d40ac1101ef2ac09517_5.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/9b369713d8fe0d40ac1101ef2ac09517_7.png

编码过程需要模拟训练时的合并过程，将文本转换为字节后，反复应用合并规则。


def encode(text, merges):

    """

    将文本字符串编码为标记ID序列。

    Args:

        text: 输入文本。

        merges: 训练得到的合并规则字典。

    Returns:

        标记ID列表。

    """

    # 将文本转换为UTF-8字节，再转为整数列表

    tokens = list(text.encode(‘utf-8’))

    # 只要还有可合并的对，就持续合并

    while True:

        stats = get_stats(tokens)

        # 找到当前序列中优先级最高（在merges中索引最小）的可合并对

        pair_to_merge = None

        min_idx = float(‘inf’)

        for pair in stats:

            idx = merges.get(pair)

            if idx is not None and idx < min_idx:

                min_idx = idx

                pair_to_merge = pair

        # 如果没有可合并的对，结束循环

        if pair_to_merge is None:

            break

        # 应用合并

        idx = merges[pair_to_merge]

        tokens = merge(tokens, pair_to_merge, idx)

    return tokens

实际分词器的复杂性

我们上面实现的是一个基础的、纯算法的 BPE 分词器。在实际应用中（如 GPT-2, GPT-4），分词器引入了更多规则来处理复杂情况。

预处理规则：例如，GPT-2 使用一个复杂的正则表达式模式，在 BPE 合并之前先将文本分割成不同的块（如字母、数字、标点符号）。这确保了合并只发生在特定类别内部，防止了像将 “dog.” 和 “dog!” 合并成不同标记的情况，使分词更加一致。

特殊标记：除了从数据中学习到的标记，分词器还会引入特殊标记，如 <|endoftext|> 用于分隔文档，或在聊天模型中用于区分用户、助手和系统消息的标记。这些标记在词汇表中拥有独立的 ID，并在处理时被特殊对待。

词汇表大小的影响：词汇表大小是一个关键超参数。太小的词汇表（如字符级）会导致序列过长，消耗大量计算资源。太大的词汇表则会使每个标记出现的频率降低，可能导致嵌入训练不足，同时也会增加模型输出层的计算负担。目前先进的模型通常在数万到十万左右。

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/9b369713d8fe0d40ac1101ef2ac09517_9.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/9b369713d8fe0d40ac1101ef2ac09517_11.png

分词器与模型训练的关系

需要明确的是，分词器的训练与语言模型本身的训练是两个独立的阶段。

分词器训练：使用一个代表性数据集（可能与模型训练集不同），运行 BPE 算法，确定合并规则和最终词汇表。这个过程产生 merges 和 vocab 两个核心组件。
模型训练：使用训练好的分词器，将海量的模型训练文本全部转换为标记序列。这些标记序列被保存下来，语言模型在此标记序列上进行训练，学习预测下一个标记。

这种分离意味着我们可以针对不同的目标（如多语言支持、代码处理）优化分词器，而不必重新训练整个大模型。

总结

本节课中我们一起学习了构建 GPT 分词器的核心知识：

分词的重要性：分词是文本进入 LLM 的桥梁，其设计直接影响模型处理各种任务（拼写、多语言、算术、代码）的能力。
BPE 算法原理：通过迭代合并最常见字节对来构建词汇表，实现从字符到子词的压缩表示。
分词器实现：我们实现了 train_bpe、encode 和 decode 等核心函数，构建了一个可工作的基础分词器。
实际考量：了解了实际分词器（如 OpenAI 的 tiktoken）引入的预处理规则、特殊标记等复杂性，以及词汇表大小等设计选择。
训练流程：明确了分词器训练与语言模型训练是两个独立且先后进行的阶段。

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/9b369713d8fe0d40ac1101ef2ac09517_13.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/9b369713d8fe0d40ac1101ef2ac09517_15.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/9b369713d8fe0d40ac1101ef2ac09517_17.png

分词虽然是一个预处理步骤，但它深远地影响着语言模型的行为和能力。希望本教程能帮助你揭开分词的神秘面纱，并为深入理解和使用大型语言模型打下坚实基础。

AI Agent技术社区

Agent 垂直技术社区，欢迎活跃、内容共建。

更多推荐

Claude Code Workflow？别跟风！我要开发强过它的，咱们自己的通用Agent操作系统

AI Agent技术社区

ReAct、Plan-and-Execute、Reflection：AI Agent 三种范式怎么选？

AI Agent技术社区

多模态 AI 走到哪了？从 GPT-4V 到 Gemini 的进化

从 GPT-4V 的惊艳亮相到 Gemini 的原生多模态，从开源社区的百花齐放到千行百业的落地应用，多模态 AI 在短短两年多的时间里完成了从"技术 demo"到"生产力工具"的蜕变。如果说纯文本的 LLM 让 AI 学会了"阅读"，那么多模态 AI 正在让 AI 学会"观察"。当一个 AI 模型既能读懂《三体》又能看懂电路图、既能分析财报图表又能理解手术影像，它离真正的通用智能也就不远了。这场