語言模型的發展歷史
🏗 Early foundation models (2010年代后期)
- 2018:ELMo(基于 LSTM 預訓練 + 微調)[Peters+ 2018]
- 2018:BERT(基于 Transformer 預訓練 + 微調)[Devlin+ 2018]
- 2019:Google T5(統一為 text-to-text)[Raffel+ 2019]
🚀 Scaling & closed (2020年代初期)
- 2019:OpenAI GPT-2 (1.5B),流暢生成文本、出現 zero-shot [Radford+ 2019]
- 2020:Scaling laws 提出,預測大模型表現 [Kaplan+ 2020]
- 2020:OpenAI GPT-3 (175B),in-context learning [Brown+ 2020]
- 2022:Google PaLM (540B),大規模但 undertrained [Chowdhery+ 2022]
- 2022:DeepMind Chinchilla (70B),計算最優 scaling [Hoffmann+ 2022]
🌍 Open models (2020年代中期)
- 2020/2021:EleutherAI,The Pile 數據集 + GPT-J [Gao+ 2020][Wang+ 2021]
- 2022:Meta OPT (175B),GPT-3 復現 [Zhang+ 2022]
- 2022:Hugging Face/BigScience BLOOM,關注數據來源 [Workshop+ 2022]
- 2023:Meta LLaMA 系列 [Touvron+ 2023]
- 2024:Alibaba Qwen 系列 [Qwen+ 2024]
- 2024:DeepSeek 系列 [DeepSeek-AI+ 2024]
- 2024:AI2 OLMo 2 [Groeneveld+ 2024][OLMo+ 2024]
🔓 Levels of openness
- 2023:封閉模型,如 OpenAI GPT-4o [OpenAI+ 2023]
- 2024:開放權重模型,如 DeepSeek [DeepSeek-AI+ 2024]
- 2024:開源模型,如 OLMo(權重+數據開放)[Groeneveld+ 2024]
🌌 Today’s frontier models (2025)
- 2025:OpenAI o3 → https://openai.com/index/openai-o3-mini/
- 2025:Anthropic Claude Sonnet 3.7 → https://www.anthropic.com/news/claude-3-7-sonnet
- 2025:xAI Grok 3 → https://x.ai/news/grok-3
- 2025:Google Gemini 2.5 → https://blog.google/technology/google-deepmind/gemini-model-thinking-updates-march-2025/
- 2025:Meta LLaMA 3.3 → https://ai.meta.com/blog/meta-llama-3/
- 2025:DeepSeek r1 → [DeepSeek-AI+ 2025]
- 2025:Alibaba Qwen 2.5 Max → https://qwenlm.github.io/blog/qwen2.5-max/
- 2025:Tencent Hunyuan-T1 → https://tencent.github.io/llm.hunyuan.T1/README_EN.html
效率組件
? 基礎 (Basics)
- 分詞 (Tokenization)
- 架構 (Architecture)
- 損失函數 (Loss function)
- 優化器 (Optimizer)
- 學習率 (Learning rate)
? 系統 (Systems)
- 內核 (Kernels)
- 并行化 (Parallelism)
- 量化 (Quantization)
- 激活檢查點 (Activation checkpointing)
- CPU 卸載 (CPU offloading)
- 推理 (Inference)
? 縮放規律 (Scaling laws)
- 縮放序列 (Scaling sequence)
- 模型復雜度 (Model complexity)
- 損失度量 (Loss metric)
- 參數化形式 (Parametric form)
? 數據 (Data)
- 評估 (Evaluation)
- 篩選 (Curation)
- 轉換 (Transformation)
- 過濾 (Filtering)
- 去重 (Deduplication)
- 混合 (Mixing)
? 對齊 (Alignment)
- 有監督微調 (Supervised fine-tuning)
- 強化學習 (Reinforcement learning)
- 偏好數據 (Preference data)
- 合成數據 (Synthetic data)
- 驗證器 (Verifiers)
Tokenization(分詞)
Byte-Pair Encoding(BPE)分詞器 [Sennrich 等, 2015]
👉 它的核心思想就是:不斷找出出現頻率最高的字符對,把它們合并成一個新“詞”,反復迭代,直到達到設定的詞表大小。BPE 已經成為現在大部分主流大模型(比如 GPT 系列)的標配分詞方案。
當然,也有一些不走分詞器路線的探索:
比如 [Xue 等, 2021][Yu 等, 2023][Pagnoni 等, 2024][Deiseroth 等, 2024] 提到的 tokenizer-free 方法,直接基于字節(bytes)做處理。
這些方法很有潛力,省去了復雜的分詞步驟,但目前還沒能像 BPE 一樣被大規模用于最前沿的大模型。
Architecture(結構)
Variants(變體):
-
Activation functions: ReLU, SwiGLU[Shazeer 2020]
-
Positional encodings: sinusoidal, RoPE[Su+ 2021]
-
Normalization: LayerNorm, RMSNorm[Ba+ 2016][Zhang+ 2019]
-
Placement of normalization: pre-norm versus post-norm[Xiong+ 2020]
-
MLP: dense, mixture of experts[Shazeer+ 2017]
-
Attention: full, sliding window, linear[Jiang+ 2023][Katharopoulos+ 2020]
-
Lower-dimensional attention: group-query attention (GQA), multi-head latent attention (MLA)[Ainslie+ 2023][DeepSeek-AI+ 2024]
-
State-space models: Hyena[Poli+ 2023]
Training(訓練)
-
Optimizer (e.g., AdamW, Muon, SOAP)
-
Learning rate schedule (e.g., cosine, WSD)
-
Batch size (e…g, critical batch size)
-
Regularization (e.g., dropout, weight decay)
-
Hyperparameters (number of heads, hidden dimension): grid search