LLaMa系列模型詳解（原理介紹、代碼解讀）：LLaMA 3

LLaMA 3

2024年4月18日，Meta 重磅推出了Meta Llama 3，Llama 3是Meta最先進開源大型語言模型的下一代，包括具有80億和700億參數的預訓練和指令微調的語言模型，能夠支持廣泛的應用場景。這一代Llama在一系列行業標準基準測試中展示了最先進的性能，并提供了新的功能，包括改進的推理能力。

版本和性能

新的 8B 和 70B 參數 Llama 3 模型是 Llama 2 的重大飛躍，并為這些規模的 LLM 模型建立了新的最先進技術。由于預訓練和訓練后的改進，模型是當今 8B 和 70B 參數規模的最佳模型。我訓練后程序的改進大大降低了錯誤拒絕率，改善了一致性并增加了模型響應的多樣性。我們還看到了推理、代碼生成和指令跟蹤等功能的極大改進，使 Llama 3 更加易于操控。

模型架構

從模型架構上看，LLaMA 3和LLaMA 2基本沒有區別，同樣使用了Transformer的Decoder-only架構，加入RMSNorm預歸一化，使用 SwiGLU 激活函數和旋轉位置嵌入，使用了改進的注意力機制GQA，增加了上下文長度。故本文不具體解釋。

上述具體的技術和方法可以查看LLaMA 2的博客：點擊此處

模型代碼如下，代碼來自LLaMA 3：https://github.com/meta-llama/llama3

# Copyright (c) Meta Platforms, Inc. and affiliates.  
# This software may be used and distributed in accordance with the terms of the Llama 3 Community License Agreement.  import math  
from dataclasses import dataclass  
from typing import Optional, Tuple  import fairscale.nn.model_parallel.initialize as fs_init  
import torch  
import torch.nn.functional as F  
from fairscale.nn.model_parallel.layers import (  ColumnParallelLinear,  RowParallelLinear,  VocabParallelEmbedding,  
)  
from torch import nn  @dataclass  
class ModelArgs:  dim: int = 4096  n_layers: int = 32  n_heads: int = 32  n_kv_heads: Optional[int] = None  vocab_size: int = -1  multiple_of: int = 256  # make SwiGLU hidden layer size multiple of large power of 2  ffn_dim_multiplier: Optional[float] = None  norm_eps: float = 1e-5  rope_theta: float = 500000  max_batch_size: int = 32  max_seq_len: int = 2048  class RMSNorm(torch.nn.Module):  def __init__(self, dim: int, eps: float = 1e-6):  super().__init__()  self.eps = eps  self.weight = nn.Parameter(torch.ones(dim))  def _norm(self, x):  return x * torch.rsqrt(x.pow(2).mean(-1, keepdim=True) + self.eps)  def forward(self, x):  output = self._norm(x.float()).type_as(x)  return output * self.weight  def precompute_freqs_cis(dim: int, end: int, theta: float = 10000.0):  freqs = 1.0 / (theta ** (torch.arange(0, dim, 2)[: (dim // 2)].float() / dim))  t = torch.arange(end, device=freqs.device, dtype=torch.float32)  freqs = torch.outer(t, freqs)  freqs_cis = torch.polar(torch.ones_like(freqs), freqs)  # complex64  return freqs_cis  def reshape_for_broadcast(freqs_cis: torch.Tensor, x: torch.Tensor):  ndim = x.ndim  assert 0 <= 1 < ndim  assert freqs_cis.shape == (x.shape[1], x.shape[-1])  shape = [d if i == 1 or i == ndim - 1 else 1 for i, d in enumerate(x.shape)]  return freqs_cis.view(*shape)  def apply_rotary_emb(  xq: torch.Tensor,  xk: torch.Tensor,  freqs_cis: torch.Tensor,  
) -> Tuple[torch.Tensor, torch.Tensor]:  xq_ = torch.view_as_complex(xq.float().reshape(*xq.shape[:-1], -1, 2))  xk_ = torch.view_as_complex(xk.float().reshape(*xk.shape[:-1], -1, 2))  freqs_cis = reshape_for_broadcast(freqs_cis, xq_)  xq_out = torch.view_as_real(xq_ * freqs_cis).flatten(3)  xk_out = torch.view_as_real(xk_ * freqs_cis).flatten(3)  return xq_out.type_as(xq), xk_out.type_as(xk)  def repeat_kv(x: torch.Tensor, n_rep: int) -> torch.Tensor:  """torch.repeat_interleave(x, dim=2, repeats=n_rep)"""  bs, slen, n_kv_heads, head_dim = x.shape  if n_rep == 1:  return x  return (  x[:, :, :, None, :]  .expand(bs, slen, n_kv_heads, n_rep, head_dim)  .reshape(bs, slen, n_kv_heads * n_rep, head_dim)  )  class Attention(nn.Module):  def __init__(self, args: ModelArgs):  super().__init__()  self.n_kv_heads = args.n_heads if args.n_kv_heads is None else args.n_kv_heads  model_parallel_size = fs_init.get_model_parallel_world_size()  self.n_local_heads = args.n_heads // model_parallel_size  self.n_local_kv_heads = self.n_kv_heads // model_parallel_size  self.n_rep = self.n_local_heads // self.n_local_kv_heads  self.head_dim = args.dim // args.n_heads  self.wq = ColumnParallelLinear(  args.dim,  args.n_heads * self.head_dim,  bias=False,  gather_output=False,  init_method=lambda x: x,  )  self.wk = ColumnParallelLinear(  args.dim,  self.n_kv_heads * self.head_dim,  bias=False,  gather_output=False,  init_method=lambda x: x,  )  self.wv = ColumnParallelLinear(  args.dim,  self.n_kv_heads * self.head_dim,  bias=False,  gather_output=False,  init_method=lambda x: x,  )  self.wo = RowParallelLinear(  args.n_heads * self.head_dim,  args.dim,  bias=False,  input_is_parallel=True,  init_method=lambda x: x,  )  self.cache_k = torch.zeros(  (  args.max_batch_size,  args.max_seq_len,  self.n_local_kv_heads,  self.head_dim,  )  ).cuda()  self.cache_v = torch.zeros(  (  args.max_batch_size,  args.max_seq_len,  self.n_local_kv_heads,  self.head_dim,  )  ).cuda()  def forward(  self,  x: torch.Tensor,  start_pos: int,  freqs_cis: torch.Tensor,  mask: Optional[torch.Tensor],  ):  bsz, seqlen, _ = x.shape  xq, xk, xv = self.wq(x), self.wk(x), self.wv(x)  xq = xq.view(bsz, seqlen, self.n_local_heads, self.head_dim)  xk = xk.view(bsz, seqlen, self.n_local_kv_heads, self.head_dim)  xv = xv.view(bsz, seqlen, self.n_local_kv_heads, self.head_dim)  xq, xk = apply_rotary_emb(xq, xk, freqs_cis=freqs_cis)  self.cache_k = self.cache_k.to(xq)  self.cache_v = self.cache_v.to(xq)  self.cache_k[:bsz, start_pos : start_pos + seqlen] = xk  self.cache_v[:bsz, start_pos : start_pos + seqlen] = xv  keys = self.cache_k[:bsz, : start_pos + seqlen]  values = self.cache_v[:bsz, : start_pos + seqlen]  # repeat k/v heads if n_kv_heads < n_heads  keys = repeat_kv(  keys, self.n_rep  )  # (bs, cache_len + seqlen, n_local_heads, head_dim)  values = repeat_kv(  values, self.n_rep  )  # (bs, cache_len + seqlen, n_local_heads, head_dim)  xq = xq.transpose(1, 2)  # (bs, n_local_heads, seqlen, head_dim)  keys = keys.transpose(1, 2)  # (bs, n_local_heads, cache_len + seqlen, head_dim)  values = values.transpose(  1, 2  )  # (bs, n_local_heads, cache_len + seqlen, head_dim)  scores = torch.matmul(xq, keys.transpose(2, 3)) / math.sqrt(self.head_dim)  if mask is not None:  scores = scores + mask  # (bs, n_local_heads, seqlen, cache_len + seqlen)  scores = F.softmax(scores.float(), dim=-1).type_as(xq)  output = torch.matmul(scores, values)  # (bs, n_local_heads, seqlen, head_dim)  output = output.transpose(1, 2).contiguous().view(bsz, seqlen, -1)  return self.wo(output)  class FeedForward(nn.Module):  def __init__(  self,  dim: int,  hidden_dim: int,  multiple_of: int,  ffn_dim_multiplier: Optional[float],  ):  super().__init__()  hidden_dim = int(2 * hidden_dim / 3)  # custom dim factor multiplier  if ffn_dim_multiplier is not None:  hidden_dim = int(ffn_dim_multiplier * hidden_dim)  hidden_dim = multiple_of * ((hidden_dim + multiple_of - 1) // multiple_of)  self.w1 = ColumnParallelLinear(  dim, hidden_dim, bias=False, gather_output=False, init_method=lambda x: x  )  self.w2 = RowParallelLinear(  hidden_dim, dim, bias=False, input_is_parallel=True, init_method=lambda x: x  )  self.w3 = ColumnParallelLinear(  dim, hidden_dim, bias=False, gather_output=False, init_method=lambda x: x  )  def forward(self, x):  return self.w2(F.silu(self.w1(x)) * self.w3(x))  class TransformerBlock(nn.Module):  def __init__(self, layer_id: int, args: ModelArgs):  super().__init__()  self.n_heads = args.n_heads  self.dim = args.dim  self.head_dim = args.dim // args.n_heads  self.attention = Attention(args)  self.feed_forward = FeedForward(  dim=args.dim,  hidden_dim=4 * args.dim,  multiple_of=args.multiple_of,  ffn_dim_multiplier=args.ffn_dim_multiplier,  )  self.layer_id = layer_id  self.attention_norm = RMSNorm(args.dim, eps=args.norm_eps)  self.ffn_norm = RMSNorm(args.dim, eps=args.norm_eps)  def forward(  self,  x: torch.Tensor,  start_pos: int,  freqs_cis: torch.Tensor,  mask: Optional[torch.Tensor],  ):  h = x + self.attention(self.attention_norm(x), start_pos, freqs_cis, mask)  out = h + self.feed_forward(self.ffn_norm(h))  return out  class Transformer(nn.Module):  def __init__(self, params: ModelArgs):  super().__init__()  self.params = params  self.vocab_size = params.vocab_size  self.n_layers = params.n_layers  self.tok_embeddings = VocabParallelEmbedding(  params.vocab_size, params.dim, init_method=lambda x: x  )  self.layers = torch.nn.ModuleList()  for layer_id in range(params.n_layers):  self.layers.append(TransformerBlock(layer_id, params))  self.norm = RMSNorm(params.dim, eps=params.norm_eps)  self.output = ColumnParallelLinear(  params.dim, params.vocab_size, bias=False, init_method=lambda x: x  )  self.freqs_cis = precompute_freqs_cis(  params.dim // params.n_heads,  params.max_seq_len * 2,  params.rope_theta,  )  @torch.inference_mode()  def forward(self, tokens: torch.Tensor, start_pos: int):  _bsz, seqlen = tokens.shape  h = self.tok_embeddings(tokens)  self.freqs_cis = self.freqs_cis.to(h.device)  freqs_cis = self.freqs_cis[start_pos : start_pos + seqlen]  mask = None  if seqlen > 1:  mask = torch.full((seqlen, seqlen), float("-inf"), device=tokens.device)  mask = torch.triu(mask, diagonal=1)  # When performing key-value caching, we compute the attention scores  # only for the new sequence. Thus, the matrix of scores is of size            # (seqlen, cache_len + seqlen), and the only masked entries are (i, j) for            # j > cache_len + i, since row i corresponds to token cache_len + i.            mask = torch.hstack(  [torch.zeros((seqlen, start_pos), device=tokens.device), mask]  ).type_as(h)  for layer in self.layers:  h = layer(h, start_pos, freqs_cis, mask)  h = self.norm(h)  output = self.output(h).float()  return output

Tokenizer

LLaMA3 改進了Tokenizer，使得對長文本的處理更快。

# Copyright (c) Meta Platforms, Inc. and affiliates.  
# This software may be used and distributed in accordance with the terms of the Llama 3 Community License Agreement.  import os  
from logging import getLogger  
from pathlib import Path  
from typing import (  AbstractSet,  cast,  Collection,  Dict,  Iterator,  List,  Literal,  Sequence,  TypedDict,  Union,  
)  import tiktoken  
from tiktoken.load import load_tiktoken_bpe  logger = getLogger(__name__)  Role = Literal["system", "user", "assistant"]  class Message(TypedDict):  role: Role  content: str  Dialog = Sequence[Message]  class Tokenizer:  """  Tokenizing and encoding/decoding text using the Tiktoken tokenizer.    """  special_tokens: Dict[str, int]  num_reserved_special_tokens = 256  pat_str = r"(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\r\n\p{L}\p{N}]?\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]+[\r\n]*|\s*[\r\n]+|\s+(?!\S)|\s+"  # noqa: E501  def __init__(self, model_path: str):  """  Initializes the Tokenizer with a Tiktoken model.  Args:            model_path (str): The path to the Tiktoken model file.        """        assert os.path.isfile(model_path), model_path  mergeable_ranks = load_tiktoken_bpe(model_path)  num_base_tokens = len(mergeable_ranks)  special_tokens = [  "<|begin_of_text|>",  "<|end_of_text|>",  "<|reserved_special_token_0|>",  "<|reserved_special_token_1|>",  "<|reserved_special_token_2|>",  "<|reserved_special_token_3|>",  "<|start_header_id|>",  "<|end_header_id|>",  "<|reserved_special_token_4|>",  "<|eot_id|>",  # end of turn  ] + [  f"<|reserved_special_token_{i}|>"  for i in range(5, self.num_reserved_special_tokens - 5)  ]  self.special_tokens = {  token: num_base_tokens + i for i, token in enumerate(special_tokens)  }  self.model = tiktoken.Encoding(  name=Path(model_path).name,  pat_str=self.pat_str,  mergeable_ranks=mergeable_ranks,  special_tokens=self.special_tokens,  )  logger.info(f"Reloaded tiktoken model from {model_path}")  self.n_words: int = self.model.n_vocab  # BOS / EOS token IDs  self.bos_id: int = self.special_tokens["<|begin_of_text|>"]  self.eos_id: int = self.special_tokens["<|end_of_text|>"]  self.pad_id: int = -1  self.stop_tokens = {  self.special_tokens["<|end_of_text|>"],  self.special_tokens["<|eot_id|>"],  }  logger.info(  f"#words: {self.n_words} - BOS ID: {self.bos_id} - EOS ID: {self.eos_id}"  )  def encode(  self,  s: str,  *,  bos: bool,  eos: bool,  allowed_special: Union[Literal["all"], AbstractSet[str]] = set(),  disallowed_special: Union[Literal["all"], Collection[str]] = (),  ) -> List[int]:  """  Encodes a string into a list of token IDs.  Args:            s (str): The input string to be encoded.            bos (bool): Whether to prepend the beginning-of-sequence token.            eos (bool): Whether to append the end-of-sequence token.            allowed_tokens ("all"|set[str]): allowed special tokens in string            disallowed_tokens ("all"|set[str]): special tokens that raise an error when in string  Returns:            list[int]: A list of token IDs.  By default, setting disallowed_special=() encodes a string by ignoring        special tokens. Specifically:        - Setting `disallowed_special` to () will cause all text corresponding          to special tokens to be encoded as natural text (insteading of raising          an error).        - Setting `allowed_special` to "all" will treat all text corresponding          to special tokens to be encoded as special tokens.        """        assert type(s) is str  # The tiktoken tokenizer can handle <=400k chars without  # pyo3_runtime.PanicException.        TIKTOKEN_MAX_ENCODE_CHARS = 400_000  # https://github.com/openai/tiktoken/issues/195  # Here we iterate over subsequences and split if we exceed the limit        # of max consecutive non-whitespace or whitespace characters.        MAX_NO_WHITESPACES_CHARS = 25_000  substrs = (  substr  for i in range(0, len(s), TIKTOKEN_MAX_ENCODE_CHARS)  for substr in self._split_whitespaces_or_nonwhitespaces(  s[i : i + TIKTOKEN_MAX_ENCODE_CHARS], MAX_NO_WHITESPACES_CHARS  )  )  t: List[int] = []  for substr in substrs:  t.extend(  self.model.encode(  substr,  allowed_special=allowed_special,  disallowed_special=disallowed_special,  )  )  if bos:  t.insert(0, self.bos_id)  if eos:  t.append(self.eos_id)  return t  def decode(self, t: Sequence[int]) -> str:  """  Decodes a list of token IDs into a string.  Args:            t (List[int]): The list of token IDs to be decoded.  Returns:            str: The decoded string.        """        # Typecast is safe here. Tiktoken doesn't do anything list-related with the sequence.  return self.model.decode(cast(List[int], t))  @staticmethod  def _split_whitespaces_or_nonwhitespaces(  s: str, max_consecutive_slice_len: int  ) -> Iterator[str]:  """  Splits the string `s` so that each substring contains no more than `max_consecutive_slice_len`        consecutive whitespaces or consecutive non-whitespaces.        """        current_slice_len = 0  current_slice_is_space = s[0].isspace() if len(s) > 0 else False  slice_start = 0  for i in range(len(s)):  is_now_space = s[i].isspace()  if current_slice_is_space ^ is_now_space:  current_slice_len = 1  current_slice_is_space = is_now_space  else:  current_slice_len += 1  if current_slice_len > max_consecutive_slice_len:  yield s[slice_start:i]  slice_start = i  current_slice_len = 1  yield s[slice_start:]  class ChatFormat:  def __init__(self, tokenizer: Tokenizer):  self.tokenizer = tokenizer  def encode_header(self, message: Message) -> List[int]:  tokens = []  tokens.append(self.tokenizer.special_tokens["<|start_header_id|>"])  tokens.extend(self.tokenizer.encode(message["role"], bos=False, eos=False))  tokens.append(self.tokenizer.special_tokens["<|end_header_id|>"])  tokens.extend(self.tokenizer.encode("\n\n", bos=False, eos=False))  return tokens  def encode_message(self, message: Message) -> List[int]:  tokens = self.encode_header(message)  tokens.extend(  self.tokenizer.encode(message["content"].strip(), bos=False, eos=False)  )  tokens.append(self.tokenizer.special_tokens["<|eot_id|>"])  return tokens  def encode_dialog_prompt(self, dialog: Dialog) -> List[int]:  tokens = []  tokens.append(self.tokenizer.special_tokens["<|begin_of_text|>"])  for message in dialog:  tokens.extend(self.encode_message(message))  # Add the start of an assistant message for the model to complete.  tokens.extend(self.encode_header({"role": "assistant", "content": ""}))  return tokens

為了防止因字符串過長而產生的性能問題，encode 方法使用一個循環來處理不超過 400,000 字符的子字符串。這種方法可以避免運行時錯誤，例如在 Python 的外部庫（如 C 或 Rust 寫的庫）中可能發生的內存錯誤。
使用 _split_whitespaces_or_nonwhitespaces 方法來處理可能的大量連續空格或非空格字符，限制每個片段的最大長度為 25,000 字符。這樣做既保證了處理的靈活性，也避免了處理過長片段可能帶來的問題。

訓練數據

為了訓練最佳的語言模型，收集一個大規模、高質量的訓練數據集至關重要。Meta AI在預訓練數據上投入了大量資金。Llama 3在超過15T的token上進行預訓練，所有數據都來自公開可用的來源。我們的訓練數據集比用于Llama 2的數據集大了七倍，并且包括了四倍的代碼。為了準備即將到來的多語言用例，超過5%的Llama 3預訓練數據集由高質量的非英語數據組成，覆蓋了超過30種語言。然而，我們不期望在這些語言中達到與英語相同的性能水平。

為了確保Llama 3訓練的數據質量最高，我們開發了一系列數據過濾管道。這些管道包括使用啟發式過濾器、NSFW過濾器、語義去重方法和文本分類器來預測數據質量。我們發現，Llama的前幾代在識別高質量數據方面出奇地好，因此我們使用Llama 2生成了為Llama 3提供動力的文本質量分類器的訓練數據。

為了在Llama 3模型中有效利用我們的預訓練數據，我們投入了大量精力來擴大預訓練規模。具體來說，我們為下游基準評估開發了一系列詳細的擴展法則。這些擴展法則使我們能夠選擇最佳數據混合方案，并就如何最佳利用我們的訓練計算資源做出明智的決策。重要的是，擴展法則允許我們在實際訓練模型之前預測我們最大模型在關鍵任務上的性能。這幫助我們確保最終模型在各種使用場景和能力上的強勁性能。

在Llama 3的開發過程中，我們對擴展行為做出了幾項新的觀察。例如，雖然對于80億參數模型來說，Chinchilla最優的訓練計算量對應于約2000億個token，但我們發現即使模型在數據量增加兩個數量級后，模型性能仍然在持續提升。在我們的80億和700億參數模型經過高達15T個token的訓練后，它們的性能繼續以對數線性方式提升。大型模型可以在較少的訓練計算量下匹配這些小型模型的性能，但通常更傾向于使用小型模型，因為它們在推理過程中效率更高。

為了訓練我們最大的Llama 3模型，我們結合了三種類型的并行化：數據并行化、模型并行化和流水線并行化。我們最有效的實現方式在同時訓練16K個GPU時，每個GPU的計算利用率超過400 TFLOPS。我們在兩個定制構建的24K GPU集群上執行了訓練運行。為了最大化GPU的運行時間，我們開發了一個新的高級訓練堆棧，自動化了錯誤檢測、處理和維護。我們還大大提高了硬件的可靠性和檢測機制，用于靜默數據損壞，并開發了新的可擴展存儲系統，減少了檢查點和回滾的開銷。這些改進使得整體有效訓練時間超過了95%。綜合來看，這些改進將Llama 3的訓練效率提高了約三倍，與Llama 2相比。

指令微調

為了充分釋放我們預訓練模型在聊天用例中的潛力，我們對指令調整方法也進行了創新。我們的后訓練方法是監督式微調（SFT）、拒絕采樣、近端策略優化（PPO）和直接策略優化（DPO）的組合。用于SFT的提示質量和用于PPO和DPO的偏好排名對對齊模型的性能有巨大影響。我們在模型質量上的一些最大改進來自于仔細篩選這些數據，并對人類標注者提供的多輪質量保證進行多次審查。

通過PPO和DPO從偏好排名中學習也大大提高了Llama 3在推理和編碼任務上的性能。我們發現，如果你問一個模型一個它難以回答的推理問題，模型有時會產生正確的推理軌跡：模型知道如何產生正確的答案，但它不知道如何選擇它。在偏好排名上進行訓練使模型學會了如何選擇它。