LLMs之PEFT之Llama-2：《LoRA Learns Less and Forgets LessLoRA學得更少但遺忘得也更少》翻譯與解讀

導讀：該論文比較了LoRA與完全微調在代碼與數學兩個領域的表現。

背景問題：微調大規模語言模型需要非常大的GPU內存。LoRA這一參數高效微調方法通過僅微調選擇性權重矩陣的低秩擾動來節省內存。

解決方案：LoRA假設微調后的權重矩陣的變化可以近似表示成基模型權重矩陣的低秩擾動。LoRA僅微調選擇模塊的A和B矩陣來學習這個低秩擾動。

核心步驟：實驗使用Llama-2語言模型，在代碼和數學兩個領域下比較LoRA與完全微調。包括繼續預訓練和指令微調兩個訓練場景，使用不同數據集和輪數進行訓練，并用專業評估指標如HumanEval和GSM8K評估學習效果。

結果和分析：LoRA在代碼領域明顯劣于完全微調，在數學領域效果更近。但LoRA相對來說遺忘源領域知識較少。LoRA相比常見正則如dropout和權重衰減起到更強的正則作用。完全微調找到的權重擾動秩遠大于LoRA配置的秩。

最佳實踐：LoRA適用于指令微調，學習率的選擇很關鍵；優先選擇所有模塊作為目標，以低秩16為妥協。

優勢：LoRA相對完全微調節省內存開銷，并能在一定程度上保留基模型的廣泛能力，避免過分專注于目標領域。

總之，這篇論文系統、全面地比較了LoRA與完全微調在代碼和數學領域的表現，提出了LoRA的最佳實踐方法。它為參數高效微調提供了有價值的參考。

《LoRA Learns Less and Forgets Less》翻譯與解讀

Abstract摘要

1 Introduction引言

Figure 1: Learning vs. forgetting tradeoff curves for Llama-2-7B and Llama-2-13B trained on Starcoder-Python. Gray regions are hypothetical Pareto frontiers for performance on the source domain and the code target domain.圖 1：Llama-2-7B 和 Llama-2-13B 在 Starcoder-Python 上的學習與遺忘權衡曲線。灰色區域是源領域和代碼目標領域性能的假設帕累托前沿。

6 Discussion討論

7 Conclusion結論

《LoRA Learns Less and Forgets Less》翻譯與解讀

地址	論文地址：https://arxiv.org/abs/2405.09673
時間	2024 年 5 月15 日
作者	哥倫比亞大學和Databricks

Abstract摘要

Low-Rank Adaptation (LoRA) is a widely-used parameter-efficient finetuning method for large language models. LoRA saves memory by training only low rank perturbations to selected weight matrices. In this work, we compare the performance of LoRA and full finetuning on two target domains, programming and mathematics. We consider both the instruction finetuning (≈100K prompt-response pairs) and continued pretraining (≈10B unstructured tokens) data regimes. Our results show that, in most settings, LoRA substantially underperforms full finetuning. Nevertheless, LoRA exhibits a desirable form of regularization: it better maintains the base model’s performance on tasks outside the target domain. We show that LoRA provides stronger regularization compared to common techniques such as weight decay and dropout; it also helps maintain more diverse generations. We show that full finetuning learns perturbations with a rank that is 10-100X greater than typical LoRA configurations, possibly explaining some of the reported gaps. We conclude by proposing best practices for finetuning with LoRA.

低秩適應（LoRA）是一種廣泛使用的大型語言模型參數高效微調方法。LoRA通過只訓練選定的權重矩陣的低秩擾動來節省內存。在這項工作中，我們比較了LoRA和全微調在編程和數學兩個目標領域的性能。我們考慮了指令微調（≈100K提示-響應對）和持續預訓練（≈10B非結構化令牌）數據制度。我們的結果顯示，在大多數設置中，LoRA的性能明顯低于全微調。盡管如此，LoRA表現出一種理想的正則化形式：它更好地保持了基礎模型在目標領域外任務上的性能。我們表明，與權重衰減和dropout等常見技術相比，LoRA提供了更強的正則化；它還有助于保持更多樣化的生成。我們顯示，全微調學習的擾動秩比典型的LoRA配置大10-100倍，這可能解釋了一些報告的差距。我們最后提出了使用LoRA進行微調的最佳實踐。

1 Introduction引言

Finetuning large language models (LLMs) with billions of weights requires a non-trivial amount of GPU memory. Parameter-efficient finetuning methods reduce the memory footprint during training by freezing a pretrained LLM and only training a small number of additional parameters, often called adapters. Low-Rank Adaptation (LoRA; Hu et al. (2021)) trains adapters that are low-rank perturbations to selected weight matrices.

Since its introduction, LoRA has been touted as a strict efficiency improvement that does not compromise accuracy on the new, target domain (Hu et al., 2021; Dettmers et al., 2024; Raschka, 2023; Zhao et al., 2024b). However, only a handful of studies benchmark LoRA against full finetuning for LLMs with billions of parameters, (Ivison et al., 2023; Zhuo et al., 2024; Dettmers et al., 2024), reporting mixed results. Some of these studies rely on older models (e.g. RoBERTa), or coarse evaluation benchmarks (such as GLUE or ROUGE) that are less relevant for contemporary LLMs. In contrast, more sensitive domain-specific evaluations, e.g., code, reveal cases where LoRA is inferior to full finetuning (Ivison et al., 2023; Zhuo et al., 2024). Here we ask: under which conditions does LoRA approximate full finetuning accuracy on challenging target domains, such as code and math?

微調具有數十億權重的大型語言模型（LLMs）需要大量的GPU內存。參數高效微調方法通過凍結預訓練的LLM并僅訓練少量附加參數（通常稱為適配器）來減少訓練期間的內存占用。低秩適配（LoRA；Hu等，2021）訓練的適配器是對選定權重矩陣的低秩擾動。

自從引入以來，LoRA被譽為一種嚴格的效率提升方法，在新目標領域的準確性上沒有妥協（Hu等，2021；Dettmers等，2024；Raschka，2023；Zhao等，2024b）。然而，只有少數研究對具有數十億參數的LLMs進行LoRA與完全微調的基準測試（Ivison等，2023；Zhuo等，2024；Dettmers等，2024），并報告了混合結果。這些研究中有些依賴于較舊的模型（例如RoBERTa）或粗略的評估基準（如GLUE或ROUGE），這些基準對于當代LLMs的相關性較低。相比之下，更敏感的領域特定評估（如代碼）顯示出LoRA在某些情況下不如完全微調（Ivison等，2023；Zhuo等，2024）。在此，我們提出問題：在何種條件下，LoRA在挑戰性的目標領域（如代碼和數學）中可以接近完全微調的準確性？

By training fewer parameters, LoRA is assumed to provide a form of regularization that constrains the finetuned model’s behavior to remain close to that of the base model (Sun et al., 2023; Du et al., 2024). We also ask: does LoRA act as a regularizer that mitigates “forgetting” of the source domain?

In this study, we rigorously compare LoRA and full finetuning for Llama-2 7B (and in some cases, 13B) models across two challenging target domains, code and mathematics. Within each domain, we explore two training regimes. The first is instruction finetuning, the common scenario for LoRA involving question-answer datasets with tens to hundreds of millions of tokens. Here, we use Magicoder-Evol-Instruct-110K (Wei et al., 2023)and MetaMathQA (Yu et al., 2023). The second regime is continued pretraining, a less common application for LoRA which involves training on billions of unlabeled tokens; here we use the StarCoder-Python (Li et al., 2023) and OpenWebMath (Paster et al., 2023) datasets (Table 1).

通過訓練較少的參數，LoRA被認為提供了一種正則化形式，使微調模型的行為保持接近基礎模型（Sun等，2023；Du等，2024）。我們還問：LoRA是否作為一種正則化工具，可以減輕對源領域的“遺忘”？

在本研究中，我們嚴格比較了LoRA和完全微調在Llama-2 7B（某些情況下為13B）模型在代碼和數學這兩個挑戰性目標領域中的表現。在每個領域中，我們探索了兩種訓練方式。第一種是指令微調，這是一種涉及數千萬到數億個標記的問題回答數據集的常見LoRA場景。在這里，我們使用Magicoder-Evol-Instruct-110K（Wei等，2023）和MetaMathQA（Yu等，2023）。第二種方式是持續預訓練，這是LoRA不常見的應用，涉及數十億個未標記的標記的訓練；在這里，我們使用StarCoder-Python（Li等，2023）和OpenWebMath（Paster等，2023）數據集（表1）。

We evaluate target-domain performance (henceforth, learning) via challenging coding and math benchmarks (HumanEval; Chen et al. (2021), and GSM8K; Cobbe et al. (2021)). We evaluate source-domain forgetting performance on language understanding, world knowledge, and common-sense reasoning tasks (Zellers et al., 2019; Sakaguchi et al., 2019; Clark et al., 2018).

We find that for code, LoRA substantially underper-forms full finetuning, whereas for math, LoRA closes more of the gap (Sec. 4.1), while requiring longer training. Despite this performance gap, we show that LoRA better maintains source-domain performance compared to full finetuning (Sec. 4.2). Furthermore, we characterize the tradeoff between performance on the target versus source domain (learning ver-sus forgetting). For a given model size and dataset, we find that LoRA and full finetuning form a simi-lar learning-forgetting tradeoff curve: LoRA’s that learn more generally forget as much as full finetun-ing, though we find cases (for code) where LoRA can learn comparably but forgets less (Sec. 4.3).

我們通過挑戰性的代碼和數學基準（HumanEval；Chen等，2021，以及GSM8K；Cobbe等，2021）評估目標領域的性能（以下簡稱學習）。我們在語言理解、世界知識和常識推理任務上評估源領域遺忘性能（Zellers等，2019；Sakaguchi等，2019；Clark等，2018）。

我們發現，對于代碼，LoRA明顯不如完全微調，而對于數學，LoRA縮小了更多差距（第4.1節），盡管需要更長的訓練時間。盡管存在性能差距，我們顯示出LoRA在保持源領域性能方面比完全微調更好（第4.2節）。此外，我們刻畫了目標域與源域性能之間的權衡（學習與遺忘）。對于給定的模型大小和數據集，我們發現LoRA和完全微調形成了類似的學習-遺忘權衡曲線：LoRA的學習越多，一般遺忘也與完全微調一樣多，盡管我們發現某些情況下（對于代碼）LoRA可以學習得相當，但遺忘更少（第4.3節）。

We then show that LoRA – even with a less restric-tive rank – provides stronger regularization when compared to classic regularization methods such as dropout (Srivastava et al., 2014), and weight decay (Goodfellow et al., 2016). We also show that LoRA provides regularization at the output level: we ana-lyze the generated solutions to HumanEval problems and find that while full finetuning collapses to a lim-ited set of solutions, LoRA maintains a diversity of solutions more similar to the base model (Sun et al., 2023; Du et al., 2024).

Why does LoRA underperform full finetuning? LoRA was originally motivated in part by the hypothesis that finetuning results in low-rank perturbations to the base model’s weight matrix (Li et al., 2018; Aghajanyan et al., 2020; Hu et al., 2021). However, the tasks explored by these works are relatively easy for modern LLMs, and certainly easier than the coding and math domains studied here. Thus, we perform a singular value decomposition to show that full finetuning barely changes the spectrum of the base model’s weight matrices, and yet the difference between the two (i.e. the perturbation) is high rank. The rank of the perturbation grows as training progresses, with ranks 10-100× higher than typical LoRA configurations (Figure 7).

然后我們展示了LoRA——即使在不太限制的秩下——與經典的正則化方法（如dropout（Srivastava等，2014）和權重衰減（Goodfellow等，2016））相比，提供了更強的正則化。我們還展示了LoRA在輸出級別提供的正則化：我們分析了對HumanEval問題生成的解決方案，發現雖然完全微調收斂到一組有限的解決方案，但LoRA保持了與基礎模型更相似的解決方案多樣性（Sun等，2023；Du等，2024）。

為什么LoRA的表現不如完全微調？LoRA的初衷部分是基于這樣一個假設：微調導致基礎模型權重矩陣的低秩擾動（Li等，2018；Aghajanyan等，2020；Hu等，2021）。然而，這些工作的任務對于現代LLMs來說相對容易，肯定比我們這里研究的代碼和數學領域容易。因此，我們進行了奇異值分解，顯示完全微調幾乎沒有改變基礎模型權重矩陣的譜，但兩者之間的差異（即擾動）是高秩的。隨著訓練的進展，擾動的秩增長，通常比典型的LoRA配置高10-100倍（圖7）。

We conclude by proposing best practices for training models with LoRA. We find that LoRA is especially sensitive to learning rates, and that the performance is affected mostly by the choice of target modules and to a smaller extent by rank.

To summarize, we contribute the following results:

? Full finetuning is more accurate and sample-efficient than LoRA in code and math (Sec.4.1).

? LoRA forgets less of the source domain, providing a form of regularization (Sec. 4.2 and 4.3).

? LoRA’s regularization is stronger compared to common regularization techniques; it also helps maintaining the diversity of generations (Sec. 4.4).

? Full finetuning finds high rank weight perturbations (Sec. 4.5).

? Compared to full finetuning, LoRA is more sensitive to hyperparameters, namely learning rate, target modules, and rank (in decreasing order; Sec. 4.6) .

最后，我們提出了使用LoRA訓練模型的最佳實踐。我們發現LoRA對學習率特別敏感，性能主要受目標模塊選擇的影響，程度較小的影響來自秩。

總結來說，我們貢獻了以下結果：

>> 完全微調在代碼和數學上比LoRA更準確且樣本效率更高（第4.1節）。

>> LoRA對源領域遺忘較少，提供了一種正則化形式（第4.2節和第4.3節）。

>> LoRA的正則化比常見的正則化技術更強；它還幫助維持生成的多樣性（第4.4節）。

>> 完全微調找到了高秩權重擾動（第4.5節）。

>> 與完全微調相比，LoRA對超參數更敏感，即學習率、目標模塊和秩（按遞減順序；第4.6節）。

Figure 1: Learning vs. forgetting tradeoff curves for Llama-2-7B and Llama-2-13B trained on Starcoder-Python. Gray regions are hypothetical Pareto frontiers for performance on the source domain and the code target domain.圖 1：Llama-2-7B 和 Llama-2-13B 在 Starcoder-Python 上的學習與遺忘權衡曲線。灰色區域是源領域和代碼目標領域性能的假設帕累托前沿。
?

6 Discussion討論

Does the difference between LoRA and full finetuning decrease with model size? Studies in the past have hinted at a relationship between the effectiveness of finetuning and model size (Aghajanyan et al., 2020; Hu et al., 2021; Zhuo et al., 2024). While recent studies have successfully applied LoRA to 70B parameter models (Ivison et al., 2023; Yu et al., 2023), we leave a rigorous study of these intriguing scaling properties to future work.	LoRA與全微調之間的差異是否隨著模型大小的增加而減少？過去的研究暗示了微調有效性與模型大小之間的關系（Aghajanyan et al., 2020; Hu et al., 2021; Zhuo et al., 2024）。雖然最近的研究已成功將LoRA應用于70B參數模型（Ivison et al., 2023; Yu et al., 2023），但我們把對這些迷人擴展性質的嚴格研究留作未來工作。
Limitations of the spectral analysis. The observation that full finetuning tends to find high rank solutions does not rule out the possibility of low-rank solutions; rather, it shows that they are not typically found. An alternative interpretation is that the rank needed to reconstruct the weight matrix is higher than the rank needed for a downstream task.	譜分析的局限性。全微調傾向于找到高秩解的觀察結果并不排除低秩解的可能性；相反，這表明它們通常不被發現。另一種解釋是，重構權重矩陣所需的秩比下游任務所需的秩要高。
Why does LoRA perform well on math and not code? One hypothesis is that math datasets involve a smaller domain shift; they include a larger percentage of English and lead to decreased forgetting. The second hypothesis is that the GSM8K evaluation is too easy and does not capture the new college-level math learned in finetuning.	為什么LoRA在數學上表現良好而在代碼上不行？一種假設是數學數據集涉及的領域轉移較小；它們包含更大比例的英語，導致遺忘減少。第二個假設是GSM8K評估太簡單，無法捕捉到微調中學習的新大學級數學。

7 Conclusion結論

This work sheds light on the downstream performance of contemporary LLMs (with 7 and 13 billion parameters) trained with LoRA. Different from most prior work, we use domain-specific datasets in code and math, associated with sensitive evaluation metrics. We show that LoRA underperforms full finetuning in both domains. We also show that LoRA keeps the finetuned model’s behavior close to that of the base model,?with diminished source-domain forgetting and more diverse generations at inference time. We investigate LoRA’s regularization properties, and show that full finetuning finds weight perturbations are far from being low-rank. We conclude by analyzing LoRA’s increased sensitivity to hyperparameters and highlighting best practices.

這項工作揭示了使用LoRA訓練的當代LLM（具有70億和130億參數）的下游性能。與大多數先前工作不同，我們使用與代碼和數學相關的領域特定數據集，并關聯敏感的評估指標。我們顯示LoRA在這兩個領域中的性能都低于全微調。我們還顯示LoRA保持了微調模型的行為接近基礎模型，減少了源領域遺忘，并在推理時產生更多樣化的生成。我們調查了LoRA的正則化屬性，并顯示全微調發現的權重擾動遠非低秩。我們最后分析了LoRA對超參數的敏感性增加，并強調了最佳實踐。