Fine-Tuning Llama2 with LoRA

Fine-Tuning Llama2 with LoRA

  • 1. What is LoRA?
  • 2. How does LoRA work?
  • 3. Applying LoRA to Llama2 models
  • 4. LoRA finetuning recipe in torchtune
  • 5. Trading off memory and model performance with LoRA
  • Model Arguments
  • References

https://docs.pytorch.org/torchtune/main/tutorials/lora_finetune.html

This guide will teach you about LoRA, a parameter-efficient finetuning technique, and show you how you can use torchtune to finetune a Llama2 model with LoRA.

LoRA: Low-Rank Adaptation of Large Language Models
https://arxiv.org/abs/2106.09685

1. What is LoRA?

LoRA is an adapter-based method for parameter-efficient finetuning that adds trainable low-rank decomposition matrices to different layers of a neural network, then freezes the network’s remaining parameters. LoRA is most commonly applied to transformer models, in which case it is common to add the low-rank matrices to some of the linear projections in each transformer layer’s self-attention.

  • Note
    If you’re unfamiliar, check out these references for the definition of rank https://en.wikipedia.org/wiki/Rank_(linear_algebra) and discussion of low-rank approximations https://en.wikipedia.org/wiki/Low-rank_approximation.

By finetuning with LoRA (as opposed to finetuning all model parameters), you can expect to see memory savings due to a substantial reduction in the number of parameters with gradients. When using an optimizer with momentum, like AdamW https://docs.pytorch.org/docs/stable/generated/torch.optim.AdamW.html,
you can expect to see further memory savings from the optimizer state.

  • Note
    LoRA memory savings come primarily from gradient and optimizer states, so if your model’s peak memory comes in its forward() method, then LoRA may not reduce peak memory.

2. How does LoRA work?

LoRA replaces weight update matrices with a low-rank approximation. In general, weight updates for an arbitrary nn.Linear(in_dim,out_dim) layer could have rank as high as min(in_dim,out_dim). LoRA (and other related papers such as Aghajanyan et al.) hypothesize that the intrinsic dimension https://en.wikipedia.org/wiki/Intrinsic_dimension of these updates during LLM fine-tuning can in fact be much lower. To take advantage of this property, LoRA finetuning will freeze the original model, then add a trainable weight update from a low-rank projection. More explicitly, LoRA trains two matrices A and B. A projects the inputs down to a much smaller rank (often four or eight in practice), and B projects back up to the dimension output by the original linear layer.

hypothesize /ha??p?θ?sa?z/
v. 假設;假定

Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning
https://arxiv.org/abs/2012.13255

The image below gives a simplified representation of a single weight update step from a full finetune (on the left) compared to a weight update step with LoRA (on the right). The LoRA matrices A and B serve as an approximation to the full rank weight update in blue.

在這里插入圖片描述

Although LoRA introduces a few extra parameters in the model forward(), only the A and B matrices are trainable. This means that with a rank r LoRA decomposition, the number of gradients we need to store reduces from in_dim * out_dim to r * (in_dim+out_dim). (Remember that in general r is much smaller than in_dim and out_dim.)

For example, in the 7B Llama2’s self-attention, in_dim=out_dim=4096 for the Q, K, and V projections. This means a LoRA decomposition of rank r=8 will reduce the number of trainable parameters for a given projection from 4096 ? 4096 ≈ 15 M 4096 * 4096 \approx 15M 4096?409615M to 8 ? 8192 ≈ 65 K 8 * 8192 \approx 65K 8?819265K, a reduction of over 99%.

Let’s take a look at a minimal implementation of LoRA in native PyTorch.

import torch
from torch import nnclass LoRALinear(nn.Module):def __init__(self,in_dim: int,out_dim: int,rank: int,alpha: float,dropout: float):# These are the weights from the original pretrained modelself.linear = nn.Linear(in_dim, out_dim, bias=False)# These are the new LoRA params. In general rank << in_dim, out_dimself.lora_a = nn.Linear(in_dim, rank, bias=False)self.lora_b = nn.Linear(rank, out_dim, bias=False)# Rank and alpha are commonly-tuned hyperparametersself.rank = rankself.alpha = alpha# Most implementations also include some dropoutself.dropout = nn.Dropout(p=dropout)# The original params are frozen, and only LoRA params are trainable.self.linear.weight.requires_grad = Falseself.lora_a.weight.requires_grad = Trueself.lora_b.weight.requires_grad = Truedef forward(self, x: torch.Tensor) -> torch.Tensor:# This would be the output of the original modelfrozen_out = self.linear(x)# lora_a projects inputs down to the much smaller self.rank,# then lora_b projects back up to the output dimensionlora_out = self.lora_b(self.lora_a(self.dropout(x)))# Finally, scale by the alpha parameter (normalized by rank)# and add to the original model's outputsreturn frozen_out + (self.alpha / self.rank) * lora_out

There are some other details around initialization which we omit here, but if you’d like to know more you can see our implementation in LoRALinear https://docs.pytorch.org/torchtune/main/generated/torchtune.modules.peft.LoRALinear.html. Now that we understand what LoRA is doing, let’s look at how we can apply it to our favorite models.

3. Applying LoRA to Llama2 models

With torchtune, we can easily apply LoRA to Llama2 with a variety of different configurations. Let’s take a look at how to construct Llama2 models in torchtune with and without LoRA.

from torchtune.models.llama2 import llama2_7b, lora_llama2_7b# Build Llama2 without any LoRA layers
base_model = llama2_7b()# The default settings for lora_llama2_7b will match those for llama2_7b
# We just need to define which layers we want LoRA applied to.
# Within each self-attention, we can choose from ["q_proj", "k_proj", "v_proj", and "output_proj"].
# We can also set apply_lora_to_mlp=True or apply_lora_to_output=True to apply LoRA to other linear
# layers outside of the self-attention.
lora_model = lora_llama2_7b(lora_attn_modules=["q_proj", "v_proj"])
  • Note
    Calling lora_llama_2_7b https://docs.pytorch.org/torchtune/main/generated/torchtune.models.llama2.lora_llama2_7b.html alone will not handle the definition of which parameters are trainable.

Let’s inspect each of these models a bit more closely.

# Print the first layer's self-attention in the usual Llama2 model
>>> print(base_model.layers[0].attn)
MultiHeadAttention((q_proj): Linear(in_features=4096, out_features=4096, bias=False)(k_proj): Linear(in_features=4096, out_features=4096, bias=False)(v_proj): Linear(in_features=4096, out_features=4096, bias=False)(output_proj): Linear(in_features=4096, out_features=4096, bias=False)(pos_embeddings): RotaryPositionalEmbeddings()
)# Print the same for Llama2 with LoRA weights
>>> print(lora_model.layers[0].attn)
MultiHeadAttention((q_proj): LoRALinear((dropout): Dropout(p=0.0, inplace=False)(lora_a): Linear(in_features=4096, out_features=8, bias=False)(lora_b): Linear(in_features=8, out_features=4096, bias=False))(k_proj): Linear(in_features=4096, out_features=4096, bias=False)(v_proj): LoRALinear((dropout): Dropout(p=0.0, inplace=False)(lora_a): Linear(in_features=4096, out_features=8, bias=False)(lora_b): Linear(in_features=8, out_features=4096, bias=False))(output_proj): Linear(in_features=4096, out_features=4096, bias=False)(pos_embeddings): RotaryPositionalEmbeddings()
)

Notice that our LoRA model’s layer contains additional weights in the Q and V projections, as expected. Additionally, inspecting the type of lora_model and base_model, would show that they are both instances of the same TransformerDecoder https://docs.pytorch.org/torchtune/main/generated/torchtune.modules.TransformerDecoder.html.

Why does this matter? torchtune makes it easy to load checkpoints for LoRA directly from our Llama2 model without any wrappers or custom checkpoint conversion logic.

# Assuming that base_model already has the pretrained Llama2 weights,
# this will directly load them into your LoRA model without any conversion necessary.
lora_model.load_state_dict(base_model.state_dict(), strict=False)
  • Note
    Whenever loading weights with strict=False, you should verify that any missing or extra keys in the loaded state_dict are as expected. torchtune’s LoRA recipes do this by default via validate_missing_and_unexpected_for_lora() https://docs.pytorch.org/torchtune/main/generated/torchtune.modules.peft.validate_missing_and_unexpected_for_lora.html.

Once we’ve loaded the base model weights, we also want to set only LoRA parameters to trainable.

from torchtune.modules.peft.peft_utils import get_adapter_params, set_trainable_params# Fetch all params from the model that are associated with LoRA.
lora_params = get_adapter_params(lora_model)# Set requires_grad=True on lora_params, and requires_grad=False on all others.
set_trainable_params(lora_model, lora_params)# Print the total number of parameters
total_params = sum([p.numel() for p in lora_model.parameters()])
trainable_params = sum([p.numel() for p in lora_model.parameters() if p.requires_grad])
print(f"""{total_params} total params,{trainable_params}" trainable params,{(100.0 * trainable_params / total_params):.2f}% of all params are trainable."""
)6742609920 total params,
4194304 trainable params,
0.06% of all params are trainable.
  • Note
    If you are directly using the LoRA recipe, you need only pass the relevant checkpoint path. Loading model weights and setting trainable parameters will be taken care of in the recipe.
recipe /?res?pi/
n. 配方;食譜;方法;秘訣;烹飪法;訣竅

4. LoRA finetuning recipe in torchtune

Finally, we can put it all together and finetune a model using torchtune’s LoRA recipe. Make sure that you have first downloaded the Llama2 weights and tokenizer by following these instructions. You can then run the following command to perform a LoRA finetune of Llama2-7B with two GPUs (each having VRAM of at least 16GB):

tune run --nnodes 1 --nproc_per_node 2 lora_finetune_distributed --config llama2/7B_lora
  • Note
    Make sure to point to the location of your Llama2 weights and tokenizer. This can be done either by adding checkpointer.checkpoint_files=[my_model_checkpoint_path] tokenizer_checkpoint=my_tokenizer_checkpoint_path or by directly modifying the 7B_lora.yaml file. See our “All About Configs” https://docs.pytorch.org/torchtune/main/deep_dives/configs.html recipe for more details on how you can easily clone and modify torchtune configs.

  • Note
    You can modify the value of nproc_per_node depending on (a) the number of GPUs you have available, and (b) the memory constraints of your hardware.

The preceding command will run a LoRA finetune with torchtune’s factory settings, but we may want to experiment a bit. Let’s take a closer look at some of the lora_finetune_distributed config.

# Model Arguments
model:_component_: lora_llama2_7blora_attn_modules: ['q_proj', 'v_proj']lora_rank: 8lora_alpha: 16
...

We see that the default is to apply LoRA to Q and V projections with a rank of 8. Some experiments with LoRA have found that it can be beneficial to apply LoRA to all linear layers in the self-attention, and to increase the rank to 16 or 32. Note that this is likely to increase our max memory, but as long as we keep rank<<embed_dim, the impact should be relatively minor.

Let’s run this experiment. We can also increase alpha (in general it is good practice to scale alpha and rank together).

tune run --nnodes 1 --nproc_per_node 2 lora_finetune_distributed --config llama2/7B_lora \
lora_attn_modules=['q_proj','k_proj','v_proj','output_proj'] \
lora_rank=32 lora_alpha=64 output_dir=./lora_experiment_1

A comparison of the (smoothed) loss curves between this run and our baseline over the first 500 steps can be seen below.

在這里插入圖片描述

  • Note
    The above figure was generated with W&B. You can use torchtune’s WandBLogger https://docs.pytorch.org/torchtune/main/generated/torchtune.training.metric_logging.WandBLogger.html to generate similar loss curves, but you will need to install W&B and setup an account separately. For more details on using W&B in torchtune, see our “Logging to Weights & Biases” https://docs.pytorch.org/torchtune/main/deep_dives/wandb_logging.html recipe.

5. Trading off memory and model performance with LoRA

In the preceding example, we ran LoRA on two devices. But given LoRA’s low memory footprint, we can run fine-tuning
on a single device using most commodity GPUs which support bfloat16 <https://en.wikipedia.org/wiki/Bfloat16_floating-point_format#bfloat16_floating-point_format>_
floating-point format. This can be done via the command:

… code-block:: bash

tune run lora_finetune_single_device --config llama2/7B_lora_single_device

On a single device, we may need to be more cognizant of our peak memory. Let’s run a few experiments
to see our peak memory during a finetune. We will experiment along two axes:
first, which model layers have LoRA applied, and second, the rank of each LoRA layer. (We will scale
alpha in parallel to LoRA rank, as discussed above.)

To compare the results of our experiments, we can evaluate our models on truthfulqa_mc2 <https://github.com/sylinrl/TruthfulQA>, a task from
the TruthfulQA <https://arxiv.org/abs/2109.07958>
benchmark for language models. For more details on how to run this and other evaluation tasks
with torchtune’s EleutherAI evaluation harness integration, see our :ref:End-to-End Workflow Tutorial <eval_harness_label>.

Previously, we only enabled LoRA for the linear layers in each self-attention module, but in fact there are other linear
layers we can apply LoRA to: MLP layers and our model’s final output projection. Note that for Llama-2-7B the final output
projection maps to the vocabulary dimension (32000 instead of 4096 as in the other linear layers), so enabling LoRA for this layer will increase
our peak memory a bit more than the other layers. We can make the following changes to our config:

… code-block:: yaml

Model Arguments

model:
component: lora_llama2_7b
lora_attn_modules: [‘q_proj’, ‘k_proj’, ‘v_proj’, ‘output_proj’]
apply_lora_to_mlp: True
apply_lora_to_output: True

… note::
All the finetuning runs below use the llama2/7B_lora_single_device <https://github.com/pytorch/torchtune/blob/main/recipes/configs/llama2/7B_lora_single_device.yaml>_
config, which has a default batch size of 2. Modifying the batch size (or other hyperparameters, e.g. the optimizer) will impact both peak memory
and final evaluation results.

… list-table::
:widths: 25 25 25 25 25
:header-rows: 1

    • LoRA Layers
    • Rank
    • Alpha
    • Peak Memory
    • Accuracy (truthfulqa_mc2)
    • Q and V only
    • 8
    • 16
    • 15.57 GB
    • 0.475
    • all layers
    • 8
    • 16
    • 15.87 GB
    • 0.508
    • Q and V only
    • 64
    • 128
    • 15.86 GB
    • 0.504
    • all layers
    • 64
    • 128
    • 17.04 GB
    • 0.514

We can see that our baseline settings give the lowest peak memory, but our evaluation performance is relatively lower.
By enabling LoRA for all linear layers and increasing the rank to 64, we see almost a 4% absolute improvement
in our accuracy on this task, but our peak memory also increases by about 1.4GB. These are just a couple simple
experiments; we encourage you to run your own finetunes to find the right tradeoff for your particular setup.

Additionally, if you want to decrease your model’s peak memory even further (and still potentially achieve similar
model quality results), you can check out our :ref:QLoRA tutorial<qlora_finetune_label>.

References

[1] Yongqiang Cheng, https://yongqiang.blog.csdn.net/

本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。
如若轉載,請注明出處:http://www.pswp.cn/web/80462.shtml
繁體地址,請注明出處:http://hk.pswp.cn/web/80462.shtml
英文地址,請注明出處:http://en.pswp.cn/web/80462.shtml

如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!

相關文章

python打卡day29

類的裝飾器 知識點回顧 類的裝飾器裝飾器思想的進一步理解&#xff1a;外部修改、動態類方法的定義&#xff1a;內部定義和外部定義 回顧一下&#xff0c;函數的裝飾器是 &#xff1a;接收一個函數&#xff0c;返回一個修改后的函數。類也有修飾器&#xff0c;類裝飾器本質上確…

十一、STM32入門學習之FREERTOS移植

目錄 一、FreeRTOS1、源碼下載&#xff1a;2、解壓源碼 二、移植步驟一&#xff1a;在需要移植的項目中新建myFreeRTOS的文件夾&#xff0c;用于存放FREERTOS的相關源碼步驟二&#xff1a;keil中包含相關文件夾和文件引用路徑步驟三&#xff1a;修改FreeRTOSConfig.h文件的相關…

2025 年十大網絡安全預測

隨著我們逐步邁向 2026 年&#xff0c;網絡安全領域正處于一個關鍵的轉折點&#xff0c;技術創新與數字威脅以前所未有的復雜態勢交織在一起。 地緣政治環境進一步加劇了這些網絡安全挑戰&#xff0c;國際犯罪組織利用先進的技術能力來追求戰略目標。 人工智能在這一不斷演變…

Mac 環境下 JDK 版本切換全指南

概要 在 macOS 上安裝了多個 JDK 后&#xff0c;可以通過系統自帶的 /usr/libexec/java_home 工具來查詢并切換不同版本的 Java。只需在終端中執行 /usr/libexec/java_home -V 列出所有已安裝的 JDK&#xff0c;然后將你想使用的版本路徑賦值給環境變量 JAVA_HOME&#xff0c;…

中級網絡工程師知識點6

1.堆疊方式可以共享使用交換機背板帶寬&#xff1b;級聯方式可以使用雙絞線將交換機連接在一起 2.光功率計是專門測量光功率大小的儀器&#xff0c;在對光纜進行檢測時&#xff0c;通過在光纜的發送端和接收端分別測量光功率&#xff0c;進而計算出光衰情況。 3.光時域反射計…

動態規劃——烏龜棋

題目描述 解題思路 首先這是一個很明顯的線性dp的題目&#xff0c;很容易發現規律 數據輸入 我們用 h[ N ] 數組存儲每一個格子的分數 用 cnt [ ]&#xff0c;數組表示每一中卡片的數目 1&#xff0c;狀態表示 因為這里一個有4種跳躍方式可以選擇 f[ i ][ a ][ b ][ c ][ d…

C#自定義控件-實現了一個支持平移、縮放、雙擊重置的圖像顯示控件

1. 控件概述 這是一個繼承自 Control 的自定義控件&#xff0c;主要用于圖像的顯示和交互操作&#xff0c;具有以下核心功能&#xff1a; 圖像顯示與縮放&#xff08;支持鼠標滾輪縮放&#xff09;圖像平移&#xff08;支持鼠標拖拽&#xff09;視圖重置&#xff08;雙擊重置…

C++ map multimap 容器:賦值、排序、大小與刪除操作

概述 map和multimap是C STL中的關聯容器&#xff0c;它們存儲的是鍵值對(key-value pairs)&#xff0c;并且會根據鍵(key)自動排序。兩者的主要區別在于&#xff1a; map不允許重復的鍵multimap允許重復的鍵 本文將詳細解析示例代碼中涉及的map操作&#xff0c;包括賦值、排…

AI Agent開發第70課-徹底消除RAG知識庫幻覺(4)-解決知識庫問答時語料“總重復”問題

開篇 “解決知識庫幻覺”系列還在繼續,這是因為:如果只是個人玩玩,像自媒體那些說的什么2小時搭一個知識庫+deepseek不要太香一類的RAG或者是基于知識庫的應用肯定是沒法用在企業級落地上的。 我們真的經歷過或者正在經歷的人都是知道的,怎么可能2小時就搭建完成一個知識…

【DAY22】 復習日

內容來自浙大疏錦行python打卡訓練營 浙大疏錦行 仔細回顧一下之前21天的內容 作業&#xff1a; 自行學習參考如何使用kaggle平臺&#xff0c;寫下使用注意點&#xff0c;并對下述比賽提交代碼 kaggle泰坦里克號人員生還預測

【Docker】Docker Compose方式搭建分布式協調服務(Zookeeper)集群

開發分布式應用時,往往需要高度可靠的分布式協調,Apache ZooKeeper 致力于開發和維護開源服務器&#xff0c;以實現高度可靠的分布式協調。具體內容見zookeeper官網。現代應用往往使用云原生技術進行搭建,如何用Docker搭建Zookeeper集群,這里介紹使用Docker Compose方式搭建分布…

若依框架Consul微服務版本

1、最近使用若依前后端分離框架改造為Consul微服務版本 在這里分享出來供大家參考 # Consul微服務配置參數已經放置/bin/Consul微服務配置目錄 倉庫地址&#xff1a; gitee&#xff1a;https://gitee.com/zlxls/Ruoyi-Consul-Cloud.git gitcode&#xff1a;https://gitcode.c…

BOM知識點

BOM&#xff08;Browser Object Model&#xff09;即瀏覽器對象模型&#xff0c;是用于訪問和操作瀏覽器窗口的編程接口。以下是一些BOM的知識點總結&#xff1a; 核心對象 ? window&#xff1a;BOM的核心對象&#xff0c;代表瀏覽器窗口。它也是全局對象&#xff0c;所有全…

什么是遷移學習(Transfer Learning)?

什么是遷移學習&#xff08;Transfer Learning&#xff09;&#xff1f; 一句話概括 遷移學習研究如何把一個源領域&#xff08;source domain&#xff09;/源任務&#xff08;source task&#xff09;中獲得的知識遷移到目標領域&#xff08;target domain&#xff09;/目標任…

[創業之路-362]:企業戰略管理案例分析-3-戰略制定-華為使命、愿景、價值觀的演變過程

一、華為使命、愿景、價值觀的演變過程 1、創業初期&#xff08;1987 - 1994 年&#xff09;&#xff1a;生存導向&#xff0c;文化萌芽 使命愿景雛形&#xff1a;1994年華為提出“10年之后&#xff0c;世界通信行業三分天下&#xff0c;華為將占一份”的宏偉夢想&#xff0c…

Python黑魔法與底層原理揭秘:突破語言邊界的深度探索

Python黑魔法與底層原理揭秘&#xff1a;突破語言邊界的深度探索 開篇&#xff1a;超越表面的Python Python常被稱為"膠水語言"&#xff0c;但其真正的威力在于對底層的高度可控性。本文將揭示那些鮮為人知的Python黑魔法&#xff0c;帶你深入CPython實現層面&…

Es的text和keyword類型以及如何修改類型

昨天同事觸發定時任務發現es相關服務報了一個序列化問題&#xff0c; 今天早上捕獲異常將異常堆棧全部打出來看&#xff0c;才發現是聚合的字段不是keyword類型的問題。 到kibbna命令行執行也是一樣的錯誤 使用 /_mapping查看索引的字段類型&#xff0c;才發現userUniqueid是te…

大語言模型 07 - 從0開始訓練GPT 0.25B參數量 - MiniMind 實機訓練 預訓練 監督微調

寫在前面 GPT&#xff08;Generative Pre-trained Transformer&#xff09;是目前最廣泛應用的大語言模型架構之一&#xff0c;其強大的自然語言理解與生成能力背后&#xff0c;是一個龐大而精細的訓練流程。本文將從宏觀到微觀&#xff0c;系統講解GPT的訓練過程&#xff0c;…

【Android】從Choreographer到UI渲染(二)

【Android】從Choreographer到UI渲染&#xff08;二&#xff09; Google 在 2012 年推出的 Project Butter&#xff08;黃油計劃&#xff09;是 Android 系統發展史上的重要里程碑&#xff0c;旨在解決長期存在的 UI 卡頓、響應延遲等問題&#xff0c;提升用戶體驗。 在 Androi…

mvc-ioc實現

IOC 1&#xff09;耦合/依賴 依賴&#xff0c;是誰離不開誰 就比如上訴的Controller層必須依賴于Service層&#xff0c;Service層依賴于Dao 在軟件系統中&#xff0c;層與層之間存在依賴。我們稱之為耦合 我們系統架構或者設計的一個原則是&#xff…