LLM 研究方向(一): LLM Prompts--p-tuning、LoRA

1. prompt-tuning background

2. Prompt Tuning 模型介紹

2.1 2021 prefix-tuning?

2.2 2021 P-tuning v1

2.3 2021 Parameter-efficient prompt tuning (PET)

2.4 2022 P-tuning v2?

2.5 2019 Adapter?

?2.6 2021 LoRA (Low-Rank Adaptation)?

2.7 2024 DoRA (Weight-Decoupled Low-Rank Adaptation)

3. LoRA Implementation

3.1? LoRA 復現 01: MiniLoRA

3.1.1 core codes：torch.nn.utils.parametrize.register_parameterization?參數化應用函數

3.2 LoRA 復現 02: LoRA from Scratch on MNIST?

3.2.1 core codes: Lightning 深度學習框架?

3.3 LoRA 復現 03: Torch tutorial with torchtune

3.3.1 core codes:??torchtune package 介紹

3.4 LoRA 復現 04: peft implementation

3.4.1 core codes: AutoModelForSeq2SeqLM 介紹

3.4.2 code codes: peft package 介紹

3.5 *LoRA 05: Explanation

Reference:?

1. prompt-tuning background

problem: 之前的fune-tuning/model-tuning是對大模型進行下游任務re-training，即對whole模型參數進行微調！但由于LLM參數量太大，fine-tuning需要大量的數據、算力去更新學習參數，不夠實用！

solution：prompt-tuning (p-tuning)，是一種通過提示詞(prompt tokens)優化生成式預訓練模型(e.g. GPT)的技術，旨在通過調整prompts而不是整個模型參數來提高模型在特定任務上的表現，達到節省計算開銷和資源消耗、保持甚至提升model performance的目的。

按照時間順序，prompt-tuning演進過程分別是：prefix-tuning、p-tuning v1、parameter-efficient prompt tuning、p-tuning v2。

2. Prompt Tuning 模型介紹

2.1 2021 prefix-tuning?

prefix-tuning, paper: Optimizing Continuous Prompts for Generation, 就是在input tokens前面加上幾個與任務相關task-specific的tokens，并用 $MLP_{\theta}$ 單獨訓練生成embeddings。

Note：tokens不拼接！原有的input tokens依舊用transformer生成embeddings，并且保持transformer參數不變。The prefix tokens' embeddings? $\vee _i \in P_{idx}$ , hi are drawn from a trainable matrix MLP~ $P_\theta$ . Then remaining tokens' embeddings are computed by the Transformer.

優點：實現簡單、高效訓練、任務一致性。
缺點：適用性有限，prefix-tuning在一些特定任務中效果不如p-tuning，e.g. 上下文限制，由于prefix embeddings始終位于序列前端，可能無法充分利用輸入序列的上下文信息。

2.2 2021 P-tuning v1

p-tuning v1, paper: GPT Understands, Too. 它通過在輸入層提示模板固定位置插入可訓練的提示詞向量trainable prompt tokens?embeddings，來提升模型性能。

problem: Previous prompts方法是離散discrete向量空間，主要是從詞庫V中選詞vi作為提示詞prompt來出入提示模板的第i個位置，并用prompt generator來生成提示詞向量prompt embeddings。這種固定的提示詞叫作hard prompt，只能用來微調整個模型的參數 pre-trained model parameters。

solution: p-tuning v1是連續continuous向量空間，主要是通過prompt encoder生成trainable parameterized prompt embeddings來代替詞庫詞vi插入輸入層，這種generated trainable prompts稱為soft prompt。

初始化 initialize prompts: <T1> <T2> The movie was fantastic <T3> <T4>. -> 訓練優化 -> 推理 inference，這時不BP。
優點：少量參數、提高性能、通用性強。
缺點：訓練復雜；依賴提示詞位置。

2.3 2021 Parameter-efficient prompt tuning (PET)

Parameter-efficient prompt tuning, paper: The power of scale for parameter-efficient prompt tuning, 可以在輸入序列的任意位置插入trianable prompt embeddings。

2.4 2022 P-tuning v2?

p-tuning v2, paper: Prompt tuning can be comparable to fine-tuning universally across scale and tasks,? 多層提示prompt，在每一層加上prefix prompt embeddings。?

problem: 在模型參數量小于10B的訓練中，prompt training效果要低于fine-tuning。

solution：p-tuning v2在每一層都加上了layer prefix prompt embeddings，不同任務可以共享相同的網絡參數，支持多任務學習。?

優點：可以更好地捕捉和利用上下文信息，進一步提高模型性能、更好泛化、靈活性強。
缺點：實現復雜；計算開銷增加。

2.5 2019 Adapter?

paper: Parameter-Efficient transfer learning for NLP.

2.6 2021 LoRA (Low-Rank Adaptation)?

paper: Low-Rank Adaptation of Large Language Models.

$W_{LoRA} = W_{orig} + \Delta W = W_{orig} + B*A$ ?

LoRA保持pre-trained model參數凍結，只在原始矩陣中添加一個 $\Delta W$ 參數，其參數比原始矩陣少。?

problem: 如果我們構造一個與Worig具有相同維度nxm的新 $\Delta W$ 矩陣來對模型進行微調，模型performance沒有提升！還會將參數加倍！

solution：所以設計鬼才提出了低秩概念r，通過基于低秩r的低維矩陣乘法來構造 $\Delta W = B_{n\times r}A_{r\times m}$ , r << n和r << m，B和A相乘會產生一個與 $\Delta W$ 具有相同維度的矩陣，但由更少的參數構成。因為我們希望訓練開始時增量為零，讓微調像原始模型一樣開始。因此，B通常被初始化為零矩陣，而A被初始化為隨機值(即正態分布)。

For example，input dim=1024，那origin W參數量=1024*1024 $\approx$ 100萬，而低秩參數量=1024*4+4*1024? $\approx$ 8k。

優點：

效率高，使用更少的參數。?
提高泛化性能《-- 通過限制模型復雜性，防止過擬合。
可以無縫集成到現有的神經網絡中。

2.7 2024 DoRA (Weight-Decoupled Low-Rank Adaptation)

核心：每個權重矩陣W通過多個低秩矩陣Ai和Bi的乘積進行近似，可以表示為： $W \approx \sum_{i=1}^k A_i B_i$ 。

3. LoRA Implementation

LoRA實現公式： $W_{LoRA} = W_{orig} + \frac{\alpha}{r} \Delta W$

my github link:?GitHub - yuyongsheng1990/LLM_Prompts

3.1? LoRA 復現 01: MiniLoRA

簡單、通俗、易懂、powerful

reference：minLoRA/demo.ipynb at main · cccntu/minLoRA · GitHub

3.1.1 core codes：torch.nn.utils.parametrize.register_parameterization?參數化應用函數

from functools import partial  # 用于固定某些函數的參數，從而創建一個新的函數。這個新函數會記住被固定的參數，并在調用時使用這些固定參數。
'''
simple example: torch.nn.utils.parametrize.register_parametrizationoutput: 原始參數(weight或bias)會被替換為一個通過指定參數模塊生成的參數。Linear((weight): ParametrizationList((0): MyParametrization())(bias): Parameter containing: [torch.FloatTensor of size 5])
'''
# -----------------single lora parameters---------------
linear = nn.Linear(5, 5)
print(linear)
class LowRankParametrization(nn.Module):def __init__(self, original_weight, rank=4):super().__init__()self.rank = rankself.U = nn.Parameter(torch.randn(original_weight.size(0), rank))self.V = nn.Parameter(torch.randn(rank, original_weight.size(1)))def forward(self, x):return self.U @ self.V# 注冊低秩參數化
'''torch.nn.utils.parametrize.register_parametrization函數用于在模型的參數上注冊新的參數化方法。這個功能允許你在現有參數layer.weight上應用一些變換LoRAParametrization，特別適用于LoRA
'''
parametrize.register_parametrization(linear, 'weight', LowRankParametrization(linear.weight))
# ----------------multiple lora parameters-------------------
# 可以順序應用多個參數化方法，繼續加就行 <--對應DoRA
# 定義第二個參數化方法
class MultiplyByTwoParametrization(nn.Module):def __init__(self, original_weight, rank=4):super().__init__()self.rank = rankself.U = nn.Parameter(torch.randn(original_weight.size(0), rank))self.V = nn.Parameter(torch.randn(rank, original_weight.size(1)))def forward(self, x):return self.U @ self.V
parametrize.register_parametrization(linear, 'weight', MultiplyByTwoParametrization(linear.weight, rank=3))# 打印線性層，查看參數化后的結果
print(linear)
'''
output:Linear(in_features=5, out_features=5, bias=True)  # 原始linear層-------------------------------------------------ParametrizedLinear(                          # 替換后的參數化線性層para linearin_features=5, out_features=5, bias=True   # 這表示layer原始參數original weight(parametrizations): ModuleDict(            # parametrizations表示應用參數化方法，新模型參數會存儲在ModuleDict中，ModuleDict是一個module容器，它像一個dict一樣工作。(weight): ParametrizationList(           # 這表示weight原始參數現在被替換/應用了ParametrizationList中一個或多個參數化方法.(0): LowRankParametrization()          # (0)表示ParametrizationList的第一個參數化方法。# (1): MultiplyByTwoParametrization()    # 順序應用：當ParametrizationList存儲多個參數化方法時，所有方法會按順序應用到weight參數上。)                                        ))
'''

3.2 LoRA 復現 02: LoRA from Scratch on MNIST?

reference:?lora_from_scratch/lora_on_mnist.ipynb at main · sunildkumar/lora_from_scratch · GitHub

3.2.1 core codes: Lightning 深度學習框架?

import lightning as L  # lightning是一個高層次的深度學習框架，建立在pytorch之上，用于簡化和加速模型的開發和訓練過程。
from lightning.pytorch.loggers import CSVLogger  # 用于將訓練日志記錄到csv文件中，便于之后的分析和可視化。
from lightning.pytorch.callbacks import LearningRateFinder  # 通過在training過程中調整學習率lr來找到最優的學習率，以提升模型性能
from lightning.pytorch.callbacks.early_stopping import EarlyStopping  # 用于在validation loss不再改善時提前停止，防止模型過擬合。from pytorch_lightning import Callback # 用于實現自定義的回調函數，在training過程中的特定時間點執行特定的操作，比如記錄日志、保存model、調整lr。

3.3 LoRA 復現 03: Torch tutorial with torchtune

reference:?Finetuning Llama2 with LoRA — TorchTune documentation

3.3.1 core codes:??torchtune package 介紹

from torchtune.models.llama2 import llama2_7b, lora_llama2_7b  # torchtune是一個torch庫，用于輕松創作、微調和試驗LLM。
'''torchtune, https://pytorch.org/torchtune/stable/index.html- Llama3 in torchtune- Finetuning with LoRA in torchtune- Understanding QLoRA in TorchTune- End-to-End Workflow with torchtune
'''

3.4 LoRA 復現 04: peft implementation

reference:??LoRA-Implementation/prepare_data.py at main · hahuyhoang411/LoRA-Implementation · GitHub

3.4.1 core codes: AutoModelForSeq2SeqLM 介紹

'''from transformers import AutoModelForSeq2SeqLM, AutoTokenizer# 指定模型名稱或路徑model_name = "t5-small"# 加載預訓練模型和分詞器model = AutoModelForSeq2SeqLM.from_pretrained(model_name)tokenizer = AutoTokenizer.from_pretrained(model_name)# 輸入文本input_text = "Translate English to French: How are you?"# 編碼文本--成模型可接受的輸入格式inputs = tokenizer(input_text, return_tensors="pt")# 生成輸出outputs = model.generate(**inputs)# 解碼輸出文本output_text = tokenizer.decode(outputs[0], skip_special_tokens=True)print(f"Input: {input_text}")print(f"Output: {output_text}")
'''

3.4.2 code codes: peft package 介紹

'''
peft (Parameter-Efficient Fine-Tuning) package introduction:Fine-tuning large pretrained models is often prohibitively costly due to their scale. PEFT methods enable efficient adaptation of large pretrained models to various downstream applications by only fine-tuning a 
small number of (extra) model parameters instead of all the model's parameters. This significantly decreases the computational and storage costs. Recent state-of-the-art PEFT techniques achieve performance comparable to fully fine-tuned models.PEFT is integrated with Transformers for easy model training and inference, 
peft簡化了LLM-finetuning 模型配置和加載功能，特別是使用LoRA等技術。- LoraConfig，用于配置LoRA參數。- TaskType，用于定義任務類型, e.g. task_type = TaskType.TEXT_GENERATION- get_peft_config，用于獲取peft配置- get_peft_model，用于獲取pretrained peft模型。
''''''
----------------peft翻譯模型---------------------
# 翻譯模型bigscience/mt0-large: English -> French
'''
# prepare a model for training with a PEFT method such as LoRA by wrapping the base model and PEFT configuration with get_peft_model.
# For the bigscience/mt0-large model, you are only training 0.19% of the parameters!
from transformers import AutoModelForSeq2SeqLM  # 用于加載和處理pre-trained seq2seq模型，用于處理nlp任務
from peft import get_peft_config, get_peft_model, LoraConfig, TaskType# 加載預訓練模型和分詞器 
model_name = 'bigscience/mt0-large'
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)# 定義lora配置
lora_config = LoraConfig(task_type = TaskType.SEQ_2_SEQ_LM, inference_mode=False, r=8, lora_alpha=32, lora_dropout=0.1
)# 獲取peft model
peft_model = get_peft_model(model, peft_config)
print(peft_model.print_trainable_parameters())  # 輸出peft mode可訓練參數# 準備輸入數據
input_text = "Translate English to French: How are you?"
inputs = tokenizer(input_text, return_tensors="pt")# 使用 PEFT 模型生成輸出
outputs = peft_model.generate(**inputs)
output_text = tokenizer.decode(outputs[0], skip_special_tokens=True)  # 解碼
print(outputs)
print(output_text)'''
------------peft因果推理模型----------------------
因果推理模型 ybelkada/opt-350m-lora; gpt2
'''
from peft import AutoPeftModelForCausalLM  # 用于加載和配置因果語言模型Causal LM，并進行高效微調參數
from transformers import AutoTokenizer
import torchdevice = 'cuda' if torch.cuda.is_available() else 'cpu'
model = AutoPeftModelForCausalLM.from_pretrained('ybelkada/opt-350m-lora').to(device) 
tokenizer = AutoTokenizer.from_pretrained('facebook/opt-350m')model.eval()
inputs = tokenizer('Preheat the oven to 350 degrees and place the cookie dough', return_tensors='pt')outputs = model.generate(input_ids=inputs['input_ids'].to(device), max_new_tokens=50)  # 生成輸出
outputs_text = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]  # tokenizer解碼輸出文本
print(outputs)
print(outputs_text)

3.5 *LoRA 05: Explanation

***選看：太難、太復雜，不做實現嘍

reference:?使用Pytorch從零開始構建LoRA_torch lora 使用 nn-CSDN博客

3.5 *LoRA 06: huanhuan chat

***選看：太難、太復雜，不做實現嘍

reference:?https://github.com/datawhalechina/self-llm/blob/master/GLM-4/05-GLM-4-9B-chat%20Lora%20%E5%BE%AE%E8%B0%83.ipynb

Reference:?

[1]?He J, Zhou C, Ma X, Berg-Kirkpatrick T, Neubig G. Towards a unified view of parameter-efficient transfer learning. arXiv preprint arXiv:2110.04366. 2021 Oct 8.

[2]?https://mltalks.medium.com/%E8%AF%A6%E8%A7%A3%E5%A4%A7%E6%A8%A1%E5%9E%8B%E5%BE%AE%E8%B0%83%E6%96%B9%E6%B3%95prompt-tuning-%E5%86%85%E9%99%84%E5%AE%9E%E7%8E%B0%E4%BB%A3%E7%A0%81-7e4276927729