書生·浦語2.5開源，推理能力再創新標桿

導讀

2024 年 7 月 3 日，上海人工智能實驗室與商湯科技聯合香港中文大學和復旦大學正式發布新一代大語言模型書?·浦語2.5（InternLM2.5）。相比上一代模型，InternLM2.5 有三項突出亮點：

推理能力大幅提升，在部分維度上甚至超越十倍量級的 Llama3-70B；
支持 1M tokens 上下文，能夠處理百萬字長文；
具有強大的自主規劃和工具調用能力，比如可以針對復雜問題，搜索上百個網頁并進行整合分析。

InternLM2.5-7B 模型即日起開源可用，更大和更小的模型也將在近期發布開源。上海人工智能實驗室秉持“以持續的高質量開源賦能創新”理念，在為社區始終如一地提供高質量開源模型的同時，也將繼續堅持免費商用授權。

Github鏈接：https://github.com/InternLM/InternLM

**模型鏈接：**https://www.modelscope.cn/models/Shanghai_AI_Laboratory/internlm2_5-7b-chat

**書生·浦語主頁：**https://internlm.intern-ai.org.cn/

領先的推理能力

強大的推理能力是大模型通向通用人工智能的重要基礎，InternLM2.5 將推理能力作為模型最核心的能力進行優化，為復雜場景的應用落地提供了良好的基礎。

基于司南 OpenCompass 開源評測框架，研究團隊使用統一可復現的評測方法在多個推理能力權威評測集上進行了評測。相比上一代模型，InternLM2.5 在多個推理能力權威評測集上實現了大幅性能提升，尤其在由競賽問題構成的數學評測集 MATH 上更是提升100%，以 7B 參數達到了 60% 的準確率（達到 GPT-4 Turbo 1106 版本的水平），充分展示了模型在數學推理上的優異成績。

在這里插入圖片描述

100萬 Token 長文本支持，文檔對話系統全開源

在長文檔理解、復雜的智能體交互等應用場景中，模型的上下文長度支持有著更高的要求。InternLM2.5 提出了解決方案，將上下文長度從上一代模型 InternLM2 的 200K 提升到了 1M（約合 120 萬漢字），進一步釋放了模型在超長文本應用上的潛力。在模型的預訓練中，從自然語料中篩選出了 256K Token 長度的文本，同時為了避免語料類型過于單一而導致的域偏移，通過合成數據進行了補充，使得模型在擴展上下文的同時可以盡量保留其能力。

采用了業界流行的“大海撈針”來評估模型的長文信息召回內容，下圖顯示，InternLM 2.5 在 1M token 范圍內實現了幾乎完美的大海撈針召回，呈現了極強的長文處理能力。

在這里插入圖片描述

基于網絡信息高效解決復雜問題

針對需要大規模復雜信息搜索和整合的復雜問題場景，InternLM2.5 創新性地提出了 MindSearch 多智能體框架，模擬人的思維過程，引入了任務規劃、任務拆解、大規模網頁搜索、多源信息歸納總結等步驟，有效地整合網絡信息。其中，規劃器專注于任務的規劃、拆解和信息歸納，采用圖結構編程的方式進行規劃，并根據任務狀態進行動態拓展，搜索器負責發散式搜索并總結網絡搜索結果，使得整個框架能夠基于上百個網頁的信息進行篩選和瀏覽和整合。

在這里插入圖片描述

模型下載

SDK下載模型：

#模型下載
from modelscope import snapshot_download
model_dir = snapshot_download('Shanghai_AI_Laboratory/internlm2_5-7b-chat')

或者使用CLI下載

modelscope download --model=Shanghai_AI_Laboratory/internlm2_5-7b-chat --local_dir ./internlm2_5-7b-chat/

支持git clone下載

git clone https://www.modelscope.cn/Shanghai_AI_Laboratory/internlm2_5-7b-chat.git

模型推理

使用transformers推理：

import torch
from modelscope import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("Shanghai_AI_Laboratory/internlm2_5-7b-chat", trust_remote_code=True)
# Set `torch_dtype=torch.float16` to load model in float16, otherwise it will be loaded as float32 and cause OOM Error.
model = AutoModelForCausalLM.from_pretrained("Shanghai_AI_Laboratory/internlm2_5-7b-chat", torch_dtype=torch.float16, trust_remote_code=True).cuda()
model = model.eval()
response, history = model.chat(tokenizer, "hello", history=[])
print(response)
# Hello! How can I help you today?
response, history = model.chat(tokenizer, "please provide three suggestions about time management", history=history)
print(response)

顯存占用：

在這里插入圖片描述

模型微調

本文介紹使用ms-swift對internlm2.5-7b-chat進行自我認知微調，并對微調前后模型進行推理、部署與評測。swift是魔搭社區官方提供的LLM工具箱，支持300+大語言模型和50+多模態大模型的微調、推理、量化、評估和部署。

swift開源地址：https://github.com/modelscope/swift

自我認知數據集: https://modelscope.cn/datasets/swift/self-cognition

如果需要使用其他數據集對internlm2.5-7b-chat進行微調，只需要修改 --dataset即可。自定義dataset支持傳入本地路徑、modelscope和huggingface中的dataset_id。文檔可以查看：https://github.com/modelscope/swift/blob/main/docs/source/LLM/%E8%87%AA%E5%AE%9A%E4%B9%89%E4%B8%8E%E6%8B%93%E5%B1%95.md#%E8%87%AA%E5%AE%9A%E4%B9%89%E6%95%B0%E6%8D%AE%E9%9B%86

在開始微調之前，請確保您的環境已正確安裝

git clone https://github.com/modelscope/swift.git
cd swift
pip install -e .[llm]

微調腳本：

# Experimental environment: A10, 3090, V100, ...
# 22GB GPU memory
CUDA_VISIBLE_DEVICES=0 swift sft \--model_type internlm2_5-7b-chat \--dataset alpaca-zh#500 alpaca-en#500 self-cognition#500 \--logging_steps 5 \--max_length 2048 \--learning_rate 1e-4 \--output_dir output \--lora_target_modules ALL \--model_name 小黃 'Xiao Huang' \--model_author 魔搭 ModelScope \# Experimental environment: A10, 3090, V100, ...
# 4 * 20GB GPU memory
# Deepspeed-ZeRO2
NPROC_PER_NODE=4 CUDA_VISIBLE_DEVICES=0,1,2,3 \
swift sft \--model_type internlm2_5-7b-chat \--dataset alpaca-zh#500 alpaca-en#500 self-cognition#500 \--logging_steps 5 \--max_length 2048 \--learning_rate 1e-4 \--output_dir output \--lora_target_modules ALL \--model_name 小黃 'Xiao Huang' \--model_author 魔搭 ModelScope \--deepspeed default-zero2

微調顯存消耗：

在這里插入圖片描述

微調過程的loss可視化：

在這里插入圖片描述

微調后推理腳本如下，這里的ckpt_dir需要修改為訓練生成的last checkpoint文件夾。我們可以使用vLLM對merge后的checkpoint進行推理加速。

# Experimental environment: A10, 3090, V100, ...
CUDA_VISIBLE_DEVICES=0 swift export \--ckpt_dir output/internlm2_5-7b-chat/vx-xxx/checkpoint-xxx \--merge_lora true# 使用vLLM進行推理加速
CUDA_VISIBLE_DEVICES=0 swift infer \--ckpt_dir output/internlm2_5-7b-chat/vx-xxx/checkpoint-xxx-merged \--infer_backend vllm --max_model_len 4096

微調后模型對驗證集進行推理的示例：

在這里插入圖片描述

對自我認知微調前后的模型進行評測：

# Experimental environment: A100
# 評測后端由llmuses庫提供: https://github.com/modelscope/eval-scope
# 原始模型
CUDA_VISIBLE_DEVICES=0 swift eval \--model_type internlm2_5-7b-chat \--eval_dataset arc ceval gsm8k --eval_backend Native \--infer_backend vllm 
# 微調后
CUDA_VISIBLE_DEVICES=0 swift eval \--ckpt_dir output/internlm2_5-7b-chat/vx-xxx/checkpoint-xxx-merged \--eval_dataset arc ceval gsm8k --eval_backend Native \--infer_backend vllm

Model	arc	ceval	gsm8k
原始模型	0.843	0.7452	0.8173
微調后模型	0.8404	0.7489	0.8082

可以看到，自我認知微調會對模型的評測結果產生略微影響，使用更好的混合通用數據集將會緩解這一現象。

模型部署

使用lmdeploy部署

pip install lmdeploy

使用python代碼本地批量推理

import lmdeploy
pipe = lmdeploy.pipeline("/mnt/workspace/internlm2_5-7b-chat")
response = pipe(["Hi, pls intro yourself", "Shanghai is"])
print(response)

也可以使用一行代碼發布OpenAI格式服務

lmdeploy serve api_server /mnt/workspace/internlm2_5-7b-chat --model-name internlm2_5-7b-chat --server-port 23333

接口調用方式：

curl http://localhost:23333/v1/chat/completions \-H "Content-Type: application/json" \-d '{"model": "internlm2_5-7b-chat","messages": [{"role": "system", "content": "You are a helpful assistant."},{"role": "user", "content": "Introduce deep learning to me."}]}'