隨著人工智能技術的快速發展, 大規模預訓練自然語言模型成為了研究熱點和關注焦點。OpenAI于2018年提出了第一代GPT模型,開辟了自然語言模型生成式預訓練的路線。沿著這條路線,隨后又陸續發布了GPT-2和GPT-3模型。與此同時,谷歌也探索了不同的大規模預訓練模型方案,例如如T5, Flan等。OpenAI在2022年11月發布ChatGPT,展示了強大的問答能力,邏輯推理能力和內容創作能力,將模型提升到了實用水平,改變人們對大模型能力的認知。在2023年4月,OpenAI發布了新升級的GPT-4模型,通過引入多模態能力,進一步拓展了大語言模型的能力邊界,朝著通用人工智能更進一步。ChatGPT和GPT-4推出之后,微軟憑借強大的產品化能力迅速將其集成進搜索引擎和Office辦公套件中,形成了New Bing和 Office Copilot等產品。谷歌也迅速上線了基于自家大語言模型PaLM和PaLM-2的Bard,與OpenAI和微軟展開正面競爭。國內的多家企業和研究機構也在開展大模型的技術研發,百度,阿里,華為,商湯,訊飛等都發布了各自的國產語言大模型,清華,復旦等高校也相繼發布了GLM, MOSS等模型。
為了準確和公正地評估大模型的能力,國內外機構在大模型評測上開展了大量的嘗試和探索。斯坦福大學提出了較為系統的評測框架HELM,從準確性,安全性,魯棒性和公平性等維度開展模型評測。紐約大學聯合谷歌和Meta提出了SuperGLUE評測集,從推理能力,常識理解,問答能力等方面入手,構建了包括8個子任務的大語言模型評測數據集。加州大學伯克利分校提出了MMLU測試集,構建了涵蓋高中和大學的多項考試,來評估模型的知識能力和推理能力。谷歌也提出了包含數理科學,編程代碼,閱讀理解,邏輯推理等子任務的評測集Big-Bench,涵蓋200多個子任務,對模型能力進行系統化的評估。在中文評測方面,國內的學術機構也提出了如CLUE,CUGE等評測數據集,從文本分類,閱讀理解,邏輯推理等方面評測語言模型的中文能力。
隨著大模型的蓬勃發展,如何全面系統地評估大模型的各項能力成為了亟待解決的問題。由于大語言模型和多模態模型的能力強大,應用場景廣泛,目前學術界和工業界的評測方案往往只關注模型的部分能力維度,缺少系統化的能力維度框架與評測方案。OpenCompass提供設計一套全面、高效、可拓展的大模型評測方案,對模型能力、性能、安全性等進行全方位的評估。OpenCompass提供分布式自動化的評測系統,支持對(語言/多模態)大模型開展全面系統的能力評估。
OpenCompass介紹
評測對象
本算法庫的主要評測對象為語言大模型與多模態大模型。我們以語言大模型為例介紹評測的具體模型類型。
-
基座模型:一般是經過海量的文本數據以自監督學習的方式進行訓練獲得的模型(如OpenAI的GPT-3,Meta的LLaMA),往往具有強大的文字續寫能力。
-
對話模型:一般是在的基座模型的基礎上,經過指令微調或人類偏好對齊獲得的模型(如OpenAI的ChatGPT、上海人工智能實驗室的書生·浦語),能理解人類指令,具有較強的對話能力。
工具架構
- 模型層:大模型評測所涉及的主要模型種類,OpenCompass以基座模型和對話模型作為重點評測對象。
- 能力層:OpenCompass從本方案從通用能力和特色能力兩個方面來進行評測維度設計。在模型通用能力方面,從語言、知識、理解、推理、安全等多個能力維度進行評測。在特色能力方面,從長文本、代碼、工具、知識增強等維度進行評測。
- 方法層:OpenCompass采用客觀評測與主觀評測兩種評測方式。客觀評測能便捷地評估模型在具有確定答案(如選擇,填空,封閉式問答等)的任務上的能力,主觀評測能評估用戶對模型回復的真實滿意度,OpenCompass采用基于模型輔助的主觀評測和基于人類反饋的主觀評測兩種方式。
- 工具層:OpenCompass提供豐富的功能支持自動化地開展大語言模型的高效評測。包括分布式評測技術,提示詞工程,對接評測數據庫,評測榜單發布,評測報告生成等諸多功能。
能力維度
評測方法
OpenCompass采取客觀評測與主觀評測相結合的方法。針對具有確定性答案的能力維度和場景,通過構造豐富完善的評測集,對模型能力進行綜合評價。針對體現模型能力的開放式或半開放式的問題、模型安全問題等,采用主客觀相結合的評測方式。
客觀評測
針對具有標準答案的客觀問題,我們可以我們可以通過使用定量指標比較模型的輸出與標準答案的差異,并根據結果衡量模型的性能。同時,由于大語言模型輸出自由度較高,在評測階段,我們需要對其輸入和輸出作一定的規范和設計,盡可能減少噪聲輸出在評測階段的影響,才能對模型的能力有更加完整和客觀的評價。
為了更好地激發出模型在題目測試領域的能力,并引導模型按照一定的模板輸出答案,OpenCompass采用提示詞工程 (prompt engineering)和語境學習(in-context learning)進行客觀評測。
在客觀評測的具體實踐中,我們通常采用下列兩種方式進行模型輸出結果的評測:
- 判別式評測:該評測方式基于將問題與候選答案組合在一起,計算模型在所有組合上的困惑度(perplexity),并選擇困惑度最小的答案作為模型的最終輸出。例如,若模型在 問題? 答案1 上的困惑度為 0.1,在 問題? 答案2 上的困惑度為 0.2,最終我們會選擇 答案1 作為模型的輸出。
- 生成式評測:該評測方式主要用于生成類任務,如語言翻譯、程序生成、邏輯分析題等。具體實踐時,使用問題作為模型的原始輸入,并留白答案區域待模型進行后續補全。我們通常還需要對其輸出進行后處理,以保證輸出滿足數據集的要求。
主觀評測
語言表達生動精彩,變化豐富,大量的場景和能力無法憑借客觀指標進行評測。針對如模型安全和模型語言能力的評測,以人的主觀感受為主的評測更能體現模型的真實能力,并更符合大模型的實際使用場景。
OpenCompass采取的主觀評測方案是指借助受試者的主觀判斷對具有對話能力的大語言模型進行能力評測。在具體實踐中,我們提前基于模型的能力維度構建主觀測試問題集合,并將不同模型對于同一問題的不同回復展現給受試者,收集受試者基于主觀感受的評分。由于主觀測試成本高昂,本方案同時也采用使用性能優異的大語言模擬人類進行主觀打分。在實際評測中,本文將采用真實人類專家的主觀評測與基于模型打分的主觀評測相結合的方式開展模型能力評估。
在具體開展主觀評測時,OpenComapss采用單模型回復滿意度統計和多模型滿意度比較兩種方式開展具體的評測工作。
實踐
安裝
conda create --name opencompass --clone=/root/share/conda_envs/internlm-base
conda activate opencompass
git clone https://github.com/open-compass/opencompass
cd opencompass
pip install -e .
數據準備
cp /share/temp/datasets/OpenCompassData-core-20231110.zip /root/opencompass/
unzip OpenCompassData-core-20231110.zip
查看支持的數據集和模型
# 列出所有跟 internlm 及 ceval 相關的配置
python tools/list_configs.py internlm ceval
復制模型
cp -r /root/share/model_repos/internlm2-chat-7b /root/model
啟動評測
python run.py --datasets ceval_gen --hf-path /root/model/internlm2-chat-7b/ --tokenizer-path /root/model/internlm2-chat-7b/ --tokenizer-kwargs padding_side='left' truncation='left' trust_remote_code=True --model-kwargs trust_remote_code=True device_map='auto' --max-seq-len 2048 --max-out-len 16 --batch-size 4 --num-gpus 1 --debug
命令解析
--datasets ceval_gen \
--hf-path /share/temp/model_repos/internlm-chat-7b/ \ # HuggingFace 模型路徑
--tokenizer-path /share/temp/model_repos/internlm-chat-7b/ \ # HuggingFace tokenizer 路徑(如果與模型路徑相同,可以省略)
--tokenizer-kwargs padding_side='left' truncation='left' trust_remote_code=True \ # 構建 tokenizer 的參數
--model-kwargs device_map='auto' trust_remote_code=True \ # 構建模型的參數
--max-seq-len 2048 \ # 模型可以接受的最大序列長度
--max-out-len 16 \ # 生成的最大 token 數
--batch-size 2 \ # 批量大小
--num-gpus 1 # 運行模型所需的 GPU 數量
--debug
評測結果
20240301_214622
tabulate format
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
dataset version metric mode opencompass.models.huggingface.HuggingFace_model_internlm2-chat-7b
---------------------------------------------- --------- ------------- ------ --------------------------------------------------------------------
ceval-computer_network db9ce2 accuracy gen 47.37
ceval-operating_system 1c2571 accuracy gen 57.89
ceval-computer_architecture a74dad accuracy gen 42.86
ceval-college_programming 4ca32a accuracy gen 51.35
ceval-college_physics 963fa8 accuracy gen 36.84
ceval-college_chemistry e78857 accuracy gen 33.33
ceval-advanced_mathematics ce03e2 accuracy gen 15.79
ceval-probability_and_statistics 65e812 accuracy gen 27.78
ceval-discrete_mathematics e894ae accuracy gen 18.75
ceval-electrical_engineer ae42b9 accuracy gen 40.54
ceval-metrology_engineer ee34ea accuracy gen 58.33
ceval-high_school_mathematics 1dc5bf accuracy gen 44.44
ceval-high_school_physics adf25f accuracy gen 47.37
ceval-high_school_chemistry 2ed27f accuracy gen 52.63
ceval-high_school_biology 8e2b9a accuracy gen 26.32
ceval-middle_school_mathematics bee8d5 accuracy gen 26.32
ceval-middle_school_biology 86817c accuracy gen 66.67
ceval-middle_school_physics 8accf6 accuracy gen 57.89
ceval-middle_school_chemistry 167a15 accuracy gen 95
ceval-veterinary_medicine b4e08d accuracy gen 39.13
ceval-college_economics f3f4e6 accuracy gen 47.27
ceval-business_administration c1614e accuracy gen 51.52
ceval-marxism cf874c accuracy gen 84.21
ceval-mao_zedong_thought 51c7a4 accuracy gen 70.83
ceval-education_science 591fee accuracy gen 72.41
ceval-teacher_qualification 4e4ced accuracy gen 79.55
ceval-high_school_politics 5c0de2 accuracy gen 21.05
ceval-high_school_geography 865461 accuracy gen 47.37
ceval-middle_school_politics 5be3e7 accuracy gen 42.86
ceval-middle_school_geography 8a63be accuracy gen 58.33
ceval-modern_chinese_history fc01af accuracy gen 65.22
ceval-ideological_and_moral_cultivation a2aa4a accuracy gen 89.47
ceval-logic f5b022 accuracy gen 54.55
ceval-law a110a1 accuracy gen 41.67
ceval-chinese_language_and_literature 0f8b68 accuracy gen 56.52
ceval-art_studies 2a1300 accuracy gen 69.7
ceval-professional_tour_guide 4e673e accuracy gen 86.21
ceval-legal_professional ce8787 accuracy gen 43.48
ceval-high_school_chinese 315705 accuracy gen 68.42
ceval-high_school_history 7eb30a accuracy gen 75
ceval-middle_school_history 48ab4a accuracy gen 68.18
ceval-civil_servant 87d061 accuracy gen 55.32
ceval-sports_science 70f27b accuracy gen 73.68
ceval-plant_protection 8941f9 accuracy gen 77.27
ceval-basic_medicine c409d6 accuracy gen 63.16
ceval-clinical_medicine 49e82d accuracy gen 45.45
ceval-urban_and_rural_planner 95b885 accuracy gen 58.7
ceval-accountant 002837 accuracy gen 44.9
ceval-fire_engineer bc23f5 accuracy gen 38.71
ceval-environmental_impact_assessment_engineer c64e2d accuracy gen 45.16
ceval-tax_accountant 3a5e3c accuracy gen 51.02
ceval-physician 6e277d accuracy gen 51.02
ceval-stem - naive_average gen 44.33
ceval-social-science - naive_average gen 57.54
ceval-humanities - naive_average gen 65.31
ceval-other - naive_average gen 54.94
ceval-hard - naive_average gen 34.62
ceval - naive_average gen 53.55
$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$-------------------------------------------------------------------------------------------------------------------------------- THIS IS A DIVIDER --------------------------------------------------------------------------------------------------------------------------------csv format
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
dataset,version,metric,mode,opencompass.models.huggingface.HuggingFace_model_internlm2-chat-7b
ceval-computer_network,db9ce2,accuracy,gen,47.37
ceval-operating_system,1c2571,accuracy,gen,57.89
ceval-computer_architecture,a74dad,accuracy,gen,42.86
ceval-college_programming,4ca32a,accuracy,gen,51.35
ceval-college_physics,963fa8,accuracy,gen,36.84
ceval-college_chemistry,e78857,accuracy,gen,33.33
ceval-advanced_mathematics,ce03e2,accuracy,gen,15.79
ceval-probability_and_statistics,65e812,accuracy,gen,27.78
ceval-discrete_mathematics,e894ae,accuracy,gen,18.75
ceval-electrical_engineer,ae42b9,accuracy,gen,40.54
ceval-metrology_engineer,ee34ea,accuracy,gen,58.33
ceval-high_school_mathematics,1dc5bf,accuracy,gen,44.44
ceval-high_school_physics,adf25f,accuracy,gen,47.37
ceval-high_school_chemistry,2ed27f,accuracy,gen,52.63
ceval-high_school_biology,8e2b9a,accuracy,gen,26.32
ceval-middle_school_mathematics,bee8d5,accuracy,gen,26.32
ceval-middle_school_biology,86817c,accuracy,gen,66.67
ceval-middle_school_physics,8accf6,accuracy,gen,57.89
ceval-middle_school_chemistry,167a15,accuracy,gen,95.00
ceval-veterinary_medicine,b4e08d,accuracy,gen,39.13
ceval-college_economics,f3f4e6,accuracy,gen,47.27
ceval-business_administration,c1614e,accuracy,gen,51.52
ceval-marxism,cf874c,accuracy,gen,84.21
ceval-mao_zedong_thought,51c7a4,accuracy,gen,70.83
ceval-education_science,591fee,accuracy,gen,72.41
ceval-teacher_qualification,4e4ced,accuracy,gen,79.55
ceval-high_school_politics,5c0de2,accuracy,gen,21.05
ceval-high_school_geography,865461,accuracy,gen,47.37
ceval-middle_school_politics,5be3e7,accuracy,gen,42.86
ceval-middle_school_geography,8a63be,accuracy,gen,58.33
ceval-modern_chinese_history,fc01af,accuracy,gen,65.22
ceval-ideological_and_moral_cultivation,a2aa4a,accuracy,gen,89.47
ceval-logic,f5b022,accuracy,gen,54.55
ceval-law,a110a1,accuracy,gen,41.67
ceval-chinese_language_and_literature,0f8b68,accuracy,gen,56.52
ceval-art_studies,2a1300,accuracy,gen,69.70
ceval-professional_tour_guide,4e673e,accuracy,gen,86.21
ceval-legal_professional,ce8787,accuracy,gen,43.48
ceval-high_school_chinese,315705,accuracy,gen,68.42
ceval-high_school_history,7eb30a,accuracy,gen,75.00
ceval-middle_school_history,48ab4a,accuracy,gen,68.18
ceval-civil_servant,87d061,accuracy,gen,55.32
ceval-sports_science,70f27b,accuracy,gen,73.68
ceval-plant_protection,8941f9,accuracy,gen,77.27
ceval-basic_medicine,c409d6,accuracy,gen,63.16
ceval-clinical_medicine,49e82d,accuracy,gen,45.45
ceval-urban_and_rural_planner,95b885,accuracy,gen,58.70
ceval-accountant,002837,accuracy,gen,44.90
ceval-fire_engineer,bc23f5,accuracy,gen,38.71
ceval-environmental_impact_assessment_engineer,c64e2d,accuracy,gen,45.16
ceval-tax_accountant,3a5e3c,accuracy,gen,51.02
ceval-physician,6e277d,accuracy,gen,51.02
ceval-stem,-,naive_average,gen,44.33
ceval-social-science,-,naive_average,gen,57.54
ceval-humanities,-,naive_average,gen,65.31
ceval-other,-,naive_average,gen,54.94
ceval-hard,-,naive_average,gen,34.62
ceval,-,naive_average,gen,53.55
$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$-------------------------------------------------------------------------------------------------------------------------------- THIS IS A DIVIDER --------------------------------------------------------------------------------------------------------------------------------raw format
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-------------------------------
Model: opencompass.models.huggingface.HuggingFace_model_internlm2-chat-7b
ceval-computer_network: {'accuracy': 47.368421052631575}
ceval-operating_system: {'accuracy': 57.89473684210527}
ceval-computer_architecture: {'accuracy': 42.857142857142854}
ceval-college_programming: {'accuracy': 51.35135135135135}
ceval-college_physics: {'accuracy': 36.84210526315789}
ceval-college_chemistry: {'accuracy': 33.33333333333333}
ceval-advanced_mathematics: {'accuracy': 15.789473684210526}
ceval-probability_and_statistics: {'accuracy': 27.77777777777778}
ceval-discrete_mathematics: {'accuracy': 18.75}
ceval-electrical_engineer: {'accuracy': 40.54054054054054}
ceval-metrology_engineer: {'accuracy': 58.333333333333336}
ceval-high_school_mathematics: {'accuracy': 44.44444444444444}
ceval-high_school_physics: {'accuracy': 47.368421052631575}
ceval-high_school_chemistry: {'accuracy': 52.63157894736842}
ceval-high_school_biology: {'accuracy': 26.31578947368421}
ceval-middle_school_mathematics: {'accuracy': 26.31578947368421}
ceval-middle_school_biology: {'accuracy': 66.66666666666666}
ceval-middle_school_physics: {'accuracy': 57.89473684210527}
ceval-middle_school_chemistry: {'accuracy': 95.0}
ceval-veterinary_medicine: {'accuracy': 39.130434782608695}
ceval-college_economics: {'accuracy': 47.27272727272727}
ceval-business_administration: {'accuracy': 51.515151515151516}
ceval-marxism: {'accuracy': 84.21052631578947}
ceval-mao_zedong_thought: {'accuracy': 70.83333333333334}
ceval-education_science: {'accuracy': 72.41379310344827}
ceval-teacher_qualification: {'accuracy': 79.54545454545455}
ceval-high_school_politics: {'accuracy': 21.052631578947366}
ceval-high_school_geography: {'accuracy': 47.368421052631575}
ceval-middle_school_politics: {'accuracy': 42.857142857142854}
ceval-middle_school_geography: {'accuracy': 58.333333333333336}
ceval-modern_chinese_history: {'accuracy': 65.21739130434783}
ceval-ideological_and_moral_cultivation: {'accuracy': 89.47368421052632}
ceval-logic: {'accuracy': 54.54545454545454}
ceval-law: {'accuracy': 41.66666666666667}
ceval-chinese_language_and_literature: {'accuracy': 56.52173913043478}
ceval-art_studies: {'accuracy': 69.6969696969697}
ceval-professional_tour_guide: {'accuracy': 86.20689655172413}
ceval-legal_professional: {'accuracy': 43.47826086956522}
ceval-high_school_chinese: {'accuracy': 68.42105263157895}
ceval-high_school_history: {'accuracy': 75.0}
ceval-middle_school_history: {'accuracy': 68.18181818181817}
ceval-civil_servant: {'accuracy': 55.319148936170215}
ceval-sports_science: {'accuracy': 73.68421052631578}
ceval-plant_protection: {'accuracy': 77.27272727272727}
ceval-basic_medicine: {'accuracy': 63.1578947368421}
ceval-clinical_medicine: {'accuracy': 45.45454545454545}
ceval-urban_and_rural_planner: {'accuracy': 58.69565217391305}
ceval-accountant: {'accuracy': 44.89795918367347}
ceval-fire_engineer: {'accuracy': 38.70967741935484}
ceval-environmental_impact_assessment_engineer: {'accuracy': 45.16129032258064}
ceval-tax_accountant: {'accuracy': 51.02040816326531}
ceval-physician: {'accuracy': 51.02040816326531}
ceval-stem: {'naive_average': 44.330303885938896}
ceval-social-science: {'naive_average': 57.54025149079596}
ceval-humanities: {'naive_average': 65.30999398082602}
ceval-other: {'naive_average': 54.944902032059396}
ceval-hard: {'naive_average': 34.6171418128655}
ceval: {'naive_average': 53.554085553239936}
$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$