用于評估大語言模型（LLMs）能力的重要基準任務（Benchmark）

基準任務涵蓋了多領域（如語言理解、數學、推理、編程、醫學等）和多能力維度（如事實檢索、計算、代碼生成、鏈式推理、多語言處理）。常用于模型發布時的對比評測，例如 GPT-4、Claude、Gemini、Mistral 等模型的論文或報告中。

Benchmark	簡介	用途	地址	許可證
MMLU	Massive Multitask Language Understanding	測試模型在多學科考試（如歷史、法律、醫學等）中的表現	https://arxiv.org/abs/2009.03300, https://github.com/hendrycks/test	MIT License
MATH	Mathematical Problem Solving	測試模型解決中學和大學級數學問題的能力	https://arxiv.org/abs/2103.03874, https://github.com/hendrycks/math	MIT License
GPQA	Graduate-level, Google-proof Q&A	高階、無法通過搜索引擎解答的物理問答題	https://arxiv.org/abs/2311.12022, https://github.com/idavidrein/gpqa/	MIT License
DROP	Discrete Reasoning Over Paragraphs	閱讀理解測試，側重數值運算、推理和信息整合	https://arxiv.org/abs/1903.00161, https://allenai.org/data/drop	Apache 2.0
MGSM	Multilingual Grade School Math	多語言小學數學題，考察鏈式思維能力	https://arxiv.org/abs/2210.03057, https://github.com/google-research/url-nlp	CC-BY 4.0
HumanEval	Code Generation and Evaluation	模型在 Python 編程題上的代碼生成與準確性測試	https://arxiv.org/abs/2107.03374, https://github.com/openai/human-eval	MIT License
SimpleQA	Short-form Factuality Benchmark	測試模型對簡單事實問答（如“地球離太陽多遠？”）的準確性	https://openai.com/index/introducing-simpleqa	MIT License
BrowseComp	Web-based Browsing Agent Task	測試具有瀏覽網頁能力的智能體在任務場景中的能力	https://openai.com/index/browsecomp	MIT License
HealthBench	Health-related LLM Evaluation	面向醫療健康場景的模型能力評估，強調事實準確性和安全性	https://openai.com/index/healthbench	MIT License

本文來自互聯網用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務，不擁有所有權，不承擔相關法律責任。
如若轉載，請注明出處：http://www.pswp.cn/news/909179.shtml
繁體地址，請注明出處：http://hk.pswp.cn/news/909179.shtml
英文地址，請注明出處：http://en.pswp.cn/news/909179.shtml

如若內容造成侵權/違法違規/事實不符，請聯系多彩編程網進行投訴反饋email:809451989@qq.com，一經查實，立即刪除！