我自己的原文哦~?? ? ?https://blog.51cto.com/whaosoft/13877107
#PHYBench
北大物院200人合作,金牌得主超50人!PHYBench:大模型究竟能不能真的懂物理?
本項目由北京大學物理學院朱華星老師、曹慶宏副院長統籌指導。基準設計、項目管理以及數據整合的主要工作由學生核心團隊完成,核心成員包括仇是、郭紹陽、宋卓洋、孫韞博、蔡則宇、衛家燊、羅天宇等。項目還得到了北京大學計算中心羅民興院士和人工智能研究院張牧涵老師的鼎力支持。
PHYBench 項目匯聚了來自物理學院及兄弟院系的 200 余名學生,共同承擔題目編寫、審核及人類基準測試等工作。這支高水平的參與者團隊中,包含至少 50 位全國中學生物理競賽金牌得主,更有亞洲物理奧賽和國際物理奧賽的金牌獲得者。這場大規模、高質量的協作,不僅充分展現了北大學子深厚的學術功底和卓越的組織協調能力,也為 PHYBench 產出高質量成果提供了堅實保障。
在大語言模型(LLMs)飛速發展的當下,模型的推理能力儼然成為模型能力的代名詞。OpenAI 的 o 系列、DeepSeek R1 等前沿模型相繼發布,這些大模型憑借強化學習技術的助力,在許多科學評測基準上頻頻刷新紀錄,甚至聲稱 “超越人類專家”。
但是,隨著模型能力和評測基準的軍備競賽白熱化,越來越多的基準不得不轉向生僻的知識點、或者抽象的數學競賽題。這些題目雖然能 “區分” 模型,但是逐漸脫離實際場景,可能難以真正反映模型的實際表現。
近日,北京大學物理學院聯合人工智能研究院等多個院系,推出了全新評測基準 PHYBench。PHYBench 包含 500 道經過精心設計的高質量物理題(如圖 1),難度橫跨高中物理、大學物理以及物理奧林匹克競賽。這些題目以真實的物理場景為基礎,對人類來說并不抽象,卻把一眾大模型考得七零八落。大模型在解決物理題時的思維鏈也暴露了它們在感知(Perception)和推理(Reasoning)能力上的缺陷。
- 論文鏈接:https://arxiv.org/abs/2504.16074
- 項目網址:https://phybench-official.github.io/phybench-demo/
- 數據集:https://huggingface.co/datasets/Eureka-Lab/PHYBench
也許,物理才是最適合考察 AI 推理能力的學科?PHYBench 的嘗試為評估大模型真正有效的推理能力提供了全新的工具和視角。
圖 1:題目樣例與兩種評估方法:表達式樹編輯距離、正確率。
表 1:與現有 benchmark 對比,PHYBench 在高難度數據集中,有著相對大的規模,同時引入了創新的分數度量:表達式樹編輯距離。
評測方法創新
表達式樹編輯距離(EED Score)?
傳統基準通常依賴 Accuracy 這一單一指標:設置唯一正確答案,模型只有在完全匹配時才能得分。為了方便評分,問答題通常被改寫成選擇題或要求代入數值。這樣會導致答案的信息量被嚴重壓縮,而且給出過多條件可能導致模型 “根據選項猜過程”,或者缺乏使用解析表達式表達普適關系的能力。同時在高難度的樣本上,0/1 打分會使得所有模型在分數層面都被歸零,強弱差異無從體現。
EED Score(Expression?tree Edit Distance)帶來了更貼近人類閱卷的方案。它將數學表達式解析成表達式樹,再計算模型答案與參考答案之間的編輯距離:樹的結構越接近,得分越高。這一機制輸出的是連續、細粒度的分數,能在更多題目上顯示區分度,顯著提高了統計效力。
實驗表明,采用 EED Score 的 500 題,其區分能力相當于 1500 道使用 0/1 Accuracy 的題目。上圖(圖 1)展示了同一道題三種不同答案在 Accuracy 與 EED Score 下的對比:前者只能給出 “全錯 / 全對” 的粗糙評價,而后者則定量刻畫了模型解答與正確答案之間的 “距離”。
實驗結果
前沿模型與人類專家的差距
PHYBench 團隊招募了 81 名北大學子,在 3 小時時限內做 8 道題目,與最先進的 AI 模型展開了一場 "人機大戰"。
結果顯示,即使是最強的 Gemini 2.5 pro,也只能答對 36.9% 的題目,EED 評分 49.5%。而 “人類專家” 們則輕松碾壓,平均正確率高達 61.9%,EED 評分高達 70.5%。?排名前 25% 的受試者更是達到了 71.4% 的正確率 —— 幾乎是最強 AI 的兩倍。其他模型與人類的差距則更為顯著。這一顯著差距揭示了現階段 LLM 在在物理推理場景中的瓶頸。
PHYBench 對模型的能力也進行了細粒度的對比。可以看到,Gemini 2.5 pro、o3 等強推理模型雖然和人類還有較大差距,但是相比前代推理模型已經有了明顯的進步。DeepSeek-V3 等基座模型雖未能超越主流推理模型,但也展現出了亮眼的成績。QwQ-32B 和 DeepSeek32B 蒸餾模型等小型推理模型在 PHYBench 上的表現很令人失望,這可能歸因于其物理感知能力的不足。
基于思維鏈的錯因分析:PP × RR
PHYBench 團隊對模型的錯誤進行了系統性總結分析,將模型的推理過程和推理能力劃分為了兩個關鍵模塊:物理感知(Physical Perception,PP)和魯棒推理(Robust Reasoning,RR):
- 物理感知(PP):在此階段,模型進行密集的文字推理,模型需要識別問題相關的物理對象、變量和動力學關系,定性判斷哪些物理效應是重要的,哪些可以忽略不計。若 PP 出錯,后續整個推理都會偏離軌道。(示例 1 展示典型 PP 失誤)
- 魯棒推理(RR):在此階段,模型寫下大量的 “草稿”,一步步化簡表達式,解方程。現階段的推理模型在此階段的推理效率尚不高,“草稿” 長度遠長于人類,而且經常犯 “低級錯誤”。(示例 2 展示典型 RR 失誤)
PP 和 RR 交替進行,組成了典型的物理解題思維鏈。
未來展望
推動 AI 的物理理解與推理能力發展
PHYBench 的愿景遠不止于 “評測”,更在于 “引領” AI 探索物理世界的無限可能。
PHYBench 的發布,不僅為評估大語言模型在物理感知與推理方面的能力提供了一個全新且權威的基準,更為未來 AI 系統的發展指明了攻堅方向。我們精心設計的真實、復雜的物理場景,旨在深度激發并驗證 AI 理解世界并進行可靠推理的能力,推動 AI 系統真正實現對世界的認知、融入與變革。
面向未來,PHYBench 團隊將持續致力于數據集的拓展與創新,計劃納入更多前沿物理課題、跨學科交叉內容,甚至挑戰人類尚未解開的科學謎題。我們相信,通過提供更具深度和廣度的物理挑戰,PHYBench 將有力催化 AI 向著突破認知邊界、探索未知領域的 “智能伙伴” 或 “超級助手” 發展。
#DIFF Transformer
差分注意力機制引領變革,DIFF Transformer攻克長序列建模難題
近年來,Transformer 架構在自然語言處理領域取得了巨大成功,從機器翻譯到文本生成,其強大的建模能力為語言理解與生成帶來了前所未有的突破。
然而,隨著模型規模的不斷擴大和應用場景的日益復雜,傳統 Transformer 架構逐漸暴露出缺陷,尤其是在處理長文本、關鍵信息檢索以及對抗幻覺等任務時,Transformer 常常因過度關注無關上下文而陷入困境,導致模型表現受限。
為攻克這一難題,來自微軟和清華的研究團隊提出了?DIFF Transformer,一種基于差分注意力機制的創新基礎模型架構。
- 論文標題:Differential Transformer?
- 論文鏈接:https://openreview.net/pdf?id=OvoCm1gGhN
- 代碼鏈接:https://aka.ms/Diff-Transformer
其核心思想是通過計算兩組 Softmax 注意力圖的差值來放大對關鍵上下文的關注,同時消除注意力噪聲干擾。DIFF Transformer 具備以下顯著優勢:
在語言建模任務中,DIFF Transformer 在模型大小、訓練 token 數量等方面展現出了卓越的可擴展性,僅需約 65% 的模型規模或訓練 token 數量即可達到與傳統 Transformer 相當的性能,大幅提升了語言模型通用表現。
在長文本建模、關鍵信息檢索、數學推理、對抗幻覺、上下文學習、模型激活值量化等一系列任務中,DIFF Transformer 展現了獨特優勢,相比傳統 Transformer 有顯著提升。
DIFF Transformer 的特性使其在自然語言處理領域具有廣闊的應用前景,有望成為推動語言模型發展的新動力。此外,已有跟進研究初步驗證方法在視覺、多模態等領域中的有效性,顯示出其跨模態通用的潛力。該研究已被 ICLR 2025 接收,并獲選為 Oral 論文(入選比例 1.8%)。
方法
本文提出了一種名為 Differential Transformer(DIFF Transformer) 的基礎模型架構,旨在解決傳統 Transformer 在長文本建模中對無關上下文過度分配注意力的問題。該方法通過差分注意力機制(Differential Attention)放大對關鍵上下文的關注,同時消除注意力噪聲,從而顯著提升模型在多種任務中的性能。
差分注意力機制
傳統 Transformer 的注意力機制通過 Softmax 函數對輸入序列中的不同 token 進行加權,但 Softmax 的性質導致模型難以完全消除無關上下文的影響。為了克服這一問題,DIFF Transformer 引入了差分注意力機制。
具體而言,該機制將查詢向量(Query)和鍵向量(Key)在注意力頭(Head)維度分為兩組,分別計算兩組的 Softmax 注意力圖,然后計算兩者的差值作為最終的注意力分數。這一設計類似于電子工程中的差分放大器,以及降噪耳機,通過兩組信號相減以消除共有噪聲。
差分注意力的數學表達如下:
其中,
和
分別是兩組查詢和鍵向量,
是值向量,
是一個可學習的標量參數,用于調節兩組注意力圖的權重。計算過程如圖 1 所示。
圖 1. 差分注意力機制圖示與偽代碼
為了同步學習速率,將
重參數化為:
其中,
是可學習的向量,而
是用于初始化的常數。
多頭差分注意力
為了進一步提升模型的表達能力,DIFF Transformer 采用了多頭機制。每個注意力頭獨立計算差分注意力,并將多頭輸出拼接為最終結果。具體實現如下:
其中
是注意力頭的數量,
是輸出投影矩陣。為了保持與 Transformer 梯度一致,DIFF Transformer 在每個頭的輸出后應用了獨立的歸一化層,采用 RMSNorm 實現。
圖 2. Transformer 與 DIFF Transformer 注意力分數分布可視化
圖 2 展示了 DIFF Transformer 和傳統 Transformer 在注意力分數分配上的顯著差異。作者將一段關鍵信息插入大段不相關文本的中間位置,并對模型抽取關鍵信息時的注意力分數分配進行可視化。
傳統 Transformer 的注意力分數被廣泛分配到整個上下文中,只有極少分數分配至關鍵信息;而 DIFF Transformer 能夠將更高的分數集中在目標答案上,并且幾乎不向無關上下文分配注意力。
注意力分數分配的稀疏性與精準性也使得 DIFF Transformer 在處理長文本關鍵信息檢索任務時顯著優于 Transformer。
實驗
作者通過一系列實驗驗證了 DIFF Transformer 在多個方面的卓越性能,證明了其在大語言模型中應用的獨特潛力與優勢。
語言建模
作者研究了 DIFF Transformer 在擴展模型規模和訓練數據量時的性能,如圖 3 所示。實驗表明,DIFF Transformer 僅需約 65% 的參數規模或訓練數據量即可達到與 Transformer 相當的語言建模性能。例如,6.8B 參數規模的 DIFF Transformer 在語言建模損失上與 11B 參數規模的 Transformer 相當。
圖 3. 語言建模上的模型參數、訓練數據量可擴展性實驗
長文本建模
作者將模型擴展到 64K 上下文長度,并在長文本書籍數據上進行了評估。結果顯示,考慮累積平均負對數似然(NLL)指標, DIFF Transformer 在不同序列位置上均優于 Transformer,能夠更有效地利用長上下文信息。
圖 4. 長文本書籍數據模型性能評估
關鍵信息檢索
作者通過「多針檢索」(Multi-Needle Retrieval)實驗評估了模型從大量上下文中提取關鍵信息的能力,如圖 5 所示。實驗表明,DIFF Transformer 在不同上下文長度和答案深度下均表現出更高的準確率,尤其是在文本較長以及答案位于文本更靠前位置時,優勢更為明顯。例如,在 64K 上下文中,DIFF Transformer 在答案位于 25% 深度時的準確率比 Transformer 高出 76%。此外,統計信息顯示,DIFF Transformer 在注意力分數分配上也表現出更高的聚焦能力,能夠準確定位關鍵信息,并展現了更高的信噪比。
圖 5. 多針檢索評估
上下文學習
作者從兩個角度評估了 DIFF Transformer 的上下文學習能力:多樣本上下文學習和樣本順序魯棒性測試。 如圖 6 所示,在多樣本上下文學習任務中,作者使用了 4 個不同的數據集(TREC、TREC-fine、Banking-77 和 Clinic-150),并逐步增加示例數量,直到總長度達到 64K tokens。結果顯示,DIFF Transformer 在不同數據集上均優于 Transformer,平均準確率提升顯著。
圖 6. 多樣本上下文學習
在魯棒性測試中,作者通過打亂示例順序的方式評估了模型的性能穩定性。如圖 7 所示,DIFF Transformer 在不同示例排列下的性能方差顯著低于 Transformer,表明其對輸入順序的敏感性更低,具有更強的魯棒性。
圖 7. 樣本順序魯棒性測試
幻覺評測
作者利用文本摘要和問答任務作為兩個典型的幻覺評測場景,評估了 DIFF Transformer 在降低大模型幻覺(hallucination)方面的表現。結果如圖 8 所示,DIFF Transformer 在生成摘要和回答問題時顯著提升了準確率,減少了幻覺現象。這是因為差分注意力機制能夠準確定位重要文段,避免無關上下文對模型預測的干擾。
圖 8. 利用文本摘要、問答任務進行幻覺評測
異常激活值分析
作者還發現 DIFF Transformer 能夠顯著減少模型激活中的異常值,這為模型激活值的量化提供了新的可能性。實驗表明,DIFF Transformer 在注意力激活值(attention logits)和隱藏狀態(hidden states)中的最大激活值顯著低于 Transformer。例如,在注意力激活值的 Top-1 激活值上,DIFF Transformer 比 Transformer 低了近 8 倍。利用這一性質,DIFF Transformer 在注意力激活值的低比特量化下的性能也優于 Transformer,如圖 9 所示。
圖 9. 注意力激活值的低比特量化
數學推理能力
作者在數學推理任務上進一步驗證了 DIFF Transformer 的性能。作者采用兩階段訓練,在 3B 預訓練模型的基礎上進行有監督微調,并在 MATH 等 8 個數學數據集上評測模型性能。在第一階段,采用 20B token 合成數學數據對模型進行微調,使模型獲得基礎數學能力,評測結果如圖 10 所示。從 15B token 開始,DIFF Transformer 展現出了顯著優于 Transformer 的數學能力,至 20B token 結束的時候,準確率的差距達到了 11% 左右。
圖 10. 第一階段數學合成數據微調
在第二階段,作者利用 Deepseek-R1 輸出所構造的數據集 OpenThoughts-114K-Math 對模型進行蒸餾,使模型更強大的深度推理能力。如圖 11 所示,在 8 個數據集上,DIFF Transformer 相較 Transformer 均有不同程度的提升,平均準確率提升了 7.5%,這表明差分注意力機制更強大的上下文建模能力在推理任務中也至關重要。
圖 11. 第二階段深度推理能力評測
討論與未來工作
DIFF Transformer 自發布以來獲得了較大關注與討論。作者在?Hugging Face?論文討論平臺、alphaXiv?平臺上與社區開展了深入的探討。在 X 平臺(原 Twitter)上,Google DeepMind 高級研究科學家(Senior Staff Research Scientist)Petar Veli?kovi??與作者就文章中的理論分析展開討論,ViT 核心作者?Lucas Beyer?也在閱讀文章后撰寫了一篇深入的論文總結,相關發帖已獲得數十萬瀏覽。目前 DIFF Transformer 也已集成至 Hugging Face 的?transformers 庫中。
- Hugging Face:https://huggingface.co/papers/2410.05258
- alphaXiv:https://www.alphaxiv.org/abs/2410.05258v1
- Petar Veli?kovi?:https://x.com/PetarV_93/status/1874820028975267866
- Lucas Beyer:https://x.com/giffmana/status/1873869654252544079
- transformers庫:https://github.com/huggingface/transformers/tree/main/src/transformers/models/diffllama
?未來工作方面,作者認為可以利用 DIFF Transformer 的性質設計低比特注意力算子,以及利用差分注意力的稀疏特性進行鍵值緩存(key-value cache)的剪枝。此外,將 DIFF Transformer 應用在除語言以外的其他模態上也值得探索。近期工作?DiffCLIP?將差分注意力擴展至視覺、多模態領域,揭示了 DIFF Transformer 在不同模態任務中的更多結構特性與應用潛力。
- DiffCLIP:https://arxiv.org/abs/2503.06626
總結
本文的貢獻主要在兩個方面:
(1)DIFF Transformer 通過創新的差分注意力機制,有效解決了傳統 Transformer 在處理文本時受到噪聲干擾、注意力分配不準確的問題;?
(2)憑借對關鍵信息的關注和對噪聲的抵御能力,DIFF Transformer 在語言建模、長文本建模、關鍵信息檢索、數學推理、對抗幻覺、上下文學習、模型激活值量化等任務中表現出色,有望在自然語言處理、多模態等領域作為基礎模型架構。
#LLM 工程師工具箱
120+大模型庫全攻略!
為大語言模型(LLM)開發者整理了超過120個相關庫,并按訓練、推理、應用開發等14個類別進行分類,涵蓋從數據提取到安全評估的全方位工具,助力開發者高效篩選和利用資源。
在大語言模型(LLM)迅速發展的今天,開發者們面臨著海量的資源和工具選擇。如何高效地篩選和利用這些資源,成為了每一個 LLM 開發者的關鍵任務。?今天,我們要介紹的 GitHub 倉庫——LLM Engineer Toolkit,或許能成為你的得力助手!
??https://github.com/KalyanKS-NLP/llm-engineer-toolkit??
這個由 KalyanKS-NLP 創建的倉庫,精心整理了超過 120 個 LLM 相關的庫,并按照類別進行了分類。無論是訓練、推理、應用開發,還是數據提取、安全評估,你都能在這里找到對應的工具。
大模型工具劃分
- 🚀 LLM Training:專注于 LLM 訓練和微調的工具,幫助你更快、更高效地優化模型。
- 🧱 LLM Application Development:從框架到多 API 接入,再到緩存和低代碼開發,為應用開發提供全方位支持。
- 🩸 LLM RAG:Retrieval-Augmented Generation(檢索增強生成)相關的庫,提升模型的知識檢索能力。
- 🟩 LLM Inference:推理加速和優化工具,讓模型運行更流暢。
- 🚧 LLM Serving:模型部署和推理服務的解決方案。
- 📤 LLM Data Extraction:數據提取工具,幫助你從各種來源獲取高質量數據。
- 🌠 LLM Data Generation:生成合成數據,豐富你的訓練集。
- 💎 LLM Agents:構建智能代理,實現自動化任務和多代理協作。
- ?? LLM Evaluation:評估工具,確保模型性能達到預期。
- 🔍 LLM Monitoring:監控模型運行狀態,及時發現并解決問題。
- 📅 LLM Prompts:優化和管理提示詞,提升模型輸出質量。
- 📝 LLM Structured Outputs:生成結構化輸出,讓模型結果更易用。
- 🛑 LLM Safety and Security:保障模型的安全性和可靠性。
- 💠 LLM Embedding Models:提供先進的文本嵌入模型。
- ?? Others:其他實用工具,涵蓋更多開發場景。
LLM Training and Fine-Tuning
Library | Description |
unsloth | Fine-tune LLMs faster with less memory. |
PEFT | State-of-the-art Parameter-Efficient Fine-Tuning library. |
TRL | Train transformer language models with reinforcement learning. |
Transformers | Transformers provides thousands of pretrained models to perform tasks on different modalities such as text, vision, and audio. |
Axolotl | Tool designed to streamline post-training for various AI models. |
LLMBox | A comprehensive library for implementing LLMs, including a unified training pipeline and comprehensive model evaluation. |
LitGPT | Train and fine-tune LLM lightning fast. |
Mergoo | A library for easily merging multiple LLM experts, and efficiently train the merged LLM. |
Llama-Factory | Easy and efficient LLM fine-tuning. |
Ludwig | Low-code framework for building custom LLMs, neural networks, and other AI models. |
Txtinstruct | A framework for training instruction-tuned models. |
Lamini | An integrated LLM inference and tuning platform. |
XTuring | xTuring provides fast, efficient and simple fine-tuning of open-source LLMs, such as Mistral, LLaMA, GPT-J, and more. |
RL4LMs | A modular RL library to fine-tune language models to human preferences. |
DeepSpeed | DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective. |
torchtune | A PyTorch-native library specifically designed for fine-tuning LLMs. |
PyTorch Lightning | A library that offers a high-level interface for pretraining and fine-tuning LLMs. |
LLM Application DevelopmentFrameworks
Library | Description |
LangChain | LangChain is a framework for developing applications powered by large language models (LLMs). |
Llama Index | LlamaIndex is a data framework for your LLM applications. |
HayStack | Haystack is an end-to-end LLM framework that allows you to build applications powered by LLMs, Transformer models, vector search and more. |
Prompt flow | A suite of development tools designed to streamline the end-to-end development cycle of LLM-based AI applications. |
Griptape | A modular Python framework for building AI-powered applications. |
Weave | Weave is a toolkit for developing Generative AI applications. |
Llama Stack | Build Llama Apps. |
Data Preparation
Library | Description |
Data Prep Kit | Data Prep Kit accelerates unstructured data preparation for LLM app developers. Developers can use Data Prep Kit to cleanse, transform, and enrich use case-specific unstructured data to pre-train LLMs, fine-tune LLMs, instruct-tune LLMs, or build RAG applications. |
Multi API Access
Library | Description |
LiteLLM | Library to call 100+ LLM APIs in OpenAI format. |
AI Gateway | A Blazing Fast AI Gateway with integrated Guardrails. Route to 200+ LLMs, 50+ AI Guardrails with 1 fast & friendly API. |
Routers
Library | Description |
RouteLLM | Framework for serving and evaluating LLM routers - save LLM costs without compromising quality. Drop-in replacement for OpenAI's client to route simpler queries to cheaper models. |
Memory
Library | Description |
mem0 | The Memory layer for your AI apps. |
Memoripy | An AI memory layer with short- and long-term storage, semantic clustering, and optional memory decay for context-aware applications. |
Letta (MemGPT) | An open-source framework for building stateful LLM applications with advanced reasoning capabilities and transparent long-term memory |
Memobase | A user profile-based memory system designed to bring long-term user memory to your Generative AI applications. |
Interface
Library | Description |
Streamlit | A faster way to build and share data apps. Streamlit lets you transform Python scripts into interactive web apps in minutes |
Gradio | Build and share delightful machine learning apps, all in Python. |
AI SDK UI | Build chat and generative user interfaces. |
AI-Gradio | Create AI apps powered by various AI providers. |
Simpleaichat | Python package for easily interfacing with chat apps, with robust features and minimal code complexity. |
Chainlit | Build production-ready Conversational AI applications in minutes. |
Low Code
Library | Description |
LangFlow | LangFlow is a low-code app builder for RAG and multi-agent AI applications. It’s Python-based and agnostic to any model, API, or database. |
Cache
Library | Description |
GPTCache | A Library for Creating Semantic Cache for LLM Queries. Slash Your LLM API Costs by 10x 💰, Boost Speed by 100x. Fully integrated with LangChain and LlamaIndex. |
LLM RAG
Library | Description |
FastGraph RAG | Streamlined and promptable Fast GraphRAG framework designed for interpretable, high-precision, agent-driven retrieval workflows. |
Chonkie | RAG chunking library that is lightweight, lightning-fast, and easy to use. |
RAGChecker | A Fine-grained Framework For Diagnosing RAG. |
RAG to Riches | Build, scale, and deploy state-of-the-art Retrieval-Augmented Generation applications. |
BeyondLLM | Beyond LLM offers an all-in-one toolkit for experimentation, evaluation, and deployment of Retrieval-Augmented Generation (RAG) systems. |
SQLite-Vec | A vector search SQLite extension that runs anywhere! |
fastRAG | fastRAG is a research framework for efficient and optimized retrieval-augmented generative pipelines, incorporating state-of-the-art LLMs and Information Retrieval. |
FlashRAG | A Python Toolkit for Efficient RAG Research. |
Llmware | Unified framework for building enterprise RAG pipelines with small, specialized models. |
Rerankers | A lightweight unified API for various reranking models. |
Vectara | Build Agentic RAG applications. |
LLM Inference
Library | Description |
LLM Compressor | Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment. |
LightLLM | Python-based LLM inference and serving framework, notable for its lightweight design, easy scalability, and high-speed performance. |
vLLM | High-throughput and memory-efficient inference and serving engine for LLMs. |
torchchat | Run PyTorch LLMs locally on servers, desktop, and mobile. |
TensorRT-LLM | TensorRT-LLM is a library for optimizing Large Language Model (LLM) inference. |
WebLLM | High-performance In-browser LLM Inference Engine. |
LLM Serving
Library | Description |
Langcorn | Serving LangChain LLM apps and agents automagically with FastAPI. |
LitServe | Lightning-fast serving engine for any AI model of any size. It augments FastAPI with features like batching, streaming, and GPU autoscaling. |
LLM Data Extraction
Library | Description |
Crawl4AI | Open-source LLM Friendly Web Crawler & Scraper. |
ScrapeGraphAI | A web scraping Python library that uses LLM and direct graph logic to create scraping pipelines for websites and local documents (XML, HTML, JSON, Markdown, etc.). |
Docling | Docling parses documents and exports them to the desired format with ease and speed. |
Llama Parse | GenAI-native document parser that can parse complex document data for any downstream LLM use case (RAG, agents). |
PyMuPDF4LLM | PyMuPDF4LLM library makes it easier to extract PDF content in the format you need for LLM & RAG environments. |
Crawlee | A web scraping and browser automation library. |
MegaParse | Parser for every type of document. |
ExtractThinker | Document Intelligence library for LLMs. |
LLM Data Generation
Library | Description |
DataDreamer | DataDreamer is a powerful open-source Python library for prompting, synthetic data generation, and training workflows. |
fabricator | A flexible open-source framework to generate datasets with large language models. |
Promptwright | Synthetic Dataset Generation Library. |
EasyInstruct | An Easy-to-use Instruction Processing Framework for Large Language Models. |
LLM Agents
Library | Description |
CrewAI | Framework for orchestrating role-playing, autonomous AI agents. |
LangGraph | Build resilient language agents as graphs. |
Agno | Build AI Agents with memory, knowledge, tools, and reasoning. Chat with them using a beautiful Agent UI. |
Agents SDK | Build agentic apps using LLMs with context, tools, hand off to other specialized agents. |
AutoGen | An open-source framework for building AI agent systems. |
Smolagents | Library to build powerful agents in a few lines of code. |
Pydantic AI | Python agent framework to build production grade applications with Generative AI. |
BeeAI | Build production-ready multi-agent systems in Python. |
gradio-tools | A Python library for converting Gradio apps into tools that can be leveraged by an LLM-based agent to complete its task. |
Composio | Production Ready Toolset for AI Agents. |
Atomic Agents | Building AI agents, atomically. |
Memary | Open Source Memory Layer For Autonomous Agents. |
Browser Use | Make websites accessible for AI agents. |
OpenWebAgent | An Open Toolkit to Enable Web Agents on Large Language Models. |
Lagent | A lightweight framework for building LLM-based agents. |
LazyLLM | A Low-code Development Tool For Building Multi-agent LLMs Applications. |
Swarms | The Enterprise-Grade Production-Ready Multi-Agent Orchestration Framework. |
ChatArena | ChatArena is a library that provides multi-agent language game environments and facilitates research about autonomous LLM agents and their social interactions. |
Swarm | Educational framework exploring ergonomic, lightweight multi-agent orchestration. |
AgentStack | The fastest way to build robust AI agents. |
Archgw | Intelligent gateway for Agents. |
Flow | A lightweight task engine for building AI agents. |
AgentOps | Python SDK for AI agent monitoring. |
Langroid | Multi-Agent framework. |
Agentarium | Framework for creating and managing simulations populated with AI-powered agents. |
Upsonic | Reliable AI agent framework that supports MCP. |
LLM Evaluation
Library | Description |
Ragas | Ragas is your ultimate toolkit for evaluating and optimizing Large Language Model (LLM) applications. |
Giskard | Open-Source Evaluation & Testing for ML & LLM systems. |
DeepEval | LLM Evaluation Framework |
Lighteval | All-in-one toolkit for evaluating LLMs. |
Trulens | Evaluation and Tracking for LLM Experiments |
PromptBench | A unified evaluation framework for large language models. |
LangTest | Deliver Safe & Effective Language Models. 60+ Test Types for Comparing LLM & NLP Models on Accuracy, Bias, Fairness, Robustness & More. |
EvalPlus | A rigorous evaluation framework for LLM4Code. |
FastChat | An open platform for training, serving, and evaluating large language model-based chatbots. |
judges | A small library of LLM judges. |
Evals | Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks. |
AgentEvals | Evaluators and utilities for evaluating the performance of your agents. |
LLMBox | A comprehensive library for implementing LLMs, including a unified training pipeline and comprehensive model evaluation. |
Opik | An open-source end-to-end LLM Development Platform which also includes LLM evaluation. |
LLM Monitoring
Library | Description |
MLflow | An open-source end-to-end MLOps/LLMOps Platform for tracking, evaluating, and monitoring LLM applications. |
Opik | An open-source end-to-end LLM Development Platform which also includes LLM monitoring. |
LangSmith | Provides tools for logging, monitoring, and improving your LLM applications. |
Weights & Biases (W&B) | W&B provides features for tracking LLM performance. |
Helicone | Open source LLM-Observability Platform for Developers. One-line integration for monitoring, metrics, evals, agent tracing, prompt management, playground, etc. |
Evidently | An open-source ML and LLM observability framework. |
Phoenix | An open-source AI observability platform designed for experimentation, evaluation, and troubleshooting. |
Observers | A Lightweight Library for AI Observability. |
LLM Prompts
Library | Description |
PCToolkit | A Unified Plug-and-Play Prompt Compression Toolkit of Large Language Models. |
Selective Context | Selective Context compresses your prompt and context to allow LLMs (such as ChatGPT) to process 2x more content. |
LLMLingua | Library for compressing prompts to accelerate LLM inference. |
betterprompt | Test suite for LLM prompts before pushing them to production. |
Promptify | Solve NLP Problems with LLMs & easily generate different NLP Task prompts for popular generative models like GPT, PaLM, and more with Promptify. |
PromptSource | PromptSource is a toolkit for creating, sharing, and using natural language prompts. |
DSPy | DSPy is the open-source framework for programming—rather than prompting—language models. |
Py-priompt | Prompt design library. |
Promptimizer | Prompt optimization library. |
LLM Structured Outputs
Library | Description |
Instructor | Python library for working with structured outputs from large language models (LLMs). Built on top of Pydantic, it provides a simple, transparent, and user-friendly API. |
XGrammar | An open-source library for efficient, flexible, and portable structured generation. |
Outlines | Robust (structured) text generation |
Guidance | Guidance is an efficient programming paradigm for steering language models. |
LMQL | A language for constraint-guided and efficient LLM programming. |
Jsonformer | A Bulletproof Way to Generate Structured JSON from Language Models. |
LLM Safety and Security
Library | Description |
JailbreakEval | A collection of automated evaluators for assessing jailbreak attempts. |
EasyJailbreak | An easy-to-use Python framework to generate adversarial jailbreak prompts. |
Guardrails | Adding guardrails to large language models. |
LLM Guard | The Security Toolkit for LLM Interactions. |
AuditNLG | AuditNLG is an open-source library that can help reduce the risks associated with using generative AI systems for language. |
NeMo Guardrails | NeMo Guardrails is an open-source toolkit for easily adding programmable guardrails to LLM-based conversational systems. |
Garak | LLM vulnerability scanner |
DeepTeam | The LLM Red Teaming Framework |
LLM Embedding Models
Library | Description |
Sentence-Transformers | State-of-the-Art Text Embeddings |
Model2Vec | Fast State-of-the-Art Static Embeddings |
Text Embedding Inference | A blazing fast inference solution for text embeddings models. TEI enables high-performance extraction for the most popular models, including FlagEmbedding, Ember, GTE and E5. |
Others
Library | Description |
Text Machina | A modular and extensible Python framework, designed to aid in the creation of high-quality, unbiased datasets to build robust models for MGT-related tasks such as detection, attribution, and boundary detection. |
LLM Reasoners | A library for advanced large language model reasoning. |
EasyEdit | An Easy-to-use Knowledge Editing Framework for Large Language Models. |
CodeTF | CodeTF: One-stop Transformer Library for State-of-the-art Code LLM. |
spacy-llm | This package integrates Large Language Models (LLMs) into spaCy, featuring a modular system for fast prototyping and prompting, and turning unstructured responses into robust outputs for various NLP tasks. |
pandas-ai | Chat with your database (SQL, CSV, pandas, polars, MongoDB, NoSQL, etc.). |
LLM Transparency Tool | An open-source interactive toolkit for analyzing internal workings of Transformer-based language models. |
Vanna | Chat with your SQL database. Accurate Text-to-SQL Generation via LLMs using RAG. |
mergekit | Tools for merging pretrained large language models. |
MarkLLM | An Open-Source Toolkit for LLM Watermarking. |
LLMSanitize | An open-source library for contamination detection in NLP datasets and Large Language Models (LLMs). |
Annotateai | Automatically annotate papers using LLMs. |
LLM Reasoner | Make any LLM think like OpenAI o1 and DeepSeek R1. |