51c大模型~合集122

我自己的原文哦~?? ? ?https://blog.51cto.com/whaosoft/13877107

#PHYBench

北大物院200人合作，金牌得主超50人！PHYBench：大模型究竟能不能真的懂物理？

本項目由北京大學物理學院朱華星老師、曹慶宏副院長統籌指導。基準設計、項目管理以及數據整合的主要工作由學生核心團隊完成，核心成員包括仇是、郭紹陽、宋卓洋、孫韞博、蔡則宇、衛家燊、羅天宇等。項目還得到了北京大學計算中心羅民興院士和人工智能研究院張牧涵老師的鼎力支持。

PHYBench 項目匯聚了來自物理學院及兄弟院系的 200 余名學生，共同承擔題目編寫、審核及人類基準測試等工作。這支高水平的參與者團隊中，包含至少 50 位全國中學生物理競賽金牌得主，更有亞洲物理奧賽和國際物理奧賽的金牌獲得者。這場大規模、高質量的協作，不僅充分展現了北大學子深厚的學術功底和卓越的組織協調能力，也為 PHYBench 產出高質量成果提供了堅實保障。

在大語言模型（LLMs）飛速發展的當下，模型的推理能力儼然成為模型能力的代名詞。OpenAI 的 o 系列、DeepSeek R1 等前沿模型相繼發布，這些大模型憑借強化學習技術的助力，在許多科學評測基準上頻頻刷新紀錄，甚至聲稱 “超越人類專家”。

但是，隨著模型能力和評測基準的軍備競賽白熱化，越來越多的基準不得不轉向生僻的知識點、或者抽象的數學競賽題。這些題目雖然能 “區分” 模型，但是逐漸脫離實際場景，可能難以真正反映模型的實際表現。

近日，北京大學物理學院聯合人工智能研究院等多個院系，推出了全新評測基準 PHYBench。PHYBench 包含 500 道經過精心設計的高質量物理題（如圖 1），難度橫跨高中物理、大學物理以及物理奧林匹克競賽。這些題目以真實的物理場景為基礎，對人類來說并不抽象，卻把一眾大模型考得七零八落。大模型在解決物理題時的思維鏈也暴露了它們在感知（Perception）和推理（Reasoning）能力上的缺陷。

論文鏈接：https://arxiv.org/abs/2504.16074
項目網址：https://phybench-official.github.io/phybench-demo/
數據集：https://huggingface.co/datasets/Eureka-Lab/PHYBench

也許，物理才是最適合考察 AI 推理能力的學科？PHYBench 的嘗試為評估大模型真正有效的推理能力提供了全新的工具和視角。

圖 1：題目樣例與兩種評估方法：表達式樹編輯距離、正確率。

表 1：與現有 benchmark 對比，PHYBench 在高難度數據集中，有著相對大的規模，同時引入了創新的分數度量：表達式樹編輯距離。

評測方法創新

表達式樹編輯距離（EED Score）?

傳統基準通常依賴 Accuracy 這一單一指標：設置唯一正確答案，模型只有在完全匹配時才能得分。為了方便評分，問答題通常被改寫成選擇題或要求代入數值。這樣會導致答案的信息量被嚴重壓縮，而且給出過多條件可能導致模型 “根據選項猜過程”，或者缺乏使用解析表達式表達普適關系的能力。同時在高難度的樣本上，0/1 打分會使得所有模型在分數層面都被歸零，強弱差異無從體現。

EED Score（Expression?tree Edit Distance）帶來了更貼近人類閱卷的方案。它將數學表達式解析成表達式樹，再計算模型答案與參考答案之間的編輯距離：樹的結構越接近，得分越高。這一機制輸出的是連續、細粒度的分數，能在更多題目上顯示區分度，顯著提高了統計效力。

實驗表明，采用 EED Score 的 500 題，其區分能力相當于 1500 道使用 0/1 Accuracy 的題目。上圖（圖 1）展示了同一道題三種不同答案在 Accuracy 與 EED Score 下的對比：前者只能給出 “全錯 / 全對” 的粗糙評價，而后者則定量刻畫了模型解答與正確答案之間的 “距離”。

實驗結果

前沿模型與人類專家的差距

PHYBench 團隊招募了 81 名北大學子，在 3 小時時限內做 8 道題目，與最先進的 AI 模型展開了一場 "人機大戰"。

結果顯示，即使是最強的 Gemini 2.5 pro，也只能答對 36.9% 的題目，EED 評分 49.5%。而 “人類專家” 們則輕松碾壓，平均正確率高達 61.9%，EED 評分高達 70.5%。?排名前 25% 的受試者更是達到了 71.4% 的正確率 —— 幾乎是最強 AI 的兩倍。其他模型與人類的差距則更為顯著。這一顯著差距揭示了現階段 LLM 在在物理推理場景中的瓶頸。

PHYBench 對模型的能力也進行了細粒度的對比。可以看到，Gemini 2.5 pro、o3 等強推理模型雖然和人類還有較大差距，但是相比前代推理模型已經有了明顯的進步。DeepSeek-V3 等基座模型雖未能超越主流推理模型，但也展現出了亮眼的成績。QwQ-32B 和 DeepSeek32B 蒸餾模型等小型推理模型在 PHYBench 上的表現很令人失望，這可能歸因于其物理感知能力的不足。

基于思維鏈的錯因分析：PP × RR

PHYBench 團隊對模型的錯誤進行了系統性總結分析，將模型的推理過程和推理能力劃分為了兩個關鍵模塊：物理感知（Physical Perception，PP）和魯棒推理（Robust Reasoning，RR）：

物理感知（PP）：在此階段，模型進行密集的文字推理，模型需要識別問題相關的物理對象、變量和動力學關系，定性判斷哪些物理效應是重要的，哪些可以忽略不計。若 PP 出錯，后續整個推理都會偏離軌道。（示例 1 展示典型 PP 失誤）
魯棒推理（RR）：在此階段，模型寫下大量的 “草稿”，一步步化簡表達式，解方程。現階段的推理模型在此階段的推理效率尚不高，“草稿” 長度遠長于人類，而且經常犯 “低級錯誤”。（示例 2 展示典型 RR 失誤）

PP 和 RR 交替進行，組成了典型的物理解題思維鏈。

未來展望

推動 AI 的物理理解與推理能力發展

PHYBench 的愿景遠不止于 “評測”，更在于 “引領” AI 探索物理世界的無限可能。

PHYBench 的發布，不僅為評估大語言模型在物理感知與推理方面的能力提供了一個全新且權威的基準，更為未來 AI 系統的發展指明了攻堅方向。我們精心設計的真實、復雜的物理場景，旨在深度激發并驗證 AI 理解世界并進行可靠推理的能力，推動 AI 系統真正實現對世界的認知、融入與變革。

面向未來，PHYBench 團隊將持續致力于數據集的拓展與創新，計劃納入更多前沿物理課題、跨學科交叉內容，甚至挑戰人類尚未解開的科學謎題。我們相信，通過提供更具深度和廣度的物理挑戰，PHYBench 將有力催化 AI 向著突破認知邊界、探索未知領域的 “智能伙伴” 或 “超級助手” 發展。

#DIFF Transformer

差分注意力機制引領變革，DIFF Transformer攻克長序列建模難題

近年來，Transformer 架構在自然語言處理領域取得了巨大成功，從機器翻譯到文本生成，其強大的建模能力為語言理解與生成帶來了前所未有的突破。

然而，隨著模型規模的不斷擴大和應用場景的日益復雜，傳統 Transformer 架構逐漸暴露出缺陷，尤其是在處理長文本、關鍵信息檢索以及對抗幻覺等任務時，Transformer 常常因過度關注無關上下文而陷入困境，導致模型表現受限。

為攻克這一難題，來自微軟和清華的研究團隊提出了?DIFF Transformer，一種基于差分注意力機制的創新基礎模型架構。

論文標題：Differential Transformer?
論文鏈接：https://openreview.net/pdf?id=OvoCm1gGhN
代碼鏈接：https://aka.ms/Diff-Transformer

其核心思想是通過計算兩組 Softmax 注意力圖的差值來放大對關鍵上下文的關注，同時消除注意力噪聲干擾。DIFF Transformer 具備以下顯著優勢：

在語言建模任務中，DIFF Transformer 在模型大小、訓練 token 數量等方面展現出了卓越的可擴展性，僅需約 65% 的模型規模或訓練 token 數量即可達到與傳統 Transformer 相當的性能，大幅提升了語言模型通用表現。

在長文本建模、關鍵信息檢索、數學推理、對抗幻覺、上下文學習、模型激活值量化等一系列任務中，DIFF Transformer 展現了獨特優勢，相比傳統 Transformer 有顯著提升。

DIFF Transformer 的特性使其在自然語言處理領域具有廣闊的應用前景，有望成為推動語言模型發展的新動力。此外，已有跟進研究初步驗證方法在視覺、多模態等領域中的有效性，顯示出其跨模態通用的潛力。該研究已被 ICLR 2025 接收，并獲選為 Oral 論文（入選比例 1.8%）。

方法

本文提出了一種名為 Differential Transformer（DIFF Transformer）的基礎模型架構，旨在解決傳統 Transformer 在長文本建模中對無關上下文過度分配注意力的問題。該方法通過差分注意力機制（Differential Attention）放大對關鍵上下文的關注，同時消除注意力噪聲，從而顯著提升模型在多種任務中的性能。

差分注意力機制

傳統 Transformer 的注意力機制通過 Softmax 函數對輸入序列中的不同 token 進行加權，但 Softmax 的性質導致模型難以完全消除無關上下文的影響。為了克服這一問題，DIFF Transformer 引入了差分注意力機制。

具體而言，該機制將查詢向量（Query）和鍵向量（Key）在注意力頭（Head）維度分為兩組，分別計算兩組的 Softmax 注意力圖，然后計算兩者的差值作為最終的注意力分數。這一設計類似于電子工程中的差分放大器，以及降噪耳機，通過兩組信號相減以消除共有噪聲。

差分注意力的數學表達如下：

其中，

和

分別是兩組查詢和鍵向量，

是值向量，

是一個可學習的標量參數，用于調節兩組注意力圖的權重。計算過程如圖 1 所示。

圖 1. 差分注意力機制圖示與偽代碼

為了同步學習速率，將

重參數化為：

其中，

是可學習的向量，而

是用于初始化的常數。

多頭差分注意力

為了進一步提升模型的表達能力，DIFF Transformer 采用了多頭機制。每個注意力頭獨立計算差分注意力，并將多頭輸出拼接為最終結果。具體實現如下：

其中

是注意力頭的數量，

是輸出投影矩陣。為了保持與 Transformer 梯度一致，DIFF Transformer 在每個頭的輸出后應用了獨立的歸一化層，采用 RMSNorm 實現。

圖 2. Transformer 與 DIFF Transformer 注意力分數分布可視化

圖 2 展示了 DIFF Transformer 和傳統 Transformer 在注意力分數分配上的顯著差異。作者將一段關鍵信息插入大段不相關文本的中間位置，并對模型抽取關鍵信息時的注意力分數分配進行可視化。

傳統 Transformer 的注意力分數被廣泛分配到整個上下文中，只有極少分數分配至關鍵信息；而 DIFF Transformer 能夠將更高的分數集中在目標答案上，并且幾乎不向無關上下文分配注意力。

注意力分數分配的稀疏性與精準性也使得 DIFF Transformer 在處理長文本關鍵信息檢索任務時顯著優于 Transformer。

實驗

作者通過一系列實驗驗證了 DIFF Transformer 在多個方面的卓越性能，證明了其在大語言模型中應用的獨特潛力與優勢。

語言建模

作者研究了 DIFF Transformer 在擴展模型規模和訓練數據量時的性能，如圖 3 所示。實驗表明，DIFF Transformer 僅需約 65% 的參數規模或訓練數據量即可達到與 Transformer 相當的語言建模性能。例如，6.8B 參數規模的 DIFF Transformer 在語言建模損失上與 11B 參數規模的 Transformer 相當。

圖 3. 語言建模上的模型參數、訓練數據量可擴展性實驗

長文本建模

作者將模型擴展到 64K 上下文長度，并在長文本書籍數據上進行了評估。結果顯示，考慮累積平均負對數似然（NLL）指標， DIFF Transformer 在不同序列位置上均優于 Transformer，能夠更有效地利用長上下文信息。

圖 4. 長文本書籍數據模型性能評估

關鍵信息檢索

作者通過「多針檢索」（Multi-Needle Retrieval）實驗評估了模型從大量上下文中提取關鍵信息的能力，如圖 5 所示。實驗表明，DIFF Transformer 在不同上下文長度和答案深度下均表現出更高的準確率，尤其是在文本較長以及答案位于文本更靠前位置時，優勢更為明顯。例如，在 64K 上下文中，DIFF Transformer 在答案位于 25% 深度時的準確率比 Transformer 高出 76%。此外，統計信息顯示，DIFF Transformer 在注意力分數分配上也表現出更高的聚焦能力，能夠準確定位關鍵信息，并展現了更高的信噪比。

圖 5. 多針檢索評估

上下文學習

作者從兩個角度評估了 DIFF Transformer 的上下文學習能力：多樣本上下文學習和樣本順序魯棒性測試。如圖 6 所示，在多樣本上下文學習任務中，作者使用了 4 個不同的數據集（TREC、TREC-fine、Banking-77 和 Clinic-150），并逐步增加示例數量，直到總長度達到 64K tokens。結果顯示，DIFF Transformer 在不同數據集上均優于 Transformer，平均準確率提升顯著。

圖 6. 多樣本上下文學習

在魯棒性測試中，作者通過打亂示例順序的方式評估了模型的性能穩定性。如圖 7 所示，DIFF Transformer 在不同示例排列下的性能方差顯著低于 Transformer，表明其對輸入順序的敏感性更低，具有更強的魯棒性。

圖 7. 樣本順序魯棒性測試

幻覺評測

作者利用文本摘要和問答任務作為兩個典型的幻覺評測場景，評估了 DIFF Transformer 在降低大模型幻覺（hallucination）方面的表現。結果如圖 8 所示，DIFF Transformer 在生成摘要和回答問題時顯著提升了準確率，減少了幻覺現象。這是因為差分注意力機制能夠準確定位重要文段，避免無關上下文對模型預測的干擾。

圖 8. 利用文本摘要、問答任務進行幻覺評測

異常激活值分析

作者還發現 DIFF Transformer 能夠顯著減少模型激活中的異常值，這為模型激活值的量化提供了新的可能性。實驗表明，DIFF Transformer 在注意力激活值（attention logits）和隱藏狀態（hidden states）中的最大激活值顯著低于 Transformer。例如，在注意力激活值的 Top-1 激活值上，DIFF Transformer 比 Transformer 低了近 8 倍。利用這一性質，DIFF Transformer 在注意力激活值的低比特量化下的性能也優于 Transformer，如圖 9 所示。

圖 9. 注意力激活值的低比特量化

數學推理能力

作者在數學推理任務上進一步驗證了 DIFF Transformer 的性能。作者采用兩階段訓練，在 3B 預訓練模型的基礎上進行有監督微調，并在 MATH 等 8 個數學數據集上評測模型性能。在第一階段，采用 20B token 合成數學數據對模型進行微調，使模型獲得基礎數學能力，評測結果如圖 10 所示。從 15B token 開始，DIFF Transformer 展現出了顯著優于 Transformer 的數學能力，至 20B token 結束的時候，準確率的差距達到了 11% 左右。

圖 10. 第一階段數學合成數據微調

在第二階段，作者利用 Deepseek-R1 輸出所構造的數據集 OpenThoughts-114K-Math 對模型進行蒸餾，使模型更強大的深度推理能力。如圖 11 所示，在 8 個數據集上，DIFF Transformer 相較 Transformer 均有不同程度的提升，平均準確率提升了 7.5%，這表明差分注意力機制更強大的上下文建模能力在推理任務中也至關重要。

圖 11. 第二階段深度推理能力評測

討論與未來工作

DIFF Transformer 自發布以來獲得了較大關注與討論。作者在?Hugging Face?論文討論平臺、alphaXiv?平臺上與社區開展了深入的探討。在 X 平臺（原 Twitter）上，Google DeepMind 高級研究科學家（Senior Staff Research Scientist）Petar Veli?kovi??與作者就文章中的理論分析展開討論，ViT 核心作者?Lucas Beyer?也在閱讀文章后撰寫了一篇深入的論文總結，相關發帖已獲得數十萬瀏覽。目前 DIFF Transformer 也已集成至 Hugging Face 的?transformers 庫中。

Hugging Face：https://huggingface.co/papers/2410.05258
alphaXiv：https://www.alphaxiv.org/abs/2410.05258v1
Petar Veli?kovi?：https://x.com/PetarV_93/status/1874820028975267866
Lucas Beyer：https://x.com/giffmana/status/1873869654252544079
transformers庫：https://github.com/huggingface/transformers/tree/main/src/transformers/models/diffllama

?未來工作方面，作者認為可以利用 DIFF Transformer 的性質設計低比特注意力算子，以及利用差分注意力的稀疏特性進行鍵值緩存（key-value cache）的剪枝。此外，將 DIFF Transformer 應用在除語言以外的其他模態上也值得探索。近期工作?DiffCLIP?將差分注意力擴展至視覺、多模態領域，揭示了 DIFF Transformer 在不同模態任務中的更多結構特性與應用潛力。

DiffCLIP：https://arxiv.org/abs/2503.06626

總結

本文的貢獻主要在兩個方面：

（1）DIFF Transformer 通過創新的差分注意力機制，有效解決了傳統 Transformer 在處理文本時受到噪聲干擾、注意力分配不準確的問題；?

（2）憑借對關鍵信息的關注和對噪聲的抵御能力，DIFF Transformer 在語言建模、長文本建模、關鍵信息檢索、數學推理、對抗幻覺、上下文學習、模型激活值量化等任務中表現出色，有望在自然語言處理、多模態等領域作為基礎模型架構。

#LLM 工程師工具箱

120+大模型庫全攻略！

為大語言模型（LLM）開發者整理了超過120個相關庫，并按訓練、推理、應用開發等14個類別進行分類，涵蓋從數據提取到安全評估的全方位工具，助力開發者高效篩選和利用資源。

在大語言模型（LLM）迅速發展的今天，開發者們面臨著海量的資源和工具選擇。如何高效地篩選和利用這些資源，成為了每一個 LLM 開發者的關鍵任務。?今天，我們要介紹的 GitHub 倉庫——LLM Engineer Toolkit，或許能成為你的得力助手！

??https://github.com/KalyanKS-NLP/llm-engineer-toolkit??

這個由 KalyanKS-NLP 創建的倉庫，精心整理了超過 120 個 LLM 相關的庫，并按照類別進行了分類。無論是訓練、推理、應用開發，還是數據提取、安全評估，你都能在這里找到對應的工具。

大模型工具劃分

🚀 LLM Training：專注于 LLM 訓練和微調的工具，幫助你更快、更高效地優化模型。
🧱 LLM Application Development：從框架到多 API 接入，再到緩存和低代碼開發，為應用開發提供全方位支持。
🩸 LLM RAG：Retrieval-Augmented Generation（檢索增強生成）相關的庫，提升模型的知識檢索能力。
🟩 LLM Inference：推理加速和優化工具，讓模型運行更流暢。
🚧 LLM Serving：模型部署和推理服務的解決方案。
📤 LLM Data Extraction：數據提取工具，幫助你從各種來源獲取高質量數據。
🌠 LLM Data Generation：生成合成數據，豐富你的訓練集。
💎 LLM Agents：構建智能代理，實現自動化任務和多代理協作。
?? LLM Evaluation：評估工具，確保模型性能達到預期。
🔍 LLM Monitoring：監控模型運行狀態，及時發現并解決問題。
📅 LLM Prompts：優化和管理提示詞，提升模型輸出質量。
📝 LLM Structured Outputs：生成結構化輸出，讓模型結果更易用。
🛑 LLM Safety and Security：保障模型的安全性和可靠性。
💠 LLM Embedding Models：提供先進的文本嵌入模型。
?? Others：其他實用工具，涵蓋更多開發場景。

LLM Training and Fine-Tuning

Library	Description
unsloth	Fine-tune LLMs faster with less memory.
PEFT	State-of-the-art Parameter-Efficient Fine-Tuning library.
TRL	Train transformer language models with reinforcement learning.
Transformers	Transformers provides thousands of pretrained models to perform tasks on different modalities such as text, vision, and audio.
Axolotl	Tool designed to streamline post-training for various AI models.
LLMBox	A comprehensive library for implementing LLMs, including a unified training pipeline and comprehensive model evaluation.
LitGPT	Train and fine-tune LLM lightning fast.
Mergoo	A library for easily merging multiple LLM experts, and efficiently train the merged LLM.
Llama-Factory	Easy and efficient LLM fine-tuning.
Ludwig	Low-code framework for building custom LLMs, neural networks, and other AI models.
Txtinstruct	A framework for training instruction-tuned models.
Lamini	An integrated LLM inference and tuning platform.
XTuring	xTuring provides fast, efficient and simple fine-tuning of open-source LLMs, such as Mistral, LLaMA, GPT-J, and more.
RL4LMs	A modular RL library to fine-tune language models to human preferences.
DeepSpeed	DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
torchtune	A PyTorch-native library specifically designed for fine-tuning LLMs.
PyTorch Lightning	A library that offers a high-level interface for pretraining and fine-tuning LLMs.

LLM Application DevelopmentFrameworks

Library	Description
LangChain	LangChain is a framework for developing applications powered by large language models (LLMs).
Llama Index	LlamaIndex is a data framework for your LLM applications.
HayStack	Haystack is an end-to-end LLM framework that allows you to build applications powered by LLMs, Transformer models, vector search and more.
Prompt flow	A suite of development tools designed to streamline the end-to-end development cycle of LLM-based AI applications.
Griptape	A modular Python framework for building AI-powered applications.
Weave	Weave is a toolkit for developing Generative AI applications.
Llama Stack	Build Llama Apps.

Data Preparation

Library	Description
Data Prep Kit	Data Prep Kit accelerates unstructured data preparation for LLM app developers. Developers can use Data Prep Kit to cleanse, transform, and enrich use case-specific unstructured data to pre-train LLMs, fine-tune LLMs, instruct-tune LLMs, or build RAG applications.

Multi API Access

Library	Description
LiteLLM	Library to call 100+ LLM APIs in OpenAI format.
AI Gateway	A Blazing Fast AI Gateway with integrated Guardrails. Route to 200+ LLMs, 50+ AI Guardrails with 1 fast & friendly API.

Routers

Library	Description
RouteLLM	Framework for serving and evaluating LLM routers - save LLM costs without compromising quality. Drop-in replacement for OpenAI's client to route simpler queries to cheaper models.

Memory

Library	Description
mem0	The Memory layer for your AI apps.
Memoripy	An AI memory layer with short- and long-term storage, semantic clustering, and optional memory decay for context-aware applications.
Letta (MemGPT)	An open-source framework for building stateful LLM applications with advanced reasoning capabilities and transparent long-term memory
Memobase	A user profile-based memory system designed to bring long-term user memory to your Generative AI applications.

Interface

Library	Description
Streamlit	A faster way to build and share data apps. Streamlit lets you transform Python scripts into interactive web apps in minutes
Gradio	Build and share delightful machine learning apps, all in Python.
AI SDK UI	Build chat and generative user interfaces.
AI-Gradio	Create AI apps powered by various AI providers.
Simpleaichat	Python package for easily interfacing with chat apps, with robust features and minimal code complexity.
Chainlit	Build production-ready Conversational AI applications in minutes.

Low Code

Library	Description
LangFlow	LangFlow is a low-code app builder for RAG and multi-agent AI applications. It’s Python-based and agnostic to any model, API, or database.

Cache

Library	Description
GPTCache	A Library for Creating Semantic Cache for LLM Queries. Slash Your LLM API Costs by 10x 💰, Boost Speed by 100x. Fully integrated with LangChain and LlamaIndex.

LLM RAG

Library	Description
FastGraph RAG	Streamlined and promptable Fast GraphRAG framework designed for interpretable, high-precision, agent-driven retrieval workflows.
Chonkie	RAG chunking library that is lightweight, lightning-fast, and easy to use.
RAGChecker	A Fine-grained Framework For Diagnosing RAG.
RAG to Riches	Build, scale, and deploy state-of-the-art Retrieval-Augmented Generation applications.
BeyondLLM	Beyond LLM offers an all-in-one toolkit for experimentation, evaluation, and deployment of Retrieval-Augmented Generation (RAG) systems.
SQLite-Vec	A vector search SQLite extension that runs anywhere!
fastRAG	fastRAG is a research framework for efficient and optimized retrieval-augmented generative pipelines, incorporating state-of-the-art LLMs and Information Retrieval.
FlashRAG	A Python Toolkit for Efficient RAG Research.
Llmware	Unified framework for building enterprise RAG pipelines with small, specialized models.
Rerankers	A lightweight unified API for various reranking models.
Vectara	Build Agentic RAG applications.

LLM Inference

Library	Description
LLM Compressor	Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment.
LightLLM	Python-based LLM inference and serving framework, notable for its lightweight design, easy scalability, and high-speed performance.
vLLM	High-throughput and memory-efficient inference and serving engine for LLMs.
torchchat	Run PyTorch LLMs locally on servers, desktop, and mobile.
TensorRT-LLM	TensorRT-LLM is a library for optimizing Large Language Model (LLM) inference.
WebLLM	High-performance In-browser LLM Inference Engine.

LLM Serving

Library	Description
Langcorn	Serving LangChain LLM apps and agents automagically with FastAPI.
LitServe	Lightning-fast serving engine for any AI model of any size. It augments FastAPI with features like batching, streaming, and GPU autoscaling.

LLM Data Extraction

Library	Description
Crawl4AI	Open-source LLM Friendly Web Crawler & Scraper.
ScrapeGraphAI	A web scraping Python library that uses LLM and direct graph logic to create scraping pipelines for websites and local documents (XML, HTML, JSON, Markdown, etc.).
Docling	Docling parses documents and exports them to the desired format with ease and speed.
Llama Parse	GenAI-native document parser that can parse complex document data for any downstream LLM use case (RAG, agents).
PyMuPDF4LLM	PyMuPDF4LLM library makes it easier to extract PDF content in the format you need for LLM & RAG environments.
Crawlee	A web scraping and browser automation library.
MegaParse	Parser for every type of document.
ExtractThinker	Document Intelligence library for LLMs.

LLM Data Generation

Library	Description
DataDreamer	DataDreamer is a powerful open-source Python library for prompting, synthetic data generation, and training workflows.
fabricator	A flexible open-source framework to generate datasets with large language models.
Promptwright	Synthetic Dataset Generation Library.
EasyInstruct	An Easy-to-use Instruction Processing Framework for Large Language Models.

LLM Agents

Library	Description
CrewAI	Framework for orchestrating role-playing, autonomous AI agents.
LangGraph	Build resilient language agents as graphs.
Agno	Build AI Agents with memory, knowledge, tools, and reasoning. Chat with them using a beautiful Agent UI.
Agents SDK	Build agentic apps using LLMs with context, tools, hand off to other specialized agents.
AutoGen	An open-source framework for building AI agent systems.
Smolagents	Library to build powerful agents in a few lines of code.
Pydantic AI	Python agent framework to build production grade applications with Generative AI.
BeeAI	Build production-ready multi-agent systems in Python.
gradio-tools	A Python library for converting Gradio apps into tools that can be leveraged by an LLM-based agent to complete its task.
Composio	Production Ready Toolset for AI Agents.
Atomic Agents	Building AI agents, atomically.
Memary	Open Source Memory Layer For Autonomous Agents.
Browser Use	Make websites accessible for AI agents.
OpenWebAgent	An Open Toolkit to Enable Web Agents on Large Language Models.
Lagent	A lightweight framework for building LLM-based agents.
LazyLLM	A Low-code Development Tool For Building Multi-agent LLMs Applications.
Swarms	The Enterprise-Grade Production-Ready Multi-Agent Orchestration Framework.
ChatArena	ChatArena is a library that provides multi-agent language game environments and facilitates research about autonomous LLM agents and their social interactions.
Swarm	Educational framework exploring ergonomic, lightweight multi-agent orchestration.
AgentStack	The fastest way to build robust AI agents.
Archgw	Intelligent gateway for Agents.
Flow	A lightweight task engine for building AI agents.
AgentOps	Python SDK for AI agent monitoring.
Langroid	Multi-Agent framework.
Agentarium	Framework for creating and managing simulations populated with AI-powered agents.
Upsonic	Reliable AI agent framework that supports MCP.

LLM Evaluation

Library	Description
Ragas	Ragas is your ultimate toolkit for evaluating and optimizing Large Language Model (LLM) applications.
Giskard	Open-Source Evaluation & Testing for ML & LLM systems.
DeepEval	LLM Evaluation Framework
Lighteval	All-in-one toolkit for evaluating LLMs.
Trulens	Evaluation and Tracking for LLM Experiments
PromptBench	A unified evaluation framework for large language models.
LangTest	Deliver Safe & Effective Language Models. 60+ Test Types for Comparing LLM & NLP Models on Accuracy, Bias, Fairness, Robustness & More.
EvalPlus	A rigorous evaluation framework for LLM4Code.
FastChat	An open platform for training, serving, and evaluating large language model-based chatbots.
judges	A small library of LLM judges.
Evals	Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.
AgentEvals	Evaluators and utilities for evaluating the performance of your agents.
LLMBox	A comprehensive library for implementing LLMs, including a unified training pipeline and comprehensive model evaluation.
Opik	An open-source end-to-end LLM Development Platform which also includes LLM evaluation.

LLM Monitoring

Library	Description
MLflow	An open-source end-to-end MLOps/LLMOps Platform for tracking, evaluating, and monitoring LLM applications.
Opik	An open-source end-to-end LLM Development Platform which also includes LLM monitoring.
LangSmith	Provides tools for logging, monitoring, and improving your LLM applications.
Weights & Biases (W&B)	W&B provides features for tracking LLM performance.
Helicone	Open source LLM-Observability Platform for Developers. One-line integration for monitoring, metrics, evals, agent tracing, prompt management, playground, etc.
Evidently	An open-source ML and LLM observability framework.
Phoenix	An open-source AI observability platform designed for experimentation, evaluation, and troubleshooting.
Observers	A Lightweight Library for AI Observability.

LLM Prompts

Library	Description
PCToolkit	A Unified Plug-and-Play Prompt Compression Toolkit of Large Language Models.
Selective Context	Selective Context compresses your prompt and context to allow LLMs (such as ChatGPT) to process 2x more content.
LLMLingua	Library for compressing prompts to accelerate LLM inference.
betterprompt	Test suite for LLM prompts before pushing them to production.
Promptify	Solve NLP Problems with LLMs & easily generate different NLP Task prompts for popular generative models like GPT, PaLM, and more with Promptify.
PromptSource	PromptSource is a toolkit for creating, sharing, and using natural language prompts.
DSPy	DSPy is the open-source framework for programming—rather than prompting—language models.
Py-priompt	Prompt design library.
Promptimizer	Prompt optimization library.

LLM Structured Outputs

Library	Description
Instructor	Python library for working with structured outputs from large language models (LLMs). Built on top of Pydantic, it provides a simple, transparent, and user-friendly API.
XGrammar	An open-source library for efficient, flexible, and portable structured generation.
Outlines	Robust (structured) text generation
Guidance	Guidance is an efficient programming paradigm for steering language models.
LMQL	A language for constraint-guided and efficient LLM programming.
Jsonformer	A Bulletproof Way to Generate Structured JSON from Language Models.

LLM Safety and Security

Library	Description
JailbreakEval	A collection of automated evaluators for assessing jailbreak attempts.
EasyJailbreak	An easy-to-use Python framework to generate adversarial jailbreak prompts.
Guardrails	Adding guardrails to large language models.
LLM Guard	The Security Toolkit for LLM Interactions.
AuditNLG	AuditNLG is an open-source library that can help reduce the risks associated with using generative AI systems for language.
NeMo Guardrails	NeMo Guardrails is an open-source toolkit for easily adding programmable guardrails to LLM-based conversational systems.
Garak	LLM vulnerability scanner
DeepTeam	The LLM Red Teaming Framework

LLM Embedding Models

Library	Description
Sentence-Transformers	State-of-the-Art Text Embeddings
Model2Vec	Fast State-of-the-Art Static Embeddings
Text Embedding Inference	A blazing fast inference solution for text embeddings models. TEI enables high-performance extraction for the most popular models, including FlagEmbedding, Ember, GTE and E5.

Others

Library	Description
Text Machina	A modular and extensible Python framework, designed to aid in the creation of high-quality, unbiased datasets to build robust models for MGT-related tasks such as detection, attribution, and boundary detection.
LLM Reasoners	A library for advanced large language model reasoning.
EasyEdit	An Easy-to-use Knowledge Editing Framework for Large Language Models.
CodeTF	CodeTF: One-stop Transformer Library for State-of-the-art Code LLM.
spacy-llm	This package integrates Large Language Models (LLMs) into spaCy, featuring a modular system for fast prototyping and prompting, and turning unstructured responses into robust outputs for various NLP tasks.
pandas-ai	Chat with your database (SQL, CSV, pandas, polars, MongoDB, NoSQL, etc.).
LLM Transparency Tool	An open-source interactive toolkit for analyzing internal workings of Transformer-based language models.
Vanna	Chat with your SQL database. Accurate Text-to-SQL Generation via LLMs using RAG.
mergekit	Tools for merging pretrained large language models.
MarkLLM	An Open-Source Toolkit for LLM Watermarking.
LLMSanitize	An open-source library for contamination detection in NLP datasets and Large Language Models (LLMs).
Annotateai	Automatically annotate papers using LLMs.
LLM Reasoner	Make any LLM think like OpenAI o1 and DeepSeek R1.