【論文速遞】2025年06周（Robotics/Embodied AI/LLM）

SMOLLM2：當Smol變得大 - 以數據為中心的小語言模型
- 英文摘要
- 中文摘要
OmniHuman-1：重新考慮一階段的人類動畫模型的擴展
- 英文摘要
- 中文摘要
S1：簡單的測試時間縮放
- 英文摘要
- 中文摘要
直接對齊算法間的差異日漸模糊
- 英文摘要
- 中文摘要
VideoJAM：在視頻模型中增強運動生成的聯合外觀運動表示
- 英文摘要
- 中文摘要
LIMO：少即是多，有助于推理
- 英文摘要
- 中文摘要
分析功能流以增強語言模型中的解釋和轉向
- 英文摘要
- 中文摘要
通過隱式獎勵進行過程加強
- 英文摘要
- 中文摘要
在LLMS中揭開長期思考推理的神秘面紗
- 英文摘要
- 中文摘要
AlphaGeometry2 在奧賽幾何問題中的金牌級表現
- 英文摘要
- 中文摘要
Preference Leakage：LLM-AS-A-Gudge中的污染問題
- 英文摘要
- 中文摘要
AlignVLM：橋接視覺和語言潛在空間，用于多模式理解
- 英文摘要
- 中文摘要
獎勵指導的投機解碼，以實現有效的LLM推理
- 英文摘要
- 中文摘要
TwinMarket：金融市場的可擴展行為和社交模擬
- 英文摘要
- 中文摘要
概念：擴散Transformers學習高度可解釋的特征
- 英文摘要
- 中文摘要
MatAnyone：基于一致記憶傳播的穩定視頻摳圖
- 英文摘要
- 中文摘要
大模型的思維趨同，這對 AI 監管構成了挑戰
- 英文摘要
- 中文摘要
OLA：以漸進式對齊方式推動Omni-Modal語言模型的前沿
- 英文摘要
- 中文摘要
DynVFX：增強具有動態內容的真實視頻
- 英文摘要
- 中文摘要
SafeRAG：基準測試安全性的大型語言模型
- 英文摘要
- 中文摘要
ACECODER：Acing編碼器RL通過自動測試案例合成
- 英文摘要
- 中文摘要
反向橋接匹配蒸餾
- 英文摘要
- 中文摘要
Llasa：基于Llama的語音合成的縮放火車時間和推理時間計算
- 英文摘要
- 中文摘要
SliderSpace：分解擴散模型的視覺功能
- 英文摘要
- 中文摘要
自監督量化表示，用于無縫集成知識圖譜與大型語言模型
- 英文摘要
- 中文摘要
BOLT：無需蒸餾即可引導語言模型實現長鏈推理
- 英文摘要
- 中文摘要
在語言模型中縮放嵌入層
- 英文摘要
- 中文摘要
DeepRAG：考慮大型語言模型的逐步檢索
- 英文摘要
- 中文摘要
MM-IQ：在多模型中基準類似人類的抽象和推理
- 英文摘要
- 中文摘要

SMOLLM2：當Smol變得大 - 以數據為中心的小語言模型

標題: SmolLM2: When Smol Goes Big – Data-Centric Training of a Small Language Model
作者: Loubna Ben Allal, Anton Lozhkov, Elie Bakouch, Gabriel Martín Blázquez, Guilherme Penedo, Lewis Tunstall, Andrés Marafioti, Hynek Kydlí?ek, Agustín Piqueres Lajarín, Vaibhav Srivastav, Joshua Lochner, Caleb Fahlgren, Xuan-Son Nguyen, Clémentine Fourrier, Ben Burtenshaw, Hugo Larcher, Haojun Zhao, Cyril Zakka, Mathieu Morlon, Colin Raffel, Leandro von Werra, Thomas Wolf
日期: 2025-02-04
論文鏈接: https://arxiv.org/pdf/2502.02737

英文摘要

While large language models have facilitated breakthroughs in many applications of artificial intelligence, their inherent largeness makes them computationally expensive and challenging to deploy in resource-constrained settings. In this paper, we document the development of SmolLM2, a state-of-the-art “small” (1.7 billion parameter) language model (LM). To attain strong performance, we overtrain SmolLM2 on ~11 trillion tokens of data using a multi-stage training process that mixes web text with specialized math, code, and instruction-following data. We additionally introduce new specialized datasets (FineMath, Stack-Edu, and SmolTalk) at stages where we found existing datasets to be problematically small or low-quality. To inform our design decisions, we perform both small-scale ablations as well as a manual refinement process that updates the dataset mixing rates at each stage based on the performance at the previous stage. Ultimately, we demonstrate that SmolLM2 outperforms other recent small LMs including Qwen2.5-1.5B and Llama3.2-1B. To facilitate future research on LM development as well as applications of small LMs, we release both SmolLM2 as well as all of the datasets we prepared in the course of this project.

中文摘要

盡管大型語言模型在人工智能的許多應用中都促進了突破性，但它們的固有寬容使它們在計算上昂貴且充滿挑戰，可以在資源受限的設置中部署。在本文中，我們記錄了SmollM2的開發，SmollM2是一種最先進的“小”（17億個參數）語言模型（LM）。為了實現強大的性能，我們使用多階段訓練過程將?11萬億個數據的SmollM2推出了大約11萬億代幣，該過程將Web文本與專門的數學，代碼和跟隨數據的數據混合在一起。我們還在階段引入了新的專業數據集（Finemath，stack-edu和smoltalk），在那里我們發現現有的數據集在有問題的小或低質量上。為了告知我們的設計決策，我們同時執行小規模消融和手動完善過程，該過程會根據上一階段的性能在每個階段更新數據集混合率。最終，我們證明了SmollM2優于其他最近的小型LM，包括QWEN2.5-1.5B和LLAMA3.2-1B。為了促進對LM開發以及小型LM的應用的未來研究，我們均釋放SmollM2以及我們在本項目過程中準備的所有數據集。

OmniHuman-1：重新考慮一階段的人類動畫模型的擴展

標題: OmniHuman-1: Rethinking the Scaling-Up of One-Stage Conditioned Human Animation Models
作者: Gaojie Lin, Jianwen Jiang, Jiaqi Yang, Zerong Zheng, Chao Liang
日期: 2025-02-03
論文鏈接: https://arxiv.org/pdf/2502.01061

英文摘要

End-to-end human animation, such as audio-driven talking human generation, has undergone notable advancements in the recent few years. However, existing methods still struggle to scale up as large general video generation models, limiting their potential in real applications. In this paper, we propose OmniHuman, a Diffusion Transformer-based framework that scales up data by mixing motion-related conditions into the training phase. To this end, we introduce two training principles for these mixed conditions, along with the corresponding model architecture and inference strategy. These designs enable OmniHuman to fully leverage data-driven motion generation, ultimately achieving highly realistic human video generation. More importantly, OmniHuman supports various portrait contents (face close-up, portrait, half-body, full-body), supports both talking and singing, handles human-object interactions and challenging body poses, and accommodates different image styles. Compared to existing end-to-end audio-driven methods, OmniHuman not only produces more realistic videos, but also offers greater flexibility in inputs. It also supports multiple driving modalities (audio-driven, video-driven and combined driving signals). Video samples are provided on the ttfamily project page (https://omnihuman-lab.github.io)

中文摘要

端到端的人類動畫，例如音頻驅動的人類一代，在最近幾年中取得了顯著的進步。但是，現有的方法仍然很難擴大作為大型一般視頻生成模型的擴展，從而限制了它們在實際應用中的潛力。在本文中，我們提出了Omnihuman，這是一種基于擴散Transformers 的框架，可通過將與運動相關條件混合到訓練階段來擴展數據。為此，我們介紹了這些混合條件的兩種培訓原則，以及相應的模型體系結構和推理策略。這些設計使Omnihuman能夠充分利用數據驅動的運動產生，最終實現了高度現實的人類視頻生成。更重要的是，Omnihuman支持各種肖像內容（面部特寫，肖像，半身，全身），支持說話和唱歌，處理人類對象的相互作用和具有挑戰性的身體姿勢，并適應不同的圖像樣式。與現有的端到端音頻驅動方法相比，Omnihuman不僅會產生更現實的視頻，而且還提供了更大的輸入靈活性。它還支持多種駕駛方式（音頻驅動，視頻驅動和組合的駕駛信號）。視頻樣本在TTFamily項目頁面（https://omnihuman-lab.github.io）上提供

S1：簡單的測試時間縮放

標題: s1: Simple test-time scaling
作者: Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, Tatsunori Hashimoto
日期: 2025-01-31
論文鏈接: https://arxiv.org/pdf/2501.19393

英文摘要

Test-time scaling is a promising new approach to language modeling that uses extra test-time compute to improve performance. Recently, OpenAI’s o1 model showed this capability but did not publicly share its methodology, leading to many replication efforts. We seek the simplest approach to achieve test-time scaling and strong reasoning performance. First, we curate a small dataset s1K of 1,000 questions paired with reasoning traces relying on three criteria we validate through ablations: difficulty, diversity, and quality. Second, we develop budget forcing to control test-time compute by forcefully terminating the model’s thinking process or lengthening it by appending “Wait” multiple times to the model’s generation when it tries to end. This can lead the model to double-check its answer, often fixing incorrect reasoning steps. After supervised finetuning the Qwen2.5-32B-Instruct language model on s1K and equipping it with budget forcing, our model s1 exceeds o1-preview on competition math questions by up to 27% (MATH and AIME24). Further, scaling s1 with budget forcing allows extrapolating beyond its performance without test-time intervention: from 50% to 57% on AIME24. Our model, data, and code are open-source at https://github.com/simplescaling/s1.

中文摘要

測試時間縮放是一種有希望的語言建模方法，它使用額外的測試時間計算來提高性能。最近，OpenAI的O1模型顯示了這種能力，但沒有公開共享其方法，從而導致了許多復制工作。我們尋求最簡單的方法來實現測試時間縮放和強大的推理性能。首先，我們策劃了一個小型數據集S1K，其中1000個問題與推理軌跡搭配，依賴于我們通過消融驗證的三個標準：難度，多樣性和質量。其次，我們通過有力終止模型的思維過程或通過將“等待”加密到模型來結束時，通過強制終止模型的思維過程來控制測試時間計算，以控制測試時間計算。這可能會導致該模型對其答案進行仔細檢查，從而解決不正確的推理步驟。在監督了S1K上的QWEN2.5-32B-INSTRUCT語言模型并為其配備預算強迫之后，我們的Model S1超過了競爭數學問題的O1-preview高達27％（Math和Aime24）。此外，通過預算強迫縮放S1可以超越其性能外推，而無需測試時間干預：AIME24的50％至57％。我們的模型，數據和代碼是https://github.com/simplescaling/s1的開源。

直接對齊算法間的差異日漸模糊

標題: The Differences Between Direct Alignment Algorithms are a Blur
作者: Alexey Gorbatovski, Boris Shaposhnikov, Viacheslav Sinii, Alexey Malakhov, Daniil Gavrilov
日期: 2025-02-03
論文鏈接: https://arxiv.org/pdf/2502.01237

英文摘要

Direct Alignment Algorithms (DAAs) simplify language model alignment by replacing reinforcement learning (RL) and reward modeling (RM) in Reinforcement Learning from Human Feedback (RLHF) with direct policy optimization. DAAs can be classified by their ranking losses (pairwise vs. pointwise), by the rewards used in those losses (e.g., likelihood ratios of policy and reference policy, or odds ratios), or by whether a Supervised Fine-Tuning (SFT) phase is required (two-stage vs. one-stage). We first show that one-stage methods underperform two-stage methods. To address this, we incorporate an explicit SFT phase and introduce the beta parameter, controlling the strength of preference optimization, into single-stage ORPO and ASFT. These modifications improve their performance in Alpaca Eval 2 by +3.46 (ORPO) and +8.27 (ASFT), matching two-stage methods like DPO. Further analysis reveals that the key factor is whether the approach uses pairwise or pointwise objectives, rather than the specific implicit reward or loss function. These results highlight the importance of careful evaluation to avoid premature claims of performance gains or overall superiority in alignment algorithms.

中文摘要

直接對齊算法（DAAS）通過替換強化學習（RL）和獎勵建模（RM）來簡化語言模型對齊，并通過直接的策略優化從人類反饋（RLHF）進行強化學習。DAA可以通過其排名損失（成對與點為角度）進行分類，這些損失中使用的獎勵（例如，策略和參考策略的可能性比率或參考策略或優勢比），或者是通過監督的微調（SFT）階段（兩階段與一階段）。我們首先表明，一階段的方法表現不足兩階段方法。為了解決這個問題，我們結合了一個顯式的SFT階段，并將Beta參數（控制偏好優化的強度）引入單級ORPO和ASFT。這些修改通過+3.46（ORPO）和+8.27（ASFT）提高了它們在羊駝毛評估2的性能，與DPO等兩階段方法匹配。進一步的分析表明，關鍵因素是該方法是否使用成對或指尖目標，而不是特定的隱式獎勵或損失函數。這些結果強調了仔細評估的重要性，以避免對性能提升過早主張或對齊算法的總體優勢。

VideoJAM：在視頻模型中增強運動生成的聯合外觀運動表示

標題: VideoJAM: Joint Appearance-Motion Representations for Enhanced Motion Generation in Video Models
作者: Hila Chefer, Uriel Singer, Amit Zohar, Yuval Kirstain, Adam Polyak, Yaniv Taigman, Lior Wolf, Shelly Sheynin
日期: 2025-02-04
論文鏈接: https://arxiv.org/pdf/2502.02492

英文摘要

Despite tremendous recent progress, generative video models still struggle to capture real-world motion, dynamics, and physics. We show that this limitation arises from the conventional pixel reconstruction objective, which biases models toward appearance fidelity at the expense of motion coherence. To address this, we introduce VideoJAM, a novel framework that instills an effective motion prior to video generators, by encouraging the model to learn a joint appearance-motion representation. VideoJAM is composed of two complementary units. During training, we extend the objective to predict both the generated pixels and their corresponding motion from a single learned representation. During inference, we introduce Inner-Guidance, a mechanism that steers the generation toward coherent motion by leveraging the model’s own evolving motion prediction as a dynamic guidance signal. Notably, our framework can be applied to any video model with minimal adaptations, requiring no modifications to the training data or scaling of the model. VideoJAM achieves state-of-the-art performance in motion coherence, surpassing highly competitive proprietary models while also enhancing the perceived visual quality of the generations. These findings emphasize that appearance and motion can be complementary and, when effectively integrated, enhance both the visual quality and the coherence of video generation. Project website: https://hila-chefer.github.io/videojam-paper.github.io/

中文摘要

盡管最近取得了巨大進展，但生成的視頻模型仍然很難捕獲現實世界的運動，動態和物理。我們表明，這種局限性是由常規像素重建物鏡產生的，該物鏡以犧牲運動相干性為代價偏向于外觀保真度。為了解決這個問題，我們介紹了Videojam，這是一個新穎的框架，通過鼓勵模型學習聯合外觀運動表示，該框架在視頻發電機之前灌輸了有效的動作。Videojam由兩個互補單元組成。在訓練過程中，我們擴展了目標，以預測產生的像素及其相應的運動。在推斷期間，我們引入了內部指導，該機制通過利用模型自己不斷發展的運動預測作為動態引導信號來引導生成朝著連貫運動。值得注意的是，我們的框架可以應用于具有最小適應的任何視頻模型，不需要修改訓練數據或模型縮放。Videojam在運動連貫性方面達到了最先進的性能，超過了競爭性高度的專有模型，同時還提高了世代相傳的視覺質量。這些發現強調，外觀和運動可以是互補的，并且在有效整合時，可以增強視頻生成的視覺質量和連貫性。項目網站：https：//hila-chefer.github.io/videojam-paper.github.io/

LIMO：少即是多，有助于推理

標題: LIMO: Less is More for Reasoning
作者: Yixin Ye, Zhen Huang, Yang Xiao, Ethan Chern, Shijie Xia, Pengfei Liu
日期: 2025-02-05
論文鏈接: https://arxiv.org/pdf/2502.03387

英文摘要

We present a fundamental discovery that challenges our understanding of how complex reasoning emerges in large language models. While conventional wisdom suggests that sophisticated reasoning tasks demand extensive training data (>100,000 examples), we demonstrate that complex mathematical reasoning abilities can be effectively elicited with surprisingly few examples. Through comprehensive experiments, our proposed model LIMO demonstrates unprecedented performance in mathematical reasoning. With merely 817 curated training samples, LIMO achieves 57.1% accuracy on AIME and 94.8% on MATH, improving from previous SFT-based models’ 6.5% and 59.2% respectively, while only using 1% of the training data required by previous approaches. LIMO demonstrates exceptional out-of-distribution generalization, achieving 40.5% absolute improvement across 10 diverse benchmarks, outperforming models trained on 100x more data, challenging the notion that SFT leads to memorization rather than generalization. Based on these results, we propose the Less-Is-More Reasoning Hypothesis (LIMO Hypothesis): In foundation models where domain knowledge has been comprehensively encoded during pre-training, sophisticated reasoning capabilities can emerge through minimal but precisely orchestrated demonstrations of cognitive processes. This hypothesis posits that the elicitation threshold for complex reasoning is determined by two key factors: (1) the completeness of the model’s encoded knowledge foundation during pre-training, and (2) the effectiveness of post-training examples as “cognitive templates” that show the model how to utilize its knowledge base to solve complex reasoning tasks. To facilitate reproducibility and future research in data-efficient reasoning, we release LIMO as a comprehensive open-source suite at https://github.com/GAIR-NLP/LIMO.

中文摘要

我們提出了一個基本發現，挑戰了我們對大語言模型中復雜推理如何出現的理解。雖然傳統的觀點表明，復雜的推理任務需要廣泛的培訓數據（> 100,000個示例），但我們證明，復雜的數學推理能力可以有效地引起，但令人驚訝的例子很少。通過全面的實驗，我們提出的模型豪華轎車在數學推理中表現出了前所未有的表現。豪華轎車僅僅817個策劃的訓練樣本，在AIME上獲得了57.1％的精度和數學的94.8％，從以前的基于SFT的模型分別提高了6.5％和59.2％，而僅使用以前方法所需的1％的培訓數據。豪華轎車表現出非凡的分布概括，在10種不同的基準中實現了40.5％的絕對改進，優于對100倍培訓的模型多于100倍的數據，這挑戰了SFT導致記憶而不是概括的觀念。基于這些結果，我們提出了較少的推理假設（豪華假說）：在基礎模型中，在培訓預訓練期間已經對領域知識進行了全面編碼，可以通過最低限制但精確的認知過程的管弦樂表現出來。該假設表明，復雜推理的啟發閾值取決于兩個關鍵因素：（1）模型在培訓期間模型編碼的知識基礎的完整性，以及（2）訓練后示例作為“認知模板”的有效性，這些示例是“認知模板”，這些示例顯示了模型如何利用其知識基礎來解決復雜的理解工作。為了促進數據有效推理的可重復性和未來的研究，我們在https://github.com/gair-nlp/limo上發布了豪華轎車作為全面的露天套件。

分析功能流以增強語言模型中的解釋和轉向

標題: Analyze Feature Flow to Enhance Interpretation and Steering in Language Models
作者: Daniil Laptev, Nikita Balagansky, Yaroslav Aksenov, Daniil Gavrilov
日期: 2025-02-05
論文鏈接: https://arxiv.org/pdf/2502.03032

英文摘要

We introduce a new approach to systematically map features discovered by sparse autoencoder across consecutive layers of large language models, extending earlier work that examined inter-layer feature links. By using a data-free cosine similarity technique, we trace how specific features persist, transform, or first appear at each stage. This method yields granular flow graphs of feature evolution, enabling fine-grained interpretability and mechanistic insights into model computations. Crucially, we demonstrate how these cross-layer feature maps facilitate direct steering of model behavior by amplifying or suppressing chosen features, achieving targeted thematic control in text generation. Together, our findings highlight the utility of a causal, cross-layer interpretability framework that not only clarifies how features develop through forward passes but also provides new means for transparent manipulation of large language models.

中文摘要

我們介紹了一種新方法，以系統地繪制由稀疏自動編碼器在連續大型語言模型層發現的系統映射功能，從而擴展了檢查層間功能鏈接的早期工作。通過使用無數據的余弦相似性技術，我們可以追蹤特定特征在每個階段如何持續，轉換或首次出現。該方法得出特征演化的粒狀流程圖，使精細元素的可解釋性和機械洞察力對模型計算。至關重要的是，我們通過放大或抑制所選特征，在文本生成中實現有針對性的主題控制，來說明這些跨層特征圖如何通過放大或抑制所選特征來促進模型行為的直接轉向。我們的發現共同介紹了因果，跨層的解釋性框架的實用性，該框架不僅闡明了特征如何通過向前傳遞而發展，而且還為大型語言模型的透明操作提供了新的手段。

通過隱式獎勵進行過程加強

標題: Process Reinforcement through Implicit Rewards
作者: Ganqu Cui, Lifan Yuan, Zefan Wang, Hanbin Wang, Wendi Li, Bingxiang He, Yuchen Fan, Tianyu Yu, Qixin Xu, Weize Chen, Jiarui Yuan, Huayu Chen, Kaiyan Zhang, Xingtai Lv, Shuo Wang, Yuan Yao, Xu Han, Hao Peng, Yu Cheng, Zhiyuan Liu, Maosong Sun, Bowen Zhou, Ning Ding
日期: 2025-02-03
論文鏈接: https://arxiv.org/pdf/2502.01456

英文摘要

Dense process rewards have proven a more effective alternative to the sparse outcome-level rewards in the inference-time scaling of large language models (LLMs), particularly in tasks requiring complex multi-step reasoning. While dense rewards also offer an appealing choice for the reinforcement learning (RL) of LLMs since their fine-grained rewards have the potential to address some inherent issues of outcome rewards, such as training efficiency and credit assignment, this potential remains largely unrealized. This can be primarily attributed to the challenges of training process reward models (PRMs) online, where collecting high-quality process labels is prohibitively expensive, making them particularly vulnerable to reward hacking. To address these challenges, we propose PRIME (Process Reinforcement through IMplicit rEwards), which enables online PRM updates using only policy rollouts and outcome labels through implict process rewards. PRIME combines well with various advantage functions and forgoes the dedicated reward model training phrase that existing approaches require, substantially reducing the development overhead. We demonstrate PRIME’s effectiveness on competitional math and coding. Starting from Qwen2.5-Math-7B-Base, PRIME achieves a 15.1% average improvement across several key reasoning benchmarks over the SFT model. Notably, our resulting model, Eurus-2-7B-PRIME, surpasses Qwen2.5-Math-7B-Instruct on seven reasoning benchmarks with 10% of its training data.

中文摘要

事實證明，密集的過程獎勵是在大型語言模型（LLMS）的推理時間縮放中稀疏結果級獎勵的更有效替代方案，尤其是在需要復雜的多步推理的任務中。盡管密集的獎勵還為LLM的強化學習（RL）提供了一個吸引人的選擇，因為它們的細粒度獎勵有可能解決一些固有的結果獎勵問題，例如培訓效率和信貸分配，但這種潛力在很大程度上仍未實現。這主要歸因于在線培訓過程獎勵模型（PRMS）的挑戰，在線收集高質量的流程標簽非常昂貴，使其特別容易受到獎勵黑客的影響。為了應對這些挑戰，我們提出了PRIME（通過隱式獎勵加強過程加強），該挑戰僅通過隱含過程獎勵，可以使用策略推出和結果標簽進行在線PRM更新。Prime與各種優勢功能很好地結合在一起，并放棄了現有方法所需的專用獎勵模型培訓短語，從而大大降低了開發開銷。我們展示了Prime對競爭數學和編碼的有效性。從QWEN2.5-MATH-7B基礎開始，Prime在SFT模型上的幾個關鍵推理基準中的平均值提高了15.1％。值得注意的是，我們由此產生的模型Eurus-2-7B-Prime在七個推理基準上超過了QWEN2.5-MATH-7B-INSTRUCT，其培訓數據的10％。

在LLMS中揭開長期思考推理的神秘面紗

標題: Demystifying Long Chain-of-Thought Reasoning in LLMs
作者: Edward Yeo, Yuxuan Tong, Morry Niu, Graham Neubig, Xiang Yue
日期: 2025-02-05
論文鏈接: https://arxiv.org/pdf/2502.03373

英文摘要

Scaling inference compute enhances reasoning in large language models (LLMs), with long chains-of-thought (CoTs) enabling strategies like backtracking and error correction. Reinforcement learning (RL) has emerged as a crucial method for developing these capabilities, yet the conditions under which long CoTs emerge remain unclear, and RL training requires careful design choices. In this study, we systematically investigate the mechanics of long CoT reasoning, identifying the key factors that enable models to generate long CoT trajectories. Through extensive supervised fine-tuning (SFT) and RL experiments, we present four main findings: (1) While SFT is not strictly necessary, it simplifies training and improves efficiency; (2) Reasoning capabilities tend to emerge with increased training compute, but their development is not guaranteed, making reward shaping crucial for stabilizing CoT length growth; (3) Scaling verifiable reward signals is critical for RL. We find that leveraging noisy, web-extracted solutions with filtering mechanisms shows strong potential, particularly for out-of-distribution (OOD) tasks such as STEM reasoning; and (4) Core abilities like error correction are inherently present in base models, but incentivizing these skills effectively for complex tasks via RL demands significant compute, and measuring their emergence requires a nuanced approach. These insights provide practical guidance for optimizing training strategies to enhance long CoT reasoning in LLMs. Our code is available at: https://github.com/eddycmu/demystify-long-cot.

中文摘要

縮放推理計算在大語言模型（LLMS）中增強推理，并具有長長的鏈條（COTS）啟用諸如回溯和誤差校正之類的策略。增強學習（RL）已成為開發這些能力的關鍵方法，但長期COTS仍然不清楚的條件，RL培訓需要仔細的設計選擇。在這項研究中，我們系統地研究了長COT推理的機制，從而確定了使模型能夠生成長COT軌跡的關鍵因素。通過廣泛的監督微調（SFT）和RL實驗，我們提出了四個主要發現：（1）雖然不是嚴格必要的SFT，但它簡化了培訓并提高了效率；（2）推理能力往往會隨著訓練計算的增加而出現，但是不能保證它們的發展，這對于穩定COT長度增長至關重要；（3）縮放可驗證的獎勵信號對于RL至關重要。我們發現，通過過濾機制利用嘈雜的Web提取的解決方案顯示出強大的潛力，尤其是對于諸如STEM推理之類的分布（OOD）任務；（4）基本模型中固有地存在諸如誤差校正之類的核心能力，但是通過RL有效地激勵這些技能來激勵這些技能，需要大量的計算，并且測量其出現需要細微的方法。這些見解為優化培訓策略提供了實用的指導，以增強LLMS中的長期COT推理。我們的代碼可在以下網址找到：https：//github.com/eddycmu/demystify-long-cot。

AlphaGeometry2 在奧賽幾何問題中的金牌級表現

標題: Gold-medalist Performance in Solving Olympiad Geometry with AlphaGeometry2
作者: Yuri Chervonyi, Trieu H. Trinh, Miroslav Ol?ák, Xiaomeng Yang, Hoang Nguyen, Marcelo Menegali, Junehyuk Jung, Vikas Verma, Quoc V. Le, Thang Luong
日期: 2025-02-05
論文鏈接: https://arxiv.org/pdf/2502.03544

英文摘要

We present AlphaGeometry2, a significantly improved version of AlphaGeometry introduced in Trinh et al. (2024), which has now surpassed an average gold medalist in solving Olympiad geometry problems. To achieve this, we first extend the original AlphaGeometry language to tackle harder problems involving movements of objects, and problems containing linear equations of angles, ratios, and distances. This, together with other additions, has markedly improved the coverage rate of the AlphaGeometry language on International Math Olympiads (IMO) 2000-2024 geometry problems from 66% to 88%. The search process of AlphaGeometry2 has also been greatly improved through the use of Gemini architecture for better language modeling, and a novel knowledge-sharing mechanism that combines multiple search trees. Together with further enhancements to the symbolic engine and synthetic data generation, we have significantly boosted the overall solving rate of AlphaGeometry2 to 84% for all geometry problems over the last 25 years, compared to 54% previously. AlphaGeometry2 was also part of the system that achieved silver-medal standard at IMO 2024 https://dpmd.ai/imo-silver. Last but not least, we report progress towards using AlphaGeometry2 as a part of a fully automated system that reliably solves geometry problems directly from natural language input.

中文摘要

我們介紹了AlphageMetry2，這是Trinh等人中引入的字母計量法的顯著改進的版本。（2024），現在已經超過了解決奧林匹克幾何問題的平均金牌得主。為了實現這一目標，我們首先將原始的字母計量學語言擴展到解決涉及對象運動的更嚴重問題，以及包含角度，比率和距離的線性方程的問題。這與其他增加一起顯著提高了國際數學奧林匹克運動會（IMO）2000-2024幾何問題的覆蓋率從66％到88％。通過使用Gemini架構來更好地建模，并且通過使用Gemini架構以及一種結合了多個搜索樹的新型知識共享機制，可以極大地改善Alphageometry2的搜索過程。加上對符號發動機和合成數據生成的進一步增強，在過去25年中，所有幾何問題的總體求解率顯著提高到了84％，而先前的54％。Alphageometry2也是在IMO 2024 https://dpmd.ai/imo-silver上實現銀色標準的系統的一部分。最后但并非最不重要的一點是，我們報告了使用Alphageometry2作為完全自動化系統的一部分，該系統直接從自然語言輸入中可靠地解決了幾何問題。

Preference Leakage：LLM-AS-A-Gudge中的污染問題

標題: Preference Leakage: A Contamination Problem in LLM-as-a-judge
作者: Dawei Li, Renliang Sun, Yue Huang, Ming Zhong, Bohan Jiang, Jiawei Han, Xiangliang Zhang, Wei Wang, Huan Liu
日期: 2025-02-03
論文鏈接: https://arxiv.org/pdf/2502.01534

英文摘要

Large Language Models (LLMs) as judges and LLM-based data synthesis have emerged as two fundamental LLM-driven data annotation methods in model development. While their combination significantly enhances the efficiency of model training and evaluation, little attention has been given to the potential contamination brought by this new model development paradigm. In this work, we expose preference leakage, a contamination problem in LLM-as-a-judge caused by the relatedness between the synthetic data generators and LLM-based evaluators. To study this issue, we first define three common relatednesses between data generator LLM and judge LLM: being the same model, having an inheritance relationship, and belonging to the same model family. Through extensive experiments, we empirically confirm the bias of judges towards their related student models caused by preference leakage across multiple LLM baselines and benchmarks. Further analysis suggests that preference leakage is a pervasive issue that is harder to detect compared to previously identified biases in LLM-as-a-judge scenarios. All of these findings imply that preference leakage is a widespread and challenging problem in the area of LLM-as-a-judge. We release all codes and data at: https://github.com/David-Li0406/Preference-Leakage.

中文摘要

大型語言模型（LLM）作為法官和基于LLM的數據綜合，已成為模型開發中的兩種基本LLM驅動的數據注釋方法。盡管它們的組合顯著提高了模型培訓和評估的效率，但對這種新型模型開發范式帶來的潛在污染的關注很少。在這項工作中，我們暴露了偏好泄漏，這是由合成數據生成器與基于LLM的評估器之間的相關性引起的LLM-AS-A-A-A-AUDGE中的污染問題。為了研究此問題，我們首先定義了數據生成器LLM和LLM法官：是同一模型，具有繼承關系并屬于同一模型家族之間的三個共同相關性。通過廣泛的實驗，我們從經驗上證實了法官對其相關學生模型的偏見，這是由于多個LLM基準和基準的偏好泄漏引起的。進一步的分析表明，與LLM-AS-A-A-A-a-a-Gudge場景中先前確定的偏見相比，偏好泄漏是一個普遍存在的問題。所有這些發現表明，在LLM-AS-A-Audge領域，偏好泄漏是一個普遍且具有挑戰性的問題。我們在以下網址發布所有代碼和數據：https：//github.com/david-li0406/preference-leakage。

AlignVLM：橋接視覺和語言潛在空間，用于多模式理解

標題: AlignVLM: Bridging Vision and Language Latent Spaces for Multimodal Understanding
作者: Ahmed Masry, Juan A. Rodriguez, Tianyu Zhang, Suyuchen Wang, Chao Wang, Aarash Feizi, Akshay Kalkunte Suresh, Abhay Puri, Xiangru Jian, Pierre-André No?l, Sathwik Tejaswi Madhusudhan, Marco Pedersoli, Bang Liu, Nicolas Chapados, Yoshua Bengio, Enamul Hoque, Christopher Pal, Issam H. Laradji, David Vazquez, Perouz Taslakian, Spandana Gella, Sai Rajeswar
日期: 2025-02-03
論文鏈接: https://arxiv.org/pdf/2502.01341

英文摘要

Aligning visual features with language embeddings is a key challenge in vision-language models (VLMs). The performance of such models hinges on having a good connector that maps visual features generated by a vision encoder to a shared embedding space with the LLM while preserving semantic similarity. Existing connectors, such as multilayer perceptrons (MLPs), often produce out-of-distribution or noisy inputs, leading to misalignment between the modalities. In this work, we propose a novel vision-text alignment method, AlignVLM, that maps visual features to a weighted average of LLM text embeddings. Our approach leverages the linguistic priors encoded by the LLM to ensure that visual features are mapped to regions of the space that the LLM can effectively interpret. AlignVLM is particularly effective for document understanding tasks, where scanned document images must be accurately mapped to their textual content. Our extensive experiments show that AlignVLM achieves state-of-the-art performance compared to prior alignment methods. We provide further analysis demonstrating improved vision-text feature alignment and robustness to noise.

中文摘要

將視覺特征與語言嵌入對齊是視覺模型（VLMS）的關鍵挑戰。此類模型的性能取決于擁有一個良好的連接器，該連接器將視覺編碼器生成的視覺特征映射到使用LLM的共享嵌入空間，同時保持語義相似性。現有的連接器（例如多層感知器（MLP））通常會產生分布式或嘈雜的輸入，從而導致模態之間的錯位。在這項工作中，我們提出了一種新穎的視覺文本對準方法AlignVLM，該方法將視覺特征映射到LLM文本嵌入的加權平均值。我們的方法利用了LLM編碼的語言先驗，以確保將視覺特征映射到LLM可以有效解釋的空間區域。AlignVLM對于文檔理解任務特別有效，必須將掃描的文檔圖像準確地映射到其文本內容。我們的廣泛實驗表明，與先前的對齊方法相比，AlignVLM實現了最先進的性能。我們提供了進一步的分析，證明了視力文本的提高特征特征對齊和與噪聲的魯棒性。

獎勵指導的投機解碼，以實現有效的LLM推理

標題: Reward-Guided Speculative Decoding for Efficient LLM Reasoning
作者: Baohao Liao, Yuhui Xu, Hanze Dong, Junnan Li, Christof Monz, Silvio Savarese, Doyen Sahoo, Caiming Xiong
日期: 2025-01-31
論文鏈接: https://arxiv.org/pdf/2501.19324

英文摘要

We introduce Reward-Guided Speculative Decoding (RSD), a novel framework aimed at improving the efficiency of inference in large language models (LLMs). RSD synergistically combines a lightweight draft model with a more powerful target model, incorporating a controlled bias to prioritize high-reward outputs, in contrast to existing speculative decoding methods that enforce strict unbiasedness. RSD employs a process reward model to evaluate intermediate decoding steps and dynamically decide whether to invoke the target model, optimizing the trade-off between computational cost and output quality. We theoretically demonstrate that a threshold-based mixture strategy achieves an optimal balance between resource utilization and performance. Extensive evaluations on challenging reasoning benchmarks, including Olympiad-level tasks, show that RSD delivers significant efficiency gains against decoding with the target model only (up to 4.4x fewer FLOPs), while achieving significant better accuracy than parallel decoding method on average (up to +3.5). These results highlight RSD as a robust and cost-effective approach for deploying LLMs in resource-intensive scenarios.

中文摘要

我們引入了獎勵指導的投機解碼（RSD），這是一個旨在提高大語模型（LLMS）推斷效率的新型框架。RSD協同結合了一個輕巧的草稿模型與更強大的目標模型，并結合了受控偏見以優先考慮高回報輸出，與現有的投機解碼方法相反，該方法實施了嚴格的無偏見。RSD采用過程獎勵模型來評估中間解碼步驟，并動態決定是否調用目標模型，優化計算成本和輸出質量之間的權衡。從理論上講，我們證明了基于閾值的混合物策略在資源利用和性能之間取得了最佳平衡。對包括奧林匹克級任務在內的挑戰性推理基準測試的廣泛評估表明，RSD可在僅使用目標模型（較少的拖鞋少4.4倍）中提供顯著的效率提升，同時比平均平均解碼方法（最高為+3.5）實現明顯的高準確性。這些結果強調了RSD是在資源密集型方案中部署LLM的強大且具有成本效益的方法。

TwinMarket：金融市場的可擴展行為和社交模擬

標題: TwinMarket: A Scalable Behavioral and Social Simulation for Financial Markets
作者: Yuzhe Yang, Yifei Zhang, Minghao Wu, Kaidi Zhang, Yunmiao Zhang, Honghai Yu, Yan Hu, Benyou Wang
日期: 2025-02-03
論文鏈接: https://arxiv.org/pdf/2502.01506v2.pdf

英文摘要

The study of social emergence has long been a central focus in social science. Traditional modeling approaches, such as rule-based Agent-Based Models (ABMs), struggle to capture the diversity and complexity of human behavior, particularly the irrational factors emphasized in behavioral economics. Recently, large language model (LLM) agents have gained traction as simulation tools for modeling human behavior in social science and role-playing applications. Studies suggest that LLMs can account for cognitive biases, emotional fluctuations, and other non-rational influences, enabling more realistic simulations of socio-economic dynamics. In this work, we introduce TwinMarket, a novel multi-agent framework that leverages LLMs to simulate socio-economic systems. Specifically, we examine how individual behaviors, through interactions and feedback mechanisms, give rise to collective dynamics and emergent phenomena. Through experiments in a simulated stock market environment, we demonstrate how individual actions can trigger group behaviors, leading to emergent outcomes such as financial bubbles and recessions. Our approach provides valuable insights into the complex interplay between individual decision-making and collective socio-economic patterns.

中文摘要

對社會出現的研究長期以來一直是社會科學的核心重點。傳統的建模方法，例如基于規則的代理模型（ABM），難以捕獲人類行為的多樣性和復雜性，尤其是行為經濟學強調的非理性因素。最近，大型語言模型（LLM）代理人已獲得吸引人的仿真工具，用于建模社會科學和角色扮演應用中的人類行為。研究表明，LLM可以解釋認知偏見，情緒波動和其他非理性影響，從而使社會經濟動態更現實。在這項工作中，我們介紹了Twinmarket，這是一個新型的多代理框架，該框架利用LLMS模擬社會經濟系統。具體而言，我們通過相互作用和反饋機制來研究單個行為如何產生集體動態和新興現象。通過模擬股票市場環境中的實驗，我們演示了個人行動如何觸發群體行為，從而導致新興的結果，例如財務泡沫和衰退。我們的方法為個人決策與集體社會經濟模式之間的復雜相互作用提供了寶貴的見解。

概念：擴散Transformers學習高度可解釋的特征

標題: ConceptAttention: Diffusion Transformers Learn Highly Interpretable Features
作者: Alec Helbling, Tuna Han Salih Meral, Ben Hoover, Pinar Yanardag, Duen Horng Chau
日期: 2025-02-06
論文鏈接: https://arxiv.org/pdf/2502.04320

英文摘要

Do the rich representations of multi-modal diffusion transformers (DiTs) exhibit unique properties that enhance their interpretability? We introduce ConceptAttention, a novel method that leverages the expressive power of DiT attention layers to generate high-quality saliency maps that precisely locate textual concepts within images. Without requiring additional training, ConceptAttention repurposes the parameters of DiT attention layers to produce highly contextualized concept embeddings, contributing the major discovery that performing linear projections in the output space of DiT attention layers yields significantly sharper saliency maps compared to commonly used cross-attention mechanisms. Remarkably, ConceptAttention even achieves state-of-the-art performance on zero-shot image segmentation benchmarks, outperforming 11 other zero-shot interpretability methods on the ImageNet-Segmentation dataset and on a single-class subset of PascalVOC. Our work contributes the first evidence that the representations of multi-modal DiT models like Flux are highly transferable to vision tasks like segmentation, even outperforming multi-modal foundation models like CLIP.

中文摘要

多模式擴散Transformers（DIT）的豐富表示表現出可增強其可解釋性的獨特特性嗎？我們介紹了一種新穎的方法，它利用DIT注意層的表現力來產生高質量的顯著性圖，這些圖精確地定位了圖像中的文本概念。在不需要額外的培訓的情況下，概念重新利用了DIT注意層的參數，以產生高度上下文化的概念嵌入，這有助于一個主要發現，即在DIT注意層的輸出空間中執行線性預測與常用的交叉意見機制相比，在DIT注意力層的輸出空間中產生明顯的顯著性圖。值得注意的是，概念甚至可以在零拍圖像分割基準上實現最先進的性能，在成像網段數據集和pascalvoc的單層子集上表現優于其他11種其他零攝像的可解釋性方法。我們的工作有助于第一個證據表明，多模式DIT模型（如通量）的表示可以高度轉移到諸如細分之類的視覺任務，甚至超過了剪輯（例如剪輯）的多模式基礎模型。

MatAnyone：基于一致記憶傳播的穩定視頻摳圖

標題: MatAnyone: Stable Video Matting with Consistent Memory Propagation
作者: Peiqing Yang, Shangchen Zhou, Jixin Zhao, Qingyi Tao, Chen Change Loy
日期: 2025-01-24
論文鏈接: https://arxiv.org/pdf/2501.14677
項目鏈接: https://pq-yang.github.io/projects/MatAnyone/

英文摘要

Auxiliary-free human video matting methods, which rely solely on input frames, often struggle with complex or ambiguous backgrounds. To address this, we propose MatAnyone, a robust framework tailored for target-assigned video matting. Specifically, building on a memory-based paradigm, we introduce a consistent memory propagation module via region-adaptive memory fusion, which adaptively integrates memory from the previous frame. This ensures semantic stability in core regions while preserving fine-grained details along object boundaries. For robust training, we present a larger, high-quality, and diverse dataset for video matting. Additionally, we incorporate a novel training strategy that efficiently leverages large-scale segmentation data, boosting matting stability. With this new network design, dataset, and training strategy, MatAnyone delivers robust and accurate video matting results in diverse real-world scenarios, outperforming existing methods.

中文摘要

無輔助的人類視頻效果方法僅依賴于輸入框架，通常在復雜或模棱兩可的背景中掙扎。為了解決這個問題，我們提出了Matanyone，這是一個針對目標分配的視頻墊子量身定制的強大框架。具體而言，在基于內存的范式上，我們通過區域自適應內存融合引入了一個一致的內存傳播模塊，該模塊可以自適應地整合了上一個幀中的內存。這樣可以確保核心區域的語義穩定性，同時在對象邊界沿著細節保留細顆粒的細節。對于強大的培訓，我們提供了一個更大，高質量且多樣化的數據集用于視頻墊子。此外，我們結合了一種新穎的培訓策略，該策略有效利用大規模分割數據，從而提高了均值穩定性。借助這種新的網絡設計，數據集和培訓策略，Matanyone提供了強大而準確的視頻效果，從而導致不同的現實情況，表現優于現有方法。

大模型的思維趨同，這對 AI 監管構成了挑戰

標題: Great Models Think Alike and this Undermines AI Oversight
作者: Shashwat Goel, Joschka Struber, Ilze Amanda Auzina, Karuna K Chandra, Ponnurangam Kumaraguru, Douwe Kiela, Ameya Prabhu, Matthias Bethge, Jonas Geiping
日期: 2025-02-06
論文鏈接: https://arxiv.org/pdf/2502.04313

英文摘要

As Language Model (LM) capabilities advance, evaluating and supervising them at scale is getting harder for humans. There is hope that other language models can automate both these tasks, which we refer to as “AI Oversight”. We study how model similarity affects both aspects of AI oversight by proposing a probabilistic metric for LM similarity based on overlap in model mistakes. Using this metric, we first show that LLM-as-a-judge scores favor models similar to the judge, generalizing recent self-preference results. Then, we study training on LM annotations, and find complementary knowledge between the weak supervisor and strong student model plays a crucial role in gains from “weak-to-strong generalization”. As model capabilities increase, it becomes harder to find their mistakes, and we might defer more to AI oversight. However, we observe a concerning trend – model mistakes are becoming more similar with increasing capabilities, pointing to risks from correlated failures. Our work underscores the importance of reporting and correcting for model similarity, especially in the emerging paradigm of AI oversight.

中文摘要

隨著語言模型（LM）功能的提高，對人類進行評估和監督越來越困難。希望其他語言模型可以自動化這兩個任務，我們將其稱為“ AI監督”。我們研究模型相似性如何通過基于模型錯誤重疊的LM相似性提出概率度量來影響AI監督的兩個方面。使用此指標，我們首先表明LLM-AS-A-A-Gudge分數有利于與法官類似的模型，從而推廣了最新的自我挑戰結果。然后，我們研究了關于LM注釋的培訓，并找到弱主管和強大的學生模型之間的互補知識在“弱到較大的概括”中的收益中起著至關重要的作用。隨著模型能力的增加，很難找到自己的錯誤，我們可能會更多地推薦給AI的監督。但是，我們觀察到了一個有關趨勢的信息 - 隨著功能的增加，模型錯誤越來越相似，指出相關故障的風險。我們的工作強調了報告和糾正模型相似性的重要性，尤其是在AI監督的新興范式中。

OLA：以漸進式對齊方式推動Omni-Modal語言模型的前沿

標題: Ola: Pushing the Frontiers of Omni-Modal Language Model with Progressive Modality Alignment
作者: Zuyan Liu, Yuhao Dong, Jiahui Wang, Ziwei Liu, Winston Hu, Jiwen Lu, Yongming Rao
日期: 2025-02-06
論文鏈接: https://arxiv.org/pdf/2502.04328

英文摘要

Recent advances in large language models, particularly following GPT-4o, have sparked increasing interest in developing omni-modal models capable of understanding more modalities. While some open-source alternatives have emerged, there is still a notable lag behind specialized single-modality models in performance. In this paper, we present Ola, an Omni-modal language model that achieves competitive performance across image, video, and audio understanding compared to specialized counterparts. The core design of Ola lies in its progressive modality alignment strategy that extends the supporting modality of the language model progressively. Our training pipeline begins with the most distinct modalities: image and text, then gradually expands the skill sets of the model using speech data that connects language and audio knowledge, and video data that connects all modalities. The progressive learning pipeline also enables us to maintain a relatively small size of the cross-modal alignment data, making developing omni-modal from existing vision-language models easy and less costly. Moreover, to unlock an advanced interactive experience like GPT-4o, we further design a sentence-wise decoding solution for streaming speech generation. Extensive experiments demonstrate that Ola surpasses existing open omni-modal LLMs across all modalities while achieving highly competitive performance compared to state-of-the-art specialized models of similar sizes. We aim to make Ola a fully open omni-modal understanding solution to advance future research in this emerging field. Model weights, code, and data are open-sourced at https://github.com/Ola-Omni/Ola.

中文摘要

大型語言模型的最新進展，尤其是在GPT-4O之后，引發了人們對開發能夠理解更多模式的Omni-Modal模型的興趣。盡管已經出現了一些開源替代方案，但性能中的專用單模式模型仍然存在著一個顯著的滯后。在本文中，我們提出了Ola，這是一種Omni-Modal語言模型，與專業對應物相比，在圖像，視頻和音頻理解中實現了競爭性能。OLA的核心設計在于其漸進式形態對準策略，該策略逐漸擴展了語言模型的支持方式。我們的培訓管道始于最不同的方式：圖像和文本，然后使用連接語言和音頻知識的語音數據以及連接所有模式的視頻數據逐漸擴展模型的技能。漸進式學習管道還使我們能夠保持相對較小的跨模式對齊數據，從而使現有視覺模型的Omni-Modal模型變得更加容易且成本較低。此外，要解鎖GPT-4O等先進的互動體驗，我們進一步設計了句子解碼的解決方案，用于流式語音生成。廣泛的實驗表明，與具有類似尺寸的最新專業模型相比，OLA在所有模式中超過了現有的Omni-Modal LLM，同時實現了高度競爭性能。我們的目標是使Ola成為一個完全開放的Omni-Modal理解解決方案，以推進這個新興領域的未來研究。型號的權重，代碼和數據在https://github.com/ola-omni/ola上進行開源。

DynVFX：增強具有動態內容的真實視頻

標題: DynVFX: Augmenting Real Videos with Dynamic Content
作者: Danah Yatim, Rafail Fridman, Omer Bar-Tal, Tali Dekel
日期: 2025-02-05
論文鏈接: https://arxiv.org/pdf/2502.03621

英文摘要

We present a method for augmenting real-world videos with newly generated dynamic content. Given an input video and a simple user-provided text instruction describing the desired content, our method synthesizes dynamic objects or complex scene effects that naturally interact with the existing scene over time. The position, appearance, and motion of the new content are seamlessly integrated into the original footage while accounting for camera motion, occlusions, and interactions with other dynamic objects in the scene, resulting in a cohesive and realistic output video. We achieve this via a zero-shot, training-free framework that harnesses a pre-trained text-to-video diffusion transformer to synthesize the new content and a pre-trained Vision Language Model to envision the augmented scene in detail. Specifically, we introduce a novel inference-based method that manipulates features within the attention mechanism, enabling accurate localization and seamless integration of the new content while preserving the integrity of the original scene. Our method is fully automated, requiring only a simple user instruction. We demonstrate its effectiveness on a wide range of edits applied to real-world videos, encompassing diverse objects and scenarios involving both camera and object motion.

中文摘要

我們提出了一種使用新生成的動態內容來增強現實世界視頻的方法。給定一個輸入視頻和一個簡單的用戶提供的文本指令描述了所需的內容，我們的方法綜合了動態對象或復雜的場景效果，這些效果自然會隨著時間的推移與現有場景相互作用。新內容的位置，外觀和運動被無縫集成到原始素材中，同時考慮相機運動，遮擋和與場景中其他動態對象的交互，從而產生了凝聚力和現實的輸出視頻。我們通過一個零射擊，無訓練的框架來實現這一目標，該框架利用預先訓練的文本對視頻擴散Transformers 合成新內容和預先訓練的視覺語言模型，以詳細設想增強場景。具體而言，我們引入了一種基于推理的新方法，該方法在注意機制中操縱特征，在保留原始場景的完整性的同時，可以準確地定位和無縫集成。我們的方法是完全自動化的，只需要簡單的用戶指令。我們展示了其在應用于現實世界視頻的廣泛編輯中的有效性，其中包括涉及相機和對象運動的各種對象和場景。

SafeRAG：基準測試安全性的大型語言模型

標題: SafeRAG: Benchmarking Security in Retrieval-Augmented Generation of Large Language Model
作者: Xun Liang, Simin Niu, Zhiyu Li, Sensen Zhang, Hanyu Wang, Feiyu Xiong, Jason Zhaoxin Fan, Bo Tang, Shichao Song, Mengwei Wang, Jiawei Yang
日期: 2025-01-28
論文鏈接: https://arxiv.org/pdf/2501.18636

英文摘要

The indexing-retrieval-generation paradigm of retrieval-augmented generation (RAG) has been highly successful in solving knowledge-intensive tasks by integrating external knowledge into large language models (LLMs). However, the incorporation of external and unverified knowledge increases the vulnerability of LLMs because attackers can perform attack tasks by manipulating knowledge. In this paper, we introduce a benchmark named SafeRAG designed to evaluate the RAG security. First, we classify attack tasks into silver noise, inter-context conflict, soft ad, and white Denial-of-Service. Next, we construct RAG security evaluation dataset (i.e., SafeRAG dataset) primarily manually for each task. We then utilize the SafeRAG dataset to simulate various attack scenarios that RAG may encounter. Experiments conducted on 14 representative RAG components demonstrate that RAG exhibits significant vulnerability to all attack tasks and even the most apparent attack task can easily bypass existing retrievers, filters, or advanced LLMs, resulting in the degradation of RAG service quality. Code is available at: https://github.com/IAAR-Shanghai/SafeRAG.

中文摘要

通過將外部知識納入大型語言模型（LLMS），取回索引的索引 - 回程生成范式（RAG）在解決知識密集型任務方面非常成功。但是，外部和未經驗證的知識的合并會增加LLM的脆弱性，因為攻擊者可以通過操縱知識來執行攻擊任務。在本文中，我們引入了一個名為Saferag的基準，旨在評估抹布安全性。首先，我們將攻擊任務分類為銀噪聲，互字中的沖突，軟廣告和白色拒絕服務。接下來，我們主要針對每個任務構建了RAG安全評估數據集（即Saferag數據集）。然后，我們利用Saferag數據集模擬RAG可能遇到的各種攻擊方案。在14個代表性的RAG組件上進行的實驗表明，RAG對所有攻擊任務表現出很大的脆弱性，即使是最明顯的攻擊任務也可以輕松繞過現有的檢索器，過濾器或高級LLM，從而導致抹布服務質量的降解。代碼可在以下網址找到：https：//github.com/iaar-hanghai/saferag。

ACECODER：Acing編碼器RL通過自動測試案例合成

標題: ACECODER: Acing Coder RL via Automated Test-Case Synthesis
作者: Huaye Zeng, Dongfu Jiang, Haozhe Wang, Ping Nie, Xiaotong Chen, Wenhu Chen
日期: 2025-02-03
論文鏈接: https://arxiv.org/pdf/2502.01718

英文摘要

Most progress in recent coder models has been driven by supervised fine-tuning (SFT), while the potential of reinforcement learning (RL) remains largely unexplored, primarily due to the lack of reliable reward data/model in the code domain. In this paper, we address this challenge by leveraging automated large-scale test-case synthesis to enhance code model training. Specifically, we design a pipeline that generates extensive (question, test-cases) pairs from existing code data. Using these test cases, we construct preference pairs based on pass rates over sampled programs to train reward models with Bradley-Terry loss. It shows an average of 10-point improvement for Llama-3.1-8B-Ins and 5-point improvement for Qwen2.5-Coder-7B-Ins through best-of-32 sampling, making the 7B model on par with 236B DeepSeek-V2.5. Furthermore, we conduct reinforcement learning with both reward models and test-case pass rewards, leading to consistent improvements across HumanEval, MBPP, BigCodeBench, and LiveCodeBench (V4). Notably, we follow the R1-style training to start from Qwen2.5-Coder-base directly and show that our RL training can improve model on HumanEval-plus by over 25% and MBPP-plus by 6% for merely 80 optimization steps. We believe our results highlight the huge potential of reinforcement learning in coder models.

中文摘要

最近的編碼器模型中的大多數進展是由監督的微調（SFT）驅動的，而強化學習的潛力（RL）仍然在很大程度上沒有開發，這主要是由于代碼域中缺乏可靠的獎勵數據/模型。在本文中，我們通過利用自動大規模測試案例合成來提高代碼模型培訓來應對這一挑戰。具體而言，我們設計了一條管道，該管道從現有代碼數據中生成廣泛的（問題，測試案例）對。使用這些測試用例，我們根據與采樣程序相比，以布拉德利（Bradley-Terry）損失培訓獎勵模型，構建偏好對。它顯示了通過32個最佳采樣的llama-3.1-8b-ins的平均10分改進，QWEN2.5-編碼-7b-ins的5分提高，使7B型號與236B DeepSeek-V2.5相當。此外，我們通過獎勵模型和測試案例通過獎勵進行增強學習，從而導致人類Val，MBPP，BigCodebench和Livecodebench（V4）的一致改進。值得注意的是，我們遵循R1風格的訓練，直接從QWEN2.5代碼基本開始，并表明我們的RL訓練可以將HOMANEVAL-PLUS上的模型提高25 \％以上，而MBPP Plus則只需6 \％即可獲得80個優化步驟。我們相信我們的結果突出了編碼器模型中加強學習的巨大潛力。

反向橋接匹配蒸餾

標題: Inverse Bridge Matching Distillation
作者: Nikita Gushchin, David Li, Daniil Selikhanovych, Evgeny Burnaev, Dmitry Baranchuk, Alexander Korotin
日期: 2025-02-03
論文鏈接: https://arxiv.org/pdf/2502.01362

英文摘要

Learning diffusion bridge models is easy; making them fast and practical is an art. Diffusion bridge models (DBMs) are a promising extension of diffusion models for applications in image-to-image translation. However, like many modern diffusion and flow models, DBMs suffer from the problem of slow inference. To address it, we propose a novel distillation technique based on the inverse bridge matching formulation and derive the tractable objective to solve it in practice. Unlike previously developed DBM distillation techniques, the proposed method can distill both conditional and unconditional types of DBMs, distill models in a one-step generator, and use only the corrupted images for training. We evaluate our approach for both conditional and unconditional types of bridge matching on a wide set of setups, including super-resolution, JPEG restoration, sketch-to-image, and other tasks, and show that our distillation technique allows us to accelerate the inference of DBMs from 4x to 100x and even provide better generation quality than used teacher model depending on particular setup.

中文摘要

學習擴散橋模型很容易；使它們快速而實用是一門藝術。擴散橋模型（DBM）是圖像到圖像翻譯中應用的擴散模型的有希望的擴展。但是，像許多現代擴散和流程模型一樣，DBM遭受了緩慢推斷的問題。為了解決它，我們提出了一種基于反橋匹配公式的新型蒸餾技術，并得出了在實踐中解決該技術的可拖動目標。與以前開發的DBM蒸餾技術不同，所提出的方法可以提煉DBM的條件和無條件類型，在一步生成器中提取模型，并僅使用損壞的圖像進行訓練。我們評估了在各種設置上的有條件和無條件類型的橋梁匹配的方法，包括超分辨率，JPEG恢復，素描到圖像和其他任務，并表明我們的蒸餾技術使我們能夠加快DBMS的推理，從4 x到100X，甚至比使用特定的設置相比，您可以從4 x到100 x，甚至提供更好的生成質量。

Llasa：基于Llama的語音合成的縮放火車時間和推理時間計算

標題: Llasa: Scaling Train-Time and Inference-Time Compute for Llama-based Speech Synthesis
作者: Zhen Ye, Xinfa Zhu, Chi-Min Chan, Xinsheng Wang, Xu Tan, Jiahe Lei, Yi Peng, Haohe Liu, Yizhu Jin, Zheqi DAI, Hongzhan Lin, Jianyi Chen, Xingjian Du, Liumeng Xue, Yunlin Chen, Zhifei Li, Lei Xie, Qiuqiang Kong, Yike Guo, Wei Xue
日期: 2025-02-06
論文鏈接: https://arxiv.org/pdf/2502.04128

英文摘要

Recent advances in text-based large language models (LLMs), particularly in the GPT series and the o1 model, have demonstrated the effectiveness of scaling both training-time and inference-time compute. However, current state-of-the-art TTS systems leveraging LLMs are often multi-stage, requiring separate models (e.g., diffusion models after LLM), complicating the decision of whether to scale a particular model during training or testing. This work makes the following contributions: First, we explore the scaling of train-time and inference-time compute for speech synthesis. Second, we propose a simple framework Llasa for speech synthesis that employs a single-layer vector quantizer (VQ) codec and a single Transformer architecture to fully align with standard LLMs such as Llama. Our experiments reveal that scaling train-time compute for Llasa consistently improves the naturalness of synthesized speech and enables the generation of more complex and accurate prosody patterns. Furthermore, from the perspective of scaling inference-time compute, we employ speech understanding models as verifiers during the search, finding that scaling inference-time compute shifts the sampling modes toward the preferences of specific verifiers, thereby improving emotional expressiveness, timbre consistency, and content accuracy. In addition, we released the checkpoint and training code for our TTS model (1B, 3B, 8B) and codec model publicly available.

中文摘要

基于文本的大語言模型（LLM）的最新進展，尤其是在GPT系列和O1模型中，已經證明了擴展訓練時間和推理時間計算的有效性。但是，利用LLM的當前最新TTS系統通常是多階段，需要單獨的模型（例如，LLM后的擴散模型），使是否在訓練或測試過程中縮放特定模型的決定變得復雜。這項工作做出了以下貢獻：首先，我們探討了火車時間和推理時間計算的縮放語音合成。其次，我們提出了一個簡單的框架LLASA用于語音合成，該框架采用單層矢量量化器（VQ）編解碼器和單個Transformers 體系結構，以完全與標準LLMS（例如Llama）完全保持一致。我們的實驗表明，LLASA的縮放列車時間計算始終改善綜合語音的自然性，并能夠產生更復雜，更準確的韻律模式。此外，從縮放推理時間計算的角度來看，我們在搜索過程中采用語音理解模型作為驗證者，發現縮放推理時間計算會將采樣模式轉移到特定驗證者的偏好上，從而提高情緒表達，音色的一致性和內容準確性。此外，我們發布了TTS模型（1B，3B，8B）和編解碼器模型的檢查點和培訓代碼。

SliderSpace：分解擴散模型的視覺功能

標題: SliderSpace: Decomposing the Visual Capabilities of Diffusion Models
作者: Rohit Gandikota, Zongze Wu, Richard Zhang, David Bau, Eli Shechtman, Nick Kolkin
日期: 2025-02-03
論文鏈接: https://arxiv.org/pdf/2502.01639
項目鏈接: https://sliderspace.baulab.info

英文摘要

We present SliderSpace, a framework for automatically decomposing the visual capabilities of diffusion models into controllable and human-understandable directions. Unlike existing control methods that require a user to specify attributes for each edit direction individually, SliderSpace discovers multiple interpretable and diverse directions simultaneously from a single text prompt. Each direction is trained as a low-rank adaptor, enabling compositional control and the discovery of surprising possibilities in the model’s latent space. Through extensive experiments on state-of-the-art diffusion models, we demonstrate SliderSpace’s effectiveness across three applications: concept decomposition, artistic style exploration, and diversity enhancement. Our quantitative evaluation shows that SliderSpace-discovered directions decompose the visual structure of model’s knowledge effectively, offering insights into the latent capabilities encoded within diffusion models. User studies further validate that our method produces more diverse and useful variations compared to baselines. Our code, data and trained weights are available at https://sliderspace.baulab.info

中文摘要

我們提出了SliderSpace，這是一種將擴散模型的視覺能力自動分解為可控制和人類可靠的方向的框架。與需要用戶分別指定每個編輯方向的屬性的現有控制方法不同，SliderSpace從單個文本提示符中同時發現了多個可解釋和不同的方向。每個方向都被訓練為低級適配器，從而實現了組成控制，并發現模型潛在空間中令人驚訝的可能性。通過對最新擴散模型進行的廣泛實驗，我們證明了SliderSpace在三個應用中的有效性：概念分解，藝術風格探索和增強多樣性。我們的定量評估表明，滑動空間 - 發現的方向有效地分解了模型知識的視覺結構，從而洞悉了擴散模型中編碼的潛在能力。用戶研究進一步驗證了我們的方法與基線相比會產生更多樣化和有用的變化。我們的代碼，數據和受過訓練的權重可從https://sliderspace.baulab.info獲得。

自監督量化表示，用于無縫集成知識圖譜與大型語言模型

標題: Self-supervised Quantized Representation for Seamlessly Integrating Knowledge Graphs with Large Language Models
作者: Qika Lin, Tianzhe Zhao, Kai He, Zhen Peng, Fangzhi Xu, Ling Huang, Jingying Ma, Mengling Feng
日期: 2025-01-30
論文鏈接: https://arxiv.org/pdf/2501.18119

英文摘要

Due to the presence of the natural gap between Knowledge Graph (KG) structures and the natural language, the effective integration of holistic structural information of KGs with Large Language Models (LLMs) has emerged as a significant question. To this end, we propose a two-stage framework to learn and apply quantized codes for each entity, aiming for the seamless integration of KGs with LLMs. Firstly, a self-supervised quantized representation (SSQR) method is proposed to compress both KG structural and semantic knowledge into discrete codes (\ie, tokens) that align the format of language sentences. We further design KG instruction-following data by viewing these learned codes as features to directly input to LLMs, thereby achieving seamless integration. The experiment results demonstrate that SSQR outperforms existing unsupervised quantized methods, producing more distinguishable codes. Further, the fine-tuned LLaMA2 and LLaMA3.1 also have superior performance on KG link prediction and triple classification tasks, utilizing only 16 tokens per entity instead of thousands in conventional prompting methods.

中文摘要

由于知識圖（kg）結構與自然語言之間存在自然差距，因此出現了KG的整體結構信息與大語言模型（LLMS）的有效整合是一個重要的問題。為此，我們提出了一個兩階段的框架，以學習和應用每個實體的量化代碼，旨在將KG與LLMS無縫集成。首先，提出了一種自我監督的量化表示（SSQR）方法，以將KG結構知識和語義知識壓縮為離合語言句子格式的離散代碼（\ ie，tokens）。我們通過將這些學習的代碼視為直接輸入LLM的功能，從而進一步設計KG指導跟隨數據，從而實現無縫集成。實驗結果表明，SSQR的表現優于現有的無監督量化方法，從而產生更多可區分的代碼。此外，微調的Llama2和Llama3.1在KG鏈接預測和三重分類任務上也具有出色的性能，在常規提示方法中僅利用16個令牌，而不是數千個代幣。

BOLT：無需蒸餾即可引導語言模型實現長鏈推理

標題: BOLT: Bootstrap Long Chain-of-Thought in Language Models without Distillation
作者: Bo Pang, Hanze Dong, Jiacheng Xu, Silvio Savarese, Yingbo Zhou, Caiming Xiong
日期: 2025-02-06
論文鏈接: https://arxiv.org/pdf/2502.03860

英文摘要

Large language models (LLMs), such as o1 from OpenAI, have demonstrated remarkable reasoning capabilities. o1 generates a long chain-of-thought (LongCoT) before answering a question. LongCoT allows LLMs to analyze problems, devise plans, reflect, and backtrack effectively. These actions empower LLM to solve complex problems. After the release of o1, many teams have attempted to replicate its LongCoT and reasoning capabilities. In terms of methods, they primarily rely on knowledge distillation with data from existing models with LongCoT capacities (e.g., OpenAI-o1, Qwen-QwQ, DeepSeek-R1-Preview), leaving significant uncertainties on systematically developing such reasoning abilities. In terms of data domains, these works focus narrowly on math while a few others include coding, limiting their generalizability. This paper introduces a novel approach to enable LLM’s LongCoT capacity without distillation from o1-like models or expensive human annotations, where we bootstrap LongCoT (BOLT) from a standard instruct model. BOLT involves three stages: 1) LongCoT data bootstrapping with in-context learning on a standard instruct model; 2) LongCoT supervised finetuning; 3) online training to further refine LongCoT capacities. In BOLT, only a few in-context examples need to be constructed during the bootstrapping stage; in our experiments, we created 10 examples, demonstrating the feasibility of this approach. We use Llama-3.1-70B-Instruct to bootstrap LongCoT and apply our method to various model scales (7B, 8B, 70B). We achieve impressive performance on a variety of benchmarks, Arena-Hard, MT-Bench, WildBench, ZebraLogic, MATH500, which evaluate diverse task-solving and reasoning capabilities.

中文摘要

大型語言模型（LLM），例如來自OpenAI的O1，已經表現出了出色的推理能力。在回答問題之前，O1產生了漫長的經營鏈（LongCot）。LongCot允許LLM有效地分析問題，設計計劃，反思和回溯。這些動作使LLM能夠解決復雜的問題。發行O1后，許多團隊試圖復制其長角和推理能力。在方法方面，它們主要依賴于具有長距離能力的現有模型的數據（例如OpenAI-O1，Qwen-QWQ，DeepSeek-R1-Preview），從而系統地發展了這種推理能力。就數據域而言，這些作品狹義地關注數學，而其他一些作品包括編碼，限制了它們的普遍性。本文介紹了一種新穎的方法，可以使LLM的Longcot容量無需從O1型模型或昂貴的人類注釋進行蒸餾而來，我們從標準指示模型中引導Longcot（Bolt）。螺栓涉及三個階段：1）在標準指示模型上使用封閉式學習的長角數據自舉；2）長期有監督的登錄；3）在線培訓，以進一步完善長角能力。在螺栓中，只需要在引導階段構建少數幾個示例。在我們的實驗中，我們創建了10個示例，證明了這種方法的可行性。我們使用Llama-3.1-70B教學來引導長科，并將我們的方法應用于各種型號（7b，8b，70b）。我們在各種基準測試中取得了令人印象深刻的性能，競技場 - 甲板，山基地，野人寬松，Zebralogic，Math500，它們評估了各種任務解決和推理能力。

在語言模型中縮放嵌入層

標題: Scaling Embedding Layers in Language Models
作者: Da Yu, Edith Cohen, Badih Ghazi, Yangsibo Huang, Pritish Kamath, Ravi Kumar, Daogao Liu, Chiyuan Zhang
日期: 2025-02-03
論文鏈接: https://arxiv.org/pdf/2502.01637

英文摘要

We propose SCONE (Scalable, Contextualized, Offloaded, N-gram Embedding), a method for extending input embedding layers to enhance language model performance as layer size scales. To avoid increased decoding costs, SCONE retains the original vocabulary while introducing embeddings for a set of frequent n-grams. These embeddings provide contextualized representation for each input token and are learned with a separate model during training. During inference, they are precomputed and stored in off-accelerator memory with minimal impact on inference speed. SCONE enables two new scaling strategies: increasing the number of cached n-gram embeddings and scaling the model used to learn them, all while maintaining fixed inference-time FLOPS. We show that scaling both aspects allows SCONE to outperform a 1.9B parameter baseline across diverse corpora, while using only half the inference-time FLOPS.

中文摘要

我們提出了Scone（可擴展，上下文化，卸載，n-gram嵌入），這是一種擴展輸入嵌入層以增強語言模型性能作為層尺寸量表的方法。為了避免增加解碼成本，Scone保留了原始詞匯，同時為一組頻繁的N-gram引入嵌入。這些嵌入為每個輸入令牌提供了上下文化表示，并在訓練過程中以單獨的模型學習。在推斷期間，它們被預先計算并存儲在離心劑內存中，對推理速度的影響很小。Scone啟用了兩種新的縮放策略：增加緩存的N-Gram嵌入式數量，并擴展用于學習它們的模型，同時保持固定的推理時間拖失lop。我們表明，縮放這兩個方面都可以使Scone在不同的語料庫中勝過1.9B參數基線，而僅使用一半的推理時間拖鞋。

DeepRAG：考慮大型語言模型的逐步檢索

標題: DeepRAG: Thinking to Retrieval Step by Step for Large Language Models
作者: Xinyan Guan, Jiali Zeng, Fandong Meng, Chunlei Xin, Yaojie Lu, Hongyu Lin, Xianpei Han, Le Sun, Jie Zhou
日期: 2025-02-03
論文鏈接: https://arxiv.org/pdf/2502.01142

英文摘要

Large Language Models (LLMs) have shown remarkable potential in reasoning while they still suffer from severe factual hallucinations due to timeliness, accuracy, and coverage of parametric knowledge. Meanwhile, integrating reasoning with retrieval-augmented generation (RAG) remains challenging due to ineffective task decomposition and redundant retrieval, which can introduce noise and degrade response quality. In this paper, we propose DeepRAG, a framework that models retrieval-augmented reasoning as a Markov Decision Process (MDP), enabling strategic and adaptive retrieval. By iteratively decomposing queries, DeepRAG dynamically determines whether to retrieve external knowledge or rely on parametric reasoning at each step. Experiments show that DeepRAG improves retrieval efficiency while improving answer accuracy by 21.99%, demonstrating its effectiveness in optimizing retrieval-augmented reasoning.

中文摘要

大型語言模型（LLM）在推理方面表現出巨大的潛力，而由于及時性，準確性和參數知識的覆蓋范圍，它們仍然遭受嚴重的事實幻覺。同時，由于無效的任務分解和冗余檢索，將推理與檢索型發電（RAG）集成在一起仍然具有挑戰性，這會引入噪聲和降低響應質量。在本文中，我們提出了DeepRag，該框架將檢索效果的推理模擬為馬爾可夫決策過程（MDP），從而實現了戰略和適應性檢索。通過迭代分解查詢，DeepRag動態地確定是在每個步驟中檢索外部知識還是依賴參數推理。實驗表明，DeepRag提高了檢索效率，同時將答案準確性提高了21.99％，這表明了其在優化檢索效果的推理方面的有效性。

MM-IQ：在多模型中基準類似人類的抽象和推理

標題: MM-IQ: Benchmarking Human-Like Abstraction and Reasoning in Multimodal Models
作者: Huanqia Cai, Yijun Yang, Winston Hu
日期: 2025-02-02
論文鏈接: https://arxiv.org/pdf/2502.00698

英文摘要

IQ testing has served as a foundational methodology for evaluating human cognitive capabilities, deliberately decoupling assessment from linguistic background, language proficiency, or domain-specific knowledge to isolate core competencies in abstraction and reasoning. Yet, artificial intelligence research currently lacks systematic benchmarks to quantify these critical cognitive dimensions in multimodal systems. To address this critical gap, we propose MM-IQ, a comprehensive evaluation framework comprising 2,710 meticulously curated test items spanning 8 distinct reasoning paradigms. Through systematic evaluation of leading open-source and proprietary multimodal models, our benchmark reveals striking limitations: even state-of-the-art architectures achieve only marginally superior performance to random chance (27.49% vs. 25% baseline accuracy). This substantial performance chasm highlights the inadequacy of current multimodal systems in approximating fundamental human reasoning capacities, underscoring the need for paradigm-shifting advancements to bridge this cognitive divide.

中文摘要

智商測試已成為評估人類認知能力的基礎方法，故意將評估與語言背景，語言水平或特定領域的知識分離，以隔離抽象和推理方面的核心能力。然而，目前，人工智能研究缺乏系統的基準來量化多模式系統中這些關鍵認知維度。為了解決這個關鍵的差距，我們提出了MM-IQ，這是一個全面的評估框架，其中包括2,710個精心策劃的測試項目，涵蓋了8種不同的推理范式。通過對領先的開源和專有多模式的系統評估，我們的基準揭示了驚人的局限性：即使是最先進的架構也只能達到比隨機機會的略有優勢（27.49％和25％的基線精度）。這種實質性的鴻溝強調了當前多模式系統在近似基本的人類推理能力方面的不足，強調了對范式轉移進步以彌合這種認知鴻溝的需求。