【論文速遞】2025年04周 (Robotics/Embodied AI/LLM)

目錄

  • DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
    • 摘要
  • Evolving Deeper LLM Thinking
    • 摘要
  • Kimi k1.5: Scaling Reinforcement Learning with LLMs
    • 摘要
  • Agent-R: Training Language Model Agents to Reflect via Iterative Self-Training
    • 摘要
  • VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding
    • 摘要
  • MMVU: Measuring Expert-Level Multi-Discipline Video Understanding
    • 摘要
  • FilmAgent: A Multi-Agent Framework for End-to-End Film Automation in Virtual 3D Spaces
    • 摘要
  • SRMT: Shared Memory for Multi-agent Lifelong Pathfinding
    • 摘要
  • Demons in the Detail: On Implementing Load Balancing Loss for Training Specialized Mixture-of-Expert Models
    • 摘要
  • GameFactory: Creating New Games with Generative Interactive Videos
    • 摘要
  • Test-Time Preference Optimization: On-the-Fly Alignment via Iterative Textual Feedback
    • 摘要
  • UI-TARS: Pioneering Automated GUI Interaction with Native Agents
    • 摘要
  • Improving Video Generation with Human Feedback
    • 摘要
  • PaSa: An LLM Agent for Comprehensive Academic Paper Search
    • 摘要
  • Sigma: Differential Rescaling of Query, Key and Value for Efficient Language Models
    • 摘要
  • TokenVerse: Versatile Multi-concept Personalization in Token Modulation Space
    • 摘要
  • InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model
    • 摘要
  • Autonomy-of-Experts Models
    • 摘要
  • Hunyuan3D 2.0: Scaling Diffusion Models for High Resolution Textured 3D Assets Generation
    • 摘要
  • Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step
    • 摘要
  • Multiple Choice Questions: Reasoning Makes Large Language Models (LLMs) More Self-Confident Even When They Are Wrong
    • 摘要
  • Reasoning Language Models: A Blueprint
    • 摘要
  • Mobile-Agent-E: Self-Evolving Mobile Assistant for Complex Tasks
    • 摘要
  • VideoWorld: Exploring Knowledge Learning from Unlabeled Videos
    • 摘要
  • O1-Pruner: Length-Harmonizing Fine-Tuning for O1-Like Reasoning Pruning
    • 摘要

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

  • 作者: DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Honghui Ding, Huajian Xin, Huazuo Gao, Hui Qu, Hui Li, Jianzhong Guo, Jiashi Li, Jiawei Wang, Jingchang Chen, Jingyang Yuan, Junjie Qiu, Junlong Li, J. L. Cai, Jiaqi Ni, Jian Liang, Jin Chen, Kai Dong, Kai Hu, Kaige Gao, Kang Guan, Kexin Huang, Kuai Yu, Lean Wang, Lecong Zhang, Liang Zhao, Litong Wang, Liyue Zhang, Lei Xu, Leyi Xia, Mingchuan Zhang, Minghua Zhang, Minghui Tang, Meng Li, Miaojun Wang, Mingming Li, Ning Tian, Panpan Huang, Peng Zhang, Qiancheng Wang, Qinyu Chen, Qiushi Du, Ruiqi Ge, Ruisong Zhang, Ruizhe Pan, Runji Wang, R. J. Chen, R. L. Jin, Ruyi Chen, Shanghao Lu, Shangyan Zhou, Shanhuang Chen, Shengfeng Ye, Shiyu Wang, Shuiping Yu, Shunfeng Zhou, Shuting Pan, S. S. Li, Shuang Zhou, Shaoqing Wu, Shengfeng Ye, Tao Yun, Tian Pei, Tianyu Sun, T. Wang, Wangding Zeng, Wanjia Zhao, Wen Liu, Wenfeng Liang, Wenjun Gao, Wenqin Yu, Wentao Zhang, W. L. Xiao, Wei An, Xiaodong Liu, Xiaohan Wang, Xiaokang Chen, Xiaotao Nie, Xin Cheng, Xin Liu, Xin Xie, Xingchao Liu, Xinyu Yang, Xinyuan Li, Xuecheng Su, Xuheng Lin, X. Q. Li, Xiangyue Jin, Xiaojin Shen, Xiaosha Chen, Xiaowen Sun, Xiaoxiang Wang, Xinnan Song, Xinyi Zhou, Xianzu Wang, Xinxia Shan, Y. K. Li, Y. Q. Wang, Y. X. Wei, Yang Zhang, Yanhong Xu, Yao Li, Yao Zhao, Yaofeng Sun, Yaohui Wang, Yi Yu, Yichao Zhang, Yifan Shi, Yiliang Xiong, Ying He, Yishi Piao, Yisong Wang, Yixuan Tan, Yiyang Ma, Yiyuan Liu, Yongqiang Guo, Yuan Ou, Yuduan Wang, Yue Gong, Yuheng Zou, Yujia He, Yunfan Xiong, Yuxiang Luo, Yuxiang You, Yuxuan Liu, Yuyang Zhou, Y. X. Zhu, Yanhong Xu, Yanping Huang, Yaohui Li, Yi Zheng, Yuchen Zhu, Yunxian Ma, Ying Tang, Yukun Zha, Yuting Yan, Z. Z. Ren, Zehui Ren, Zhangli Sha, Zhe Fu, Zhean Xu, Zhenda Xie, Zhengyan Zhang, Zhewen Hao, Zhicheng Ma, Zhigang Yan, Zhiyu Wu, Zihui Gu, Zijia Zhu, Zijun Liu, Zilin Li, Ziwei Xie, Ziyang Song, Zizheng Pan, Zhen Huang, Zhipeng Xu, Zhongyu Zhang, Zhen Zhang

  • 日期: 2025-01-22

  • 論文鏈接: https://arxiv.org/pdf/2501.12948

摘要

We introduce our first-generation reasoning models, DeepSeek-R1-Zero andDeepSeek-R1. DeepSeek-R1-Zero, a model trained via large-scale reinforcementlearning (RL) withoutsupervised fine-tuning (SFT)as a preliminary step,demonstrates remarkable reasoning capabilities. Through RL, DeepSeek-R1-Zeronaturally emerges with numerous powerful and intriguingreasoning behaviors.However, it encounters challenges such as poor readability, and languagemixing. To address these issues and further enhance reasoning performance, weintroduce DeepSeek-R1, which incorporatesmulti-stage trainingand cold-startdata before RL. DeepSeek-R1 achieves performance comparable toOpenAI-o1-1217onreasoning tasks. To support the research community, we open-sourceDeepSeek-R1-Zero, DeepSeek-R1, and six dense models (1.5B, 7B, 8B, 14B, 32B,70B) distilled from DeepSeek-R1 based onQwenandLlama.


Evolving Deeper LLM Thinking

  • 作者: Kuang-Huei Lee, Ian Fischer, Yueh-Hua Wu, Dave Marwood, Shumeet Baluja, Dale Schuurmans, Xinyun Chen

  • 日期: 2025-01-17

  • 論文鏈接: https://arxiv.org/pdf/2501.09891

摘要

We explore anevolutionary search strategyfor scalinginference time computeinLarge Language Models. The proposed approach,Mind Evolution, uses alanguage model to generate, recombine and refinecandidate responses. Theproposed approach avoids the need to formalize the underlying inference problemwhenever asolution evaluatoris available. Controlling for inference cost, wefind thatMind Evolutionsignificantly outperforms other inference strategiessuch as Best-of-N and Sequential Revision innatural language planning tasks.In theTravelPlannerandNatural Plan benchmarks,Mind Evolutionsolves morethan 98% of the problem instances usingGemini 1.5 Prowithout the use of aformal solver.


Kimi k1.5: Scaling Reinforcement Learning with LLMs

  • 作者: Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, Chuning Tang, Congcong Wang, Dehao Zhang, Enming Yuan, Enzhe Lu, Fengxiang Tang, Flood Sung, Guangda Wei, Guokun Lai, Haiqing Guo, Han Zhu, Hao Ding, Hao Hu, Hao Yang, Hao Zhang, Haotian Yao, Haotian Zhao, Haoyu Lu, Haoze Li, Haozhen Yu, Hongcheng Gao, Huabin Zheng, Huan Yuan, Jia Chen, Jianhang Guo, Jianlin Su, Jianzhou Wang, Jie Zhao, Jin Zhang, Jingyuan Liu, Junjie Yan, Junyan Wu, Lidong Shi, Ling Ye, Longhui Yu, Mengnan Dong, Neo Zhang, Ningchen Ma, Qiwei Pan, Qucheng Gong, Shaowei Liu, Shengling Ma, Shupeng Wei, Sihan Cao, Siying Huang, Tao Jiang, Weihao Gao, Weimin Xiong, Weiran He, Weixiao Huang, Wenhao Wu, Wenyang He, Xianghui Wei, Xianqing Jia, Xingzhe Wu, Xinran Xu, Xinxing Zu, Xinyu Zhou, Xuehai Pan, Y. Charles, Yang Li, Yangyang Hu, Yangyang Liu, Yanru Chen, Yejie Wang, Yibo Liu, Yidao Qin, Yifeng Liu, Ying Yang, Yiping Bao, Yulun Du, Yuxin Wu, Yuzhi Wang, Zaida Zhou, Zhaoji Wang, Zhaowei Li, Zhen Zhu, Zheng Zhang, Zhexu Wang, Zhilin Yang, Zhiqi Huang, Zihao Huang, Ziyao Xu, Zonghan Yang

  • 日期: 2025-01-22

  • 論文鏈接: https://arxiv.org/pdf/2501.12599

摘要

Language model pretraining with next token prediction has proved effectivefor scaling compute but is limited to the amount of available training data.Scaling reinforcement learning (RL) unlocks a new axis for the continuedimprovement of artificial intelligence, with the promise that large languagemodels (LLMs) can scale their training data by learning to explore withrewards. However, prior published work has not produced competitive results. Inlight of this, we report on the training practice of Kimi k1.5, our latestmulti-modalLLMtrained withRL, including itsRLtraining techniques,multi-modal data recipes, and infrastructure optimization.Long context scalingand improvedpolicy optimizationmethods are key ingredients of our approach,which establishes a simplistic, effectiveRLframework without relying on morecomplex techniques such asMonte Carlo tree search,value functions, andprocess reward models. Notably, our system achieves state-of-the-art reasoningperformance across multiple benchmarks and modalities – e.g., 77.5 onAIME,96.2 onMATH 500, 94-th percentile onCodeforces, 74.9 onMathVista-- matchingOpenAI’s o1. Moreover, we present effectivelong2shortmethods that uselong-CoTtechniques to improveshort-CoTmodels, yielding state-of-the-artshort-CoTreasoning results – e.g., 60.8 onAIME, 94.6 on MATH500, 47.3 onLiveCodeBench-- outperforming existingshort-CoTmodels such asGPT-4oandClaude Sonnet 3.5by a large margin (up to +550%).


Agent-R: Training Language Model Agents to Reflect via Iterative Self-Training

  • 作者: Siyu Yuan, Zehui Chen, Zhiheng Xi, Junjie Ye, Zhengyin Du, Jiecao Chen

  • 日期: 2025-01-20

  • 論文鏈接: https://arxiv.org/pdf/2501.11425

摘要

Large Language Models (LLMs)agents are increasingly pivotal for addressingcomplex tasks ininteractive environments. Existing work mainly focuses onenhancing performance throughbehavior cloningfrom stronger experts, yet suchapproaches often falter in real-world applications, mainly due to the inabilityto recover from errors. However,step-level critique datais difficult andexpensive to collect. Automating and dynamically constructing self-critiquedatasets is thus crucial to empowering models with intelligent agentcapabilities. In this work, we propose aniterative self-training framework,Agent-R, that enables language Agent to Reflect on the fly. Unlike traditionalmethods thatreward or penalize actionsbased on correctness,Agent-RleveragesMCTS to construct training data that recovercorrect trajectoriesfromerroneous ones. A key challenge of agent reflection lies in the necessity fortimely revisionrather than waiting until the end of a rollout. To addressthis, we introduce a model-guided critique construction mechanism: the actormodel identifies the first error step (within its current capability) in afailed trajectory. Starting from it, we splice it with the adjacent correctpath, which shares the same parent node in the tree. This strategy enables themodel to learn reflection based on its current policy, therefore yieldingbetterlearning efficiency. To further explore the scalability of thisself-improvement paradigm, we investigateiterative refinementof both errorcorrection capabilities and dataset construction. Our findings demonstrate thatAgent-Rcontinuously improves the model’s ability to recover from errors andenablestimely error correction. Experiments on threeinteractive environmentsshow thatAgent-Reffectively equips agents to correct erroneous actions whileavoiding loops, achieving superior performance compared tobaseline methods(+5.59%).


VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

  • 作者: Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, Peng Jin, Wenqi Zhang, Fan Wang, Lidong Bing, Deli Zhao

  • 日期: 2025-01-22

  • 論文鏈接: https://arxiv.org/pdf/2501.13106

  • 項目鏈接: https://github.com/DAMO-NLP-SG/VideoLLaMA3

摘要

In this paper, we propose VideoLLaMA3, a more advanced multimodal foundationmodel for image and video understanding. The core design philosophy ofVideoLLaMA3 is vision-centric. The meaning of “vision-centric” is two-fold: thevision-centric training paradigmandvision-centric framework design. The keyinsight of ourvision-centric training paradigmis that high-quality image-textdata is crucial for both image and video understanding. Instead of preparingmassive video-text datasets, we focus on constructing large-scale andhigh-quality image-text datasets. VideoLLaMA3 has four training stages: 1)vision-centric alignment stage, which warms up thevision encoderandprojector; 2) vision-language pretraining stage, which jointly tunes the visionencoder,projector, and LLM with large-scale image-text data covering multipletypes (including scene images, documents, charts) as well as text-only data. 3)multi-task fine-tuning stage, which incorporates image-text SFT data fordownstream tasks and video-text data to establish a foundation for videounderstanding. 4) video-centric fine-tuning, which further improves the model’scapability in video understanding. As for the framework design, to bettercapture fine-grained details in images, the pretrainedvision encoderisadapted to encode images of varying sizes into vision tokens with correspondingnumbers, rather than a fixed number of tokens. For video inputs, we reduce thenumber of vision tokens according to their similarity so that therepresentation of videos will be more precise and compact. Benefit fromvision-centric designs, VideoLLaMA3 achieves compelling performances in bothimage and video understanding benchmarks.


MMVU: Measuring Expert-Level Multi-Discipline Video Understanding

  • 作者: Yilun Zhao, Lujing Xie, Haowei Zhang, Guo Gan, Yitao Long, Zhiyuan Hu, Tongyan Hu, Weiyuan Chen, Chuhan Li, Junyang Song, Zhijian Xu, Chengye Wang, Weifeng Pan, Ziyao Shangguan, Xiangru Tang, Zhenwen Liang, Yixin Liu, Chen Zhao, Arman Cohan

  • 日期: 2025-01-21

  • 論文鏈接: https://arxiv.org/pdf/2501.12380

  • 項目鏈接: https://mmvu-benchmark.github.io/

摘要

We introduce MMVU, a comprehensive expert-level, multi-discipline benchmarkfor evaluating foundation models in video understanding. MMVU includes 3,000expert-annotated questions spanning 27 subjects across four core disciplines:Science, Healthcare, Humanities & Social Sciences, and Engineering. Compared toprior benchmarks, MMVU features three key advancements. First, it challengesmodels to apply domain-specific knowledge and perform expert-level reasoning toanalyze specialized-domain videos, moving beyond the basic visual perceptiontypically assessed in current video benchmarks. Second, each example isannotated by human experts from scratch. We implement strict data qualitycontrols to ensure the high quality of the dataset. Finally, each example isenriched with expert-annotated reasoning rationals and relevant domainknowledge, facilitating in-depth analysis. We conduct an extensive evaluationof 32 frontier multimodal foundation models on MMVU. The latestSystem-2-capable models, o1 and Gemini 2.0 Flash Thinking, achieve the highestperformance among the tested models. However, they still fall short of matchinghuman expertise. Through in-depth error analyses and case studies, we offeractionable insights for future advancements in expert-level,knowledge-intensive video understanding for specialized domains.


FilmAgent: A Multi-Agent Framework for End-to-End Film Automation in Virtual 3D Spaces

  • 作者: Zhenran Xu, Longyue Wang, Jifang Wang, Zhouyi Li, Senbao Shi, Xue Yang, Yiyu Wang, Baotian Hu, Jun Yu, Min Zhang

  • 日期: 2025-01-22

  • 論文鏈接: https://arxiv.org/pdf/2501.12909

摘要

Virtual film production requires intricate decision-making processes,including scriptwriting, virtual cinematography, and precise actor positioningand actions. Motivated by recent advances in automated decision-making withlanguage agent-based societies, this paper introduces FilmAgent, a novelLLM-basedmulti-agentcollaborative frameworkforend-to-endfilm automationinour constructed 3D virtual spaces. FilmAgent simulates various crew roles,including directors, screenwriters, actors, and cinematographers, and coverskey stages of a film production workflow: (1) idea development transformsbrainstormed ideas into structured story outlines; (2) scriptwriting elaborateson dialogue and character actions for each scene; (3) cinematography determinesthe camera setups for each shot. A team of agents collaborates throughiterative feedbackand revisions, thereby verifying intermediate scripts andreducinghallucinations. We evaluate the generated videos on 15 ideas and 4 keyaspects.Human evaluationshows that FilmAgent outperforms all baselines acrossall aspects and scores 3.98 out of 5 on average, showing the feasibility ofmulti-agent collaborationin filmmaking. Further analysis reveals thatFilmAgent, despite using the less advancedGPT-4omodel, surpasses thesingle-agent o1, showing the advantage of a well-coordinatedmulti-agentsystem. Lastly, we discuss the complementary strengths and weaknesses ofOpenAI’stext-to-video modelSoraand our FilmAgent in filmmaking.


SRMT: Shared Memory for Multi-agent Lifelong Pathfinding

  • 作者: Alsu Sagirova, Yuri Kuratov, Mikhail Burtsev

  • 日期: 2025-01-22

  • 論文鏈接: https://arxiv.org/pdf/2501.13200

摘要

Multi-agent reinforcement learning (MARL)demonstrates significant progressin solving cooperative and competitive multi-agent problems in variousenvironments. One of the principal challenges in MARL is the need for explicitprediction of the agents’ behavior to achieve cooperation. To resolve thisissue, we propose theShared Recurrent Memory Transformer (SRMT)which extendsmemory transformersto multi-agent settings by pooling and globallybroadcasting individual working memories, enabling agents to exchangeinformation implicitly and coordinate their actions. We evaluate SRMT on thePartially Observable Multi-Agent Pathfinding problem in a toy Bottlenecknavigation task that requires agents to pass through a narrow corridor and on aPOGEMA benchmark set of tasks. In the Bottleneck task, SRMT consistentlyoutperforms a variety of reinforcement learning baselines, especially undersparse rewards, and generalizes effectively to longer corridors than those seenduring training. On POGEMA maps, including Mazes, Random, and MovingAI, SRMT iscompetitive with recent MARL, hybrid, and planning-based algorithms. Theseresults suggest that incorporating shared recurrent memory into thetransformer-based architectures can enhance coordination in decentralizedmulti-agent systems. The source code for training and evaluation is availableon GitHub: https://github.com/Aloriosa/srmt.


Demons in the Detail: On Implementing Load Balancing Loss for Training Specialized Mixture-of-Expert Models

  • 作者: Zihan Qiu, Zeyu Huang, Bo Zheng, Kaiyue Wen, Zekun Wang, Rui Men, Ivan Titov, Dayiheng Liu, Jingren Zhou, Junyang Lin

  • 日期: 2025-01-21

  • 論文鏈接: https://arxiv.org/pdf/2501.11873

摘要

This paper revisits the implementation ofLoad-balancing Loss (LBL)when trainingMixture-of-Experts (MoEs)models. Specifically, LBL for MoEs is defined as N_Esum_{i=1}^{N_E} f_i p_i, where N_E is the total number ofexperts, f_irepresents thefrequencyof expert i being selected, and p_i denotes theaveragegating scoreof the expert i. Existing MoE training frameworksusually employ theparallel training strategyso that f_i and the LBL arecalculated within amicro-batchand then averaged across parallelgroups. In essence, amicro-batchfor training billion-scale LLMs normallycontains very fewsequences. So, themicro-batchLBL is almost at thesequencelevel, and therouteris pushed to distribute thetokenevenly within eachsequence. Under this strict constraint, eventokens from a domain-specificsequence(e.g., code) are uniformly routed to allexperts, therebyinhibiting expert specialization. In this work, we propose calculating LBLusing aglobal-batchto loose this constraint. Because aglobal-batchcontains much more diversesequences than amicro-batch, whichwill encourage load balance at thecorpus level. Specifically, we introduce anextracommunication stepto synchronize f_i acrossmicro-batches and then useit to calculate the LBL. Through experiments on training MoEs-based LLMs (up to42.8B total parameters and 400Btokens), we surprisinglyfind that theglobal-batchLBL strategy yields excellent performance gains inboth pre-training perplexity and downstream tasks. Our analysis reveals thattheglobal-batchLBL also greatly improves thedomain specializationof MoEexperts.


GameFactory: Creating New Games with Generative Interactive Videos

  • 作者: Jiwen Yu, Yiran Qin, Xintao Wang, Pengfei Wan, Di Zhang, Xihui Liu

  • 日期: 2025-01-14

  • 論文鏈接: https://arxiv.org/pdf/2501.08325

  • 項目鏈接: https://yujiwen.github.io/gamefactory/

摘要

Generative game engines have the potential to revolutionize game developmentby autonomously creating new content and reducing manual workload. However,existing video-based game generation methods fail to address the criticalchallenge ofscene generalization, limiting their applicability to existinggames with fixed styles and scenes. In this paper, we present GameFactory, aframework focused on exploringscene generalizationin game video generation.To enable the creation of entirely new and diverse games, we leveragepre-trainedvideo diffusion modelstrained onopen-domain video data. To bridgethe domain gap between open-domain priors and small-scale game dataset, wepropose amulti-phase trainingstrategy that decouplesgame style learningfromaction control, preserving open-domain generalization while achieving actioncontrollability. Using Minecraft as our data source, we releaseGF-Minecraft, ahigh-quality and diversity action-annotated video dataset for research.Furthermore, we extend our framework to enable autoregressiveaction-controllable game video generation, allowing the production ofunlimited-length interactive game videos. Experimental results demonstrate thatGameFactory effectively generates open-domain, diverse, and action-controllablegame videos, representing a significant step forward in AI-driven gamegeneration. Our dataset and project page are publicly available athttps://vvictoryuki.github.io/gamefactory/.


Test-Time Preference Optimization: On-the-Fly Alignment via Iterative Textual Feedback

  • 作者: Yafu Li, Xuyang Hu, Xiaoye Qu, Linjie Li, Yu Cheng

  • 日期: 2025-01-22

  • 論文鏈接: https://arxiv.org/pdf/2501.12895

摘要

Large language models (LLMs) demonstrate impressive performance but lack theflexibility to adapt to human preferences quickly without retraining. In thiswork, we introduceTest-time Preference Optimization (TPO), a framework thataligns LLM outputs with human preferences during inference, removing the needto update model parameters. Rather than relying on purely numerical rewards,TPO translates reward signals intotextual critiquesand uses them as textualrewards to iteratively refine its response. Evaluations on benchmarks coveringinstruction following,preference alignment,safety, andmathematicsrevealthat TPO progressively improves alignment with human preferences. Notably,after only a few TPO steps, the initially unaligned Llama-3.1-70B-SFT model cansurpass the aligned counterpart, Llama-3.1-70B-Instruct. Furthermore, TPOscales efficiently with both thesearch widthand depth during inference.Through case studies, we illustrate how TPO exploits the innate capacity of LLMtointerpretandact upon reward signals. Our findings establish TPO as apractical, lightweight alternative for test-time preference optimization,achieving alignment on the fly. Our code is publicly available athttps://github.com/yafuly/TPO.


UI-TARS: Pioneering Automated GUI Interaction with Native Agents

  • 作者: Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, Wanjun Zhong, Kuanye Li, Jiale Yang, Yu Miao, Woyu Lin, Longxiang Liu, Xu Jiang, Qianli Ma, Jingyu Li, Xiaojun Xiao, Kai Cai, Chuang Li, Yaowei Zheng, Chaolin Jin, Chen Li, Xiao Zhou, Minchao Wang, Haoli Chen, Zhaojian Li, Haihua Yang, Haifeng Liu, Feng Lin, Tao Peng, Xin Liu, Guang Shi

  • 日期: 2025-01-21

  • 論文鏈接: https://arxiv.org/pdf/2501.12326

摘要

This paper introduces UI-TARS, a native GUI agent model that solely perceivesthe screenshots as input and performshuman-like interactions(e.g., keyboardandmouse operations). Unlike prevailing agent frameworks that depend onheavily wrapped commercial models (e.g.,GPT-4o) withexpert-crafted promptsandworkflows, UI-TARS is anend-to-end modelthat outperforms thesesophisticated frameworks. Experiments demonstrate its superior performance:UI-TARS achievesSOTA performancein 10+GUI agent benchmarksevaluatingperception,grounding, and GUI task execution. Notably, in the OSWorldbenchmark, UI-TARS achieves scores of 24.6 with 50 steps and 22.7 with 15steps, outperforming Claude (22.0 and 14.9 respectively). In AndroidWorld,UI-TARS achieves 46.6, surpassingGPT-4o(34.5). UI-TARS incorporates severalkey innovations: (1) Enhanced Perception: leveraging a large-scale dataset ofGUI screenshots for context-aware understanding of UI elements and precisecaptioning; (2)Unified Action Modeling, which standardizes actions into aunified spaceacross platforms and achieves precisegroundingand interactionthrough large-scale action traces; (3)System-2 Reasoning, which incorporatesdeliberate reasoninginto multi-step decision making, involving multiplereasoning patterns such astask decomposition,reflection thinking, milestonerecognition, etc. (4)Iterative Training with Reflective Online Traces, whichaddresses the data bottleneck by automatically collecting,filtering, andreflectively refining newinteraction traceson hundreds ofvirtual machines.Through iterative training and reflection tuning, UI-TARS continuously learnsfrom its mistakes and adapts to unforeseen situations with minimal humanintervention. We also analyze theevolution pathofGUI agentsto guide thefurther development of this domain.


Improving Video Generation with Human Feedback

  • 作者: Jie Liu, Gongye Liu, Jiajun Liang, Ziyang Yuan, Xiaokun Liu, Mingwu Zheng, Xiele Wu, Qiulin Wang, Wenyu Qin, Menghan Xia, Xintao Wang, Xiaohong Liu, Fei Yang, Pengfei Wan, Di Zhang, Kun Gai, Yujiu Yang, Wanli Ouyang

  • 日期: 2025-01-23

  • 論文鏈接: https://arxiv.org/pdf/2501.13918

摘要

Video generation has achieved significant advances throughrectified flowtechniques, but issues like unsmoothmotionandmisalignmentbetween videos andprompts persist. In this work, we develop a systematic pipeline that harnesseshuman feedback to mitigate these problems and refine the video generationmodel. Specifically, we begin by constructing a large-scale human preferencedataset focused on modern video generation models, incorporating pairwiseannotations across multi-dimensions. We then introduceVideoReward, amulti-dimensional video reward model, and examine how annotations and variousdesign choices impact its rewarding efficacy. From a unified reinforcementlearning perspective aimed at maximizing reward withKL regularization, weintroduce threealignment algorithmsfor flow-based models by extending thosefromdiffusion models. These include two training-time strategies: directpreference optimization for flow (Flow-DPO) and reward weighted regression forflow (Flow-RWR), and an inference-time technique,Flow-NRG, which appliesreward guidance directly tonoisy videos. Experimental results indicate thatVideoRewardsignificantly outperforms existing reward models, andFlow-DPOdemonstrates superior performance compared to bothFlow-RWRand standardsupervised fine-tuning methods. Additionally,Flow-NRGlets users assign customweights to multiple objectives during inference, meeting personalized videoquality needs. Project page: https://gongyeliu.github.io/videoalign.


PaSa: An LLM Agent for Comprehensive Academic Paper Search

  • 作者: Yichen He, Guanhua Huang, Peiyuan Feng, Yuan Lin, Yuchen Zhang, Hang Li, Weinan E

  • 日期: 2025-01-17

  • 論文鏈接: https://arxiv.org/pdf/2501.10120

摘要

We introduce PaSa, an advanced Paper Search agent powered by large languagemodels. PaSa can autonomously make a series of decisions, including invokingsearch tools, reading papers, and selecting relevant references, to ultimatelyobtain comprehensive and accurate results for complex scholarly queries. Weoptimize PaSa using reinforcement learning with a synthetic dataset,AutoScholarQuery, which includes 35k fine-grained academic queries andcorresponding papers sourced from top-tier AI conference publications.Additionally, we develop RealScholarQuery, a benchmark collecting real-worldacademic queries to assess PaSa performance in more realistic scenarios.Despite being trained on synthetic data, PaSa significantly outperformsexisting baselines on RealScholarQuery, including Google, Google Scholar,Google with GPT-4 for paraphrased queries, chatGPT (search-enabled GPT-4o),GPT-o1, and PaSa-GPT-4o (PaSa implemented by prompting GPT-4o). Notably,PaSa-7B surpasses the best Google-based baseline, Google with GPT-4o, by 37.78%in recall@20 and 39.90% in recall@50. It also exceeds PaSa-GPT-4o by 30.36% inrecall and 4.25% in precision. Model, datasets, and code are available athttps://github.com/bytedance/pasa.


Sigma: Differential Rescaling of Query, Key and Value for Efficient Language Models

  • 作者: Zhenghao Lin, Zihao Tang, Xiao Liu, Yeyun Gong, Yi Cheng, Qi Chen, Hang Li, Ying Xin, Ziyue Yang, Kailai Yang, Yu Yan, Xiao Liang, Shuai Lu, Yiming Huang, Zheheng Luo, Lei Qu, Xuan Feng, Yaoxiang Wang, Yuqing Xia, Feiyang Chen, Yuting Jiang, Yasen Hu, Hao Ni, Binyang Li, Guoshuai Zhao, Jui-Hao Chiang, Zhongxin Guo, Chen Lin, Kun Kuang, Wenjie Li, Yelong Shen, Jian Jiao, Peng Cheng, Mao Yang

  • 日期: 2025-01-23

  • 論文鏈接: https://arxiv.org/pdf/2501.13629

摘要

We introduce Sigma, an efficient large language model specialized for thesystem domain, empowered by a novel architecture includingDiffQKV attention,andpre-trained on our meticulously collectedsystem domain data. DiffQKVattention significantly enhances theinference efficiencyof Sigma byoptimizing theQuery (Q),Key (K), andValue (V)components in the attentionmechanism differentially, based on their varying impacts on the modelperformance andefficiency indicators. Specifically, we (1) conduct extensiveexperiments that demonstrate the model’svarying sensitivityto the compressionof K and V components, leading to the development of differentially compressedKV, and (2) proposeaugmented Qto expand the Q head dimension, which enhancesthe model’srepresentation capacitywith minimal impacts on the inferencespeed. Rigorous theoretical and empirical analyses reveal that DiffQKVattention significantly enhances efficiency, achieving up to a 33.36%improvement ininference speedover the conventional grouped-query attention(GQA) inlong-context scenarios. Wepre-trainSigma on 6T tokens from varioussources, including 19.5Bsystem domain datathat we carefully collect and 1Ttokens of synthesized and rewritten data. In general domains, Sigma achievescomparable performance to otherstate-of-arts models. In the system domain, weintroduce the first comprehensivebenchmark AIMicius, where Sigma demonstratesremarkable performance across all tasks, significantly outperformingGPT-4withan absolute improvement up to 52.5%.


TokenVerse: Versatile Multi-concept Personalization in Token Modulation Space

  • 作者: Daniel Garibi, Shahar Yadin, Roni Paiss, Omer Tov, Shiran Zada, Ariel Ephrat, Tomer Michaeli, Inbar Mosseri, Tali Dekel

  • 日期: 2025-01-21

  • 論文鏈接: https://arxiv.org/pdf/2501.12224

摘要

We present TokenVerse – a method for multi-concept personalization,leveraging apre-trained text-to-image diffusion model. Our framework candisentangle complex visual elements and attributes from as little as a singleimage, while enabling seamless plug-and-play generation of combinations ofconcepts extracted from multiple images. As opposed to existing works,TokenVerse can handle multiple images with multiple concepts each, and supportsa wide-range of concepts, including objects, accessories, materials, pose, andlighting. Our work exploits aDiT-based text-to-image model, in which the inputtext affects the generation through bothattentionandmodulation(shift andscale). We observe that themodulation spaceis semantic and enables localizedcontrol over complex concepts. Building on this insight, we devise anoptimization-based frameworkthat takes as input an image and a textdescription, and finds for each word a distinct direction in themodulationspace. These directions can then be used to generate new images that combinethe learned concepts in a desired configuration. We demonstrate theeffectiveness of TokenVerse in challenging personalization settings, andshowcase its advantages over existing methods. project’s webpage inhttps://token-verse.github.io/


InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model

  • 作者: Yuhang Zang, Xiaoyi Dong, Pan Zhang, Yuhang Cao, Ziyu Liu, Shengyuan Ding, Shenxi Wu, Yubo Ma, Haodong Duan, Wenwei Zhang, Kai Chen, Dahua Lin, Jiaqi Wang

  • 日期: 2025-01-21

  • 論文鏈接: https://arxiv.org/pdf/2501.12368

摘要

Despite the promising performance ofLarge Vision Language Models (LVLMs)invisual understanding, they occasionally generate incorrect outputs. Whilereward models (RMs)withreinforcement learningortest-time scalingoffer thepotential for improving generation quality, a critical gap remains: publiclyavailable multi-modal RMs for LVLMs are scarce, and the implementation detailsof proprietary models are often unclear. We bridge this gap withInternLM-XComposer2.5-Reward (IXC-2.5-Reward), a simple yet effectivemulti-modal reward modelthat aligns LVLMs withhuman preferences. To ensurethe robustness and versatility of IXC-2.5-Reward, we set up a high-qualitymulti-modal preference corpusspanning text, image, and video inputs acrossdiverse domains, such asinstruction following, general understanding,text-rich documents, mathematical reasoning, and video understanding.IXC-2.5-Reward achieves excellent results on the latest multi-modal rewardmodel benchmark and shows competitive performance on text-only reward modelbenchmarks. We further demonstrate three key applications of IXC-2.5-Reward:(1) Providing a supervisory signal for RL training. We integrate IXC-2.5-RewardwithProximal Policy Optimization (PPO)yields IXC-2.5-Chat, which showsconsistent improvements ininstruction followingand multi-modal open-endeddialogue; (2) Selecting the best response fromcandidate responsesfortest-time scaling; and (3) Filteringoutlier or noisy samplesfrom existingimage and video instruction tuningtraining data. To ensure reproducibility andfacilitate further research, we have open-sourced all model weights andtraining recipes at https://github.com/InternLM/InternLM-XComposer


Autonomy-of-Experts Models

  • 作者: Ang Lv, Ruobing Xie, Yining Qian, Songhao Wu, Xingwu Sun, Zhanhui Kang, Di Wang, Rui Yan

  • 日期: 2025-01-22

  • 論文鏈接: https://arxiv.org/pdf/2501.13074

摘要

Mixture-of-Experts (MoE) modelsmostly use a router to assign tokens tospecificexpert modules, activating onlypartial parametersand oftenoutperformingdense models. We argue that the separation between the router’sdecision-making and the experts’ execution is a critical yet overlooked issue,leading to suboptimalexpert selectionand ineffective learning. To addressthis, we proposeAutonomy-of-Experts (AoE), a novel MoE paradigm in whichexperts autonomously selectthemselves to process inputs. AoE is based on theinsight that an expert is aware of its own capacity to effectively process atoken, an awareness reflected in the scale of itsinternal activations. In AoE,routersare removed; instead, experts pre-computeinternal activationsforinputs and are ranked based on theiractivation norms. Only the top-rankingexperts proceed with theforward pass, while the others abort. The overhead ofpre-computing activations is reduced through alow-rank weight factorization.Thisself-evaluating-then-partner-comparing approachensures improved expertselection andeffective learning. Wepre-train language modelshaving 700M upto 4B parameters, demonstrating that AoE outperforms traditional MoE modelswith comparable efficiency.


Hunyuan3D 2.0: Scaling Diffusion Models for High Resolution Textured 3D Assets Generation

  • 作者: Zibo Zhao, Zeqiang Lai, Qingxiang Lin, Yunfei Zhao, Haolin Liu, Shuhui Yang, Yifei Feng, Mingxin Yang, Sheng Zhang, Xianghui Yang, Huiwen Shi, Sicong Liu, Junta Wu, Yihang Lian, Fan Yang, Ruining Tang, Zebin He, Xinzhou Wang, Jian Liu, Xuhui Zuo, Zhuo Chen, Biwen Lei, Haohan Weng, Jing Xu, Yiling Zhu, Xinhai Liu, Lixin Xu, Changrong Hu, Tianyu Huang, Lifu Wang, Jihong Zhang, Meng Chen, Liang Dong, Yiwen Jia, Yulin Cai, Jiaao Yu, Yixuan Tang, Hao Zhang, Zheng Ye, Peng He, Runzhou Wu, Chao Zhang, Yonghao Tan, Jie Xiao, Yangyu Tao, Jianchen Zhu, Jinbao Xue, Kai Liu, Chongqing Zhao, Xinming Wu, Zhichao Hu, Lei Qin, Jianbing Peng, Zhan Li, Minghui Chen, Xipeng Zhang, Lin Niu, Paige Wang, Yingkai Wang, Haozhao Kuang, Zhongyi Fan, Xu Zheng, Weihao Zhuang, YingPing He, Tian Liu, Yong Yang, Di Wang, Yuhong Liu, Jie Jiang, Jingwei Huang, Chunchao Guo

  • 日期: 2025-01-21

  • 論文鏈接: https://arxiv.org/pdf/2501.12202

摘要

We present Hunyuan3D 2.0, an advanced large-scale 3D synthesis system forgeneratinghigh-resolution textured 3Dassets. This system includes twofoundation components: alarge-scale shape generationmodel --Hunyuan3D-DiT,and alarge-scale texture synthesismodel --Hunyuan3D-Paint. The shapegenerative model, built on ascalable flow-based diffusion transformer, aims tocreate geometry that properly aligns with a given condition image, laying asolid foundation for downstream applications. The texture synthesis model,benefiting from strong geometric and diffusion priors, produces high-resolutionand vibrant texture maps for either generated or hand-crafted meshes.Furthermore, we buildHunyuan3D-Studio-- a versatile, user-friendly productionplatform that simplifies the re-creation process of 3D assets. It allows bothprofessional and amateur users to manipulate or even animate their meshesefficiently. We systematically evaluate our models, showing that Hunyuan3D 2.0outperforms previous state-of-the-art models, including the open-source modelsand closed-source models ingeometry details,condition alignment, texturequality, and etc. Hunyuan3D 2.0 is publicly released in order to fill the gapsin the open-source 3D community for large-scale foundationgenerative models.The code and pre-trained weights of our models are available at:https://github.com/Tencent/Hunyuan3D-2


Can We Generate Images with CoT? Let’s Verify and Reinforce Image Generation Step by Step

  • 作者: Ziyu Guo, Renrui Zhang, Chengzhuo Tong, Zhizheng Zhao, Peng Gao, Hongsheng Li, Pheng-Ann Heng

  • 日期: 2025-01-23

  • 論文鏈接: https://arxiv.org/pdf/2501.13926

摘要

Chain-of-Thought (CoT) reasoning has been extensively explored in largemodels to tackle complex understanding tasks. However, it still remains an openquestion whether such strategies can be applied to verifying and reinforcingimage generation scenarios. In this paper, we provide the first comprehensiveinvestigation of the potential of CoT reasoning to enhance autoregressive imagegeneration. We focus on three techniques:scaling test-time computationforverification, aligning model preferences with Direct Preference Optimization(DPO), and integrating these techniques for complementary effects. Our resultsdemonstrate that these approaches can be effectively adapted and combined tosignificantly improve image generation performance. Furthermore, given thepivotal role ofreward modelsin our findings, we propose the PotentialAssessment Reward Model (PARM) andPARM++, specialized for autoregressive imagegeneration. PARM adaptively assesses each generation step through a potentialassessment approach, merging the strengths of existingreward models, andPARM++further introduces areflection mechanismto self-correct the generatedunsatisfactory image. Using our investigated reasoning strategies, we enhance abaseline model, Show-o, to achieve superior results, with a significant +24%improvement on theGenEval benchmark, surpassingStable Diffusion 3by +15%. Wehope our study provides unique insights and paves a new path for integratingCoT reasoning withautoregressive image generation. Code and models arereleased at https://github.com/ZiyuGuo99/Image-Generation-CoT


Multiple Choice Questions: Reasoning Makes Large Language Models (LLMs) More Self-Confident Even When They Are Wrong

  • 作者: Tairan Fu, Javier Conde, Gonzalo Martínez, María Grandury, Pedro Reviriego

  • 日期: 2025-01-16

  • 論文鏈接: https://arxiv.org/pdf/2501.09775

摘要

One of the most widely used methods to evaluate LLMs are Multiple ChoiceQuestion (MCQ) tests.MCQ benchmarksenable the testing of LLM knowledge onalmost any topic at scale as the results can be processed automatically. Tohelp the LLM answer, a few examples called few shots can be included in theprompt. Moreover, the LLM can be asked to answer the question directly with theselected option or to first provide the reasoning and then the selected answer,which is known aschain of thought. In addition to checking whether theselected answer is correct, the evaluation can look at the LLM-estimatedprobability of its response as an indication of the confidence of the LLM inthe response. In this paper, we study how theLLM confidencein its answerdepends on whether the model has been asked to answer directly or to providethe reasoning before answering. The results of the evaluation of questions on awide range of topics in seven different models show that LLMs are moreconfident in their answers when they provide reasoning before the answer. Thisoccurs regardless of whether the selected answer is correct. Our hypothesis isthat this behavior is due to the reasoning that modifies the probability of theselected answer, as the LLM predicts the answer based on the input question andthe reasoning that supports the selection made. Therefore, LLM estimatedprobabilities seem to haveintrinsic limitationsthat should be understood inorder to use them in evaluation procedures. Interestingly, the same behaviorhas been observed in humans, for whom explaining an answer increases confidencein its correctness.


Reasoning Language Models: A Blueprint

  • 作者: Maciej Besta, Julia Barth, Eric Schreiber, Ales Kubicek, Afonso Catarino, Robert Gerstenberger, Piotr Nyczyk, Patrick Iff, Yueling Li, Sam Houliston, Tomasz Sternal, Marcin Copik, Grzegorz Kwa?niewski, Jürgen Müller, ?ukasz Flis, Hannes Eberhard, Hubert Niewiadomski, Torsten Hoefler

  • 日期: 2025-01-20

  • 論文鏈接: https://arxiv.org/pdf/2501.11223

摘要

Reasoning language models (RLMs), also known as Large Reasoning Models(LRMs), such asOpenAI’s o1and o3,DeepSeek-V3, andAlibaba’s QwQ, haveredefined AI’s problem-solving capabilities by extending large language models(LLMs) with advanced reasoning mechanisms. Yet, their high costs, proprietarynature, and complex architectures - uniquely combining Reinforcement Learning(RL),search heuristics, and LLMs - present accessibility and scalabilitychallenges. To address these, we propose a comprehensive blueprint thatorganizes RLM components into a modular framework, based on a survey andanalysis of all RLM works. This blueprint incorporates diverse reasoningstructures (chains, trees, graphs, and nested forms), reasoning strategies(e.g., Monte Carlo Tree Search, Beam Search), RL concepts (policy,value modelsand others), and supervision schemes (Output-Based and Process-BasedSupervision). We also provide detailed mathematical formulations andalgorithmic specifications to simplify RLM implementation. By showing howschemes likeLLaMA-Berry, QwQ,Journey Learning, andGraph of Thoughtsfit asspecial cases, we demonstrate the blueprint’s versatility and unifyingpotential. To illustrate its utility, we introduce x1, a modular implementationfor rapid RLM prototyping and experimentation. Using x1 and a literaturereview, we provide key insights, such asmulti-phase trainingforpolicyandvalue models, and the importance offamiliar training distributions. Finally,we outline how RLMs can integrate with a broaderLLM ecosystem, including toolsand databases. Our work demystifies RLM construction, democratizes advancedreasoning capabilities, and fosters innovation, aiming to mitigate the gapbetween “rich AI” and “poor AI” by lowering barriers to RLM development andexperimentation.


Mobile-Agent-E: Self-Evolving Mobile Assistant for Complex Tasks

  • 作者: Zhenhailong Wang, Haiyang Xu, Junyang Wang, Xi Zhang, Ming Yan, Ji Zhang, Fei Huang, Heng Ji

  • 日期: 2025-01-20

  • 論文鏈接: https://arxiv.org/pdf/2501.11733

摘要

Smartphones have become indispensable in modern life, yet navigating complextasks on mobile devices often remains frustrating. Recent advancements in largemultimodal model (LMM)-basedmobile agentshave demonstrated the ability toperceive and act in mobile environments. However, current approaches facesignificant limitations: they fall short in addressing real-world human needs,struggle with reasoning-intensive andlong-horizontasks, and lack mechanismsto learn and improve from prior experiences. To overcome these challenges, weintroduce Mobile-Agent-E, ahierarchical multi-agent frameworkcapable ofself-evolutionthrough past experience. By hierarchical, we mean an explicitseparation ofhigh-level planningandlow-level action execution. The frameworkcomprises aManager, responsible for devising overall plans by breaking downcomplex tasks into subgoals, and four subordinate agents–Perceptor,Operator,Action Reflector, andNotetaker–which handle fine-grained visual perception,immediate action execution, error verification, and information aggregation,respectively. Mobile-Agent-E also features a novelself-evolution modulewhichmaintains a persistentlong-term memorycomprisingTipsandShortcuts.Tipsaregeneral guidance and lessons learned from prior tasks on how to effectivelyinteract with the environment.Shortcutsare reusable, executable sequences ofatomic operations tailored for specific subroutines. The inclusion ofTipsandShortcutsfacilitates continuous refinement in performance and efficiency.Alongside this framework, we introduceMobile-Eval-E, a new benchmark featuringcomplex mobile tasks requiringlong-horizon,multi-app interactions. Empiricalresults show that Mobile-Agent-E achieves a 22% absolute improvement overprevious state-of-the-art approaches across threefoundation model backbones.Project page: https://x-plug.github.io/MobileAgent.


VideoWorld: Exploring Knowledge Learning from Unlabeled Videos

  • 作者: Zhongwei Ren, Yunchao Wei, Xun Guo, Yao Zhao, Bingyi Kang, Jiashi Feng, Xiaojie Jin

  • 日期: 2025-01-16

  • 論文鏈接: https://arxiv.org/pdf/2501.09781

摘要

This work explores whether adeep generative modelcan learn complexknowledge solely from visual input, in contrast to the prevalent focus ontext-based models like large language models (LLMs). We develop VideoWorld, anauto-regressive video generation modeltrained on unlabeled video data, andtest itsknowledge acquisitionabilities in video-based Go and robotic controltasks. Our experiments reveal two key findings: (1)video-only trainingprovides sufficient information for learning knowledge, including rules,reasoning and planning capabilities, and (2) the representation of visualchange is crucial forknowledge acquisition. To improve both the efficiency andefficacy of this process, we introduce theLatent Dynamics Model (LDM)as a keycomponent of VideoWorld. Remarkably, VideoWorld reaches a 5-dan professionallevel in theVideo-GoBenchwith just a 300-million-parameter model, withoutrelying onsearch algorithmsorreward mechanismstypical in reinforcementlearning. In robotic tasks, VideoWorld effectively learns diverse controloperations and generalizes across environments, approaching the performance oforacle models inCALVINandRLBench. This study opens new avenues for knowledgeacquisition from visual data, with all code, data, and models open-sourced forfurther research.


O1-Pruner: Length-Harmonizing Fine-Tuning for O1-Like Reasoning Pruning

  • 作者: Haotian Luo, Li Shen, Haiying He, Yibo Wang, Shiwei Liu, Wei Li, Naiqiang Tan, Xiaochun Cao, Dacheng Tao

  • 日期: 2025-01-22

  • 論文鏈接: https://arxiv.org/pdf/2501.12570

摘要

Recently,long-thought reasoningLLMs, such as OpenAI’s O1, adopt extendedreasoning processes similar to how humans ponder over complex problems. Thisreasoning paradigm significantly enhances the model’s problem-solving abilitiesand has achieved promising results. However,long-thought reasoningprocessleads to a substantial increase in inference time. A pressing challenge isreducing theinference overheadof long-thought LLMs while ensuring accuracy.In this paper, we experimentally demonstrate thatlong-thought reasoningmodelsstruggle to effectively allocatetoken budgetsbased on problem difficulty andreasoning redundancies. To address this, we propose Length-HarmonizingFine-Tuning (O1-Pruner), aiming at minimizing reasoning overhead whilemaintaining accuracy. This effective fine-tuning method first estimates theLLM’s baseline performance throughpre-samplingand then uses RL-stylefine-tuning to encourage the model to generate shorter reasoning processesunder accuracy constraints. This allows the model to achieve efficientreasoning with lower redundancy while maintaining accuracy. Experiments onvarious mathematical reasoning benchmarks show thatO1-Prunernot onlysignificantly reducesinference overheadbut also achieves higher accuracy,providing a novel and promising solution to this challenge. Our code is comingsoon at https://github.com/StarDewXXX/O1-Pruner


本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。
如若轉載,請注明出處:http://www.pswp.cn/pingmian/77700.shtml
繁體地址,請注明出處:http://hk.pswp.cn/pingmian/77700.shtml
英文地址,請注明出處:http://en.pswp.cn/pingmian/77700.shtml

如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!

相關文章

FortiAI 重塑Fortinet Security Fabric全面智能化進階

專注推動網絡與安全融合的全球性綜合網絡安全解決方案供應商 Fortinet(NASDAQ:FTNT),近日宣布,旗下 Fortinet Security Fabric 安全平臺成功嵌入了 FortiAI 關鍵創新功能。這一舉措將有效增強用戶對各類新興威脅的防護…

汽車免拆診斷案例 | 2019款大眾途觀L車鼓風機偶爾不工作

故障現象 一輛2019款大眾途觀L車,搭載DKV發動機和0DE雙離合變速器,累計行駛里程約為8萬km。車主進廠反映,鼓風機偶爾不工作。 故障診斷  接車后試車,鼓風機各擋位均工作正常。用故障檢測儀檢測,空調控制單元&#x…

MySQL為什么默認使用RR隔離級別?

大家好,我是鋒哥。今天分享關于【MySQL為什么默認使用RR隔離級別?】面試題。希望對大家有幫助; MySQL為什么默認使用RR隔離級別? 1000道 互聯網大廠Java工程師 精選面試題-Java資源分享網 MySQL 默認使用 RR(Repeatable Read)…

目標檢測篇---R-CNN梳理

目標檢測系列文章 第一章 R-CNN 目錄 目標檢測系列文章📄 論文標題🧠 論文邏輯梳理1. 引言部分梳理 (動機與思想) 📝 三句話總結🔍 方法邏輯梳理🚀 關鍵創新點🔗 方法流程圖補充邊界框回歸 (BBR)1. BBR 的…

Java技術棧 —— 基本規范

Java技術棧 —— 基本規范 一、接口文檔生成工具二、接口設計2.1 開發順序2.2 接口規范 三、數據類封裝 一、接口文檔生成工具 有很多jar包都支持swagger的接口文檔,這樣方便了接口測試,不需要用apifox自己寫接口,直接調用文檔里的swagger接…

Django ORM 定義模型

提示:定義模型字段的類型 文章目錄 一、字段類型二、字段屬性三、元信息 一、字段類型 常用字段 字段名描述備注AutoFieldint 自增必填參數 primary_keyTrue,無該字段時,django自動創建一個 BigAutoField,一個model不能有兩個Au…

[密碼學基礎]GB與GM國密標準深度解析:定位、差異與協同發展

[密碼學基礎]GB與GM國密標準深度解析:定位、差異與協同發展 導語 在國產密碼技術自主可控的浪潮下,GB(國家標準)與GM(密碼行業標準)共同構建了我國商用密碼的技術規范體系。二者在制定主體、法律效力、技術…

Day-1 漏洞攻擊實戰

實訓任務1 漏洞攻擊實戰一 使用 御劍 得到網站后臺地址 數據庫登錄與日志配置?? 使用默認密碼 root:root 登錄phpMyAdmin,執行 SHOW VARIABLES LIKE general% 查看日志狀態。 開啟日志功能:set global general_log "ON";(配圖&…

leetcode 2563. 統計公平數對的數目 中等

給你一個下標從 0 開始、長度為 n 的整數數組 nums &#xff0c;和兩個整數 lower 和 upper &#xff0c;返回 公平數對的數目 。 如果 (i, j) 數對滿足以下情況&#xff0c;則認為它是一個 公平數對 &#xff1a; 0 < i < j < n&#xff0c;且lower < nums[i] …

011數論——算法備賽

素數篩 給定n, 求2~n內的所有素數 埃氏篩 利用素數的定義&#xff0c; 輸出素數2&#xff0c;然后篩掉2的倍數&#xff0c;得 {2,3,5,7,9,11,13&#xff0c;…}輸出素數3&#xff0c;然后篩掉3的倍數&#xff0c;得 {2,3,5,7,11,13&#xff0c;…} 繼續上述步驟&#xff0…

算法之貪心算法

貪心算法 貪心算法核心思想常見應用場景典型案例案例一&#xff1a;找零問題案例二&#xff1a;活動選擇問題案例三&#xff1a;貨倉選址問題 貪心算法的應用詳解霍夫曼編碼最小生成樹Dijkstra最短路徑算法 總結 貪心算法 核心思想 貪心算法&#xff08;Greedy Algorithm&…

英碼科技與泊川軟件,攜手加速AI與嵌入式系統融合創新

2025年4月15日&#xff0c;廣州英碼信息科技有限公司&#xff08;以下簡稱“英碼科技”&#xff09;與廣州泊川軟件技術有限公司&#xff08;以下簡稱“泊川軟件”&#xff09; 正式簽署戰略合作框架協議。此次合作將充分發揮雙方在AI計算硬件與嵌入式操作系統領域的技術優勢&a…

Flowable7.x學習筆記(九)部署 BPMN XML 流程

前言 到本篇為止&#xff0c;我們已經完成了流程定義以及其 BPMN XML 本身的查詢和新增功能&#xff0c;那我們有有了XML之后就可以開始著手研究實現 Flowable7對流程的各種操作了&#xff0c;比如部署&#xff0c;掛起&#xff0c;發起等等。 首先第一步&#xff0c;我們本篇文…

electron 渲染進程按鈕創建新window,報BrowserWindow is not a constructor錯誤;

在 Electron 中&#xff0c;有主進程和渲染進程 主進程&#xff1a;在Node.js環境中運行—意味著能夠使用require模塊并使用所有Node.js API 渲染進程&#xff1a;每個electron應用都會為每個打開的BrowserWindow&#xff08;與每個網頁嵌入&#xff09;生成一個單獨的渲染器進…

深入規劃 Elasticsearch 索引:策略與實踐

一、Elasticsearch 索引概述 &#xff08;一&#xff09;索引基本概念 Elasticsearch 是一個分布式、高性能的全文搜索引擎&#xff0c;其核心概念之一便是索引。索引本質上是一個存儲文檔的邏輯容器&#xff0c;它使得數據能夠在高效的檢索機制下被查詢到。當我們對文檔進行…

llamafactory的包安裝

cuda版本12.1&#xff0c;python版本3.10&#xff0c;torch版本2.4.0&#xff0c;幾個關鍵包版本如下&#xff1a; torch2.4.0cu121 transformers4.48.3 triton3.0.0 flash-attn2.7.1.post4 xformers0.0.27.post2 vllm0.6.3.post1 vllm-flash-attn2.6.1 unsloth2025.3.18 unsl…

Redis專題

前言 Redis的各種思想跟機組Cache和操作系統對進程的管理非常類似&#xff01; 一&#xff1a;看到你的簡歷上寫了你的項目里面用到了redis&#xff0c;為啥用redis&#xff1f; 因為傳統的關系型數據庫如Mysql,已經不能適用所有的場景&#xff0c;比如秒殺的庫存扣減&#xff…

【Rust 精進之路之第7篇-函數之道】定義、調用與參數傳遞:構建代碼的基本單元

系列: Rust 精進之路:構建可靠、高效軟件的底層邏輯 作者: 碼覺客 發布日期: 2025-04-20 引言:封裝邏輯,代碼復用的基石 在之前的文章中,我們已經探索了 Rust 如何處理數據(變量、標量類型、復合類型)以及如何控制程序的執行流程(if/else、循環)。這些構成了編寫簡…

文件有幾十個T,需要做rag,用ragFlow能否快速落地呢?

一、RAGFlow的優勢 1、RAGFlow處理大規模數據性能&#xff1a; &#xff08;1&#xff09;、RAGFlow支持分布式索引構建&#xff0c;采用分片技術&#xff0c;能夠處理TB級數據。 &#xff08;2&#xff09;、它結合向量搜索和關鍵詞搜索&#xff0c;提高檢索效率。 &#xf…

安卓的桌面 launcher是什么

安卓的桌面Launcher是一種安卓應用程序&#xff0c;它主要負責管理和展示手機主屏幕的界面以及相關功能&#xff0c;為用戶提供與設備交互的主要入口。以下是其詳細介紹&#xff1a; 功能 主屏幕管理&#xff1a;用戶可以在主屏幕上添加、刪除和排列各種應用程序圖標、小部件…