LLaVA-CoT: Let Vision Language Models Reason Step-by-Step論文解讀

文章目錄

前言
一、摘要
一、引言
二、文獻綜述
- 1. Visual reasoning with large language models
- 2. Chain-of-thought in large language models
- 3. Inference time scaling
三、方法
- 1. Enhancing Reasoning Capability through Structured Thinking
- - 1. Reasoning Stages
  - 2. Data Preparation and Model Training
- 2. Effective Inference Time Scaling using Stage-level Beam Search
四、Post-Training Performance
- 1. Experimental Setup
- 2. Benchmark Results
- 3. Ablation Study
五. Inference Time Scaling
- 1. Benchmark Results
- 2. Comparison to Baseline Methods
- 3. Scaling Trend of Stage-level Beam Search
六、 Comparison to State-of-the-Art VLMs
七、 Conclusion
八、LLaVA-COT的數據制作
- 1、訓練數據格式
- 2、數據制作(非常重要)

前言

大型語言模型在推理能力方面展現了顯著的進步，尤其是在推理時擴展方面，如OpenAI的o1模型所示。然而，當前的視覺-語言模型（VLMs）在進行系統性和結構性推理時往往面臨挑戰，特別是在處理復雜的視覺問答任務時。在這項工作中，我們介紹了LLaVA-CoT1，這是一種新型的VLM，旨在進行自主的多階段推理。不同于鏈式思維提示，LLaVA-CoT獨立地參與到摘要、視覺解釋、邏輯推理和結論生成的連續階段中。這種結構化的方法使得LLaVA-CoT在需要高度推理的任務上實現了明顯的精度提升。為了實現這一目標，我們編譯了LLaVA-CoT-100k數據集，整合了來自各種視覺問答資源的樣本，并提供了結構化的推理注釋。此外，我們提出了一種推理時階段級束搜索方法，這種方法能夠實現出色的推理時擴展。值得注意的是，僅用10萬訓練樣本和一種簡單而有效的推理時擴展方法，LLaVA-CoT不僅比其基礎模型在廣泛的多模態推理基準測試中高出7.4%，而且還超越了更大甚至閉源模型的性能，如Gemini-1.5-pro、GPT-4o-mini以及Llama-3.2-90B-Vision-Instruct。代碼、數據集和預訓練權重已公開在https://github.com/PKU-YuanGroup/LLaVA-CoT。

在這里插入圖片描述
LLaVA-COT不是LLAVA模型基座，而是Llama-3.2-11BVision-Instruct [42]模型作為基礎模型。
論文地址：https://arxiv.org/abs/2411.10440
代碼地址：https://github.com/PKU-YuanGroup/LLaVA-CoT
數據地址：https://huggingface.co/datasets/Xkev/LLaVA-CoT-100k/tree/master

一、摘要

Large language models have demonstrated substantial advancements in reasoning capabilities, particularly through inference-time scaling, as illustrated by models such as OpenAI’s o1. However, current Vision-Language Models (VLMs) often struggle to perform systematic and structured reasoning, especially when handling complex visual question-answering tasks. In this work, we introduce LLaVA-CoT1, a novel VLM designed to conduct autonomous multistage reasoning. Unlike chain-of-thought prompting, LLaVA-CoT independently engages in sequential stages of summarization, visual interpretation, logical reasoning, and conclusion generation. This structured approach enables LLaVA-CoT to achieve marked improvements in precision on reasoning-intensive tasks. To accomplish this, we compile the LLaVA-CoT-100k dataset, integrating samples from various visual question answering sources and providing structured reasoning annotations. Besides, we propose an inference-time stage-level beam search method, which enables effective inference-time scaling. Remarkably, with only 100k training samples and a simple yet effective inference time scaling method, LLaVA-CoT not only outperforms its base model by 7.4% on a wide range of multimodal reasoning benchmarks, but also surpasses the performance of larger and even closed-source models, such as Gemini-1.5-pro, GPT-4o-mini, and Llama-3.2-90B-Vision-Instruct. The code, dataset, and pre-trained weights are publicly available at https://github.com/PKU-YuanGroup/LLaVA-CoT.
大型語言模型在推理能力方面展現了顯著的進步，尤其是在推理時擴展方面，如OpenAI的o1模型所示。然而，當前的視覺-語言模型（VLMs）在進行系統性和結構性推理時往往面臨挑戰，特別是在處理復雜的視覺問答任務時。在這項工作中，我們介紹了LLaVA-CoT1，這是一種新型的VLM，旨在進行自主的多階段推理。不同于鏈式思維提示，LLaVA-CoT獨立地參與到摘要、視覺解釋、邏輯推理和結論生成的連續階段中。這種結構化的方法使得LLaVA-CoT在需要高度推理的任務上實現了明顯的精度提升。為了實現這一目標，我們編譯了LLaVA-CoT-100k數據集，整合了來自各種視覺問答資源的樣本，并提供了結構化的推理注釋。此外，我們提出了一種推理時階段級束搜索方法，這種方法能夠實現出色的推理時擴展。值得注意的是，僅用10萬訓練樣本和一種簡單而有效的推理時擴展方法，LLaVA-CoT不僅比其基礎模型在廣泛的多模態推理基準測試中高出7.4%，而且還超越了更大甚至閉源模型的性能，如Gemini-1.5-pro、GPT-4o-mini以及Llama-3.2-90B-Vision-Instruct。代碼、數據集和預訓練權重已公開在https://github.com/PKU-YuanGroup/LLaVA-CoT。

一、引言

Large language models, represented by OpenAI o1 [67], demonstrate strong capabilities for systematic and in-depth reasoning, validating the efficacy of inference-time scaling for language models [50]. However, vision is equally important for enabling models to fully understand the world and extend their cognitive abilities [6]. Therefore, developing a multimodal model that integrates both language and vision while facilitating effective, systematic, and deep reasoning holds substantial significance.
大型語言模型，以OpenAI的o1 [67]為代表，展示了系統和深入推理的強大能力，驗證了推理時擴展對于語言模型的有效性[50]。然而，視覺對于使模型全面理解世界并擴展其認知能力同樣重要[6]。因此，開發一種整合語言和視覺并促進有效、系統和深入推理的多模態模型具有重要意義。

Early open-source vision language models (VLMs) mainly employ a direct prediction approach [23, 32, 34], generating brief answers immediately in response to a question. The main limitation of this direct-response paradigm is its lack of a structured reasoning process, making it less effective for tasks demanding logical reasoning [66]. Recent studies have shown that incorporating Chain-of-Thought (CoT) reasoning encourages the model to reason step by step, significantly improving its question-answering capabilities [56]. However, even with CoT reasoning, most VLMs frequently produce errors or hallucinated outputs during the reasoning progress [26, 33, 53].
早期開源的視覺-語言模型（VLMs）主要采用直接預測方法[23, 32, 34]，立即生成簡短的答案來響應問題。這種直接響應范式的局限在于缺乏結構化的推理過程，使其在需要邏輯推理的任務上效果不佳[66]。最近的研究表明，引入鏈式思維（CoT）推理鼓勵模型逐步推理，顯著提高了其問答能力[56]。然而，即使采用CoT推理，大多數VLMs在推理過程中經常產生錯誤或幻覺輸出[26, 33, 53]。

Our findings suggest that a significant cause of these issues is the insufficiently systematic and structured nature of the reasoning process in existing VLMs. Specifically, by referring systematic, we mean that the model does not generate a direct reasoning chain but instead engages in multistage reasoning. Structured, on the other hand, refers to the model’s ability to clearly identify the reasoning stage it is in and understand the primary task to be addressed at each stage. We observe that VLMs often initiate responses without adequately organizing the problem and the available information. Moreover, they frequently deviate from a logical reasoning toward conclusions, instead of presenting a conclusion prematurely and subsequently attempting to justify it. Given that language models generate responses token-by-token, once an erroneous conclusion is introduced, the model typically continues along a flawed reasoning path. Examples of these issues can be found in Appendix A.
我們的研究發現，這些問題的一個重要原因是在現有VLMs中推理過程不夠系統和結構化。具體而言，“系統”指的是模型不生成直接的推理鏈，而是參與多階段推理。“結構化”則指模型能夠清晰地識別其所處的推理階段，并理解每個階段的主要任務。我們觀察到，VLMs往往在未充分組織問題和可用信息的情況下就開始響應。此外，它們傾向于偏離邏輯推理路徑而過早得出結論，然后試圖證明這些結論。由于語言模型是逐令牌生成響應的，一旦引入錯誤結論，模型通常會沿著有缺陷的推理路徑繼續下去。這些例子可以在附錄A中找到。
在這里插入圖片描述

OpenAI o1 [67] addresses these issues effectively by enabling the model to independently engage in systematic and structured reasoning through language. Building on this insight, we design LLaVA-CoT. While the community has made some preliminary explorations into the underlying mechanisms of OpenAI o1 [45, 58], the model remains a black box, with its technical details largely unknown. This work demonstrates a potential way to enhance a model’s ability to perform autonomous, stage-by-stage reasoning by employing supervised fine-tuning. Specifically, LLaVA-CoT is capable of generating four distinct stages: summary, caption, reasoning, and conclusion. Each stage serves a unique purpose in the reasoning process.

Summary: A brief outline in which the model summarizes the forthcoming task.
Caption: A description of the relevant parts of an image (if present), focusing on elements related to the question.
Reasoning: A detailed analysis in which the model systematically considers the question.
Conclusion: A concise summary of the answer, providing a final response based on the preceding reasoning.
OpenAI的o1 [67]通過使模型獨立進行系統和結構化的語言推理有效地解決了這些問題。基于這一見解，我們設計了LLaVA-CoT。雖然社區對OpenAI o1的潛在機制進行了初步探索[45, 58]，但該模型仍然是一個黑盒子，其技術細節在很大程度上未知。這項工作展示了一種增強模型執行自主、分階段推理能力的潛在途徑，即通過監督微調實現。特別是，LLaVA-CoT能夠生成四個不同的階段：摘要、標題、推理和結論。每個階段在推理過程中都有獨特的用途。
摘要：簡要概述，模型在此總結即將進行的任務。
標題：如果存在圖像，則描述相關部分，重點放在與問題相關的元素上。
推理：詳細分析，模型在此系統地考慮問題。
結論：答案的簡潔總結，基于前述推理提供最終響應。

To enhance the understanding of CoT processes in LLM, LLaVA-CoT marks each stage with a dedicated tag (e.g., …) to denote the beginning and end of each stage. Those taggings enable the model to maintain clarity throughout the reasoning process. Unlike traditional CoT reasoning, which allows the model to think freely, our method promotes structured thinking by first organizing the problem and known information, followed by a detailed thought process, and then deriving a conclusion. To achieve this, we construct the LLaVA-CoT-100k dataset by generating responses stage by stage using GPT-4o [3] and then train the model using supervised fine-tuning.
為了增強對LLM中的CoT過程的理解，LLaVA-CoT為每個階段標記了一個專門的標簽（例如，…），以表示每個階段的開始和結束。這些標記使模型在整個推理過程中保持清晰。與允許模型自由思考的傳統CoT推理不同，我們的方法首先組織問題和已知信息，然后進行詳細的思考過程，最后得出結論。為此，我們使用GPT-4o [3]按階段生成響應構建了LLaVA-CoT-100k數據集，然后使用監督微調訓練模型。

The structured reasoning in LLaVA-CoT also facilitates efficient inference time scaling. In contrast to conventional scaling methods, such as best-of-N sampling [4, 55] and sentence-level beam search [17, 52], LLaVA-CoT employs a novel stage-level beam search method that generates multiple candidate results at each stage and selects the best one to continue the generation process.
LLaVA-CoT中的結構化推理還促進了有效的推理時擴展。與傳統的擴展方法（如最佳N采樣[4, 55]和句子級束搜索[17, 52]）相比，LLaVA-CoT采用了一種新穎的階段級束搜索方法，在每個階段生成多個候選結果并選擇最佳結果繼續生成過程。

We conduct experiments on several multimodal reasoning benchmarks, including MMStar [10], MMBench [35], MMVet [64], MathVista [37], AI2D [25], and HallusionBench [18], and observed that LLaVA-CoT offers two primary advantages: First, enabling the model to perform structured reasoning independently substantially outperforms traditional CoT prompting, particularly in complex reasoning tasks that require systematic analysis. Second, our stage-level beam search method is scalable and improves performance reliability, making it more effective in achieving stable and accurate results. Our contributions are summarized as follows:

We introduce LLaVA-CoT, a visual language model designed for systematic reasoning, demonstrating exceptional performance on tasks that require structured thinking and reasoning.
We demonstrate that LLaVA-CoT, using stage-level beam search, is inference-time scalable. This means that with increased computational resources, the performance of our approach can be further enhanced, making it applicable to more complex tasks and scenarios.
Extensive experiments on various benchmarks demonstrate that our method achieves superior performance relative to larger and closed-source models, underscoring the effectiveness of LLaVA-CoT for multimodal reasoning.
我們在幾個多模態推理基準上進行了實驗，包括MMStar [10]、MMBench [35]、MMVet [64]、MathVista [37]、AI2D [25]和HallusionBench [18]，并觀察到LLaVA-CoT提供了兩個主要優勢：首先，使模型能夠獨立進行結構化推理，顯著優于傳統的CoT提示，特別是在需要系統分析的復雜推理任務中。其次，我們的階段級束搜索方法可擴展且提高了性能可靠性，使其在實現穩定準確的結果方面更為有效。我們的貢獻總結如下：
我們介紹了LLaVA-CoT，一種專為系統推理設計的視覺語言模型，在需要結構化思考和推理的任務上表現出色。
我們證明了使用階段級束搜索的LLaVA-CoT是推理時可擴展的。這意味著隨著計算資源的增加，我們方法的性能可以進一步提高，使其適用于更復雜的任務和場景。
在各種基準上的廣泛實驗表明，我們的方法相對于更大和閉源模型實現了卓越性能，突顯了LLaVA-CoT在多模態推理方面的有效性。

二、文獻綜述

1. Visual reasoning with large language models

Visual reasoning demands the model’s visual perception capability and the high-level cognition ability [24, 39]. Several tasks have been applied to evaluate the visual reasoning ability of VLMs, including VQA [22, 28] requiring models to answer from visual contents and textual questions, and Visual Entailment [51] requiring models to determine the consistency of text descriptions and visual contents, etc. Traditional vision-language models employ neural symbolic approaches [5, 12] to explicitly model the visual reasoning process. With the development of LLMs, vision-language models leverage the advanced reasoning abilities of LLMs to interpret visual tasks [34, 63]. Some vision-language models enhance visual reasoning by optimizing the visual encoding strategy [23, 31, 34] to produce cognition-focused visual tokens. VISPROG [19] positions the LLM as a decision-making agent, enhancing visual reasoning by invoking task-specific visual modules. Hu et al. [20] improves reasoning capabilities through sequential instruction tuning. Additionally, instructing learning techniques for language models, including prompt tuning [65], in-context learning, and supervised fine-tuning [49], also contribute to improvements in visual reasoning capabilities.
視覺推理要求模型具備視覺感知能力和高層次的認知能力[24, 39]。為了評估VLMs的視覺推理能力，已經應用了若干任務，包括需要模型從視覺內容和文本問題中回答的VQA [22, 28]，以及要求模型確定文本描述與視覺內容一致性的Visual Entailment [51]等。傳統的視覺-語言模型采用神經符號方法[5, 12]明確建模視覺推理過程。隨著LLMs的發展，視覺-語言模型利用LLMs先進的推理能力來解釋視覺任務[34, 63]。一些視覺-語言模型通過優化視覺編碼策略[23, 31, 34]以生成專注于認知的視覺令牌來增強視覺推理。VISPROG [19]將LLM定位為決策代理，通過調用特定任務的視覺模塊來增強視覺推理。Hu等人[20]通過順序指令微調改進了推理能力。此外，針對語言模型的指導學習技術，包括提示微調[65]、上下文學習和監督微調[49]，也對視覺推理能力的提升做出了貢獻。

2. Chain-of-thought in large language models

Chain-of-thought prompting [56] offers a step-by-step reasoning trajectory when LLM faces hard questions including commonsense reasoning [16, 47], logical reasoning [29, 59], etc. Specifically, CoT prompting decomposes the question into a group of reasoning steps and builds a chain to guide the model to generate the results of complex problems step-by-step [13]. Recent works have demonstrated that CoT prompting substantially improves the LLM’s capability on reasoning tasks. For instance, Prism [44] prompts LLMs by dividing the process into a perception stage and a reasoning stage. MSG [8] pioneers the use of forced Chain-of-Thoughts, establishing a new direction for structured prompting techniques.
鏈式思維提示[56]在LLM面對包括常識推理[16, 47]、邏輯推理[29, 59]等難題時提供了一步步的推理軌跡。具體來說，CoT提示將問題分解成一組推理步驟，并構建一個鏈條引導模型逐步生成復雜問題的結果[13]。最近的研究表明，CoT提示大幅提升了LLM在推理任務上的能力。例如，Prism [44]通過將其過程分為感知階段和推理階段來提示LLMs。MSG [8]率先使用強制鏈式思維，為結構化提示技術開辟了新的方向。

3. Inference time scaling

Existing methods for inference time scaling fall into two main categories: those that rely on an external verifier for selection [27, 60] and those that operate independently of any external verifier [21, 57]. The external verifier selection methods can be employed in prevalent methods. On the other hand, inference time scaling methods that do not rely on an external verifier primarily include majority voting [21], best-of-N search [4, 55], and sentence-level beam search [17, 52]. Majority voting is effective for certain types of problems that have standard answers, but it is not suitable for open-ended tasks. Best-of-N search generates N complete answers and allows the model to select the best response. However, generating full answers for selection can complicate the evaluation of their accuracy. Sentence-level beam search generates multiple candidate sentences, selects the best one, and iteratively continues this process. However, this approach operates at too granular a level, which makes it difficult for the model to effectively assess the quality of its responses on a per-sentence basis.
現有的推理時擴展方法主要分為兩類：依賴外部驗證器進行選擇的方法[27, 60]和不依賴任何外部驗證器獨立運行的方法[21, 57]。外部驗證器選擇方法可以應用于普遍方法。另一方面，不依賴外部驗證器的推理時擴展方法主要包括多數投票[21]、最佳N搜索[4, 55]和句子級束搜索[17, 52]。多數投票對于某些有標準答案的問題類型是有效的，但對于開放式任務則不適合。最佳N搜索生成N個完整答案，并允許模型選擇最佳響應。然而，為選擇而生成完整答案可能會使準確性評估變得復雜。句子級束搜索生成多個候選句子，選擇最佳的一個，并迭代地繼續這一過程。但是，這種方法操作的粒度太細，使得模型難以有效地按句子級別評估其響應的質量

三、方法

Our LLaVA-CoT facilitates a progressive, step-by-step reasoning process that enhances the reasoning capabilities of Vision-Language Models (VLMs) and allows for effective inference time scaling [50]. Using structured thinking, LLaVA-CoT achieves a systematic and efficient reasoning process. Its inference-time reasoning framework enables it to outperform existing methods in inference time scalability. This design ensures both robustness and accuracy in complex tasks requiring reasoning, which separates it from traditional approaches. Figure 1 illustrates our general framework of the reasoning process.
我們的LLaVA-CoT促進了逐步、一步步的推理過程，增強了視覺-語言模型（VLMs）的推理能力，并允許有效的推理時擴展[50]。通過結構化思考，LLaVA-CoT實現了系統且高效的推理過程。其推理時推理框架使其在推理時可擴展性方面超越了現有方法。這種設計確保了在需要推理的復雜任務中的魯棒性和準確性，從而區別于傳統方法。圖1展示了我們推理過程的一般框架。

1. Enhancing Reasoning Capability through Structured Thinking

Our goal during training time is to develop a visual language model capable of extended chains of reasoning, allowing it to engage in systematic and in-depth reasoning.
我們在訓練階段的目標是開發一種能夠進行擴展鏈式推理的視覺語言模型，使其能夠參與系統和深入的推理。

1. Reasoning Stages

Our proposed model, LLaVA-CoT, decomposes the answer generation process into four structured reasoning stages:

Summary Stage: In this initial phase, LLaVA-CoT provides a high-level summary interpretation of the question, outlining the primary aspects of the problem it intends to address.
Caption Stage: If an image is present, LLaVA-CoT offers a concise overview of the visual elements relevant to the question, helping to understand multimodal input.
Reasoning Stage: Building on the initial summary, LLaVA-CoT conducts structured, logical reasoning to derive a preliminary answer.
Conclusion Stage: In this final stage, LLaVA-CoT synthesizes an answer based on the preceding reasoning. Here, the output from the conclusion stage is the direct response provided to the user, while the prior three stages are internal “hidden stages” representing LLaVA-CoT’s reasoning process. The output at this stage adapts to the user’s requirements: for instance, if the user requests a brief answer, the conclusion will be concise; if detailed explanations are desired, the conclusion provides a thorough, comprehensive response.
我們提出的模型LLaVA-CoT將答案生成過程分解為四個結構化的推理階段：
摘要階段：在這個初始階段，LLaVA-CoT提供了對問題的高層次總結解釋，概述了它打算解決的問題的主要方面。
標題階段：如果存在圖像，LLaVA-CoT提供與問題相關的視覺元素的簡潔概覽，幫助理解多模態輸入。
推理階段：基于最初的摘要，LLaVA-CoT進行結構化邏輯推理以得出初步答案。
結論階段：在這個最終階段，LLaVA-CoT根據前述推理綜合一個答案。這里的輸出是直接提供給用戶的響應，而前三個階段是內部“隱藏階段”，代表LLaVA-CoT的推理過程。此階段的輸出適應用戶的需求：例如，如果用戶請求簡短答案，結論將是簡潔的；如果需要詳細解釋，結論將提供全面細致的回答。

Each stage is initiated at the model’s discretion, without external prompt engineering frameworks or additional prompting. Specifically, we provide the model with four pairs of special tags: , , , and . These tags correspond to summarizing the response approach, describing relevant image content, conducting reasoning, and preparing a final answer, respectively.
每個階段都是由模型自行決定啟動的，無需外部提示工程框架或額外的提示。具體來說，我們為模型提供了四對特殊標簽：、、和。這些標簽分別對應于總結響應方法、描述相關圖像內容、進行推理以及準備最終答案。

Upon training, the model autonomously selects these tags as needed, activating each stage based on its own judgment. As with OpenAI o1 [67], all stages are completed by the model in a single inference pass. This structured approach enables the model to independently manage its reasoning process, improving its adaptability and performance on complex reasoning tasks.
在訓練后，模型自主選擇這些標簽，根據自己的判斷激活每個階段。就像OpenAI o1 [67]一樣，所有階段都由模型在一個推理過程中完成。這種結構化方法使模型能夠獨立管理其推理過程，提高其在復雜推理任務上的適應性和性能。

2. Data Preparation and Model Training

Most existing VQA datasets lack detailed reasoning processes needed to train the LLaVA-CoT model. Therefore, we compile a new dataset, integrating samples from several widely used VQA datasets, resulting in a total of 99k image QA pairs (each pair may include one or multiple rounds of questioning). As shown in Figure 3, since no multimodal model currently exists that can directly produce systematic, structured reasoning, we use GPT-4o [3] to generate detailed reasoning processes, including summary, caption, reasoning, and conclusion, and compile these into the LLaVA-CoT-100k dataset, which we plan to release for public use. Details of the generation process and examples of the generated data are provided in Appendix B. We include data from both general-purpose VQA datasets and science-targeted VQA datasets specified blow:
大多數現有的VQA數據集缺乏訓練LLaVA-CoT模型所需的詳細推理過程。因此，我們編譯了一個新的數據集，整合了幾個廣泛使用的VQA數據集，總共包含99k張圖像QA對（每對可能包括一輪或多輪提問）。如圖3所示，由于目前沒有能夠直接產生系統、結構化推理的多模態模型，我們使用GPT-4o [3]生成詳細的推理過程，包括摘要、標題、推理和結論，并將這些編入LLaVA-CoT-100k數據集中，我們計劃將其公開發布。生成過程的細節和生成數據的例子見附錄B。我們包括了通用VQA數據集和目標科學的VQA數據集：
在這里插入圖片描述

General VQA Datasets: We include several general-purpose VQA datasets with distinct focuses. ShareGPT4V [9] provides multi-turn question-answering data from GPT-4V [61] interactions. ChartQA [40] focuses on interpreting charts and graphs. A-OKVQA [48] emphasizes external knowledge beyond visible content. DocVQA [41] involves document-based questions requiring textual comprehension. We also include PISC [30] to understand social relationships, and CLEVR [24] to address object properties, spatial relationships, and counting tasks.
通用VQA數據集：我們包括了多個具有不同焦點的通用VQA數據集。ShareGPT4V [9]提供了來自GPT-4V [61]交互的多輪問答數據。ChartQA [40]專注于圖表解讀。A-OKVQA [48]強調超出可見內容的外部知識。DocVQA [41]涉及要求文本理解的文檔基礎問題。我們還包括PISC [30]來理解社會關系，以及CLEVR [24]來解決對象屬性、空間關系和計數任務。

Science-Targeted VQA Datasets: These datasets include GeoQA+ [7] for geometric reasoning, along with AI2D [25] and ScienceQA [36], which target scientific questions. CLEVR-Math [14], an extension of CLEVR, focuses on arithmetic analysis in visual contexts. Table 1 shows the number of QA pairs selected from each dataset.
目標科學的VQA數據集：這些數據集包括用于幾何推理的GeoQA+ [7]，以及針對科學問題的AI2D [25]和ScienceQA [36]。作為CLEVR擴展的CLEVR-Math [14]專注于視覺上下文中的算術分析。表1顯示了從每個數據集中選出的QA對數量。
在這里插入圖片描述

Model Training: The LLaVA-CoT-100k dataset we construct can be used to further conduct Supervised Fine-Tuning (SFT) on any existing model to enhance reasoning capabilities. In this work, we select the Llama-3.2-11BVision-Instruct [42] model as the base model, and perform a full parameter fine-tuning by using the LLaVA-CoT-100k dataset. The training is conducted on a single node with 8 H100 GPUs. Details on the specific training parameters, including training epochs, learning rate, and optimization settings, are provided in Appendix C.
模型訓練：我們構建的LLaVA-CoT-100k數據集可用于進一步對任何現有模型進行監督微調（SFT），以增強推理能力。在這項工作中，我們選擇了Llama-3.2-11BVision-Instruct [42]模型作為基礎模型，并使用LLaVA-CoT-100k數據集進行全面參數微調。訓練是在配備8個H100 GPU的單節點上進行的。關于特定訓練參數的詳細信息，包括訓練輪次、學習率和優化設置，參見附錄C。
在這里插入圖片描述

2. Effective Inference Time Scaling using Stage-level Beam Search

After training, our objective is to further enhance the model’s reasoning ability during inference. Specifically, we leverage the stage-based outputs of LLaVA-CoT, which provides an ideal granularity for inference time scaling. Our method follows the steps below:

Sample N responses for the first stage in the solution.
Randomly sample 2 responses and let the model determine which is better, keeping the better response.
Repeat for N ? 1 times, retaining the best response.
Sample N responses for the next stage, then repeat steps 2-4 until all stages are processed.
訓練后，我們的目標是進一步增強模型在推理時的推理能力。具體而言，我們利用LLaVA-CoT的階段性輸出，這為推理時擴展提供了理想的粒度。我們的方法遵循以下步驟：
對解決方案的第一個階段采樣N個響應。
隨機采樣2個響應并讓模型確定哪個更好，保留更好的響應。
對N-1次重復步驟2-3，保留最佳響應。
對下一個階段采樣N個響應，然后重復步驟2-4直到處理完所有階段。

Notably, it is the structured output design of LLaVA-CoT that makes this approach feasible, enabling efficient and accurate verification at each stage. This validates the effectiveness of structured output in improving inference time scaling. An illustration of the three approaches is shown in Figure 4, and the detailed implementation of our stage-level beam search can be found in Appendix D.
值得注意的是，正是LLaVA-CoT的結構化輸出設計使得這種方法可行，使得每個階段都能進行高效準確的驗證。這驗證了結構化輸出在改進推理時擴展方面的有效性。三種方法的示意圖見圖4，我們階段級束搜索的詳細實現見附錄D。

附錄D介紹開始-階段級束搜索：
D. Implementation Details of Stage-level Beam Search
In this section, we provide the implementation details of Stage-level Beam Search. As mentioned in the main paper, we randomly sample two responses and allow the model to determine which response is better, retaining the superior one. However, the implementation details for this process are omitted. Here, we elaborate on our method using prompt engineering within LLaVA-CoT, wherein the model acts as a judge to evaluate responses.
The general form of the prompt we use is:
Prompt for Stage-level Beam Search
Now you act as a judge, helping me determine which of the two texts I provide better provides a summary/caption/reasoning/conclusion to solve the question.

For each stage, we provide the model with specific guidance to ensure accurate evaluation:

Summary stage: Please note that a better summary should focus on outlining the main approach instead of stating specific analytical reasoning or math formula.
Caption stage: Please note that a better caption should be as thorough as possible while remaining accurate, capturing as many details as possible rather than providing only general commentary.
Reasoning stage: Begin by thoroughly reviewing the question, followed by an in-depth examination of each answer individually, noting any differences. Subsequently, analyze these differences to determine which response demonstrates stronger reasoning and provide a clear conclusion.
Conclusion stage: Please note that a better conclusion should align with the reasoning. The conclusion should never refuse to answer the question.
The model follows these guidelines to evaluate the two responses for each stage and selects the superior one, ensuring the final output aligns with high-quality reasoning pathway.

D. 階段級波束搜索的實現細節
在本節中，我們提供了階段級波束搜索的實現細節。正如主論文所述，我們隨機抽取兩個回答，并讓模型來判斷哪一個回答更優，保留較優的那個。然而，該過程的實現細節被省略了。在此，我們詳細說明了如何在LLaVA-CoT中通過提示工程使用我們的方法，其中模型充當評委以評估回答。
我們使用的提示的一般形式是：
Prompt for Stage-level Beam Search
現在你充當一名評委，幫助我確定所提供的兩段文本中哪一段更好地提供了解決問題所需的總結/標題/推理/結論。

對于每個階段，我們會向模型提供具體的指導，以確保準確評估：

總結階段：請注意，更好的總結應側重于概述主要方法，而不是陳述特定的分析性推理或數學公式。
標題階段：請注意，更好的標題應當盡可能詳盡且準確，捕捉盡可能多的細節，而不僅僅是提供一般的評論。
推理階段：首先徹底審查問題，然后分別深入檢查每個答案，注意任何差異。隨后，分析這些差異以確定哪個回答顯示出更強的推理能力，并給出明確的結論。
結論階段：請注意，更好的結論應該與推理相一致。結論不應拒絕回答問題。
模型遵循這些指南來評估每個階段的兩個回答并選擇更優的一個，確保最終輸出符合高質量推理路徑。

附錄D介紹結束。

We provide an example in Figure 5. When inference time scaling is not applied, although the model generates correct reasoning steps, it fails to arrive at a concrete answer during the reasoning process. This causes the model to make a guess in the conclusion phase, leading to an incorrect result. In contrast, with inference time scaling, the model retains the reasoning steps leading to the final result, ensuring the correctness of the answer.
我們在圖5中提供了一個例子。當未應用推理時擴展時，盡管模型生成了正確的推理步驟，但在推理過程中未能得出具體的答案。這導致模型在結論階段做出猜測，導致錯誤的結果。相比之下，當應用推理時擴展時，模型保留了通向最終結果的推理步驟，確保了答案的正確性。

在這里插入圖片描述

我評價：太厲害了！

四、Post-Training Performance

In this section, we compare LLaVA-CoT with the base model, Llama-3.2-11B-Vision-Instruct, on six commonly used multimodal benchmarks to demonstrate the effectiveness of our approach during the training phase. Following this comparison, we conduct ablation studies to evaluate the contribution of each component within our method, addressing the following three key questions: (1) Is our LLaVA-CoT-100k dataset more effective than directly using the original dataset’s Q&A pairs? (2) What is the impact of structured tags on the performance? Specifically, we explore whether LLaVA-CoT can function without tags by implicitly segmenting different stages of the response. (3) In which specific areas does our model show the most improvement compared to the base model, and does it genuinely enhance reasoning capabilities?
在本節中，我們將LLaVA-CoT與基礎模型Llama-3.2-11B-Vision-Instruct在六個常用的多模態基準上進行比較，以展示我們訓練階段方法的有效性。在此比較之后，我們進行了消融研究，評估了我們方法中每個組件的貢獻，解決以下三個關鍵問題：(1) 我們的LLaVA-CoT-100k數據集是否比直接使用原始數據集的Q&A對更有效？(2) 結構化標簽對性能的影響是什么？具體來說，我們探索了LLaVA-CoT能否通過隱式分割響應的不同階段來運行。(3) 我們的模型在哪些特定領域相比基礎模型顯示出最大改進，并且它確實增強了推理能力嗎？

1. Experimental Setup

We selected six widely used and challenging benchmarks for our experiments: MMStar [10], MMBench V1.1 [35], MMVet [64], MathVista [37], AI2D [25], and HallusionBench [18]. MMStar, MMBench, and MMVet primarily evaluate the general visual question-answering capabilities of models, while MathVista, and AI2D focus on models’ proficiency in mathematical and scientific reasoning. HallusionBench specifically assesses the models’ handling of language hallucinations and visual illusions. For MMBench, we use the V1.1 version of the test set, MathVista is evaluated using the testmini set, and the remaining datasets each have a single test set. To ensure fairness and reproducibility, all evaluations are conducted using VLMEvalKit [15], an open-source evaluation toolkit for large vision-language models. The performance metrics of all baseline models are derived from VLMEvalKit’s testing results [1].
我們選擇了六個廣泛使用且具有挑戰性的基準進行實驗：MMStar [10]、MMBench V1.1 [35]、MMVet [64]、MathVista [37]、AI2D [25]和HallusionBench [18]。MMStar、MMBench和MMVet主要評估模型的一般視覺問答能力，而MathVista和AI2D則專注于模型在數學和科學推理方面的熟練程度。HallusionBench專門評估模型處理語言幻覺和視覺錯覺的能力。對于MMBench，我們使用測試集的V1.1版本，MathVista使用testmini集進行評估，其余數據集各有一個測試集。為了確保公平性和可重復性，所有評估均使用VLMEvalKit [15]，這是大型視覺-語言模型的開源評估工具包。所有基線模型的性能指標均來源于VLMEvalKit的測試結果[1]。

2. Benchmark Results

We found that LLaVA-CoT achieves significant performance improvements, despite using only 100k data. According to Table 2, compared to the base model, Llama-3.2-11B-Vision-Instruct, LLaVA-CoT demonstrates notable improvements across general VQA, mathematical reasoning, scientific VQA, and hallucination control tasks, with an average benchmark score increase of 5.8%, thereby validating the effectiveness of our approach.
我們發現，盡管僅使用了100k的數據，LLaVA-CoT實現了顯著的性能提升。根據表2所示，與基礎模型Llama-3.2-11B-Vision-Instruct相比，LLaVA-CoT在一般VQA、數學推理、科學VQA以及幻覺控制任務方面展示了顯著的進步，平均基準分數提高了5.8%，從而驗證了我們方法的有效性。
在這里插入圖片描述

3. Ablation Study

Effectiveness of LLaVA-CoT-100k Compared to Original Datasets. To demonstrate the effectiveness of our improved LLaVA-CoT-100k dataset, we present a comparison between LLaVA-CoT and the model trained on the original Q&A pairs across different benchmarks in Table 2. Although the model trained directly on the original Q&A pairs shows some overall improvement on the base model, its average performance remains significantly lower. In particular, on the MMVet benchmark, which requires more detailed responses, its performance is even worse than the base model. This result underscores the importance of the multi-stage format of our LLaVA-CoT-100k dataset for training models capable of advanced reasoning.
LLaVA-CoT-100k相比原始數據集的有效性。為了證明我們改進的LLaVA-CoT-100k數據集的有效性，我們在不同基準上將LLaVA-CoT與直接在原始Q&A對上訓練的模型進行了比較（見表2）。雖然直接在原始Q&A對上訓練的模型顯示了一些總體上的改進，但其平均表現仍然顯著較低。特別是在需要更詳細響應的MMVet基準上，其表現甚至低于基礎模型。這一結果強調了我們的LLaVA-CoT-100k數據集中多階段格式對于訓練能夠進行高級推理的模型的重要性。

Structured Tags are Essential for Enhanced Performance. To examine whether the four tags we introduced improve the model’s performance, we compare LLaVA-CoT with the model trained on the LLaVA-CoT-100k dataset with structured tags removed. As shown in Table 2, our results show a significant drop in performance when the tags are removed, indicating that the structured tagging facilitates reasoning and improves model performance. To the best of our knowledge, LLaVA-CoT is the first attempt to successfully enhance a model’s reasoning ability and overall performance through a structured reasoning with tags.
結構化標簽對增強性能至關重要。為了檢查我們引入的四個標簽是否改善了模型性能，我們將LLaVA-CoT與從移除了結構化標簽的LLaVA-CoT-100k數據集上訓練的模型進行了比較。如表2所示，結果顯示當移除標簽時性能顯著下降，表明結構化標簽促進了推理并提升了模型性能。據我們所知，LLaVA-CoT是首次成功通過帶有標簽的結構化推理來增強模型推理能力和整體性能的嘗試。

Performance Gains Primarily in Reasoning-Intensive Areas. To analyze the specific areas in which LLaVA-CoT has improved compared to the base model, we conduct a detailed assessment of the model’s performance across different skills on the MMStar benchmark. MMStar is designed to evaluate six key capabilities: coarse perception, fine-grained perception, instance reasoning, logical reasoning, math, and science & technology. In Table 3, we compare the base model with LLaVA-CoT. Our analysis reveals that LLaVA-CoT demonstrates notable improvements in tasks requiring systematic reasoning, such as instance reasoning, logical reasoning, math, and science & technology, while showing relatively smaller gains in coarse perception and fine-grained perception. This suggests that our method can mainly improve reasoning capabilities of the model.
性能增益主要集中在推理密集型領域。為了分析LLaVA-CoT相比基礎模型在哪一特定領域有所改進，我們對MMStar基準上的模型在不同技能上的表現進行了詳細評估。MMStar旨在評估六項關鍵能力：粗略感知、細粒度感知、實例推理、邏輯推理、數學和科學技術。表3對比了基礎模型與LLaVA-CoT。我們的分析揭示出LLaVA-CoT在需要系統推理的任務（例如實例推理、邏輯推理、數學和科學技術）中表現出顯著進步，而在粗略感知和細粒度感知方面進步相對較小。這表明我們的方法主要可以提升模型的推理能力。
在這里插入圖片描述

五. Inference Time Scaling

In this section, we aim to compare the effectiveness of our stage-level beam search approach with traditional methods like best-of-N and sentence-level beam search under comparable computational constraints. The experimental setup mirrors that used in the previous section, with evaluations conducted across the same six benchmarks: MMStar, MMBench V1.1, MMVet, MathVista, AI2D, and HallusionBench. All methods are evaluated using VLMEvalKit to ensure reproducibility.
在本節中，我們旨在比較我們的階段級束搜索方法與傳統方法（如最佳-N和句子級束搜索）在相似計算約束下的有效性。實驗設置與前一節使用的相同，在相同的六個基準上進行評估：MMStar、MMBench V1.1、MMVet、MathVista、AI2D和HallusionBench。所有方法均使用VLMEvalKit進行評估以確保可重復性。

1. Benchmark Results

As shown in Table 4, stage-level beam search demonstrates substantial effectiveness in leveraging the structured reasoning stages of LLaVA-CoT. By evaluating outputs at each reasoning stage, this approach strikes a balance between rigorous quality control and computational efficiency, yielding higher inference accuracy on complex reasoning tasks without significant computational overhead. These findings suggest that stage-level beam search, which is made possible by the structured output design of LLaVA-CoT, is an effective and powerful approach for inference time scaling.
如表4所示，階段級束搜索展示了在利用LLaVA-CoT的結構化推理階段方面的顯著有效性。通過在每個推理階段評估輸出，這種方法在嚴格的質量控制和計算效率之間找到了平衡，在復雜的推理任務上實現了更高的推理準確性而沒有顯著增加計算開銷。這些發現表明，由LLaVA-CoT的結構化輸出設計實現的階段級束搜索是一種有效且強大的推理時間擴展方法。
在這里插入圖片描述

2. Comparison to Baseline Methods

We compare our method with baseline inference scaling methods on the MMVet benchmark to evaluate relative performance. For a fair comparison, our stage-level beam search method and the baseline models are evaluated using comparable levels of inference time compute. Specifically, we set N = 10 for the best-of-N method, generate 4 candidate responses per stage for our stage-level beam search, and use a sentence-level beam search generating 2 candidates per sentence. As shown in Table 5, the best-of-N method yields only a modest improvement of 0.6%, while sentence-level beam search even shows a 1.9% decrease in performance. We examine the sub-scores and found that the main reason for the performance drop in sentence-level beam search is the excessively granular sentence-level approach, which struggles to effectively address open-ended questions. In contrast, our stage-level beam search improved performance by 2.6%, highlighting the superiority of stage-based search.
我們在MMVet基準上將我們的方法與基線推理擴展方法進行比較以評估相對性能。為了公平比較，我們的階段級束搜索方法和基線模型使用可比較級別的推理時間計算進行評估。具體來說，對于最佳-N方法設置N=10，為我們的階段級束搜索生成每階段4個候選響應，并使用生成每句2個候選的句子級束搜索。如表5所示，最佳-N方法僅實現了0.6%的小幅改進，而句子級束搜索顯示了1.9%的性能下降。我們檢查了子分數并發現句子級束搜索性能下降的主要原因是過于細粒度的句子級別方法難以有效解決開放式問題。相比之下，我們的階段級束搜索提高了2.6%的性能，突顯了基于階段搜索的優勢。

在這里插入圖片描述

3. Scaling Trend of Stage-level Beam Search

To better illustrate the effectiveness of our stage-level beam search as inference time compute increases, we evaluate LLaVA-CoT with different beam sizes on the MMVet benchmark. As shown in Table 6, we test the performance of the model by generating 1 (ie, no inference time scaling), 2, 3, and 4 candidate responses at each reasoning stage, allowing the model to select the best answer from these options. Our findings show that as the number of candidate responses increases, the model’s performance consistently improves, confirming that our stage-level beam search approach is scalable. Due to computational resource constraints, we only test a beam size of 2 across all benchmarks. However, it is expected that increasing the beam size will lead to even more significant improvements.
為了更好地說明隨著推理時間計算增加，我們階段級束搜索的有效性，我們在MMVet基準上評估了不同束大小的LLaVA-CoT。如表6所示，我們通過在每個推理階段生成1（即無推理時間擴展）、2、3和4個候選響應來測試模型的性能，允許模型從這些選項中選擇最佳答案。我們的研究結果表明，隨著候選響應數量的增加，模型的性能持續提升，證實了我們的階段級束搜索方法是可擴展的。由于計算資源限制，我們僅在整個基準上測試了束大小為2的情況。然而，預計增加束大小將帶來更為顯著的改進。

在這里插入圖片描述

六、 Comparison to State-of-the-Art VLMs

As shown in Table 7, we compare LLaVA-CoT with other state-of-the-art open-source and closed-source vision language models (VLM) across six benchmarks that require advanced reasoning capabilities: MMStar-R, MMBench-R, MMVet-R, MathVista, AI2D, and HallusionBench. MMStar-R, MMBench-R, and MMVet-R are custom benchmarks derived from MMStar, MMBench V1.1, and MMVet, respectively, with tasks requiring only coarse perception, fine-grained perception, and OCR removed. These filtered benchmarks retain tasks that demand complex reasoning, with further details on the selection criteria available in Appendix E. MathVista, AI2D, and HallusionBench inherently focus on advanced reasoning, so we retained all tasks within these benchmarks.
Our results show that LLaVA-CoT consistently outperforms many open-source models of similar or even larger sizes, such as InternVL2-8B [11], Ovis1.5-Gemma2-9B [38], MiniCPM-V2.6-8B [62], Llama-3.2-90B-Vision-Instruct [42], and VILA-1.5-40B [32]. Remarkably, LLaVA-CoT even surpasses certain closed-source models like GPT-4o-mini [43] and Gemini-1.5-pro [46], underscoring the effectiveness of our structured reasoning approach. This comparison validates the advantages of our method, particularly in benchmarks that heavily depend on reasoning skills, and highlights LLaVA-CoT as a competitive model in the domain of reasoning-intensive VLM tasks.
如表7所示，我們將LLaVA-CoT與其他最先進的開源和閉源視覺-語言模型（VLM）在六個需要高級推理能力的基準上進行了比較：MMStar-R、MMBench-R、MMVet-R、MathVista、AI2D和HallusionBench。MMStar-R、MMBench-R和MMVet-R分別是MMStar、MMBench V1.1和MMVet衍生的定制基準，去除了僅需粗略感知、細粒度感知和OCR的任務。這些過濾后的基準保留了需要復雜推理的任務，有關選擇標準的更多詳情請參見附錄E。MathVista、AI2D和HallusionBench本質上專注于高級推理，因此我們保留了這些基準中的所有任務。
我們的結果顯示，LLaVA-CoT在許多相似甚至更大規模的開源模型中持續表現出色，例如InternVL2-8B [11]、Ovis1.5-Gemma2-9B [38]、MiniCPM-V2.6-8B [62]、Llama-3.2-90B-Vision-Instruct [42]和VILA-1.5-40B [32]。值得注意的是，LLaVA-CoT甚至超越了某些閉源模型如GPT-4o-mini [43]和Gemini-1.5-pro [46]，這強調了我們結構化推理方法的有效性。這種比較驗證了我們方法的優勢，特別是在那些高度依賴推理技能的基準上，并突出LLaVA-CoT作為推理密集型VLM任務領域的競爭模型。
在這里插入圖片描述

七、 Conclusion

In this paper, we present LLaVA-CoT, a novel vision language model that performs structured, autonomous reasoning in multiple stages. By introducing four distinct stages, LLaVA-CoT achieves a systematic reasoning process. Our contributions are twofold: first, the creation of the LLaVA-CoT-100k dataset with detailed reasoning annotations, which supports training on systematic, structured responses; and second, the proposal of a stage-level beam search method, enabling effective inference time scaling. Overall, LLaVA-CoT establishes a new standard for multimodal reasoning in VLMs, offering robust performance and scalability, especially in inference time. Our work paves the way for future research on structured reasoning in VLMs, including potential expansions with external verifiers and the use of reinforcement learning to further enhance complex multimodal reasoning capabilities.

在本文中，我們介紹了LLaVA-CoT，一種新型視覺-語言模型，能夠在多個階段執行結構化自主推理。通過引入四個不同的階段，LLaVA-CoT實現了系統化的推理過程。我們的貢獻有兩個方面：首先，創建了具有詳細推理注釋的LLaVA-CoT-100k數據集，支持對系統化、結構化響應的訓練；其次，提出了一個階段級束搜索方法，使有效的推理時間擴展成為可能。總體而言，LLaVA-CoT為VLM中的多模態推理樹立了新標準，提供了強大的性能和可擴展性，尤其是在推理時間方面。我們的工作為未來關于VLM中結構化推理的研究鋪平了道路，包括潛在的外部驗證器擴展以及使用強化學習進一步增強復雜多模態推理能力的應用。

八、LLaVA-COT的數據制作

1、訓練數據格式

我去查找了此模型的數據格式，我只羅列了一個數據示列，其內容如下圖。我們可以看到盡管SUMMARY CAPTION REASONING等都是在gpt格式上，我認為就是一起做loss的。

在這里插入圖片描述

2、數據制作(非常重要)

Overall, we provide GPT-4o with a question, an image, and the original dataset’s answer to generate systematic and structured datasets.
總體而言，我們為GPT-4o提供一個問題、一張圖片以及原始數據集的答案來生成系統化和結構化的數據集。
Specifically, we guide GPT-4o to generate response data in stages using a carefully designed prompt. The prompt is formatted as follows:
具體來說，我們指導GPT-4o使用精心設計的提示分階段生成響應數據。提示格式如下：
Prompt for data generation

I have an image and a question that I want you to answer. I need you to strictly follow the format with four specific sections: SUMMARY, CAPTION, REASONING, and CONCLUSION. It is crucial that you adhere to this structure exactly as outlined and that the final answer in the CONCLUSION matches the standard correct answer precisely.

To explain further: In SUMMARY, briefly explain what steps you’ll take to solve the problem. In CAPTION, describe the contents of the image, specifically focusing on details relevant to the question. In REASONING, outline a step-by-step thought process you would use to solve the problem based on the image. In CONCLUSION, give the final answer in a direct format, and it must match the correct answer exactly. If it’s a multiple choice question, the conclusion should only include the option without repeating what the option is.

Here’s how the format should look:

[Summarize how you will approach the problem and explain the steps you will take to reach the answer.] [Provide a detailed description of the image, particularly emphasizing the aspects related to the question.] [Provide a chain-of-thought, logical explanation of the problem. This should outline step-by-step reasoning.] [State the final answer in a clear and direct format. It must match the correct answer exactly.] (Do not forget !)

Please apply this format meticulously to analyze the given image and answer the related question, ensuring that the answer matches the standard one perfectly.
我有一張圖片和一個問題希望你回答。你需要嚴格按照包含四個特定部分的格式來回答：摘要（SUMMARY）、說明（CAPTION）、推理（REASONING）和結論（CONCLUSION）。嚴格遵循此結構非常重要，且結論中的最終答案必須與標準正確答案完全匹配。

進一步解釋如下：在摘要（SUMMARY）中，簡要說明你將采取哪些步驟解決問題。在說明（CAPTION）中，描述圖片的內容，特別關注與問題相關的信息。在推理（REASONING）中，基于圖片內容，列出解決問題所使用的逐步思考過程。在結論（CONCLUSION）中，以直接的方式給出最終答案，并且它必須與正確的答案完全匹配。如果是選擇題，則結論僅應包括選項而不重復選項的具體內容。

以下是格式應該如何呈現的例子：

[總結你將如何解決問題并解釋達到答案的步驟。] [詳細描述圖片內容，特別是強調與問題相關的方面。] [提供一個邏輯解釋的問題解決鏈式思維。這應該概述逐步的推理過程。] [以清晰直接的形式陳述最終答案。它必須與正確答案完全匹配。]（不要忘了！）請仔細應用這種格式分析給定圖像并回答相關問題，確保答案與標準答案完美匹配。

After generating data using this prompt, we verify whether the data generated by GPT-4o adheres to the prescribed format and filter out any data that does not comply. Next, we extract the content within … and apply the following prompt to filter out cases where GPT-4o either refuses to answer or provides an answer that is inconsistent with the original dataset’s standard answer:
生成數據后，我們將驗證GPT-4o生成的數據是否符合規定的格式，并過濾掉任何不符合要求的數據。接下來，提取…內的內容并應用以下提示來過濾出GPT-4o拒絕回答或提供的答案與原始數據集的標準答案不一致的情況：
Prompt for data verification
Evaluate whether the assistant’s response is valid. Respond with ‘valid’ if the assistant’s response is not a refusal and it aligns with the standard answer in meaning. Respond with ‘invalid’ if the response is a refusal or differs from the standard answer in a meaningful way.
A refusal means the assistant states it cannot recognize a specific person/object or refuses to answer the question. Do not consider a response to be a refusal just because it includes the word ‘no’ or other negative terms.
Standard answer: {standard answer}
Assistant’s response: {assistant response}
評估助手的回答是否有效。如果助手的回答不是拒絕并且與標準答案在意義上相符，則回復“有效”。如果回答是拒絕或與標準答案有顯著差異，則回復“無效”。
拒絕指的是助手表示無法識別特定人物/對象或拒絕回答問題。不要僅僅因為回答中包含了“不”或其他否定詞匯就認為它是拒絕。
標準答案：{standard answer}
助手回答：{assistant response}