PyVision：基于動態工具的具身智能體

論文地址：

[2507.07998v1] PyVision: Agentic Vision with Dynamic Tooling

1. 背景?

現有的智能體一般都是通過大模型規劃調用已經預定義好的一些工具（具體來說也就是一些函數）來解決問題。這樣就會導致在針對特征的任務上Agent去解決問題缺乏靈活性。所以這篇文章就提出了pyVision來在解決特定問題的時候，針對任務具體的生成一些工具（函數或者也這說是代碼）來提高智能體解決問題的能力。

2.框架架構

從示意圖中可以看到PyVision 使一個多語言大語言模型(MLLM) 能夠在推理過程中動態生成并執行Python 代碼。在每個會話中，MLLM 接收一個輸入（Input），生成相應的Python 代碼，并在一個隔離的Python 運行環境中執行它。生成的輸出——文本、視覺或兩者皆有——會反饋回MLLM 的上下文，使其能夠在多輪中迭代和完善其推理，直到產生最終答案。?

其中：

code_block_i 指的是MLLM 在第i 輪生成的Python 代碼。
mm_clue_i 指的是Python 解釋器執行后的多模態輸出。

3.具體推理案例

在文章中提到了針對幾個不同領域特定的任務，來使用pyVsion來解決視覺推理的例子。

3.1 視覺搜索

3.2 醫學圖像分析

?3.3 符號視覺謎題

?3.4視覺草圖繪制

?3.5 視覺差異比較

?3.6?視頻理解

?4 論文結論

從圖中可以看到選擇的這幾種的任務數據集上，其中：

MathVista和MathVision-mini其中主要是多模態數學。用于測試LLM在具有需要視覺感知和數值推理的數學問題的表現。

MMMU：其中主要是測試跨學科的特定領域推理。

VisualPUzzles和VLMAreBlind-mini:里面主要是符號視覺謎題組成，用于測試LLM探索對抽象、結構化視覺原語
進行解析和推理的極限。

V?主要用于測試LLM精確識別微妙的視覺細節。

從圖中可以看到，在GPT-4.1的使用了Pyvision之后的PyVision-GPT-4.1在?MathVista的測試上提升了1.8%（也就是從69.9%-71.7%）同樣的也可以看到在其他任務上也得到了一些提升。但是相比于o1 o3這些模型上面，其實還是差了不少。也同樣說明這個框架中所用的后端大模型對于整體在解決這些問題上也是很重要的。

5. 代碼復現

項目源碼：https://github.com/agents-x-project/PyVision

DEMO：https://huggingface.co/spaces/Agents-X/PyVision

源碼解析：

1. 配置LLM的API配置

項目里面提供了三種LLM的配置，分別是openai， auzre和vllm。其中配置文件是放在：

./api_config_files/api_config_*

2. 提示詞模版

#英文原版
{"retool": "Solve the following problem step by step. You now have the ability to selectively write executable Python code to enhance your reasoning process. The Python code will be executed by an external sandbox, and the output (wrapped in `<interpreter>output_str</interpreter>`) can be returned to aid your reasoning and help you arrive at the final answer. The Python code should be complete scripts, including necessary imports. \nEach code snippet is wrapped with `<code>\n```python\ncode snippet\n```\n</code>`.\nThe last part of your response should be in the following format:\n<answer>\n\\boxed{{'The final answer goes here.'}}\n</answer>\n\n*user question:*\nAnswer the following Math Problem and put the answer in the format of \\boxed{{answer}}\n\n{query}\n\n\nRemember to place the final answer in the last part using the format: \n<answer>\n\\boxed{{'The final answer goes here.'}}\n</answer>","vistool": "Solve the following problem step by step. You now have the ability to selectively write executable Python code to enhance your reasoning process. The Python code will be executed by an external sandbox.\n\nFor all the provided images, in order, the i-th image has already been read into the global variable `image_clue_i` using the PIL.Image.open() function. When writing Python code, you can directly use these variables without needing to read them again.\n\nSince you are dealing with the VQA task, you MUST use the python tool (e.g., matplotlib library) to analyze or transform images whenever it could improve your understanding or aid your reasoning. This includes but is not limited to zooming in, rotating, adjusting contrast, computing statistics, or isolating features. \n\nNote that when you use matplotlib to visualize data or further process images, you need to use plt.show() to display these images; there is no need to save them. Do not use image processing libraries like cv2 or PIL. If you want to check the value of a variable, you MUST use print() to check it.\n\nThe output (wrapped in `<interpreter>output_str</interpreter>`) can be returned to aid your reasoning and help you arrive at the final answer. The Python code should be complete scripts, including necessary imports. \nEach code snippet is wrapped with `<code>\n```python\ncode snippet\n```\n</code>`.\nThe last part of your response should be in the following format:\n<answer>\n\\boxed{{'The final answer goes here.'}}\n</answer>\n\n*user question:*\nAnswer the following Problem with an image provided and put the answer in the format of \\boxed{{answer}}\n\n{query}\n\nRemember to place the final answer in the last part using the format: \n<answer>\n\\boxed{{'The final answer goes here.'}}\n</answer>","vistool_with_img_info": "Solve the following problem step by step. You now have the ability to selectively write executable Python code to enhance your reasoning process. The Python code will be executed by an external sandbox.\n\nFor all the provided images, in order, the i-th image has already been read into the global variable `image_clue_i` using the PIL.Image.open() function. When writing Python code, you can directly use these variables without needing to read them again.\n\nSince you are dealing with the VQA task, you MUST use the python tool (e.g., matplotlib library) to analyze or transform images whenever it could improve your understanding or aid your reasoning. This includes but is not limited to zooming in, rotating, adjusting contrast, computing statistics, or isolating features. \n\nNote that when you use matplotlib to visualize data or further process images, you need to use plt.show() to display these images; there is no need to save them. Do not use image processing libraries like cv2 or PIL. If you want to check the value of a variable, you MUST use print() to check it.\n\nThe output (wrapped in `<interpreter>output_str</interpreter>`) can be returned to aid your reasoning and help you arrive at the final answer. The Python code should be complete scripts, including necessary imports. \nEach code snippet is wrapped with `<code>\n```python\ncode snippet\n```\n</code>`.\nThe last part of your response should be in the following format:\n<answer>\n\\boxed{{'The final answer goes here.'}}\n</answer>\n\n*image resolution:*\n\nImage Width: {width}; Image Height: {height}\n\n*user question:*\nAnswer the following Problem with an image provided and put the answer in the format of \\boxed{{answer}}\n\n{query}\n\nRemember to place the final answer in the last part using the format: \n<answer>\n\\boxed{{'The final answer goes here.'}}\n</answer>","vistool_with_img_info_multi_image": "Solve the following problem step by step. You now have the ability to selectively write executable Python code to enhance your reasoning process. The Python code will be executed by an external sandbox.\n\nFor all the provided images, in order, the i-th image has already been read into the global variable `image_clue_i` using the PIL.Image.open() function. When writing Python code, you can directly use these variables without needing to read them again.\n\nSince you are dealing with the VQA task, you MUST use the python tool (e.g., matplotlib library) to analyze or transform images whenever it could improve your understanding or aid your reasoning. This includes but is not limited to zooming in, rotating, adjusting contrast, computing statistics, or isolating features. \n\nNote that when you use matplotlib to visualize data or further process images, you need to use plt.show() to display these images; there is no need to save them. Do not use image processing libraries like cv2 or PIL. If you want to check the value of a variable, you MUST use print() to check it.\n\nThe output (wrapped in `<interpreter>output_str</interpreter>`) can be returned to aid your reasoning and help you arrive at the final answer. The Python code should be complete scripts, including necessary imports. \nEach code snippet is wrapped with `<code>\n```python\ncode snippet\n```\n</code>`.\nThe last part of your response should be in the following format:\n<answer>\n\\boxed{{'The final answer goes here.'}}\n</answer>\n\n*image resolution:*\n\n{image_information}\n\n*user question:*\nAnswer the following Problem with an image provided and put the answer in the format of \\boxed{{answer}}\n\n{query}\n\nRemember to place the final answer in the last part using the format: \n<answer>\n\\boxed{{'The final answer goes here.'}}\n</answer>","vistool_with_img_info_v2": "You are an agent - please keep going until the user’s query is completely resolved, before ending your turn and yielding back to the user. Only terminate your turn when you are sure that the problem is solved. \n\nSolve the following problem step by step. You now have the ability to selectively write executable Python code to enhance your reasoning process. The Python code will be executed by an external sandbox. \n\nYou MUST plan extensively before each function call, and reflect extensively on the outcomes of the previous function calls. DO NOT do this entire process by making function calls only, as this can impair your ability to solve the problem and think insightfully.\n\nFor all the provided images, in order, the i-th image has already been read into the global variable `image_clue_i` using the PIL.Image.open() function. When writing Python code, you can directly use these variables without needing to read them again.\n\nSince you are dealing with the vision-related question answering task, you MUST use the python tool (e.g., matplotlib library) to analyze or transform images whenever it could improve your understanding or aid your reasoning. This includes but is not limited to zooming in, rotating, adjusting contrast, computing statistics, or isolating features. \n\nNote that when you use matplotlib to visualize data or further process images, you need to use plt.show() to display these images; there is no need to save them. Do not use image processing libraries like cv2 or PIL. If you want to check the value of a variable, you MUST use print() to check it.\n\nThe output (wrapped in `<interpreter>output_str</interpreter>`) can be returned to aid your reasoning and help you arrive at the final answer. The Python code should be complete scripts, including necessary imports. \nEach code snippet is wrapped with `<code>\n```python\ncode snippet\n```\n</code>`.\nThe last part of your response should be in the following format:\n<answer>\n\\boxed{{'The final answer goes here.'}}\n</answer>\n\n*image resolution:*\n\nImage Width: {width}; Image Height: {height}\n\n*user question:*\nAnswer the following Problem with an image provided and put the answer in the format of \\boxed{{answer}}\n\n{query}\n\nRemember to place the final answer in the last part using the format: \n<answer>\n\\boxed{{'The final answer goes here.'}}\n</answer>","no_tool": "You are a helpful assistant. And you are dealing with the VQA tasks. Solve the visual questions step by step and give the correct answer. Note: put your answer in the format of \"\\boxed{{the right answer here}}\"\n *user question*:\n{query}","no_tool_no_cot": "Question:\n{query}\nGive the correct answer directly, in the format of \"Final Answer:\\boxed{{the final answer here}}\"\n"
}#中文版
{"retool": "逐步解決以下問題。您現在有能力選擇性地編寫可執行的Python代碼來增強您的推理過程。Python代碼將由外部沙盒執行，輸出（包裝在`<interpreter>output_str</interpreter>`中）可以返回以幫助您的推理并幫助您得出最終答案。Python代碼應該是完整的腳本，包括必要的導入。\n每個代碼片段都用`<code>\n```python\n代碼片段\n```\n</code>`包裝。\n您回答的最后部分應該采用以下格式：\n<answer>\n\\boxed{{'最終答案放在這里。'}}\n</answer>\n\n*用戶問題：*\n回答以下數學問題并將答案放在\\boxed{{answer}}格式中\n\n{query}\n\n\n記住在最后部分使用以下格式放置最終答案：\n<answer>\n\\boxed{{'最終答案放在這里。'}}\n</answer>","vistool": "逐步解決以下問題。您現在有能力選擇性地編寫可執行的Python代碼來增強您的推理過程。Python代碼將由外部沙盒執行。\n\n對于所有提供的圖像，按順序，第i個圖像已經使用PIL.Image.open()函數讀入全局變量`image_clue_i`中。在編寫Python代碼時，您可以直接使用這些變量，而無需再次讀取它們。\n\n由于您正在處理VQA任務，每當可以改善您的理解或幫助您的推理時，您必須使用python工具（例如，matplotlib庫）來分析或轉換圖像。這包括但不限于放大、旋轉、調整對比度、計算統計信息或隔離特征。\n\n請注意，當您使用matplotlib可視化數據或進一步處理圖像時，您需要使用plt.show()來顯示這些圖像；無需保存它們。不要使用cv2或PIL等圖像處理庫。如果您想檢查變量的值，您必須使用print()來檢查它。\n\n輸出（包裝在`<interpreter>output_str</interpreter>`中）可以返回以幫助您的推理并幫助您得出最終答案。Python代碼應該是完整的腳本，包括必要的導入。\n每個代碼片段都用`<code>\n```python\n代碼片段\n```\n</code>`包裝。\n您回答的最后部分應該采用以下格式：\n<answer>\n\\boxed{{'最終答案放在這里。'}}\n</answer>\n\n*用戶問題：*\n回答以下提供圖像的問題并將答案放在\\boxed{{answer}}格式中\n\n{query}\n\n記住在最后部分使用以下格式放置最終答案：\n<answer>\n\\boxed{{'最終答案放在這里。'}}\n</answer>","vistool_with_img_info": "逐步解決以下問題。您現在有能力選擇性地編寫可執行的Python代碼來增強您的推理過程。Python代碼將由外部沙盒執行。\n\n對于所有提供的圖像，按順序，第i個圖像已經使用PIL.Image.open()函數讀入全局變量`image_clue_i`中。在編寫Python代碼時，您可以直接使用這些變量，而無需再次讀取它們。\n\n由于您正在處理VQA任務，每當可以改善您的理解或幫助您的推理時，您必須使用python工具（例如，matplotlib庫）來分析或轉換圖像。這包括但不限于放大、旋轉、調整對比度、計算統計信息或隔離特征。\n\n請注意，當您使用matplotlib可視化數據或進一步處理圖像時，您需要使用plt.show()來顯示這些圖像；無需保存它們。不要使用cv2或PIL等圖像處理庫。如果您想檢查變量的值，您必須使用print()來檢查它。\n\n輸出（包裝在`<interpreter>output_str</interpreter>`中）可以返回以幫助您的推理并幫助您得出最終答案。Python代碼應該是完整的腳本，包括必要的導入。\n每個代碼片段都用`<code>\n```python\n代碼片段\n```\n</code>`包裝。\n您回答的最后部分應該采用以下格式：\n<answer>\n\\boxed{{'最終答案放在這里。'}}\n</answer>\n\n*圖像分辨率：*\n\n圖像寬度：{width}；圖像高度：{height}\n\n*用戶問題：*\n回答以下提供圖像的問題并將答案放在\\boxed{{answer}}格式中\n\n{query}\n\n記住在最后部分使用以下格式放置最終答案：\n<answer>\n\\boxed{{'最終答案放在這里。'}}\n</answer>","vistool_with_img_info_multi_image": "逐步解決以下問題。您現在有能力選擇性地編寫可執行的Python代碼來增強您的推理過程。Python代碼將由外部沙盒執行。\n\n對于所有提供的圖像，按順序，第i個圖像已經使用PIL.Image.open()函數讀入全局變量`image_clue_i`中。在編寫Python代碼時，您可以直接使用這些變量，而無需再次讀取它們。\n\n由于您正在處理VQA任務，每當可以改善您的理解或幫助您的推理時，您必須使用python工具（例如，matplotlib庫）來分析或轉換圖像。這包括但不限于放大、旋轉、調整對比度、計算統計信息或隔離特征。\n\n請注意，當您使用matplotlib可視化數據或進一步處理圖像時，您需要使用plt.show()來顯示這些圖像；無需保存它們。不要使用cv2或PIL等圖像處理庫。如果您想檢查變量的值，您必須使用print()來檢查它。\n\n輸出（包裝在`<interpreter>output_str</interpreter>`中）可以返回以幫助您的推理并幫助您得出最終答案。Python代碼應該是完整的腳本，包括必要的導入。\n每個代碼片段都用`<code>\n```python\n代碼片段\n```\n</code>`包裝。\n您回答的最后部分應該采用以下格式：\n<answer>\n\\boxed{{'最終答案放在這里。'}}\n</answer>\n\n*圖像分辨率：*\n\n{image_information}\n\n*用戶問題：*\n回答以下提供圖像的問題并將答案放在\\boxed{{answer}}格式中\n\n{query}\n\n記住在最后部分使用以下格式放置最終答案：\n<answer>\n\\boxed{{'最終答案放在這里。'}}\n</answer>","vistool_with_img_info_v2": "您是一個代理 - 請繼續直到用戶的查詢完全解決，在結束您的回合并讓回給用戶之前。只有在您確定問題已解決時才終止您的回合。\n\n逐步解決以下問題。您現在有能力選擇性地編寫可執行的Python代碼來增強您的推理過程。Python代碼將由外部沙盒執行。\n\n您必須在每次函數調用之前進行廣泛規劃，并對之前函數調用的結果進行廣泛反思。不要僅通過進行函數調用來完成整個過程，因為這可能會損害您解決問題和深入思考的能力。\n\n對于所有提供的圖像，按順序，第i個圖像已經使用PIL.Image.open()函數讀入全局變量`image_clue_i`中。在編寫Python代碼時，您可以直接使用這些變量，而無需再次讀取它們。\n\n由于您正在處理與視覺相關的問題回答任務，每當可以改善您的理解或幫助您的推理時，您必須使用python工具（例如，matplotlib庫）來分析或轉換圖像。這包括但不限于放大、旋轉、調整對比度、計算統計信息或隔離特征。\n\n請注意，當您使用matplotlib可視化數據或進一步處理圖像時，您需要使用plt.show()來顯示這些圖像；無需保存它們。不要使用cv2或PIL等圖像處理庫。如果您想檢查變量的值，您必須使用print()來檢查它。\n\n輸出（包裝在`<interpreter>output_str</interpreter>`中）可以返回以幫助您的推理并幫助您得出最終答案。Python代碼應該是完整的腳本，包括必要的導入。\n每個代碼片段都用`<code>\n```python\n代碼片段\n```\n</code>`包裝。\n您回答的最后部分應該采用以下格式：\n<answer>\n\\boxed{{'最終答案放在這里。'}}\n</answer>\n\n*圖像分辨率：*\n\n圖像寬度：{width}；圖像高度：{height}\n\n*用戶問題：*\n回答以下提供圖像的問題并將答案放在\\boxed{{answer}}格式中\n\n{query}\n\n記住在最后部分使用以下格式放置最終答案：\n<answer>\n\\boxed{{'最終答案放在這里。'}}\n</answer>","no_tool": "您是一個有用的助手。您正在處理VQA任務。逐步解決視覺問題并給出正確答案。注意：將您的答案放在\"\\boxed{{正確答案在這里}}\"格式中\n*用戶問題*：\n{query}","no_tool_no_cot": "問題：\n{query}\n直接給出正確答案，格式為\"最終答案：\\boxed{{最終答案在這里}}\"\n"
}

3. 啟動main.py

from openai import OpenAI
from inference_engine.vis_inference_demo_gpt import evaluate_single_data, evaluate_single_with_cleanup
from inference_engine.safe_persis_shared_vis_python_exe import PythonExecutor......
# Run inference with safe execution
print(f"Processing image: {args.image_path}")
print(f"Question: {args.question}")
print("Running inference with safe execution...")# messages, final_response = evaluate_single_with_cleanup(eval_args, data, client)
executor = PythonExecutor()
messages, final_response = evaluate_single_data(eval_args, data, client, executor)# Save results
os.makedirs(args.output_dir, exist_ok=True)if args.save_messages:messages_path = os.path.join(args.output_dir, "test_messages.json")with open(messages_path, "w", encoding="utf-8") as f:json.dump(messages, f, indent=4, ensure_ascii=False)print(f"Messages saved to: {messages_path}")

這里整體的邏輯也很簡單，就是配置好參數之后使用evaluate_single_data函數進行執行得到模型推理結果。

4. inference_engine?

這個模塊是項目中最核心的代碼，負責主要負責處理視覺問答（VQA）任務中的代碼執行和推理過程。

evaluate_single_data

其中evaluate_single_data是整個系統的核心代碼，實現了基于動態工具的具身視覺問答的完整流程。

# 參數提取和驗證
prompt_template = args.prompt_template
prompt = args.prompt
exe_code = args.exe_code
max_tokens = args.max_tokens
temperature = args.temperature
api_name = args.api_name#提示模板選擇邏輯
if "no_tool" in prompt:# 不使用工具的純文本推理if len(image_path_list) == 1:messages = process_prompt_init(...)elif len(image_path_list) >= 2:messages = process_prompt_init_multi_images(...)
else:# 使用工具的推理if len(image_path_list) == 1:prompt = "vistool_with_img_info_v2"  # 單圖像增強版messages = process_prompt_init(...)elif len(image_path_list) >= 2:prompt = "vistool_with_img_info_multi_image"  # 多圖像messages = process_prompt_init_multi_images(...)
#迭代推理循環
while True:if exe_code and pred_stop_reason == "</code>":# 需要執行代碼的情況# 1. 提取代碼code_to_execute = response_text.split("```python")[-1].split("```")[0].strip()# 2. 執行代碼exe_result = execute_codes([code_to_execute], messages, executor)[0][0]# 3. 處理執行結果if report == "Done":# 成功執行text_result = exe_result[0]['text']images_result = exe_result[0]['images']else:# 執行出錯error_result = report# 4. 更新消息歷史messages, new_image_clue_idx = update_messages_with_execute_content(...)# 5. 繼續生成下一部分response_text, pred_stop_reason = call_chatgpt_api(...)else:# 不需要執行代碼，完成推理final_response = response_textbreak

?call_chatgpt_api?函數 - API調用封裝

#多API支持
if client_type == "openai" or client_type == "azure":# OpenAI/Azure APIresponse = client.chat.completions.create(...)
elif client_type == "anthropic":# Claude APImessage = client.messages.create(...)
elif client_type == "vllm":# VLLM APIresponse = client.chat.completions.create(...)

#停止條件檢測
# 檢測停止序列
if stop and any(s in response_text for s in stop):for s in stop:if s in response_text:stop_reason = sbreak# 特殊處理代碼塊
if "<code>" in response_text:stop_reason = "</code>"

process_prompt_init?函數?- 提示構建

#圖像編碼處理
if "claude" in api_name:img_result = encode_image_with_resize(image_path)  # Claude需要調整尺寸
else:img_result = encode_image(image_path)  # 其他API直接編碼

#消息結構構件
# 對于工具模式，添加image_clue標簽
content.append({"type": "text", "text": "<image_clue_0>"})
content.append({"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{image_base64}"}})
content.append({"type": "text", "text": "</image_clue_0>\n\n"})

execute_codes 函數 - 代碼執行管理

def execute_codes(codes, messages, executor: PythonExecutor):no_code_idx = []codes_use = []# 過濾空代碼for i, code in enumerate(codes):if code == "":no_code_idx.append(i)else:codes_use.append(code)# 批量執行代碼batch_results = executor.batch_apply(codes_use, messages)return batch_results, no_code_idx

update_messages_with_execute_content 函數 - 執行結果整合

#執行成功的情況
if error_result is None:# 構建解釋器消息interpreter_message_text_prefix = [{"type": "text", "text": f"<interpreter>\nText Result:\n{text_result}\nImage Result:\n"}]# 處理生成的圖像if images_result is not None:for image_base64_item in images_result:interpreter_message_images = [{"type": "text", "text": f"<image_clue_{image_clue_idx}>"},{"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{image_base64_item}"}},{"type": "text", "text": f"</image_clue_{image_clue_idx}>"}]image_content += interpreter_message_imagesimage_clue_idx += 1
#執行失敗的圖像
else:interpreter_message_text_prefix = [{"type": "text", "text": f"<interpreter>{error_result}"}]