【大模型】視覺語言模型：Qwen2.5-VL的使用

官方github地址：https://github.com/QwenLM/Qwen2.5-VL

Qwen家族的最新成員：Qwen2.5-VL

主要增強功能

模型架構更新

快速開始

使用Transformers聊天

Docker

Qwen家族的最新成員：Qwen2.5-VL

主要增強功能

強大的文檔解析功能：將文本識別升級為全文檔解析，擅長處理多場景、多語言和各種內置（手寫、表格、圖表、化學式和樂譜）文檔。

跨格式精確的對象基礎：解鎖在檢測，指向和計數對象方面的提高準確性，為高級空間推理提供絕對坐標和JSON格式。

超長視頻理解和細粒度視頻基礎：將原生動態分辨率擴展到時間維度，增強對持續數小時的視頻的理解能力，同時在數秒內提取事件片段。

增強的計算機和移動設備代理功能：利用先進的推理和決策能力，通過智能手機和計算機上的高級代理功能來增強模型。

模型架構更新

視頻理解的動態分辨率和幀率訓練

通過采用動態FPS采樣將動態分辨率擴展到時間維度，使模型能夠在不同采樣率下理解視頻。因此，在時間維度上使用id和絕對時間對齊來更新mRoPE，使模型能夠學習時間序列和速度，最終獲得精確定位特定時刻的能力。

流線型和高效的視覺編碼器

通過在ViT中戰略性地實施窗口注意力來提高訓練和推理速度。利用SwiGLU和RMSNorm對ViT架構進行進一步優化，使其與Qwen2.5 LLM的結構保持一致。

快速開始

下面，將提供簡單的示例來展示如何將Qwen2.5-VL與ModelScope和Transformers一起使用。

Qwen2.5-VL的代碼已經在最新的?Hugging face?transformers中，建議使用命令從源代碼構建：

pip install git+https://github.com/huggingface/transformers accelerate

安裝依賴

pip install qwen-vl-util

使用Transformers聊天

這里展示了一個代碼片段，展示如何使用聊天模型與transformers和qwen_vl_utils：

from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info# default: Load the model on the available device(s)
model = Qwen2_5_VLForConditionalGeneration.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct", torch_dtype="auto", device_map="auto"
)# We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
# model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
#     "Qwen/Qwen2.5-VL-7B-Instruct",
#     torch_dtype=torch.bfloat16,
#     attn_implementation="flash_attention_2",
#     device_map="auto",
# )# default processor
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct")# The default range for the number of visual tokens per image in the model is 4-16384.
# You can set min_pixels and max_pixels according to your needs, such as a token range of 256-1280, to balance performance and cost.
# min_pixels = 256*28*28
# max_pixels = 1280*28*28
# processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels)messages = [{"role": "user","content": [{"type": "image","image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",},{"type": "text", "text": "Describe this image."},],}
]# Preparation for inference
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(text=[text],images=image_inputs,videos=video_inputs,padding=True,return_tensors="pt",
)
inputs = inputs.to(model.device)# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

多圖片推理

# Messages containing multiple images and a text query
messages = [{"role": "user","content": [{"type": "image", "image": "file:///path/to/image1.jpg"},{"type": "image", "image": "file:///path/to/image2.jpg"},{"type": "text", "text": "Identify the similarities between these images."},],}
]# Preparation for inference
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(text=[text],images=image_inputs,videos=video_inputs,padding=True,return_tensors="pt",
)
inputs = inputs.to("cuda")# Inference
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

更多使用提示

對于輸入圖像，支持本地文件、base64和url。對于視頻，目前只支持本地文件。

# You can directly insert a local file path, a URL, or a base64-encoded image into the position where you want in the text.
## Local file path
messages = [{"role": "user","content": [{"type": "image", "image": "file:///path/to/your/image.jpg"},{"type": "text", "text": "Describe this image."},],}
]
## Image URL
messages = [{"role": "user","content": [{"type": "image", "image": "http://path/to/your/image.jpg"},{"type": "text", "text": "Describe this image."},],}
]
## Base64 encoded image
messages = [{"role": "user","content": [{"type": "image", "image": "data:image;base64,/9j/..."},{"type": "text", "text": "Describe this image."},],}
]

提高性能的圖像解決方法

該模型支持廣泛的分辨率輸入。默認情況下，它使用本機分辨率作為輸入，但是更高的分辨率可以以更多的計算為代價來提高性能。用戶可以設置最小和最大像素數，以實現滿足其需求的最佳配置，例如token計數范圍為256-1280，以平衡速度和內存使用。

min_pixels = 256 * 28 * 28
max_pixels = 1280 * 28 * 28
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels
)

此外，提供了兩種方法對輸入到模型的圖像大小進行細粒度控制：

1.指定確切的尺寸：直接設置resized_height和resized_width。這些值將四舍五入到最接近28的倍數。

2.定義min_pixels和max_pixels：圖像將被調整大小，以保持其寬高比在min_pixels和max_pixels的范圍內。

# resized_height and resized_width
messages = [{"role": "user","content": [{"type": "image","image": "file:///path/to/your/image.jpg","resized_height": 280,"resized_width": 420,},{"type": "text", "text": "Describe this image."},],}
]
# min_pixels and max_pixels
messages = [{"role": "user","content": [{"type": "image","image": "file:///path/to/your/image.jpg","min_pixels": 50176,"max_pixels": 50176,},{"type": "text", "text": "Describe this image."},],}
]

為多個圖像輸入添加id

默認情況下，對話中直接包含圖像和視頻內容。在處理多個圖像時，為圖像和視頻添加標簽有助于更好地參考。用戶可以通過以下設置控制此行為：

1.添加視覺id

Flash-Attention 2加速生成

首先，確保安裝最新版本的Flash Attention 2：

pip install -U flash-attn --no-build-isolation

此外，應該有一個與Flash-Attention 2兼容的硬件。在flash attention repository的官方文檔中信息。FlashAttention-2只能在模型加載到torch.float16?或者?torch.bfloat16。

要使用Flash Attention-2加載和運行模型，只需在加載模型時添加attn_implementation="flash_attention_2"，如下所示：

from transformers import Qwen2_5_VLForConditionalGenerationmodel = Qwen2_5_VLForConditionalGeneration.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct", torch_dtype=torch.bfloat16, attn_implementation="flash_attention_2",
)

處理長文本

當前配置config.json設置的上下文長度最多為32,768個token。為了處理超過32,768個token的大量輸入，使用?YaRN，這是一種增強模型長度外推的技術，確保在長文本上的最佳性能。

對于支持的框架，可以在config.json中添加以下內容來啟用YaRN：

{...,"type": "yarn","mrope_section": [16,24,24],"factor": 4,"original_max_position_embeddings": 32768
}

然而，需要注意的是，這種方法對時間和空間定位任務的性能有很大的影響，因此不建議使用。

同時，對于長視頻輸入，由于MRoPE本身使用id更經濟，因此可以直接將max_position_embeddings修改為更大的值，例如64k。

Docker

為了簡化部署過程，提供了帶有預構建環境的docker：qwenllm/qwenvl。只需要安裝驅動程序并下載模型文件來啟動演示。

docker run --gpus all --ipc=host --network=host --rm --name qwen2.5 -it qwenllm/qwenvl:2.5-cu121 bash