【GPT入門】第67課 多模態模型實踐: 本地部署文生視頻模型和圖片推理模型
- 1. 文生視頻模型CogVideoX-5b 本地部署
- 1.1 模型介紹
- 1.2 環境安裝
- 1.3 模型下載
- 1.4 測試
- 2.ollama部署圖片推理模型 llama3.2-vision
- 2.1 模型介紹
- 2.2 安裝ollama
- 2.3 下載模型
- 2.4 測試模型
- 2.5 測試
1. 文生視頻模型CogVideoX-5b 本地部署
https://www.modelscope.cn/models/ZhipuAI/CogVideoX-5b/summary
1.1 模型介紹
https://www.modelscope.cn/models/ZhipuAI/CogVideoX-5b/summary
1.2 環境安裝
下載 安裝conda
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
conda create --prefix /root/autodl-tmp/xxzhenv/video python=3.10 -y
或
conda create --name video python=3.10
pip install --upgrade transformers accelerate diffusers imageio-ffmpeg
1.3 模型下載
modelscope download --model ZhipuAI/CogVideoX-5b --local_dir /root/autodl-tmp/models_xxzh/ZhipuAI/CogVideoX-5b
1.4 測試
跑
import torch
from modelscope import CogVideoXPipeline
from diffusers.utils import export_to_videoprompt = "A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical atmosphere of this unique musical performance."pipe = CogVideoXPipeline.from_pretrained("/root/autodl-tmp/models_xxzh/ZhipuAI/CogVideoX-5b",torch_dtype=torch.bfloat16
)pipe.enable_sequential_cpu_offload()
pipe.vae.enable_tiling()
pipe.vae.enable_slicing()video = pipe(prompt=prompt,num_videos_per_prompt=1,num_inference_steps=50,num_frames=49,guidance_scale=6,generator=torch.Generator(device="cuda").manual_seed(42),
).frames[0]export_to_video(video, "output.mp4", fps=8)
2.ollama部署圖片推理模型 llama3.2-vision
2.1 模型介紹
官網: https://ollama.com/library/llama3.2-vision
Llama 3.2-Vision 多模態大型語言模型(LLM)系列,是包含 110 億參數和 900 億參數兩種規模的指令微調型圖像推理生成模型集合,支持 “輸入文本 + 圖像 / 輸出文本” 的交互模式。
經過指令微調的 Llama 3.2-Vision 模型,在視覺識別、圖像推理、圖像描述生成,以及回答與圖像相關的通用問題等任務上進行了優化。在行業常用基準測試中,該系列模型的性能優于多款已有的開源及閉源多模態模型。
支持語言
- 純文本任務:官方支持英語、德語、法語、意大利語、葡萄牙語、印地語、西班牙語和泰語共 8 種語言。此外,Llama 3.2 的訓練數據涵蓋了比這 8 種官方支持語言更廣泛的語種范圍。
- 圖像 + 文本任務:需注意,目前僅支持英語。
2.2 安裝ollama
curl -fsSL https://ollama.com/install.sh | sh
2.3 下載模型
ollama pull llama3.2-vision
2.4 測試模型
conda create --prefix /root/autodl-tmp/xxzhenv/ollama python=3.10 -y
conda activate ollama
pip install ollama
2.5 測試
放一個圖片
import ollamaresponse = ollama.chat(model='llama3.2-vision',messages=[{'role': 'user','content': 'What is in this image?','images': ['image.jpeg']}]
)print(response)
回復:
(/root/autodl-tmp/xxzhenv/ollama) root@autodl-container-b197439d52-c6eeee38:~/autodl-tmp/xxzh# python test01.py
model='llama3.2-vision' created_at='2025-09-12T07:40:47.282497498Z' done=True done_reason='stop' total_duration=9314004386 load_duration=6304258184 prompt_eval_count=16 prompt_eval_duration=1965372891 eval_count=74 eval_duration=1036467359 message=Message(role='assistant', content='The image is a painting of a starry night sky with a village below, featuring a large cypress tree and a bright crescent moon. The painting is called "The Starry Night" and was created by Vincent van Gogh in 1889. It is one of his most famous works and is widely considered a masterpiece of Post-Impressionism.', thinking=None, images=None, tool_name=None, tool_calls=None)