目錄
- 1 簡介
- 2 本地部署
- 2.1 配置環境
- 2.2 下載模型
- 3 文生視頻
- 3.1 運行命令
- 3.2 生成結果
- 4 圖生視頻
- 4.1 運行命令
- 4.2 生成結果
- 5 首尾幀生成視頻
- 5.1 運行命令
- 5.2 生成結果
- 6 提示詞擴展
- 7 LoRA微調
1 簡介
通義萬相 2.1 在 2025 年 1 月推出,2 月 25 日阿里巴巴宣布全面開源該模型。此次開源意義重大,讓全球開發者都能參與到模型的應用與優化中。它提供了 14B 專業版和 1.3B 輕量版兩種規格,滿足不同場景需求。在權威評測 VBench 中,14B 版本以 86.22% 總分超越 Sora、Luma 等國內外模型,在運動質量、視覺質量等 14 個主要維度評測中斬獲 5 項第一。1.3B 輕量版則主打高效率,在 RTX 4090 顯卡上僅需 8.2GB 顯存即可生成 480P 視頻,4 分鐘內完成 5 秒視頻生成,兼容主流消費級 GPU。
本文來實測一下。
2 本地部署
2.1 配置環境
將Wan2.1工程克隆到本地:
git clone https://github.com/Wan-Video/Wan2.1.git
cd Wan2.1
安裝依賴庫:
# Ensure torch >= 2.4.0
pip install -r requirements.txt
如果安裝flash_attn較慢,可以直接下載flash-attn安裝包,離線安裝,下載地址:https://github.com/Dao-AILab/flash-attention/releases
2.2 下載模型
到modelscope上下載模型:
pip install modelscope
modelscope download Wan-AI/Wan2.1-T2V-14B --local_dir ./Wan2.1-T2V-14B
3 文生視頻
3.1 運行命令
Single-GPU:
python generate.py --task t2v-14B --size 1280*720 --ckpt_dir ./Wan2.1-T2V-14B --prompt "Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage."
如果顯存較小,遇到OOM(內存不足)問題,可以使用–offload_model True和–t5_cpu選項來減少GPU內存使用。例如,在RTX 4090 GPU上:
python generate.py --task t2v-1.3B --size 832*480 --ckpt_dir ./Wan2.1-T2V-1.3B --offload_model True --t5_cpu --sample_shift 8 --sample_guide_scale 6 --prompt "Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage."
Multi-GPU:
pip install "xfuser>=0.4.1"
torchrun --nproc_per_node=8 generate.py --task t2v-14B --size 1280*720 --ckpt_dir ./Wan2.1-T2V-14B --dit_fsdp --t5_fsdp --ulysses_size 8 --prompt "Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage."
3.2 生成結果
使用Wan2.1-T2V-14B模型,迭代50步:
單卡耗時約2小時,顯存占用80多GB。
(和官方說的5秒視頻需要10分鐘不太相符,是什么原因?)
8卡耗時約15分鐘,每張卡占用顯存30多GB。
生成結果:
bilibili
4 圖生視頻
4.1 運行命令
Single-GPU:
python generate.py --task i2v-14B --size 1280*720 --ckpt_dir ./Wan2.1-I2V-14B-720P --image examples/i2v_input.JPG --prompt "Summer beach vacation style, a white cat wearing sunglasses sits on a surfboard. The fluffy-furred feline gazes directly at the camera with a relaxed expression. Blurred beach scenery forms the background featuring crystal-clear waters, distant green hills, and a blue sky dotted with white clouds. The cat assumes a naturally relaxed posture, as if savoring the sea breeze and warm sunlight. A close-up shot highlights the feline's intricate details and the refreshing atmosphere of the seaside."
Multi-GPU:
torchrun --nproc_per_node=8 generate.py --task i2v-14B --size 1280*720 --ckpt_dir ./Wan2.1-I2V-14B-720P --image examples/i2v_input.JPG --dit_fsdp --t5_fsdp --ulysses_size 8 --prompt "Summer beach vacation style, a white cat wearing sunglasses sits on a surfboard. The fluffy-furred feline gazes directly at the camera with a relaxed expression. Blurred beach scenery forms the background featuring crystal-clear waters, distant green hills, and a blue sky dotted with white clouds. The cat assumes a naturally relaxed posture, as if savoring the sea breeze and warm sunlight. A close-up shot highlights the feline's intricate details and the refreshing atmosphere of the seaside."
4.2 生成結果
8卡耗時12分鐘左右,每張卡顯存占用約48GB。
輸入圖片:
生成結果:bilibili
5 首尾幀生成視頻
5.1 運行命令
Single-GPU:
python generate.py --task flf2v-14B --size 1280*720 --ckpt_dir ./Wan2.1-FLF2V-14B-720P --first_frame examples/flf2v_input_first_frame.png --last_frame examples/flf2v_input_last_frame.png --prompt "CG animation style, a small blue bird takes off from the ground, flapping its wings. The bird’s feathers are delicate, with a unique pattern on its chest. The background shows a blue sky with white clouds under bright sunshine. The camera follows the bird upward, capturing its flight and the vastness of the sky from a close-up, low-angle perspective."
Multi-GPU:
torchrun --nproc_per_node=8 generate.py --task flf2v-14B --size 1280*720 --ckpt_dir ./Wan2.1-FLF2V-14B-720P --first_frame examples/flf2v_input_first_frame.png --last_frame examples/flf2v_input_last_frame.png --dit_fsdp --t5_fsdp --ulysses_size 8 --prompt "CG animation style, a small blue bird takes off from the ground, flapping its wings. The bird’s feathers are delicate, with a unique pattern on its chest. The background shows a blue sky with white clouds under bright sunshine. The camera follows the bird upward, capturing its flight and the vastness of the sky from a close-up, low-angle perspective."
5.2 生成結果
8卡耗時30分鐘左右,每張卡顯存占用約48GB。
輸入首幀:
輸入尾幀:
生成結果:bilibili
6 提示詞擴展
此外,我們還可以使用本地模型進行提示詞擴展(Prompt Extension)。
對于文本到視頻的任務,您可以使用Qwen/Qwen2.5-14B-Instruct、Qwen/Qwen2.5-7B-Instruct 和Qwen/Qwen2.5-3B-Instruct等模型。
對于圖像到視頻或最后一幀到視頻的任務,您可以使用Qwen/Qwen2.5-VL-7B-Instruct和Qwen/Qwen2.5-VL-3B-Instruct等模型。
較大的模型通常提供更好的擴展結果,但需要更多的GPU內存。
可以使用參數–prompt_extension_model修改用于擴展的模型,例如:
- 文生視頻:
python generate.py --task t2v-14B --size 1280*720 --ckpt_dir ./Wan2.1-T2V-14B --prompt "Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage" --use_prompt_extend --prompt_extend_method 'local_qwen' --prompt_extend_target_lang 'zh'
- 圖生視頻:
python generate.py --task i2v-14B --size 1280*720 --ckpt_dir ./Wan2.1-I2V-14B-720P --image examples/i2v_input.JPG --use_prompt_extend --prompt_extend_model Qwen/Qwen2.5-VL-7B-Instruct --prompt "Summer beach vacation style, a white cat wearing sunglasses sits on a surfboard. The fluffy-furred feline gazes directly at the camera with a relaxed expression. Blurred beach scenery forms the background featuring crystal-clear waters, distant green hills, and a blue sky dotted with white clouds. The cat assumes a naturally relaxed posture, as if savoring the sea breeze and warm sunlight. A close-up shot highlights the feline's intricate details and the refreshing atmosphere of the seaside."
- 首尾幀生成視頻:
python generate.py --task flf2v-14B --size 1280*720 --ckpt_dir ./Wan2.1-FLF2V-14B-720P --first_frame examples/flf2v_input_first_frame.png --last_frame examples/flf2v_input_last_frame.png --use_prompt_extend --prompt_extend_model Qwen/Qwen2.5-VL-7B-Instruct --prompt "CG animation style, a small blue bird takes off from the ground, flapping its wings. The bird’s feathers are delicate, with a unique pattern on its chest. The background shows a blue sky with white clouds under bright sunshine. The camera follows the bird upward, capturing its flight and the vastness of the sky from a close-up, low-angle perspective."
7 LoRA微調
請參考:LoRA微調Wan2.1教程
git clone https://github.com/modelscope/DiffSynth-Studio.git
cd DiffSynth-Studio
pip install -e .
Step 1: Install additional packages
pip install peft lightning pandas
Step 2: Prepare your dataset
You need to manage the training videos as follows:
data/example_dataset/
├── metadata.csv
└── train├── video_00001.mp4└── image_00002.jpg
metadata.csv
:
file_name,text
video_00001.mp4,"video description"
image_00002.jpg,"video description"
We support both images and videos. An image is treated as a single frame of video.
Step 3: Data process
CUDA_VISIBLE_DEVICES="0" python examples/wanvideo/train_wan_t2v.py \--task data_process \--dataset_path data/example_dataset \--output_path ./models \--text_encoder_path "models/Wan-AI/Wan2.1-T2V-1.3B/models_t5_umt5-xxl-enc-bf16.pth" \--vae_path "models/Wan-AI/Wan2.1-T2V-1.3B/Wan2.1_VAE.pth" \--tiled \--num_frames 81 \--height 480 \--width 832
After that, some cached files will be stored in the dataset folder.
data/example_dataset/
├── metadata.csv
└── train├── video_00001.mp4├── video_00001.mp4.tensors.pth├── video_00002.mp4└── video_00002.mp4.tensors.pth
Step 4: Train
LoRA training:
CUDA_VISIBLE_DEVICES="0" python examples/wanvideo/train_wan_t2v.py \--task train \--train_architecture lora \--dataset_path data/example_dataset \--output_path ./models \--dit_path "models/Wan-AI/Wan2.1-T2V-1.3B/diffusion_pytorch_model.safetensors" \--steps_per_epoch 500 \--max_epochs 10 \--learning_rate 1e-4 \--lora_rank 16 \--lora_alpha 16 \--lora_target_modules "q,k,v,o,ffn.0,ffn.2" \--accumulate_grad_batches 1 \--use_gradient_checkpointing
Full training:
CUDA_VISIBLE_DEVICES="0" python examples/wanvideo/train_wan_t2v.py \--task train \--train_architecture full \--dataset_path data/example_dataset \--output_path ./models \--dit_path "models/Wan-AI/Wan2.1-T2V-1.3B/diffusion_pytorch_model.safetensors" \--steps_per_epoch 500 \--max_epochs 10 \--learning_rate 1e-4 \--accumulate_grad_batches 1 \--use_gradient_checkpointing
If you wish to train the 14B model, please separate the safetensor files with a comma. For example: models/Wan-AI/Wan2.1-T2V-14B/diffusion_pytorch_model-00001-of-00006.safetensors,models/Wan-AI/Wan2.1-T2V-14B/diffusion_pytorch_model-00002-of-00006.safetensors,models/Wan-AI/Wan2.1-T2V-14B/diffusion_pytorch_model-00003-of-00006.safetensors,models/Wan-AI/Wan2.1-T2V-14B/diffusion_pytorch_model-00004-of-00006.safetensors,models/Wan-AI/Wan2.1-T2V-14B/diffusion_pytorch_model-00005-of-00006.safetensors,models/Wan-AI/Wan2.1-T2V-14B/diffusion_pytorch_model-00006-of-00006.safetensors
.
If you wish to train the image-to-video model, please add an extra parameter --image_encoder_path "models/Wan-AI/Wan2.1-I2V-14B-480P/models_clip_open-clip-xlm-roberta-large-vit-huge-14.pth"
.
For LoRA training, the Wan-1.3B-T2V model requires 16G of VRAM for processing 81 frames at 480P, while the Wan-14B-T2V model requires 60G of VRAM for the same configuration. To further reduce VRAM requirements by 20%-30%, you can include the parameter --use_gradient_checkpointing_offload
.
Step 5: Test
Test LoRA:
import torch
from diffsynth import ModelManager, WanVideoPipeline, save_video, VideoDatamodel_manager = ModelManager(torch_dtype=torch.bfloat16, device="cpu")
model_manager.load_models(["models/Wan-AI/Wan2.1-T2V-1.3B/diffusion_pytorch_model.safetensors","models/Wan-AI/Wan2.1-T2V-1.3B/models_t5_umt5-xxl-enc-bf16.pth","models/Wan-AI/Wan2.1-T2V-1.3B/Wan2.1_VAE.pth",
])
model_manager.load_lora("models/lightning_logs/version_1/checkpoints/epoch=0-step=500.ckpt", lora_alpha=1.0)
pipe = WanVideoPipeline.from_model_manager(model_manager, device="cuda")
pipe.enable_vram_management(num_persistent_param_in_dit=None)video = pipe(prompt="...",negative_prompt="...",num_inference_steps=50,seed=0, tiled=True
)
save_video(video, "video.mp4", fps=30, quality=5)
Test fine-tuned base model:
import torch
from diffsynth import ModelManager, WanVideoPipeline, save_video, VideoDatamodel_manager = ModelManager(torch_dtype=torch.bfloat16, device="cpu")
model_manager.load_models(["models/lightning_logs/version_1/checkpoints/epoch=0-step=500.ckpt","models/Wan-AI/Wan2.1-T2V-1.3B/models_t5_umt5-xxl-enc-bf16.pth","models/Wan-AI/Wan2.1-T2V-1.3B/Wan2.1_VAE.pth",
])
pipe = WanVideoPipeline.from_model_manager(model_manager, device="cuda")
pipe.enable_vram_management(num_persistent_param_in_dit=None)video = pipe(prompt="...",negative_prompt="...",num_inference_steps=50,seed=0, tiled=True
)
save_video(video, "video.mp4", fps=30, quality=5)