[MAX] Add Wan-Animate pipeline support by kkimmk · Pull Request #6347 · modular/modular
Summary
This PR adds support for the Wan-Animate pipeline (Wan-AI/Wan2.2-Animate-14B-Diffusers), a video generation model for motion transfer and character replacement built on top of the existing Wan I2V pipeline.
- CLIP vision encoder: Adds a MAX-native CLIP vision encoder (
clip_encoder.py,layers/) used as an image conditioning signal. The existing CLIP text encoder is extracted intoclip_modulev3to avoid a naming collision. - Wan-Animate transformer: Extends
WanTransformerModelwith pose injection (Conv3d), CLIP image cross-attention, a face adapter (WanAnimateFaceEncoder), and a motion encoder (WanAnimateMotionEncoder). Transformer layers are refactored into alayers/subpackage. - Wan-Animate pipeline: Adds
WanAnimatePipelineextendingWanI2VPipeline, supporting animate mode (motion transfer) and replace mode (background-preserving character replacement via mask) with multi-segment processing and temporal overlap. - Tokenizer and preprocessing: Extends
PixelGenerationTokenizerand addsvideo_processor.pyto handle pose video, face pixels, background video, and mask inputs for the Animate pipeline.
How to Run
The Wan-Animate pipeline requires preprocessed driving videos — a skeleton pose video (DWPose renders) and a cropped face video — rather than raw footage. To generate them from a raw driving video, use the official preprocessing scripts.
Alternatively, pre-processed sample assets are available for download at https://huggingface.co/datasets/squeezebits/diffusion-benchmark.
Animate Mode
./bazelw run //max/examples/diffusion:simple_offline_video_generation -- \
--model Wan-AI/Wan2.2-Animate-14B-Diffusers \
--input-image <character.jpeg> \
--pose-video <pose.mp4> \
--face-video <face.mp4> \
--prompt "A character moving naturally." \
--height 480 --width 848 --num-frames 77 \
--num-inference-steps 40 --seed 42 --num-warmups 1 \
--guidance-scale 1.0 \
--output output_animate.mp4Replace Mode
./bazelw run //max/examples/diffusion:simple_offline_video_generation -- \
--model Wan-AI/Wan2.2-Animate-14B-Diffusers \
--input-image <character.jpeg> \
--pose-video <pose.mp4> \
--face-video <face.mp4> \
--prompt "A character moving naturally." \
--height 480 --width 848 --num-frames 77 \
--num-inference-steps 40 --seed 42 --num-warmups 1 \
--guidance-scale 1.0 \
--output output_replace.mp4
--animate-mode replace \
--background-video <background.mp4> \
--mask-video <mask.mp4>Checklist
- PR is small and focused — consider splitting larger changes into a
sequence of smaller PRs - I ran
./bazelw run formatto format my changes - I added or updated tests to cover my changes
- If AI tools assisted with this contribution, I have included an
Assisted-by:trailer in my commit message or this PR description
(see AI Tool Use Policy)
Assisted-by: Claude Code