[MAX] Add Wan-Animate pipeline support by kkimmk · Pull Request #6347 · modular/modular

Summary

This PR adds support for the Wan-Animate pipeline (Wan-AI/Wan2.2-Animate-14B-Diffusers), a video generation model for motion transfer and character replacement built on top of the existing Wan I2V pipeline.

  • CLIP vision encoder: Adds a MAX-native CLIP vision encoder (clip_encoder.py, layers/) used as an image conditioning signal. The existing CLIP text encoder is extracted into clip_modulev3 to avoid a naming collision.
  • Wan-Animate transformer: Extends WanTransformerModel with pose injection (Conv3d), CLIP image cross-attention, a face adapter (WanAnimateFaceEncoder), and a motion encoder (WanAnimateMotionEncoder). Transformer layers are refactored into a layers/ subpackage.
  • Wan-Animate pipeline: Adds WanAnimatePipeline extending WanI2VPipeline, supporting animate mode (motion transfer) and replace mode (background-preserving character replacement via mask) with multi-segment processing and temporal overlap.
  • Tokenizer and preprocessing: Extends PixelGenerationTokenizer and adds video_processor.py to handle pose video, face pixels, background video, and mask inputs for the Animate pipeline.

How to Run

The Wan-Animate pipeline requires preprocessed driving videos — a skeleton pose video (DWPose renders) and a cropped face video — rather than raw footage. To generate them from a raw driving video, use the official preprocessing scripts.

Alternatively, pre-processed sample assets are available for download at https://huggingface.co/datasets/squeezebits/diffusion-benchmark.

Animate Mode

./bazelw run //max/examples/diffusion:simple_offline_video_generation -- \
      --model Wan-AI/Wan2.2-Animate-14B-Diffusers \
      --input-image <character.jpeg> \
      --pose-video <pose.mp4> \
      --face-video <face.mp4> \
      --prompt "A character moving naturally." \
      --height 480 --width 848 --num-frames 77 \
      --num-inference-steps 40 --seed 42  --num-warmups 1 \
      --guidance-scale 1.0 \
      --output output_animate.mp4

Replace Mode

./bazelw run //max/examples/diffusion:simple_offline_video_generation -- \
      --model Wan-AI/Wan2.2-Animate-14B-Diffusers \
      --input-image <character.jpeg> \
      --pose-video <pose.mp4> \
      --face-video <face.mp4> \
      --prompt "A character moving naturally." \
      --height 480 --width 848 --num-frames 77 \
      --num-inference-steps 40 --seed 42 --num-warmups 1 \
      --guidance-scale 1.0 \
      --output output_replace.mp4
      --animate-mode replace \
      --background-video <background.mp4> \
      --mask-video <mask.mp4>

Checklist

  • PR is small and focused — consider splitting larger changes into a
    sequence of smaller PRs
  • I ran ./bazelw run format to format my changes
  • I added or updated tests to cover my changes
  • If AI tools assisted with this contribution, I have included an
    Assisted-by: trailer in my commit message or this PR description
    (see AI Tool Use Policy)

Assisted-by: Claude Code