[MAX] Refactor + Optimization for Z-Image, Z-Image Turbo Pipeline (modulev3) by byungchul-sqzb · Pull Request #6329 · modular/modular
added 5 commits
April 1, 2026 10:02…e postprocessing BEGIN_PUBLIC [Diffusion] Fix Z-Image modulev3 pipeline autoencoder import and image postprocessing - Fix autoencoder import: use autoencoders_modulev3 instead of autoencoders to match the modulev3 pipeline, resolving Buffer.cast AttributeError - Fix _to_numpy to denormalize VAE output from [-1,1] to [0,255] uint8 (BCHW -> BHWC), matching diffusers reference postprocessing - Remove stale type: ignore comments no longer needed with correct import E2E verified (1024x1024, 50 steps, guidance=4.0, seed=42): E2E execute: 7535.868 ms transformer (100): 5866.055 ms (avg 58.661 ms/step) vae.decode: 70.709 ms text_encoder (2): 12.520 ms (avg 6.260 ms) END_PUBLIC
…r z_image Removes the per-step eager `F.mul(noise_pred, -1.0)` from the denoising hot loop and folds the sign flip into the compiled `scheduler_step` graph (`latents + dt * noise_pred` → `latents - dt * noise_pred`). This eliminates 50 Python↔GPU synchronisation round-trips per generation, reducing E2E latency by ~424 ms (5913 → 5489 ms) on 1024×1024 / 50 steps.
Batch positive and negative transformer calls into a single batch=2B call, reducing transformer invocations from 100 to 50 per generation (50 steps). - Add compiled `duplicate_batch` (broadcast-based, no copy) and `cfg_finalize_batched` (split + CFG formula in single compiled graph) - Always apply `_align_prompt_seq_len` for both implicit and explicit negative prompts to enable batched CFG in all cases - Pre-allocate batch=2B timestep tensors and guidance_scale scalar tensor - Use float32 intermediate precision in CFG formula matching reference implementation to ensure numerical consistency
…g, eager reduction - Fused decode: unpack + denorm + decode + uint8 in single compiled graph, replacing eager _postprocess_latents + _to_numpy - Scheduler caching: pre-computed per-step scalar tensors for timesteps, eliminating eager tensor slicing (50 slices/gen) - Patchify and pack: combine 4D→6D reshape + pack into single compiled op - F.chunk modulation → direct slicing in transformer blocks - Attention to_out: ModuleList → Linear, removing wrapper overhead - Remove build_prepare_scheduler compiled function
- Compute RoPE frequencies once on concatenated [img_ids, txt_ids] and slice them for the image/text refiners instead of invoking the embedder twice. - Produce interleaved [cos, sin] RoPE frequencies directly in RopeEmbedder and feed them to rope_ragged_with_position_ids(interleaved=True) in attention, removing per-block format conversion in the hot path. - Cast RoPE frequencies to the model dtype in the transformer preamble so attention no longer performs repeated dtype conversions. - Remove redundant RoPE rank checks in the compiled path; input_types already guarantee 2D position-id tensors. Profiling (1024x1024 / 50 steps): - E2E execute: 5870.526 -> 5174.589 ms (-695.937 ms) - component/transformer avg: 114.907 -> 100.880 ms/call (-14.027 ms/call) - component/transformer total: 5745.339 -> 5043.993 ms (-701.346 ms)
byungchul-sqzb
changed the title
Bc/z image refactor modulev3
[MAX] Refactor + Optimization for Z-Image, Z-Image Turbo Pipeline (modulev3)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters