[MAX] Refactor + Optimization for Z-Image, Z-Image Turbo Pipeline (modulev3) by byungchul-sqzb · Pull Request #6329

[MAX] Refactor + Optimization for Z-Image, Z-Image Turbo Pipeline (modulev3) by byungchul-sqzb · Pull Request #6329 · modular/modular

added 5 commits

April 1, 2026 10:02

…e postprocessing

BEGIN_PUBLIC
[Diffusion] Fix Z-Image modulev3 pipeline autoencoder import and image postprocessing

- Fix autoencoder import: use autoencoders_modulev3 instead of autoencoders
  to match the modulev3 pipeline, resolving Buffer.cast AttributeError
- Fix _to_numpy to denormalize VAE output from [-1,1] to [0,255] uint8
  (BCHW -> BHWC), matching diffusers reference postprocessing
- Remove stale type: ignore comments no longer needed with correct import

E2E verified (1024x1024, 50 steps, guidance=4.0, seed=42):
  E2E execute:       7535.868 ms
  transformer (100): 5866.055 ms (avg 58.661 ms/step)
  vae.decode:          70.709 ms
  text_encoder (2):    12.520 ms (avg 6.260 ms)
END_PUBLIC

…r z_image

Removes the per-step eager `F.mul(noise_pred, -1.0)` from the denoising hot
loop and folds the sign flip into the compiled `scheduler_step` graph
(`latents + dt * noise_pred` → `latents - dt * noise_pred`).

This eliminates 50 Python↔GPU synchronisation round-trips per generation,
reducing E2E latency by ~424 ms (5913 → 5489 ms) on 1024×1024 / 50 steps.

Batch positive and negative transformer calls into a single batch=2B call,
reducing transformer invocations from 100 to 50 per generation (50 steps).

- Add compiled `duplicate_batch` (broadcast-based, no copy) and
  `cfg_finalize_batched` (split + CFG formula in single compiled graph)
- Always apply `_align_prompt_seq_len` for both implicit and explicit
  negative prompts to enable batched CFG in all cases
- Pre-allocate batch=2B timestep tensors and guidance_scale scalar tensor
- Use float32 intermediate precision in CFG formula matching reference
  implementation to ensure numerical consistency

…g, eager reduction

- Fused decode: unpack + denorm + decode + uint8 in single compiled graph,
  replacing eager _postprocess_latents + _to_numpy
- Scheduler caching: pre-computed per-step scalar tensors for timesteps,
  eliminating eager tensor slicing (50 slices/gen)
- Patchify and pack: combine 4D→6D reshape + pack into single compiled op
- F.chunk modulation → direct slicing in transformer blocks
- Attention to_out: ModuleList → Linear, removing wrapper overhead
- Remove build_prepare_scheduler compiled function

- Compute RoPE frequencies once on concatenated [img_ids, txt_ids] and slice them for the image/text refiners instead of invoking the embedder twice.
- Produce interleaved [cos, sin] RoPE frequencies directly in RopeEmbedder and feed them to rope_ragged_with_position_ids(interleaved=True) in attention, removing per-block format conversion in the hot path.
- Cast RoPE frequencies to the model dtype in the transformer preamble so attention no longer performs repeated dtype conversions.
- Remove redundant RoPE rank checks in the compiled path; input_types already guarantee 2D position-id tensors.

Profiling (1024x1024 / 50 steps):
- E2E execute: 5870.526 -> 5174.589 ms (-695.937 ms)
- component/transformer avg: 114.907 -> 100.880 ms/call (-14.027 ms/call)
- component/transformer total: 5745.339 -> 5043.993 ms (-701.346 ms)

byungchul-sqzb changed the title ~~Bc/z image refactor modulev3~~ [MAX] Refactor + Optimization for Z-Image, Z-Image Turbo Pipeline (modulev3)

Apr 1, 2026