[MAX] Optimize QwenImage DiT RoPE: graph ops → fused GPU kernel by byungchul-sqzb · Pull Request #6345 · modular/modular

@byungchul-sqzb

Replace graph-level complex multiplication (~10 ops per call) with the
fused `rope_ragged_with_position_ids` GPU kernel for RoPE application
in QwenImage transformer blocks. This mirrors the approach already used
by flux2 and z-image.

With 60 dual-stream blocks applying RoPE to both Q and K (120 calls per
step), this eliminates significant kernel launch overhead and intermediate
memory allocations. Measured improvement: transformer step avg 100.6ms →
81.9ms (-18.5%).

Changes:
- QwenImagePosEmbed: return interleaved freqs_cis [S, D] instead of
  separate (cos, sin) tuple
- QwenImageAttention: use rope_ragged_with_position_ids kernel
- qwen_image.py: cast freqs_cis to model dtype (z-image pattern)
- Update parity test fixtures for new single-tensor interface