[MAX] Optimize QwenImage DiT RoPE: graph ops → fused GPU kernel by byungchul-sqzb · Pull Request #6345 · modular/modular
Replace graph-level complex multiplication (~10 ops per call) with the fused `rope_ragged_with_position_ids` GPU kernel for RoPE application in QwenImage transformer blocks. This mirrors the approach already used by flux2 and z-image. With 60 dual-stream blocks applying RoPE to both Q and K (120 calls per step), this eliminates significant kernel launch overhead and intermediate memory allocations. Measured improvement: transformer step avg 100.6ms → 81.9ms (-18.5%). Changes: - QwenImagePosEmbed: return interleaved freqs_cis [S, D] instead of separate (cos, sin) tuple - QwenImageAttention: use rope_ragged_with_position_ids kernel - qwen_image.py: cast freqs_cis to model dtype (z-image pattern) - Update parity test fixtures for new single-tensor interface