Optimize the SM100 softmax GPU path by prsabahrami · Pull Request #6335 · modular/modular
Summary
- replace the SM100 softmax GPU path with the preserved optimized variant from the best checkpoint
- use vectorized online max/sum accumulation with
block.maxandblock.sum - enable the higher-throughput launch configuration and PDL attributes for the GPU kernel path
Benchmark Notes
- rerun command:
./bazelw run //max/kernels/benchmarks/autotune:kbench -- max/kernels/benchmarks/gpu/nn/bench_softmax.yaml --skip-clock-check - in this rerun, both
base.csvandpr.csvreported only zeroed rows across the suite (met (ms)=0,iters=0) - the corresponding log tails also showed
0 nsfor the tested shapes - as run here, this benchmark target does not support a measurable before/after claim for this branch pair, so this PR body does not claim a verified speedup from the rerun
Copilot AI review requested due to automatic review settings
April 2, 2026 01:36Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR replaces the existing GPU softmax implementation with an optimized variant aimed at higher SM100 throughput, using vectorized online max/sum accumulation and block-level reductions, and enabling PDL launch attributes.
Changes:
- Added a
softmax_gpu(...)convenience wrapper that directly dispatches the GPU path. - Reworked
softmax_kernelto use an online max/sum algorithm withblock.max/block.sum, plus a vectorized fast path. - Increased the GPU launch block size and enabled PDL launch attributes for the kernel enqueue.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines +744 to +747
| comptime if sink: | ||
| sink_val = sink_weights[Int(row_idx % UInt(shape[0]))].cast[ | ||
| accum_type | ||
| ]() |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sink_weights is now an UnsafePointer, but when sink is enabled the indexing uses row_idx % shape[0]. For common call sites (e.g., attention where shape[0] = batch_size * num_heads but sink_weights is length num_heads), this can read out-of-bounds and also selects the wrong sink weight for a given row. Consider indexing via the computed row_coords (e.g., derive head/batch-head index) and/or pass the sink_weights length (or keep a LayoutTensor wrapper with runtime layout) so bounds and mapping are correct.
Copilot uses AI. Check for mistakes.
Comment on lines +877 to +880
| var sink_ptr = UnsafePointer[Scalar[sink_type], ImmutAnyOrigin]() | ||
| if sink_weights: | ||
| sink_ptr = sink_weights.value().ptr | ||
|
|
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When sink is enabled, _softmax_gpu will launch the kernel even if sink_weights is None, leaving sink_ptr as a null pointer. The kernel then dereferences sink_weights[...], which can cause a GPU fault. Add a runtime assertion/early error when sink==True and sink_weights is not provided (similar to other attention kernels), or make the kernel guard the dereference on a non-null pointer.
Copilot uses AI. Check for mistakes.
| var row_size = UInt(shape[axis]) | ||
| var num_rows = UInt(shape.flattened_length()) // row_size | ||
| var tid = Int(thread_idx.x) | ||
| var use_vectorized = row_size >= UInt(4 * BLOCK_SPAN) |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Current GPU tests for _softmax_gpu appear to exercise only the scalar branch (use_vectorized is false with the existing test shapes, e.g. last-dim 515). Since correctness now depends on the new vectorized online max/sum + vectorized normalize path, add a test case with a sufficiently large inner dimension to force use_vectorized == true and compare against the CPU/reference implementation.
| var use_vectorized = row_size >= UInt(4 * BLOCK_SPAN) | |
| var use_vectorized = row_size >= UInt(BLOCK_SPAN) |
Copilot uses AI. Check for mistakes.
Comment on lines +681 to +705
| def softmax_gpu[ | ||
| dtype: DType, | ||
| simd_width: Int, | ||
| rank: Int, | ||
| ]( | ||
| ctx: DeviceContext, | ||
| input: TileTensor[dtype, ...], | ||
| output: TileTensor[mut=True, dtype, ...], | ||
| axis: Int, | ||
| ) raises: | ||
| @parameter | ||
| @always_inline | ||
| def input_fn[ | ||
| _simd_width: Int, _rank: Int | ||
| ](coords: IndexList[_rank]) -> SIMD[dtype, _simd_width]: | ||
| return input.load_linear[width=_simd_width, alignment=1](coords) | ||
|
|
||
| _softmax_gpu[dtype, simd_width, rank, input_fn]( | ||
| rebind[IndexList[rank]]( | ||
| coord_to_index_list(input.layout.shape_coord()) | ||
| ), | ||
| output, | ||
| axis, | ||
| ctx, | ||
| ) |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
softmax_gpu bypasses the softmax(...) wrapper’s empty-tensor early-exit. If any dimension is 0, _softmax_gpu will compute num_blocks = 0 and call ctx.enqueue_function(... grid_dim=0 ...), which other GPU code explicitly avoids. Add an early return in softmax_gpu (or _softmax_gpu) when shape.flattened_length() == 0 (and/or when the reduced axis length is 0) to prevent invalid launches/div-by-zero in the kernel.
Copilot uses AI. Check for mistakes.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters