Optimize the SM100 softmax GPU path by prsabahrami · Pull Request #6335

Optimize the SM100 softmax GPU path by prsabahrami · Pull Request #6335 · modular/modular

Summary

replace the SM100 softmax GPU path with the preserved optimized variant from the best checkpoint
use vectorized online max/sum accumulation with block.max and block.sum
enable the higher-throughput launch configuration and PDL attributes for the GPU kernel path

Benchmark Notes

rerun command: ./bazelw run //max/kernels/benchmarks/autotune:kbench -- max/kernels/benchmarks/gpu/nn/bench_softmax.yaml --skip-clock-check
in this rerun, both base.csv and pr.csv reported only zeroed rows across the suite (met (ms)=0, iters=0)
the corresponding log tails also showed 0 ns for the tested shapes
as run here, this benchmark target does not support a measurable before/after claim for this branch pair, so this PR body does not claim a verified speedup from the rerun

Copilot AI review requested due to automatic review settings

April 2, 2026 01:36

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR replaces the existing GPU softmax implementation with an optimized variant aimed at higher SM100 throughput, using vectorized online max/sum accumulation and block-level reductions, and enabling PDL launch attributes.

Changes:

Added a softmax_gpu(...) convenience wrapper that directly dispatches the GPU path.
Reworked softmax_kernel to use an online max/sum algorithm with block.max / block.sum, plus a vectorized fast path.
Increased the GPU launch block size and enabled PDL launch attributes for the kernel enqueue.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +744 to +747

		comptime if sink:
		sink_val = sink_weights[Int(row_idx % UInt(shape[0]))].cast[
		accum_type
		]()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sink_weights is now an UnsafePointer, but when sink is enabled the indexing uses row_idx % shape[0]. For common call sites (e.g., attention where shape[0] = batch_size * num_heads but sink_weights is length num_heads), this can read out-of-bounds and also selects the wrong sink weight for a given row. Consider indexing via the computed row_coords (e.g., derive head/batch-head index) and/or pass the sink_weights length (or keep a LayoutTensor wrapper with runtime layout) so bounds and mapping are correct.

Copilot uses AI. Check for mistakes.

Comment on lines +877 to +880

		var sink_ptr = UnsafePointer[Scalar[sink_type], ImmutAnyOrigin]()
		if sink_weights:
		sink_ptr = sink_weights.value().ptr

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When sink is enabled, _softmax_gpu will launch the kernel even if sink_weights is None, leaving sink_ptr as a null pointer. The kernel then dereferences sink_weights[...], which can cause a GPU fault. Add a runtime assertion/early error when sink==True and sink_weights is not provided (similar to other attention kernels), or make the kernel guard the dereference on a non-null pointer.

Copilot uses AI. Check for mistakes.

		var row_size = UInt(shape[axis])
		var num_rows = UInt(shape.flattened_length()) // row_size
		var tid = Int(thread_idx.x)
		var use_vectorized = row_size >= UInt(4 * BLOCK_SPAN)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current GPU tests for _softmax_gpu appear to exercise only the scalar branch (use_vectorized is false with the existing test shapes, e.g. last-dim 515). Since correctness now depends on the new vectorized online max/sum + vectorized normalize path, add a test case with a sufficiently large inner dimension to force use_vectorized == true and compare against the CPU/reference implementation.

	var use_vectorized = row_size >= UInt(4 * BLOCK_SPAN)
	var use_vectorized = row_size >= UInt(BLOCK_SPAN)

Copilot uses AI. Check for mistakes.

Comment on lines +681 to +705

		def softmax_gpu[
		dtype: DType,
		simd_width: Int,
		rank: Int,
		](
		ctx: DeviceContext,
		input: TileTensor[dtype, ...],
		output: TileTensor[mut=True, dtype, ...],
		axis: Int,
		) raises:
		@parameter
		@always_inline
		def input_fn[
		_simd_width: Int, _rank: Int
		](coords: IndexList[_rank]) -> SIMD[dtype, _simd_width]:
		return input.load_linear[width=_simd_width, alignment=1](coords)

		_softmax_gpu[dtype, simd_width, rank, input_fn](
		rebind[IndexList[rank]](
		coord_to_index_list(input.layout.shape_coord())
		),
		output,
		axis,
		ctx,
		)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

softmax_gpu bypasses the softmax(...) wrapper’s empty-tensor early-exit. If any dimension is 0, _softmax_gpu will compute num_blocks = 0 and call ctx.enqueue_function(... grid_dim=0 ...), which other GPU code explicitly avoids. Add an early return in softmax_gpu (or _softmax_gpu) when shape.flattened_length() == 0 (and/or when the reduced axis length is 0) to prevent invalid launches/div-by-zero in the kernel.

Copilot uses AI. Check for mistakes.