Add blocksize=64 4-bit quantization support for ROCm CDNA (warp64) GPUs by Abdennacer-Badaoui · Pull Request #1856

Add blocksize=64 4-bit quantization support for ROCm CDNA (warp64) GPUs by Abdennacer-Badaoui · Pull Request #1856 · bitsandbytes-foundation/bitsandbytes

Description:

Summary

Add kQuantizeBlockwise64 kernel that supports blocksize=64 with 4-bit quantization (FP4/NF4) on both warp32 (RDNA) and warp64 (CDNA) hardware
Previously, blocksize=64 for 4-bit was only supported on consumer RDNA GPU (warp size 32). Data center CDNA GPUs (MI300, MI325) could not use it because the existing kernel requires threads == blocksize/2 = 32, which underutilizes the 64-wide wavefront
The new kernel processes 2 quantization blocks of 64 values per thread block using 64 threads, with logical warps of 32 (WarpReduce<float, 32>) to perform independent reductions per block

Quick comparaison

Test configuration:

Device: AMD Instinct Mi325X VF
PyTorch: 2.8.0+rocm7.1.0.git7a520360
HIP: 7.1.25424-4179531dcd

FP4 Quantization Error (Mean Absolute Error)

Shape	Blocksize=128	Blocksize=64	Error Reduction
1K x 1K	0.102941	0.096551	+6.2%
2K x 2K	0.102949	0.096549	+6.2%
4K x 4K	0.102950	0.096545	+6.2%
8K x 4K	0.102948	0.096545	+6.2%
4K x 11K (LLaMA FFN)	0.102948	0.096545	+6.2%
4K x 14K (LLaMA2 FFN)	0.102946	0.096545	+6.2%

NF4 Quantization Error (Mean Absolute Error)

Shape	Blocksize=128	Blocksize=64	Error Reduction
1K x 1K	0.076826	0.072796	+5.2%
2K x 2K	0.076834	0.072794	+5.3%
4K x 4K	0.076836	0.072794	+5.3%
8K x 4K	0.076836	0.072796	+5.3%
4K x 11K (LLaMA FFN)	0.076835	0.072796	+5.3%
4K x 14K (LLaMA2 FFN)	0.076835	0.072796	+5.3%