Add blocksize=64 4-bit quantization support for ROCm CDNA (warp64) GPUs by Abdennacer-Badaoui · Pull Request #1856 · bitsandbytes-foundation/bitsandbytes
Description:
Summary
- Add
kQuantizeBlockwise64kernel that supports blocksize=64 with 4-bit quantization (FP4/NF4) on both warp32 (RDNA) and warp64 (CDNA) hardware - Previously, blocksize=64 for 4-bit was only supported on consumer RDNA GPU (warp size 32). Data center CDNA GPUs (MI300, MI325) could not use it because the existing kernel requires threads == blocksize/2 = 32, which underutilizes the 64-wide wavefront
- The new kernel processes 2 quantization blocks of 64 values per thread block using 64 threads, with logical warps of 32 (
WarpReduce<float, 32>) to perform independent reductions per block
Quick comparaison
Test configuration:
Device: AMD Instinct Mi325X VF
PyTorch: 2.8.0+rocm7.1.0.git7a520360
HIP: 7.1.25424-4179531dcd
FP4 Quantization Error (Mean Absolute Error)
| Shape | Blocksize=128 | Blocksize=64 | Error Reduction |
|---|---|---|---|
| 1K x 1K | 0.102941 | 0.096551 | +6.2% |
| 2K x 2K | 0.102949 | 0.096549 | +6.2% |
| 4K x 4K | 0.102950 | 0.096545 | +6.2% |
| 8K x 4K | 0.102948 | 0.096545 | +6.2% |
| 4K x 11K (LLaMA FFN) | 0.102948 | 0.096545 | +6.2% |
| 4K x 14K (LLaMA2 FFN) | 0.102946 | 0.096545 | +6.2% |
NF4 Quantization Error (Mean Absolute Error)
| Shape | Blocksize=128 | Blocksize=64 | Error Reduction |
|---|---|---|---|
| 1K x 1K | 0.076826 | 0.072796 | +5.2% |
| 2K x 2K | 0.076834 | 0.072794 | +5.3% |
| 4K x 4K | 0.076836 | 0.072794 | +5.3% |
| 8K x 4K | 0.076836 | 0.072796 | +5.3% |
| 4K x 11K (LLaMA FFN) | 0.076835 | 0.072796 | +5.3% |
| 4K x 14K (LLaMA2 FFN) | 0.076835 | 0.072796 | +5.3% |