[XPU] Implemented 32bit optimizers in triton by YangKai0616 · Pull Request #1710

[XPU] Implemented 32bit optimizers in triton by YangKai0616 · Pull Request #1710 · bitsandbytes-foundation/bitsandbytes

Depends on #1692.

Implemented 32bit optimizers in triton to use of XPU devices.

The PR includes two implementations:

Pure Torch implementation: utilizing torch.compile
Pure Triton implementation: utilizing triton.jit

For the benchmarking on 4096*4096 shapes, the results are as follows:

Pure Torch implementation:

Torch step (eager): 1.075ms
BNB step: 0.516ms
Torch step (eager): 1.058ms
BNB step: 0.517ms
Torch step (eager): 1.080ms
BNB step: 0.527ms
Torch step (eager): 1.069ms
BNB step: 0.539ms
Torch step (eager): 1.034ms
BNB step: 0.526ms

Pure Triton implementation:

Torch step (eager): 1.034ms
BNB step: 0.524ms
Torch step (eager): 1.054ms
BNB step: 0.488ms
Torch step (eager): 1.031ms
BNB step: 0.526ms
Torch step (eager): 1.047ms
BNB step: 0.538ms
Torch step (eager): 1.045ms
BNB step: 0.489ms

For the benchmarking on 1024*1024 shapes, the results are as follows:
Pure Torch implementation:

Torch step (eager): 0.345ms
BNB step: 0.335ms
Torch step (eager): 0.354ms
BNB step: 0.226ms
Torch step (eager): 0.347ms
BNB step: 0.227ms
Torch step (eager): 0.358ms
BNB step: 0.232ms
Torch step (eager): 0.349ms
BNB step: 0.225ms

Pure Triton implementation:

Torch step (eager): 0.346ms
BNB step: 0.226ms
Torch step (eager): 0.337ms
BNB step: 0.216ms
Torch step (eager): 0.338ms
BNB step: 0.215ms
Torch step (eager): 0.333ms
BNB step: 0.226ms
Torch step (eager): 0.349ms
BNB step: 0.235ms

The test platform is Intel(R) Data Center GPU Max 1550. Test script reference #1692. Torch(eager) is 32bit optimizer from torch, BNB is 32bit optimizer.

Considering that the performance gap between torch.compile and Triton implementations is not significant, but triton's implementation compiles faster, and #1692 was implemented with Triton, this PR adopts the Triton version for submission.

Note:Currently, XPU does not support the allocation of memory buffers using a paging mechanism. Therefore, these tests are skipped in tests/test_optim.py::test_optimizer32bit. This functionality will be developed in the future to support full optimizer capabilities.