[MLAS] Fix Lut GEMM Flakiness and Accuracy by tianleiwu · Pull Request #27216 · microsoft/onnxruntime

vraspar

@tianleiwu tianleiwu changed the title [MLAS] Fix Lut GEMM [MLAS] Fix Lut GEMM Flakiness and Accuracy

Jan 31, 2026

@tianleiwu

tianleiwu added a commit that referenced this pull request

Feb 12, 2026
This PR resolves flakiness and accuracy issues in the
`MatMulNBitsLutGemm` operator.

## Root Cause Analysis

The `MatMulNBitsLutGemm` operator exhibited non-deterministic flakiness
and numerical accuracy issues. This analysis covers the root causes
addressed by the changes.

## Identified Root Causes

### 1. Data Race in
[LutGemmPackQuantBData](https://github.com/microsoft/onnxruntime/blob/cee825d34d533ca325bfd8f8269c86133ae512e6/onnxruntime/core/mlas/lib/qlutgemm.cpp#L166-L295)
- **Issue**: The weight packing loop was parallelized across output
features ($N$). Since T-MAC packs multiple features into a single byte,
concurrent updates to the same byte caused bit-level corruption.
- **Fix**: Serialized the sub-byte accumulation phase of the weight
packing process.

### 2. Thread-Safety in Global Configuration Map
- **Issue**: `tmac_kernel_configs` (a static `std::unordered_map`) was
accessed concurrently. Map insertions or rehashing during initialization
could invalidate references held by other threads.
- **Fix**: Added `std::mutex` protection and modified the parameter
getter to return by value.

### 3. Tiling Dimension Mismatch and Buffer Safety
- **Issue**: The orchestrator used batch size ($M$) for kernel
configuration, while weights are tiled by features ($N$). Additionally,
the kernel lacked clamping for partial tiles, leading to potential
overruns.
- **Fix**: Synchronized tiling logic by using $N$ for initialization,
passing `TotalN` for parameter retrieval, and implementing explicit
clamping and tail-case handling in the AVX2 kernel.

### Verification Results
- `MatMulNBitsLutGemm.Float32_2Bits_Asymmetric_Batch32_256x256` passed
100 consecutive iterations.
- Full MatMul2Bits suite passed all 10 tests with standard **0.15f**
tolerance.

tianleiwu added a commit that referenced this pull request

Feb 13, 2026
This cherry-picks the following commits for the 1.24.2 release:
- #27096
- #27077
- #26677
- #27238
- #27213
- #27256
- #27278
- #27275
- #27276
- #27216
- #27271
- #27299
- #27294
- #27266
- #27176
- #27126
- #27252

---------

Co-authored-by: Xiaofei Han <xiaofeihan@microsoft.com>
Co-authored-by: Jiajia Qin <jiajiaqin@microsoft.com>
Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>
Co-authored-by: qti-monumeen <monumeen@qti.qualcomm.com>
Co-authored-by: Ankit Maheshkar <ankit.maheshkar@intel.com>
Co-authored-by: Eric Crawford <eric.r.crawford@intel.com>
Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com>
Co-authored-by: guschmue <22941064+guschmue@users.noreply.github.com>
Co-authored-by: Guenther Schmuelling <guschmue@microsoft.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: angelser <32746004+angelser@users.noreply.github.com>
Co-authored-by: Angela Serrano Brummett <angelser@microsoft.com>
Co-authored-by: Misha Chornyi <99709299+mc-nv@users.noreply.github.com>
Co-authored-by: hariharans29 <9969784+hariharans29@users.noreply.github.com>
Co-authored-by: eserscor <erscor@microsoft.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Baiju Meswani <bmeswani@microsoft.com>
Co-authored-by: Adrian Lizarraga <adlizarraga@microsoft.com>
Co-authored-by: Ti-Tai Wang <titaiwang@microsoft.com>
Co-authored-by: bmehta001 <bmehta001@users.noreply.github.com>

This was referenced

Feb 23, 2026