[QNN-EP] Add MatMulNBits translation for GPU by quic-tirupath · Pull Request #26340

[QNN-EP] Add MatMulNBits translation for GPU by quic-tirupath · Pull Request #26340 · microsoft/onnxruntime

 - Also fix unit test failure.

alex-spacemit pushed a commit to spacemit-com/onnxruntime that referenced this pull request

Jan 20, 2026

### Description  
Add support for translation of MatMulNBits contrib op to
  QNN with FullyConnected operation with INT4 BlockQuantized weights

Implementation details:
 - Translate MatMulNBits to FullyConnected in OpBuilder
 - Support QNN_QUANTIZATION_ENCODING_BLOCK for INT4 weights
- Pass INT4 weights and quant params as BlockQuantization encoding
params in QNN

Testing:
 - Added new unit tests for MNB -> QNN-GPU
 - Validated all OnnxRuntime tests
- Validated the following LLMs through Olive and ORT-GenAI execution
flow
   - LlaMA3.2 1B
   - Qwen2.5
   - DeepSeek-R1-Qwen 1.5b
   - Phi3.5-mini-instruct

### Motivation and Context
LLMs with INT4 quantization pass in Olive will generate a model with
MatMulMBits contrib ops.
To run these ops via QNN-EP, MatMulNBits is translated to QNN
FullyConnected op with INT4 weights.

---------

Co-authored-by: tirupath-qti <tirupath@qti.qualcomm.com>

tianleiwu pushed a commit that referenced this pull request

Jan 21, 2026

### Description
Add support for translation of MatMulNBits contrib op to
  QNN with FullyConnected operation with INT4 BlockQuantized weights

Implementation details:
 - Translate MatMulNBits to FullyConnected in OpBuilder
 - Support QNN_QUANTIZATION_ENCODING_BLOCK for INT4 weights
- Pass INT4 weights and quant params as BlockQuantization encoding
params in QNN

Testing:
 - Added new unit tests for MNB -> QNN-GPU
 - Validated all OnnxRuntime tests
- Validated the following LLMs through Olive and ORT-GenAI execution
flow
   - LlaMA3.2 1B
   - Qwen2.5
   - DeepSeek-R1-Qwen 1.5b
   - Phi3.5-mini-instruct

### Motivation and Context
LLMs with INT4 quantization pass in Olive will generate a model with
MatMulMBits contrib ops.
To run these ops via QNN-EP, MatMulNBits is translated to QNN
FullyConnected op with INT4 weights.

---------

Co-authored-by: tirupath-qti <tirupath@qti.qualcomm.com>
(cherry picked from commit ba11af4)

This was referenced

Jan 21, 2026

tianleiwu added a commit that referenced this pull request

Jan 23, 2026

### Description
This PR cherry-picks the following changes for the 1.24.0 release.

### Cherry-picked Commits
| Commit | Commit Title | Author |
|---|---|---|
| 744e7fe | Add type definitions, registration, utilities for
INT2/UINT2 support (#26824) | vraspar |
| 530a1fb | [QNN EP] Add BFloat16 dtype support in QNN EP (#26987) |
tirupath-qti |
| 8e050d1 | Implement new experimental lookup-based matrix
multiplication method(TMAC) (#26695) | vraspar |
| 2d2ba6b | [MLAS/CPU EP] Improve performance of Silu activation path
within the QuickGelu CPU kernel (#26753) | Hariharan Seshadri |
| 1c02b79 | [QNN EP] Add support for handling 0-dimension for Concat
Op (#27000) | Ashwath Shankarnarayan |
| cc2b01b | Fix ClipQuantFusion crash when Clip has multiple input
edges (#27016) | Edward Chen |
| bbd3850 | [QNN EP] Support quantized BatchNorm with per-channel DQ
params on QNN HTP (#26959) | qti-yuduo |
| d8f0318 | Add API to get ep graph partitioning info (#26781) |
Adrian Lizarraga |
| b912b18 | [OVEP] OpenVINO EP Features and bug-fixes for ORT-1.24 -
Follow up (#27007) | Preetha Veeramalai |
| ba11af4 | [QNN-EP] Add MatMulNBits translation for GPU (#26340) |
quic-tirupath |
| c03c419 | [MLAS/NEON] Add dedicated kernel for depthwise
convolution for ARM64 using NEON intrinsics (#26688) | Hariharan
Seshadri |
| e7dfd69 | [QNN-EP] Support alternate Layernorm fusion pattern in
QNN preprocess (#26060) | qti-mattsinc |
| 4013dc1 | Implement multithreading in qgemm_kleidi (#26301) |
Melike Kaptan |
| 9f06181 | [CXX] Enable users to specify custom OrtSyncStream via
RunOptions (#26988) | Dmitri Smirnov |
| cfccd64 | Added support for QMX kernels in MLAS (#26849) |
qti-vaiskv |
| 29d9b2f | Tweak external resource importer handle structs (#27040)
| Scott McKay |
| 9d108d0 | [QNN EP] Add QuickGELU operator support for QNN provider
(#27034) | tirupath-qti |
| b35688f | Add INT2 and UINT2 support for QDQ, transpose and cast
ops (#27022) | vraspar |
| 6d34aba | Introducing BF16 Pointwise NCHWc Convolution for Arm64
(#26838) | Rohanjames1997 |
| 36017ad | [EP ABI] Add CreateCustomOpDomains() API for plugin EP to
register custom ops (#27050) | Chi Lo |
| 50a03e4 | Add a new pipeline for CUDA 13 nuget builds (#27023) |
eserscor |
| a0d4439 | [EP ABI] Update Graph_GetGraphView() implementation
(#26711) | Chi Lo |
| 34bb209 | [webgpu] Fix a bug for im2col (#27069) | Wenqin Yang |
| 46e8d45 | [QNN EP] Add FusedMatMul operator support (#27044) |
tirupath-qti |
| 5e7e7a3 | Disable Float32_2Bits_Asymmetric_256x256 test (#27046) |
vraspar |
| 39f966e | Fix Doxygen documentation build error in
onnxruntime_c_api.h (#27083) | Nick Eubanks |
| 8a7a797 | Print tensor for new packed type of 2 bits (#27064) |
Tianlei Wu |
| 01f40e6 | Fix GPU JAR testing on Linux (#27011) | eserscor |
| b6ed7f3 | Fix warning around ununsed code in QNN Android Emulator
builds by clang (#27026) | Hariharan Seshadri |
| d7daa45 | Raise the timeout for the ios simulator job (#27045) |
Hariharan Seshadri |
| 7e1d818 | upgrade emsdk to 4.0.23 (#27029) | Yulong Wang |
| 347b990 | Fix failing mainline build on Arm64 linux (#27101) |
Rohanjames1997 |
| f481b17 | Add dedicated API to support extracting compatibility
string from model metadata (#27015) | adrastogi |

---------

Signed-off-by: Liqun Fu <liqun.fu@microsoft.com>
Signed-off-by: bfilipek <bartlomiej.filipek@intel.com>
Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: Jonathan Clohessy <jonathan.clohessy@arm.com>
Signed-off-by: Christian Bourjau <christian.bourjau@quantco.com>
Signed-off-by: melkap01 <melike.kaptan@arm.com>
Co-authored-by: vraspar <vrajang@outlook.com>
Co-authored-by: tirupath-qti <tirupath@qti.qualcomm.com>
Co-authored-by: Ashwath Shankarnarayan <ashwshan@qti.qualcomm.com>
Co-authored-by: Liqun Fu <liqun.fu@microsoft.com>
Co-authored-by: carzh <wolfivyaura@gmail.com>
Co-authored-by: Hector Li <hecli@microsoft.com>
Co-authored-by: carzh <carolinezhu@microsoft.com>
Co-authored-by: Vrajang Parikh <vrparikh@microsoft.com>
Co-authored-by: Hariharan Seshadri <shariharan91@gmail.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
Co-authored-by: Yuduo Wu <yuduow@qti.qualcomm.com>
Co-authored-by: Adrian Lizarraga <adlizarraga@microsoft.com>
Co-authored-by: Preetha Veeramalai <preetha.veeramalai@intel.com>
Co-authored-by: jatinwadhwa921 <110383850+jatinwadhwa921@users.noreply.github.com>
Co-authored-by: jatinwadhwa921 <jatin.wadhwa@intel.com>
Co-authored-by: saurabh <saurabh1.kale@intel.com>
Co-authored-by: Ankit Maheshkar <ankit.maheshkar@intel.com>
Co-authored-by: sfatimar <sahar.fatima@intel.com>
Co-authored-by: Javier Martinez <javier.e.martinez@intel.com>
Co-authored-by: Bartlomiej Filipek <bartlomiej.filipek@intel.com>
Co-authored-by: bopeng1234 <bo.peng@intel.com>
Co-authored-by: Eric Crawford <eric.r.crawford@intel.com>
Co-authored-by: MayureshV1 <47039074+MayureshV1@users.noreply.github.com>
Co-authored-by: TejalKhade28 <tejal.khade@intel.com>
Co-authored-by: Vishnudas Thaniel S <vishnudas.thaniel.s@intel.com>
Co-authored-by: Yaru Du <yaru.du@intel.com>
Co-authored-by: Ryan Metcalfe <107415876+RyanMetcalfeInt8@users.noreply.github.com>
Co-authored-by: Dvoretckii, Mikhail <mikhail.dvoretckii@intel.com>
Co-authored-by: Pallavi Gupta <pallavi.gupta@intel.com>
Co-authored-by: Jianhui Dai <jianhui.j.dai@intel.com>
Co-authored-by: Jiajia Qin <jiajiaqin@microsoft.com>
Co-authored-by: Changming Sun <chasun@microsoft.com>
Co-authored-by: Fei Chen <feich@microsoft.com>
Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>
Co-authored-by: Akupadhye <aupadhye@qti.qualcomm.com>
Co-authored-by: Wang Ning <ning4.wang@intel.com>
Co-authored-by: Maximilian Müller <44298237+gedoensmax@users.noreply.github.com>
Co-authored-by: Chi Lo <54722500+chilo-ms@users.noreply.github.com>
Co-authored-by: George Wu <jywu@microsoft.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Wanming Lin <wanming.lin@intel.com>
Co-authored-by: quic-calvnguy <quic_calvnguy@quicinc.com>
Co-authored-by: Jie Chen <jie.a.chen@intel.com>
Co-authored-by: xhcao <xinghua.cao@intel.com>
Co-authored-by: Wei-Sheng Chin <wschin@outlook.com>
Co-authored-by: quic-hungjuiw <quic_hungjuiw@quicinc.com>
Co-authored-by: Ian Hunter <ianfhunter@gmail.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: kunal-vaishnavi <115581922+kunal-vaishnavi@users.noreply.github.com>
Co-authored-by: Jeff Kilpatrick <jkilpatrick@qti.qualcomm.com>
Co-authored-by: Jeff Kilpatrick <jkilpat@qti.qualcomm.com>
Co-authored-by: Scott McKay <skottmckay@gmail.com>
Co-authored-by: Nenad Banfic <46795300+nenad1002@users.noreply.github.com>
Co-authored-by: derdeljan-msft <derdeljan@microsoft.com>
Co-authored-by: n1harika <niharika.sathish@intel.com>
Co-authored-by: Ryan Metcalfe <ryan.metcalfe@intel.com>
Co-authored-by: Jaswanth Gannamaneni <jaswanth.gannamaneni@intel.com>
Co-authored-by: Klimenko, Mikhail <mikhail.klimenko@intel.com>
Co-authored-by: liang <gxgaoliang@126.com>
Co-authored-by: Garth Long <garth.long@intel.com>
Co-authored-by: Jonathan Clohessy <jonathan.clohessy@arm.com>
Co-authored-by: Akshay Sonawane <111780983+apsonawane@users.noreply.github.com>
Co-authored-by: Christopher Warrington <chwarr@microsoft.com>
Co-authored-by: Ishwar Raut <iraut@nvidia.com>
Co-authored-by: Gaurav Garg <gaugarg@nvidia.com>
Co-authored-by: Xinpeng Dou <15529241576@163.com>
Co-authored-by: adrastogi <aditya.rastogi@microsoft.com>
Co-authored-by: Aditya Rastogi <adityar@ntdev.microsoft.com>
Co-authored-by: qti-hungjuiw <hungjuiw@qti.qualcomm.com>
Co-authored-by: Pradeep Sakhamoori <psakhamoori@microsoft.com>
Co-authored-by: Adam Pocock <adam.pocock@oracle.com>
Co-authored-by: mingyue <131847423+mingyueliuh@users.noreply.github.com>
Co-authored-by: Susanta Bhattacharjee <susanta.bhattacharjee@intel.com>
Co-authored-by: Jozef Wludzik <jozef.wludzik@intel.com>
Co-authored-by: Rajeev Sekar <rajeevsekar21@gmail.com>
Co-authored-by: Mayuresh M Varerkar <mayuresh.m.varerkar@intel.com>
Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com>
Co-authored-by: Wenqin Yang <wenqin.yang@intel.com>
Co-authored-by: xieofxie <xieofxie@126.com>
Co-authored-by: hualxie <hualxie@microsoft.com>
Co-authored-by: Joshua Lochner <admin@xenova.com>
Co-authored-by: Christian Bourjau <cbourjau@users.noreply.github.com>
Co-authored-by: Xiaofei Han <xiaofeihan@microsoft.com>
Co-authored-by: Dmitri Smirnov <yuslepukhin@users.noreply.github.com>
Co-authored-by: chunghow-qti <chunghow@qti.qualcomm.com>
Co-authored-by: Guenther Schmuelling <guschmue@microsoft.com>
Co-authored-by: Jiawei Shao <jiawei.shao@intel.com>
Co-authored-by: czekun <chen.zekun@intel.com>
Co-authored-by: Jaskaran Singh Nagi <jaskaran.singh.nagi@intel.com>
Co-authored-by: quic-tirupath <quic_tirupath@quicinc.com>
Co-authored-by: qti-mattsinc <mattsinc@qti.qualcomm.com>
Co-authored-by: Melike Kaptan <melike.kaptan@arm.com>
Co-authored-by: Damien Dooley <damien.dooley@arm.com>
Co-authored-by: qti-vaiskv <vaiskv@qti.qualcomm.com>
Co-authored-by: Rohanjames1997 <rohan.james4@gmail.com>
Co-authored-by: eserscor <erscor@microsoft.com>
Co-authored-by: eserscor <247253654+eserscor@users.noreply.github.com>
Co-authored-by: Nick Eubanks <nieubank@microsoft.com>
Co-authored-by: adrastogi <8368026+adrastogi@users.noreply.github.com>
Co-authored-by: Rohanjames1997 <rohanjms@amazon.com>

alex-spacemit pushed a commit to spacemit-com/onnxruntime that referenced this pull request

Jan 27, 2026

### Description  
Add support for translation of MatMulNBits contrib op to
  QNN with FullyConnected operation with INT4 BlockQuantized weights

Implementation details:
 - Translate MatMulNBits to FullyConnected in OpBuilder
 - Support QNN_QUANTIZATION_ENCODING_BLOCK for INT4 weights
- Pass INT4 weights and quant params as BlockQuantization encoding
params in QNN

Testing:
 - Added new unit tests for MNB -> QNN-GPU
 - Validated all OnnxRuntime tests
- Validated the following LLMs through Olive and ORT-GenAI execution
flow
   - LlaMA3.2 1B
   - Qwen2.5
   - DeepSeek-R1-Qwen 1.5b
   - Phi3.5-mini-instruct

### Motivation and Context
LLMs with INT4 quantization pass in Olive will generate a model with
MatMulMBits contrib ops.
To run these ops via QNN-EP, MatMulNBits is translated to QNN
FullyConnected op with INT4 weights.

---------

Co-authored-by: tirupath-qti <tirupath@qti.qualcomm.com>