Add API for precompiled model compatibility check using just the compat info by adrastogi · Pull Request #25841

Add API for precompiled model compatibility check using just the compat info by adrastogi · Pull Request #25841 · microsoft/onnxruntime

bot found potential problems Aug 25, 2025

Aditya Rastogi added 3 commits

August 25, 2025 13:22

…tation to use the underlying hardware devices as input into the validation decision)

adrastogi deleted the adrastogi/c-model-compatibility-api branch

August 27, 2025 20:15

snnn pushed a commit that referenced this pull request

Aug 28, 2025

…at info (#25841)

### Description
This PR adds a new API that applications can use to verify compatibility
of a precompiled model with the underlying system, using only the
compatibility info string from the model's metadata.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
- This is a feature to enable apps to check compatibility of a
precompiled model without necessarily having the model locally on the
device. This enables precompiled models to be stored remotely and
downloaded once the application has been able to confirm the validity of
a given model with EPs on the device.

### Testing
- New unit tests pass 
- For regression testing, built a private version of WinML + AMD NPU EP
with these changes. Ran the Cpp Selfcontained Desktop sample
successfully; ran with compilation and also re-ran using the
already-compiled model to verify that session initialization continued
to work as expected.

---------

Co-authored-by: Aditya Rastogi <adityar@ntdev.microsoft.com>

snnn mentioned this pull request

Aug 28, 2025

adrastogi added a commit that referenced this pull request

Aug 29, 2025

### Description
This change builds on top of #25841 , and adds the scaffolding necessary
to call into this API from C++ / C# / Python.

### Motivation and Context
#25454 talks more about the broader notion of precompiled model
compatibility. This change is directed at app developers whose apps may
want to determine if a particular precompiled model (e.g. on a server
somewhere) is compatible with the device where the application is
running. There is functionality in `OrtEpFactory` for making this
determination, which was exposed as a C API in #25841, and this change
makes the API more broadly available in other languages.

### Testing and Validation
Introduced new unit test cases across each language, and verified that
the API was being called and returned the correct result for the default
CPU EP.

---------

Co-authored-by: Aditya Rastogi <adityar@ntdev.microsoft.com>

snnn pushed a commit that referenced this pull request

Aug 29, 2025

### Description
This change builds on top of #25841 , and adds the scaffolding necessary
to call into this API from C++ / C# / Python.

### Motivation and Context
#25454 talks more about the broader notion of precompiled model
compatibility. This change is directed at app developers whose apps may
want to determine if a particular precompiled model (e.g. on a server
somewhere) is compatible with the device where the application is
running. There is functionality in `OrtEpFactory` for making this
determination, which was exposed as a C API in #25841, and this change
makes the API more broadly available in other languages.

### Testing and Validation
Introduced new unit test cases across each language, and verified that
the API was being called and returned the correct result for the default
CPU EP.

---------

Co-authored-by: Aditya Rastogi <adityar@ntdev.microsoft.com>

snnn pushed a commit that referenced this pull request

Aug 29, 2025

- **Relax WeightBiasQuantization constraint for larger QDQ node group
(#25673)**
- **Add cuda graph implementation for NV TRT RTX EP (#25787)**
- **python GPU IO Bindings for NVIDIA  (#25776)**
- **Fixes for DynamicQuantizeMatMul and Attention3D tests (#25814)**
- **Fix a long standing bug on file memory mapping on windows.
(#25833)**
- **Add API for precompiled model compatibility check using just the
compat info (#25841)**
- **Enable ABSL_FLAGS flag registration for onnxruntime_perf_test for
mobile build (#25849)**
- **Add default constructor to Ort::Status. (#25860)**
- #25871
- #25878
- #25884
- #25886
- #25866

preetha-intel added a commit to intel/onnxruntime that referenced this pull request

Sep 1, 2025

* [CPU] Optimize GQA attention bias application for FP16 (microsoft#25871)

### Description

When using attention bias input for GQA op with FP16, on the platforms
that don't natively support FP16 math a cast to fp32 needs to be
performed, and thus a temporary buffer needs to be created to store the
fp32 values. The issue is that this temporary buffer was being allocated
/ deallocated inside of a loop for every token being processed.
Refactored the implementation so that the allocation takes place only
once.

Phi model throughput increased by 15%.

* Fixes for DynamicQuantizeMatMul and Attention3D tests (microsoft#25814)

### Description
This change fixes correctness issues in two areas that were causing
failures in onnxruntime_test_all:

- DynamicQuantizeMatMul.WithConstantBInputs
- AttentionTest.Attention3DDefault
- AttentionTest.Attention3DWithPastAndPresentQkMatmul

What was wrong and how it’s fixed
1) DynamicQuantizeMatMul.WithConstantBInputs
- Root cause: The Kleidi dynamic quantization GEMM path could be
selected even when the B scales contained values such as (zero,
negative, or non-finite). That violates kernel assumptions and can lead
to incorrect results.
- Fix: In
`onnxruntime/contrib_ops/cpu/quantization/dynamic_quantize_matmul.cc`,
we now explicitly validate that all B scales are finite and strictly
positive before enabling the Kleidi/MLAS dynamic path. If any scale is
invalid, we disable that path.

2) Attention tests (Attention3DDefault,
Attention3DWithPastAndPresentQkMatmul)
- Root causes in
`onnxruntime/core/mlas/lib/kleidiai/sgemm_kleidiai.cpp`:
- Incorrect handling of GEMM corner cases for alpha/beta and K==0 (e.g.,
not respecting C = beta*C when alpha==0 or K==0).
  - Unnecessary or premature fallbacks for small shapes.
- Fixes:
- Add early-outs for degenerate sizes: if M==0 or N==0, return handled.
  - Correctly implement alpha/beta semantics:

---------

Signed-off-by: Jonathan Clohessy <jonathan.clohessy@arm.com>

* Fix MoE CPP tests (microsoft#25877)

This change adds skip test for QMoE CPU tests when running on TensorRT
or CUDA EP.
In the QMoE kernel there was a memory overwrite bug in the accumulate
part, updated that and this fixed the python tests back

* [c++] Eliminate dynamic initialization of static Ort::Global<void>::api_ (microsoft#25741)

### Description

Delay the call to `OrtGetApiBase()` until the first call to
`Ort::GetApi()` so that `OrtGetApiBase()` is typically called after
dynamic library loading.

### Motivation and Context

When ORT_API_MANUAL_INIT is not defined (which is the default), the
static `Ort::Global<void>::api_` has a dynamic initializer that calls
`OrtGetApiBase()->GetApi(ORT_API_VERSION)` This dynamic initialization
can cause problems when it interacts with other global/static
initialization. On Windows in particular, it can also cause deadlocks
when used in a dynamic library if OrtGetApiBase()->GetApi() attempts to
load any other libraries.

* Replace the templated `Global<void>::api_` with an inline static
initialized to nullptr.
* `Ort::GetApi()` now calls `detail::Global::GetApi()` which calls
`detail::Global::DefaultInit()` if initialization is needed.
* When `ORT_API_MANUAL_INIT` is defined, `DefaultInit()` returns
nullptr, which will eventually cause the program to crash. The callers
have violated the initialization contract by not calling one of the
`Ort::InitApi` overloads.
* When `ORT_API_MANUAL_INIT` is not defined, `DefaultInit()` uses a
function-level static to compute the result of
`OrtGetApiBase()->GetApi(ORT_API_VERSION)` once and return it.
* `Ort::Global<void>` has been replaced with a non-templated type and
moved inside a `detail` namespace. Since the `Global<void>` object was
documented as being used internally, it is believed that these changes
here are non-breaking, as they do not impact a public API. The public
APIs, `Ort::InitApi()` and `Ort::InitApi(const OrtApi*)` remain
unchanged.
* Add `#pragma detect_mismatch` to surface issues with compilation units
that disagree on how ORT_API_MANUAL_INIT is defined. (MSVC only.)

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* python GPU IO Bindings for NVIDIA  (microsoft#25776)

### Description
<!-- Describe your changes. -->
1. A Small change to use the shared allocator in Python binding. 
2. Remove the FP64 support from the EP. 


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

The Python GPU IO binding is necessary for performance. The change will
enable the shared allocator for GPU allocation.
The FP64 was using the FP32 inference—aligned WRT TRT RTX support.

---------

Co-authored-by: Gaurav Garg <gaugarg@nvidia.com>

* [CANN] Add a `enable_cann_subgraph` feature parameter (microsoft#25867)

### Description

Add a `enable_cann_subgraph` feature parameter. this parameter controls
whether graph splitting is performed and can help quickly identify
issues in certain scenarios.

* [EP ABI] Add OpAttr_GetTensorAttributeAsOrtValue and replace the existing Node_GetTensorAttributeAsOrtValue (microsoft#25886)

### Description
Replace `Node_GetTensorAttributeAsOrtValue` with
`OpAttr_GetTensorAttributeAsOrtValue`.
Change the API signature to make it one of the `OpAttr` interfaces
instead of the `OrtNode` interface.

The original API was added
[here](microsoft#25566).

* Language bindings for model compatibility API (microsoft#25878)

### Description
This change builds on top of microsoft#25841 , and adds the scaffolding necessary
to call into this API from C++ / C# / Python.

### Motivation and Context
microsoft#25454 talks more about the broader notion of precompiled model
compatibility. This change is directed at app developers whose apps may
want to determine if a particular precompiled model (e.g. on a server
somewhere) is compatible with the device where the application is
running. There is functionality in `OrtEpFactory` for making this
determination, which was exposed as a C API in microsoft#25841, and this change
makes the API more broadly available in other languages.

### Testing and Validation
Introduced new unit test cases across each language, and verified that
the API was being called and returned the correct result for the default
CPU EP.

---------

Co-authored-by: Aditya Rastogi <adityar@ntdev.microsoft.com>

* [QNN-EP] Introduce Level1 Transformer into qnn.preprocess (microsoft#25883)

### Description
- Introduce Level1 Transformer into qnn.preprocess to support various optimizations.

### Motivation and Context
- This change brings in several useful optimizations such as `ConvBnFusion` and `ConstantFolding`, which are part of
`TransformerLevel::Level1` and can benefit QNNEP.
- The goal is to optimize the ONNX model before quantization by integrating these passes into the Python tooling workflow.

* [QNN EP] Minor fix weight name missing when not valid QDQ node group (microsoft#25887)

### Description
Minor fix weight name missing when not valid QDQ node group

### Motivation and Context
Some quantized model failed QDQ node group validation, the weights then won't be folded as initializer. QNN EP failed to handle the dynamic weights here due to the transpose op input name look up. This change make sure we process the weights tensor before adding transposes.

* Add custom ops library_path to EP metadata (microsoft#25830)

## Summary
Adds EP metadata library path support to enable custom ops DLL
registration with proper path resolution.

## Changes
- Added `library_path` metadata key to EP metadata infrastructure
- Pass resolved library path directly to `EpLibraryProviderBridge`
constructor
- Simplified implementation per reviewer feedback (removed virtual
method complexity)
- Added `#include <utility>` for std::move compliance

## Purpose
Enables downstream applications (like onnxruntime-genai) to resolve
relative custom ops library paths using EP metadata, improving DLL
registration reliability.

## Files Modified
- `plugin_ep/ep_factory_provider_bridge.h`
- `plugin_ep/ep_library.h` 
- `plugin_ep/ep_library_plugin.h`
- `plugin_ep/ep_library_provider_bridge.cc`
- `plugin_ep/ep_library_provider_bridge.h`
- `utils.cc`

* [OVEP] OpenVINO EP Features and bug-fixes for ORT-1.23  (microsoft#25884)

### Description
This update introduces multiple improvements, fixes, and feature enhancements to the OpenVINO Execution Provider (OVEP) and related components in ONNX Runtime:

#### Configuration & Properties

- Updated load_config mapping to act as a passthrough to OpenVINO properties.
- Added support for providing layout information to inputs/outputs in OpenVINO.

#### Inference & Tensor Handling

- Improved OVInferRequest::SetTensor to correctly handle cached binding shape mismatches.
- Added support for self-detecting on-the-fly bfloat16 → float16 conversion.
- Fixed issues with input ONNX models when used with shared execution contexts.

#### Model Handling & Operator Support

- Fixed model copying behavior for QDQ stripping.
- Updated operator support status for OpenVINO 2025.2.

#### Platform & Integration Fixes

- Applied multiple PSU Lora fixes and related updates.
- Resolved filename confusion issues with wrapped OVIRs in EPCtx.
- Enabled memory-mapped native binaries for OpenVINO 2025.3.

#### Quality & Maintenance

- Addressed linting issues.
- Fixed coverage gaps in OVEP.
- Added a new test script for OpenVINO with ORT ABI integration.

---------

Co-authored-by: Ankit Maheshkar <ankit.maheshkar@intel.com>
Co-authored-by: Ryan Metcalfe <107415876+RyanMetcalfeInt8@users.noreply.github.com>
Co-authored-by: Klimenko, Mikhail <mikhail.klimenko@intel.com>
Co-authored-by: sfatimar <sahar.fatima@intel.com>
Co-authored-by: Garth Long <garth.long@intel.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: MayureshV1 <47039074+MayureshV1@users.noreply.github.com>
Co-authored-by: Eric Crawford <eric.r.crawford@intel.com>
Co-authored-by: jatinwadhwa921 <110383850+jatinwadhwa921@users.noreply.github.com>
Co-authored-by: Vishnudas Thaniel S <vishnudas.thaniel.s@intel.com>
Co-authored-by: Javier Martinez <javier.e.martinez@intel.com>

* [java] Auto EP and compile model support (microsoft#25131)

### Description
Java API for compile model and EP discovery APIs. Roughly equivalent to
the C# version in microsoft#24604.

cc: @skottmckay.

I haven't quite got the CMake configured so the Java tests for the ep
registration only run when the ONNX Runtime shared provider support is
built, but everything else works. I expect that to be a quick fix, but
I'm not sure in what conditions it should be built and how we should
handle it so I don't know where/when to plumb it through.

### Motivation and Context
API parity for Java.

* Add error handling to extract_nuget_files.ps1 (microsoft#25866)

### Description
1. Check process exit code when running 7z.exe . Currently the errors
were silently ignored.
2. Add snld20 flag to the 7z.exe commands, which is needed to be
compatible with the latest 7z release.

* [Fix] illegal memory access in GetInputIndices with optional inputs (microsoft#25881)

### Description
Fix illegal memory access in GetInputIndices with optional inputs

### Motivation and Context
When an input is optional, its ValueInfo may be nullptr. 
The current implementation directly calls InputValueInfo->GetName(), leading to illegal memory access.

Update logic to skip optional inputs when valueInfo is nullptr .

* Re-enable cpuinfo for ARM64EC (microsoft#25863)

### Description
<!-- Describe your changes. -->

Re-enable cpuinfo for ARM64EC build and fix `CPUIDINFO_ARCH_ARM` so it
is actually used.

Patch cpuinfo to support vcpkg ARM64EC build. See
pytorch/cpuinfo#324.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Fix for workaround in microsoft#25831.

---------

Signed-off-by: Jonathan Clohessy <jonathan.clohessy@arm.com>
Co-authored-by: derdeljan-msft <derdeljan@microsoft.com>
Co-authored-by: Jonathan Clohessy <jonathan.clohessy@arm.com>
Co-authored-by: Akshay Sonawane <111780983+apsonawane@users.noreply.github.com>
Co-authored-by: Christopher Warrington <chwarr@microsoft.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Ishwar Raut <iraut@nvidia.com>
Co-authored-by: Gaurav Garg <gaugarg@nvidia.com>
Co-authored-by: Xinpeng Dou <15529241576@163.com>
Co-authored-by: Chi Lo <54722500+chilo-ms@users.noreply.github.com>
Co-authored-by: adrastogi <aditya.rastogi@microsoft.com>
Co-authored-by: Aditya Rastogi <adityar@ntdev.microsoft.com>
Co-authored-by: qti-hungjuiw <hungjuiw@qti.qualcomm.com>
Co-authored-by: qti-yuduo <yuduow@qti.qualcomm.com>
Co-authored-by: Pradeep Sakhamoori <psakhamoori@microsoft.com>
Co-authored-by: Preetha Veeramalai <preetha.veeramalai@intel.com>
Co-authored-by: Ankit Maheshkar <ankit.maheshkar@intel.com>
Co-authored-by: Ryan Metcalfe <107415876+RyanMetcalfeInt8@users.noreply.github.com>
Co-authored-by: Klimenko, Mikhail <mikhail.klimenko@intel.com>
Co-authored-by: sfatimar <sahar.fatima@intel.com>
Co-authored-by: Garth Long <garth.long@intel.com>
Co-authored-by: MayureshV1 <47039074+MayureshV1@users.noreply.github.com>
Co-authored-by: Eric Crawford <eric.r.crawford@intel.com>
Co-authored-by: jatinwadhwa921 <110383850+jatinwadhwa921@users.noreply.github.com>
Co-authored-by: Vishnudas Thaniel S <vishnudas.thaniel.s@intel.com>
Co-authored-by: Javier Martinez <javier.e.martinez@intel.com>
Co-authored-by: Adam Pocock <adam.pocock@oracle.com>
Co-authored-by: Changming Sun <chasun@microsoft.com>
Co-authored-by: mingyue <131847423+mingyueliuh@users.noreply.github.com>
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>

gedoensmax pushed a commit to gedoensmax/onnxruntime that referenced this pull request

Sep 2, 2025

…at info (microsoft#25841)

### Description
This PR adds a new API that applications can use to verify compatibility
of a precompiled model with the underlying system, using only the
compatibility info string from the model's metadata.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
- This is a feature to enable apps to check compatibility of a
precompiled model without necessarily having the model locally on the
device. This enables precompiled models to be stored remotely and
downloaded once the application has been able to confirm the validity of
a given model with EPs on the device.

### Testing
- New unit tests pass 
- For regression testing, built a private version of WinML + AMD NPU EP
with these changes. Ran the Cpp Selfcontained Desktop sample
successfully; ran with compilation and also re-ran using the
already-compiled model to verify that session initialization continued
to work as expected.

---------

Co-authored-by: Aditya Rastogi <adityar@ntdev.microsoft.com>

Jaswanth51 pushed a commit to intel/onnxruntime that referenced this pull request

Sep 3, 2025

### Description
This change builds on top of microsoft#25841 , and adds the scaffolding necessary
to call into this API from C++ / C# / Python.

### Motivation and Context
microsoft#25454 talks more about the broader notion of precompiled model
compatibility. This change is directed at app developers whose apps may
want to determine if a particular precompiled model (e.g. on a server
somewhere) is compatible with the device where the application is
running. There is functionality in `OrtEpFactory` for making this
determination, which was exposed as a C API in microsoft#25841, and this change
makes the API more broadly available in other languages.

### Testing and Validation
Introduced new unit test cases across each language, and verified that
the API was being called and returned the correct result for the default
CPU EP.

---------

Co-authored-by: Aditya Rastogi <adityar@ntdev.microsoft.com>

yuslepukhin pushed a commit that referenced this pull request

Dec 10, 2025

…val and validation (#26699)

### Description

Adds support for compiled model compatibility information retrieval and
validation in the VitisAI EP. This enables runtime validation of
compiled models against the execution environment to prevent failures
and provide clear compatibility feedback.

**Key Changes:**
- Implemented `GetCompiledModelCompatibilityInfo` to collect and
serialize compatibility metadata during model compilation
- Added `ValidateCompiledModelCompatibilityInfo` to validate
compatibility at runtime against the current environment

### Motivation and Context
Compiled models may fail at runtime due to missing backend plugins,
version mismatches, or hardware platform differences.
The ONNXRuntime add 2 API for support compiled model compatibility
validation system . Ref PRs:
    #25841
    #25749
    

This PR implements a compatibility validation system for Vitis AI EP
that:

- Detects incompatibilities before model loading to prevent runtime
failures
- Enables cross-version compatibility checking between different EP
versions
- Provides clear feedback through specific compatibility status codes
- Maintains backward compatibility with legacy EPs

Kevin-Taha pushed a commit that referenced this pull request

Dec 11, 2025

…val and validation (#26699)

### Description

Adds support for compiled model compatibility information retrieval and
validation in the VitisAI EP. This enables runtime validation of
compiled models against the execution environment to prevent failures
and provide clear compatibility feedback.

**Key Changes:**
- Implemented `GetCompiledModelCompatibilityInfo` to collect and
serialize compatibility metadata during model compilation
- Added `ValidateCompiledModelCompatibilityInfo` to validate
compatibility at runtime against the current environment

### Motivation and Context
Compiled models may fail at runtime due to missing backend plugins,
version mismatches, or hardware platform differences.
The ONNXRuntime add 2 API for support compiled model compatibility
validation system . Ref PRs:
    #25841
    #25749
    

This PR implements a compatibility validation system for Vitis AI EP
that:

- Detects incompatibilities before model loading to prevent runtime
failures
- Enables cross-version compatibility checking between different EP
versions
- Provides clear feedback through specific compatibility status codes
- Maintains backward compatibility with legacy EPs

Sumit2318 pushed a commit that referenced this pull request

Jan 6, 2026

…val and validation (#26699)

### Description

Adds support for compiled model compatibility information retrieval and
validation in the VitisAI EP. This enables runtime validation of
compiled models against the execution environment to prevent failures
and provide clear compatibility feedback.

**Key Changes:**
- Implemented `GetCompiledModelCompatibilityInfo` to collect and
serialize compatibility metadata during model compilation
- Added `ValidateCompiledModelCompatibilityInfo` to validate
compatibility at runtime against the current environment

### Motivation and Context
Compiled models may fail at runtime due to missing backend plugins,
version mismatches, or hardware platform differences.
The ONNXRuntime add 2 API for support compiled model compatibility
validation system . Ref PRs:
    #25841
    #25749
    

This PR implements a compatibility validation system for Vitis AI EP
that:

- Detects incompatibilities before model loading to prevent runtime
failures
- Enables cross-version compatibility checking between different EP
versions
- Provides clear feedback through specific compatibility status codes
- Maintains backward compatibility with legacy EPs

This was referenced

Jan 15, 2026

adrianlizarraga pushed a commit that referenced this pull request

Jan 21, 2026

…el metadata (#27015)

### Description
This change proposes a new helper ORT API for callers that need to
extract the model compatibility string from a precompiled model.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
See #25749 for more background on the model compatibility concept and
infrastructure; #25841 provides a related helper API for an application
to call to do a validation check using the compatibility info string.
However, there is no direct way to get to the model metadata without
creating a session (which some callers may prefer to avoid) or by taking
a dependency on a separate library to parse the model's protobuf (which
again callers may prefer to avoid).

This change proposes a separate helper API which can be used to retrieve
the compatibility info string, thereby avoiding session creation or an
external dependency. This does incur some amount of redundant work in
that the model protobuf will be parsed again during session creation-
but for some callers, this tradeoff may be acceptable.

---------

Co-authored-by: Aditya Rastogi <adityar@ntdev.microsoft.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com>
Co-authored-by: adrastogi <8368026+adrastogi@users.noreply.github.com>

tianleiwu pushed a commit that referenced this pull request

Jan 22, 2026

…el metadata (#27015)

### Description
This change proposes a new helper ORT API for callers that need to
extract the model compatibility string from a precompiled model.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
See #25749 for more background on the model compatibility concept and
infrastructure; #25841 provides a related helper API for an application
to call to do a validation check using the compatibility info string.
However, there is no direct way to get to the model metadata without
creating a session (which some callers may prefer to avoid) or by taking
a dependency on a separate library to parse the model's protobuf (which
again callers may prefer to avoid).

This change proposes a separate helper API which can be used to retrieve
the compatibility info string, thereby avoiding session creation or an
external dependency. This does incur some amount of redundant work in
that the model protobuf will be parsed again during session creation-
but for some callers, this tradeoff may be acceptable.

---------

Co-authored-by: Aditya Rastogi <adityar@ntdev.microsoft.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com>
Co-authored-by: adrastogi <8368026+adrastogi@users.noreply.github.com>
(cherry picked from commit f481b17)

quic-muchhsu pushed a commit to CodeLinaro/onnxruntime that referenced this pull request

Feb 27, 2026

* Bump version to 1.25.0 (microsoft#27048)

Increase version number to 1.25.0.

* [webgpu] Optimize generic 4D Transpose using OIHW2OHWI Program (microsoft#26942)

### Description
This PR migrates the `OIHW2OHWI` Program from `Im2ColMatMul` to the
`Transpose` operator. By centralizing this logic, we leverage the
specialized shader to optimize generic 4D transpositions (specifically
the {0, 2, 3, 1} permutation pattern) while reducing code duplication.

While this shader is capable of supporting 2D/3D transpositions, those
optimizations are reserved for follow-up PRs.

### Motivation and Context
See above.

* Fix failing mainline build on Arm64 linux (microsoft#27101)

### Description
`sconv.h` was renamed to `sconv_nchwc_kernel_neon.h` in microsoft#26688 but the
reference to the old name was still in a new file added at around the
same time in microsoft#26838.
The CI doesn't include building for this configuration yet - it will be
added after the 1.24 release.



### Motivation and Context
Fixes failing mainline build on Arm64 linux when
`--enable_arm_neon_nchwc` is supplied.


### Testing
This now passes on Arm64 linux
`./build.sh --config Release --build_shared_lib --parallel
--compile_no_warning_as_error --skip_submodule_sync --skip_tests
--enable_pybind --build_wheel --enable_arm_neon_nchwc`

* Add dedicated API to support extracting compatibility string from model metadata (microsoft#27015)

### Description
This change proposes a new helper ORT API for callers that need to
extract the model compatibility string from a precompiled model.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
See microsoft#25749 for more background on the model compatibility concept and
infrastructure; microsoft#25841 provides a related helper API for an application
to call to do a validation check using the compatibility info string.
However, there is no direct way to get to the model metadata without
creating a session (which some callers may prefer to avoid) or by taking
a dependency on a separate library to parse the model's protobuf (which
again callers may prefer to avoid).

This change proposes a separate helper API which can be used to retrieve
the compatibility info string, thereby avoiding session creation or an
external dependency. This does incur some amount of redundant work in
that the model protobuf will be parsed again during session creation-
but for some callers, this tradeoff may be acceptable.

---------

Co-authored-by: Aditya Rastogi <adityar@ntdev.microsoft.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com>
Co-authored-by: adrastogi <8368026+adrastogi@users.noreply.github.com>

* Move model compatibility checks ahead of session initialization (microsoft#27037)

### Description
<!-- Describe your changes. -->
The current infrastructure for validating compatibility of a precompiled
model does the check after session initialization occurs, which turns
out to be quite costly. The check should ideally happen beforehand, to
short-circuit those expensive operations.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
This change will make it more tractable for applications to rely on the
existing session machinery to check compatibility of any of their
models.

Co-authored-by: Aditya Rastogi <adityar@ntdev.microsoft.com>

* [test] refactor common test target settings (microsoft#27013)

### Description
- factor duplicated test target settings into helper functions
- reuse helpers for onnxruntime_test_all and onnxruntime_provider_test
- keep target-specific settings intact


### Motivation and Context

There are some duplicated codes in the onnxruntime_unittests. Originally
there is only one unit test `onnxruntime_test_all` and later it is split
into two: `onnxruntime_test_all` and `onnxruntime_provider_test`. Some
lines for setting up build flags are simply copied. This causes
potential risk for inconsistency in future.

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Fix OrtApi static_assert violation, add instructions for updating additional API structs. (microsoft#27100)

### Description
<!-- Describe your changes. -->

Fix OrtApi 1.24 API size static_assert violation triggered by addition
of new APIs in
microsoft@f481b17.

Add version update instructions for updating additional API structs.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Fix build on main.

Add info about other API structs to version update instructions.

* Linux device discovery for TRT-RTX Ep (microsoft#26210)

### Description
<!-- Describe your changes. -->

This change adds PCIe bus_id to the properties detected
during Linux device discovery.

This property is used to enable device discovery on Linux for the
TRT-RTX execution provider.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve? -->
I want to use device discovery for TRT-EP also on Linux.


This changes have already been tested with the newly added inference
samples
microsoft/onnxruntime-inference-examples#529 .

@gedoensmax for visibilty

* Add absl cuda warnings patch (microsoft#27096)

Some PRs that use core/common/inlined_containers.h can cause failures in
the CUDA CI pipeline.

```
E:\_work\_temp\build\RelWithDebInfo\vcpkg_installed\x64-windows-static-md\include\absl/hash/internal/hash.h(481): error microsoft#68-D: integer conversion resulted in a change of sign [E:\_work\_temp\build\RelWithDebInfo\onnxruntime_providers_cuda.vcxproj]
          sizeof(T) == -1,
                       ^
  Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"

E:\_work\_temp\build\RelWithDebInfo\vcpkg_installed\x64-windows-static-md\include\absl/hash/hash.h(337): error microsoft#549-D: variable "s" is used before its value is set [E:\_work\_temp\build\RelWithDebInfo\onnxruntime_providers_cuda.vcxproj]
        return s;
               ^
E:\_work\_temp\build\RelWithDebInfo\vcpkg_installed\x64-windows-static-md\include\absl/container/internal/raw_hash_set.h(468): error microsoft#69-D: integer conversion resulted in truncation [E:\_work\_temp\build\RelWithDebInfo\onnxruntime_providers_cuda.vcxproj]
          static_cast<uint16_t>(reinterpret_cast<uintptr_t>(&seed));
                      ^
  3 errors detected in the compilation of "E:/_work/onnxruntime/onnxruntime/onnxruntime/contrib_ops/cuda/sparse/block_mask.cu".
```

This change adds a patch to Abseil to mitigate those failures.


This solution has been verified to be effective in PR
microsoft#27087.

* [webgpu] Support Identity (microsoft#27067)

### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

* Add enable_profiling in runoptions (microsoft#26846)

### Description
Support run-level profiling

This PR adds support for profiling individual Run executions, similar to
session-level profiling. Developers can enable run-level profiling by
setting `enable_profiling` and `profile_file_prefix` in RunOptions. Once
the run completes, a JSON profiling file will be saved using
profile_file_prefix + timestamp.

<img width="514" height="281" alt="png (2)"
src="https://github.com/user-attachments/assets/8a997068-71d9-49ed-8a5c-00e0fa8853af"
/>


### Key Changes
1. Introduced a local variable `run_profiler` in
`InferenceSession::Run`, which is destroyed after the run completes.
Using a dedicated profiler per run ensures that profiling data is
isolated and prevents interleaving or corruption across runs.
2. To maintain accurate execution time when both session-level and
run-level profiling are enabled, overloaded `Start` and
`EndTimeAndRecordEvent` functions have been added. These allow the
caller to provide timestamps instead of relying on
`std::chrono::high_resolution_clock::now()`, avoiding potential timing
inaccuracies.
3. Added a TLS variable `tls_run_profiler_` to support run-level
profiling with WebGPU Execution Provider (EP). This ensures that when
multiple threads enable run-level profiling, each thread logs only to
its own WebGPU profiler, keeping thread-specific data isolated.
4. Use `HH:MM:SS.mm` instead of `HH:MM:SS`in the JSON filename to
prevent conflicts when profiling multiple consecutive runs.

### Motivation and Context
Previously, profiling only for session level. Sometimes developer want
to profile for specfic run . so the PR comes.


### Some details

When profiling is enabled via RunOptions, it should ideally collect two
types of events:
1. Profiler events
Used to calculate the CPU execution time of each operator.
2. Execution Provider (EP) profiler events
Used to measure GPU kernel execution time. 

Unlike session-level, we need to ensure the collecting events is correct
for multiple thread scenario.

For 1, this can be supported easily(sequential_executor.cc). We use a
thread-local storage (TLS) variable, RunLevelState (defined in
profiler.h), to maintain run-level profiling state for each thread.

For 2, each Execution Provider (EP) has its own profiler implementation,
and each EP must ensure correct behavior under run-level profiling. This
PR ensures that the WebGPU profiler works correctly with run-level
profiling.

# Test Cases

| Scenario | Example | Expected Result |
|---------|---------|-----------------|
| Concurrent runs on the same session with different run-level profiling
settings| t1: `sess1.Run({ enable_profiling: true })`<br>t2:
`sess1.Run({ enable_profiling: false })`<br>t3: `sess1.Run({
enable_profiling: true })` | Two trace JSON files are generated: one for
`t1` and one for `t3`. |
| Run-level profiling enabled together with session-level profiling|
`sess1 = OrtSession({ enable_profiling: true })`<br>`sess1.Run({
enable_profiling: true })` | Two trace JSON files are generated: one
corresponding to session-level profiling and one corresponding to
run-level profiling. |

* Fix GQA Parity (microsoft#27108)

Fix [microsoft#27079](microsoft#27079) -
Qwen3 model quality regression on CUDA backend.
### Root Cause Analysis
The parity issue was caused by **buffer pointer misconfiguration** in
the GQA (Group Query Attention) QKV preprocessing pipeline. The original
implementation used multiple separate kernels for:
1. Unpacking packed QKV tensor
2. Applying RoPE (Rotary Position Embedding) to Q and K 
3. Appending K/V to cache
This multi-kernel approach created opportunities for misconfiguration:
- Buffers were allocated but not properly used
- Pointers could reference memory that was not yet allocated or
initialized
- Buffer sharing logic was fragmented across different code paths
### Solution
Consolidate QKV preprocessing into a **single fused kernel**
(`UnpackRoPEAppend`) that performs all operations in one pass:
1. **Unified kernel design**: A single kernel handles unpacking, RoPE
application, and cache append operations
2. **Simplified buffer management**: The new `PrepareQKV` function
clearly manages buffer allocation and ensures proper initialization
3. **Explicit past-to-present cache copy**: When
`past_present_share_buffer` is false, explicitly copy past KV cache to
present buffer before appending new tokens
4. **Zero-initialization for non-shared buffers**: Clear present KV
buffers when not sharing with past to ensure deterministic output
### Changes Summary
| File | Changes |
|------|---------|
|
[group_query_attention_qkv.cuh](cci:7://file:///home/tlwu/onnxruntime/onnxruntime/contrib_ops/cuda/bert/group_query_attention_qkv.cuh:0:0-0:0)
| New fused `UnpackRoPEAppend` kernel with shared memory optimization
for non-interleaved RoPE |
| `group_query_attention_impl.cu` | New `PrepareQKV` helper function
that orchestrates buffer setup and kernel launch |
| `group_query_attention.cc` | Simplified operator logic by delegating
QKV prep to unified helper |
| `test_gqa.py` | Enhanced test coverage for various QKV configurations
|
### Key Improvements
- **Reduced kernel launches**: From 4-5 separate kernel calls to a
single fused kernel
- **Better memory safety**: All buffer pointers are validated in a
single location
- **Improved RoPE handling**: Uses shared memory for efficient
non-interleaved RoPE computation
- **Deterministic output**: Explicit buffer initialization ensures
consistent results across runs
- **Compatible with quantized KV cache**: The new preprocessing kernel
design supports future quantization work
### Testing
- All existing GQA unit tests pass
- Verified Qwen3 model no longer produces gibberish output
- Tested both fp16/bf16 and various head configurations

* [QNN EP] Fix error messages being logged as VERBOSE instead of ERROR (microsoft#24931)

## Problem

QNN error messages were being logged at VERBOSE level instead of ERROR
level, making them invisible unless verbose logging was enabled. Users
would only see unhelpful generic error messages like:

```
Failed to finalize QNN graph. Error code: 1002 at location qnn_model.cc:167 FinalizeGraphs
```

But the actual detailed error messages from QNN were hidden in verbose
logs:

```
tcm_migration.cc:2088:ERROR:Operator named q::*InputSlicePad (0x1654900000002) not sufficiently tiled to fit in TCM. Requires 12441600 bytes
graph_prepare.cc:2808:ERROR:Graph prepare TCM Migration action failed
graph_prepare.cc:2868:ERROR:Graph prepare failed during optimization with err: 17, Fatal Optimize
```

## Root Cause

The `QnnLogging` callback function in `qnn_backend_manager.cc` was
ignoring the `level` parameter from QNN and hardcoding all messages as
`kVERBOSE` severity:

```cpp
void QnnLogging(const char* format, QnnLog_Level_t level, uint64_t timestamp, va_list argument_parameter) {
  ORT_UNUSED_PARAMETER(level);  // ❌ Ignoring the actual log level
  // ...
  const auto severity = ::onnxruntime::logging::Severity::kVERBOSE;  // ❌ Hardcoded as VERBOSE
```

## Solution

Modified the `QnnLogging` function to properly map QNN log levels to
appropriate ORT severity levels:

- `QNN_LOG_LEVEL_ERROR` → `logging::Severity::kERROR` ✅ **Key fix**
- `QNN_LOG_LEVEL_WARN` → `logging::Severity::kWARNING`
- `QNN_LOG_LEVEL_INFO` → `logging::Severity::kINFO`
- `QNN_LOG_LEVEL_VERBOSE/DEBUG` → `logging::Severity::kVERBOSE`

## Changes Made

1. **Modified `QnnLogging` function**: Removed hardcoded `kVERBOSE` and
added proper level mapping
2. **Added `MapQNNLogLevelToOrtSeverity` function**: For potential
future reuse
3. **Minimal and surgical changes**: Only 37 lines added, 2 removed

## Impact

QNN error messages will now appear as ERROR-level logs in normal logging
output, making debugging much easier for users without requiring verbose
logging to be enabled.

Fixes microsoft#24876.

---

💡 You can make Copilot smarter by setting up custom instructions,
customizing its development environment and configuring Model Context
Protocol (MCP) servers. Learn more [Copilot coding agent
tips](https://gh.io/copilot-coding-agent-tips) in the docs.

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: vraspar <51386888+vraspar@users.noreply.github.com>
Co-authored-by: yuslepukhin <11303988+yuslepukhin@users.noreply.github.com>
Co-authored-by: Dmitri Smirnov <yuslepukhin@users.noreply.github.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Dmitri Smirnov <dmitrism@microsoft.com>

* [webgpu] Use LazyRelease for prepack allocator (microsoft#27077)

BUG microsoft#27068

---------

Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>

* [webgpu] fix broadcast for SkipLayerNorm (microsoft#27107)

### Description

Fix the bug discovered by microsoft#27014:

```
SkipLayerNormTest.SkipLayerNormBatch2_Skip_Broadcast_No_Batch_Size
SkipLayerNormTest.SkipLayerNormBatch2_Skip_Broadcast_Batch_Size_1
```

* [webgpu] Support int64 for range (microsoft#26673)

### Description  
 - Add new registerInt64Ops option to WebGpuExecutionProviderConfig
- Int64 support now enabled when enable_graph_capture OR
register_int64_ops is true
- Refactor Range kernel registration to support conditional int64
registration
  - Update kernel registry caching to handle all 4 combinations of flags
- Rename parameters from enable_graph_capture to enable_int64 for
clarity
- Add config parsing in webgpu_provider_factory.cc for registerInt64Ops
option

### Motivation
Needed by updating position id with an onnx model in genai.

Continuous decoding mode: `position_ids[i] = i + total_length -
new_kv_length`

We can use an onnx model which includes a Range op to implement update
the position ids:
Inputs: start (total_length - new_kv_length), limit (total_length),
delta (1)
    Output: position_ids (1D tensor of size new_kv_length)

* Remove x86 from nuget (microsoft#27124)

Related issue: microsoft#26985

* perftest: support plugin eps for compile_ep_context (microsoft#27121)

* Extend compile_ep_context to also support plugin eps
* Adds compile_only option to skip execution, can be used when compiling
for virtual devices

compile_ep_context (physical device)
<img width="1259" height="510" alt="image"
src="https://github.com/user-attachments/assets/14650c17-0c8a-4002-a7ce-e8e4c815a516"
/>

compile_ep_context + compile_only (virtual device)
<img width="1262" height="173" alt="image"
src="https://github.com/user-attachments/assets/2f0844cc-5e83-4b2d-bf0a-0d815d9bad29"
/>

* [CPU] Fix arithmetic overflow and legacy TODO in Det operator (microsoft#27070)

### Description
This PR fixes the legacy `TODO: fix the warnings` in the `Det` operator.
The arithmetic overflow warning (C26451) is addressed by using `int64_t`
for tensor dimension and batch size calculations, ensuring safe pointer
arithmetic.

### Motivation and Context
- Removes unused warning suppression pragma.
- Prevents potential overflow when handling large batches of matrices.

* Engine compatibility validity API implementation (microsoft#26774)

Added support for engine validation check for EP Context models.

### Motivation and Context
We wanted to implement the GetModelCompatibilityForEpDevices() API
support and thus have an end user available API for the engine
validation check for EP context models. Added this support and the
necessary function implementation

---------

Co-authored-by: Tianlei Wu <tlwu@microsoft.com>
Co-authored-by: Jianhui Dai <jianhui.j.dai@intel.com>
Co-authored-by: Rohanjames1997 <rohanjms@amazon.com>
Co-authored-by: adrastogi <aditya.rastogi@microsoft.com>
Co-authored-by: Aditya Rastogi <adityar@ntdev.microsoft.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com>
Co-authored-by: adrastogi <8368026+adrastogi@users.noreply.github.com>
Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
Co-authored-by: Stephan Seitz <stephan.seitz@fau.de>
Co-authored-by: Xiaofei Han <xiaofeihan@microsoft.com>
Co-authored-by: Jiajia Qin <jiajiaqin@microsoft.com>
Co-authored-by: vraspar <51386888+vraspar@users.noreply.github.com>
Co-authored-by: yuslepukhin <11303988+yuslepukhin@users.noreply.github.com>
Co-authored-by: Dmitri Smirnov <yuslepukhin@users.noreply.github.com>
Co-authored-by: Dmitri Smirnov <dmitrism@microsoft.com>
Co-authored-by: Jaskaran Singh Nagi <jaskaran.singh.nagi@intel.com>
Co-authored-by: Theodore Cooper <63190431+the0cp@users.noreply.github.com>
Co-authored-by: umangb-09 <umangb@nvidia.com>
Co-authored-by: ortqnnepci <ortqnnepci@qti.qualcomm.com>