Introducing BF16 Pointwise NCHWc Convolution for Arm64 by Rohanjames1997 · Pull Request #26838 · microsoft/onnxruntime
* Bump version to 1.25.0 (microsoft#27048) Increase version number to 1.25.0. * [webgpu] Optimize generic 4D Transpose using OIHW2OHWI Program (microsoft#26942) ### Description This PR migrates the `OIHW2OHWI` Program from `Im2ColMatMul` to the `Transpose` operator. By centralizing this logic, we leverage the specialized shader to optimize generic 4D transpositions (specifically the {0, 2, 3, 1} permutation pattern) while reducing code duplication. While this shader is capable of supporting 2D/3D transpositions, those optimizations are reserved for follow-up PRs. ### Motivation and Context See above. * Fix failing mainline build on Arm64 linux (microsoft#27101) ### Description `sconv.h` was renamed to `sconv_nchwc_kernel_neon.h` in microsoft#26688 but the reference to the old name was still in a new file added at around the same time in microsoft#26838. The CI doesn't include building for this configuration yet - it will be added after the 1.24 release. ### Motivation and Context Fixes failing mainline build on Arm64 linux when `--enable_arm_neon_nchwc` is supplied. ### Testing This now passes on Arm64 linux `./build.sh --config Release --build_shared_lib --parallel --compile_no_warning_as_error --skip_submodule_sync --skip_tests --enable_pybind --build_wheel --enable_arm_neon_nchwc` * Add dedicated API to support extracting compatibility string from model metadata (microsoft#27015) ### Description This change proposes a new helper ORT API for callers that need to extract the model compatibility string from a precompiled model. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> See microsoft#25749 for more background on the model compatibility concept and infrastructure; microsoft#25841 provides a related helper API for an application to call to do a validation check using the compatibility info string. However, there is no direct way to get to the model metadata without creating a session (which some callers may prefer to avoid) or by taking a dependency on a separate library to parse the model's protobuf (which again callers may prefer to avoid). This change proposes a separate helper API which can be used to retrieve the compatibility info string, thereby avoiding session creation or an external dependency. This does incur some amount of redundant work in that the model protobuf will be parsed again during session creation- but for some callers, this tradeoff may be acceptable. --------- Co-authored-by: Aditya Rastogi <adityar@ntdev.microsoft.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com> Co-authored-by: adrastogi <8368026+adrastogi@users.noreply.github.com> * Move model compatibility checks ahead of session initialization (microsoft#27037) ### Description <!-- Describe your changes. --> The current infrastructure for validating compatibility of a precompiled model does the check after session initialization occurs, which turns out to be quite costly. The check should ideally happen beforehand, to short-circuit those expensive operations. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> This change will make it more tractable for applications to rely on the existing session machinery to check compatibility of any of their models. Co-authored-by: Aditya Rastogi <adityar@ntdev.microsoft.com> * [test] refactor common test target settings (microsoft#27013) ### Description - factor duplicated test target settings into helper functions - reuse helpers for onnxruntime_test_all and onnxruntime_provider_test - keep target-specific settings intact ### Motivation and Context There are some duplicated codes in the onnxruntime_unittests. Originally there is only one unit test `onnxruntime_test_all` and later it is split into two: `onnxruntime_test_all` and `onnxruntime_provider_test`. Some lines for setting up build flags are simply copied. This causes potential risk for inconsistency in future. --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Fix OrtApi static_assert violation, add instructions for updating additional API structs. (microsoft#27100) ### Description <!-- Describe your changes. --> Fix OrtApi 1.24 API size static_assert violation triggered by addition of new APIs in microsoft@f481b17. Add version update instructions for updating additional API structs. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Fix build on main. Add info about other API structs to version update instructions. * Linux device discovery for TRT-RTX Ep (microsoft#26210) ### Description <!-- Describe your changes. --> This change adds PCIe bus_id to the properties detected during Linux device discovery. This property is used to enable device discovery on Linux for the TRT-RTX execution provider. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? --> I want to use device discovery for TRT-EP also on Linux. This changes have already been tested with the newly added inference samples microsoft/onnxruntime-inference-examples#529 . @gedoensmax for visibilty * Add absl cuda warnings patch (microsoft#27096) Some PRs that use core/common/inlined_containers.h can cause failures in the CUDA CI pipeline. ``` E:\_work\_temp\build\RelWithDebInfo\vcpkg_installed\x64-windows-static-md\include\absl/hash/internal/hash.h(481): error microsoft#68-D: integer conversion resulted in a change of sign [E:\_work\_temp\build\RelWithDebInfo\onnxruntime_providers_cuda.vcxproj] sizeof(T) == -1, ^ Remark: The warnings can be suppressed with "-diag-suppress <warning-number>" E:\_work\_temp\build\RelWithDebInfo\vcpkg_installed\x64-windows-static-md\include\absl/hash/hash.h(337): error microsoft#549-D: variable "s" is used before its value is set [E:\_work\_temp\build\RelWithDebInfo\onnxruntime_providers_cuda.vcxproj] return s; ^ E:\_work\_temp\build\RelWithDebInfo\vcpkg_installed\x64-windows-static-md\include\absl/container/internal/raw_hash_set.h(468): error microsoft#69-D: integer conversion resulted in truncation [E:\_work\_temp\build\RelWithDebInfo\onnxruntime_providers_cuda.vcxproj] static_cast<uint16_t>(reinterpret_cast<uintptr_t>(&seed)); ^ 3 errors detected in the compilation of "E:/_work/onnxruntime/onnxruntime/onnxruntime/contrib_ops/cuda/sparse/block_mask.cu". ``` This change adds a patch to Abseil to mitigate those failures. This solution has been verified to be effective in PR microsoft#27087. * [webgpu] Support Identity (microsoft#27067) ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> * Add enable_profiling in runoptions (microsoft#26846) ### Description Support run-level profiling This PR adds support for profiling individual Run executions, similar to session-level profiling. Developers can enable run-level profiling by setting `enable_profiling` and `profile_file_prefix` in RunOptions. Once the run completes, a JSON profiling file will be saved using profile_file_prefix + timestamp. <img width="514" height="281" alt="png (2)" src="https://github.com/user-attachments/assets/8a997068-71d9-49ed-8a5c-00e0fa8853af" /> ### Key Changes 1. Introduced a local variable `run_profiler` in `InferenceSession::Run`, which is destroyed after the run completes. Using a dedicated profiler per run ensures that profiling data is isolated and prevents interleaving or corruption across runs. 2. To maintain accurate execution time when both session-level and run-level profiling are enabled, overloaded `Start` and `EndTimeAndRecordEvent` functions have been added. These allow the caller to provide timestamps instead of relying on `std::chrono::high_resolution_clock::now()`, avoiding potential timing inaccuracies. 3. Added a TLS variable `tls_run_profiler_` to support run-level profiling with WebGPU Execution Provider (EP). This ensures that when multiple threads enable run-level profiling, each thread logs only to its own WebGPU profiler, keeping thread-specific data isolated. 4. Use `HH:MM:SS.mm` instead of `HH:MM:SS`in the JSON filename to prevent conflicts when profiling multiple consecutive runs. ### Motivation and Context Previously, profiling only for session level. Sometimes developer want to profile for specfic run . so the PR comes. ### Some details When profiling is enabled via RunOptions, it should ideally collect two types of events: 1. Profiler events Used to calculate the CPU execution time of each operator. 2. Execution Provider (EP) profiler events Used to measure GPU kernel execution time. Unlike session-level, we need to ensure the collecting events is correct for multiple thread scenario. For 1, this can be supported easily(sequential_executor.cc). We use a thread-local storage (TLS) variable, RunLevelState (defined in profiler.h), to maintain run-level profiling state for each thread. For 2, each Execution Provider (EP) has its own profiler implementation, and each EP must ensure correct behavior under run-level profiling. This PR ensures that the WebGPU profiler works correctly with run-level profiling. # Test Cases | Scenario | Example | Expected Result | |---------|---------|-----------------| | Concurrent runs on the same session with different run-level profiling settings| t1: `sess1.Run({ enable_profiling: true })`<br>t2: `sess1.Run({ enable_profiling: false })`<br>t3: `sess1.Run({ enable_profiling: true })` | Two trace JSON files are generated: one for `t1` and one for `t3`. | | Run-level profiling enabled together with session-level profiling| `sess1 = OrtSession({ enable_profiling: true })`<br>`sess1.Run({ enable_profiling: true })` | Two trace JSON files are generated: one corresponding to session-level profiling and one corresponding to run-level profiling. | * Fix GQA Parity (microsoft#27108) Fix [microsoft#27079](microsoft#27079) - Qwen3 model quality regression on CUDA backend. ### Root Cause Analysis The parity issue was caused by **buffer pointer misconfiguration** in the GQA (Group Query Attention) QKV preprocessing pipeline. The original implementation used multiple separate kernels for: 1. Unpacking packed QKV tensor 2. Applying RoPE (Rotary Position Embedding) to Q and K 3. Appending K/V to cache This multi-kernel approach created opportunities for misconfiguration: - Buffers were allocated but not properly used - Pointers could reference memory that was not yet allocated or initialized - Buffer sharing logic was fragmented across different code paths ### Solution Consolidate QKV preprocessing into a **single fused kernel** (`UnpackRoPEAppend`) that performs all operations in one pass: 1. **Unified kernel design**: A single kernel handles unpacking, RoPE application, and cache append operations 2. **Simplified buffer management**: The new `PrepareQKV` function clearly manages buffer allocation and ensures proper initialization 3. **Explicit past-to-present cache copy**: When `past_present_share_buffer` is false, explicitly copy past KV cache to present buffer before appending new tokens 4. **Zero-initialization for non-shared buffers**: Clear present KV buffers when not sharing with past to ensure deterministic output ### Changes Summary | File | Changes | |------|---------| | [group_query_attention_qkv.cuh](cci:7://file:///home/tlwu/onnxruntime/onnxruntime/contrib_ops/cuda/bert/group_query_attention_qkv.cuh:0:0-0:0) | New fused `UnpackRoPEAppend` kernel with shared memory optimization for non-interleaved RoPE | | `group_query_attention_impl.cu` | New `PrepareQKV` helper function that orchestrates buffer setup and kernel launch | | `group_query_attention.cc` | Simplified operator logic by delegating QKV prep to unified helper | | `test_gqa.py` | Enhanced test coverage for various QKV configurations | ### Key Improvements - **Reduced kernel launches**: From 4-5 separate kernel calls to a single fused kernel - **Better memory safety**: All buffer pointers are validated in a single location - **Improved RoPE handling**: Uses shared memory for efficient non-interleaved RoPE computation - **Deterministic output**: Explicit buffer initialization ensures consistent results across runs - **Compatible with quantized KV cache**: The new preprocessing kernel design supports future quantization work ### Testing - All existing GQA unit tests pass - Verified Qwen3 model no longer produces gibberish output - Tested both fp16/bf16 and various head configurations * [QNN EP] Fix error messages being logged as VERBOSE instead of ERROR (microsoft#24931) ## Problem QNN error messages were being logged at VERBOSE level instead of ERROR level, making them invisible unless verbose logging was enabled. Users would only see unhelpful generic error messages like: ``` Failed to finalize QNN graph. Error code: 1002 at location qnn_model.cc:167 FinalizeGraphs ``` But the actual detailed error messages from QNN were hidden in verbose logs: ``` tcm_migration.cc:2088:ERROR:Operator named q::*InputSlicePad (0x1654900000002) not sufficiently tiled to fit in TCM. Requires 12441600 bytes graph_prepare.cc:2808:ERROR:Graph prepare TCM Migration action failed graph_prepare.cc:2868:ERROR:Graph prepare failed during optimization with err: 17, Fatal Optimize ``` ## Root Cause The `QnnLogging` callback function in `qnn_backend_manager.cc` was ignoring the `level` parameter from QNN and hardcoding all messages as `kVERBOSE` severity: ```cpp void QnnLogging(const char* format, QnnLog_Level_t level, uint64_t timestamp, va_list argument_parameter) { ORT_UNUSED_PARAMETER(level); // ❌ Ignoring the actual log level // ... const auto severity = ::onnxruntime::logging::Severity::kVERBOSE; // ❌ Hardcoded as VERBOSE ``` ## Solution Modified the `QnnLogging` function to properly map QNN log levels to appropriate ORT severity levels: - `QNN_LOG_LEVEL_ERROR` → `logging::Severity::kERROR` ✅ **Key fix** - `QNN_LOG_LEVEL_WARN` → `logging::Severity::kWARNING` - `QNN_LOG_LEVEL_INFO` → `logging::Severity::kINFO` - `QNN_LOG_LEVEL_VERBOSE/DEBUG` → `logging::Severity::kVERBOSE` ## Changes Made 1. **Modified `QnnLogging` function**: Removed hardcoded `kVERBOSE` and added proper level mapping 2. **Added `MapQNNLogLevelToOrtSeverity` function**: For potential future reuse 3. **Minimal and surgical changes**: Only 37 lines added, 2 removed ## Impact QNN error messages will now appear as ERROR-level logs in normal logging output, making debugging much easier for users without requiring verbose logging to be enabled. Fixes microsoft#24876. --- 💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more [Copilot coding agent tips](https://gh.io/copilot-coding-agent-tips) in the docs. --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: vraspar <51386888+vraspar@users.noreply.github.com> Co-authored-by: yuslepukhin <11303988+yuslepukhin@users.noreply.github.com> Co-authored-by: Dmitri Smirnov <yuslepukhin@users.noreply.github.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Dmitri Smirnov <dmitrism@microsoft.com> * [webgpu] Use LazyRelease for prepack allocator (microsoft#27077) BUG microsoft#27068 --------- Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com> * [webgpu] fix broadcast for SkipLayerNorm (microsoft#27107) ### Description Fix the bug discovered by microsoft#27014: ``` SkipLayerNormTest.SkipLayerNormBatch2_Skip_Broadcast_No_Batch_Size SkipLayerNormTest.SkipLayerNormBatch2_Skip_Broadcast_Batch_Size_1 ``` * [webgpu] Support int64 for range (microsoft#26673) ### Description - Add new registerInt64Ops option to WebGpuExecutionProviderConfig - Int64 support now enabled when enable_graph_capture OR register_int64_ops is true - Refactor Range kernel registration to support conditional int64 registration - Update kernel registry caching to handle all 4 combinations of flags - Rename parameters from enable_graph_capture to enable_int64 for clarity - Add config parsing in webgpu_provider_factory.cc for registerInt64Ops option ### Motivation Needed by updating position id with an onnx model in genai. Continuous decoding mode: `position_ids[i] = i + total_length - new_kv_length` We can use an onnx model which includes a Range op to implement update the position ids: Inputs: start (total_length - new_kv_length), limit (total_length), delta (1) Output: position_ids (1D tensor of size new_kv_length) * Remove x86 from nuget (microsoft#27124) Related issue: microsoft#26985 * perftest: support plugin eps for compile_ep_context (microsoft#27121) * Extend compile_ep_context to also support plugin eps * Adds compile_only option to skip execution, can be used when compiling for virtual devices compile_ep_context (physical device) <img width="1259" height="510" alt="image" src="https://github.com/user-attachments/assets/14650c17-0c8a-4002-a7ce-e8e4c815a516" /> compile_ep_context + compile_only (virtual device) <img width="1262" height="173" alt="image" src="https://github.com/user-attachments/assets/2f0844cc-5e83-4b2d-bf0a-0d815d9bad29" /> * [CPU] Fix arithmetic overflow and legacy TODO in Det operator (microsoft#27070) ### Description This PR fixes the legacy `TODO: fix the warnings` in the `Det` operator. The arithmetic overflow warning (C26451) is addressed by using `int64_t` for tensor dimension and batch size calculations, ensuring safe pointer arithmetic. ### Motivation and Context - Removes unused warning suppression pragma. - Prevents potential overflow when handling large batches of matrices. * Engine compatibility validity API implementation (microsoft#26774) Added support for engine validation check for EP Context models. ### Motivation and Context We wanted to implement the GetModelCompatibilityForEpDevices() API support and thus have an end user available API for the engine validation check for EP context models. Added this support and the necessary function implementation --------- Co-authored-by: Tianlei Wu <tlwu@microsoft.com> Co-authored-by: Jianhui Dai <jianhui.j.dai@intel.com> Co-authored-by: Rohanjames1997 <rohanjms@amazon.com> Co-authored-by: adrastogi <aditya.rastogi@microsoft.com> Co-authored-by: Aditya Rastogi <adityar@ntdev.microsoft.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com> Co-authored-by: adrastogi <8368026+adrastogi@users.noreply.github.com> Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com> Co-authored-by: Stephan Seitz <stephan.seitz@fau.de> Co-authored-by: Xiaofei Han <xiaofeihan@microsoft.com> Co-authored-by: Jiajia Qin <jiajiaqin@microsoft.com> Co-authored-by: vraspar <51386888+vraspar@users.noreply.github.com> Co-authored-by: yuslepukhin <11303988+yuslepukhin@users.noreply.github.com> Co-authored-by: Dmitri Smirnov <yuslepukhin@users.noreply.github.com> Co-authored-by: Dmitri Smirnov <dmitrism@microsoft.com> Co-authored-by: Jaskaran Singh Nagi <jaskaran.singh.nagi@intel.com> Co-authored-by: Theodore Cooper <63190431+the0cp@users.noreply.github.com> Co-authored-by: umangb-09 <umangb@nvidia.com> Co-authored-by: ortqnnepci <ortqnnepci@qti.qualcomm.com>