Explicit SIMD design notes — oneAPI DPC++ Compiler documentation
This documents is a collection of notes describing design and/or implementation of various parts of the ESIMD programming model support within the DPC++.
Overview of ESIMD support in DPC++ components¶
ESIMD support is spread across a number of components in the oneAPI software stack, spanning compile time, link time and runtime. The picture below shows simplified view of the DPC++ compiler and runtime diagram and where ESIMD (sub-)components fit in it.

User program¶
User program can contain both SYCL and ESIMD kernels, either in the same or
different translation units. DPC++ ESIMD support will automatically split the
device code into SYCL and ESIMD parts to redirect them to different back-ends.
To facilitate this splitting, compiler will automatically identify markup and
clone parts of the ESIMD callgraph starting from kernels and functions
explicitly marked with the intel::sycl_explicit_simd attribute.
Clang driver¶
TODO: describe driver modifications.
Source locations:
clang/lib/Driver/ToolChains/Clang.cppclang/include/clang/Driver/Options.td
Clang front-end¶
Code (LLVMIR) generator¶
ESIMD-specific code generator tweaks are mostly translations of internal FE representation of variaous ESIMD attributes into LLVM IR attributes or metadata.
Kernel signature generation¶
For ESIMD kernels, a number of additional attributes are generated for the kernel function itself as well as certain argument.
Kernels are annotated with
sycl_explicit_simdandintel_reqd_sub_group_sizeattributes. The latter must always be1for a ESIMD kernel or function.An argument which conveys accessor’s pointer is assigned a
kernel_arg_accessor_ptrattribute
Global variable code generation¶
ESIMD supports “private globals” - global variables which have one copy per
thread of execution (similar to C++ thread_local), normally allocated of Gen
register file. To make a global variable a “private global”,
__attribute__((opencl_private)) __attribute__((sycl_explicit_simd))
attributes are used. Globals of this can be forced to a specific register using
the __attribute__((register_num(n))) attribute. The clang code generator
translates these to genx_volatile and genx_byte_offset LLVM IR attributes.
Function attributes translations¶
sycl_esimd_vectorize->CMGenxSIMT
Source locations:
clang/lib/CodeGen/CGSYCLRuntime.cppclang/lib/CodeGen/CodeGenFunction.cppclang/lib/CodeGen/CodeGenModule.cpp
Clang middle-end¶
ESIMD API restriction verifier¶
This component is an LLVM IR pass over a compiled translation unit. It checks
for presence of certain SYCL APIs which are disallowed within ESIMD code. For
exaple, SYCL reductions are not allowed in ESIMD. The verifier does this by
demangling all the call targets within ESIMD code and matching them with
internal sub-string filters. Invoked from clang/lib/CodeGen/BackendUtil.cpp.
Source locations:
llvm/lib/SYCLLowerIR/ESIMD/ESIMDVerifier.cpp
sycl-post-link transformations¶
As a part of the input device code module transformation pipeline, the
sycl-post-link tool splits the input module (or modules resulting from splitting
by other characteristics, such as aspects) into two - SYCL and ESIMD ones.
Shared functions invoked both from SYCL and ESIMD are cloned during the process.
This is necessary because SYCL and ESIMD parts must undergo different set of
transformations before generating resulting SPIR-V. ESIMD modules resulting from
splitting are marked with specific device binary property isEsimdImage (see
source
.)
sycl-post-link is the post-link process driver, it invokes necessary
transformations as well as optimizations on fully linked device code. As a part
of the process it splits SYCL and ESIMD parts of the code into separate LLVM IR
modules and invokes different set or transformations on them. If a program has
an invoke_simd call in it, then sycl-post-link will link SYCL and ESIMD parts
back, cloning overlaping parts as needed.
Source locations:
llvm/tools/sycl-post-link/sycl-post-link.cpp
ESIMD Lowerer¶
ESIMD part of device code undergoes a set of ESIMD-specific transformations.
First, intrinsic lowering and metadata generation phase happens. It is
implemented in the SYCLLowerESIMDPass LLVM IR Module pass. Its primary
purposes are:
translate
__esimd_*intrinsic calls into correspondinggenx.*intrinsics known to the VC BEin some cases, there is no direct equivalent (for example,
__esimd_pack_mask), in which case the lowerer generates LLVM IR with desired semantics
translate some of the
__spirv.*intrinsics to something acceptable by VC BE
Source locations:
LowerESIMD.cppESIMDOptimizeVecArgCallConv.cppLowerESIMDVecArg.cppLowerESIMDVLoadVStore.cpp
Genx SPIR-V writer adaptor¶
(part of vc-intrinsics repo)
SYCL Runtime¶
SYCL runtime (RT) has a few places where ESIMD is handled specially:
When setting kernel invocation arguments corresponding to an accessor, RT will skip setting offset, memory and access ranges arguments (normally set for usual SYCL kernels), because ESIMD does not support these. In other words, an accessors used within kernel (and captured in kernel lambda) is translated to 4 SPIR-V kernel arguments for a normal SYCL kernel, and just to 1 argument for a ESIMD kernel. Link.
When creating JIT compilation options, SYCL runtime checks if the device binary image to be JIT-compiled has “isESIMDImage” property, in which case it adds
-vc-codegenJIT options, which makes Intel GPU runtime use the vector backend (aka ‘VC BE’) to JIT-compile the device binary (SPIR-V). Link.
TODOs¶
This section lists current major ESIMD gaps/TODOs.
Move all APIs out of the experimental namespace. One of the major APIs there is LSC memory accesses. The main roadblock for making it stable API is absense of specification for cache hints, which should be shared between SYCL and ESIMD.
Architecture specific APIs should be explicitly marked as such in the user documentation with references to the list of architectures known to oneAPI.
Properly markup architecture-specific APIs, such as
dpas, with required aspects, according to the “optional device features” design. This might require splitting implementations into per-architecture variants.if_device_hasfeature may help avoid duplication of common parts and dispatch to architecture-dependent code at fine-grained level from within a function.As VC BE moves away from
genx.*intrinsics replacing them with__spirv_*ones defined in various extensions, ESIMD should catch up.Unification of common
simd_view/simdinterfaces in fact leads to significant complication of implementation rather than its intended simplification via avoiding code duplication, might make sense to have separate implementations.
Directions¶
This section lists possible directions for ESIMD improvements.
Support
std::simd. This is the standard C++ way for explicit SIMD programming. Can help run (subsest of ESIMD) on CPU efficiently in the future.Clear (via namespace?) separation of ESIMD APIs into portable and architecture-specific parts.
Standardizing
simd_viewor equivalent. This is effectively a reference to a subset ofesimd::simdvector object’s elements. The subset is defined in a regular way via starting offset, stride and number of elements in the subset. This proved to be very useful and loved by users. Missing instd::simd.Design something like
invoke_spmd(similar toinvoke_simdextension) to be able to invoke SPMD functions from ESIMD code while vectorizing the calls in the back-end. This would replacesycl_esimd_vectorizeand make this concept usable by all users, not only internal ESIMD implementation.Create a specification for ESIMD kernel ABI and stand-alone kernel declaration rules to make ESIMD kernels callable by arbitrary host offload runtimes, such as Level Zero.