Proposed design for offloading model — oneAPI DPC++ Compiler documentation
Clang Linker Wrapper¶
The clang-linker-wrapper provides the interface to perform the needed link
steps when consuming fat binaries. The linker wrapper performs a majority of
the work involved during the link step during an offload compilation,
significantly reducing the amount of work that is occurring in the compiler
driver. From the compilation perspective, the linker wrapper replaces the
typical call to the host link. This allows for the responsibility of the
compiler driver to be nearly identical when performing a regular compilation
vs an offloading compilation.
From a high level, using the clang-linker-wrapper provides a few benefits:
Moves all of the device linking responsibility out of the compiler driver.
Allows for a more direct ability to perform linking for offloading without requiring the use of the driver, using more linker like calls.
Provides additional flexibility with the ability to dynamically modify the toolchain execution.
Example usage of the external clang-linker-wrapper call:
clang-linker-wrapper <wrapper opts> -- <linker opts>
Device Linking¶
During this phase, all of the individual device binaries that are extracted and
are associated with a given target are worked on given the type of binary we
are working with. The default device code is typically represented in LLVM-IR
which requires an additional link step of the device code before being
wrapped and integrated into the final executable. As mentioned in
Packager device representation in SPIR-V should be considered
with the ability to link native SPIR-V modules. Performing the device link
against LLVM-IR is performed by llvm-link.
To reduce the potential size of the linked device binary, an additional host
link step is performed to gather dependency information when static device
libraries are being compiled. This information is sent through the
clang-offload-deps tool to generate a dependency IR file which is used
during the device link step.
The use of the clang-linker-wrapper introduces the support of LTO for device code. We can leverage this and move away from the dependency gathering information step with ‘clang-offload-deps’ and use thinLTO for device code.
There are multiple device linking steps that also occur. The first step links
together all of the objects and the required device libraries. The second is
performed including all of optional device libraries, the static device
libraries and the dependency information that was gathered above. This link
step is performed with --only-needed to streamline the final device binary.
A list of device libraries that need to linked in with user code is provided by
the driver. The driver is also responsible for letting the
clang-linker-wrapper know the location of the device libraries.
Option |
Expected Behavior |
|---|---|
|
A comma separated list of device libraries that are linked during the device link |
|
The location in which the device libraries reside |
Table: Options to pass device libraries to the clang-linker-wrapper
The device libraries are controlled via the -fno-sycl-device-lib=arg option
where the driver determines based on this option which libraries to tell the
linker wrapper to pull in.
Post-link and SPIR-V translation¶
After the device binaries are linked together, two additional steps are performed to prepare the device binary for consumption by an offline compilation tool for AOT or to be wrapped for JIT processing.
The sycl-post-link tool is used after the device link is performed,
applying any changes such as optimizations and code splitting before passing
off to the llvm-spirv tool, which translates the LLVM-IR to SPIR-V.
Option |
Expected Behavior |
|---|---|
|
Options that will control sycl-post-link step |
|
Options that will control llvm-spirv step |
Table: Options to pass sycl-post-link and llvm-spirv options to the clang-linker-wrapper
Options that will be used by clang-linker-wrapper when invoking the sycl-post-link
tool are provided by the driver via the --sycl-post-link-options=<arg> option.
Options that will be used by clang-linker-wrapper when invoking the llvm-spirv
tool are provided by the driver via the --llvm-spirv-options=<arg> option.
Ahead Of Time Compilation¶
The updated offloading model will integrate the Ahead of Time (AOT) compilation behaviors into the clang-linker-wrapper. The actions will typically take place after the device link, post link, and LLVM-IR to SPIR-V translation steps.
Regardless of the AOT target, the flow is similar, only modifying the offline compiler that is used to create the target device image. It is expected that the offline compiler will also use unique command lines specific to the tool to create the image.
To support the needed option passing triggered by use of the
-Xsycl-target-backend option and implied options based on the optional
device behaviors for AOT compilations for GPU new command line interfaces
are needed to pass along this information.
Target |
Triple |
Offline Tool |
Option for Additional Args |
|---|---|---|---|
CPU |
spir64_x86_64 |
opencl-aot |
|
GPU |
spir64_gen |
ocloc |
|
FPGA |
spir64_fpga |
aoc/opencl-aot |
|
Table: Ahead of Time Info
To complete the support needed for the various targets using the
clang-linker-wrapper as the main interface, a few additional options will
be needed to communicate from the driver to the tool. Further details of usage
are given further below.
Option Name |
Purpose |
|---|---|
|
Tells the link step to perform ‘early’ or ‘image’ processing to create archives for FPGA |
|
Provide the number of parallel jobs that will be used when processing split jobs |
Table: Additional Options for clang-linker-wrapper
The clang-linker-wrapper provides an existing option named -wrapper-jobs
that may be useful for our usage.
spir64_gen support¶
Compilation behaviors involving AOT for GPU involve an additional call to
the OpenCL Offline compiler (OCLOC). This call occurs after the post-link
step performed by sycl-post-link and the SPIR-V translation step which is done
by llvm-spirv. Additional options passed by the user via the
-Xsycl-target-backend=spir64_gen <opts> command as well as the implied
options set via target options such as -fsycl-targets=intel_gpu_skl
will be processed by a new options to the wrapper, --gpu-tool-arg=<arg>
To support multiple target specifications, for instance:
-fsycl-targets=intel_gpu_skl,intel_gpu_pvc, multiple --gpu-tool-arg
options can be passed on the command line. Each instance will be considered
a separate OCLOC call passing along the <args> as options to the OCLOC call.
The compiler driver will be responsible for putting together the full option
list to be passed along.
-fsycl -fsycl-targets=spir64_gen,intel_gpu_skl -Xsycl-target-backend=spir64_gen “-device pvc -options -extraopt_pvc” -Xsycl-target-backend=intel_gpu_skl “-options -extraopt_skl”
Example: spir64_gen enabling options
–gpu-tool-arg=”-device pvc -options extraopt_pvc” –gpu-tool-arg=”-options -extraopt_skl”
Example: clang-linker-wrapper options
Each OCLOC call will be represented as a separate device binary that is individually wrapped and linked into the final executable.
Additionally, the syntax can be expanded to enable the ability to pass specific
options to a specific device GPU target for spir64_gen. The syntax will
resemble --gpu-tool-arg=<arch> <arg>. This corresponds to the existing
option syntax of -fsycl-targets=intel_gpu_arch where arch can be a fixed
set of targets.
–offload-arch¶
For SYCL offloading to Intel GPUs, Intel CPUs, NVidia and AMD GPUs, specify the device architecture using --offload-arch option. For instance
--offload-arch=sm_80 to target an NVidia Tesla A100,
--offload-arch=gfx90a to target an AMD Instinct MI250X, or
--offload-arch=sm_80,gfx90a to target both.
For Intel Graphics AOT target, valid values for --offload-arch are mapped to
valid device names accepted by OCLOC (the Intel GPU AOT compiler) via the -device option.
SYCL offloading with --offload-arch for Intel CPUs and Intel GPUs is currently enabled only with --offload-new-driver option.
Example: $ clang++ -fsycl -offload-arch=bdw --offload-new-driver -c foo.cpp // SYCL AOT for Intel GPU. $ clang++ -fsycl -offload-arch=broadwell --offload-new-driver -c foo.cpp // SYCL AOT for Intel CPU.
The following table shows a mapping of the accepted values for --offload-arch to enable SYCL offloading to Intel GPUs and the corresponding -device value passed to OCLOC.
Intel GPU device |
|
OCLOC -device value |
|---|---|---|
Intel(R) microarchitecture code name Broadwell Intel graphics architecture |
bdw |
bdw |
Intel(R) microarchitecture code name Skylake Intel graphics architecture |
skl |
skl |
Kaby Lake Intel graphics architecture |
kbl |
kbl |
Coffee Lake Intel graphics architecture |
cfl |
cfl |
Apollo Lake Intel graphics architecture |
apl |
apl |
Broxton Intel graphics architecture |
bxt |
apl |
Gemini Lake Intel graphics architecture |
glk |
glk |
Whiskey Lake Intel graphics architecture |
whl |
whl |
Amber Lake Intel graphics architecture |
aml |
aml |
Comet Lake Intel graphics architecture |
cml |
cml |
Ice Lake Intel graphics architecture |
icl, icllp |
icllp |
Elkhart Lake Intel graphics architecture |
ehl |
ehl |
Jasper Lake Intel graphics architecture |
jsl |
jsl |
Tiger Lake Intel graphics architecture |
tgl, tgllp |
tgllp |
Rocket Lake Intel graphics architecture |
rkl |
rkl |
Alder Lake S Intel graphics architecture |
adl_s |
adl_s |
Raptor Lake Intel graphics architecture |
rpl_s |
adl_s |
Alder Lake P Intel graphics architecture |
adl_p |
adl_p |
Alder Lake N Intel graphics architecture |
adl_n |
adl_n |
DG1 Intel graphics architecture |
dg1 |
dg1 |
Alchemist G10 Intel graphics architecture |
acm_g10, dg2_g10 |
acm_g10 |
Alchemist G11 Intel graphics architecture |
acm_g11, dg2_g11 |
acm_g11 |
Alchemist G12 Intel graphics architecture |
acm_g12, dg2_g12 |
acm_g12 |
Ponte Vecchio Intel graphics architecture |
pvc |
pvc |
Ponte Vecchio VG Intel graphics architecture |
pvc_vg |
pvc_vg |
Meteor Lake U/S or Arrow Lake U/S Intel graphics architecture |
mtl_u, mtl_s, arl_u |
mtl_s |
Meteor Lake H Intel graphics architecture |
mtl_h |
mtl_h |
Arrow Lake H Intel graphics architecture |
arl_h |
arl_h |
Battlemage G21 Intel graphics architecture |
bmg_g21 |
bmg_g21 |
Lunar Lake Intel graphics architecture |
lnl_m |
lnl_m |
nvptx64-nvidia-cuda support¶
For SYCL offloading to NVidia GPUs using --offload-arch option, the following table
lists the accepted values.
NVidia GPU device name |
|
|---|---|
NVIDIA Maxwell architecture (compute capability 5.0) |
sm_50 |
NVIDIA Maxwell architecture (compute capability 5.2) |
sm_52 |
NVIDIA Maxwell architecture (compute capability 5.3) |
sm_53 |
NVIDIA Pascal architecture (compute capability 6.0) |
sm_60 |
NVIDIA Pascal architecture (compute capability 6.1) |
sm_61 |
NVIDIA Pascal architecture (compute capability 6.2) |
sm_62 |
NVIDIA Volta architecture (compute capability 7.0) |
sm_70 |
NVIDIA Volta architecture (compute capability 7.2) |
sm_72 |
NVIDIA Turing architecture (compute capability 7.5) |
sm_75 |
NVIDIA Ampere architecture (compute capability 8.0) |
sm_80 |
NVIDIA Ampere architecture (compute capability 8.6) |
sm_86 |
NVIDIA Jetson/Drive AGX Orin architecture |
sm_87 |
NVIDIA Ada Lovelace architecture |
sm_89 |
NVIDIA Hopper architecture |
sm_90 |
NVIDIA Hopper architecture (with wgmma and setmaxnreg instructions) |
sm_90a |
amdgcn-amd-amdhsa support¶
For SYCL offloading to AMD GPUs using --offload-arch option, the following table
lists the accepted values.
AMD GPU device name |
|
|---|---|
AMD GCN GFX7 (Sea Islands (CI)) architecture |
gfx700 |
AMD GCN GFX7 (Sea Islands (CI)) architecture |
gfx701 |
AMD GCN GFX7 (Sea Islands (CI)) architecture |
gfx702 |
AMD GCN GFX8 (Volcanic Islands (VI)) architecture |
gfx801 |
AMD GCN GFX8 (Volcanic Islands (VI)) architecture |
gfx802 |
AMD GCN GFX8 (Volcanic Islands (VI)) architecture |
gfx803 |
AMD GCN GFX8 (Volcanic Islands (VI)) architecture |
gfx805 |
AMD GCN GFX8 (Volcanic Islands (VI)) architecture |
gfx810 |
AMD GCN GFX9 (Vega) architecture |
gfx900 |
AMD GCN GFX9 (Vega) architecture |
gfx902 |
AMD GCN GFX9 (Vega) architecture |
gfx904 |
AMD GCN GFX9 (Vega) architecture |
gfx906 |
AMD GCN GFX9 (Vega) architecture |
gfx908 |
AMD GCN GFX9 (Vega) architecture |
gfx909 |
AMD GCN GFX9 (Vega) architecture |
gfx90a |
AMD GCN GFX9 (Vega) architecture |
gfx90c |
AMD GCN GFX9 (Vega) architecture |
gfx940 |
AMD GCN GFX9 (Vega) architecture |
gfx941 |
AMD GCN GFX9 (Vega) architecture |
gfx942 |
AMD GCN GFX10.1 (RDNA 1) architecture |
gfx1010 |
AMD GCN GFX10.1 (RDNA 1) architecture |
gfx1011 |
AMD GCN GFX10.1 (RDNA 1) architecture |
gfx1012 |
AMD GCN GFX10.1 (RDNA 1) architecture |
gfx1013 |
AMD GCN GFX10.3 (RDNA 2) architecture |
gfx1030 |
GCN GFX10.3 (RDNA 2) architecture |
gfx1031 |
GCN GFX10.3 (RDNA 2) architecture |
gfx1032 |
GCN GFX10.3 (RDNA 2) architecture |
gfx1033 |
GCN GFX10.3 (RDNA 2) architecture |
gfx1034 |
GCN GFX10.3 (RDNA 2) architecture |
gfx1035 |
GCN GFX10.3 (RDNA 2) architecture |
gfx1036 |
GCN GFX11 (RDNA 3) architecture |
gfx1100 |
GCN GFX11 (RDNA 3) architecture |
gfx1101 |
GCN GFX11 (RDNA 3) architecture |
gfx1102 |
GCN GFX11 (RDNA 3) architecture |
gfx1103 |
GCN GFX11 (RDNA 3) architecture |
gfx1150 |
GCN GFX11 (RDNA 3) architecture |
gfx1151 |
GCN GFX12 (RDNA 4) architecture |
gfx1200 |
GCN GFX12 (RDNA 4) architecture |
gfx1201 |
spir64_fpga support¶
Compilation behaviors involving AOT for FPGA involve an additional call to
either aoc (for Hardware/Simulation) or opencl-aot (for Emulation). This
call occurs after the post-link step performed by sycl-post-link and the
SPIR-V translation step performed by llvm-spirv. Additional options passed
by the user via the -Xsycl-target-backend=spir64_fpga <opts> command will be
processed by a new options to the wrapper,
--fpga-tool-arg=<arg>
The FPGA target also has support for additional generated binaries that contain intermediate files specific for FPGA. These binaries (aoco, aocr and aocx) can reside in archives and are treated differently than traditional device binaries.
Generation of the AOCR and AOCX type binary is triggered by the command line
option -fsycl-link, where -fsycl-link=image creates AOCX archives and
-fsycl-link=early generates AOCR archives. The files generated by these
options are handled in a specific manner when encountered.
Any archive with an AOCR type device binary will have the AOCR binary
extracted and passed to aoc to produce an AOCX final image. This final
image is wrapped and added to the final binary during the host link. The use
of -fsycl-link=image with an AOCR binary will create an AOCX based archive
instead of completing the host link. Any archive with an AOCX type device
binary skips the aoc step and is wrapped and added to the final binary during
the host link. Archives with any AOCO device binaries are extracted and passed
through to aoc -library-list=<listfile>
As the clang-linker-wrapper is responsible for understanding the archives
that are added on the command line, it will need to know when to look for
these unique device binaries based on the expected compilation targets. The
behavior of creating the AOCX/AOCR type archive will be triggered via an
additional command line option specified by the driver when -fsycl-link
options are used. The --fpga-link=<type> option will tell the wrapper when
these handlings need to occur.
When using the -fintelfpga option to enable AOT for FPGA, there are
additional expectations during the compilation. Use of the option will enable
debug generation and also generate dependency information. The dependency
generation should be packaged along with the device binary for use during
the link phase. It is expected that the full fat object, containing host,
device and dependency file is generated before being passed to the link phase.
The dependency information is only used when compiling for hardware.
The clang-linker-wrapper tool will be responsible to determine which FPGA
tool is being used during the AOT device compilation phase. The use of
-simulation or -hardware as passed in by --fpga-tool-arg signifies
which tool is used.
spir64_x86_64 support¶
Compilation behaviors involving AOT for CPU involve an additional call to
opencl-aot. This call occurs after the post-link step performed by
sycl-post-link and the SPIR-V translation step performed by llvm-spirv.
Additional options passed by the user via the
-Xsycl-target-backend=spir64_x86_64 <opts> command will be processed by a new
option to the wrapper, --cpu-tool-arg=<arg>
Similar to SYCL offloading to Intel GPUs using --offload-arch, SYCL AOT for Intel CPUs
will also leverage the --offload-arch option.
The valid CPU device names accepted for --offload-arch are CPU names from clang -march.
These names are more verbose, and do not overlap with the Intel GPU names.
These user input CPU names are mapped to the corresponding opencl-aot -march option.
The following table shows a mapping of the accepted values for --offload-arch to enable SYCL offloading to Intel CPUs and the corresponding -march value passed to opencl-aot.
Intel CPU device |
|
opencl-aot -march value |
|---|---|---|
Intel(R) Advanced Vector Extensions 512 |
skylake-avx512 |
avx512 |
Intel(R) Advanced Vector Extensions 2 |
core-avx2 |
avx2 |
Intel(R) Advanced Vector Extensions |
corei7-avx |
avx |
Intel(R) Streaming SIMD Extensions 4.2 |
corei7 |
sse4.2 |
Intel(R) microarchitecture code name Westmere |
westmere |
wsm |
Intel(R) microarchitecture code name Sandy Bridge |
sandybridge |
snb |
Intel(R) microarchitecture code name Ivy Bridge |
ivybridge |
ivyb |
Intel(R) microarchitecture code name Broadwell |
broadwell |
bdw |
Intel(R) microarchitecture code name Coffee Lake |
coffeelake |
cfl |
Intel(R) microarchitecture code name Alder Lake |
alderlake |
adl |
Intel(R) microarchitecture code name Skylake (client) |
skylake |
skylake |
Intel(R) microarchitecture code name Skylake (server) |
skx |
skx |
Intel(R) microarchitecture code name Cascade Lake |
cascadelake |
clk |
Intel(R) microarchitecture code name Ice Lake (client) |
icelake-client |
icl |
Intel(R) microarchitecture code name Ice Lake (server) |
icelake-server |
icx |
Intel(R) microarchitecture code name Sapphire Rapids |
sapphirerapids |
spr |
Intel(R) microarchitecture code name Granite Rapids |
graniterapids |
gnr |
Wrapping of device image¶
Once the device binary is pulled out of the fat binary, the binary must be
wrapped and provided the needed entry points to be used during execution. This
is performed during the link phase and controlled by the clang-linker-wrapper.
It is expected that the wrap information that is generated to be wrapped
around the device binary will match current wrapping information that is used
for the existing offload model. The wrapping in the old model is using the
clang-offload-wrapper tool.
Integration of llvm-foreach¶
Use of llvm-foreach is used frequently during the offloading process. The
functionality was introduced due to the toolchain building limitations of the
driver and the need perform multiple compilations during a specific step of
the compilation. For example, the support for device code splitting performed
by the sycl-post-link tool would require consumers of the split code to work
each item individually. The driver toolchain does not know that this split
occurs, or how many splits were performed. llvm-foreach permits this
obfuscation.
With the new model, the compilation sequence is built dynamically within the
clang-linker-wrapper, allowing for the use of llvm-foreach to become
obsolete. The clang-linker-wrapper is then responsible for performing the
needed individual calls work item which can be done using parallel for.
Beyond llvm-foreach and similar job hiding tools¶
Tools like llvm-foreach, file-table-tform, spirv-to-ir-wrapper were all
introduced to provide a way to manipulate behaviors that could only be
determined at runtime of the compiler toolchain. These were needed to work
around the fact that the toolchain commands constructed by the driver is a fixed
state of commands.
Moving the functionality into clang-linker-wrapper presents the opportunity
step away from the static command construction and create the call chain on
the fly based on real time output from corresponding tools being called.
Host Link¶
The final host link is also performed by the linker wrapper. This link is built upon the full link command line as constructed by the compiler driver, including all libraries and the linked/wrapped device binaries to complete the compilation process.
The provided command line for the final host link step contains the full list of libraries and objects to be linked against. The expectation is for this list to be complete. With the old model, the host objects are directly passed to the host link step. The device objects are processed separately. As we are passing the full command line to the link step, the objects provided will need to be full fat objects. This is different from the old model which will require for an additional step before the link to create the full fat object that is properly represented on the host link command line. This additional step is necessary due to the fact that we are creating the fat objects during a separate step as opposed to integrating the offload binaries during the host object generation. See Fat Binary Generation.