coder.gpu.kernelfun - Pragma that maps function to GPU kernels - MATLAB
Main Content
Pragma that maps function to GPU kernels
Syntax
Description
coder.gpu.kernelfun() is a global-level pragma that attempts
to map all the computation within the function it
resides in on to the GPU. Loops within this function are parallelized into GPU
kernels only if they pass the parallel-loop analysis check. This analysis tries to
prove that every loop iteration is independent of each other. In addition, the code
generator does not create GPU kernels for simple loops, such as those with few
iterations or that perform minimal computation. Parallelization of such loops can
decrease code performance due to the overhead of
kernel creation and memory transfer between the GPU and the CPU.
The kernelfun pragma does not require
any input parameters. It generates kernels whose
dimensions are computed automatically based on loop parameters.
This function is a code generation function. It has no effect in MATLAB®.
Examples
collapse all
This example shows how to use the
kernelfun pragma in a function and generate CUDA® code.
In one file, write the entry-point function scalars
that accepts two vector inputs x,y of size
1x4096 and one scalar input scale.
The function has two for-loops of different iteration
lengths, one for vector addition and one for finding the cumulative sum.
Place the coder.gpu.kernelfun() pragma within the
scalars function.
function [vout, sout1] = scalars(x,y,scale) coder.gpu.kernelfun; sout1 = 0; vout = coder.nullcopy(zeros(1,1024)); for i=1:1024 vout(i) = x(i) + y(i); end for i=1:4096 sout1 = (x(i)*scale) + sout1; end end
Use the codegen function to generate
CUDA MEX function.
codegen -config coder.gpuConfig('mex')... -args {ones(1,4096,'double'),ones(1,4096,'double'),coder.typeof(0)}... -report scalars
GPU Coder creates kernels for vector addition and cumulative summation.
scalars_kernel1<<<dim3(8U, 1U, 1U), dim3(128U, 1U, 1U)>>>(*gpu_y, *gpu_x,
*gpu_vout);
scalars_kernel2<<<dim3(4U, 1U, 1U), dim3(1024U, 1U, 1U)>>>(scale, *gpu_x,
gpu_sout1);
The kernel for vector addition has a total of 1024 threads, one for adding each element. Similarly, the kernel for cumulative summation has a total of 4096 threads.
Version History
Introduced in R2017b
See Also
Apps
Functions
codegen|coder.gpu.kernel|stencilfun|coder.gpu.constantMemory|gpucoder.reduce|gpucoder.sort|coder.gpu.nokernel