coder.gpu.kernel - Pragma that maps for-loops to GPU kernels - MATLAB
Pragma that maps for-loops to GPU kernels
Syntax
Description
coder.gpu.kernel() is a loop-level pragma that you must place
immediately before a for-loop. This pragma generates a kernel and
computes the launch parameters from the loop parameters.
The coder.gpu.kernel pragma overrides
all parallel loop analysis checks. This override
allows GPU Coder™ to parallelize loops in situations where parallel loop analysis cannot
prove that all iterations are independent. Consider
using coder.gpu.kernelfun to parallelize loops in functions that pass the
parallel loop analysis check.
Note
Using the coder.gpu.kernel pragma before a
for-loop that contains reductions is not
recommended.
coder.gpu.kernel(B,T) generates a kernel with the dimensions
specified by B and T.
B[Bx,By,Bz] is an array that defines the number of blocks in
the grid along dimensions x and y
(z not used). T[Tx,Ty,Tz] is an array that
defines the number of threads in the block along dimensions x,
y, and z.
A value of -1 for B and T indicates that
GPU Coder must infer the grid and block dimensions automatically. The
coder.gpu.kernel pragma generates errors for invalid grid
and block dimensions.
coder.gpu.kernel(B,T,M,name) specifies optional arguments
M and name. M is a
positive integer that specifies the minimum number of blocks per streaming
multiprocessor. Increasing M can reduce the register usage within
a kernel and improve kernel occupancy. A value of -1 for M
indicates that GPU Coder must use the default value of 1. name is
a character array that allows you to customize the name of the generated
kernel.
This function is a code generation function. It has no effect in MATLAB®.
Examples
collapse all
This example shows how to use the
coder.gpu.kernel pragma to generate a CUDA® kernel.
Create a function named multiplyVectors that performs
element-wise multiplication on two 1-by-1024 input vectors, a
and b. The function contains a for-loop
that multiplies the elements of the
vectors.
function out = multiplyVectors(a,b) %#codegen out = zeros(size(a)); for i=1:size(a,2) out(i) = a(i)*b(i); end end
To generate a kernel from the for-loop, add the
coder.gpu.kernel pragma before the
for-loop. To compute the kernel launch parameters
from the loop parameters, specify the coder.gpu.kernel
pragma without input arguments.
function out = multiplyVectors(a,b) %#codegen out = zeros(size(a)); coder.gpu.kernel(); for i=1:size(a,2) out(i) = a(i)*b(i); end end
Use the codegen command to generate
code from multiplyVectors. The generated code contains a
kernel named
multiplyVectors_kernel1.
cfg = coder.gpuConfig("mex"); a = ones([1 1024]); b = ones([1 1024]); codegen -config cfg -args {a,b} -report multiplyVectors
This example shows how to use the
coder.gpu.kernel pragma to generate a CUDA kernel and specify the launch parameters.
Create a function named addVectors that accepts two
1-by-4096 inputs, x and y. The
function has one for-loop that adds x
and y.
function out = addVectors(x,y) %#codegen out = zeros(size(x)); for i=1:size(x,2) out(i) = x(i)+y(i); end end
To create a kernel, place the coder.gpu.kernel pragma
immediately before the vector addition loop. To automatically determine the
number of blocks, specify the number of blocks as -1, and
specify 128 threads per block.
function out = addVectors(x,y) %#codegen out = zeros(size(x)); coder.gpu.kernel(-1,128); for i=1:size(x,2) out(i) = x(i)+y(i); end end
Use the codegen command to generate
CUDA code.
cfg = coder.gpuConfig("mex"); x = ones([1 4096]); y = ones([1 4096]); codegen -config cfg -args {x,y} -report addVectors
The generated code contains a kernel named
addVectors_kernel1. The kernel launches with 32
blocks and 128 threads per block.
addVectors_kernel1<<<dim3(32U, 1U, 1U), dim3(128U, 1U, 1U)>>>(*gpu_y, *gpu_x,
*gpu_out);You can use variables or expressions when specifying the kernel launch
parameters. For example, you can add an input argument named
T to the addVectors function and
specify T as the number of threads by using
coder.gpu.kernel.
function out = addVectors(x,y,T) %#codegen out = zeros(size(x)); coder.gpu.kernel(1,T); for i=1:size(x,2) out(i) = x(i)+y(i); end end
Use the codegen function to generate
CUDA code. The generated code uses the input variable
T to determine the number of threads for each
block.
cfg = coder.gpuConfig("dll"); x = ones([1 4096]); y = ones([1 4096]); T = 512; codegen -config cfg -args {x,y,T} -report addVectors
Version History
Introduced in R2017b
See Also
Apps
Functions
codegen|coder.gpu.kernelfun|stencilfun|coder.gpu.constantMemory|gpucoder.reduce|gpucoder.sort|coder.gpu.nokernel