Compute Shader :: Vulkan Documentation Project

Dispatch

Now it’s time to actually tell the GPU to do some compute. This is done by calling computeCommandBuffers[frameIndex]→dispatch inside a command buffer. While not perfectly true, a dispatch is for compute as a draw call like commandBuffers[frameIndex]→draw is for graphics. This dispatches a given number of compute work items in at max. three dimensions.

computeCommandBuffers[frameIndex]->begin({});
...

computeCommandBuffers[frameIndex]->bindPipeline(vk::PipelineBindPoint::eCompute, *computePipeline);
computeCommandBuffers[frameIndex]->bindDescriptorSets(vk::PipelineBindPoint::eCompute, *computePipelineLayout, 0, {computeDescriptorSets[frameIndex]}, {});

computeCommandBuffers[frameIndex]->dispatch( PARTICLE_COUNT / 256, 1, 1 );

...

computeCommandBuffers[frameIndex]->end();

The computeCommandBuffers[frameIndex]→dispatch will dispatch PARTICLE_COUNT / 256 local work groups in the x dimension. As our particle array is linear, we leave the other two dimensions at one, resulting in a one-dimensional dispatch. But why do we divide the number of particles (in our array) by 256? That’s because in the previous paragraph, we defined that every compute shader in a work group will do 256 invocations. So if we were to have 4096 particles, we would dispatch 16 work groups, with each work group running 256 compute shader invocations. Getting the two numbers right usually takes some tinkering and profiling, depending on your workload and the hardware you’re running on. If your particle size is dynamic and can’t always be divided by e.g., 256, you can always use gl_GlobalInvocationID at the start of your compute shader and return from it if the global invocation index is greater than the number of your particles.

And just as was the case for the compute pipeline, a compute command buffer has a lot less state than a graphics command buffer. There’s no need to start a render pass or set a viewport.

Submitting work

As our sample does both compute and graphics operations, we’ll be doing two submits to both the graphics and compute queue per frame (see the drawFrame function):

...
computeQueue->submit(submitInfo, **computeInFlightFences[frameIndex]);
...
graphicsQueue->submit(submitInfo, **inFlightFences[frameIndex]);

The first submit to the compute queue updates the particle positions using the compute shader, and the second submit will then use that updated data to draw the particle system.

Synchronizing graphics and compute

Synchronization is an important part of Vulkan, even more so when doing compute in conjunction with graphics. Wrong or lacking synchronization may result in the vertex stage starting to draw (=read) particles while the compute shader hasn’t finished updating (=write) them (read-after-write hazard), or the compute shader could start updating particles that are still in use by the vertex part of the pipeline (write-after-read hazard).

So we must make sure that those cases don’t happen by properly synchronizing the graphics and the compute load. There are different ways of doing so, depending on how you submit your compute workload, but in our case with two separate submits, we’ll be using semaphores and fences to ensure that the vertex shader won’t start fetching vertices until the compute shader has finished updating them.

This is necessary as even though the two submits are ordered one-after-another, there is no guarantee that they execute on the GPU in this order. Adding in wait and signal semaphores ensures this execution order.

So we first add a new set of synchronization primitives for the compute work in createSyncObjects. The compute fences, just like the graphics fences, are created in the signaled state because otherwise, the first draw would time out while waiting for the fences to be signaled as detailed here:

std::vector<std::unique_ptr<vk::raii::Fence>> computeInFlightFences;
std::vector<std::unique_ptr<vk::raii::Semaphore>> computeFinishedSemaphores;
...
computeInFlightFences.resize(MAX_FRAMES_IN_FLIGHT);
computeFinishedSemaphores.resize(MAX_FRAMES_IN_FLIGHT);

for (size_t i = 0; i < MAX_FRAMES_IN_FLIGHT; i++) {
    ...
    computeFinishedSemaphores[i] = std::make_unique<vk::raii::Semaphore>(*device, vk::SemaphoreCreateInfo());
    computeInFlightFences[i] = std::make_unique<vk::raii::Fence>(*device, vk::FenceCreateInfo(vk::FenceCreateFlagBits::eSignaled));
}

We then use these to synchronize the compute buffer submission with the graphics submission:

{
    // Compute submission
    while ( vk::Result::eTimeout == device->waitForFences(**computeInFlightFences[frameIndex], vk::True, UINT64_MAX) )
        ;

    updateUniformBuffer(frameIndex);
    device->resetFences( **computeInFlightFences[frameIndex] );
    computeCommandBuffers[frameIndex]->reset();
    recordComputeCommandBuffer();

    const vk::SubmitInfo submitInfo({}, {}, {**computeCommandBuffers[frameIndex]}, { **computeFinishedSemaphores[frameIndex]});
    computeQueue->submit(submitInfo, **computeInFlightFences[frameIndex]);
}
{
    // Graphics submission
    while ( vk::Result::eTimeout == device->waitForFences(**inFlightFences[frameIndex], vk::True, UINT64_MAX))
...

    device->resetFences(  **inFlightFences[frameIndex] );
    commandBuffers[frameIndex]->reset();
    recordCommandBuffer(imageIndex);

    vk::Semaphore waitSemaphores[] = {**presentCompleteSemaphore[frameIndex], **computeFinishedSemaphores[frameIndex]};
    vk::PipelineStageFlags waitDestinationStageMask[] = { vk::PipelineStageFlagBits::eVertexInput, vk::PipelineStageFlagBits::eColorAttachmentOutput };
    const vk::SubmitInfo submitInfo( waitSemaphores, waitDestinationStageMask, {**commandBuffers[frameIndex]}, {**renderFinishedSemaphore[frameIndex]} );
    graphicsQueue->submit(submitInfo, **inFlightFences[frameIndex]);

Similar to the sample in the semaphore chapter, this setup will immediately run the compute shader as we haven’t specified any wait semaphores. Note that we’re using scoping braces above to ensure that the RAII temporary variables we use get a chance to clean themselves up between the compute and the graphics stage. This is fine, as we are waiting for the compute command buffer of the current frame to finish execution before the compute submission with the device→waitForFences command.

The graphics submission, on the other hand, needs to wait for the compute work to finish so it doesn’t start fetching vertices while the compute buffer is still updating them. So we wait on the computeFinishedSemaphores for the current frame and have the graphics submission wait on the vk::PipelineStageFlagBits::eVertexInput stage, where vertices are consumed.

But it also needs to wait for presentation, so the fragment shader won’t output to the color attachments until the image has been presented. So we also wait on the imageAvailableSemaphores on the current frame at the vk::PipelineStageFlagBits::eColorAttachmentOutput stage.

Timeline semaphores: An improved synchronization mechanism

The synchronization approach described above uses binary semaphores, which have a simple signaled/unsignaled state. While this works well for many scenarios, Vulkan also offers a more powerful synchronization primitive: timeline semaphores.

Timeline semaphores were introduced as an extension and later promoted to core in Vulkan 1.2. Unlike binary semaphores, timeline semaphores have a 64-bit unsigned integer counter value that can be waited on and signaled to specific values. This provides several advantages over binary semaphores:

Reusability: A single timeline semaphore can be used for multiple synchronization points, reducing the number of semaphores needed.
Host synchronization: Timeline semaphores can be signaled and waited on from the host (CPU) without submitting commands to a queue.
Out-of-order signaling: You can signal a timeline semaphore to a value higher than what’s currently being waited on, allowing for more flexible synchronization patterns.
Multiple pending signals: Unlike binary semaphores, which can only be pending-signaled once, timeline semaphores can have multiple pending signals.

Let’s see how we can modify our particle system example to use timeline semaphores instead of binary semaphores:

First, we need to enable the timeline semaphore feature when creating the logical device:

vk::PhysicalDeviceTimelineSemaphoreFeaturesKHR timelineSemaphoreFeatures;
timelineSemaphoreFeatures.timelineSemaphore = vk::True;
// Chain this to your device creation info

Instead of creating multiple binary semaphores, we create a single timeline semaphore:

vk::SemaphoreTypeCreateInfo semaphoreType{ .semaphoreType = vk::SemaphoreType::eTimeline, .initialValue = 0 };
semaphore = vk::raii::Semaphore(device, {.pNext = &semaphoreType});
timelineValue = 0;

In our draw frame function, we use incrementing timeline values to coordinate work between compute and graphics:

// Update timeline value for this frame
uint64_t computeWaitValue = timelineValue;
uint64_t computeSignalValue = ++timelineValue;
uint64_t graphicsWaitValue = computeSignalValue;
uint64_t graphicsSignalValue = ++timelineValue;

For the compute submission, we use a timeline semaphore submit info structure:

vk::TimelineSemaphoreSubmitInfo computeTimelineInfo{
    .waitSemaphoreValueCount = 1,
    .pWaitSemaphoreValues = &computeWaitValue,
    .signalSemaphoreValueCount = 1,
    .pSignalSemaphoreValues = &computeSignalValue
};

vk::PipelineStageFlags waitStages[] = {vk::PipelineStageFlagBits::eComputeShader};

vk::SubmitInfo computeSubmitInfo{
    .pNext = &computeTimelineInfo,
    .waitSemaphoreCount = 1,
    .pWaitSemaphores = &*semaphore,
    .pWaitDstStageMask = waitStages,
    .commandBufferCount = 1,
    .pCommandBuffers = &*computeCommandBuffers[frameIndex],
    .signalSemaphoreCount = 1,
    .pSignalSemaphores = &*semaphore
};

computeQueue.submit(computeSubmitInfo, nullptr);

Similarly, for the graphics submission:

vk::PipelineStageFlags waitStage = vk::PipelineStageFlagBits::eVertexInput;
vk::TimelineSemaphoreSubmitInfo graphicsTimelineInfo{
    .waitSemaphoreValueCount = 1,
    .pWaitSemaphoreValues = &graphicsWaitValue,
    .signalSemaphoreValueCount = 1,
    .pSignalSemaphoreValues = &graphicsSignalValue
};

vk::SubmitInfo graphicsSubmitInfo{
    .pNext = &graphicsTimelineInfo,
    .waitSemaphoreCount = 1,
    .pWaitSemaphores = &*semaphore,
    .pWaitDstStageMask = &waitStage,
    .commandBufferCount = 1,
    .pCommandBuffers = &*commandBuffers[frameIndex],
    .signalSemaphoreCount = 1,
    .pSignalSemaphores = &*semaphore
};

graphicsQueue.submit(graphicsSubmitInfo, nullptr);

Before presenting, we wait for the graphics work to complete:

vk::SemaphoreWaitInfo waitInfo{
    .semaphoreCount = 1,
    .pSemaphores = &*semaphore,
    .pValues = &graphicsSignalValue
};

// Wait for graphics to complete before presenting
auto result = device.waitSemaphores(waitInfo, UINT64_MAX);
if (result != vk::Result::eSuccess)
{
    throw std::runtime_error("failed to wait for semaphore!");
}

vk::PresentInfoKHR presentInfo{
    .waitSemaphoreCount = 0, // No binary semaphores needed
    .pWaitSemaphores = nullptr,
    .swapchainCount = 1,
    .pSwapchains = &*swapChain,
    .pImageIndices = &imageIndex
};

This timeline semaphore approach offers several benefits over the binary semaphore implementation:

Simplified resource management: We only need a single semaphore instead of multiple semaphores per frame in flight.
More explicit synchronization: The timeline values make it clear which operations depend on each other.
Reduced overhead: With fewer synchronization objects, there’s less overhead in managing them.
More flexible synchronization patterns: Timeline semaphores enable more complex synchronization scenarios that would be difficult with binary semaphores.

Timeline semaphores are particularly useful in scenarios with multiple dependent operations, like our compute-then-graphics workflow, or when you need to synchronize between the host and device. They provide a more powerful and flexible synchronization mechanism that can simplify your code while enabling more complex synchronization patterns.