Use the subgroup feature of Vulkan 1.1 to accelerate the prefix and calculation of ComputeShader, refer to:
Vulkan Subgroup Tutorial - Khronos Blog - The Khronos Group Inc
Single-pass Parallel Prefix Scan with Decoupled Look-back | Research
Related knowledge
Compute model
shared memory
Shared variables are shared within a single work group. This article is used to record prefixes and results of multiple subgroups.
subgroup
On GPU, threads are usually executed in groups (usually 32 or 64 threads), and this article usessubgroupInclusiveAddCalculate the prefix sum in a single subgroup, for specific reference/blog/vulkan-subgroup-tutorial
Suppose there are 8 blocks, and their active state is as follows
id : 0 1 2 3 4 5 6 7
val: 0 1 0 1 1 0 0 1
//subgroupInclusiveAdd
val: 0 1 1 2 3 3 3 4
Process Summary
Goal: Calculate the prefix sum of data with size = n
- Split into
work_group_nums = (n + 1023) / 1024
The prefix sum of work_groups with local_size = <1024, 1, 1>, a work_group has 1024 invocations, and 1024 invocations are split into 32 sub_group prefix sums (sub_group_size = 32 on NIVDIA) - subgroupInclusiveAdd calculates the sum of prefixes in 32 sub_groups, and the last result of each sub_group (local_id = 31) is stored
shared uint sg_offset[32];
(Shared variables are shared within the current work_group) - subgroupInclusiveAdd calculates the prefix sum of sg_offset and updates directly into sg_offset, then
sg_offset[gl_SubgroupSize - 1]
That is the prefix sum of the current work_group, and the result is stored inss_wg_offset_[gl_WorkGroupID.x]
- Final pass prefix and sum of ss_wg_offset_ again. Since the unit is no longer invocation in work_group, subgroupInclusiveAdd cannot work, so it manually traverses the accumulated writes
atomicExchange(ss_wg_offset_[gl_WorkGroupID.x], final_res);
Implementation details
layout(local_size_x = 1024, local_size_y = 1, local_size_z = 1) in;
//shared memory temporary storage results across subgroup
shared uint sg_offset[32];
//sub_group_id
uint sg_id = gl_LocalInvocationIndex / gl_SubgroupSize;
// Is there any active voxel in the previous block
uint prev_inv_actives = invocationActives(gl_GlobalInvocationID.x - 1) > 0 ? 1 : 0;
// Prefixes in sub_group and
uint wg_offset = subgroupInclusiveAdd(prev_inv_actives);
// sg_offset stores 32 sub_group final prefixes and
if (gl_SubgroupInvocationID == gl_SubgroupSize-1) {
sg_offset[sg_id] = wg_offset;
}
barrier();
if (sg_id == 0) {
// Calculate the prefix sum for sg_offset once and update it directly into sg_offset
sg_offset[gl_SubgroupInvocationID] =
subgroupInclusiveAdd(sg_offset[gl_SubgroupInvocationID]);
// The result is stored in ss_wg_offset_, omitting the ecnode process
atomicExchange(ss_wg_offset_[gl_WorkGroupID.x], your_value_encode);
}
barrier();
// Simple final pass, omitted
barrier();
tips: GLSL does not provide atomicRead, can be passedatomicCompSwap(target, 0, 0)
accomplish