fast prefix sum implementation using subgroups ing LS l compute shader is

Use the subgroup feature of Vulkan 1.1 to accelerate the prefix and calculation of ComputeShader, refer to:
Vulkan Subgroup Tutorial - Khronos Blog - The Khronos Group Inc
Single-pass Parallel Prefix Scan with Decoupled Look-back | Research

Related knowledge

Compute model

flowchart TD subgraph Subgroup["Subgroup"] Inv0["invocation 0"] Inv1["invocation 1"] InvDots["..."] Inv31["invocation 31"] end subgraph Workgroup["Workgroup"] SG0["Subgroup 0"] SG1["Subgroup 1"] SGDots["..."] SGM["Subgroup m"] end subgraph Dispatch["Dispatch"] WG0["Workgroup 0"] WG1["Workgroup 1"] WGDots["..."] WGN["Workgroup n"] end %% Set horizontal row WG0 --- WG1 --- WGDots --- WGN SG0 --- SG1 --- SGDots --- SGM Inv0 --- Inv1 --- InvDots --- Inv31

shared memory

Shared variables are shared within a single work group. This article is used to record prefixes and results of multiple subgroups.

subgroup

On GPU, threads are usually executed in groups (usually 32 or 64 threads), and this article usessubgroupInclusiveAddCalculate the prefix sum in a single subgroup, for specific reference/blog/vulkan-subgroup-tutorial

Suppose there are 8 blocks, and their active state is as follows

id : 0  1  2  3  4  5  6  7
val: 0  1  0  1  1  0  0  1 
//subgroupInclusiveAdd
val: 0  1  1  2  3  3  3  4

Process Summary

Goal: Calculate the prefix sum of data with size = n

Split intowork_group_nums = (n + 1023) / 1024The prefix sum of work_groups with local_size = <1024, 1, 1>, a work_group has 1024 invocations, and 1024 invocations are split into 32 sub_group prefix sums (sub_group_size = 32 on NIVDIA)
subgroupInclusiveAdd calculates the sum of prefixes in 32 sub_groups, and the last result of each sub_group (local_id = 31) is storedshared uint sg_offset[32];(Shared variables are shared within the current work_group)
subgroupInclusiveAdd calculates the prefix sum of sg_offset and updates directly into sg_offset, thensg_offset[gl_SubgroupSize - 1]That is the prefix sum of the current work_group, and the result is stored inss_wg_offset_[gl_WorkGroupID.x]
Final pass prefix and sum of ss_wg_offset_ again. Since the unit is no longer invocation in work_group, subgroupInclusiveAdd cannot work, so it manually traverses the accumulated writesatomicExchange(ss_wg_offset_[gl_WorkGroupID.x], final_res);

Implementation details

layout(local_size_x = 1024, local_size_y = 1, local_size_z = 1) in;
 //shared memory temporary storage results across subgroup
 shared uint sg_offset[32];

 //sub_group_id
 uint sg_id = gl_LocalInvocationIndex / gl_SubgroupSize;

 // Is there any active voxel in the previous block
 uint prev_inv_actives = invocationActives(gl_GlobalInvocationID.x - 1) > 0 ? 1 : 0;
 // Prefixes in sub_group and
 uint wg_offset = subgroupInclusiveAdd(prev_inv_actives);
 // sg_offset stores 32 sub_group final prefixes and
 if (gl_SubgroupInvocationID == gl_SubgroupSize-1) {
	 sg_offset[sg_id] = wg_offset;
 }

 barrier();

 if (sg_id == 0) {
	 // Calculate the prefix sum for sg_offset once and update it directly into sg_offset
	 sg_offset[gl_SubgroupInvocationID] =
	 subgroupInclusiveAdd(sg_offset[gl_SubgroupInvocationID]);
	 // The result is stored in ss_wg_offset_, omitting the ecnode process
	 atomicExchange(ss_wg_offset_[gl_WorkGroupID.x], your_value_encode);
 }

 barrier();

 // Simple final pass, omitted

 barrier();

tips: GLSL does not provide atomicRead, can be passedatomicCompSwap(target, 0, 0)accomplish