Nits by b0nes164 · Pull Request #44 · gridwise-webgpu/gridwise

b0nes164 · 2026-06-21T17:26:20Z

Just some general nits from going through the documentation.

"they require one hardware instruction to do what would take many hardware instructions in emulation"
I don't think any GPU has a dedicated instruction for scan/reduce; instead they are composed by shuffling. The subgroupScan exposed by the shading languages wrap the shuffling for you. (And maybe provide some non-divergence guarantees? That's purely speculation).

I think it is generally true that the shfl/intrinsic scans are more instruction efficient, but I would qualify it. To me I think it's very easy to interpret that as "intrinsics are more depth efficient," which isn't true. (e.g. Regardless of whether you use a pure shared memory, pure shfl/intrinsic, or hybridized scan, the depth will always be O(logn), assuming the component scans are also minimal depth.) Instead the savings in instructions come from avoiding the setup required for shared memory.

I double checked these for scan/reduce instructions, and didn't see anything:
https://docs.nvidia.com/cuda/parallel-thread-execution/index.html
https://dougallj.github.io/applegpu/docs.html

"On Hardware: Native hardware subgroups use execution masks to dynamically disable inactive lanes. Threads that have already completed lookback (lookbackComplete == true) simply bypass the branch, and the hardware evaluates subgroupAny correctly using only the active participating lanes."

It is true that the hardware will mask away inactive lanes and the fix that was made to the onesweep kernel was correct, but the reasoning here is wrong.

Even though the hardware is able to mask away lanes (which prevents the deadlocking observed in swiftshader), the result would not necessarily be correct, because the lowest rank lane is still responsible for broadcast an "incomplete." But the code was passing validation on the devices you ran it on. Instead, what mast have been happening is that on hardware, all the writes to spine must have been written in subgroup lockstep, so there was never a partial subgroup result, so the !lookbackComplete branch was never diverged.

As I mentioned in the email, I'm inclined to just remove this block?

Thomas Smith added 3 commits June 20, 2026 21:27

nits

c5a95e1

more nits

20b56bc

polish

f290cb8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Nits#44

Nits#44
b0nes164 wants to merge 3 commits into
gridwise-webgpu:mainfrom
b0nes164:main

b0nes164 commented Jun 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

b0nes164 commented Jun 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant