Skip to content

Nits#44

Open
b0nes164 wants to merge 3 commits into
gridwise-webgpu:mainfrom
b0nes164:main
Open

Nits#44
b0nes164 wants to merge 3 commits into
gridwise-webgpu:mainfrom
b0nes164:main

Conversation

@b0nes164

Copy link
Copy Markdown
Member

Just some general nits from going through the documentation.

"they require one hardware instruction to do what would take many hardware instructions in emulation"
I don't think any GPU has a dedicated instruction for scan/reduce; instead they are composed by shuffling. The subgroupScan exposed by the shading languages wrap the shuffling for you. (And maybe provide some non-divergence guarantees? That's purely speculation).

I think it is generally true that the shfl/intrinsic scans are more instruction efficient, but I would qualify it. To me I think it's very easy to interpret that as "intrinsics are more depth efficient," which isn't true. (e.g. Regardless of whether you use a pure shared memory, pure shfl/intrinsic, or hybridized scan, the depth will always be O(logn), assuming the component scans are also minimal depth.) Instead the savings in instructions come from avoiding the setup required for shared memory.

I double checked these for scan/reduce instructions, and didn't see anything:
https://docs.nvidia.com/cuda/parallel-thread-execution/index.html
https://dougallj.github.io/applegpu/docs.html


"On Hardware: Native hardware subgroups use execution masks to dynamically disable inactive lanes. Threads that have already completed lookback (lookbackComplete == true) simply bypass the branch, and the hardware evaluates subgroupAny correctly using only the active participating lanes."

It is true that the hardware will mask away inactive lanes and the fix that was made to the onesweep kernel was correct, but the reasoning here is wrong.

Even though the hardware is able to mask away lanes (which prevents the deadlocking observed in swiftshader), the result would not necessarily be correct, because the lowest rank lane is still responsible for broadcast an "incomplete." But the code was passing validation on the devices you ran it on. Instead, what mast have been happening is that on hardware, all the writes to spine must have been written in subgroup lockstep, so there was never a partial subgroup result, so the !lookbackComplete branch was never diverged.

As I mentioned in the email, I'm inclined to just remove this block?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant