Nits#44
Open
b0nes164 wants to merge 3 commits into
Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Just some general nits from going through the documentation.
"they require one hardware instruction to do what would take many hardware instructions in emulation"
I don't think any GPU has a dedicated instruction for scan/reduce; instead they are composed by shuffling. The
subgroupScanexposed by the shading languages wrap the shuffling for you. (And maybe provide some non-divergence guarantees? That's purely speculation).I think it is generally true that the shfl/intrinsic scans are more instruction efficient, but I would qualify it. To me I think it's very easy to interpret that as "intrinsics are more depth efficient," which isn't true. (e.g. Regardless of whether you use a pure shared memory, pure shfl/intrinsic, or hybridized scan, the depth will always be
O(logn), assuming the component scans are also minimal depth.) Instead the savings in instructions come from avoiding the setup required for shared memory.I double checked these for scan/reduce instructions, and didn't see anything:
https://docs.nvidia.com/cuda/parallel-thread-execution/index.html
https://dougallj.github.io/applegpu/docs.html
"On Hardware: Native hardware subgroups use execution masks to dynamically disable inactive lanes. Threads that have already completed lookback (lookbackComplete == true) simply bypass the branch, and the hardware evaluates subgroupAny correctly using only the active participating lanes."
It is true that the hardware will mask away inactive lanes and the fix that was made to the onesweep kernel was correct, but the reasoning here is wrong.
Even though the hardware is able to mask away lanes (which prevents the deadlocking observed in swiftshader), the result would not necessarily be correct, because the lowest rank lane is still responsible for broadcast an "incomplete." But the code was passing validation on the devices you ran it on. Instead, what mast have been happening is that on hardware, all the writes to spine must have been written in subgroup lockstep, so there was never a partial subgroup result, so the
!lookbackCompletebranch was never diverged.As I mentioned in the email, I'm inclined to just remove this block?