-
Notifications
You must be signed in to change notification settings - Fork 49
Move the CUDA sync caused by the max operation to a higher level #1532
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
|
What is the performance impact of the synchronization, i.e. how much faster is the code with the max operation moved up? |
The problem isn’t an explicit cuda synchronization. I’m trying to compile the model incrementally, starting from the lowest level (the attention module). At this level, to compile the class that uses Using without Therefore, we should avoid computing |
|
I can see how int max_seqlen_q,
const int max_seqlen_k,
|
But flash_attn is not comilable because of the C++ code dependence (there's in principle a way to wrap flash_attn so that it is compilable, but unconvinced that it will work). The issue won't be solved by moving the max computation. |
Description
The max operation needed for
flash_attn_varlen_funcintroduces an unnecessary CUDA synchronization. This sync can also prevent the module from being compiled. This PR moves the max operation as far up the call stack as possible to avoid the sync at the attention-module level.Issue Number
Closes #1531
Is this PR a draft? Mark it as draft.
Checklist before asking for review
./scripts/actions.sh lint./scripts/actions.sh unit-test./scripts/actions.sh integration-testlaunch-slurm.py --time 60