Merged
Conversation
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Signed-off-by: Nathan Ordonez <nathanaxcan@gmail.com>
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Member
|
I would suggest with force push main to sync with upstream before merging this PR (I will do it later) so that we can actually see the changes clearly. One suggestion though: should we enable TritonAttention by default for this fork? If one doesn't use it, the results will be incorrect. |
6f7de33 to
5a9e573
Compare
tdoublep
reviewed
Nov 20, 2025
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Implements a simpler and often faster alternative implementation strategy for enabling spans in vLLM.
Here's the explanation:
Spans in vLLM are pre-computed KV vectors that have position-independence, so they could appear at any position in a future request's prompt.
But K vectors are stored with positional encodings applied to them, so if a vector
kis stored with positional encodingnand a new request loadskat positionmwheren != m, the positional encodings have to be re-adjusted.Currently on main, custom code keeps track of what positional encodings are applied to which k vectors, and manages their re-adjustments when needed. This creates overhead, as it launches kernels on the GPU before scheduled compute can be executed, and potentially is not able to perform all positional encoding adjustments without requiring batching, which further slows vLLM down. It also causes a bug when two requests use the same span at two different positions in their prompts, because the KV cache can only hold one positional encoding at a time without duplicating blocks. After exploring an approach where blocks are duplicated in this manner, we found that it would require big changes to vLLM and cause bugs that require many more changes to vLLM.
In this PR we remove the need for positional encoding adjustment and lower the number of changes to vLLM (compared to what's currently on main). In this approach, K vectors stored in the KV Cache do not contain any positional encodings. Instead, they are simply applied as part of the Triton attention kernel. This increases the amount of compute for attention at the
O(n^2)scale, but we find that it's still faster than the overhead caused by the previous approach.Experimental results:
When pre-loading increasing numbers of spans of fixed token length (1024 tokens) into the kv cache, and when loading those spans into a single prompt, we find our approach:
For the given range of prompt sizes.
Besides that, we have benchmarks comparing batched prefill speed and batched overall request throughput for samples taken from 2WikiMultihopQA.