Skip to content

Spans V2#94

Merged
tdoublep merged 11 commits intomainfrom
repos-kernel
Nov 20, 2025
Merged

Spans V2#94
tdoublep merged 11 commits intomainfrom
repos-kernel

Conversation

@nataxcan
Copy link
Member

@nataxcan nataxcan commented Nov 20, 2025

Implements a simpler and often faster alternative implementation strategy for enabling spans in vLLM.

Here's the explanation:

Spans in vLLM are pre-computed KV vectors that have position-independence, so they could appear at any position in a future request's prompt.

But K vectors are stored with positional encodings applied to them, so if a vector k is stored with positional encoding n and a new request loads k at position m where n != m, the positional encodings have to be re-adjusted.

Currently on main, custom code keeps track of what positional encodings are applied to which k vectors, and manages their re-adjustments when needed. This creates overhead, as it launches kernels on the GPU before scheduled compute can be executed, and potentially is not able to perform all positional encoding adjustments without requiring batching, which further slows vLLM down. It also causes a bug when two requests use the same span at two different positions in their prompts, because the KV cache can only hold one positional encoding at a time without duplicating blocks. After exploring an approach where blocks are duplicated in this manner, we found that it would require big changes to vLLM and cause bugs that require many more changes to vLLM.

In this PR we remove the need for positional encoding adjustment and lower the number of changes to vLLM (compared to what's currently on main). In this approach, K vectors stored in the KV Cache do not contain any positional encodings. Instead, they are simply applied as part of the Triton attention kernel. This increases the amount of compute for attention at the O(n^2) scale, but we find that it's still faster than the overhead caused by the previous approach.

Experimental results:

When pre-loading increasing numbers of spans of fixed token length (1024 tokens) into the kv cache, and when loading those spans into a single prompt, we find our approach:

  1. does not slow down document prefilling
  2. loads them strictly faster than previous approaches
    For the given range of prompt sizes.
fusedrope_ttft As a percentage, we find up to 50% reductions in TTFT: fusedrope_ttft_relative

Besides that, we have benchmarks comparing batched prefill speed and batched overall request throughput for samples taken from 2WikiMultihopQA.

end-to-end serving speed of 1024
samples from 2wiki:
- main branch: 22.72 samples/s
- fused rope : 24.79 samples/s

TTFT of 16 2wiki samples given to
vLLM in a single batch, when their
documents are already preloaded:
- main branch: 0.0817 seconds
- fused rope : 0.0427 seconds

tdoublep and others added 10 commits November 10, 2025 08:18
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Signed-off-by: Nathan Ordonez <nathanaxcan@gmail.com>
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
@nataxcan nataxcan requested a review from tdoublep as a code owner November 20, 2025 17:08
@tdoublep
Copy link
Member

I would suggest with force push main to sync with upstream before merging this PR (I will do it later) so that we can actually see the changes clearly.

One suggestion though: should we enable TritonAttention by default for this fork? If one doesn't use it, the results will be incorrect.

@tdoublep tdoublep force-pushed the main branch 2 times, most recently from 6f7de33 to 5a9e573 Compare November 20, 2025 17:55
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Copy link
Member

@tdoublep tdoublep left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tdoublep tdoublep merged commit 433f014 into main Nov 20, 2025
1 check passed
@tdoublep tdoublep deleted the repos-kernel branch November 20, 2025 19:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants