Spans V2 by nataxcan · Pull Request #94 · IBM/vllm

nataxcan · 2025-11-20T17:08:13Z

Implements a simpler and often faster alternative implementation strategy for enabling spans in vLLM.

Here's the explanation:

Spans in vLLM are pre-computed KV vectors that have position-independence, so they could appear at any position in a future request's prompt.

But K vectors are stored with positional encodings applied to them, so if a vector k is stored with positional encoding n and a new request loads k at position m where n != m, the positional encodings have to be re-adjusted.

Currently on main, custom code keeps track of what positional encodings are applied to which k vectors, and manages their re-adjustments when needed. This creates overhead, as it launches kernels on the GPU before scheduled compute can be executed, and potentially is not able to perform all positional encoding adjustments without requiring batching, which further slows vLLM down. It also causes a bug when two requests use the same span at two different positions in their prompts, because the KV cache can only hold one positional encoding at a time without duplicating blocks. After exploring an approach where blocks are duplicated in this manner, we found that it would require big changes to vLLM and cause bugs that require many more changes to vLLM.

In this PR we remove the need for positional encoding adjustment and lower the number of changes to vLLM (compared to what's currently on main). In this approach, K vectors stored in the KV Cache do not contain any positional encodings. Instead, they are simply applied as part of the Triton attention kernel. This increases the amount of compute for attention at the O(n^2) scale, but we find that it's still faster than the overhead caused by the previous approach.

Experimental results:

When pre-loading increasing numbers of spans of fixed token length (1024 tokens) into the kv cache, and when loading those spans into a single prompt, we find our approach:

does not slow down document prefilling
loads them strictly faster than previous approaches
For the given range of prompt sizes.

As a percentage, we find up to 50% reductions in TTFT:

Besides that, we have benchmarks comparing batched prefill speed and batched overall request throughput for samples taken from 2WikiMultihopQA.

end-to-end serving speed of 1024
samples from 2wiki:
- main branch: 22.72 samples/s
- fused rope : 24.79 samples/s

TTFT of 16 2wiki samples given to
vLLM in a single batch, when their
documents are already preloaded:
- main branch: 0.0817 seconds
- fused rope : 0.0427 seconds

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

Signed-off-by: Nathan Ordonez <nathanaxcan@gmail.com>

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

tdoublep · 2025-11-20T17:46:59Z

I would suggest with force push main to sync with upstream before merging this PR (I will do it later) so that we can actually see the changes clearly.

One suggestion though: should we enable TritonAttention by default for this fork? If one doesn't use it, the results will be incorrect.

examples/offline_inference/spans/spans.py

vllm/model_executor/layers/rotary_embedding/base.py

vllm/v1/attention/backends/triton_attn.py

vllm/v1/worker/gpu_model_runner.py

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

tdoublep

LGTM

tdoublep and others added 10 commits November 10, 2025 08:18

working changes

0334383

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

remove some prints

d0d952f

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

Use neox style in kernel

72261c2

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

remove file

3c34a5b

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

remove file

99fa372

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

adapt forward_cuda

23d30cc

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

Fuse rope into 3D kernel

28f6311

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

Remove block repos

f71ef8b

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

bugfix: free block queue was being corrupted

2f6a7b2

Signed-off-by: Nathan Ordonez <nathanaxcan@gmail.com>

Fused RoPE only when spans enabled

aa1416c

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

nataxcan requested a review from tdoublep as a code owner November 20, 2025 17:08

tdoublep force-pushed the main branch 2 times, most recently from 6f7de33 to 5a9e573 Compare November 20, 2025 17:55

tdoublep reviewed Nov 20, 2025

View reviewed changes

Minor things

e9d302c

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

tdoublep approved these changes Nov 20, 2025

View reviewed changes

tdoublep merged commit 433f014 into main Nov 20, 2025
1 check passed

tdoublep deleted the repos-kernel branch November 20, 2025 19:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Spans V2#94

Spans V2#94
tdoublep merged 11 commits intomainfrom
repos-kernel

nataxcan commented Nov 20, 2025 •

edited by github-actions bot

Loading

Uh oh!

tdoublep commented Nov 20, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tdoublep left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

nataxcan commented Nov 20, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Here's the explanation:

Experimental results:

Uh oh!

tdoublep commented Nov 20, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tdoublep left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

nataxcan commented Nov 20, 2025 •

edited by github-actions bot

Loading