Spans block duplication by nataxcan · Pull Request #92 · IBM/vllm

nataxcan · 2025-09-23T08:54:37Z

This PR implements block duplication.
Previously, span cache-hits were treated the same way prefix caching does it: same memory reference, different request being served.
But while the same memory reference contains the same semantic content (it refers to KV vectors of the same input tokens), it cannot contain two different positional encodings at once.
So, instead of using the same memory reference, we let vLLM allocate new blocks, but instead of prefilling those blocks we copy the KV vectors with adjusted (if needed) positional encodings.

Co-authored-by: Nathan Ordonez <Nathan.Ordonez@ibm.com> Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com> Signed-off-by: Nathan Ordonez <nathanaxcan@gmail.com>

Signed-off-by: Nathan Ordonez <nathanaxcan@gmail.com>

An initial implementation of span semantics in vLLM. Please note that this has a known bug dealing with concurrent sequences that re-use the same span in different locations. We are working on a solution for this, but in the meantime accuracy may be negatively affected. n/a n/a --- <details> <summary> Essential Elements of an Effective PR Description Checklist </summary> - [x] The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)". - [ ] The test plan, such as providing test command. - [ ] The test results, such as pasting the results comparison before and after, or e2e results - [ ] (Optional) The necessary documentation update, such as updating `supported_models.md` and `examples` for a new model. - [ ] (Optional) Release notes update. If your change is user facing, please update the release notes draft in the [Google Doc](https://docs.google.com/document/d/1YyVqrgX4gHTtrstbq8oWUImOyPCKSGnJ7xtTpmXzlRs/edit?tab=t.0). </details> --------- Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com> Signed-off-by: Nathan Ordonez <nathanaxcan@gmail.com> Co-authored-by: Nathan Ordonez <Nathan.Ordonez@ibm.com> Co-authored-by: Nathan Ordonez <nathanaxcan@gmail.com>

… benefits) Signed-off-by: Nathan Ordonez <nathanaxcan@gmail.com>

Signed-off-by: Nathan Ordonez <nathanaxcan@gmail.com>

nataxcan · 2025-09-24T10:47:24Z

currently running evals to check accuracy...

Signed-off-by: Nathan Ordonez <nathanaxcan@gmail.com>

starpit · 2025-09-24T17:25:02Z

i see that this addresses the regression in SPANS_DEBUG. but the hashes printed out aren't useful. to turn bytes into a string, i think we need something like?

# just for example...
b = bytes.fromhex("abcd1234")

# turn b back into the string "abcd":
from binascii import hexlify
def pretty(b):
  return hexlify(b).decode('utf-8')[:4]

pretty(b)
abcd

whereas this PR uses str(b)[:4] which produces pretty much garbage output.

nataxcan force-pushed the spans-block-duplication branch from 5b31ef2 to 919e73c Compare September 24, 2025 10:10

tdoublep and others added 7 commits September 24, 2025 06:17

Initial supports for spans/block-attention.

ef59a8f

Co-authored-by: Nathan Ordonez <Nathan.Ordonez@ibm.com> Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com> Signed-off-by: Nathan Ordonez <nathanaxcan@gmail.com>

initial impl (runs, but accuracy dropped)

6491b42

Signed-off-by: Nathan Ordonez <nathanaxcan@gmail.com>

bug fix (block duplication seems to work)

7a4e46b

Signed-off-by: Nathan Ordonez <nathanaxcan@gmail.com>

bugfix repositioning

92812ef

Signed-off-by: Nathan Ordonez <nathanaxcan@gmail.com>

bugfix, benefits now show up (and including benchmark that shows said…

4f5c00f

… benefits) Signed-off-by: Nathan Ordonez <nathanaxcan@gmail.com>

development folder

3998c6f

Signed-off-by: Nathan Ordonez <nathanaxcan@gmail.com>

nataxcan force-pushed the spans-block-duplication branch from 919e73c to 3998c6f Compare September 24, 2025 10:27

nataxcan added 2 commits September 24, 2025 06:33

Merge branch 'main' into spans-block-duplication

3d84e53

Signed-off-by: Nathan Ordonez <nathanaxcan@gmail.com>

Merge branch 'main' into spans-block-duplication

ffcd788

Signed-off-by: Nathan Ordonez <nathanaxcan@gmail.com>

bugfix

cdae9f9

Signed-off-by: Nathan Ordonez <nathanaxcan@gmail.com>

speed optimizations (from 6x to 1.3x overhead)

116b457

tdoublep force-pushed the main branch 3 times, most recently from 2b047d1 to 6ff59f0 Compare November 21, 2025 11:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Spans block duplication#92

Spans block duplication#92
nataxcan wants to merge 11 commits intomainfrom
spans-block-duplication

nataxcan commented Sep 23, 2025 •

edited by github-actions bot

Loading

Uh oh!

nataxcan commented Sep 24, 2025

Uh oh!

starpit commented Sep 24, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

nataxcan commented Sep 23, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nataxcan commented Sep 24, 2025

Uh oh!

starpit commented Sep 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

nataxcan commented Sep 23, 2025 •

edited by github-actions bot

Loading

starpit commented Sep 24, 2025 •

edited

Loading