Skip to content

Conversation

@bfroemel
Copy link

@bfroemel bfroemel commented Feb 3, 2026

related to #19164, PoC of #19164 (comment)

Track for each ngram in the pool a capped score, initially set to SCORE_INS on insert. If an ngram was used successfully in a draft, count its score up. If the draft was rejected count its score down. On streaks remove all ngrams with a score lower than SCORE_THR.

I did superficial testing and speedup is more consistent throughout processing the whole request; no more sudden drops of speed-up after (early) low acceptance streaks. Pruning currently goes through all 4M cache pool entries + the scoring has a minor but noticeable effect on performance; there is still optimization potential.

Also added some hash pool stats (scoring state + collisions) that might be helpful to further fine-tune the parameters (SCORE_MIN, SCORE_MAX, SCORE_INS, ..).

Here logs where the prompt looked like this: [GIVEN_SOURCE_CODE|TASK] and the model was tasked to generate [A|GIVEN_SOURCE_CODE|B|GIVEN_SOURCE_CODE]. A, and B is something sampled stochastically, GIVEN_SOURCE_CODE is known to be in the hash pool. Before the change sometimes the streak was encountered (early), the entire hash pool was cleared and there was no speed-up afterwards (see: #19164 (comment)). With this change we only prune low-scored ngrams on streaks and (still useful) ngrams above or equal SCORE_THR remain in the hash pool.

Log
srv  params_from_: Chat format: GPT-OSS
slot get_availabl: id  3 | task -1 | selected slot by LCP similarity, sim_best = 1.000 (> 0.100 thold), f_keep = 0.317
srv  get_availabl: updating prompt cache
srv   prompt_save:  - saving prompt with length 4415, total state size = 191.067 MiB
srv          load:  - looking for better prompt, base f_keep = 0.317, sim = 1.000
srv        update:  - cache state: 7 prompts, 1576.120 MiB (limits: 8192.000 MiB, 64000 tokens, 158645 est)
srv        update:    - prompt 0x560874484e20:    4353 tokens, checkpoints:  1,   224.899 MiB
srv        update:    - prompt 0x560874652c40:    4356 tokens, checkpoints:  1,   225.004 MiB
srv        update:    - prompt 0x560865f90410:    4302 tokens, checkpoints:  1,   223.105 MiB
srv        update:    - prompt 0x560874bb1f50:    4288 tokens, checkpoints:  1,   222.613 MiB
srv        update:    - prompt 0x560874aa4050:    4377 tokens, checkpoints:  1,   225.743 MiB
srv        update:    - prompt 0x560874ce33e0:    4432 tokens, checkpoints:  1,   227.677 MiB
srv        update:    - prompt 0x560874647ca0:    4415 tokens, checkpoints:  1,   227.079 MiB
srv  get_availabl: prompt cache update took 46.56 ms
slot launch_slot_: id  3 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> ?top-k -> ?typical -> ?top-p -> min-p -> ?xtc -> ?temp-ext -> dist 
slot launch_slot_: id  3 | task 4301 | processing task, is_child = 0
slot update_slots: id  3 | task 4301 | new prompt, n_ctx_slot = 64000, n_keep = 0, task.n_tokens = 1401
slot update_slots: id  3 | task 4301 | n_past = 1401, slot.prompt.tokens.size() = 4415, seq_id = 3, pos_min = 3397, n_swa = 128
slot update_slots: id  3 | task 4301 | restored context checkpoint (pos_min = 313, pos_max = 1336, size = 36.012 MiB)
slot update_slots: id  3 | task 4301 | n_tokens = 1336, memory_seq_rm [1336, end)
slot update_slots: id  3 | task 4301 | prompt processing progress, n_tokens = 1337, batch.n_tokens = 1, progress = 0.954318
slot update_slots: id  3 | task 4301 | n_tokens = 1337, memory_seq_rm [1337, end)
slot update_slots: id  3 | task 4301 | prompt processing progress, n_tokens = 1401, batch.n_tokens = 64, progress = 1.000000
slot update_slots: id  3 | task 4301 | prompt done, n_tokens = 1401, batch.n_tokens = 64
slot init_sampler: id  3 | task 4301 | init sampler, took 0.17 ms, tokens: text = 1401, total = 1401
begin: ngram_mod occupancy = 4050/4194304 (0.00)
accept: accepted 4 tokens from 64 drafted tokens
accept: accepted 43 tokens from 64 drafted tokens
accept: accepted 3 tokens from 64 drafted tokens
accept: accepted 3 tokens from 64 drafted tokens
accept: accepted 8 tokens from 64 drafted tokens
accept: low acceptance streak (3) - pruning ngram_mod (collisions=140)
accept: before prune scores - below_thr=26, at_min=16, at_max=0, at_ins=2029
accept: accepted 64 tokens from 64 drafted tokens
accept: accepted 64 tokens from 64 drafted tokens
accept: accepted 64 tokens from 64 drafted tokens
accept: accepted 64 tokens from 64 drafted tokens
accept: accepted 64 tokens from 64 drafted tokens
accept: accepted 64 tokens from 64 drafted tokens
accept: accepted 64 tokens from 64 drafted tokens
accept: accepted 64 tokens from 64 drafted tokens
accept: accepted 64 tokens from 64 drafted tokens
accept: accepted 64 tokens from 64 drafted tokens
accept: accepted 64 tokens from 64 drafted tokens
accept: accepted 64 tokens from 64 drafted tokens
accept: accepted 64 tokens from 64 drafted tokens
accept: accepted 64 tokens from 64 drafted tokens
accept: accepted 64 tokens from 64 drafted tokens
accept: accepted 64 tokens from 64 drafted tokens
accept: accepted 64 tokens from 64 drafted tokens
accept: accepted 57 tokens from 57 drafted tokens
accept: accepted 8 tokens from 64 drafted tokens
accept: accepted 64 tokens from 64 drafted tokens
accept: accepted 64 tokens from 64 drafted tokens
accept: accepted 64 tokens from 64 drafted tokens
accept: accepted 64 tokens from 64 drafted tokens
accept: accepted 64 tokens from 64 drafted tokens
accept: accepted 64 tokens from 64 drafted tokens
accept: accepted 64 tokens from 64 drafted tokens
accept: accepted 64 tokens from 64 drafted tokens
accept: accepted 64 tokens from 64 drafted tokens
accept: accepted 64 tokens from 64 drafted tokens
accept: accepted 64 tokens from 64 drafted tokens
accept: accepted 64 tokens from 64 drafted tokens
accept: accepted 64 tokens from 64 drafted tokens
accept: accepted 64 tokens from 64 drafted tokens
accept: accepted 64 tokens from 64 drafted tokens
accept: accepted 64 tokens from 64 drafted tokens
accept: accepted 64 tokens from 64 drafted tokens
accept: accepted 57 tokens from 64 drafted tokens
slot print_timing: id  3 | task 4301 | 
prompt eval time =      54.12 ms /    65 tokens (    0.83 ms per token,  1201.15 tokens per second)
       eval time =    6246.35 ms /  3056 tokens (    2.04 ms per token,   489.25 tokens per second)
      total time =    6300.47 ms /  3121 tokens
draft acceptance rate = 0.83980 ( 2359 accepted /  2809 generated)
statistics ngram_mod: #calls = 4976, #gen drafts = 322, #acc drafts = 316, #gen tokens = 20601, #acc tokens = 18795, dur(b,g,a) = 1.101, 7.910, 4.906 ms
slot      release: id  3 | task 4301 | stop processing: n_tokens = 4456, truncated = 0
srv  update_slots: all slots are idle

@bfroemel bfroemel changed the title Feat spec ngram mod scoreeviction spec: ngram-mod, score-based pruning Feb 3, 2026
@bfroemel
Copy link
Author

bfroemel commented Feb 3, 2026

@ggerganov @srogmann Please let me know, if you see any merits in further pursuing this PR. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant