Skip to content

Conversation

@ggerganov
Copy link
Member

@ggerganov ggerganov commented Feb 2, 2026

fix #19267

Memory modules that do not support removing the last tokens from the context (such as recurrent modules) cannot perform speculative decoding. Add new common_speculative_is_compat() to query this functionality and use it in llama-server to disable speculative decoding for those contexts.

Copy link
Collaborator

@ngxson ngxson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wondering if we can do this inside common_speculative_init instead.

For example, common_speculative_init can try to evaluate 2 tokens, then remove the first one. If llama_memory_seq_rm returns error, then we throw an error saying the model is not compatible.

Btw, I think it's better to throw an error and exit, rather than a warning.

@coder543
Copy link

coder543 commented Feb 3, 2026

I just ran into #19267, and it would be cool if there were a way to make this compatible rather than just disabling it, but disabling it is better than crashing. With Qwen3-Coder-Next, ngram-mod could provide large speedups during coding workflows.

@ggerganov
Copy link
Member Author

@ngxson Implemented this idea in a new common_speculative_is_compat() helper function.

Btw, I think it's better to throw an error and exit, rather than a warning.

Do you have something specific in mind? In my server config, I want to set a default ngram-based spec decoding and have it applied for all routed models. When a routed model does not support it, it still continues to work. So I think a warning is better.

@ggerganov ggerganov changed the title llama : add llama_memory_can_rm_suffix() common : add common_speculative_is_compat() Feb 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Eval bug: spec-type ngram-mod crash with Qwen3Next

5 participants