-
Notifications
You must be signed in to change notification settings - Fork 14.8k
kv-cache : support V-less cache #19067
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
The commit computes GLM-4.7-Flash config.json does not include the Was hoping to squeeze my GLM model a bit more. 😿 Nevermind. deepseek2.attention.key_length_mla |
@John-Dekka These are |
JohannesGaessler
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The CUDA changes are correct, the changes in the llama.cpp user code seem correct to me though I am not as familiar with that part of the codebase.
|
Will the |
|
|
|
very cool master: PR: |
c843f3a to
6d7ce2e
Compare
| if (wo) { | ||
| cur = build_lora_mm(wo, cur); | ||
| if (arch == LLM_ARCH_GLM4 || arch == LLM_ARCH_GLM4_MOE) { | ||
| // GLM4 and GLM4_MOE seem to have numerical issues with half-precision accumulators | ||
| ggml_mul_mat_set_prec(cur, GGML_PREC_F32); | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might need to add LLM_ARCH_DEEPSEEK2 here in case we suspect similar numerical issues with GLM 4.7 Flash - something too keep in mind. cc @jeffbolznv @JohannesGaessler
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC @ngxson tried this during our PR and it made no difference in his testing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At that time, the wrong gating function was used, so can't conclude based on this. Plus this is somewhat backend-specific - e.g. it's not a problem for Metal since we always accumulate in F32.
When V is a view of K but with different head dimensions (e.g., GLM-4.7-Flash with K=576, V=512), we cannot simply reuse K's data pointer for V. For MLA models, the K tensor layout is [kv_lora_scaled (DV), pe (DQK-DV)], so V data is the first DV elements of each K row. This fix extracts the correct V data from K when DQK != DV in: - ggml_sycl_op_flash_attn_1 (basic FA path) - ggml_sycl_op_flash_attn_coopmat (XMX path) - ggml_sycl_op_flash_attn_mkl (oneMKL path) Fixes GPU memory faults and incorrect results in backend tests for hsk=576,hsv=512 configurations. Aligns with upstream PRs ggml-org#18953, ggml-org#18986, ggml-org#19067 that implement V-less KV cache for MLA models like DeepSeek and GLM-4.7-Flash. Amp-Thread-ID: https://ampcode.com/threads/T-019bf97a-9105-718e-84fb-320913c5f0c6 Co-authored-by: Amp <amp@ampcode.com>
When V is a view of K but with different head dimensions (e.g., GLM-4.7-Flash with K=576, V=512), we cannot simply reuse K's data pointer for V. For MLA models, the K tensor layout is [kv_lora_scaled (DV), pe (DQK-DV)], so V data is the first DV elements of each K row. This fix extracts the correct V data from K when DQK != DV in: - ggml_sycl_op_flash_attn_1 (basic FA path) - ggml_sycl_op_flash_attn_coopmat (XMX path) - ggml_sycl_op_flash_attn_mkl (oneMKL path) Fixes GPU memory faults and incorrect results in backend tests for hsk=576,hsv=512 configurations. Aligns with upstream PRs ggml-org#18953, ggml-org#18986, ggml-org#19067 that implement V-less KV cache for MLA models like DeepSeek and GLM-4.7-Flash. Amp-Thread-ID: https://ampcode.com/threads/T-019bf97a-9105-718e-84fb-320913c5f0c6 Co-authored-by: Amp <amp@ampcode.com>
When V is a view of K but with different head dimensions (e.g., GLM-4.7-Flash with K=576, V=512), we cannot simply reuse K's data pointer for V. For MLA models, the K tensor layout is [kv_lora_scaled (DV), pe (DQK-DV)], so V data is the first DV elements of each K row. This fix extracts the correct V data from K when DQK != DV in: - ggml_sycl_op_flash_attn_1 (basic FA path) - ggml_sycl_op_flash_attn_coopmat (XMX path) - ggml_sycl_op_flash_attn_mkl (oneMKL path) Fixes GPU memory faults and incorrect results in backend tests for hsk=576,hsv=512 configurations. Aligns with upstream PRs ggml-org#18953, ggml-org#18986, ggml-org#19067 that implement V-less KV cache for MLA models like DeepSeek and GLM-4.7-Flash. Amp-Thread-ID: https://ampcode.com/threads/T-019bf97a-9105-718e-84fb-320913c5f0c6 Co-authored-by: Amp <amp@ampcode.com>
cont #18986
Support V-less KV cache. This is useful for MLA models such as DeepSeek and GLM 4.7 Flash where we store combined latent data represented by the K cache. Results in almost x2 less memory for the KV cache.
Also:
llama_hparams::is_mla()llama_hparams::n_embd_head_k_mla()llama_hparams::n_embd_head_v_mla()llama_hparams::get_n_embd_out()->llama_hparams::n_embd_out()class llm_graph_input_attn_k- similar toclass llm_graph_input_attn_kv, but only K data