kv-cache : support V-less cache #19067

ggerganov · 2026-01-24T09:20:59Z

Support V-less KV cache. This is useful for MLA models such as DeepSeek and GLM 4.7 Flash where we store combined latent data represented by the K cache. Results in almost x2 less memory for the KV cache.

Also:

Add llama_hparams::is_mla()
Add llama_hparams::n_embd_head_k_mla()
Add llama_hparams::n_embd_head_v_mla()
Rename llama_hparams::get_n_embd_out() -> llama_hparams::n_embd_out()
Add class llm_graph_input_attn_k - similar to class llm_graph_input_attn_kv, but only K data

John-Dekka · 2026-01-24T11:02:49Z

The commit computes is_mla from the model hparams (hparams.n_embd_head_k_mla != 0 && hparams.n_embd_head_v_mla != 0) and when true, the kv-cache will skip allocating V tensors and the graph uses the new k-only input path. That decision is driven by the model/hparams at load time.

GLM-4.7-Flash config.json does not include the n_embd_head_k_mla / n_embd_head_v_mla fields.

Was hoping to squeeze my GLM model a bit more. 😿

Nevermind.

deepseek2.attention.key_length_mla
deepseek2.attention.value_length_mla

ggerganov · 2026-01-24T11:17:23Z

GLM-4.7-Flash config.json does not include the n_embd_head_k_mla / n_embd_head_v_mla fields.

@John-Dekka These are llama.cpp-specific parameters, they don't have to be present in the config.json. This patch applies to GLM-4.7-Flash.

JohannesGaessler

The CUDA changes are correct, the changes in the llama.cpp user code seem correct to me though I am not as familiar with that part of the codebase.

eapache · 2026-01-24T13:14:28Z

Will the —fit algorithm pick up the changes in memory requirements here, or does it need to be adjusted as well to expect the smaller KV cache for these models?

JohannesGaessler · 2026-01-24T13:27:24Z

llama_kv_cache has a member std::vector<std::pair<ggml_context_ptr, ggml_backend_buffer_ptr>> ctxs_bufs;. As long as the KV cache is allocated in ctxs_bufs it should work correctly with -fit regardless of the specific tensors.

jacekpoplawski · 2026-01-24T15:56:41Z

very cool

master:

llama_params_fit_impl: projected to use 57339 MiB of device memory vs. 71537 MiB of free device memory
llama_kv_cache: size = 19525.56 MiB (200192 cells,  47 layers,  1/1 seqs), K (f16): 10337.06 MiB, V (f16): 9188.50 MiB

PR:

llama_params_fit_impl: projected to use 48150 MiB of device memory vs. 71537 MiB of free device memory
llama_kv_cache: size = 10337.06 MiB (200192 cells,  47 layers,  1/1 seqs), K (f16): 10337.06 MiB, V (f16):    0.00 MiB

ggerganov · 2026-01-25T08:24:17Z

src/llama-graph.cpp

+    if (wo) {
+        cur = build_lora_mm(wo, cur);
+        if (arch == LLM_ARCH_GLM4 || arch == LLM_ARCH_GLM4_MOE) {
+            // GLM4 and GLM4_MOE seem to have numerical issues with half-precision accumulators
+            ggml_mul_mat_set_prec(cur, GGML_PREC_F32);
+        }
+    }


We might need to add LLM_ARCH_DEEPSEEK2 here in case we suspect similar numerical issues with GLM 4.7 Flash - something too keep in mind. cc @jeffbolznv @JohannesGaessler

IIRC @ngxson tried this during our PR and it made no difference in his testing

At that time, the wrong gating function was used, so can't conclude based on this. Plus this is somewhat backend-specific - e.g. it's not a problem for Metal since we always accumulate in F32.

When V is a view of K but with different head dimensions (e.g., GLM-4.7-Flash with K=576, V=512), we cannot simply reuse K's data pointer for V. For MLA models, the K tensor layout is [kv_lora_scaled (DV), pe (DQK-DV)], so V data is the first DV elements of each K row. This fix extracts the correct V data from K when DQK != DV in: - ggml_sycl_op_flash_attn_1 (basic FA path) - ggml_sycl_op_flash_attn_coopmat (XMX path) - ggml_sycl_op_flash_attn_mkl (oneMKL path) Fixes GPU memory faults and incorrect results in backend tests for hsk=576,hsv=512 configurations. Aligns with upstream PRs ggml-org#18953, ggml-org#18986, ggml-org#19067 that implement V-less KV cache for MLA models like DeepSeek and GLM-4.7-Flash. Amp-Thread-ID: https://ampcode.com/threads/T-019bf97a-9105-718e-84fb-320913c5f0c6 Co-authored-by: Amp <amp@ampcode.com>

ggerganov requested review from CISC and JohannesGaessler as code owners January 24, 2026 09:21

github-actions bot added model Model specific Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Jan 24, 2026

loci-dev mentioned this pull request Jan 24, 2026

UPSTREAM PR #19067: kv-cache : support V-less cache auroralabs-loci/llama.cpp#1020

Open

JohannesGaessler approved these changes Jan 24, 2026

View reviewed changes

ggerganov added 5 commits January 25, 2026 09:13

kv-cache : support V-less cache

1c5724e

cuda : better check for V_is_K_view

accf239

cuda : improve V_is_K_view check

9decd49

graph : add comments

2ed4983

hparams : refactor

6d7ce2e

ggerganov force-pushed the gg/kv-cache-support-no-v branch from c843f3a to 6d7ce2e Compare January 25, 2026 08:10

ggerganov commented Jan 25, 2026

View reviewed changes

ggerganov merged commit d9c6ce4 into master Jan 25, 2026
78 of 81 checks passed

ggerganov deleted the gg/kv-cache-support-no-v branch January 25, 2026 18:02

github-actions bot mentioned this pull request Jan 26, 2026

Reddit News Daily 2026-01-26 gitlawr/reddit-daily-news#136

Open

engrtipusultan mentioned this pull request Jan 26, 2026

Misc. bug: GLM-4.7-Flash Inference Much slower as compared to other A3B Models #19081

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kv-cache : support V-less cache #19067

kv-cache : support V-less cache #19067

Uh oh!

ggerganov commented Jan 24, 2026 •

edited

Loading

Uh oh!

John-Dekka commented Jan 24, 2026 •

edited

Loading

Uh oh!

ggerganov commented Jan 24, 2026

Uh oh!

JohannesGaessler left a comment

Uh oh!

eapache commented Jan 24, 2026

Uh oh!

JohannesGaessler commented Jan 24, 2026

Uh oh!

jacekpoplawski commented Jan 24, 2026

Uh oh!

ggerganov Jan 25, 2026

Uh oh!

ddh0 Jan 25, 2026

Uh oh!

ggerganov Jan 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

kv-cache : support V-less cache #19067

kv-cache : support V-less cache #19067

Uh oh!

Conversation

ggerganov commented Jan 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

John-Dekka commented Jan 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov commented Jan 24, 2026

Uh oh!

JohannesGaessler left a comment

Choose a reason for hiding this comment

Uh oh!

eapache commented Jan 24, 2026

Uh oh!

JohannesGaessler commented Jan 24, 2026

Uh oh!

jacekpoplawski commented Jan 24, 2026

Uh oh!

ggerganov Jan 25, 2026

Choose a reason for hiding this comment

Uh oh!

ddh0 Jan 25, 2026

Choose a reason for hiding this comment

Uh oh!

ggerganov Jan 25, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

ggerganov commented Jan 24, 2026 •

edited

Loading

John-Dekka commented Jan 24, 2026 •

edited

Loading