-
Notifications
You must be signed in to change notification settings - Fork 14.8k
CUDA: add gqa_ratio 4 for GLM 4.7 flash #18953
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Looks like there is a bug there, taking a look |
|
I would have thought the correct patch is this but the results are wrong. diff --git a/ggml/src/ggml-cuda/fattn-mma-f16.cuh b/ggml/src/ggml-cuda/fattn-mma-f16.cuh
index 9e98da95f..6cca5b2ec 100644
--- a/ggml/src/ggml-cuda/fattn-mma-f16.cuh
+++ b/ggml/src/ggml-cuda/fattn-mma-f16.cuh
@@ -510,7 +510,6 @@ static __device__ __forceinline__ void flash_attn_ext_f16_iter(
}
}
} else {
- static_assert(cols_per_warp != 8, "cols_per_warp == 8 not implemented");
#pragma unroll
for (int k_KQ_0 = k0_start; k_KQ_0 < k0_stop; k_KQ_0 += T_A_KQ::J) {
load_ldmatrix(Q_B[0], tile_Q + (threadIdx.y / np)*(T_B_KQ::I*stride_tile_Q) + k_KQ_0, stride_tile_Q);
@@ -522,14 +521,18 @@ static __device__ __forceinline__ void flash_attn_ext_f16_iter(
T_A_KQ K_A;
load_ldmatrix(K_A, tile_K + i_KQ_0*stride_tile_K + (k_KQ_0 - k0_start), stride_tile_K);
- // Wide version of KQ_C is column-major
+ if constexpr (cols_per_warp == 8) {
+ mma(KQ_C[i_KQ_00/(np*T_A_KQ::I)], K_A, Q_B[0]);
+ } else {
+ // Wide version of KQ_C is column-major
#if defined(AMD_WMMA_AVAILABLE)
- // RDNA matrix C is column-major.
- mma(KQ_C[i_KQ_00/(np*T_A_KQ::I)], K_A, Q_B[0]);
+ // RDNA matrix C is column-major.
+ mma(KQ_C[i_KQ_00/(np*T_A_KQ::I)], K_A, Q_B[0]);
#else
- // swap A and B for CUDA.
- mma(KQ_C[i_KQ_00/(np*T_A_KQ::I)], Q_B[0], K_A);
+ // swap A and B for CUDA.
+ mma(KQ_C[i_KQ_00/(np*T_A_KQ::I)], Q_B[0], K_A);
#endif // defined(AMD_WMMA_AVAILABLE)
+ }
}
}
} |
|
I forgot that the tests in |
|
Added rest of your comments. In general this model spews out nonsense with and without fa, I think that should be fixed with #18980 |
We can add the tests, but unless there is a technical blocker, it would be useful to support separate K and V data in the CUDA implementation as well. |
|
This patch is working well for me, combined with --override-kv deepseek2.expert_gating_func=int:2 to resolve most of the issues with GLM 4.7 Flash 2100 tokens/sec prompt processing and 90 tokens per second on RTX 6000 Blackwell. Output looks overall pretty decent when using the Unsloth FP16 GGUF |
|
@JohannesGaessler @am17an This PR should OK to merge. I have outlined a plan for improving the implementation and fixing the tests in #18986. Let's continue there after we merge this. |
|
In the CI, the following tests are failing, locally I am able get tests to pass. So I'm thinking there's probably bug in the tile kernel we haven't fixed |
|
I didn't touch any of the tests so it's expected that they are still failing. |
@am17an The test should fail locally too. For me, they fail on DGX Spark. On which setup are they passing for you? |
|
Btw, the windows workflow fails to compile: https://github.com/ggml-org/llama.cpp/actions/runs/21195664178/job/60970976275?pr=18953 Any ideas? |
|
The tests now fail for me too. Is the fix going to go in #18986? |
Yes, it needs some more work and will be merged after this PR. Regarding the windows build - from the logs, I think this is the problematic part: llama.cpp/ggml/src/ggml-cuda/fattn-mma-f16.cuh Lines 447 to 456 in fe1703a
If |
|
Forgive me if this is covered in previous discussions and I'm not understanding but I'm attempting to build this PR myself I'm building my own docker containers (with a bunch of other stuff bundled, hence why not using the official ones) and finally building with And running into a failure during build after applying it: |
|
The compilation problems have to do with Volta, I'll push a fix. |
|
For Volta the minimum number of KQ columns is 32 so for the kernel templates with 8 and 16 columns the number of parallel warps per CUDA block was being calculated incorrectly. This tripped the static asserts even though that code should never actually be executed. I fixed the calculation of parallel warps to consider this edge case correctly. |
Awesome, glad I helped find something :) I'll give the updated branch a test shortly. |
|
is -ctk q8_0 -ctv q8_0 is broken too? For me it's again uses only cpu. Without all ok. |
|
I forgot: I reverted the change to exclude kernels with 8 KQ columns. I meant in my review comment that this variant should be excluded if and only if there are problems with it (but those are now fixed). |
|
Perplexity over Wikitext is fundamentally the wrong metric for judging the quality of an instruct-tuned model. What you should look at when it comes to the impact of quantization is KL divergence vs. the full-precision model. For judging the quality in an absolute sense there currently just isn't good tooling in the llama.cpp ecosystem. |
|
@JohannesGaessler Good to merge? |
|
I dont know if it's worth mentioning, but with my P40 + 3090 setup, FA on for PP is half the speed of FA off. Freshly build from this PR
build: a10d87b (7786)
build: a10d87b (7786) 3090 Only
build: a10d87b (7786)
build: a10d87b (7786) P40 Only
build: a10d87b (7786)
build: a10d87b (7786) |
JohannesGaessler
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My general preference would be that we fix CI failures prior to a merge but it's fine if we take care of it soon after.
|
The changes do not improve performance with quantized KV cache, which still goes to CPU, is that to be expected?
build: b70d251 (7803) |
…tn-mma-f16 This reverts commit b70d251.
This enables MMA-based flash attention on RDNA3 GPUs (gfx1100/1101/1102) for models with head size 576, such as GLM-4.7-Flash and other MLA (Multi-head Latent Attention) models. Previously, flash attention with head size 576 only worked on CUDA (via PR ggml-org#18953) and RDNA4. RDNA3 users had to disable flash attention, resulting in ~3x slower inference. Changes: - fattn.cu: Route RDNA3 + head size 576 to MMA kernel (was RDNA4-only) - fattn-mma-f16.cuh: Enable AMD WMMA for all RDNA3/RDNA4, allow DKQ==576 - mma.cuh: Add RDNA3 to make_identity_mat(), add f16->f16 WMMA intrinsic Tested on AMD RX 7900 XTX (gfx1100) with GLM-4.7-Flash-REAP-23B: - FA off: ~77 t/s - FA on (before, broken): ~27 t/s - FA on (after fix): ~83 t/s
…tn-mma-f16 This reverts commit b70d251.
When V is a view of K but with different head dimensions (e.g., GLM-4.7-Flash with K=576, V=512), we cannot simply reuse K's data pointer for V. For MLA models, the K tensor layout is [kv_lora_scaled (DV), pe (DQK-DV)], so V data is the first DV elements of each K row. This fix extracts the correct V data from K when DQK != DV in: - ggml_sycl_op_flash_attn_1 (basic FA path) - ggml_sycl_op_flash_attn_coopmat (XMX path) - ggml_sycl_op_flash_attn_mkl (oneMKL path) Fixes GPU memory faults and incorrect results in backend tests for hsk=576,hsv=512 configurations. Aligns with upstream PRs ggml-org#18953, ggml-org#18986, ggml-org#19067 that implement V-less KV cache for MLA models like DeepSeek and GLM-4.7-Flash. Amp-Thread-ID: https://ampcode.com/threads/T-019bf97a-9105-718e-84fb-320913c5f0c6 Co-authored-by: Amp <amp@ampcode.com>
When V is a view of K but with different head dimensions (e.g., GLM-4.7-Flash with K=576, V=512), we cannot simply reuse K's data pointer for V. For MLA models, the K tensor layout is [kv_lora_scaled (DV), pe (DQK-DV)], so V data is the first DV elements of each K row. This fix extracts the correct V data from K when DQK != DV in: - ggml_sycl_op_flash_attn_1 (basic FA path) - ggml_sycl_op_flash_attn_coopmat (XMX path) - ggml_sycl_op_flash_attn_mkl (oneMKL path) Fixes GPU memory faults and incorrect results in backend tests for hsk=576,hsv=512 configurations. Aligns with upstream PRs ggml-org#18953, ggml-org#18986, ggml-org#19067 that implement V-less KV cache for MLA models like DeepSeek and GLM-4.7-Flash. Amp-Thread-ID: https://ampcode.com/threads/T-019bf97a-9105-718e-84fb-320913c5f0c6 Co-authored-by: Amp <amp@ampcode.com>
When V is a view of K but with different head dimensions (e.g., GLM-4.7-Flash with K=576, V=512), we cannot simply reuse K's data pointer for V. For MLA models, the K tensor layout is [kv_lora_scaled (DV), pe (DQK-DV)], so V data is the first DV elements of each K row. This fix extracts the correct V data from K when DQK != DV in: - ggml_sycl_op_flash_attn_1 (basic FA path) - ggml_sycl_op_flash_attn_coopmat (XMX path) - ggml_sycl_op_flash_attn_mkl (oneMKL path) Fixes GPU memory faults and incorrect results in backend tests for hsk=576,hsv=512 configurations. Aligns with upstream PRs ggml-org#18953, ggml-org#18986, ggml-org#19067 that implement V-less KV cache for MLA models like DeepSeek and GLM-4.7-Flash. Amp-Thread-ID: https://ampcode.com/threads/T-019bf97a-9105-718e-84fb-320913c5f0c6 Co-authored-by: Amp <amp@ampcode.com>

Enable FA for GLM 4.7, I'm not sure it's optimal but at least it does not go to CPU. Fixes #18944