vulkan: Preprocess FA mask to detect all-neg-inf and all-zero. #19281

jeffbolznv · 2026-02-03T03:46:26Z

Write out a 2-bit code per block and avoid loading the mask when it matches these two common cases.

Apply this optimization when the mask is relatively large (i.e. prompt processing).

coopmat2 before

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -p 512 -n 0 -d 0-32768+8192 -m c:\models\GLM-4.7-Flash-Q4_K_M.gguf -m c:\models\gpt-oss-20b-mxfp4.gguf -m c:\models\Qwen_Qwen3-30B-A3B-Q4_K_M.gguf -m c:\models\Qwen3-Next-80B-A3B-Instruct-Q2_K_L.gguf -m c:\models\llama-2-7b.Q4_0.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| deepseek2 30B.A3B Q4_K - Medium |  16.88 GiB |    29.94 B | Vulkan     |  99 |  1 |           pp512 |      8365.94 ± 63.04 |
| deepseek2 30B.A3B Q4_K - Medium |  16.88 GiB |    29.94 B | Vulkan     |  99 |  1 |   pp512 @ d8192 |      3086.12 ± 16.37 |
| deepseek2 30B.A3B Q4_K - Medium |  16.88 GiB |    29.94 B | Vulkan     |  99 |  1 |  pp512 @ d16384 |       1883.63 ± 4.58 |
| deepseek2 30B.A3B Q4_K - Medium |  16.88 GiB |    29.94 B | Vulkan     |  99 |  1 |  pp512 @ d24576 |       1350.52 ± 2.89 |
| deepseek2 30B.A3B Q4_K - Medium |  16.88 GiB |    29.94 B | Vulkan     |  99 |  1 |  pp512 @ d32768 |       1046.75 ± 3.22 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |           pp512 |     11062.13 ± 67.12 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |   pp512 @ d8192 |      9142.91 ± 79.52 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |  pp512 @ d16384 |      7746.07 ± 66.12 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |  pp512 @ d24576 |      6790.61 ± 41.74 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |  pp512 @ d32768 |      6012.63 ± 50.81 |
| qwen3moe 30B.A3B Q4_K - Medium |  17.35 GiB |    30.53 B | Vulkan     |  99 |  1 |           pp512 |     10578.41 ± 64.75 |
| qwen3moe 30B.A3B Q4_K - Medium |  17.35 GiB |    30.53 B | Vulkan     |  99 |  1 |   pp512 @ d8192 |      6987.35 ± 64.34 |
| qwen3moe 30B.A3B Q4_K - Medium |  17.35 GiB |    30.53 B | Vulkan     |  99 |  1 |  pp512 @ d16384 |      5200.53 ± 32.42 |
| qwen3moe 30B.A3B Q4_K - Medium |  17.35 GiB |    30.53 B | Vulkan     |  99 |  1 |  pp512 @ d24576 |      4107.49 ± 34.16 |
| qwen3moe 30B.A3B Q4_K - Medium |  17.35 GiB |    30.53 B | Vulkan     |  99 |  1 |  pp512 @ d32768 |      3395.78 ± 19.24 |
| qwen3next 80B.A3B Q2_K - Medium |  27.23 GiB |    79.67 B | Vulkan     |  99 |  1 |           pp512 |      4540.43 ± 18.07 |
| qwen3next 80B.A3B Q2_K - Medium |  27.23 GiB |    79.67 B | Vulkan     |  99 |  1 |   pp512 @ d8192 |     3975.27 ± 128.27 |
| qwen3next 80B.A3B Q2_K - Medium |  27.23 GiB |    79.67 B | Vulkan     |  99 |  1 |  pp512 @ d16384 |     3509.32 ± 101.67 |
| qwen3next 80B.A3B Q2_K - Medium |  27.23 GiB |    79.67 B | Vulkan     |  99 |  1 |  pp512 @ d24576 |      3156.87 ± 89.19 |
| qwen3next 80B.A3B Q2_K - Medium |  27.23 GiB |    79.67 B | Vulkan     |  99 |  1 |  pp512 @ d32768 |      2850.34 ± 62.98 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |           pp512 |     12846.73 ± 26.25 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |   pp512 @ d8192 |      8687.44 ± 27.04 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |  pp512 @ d16384 |     6481.34 ± 297.18 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |  pp512 @ d24576 |     5018.36 ± 364.86 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |  pp512 @ d32768 |     4201.95 ± 253.85 |

coopmat2 after

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -p 512 -n 0 -d 0-32768+8192 -m c:\models\GLM-4.7-Flash-Q4_K_M.gguf -m c:\models\gpt-oss-20b-mxfp4.gguf -m c:\models\Qwen_Qwen3-30B-A3B-Q4_K_M.gguf -m c:\models\Qwen3-Next-80B-A3B-Instruct-Q2_K_L.gguf -m c:\models\llama-2-7b.Q4_0.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| deepseek2 30B.A3B Q4_K - Medium |  16.88 GiB |    29.94 B | Vulkan     |  99 |  1 |           pp512 |      8405.37 ± 58.64 |
| deepseek2 30B.A3B Q4_K - Medium |  16.88 GiB |    29.94 B | Vulkan     |  99 |  1 |   pp512 @ d8192 |      3208.47 ± 20.88 |
| deepseek2 30B.A3B Q4_K - Medium |  16.88 GiB |    29.94 B | Vulkan     |  99 |  1 |  pp512 @ d16384 |       1972.69 ± 4.23 |
| deepseek2 30B.A3B Q4_K - Medium |  16.88 GiB |    29.94 B | Vulkan     |  99 |  1 |  pp512 @ d24576 |       1414.26 ± 2.20 |
| deepseek2 30B.A3B Q4_K - Medium |  16.88 GiB |    29.94 B | Vulkan     |  99 |  1 |  pp512 @ d32768 |       1092.84 ± 1.57 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |           pp512 |     11094.89 ± 40.27 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |   pp512 @ d8192 |      9413.05 ± 80.21 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |  pp512 @ d16384 |      8232.54 ± 56.29 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |  pp512 @ d24576 |      7296.55 ± 68.10 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |  pp512 @ d32768 |      6557.01 ± 39.06 |
| qwen3moe 30B.A3B Q4_K - Medium |  17.35 GiB |    30.53 B | Vulkan     |  99 |  1 |           pp512 |     10578.99 ± 53.19 |
| qwen3moe 30B.A3B Q4_K - Medium |  17.35 GiB |    30.53 B | Vulkan     |  99 |  1 |   pp512 @ d8192 |      7226.89 ± 50.77 |
| qwen3moe 30B.A3B Q4_K - Medium |  17.35 GiB |    30.53 B | Vulkan     |  99 |  1 |  pp512 @ d16384 |      5515.21 ± 39.63 |
| qwen3moe 30B.A3B Q4_K - Medium |  17.35 GiB |    30.53 B | Vulkan     |  99 |  1 |  pp512 @ d24576 |      4451.93 ± 25.74 |
| qwen3moe 30B.A3B Q4_K - Medium |  17.35 GiB |    30.53 B | Vulkan     |  99 |  1 |  pp512 @ d32768 |      3725.73 ± 19.27 |
| qwen3next 80B.A3B Q2_K - Medium |  27.23 GiB |    79.67 B | Vulkan     |  99 |  1 |           pp512 |      4532.59 ± 22.57 |
| qwen3next 80B.A3B Q2_K - Medium |  27.23 GiB |    79.67 B | Vulkan     |  99 |  1 |   pp512 @ d8192 |     4055.62 ± 134.18 |
| qwen3next 80B.A3B Q2_K - Medium |  27.23 GiB |    79.67 B | Vulkan     |  99 |  1 |  pp512 @ d16384 |     3642.60 ± 104.75 |
| qwen3next 80B.A3B Q2_K - Medium |  27.23 GiB |    79.67 B | Vulkan     |  99 |  1 |  pp512 @ d24576 |      3323.83 ± 97.42 |
| qwen3next 80B.A3B Q2_K - Medium |  27.23 GiB |    79.67 B | Vulkan     |  99 |  1 |  pp512 @ d32768 |      3047.56 ± 68.54 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |           pp512 |     12798.62 ± 76.10 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |   pp512 @ d8192 |     9033.60 ± 144.83 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |  pp512 @ d16384 |     6896.18 ± 215.43 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |  pp512 @ d24576 |     5539.29 ± 171.84 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |  pp512 @ d32768 |     4816.55 ± 164.99 |

coopmat1 before

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -p 512 -n 0 -d 0-32768+8192 -m c:\models\gpt-oss-20b-mxfp4.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: KHR_coopmat
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |           pp512 |     7307.18 ± 104.06 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |   pp512 @ d8192 |      5619.02 ± 33.00 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |  pp512 @ d16384 |      4575.58 ± 18.48 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |  pp512 @ d24576 |      3844.27 ± 13.42 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |  pp512 @ d32768 |      3319.90 ± 10.61 |

coopmat1 after

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -p 512 -n 0 -d 0-32768+8192 -m c:\models\gpt-oss-20b-mxfp4.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: KHR_coopmat
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |           pp512 |     7260.44 ± 147.86 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |   pp512 @ d8192 |      5797.43 ± 47.50 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |  pp512 @ d16384 |      4833.98 ± 29.90 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |  pp512 @ d24576 |      4131.64 ± 25.95 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |  pp512 @ d32768 |      3629.55 ± 17.00 |

scalar before

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -p 512 -n 0 -d 0-32768+8192 -m c:\models\gpt-oss-20b-mxfp4.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: none
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |           pp512 |     4739.86 ± 116.28 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |   pp512 @ d8192 |      2768.71 ± 24.03 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |  pp512 @ d16384 |       1942.46 ± 6.94 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |  pp512 @ d24576 |       1487.37 ± 5.60 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |  pp512 @ d32768 |       1206.77 ± 1.97 |

scalar after

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -p 512 -n 0 -d 0-32768+8192 -m c:\models\gpt-oss-20b-mxfp4.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: none
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |           pp512 |     4723.24 ± 107.93 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |   pp512 @ d8192 |      3648.14 ± 30.70 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |  pp512 @ d16384 |       2927.92 ± 9.00 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |  pp512 @ d24576 |      2425.63 ± 11.23 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |  pp512 @ d32768 |       2073.52 ± 8.59 |

jeffbolznv · 2026-02-03T04:08:09Z

I was initially surprised at how much the scalar path benefited from this. I had assumed that slower math would mean the stalls would be a smaller percentage of the time. But I guess the smaller tile size means less math per stall, so it ends up benefiting as much or more than the coopmat paths.

ggerganov · 2026-02-03T06:54:15Z

Do I understand correctly that all gains here are effectively thanks to the "skip loading of all-zero masks", since the "all-inf" case was already being handled in #17186?

jeffbolznv · 2026-02-03T14:37:42Z

The previous change skipped the math for all-inf blocks. This change skips loading the mask for all-inf and all-zero blocks. The mask load is relatively expensive and with the way the code is structured there wasn't anything for the SM to do while waiting on the load.

Write out a 2-bit code per block and avoid loading the mask when it matches these two common cases. Apply this optimization when the mask is relatively large (i.e. prompt processing).

jeffbolznv requested review from 0cc4m and ggerganov as code owners February 3, 2026 03:46

github-actions bot added testing Everything test related Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Feb 3, 2026

loci-dev mentioned this pull request Feb 3, 2026

UPSTREAM PR #19281: vulkan: Preprocess FA mask to detect all-neg-inf and all-zero. auroralabs-loci/llama.cpp#1145

Open

vulkan: Preprocess FA mask to detect all-neg-inf and all-zero.

a95c994

Write out a 2-bit code per block and avoid loading the mask when it matches these two common cases. Apply this optimization when the mask is relatively large (i.e. prompt processing).

jeffbolznv force-pushed the fa_mask_opt branch from 6195385 to a95c994 Compare February 3, 2026 18:32

jeffbolznv mentioned this pull request Feb 3, 2026

vulkan: make FA mask/softcap enables spec constants #19309

Open

ggerganov mentioned this pull request Feb 4, 2026

metal : skip loading all-zero mask #19337

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vulkan: Preprocess FA mask to detect all-neg-inf and all-zero. #19281

vulkan: Preprocess FA mask to detect all-neg-inf and all-zero. #19281

jeffbolznv commented Feb 3, 2026

Uh oh!

jeffbolznv commented Feb 3, 2026

Uh oh!

ggerganov commented Feb 3, 2026

Uh oh!

jeffbolznv commented Feb 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vulkan: Preprocess FA mask to detect all-neg-inf and all-zero. #19281

Are you sure you want to change the base?

vulkan: Preprocess FA mask to detect all-neg-inf and all-zero. #19281

Conversation

jeffbolznv commented Feb 3, 2026

Uh oh!

jeffbolznv commented Feb 3, 2026

Uh oh!

ggerganov commented Feb 3, 2026

Uh oh!

jeffbolznv commented Feb 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants