HIP: Enable MMA flash attention for RDNA3 with head size 576 by linus-amg · Pull Request #19063 · ggml-org/llama.cpp

linus-amg · 2026-01-24T05:23:25Z

Summary

This PR enables MMA-based flash attention on RDNA3 GPUs (gfx1100/1101/1102) for models with head size 576, such as GLM-4.7-Flash and other MLA (Multi-head Latent Attention) models.

Previously, flash attention with head size 576 only worked on CUDA (via #18953) and RDNA4. RDNA3 users had to disable flash attention, resulting in ~3x slower inference.

Changes

fattn.cu: Route RDNA3 + head size 576 to MMA kernel (was RDNA4-only)
fattn-mma-f16.cuh:
- Enable AMD WMMA guards for all RDNA3/RDNA4 (was RDNA4-only)
- Allow DKQ == 576 in AMD path (was limited to ≤128)
mma.cuh:
- Add RDNA3 to make_identity_mat()
- Add RDNA3 f16→f16 WMMA intrinsic with correct 4-argument signature

Performance

Tested on AMD RX 7900 XTX (gfx1100) with GLM-4.7-Flash-REAP-23B-A3B:

Configuration	Generation Speed
FA off (before)	~77 t/s
FA on (before - broken)	~27 t/s
FA on (after fix)	~83 t/s

Testing

Builds successfully with -DGGML_HIP=ON -DGGML_HIP_ROCWMMA_FATTN=ON -DGPU_TARGETS="gfx1100"
GLM-4.7-Flash-REAP inference works with flash attention enabled
No regressions on standard head sizes (64, 128)

This enables MMA-based flash attention on RDNA3 GPUs (gfx1100/1101/1102) for models with head size 576, such as GLM-4.7-Flash and other MLA (Multi-head Latent Attention) models. Previously, flash attention with head size 576 only worked on CUDA (via PR ggml-org#18953) and RDNA4. RDNA3 users had to disable flash attention, resulting in ~3x slower inference. Changes: - fattn.cu: Route RDNA3 + head size 576 to MMA kernel (was RDNA4-only) - fattn-mma-f16.cuh: Enable AMD WMMA for all RDNA3/RDNA4, allow DKQ==576 - mma.cuh: Add RDNA3 to make_identity_mat(), add f16->f16 WMMA intrinsic Tested on AMD RX 7900 XTX (gfx1100) with GLM-4.7-Flash-REAP-23B: - FA off: ~77 t/s - FA on (before, broken): ~27 t/s - FA on (after fix): ~83 t/s

linus-amg · 2026-01-24T06:13:18Z

Closing this PR - the RDNA3 f16→f16 WMMA implementation produces incorrect output due to unpacked output format incompatibility with the tile structure. RDNA3 works correctly with tile-based flash attention instead of MMA. May revisit with a proper fix in the future.

linus-amg requested a review from JohannesGaessler as a code owner January 24, 2026 05:23

github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Jan 24, 2026

loci-dev mentioned this pull request Jan 24, 2026

UPSTREAM PR #19063: HIP: Enable MMA flash attention for RDNA3 with head size 576 auroralabs-loci/llama.cpp#1018

Open

3 tasks

linus-amg closed this Jan 24, 2026

linus-amg deleted the hip-rdna3-mma-fattn-576 branch January 24, 2026 14:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HIP: Enable MMA flash attention for RDNA3 with head size 576#19063

HIP: Enable MMA flash attention for RDNA3 with head size 576#19063
linus-amg wants to merge 1 commit intoggml-org:masterfrom
linus-amg:hip-rdna3-mma-fattn-576

linus-amg commented Jan 24, 2026

Uh oh!

linus-amg commented Jan 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

linus-amg commented Jan 24, 2026

Summary

Changes

Performance

Testing

Related

Uh oh!

linus-amg commented Jan 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant