MXFP4 Cast Transpose Triton [WIP] #422

sarthak-amd · 2026-01-20T00:32:53Z

Description

Implements the MXFP4 rowwise and columnwise FP32/BF16 -> MXFP4 fused quantization + cast kernel

Verify Tolerances and functional Unit Tests
The triton te_cast_transpose_mxfp4_triton currently outputs FP4 data in linear layout [M, N/2] with contiguous byte packing. AITER's gemm_a4w4 requires the B matrix in MFMA shuffle layout for tensor cores. This layout shuffle can be fused into the triton kernel in future.

…-mxfp4

sarthak-amd added 6 commits January 19, 2026 18:28

MXFP4 Tensor support in TE

fd7129d

fused cast transpose mxfp4

aca9e33

add E2M1 Dtype

b7cc9f2

Add unit test

7b2b4e5

update unit test and unify the api with upcoming hip kernel

df39c9a

Merge remote-tracking branch 'origin/dev' into feature/cast-transpose…

c1680cb

…-mxfp4