Fix QMoE CPU Operator by tianleiwu · Pull Request #27360 · microsoft/onnxruntime

tianleiwu · 2026-02-16T19:32:04Z

This PR addresses several issues in the QMoE CPU implementation, improves MLAS documentation.

Changes

1. QMoE CPU Operator Fixes

Corrected Bias Handling: Renamed fc2_bias_handled_by_q4_gemm to fc2_bias_added_by_mlas and updated the logic to consistently track whether FC2 bias has been applied. This ensures that bias is not double-counted or missed when using DirectQ4Gemm.
SwiGLU Attribute Update: Switched from swiglu_interleaved to swiglu_fusion in both the C++ operator and the Python test infrastructure to align with the latest QMoE implementation standards.

2. MLAS Documentation

Clarified Buffer Shapes: Added explicit documentation to MlasQ4GemmPackB to specify that the input FpData buffer expects a shape of [K, N]. This helps prevent layout-related errors in future integrations.

3. Test Updates

PyTorch Parity Fixes: Refactored onnxruntime/test/python/transformers/test_qmoe_cpu.py to use swiglu_fusion and improved the test structure for better parity checks with PyTorch.

Verification

Verified by running test_qmoe_cpu.py to ensure all QMoE parity tests pass on CPU.

Copilot

Pull request overview

This PR fixes issues in the QMoE CPU operator implementation, specifically correcting bias handling logic and updating attribute naming to match the actual C++ implementation. The changes also improve MLAS documentation for better clarity on input buffer layout requirements.

Changes:

Fixed FC2 bias handling in QMoE CPU operator by tracking when MLAS DirectQ4Gemm adds bias
Added transpose logic to convert weight matrices from [N, K] to [K, N] layout required by MlasQ4GemmPackB
Updated Python tests to use swiglu_fusion attribute instead of incorrect swiglu_interleaved attribute
Enhanced MLAS documentation to clarify that MlasQ4GemmPackB expects FpData with shape [K, N]
Added proper bias collection and passing in Python test infrastructure

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

File	Description
onnxruntime/core/mlas/inc/mlas_q4.h	Updated documentation for MlasQ4GemmPackB to clarify FpData shape [K, N] and parameter meanings
onnxruntime/contrib_ops/cpu/moe/moe_quantization_cpu.cc	Added transpose logic for weight matrices, renamed fc2_bias_handled_by_q4_gemm to fc2_bias_added_by_mlas, removed unused fc1_used_direct_q4 flag
onnxruntime/test/python/transformers/test_qmoe_cpu.py	Migrated from swiglu_interleaved to swiglu_fusion attribute, added bias collection/passing logic, updated swiglu function signature, improved weight interleaving for swiglu_fusion=1

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Fix QMoE CPU

ed0478a

tianleiwu requested review from apsonawane and Copilot February 16, 2026 19:32

Copilot started reviewing on behalf of tianleiwu February 16, 2026 19:32 View session

Copilot AI reviewed Feb 16, 2026

View reviewed changes

tianleiwu added release:1.24.2 and removed release:1.24.2 labels Feb 17, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix QMoE CPU Operator#27360

Fix QMoE CPU Operator#27360
tianleiwu wants to merge 1 commit intomainfrom
tlwu/20260216/fix_qmoe_cpu

tianleiwu commented Feb 16, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tianleiwu commented Feb 16, 2026

Changes

1. QMoE CPU Operator Fixes

2. MLAS Documentation

3. Test Updates

Verification

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant