Skip to content

Conversation

@hipudding
Copy link
Collaborator

@hipudding hipudding commented Jan 31, 2026

Implement ggml_cann_mul_mat_id_quant function to support quantized matrix
multiplication for Mixture of Experts (MoE) architectures on CANN backend.

Key features:

  • Support Q4_0 and Q8_0 quantized weight formats
  • Use IndexSelect to dynamically route expert-specific weights based on indices
  • Leverage WeightQuantBatchMatmulV2 for efficient quantized computation
  • Handle automatic F16 type conversion for hardware compatibility
  • Support both per-expert and broadcast input modes

Implementation details:

  • Extract expert weights and scales using CANN IndexSelect operation
  • Process each batch and expert combination independently
  • Create proper tensor views with correct stride for matmul operations
  • Automatic input/output type casting to/from F16 as needed

Testing: All test cases passed for supported types (F32, F16, Q4_0, Q8_0).

Make sure to read the contributing guidelines before submitting a PR

@hipudding hipudding added the Ascend NPU issues specific to Ascend NPUs label Jan 31, 2026
@hipudding hipudding self-assigned this Jan 31, 2026
@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Jan 31, 2026
@hipudding hipudding marked this pull request as ready for review February 3, 2026 02:32
@hipudding hipudding requested a review from noemotiovon February 3, 2026 06:07
Implement ggml_cann_mul_mat_id_quant function to support quantized matrix
multiplication for Mixture of Experts (MoE) architectures on CANN backend.

Key features:
- Support Q4_0 and Q8_0 quantized weight formats
- Use IndexSelect to dynamically route expert-specific weights based on indices
- Leverage WeightQuantBatchMatmulV2 for efficient quantized computation
- Handle automatic F16 type conversion for hardware compatibility
- Support both per-expert and broadcast input modes

Implementation details:
- Extract expert weights and scales using CANN IndexSelect operation
- Process each batch and expert combination independently
- Create proper tensor views with correct stride for matmul operations
- Automatic input/output type casting to/from F16 as needed

Testing: All test cases passed for supported types (F32, F16, Q4_0, Q8_0).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Ascend NPU issues specific to Ascend NPUs ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant