Skip to content

Conversation

@hipudding
Copy link
Collaborator

Make sure to read the contributing guidelines before submitting a PR

Implement ggml_cann_mul_mat_id_quant function to support quantized matrix
multiplication for Mixture of Experts (MoE) architectures on CANN backend.

Key features:
- Support Q4_0 and Q8_0 quantized weight formats
- Use IndexSelect to dynamically route expert-specific weights based on indices
- Leverage WeightQuantBatchMatmulV2 for efficient quantized computation
- Handle automatic F16 type conversion for hardware compatibility
- Support both per-expert and broadcast input modes

Implementation details:
- Extract expert weights and scales using CANN IndexSelect operation
- Process each batch and expert combination independently
- Create proper tensor views with correct stride for matmul operations
- Automatic input/output type casting to/from F16 as needed

Code quality improvements:
- Clear variable naming (n_batches, n_experts, batch_idx, expert_idx)
- Reusable lambda function for F16 buffer preparation
- Simplified array initialization and memory layout calculations
- Comprehensive inline documentation

Testing: All 297 test cases passed for supported types (F32, F16, Q4_0, Q8_0)
across various configurations (different n_mats, n_used, batch parameters).
Implement ggml_backend_cann_graph_optimize function for CANN backend,
ported from Vulkan backend (PR ggml-org#15489 and ggml-org#15850).

Key changes:
- Add graph optimization to reorder nodes based on dependency analysis
- Group non-dependent nodes together for potential parallel execution
- Preserve fusion patterns (RMS_NORM+MUL, MUL_MAT+ADD, ADD+RMS_NORM)
- Add GGML_CANN_DISABLE_GRAPH_OPTIMIZE env var to disable optimization

This is the first step toward multi-stream parallel execution on Ascend NPU.
@hipudding hipudding self-assigned this Feb 3, 2026
@hipudding hipudding added the Ascend NPU issues specific to Ascend NPUs label Feb 3, 2026
@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Feb 3, 2026
@hipudding hipudding closed this Feb 3, 2026
@hipudding hipudding reopened this Feb 3, 2026
- Replace tensor-pointer-based dependency tracking with memory-address-based tracking
- Use std::map<void*, int> to track pending writes per stream
- Implement smart stream selection:
  - No dependencies: round-robin distribution
  - Single dependency: execute on same stream (avoid sync overhead)
  - Multiple dependencies: sync all streams
- Add WAW (Write-After-Write) hazard detection
- Fix output corruption issue when using multi-stream execution

Enable with: GGML_CANN_MULTI_STREAM=1
When GGML_CANN_MULTI_STREAM=1 is set, ACL graph capture/execution must
be disabled since they are incompatible. The previous code had a bug
where the prefill_use_graph check would overwrite use_cann_graph after
it was set to false for multi-stream mode.

Fix by wrapping the prefill_use_graph check inside if (use_cann_graph)
to ensure it only runs when ACL graph is not already disabled.
- Use parse_bool() for GGML_CANN_MULTI_STREAM environment variable
  parsing, consistent with other env var handling
- Only synchronize dependent streams instead of all streams when
  a node has multiple dependencies, reducing sync overhead
- Performance improvement: ~9% faster prompt processing on 0.5B model
  (1838 t/s vs 1688 t/s with ACL graph disabled)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Ascend NPU issues specific to Ascend NPUs ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant