CANN: Multi-stream support #19284

hipudding · 2026-02-03T06:10:24Z

Make sure to read the contributing guidelines before submitting a PR

Implement ggml_cann_mul_mat_id_quant function to support quantized matrix multiplication for Mixture of Experts (MoE) architectures on CANN backend. Key features: - Support Q4_0 and Q8_0 quantized weight formats - Use IndexSelect to dynamically route expert-specific weights based on indices - Leverage WeightQuantBatchMatmulV2 for efficient quantized computation - Handle automatic F16 type conversion for hardware compatibility - Support both per-expert and broadcast input modes Implementation details: - Extract expert weights and scales using CANN IndexSelect operation - Process each batch and expert combination independently - Create proper tensor views with correct stride for matmul operations - Automatic input/output type casting to/from F16 as needed Code quality improvements: - Clear variable naming (n_batches, n_experts, batch_idx, expert_idx) - Reusable lambda function for F16 buffer preparation - Simplified array initialization and memory layout calculations - Comprehensive inline documentation Testing: All 297 test cases passed for supported types (F32, F16, Q4_0, Q8_0) across various configurations (different n_mats, n_used, batch parameters).

Implement ggml_backend_cann_graph_optimize function for CANN backend, ported from Vulkan backend (PR ggml-org#15489 and ggml-org#15850). Key changes: - Add graph optimization to reorder nodes based on dependency analysis - Group non-dependent nodes together for potential parallel execution - Preserve fusion patterns (RMS_NORM+MUL, MUL_MAT+ADD, ADD+RMS_NORM) - Add GGML_CANN_DISABLE_GRAPH_OPTIMIZE env var to disable optimization This is the first step toward multi-stream parallel execution on Ascend NPU.

- Replace tensor-pointer-based dependency tracking with memory-address-based tracking - Use std::map<void*, int> to track pending writes per stream - Implement smart stream selection: - No dependencies: round-robin distribution - Single dependency: execute on same stream (avoid sync overhead) - Multiple dependencies: sync all streams - Add WAW (Write-After-Write) hazard detection - Fix output corruption issue when using multi-stream execution Enable with: GGML_CANN_MULTI_STREAM=1

When GGML_CANN_MULTI_STREAM=1 is set, ACL graph capture/execution must be disabled since they are incompatible. The previous code had a bug where the prefill_use_graph check would overwrite use_cann_graph after it was set to false for multi-stream mode. Fix by wrapping the prefill_use_graph check inside if (use_cann_graph) to ensure it only runs when ACL graph is not already disabled.

- Use parse_bool() for GGML_CANN_MULTI_STREAM environment variable parsing, consistent with other env var handling - Only synchronize dependent streams instead of all streams when a node has multiple dependencies, reducing sync overhead - Performance improvement: ~9% faster prompt processing on 0.5B model (1838 t/s vs 1688 t/s with ACL graph disabled)

hipudding added 3 commits January 31, 2026 08:11

Fix mul_mat_id_quant: use contiguous [M,D] layout for weight tensors

f4ed24f

hipudding self-assigned this Feb 3, 2026

hipudding added the Ascend NPU issues specific to Ascend NPUs label Feb 3, 2026

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Feb 3, 2026

hipudding closed this Feb 3, 2026

hipudding reopened this Feb 3, 2026

loci-dev mentioned this pull request Feb 3, 2026

UPSTREAM PR #19284: CANN: Multi-stream support auroralabs-loci/llama.cpp#1150

Open

hipudding added 2 commits February 3, 2026 07:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CANN: Multi-stream support #19284

CANN: Multi-stream support #19284

hipudding commented Feb 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

CANN: Multi-stream support #19284

Are you sure you want to change the base?

CANN: Multi-stream support #19284

Conversation

hipudding commented Feb 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant