Skip to content

Conversation

@chraac
Copy link
Contributor

@chraac chraac commented Feb 3, 2026

Key changes

  • Optimization: Implemented in-place F32-to-F16 conversion for row data, eliminating duplicate conversions inside the inner loop.
  • New Compute Primitives: Added fused multiply-add (MAD) support (hvx_mad_f32_f16_aa_rx2) for FP16 inputs with separate scaling factors.

Implementation Details

  • In-place F32 to F16 Conversion: The row conversion logic was hoisted out of the inner loop.
  • hvx_mad_f32_f16_aa_rx2: A /new HVX intrinsic wrapper allowing efficient accumulation of two FP16 vectors.

Performance

TODO: Add benchmarks comparisons (e.g., token/s improvements on target hardware).

@chraac chraac marked this pull request as draft February 3, 2026 03:53
@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Feb 3, 2026
const uint8_t * q_ptr_vtcm = dma_queue_pop(dma).dst;
uint8_t * q_ptr_vtcm = dma_queue_pop(dma).dst;
if (is_q_fp32) {
hvx_copy_f16_f32_aa(q_ptr_vtcm, q_ptr_vtcm, DK); // inplace convert f32 to f16
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pre-conversion to F16: Converted the row to F16 upfront to avoid repeated on-the-fly conversion in the code below.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant