Skip to content

[Hexagon] Optimize argsort with vectorized Quicksort partition#30

Closed
max-krasnyansky wants to merge 2 commits intomasterfrom
hexagon-vector-argsort-9639307229427924630
Closed

[Hexagon] Optimize argsort with vectorized Quicksort partition#30
max-krasnyansky wants to merge 2 commits intomasterfrom
hexagon-vector-argsort-9639307229427924630

Conversation

@max-krasnyansky
Copy link
Owner

This PR optimizes the argsort operation in the Hexagon backend by vectorizing the Quicksort partition loop.

Changes:

  1. ggml/src/ggml-hexagon/htp/hvx-base.h: Added hvx_vec_get_i32 helper function to extract a scalar integer from a vector, necessary for the reduction check.
  2. ggml/src/ggml-hexagon/htp/argsort-ops.c:
    • Replaced quicksort_indices_asc and quicksort_indices_desc with quicksort_values_indices_asc and quicksort_values_indices_desc.
    • The new sorting functions sort the values scratchpad buffer directly and mirror the swaps to the indices buffer. This allows for contiguous vector loads from the values array, significantly speeding up the partition scan.
    • Implemented the partition scanning loop using HVX intrinsics (Q6_Q_vcmp_gt_VsfVsf).
    • Implemented a workaround for the missing Q6_Q_all_P instruction by using Q6_V_vmux_QVV to create a mask of 1s/0s and summing them with hvx_vec_reduce_sum_i32 to check if a whole block of 32 elements satisfies the pivot condition.
    • Updated htp_argsort_f32 to use the new sorting functions.

Performance:
The vectorized scan reduces the number of scalar comparisons and branch mispredictions during the partitioning phase of Quicksort, which is the most compute-intensive part of the operation.


PR created automatically by Jules for task 9639307229427924630 started by @max-krasnyansky

max-krasnyansky and others added 2 commits February 4, 2026 13:46
Replaced the scalar Quicksort implementation with a vectorized version using HVX intrinsics.
- Changed sorting strategy to direct sort on values buffer with mirrored index swaps for better vectorization.
- Added `hvx_vec_get_i32` to `hvx-base.h`.
- Implemented partition loop using vector comparisons and reduction-based "all check" (workaround for missing `Q6_Q_all_P`).

Co-authored-by: max-krasnyansky <1380796+max-krasnyansky@users.noreply.github.com>
@google-labs-jules
Copy link
Contributor

👋 Jules, reporting for duty! I'm here to lend a hand with this pull request.

When you start a review, I'll add a 👀 emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down.

I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job!

For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with @jules. You can find this option in the Pull Request section of your global Jules UI settings. You can always switch back!

New to Jules? Learn more at jules.google/docs.


For security, I will only act on instructions from the user who triggered this task.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant