[Hexagon] Optimize argsort with vectorized Quicksort partition#30
[Hexagon] Optimize argsort with vectorized Quicksort partition#30max-krasnyansky wants to merge 2 commits intomasterfrom
Conversation
Replaced the scalar Quicksort implementation with a vectorized version using HVX intrinsics. - Changed sorting strategy to direct sort on values buffer with mirrored index swaps for better vectorization. - Added `hvx_vec_get_i32` to `hvx-base.h`. - Implemented partition loop using vector comparisons and reduction-based "all check" (workaround for missing `Q6_Q_all_P`). Co-authored-by: max-krasnyansky <1380796+max-krasnyansky@users.noreply.github.com>
|
👋 Jules, reporting for duty! I'm here to lend a hand with this pull request. When you start a review, I'll add a 👀 emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down. I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job! For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with New to Jules? Learn more at jules.google/docs. For security, I will only act on instructions from the user who triggered this task. |
5aaf5de to
bdcb213
Compare
This PR optimizes the
argsortoperation in the Hexagon backend by vectorizing the Quicksort partition loop.Changes:
ggml/src/ggml-hexagon/htp/hvx-base.h: Addedhvx_vec_get_i32helper function to extract a scalar integer from a vector, necessary for the reduction check.ggml/src/ggml-hexagon/htp/argsort-ops.c:quicksort_indices_ascandquicksort_indices_descwithquicksort_values_indices_ascandquicksort_values_indices_desc.valuesscratchpad buffer directly and mirror the swaps to theindicesbuffer. This allows for contiguous vector loads from thevaluesarray, significantly speeding up the partition scan.Q6_Q_vcmp_gt_VsfVsf).Q6_Q_all_Pinstruction by usingQ6_V_vmux_QVVto create a mask of 1s/0s and summing them withhvx_vec_reduce_sum_i32to check if a whole block of 32 elements satisfies the pivot condition.htp_argsort_f32to use the new sorting functions.Performance:
The vectorized scan reduces the number of scalar comparisons and branch mispredictions during the partitioning phase of Quicksort, which is the most compute-intensive part of the operation.
PR created automatically by Jules for task 9639307229427924630 started by @max-krasnyansky