Skip to content

Conversation

@willieyz
Copy link
Contributor

@willieyz willieyz commented Jan 22, 2026

In this PR, we replace the AVX2 intrinsics implementation of poly_caddq with a x86_64 assembly version.
To estimate the performance impact, we compare the results shown in the two tables below.
Overall, for keypair, sign, and verify (opt), the performance difference is below 1%, which is consistent with the no-opt case.

In the component-level benchmark for mld_poly_caddq, the observed performance differences are at least 17%. After unrolling the loop by a factor of 4, the differences are reduced to approximately 10%.

  • bench components
    • Δ (%) = (asm − AVX2) / AVX2 × 100
Component Implementation Build ML-DSA-44 ML-DSA-65 ML-DSA-87 Notes
mld_poly_caddq (avg) AVX2 intrinsics no-opt 391 393 391
x86_64 asm no-opt 390 393 392
Δ (%) no-opt -0.26 0.00 +0.26
mld_poly_caddq (avg) AVX2 intrinsics opt 38 40 39
x86_64 asm opt 51 50 46
x86_64 asm (unroll) opt 42 42 42 unroll by 4
Δ (%) opt +34.21 +25.00 +17.95
Δ (%) (unroll) opt +10.53 +5.00 +7.69 unroll by 4
  • bench
    • Δ (%) = (asm − AVX2) / AVX2 × 100
Component Implementation Build ML-DSA-44 ML-DSA-65 ML-DSA-87 Notes
keypair cycles (avg) AVX2 no-opt 134355 226117 377069 baseline (main)
x86_64 asm no-opt 133831 226345 374963
Δ (%) no-opt -0.39 +0.10 -0.56
AVX2 opt 60367 105019 166676 baseline (main)
x86_64 asm opt 60535 104479 165781
x86_64 asm (unroll) opt 59921 104367 165795 unroll by 4
Δ (%) opt +0.28 -0.51 -0.54
Δ (%) (unroll) opt -0.74 -0.62 -0.53 unroll by 4
sign cycles (avg) AVX2 no-opt 473892 779091 998026 baseline (main)
x86_64 asm no-opt 473262 779359 993245
Δ (%) no-opt -0.13 +0.03 -0.48
AVX2 opt 179804 301077 364509 baseline (main)
x86_64 asm opt 180253 298598 363742
x86_64 asm (unroll) opt 178255 299153 363505 unroll by 4
Δ (%) opt +0.25 -0.82 -0.21
Δ (%) (unroll) opt -0.86 -0.64 -0.28 unroll by 4
verify cycles (avg) AVX2 no-opt 140765 228322 379244 baseline (main)
x86_64 asm no-opt 140872 228255 377091
Δ (%) no-opt +0.08 -0.03 -0.57
AVX2 opt 63674 105734 164897 baseline (main)
x86_64 asm opt 63924 105192 164131
x86_64 asm (unroll) opt 62955 105111 163861 unroll by 4
Δ (%) opt +0.39 -0.51 -0.46
Δ (%) (unroll) opt -1.13 -0.59 -0.63 unroll by 4

@willieyz willieyz force-pushed the eliminate-caddq-intrinsics branch from 00b155f to 3819863 Compare January 23, 2026 06:52
@oqs-bot
Copy link
Contributor

oqs-bot commented Jan 23, 2026

CBMC Results (ML-DSA-87)

Full Results (174 proofs)
Proof Status Current Previous Change
**TOTAL** 2138s 2313s -7.6%
mld_attempt_signature_generation 200s 210s -5%
polyvec_matrix_expand 182s 203s -10%
polyvecl_pointwise_acc_montgomery_c 154s 196s -21%
poly_pointwise_montgomery_c 127s 152s -16%
rej_uniform_native 127s 136s -7%
sign_verify_internal 119s 122s -2%
polyvec_matrix_expand_serial 101s 107s -6%
mld_ct_memcmp 77s 89s -13%
mld_invntt_layer 60s 66s -9%
sign_signature_internal 57s 60s -5%
mld_ntt_layer 46s 46s +0%
keccak_squeezeblocks_x4 44s 46s -4%
mld_compute_t0_t1_tr_from_sk_components 25s 29s -14%
polymat_permute_bitrev_to_custom 22s 25s -12%
fqmul 20s 21s -5%
poly_chknorm_c 19s 17s +12%
rej_uniform_c 19s 19s +0%
rej_uniform 18s 22s -18%
poly_uniform_eta_4x 17s 17s +0%
polyt0_unpack 17s 16s +6%
polyveck_add 15s 15s +0%
polyvec_matrix_pointwise_montgomery 14s 15s -7%
keccakf1600x4_permute_native 13s 12s +8%
poly_uniform_4x 13s 14s -7%
keccak_absorb_once_x4 12s 12s +0%
mld_ntt_butterfly_block 12s 12s +0%
polyeta_unpack 12s 15s -20%
polyveck_use_hint 12s 8s +50%
mld_polyvecl_permute_bitrev_to_custom_native 11s 12s -8%
polyveck_power2round 11s 15s -27%
poly_decompose_c 10s 11s -9%
polyveck_reduce 10s 10s +0%
keccakf1600_permute 9s 6s +50%
polyveck_ntt 9s 9s +0%
sign 9s 8s +12%
mld_sample_s1_s2 8s 7s +14%
poly_invntt_tomont_c 8s 10s -20%
polyveck_chknorm 8s 9s -11%
polyvecl_ntt 8s 10s -20%
rej_eta_c 8s 5s +60%
keccak_absorb 7s 5s +40%
keccakf1600_permute_native 7s 8s -12%
mld_check_pct 7s 8s -12%
poly_uniform_eta 7s 5s +40%
poly_uniform_gamma1_4x 7s 3s +133%
polyveck_pointwise_poly_montgomery 7s 8s -12%
sign_pk_from_sk 7s 9s -22%
fqscale 6s 2s +200%
mld_compute_pack_z 6s 6s +0%
mld_prepare_domain_separation_prefix 6s 4s +50%
mld_sample_s1_s2_serial 6s 7s -14%
pack_pk 6s 5s +20%
poly_challenge 6s 5s +20%
polyveck_caddq 6s 6s +0%
polyveck_decompose 6s 7s -14%
polyveck_invntt_tomont 6s 8s -25%
polyveck_shiftl 6s 6s +0%
polyvecl_unpack_z 6s 5s +20%
sign_signature_pre_hash_internal 6s 8s -25%
unpack_pk 6s 4s +50%
mld_ct_cmask_nonzero_u8 5s 2s +150%
mld_h 5s 2s +150%
poly_sub 5s 3s +67%
poly_uniform 5s 5s +0%
polyeta_pack 5s 3s +67%
polyveck_make_hint 5s 6s -17%
polyvecl_chknorm 5s 3s +67%
polyvecl_unpack_eta 5s 4s +25%
rej_eta_native 5s 5s +0%
sign_keypair 5s 3s +67%
sign_keypair_internal 5s 5s +0%
sign_open 5s 3s +67%
sign_signature 5s 5s +0%
sign_signature_extmu 5s 6s -17%
sign_signature_pre_hash_shake256 5s 7s -29%
unpack_sk 5s 6s -17%
keccak_squeeze 4s 4s +0%
keccakf1600x4_extract_bytes 4s 4s +0%
keccakf1600x4_permute 4s 2s +100%
mld_keccakf1600_extract_bytes 4s 4s +0%
pack_sig_z 4s 2s +100%
poly_caddq_native 4s 2s +100%
poly_caddq_native_aarch64 4s 5s -20%
poly_chknorm 4s 4s +0%
poly_chknorm_native 4s 4s +0%
poly_invntt_tomont 4s 4s +0%
poly_make_hint 4s 4s +0%
poly_pointwise_montgomery_native 4s 5s -20%
poly_power2round 4s 3s +33%
poly_use_hint_native 4s 4s +0%
polyt0_pack 4s 4s +0%
polyveck_sub 4s 5s -20%
polyveck_unpack_t0 4s 6s -33%
polyvecl_uniform_gamma1 4s 4s +0%
polyvecl_uniform_gamma1_serial 4s 5s -20%
polyz_unpack 4s 2s +100%
shake256_release 4s 3s +33%
sign_verify_extmu 4s 3s +33%
sign_verify_pre_hash_internal 4s 4s +0%
unpack_hints 4s 4s +0%
caddq 3s 2s +50%
decompose 3s 2s +50%
keccakf1600_xor_bytes 3s 3s +0%
keccakf1600x4_xor_bytes 3s 3s +0%
mld_ct_abs_i32 3s 2s +50%
mld_ct_cmask_nonzero_u32 3s 3s +0%
mld_value_barrier_u32 3s 2s +50%
mld_value_barrier_u8 3s 1s +200%
poly_add 3s 7s -57%
poly_caddq 3s 4s -25%
poly_decompose 3s 3s +0%
poly_decompose_native 3s 2s +50%
poly_invntt_tomont_native 3s 3s +0%
poly_ntt_c 3s 4s -25%
poly_ntt_native 3s 4s -25%
poly_pointwise_montgomery 3s 3s +0%
poly_shiftl 3s 1s +200%
poly_uniform_gamma1 3s 3s +0%
polyt1_unpack 3s 4s -25%
polyveck_pack_eta 3s 4s -25%
polyveck_pack_t0 3s 2s +50%
polyvecl_pack_eta 3s 3s +0%
polyvecl_pointwise_acc_montgomery_native 3s 2s +50%
polyz_pack 3s 5s -40%
polyz_unpack_c 3s 4s -25%
rej_eta 3s 4s -25%
shake128_squeeze 3s 2s +50%
shake256 3s 3s +0%
shake256_absorb 3s 2s +50%
shake256_finalize 3s 2s +50%
shake256x4_absorb_once 3s 4s -25%
sign_verify 3s 3s +0%
unpack_sig 3s 4s -25%
keccak_finalize 2s 2s +0%
keccak_init 2s 1s +100%
keccakf1600_extract_bytes (big endian) 2s 2s +0%
keccakf1600_xor_bytes (big endian) 2s 2s +0%
mld_ct_get_optblocker_i64 2s 2s +0%
mld_ct_get_optblocker_u32 2s 3s -33%
mld_ct_get_optblocker_u8 2s 1s +100%
mld_ct_sel_int32 2s 3s -33%
montgomery_reduce 2s 3s -33%
ntt_native_x86_64 2s 3s -33%
pack_sig_c_h 2s 5s -60%
poly_caddq_c 2s 3s -33%
poly_ntt 2s 5s -60%
poly_reduce 2s 6s -67%
poly_use_hint 2s 3s -33%
poly_use_hint_c 2s 2s +0%
polyt1_pack 2s 3s -33%
polyveck_pack_w1 2s 3s -33%
polyveck_unpack_eta 2s 4s -50%
polyvecl_permute_bitrev_to_custom 2s 3s -33%
polyvecl_pointwise_acc_montgomery 2s 3s -33%
power2round 2s 4s -50%
shake128_absorb 2s 1s +100%
shake128_finalize 2s 2s +0%
shake128_init 2s 4s -50%
shake128_release 2s 2s +0%
shake128x4_absorb_once 2s 2s +0%
shake128x4_squeezeblocks 2s 2s +0%
shake256_squeeze 2s 3s -33%
shake256x4_squeezeblocks 2s 4s -50%
sign_verify_pre_hash_shake256 2s 8s -75%
sys_check_capability 2s 2s +0%
make_hint 1s 2s -50%
mld_ct_cmask_neg_i32 1s 3s -67%
mld_value_barrier_i64 1s 2s -50%
pack_sk 1s 3s -67%
polyw1_pack 1s 4s -75%
polyz_unpack_native 1s 3s -67%
reduce32 1s 3s -67%
shake256_init 1s 3s -67%
use_hint 1s 2s -50%

@oqs-bot
Copy link
Contributor

oqs-bot commented Jan 23, 2026

CBMC Results (ML-DSA-44)

Full Results (174 proofs)
Proof Status Current Previous Change
**TOTAL** 1987s 1873s +6.1%
mld_attempt_signature_generation 237s 235s +1%
polyvecl_pointwise_acc_montgomery_c 223s 196s +14%
poly_pointwise_montgomery_c 142s 135s +5%
rej_uniform_native 135s 130s +4%
sign_verify_internal 126s 126s +0%
mld_ct_memcmp 88s 81s +9%
mld_invntt_layer 76s 71s +7%
keccak_squeezeblocks_x4 47s 43s +9%
mld_ntt_layer 46s 45s +2%
sign_signature_internal 39s 38s +3%
rej_uniform 21s 20s +5%
fqmul 20s 22s -9%
rej_uniform_c 20s 17s +18%
poly_chknorm_c 18s 16s +12%
mld_compute_t0_t1_tr_from_sk_components 17s 15s +13%
polyeta_unpack 17s 12s +42%
polyvec_matrix_expand 17s 16s +6%
poly_uniform_eta_4x 16s 17s -6%
polymat_permute_bitrev_to_custom 16s 17s -6%
polyt0_unpack 16s 16s +0%
keccak_absorb_once_x4 14s 11s +27%
mld_ntt_butterfly_block 14s 13s +8%
poly_uniform_4x 14s 14s +0%
keccakf1600x4_permute_native 12s 14s -14%
polyz_unpack_c 12s 12s +0%
mld_polyvecl_permute_bitrev_to_custom_native 11s 9s +22%
poly_invntt_tomont_c 10s 11s -9%
polyveck_ntt 10s 5s +100%
polyvec_matrix_expand_serial 9s 7s +29%
polyveck_reduce 9s 3s +200%
sign_pk_from_sk 9s 6s +50%
keccakf1600_permute 8s 8s +0%
polyveck_add 8s 8s +0%
polyveck_caddq 8s 5s +60%
keccakf1600_permute_native 7s 8s -12%
polyveck_pointwise_poly_montgomery 7s 5s +40%
polyveck_power2round 7s 7s +0%
keccak_squeeze 6s 3s +100%
mld_check_pct 6s 7s -14%
mld_compute_pack_z 6s 5s +20%
mld_h 6s 3s +100%
poly_caddq 6s 2s +200%
poly_uniform_eta 6s 3s +100%
polyvec_matrix_pointwise_montgomery 6s 7s -14%
polyveck_sub 6s 4s +50%
polyvecl_unpack_z 6s 3s +100%
sign 6s 7s -14%
sign_keypair 6s 3s +100%
sign_verify_extmu 6s 5s +20%
sign_verify_pre_hash_internal 6s 3s +100%
unpack_sk 6s 3s +100%
keccak_absorb 5s 6s -17%
keccakf1600x4_xor_bytes 5s 3s +67%
mld_sample_s1_s2 5s 4s +25%
mld_sample_s1_s2_serial 5s 4s +25%
ntt_native_x86_64 5s 2s +150%
poly_make_hint 5s 2s +150%
poly_pointwise_montgomery_native 5s 3s +67%
poly_shiftl 5s 3s +67%
poly_uniform 5s 4s +25%
polyveck_decompose 5s 7s -29%
polyveck_invntt_tomont 5s 6s -17%
polyvecl_chknorm 5s 6s -17%
polyvecl_ntt 5s 4s +25%
polyvecl_pointwise_acc_montgomery_native 5s 4s +25%
rej_eta 5s 4s +25%
shake256_absorb 5s 3s +67%
shake256_init 5s 4s +25%
unpack_pk 5s 5s +0%
keccakf1600_xor_bytes (big endian) 4s 2s +100%
mld_keccakf1600_extract_bytes 4s 2s +100%
mld_prepare_domain_separation_prefix 4s 4s +0%
pack_sig_c_h 4s 2s +100%
poly_caddq_native 4s 7s -43%
poly_caddq_native_aarch64 4s 4s +0%
poly_decompose_native 4s 3s +33%
poly_invntt_tomont 4s 3s +33%
poly_ntt_native 4s 4s +0%
poly_reduce 4s 5s -20%
poly_uniform_gamma1_4x 4s 3s +33%
poly_use_hint_native 4s 3s +33%
polyt0_pack 4s 3s +33%
polyt1_unpack 4s 2s +100%
polyveck_shiftl 4s 7s -43%
polyveck_use_hint 4s 6s -33%
polyvecl_uniform_gamma1 4s 4s +0%
polyvecl_uniform_gamma1_serial 4s 4s +0%
polyz_pack 4s 3s +33%
polyz_unpack 4s 3s +33%
rej_eta_c 4s 4s +0%
rej_eta_native 4s 5s -20%
shake128x4_absorb_once 4s 2s +100%
shake256_finalize 4s 3s +33%
shake256_release 4s 3s +33%
shake256x4_squeezeblocks 4s 2s +100%
sign_keypair_internal 4s 5s -20%
sign_open 4s 4s +0%
sign_signature 4s 6s -33%
sign_signature_extmu 4s 4s +0%
sign_verify 4s 6s -33%
unpack_sig 4s 4s +0%
use_hint 4s 4s +0%
make_hint 3s 5s -40%
mld_ct_cmask_nonzero_u8 3s 1s +200%
mld_ct_get_optblocker_i64 3s 1s +200%
mld_value_barrier_i64 3s 3s +0%
montgomery_reduce 3s 2s +50%
pack_pk 3s 2s +50%
pack_sk 3s 3s +0%
poly_add 3s 3s +0%
poly_challenge 3s 4s -25%
poly_chknorm_native 3s 2s +50%
poly_pointwise_montgomery 3s 3s +0%
poly_power2round 3s 4s -25%
poly_uniform_gamma1 3s 2s +50%
poly_use_hint_c 3s 3s +0%
polyeta_pack 3s 3s +0%
polyt1_pack 3s 3s +0%
polyveck_chknorm 3s 5s -40%
polyveck_pack_eta 3s 7s -57%
polyveck_pack_t0 3s 4s -25%
polyveck_pack_w1 3s 2s +50%
polyveck_unpack_eta 3s 2s +50%
polyveck_unpack_t0 3s 3s +0%
polyvecl_pack_eta 3s 3s +0%
polyvecl_pointwise_acc_montgomery 3s 3s +0%
polyvecl_unpack_eta 3s 4s -25%
polyw1_pack 3s 2s +50%
polyz_unpack_native 3s 3s +0%
shake128_finalize 3s 1s +200%
shake128_init 3s 2s +50%
shake128_squeeze 3s 2s +50%
shake128x4_squeezeblocks 3s 2s +50%
shake256_squeeze 3s 2s +50%
sign_verify_pre_hash_shake256 3s 4s -25%
sys_check_capability 3s 3s +0%
unpack_hints 3s 5s -40%
caddq 2s 2s +0%
fqscale 2s 2s +0%
keccak_finalize 2s 2s +0%
keccak_init 2s 1s +100%
keccakf1600_extract_bytes (big endian) 2s 2s +0%
keccakf1600_xor_bytes 2s 1s +100%
keccakf1600x4_extract_bytes 2s 1s +100%
keccakf1600x4_permute 2s 5s -60%
mld_ct_cmask_neg_i32 2s 2s +0%
mld_ct_cmask_nonzero_u32 2s 2s +0%
mld_ct_get_optblocker_u8 2s 3s -33%
mld_ct_sel_int32 2s 2s +0%
mld_value_barrier_u8 2s 2s +0%
pack_sig_z 2s 3s -33%
poly_caddq_c 2s 2s +0%
poly_chknorm 2s 1s +100%
poly_decompose 2s 4s -50%
poly_decompose_c 2s 1s +100%
poly_invntt_tomont_native 2s 3s -33%
poly_ntt 2s 2s +0%
poly_ntt_c 2s 2s +0%
poly_sub 2s 5s -60%
poly_use_hint 2s 4s -50%
polyveck_make_hint 2s 4s -50%
polyvecl_permute_bitrev_to_custom 2s 5s -60%
power2round 2s 2s +0%
reduce32 2s 6s -67%
shake128_absorb 2s 1s +100%
shake128_release 2s 2s +0%
shake256 2s 2s +0%
shake256x4_absorb_once 2s 4s -50%
sign_signature_pre_hash_internal 2s 2s +0%
sign_signature_pre_hash_shake256 2s 3s -33%
decompose 1s 5s -80%
mld_ct_abs_i32 1s 2s -50%
mld_ct_get_optblocker_u32 1s 2s -50%
mld_value_barrier_u32 1s 4s -75%

@oqs-bot
Copy link
Contributor

oqs-bot commented Jan 23, 2026

CBMC Results (ML-DSA-65)

Full Results (174 proofs)
Proof Status Current Previous Change
**TOTAL** 2432s 2432s +0.0%
mld_attempt_signature_generation 415s 409s +1%
polyvecl_pointwise_acc_montgomery_c 235s 247s -5%
sign_verify_internal 172s 171s +1%
poly_pointwise_montgomery_c 146s 143s +2%
rej_uniform_native 135s 137s -1%
polyvec_matrix_expand 134s 134s +0%
mld_ct_memcmp 87s 88s -1%
polyvec_matrix_expand_serial 65s 63s +3%
mld_invntt_layer 63s 63s +0%
sign_signature_internal 47s 49s -4%
keccak_squeezeblocks_x4 44s 46s -4%
mld_ntt_layer 43s 46s -7%
mld_compute_t0_t1_tr_from_sk_components 26s 27s -4%
rej_uniform 22s 22s +0%
polymat_permute_bitrev_to_custom 21s 20s +5%
polyveck_decompose 19s 16s +19%
rej_uniform_c 19s 18s +6%
poly_uniform_eta_4x 18s 17s +6%
fqmul 17s 20s -15%
polyt0_unpack 16s 15s +7%
polyvec_matrix_pointwise_montgomery 16s 14s +14%
poly_uniform_4x 15s 15s +0%
keccak_absorb_once_x4 14s 15s -7%
keccakf1600x4_permute_native 14s 13s +8%
mld_ntt_butterfly_block 12s 14s -14%
poly_chknorm_c 12s 13s -8%
sign 12s 10s +20%
polyveck_use_hint 11s 13s -15%
mld_check_pct 9s 11s -18%
mld_polyvecl_permute_bitrev_to_custom_native 9s 9s +0%
polyveck_add 9s 10s -10%
keccakf1600_permute_native 8s 9s -11%
mld_sample_s1_s2 8s 5s +60%
poly_decompose_c 8s 7s +14%
poly_invntt_tomont_c 8s 10s -20%
poly_uniform 8s 4s +100%
polyveck_pointwise_poly_montgomery 8s 6s +33%
polyveck_power2round 8s 10s -20%
polyveck_reduce 8s 4s +100%
polyveck_shiftl 8s 7s +14%
polyveck_sub 8s 6s +33%
polyvecl_ntt 8s 6s +33%
keccakf1600_permute 7s 9s -22%
polyveck_invntt_tomont 7s 7s +0%
polyveck_ntt 7s 9s -22%
sign_keypair_internal 7s 4s +75%
sign_pk_from_sk 7s 8s -12%
sign_verify_pre_hash_internal 7s 5s +40%
sign_verify_pre_hash_shake256 7s 3s +133%
mld_sample_s1_s2_serial 6s 5s +20%
poly_make_hint 6s 3s +100%
poly_ntt 6s 2s +200%
poly_uniform_eta 6s 4s +50%
poly_use_hint_c 6s 5s +20%
polyeta_unpack 6s 9s -33%
polyveck_caddq 6s 7s -14%
polyveck_chknorm 6s 2s +200%
polyvecl_pointwise_acc_montgomery_native 6s 5s +20%
polyz_unpack_c 6s 6s +0%
rej_eta_c 6s 6s +0%
rej_eta_native 6s 3s +100%
sign_signature 6s 6s +0%
mld_compute_pack_z 5s 7s -29%
poly_pointwise_montgomery_native 5s 4s +25%
polyvecl_permute_bitrev_to_custom 5s 3s +67%
shake256x4_absorb_once 5s 3s +67%
sign_open 5s 3s +67%
sign_signature_pre_hash_internal 5s 2s +150%
unpack_hints 5s 4s +25%
unpack_sk 5s 5s +0%
use_hint 5s 3s +67%
caddq 4s 1s +300%
keccak_absorb 4s 7s -43%
keccak_squeeze 4s 3s +33%
mld_value_barrier_u32 4s 1s +300%
pack_sk 4s 3s +33%
poly_add 4s 5s -20%
poly_caddq 4s 2s +100%
poly_caddq_native 4s 5s -20%
poly_challenge 4s 4s +0%
poly_chknorm_native 4s 2s +100%
poly_ntt_c 4s 4s +0%
poly_sub 4s 3s +33%
poly_uniform_gamma1_4x 4s 6s -33%
poly_use_hint 4s 3s +33%
poly_use_hint_native 4s 6s -33%
polyveck_make_hint 4s 5s -20%
polyveck_pack_eta 4s 2s +100%
polyveck_pack_w1 4s 2s +100%
polyveck_unpack_t0 4s 5s -20%
polyvecl_chknorm 4s 6s -33%
polyvecl_uniform_gamma1 4s 5s -20%
polyz_unpack 4s 3s +33%
power2round 4s 4s +0%
reduce32 4s 1s +300%
shake128_init 4s 1s +300%
sign_verify_extmu 4s 3s +33%
unpack_pk 4s 3s +33%
decompose 3s 2s +50%
keccak_finalize 3s 3s +0%
keccakf1600_extract_bytes (big endian) 3s 3s +0%
keccakf1600x4_permute 3s 3s +0%
make_hint 3s 2s +50%
mld_ct_cmask_nonzero_u32 3s 1s +200%
mld_ct_cmask_nonzero_u8 3s 5s -40%
mld_ct_get_optblocker_u8 3s 3s +0%
mld_value_barrier_i64 3s 2s +50%
montgomery_reduce 3s 3s +0%
ntt_native_x86_64 3s 3s +0%
pack_pk 3s 3s +0%
pack_sig_z 3s 2s +50%
poly_caddq_c 3s 3s +0%
poly_caddq_native_aarch64 3s 3s +0%
poly_decompose_native 3s 4s -25%
poly_invntt_tomont 3s 3s +0%
poly_ntt_native 3s 4s -25%
polyeta_pack 3s 3s +0%
polyt0_pack 3s 5s -40%
polyt1_pack 3s 3s +0%
polyveck_pack_t0 3s 3s +0%
polyvecl_pack_eta 3s 4s -25%
polyvecl_unpack_z 3s 3s +0%
polyz_unpack_native 3s 7s -57%
rej_eta 3s 2s +50%
shake128_release 3s 3s +0%
shake128x4_absorb_once 3s 5s -40%
shake128x4_squeezeblocks 3s 3s +0%
shake256 3s 3s +0%
shake256_release 3s 1s +200%
shake256_squeeze 3s 2s +50%
shake256x4_squeezeblocks 3s 2s +50%
sign_keypair 3s 4s -25%
sign_signature_extmu 3s 6s -50%
sign_signature_pre_hash_shake256 3s 5s -40%
sign_verify 3s 4s -25%
unpack_sig 3s 4s -25%
fqscale 2s 3s -33%
keccak_init 2s 2s +0%
keccakf1600_xor_bytes 2s 3s -33%
keccakf1600x4_extract_bytes 2s 2s +0%
mld_ct_abs_i32 2s 3s -33%
mld_ct_cmask_neg_i32 2s 2s +0%
mld_ct_get_optblocker_i64 2s 2s +0%
mld_ct_get_optblocker_u32 2s 3s -33%
mld_ct_sel_int32 2s 2s +0%
mld_h 2s 5s -60%
mld_keccakf1600_extract_bytes 2s 3s -33%
mld_prepare_domain_separation_prefix 2s 4s -50%
mld_value_barrier_u8 2s 4s -50%
pack_sig_c_h 2s 3s -33%
poly_chknorm 2s 4s -50%
poly_decompose 2s 2s +0%
poly_pointwise_montgomery 2s 1s +100%
poly_power2round 2s 2s +0%
poly_reduce 2s 3s -33%
poly_shiftl 2s 4s -50%
poly_uniform_gamma1 2s 4s -50%
polyt1_unpack 2s 5s -60%
polyveck_unpack_eta 2s 4s -50%
polyvecl_pointwise_acc_montgomery 2s 2s +0%
polyvecl_uniform_gamma1_serial 2s 5s -60%
polyvecl_unpack_eta 2s 4s -50%
polyw1_pack 2s 1s +100%
polyz_pack 2s 3s -33%
shake128_absorb 2s 4s -50%
shake128_finalize 2s 2s +0%
shake128_squeeze 2s 5s -60%
shake256_absorb 2s 2s +0%
shake256_finalize 2s 1s +100%
shake256_init 2s 2s +0%
sys_check_capability 2s 2s +0%
keccakf1600_xor_bytes (big endian) 1s 1s +0%
keccakf1600x4_xor_bytes 1s 1s +0%
poly_invntt_tomont_native 1s 4s -75%

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mac Mini (M1, 2020) benchmarks (opt)

Details
Benchmark suite Current: f9a6d30 Previous: 9258ea1 Ratio
ML-DSA-44 keypair 46204 cycles 46203 cycles 1.00
ML-DSA-44 sign 131289 cycles 131278 cycles 1.00
ML-DSA-44 verify 47763 cycles 47762 cycles 1.00
ML-DSA-65 keypair 81015 cycles 81014 cycles 1.00
ML-DSA-65 sign 215763 cycles 215783 cycles 1.00
ML-DSA-65 verify 80054 cycles 80051 cycles 1.00
ML-DSA-87 keypair 132159 cycles 132161 cycles 1.00
ML-DSA-87 sign 276888 cycles 276854 cycles 1.00
ML-DSA-87 verify 130426 cycles 130402 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mac Mini (M1, 2020) benchmarks (no-opt)

Details
Benchmark suite Current: f9a6d30 Previous: 9258ea1 Ratio
ML-DSA-44 keypair 114189 cycles 114160 cycles 1.00
ML-DSA-44 sign 418072 cycles 417949 cycles 1.00
ML-DSA-44 verify 122294 cycles 122254 cycles 1.00
ML-DSA-65 keypair 195495 cycles 195504 cycles 1.00
ML-DSA-65 sign 682472 cycles 682465 cycles 1.00
ML-DSA-65 verify 197737 cycles 197733 cycles 1.00
ML-DSA-87 keypair 322648 cycles 322653 cycles 1.00
ML-DSA-87 sign 864619 cycles 864668 cycles 1.00
ML-DSA-87 verify 328624 cycles 328682 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intel Xeon 4th gen (c7i)

Details
Benchmark suite Current: f9a6d30 Previous: 9258ea1 Ratio
ML-DSA-44 keypair 34471 cycles 34561 cycles 1.00
ML-DSA-44 sign 120257 cycles 119604 cycles 1.01
ML-DSA-44 verify 38102 cycles 38161 cycles 1.00
ML-DSA-65 keypair 61645 cycles 61342 cycles 1.00
ML-DSA-65 sign 202965 cycles 201886 cycles 1.01
ML-DSA-65 verify 62950 cycles 63038 cycles 1.00
ML-DSA-87 keypair 94655 cycles 93985 cycles 1.01
ML-DSA-87 sign 237727 cycles 239107 cycles 0.99
ML-DSA-87 verify 94851 cycles 96550 cycles 0.98

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arm Cortex-A72 (Raspberry Pi 4) benchmarks (opt)

Details
Benchmark suite Current: f9a6d30 Previous: 9258ea1 Ratio
ML-DSA-44 keypair 237495 cycles 229189 cycles 1.04
ML-DSA-44 sign 617962 cycles 646441 cycles 0.96
ML-DSA-44 verify 221067 cycles 226888 cycles 0.97
ML-DSA-65 keypair 392612 cycles 411260 cycles 0.95
ML-DSA-65 sign 1045982 cycles 1058663 cycles 0.99
ML-DSA-65 verify 377333 cycles 393295 cycles 0.96
ML-DSA-87 keypair 648120 cycles 682350 cycles 0.95
ML-DSA-87 sign 1340435 cycles 1396069 cycles 0.96
ML-DSA-87 verify 621700 cycles 651014 cycles 0.95

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intel Xeon 4th gen (c7i) (no-opt)

Details
Benchmark suite Current: f9a6d30 Previous: 9258ea1 Ratio
ML-DSA-44 keypair 93617 cycles 93653 cycles 1.00
ML-DSA-44 sign 333332 cycles 333354 cycles 1.00
ML-DSA-44 verify 99744 cycles 99709 cycles 1.00
ML-DSA-65 keypair 160092 cycles 160242 cycles 1.00
ML-DSA-65 sign 545851 cycles 546031 cycles 1.00
ML-DSA-65 verify 160873 cycles 160833 cycles 1.00
ML-DSA-87 keypair 268252 cycles 267347 cycles 1.00
ML-DSA-87 sign 707830 cycles 706548 cycles 1.00
ML-DSA-87 verify 270627 cycles 270921 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AMD EPYC 3rd gen (c6a)

Details
Benchmark suite Current: f9a6d30 Previous: 9258ea1 Ratio
ML-DSA-44 keypair 69051 cycles 69255 cycles 1.00
ML-DSA-44 sign 187895 cycles 188098 cycles 1.00
ML-DSA-44 verify 69083 cycles 69380 cycles 1.00
ML-DSA-65 keypair 119654 cycles 120115 cycles 1.00
ML-DSA-65 sign 299489 cycles 301522 cycles 0.99
ML-DSA-65 verify 115283 cycles 115505 cycles 1.00
ML-DSA-87 keypair 203725 cycles 204908 cycles 0.99
ML-DSA-87 sign 392930 cycles 396816 cycles 0.99
ML-DSA-87 verify 195673 cycles 197182 cycles 0.99

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intel Xeon 3rd gen (c6i)

Details
Benchmark suite Current: f9a6d30 Previous: 9258ea1 Ratio
ML-DSA-44 keypair 56503 cycles 59724 cycles 0.95
ML-DSA-44 sign 180796 cycles 193453 cycles 0.93
ML-DSA-44 verify 61187 cycles 65079 cycles 0.94
ML-DSA-65 keypair 98682 cycles 104370 cycles 0.95
ML-DSA-65 sign 298537 cycles 315933 cycles 0.94
ML-DSA-65 verify 100423 cycles 106176 cycles 0.95
ML-DSA-87 keypair 156536 cycles 162885 cycles 0.96
ML-DSA-87 sign 364760 cycles 379531 cycles 0.96
ML-DSA-87 verify 156758 cycles 164743 cycles 0.95

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AMD EPYC 4th gen (c7a)

Details
Benchmark suite Current: f9a6d30 Previous: 9258ea1 Ratio
ML-DSA-44 keypair 42082 cycles 41564 cycles 1.01
ML-DSA-44 sign 134163 cycles 133585 cycles 1.00
ML-DSA-44 verify 45157 cycles 44717 cycles 1.01
ML-DSA-65 keypair 73088 cycles 72591 cycles 1.01
ML-DSA-65 sign 214510 cycles 214322 cycles 1.00
ML-DSA-65 verify 73546 cycles 73308 cycles 1.00
ML-DSA-87 keypair 108117 cycles 108001 cycles 1.00
ML-DSA-87 sign 252050 cycles 253603 cycles 0.99
ML-DSA-87 verify 111866 cycles 109742 cycles 1.02

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Performance Alert ⚠️

Possible performance regression was detected for benchmark 'AMD EPYC 4th gen (c7a)'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.03.

Benchmark suite Current: 3819863 Previous: 9258ea1 Ratio
ML-DSA-65 keypair 75829 cycles 72591 cycles 1.04

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AMD EPYC 3rd gen (c6a) (no-opt)

Details
Benchmark suite Current: f9a6d30 Previous: 9258ea1 Ratio
ML-DSA-44 keypair 135619 cycles 134864 cycles 1.01
ML-DSA-44 sign 526267 cycles 523274 cycles 1.01
ML-DSA-44 verify 148308 cycles 147587 cycles 1.00
ML-DSA-65 keypair 226650 cycles 226332 cycles 1.00
ML-DSA-65 sign 860097 cycles 860452 cycles 1.00
ML-DSA-65 verify 234863 cycles 234687 cycles 1.00
ML-DSA-87 keypair 370322 cycles 370327 cycles 1.00
ML-DSA-87 sign 1078704 cycles 1078650 cycles 1.00
ML-DSA-87 verify 381978 cycles 382103 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intel Xeon 3rd gen (c6i) (no-opt)

Details
Benchmark suite Current: f9a6d30 Previous: 9258ea1 Ratio
ML-DSA-44 keypair 157726 cycles 166712 cycles 0.95
ML-DSA-44 sign 550116 cycles 582606 cycles 0.94
ML-DSA-44 verify 169086 cycles 179293 cycles 0.94
ML-DSA-65 keypair 267836 cycles 285346 cycles 0.94
ML-DSA-65 sign 902166 cycles 964659 cycles 0.94
ML-DSA-65 verify 274146 cycles 292689 cycles 0.94
ML-DSA-87 keypair 447769 cycles 479766 cycles 0.93
ML-DSA-87 sign 1157051 cycles 1244761 cycles 0.93
ML-DSA-87 verify 457856 cycles 490453 cycles 0.93

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graviton4

Details
Benchmark suite Current: f9a6d30 Previous: 9258ea1 Ratio
ML-DSA-44 keypair 68281 cycles 68206 cycles 1.00
ML-DSA-44 sign 201979 cycles 201988 cycles 1.00
ML-DSA-44 verify 70738 cycles 70695 cycles 1.00
ML-DSA-65 keypair 121375 cycles 121162 cycles 1.00
ML-DSA-65 sign 330717 cycles 331255 cycles 1.00
ML-DSA-65 verify 118005 cycles 118031 cycles 1.00
ML-DSA-87 keypair 198121 cycles 198330 cycles 1.00
ML-DSA-87 sign 426802 cycles 426779 cycles 1.00
ML-DSA-87 verify 194748 cycles 194253 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graviton3

Details
Benchmark suite Current: f9a6d30 Previous: 9258ea1 Ratio
ML-DSA-44 keypair 72334 cycles 72197 cycles 1.00
ML-DSA-44 sign 212135 cycles 212050 cycles 1.00
ML-DSA-44 verify 75716 cycles 75727 cycles 1.00
ML-DSA-65 keypair 127531 cycles 127412 cycles 1.00
ML-DSA-65 sign 350281 cycles 350180 cycles 1.00
ML-DSA-65 verify 125483 cycles 125339 cycles 1.00
ML-DSA-87 keypair 205301 cycles 208131 cycles 0.99
ML-DSA-87 sign 443389 cycles 448959 cycles 0.99
ML-DSA-87 verify 205204 cycles 205063 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AMD EPYC 4th gen (c7a) (no-opt)

Details
Benchmark suite Current: f9a6d30 Previous: 9258ea1 Ratio
ML-DSA-44 keypair 120766 cycles 120426 cycles 1.00
ML-DSA-44 sign 452473 cycles 449374 cycles 1.01
ML-DSA-44 verify 130839 cycles 131829 cycles 0.99
ML-DSA-65 keypair 205544 cycles 208687 cycles 0.98
ML-DSA-65 sign 729796 cycles 740391 cycles 0.99
ML-DSA-65 verify 211042 cycles 213436 cycles 0.99
ML-DSA-87 keypair 338441 cycles 337635 cycles 1.00
ML-DSA-87 sign 929258 cycles 924886 cycles 1.00
ML-DSA-87 verify 348567 cycles 345917 cycles 1.01

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graviton4 (no-opt)

Details
Benchmark suite Current: f9a6d30 Previous: 9258ea1 Ratio
ML-DSA-44 keypair 128278 cycles 128259 cycles 1.00
ML-DSA-44 sign 447568 cycles 447651 cycles 1.00
ML-DSA-44 verify 138373 cycles 138315 cycles 1.00
ML-DSA-65 keypair 220146 cycles 220341 cycles 1.00
ML-DSA-65 sign 727221 cycles 727602 cycles 1.00
ML-DSA-65 verify 223062 cycles 223189 cycles 1.00
ML-DSA-87 keypair 365113 cycles 365093 cycles 1.00
ML-DSA-87 sign 926622 cycles 926051 cycles 1.00
ML-DSA-87 verify 372778 cycles 372761 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graviton3 (no-opt)

Details
Benchmark suite Current: f9a6d30 Previous: 9258ea1 Ratio
ML-DSA-44 keypair 138533 cycles 138530 cycles 1.00
ML-DSA-44 sign 484119 cycles 484127 cycles 1.00
ML-DSA-44 verify 148714 cycles 148699 cycles 1.00
ML-DSA-65 keypair 242002 cycles 242316 cycles 1.00
ML-DSA-65 sign 792696 cycles 792717 cycles 1.00
ML-DSA-65 verify 241201 cycles 241180 cycles 1.00
ML-DSA-87 keypair 396212 cycles 396270 cycles 1.00
ML-DSA-87 sign 1012825 cycles 1012390 cycles 1.00
ML-DSA-87 verify 402487 cycles 402495 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graviton2

Details
Benchmark suite Current: f9a6d30 Previous: 9258ea1 Ratio
ML-DSA-44 keypair 113890 cycles 113785 cycles 1.00
ML-DSA-44 sign 356860 cycles 356400 cycles 1.00
ML-DSA-44 verify 118531 cycles 118156 cycles 1.00
ML-DSA-65 keypair 197180 cycles 196636 cycles 1.00
ML-DSA-65 sign 589760 cycles 589236 cycles 1.00
ML-DSA-65 verify 194927 cycles 194738 cycles 1.00
ML-DSA-87 keypair 323665 cycles 323344 cycles 1.00
ML-DSA-87 sign 755812 cycles 754065 cycles 1.00
ML-DSA-87 verify 321048 cycles 320254 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SpacemiT K1 8 (Banana Pi F3) benchmarks (no-opt)

Details
Benchmark suite Current: f9a6d30 Previous: 9258ea1 Ratio
ML-DSA-44 keypair 829325 cycles 827416 cycles 1.00
ML-DSA-44 sign 3239633 cycles 3233101 cycles 1.00
ML-DSA-44 verify 925127 cycles 922573 cycles 1.00
ML-DSA-65 keypair 1413241 cycles 1410466 cycles 1.00
ML-DSA-65 sign 5353017 cycles 5337064 cycles 1.00
ML-DSA-65 verify 1482432 cycles 1479411 cycles 1.00
ML-DSA-87 keypair 2313345 cycles 2308452 cycles 1.00
ML-DSA-87 sign 6671028 cycles 6657983 cycles 1.00
ML-DSA-87 verify 2417189 cycles 2413172 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graviton2 (no-opt)

Details
Benchmark suite Current: f9a6d30 Previous: 9258ea1 Ratio
ML-DSA-44 keypair 213826 cycles 212836 cycles 1.00
ML-DSA-44 sign 762167 cycles 760705 cycles 1.00
ML-DSA-44 verify 241958 cycles 229196 cycles 1.06
ML-DSA-65 keypair 381627 cycles 380999 cycles 1.00
ML-DSA-65 sign 1253488 cycles 1254188 cycles 1.00
ML-DSA-65 verify 372913 cycles 372030 cycles 1.00
ML-DSA-87 keypair 606826 cycles 604389 cycles 1.00
ML-DSA-87 sign 1594099 cycles 1595105 cycles 1.00
ML-DSA-87 verify 618467 cycles 618551 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arm Cortex-A72 (Raspberry Pi 4) benchmarks (no-opt)

Details
Benchmark suite Current: f9a6d30 Previous: 9258ea1 Ratio
ML-DSA-44 keypair 309195 cycles 299195 cycles 1.03
ML-DSA-44 sign 1168415 cycles 1162268 cycles 1.01
ML-DSA-44 verify 335171 cycles 330180 cycles 1.02
ML-DSA-65 keypair 561211 cycles 555502 cycles 1.01
ML-DSA-65 sign 1917406 cycles 1912815 cycles 1.00
ML-DSA-65 verify 537657 cycles 527139 cycles 1.02
ML-DSA-87 keypair 863436 cycles 868917 cycles 0.99
ML-DSA-87 sign 2447930 cycles 2435700 cycles 1.01
ML-DSA-87 verify 887441 cycles 879744 cycles 1.01

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Performance Alert ⚠️

Possible performance regression was detected for benchmark 'Arm Cortex-A72 (Raspberry Pi 4) benchmarks (no-opt)'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.03.

Benchmark suite Current: f9a6d30 Previous: 9258ea1 Ratio
ML-DSA-44 keypair 309195 cycles 299195 cycles 1.03

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arm Cortex-A55 (Snapdragon 888) benchmarks (opt)

Details
Benchmark suite Current: f9a6d30 Previous: 9258ea1 Ratio
ML-DSA-44 keypair 272576 cycles 271747 cycles 1.00
ML-DSA-44 sign 801220 cycles 799220 cycles 1.00
ML-DSA-44 verify 273471 cycles 272494 cycles 1.00
ML-DSA-65 keypair 466875 cycles 469149 cycles 1.00
ML-DSA-65 sign 1312938 cycles 1319018 cycles 1.00
ML-DSA-65 verify 449913 cycles 451950 cycles 1.00
ML-DSA-87 keypair 806611 cycles 805651 cycles 1.00
ML-DSA-87 sign 1800985 cycles 1810381 cycles 0.99
ML-DSA-87 verify 782904 cycles 783507 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

@willieyz willieyz force-pushed the eliminate-caddq-intrinsics branch 2 times, most recently from 72bc3f8 to d186f5e Compare January 26, 2026 10:12
This commit replace the eurrently caddq AVX2 implementation to x86_64
assembly code.

Signed-off-by: willieyz <willie.zhao@chelpis.com>
@willieyz willieyz force-pushed the eliminate-caddq-intrinsics branch from a467f42 to 10f3614 Compare January 26, 2026 10:49
This commit adds mld_poly_caddq to the benchmark components to evaluate
the performance impact of replacing the caddq AVX2 intrinsics
with x86_64 assembly code.

Signed-off-by: willieyz <willie.zhao@chelpis.com>
Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arm Cortex-A76 (Raspberry Pi 5) benchmarks (opt)

Details
Benchmark suite Current: f9a6d30 Previous: 9258ea1 Ratio
ML-DSA-44 keypair 113342 cycles 113370 cycles 1.00
ML-DSA-44 sign 356021 cycles 355986 cycles 1.00
ML-DSA-44 verify 117872 cycles 118036 cycles 1.00
ML-DSA-65 keypair 196532 cycles 196544 cycles 1.00
ML-DSA-65 sign 589189 cycles 589033 cycles 1.00
ML-DSA-65 verify 194577 cycles 194759 cycles 1.00
ML-DSA-87 keypair 322408 cycles 322752 cycles 1.00
ML-DSA-87 sign 752104 cycles 753067 cycles 1.00
ML-DSA-87 verify 319915 cycles 320159 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arm Cortex-A76 (Raspberry Pi 5) benchmarks (no-opt)

Details
Benchmark suite Current: f9a6d30 Previous: 9258ea1 Ratio
ML-DSA-44 keypair 212629 cycles 212540 cycles 1.00
ML-DSA-44 sign 760055 cycles 759958 cycles 1.00
ML-DSA-44 verify 228848 cycles 228975 cycles 1.00
ML-DSA-65 keypair 380543 cycles 380692 cycles 1.00
ML-DSA-65 sign 1252397 cycles 1252836 cycles 1.00
ML-DSA-65 verify 371721 cycles 371790 cycles 1.00
ML-DSA-87 keypair 604737 cycles 604270 cycles 1.00
ML-DSA-87 sign 1593720 cycles 1593938 cycles 1.00
ML-DSA-87 verify 618504 cycles 618393 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Performance Alert ⚠️

Possible performance regression was detected for benchmark 'Graviton2 (no-opt)'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.03.

Benchmark suite Current: f9a6d30 Previous: 9258ea1 Ratio
ML-DSA-44 verify 241958 cycles 229196 cycles 1.06

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arm Cortex-A55 (Snapdragon 888) benchmarks (no-opt)

Details
Benchmark suite Current: f9a6d30 Previous: 9258ea1 Ratio
ML-DSA-44 keypair 461720 cycles 461204 cycles 1.00
ML-DSA-44 sign 2132007 cycles 2133615 cycles 1.00
ML-DSA-44 verify 546386 cycles 546329 cycles 1.00
ML-DSA-65 keypair 773928 cycles 774556 cycles 1.00
ML-DSA-65 sign 3496455 cycles 3505243 cycles 1.00
ML-DSA-65 verify 849310 cycles 849774 cycles 1.00
ML-DSA-87 keypair 1253417 cycles 1251282 cycles 1.00
ML-DSA-87 sign 4370207 cycles 4327691 cycles 1.01
ML-DSA-87 verify 1368861 cycles 1367270 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Performance Alert ⚠️

Possible performance regression was detected for benchmark 'Arm Cortex-A72 (Raspberry Pi 4) benchmarks (opt)'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.03.

Benchmark suite Current: f9a6d30 Previous: 9258ea1 Ratio
ML-DSA-44 keypair 237495 cycles 229189 cycles 1.04

This comment was automatically generated by workflow using github-action-benchmark.

@willieyz willieyz force-pushed the eliminate-caddq-intrinsics branch from 6faaac2 to 5b1b8a7 Compare January 27, 2026 10:46
Signed-off-by: willieyz <willie.zhao@chelpis.com>
@willieyz willieyz force-pushed the eliminate-caddq-intrinsics branch from 5b1b8a7 to f9a6d30 Compare January 28, 2026 03:58
@willieyz willieyz marked this pull request as ready for review January 28, 2026 06:42
@willieyz willieyz requested a review from a team as a code owner January 28, 2026 06:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

AVX2: Replace intrinsics implementation of poly_caddq with assembly

4 participants