Skip to content

Optimize is_permutation for vector<bool>#6148

Open
AlexGuteniev wants to merge 9 commits intomicrosoft:mainfrom
AlexGuteniev:order-mutant-vector-bool
Open

Optimize is_permutation for vector<bool>#6148
AlexGuteniev wants to merge 9 commits intomicrosoft:mainfrom
AlexGuteniev:order-mutant-vector-bool

Conversation

@AlexGuteniev
Copy link
Contributor

@AlexGuteniev AlexGuteniev commented Mar 7, 2026

I was looking into bit tricks for next_permutation / prev_permutation.
Seems like that there will be no impressive results, as these operations perform already at ~15 ns order of magnitude.
I may further look into these though.

But let's go for the easiest optimization here that gains 100x and more!

The optimized algorithm does mismatch, and then counts true values in each of the remaining ranges.

Mismatch

The mismatch part serves to save the equality case optimization that is implied by the standard, asking for N**2 comparisons generally, but just N if the ranges are equal. If none of the ranges is vector<bool> then we have these cases:

  • Both contiguous ranges. mismatch turns to vector mismatch, and count to vector count
    • The reason to use the algorithm, and not inline implementation is specifically vectorization
    • For equal case vectorized mismatch is slightly faster
    • Inequal case is likely to be slightly slower, due to wasted mismatch call that will only likely misalign the input for the count call due to advancing few elements
  • Both noncontiguous ranges. With mismatch the number of comparison is twice smaller for equal cases.
  • One contiguous range, one noncontiguous range. Similar to the case above, except that one of counts is vectorized, so mismatch is not that much faster

So overall mismatch should be there.

One vector<bool> or both of them flip it:

  • mismatch for vector<bool> is not optimized currently
  • If we optimize it, we are not likely to optimize it for misaligned cases
  • Even if misaligned cases are optimized, they will not perform as well as normal ones
  • and there are mixed cases, which we even less likely to optimize
  • in contrast, count is optimized equally well for aligned and misaligned cases

Benchmark results

Interim version is without mismatch. Its timings are not used is speedup calculation.

Array:

Benchmark Before Interim After Speedup after
perm_arr_check<equality::eq, args::three>/64 25.8 ns 10.2 ns 2.82 ns 9.15
perm_arr_check<equality::eq, args::three>/4096 976 ns 85.0 ns 61.2 ns 16
perm_arr_check<equality::eq, args::three>/65536 15737 ns 1733 ns 1385 ns 11.4
perm_arr_check<equality::eq, args::four>/64 24.1 ns 10.1 ns 2.80 ns 8.61
perm_arr_check<equality::eq, args::four>/4096 982 ns 85.3 ns 59.2 ns 16.6
perm_arr_check<equality::eq, args::four>/65536 15470 ns 1727 ns 1394 ns 11.1
perm_arr_check<equality::neq, args::three>/64 46.0 ns 9.83 ns 13.3 ns 3.46
perm_arr_check<equality::neq, args::three>/4096 2960 ns 86.0 ns 103 ns 28.7
perm_arr_check<equality::neq, args::three>/65536 48750 ns 1751 ns 2151 ns 22.7
perm_arr_check<equality::neq, args::four>/64 45.4 ns 9.79 ns 14.4 ns 3.15
perm_arr_check<equality::neq, args::four>/4096 2989 ns 84.6 ns 107 ns 27.9
perm_arr_check<equality::neq, args::four>/65536 47932 ns 1751 ns 2140 ns 22.4

vector<bool>, Interim column is irrelevant, it does not show any data different from After.

Benchmark Before After Speedup
perm_vbool_check<equality::eq, args::three>/64 107 ns 6.64 ns 16.1
perm_vbool_check<equality::eq, args::three>/4096 6631 ns 96.4 ns 68.8
perm_vbool_check<equality::eq, args::three>/65536 105642 ns 1452 ns 72.8
perm_vbool_check<equality::eq, args::four>/64 109 ns 7.11 ns 15.3
perm_vbool_check<equality::eq, args::four>/4096 6525 ns 95.6 ns 68.2
perm_vbool_check<equality::eq, args::four>/65536 103770 ns 1449 ns 71.6
perm_vbool_check<equality::neq, args::three>/64 163 ns 6.59 ns 24.7
perm_vbool_check<equality::neq, args::three>/4096 10094 ns 97.4 ns 104
perm_vbool_check<equality::neq, args::three>/65536 383338 ns 1472 ns 260
perm_vbool_check<equality::neq, args::four>/64 168 ns 7.23 ns 23.2
perm_vbool_check<equality::neq, args::four>/4096 10432 ns 95.1 ns 110
perm_vbool_check<equality::neq, args::four>/65536 382361 ns 1464 ns 261

@AlexGuteniev AlexGuteniev requested a review from a team as a code owner March 7, 2026 16:37
@github-project-automation github-project-automation bot moved this to Initial Review in STL Code Reviews Mar 7, 2026
@StephanTLavavej StephanTLavavej added the performance Must go faster label Mar 7, 2026
@StephanTLavavej
Copy link
Member

I believe this could be generalized further. For any ranges where the value types are bool and the predicate is equality, you should be able to count true. This could even benefit mixed comparisons of vector<bool> versus array<bool, N>, for example.

@AlexGuteniev
Copy link
Contributor Author

Generalized, benchmark results are about the same

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance Must go faster

Projects

Status: Initial Review

Development

Successfully merging this pull request may close these issues.

2 participants