Add Neon implementation of `bitset_to_string` by hazzlim · Pull Request #6153 · microsoft/STL

hazzlim · 2026-03-11T13:19:55Z

This PR adds a Neon implementation of bitset_to_string.

There's a slightly ugly workaround for DevCom-11056873 which hopefully isn't deemed to egregious.

Benchmark results ⏲️

	MSVC	Clang
`BM_bitset_to_string<15, char>`	1	0.996
`BM_bitset_to_string<64, char>`	2.516	1.294
`BM_bitset_to_string_large_single<512, char>`	7.873	7.064
`BM_bitset_to_string_large_single<2048, char>`	11.442	9.977
`BM_bitset_to_string<7, wchar_t>`	1	1.008
`BM_bitset_to_string<64, wchar_t>`	2.304	1.242
`BM_bitset_to_string_large_single<512, wchar_t>`	5.8	4.293
`BM_bitset_to_string_large_single<2048, wchar_t>`	7.778	5.714

hazzlim · 2026-03-11T13:21:01Z

stl/src/vector_algorithms.cpp

+            }
+
+            static _Vec_t _Load_constant() {
+                // We do not omit static here, despite DevCom-11055227, because codegen is worse - see DevCom-11056805.


I actually wonder if this doesn't belong in a comment now (and below), but I thought I would explain why I didn't avoid static here (and below).

I'm 💯 sure these things are good to comment, and it is not too verbose to comment every occurrence like you did.

AlexGuteniev · 2026-03-11T13:47:14Z

stl/src/vector_algorithms.cpp

+            static _Vec_t __forceinline _Step(
+                const uint16_t _Val, const _Vec_t _Px0, const _Vec_t _Px1, const _Vec_t _Idx) noexcept {
+                const auto _Vx0 = vdupq_n_u16(_Val);
+                const auto _Vx1 = vqtbl1q_u8(vreinterpretq_u8_u16(_Vx0), _Idx);


But is this the most efficient approach?
if vqtbl is not very efficient, there may be other approach.
Making vector of halves which are broadcasted _Val & 0xFF and _Val >> 8 bytes may work.

TBL is performant on Neoverse cores, e.g. on Neoverse N2 it has 2c latency and throughput of 2 (it executes on all vector pipelines). The TBL approach (once we've loaded the indices) uses one DUP, and one TBL instruction. Whereas non-TBL would require two DUPs and one additional instruction to combine the two vectors.

It can be wise to avoid TBL when optimizing for AArch64 little core CPUs, but I don't think we care much about these here.

AlexGuteniev · 2026-03-11T13:48:03Z

stl/src/vector_algorithms.cpp

+            }
+
+            static _Vec_t _Load_constant() {
+                // We do not omit static here, despite DevCom-11055227, because codegen is worse - see DevCom-11056805.


I especially dislike that constant has different meaning in _Traits_1_neon and _Traits_2_neon.
But different argument naming should be sufficient to draw attention to this (worked for me).

Yes it's not ideal... but hopefully the argument naming softens the blow

AlexGuteniev · 2026-03-11T13:56:25Z

stl/src/vector_algorithms.cpp

+                const uint16_t _Val, const _Vec_t _Px0, const _Vec_t _Px1, const _Vec_t _Idx) noexcept {
+                const auto _Vx0 = vdupq_n_u16(_Val);
+                const auto _Vx1 = vqtbl1q_u8(vreinterpretq_u8_u16(_Vx0), _Idx);
+                const auto _Msk = vandq_u8(_Vx1, vreinterpretq_u8_u64(vdupq_n_u64(0x0102040810204080)));


Is compiler capable of moving out of the loop this constant, but not the other constant? 👀

Yes (for some reason) - you can see in DevCom-11056873 that v20 is not loaded inside the loop:

|$LL7@std_bitset| ldrh w8,[x1],#2 ldr q17,[x10] dup v16.8h,w8 tbl v16.16b,{v16.16b},v17.16b and v16.16b,v16.16b,v20.16b cmeq v16.16b,v16.16b,v21.16b bsl v16.16b,v19.16b,v18.16b str q16,[x9,#-0x10]! cmp x9,x11 bne |$LL7@std_bitset|

It's almost as though the compiler is willing to hoist only one Neon constant per loop (and no more!)

AlexGuteniev · 2026-03-11T14:41:59Z

BM_bitset_to_string<64, wchar_t> 2.304

This is small.

Is this threshold large enough then?

STL/stl/inc/bitset

Line 554 in 26b7ca5

constexpr size_t _Bitset_to_string_vector_threshold = 32;

hazzlim · 2026-03-12T09:46:16Z

BM_bitset_to_string<64, wchar_t> 2.304

This is small.

Is this threshold large enough then?

STL/stl/inc/bitset

Line 554 in 26b7ca5

constexpr size_t _Bitset_to_string_vector_threshold = 32;

I added a couple more sizes and it seems reasonable to keep threshold at 32:

	MSVC	Clang
`BM_bitset_to_string<32, char>`	1.83	0.98
`BM_bitset_to_string<48, char>`	2.14	1.22
`BM_bitset_to_string<32, wchar_t>`	1.64	0.93
`BM_bitset_to_string<48, wchar_t>`	2.01	1.17

Add Neon implementation of bitset_to_string

67749ad

hazzlim requested a review from a team as a code owner March 11, 2026 13:19

github-project-automation bot added this to STL Code Reviews Mar 11, 2026

github-project-automation bot moved this to Initial Review in STL Code Reviews Mar 11, 2026

hazzlim commented Mar 11, 2026

View reviewed changes

AlexGuteniev reviewed Mar 11, 2026

View reviewed changes

StephanTLavavej added performance Must go faster ARM64 Related to the ARM64 architecture ARM64EC I can't believe it's not x64! labels Mar 11, 2026

StephanTLavavej self-assigned this Mar 13, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Neon implementation of `bitset_to_string`#6153

Add Neon implementation of `bitset_to_string`#6153
hazzlim wants to merge 1 commit intomicrosoft:mainfrom
hazzlim:bitset-to-string-neon-pr

hazzlim commented Mar 11, 2026

Uh oh!

hazzlim Mar 11, 2026

Uh oh!

AlexGuteniev Mar 11, 2026

Uh oh!

AlexGuteniev Mar 11, 2026 •

edited

Loading

Uh oh!

hazzlim Mar 11, 2026

Uh oh!

AlexGuteniev Mar 11, 2026

Uh oh!

hazzlim Mar 11, 2026

Uh oh!

AlexGuteniev Mar 11, 2026

Uh oh!

hazzlim Mar 11, 2026

Uh oh!

AlexGuteniev commented Mar 11, 2026

Uh oh!

hazzlim commented Mar 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

hazzlim commented Mar 11, 2026

Benchmark results ⏲️

Uh oh!

hazzlim Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

AlexGuteniev Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

AlexGuteniev Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hazzlim Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

AlexGuteniev Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

hazzlim Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

AlexGuteniev Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

hazzlim Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

AlexGuteniev commented Mar 11, 2026

Uh oh!

hazzlim commented Mar 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

AlexGuteniev Mar 11, 2026 •

edited

Loading