Skip to content

Add Neon implementation of bitset_to_string#6153

Open
hazzlim wants to merge 1 commit intomicrosoft:mainfrom
hazzlim:bitset-to-string-neon-pr
Open

Add Neon implementation of bitset_to_string#6153
hazzlim wants to merge 1 commit intomicrosoft:mainfrom
hazzlim:bitset-to-string-neon-pr

Conversation

@hazzlim
Copy link
Contributor

@hazzlim hazzlim commented Mar 11, 2026

This PR adds a Neon implementation of bitset_to_string.

There's a slightly ugly workaround for DevCom-11056873 which hopefully isn't deemed to egregious.

Benchmark results ⏲️

  MSVC Clang
BM_bitset_to_string<15, char> 1 0.996
BM_bitset_to_string<64, char> 2.516 1.294
BM_bitset_to_string_large_single<512, char> 7.873 7.064
BM_bitset_to_string_large_single<2048, char> 11.442 9.977
BM_bitset_to_string<7, wchar_t> 1 1.008
BM_bitset_to_string<64, wchar_t> 2.304 1.242
BM_bitset_to_string_large_single<512, wchar_t> 5.8 4.293
BM_bitset_to_string_large_single<2048, wchar_t> 7.778 5.714

@hazzlim hazzlim requested a review from a team as a code owner March 11, 2026 13:19
@github-project-automation github-project-automation bot moved this to Initial Review in STL Code Reviews Mar 11, 2026
}

static _Vec_t _Load_constant() {
// We do not omit static here, despite DevCom-11055227, because codegen is worse - see DevCom-11056805.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually wonder if this doesn't belong in a comment now (and below), but I thought I would explain why I didn't avoid static here (and below).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm 💯 sure these things are good to comment, and it is not too verbose to comment every occurrence like you did.

static _Vec_t __forceinline _Step(
const uint16_t _Val, const _Vec_t _Px0, const _Vec_t _Px1, const _Vec_t _Idx) noexcept {
const auto _Vx0 = vdupq_n_u16(_Val);
const auto _Vx1 = vqtbl1q_u8(vreinterpretq_u8_u16(_Vx0), _Idx);
Copy link
Contributor

@AlexGuteniev AlexGuteniev Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But is this the most efficient approach?
if vqtbl is not very efficient, there may be other approach.
Making vector of halves which are broadcasted _Val & 0xFF and _Val >> 8 bytes may work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBL is performant on Neoverse cores, e.g. on Neoverse N2 it has 2c latency and throughput of 2 (it executes on all vector pipelines). The TBL approach (once we've loaded the indices) uses one DUP, and one TBL instruction. Whereas non-TBL would require two DUPs and one additional instruction to combine the two vectors.

It can be wise to avoid TBL when optimizing for AArch64 little core CPUs, but I don't think we care much about these here.

}

static _Vec_t _Load_constant() {
// We do not omit static here, despite DevCom-11055227, because codegen is worse - see DevCom-11056805.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I especially dislike that constant has different meaning in _Traits_1_neon and _Traits_2_neon.
But different argument naming should be sufficient to draw attention to this (worked for me).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it's not ideal... but hopefully the argument naming softens the blow

const uint16_t _Val, const _Vec_t _Px0, const _Vec_t _Px1, const _Vec_t _Idx) noexcept {
const auto _Vx0 = vdupq_n_u16(_Val);
const auto _Vx1 = vqtbl1q_u8(vreinterpretq_u8_u16(_Vx0), _Idx);
const auto _Msk = vandq_u8(_Vx1, vreinterpretq_u8_u64(vdupq_n_u64(0x0102040810204080)));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is compiler capable of moving out of the loop this constant, but not the other constant? 👀

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes (for some reason) - you can see in DevCom-11056873 that v20 is not loaded inside the loop:

|$LL7@std_bitset|
        ldrh        w8,[x1],#2
        ldr         q17,[x10]
        dup         v16.8h,w8
        tbl         v16.16b,{v16.16b},v17.16b
        and         v16.16b,v16.16b,v20.16b
        cmeq        v16.16b,v16.16b,v21.16b
        bsl         v16.16b,v19.16b,v18.16b
        str         q16,[x9,#-0x10]!
        cmp         x9,x11
        bne         |$LL7@std_bitset|

It's almost as though the compiler is willing to hoist only one Neon constant per loop (and no more!)

@AlexGuteniev
Copy link
Contributor

BM_bitset_to_string<64, wchar_t> 2.304

This is small.

Is this threshold large enough then?

constexpr size_t _Bitset_to_string_vector_threshold = 32;

@StephanTLavavej StephanTLavavej added performance Must go faster ARM64 Related to the ARM64 architecture ARM64EC I can't believe it's not x64! labels Mar 11, 2026
@hazzlim
Copy link
Contributor Author

hazzlim commented Mar 12, 2026

BM_bitset_to_string<64, wchar_t> 2.304

This is small.

Is this threshold large enough then?

constexpr size_t _Bitset_to_string_vector_threshold = 32;

I added a couple more sizes and it seems reasonable to keep threshold at 32:

  MSVC Clang
BM_bitset_to_string<32, char> 1.83 0.98
BM_bitset_to_string<48, char> 2.14 1.22
BM_bitset_to_string<32, wchar_t> 1.64 0.93
BM_bitset_to_string<48, wchar_t> 2.01 1.17

@StephanTLavavej StephanTLavavej self-assigned this Mar 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ARM64EC I can't believe it's not x64! ARM64 Related to the ARM64 architecture performance Must go faster

Projects

Status: Initial Review

Development

Successfully merging this pull request may close these issues.

3 participants