Add Neon implementation of bitset_to_string#6153
Add Neon implementation of bitset_to_string#6153hazzlim wants to merge 1 commit intomicrosoft:mainfrom
bitset_to_string#6153Conversation
| } | ||
|
|
||
| static _Vec_t _Load_constant() { | ||
| // We do not omit static here, despite DevCom-11055227, because codegen is worse - see DevCom-11056805. |
There was a problem hiding this comment.
I actually wonder if this doesn't belong in a comment now (and below), but I thought I would explain why I didn't avoid static here (and below).
There was a problem hiding this comment.
I'm 💯 sure these things are good to comment, and it is not too verbose to comment every occurrence like you did.
| static _Vec_t __forceinline _Step( | ||
| const uint16_t _Val, const _Vec_t _Px0, const _Vec_t _Px1, const _Vec_t _Idx) noexcept { | ||
| const auto _Vx0 = vdupq_n_u16(_Val); | ||
| const auto _Vx1 = vqtbl1q_u8(vreinterpretq_u8_u16(_Vx0), _Idx); |
There was a problem hiding this comment.
But is this the most efficient approach?
if vqtbl is not very efficient, there may be other approach.
Making vector of halves which are broadcasted _Val & 0xFF and _Val >> 8 bytes may work.
There was a problem hiding this comment.
TBL is performant on Neoverse cores, e.g. on Neoverse N2 it has 2c latency and throughput of 2 (it executes on all vector pipelines). The TBL approach (once we've loaded the indices) uses one DUP, and one TBL instruction. Whereas non-TBL would require two DUPs and one additional instruction to combine the two vectors.
It can be wise to avoid TBL when optimizing for AArch64 little core CPUs, but I don't think we care much about these here.
| } | ||
|
|
||
| static _Vec_t _Load_constant() { | ||
| // We do not omit static here, despite DevCom-11055227, because codegen is worse - see DevCom-11056805. |
There was a problem hiding this comment.
I especially dislike that constant has different meaning in _Traits_1_neon and _Traits_2_neon.
But different argument naming should be sufficient to draw attention to this (worked for me).
There was a problem hiding this comment.
Yes it's not ideal... but hopefully the argument naming softens the blow
| const uint16_t _Val, const _Vec_t _Px0, const _Vec_t _Px1, const _Vec_t _Idx) noexcept { | ||
| const auto _Vx0 = vdupq_n_u16(_Val); | ||
| const auto _Vx1 = vqtbl1q_u8(vreinterpretq_u8_u16(_Vx0), _Idx); | ||
| const auto _Msk = vandq_u8(_Vx1, vreinterpretq_u8_u64(vdupq_n_u64(0x0102040810204080))); |
There was a problem hiding this comment.
Is compiler capable of moving out of the loop this constant, but not the other constant? 👀
There was a problem hiding this comment.
Yes (for some reason) - you can see in DevCom-11056873 that v20 is not loaded inside the loop:
|$LL7@std_bitset|
ldrh w8,[x1],#2
ldr q17,[x10]
dup v16.8h,w8
tbl v16.16b,{v16.16b},v17.16b
and v16.16b,v16.16b,v20.16b
cmeq v16.16b,v16.16b,v21.16b
bsl v16.16b,v19.16b,v18.16b
str q16,[x9,#-0x10]!
cmp x9,x11
bne |$LL7@std_bitset|
It's almost as though the compiler is willing to hoist only one Neon constant per loop (and no more!)
This is small. Is this threshold large enough then? Line 554 in 26b7ca5 |
I added a couple more sizes and it seems reasonable to keep threshold at 32:
|
This PR adds a Neon implementation of
bitset_to_string.There's a slightly ugly workaround for DevCom-11056873 which hopefully isn't deemed to egregious.
Benchmark results ⏲️
BM_bitset_to_string<15, char>BM_bitset_to_string<64, char>BM_bitset_to_string_large_single<512, char>BM_bitset_to_string_large_single<2048, char>BM_bitset_to_string<7, wchar_t>BM_bitset_to_string<64, wchar_t>BM_bitset_to_string_large_single<512, wchar_t>BM_bitset_to_string_large_single<2048, wchar_t>