-
Notifications
You must be signed in to change notification settings - Fork 21
Add a "slide" operation (like x86's alignr and ARM's vext)
#164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
403ac25 to
9e2efa5
Compare
|
All the merge conflicts are now resolved, since I've rebased this post-codegen rework. I'm using this operation in my video effect project, to implement an IIR filter. |
LaurenzV
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't tried to fully understand all of the logic of the added (helper) methods, but I've gotten a good overview and overall it seems fine to add. However, I do have some comments/remarks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need all these tests? I feel like the ones in mod.rs should be enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These tests aren't compiled in by default, but given how intricate the block-selection logic is, I think it's good to have some way to exhaustively test all shift amounts.
| let method_ident = Ident::new(self.method, Span::call_site()); | ||
| let sig_inner = match &self.sig { | ||
| OpSig::Splat | OpSig::LoadInterleaved { .. } | OpSig::StoreInterleaved { .. } => { | ||
| OpSig::Splat => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why was Splat changed here as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed that the signatures for splat and slide could be autogenerated, which avoids duplicating the doc comments in two different places. For splat this isn't a big deal, since the doc comment is quite simple, but slide's doc comment is a lot more involved.
Looking back on it, I probably should just revert it for splat.
| /// expected to be constant in practice, so the match statement will be optimized out. This exists because | ||
| /// Rust doesn't currently let you do math on const generics. | ||
| #[inline(always)] | ||
| unsafe fn dyn_vext_128(a: uint8x16_t, b: uint8x16_t, shift: usize) -> uint8x16_t { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we already have an unsafe block inside we don't need it on the function itself, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's marked as unsafe to match the unsafety of the underlying vextq_[...] function that it calls. Rust now wants you to use unsafe blocks even inside unsafe functions.
| #[doc = r" Concatenates `b` and `a` (each N blocks) and extracts N blocks starting at byte offset `shift_bytes`."] | ||
| #[doc = r" Extracts from [b : a] (b in low bytes, a in high bytes), matching `alignr` semantics."] | ||
| #[inline(always)] | ||
| unsafe fn cross_block_alignr_128x4( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine leaving it this way for now, but I'm wondering whether this is really going to be faster than just using the fallback approach? Have you done any benchmarks on that? (Also for 128x2 and 256x2 in AVX2) Especially because cross_block_slide_blocks_at does quite a bit of work and is called 4 times.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Also applies to NEON and WASM, basiaclly anywhere where we polyfill a larger vector width than the base one.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cross_block_slide_blocks_at is intended to be fully evaluated at compile time. All it does is determine which blocks to combine given a shift value, and since the shift value is constant, the selected blocks should be too. If Rust had better const-eval support, we could make everything a const fn, but sadly we're not there yet.
Depends on #159. Working on this is what sent me down that rabbit hole in the first place.
Progress towards #29, implementing the first lane-shuffling operations.
This PR adds an operation that concatenates two vectors and then takes a window of the concatenation. In other words, it takes two
n-element vectors and a "window shift" ofs, and returns the lastn - selements of the first vector concatenated with the firstselements of the second. This is like thevextfamily on ARM oralignron x86.This can be used to implement "shift items" or "rotate items" operations, by providing a zero vector for one operand to get "shift" behavior and providing the same operand twice to get "rotate" behavior.
There are two variants of this operation: one that operates over the full width of the vector and one that operates within 128-bit blocks. Even on AVX2,
_mm256_alignr_epi8operates within 128-bit lanes, and it takes some extra permutes to make a full-width version, so I think it makes sense to provide a per-block version. This will also be the case when I implement fully-general swizzles.The shift amount is provided as a const generic argument, since the underlying intrinsics also expose it that way. In many cases, we need to do math on that const generic argument before passing it to the intrinsic--we might need to convert it from a
usizetoi32, divide it by the number of bytes per scalar element, wrap it modulo 16, etc. Rust doesn't let us do this yet, so I've added "faux-dynamic" versions of thealignr/vextintrinsics that are implemented as a huge match statement, one for each of the 16 byte shift amounts. Since we inline everything, these should be evaluated at compile time.I haven't yet confirmed that this generates the LLVM IR that we expect on all targets. The codegen seems to be producing
shufflevectors on x86 and AArch64 (I haven't looked at WebAssembly), but all the functions go through some level of indirection viavectorizeorcall_onceor something so it's hard to match them up to what they're supposed to be doing.I'm not fully tied to the name "slide". It's hard to find a good name for this operation. x86's "alignr" makes me think of memory alignment, and ARM's "vext" ("vector extend") sounds like you're just combining two vectors into a wider one.