Skip to content

Conversation

@bonega
Copy link
Contributor

@bonega bonega commented Jan 17, 2026

Summary

Fix [u8]::is_ascii() performance regression when compiled with -C target-cpu=native on AVX-512 CPUs.

Problem

When is_ascii is compiled with AVX-512 enabled, LLVM's auto-vectorization generates ~31 kshiftrd instructions to extract mask bits one-by-one, instead of using the efficient pmovmskb
instruction. This causes a ~22x performance regression.

Because is_ascii is marked #[inline], it gets inlined and recompiled with the user's target settings, affecting anyone using -C target-cpu=native on AVX-512 CPUs.

Solution

Replace the counting loop with explicit SSE2 intrinsics (_mm_movemask_epi8) that force pmovmskb codegen regardless of CPU features.

Godbolt Links (Rust 1.92)

Pattern Target Link Result
Counting loop (old) Default SSE2 https://godbolt.org/z/sE86xz4fY pmovmskb
Counting loop (old) AVX-512 (znver4) https://godbolt.org/z/b3jvMhGd3 31x kshiftrd (broken)
SSE2 intrinsics (fix) Default SSE2 https://godbolt.org/z/hMeGfeaPv pmovmskb
SSE2 intrinsics (fix) AVX-512 (znver4) https://godbolt.org/z/Tdvdqjohn vpmovmskb (fixed)

Benchmark Results

AMD Ryzen 5 7500F (Zen 4 with AVX-512):

Build Before After Improvement
Default ~73 GB/s ~74 GB/s No regression
-C target-cpu=native ~3 GB/s ~67 GB/s 22x

Note: this is the pure ascii path, but the story is similar for the others.
See linked bench project.

Test Plan

  • Assembly test (slice-is-ascii-avx512.rs) verifies no kshiftrd with AVX-512
  • Existing codegen test updated to loongarch64-only (auto-vectorization still used there)
  • Fuzz testing confirms old/new implementations produce identical results (~53M iterations)
  • Benchmarks confirm performance improvement
  • Tidy checks pass

Reproduction / Test Projects

Standalone validation tools: https://github.com/bonega/is-ascii-fix-validation

  • bench/ - Criterion benchmarks for SSE2 vs AVX-512 comparison
  • fuzz/ - Compares old/new implementations with libfuzzer

Related Issues

When `[u8]::is_ascii()` is compiled with `-C target-cpu=native` on
AVX-512 CPUs, LLVM generates inefficient code. Because `is_ascii` is
marked `#[inline]`, it gets inlined and recompiled with the user's
target settings. The previous implementation used a counting loop that
LLVM auto-vectorizes to `pmovmskb` on SSE2, but with AVX-512 enabled,
LLVM uses k-registers and extracts bits individually with ~31
`kshiftrd` instructions.

This fix replaces the counting loop with explicit SSE2 intrinsics
(`_mm_loadu_si128`, `_mm_or_si128`, `_mm_movemask_epi8`) for x86_64.
`_mm_movemask_epi8` compiles to `pmovmskb`, forcing efficient codegen
regardless of CPU features.

Benchmark results on AMD Ryzen 5 7500F (Zen 4 with AVX-512):
- Default build: ~73 GB/s → ~74 GB/s (no regression)
- With -C target-cpu=native: ~3 GB/s → ~67 GB/s (22x improvement)

The loongarch64 implementation retains the original counting loop
since it doesn't have this issue.

Regression from: rust-lang#130733
@rustbot
Copy link
Collaborator

rustbot commented Jan 17, 2026

⚠️ #[rustc_allow_const_fn_unstable] needs careful audit to avoid accidentally exposing unstable
implementation details on stable.

cc @rust-lang/wg-const-eval

@rustbot rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. T-libs Relevant to the library team, which will review and decide on the PR/issue. labels Jan 17, 2026
@rustbot
Copy link
Collaborator

rustbot commented Jan 17, 2026

r? @tgross35

rustbot has assigned @tgross35.
They will have a look at your PR within the next two weeks and either review your PR or reassign to another reviewer.

Use r? to explicitly pick a reviewer

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. T-libs Relevant to the library team, which will review and decide on the PR/issue.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants