Fixing rust-lang stdarch issues in LLVM
rust-lang/stdarch
, which defines vendor-specific APIs that are used by the Rust standard library and Rust users writing explicit SIMD code.The stdarch source is not pretty: we maintain long files that contain definitions like this:
#[allow(improper_ctypes)]
unsafe extern "C" {
#[link_name = "llvm.x86.pclmulqdq"]
fn pclmulqdq(a: __m128i, round_key: __m128i, imm8: u8) -> __m128i;
}
pub fn _mm_clmulepi64_si128<const IMM8: i32>(a: __m128i, b: __m128i) -> __m128i {
static_assert_uimm_bits!(IMM8, 8);
unsafe { pclmulqdq(a, b, IMM8 as u8) }
}
For hopefully obvious reasons we want to maintain as little such code as possible. In this post we'll look at how we do that, firstly by using cross-platform functions that are built into the compiler itself, and, when that doesn't work, by going up the supply chain and attempting to fix issues in LLVM.
This post gets extremely technical very quickly, but the tl;dr is that I believe there is a lot of value in Rust (compiler) developers being able to go and fix the LLVM issues that bother us; Especially for anything but the most common targets (e.g. wasm, embedded), it can otherwise take years for issues to get resolved.
Anatomy of a stdarch function
The stdarch crate exposes so-called platform intrinsics as Rust functions. Intrinsics are functions that, roughly, map to a single instruction, although in practice sometimes they represent a couple of instructions. These intrinsic functions are part of the platform, and are typically provided as C functions, for instance xmmintrin.h
for 128-bit SIMD on x86_64.
In stdarch we link to the intrinsics that LLVM provides, and re-expose them in a safer way: we enforce that arguments that should be constant are actually const
, and that the required target features are specified. In some cases the arguments and/or result must be cast or transmuted.
Each platform has its own custom instructions, but there is a common set that most platforms provide. We try to use platform-independent functionality to slightly ease our maintenance burden.
Using core::intrinsics::simd
The Rust compiler defines a number of cross-platform SIMD primitives in core::intrinsics::simd
. These functions are baked into the compiler itself, and can therefore be slightly magical:
/// Adds two simd vectors elementwise.
///
/// `T` must be a vector of integers or floats.
#[rustc_intrinsic]
#[rustc_nounwind]
pub unsafe fn simd_add<T>(x: T, y: T) -> T;
Many platform-specific functions can be implemented in terms of these cross-platform SIMD intrinsics. For example:
/// Adds packed 8-bit integers in `a` and `b`.
#[target_feature(enable = "sse2")]
pub fn _mm_add_epi8(a: __m128i, b: __m128i) -> __m128i {
// First cast to the right type so that `simd_add` has the right behavior.
unsafe { transmute(simd_add(a.as_i8x16(), b.as_i8x16())) }
}
This approach has several advantages versus directly linking with the llvm intrinsic:
- the implementation is clearer to Rust users
- there is less platform-specific code for us to maintain
- this implementation works with non-LLVM backends, e.g. miri, cranelift and gcc
However, using the cross-platform intrinsics is only an option when they generate good code. For the common functions like simd_add
that is the case, but for more complex operations LLVM does not always recognize that the cross-platform function should generate a specific instruction.
Saturating addition on aarch64
The vqaddq_s64
intrinsic performs 64-bit signed saturating addition. The stdarch
definition looked like this:
pub fn vqaddq_s64(a: int64x2_t, b: int64x2_t) -> int64x2_t {
unsafe extern "unadjusted" {
#[cfg_attr(
any(target_arch = "aarch64", target_arch = "arm64ec"),
link_name = "llvm.aarch64.neon.sqadd.v2i64"
)]
#[cfg_attr(target_arch = "arm", link_name = "llvm.sadd.sat.v2i64")]
fn _vqaddq_s64(a: int64x2_t, b: int64x2_t) -> int64x2_t;
}
unsafe { _vqaddq_s64(a, b) }
}
Using the functions from core::intrinsics::simd
, it should be possible to write this function as simply
pub fn vqaddq_s64(a: int64x2_t, b: int64x2_t) -> int64x2_t {
unsafe { simd_saturating_add(a, b) }
}
Unfortunately, it turned out that this simpler implementation optimized less well in some cases: when used on their own both versions generate the same instruction, but the platform-specific version is able to fuse with other aarch64 instructions in ways that the cross-platform version is not.
https://godbolt.org/z/r9WEM4r57
specific: // @specific
ldr q0, [x1]
ldr q1, [x2]
ldr q2, [x0]
sqdmlal2 v2.2d, v0.4s, v1.s[1]
str q2, [x8]
ret
generic: // @generic
ldr q0, [x1]
ldr q1, [x2]
sqdmull2 v0.2d, v0.4s, v1.s[1]
ldr q1, [x0]
sqadd v0.2d, v1.2d, v0.2d
str q0, [x8]
ret
The target-specific version fuses sqdmull
and sqadd
into sqdmlal
, the cross-platform one does not. We don't want to regress the quality of the generated code, so this is a deal breaker.
I opened an LLVM issue https://github.com/llvm/llvm-project/issues/94463, hoping that this problem would be resolved. Aarch64 is a common target, saturating arithmetic a reasonably common operation, so this seemed like a real issue to me, independent from how a fix would make Rust's stdarch slightly simpler.
However, opening LLVM issues can be a bit like screaming into the void: the rate of new issues far outpaces PRs, leading to an ever-growing pile of issues. Someone started work on a fix, but over time it became clear that I'd have to go solve this myself...
Building LLVM
The LLVM codebase is a beast (indeed, the dragon logo is apt). It is also written in C++, a language that I'm not especially familiar with (though have loved to hate from a distance). But, as the internet tells me, the only way to learn is by playing.
I found some commands to build LLVM on the internet. The llc
and FileCheck
binaries are the ones I needed to run the relevant tests.
git clone git@github.com:llvm/llvm-project.git
cd llvm-project
mkdir build
cd build
cmake -G Ninja -DLLVM_TARGETS_TO_BUILD="ARM;AArch64;X86" -DLLVM_ENABLE_PROJECTS="clang" -DLLVM_ENABLE_DUMP=ON ../llvm
ninja bin/llc bin/FileCheck
The actual cold build takes forever, but incremental builds are reasonably quick (in the sense that they are not substantially worse than incremental changes to rust-lang/rust
; objectively it's still slow).
Making a plan
Searching for the sqdmlal
instruction in the AArch64
backend led me to this piece of code
defm SQDMLAL : SIMDThreeScalarMixedTiedHS<0, 0b10010, "sqdmlal">;
def : Pat<(i64 (int_aarch64_neon_sqadd (i64 FPR64:$Rd),
(i64 (int_aarch64_neon_sqdmulls_scalar (i32 FPR32:$Rn),
(i32 FPR32:$Rm))))),
(SQDMLALi32 FPR64:$Rd, FPR32:$Rn, FPR32:$Rm)>;
Not very readable, certainly not to me at the time. However, the two patterns at the bottom do seem to, somehow, perform the transformation that the assembly also shows: an int_aarch64_neon_sqadd
and int_aarch64_neon_sqdmulls_scalar
fuse together into a SQDMLSL
.
So, then, how do we make llvm.sadd.sat.v2i64
do the same thing?
After some discussion in the issue thread, we actually settled on optimizing llvm.sadd.sat.v2i64
, and having the platform-specific @llvm.aarch64.neon.vqadd.v2i64
map to it. That way the aarch64 backend might benefit from more general optimizations that apply to llvm.sadd.sat
.
// in AArch64TargetLowering::LowerINTRINSIC_WO_CHAIN
case Intrinsic::aarch64_neon_sqadd:
if (Op.getValueType().isVector())
return DAG.getNode(ISD::SADDSAT, dl, Op.getValueType(), Op.getOperand(1),
Op.getOperand(2));
This bit of code maps Intrinsic::aarch64_neon_sqadd
to a ISD::SADDSAT
DAG node. The cross-platform llvm.sadd.sat
automatically maps to such a DAG node.
Just making this change breaks the optimization, because int_aarch64_neon_sqadd
is now rewritten and never makes it to the phase where the patterns to combine it with sqdmull
would apply.
In fact we've broken instruction selection for saturating addition entirely: the backend does not know what instruction to emit for saddsat
. That can be fixed by having saddsat
map to the SQADD
instruction.
@ llvm/lib/Target/AArch64/AArch64InstrInfo.td
- defm SQADD : SIMDThreeSameVector<0,0b00001,"sqadd", int_aarch64_neon_sqadd>;
+ defm SQADD : SIMDThreeSameVector<0,0b00001,"sqadd", saddsat>;
Finally, the fusing with the multiplication can be recovered with:
- defm SQDMLAL : SIMDLongThreeVectorSQDMLXTiedHS<0, 0b1001, "sqdmlal",
- int_aarch64_neon_sqadd>;
+ defm SQDMLAL : SIMDLongThreeVectorSQDMLXTiedHS<0, 0b1001, "sqdmlal", saddsat>;
Running tests
We appended new tests to the llvm/test/CodeGen/AArch64/arm64-vmul.ll
test file. This file starts with the following lines:
; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
; RUN: llc -mtriple=aarch64-none-elf -mattr=+aes < %s | FileCheck %s --check-prefixes=CHECK,CHECK-SD
; RUN: llc -mtriple=aarch64-none-elf -mattr=+aes -global-isel -global-isel-abort=2 2>&1 < %s | FileCheck %s --check-prefixes=CHECK,CHECK-GI
This test can be executed from the build directory with:
bin/llvm-lit -v ../llvm/test/CodeGen/AArch64/arm64-vmul.ll
The actual tests look like this, with assertions about the generated input in comments:
define <4 x i32> @sqdmlal4s_lib(<4 x i32> %dst, <4 x i16> %v1, <4 x i16> %v2) {
; CHECK-LABEL: sqdmlal4s_lib:
; CHECK: // %bb.0:
; CHECK-NEXT: sqdmlal.4s v0, v1, v2
; CHECK-NEXT: ret
%tmp = call <4 x i32> @llvm.aarch64.neon.sqdmull.v4i32(<4 x i16> %v1, <4 x i16> %v2)
%sum = call <4 x i32> @llvm.sadd.sat.v4i32(<4 x i32> %dst, <4 x i32> %tmp)
ret <4 x i32> %sum
}
The CHECK
s are machine-generated with update_llc_test_checks.py
:
llvm/utils/update_llc_test_checks.py --llc-binary=build/bin/llc llvm/test/CodeGen/AArch64/arm64-vmul.ll --force-update
NOTE: This is just the workflow that I've been able to cobble together. It's just really hard to discover workflows. Something I enjoyed in the early days of COVID was watching folks stream their programming: you learn things, both in terms of tooling and process, that just can't really be written down. I'd love to watch over the shoulder of someone experienced while they fix some LLVM bugs. E.g. more like "Everything I know about debugging LLVM".
Conclusion
After an LLVM PR gets merged, we get to wait up to 6 months for the next LLVM release to make it into nightly Rust. Finally, a year and a half after first looking at this problem in stdarch, we were able to merge PR 1575, which makes the simplifications, in late August 2025.
The same LLVM bump brought some similar improvements to s390x in PR 1903, and I've since made some further PRs for s390x, powerpc and wasm32 to hopefully reduce the number of target-specific intrinsics even further. Once you're familiar with some of the basic concepts, there is a lot of low-hanging fruit to pick.
I mostly found using the C++ language to be fine. There is lots of code to steal and pattern match from. What really bites is the lack of good error messages. It's just surprising that the companies that have these enormous C++ codebases haven't been able to coordinate something here. Rust very clearly demonstrates there is enormous value in thoughtful error messages.
I think there is a lot of value for Rust in us actually being able to go and fix more of the LLVM issues we run into: past experience just reveals that often LLVM issues will collect dust if we don't chase them down ourselves. It's not easy, but I'd encourage anyone who's curious to give it a go.