Fixing rust-lang stdarch issues in LLVM

Folkert
Systems software engineer
A couple of months ago I became a co-maintainer of rust-lang/stdarch, which defines vendor-specific APIs that are used by the Rust standard library and Rust users writing explicit SIMD code.

The stdarch source is not pretty: we maintain long files that contain definitions like this:

#[allow(improper_ctypes)]
unsafe extern "C" {
    #[link_name = "llvm.x86.pclmulqdq"]
    fn pclmulqdq(a: __m128i, round_key: __m128i, imm8: u8) -> __m128i;
}

pub fn _mm_clmulepi64_si128<const IMM8: i32>(a: __m128i, b: __m128i) -> __m128i {
    static_assert_uimm_bits!(IMM8, 8);
    unsafe { pclmulqdq(a, b, IMM8 as u8) }
}

For hopefully obvious reasons we want to maintain as little such code as possible. In this post we'll look at how we do that, firstly by using cross-platform functions that are built into the compiler itself, and, when that doesn't work, by going up the supply chain and attempting to fix issues in LLVM.

This post gets extremely technical very quickly, but the tl;dr is that I believe there is a lot of value in Rust (compiler) developers being able to go and fix the LLVM issues that bother us; Especially for anything but the most common targets (e.g. wasm, embedded), it can otherwise take years for issues to get resolved.

Anatomy of a stdarch function

The stdarch crate exposes so-called platform intrinsics as Rust functions. Intrinsics are functions that, roughly, map to a single instruction, although in practice sometimes they represent a couple of instructions. These intrinsic functions are part of the platform, and are typically provided as C functions, for instance xmmintrin.h for 128-bit SIMD on x86_64.

In stdarch we link to the intrinsics that LLVM provides, and re-expose them in a safer way: we enforce that arguments that should be constant are actually const, and that the required target features are specified. In some cases the arguments and/or result must be cast or transmuted.

Each platform has its own custom instructions, but there is a common set that most platforms provide. We try to use platform-independent functionality to slightly ease our maintenance burden.

Using core::intrinsics::simd

The Rust compiler defines a number of cross-platform SIMD primitives in core::intrinsics::simd. These functions are baked into the compiler itself, and can therefore be slightly magical:

/// Adds two simd vectors elementwise.
///
/// `T` must be a vector of integers or floats.
#[rustc_intrinsic]
#[rustc_nounwind]
pub unsafe fn simd_add<T>(x: T, y: T) -> T;

Many platform-specific functions can be implemented in terms of these cross-platform SIMD intrinsics. For example:

/// Adds packed 8-bit integers in `a` and `b`.
#[target_feature(enable = "sse2")]
pub fn _mm_add_epi8(a: __m128i, b: __m128i) -> __m128i {
    // First cast to the right type so that `simd_add` has the right behavior.
    unsafe { transmute(simd_add(a.as_i8x16(), b.as_i8x16())) }
}

This approach has several advantages versus directly linking with the llvm intrinsic:

  • the implementation is clearer to Rust users
  • there is less platform-specific code for us to maintain
  • this implementation works with non-LLVM backends, e.g. miri, cranelift and gcc

However, using the cross-platform intrinsics is only an option when they generate good code. For the common functions like simd_add that is the case, but for more complex operations LLVM does not always recognize that the cross-platform function should generate a specific instruction.

Saturating addition on aarch64

The vqaddq_s64 intrinsic performs 64-bit signed saturating addition. The stdarch definition looked like this:

pub fn vqaddq_s64(a: int64x2_t, b: int64x2_t) -> int64x2_t {
    unsafe extern "unadjusted" {
        #[cfg_attr(
            any(target_arch = "aarch64", target_arch = "arm64ec"),
            link_name = "llvm.aarch64.neon.sqadd.v2i64"
        )]
        #[cfg_attr(target_arch = "arm", link_name = "llvm.sadd.sat.v2i64")]
        fn _vqaddq_s64(a: int64x2_t, b: int64x2_t) -> int64x2_t;
    }
    unsafe { _vqaddq_s64(a, b) }
}

Using the functions from core::intrinsics::simd, it should be possible to write this function as simply

pub fn vqaddq_s64(a: int64x2_t, b: int64x2_t) -> int64x2_t {
    unsafe { simd_saturating_add(a, b) }
}

Unfortunately, it turned out that this simpler implementation optimized less well in some cases: when used on their own both versions generate the same instruction, but the platform-specific version is able to fuse with other aarch64 instructions in ways that the cross-platform version is not.

https://godbolt.org/z/r9WEM4r57

specific:                               // @specific
        ldr     q0, [x1]
        ldr     q1, [x2]
        ldr     q2, [x0]
        sqdmlal2        v2.2d, v0.4s, v1.s[1]
        str     q2, [x8]
        ret
generic:                                // @generic
        ldr     q0, [x1]
        ldr     q1, [x2]
        sqdmull2        v0.2d, v0.4s, v1.s[1]
        ldr     q1, [x0]
        sqadd   v0.2d, v1.2d, v0.2d
        str     q0, [x8]
        ret

The target-specific version fuses sqdmull and sqadd into sqdmlal, the cross-platform one does not. We don't want to regress the quality of the generated code, so this is a deal breaker.

I opened an LLVM issue https://github.com/llvm/llvm-project/issues/94463, hoping that this problem would be resolved. Aarch64 is a common target, saturating arithmetic a reasonably common operation, so this seemed like a real issue to me, independent from how a fix would make Rust's stdarch slightly simpler.

However, opening LLVM issues can be a bit like screaming into the void: the rate of new issues far outpaces PRs, leading to an ever-growing pile of issues. Someone started work on a fix, but over time it became clear that I'd have to go solve this myself...

Building LLVM

The LLVM codebase is a beast (indeed, the dragon logo is apt). It is also written in C++, a language that I'm not especially familiar with (though have loved to hate from a distance). But, as the internet tells me, the only way to learn is by playing.

I found some commands to build LLVM on the internet. The llc and FileCheck binaries are the ones I needed to run the relevant tests.

git clone git@github.com:llvm/llvm-project.git
cd llvm-project
mkdir build
cd build
cmake -G Ninja -DLLVM_TARGETS_TO_BUILD="ARM;AArch64;X86" -DLLVM_ENABLE_PROJECTS="clang" -DLLVM_ENABLE_DUMP=ON ../llvm
ninja bin/llc bin/FileCheck

The actual cold build takes forever, but incremental builds are reasonably quick (in the sense that they are not substantially worse than incremental changes to rust-lang/rust; objectively it's still slow).

Making a plan

Searching for the sqdmlal instruction in the AArch64 backend led me to this piece of code

defm SQDMLAL  : SIMDThreeScalarMixedTiedHS<0, 0b10010, "sqdmlal">;

def : Pat<(i64 (int_aarch64_neon_sqadd (i64 FPR64:$Rd),
                   (i64 (int_aarch64_neon_sqdmulls_scalar (i32 FPR32:$Rn),
                                                        (i32 FPR32:$Rm))))),
          (SQDMLALi32 FPR64:$Rd, FPR32:$Rn, FPR32:$Rm)>;

Not very readable, certainly not to me at the time. However, the two patterns at the bottom do seem to, somehow, perform the transformation that the assembly also shows: an int_aarch64_neon_sqadd and int_aarch64_neon_sqdmulls_scalar fuse together into a SQDMLSL.

So, then, how do we make llvm.sadd.sat.v2i64 do the same thing?


After some discussion in the issue thread, we actually settled on optimizing llvm.sadd.sat.v2i64, and having the platform-specific @llvm.aarch64.neon.vqadd.v2i64 map to it. That way the aarch64 backend might benefit from more general optimizations that apply to llvm.sadd.sat.

// in AArch64TargetLowering::LowerINTRINSIC_WO_CHAIN
  case Intrinsic::aarch64_neon_sqadd:
    if (Op.getValueType().isVector())
      return DAG.getNode(ISD::SADDSAT, dl, Op.getValueType(), Op.getOperand(1),
                         Op.getOperand(2));

This bit of code maps Intrinsic::aarch64_neon_sqadd to a ISD::SADDSAT DAG node. The cross-platform llvm.sadd.sat automatically maps to such a DAG node.

Just making this change breaks the optimization, because int_aarch64_neon_sqadd is now rewritten and never makes it to the phase where the patterns to combine it with sqdmull would apply.

In fact we've broken instruction selection for saturating addition entirely: the backend does not know what instruction to emit for saddsat. That can be fixed by having saddsat map to the SQADD instruction.

@ llvm/lib/Target/AArch64/AArch64InstrInfo.td
- defm SQADD    : SIMDThreeSameVector<0,0b00001,"sqadd", int_aarch64_neon_sqadd>;
+ defm SQADD    : SIMDThreeSameVector<0,0b00001,"sqadd", saddsat>;

Finally, the fusing with the multiplication can be recovered with:

- defm SQDMLAL : SIMDLongThreeVectorSQDMLXTiedHS<0, 0b1001, "sqdmlal",
-                                                int_aarch64_neon_sqadd>;
+ defm SQDMLAL : SIMDLongThreeVectorSQDMLXTiedHS<0, 0b1001, "sqdmlal", saddsat>;

Running tests

We appended new tests to the llvm/test/CodeGen/AArch64/arm64-vmul.ll test file. This file starts with the following lines:

; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
; RUN: llc -mtriple=aarch64-none-elf -mattr=+aes < %s | FileCheck %s --check-prefixes=CHECK,CHECK-SD
; RUN: llc -mtriple=aarch64-none-elf -mattr=+aes -global-isel -global-isel-abort=2 2>&1 < %s | FileCheck %s --check-prefixes=CHECK,CHECK-GI

This test can be executed from the build directory with:

bin/llvm-lit -v ../llvm/test/CodeGen/AArch64/arm64-vmul.ll

The actual tests look like this, with assertions about the generated input in comments:

define <4 x i32> @sqdmlal4s_lib(<4 x i32> %dst, <4 x i16> %v1, <4 x i16> %v2) {
; CHECK-LABEL: sqdmlal4s_lib:
; CHECK:       // %bb.0:
; CHECK-NEXT:    sqdmlal.4s v0, v1, v2
; CHECK-NEXT:    ret
  %tmp  = call <4 x i32> @llvm.aarch64.neon.sqdmull.v4i32(<4 x i16> %v1, <4 x i16> %v2)
  %sum = call <4 x i32> @llvm.sadd.sat.v4i32(<4 x i32> %dst, <4 x i32> %tmp)
  ret <4 x i32> %sum
}

The CHECKs are machine-generated with update_llc_test_checks.py:

llvm/utils/update_llc_test_checks.py --llc-binary=build/bin/llc llvm/test/CodeGen/AArch64/arm64-vmul.ll --force-update

NOTE: This is just the workflow that I've been able to cobble together. It's just really hard to discover workflows. Something I enjoyed in the early days of COVID was watching folks stream their programming: you learn things, both in terms of tooling and process, that just can't really be written down. I'd love to watch over the shoulder of someone experienced while they fix some LLVM bugs. E.g. more like "Everything I know about debugging LLVM".

Conclusion

After an LLVM PR gets merged, we get to wait up to 6 months for the next LLVM release to make it into nightly Rust. Finally, a year and a half after first looking at this problem in stdarch, we were able to merge PR 1575, which makes the simplifications, in late August 2025.

The same LLVM bump brought some similar improvements to s390x in PR 1903, and I've since made some further PRs for s390x, powerpc and wasm32 to hopefully reduce the number of target-specific intrinsics even further. Once you're familiar with some of the basic concepts, there is a lot of low-hanging fruit to pick.

I mostly found using the C++ language to be fine. There is lots of code to steal and pattern match from. What really bites is the lack of good error messages. It's just surprising that the companies that have these enormous C++ codebases haven't been able to coordinate something here. Rust very clearly demonstrates there is enormous value in thoughtful error messages.

I think there is a lot of value for Rust in us actually being able to go and fix more of the LLVM issues we run into: past experience just reveals that often LLVM issues will collect dust if we don't chase them down ourselves. It's not easy, but I'd encourage anyone who's curious to give it a go.

Folkert
Systems software engineer

Stay up-to-date

Stay up-to-date with our work and blog posts?

Related articles

September 17, 2024

Mix in Rust with C++

This article will help you to slowly introduce some Rust into your C++ project. We'll familiarize ourselves with the tooling and go through some examples.
Earlier this month Marc and I had the opportunity to deliver a talk at ONE Conference in The Hague, The Netherlands, on a topic that’s near to our hearts: memory safety. Below we share some context, as well as our slides.
On Thursday 11 September 2025, I attended the LF Energy Summit in Aachen, Germany, where I gave the talk ‘Rust Meets the Grid: Building openleadr-rs for Real-World Demand Response’, together with Ton Smets from ElaadNL.