May 26, 2025

SIMD in zlib-rs (part 2): compare256

Systems software engineer

In part 1 of the "SIMD in zlib-rs" series, we've seen that, with a bit of nudging, autovectorization can produce optimal code for some problems.
But that does not always work: with SIMD clever programmers can still beat the compiler. This time we'll look at a problem where the compiler is not currently capable of using the SIMD capabilities of modern CPUs effectively.

This article was also posted on the Trifecta Tech Foundation blog, the non-profit backed by Tweede golf and the long-term home for open infrastructure projects like zlib-rs.

The problem

The problem at hand is the compare256 function below: given two [u8; 256] values, it counts, starting from the left, for how many positions their elements match until there is a mismatch. This function is at the core of the zlib-rs compression algorithm that finds repetitions in the input.

https://godbolt.org/z/7nzYKrhrT

pub fn compare256(src0: &[u8; 256], src1: &[u8; 256]) -> usize {
    src0.iter().zip(src1).take_while(|(x, y)| x == y).count()
}

This function generates the following assembly:

compare256:
        xor     eax, eax
.LBB0_1:
        movzx   ecx, byte ptr [rdi + rax]
        cmp     cl, byte ptr [rsi + rax]
        jne     .LBB0_6
        movzx   ecx, byte ptr [rdi + rax + 1]
        cmp     cl, byte ptr [rsi + rax + 1]
        jne     .LBB0_3
        movzx   ecx, byte ptr [rdi + rax + 2]
        cmp     cl, byte ptr [rsi + rax + 2]
        jne     .LBB0_4
        movzx   ecx, byte ptr [rdi + rax + 3]
        cmp     cl, byte ptr [rsi + rax + 3]
        jne     .LBB0_5
        add     rax, 4
        cmp     rax, 256
        jne     .LBB0_1
        mov     eax, 256
        ret
.LBB0_3:
        inc     rax
        ret
.LBB0_4:
        add     rax, 2
        ret
.LBB0_5:
        add     rax, 3
.LBB0_6:
        ret

No autovectorizaton occurs this time (no xmm registers are used), but the compiler did unroll the loop 4 times (processing 4 bytes per iteration) and is reasonably smart about doing the counting.

Altogether not terrible, but we can do much, much better with explicit SIMD.

The idea

Let's consider an example using a hypothetical 32-bit SIMD vector. That means we iterate over the input in chunks of 4 8-bit elements, e.g.:

let src0 = b"abcd"; 
let src1 = b"abce";

The chunks are loaded into the vector register like so (the chunk is now treated like a number, so the "lower" elements are on the right):

[b'd', b'c', b'b', b'a']
[b'e', b'c', b'b', b'a']

Next we perform an element-wise equality operation. The output vector contains at each position:

0xFF (all ones in binary) if the elements were equal
0x00 (all zeros in binary) if the elements were different

This element-wise equality is the crucial step that non-SIMD code cannot perform elegantly and effciently. For our example, the output is:

[0x00, 0xFF, 0xFF, 0xFF]

A vector where every element is either 0x00, or all 0xFF is called a mask vector. Next, there is an operation to turn this SIMD mask vector into an integer, resulting in:

0b0111

Finally the .trailing_ones method can be used to count how many elements matched before the first one that didn't. For our example it returns 3, and indeed there are 3 matching characters before the first non-matching character.

Now for real

Next let's actually implement this on the real 128-bit SIMD registers that modern CPUs have. We'll use these imports:

use core::arch::x86_64::{
    __m128i,
    _mm_cmpeq_epi8,
    _mm_loadu_si128,
    _mm_movemask_epi8,
};

__m128i is a 128-bit integer type. It is stored in the xmm registers and used by the intrinsics;
_mm_loadu_si128 loads an __m128i from a pointer;
_mm_cmpeq_epi8 performs element-wise equality;
_mm_movemask_epi8 converts the output of _mm_cmpeq_epi8 (a __m128i) into a u16, with each bit indicating whether the element at that position was equal.

With those ingredients, this is our implementation:

https://godbolt.org/z/11z1GavW9

#[target_feature(enable = "sse2,bmi1")]
pub unsafe fn compare256(src0: &[u8; 256], src1: &[u8; 256]) -> usize {
    let src0 = src0.chunks_exact(16);
    let src1 = src1.chunks_exact(16);

    let mut len = 0;

    unsafe {
        for (chunk0, chunk1) in src0.zip(src1) {
            // load the next chunks into a simd register
            let xmm_src0 = _mm_loadu_si128(chunk0.as_ptr() as *const __m128i);
            let xmm_src1 = _mm_loadu_si128(chunk1.as_ptr() as *const __m128i);

            // element-wise compare of the 8-bit elements
            let xmm_cmp = _mm_cmpeq_epi8(xmm_src0, xmm_src1);

            // turn a 16 * 8-bit vector into a 16-bit integer.
            // a bit in the output is set if the corresponding element is non-zero.
            let mask = _mm_movemask_epi8(xmm_cmp) as u16;

            if mask != 0xFFFF /* i.e. all 1 bits */ {
                let match_byte = mask.trailing_ones();
                return len + match_byte as usize;
            }

            len += 16;
        }
    }

    256
}

The assembly is too long to include here because the loop is unrolled 16 times, so that the whole 256-byte input is processed without any loop logic. Here is the final part (processing the 16th chunk):

.LBB0_30:
        movdqu  xmm0, xmmword ptr [rdi + 240]
        movdqu  xmm1, xmmword ptr [rsi + 240]
        pcmpeqb xmm1, xmm0
        pmovmskb        eax, xmm1
        cmp     eax, 65535       ; this is 0xFFFF
        je      .LBB0_34
        mov     ecx, 240
.LBB0_32:
        not     eax
        or      eax, 65536
        tzcnt   eax, eax         ; finds the first zero bit
        or      rax, rcx         ; really this is an add
        ret
.LBB0_34:
        mov     eax, 256
        ret

Some observations:

_mm_loadu_si128 is lowered to the movdqu instruction;
_mm_cmpeq_epi8 is lowered as pcmpeqb;
_mm_movemask_epi8 is lowered as pmovmskb, note how its input is xmm1(a SIMD register) and its output is eax (a standard register);
When it is safe (no overlapping bits). the compiler will sometimes use a bitwise or instead of an add instruction.

Benchmarks

The manual SIMD implementation is extremely effective, and roughly an order of magnitude faster for some simple input. The benchmark setup can be found here.

name	fastest	slowest	median	mean	samples	iters
compare256_new equal	`6.745 ns`	`12.22 ns`	`7.374 ns`	`7.408 ns`	`100`	`51200`
compare256_new early return	`4.066 ns`	`4.165 ns`	`4.124 ns`	`4.125 ns`	`100`	`51200`
compare256_old equal	`45.46 ns`	`85.59 ns`	`45.81 ns`	`48.16 ns`	`100`	`3200`
compare256_old early return	`24.28 ns`	`48.53 ns`	`24.35 ns`	`24.59 ns`	`100`	`12800`

Conclusion

This post shows a basic, but very effective, example of a custom SIMD implementation beating the compiler. Examples for other instruction sets (avx2, neon, wasm32) can be found in compare256.rs.

Next time: given our beautiful hand-crafted SIMD solutions, how do we efficiently pick the right one to use for the actual CPU we're running on?

Folkert

Systems software engineer

folkert@tweedegolf.com

data-compression rust

Stay up-to-date

Stay up-to-date with our work and blog posts?

April 15, 2025

SIMD in zlib-rs (part 1): Autovectorization and target features

I'm fascinated by the creative use of SIMD instructions. When you first learn about SIMD, it is clear that doing more multiplications in a single instruction is useful for speeding up matrix multiplication. But how can all of these weird instructions be used to solve problems that aren't just arithmetic?

data-compression rust

Read article

March 11, 2025

Translating bzip2 with c2rust

Over the past couple of months we've been hard at work on libbzip2-rs, a 100% Rust drop-in compatible implementation the bzip2 compression and decompression functionality.

For this project, we used c2rust for the initial translation from the C code to a Rust implementation. The generated Rust code has now been cleaned up and made safe where possible. This post describes our experiences using c2rust for this project.

open-source data-compression rust c

Read article

February 25, 2025

zlib-rs is faster than C

We've released version 0.4.2 of zlib-rs, featuring a number of substantial performance improvements. We are now (to our knowledge) the fastest api-compatible zlib implementation for decompression, and beat the competition in the most important compression cases too.

open-source data-compression rust

Read article

SIMD in zlib-rs (part 2): compare256

The problem

The idea

Now for real

Benchmarks

Conclusion

Stay up-to-date

Related articles

SIMD in zlib-rs (part 1): Autovectorization and target features

Translating bzip2 with c2rust

zlib-rs is faster than C