Debloat your async Rust

Dion
Embedded software engineer
Async Rust is amazing, but far from flawless. In this blog, I'll walk you through the current struggles and possible solutions.

So async Rust is amazing. It makes it possible to write code that can run concurrently with other code without having to manage dozens of manually written state machines.

This is great for resource use. While waiting on data on a socket, the code won't spin or block a thread. Instead another task can run and do something useful. And when there are no tasks to run, the processor doesn't have to do anything, which saves power.

Async Rust can be pretty difficult at times, since you need to know how it works to be proficient with it. Of course, if you give yourself time to learn, that will sort itself out eventually.

So when you've got a decent competency with async, you start using it in many places. Then after a while, especially when your project grows, you'll notice something isn't quite right.

It depends on what you're building what that something will be.

  • Maybe you're writing a serverless app and notice the program is using way more RAM than is sensible. This is increasing your monthly bill.
  • Or you're writing a high performance video streaming server, except you can't quite get the performance you were looking for.
  • On embedded, the code size may be bigger than expected, and eventually it doesn't fit your devices anymore.

All of these common issues have the same source: async bloat.

What is async bloat not?

Some of these problems are expected. Not all situations where async is slower or bigger are actual bloat. We're explicitly asking for the compiler to generate a state machine by using async/await, so it's unfair to expect async code to have the exact same properties as blocking code.

Let's take some async code as an example and then translate that into a handwritten state machine to compare.

async fn foo(num: i32) -> i32 {
    bar().await;
    let result = quux(num).await;
    result * 2
}

We see an async function with two await points. We can model this with an enum as state machine:

// The function the user calls. It returns an instance of a future
fn foo(num: i32) -> FooFut {
    // We just construct the future. The first work only happens after the first poll
    FooFut::Unresumed(num)
}

// Our future type
enum FooFut {
    Unresumed(i32),
    Suspend0(i32, BarFut),
    Suspend1(QuuxFut),
    Returned,
}

impl Future for FooFut {
    type Output = i32;
    
    // We're ignoring pinning for brevity
    fn poll(self: &mut Self, cx: &mut Context<'_>) -> Poll<Self::Output> {
        loop {
            match self {
                Self::Unresumed(num) => {
                    // Get the bar future
                    let bar = bar();
                    *self = Self::Suspend0(num, bar);
                }
                Self::Suspend0(num, bar) => {
                    // bar.await
                    if bar.poll(cx).is_pending() {
                        return Poll::Pending;
                    }
                    
                    // Get the quux future
                    let quux = quux(num);
                    *self = Self::Suspend1(quux);
                }
                Self::Suspend1(quux) => {
                    // quux.await
                    return match quux.poll(cx) {
                        Poll::Pending => Poll::Pending,
                        Poll::Ready(result) => {
                            *self = Self::Returned;
                            // Done, but do the multiply by 2
                            Poll::Ready(result * 2)
                        },
                    };
                }
                Self::Returned => {
                    panic!("Polled after future has returned ready");
                }
            }
        }    
    }
}

For a reasonable implementation, you see this is a lot more code than the blocking version would be. We need to juggle the state and potentially poll bar and quux a lot (or even an infinite number) of times.

This is what you're asking for when using async Rust.

So let's see when using async can be very inopportune and what the alternatives are.

Async, but no awaits

What does this function do?

async fn foo() -> i32 {
    5
}

It returns 5, right?

Well, no, it returns a statemachine that when polled to completion will return 5. But nobody would write this. So why can it show up in codebases regardless?

This happens most with traits where there's an abstraction that needs to be async sometimes, but not always.

Let's illustrate it with a config that can be loaded from disk or from a database. But you may have an implementation of that trait that doesn't load anything and just returns a value:

struct Config { /*...*/ }

trait ConfigLoader {
    async fn load(&mut self) -> Result<Config, Error>;
}

impl ConfigLoader for FileLoader { /*...*/ }
impl ConfigLoader for DbLoader { /*...*/ }
impl ConfigLoader for DefaultLoader {
    async fn load(&mut self) -> Result<Config, Error> {
        Ok(Config::new())
    }
}

async fn run_system(loader: impl ConfigLoader) -> Result<(), Error> {
    // ...
    let config = loader.load().await?;
    // ...
}

So when the DefaultLoader is passed to run_system, we're still polling a statemachine. This takes more cpu, ram and code size to achieve.

So how can we fix this?

It boils down to finding a way to get rid of the statemachine.

One way would be to change the loader parameter to an option. If there is no loader, the function would instead use a default value. A different way is to keep the structure as is, but manually provide a much simpler future that doesn't really have a statemachine.

impl ConfigLoader for DefaultLoader {
    // Manually desugared async function so we get control of what's being returned
    fn load(&mut self) -> impl Future<Output = Result<Config, Error>> {
        // Use the `Ready` type from std
        std::future::ready(Ok(Config::new()))
    }
}

ready just returns the value when polled. This is much smaller than a real statemachine. This is also what a new clippy lint suggests that my colleague Wouter opened a PR for.

Async pass-through

In Rust we love to use abstractions. They're great! We use them a bunch in embedded too to do hardware abstractions. But with async Rust they can come at a cost.

Let's look at the embedded-hal-async trait for I2c:

/// Async I2c.
pub trait I2c<A: AddressMode = SevenBitAddress>: ErrorType {
    /* Skipped provided methods */

    async fn transaction(
        &mut self,
        address: A,
        operations: &mut [Operation<'_>],
    ) -> Result<(), Self::Error>;
}

This is a nice trait that works on almost all hardware. It allows us to do I2c transactions on any hardware. Every hardware abstraction layer (HAL) will implement the trait for their I2c drivers. The one from embassy-stm32 looks like this (adapted for brevity):

impl<'d, IM: MasterMode> I2c for I2c<'d, Async, IM> {
    async fn transaction(
        &mut self,
        address: u8,
        operations: &mut [Operation<'_>],
    ) -> Result<(), Self::Error> {
        self.transaction(address, operations).await
    }
}

The I2c driver already has a transaction method and it's being used by the trait impl to use the functionality.

So what does this implementation do?

Pseudo:

enum TraitTransactionFut {
    Unresumed(self, address, operations),
    Polling(DriverTransactionFut), // Polling the future of the driver
    Returned,
}

Well, it does exactly what we're asking of it, which is to create a statemachine that calls another statemachine. That's more statemachine than we need!

We want less bloat, so that means fewer statemachines. So instead, we must tell the compiler what we actually want. Our function doesn't need to be async and can simple return the statemachine that does need to be polled.

impl<'d, IM: MasterMode> I2c for I2c<'d, Async, IM> {
    // No longer async
    // This is allowed in Rust even when the trait does use async
    fn transaction(
        &mut self,
        address: u8,
        operations: &mut [Operation<'_>],
    ) -> impl Future<Output = Result<(), Self::Error>> {
    //   ^^^^^^^^^^^ Now returning impl Future
        self.transaction(address, operations)
        // No more await                     ^^^^^^
    }
}

Now we're simply forwarding the future and no extra statemachine is created.

It's simple when we have a pure forwarding, but a bit harder if the function has a 'preamble' and/or 'postamble' where stuff needs to be done after the await point.

For example:

async fn foo() -> i32 {
    let a = quux(); // <- Not async
    let num = bar(a).await;
    num * 2
}

We can't pull off the same trick exactly. That is, not without the commonly used futures crate. That crate provides us with a method that would help us here for the postamble.

use futures::future::FutureExt;

fn foo() -> impl Future<Output = i32> {
    let a = quux();
    bar(a).map(|num| num * 2)
}

Here we again forward the future, but map the output. This makes it so we still don't need an additional statemachine.

There are more handy extension methods in the FutureExt that can help with async bloat.

Do note however, that we changed the behavior for the preamble execution here! Normally a future is evaluated lazily. That means that quux is not executed until the future is polled. But in our transformed version, quux gets executed immediately. This is almost never an issue, but it's something to be aware of.

Smaller statemachines are better statemachines

The Rust compiler backend, LLVM, is really smart. So there are a lot of code patterns that get optimized for us.

For example:

pub fn process_command() {
    match get_command() {
        CommandId::A => send_response(123),
        CommandId::B => send_response(456),
    }
}

Once compiled with optimizations turned on, the compiler generates the following assembly:

; x86-64
process_command:
        push    rax
        call    qword ptr [rip + get_command@GOTPCREL]
        test    al, al ; match on enum
        mov     eax, 456 ; prepare B value
        mov     edi, 123 ; prepare A value
        cmovne  edi, eax ; use B value if discriminant is 1 (B)
        pop     rax
        ; call send_response with edi as argument
        jmp     qword ptr [rip + send_response@GOTPCREL]

This is really nice code! It calculates the value beforehand and then does one call to send_response.

Let's translate it to async now and examine that:

pub async fn process_command() {
    match get_command().await {
        CommandId::A => send_response(123).await,
        CommandId::B => send_response(456).await,
    }
}

This is the same code, but made async. What's the assembly of the poll implementation on the generated future? (-O2 for more clarity)

example::run2::h225245dd7a7d0876:
        push    rbx
        mov     rbx, rdi
        movzx   eax, byte ptr [rdi]
        lea     rcx, [rip + .LJTI2_0] ; Load jump table pointer
        movsxd  rax, dword ptr [rcx + 4*rax] ; Load jump address based on state of future
        add     rax, rcx
        jmp     rax ; Do the jump
.LBB2_4: ; unresumed
        mov     byte ptr [rbx + 1], 0
.LBB2_5: ; get_command
        lea     rdi, [rbx + 1]
        call    example::get_command::{{closure}}::h8b5d742a2e42c0db
        jmp     .LBB2_8
.LBB2_7: ; send_response A or B
        lea     rdi, [rbx + 4]
        call    example::send_response::{{closure}}::h3ca77514be95ddc4
        jmp     .LBB2_8
.LBB2_1:
        ; ... panic
.LBB2_2:
        ; ... panic
.LBB2_11: ; send_response B or A
        lea     rdi, [rbx + 4]
        call    example::send_response::{{closure}}::h3ca77514be95ddc4
.LBB2_8: ; Some sort of corruption guard
        ud2
        jmp     .LBB2_10
        jmp     .LBB2_10
.LBB2_10: ; Some sort of corruption guard
        mov     byte ptr [rbx], 2
        mov     rdi, rax
        call    _Unwind_Resume@PLT
.LJTI2_0: ; Jump table
        .long   .LBB2_4-.LJTI2_0
        .long   .LBB2_1-.LJTI2_0
        .long   .LBB2_2-.LJTI2_0
        .long   .LBB2_5-.LJTI2_0
        .long   .LBB2_7-.LJTI2_0
        .long   .LBB2_11-.LJTI2_0

Oh dear... I've tried to annotate what's going on and removed the panicking paths for extra clarity.

In any case, the jump table shows we've got 6 states. 3 are there by default, so the other 3 are from our code. That means every await point has gotten a state and the call to send_response is duplicated. We've got 2 calls instead of 1.

Instead we can change the code to this:

pub async fn process_command() {
    let response = match get_command().await {
        CommandId::A => 123,
        CommandId::B => 456,
    };
    send_response(response).await;
}

This prevents one of the states being generated. The generated assembly is very similar, except it has one fewer state.

That doesn't sound like much. The original had 58 lines of assembly and our optimized version has 52, which is a reduction of ~11.5% . But remember that optimizations stack, so this optimized version may allow the compiler to optimize something else which makes the impact bigger. This is especially true in larger async functions.

What variables are part of the statemachine anyways?

Code needs data to operate on, so the generated statemachines need to store the variables they need.

Luckily, only the data we actually need in the statemachine is being captured, i.e., initially the variable captured by an async block or the async fn function parameters. After that, any variable that's kept across an await point will be stored in the statemachine.

The way the variables are allocated into the statemachine is quite sub-optimal, but luckily there's a Rust PR in the works that tackles that problem.

Still, we want to be cognizant of the data we store in the futures because things can get out of hand (even when the compiler does it optimally). It continues the path of 'you get what you ask for'.

Consider the following code:

async fn foo_big(mut buffer: [u8; 1024]) {
    let result = fill_async(&mut buffer).await;
    println!("{result}");
}

async fn foo_small(buffer: &mut [u8; 1024]) {
    let result = fill_async(buffer).await;
    println!("{result}");
}

async fn fill_async(buffer: &mut [u8]) -> u8 {
    todo!()
}

fn main() {
    println!(
        "big: {}, small: {}",
        std::mem::size_of_val(&foo_big([0; 1024])),
        std::mem::size_of_val(&foo_small(&mut [0; 1024]))
    )
}

What would you expect the output to be?

When compiled with Rust 1.94 on 64-bit Linux, it prints: big: 2080, small: 40.

This means there's space allocated for buffer twice instead of just once.

Both async functions will get you the same end result, but the big variant will use orders of magnitude more memory. Not only that, the memory also needs to be lugged around. Less data will be in registers, more data will need to be memcpy'd and the optimizer will have fewer opportunities to do its thing.

This example uses a byte array, but the issue is there with big structs too.

So, when working with futures, pass references to large variables instead of moving the variables in.

Wrap-up

I've laid out some tips to battle the bloat that async Rust can bring along.

To summarize, they are:

  • Avoid async fns that don't actually need to be async
  • Use -> impl Future for async 'pass-through' functions
  • Use the futures crate to allow more async fns to be rewritten as a 'pass-through' function
  • When possible, refactor code to share await points
  • Pass references to large variables instead of moving them in

If you're wondering why you need to debloat your async code yourself at all instead of having the compiler do it for you, you're not alone! This has been a known issue for some time already.

A while ago I took some time to read through the compiler code that deals with async. To my surprise, I've found that async Rust never left the MVP state.

But I've come up with a plan! Stay tuned for the next blog post to find out what that is.

Until I've had the opportunity to execute my plan, at least now you know how to improve your async Rust in your own code.

(our services)

Need help fixing bloat?

Code size bloat can happen over time, or because of complexity in async Rust.

If bloat is slowing your project down, we can help!

> Contact us

Stay up-to-date

Stay up-to-date with our work and blog posts?

Related articles

I've previously explained async bloat and some work-arounds for it, but would much prefer to solve the issue at the root, in the compiler. I've submitted a Project Goal, and am looking for help to fund the effort.
As part of my internship at Tweede golf this summer I was tasked with improving the async debugging experience for embedded development. This work resulted in a prototype async debugger for embassy, a common async runtime for embedded systems.

It's time for another technical blog post about async Rust on embedded. This time we're going to pitch Embassy/Rust against FreeRTOS/C on an STM32F446 microcontroller.