The promise of Rust async-await for embedded

Wouter
Embedded software consultant
The promise of Rust async-await for embedded
Typically embedded devices are developed using C++. At Tweede golf we have chosen to use Rust instead for implementing our embedded devices. This is controversial as the embedded hardware field is generally quite conservative. Convincing our clients to adopt Rust for their products can be a challenge.

The main reason we choose Rust over C++ to develop our devices is that the language offers more secure semantics and primitives to develop with:

  • It is a strongly typed language with no implicit conversions.
  • Uninitialized memory or null pointers are explicit.
  • There are strong notions of ownership, borrowing and mutability.
  • Important return values must be consumed and can not be dropped by accident.
  • Pattern matching is exhaustive and enforced.
  • Unsafe operations are marked explicitly as such.
  • Errorhandling is provided by the language with Result<T, E>. Code should not crash/panic unless the programmer uses panic!, unwrap or similar. Even still, panicking can be handled gracefully, as long as one does not panic in the panic handler. There are some operations that implicitly can panic, like array indexing and integer division, but only if there is a bug in your program.

To further reinforce these security and correctness properties, there are efforts to create a Sealed Rust. This effort aims to formalize the Rust semantics, and aims to yield a compiler that guarantees the generation of assembly that conforms to these semantics. In the future Rust will thus be formally suitable for automotive and aerospace applications. Though support for a Cortex-R or similar automotive grade processor is still lacking.

With the introduction of Rust async-await we feel that there will be another strong reason to choose Rust for these applications. As you will discover in this article we are not there fully yet.

In this article I will detail what async-await is, and why it has great potential. You need not know a lot about Rust or about embedded devices. This article is also a preview for a series of articles detailing our journey towards more robust and better maintainable embedded devices. We will use our experiences with developing the Trigly Holter patch, a recording device for ECG signals and ancillary medical data, to illustrate this concept.

Waiting

To grasp the concepts of async and await, we first need to know why devices need to wait. An embedded device is comprised of many smaller devices working in tandem. These devices operate independently but still depend on eachother for their operation. To work together these devices need to be able to communicate with eachother. However this communication takes time. Also devices need not run on the same clock frequency. Hence a device regularly needs to wait on something.

Synchronous code

Waiting can be done by continously checking whether a task has been completed. Checking continuously is also called polling. For example, using the Arduino standard library one can wait a single second by invoking the following function:

delay(1000);

Under the hood delay is implemented for AVR based Arduino cores as follows:

void delay(unsigned long ms)
{
    uint32_t start = micros();

    while (ms > 0) {
        // [...]
        while ( ms > 0 && (micros() - start) >= 1000) {
            ms--;
            start += 1000;
        }
    }
}

As you can see, the program continuously loops, and for each iteration of the loop checks (i.e. polls) whether enough time has passed. This works similarly for hardware communication protocols. Instead of checking the time, we would for example have to check whether a signal line has been pulled to a high voltage.

This waiting in a loop is called busy waiting or blocking, and is synchronous code. This type of code is great because it is very straightforward and easy to understand. However whilst the processor is waiting for the time to progress or for the line to be pulled to a higher voltage it is wasting processor cycles. These cycles could also be used for other things. Lets take a look at a real world example:

Earlier we wrote a device driver for the ECG sensor for the Trigly Holter Patch. Initializing this driver requires first that the chip is enabled and powered for a short while, in order for the chip to start up.

pub async fn exec(&mut self) {
    let spid: SpiD = SpiDevice::new(spi, ncs, timer);
    self.nreset.set_low();
    self.start.set_low();
    block!(delay(100));
    self.nreset.set_high();
    block!(delay(100));
    let device = Ads1292::new(spid);
    // [...]
}

Note that we wait 100 milliseconds to be sure that the device is powered up, and subsequently reset to the default settings. The block! macro invocation is basically a busy-waiting loop, waiting until the function is done. The function delay can throw a WouldBlock error, for which the block! macro continues a loop.

Later in this article I will explain why, but we still use this driver in this current form. As you can imagine, we instead would like to use this 200ms startup time to also initialize our other peripheral devices. We can achieve this by doing multiple tasks asynchronously.

Asynchronous code

Instead of looping over one thing, we can try looping over multiple things. This seems like a boring difference, but is actually the only real and essential difference. It requires that we do three things:

  • We define tasks that run in parallel.
  • We loop over all these tasks and ask them to progress.
  • Each task yields if no more progress can be made or when it feels it has taken up enough time for now.

In Trigly the main task loop looks like this:

loop {
    // [...]
    feedback.poll();
    disklog_test.poll(&state, &mut disklog);
    message_test.poll();
    // [...]
    error_log.poll((&mut disklog, &mut usart_write));
    watchdog.poll();
    // [...]
}

In essence each task is polled, and returns when it is done. This requires that each task remembers for itself where it was when it yields. Each task is thus stateful. We will now consider our previous example for our ECG sensor chipset. Instead we will rewrite it to be a stateful task with a state for each occurance of block!:

enum State {
    //[...]

    /// The task is starting the ECG peripheral
    Startup(SpiL, PinNCS, Timer<TIM>),
    /// The task is waiting for the ECG peripheral to boot
    Powerup(SpiD, Delay),
    /// The task is configuring the ECG peripheral
    Initializing(SpiD, Delay),
    /// The task is active and reading data
    Active(Device),
}

// Note: simplified and shortened for this blog.
pub fn poll(&mut self) {
    match self.state {
        // [...]
        Startup(spi, ncs, timer) => {
            let spid: SpiD = SpiDevice::new(spi, ncs, timer);
            self.nreset.set_low();
            self.start.set_low();
            self.state = Powerup(spid, Delay::new(100));
        }
        Powerup(spid, d) => {
            if d.poll() {
                self.nreset.set_high();
                self.state = Initializing(spid, Delay::new(100))
            }
        }
        Initializing(spid, d) => {
            if d.poll() {
                self.state = Active(Ads1292::new(spid));
            }
        }
        Active(mut ads) => {
            // [...]
        }
    }
}

This driver takes ownership of the underlying communication and timer devices used to properly work with this sensor. Depending on the state we might need to deal with the separate components or the complete device driver. Hence we move these individual components between states. It is wonderful that Rust enables and indeed forces us to do this. It however is also annoying and cumbersome. We wrote all the Trigly Holter Code like this, for 22 separate tasks. Note that the amount of lines required to describe the task in this manner is a multiple of our first blocking implementation.

Mind that the same implementation but using block! is also quite stateful: the processor keeps a program counter, and all variables are stored on the stack. However for multiple tasks we have to do this administration by hand, and basically move the "stack" to the state object. The program counter is encapsulated by which variant of the State enum is active. For each possible moment of blocking (or yielding for time costing operations) we have a state variant.

Futures

This might sound a bit familiar if you've already used Rust, as there is a way to describe these asynchronous tasks already in the language: futures. You could consider the above examples to be our own interpretation of futures. (though we did leave out a lot of details, not the least being a PollResult of these tasks)

Futures bring the language-wide interface to the table, allowing us to use futures defined in various crates. However futures also necessitate some infrastructure in the form of executors and wakers. This infrastructure implies some overhead: in effort to create conforming implementations, memory requirements and runtime. Considering that there are yet no embedded drivers using futures, this overhead is probably not worth it. Except it might be worth it when also leveraging async-await.

With Rust async-await

Rust async-await introduces a new syntax that enables us to forego the aforementioned explicit state keeping. Instead the compiler computes a state machine corresponding to the code, where each possible state is a point in the code where one calls .await.

The above code example would look as follows in this new syntax:

pub async fn exec(&mut self) {
    let spid: SpiD = SpiDevice::new(spi, ncs, timer);
    self.nreset.set_low();
    self.start.set_low();
    Delay::new(100).await;
    self.nreset.set_high();
    Delay::new(100).await;
    let device = Ads1292::new(spid);
    // [...]
}

This is analogous to our first example with block!, but even more concise and clean. Internally the compiler emits a so-called generator with a structure very similar to our State enum.

This becomes quite a bit more exciting when considering that this can be applied to all asynchronous code. In the above example we omitted the actual hardware communication with the ECG sensor. This communication is implemented in the Ads1292 ECG hardware driver. This driver is a generic driver crate based on the embedded-hal crate. All driver communication is implemented using blocking primitives of this hardware abstraction layer crate. To make that code asynchronous, we would have to also keep track of the state in that driver. We choose to not do this as it would not win us enough in terms of (battery) performance.

Hence it is a trade-off between programmer effort and cycles (and power) wasted to busy waiting. With async-await, programmer effort is reduced. This reduces development and maintenance costs. For the Trigly Holter Patch we estimate that 60% of the lines of code could be eliminated by using async-await. Also we would be able to make more code asynchronous with small effort, reducing total amount of active CPU time. Depending on the application this skews the trade-off in favor of making the code asynchronous.

Interrupts and sleeping

The ideal situation is when all tasks are waiting and thus the processor can go to sleep. Some processors have special sleep modes that can greatly reduce the power consumption of the device. For the Trigly Holter Patch sleeping is absolutely essential to enable recording data for 48 hours straight on a small cell battery. Sleeping however necessitates the direct use of interrupts to wake the processor up again when a task can be resumed. Implementations of the blocking and asynchronous communication interfaces in the embedded HAL crate do not necessarily configure these interrupts to be fired. For proper async implementations of device drivers we will first need an alternative embedded HAL interface (i.e. embedded-hal-async) that is Future-compatible.

Relative costs

The performance of a state machine created by hand, as we have done for the Trigly Holter Patch, is not necessarily more (or less) efficient compared to that generated by async-await. Generally the generated code is quite similar, and has the potential of being more efficient as detailed in a blog series by Tyler Mandry.

However most executors require some arena (a piece of memory) to be allocated. This arena is used to store the futures themselves. For Trigly our faux-futures are stored on the stack. In theory there should not be a huge difference between storing this on the stack and storing it in a pre-allocated arena, except that once allocated the arena can not be used for the stack anymore. This might be a considerable problem for memory contested small embedded devices with only kilobytes of RAM memory. The waker architecture provides a potential computational benefit by limitting the tasks that need to be woken up, but only if our embedded primitives implement their futures to strictly use interrupts to wake these tasks. Also the effectiveness of these futures depends on the precise executor chosen.

From our experiments we have found out that the futures as generated by async/await waste memory copiously. For one of our more trivial futures that we implemented by hand that uses only 32 bytes, async/await generates a future between 176 and 106 bytes, depending on trivial details. In our use-case, this is unacceptable. Some of the underlying issues in the compiler are known, and we might be able to contribute to some of these.

We have also noticed that debugging these futures is harder compared to our faux-futures. Using debuggers like gdb to inspect the memory is only possible when inside the future you want to debug. This is contrary to our faux-futures, where each future could be inspected at any time because their types are transparant and are allocated on the stack.

In a system with adequate RAM memory the benefits of the await syntax greatly outweigh these potential downsides. I intend to make a small benchmark to compare performance, code size, memory usage, and generated assembly to make this more concrete.

State of Rust Async Embedded

Recently Ferrous Systems has worked hard on getting async await usable on non-std/core embedded systems. At Tweede golf we are working on helping out in this regard, by experimenting with async await and writing HAL drivers using it. Also we are looking into making the generated futures smaller and more efficient.

Our next blog posts on this topic will be more in-depth and technical, and will concern our effort to implement multiple simultaneous timers and SPI drivers. We are also working on a proof of concept showcase of this functionality.

We'd love to hear what you think over at r/rust.

Stay tuned for more Embedded Rust at Tweede golf!

Wouter
Embedded software consultant

Stay up-to-date

Stay up-to-date with our work and blog posts?

Related articles

In my job I get to speak to lots of people about Rust. Some are just starting out, some have barely ever heard of it, and then some people are running Rust silently in production at a very large company in a very serious product.
June 10, 2024

Tock binary size

Tock is a powerful and secure embedded operating system. While Tock was designed with resource constraints in mind, years of additional features, generalizing to more platforms, and security improvements have brought resource, and in particular, code size bloat.

While using a full-blown filesystem for storing your data in non-volatile memory is common practice, those filesystems are often too big, not to mention annoying to use, for the things I want to do. My solution?

I've been hard at work creating the sequential-storage crate. In this blog post I'd like to go over what it is, why I created it and what it does.