Making embedded robust with Rust

Lars
Software engineer
Making embedded robust with Rust

Embedded software has an issue that most software doesn't: It can be very hard to get it patched. Sometimes a device hangs 5 meters high on a street light in the middle of a highway in another country. Sometimes a device is attached to a customer's heart. Sometimes strict validation requirements make changes to the software very expensive. In each case it is important to build software that doesn't fail, even in unpredictable conditions.

Recently, we had the opportunity to write the software for a medical device. We had only a few months to write it before it needed to pass IEC 62304 validation. In the final phase of the project, the device was already working but we hadn't spent much time on error handling yet. In order to make our code as robust as possible and have it pass validation, we took the following approach to error handling.

Categorizing errors

In order to make our software robust, we've already been using the Rust programming language from the start. It has a strict type system that helps avoid many potential errors by checking them before we ever run the code. We still needed to handle hardware errors though. We began by categorizing these errors by their severity and recovery methods.

IO errors

Some errors are the result of simple operations that are expected to fail occasionally. These are usually operations that involve communication with a hardware device that can be affected by electrical noise. When communication with a peripheral fails, the operation can simply be retried immediately and it will usually succeed. For the case it keeps failing, we give these operations a retry limit as well as a time limit. When those are exceeded, we consider the component to have failed.

Component errors

Errors that can't be solved by simply retrying the operation are usually handled by resetting and reinitializing the relevant components. This can be expensive, so when you're on limited battery power, it's good to wait a few seconds before reinitializing. This both helps the remaining parts to stabilize and prevents an expensive reset loop from eating the entire battery. If the component is critical for the purpose of the device, the reset should be attempted indefinitely until the power runs out, but if the device can keep functioning without it, it should just give up at some point and leave the component disabled.

Fatal errors

The severest kinds of errors are the kinds that can only be resolved by power cycling or not at all. In some cases it's best to shut down and wait for the device to be retrieved for repair. In our case, we wanted to keep trying until the battery runs out, since we wouldn't need it for anything else. These kinds of errors usually occur when the software or a peripheral ends up in a bad state. For example, an SD card can switch to inactive mode if initialization fails and there is no way to get it out of that state without power cycling it.

The ability to turn the power off for a peripheral cannot be taken for granted in an embedded setting. The hardware usually needs to be designed to include a software controllable power switch, or multiple, one for each peripheral. Without these, recovery from fatal errors may not be possible without human intervention.

It can also be useful to turn on the watchdog peripheral, which reboots the device after a timeout. As long as the device keeps passing through its main loop, it will avoid the reboot by regularly bumping the timeout. If on the other hand it gets stuck somewhere and fails to bump the timeout for a while, it will be reset, breaking the cycle.

Don't panic

In Rust, most errors are reported using result types. This means the result of a function can indicate either success or some kind of error. Rust makes it easy to bubble these errors up to the points where we want to deal with them and it warns us if we ignore them. However, some sources of errors are too common to encode in the type system.

Indexing an array is one example. If an index is out of range, instead of returning an error that needs to be handled all throughout the type system, Rust panics. This is a type of error that is not meant to be recoverable. On server applications, the application can usually log an error message before failing, but on embedded systems, there is usually nothing to report to, so all the device can do is reboot. Rebooting can be expensive because every component will need to be reinitialized, so it's best to avoid that.

Rustlang Ferris panics

Unfortunately, Rust's type system does not encode whether a function can panic or not. The best way to find sources of panics is to search for the functions that are usually used to cause them. The most common ones are panic, unwrap, assert and expect. When we give our code a robustness pass, we search for these keywords and replace each call with one of the error handling tactics from above.

This method won't help with panics caused by indexing though. Currently, the only way to protect against those is careful writing and rigorous testing. We're looking forward to the moment a feature called "const generics"1 becomes available in Rust, because it extends the type system so that many potential indexing errors can be proven impossible.

Reporting errors

Thanks to our efforts to attempt recovery from every error, we don't expect the devices to brick themselves over something minor. Nonetheless, we would still like to investigate any errors they do encounter. Our devices are offline, so they can't just send us a crash report. Instead, they log their sensor data to an SD card which is read out near the end of its life cycle.

Ideally, we'd write entire stack traces to the SD card. In practice, we're on a limited power budget and SD cards use a lot of power, so we don't want to write too much. We opted to give every location in our code that can fail its own unique error tag from a single large enum. These errors only cost us 3 bytes to write. Using a single error type throughout our code base also limits the amount of error conversion code we need to write.

In some cases a unique location in the source code still doesn't give us enough information. Usually, there is a high-level operation that failed and a low-level cause. In these cases, we simply log two errors in a row, making a mini stack trace of 6 bytes.

We also need to watch out that we don't report 6 bytes of errors every millisecond whenever recovery fails. To solve that, we limit the number of errors that can be written in a given time period. In order to not miss out completely on errors after the limit, we keep track of flags for the peripherals and severities that those errors occured at.

Concluding remarks

By combining cheap yet sufficiently detailed error logging with appropriate recovery tactics, we've built a robust product that is unlikely to fail and easy to debug if it does. By relying on unwrap initially, we made it easy for ourselves to find most error sources. If you have any thoughts about this approach or would like our help developing your project, get in touch with me or Hugo.

Stay up-to-date

Stay up-to-date with our work and blog posts?

Related articles

June 10, 2024

Tock binary size

Tock is a powerful and secure embedded operating system. While Tock was designed with resource constraints in mind, years of additional features, generalizing to more platforms, and security improvements have brought resource, and in particular, code size bloat.

While using a full-blown filesystem for storing your data in non-volatile memory is common practice, those filesystems are often too big, not to mention annoying to use, for the things I want to do. My solution?

I've been hard at work creating the sequential-storage crate. In this blog post I'd like to go over what it is, why I created it and what it does.

At Tweede golf we're big fans of creating applications on embedded devices with Rust and we've written a lot about it.

But if you're a hardware vendor (be it chips or full devices/systems), should you give your users Rust support in addition to your C support?

In this blog I argue that the answer to the question is yes.