Embedded software has an issue that most software doesn't: It can be very hard to get it patched. Sometimes a device hangs 5 meters high on a street light in the middle of a highway in another country. Sometimes a device is attached to a customer's heart. Sometimes strict validation requirements make changes to the software very expensive. In each case it is important to build software that doesn't fail, even in unpredictable conditions.
Recently, we had the opportunity to write the software for a medical device. We had only a few months to write it before it needed to pass IEC 62304 validation. In the final phase of the project, the device was already working but we hadn't spent much time on error handling yet. In order to make our code as robust as possible and have it pass validation, we took the following approach to error handling.
In order to make our software robust, we've already been using the Rust programming language from the start. It has a strict type system that helps avoid many potential errors by checking them before we ever run the code. We still needed to handle hardware errors though. We began by categorizing these errors by their severity and recovery methods.
Some errors are the result of simple operations that are expected to fail occasionally. These are usually operations that involve communication with a hardware device that can be affected by electrical noise. When communication with a peripheral fails, the operation can simply be retried immediately and it will usually succeed. For the case it keeps failing, we give these operations a retry limit as well as a time limit. When those are exceeded, we consider the component to have failed.
Errors that can't be solved by simply retrying the operation are usually handled by resetting and reinitializing the relevant components. This can be expensive, so when you're on limited battery power, it's good to wait a few seconds before reinitializing. This both helps the remaining parts to stabilize and prevents an expensive reset loop from eating the entire battery. If the component is critical for the purpose of the device, the reset should be attempted indefinitely until the power runs out, but if the device can keep functioning without it, it should just give up at some point and leave the component disabled.
The severest kinds of errors are the kinds that can only be resolved by power cycling or not at all. In some cases it's best to shut down and wait for the device to be retrieved for repair. In our case, we wanted to keep trying until the battery runs out, since we wouldn't need it for anything else. These kinds of errors usually occur when the software or a peripheral ends up in a bad state. For example, an SD card can switch to inactive mode if initialization fails and there is no way to get it out of that state without power cycling it.
The ability to turn the power off for a peripheral cannot be taken for granted in an embedded setting. The hardware usually needs to be designed to include a software controllable power switch, or multiple, one for each peripheral. Without these, recovery from fatal errors may not be possible without human intervention.
It can also be useful to turn on the watchdog peripheral, which reboots the device after a timeout. As long as the device keeps passing through its main loop, it will avoid the reboot by regularly bumping the timeout. If on the other hand it gets stuck somewhere and fails to bump the timeout for a while, it will be reset, breaking the cycle.
In Rust, most errors are reported using result types. This means the result of a function can indicate either success or some kind of error. Rust makes it easy to bubble these errors up to the points where we want to deal with them and it warns us if we ignore them. However, some sources of errors are too common to encode in the type system.
Indexing an array is one example. If an index is out of range, instead of returning an error that needs to be handled all throughout the type system, Rust panics. This is a type of error that is not meant to be recoverable. On server applications, the application can usually log an error message before failing, but on embedded systems, there is usually nothing to report to, so all the device can do is reboot. Rebooting can be expensive because every component will need to be reinitialized, so it's best to avoid that.
Unfortunately, Rust's type system does not encode whether a function can panic or not. The best way to find sources of panics is to search for the functions that are usually used to cause them. The most common ones are panic, unwrap, assert and expect. When we give our code a robustness pass, we search for these keywords and replace each call with one of the error handling tactics from above.
This method won't help with panics caused by indexing though. Currently, the only way to protect against those is careful writing and rigorous testing. We're looking forward to the moment a feature called "const generics"1 becomes available in Rust, because it extends the type system so that many potential indexing errors can be proven impossible.
Thanks to our efforts to attempt recovery from every error, we don't expect the devices to brick themselves over something minor. Nonetheless, we would still like to investigate any errors they do encounter. Our devices are offline, so they can't just send us a crash report. Instead, they log their sensor data to an SD card which is read out near the end of its life cycle.
Ideally, we'd write entire stack traces to the SD card. In practice, we're on a limited power budget and SD cards use a lot of power, so we don't want to write too much. We opted to give every location in our code that can fail its own unique error tag from a single large enum. These errors only cost us 3 bytes to write. Using a single error type throughout our code base also limits the amount of error conversion code we need to write.
In some cases a unique location in the source code still doesn't give us enough information. Usually, there is a high-level operation that failed and a low-level cause. In these cases, we simply log two errors in a row, making a mini stack trace of 6 bytes.
We also need to watch out that we don't report 6 bytes of errors every millisecond whenever recovery fails. To solve that, we limit the number of errors that can be written in a given time period. In order to not miss out completely on errors after the limit, we keep track of flags for the peripherals and severities that those errors occured at.
By combining cheap yet sufficiently detailed error logging with appropriate recovery tactics, we've built a robust product that is unlikely to fail and easy to debug if it does. By relying on unwrap initially, we made it easy for ourselves to find most error sources. If you have any thoughts about this approach or would like our help developing your project, get in touch with me or Hugo.
-  Const generics have already been in the works for more than two years. https://github.com/rust-lang/rust/issues/44580