The hunt for error -22
Sit back and grab a drink. This one is a doozy.
Background
At Tweede golf we do lots of engineering work, and a while ago we worked for a client using the nRF9160 microcontroller. This chip has an LTE modem with which you can reach the internet while still not using a lot of power. It's pretty neat!
I built the majority of this client's new firmware using Rust and Embassy and aside from some low-power/sleep woes, it went well. The client took the maintenance and further development in-house and my job was done. Or so I thought.
Half a year later, we were contacted by the client because they were seeing strange things, the sort of things you don't want to see in your in-production firmware.
See, when they compiled the firmware, everything functioned fine most of the time, but sometimes a small change in any place would make it so sending data to the server always or often returned Error::NrfError(-22)
. For example, the problem could trigger if you read some config on line X, but not when you read the config on line X+1. It was incredibly sensitive. Changing the compiler version and optimization settings could also affect the problem.
Initial exploration
The error was hard to replicate. It was so random that you pretty much had to stumble over it to find it. So luckily the client had made a branch out of one exact version of the codebase that exhibited the behavior.
I was tasked with finding and fixing the cause and went to the client's office for two days. There I started my first exploration of the problem. Of course I looked at what the client had changed in the codebase in the half year before. Luckily for me, it was not much, but nothing really stood out either.
So there I was with a really vague problem. At this point it could be anything, but mostly likely one of these:
- Bug in the compiler
- Some bad code somewhere triggering UB
While rare, bugs in the compiler do happen, so it's good to look out for these, especially for such a sensitive issue as this one.
It's more probable, though, that we had some bad code in our codebase or dependencies. However, all of our code was written in Rust, a language we use often and one that is famous for preventing issues like this one.
But we do have unsafe code in our codebase; Especially on embedded, you can't do without, because we need to talk to the hardware and implement very low-level datastructures to be efficient (and to avoid allocations in some cases). Obviously I looked through all of our own unsafe blocks and at least at a quick glance they all seemed alright.
There's one exception, though, to our Rust usage and that's the giant blob of C code to talk to the modem.
How does the modem work, anyways?
Well... I don't know.
See, the modem is proprietary. It's some IP that Nordic bought at some point.
The nRF9160 is a dual-core chip. The first core is free to use for the user and you can run any software on it. The second core is the modem core and it runs its own modem firmware. As a user you really can't do anything with that core. It's not even accessible via the debugger as far as I know and you can really only flash updated firmware to it that is downloadable on Nordic's website.
Even the protocol to talk with the modem core is proprietary. We do get some hints of how it works from their documentation. The way to talk with the modem is by using their libmodem
library and the accompanying header files. It's distributed as a precompiled binary blob, so that's always fun to work with...
They've got a page on how to port libmodem
(called nrf_modem
in their image) to your own system and it shows some diagrams:
Figure 1: sequence diagram of receiving data from a socket. Source: Nordic Semiconductor
There is some shared memory and the IPC (inter process communication) peripheral is used to send events between the cores.
As an 'OS' you need to provide all the required functions from the nrf_modem_os.h
header. These are things like allocation, semaphores and some hardware setup. It's all made with the idea that the OS is an RTOS with threads. In Rust we tend to not use an RTOS, which means we often have to emulate one.
In any case, we were using the nrf-modem
crate as a wrapper around libmodem
. It's a crate I wrote based on Jonathan Pallant's previous wrapper nrfxlib
. My new wrapper was fully async, which wasn't a thing yet on embedded when JP worked on his wrapper, and fully updated to the newest libmodem
version.
Back to the problem, what's error -22?
With some quick debugging I found out the error came from this call to libmodem
:
let mut send_result = unsafe {
nrfxlib_sys::nrf_send(self.fd, buffer.as_ptr() as *const _, buffer.len(), 0)
};
The code passes the filedescriptor of the socket, a pointer to the buffer &[u8]
slice, the length of the buffer and some flags (0). We only use TCP, which means I was getting the error after the modem had booted and connected to the server through the TCP socket.
I needed to know what the error code means. The nrf-modem
crate defines many custom errors. Any error that is expected should be covered by a 'proper' error variant and not an error code. But when an unexpected error code is returned by libmodem
, then, well, the only option you have left is returning the error code as is. So because we're getting the raw error code, we know we're off the expected path.
Nordic has a header with defines for all error codes. We can look up what 22 is (it's negative for unimportant reasons). Looking through the list I saw:
#define NRF_EINVAL 22 /**< Invalid argument */
Ok... That's weird. Invalid argument? I'm not exactly changing anything related to the modem code and setup between working versions and broken versions. All arguments given come from safe Rust. It would be weird if they were wrong.
Luckily, this function is documented:
ssize_t nrf_send(int socket, const void *buffer, size_t length, int flags)
Send a message on a connected socket. See POSIX.1-2017 article for normative description. In addition, the function shall return -1 and set the following errno:
- [NRF_ESHUTDOWN] Modem was shut down.
- [NRF_ECANCELED] Operation canceled because of APN rate control.
- [NRF_ENOMEM] TCP stream interrupted because of no heap memory.
- [NRF_EPROTO] Request failed because DTLS context was serialized.
Ok, that's not enough because they don't say what my EINVAL
error means, but apparently they're using a normal posix interface and they even link to a good docs page for this function.
Reading through that linked page I saw many errors explained. Some aren't even errors like EWOULDBLOCK
. Most important though, the error EINVAL
is not mentioned! This means that I was getting an error that isn't supposed to be possible...
Cool, just my luck.
What now?
To sum up what I had learned so far:
- The bug was incredibly sensitive to small changes anywhere in the codebase
- Compiler versions and settings affect the bug appearance
- The only thing that seemingly changes is that I was getting an error from an API that officially can't return that error
Great...
Well, if the bug is sensitive to the compiler version, then maybe it's a compiler bug? The project was pinned to a nightly version at 2023-12-26. This means there are many newer nightlies, one of which might already have fixed the bug. My idea was that if I found a newer version where the problem didn't reproduce anymore, that might be an easy fix.
One problem I ran into was what we would later call 'modem jail'. This is where you've connected to the cell towers too many times in a given timespan and they just won't let you connect anymore for a while. This happens when you connect 10-15 or so times in an hour. Once you've been thrown in modem jail, you have to sit out your time. Very annoying and it slows everything down.
Eventually I found that the problem still reproduced on 2024-02-05, but not on 2024-02-06. Good news right? But when I checked the differences between these two versions, there were over 200 commits and too many PRs. I neither had the time nor compiler knowledge to review all of this. I did check if there had been an LLVM backend update, as that can change lots of things, but no, the LLVM version was the same in both Rust versions.
Why error -22?
If it was a compiler bug, it would be very hard to find in the time I had left, so I went back to looking at the error itself. According to Nordic's docs, the error could not be returned by the nrf_send
function, but obviously it was. I needed to know why.
Nordic has a forum where you can ask questions and I've used it in the past. I've found it difficult to get help there before though, because I use Rust and none of the tooling they provide, which makes them hesitant to get involved.
So... I decided to look at the disassembly, which technically goes against their license (section 5).
I'm definitely not the first one to break that provision from their license though, because anyone doing an objdump on their own binary has already broken it. Also, I think it's quite ridiculous that I'm not allowed to look at the disassembly of my own program. And for a previous problem I had with this library, I had a private chat with one of Nordic's engineers and he essentially encouraged me to reverse engineer something because he told me there might have been some datastructure in the shared memory that might have solved the problem I was having at that point.
So the first thing I noticed when looking at the disassembly was that all the debug info you could wish for is still present in the blob. It's almost as if they want me to look; Nothing is obscured and all functions have proper names! I could even see which source file and line every instruction belonged to.
Because we're going to look at a lot of assembly code, here's a very quick overview of some ARM assembly instructions and names:
push
: Pushes registers on the stackmov
&ldr
: Moves or loads a value from right to leftstr
: Stores the value in memory somewhere[]
: This 'dereferences' the value in here. Sostrd r1, [r0]
stores the value in r1 at the address that r0 contains.b
: Branches (or jumps) to something. There are loads of variants likebne
(branch if not equal) andbl
(branch and link back)lr
: Links register. This contains the program address to return to when we return from the current functionpc
: Programs counter, the current position in the programsp
: Stacks pointer
So, what does nrf_send
do?
; ssize_t nrf_send(int socket, const void *buffer, size_t length, int flags)
00061eac <nrf_send>:
61eac: b513 push {r0, r1, r4, lr}
61eae: 2400 movs r4, #0x0
61eb0: e9cd 4400 strd r4, r4, [sp] ; `sockaddr` ptr and length are set to null
61eb4: f7ff feee bl 0x61c94 <nrf_sendto> @ imm = #-0x224
61eb8: b002 add sp, #0x8
61eba: bd10 pop {r4, pc}
It turns out, not much. It just calls nrf_sendto
, which is mostly there for connectionless UDP. The function is the same as nrf_send
, except you also need to provide a destination address and length. The assembly shows it's calling the function with a null pointer for the destination address and 0 for the length. Because I am calling this with a connected TCP socket, I guess the destination address is already known internally.
ssize_t nrf_sendto(int socket, const void *message, size_t length, int flags, const struct nrf_sockaddr *dest_addr, nrf_socklen_t dest_len)
nrf_sendto
does a couple of things. I'll summarize because this function is too big to show here:
- Some state checks
- Preparing some data
- Putting an rpc request in shared memory
- Waiting for an rpc response
- Return
Nothing crazy.
So I looked at how error handling is done and this was rather easy.
There are several blocks like this:
61cf6: f7ff fcbf bl 0x61678 <sock_sockaddr_fam_is_valid> @ imm = #-0x682
61cfa: b908 cbnz r0, 0x61d00 <nrf_sendto+0x6c> @ imm = #0x2
61cfc: 206a movs r0, #0x6a
61cfe: e7d9 b 0x61cb4 <nrf_sendto+0x20> @ imm = #-0x4e
61d00: 4638 ...
On line 1 we jump to a function that does some sort of check. From the name I can guess this returns a boolean value and with the C ABI the return value can normally be found in r0
.
Line 2 has a compare and branch when not zero
, which checks r0
. If r0
contains true, we jump to the address given as the second argument to the instruction which is line 5. If the value is false (i.e. 0), then we don't branch, which means we end up at line 3.
Line 3 loads the value 0x6A (or 106 in decimal) into r0
after which we unconditionally branch to another part of the function on line 4.
The part we jump to is this:
61cb4: f7fc f970 bl 0x5df98 <nrf_modem_os_errno_set> @ imm = #-0x3d20
61cb8: f04f 34ff mov.w r4, #0xffffffff
61cbc: 4620 mov r0, r4
61cbe: b005 add sp, #0x14
61cc0: e8bd 8ff0 pop.w {r4, r5, r6, r7, r8, r9, r10, r11, pc}
The first thing that's done on line 1 is jumping to nrf_modem_os_errno_set
. This is an OS function we actually control! It has one argument, the error number, and it's the way the library communicates errors to us. The C ABI puts the first argument of a function in r0
, so we know this code is now reporting error 106.
The rest of the code just returns from the function. Line 2 and 3 get -1 (0xFFFFFFFF in hex) into the r0
register (I have no idea why it goes through r4
first) and then pops everything it needs to restore from the stack.
The contract of the send function is that it returns -1 when an error occurs, after which the errno can be read. So we can see that I was looking right at the error handling code!
And indeed, the error codes header says that 106 is NRF_EAFNOSUPPORT
, which has the comment /**< Address family not supported */
.
Great, I was now able to read all the error codes here and what causes them!
[side tangent]
There's something quite clever here that the compiler does. In the beginning it pushes registers to the stack like this:
61c94: e92d 4ff0 push.w {r4, r5, r6, r7, r8, r9, r10, r11, lr}
Lots of registers, but most importantly: the lr
register in the end. This is the link register and on an ARM chip like mine, that contains the program counter value to jump back to. Returning from a function often has this shape:
b lr
So, branch(jump) to what's in the lr
register.
But look back at the pop instruction. The lr
register was pushed, but it's popped back as the pc
register directly! This saves the normal b lr
instruction and makes the function return immediately.
So now it's:
pop.w {r4, r5, r6, r7, r8, r9, r10, r11, pc}
instead of:
pop.w {r4, r5, r6, r7, r8, r9, r10, r11, lr}
b lr
I just think that's neat.
[/side tangent]
Back to reading the error codes: I scanned though the assembly to see if the number 22 was loaded into r0
anywhere before jumping to the error handling portion. And it was, even though supposedly this function doesn't return it!
So, what causes error 22: sock_sockaddr_len_is_valid
Huh... I would not have guessed that. I'm not even providing a socket address since this is a connected tcp stream... The address is probably stored in some array somewhere together with other socket info.
Thinking about it now, at the time of writing, I realize that the nrf_sendto
posix api does have a valid error that uses EINVAL
. It even describes it like this:
[EINVAL] The dest_len argument is not a valid length for the address family.
The error is logical for the sendto
function and the Nordic engineers must have thought this error would never propagate to the send
function. I had not realized this back when I discovered all this. In the end that luckily didn't matter.
So what was happening here? Was there some memory corruption?
I decided this was a good point to start the debugger. I knew where to set the breakpoints and what to look for. But things weren't great: The debugger has to work well with Rust code and source-less C code at the same time; There is some time sensitivity because there's the other modem core still doing things while my program is halted; And there's the modem jail always ruining my life.
I could break at the start of the function, but whatever I did, I just couldn't break on the code that created this error 22 case. Was this another problem resulting from the same bug? Was this code timing sensitive? Not sure.
At this point the planned two days were almost over.
I looked at the implementation of sock_sockaddr_len_is_valid
and it looked fine. It checks if the address is IPv4 or IPv6 and then does some checks based on that. Totally fine. Another thing I noticed is that the nrf_connect
function, which lets us connect a socket to a tcp server, also does this check and also emits the EINVAL
(22) error in the same way.
My conclusion was that there probably was some corruption of memory somewhere between the time the connect function was called, when the check succeeds, and the time the send function was called, where the check fails. But I couldn't find what the corruption was and what could cause it.
At the end of the two days I reported back that there definitely was an issue that was worth investigating further. In theory, if the problem was in some crate somewhere or in the compiler, there would be a decent chance that simply updating everything would solve the problem. The annoying thing with that is that you wouldn't know whether the error had been fixed, as reproducing this was so flaky.
After some consideration, the client decided to give us more time to look into this problem and we also involved our good friend and former colleague Wouter Geraedts (aka Wassasin online). Together we would tackle this problem.
Continuing with a fresh look
Wouter started optimistically, convinced we could solve it in a day. I wish he had been right.
First I needed to get Wouter up to speed and so I showed him everything I had learned about the problem. Just like me, he was sceptical about sock_sockaddr_len_is_valid
and looked at it again. But it really turns out that the function looks fine and gets called just fine.
Except, well, did it get called? When I looked at it before I said I had huge trouble stepping through that area of the code. Looking more closely at the code, that function seems to only be called when the sockaddr argument to the nrf_sendto
is not null (though that wasn't super clear). This actually makes a lot of sense because why would the sockaddr need to be checked again? It had already been checked by the connect function.
But if this check is the only thing that returns the error 22 and it isn't run, then how come we see the error anyways?
Well, Wouter found one other potential place where an error was returned. It didn't return a hardcoded number, but instead it returned a runtime error. What was it?
The libmodem communicates with the modem core using IPC and shared memory. More specifically, it uses those to do Remote Procedure Calls (RPC). The error path we found was for failed RPC calls. And indeed when we put a breakpoint there, it stopped there and we saw the number 22 in one of the registers. Aha! We had found something new!
Failing RPC call
Luckily for us, there was only one RPC call in the nrf_sendto
function so it likely roughly mirrored the function itself. Stepping though the RPC code, we saw that the modem indeed returned an error. But it was error 6, not 22!
What does that mean? Well, apparently nothing special. It turns out that number 6 is used as an index into a data table and it just so happens that the 7th element contains the number 22. This is probably an error code translation table. Still we wanted to figure out what error 6 meant and after some searching in the debug info, we found that the name was EINVAL
... So, yeah, translating that to error 22, which means the exact same thing, seems fair.
An invalid value error is generally returned when an input argument is wrong or illegal. It's either that or there's a bug in the modem itself. We can't look into the modem, but we can look at the inputs.
I still don't know fully how the Remote Procedure Call (RPC) works but roughly the following steps are done for sending data over a socket:
- Copy the data buffer to the shared TX memory
- Construct the RPC data in the shared memory
- The RPC data contains a pointer and a length to what was allocated in the TX memory
Did I mention before that running the debugger was a nightmare? Yeah, we were still battling with modem-jail. In fact, it was around this time that we invented the term.
It was hard to figure out where in shared memory all this data was stored. So instead of figuring that out, we simply dumped all the shared memory contents to the logs so we could look at it later. On its own that didn't give many insights, but we knew that if we changed the program a little bit the problem would go away. If the input values were indeed invalid in some way, we should see it if we compared the shared memory state.
What do you mean the length is 0?
The shared memory states looked very similar when comparing the working version to the broken version. There were small differences but nothing really stood out. The shared memory is over a kilobyte so that wasn't a surprise.
After staring at it long enough though, I found something interesting!
In the working version, I found a section that had two interesting words: one seemed to be a pointer to the TX shared memory and the other had a value of 44, which was the length of the message we were sending to the server. We guessed this had to be where the RPC was stored. And when looking over at the broken version of the code, those two words both had the value 0
.
So when it works, the RPC has a valid pointer and length. And when it doesn't work, the RPC has a null pointer and length of 0...
Yeah, I guess it's fair that the modem thinks a null pointer and length is not a valid input. But why the heck is this happening!?
Fidgeting with the buffer
This was so weird. The pointer being null, well, we don't really see the pointer being created, so ok. But the length being 0 is strange, because we can see the length being copied to the RPC in the assembly.
Let's look at how we're calling the nrf_send
function in the wrapper:
/// Call the [nrfxlib_sys::nrf_send] in an async fashion
pub async fn write(&self, buffer: &[u8], token: &CancellationToken) -> Result<usize, Error> {
token.bind_to_current_task().await;
core::future::poll_fn(|cx| {
if token.is_cancelled() {
return Poll::Ready(Err(Error::OperationCancelled));
}
register_socket_waker(cx.waker().clone(), self.fd, SocketDirection::Out);
let mut send_result = unsafe {
nrfxlib_sys::nrf_send(self.fd, buffer.as_ptr() as *const _, buffer.len(), 0)
};
if send_result == -1 {
send_result = get_last_error().abs().neg();
}
const NRF_EWOULDBLOCK: isize = -(nrfxlib_sys::NRF_EWOULDBLOCK as isize);
const NRF_ENOTCONN: isize = -(nrfxlib_sys::NRF_ENOTCONN as isize);
match send_result {
0 if !buffer.is_empty() => Poll::Ready(Err(Error::Disconnected)),
NRF_ENOTCONN => Poll::Ready(Err(Error::Disconnected)),
bytes_sent @ 0.. => Poll::Ready(Ok(bytes_sent as usize)),
NRF_EWOULDBLOCK => Poll::Pending,
error => Poll::Ready(Err(Error::NrfError(error))),
}
})
.await
}
So we're getting the buffer as a slice, some bookkeeping, using poll_fn
to create a future very easily, more bookkeeping, and then sending the buffer. The pointer and the length are taken from the buffer and then passed to the nrf_send
function.
This looks completely fine and works as you'd expect. That's good, because I was the one who wrote this wrapper!
So what's happening? How could the length be 0? The slice is created in safe Rust.
While I was looking at this, Wouter found that there's an alternative version of libmodem
that has logs in it. When we looked at it with those logs enabled, it actually logged a message length of 0! I wish we had found that earlier. At least now we had a way to easily spot the moment things went wrong.
Back to that buffer slice. We still couldn't rule out some memory corruption somewhere, or a miscompilation, even in otherwise safe code.
One thing you learn when you do benchmarking in Rust is using black_box
. This is a hint to the compiler that it should be pessimistic about that code and not perform as many optimizations.
If there's a miscompilation or some UB that's manifested by optimizations, then adding a black box might 'fix' the problem right? The only thing I changed was adding block_box(buffer)
right before getting the pointer and length of it.
To be honest, this was a throwaway thought. I just tried it because why not and when it turned out to work I was very surprised. Suddenly we didn't observe the error anymore!
So, was there some optimization that broke our code?
Since using the black box should disable some optimizations, it was now important to find out what had changed between the versions, because that could be key in figuring out why this was suddenly working. And that was easier said than done...
Comparing two big binaries
How do you compare two binaries? Or more accurately, how do you compare the assembly of two binaries?
It turns out there are no good tools for this. For some reason there were quite some changes in the two binaries with and without black box.
Comparing single functions is very easy. Just do an object dump, copy the function to a text editor, and let it then do a compare. Depending on your editor (vscode in our case) it's a pretty nice comparison tool that detects reordered sections.
But do a compare on a whole-program dump and it's simply too large. The smart comparison tools like in vscode never come through and other dumber diffing tools give awful output for reordered code segments.
We were stuck here for a long time, only very slowly making progress. We also made our own diffing tool, which wasn't a success. We used a tool made by the Rust project and that didn't work for us either.
This period was really frustrating: No progress and a client that wanted to see progress.
In the black box version we were seeing (very small) changes in the wrapper function, while all the functions in libmodem
were the same. We also noticed that there were more dereferences being done for some reason. To what, and why, was unclear. Mind you, we were looking at the assembly that was generated by the Rust async machinery. Finding out which data types you're dealing with and where they are stored was very hard. This points to another lack of tooling. It's really hard to peek into the generated futures and see what everything does.
I'll show what I mean with the assembly doing more dereferences.
The broken version did this to call the nrf_send
function:
ldr.w r1, [r5, #0x4c4] ; Load the buffer pointer. r5 is likely the pointer to the future object
movs r3, #0x0 ; Use 0 for the flags
ldr.w r5, [r5, #0x4c8] ; Load the buffer length to r5 for some reason (likely gonna be used later)
ldr r0, [r4, #0x4] ; Load the `fd` of the socket. r4 is likely the `self` reference
mov r2, r5 ; Move the buffer length to the correct register
bl 0x61eac <nrf_send> @ imm = #0xc7fc
Meanwhile the working black box version did (roughly) this:
ldr.w r6, [r5, #0x4c4] ; Dereference the buffer
movs r3, #0x0
ldr.w r1, r5, [r6] ; Dereference again? And load the pointer and length in one instruction
ldr r0, [r4, #0x4]
mov r2, r5
bl 0x61eac <nrf_send> @ imm = #0xc7fc
We could not figure out why this extra derefence was there and why this made things work. At some point, we switched focus from the assembly to the generated MIR and LLVM-IR.
MIR is the Mid-level Intermediate Representation of the program. This is what the Rust code looks like after some slight processing and interpretation from the compiler. For example, RAII exists in Rust code, but not in MIR. This means that you can already see when variables are being dropped in MIR.
However, the MIR was virtually the same, so that was kind of lame. To the Rust compiler frontend, our code didn't change. But then why is the assembly so different?
Maybe the answer was in the LLVM-IR?
References galore
And there was more going on there. I don't quite remember all those details since this period was all a blur and it was Wouter who looked at the IR the most, but we found the same thing going on there as in the assembly: For some reason, more things were being dereferenced when the black box was used. The LLVM-IR does keep some more names and data around, so we could see this applied to the buffer.
Again, we didn't know why this was working. And it seemed incorrect as well. A slice is &[u8]
(which is a fat pointer) and through the closure we had a reference to it (&&[u8]
) so doing one dereference (*&&[u8]
) to get the pointer and the length would make sense. But we were seeing the equivalent of **&&[u8]
. Really strange.
To try and make more sense of it, I looked at the debug info again using dwex. And at some point suddenly everything clicked. We were looking at a complete red herring.
Let's look at the definition of black box:
pub fn black_box<T>(dummy: T) -> T
It moves in the T value. It doesn't get it by reference.
Add to that the fact that this black box is done in a closure with a buffer that is outside the closure. Now, I'd be a liar if I claimed to know all the closure capture rules, so I don't fully understand why things are as they are, but by adding a move in the closure, the capture of the buffer is done differently.
The buffer is normally only used by reference in the as_ptr
, len
, and is_empty
functions. Those all technically ask for a reference to the slice, but the compiler likely optimizes that so the closure captures the buffer directly.
But the black box doesn't ask for a &T
, but for a T
by move. For some reason, this changes the closure so it captures the buffer without that optimization which keeps an extra layer of indirection. This explains what all those extra dereferences were all about.
And why did this work? Well, no idea really, but one of the problems was that super small changes to the program could make the bug manifest or make it stop manifest. We concluded that the program behaved differently simply because the program was different.
This was a real bummer; All that work for nothing.
Almost ready to give up
We had not gotten further in a long time. At this point, we had already spent around 20 days on this issue collectively and we had no clear direction to go in. Maybe we could take a more brute-force approach, but that was infeasible due to the modem jail.
Why was the RPC call made with length 0 and a null pointer?
The only lead we had left was the new compiler version that made things work. Was it a bug that was fixed in the new compiler version?
Between the two nightly versions where the code was broken and working were a lot of PRs. But since we didn't have anything better to do, I decided to try and find the specific PR that seemingly fixed things.
So in a sort of manual binary search through the PRs, I built the Rust compiler and tried it out on the device. This was a nice and easy time because building the compiler, recompiling the firmware and waiting on modem jail gave a lot of down time.
In the end I found the exact PR! It's a PR that introduces local_waker
to async contexts.
We investigated it, but in the end had to conclude that this too gave different program behavior simply because the program was different. The local waker didn't fix anything.
Last efforts
Even though we had no clear direction, we were still confident that if we simply put enough effort into this problem, we'd eventually be able to find the source of it. The biggest issue was that it was unclear how much effort that would be.
The alternative to continuing was to just update everything to the most recent versions (dependencies and compiler) and hope that something fixed our bug. But we would not be able to know for sure whether the bug had been fixed.
If we were able to make our lives easier and make the debugging faster, the mental calculation would fall in favor of continuing. If we couldn't do that, then we'd be better off giving up and updating and hoping for the best.
So we had time to try two more things:
- Mocking the modem
- Getting the watchpoint to work
Both of these would improve our debug speed and open up new possibilities.
If we could mock the modem and still reproduce the issue, then we'd know the bug is not in the modem, we would not get thrown in modem jail anymore and we could potentially run it on desktop through MIRI.
Getting the watchpoint - like a breakpoint but for a memory location - to work would allow us to keep tabs on when memory locations were being written. When the location is written, the debugger halts the program and you can see the code that performed the write. For some reason, our debug setup didn't support watchpoints while the mcu datasheet suggested that they were supported in hardware.
I would focus on the mock and Wouter would work on the watchpoints.
Mocking the modem
This was rather easy. We had a header file of the libmodem
api and I simply had to create an implementation of it and stop linking to the existing binary. This is very easy to do with use of the cc
crate that can compile C code as part of the cargo build script.
If you're interested, you can still find the mock file here: github/gist.
The results, however, were a bit disappointing. Whatever I did, I could not reproduce the issue. I had made the outputs better, made it respond more closely to the real values, and I had even introduced a work queue so I could replicate the timing of everything in case the bug was timing sensitive (which is something we were suspicious of from the start).
Nothing worked. And normally you'd learn that the bug must be within the libmodem
implementation. But we've seen before that making small changes to the program leads to the bug being triggered, or not, very easily. Our mock was different... so... We didn't really learn anything.
Watchpoint
Wouter quickly got the watchpoints to work. This was lucky because we were pessimistic and had already looked at configuring the DWT (Data Watchpoint and Trace Unit) through the firmware if the debugger didn't cooperate.
We tried using watchpoints at the start of the adventure, but not all hardware supports it. So when we didn't get it to work, we assumed this microcontroller also had no support for it. But simply updating the debugging stack to the newest versions did the trick. So if you're ever debugging something and the experience is kind of bad, then make sure to download the newest versions! Even when that means going outside of your OS package manager.
Once Wouter got things going, the whole debugging experience became much nicer.
We decided that he could just step through the code and watch 'random' memory locations while I was still working on the mock.
Things went fast from here. He was able to find where exactly the rpc length was written. He noticed that the length was lowered from the 44 input to the function to 0. Why?
Well, apparently there was one more sneaky step in determining the length that would be sent to the modem. See, the data is written to the TX shared memory. But that memory has a max size and if the length is longer than that size then the length is shortened to fit in it. But our TX shared memory was 8kB; Why would the code think it's 0?
A pointer to oblivion
The sizes of the TX and RX buffers are given by the users of libmodem
, so us, in our case. We could confirm that we had configured 8kB of TX shared memory by looking at our source code. We were using the default memory layout defined in the nrf-modem
wrapper and indeed, that's 8kB.
So this should not go wrong. There must be a reason why the library thinks this size is 0, and not just always 0, but a really sensitive, sometimes 0.
Wouter looked more closely at the length checking code. It seemed like the TX shared mem size was read through a pointer and using his new watchpoint abilities, he tried to watch what was written at that memory location and when. The first time the program halted was in the nrf_modem_init
function here: github
That's super logical, since that's when the size of tx region is set. But soon after, the program halts again and the value is set to 0. Looking at the stack trace, we see... something that doesn't make sense...
The program is halted in the middle of nowhere, pretty much in a random place in the application. How come our program randomly writes to where the shared mem region config is stored? It can't be! It's not in an unsafe block or anything and everything around it is also using safe Rust code. What the heck.
Then the puzzle pieces started to fall into place. What if the 'random' application code isn't wrong? And what if the libmodem
is also not looking at the wrong memory location?
We looked through the nrf_modem_init
disassembly and found our problem. The modem init function has a pointer to a config struct as a parameter. It then creates a pointer to one of the fields of the config struct and then keeps reusing it. Let's look at the init code of the wrapper that calls the modem init function:
/// Start the NRF Modem library with a manually specified memory layout
pub async fn init_with_custom_layout(
mode: SystemMode,
memory_layout: MemoryLayout,
) -> Result<(), Error> {
// ...
// Tell nrf_modem what memory it can use.
let params = nrfxlib_sys::nrf_modem_init_params {
shmem: nrfxlib_sys::nrf_modem_shmem_cfg {
ctrl: nrfxlib_sys::nrf_modem_shmem_cfg__bindgen_ty_1 {
base: memory_layout.base_address,
size: nrfxlib_sys::NRF_MODEM_SHMEM_CTRL_SIZE,
},
tx: nrfxlib_sys::nrf_modem_shmem_cfg__bindgen_ty_2 {
base: memory_layout.base_address + nrfxlib_sys::NRF_MODEM_SHMEM_CTRL_SIZE,
size: memory_layout.tx_area_size,
},
rx: nrfxlib_sys::nrf_modem_shmem_cfg__bindgen_ty_3 {
base: memory_layout.base_address
+ nrfxlib_sys::NRF_MODEM_SHMEM_CTRL_SIZE
+ memory_layout.tx_area_size,
size: memory_layout.rx_area_size,
},
trace: nrfxlib_sys::nrf_modem_shmem_cfg__bindgen_ty_4 {
base: memory_layout.base_address
+ nrfxlib_sys::NRF_MODEM_SHMEM_CTRL_SIZE
+ memory_layout.tx_area_size
+ memory_layout.rx_area_size,
size: memory_layout.trace_area_size,
},
},
ipc_irq_prio: 0,
fault_handler: Some(modem_fault_handler),
};
// ...
// OK, let's start the library
unsafe { nrfxlib_sys::nrf_modem_init(¶ms) }.into_result()?;
// ...
Ok(())
}
See what's happening here? We create a config struct on the stack that has a field for the tx area size and pass it to the init function by pointer. Internally, libmodem
is taking a pointer to that field. Then after the init is done, the config is dropped and doesn't exist anymore. This means the pointer that libmodem
is holding on to is invalid.
So in the program versions that didn't work, this 'random' place on the stack simply had 0 written to it by accident. And in program versions that did work, there was a non-zero value by accident.
So why was the program so sensitive to small changes? It's all got to do with the exact stack layout. Make a type larger or smaller, or add or remove a variable from the stack and a different value will be written in the memory location that once inhibited the tx area size value.
This is awful. How could this have gone wrong? Who wrote this!? (It was me.)
It's not my fault, I swear
My wrapper nrf-modem
is a sort of continuation of Jonathan Pallant's nrfxlib
crate. The major difference is that the new wrapper makes everything async and is built on a much newer version of libmodem
.
There's a lot of overlap though, including the modem init. If you look at the init code, you'll see the same general structure.
This pointer mishap was also present in the older nrfxlib
. So it's Jonathan's fault? Well, no.
We looked at how Nordic does its initialization in Zephyr and they initialize their config struct as a static. So at least they are doing it properly. But Jonathan is an experienced programmer. We didn't think he would've made a mistake like this. And he hasn't because, when we looked at when his code was written and then looked at Nordic's own init code at that same time, they also put the config struct on the stack!
There are two possibilities here:
- That code used to be correct and the init didn't require a static config but they changed it so the init now does require a static config. Well, then Nordic messed up! This change hasn't been mentioned in the changelog and there's no docs that I found that mention this requirement.
- It was wrong in their code so they fixed it. Here Nordic still messed up! The lack of docs made them introduce a bug in their own code. And once they discovered it, they simply fixed it but didn't spend any time to prevent their users from writing the same bug.
Both these possibilities are awful and this has definitely lowered my opinion of Nordic.
This would have been so simple to put in the docs. I've opened a ticket on their DevZone forum. As of writing they've still not updated the docs of the init function.
No way to prevent this, says only language where this regularly happens
Look...
Please use Rust.
This bug would not have happened if libmodem
had been written in Rust, because the init function would have taken a reference to the config struct with a static lifetime. Anything passed to it that doesn't have a static lifetime would have resulted in a compiler error and the requirement itself would be part of the function signature.
Please, please, please stop creating new projects in C. It's actively harmful to our industry. I'm not suggesting to rewrite the whole world; Obviously you can maintain and extend existing code. But for new things, let's do better.
At least I'm glad we were able to find the issue. I just wish it hadn't ever been an issue.