September 14, 2022

Using C libraries in your Rust project

Folkert

Embedded software engineer

Recently, we gave a workshop for the folks at iHub about using Rust, specifically looking at integrating Rust with cryptography libraries written in C.

Rolling your own crypto is known to be a bad idea, but what are your options if your language of choice just does not have an implementation for the cryptographic primitives you want? Or perhaps an implementation does exist, but it hasn't been under the same scrutiny as some more standard implementation in another language. Blindly trusting someone else's crypto is not that much better than rolling your own.

Rust has many benefits for new software development, but many industry-standard cryptography implementations are written in C. In this post, we'll look at using these libraries from Rust.

Why can't we just import C

Languages like Zig, D, and Nim can just import C header files and then call C functions with minimal friction. In Rust, using C code is less straightforward.

The problem is that other languages disagree with C on memory layout (e.g., the ordering of struct fields) and calling convention (how do arguments get passed to a function, how is the result returned).

Hence, the only way to do make interoperability seamless is to have a C compiler frontend (lexer, parser, error reporting) in your compiler, and furthermore, make accommodations for C's quirks in the rest of your compiler. That's a lot of complexity and a huge maintenance burden for the people writing a compiler for a language that is (often very deliberately) not C.

Aside: It is also not evident that having seamless interop is actually desirable for a language. Idiomatic C is very different from idiomatic Rust, and it's very possible that "idiomatic Rust" would look quite different if, on day 1, most Rust code was just bindings to C.

In the absence of seamless interop, we must fall back to the C abstract binary interface (ABI), a specification of what data structures and function calls look like in memory. We compile our Rust code into an object, the C code into an object, make both adhere to the C ABI, and then have the linker stitch the two together.

Aside: while this works, it is far from ideal; see also these two excellent recent blog posts "to save C we must save ABI" and "C is not a language anymore",

Our project

As an example, we'll use the tweetnacl C library. It is a single-header implementation of much of the NaCl cryptography library. The full source fits in 100 (70-character) tweets.

Let's see how we can use this library from Rust, and how we can go about writing a proper wrapper around this library.

Generating bindings

We will use the rust-bindgen tool to automate writing the bindings. It can be done manually, but that is rather laborious.

Given a header file, bindgen will produce a Rust file with all the public functions and constants from the header file.

Let's try it. Install bindgen with

cargo install bindgen

With a clone of the tweetnacl-bindgen repository, we can then generate our bindings with:

bindgen tweetnacl.h -o src/bindings.rs

This creates an absolute mess of a file, consisting first of many constants:

/* automatically generated by rust-bindgen 0.59.1 */

pub const crypto_auth_PRIMITIVE: &'static [u8; 14usize] = b"hmacsha512256\0";
pub const crypto_auth_hmacsha512256_tweet_BYTES: u32 = 32;
pub const crypto_auth_hmacsha512256_tweet_KEYBYTES: u32 = 32;
pub const crypto_auth_hmacsha512256_tweet_VERSION: &'static [u8; 2usize] = b"-\0";
pub const crypto_auth_hmacsha512256_BYTES: u32 = 32;
pub const crypto_auth_hmacsha512256_KEYBYTES: u32 = 32;
pub const crypto_auth_hmacsha512256_VERSION: &'static [u8; 2usize] = b"-\0";
...

and then extern "C" function signatures

extern "C" {
    pub fn crypto_auth_hmacsha512256_tweet(
        arg1: *mut ::std::os::raw::c_uchar,
        arg2: *const ::std::os::raw::c_uchar,
        arg3: ::std::os::raw::c_ulonglong,
        arg4: *const ::std::os::raw::c_uchar,
    ) -> ::std::os::raw::c_int;
}
...

The generated code uses Rust syntax that we don't typically see in Rust code.

the extern "C" block

A function signature without an implementation inside anextern "C" block indicates that Rust expects a function with this name and type to be provided by the linker. When Rust calls this function, the C calling convention and ABI will be respected.

raw pointers

Raw pointers can either be constant *const T or mutable *mut T. They are similar to pointers in C, and more low-level than the references &T that most Rust APIs use.

In Rust, references are like pointers, but come with additional guarantees:

  • A references has the same in-memory representation as a pointer
  • a reference is NEVER NULL!
  • a reference ALWAYS points to valid (allocated, aligned) memory

The upside is that these guarantees are implicit within (safe) Rust code and never need to be defensively checked.

But when calling a C function, we cannot assume these promises are kept. So indeed, if we give a C function a reference from Rust, C might free() it, and we get undefined behavior on the Rust side.

Therefore we must fall back to raw pointers to communicate with C, and take responsibility for ensuring that when we turn a pointer into a reference, the conditions are met.

C types

The c_ types are used because C integer types don't map 1-to-1 to Rust types. The Rust standard library provides aliases that ensure the mapping is correct for the particular target you use.

Compiling the C code

We can instruct cargo to automatically build and link our C code into the executable it produces.

Create build.rs and insert

fn main() {
    cc::Build::new()
        .warnings(false)
        .extra_warnings(false)
        .file("tweetnacl.c")
        .compile("tweetnacl"); // outputs `libtweetnacl.a`
}

And add this section to your Cargo.toml

[build-dependencies]
cc = "1"

To silence some errors, it's helpful to put this line at the top of src/bindings.rs

#![allow(non_upper_case_globals)]

Writing a safe Rust wrapper

Alright, with all that setup out of the way, we can finally get to writing a safe Rust wrapper around one of the C functions.

Our running example will be: rust extern "C" { pub fn crypto_hash_sha512_tweet( arg1: *mut ::std::os::raw::c_uchar, arg2: *const ::std::os::raw::c_uchar, arg3: ::std::os::raw::c_ulonglong, ) -> ::std::os::raw::c_int; }

Now, what Rust type should we give this function? As a first guess, maybe

pub fn crypto_hash_sha512_tweet(out: &mut [u8], data: &[u8]) -> i32 {
    todo!()
}

The C function seems to take a pointer + length combination. In Rust, this pattern is represented by a slice &[T]. We don't know exactly where the lengths of the slices come from, yet.

Let's look at the source. Finding the source is actually not trivial because the function is re-exported under a different name. The actual source is in crypto_hash:

int crypto_hash(u8 *out,const u8 *m,u64 n) {
  u8 h[64],x[256];
  u64 i,b = n;

  FOR(i,64) h[i] = iv[i];

  crypto_hashblocks(h,m,n);
  m += n;
  n &= 127;
  m -= n;

  FOR(i,256) x[i] = 0;
  FOR(i,n) x[i] = m[i];
  x[n] = 128;

  n = 256-128*(n<112);
  x[n-9] = b >> 61;
  ts64(x+n-8,b<<3);
  crypto_hashblocks(h,x,n);

  FOR(i,64) out[i] = h[i];

  return 0;
}

The exact implementation logic is not important. We can, however make two observations:

  • the only possible return value is 0. That means the return value is meaningless, and we can remove it in Rust.
  • the output is always 64 bytes (it's a sha512, 512 bits is 64 bytes, makes sense)

That brings us to:

pub fn crypto_hash_sha512_tweet(out: &mut [u8; 64], data: &[u8]) {
    todo!()
}

Finally, in Rust we like to return values, rather than storing them into a pointer, so our final signature will be

pub fn crypto_hash_sha512_tweet(data: &[u8]) -> [u8; 64] {
    todo!()
}

Writing the wrapper

We can reach the bindings from main.rs with e.g.

tweetnacl_bindgen::bindings::crypto_hash_sha512_tweet(a,b,c);

Here tweetnacl_bindgen is the name of the project, specified in the package section of the Cargo.toml

[package]
name = "tweetnacl-bindgen"

Calling extern functions is unsafe, so we need an unsafe block

pub fn crypto_hash_sha512_tweet(data: &[u8]) -> [u8; 64] {
    unsafe { 
        tweetnacl_bindgen::bindings::crypto_hash_sha512_tweet(
            todo!(),
            todo!(),
            todo!(),
        );
    }
}

Next, we can decompose our data slice into its components

pub fn crypto_hash_sha512_tweet(data: &[u8]) -> [u8; 64] {
    unsafe { 
        tweetnacl_bindgen::bindings::crypto_hash_sha512_tweet(
            todo!(),
            data.as_ptr(),
            data.len() as ::std::os::raw::c_ulonglong,
        );
    }
}

Then we can create an array, and pass a mutable pointer to our extern function

pub fn crypto_hash_sha512_tweet(data: &[u8]) -> [u8; 64] {
    let mut result = [ 0; 64 ];

    unsafe { 
        tweetnacl_bindgen::bindings::crypto_hash_sha512_tweet(
            &mut result as *mut _, // the `_` is inferred
            data.as_ptr(),
            data.len() as u64,
        );
    }

    result
}

And with that, we have a nice Rust API for the crypto_hash_sha512_tweet function.

Initializing memory

But wait a minute, what does this do?

pub fn crypto_hash_sha512_tweet(data: &[u8]) -> [u8; 64] {
    let mut result = [ 0; 64 ];

    ...
}

Variables in Rust must be initialized before use. The array is initialized with zeros. But filling it with zeros is wasted effort because the C function will overwrite all the bytes anyway.

Normally, the Rust compiler could figure this out and remove this inefficiency. But because Rust doesn't know anything about the implementation of the C function we're calling, in this case the initialization cannot be elided.

You may not care about this inefficiency, and indeed in this case it's unlikely to matter. But just for the sake of argument, we can make the Rust version have just as little overhead as in plain C.

Again, Rust mandates that all values are initialized. If we must initialize all values, then what we want is a type where initializing it just does nothing: enter MaybeUninit

pub union MaybeUninit<T> {
    uninit: (),
    value: ManuallyDrop<T>,
}

This is a union, a primitive that we don't often see in Rust because the compiler cannot verify that it is used safely. unions are different from enums in that they don't store which of the variants they currently are. This is called an untagged union. The tag, indicating which variant the bytes represent, is meant to be stored elsewhere.

We will look at ManuallyDrop in a moment. In terms of memory layout, the union is:

pub union MaybeUninit<T> {
    uninit: (),
    value: T,
}

It represents a piece of memory that could contain either a unit (), or a value of type T. Note that whichever variant you pick, it always uses the same amount of space as the biggest variant. So if we create MaybeUninit::uninit that would use the same amount of space as a value of type T. But, the value () has a trivial initialization (do nothing), so it is perfect for our use case.

A limitation of being untagged is that when Droping a union value, it is unclear what to do: the union doesn't know which variant it is. Therefore, union variants cannot contain types that have a custom Drop implementation. TheManuallyDrop type ignores any Drop implementation that the wrapped type might have, and hence always satisfies this "no manual drop" constraint, whereas T might not (e.g. if you pick T = String).

With MaybeUninit, we can write:

pub fn crypto_hash_sha512_tweet(data: &[u8]) -> [u8; 64] {
    let mut result : MaybeUninit<[u8; 64]> = MaybeUninit::uninit();

    unsafe { 
        tweetnacl_bindgen::bindings::crypto_hash_sha512_tweet(
            result.as_mut_ptr() as *mut _,
            data.as_ptr(),
            data.len() as _,
        );

        result.assume_init()
    }
}

So this is creating a MaybeUninit with the uninit variant (free initialization, but reserves space on the stack for a value of type [u8; 64]), then converts that into a mutable raw pointer, has C write the result into that pointer (thus properly initializing the bytes) and then declares the value initialized.

We can verify that this actually removes the initialization by looking at the LLVM IR that rustc generates. LLVM IR is the lowest format that the Rust compiler generates. LLVM then gets to work optimizing the LLVM IR. We can look at the LLVM IR after optimizations to see what our Rust code turns into (without quite going to the assembly level).

With an array of zeros:

%result.i = alloca <64 x i8>, align 1
%0 = getelementptr inbounds <64 x i8>, <64 x i8>* %result.i, i64 0, i64 0
call void @llvm.memset.p0i8.i64(i8* noundef nonnull align 1 dereferenceable(64) %0, i8 0, i64 64, i1 false), !alias.scope !8, !noalias !11
%_2.i = call i32 @bindings::crypto_hash_sha512_tweet(i8* nonnull %0, i8* nonnull "foobarbaz", i64 9)

With MaybeUninit:

%result.i = alloca <64 x i8>, align 1
%0 = getelementptr inbounds <64 x i8>, <64 x i8>* %result.i, i64 0, i64 0

%_3.i = call i32 @bindings::crypto_hash_sha512_tweet(i8* nonnull %0, i8* nonnull "foobarbaz", i64 9), !noalias !6

Conclusion

We have seen the primitives that Rust provides to talk to C code (or anything else that uses the C ABI). Using these primitives can be tedious, but does force you to think about the right way to express the API in Rust, rather than blindly copying the lowest common denominator that is C.

With these interop primitives, we can build safe and convenient APIs around battle-tested C libraries, an important step towards safer software.

Stay up-to-date

Stay up-to-date with our work and blog posts?