Optimizing a parser/compiler with data-oriented design: a case study

While working on the Roc compiler, we regularly dive deep on computer science topics. A recurring theme is speed, both the runtime performance of the code that we generate, as well as the performance of our compiler itself.

One extremely useful technique that we have been playing with is data-oriented design: the idea that the actual data you have should guide how code is structured.

DoD is commonly used in game programming, where runtime speed defines what is possible. Recently it's seen more use in other domains, for instance in the Zig and Rust compilers, and projects that focus on runtime speed like Mold and Bun.

The talk Practical Data-oriented design by Zig creator Andrew Kelly is an excellent introduction to the ideas behind DoD. In this post, I'll show a change we made to the roc compiler using some of these ideas.

The collection is the abstraction

The goal of DoD is faster programs, but it also provides an interesting lens for program design in general.

The most common example of DoD is Array of Structs vs. Struct of Arrays. In most programming traditions, it is natural to have values of some shape, the structs, and if you have many of them, you put them into an array. The unit of thought is the individual struct: it is what methods are defined on.

struct Point2d { 
    x: f32, 
    y: f32,
}

type Polygon = Vec<Point2d>;

This is intuitive, but DoD takes the opposite perspective, the collection as a whole is the unit of thought:

struct Polygon { 
    xs: Vec<f32>,
    ys: Vec<f32>,
}

While this example illustrates the transformation, it does not really show the advantages that DoD has. Let's try to apply the DoD hammer to some real problems.

DoD and compilers

Compilers are usually considered as CPU-bound: they transform input (the source code) into output (executables), and are limited by how fast the machine can process the data.

However in practice the problem is a bit more nuanced. Many modern applications are not limited by the speed of the CPU -- how long it takes to perform an instruction -- but rather by getting the instructions and the data to the CPU. Compilers are mostly memory-bound.

Data-oriented design takes this as a starting point, and provides ideas for how to structure code such that your hardware can perform your computation efficiently.

Example problem

The roc parser looks at a roc source file, and produces a list of top-level definitions in that file. These are stored as a list of Defs combined with their start and end position in the file:

/// A span in the input file
#[derive(Clone, Eq, Copy, PartialEq, PartialOrd, Ord, Hash)]
pub struct Region {
    start_position: u32,
    end_position: u32,
}

#[derive(Clone, Eq, Copy, PartialEq, PartialOrd, Ord, Hash)]
pub struct Loc<T> {
    pub region: Region,
    pub value: T,
}

#[derive(Debug, Clone, Copy, PartialEq)]
pub enum Def<'a> {
    Type(TypeDef<'a>),
    Value(ValueDef<'a>),

    // Blank Space (e.g. comments, spaces, newlines) before or after a def.
    // We preserve this for the formatter; canonicalization ignores it.
    SpaceBefore(&'a Def<'a>, &'a [CommentOrNewline<'a>]),
    SpaceAfter(&'a Def<'a>, &'a [CommentOrNewline<'a>]),
}

type Defs<'a> = Vec<Loc<Def<'a>>>;

The parsed representation is used both to generate an executable (a program you can run), but also to format roc code. For the formatter, the order of definitions matters, and also comments must be preserved: we don't want the formatter to re-order top-level definitions and throw away all comments!

The Def type is at first sight a nice representation: we use enums so we can later match on all the cases. This is close to how one might represent this idea in haskell or elm.

But when working with this type, there are some problems: there are illegal states that can be represented. For instance, we could in theory see

SpaceAfter(SpaceAfter(..., &[ ... ]), &[ ... ])

We should never generate such a value, but structurally the type allows it to exist, and rust makes us deal with such technically-possible cases.

How can data-oriented design help?

Feeding the CPU

Data-oriented design is commonly used as an optimization technique. The goal is to do the actual work that needs to be performed as fast as possible on actual hardware. This in turn means that the data must be represented in a form that the hardware can deal with efficiently.

In university courses we typically learn about performance as big-O classes. But good algorithm design is not enough to make a program run fast in the real world. We must also play to the strengths of the hardware.

The limiting factor in our compiler, and most data-intensive programs, is usually not the CPU, but how much data we can feed the CPU. Most of the time, the CPU is waiting for new inputs to become available. But we can design our programs in a way that utilizes the CPU more effectively.

Loading data

To execute an instruction, two things must be done. First, the instruction and its arguments must be loaded, and second the CPU actually performs the instruction.

On modern hardware, loading the instruction and its arguments usually has a higher latency than performing the instruction. That counter-intuitively means it can be faster to perform a computation repeatedly rather than caching it. In other words, the CPU is not actually the bottleneck. This was not always the case, but over the decades we've seen much larger improvements in CPU speed than memory speed.

To remedy the (relative) slowness of retrieving data from memory, CPUs cache the most recent data that they've used: there is a good chance it will be used again. CPUs have several layers of caches:

> lscpu  | grep "cache:"
L1d cache:                       256 KiB
L1i cache:                       512 KiB
L2 cache:                        4 MiB
L3 cache:                        16 MiB

The level 1 (L1) cache has 2 flavors: one cache for the instructions of your program, and one for data used by your program. Levels 2 and 3 don't make this distinction. These caches make a speed versus size tradeoff: L1 is tiny but loading data stored there is very fast, L3 is significantly bigger but somewhat slower (still faster than main memory though).

The memory in the caches is not just a collection of bytes: it is partitioned into cache lines. Today these are usually 64 bytes wide. When a value is loaded from memory, what is really loaded is its surrounding cache line. For instance, my_vector[0] will load not just the element, but a full cache line from main memory. If we subsequently use my_vector[1], the value is likely in that same cache line, which is already loaded and stored in a CPU cache. Consequently the second lookup is very, very fast.

But, the cache has finite space, so putting something new in the cache means something old needs to go: an existing cache line must be evicted.

In summary: to go fast, the data you are using should be in cache. To take advantage of the cache, all data you'll use should be adjacent in memory, so one cache line contains as much useful data as possible. Finally, your data should be compact, such that loading it evicts as few cache lines as possible.

Struct of Arrays

Struct of Arrays is a label for an approach that tries to achieve these things: group similar data (likely to be used together) adjacent in memory.

Usually when we iterate over a collection, we only use a couple of fields from the elements. But the whole element is loaded, and so the amount of useful information per cache line is low.

Let's look at our example:

#[derive(Debug, Clone, Copy, PartialEq)]
pub enum Def<'a> {
    Type(TypeDef<'a>),
    Value(ValueDef<'a>),

    // Blank Space (e.g. comments, spaces, newlines) before or after a def.
    // We preserve this for the formatter; canonicalization ignores it.
    SpaceBefore(&'a Def<'a>, &'a [CommentOrNewline<'a>]),
    SpaceAfter(&'a Def<'a>, &'a [CommentOrNewline<'a>]),
}

type Defs<'a> = Vec<Loc<Def<'a>>>

The idea is to represent the Defs as several arrays.

struct Defs<'a> { 
    defs: Vec<Def<'a>>,
    regions: Vec<Region>,
}

The Defs and Regions are now stored separately, in parallel arrays. For instance, the Region at index 12 corresponds to the Def at index 12. We now have to refer to definitions by their index. A rust reference would no longer work because a reference to a Def does not contain the Region any more.

This approach stores the data more densely, and unused data is never loaded. For instance when formatting definitions, the regions are not important. With this representation, during formatting the regions are never loaded into a CPU cache, and hence never evict other useful information.

But why stop here: we can add more arrays to store data that we only use some of the time, data that is usually in the way. Let's pull out the spaces:

#[derive(Debug, Clone, Copy, PartialEq)]
pub enum Def<'a> {
    Type(TypeDef<'a>),
    Value(ValueDef<'a>),
}

struct Defs<'a> { 
    defs: Vec<Def<'a>>,
    regions: Vec<Region>,

    space_before: Vec<&'a [CommentOrNewline<'a>]>,
    space_after: Vec<&'a [CommentOrNewline<'a>]>,
}

In the type checker, where spaces are irrelevant, we just don't look at the space_* arrays, and never have to worry about whether there are spaces. In the formatter, we can easily look them up given the index of the Def that we're processing.

We made this transformation based on a performance argument, but actually, this representation eliminates the invalid double SpaceAfter case: our code got better!

One more step

Let's make one more data structure change. In the formatter, only the ordering of definitions is important. But in most other places, we want to treat type and value definitions separately.

#[derive(Debug, Clone, Copy, PartialEq)]
pub enum DefTag<'a> {
    Type { type_defs_index : u32 },
    Value { value_defs_index: u32 },
}

struct Defs<'a> { 
    tags: Vec<DefTag<'a>>,
    regions: Vec<Region>,

    space_before: Vec<&'a [CommentOrNewline<'a>]>,
    space_after: Vec<&'a [CommentOrNewline<'a>]>,

    value_defs: Vec<ValueDef<'a>>
    type_defs: Vec<TypeDef<'a>>
}

We could go further still, by storing the tag as an i32 and using the sign to indicate if we're dealing with a type or value. Maybe the space_before and space_after vectors could be combined. DoD can be a really fun puzzle.

Results

We've moved a lot of code around, but did it actually make a difference? Measuring performance is always tricky, but I have run some benchmarks. The code for them is here.

The benchmark parses a very simple sequence of tokens into our two data structures. The first thing I tried is to run a cargo bench test with criterion. That did not work, because the standard::parser runs out of memory. The dod::parser runs just fine, showing that at least the DoD approach uses less memory.

The easiest way I know to display memory usage is valgrind:

> valgrind ./target/release/standard
...
==197038==   total heap usage: 46 allocs, 46 frees, 75,499,477 bytes allocated
...
> valgrind ./target/release/dod
...
==197006==   total heap usage: 99 allocs, 99 frees, 67,110,853 bytes allocated
...

For this example we're using roughly 8Mb, or 12% less memory. Nice!

In terms of runtime, we can fall back to hyperfine:

> hyperfine target/release/standard target/release/dod
Benchmark #1: target/release/standard
  Time (mean ± σ):      26.6 ms ±   1.6 ms    [User: 14.9 ms, System: 11.8 ms]
  Range (min … max):    24.1 ms …  33.8 ms    91 runs
 
Benchmark #2: target/release/dod
  Time (mean ± σ):      23.6 ms ±   1.1 ms    [User: 14.3 ms, System: 9.5 ms]
  Range (min … max):    21.3 ms …  27.8 ms    108 runs
 
Summary
  'target/release/dod' ran
    1.12 ± 0.09 times faster than 'target/release/standard'

Not a big difference here.

Compare with boxed

We're helped in the above benchmark by already using arena allocation in the starting code. So what if we had used the more standard Box<Def> instead of &'a Def, like so?

#[derive(Debug, Clone)]
pub enum Def<'a> {
    Type(TypeDef<'a>),
    Value(ValueDef<'a>),

    // was 
    // SpaceBefore(&'a Def<'a>, &'a [CommentOrNewline<'a>]),
    // SpaceAfter(&'a Def<'a>, &'a [CommentOrNewline<'a>]),
    SpaceBefore(Box<Def<'a>>, &'a [CommentOrNewline<'a>]),
    SpaceAfter(Box<Def<'a>>, &'a [CommentOrNewline<'a>]),
}

Using Box is significantly worse, because every type, value or space wrapper requires its own small allocation:

==196475==   total heap usage: 399,862 allocs, 399,862 frees, 71,516,421 bytes allocated

As a result it is roughly 70% slower:

> hyperfine target/release/standard target/release/boxed
Benchmark #1: target/release/standard
  Time (mean ± σ):      25.1 ms ±   1.8 ms    [User: 14.3 ms, System: 10.8 ms]
  Range (min … max):    23.0 ms …  35.4 ms    82 runs
 
Benchmark #2: target/release/boxed
  Time (mean ± σ):      42.5 ms ±   2.3 ms    [User: 28.3 ms, System: 14.2 ms]
  Range (min … max):    38.9 ms …  49.9 ms    70 runs
 
Summary
  'target/release/standard' ran
    1.69 ± 0.15 times faster than 'target/release/boxed'

Conclusion

We've seen that the ideas from data-oriented design can make code both faster and better. Like Andrew Kelly mentions in the talk, playing with these ideas gives me the feeling that my programming abilities are improving (again). DoD is not always worth it, it can create a bit of a mess of arrays and indices, and there is no safety around lifetimes and mutability. But it is an amazing tool to have in the toolbox.

Stay up-to-date

Stay up-to-date with our work and blog posts?

Related articles

The Dutch government offers the AHN [[1]](https://www.ahn.nl/) as a way to get information about the height of any specific place in the country. They offer this data by using a point cloud. That is, a large set of points with some additional meta information. With the current version of the AHN the resolution of the dataset is about eight points per square meter. This results in about 2.5TB of compressed data for the relatively small area of the Netherlands. While this is something that is not impossible to store locally, it does offer some challenges.

Asynchronous programming is pretty weird. While it is straightforward enough to understand in principle (write code that looks synchronous, but may be run concurrently yada yada yada), it is not so obvious how and when async functions actually perform work. This blog aims to shed light on how that works in Rust.
Ever wanted to have a quickly put together command-line tool to delete large chunks of your project automatically? Me neither, but my colleague Marc made a pretty convincing argument as to why such a tool could be useful. So we went ahead and made it. Here are the results.