Implementing Lempel-Ziv Jaccard Distance (LZJD) in Rust

Henk
Embedded software engineer
Implementing Lempel-Ziv Jaccard Distance (LZJD) in Rust

One of our clients helps companies in becoming GDPR-compliant. A goal is to recognize sensitive pieces of user data in a big pile of registrations, receipts, emails, and transcripts, and mark them to be checked out later. As more and more data is collected by companies, finding and eliminating sensitive data becomes harder and harder, to the point where it is no longer possible for mere human employees to keep up without assistance.

A crate I’ve been working on, LZJD-rs, implements an algorithm that can be used to automate this process. It is my first serious attempt at an open source contribution in Rust, and an interesting one at that.

LZJD

To detect similar documents containing sensitive information, one can use the Lempel-Ziv Jaccard Distance or LZJD algorithm proposed by Edward Raff and Charles Nicholas [1]. LZJD is a means of calculating edit distance between binary sequences. It is based on Normalized Compression Distance (NCD), which is used in malware classification software. NCD essentially is quite a simple algorithm. In NCD, two binary sequences are first compressed apart from each other. The total of the sizes of the compressed sequences is then compared to the size of the compression output of the two binary sequences concatenated. As compression eliminates duplications in sequences, the greater the difference between the compared sizes, the more the two sequences were alike.

A significant portion of the work being done in NCD, like substituting parts of the sequence and writing the compressed file, can be eliminated by simply comparing the size of the substitution dictionaries. LZJD does exactly that, significantly improving the calculation speed. What’s more, LZJD only compares a small portion of the dictionaries in order to speed up the process even more. The steps taken by the algorithm in order to compare two binary sequences are roughly the following:

  1. Generate a Lempel-Ziv dictionary for each of the sequences;
  2. Generate a hash of each of the entries in both dictionaries;
  3. Sort the hashes and keep only the k=1024 smallest;
  4. Calculate the Jaccard distance between both of the lists of smallest hashes.

This rather simple approach could enable scaling classification processes significantly [2].

A Rust implementation

Raff and Nicholas provided a reference implementation, jLZJD, written in Java. jLZJD’s interface is a lot like that of the well-known classification tool sdhash. Given that our client has written their entire application in Rust, they asked us to create an open-source Rust implementation of LZJD: LZJD-rs.

My initial implementation of LZJD-rs and jLZJD did not only differ in programming language. To optimize the architecture a bit, I introduced implementation differences as well. For instance, jLZJD uses a simplified hash set to store dictionary entry hashes in, called IntSetNoRemove. After processing the sequence, the entries in the hash set are sorted and truncated to the desired length for comparison.

In contrast, LZJD-rs’ first version used a fixed-length vector in which the entries are inserted in such a way that the result is always an ordered list that can be compared immediately. To keep the entries sorted, a binary search is performed to find the spot in which a hash should be inserted. This way, no unnecessary allocations are done to retain data that will be thrown away later.

pub fn from_bytes_stream<I, H>(seq_iter: I, build_hasher: &H, k: usize) -> Self
where
    I: Iterator<Item = u8>,
    H: BuildHasher,
{
    let mut entries = Vec::with_capacity(k);
    let mut hasher = build_hasher.build_hasher();

    seq_iter.for_each(|byte| {
        // Update hash
        hasher.write_u8(byte);
        let hash = hasher.finish();

        if let Err(insert_at) = entries.binary_search(&hash) {
            // If entries does not yet contain current hash
            if entries.len() < k {
                // There's room for another hash without reallocating
                entries.insert(insert_at, hash); // Insert current hash
                hasher = build_hasher.build_hasher(); // Reset hasher
            } else if hash < *entries.last().unwrap() {
                // Current hash is smaller than largest in entries
                entries.pop(); // Remove greatest hash

                entries.insert(insert_at, hash); // Insert current hash
                hasher = build_hasher.build_hasher(); // Reset hasher
            }
        }
        // else it's already in there and we can go on
    });

    LZDict { entries }
}

Quite proud of the optimization I added, I asked Edward Raff to review my code. He responded with an observation: alas, my optimization was incorrect. Truncating the dictionary in-process makes the application forget about certain sequences, which will be treated as new, even when they are not. Thank you for the feedback, Edward!

Me and my colleague Ruben started working on a new implementation, more similar to jLZJD. Like jLZJD, it uses a hash set internally to store dictionary entry hashes in: std::collections::HashSet. After processing the sequence, the entries in the hash set are sorted and truncated to the desired length for comparison. jLZJD uses the MurmurHash3 algorithm to hash dictionary entries, whereas LZJD-rs library is designed to use any hashing algorithm. The binary however, like jLZJD, uses the MurmurHash3 algorithm.

pub fn from_bytes_stream<I, H>(seq_iter: I, build_hasher: &H) -> Self
    where
        I: Iterator<Item=u8>,
        H: BuildHasher,
{
    let mut dict = HashSet::new();
    let mut hasher = build_hasher.build_hasher();

    for byte in seq_iter {
        hasher.write_u8(byte);
        let hash = hasher.finish() as i32;
        if dict.insert(hash) {
            hasher = build_hasher.build_hasher();
        }
    }

    let mut dict: Vec<_> = dict.iter().cloned().collect();
    dict.sort();

    LZDict { entries: dict.iter().cloned().take(1000).collect() }
}

Putting it to the test

One of the goals of Rust is optimizing efficiency. LZJD-rs being written in Rust, would it be faster than jLZJD? Well, let’s have a go at benchmarking both implementations. Initially, I created a basic test environment in which both implementations compare two 200MB random files, using a single thread to compute the LZ-dictionaries. LZJD-rs did it in a whopping 72 seconds on average on my machine. However, jLZJD, to my surprise, did it in about 38 seconds on average. That’s almost twice as fast!

Not as easy as I thought it would be. To find out what was the cause of my implementation being slow, I used a profiling tool called Perf. Perf uses hardware and operating systems counters to mark so-called hot spots in compiled programs, so that optimization can be focused on the parts where the biggest potential gains lie. Below is a screenshot of the hottest spots found in LZJD-rs:

The lines in white, starting with _ZN, are the mangled names of the functions to which the instructions below them belong. Both red colored instructions are inside Rust’s std::collections::HashSet implementation. It would seem that mimicking jLZJD’s IntSetNoRemove in LZJS-rs or providing another more tailor-made HashSet implementation could benefit the performance LZJS-rs.

Future work

LZJD-rs is published on Crates.io and GitHub. I’ve been thinking about implementing a simple linear-probing hash set similar to IntSetNoRemove in order to improve performance.

It would be a valuable learning experience if other developers propose optimizations. Create a pull request on GitHub and I’d be happy to review it!

Citations

[1] Raff, E., & Nicholas, C. K. (2017). An Alternative to NCD for Large Sequences, Lempel-Ziv Jaccard Distance. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. http://doi.org/10.1145/3097983.3098111

[2] LZJD was considered by the Dutch government for classification purposes as part of their Di-Stroy program, see https://www.linkedin.com/posts/activity-6547029450402476033-i3AJ/ (Dutch)

Stay up-to-date

Stay up-to-date with our work and blog posts?

Related articles

At Tweede golf I've been working with Rust a lot lately. My interest in Rust has been there for years, so I was very happy to start applying it in my working life, about a year ago. Since then I have worked with Rust both for our clients as well as employing it for our operations setup. I have also experimented with Rust for web [1]. Until now however we did not contribute to the Rust ecosystem. About time that we get our feet wet and publish our first crate!
When sudo-rs development started, we added several dependencies using Rust’s crates ecosystem to quickly ramp up development. During development we accrued approximately 135 transitive (direct and indirect) dependencies. Once this was identified, we managed to reduce our total dependencies down to three. In this blog, we explain why and how we did this.
Asynchronous programming is pretty weird. While it is straightforward enough to understand in principle (write code that looks synchronous, but may be run concurrently yada yada yada), it is not so obvious how and when async functions actually perform work. This blog aims to shed light on how that works in Rust.