Blog

algorithms

Implementing Lempel-Ziv Jaccard Distance (LZJD) in Rust

Henk Dieter
Henk Dieter

One of our clients helps companies in becoming GDPR-compliant. A goal is to recognize sensitive pieces of user data in a big pile of registrations, receipts, emails, and transcripts, and mark them to be checked out later. As more and more data is collected by companies, finding and eliminating sensitive data becomes harder and harder, to the point where it is no longer possible for mere human employees to keep up without assistance. A crate I’ve been working on, LZJD-rs, implements an algorithm that can be used to automate this process. It is my first serious attempt at an open source contribution in Rust, and an interesting one at that. LZJD To detect similar documents containing sensitive information, one can use the Lempel-Ziv Jaccard Distance or LZJD algorithm proposed by Edward Raff and Charles Nicholas 1]. LZJD is a means of calculating edit distance between binary sequences. It is based on [Normalized Compression Distance (NCD), which is used in malware classification software. NCD essentially is quite a simple algorithm. In NCD, two binary sequences are first compressed apart from each other. The total of the sizes of the compressed sequences is then compared to the size of the compression output of the two binary sequences concatenated. As compression eliminates duplications in sequences, the greater the difference between the compared sizes, the more the two sequences were alike. A significant portion of the work being done in NCD, like substituting parts of the sequence and writing the compressed file, can be eliminated by simply comparing the size of the substitution dictionaries. LZJD does exactly that, significantly improving the calculation speed. What’s more, LZJD only compares a small portion of the dictionaries in order to speed up the process even more. The steps taken by the algorithm in order to compare two binary sequences are roughly the following: Generate a Lempel-Ziv dictionary for each of the sequences; Generate a hash of each of the entries in both dictionaries; Sort the hashes and keep only the k=1024 smallest; Calculate the Jaccard distance between both of the lists of smallest hashes. This rather simple approach could enable scaling classification processes significantly [2]. A Rust implementation Raff and Nicholas provided a reference implementation, jLZJD, written in Java. jLZJD’s interface is a lot like that of the well-known classification tool sdhash. Given that our client has written their entire application in Rust, they asked us to create an open-source Rust implementation of LZJD: LZJD-rs. My initial implementation of LZJD-rs and jLZJD did not only differ in programming language. To optimize the architecture a bit, I introduced implementation differences as well. For instance, jLZJD uses a simplified hash set to store dictionary entry hashes in, called IntSetNoRemove. After processing the sequence, the entries in the hash set are sorted and truncated to the desired length for comparison. In contrast, LZJD-rs’ first version used a fixed-length vector in which the entries are inserted in such a way that the result is always an ordered list that can be compared immediately. To keep the entries sorted, a binary search is performed to find the spot in which a hash should be inserted. This way, no unnecessary allocations are done to retain data that will be thrown away later. pub fn frombytesstream(seqiter: I, buildhasher: &H, k: usize) -> Self where I: Iterator, H: BuildHasher, { let mut entries = Vec::with_capacity(k); let mut hasher = buildhasher.buildhasher(); seqiter.foreach(|byte| { // Update hash hasher.write_u8(byte); let hash = hasher.finish(); if let Err(insertat) = entries.binarysearch(&hash) { // If entries does not yet contain current hash if entries.len() (seqiter: I, buildhasher: &H) -> Self where I: Iterator, H: BuildHasher, { let mut dict = HashSet::new(); let mut hasher = buildhasher.buildhasher(); for byte in seq_iter { hasher.write_u8(byte); let hash = hasher.finish() as i32; if dict.insert(hash) { hasher = buildhasher.buildhasher(); } } let mut dict: Vec = dict.iter().cloned().collect(); dict.sort(); LZDict { entries: dict.iter().cloned().take(1000).collect() } } Putting it to the test One of the goals of Rust is optimizing efficiency. LZJD-rs being written in Rust, would it be faster than jLZJD? Well, let’s have a go at benchmarking both implementations. Initially, I created a basic test environment in which both implementations compare two 200MB random files, using a single thread to compute the LZ-dictionaries. LZJD-rs did it in a whopping 72 seconds on average on my machine. However, jLZJD, to my surprise, did it in about 38 seconds on average. That’s almost twice as fast! Not as easy as I thought it would be. To find out what was the cause of my implementation being slow, I used a profiling tool called Perf. Perf uses hardware and operating systems counters to mark so-called hot spots in compiled programs, so that optimization can be focused on the parts where the biggest potential gains lie. Below is a screenshot of the hottest spots found in LZJD-rs: The lines in white, starting with \_ZN, are the mangled names of the functions to which the instructions below them belong. Both red colored instructions are inside Rust’s std::collections::HashSet implementation. It would seem that mimicking jLZJD’s IntSetNoRemove in LZJS-rs or providing another more tailor-made HashSet implementation could benefit the performance LZJS-rs. Future work LZJD-rs is published on Crates.io and GitHub. I’ve been thinking about implementing a simple linear-probing hash set similar to IntSetNoRemove in order to improve performance. It would be a valuable learning experience if other developers propose optimizations. Create a pull request on GitHub and I’d be happy to review it! Citations \[1\] Raff, E., & Nicholas, C. K. (2017). An Alternative to NCD for Large Sequences, Lempel-Ziv Jaccard Distance. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. http://doi.org/10.1145/3097983.3098111 \[2\] LZJD was considered by the Dutch government for classification purposes as part of their Di-Stroy program, see https://www.linkedin.com/posts/activity-6547029450402476033-i3AJ/ (Dutch)

Our first Rust crate: decrypting ansible vaults

Wouter
Wouter

At Tweede golf I've been working with Rust a lot lately. My interest in Rust has been there for years, so I was very happy to start applying it in my working life, about a year ago. Since then I have worked with Rust both for our clients as well as employing it for our operations setup. I have also experimented with Rust for web \[1\]. Until now however we did not contribute to the Rust ecosystem. About time that we get our feet wet and publish our first crate! The use case: CI with Kubernetes Of course we like to continuously deploy our projects. For older projects we use Ansible \[2\] to directly deploy to virtual machines. Our newer projects however are deployed to a Kubernetes cluster hosted by Google \[3\]. In order to generate all the necessary Kubernetes Objects for our projects, we have created our own command line utility called Kuberwave. I will discuss this utility in a future blog post, but for now it suffices to say that it is written in Rust. In order to enable Continuous Deployment our CI-servers require: Credentials which grant access to the Kubernetes cluster. Credentials which grant access to a docker image registry (to enable kubernetes to pull new images). Credentials for resources like databases and third party API’s. We encrypt these secrets in so-called ansible vaults \[4\]. Ansible comes with a handy command line utility called ansible-vault to create, edit and rekey these vaults. Vaults are stored in the git repository of the project itself. As long as we only hand the ansible vault key to the people needing direct access to the kubernetes project namespace (for example, for maintenance), this setup is safe enough. Kuberwave also has the need to access these secrets. Because our staff is already comfortable with using these vaults, we’ve decided to also employ them for our Kubernetes setup. Decrypting ansible vaults For this I created ansible-vault-rs \[5\], a library that can decrypt ansible vaults. Note that it can not create or edit vaults, because I have no need (yet) for this functionality. $ANSIBLE_VAULT;1.1;AES256 30613330376362653133323866376665376365303734633836666136616531396361623761363061 3962356534626335323730643565346134663930363430320a313666376239373162636637613961 64366333313066646635343735626661366537666665653930646562643533363364363037653761 3730336235613732390a363166396664666262316632633536646332316334363361303030383036 3538 > A small ansible vault. The ansible-vault format is not pretty, or clearly documented \[6\]. I had to resort to reading the original source code \[7\] to understand how the format works. I have various problems with this format: It is purely an ascii-based file. Most of the file is comprised of base16 encoded text spaced nicely on 80 character long lines. Clearly this is not efficient, and would only be served by the fact that git diffs look nice. Except that the entire point of the file format is that the contents are encrypted, and thus look like gibberish. The file is wrapped in two layers of base16 encoded text. It becomes quite complicated to implement a streaming reader for this format. Unfortunately due to how an hmac is used, we are only able to verify the passphrase after we have read the entire file. Thus we first read the entire file into memory, verify the HMAC and only then yield the file as a complete byte buffer. The format would benefit from per-block HMAC verification. Our implementation is thus not able to stream the file. For our use-case the vaults are rarely more than a few kilobytes in size. Also we will only parse a handful of files in a run. Most of the resources go into computing the derived key of a vault using PBKDF2. For us this is not a problem, but it might be to others wanting to decrypt large files or a large number of files. If I ever have to store and encrypt large files in a git repository (probably not, due to the nature of git repositories) we will consider moving to a different file format. For now Ansible vault serves our purpose just fine. Publishing the crate Initially the implementation was written as part of Kuberwave, but as an exercise I decided to publish it to crates.io. I added more explicit error handling (as per \[8\]) and added a few tests. When creating a separate cargo.toml file and re-adding the specific dependencies, I noticed I was using a deprecated package: rust-crypto. \[9\] This package has not been updated since May 2016. For a crypto library this is quite dangerous. Cargo would benefit greatly from explicitly deprecating this package, as already suggested in an issue \[10\]. I re-implemented the crypto to use the RustCrypto \[11\] family of crates. The only thing still missing is more extensive documentation. I will probably also implement creating vaults. I am already working on a tool to automagically generate the Kuberwave configuration for simple projects, which will (probably) need this. Crates are easy Publishing the crate to crates.io was easy a pie \[12\]. I only had to register to crates.io as a publisher, add some fields to the cargo.toml file, and run cargo publish. Even though this crate will probably not be used widely and isn't all that exciting, it does what it is meant to do, and I hope some of you can benefit from it. For me it has shown Rust's package manager is easy as well as powerful. The next time I write something in Rust that is general enough to open source it, we will do so. \[1\] https://tweedegolf.nl/blog/23/rust-als-webplatform (Dutch) \[2\] https://github.com/ansible/ansible \[3\] https://cloud.google.com/kubernetes-engine/ \[4\] https://docs.ansible.com/ansible/2.4/vault.html \[5\] https://crates.io/crates/ansible-vault \[6\] https://docs.ansible.com/ansible/2.4/vault.html#vault-payload-format-1-1 \[7\] https://github.com/ansible/ansible/blob/devel/lib/ansible/parsing/vault/init.py \[8\] https://doc.rust-lang.org/rust-by-example/error/multipleerrortypes/defineerrortype.html \[9\] https://crates.io/crates/rust-crypto \[10\] https://github.com/DaGenix/rust-crypto/issues/440 \[11\] https://github.com/RustCrypto \[12\] https://doc.rust-lang.org/cargo/reference/publishing.html