Cloud storage simplification and abstraction for Node.js

The API documentation of cloud storage providers can be quite intimidating. If you are simply looking for a few straight forward storage actions these extensive APIs might seem a bit overkill. Another hurdle is that storage providers define their own distinctive APIs. That is why we've decided to write an abstraction layer for the APIs of the most-used storage providers that exposes only a limited set of common storage actions. Keywords are abstraction and simplification, as mentioned in the title. A little history The first completely web-based commercial cloud storage was "PersonaLink Service" by AT&T in 1994[1], but the use of cloud storage really took off since Amazon launched S3 in 2006. Today S3 is still one of the most popular cloud storage services. Because Amazon S3 was the first cloud service that got widely adopted, other cloud services implemented the S3 API in their services to be compliant with S3 and make it easier for customers to hop over to their services. As a consequence, the S3 API has more or less become the standard for cloud services. Amazon S3 and other services There are cloud services that use a different API, such as Google Cloud, Microsoft Azure and Backblaze B2, but they also provide tools for easy migration from S3. Google offers tools [2] that are currently only available for Java or Python. And Backblaze has recently added an extra S3 compliant API to their B2 service[3]. We can discern a tendency of convergence towards S3 and while this may or may not be good news, at least it unifies the APIs of some storage providers. However, S3 still is a rather comprehensive API. Testing with local storage In addition to the reasons mentioned above (abstraction and simplification), we wanted a tool that made it possible to use storage on a local disk and storage in the cloud interchangeably. During development, we often use local disk storage to save network calls but we want to be able to seamlessly switch to a cloud service on the production server. How it works Our simplified bare-necessity API is defined in a thin wrapper that forwards all API calls to adapters. There is an adapter for every supported storage type (Amazon, Google, Backblaze and local disk) and based on the configuration the appropriate adapter will be used. The adapters translate the generic API calls to storage specific calls. To make it transparent to the wrapper which adapter is used (necessary for friction-free switching between storage types) the adapters also implement the API methods. This way an API call can be forwarded one-to-one to an adapter, for instance: async listFiles(numFiles?: number): Promise { return this.adapter.listFiles(numFiles); } Another result of this setup is that new adapters can be added very easily, both by ourselves and by others. Some code examples You start by creating an instance of Storage. The constructor requires a configuration object or a configuration url. Both configuration options have their pros and cons, it is also a matter of taste: const config = { type: StorageType.LOCAL, directory: "path/to/folder/bucket", mode: "750", }; const s = new Storage(config); // or const url = "local://path/to/folder/bucket?mode=750"; const s = new Storage(url); Once you have created an instance, you can call all API methods directly on this instance. Note that most methods return promises: const files = await s.listFiles(); There are also some handy introspective methods: s.getType(); // local, gcs, s3 or b2 s.getConfiguration(); // shows the configuration as // provided during instantiation And you can even swith between adapters at runtime, this feature is used in the example application: // local storage const urlLocal = "local://path/to/folder/bucket?mode=750"; // Google Cloud storage const urlGoogle = "gcs://path/to/keyFile.json:projectId@bucketName"; // connect to local storage const s = new Storage(urlLocal); // switch to Google Cloud storage s.switchStorage(urlGoogle); Further reading The code and extensive documentation are available on Github. Please don't hesitate to drop us a line should you have any questions. Links to materials used or mentioned in this blogpost: [1] Wikipedia about cloud storage [2] Google Storage migration from S3 [3] Backblaze announcing S3 support Microsoft announcing S3 support Microsoft Azure and Minio sercer Using Minio to make Google Storage S3 compliant Painting by Jean-Michel Cels (1819–1894) Cloud Study, ca. 1838–42 Oil on cardboard Thaw Collection Painters and clouds

Implementing Lempel-Ziv Jaccard Distance (LZJD) in Rust

Henk Dieter
Henk Dieter

One of our clients helps companies in becoming GDPR-compliant. A goal is to recognize sensitive pieces of user data in a big pile of registrations, receipts, emails, and transcripts, and mark them to be checked out later. As more and more data is collected by companies, finding and eliminating sensitive data becomes harder and harder, to the point where it is no longer possible for mere human employees to keep up without assistance. A crate I’ve been working on, LZJD-rs, implements an algorithm that can be used to automate this process. It is my first serious attempt at an open source contribution in Rust, and an interesting one at that. LZJD To detect similar documents containing sensitive information, one can use the Lempel-Ziv Jaccard Distance or LZJD algorithm proposed by Edward Raff and Charles Nicholas 1]. LZJD is a means of calculating edit distance between binary sequences. It is based on [Normalized Compression Distance (NCD), which is used in malware classification software. NCD essentially is quite a simple algorithm. In NCD, two binary sequences are first compressed apart from each other. The total of the sizes of the compressed sequences is then compared to the size of the compression output of the two binary sequences concatenated. As compression eliminates duplications in sequences, the greater the difference between the compared sizes, the more the two sequences were alike. A significant portion of the work being done in NCD, like substituting parts of the sequence and writing the compressed file, can be eliminated by simply comparing the size of the substitution dictionaries. LZJD does exactly that, significantly improving the calculation speed. What’s more, LZJD only compares a small portion of the dictionaries in order to speed up the process even more. The steps taken by the algorithm in order to compare two binary sequences are roughly the following: Generate a Lempel-Ziv dictionary for each of the sequences; Generate a hash of each of the entries in both dictionaries; Sort the hashes and keep only the k=1024 smallest; Calculate the Jaccard distance between both of the lists of smallest hashes. This rather simple approach could enable scaling classification processes significantly [2]. A Rust implementation Raff and Nicholas provided a reference implementation, jLZJD, written in Java. jLZJD’s interface is a lot like that of the well-known classification tool sdhash. Given that our client has written their entire application in Rust, they asked us to create an open-source Rust implementation of LZJD: LZJD-rs. My initial implementation of LZJD-rs and jLZJD did not only differ in programming language. To optimize the architecture a bit, I introduced implementation differences as well. For instance, jLZJD uses a simplified hash set to store dictionary entry hashes in, called IntSetNoRemove. After processing the sequence, the entries in the hash set are sorted and truncated to the desired length for comparison. In contrast, LZJD-rs’ first version used a fixed-length vector in which the entries are inserted in such a way that the result is always an ordered list that can be compared immediately. To keep the entries sorted, a binary search is performed to find the spot in which a hash should be inserted. This way, no unnecessary allocations are done to retain data that will be thrown away later. pub fn frombytesstream(seqiter: I, buildhasher: &H, k: usize) -> Self where I: Iterator, H: BuildHasher, { let mut entries = Vec::with_capacity(k); let mut hasher = buildhasher.buildhasher(); seqiter.foreach(|byte| { // Update hash hasher.write_u8(byte); let hash = hasher.finish(); if let Err(insertat) = entries.binarysearch(&hash) { // If entries does not yet contain current hash if entries.len() (seqiter: I, buildhasher: &H) -> Self where I: Iterator, H: BuildHasher, { let mut dict = HashSet::new(); let mut hasher = buildhasher.buildhasher(); for byte in seq_iter { hasher.write_u8(byte); let hash = hasher.finish() as i32; if dict.insert(hash) { hasher = buildhasher.buildhasher(); } } let mut dict: Vec = dict.iter().cloned().collect(); dict.sort(); LZDict { entries: dict.iter().cloned().take(1000).collect() } } Putting it to the test One of the goals of Rust is optimizing efficiency. LZJD-rs being written in Rust, would it be faster than jLZJD? Well, let’s have a go at benchmarking both implementations. Initially, I created a basic test environment in which both implementations compare two 200MB random files, using a single thread to compute the LZ-dictionaries. LZJD-rs did it in a whopping 72 seconds on average on my machine. However, jLZJD, to my surprise, did it in about 38 seconds on average. That’s almost twice as fast! Not as easy as I thought it would be. To find out what was the cause of my implementation being slow, I used a profiling tool called Perf. Perf uses hardware and operating systems counters to mark so-called hot spots in compiled programs, so that optimization can be focused on the parts where the biggest potential gains lie. Below is a screenshot of the hottest spots found in LZJD-rs: The lines in white, starting with \_ZN, are the mangled names of the functions to which the instructions below them belong. Both red colored instructions are inside Rust’s std::collections::HashSet implementation. It would seem that mimicking jLZJD’s IntSetNoRemove in LZJS-rs or providing another more tailor-made HashSet implementation could benefit the performance LZJS-rs. Future work LZJD-rs is published on and GitHub. I’ve been thinking about implementing a simple linear-probing hash set similar to IntSetNoRemove in order to improve performance. It would be a valuable learning experience if other developers propose optimizations. Create a pull request on GitHub and I’d be happy to review it! Citations \[1\] Raff, E., & Nicholas, C. K. (2017). An Alternative to NCD for Large Sequences, Lempel-Ziv Jaccard Distance. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. \[2\] LZJD was considered by the Dutch government for classification purposes as part of their Di-Stroy program, see (Dutch)

Our first Rust crate: decrypting ansible vaults

At Tweede golf I've been working with Rust a lot lately. My interest in Rust has been there for years, so I was very happy to start applying it in my working life, about a year ago. Since then I have worked with Rust both for our clients as well as employing it for our operations setup. I have also experimented with Rust for web \[1\]. Until now however we did not contribute to the Rust ecosystem. About time that we get our feet wet and publish our first crate! The use case: CI with Kubernetes Of course we like to continuously deploy our projects. For older projects we use Ansible \[2\] to directly deploy to virtual machines. Our newer projects however are deployed to a Kubernetes cluster hosted by Google \[3\]. In order to generate all the necessary Kubernetes Objects for our projects, we have created our own command line utility called Kuberwave. I will discuss this utility in a future blog post, but for now it suffices to say that it is written in Rust. In order to enable Continuous Deployment our CI-servers require: Credentials which grant access to the Kubernetes cluster. Credentials which grant access to a docker image registry (to enable kubernetes to pull new images). Credentials for resources like databases and third party API’s. We encrypt these secrets in so-called ansible vaults \[4\]. Ansible comes with a handy command line utility called ansible-vault to create, edit and rekey these vaults. Vaults are stored in the git repository of the project itself. As long as we only hand the ansible vault key to the people needing direct access to the kubernetes project namespace (for example, for maintenance), this setup is safe enough. Kuberwave also has the need to access these secrets. Because our staff is already comfortable with using these vaults, we’ve decided to also employ them for our Kubernetes setup. Decrypting ansible vaults For this I created ansible-vault-rs \[5\], a library that can decrypt ansible vaults. Note that it can not create or edit vaults, because I have no need (yet) for this functionality. $ANSIBLE_VAULT;1.1;AES256 30613330376362653133323866376665376365303734633836666136616531396361623761363061 3962356534626335323730643565346134663930363430320a313666376239373162636637613961 64366333313066646635343735626661366537666665653930646562643533363364363037653761 3730336235613732390a363166396664666262316632633536646332316334363361303030383036 3538 > A small ansible vault. The ansible-vault format is not pretty, or clearly documented \[6\]. I had to resort to reading the original source code \[7\] to understand how the format works. I have various problems with this format: It is purely an ascii-based file. Most of the file is comprised of base16 encoded text spaced nicely on 80 character long lines. Clearly this is not efficient, and would only be served by the fact that git diffs look nice. Except that the entire point of the file format is that the contents are encrypted, and thus look like gibberish. The file is wrapped in two layers of base16 encoded text. It becomes quite complicated to implement a streaming reader for this format. Unfortunately due to how an hmac is used, we are only able to verify the passphrase after we have read the entire file. Thus we first read the entire file into memory, verify the HMAC and only then yield the file as a complete byte buffer. The format would benefit from per-block HMAC verification. Our implementation is thus not able to stream the file. For our use-case the vaults are rarely more than a few kilobytes in size. Also we will only parse a handful of files in a run. Most of the resources go into computing the derived key of a vault using PBKDF2. For us this is not a problem, but it might be to others wanting to decrypt large files or a large number of files. If I ever have to store and encrypt large files in a git repository (probably not, due to the nature of git repositories) we will consider moving to a different file format. For now Ansible vault serves our purpose just fine. Publishing the crate Initially the implementation was written as part of Kuberwave, but as an exercise I decided to publish it to I added more explicit error handling (as per \[8\]) and added a few tests. When creating a separate cargo.toml file and re-adding the specific dependencies, I noticed I was using a deprecated package: rust-crypto. \[9\] This package has not been updated since May 2016. For a crypto library this is quite dangerous. Cargo would benefit greatly from explicitly deprecating this package, as already suggested in an issue \[10\]. I re-implemented the crypto to use the RustCrypto \[11\] family of crates. The only thing still missing is more extensive documentation. I will probably also implement creating vaults. I am already working on a tool to automagically generate the Kuberwave configuration for simple projects, which will (probably) need this. Crates are easy Publishing the crate to was easy a pie \[12\]. I only had to register to as a publisher, add some fields to the cargo.toml file, and run cargo publish. Even though this crate will probably not be used widely and isn't all that exciting, it does what it is meant to do, and I hope some of you can benefit from it. For me it has shown Rust's package manager is easy as well as powerful. The next time I write something in Rust that is general enough to open source it, we will do so. \[1\] (Dutch) \[2\] \[3\] \[4\] \[5\] \[6\] \[7\] \[8\] \[9\] \[10\] \[11\] \[12\]