BonsaiDb April Update: Dogfooding and Fixing Bugs

Written by Jonathan Johnson. Published 2022-05-02.

What is BonsaiDb?

BonsaiDb is a new database aiming to be the most developer-friendly Rust database. BonsaiDb has a unique feature set geared at solving many common data problems. We have a page dedicated to answering the question: What is BonsaiDb?.

This month has been a journey of interesting projects that I have been itching to write about. The primary focus of the month was to work on a new project to use and test BonsaiDb. Before I dive into that topic, I want to focus on a bug fix for Nebari that arose from testing this new project.

Nebari 0.5.2 Update

Yesterday, I published an update to Nebari which contains bug fixes that prevent potential corruption of databases. For users of BonsaiDb, the view indexer is the only way that this bug could arise, from my understanding of the bugs discovered.

The problem with this bug is that it's a silent bug until a compact operation occurs. There are various debug_assert!() statements littered in the code that assert the internal B+ tree order is correct. These are expensive to perform, so they are only enabled in debug builds. Users who are running BonsaiDb or Nebari with release builds may not notice the bug, and the database may return inconsistent results.

The only way I was able to reproduce the bug involved many multi-key Modifications that operated in such a way that a bug in the tree balancing algorithm was found. The edge case only occurred when Nebari attempted to avoid splitting a node by moving entries both to the previous and next nodes during the same operation. This is impossible on small trees and is impossible to trigger by performing operations with only a few keys per operation or with a workflow that rarely deletes data.

Despite me feeling fairly confident users haven't encountered this yet, I would rest better knowing users have run cargo update to the latest version.

Fuzzing Nebari

After I fixed the bug in Nebari, I proceeded to try to get a fuzzing suite set up. For those unfamiliar, fuzzing is an algorithmic approach to automated testing where a "fuzzer" repeatedly generates parameters for a test in an attempt to discover bugs. The test I wrote essentialy was a dumb fuzzer -- the outer loop could be changed to loop forever.

The advantage of my hand-written test is that it was able to narrow down the failure case quickly, as I knew some of the constraints of the inputs that produced the bug I encountered. The disadvantage is that it only would find a specific pattern of failure, as my inputs were not completely random.

I ported the same approach to a fuzzer. The fuzzer is written to take an arbitrary Vec<BTreeSet<u16>>>. For each BTreeSet, a TreeFile is modified using a CompareSwap that inserts the key if a value isn't already stored or removes the key if a value is already stored. The fuzzer...

xkcd #1745 "Record Scratch"1

Hey, who would have guessed it? Fuzzing works! As I was writing the last paragraph, the fuzzer found that my fix in v0.5.1 was incomplete. I couldn't bring myself to continue writing the post and shifted to working on a fix. I held off on releasing v0.5.2 until last night, which allowed me to continue running fuzzing for 6 more hours after fixing the bug. It's been running continuously overnight and up to the point of publishing this post without further incident.

The fuzzer tries over and over to break my test by sending in different values for the Vec<BTreeSet<u16>>. Once it discovers a failure, it saves the problematic input so that it can be repeated on-demand.

But that's only part of what fuzzing can do for us. By running an "input minimization" operation on the crashing input, the fuzzer will try to find the smallest input that can reproduce the failure. It saves each of the discovered failure cases so they can be tested independently if desired. Once it can no longer find a smaller input, it reports the minimized test case.

I was feeling pretty confident in Nebari's stability before Friday. My confidence has been shaken, but as I write more fuzzing tests and run them for extended periods of time, I'm hopeful I can weed out any remaining bugs and restore my confidence.

Dogfooding: Shrinking Git Repository Sizes

At the start of the month, I really wanted to build something with BonsaiDb. One problem BonsaiDb contributors have is related to my choice to use GitHub Pages to build a deployed version of BonsaiDb's documentation, benchmark results, and user's guide on every commit. The way this works is that a workflow generates the static websites into a gh-pages branch, and GitHub deploys the latest version of that branch automatically.

This comes at a cost, however. If you have cloned BonsaiDb's repository, you probably noticed that the repository is quite large. Using Git, we can inspect how much disk space the main branch takes up compared to the gh-pages branch:

# Total size of commits reachable from `main`
git rev-list --disk-usage --objects  refs/heads/main
>     4,302,211 (~4.3mb)
# Total size of commits not reachable from `main`.
git rev-list --disk-usage --objects --all --not refs/heads/main
> 3,215,042,793 (~3.2gb)

By moving gh-pages hosting somewhere else, I can reduce BonsaiDb's repository size down to a few megabytes instead of a few gigabytes. I'm aware there are other approaches that can be taken to alleviate this problem, but when paired with a request to add support for file storage, I saw it as an opportunity to dogfood BonsaiDb.

On Friday, I deployed Dossier to I loaded all of the BonsaiDb gh-pages files, but then encountered a slower sync time than I had hoped for. I had an idea for an optimization for listing files in bonsaidb-files and proceeded to update the View used for that operation and incremented its version number. This informs BonsaiDb it should discard all of the previously indexed data and reindex that view. While testing this locally against a database that had been used extensively for days and had over 100,000 files in it, I encountered the Nebari bug covered above. This caused me to pause progress on Dossier, as I will always prioritize any bugs relating to data integrity.

I was hoping this blog post would announce that all of BonsaiDb's static pages, including this blog, were hosted using Dossier, but I haven't quite reached that milestone. I'm hopeful that will be true within the next week, however!

Easier Administration of BonsaiDb

Another challenge I faced in building Dossier is that I needed a way to administer permissions. One morning I had the thought: "It'd be great if I had a tool like psql." For those unfamiliar, psql is a simple command line interface that allows you to execute commands on a PostgreSQL database. Of course, most databases have SQL, and BonsaiDb has no language.

Previously, I had considered the extensible command line interface to be a good approach, but as I added authentication support, I realized that every operation was going to re-authenticate. One advantage a tool like pqsl has is that it authenticates once upon launch and then reuses the same connection for subsequent commands.

Because developing programming languages is a hobby of mine, one day I whipped up a demo exploring the idea of a language that worked both as a proc macro and as a string:

let results = bql!( let my_new_user = user "ecton" create with password "hunter2" )
println!("Created user id: {:?}", results.get("my_new_user").unwrap());

let results = Program::parse(r#"let jon = user "jon" create;"#)
println!("Created user id: {:?}", results.get("jon").unwrap());

I was envisioning allowing this language to be extended. For example, the bonsaidb-files crate should be able to define its own commands. This creates an interesting problem that I decided wasn't worth tackling: proc-macros cannot invoke code in types that are outside of its own list of dependencies.

In the end, I decided that the use case for a language for use within Rust would be unlikely to surpass a well designed API's fluidity. I put aside this experiment for now, and for Dossier, I've continued to expand BonsaiDb's built-in command line interface.

Open File Limits

Earlier in the month, I was testing something on my Mac laptop, and I decided to invoke the entire test suite with all features enabled on that machine. I ran into an error due to having too many open files. I expected to eventually run into this, and I was proud that my test suite was the first I heard of it from a user. One might ask, why does the test suite using so many files that it's a problem on that machine?

Most of the unit tests in BonsaiDb are written against the core traits like Connection and StorageConnection. The core-suite integration test uses bonsaidb-client to run the test suite against a shared bonsaidb-server instance. Each unit test for each connection type gets its own database. The schema most tests utilize has multiple views, each which uses a few files to track their state.

But still, with roughly 100 unit tests in the suite and two connection types, that is only a few hundred files. It turns out that Mac OS has a default per-process limit of 256 open files.

In production, this is easily worked around by configuring the server's environment, so this issue wasn't a huge priority to me. But my mind started exploring what sort of data structure I might need to solve this problem. I wanted to use an LRU thought an LRU cache that worked with a BTreeMap instead of a HashMap. Alas, none of the existing crates seemed to offer this functionality. I whipped up my own LRU crate (awaiting publication).

Despite being 100% safe, it is very similar in performance compared to the unsafe crate Nebari currently uses.

I attempted a few approaches in Nebari, but it's a tough problem to solve. As this work was dragging on, I really wanted to stop chasing this squirrel and make progress on my original goal of the month. I will be finishing up this problem in the next month or two. Until then, if you run into this issue, you can configure your system to increase the maximum open files for your process, user, or system.

RFC: Token Authentication

I've written up a request for comments discussion about how I've implemented token authentication for BonsaiDb. I'm dogfooding token authentication within Dossier by using it to power API Tokens that each GitHub Repository can have to upload files. The Dossier server uses BonsaiDb's built-in Role-Based Access Control to verify permissions when synchronizing files.

Since I've created a new approach and am using a newer algorithm (BLAKE3), I wanted to see if there was any inherent problems with the approach I'm taking. Unless I hear from someone who is a cryptographer, I'm likely to document this feature as "experimental" for the foreseeable future.

What's next?

I'm going to be continuing to expand the fuzzing suite for Nebari to try to ensure it's rock-solid. After that, I want to finish moving most of Khonsu Labs' self-hosted static pages onto Dossier. Once that is all done, I expect I will feel ready to release the next update to BonsaiDb.

Getting Started

Our homepage has basic setup instructions and a list of examples. We have started writing a user's guide, and we have tried to write good documentation.

We would love to hear from you if you have questions or feedback. We have community Discourse forums and a Discord server, but also welcome anyone to open an issue with any questions or feedback.

We dream big with BonsaiDb, and we believe that it can simplify writing and deploying complex, data-driven applications in Rust. We would love additional contributors who have similar passions and ambitions.

Lastly, if you build something with one of our libraries, we would love to hear about it. Nothing makes us happier than hearing people are building things with our crates!


"Record Scratch" by xkcd is licensed under CC BY-NC 2.5