Written by Jonathan Johnson. Published 2022-05-02.
What is BonsaiDb?
BonsaiDb is a new database aiming to be the most developer-friendly Rust database. BonsaiDb has a unique feature set geared at solving many common data problems. We have a page dedicated to answering the question: What is BonsaiDb?. All source code is dual-licensed under the MIT and Apache License 2.0 licenses.
This month has been a journey of interesting projects that I have been itching to write about. The primary focus of the month was to work on a new project to use and test BonsaiDb. Before I dive into that topic, I want to focus on a bug fix for Nebari that arose from testing this new project.
Yesterday, I published an update to Nebari which contains bug fixes that prevent potential corruption of databases. For users of BonsaiDb, the view indexer is the only way that this bug could arise, from my understanding of the bugs discovered.
The problem with this bug is that it's a silent bug until a compact operation
occurs. There are various debug_assert!()
statements littered in the code that
assert the internal B+ tree order is correct. These are expensive to
perform, so they are only enabled in debug builds. Users who are running
BonsaiDb or Nebari with release builds may not notice the bug, and the database
may return inconsistent results.
The only way I was able to reproduce the bug involved many multi-key
Modification
s that operated in such a way that a bug in the
tree balancing algorithm was found. The edge case only occurred when Nebari
attempted to avoid splitting a node by moving entries both to the previous and
next nodes during the same operation. This is impossible on small trees and is
impossible to trigger by performing operations with only a few keys per
operation or with a workflow that rarely deletes data.
Despite me feeling fairly confident users haven't encountered this yet, I would
rest better knowing users have run cargo update
to the latest version.
After I fixed the bug in Nebari, I proceeded to try to get a fuzzing suite set up. For those unfamiliar, fuzzing is an algorithmic approach to automated testing where a "fuzzer" repeatedly generates parameters for a test in an attempt to discover bugs. The test I wrote essentialy was a dumb fuzzer -- the outer loop could be changed to loop forever.
The advantage of my hand-written test is that it was able to narrow down the failure case quickly, as I knew some of the constraints of the inputs that produced the bug I encountered. The disadvantage is that it only would find a specific pattern of failure, as my inputs were not completely random.
I ported the same approach to a fuzzer. The fuzzer is written to take an
arbitrary Vec<BTreeSet<u16>>>
. For each BTreeSet
, a TreeFile
is modified
using a CompareSwap
that inserts the key if a value isn't already stored or
removes the key if a value is already stored. The fuzzer...
Hey, who would have guessed it? Fuzzing works! As I was writing the last paragraph, the fuzzer found that my fix in v0.5.1 was incomplete. I couldn't bring myself to continue writing the post and shifted to working on a fix. I held off on releasing v0.5.2 until last night, which allowed me to continue running fuzzing for 6 more hours after fixing the bug. It's been running continuously overnight and up to the point of publishing this post without further incident.
The fuzzer tries over and over to break my test by sending in different values
for the Vec<BTreeSet<u16>>
. Once it discovers a failure, it saves the
problematic input so that it can be repeated on-demand.
But that's only part of what fuzzing can do for us. By running an "input minimization" operation on the crashing input, the fuzzer will try to find the smallest input that can reproduce the failure. It saves each of the discovered failure cases so they can be tested independently if desired. Once it can no longer find a smaller input, it reports the minimized test case.
I was feeling pretty confident in Nebari's stability before Friday. My confidence has been shaken, but as I write more fuzzing tests and run them for extended periods of time, I'm hopeful I can weed out any remaining bugs and restore my confidence.
At the start of the month, I really wanted to build something with BonsaiDb. One
problem BonsaiDb contributors have is related to my choice to use GitHub
Pages to build a deployed version of BonsaiDb's
documentation, benchmark
results, and user's guide on every
commit. The way this works is that a workflow
generates the static websites into a gh-pages
branch, and GitHub deploys the
latest version of that branch automatically.
This comes at a cost, however. If you have cloned BonsaiDb's repository, you
probably noticed that the repository is quite large. Using Git, we can inspect
how much disk space the main
branch takes up compared to the gh-pages
branch:
# Total size of commits reachable from `main`
git rev-list --disk-usage --objects refs/heads/main
> 4,302,211 (~4.3mb)
# Total size of commits not reachable from `main`.
git rev-list --disk-usage --objects --all --not refs/heads/main
> 3,215,042,793 (~3.2gb)
By moving gh-pages
hosting somewhere else, I can reduce BonsaiDb's repository
size down to a few megabytes instead of a few gigabytes. I'm aware there are
other approaches that can be taken to alleviate this problem, but when paired
with a request to add support for file storage, I saw it as an
opportunity to dogfood BonsaiDb.
On Friday, I deployed Dossier to khonsu.dev. I loaded
all of the BonsaiDb gh-pages
files, but then encountered a slower sync time
than I had hoped for. I had an idea for an optimization for
listing files in bonsaidb-files
and proceeded to update the View used for that
operation and incremented its version number. This informs BonsaiDb it should
discard all of the previously indexed data and reindex that view. While testing
this locally against a database that had been used extensively for days and had
over 100,000 files in it, I encountered the Nebari bug covered above. This
caused me to pause progress on Dossier, as I will always prioritize any bugs
relating to data integrity.
I was hoping this blog post would announce that all of BonsaiDb's static pages, including this blog, were hosted using Dossier, but I haven't quite reached that milestone. I'm hopeful that will be true within the next week, however!
Another challenge I faced in building Dossier is that I needed a way to
administer permissions. One morning I had the thought: "It'd be great if I had a
tool like psql
." For those unfamiliar, psql
is a simple command line
interface that allows you to execute commands on a PostgreSQL database. Of
course, most databases have SQL, and BonsaiDb has no language.
Previously, I had considered the extensible command line
interface to be a good approach, but as I added authentication
support, I realized that every operation was going to re-authenticate. One
advantage a tool like pqsl
has is that it authenticates once upon launch and
then reuses the same connection for subsequent commands.
Because developing programming languages is a hobby of mine, one day I whipped up a demo exploring the idea of a language that worked both as a proc macro and as a string:
let results = bql!( let my_new_user = user "ecton" create with password "hunter2" )
.execute_on(&storage)
.unwrap();
println!("Created user id: {:?}", results.get("my_new_user").unwrap());
let results = Program::parse(r#"let jon = user "jon" create;"#)
.unwrap()
.execute_on(&storage)
.unwrap();
println!("Created user id: {:?}", results.get("jon").unwrap());
I was envisioning allowing this language to be extended. For example, the
bonsaidb-files
crate should be able to define its own commands. This creates
an interesting problem that I decided wasn't worth tackling: proc-macros cannot
invoke code in types that are outside of its own list of dependencies.
In the end, I decided that the use case for a language for use within Rust would be unlikely to surpass a well designed API's fluidity. I put aside this experiment for now, and for Dossier, I've continued to expand BonsaiDb's built-in command line interface.
Earlier in the month, I was testing something on my Mac laptop, and I decided to invoke the entire test suite with all features enabled on that machine. I ran into an error due to having too many open files. I expected to eventually run into this, and I was proud that my test suite was the first I heard of it from a user. One might ask, why does the test suite using so many files that it's a problem on that machine?
Most of the unit tests in BonsaiDb are written against the core traits like
Connection
and StorageConnection
. The
core-suite
integration test uses bonsaidb-client
to run the test suite
against a shared bonsaidb-server
instance. Each unit test for each connection
type gets its own database. The schema most tests utilize has multiple views,
each which uses a few files to track their state.
But still, with roughly 100 unit tests in the suite and two connection types, that is only a few hundred files. It turns out that Mac OS has a default per-process limit of 256 open files.
In production, this is easily worked around by configuring the server's
environment, so this issue wasn't a huge priority to me. But my mind started
exploring what sort of data structure I might need to solve this problem. I
wanted to use an LRU thought an LRU cache that worked with a BTreeMap
instead of a HashMap
. Alas,
none of the existing crates seemed to offer this functionality. I whipped up my
own LRU crate (awaiting publication).
Despite being 100% safe, it is very similar in performance compared to the unsafe crate Nebari currently uses.
I attempted a few approaches in Nebari, but it's a tough problem to solve. As this work was dragging on, I really wanted to stop chasing this squirrel and make progress on my original goal of the month. I will be finishing up this problem in the next month or two. Until then, if you run into this issue, you can configure your system to increase the maximum open files for your process, user, or system.
I've written up a request for comments discussion about how I've implemented token authentication for BonsaiDb. I'm dogfooding token authentication within Dossier by using it to power API Tokens that each GitHub Repository can have to upload files. The Dossier server uses BonsaiDb's built-in Role-Based Access Control to verify permissions when synchronizing files.
Since I've created a new approach and am using a newer algorithm (BLAKE3), I wanted to see if there was any inherent problems with the approach I'm taking. Unless I hear from someone who is a cryptographer, I'm likely to document this feature as "experimental" for the foreseeable future.
I'm going to be continuing to expand the fuzzing suite for Nebari to try to ensure it's rock-solid. After that, I want to finish moving most of Khonsu Labs' self-hosted static pages onto Dossier. Once that is all done, I expect I will feel ready to release the next update to BonsaiDb.
Our homepage has basic setup instructions and a list of examples. We have started writing a user's guide, and we have tried to write good documentation.
We would love to hear from you if you have questions or feedback. We have community Discourse forums and a Discord server, but also welcome anyone to open an issue with any questions or feedback.
We dream big with BonsaiDb, and we believe that it can simplify writing and deploying complex, data-driven applications in Rust. We would love additional contributors who have similar passions and ambitions.
Lastly, if you build something with one of our libraries, we would love to hear about it. Nothing makes us happier than hearing people are building things with our crates!
"Record Scratch" by xkcd is licensed under CC BY-NC 2.5