Written by Jonathan Johnson. Published 2022-06-14.
I'm building a database, and I consider SQLite a "gold standard" to compare my database against. While benchmarking new code recently, I noticed Apple's bundled version of SQLite is not ACID compliant.
I do not consider myself an expert on these topics. If there are any errors in my analysis, please reach out to me, and I will correct them immediately. I'm learning by doing, and as evidenced by my last post on this topic, I have made my own fair share of mistakes in trying to implement a fast, ACID-compliant database.
On February 17, 2022, Scott Perry wrote this in a conversation on Twitter:
There's a third sync operation that lets you have your performance and write ordering too: F_BARRIERFSYNC. SQLite already uses it on Darwin, and it's part of the best practices guide for I/O reduction. https://developer.apple.com/documentation/xcode/reducing-disk-writes
Some people (myself included) interpretted the statement "SQLite already uses it
on Darwin" to mean that it's the default behavior. My post will show that this
is not the case. By default, the bundled version of SQLite distributed in macOS
12.4 (21F79) relies on fsync()
for synchronization.
From my investigation, Apple's version of SQLite instead replaces PRAGMA fullfsync = on
's implementation to use F_BARRIERFSYNC
.
SQLite users who are expecting PRAGMA fullfsync
to provide durability
guarantees in the event of power failures or kernel panics can override xSync
via a custom VFS or build SQLite from source.
To understand why, let's review how to ensure persistent writes on Apple's operating systems.
There are two APIs we need to cover: fsync()
and fcntl()
. On Linux,
fsync()
is the system call that is tells the kernel to fully synchronize its
file state with the disk. It is debatable whether the original POSIX
specification intends for this level of durability guarantees of fsync()
, but
on Linux it tries its best to guarantee all bits changed have been synchronized
to the disk including issuing a flush of any affected volatile write caches.
However, on macOS, the man page for fsync()
reads:
Note that while fsync() will flush all data from the host to the drive (i.e. the "permanent storage device"), the drive itself may not physically write the data to the platters for quite some time and it may be written in an out-of-order sequence.
Specifically, if the drive loses power or the OS crashes, the application may find that only some or none of their data was written. The disk drive may also re-order the data so that later writes may be present, while earlier writes are not.
This is not a theoretical edge case. This scenario is easily reproduced with real world workloads and drive power failures.
For applications that require tighter guarantees about the integrity of their data, Mac OS X provides the F_FULLFSYNC fcntl. The F_FULLFSYNC fcntl asks the drive to flush all buffered data to permanent storage. Applications, such as databases, that require a strict ordering of writes should use F_FULLFSYNC to ensure that their data is written in the order they expect. Please see fcntl(2) for more detail.
Apple's documentation clearly states that for any guarantees about data loss due
to power loss or kernel panic, you must use the fcntl()
API with the
F_FULLFSYNC
command.
Back in February of this year, this topic circulated fairly widely, and this
post from Michael Tsai has a summary of the findings. In short, it was
noted that F_FULLFSYNC
is incredibly slow in its current implementation. It
was noted that Apple points users to another fcntl()
command in its "Reducing
Disk Writes" article:
Some apps require a write barrier to ensure data persistence before subsequent operations can proceed. Most apps can use the fcntl(::) F_BARRIERFSYNC for this.
Only use F_FULLFSYNC when your app requires a strong expectation of data persistence. Note that F_FULLFSYNC represents a best-effort guarantee that iOS writes data to the disk, but data can still be lost in the case of sudden power loss.
F_BARRIERFSYNC
issues an IO barrier such that all subsequent IO operations
must wait for all current writes to succeed. The fcntl()
call returns after
issuing the barrier, but before the data is synchronized. This is why using
F_BARRIERFSYNC
doesn't fulfill the durability requirement of ACID: the changes
are confirmed before the data is fully synchronized.
I should note that while fcntl()
is an API that is available on Linux,
F_FULLFSYNC
and F_BARRIERFSYNC
are specific to Apple OSes. Linux has no need
for these options as fsync()
provides the guarantees needed.
When starting my new low-level storage layer (Sediment), I added
support to optionally use F_BARRIERFSYNC
instead of F_FULLFSYNC
on macOS.
By default, F_FULLSYNC
would still be used as I wanted the user to explicitly
opt-out of ACID if they needed the extra performance on Apple hardware. This was
based on the idea that SQLite was using this same approach to achieve its very
fast speed on macOS.
Yesterday, I created a simple benchmark to see where Sediment's performance was currently at. I'm not ready to share numbers, and that's not the point of this post. The summary, however, is that Sediment was faster than SQLite on Linux, but slower than SQLite on my M1 Macbook Air.
That puzzled me, because if both SQLite and Sediment are using the same
synchronization primitives, how could the performance difference be inverted
between by switching operating systems? I decided to investigate how SQLite
utilized F_BARRIERFSYNC
.
My first stop was the documentation. SQLite has a pragma to enable
F_FULLFSYNC
, but I could not find any documentation talking
about F_BARRIERFSYNC
. The documentation for PRAGMA fullfsync
states that the
default value is off.
My next stop was the SQLite source code: full_fsync()
is defined in
os_unix.c. Its responsibility is to perform a full fsync based
on the available and configured options. This section is what is relevant for
Apple OSes:
#elif HAVE_FULLFSYNC
if( fullSync ){
rc = osFcntl(fd, F_FULLFSYNC, 0);
}else{
rc = 1;
}
/* If the FULLFSYNC failed, fall back to attempting an fsync().
** It shouldn't be possible for fullfsync to fail on the local
** file system (on OSX), so failure indicates that FULLFSYNC
** isn't supported for this file system. So, attempt an fsync
** and (for now) ignore the overhead of a superfluous fcntl call.
** It'd be better to detect fullfsync support once and avoid
** the fcntl call every time sync is called.
*/
if( rc ) rc = fsync(fd);
#elif defined(__APPLE__)
/* fdatasync() on HFS+ doesn't yet flush the file size if it changed correctly
** so currently we default to the macro that redefines fdatasync to fsync
*/
rc = fsync(fd);
The SQLite source code shows the implementation for calling fcntl()
with
F_FULLFSYNC
, but has no mention of F_BARRIERFSYNC
.
There's one last thing to check: maybe Apple ships a custom build of SQLite that
utilizes F_BARRIERFSYNC
. The best way to verify is to use dtrace to log out
the system calls the process makes.
I disabled System Integrity Protection so that I could trace the sqlite3
executable that ships with macOS 12.4 (21F79):
~ % sudo dtruss -t fcntl sqlite3 test.sqlite
SYSCALL(args) = return
SQLite version 3.37.0 2021-12-09 01:34:53
Enter ".help" for usage hints.
fcntl(0x3, 0x5F, 0x1) = 0 0
fcntl(0x3, 0x3F, 0x6BDBD9C0) = 3 0
fcntl(0x4, 0x32, 0x16BDBDDA8) = 0 0
sqlite> insert into test (a) values (1);
fcntl(0x3, 0x5A, 0x16BDBD0B8) = 0 0
fcntl(0x3, 0x5A, 0x16BDBD0B8) = 0 0
fcntl(0x3, 0x5A, 0x16BDBD0B8) = 0 0
fcntl(0x3, 0x5A, 0x16BDBCBA8) = 0 0
fcntl(0x3, 0x5A, 0x16BDBDE98) = 0 0
fcntl(0x3, 0x5A, 0x16BDBDE98) = 0 0
fcntl(0x3, 0x5A, 0x16BDBDE98) = 0 0
fcntl(0x3, 0x5A, 0x16BDBDF08) = 0 0
fcntl(0x4, 0x5F, 0x1) = 0 0
fcntl(0x4, 0x3F, 0x1) = 3 0
fcntl(0x5, 0x5F, 0x1) = 0 0
fcntl(0x3, 0x5A, 0x16BDBDEE8) = 0 0
fcntl(0x3, 0x5A, 0x16BDBDEE8) = 0 0
fcntl(0x5, 0x5F, 0x1) = 0 0
fcntl(0x3, 0x5A, 0x16BDBDE88) = 0 0
fcntl(0x3, 0x5A, 0x16BDBDE88) = 0 0
fcntl(0x3, 0x5A, 0x16BDBDEB8) = 0 0
This log shows all of the fcntl()
calls issued by SQLite to perform the
insert into test (a) values (1);
statement. The second argument is the
command. We can see SQLite is using commands 0x3F, 0x5A, and 0x5F. In decimal,
those are 63, 90, and 95 respectively. Looking in fcntl.h
, we see these values:
#define F_FULLFSYNC 51 /* fsync + ask the drive to flush to the media */
#define F_GETPROTECTIONCLASS 63 /* Get the protection class of a file from the EA, returns int */
#define F_BARRIERFSYNC 85 /* fsync + issue barrier to drive */
The commands for 90 and 95 are private:
#define F_OFD_SETLK 90 /* Acquire or release open file description lock */
#define F_SETCONFINED 95 /* "confine" OFD to process */
We did not see any fcntl()
calls with the command argument being 85 (0x55).
Let's try enabling PRAGMA fullfsync
:
sqlite> pragma fullfsync=on;
sqlite> insert into test (a) values (1);
fcntl(0x3, 0x5A, 0x16DD9DEF8) = 0 0
fcntl(0x3, 0x5A, 0x16DD9DEF8) = 0 0
fcntl(0x3, 0x5A, 0x16DD9DEF8) = 0 0
fcntl(0x3, 0x5A, 0x16DD9DF68) = 0 0
fcntl(0x4, 0x5F, 0x1) = 0 0
fcntl(0x4, 0x3F, 0x1) = 3 0
fcntl(0x3, 0x5A, 0x16DD9DF48) = 0 0
fcntl(0x3, 0x5A, 0x16DD9DF48) = 0 0
fcntl(0x4, 0x55, 0x0) = 0 0
fcntl(0x5, 0x5F, 0x1) = 0 0
fcntl(0x4, 0x55, 0x0) = 0 0
fcntl(0x3, 0x55, 0x0) = 0 0
fcntl(0x3, 0x5A, 0x16DD9DEE8) = 0 0
fcntl(0x3, 0x5A, 0x16DD9DEE8) = 0 0
fcntl(0x3, 0x5A, 0x16DD9DF18) = 0 0
As expected, we have a new fcntl()
command: 0x55. Unexpectedly, however,
instead of enabling F_FULLFSYNC
(0x33) as we would expect from reading the
publicly available SQLite code, we see 0x55 instead which is F_BARRIERFSYNC
.
To summarize, Apple's SQLite doesn't use F_BARRIERFSYNC
or F_FULLFSYNC
by
default, and it replaces fcntl(.., F_FULLFSYNC, ..)
with fcntl(.., F_BARRIERFSYNC, ..)
when PRAGMA fullfsync
is enabled.
This behavior was confirmed by Scott Perry as I was editing this post.
For most consumer applications, F_BARRIERFSYNC
will be enough to provide
reasonable durability with the benefit of performing much more quickly. However,
there are some situations where true ACID compliance is desired. Many (but not
all) of those situations involve server software.
With Apple no longer shipping server hardware and the performance of
F_FULLFSYNC
on Apple's drives, it's hard to fault Apple for
making the decision to use F_BARRIERFSYNC
in their version of SQLite. I wish
they would have opted to do it in a different way, such as a new pragma or
changing the default fsync()
behavior instead of replacing PRAMGA fullfsync
.
It's very confusing when a feature that's documented to be specific to macOS doesn't behave as documented on macOS. As it stands, if a developer wants the documented behavior, the easiest way probably is to build SQLite from source.
I did not test any of these findings on iOS -- it's been years since I have tried doing any tracing on a device. I suspect Apple doesn't maintain separate versions of SQLite for iOS and macOS, but because their version of SQLite is closed source, we cannot verify easily.
Regardless of whether Apple changes how SQLite synchronizes in the future, I encourage Apple to publish their updates to SQLite alongside their other open source repositories. I can't imagine the changes made to SQLite would be considered proprietary, and the ability to understand what differs between SQLite's source code and the shipping version in Apple's operating systems is important in understanding what guarantees SQLite provides on Apple's hardware.