As dataset sizes grow, you need to buy larger/faster database machines, or shard data across multiple machines. sortdb is a database server we have written to help fill a specific need for performant access to static data.
One observation about many datasets is that there is often a small dataset that is heavily updated, and a larger older related dataset that does not change. It is this second case where sortdb can be used to expose a query interface to static datasets.
sortdb has a HTTP interface built on libevent, and exposes
fwmatch endpoints to key / value data in
csv data file. Internally, the datafile is mmaped and a binary search is
performed to find rows in the file. It is part of
simplehttp, a family of libraries and daemons for building
scalable web infrastructures.
By mmap’ing the data file sortdb allows the operating system to handle maintaining the most important parts of the data in memory while retrieving others from disk transparently. Since the file is searched using a binary search, it means the first few common steps of the search tree are quickly cached in ram and execute very quickly. The operating system will also share mmap’d data between multiple processes. Practically, this means you can start a second sortdb process pointed at the same data file for higher read throughput.
In database terms, a sorted key/value datafile is essentially a covering index where all the values are stored with the key. Because there is no need to store a separate index of the data, this is a compact on-disk representation. (removing the need for separate indexes can often give a 20~30% savings in disk usage).
Operationally managing sortdb is easy because the data files it uses are static and read-only. This makes it easy to
copy the file to multiple servers and distribute requests to multiple sortdb instances on separate servers to handle a
higher workload, or to add redundancy. It is also possible to point sortdb to a new file with zero downtime (literally
mv data.csv old.csv && mv new.csv data.csv && kill -s HUP $pid).
At bitly we use sortdb successfully as a core component of our metrics system and are very happy with it’s performance characteristics. Our long term metrics system is composed of two different storage systems. One “realtime” system stores metrics for the past 48 hours and processes all increment requests. A second “static” storage system is built on sortdb and stores all data prior to the past 48 hours. Every day, 24 hours of data is exported from the realtime system, and new static data files are re-built for the static system comprising all the existing data, and the new 24 hour data segment.
A small slice of our “static” metrics data looks like this:
c.u|jehiah.c3h5,US:2 c.u|jehiah.c3h7,US:1 c.u|jehiah.c3h9,None:2 US:3 c.u|jehiah.c3ha,None:2 c.u|jehiah.c3hb,None:2 US:3 c.u|jehiah.c3hc,None:3 US:4 c.u|jehiah.c3hd,None:5 US:4 c.u|jehiah.c3he,None:2 US:6 c.u|jehiah.c3hf,None:3 US:5 c.u|jehiah.c3hg,None:4 US:5 c.u|jehiah.c3hh,DE:1 None:5 US:7 c.u|jehiah.c3hi,CA:1 None:6 US:5 u|jehiah.c3hh,13 u|jehiah.c3i0,7
You will notice that these are
key,value records where the key is a compound key often made up of a namespace and
sometimes a time component, and the value is sometimes a repeating (
With that dataset in plain text file named
data.csv, you can point sortdb at it with this run command (setting comma
as the field separator)
$ sortdb --field-separator=, --db-file=data.csv
Now it’s possible to query for a single record
$ curl 'http://127.0.0.1:8080/get?key=u|jehiah.c3hh' 13 $ curl 'http://127.0.0.1:8080/get?key=c.u|jehiah.c3hh' DE:1 None:5 US:7
Or forward match a range of records
$ curl 'http://127.0.0.1:8080/fwmatch?key=u|jehiah.' u|jehiah.c3hh,13 u|jehiah.c3i0,7
If this sound like a fun project to hack on, bitly is hiring.