One of the core product features of bitly has always been metrics. Unlike tools like google analytics, bitly metrics have always been a way to gather website metrics when you don’t control the website. Want metrics for a book you wrote on Amazon? Use a bitly link. Want metrics for a blog post? You can use a bitly link for that, too.
As bitly usage has grown, one of the systems that has evolved several times along the way is our metrics platform. Originally implemented as log files with a hierarchical timestamp key, later metrics data was loaded into a flat mysql table. A subsequent revision moved that data into a cluster of tokyocabinet servers. That was eventually supplemented with some additional metrics datasets in mongodb. By early 2011 we were using 3 different metrics systems, and had outgrown 3 others. We started with a set of goals to build a scalable time series database application, which we now call Clickatron.
The set of goals we wanted to achieve with a new metrics system were:
1) Move from daily to hourly granularity (not everyone should have to pretend they live in EST). A desire was to be able to satisfy read requests on-the-fly in any unit of time (hour, day, week, month) for any timezone based on the single hourly dataset (removing duplication present in other systems)
2) Compact on-disk format. One of the previous metrics systems had a very inefficient data structure which was causing both bloated data set sizes and performance problems resulting from insufficient disk IO. In order to satisfy read requests at different units of time, it was also storing several copies of the data.
3) Ability to bulk-load new datasets. There were two problems with previous metrics systems that inhibited adding new metrics datasets. Previous systems lacked proper namespacing which caused essentially duplicate systems for each dataset, and there was not enough excess capacity to insert new metrics going backwards in time.
4) Scalability. Our existing metrics systems were at their limits in terms of handling real-time increments and we didn’t have a simple path forward to add capacity to those systems.
As mentioned in our post about sortdb, many datasets have a small subset that is heavily updated, and a larger related dataset that does not change. Metrics data is a textbook example where updates are limited to a small time range, and an overwhelming majority of the data is static.
We chose to take advantage of this and implement an architecture that split data into two separate storage systems. A Realtime System is responsible for per-hour data covering the past 48 hours and also handles increment operations. A separate Archival System stores all of the older data and is updated daily with a new set of static database files from a Rebuild System.
Each of these two systems are themselves clusters of key/value databases. The keys are a combination of
namespace, record key, and time components (explained in more detail below). Data is spread across the shards
of each cluster by a simple
crc(shard_key) % number_shards. The shard key is comprised of only parts of the
actual key in order to improve data locality on a per-request basis. For example the shard key excludes time components
so that all data for a request is stored on the same shard regardless of the time range requested.
The Realtime System is comprised of a 3-position ring of storage engines with positions for
Tomorrow. The storage engines in the
Today position handle increments. The storage engine in the
position has a 24-hour window to be exported before being cleared for re-use in the
Each spot on the ring corresponds to a cluster of
ttserver pairs which data is
sharded across. Records stored in this system are stored in
<key>,<value> format where each record represents
an integer value for a single hour.
The Rebuild System serves 3 purposes. Its primary role is to do the export from the Realtime System every 24 hours and combine that new dataset with all previous datasets to generate new data files for the Archive System. As part of that export process, the Rebuild System generates a second copy of the merged data files to serve as backup. Because it generates new data files from scratch every day, it also provides a spot to introduce new historical data independent of the number of records added. We have been able to use this to introduce new datasets as large as 13 billion keys. Good luck waiting for that many inserts into [insert database name].
The rebuild process is a set of steps that involve sorting, merging, reformatting keys, and combining records. That’s followed by a final sharding and another sort and merge operation. These steps happen largely in parallel and are each split into smaller segments of work that get distributed across several machines. Many of these steps also intelligently avoid redoing unnecessary work by keeping intermediate files split by time range.
The Archive System consists of static csv files accessed through sortdb. These files are spread across hosts as needed depending on desired read performance. Multiple sortdb instances for the same data files can be used on separate physical hosts for read availability and fault tolerance.
The API layer directs increment requests to the right spot on the Realtime System ring. For increments, the API layer also writes out a local oplog for data durability in case there is a hardware failure in the Realtime System. For data fetches, the API layer handles the potential need to pull data from both the Realtime System and the Archive System, merging the two data sets into a single response. The components of the API involved in incrementing are written in C using simplehttp, and the components for querying are written in a combination of python using tornadoweb and C using simplehttp.
Database Key Optimizations
Two optimizations directly aimed at lowering disk storage needs are a compact time format and a global lookup table for records with long keys. The choice to use sortdb also directly reduced our storage requirements as the data index is stored with the data (more specifically the index is the data).
For time values we use a compact 4 character
YMDH representation. Using this format
c3p1 corresponds to 1am on March 25th, 2012.
YMDH == _b32(year % 100) + _b32(month) + _b32(day) + _b32(hour)
Any time a key contains a reserved character (
|) or is 12 or more characters, we create a
12-character hash of it, and store a lookup record. This means that if we want to count the number of referrers from
http://www.facebook.com/ instead of storing that 24-character string every time we wanted to count it, we count a
smaller 12-character hash such that
_hash('http://www.facebook.com/') == 'h0aMB8AuNw4=', and do a reverse lookup to
expand the hash back to the full value at query time. It may not seem like much, a savings of even a few bytes for
repeated values makes a big difference on datasets with billions of records.
Our time series database is made up of 3 types of records. Lookup Records (which are the expansions for the 12 character hashes above). Total Records which are a single key with values per hour over time. And Subtotal Records which represent the line item values whose sum rolls up to the Total Record for each hour.
We use a different data format between the Realtime System and the Archive System for storing Total Records and Subtotal Records. The data format used for the Realtime System is optimized for writes while the format used for the Archive System is optimized for compactness and reads.
In the Realtime System records have one data point, but in the Archive System records are multi-column and have many values. Many records in the Realtime System will be collapsed to a single multi-column record in the Archive System. Records in the Realtime System are always for a single hour, but Total Records in the Archive System are for all time ranges, and Subtotal Records are similarly for all subtotal keys (for an hour). This facilitates retrieval, but it also provides for a more compact storage (the record key is stored only once for that multi-column record compared with the Realtime System). Metrics data is very sparse (many keys only have values for a few hours) so the time values in a Total Record also serve as an index of what data to expected in the equivalent subtotal data set.
Total Records - Realtime System
A Total Record as stored in the Realtime System consists of a
total key, an hourly
value, and the
Example: These are two Total Record in the Realtime System representing total clicks for a single user in two
u is our namespace for “user clicks”,
jehiah is my username,
c41l are time components
and these records have values of
Total Records - Archive System
In the Archive System, all Total Records for the same
total key combination are collapsed into one
record with the individual
time value and
counter value pairs as the multi-column component.
Example: This the same two Total Records from the Realtime System as stored in the Archive System. Here the key is
u namespace and
jehiah total key, but the value is multi-column pairs of time components
and values of
5. In this small example, the archive record is 27% more compact.
Subtotal Records - Realtime System
A Subtotal Record as stored in the Realtime System consists of a
subtotal namespace, a
total key, a
subtotal key, a hourly
time value and a
counter value. This record is structured such
that the (
total namespace +
total key) matches exactly to a Total Record. Further the
count values are related
such that sum of
count values for all subtotal keys for a
time value exactly equals the matching value in a Total Record
Example: Here are 4 Subtotal Records that represent two different datasets related to the Total Record above as
stored in the Realtime System. One dataset with the
c subtotal namespace represents the clicks by country, and the
r subtotal namespace represents the clicks by referrer. Each of these datasets (
2) sum up
5 in the Total Record above. The
JP are country values in the subtotal key, and
h0aMB8AuNw4= are 12 character lookup values that are placeholders for longer subtotal keys. These records in the
Realtime System are per-hour so they can be incremented individually.
c.u|jehiah.US.c41l,4 c.u|jehiah.JP.c41l,1 r.u|jehiah.5bmehgOB4+w=.c41l,3 r.u|jehiah.h0aMB8AuNw4=.c41l,2
Subtotal Records - Archive System
In the Archive System all Subtotal Records for a given hour for a
subtotal namespace collapse so that the
count values create the multi-column data.
Example: The same 4 values in the Realtime System are represented as two multi-column records in the Archive System. Here the records are still one per hour, but all subtotal keys for that hour are stored together. For this small example, the archive format is 31% smaller.
c.u|jehiah.c41l,US:4 JP:1 r.u|jehiah.c41l,5bmehgOB4+w=:3 h0aMB8AuNw4=:2
This is a visualization of the data contained within the pair of Total Record (green) and a Subtotal Record (purple) from the Archive System. On the right are the same records but with example values filled in.
Since Clickatron stores data with hourly granularity, the API merges that hourly data on the fly to fulfill requests for
other granularities. This approach enables the API to accept an
hour_offset parameter in order to fulfill queries for
any timezone aligned on the hour (with apologies to India and other locales aligned on the half hour). In our public API
endpoint we lookup
hour_offset from a more generic
timezone parameter. It also uses the technique for fulfilling day
queries to response to week, and month queries, and can even respond with week units that start on monday instead of
sunday (we call this “mweek”). You can see an example of these parameters in our API Documentation
for user and link metrics.
Performance in the Real World
Clickatron has been in production for over a year now. It has successfully replaced several other metrics systems, reduced hardware requirements, improved data granularity, and reduced query latency. As previously mentioned, we have leveraged its rebuild functionality to bulk load datasets with as many as 13 billion data points at once with zero service impact - directly contributing to our ability to expand the breadth of our data. Despite an increase in the breadth of metrics, the granularity of data storage, the timespan of accumulated data, and a significant increase in traffic, Clickatron’s current persistent footprint is less than one quarter the size of the disparate systems it replaced a year ago. Clickatron currently tracks well over 100 billion data points, a number that increases every day, and we feel well situated to handle future needs as they arise.
If this sounds like a fun metrics system to use or to work on, bitly is hiring.