Metrics - Building Clickatron

One of the core product features of bitly has always been metrics. Unlike tools like google analytics, bitly metrics have always been a way to gather website metrics when you don’t control the website. Want metrics for a book you wrote on Amazon? Use a bitly link. Want metrics for a blog post? You can use a bitly link for that, too.

As bitly usage has grown, one of the systems that has evolved several times along the way is our metrics platform. Originally implemented as log files with a hierarchical timestamp key, later metrics data was loaded into a flat mysql table. A subsequent revision moved that data into a cluster of tokyocabinet servers. That was eventually supplemented with some additional metrics datasets in mongodb. By early 2011 we were using 3 different metrics systems, and had outgrown 3 others. We started with a set of goals to build a scalable time series database application, which we now call Clickatron.

Goals

The set of goals we wanted to achieve with a new metrics system were:

1) Move from daily to hourly granularity (not everyone should have to pretend they live in EST). A desire was to be able to satisfy read requests on-the-fly in any unit of time (hour, day, week, month) for any timezone based on the single hourly dataset (removing duplication present in other systems)

2) Compact on-disk format. One of the previous metrics systems had a very inefficient data structure which was causing both bloated data set sizes and performance problems resulting from insufficient disk IO. In order to satisfy read requests at different units of time, it was also storing several copies of the data.

3) Ability to bulk-load new datasets. There were two problems with previous metrics systems that inhibited adding new metrics datasets. Previous systems lacked proper namespacing which caused essentially duplicate systems for each dataset, and there was not enough excess capacity to insert new metrics going backwards in time.

4) Scalability. Our existing metrics systems were at their limits in terms of handling real-time increments and we didn’t have a simple path forward to add capacity to those systems.

Architecture

As mentioned in our post about sortdb, many datasets have a small subset that is heavily updated, and a larger related dataset that does not change. Metrics data is a textbook example where updates are limited to a small time range, and an overwhelming majority of the data is static.

We chose to take advantage of this and implement an architecture that split data into two separate storage systems. A Realtime System is responsible for per-hour data covering the past 48 hours and also handles increment operations. A separate Archival System stores all of the older data and is updated daily with a new set of static database files from a Rebuild System.

Each of these two systems are themselves clusters of key/value databases. The keys are a combination of namespace, record key, and time components (explained in more detail below). Data is spread across the shards of each cluster by a simple crc(shard_key) % number_shards. The shard key is comprised of only parts of the actual key in order to improve data locality on a per-request basis. For example the shard key excludes time components so that all data for a request is stored on the same shard regardless of the time range requested.

Realtime System

The Realtime System is comprised of a 3-position ring of storage engines with positions for Today, Yesterday, and Tomorrow. The storage engines in the Today position handle increments. The storage engine in the Yesterday position has a 24-hour window to be exported before being cleared for re-use in the Tomorrow position.

Each spot on the ring corresponds to a cluster of simpletokyo / ttserver pairs which data is sharded across. Records stored in this system are stored in <key>,<value> format where each record represents an integer value for a single hour.

Rebuild System

The Rebuild System serves 3 purposes. Its primary role is to do the export from the Realtime System every 24 hours and combine that new dataset with all previous datasets to generate new data files for the Archive System. As part of that export process, the Rebuild System generates a second copy of the merged data files to serve as backup. Because it generates new data files from scratch every day, it also provides a spot to introduce new historical data independent of the number of records added. We have been able to use this to introduce new datasets as large as 13 billion keys. Good luck waiting for that many inserts into [insert database name].

The rebuild process is a set of steps that involve sorting, merging, reformatting keys, and combining records. That’s followed by a final sharding and another sort and merge operation. These steps happen largely in parallel and are each split into smaller segments of work that get distributed across several machines. Many of these steps also intelligently avoid redoing unnecessary work by keeping intermediate files split by time range.

Archive System

The Archive System consists of static csv files accessed through sortdb. These files are spread across hosts as needed depending on desired read performance. Multiple sortdb instances for the same data files can be used on separate physical hosts for read availability and fault tolerance.

API

The API layer directs increment requests to the right spot on the Realtime System ring. For increments, the API layer also writes out a local oplog for data durability in case there is a hardware failure in the Realtime System. For data fetches, the API layer handles the potential need to pull data from both the Realtime System and the Archive System, merging the two data sets into a single response. The components of the API involved in incrementing are written in C using simplehttp, and the components for querying are written in a combination of python using tornadoweb and C using simplehttp.

Architecture Overview

clickatron architecture overview

Storage Formats

Database Key Optimizations

Two optimizations directly aimed at lowering disk storage needs are a compact time format and a global lookup table for records with long keys. The choice to use sortdb also directly reduced our storage requirements as the data index is stored with the data (more specifically the index is the data).

For time values we use a compact 4 character YMDH representation. Using this format c3p1 corresponds to 1am on March 25th, 2012.

YMDH == _b32(year % 100) + _b32(month) + _b32(day) + _b32(hour)

Any time a key contains a reserved character (<space> : , . |) or is 12 or more characters, we create a 12-character hash of it, and store a lookup record. This means that if we want to count the number of referrers from http://www.facebook.com/ instead of storing that 24-character string every time we wanted to count it, we count a smaller 12-character hash such that _hash('http://www.facebook.com/') == 'h0aMB8AuNw4=', and do a reverse lookup to expand the hash back to the full value at query time. It may not seem like much, a savings of even a few bytes for repeated values makes a big difference on datasets with billions of records.

Record Types

Our time series database is made up of 3 types of records. Lookup Records (which are the expansions for the 12 character hashes above). Total Records which are a single key with values per hour over time. And Subtotal Records which represent the line item values whose sum rolls up to the Total Record for each hour.

We use a different data format between the Realtime System and the Archive System for storing Total Records and Subtotal Records. The data format used for the Realtime System is optimized for writes while the format used for the Archive System is optimized for compactness and reads.

In the Realtime System records have one data point, but in the Archive System records are multi-column and have many values. Many records in the Realtime System will be collapsed to a single multi-column record in the Archive System. Records in the Realtime System are always for a single hour, but Total Records in the Archive System are for all time ranges, and Subtotal Records are similarly for all subtotal keys (for an hour). This facilitates retrieval, but it also provides for a more compact storage (the record key is stored only once for that multi-column record compared with the Realtime System). Metrics data is very sparse (many keys only have values for a few hours) so the time values in a Total Record also serve as an index of what data to expected in the equivalent subtotal data set.

Record Formats

Total Records - Realtime System

A Total Record as stored in the Realtime System consists of a namespace, a total key, an hourly time value, and the counter value.

Total Record Realtime Format

Example: These are two Total Record in the Realtime System representing total clicks for a single user in two separate hours. u is our namespace for “user clicks”, jehiah is my username, c413 and c41l are time components and these records have values of 2 and 5 respectively.

u|jehiah.c413,2
u|jehiah.c41l,5

Total Records - Archive System

In the Archive System, all Total Records for the same namespace + total key combination are collapsed into one record with the individual time value and counter value pairs as the multi-column component.

Total Record Archive Format

Example: This the same two Total Records from the Realtime System as stored in the Archive System. Here the key is only the u namespace and jehiah total key, but the value is multi-column pairs of time components c413 and c41l and values of 2 and 5. In this small example, the archive record is 27% more compact.

u|jehiah,c413:2 c41l:5

Subtotal Records - Realtime System

A Subtotal Record as stored in the Realtime System consists of a subtotal namespace, a total namespace, a total key, a subtotal key, a hourly time value and a counter value. This record is structured such that the (total namespace + total key) matches exactly to a Total Record. Further the count values are related such that sum of count values for all subtotal keys for a subtotal namespace, total namespace, total key, time value exactly equals the matching value in a Total Record

Subtotal Record Realtime Format

Example: Here are 4 Subtotal Records that represent two different datasets related to the Total Record above as stored in the Realtime System. One dataset with the c subtotal namespace represents the clicks by country, and the other r subtotal namespace represents the clicks by referrer. Each of these datasets (4 + 1 and 3 + 2) sum up to the 5 in the Total Record above. The US and JP are country values in the subtotal key, and 5bmehgOB4+w= and h0aMB8AuNw4= are 12 character lookup values that are placeholders for longer subtotal keys. These records in the Realtime System are per-hour so they can be incremented individually.

c.u|jehiah.US.c41l,4
c.u|jehiah.JP.c41l,1
r.u|jehiah.5bmehgOB4+w=.c41l,3 
r.u|jehiah.h0aMB8AuNw4=.c41l,2

Subtotal Records - Archive System

In the Archive System all Subtotal Records for a given hour for a subtotal namespace collapse so that the subtotal key and count values create the multi-column data.

Subtotal Record Archive Format

Example: The same 4 values in the Realtime System are represented as two multi-column records in the Archive System. Here the records are still one per hour, but all subtotal keys for that hour are stored together. For this small example, the archive format is 31% smaller.

c.u|jehiah.c41l,US:4 JP:1
r.u|jehiah.c41l,5bmehgOB4+w=:3 h0aMB8AuNw4=:2

Record Visualization

This is a visualization of the data contained within the pair of Total Record (green) and a Subtotal Record (purple) from the Archive System. On the right are the same records but with example values filled in.

Clickatron Record Format Visualization

Queries

Since Clickatron stores data with hourly granularity, the API merges that hourly data on the fly to fulfill requests for other granularities. This approach enables the API to accept an hour_offset parameter in order to fulfill queries for any timezone aligned on the hour (with apologies to India and other locales aligned on the half hour). In our public API endpoint we lookup hour_offset from a more generic timezone parameter. It also uses the technique for fulfilling day queries to response to week, and month queries, and can even respond with week units that start on monday instead of sunday (we call this “mweek”). You can see an example of these parameters in our API Documentation for user and link metrics.

Performance in the Real World

Clickatron has been in production for over a year now. It has successfully replaced several other metrics systems, reduced hardware requirements, improved data granularity, and reduced query latency. As previously mentioned, we have leveraged its rebuild functionality to bulk load datasets with as many as 13 billion data points at once with zero service impact - directly contributing to our ability to expand the breadth of our data. Despite an increase in the breadth of metrics, the granularity of data storage, the timespan of accumulated data, and a significant increase in traffic, Clickatron’s current persistent footprint is less than one quarter the size of the disparate systems it replaced a year ago. Clickatron currently tracks well over 100 billion data points, a number that increases every day, and we feel well situated to handle future needs as they arise.

EOL

If this sounds like a fun metrics system to use or to work on, bitly is hiring.