Metrics - Building Clickatron
One of the core product features of bitly has always been metrics. Unlike tools like google analytics, bitly metrics have always been a way to gather website metrics when you don’t control the website. Want metrics for a book you wrote on Amazon? Use a bitly link. Want metrics for a blog post? You can use a bitly link for that, too.
As bitly usage has grown, one of the systems that has evolved several times along the way is our metrics platform. Originally implemented as log files with a hierarchical timestamp key, later metrics data was loaded into a flat mysql table. A subsequent revision moved that data into a cluster of tokyocabinet servers. That was eventually supplemented with some additional metrics datasets in mongodb. By early 2011 we were using 3 different metrics systems, and had outgrown 3 others. We started with a set of goals to build a scalable time series database application, which we now call Clickatron.
Goals
The set of goals we wanted to achieve with a new metrics system were:
1) Move from daily to hourly granularity (not everyone should have to pretend they live in EST). A desire was to be able to satisfy read requests on-the-fly in any unit of time (hour, day, week, month) for any timezone based on the single hourly dataset (removing duplication present in other systems)
2) Compact on-disk format. One of the previous metrics systems had a very inefficient data structure which was causing both bloated data set sizes and performance problems resulting from insufficient disk IO. In order to satisfy read requests at different units of time, it was also storing several copies of the data.
3) Ability to bulk-load new datasets. There were two problems with previous metrics systems that inhibited adding new metrics datasets. Previous systems lacked proper namespacing which caused essentially duplicate systems for each dataset, and there was not enough excess capacity to insert new metrics going backwards in time.
4) Scalability. Our existing metrics systems were at their limits in terms of handling real-time increments and we didn’t have a simple path forward to add capacity to those systems.
Architecture
As mentioned in our post about sortdb, many datasets have a small subset that is heavily updated, and a larger related dataset that does not change. Metrics data is a textbook example where updates are limited to a small time range, and an overwhelming majority of the data is static.
We chose to take advantage of this and implement an architecture that split data into two separate storage systems. A Realtime System is responsible for per-hour data covering the past 48 hours and also handles increment operations. A separate Archival System stores all of the older data and is updated daily with a new set of static database files from a Rebuild System.
Each of these two systems are themselves clusters of key/value databases. The keys are a combination of
namespace, record key, and time components (explained in more detail below). Data is spread across the shards
of each cluster by a simple crc(shard_key) % number_shards. The shard key is comprised of only parts of the
actual key in order to improve data locality on a per-request basis. For example the shard key excludes time components
so that all data for a request is stored on the same shard regardless of the time range requested.
Realtime System
The Realtime System is comprised of a 3-position ring of storage engines with positions for Today, Yesterday, and
Tomorrow. The storage engines in the Today position handle increments. The storage engine in the Yesterday
position has a 24-hour window to be exported before being cleared for re-use in the Tomorrow position.
Each spot on the ring corresponds to a cluster of simpletokyo / ttserver pairs which data is
sharded across. Records stored in this system are stored in <key>,<value> format where each record represents
an integer value for a single hour.
Rebuild System
The Rebuild System serves 3 purposes. Its primary role is to do the export from the Realtime System every 24 hours and combine that new dataset with all previous datasets to generate new data files for the Archive System. As part of that export process, the Rebuild System generates a second copy of the merged data files to serve as backup. Because it generates new data files from scratch every day, it also provides a spot to introduce new historical data independent of the number of records added. We have been able to use this to introduce new datasets as large as 13 billion keys. Good luck waiting for that many inserts into [insert database name].
The rebuild process is a set of steps that involve sorting, merging, reformatting keys, and combining records. That’s followed by a final sharding and another sort and merge operation. These steps happen largely in parallel and are each split into smaller segments of work that get distributed across several machines. Many of these steps also intelligently avoid redoing unnecessary work by keeping intermediate files split by time range.
Archive System
The Archive System consists of static csv files accessed through sortdb. These files are spread across hosts as needed depending on desired read performance. Multiple sortdb instances for the same data files can be used on separate physical hosts for read availability and fault tolerance.
API
The API layer directs increment requests to the right spot on the Realtime System ring. For increments, the API layer also writes out a local oplog for data durability in case there is a hardware failure in the Realtime System. For data fetches, the API layer handles the potential need to pull data from both the Realtime System and the Archive System, merging the two data sets into a single response. The components of the API involved in incrementing are written in C using simplehttp, and the components for querying are written in a combination of python using tornadoweb and C using simplehttp.
Architecture Overview

Storage Formats
Database Key Optimizations
Two optimizations directly aimed at lowering disk storage needs are a compact time format and a global lookup table for records with long keys. The choice to use sortdb also directly reduced our storage requirements as the data index is stored with the data (more specifically the index is the data).
For time values we use a compact 4 character YMDH representation. Using this format c3p1 corresponds to 1am on March 25th, 2012.
YMDH == _b32(year % 100) + _b32(month) + _b32(day) + _b32(hour)
Any time a key contains a reserved character (<space> : , . |) or is 12 or more characters, we create a
12-character hash of it, and store a lookup record. This means that if we want to count the number of referrers from
http://www.facebook.com/ instead of storing that 24-character string every time we wanted to count it, we count a
smaller 12-character hash such that _hash('http://www.facebook.com/') == 'h0aMB8AuNw4=', and do a reverse lookup to
expand the hash back to the full value at query time. It may not seem like much, a savings of even a few bytes for
repeated values makes a big difference on datasets with billions of records.
Record Types
Our time series database is made up of 3 types of records. Lookup Records (which are the expansions for the 12 character hashes above). Total Records which are a single key with values per hour over time. And Subtotal Records which represent the line item values whose sum rolls up to the Total Record for each hour.
We use a different data format between the Realtime System and the Archive System for storing Total Records and Subtotal Records. The data format used for the Realtime System is optimized for writes while the format used for the Archive System is optimized for compactness and reads.
In the Realtime System records have one data point, but in the Archive System records are multi-column and have many values. Many records in the Realtime System will be collapsed to a single multi-column record in the Archive System. Records in the Realtime System are always for a single hour, but Total Records in the Archive System are for all time ranges, and Subtotal Records are similarly for all subtotal keys (for an hour). This facilitates retrieval, but it also provides for a more compact storage (the record key is stored only once for that multi-column record compared with the Realtime System). Metrics data is very sparse (many keys only have values for a few hours) so the time values in a Total Record also serve as an index of what data to expected in the equivalent subtotal data set.
Record Formats
Total Records - Realtime System
A Total Record as stored in the Realtime System consists of a namespace, a total key, an hourly time
value, and the counter value.

Example: These are two Total Record in the Realtime System representing total clicks for a single user in two
separate hours. u is our namespace for “user clicks”, jehiah is my username, c413 and c41l are time components
and these records have values of 2 and 5 respectively.
u|jehiah.c413,2
u|jehiah.c41l,5
Total Records - Archive System
In the Archive System, all Total Records for the same namespace + total key combination are collapsed into one
record with the individual time value and counter value pairs as the multi-column component.

Example: This the same two Total Records from the Realtime System as stored in the Archive System. Here the key is
only the u namespace and jehiah total key, but the value is multi-column pairs of time components c413 and c41l
and values of 2 and 5. In this small example, the archive record is 27% more compact.
u|jehiah,c413:2 c41l:5
Subtotal Records - Realtime System
A Subtotal Record as stored in the Realtime System consists of a subtotal namespace, a total
namespace, a total key, a subtotal key, a hourly time value and a counter value. This record is structured such
that the (total namespace + total key) matches exactly to a Total Record. Further the count values are related
such that sum of count values for all subtotal keys for a subtotal namespace, total namespace, total key,
time value exactly equals the matching value in a Total Record

Example: Here are 4 Subtotal Records that represent two different datasets related to the Total Record above as
stored in the Realtime System. One dataset with the c subtotal namespace represents the clicks by country, and the
other r subtotal namespace represents the clicks by referrer. Each of these datasets (4 + 1 and 3 + 2) sum up
to the 5 in the Total Record above. The US and JP are country values in the subtotal key, and 5bmehgOB4+w= and
h0aMB8AuNw4= are 12 character lookup values that are placeholders for longer subtotal keys. These records in the
Realtime System are per-hour so they can be incremented individually.
c.u|jehiah.US.c41l,4
c.u|jehiah.JP.c41l,1
r.u|jehiah.5bmehgOB4+w=.c41l,3
r.u|jehiah.h0aMB8AuNw4=.c41l,2
Subtotal Records - Archive System
In the Archive System all Subtotal Records for a given hour for a subtotal namespace collapse so that the subtotal
key and count values create the multi-column data.

Example: The same 4 values in the Realtime System are represented as two multi-column records in the Archive System. Here the records are still one per hour, but all subtotal keys for that hour are stored together. For this small example, the archive format is 31% smaller.
c.u|jehiah.c41l,US:4 JP:1
r.u|jehiah.c41l,5bmehgOB4+w=:3 h0aMB8AuNw4=:2
Record Visualization
This is a visualization of the data contained within the pair of Total Record (green) and a Subtotal Record (purple) from the Archive System. On the right are the same records but with example values filled in.

Queries
Since Clickatron stores data with hourly granularity, the API merges that hourly data on the fly to fulfill requests for
other granularities. This approach enables the API to accept an hour_offset parameter in order to fulfill queries for
any timezone aligned on the hour (with apologies to India and other locales aligned on the half hour). In our public API
endpoint we lookup hour_offset from a more generic timezone parameter. It also uses the technique for fulfilling day
queries to response to week, and month queries, and can even respond with week units that start on monday instead of
sunday (we call this “mweek”). You can see an example of these parameters in our API Documentation
for user and link metrics.
Performance in the Real World
Clickatron has been in production for over a year now. It has successfully replaced several other metrics systems, reduced hardware requirements, improved data granularity, and reduced query latency. As previously mentioned, we have leveraged its rebuild functionality to bulk load datasets with as many as 13 billion data points at once with zero service impact - directly contributing to our ability to expand the breadth of our data. Despite an increase in the breadth of metrics, the granularity of data storage, the timespan of accumulated data, and a significant increase in traffic, Clickatron’s current persistent footprint is less than one quarter the size of the disparate systems it replaced a year ago. Clickatron currently tracks well over 100 billion data points, a number that increases every day, and we feel well situated to handle future needs as they arise.
EOL
If this sounds like a fun metrics system to use or to work on, bitly is hiring.
sortdb - static key value database
As dataset sizes grow, you need to buy larger/faster database machines, or shard data across multiple machines. sortdb is a database server we have written to help fill a specific need for performant access to static data.
One observation about many datasets is that there is often a small dataset that is heavily updated, and a larger older related dataset that does not change. It is this second case where sortdb can be used to expose a query interface to static datasets.
sortdb has a HTTP interface built on libevent, and exposes get, mget and fwmatch endpoints to key / value data in
a sorted csv data file. Internally, the datafile is mmaped and a binary search is
performed to find rows in the file. It is part of
simplehttp, a family of libraries and daemons for building
scalable web infrastructures.
By mmap’ing the data file sortdb allows the operating system to handle maintaining the most important parts of the data in memory while retrieving others from disk transparently. Since the file is searched using a binary search, it means the first few common steps of the search tree are quickly cached in ram and execute very quickly. The operating system will also share mmap’d data between multiple processes. Practically, this means you can start a second sortdb process pointed at the same data file for higher read throughput.
In database terms, a sorted key/value datafile is essentially a covering index where all the values are stored with the key. Because there is no need to store a separate index of the data, this is a compact on-disk representation. (removing the need for separate indexes can often give a 20~30% savings in disk usage).
Operationally managing sortdb is easy because the data files it uses are static and read-only. This makes it easy to
copy the file to multiple servers and distribute requests to multiple sortdb instances on separate servers to handle a
higher workload, or to add redundancy. It is also possible to point sortdb to a new file with zero downtime (literally
mv data.csv old.csv && mv new.csv data.csv && kill -s HUP $pid).
Using sortdb
At bitly we use sortdb successfully as a core component of our metrics system and are very happy with it’s performance characteristics. Our long term metrics system is composed of two different storage systems. One “realtime” system stores metrics for the past 48 hours and processes all increment requests. A second “static” storage system is built on sortdb and stores all data prior to the past 48 hours. Every day, 24 hours of data is exported from the realtime system, and new static data files are re-built for the static system comprising all the existing data, and the new 24 hour data segment.
A small slice of our “static” metrics data looks like this:
c.u|jehiah.c3h5,US:2
c.u|jehiah.c3h7,US:1
c.u|jehiah.c3h9,None:2 US:3
c.u|jehiah.c3ha,None:2
c.u|jehiah.c3hb,None:2 US:3
c.u|jehiah.c3hc,None:3 US:4
c.u|jehiah.c3hd,None:5 US:4
c.u|jehiah.c3he,None:2 US:6
c.u|jehiah.c3hf,None:3 US:5
c.u|jehiah.c3hg,None:4 US:5
c.u|jehiah.c3hh,DE:1 None:5 US:7
c.u|jehiah.c3hi,CA:1 None:6 US:5
u|jehiah.c3hh,13
u|jehiah.c3i0,7
You will notice that these are key,value records where the key is a compound key often made up of a namespace and
sometimes a time component, and the value is sometimes a repeating (key + : + value) sequence.
With that dataset in plain text file named data.csv, you can point sortdb at it with this run command (setting comma
as the field separator)
$ sortdb --field-separator=, --db-file=data.csv
Now it’s possible to query for a single record
$ curl 'http://127.0.0.1:8080/get?key=u|jehiah.c3hh'
13
$ curl 'http://127.0.0.1:8080/get?key=c.u|jehiah.c3hh'
DE:1 None:5 US:7
Or forward match a range of records
$ curl 'http://127.0.0.1:8080/fwmatch?key=u|jehiah.'
u|jehiah.c3hh,13
u|jehiah.c3i0,7
EOL
If this sound like a fun project to hack on, bitly is hiring.
Infrastructure as a Platform
Introduction
At bitly, infrastructure has two core responsibilities.
- The obvious - systems architecture, performance, scaling, technology choices, and implementation details.
- The not-so-obvious - new hire on-boarding, developer tools, and idioms that enable non-infrastructure engineers (science/research, application developers, etc.) to build scalable, performant, operationally sound, and easy-to-develop pieces of your product.
I say not-so-obvious because many of these things aren’t pressing issues when teams are small, everyone is in one room, and it’s crystal clear what the most important goal is from an engineering perspective.
As an engineering team grows, it becomes more and more important to develop and evangelize a common way of solving problems and getting things done. Without these guidelines you end up with inconsistent solutions that have a dramatic impact on the ability to operationally handle production systems (let alone the time cost of engineers constantly re-inventing the wheel).
Maybe you’re responsible 24/7 for production systems - consider any combination of the following situations:
- logs are in 14 different places
- an engineer chose a different one of the far too many Python web frameworks
- services are started by the current flavor-of-the-week process manager
- response formats from APIs differ
- techniques used to process data vary
- monitoring is done differently or not at all
- backups aren’t consistent
- no tests
Sounds like one helluva mess to debug. (and it’s always at 3am)
Let’s talk about how we’ve made progress over the past year solving some of these problems.
Development Workflow
A big win here was our switch to Git (and GitHub). The freedom and flexibility this gave us to design a powerful code-flow (develop -> review -> merge -> deploy) has proved extremely valuable.
We keep our code in one repository (application code, infrastructure tools, configuration and system dependencies). When possible we work to move things into external open source projects that are paired with an install script in our main repo. Every engineer develops in their own fork. Code makes it into master in one of two ways:
- cherry-picking - a small, well defined changeset or high-priority bug fix can be cherry-picked into
master. - pull request - anything else follows the process of creating a branch off
master, developing the feature, and opening a pull request.
It should be noted that another member of the engineering team will review either of the above and provide constructive feedback, ultimately being the one to bring the code into master. Code review is a first class member of our development workflow - it is hugely beneficial to both the developer and the reviewer. The better you are at reading code, the better you are at writing it. For efficiency, and to respect each others time, we often exchange commit hashes and play tit-for-tat with pull-request reviewing.
Another benefit of this model is that master maintains a clean history, always moving forward. In doing so we avoid the pitfalls of coordinating a destructive history change with so many outstanding branches. Still, we understand the value of Git’s power - engineers are encouraged to re-write history to their hearts content while developing in their own branch. When ready, a comment on the pull request stating “ready for review @github_user” will notify the corresponding person to begin review at their discretion.
Reviewing takes the form of comments on the pull request. As each round of review is completed, the developer will comment “resolved” where appropriate and push additional commits to the branch. This culminates in one final rebase and squash before a merge. We find that looking back you rarely need the back-and-forth of many small commits and we prefer to see the change as a whole as it relates to the issue documented in the pull request.
Developing in a Virtual Machine
If you’re not:
- writing code in an environment that mimics production
- setup by the same scripts and processes that stand up production hosts
- exposed to the challenges of making things play nice together
As soon as your code is actually running in production it’s probably not going to go well.
We believe whole-heartedly in the VM approach. Every developer (and even designers!) have a local VM for developing changes to any component in our infrastructure, front to back.
Most importantly, it forces you to design your applications to be “environment aware”. Your application code should only know the how. The where and how many is the responsibility of configuration files that pivot on the environment the code is running in. (see @jehiah’s post for more info about how we structure this)
The benefits of this are easy to understand - if you can get it working, tested, and deployable in the VM then getting it running successfully in production should be as simple as a deploy.
However, the choice of using a VM doesn’t come without challenges:
- When you’re service-oriented, as the overall architecture grows, it becomes increasingly difficult to “fit” all services in a single VM on your local machine (RAM, etc.).
- Not all services are equal and different engineers tend to touch different components more often.
- Setup, maintenance, and learning curve are a time sink.
We’ve only begun to scratch the surface in addressing these issues. We’ve begun to logically group services so that they can be spun up/down on demand, giving some control over whats running simultaneously. We’ll surely post a follow up with our findings as we continue to make progress in this area.
The Way of the Bitly
The value of choosing sane and repeatable solutions to common problems becomes clearer and clearer as things grow - whether it’s data, engineering team size, or number of services in your infrastructure.
Often the # of operations engineers doesn’t scale linearly with the # of services they have to monitor in production. It becomes extremely important to make their lives easy by developing good frameworks and solutions for solving problems, and make those solutions readily available, well documented, and easy to use.
We begin this education process early on with new engineers.
An engineer’s first day sets the tone for his future at your company. A new engineer at bitly can expect that on the first day they’ll have some shiny new hardware to play with as well as one simple goal, get your bio on the bitly about page. Since we put a strong focus on working in a VM, that implies that the existing operations and infrastructure team has done their job to have all the appropriate accounts setup such as access to email, wiki, GitHub, and of course, a VM.
This simple task facilitates the process of familiarizing themselves with the dev workflow described above as well as providing the satisfaction of seeing code go into production on day 1.
We try to expose new engineers to all different facets of bitly infrastructure and encourage them to explore the codebase. Fixing bugs, adding small pieces of functionality, and developing brand new services are common first week tasks.
We think it’s important to step away from the computer, too. We schedule internal tech talks where we discuss the successes (and challenges) of the current state of the infrastructure. It’s an open forum where questions are answered, odd names get definitions, and bitly idioms are discussed.
EOL
There’s lots more to talk about:
- how we deploy…
- what are some of those bitly idioms…
- how we make technology decisions…
- meetings and communication…
and more. We’ll save all that for future posts.
As always, if any of this sounds interesting to you, bitly is hiring.
Introducing Asyncdynamo
When Amazon announced its DynamoDB service in late January, we quickly identified it as a promising candidate to meet many of our database demands: high availability and fault tolerance, low latency, and low maintenance. SSD hardware promised to grant us similar performance to what we experience with an in-memory datastore, while replication handled by Amazon promised to let our ops team sleep soundly at night (a cranky ops team is never a good thing). While it would be impractical for us to move all of our core infrastructure over to hosted services, we felt that Dynamo could be a good persistence option for applications already deployed to EC2.
Since all interactions with Dynamo are carried out over HTTP, there’s no real need for a custom client library to begin performing database options. Our applications are written primarily in Python, and Boto provided just enough helper methods to handle authentication and proper request formatting (neither of which are very straightforward tasks). However, Boto’s network calls are executed using Python’s built in httplib, whose blocking nature makes it impractical for use with Tornado, an asynchronous framework.
Consequently, we set out to develop a library that would make use of Boto’s facilities for Dynamo-specific operations (request formatting and signing, as well as response parsing) but would leverage Tornado’s async HTTP client to execute the actual requests. For the benefit of any other Tornado users hoping to make use of DynamoDB, we are proud to announce Asyncdynamo.
Asyncdynamo requires Boto and Tornado to be installed, and must be run with Python 2.7. It replaces Boto’s synchronous calls to Dynamo and to Amazon STS (to retrieve session tokens) with non-blocking Tornado calls. For the end user its interface seeks to mimic that of Boto Layer1, with each method now requiring an additional callback parameter.
A little trickery was involved in making this work: Boto’s methods to sign requests expect to be operating on an instance of boto.connection.HTTPRequest, so we needed to trick them into accepting our tornado.httpclient.HTTPRequest. Working with a dynamic language makes this kind of type-hijacking much easier than it might otherwise be, and after a short bit of trial and error we were able to successfully fire off our hybrid requests to Amazon.
Currently, Boto Layer1 equivalents of the essential operations - Get, Put, and Query - have all been implemented, and any other interaction with Dynamo is possible using a generic make_request method. Once initialized with your AWS keys, Asyncdynamo will handle all authentication and session token management behind the scenes. Our immediate development plans are to fully replicate the methods offered in Boto Layer1, and in the longer term we hope to be able to add a further layer of abstraction similar to Boto Layer2.
Interested in working on these kinds of projects? We’re hiring. We’ll also be at PyCon this week if you’d like to chat about this project, small links, big data, or anything in between.
Use Amazon SNS for Nagios Alerts
Amazon recently added SMS publishing capability to their Simple Notification Service(SNS) platform.
Using SNS to send nagios alerts:
At Bitly, we used individual carrier’s email SMS gateways (eg. 123456789@text.att.net) to send pages to the on-call person as well as other people on the ops team. This solution was not cutting it as messages were heavily throttled on the carrier’s side (the gateway being a free service provided by the carrier meant that we couldn’t really complain or get the problem rectified). This basically meant that pages arrived hours later in some cases and defeated the purpose of an alert during an emergency.
Once SMS capability was announced, it was pretty much a no-brainer to switch to SNS.
I wrote a quick and simple python script with the help of a popular AWS python library called boto. You can find the source over at http://bit.ly/pyamazonsns.
This script can be used to send any kind of SNS message, not just SMS.
Here is a snippet of how this would tie into nagios:
define command{
command_name notify-host-by-txt
command_line printf "%b" "$HOSTALIAS$" | send_sns.py $CONTACTPAGER$
}
Feel free to fork it, improve on it and add features beyond the scope of our usage. Drop us a line to let us know :)
Introduction to simplehttp
Part of our engineering philosophy is to keep things fast and simple. Aim to serve one purpose and serve it well. Speak HTTP and encode in JSON. Prototype in Python and speed it up in C.
There are a few components that follow these tenets and sit at the core of our infrastructure. We’ve open-sourced them under the simplehttp moniker. They serve as the the architectural foundation for higher-order functionality.
simplehttp
At the lowest level is the simplehttp library, an abstraction of libevent’s evhttp functions, aimed at trivializing the task of writing an evented HTTP server in C. It’s dead simple yet provides high-level features such as:
- Tornado inspired options parsing
- Tornado inspired logging
- Automatic per-endpoint stat tracking (request counts, 95% times, averages)
- Clean API to perform async HTTP requests
Built on top of simplehttp, perhaps the most important daemons are simplequeue and pubsub.
simplequeue
A rock-solid in-memory message queue for arbitrary message bodies (we use JSON), providing basic /get and /put endpoints. We use this in several key areas to serve as a work queue for asynchronous processing. During a maintenance window or situations where backend services are degraded it also acts as a buffer, queueing up work that needs to be done when backend services are restored.
We use long-lived Python “queuereaders” to poll the simplequeue and perform work. This work might be writing to a database, logging, aggregating, or anything else that you might not want to perform in a blocking fashion during a request cycle. These queuereaders have built-in backoff timers which slow down the processing rate when errors are detected to allow a struggling backend to recover gracefully and to reduce the load on the machine running the queuereader.
Generally, we silo a simplequeue and its associated queuereaders on each host of the service. Meaning a simplequeue on hostA will only contain messages from requests received by that host and its queuereaders will only process messages from its local simplequeue. We do this to address single point of failure issues.
pubsub
We have many different types of data at bitly, each classified into a stream. There are streams of encodes (shortens), decodes (“clicks”), user events, etc. In order to provide a central, consistent, means for developers to access data in realtime we expose these streams via pubsub.
Publishing a message is a simple HTTP request to the /pub endpoint, in our case this usually happens in an queuereader who’s sole purpose is to read off a specific simplequeue and write to a specific pubsub.
A client consuming the stream is a a long-lived HTTP request to the /sub endpoint. Messages are transmitted as newline deliminated JSON.
To pair with pubsub the repository contains three additional utilities built on pubsubclient. ps_to_file and ps_to_http are fairly self-explanatory. One archives a pubsub stream to a file (automatically rolling the output files for you based on a configurable strftime format string), and the other writes a stream of data to destination HTTP endpoints. The latter can be used to send messages to a simplequeue, another pubsub stream, or any other HTTP endpoint. Additionally, pubsub_filtered repeats a pubsub stream and provides the option to remove or obfuscate fields, creating a filtered view on a subset of the data. At bitly we use these tools to archive our data streams and to pass data published by one application into another application (or another datacenter).
EOL
If any of these things sound like fun projects to hack on, bitly is hiring.
Welcome to the bitly Engineering Blog
At bitly, we have been happy to contribute to a number of open source projects, and we have even started a few of our own. We look forward to talking about those, and other engineering details here. You can stay up-to-date by following @bitly