Here is a tale of how we leverage redundant datacenters, redundant code, and multi-tiered fallbacks in the quest for uptime.
High availability is important for any site operating at scale, but at bitly it is particularly important; people expect bitly links to work, no matter what. We have enterprise customers who rely on them for key metrics, users who share them on social networks, and websites with custom short domains that trust us to serve requests with their name on them. A bitly link not working in any of these scenarios would make our users look bad, so it is something we take very seriously.
No matter how redundant, distributed, and fault tolerant your main infrastructure is, things can always go wrong. Recently Google Apps and Search, probably the most distributed infrastructure in existence, went down. There are unknowns everywhere, and ultimately you have to plan for any part of your infrastructure breaking for unknown reasons. Under failure, a distributed system should degrade gracefully, not suddenly. This is why we created Z Proxy.
So What is Z Proxy?
Z Proxy is an application that serves decodes (this is what we call redirecting from a short bitly link to its long URL, and what happens every time you click on a bitly link) without relying on any other part of the bitly infrastructure. This means that it does not use our primary database of urls, or any of our other servers, to do lookups. So how does it work?
How it Works
Z Proxy is essentially a self contained wrapper around S3, written in Go. When all of bitly is running properly, every time a link is shortened, a message is put on NSQ, which a queuereader later grabs. A queuereader then writes the short and long urls into S3 so that Z Proxy can perform lookups against S3 by short url, get the long url, and serve a 301 or 302 redirect. To the browser, nothing different happened.
There are multiple host running Z Proxy in EC2. This location provides proximity to S3, high availability, and most importantly different availability from the main decode infrastructure, which exists outside of AWS. EC2 and S3 can have problems, but the chance of this happening at the same time as our other datacenter is extremely low, and most importantly gives us flexibility.
Each host has a local memcached instance used to cache the slow S3 lookups. Usually there are many more steps involved with decodes, but Z Proxy skips most that are not critically essential, such as spam checking. Because it has fewer features than the main decode path, and because it is written in optimized Go, this is a lightweight way to serve our decodes (thousands a second) in a failure scenario. We keep sufficient capacity on these systems to be ready for a failure at any time.
Metrics & Background Processing
Because we use NSQ, even if the primary infrastructure is down, hosts running Z Proxy (we call these “lastresort” hosts) can create and queue messages corresponding to each decode request. That means when everything is back up and running, the primary infrastructure will process messages from these hosts. Info+ pages will be updated with clicks that happened when everything was down, ratelimits will be adjusted, realtime will find new trends based on these clicks, and more.
Z Proxy also records metrics for internal use. It sends data to graphite recording response times, types of requests, etc., but of course since it makes no assumptions about anything in our infrastructure working, graphite included, it also aggregates some stats locally.
Normally our DNS points to our load balancers, which send requests off to frontend webservers. Nginx on each frontend webserver is configured to handle local timeouts and failures by transparently retrying the request against a lastresort host. Nginx on each lastresort host then sends the request to one of a few local Z Proxy instances. This is great because it allows failovers on a per request basis, but if our frontend servers or load balancers are taken out (ie: we loose datacenter connectivity), it doesn’t help. In this case, we can point DNS for all of our domains directly at the lastresort hosts.
But What if Z Proxy Doesn’t Work?
The trust-nobody approach of Z Proxy makes it very stable, but ultimately it could still break, so even the Go app isn’t enough.
To have an additional level of safety, the S3 key is the short link, but the value isn’t actually the long url itself. The S3 value is an HTML blob containing a meta refresh to the destination url. This allows Z Proxy to parse out the long url, but also allows nginx on lastresort hosts to proxy the 200 responses directly from S3 if Z Proxy goes down. This multi-tier approach to failures gives us increasing levels of availability with decreasing levels of features, metrics, performance, and consistency.
This system gives us confidence that we can serve decodes with high availability, and in the event of an outage or failure, it gives us options for where to send traffic. Because our link resolution dataset is immutable, S3 is an invaluable tool. While we might take slightly different approaches with dynamic data, providing layers of fallbacks from S3 and transparent retrying across datacenters is simple and effective at providing high availability.
- Because we have tens of billions of links, this makes our Z Proxy bucket about 1.25% of all S3 objects (about 4TB in size).
- The main decode path is written in Tcl, a strange and interesting language that I had never seen before working at bitly.