Speeding things up with Redshift

Recently we’ve started to experiment with using Redshift, Amazon’s new data warehousing service. More specifically, we’re using it to speed up and expand our ad hoc data analysis.

The Challenge

bitly sees billions of clicks and shortens each month. Often we have various questions about the data generated from this activity. Sometimes these questions are driven by business needs (how much traffic do we see from a potential enterprise customer), sometimes they are more technically driven (how much traffic will a new sub-system need to deal with), and sometimes we like to just have fun (what are the top trashy celeb stories this week).

Unfortunately, when working with that volume of data it can be pretty difficult to do much of anything quickly. Pre-Redshift, all of these questions were answered by writing map-reduce jobs to be run on our Hadoop cluster or on Amazon’s EMR.

Whenever we wanted to answer a question with our data, the process would look something like this:

  1. Write map-reduce job in Python
  2. Run it on some local test data
  3. Fix bugs.
  4. Run it on the Hadoop cluster
  5. Wait 20-30 minutes for results
  6. Get an error back from Hadoop
  7. Dig through the logs to find the error.
  8. GOTO 3

This is clearly not ideal when all you want to do is get a simple count.

For a lot of the work we do Hadoop + Python make for an awesome combination, but for these ad hoc aggregation queries they’re very blunt instruments. In both cases, they are general purpose tools that are super flexible, but slow and difficult to use for this specific use case.

Redshift, on the other hand, is specifically built and optimized for doing aggregation queries over large sets of data. When we want to answer a question with Redshift, we just write a SQL query and get an answer within a few minutes—if not seconds.

Overall, our experience with Redshift has been a positive one but we have run into some gotchas that we’ll get into below.

The Good News

User Experience

From a user perspective, we’re really happy with Redshift. Any one of our developers or data scientists just need to write a SQL query and they have an answer to their question in less than 5 minutes. Moving from our old hadoop based workflow to an interactive console session with Redshift is a major improvement.

Additionally, since much of the user facing bits of Redshift are based on PostgreSQL there is a large ecosystem of mature, well-documented tools and libraries for us to take advantage of.

Finally, while it can be a bit slow at times, we’ve been very impressed with the web management console Amazon provides with Redshift. For a 1.0 product, the console is comprehensive and offers much more information than we expected it to.

Performance

For our current use case of ad hoc research queries, Redshift’s performance is adequate. Most queries return a response in less than five minutes and we rarely have many users executing queries concurrently.

That being said, we have done some experimentation with competing products (e.g. Vertica) and have seen better performance out of those tools. This is especially true for more complex queries that benefit from projections/secondary indexes and situations where the cluster’s resources are under contention.

Documentation

Just like the rest of AWS, Amazon provides reasonably comprehensive and thorough documentation. For everything that is directly exposed to us as users (e.g. loading operations, configuration params, etc) we are very happy with the documentation. The only places where we felt we wanted more information were those where Amazon makes things “just work”. Most significantly we would like to see more details about what exactly happens when a node in the cluster fails and how the cluster is expected to behave in that state.

Cost

We wouldn’t go so far as to call Redshift cheap, but compared to many competitors it is pretty cost effective. The biggest gotcha here is that while the simple model for scaling Redshift clusters and tuning performance within a cluster is nice as a user, it does mean that you have a bit of a one-size-fits all situation.

In our case we are computationally and I/O constrained so we’re paying for a bunch of storage capacity and memory that we don’t use. At our current scale, things work out okay but as we continue to grow it may make sense to take advantage of something else that is more flexible in terms of both hardware and tuning.

The Bad News

Data Loading

We had to spend a lot of time getting our data into Redshift. This is partially our fault since our dataset is not the cleanest in the world, but overall this is the place that we felt the most pain from an immature tool chain.

The majority of bitly’s at-rest data is stored in line-oriented (i.e. one JSON blob per line) JSON files. The primary way to load data into Redshift is to use the COPY command to point the cluster at a pile of CSV or TSV files stored on S3. Clearly we needed to build out some kind of tool chain to transform our JSON logs into flat files.

Initially, we just wrote a simple Python script that would do the transformation. Unfortunately, we quickly discovered that this simple approach would be too slow. We estimated that it would have taken a month to process and load all the data we wanted in Redshift.

Next, we realized that we had a tool for easily doing highly parallelized, distributed text processing: Hadoop. Accordingly, we re-worked our quick Python script into a hadoop job to transform our logs in a big batch. Since we already keep a copy of our raw logs in S3, EMR proved to be a great tool for this

Overall this process worked well but we did still run into a few gotchas loading the flattened data into Redshift:

  • Redshift only supports a subset of UTF-8, specifically characters can only be 3 bytes long. Amazon does mention this in their docs, but it still bit us a few times.
  • varchar field lengths are in terms of bytes, not characters. In Python, byte strings expose these as one and the same. Unicode strings on the other hand do not. It took us a little while to realize this was happening and to get our code setup to do byte length truncations without truncating mid-characater in unicode strings. To get an idea of how we went about doing unicode aware truncation, check out this relevant stack overflow thread.
  • Moving floats between different systems always has some issues. In Python the default string output of small float values is scientific notation (.332e-8). Unfortunately Redshift doesn’t recognize this format so we needed to force our data prep job to always output floats in decimal format.

Now that we have a large body of data loaded into Redshift, we’re working on building out tooling based on NSQ to do our data prep work in a streaming fashion that should allow us to easily do smaller incremental updates.

In the end, we worked through our data loading issues with Redshift but it was one of the more acute pain points we encountered. From our conversations with Amazon, they’re definitely aware of this and we’re interested to see what they’ll come out with, but for now the provided tooling is pretty limited.

Availability

Long term, there are a number of periodic and online tasks that we’re thinking about using a tool like Redshift for. Unfortunately, as things stand today we would not be comfortable relying on Redshift as a highly available system.

Currently, if any node within a Redshift cluster fails, all outstanding queries will fail and Amazon will automatically start replacing the failed node. In theory this recovery should happen very quickly. The cluster will be available for querying as soon as the replacement node is added, but performance on the cluster will be degraded while data is restored on to the new node.

At this point, we have no data on how well this recovery process works in the real world. Additionally, we have concerns about how well this process will work when there are larger issues happening within an availability zone or region. Historically there are a number of cases where issues within one Amazon service (e.g. EBS) have cascaded into other services leading to long periods of unavailability or degradation.

Until there’s a significant track record behind a system like this, we’re hesitant to trust anything that will “automatically recover”.

There is the option of running a mirrored Redshift cluster in a different AZ or region, but that gets expensive fast. Additionally, we’d have to build out even more tooling to make sure those two clusters stay in sync with each other.

Limited Feature Set

Redshift is very impressive feature-wise for a 1.0 product. That being said, a number of the competing products have been around for a while and offer some major features that Reshift lacks.

The biggest missing feature for us with Redshift is some kind of secondary indexing or projections. Right now, if you sort or filter by any column other than the SORTKEY, Redshift will do a full table scan. Technically you could create a copy of that table with a different sort key but then it would become your problem to keep those tables in sync and to query the right table at the right time.

Some other “missing” features include tools for working with time series data, geospatial query tools, advanced HA features, and more mature data loading tools.

The Bottom Line

Redshift is great for our needs today, but we’ll see if Amazon’s development keeps up as our needs change going forward. Given Amazon’s impressive track record for quickly iterating on and improving products we’re hopeful, but we do have our eyes on competing products as our use of data warehousing tools matures.

by Sean O’Connor