Debugging a Specialized Database Cache

This blog post is a tale of interesting patterns in system and application metrics, optimizing memory use, and debugging. Before we get into that stuff, it helps to be familiar with the application at the center of it.

Introduction to the Network view

A significant new feature of Bitly’s recent consumer relaunch is the ability to browse links shortened by all your connections on Facebook and Twitter, on a page titled “Your Network”. Internally we call this service “network history”. Looking up all your connections and all their links everytime you view the “Network” page is, due to Bitly’s scale, a non-trivial problem. Instead we structured the system to do the hard work at insertion time, to make the read side performant. In a minute, we’ll dig into how exactly we tackled that problem.

We use a key-value store (leveldb via simpleleveldb) where the key is the username and the value is the list of bitly links created by their connections. Such an arrangement is very efficient in terms of storage, memory, and cpu usage. However, based on the number of connections a user has, the number of updates we have to perform could be significant (thousands) per new Bitly link. We needed a way to gain control over the number of writes to our persistent datastore (and thus disk seeks) and provide the performance needed to handle many thousands of inserts per second - enter nhproxy.

nhproxy mechanics

nhproxy, the caching proxy to the network history database, maintains in-memory:

  • a hash-table of all users for which it is caching network history
  • a hash-table of all links in users’ network history
  • attached to each user, a list of pointers to links, which is the network history

The actual link metadata is stored in a separate global hash-table so that it is not duplicated when multiple users have the same link in their history. Each user only stores a pointer into the global data structure.

nhproxy serves two main request types:

  • get requests specify a username to retrieve the full Network History for. If the user’s network history is already in memory it is quickly returned, otherwise it is fetched from the database and loaded into memory and then returned.

  • add requests specify a bitly link and a list of all the users for which the link will appear (all connections of the creator of the link). All users who aren’t already loaded into memory are fetched from the database and loaded into memory, and then a reference to the link is added to all their network history lists.

There are simple strategies for:

  • "pruning" old network history list elements when the list grows beyond a set maximum
  • saving a user’s changed network history to the database after a while
  • discarding the user’s record and history from memory to keep the total number of users in the cache below another set maximum.

It’s interesting to note that the longer one delays saving a user’s changed network history to the persistent datastore, the more changes it accumulates per save, and the lower the overall rate of writes to the store. One can control that rate of writes by adjusting the delay. Don’t worry about what exactly these strategies are, as I’ll be focussing on other aspects of nhproxy in this article.

nhproxy memory load

In the initial deployed state, the memory usage of a box running both nhproxy and the database looked like this:

memory graph 1

The reason it fell each night before exhausting all memory is because we cleanly terminated and restarted nhproxy with a cron job. nhproxy saves and reloads everything it’s caching when it restarts, so no un-flushed changes are lost. Still, we would much rather not need to have it regularly restarted. Also, notice that the data in nhproxy just before exiting and just after starting again is exactly the same (besides change timestamps the flushing logic uses), but uses up very different amounts of physical memory.

There had already been some testing and analysis with valgrind; it wasn’t a simple memory leak in nhproxy. One theory was that the problem was fragmentation: the multitude of small memory regions allocated and deallocated left awkwardly sized regions free that were too small to be reused. We considered trying an alternative allocator like jemalloc which has fancier strategies for dealing with fragmentation, but decided to try some less drastic measures first.

After a bit of thought I realized that many of the small regions we allocated were dominated by the overhead of malloc itself - the extra space from rounding the allocated region up to a well-aligned size, and the metadata of the region size (typically prepended to the region). This most likely wouldn’t fix the unbounded memory growth problem, but I couldn’t resist first going after some low-hanging fruit to make overall memory use and fragmentation significantly better in the short term.

Initially, the data in memory was represented by structs that looked something like this:

struct user {
    char *username;
    struct linked_list *history;
    // plus a few miscellaneous flags and counters
};

struct linked_list {
    struct linked_list *next;
    struct linked_list *prev;
    struct bitly_link *link;
};

struct bitly_link {
    char *username;
    char *hash;
    int ref_count;
    // plus a few miscellaneous flags and counters
};

That’s a simplification, but what’s accurate is that there was a separate allocation for each username, each bitly link hash, and each link reference in a history list. That is quite a lot of small allocations, with a lot of overhead. For example, each bitly_link hash is (currently) 6 characters, so we malloc()ed 7 bytes to store the string and null terminator, which malloc() rounded up to 8 bytes, and then prepended with the length in another 8 bytes, to end up as a 16 byte allocation. Also, I suspected that that the variable size of the username allocation was a significant cause of fragmentation.

So I refactored the code to use structures like this:

struct user {
    char username[MAX_USERNAME_LEN+1];
    struct bitly_link *history[MAX_HISTORY_LEN];
    int history_count;
    // plus a few miscellaneous flags and counters
};

struct bitly_link {
    char username[MAX_USERNAME_LEN+1];
    char hash[MAX_HASH_LEN+1];
    int ref_count;
    // plus a few miscellaneous flags and counters
};

Notice that, with the (minimum) 8-byte malloc() overhead, on 64-bit systems such as we use, a history list any size over 1/4 the maximum uses less space in the new design: an array of 1000 8-byte pointers in the new design, vs 4*8 bytes per link in the old design. In practice, most users in the nhproxy cache have full history lists (maintained at that length by pruning when they grow longer).

Satisfyingly, cpu usage was much improved, because all those extra malloc() and free() calls in the original design were significant. Here’s what it looked like before:

cpu graph 1

and here’s what it looked like after:

cpu graph 2

(that’s 100% per core)

But what we really cared about was memory use:

memory graph 2

That’s much lower, but still grew until exhausting all memory, and it only took 3 times longer to do so. Until the real problem was solved, at least it was an improvement, right?

nhproxy history list bug

Unfortunately, we couldn’t roll it out to production yet, because we noticed this other graph:

counters graph

We had a fair number of what we thought were pretty good tests, and the new version of nhproxy passed them all… but why on earth did the number of bitly links in the cache, and the number of references to them in history lists, gravitate to 2/3 of what they used to be? Fortunately, having solid instrumentation into the runtime behavior of the application enabled us to spot this discrepancy. We had theories about bugs in the history pruning strategy, but couldn’t find anything, so we had to write a tool to compare the state of the original system still running in production, and the new version we were testing and staging, which was getting the same add requests, but not servicing any of the get requests.

We couldn’t compare just what was cached in the nhproxy, because we had actually changed the user flush and drop strategy, and because even if the strategies were the same, the lack of get requests on the new system being staged would have made the list of users in nhproxy different. So I wrote a tool that took a list of user names, made get requests to both systems, compared the results, and output differences, with various options. We were hoping to see some pattern in which links were lost from which network history lists - maybe relatively old or new links? We looked in request logs to get some users with very new changes, or hour-old changes, or older, and fed these usernames to the tool.

Here’s some sample output:

login <user> common= 998   0_only=   0   1_only=   0   difference=  0%
login <user> common= 783   0_only=   0   1_only=   0   difference=  0%
...
login <user> common= 301   0_only= 315   1_only= 287   difference= 50%
login <user> common=1000   0_only=   0   1_only=   0   difference=  0%
Total successes=192 failures=3

We didn’t notice any clear patterns in the users, links, or ages. But we did notice that the loss wasn’t evenly distributed - some users matched completely between the systems, and others were over 50% different. For a few of the very different users, we manually queried and compared the caches and the backend databases, trying to find a clue. Finally, Jehiah noticed that the backend database was missing many list items for a user that the cache had, even after the user was flushed. We had no test that caused a user’s list to be flushed, then dropped, then re-fetched, so I added one. Soon after, snakes spotted the extra “i++” in the flush routine, iterating over the history list, introduced when a while-loop that already had an “i++” was changed to a for-loop in the memory-saving struct refactor - we were skipping over every other link and not writing it to the persistent datastore. It’s funny how much time such a little bug can take to track down, but our tests and tools are better for it.

libevent memory leak

At that point, nhproxy was much more efficient, and at least as correct as the original version, but it still had the unbounded memory use which becomes problematic after only 3 and a half days or so. When I wrote above that valgrind reported no memory leaks, I wasn’t being completely honest - it reported no memory leaks if one was careful to only use HTTP/1.0 requests without keep-alive, or HTTP/1.1 requests with Connection: close, when exercising or querying the nhproxy. There was a known leak with keep-alive requests, but it wasn’t attacked earlier because:

  • while our internal services do often hold persistent connections, they rarely close them, and we didn’t think we experienced this effect to a degree significant enough to devote resources to solving it
  • libevent is a sensitive and core part of our infrastructure, so we didn’t want to migrate to libevent-2.0.x right at that moment, nor did we want to maintain our own branch of libevent-1.4.x

Simultaneously, we were aware of a problem related to how heap memory was freed back to the OS, which we had already seen in other applications. Memory use would grow due to a large number of small allocations (using the glibc provided malloc()) caused by a temporary high load, but when the load fell, the memory use wouldn’t significantly fall back. Even if valgrind reported no real leaks, and even if load fell back to effectively zero, the OS could still report that the process was using a huge amount of private memory (RSS). We confirmed through some tests that a single tiny allocation on the top of the heap could prevent many gigabytes of contiguous free space from being released back to the OS, and when that top allocation was free()d, the rest would finally go.

Between the occasional tiny leak and the significant fragmentation, this high-water-mark effect could be a real problem for nhproxy. jemalloc would probably help, but at this point I really wanted to get rid of the leak. I wanted the output of memory use analysis tools like valgrind to be easier to use, and I didn’t want to wonder how much of an effect the leak was really having.

I started working with the original libevent-1.4.14b source, building and installing the shared library in a custom local path, and running nhproxy with it using LD_LIBRARY_PATH.

libevent uses a lot of callbacks of function pointers, and evhttp abstracts outgoing and incoming http connections to share many functions between them, so I started by putting in some useful printf()s to familiarize myself with the flow of the code in a simple minimal use. I used curl with and without the --http1.0 argument to exercise nhproxy.

Valgrind reports that the root of the leak is a struct evhttp_request allocated in evhttp_associate_new_request_with_connection(), which makes sense because that’s where all evhttp_requests are allocated. I eventually traced the exceptional case through

  1. evhttp_associate_new_request_with_connection()
  2. evhttp_request_new()
  3. evhttp_start_read()
  4. evhttp_read() (callback set in evhttp_start_read())
  5. evhttp_connection_done()
  6. evhttp_handle_request() (the (*req_cb)())
  7. evhttp_connection_fail()
  8. evhttp_connection_incoming_fail()
  9. evhttp_connection_free()

The very interesting bit is that evhttp_connection_incoming_fail() seemed to intentionally leak the evhttp_request, preventing evhttp_connection_free() from freeing it, with this comment:

/* 
 * these are cases in which we probably should just
 * close the connection and not send a reply.  this
 * case may happen when a browser keeps a persistent
 * connection open and we timeout on the read.  when
 * the request is still being used for sending, we
 * need to disassociated it from the connection here.
 */
if (!req->userdone) {
        /* remove it so that it will not be freed */
        TAILQ_REMOVE(&req->evcon->requests, req, next);
        /* indicate that this request no longer has a
         * connection object
         */
        req->evcon = NULL;
}

I’m still not sure exactly why that case is needed. But I did know that if evhttp_handle_request() is called with req->evcon->state == EVCON_DISCONNECTED, then the user and the request is done. My minimal fix was

@@ -2231,6 +2231,7 @@ evhttp_handle_request(struct evhttp_requ
    if (req->uri == NULL) {
        event_debug(("%s: bad request", __func__));
        if (req->evcon->state == EVCON_DISCONNECTED) {
+           req->userdone = 1;
            evhttp_connection_fail(req->evcon, EVCON_HTTP_EOF);
        } else {
            event_debug(("%s: sending error", __func__));

In our testing, that didn’t result in any use-after-free crashes or unusual failed requests. This is what memory use looked like under real load with that fix:

memory graph 3

That seemed to do it. We weren’t expecting the keep-alive connection leak to be that significant, but apparently it was. (The memory ramp up before it stabilizes is fragmentation of object allocations in memory approaching a maximum before gaps formed are large enough to be reused.)

Conclusion

Graphs of system and application metrics can reveal unexpected behaviors that reflect bugs. Create tools to help you investigate issues if you need them. Tiny fixes can have huge effects.