<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" version="2.0"><channel><atom:link rel="hub" href="http://tumblr.superfeedr.com/" xmlns:atom="http://www.w3.org/2005/Atom"/><description>The bitly engineering blog</description><title>word</title><generator>Tumblr (3.0; @wordbitly)</generator><link>http://word.bitly.com/</link><item><title>Building NSQ Client Libraries</title><description>&lt;p&gt;Brace yourself, this is a long one.&lt;/p&gt;

&lt;p&gt;The following guide was originally intended for client library developers to describe in detail all
the important features and functionality we expected in an &lt;a href="https://github.com/bitly/nsq"&gt;NSQ&lt;/a&gt; client library.&lt;/p&gt;

&lt;p&gt;While writing it we began to realize that it had value &lt;em&gt;beyond&lt;/em&gt; client library developers. It
incorporates a comprehensive analysis of most of the capabilities of &lt;a href="https://github.com/bitly/nsq"&gt;NSQ&lt;/a&gt; (both client and
server) and is therefore interesting and useful for end-users as well (or anyone using or interested
in infrastructure messaging platforms).&lt;/p&gt;

&lt;p&gt;If you need some background on &lt;a href="https://github.com/bitly/nsq"&gt;NSQ&lt;/a&gt; please see our &lt;a href="http://word.bitly.com/post/33232969144/nsq"&gt;original blog post&lt;/a&gt; or its
follow up, &lt;a href="http://word.bitly.com/post/38385370762/spray-some-nsq-on-it"&gt;spray some NSQ on it&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;Intro&lt;/h3&gt;

&lt;p&gt;NSQ&amp;#8217;s design pushes a lot of responsibility onto client libraries in order to maintain overall
cluster robustness and performance.&lt;/p&gt;

&lt;p&gt;This guide attempts to outline the various responsibilities well-behaved client libraries need to
fulfill. Because publishing to &lt;code&gt;nsqd&lt;/code&gt; is trivial (just an HTTP POST to the &lt;code&gt;/put&lt;/code&gt; endpoint), this
document focuses on consumers.&lt;/p&gt;

&lt;p&gt;By setting these expectations we hope to provide a foundation for achieving consistency across
languages for NSQ users.&lt;/p&gt;

&lt;h3&gt;Overview&lt;/h3&gt;

&lt;ol&gt;&lt;li&gt;&lt;a href="#configuration"&gt;Configuration&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#discovery"&gt;Discovery&lt;/a&gt; (optional)&lt;/li&gt;
&lt;li&gt;&lt;a href="#connection_handling"&gt;Connection Handling&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#feature_negotiation"&gt;Feature Negotiation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#data_flow"&gt;Data Flow / Heartbeats&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#message_handling"&gt;Message Handling&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#rdy_state"&gt;RDY State&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#backoff"&gt;Backoff&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;&lt;h3&gt;&lt;a name="configuration" id="configuration"&gt;&lt;/a&gt;Configuration&lt;/h3&gt;

&lt;p&gt;At a high level, our philosophy with respect to configuration is to design the system to have the
flexibility to support different workloads, use sane defaults that run well &amp;#8220;out of the box&amp;#8221;, and
minimize the number of dials.&lt;/p&gt;

&lt;p&gt;A client subscribes to a &lt;code&gt;topic&lt;/code&gt; on a &lt;code&gt;channel&lt;/code&gt; over a TCP connection to &lt;code&gt;nsqd&lt;/code&gt; instance(s). You can
only subscribe to one topic per connection so multiple topic consumption needs to be structured
accordingly.&lt;/p&gt;

&lt;p&gt;Using &lt;code&gt;nsqlookupd&lt;/code&gt; for discovery is optional so client libraries should support a configuration
where a client connects &lt;em&gt;directly&lt;/em&gt; to one or more &lt;code&gt;nsqd&lt;/code&gt; instances or where it is configured to poll
one or more &lt;code&gt;nsqlookupd&lt;/code&gt; instances. When a client is configured to poll &lt;code&gt;nsqlookupd&lt;/code&gt; the polling
interval should be configurable. Additionally, because typical deployments of NSQ are in distributed
environments with many producers and consumers, the client library should automatically add jitter
based on a random % of the configured value. This will help avoid a thundering herd of connections.
For more detail see &lt;a href="#discovery"&gt;Discovery&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;An important performance knob for clients is the number of messages it can receive before &lt;code&gt;nsqd&lt;/code&gt;
expects a response. This pipelining facilitates buffered, batched, and asynchronous message
handling. By convention this value is called &lt;code&gt;max_in_flight&lt;/code&gt; and it effects how &lt;code&gt;RDY&lt;/code&gt; state is
managed. For more detail see &lt;a href="#rdy_state"&gt;RDY State&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Being a system that is designed to gracefully handle failure, client libraries are expected to 
implement retry handling for failed messages and provide options for bounding that behavior in terms
of number of attempts per message.  For more detail see &lt;a href="#message_handling"&gt;Message Handling&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Relatedly, when message processing fails, the client library is expected to automatically handle
re-queueing the message. NSQ supports sending a delay along with the &lt;code&gt;REQ&lt;/code&gt; command. Client libraries
are expected to provide options for what this delay should be set to initially (for the first
failure) and how it should change for subsequent failures. For more detail see &lt;a href="#backoff"&gt;Backoff&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Most importantly, the client library should support some method of configuring callback handlers for
message processing. The signature of these callbacks should be simple, typically accepting a single
parameter (an instance of a &amp;#8220;message object&amp;#8221;).&lt;/p&gt;

&lt;h3&gt;&lt;a name="discovery" id="discovery"&gt;&lt;/a&gt;Discovery&lt;/h3&gt;

&lt;p&gt;An important component of NSQ is &lt;code&gt;nsqlookupd&lt;/code&gt;, which provides a discovery service for consumers to
locate producers of a given topic at runtime.&lt;/p&gt;

&lt;p&gt;Although optional, using &lt;code&gt;nsqlookupd&lt;/code&gt; greatly reduces the amount of configuration required to
maintain and scale a large distributed NSQ cluster.&lt;/p&gt;

&lt;p&gt;When a client uses &lt;code&gt;nsqlookupd&lt;/code&gt; for discovery, the client library should manage the process of
polling all &lt;code&gt;nsqlookupd&lt;/code&gt; instances for an up-to-date set of &lt;code&gt;nsqd&lt;/code&gt; producers and should manage the
connections to those producers.&lt;/p&gt;

&lt;p&gt;Querying an &lt;code&gt;nsqlookupd&lt;/code&gt; instance is straightforward. Perform an HTTP request to the lookup endpoint
with a query parameter of the topic the client is attempting to discover (i.e.
&lt;code&gt;/lookup?topic=clicks&lt;/code&gt;). The response format is JSON:&lt;/p&gt;

&lt;p&gt;&lt;a class="gist" href="https://gist.github.com/mreiferson/5548351"&gt;Example JSON Response From nsqlookupd&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;broadcast_address&lt;/code&gt; and &lt;code&gt;tcp_port&lt;/code&gt; should be used to connect to an &lt;code&gt;nsqd&lt;/code&gt; producer. Because, by
design, &lt;code&gt;nsqlookupd&lt;/code&gt; instances don&amp;#8217;t coordinate their lists of producers, the client library should
union the lists it received from all &lt;code&gt;nsqlookupd&lt;/code&gt; queries to build the final list of &lt;code&gt;nsqd&lt;/code&gt;
producers to connect to. The &lt;code&gt;broadcast_address:tcp_port&lt;/code&gt; combination should be used as the unique
key for this union.&lt;/p&gt;

&lt;p&gt;A periodic timer should be used to repeatedly poll the configured &lt;code&gt;nsqlookupd&lt;/code&gt; so that clients will
automatically discover new producers. The client library should automatically initiate connections
to all newly found &lt;code&gt;nsqd&lt;/code&gt; producers.&lt;/p&gt;

&lt;p&gt;When client library execution begins it should bootstrap this polling process by kicking off an
initial set of requests to the configured &lt;code&gt;nsqlookupd&lt;/code&gt; instances.&lt;/p&gt;

&lt;h3&gt;&lt;a name="connection_handling" id="connection_handling"&gt;&lt;/a&gt;Connection Handling&lt;/h3&gt;

&lt;p&gt;Once a client has an &lt;code&gt;nsqd&lt;/code&gt; producer to connect to (via discovery or manual configuration), it
should open a TCP connection to &lt;code&gt;broadcast_address:port&lt;/code&gt;. A separate TCP connection should be made
to each &lt;code&gt;nsqd&lt;/code&gt; for each topic the client wants to consume.&lt;/p&gt;

&lt;p&gt;When connecting to an &lt;code&gt;nsqd&lt;/code&gt; instance, the client library should send the following data, in order:&lt;/p&gt;

&lt;ol&gt;&lt;li&gt;the magic identifier&lt;/li&gt;
&lt;li&gt;an &lt;code&gt;IDENTIFY&lt;/code&gt; command (and payload) and read/verify response&lt;/li&gt;
&lt;li&gt;a &lt;code&gt;SUB&lt;/code&gt; command (specifying desired topic) and read/verify response&lt;/li&gt;
&lt;li&gt;an initial &lt;code&gt;RDY&lt;/code&gt; count of 1 (see &lt;a href="#rdy_state"&gt;RDY State&lt;/a&gt;).&lt;/li&gt;
&lt;/ol&gt;&lt;p&gt;(low-level details on the protocol are available in the &lt;a href="http://protocol.md"&gt;spec&lt;/a&gt;)&lt;/p&gt;

&lt;h3&gt;Reconnection&lt;/h3&gt;

&lt;p&gt;Client libraries should automatically handle reconnection as follows:&lt;/p&gt;

&lt;ul&gt;&lt;li&gt;&lt;p&gt;If the client is configured with a specific list of &lt;code&gt;nsqd&lt;/code&gt; instances, reconnection should be
handled by delaying the retry attempt in an exponential backoff manner (i.e. try to reconnect in
8s, 16s, 32s, etc., up to a max).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;If the client is configured to discover instances via &lt;code&gt;nsqlookupd&lt;/code&gt;, reconnection should be
handled automatically based on the polling interval (i.e. if a client disconnects from an &lt;code&gt;nsqd&lt;/code&gt;,
the client library should &lt;em&gt;only&lt;/em&gt; attempt to reconnect if that instance is discovered by a 
subsequent &lt;code&gt;nsqlookupd&lt;/code&gt; polling round). This ensures that clients can learn about producers that 
are introduced to the topology &lt;em&gt;and&lt;/em&gt; ones that are removed (or failed).&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;&lt;h3&gt;&lt;a name="feature_negotiation" id="feature_negotiation"&gt;&lt;/a&gt;Feature Negotiation&lt;/h3&gt;

&lt;p&gt;The &lt;code&gt;IDENTIFY&lt;/code&gt; command can be used to set &lt;code&gt;nsqd&lt;/code&gt; side metadata, modify client settings,
and negotiate features.  It satisfies two needs:&lt;/p&gt;

&lt;ol&gt;&lt;li&gt;In certain cases a client would like to modify how &lt;code&gt;nsqd&lt;/code&gt; interacts with it (currently this 
is limited to modifying a client&amp;#8217;s heartbeat interval but you can imagine this could evolve to 
include enabling compression, TLS, output buffering, etc.)&lt;/li&gt;
&lt;li&gt;&lt;code&gt;nsqd&lt;/code&gt; responds to the &lt;code&gt;IDENTIFY&lt;/code&gt; command with a JSON payload that includes important server 
side configuration values that the client should respect while interacting with the instance.&lt;/li&gt;
&lt;/ol&gt;&lt;p&gt;After connecting, based on the user&amp;#8217;s configuration, a client library should send an &lt;code&gt;IDENTIFY&lt;/code&gt;
command (the body of an &lt;code&gt;IDENTIFY&lt;/code&gt; command is a JSON payload), e.g:&lt;/p&gt;

&lt;p&gt;&lt;a class="gist" href="https://gist.github.com/mreiferson/5548362"&gt;Example IDENTIFY Command JSON Payload&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;feature_negotiation&lt;/code&gt; field indicates that the client can &lt;em&gt;accept&lt;/em&gt; a JSON payload in return. The
&lt;code&gt;short_id&lt;/code&gt; and &lt;code&gt;long_id&lt;/code&gt; are arbitrary text fields that are used by &lt;code&gt;nsqd&lt;/code&gt; (and &lt;code&gt;nsqadmin&lt;/code&gt;) to
identify clients. &lt;code&gt;heartbeat_interval&lt;/code&gt; configures the interval between heartbeats on a per-client
basis.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;nsqd&lt;/code&gt; will respond &lt;code&gt;OK&lt;/code&gt; if it does not support feature negotiation (introduced in &lt;code&gt;nsqd&lt;/code&gt;
v0.2.20+), otherwise:&lt;/p&gt;

&lt;p&gt;&lt;a class="gist" href="https://gist.github.com/mreiferson/5548366"&gt;Example IDENTIFY Response JSON Payload&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;More detail on the use of the &lt;code&gt;max_rdy_count&lt;/code&gt; field is in the &lt;a href="#rdy_state"&gt;RDY State&lt;/a&gt; section.&lt;/p&gt;

&lt;h3&gt;&lt;a name="data_flow" id="data_flow"&gt;&lt;/a&gt;Data Flow and Heartbeats&lt;/h3&gt;

&lt;p&gt;Once a client is in a subscribed state, data flow in the NSQ protocol is asynchronous. For
consumers, this means that in order to build truly robust and performant client libraries they
should be structured using asynchronous network IO loops and/or &amp;#8220;threads&amp;#8221; (the scare quotes are used
to represent both OS-level threads and userland threads, like coroutines).&lt;/p&gt;

&lt;p&gt;Additionally clients are expected to respond to periodic heartbeats from the &lt;code&gt;nsqd&lt;/code&gt; instances
they&amp;#8217;re connected to. By default this happens at 30 second intervals. The client can respond with
&lt;em&gt;any&lt;/em&gt; command but, by convention, it&amp;#8217;s easiest to simply respond with a &lt;code&gt;NOP&lt;/code&gt; whenever a heartbeat
is received. See the &lt;a href="http://protocol.md"&gt;protocol spec&lt;/a&gt; for specifics on how to identify heartbeats.&lt;/p&gt;

&lt;p&gt;A &amp;#8220;thread&amp;#8221; should be dedicated to reading data off the TCP socket, unpacking the data from the
frame, and performing the multiplexing logic to route the data as appropriate. This is also
conveniently the best spot to handle heartbeats.  At the lowest level, reading the protocol 
involves the following sequential steps:&lt;/p&gt;

&lt;ol&gt;&lt;li&gt;read 4 byte big endian uint32 size&lt;/li&gt;
&lt;li&gt;read size bytes data&lt;/li&gt;
&lt;li&gt;unpack data&lt;/li&gt;
&lt;li&gt;&amp;#8230;&lt;/li&gt;
&lt;li&gt;profit&lt;/li&gt;
&lt;li&gt;goto 1&lt;/li&gt;
&lt;/ol&gt;&lt;h3&gt;A Brief Interlude on Errors&lt;/h3&gt;

&lt;p&gt;Due to their asynchronous nature, it would take a bit of &lt;em&gt;extra&lt;/em&gt; state tracking in order to
correlate protocol errors with the commands that generated them. Instead, we took the &amp;#8220;fail fast&amp;#8221;
approach so the overwhelming majority of protocol-level error handling is fatal. This means that if
the client sends an invalid command (or gets itself into an invalid state) the &lt;code&gt;nsqd&lt;/code&gt; instance it&amp;#8217;s
connected to will protect itself (and the system) by forcibly closing the connection (and, if
possible, sending an error to the client). This, coupled with the connection handling mentioned
above, makes for a more robust and stable system.&lt;/p&gt;

&lt;p&gt;The only errors that are not fatal are:&lt;/p&gt;

&lt;ul&gt;&lt;li&gt;&lt;code&gt;E_FIN_FAILED&lt;/code&gt; - The client tried to send a &lt;code&gt;FIN&lt;/code&gt; command for an invalid message ID.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;E_REQ_FAILED&lt;/code&gt; - The client tried to send a &lt;code&gt;REQ&lt;/code&gt; command for an invalid message ID.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;E_TOUCH_FAILED&lt;/code&gt; - The client tried to send a &lt;code&gt;TOUCH&lt;/code&gt; command for an invalid message ID.&lt;/li&gt;
&lt;/ul&gt;&lt;p&gt;Because these errors are most often timing issues, they are not considered fatal. These situations
typically occur when a message times out on the &lt;code&gt;nsqd&lt;/code&gt; side and is re-queued and delivered to
another client. The original recipient is no longer allowed to respond on behalf of that message.&lt;/p&gt;

&lt;h3&gt;&lt;a name="message_handling" id="message_handling"&gt;&lt;/a&gt;Message Handling&lt;/h3&gt;

&lt;p&gt;When the IO loop unpacks a data frame containing a message, it should route that message to the
configured handler for processing.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;nsqd&lt;/code&gt; producer expects to receive a reply within its configured message timeout (default: 60
seconds). There are a few possible scenarios:&lt;/p&gt;

&lt;ol&gt;&lt;li&gt;The handler indicates that the message was processed successfully.&lt;/li&gt;
&lt;li&gt;The handler indicates that the message processing was unsuccessful.&lt;/li&gt;
&lt;li&gt;The handler decides that it needs more time to process the message.&lt;/li&gt;
&lt;li&gt;The in-flight timeout expires and &lt;code&gt;nsqd&lt;/code&gt; automatically re-queues the message.&lt;/li&gt;
&lt;/ol&gt;&lt;p&gt;In the first 3 cases, the client library should send the appropriate command on the client&amp;#8217;s behalf
(&lt;code&gt;FIN&lt;/code&gt;, &lt;code&gt;REQ&lt;/code&gt;, and &lt;code&gt;TOUCH&lt;/code&gt; respectively).&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;FIN&lt;/code&gt; command is the simplest of the bunch. It tells &lt;code&gt;nsqd&lt;/code&gt; that it can safely discard the
message. &lt;code&gt;FIN&lt;/code&gt; can also be used to discard a message that you do not want to process or retry.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;REQ&lt;/code&gt; command tells &lt;code&gt;nsqd&lt;/code&gt; that the message should be re-queued (with an optional parameter
specifying the amount of time to &lt;em&gt;defer&lt;/em&gt; additional attempts). If the optional parameter is not
specified by the client, the client library should automatically calculate the duration in relation
to the number of attempts to process the message (a multiple is typically sufficient). The client
library should discard messages that exceed the configured max attempts. When this occurs, a
user-supplied callback should be executed to notify and enable special handling.&lt;/p&gt;

&lt;p&gt;If the message handler requires more time than the configured message timeout, the &lt;code&gt;TOUCH&lt;/code&gt; command
can be used to reset the timer on the &lt;code&gt;nsqd&lt;/code&gt; side. This can be done repeatedly until the message is
either &lt;code&gt;FIN&lt;/code&gt; or &lt;code&gt;REQ&lt;/code&gt;, up to the &lt;code&gt;nsqd&lt;/code&gt; producer&amp;#8217;s configured max timeout.  Client libraries should
never &lt;em&gt;automatically&lt;/em&gt; &lt;code&gt;TOUCH&lt;/code&gt; on behalf of the client.&lt;/p&gt;

&lt;p&gt;If the &lt;code&gt;nsqd&lt;/code&gt; instance receives &lt;em&gt;no&lt;/em&gt; response, the message will time out and be automatically
re-queued for delivery to an available client.&lt;/p&gt;

&lt;p&gt;Finally, a property of each message is the number of attempts. Client libraries should compare this
value against the configured max and discard messages that have exceeded it. When a message is
discarded there should be a callback fired. Typical default implementations of this callback might
include writing to a directory on disk, logging, etc. The user should be able to override the
default handling.&lt;/p&gt;

&lt;h3&gt;&lt;a name="rdy_state" id="rdy_state"&gt;&lt;/a&gt;RDY State&lt;/h3&gt;

&lt;p&gt;Because messages are &lt;em&gt;pushed&lt;/em&gt; from &lt;code&gt;nsqd&lt;/code&gt; to clients we needed a way to manage the flow of data in
userland rather than relying on low-level TCP semantics. A client&amp;#8217;s &lt;code&gt;RDY&lt;/code&gt; state is NSQ&amp;#8217;s flow
control mechanism.&lt;/p&gt;

&lt;p&gt;As outlined in the &lt;a href="#configuration"&gt;configuration section&lt;/a&gt;, a consumer is configured with a
&lt;code&gt;max_in_flight&lt;/code&gt;. This is a concurrency and performance knob, e.g. some downstream systems are able
to more-easily batch process messages and benefit greatly from a higher &lt;code&gt;max-in-flight&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;When a client connects to &lt;code&gt;nsqd&lt;/code&gt; (and subscribes) it is placed in an initial &lt;code&gt;RDY&lt;/code&gt; state of &lt;code&gt;0&lt;/code&gt;. No
messages will be sent to the client.&lt;/p&gt;

&lt;p&gt;Client libraries have a few responsibilities:&lt;/p&gt;

&lt;ol&gt;&lt;li&gt;bootstrap and evenly distribute the configured &lt;code&gt;max_in_flight&lt;/code&gt; to all connections.&lt;/li&gt;
&lt;li&gt;never allow the aggregate sum of &lt;code&gt;RDY&lt;/code&gt; counts for all connections (&lt;code&gt;total_rdy_count&lt;/code&gt;) 
to exceed the configured &lt;code&gt;max_in_flight&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;never exceed the per connection &lt;code&gt;nsqd&lt;/code&gt; configured &lt;code&gt;max_rdy_count&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;expose an API method to reliably indicate message flow starvation&lt;/li&gt;
&lt;/ol&gt;&lt;h3&gt;&lt;a name="bootstrap_and_distribution" id="bootstrap_and_distribution"&gt;&lt;/a&gt;1. Bootstrap and Distribution&lt;/h3&gt;

&lt;p&gt;There are a few considerations when choosing an appropriate &lt;code&gt;RDY&lt;/code&gt; count for a connection (in order
to evenly distribute &lt;code&gt;max_in_flight&lt;/code&gt;):&lt;/p&gt;

&lt;ul&gt;&lt;li&gt;the # of connections is dynamic, often times not even known in advance (ie. when
&lt;em&gt;discovering&lt;/em&gt; producers via &lt;code&gt;nsqlookupd&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;&lt;code&gt;max_in_flight&lt;/code&gt; may be &lt;em&gt;lower&lt;/em&gt; than your number of connections&lt;/li&gt;
&lt;/ul&gt;&lt;p&gt;To kickstart message flow a client library needs to send an &lt;em&gt;initial&lt;/em&gt; &lt;code&gt;RDY&lt;/code&gt; count. Because the
eventual number of connections is often not known ahead of time it should start with a value of &lt;code&gt;1&lt;/code&gt;
so that the client library does not unfairly favor the first connection(s).&lt;/p&gt;

&lt;p&gt;Additionally, after each message is processed, the client library should evaluate whether or not
it&amp;#8217;s time to update &lt;code&gt;RDY&lt;/code&gt; state. An update should be triggered if the current value is &lt;code&gt;0&lt;/code&gt; or if it
is below ~25% of the last value sent.&lt;/p&gt;

&lt;p&gt;The client library should always attempt to evenly distribute &lt;code&gt;RDY&lt;/code&gt; count across all connections.
Typically, this is implemented as &lt;code&gt;max_in_flight / num_conns&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;However, when &lt;code&gt;max_in_flight &amp;amp;lt; num_conns&lt;/code&gt; this simple formula isn&amp;#8217;t sufficient. In this state,
client libraries should perform a dynamic runtime evaluation of producer &amp;#8220;liveness&amp;#8221; by measuring the
duration of time since it last received a message for a connection. After a configurable expiration,
it should &lt;em&gt;re-distribute&lt;/em&gt; whatever &lt;code&gt;RDY&lt;/code&gt; count is available to a new (random) set of producers. By
doing this, you guarantee that you&amp;#8217;ll (eventually) find producers with messages. Clearly this has a
latency impact.&lt;/p&gt;

&lt;h3&gt;2. Maintaining &lt;code&gt;max_in_flight&lt;/code&gt;&lt;/h3&gt;

&lt;p&gt;The client library should maintain a ceiling for the maximum number of messages in flight for a
given consumer. Specifically, the aggregate sum of each connection&amp;#8217;s &lt;code&gt;RDY&lt;/code&gt; count should never exceed
the configured &lt;code&gt;max_in_flight&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Below is example code in Python to determine whether or not the proposed RDY count is valid for a
given connection:&lt;/p&gt;

&lt;p&gt;&lt;a class="gist" href="https://gist.github.com/mreiferson/5548381"&gt;send_ready.py&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;3. Producer Max RDY Count&lt;/h3&gt;

&lt;p&gt;Each &lt;code&gt;nsqd&lt;/code&gt; is configurable with a &lt;code&gt;--max-rdy-count&lt;/code&gt; (see &lt;a href="#feature_negotiation"&gt;feature
negotiation&lt;/a&gt; for more information on the handshake a client can perform to
ascertain this value). If the client sends a &lt;code&gt;RDY&lt;/code&gt; count that is outside of the acceptable range its
connection will be forcefully closed. For backwards compatibility, this value should be assumed to
be &lt;code&gt;2500&lt;/code&gt; if the &lt;code&gt;nsqd&lt;/code&gt; instance does not support &lt;a href="#feature_negotiation"&gt;feature negotiation&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;4. Message Flow Starvation&lt;/h3&gt;

&lt;p&gt;Finally, the client library should provide an API method to indicate message flow starvation. It is
insufficient for clients (in their message handlers) to simply compare the number of messages they
have in-flight vs. their configured &lt;code&gt;max_in_flight&lt;/code&gt; in order to decide to &amp;#8220;process a batch&amp;#8221;. There
are two cases when this is problematic:&lt;/p&gt;

&lt;ol&gt;&lt;li&gt;When clients configure &lt;code&gt;max_in_flight &amp;amp;gt; 1&lt;/code&gt;, due to variable &lt;code&gt;num_conns&lt;/code&gt;, there are
cases where &lt;code&gt;max_in_flight&lt;/code&gt; is not evenly divisible by &lt;code&gt;num_conns&lt;/code&gt;. Because the contract states 
that you should never &lt;em&gt;exceed&lt;/em&gt; &lt;code&gt;max_in_flight&lt;/code&gt;, you must round down, and you end up with cases 
where the sum of all &lt;code&gt;RDY&lt;/code&gt; counts is less than &lt;code&gt;max_in_flight&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Consider the case where only a subset of producers have messages. Because of the expected &lt;a href="#bootstrap_and_distribution"&gt;even 
distribution&lt;/a&gt; of &lt;code&gt;RDY&lt;/code&gt; count, those active producers only have a 
fraction of the configured &lt;code&gt;max_in_flight&lt;/code&gt;.&lt;/li&gt;
&lt;/ol&gt;&lt;p&gt;In both cases, a client will never actually receive &lt;code&gt;max_in_flight&lt;/code&gt; # of messages. Therefore, the
client library should expose a method &lt;code&gt;is_starved&lt;/code&gt; that will evaluate whether &lt;em&gt;any&lt;/em&gt; of the
connections are starved, as follows:&lt;/p&gt;

&lt;p&gt;&lt;a class="gist" href="https://gist.github.com/mreiferson/5548391"&gt;is_starved.py&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;is_starved&lt;/code&gt; method should be used by message handlers to reliably identify when to process a
batch of messages.&lt;/p&gt;

&lt;h3&gt;&lt;a name="backoff" id="backoff"&gt;&lt;/a&gt;Backoff&lt;/h3&gt;

&lt;p&gt;The question of what to do when message processing fails is a complicated one to answer. The
&lt;a href="#message_handling"&gt;message handling&lt;/a&gt; section detailed client library behavior that would &lt;em&gt;defer&lt;/em&gt;
the processing of failed messages for some (increasing) duration of time. The other piece of the
puzzle is whether or not to reduce throughput. The interplay between these two pieces of
functionality is crucial for overall system stability.&lt;/p&gt;

&lt;p&gt;By slowing down the rate of processing, or &amp;#8220;backing off&amp;#8221;, the consumer allows the downstream system
to recover from transient failure. However, this behavior should be configurable as it isn&amp;#8217;t always
desirable, such as situations where &lt;em&gt;latency&lt;/em&gt; is prioritized.&lt;/p&gt;

&lt;p&gt;Backoff should be implemented by sending &lt;code&gt;RDY 0&lt;/code&gt; to the appropriate producers, stopping message
flow. The duration of time to remain in this state should be calculated based on the number of
repeated failures (exponential). Similarly, successful processing should reduce this duration until
the reader is no longer in a backoff state.&lt;/p&gt;

&lt;p&gt;While a reader is in a backoff state, after the timeout expires, the client library should only ever
send &lt;code&gt;RDY 1&lt;/code&gt; regardless of &lt;code&gt;max_in_flight&lt;/code&gt;. This effectively &amp;#8220;tests the waters&amp;#8221; before returning to
full throttle. Additionally, during a backoff timeout, the client library should ignore any success
or failure results with respect to calculating backoff duration (i.e. it should only take into
account &lt;em&gt;one&lt;/em&gt; result per backoff timeout).&lt;/p&gt;

&lt;p&gt;&lt;img src="http://media.tumblr.com/7adbf06362cc6530153ef35b4dacf2cb/tumblr_inline_mmjev3stkE1qz4rgp.png" alt="client flow"/&gt;&lt;/p&gt;

&lt;h3&gt;Bringing It All Together&lt;/h3&gt;

&lt;p&gt;Distributed systems are fun.&lt;/p&gt;

&lt;p&gt;The interactions between the various components of an NSQ cluster work in concert to provide a
platform on which to build robust, performant, and stable infrastructure. We hope this guide shed
some light as to how important the client&amp;#8217;s role is.&lt;/p&gt;

&lt;p&gt;In terms of actually implementing all of this, we treat &lt;a href="https://github.com/bitly/pynsq"&gt;pynsq&lt;/a&gt; and &lt;a href="https://github.com/bitly/nsq/tree/master/nsq"&gt;go-nsq&lt;/a&gt; as our
reference codebases. The structure of &lt;a href="https://github.com/bitly/pynsq"&gt;pynsq&lt;/a&gt; can be broken down into three core components:&lt;/p&gt;

&lt;ul&gt;&lt;li&gt;&lt;code&gt;Message&lt;/code&gt; - a high-level message object, which exposes stateful methods for responding to the 
&lt;code&gt;nsqd&lt;/code&gt; producer (&lt;code&gt;FIN&lt;/code&gt;, &lt;code&gt;REQ&lt;/code&gt;, &lt;code&gt;TOUCH&lt;/code&gt;, etc.) as well as metadata such as attempts and timestamp.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Connection&lt;/code&gt; - a high-level wrapper around a TCP connection to a specific &lt;code&gt;nsqd&lt;/code&gt; producer, which 
has knowledge of in flight messages, its &lt;code&gt;RDY&lt;/code&gt; state, negotiated features, and various timings.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Reader&lt;/code&gt; - the front-facing API a user interacts with, which handles discovery, creates 
connections (and subscribes), bootstraps and manages &lt;code&gt;RDY&lt;/code&gt; state, parses raw incoming data, 
creates &lt;code&gt;Message&lt;/code&gt; objects, and dispatches messages to handlers.&lt;/li&gt;
&lt;/ul&gt;&lt;p&gt;We&amp;#8217;re happy to help support anyone interested in building client libraries for NSQ. We&amp;#8217;re looking
for contributors to continue to expand our language support as well as flesh out functionality in
existing libraries. The community has already open sourced client libraries for 7 languages:&lt;/p&gt;

&lt;ul&gt;&lt;li&gt;&lt;a href="https://github.com/bitly/nsq/tree/master/nsq"&gt;go-nsq&lt;/a&gt; Go (official)&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/bitly/pynsq"&gt;pynsq&lt;/a&gt; Python (official)&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/mreiferson/libnsq"&gt;libnsq&lt;/a&gt; C&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/bitly/nsq-java"&gt;nsq-java&lt;/a&gt; Java&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/dustismo/TrendrrNSQClient"&gt;TrendrrNSQClient&lt;/a&gt; Java&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/domwong/nsqjava"&gt;nsqjava&lt;/a&gt; Java&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/jmanero/nsq-client"&gt;nsq-client&lt;/a&gt; Node.js&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/phillro/nodensq"&gt;nodensq&lt;/a&gt; Node.js&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/davegardnerisme/nsqphp"&gt;nsqphp&lt;/a&gt; PHP&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/ClarityServices/ruby_nsq"&gt;ruby_nsq&lt;/a&gt; Ruby&lt;/li&gt;
&lt;/ul&gt;&lt;p&gt;If you&amp;#8217;re interested in hacking on NSQ, check out the &lt;a href="https://github.com/bitly/nsq"&gt;GitHub repo&lt;/a&gt;. We also publish &lt;a href="https://github.com/bitly/nsq/blob/master/INSTALLING.md#binary"&gt;binary
distributions&lt;/a&gt; for Linux and Darwin to make it easy to get started.&lt;/p&gt;

&lt;p&gt;If
you have any specific questions, feel free to ping me on twitter
&lt;a href="http://twitter.com/imsnakes"&gt;@imsnakes&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Finally, a huge thank you to the crew at bitly.  &lt;a href="http://twitter.com/mccutchen"&gt;@mccutchen&lt;/a&gt; and &lt;a href="http://twitter.com/ploxiln"&gt;@ploxiln&lt;/a&gt; for their tireless review and &lt;a href="http://twitter.com/jehiah"&gt;@jehiah&lt;/a&gt; my partner in crime on NSQ.&lt;/p&gt;

&lt;div class="postmeta"&gt; by &lt;a href="http://twitter.com/imsnakes"&gt;snakes&lt;/a&gt;&lt;/div&gt;</description><link>http://word.bitly.com/post/50027069647</link><guid>http://word.bitly.com/post/50027069647</guid><pubDate>Thu, 09 May 2013 14:58:23 -0400</pubDate><category>nsq</category><category>message queue</category><category>infrastructure</category><category>client library</category></item><item><title>Speeding things up with Redshift</title><description>&lt;p&gt;Recently we&amp;#8217;ve started to experiment with using &lt;a href="http://aws.amazon.com/redshift/"&gt;Redshift&lt;/a&gt;, Amazon&amp;#8217;s
new data warehousing service.  More specifically, we&amp;#8217;re using it to
speed up and expand our ad hoc data analysis.&lt;/p&gt;

&lt;h2&gt;The Challenge&lt;/h2&gt;

&lt;p&gt;bitly sees billions of clicks and shortens each month.  Often we have
various questions about the data generated from this activity.  Sometimes
these questions are driven by business needs (how much traffic do we see from 
a potential enterprise customer), sometimes they are more technically driven 
(how much traffic will a new sub-system need to deal with), and sometimes we 
like to just have fun (what are the top trashy celeb stories this week).&lt;/p&gt;

&lt;p&gt;Unfortunately, when working with that volume of data it can be pretty 
difficult to do much of anything quickly.  Pre-Redshift, all of these questions
were answered by writing map-reduce jobs to be run on our Hadoop cluster or on
Amazon&amp;#8217;s &lt;a href="http://aws.amazon.com/elasticmapreduce/"&gt;EMR&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Whenever we wanted to answer a question with our data, the process would look
something like this:&lt;/p&gt;

&lt;ol&gt;&lt;li&gt;Write map-reduce job in Python&lt;/li&gt;
&lt;li&gt;Run it on some local test data&lt;/li&gt;
&lt;li&gt;Fix bugs.&lt;/li&gt;
&lt;li&gt;Run it on the Hadoop cluster&lt;/li&gt;
&lt;li&gt;Wait 20-30 minutes for results&lt;/li&gt;
&lt;li&gt;Get an error back from Hadoop&lt;/li&gt;
&lt;li&gt;Dig through the logs to find the error.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;GOTO 3&lt;/code&gt;&lt;/li&gt;
&lt;/ol&gt;&lt;p&gt;This is clearly not ideal when all you want to do is get a simple count.&lt;/p&gt;

&lt;p&gt;For a lot of the work we do Hadoop + Python make for an awesome combination,
but for these ad hoc aggregation queries they&amp;#8217;re very blunt instruments.  In 
both cases, they are general purpose tools that are super flexible, but 
slow and difficult to use for this specific use case.&lt;/p&gt;

&lt;p&gt;Redshift, on the other hand, is specifically built and optimized for doing 
aggregation queries over large sets of data.  When we want to answer a 
question with Redshift, we just write a SQL query and get an answer within a
few minutes—if not seconds.&lt;/p&gt;

&lt;p&gt;Overall, our experience with Redshift has been a positive one but we have run
into some gotchas that we&amp;#8217;ll get into below.&lt;/p&gt;

&lt;h2&gt;The Good News&lt;/h2&gt;

&lt;h3&gt;User Experience&lt;/h3&gt;

&lt;p&gt;From a user perspective, we&amp;#8217;re really happy with Redshift.  Any one of our 
developers or data scientists just need to write a SQL query and they have an
answer to their question in less than 5 minutes.  Moving from our old hadoop
based workflow to an interactive console session with Redshift is a major
improvement.&lt;/p&gt;

&lt;p&gt;Additionally, since much of the user facing bits of Redshift are based on 
PostgreSQL there is a large ecosystem of mature, well-documented tools and 
libraries for us to take advantage of.&lt;/p&gt;

&lt;p&gt;Finally, while it can be a bit slow at times, we&amp;#8217;ve been very impressed with
the web management console Amazon provides with Redshift.  For a 1.0 product,
the console is comprehensive and offers much more information than we expected
it to.&lt;/p&gt;

&lt;h3&gt;Performance&lt;/h3&gt;

&lt;p&gt;For our current use case of ad hoc research queries, Redshift&amp;#8217;s performance is
adequate.  Most queries return a response in less than five minutes and we 
rarely have many users executing queries concurrently.&lt;/p&gt;

&lt;p&gt;That being said, we have done some experimentation with competing products 
(e.g. Vertica) and have seen better performance out of those tools.  This is 
especially true for more complex queries that benefit from 
projections/secondary indexes and situations where the cluster&amp;#8217;s resources are
under contention.&lt;/p&gt;

&lt;h3&gt;Documentation&lt;/h3&gt;

&lt;p&gt;Just like the rest of AWS, Amazon provides reasonably comprehensive and 
thorough documentation.  For everything that is directly exposed to us
as users (e.g. loading operations, configuration params, etc) we are very 
happy with the documentation.  The only places where we felt we wanted more 
information were those where Amazon makes things &amp;#8220;just work&amp;#8221;.  Most 
significantly we would like to see more details about what exactly happens 
when a node in the cluster fails and how the cluster is expected to behave 
in that state.&lt;/p&gt;

&lt;h3&gt;Cost&lt;/h3&gt;

&lt;p&gt;We wouldn&amp;#8217;t go so far as to call Redshift cheap, but compared to many 
competitors it is pretty cost effective.  The biggest gotcha here is that
while the simple model for scaling Redshift clusters and tuning performance
within a cluster is nice as a user, it does mean that you have a bit of a 
one-size-fits all situation.&lt;/p&gt;

&lt;p&gt;In our case we are computationally and I/O constrained so we&amp;#8217;re paying for a 
bunch of storage capacity and memory that we don&amp;#8217;t use.  At our current scale,
things  work out okay but as we continue to grow it may make sense to take 
advantage of something else that is more flexible in terms of both hardware 
and tuning.&lt;/p&gt;

&lt;h2&gt;The Bad News&lt;/h2&gt;

&lt;h3&gt;Data Loading&lt;/h3&gt;

&lt;p&gt;We had to spend a lot of time getting our data into Redshift.  This is 
partially our fault since our dataset is not the cleanest in the world,
but overall this is the place that we felt the most pain from an immature 
tool chain.&lt;/p&gt;

&lt;p&gt;The majority of bitly&amp;#8217;s at-rest data is stored in line-oriented (i.e. one JSON
blob per line) JSON files.  The primary way to load data into Redshift is to 
use the &lt;a href="http://docs.aws.amazon.com/redshift/latest/dg/t_Loading-data-from-S3.html"&gt;COPY&lt;/a&gt; command to point the cluster at a pile of CSV or TSV
files stored on S3. Clearly we needed to build out some kind of tool chain to 
transform our JSON logs into flat files.&lt;/p&gt;

&lt;p&gt;Initially, we just wrote a simple Python script that would do the 
transformation.  Unfortunately, we quickly discovered that this simple
approach would be too slow.  We estimated that it would have taken a month to 
process and load all the data we wanted in Redshift.&lt;/p&gt;

&lt;p&gt;Next, we realized that we had a tool for easily doing highly parallelized, 
distributed text processing: Hadoop.  Accordingly, we re-worked our quick 
Python script into a hadoop job to transform our logs in a big batch.  Since 
we already keep a copy of our raw logs in S3, EMR proved to be a great tool
for this&lt;/p&gt;

&lt;p&gt;Overall this process worked well but we did still run into a few gotchas 
loading the flattened data into Redshift:&lt;/p&gt;

&lt;ul&gt;&lt;li&gt;Redshift only supports a subset of UTF-8, specifically characters can only
be 3 bytes long.  Amazon does mention this in their &lt;a href="http://docs.aws.amazon.com/redshift/latest/dg/t_preparing-input-data.html"&gt;docs&lt;/a&gt;, 
but it still bit us a few times.&lt;/li&gt;
&lt;li&gt;varchar field lengths are in terms of bytes, not characters.  In Python, 
byte strings expose these as one and the same.  Unicode strings on the other
hand do not.  It took us a little while to realize this was happening and to 
get our code setup to do byte length truncations without truncating 
mid-characater in unicode strings.  To get an idea of how we went about doing
unicode aware truncation, check out this relevant 
&lt;a href="http://stackoverflow.com/questions/13727977/truncating-string-to-byte-length-in-python"&gt;stack overflow thread&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;Moving floats between different systems always has some issues.  In Python
the default string output of small float values is scientific notation 
(.332e-8).  Unfortunately Redshift doesn&amp;#8217;t recognize this format so we needed
to force our data prep job to always output floats in decimal format.&lt;/li&gt;
&lt;/ul&gt;&lt;p&gt;Now that we have a large body of data loaded into Redshift, we&amp;#8217;re working on
building out tooling based on &lt;a href="https://github.com/bitly/nsq"&gt;NSQ&lt;/a&gt; to do our data prep work in a
streaming fashion that should allow us to easily do smaller incremental 
updates.&lt;/p&gt;

&lt;p&gt;In the end, we worked through our data loading issues with Redshift but it was
one of the more acute pain points we encountered.  From our conversations with
Amazon, they&amp;#8217;re definitely aware of this and we&amp;#8217;re interested to see what 
they&amp;#8217;ll come out with, but for now the provided tooling is pretty limited.&lt;/p&gt;

&lt;h3&gt;Availability&lt;/h3&gt;

&lt;p&gt;Long term, there are a number of periodic and online tasks that we&amp;#8217;re thinking
about using a tool like Redshift for.  Unfortunately, as things stand today we
would not be comfortable relying on Redshift as a highly available system.&lt;/p&gt;

&lt;p&gt;Currently, if any node within a Redshift cluster fails, all outstanding 
queries will fail and Amazon will automatically start replacing the failed
node.  In theory this recovery should happen very quickly.  The cluster
will be available for querying as soon as the replacement node is added,
but performance on the cluster will be degraded while data is restored 
on to the new node.&lt;/p&gt;

&lt;p&gt;At this point, we have no data on how well this recovery process works in
the real world.  Additionally, we have concerns about how well this process
will work when there are larger issues happening within an availability zone
or region.  Historically there are a number of cases where issues within one
Amazon service (e.g. EBS) have cascaded into other services leading to long
periods of unavailability or degradation.&lt;/p&gt;

&lt;p&gt;Until there&amp;#8217;s a significant track record behind a system like this, we&amp;#8217;re 
hesitant to trust anything that will &amp;#8220;automatically recover&amp;#8221;.&lt;/p&gt;

&lt;p&gt;There is the option of running a mirrored Redshift cluster in a different AZ
or region, but that gets expensive fast.  Additionally, we&amp;#8217;d have to build
out even more tooling to make sure those two clusters stay in sync with each
other.&lt;/p&gt;

&lt;h3&gt;Limited Feature Set&lt;/h3&gt;

&lt;p&gt;Redshift is very impressive feature-wise for a 1.0 product.  That being said,
a number of the competing products have been around for a while and offer some
major features that Reshift lacks.&lt;/p&gt;

&lt;p&gt;The biggest missing feature for us with Redshift is some kind of secondary 
indexing or projections.  Right now, if you sort or filter by any
column other than the &lt;a href="http://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-sort-key.html"&gt;SORTKEY&lt;/a&gt;, Redshift will do a full table 
scan.  Technically you could create a copy of that table with a different sort
key but then it would become your problem to keep those tables in sync and to 
query the right table at the right time.&lt;/p&gt;

&lt;p&gt;Some other &amp;#8220;missing&amp;#8221; features include tools for working with time series data,
geospatial query tools, advanced HA features, and more mature data loading 
tools.&lt;/p&gt;

&lt;h2&gt;The Bottom Line&lt;/h2&gt;

&lt;p&gt;Redshift is great for our needs today, but we&amp;#8217;ll see if Amazon&amp;#8217;s development
keeps up as our needs change going forward.  Given Amazon&amp;#8217;s impressive track
record for quickly iterating on and improving products we&amp;#8217;re hopeful, but we 
do have our eyes on competing products as our use of data warehousing tools
matures.&lt;/p&gt;

&lt;p&gt;by &lt;a href="http://twitter.com/theSeanOC"&gt;Sean O&amp;#8217;Connor&lt;/a&gt;&lt;/p&gt;</description><link>http://word.bitly.com/post/48854093418</link><guid>http://word.bitly.com/post/48854093418</guid><pubDate>Thu, 25 Apr 2013 10:30:19 -0400</pubDate></item><item><title>Securing Internal Applications</title><description>&lt;p&gt;A common infrastructure problem is managing access to internal applications. Some patterns for solving that problem
include: fronting internal apps with HTTP Basic or HTTP Digest authentication through Apache or nginx, running the apps
on secret ports, and bundling an authentication system as part of each application. Many off-the-shelf applications
include no built-in authentication, and count on upstream proxies/systems for access control which limits options for
securing them. &lt;strong&gt;&lt;a href="https://github.com/bitly/google_auth_proxy"&gt;Google Auth Proxy&lt;/a&gt;&lt;/strong&gt; is a tool we have developed to give us a better ability to
secure internal applications by using Google Account authentication (like many other startups, we rely on Google Apps).&lt;/p&gt;

&lt;p&gt;At bitly, we like fronting HTTP applications with Nginx and use that approach for nearly all of our stack, from Tornado
apps handling our core APIs to common internal tools (e.g. Nagios, Munin, Graphite) to homegrown tools like &lt;a href="https://github.com/bitly/nsq"&gt;NSQ&lt;/a&gt;
and our deploy system. For many of those systems, we took advantage of this by configuring HTTP Basic authentication to
restrict access. Using Basic Auth however led to situations where the authentication information would be stored in
configuration files and scripts, leaking it throughout the repository.&lt;/p&gt;

&lt;p&gt;Some of our homegrown tools initially relied on bitly&amp;#8217;s &lt;a href="http://dev.bitly.com/authentication.html"&gt;OAuth 2 service for authentication&lt;/a&gt;, which meant
that sign in was managed through bitly accounts by the OAuth flow. This left each application to handle
authorization by maintaining a list of bitly accounts that actually had permission to access the app. This was an
improvement over HTTP Basic Auth as each application did not need to accept or store passwords, but it was less than
ideal because each app still needed to store and manage its own list of authorized users. Over time, this led to a large
number of disparate systems with separate authorization lists. This approach was only feasible for internal tools and
thus couldn&amp;#8217;t apply to other open source applications we were using.&lt;/p&gt;

&lt;p&gt;We developed &lt;a href="https://github.com/bitly/google_auth_proxy"&gt;Google Auth Proxy&lt;/a&gt; as a new tool to use in securing internal applications. It is a HTTP
Reverse Proxy that provides authentication using &lt;a href="https://developers.google.com/accounts/docs/OAuth2"&gt;Google&amp;#8217;s OAuth2 API&lt;/a&gt; with flexibility to authorize
individual Google Accounts (by email address) or a whole Google apps domain. 
For internal applications, this is convenient because we can now allow our whole &lt;code&gt;@bit.ly&lt;/code&gt; Google apps domain without
separately managing accounts or passwords.&lt;/p&gt;

&lt;p&gt;Google Auth Proxy requires a limited set of privileges in order to authenticate users (asking for only the
&lt;code&gt;userinfo.email&lt;/code&gt; and &lt;code&gt;userinfo.profile&lt;/code&gt;
OAuth2&amp;#160;&lt;a href="https://developers.google.com/accounts/docs/OAuth2Login#scopeparameter"&gt;scopes&lt;/a&gt;). This authentication information
is then passed to the upstream application as HTTP Basic Auth (with an empty value for the password), and in a HTTP
Header as &lt;code&gt;X-Forwarded-User&lt;/code&gt; for applications that need that context.&lt;/p&gt;

&lt;p&gt;Google Auth Proxy has been in production for 6 months and has helped reduce overhead for managing internal applications
(no setup time for new or removed accounts) and has made it easier for us to open up access for important tools to the
 entire company. It&amp;#8217;s also worth mentioning that if you use
two-factor-authentication with your Google accounts, that security carries over to improved authentication for Google
Auth Proxy.&lt;/p&gt;

&lt;p&gt;We chose to write Google Auth Proxy in &lt;a href="http://golang.org/"&gt;Go&lt;/a&gt; because golang has built in support for writing a concurrent
&lt;a href="http://golang.org/pkg/net/http/httputil/#ReverseProxy"&gt;ReverseProxy&lt;/a&gt; in the &lt;code&gt;net/http/httputil&lt;/code&gt; package. This meant that little work was needed to proxy
requests and we could focus on just writing the authentication layer.&lt;/p&gt;

&lt;p&gt;If you find this useful, we&amp;#8217;d love to hear about it &lt;a href="http://twitter.com/bitly"&gt;@bitly&lt;/a&gt;&lt;/p&gt;

&lt;div class="postmeta"&gt; by &lt;a href="http://twitter.com/jehiah"&gt;jehiah&lt;/a&gt;&lt;/div&gt;</description><link>http://word.bitly.com/post/47548678256</link><guid>http://word.bitly.com/post/47548678256</guid><pubDate>Tue, 09 Apr 2013 13:19:51 -0400</pubDate></item><item><title>Forget Table</title><description>&lt;blockquote&gt;
  &lt;p&gt;&lt;a href="http://bitly.github.com/forgettable/"&gt;Forget Table&lt;/a&gt; is a solution to the problem of
  storing the &lt;em&gt;recent&lt;/em&gt; dynamics of &lt;a href="http://en.wikipedia.org/wiki/Categorical_distribution"&gt;categorical
  distributions&lt;/a&gt; that
  change over time (ie: non-stationary distributions).&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;What does this mean? Why is this useful? What makes this the most important
thing since sliced bread?! Read on!&lt;/p&gt;

&lt;p&gt;Storing distributions is crucial for doing any sort of statistics on a dataset.
Normally, this problem is as easy as keeping counters of how many times we&amp;#8217;ve
seen certain quantities, but when dealing with data streams that constantly
deliver new data, these simple counters no longer do the job.  They fail
because data we saw weeks ago has the same weight as data we have just seen,
even though the fundamental distribution the data is describing may be
changing.  In addition it provides an engineering challenge since the counters
would strictly grow and very soon use all the resources available to it.  This
is why we created &lt;a href="http://bitly.github.com/forgettable"&gt;Forget Table&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;Background&lt;/h3&gt;

&lt;p&gt;A &lt;a href="http://en.wikipedia.org/wiki/Categorical_distribution"&gt;categorical
distribution&lt;/a&gt; describes
the probability of seeing an event occur out of a set of possible events. So an
example of this at bitly is that every click coming through our system comes
from one of a fixed number of country codes (&lt;a href="http://en.wikipedia.org/wiki/ISO_3166-1_alpha-2"&gt;there are about
260&lt;/a&gt;). We would like to
maintain a categorical distribution, that assigns a probability to each
country, that describes how likely any given click comes from each country.
With bitly&amp;#8217;s data, this is going to give a high weight to the US and lower
weights to Japan and Brazil, for example.&lt;/p&gt;

&lt;p&gt;It&amp;#8217;s very useful for bitly to store distributions like this, as it gives us a
good idea as to what&amp;#8217;s considered &amp;#8216;normal&amp;#8217;. Most of the time we produce
analytics that show, for example, how many clicks came from a given country, or
referrer. While this gives our clients a sense of where there traffic is coming
from, it doesn&amp;#8217;t directly express how surprising their traffic is.  This can be
remedied by maintaining a distribution over countries for &lt;em&gt;all&lt;/em&gt; of bitly, so
that we can identify anomalous traffic, ie: when a particular link is getting
disproportionally more clicks from Oregon than we would normally expect.&lt;/p&gt;

&lt;p&gt;The difficulty that Forget Table deals with comes from the fact that what bitly
considers normal changes constantly. At 8am on the East Coast of the US, we&amp;#8217;d
be surprised to see a lot of traffic coming from San Francisco (they&amp;#8217;re all
still asleep over there) however at 11am EST we should expect to see tons of
traffic coming from San Francisco. So unless we allow our understanding of
what&amp;#8217;s considered normal to change over time, we are going to have a skewed
idea of normality.&lt;/p&gt;

&lt;p&gt;This is a general problem over longer time scales, too. The behavior of the
social web is different in December than it is in July, and it&amp;#8217;s different in
2012 than it was in 2011. We need to make sure that our understanding of
normality changes with time. One way of achieving this is to forget old data
that&amp;#8217;s no longer relevant to our current understanding. This is where Forget
Table comes in.&lt;/p&gt;

&lt;h3&gt;Why forget in the first place?&lt;/h3&gt;

&lt;p&gt;The basic problem is that the fundamental distribution (ie: the distribution
that the data is &lt;em&gt;actually&lt;/em&gt; being created from) is changing.  This property is
called being &amp;#8220;non-stationary&amp;#8221;.  Simply storing counts to represent your
categorical distribution makes this problem worse because it keeps the same
weight for data seen weeks ago as it does for the data that was just observed.
As a result, the total count method simply shows the combination of every
distribution the data was drawn from (which will approach a Gaussian by the
&lt;a href="http://en.wikipedia.org/wiki/Central_limit_theorem"&gt;central limit theorem&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;A naive solution to this would be to simply have multiple categorical
distributions that are in rotation.  Similar to the way log files get rotated,
we would have different distributions that get updated at different time.  This
approach, however, can lead to many problems.&lt;/p&gt;

&lt;p&gt;When a distribution first gets rotated in, it has no information about the
current state of the world.  Similarly, at the end of a distribution&amp;#8217;s rotation
it is just as affected by events from the beginning of it&amp;#8217;s rotation as it is
by recent events.  This creates artifacts and dynamics in the data which are
only dependent on the time of and time between rotations.  These effects are
similar to various pathologies that come from binning data or by simply keeping
a total count.&lt;/p&gt;

&lt;p&gt;On the other hand, forgetting things smoothly using rates we have a continuous
window of time that our distribution is always describing.  The further back in
time the event is, the less of an effect it has on the distribution&amp;#8217;s current
state.&lt;/p&gt;

&lt;h3&gt;Forgetting Data Responsibly&lt;/h3&gt;

&lt;p&gt;Forget table takes a principled approach to forgetting old data, by defining
the &lt;em&gt;rate&lt;/em&gt; at which old data should decay. Each bin of a categorical
distribution forgets its counts depending on how many counts it currently has
and a user specified rate.  With this rule, bins that are more dominant get
decayed faster than bins without many counts.  This method also has the benefit
of being a very simple process (one that was inspired by the decay of
radioactive particles) which can be calculated very quickly.  In addition,
since bins with high counts get decayed faster than bins with low counts, this
process helps smooth out sudden spikes in data automatically.&lt;/p&gt;

&lt;p&gt;If the data suddenly stopped flowing into the Forget Table, then all the
categorical distributions would eventually decay to the uniform distribution -
each bin would have a count of 1 (at which point we stop decaying), and &lt;em&gt;z&lt;/em&gt;
would be equal to the number of bins (see a
&lt;a href="http://bitly.github.com/forgettable/visualization/"&gt;visualization&lt;/a&gt; of this happening). This
captures the fact that we no longer have any information about the distribution
of the variables in Forget Table.&lt;/p&gt;

&lt;p&gt;The result of this approach is that the counts in each bin, in each categorical
distribution, decay exponentially.  Each time we decrement the count in a bin,
we also decrement the normalising constant &lt;em&gt;z&lt;/em&gt;.  When using ForgetTable, we can
choose a rate at which things should decay, depending on the dominant time
constants of the system.&lt;/p&gt;

&lt;h3&gt;Building Categorical Distributions in Redis&lt;/h3&gt;

&lt;p&gt;We store the categorical distribution as a set of event counts, along with a
&amp;#8216;normalising constant&amp;#8217; which is simply the number of all the events we&amp;#8217;ve
stored. In the country example, we have ~260 bins, one per country, and in each
bin we store the number of clicks we&amp;#8217;ve seen from each country. Alongside it,
our normalising constant stores the total number of clicks we&amp;#8217;ve seen across
all countries.&lt;/p&gt;

&lt;p&gt;All this lives in a &lt;a href="http://redis.io/topics/data-types"&gt;Redis sorted set&lt;/a&gt; where
the key describes the variable which, in this case, would simply be
&lt;code&gt;bitly_country&lt;/code&gt; and the value would be a categorical distribution. Each element
in the set would be a country and the score of each element would be the number
of clicks from that country. We store a separate element in the set
(traditionally called &lt;em&gt;z&lt;/em&gt;) that records the total number of clicks stored in
the set. When we want to report the categorical distribution, we extract the
whole sorted set, divide each count by &lt;em&gt;z&lt;/em&gt;, and report the result.&lt;/p&gt;

&lt;p&gt;Storing the categorical distribution in this way allows us to make very rapid
writes (simply increment the score of two elements of the sorted set) and means
we can store millions of categorical distributions in memory. Storing a large
number of these is important, as we&amp;#8217;d often like to know the normal behavior
of a particular key phrase, or the normal behavior of a
&lt;a href="http://rt.ly/#t=economics"&gt;topic&lt;/a&gt;, or a &lt;a href="http://bitly.com/bundles/"&gt;bundle&lt;/a&gt;,
and so on.&lt;/p&gt;

&lt;h3&gt;Forgetting Data Under Strenuous Circumstances&lt;/h3&gt;

&lt;p&gt;This seems simple enough, but we have two problems. The first is that we have
potentially millions of categorical distributions. Bitly maintains information
for over a million separate key phrases at any given time, and (for some super
secret future projects) it is necessary to store a few distributions per
key phrase. So we are unable to iterate through each key of our Redis table
in order to do our decays, so cron-like decays wouldn&amp;#8217;t be feasible (ie:
decaying every distribution in the database every several minutes).&lt;/p&gt;

&lt;p&gt;The second problem is that data is constantly flowing into multiple
distributions: we sometimes see spikes of up to 3 thousand clicks per second
which can correspond to dozens of increments per second.  At this sort of high
volume, there is simply too much contention between the decay process and the
incoming increments to safely do both.&lt;/p&gt;

&lt;p&gt;So the real contribution of Forget Table is an approach to forgetting data at
read time. When we are interested in the &lt;em&gt;current&lt;/em&gt; distribution of a particular
variable, we extract whatever sorted set is stored against that variable&amp;#8217;s key
and decrement the counts at that time.&lt;/p&gt;

&lt;p&gt;It turns out that, using the simple rate based model of decay from above, we can
decrement each bin by simply sampling an integer from a &lt;a href="http://en.wikipedia.org/wiki/Poisson_distribution"&gt;Poisson
distribution&lt;/a&gt; whose rate is
proportional to the current count of the bin and the length of time it has been
since we last decayed that bin. So, by storing another piece of information,
the time since we last successfully decayed this distribution, we can calculate
the amount of counts to discard very cheaply (this algorithm is an
approximation to &lt;a href="http://en.wikipedia.org/wiki/Gillespie_algorithm"&gt;Gillespie&amp;#8217;s
algorithm&lt;/a&gt; used to simulate
stochastic systems).&lt;/p&gt;

&lt;p&gt;In Redis we implement this using pipelines. Using a pipeline, we read the
sorted set, form the distribution, calculate the amount of decay for each bin
and then attempt to perform the decay. Assuming nothing&amp;#8217;s written into the
sorted set in that time, we decay each bin and update the time since last
decayed. If the pipeline has detected a collision &amp;#8212; either another process has
decayed the set or a new event has arrived &amp;#8212; we abandon the update. The
algorithm we&amp;#8217;ve chosen means that it&amp;#8217;s not terribly important to actually store
the decayed version of the distribution, so long as we know the time between
the read and the last decay.&lt;/p&gt;

&lt;h3&gt;Get the Code and We&amp;#8217;re Hiring!&lt;/h3&gt;

&lt;p&gt;The result is a wrapper on top of Redis that runs as a little HTTP service. Its
API has an increment handler on &lt;code&gt;/incr&lt;/code&gt; used for incrementing counts based on
new events, and a get handler on &lt;code&gt;/get&lt;/code&gt; used for retrieving a distribution.  In
addition, there is an &lt;code&gt;/nmostprobable&lt;/code&gt; endpoint that returns the n categories
which have the highest probability of occurring.&lt;/p&gt;

&lt;p&gt;There are two versions, one in Go (for super speed) and the other in Python.
The code is open source and up on Github, available at
&lt;a href="http://bitly.github.com/forgettable"&gt;&lt;a href="http://bitly.github.com/forgettable"&gt;http://bitly.github.com/forgettable&lt;/a&gt;&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;As always, if you like what you see (or feel the need to make improvements),
don&amp;#8217;t forget that &lt;a href="https://bitly.com/pages/jobs"&gt;bitly is hiring!&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Get ready for more posts on science at bitly, and If you have any specific
questions, feel free to ping me on twitter
&lt;a href="http://twitter.com/mynameisfiber/"&gt;@mynameisfiber&lt;/a&gt;.&lt;/p&gt;

&lt;div class="postmeta"&gt; by &lt;a href="http://micha.gd/"&gt;micha gorelick&lt;/a&gt; and &lt;a href="http://twitter.com/mikedewar"&gt;mike dewar&lt;/a&gt;&lt;/div&gt;</description><link>http://word.bitly.com/post/41284219720</link><guid>http://word.bitly.com/post/41284219720</guid><pubDate>Wed, 23 Jan 2013 11:46:00 -0500</pubDate></item><item><title>spray some NSQ on it</title><description>&lt;p&gt;We released &lt;a href="https://github.com/bitly/nsq"&gt;NSQ&lt;/a&gt; on October 9th 2012. Supported by &lt;a href="https://speakerdeck.com/snakes/nsq-nyc-golang-meetup"&gt;3
talks&lt;/a&gt; and a &lt;a href="http://word.bitly.com/post/33232969144/nsq"&gt;blog post&lt;/a&gt;, it&amp;#8217;s already the &lt;a href="https://github.com/languages/Go/most_watched"&gt;4th most
watched Go project on GitHub&lt;/a&gt;. There are client libraries in 7
languages and we continue to talk with folks experimenting with and
transitioning to the platform.&lt;/p&gt;

&lt;p&gt;This post aims to kickstart the documentation available for &amp;#8220;getting started&amp;#8221;
and describe some NSQ patterns that solve a variety of common problems.&lt;/p&gt;

&lt;p&gt;DISCLAIMER: this post makes &lt;em&gt;some&lt;/em&gt; obvious technology suggestions to the
reader but it generally ignores the deeply personal details of choosing proper
tools, getting software installed on production machines, managing what
service is running where, service configuration, and managing running
processes (daemontools, supervisord, init.d, etc.).&lt;/p&gt;

&lt;h3&gt;Metrics Collection&lt;/h3&gt;

&lt;p&gt;Regardless of the type of web service you&amp;#8217;re building, in most cases you&amp;#8217;re
going to want to collect some form of metrics in order to understand your
infrastructure, your users, or your business.&lt;/p&gt;

&lt;p&gt;For a web service, most often these metrics are produced by events that happen
via HTTP requests, like an API. The naive approach would be to structure
this synchronously, writing to your metrics system directly in the API request
handler.&lt;/p&gt;

&lt;p&gt;&lt;img src="http://media.tumblr.com/tumblr_mf74kh5r4P1qj3yp2.png" alt="naive approach"/&gt;&lt;/p&gt;

&lt;ul&gt;&lt;li&gt;What happens when your metrics system goes down?&lt;/li&gt;
&lt;li&gt;Do your API requests hang and/or fail?&lt;/li&gt;
&lt;li&gt;How will you handle the scaling challenge of increasing API request volume
or breadth of metrics collection?&lt;/li&gt;
&lt;/ul&gt;&lt;p&gt;One way to resolve all of these issues is to somehow perform the work of
writing into your metrics system asynchronously - that is, place the data in
some sort of local queue and write into your downstream system via some other
process (consuming that queue). This separation of concerns allows the system
to be more robust and fault tolerant. At bitly, we use NSQ to achieve this.&lt;/p&gt;

&lt;p&gt;Brief tangent: NSQ has the concept of &lt;strong&gt;topics&lt;/strong&gt; and &lt;strong&gt;channels&lt;/strong&gt;. Basically,
think of a &lt;strong&gt;topic&lt;/strong&gt; as a unique stream of messages (like our stream of API
events above). Think of a &lt;strong&gt;channel&lt;/strong&gt; as a &lt;em&gt;copy&lt;/em&gt; of that stream of messages
for a given set of consumers. Topics and channels are both independent queues,
too. These properties enable NSQ to support both multicast (a topic &lt;em&gt;copying&lt;/em&gt;
each message to N channels) and distributed (a channel &lt;em&gt;equally dividing&lt;/em&gt; its
messages among N consumers) message delivery.&lt;/p&gt;

&lt;p&gt;For a more thorough treatment of these concepts, read through the &lt;a href="https://github.com/bitly/nsq/blob/master/docs/design.md"&gt;design
doc&lt;/a&gt; and &lt;a href="https://speakerdeck.com/snakes/nsq-nyc-golang-meetup"&gt;slides from our Golang NYC talk&lt;/a&gt;,
specifically &lt;a href="https://speakerdeck.com/snakes/nsq-nyc-golang-meetup?slide=19"&gt;slides 19 through 33&lt;/a&gt; describe topics and
channels in detail.&lt;/p&gt;

&lt;p&gt;&lt;img src="http://media.tumblr.com/tumblr_mf74ktpfpP1qj3yp2.png" alt="architecture with NSQ"/&gt;&lt;/p&gt;

&lt;p&gt;Integrating NSQ is straightforward, let&amp;#8217;s take the simple case:&lt;/p&gt;

&lt;ol&gt;&lt;li&gt;Run an instance of &lt;code&gt;nsqd&lt;/code&gt; on the same host that runs your API application.&lt;/li&gt;
&lt;li&gt;Update your API application to write to the local &lt;code&gt;nsqd&lt;/code&gt; instance to 
queue events, instead of directly into the metrics system.  To be able 
to easily introspect and manipulate the stream, we generally format this
type of data in line-oriented JSON.  Writing into &lt;code&gt;nsqd&lt;/code&gt; can be as simple
as performing an HTTP POST request to the &lt;code&gt;/put&lt;/code&gt; endpoint.&lt;/li&gt;
&lt;li&gt;Create a consumer in your preferred language using one of our 
&lt;a href="https://github.com/bitly/nsq#client-libraries"&gt;client libraries&lt;/a&gt;.  This &amp;#8220;worker&amp;#8221; will subscribe to the 
stream of data and process the events, writing into your metrics system. 
It can also run locally on the host running both your API application 
and &lt;code&gt;nsqd&lt;/code&gt;.&lt;/li&gt;
&lt;/ol&gt;&lt;p&gt;Here&amp;#8217;s an example worker written with our official &lt;a href="https://github.com/bitly/pynsq"&gt;Python client
library&lt;/a&gt;:&lt;/p&gt;

&lt;p&gt;&lt;a class="gist" href="https://gist.github.com/4331602"&gt;Python Example Code&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In addition to de-coupling, by using one of our official client libraries,
consumers will degrade gracefully when message processing fails.  Our libraries 
have two key features that help with this:&lt;/p&gt;

&lt;ol&gt;&lt;li&gt;&lt;strong&gt;Retries&lt;/strong&gt; - when your message handler indicates failure, that information is
sent to &lt;code&gt;nsqd&lt;/code&gt; in the form of a &lt;code&gt;REQ&lt;/code&gt; (re-queue) command.  Also, &lt;code&gt;nsqd&lt;/code&gt; will
&lt;em&gt;automatically&lt;/em&gt; time out (and re-queue) a message if it hasn&amp;#8217;t been responded 
to in a configurable time window.  These two properties are critical to 
providing a delivery guarantee.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Exponential Backoff&lt;/strong&gt; - when message processing fails the reader library will
delay the receipt of additional messages for a duration that scales 
exponentially based on the # of consecutive failures.  The opposite sequence
happens when a reader is in a backoff state and begins to process
successfully, until 0.&lt;/li&gt;
&lt;/ol&gt;&lt;p&gt;In concert, these two features allow the system to respond gracefully to 
downstream failure, automagically.&lt;/p&gt;

&lt;h3&gt;Persistence&lt;/h3&gt;

&lt;p&gt;Ok, great, now you have the ability to withstand a situation where your
metrics system is unavailable with no data loss and no degraded API service to
other endpoints. You also have the ability to scale the processing of this
stream horizontally by adding more worker instances to consume from the same
channel.&lt;/p&gt;

&lt;p&gt;But, it&amp;#8217;s kinda hard ahead of time to think of all the types of metrics you
might want to collect for a given API event.&lt;/p&gt;

&lt;p&gt;Wouldn&amp;#8217;t it be nice to have an archived log of this data stream for any future
operation to leverage? Logs tend to be relatively easy to redundantly backup,
making it a &amp;#8220;plan z&amp;#8221; of sorts in the event of catastrophic downstream data
loss. But, would you want this same consumer to also have the responsibility
of archiving the message data? Probably not, because of that whole &amp;#8220;separation
of concerns&amp;#8221; thing.&lt;/p&gt;

&lt;p&gt;Archiving an NSQ topic is such a common pattern that we built a
utility, &lt;a href="https://github.com/bitly/nsq/blob/master/examples/nsq_to_file/nsq_to_file.go"&gt;nsq_to_file&lt;/a&gt;, packaged with NSQ, that does
exactly what you need.&lt;/p&gt;

&lt;p&gt;Remember, in NSQ, each channel of a topic is independent and receives a
&lt;em&gt;copy&lt;/em&gt; of all the messages. You can use this to your advantage when archiving
the stream by doing so over a new channel, &lt;code&gt;archive&lt;/code&gt;. Practically, this means
that if your metrics system is having issues and the &lt;code&gt;metrics&lt;/code&gt; channel gets
backed up, it won&amp;#8217;t effect the separate &lt;code&gt;archive&lt;/code&gt; channel you&amp;#8217;ll be using to
persist messages to disk.&lt;/p&gt;

&lt;p&gt;So, add an instance of &lt;code&gt;nsq_to_file&lt;/code&gt; to the same host and use a command line
like the following:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;/usr/local/bin/nsq_to_file --nsqd-tcp-address=127.0.0.1:4150 --topic=api_requests --channel=archive&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;img src="http://media.tumblr.com/tumblr_mf74l5RqlZ1qj3yp2.png" alt="archiving the stream"/&gt;&lt;/p&gt;

&lt;h3&gt;Distributed Systems&lt;/h3&gt;

&lt;p&gt;You&amp;#8217;ll notice that the system has not yet evolved beyond a single production
host, which is a glaring single point of failure.&lt;/p&gt;

&lt;p&gt;Unfortunately, building a distributed system &lt;a href="https://twitter.com/b6n/status/276909760010387456"&gt;is hard&lt;/a&gt;.
Fortunately, NSQ can help. The following changes demonstrate how
NSQ alleviates some of the pain points of building distributed systems
as well as how its design helps achieve high availability and fault tolerance.&lt;/p&gt;

&lt;p&gt;Let&amp;#8217;s assume for a second that this event stream is &lt;em&gt;really&lt;/em&gt; important. You
want to be able to tolerate host failures and continue to ensure that messages
are &lt;em&gt;at least&lt;/em&gt; archived, so you add another host.&lt;/p&gt;

&lt;p&gt;&lt;img src="http://media.tumblr.com/tumblr_mf74lmYhZa1qj3yp2.png" alt="adding a second host"/&gt;&lt;/p&gt;

&lt;p&gt;Assuming you have some sort of load balancer in front of these two hosts you
can now tolerate any single host failure.&lt;/p&gt;

&lt;p&gt;Now, let&amp;#8217;s say the process of persisting, compressing, and transferring these
logs is affecting performance. How about splitting that responsibility off to
a tier of hosts that have higher IO capacity?&lt;/p&gt;

&lt;p&gt;&lt;img src="http://media.tumblr.com/tumblr_mf74m0JHMi1qj3yp2.png" alt="separate archive hosts"/&gt;&lt;/p&gt;

&lt;p&gt;This topology and configuration can easily scale to double-digit hosts, but
you&amp;#8217;re still managing configuration of these services manually, which does not
scale. Specifically, in each consumer, this setup is hard-coding the address
of where &lt;code&gt;nsqd&lt;/code&gt; instances live, which is a pain. What you really want is for
the configuration to evolve and be accessed at runtime based on the state of
the NSQ cluster. This is &lt;em&gt;exactly&lt;/em&gt; what we built &lt;code&gt;nsqlookupd&lt;/code&gt; to
address.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;nsqlookupd&lt;/code&gt; is a daemon that records and disseminates the state of an NSQ
cluster at runtime. &lt;code&gt;nsqd&lt;/code&gt; instances maintain persistent TCP connections to
&lt;code&gt;nsqlookupd&lt;/code&gt; and push state changes across the wire. Specifically, an &lt;code&gt;nsqd&lt;/code&gt;
registers itself as a producer for a given topic as well as all channels it
knows about. This allows consumers to query an &lt;code&gt;nsqlookupd&lt;/code&gt; to determine who
the producers are for a topic of interest, rather than hard-coding that
configuration. Over time, they will &lt;em&gt;learn&lt;/em&gt; about the existence of new
producers and be able to route around failures.&lt;/p&gt;

&lt;p&gt;The only changes you need to make are to point your existing &lt;code&gt;nsqd&lt;/code&gt; and
consumer instances at &lt;code&gt;nsqlookupd&lt;/code&gt; (everyone explicitly knows where
&lt;code&gt;nsqlookupd&lt;/code&gt; instances are but consumers don&amp;#8217;t explicitly know where
producers are, and vice versa). The topology now looks like this:&lt;/p&gt;

&lt;p&gt;&lt;img src="http://media.tumblr.com/edb403d38fc2bcc727b8655ea70eb3a7/tumblr_inline_mf8sfr2sp41qj3yp2.png" alt="adding nsqlookupd"/&gt;&lt;/p&gt;

&lt;p&gt;At first glance this may look &lt;em&gt;more&lt;/em&gt; complicated. It&amp;#8217;s deceptive though, as
the effect this has on a growing infrastructure is hard to communicate
visually. You&amp;#8217;ve effectively decoupled producers from consumers because
&lt;code&gt;nsqlookupd&lt;/code&gt; is now acting as a directory service in between. Adding
additional downstream services that depend on a given stream is trivial, just
specify the topic you&amp;#8217;re interested in (producers will be discovered by
querying &lt;code&gt;nsqlookupd&lt;/code&gt;).&lt;/p&gt;

&lt;p&gt;But what about availability and consistency of the lookup data? We generally
recommend basing your decision on how many to run in congruence with your
desired availability requirements. &lt;code&gt;nsqlookupd&lt;/code&gt; is not resource intensive and
can be easily homed with other services. Also, &lt;code&gt;nsqlookupd&lt;/code&gt; instances do not
need to coordinate or otherwise be consistent with each other. Consumers will
generally only require one &lt;code&gt;nsqlookupd&lt;/code&gt; to be available with the information
they need (and they will union the responses from all of the &lt;code&gt;nsqlookupd&lt;/code&gt;
instances they know about). Operationally, this makes it easy to migrate to a
new set of &lt;code&gt;nsqlookupd&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;EOL&lt;/h3&gt;

&lt;p&gt;Using these same strategies, we&amp;#8217;re now peaking at 65k messages per second
delivered through our production NSQ cluster. Although applicable to many
use cases, we&amp;#8217;re only scratching the surface with the configurations described
above.&lt;/p&gt;

&lt;p&gt;If you&amp;#8217;re interested in hacking on NSQ, check out the &lt;a href="https://github.com/bitly/nsq"&gt;GitHub repo&lt;/a&gt; and
the &lt;a href="https://github.com/bitly/nsq#client-libraries"&gt;list of client libraries&lt;/a&gt;. We also publish &lt;a href="https://github.com/bitly/nsq/downloads"&gt;binary
distributions&lt;/a&gt; for Linux and Darwin to make it easy to get started.&lt;/p&gt;

&lt;p&gt;Stay tuned for more posts on leveraging NSQ to build distributed systems. If
you have any specific questions, feel free to ping me on twitter
&lt;a href="https://twitter.com/imsnakes"&gt;@imsnakes&lt;/a&gt;.&lt;/p&gt;

&lt;div class="postmeta"&gt; by &lt;a href="http://twitter.com/imsnakes"&gt;snakes&lt;/a&gt;&lt;/div&gt;</description><link>http://word.bitly.com/post/38385370762</link><guid>http://word.bitly.com/post/38385370762</guid><pubDate>Thu, 20 Dec 2012 10:19:54 -0500</pubDate><category>nsq</category><category>message queue</category><category>distributed</category><category>scale</category><category>python</category><category>golang</category><category>high availability</category></item><item><title>Improving Frontend Code Quality and Workflow</title><description>&lt;p&gt;When I started at Bitly as a Frontend Engineer, we were about to launch the &lt;a href="http://blog.bitly.com/post/23998132587/welcome-to-the-new-bitly"&gt;new bitly&lt;/a&gt;. It was exciting to be so close to a product launch and we were cranking out lots of code each day. It took me a few days to get my bearings in the sprawling codebase and I saw lots of opportunity for refactoring, cleanup, and style normalization as well as some architectural concepts we weren&amp;#8217;t leveraging.&lt;/p&gt;

&lt;p&gt;Product launch mode kept us from doing house keeping for the next few months after we released the new product as we collected feedback and iterated on several designs/functionalities.&lt;/p&gt;

&lt;p&gt;Once we had a chance to take a step back and regroup, I saw that we could be more productive if we invested in a common JavaScript style (both syntax and module level patterns) and re-evaluated which libraries we were leveraging to build the site.&lt;/p&gt;

&lt;p&gt;When we decided the next feature work would be on save/share modal dialogs, I used this as an opportunity to evaluate my new picks for tools and libraries. The result was a success and allowed us to adopt the new tools and libraries for all new features (see &lt;a href="http://blog.bitly.com/post/29644087171/easily-save-and-share-the-links-you-love"&gt;Easily Save and Share the Links You Love&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;Keep reading to see what tools we ended up using to make us more productive!&lt;/p&gt;

&lt;h3&gt;Coding style guidelines&lt;/h3&gt;

&lt;p&gt;If you&amp;#8217;ve seen frontend JavaScript codebases older than six months with two or more developers working on them, you&amp;#8217;ve seen how hard it is to enforce style.&lt;/p&gt;

&lt;p&gt;The goal of a style guide (and code conventions in general) is to reduce the amount of friction when switching between different modules in a codebase and increase the consistency and readability of code. If every module follows the same patterns, you can easily get your bearings in code written by your coworkers. Every team should have a common style that is enforced in code review and ideally written in a document that can be referred to for new hires.&lt;/p&gt;

&lt;p&gt;For JavaScript at Bitly, we mostly use the &lt;a href="http://google-styleguide.googlecode.com/svn/trunk/javascriptguide.xml"&gt;Google Style Guide&lt;/a&gt; without the JSDoc and with a variation that non-method variables are lowercase_with_underscores e.g.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;var Bitly = function() {
};

Bitly.prototype.doThing = function() {
    var an_object = {};
};
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The bigger the team, the more you might have to diverge from your preferred style but you&amp;#8217;ll still get tons of benefit from the &lt;em&gt;consistency&lt;/em&gt; alone.&lt;/p&gt;

&lt;h3&gt;Coffeescript for teams&lt;/h3&gt;

&lt;p&gt;At Bitly, in addition to codifying the style guide, we gave &lt;a href="http://coffeescript.org/"&gt;Coffeescript&lt;/a&gt; a trial run and decided to continue using it for all new code (like &lt;a href="https://github.com/styleguide/javascript"&gt;Github does&lt;/a&gt;). Coffeescript is quite contentious in the JS community, but for our frontend team we saw a great improvement in workflow and maintainability.
Since whitespace is significant and &lt;code&gt;{}&lt;/code&gt; and &lt;code&gt;;&lt;/code&gt; are almost always omitted, there is a distinct Coffeescript style that is encouraged by default.&lt;/p&gt;

&lt;p&gt;Here&amp;#8217;s a snippet from a Backbone View that declares all the DOM event handlers for the view in a concise way.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;events:
    'click': 'focusInput'
    'focusin .picker-input': 'onFocusIn'
    'focusout .picker-input': 'checkForFullEmail'
    'keydown .picker-display': 'keyDownCheckForFocusedChip'
    'keydown .picker-input': 'inputKeydown'
    'keyup .picker-input': 'inputKeyup'
    'input .picker-input': 'inputNewText'
    'keydown .picker-suggestions': 'suggestionKeydown'
    'mousedown .picker-suggestions a': 'suggestionMousedown'
    'click .picker-suggestions a': 'selectItem'
    'click .picked-item .remove': 'removeItem'
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Fat arrow (=&amp;gt;) is another great feature used for binding a function to the current scope and obviates the &lt;code&gt;var self = this;&lt;/code&gt; pattern you will see in almost all JS codebases.&lt;/p&gt;

&lt;p&gt;Here, fat arrow binds &lt;code&gt;selectItem&lt;/code&gt; at constructor time to the newly constructed instance of MultiPickerView (not the prototype) instead of the default jQuery wrapped current target so I can access View methods/members within the handler.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;class App.MultiPickerView extends Backbone.View
# ...
# ...
  selectItem: (e) =&amp;gt;
    e.preventDefault()
    $item = $(e.currentTarget)
    @addEmail $item.data("email")
    # @$input is convention for cached jquery elements 
    # stored on the view instance
    @$input.val ""
    @hideSuggest()
    _.defer =&amp;gt; @$input.focus()
&lt;/code&gt;&lt;/pre&gt;

&lt;h3&gt;Fast and (mostly) logic-less templates with Handlebars&lt;/h3&gt;

&lt;p&gt;Handlebars is a template language with partials, custom helpers, and minimal logic, which can be compiled on the client or server side. We take advantage of the watcher system we already had in place for compiling Coffeescript to automatically compile our Handlebars templates ahead of time in order to reduce the workload on clients.&lt;/p&gt;

&lt;h3&gt;Immediate feedback with Coffeescript + File watcher + Growl&lt;/h3&gt;

&lt;p&gt;One of the major benefits for using Coffeescript is the compile-time checking that catches syntax errors. Instead of using the built-in &lt;code&gt;coffee -cw *.coffee&lt;/code&gt; to compile-on-change, I wrote a script that does this but also triggers Growl notifications if the compilation produced errors.&lt;/p&gt;

&lt;p&gt;&lt;img src="http://media.tumblr.com/tumblr_mevm4iKfxq1qaimnt.png" alt="coffeescript-error"/&gt;&lt;/p&gt;

&lt;p&gt;The script does the same for *.handlebars files (for which there is not a built-in watcher functionality).&lt;/p&gt;

&lt;p&gt;&lt;img src="http://media.tumblr.com/tumblr_mevm4vBLe81qaimnt.png" alt="handlebars-error"/&gt;&lt;/p&gt;

&lt;p&gt;See the source: &lt;a href="https://gist.github.com/4217755"&gt;Multiwatcher gist&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;Backbone (Models, Views, Events)&lt;/h3&gt;

&lt;p&gt;Historically, frontend JavaScript is fertile ground for spaghetti code, tight coupling, repeated code and scattered entry points. The introduction of jQuery increased this trend by making it easy for developers to handle DOM events and add visual effects wherever it was most convenient. This scales to a point, but when you start doing lots of AJAX, manipulating application state, and generating HTML using the DOM APIs, things can quickly get out of hand.&lt;/p&gt;

&lt;p&gt;&lt;a href="http://backbonejs.org/"&gt;Backbone&lt;/a&gt; introduced a few great primitives that can be leveraged to make more maintainable frontend code.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Models&lt;/strong&gt; are a good abstraction when you have a lot application state that can change based on user interaction and/or AJAX calls.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Views&lt;/strong&gt; encapsulate DOM elements and provide a declarative way of setting your event listeners using jQuery&amp;#8217;s delegated event listeners.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Events&lt;/strong&gt; implements pub/sub pattern which allows you to easily decouple your models and views. &lt;strong&gt;This is probably the most important development for frontend application architecture&lt;/strong&gt;. The order of events generally might go like this:&lt;/p&gt;

&lt;ol&gt;&lt;li&gt;Instantiate Model, View&lt;/li&gt;
&lt;li&gt;View listens on Model changes&lt;/li&gt;
&lt;li&gt;Model attribute changes (via AJAX callback or direct user interaction)&lt;/li&gt;
&lt;li&gt;Model fires change event on it&amp;#8217;s event channel&lt;/li&gt;
&lt;li&gt;View&amp;#8217;s listener fires in response and performs some action&lt;/li&gt;
&lt;/ol&gt;&lt;p&gt;&lt;strong&gt;The decoupling is that the Model never has knowledge of the View. This means the Model handles only the business logic and can be much more easily tested.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;There has been much written on Backbone and other MV* frameworks, so I won&amp;#8217;t evangelize much here except to say that adding a View and Model layer in your frontend JavaScript will improve your architecture significantly.&lt;/p&gt;

&lt;p&gt;Addy Osmani gives a good treatment as to why you would use MV* and how to select your tools in: &lt;a href="http://coding.smashingmagazine.com/2012/07/27/journey-through-the-javascript-mvc-jungle/"&gt;Journey through the JavaScript MVC jungle&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;Unit/Integration Tests with Mocha, Sinon&lt;/h3&gt;

&lt;p&gt;Unit testing is a given for any modern web app with a long lifecycle and high volume of traffic.&lt;/p&gt;

&lt;p&gt;There are many test runner frameworks for testing but you should choose one that matches your style. I like &lt;a href="http://visionmedia.github.com/mocha/"&gt;Mocha&lt;/a&gt; because it can handle BDD, TDD, and export styles, has a nice web browser interface, is being adopted in lots of places, and is written by &lt;a href="https://github.com/visionmedia"&gt;TJ Holowaychuk&lt;/a&gt; who has a great reputation in the node community. The two other big names in testing frameworks are &lt;a href="http://pivotal.github.com/jasmine/"&gt;Jasmine&lt;/a&gt; and &lt;a href="http://qunitjs.com/"&gt;QUnit&lt;/a&gt;, but they are more opinionated and less flexible.&lt;/p&gt;

&lt;p&gt;Besides a test runner, you&amp;#8217;ll want some helper libraries to make your tests easier to write.&lt;/p&gt;

&lt;p&gt;&lt;a href="http://sinonjs.org/"&gt;Sinon&lt;/a&gt; is &lt;em&gt;essential&lt;/em&gt; as it provides spy, stub, and mock functionality to make sure you are testing the unit under test and not its dependencies! The author wrote &lt;a href="http://tddjs.com/"&gt;Test-Driven JavaScript Development&lt;/a&gt; and is the foremost expert on this topic.&lt;/p&gt;

&lt;p&gt;To support different styles of test assertions, you may like &lt;a href="https://github.com/visionmedia/should.js"&gt;Should.js&lt;/a&gt; which provides expressive methods so you can write tests like:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;user.pets.should.have.length(5)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Here is a sample test file for our AJAX wrapper method (which uses &lt;a href="https://github.com/appendto/jquery-mockjax/"&gt;mockjax&lt;/a&gt;):&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$.mockjaxSettings.responseTime = 1
$.mockjax(
  url: '/data/beta/missing'
  status: 404
)
$.mockjax(
  url: '/data/beta/invalid'
  status: 200
  responseText:
    status_code: 500
    status_txt: 'MISSING_ARG'
    data: null
)
$.mockjax(
  url: '/data/beta/valid'
  status: 200
  responseText:
    status_code: 200
    status_txt: 'OK'
    data: []
)

describe 'BITLY', -&amp;gt;
  describe '#post', -&amp;gt;
    it 'should trigger .fail for non-200 status_code', (done) -&amp;gt;
      failSpy = sinon.spy()
      doneSpy = sinon.spy()
      BITLY.post('/data/beta/missing', {},
        success: doneSpy
        error: failSpy
      )
        .done(doneSpy)
        .done((data) -&amp;gt;
          data.should.be.a 'object'
        )
        .fail(failSpy)
        .always(-&amp;gt;
          doneSpy.called.should.be.false
          failSpy.calledTwice.should.be.true
          done()
        )
    it 'should trigger .fail for non-200 response', (done) -&amp;gt;
      failSpy = sinon.spy()
      doneSpy = sinon.spy()
      BITLY.post('/data/beta/invalid', {},
        success: doneSpy
        error: failSpy
      )
        .done(doneSpy)
        .fail(failSpy)
        .always(-&amp;gt;
          doneSpy.called.should.be.false
          failSpy.calledTwice.should.be.true
          done()
        )
    it 'should trigger .done for 200 response', (done) -&amp;gt;
      failSpy = sinon.spy()
      doneSpy = sinon.spy()
      BITLY.post('/data/beta/valid', {},
        success: doneSpy
        error: failSpy
      )
        .done(doneSpy)
        .done((data) -&amp;gt;
          data.should.be.a 'object'
        )
        .fail(failSpy)
        .always(-&amp;gt;
          failSpy.called.should.be.false
          doneSpy.calledTwice.should.be.true
          done()
    )
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Here is a sample of our test suite and its test runner:&lt;/p&gt;

&lt;p&gt;&lt;img src="http://media.tumblr.com/tumblr_mevm50SoFg1qaimnt.png" alt="mocha tests"/&gt;&lt;/p&gt;

&lt;p&gt;Always be looking to improve your toolkit and sweat the small stuff when it comes to code quality. Every engineer wants to work with other great engineers. Ask yourself &amp;#8220;If not me, then who on my team will make the effort?&amp;#8221;&lt;/p&gt;

&lt;p&gt;If you like working with other great engineers, &lt;a href="http://bitly.com/jobs"&gt;bitly is hiring&lt;/a&gt;.&lt;/p&gt;

&lt;div class="postmeta"&gt; by &lt;a href="http://twitter.com/nibblebot"&gt;joshua&lt;/a&gt;&lt;/div&gt;</description><link>http://word.bitly.com/post/37725436365</link><guid>http://word.bitly.com/post/37725436365</guid><pubDate>Tue, 11 Dec 2012 13:45:27 -0500</pubDate></item><item><title>Classifying Human Traffic with Random Forest Decision Trees</title><description>&lt;p&gt;&lt;span&gt;At bitly, we study human behavior on the social web, and we often need to figure out when data is generated by a deliberate human action (organic data) or by an action taken by a script or without a human&amp;#8217;s knowledge (inorganic data). bitly data scientist &lt;/span&gt;&lt;a href="http://twitter.com/bde"&gt;Brian Eoff&lt;/a&gt;&lt;span&gt; recently gave a talk at &lt;/span&gt;&lt;a href="http://nyc2012.pydata.org/"&gt;PyData NYC 2012&lt;/a&gt;&lt;span&gt; on a fast random forest decision tree approach, implemented in Python with &lt;/span&gt;&lt;a href="http://scikit-learn.org/stable/"&gt;scikits-learn&lt;/a&gt;&lt;span&gt;, to identifying organic vs inorganic data in a realtime stream.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span&gt;&lt;br/&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;iframe frameborder="0" height="281" src="http://player.vimeo.com/video/53112972?badge=0" width="500"&gt;&lt;/iframe&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href="http://vimeo.com/53112972"&gt;SciKit Random Forest - Brian Eoff&lt;/a&gt; from &lt;a href="http://vimeo.com/continuumanalytics"&gt;Continuum Analytics&lt;/a&gt; on &lt;a href="http://vimeo.com"&gt;Vimeo&lt;/a&gt;.&lt;/p&gt;
&lt;div class="postmeta"&gt;by &lt;a href="http://twitter.com/hmason"&gt;hilary&lt;/a&gt;&lt;/div&gt;</description><link>http://word.bitly.com/post/35843651276</link><guid>http://word.bitly.com/post/35843651276</guid><pubDate>Fri, 16 Nov 2012 10:38:18 -0500</pubDate></item><item><title>NSQ: realtime distributed message processing at scale</title><description>&lt;p&gt;&lt;strong&gt;NSQ&lt;/strong&gt; is a realtime message processing system designed to operate at bitly&amp;#8217;s scale, handling
billions of messages per day.&lt;/p&gt;

&lt;p&gt;It promotes &lt;em&gt;distributed&lt;/em&gt; and &lt;em&gt;decentralized&lt;/em&gt; topologies &lt;a href="#spof"&gt;without single points of failure&lt;/a&gt;,
enabling fault tolerance and high availability coupled with a reliable &lt;a href="#delivery"&gt;message delivery
guarantee&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Operationally, &lt;strong&gt;NSQ&lt;/strong&gt; is easy to configure and deploy (all parameters are specified on the command
line and compiled binaries have no runtime dependencies). For maximum flexibility, it is agnostic to
data format (messages can be JSON, &lt;a href="http://msgpack.org/"&gt;MsgPack&lt;/a&gt;, &lt;a href="http://code.google.com/p/protobuf/"&gt;Protocol Buffers&lt;/a&gt;, or anything else). Go and
Python libraries are available out of the box.&lt;/p&gt;

&lt;p&gt;This post aims to provide a detailed overview of &lt;strong&gt;NSQ&lt;/strong&gt;, from the problems that inspired us to
build a better solution to how it works inside and out. There&amp;#8217;s a lot to cover so let&amp;#8217;s start off
with a little history&amp;#8230;&lt;/p&gt;

&lt;h2&gt;Background&lt;/h2&gt;

&lt;p&gt;Before &lt;strong&gt;NSQ&lt;/strong&gt;, there was &lt;a href="https://github.com/bitly/simplehttp/tree/master/simplequeue"&gt;simplequeue&lt;/a&gt;, a &lt;em&gt;simple&lt;/em&gt; (shocking, right?) in-memory message queue
with an HTTP interface, developed as part of our open source &lt;a href="https://github.com/bitly/simplehttp"&gt;simplehttp&lt;/a&gt; suite of tools. Like
its successor, &lt;code&gt;simplequeue&lt;/code&gt; is agnostic to the type and format of the data it handles.&lt;/p&gt;

&lt;p&gt;We used &lt;code&gt;simplequeue&lt;/code&gt; as the foundation for a distributed message queue by siloing an instance on
each host that produced messages. This effectively reduced the potential for data loss in a system
which otherwise did not persist messages by guaranteeing that the loss of any single host would
not prevent the rest of the message producers or consumers from functioning.&lt;/p&gt;

&lt;p&gt;We also used &lt;a href="https://github.com/bitly/simplehttp/tree/master/pubsub"&gt;pubsub&lt;/a&gt;, an HTTP server to aggregate streams and provide an endpoint for multiple
clients to subscribe. We used it to transmit streams across hosts (or datacenters) and be queued
again for writing to various downstream services.&lt;/p&gt;

&lt;p&gt;As a glue utility, we used &lt;a href="https://github.com/bitly/simplehttp/tree/master/ps_to_http"&gt;ps_to_http&lt;/a&gt; to subscribe to a &lt;code&gt;pubsub&lt;/code&gt; stream and write the data to
&lt;code&gt;simplequeue&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;There are a couple of important properties of these tools with respect to message duplication and
delivery. Each of the &lt;code&gt;N&lt;/code&gt; clients of a &lt;code&gt;pubsub&lt;/code&gt; receive all of the messages published (each
message is delivered to all clients), whereas each of the &lt;code&gt;N&lt;/code&gt; clients of a &lt;code&gt;simplequeue&lt;/code&gt; receive
&lt;code&gt;1 / N&lt;/code&gt; of the messages queued (each message is delivered to 1 client). Consequently, when
multiple applications need to consume data from a single producer, we set up the following workflow:&lt;/p&gt;

&lt;p&gt;&lt;img src="http://media.tumblr.com/tumblr_marmbwe9QV1qj3yp2.png" alt="old school setup"/&gt;&lt;/p&gt;

&lt;p&gt;The producer publishes to &lt;code&gt;pubsub&lt;/code&gt; and for each downstream service we set up a dedicated
&lt;code&gt;simplequeue&lt;/code&gt; with a &lt;code&gt;ps_to_http&lt;/code&gt; process to route all messages from the &lt;code&gt;pubsub&lt;/code&gt; into the queue.
Each service has its own set of &amp;#8220;queuereaders&amp;#8221; which we scale independently according to the
service&amp;#8217;s needs.&lt;/p&gt;

&lt;p&gt;We used this foundation to process 100s of millions of messages a day. It was the core upon which
bitly was built.&lt;/p&gt;

&lt;p&gt;This setup had several nice properties:&lt;/p&gt;

&lt;ul&gt;&lt;li&gt;producers are de-coupled from downstream consumers&lt;/li&gt;
&lt;li&gt;no producer-side single point of failures&lt;/li&gt;
&lt;li&gt;easy to interact with (all HTTP)&lt;/li&gt;
&lt;/ul&gt;&lt;p&gt;But, it also had its issues&amp;#8230;&lt;/p&gt;

&lt;p&gt;One is simply the operational overhead/complexity of having to setup and configure the various tools
in the chain. Of particular note are the &lt;code&gt;pubsub &amp;gt; ps_to_http&lt;/code&gt; links. Given this setup, consuming a
stream in a way that avoids SPOFs is a challenge. There are two options, neither of which is ideal:&lt;/p&gt;

&lt;ol&gt;&lt;li&gt;just put the &lt;code&gt;ps_to_http&lt;/code&gt; process on a single box and pray&lt;/li&gt;
&lt;li&gt;shard by &lt;em&gt;consuming&lt;/em&gt; the full stream but &lt;em&gt;processing&lt;/em&gt; only a percentage of it on each host 
(though this does not resolve the issue of seamless failover)&lt;/li&gt;
&lt;/ol&gt;&lt;p&gt;To make things even more complicated, we needed to repeat this for &lt;em&gt;each&lt;/em&gt; stream of data we were
interested in.&lt;/p&gt;

&lt;p&gt;Also, messages traveling through the system had no delivery guarantee and the responsibility of
re-queueing was placed on the client (for instance, if processing fails). This churn increased the
potential for situations that result in message loss.&lt;/p&gt;

&lt;h2&gt;Enter NSQ&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;NSQ&lt;/strong&gt; is designed to (in no particular order):&lt;/p&gt;

&lt;ul&gt;&lt;li&gt;provide easy topology solutions that enable high-availability and eliminate SPOFs&lt;/li&gt;
&lt;li&gt;address the need for stronger message delivery guarantees&lt;/li&gt;
&lt;li&gt;bound the memory footprint of a single process (by persisting some messages to disk)&lt;/li&gt;
&lt;li&gt;greatly simplify configuration requirements for producers and consumers&lt;/li&gt;
&lt;li&gt;provide a straightforward upgrade path&lt;/li&gt;
&lt;li&gt;improve efficiency&lt;/li&gt;
&lt;/ul&gt;&lt;p&gt;To introduce some &lt;strong&gt;NSQ&lt;/strong&gt; concepts, let&amp;#8217;s start off by discussing configuration.&lt;/p&gt;

&lt;h3&gt;Simplifying Configuration and Administration&lt;/h3&gt;

&lt;p&gt;A single &lt;code&gt;nsqd&lt;/code&gt; instance is designed to handle multiple streams of data at once. Streams are called
&amp;#8220;topics&amp;#8221; and a topic has 1 or more &amp;#8220;channels&amp;#8221;. Each channel receives a &lt;em&gt;copy&lt;/em&gt; of all the
messages for a topic. In practice, a channel maps to a downstream service consuming a topic.&lt;/p&gt;

&lt;p&gt;Topics and channels all buffer data independently of each other, preventing a slow consumer from
causing a backlog for other channels (the same applies at the topic level).&lt;/p&gt;

&lt;p&gt;A channel can, and generally does, have multiple clients connected. Assuming all connected clients
are in a state where they are ready to receive messages, each message will be delivered to a random
client.  For example:&lt;/p&gt;

&lt;p&gt;&lt;img src="http://media.tumblr.com/tumblr_marmk2NL0k1qj3yp2.png" alt="nsqd clients"/&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;NSQ&lt;/strong&gt; also includes a helper application, &lt;code&gt;nsqlookupd&lt;/code&gt;, which provides a directory service where
consumers can lookup the addresses of &lt;code&gt;nsqd&lt;/code&gt; instances that provide the topics they are interested
in subscribing to. In terms of configuration, this decouples the consumers from the producers (they
both individually only need to know where to contact common instances of &lt;code&gt;nsqlookupd&lt;/code&gt;, never each
other), reducing complexity and maintenance.&lt;/p&gt;

&lt;p&gt;At a lower level each &lt;code&gt;nsqd&lt;/code&gt; has a long-lived TCP connection to &lt;code&gt;nsqlookupd&lt;/code&gt; over which it
periodically pushes its state. This data is used to inform which &lt;code&gt;nsqd&lt;/code&gt; addresses &lt;code&gt;nsqlookupd&lt;/code&gt; will
give to consumers. For consumers, an HTTP &lt;code&gt;/lookup&lt;/code&gt; endpoint is exposed for polling.&lt;/p&gt;

&lt;p&gt;To introduce a new distinct consumer of a topic, simply start up an &lt;strong&gt;NSQ&lt;/strong&gt; client configured with
the addresses of your &lt;code&gt;nsqlookupd&lt;/code&gt; instances. There are no configuration changes needed to add
either new consumers or new publishers, greatly reducing overhead and complexity.&lt;/p&gt;

&lt;p&gt;NOTE: in future versions, the heuristic &lt;code&gt;nsqlookupd&lt;/code&gt; uses to return addresses could be based on
depth, number of connected clients, or other &amp;#8220;intelligent&amp;#8221; strategies. The current implementation is
simply &lt;em&gt;all&lt;/em&gt;. Ultimately, the goal is to ensure that all producers are being read from such that
depth stays near zero.&lt;/p&gt;

&lt;p&gt;It is important to note that the &lt;code&gt;nsqd&lt;/code&gt; and &lt;code&gt;nsqlookupd&lt;/code&gt; daemons are designed to operate
independently, without communication or coordination between siblings.&lt;/p&gt;

&lt;p&gt;We also think that it&amp;#8217;s really important to have a way to view, introspect, and manage the
cluster in aggregate. We built &lt;code&gt;nsqadmin&lt;/code&gt; to do this. It provides a web UI to browse the hierarchy
of topics/channels/consumers and inspect depth and other key statistics for each layer. Additionally
it supports a few administrative commands such as removing and emptying a channel (which is a useful
tool when messages in a channel can be safely thrown away in order to bring depth back to 0).&lt;/p&gt;

&lt;p&gt;&lt;img src="http://media.tumblr.com/tumblr_mbmsd6YMfS1qj3yp2.png" alt="nsqadmin"/&gt;&lt;/p&gt;

&lt;h3&gt;Straightforward Upgrade Path&lt;/h3&gt;

&lt;p&gt;This was one of our &lt;strong&gt;highest&lt;/strong&gt; priorities. Our production systems handle a large volume of traffic,
all built upon our existing messaging tools, so we needed a way to slowly and methodically upgrade
specific parts of our infrastructure with little to no impact.&lt;/p&gt;

&lt;p&gt;First, on the message &lt;em&gt;producer&lt;/em&gt; side we built &lt;code&gt;nsqd&lt;/code&gt; to match &lt;code&gt;simplequeue&lt;/code&gt;. Specifically, nsqd
exposes an HTTP &lt;code&gt;/put&lt;/code&gt; endpoint, just like simplequeue, to POST binary data (with the one caveat
that the endpoint takes an additional query parameter specifying the &amp;#8220;topic&amp;#8221;). Services that wanted
to switch to start publishing to &lt;code&gt;nsqd&lt;/code&gt; only have to make minor code changes.&lt;/p&gt;

&lt;p&gt;Second, we built libraries in both Python and Go that matched the functionality and idioms we had
been accustomed to in our existing libraries. This eased the transition on the message &lt;em&gt;consumer&lt;/em&gt;
side by limiting the code changes to bootstrapping.  All business logic remained the same.&lt;/p&gt;

&lt;p&gt;Finally, we built utilities to glue old and new components together. These are all available in the
&lt;code&gt;examples&lt;/code&gt; directory in the repository:&lt;/p&gt;

&lt;ul&gt;&lt;li&gt;&lt;code&gt;nsq_pubsub&lt;/code&gt; - expose a &lt;code&gt;pubsub&lt;/code&gt; like HTTP interface to topics in an &lt;strong&gt;NSQ&lt;/strong&gt; cluster&lt;/li&gt;
&lt;li&gt;&lt;code&gt;nsq_to_file&lt;/code&gt; - durably write all messages for a given topic to a file&lt;/li&gt;
&lt;li&gt;&lt;code&gt;nsq_to_http&lt;/code&gt; - perform HTTP requests for all messages in a topic to (multiple) endpoints&lt;/li&gt;
&lt;/ul&gt;&lt;h3&gt;&lt;a name="spof"&gt;&lt;/a&gt;Eliminating SPOFs&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;NSQ&lt;/strong&gt; is designed to be used in a distributed fashion. &lt;code&gt;nsqd&lt;/code&gt; clients are connected (over TCP) to
&lt;strong&gt;all&lt;/strong&gt; instances providing the specified topic. There are no middle-men, no message brokers, and no
SPOFs:&lt;/p&gt;

&lt;p&gt;&lt;img src="http://media.tumblr.com/tumblr_mat85kr5td1qj3yp2.png" alt="nsq clients"/&gt;&lt;/p&gt;

&lt;p&gt;This topology eliminates the need to chain single, aggregated, feeds. Instead you consume directly
from &lt;strong&gt;all&lt;/strong&gt; producers. &lt;em&gt;Technically&lt;/em&gt;, it doesn&amp;#8217;t matter which client connects to which &lt;strong&gt;NSQ&lt;/strong&gt;, as
long as there are enough clients connected to all producers to satisfy the volume of messages,
you&amp;#8217;re guaranteed that all will eventually be processed.&lt;/p&gt;

&lt;p&gt;For &lt;code&gt;nsqlookupd&lt;/code&gt;, high availability is achieved by running multiple instances. They don&amp;#8217;t
communicate directly to each other and data is considered eventually consistent. Consumers poll
&lt;em&gt;all&lt;/em&gt; of their configured &lt;code&gt;nsqlookupd&lt;/code&gt; instances and union the responses. Stale, inaccessible, or
otherwise faulty nodes don&amp;#8217;t grind the system to a halt.&lt;/p&gt;

&lt;h3&gt;&lt;a name="delivery"&gt;&lt;/a&gt;Message Delivery Guarantees&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;NSQ&lt;/strong&gt; guarantees that a message will be delivered &lt;strong&gt;at least once&lt;/strong&gt;, though duplicate messages are
possible. Consumers should expect this and de-dupe or perform &lt;a href="http://en.wikipedia.org/wiki/Idempotence"&gt;idempotent&lt;/a&gt; operations.&lt;/p&gt;

&lt;p&gt;This guarantee is enforced as part of the protocol and works as follows (assume the client has
successfully connected and subscribed to a topic):&lt;/p&gt;

&lt;ol&gt;&lt;li&gt;client indicates they are ready to receive messages&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;NSQ&lt;/strong&gt; sends a message and temporarily stores the data locally (in the event of re-queue or
 timeout)&lt;/li&gt;
&lt;li&gt;client replies FIN (finish) or REQ (re-queue) indicating success or failure respectively. If
 client does not reply &lt;strong&gt;NSQ&lt;/strong&gt; will timeout after a configurable duration and automatically
 re-queue the message)&lt;/li&gt;
&lt;/ol&gt;&lt;p&gt;This ensures that the only edge case that would result in message loss is an unclean shutdown of an
&lt;code&gt;nsqd&lt;/code&gt; process. In that case, any messages that were in memory (or any buffered writes not flushed
to disk) would be lost.&lt;/p&gt;

&lt;p&gt;If preventing message loss is of the utmost importance, even this edge case can be mitigated. One
solution is to stand up redundant &lt;code&gt;nsqd&lt;/code&gt; pairs (on separate hosts) that receive copies of the same
portion of messages. Because you&amp;#8217;ve written your consumers to be idempotent, doing double-time on
these messages has no downstream impact and allows the system to endure any single node failure
without losing messages.&lt;/p&gt;

&lt;p&gt;The takeaway is that &lt;strong&gt;NSQ&lt;/strong&gt; provides the building blocks to support a variety of production use
cases and configurable degrees of durability.&lt;/p&gt;

&lt;h3&gt;Bounded Memory Footprint&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;nsqd&lt;/code&gt; provides a configuration option &lt;code&gt;--mem-queue-size&lt;/code&gt; that will determine the number of messages
that are kept &lt;em&gt;in memory&lt;/em&gt; for a given queue. If the depth of a queue exceeds this threshold messages
are transparently written to disk. This bounds the memory footprint of a given &lt;code&gt;nsqd&lt;/code&gt; process to
&lt;code&gt;mem-queue-size * #_of_channels_and_topics&lt;/code&gt;:&lt;/p&gt;

&lt;p&gt;&lt;img src="http://media.tumblr.com/tumblr_mavte17V3t1qj3yp2.png" alt="message overflow"/&gt;&lt;/p&gt;

&lt;p&gt;Also, an astute observer might have identified that this is a convenient way to gain an even higher
guarantee of delivery by setting this value to something low (like 1 or even 0). The disk-backed
queue is designed to survive unclean restarts (although messages might be delivered twice).&lt;/p&gt;

&lt;p&gt;Also, related to message delivery guarantees, &lt;em&gt;clean&lt;/em&gt; shutdowns (by sending a &lt;code&gt;nsqd&lt;/code&gt; process the
TERM signal) safely persist the messages currently in memory, in-flight, deferred, and in various
internal buffers.&lt;/p&gt;

&lt;p&gt;Note, a channel whose name ends in the string &lt;code&gt;#ephemeral&lt;/code&gt; will not be buffered to disk and will
instead drop messages after passing the &lt;code&gt;mem-queue-size&lt;/code&gt;. This enables consumers which do not need
message guarantees to subscribe to a channel. These ephemeral channels will also not persist after
its last client disconnects.&lt;/p&gt;

&lt;h3&gt;Efficiency&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;NSQ&lt;/strong&gt; was designed to communicate over a &amp;#8220;memcached-like&amp;#8221; command protocol with simple
size-prefixed responses. All message data is kept in the core including metadata like number of
attempts, timestamps, etc. This eliminates the copying of data back and forth from server to client,
an inherent property of the previous toolchain when re-queueing a message. This also simplifies
clients as they no longer need to be responsible for maintaining message state.&lt;/p&gt;

&lt;p&gt;Also, by reducing configuration complexity, setup and development time is greatly reduced
(especially in cases where there are &amp;gt;1 consumers of a topic).&lt;/p&gt;

&lt;p&gt;For the data protocol, we made a key design decision that maximizes performance and throughput by
pushing data to the client instead of waiting for it to pull. This concept, which we
call &lt;code&gt;RDY&lt;/code&gt; state, is essentially a form of client-side flow control.&lt;/p&gt;

&lt;p&gt;When a client connects to &lt;code&gt;nsqd&lt;/code&gt; and subscribes to a channel it is placed in a &lt;code&gt;RDY&lt;/code&gt; state of 0.
This means that no messages will be sent to the client. When a client is ready to receive messages
it sends a command that updates its &lt;code&gt;RDY&lt;/code&gt; state to some # it is prepared to handle, say 100. Without
any additional commands, 100 messages will be pushed to the client as they are available (each time
decrementing the server-side RDY count for that client).&lt;/p&gt;

&lt;p&gt;Client libraries are designed to send a command to update &lt;code&gt;RDY&lt;/code&gt; count when it reaches ~25% of the
configurable &lt;code&gt;max-in-flight&lt;/code&gt; setting (and properly account for connections to multiple &lt;code&gt;nsqd&lt;/code&gt;
instances, dividing appropriately).&lt;/p&gt;

&lt;p&gt;&lt;img src="http://media.tumblr.com/tumblr_mataigNDn61qj3yp2.png" alt="nsq protocol"/&gt;&lt;/p&gt;

&lt;p&gt;This is a significant performance knob as some downstream systems are able to more-easily batch
process messages and benefit greatly from a higher &lt;code&gt;max-in-flight&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Notably, because it is both buffered &lt;em&gt;and&lt;/em&gt; push based with the ability to satisfy the need for
independent copies of streams (channels), we&amp;#8217;ve produced a daemon that behaves like &lt;code&gt;simplequeue&lt;/code&gt;
and &lt;code&gt;pubsub&lt;/code&gt; &lt;em&gt;combined&lt;/em&gt; . This is powerful in terms of simplifying the topology of our systems where
we would have traditionally maintained the older toolchain discussed above.&lt;/p&gt;

&lt;h3&gt;Go&lt;/h3&gt;

&lt;p&gt;We made a strategic decision early on to build the &lt;strong&gt;NSQ&lt;/strong&gt; core in &lt;a href="http://golang.org"&gt;Go&lt;/a&gt;. We recently blogged
about our &lt;a href="http://word.bitly.com/post/29550171827/go-go-gadget"&gt;use of Go at bitly&lt;/a&gt; and alluded to this very project - it might be helpful to browse
through that post to get an understanding of our thinking with respect to the language.&lt;/p&gt;

&lt;p&gt;Regarding &lt;strong&gt;NSQ&lt;/strong&gt;, Go channels (not to be confused with &lt;strong&gt;NSQ&lt;/strong&gt; channels) and the language&amp;#8217;s built
in concurrency features are a perfect fit for the internal workings of &lt;code&gt;nsqd&lt;/code&gt;. We leverage buffered
channels to manage our in memory message queues and seamlessly write overflow to disk.&lt;/p&gt;

&lt;p&gt;The standard library makes it easy to write the networking layer and client code. The built in
memory and cpu profiling hooks highlight opportunities for optimization and require very little
effort to integrate. We also found it really easy to test components in isolation, mock types using
interfaces, and iteratively build functionality.&lt;/p&gt;

&lt;p&gt;Overall, it&amp;#8217;s been a fantastic project to use as an opportunity to really dig into the language and
see what it&amp;#8217;s capable of on a larger scale. We&amp;#8217;ve been extremely happy with our choice to use
golang, its performance, and how productive we are using it.&lt;/p&gt;

&lt;h2&gt;EOL&lt;/h2&gt;

&lt;p&gt;We&amp;#8217;ve been using &lt;strong&gt;NSQ&lt;/strong&gt; in production for several months and we&amp;#8217;re excited to &lt;a href="https://github.com/bitly/nsq"&gt;share this&lt;/a&gt; with
the open source community.&lt;/p&gt;

&lt;p&gt;Across the 13 services we&amp;#8217;ve upgraded, we&amp;#8217;re processing ~35,000 messages/second at peak through the
cluster. It has proved both performant and stable and made our lives easier operating our
production systems.&lt;/p&gt;

&lt;p&gt;There is more work to be done though — so far we&amp;#8217;ve converted ~40% of our infrastructure.
Fortunately, the upgrade process has been straightforward and well worth the short-term time
tradeoff.&lt;/p&gt;

&lt;p&gt;We&amp;#8217;re really curious to &lt;a href="http://twitter.com/imsnakes"&gt;hear what you think&lt;/a&gt;, so grab the &lt;a href="https://github.com/bitly/nsq"&gt;source from github&lt;/a&gt; and try it
out.&lt;/p&gt;

&lt;p&gt;Finally, this labor of love began as scratching an itch — bitly provided an environment to
experiment, build, and open source it&amp;#8230; &lt;a href="http://bitly.com/jobs"&gt;we&amp;#8217;re always hiring&lt;/a&gt;.&lt;/p&gt;

&lt;div class="postmeta"&gt; by &lt;a href="http://twitter.com/imsnakes"&gt;snakes&lt;/a&gt; (shoutout to &lt;a href="http://twitter.com/jehiah"&gt;jehiah&lt;/a&gt; who co-designed/developed NSQ, &lt;a href="http://twitter.com/danielhfrank"&gt;dan&lt;/a&gt; and &lt;a href="http://twitter.com/ploxiln"&gt;pierce&lt;/a&gt;
for contributing, and &lt;a href="http://twitter.com/mccutchen"&gt;mccutchen&lt;/a&gt; for tirelessly
proofreading this beast)&lt;/div&gt;</description><link>http://word.bitly.com/post/33232969144</link><guid>http://word.bitly.com/post/33232969144</guid><pubDate>Tue, 09 Oct 2012 11:15:51 -0400</pubDate><category>golang</category><category>go</category><category>simplequeue</category><category>message queue</category><category>nsq</category><category>distributed</category></item><item><title>Static Analysis, Syntax Checking, Code Formatting</title><description>&lt;p&gt;Nothing describes the need for consistent code style more than this quote from Martin Fowler: “Any fool can write code that a
computer can understand. Good programmers write code that humans can understand.”&lt;/p&gt;

&lt;p&gt;I would amend that quote though, because in reality it&amp;#8217;s hard to write good code for computers to understand, too. That
is where tools that perform Static Analysis and Syntax Checking come into play. They both help track down and identify
classes of problems that might otherwise go undetected, resulting in production issues.&lt;/p&gt;

&lt;p&gt;We want to share a few specific and practical ways we perform Static Analysis and produce consistently styled code at
bitly.&lt;/p&gt;

&lt;h3&gt;pyflakes and pep8 for Python&lt;/h3&gt;

&lt;p&gt;For python, we like to run our code through &lt;a href="https://github.com/jehiah/pyflakes"&gt;pyflakes&lt;/a&gt;. There are a few different
versions of pyflakes floating around, but we use this one because it&amp;#8217;s been rewritten to use AST so it&amp;#8217;s fast, and it
has output we like for scanning a whole directory tree.&lt;/p&gt;

&lt;p&gt;Pyflakes for us looks like this&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ pyflakes --quiet
.................F........

$ pyflakes
./graphite/conf/graphite_settings.py:23: redefinition of unused 
    'rrdtool' from line 21
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;There are ways to plug this type of check into various editors. Some of us use editor plugins for &lt;a href="https://github.com/jehiah/pyflakes.tmbundle#pyflakes-textmate-bundle"&gt;Textmate&lt;/a&gt;,
&lt;a href="https://github.com/SublimeLinter/SublimeLinter"&gt;Sublime&lt;/a&gt; to give feedback immediately during development.&lt;/p&gt;

&lt;p&gt;Pyflakes is good at highlighting syntax issues, but it doesn&amp;#8217;t help resolve inconsistent python style. Do you put
whitespace around operators? Do you inline your if statements? Tabs or spaces? To address some of these, we use a
version of &lt;a href="https://github.com/jehiah/pep8"&gt;pep8&lt;/a&gt; which has the ability to automatically apply consistent formatting.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ pep8 --ignore=W293 --ignore=W391 --max-line-length=120 .
./deploy_ui.py:95:66: W291 trailing whitespace
./deploy_ui.py:253:62: E225 missing whitespace around operator
./deploy_ui.py:257:5: E303 too many blank lines (2&amp;gt;1)
./settings.py:7:10: E203 whitespace before ':'

$ pep8 --fix --inplace settings.py
settings.py:9:80: E501 line too long (121 characters)
settings.py:16:7: W291 trailing whitespace
settings.py:7:10: E203 whitespace before ':'
settings.py:9:54: E261 at least two spaces before inline comment
settings.py:50:1: E302 expected 2 blank lines, found 1
 - pep8 fix: Writing changes to settings.py
 - pep8 fix: Applying fix for E302 at line 50, offset 0
 - pep8 fix: Applying fix for E261 at line 45, offset 34
 - pep8 fix: Applying fix for E203 at line 45, offset 15
 - pep8 fix: Applying fix for E203 at line 44, offset 26
 - pep8 fix: Applying fix for E261 at line 43, offset 45
 [truncated]
&lt;/code&gt;&lt;/pre&gt;

&lt;h3&gt;Bash Syntax Checking&lt;/h3&gt;

&lt;p&gt;Bash has a built in way to syntax check a file by running &lt;code&gt;bash -n $file&lt;/code&gt;. We wrap this in a &lt;a href="https://gist.github.com/3726075"&gt;short
script&lt;/a&gt; to give us similar usage to pyflakes. If all goes well, and there are no
errors, it&amp;#8217;s one dot per file checked. The syntax checking isn&amp;#8217;t as extensive as pyflakes, but it beats finding out in
the middle of a production maintenance that the one script you needed to work has a syntax error.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ test_shell_scripts.sh --quiet
.......................................F...................

$ test_shell_scripts.sh 
./test.sh
./test.sh: line 7: unexpected EOF while looking for matching `)'
./test.sh: line 245: syntax error: unexpected end of file
&lt;/code&gt;&lt;/pre&gt;

&lt;h3&gt;Go Syntax Checking&lt;/h3&gt;

&lt;p&gt;We have written previously about our experience &lt;a href="/post/29550171827/go-go-gadget"&gt;adopting golang&lt;/a&gt; as a primary language at bitly. One of the
golang choices we really like is directly related to code formatting. The Go core developers made a choice to preempt
coding style wars by designing a formatting tool &lt;code&gt;go fmt&lt;/code&gt; as part of the language distribution.&lt;/p&gt;

&lt;p&gt;Additionally, things that many languages list as warning (or just swallow completely) are complier errors in Go.
Examples of errors in Go are: unused variables, undefined variables, missing imports, mixing types, etc. These issues
are raised by the compiler and will keep code from being compiled. To this end, the language itself enforces static
analysis and syntax checking at compile time, which is nice because we don&amp;#8217;t need any extra tools.&lt;/p&gt;

&lt;h3&gt;C Style with astyle&lt;/h3&gt;

&lt;p&gt;For C code we don&amp;#8217;t use a static analysis tool (we use dynamic tools like &lt;a href="http://valgrind.org/"&gt;valgrind&lt;/a&gt;), but we do
programmatically apply &lt;a href="http://en.wikipedia.org/wiki/Indent_style#Variant:_1TBS"&gt;One True Brace Style&lt;/a&gt; with &lt;a href="http://astyle.sourceforge.net/"&gt;astyle&lt;/a&gt;. (We suggest a recent version of astyle from
trunk; it has bug fixes we found useful). Our specific astyle syntax is applied with the following command&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ astyle --style=1tbs \
    --lineend=linux \
    --convert-tabs \
    --preserve-date \
    --fill-empty-lines \
    --pad-header \
    --indent-switches \
    --align-pointer=name \
    --align-reference=name \
    --pad-oper \
    --suffix=none \
    $file
&lt;/code&gt;&lt;/pre&gt;

&lt;h3&gt;Javascript with Closure Compiler&lt;/h3&gt;

&lt;p&gt;We use Google&amp;#8217;s &lt;a href="https://developers.google.com/closure/compiler/"&gt;Closure Compiler&lt;/a&gt; to prepare our Javascript files for production use. In addition to
concatenating and compressing our Javascript, the Closure Compiler also provides syntax and type checking, variable
reference checking, and warnings about common Javascript pitfalls:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ java -jar closure-compiler.jar \
    --js input.js \
    --js_output_file output-min.js
input.js:83: WARNING - Suspicious code. This code lacks side-effects.
        Is there a bug?
        if(!data.mode) { data.mode === "save"; }
input.js:1150: ERROR - Parse error. Internet Explorer has a
        non-standard intepretation of trailing commas. Arrays will
        have the wrong length and objects will not parse at all.
&lt;/code&gt;&lt;/pre&gt;

&lt;h3&gt;EOL&lt;/h3&gt;

&lt;p&gt;If this sounds like how you would like to write code, &lt;a href="http://bitly.com/jobs"&gt;bitly is hiring&lt;/a&gt;.&lt;/p&gt;

&lt;div class="postmeta"&gt; by &lt;a href="http://twitter.com/jehiah"&gt;jehiah&lt;/a&gt; (also shout out to &lt;a href="http://twitter.com/imsnakes"&gt;snakes&lt;/a&gt; for helping develop our usage of astyle) &lt;/div&gt;</description><link>http://word.bitly.com/post/31921713978</link><guid>http://word.bitly.com/post/31921713978</guid><pubDate>Thu, 20 Sep 2012 10:13:00 -0400</pubDate></item><item><title>Go Go Gadget</title><description>&lt;p&gt;bitly&amp;#8217;s stack isn&amp;#8217;t fancy. We use a combination of languages and tools that interact well with each
other and promote philosophies such as explicitness, simplicity, and robustness. Most of our
services are Python with many core components in C (of which some are open sourced under
&lt;a href="https://github.com/bitly/simplehttp"&gt;simplehttp&lt;/a&gt;). We believe whole-heartedly in a services oriented approach where components do one
thing and do it well.&lt;/p&gt;

&lt;p&gt;C is a fantastic language, one which I highly encourage &lt;em&gt;all&lt;/em&gt; developers to learn and experiment
with as it pulls away the veil and provides insight into all the things that are happening behind
those seemingly innocent lines of a scripting language. With great power comes great responsibility
and sometimes that power comes at a cost – primarily in terms of development and debugging time and
thus the time it takes for you to ship your product.&lt;/p&gt;

&lt;p&gt;Python has also treated us well. Its readability, standard library, and obviousness make it a great
language to work with on a team. It performs well and is flexible to the point where it can be used
successfully in most any situation – whether it be data science, network daemons, or one off shell
scripts.&lt;/p&gt;

&lt;h3&gt;Enter Go&lt;/h3&gt;

&lt;p&gt;We identified early on that Go had all the makings of a language that could supersede some of the
places we would have traditionally turned to C &lt;em&gt;and&lt;/em&gt; some of the places where we wanted to move away
from Python.&lt;/p&gt;

&lt;p&gt;For example, we&amp;#8217;ve built many network daemons in C that don&amp;#8217;t require strict, absolute, control over
memory. This is where Go shines. The &amp;#8220;cost&amp;#8221; of a garbage collected language in these types of
situations is overshadowed by the huge benefit to development time and flexibility. Code becomes
more readable as a natural side effect of being able to focus more on the &amp;#8220;meat&amp;#8221; of the problem
instead of having to worry about low level details. Mix in the &amp;#8220;batteries included&amp;#8221; standard library
and you&amp;#8217;re cookin&amp;#8217;.&lt;/p&gt;

&lt;p&gt;Also, all of our C apps are built on &lt;a href="http://libevent.org"&gt;libevent&lt;/a&gt; - i.e. a single-threaded event loop with
callbacks that enables us to efficiently handle many thousands of simultaneous open connections. We
use the same technique in Python via &lt;a href="https://github.com/facebook/tornado"&gt;tornado&lt;/a&gt;, but the amount of work a single Python process
can do is limited largely by runtime overhead. Additionally, this style of code can be hard to read
and debug when you&amp;#8217;re not used to it. Go&amp;#8217;s lightweight concurrency model allows you to write in an
imperative style while providing a built-in way to leverage the multi-core architecture of the
&lt;a href="http://twitter.com/jayridge"&gt;computers&lt;/a&gt; we operate on.&lt;/p&gt;

&lt;p&gt;Just as in C, Go promotes a style of very explicit and localized error handling. While sometimes
verbose it forces a developer to &lt;em&gt;really&lt;/em&gt; think about how their program might fail and what to do
about it. Unlike C, Go has the built-in capability of returning multiple values from a function,
resulting in clearer and more concise error-handling code.&lt;/p&gt;

&lt;p&gt;The points above are mostly low-level. Go&amp;#8217;s true power comes from the fact that the
language is small and the standard library is large. We&amp;#8217;ve had developers start from &lt;strong&gt;zero&lt;/strong&gt;
experience with Go to writing working, production, code in a day. &lt;em&gt;That&lt;/em&gt; is why we&amp;#8217;re so excited
about its place in our stack.&lt;/p&gt;

&lt;p&gt;Think about it like this&amp;#8230; if you can write something in Go just as fast as you could in Python
and:&lt;/p&gt;

&lt;ul&gt;&lt;li&gt;gain the speed and robustness of a compiled, statically typed language without &lt;em&gt;all&lt;/em&gt; of the rope
to hang yourself&lt;/li&gt;
&lt;li&gt;clearly express concurrent solutions to parallelizable problems&lt;/li&gt;
&lt;li&gt;sacrifice little to nothing in terms of functionality&lt;/li&gt;
&lt;li&gt;unambiguously produce consistently styled code (thank you &lt;code&gt;go fmt&lt;/code&gt;)&lt;/li&gt;
&lt;/ul&gt;&lt;p&gt;What would you choose?&lt;/p&gt;

&lt;p&gt;And in an effort to encourage the team to experiment and learn about Go we created the &amp;#8220;Go Gopher
Award&amp;#8221;! The most recent engineer to learn the language and get code into production gets to have the
&lt;a href="http://www.googlestore.com/Fun/Go+Gopher+Squishable.axd"&gt;Go Gopher Squishable&lt;/a&gt; sit with them at their desk. This is an &lt;em&gt;extremely&lt;/em&gt; prestigious award.&lt;/p&gt;

&lt;h3&gt;Go Isn&amp;#8217;t Perfect&lt;/h3&gt;

&lt;p&gt;Let&amp;#8217;s talk about garbage collection. It isn&amp;#8217;t &lt;em&gt;free&lt;/em&gt;. Go&amp;#8217;s garbage collector is a conservative,
stop-the-world, mark-and-sweep GC. Practically speaking this means that the more careless you are in
what, how many, and how often you allocate the longer the world stops when the Go runtime cleans
things up. Unsurprisingly, significant performance gains can be had by being more careful&amp;#8230; not
only in raw speed but in the consistency of it. Fortunately Go provides fantastic hooks (&lt;a href="http://golang.org/pkg/runtime/pprof/"&gt;pprof&lt;/a&gt;)
to introspect the runtime behavior of your application. It&amp;#8217;s generally obvious when your bottleneck
is the GC.&lt;/p&gt;

&lt;p&gt;Interacting with JSON can be a bit clumsy. It&amp;#8217;s not &lt;em&gt;terrible&lt;/em&gt;&amp;#8230; just isn&amp;#8217;t as elegant and readable
as the alternative in Python. Truthfully it&amp;#8217;s quite impressive that it&amp;#8217;s as good as it is
considering the type safety of the language. We&amp;#8217;ve made an effort to make this even easier and open
sourced &lt;a href="https://github.com/bitly/go-simplejson"&gt;simplejson&lt;/a&gt;, a package that exposes a clean, chainable, API for interacting with JSON
where the format may not be known ahead of time.&lt;/p&gt;

&lt;p&gt;Packaging has also given us some trouble. This is an area where we&amp;#8217;ve probably gone the most
back-and-forth and are admittedly still in the early stages of finding what works for us. The
biggest problem is somewhat related to its purported greatest strength, i.e. qualifying the full
path when importing an open-source package. This has led to a whole crop of third-party utilities
that work around this by re-writing fully qualified import paths to something the user specifies
(and installing the package as the same, modified, name). We use &lt;a href="https://github.com/mreiferson/go-install-as"&gt;go-install-as&lt;/a&gt; and our policy
is to install non-standard packages under our own custom import paths to ensure that our apps are
&lt;em&gt;always&lt;/em&gt; compiled against the version and source we&amp;#8217;ve strictly enforced to be installed on a
machine.&lt;/p&gt;

&lt;p&gt;One notable feature missing from the HTTP client in Go&amp;#8217;s standard library is the ability to set
timeouts. It is further complicated by the inability of the API to provide easy access to the
connection (to do it yourself). We set out to bridge the gap and wrote &lt;a href="https://github.com/mreiferson/go-httpclient"&gt;go-httpclient&lt;/a&gt; (with the
expectation that these features will eventually make it into the standard library).&lt;/p&gt;

&lt;p&gt;Oh, and one night after way too much drinking they decided to use &lt;a href="http://golang.org/src/pkg/time/format.go#L58"&gt;a radical syntax&lt;/a&gt; for
formatting time as strings (instead of your standard &lt;code&gt;strftime&lt;/code&gt; parameters). We open sourced 
&lt;a href="https://github.com/jehiah/go-strftime"&gt;go-strftime&lt;/a&gt; to remedy that.&lt;/p&gt;

&lt;p&gt;All in all these &amp;#8220;issues&amp;#8221; are minor and ultimately we felt that the benefits far outweigh the
tradeoffs – Go meets our needs and philosophies as software engineers.&lt;/p&gt;

&lt;h3&gt;Go In Production&lt;/h3&gt;

&lt;p&gt;Where are we &lt;em&gt;actually&lt;/em&gt; using Go in production? Let&amp;#8217;s take a step back and talk about how data moves
through our infrastructure first.&lt;/p&gt;

&lt;p&gt;In the spirit of keeping things simple, fault tolerant, and asynchronous, most everything that gets
written to downstream systems is done so via message queues (formatted in JSON with all
communication over HTTP). Some of these message queues process volumes in the thousands per second.
We call the workers &lt;code&gt;queuereaders&lt;/code&gt;, all of which were originally written in Python. At the lower
level are two C apps – &lt;a href="https://github.com/bitly/simplehttp/blob/master/simplequeue/simplequeue.c"&gt;simplequeue&lt;/a&gt;, an HTTP interface to an in-memory data agnostic queue, and
&lt;a href="https://github.com/bitly/simplehttp/blob/master/pubsub/pubsub.c"&gt;pubsub&lt;/a&gt;, an HTTP interface for one-to-many streaming.&lt;/p&gt;

&lt;p&gt;As volume increases, the overhead of processing each message becomes more significant. Therefore,
these applications emerged as one key area where Go could replace Python. We started by re-writing
some queuereaders in Go by building matching libraries to support the same API we expose in Python.
Also, whereas in Python we would run multiple processes, in Go we leverage channels and goroutines
to write straightforward code that parallelizes HTTP requests to downstream systems or aggregates
writes to disk. These same features make it really easy to batch retrieve messages from the
queue while still processing them independently, reducing round-trips.&lt;/p&gt;

&lt;p&gt;By experimenting with non-critical systems at the edge of our infrastructure we were able to safely
and methodically learn about Go&amp;#8217;s behavior in production. Needless to say, we&amp;#8217;ve been very happy.
We&amp;#8217;ve seen consistent, measurable performance gains without sacrificing stability.&lt;/p&gt;

&lt;p&gt;We also wanted a bit more functionality out of the core combination of &lt;code&gt;simplequeue&lt;/code&gt; and &lt;code&gt;pubsub&lt;/code&gt;,
so we decided to build a successor with the goal of improving message delivery guarantees,
maintaining the spirit and simplicity of what we already had, increasing performance, solving some
configuration and setup issues, and providing a straightforward upgrade path.&lt;/p&gt;

&lt;p&gt;We realized Go would be &lt;em&gt;perfect&lt;/em&gt; for this project. We&amp;#8217;ll talk about it in depth in a future blog
post (and don&amp;#8217;t worry, it will be open source). Is it any good? Yes.&lt;/p&gt;

&lt;h2&gt;EOL&lt;/h2&gt;

&lt;p&gt;Finally, we&amp;#8217;ve also open sourced a few other Go related items&amp;#8230;&lt;/p&gt;

&lt;ul&gt;&lt;li&gt;&lt;a href="https://github.com/bitly/go-notify"&gt;go-notify&lt;/a&gt; - a package to standardize one-to-many communication over channels&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/bitly/file2http"&gt;file2http&lt;/a&gt; - a utility to spray a line-oriented file at an HTTP endpoint&lt;/li&gt;
&lt;/ul&gt;&lt;p&gt;As always, &lt;a href="http://bitly.com/jobs"&gt;bitly is hiring&lt;/a&gt;.&lt;/p&gt;

&lt;div class="postmeta"&gt; by &lt;a href="http://twitter.com/imsnakes"&gt;snakes&lt;/a&gt; (shoutouts to &lt;a href="http://twitter.com/jehiah"&gt;jehiah&lt;/a&gt;, &lt;a href="http://twitter.com/danielhfrank"&gt;dan&lt;/a&gt;, &lt;a href="http://twitter.com/mccutchen"&gt;mccutch&lt;/a&gt;, &lt;a href="http://twitter.com/mynameisfiber"&gt;micha&lt;/a&gt;, and &lt;a href="http://twitter.com/ploxiln"&gt;pierce&lt;/a&gt; for testing the Go waters with me – &amp;#8220;if you&amp;#8217;re headed
down a rabbit hole, remember to bring a friend&amp;#8221; ©)&lt;/div&gt;</description><link>http://word.bitly.com/post/29550171827</link><guid>http://word.bitly.com/post/29550171827</guid><pubDate>Thu, 16 Aug 2012 09:42:32 -0400</pubDate><category>golang</category><category>infrastructure</category><category>python</category><category>c</category></item><item><title>dablooms - an open source, scalable, counting bloom filter library</title><description>&lt;h3&gt;Overview&lt;/h3&gt;

&lt;p&gt;This project aims to demonstrate a novel bloom filter implementation that can
scale, and provide not only the addition of new members, but reliable removal
of existing members.&lt;/p&gt;

&lt;h3&gt;Motivation&lt;/h3&gt;

&lt;p&gt;bitly has billions and billions of links and an infrastructure that serves
thousands of decode requests per second, i.e. redirect the user from &lt;code&gt;bit.ly/short&lt;/code&gt;
to the &lt;code&gt;long.destination.url.com/with/path&lt;/code&gt;. Users not only expect this redirect
to be fast, but they also trust bitly to not redirect them to a malicious website.
Unfortunately, a subset of destination urls are malicious, and we identify and
mark them as spam via automated, realtime analysis. This spam dataset is constantly
changing with an increasing rate of additions and a small number of deletions. In
order to maintain performance and provide a warning to users who might be going to a
spam website, we have to keep the content of the spam dataset in resident memory
for each decode request. This is complicated by the fact that the system is designed to
support both domain and path roll up, meaning we can block an entire domain, subdomain,
or top-level path (or any combination thereof) thus marking all children of that parent as
spam. Previously, we had  implemented this as a simple hashtable with existence in the table
equating to membership in the spam dataset. This approach was simple and performant, but
not very memory efficient. On the occasions when resident memory was cleared, say on a
restart, it was necessary to reconstruct the dataset from the database. This made restarts
painfully long during the dataset reconstruction, and the hashtable took up considerable
amounts of memory, 2.5 gigabytes to be exact.&lt;/p&gt;

&lt;p&gt;We began exploring bloom filters as a possible solution to our spam problems.
Despite the many bloom filter variations out there - counting, compact,
inverse, scalable, and bloomier - none of them met our criteria entirely.  We needed
the ability to scale and persist reliably while allowing for the removal
of elements.  It appeared that only a mixture of scalable and counting (elaborated on below)
would solve our problems.&lt;/p&gt;

&lt;h2&gt;Bloom Filter Basics&lt;/h2&gt;

&lt;p&gt;Bloom filters are probabilistic data structures that provide
space-efficient storage of elements at the cost of occasional false positives on
membership queries, i.e. a bloom filter may state true on query when it in fact does
not contain said element. A bloom filter is traditionally implemented as an array of
&lt;code&gt;M&lt;/code&gt; bits, where &lt;code&gt;M&lt;/code&gt; is the size of the bloom filter. On initialization all bits are
set to zero. A filter is also parameterized by a constant &lt;code&gt;k&lt;/code&gt; that defines the number
of hash functions used to set and test bits in the filter.  Each hash function should
output one index in &lt;code&gt;M&lt;/code&gt;.  When inserting an element &lt;code&gt;x&lt;/code&gt; into the filter, the bits
in the &lt;code&gt;k&lt;/code&gt; indices &lt;code&gt;h1(x), h2(x), ..., hk(X)&lt;/code&gt; are set.&lt;/p&gt;

&lt;p&gt;In order to query a bloom filter, say for element &lt;code&gt;x&lt;/code&gt;, it suffices to verify if
all bits in indices &lt;code&gt;h1(x), h2(x), ..., hk(x)&lt;/code&gt; are set. If one or more of these
bits is not set then the queried element is definitely not present in the
filter. However, if all these bits are set, then the element is considered to
be in the filter. Given this procedure, an error probability exists for positive
matches, since the tested indices might have been set by the insertion of other
elements.&lt;/p&gt;

&lt;h3&gt;Counting Bloom Filters: Solving Removals&lt;/h3&gt;

&lt;p&gt;The same property that results in false positives &lt;em&gt;also&lt;/em&gt; makes it
difficult to remove an element from the filter as there is no
easy means of discerning if another element is hashed to the same bit.
Unsetting a bit that is hashed by multiple elements can cause &lt;strong&gt;false
negatives&lt;/strong&gt;.  Using a counter, instead of a bit, can circumvent this issue.
The bit can be incremented when an element is hashed to a
given location, and decremented upon removal.  Membership queries rely on whether a
given counter is greater than zero.  This reduces the exceptional
space-efficiency provided by the standard bloom filter.&lt;/p&gt;

&lt;h3&gt;Scalable Bloom Filters: Solving Scale&lt;/h3&gt;

&lt;p&gt;Another important property of a bloom filter is its linear relationship between size
and storage capacity.  If given a maximum acceptable false positive
ratio, it is straightforward to determine how much space is needed.&lt;/p&gt;

&lt;p&gt;If the maximum allowable error probability and the number of elements to store
are both known, it is relatively straightforward to dimension an appropriate
filter. However, it is not always possible to know how many elements
will need to be stored a priori. There is a trade off between over-dimensioning filters or 
suffering from a ballooning error probability as it fills.&lt;/p&gt;

&lt;p&gt;Almeida, Baquero, Preguiça, Hutchison published a paper in 2006, on
&lt;a href="http://www.sciencedirect.com/science/article/pii/S0020019006003127"&gt;Scalable Bloom Filters&lt;/a&gt;,
which suggested a means of scalable bloom filters by creating essentially
a list of bloom filters that act as one large bloom filter. When greater
capacity is desired, a new filter is added to the list.&lt;/p&gt;

&lt;p&gt;Membership queries are conducted on each filter with the positives
evaluated if the element is found in any one of the filters.  Naively, this
leads to an increasing compounding error probability since the probability
of the given structure evaluates to:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;1 - 𝚺(1 - P)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;It is possible to bound this error probability by adding a reducing tightening
ratio, &lt;code&gt;r&lt;/code&gt;. As a result, the bounded error probability is represented as:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;1 - 𝚺(1 - P0 * r^i) where r is chosen as 0 &amp;lt; r &amp;lt; 1
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Since size is simply a function of an error probability and capacity, any
array of growth functions can be applied to scale the size of the bloom filter
as necessary.  We found it sufficient to pick .9 for &lt;code&gt;r&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;Problems with Mixing Scalable and Counting Bloom Filters&lt;/h2&gt;

&lt;p&gt;Scalable bloom filters do not allow for the removal of elements from the filter.
In addition, simply converting each bloom filter in a scalable bloom filter into
a counting filter also poses problems. Since an element can be in any filter, and
bloom filters inherently allow for false positives, a given element may appear to
be in two or more filters. If an element is inadvertently removed from a filter
which did not contain it, it would introduce the possibility of &lt;strong&gt;false negatives&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;If however, an element can be removed from the correct filter, it maintains
the integrity of said filter, i.e. prevents the possibility of false negatives. Thus,
a scaling, counting, bloom filter is possible if upon additions and deletions
one can correctly decide which bloom filter contains the element.&lt;/p&gt;

&lt;p&gt;There are several advantages to using a bloom filters.  A bloom filter
gives us cheap, memory efficient set operations, with no actual data stored
about the given element. Rather, bloom filters allow us to test, with some
given error probability, the membership of an item.  This leads to the 
conclusion that the majority of operations performed on bloom filters are the
queries of membership, rather than the addition and removal of elements.  Thus,
for a scaling, counting, bloom filter, we can optimize for membership queries at
the expense of additions and removals.  This expense comes not in performance,
but in the addition of more metadata concerning an element and its relation to
the bloom filter.  With the addition of some sort of identification of an
element, which does not need to be unique as long as it is fairly distributed, it
is possible to correctly determine which filter an element belongs to, thereby, Hence,
maintaining the integrity of a given bloom filter with accurate additions
and removals.&lt;/p&gt;

&lt;h2&gt;Enter &lt;a href="http://github.com/bitly/dablooms"&gt;dablooms&lt;/a&gt;&lt;/h2&gt;

&lt;p&gt;&lt;a href="http://github.com/bitly/dablooms"&gt;dablooms&lt;/a&gt; is one such implementation of a
scaling, counting, bloom filter that takes additional metadata during additions
and deletions in the form of a monotonically increasing integer to classify
elements such as a timestamp. This is used during additions/removals to easily
classify an element into the correct bloom filter (essentially a comparison against a
range).&lt;/p&gt;

&lt;p&gt;&lt;a href="http://github.com/bitly/dablooms"&gt;dablooms&lt;/a&gt; is designed to scale itself using these monotonically increasing identifiers
and the given capacity. When a bloom filter is at capacity, &lt;a href="http://github.com/bitly/dablooms"&gt;dablooms&lt;/a&gt; will create a new
bloom filter using the to-be-added elements identifier as the beginning identifier for
the new bloom filter. Given the fact that the identifiers monotonically increase, new
elements will be added to the newest bloom filter. Note, in theory and as implemented,
nothing prevents one from adding an element to any &amp;#8220;older&amp;#8221; filter. You just run the
increasing risk of the error probability growing beyond the bound as it becomes
&amp;#8220;overfilled&amp;#8221;.&lt;/p&gt;

&lt;p&gt;You can then remove any element from any bloom filter using the identifier to intelligently
pick which bloom filter to remove from.  Consequently, as you continue to remove elements
from bloom filters that you are not continuing to add to, these bloom filters will become
more accurate.&lt;/p&gt;

&lt;p&gt;For performance, the low-level operations are implemented in C.  It is also
memory mapped which provides async flushing and persistence at low cost.
In an effort to maintain memory efficiency, rather than using integers, or
even whole bytes as counters, we use only four bit counters. These four bit
counters allow up to 15 items to share a counter in the map. If more than a
small handful are sharing said counter, the bloom filter would be overloaded
(resulting in excessive false positives anyway) at any sane error rate, so
there is no benefit in supporting larger counters.&lt;/p&gt;

&lt;p&gt;The bloom filter also employees change sequence numbers to track operations performed on bloom 
filter. These allow us to determine failed writes, leaving us with a &amp;#8216;dirty&amp;#8217; filter where an 
element is only partially written.  Upon restart, this information allows us to determine if a
filter is clean and thus can just load it into memory locally rather than having to
rebuild the filter. The counters also provide a means for us to identify a position
at which the bloom filter is valid in order to replay operations to &amp;#8220;catch up&amp;#8221;.&lt;/p&gt;

&lt;h2&gt;EOL&lt;/h2&gt;

&lt;p&gt;For our spam dataset at scale, initial tests have shown a significant reduction - 98.8%
in resident memory usage on each machine using &lt;a href="http://github.com/bitly/dablooms"&gt;dablooms&lt;/a&gt;
while maintaining a rigid performance threshold. For our use of an identifier,
we simply use timestamps for when the spam link
was added and query our database on removals for the insertion timestamp.&lt;/p&gt;

&lt;p&gt;We have also decided to &lt;strong&gt;open source&lt;/strong&gt; this project, so fork it and submit
patches at the &lt;a href="http://github.com/bitly/dablooms"&gt;github repo&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;If this sounds like a fun project to use or to work on, &lt;a href="http://bitly.com/jobs"&gt;bitly is hiring&lt;/a&gt;,
including interns!&lt;/p&gt;

&lt;div class="postmeta"&gt; by &lt;a href="http://twitter.com/jphines"&gt;jphines&lt;/a&gt;, a hackNY intern,
(also shout out to &lt;a href="http://twitter.com/imsnakes"&gt;snakes&lt;/a&gt; and
&lt;a href="http://twitter.com/ploxiln"&gt;pierce&lt;/a&gt; who provided significant feedback
and helped in the design and development of dablooms). &lt;/div&gt;</description><link>http://word.bitly.com/post/28558800777</link><guid>http://word.bitly.com/post/28558800777</guid><pubDate>Thu, 02 Aug 2012 11:25:06 -0400</pubDate><category>bloom filter</category><category>scalable</category><category>spam</category><category>bitly</category><category>open source</category></item><item><title>Debugging a Specialized Database Cache</title><description>&lt;p&gt;This blog post is a tale of interesting patterns in system and application metrics,
optimizing memory use, and debugging. Before we get into that stuff, it helps to be
familiar with the application at the center of it.&lt;/p&gt;

&lt;h2&gt;Introduction to the Network view&lt;/h2&gt;

&lt;p&gt;A significant new feature of Bitly&amp;#8217;s recent consumer relaunch is the ability to browse
links shortened by all your connections on Facebook and Twitter, on a page titled
&amp;#8220;Your Network&amp;#8221;. Internally we call this service &amp;#8220;network history&amp;#8221;. Looking up all your
connections and all their links everytime you view the &amp;#8220;Network&amp;#8221; page is, due to Bitly&amp;#8217;s
scale, a non-trivial problem. Instead we structured the system to do the hard work at
insertion time, to make the read side performant. In a minute, we&amp;#8217;ll dig into how exactly
we tackled that problem.&lt;/p&gt;

&lt;p&gt;We use a key-value store (&lt;a href="http://code.google.com/p/leveldb/"&gt;leveldb&lt;/a&gt; via
&lt;a href="https://github.com/bitly/simplehttp"&gt;simpleleveldb&lt;/a&gt;) where the key is the username and the
value is the list of bitly links created by their connections. Such an arrangement is very
efficient in terms of storage, memory, and cpu usage. However, based on the number of
connections a user has, the number of updates we have to perform could be significant
(thousands) per new Bitly link. We needed a way to gain control over the number of writes to
our persistent datastore (and thus disk seeks) and provide the performance needed to handle
many thousands of inserts per second - enter &lt;strong&gt;nhproxy&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;nhproxy mechanics&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;nhproxy&lt;/strong&gt;, the caching proxy to the network history database, maintains in-memory:&lt;/p&gt;

&lt;ul&gt;&lt;li&gt;a hash-table of all users for which it is caching network history&lt;/li&gt;
&lt;li&gt;a hash-table of all links in users&amp;#8217; network history&lt;/li&gt;
&lt;li&gt;attached to each user, a list of pointers to links, which is the network history&lt;/li&gt;
&lt;/ul&gt;&lt;p&gt;The actual link metadata is stored in a separate global hash-table so that it is not
duplicated when multiple users have the same link in their history. Each user only
stores a pointer into the global data structure.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;nhproxy&lt;/strong&gt; serves two main request types:&lt;/p&gt;

&lt;ul&gt;&lt;li&gt;&lt;p&gt;&lt;em&gt;get&lt;/em&gt; requests specify a username to retrieve the full Network History for. If the
user&amp;#8217;s network history is already in memory it is quickly returned, otherwise it is
fetched from the database and loaded into memory and then returned.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;em&gt;add&lt;/em&gt; requests specify a bitly link and a list of all the users for which the link
will appear (all connections of the creator of the link). All users who aren&amp;#8217;t
already loaded into memory are fetched from the database and loaded into memory, and
then a reference to the link is added to all their network history lists.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;&lt;p&gt;There are simple strategies for:&lt;/p&gt;

&lt;ul&gt;&lt;li&gt;&amp;#8220;pruning&amp;#8221; old network history list elements when the list grows beyond a set maximum&lt;/li&gt;
&lt;li&gt;saving a user&amp;#8217;s changed network history to the database after a while&lt;/li&gt;
&lt;li&gt;discarding the user&amp;#8217;s record and history from memory to keep the total number of users
in the cache below another set maximum.&lt;/li&gt;
&lt;/ul&gt;&lt;p&gt;It&amp;#8217;s interesting to note that the longer one delays saving a user&amp;#8217;s changed network history
to the persistent datastore, the more changes it accumulates per save, and the lower the
overall rate of writes to the store. One can control that rate of writes by adjusting the
delay. Don&amp;#8217;t worry about what exactly these strategies are, as I&amp;#8217;ll be focussing on other
aspects of nhproxy in this article.&lt;/p&gt;

&lt;h2&gt;&lt;strong&gt;nhproxy&lt;/strong&gt; memory load&lt;/h2&gt;

&lt;p&gt;In the initial deployed state, the memory usage of a box running both nhproxy and the
database looked like this:&lt;/p&gt;

&lt;p&gt;&lt;img src="http://media.tumblr.com/tumblr_m6jjhyIL1F1rsyjyu.png" alt="memory graph 1"/&gt;&lt;/p&gt;

&lt;p&gt;The reason it fell each night before exhausting all memory is because we cleanly terminated
and restarted nhproxy with a cron job. nhproxy saves and reloads everything it&amp;#8217;s caching
when it restarts, so no un-flushed changes are lost. Still, we would much rather not need
to have it regularly restarted. Also, notice that the data in nhproxy just before exiting
and just after starting again is exactly the same (besides change timestamps the flushing
logic uses), but uses up very different amounts of physical memory.&lt;/p&gt;

&lt;p&gt;There had already been some testing and analysis with valgrind; it wasn&amp;#8217;t a simple memory
leak in nhproxy. One theory was that the problem was fragmentation: the multitude of small
memory regions allocated and deallocated left awkwardly sized regions free that were too
small to be reused. We considered trying an alternative allocator like
&lt;a href="http://www.canonware.com/jemalloc/"&gt;jemalloc&lt;/a&gt; which has fancier strategies for dealing
with fragmentation, but decided to try some less drastic measures first.&lt;/p&gt;

&lt;p&gt;After a bit of thought I realized that many of the small regions we allocated were
dominated by the overhead of malloc itself - the extra space from rounding the allocated
region up to a well-aligned size, and the metadata of the region size (typically prepended
to the region). This most likely wouldn&amp;#8217;t fix the unbounded memory growth problem, but I
couldn&amp;#8217;t resist first going after some low-hanging fruit to make overall memory use and
fragmentation significantly better in the short term.&lt;/p&gt;

&lt;p&gt;Initially, the data in memory was represented by structs that looked something like this:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;struct user {
    char *username;
    struct linked_list *history;
    // plus a few miscellaneous flags and counters
};

struct linked_list {
    struct linked_list *next;
    struct linked_list *prev;
    struct bitly_link *link;
};

struct bitly_link {
    char *username;
    char *hash;
    int ref_count;
    // plus a few miscellaneous flags and counters
};
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;That&amp;#8217;s a simplification, but what&amp;#8217;s accurate is that there was a separate allocation for
each username, each bitly link hash, and each link reference in a history list. That is
quite a lot of small allocations, with a lot of overhead. For example, each bitly_link hash
is (currently) 6 characters, so we malloc()ed 7 bytes to store the string and null
terminator, which malloc() rounded up to 8 bytes, and then prepended with the length in
another 8 bytes, to end up as a 16 byte allocation. Also, I suspected that that the variable
size of the username allocation was a significant cause of fragmentation.&lt;/p&gt;

&lt;p&gt;So I refactored the code to use structures like this:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;struct user {
    char username[MAX_USERNAME_LEN+1];
    struct bitly_link *history[MAX_HISTORY_LEN];
    int history_count;
    // plus a few miscellaneous flags and counters
};

struct bitly_link {
    char username[MAX_USERNAME_LEN+1];
    char hash[MAX_HASH_LEN+1];
    int ref_count;
    // plus a few miscellaneous flags and counters
};
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Notice that, with the (minimum) 8-byte malloc() overhead, on 64-bit systems such as we use,
a history list any size over 1/4 the maximum uses less space in the new design: an array of
1000&amp;#160;8-byte pointers in the new design, vs 4*8 bytes per link in the old design. In
practice, most users in the nhproxy cache have full history lists (maintained at that
length by pruning when they grow longer).&lt;/p&gt;

&lt;p&gt;Satisfyingly, cpu usage was much improved, because all those extra malloc() and free() calls
in the original design were significant. Here&amp;#8217;s what it looked like before:&lt;/p&gt;

&lt;p&gt;&lt;img src="http://media.tumblr.com/tumblr_m6jjirUNL11rsyjyu.png" alt="cpu graph 1"/&gt;&lt;/p&gt;

&lt;p&gt;and here&amp;#8217;s what it looked like after:&lt;/p&gt;

&lt;p&gt;&lt;img src="http://media.tumblr.com/tumblr_m6jjj8oqgp1rsyjyu.png" alt="cpu graph 2"/&gt;&lt;/p&gt;

&lt;p&gt;(that&amp;#8217;s 100% per core)&lt;/p&gt;

&lt;p&gt;But what we really cared about was memory use:&lt;/p&gt;

&lt;p&gt;&lt;img src="http://media.tumblr.com/tumblr_m6jjjrHe2U1rsyjyu.png" alt="memory graph 2"/&gt;&lt;/p&gt;

&lt;p&gt;That&amp;#8217;s much lower, but still grew until exhausting all memory, and it only took 3 times
longer to do so. Until the real problem was solved, at least it was an improvement, right?&lt;/p&gt;

&lt;h2&gt;&lt;strong&gt;nhproxy&lt;/strong&gt; history list bug&lt;/h2&gt;

&lt;p&gt;Unfortunately, we couldn&amp;#8217;t roll it out to production yet, because we noticed this other
graph:&lt;/p&gt;

&lt;p&gt;&lt;img src="http://media.tumblr.com/tumblr_m6jjkaPZ6v1rsyjyu.png" alt="counters graph"/&gt;&lt;/p&gt;

&lt;p&gt;We had a fair number of what we thought were pretty good tests, and the new version of
nhproxy passed them all&amp;#8230; but why on earth did the number of bitly links in the cache,
and the number of references to them in history lists, gravitate to 2/3 of what they used
to be? Fortunately, having solid instrumentation into the runtime behavior of the
application enabled us to spot this discrepancy. We had theories about bugs in the history
pruning strategy, but couldn&amp;#8217;t find anything, so we had to write a tool to compare the state
of the original system still running in production, and the new version we were testing and
staging, which was getting the same &lt;em&gt;add&lt;/em&gt; requests, but not servicing any of the &lt;em&gt;get&lt;/em&gt;
requests.&lt;/p&gt;

&lt;p&gt;We couldn&amp;#8217;t compare just what was cached in the nhproxy, because we had actually changed
the user flush and drop strategy, and because even if the strategies were the same, the
lack of &lt;em&gt;get&lt;/em&gt; requests on the new system being staged would have made the list of users in
nhproxy different. So I wrote a tool that took a list of user names, made &lt;em&gt;get&lt;/em&gt; requests
to both systems, compared the results, and output differences, with various options.
We were hoping to see some pattern in which links were lost from which network history
lists - maybe relatively old or new links? We looked in request logs to get some users
with very new changes, or hour-old changes, or older, and fed these usernames to the tool.&lt;/p&gt;

&lt;p&gt;Here&amp;#8217;s some sample output:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;login &amp;lt;user&amp;gt; common= 998   0_only=   0   1_only=   0   difference=  0%
login &amp;lt;user&amp;gt; common= 783   0_only=   0   1_only=   0   difference=  0%
...
login &amp;lt;user&amp;gt; common= 301   0_only= 315   1_only= 287   difference= 50%
login &amp;lt;user&amp;gt; common=1000   0_only=   0   1_only=   0   difference=  0%
Total successes=192 failures=3
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;We didn&amp;#8217;t notice any clear patterns in the users, links, or ages. But we did notice that
the loss wasn&amp;#8217;t evenly distributed - some users matched completely between the systems,
and others were over 50% different. For a few of the very different users, we manually
queried and compared the caches and the backend databases, trying to find a clue.
Finally, Jehiah noticed that the backend database was missing many list items for a user
that the cache had, even after the user was flushed. We had no test that caused a
user&amp;#8217;s list to be flushed, then dropped, then re-fetched, so I added one. Soon after,
snakes spotted the extra &amp;#8220;i++&amp;#8221; in the flush routine, iterating over the history list,
introduced when a while-loop that already had an &amp;#8220;i++&amp;#8221; was changed to a for-loop in the
memory-saving struct refactor - we were skipping over every other link and not writing it
to the persistent datastore. It&amp;#8217;s funny how much time such a little bug can take to track
down, but our tests and tools are better for it.&lt;/p&gt;

&lt;h2&gt;&lt;strong&gt;libevent&lt;/strong&gt; memory leak&lt;/h2&gt;

&lt;p&gt;At that point, nhproxy was much more efficient, and at least as correct as the original
version, but it still had the unbounded memory use which becomes problematic after only 3
and a half days or so. When I wrote above that valgrind reported no memory leaks, I wasn&amp;#8217;t
being completely honest - it reported no memory leaks if one was careful to only use
HTTP/1.0 requests without keep-alive, or HTTP/1.1 requests with Connection: close, when
exercising or querying the nhproxy. There was a known leak with keep-alive requests, but
it wasn&amp;#8217;t attacked earlier because:&lt;/p&gt;

&lt;ul&gt;&lt;li&gt;while our internal services do often hold persistent connections, they rarely close
them, and we didn&amp;#8217;t think we experienced this effect to a degree significant enough
to devote resources to solving it&lt;/li&gt;
&lt;li&gt;libevent is a sensitive and core part of our infrastructure, so we didn&amp;#8217;t want to
migrate to libevent-2.0.x right at that moment, nor did we want to maintain our own
branch of libevent-1.4.x&lt;/li&gt;
&lt;/ul&gt;&lt;p&gt;Simultaneously, we were aware of a problem related to how heap memory was freed back to the
OS, which we had already seen in other applications. Memory use would grow due to a large
number of small allocations (using the glibc provided &lt;code&gt;malloc()&lt;/code&gt;) caused by a temporary
high load, but when the load fell, the memory use wouldn&amp;#8217;t significantly fall back. Even if
valgrind reported no real leaks, and even if load fell back to effectively zero, the OS
could still report that the process was using a huge amount of private memory (RSS). We
confirmed through some tests that a single tiny allocation on the top of the heap could
prevent many gigabytes of contiguous free space from being released back to the OS, and
when that top allocation was &lt;code&gt;free()&lt;/code&gt;d, the rest would finally go.&lt;/p&gt;

&lt;p&gt;Between the occasional tiny leak and the significant fragmentation, this high-water-mark
effect could be a real problem for nhproxy. jemalloc would probably help, but at this point
I really wanted to get rid of the leak. I wanted the output of memory use analysis tools like
valgrind to be easier to use, and I didn&amp;#8217;t want to wonder how much of an effect the leak was
really having.&lt;/p&gt;

&lt;p&gt;I started working with the original libevent-1.4.14b source, building and installing the
shared library in a custom local path, and running nhproxy with it using LD_LIBRARY_PATH.&lt;/p&gt;

&lt;p&gt;libevent uses a lot of callbacks of function pointers, and evhttp abstracts outgoing and
incoming http connections to share many functions between them, so I started by putting
in some useful printf()s to familiarize myself with the flow of the code in a simple
minimal use. I used curl with and without the &lt;code&gt;--http1.0&lt;/code&gt; argument to exercise nhproxy.&lt;/p&gt;

&lt;p&gt;Valgrind reports that the root of the leak is a &lt;code&gt;struct evhttp_request&lt;/code&gt; allocated in
&lt;code&gt;evhttp_associate_new_request_with_connection()&lt;/code&gt;, which makes sense because that&amp;#8217;s
where all &lt;code&gt;evhttp_request&lt;/code&gt;s are allocated. I eventually traced the exceptional case
through&lt;/p&gt;

&lt;ol&gt;&lt;li&gt;&lt;code&gt;evhttp_associate_new_request_with_connection()&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;evhttp_request_new()&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;evhttp_start_read()&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;evhttp_read()&lt;/code&gt; (callback set in &lt;code&gt;evhttp_start_read()&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;&lt;code&gt;evhttp_connection_done()&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;evhttp_handle_request()&lt;/code&gt; (the &lt;code&gt;(*req_cb)()&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;&lt;code&gt;evhttp_connection_fail()&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;evhttp_connection_incoming_fail()&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;evhttp_connection_free()&lt;/code&gt;&lt;/li&gt;
&lt;/ol&gt;&lt;p&gt;The very interesting bit is that &lt;code&gt;evhttp_connection_incoming_fail()&lt;/code&gt; seemed to
intentionally leak the &lt;code&gt;evhttp_request&lt;/code&gt;, preventing &lt;code&gt;evhttp_connection_free()&lt;/code&gt; from
freeing it, with this comment:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;/* 
 * these are cases in which we probably should just
 * close the connection and not send a reply.  this
 * case may happen when a browser keeps a persistent
 * connection open and we timeout on the read.  when
 * the request is still being used for sending, we
 * need to disassociated it from the connection here.
 */
if (!req-&amp;gt;userdone) {
        /* remove it so that it will not be freed */
        TAILQ_REMOVE(&amp;amp;req-&amp;gt;evcon-&amp;gt;requests, req, next);
        /* indicate that this request no longer has a
         * connection object
         */
        req-&amp;gt;evcon = NULL;
}
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;I&amp;#8217;m still not sure exactly why that case is needed. But I did know that if
&lt;code&gt;evhttp_handle_request()&lt;/code&gt; is called with &lt;code&gt;req-&amp;gt;evcon-&amp;gt;state == EVCON_DISCONNECTED&lt;/code&gt;,
then the user and the request is done. My minimal fix was&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;@@ -2231,6 +2231,7 @@ evhttp_handle_request(struct evhttp_requ
    if (req-&amp;gt;uri == NULL) {
        event_debug(("%s: bad request", __func__));
        if (req-&amp;gt;evcon-&amp;gt;state == EVCON_DISCONNECTED) {
+           req-&amp;gt;userdone = 1;
            evhttp_connection_fail(req-&amp;gt;evcon, EVCON_HTTP_EOF);
        } else {
            event_debug(("%s: sending error", __func__));
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;In our testing, that didn&amp;#8217;t result in any use-after-free crashes or unusual failed
requests. This is what memory use looked like under real load with that fix:&lt;/p&gt;

&lt;p&gt;&lt;img src="http://media.tumblr.com/tumblr_m6jjkvX4KZ1rsyjyu.png" alt="memory graph 3"/&gt;&lt;/p&gt;

&lt;p&gt;That seemed to do it. We weren&amp;#8217;t expecting the keep-alive connection leak to be
that significant, but apparently it was. (The memory ramp up before it stabilizes
is fragmentation of object allocations in memory approaching a maximum before
gaps formed are large enough to be reused.)&lt;/p&gt;

&lt;h2&gt;Conclusion&lt;/h2&gt;

&lt;p&gt;Graphs of system and application metrics can reveal unexpected behaviors that
reflect bugs. Create tools to help you investigate issues if you need them.
Tiny fixes can have huge effects.&lt;/p&gt;

&lt;div class="postmeta"&gt; by &lt;a href="http://twitter.com/ploxiln"&gt;pierce&lt;/a&gt; (also shout out
to &lt;a href="http://twitter.com/jehiah"&gt;Jehiah&lt;/a&gt; who originally designed and developed
the network history service, and &lt;a href="http://twitter.com/imsnakes"&gt;snakes&lt;/a&gt; who helped
with the design and debugging) &lt;/div&gt;</description><link>http://word.bitly.com/post/26351432099</link><guid>http://word.bitly.com/post/26351432099</guid><pubDate>Mon, 02 Jul 2012 12:16:18 -0400</pubDate></item><item><title>Metrics - Building Clickatron</title><description>&lt;p&gt;One of the core product features of bitly has always been metrics. Unlike tools like google analytics,
bitly metrics have always been a way to gather website metrics when you don&amp;#8217;t control the website.
Want metrics for a book you wrote on Amazon? Use a bitly link. Want metrics for a blog post? You can
use a bitly link for that, too.&lt;/p&gt;

&lt;p&gt;As bitly usage has grown, one of the systems that has evolved several times along the way is our metrics
platform. Originally implemented as log files with a hierarchical timestamp key, later metrics data was loaded into a
flat mysql table. A subsequent revision moved that data into a cluster of
&lt;a href="http://fallabs.com/tokyocabinet/"&gt;tokyocabinet&lt;/a&gt; servers. That was eventually supplemented with some additional metrics
datasets in &lt;a href="http://www.mongodb.org/"&gt;mongodb&lt;/a&gt;. By early 2011 we were using 3 different metrics systems, and had
outgrown 3 others. We started with a set of goals to build a scalable time series database application, which we now
call Clickatron.&lt;/p&gt;

&lt;h2&gt;Goals&lt;/h2&gt;

&lt;p&gt;The set of goals we wanted to achieve with a new metrics system were:&lt;/p&gt;

&lt;p&gt;1) Move from daily to hourly granularity (not everyone should have 
to pretend they live in EST). A desire was to be able to satisfy read requests on-the-fly in any
unit of time (hour, day, week, month) for any timezone based on the single hourly dataset (removing
duplication present in other systems)&lt;/p&gt;

&lt;p&gt;2) Compact on-disk format. One of the previous metrics systems had a very inefficient data structure which was causing
both bloated data set sizes and performance problems resulting from insufficient disk IO. In order to satisfy read
requests at different units of time, it was also storing several copies of the data.&lt;/p&gt;

&lt;p&gt;3) Ability to bulk-load new datasets. There were two problems with previous metrics systems that
inhibited adding new metrics datasets. Previous systems lacked proper namespacing which caused
essentially duplicate systems for each dataset, and there was not enough excess capacity
to insert new metrics going backwards in time.&lt;/p&gt;

&lt;p&gt;4) Scalability. Our existing metrics systems were at their limits in terms of handling real-time increments and we
didn&amp;#8217;t have a simple path forward to add capacity to those systems.&lt;/p&gt;

&lt;h2&gt;Architecture&lt;/h2&gt;

&lt;p&gt;As mentioned in our post about &lt;a href="http://word.bitly.com/post/20350137230/sortdb"&gt;sortdb&lt;/a&gt;, many datasets have a small subset that is heavily updated, and a
larger related dataset that does not change. Metrics data is a textbook example where updates are limited to a
small time range, and an overwhelming majority of the data is static.&lt;/p&gt;

&lt;p&gt;We chose to take advantage of this and implement an architecture that split data into two separate storage systems. A
Realtime System is responsible for per-hour data covering the past 48 hours and also handles increment operations. A
separate Archival System stores all of the older data and is updated daily with a new set of static database files from
a Rebuild System.&lt;/p&gt;

&lt;p&gt;Each of these two systems are themselves clusters of key/value databases. The keys are a combination of
namespace, record key, and time components (explained in more detail below). Data is spread across the shards
of each cluster by a simple &lt;code&gt;crc(shard_key) % number_shards&lt;/code&gt;. The shard key is comprised of only parts of the
actual key in order to improve data locality on a per-request basis. For example the shard key excludes time components
so that all data for a request is stored on the same shard regardless of the time range requested.&lt;/p&gt;

&lt;h3&gt;Realtime System&lt;/h3&gt;

&lt;p&gt;The Realtime System is comprised of a 3-position ring of storage engines with positions for &lt;code&gt;Today&lt;/code&gt;, &lt;code&gt;Yesterday&lt;/code&gt;, and
&lt;code&gt;Tomorrow&lt;/code&gt;. The storage engines in the &lt;code&gt;Today&lt;/code&gt; position handle increments. The storage engine in the &lt;code&gt;Yesterday&lt;/code&gt;
position has a 24-hour window to be exported before being cleared for re-use in the &lt;code&gt;Tomorrow&lt;/code&gt; position.&lt;/p&gt;

&lt;p&gt;Each spot on the ring corresponds to a cluster of &lt;code&gt;simpletokyo&lt;/code&gt; / &lt;code&gt;ttserver&lt;/code&gt; pairs which data is
sharded across. Records stored in this system are stored in &lt;code&gt;&amp;lt;key&amp;gt;,&amp;lt;value&amp;gt;&lt;/code&gt; format where each record represents
an integer value for a single hour.&lt;/p&gt;

&lt;h3&gt;Rebuild System&lt;/h3&gt;

&lt;p&gt;The Rebuild System serves 3 purposes. Its primary role is to do the export from the Realtime System every 24 hours and
combine that new dataset with all previous datasets to generate new data files for the Archive System. As part of that
export process, the Rebuild System generates a second copy of the merged data files to serve as backup. Because it
generates new data files from scratch every day, it also provides a spot to introduce new historical data
independent of the number of records added. We have been able to use this to introduce new datasets as large as 13
billion keys. Good luck waiting for that many inserts into [insert database name].&lt;/p&gt;

&lt;p&gt;The rebuild process is a set of steps that involve sorting, merging, reformatting keys, and combining records. That&amp;#8217;s
followed by a final sharding and another sort and merge operation. These steps happen largely in parallel and are each
split into smaller segments of work that get distributed across several machines. Many of these steps also intelligently
avoid redoing unnecessary work by keeping intermediate files split by time range.&lt;/p&gt;

&lt;h3&gt;Archive System&lt;/h3&gt;

&lt;p&gt;The Archive System consists of static csv files accessed through &lt;a href="http://word.bitly.com/post/20350137230/sortdb"&gt;sortdb&lt;/a&gt;. These files are spread across
hosts as needed depending on desired read performance. Multiple sortdb instances for the same data files can be used on
separate physical hosts for read availability and fault tolerance.&lt;/p&gt;

&lt;h3&gt;API&lt;/h3&gt;

&lt;p&gt;The API layer directs increment requests to the right spot on the Realtime System ring. For increments, the API layer
also writes out a local oplog for data durability in case there is a hardware failure in the Realtime System. For data
fetches, the API layer handles the potential need to pull data from both the Realtime System and the Archive System,
merging the two data sets into a single response. The components of the API involved in incrementing are written in C
using &lt;a href="https://github.com/bitly/simplehttp"&gt;simplehttp&lt;/a&gt;, and the components for querying are written in a combination of python using
&lt;a href="http://www.tornadoweb.org"&gt;tornadoweb&lt;/a&gt; and C using simplehttp.&lt;/p&gt;

&lt;h3&gt;Architecture Overview&lt;/h3&gt;

&lt;p&gt;&lt;img src="http://media.tumblr.com/tumblr_m2yrxsJ5Bd1qz94k4.png" alt="clickatron architecture overview"/&gt;&lt;/p&gt;

&lt;h2&gt;Storage Formats&lt;/h2&gt;

&lt;h3&gt;Database Key Optimizations&lt;/h3&gt;

&lt;p&gt;Two optimizations directly aimed at lowering disk storage needs are a compact time format and a global lookup
table for records with long keys. The choice to use sortdb also directly reduced our storage requirements as the data
index is stored with the data (more specifically the index &lt;em&gt;is&lt;/em&gt; the data).&lt;/p&gt;

&lt;p&gt;For time values we use a compact 4 character &lt;code&gt;YMDH&lt;/code&gt; representation. Using this format &lt;code&gt;c3p1&lt;/code&gt; corresponds to 1am on March 25th, 2012.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;YMDH == _b32(year % 100) + _b32(month) + _b32(day) + _b32(hour)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Any time a key contains a reserved character (&lt;code&gt;&amp;lt;space&amp;gt;&lt;/code&gt; &lt;code&gt;:&lt;/code&gt; &lt;code&gt;,&lt;/code&gt; &lt;code&gt;.&lt;/code&gt; &lt;code&gt;|&lt;/code&gt;) or is 12 or more characters, we create a
12-character hash of it, and store a lookup record. This means that if we want to count the number of referrers from
&lt;code&gt;&lt;a href="http://www.facebook.com/"&gt;http://www.facebook.com/&lt;/a&gt;&lt;/code&gt; instead of storing that 24-character string every time we wanted to count it, we count a
smaller 12-character hash such that &lt;code&gt;_hash('http://www.facebook.com/') == 'h0aMB8AuNw4='&lt;/code&gt;, and do a reverse lookup to
expand the hash back to the full value at query time. It may not seem like much, a savings of even a few bytes for
repeated values makes a big difference on datasets with billions of records.&lt;/p&gt;

&lt;h3&gt;Record Types&lt;/h3&gt;

&lt;p&gt;Our time series database is made up of 3 types of records. &lt;strong&gt;Lookup Records&lt;/strong&gt; (which are the expansions for the
12 character hashes above). &lt;strong&gt;Total Records&lt;/strong&gt; which are a single key with values per hour over time. And &lt;strong&gt;Subtotal
Records&lt;/strong&gt; which represent the line item values whose sum rolls up to the &lt;em&gt;Total Record&lt;/em&gt; for each hour.&lt;/p&gt;

&lt;p&gt;We use a different data format between the Realtime System and the Archive System for storing &lt;em&gt;Total Records&lt;/em&gt; and
&lt;em&gt;Subtotal Records&lt;/em&gt;. The data format used for the Realtime System is optimized for writes while the format used for the
Archive System is optimized for compactness and reads.&lt;/p&gt;

&lt;p&gt;In the Realtime System records have one data point, but in the Archive System records are multi-column and have many
values. Many records in the Realtime System will be collapsed to a single multi-column record in the Archive System.
Records in the Realtime System are always for a single hour, but &lt;em&gt;Total Records&lt;/em&gt; in the Archive System are for all time
ranges, and &lt;em&gt;Subtotal Records&lt;/em&gt; are similarly for all subtotal keys (for an hour). This facilitates retrieval, but it
also provides for a more compact storage (the record key is stored only once for that multi-column record compared with
the Realtime System). Metrics data is very sparse (many keys only have values for a few hours) so the time values in a
&lt;em&gt;Total Record&lt;/em&gt; also serve as an index of what data to expected in the equivalent subtotal data set.&lt;/p&gt;

&lt;h3&gt;Record Formats&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Total Records - Realtime System&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A &lt;em&gt;Total Record&lt;/em&gt; as stored in the Realtime System consists of a &lt;code&gt;namespace&lt;/code&gt;, a &lt;code&gt;total key&lt;/code&gt;, an hourly &lt;code&gt;time
value&lt;/code&gt;, and the &lt;code&gt;counter value&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;img src="http://media.tumblr.com/tumblr_m2yueh0Bir1qz94k4.png" alt="Total Record Realtime Format"/&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Example:&lt;/em&gt; These are two &lt;em&gt;Total Record&lt;/em&gt; in the Realtime System representing total clicks for a single user in two
separate hours. &lt;code&gt;u&lt;/code&gt; is our namespace for &amp;#8220;user clicks&amp;#8221;, &lt;code&gt;jehiah&lt;/code&gt; is my username, &lt;code&gt;c413&lt;/code&gt; and &lt;code&gt;c41l&lt;/code&gt; are time components
and these records have values of &lt;code&gt;2&lt;/code&gt; and &lt;code&gt;5&lt;/code&gt; respectively.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;u|jehiah.c413,2
u|jehiah.c41l,5
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;&lt;strong&gt;Total Records - Archive System&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In the Archive System, all &lt;em&gt;Total Records&lt;/em&gt; for the same &lt;code&gt;namespace&lt;/code&gt; + &lt;code&gt;total key&lt;/code&gt; combination are collapsed into one
record with the individual &lt;code&gt;time value&lt;/code&gt; and &lt;code&gt;counter value&lt;/code&gt; pairs as the multi-column component.&lt;/p&gt;

&lt;p&gt;&lt;img src="http://media.tumblr.com/tumblr_m2yudxxere1qz94k4.png" alt="Total Record Archive Format"/&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Example:&lt;/em&gt; This the same two &lt;em&gt;Total Records&lt;/em&gt; from the Realtime System as stored in the Archive System. Here the key is
only the &lt;code&gt;u&lt;/code&gt; namespace and &lt;code&gt;jehiah&lt;/code&gt; total key, but the value is multi-column pairs of time components &lt;code&gt;c413&lt;/code&gt; and &lt;code&gt;c41l&lt;/code&gt;
and values of &lt;code&gt;2&lt;/code&gt; and &lt;code&gt;5&lt;/code&gt;. In this small example, the archive record is &lt;strong&gt;27%&lt;/strong&gt; more compact.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;u|jehiah,c413:2 c41l:5
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;&lt;strong&gt;Subtotal Records - Realtime System&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A &lt;em&gt;Subtotal Record&lt;/em&gt; as stored in the Realtime System consists of a &lt;code&gt;subtotal namespace&lt;/code&gt;, a &lt;code&gt;total
namespace&lt;/code&gt;, a &lt;code&gt;total key&lt;/code&gt;, a &lt;code&gt;subtotal key&lt;/code&gt;, a hourly &lt;code&gt;time value&lt;/code&gt; and a &lt;code&gt;counter value&lt;/code&gt;. This record is structured such
that the (&lt;code&gt;total namespace&lt;/code&gt; + &lt;code&gt;total key&lt;/code&gt;) matches exactly to a &lt;em&gt;Total Record&lt;/em&gt;. Further the &lt;code&gt;count values&lt;/code&gt; are related
such that sum of &lt;code&gt;count values&lt;/code&gt; for all subtotal keys for a &lt;code&gt;subtotal namespace&lt;/code&gt;, &lt;code&gt;total namespace&lt;/code&gt;, &lt;code&gt;total key&lt;/code&gt;,
&lt;code&gt;time value&lt;/code&gt; exactly equals the matching value in a &lt;em&gt;Total Record&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;img src="http://media.tumblr.com/tumblr_m2yudcKGDf1qz94k4.png" alt="Subtotal Record Realtime Format"/&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Example:&lt;/em&gt; Here are 4&amp;#160;&lt;em&gt;Subtotal Records&lt;/em&gt; that represent two different datasets related to the &lt;em&gt;Total Record&lt;/em&gt; above as
stored in the Realtime System. One dataset with the &lt;code&gt;c&lt;/code&gt; subtotal namespace represents the clicks by country, and the
other &lt;code&gt;r&lt;/code&gt; subtotal namespace represents the clicks by referrer. Each of these datasets (&lt;code&gt;4&lt;/code&gt; + &lt;code&gt;1&lt;/code&gt; and &lt;code&gt;3&lt;/code&gt; + &lt;code&gt;2&lt;/code&gt;) sum up
to the &lt;code&gt;5&lt;/code&gt; in the &lt;em&gt;Total Record&lt;/em&gt; above. The &lt;code&gt;US&lt;/code&gt; and &lt;code&gt;JP&lt;/code&gt; are country values in the subtotal key, and &lt;code&gt;5bmehgOB4+w=&lt;/code&gt; and
&lt;code&gt;h0aMB8AuNw4=&lt;/code&gt; are 12 character lookup values that are placeholders for longer subtotal keys. These records in the
Realtime System are per-hour so they can be incremented individually.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;c.u|jehiah.US.c41l,4
c.u|jehiah.JP.c41l,1
r.u|jehiah.5bmehgOB4+w=.c41l,3 
r.u|jehiah.h0aMB8AuNw4=.c41l,2
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;&lt;strong&gt;Subtotal Records - Archive System&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In the Archive System all &lt;em&gt;Subtotal Records&lt;/em&gt; for a given hour for a &lt;code&gt;subtotal namespace&lt;/code&gt; collapse so that the &lt;code&gt;subtotal
key&lt;/code&gt; and &lt;code&gt;count values&lt;/code&gt; create the multi-column data.&lt;/p&gt;

&lt;p&gt;&lt;img src="http://media.tumblr.com/tumblr_m2yucqHtVQ1qz94k4.png" alt="Subtotal Record Archive Format"/&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Example:&lt;/em&gt; The same 4 values in the Realtime System are represented as two multi-column records in the Archive System.
Here the records are still one per hour, but all subtotal keys for that hour are stored together. For this small
example, the archive format is &lt;strong&gt;31%&lt;/strong&gt; smaller.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;c.u|jehiah.c41l,US:4 JP:1
r.u|jehiah.c41l,5bmehgOB4+w=:3 h0aMB8AuNw4=:2
&lt;/code&gt;&lt;/pre&gt;

&lt;h3&gt;Record Visualization&lt;/h3&gt;

&lt;p&gt;This is a visualization of the data contained within the pair of &lt;em&gt;Total Record&lt;/em&gt; (green) and a &lt;em&gt;Subtotal Record&lt;/em&gt; (purple)
from the Archive System. On the right are the same records but with example values filled in.&lt;/p&gt;

&lt;p&gt;&lt;img src="http://media.tumblr.com/tumblr_m2yubwPNbB1qz94k4.png" alt="Clickatron Record Format Visualization"/&gt;&lt;/p&gt;

&lt;h2&gt;Queries&lt;/h2&gt;

&lt;p&gt;Since Clickatron stores data with hourly granularity, the API merges that hourly data on the fly to fulfill requests for
other granularities. This approach enables the API to accept an &lt;code&gt;hour_offset&lt;/code&gt; parameter in order to fulfill queries for
any timezone aligned on the hour (with apologies to India and other locales aligned on the half hour). In our public API
endpoint we lookup &lt;code&gt;hour_offset&lt;/code&gt; from a more generic &lt;code&gt;timezone&lt;/code&gt; parameter. It also uses the technique for fulfilling day
queries to response to week, and month queries, and can even respond with week units that start on monday instead of
sunday (we call this &amp;#8220;mweek&amp;#8221;). You can see an example of these parameters in our &lt;a href="http://dev.bitly.com/link_metrics.html"&gt;API Documentation&lt;/a&gt;
for user and link metrics.&lt;/p&gt;

&lt;h2&gt;Performance in the Real World&lt;/h2&gt;

&lt;p&gt;Clickatron has been in production for over a year now. It has successfully replaced several other metrics systems,
reduced hardware requirements, improved data granularity, and reduced query latency. As previously mentioned, we have
leveraged its rebuild functionality to bulk load datasets with as many as 13 billion data points at once with zero
service impact - directly contributing to our ability to expand the breadth of our data. Despite an increase in the
breadth of metrics, the granularity of data storage, the timespan of accumulated data, and a significant increase in
traffic, Clickatron&amp;#8217;s current persistent footprint is less than one quarter the size of the disparate systems it
replaced a year ago. Clickatron currently tracks well over 100 billion data points, a number that increases every day,
and we feel well situated to handle future needs as they arise.&lt;/p&gt;

&lt;h3&gt;EOL&lt;/h3&gt;

&lt;p&gt;If this sounds like a fun metrics system to use or to work on, &lt;a href="http://bitly.com/jobs"&gt;bitly is hiring&lt;/a&gt;.&lt;/p&gt;

&lt;div class="postmeta"&gt;
by &lt;a href="http://twitter.com/jehiah"&gt;jehiah&lt;/a&gt; (also shout out to &lt;a href="http://twitter.com/imsnakes"&gt;snakes&lt;/a&gt; who helped design and develop clickatron)
&lt;/div&gt;</description><link>http://word.bitly.com/post/21721687297</link><guid>http://word.bitly.com/post/21721687297</guid><pubDate>Tue, 24 Apr 2012 13:19:16 -0400</pubDate><category>clickatron</category><category>metrics</category></item><item><title>sortdb - static key value database</title><description>&lt;p&gt;As dataset sizes grow, you need to buy larger/faster database machines, or shard data across multiple machines.
&lt;a href="https://github.com/bitly/simplehttp/tree/master/sortdb"&gt;sortdb&lt;/a&gt; is a database server we have written to help fill a specific need for performant access to static data.&lt;/p&gt;

&lt;p&gt;One observation about many datasets is that there is often a small dataset that is heavily updated, and a larger older
related dataset that does not change. It is this second case where sortdb can be used to expose a query interface to
static datasets.&lt;/p&gt;

&lt;p&gt;sortdb has a HTTP interface built on libevent, and exposes &lt;code&gt;get&lt;/code&gt;, &lt;code&gt;mget&lt;/code&gt; and &lt;code&gt;fwmatch&lt;/code&gt; endpoints to key / value data in
a sorted &lt;code&gt;csv&lt;/code&gt; data file. Internally, the datafile is &lt;a href="http://en.wikipedia.org/wiki/Mmap"&gt;mmaped&lt;/a&gt; and a binary search is
performed to find rows in the file. It is part of
&lt;a href="http://word.bitly.com/post/13216970565/intro-to-simplehttp"&gt;simplehttp&lt;/a&gt;, a family of libraries and daemons for building
scalable web infrastructures.&lt;/p&gt;

&lt;p&gt;By mmap&amp;#8217;ing the data file sortdb allows the operating system to handle maintaining the most important parts of the data
in memory while retrieving others from disk transparently. Since the file is searched using a binary search, it means
the first few common steps of the search tree are quickly cached in ram and execute very quickly. The operating system
will also share mmap&amp;#8217;d data between multiple processes. Practically, this means you can start a second sortdb process
pointed at the same data file for higher read throughput.&lt;/p&gt;

&lt;p&gt;In database terms, a sorted key/value datafile is essentially a covering index where all the values are stored with the
key. Because there is no need to store a separate index of the data, this is a compact on-disk representation. (removing
the need for separate indexes can often give a 20~30% savings in disk usage).&lt;/p&gt;

&lt;p&gt;Operationally managing sortdb is easy because the data files it uses are static and read-only. This makes it easy to
copy the file to multiple servers and distribute requests to multiple sortdb instances on separate servers to handle a
higher workload, or to add redundancy. It is also possible to point sortdb to a new file with zero downtime (literally
&lt;code&gt;mv data.csv old.csv &amp;amp;&amp;amp; mv new.csv data.csv &amp;amp;&amp;amp; kill -s HUP $pid&lt;/code&gt;).&lt;/p&gt;

&lt;h2&gt;Using sortdb&lt;/h2&gt;

&lt;p&gt;At bitly we use sortdb successfully as a core component of our metrics system and are very happy with it&amp;#8217;s performance
characteristics. Our long term metrics system is composed of two different storage systems. One &amp;#8220;realtime&amp;#8221; system stores
metrics for the past 48 hours and processes all increment requests. A second &amp;#8220;static&amp;#8221; storage system is built on sortdb
and stores all data prior to the past 48 hours. Every day, 24 hours of data is exported from the realtime system, and
new static data files are re-built for the static system comprising all the existing data, and the new 24 hour data
segment.&lt;/p&gt;

&lt;p&gt;A small slice of our &amp;#8220;static&amp;#8221; metrics data looks like this:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;c.u|jehiah.c3h5,US:2
c.u|jehiah.c3h7,US:1
c.u|jehiah.c3h9,None:2 US:3
c.u|jehiah.c3ha,None:2
c.u|jehiah.c3hb,None:2 US:3
c.u|jehiah.c3hc,None:3 US:4
c.u|jehiah.c3hd,None:5 US:4
c.u|jehiah.c3he,None:2 US:6
c.u|jehiah.c3hf,None:3 US:5
c.u|jehiah.c3hg,None:4 US:5
c.u|jehiah.c3hh,DE:1 None:5 US:7
c.u|jehiah.c3hi,CA:1 None:6 US:5
u|jehiah.c3hh,13
u|jehiah.c3i0,7
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;You will notice that these are &lt;code&gt;key,value&lt;/code&gt; records where the key is a compound key often made up of a namespace and
sometimes a time component, and the value is sometimes a repeating (&lt;code&gt;key&lt;/code&gt; + &lt;code&gt;:&lt;/code&gt; + &lt;code&gt;value&lt;/code&gt;) sequence.&lt;/p&gt;

&lt;p&gt;With that dataset in plain text file named &lt;code&gt;data.csv&lt;/code&gt;, you can point sortdb at it with this run command (setting comma
as the field separator)&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ sortdb --field-separator=, --db-file=data.csv
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Now it&amp;#8217;s possible to query for a single record&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ curl 'http://127.0.0.1:8080/get?key=u|jehiah.c3hh'
13
$ curl 'http://127.0.0.1:8080/get?key=c.u|jehiah.c3hh'
DE:1 None:5 US:7
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Or forward match a range of records&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ curl 'http://127.0.0.1:8080/fwmatch?key=u|jehiah.'
u|jehiah.c3hh,13
u|jehiah.c3i0,7
&lt;/code&gt;&lt;/pre&gt;

&lt;h3&gt;EOL&lt;/h3&gt;

&lt;p&gt;If this sound like a fun project to hack on, &lt;a href="http://bit.ly/jobs"&gt;bitly is hiring&lt;/a&gt;.&lt;/p&gt;

&lt;div class="postmeta"&gt;
by &lt;a href="http://twitter.com/jehiah"&gt;jehiah&lt;/a&gt;
&lt;/div&gt;</description><link>http://word.bitly.com/post/20350137230</link><guid>http://word.bitly.com/post/20350137230</guid><pubDate>Mon, 02 Apr 2012 10:59:31 -0400</pubDate><category>sortdb</category><category>simplehttp</category></item><item><title>Infrastructure as a Platform</title><description>&lt;h3&gt;Introduction&lt;/h3&gt;

&lt;p&gt;At bitly, infrastructure has two core responsibilities.&lt;/p&gt;

&lt;ol&gt;&lt;li&gt;&lt;strong&gt;The obvious&lt;/strong&gt; - systems architecture, performance, scaling, technology choices, and implementation details.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;The not-so-obvious&lt;/strong&gt; - new hire on-boarding, developer tools, and idioms that enable non-infrastructure engineers (science/research, application developers, etc.) to build scalable, performant, operationally sound, and easy-to-develop pieces of your product.&lt;/li&gt;
&lt;/ol&gt;&lt;p&gt;I say not-so-obvious because many of these things aren&amp;#8217;t pressing issues when teams are small, everyone is in one room, and it&amp;#8217;s crystal clear what the most important goal is from an engineering perspective.&lt;/p&gt;

&lt;p&gt;As an engineering team grows, it becomes more and more important to develop and evangelize a common way of solving problems and getting things done.  Without these guidelines you end up with inconsistent solutions that have a dramatic impact on the ability to operationally handle production systems (let alone the time cost of engineers constantly re-inventing the wheel).&lt;/p&gt;

&lt;p&gt;Maybe you&amp;#8217;re responsible 24/7 for production systems - consider any combination of the following situations:&lt;/p&gt;

&lt;ul&gt;&lt;li&gt;logs are in 14 different places&lt;/li&gt;
&lt;li&gt;an engineer chose a different one of the far too many Python web frameworks&lt;/li&gt;
&lt;li&gt;services are started by the current flavor-of-the-week process manager&lt;/li&gt;
&lt;li&gt;response formats from APIs differ&lt;/li&gt;
&lt;li&gt;techniques used to process data vary&lt;/li&gt;
&lt;li&gt;monitoring is done differently or not at all&lt;/li&gt;
&lt;li&gt;backups aren&amp;#8217;t consistent&lt;/li&gt;
&lt;li&gt;no tests&lt;/li&gt;
&lt;/ul&gt;&lt;p&gt;Sounds like one helluva mess to debug.  (and it&amp;#8217;s &lt;strong&gt;always&lt;/strong&gt; at &lt;em&gt;3am&lt;/em&gt;)&lt;/p&gt;

&lt;p&gt;Let&amp;#8217;s talk about how we&amp;#8217;ve made progress over the past year solving some of these problems.&lt;/p&gt;

&lt;h3&gt;Development Workflow&lt;/h3&gt;

&lt;p&gt;A big win here was our switch to Git (and GitHub).  The freedom and flexibility this gave us to design a powerful code-flow (&lt;code&gt;develop&lt;/code&gt; -&amp;gt; &lt;code&gt;review&lt;/code&gt; -&amp;gt; &lt;code&gt;merge&lt;/code&gt; -&amp;gt; &lt;code&gt;deploy&lt;/code&gt;) has proved &lt;em&gt;extremely&lt;/em&gt; valuable.&lt;/p&gt;

&lt;p&gt;We keep our code in one repository (application code, infrastructure tools, configuration and system dependencies).  When possible we work to move things into external &lt;a href="http://github.com/bitly"&gt;open source projects&lt;/a&gt; that are paired with an install script in our main repo.  Every engineer develops in their own fork.  Code makes it into &lt;code&gt;master&lt;/code&gt; in one of two ways:&lt;/p&gt;

&lt;ol&gt;&lt;li&gt;&lt;strong&gt;cherry-picking&lt;/strong&gt; - a small, well defined changeset or high-priority bug fix can be cherry-picked into &lt;code&gt;master&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;pull request&lt;/strong&gt; - anything else follows the process of creating a branch off &lt;code&gt;master&lt;/code&gt;, developing the feature, and opening a pull request.&lt;/li&gt;
&lt;/ol&gt;&lt;p&gt;It should be noted that another member of the engineering team will review either of the above and provide constructive feedback, ultimately being the one to bring the code into master.  Code review is a first class member of our development workflow - it is hugely beneficial to both the developer and the reviewer.  The better you are at reading code, the better you are at writing it.  For efficiency, and to respect each others time, we often exchange commit hashes and play tit-for-tat with pull-request reviewing.&lt;/p&gt;

&lt;p&gt;Another benefit of this model is that &lt;code&gt;master&lt;/code&gt; maintains a clean history, always moving forward.  In doing so we avoid the pitfalls of coordinating a destructive history change with so many outstanding branches.  Still, we understand the value of Git&amp;#8217;s power - engineers are encouraged to re-write history to their hearts content while developing in their own branch.  When ready, a comment on the pull request stating &amp;#8220;ready for review &lt;strong&gt;@github_user&lt;/strong&gt;&amp;#8221; will notify the corresponding person to begin review at their discretion.&lt;/p&gt;

&lt;p&gt;Reviewing takes the form of comments on the pull request.  As each round of review is completed, the developer will comment &amp;#8220;resolved&amp;#8221; where appropriate and push additional commits to the branch.  This culminates in one final rebase and squash before a merge.  We find that looking back you rarely need the back-and-forth of many small commits and we prefer to see the change as a whole as it relates to the issue documented in the pull request.&lt;/p&gt;

&lt;h3&gt;Developing in a Virtual Machine&lt;/h3&gt;

&lt;p&gt;If you&amp;#8217;re &lt;strong&gt;not&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;&lt;li&gt;writing code in an environment that mimics production&lt;/li&gt;
&lt;li&gt;setup by the same scripts and processes that stand up production hosts&lt;/li&gt;
&lt;li&gt;exposed to the challenges of making things play nice together&lt;/li&gt;
&lt;/ul&gt;&lt;p&gt;As soon as your code is actually running in production it&amp;#8217;s probably not going to go well.&lt;/p&gt;

&lt;p&gt;We believe whole-heartedly in the VM approach.  Every developer (and even designers!) have a local VM for developing changes to any component in our infrastructure, front to back.&lt;/p&gt;

&lt;p&gt;Most importantly, it forces you to design your applications to be &amp;#8220;environment aware&amp;#8221;.  Your application code should only know the &lt;strong&gt;&lt;em&gt;how&lt;/em&gt;&lt;/strong&gt;.  The &lt;strong&gt;&lt;em&gt;where&lt;/em&gt;&lt;/strong&gt; and &lt;strong&gt;&lt;em&gt;how many&lt;/em&gt;&lt;/strong&gt; is the responsibility of configuration files that pivot on the &lt;em&gt;environment&lt;/em&gt; the code is running in.  (see &lt;a href="http://twitter.com/jehiah"&gt;@jehiah&lt;/a&gt;&amp;#8217;s &lt;a href="http://jehiah.cz/a/application-settings"&gt;post&lt;/a&gt; for more info about how we structure this)&lt;/p&gt;

&lt;p&gt;The benefits of this are easy to understand - if you can get it working, tested, and deployable in the VM then getting it running successfully in production should be as simple as a deploy.&lt;/p&gt;

&lt;p&gt;However, the choice of using a VM doesn&amp;#8217;t come without challenges:&lt;/p&gt;

&lt;ul&gt;&lt;li&gt;When you&amp;#8217;re service-oriented, as the overall architecture grows, it becomes increasingly difficult to &amp;#8220;fit&amp;#8221; &lt;em&gt;all&lt;/em&gt; services in a single VM on your local machine (RAM, etc.).&lt;/li&gt;
&lt;li&gt;Not &lt;em&gt;all&lt;/em&gt; services are equal and different engineers tend to touch different components more often.&lt;/li&gt;
&lt;li&gt;Setup, maintenance, and learning curve are a time sink.&lt;/li&gt;
&lt;/ul&gt;&lt;p&gt;We&amp;#8217;ve only begun to scratch the surface in addressing these issues.   We&amp;#8217;ve begun to logically group services so that they can be spun up/down on demand, giving some control over whats running simultaneously.  We&amp;#8217;ll surely post a follow up with our findings as we continue to make progress in this area.&lt;/p&gt;

&lt;h3&gt;The Way of the Bitly&lt;/h3&gt;

&lt;p&gt;The value of choosing sane and repeatable solutions to common problems becomes clearer and clearer as things grow - whether it&amp;#8217;s data, engineering team size, or number of services in your infrastructure.&lt;/p&gt;

&lt;p&gt;Often the # of operations engineers doesn&amp;#8217;t scale linearly with the # of services they have to monitor in production.  It becomes extremely important to make their lives easy by developing good frameworks and solutions for solving problems, and make those solutions readily available, well documented, and easy to use.&lt;/p&gt;

&lt;p&gt;We begin this education process early on with new engineers.&lt;/p&gt;

&lt;p&gt;An engineer&amp;#8217;s first day sets the tone for his future at your company.  A new engineer at bitly can expect that on the first day they&amp;#8217;ll have some shiny new hardware to play with as well as one simple goal, get your bio on the &lt;a href="http://bitly.com/pages/about"&gt;bitly about page&lt;/a&gt;.  Since we put a strong focus on working in a VM, that implies that the existing operations and infrastructure team has done their job to have all the appropriate accounts setup such as access to email, wiki, GitHub, and of course, a VM.&lt;/p&gt;

&lt;p&gt;This simple task facilitates the process of familiarizing themselves with the dev workflow described above as well as providing the satisfaction of seeing code go into production on day 1.&lt;/p&gt;

&lt;p&gt;We try to expose new engineers to all different facets of bitly infrastructure and encourage them to explore the codebase.  Fixing bugs, adding small pieces of functionality, and developing brand new services are common first week tasks.&lt;/p&gt;

&lt;p&gt;We think it&amp;#8217;s important to step away from the computer, too.  We schedule internal tech talks where we discuss the successes (and challenges) of the current state of the infrastructure.  It&amp;#8217;s an open forum where questions are answered, odd names get definitions, and bitly idioms are discussed.&lt;/p&gt;

&lt;h3&gt;EOL&lt;/h3&gt;

&lt;p&gt;There&amp;#8217;s lots more to talk about:&lt;/p&gt;

&lt;ul&gt;&lt;li&gt;how we deploy&amp;#8230;&lt;/li&gt;
&lt;li&gt;what are some of those bitly idioms&amp;#8230;&lt;/li&gt;
&lt;li&gt;how we make technology decisions&amp;#8230;&lt;/li&gt;
&lt;li&gt;meetings and communication&amp;#8230;&lt;/li&gt;
&lt;/ul&gt;&lt;p&gt;and more.  We&amp;#8217;ll save all that for future posts.&lt;/p&gt;

&lt;p&gt;As always, if any of this sounds interesting to you, &lt;a href="https://bitly.com/jobs"&gt;bitly is hiring&lt;/a&gt;.&lt;/p&gt;

&lt;div class="postmeta"&gt;
by &lt;a href="http://twitter.com/imsnakes"&gt;snakes&lt;/a&gt;
&lt;/div&gt;</description><link>http://word.bitly.com/post/19183796149</link><guid>http://word.bitly.com/post/19183796149</guid><pubDate>Mon, 12 Mar 2012 12:52:00 -0400</pubDate><category>infrastructure</category><category>bitly</category><category>git</category><category>github</category><category>engineering</category></item><item><title>Introducing Asyncdynamo</title><description>&lt;p&gt;When Amazon announced its DynamoDB service in late January, we quickly identified it as a promising candidate to meet many of our database demands: high availability and fault tolerance, low latency, and low maintenance. SSD hardware promised to grant us similar performance to what we experience with an in-memory datastore, while replication handled by Amazon promised to let our ops team sleep soundly at night (a cranky ops team is never a good thing). While it would be impractical for us to move all of our core infrastructure over to hosted services, we felt that Dynamo could be a good persistence option for applications already deployed to EC2.&lt;/p&gt;

&lt;p&gt;Since all interactions with Dynamo are carried out over HTTP, there’s no real need for a custom client library to begin performing database options. Our applications are written primarily in Python, and &lt;a href="http://github.com/boto/boto"&gt;Boto&lt;/a&gt; provided just enough helper methods to handle authentication and proper request formatting (neither of which are very straightforward tasks). However, Boto’s network calls are executed using Python’s built in &lt;code&gt;httplib&lt;/code&gt;, whose blocking nature makes it impractical for use with &lt;a href="http://github.com/facebook/tornado"&gt;Tornado&lt;/a&gt;, an asynchronous framework.&lt;/p&gt;

&lt;p&gt;Consequently, we set out to develop a library that would make use of Boto&amp;#8217;s facilities for Dynamo-specific operations (request formatting and signing, as well as response parsing) but would leverage Tornado&amp;#8217;s async HTTP client to execute the actual requests. For the benefit of any other Tornado users hoping to make use of DynamoDB, we are proud to announce &lt;a href="http://github.com/bitly/asyncdynamo"&gt;Asyncdynamo&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Asyncdynamo requires Boto and Tornado to be installed, and must be run with Python 2.7. It replaces Boto&amp;#8217;s synchronous calls to Dynamo and to Amazon STS (to retrieve session tokens) with non-blocking Tornado calls. For the end user its interface seeks to mimic that of Boto Layer1, with each method now requiring an additional callback parameter.&lt;/p&gt;

&lt;p&gt;A little trickery was involved in making this work: Boto&amp;#8217;s methods to sign requests expect to be operating on an instance of &lt;code&gt;boto.connection.HTTPRequest&lt;/code&gt;, so we needed to trick them into accepting our &lt;code&gt;tornado.httpclient.HTTPRequest&lt;/code&gt;. Working with a dynamic language makes this kind of type-hijacking much easier than it might otherwise be, and after a short bit of trial and error we were able to successfully fire off our hybrid requests to Amazon.&lt;/p&gt;

&lt;p&gt;Currently, Boto Layer1 equivalents of the essential operations - Get, Put, and Query - have all been implemented, and any other interaction with Dynamo is possible using a generic &lt;code&gt;make_request&lt;/code&gt; method. Once initialized with your AWS keys, Asyncdynamo will handle all authentication and session token management behind the scenes. Our immediate development plans are to fully replicate the methods offered in Boto Layer1, and in the longer term we hope to be able to add a further layer of abstraction similar to Boto Layer2.&lt;/p&gt;

&lt;p&gt;Interested in working on these kinds of projects? We&amp;#8217;re &lt;a href="http://bit.ly/jobs"&gt;hiring&lt;/a&gt;. We&amp;#8217;ll also be at PyCon this week if you&amp;#8217;d like to chat about this project, small links, big data, or anything in between.&lt;/p&gt;

&lt;div class="postmeta"&gt;
by &lt;a href="http://twitter.com/danielhfrank"&gt;danielhfrank&lt;/a&gt;
&lt;/div&gt;</description><link>http://word.bitly.com/post/18861837158</link><guid>http://word.bitly.com/post/18861837158</guid><pubDate>Tue, 06 Mar 2012 16:20:02 -0500</pubDate></item><item><title>Use Amazon SNS for Nagios Alerts</title><description>&lt;p&gt;&lt;span&gt; &lt;/span&gt;Amazon recently added &lt;a href="https://forums.aws.amazon.com/ann.jspa?annID=1230" target="_self"&gt;SMS publishing capability&lt;/a&gt; to their Simple Notification Service(SNS) platform. &lt;/p&gt;



&lt;p&gt;&lt;span&gt; &lt;/span&gt;&lt;/p&gt;



&lt;p&gt;&lt;strong&gt;Using SNS to send nagios alerts:&lt;/strong&gt;&lt;/p&gt;



&lt;p&gt;&lt;span&gt; &lt;/span&gt;At Bitly, we used individual carrier’s email SMS gateways (eg. 123456789@text.att.net) to send pages to the on-call person as well as other people on the ops team. This solution was not cutting it as messages were heavily throttled on the carrier’s side (the gateway being a free service provided by the carrier meant that we couldn’t really complain or get the problem rectified). This basically meant that pages arrived hours later in some cases and defeated the purpose of an alert during an emergency. &lt;/p&gt;



&lt;p&gt;Once SMS capability was announced, it was pretty much a no-brainer to switch to SNS.&lt;/p&gt;



&lt;p&gt;I wrote a quick and simple python script with the help of a popular AWS python library called &lt;a href="http://code.google.com/p/boto/"&gt;boto&lt;/a&gt;. You can find the source over at &lt;a href="http://bit.ly/pyamazonsns"&gt;&lt;a href="http://bit.ly/pyamazonsns"&gt;http://bit.ly/pyamazonsns&lt;/a&gt;&lt;/a&gt;.&lt;/p&gt;



&lt;p&gt;This script can be used to send any kind of SNS message, not just SMS.&lt;/p&gt;



&lt;p&gt;Here is a snippet of how this would tie into nagios:&lt;/p&gt;



&lt;pre&gt;
&lt;code&gt;&#13;
define command{&#13;
    command_name  notify-host-by-txt&#13;
    command_line  printf "%b" "$HOSTALIAS$" | send_sns.py $CONTACTPAGER$
}
&lt;/code&gt;
&lt;/pre&gt;

&lt;p&gt;&#13;
&#13;&lt;/p&gt;

&lt;p&gt;Feel free to fork it, improve on it and add features beyond the scope of our usage. Drop us a line to let us know :)&lt;/p&gt;



&lt;div class="postmeta"&gt;by &lt;a href="http://twitter.com/sricola"&gt;sri&lt;/a&gt;&lt;/div&gt;</description><link>http://word.bitly.com/post/15571216901</link><guid>http://word.bitly.com/post/15571216901</guid><pubDate>Mon, 09 Jan 2012 12:13:00 -0500</pubDate></item><item><title>Introduction to simplehttp</title><description>&lt;p&gt;Part of our engineering philosophy is to keep things fast and simple. Aim to serve one purpose and serve it well. Speak HTTP and encode in JSON. Prototype in Python and speed it up in C.&lt;/p&gt;

&lt;!--more--&gt;

&lt;p&gt;There are a few components that follow these tenets and sit at the core of our infrastructure. We&amp;#8217;ve open-sourced them under the &lt;a href="https://github.com/bitly/simplehttp"&gt;simplehttp&lt;/a&gt; moniker. They serve as the the architectural foundation for higher-order functionality.&lt;/p&gt;

&lt;h2&gt;simplehttp&lt;/h2&gt;

&lt;p&gt;At the lowest level is the &lt;code&gt;simplehttp&lt;/code&gt; library, an abstraction of &lt;a href="http://monkey.org/~provos/libevent/"&gt;libevent&lt;/a&gt;&amp;#8217;s evhttp functions, aimed at trivializing the task of writing an evented HTTP server in C. It&amp;#8217;s dead simple yet provides high-level features such as:&lt;/p&gt;

&lt;ul&gt;&lt;li&gt;&lt;a href="https://github.com/bitly/simplehttp/blob/master/simplehttp/options.c"&gt;Tornado inspired options parsing&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/bitly/simplehttp/blob/master/simplehttp/log.c"&gt;Tornado inspired logging&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Automatic per-endpoint &lt;a href="https://github.com/bitly/simplehttp/blob/master/simplehttp/stat.c"&gt;stat tracking&lt;/a&gt; (request counts, 95% times, averages)&lt;/li&gt;
&lt;li&gt;Clean API to perform &lt;a href="https://github.com/bitly/simplehttp/blob/master/simplehttp/async_simplehttp.c"&gt;async HTTP requests&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;&lt;p&gt;Built on top of &lt;code&gt;simplehttp&lt;/code&gt;, perhaps the most important daemons are &lt;code&gt;simplequeue&lt;/code&gt; and &lt;code&gt;pubsub&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;simplequeue&lt;/h3&gt;

&lt;p&gt;A rock-solid in-memory message queue for arbitrary message bodies (we use JSON), providing basic &lt;code&gt;/get&lt;/code&gt; and &lt;code&gt;/put&lt;/code&gt; endpoints. We use this in several key areas to serve as a work queue for asynchronous processing.  During a maintenance window or situations where backend services are degraded it also acts as a buffer, queueing up work that needs to be done when backend services are restored.&lt;/p&gt;

&lt;p&gt;We use long-lived Python &amp;#8220;queuereaders&amp;#8221; to poll the &lt;code&gt;simplequeue&lt;/code&gt; and perform work.  This work might be writing to a database, logging, aggregating, or anything else that you might not want to perform in a blocking fashion during a request cycle.  These queuereaders have built-in backoff timers which slow down the processing rate when errors are detected to allow a struggling backend to recover gracefully and to reduce the load on the machine running the queuereader.&lt;/p&gt;

&lt;p&gt;Generally, we silo a &lt;code&gt;simplequeue&lt;/code&gt; and its associated queuereaders on each host of the service.  Meaning a &lt;code&gt;simplequeue&lt;/code&gt; on &lt;code&gt;hostA&lt;/code&gt; will only contain messages from requests received by that host and its queuereaders will only process messages from its local &lt;code&gt;simplequeue&lt;/code&gt;.  We do this to address single point of failure issues.&lt;/p&gt;

&lt;h3&gt;pubsub&lt;/h3&gt;

&lt;p&gt;We have many different types of data at bitly, each classified into a stream. There are streams of &lt;strong&gt;encodes&lt;/strong&gt; (shortens), &lt;strong&gt;decodes&lt;/strong&gt; (&amp;#8220;clicks&amp;#8221;), &lt;strong&gt;user events&lt;/strong&gt;, etc. In order to provide a central, consistent, means for developers to access data in realtime we expose these streams via &lt;code&gt;pubsub&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Publishing a message is a simple HTTP request to the &lt;code&gt;/pub&lt;/code&gt; endpoint, in our case this usually happens in an queuereader who&amp;#8217;s sole purpose is to read off a specific &lt;code&gt;simplequeue&lt;/code&gt; and write to a specific &lt;code&gt;pubsub&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;A client consuming the stream is a a long-lived HTTP request to the &lt;code&gt;/sub&lt;/code&gt; endpoint.  Messages are transmitted as newline deliminated JSON.&lt;/p&gt;

&lt;p&gt;To pair with &lt;code&gt;pubsub&lt;/code&gt; the repository contains three additional utilities built on &lt;code&gt;pubsubclient&lt;/code&gt;. &lt;code&gt;ps_to_file&lt;/code&gt; and &lt;code&gt;ps_to_http&lt;/code&gt; are fairly self-explanatory. One archives a &lt;code&gt;pubsub&lt;/code&gt; stream to a file (automatically rolling the output files for you based on a configurable strftime format string), and the other writes a stream of data to destination HTTP endpoints. The latter can be used to send messages to a &lt;code&gt;simplequeue&lt;/code&gt;, another &lt;code&gt;pubsub&lt;/code&gt; stream, or any other HTTP endpoint. Additionally, &lt;code&gt;pubsub_filtered&lt;/code&gt; repeats a pubsub stream and provides the option to remove or obfuscate fields, creating a filtered view on a subset of the data.  At bitly we use these tools to archive our data streams and to pass data published by one application into another application (or another datacenter).&lt;/p&gt;

&lt;h3&gt;EOL&lt;/h3&gt;

&lt;p&gt;If any of these things sound like fun projects to hack on, &lt;a href="http://bit.ly/jobs"&gt;bitly is hiring&lt;/a&gt;.&lt;/p&gt;

&lt;div class="postmeta"&gt;
by &lt;a href="http://twitter.com/imsnakes"&gt;snakes&lt;/a&gt;
&lt;/div&gt;</description><link>http://word.bitly.com/post/13216970565</link><guid>http://word.bitly.com/post/13216970565</guid><pubDate>Thu, 25 Aug 2011 09:00:00 -0400</pubDate></item><item><title>Welcome to the bitly Engineering Blog</title><description>&lt;p&gt;At &lt;a href="http://bitly.com/"&gt;bitly&lt;/a&gt;, we have been happy to 
contribute to a number of open source projects, and we have even started
&lt;a href="https://github.com/bitly/"&gt;a few of our own&lt;/a&gt;. We look forward to talking
about those, and other engineering details here. You can stay up-to-date by following &lt;a href="http://twitter.com/bitly"&gt;@bitly&lt;/a&gt;&lt;/p&gt;

&lt;div class="postmeta"&gt;
by &lt;a href="http://twitter.com/jehiah"&gt;jehiah&lt;/a&gt;
&lt;/div&gt;</description><link>http://word.bitly.com/post/13216967883</link><guid>http://word.bitly.com/post/13216967883</guid><pubDate>Mon, 22 Aug 2011 15:28:00 -0400</pubDate></item></channel></rss>
