This project aims to demonstrate a novel bloom filter implementation that can scale, and provide not only the addition of new members, but reliable removal of existing members.
bitly has billions and billions of links and an infrastructure that serves
thousands of decode requests per second, i.e. redirect the user from
long.destination.url.com/with/path. Users not only expect this redirect
to be fast, but they also trust bitly to not redirect them to a malicious website.
Unfortunately, a subset of destination urls are malicious, and we identify and
mark them as spam via automated, realtime analysis. This spam dataset is constantly
changing with an increasing rate of additions and a small number of deletions. In
order to maintain performance and provide a warning to users who might be going to a
spam website, we have to keep the content of the spam dataset in resident memory
for each decode request. This is complicated by the fact that the system is designed to
support both domain and path roll up, meaning we can block an entire domain, subdomain,
or top-level path (or any combination thereof) thus marking all children of that parent as
spam. Previously, we had implemented this as a simple hashtable with existence in the table
equating to membership in the spam dataset. This approach was simple and performant, but
not very memory efficient. On the occasions when resident memory was cleared, say on a
restart, it was necessary to reconstruct the dataset from the database. This made restarts
painfully long during the dataset reconstruction, and the hashtable took up considerable
amounts of memory, 2.5 gigabytes to be exact.
We began exploring bloom filters as a possible solution to our spam problems. Despite the many bloom filter variations out there - counting, compact, inverse, scalable, and bloomier - none of them met our criteria entirely. We needed the ability to scale and persist reliably while allowing for the removal of elements. It appeared that only a mixture of scalable and counting (elaborated on below) would solve our problems.
Bloom Filter Basics
Bloom filters are probabilistic data structures that provide
space-efficient storage of elements at the cost of occasional false positives on
membership queries, i.e. a bloom filter may state true on query when it in fact does
not contain said element. A bloom filter is traditionally implemented as an array of
M bits, where
M is the size of the bloom filter. On initialization all bits are
set to zero. A filter is also parameterized by a constant
k that defines the number
of hash functions used to set and test bits in the filter. Each hash function should
output one index in
M. When inserting an element
x into the filter, the bits
h1(x), h2(x), ..., hk(X) are set.
In order to query a bloom filter, say for element
x, it suffices to verify if
all bits in indices
h1(x), h2(x), ..., hk(x) are set. If one or more of these
bits is not set then the queried element is definitely not present in the
filter. However, if all these bits are set, then the element is considered to
be in the filter. Given this procedure, an error probability exists for positive
matches, since the tested indices might have been set by the insertion of other
Counting Bloom Filters: Solving Removals
The same property that results in false positives also makes it difficult to remove an element from the filter as there is no easy means of discerning if another element is hashed to the same bit. Unsetting a bit that is hashed by multiple elements can cause false negatives. Using a counter, instead of a bit, can circumvent this issue. The bit can be incremented when an element is hashed to a given location, and decremented upon removal. Membership queries rely on whether a given counter is greater than zero. This reduces the exceptional space-efficiency provided by the standard bloom filter.
Scalable Bloom Filters: Solving Scale
Another important property of a bloom filter is its linear relationship between size and storage capacity. If given a maximum acceptable false positive ratio, it is straightforward to determine how much space is needed.
If the maximum allowable error probability and the number of elements to store are both known, it is relatively straightforward to dimension an appropriate filter. However, it is not always possible to know how many elements will need to be stored a priori. There is a trade off between over-dimensioning filters or suffering from a ballooning error probability as it fills.
Almeida, Baquero, Preguiça, Hutchison published a paper in 2006, on Scalable Bloom Filters, which suggested a means of scalable bloom filters by creating essentially a list of bloom filters that act as one large bloom filter. When greater capacity is desired, a new filter is added to the list.
Membership queries are conducted on each filter with the positives evaluated if the element is found in any one of the filters. Naively, this leads to an increasing compounding error probability since the probability of the given structure evaluates to:
1 - 𝚺(1 - P)
It is possible to bound this error probability by adding a reducing tightening
r. As a result, the bounded error probability is represented as:
1 - 𝚺(1 - P0 * r^i) where r is chosen as 0 < r < 1
Since size is simply a function of an error probability and capacity, any
array of growth functions can be applied to scale the size of the bloom filter
as necessary. We found it sufficient to pick .9 for
Problems with Mixing Scalable and Counting Bloom Filters
Scalable bloom filters do not allow for the removal of elements from the filter. In addition, simply converting each bloom filter in a scalable bloom filter into a counting filter also poses problems. Since an element can be in any filter, and bloom filters inherently allow for false positives, a given element may appear to be in two or more filters. If an element is inadvertently removed from a filter which did not contain it, it would introduce the possibility of false negatives.
If however, an element can be removed from the correct filter, it maintains the integrity of said filter, i.e. prevents the possibility of false negatives. Thus, a scaling, counting, bloom filter is possible if upon additions and deletions one can correctly decide which bloom filter contains the element.
There are several advantages to using a bloom filters. A bloom filter gives us cheap, memory efficient set operations, with no actual data stored about the given element. Rather, bloom filters allow us to test, with some given error probability, the membership of an item. This leads to the conclusion that the majority of operations performed on bloom filters are the queries of membership, rather than the addition and removal of elements. Thus, for a scaling, counting, bloom filter, we can optimize for membership queries at the expense of additions and removals. This expense comes not in performance, but in the addition of more metadata concerning an element and its relation to the bloom filter. With the addition of some sort of identification of an element, which does not need to be unique as long as it is fairly distributed, it is possible to correctly determine which filter an element belongs to, thereby, Hence, maintaining the integrity of a given bloom filter with accurate additions and removals.
dablooms is one such implementation of a scaling, counting, bloom filter that takes additional metadata during additions and deletions in the form of a monotonically increasing integer to classify elements such as a timestamp. This is used during additions/removals to easily classify an element into the correct bloom filter (essentially a comparison against a range).
dablooms is designed to scale itself using these monotonically increasing identifiers and the given capacity. When a bloom filter is at capacity, dablooms will create a new bloom filter using the to-be-added elements identifier as the beginning identifier for the new bloom filter. Given the fact that the identifiers monotonically increase, new elements will be added to the newest bloom filter. Note, in theory and as implemented, nothing prevents one from adding an element to any “older” filter. You just run the increasing risk of the error probability growing beyond the bound as it becomes “overfilled”.
You can then remove any element from any bloom filter using the identifier to intelligently pick which bloom filter to remove from. Consequently, as you continue to remove elements from bloom filters that you are not continuing to add to, these bloom filters will become more accurate.
For performance, the low-level operations are implemented in C. It is also memory mapped which provides async flushing and persistence at low cost. In an effort to maintain memory efficiency, rather than using integers, or even whole bytes as counters, we use only four bit counters. These four bit counters allow up to 15 items to share a counter in the map. If more than a small handful are sharing said counter, the bloom filter would be overloaded (resulting in excessive false positives anyway) at any sane error rate, so there is no benefit in supporting larger counters.
The bloom filter also employees change sequence numbers to track operations performed on bloom filter. These allow us to determine failed writes, leaving us with a ‘dirty’ filter where an element is only partially written. Upon restart, this information allows us to determine if a filter is clean and thus can just load it into memory locally rather than having to rebuild the filter. The counters also provide a means for us to identify a position at which the bloom filter is valid in order to replay operations to “catch up”.
For our spam dataset at scale, initial tests have shown a significant reduction - 98.8% in resident memory usage on each machine using dablooms while maintaining a rigid performance threshold. For our use of an identifier, we simply use timestamps for when the spam link was added and query our database on removals for the insertion timestamp.
We have also decided to open source this project, so fork it and submit patches at the github repo.
If this sounds like a fun project to use or to work on, bitly is hiring, including interns!