Understanding How Your Past is Blocking Your Abundance

If I added in the scene of you waking up to the same cold February morning, in the same small Pennsylvania town of Punxsutawney, hearing the same Sonny & Cher song, you would be playing Bill Murray’s…

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




Creating a Bloom Filter with Go

A bloom filter is a set-like data structure that is more space-efficient compared to traditional set-like data structures such as hash tables or trees. The catch is that it is probabilistic, meaning it can not give a 100% certain answer to certain queries.

More specifically, a bloom filter can tell you with 100% certainty that something is not in the set. But it can not tell you with 100% certainty that something is in the set.

If you already know why you’d want to use a bloom filter, just skip the next part below 😃. So why would we want to use a set like this? One way bloom filters are used is to serve as a type of ‘cache’ for large data sets, where you expect that most queries return that an item is not in the set.

So instead of querying the large set directly, you first query the bloom filter. In most cases, the bloom filter tells us that an item is absent and that’s what we care about. And when an item might be present you query the actual data set directly.

You probably interact with them quite often. Let’s look at two examples

Google Bigtable uses bloom filters to reduce disk lookups for non-existent rows or columns. If you’re 100% sure an item is not in Bigtable, your query will never need to read the actual data, thus saving (costly) disk lookups.

Chrome uses a bloom filter to check a URL against a list of known malicious URLs. The first check is a (local) bloom filter, and only when this returns a positive result will an actual full-check be performed. The user only gets a notification if this check is also true.

To avoid using a lot of space, a bloom filter is actually an array of bits. Each item you pass to your bloom filter gets mapped to a location in the bloom filter, by hashing the incoming item. (As in the image at the start of this post)

When you later query your bloom filter, you hash the input and turn the hash into a location inside the bitfield. Next you check if the corresponding bits are set to 1 or 0. If they have been set to 1, than the item is in the set.

The hashing of the input is the reason why you can be certain of absence, but not certain of presence. If a bit is 0 you’re certain that nothing ever ended up on that index. But if it’s 1, it could be due a ‘hash collision’ or simply because multiple hashes mapped to the same position. Don’t worry, this will become clear in the implementation.

For our Go implementation, we’ll steal the example of Chrome given above. We are storing string (urls) and when a user requests a URL we need to make sure it’s not malicious. If the entry is not in our bloom filter, the URL is not malicious and won’t harm our user.

It’s a toy example, so we’ll omit the steps of querying an actual database and just focus on the core part of the bloom filter. As mentioned, often you’d make sure a positive result is not a false positive by performing a query against the actual data.

Next, we’ll want to create a function to take a string input and return the hash from this. We’ll want the hash as a number and not a string representation, as we need to map it to a location in our bitfield.

As it’s an example, let’s ignore the error. 😱
We’ll need to map this result to a location in our bitfield. As we’ll need to get the same location for reads as well as writes, it’s best to create a separate function out of this. (To ensure that the same input always returns the same output).

Great, now we can chain this together with a set and get function.

That’s all there is to create a basic bloom filter. When we store data in it using f.Set("https://www.abc.com") we’ll flip a bit to 1 to indicate that it has been set. When we run f.Get("https://www.abc.com") we use the same hash to find out if the bit has been set.

The above example is an implementation of a quite basic bloom filter, but it’s enough to get the idea of a bloom filter across. But what can we do to make this bloom filter better?

A good bloom filter gives a small amount of false positives, as this avoids the (expensive) lookup in the real data set. One way of doing this is by adding more hash functions, thus decreasing our odds of multiple items mapping to the same location in our bitfield.

A natural question then arises, “how many hash functions do we use?”. It turns out we can calculate how many hash functions we should use. For any given number (m) of bits in the bitfield and number (n) of elements we want to store in the bloom filter, the optimum number of hash functions (k) is given by the formula:

Formula for optimal hash functions

Hence our basic example might be fine for smaller bloom filters, but in the real world you’ll encounter bloom filters that use multiple hash functions.

If you liked this post and 💙 Go as well, consider:

Add a comment

Related posts:

5 Things To Do While Quarantined

I have seen so many people, now unsure what to do with their time as their country goes into either lockdown or are told to self-isolate, having restrictions put on public gatherings and alike which…

Como Ser Emprendedor De Alto Rendimiento

Los valores del emprendedor de alto rendimiento se pueden resumir en 5 principios que deben aplicarse cada día: Un High Performance Entrepreneur sabe qué quiere y por qué lo quiere. No anda dando…

How the BBC and the Guardian tell stories on mobile

Build for desktop then shrink down to mobile, or start with mobile and then scale up? How do we make long articles and complex charts work for small screens and fat fingers instead of wide canvases…