Counting at Scale: HyperLogLog to the Rescue
MediaMath processes many terabytes of data each day for the various reports available in T1. One metric we show is the number of unique impressions for each campaign, there is a big difference between showing an ad to 100 different people and showing the same ad to one person 100 times. While this is conceptually a simple problem, solving it at scale is not quite as straightforward.
The canonical way of solving this problem would be for any given campaign to put the id of each person who saw an ad for that campaign into a set and then check its size. It’s easy to understand the appeal of this approach; sets are provided in all modern standard libraries so it is simple to implement, and is a mathematically sound way of approaching the problem.
Best of all, implementing this on top of Hadoop would be relatively straightforward. The map step would be to place all the items into sets and the reduce phase would be to combine them, then output the sets size.
I thought you said this was going to be difficult?
The problem with this simple solution, though, is that eventually everything will need to be piped to a single reducer. What happens if the sets are so large that combining them takes a prohibitively long amount of time? What happens if the sets are so large they can’t all fit on one machine? This might not seem like a common problem but imagine we are counting the unique number of strings in a text file.
An empty string on the JVM requires 40 bytes and a string with 5 characters, the average length of a word in the English language, requires 48 bytes. This means that for at least half of all English words we are using more memory for overhead than to store the actual characters! 
Even if we are only looking at numeric id’s, a Long is 64 bits. Multiply that by the number of unique impressions for a given campaign and the total number of campaigns being run, it adds up quickly.
To reduce memory usage, we can heavily optimize an algorithm towards doing one thing, counting unique items.
HyperLogLog to the rescue
HyperLogLog, instead of storing each item, makes use of the random uniform distribution of a hash function in order to estimate the number of items seen given a specific phenomenon. If that made sense to you on first pass, congratulation—you know more about this problem when I did when I first tried to tackle it. If not, we will break it apart to understand how it works.
Hash functions are generally understood as a transformation of an object from a larger space to a smaller space. Most developers conceptual understanding of of a hash function is that the hash of an object will be a random looking number which only has a very small chance of being the same as the hash of any other object. Another way of viewing a hash function is that it returns a uniformly distributed bit stream or a list of 1’s and 0’s where the probability of each bit being a specific value is independent of the value of every other bit.
As stated above, we are looking for a specific phenomenon; the number of consecutive 0’s at the beginning of the stream. Because each bit is independent of every other the odds that the first bit is 0 is , the odds that the nth bit is 0 is , and the odds that the first consecutive n bits are 0 is .
To be clear, we are interested in the probability of different patterns not specific numbers. The odds of a function producing the hashes 00101101 and 00010110 are the same but the probability of seeing the pattern 00xxxxx (two 0’s followed by some numbers) is twice as likely as seeing the pattern 000xxxx (three 0’s followed by some numbers).
This means that if the largest number of consecutive 0’s seen is 32 we know that the odds of that happening are 1/4294967296 and we can expect that we have seen 4,294,967,296 distinct items.
If we were to stop here, then we would have an algorithm with a very small memory footprint. The number 32 is the same as the binary number 100000b which is only 6 bits and with those 6 bits we can count up to items, it only requires Log(Log(n)) bits to count up to n.
The problem with this method is that it is incredibly susceptible to outliers. If the first item seen hashes to 000000000001 then we would think we have seen 2048 distinct items when we have only seen one.
HyperLogLog takes this basic principle and iterates on it. One way it minimizes the influence of outliers in a data set is to maintain a group of buckets. For each item seen it will calculate its hash and take the first k bits of the hash to use that to pick a bucket. It will then count the number of 0-bits after that point and insert the number found if it is greater than the number already in that bucket.
So given this hash function and k = 8 we would store the number 7 into bucket 37 if that is greater than the number already in that bucket.
By storing multiple 0-bit sequence lengths, even if we do find one or two outliers we will have many more numbers which accurately represent the number of items seen. The buckets are combined by throwing away the largest 30% of buckets and then taking the harmonic mean of those remaining. There are several constants for range and bias correction that we could explore here, but that is far beyond the scope of this post.
While its memory footprint is much smaller than a basic set, it is still just as easy to parallelize across many machines. Each machine can run the algorithm using the same hash function and the same number of buckets. These results of each machine can then be combined by taking the max value of each corresponding bucket and reducing the buckets to generate a final estimate.
- The Java String class contains multiple fields and so the size in bytes of a string with n characters can be calculated using the following equations. Note, the JVM allocates memory in 8 byte increments so padding must be added to make each object size is a multiple of 8.