HyperLogLog - Estimating data size using statistical and probabilistic principles

Introduction to the HyperLogLog Algorithm#

In big data processing, we often encounter the problem of counting the number of distinct elements. For example, we want to know how many unique users visit a large website every day. The simple method is to use a hash table to track each user. However, for a huge number of users, this method requires a large amount of memory. You need to reserve a position on the table for each user, so it is not suitable for big data scenarios. In addition, in most cases, we only need to know the approximate scale of the users, so the HyperLogLog algorithm is suitable for this situation as it provides high accuracy with extremely low memory usage.

1. What is HyperLogLog?#

HyperLogLog (HLL) is an algorithm proposed by Philippe Flajolet and others in 2007 to estimate the number of distinct elements in a large dataset. Compared to direct counting, the beauty of HLL is that it only requires a small fixed amount of memory (usually a few KB).

2. Principles of HyperLogLog#

The foundation of HLL is the observation of the characteristics of the hash values of random data. When we hash an element, a uniform hash function should give us a uniformly distributed result, which can be considered as random.

The core idea of HLL is to count the number of leading zeros in the binary hash value. For example, the hash value "001010..." has 2 leading zeros. Since the hash value is random, the average number of leading zeros can be used as an indicator of estimating the cardinality. In simple terms, it counts the number of leading zeros, and if this value is larger, it means that we need to "see" more different users in terms of probability.

For example, if we continuously flip a coin, compared to getting 5 consecutive heads, getting 10 consecutive heads is less probable. Generally speaking, getting 10 consecutive heads requires flipping the coin more times.

Since HLL is a probabilistic counting method, the following factors determine the estimation accuracy:

1. Uniform distribution of hash functions#

When hashing data using a uniform distribution hash function, the generated binary hash values can be considered as random numbers. The probability that a bit of the hash value is 0 or 1 is 50%.

2. Performance of hash functions#

The performance of HLL largely depends on the performance of the hash function. The hash function needs to evenly distribute the input data into its output space, ensuring that each bucket receives approximately equal amounts of data. It also needs to ensure its speed.

3. Leading zeros#

Consider a random binary number, the probability that the highest bit is 0 is 50%, and the probability that two consecutive highest bits are 0 is 25%, and so on. Therefore, the event of having k consecutive zeros in the hash value is very rare, and it can be used to represent the existence of a large number of unique elements.

4. Buckets or counters#

To solve the problem of unstable estimation caused by the randomness of individual data, HLL uses multiple buckets. The first few bits of the hash value determine which bucket the element should enter, and the rest is used to calculate the leading zeros.

For example, if 16 buckets are used (2^4 = 16), then the first 4 bits of the hash value will determine the bucket number (0-15). Each bucket maintains the maximum number of leading zeros of the hash values in that bucket.

5. Merging counter values#

To obtain an overall estimation of the cardinality, we cannot simply add up the values of all buckets. We should use the harmonic mean: $\frac{m}{\sum_{j=1}^{m} 2^{-M[j]}}$ , where m is the number of buckets and M[j] is the number of leading zeros recorded by the jth bucket.
Because the buckets are not independent of each other. One bucket may observe 5 leading zeros, while another bucket may observe 3 leading zeros. Simply adding these values (e.g. $2^5 + 2^3$ ) would result in duplicate counting because these buckets may observe some of the same elements. It is important to estimate the overall cardinality, not to accumulate independent estimates for each bucket.

6. Correcting small and large count biases#

Although the above method works well for medium-sized cardinalities, it may introduce biases for very small or very large cardinalities. To correct these biases, HLL provides the following corrections:

Small count correction: When the estimated value is less than 5/2m, considering that there may be many empty buckets, the estimated value is corrected by linear counting.
Large count correction: When the estimated value is close to 2^32 (for 32-bit hashes), the algorithm considers overflow and corrects the estimated value.

3. Data Structure#

HyperLogLog uses an array of buckets to store information. Each bucket records the maximum number of leading zeros of the corresponding hash value. The output of the algorithm is a function based on the values of all buckets.

4. Accuracy and Memory#

The accuracy of HLL depends on the number of buckets. More buckets mean higher accuracy, but also mean more memory usage. Fortunately, even with a few KB of memory, HLL can provide very reasonable estimates.

5. Merging#

One powerful feature of HLL is its support for merging operations. If you have two independent HLL data structures that track two different datasets, you can merge them into one, and the new HLL will track as if you were tracking on a large dataset.

This means that HLL can be used for distributed processing on multiple machines, which is in line with the processing scenarios of big data. For processing data on multiple machines, we can obtain the overall statistical results of the sample population through a simple "union" operation.

Summary#

HyperLogLog is an efficient algorithm for estimating the cardinality of large datasets with limited memory. Although it only provides an approximate value, this approximation is sufficient in many practical applications, and the benefits of memory saving make it a valuable tool.

Below is a simple implementation of the algorithm in the C language. This code aims to provide a concise and intuitive example, but it may not be production-ready. It omits many details, such as optimization and handling of boundary cases. In addition, the hash function is also simplified and not suitable for actual production environments.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <math.h>

#define BUCKET_COUNT 1024  // 2^10 buckets

typedef struct {
    unsigned char buckets[BUCKET_COUNT];
} HyperLogLog;

// Simplified example of a hash function. In actual applications, you would want to use higher quality hash functions like MurmurHash, CityHash, etc.
unsigned int hash_function(const char* data) {
    unsigned int hash = 0;
    while (*data) {
        hash = (hash * 31) + *data++;  // Simple multiplication hash
    }
    return hash;
}

int leading_zeros(unsigned int value) {
    if (value == 0) return 32;
    int count = 0;
    while ((value & 0x80000000) == 0) {
        count++;
        value <<= 1;
    }
    return count;
}

void hll_add(HyperLogLog *hll, const char* data) {
    unsigned int hash = hash_function(data);
    int bucket_idx = hash & (BUCKET_COUNT - 1);
    int zero_count = leading_zeros(hash >> 10);  // Use the remaining part except for bucket_idx
    if (zero_count > hll->buckets[bucket_idx]) {
        hll->buckets[bucket_idx] = zero_count;
    }
}

double hll_count(const HyperLogLog *hll) {
    double sum = 0.0;
    for (int i = 0; i < BUCKET_COUNT; i++) {
        sum += pow(2.0, -hll->buckets[i]);
    }
    return BUCKET_COUNT * BUCKET_COUNT / sum;
}

int main() {
    HyperLogLog hll;
    memset(&hll, 0, sizeof(hll));

    const char* samples[] = {
        "user1", "user2", "user3", "user1", "user2", "user4"
    };
    int sample_count = sizeof(samples) / sizeof(samples[0]);
    for (int i = 0; i < sample_count; i++) {
        hll_add(&hll, samples[i]);
    }
    printf("Estimated unique count: %f\n", hll_count(&hll));
    return 0;
}