I remember learning about memoization and finding it fascinating. The idea of purposefully sacrificing space to improve speed is beautifully pragmatic. If you're not familiar with memoization, it is where you cache the result of a function call so that when you call it again with the same parameters then you just have to look it up in the cache, rather than recomputing the result. Here's an example computing the sequence from the nth Fibonacci number using memoization:

def fib(n, memo={})
    if n == 0 || n == 1
        return n
    end

    if memo[n]
        return memo[n]
    end

    memo[n] = fib(n - 1, memo) + fib(n - 2, memo)
end

We can generalise this idea: sacrifice one characteristic to improve another. The one I find particularly interesting is sacrificing accuracy because it seems antithetical to a lot of what we do as programmers. When writing code the mantra is often: "Make it work, make it right, make it fast"1. However, we can bend "make it work" if we don't need to result to be perfect.

Setting the scene: We want to know how many customers you've ever had. However, the only device you have is very memory constrained, a Raspberry Pi Zero with 512MB of RAM.

We estimate to have had about 120,000,000 unique customer. Customer IDs are UUIDs, so 128 bits. The naive solution of storing a set of the customer IDs in memory and counting the number of elements in the set would use 1.92GB of memory (128 bits * 120,000,000 = 15,360,000,000 bits = 1.92GB). This won't fit on the Raspberry Pi Zero so we have to get more creative. Linear Counting is an algorithm for estimating the cardinality of a set in constant time and memory by sacrificing accuracy. At a high level the algorithm is relatively simple:

  1. Construct a bit map with all the bits set to 0 (we'll come back to sizing in a minute).
  2. Stream the customer IDs through a hash function to get a number, modulo it with the size of the bit map, and set the bit identified by the value: bit_map[hash(customer_id) % bit_map.size] = 1.
  3. Estimate the number of unique customers as the bit map size multiplied by the natural logarithm of the ratio between the number of zeros in the bit map and the bit map size2. That is:
-(bit_map.size * Math.log(bit_map.total_unset.fdiv(bit_map.size)))

What this does is map customers to particular buckets, represented by bits in the bit map. However, we only know whether the bucket is empty or not - we've seen at least one customer in that bucket or zero. It's not a perfect representation.

In the paper3 they present a statistical analysis that the estimate is accurate even though the number of unique customers is much larger than the size of the bit map. This measurement, true cardinality / bit map size is known as the load factor and even for load factors as high as 12, the estimate has a standard error of 1%4.

So, can we use Linear Counting on our Raspberry Pi Zero? The Linear Counting paper3 has a table of the bit map size for various cardinalities for 1% and 10% standard error. For a cardinality of 120,000,000 we need a bit map of size 10,112,529, a load factor of 12, to get 1% standard error. So, our bit map is 1.26MB (0.001GB). We only need to keep one customer ID in memory at a time to hash it and set the bit in the bit map. At 128 bits, that doesn't even move the needle on memory usage. So, whilst the naive solution uses 1.92GB, Linear Counting only uses 1.26MB, that's almost 1/2000th of the memory and comfortably fits on our Raspberry Pi Zero.

Deciding what bit map size you're happy with is difficult because you can't calculate the load factor without knowing the true cardinality of your dataset and that's the whole point of using Linear Counting. I would recommend using the total size of the dataset in place of the true cardinality, that way you're going to be overestimating the load factor. However, this is a waste of space as you'll almost always be overestimating the dataset's cardinality. So, if you have knowledge of the dataset this should be used to inform your decision.

1

I'm not sure where this phrase first came from but I've always interpreted it as: first make the code output the right result, then make it readable, then optimise it. We no longer get the "right" result with Linear counting because it is only an estimate.

2

You can find my implementation Linear Counting here: https://git.sr.ht/~nds/probably/tree/main/item/lib/probably/linear_counter.rb. Along with a few other "pragmatic" data structures.

4

The standard error is used to estimate how close the estimated cardinality is from the true cardinality. It is the standard deviation of the ratio of the estimated cardinality to the true cardinality. So, a standard error of 1% means that 68% of the time, the estimated cardinality will be within 1% of the true cardinality.