Imagine you want to build a system that needs to search through petabytes of log data, with new logs streaming in at multiple terabytes per day. Using traditional data structures and exact algorithms it’s hard to keep up with the pressure of such scale. Database indices grow unwieldy, memory requirements explode, and query times stretch from milliseconds to minutes or even hours. When working at this scale, the pursuit of 100% precision can become your worst enemy.
Following up on our exploration of log search engines in “Search Logs Faster than Sonic”, it’s time to introduce a class of solutions that isn’t very common in the standard software engineer’s toolbox but shines best at extreme scale: probabilistic data structures and approximation algorithms.
These tools aren’t just a part of theoretical computer science. They’re working behind the scenes in systems you likely use every day. Redis, ElasticSearch, ClickHouse rely on them to optimize lookups and provide estimations in queries that would otherwise crash the servers or take forever to complete.
The basic idea is simple, there is a trade-off between accuracy and performance. Sometimes a small compromise on accuracy can results in massive performance gains while still producing a sufficient result. Instead of keeping track of everything exactly (which gets expensive fast), these structures / algorithms maintain a good-enough approximation that requires far less memory and processing time. It’s like estimating the number of rubber ducks I have in my collection instead of counting each one – you might be off by a few, but you’ll get a good-enough answer fast, without searching for the ones my cats have “sharded” across the apartment.
Let’s explore how these techniques can help process massive amounts of logs without breaking your infrastructure budget.
The Challenge of Data Sharding #
When working with massive datasets, high-scale systems often split data into smaller, more manageable horizontal partition of data called shards.
When you want to query this data, you need to know which shards contain relevant information. Otherwise, you’re forced to read from all of them leading to many expensive io operations whether the shards should be read from disk or over network (e.g. from s3).
Basic Pruning #
The simplest pruning approach is time-based filtering. Each shard tracks its minimum and maximum timestamps:
Shard_1: 2023-01-01T00:00:00Z to 2023-01-01T06:00:00Z Shard_2: 2023-01-01T03:00:00Z to 2023-01-01T09:00:00Z Shard_3: 2023-01-01T06:00:00Z to 2023-01-01T12:00:00Z ...When a query comes in requesting data for a specific timeframe:
@Table | where timestamp > '2023-01-01T07:00:00Z'We can immediately eliminate Shard_1 from consideration. This concept is widely used, for example elasticsearch organizes data into time-based indices and shards within those indices, ClickHouse partitions tables by date ranges and S3-based data lakes organize files into prefixes and time-based partitions.
But what about other filter conditions? Consider this simple query:
@Table | where source.ip = "192.168.1.1" AND timestamp > '2023-01-01T07:00:00Z'Time-based pruning helps with the timestamp condition, but we still need to check all remaining shards for the specific IP.
A naive approach might be to maintain an exact index of all values for each field using a hashmap. The shard can be skipped if the filtered value isn’t present:
Shard_2 contains: source.ip: {"192.168.1.1", "10.0.0.1", ... 10,000s more IPs}The problem is for high-cardinality fields like user IDs, request paths or if you’re really unlucky some uuid as storing and checking complete value lists consumes enormous amounts of memory and processing time.
Enter the Bloom Filter #
A Bloom filter solves this by providing a memory efficient way to answer a simple question: “Could this value exist in this shard?” It can tell you with certainty when something is NOT in the dataset (no false-negative), while occasionally producing false positives.
You can think of Bloom filters like trying to guess what your coworker is heating up in the office microwave just by the smell, so you know if it’s worth asking for a bite. Smells carry less information than the full dish, but if you recognize the scent of leftover fried chicken, you can usually make a decent guess. The problem is that scents can overlap so you might think it’s fried chicken, but it’s actually reheated chicken nuggets 😕 (that’s a false positive). But if none of the familiar smells are present, you know for sure it’s not what you’re hoping for (no false negatives).
Here’s how a Bloom filter works:
- Start with a bit array of m bits, all initially set to 0
- Choose k different hash functions (scents) that each map an input to one of the m array positions
- To add an element, run it through all k hash functions to get k array positions, then set all those positions to 1
- To check if an element exists, run it through all k hash functions to get k positions If ALL positions contain a 1, the element is PROBABLY in the set (it could be a false positive due to hash collisions)
- Otherwise, the element is DEFINITELY not in the set.
What I like about Bloom filters is that both adding and searching are done in a time-complexity which doesn’t depend on the data, it depends solely on the number of chosen hash function O(K) which of-course affect the false positive rate. So you can control the trade-off between memory usage and false positive rate! The probability of a false positive is approximately:
$$ p ≈ (1 - e^(\frac{-kn}{m}))^k $$
Where:
- m is the size of the bit array
- n is the number of elements in the set
- k is the number of hash functions
So for our use case, for each shard and each “relevant” field (we’ll touch on when to avoid Bloom filters later on) in the table’s schema, we can maintain a separate Bloom filter that tracks all values for that field in that shard. This lets us quickly eliminate shards that definitely don’t contain our target values.
So let’s say you estimate a particular field will have 1,000 different values in a shard of data and you’re willing to retrieve shards without relevant data (false positives) at a rate of \(1\%\). You would need approximately:
$$ m = -\frac{n \cdot \ln(p)}{(\ln 2)^2} = -\frac{1000 \cdot \ln(0.01)}{(\ln 2)^2} \approx 9585 \text{ Bits} \approx 1198 \text{ Bytes} \approx 1.17 \text{ KB} $$
And you would need approximately: $$ k = \frac{m}{n} \cdot \ln 2 = \frac{9585}{1000} \cdot \ln 2 \approx 6.64 = 7 \text{ hash functions} $$
You can forget about the math, there are calculators online to help you out :)
The point is that this is dramatically more space-efficient than storing the complete set of elements. Here’s a simple implementation:
As I mentioned you can find them everywhere, for example:
- Elasticsearch is based on Apache Lucene search which uses Bloom filters in engine for efficient term lookups.
- Cassandra uses Bloom filters to avoid checking every SSTable data file for the partition being requested.
- ClickHouse uses Bloom filters them to skip indexes.
- Loki uses Bloom filters to accelerate queries by skipping irrelevant logs as well.
When Bloom Filters Fall Short #
Bloom filters shine when you’re looking for something specific and rare, the classic “needle in a haystack” scenario. But they quickly lose their edge when that needle becomes a recurring pattern.
A classic example is multi-tenancy. When handling logs from many tenants, it’s common to have a tenant_name field. In that case most queries if not all will filter on a specific tenant_name:
@AuthLogs | where tenant_name = 'ducks-corp' ...As mentioned earlier, shards are often partitioned by time ranges so that we could skip irrelevant data when filtering by timestamp. The problem is that logs from many tenants are usually mixed together across time so their logs are likely to show up in almost every shard. That means a Bloom filter on tenant_name will be pretty useless as it will return “maybe” for almost every shard and we’ll still need to scan all of them.
The tenant_name example is a pretty extreme case, let’s take a proper example, say you’re hunting for activity related to a single user “ducker”
@AuthLogs | where actor.username == "ducker" and timestamp > ago(7d)You’re in a large organization:
- 1TB worth of data is ingested per day.
- Authentication logs make up about 5% of the total → 50 GB/day.
- Each log entry averages around 1 KB → roughly 50 million @AuthLogs entries per day.
- Each shard contains about 1 million entries → 50 shards per day.
Now assuming our suspect ducker appears in just 0.02% of the logs, that’s 10,000 logs total per day. Note that ducker may be a power IT user that is shared across many people or a user that is being used by some automation. If the data is uniformly distributed then each shard has 200 matching entries. Under a random distribution, the chance of a shard having zero matches is: $$ P(\text{no match}) = (1 - 0.0002)^{1,000,000} \approx 1.35^{-87} $$ In both cases, Bloom filters mark every shard as a “maybe”, offering no pruning. It’s important to note that although having large shards have their benefits, the larger the shard the more likely that even low-frequency value will appear at least once. So basically it will be much harder for Bloom filters to prune any shard…
So now we understand that Bloom filters are optimized for infrequent matches. When the match rate is high, the bit array becomes saturated.
A More General Rule of Thumb Bloom filters become ineffective when:
- The value you’re searching for is not rare so it appears frequently across many shards.
- Each shard is large enough that even rare terms still appear often.
- The field being filtered has low cardinality, e.g. categorical field like status or event_type.
So before reaching for a Bloom filter, consider: how rare is the thing you’re looking for? If the answer is “not very” you may just be wasting CPU cycles hashing your way to scanning most of the shards anyway…
Alternative Approach: Data Partitioning #
A simple solution for fields that are too common for Bloom filters is to partition your data by the values of those fields. Instead of probabilistic filtering, you group data by field values into separate shards.
Going back to our tenant_name example, partitioned shards would look like:
Shard_1: tenant=ducks-corp, 2023-01-01T00:00:00Z to 2023-01-01T06:00:00Z Shard_2: tenant=ducks-inc , 2023-01-01T00:00:00Z to 2023-01-01T06:00:00Z ...Now when you query | where tenant_name == "ducks-inc", the system only needs to scan shards tagged with tenant=ducks-inc. It can skip everything else no probabilistic guessing needed.
This approach works best for low-cardinality fields with a small, fixed number of possible values like tenant names, regions, or event types. Partitioning by high-cardinality fields like user IDs or UUIDs would create too many tiny shards, making the search operation inefficient (we will probably cover shard merging in a future post).
Beyond Membership: What Else Can We Prune? #
Here’s a challenge: what about the following query which a Bloom filters can’t handle at all?
@AuthLogs | where FailedAttempts > 10Think about it for a moment. Bloom filters are designed for exact membership testing (“is X in the set?”), but this query asks “show me all of the logs with a value greater than 10.” How would you skip irrelevant shards?
Hint: Just like Bloom filters, you would need to store some metadata about the numeric values in a shard.
The answer: for each numeric field, store the min/max range:
Shard_1: FailedAttempts: min=0, max=5 Shard_2: FailedAttempts: min=3, max=15 Shard_3: FailedAttempts: min=12, max=25Now FailedAttempts > 10 can immediately skip Shard_1 (max=5), while FailedAttempts < 10 can skip Shard_3 (min=12).
Here’s another puzzle, what about this query?
@AuthLogs | where UserAgent contains "Chrome/91"How would you efficiently skip shards that definitely don’t contain that substring? Bloom filters work for exact matches, but substring searches are trickier…
Mutable Shards #
Throughout our examples, we’ve made an important assumption that’s worth calling out: shards are immutable. Once written, they don’t change. This assumption breaks down when you need to update or delete data, which brings us to our next topic.
Cuckoo Filters: When Elements Need to Leave the Nest #
Bloom filters have one big limitation: they don’t forget. Once you add an element, you can’t remove it, because different elements might “share” the same bits. Clearing bits for one element could accidentally wipe out another leading to . One workaround is to use a counting Bloom filter, which maintains a counter for each bit position rather than a single bit. When adding an element, you increment the counters; when removing, you decrement them. An element exists if all its positions have counts greater than zero. But this comes at a cost, as each position now requires multiple bits to store the counter.
That’s where Cuckoo filters come in as a more elegant alternative, named after the cuckoo bird’s charming habit of tossing other birds eggs out of the nest. Unlike Bloom filters, which use a bit array, Cuckoo filters use a fixed-size hash table to store fingerprints: small, fixed-size representations of the original items. Each fingerprint has two possible “homes” in the table, determined by hash functions. When both are full, the filter evicts an existing fingerprint to its alternate location, just like the cuckoo evicts its nest-mates, and repeats this process until it finds space.
Instead of a bit array, Cuckoo filters use a fixed-size hash table that stores short “fingerprints”, which are small hashes derived from the inserted values. These fingerprints are much shorter than the original items, which helps save space. Each fingerprint has two possible positions in the table, chosen using two different hash functions. If both positions are already occupied, the filter selects one of the existing fingerprints, evicts it (just like the cuckoo evicts its nest-mates) and moves it to its alternate location. If that spot is also full, the process continues by evicting again until an empty slot is found or the filter gives up after a fixed number of attempts.
Because each fingerprint is tied to a specific spot, deletion is possible by simply removing it the fingerprint if you find it in one of the expected slots.
To summarize they offer:
- Deletion of elements (ideal for expiring old data)
- Lower false positive rates compared to Bloom filters
- Comparable or better space efficiency
The trade-off? potentially slower insertions due to the evictions logic and slightly slower lookup.
Further Reading:
- Bloom Filter Calculator
- Cuckoo Filter: Practically Better Than Bloom
- Probabilistic Filters Visualized
Typically for security monitoring purposes you might need to answer questions like:
“How many unique IP addresses attempted to authenticate to our VPN in the last 24 hours?”
@VPNLogs | where timestamp > ago(24h) | summarize unique_ips = dcount(source_ip)“How many distinct hosts communicated with domains on our watchlist this week?”
@DNSLogs | where timestamp > ago(7d) and query.domain in (<watchlist_domains>) | summarize unique_hosts = dcount(source_host)“How many different user accounts accessed our internal data-sensitive database this month?”
@DBLogs | where timestamp > ago(30d) and db_name == "sensitive_data_db" | summarize unique_users = dcount(actor.username)These seem like simple questions, but at scale, they become challenging. The naive approach to counting unique items is straightforward, collect items into a set and return the size:
The problem with this approach is that the memory requirements grow linearly with the number of unique elements. In a large scale data system, we can expect millions of unique IP addresses, hundreds of thousands of unique user accounts, and tens of thousands of unique hostnames. So you need to keep track of all of them, plus apart from the size of the raw data there is a significant overhead from the hash-set data structure itself.
The real problem isn’t just the memory for a single count. In practice, you’re running dozens of these queries simultaneously:
- Different time windows (hourly, daily, weekly, monthly)
- Different log sources (VPN, auth, DNS, network traffic)
- Different groupings (by region, department, risk level)
What seemed like a simple counting problem quickly consumes gigabytes of memory.
Finally distributing exact counting across multiple machines requires coordination to avoid double-counting elements which can be tricky as well.
Enter HyperLogLog++: Counting Without Remembering #
HyperLogLog++ solves this using a different approach. Instead of “remembering” every element, it tries to estimate how many unique elements there are using the statistical properties of hash functions. The estimates are pretty accurate while using a tiny, fixed amount of memory.
The high-level idea is hashing each element and looking for rare patterns in the binary representation. The rarer the pattern you’ve observed, the more elements you’ve likely processed.
Think of it like estimating the population of a city by sampling random people and asking where they were born. If you ask 100 people and find that the most remote birthplace is someone from a tiny village 500 miles away, you can infer that the city probably has a pretty large population. The logic behind it is that the odds of randomly finding someone from such a remote place is low unless there are many people to sample from. Another classic analogy is coin flips: if someone tells you they flipped 5 heads in a row, you might guess they’ve done around 32 flips total, since the probability of getting 5 consecutive heads is about \(\frac{1}{32}\). The longer the streak of heads, the more flips they’ve likely made.
HyperLogLog works similarly but with binary patterns. Here’s the intuition:
- Hash everything consistently: Every element gets run through a hash function, giving us a random-looking binary string
- Count leading zeros: Look at how many zeros appear at the start of each hash
- Track the maximum: Keep track of the longest run of leading zeros you’ve ever seen
- Estimate from extremes: The longer the maximum run of zeros, the more unique elements you’ve probably processed
So similar to the coin flip analogy, if you’ve seen a hash starting ith 5 zeros 00000... it safe to assume you’ve processed roughly \(2^5 = 32\) different elements since the probability of any single hash starting with 5 zero is about \(\frac{1}{32}\). This of course only works if your hash function produces uniformly random bits so each bit position should be 0 or 1 with equal probability, independent of the input data or other bit positions just like coin flips.
You’re probably thinking now that relying on a single “maximum” doesn’t sound like a good idea, just like I thought when I first read about it. You might get lucky and see a very rare pattern early, leading to a massive overestimate, or unlucky and never see rare patterns, leading to underestimation. HyperLogLog++ addresses this problem by using multiple independent estimates and combining them to get a much more stable result.
The HyperLogLog++ Algorithm #
Instead of keeping one maximum, HyperLogLog++ maintains many buckets, each tracking the maximum leading zeros for a subset of elements. This provides multiple independent estimates that can be averaged for better accuracy. Here’s how it actually works:
- Hash the element using a good hash function
- Split the hash: Use the first n bits to choose a bucket (2^n total buckets), and count leading zeros in the remaining bits. For example for the hash 101000101... and n=4, we split it as 1010|00101... so the bucket index is 10 (1010 in binary) and we count 2 leading zeros in the remaining part.
- Update the bucket: If this is the longest run of zeros seen for this bucket, update it
- Estimate the total: Combine all bucket values using harmonic mean and bias correction
The formula for the harmonic mean of a set of \(n\) positive real numbers \(x_1, x_2, \dots, x_n\) is:
$$ H = \frac{n}{\sum_{i=1}^{n} \frac{1}{x_i}} $$
Why use harmonic mean when estimating the count? Each bucket value represents the maximum leading zeros observed, which corresponds to an estimated count of \({2^{buckets}}\) elements. Say you have 4 buckets with values \([2, 2, 2, 6]\), representing estimated counts of \([4, 4, 4, 64]\) elements respectively.
- Using arithmetic mean: \(\frac{4 + 4 + 4 + 64}{4} = 19\)
- Using harmonic mean: \(\frac{4}{\frac{1}{4} + \frac{1}{4} + \frac{1}{4} + \frac{1}{64}} \approx 5.1\)
As you can see the harmonic mean is much less sensitive to that one outlier bucket that got lucky with a rare pattern, giving a more stable estimate.
The actual formula the algorithm use is:
$$ \frac{\alpha \cdot m^2}{\sum 2^{-{buckets}}} $$
Based on the harmonic mean but adds:
- An extra factor of \(m\) (so \(m^2\) instead of m) - to scale from “average per bucket” to “total count”
- The \(\alpha\) constant - used to correct mathematical biases in the harmonic mean estimation and its value depends on the number of buckets.
So for the 4 buckets from the example before with an \(\alpha = 0.568\) will actually get \(\frac{0.568 \times 4^2}{\frac{1}{2^1} + \frac{1}{2^1} + \frac{1}{2^1} + \frac{1}{2^8}} \approx 11.9\) total elements.
Note: there’s no predefined alpha for 4 buckets as using HLL with such a small number is not supported in the original algorithm
This raw estimate has systematic biases, especially when most buckets are still empty (value 0). HyperLogLog++ detects this and switches to a more accurate method for small datasets, plus uses pre-computed correction tables to fix predictable errors across different cardinality ranges.
HyperLogLog with 1,024 buckets estimating 1,000 unique elements. Each bucket represents the maximum number of leading zeros + 1 seen. This "lucky" run achieved 0.2% error, showing how bucket values distribute across the hash space. Try playing with this online calculator
Here’s a simplified rust implementation:
Choosing the Right Precision #
For most applications, \(4,096\) buckets (\(2^{12}\)) hit the sweet spot of good accuracy with minimal memory overhead. You can play with different configurations using this HyperLogLog calculator which also has a nice visualization.
To see how significant the memory reduction can be, here’s an example: Say you’re tracking 1 million unique users from authentication logs each username is 10 characters long on average.
Using HLL++ with 4,096 buckets requires approximately 32KB of memory. According to this paper by google, the standard error of the cardinality can be calculated using: $$ \text{SE} \approx \frac{1.04}{\sqrt{m}} \rightarrow \frac{1.04}{\sqrt{4096}} \approx 0.01625 $$ An error of \(1.625\%\) which in our example is \(\pm 16,250\), it means the estimated cardinality will most likely fall between 983,750 and 1,016,250.
Now let’s write a small Rust program to see how much memory we would need to store 1 million unique usernames each 10 characters long using a hash-set for exact count:
Now let’s see how much memory that actually takes with heaptrack:
The measurement shows 93.4 MB total memory usage. This includes overhead from String allocations, HashSet internal structure, and the format! macro. While the code could obviously be optimized, that’s a \(\frac{93.4 * 1024^2}{32 * 1024} = 2988.8\)x memory reduction for a small accuracy loss – a trade-off worth taking for most applications.
When HyperLogLog++ Stumbles #
HyperLogLog++ has some important limitations worth knowing:
- Small Cardinalities: For datasets with fewer than ~100 unique elements, the probabilistic nature introduces more error than it’s worth. A simple hash set would be more accurate and use similar memory.
- Merge Complexity: In distributed systems, you often need to combine cardinality estimates from multiple sources. While you can merge HyperLogLog++ structures (by taking the maximum value for each bucket), the error accumulates with each merge operation.
- No Individual Elements: Unlike exact approaches, you can’t ask “have I seen element X before?”. You can only get total counts (making it unsuitable for deduplication tasks).
- Non-Uniform Data: HyperLogLog++ assumes your hash function produces truly random bits. If your data has patterns that survive hashing (like sequential IDs), accuracy can suffer. This is rare with good hash functions, but good to know.
This algorithm is the basis for cardinality estimation for most search engines for example:
- approx_distinct in Trino
- PFCOUNT in Redis
- Cardinality Aggregation in ElasticSearch and QuickWit
- uniqHLL12 in ClickHouse
Further Reading on HyperLogLog:
- HyperLogLog in Presto: A significantly faster way to handle cardinality estimation
- HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm
- HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm
We’ve explored how probabilistic data structures like Bloom filters and HyperLogLog++ can be used for shard pruning and cardinality estimation in large-scale log processing systems, trading small amounts of accuracy for massive gains in memory efficiency and query performance.
If you’re interested in learning more about probabilistic structures, here are some more useful ones: Count-Min Sketches estimate item frequencies, MinHash enables fast set similarity, and Quantile Sketches provide accurate percentile calculations. We may explore them in future posts.
Probabilistic structures are just one part of building a scalable log search system. We’ve already looked at query planning and optimization in distributed search in our blog post “Hidden Complexities of Distributed SQL”. Future posts will cover other critical challenges like high-throughput indexing for real-time ingestion, shard merging strategies to improve search efficiency by minimizing number of shards queried, tokenization and indexing design choices for different search capabilities, and distributed query coordination. All essential for systems that process terabytes of logs every day.