Bloom Filter is a space-efficient data structure for quickly determining whether an element is in the set. It can save a lot of memory, but it has a characteristic: there may be a misjudgment, that is, it may think that an element exists in the set, but in fact does not exist; for the non-existent elements, it is guaranteed that there will be no misjudgment. Bloom filters are suitable for use in applications where storage space requirements are extremely stringent and a small number of false positives are acceptable.
1. Principle of operation of Bloom filters
The core idea of Bloom filters is to use multipleHash Functions (Hash Functions)and a Bit Array. The procedure is as follows:
1.1 Insertion of elements
- When an element is inserted, the Bloom filter hashes the element using a number of different hash functions, obtains a number of hashes (positional indexes), and sets the bit array positions corresponding to these hashes to the
1
。 - For example, if an element goes through 3 hash functions and gets 3 different positions, the Bloom filter sets the value of the bit array at those 3 positions to
1
。
1.2 Query elements
- When querying an element, the Bloom filter uses the same hash function to hash the element, yielding multiple locations. If the bits of all these positions are
1
If the Bloom filter considers this element to beprobable; if the bits in any position are0
, then the element can be identifiedIt must not exist.。
1.3 Characteristics
-
Possible miscarriage of justice: It is possible for the Bloom filter to misjudge, and there is a small probability that a query for a non-existent element will fail because some bits in the bit array are set by other elements of the
1
and mistakenly believe that this element exists. - There will be no missed judgments.: If an element does not exist, the Bloom filter must not misjudge its existence, i.e., when the Bloom filter queries whether an element exists, the result is reliable if it is judged not to exist.
2. Components of a Bloom filter
2.1 Bit Array
The bit array is the core data structure of the Bloom filter. It is an array of lengthm
array, which can only be stored at each location0
maybe1
. Initially, all positions in the bit array have the value of0
. When inserting an element, the hash function generates a number of positional indices based on the element value and sets the bits corresponding to these indices to be1
。
2.2 Hash Functions (Hash Functions)
Bloom filters use multiple hash functions (usually separate hash functions) to hash elements. Each hash function generates a different bit array index that is used to determine where the element is stored.
- The chosen hash function should have good homogeneity to ensure that the hash values are evenly distributed over the array of bits to reduce conflicts.
2.3 Number of hash functions (k-values)
k
Indicates the number of hash functions used for each element. The higher the number of hash functions, the lower the probability of misclassification, but the complexity of the query and insertion will increase. Therefore, thek
of quantities is generally chosen as a suitable intermediate value to strike a balance between query performance and false positive rate.
2.4 Length of bit array (m-value)
m
It is the length of the bit array. The longer the bit array is, the lower the false positive rate is, but the more memory it takes up. Therefore, the length of the bit array should be designed according to the actual business requirements and memory overhead.
3. False positive rate of the Bloom filter
The misclassification rate of a Bloom filter is the probability that, at query time, the Bloom filter incorrectly assumes that a non-existent element exists in the collection. The misclassification rate increases with the number of elements inserted in the collection and is affected by several factors:
- Bit array length (m): The longer the array of bits, the lower the false positive rate.
- Number of hash functions (k): The misclassification rate is lowest when the number of hash functions is moderate, but too many makes the misclassification rate increase.
-
Number of elements (n): The more elements that are inserted, the higher the false positive rate, since the bit array is set to
1
more and more bits of the hash function, the chance of collision of the hash function increases.
The Bloom filter's false positive rate is calculated as follows: p=(1-e-k⋅nm)kp = \left( 1 - e^{- \frac{k \cdot n}{m}} \right)^kp=(1-e-mk⋅n) k
-
p
It's the miscarriage of justice rate; -
k
is the number of hash functions; -
n
is the number of elements inserted; -
m
is the length of the bit array.
4. Advantages and disadvantages of Bloom filters
4.1 Advantages
- Efficient space utilization: Bloom filters can store large amounts of data in a small space, especially when the number of elements is large, and it can save memory significantly.
- Low time complexity of query and insert operations: Whether inserting or querying, the time complexity of the Bloom filter is O(k), i.e., linear in the number of hash functions, which is very fast.
- Suitable for large-scale data filtering: For existence determination of massive data, Bloom filters are very efficient and suitable for use in scenarios where the existence of an element needs to be determined quickly.
4.2 Disadvantages
- There is a miscarriage of justice: Bloom filters may falsely determine that an element exists, i.e. the result may be a False Positive, which means that an element does not exist even though the Bloom filter thinks it does. Bloom filters are not suitable for scenarios that require precision.
- Unable to delete element: Bloom filters cannot delete elements directly because the hash function maps multiple elements to the same array position, and deleting one element may invalidate the hash results of other elements. Although there areCounting Bloom Filter(Counting Bloom Filter) can support the delete operation, but its implementation is more complicated.
5. Typical application scenarios for Bloom filters
5.1 Cache Penetration Protection
-
One of the most common applications for Bloom filters is to prevent
cache passthrough
In a Redis caching scenario. In a Redis caching scenario, users request data that may not exist in either the cache or the database, and if left unguarded, these requests will hit the database directly. With Bloom filters, it is possible to reduce invalid database queries by determining whether an element may exist in the database before it is requested.
- Scene: In an e-commerce system, users may frequently query for product IDs that do not exist. Bloom filters can be used to store all legitimate product IDs and determine them before querying, and if they do not exist in the Bloom filter, they can directly return empty results without querying the database and cache.
5.2 spam filter
- Bloom filters can be used in spam systems to quickly determine if an email address or IP is on a blacklist. Due to the high efficiency of Bloom filters, the speed of spam detection can be greatly improved and memory resources can be saved.
5.3 Big data de-duplication
- In large-scale data processing scenarios, Bloom filters can be used to detect whether an element has already appeared, thus realizing the de-duplication operation. It is especially suitable for systems with strict memory requirements, such as distributed crawler systems that require de-duplication of URL processing.
5.4 Database and storage systems
- Bloom filters are widely used in database and storage systems to reduce unnecessary disk I/O operations. Example:
- HBase Use Bloom filters to speed up lookups and avoid unnecessary disk reads.
- Cassandra Reduce the number of disk scans by using a Bloom filter to determine if an SSTable contains a certain key.
6. Extension of the Bloom filter
6.1 Counting Bloom Filter
The Count Bloom Filter is a type of Bloom Filter that supports the delete operation. Unlike standard Bloom filters, each position in the bit array of a Count Bloom filter is no longer a binary0
cap (a poem)1
, but rather a counter. When an element is inserted, the counter on the bit corresponding to the multiple hash function increases; when an element is removed, the counter decreases accordingly.
The disadvantage of the Count Bloom filter is that it requires more storage space (since each position is a counter), but it allows for the deletion of elements, which makes it suitable for dynamic update scenarios.
6.2 Distributed Bloom Filter
In large-scale distributed systems, Bloom filters can be extended toDistributed Bloom FilterThis means that the bit array is distributed over multiple nodes and each node is responsible for a portion of the bit array storage and hash computation. This improves the scalability of the system and adapts to larger datasets.
summarize
Bloom filter is a spatially efficient data structure for scenarios that need to quickly determine whether an element exists or not, especially for preventing cache penetration, spam filtering, big data de-duplication and other scenarios. Although it has a certain false judgment rate, its excellent spatial efficiency and query performance make it an important tool in many large-scale applications.