Hello, I'm Tomo.
A brother told me privately that he didn't answer one of the interview questions during his interview with Dog East: Redis'Bitmap
cap (a poem)Bloom Filter
What's the difference?
In fact, it is to test the little old man of these two tools of the underlying data structure whether to understand, not too difficult questions. However, bitmap and Bloom filters in large data volumes and high concurrency business use frequency is not low, the knowledge should be mastered, since the question that we simply sort out their underlying principles, application scenarios and the connection between them.
Bitmap
The Bitmap in Redis is a more specialized data type that takes the smallest units ofbit
To store data, we know that a byte consists of 8 bits, which makes it extremely space efficient when dealing with large amounts of binary state (true, false, or 0, 1, etc. with only two states) data compared to traditional data structures that use bytes for storage. However, it is not a completely new data type, and the underlying implementation is still based on theString
Type.
To make it easier to understand, you can think of the underlying structure of a Bitmap as an array of bit bits, where each bit corresponds to an offset (similar to an array subscript). Different states are represented by setting the bit value at a particular offset to 0 or 1.
Let's say we want to design a quiz game system. The rule is: if the user answers all 7 questions correctly, he/she will get the grand prize.
Each user who answers a question has his or her own key, theanswer:user:X
The bitmap is used to record the user's answers, setting the question number as the corresponding offset. Use bitmap to record the user's answer, set the question number to the corresponding offset, when the user answers ✅ the question correctly, the offset bit value is set to 1; when the user answers ❌ the question incorrectly, the bit value is set to 0.
assuming that the useruser:1
If you answer questions 2, 5, or 7 correctly, you can set the bit values corresponding to offsets 2, 5, or 7 to 1, and the rest of the bit values are set to 0. To check the user's answer to a particular question, you only need to iterate through the data structure according to the offsets, and once you come across a bit value of 1, you will know that you answered the question correctly.
After the quiz is over, the next step is to count the winners, i.e., those users who have answered all 7 questions correctly.
To quickly count whether a user has answered all the questions correctly, you can use theBITCOUNT
command to count the number of bit values that are set to 1. You can count the number of bit values that are set to 1 by executing theBITCOUNT answer:userX == 7
Such an operation is judged, and if the result is equal to 7, then the user has answered all the questions correctly.
If you want to use bitmap to determine whether an email address is in the blacklist, how to set the offset? Unfortunately, bitmap doesn't support using string as offset. However, we can hash the mailbox to get the hash value and then calculate the offset.
Since hash operations are used, there will inevitably be data collision problems, i.e., different strings may yield the same hash value. In this way, the state judgment may not be accurate. Don't worry, the Bloom filter is introduced later (Bloom Filter
) See how it comes to optimizing this.
operating command
Bitmap commands are few and simple to use, the main ones areSETBIT
、GETBIT
、BITCOUNT
、BITOP
A few orders.
SETBIT
: Used to set a bit value at a specified offset with a time complexity of O (1). For example, when the user has answered question 7 correctly, the bit value at offset 7 corresponding to the question number can be set to 1 to indicate that the question has been answered correctly.
# user key answer:user:1
# Offset: question number 7
# Set to 1 if the question is answered correctly
SETBIT answer:user:1 7 1
GETBIT
: Get the bit value at a specified offset, again with efficient time complexity. It is possible to quickly query the status of a user's answer to a specific question number.
# Query the user's answer to question 7, 1-right answer 0-incorrect answer
GETBIT answer:user:1 7
BITCOUNT
: Used to count the number of bit values that are set to 1. For example, above I can easily count the users who answered all the questions correctly, but I can only know the number, I can't see which questions.
# Count the number of users with correct answers to question 1
BITCOUNT answer:user:1
BITOP
: Performs bitwise operations on one or more bitmaps and saves the result in a new key, supports AND, OR, NOT and XOR. The usage of this command is to operate on bit values of the same offset in multiple bitmaps. If I want to know the question that both user 1 and user 2 answered correctly, after AND operation, if only question number 1 is the question that both users answered correctly, then only the bit value corresponding to question number 1 will be 1 in the new result set.
# of questions answered correctly by both user1 and user2, it can be seen that only question number 1 is answered correctly
SETBIT answer:user:1 1 1
SETBIT answer:user:1 2 0
SETBIT answer:user:1 3 1
SETBIT answer:user:2 1 1
SETBIT answer:user:2 2 1 1
SETBIT answer:user:2 3 0
BITOP AND resultbitmap answer:user:1 answer:user:2 3 0
make the best use of limited resources
vantage
-
Extremely space-efficient: bitmap is really space-saving data storage. Roughly speaking, a 100 million-bit bitmap only takes up about 12MB of memory, which is a huge space saver compared to other data structures;
-
Fast querying: bit operations are usually faster than other data structure queries. The time complexity is O (1) for both setting the bit value and getting the bit value, enabling fast responses to query requests;
-
Easy to operate: Support single bit operation, bit statistics, bit logic operation, etc., with high operation efficiency, no need to compare and shift;
drawbacks
-
Due to the structural characteristics of the data, it is only suitable for representing two states, 0 and 1. Bitmap is not suitable for cases where more states need to be represented;
-
Only when the data is more dense to have an advantage, if we only set (20, 30, 88888888888) three offsets of the bit value, you need to create a 99999999 length of the BitMap, but in fact, only three data, this time there is a big waste of space, encountered such a problem, then you can through the introduction of another
Roaring BitMap
settle;
application scenario
See Bitmap is still a relatively simple data structure, occupies little space query efficiency, very suitable for recording the state of the scene, it is very common application scenarios, such as:
-
User check-in status (number of consecutive days)
-
Online status of users (counting active users)
-
Questionnaire answers and so on, I guess!
Bloom Filter
As we mentioned above, when bitmap records the state of a character element, it needs to use hash operation to derive the offset first. However, after the introduction of hash operation, hash collision may occur, resulting in misjudgment of the state.
The Bloom filter further optimizes this problem to a manageable false positive rate, when we add an email address to the collection, multiple different hash functions map this email address to different offset locations in the bitmap and set these bit values to one.
To determine whether an email address is in the collection, mapped to multiple locations on the bitmap by the same hash function, theIf the value on each of these bits is 1, the mailboxprobable
in the set; if any of the positions has a value of 0, the elementI'm sure it's not.
muster. This is a feature of the Bloom filter.
Although but Bloom filters can still be misjudged, er~, but the good thing is that we can pass theAdjusting the size of the Bloom filter and the number of hash functions to control the false positive rate。
operating command
There are also not many commands for Bloom filters, the main ones used are as follows:
: Create a new Bloom filter and specify the capacity and error_rate.
<key> <error_rate> <capacity>
myfilter 0.000001 999999
: Get information about Bloom filters, including capacity, false positive rate, etc.
<key>
cap (a poem)
: adding elements to Bloom filters and batch adding, respectively
# Add elements to the Bloom filter
myfilter hello
<key> <item> [item ...]
cap (a poem)
: checking for the presence of an element in a Bloom filter and batch checking for the presence of an element, respectively.
# Whether the element exists in the Bloom filter
myfilter hello
# Whether the element exists in the Bloom filter
<key> <item> [item ...]
make the best use of limited resources
vantage
-
The space consumption of Bloom filter is also very small, it does not store the complete data itself, and like bitmap, the underlying layer also indicates whether the data exists or not by bit bits.
-
The performance is more stable, regardless of the number of elements in the collection, the time complexity of insertion and query operations is very low, usually O (k), where k is the number of hash functions. This means that when dealing with large-scale data, the performance of Bloom filters does not drop drastically as the amount of data increases.
drawbacks
-
There is a certain rate of misidentification: the Bloom filter suffers from misidentification, i.e., it is possible for an element to be judged as being in the set when it is not actually in the set. This is because multiple elements may be mapped to the same location by a hash function, leading to a false judgment. However, when the Bloom filter judges that an element is not in the set, it is 100% correct.
-
Deleting elements is more difficult: in general, it is not possible to delete elements directly from a Bloom filter. This is because a position may be mapped by more than one element, and if the value of that position is set to 0 directly, it may affect the judgment of other elements.
application scenario
The Bloom filter has a certain amount of false positives, so the scenario in which it is used must allow for inaccuracies:
-
Solve Redis cache-penetration problem: Seconds product details are usually cached into Redis. If there are a large number of malicious requests to query for non-existent products, the Bloom filter can quickly determine that these products do not exist, thus avoiding queries to the database and reducing the pressure on the database.
-
Mailbox blacklist filtering: In the mail system, you can use Bloom filter to filter spam and malicious mails. Store known spammers' addresses or characteristics in the Bloom filter, and determine whether the sender is in the blacklist when new emails come.
-
Filtering of crawler URLs: In a crawler program, in order to avoid crawling the same URLs over and over again, a Bloom filter can be used to record URLs that have already been crawled. When a new URL appears, first determine whether it has been crawled.
-
Too much too much .....
summarize
Together combed the bitmap and Bloom filter principle, usage, and their respective advantages and disadvantages and application scenarios, the environment is not good more to improve their technical skills, and now the interview three sentences do not leave the amount of big data and high concurrency, such questions want to cope with freely, not only to have the depth of the breadth of mastering these two knowledge points to provide more of an answer is also good. It is not good to write a bad deal of people to deal with look at it!