Location>code7788 >text

Redis series supplement: talk about Bloom filters (go language practice chapter)

Popularity:51 ℃/2024-09-24 08:00:51

Redis 24-part collection

1 Introduction

Bloom Filter is a new feature provided after Redis version 4.0, we usually load it as a plugin into the Redis Service Server to provide Redis with powerful weight filtering capabilities.

It is a probabilistic data structure that can be used to determine whether an element exists in a set. Compared to the de-duplication function of Set collection, Bloom filter can save 90% + in space, the disadvantage is that the de-duplication rate is around 99%, that is, there is about 1% false positive rate, this error is determined by the structure of Bloom filter itself. It has the following advantages and disadvantages:

  • Pros: space efficiency and query time are much better than the usual algorithms
  • Cons: some misidentification and deletion difficulties

Detailed principles can be found in this author's article onTalking about Bloom filters (Principles section)》。

2 Description of application scenarios

When we encounter a large amount of data, in order to de-emphasize and avoid large quantities of repeated calculations, we can consider using Bloom Filter for filtering.
Specific commonly used classic scenarios are listed below:

  • To solve the problem of cache-penetration under high traffic, refer to the author's article on theA Cache Avalanche Disaster Revisited》。
  • Filtering is blocked, pulled black, reduce the recommended information, generally you are browsing Shake or Baidu App, see do not like will be set to reduce the recommended, blocking such information, etc., can be designed using this principle.
  • Various list filtering, using Bloom filters to implement the first layer of whitelist or blacklist filtering, can be used in various AB scenarios.

The following is a case study with cache penetration as the solution goal.

3 Case Study

A classic application scenario for Bloom filters is to solve the cache-penetration problem!

Cache penetration refers to accessing a non-existent key, the cache does not work, the request will penetrate to the DB, the traffic blowout will cause the DB to hang.

For example, if we query a user's information, the program will go to the cache based on the user's number, and if it can't find it, it will search the database again. If you give a non-existent number: XXXXXXXXXX, then every time you can't match it, it goes through the cache to the database.
This is very risky, and it would be a disaster if for some reason a large number of non-existent numbers were queried, or even maliciously spoofed for a large-scale attack.

The solution is to add a layer of BloomFilter before caching:

  • record the existence of the key in the BloomFilter, in the query first go to the BloomFilter to query whether the key exists, if it does not exist, then the database and cache are not, it will be returned directly.
  • Existence and then go to check the cache , into the database to query, which reduces the pressure on the database.

3.1 Huge Query Scenarios

The following train ticket ordering and query as a case for illustration, if the train ticket is malicious attack, simulated the same structure of the train ticket order number, it is likely that through a large number of requests to penetrate through the cache layer to the database avalanche, so the use of Bloom filters to provide a layer of protection for the service.
The specific approach is that when we buy a successful train ticket, we write (asynchronously or in a message queue) the ID of the order number to the Bloom filter, guaranteeing that all subsequent queries go through the Bloom filter before going into the cache to query.

3.2 Creating a Bloom Filter

The syntax for creating a Bloom Filter is as follows:

#  {key} {error_rate} {capacity} [EXPANSION {expansion}] [NONSCALING]
 ticket_orders 0.01 1000000

The command here is to manually create a Bloom filter with the name ticket_orders, an error rate of 0.01, and an initial capacity of 1000000.
Some points to note over here are:

  • The smaller the error_rate is, the less tolerant it is to collisions and the more storage space it requires. If a certain percentage of inaccuracy is allowed, the error_rate can be set slightly larger for scenarios that do not require high accuracy.
  • capacity Setting it too large will waste storage space, and setting it too small will not be very accurate. So you need to be precise when evaluating, both to avoid wasting space and to ensure the accurate ratio.

For those who don't understand the principle, please refer to this author'sTalking about Bloom filters (Principles section)》。

3.3 Creating a Ticket Order

# {key} {value ... }

# Add a single order number
 ticket_orders 1725681193-350000
(integer) 1

# Add multiple order numbers
 ticket_orders 1725681193-350000 1725681197-270001 1725681350-510007
1) (integer) 1
2) (integer) 1
3) (integer) 1

The above statement is storing already booked ticket order numbers into Bloom Filter, both single at a time and multiple at a time.

The steps to synchronize train ticket orders to Bloom Filter are as follows:
image

3.4 Determine whether train ticket order Id exists or not

# {key} {value}, returns 1 if present, 0 if not.
 ticket_orders 1725681193-350000
(integer) 1

# Batch determine if multiple values exist in the Bloom filter with the following statement:
 ticket_orders 1725681193-350000 1725681197-270001 1725681350-510007
1) (integer) 0
2) (integer) 1
3) (integer) 0

Determines whether an element exists in the Bloom Filter, return value = 1 means it exists, return value = 0 means it does not exist. It can be used to determine a single element at a time, or to determine multiple elements at a time.

image

To summarize, we can implement the construction of a Bloom filter to avoid cache-penetration with a few commands.
If you want to query the cache information, you must first go to the Bloom Filter and run it first, and the non-existent ones are directly filtered out, so that the cache will not be punctured by invalid keys.

4 Description of program implementation

can be used in Golanggo-redis/redis library to encapsulate Bloom filter functionality.
You need to first make sure that your Redis server has installed theRedisBloom module, since Redis itself does not directly support Bloom filters. Once theRedisBloom Installed and configured, you can then pass thego-redis/redis library to call the relevantRedisBloom Command.

package bloomfilter

import (
    "context"
    "fmt"
    "/go-redis/redis/v8"
)

// BloomFilter encapsulates the operations associated with Bloom filters
type BloomFilter struct {
    rdb *
    name string
}

// NewBloomFilter creates a new instance of a Bloom Filter
func NewBloomFilter(rdb *, name string) *BloomFilter {
    return &BloomFilter{
        rdb: rdb,
        name: name, }
    }
}

// Add adds the element to the Bloom Filter
func (bf *BloomFilter) Add(ctx , item string, capacity int64, errorRate float64) error {
    // Note: RedisBloom's commands usually do not require explicit settings for capacity and error rate.
    // because these are set when the Bloom filter is created. Here we simplify it to just adding elements.
    // If you need to dynamically adjust these parameters, you may need to recreate the Bloom filter.
    // But for the sake of the example, we'll assume these parameters were set when the Bloom filter was created.
    _, err := (ctx, "", , item).Result()
    return err
}

// Exists checks if the element may exist in the Bloom filter
func (bf *BloomFilter) Exists(ctx , item string) (bool, error) {
    result, err := (ctx, "", , item).Int()
    Int() if err ! = nil {
        return false, err
    }
    // return 1 for possible existence, 0 for certainty of non-existence
    return result == 1, nil
}

// Note: In practice, you may also need to encapsulate additional operations, such as deleting Bloom filters (although Bloom filters don't usually support deleting individual elements)
// or adjusting the capacity and error rate of a Bloom filter (which usually means that the Bloom filter needs to be recreated).

func main() {
    rdb := (&{
        Addr: "localhost:6379", // Redis address
        Password: "", // Password (if any)
        DB: 0, // Database to use
    })

    bf := NewBloomFilter(rdb, "myBloomFilter")

    ctx := ()

    // Add the element
    err := (ctx, "item1", 100000, 0.01) // Note: the commands do not normally require capacity and errorRate.
    if err ! = nil {
        panic(err)
    }

    // Check if the element exists
    exists, err := (ctx, "item1")
    if err ! = nil {
        panic(err)
    }
    ("Exists:", exists)

    exists, err = (ctx, "item2")
    if err ! = nil {
        panic(err)
    }
    ("Exists:", exists)
}

// Note: The capacity and errorRate parameters in the Add method above are not used directly in the command, // because RedisBloom's commands are primarily used to add elements to an existing Bloom filter.
// because RedisBloom's commands are primarily used to add elements to an existing Bloom filter.
// The capacity and error rate are usually set by the command when the Bloom filter is created.

Important Notes

  • In the code above, theAdd methodologicalcapacity cap (a poem)errorRate parameter is not directly used in the command, because the It is only used to add elements to an already existing Bloom filter. If you need to set the capacity and error rate of a Bloom filter, you should create the Bloom filter using the Command.
  • Bloom filters do not support "deletion" in the traditional sense, because once a bit is set to 1, it cannot be set to 0 again (unless the Bloom filter is recreated).
  • Before actually deploying, make sure that your Redis server has the RedisBloom module installed and that thego-redis/redis library is compatible with your version of Redis server.

5 Summary

This article describes several implementation scenarios for Bloom filters.
It also uses the train ticket order information query as a case study to illustrate how to use Bloom filters to avoid cache-penetration and avoid being maliciously attacked.