13

Preface

It’s been a long time since I wrote an article. learn about the famous 1611487b8ccf15 Bloom filter today.

text

What is bloom filter

This awesome artifact was invented in 1970 by Bron, a big cow, and it was a binary vector data structure that was specifically designed to solve data query problems at that time. It can be used to tell you that something must not exist or may exist.

Compared with the traditional List, Set, Map and other data structures, it is more efficient and takes up less space. However, the is that the returned result is probabilistic , not exact.

What is the use of bloom filters

Speaking of the role of bloom filters, the first thing you might think of is Redis cache penetration, so let's talk about what is cache penetration first?

You must have been on shopping websites such as JD.com and Taobao. I wonder if you have noticed: In our daily development, the URL of each page actually corresponds to a specific product.

For example, the 857 currently marked in red is the sku number of our product. You can understand it as the only code for this product model. The product number here is 857, and the displayed page is naturally the corresponding content.

image.png

For example, on the day of 618, when hundreds of millions of netizens were shopping and picking their hands, the mall application simultaneously received a large number of requests for access, ordering, and payment. At this time, how can the machine withstand it? This involves the architecture of our system, let's take a look.

image.png

First, as a user of the mall, he initiates a request. For example, he still wants to view the 857 product.

At this time, as a mall application, it will query the backend Redis cache server.

If there is no commodity data of No. 857 in the cache database, our program needs to query in the back-end database server and fill it into the Redis server. This is a normal operation process.

Then after a long time of accumulation, the data in our cache server may look like this. For better understanding, here I assume that there are 1000 items in the mall, numbered from 1 to 1000.

At this moment, as a mall user, if you inquire about the 857 product, the mall application no longer needs to extract data from the MySQL database, just extract the data directly from the Redis server and return it. Because Redis is memory-based, both in terms of throughput and processing speed, it is many times faster than traditional MySQL databases.

As time continues to accumulate, then the Redis server should store all commodity data caches numbered from 1 to 1000.

Security risks faced by Redis: cache penetration

Please note that there are only data from 1 to 1000 in the current cache.
Suppose if it is a malicious competition among peers, or a third-party company has developed a crawler robot to query data in batches in a short period of time, and the numbers of these queries do not exist in the previous database, such as 8888, 8889, and 8889. 8890 None of these exist.

image.png

At this time, our system will encounter a major security risk: because the mall application queries the background Redis, because there is no such data in the cache, it then goes to the database server for query.

[High concurrency of invalid requests can cause a crash]

It is important to know that the database server is not very capable of carrying instantaneous ultra-high concurrent access. Therefore, in a short period of time, these invalid requests sent by crawler robots or traffic attack robots will be instantly poured into the database server, which will have a great impact on the performance of our system, and even cause system crashes.

And this attack method that bypasses the Redis server and directly enters the back-end database query is called cache penetration.

Small-scale cache penetration will not have a major impact on our system, but if it is a cache penetration attack, it is another matter. Cache penetration attack refers to a malicious user querying a large number of non-existent data in a short period of time, causing a large number of requests to be sent to the database for querying. When the number of requests exceeds the upper limit of the database load, the system responds with high latency or even paralysis. , Is a cache penetration attack.

image.png

Prevent cache penetration "artifact": Bloom filter

One of the most common designs in architecture design is called Bloom filter, which can effectively reduce cache penetration. The main idea is to use a very long binary array to determine whether the data exists through a series of Hash functions.

It may be a little bit obscure to say that, we will show you through a series of graphs.

Bloom filter is essentially an n-bit binary array . You also know that binary has only 0 and 1 to represent, for our current scene. Here I simulated a binary array, each of which has an initial value of 0.

image.png

And this binary array will be stored in the Redis server, so how should this array be used?

1. Hash several times to determine its location

Just now we mentioned that as the current mall, suppose there are 1000 product numbers, ranging from 1 to 1000. As a Bloom filter, when it is initialized, it is actually to Hash each product number several times to determine their position.

(1) Calculation of No. 1 commodity

For example, for the current "1" number, we have performed Hash three times on it. The so-called Hash function is to determine a specific location after substituting data.

Hash 1 function: it will locate the second bit of the binary array and change its value from 0 to 1;

Hash 2 function: It locates to the position of index 5, and changes it from 0 to 1;

Hash 3 function: locate the position with index 99 and change it from 0 to 1.

image.png

(2) Calculation of No. 2 commodity

After the calculation of the No. 1 commodity is completed, it is the turn of the No. 2 commodity. After three times of Hashing, the No. 2 commodity is located at the index 1, 3 and 98 respectively.

image.png

(3) Calculation of No. 1000 commodity

At this time, commodity number 2 has also been processed. We continue to go backwards 3, 4, 5, 6, 7, and 8 until the number reaches the last 1000. When the product number 1000 is processed, he will set the index to 3, 6, 98 Is 1.

image.png

2. The practice of Bloom filter in e-commerce products

As a Bloom filter, how can it be used when stored in the Redis server? This involves the comparison of product numbers in our daily development.

(1) First look at an existing situation

For example, at this time, a certain user wants to inquire the data of commodity No. 858. We all know that 858 exists, so according to the original three hash positions 1, 5, and 98 are located respectively. When the value of each hash bit is 1, it means that the corresponding number exists.

image.png

(2) Look at a non-existent situation

For example, you want to query 8888. After three Hashes, the value of 8888 is located at 3, 6, and 100. At this time, the value of index 100 is 0, and if any bit is 0 in multiple hashes, it means that the data does not exist.

image.png

To summarize briefly: if all Hash values of the Bloom filter are 1, it means that this data may exist.

Pay attention to my expression: it may exist; but if the value of a certain bit is 0, it must not exist.

At the beginning of the design of the Bloom filter, it was not an accurate judgment, because the Bloom filter was misjudged.

(3) Finally, look at a misjudgment situation

Take a look at the current demo: For example, now I want to query the situation of 8889, after three Hashes, every bit is exactly 1. Although in the database, the commodity 8889 does not exist; but in the Bloom filter, it will be judged to exist. This is the small probability of misjudgment that will occur in the Bloom filter.

image.png

3. How to reduce the misjudgment of Bloom filter?

There are two ways to reduce the occurrence of misjudgments:

The first is increase the number of binary digits to . In the original situation, we set the index bit to 100, but if we enlarge it by 10,000 times to reach 1 million, will the data after Hash become more scattered, and there will be less duplication. This is The first way.

The second is increase the number of Hash . In fact, every Hash processing is increasing the characteristics of the data. The more features, the smaller the probability of misjudgment.

Now we have done Hash three times, so if you do it ten times, does it have a much lower probability of misjudgment? But in this process, the cost is that the CPU needs to perform more calculations, which will reduce the performance of the Bloom filter.

Speaking of this, you must know something about Bloom filters. But in our development process, how do we use bloom filters? Let's take a look.

How to use Bloom filter during development?

1. Application of Bloom Filter in Java

In fact, as Java has accumulated for so many years, classic algorithms like Bloom filters have long been encapsulated and integrated for us. A Redisson component is provided in Java, which has a built-in bloom filter, which allows programmers to set bloom filters very simply and directly.

image.png

The above code is how to use Redisson. The first few lines of code are used to set the service address and port number of the Redis server.

The key point is here. We instantiate a Bloom filter object. The following parameter refers to which key Redis uses to save Bloom filter data?

The following sentence is very critical. As the current Bloom filter, the tryInit method needs to be called here, which has two parameters:

The first parameter represents the length of the initialized Bloom filter. The , the lower the possibility of misjudgment .

The second 0.01 means that the maximum allowable false positive rate is 1%, which is usually set to 1% in our previous projects. is set too small, although it will reduce the false positive rate, it will generate more Hash operations, which will reduce the performance of the system (just mentioned), so 1% is also the value I recommend .

After initializing the Bloom filter, we can add data to it through the add method. The so-called adding data is the process of hashing the data multiple times and changing the corresponding bit from 0 to 1. For example, now we add the number 1 to it, and then we can use the contains method of the Bloom filter to determine whether the current data exists.

We input 1, and it outputs true; and input 8888 which does not exist, it outputs false.

Please note: the meaning of these two results is different.

If the output is false, it means that the data definitely does not exist;

But if the output is true, there is a 1% probability that it may not exist, because the Bloom filter has misjudgment.

The above is the application of bloom filters in Java, but what should bloom filters look like if they are to be used in projects? What is its processing flow?

2. Application of Bloom filter in the project

Let's take a look at the process of using Bloom filters in the project. In fact, it boils down to the following three parts:

image.png

The first part is to initialize the Bloom filter when the application starts. For example, initialize 1,000, 10,000, and 100,000 commodities to complete the conversion from 0 to 1.

After that, when the user sends a request, the product number will be appended. If the Bloom filter determines that the number exists, it will directly read the data stored in the Redis cache; if there is no corresponding product data in the Redis cache at this time, then Go directly to read the database and reload the read information into the Redis cache. In this way, the next time the user queries the data with the same serial number, he can directly read the cache.

Another situation is that if the Bloom filter judges that the number is not included, it will directly return a message prompt that the data does not exist, so that the request can be intercepted at the Redis level.

You may be wondering, since the Bloom filter has a misjudgment rate, what should I do if there is a misjudgment?

In fact, in most cases, our misjudgment will not have an additional impact on the system. Because we set a false positive rate of 1% just now, only 100 false positives may occur for 10,000 requests. We have blocked 99% of invalid requests, and these slipped fish will not have any real impact on our system.

Extended question: What should I do if the corresponding product is deleted after initialization?
Finally, there is an extended little question: What should I do if the corresponding product is deleted after the Bloom filter is initialized? This is a small difficulty of a Bloom filter.

Because the binary data of a certain bit of the Bloom filter may be referenced by multiple numbered Hash bits. For example, the 2nd position in the Bloom filter is 1, but it may be referenced by the 4 product numbers 3, 5, 100, and 1000 at the same time. It is not allowed to delete a bit of the Bloom filter directly, otherwise the data will be messed up, what should I do?

There are two common solutions in the industry:

  • Asynchronously rebuild the Bloom filter at regular intervals. For example, we perform a task scheduling asynchronously on an additional server every 4 hours to regenerate the bloom filter and replace the existing bloom filter.
  • Count bloom filter. Under the standard Bloom filter, it is impossible to know which specific data is currently referenced by a certain bit, but the counting Bloom filter is additional additional counting information on this bit to express the The bit is referenced by several data. (If you are interested in counting bloom filter, you can see principle Counting Bloom Filter and implementation )

Summarize

Today's knowledge about Bloom filters is here.

Reference: Lagou Education "How to resist the Bloom filter of 100 million traffic"

image.png


超大只乌龟
882 声望1.4k 粉丝

区区码农