Cache avalanche, penetration, breakdown that Xiaobai can also understand

Hello everyone, my name is Qixi (xī).

As a backend developer, I think caching is something that everyone is all too familiar with.

This article will introduce the business background, solutions and business reliability of cache avalanche, penetration and . Explain in advance that the best solution must be adjusted according to the actual business, and the processing of different businesses is not exactly the same

In fact, I have seen a lot of introductions on cache avalanche, penetration, and breakdown on the Internet. I don't know if it is because of the difference in what everyone does. I found that many small partners have the following questions, such as:

After the random time has expired, if the access time happens to be the data after adding the random time, isn't that adding the random time in vain?
If the hotspot data does not expire, isn't that more and more dirty data?

Regarding the above issues, I will explain them one by one in the text. The cache mentioned below refers to Redis.

I try to explain this high-frequency interview question clearly. If everyone can chat and laugh in front of this content and the interviewer after reading it, then you are the most beautiful boy.

Now, I start to get to the point.

1. Cache Avalanche

That is, the cache fails in a large area at the same time. At this time, a large wave of requests came, all of which were sent to the database, and finally the database could not be processed and collapsed.

1.1 Examples of business scenarios

There is a lot of hot data on the APP homepage. During a large-scale event, different homepage data needs to be displayed for different time periods.

For example, at 0:00, the new home page data needs to be replaced. At this time, the old home page data expires and the new home page data just starts to load.

And at 0:00, a small event is starting, and a large number of requests are pouring in. Because the new data has just started to be loaded, most of the requests do not hit the cache, the request reaches the database, and finally the database is suspended.

1.2 Solutions

emphasize again that the so-called solution needs to be adjusted according to the actual business, and the processing of different businesses is not exactly the same

1.2.1 Method 1

A common way is to add a random time to the expiration time.

Note that this random time is not a few seconds, it can be up to several minutes. Because if the amount of data is large, according to the above example, plus Redis processes data in a single thread. Then a few seconds of buffering may not guarantee that all new data is loaded.

Therefore, it is better to set the expiration time to be longer than shorter. Anyway, it will expire in the end, and the final effect is the same.

Moreover, the expiration time range is increased, and the keys will be more scattered, which also shortens the blocking time of Redis when the key expires.

As for what the article said at the beginning: "If the access time happens to be the data after adding random time, wouldn't this add random time in vain?"

Now that you combine the example of the activity above, will it still be a problem? Combine business, you must combine business.

1.2.2 Method 2

Add a mutex, but this solution will lead to a significant drop in throughput. So it still depends on the actual business, such as the above example is not suitable for use

1.2.3 Method 3

Hotspot data is not set to expire. If it does not expire, normal business requests will naturally not hit the database.

Then a new problem comes again, what should I do if there is dirty data before it expires?

It's very simple, delete it after the event as a whole is over.

Like the above example, how to deal with it? ——Choose method 1; or load the new data needed at 0 o'clock into Redis in advance, without waiting until 0 o'clock to load, this is also possible

2. Cache breakdown

Cache breakdown means that after a hotspot key expires or is deleted, online requests that could hit the hotspot key hit the database instantly and in large numbers, eventually causing the database to be overwhelmed.

There is a kind of embankment of a thousand miles, the feeling of collapsing in the ant's nest.

2.1 Examples of business scenarios

The situation is usually caused by misoperation, such as setting the wrong expiration time or deleting it by mistake.

Who hasn't made a mistake, delete the library and run to find out. Anyway, I deleted the data of the test library by mistake. Fortunately, people are all right, and the dog's head is saved.

2.2 Solutions

method one

Code problem, this review's review.

Whether hotspot data should expire or not, and when to expire should be clear

Since it is hot data, the high probability is the core process. Then the core function of the guarantee still needs to be guaranteed to reduce the chance of making mistakes. In case of a problem, it is a wave of output from the user.

Method Two

For online misoperations, it is necessary to strengthen the strengthening of authority management, especially online authority, which must be reviewed to prevent hand tremors.

PS: If it is helpful, I hope everyone can like, watch, and forward any encouragement. This is really important to me, thank you very much~

3. Cache penetration

Cache penetration means that the client requests data that does not exist in the cache and the database, resulting in all requests hitting the database. If there are many requests, the database will still hang clearly.

3.1 Examples of business scenarios

The database primary key ids are all positive numbers, and then the client initiates a query of id = -1
A query interface has a status field status, in fact, 0 means start and 1 means end. As a result, there are requests to send status=3 requests all the time.

3.2 Solutions

3.2.1 Method 1

Do a good job of parameter verification, and return in time for unreasonable parameters.

This is very important. It is the same for any business. For the backend, there must be a principle of .

Simply put, don't trust the request data from the front-end, client and upstream services, and the verification that should be done is still to be done.

Because we never know what strange data the user will write; or even if you agree with the developer on how to pass parameters, but you are not sure that he will not follow it; take a step back, in case The interface is broken.

You have to protect yourself, otherwise when something goes wrong, you tell the boss, because who did not follow the agreement to pass the parameters, or because you didn't expect users to fill in this way, look at what your boss said (dog head.jpg)

3.2.2 Method 2

For keys whose data cannot be found, they are also temporarily cached.

Like 30s. This can avoid a large number of identical requests hitting the database instantly and reduce the pressure.

But we must see why there is such data in the future, and fundamentally solve the problem. This method is only to alleviate the problem.

If it is found that some IPs are requesting, and the data is illegal, you can restrict access to these IPs at the gateway layer

3.2.3 Method 3

Provides an interception mechanism that can quickly determine whether a request is valid, such as a bloom filter, which Redis itself has.

Let it maintain all valid keys, and return directly if the request parameters are invalid. Otherwise get it from cache or database.

About bloom filter, you can read my previous article: Bloom filter

4. Business reliability processing

As mentioned at the beginning, cache refers to Redis.

Improve Redis availability: Redis either uses cluster architecture or master-slave + sentinel. Guaranteed availability of Redis.

A master-slave without a sentinel cannot failover automatically, so there is only a master-slave in case a node goes down during peak periods or at critical active times.

Then, after online alarms, positioning problems, communication of information, and other operation and maintenance solutions, after a set of operations, it is estimated that the day lily will be cold.

Reduce reliance on cache

For hot data, can you consider adding local cache, such as: Guava, Ehcache, or even simpler, hashMap, List can be anything.

While reducing the pressure on Redis, it can also improve performance, killing two birds with one stone.

business downgrade

From the perspective of protecting the downstream (interface or database), is it possible to lower the current limit for high-traffic scenarios? In this way, even if the cache crashes, it will not drag down all the downstream.

And whether the downgraded function can be downgraded, write the downgrade switch and downgrade logic in advance, and rely on it to stabilize at critical times.

Original is not easy, if you think the article is good, I hope you can pay attention to my official account: is learning Java

Cache avalanche, penetration, breakdown that Xiaobai can also understand

1. Cache Avalanche

1.1 Examples of business scenarios

1.2 Solutions

1.2.1 Method 1

1.2.2 Method 2

1.2.3 Method 3

2. Cache breakdown

2.1 Examples of business scenarios

2.2 Solutions

method one

Method Two

3. Cache penetration

3.1 Examples of business scenarios

3.2 Solutions

3.2.1 Method 1

3.2.2 Method 2

3.2.3 Method 3

4. Business reliability processing

七淅在学Java

引用和评论

请求一下子太多了，数据库危

嘎嘎好用！推荐三款开源的 Redis 桌面客户端！

C++ 中 VS 项目引入公共配置文件

疯狂推荐！从零开始 Dify 部署全攻略！

Cherry Studio 入门 MCP：为你的大模型插上翅膀

狂揽17k star！Docker可视化神器，一键部署项目真香！

OpenWebUI：一站式 AI 应用构建平台体验