Redis is down and the database is also down due to traffic, what should I do?

Hello, I'm crooked.

It's like this. A few days ago, a reader sent me a message saying that he encountered a scene question during the interview:

He said that he couldn't find the angle to answer the question for a while, and he felt that he hadn't answered the point.

I thought about it for a while, and I really felt that this question was a bit strange, there was a magical power that made people say a thousand words, and suddenly I didn't know where to start.

Why do you say that?

Let's read the question first, read the question carefully, and I will translate it for you.

If online Redis hangs. Then all requests hit the database and the database was also hung up.

what is this?

Redis is down, isn't the cache all gone?

The cache is gone, isn’t it a cache avalanche?

The cache is avalanche, doesn't it cause the database to hang?

As soon as the words "cache avalanche" are mentioned, the brothers of cache penetration and cache breakdown will immediately appear in your mind with conditioned reflex, along with several corresponding solutions.

Then it's like an endorsement. Any cache is gone, what cache is not in the database, and neither is in the cache nor in the database...

Opening your mouth is just a few minutes without pause.

In addition, many students will confuse the cache breakdown and cache penetration.

You change the notation, cache puncture, cache puncture.

Wear it, just through the cache.

Thorough, is to go straight to the end.

If you taste it carefully, you probably won't remember it.

In addition to the set of eight-legged essays "Redis Cache Three Combo" above, there is also another eight-legged essay hidden:

Redis is dead, why is it dead? Why did you hang up? Is there a single point of issue?

Isn't this asking you about the high availability of Redis services?

Speaking of the high availability of Redis, the master-slave, sentinel, and cluster must be immediately popped up in your mind?

Thinking of this, I opened my mouth for a few minutes without pause.

But the thousands of words in these few minutes were immediately stunned by the following question?

How to recover at this time?

Asking you how to recover now is a matter of course.

You have to talk about how to recover first, and then how to prevent it.

If you come up, you will answer the "three-strike cache" and "high-availability architecture" mentioned earlier, as well as the multi-level caching, current limiting measures, service degradation, and fuse mechanism that most students can think of. These are all a bit far-fetched.

Because after all, these methods are precautionary measures, and it is said that these endorsement marks are more obvious.

The answer must be an answer. The recovery during the course of work has gone to pre-prevention programs, and the focus is on pre-prevention. So how can we be more natural?

Let me talk about how to recover from the incident first. In fact, I think I'm finished in a few words.

The service is down, brother, how can I restore it? Of course, restart the service.

From the perspective of operation and maintenance personnel, of course, the priority is to restart Redis and database services first.

But before starting, you have to do a small operation to remove the traffic. You can intercept the traffic at the entrance first. For example, you can simply and rudely transfer all requests to a well-designed error page through the Nginx configuration. .

The purpose of this is to prevent the traffic from being too large, and directly start the newly started service and start the situation where one hangs up one after the other.

If you can't hold it up, please remember the three powerful tools of distributed systems in your heart:

If it doesn't work, add money and pile up machines.

If you feel that the heap machine has no technical content, you can answer one more from the perspective of cache preheating.

That is, when the Redis service is restarted, the known hot key is first put in the program through the program, and the system then provides services to the outside to prevent the scene of cache breakdown.

In fact, the above series of operations have little to do with developers, and are mainly done by operation and maintenance students.

At most, developing students is to make the service stateless when designing the service, in order to achieve the purpose of rapid horizontal expansion.

As for how to quickly and horizontally expand, it is a matter for the operation and maintenance classmates. Don't grab someone else's job for the time being.

After answering this, you can use "but" to transition to pre-prevention and start your own performance.

Pretending to be pensive, said "but" to the interviewer:

I think from the perspective of technical solutions, we should take precautions.

All these problems are because Redis crashed, that is, a cache avalanche occurred.

In the case of high concurrency, in addition to the cache avalanche, we must also consider the breakdown and penetration of the cache.

And why did Redis crash? Is the posture wrong? Is there no guarantee of high availability?

Does the service need to consider current limiting or fuse mechanism to protect the operation of the program to the greatest extent?

Or should we establish a multi-level caching mechanism to prevent a large amount of traffic from hitting the MySQL service directly after Redis crashes and causing the database to crash?

At this point, the "but" is completed, the direction of answering the question has been restored, and it has turned to pre-prevention, entered our strong point, the special eight-legged essay, and then can start "reciting".

Here I will briefly talk about the three-shot caching problem and the high availability of Redis.

As for multi-level caching, you can read this article I posted earlier: , multi-level caching! 》 .

Cache breakdown

Let me talk about the concept of cache breakdown first.

Cache breakdown refers to a situation where the data to be accessed by a request is not in the cache but in the database.

Generally speaking, this situation is that the cache has expired.

But at this time, because there are so many users concurrently accessing the cache, this is a hot key, so many user requests come at the same time, and no data is retrieved in the cache, so they are also accessing the database to retrieve data at the same time, causing a surge in database traffic and pressure Instantly increase, directly collapse to show you.

So a piece of data is cached, and every request quickly returns data from the cache, but at a certain point in time the cache fails, and a request does not request data in the cache. At this time, we say that the request is "breakthrough" "The cache.

For this scenario, there are generally three corresponding solutions.

The first is to only release one request to the database, and then do the operation of constructing the cache.

Only one of multiple requests is allowed, how to do it?

Just use the Redis setNX command to set a flag bit. If the setting is successful, it will be released, and if the setting fails, it will be polled and waited.

The second solution is to continue life in the background.

The idea of this scheme is to open a timed task in the background to actively update the data that is about to expire.

For example, when the hot key why is set in the program, and the expiration time is set to 10 minutes at the same time, the background program will go to the database to query the data and put it in the cache again at the 8th minute, and at the same time set the cache to 10 minutes again.

How about, is it a bit of Redisson distributed lock watchdog?

I think the thoughts are in the same line.

It was just a little troublesome from the point of view of code writing when the solution was implemented.

I used this idea to develop a serial number system.

Probably so.

The serial number system is a very critical system. In order to reduce the impact of database exceptions on the service, after the service is started, 5000 serial numbers will be pre-cached in the cache for each business system.

Then the background job regularly checks how many serial numbers are left in the cache. If it is less than 1,000, it will generate a new serial number in advance and add it to the cache so that the serial number in the cache returns to 5000 again.

The advantage of this is that after the database is abnormal, I guarantee that there are at least 5000 caches to ensure the upstream business, and I have a certain time to restore the database.

This can be regarded as a kind of thought of continuation of life in the background.

The third method is simple: never expire.

Why is the cache broken? Is it because the timeout period is set and then it is recycled?

Then I don’t need to set the timeout period.

If you use your toes in combination with the actual scenario, you can think that this key will definitely be a hot key, and there will be a lot of requests to access this data. And the value corresponding to this key will not change.

What do you set the expiration time for such data?

Just put it in and never expire.

In fact, the ultimate manifestation of the above background life-sustaining thought is that it will never expire.

It's just the idea of continuation of life in the background, which will actively update the cache, which is suitable for scenarios where the cache will change. There will be cache inconsistencies, depending on how long your business scenario can accept cache inconsistencies.

In short, specific circumstances, specific analysis.

But the idea must be clear, and the final plan is a combination or variant of the conventional plan.

Cache penetration

So what is cache penetration?

Cache penetration means that the data to be accessed by a request is not in the cache or the database, and the user initiates such a request in a short time and at a high density, and it hits the database service every time, which puts pressure on the database.

Generally speaking, such requests are malicious requests.

For example, I am a technical public account. You know that I don’t have one, but you have to come to me to buy a bottle of beer and make a malicious request.

How to solve it?

Two options.

The first cache empty object.

Even if the data is not queried in the database, we cache this request as a key, and the value can be NULL.

The next time the same request will hit this NULL, the cache layer will process the request without stressing the database.

This is simple to implement and the development cost is very low.

But the following interview questions must be paid attention to:

For malicious attacks, the keys are often different at the time of request and only requested once. If you want to cache these keys, because each key is only requested once, then the database will be requested every time, and the database will not be protected. Ah?

This question, Bloom filter, understand?

I wrote this article about Bloom filters before, you can read: "Blon, awesome! Cuckoo, awesome! 》

The characteristic of the Bloom filter is that when a certain value exists, the value may not exist. When it says it does not exist, then it certainly does not exist.

So based on this feature, all existing data can be built into the Bloom filter.

Then it can help block most attacks.

But there is also a serial gun problem that is easier to overlook:

Interviewer: The Bloom filter has a limited capacity and does not support deletion. As the content increases, the false positive rate will increase. Excuse me, how did you solve this problem?

There are also two directions for answering questions.

First of all, if deletion is not supported, change the wheel of the Bloom filter that supports deletion.

For example, the cuckoo filter mentioned in my previous article.

Or just refactor the Bloom filter in advance.

For example, when the capacity reaches 50%, apply for a new and larger Bloom filter to replace the previous filter.

The only thing to note is that for reconstruction you have to know what data needs to be reconstructed, so you have to have a place to record.

For example, Redis, database, or even memory cache can be used.

It doesn't matter if you haven't landed before, you just need to answer with confidence.

You have to believe that 80% of the interviewers have never landed before, and you may all read the same information.

Cache avalanche

Cache avalanche means that most of the data in the cache reaches the expiration time at the same time, and the amount of query data is huge. At this time, it is not in the cache but in the database.

Requests hit the database, causing a surge in database traffic, an instantaneous increase in pressure, and a direct crash to show you.

Unlike the aforementioned cache breakdown, cache breakdown refers to a large number of requests concurrently querying the same piece of data.

Cache avalanche is that different data has reached the expiration time, causing these data to not be queried in the cache.

Avalanche is still very vivid.

The solution to prevent avalanches is simply the peak staggering expiration.

When setting the key expiration time, add a short random expiration time, so as to avoid a large number of caches expiring at the same time, causing cache avalanches.

If there is an avalanche, we can have service degradation, fuse, and current limiting methods to reject some requests and ensure the normal service.

However, these have a certain impact on the user experience. Assuming that our program has this logic, it is also used for the bottom line. From the user's point of view, we do not want to go into this logic.

Therefore, the main focus is to prevent avalanches.

Finally, there is an avalanche, that is, the entire Redis service is down.

So, the next step is to talk about the high-availability architecture of the Redis service.

Redis high availability architecture

Talking about the Redis high-availability architecture, everyone can basically think of the three modes of master-slave, sentinel, and cluster.

The master-slave structure is very simple, not to mention, the main disadvantage is that manual intervention is required when a fault occurs. If manual intervention is required, it is not truly high-availability.

Sentinel and cluster are the two solutions written on the official website:

https://redis.io/topics/introduction

The underlined words above translate into: Redis provides high availability through Redis Sentinel and Cluster.

Among them, Sentinel is an official high availability solution (official high availability solution):

So mainly talk about the sentinel mode.

The sentinel is used to manage multiple Redis servers. From the book "Redis Development and Operation and Maintenance", I will take a photo for you to see:

It mainly performs three types of tasks:

Monitoring: Sentinel will constantly check whether your master server and slave server are operating normally.
Notification: When there is a problem with a monitored Redis server, Sentinel can send notifications to administrators or other applications through the API.
Automatic failover: When a master server fails to work normally, Sentinel will start an automatic failover operation. It will upgrade one of the slave servers of the failed master server to a new master server, and let the master server fail The other slave servers are changed to replicate the new master server; when the client tries to connect to the failed master server, the cluster will also return the address of the new master server to the client, so that the cluster can use the new master server to replace the failed server.

The sentinel is actually a distributed system, and we can run multiple sentinels.

Then these sentinels need to ventilate each other, exchange information, and vote to decide whether to perform automatic failover and which slave server to choose as the new master server.

The agreement adopted between the sentinels is gossip, which is a decentralized agreement, and what is reached is final consistency, which is very interesting.

There is an introduction to the gossip protocol in the Phoenix architecture I wrote before, you can look at:

https://icyfenix.cn/distribution/consensus/gossip.html

In addition, if the master node goes down, what rules will the sentry use to select a new master node, that is, how the election process looks like, also occasionally appear in the interview session.

I have been asked before, but fortunately I was proficient in memorizing at that time.

Simply talk about the rules, without it, the recitation is over:

Among the slave nodes hanging under the hung master node, the slave nodes that are marked as subjectively offline, disconnected, or the last time to reply to the PING command for more than five seconds are not eligible to participate in the election.
Among the slave nodes hung under the hung master node, those slave nodes that have been disconnected from the hung master node for more than ten times the duration specified in the down-after configuration are not eligible to participate in the election.
After the above two rounds of elimination, among the remaining slave servers, the slave server with the largest replication offset is selected as the new master server. If the replication offset is not available, or the replication offset of the slave server is the same, then the slave server with the smallest run ID becomes the new master server.

In fact, the one performing the above operations is a sentinel. And our sentry is generally more than three, so which sentry to perform these operations?

In fact, this sentinel also needs to be elected from multiple sentinels, and the selected sentinel is the leader Sentinel.

When the leading sentry is elected, the Raft algorithm is adopted.

As for the construction of the sentinel mode, it is generally a matter of operation and maintenance.

However, there are many building tutorials on the Internet, and it would be better to follow the tutorials yourself.

Believe me, you will encounter various problems during the construction process, and these problems are your reward.

Back to the beginning

In this section, we return to the initial interview question:

In fact, when I saw this question, I thought of Weibo, which is always being exploded.

It just so happened that I ate another wave of Wu Moufan’s melons this week. At the time, I was watching the women’s volleyball live broadcast. When I saw the report, my expression was like this:

.png)

On Saturday night, I basically jumped up and down in the melon field with this expression. It was so delicious.

However, in this wave of Fanfan, I don't know if Fanfan's traffic is dying, or the structure of Weibo has withstood the test.

Weibo turned out to be relatively smooth, and there was no large-scale, very obvious service hang-up.

I remember that the last time that Weibo was hungry was Lu Han and Guan Xiaotong.

Not because I followed them, but because I followed the programmers who were getting married that day.

To say that this classmate Ding Zhenkai is really too miserable.

When I got married, I met Lu Han and announced his love affair. When he was on vacation overseas, he ran into Shuang Song Guan Xuan. When his wife was expecting childbirth, she ran into Hua Chenyu and admitted that she had an unmarried daughter with Zhang Bichen.

This time I went to see it and my performance was relatively calm. It should be holding the baby in one hand, expanding the volume in the other hand, and eating melon by the way.

Back then, the Weibo assistant said that it was because of too many reposts and comments on a single Weibo about Luhan.

This is not comprehensive. Simply reposting a lot of comments will not overwhelm the big Weibo. And Lu Han's Weibo that day should not be the one with the most reposted comments on his Weibo.

The reason is that the concurrency of forwarding and commenting is too high, too high, and too high, and it is the instantaneous traffic that I will never have access to in my life.

The people who eat melons also flocked, and the simultaneous online surge rapidly within a short period of time, killing the server:

Regarding this issue, I saw a comment on Zhihu. I think it’s pretty good. Take a screenshot:

https://www.zhihu.com/question/66346687

You see, is this scene similar to the question asked by the interviewer?

It is as strong as Weibo, and it also added 1,000 servers to deal with this traffic spike.

So, what should I do if the service is down?

Reboot.

What should I do if the restart still doesn't work?

Add money and expand the machine.

If Luhan Guan Xiaotong incident, the famous paparazzi Zhuo Weineng can break the news in advance and make an advance.

Perhaps, Weibo can resist that surge of traffic.

If Wu signs this matter, the Beijing police can ventilate with Weibo in advance, and notify the relevant personnel on Weibo before the release, even 10 minutes in advance?

Perhaps, more people can eat melon smoothly.

One last word (seeking attention)

Okay, I saw that I have arranged a follow-up here. Zhou is more tired and needs a little positive feedback.

Thank you for reading, I insist on originality, very welcome and thank you for your attention.

Redis is down and the database is also down due to traffic, what should I do?

Cache breakdown

Cache penetration

Cache avalanche

Redis high availability architecture

Back to the beginning

One last word (seeking attention)

why技术

引用和评论

面试场景题：一次关于线程池使用场景的讨论。

Java8的新特性

Java11的新特性

Java5的新特性

Java9的新特性

Java13的新特性

Java7的新特性