Stability of Cache System-Architects Summit Speech Record

Preface

Hello everyone! I am Wan Junfeng, the author of go-zero. Thank you ArchSummit for providing such a good opportunity to share with you the best practices of go-zero caching.

First of all, you can think about it: In the case of a surge in traffic, which part of the server is most likely to be the first bottleneck? I believe that what most people encounter is that the database will not be able to carry it at first, and the database will be slow to query, or even stuck. At this time, it is of no avail for the upper-level service to have strong governance capabilities.

Therefore, we often say that we look at the design of a system architecture, and we often know how to look at the cache design. We have encountered such a problem before. Before I joined, our service was not cached. Although the traffic was not high at the time, everyone would be extremely nervous during the peak traffic period every day. There were several downtimes a week. The database was directly killed, and then I couldn’t do anything. I had to restart it. I was still a consultant at the time. I looked at the system design and could only help me. I asked everyone to add the cache first, but because everyone didn’t know enough about the cache and The chaos of the old system, every business developer will tear the cache in his own way. The problem caused by this is that the cache is used, but the data is scattered and there is no way to ensure the consistency of the data. This is indeed a relatively painful experience, which should resonate and reminisce everyone.

Then I overturned the entire system and redesigned it. The architectural design of the cache part plays a very obvious role in it, so I have today’s sharing.

I mainly divided into the following parts to discuss with you:

Common problems with caching system
Cache and automatic management of single-line query
Multi-line query caching mechanism
Distributed cache system design
Cached code automation practice

There are many problems and knowledge points involved in the caching system, and I will discuss them in the following aspects:

stability
Correctness
Observability
Standardization and tool construction

Due to the length of this article, as the first in a series, this article mainly discusses "Caching System Stability" with you.

Cache system stability

In terms of cache stability, basically all cache-related articles and sharing on the Internet will talk about three key points:

Cache penetration
Cache breakdown
Cache avalanche

Why do we talk about cache stability in the first place? As you can recall, when will we introduce caching? Generally, the cache is introduced when the DB is under pressure or even when it is often broken. Therefore, we first introduced the cache system to solve the stability problem.

Cache penetration

The reason for the existence of cache penetration is to request data that does not exist. From the figure we can see that request 1 for the same data will first access the cache, but because the data does not exist, there must be no in the cache, so it will fall into the DB Go, request 2 and request 3 for the same data will also fall to the DB through the cache, so that when a large number of non-existent data is requested, the pressure on the DB will be particularly high, especially if the malicious request may be overwhelmed. Well-meaning people find that a piece of data does not exist, and then initiate a large number of requests for this non-existent data).

The solution of go-zero is: for requests for non-existent data, we will also store a placeholder in the cache for a short time (for example, one minute), so that the number of DB requests for the same non-existent data will be the same as the actual number of requests Decoupling, of course, on the business side, you can also delete the placeholder when adding data to ensure that the new data can be queried immediately.

Cache breakdown

The cause of cache breakdown is the expiration of hotspot data, because it is hotspot data, so once it expires, a large number of requests for the hotspot data may come at the same time. At this time, if all the requests cannot find the data in the cache, if they fall at the same time If you go to the DB, then the DB will be under tremendous pressure instantly, or even get stuck directly.

The solution of go-zero is: for the same data, we can use core/syncx/SharedCalls to ensure that only one request falls to the DB at the same time, and other requests for the same data wait for the first request to return and share the result or error, according to different In concurrency scenarios, we can choose to use in-process locks (the amount of concurrency is not very high) or distributed locks (the amount of concurrency is very high). If it is not particularly necessary, we generally recommend in-process locks. After all, the introduction of distributed locks will increase complexity and cost. Learn from Occam's razor theory: if not necessary, do not add entities.

Let's take a look at the cache breakdown protection process in the figure above. We use different colors to indicate different requests:

The green request arrives first, and if there is no data in the cache, it goes to the DB query
The pink request arrives, the same data is requested, and an existing request is found to be processed, waiting for the green request to return, singleflight mode
The green request is returned, the pink request is returned with the result of the green request sharing
Subsequent requests, such as blue requests, can get data directly from the cache

Cache avalanche

The reason for the cache avalanche is that a large number of caches loaded at the same time have the same expiration time. When the expiration time arrives, a large number of caches expire in a short period of time. This will cause many requests to fall to the DB at the same time, which will cause the DB pressure to surge and even get stuck. dead.

For example, in the online teaching scenario under the epidemic situation, high school, junior high school, and elementary school start classes at the same time in several time periods. At this time, a large amount of data will be loaded at the same time, and the same expiration time will be set. When there are DB request peaks one by one, such pressure wave peaks are passed to the next cycle, and even overlap.

The solution of go-zero is:

Use distributed cache to prevent cache avalanche caused by single point of failure
Add 5% standard deviation to the expiration time, 5% is the empirical value of the P value in the hypothesis test (readers who are interested can refer to it by themselves)

Let's do an experiment. If we use 10,000 data, the expiration time is set to 1 hour, and the standard deviation is set to 5%, then the expiration time will be more evenly distributed between 3400 and 3800 seconds. If our default expiration time is 7 days, it will be evenly distributed within 16 hours centered on 7 days. In this way, the avalanche problem of the cache can be well prevented.

To be continued

This article discusses the common stability problems of the cache system with you. In the next article, I will analyze the data consistency problem of the cache with you.

The solutions to all these problems have been included in the go-zero microservice framework. If you want to better understand the go-zero project, please go to the official website to learn specific examples.

Video playback address

ArchSummit Architects Summit-Cache Architecture Design under Massive Concurrency

project address

https://github.com/tal-tech/go-zero

Welcome to go-zero and star support us!

WeChat Exchange Group

Follow the Practice 1609de0b52a4c5" and click enter the group obtain the QR code of the community group.

For the go-zero series of articles, please refer to the official account of "Microservice Practice"

Stability of Cache System-Architects Summit Speech Record

Preface

Cache system stability

Cache penetration

Cache breakdown

Cache avalanche

To be continued

Video playback address

project address

WeChat Exchange Group

kevinwan

引用和评论

熔断原理分析与源码解读

腾讯 tRPC-Go 教学——（5）filter、context 和日志组件

Python 与 PostgreSQL 集成：深入 psycopg2 的应用与实践

大模型时代，后端程序员如何避免被AI卷死？

如何将豆瓣观影记录实时同步至博客中

马上卸载这个恶心的软件！

腾讯 tRPC-Go 教学——（1）搭建服务