[Redis—Advanced] Cache Design

Benefits and costs of

Revenue

Accelerated Read and Write : Because the cache is usually full memory (such as Redis, Memcache), and the storage layer is usually not strong enough in read and write performance (such as MySQL), the use of cache can effectively speed up reading and writing and optimize the user experience.
reduces the back-end load : Helps the back-end reduce visits and complex calculations (such as very complex SQL statements), which greatly reduces the back-end load.

cost

Data inconsistency : There is a certain time window of inconsistency in the data of the cache layer and the storage layer, and the time window is related to the update strategy.
Code maintenance cost : After adding the cache, the logic of the cache layer and the storage layer need to be processed at the same time, which increases the cost of the developer to maintain the code.
Operation and maintenance cost : Taking Redis Cluster as an example, the operation and maintenance cost will be increased virtually after joining.

The usage scenarios of the cache basically include the following two:

Complicated calculations with large overhead : Take MySQL as an example, some complex operations or calculations (such as a large number of join table operations, some grouping calculations), if you do not add cache, not only can not meet the high concurrency, but also bring MySQL A huge burden.
Accelerated Request Response : Even if a single back-end data query is fast enough, the cache can still be used. Taking Redis as an example, it can complete tens of thousands of reads and writes per second, and the batch operations provided can optimize the response time of the entire IO chain.

Cache update strategy

LRU/LFU/FIFO algorithm eliminates

Usage scenario: Usually used to remove existing data when the cache usage exceeds the preset maximum value. For example, Redis uses the maxmemory-policy configuration as the data removal strategy after the maximum memory.

Consistency: The data to be cleaned is determined by the specific algorithm, and the developer can only decide which algorithm to use, so the consistency of the data is the worst.

Maintenance cost: The algorithm does not need to be implemented by the developer, usually only the maximum maxmemory and the corresponding strategy need to be configured. Developers only need to know the meaning of each algorithm and choose the algorithm that suits them.

Timeout culling

Usage scenario: Timeout elimination sets an expiration time for cached data, so that it is automatically deleted after the expiration time, such as the expire command provided by Redis. If the business can tolerate the inconsistency between the cache layer data and the storage layer data for a period of time, you can set an expiration time for it. After the data expires, get the data from the real data source, put it back in the cache and set the expiration time.

Consistency: There is a consistency problem within a period of time, that is, the cached data is not inconsistent with the data of the real data source.

Maintenance cost: The maintenance cost is not very high. You only need to set the expire expiration time. Of course, the premise is that the application side allows the data inconsistency that may occur during this time.

Active update

Usage scenario: The application side has high requirements for data consistency and needs to update the cached data immediately after the real data is updated. For example, a message system or other methods can be used to notify cache updates.

Consistency: The consistency is the highest, but if there is a problem with the active update, then this data is likely not to be updated for a long time, so it is recommended to use it in conjunction with timeout elimination to achieve better results.

Maintenance cost: The maintenance cost will be relatively high, and developers need to complete the update by themselves and ensure the correctness of the update operation.

Best Practice

low-consistency services are recommended to configure the maximum memory and use the elimination strategy. High-consistency services can combine timeout elimination and active update, so that even if there is a problem with active update, dirty data can be guaranteed to be deleted after the data expires.

Cache granularity control

For example, it is now necessary to use Redis to cache MySQL user information. Assuming that the user table has 100 columns, what dimension needs to be cached? The problem is the granularity of the cache. Should all attributes be cached or only some important attributes? The following will explain from the three perspectives of versatility, space occupation, and code maintenance.

Versatility: Caching all data is more versatile than part of the data, but from practical experience, the application only needs a few important attributes for a long time.

Space occupation: Cache all data occupies more space than some data, and the following problems may exist:

All data will cause a waste of memory.
The network traffic generated by each transmission of all data may be relatively large and time-consuming, and in extreme cases it will block the network.
The CPU overhead of serialization and deserialization of all data is greater.

Code maintenance: The advantages of all data are more obvious, and some data needs to modify the business code once new fields are added, and the cached data usually needs to be refreshed after modification.

The problem of cache granularity is a problem that is easily overlooked. If used improperly, it may cause a lot of waste of useless space, waste of network bandwidth, poor code versatility, etc., which requires comprehensive data versatility, space occupancy ratio, and code maintainability. Make a choice between three points.

Cache penetration

cache penetration refers to querying a data that does not exist at all. cache layer nor the storage layer will hit 160e0898fb7e74. Usually for fault tolerance, if the data cannot be found from the storage layer, it will not be written to the cache layer. Cache penetration will cause non-existent data to be queried at the storage layer every time it is requested, which loses the meaning of cache protection for back-end storage.

The cache penetration problem may increase the load on the back-end storage. Because many back-end storage does not have high concurrency, it may even cause the back-end storage to go down. Generally, the total number of calls, cache layer hits, and storage layer hits can be counted separately in the program. If a large number of storage layer empty hits are found, it may be a cache penetration problem.

There are two basic reasons for cache penetration. First, there is a problem with its own business code or data, and second, some malicious attacks, crawlers, etc. cause a large number of empty hits. Let's take a look at how to solve the cache penetration problem.

Cache empty objects

When the storage layer is missed, the empty object is still reserved in the cache layer, and the data will be obtained from the cache when the data is accessed later, which protects the back-end data source.

There are two problems with caching empty objects:

The empty value is cached, which means that more keys are stored in the cache layer and more memory space is required (if it is an attack, the problem is more serious). A more effective method is to set a shorter expiration time for this type of data. , Let it be automatically removed.
The data of the cache layer and the storage layer will be inconsistent for a period of time, which may have a certain impact on the business. For example, the expiration time is set to 5 minutes. If the storage layer adds this data at this time, there will be inconsistencies between the cache layer and the storage layer data during this period. At this time, the message system or other methods can be used to clear the empty space in the cache layer. Object.

Bloom filter intercept

Before accessing the cache layer and the storage layer, the existing keys are saved in advance with a Bloom filter for the first layer of interception. For example: a recommendation system has 400 million user ids, and every hour the algorithm engineer will calculate the recommendation data based on the previous historical behavior of each user and put it in the storage layer, but since the latest user has no historical behavior, cache penetration will occur For this purpose, all users who recommend data can be made into Bloom filters. If the Bloom filter considers that the user id does not exist, then the storage layer will not be accessed, which protects the storage layer to a certain extent.

This method is suitable for application scenarios where the data hits are not high, the data is relatively fixed, and the real-time performance is low (usually a large data set). The code maintenance is more complicated, but the cache space is small.

Cache empty object and

cache bottomless hole

In 2010, Facebook's Memcache nodes have reached 3,000, carrying terabytes of cached data. However, developers and operation and maintenance personnel discovered a problem. To meet business requirements, a large number of new Memcache nodes were added, but they found that the performance was not improved but declined. At the time, this phenomenon was called the "bottomless pit" of caching.

So why does this phenomenon occur? Generally speaking, adding nodes makes the performance of the Memcache cluster stronger, but this is not the case. The key-value database usually uses a hash function to map the key to each node, causing the distribution of the key to be irrelevant to the business. However, due to the continuous growth of data volume and access volume, a large number of nodes need to be added for horizontal expansion, resulting in the key value distribution to More nodes, so whether it is Memcache or Redis distributed, batch operations (such as mget) usually need to be obtained from different nodes. Compared with single-machine batch operations involving only one network operation, distributed batch operations will involve multiple times Network time.

Analysis of bottomless pit:

client will involve multiple network operations, which means that batch operations will increase in time as the number of nodes increases.
number of network connections increases, which also has a certain impact on the performance of the node.

To sum it up in a popular sentence, do not mean higher performance. The so-called "bottomless pit" means that more input does not necessarily produce more . But distributed is inevitable, because the amount of access and data is getting larger and larger, a node can't resist it at all, so how to efficiently batch operations in the distributed cache is a difficult point.

The following describes how to optimize batch operations under distributed conditions. Before introducing specific methods, let's take a look at common stand-alone IO optimization ideas:

Optimization of the command itself, such as optimizing SQL statements, etc.
Reduce the number of network communications.
Reduce access costs, such as the use of long-term connections/connection pools, NIO, etc.

Here we assume that commands and client connections are already optimal, and focus on reducing the number of network operations. Taking Redis to obtain n strings in batches as an example, we will combine some features of Redis Cluster to explain the four distributed batch operation methods.

serial command

Since n keys are generally distributed on each node of the Redis Cluster, they cannot be obtained at one time using the mget command. Generally speaking, to obtain the value of n keys, the easiest way is to execute n get commands successively. This kind of operation time complexity is relatively high, its operation time = n times of network time + n times of command time, and the number of times of network is n. Obviously this scheme is not optimal, but it is
It's relatively simple now.

Serial IO

Redis Cluster uses the CRC16 algorithm to calculate the hash value, and then takes the remainder of 16383 to calculate the slot value. At the same time, the Smart client saves the corresponding relationship between the slot and the node. With these two data, it can be the same node. The key is archived and the key sublist of each node is obtained. Then, the mget or Pipeline operation is performed on each node. Its operation time = node network time + n command time. The network times is the number of nodes. Obviously this This scheme is much better than the first one, but if there are too many nodes, there are still certain performance problems.

Parallel IO

This solution is to change the last step in solution 2 to multi-threaded execution. Although the number of networks is still the number of nodes, the network time becomes O(1) due to the use of multi-threaded networks. This solution will increase the complexity of programming. Its operating time is: max_slow (node time network time) + n command time.

hash_tag implementation

Using the hash_tag function of Redis Cluster, it can force multiple keys to be assigned to one node, and its operation time = 1 network time + n command time.

plan comparison

cache avalanche

Since the cache layer carries a large number of requests, it effectively protects the storage layer. However, if the cache layer cannot provide services for some reasons, all requests will reach the storage layer, and the amount of calls to the storage layer will increase sharply, causing the storage layer to also Cascade downtime.

To prevent and solve the cache avalanche problem, we can proceed from the following three aspects.

Ensure high availability of cache layer services

If the cache layer is designed to be highly available, even if individual nodes, individual machines, or even computer rooms are down, services can still be provided. For example, Redis Sentinel and Redis Cluster have achieved high availability.

relies on isolation components to limit the back-end current and downgrade

Whether it is the cache layer or the storage layer, there is a probability of error, and they can be regarded as resources. As a system with a large amount of concurrency, if a resource is unavailable, it may cause all threads to be blocked on this resource, causing the entire system to be unavailable.

The downgrade mechanism is very common in high-concurrency systems: for example, in the recommendation service, if the personalized recommendation service is not available, it can be downgraded to supplement hotspot data. In actual projects, we need to isolate important resources (such as Redis, MySQL, HBase, and external interfaces), so that each resource runs separately in its own thread pool, even if there are problems with individual resources, other services No effect. But how to manage thread pools, such as how to close resource pools, open resource pools, and manage resource pool thresholds, is still quite complicated.

Advance drill

Before the project goes live, after the cache layer goes down, the load situation of the application and the back-end and possible problems will be drilled, and some pre-plan settings will be made on this basis.

Hotspot key reconstruction

Developers use the "cache + expiration time" strategy to speed up data reading and writing, and to ensure regular data updates. This model can basically meet most of the needs. However, if two problems occur at the same time, they may cause fatal harm to the application:

The current key is a hot key, and the amount of concurrency is very large.
Rebuilding the cache cannot be completed in a short time. It may be a complex calculation, such as complex SQL, multiple IOs, multiple dependencies, etc.

At the moment of cache failure, there are a large number of threads to rebuild the cache, causing the back-end load to increase, and may even crash the application. To solve this problem is not very complicated, but it can't bring more trouble to the system in order to solve this problem, so the following goals need to be formulated:

Reduce the number of times to rebuild the cache.
The data is as consistent as possible.
Less potential danger.

Mutex lock

This method only allows one thread to rebuild the cache, and other threads wait for the thread that rebuilds the cache to finish executing, and then get data from the cache again. For example, you can use Redis's setnx command to implement a simple distributed mutex lock.

never

"Never expires" has two meanings:

From the perspective of caching, there is indeed no expiration time set, so there will be no problems after the hot key expires, that is, the "physical" does not expire.
From a functional point of view, sets a logical expiration time for each value. When the logical expiration time is found to exceed the logical expiration time, a separate thread will be used to build the cache .

This method effectively eliminates the problem of hot keys, but the only disadvantage is that during the reconstruction of the cache, data inconsistencies will occur, depending on whether the application side tolerates such inconsistencies.

summary

As an application with a large amount of concurrency, there are three goals when using caching: First, speed up user access and improve user experience. Second, reduce the back-end load, reduce potential risks, and ensure the stability of the system. Third, ensure that the data is updated "as much as possible" in time. The following table compares the above two solutions according to these three dimensions.