Cache Design for Analytical Distributed Systems

An introduction to caching

1.1 What is cache

A cache is a buffer for data exchange. The essence of the cache is a memory Hash. Caching is a design that trades space for time, and its goal is to be faster and closer: a huge improvement.

Write/read data to faster storage (devices);
Cache data closest to the application;
Cache data closest to the user.

A cache is a hardware or software component used to store data so that subsequent access to the corresponding data is faster. The data in the cache may be pre-computed results, copies of data, etc. Typical application scenarios: cpu cache, disk cache, etc. The cache mentioned in this article mainly refers to the cache component used in Internet applications.

The cache hit rate is an important measure of the cache, and the higher the hit rate, the better.

Cache hit ratio = number of reads from cache / total number of reads

1.2 When do you need caching

The introduction of cache will increase the complexity of the system. Therefore, before introducing the cache, you need to weigh whether it is worth it. The considerations are as follows:

CPU overhead - If you apply a computation that consumes a lot of CPU, consider caching the result of the computation. Typical scenarios: complex, frequently called regular calculations; distributed computing intermediate states, etc.
IO overhead - If the database connection pool is busy, consider caching its query results.

Introducing caching in the data layer has the following benefits:

Improve data reading speed.
Improve the system expansion capability, and improve the system carrying capacity by expanding the cache.
To reduce storage costs, the Cache+DB method can bear the amount of requests that originally required multiple DBs, saving machine costs.

1.3 Basic principles of caching

According to the business scenario, the cache is usually used in the following ways:

Lazy type (triggered when reading): first query the data in the DB, and then write the relevant data into the Cache.
Hunger type (triggered when writing): After writing to the DB, the related data is also written to the Cache.
Periodic refresh: suitable for tasks that run data periodically, or list-type data, and does not require absolute real-time performance.

1.4 Cache elimination strategy

Types of cache elimination:

1) Based on space : Set the size of the cache space.

2) Based on capacity : Set the number of records stored in the cache.

3) Time based

TTL (Time To Live) is the time from creation to expiration of cached data.
TTI (Time To Idle, the idle period) how long the cached data has not been accessed.

Cache elimination algorithm:

1) FIFO: first in first out . In this elimination algorithm, the first to enter the cache will be eliminated first. This is the simplest, but it will lead to a very low hit rate. Imagine if we have a data with high access frequency that is accessed first by all data, and those that are not very high are accessed later, then our first data will be accessed but his access frequency is very high. to squeeze out.

2) LRU: Least Recently Used Algorithm . In this algorithm, the above problems are avoided. Every time the data is accessed, it will be placed at the end of our team. If we need to eliminate data, we only need to eliminate the head of the team. But there is still a problem with this. If there is a data that is accessed 10,000 times in the first 59 minutes of an hour (it can be seen that this is a hot data), and the data is not accessed in the next minute, but there are other data accesses, it will lead to As a result, our hot data is eliminated.

3) LFU: least recently used . In this algorithm, the above is optimized, using additional space to record the frequency of use of each data, and then selecting the lowest frequency for elimination. This avoids the problem that the LRU cannot handle time periods.

The implementation complexity of these three cache elimination algorithms is higher than that of the other, and the same hit rate is also better than the other. Generally speaking, the solution we choose can be in the middle, that is, the implementation cost is not too high, and the hit rate is also OK.

Second, the classification of cache

From the perspective of deployment, caching can be divided into client-side caching and server-side caching.

Client cache

HTTP cache
browser cache
APP cache (1, Android 2, IOS)

server cache

CDN cache: Stores static resources such as HTML, CSS, and JS.
Reverse proxy cache: separation of dynamic and static, only static resources requested by users are cached.
Database cache: The database (such as MySQL) generally has a cache itself, but it is not recommended because of the hit rate and update frequency.
In-process cache: Cache common data such as application dictionaries.
Distributed cache: Cache hot data in the database.

Among them, CDN cache, reverse proxy cache, and database cache are generally maintained by full-time personnel (operation and maintenance, DBA). Back-end development generally focuses on in-process caching and distributed caching.

2.1 HTTP caching

2.2 CDN cache

CDN caches data to the server closest to the user's physical distance, so that the user can obtain the requested content nearby. CDNs generally cache static resource files (pages, scripts, images, videos, documents, etc.).

The domestic network is extremely complex, and network access across carriers will be very slow. In order to solve the problem of cross-operator or user access in various places, CDN applications can be deployed in important cities. Enable users to obtain the desired content nearby, reduce network congestion, and improve user access response speed and hit rate.

Image quote from: Why use a CDN

2.1.1 Principle of CDN

The basic principle of CDN is to widely use various cache servers, distribute these cache servers to regions or networks where user access is relatively concentrated, and use global load technology to direct users' access to the nearest working cache when users visit a website. On the server, the cache server directly responds to user requests.

1) The network path before the CDN application is not deployed:

Request : Local Network (LAN) => Carrier Network =\> Application Server Room
Response : Application Server Room => Carrier Network =\> Native Network (LAN)

Without considering the complex network, it takes 3 nodes and 6 steps to complete a user access operation from request to response.

2) The network path after deploying the CDN application:

Request : Native Network (LAN) =\> Carrier Network
Response : Carrier Network =\> Native Network (LAN)

Without considering the complex network, it takes 2 nodes and 2 steps to complete a user access operation from request to response. Compared with not deploying CDN service, the access of 1 node and 4 steps is reduced. Greatly improve the response speed of the system.

2.1.2 CDN Features

advantage

Local Cache acceleration : Improve access speed, especially for sites with a large number of pictures and static pages;
Achieve cross-operator network acceleration : eliminates the impact of the bottleneck of interconnection between different operators, realizes cross-operator network acceleration, and ensures that users in different networks can get good access quality;
Remote acceleration : The remote access user intelligently selects the Cache server automatically according to the DNS load balancing technology, selects the fastest Cache server, and accelerates the speed of remote access;
Bandwidth optimization : automatically generate a remote Mirror (mirror) cache server of the server, read data from the cache server when remote users access, reduce the bandwidth of remote access, share network traffic, and reduce the load of the original site WEB server.
Cluster anti-attack : The widely distributed CDN nodes and the intelligent redundancy mechanism between nodes can effectively prevent hacker intrusion and reduce the impact of various DDoS attacks on the website, while ensuring better service quality.

shortcoming

Not suitable for caching dynamic resources

Solution: Mainly cache static resources, establish multi-level cache or quasi-real-time synchronization for dynamic resources;

There is a data consistency problem

1. Solution (mainly to find a balance between performance and data consistency).
2. Set the cache expiration time (1 hour, data will be synchronized after expiration).
3. Set the version number for the resource.

2.2 Reverse proxy cache

Reverse Proxy means that the proxy server accepts connection requests on the internet, then forwards the request to the server on the internal network, and returns the result obtained from the server to the client requesting the connection on the internet. At this point, the proxy server acts as a reverse proxy server to the outside world.

2.2.1 The principle of reverse proxy caching

The reverse proxy is located on the same network as the application server and handles all requests to the web server. The principle of reverse proxy cache:

If the page requested by the user is cached on the proxy server, the proxy server directly sends the cached content to the user.
If there is no cache, send a request to the WEB server first, retrieve the data, cache it locally, and then send it to the user.

This method reduces the load of the WEB server by reducing the number of requests to the WEB server.

The reverse proxy cache is generally aimed at static resources, and forwards dynamic resource requests to the application server for processing. Commonly used caching application servers are Varnish, Ngnix, and Squid.

2.2.2 Comparison of reverse proxy caches

Commonly used proxy caches include Varnish, Squid, and Ngnix. A simple comparison is as follows:

Varnish and Squid are professional cache services, and Ngnix requires third-party module support;
Varnish uses an in-memory cache, which avoids frequently swapping files in memory and disk, and has higher performance than Squid;
Because Varnish is a memory cache, it supports small files such as css, js, and small images. The back-end persistent cache can use Squid or ATS;
Squid is full-featured and large, and is suitable for various static file caching. Generally, an HAProxy or Ngnix will be hung on the front end for load balancing and running multiple instances;
Nginx uses the third-party module ncache to do the buffering, and its performance basically reaches Varnish. It is generally used as a reverse proxy and can achieve simple caching.

3. In-process cache

The in-process cache refers to the cache inside the application. The standard distributed system generally consists of multiple levels of cache. The local cache is the cache closest to the application and can generally cache data to the hard disk or memory.

Hard Disk Cache : Cache data to the hard disk, and read from the hard disk when reading. The principle is to directly read the local file, which reduces the network transmission consumption and is faster than reading the database through the network. It can be used in scenarios where the speed requirement is not very high, but a large amount of cache storage is required.
Memory cache : Directly store data in the local memory, and maintain the cache object directly through the program, which is the fastest way to access.

Common local cache implementation schemes: HashMap, Guava Cache, Caffeine, Ehcache.

3.1 ConcurrentHashMap

The simplest in-process cache can be implemented through the HashMap or ConcurrentHashMap that comes with the JDK.

Applicable scenarios: cache data that does not need to be eliminated.
Disadvantages: cache elimination cannot be performed, and memory will grow without limit.

3.2 LRUHashMap

A simple LRUHashMap can be implemented by extending LinkedHashMap. Override the removeEldestEntry method to complete a simple least recently used algorithm.

shortcoming:

The lock competition is serious and the performance is relatively low.
Expiration times are not supported.
Auto refresh is not supported.

3.3 Guava Cache

Addresses several shortcomings in LRUHashMap. Guava Cache adopts an idea similar to ConcurrentHashMap, which is segmented and locked to reduce lock competition.

Guava Cache does not expire immediately for expired Entry (that is, there is no background thread scanning all the time), but expires by performing read and write operations. The advantage of this is to avoid global locking when background threads are scanned. . Directly through the query, determine whether it meets the refresh conditions, and refresh.

3.4 Caffeine

Caffeine implements W-TinyLFU (a variant of LFU + LRU algorithm), and its hit rate and read and write throughput are much better than Guava Cache. The implementation principle is more complicated, you can refer to the cache evolution history you should know .

3.5 Ehcache

EhCache is a pure Java in-process caching framework with fast, lean and other characteristics, and is the default CacheProvider in Hibernate.

advantage

fast and easy;
Support multiple caching strategies: LRU, LFU, FIFO elimination algorithm;
There are two levels of cached data: memory and disk, so there is no need to worry about capacity;
Cached data will be written to disk during virtual machine restart;
Distributed caching can be done through RMI, pluggable API, etc.;
A listening interface with cache and cache manager;
Support for multiple cache manager instances, and multiple cache areas for one instance;
Provides a cache implementation for Hibernate.

shortcoming

When using the disk cache, it takes up a lot of disk space;
does not guarantee the security of data;
Although distributed caching is supported, it is not efficient (synchronizing data between different nodes through multicast).

3.6 In-process cache comparison

Comparison of common in-process caching technologies:

ConcurrentHashMap : It is more suitable for caching relatively fixed elements, and the number of caches is small. Although it is a bit inferior from the above table, because it is a class that comes with JDK, it is still widely used in various frameworks. For example, we can use it to cache our reflected Method, Field, etc.; we can also cache some link to prevent it from being duplicated. ConcurrentHashMap is also used in Caffeine to store elements.
LRUMap : If you don't want to introduce third-party packages and want to use the elimination algorithm to eliminate data, you can use this.
Ehcache : Due to its large jar package, it is more heavyweight. For some functions that require persistence and clustering, you can choose Ehcache. It should be noted that although Ehcache also supports distributed caching, because its inter-node communication mode is rmi, its performance is not as good as Redis, so it is generally not recommended to use it as a distributed cache.
Guava Cache : The jar package of Guava has been widely introduced in many Java applications, so it is often just used directly, and it is lightweight and rich in functions. Without understanding Caffeine Guava Cache can be selected.
Caffeine : Its hit rate, read and write performance are much better than Guava Cache, and its API is basically the same as Guava cache, or even a little more. Using Caffeine in a real environment has achieved good results.

To sum up: if you do not need an elimination algorithm, choose ConcurrentHashMap. If you need an elimination algorithm and some rich APIs, it is recommended to choose.

4. Distributed cache

Distributed cache solves the biggest problem of in-process cache: if the application is a distributed system, nodes cannot share each other's in-process cache. Application scenarios of distributed cache:

Cache data that has undergone complex calculations.
Cache frequently accessed hot data in the system to reduce database pressure.

The implementation principles of different distributed caches are often quite different. This article mainly describes Memcached and Redis.

4.1 Memcached

Memcached is a high-performance, distributed memory object caching system. By maintaining a unified huge Hash table in memory, it can be used to store data in various formats, including images, videos, files, and database retrieval results.

Simply put: cache the data in memory and then read it from memory, which greatly improves the reading speed.

4.1.1 Memcached Features

Use physical memory as a cache area and can run independently on the server . Each process has a maximum of 2G. If you want to cache more data, you can open more Memcached processes (different ports) or use distributed Memcached for caching to cache data on different physical machines or virtual machines.
Use the key-value method to store data . This is a single-index structured data organization, which makes the query time complexity of data items O(1).
The protocol is simple, a line-of-text protocol . The data access operation can be performed directly on the Memcached server through telnet, which is simple and convenient for various caches to refer to this protocol;
High performance communication based on libevent . Libevent is a set of program libraries developed with C. It encapsulates the event processing functions such as kqueue of BSD system and epoll of Linux system into an interface, which improves the performance compared with the traditional select.
The distributed capability depends on the Memcached client, the servers do not communicate with each other . Each Memcached server does not communicate with each other, accesses data independently, and does not share any information. The server is not distributed, and distributed deployment depends on the Memcached client.
The LRU cache elimination strategy is adopted . When storing a data item in Memcached, you can specify its expiration time in the cache, which is permanent by default. When the Memcached server runs out of allocated memory, the stale data is replaced first, followed by the most recently unused data. In LRU, Memcached uses a Lazy Expiration strategy. It does not monitor whether the stored key/vlue pair expires, but checks the timestamp of the record when obtaining the key value to check whether the key/value pair space expires. This reduces the load on the server.
A set of efficient memory management algorithms is built in . This set of memory management is very efficient and will not cause memory fragmentation, but its biggest disadvantage is that it will lead to waste of space. When the memory is full, the unused cache is automatically deleted through the LRU algorithm.
Persistence is not supported . Memcached does not consider the disaster recovery of data, restart the service, all data will be lost.

4.1.2 How Memcached Works

1) Memory management

Memcached uses the slab allocation mechanism to allocate and manage memory. It divides the allocated memory into memory blocks of a specific length according to a predetermined size, and then divides memory blocks of the same size into groups. When data is stored, it is stored according to the size of the key value. Match the slab size and find the nearest slab to store, so there is a waste of space.

This set of memory management is very efficient and will not cause memory fragmentation, but its biggest disadvantage is that it will lead to waste of space.

2) Cache elimination strategy

Memcached's cache elimination strategy is LRU + expiration strategy.

When you store an item in Memcached, you may specify its expiration time in the cache, which defaults to permanent. When the Memcached server runs out of allocated memory, the stale data is replaced first, followed by the most recently unused data.

In LRU, Memcached uses a Lazy Expiration strategy: Memcached will not monitor whether the stored key/vlue pair expires, but check the timestamp of the record when obtaining the key value, and check whether the key/value pair space has expired. This reduces the load on the server.

3) Partition

Memcached servers do not communicate with each other, and its distributed capabilities are implemented by clients. Specifically, it is to implement an algorithm on the client side to calculate which server node the data should read/write to according to the key.

There are three common algorithms for selecting cluster nodes:

Hash remainder algorithm : Use the formula: Hash(key)% N to calculate the hash value to determine which node the data is mapped to.
Consistent Hash Algorithm : It can solve the stability problem very well. All storage nodes can be arranged on the Hash ring end-to-end. After calculating Hash, each key will find the adjacent storage node clockwise for storage. When a node joins or leaves, it only affects the clockwise adjacent subsequent nodes of the node on the Hash ring.
Virtual Hash Slot Algorithm : Use a well-distributed hash function to map all data into a fixed range of integers, the integers are defined as slots, and this range is generally much larger than the number of nodes. A slot is the basic unit of data management and migration within a cluster. The main purpose of using large-scale slots is to facilitate data splitting and cluster expansion. Each node is responsible for a certain number of slots.

4.2 Redis

Redis is an open source (BSD licensed), in-memory, multi-data structure storage system. Can be used as database, cache and message middleware.

Redis can also use client-side sharding to scale write performance. Built-in replication (replication), LUA scripting (Lua scripting), LRU driven events (LRU eviction), transactions (transactions) and different levels of disk persistence (persistence), and through Redis sentinel (Sentinel) and automatic partition (Cluster) Provide high availability (high availability).

4.2.1 Redis Features

Multiple data types are supported - string, Hash, list, set, sorted set.
Support multiple data elimination strategies;

volatile-lru : Select the least recently used data from the data set with the expiration time set to be eliminated;
volatile-ttl : Select the data to be expired from the data set with the expiration time set;
volatile-random : arbitrarily select data elimination from the data set with set expiration time;
allkeys-lru : pick the least recently used data out of all datasets;
allkeys-random : arbitrarily select data from all datasets for elimination;
noeviction : prohibit eviction of data.

Provides two persistence methods - RDB and AOF.
Provides cluster mode via Redis cluster.

4.2.2 Principle of Redis

1) Cache elimination

Redis has two implementations of data elimination;

Negative way: When accessing the Redis key, if it is found to be invalid, delete it
Positive method: Periodically select some of the expired keys to delete according to the elimination strategy from the keys with the expiration time set.

2) Partition

The Redis Cluster cluster contains 16384 virtual hash slots, and it uses an efficient algorithm to calculate which hash slot the key belongs to.
Redis Cluster supports request distribution - when a node receives a command request, it will first detect whether the slot where the key to be processed by the command request is located is responsible for itself. If not, the node will return a MOVED error to the client, which carries the MOVED error. The information can direct the client to redirect the request to the node that is responsible for the relevant slot.

3) Master-slave replication

Asynchronous replication is supported since Redis 2.8. It has two modes:

full resychronization - for initial replication. The execution steps are basically the same as the SYNC command.
Partial resychronization - for replication after disconnection. If conditions permit, the master server can send the write commands executed during the disconnection of the master and slave servers to the slave server, and the slave server only needs to receive and execute these write commands to keep the database state of the master and slave servers consistent.

Each node in the cluster periodically sends PING messages to other nodes in the cluster to detect whether the other node is online.
If a master node is considered offline, among its slave nodes, according to the Raft algorithm, a node is elected and upgraded to the master node.

4) Data consistency

Redis does not guarantee strong consistency, as this will greatly reduce cluster performance.
Redis achieves eventual consistency through asynchronous replication.

4.3 Distributed cache comparison

Different distributed caches have great differences in functional characteristics and implementation principles, so the scenarios they adapt to are also different.

Here are three well-known distributed caches (MemCache, Redis, Tair) for comparison:

MemCache : Only suitable for memory-based caching framework; and does not support data persistence and disaster recovery.
Redis : Supports rich data structures and has high read and write performance, but the data is in full memory, resource costs must be considered, and persistence is supported.
Tair : It supports rich data structures, has high read and write performance, and some types are relatively slow. In theory, the capacity can be expanded infinitely.

Summary: If the service is sensitive to latency and there is a lot of Map/Set data, Redis is more suitable. If the service needs to put a large amount of data into the cache and is not particularly sensitive to latency, then you can choose Memcached.

Five, multi-level cache

5.1 Overall Caching Framework

Usually, the cache of a large software system adopts a multi-level cache scheme:

Request process:

The browser initiates a request to the client, and returns directly if the CDN has a cache;
If the CDN has no cache, access the reverse proxy server;
If the reverse proxy server has a cache, it will return directly;
If the reverse proxy server has no cache or dynamic requests, access the application server;
The application server accesses the in-process cache; if there is a cache, it returns to the proxy server and caches the data (dynamic requests are not cached);
If there is no data in the in-process cache, read the distributed cache; and return to the application server; the application server caches the data to the local cache (part);
If the distributed cache has no data, the application reads the database data and puts it into the distributed cache;

5.2 Using In-Process Cache

If the application service is a single point application, then in-process caching is of course the preferred solution for caching. For the in-process cache, it is originally limited by the size of the memory, and other caches cannot be known after the process cache is updated, so in general, the process cache is suitable for:

Data that is not very large and that is updated less frequently.
If you update frequently data and want to use the in-process cache, you can set its expiration time to a shorter time, or set a shorter automatic refresh time.

This scheme has the following problems:

If the application service is a distributed system, the cache cannot be shared between application nodes, and there is a data inconsistency problem.
Since the in-process cache is limited by the memory size, the cache cannot expand indefinitely.

5.3 Using Distributed Cache

If the application service is a distributed system, then the simplest caching solution is to use the distributed cache directly. Its application scenario is shown in the figure:

Redis is used to store hot data. If the cache misses, it will query the database and update the cache. This scheme has the following problems:

If the cache service hangs, the application can only access the database at this time, which is likely to cause a cache avalanche.
Accessing the distributed cache service will have a certain I/O and serialization and deserialization overhead. Although the performance is high, it is not as fast as in-memory query.

5.4 Using Multilevel Cache

Purely using in-process cache and distributed cache have their own shortcomings. If higher performance and better availability are required, we can design the cache as a multi-level structure. Store the hottest data in memory using an in-process cache to further improve access speed.

This design idea also exists in computer systems. For example, the CPU uses L1, L2, and L3 multi-level caches to reduce direct access to memory, thereby speeding up access. Generally speaking, the multi-level cache architecture can meet most business requirements by using the second-level cache. Too many levels will increase the complexity of the system and the cost of maintenance. Therefore, the multi-level cache is not as good as possible, and needs to be weighed according to the actual situation.

A typical second-level cache architecture can use an in-process cache (eg Caffeine/Google Guava/Ehcache/HashMap) as the first-level cache; use a distributed cache (eg: Redis/Memcached) as the second-level cache.

5.4.1 Multi-level cache query

The multi-level cache query process is as follows:

First, query the L1 cache, if the cache hits, return the result directly; if there is no hit, go to the next step.
Next, query the L2 cache, if the cache hits, return the result directly and backfill the L1 cache; if there is no hit, go to the next step.
Finally, query the database, return the results and backfill the L2 cache, then the L1 cache.

5.4.2 Multi-level cache update

For the L1 cache, if there is data update, only the cache on the machine where it is located can be deleted and updated, and other machines can only refresh the cache through the timeout mechanism. There are two strategies for timeout setting:

It is set to expire after how long after writing;
Set how much time to refresh after writing.

For the L2 cache, if there is a data update, other machines are immediately visible. However, a timeout must also be set, which should be longer than the valid time of the L1 cache. In order to solve the problem of inconsistency of in-process cache, the design can be further optimized;

Through the publish and subscribe mechanism of the message queue, other application nodes can be notified to update the in-process cache. Using this scheme, even if the message queue service is down or unreliable, because the database update is performed first, but the in-process cache expires, the eventual consistency of the data can be guaranteed when the cache is refreshed.

6. Cache problem

6.1 Cache Avalanche

Cache avalanche means that the cache is unavailable or a large number of caches fail in the same time period due to the same timeout period, a large number of requests directly access the database, and the database pressure is too large, resulting in a system avalanche.

For example, for System A, assuming that there are 5,000 requests per second during the peak period per day, the cache could handle 4,000 requests per second during the peak period, but the cache machine unexpectedly went down. The cache hangs. At this time, all 5000 requests per second fall into the database. The database must not be able to handle it. It will report to the police and then hang up. At this time, if there is no special solution to deal with this failure, the DBA is very anxious and restarts the database, but the database is immediately killed by the new traffic.

The main means of solving cache avalanches are as follows:

Increase cache system availability (ex-ante) . For example: Deploy Redis Cluster (master-slave + sentinel) to achieve high availability of Redis and avoid complete crashes.
Adopt multi-level caching scheme (in-thing) . For example: local cache (Ehcache/Caffine/Guava Cache) + distributed cache (Redis/ Memcached).
Current limiting, downgrading, and fusing solutions (in progress) to avoid being killed by traffic. Such as: use Hystrix for fusing and downgrading.
The cache, if it supports persistence , can restore data after work is resumed (after the fact). For example, Redis supports persistence. Once restarted, data is automatically loaded from disk to quickly restore cached data.

The above solution is simply a multi-level cache solution. When the system receives a query request, it first checks the local cache, then the distributed cache, and finally the database. As long as it hits, it returns immediately.

The auxiliary means to solve the cache avalanche are as follows:

Monitor the cache and expand elastically.
The expiration time of the cache can take a random value. This is done to avoid cache invalidation at the same time, causing a sudden increase in database IO. For example: In the past, a timeout of 10 minutes was set, then each key can be randomly expired in 8-13 minutes, and try to make the expiration time of different keys different.

6.2 Cache penetration

Cache penetration means that if the queried data does not exist in the database, it will naturally not exist in the cache. Therefore, if the application cannot find it in the cache, it will query the database. When there are many such requests, the pressure on the database will increase.

There are generally two ways to solve cache penetration:

1) Cache empty values

For the return of NULL is still cached, for the return of throwing an exception is not cached.

Using this method will increase the maintenance cost of our cache. It is necessary to delete the empty cache when inserting the cache. Of course, we can solve this problem by setting a shorter timeout period.

2) Filter data that cannot exist

Make some rules to filter some impossible data. You can use a Bloom filter (data structure for binary operations, so high performance). For example, your order ID is obviously in a range of 1-1000. If it is not within 1-1000, it can be directly filtered.

For some malicious attacks, a large number of keys brought by the attack do not exist, then we use the first scheme to cache a large amount of data without keys. At this time, it is not suitable for us to use the first scheme. We can completely filter out these keys by using the second scheme. For this kind of data with an unusually large number of keys and a low request repetition rate, we do not need to cache it, and use the second scheme to filter it out directly. For empty data with limited keys and high repetition rate, we can use the first method to cache.

6.3 Cache Breakdown

Cache breakdown means that a large number of requests directly access the database at the moment when the hot data fails. For example, some keys are hot data and are accessed very frequently. If a large number of requests come at the moment when a certain key is invalid, the cache misses, and then go to the database to access, the database access will increase sharply.

To avoid this problem, we can take the following two measures:

Distributed lock : lock the key of hot data to prevent a large number of threads from accessing the same key at the same time.
Timed asynchronous refresh : Some data can be automatically refreshed before expiration, instead of being automatically eliminated when it expires. Elimination is actually for the timeliness of data, so automatic refresh can also be used.

6.4 Summary

The above describes the common problems in the use of cache one by one. Here, the solution to the cache problem is summarized as a whole from the perspective of the time period of occurrence.

Beforehand : Redis high availability solution (Redis Cluster + master-slave + sentinel) to avoid full cache collapse.
In the process : (1) Adopt a multi-level cache scheme, local cache (Ehcache/Caffine/Guava Cache) + distributed cache (Redis/Memcached). (2) Current limiting + fuse + downgrade (Hystrix) to avoid the database being killed in extreme cases.
After the fact: Redis persistence (RDB+AOF), once restarted, automatically loads data from disk and quickly restores cached data.

Memcached, a distributed cache, is not as rich in data types as Redis, and does not support persistence and disaster tolerance. Therefore, Redis is generally chosen as a distributed cache.

Seven, caching strategy

7.1 Cache warm-up

Cache warm-up means that after the system is started, hotspot data is directly queried and cached. This can avoid the problem of querying the database first and then updating the cache when the user requests.

solution:

Manually refresh the cache: directly write a cache refresh page, and manually operate it when going online.
Refresh the cache when the application starts: The amount of data is not large and can be automatically loaded when the project starts.
Periodically refresh the cache asynchronously.

7.2 How to cache

7.2.1 Do not expire the cache

Cache update mode:
open business;
write SQL;
commit the transaction;
write cache;

Do not put write cache operations in transactions, especially write distributed caches. Because network jitter may cause the write cache response time to be very slow, causing database transactions to block. If the requirements for cache data consistency are not so high and the amount of data is not very large, you can consider periodically synchronizing the cache in full.

In this mode, there is a situation where there is a possibility that the transaction succeeds but the cache write fails. But this situation has less impact than the above problem.

7.2.2 Expired cache

Use lazy loading. For hot data, you can set a short cache time and load it asynchronously at regular intervals.

7.3 Cache update

In general, if the system does not strictly require cache and database consistency, try not to serialize read and write requests. Serialization can guarantee that there will be no data inconsistency, but it will lead to a significant decrease in the throughput of the system.

Generally speaking, there are two situations for cache updates:

Delete the cache first, then update the database;
Update the database first, then delete the cache;

Why delete the cache instead of updating the cache?
You can think that when there are multiple concurrent requests to update data, you cannot guarantee that the sequence of updating the database is the same as the sequence of updating the cache, then there will be inconsistencies between the data in the database and the cache. So in general consider deleting the cache.

Delete the cache first, then update the database;

For a simple update operation, it is to delete the cache at all levels first, and then update the database. There is a big problem with this operation. After the cache is deleted, there is a read request. At this time, because the cache is deleted, the library will be read directly. The data of the read operation is old and will be loaded into the cache. Request full access to old data.

The operation of the cache cannot block our operations on the database whether it succeeds or fails. In many cases, asynchronous operations can be used to delete the cache, but deleting the cache first is not very suitable for this scenario. Deleting the cache first also has the advantage that if the operation on the database fails, the cache that is deleted first will only cause Cache Miss at most.

1) Update the database first, and then delete the cache (Note: This strategy is more recommended).

If we use the update database, then delete the cache to avoid the above problem.

But the same introduces a new problem: if a query request is received when an update operation is performed, the old data in the cache will be returned at this time. Even more troublesome, if the database update operation fails, the cache may always have dirty data.

2) Which update strategy should be chosen

From the above, we know that both update strategies have concurrency problems.

However, it is recommended to choose to update the database first, and then delete the cache, because the probability of concurrency problems may be very low, because this condition requires cache invalidation when reading the cache, and there is a concurrent write operation at the same time. In fact, the write operation of the database will be much slower than the read operation, and the table needs to be locked. The read operation must enter the database operation before the write operation, and update the cache later than the write operation. All these conditions are met The probability is basically not great.

If the database and cache are required to ensure strong consistency, it can be achieved through the 2PC or Paxos protocol. But 2PC is too slow, and Paxos is too complex, so if the data is not very important, it is not recommended to use a strong consistency scheme. For a more detailed analysis, please refer to: Analysis of Distributed Database and Cache Double-Write Consistency Scheme .

8. Summary

Finally, a mind map is used to summarize the knowledge points described in this article to help you have a systematic understanding of caching.

9. References

1. "Technical Architecture of Large Websites: Core Principles and Case Analysis"

2. The history of cache evolution you should know

3. How to design and use cache gracefully?

4. Understand the cache architecture in distributed systems (Part 1)

5. Cache those things

6. Analysis of distributed database and cache double-write consistency scheme

Author: vivo internet server team - Zhang Peng