3
cache is a powerful tool in the architecture of a high-concurrency system. By using the cache, the system can easily handle thousands of concurrent access requests, but while enjoying the convenience of the cache, how to ensure the data consistency between the database and the cache , Has always been a difficult problem. In this article, I will share how to ensure cache consistency in the system architecture.

Overview

Before introducing how to solve the problem of consistency between the database and the cache, let's first understand two problems-what is the consistency problem of the database and the cache (What) and why there is a data consistency problem between the database and the cache (Why).

What is the data consistency problem of database and cache

First, let's understand what the data consistency problem we have been talking about is. I believe that everyone is familiar with the CAP theory. As long as you are doing distributed system development, you should have heard of it. C means Consistency, A means Availability, P means Partition tolerance, CAP The theory states that these three elements can only achieve two at most at the same time, and it is impossible to give consideration to all three. The definition of consistency here is-whether all data backups in a distributed system have the same value at the same time.
Therefore, we can understand the data in the database and the cache as two copies of the data. The problem of data consistency between the database and the cache is equivalent to how to ensure the data consistency of the two copies of the data in the database and the cache.

Why is there a data consistency problem between the database and the cache

In business development, we generally use the four major characteristics of database transactions (ACID) to ensure data consistency. In a distributed environment, because there is no guarantee for similar transactions, it is prone to partial failures, such as database update success, cache update failure, or cache update success, database update failure, etc., to sum up will lead to database and cache The reason for the inconsistency of the data.

The internet

The default network is unstable in a distributed system. Therefore, under the CAP theory, it is generally believed that failures caused by network reasons are unavoidable, and the design of the system generally chooses CP or AP for this reason. Operating the database and cache both involve network I/O, and it is easy to fail some requests due to network instability, resulting in data inconsistency.

Concurrent

In a distributed environment, if you do not explicitly synchronize, the request will be processed concurrently by multiple server nodes. Look at the following example. Assuming that there are two concurrent requests that update field A in the database at the same time, process 1 first updates field A to 1 and updates the cache to 1, and process 2 updates field A to 2 and updates the cache to 2. Because of the concurrent If the timing cannot be guaranteed, the following situation will occur. The final result is that the value of field A in the database is 2, the value in the cache is 1, and the database and the cached data are inconsistent.

Process 1Process 2
Time point T1Update database field A = 1
Time point T2 Update database field A = 2
Time point T3 Update cache KEY A = 2
Time point T4Update cache KEY A = 1

Read-write cache mode

In engineering practice, there are several common modes for read-write caching.

Cache Aside

Cache Aside should be the most commonly used mode, which is used to update the database and cache in many business codes. Its main logic is shown in the figure below.

cacheaside活动图-Cash_Aside.png

First determine the type of request, and do different processing for read requests and write requests:

write request : update the database first, and then invalidate the cache after success.
read request : first query whether the data is hit in the cache, if it is hit, the data will be returned directly, if it is not hit, the database will be queried, the cache will be updated after success, and the data will be returned at the end.

This mode is relatively simple to implement and does not seem to be a problem logically. The realization of the read request logic in Java generally uses AOP in order to avoid code duplication.
Cache Aside mode will have data consistency problems in a concurrent environment, such as the read and write concurrency scenario described in the following table.

Read requestWrite request
Time point T1The value of field A in the query cache misses
Time point T2Query the database to get the field A=1
Time point T3 Update database field A = 2
Time point T4 Invalidate cache
Time point T5Set the value of cache A to 1

The read request queries the cache of field A, but it fails to hit, and then queries the database to get the value of field A to be 1, and the write request updates the value of field A to 2. As it is a concurrent request, the operation of invalidating the cache in the write request is first The operation of setting the cache in the read request caused the cache value of field A to be set to 1 in the read request and failed to be invalidated correctly, resulting in dirty data in the cache. If the cache expiration time is not set here, the data will always be Incorrect.

Read Through

The Read Through mode is very similar to the Cache Aside mode. The difference is that in the Cache Aside mode, if the read request fails to hit the cache, we need to implement the logic of querying the database and then updating the cache. In the Read Through mode, we don’t need to care. For these logics, we only deal with the cache service. The cache service implements the loading of the cache. Take an example of Guava Cache commonly used in Java to illustrate, see the following code.

LoadingCache<Key, Graph> graphs = CacheBuilder.newBuilder()
       .maximumSize(1000)
       .build(
           new CacheLoader<Key, Graph>() {
             public Graph load(Key key) throws AnyException {
               return createExpensiveGraph(key);
             }
           });
...
try {
  return graphs.get(key);
} catch (ExecutionException e) {
  throw new OtherException(e.getCause());
}

In this code, we use the CacheLoader in Guava to load the cache for us. In the read request, when the get method is called, if Cache Miss occurs, then the CacheLoader is responsible for loading the cache, and our code only deals with the graphs object, and does not care about the details of the underlying loading cache. This is the Read Through mode.
There is no essential difference in logic between Read Through mode and Cache Aside mode, but the code in Read Through mode will be more concise in implementation. Therefore, in the same way, Read Through mode will also have the inconsistency between concurrent database and cache data in Cache Aside mode. problem.

Write Through

The logic of the Write Through mode is similar to that of the Read Through mode. In the Write Through mode, all write operations must be cached, and then the subsequent logic will be executed according to whether the cache is hit during the write.

Write-through: write is done synchronously both to the cache and to the backing store.

The definition of Write Through mode on Wikipedia emphasizes that in this mode, in the write request, the cache and the database will be written synchronously. Only when the cache and the database are written successfully can it be considered successful. The main logic is shown in the figure below.

writethrough活动图-Write_Through.png

In Write Through mode, when Cache Miss occurs, the cache will only be updated in the read request. The write request does not update the cache when Cache Miss occurs, but directly writes to the database. If it hits the cache, the cache is updated first, and the cache itself then writes the data back to the database. How do you understand that the cache writes data back to the database by itself? Here is an example of using Ehcache. In Ehcache, the CacheLoaderWriter interface implements the Write Through mode. A series of Cache life cycle hook functions are defined in this interface. Two methods are as follows:

public interface CacheLoaderWriter<K, V> {

    void write(K var1, V var2) throws Exception;

    void writeAll(Iterable<? extends Entry<? extends K, ? extends V>> var1) throws BulkCacheWritingException, Exception;
}

You only need to implement these two Write-related methods, that is, you can write data to the underlying database when you update the cache, that is to say, you only need to interact with the CacheLoaderWriter in the code, and you do not need to update the cache and write at the same time. The logic of the database.
Looking back at the logic of the Write Through mode, I found that the processing of read requests is basically the same as the Read Through mode, so Read Through mode and Write Through mode can be used together.
So does the Write Through mode have the consistency problem of the Read Through mode in concurrent scenarios? Obviously there are, and the reason for the inconsistency problem is similar to the Read Through mode, which is caused by the inability to guarantee the timing of updating the database and updating the cache in a concurrent scenario.

Write Back

Write-back (also called write-behind): initially, writing is done only to the cache. The write to the backing store is postponed until the modified content is about to be replaced by another cache block.

Let's first look at the definition of the Write Back mode on Wikipedia-this mode will only be written to the cache in a write request, and then only when the data in the cache is to be replaced out of the memory, will it be written to the underlying database. There are two main differences between Write Back mode and Write Through mode:

  1. The Write Through mode is to synchronously write to the cache and the database, while the Write Back mode is asynchronous. In the write request, only the cache is written, and then the data will be asynchronously written from the cache to the underlying database, and it is batched.
  2. In Write Back mode, when a Cache Miss occurs in a write request, the data will be rewritten to the cache. This is also different from the Write Through mode. Therefore, the processing of Read Cache Miss and Write Cache Miss in Write Back mode is similar.

The implementation logic of the Write Back mode is more complicated. The main reason is that the mode needs to track which "dirty" data is written to the underlying storage when necessary, and if there are multiple updates, it also needs to do batch merge writes, Write Back mode logic implementation of FIG here is not posted, if interested, can refer FIG on Wikipedia by .
Since it is asynchronous, the advantage of Write Back mode is high performance, but the disadvantage is that it cannot guarantee data consistency between the cache and the database.

think

By observing the implementation of the above three modes, we can see some differences in implementation-whether to delete the cache or update the cache, operate the cache first or update the database first. The following table lists all possible situations, where 1 means that the cache is consistent with the data in the database, and 0 means that it is inconsistent.

Cache operation failedDatabase operation failed
Update the cache first, then update the database10
Update the database first, then update the cache01
Delete the cache first, then update the database11
Update the database first, then delete the cache01

In the above cases, does not consider put cached operations in database transactions (generally, it is not recommended to put non-database operations in transactions, such as RPC calls, Redis operations, etc., because these external operations often rely on the network, etc. Unreliable factors, once a problem occurs, it is easy to cause the database transaction to fail to submit or cause "long transaction" problems).
It can be seen that only the mode of "delete the cache first, then update the database" can guarantee data consistency in the case of partial failure, so we can draw conclusion 1-"delete the cache first, then update the database" is the most Excellent solution. However, the mode of deleting first and updating later can easily cause the problem of cache breakdown, which will be discussed in detail later.

In addition, we can also observe that the three modes of Cache Aside/Read Through/Write Through have the problem of inconsistency between cache and database data in concurrent scenarios, and the reason is that in concurrent scenarios, there is no guarantee that the database will be updated and updated. The timing of updating the cache causes the update of the database to occur before the write to the cache, and the written cache is old data, resulting in data inconsistency. Based on this, we can draw conclusion 2-as long as a certain model can solve this problem, then this model can ensure data consistency between the cache and the database in a concurrent environment.

After observing the above three modes, the last point found is that once the data inconsistency between the cache and the database occurs, if the data is no longer updated, then the data in the cache is always wrong, and there is no remedial mechanism, so it can be concluded Conclusion 3-There needs to be some kind of automatic cache refresh mechanism. The easiest way is to set an expiration time for the cache. This is a bottom line method to prevent the data from being wrong in case of data inconsistency.
Based on the above three conclusions, let us introduce the following two modes.

Delayed double delete

The delayed double delete mode can be considered as an optimized version of the Cache Aside mode. The main implementation logic is shown in the figure below.

延迟双删.drawio.png

In the write request, first follow the best practice conclusions we have drawn above, first delete the cache, update the database, and then send the MQ message. The MQ message here can be sent by the service itself, or it can be monitored by some middleware. After listening to the message, you need to delay for a period of time. The delay can be realized by using the delayed message function of the message queue, or sleeping for a period of time after receiving the message on the consumer side, and then deleting the cache again. The pseudo code is as follows.

// 删除缓存
redis.delKey(key);
// 更新数据库
db.update(x);
// 发送延迟消息,延迟1s
mq.sendDelayMessage(msg, 1s);

...
// 消费延迟消息
mq.consumeMessage(msg);
// 再次删除缓存
redis.delKey(key);

The implementation logic of the read request is the same as the Cache Aside mode. When a cache miss is found, the cache will be reloaded in the read request, and a reasonable expiration time should be set for the cache setting.
Compared with the Cache Aside mode, this mode reduces the possibility of inconsistency between cache and database data to a certain extent, but it is only reduced. The problem still exists, but the conditions are more stringent. See below. This situation.

Read requestWrite request
Time point T1The value of field A in the query cache misses
Time point T2Query the database to get the field A=1
Time point T3 Update database field A = 2
Time point T4 Invalidate cache
Time point T5 Send delayed message
Time point T6 Consume delayed messages and invalidate the cache
Time point T7Set the value of cache A to 1

Since message consumption and read requests occur concurrently, the timing of invalidating the cache and setting the cache in the read request after the message is delayed is still not guaranteed, and there is still the possibility of data inconsistency, but the probability becomes lower. .

Sync failure and update

Combining the advantages and disadvantages of the above several modes, I have adopted another mode in actual project practice. I named it "synchronous failure and update mode", and the main implementation logic is as shown in the figure below.

同步失效并更新-.png

The idea of this mode is to read-only cache in the read request, put the operation cache and the database in the write request, and these operations are synchronized, at the same time, in order to prevent the concurrency of write requests, a distributed lock needs to be added to the write operation , After the lock is acquired, subsequent operations can be performed. In this way, all possible data inconsistencies caused by concurrency are eliminated.

The distributed lock here can be determined according to the dimension of the cache, and there is no need to use a global lock. For example, if the cache is the order latitude, the lock can also be the order latitude. If the cache is the user latitude, then the distributed lock can be the user Latitude. Taking the order as an example, the pseudo-code for the implementation of the write request is as follows:

// 获取订单纬度的分布式锁
lock(orderID) {
      // 先删除缓存
    redis.delKey(key);
      // 再更新数据库
      db.update(x);
      // 最后重新更新缓存
    redis.setEx(key, 60s);
}

The advantage of this mode is that it can basically guarantee the data consistency between the cache and the database. In terms of performance, the read request is basically performanceless. The write request needs to synchronize the writing of the database and the cache, which will have a certain impact. However, due to most of the Internet The business reads more and writes less, and the impact is relatively small. Of course, this model also has shortcomings, mainly as follows:

Write requests strongly rely on distributed locks

In this mode, write requests are strongly dependent on distributed locks. If the first step to obtain distributed locks fails, then the entire request fails. In normal business processes, database transactions are generally used to ensure consistency. In some key business scenarios, in addition to transactions, distributed locks are also used to ensure consistency, so there are many distributed locks. Will be used in business scenarios, and it is not entirely an additional dependency. Moreover, major manufacturers basically have mature distributed lock services or components. Even if they do not, the cost of simply implementing a distributed lock using Redis or ZK is not high, and stability is basically guaranteed. In my personal project practice using this model, there have been basically no problems caused by distributed locks.

Failed write request to update the cache will cause cache breakdown

In order to pursue the data consistency between the cache and the database, the synchronization invalidation and update mode puts the write operation of the cache and the database in the write request, which avoids the data caused by multiple operations of the cache and the database in a concurrent environment Inconsistency, the cache is read-only in the read request, and the cache will not be reloaded even if a cache miss occurs.

But it is precisely because of this design that in case of a failure to update the cache in a write request, if there is no subsequent write request, the data in the cache will no longer be loaded, and all subsequent read requests will go directly to the DB , Causing cache breakdown problems. Internet-based services are characterized by more reads and less writes, so the possibility of cache breakdown is relatively high.

The solution to this problem is to use compensation methods, such as timing task compensation or MQ message compensation, which can be incremental compensation or full compensation. Personal experience suggests that it is best to add compensation.

Some other issues that need attention

With a reasonable cache read and write mode, let's take a look at some other issues that need to be paid attention to in order to ensure data consistency between the cache and the database.

Avoid other problems that cause the cache server to crash, leading to data inconsistencies

  1. Cache penetration, cache breakdown, and cache avalanche

As mentioned earlier, there will be cache breakdown problems in the "delete the cache first, then update the database" mode. In addition to cache breakdown, related problems include cache penetration and cache avalanche, which will cause the cache server to crash. , Resulting in inconsistent data, let's first look at the definition of these problems and some conventional solutions.

problemdescribesolution
Cache penetrationQuery a non-existent key, it is impossible to hit the cache, resulting in every request to the DB, causing the database to crash1. Cache empty objects 2. Bloom filter
Cache breakdownFor a cache key with an expiration time set, when it expires at a certain point in time, there happens to be a large number of concurrent requests for this key, which may cause a large number of concurrent requests to overwhelm the database in an instant1. Use mutual exclusion locks (distributed locks): Only 1 request can grab the lock and reload the cache each time 2. Never expire: Physically does not expire, but logically expires (such as background task timing refresh, etc.)
Cache avalancheThe same value is used when setting the expiration time of the cache, and the cache expires in a large amount at a certain time, resulting in a large number of requests to access the database. cache avalanche and cache breakdown is : the cache breakdown is for a single key, and the cache avalanche is for multiple keys1. Distribute the cache expiration time, such as increasing random numbers and so on. 2. Use mutex locks (distributed locks): only 1 request can grab the lock and reload the cache each time

In actual project practice, 100% cache hit rate is generally not pursued. Secondly, when the mode of "delete the cache first, then update the database" mode is used, the time between the two operations under normal circumstances is very short, and there will be no A large number of requests are broken into the database, so some cache breakdowns are acceptable. But if it is a system with a particularly high concurrency such as a spike, when it is completely unable to accept a cache breakdown, you can use the preemptive mutex update or put the cache operation in the database transaction, so that you can use "update the database first, "Update the cache again" mode to avoid the problem of cache penetration.

  1. Big key/hot key

The big key and hot key problems are basically business design problems, which need to be solved from the perspective of business design. Large keys affect performance more. The idea of solving large keys is to split large keys into multiple keys, which can effectively reduce the amount of data transmitted over the network at a time, thereby improving performance.
Hot keys can easily cause a single point of load on the cache server to be too high, which can cause the server to crash. The solution to the hot key is to increase the number of copies, or to split a hot key into multiple keys.

Summarize

It should be noted that these modes described above are not complete data strong consistency , it can only be said to try to achieve the final consistency of the data in the business sense, if you must ensure strong consistency, then you need to use 2PC, Distributed consensus algorithms such as 3PC, Paxos, and Raft.
Finally, here is a summary of the modes introduced above.

  • A system that does not have a large amount of concurrency or can accept inconsistent cache and database data for a certain period of time: Cache Aside/Read Through/Write Through mode.
  • Systems with a certain amount of concurrency or medium requirements for cache and database data consistency: delayed double delete mode.
  • Systems with high concurrency or high requirements for the consistency of cache and database data: synchronization failure and update mode.
  • Systems that require strong consistency for database data consistency: distributed consistency algorithms such as 2PC, 3PC, Paxos, and Raft.

In summary, it can be seen that there is no silver bullet in the architecture, and various trade-offs need to be made when designing the architecture. Therefore, when selecting and designing the cache read and write mode, it is necessary to combine specific business scenarios, such as a large amount of concurrency. Whether it is still small, the data consistency level is high or low, etc., use these modes flexibly, and make some modifications if necessary. After determining the general direction, add details to have a good architecture design.

refer to


listen_
2.8k 声望23 粉丝

只会一点点


引用和评论

0 条评论