Cache Coherence Best Practices

background

Overview

Recently in the team, we have been intensively discussing issues related to Redis cache consistency. The core domains of e-commerce such as merchandise, marketing, inventory, orders, etc. actually have their own characteristics in the choice of cache, so behind these differences in business, we Are there some best practices for reference? This article attempts to discuss this issue and give some suggestions.
Before discussing, there are two important points we need to agree on:

Strong consistency cannot be achieved in distributed scenarios: Different from the MESI protocol adopted by the CPU hardware cache system and the strong clock control of hardware, in distributed scenarios, we cannot achieve strong consistency between the cache and the underlying database, that is, the data in the cache and the database cannot be achieved. Changes are made into an atomic operation. Hardware engineers designed the concept of memory barrier (Memory Barrier) to provide software developers with different consistency options to balance performance and consistency.
Even achieving eventual consistency is difficult: in a distributed scenario, to achieve eventual consistency, it is required that the latest version of the data is stored in the cache (or the cache is empty), and the data is stored quickly after the database is updated. It is extremely difficult to achieve this state of consistency. We will face anomalies with many components such as hardware, software, communication, and so on.

CPU cache structure*

cache coherency issues

Generally speaking, we are faced with such a problem. As shown in the figure below, the data in the database will be updated 5 times, resulting in 6 versions, V1~V6. The length of each box in the figure represents the duration of this version. time. We expect that after the data in the database changes, the cache layer needs to sense and respond as soon as possible. As shown in the figure below, the interval in the cache layer box means that the cached data does not exist in this period of time. The V2, V3 and V5 versions are in Being absent from the cache doesn't break our eventual consistency requirement, as long as the final version of the database and the final version of the cache are the same.

How the cache is written

The code written to the cache is usually put together with the code used by the cache, including 4 steps, as shown in the following figure: W1 reads the cache, W2 determines whether the cache exists, and W3 assembles the cache data (this usually requires database to query), W4 writes to the cache. There is no way to control how long the pause between each step may be, especially the pause between W3 and W4 is the most deadly, it is likely to allow us to write the old version of the data to the cache.
We might wonder, would the writing of the W4 step, with the assumption of W2, i.e. using WriteIfNotExists semantics, improve?

Consider the following situation. Suppose there are three concurrent executions of cache writes. Due to a large number of database updates in a short period of time, they respectively assemble the data of V1, V2, and V3 versions. Using WriteIfNotExists semantics, there must be 2 executions that will fail, and there is no guarantee which one will succeed. We can't make a simple decision, we need to read the cache again, and then judge whether what we are about to write is the same, if it is the same, then it is very simple; if not, we have two choices:
1) Delete the cache and let other subsequent requests handle the write.
2) Use atomic operations provided by the cache and only write when our data is a newer version.
截屏2022-01-11 上午10.27.27.png

How to perceive changes in the database

After the data in the database changes, how do we perceive and perform effective cache management? There are usually three ways to do this:

Use code execution flow

Usually we will execute some cache operation code after the database operation is completed. The biggest problem with this method is that the reliability is not high. The application restarts, the machine crashes unexpectedly, etc., which will cause the subsequent code to fail to execute.

Working with transactional messages

As an improvement to using code execution flow, transaction messages are issued after database operations are completed, and then cache management operations are performed in the message consumption logic. The reliability problem is solved, but the business side needs to add transaction message logic and operating costs.

Using the data change log

Database products usually support the generation of change logs after data changes, such as MySQL's binlog. It allows the middleware team to write a product that performs cache management operations after receiving changes, such as Ali's Jingwei. The reliability is guaranteed, and at the same time, the playback of the change log in a certain period of time can be performed, and the function is relatively powerful.

Best practice 1: Invalidate cache after database change

This is the most common and simple way and should be the preferred solution. The overall execution logic is shown in the following figure:

The W4 step uses the most basic put semantics. The assumption here is that the request that writes later usually also carries the latest data, which is true in most cases. Step D1 deletes the cache by listening to the DB binlog, that is, the method described in the previous use of the data change log.

The disadvantage of this solution is that in the case of high concurrent updates of database data and large cache read traffic, there is a small probability that old versions of data are stored in the cache.

There are four common solutions:
1) Limit the valid time of the cache: Set the expiration time of the cache, such as 15 minutes. That means we accept at most that the cache is old within a time frame of 15 minutes.
2) Small probability cache reload: Set a certain proportion of cache reload according to the traffic ratio to ensure the consistency of cache data in the case of large traffic. For example, the ratio of 1% can also help the database to be fully warmed up.
3) Combining business characteristics: Do some design according to the characteristics of the business, such as:
For marketing scenarios: cache is used when calculating discounts on product detail pages/order confirmation pages, but not when placing an order. This allows extreme situations to occur without undue loss of business.
For the inventory scenario: reading the old version of the data will only allow excess traffic to enter the order when the product is sold out. The inventory deduction when placing an order is to operate the database, so there will be no business Loss.
4) Twice deletion: The operation of deleting the cache in step D1 is performed twice, and there is a certain interval in between, such as 30 seconds. The triggering of these two actions is initiated by the "cache management component", so it can be supported by it.

Best practice 2: write with version

For scenarios such as commodity information cache with low update frequency, high data consistency requirements, and high cache read traffic, version update is usually adopted. The overall execution logic is shown in the following figure:

The biggest difference from the "cache invalidation after database change" scheme is in steps W4 and D1. The cache layer needs to provide an API with version writing, that is, the writing can be successful only when the version of the written data is newer, otherwise the writing will fail. This also requires us to add data version information to the database.
The final consistency effect of this scheme is relatively good. Only in extreme cases (the data is lost after the new version is written, and the subsequent writing of the old version will succeed), there is a possibility that the old version of the data is stored in the cache. Using write instead of delete in step D1 can avoid this extreme situation to a great extent. At the same time, because this solution is suitable for scenarios with high cache read traffic, it can also avoid a large number of requests in a short period of time in step W3 after the cache is deleted. Penetrates to DB.

Summary and Outlook

For the scenario of separation of cache and database, after combining the practical experience of many companies in the industry and the ROI trade-off, the above two best practices are the most widely used, especially the best practice one, which should be used as our daily application. of choice. At the same time, in order to avoid the possible inconsistencies behind each best practice to the greatest extent, we also need to adapt to the characteristics of the business and make some designs to ensure consistency in key scenarios (for example, the aforementioned marketing uses a database when placing an order). read rather than cached reads), which is also particularly important (after all, as explained in "Background", there is no perfect technical solution).

In addition to the solution of separating cache and database, there are two solutions that have been applied in the industry that are worth learning from:

Ali XKV

Simply put, a Memcache server is deployed on the database, which bypasses the database layer and directly accesses the storage engine layer (such as InnoDB), and uses the KV client to access data. Its characteristic is that the data is actually strongly consistent with the database, and the performance can be improved by 5 to 10 times compared to using SQL to access the database. The disadvantage is also obvious, data can only be accessed through the primary key or unique key (this is only relative to SQL, most caches are originally KV access protocols).

Tencent DCache

There is no need to maintain two sets of cache and database storage by yourself, and a unified set of data views is given to developers, and DCache persists the data by itself after the cache is updated. The disadvantage is that the supported data structures are limited (key-value, kk-row, list, set, zset), and it is difficult to support data structures as complex as database tables in the future.

Text / Sumu
Pay attention to Dewu Technology and be the most fashionable technical person!

Cache Coherence Best Practices

background

Overview

cache coherency issues

How the cache is written

How to perceive changes in the database

Use code execution flow

Working with transactional messages

Using the data change log

Best practice 1: Invalidate cache after database change

Best practice 2: write with version

Summary and Outlook

Ali XKV

Tencent DCache

得物技术

引用和评论

AI生成功能设计用例｜得物技术

【深度揭秘】Caffeine 缓存引发的内存泄漏全攻略：从根源到解决方案

在 ApeCloud （云猿生数据）实习是怎样的体验？跟行业大佬练技术修为的一年小记

浅谈大模型背景下的数据治理

ClkLog埋点系统基于ClickHouse的百万日活测试报告

数据要素的价值要如何释放？

2027倒计时：5个关键数据揭秘100%国产替代实施路径