TiDB Massive Region Cluster Tuning Practice

author of : 161121cf0ed429 Tian Weifan, senior database engineer at NetEase Games, Co-Leader, TUG South China. Database veteran, has maintained a variety of databases, and is currently responsible for the operation, maintenance and development of the database private cloud platform at NetEase Games.

At the TUG NetEase online corporate event, Mr. Tian Weifan, a senior database engineer from NetEase Games, shared the topic of TiDB massive region cluster tuning. The following content is compiled from the sharing record of the day's event. The topics shared by

The first part is about NetEase games and the current status of TiDB used by NetEase games;
The second part is tombstone key case sharing, including the introduction of the entire troubleshooting process;
The third part is the first experience of using TiDB 5.0 and future prospects.

We are a department that provides one-stop database private cloud services for the entire NetEase game, providing database services including MongoDB, MySQL, Redis, and TiDB. More than 99% of NetEase Games' game business uses the database private cloud service we provide.

The usage of TiDB in NetEase games is as follows: the number of clusters in the entire TiDB is 50+, the overall data size of the TiDB environment is 200TB+, and the maximum number of cluster nodes is 25+, including most TiKV and a small part of TiFlash nodes, single-cluster ones The maximum amount of data is 25TB+. Although TiDB is currently in the preliminary stage of use, it has reached a certain scale, and we have encountered many problems during the entire business use process.

For this sharing, I have selected an , and shared the whole process of our investigation . I also hope that you can get some reference from it.

"Murder" caused by tombstone key

The case we shared is a "blood case" caused by the tomstone key. Some friends may ask what is the tomstone key? The meaning of tomstone key will be introduced in detail later.

called a "blood case"? is because when you think of the murder, everyone thinks that this problem is really serious.

This problem affected the real-time and offline business analysis scenarios of a large game in NetEase at that time, resulting in the unavailability of the entire analysis business, which in turn affected the availability of the internal data reporting system. In addition, due to the long duration, the game's product operations and the company's senior management can not see the game data in time.

background

Before analyzing specific issues, let’s briefly introduce the entire business background: cluster has a large number of single-node regions, TiKV has 25+ nodes, and single-node regions 90,000 + . The main business is a real-time analytical business, and there are a large number of transactional delete and insert operations.

Current status of the problem

Let's enter the scene of the problem case. This case starts with a phone call. One day at 4 am, we received a phone call. After determining the impact of the problem, we notified the business as soon as possible, and then started troubleshooting to see what caused the alarm.

As shown in the figure above, you can see that the CPU of the entire node is full. At this time, first guess whether the hot spot is causing this phenomenon. Generally, if the CPU of a single node is abnormally high, it will be regarded as a hot issue. In order to deal with this problem quickly, we temporarily take the node offline to avoid the unavailability of the entire database. It turns out that after the node is taken offline, other TiKV nodes will suddenly spike, that is, the business cannot be restored to normal by simply restarting the node. Since it is a hot issue, it is urgent to investigate the hot spot urgently. First of all, there are indeed some hot spots in the heat map, and some tables corresponding to the hot spots can be found in a targeted manner.

Secondly, through the heat map and some hotspot metadata tables, we lock a large number of hotspot tables and find that these tables are all jointly formed. This is relatively simple, directly set a parameter SHARD\_ROW\_ID_BITS in batches to break up.

At the same time, it is also necessary to contact the business to discover possible hot spots and break them up together. In addition, we will also monitor the hot region table. If there is an unbroken table, the parameters will be automatically broken. And appropriately adjust the concurrent number of hotspot scheduling of PD, and then increase the scheduling of the entire hotspot, which can quickly deal with similar hotspot issues. After a wave of adjustments, it is found that the CPU has indeed dropped, and the business feedback is also normal, as if the problem has been solved. But the next day, we received another wave of phone calls at 10 o'clock. At this time, the problem was already very urgent in business, and we immediately reorganized the entire problem. In the process of sorting out, we reviewed the situation we dealt with for the first time, and guessed that either the problem was not handled thoroughly the first time, or it was not caused by a hot issue at all. So we return to the problem itself for analysis. Since the CPU is soaring, we first check from the CPU point. After the investigation, we found that the index of storageage readpool CPU soared.

I haven't encountered this indicator before soaring, and I was quite confused at the time, so we contacted the technical experts of the TiDB community: Qibin, Qihang, and Yueyue for online support.

Through the log, and some other indicators: including seek duration and Batch Get, etc., the problem caused by the seek tombstone key is locked.

Back to the beginning of the problem, What is the Key Tombstone? There is also a brief introduction.

Generally, after deleting data, multiple versions of data are retained in TiDB. That is, when new data is inserted over old data, the old data will not be cleaned up immediately, but will be retained together with the new data, and multiple versions will be distinguished by timestamp. these multi-version data be deleted and cleaned up? has a GC thread at this time. The GC thread will traverse through the background and traverse some keys that exceed the GC time. Then mark these keys as tombstone keys. These keys are actually just a mark, the so-called tombstone key. These keys are cleaned up by a thread initiated by the RocksDB engine backend in the form of compaction. RocksDB The entire Compaction process is also briefly introduced here. First, after the data is inserted into RocksDB, it will be written into WAL, and then formally written into the memory block of the memory table. When the memory block is full, we will mark it as immutable member table.

After marking, RocksDB will flush the memory block to disk and generate an SST file, which is our so-called level 0.

When there are more and more SST files at level 0, there will be a lot of duplicate data, that is, tombstone keys will begin to appear at this level. At this time, RocksDB will merge these SSTs into level 1 through compaction.

When level 1 files reach a certain preset value, they will be merged into level 2 again. This is a brief description of the entire RocksDB compaction process.

At this point, the whole problem has almost confirmed the cause.

Before the failure occurred, the business deleted a large amount of data, that is, there are a large number of tombstone keys. In this case, if a large number of transactions are executed, TiKV may seek a large number of deleted tombstone keys. Because these tombstone keys need to be skipped, it may cause a serious drop in the performance of batch get, which in turn affects the entire Storage ReadPool CPU soaring, causing the entire node to get stuck.

I need to review here. When we dealt with the problem for the first time, after we broke it through the hot spots, the whole problem seemed to be restored?

Why did the problem recover after our hot spots broke up? In fact, we also did a review. At that time, we found that when the heat dissipation point was used, the whole problem was indeed alleviated to a certain extent. The most fundamental reason was that during the processing, RocksDB made a compaction to the underlying tombstone key, which caused the tombstone key to disappear. , So the problem was restored.

After determining the problem, the next step is how to optimize and solve the problem thoroughly. To this end, we have formulated some optimization measures.

how to optimize

Business side

First, the partition table transformation. In some scenarios, for example, the business can directly delete partitions on a daily basis when supplementing data, and TiDB will use the fast Delete Ranges instead of waiting for compaction.

Second, avoid hot issues. The business read request is changed to follower read mode, and the table is created to specify the dispersal feature to alleviate the hot issues of reading and writing, and to a certain extent alleviate the impact of the seek tombstone key.

Third, split large transactions as much as possible. If the transaction is too large, it will cause a lot of pressure on the entire TiKV layer; when the transaction is too large, a large number of keys need to be operated, and it is easier to trigger the tombstone key.

After making some optimization schemes at the business layer, we also made some processing schemes or optimized schemes on the operation and maintenance side.

O&M side

First, manual compaction. When a problem occurs, you can urgently drop the problematic TiKV node, manually perform compaction, and reconnect it. Secondly, the hot spot table is broken up. Establish an automatic monitoring mechanism to monitor the status of cluster hotspot tables, and automatically break them up according to the hotspot type and table type settings. Third, better communication channels. Establish a unified communication channel with the official, communicate in real time on TiDB related issues, and organize some offline communication activities from time to time.

Thorough optimization ideas

In fact, the best solution for this problem is to do a thorough optimization from the database layer. From the above introduction, the idea should be relatively clear.

In the scenario where there are tombstone keys, write operations or Batch Get, as long as these tombstone keys are bypassed to avoid seeking too many tombstone keys, this problem can be completely avoided. In addition, related optimization solutions have been officially merged into TiDB 5.0, and an optimization has been made in version 4.0.12.

TiDB 5.0 first experience

We are very much looking forward to the arrival of 5.0. So after 5.0 was released, some related tests were done as soon as possible. Here are some test cases, simply share them with friends.

TiDB 5.0 test experience

stability

The first is stability , because we are versions prior to 4.0 or 4.0, or less will encounter some problems, such as the emergence of jitter performance at high concurrency, this problem has plagued us in the test It was discovered that TiDB 4.0.12, in the case of high concurrency, the entire duration and QPS fluctuate to a certain extent.

Through the comparison test 5.0, it is found that the entire duration and QPS are relatively stable, and even a stable straight line is almost seen.

MPP

5.0 has also developed some new features. We are more interested in some new features, so we also did some tests.

For example, two-phase submission, you can see that after the asynchronous submission is turned on during the test, the overall insertion performance TPS has increased by 33%, and the delay response has been reduced by 37%, which is a relatively large improvement compared to the previous one.

When it comes to 5.0, I have to mention a major feature of 5.0, that is, the MPP feature, which is what we are more concerned about.

In this regard, we have done a lot of tests, including benchmark tests and some other single SQL tests.

Here I want to share with you a real database scenario test case. The MySQL tested has about 6T of data. We compared the performance with the MPP of TiDB 5.0. The found that in the case of a single TiFlash 5.0, the performance of MPP is better than MySQL in most cases and many times, and some of them are even ten times better than MySQL. times 161121cf0edb80.

If there are multiple TiFlash, after using the MPP feature, the performance will be better. Through the test, we are full of confidence in 5.0.

TiDB 5.0 can be expected in the future

Next, we will promote the implementation of 5.0, such as some real-time business analysis scenarios, and will also promote the implementation of some big data scenarios in TiDB 5.0. In addition, promotes the TP-type business landing in TiKV 5.0 is also in our plan .

For the private cloud service of the entire database, we have done a lot of surrounding ecosystems during the entire TiDB construction process, such as database migration, unified monitoring platform, and some automated creation and addition. In the future, we will continue to improve the ecological construction of the entire TiDB, so that TiDB can better land in the cloud services of the entire NetEase game, and at the same time solve our business worries. In the process of using, we have also received strong support from TUG friends. Therefore, while obtaining the support of the TUG community, we will also participate in the construction of the entire TUG community and build the entire TUG community together.

TiDB Massive Region Cluster Tuning Practice

PingCAP

引用和评论

从企业数智化四阶段解读 TiDB 场景价值

做到真正0丢失、0重复：Apache SeaTunnel 实现万亿级数据一致性全解密

在 Kubernetes 上用 KubeBlocks + Dify 快速构建生产级 AIGC 应用

数据库的下一场革命：S3 延迟已降至原先的 10%，云数据库架构该进化了

Ape-DTS：开源 DTS 工具，助力自建 MySQL、PostgreSQL 迁移上云

好用的开源埋点方案-ClkLog埋点用户分析系统

【TVM教程】为 ARM CPU 自动调度神经网络