TPC TiKV: How did the hardest project in Hackathon come to be? | TPC Team Interview - 开源分布式关系型数据库 TiDB

Database tuning can make database applications run faster, but for many people, tuning the database kernel is a very challenging "technical job" and a "game" that only belongs to a small number of kernel developers. But even for them, the performance tuning of the database kernel is full of uncertainties. It needs to comprehensively consider various complex factors, such as CPU, I/O, memory and network at the hardware level, as well as the operating system, Configuration of middleware, database parameters, etc., as well as various queries and commands running on the database. In this Hackathon 2021 competition, the TPC team completed this "challenge", using the bottom-up design idea to make better use of hardware resources, and using the TPC (thread-per-core) threading model to optimize the performance of TiKV. Write performance, performance stability and adaptability. The TPC team also won the third prize and the Technical Potential Award for this hard-core project.

"This project is the most hardcore project in this Hackathon, and I gave it a very high score. TPC has done a lot of work in it, and I have a hunch that it will be difficult to implement in the future. They used io uring, but it seems that they also encountered problems. After reaching a lot of pits, you can also choose AIO or a separate asynchronous thread mechanism. Because of the new raft engine (this will be in TiDB 5.4 GA), it is also very convenient to do parallel log write and make full use of the multi-queue IO feature. This The feature is also very important on Cloud, because the single-threaded IOPS of EBS disks is actually not high. In addition, I think they will remove the WAL of KV RocksDB later, so that several thread pools can really be merged and only do computing operations , IO operations have become completely asynchronous." - Judge Tang Liu

The TPC team is composed of Chen Yilin and Zhao Lei from the TiDB distributed transaction R&D team. Among them, Chen Yilin has participated in the TiDB Hackathon since 2019, and won the first prize of the year with the thread pool project. At that time, he was still a member of PingCAP. intern. Now that he has graduated, he stays at PingCAP to continue doing business-related research and development work, and is already an old TiDB Hackathon player.

The Charm of TiDB Hackathon

Chen Yilin: Actually, for PingCAPer, Hackathon is an opportunity to discover more possibilities. We usually have many urgent projects in our work, and we don't have the opportunity to explore more new possibilities of TiDB. Hackathon gives such an opportunity. In the usual work scenarios, we often have some ideas, but we don't have the opportunity to try them. These ideas can be put into practice in Hackathon, and their effects and potential can be demonstrated through DEMO. If implemented well, it may finally land in production code.

Project Inspiration

Chen Yilin: Zhao Lei is very eager to do this project, and the inspiration for the project mainly comes from him. When we usually do kernel development and solve some user problems, we find that the overall performance of TiKV is relatively general, and it has strong uncertainty and is difficult to tune. When Zhao Lei was studying the code of another database product, he found that some technologies in that architecture can actually improve the performance of TiKV. So I want to apply some technologies used in the product architecture idea to TiKV to see if it can improve the performance and stability of TiKV.
The primary purpose of the TPC project is to improve performance. TiKV has not used resources very well, such as insufficient utilization of CPU or IO resources. Through this architecture, concurrent WAL writing can be used to achieve full utilization of IO resources. The new architecture of the thread pool can also reasonably plan the resource usage of the CPU, especially in the cloud environment, which can make TiKV get more stable and predictable performance.

Asynchronous collaboration in competitions

Chen Yilin: We started the preliminary development work almost during the New Year's Day holiday. It is similar to normal work. Most of the time we still collaborate asynchronously. If I have any progress, I will directly synchronize it to Zhao Lei. This process may be carried out by email or Github notification. . The development process is mainly divided into two parts: one is to change the raftstore of TiKV itself, which is done by Zhao Lei. On the other hand, it is about the Raft engine, a component that TiKV uses to store Raft logs, and I come to its asynchrony and concurrency of writing.

Among them, Raftstore contains two thread pools:

store pool is used to process raft message, append log, etc. raft log will be written to raft db;
The apply pool is used to process committed logs, and the data will be written to kv db. Currently, both raft db and kv db use RocksDB, and then raft db will switch to raft-engine.

RocksDB cannot make good use of modern high-speed hard disks. Its foreground write (WAL) can only provide 1 I/O depth, and the synchronization and queuing between write groups consume a lot of money, while high-speed hard disks such as NMVe SSD require high I/O depth. To fill the IOPS, or a large I/O size plus a not so high I/O depth to fill the bandwidth, but a large I/O size is not suitable for OLTP systems, because large batches usually mean high latency.

In order to optimize the disk usage of TiKV, the raft engine needs to support concurrent WAL writing or splitting raft db to write multiple WAL files in parallel. Realize parallel writing of WAL without splitting raft db. In order to maximize disk pressure, better CPU utilization, and better performance stability, TPC chooses to use async I/O to achieve this function.

After the store pool implements the above functions, its performance should be significantly better than that of the apply pool, but it may consume more resources and thus affect the overall performance. Slowness, accumulation of too many committed logs leads to OOM, etc., and the performance of the entire pipeline is limited to the slowest stage. Back pressure needs to be applied according to the slowest stage, such as adjusting the number of threads in store pool and apply pool to ensure speed matching. But splitting multiple thread pools is really inconvenient and inflexible. To avoid manual tuning, we combine store pool and apply pool into a single thread pool. In order to achieve this goal, it is also necessary for raft engine to use async I/O, and kv db also needs to use async I/O, but kv db theoretically does not need to write WAL, because data can be played back through raft log and there is a solution for this function , the WAL of kv db will be forcibly removed on Hackathon. In addition to async I/O, it is also necessary to implement a CPU scheduler to ensure that when the CPU becomes the bottleneck, different tasks in a single thread use resources proportionally, such as the original store pool and apply pool tasks each use 50% of the CPU resources.

With the CPU scheduler, more thread pools can be combined to achieve a real unified thread pool, such as gRPC thread pool, scheduler worker pool, unified read pool, RocksDB background threads, backup thread pool, etc. The CPU scheduler will give each The tasks of the original thread pool are allocated a certain proportion of resources, and can be adjusted dynamically, thereby improving the performance stability when resources are tight, realizing self-adaptation, and avoiding manual parameter adjustment.

The biggest technical difficulty encountered

Chen Yilin: The various technologies we used this time are very radical and core technologies. We encountered many unexpected situations that depended on libraries or the Linux kernel, and some things did not meet expectations when writing. For example, the thread per core library we use, when we want to preempt based on latency, it doesn't work on most cores.
Also, we tried a lot of kernels on AWS. When using the default kernel provided by AWS Linux with IO uring, there are many problems. Later, we moved to a newer kernel and finally got it to work. On the other hand is the file system. We commonly use two file systems, ext4 and xfs. They have some differences in the behavior of asynchronous writing. We also tried a variety of kernels and changed different file systems before finally finding a certain combination. , which basically meets our expectations for the behavior of asynchronous writing. The biggest problem we encountered in the overall process was that the technology used was too immature, and we encountered many pitfalls in the kernel, which was actually quite painful.

Any regrets during the game?

Chen Yilin: It is a pity that the time is tight, and the tuning of the entire system has not been adjusted to a better level. The final effect is a little worse than we imagined. Throughout the process, we spent a lot of time getting the project up and running and getting it pretty much what we expected.
There was an interesting thing about the game. I actually didn't know what the team's declaration was. Later, when I arrived at the scene, I found that the small print under the team flag read "The champion was chosen by me or the clown was me". A few days ago, I suddenly discovered that Zhao Lei had changed his avatar to a clown...

This game experience

Chen Yilin: When Zhao Lei shared other technical architectures with us before, it was only a concept or concept. What would happen if it was actually applied to TiDB? Is the problem of TiKV here? In fact, we are not very clear. Through this Hackathon, we proved that this idea is correct to a certain extent, it is indeed useful, and TiKV has been improved accordingly. I think this is also the TPC project to verify a correct path for the evolution of the TiKV product.

What are your expectations for the future of the project?

Chen Yilin: I think it may not be particularly feasible to directly apply the technology stack used on Hackathon to TiKV. As Mr. Tang Liu mentioned in the evaluation, we encountered a lot of problems with io uring, but in fact, we can turn to Linux AIO and the like. At the same time, something like the Raft engine, its asynchronous future can also be advanced. The bigger role of this project is to point out the possible evolution direction of TiKV.

TPC TiKV: How did the hardest project in Hackathon come to be? | TPC Team Interview

PingCAP

引用和评论

从企业数智化四阶段解读 TiDB 场景价值

MySQL慢查询日志：性能优化的终极指南

做到真正0丢失、0重复：Apache SeaTunnel 实现万亿级数据一致性全解密

Devin 发布 DeepWiki，2 星的项目直接装出万星的气场

好用的开源埋点方案-ClkLog埋点用户分析系统

DNS服务器地址大全

实战分享：DolphinScheduler 中 Shell 任务环境变量最佳配置方式