This article is based on the sharing of WeBank’s senior database architect Huang Wei on PingCAP DevCon 2021. It mainly explains the application practice of TiDB in WeBank, including the background of WeBank’s choice of TiDB and the deployment architecture of TiDB. The application of the core batch scenario of loans, and finally shared the best practices and future plans based on the TiDB optimization solution.
Product advantages of TiDB
From the beginning of WeBank's contact with the TiDB team at the end of 2018, to its launch in 2019, TiDB has shown many unique advantages in database selection.
TiDB is compatible with the MySQL protocol and also compatible with MySQL's ecological tools, such as backup, recovery, monitoring, etc., whether it is the application itself, operation and maintenance or developers, the cost and threshold of migrating from MySQL to TiDB are low. For TiDB's native computing and storage separation architecture, users will not have to worry about the bottleneck of capacity or stand-alone performance. To some extent, TiDB can be used as a big MySQL. At the same time, TiDB's data multi-copy and strong consistency feature is very important for financial scenarios. TiDB also naturally supports a multi-IDC deployment architecture, which can support applications to achieve a same-city multi-activity deployment architecture. In addition, the operation of the TiDB open source community is also very active. For example, on the AskTUG platform, you can see the handling methods of many users' typical problems, including a lot of valuable experience to learn from, which can further reduce the threshold for users to use TiDB.
There are more and more users using TiDB, whether it is a large number of Internet manufacturers or users in the financial industry. This is also a manifestation of the increasing maturity of TiDB products, and it also gives more users greater confidence in using TiDB. .
TiDB's deployment architecture in WeBank
Can the features of TiDB meet the high-availability architecture requirements of financial institutions?
This is the deployment architecture of TiDB in WeBank. As shown in the figure, first TiKV chooses three copies and deploys them in three data centers in the same city. This can achieve IDC-level high availability, and at the same time, a set is deployed in each IDC. TiDB Server provides VIP services by binding to a load balancer, so that applications can achieve a multi-active access mode. This architecture has also undergone IDC-level real failure drills and verifications. All of the IDC networks were disconnected. It was observed that the cluster can be quickly recovered. We believe that TiDB can meet the requirements of high availability in financial scenarios.
Core batch accounting scenario
The core batch accounting of loans is a classic and very important scenario in the financial industry, and we have connected it to TiDB. The following figure is the architecture of the previous WeBank loan core batch application scenario. There are many business units in the left part, which is equivalent to splitting the user data into units. Each unit data may be different, but the architecture and deployment model It's the same. The bottom layer uses a single-instance database, and the batch runs on each single-instance database. Finally, the batch result ETL is sent to the big data platform for downstream use. So what are the bottlenecks or optimization points of this architecture?
It is a pure batch application, which means that there are a large number of batches of writes, updates, and operations, and the amount of data is extremely large, 100 million or more than one billion. With the rapid development of business, the number of borrowing data, the number of users and the flow Data is also increasing continuously. If you use a stand-alone database to carry it, it is first limited by the performance upper limit of the stand-alone database, and it will take longer and longer to run batches. In addition, the load of the stand-alone database has been very high before, and the IO and CPU have reached 70%. ~ 80%. If you want to improve the efficiency of running batches, the application is risky by increasing concurrency. Because the database load is too high, it may cause the delay of the master/backup replication or the failure to perform fast master/backup switching, so the efficiency is difficult to improve; Secondly, the stand-alone database is very difficult to add fields or data management to such billion-level or billion-level tables. Although WeBank uses online DDL tools such as pt-online-schema-change to do table change operations, it will also There is a small probability of table lock risk. In addition, based on resource utilization considerations, the batch system and the online system reuse the same single-machine database, so if the batch task causes a high load, it is likely to affect online transactions. Based on these background issues, WeBank used TiDB to optimize and upgrade its architecture.
The upgraded architecture is shown in the figure below. We can see that WeBank synchronizes and aggregates the data of each business unit to TiDB in real time through the DM tool, and then the batch APP directly performs batch calculations based on TiDB, and then transfers the results to the database. The data platform is equivalent to using the horizontal scalability of TiDB to achieve the horizontal expansion of batch efficiency. Previously, it was the traditional MySQL master/backup architecture, which required the APP server to be deployed in the same IDC as the MySQL master node. However, if it is accessed across computer rooms, the network delay will be relatively large, which will affect the batch time consumption, so other IDC apps The server is in the standby state, there will be a certain amount of waste of resources, and all TiKV nodes of the TiDB architecture can read and write at the same time, so multiple IDCs can start batch tasks at the same time to maximize resource utilization.
The use of TiDB in the core business scenarios of WeBank loans has three main value benefits.
1. Improve batch efficiency. The left side of the figure below is a comparison of the batch time consumption of one of the loan business of WeBank. It can be seen that under the single-instance architecture, the batch runs for more than three hours, and WeBank uses TiDB to upgrade and optimize the architecture. After that, the time was reduced to about 30 minutes, which is an absolute improvement in efficiency.
2. Linear horizontal expansion. The demand of WeBank is not only to improve efficiency, but also to achieve horizontal expansion, that is, elastic scaling. Because with the development of the business, the amount of IOUs, including the number of users, continues to grow. If there are hot spots or other bottlenecks, it will be very difficult to continue to improve in the future. The right side of the figure below shows the comparison of the time-consuming batches, in the initial resource situation It takes about 25 minutes to run. If the amount of data doubles, the time is increased to 50 minutes. If you want to reduce the time and double the resources, you can find that the time is reduced to about 26 minutes, which shows that linear expansion is already available. ability. Therefore, in addition to the improvement in efficiency, a major benefit of linear expansion capability is that as the business continues to develop, the number of loans and the amount of loans are growing rapidly. This architecture will eliminate the need to worry about the technical bottleneck that may occur in the rapid growth of the business, and the business can be more Focusing on the product itself, this is a real business value brought by TiDB.
3. The batch system is separated from the online trading system. As mentioned earlier, the online system has been reused because of resource considerations. Now that the online system has been completely separated after the split, and there is no master-standby replication delay like a stand-alone database, it can maximize resource utilization to improve Batch efficiency.
Obvious results can be seen from the above benefits. So what optimizations have WeBank made or what problems has it encountered?
1. SQL mode optimization. Because of its distributed architecture, TiDB has a relatively higher latency for a single request than MySQL. Therefore, it is necessary to package some requests that frequently interact with the database to minimize interaction, such as changing multiple selects to in, and inserting multiple inserts. The entry is changed to insert a single entry with multiple values, and multiple updates are changed to replace multiple values. In addition, because multiple unitized data are all aggregated into a TiDB cluster, the amount of data in a single table must be very, very large. If you run a relatively inefficient SQL, it is easy to destroy the cluster, such as the existence of OOM risks, so special attention should be paid to the audit and tuning of SQL. Another example is the inaccurate execution plan in the earlier version. In version 4.0, SQL execution plan binding is supported, and some high-frequency SQL can be bound to make its operation more stable. Because WeBank’s access to TiDB is relatively early, it mainly uses optimistic lock mode, and many applications have been adapted. At present, the code that adapts to optimistic lock mode has been solidified into a common module, which is directly used when the new system is connected. Just use it.
2. Concurrent optimization of hotspots and applications. Users who use TiDB more may be familiar with the issue of hotspots. As mentioned earlier, elastic scaling, but to do elastic scaling, the data must be discrete enough. Therefore, when WeBank was connected to TiDB in the early stage, it also found that it was similar to the one on MySQL. The Auto Increment feature may have hot issues. For example, the user's card number and debit number may also be some continuous numbers. Therefore, WeBank has made some adjustments or optimizations for these two areas, such as changing it to Auto Random, Then, according to its data distribution law, some card numbers are calculated in advance through the algorithm to calculate the distribution interval of these data, and then pre-disintegrated through the Split Region function, so that each node can be fully utilized when writing large quantities of instantaneously. In addition, the low-frequency modified and high-frequency access small tables are cached in the application to alleviate the hot reading problem. In addition to the data need to be sufficiently discrete, the application also needs to be distributed transformation and optimization. Because the application is distributed, an App Master node is needed to do the data sharding work, and then the sharding task is evenly distributed to each App. During operation, it is also necessary to monitor the status and progress of each sharding task; finally, through the collaborative optimization of data and applications, the overall horizontal expansion capability is achieved.
3. Data synchronization and data verification optimization. This is what we mentioned earlier that WeBank uses DM tools to aggregate the data of various business units. The DM 1.0 version used in the early days does not have high-availability features, which is fatal in financial scenarios. In the DM 2.0 version, several features including high usability, compatibility with gray-scale DDL, ease of use, etc. have been steadily launched. In addition, there is the data verification part. Because it is a core batch scenario, data synchronization must ensure that the data is not lost and good. Therefore, the application also embeds the logic of data checksum. The checksum value of each shard is written into the table, and then synchronized to the downstream TiDB through DM. Finally, the application loads each shard from TiDB during the batch run, and then runs the checksum of the corresponding shard again, and then checks the upstream and downstream checksums. The value is compared, and the consistency of the data is ensured through such a verification mechanism.
4. Fault drill and optimization of the plan. This system was previously a batch system based on MySQL. After migrating to TiDB, some failure scenarios may show unexpected phenomena. Therefore, WeBank has done a lot of failure drills. The first is to simulate the abnormality of each TiDB component node to ensure that the application can be used. Compatible. At the same time, when a batch is interrupted, the application also supports breakpoints to continue running; the second is to rerun the entire batch, because the entire batch may need to be rerun due to program bugs or unexpected problems, in order to quickly resume reruns On the spot, the application developed the functions of fast backup and rename flashback; the third is a drill for extreme scenarios, such as assuming that the TiDB library is unavailable as a whole, and WeBank combined Dumpling and Lightning to perform fast backup and recovery of the entire cluster. Difficulties include quick confirmation of restore points for DM synchronization, manual pre-breaking of large tables, etc. The final verification result meets the requirements of correctness and timeliness. Because this architecture involves a lot of data flow, a lot of drills have been done for failure scenarios and the preparation of corresponding SOPs.
WeBank began research and POC in 2018, and launched the first TiDB application in 2019. At present, TiDB's application in WeBank has covered loans, peers, technology management, basic technology, etc., and there are more A core business scenario is doing POC testing. There are five aspects of planning for the future:
1. TiDB's cloud native + containerization. It can bring, for example, the improvement of automated operation and maintenance capabilities, the ability to allocate resources, and so on.
2. A persistence scheme based on Redis + TiKV. The main purpose is to replace the Redis + MySQL solution, and use TiKV's natural high-availability features to make a persistence solution.
3. Low-cost applications based on SAS disks. WeBank has many archiving scenarios in the bank, and the amount of data is particularly large, because it needs to be retained for a long time due to regulatory requirements. For such scenarios with high storage capacity requirements but low-frequency access, TiDB will also do some things based on the low-cost direction of SAS disks. Pilot.
4. TiDB application on the domestic ARM platform. Last year, WeBank already had its business on TiDB ARM. With the trend of localization in the future, this area will continue to increase investment.
5. Evaluation and application of TiFlash. What TiFlash provides is the ability of HTAP, especially scenarios like real-time risk control and lightweight AP query will bring great help, which is also the key planning direction of WeBank in the future.