Use TiDB to build real-time applications

Author introduction: Lei Yu, TiFlash R&D engineer, graduated from Sun Yat-sen University majoring in software engineering. Currently, he is mainly engaged in the research and development of TiDB SQL MPP at PingCAP.

This article is shared by PingCAP R&D engineer Lei Yu. It mainly analyzes from a macro perspective what TiDB can do, what value it creates, and some design footholds in the R&D process. article will be shared from four parts:

First, the evolution of data management technology;
Second, what can TiDB do?
Third, how do you use TiDB?
Fourth, the future of TiDB HTAP.

The evolution of data management technology

First, briefly review the evolution of data management technology.

In the 1970s, IBM developed the world's first relational database System R, the first database to use SQL as a query language, and laid the foundation for the design of later relational databases.
In the 1980s and 1990s, relational databases began to grow wildly, and a large number of commercial relational databases emerged, such as the current well-known Oracle, IBM's DB2, Microsoft's SQL Server, and the more popular open source relational databases PostgreSQL, MySQL, etc. . During this period, the technical focus is mainly on the complete function of the database, such as stored procedures, triggers, and various indexes to meet different business needs.
In the early 2000s, the world entered the Internet era, and data began to grow exponentially. Traditional relational databases could not accommodate such a huge amount of data. At this time, some Internet companies began to take the lead to open source the internal processing of massive amounts of data. Around 2004, Google led the publication of three papers, namely their distributed file system GFS, distributed computing system MapReduce, and distributed storage system BigTable . Under the guidance of these three papers, the Hadoop ecological community prospered. At the same time, distributed KV databases Cassandra, MongoDB, etc. also appeared in this period; traditional relational databases are also developing, and some very key technologies have emerged, such as MySQL's InnoDB engine and Oracle RAC analysis engine. A single database product can no longer meet the needs of users, and the technical direction of the entire data processing field has severely differentiated. The OLTP field is still occupied by traditional relational databases, and the OLAP field has become the main battlefield of big data technology later.
Pre-2010s, thanks to the development of hardware, memory capacity and network bandwidth and delay have been greatly improved, and the database architecture ushered in changes. In-memory databases and distributed databases were put into production on a large scale. Representative products: Google Spanner, SAP HANA, SQL Server Hekaton, Amazon Aurora. During this period, the concept of OLTP and OLAP gradually began to blur, and someone proposed HTAP, which mixed OLTP and OLAP, and processed both loads on the same database at the same time, returning to the original intention of database products.
Post-2010s continues the glory of the early 2010s. Various NewSQL databases have appeared, which can carry more complex loads. Representative products: CockroachDB, TiDB, VoltDB, Azure Cosmos DB, various technologies have begun to move in different directions.

On the whole, since 2000, big data technology has entered the Internet ecology, and it has become common to use big data technology to build data warehouses. Although the concept of data warehouse has appeared in the 1990s, the products of various data warehouses have not yet been open source, and there is a lack of consensus in the industry. After Hadoop is open sourced, Hadoop has gradually become the mainstream , which is the traditional data warehouse architecture.

Traditional data warehouse architecture

As shown in the figure above, on the left is the database used by OLTP online business. Because it cannot be directly analyzed on it, it is generally used to import data changes or full data to Hadoop through MySQL's Binlog CDC or directly read and write database ETL. The platform then uses Hive and other software in Hadoop to analyze data, generate reports, and write the results to another OLTP database, which is the Data Serving layer on the right used to present the results of offline analysis. Finally, the Data Serving layer presents the data to the application.

Because this set of links is very long and there are various implementation reasons in Hadoop, this architecture can only achieve T+1 at the beginning, that is, the data of the day can only be calculated the next day after the data is written. come out.

Although the problem of mass storage and calculation is solved, the real-time nature of data processing is lost. In recent years, with the increasing demand for real-time performance, in order to ensure the real-time performance of data processing, a new architecture has emerged: Lambda architecture.

Lambda real-time data warehouse

The Lambda architecture is characterized by adding a real-time computing layer to offline Hadoop, generally called Speed Layer. In the early days, Spark Streaming or Storm streaming computing engines were mainly used to directly collect OLTP data and calculate it as real-time data. It is mixed with offline T+1 data and provided to the application. In this way, the application can get a relatively real-time data.

In the traditional data warehouse era, only T+1 can be achieved. With the Lambda architecture, T plus a few tenths can be achieved, and yesterday's data and today's data for half a day can be combined and processed together. However, can more real-time data analysis be achieved on this basis?

Kappa real-time data warehouse

The Kappa architecture came into being. The pain point of the previous Lambda architecture is the need to do very complex maintenance, because at the same time, the data must be written to T+0, and the data must be written to the real-time part, and then the results of the two parts are integrated. With the Kappa architecture, as long as the real-time computing layer is used to pull the changes of the OLTP business on demand, then the calculation result data can be presented. However, due to performance reasons, this system has not yet been widely used.

It can be seen that in the course of the data warehouse architecture speech, real-time data has become a common demand for everyone, and massive data processing capabilities are also indispensable. In this case, let's see what TiDB can do.

What can TiDB do?

Before TiDB 4.0

When TiDB 1.0 was released, the architecture diagram is as follows, which is also the first impression of TiDB by many people.

The architecture of TiDB is very simple. The first is Load Balancer, which can break up user SQL requests and send them to TiDB Server. TiDB Server is a stateless computing layer that can be expanded at will. The actual data is stored in the distributed KV storage TiKV. In addition, there is a PD component to schedule and regularize these data.

The most prominent part of this set of architecture is capacity expansion, with capacity expansion as the first priority. Expansion is embodied in two aspects. One is storage expansion. TiDB can store the amount of data that cannot be carried by traditional stand-alone databases in distributed storage. Second, in terms of computing, a single traditional database cannot withstand higher QPS. Through this expansion method, QPS can be scattered to different computing nodes.

Before TiDB 4.0, we have continued this architecture. The following is a summary of what we can do before TiDB 4.0:

Relational database compatible with MySQL protocol and features;
Storage is inherently capable of horizontal expansion, no need to sub-database and sub-table;
Carry tens of millions of QPS online services;
Separation of computing and storage, flexible resource allocation;
The high-quality carrier of the data warehouse Serving layer (data center).

First, TiDB standpoint is a compatible protocol and MySQL MySQL characteristic relational database , includes a scalability level, comprises a computing and storage can extend horizontally, and does not require sub-library sub-table. On this basis, because it supports the horizontal expansion of computing, it can carry high-QPS online services, and the storage and computing are separated, which provides a foundation for flexible resource allocation.

But beyond our imagination, many open source community users regard TiDB as a high-quality carrier for data warehouses. TiDB can accept the storage of massive amounts of data, and at the same time can provide a more convenient access interface, so many users naturally regard it as the middle layer of the data warehouse.

Before TiDB 4.0, this usage was not considered in the design at all, so there are many problems, such as the calculation is a single node, it is impossible to perform distributed expansion, and some heavier computing tasks are not supported. At the same time, TiDB's storage engine TiKV uses the storage format of row storage. The advantage of row storage is that it can handle concurrent transactions better in OLTP scenarios, but its performance in OLAP scenarios is not ideal.

Because we have received various user requirements, we have specially developed TiDB's column storage engine TiFlash to carry TiDB's OLAP load. In TiDB 4.0, TiFlash has officially become a member of the TiDB family.

After TiDB 4.0

Before 4.0, the community had provided a set of TiSpark. TiSpark is essentially a Spark plug-in. Through TiSpark, we can access the data in the TiDB cluster in Spark, and read and write it. However, using Spark to access TiDB data will have certain problems, because it is a highly concurrent table scan request, which will affect the OLTP load of TiKV itself.

With TiFlash, the load of OLAP and OLTP can be completely isolated, and consistency can also be ensured. The consistency of TiFlash is achieved through Raft's synchronization protocol. Students who are familiar with Raft should know that it is a synchronous replication protocol, and all data is presented in the form of log. Each log has a globally consistent ID, which is also the index of its location. If there are two logs, one is 4 and the other is 5, then the Raft protocol can guarantee that 5 must be written after 4, and when 5 is written, all Clients (TiDB) can read 4, thus satisfying linearity consistency.

Generally speaking, in Raft, only the leader can perform read and write operations, but if this is optimized, the state of a learner or follower can meet the condition of reading the same index on the leader, and it can be read directly from the learner data. TiFlash uses such a mechanism to synchronize data from the TiKV cluster and achieve linear consistency. The advantages of this are:

First of all, assuming that binlog and other methods are used to synchronize data to the columnar analysis engine, there will be additional transmission overhead or processing overhead similar to middleware. While writing directly through the Raft protocol, when a piece of data is written to the leader, it will go through the Raft quorum confirmation process. At this time, the data has been sent to TiFlash for writing. In addition, although TiFlash write confirmation does not need to be synchronized, its data and TiKV internal high availability priority are the same, which is the key to achieving consistency.

Overall, with TiDB 4.0, analysis capabilities have improved to a step . At this point, we can proudly say that TiDB is a real HTAP database. TiDB features as follows:

HTAP database in the true sense;
Isolated OLAP and OLTP loads;
Analytical-friendly, strong real-time, strong consistency column storage;
Integrated deployment of the operation and maintenance system, the optimizer intelligently selects the storage engine;
ALTER TABLE \`db\`.\`table\` SET TIFLASH REPLICA 1. You can experience the enhancements brought by TiFlash with a simple SQL.

TiDB 5.0 HTAP

In 5.0, in order to solve the above-mentioned pain points, we developed the MPP of TiDB. Let's first understand what MPP is.

When executing SQL, a set of Volcano model is used. Its advantage is that operators can be decoupled. The disadvantage is that the calls between upstream and downstream are coupled, that is, upstream must find downstream data, and then downstream The data will be calculated and provided to the upstream. The consumption capacity and production capacity of each operator are very mismatched. Although TiDB itself has made a lot of optimizations, parallel calculations are used inside the operators to speed up its calculations. But in the final analysis, it is only a stand-alone computing engine, with a very low upper limit. In order to solve this problem, we make full use of the TiFlash node.

First, let's see how to achieve it.

A piece of SQL comes in from TiDB, passes through Parser and Planner to generate a TiDB Logical Plan, and then after the Logical Plan passes through the TiDB optimizer, it will determine whether it is an OLAP request.

If it is an OLAP request, you need to choose whether to read or write from TiKV or TiFlash according to the cost estimate. In this process, we will add exchange to these join operators, which is Volcano paper to generate a parallel execution plan, and then push the fragments of these execution plans to The corresponding TiFlash node executes.

Let's look at a practical example.

The above is the data from the TPCH data set. There is a table called lineitem in the TPCH data set. The lineitem table accesses all product information, generally about 600 million rows. In addition, there is the orders table. The orders table is the fact table of product orders. After doing a simple Join, we add a Count Star aggregation. The Plan at this time is different under the MPP architecture. In the past, usually there are two Table Scans under Join. If the calculation is performed in TiDB, the two Table Scans can be directly put into the Join operator. But after MPP, we will first perform a Shuffle based on the Join Key on the Scan table, and then push the data to the corresponding computing node. After the overall calculation is completed, it will be pushed to the TiDB and returned to the user.

Such benefits are two , on the one hand if a single node TiDB be calculated, it is necessary to put a large amount of data in memory, the data may even be TiDB to fit, in which case they must be dropped onto the disk, computational efficiency very low. However, after shuffle partitioning, the amount of data that needs to be calculated on each computing node becomes smaller, and it can all be accommodated in the memory, which can achieve an acceleration effect. In addition, MPP can use the CP of multiple machines at the same time, theoretically it can achieve very strong scalability.

In order to verify the performance of TiDB MPP, we compared other products. The cluster is a three-node cluster, and each node uses NVMe SSDs, which can eliminate the impact of storage reads on the overall computing speed as much as possible.

As shown in the figure above, you can see the performance of TiFlash MPP in blue, and the length represents its execution time. The shorter the indicator, the better. As can be seen from the above figure, comparing Greenplum and Apache Spark, MPP is in an advantageous position under most queries. The reason is: On the one hand, TiDB 5.0 itself integrates a set of columnar computing engines with very powerful performance; on the other hand, the advantage of the MPP architecture over the batch processing engine is that all tasks are parallel and there is no mutual dependence. , So it can be concurrent in a better way. However, the disadvantage is that compared with batch processing, it cannot support an excessively large amount of data. However, in most scenarios, the MPP architecture is very sufficient.

To summarize TiDB of MPP .

Support multiple parallel execution algorithms:
- Broadcast Join。
- Repartition(Shuffle) Join；
- Two Phase Aggregation；
- One Phase Aggregation；
Scalable and complex query processing capabilities;
TiDB is highly integrated and automatically selected by the optimizer;
After upgrading to TiDB 5.0, you only need to turn on the switch SET tidb\_allow\_mpp=ON to use it.

With the MPP architecture, several new features introduced in TiDB 5.0 have greatly improved TiDB's HTAP capabilities:

OLTP:
- Async Commit, 1PC provides lower transaction latency.
- Clustered Index enhances latency and throughput under specific loads.
OLAP:
- SQL MPP greatly improves TiDB's ability to handle complex queries.

The above shared the functional features and product capabilities of TiDB at different stages, and the following will explain in detail how you use TiDB.

How do you use TiDB?

Based on user feedback and our own sorting, we found the most commonly used scenarios for TiDB.

Transaction/analysis integration

First of all, the integration of transaction analysis. In this scenario, the data level is generally at a medium level, that is, TB level.

If you simply use MySQL, you cannot perform data calculations well, so you generally need to import these data into analytical databases for calculations, such as ClickHouse, GreenPlum, etc., and then present the calculated reports. With TiDB, the two parts can be combined, TP writes TiDB directly, AP also directly calculates , and then presents the results, so that can greatly save operation and maintenance costs, and may achieve performance improvements Upgrade .

The scenarios where transaction analysis is integrated are more common, such as CRM systems, ERP systems, etc., which are also the most complete HTAP scenarios that we highly respect. However, Internet companies generally cannot use it, and must also have an offline part to process massive amounts of data.

Therefore, in this system, TiDB is mainly used for real-time analysis.

Real-time analysis

The business data is pre-aggregated or flattened in Flink through Kafka + Flink, and then the result is written to TiDB for application query. This is a common real-time analysis architecture. And if TiDB is already used in the online business of the application, the whole structure is more natural. You can directly use the CDC function of TiDB to import data into Flink for processing.

Since the entire architecture is very practical, it has been widely used in multiple business scenarios, and examples will be given later.

Real-time analysis: Flink architecture

There are also several common architectures for using Flink in real-time analysis.

Use Flink MySQL connector to parse MySQL CDC

In the first architecture, the front-end business uses MySQL. For example, the sub-database and sub-table solution uses Flink MySQL Connector to obtain MySQL data changes, and then write the data to TiDB.

Use Kafka to push Canal JSON and other formats

The second architecture is to process data through middleware processed by MySQL binlog, such as Canal, and then write it to Kafka for Flink consumption, and finally write it to TiDB. This method is more common.

uses TiCDC to push Canal JSON to Kafka

In the third architecture, the user front-end has already used TiDB. Through the CDC function of TiDB, the Canal JSON format is output to Kafka for consumption, and Flink writes the data to TiDB-like databases or other sinks.

data warehouse acceleration layer / ODS layer

There is also a common solution, the acceleration layer or ODS layer of the data warehouse.

The most common usage is that the data warehouse will separate the acceleration layer. With TiDB, the two parts can be combined. User data can be written into TiDB in a variety of ways. In TiDB, some ETL is being performed. The operation is then written into the offline calculation, and finally the result is fed back to TiDB. TiDB can directly provide real-time data analysis services to the outside world, which is also one of the very popular architectures.

Applications

Next, I will share some real-life company cases.

Zhongtong Express Logistics

The first is Zhongtong Express, which everyone is familiar with. Zhongtong Express should now be one of the largest express companies in the world. In recent years, they have begun to try to use TiDB for package tracking management. In the early days, they used TiSpark for calculations, then assembled the data into wide tables and wrote them in TiDB, and then performed some aggregations. Recently, they have been testing the MPP architecture of 5.0 to see if TiDB 5.0 can provide more help.

Zhongtong Express
- The largest express delivery company in the world.
logistics full link life cycle management
- The same set of TiDB platform services package tracking management and real-time reports.
- The peak QPS is 120,000+.
- Real-time statistical analysis.
- Connect to offline platforms through TiSpark.

The structure of ZTO Express is as above. First of all, package tracking is an online business. It is written into TiDB through Spark Streaming training and is analyzed in real time. Then the archived data of TiDB will be sent to the big data platform of Zhongtong for calculation, and finally the result of the calculation of the big data platform Write back to TiDB. In this structure, TiDB is the integration layer of the entire real-time computing.

Little Red Book

Xiaohongshu is a platform for vertical e-commerce related content at the same time. At present, the number of users and visits are also very large.

The early architecture of Xiaohongshu was a business solution using MySQL sub-database and table. Business data was written to offline products through ETL. After T+1 calculation, it was written back to another MySQL sub-database and table cluster. Provide external data services. At the same time, offline data warehouses will also be used for risk control related businesses.

The pain point of the above architecture is T+1, which is very uncomfortable for business and operation and maintenance. After trying TiDB, the architecture was upgraded.

Currently, the business online layer still uses sub-databases and tables, but business data will be directly written to TiDB in some simple ways. At the same time, TiDB feeds the data back to the offline layer, and then writes back to TiDB after processing the offline data.

The above structure directly uses TiDB for data analysis or risk control services. The overall structure has changed from T+1 to T+0. According to the feedback from Xiaohongshu engineers, after using TiDB, saves a lot of operations for MySQL sub-databases and sub-tables. , which is also one of the advantages of TiDB.

Buds of wisdom

Wisdom Buds is a provider of SaaS services, providing big data intelligence services to more than 10,000 technology companies, universities, scientific research and financial institutions in more than 50 countries around the world.

Wisdom bud
- The fast-growing technological innovation SaaS service provider provides big data intelligence services to more than 10,000 technology companies, universities, scientific research and financial institutions in more than 50 countries around the world.
Real-time data warehouse
- Deployed in AWS cloud environment.
uses AWS Kinesis / AWS EMR Flink for data warehouse modeling .

All of Smart Bud's businesses are deployed on AWS. In the early days, Smart Bud used AWS' Redshift for data analysis, but the speed of Redshift itself was not particularly ideal. Therefore, in order to obtain better real-time performance, Smart Bud began to try to use TiDB to build a real-time data warehouse. The data warehouse architecture is very similar to other companies. It also uses Flink for real-time data processing, and then writes various data to TiDB, and finally presents it directly to the data application.

The above several cases are very typical scenarios where TiDB is used for real-time data analysis. Among them, there are also businesses that are relatively biased towards HTAP, such as the Xiaohongshu architecture. The online business data will be directly written to TiDB, which can make full use of TiDB’s OLTP capabilities.

After seeing so many cases, we can also imagine the future of TiDB HTAP.

The future of TiDB HTAP

First of all, the most important point, after 5.0, TiDB has can be used to do complex calculations, and we can provide more real-time scenarios to verify .

What does SQL MPP mean?

With SQL and MPP, we have faster computing speeds, and can carry more complex computing tasks, coupled with strong real-time data, and strong consistency guarantees. With these, what can we do?

Live scene

First, the live broadcast scene. When a major broadcaster starts broadcasting, users will directly rush in. At this time, the user's information will be inserted into the fact table visited, and the broadcaster's live room will also update its dimension table. If this set of architecture follows the traditional way, Flink may be used to process data, but there is also a problem. The concurrency of the operation will be very high and it needs to be completed in a short time. Therefore, if you want Flink to process, you need to maintain some more complicated Watermarks, etc., and after preprocessing, it may also bring some delays.

If you directly use TiDB to carry these loads, you can analyze it immediately when the data is written in, generate analysis reports, and feed back to the platform or anchor in time to make timely business adjustments. Of course, the application of the live broadcast scene is still assumed, and we look forward to the landing of TiDB in the live broadcast scene.

Real-time risk control scenario

In another scenario, take real-time risk control as an example. Some online platforms often generate transactions and transfers, but fraud incidents often reported in the news are also related to this. In fact, financial or other trading platforms generally have risk control services to detect and avoid the occurrence of similar incidents.

One of the possible problems with previous risk control is that the crime is so rapid that the risk control rules have not yet been triggered but the fraud process has ended. Not only causes economic losses to users, but also affects the efficiency of police handling of cases.

If TiDB is applied to the risk control business, the moment the illegal transaction occurs, it can be directly analyzed to trigger the risk control strategy. The entire link delay will be greatly reduced, and it will also help relevant departments to solve the case faster.

More other application scenarios of TiDB HTAP are also welcome to help us imagine and imagine the future of TiDB together.