OPPO Cloud Database Access Service Technology Revealed

1. Background

MySQL is the most widely used relational database in OPPO. Microservices in different programming languages are directly connected to real database instances through the official MySQL SDK. This most traditional usage will bring a series of problems affecting efficiency and stability to business development and database operation and maintenance.

Unreasonable database access, such as inappropriate connection pool settings, non-intercepting high-risk SQL, unlimited flow and circuit breaker and other governance capabilities
No elastic scaling capability. When the performance or capacity of a single-machine database is insufficient, capacity expansion is very cumbersome and inefficient.
Lack of commonly used features, does not support read-write separation, shadow database table, unitized architecture, database encryption
Does not support cross-language, and is strongly coupled with application services, coding and upgrading are very difficult

For the above problems, we talk about solving a series of problems caused by single MySQL through CDAS.

2. Introduction to CDAS products

MySQL has been tested in terms of concurrency, stability, and high availability. It is the preferred storage component for most Internet-connected products in OLTP scenarios. However, single-machine MySQL is inherently insufficient in ultra-high concurrency, mass storage, and OLAP capabilities. Therefore, when the data reaches a certain level After the order of magnitude, the method of sub-database and sub-table is generally used to horizontally expand MySQL and improve the overall processing capacity of the system.

CDAS is designed based on the concept of sharding. The purpose is to solve the performance bottleneck of MySQL single machine, and to provide external MySQL services that exceed concurrency, mass storage, and support HTAP. A set of highly available agent + computing clusters is built in front of a set of MySQL clusters, providing sharding capabilities, automatic elastic scaling capabilities, read-write separation, shadow database, data encryption storage capabilities, and providing users with integrated products.

CDAS Proxy is developed in Java language, based on the open source product Apache ShardingSphere Proxy, and adds many advanced features to support internal business. Our development philosophy is to build services by relying on mature open source products. While absorbing nutrients from the community, we also return the discovered problems and features to the community, and actively participate in and develop together with the community.

2.1 Features of CDAS products

(1) Stability

CDAS has carried out a lot of benchmarking and tuning, and implemented a special dynamic queue + thread pool model to ensure that the overall concurrency of the Proxy will not decrease due to frequent context switching due to too many threads. In this scenario, Full GC will not cause business interruption. At present, CDAS has been used in several business scenarios with QPS: 5000+ in the official environment, and the performance is stable.

(2) Highly scalable

We recommend that the business estimates the total amount of data in the next 3 to 5 years to set the total number of shards when applying for a sharding table. In the early stage, 1 to 2 databases can be applied, and the capacity can be expanded after the amount of data becomes larger.
Theoretically, the scalability is related to the total number of shards. A database instance with the same maximum number of scalable shards can be extended to a maximum of 250TB. At the same time, the Proxy cluster also supports online capacity expansion.

(3) Platform-based operation and maintenance

Unified operation, maintenance and deployment by the cloud platform, MySQL and Proxy both support automated process application and change, and automatically access the metadata management system.

(4) Compatible with MySQL syntax

Most MySQL syntax is supported.

100% compatible with routing to single data shards
Support common query scenarios such as paging, deduplication, sorting, grouping, aggregation, and association

(5) Feature-rich

Multiple sharding algorithms
Self-incrementing distributed ID
Read-write separation, master-slave synchronization tolerance threshold can be set
shadow library
HTAP
Complete monitoring information: audit log, slow log, authentication

3. Core Design

The core goal of CDAS is to solve the problem of mass data storage and access for users, focusing on the sharding scenario, which is mainly reflected in the logic of parsing, routing, rewriting, and aggregation.

3.1 Kernel Architecture

Looking at the Proxy kernel from a height, it is mainly divided into three parts: the connection access module, the I/O multiplexing module, and the parsing and routing execution engine. You can see on the left that the thread model is also divided into three parts, which actually involve There are other sub-logics for thread resources, such as parallel execution of thread pools, etc. Next, we will learn about the details of the kernel architecture around multiple models and execution processes.

3.2 Threading Model

The kernel entry layer is implemented based on Netty. On the whole, it is a typical Netty usage scenario. The kernel diagram is mainly divided into three parts as can be seen.

(1) Boss Thread

Responsible for Accept connection, that is, accepting and establishing connections. Clients generally use connection pooling technology, so there will not be too many requests to establish connections, and they can be processed by a single thread.

(2) I/O Threads

Namely Netty EventLoopGroup, responsible for encoding and decoding, initiating auth authentication, based on Epoll event-driven I/O multiplexing, one thread can handle multiple connections, total number of CPU cores * 2 threads.

(3) Worker Threads

Responsible for core processes such as executing and writing back data.
Worker threads are synchronous as a whole. Parser\Router\Rewriter\RateLimter is pure CPU computing. The bottom layer of Executor Engine and ResultManager is based on JDBC specifications. JDBC currently exposes a synchronous calling model. The thread status is blocked before the database does not respond. , so the configuration of the Worker thread pool determines the overall concurrency capability of the Proxy.
Thread resources are relatively expensive resources in the operating system. When the number of CPU cores is fixed, too many threads will consume a lot of memory, cause frequent context switching, and reduce performance. During the stress test, we paid extra attention to the impact of the number of threads on the service, and found a suitable calculation formula.

Math.min(cpuNum * X, maxThreadsThreshold)

X: Default 75, configurable
maxThreadsThreshold: default 800, configurable

At the same time, we developed DynamicBlockingQueue by ourselves to create new threads in advance to execute tasks after the backlog of tasks in the queue reaches a certain threshold.

3.3 Connection Model

After accepting a client connection, the Proxy internally maintains a logical connection of a Backend Connection for each client connection. The connections list of Backend Connection saves the real connections used in the execution process. After the request is executed or the transaction ends, the connections list is cleared and the physical connection is released.

At the same time, Proxy internally uses connection pooling technology to save physical connections, reduce connection establishment time and control the total number of connections.

3.4 Transaction Model

Next, let's talk about the execution flow in the transaction scenario.

3.4.1 Single database transaction

Assume the following transaction scenario:

 begin;
select * from t_order where order_id = 1;
update t_order set gmt_modify = now() where order_id = 1;
update t_order_item set cnt = 2 where order_id = 1;
commit/rollback;

The transaction process will interact with the database sequence 5 times in total.
Statement 1: Proxy does not know what statement to execute after the start, and cannot route to RDS, but only records a begin status on the connection Statement 2: Execute select with shard key, or take 16 shards as an example, the final statement It will be routed to the table t_order_1, get connection_db1, execute begin first, then select, and put it into the connection list of the logical connection

connection_db1 is the connection on the database where the t_order_1 table is located

Statement 3, statement 4: route to t_order_1, t_order_item_1, and statement 2 to the same DB, reuse connection_db1 to execute statement 5: simple commit or rollback statement, take out all RDS connections of the current logical connection, and execute commit/rollback one by one
In the process of transaction execution, whether a distributed transaction will be generated is completely controlled by the user's SQL statement. If the database is not switched during the transaction execution process, it will degenerate into a simple RDS transaction, maintaining complete ACID characteristics. In the case of routing to multiple DBs, distributed transactions will occur.

3.4.2 Distributed Transactions

At present, there are three main solutions for distributed transactions:

(1) eventually consistent

The eventual consistency scheme initiated by the business, such as Seata\TCC, etc., has a sense of business, and is mostly used in cross-service call scenarios, such as order and inventory systems. The failure of a certain loop submission requires specific logic to roll back as a whole, which is not suitable for proxy scenarios.

(2) Strong consistent XA

It is mostly used in cross-service calling scenarios. At present, there are problems such as poor performance and single-point\Recovery lock occupation of the coordinator. It is not suitable for high concurrency scenarios.

(3) Manufacturers provide distributed transactions

Distributed database vendors such as OceanBase themselves will shard data, and multiple distributed nodes are operated during transaction execution to introduce distributed transactions. ACID is guaranteed for transactions in a single distributed library. The Proxy backend is based on the MySQL protocol, and cannot implement such transactions between multiple RDSs.

(4) Status Quo

CDAS Proxy does not provide XA to guarantee strong consistency across RDS instances, nor does it support eventual consistency internally. If a multi-database transaction is triggered, it will be submitted separately, and there is a risk of some successful submissions and some failures. Therefore, it is recommended that the business side try not to generate cross-database transactions within the RDS cluster when designing transactions.

3.5 Sharding Model

Taking the scenario of 4 libraries and 16 shards as an example to illustrate the difference between the CDAS sharding scheme and the traditional scheme

In the traditional scheme, the shard names in each DB start with a suffix of 0, and each library has the same number of shards.
There are the following disadvantages:

Difficulty dealing with hotspot sharding
Unable to migrate for shards
The shard name is meaningless and cannot be explained by itself

In response to the above problems, we adopt a similar range sharding idea to optimize the shard name.

The shard name is unique, and the shard names located in different DBs are different from each other. This design method is very similar to that different slots in Redis/MongoDB are located on different instances, which is conducive to expansion.

t_order_1 is a hotspot shard or when DB1 is under too much pressure, migrate t_order_1 to the new DB5:

Of course, this migration method is not suitable for all scenarios. For example, the sharding algorithms of tables t_order and t_order_item are exactly the same, so they must fall into the same DB. If only one of the tables is migrated during the migration process, it will inevitably lead to unnecessary Distributed transactions, so the tables in the binding relationship will be migrated together when migrating.

3.6 Execution process

Taking the simplest scenario as an example, the logical table: t_order is divided into 16 slices, and the slice key: order_id
select * from t_order where order_id=10

(1) Analysis

After the worker thread gets this statement, the first thing to do is to parse the SQL statement and implement syntax tree parsing based on the open source product Antlr.

 {type：select，table：t_order，condition: order_id=10}

(2) Routing

After obtaining the sharding table name and conditions, obtain the sharding rule according to the table name and calculate the sharding
order_id mod 16 => 10 mod 16 => 10
Get the real table: t_order_10

(3) Rewrite

The rewrite process will replace the logical table name with the real table name and calculate which DB the real is located in.

 select * from t_order_10 where order_id=10

(4) Current limiting

The current limiting function can control the instantaneous concurrency of certain types of SQL. Currently, only some scenarios are limited, such as OLAP requests.

(5) Execute

The statement will be sent to the real database for execution. Currently, OLTP requests are sent using MySQL official Connector/J, OLAP requests are sent using ClickHouse official SDK, and AP traffic will not be routed + rewritten, because the overall design is distributed in the ClickHouse cluster The formula table name and the logical table name are the same

(6) Aggregation results

The main scenario for aggregation is when query requests are routed to more than one real table, such as

 select * from t_order where order_id in (10,11)

In this case, a SQL will eventually be two sub-requests at the DB level

 select * from t_order_10 where order_id in (10,11)
select * from t_order_11 where order_id in (10,11)

During aggregation, the ResultSets of the two sub-requests are traversed through the JDBC API, and a new logical ResultSet is created in the memory. When the logical ResultSet is traversed, the sub-ResultSet will be triggered to be traversed, and finally the logical ResultSet is encoded in the MySQL protocol package and sent to the client.

If the SQL statement contains order by, it will be merged in memory during aggregation. In-memory sorting needs to load all row data into memory for sorting, and the memory pressure will be very large. Therefore, although there is sorting ability, we still do not recommend using the order by syntax in scenarios where multiple sub-requests are split.

3.7 HTAP

HTAP refers to a mixture of OLTP (Online Transactional Processing) and OLAP (Online Analytical Processing), which supports both online transaction requests and background analysis requests in the database.

MySQL is a typical OLTP type database and is not suitable for OLAP type business, because analysis statements generally need to perform complex aggregation and query a large amount of data, which may affect its stability on OLTP. At the same time, OLAP databases generally use column storage to maximize compression efficiency and space occupation, while MySQL uses row storage. Under normal circumstances, the entire row of records must be queried before filtering out certain columns. There is obvious IO read amplification, and the efficiency is not high.

CDAS transmits and aggregates the data of each shard of MySQL to ClickHouse (analytical column storage database) through the DTS (data transfer) service to form a unified data view. After the AP traffic reaches the Proxy, it will be forwarded to ClickHouse, and the data response will be returned to the client in the form of MySQL data packets. In this way, CDAS implements the HTAP capability based on the MySQL protocol.

4. Deployment Architecture

The deployment topology is as follows:

In the modern architecture, the business side has higher and higher requirements for high availability. At present, most important services are deployed in a multi-machine room master-standby or mutual master-standby solution. Generally, a dual-machine room or dual-center solution is used to ensure that one After the area hangs up, it can still serve the outside world.

At the beginning of the design, CDAS considers that it has extremely high availability requirements as the data traffic entrance of the L7 layer, so it allows users to apply for a double-room Proxy cluster to ensure that after one of the computer rooms hangs up, the other computer room can still undertake external traffic services. Achieve high availability at the computer room level.

The business system establishes a connection with the Proxy through L4 load balancing, and the L4 and the Proxy closely coordinate to ensure graceful offline and new instance discovery, and sense the health status of the Proxy node, so that the unavailable node can be offline in time. At the same time, L4 is used for priority routing and traffic forwarding in the same computer room. All MySQL clusters behind Proxy adopt the semi-sync high-availability architecture mode.

This deployment architecture enables CDAS to have room-level disaster recovery capabilities.

5. Performance test

During the test, we paid special attention to the performance loss, and compared the Proxy proxy and the directly connected RDS, so that we can intuitively get the performance loss degree.

5.1 Pressure measurement tool

Internal pressure testing platform (based on JMeter internally)

5.2 Pressure measurement method

Java SDK + connection pool pressure test is mainly divided into two groups

Stress test Proxy, Proxy connects to 2 RDS clusters
Stress test a single RDS cluster as a benchmark for proxy performance comparison

Stress test statement:

Referring to the database stress test statement of SysBench, since the sharding scenario generally does not directly use the primary key ID for sharding, and the query efficiency of the primary key ID in InnoDB is higher than that of the secondary index, adding the sharding_k column for sharding test is more in line with practical applications Scenes.

 CREATE TABLE `sbctest_x` (  
`id` bigint(11) NOT NULL,
`sharding_k` bigint(11) NOT NULL,
`k` int(11) DEFAULT NOT NULL,  
`name` varchar(100) NOT NULL,  
`ts` timestamp NULL DEFAULT NULL,  
`dt` datetime DEFAULT NULL,  
`c` char(100) DEFAULT NULL,  
`f` float DEFAULT NULL,  
`t` tinyint(4) DEFAULT NULL,  
`s` smallint(6) DEFAULT NULL,  
PRIMARY KEY (`id`),  
KEY `sharding_k_idx` (`sharding_k`),
KEY `k_idx` (`k`),
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;

压测SQL
#QPS 非事务的简单查询
SELECT COUNT(1) FROM $table WHERE sharding_k=?

#TPS 包含 1条查询，4条变更语句
BEGIN;
SELECT c FROM $table WHERE sharding_k=?
UPDATE $table SET k=k+1 WHERE sharding_k=?
UPDATE $table SET c=?,ts=?,dt=? WHERE sharding_k=?
DELETE FROM $table WHERE sharding_k=?
INSERT INTO $table (id,sharding_k,k,name,ts,dt,c,pad,f,t,s) VALUES(?,?,?,?,?,?,?,?,?,?,?)
COMMIT;

5.3 Stress Test Report

Take the 8C16G specification as an example:
SSD\100 slices\single slice 250W rows\data volume is larger than InnoDB buffer pool

(1) QPS scene

The Proxy QPS is only 1/2 of the RDS QPS, and the RT observed during the stress test is about 2 times
The QPS scenario is mainly a scenario of a large number of non-transactional queries. The performance loss of the proxy is about 1/2. During the QPS stress test, we observed that the CPU usage of the RDS instance behind the proxy is low, so after doubling the CPU specification of the proxy, Proxy QPS immediately doubled, approaching the performance of directly connected RDS.

(2) TPS scene

Because TPS involves transaction commit, log and other operations, I/O is frequent, and the bottleneck is mainly I/O, even SSD disk performance is not high.

Two RDS instances are mounted behind the Proxy. The number of concurrent users in TPS is significantly higher than that of a single RDS directly connected, and the number of concurrent users can be doubled, and the performance loss is almost negligible.

(3) Conclusion

If non-transactional query requests account for the vast majority of business scenarios, and the CPU utilization of RDS is high, it is recommended to select the Proxy specification with a higher CPU than the RDS specification. If there are many transaction execution requests in the business scenario, the Proxy will not become a Performance bottleneck, specifications can be consistent with RDS

6. Case

At present, it supports multiple online businesses, and has a complete indicator display and alarm mechanism. Let us give an example of a business console interface and monitoring diagram.
Proxy list:

Single node TPS:

Single node QPS:

During the use of multiple products, CDAS has performed steadily, and we will continue to promote it to more traffic scenarios to serve more business parties.

7. Summary and Outlook

CDAS was born for data sharding, builds internal standards based on MySQL sharding scenarios, realizes cross-language accessibility through MySQL protocol, unifies usage, and simplifies usage costs.

In the future, we will continue to focus on the performance improvement and functional supplement of CDAS, which are mainly reflected in the following two aspects:
1. The ShardingSphere community has planned to replace the database driver of Proxy from MySQL Connector/J to the event-driven component vertx-mysql-client (a Reactive MySQL Client), based on Netty to implement I/O multiplexing and event-driven interaction with MySQL

After the replacement, the worker thread pool will change from blocking waiting to event-driven when executing statements. A small number of threads can support a large amount of concurrency. We will continue to pay attention to the development progress of the community and actively participate in it.

2. During the interaction between CDAS and MySQL, Rows data packets will be automatically decoded, and a large number of objects will be generated and will be destroyed in the blink of an eye, resulting in increased GC pressure, and encoding and decoding will also consume a lot of CPU time.

Rows data packets can be sent in zero-copy based on Netty ByteBuf except in special scenarios such as non-decryption, which is the best solution. We plan to optimize this part of the logic after replacing the open source version with vertx-mysql-client, so that the response time of most requests can reach the performance close to Layer 4 load balancing.

About the Author

jianliu OPPO Senior Backend Engineer

At present, he mainly focuses on database agents, registration centers, and cloud-native related technologies. He has worked in Zhe 800, Jingdong Mall

For more exciting content, please scan the code and follow the [OPPO Digital Intelligence Technology] public account