The Evolution Road of "Rainbow Bridge", the Middleware Platform of Dewu Database

foreword

With the rapid growth of Dewu App users and the increasingly rich business lines, it has also brought greater pressure on the underlying database. The requirements of each business line for data fragmentation, read-write separation, shadow library routing, etc. have become rigid needs, so a unified middleware is needed to support these requirements, and the "Rainbow Bridge" came into being.

In Norse mythology, the Rainbow Bridge is a huge rainbow bridge connecting Asgard [1] and Midgard (Atrium/Midgard). We can think of it as the connecting channel "between the Nine Realms" and the only stable entrance into Asgard. The Rainbow Bridge of Dewu is the database middle-tier processing middleware that connects the service and the database. It can be said that every order of Dewu is closely related to it.

1. Technical selection

In the early stage, we researched open source middleware such as Mycat, ShardingSphere, kingshard, Atlas, etc., and integrated many aspects such as applicability, advantages and disadvantages, product reputation, community activity, actual combat cases, and scalability, and finally we chose ShardingSphere. Prepare to carry out secondary development on the basis of ShardingSphere and customize a set of database middleware suitable for the internal environment of Dewu.

Apache ShardingSphere is an open source distributed database ecological project, which consists of 3 products: JDBC, Proxy and Sidecar (planning). Its core adopts a pluggable architecture, and functions are extended through components. For the above, it provides many enhanced functions in the database protocol and SQL mode, including data fragmentation, access routing, data security, etc.; for the bottom, it natively supports MySQL, PostgreSQL, SQL Server, Oracle and other data storage engines. ShardingSphere has become a top-level project of the Apache Software Foundation on April 16, 2020, and is used by teams in many countries around the world.

At present, we mainly provide services in the Proxy mode of ShardingSphere, and will continue to explore the JDBC&Proxy hybrid architecture in the future.

2. Rainbow Bridge's current capabilities

Among them, the white modules are the current stage and capabilities, and the green modules are the planned & ongoing functions. A few key features are described below. Note that the following functions are based on the Proxy mode.

2.1 Data Fragmentation

Data sharding refers to the distributed storage of data stored in a single database into multiple databases or tables according to a certain dimension to improve performance bottlenecks and availability. An effective means of data sharding is to shard and shard relational databases. Both database sharding and table sharding can effectively avoid query bottlenecks caused by data volume exceeding the acceptable threshold. In addition, the sub-database can also be used to effectively disperse the single-point access to the database; although the sub-table cannot relieve the pressure on the database, it can provide the possibility to convert distributed transactions into local transactions as much as possible. Repository update operations, distributed transactions tend to complicate the problem. Using the multi-master and multi-slave sharding method can effectively avoid a single point of data, thereby improving the availability of the data architecture.

It is an effective means to deal with high concurrency and massive data systems by splitting data by sub-database and sub-table to keep the data volume of each table below the threshold, and to divert traffic to cope with high access volume. The splitting method of data sharding is further divided into vertical sharding and horizontal sharding.

According to the way of business splitting, it is called vertical sharding, also known as vertical splitting, and its core concept is dedicated to special libraries. The main sharding capability provided by Rainbow Bridge is horizontal sharding, also known as horizontal sharding. Compared with vertical sharding, it no longer classifies data according to business logic, but scatters data into multiple libraries or tables according to certain rules through a certain field (or certain fields), and each shard only contains part of the data. For example: according to the primary key sharding, the records of the even primary key are put into the 0 library (or table), and the records of the odd primary key are put into the 1 library (or table), as shown in the following figure.

Horizontal sharding theoretically breaks through the bottleneck of single-machine data processing, and is relatively free to expand. It is a standard solution for data sharding.

Of course, the sharding rules of actual usage scenarios are very complicated. We provide some built-in algorithms such as modulo, HASH modulo, automatic time segment slicing algorithm, and inline expressions. When the built-in algorithm cannot meet the requirements, you can also customize the exclusive sharding logic based on groovy.

2.2 Read-write separation

Facing the increasing system access volume, the throughput of the database is facing a huge bottleneck. For application systems with a large number of concurrent read operations and fewer write operations at the same time, splitting the database into a master database and a slave database, the master database is responsible for transactional additions, deletions and modifications, and the slave database is responsible for processing query operations. Effectively avoid row locks caused by data updates, which greatly improves the query performance of the entire system. Through the configuration of one master and multiple slaves, query requests can be evenly distributed to multiple data copies, which can further improve the processing capacity of the system.

Different from the horizontal sharding in which data is scattered to each data node according to the sharding key, read-write separation is based on the analysis of SQL semantics, and the read and write operations are routed to the main library and the slave library respectively.

The configuration method here is relatively simple. You can bind one or more slave libraries to the target master library and set the corresponding load balancing algorithm.

The implementation method here is to route the query statement to the corresponding slave library through SQL parsing. However, in some scenarios that are sensitive to the master-slave synchronization delay, it may be necessary to force the master library. Here we also provide an API (the principle is that SQL Hint), so that the upstream can specify certain modules to be forced to go to the master, and related global configuration can make all read requests in the transaction go to the master.

2.3 Shadow library pressure test

Under the distributed application architecture based on microservices, the business requires multiple services to be completed through a series of service and middleware calls, so the stress test of a single service can no longer represent the real scenario. In the test environment, if a whole set of pressure testing environment similar to the production environment is rebuilt, the cost is too high, and it is often impossible to simulate the complexity and traffic of the online environment. Therefore, the industry usually chooses the full-link stress test method, that is, the stress test is performed in the production environment, so that the obtained test results can accurately reflect the real capacity and performance level of the system.

Full-link stress testing is a complex and huge task. Coordination and adjustment between various microservices and middleware are required to deal with the transparent transmission of different traffic and stress test identifiers. Usually a set of stress testing platforms are built to suit different test plans. Data isolation needs to be done at the database level. In order to ensure the reliability and integrity of production data, it is necessary to route the data generated by the pressure measurement to the pressure measurement environment database to prevent the pressure measurement data from polluting the real data in the production database. This requires that before executing SQL, business applications can classify data according to the transparently transmitted pressure test identifier, and route the corresponding SQL to the corresponding data source.

The configuration method here is similar to read-write separation. It also binds a shadow library to the target main library. When SQL carries a shadow tag, it will be routed to the shadow library.

2.4 Current Limiting & Fusing

When the pressure of the DB exceeds its own water line, it will cause the DB to fail. After we estimate the water level of a certain dimension, we can configure the corresponding current limiting rules to reject requests beyond our own water level to protect the DB. Let the system run at the maximum throughput as much as possible while ensuring the overall stability of the system. In terms of dimensions, we support DB, Table, SQL and DML types.

Hundreds of RDS instances are connected under the rainbow bridge, and each RDS instance may have various failures. When a single instance fails, it will affect the entire logic library, which will quickly cause blocking and induce avalanche effects. Therefore, we need a fast-failure mechanism to prevent avalanches. At present, we support the fuse at the DB instance level, based on two behaviors of acquiring connection & SQL execution, as well as the execution time and failure ratio to achieve fuse, so as to achieve fast failure in the event of DB failure. Effect.

2.5 Flow correction

Under the active-active architecture, Rainbow Bridge, as the proxy layer of the database, can ensure that the traffic during the traffic switching process under the active-active architecture is intercepted at the bottom to ensure data consistency. The principle is to match the userId carried by the SQL Hint with the rules of the computer room to intercept traffic that does not belong to the current computer room.

3. What transformations have we made based on ShardingSphere

Although ShardingSphere Proxy itself is actually powerful enough, there are still some defects and lack of functions for the internal environment of the object, which are mainly divided into the following points:

Ease of use
a. The rule configuration files such as fragmentation and read-write separation are too complicated and not friendly enough for business development
b. The dynamic change of rules is completely dependent on the configuration center, lacking a complete change process
c. The connection pool governance capability is not perfect
d. The Hint method is not friendly enough, and the business needs to write a RAL statement
e. The custom sharding algorithm needs to be published
f. SQL Compatibility
stability
a. Lack of multi-cluster governance functions
b. Lack of isolation between logic libraries
c. Missing current limiting fuse components
d. Dynamic changes of data sources and rules are detrimental
e. Lack of traffic correction function under the active-active architecture
f. Posting lossy
observability
a. SQL Trace function is not perfect
b. Monitoring indicators are not comprehensive enough
c. Lack of SQL insight
Performance Due to one more network forwarding, the RT of a single SQL is 2~3ms higher than that of a direct connection

In order to solve the above problems, we have made the following transformations and optimizations.

3.1 Ease of use improvement

A series of operations such as data source and rule configuration, change, and audit are integrated into the console for unified operations. The complexity of rule configuration files such as fragmentation and read-write separation is reduced graphically, and a series of checks are added to avoid some low-level errors caused by configuration file errors. Secondly, the audit function is added to ensure the security and controllability of dynamic configuration changes.

The management console has added connection pool management, which automatically calculates a safe and reasonable connection pool size based on factors such as the number of RDS connections, the number of Proxy nodes, and the number of mounted data sources.

Added Client to make a series of adaptations for Hint, such as shadow mark transfer, active-active user ID transfer, trace transfer, forced routing, etc. The user only needs to call the API in the Client, and it can be automatically rewritten into a Hint-enhanced statement that the Proxy can recognize when the SQL is issued.

The management console has added a cluster governance function: Since we have deployed multiple sets of proxy clusters, in order to maximize the explosion radius in the event of a failure, we have divided the proxy clusters according to business domains, and tried our best to ensure the logical library traffic under the unified business domain. into the same cluster.

Added Groovy to support custom sharding algorithm. After the sharding logic is configured in the background, it will take effect after the review and approval. No Proxy release is required.

3.2 Stability improvement

3.2.1 Proxy multi-cluster governance

(1) Background As the Proxy carries more and more business domains, if all traffic is routed to the Proxy node through load balancing, when a problem occurs in a library, the entire cluster may be paralyzed, and the explosion range may be uncontrollable. Moreover, due to the limited number of DB connections, the Proxy nodes cannot scale horizontally on a large scale.

(2) Solution In order to isolate different business domains, we deployed multiple sets of proxy clusters, and maintained the relationship between each logical library and the cluster through the console. Ensure that the logical library traffic under the same business domain enters the same cluster. And when a cluster fails, the logical library in the failed cluster can be dynamically switched to the standby cluster quickly and without damage, so as to minimize the loss of the business caused by the failure of the proxy itself.

In terms of connection number management, Proxy will determine whether the current logical library is in the current cluster when initializing the connection pool. If it is not, set the minimum connection number configuration to the minimum. If it is, load it according to the normal configuration, and do it before and after cluster switching. The warm-up of the target cluster and the recovery of the original cluster. This can greatly alleviate the horizontal expansion of the DB connection pool resources to the Proxy node.

(3) Implementation principle After the upstream application introduces Rainbow (self-developed connection pool), before the connection pool is initialized, it will read the cluster where the current library is located according to the logic library, and dynamically replace the domain name of the Proxy with the domain name of the cluster where it is located. At the same time, the monitoring of cluster configuration will be added, so that after the console switches the cluster operation, Rainbow will create a new connection pool according to the switched cluster domain name, and then replace the old connection pool. The delay is gracefully closed, and the entire process is unaware of upstream applications.

(4) Architecture diagram

3.2.2 Proxy worker thread pool isolation

(1) Background In the open source version of Proxy, all logic libraries share a thread pool. When a single logic library is blocked, the thread pool resources will be exhausted, resulting in other logic libraries being affected.

(2) Solution We have adopted the isolation scheme based on thread pool, and introduced the concept of exclusive & shared thread pool. The logic library is given priority to use the exclusive thread pool. When the exclusive thread pool is queued, the shared thread pool is used. The shared thread pool After reaching a certain load, it will be forced to route to the exclusive thread pool within a time window, and maximize the use of thread resources under the premise of ensuring isolation.

3.2.3 Flow control and fusing

(1) Background The open source version of Proxy lacks flow control over dimensions such as libraries, tables, and SQL. When traffic exceeding the system water level breaks out in a short period of time, it is likely to overwhelm the DB and cause online failures. Moreover, it lacks the ability to quickly target the DB. The failure mechanism, when the DB fails (such as CPU 100%) cannot fail quickly, it will quickly cause blocking and induce an avalanche effect.

(2) The solution adds the current limit of each dimension (DB, table, SQL, statement type), and each library can configure a reasonable threshold according to the estimated water level and business needs to protect the DB to the greatest extent. At the same time, we also introduce a circuit breaker strategy for DB instances, which can fail quickly when an instance fails. In the sharding scenario, the impact on other shards can be minimized, the explosion radius of the fault can be further reduced, and the overall stability of the Rainbow Bridge can be improved.

(3) Implementation principle Flow control and fuse are implemented based on sentinel, and the corresponding rules can be configured on the control console.

3.2.4 Lossless publishing

(1) Background In the early stage, every time the Proxy is released or restarted, it will receive some alarms that the upstream application SQL execution fails. The main reason is that there is a long connection between the upstream and the Proxy. At this time, if there is a connection that is executing SQL , then it will be forcibly disconnected, resulting in SQL execution failure, causing losses to the upstream business to a certain extent.

(2) The solution publishing system cooperates with the self-developed connection pool Rainbow. Before the Proxy node is published or restarted, the Rainbow connection pool will be notified to gracefully close the connection between the Proxy nodes that need to be restarted and published, and then the proxy traffic will drop to 0. Do a restart & release.

3.3 Observability

3.3.1 Runtime metrics

(1) Background The open source version has very few monitoring indicators for the Proxy runtime, which makes it impossible to observe the detailed running status of each library on the Proxy.

(2) The solution adds buried points in each execution stage of Proxy, adds the following indicators, and draws the corresponding monitoring dashboard.

QPS, RT, error, slow SQL metrics at library & table level
Various connection indicators of the Proxy-DB connection pool
Flow control fuse indicator
Thread pool active threads, queue size indicators

(3) renderings

3.3.2 Full Link Tracking

(1) Background The open source version of trace only supports link tracing in the internal execution phase of the Proxy, and cannot be connected in series with the upstream, resulting in low troubleshooting efficiency.

(2) The solution mainly transmits trace information through RAL and SQL annotations, and realizes full-link tracking from upstream to Proxy.

(3) Implementation principle The first solution we thought of was to transmit trace information through SQL annotations, but some problems were found after the actual production. When the prepare mode (useServerPrepStmts=true&cachePrepStmts=true) is used upstream, the Proxy will cache the statmentId and SQL, and the trace is different every time, which will cause the storage cache to grow infinitely and eventually lead to OOM.

Therefore, passing trace information through SQL annotations is only applicable to non-prepare scenarios. So we added a new solution, which is to send a RAL statement to transmit trace information before each SQL execution, and cache the corresponding relationship between channelId and trace information in the Proxy. Since all SQLs of a single channel are executed serially, In addition, the number of channelIds is controllable and will not expand infinitely. However, this scheme has a relatively large impact on performance compared to the execution of a RAL statement before each SQL execution. From the monitoring point of view, the RT of each SQL execution will float by 2-3ms (network transmission), which is quite fatal for some interfaces with long links.

(4) In summary, the two solutions actually have obvious shortcomings. In response to this problem, a small partner of ShardingSphere has come to us to have an in-depth communication. Brother Liang gave a more feasible solution, that is, through virtual The column transmits trace information, but the upstream needs to rewrite the SQL, which may increase the burden on the upstream application. We have not yet started to do this.

3.3.3 SQL Insights

(1) Background At present, although the Proxy has logs for printing logical SQL and physical SQL, due to the large amount of production requests, enabling logs will pose a greater challenge to IO. Therefore, our production environment is currently closed. Moreover, this log cannot be connected in series with the upstream, and the help for troubleshooting is relatively limited.

(2) There is only a general idea for the solution at present, and there is no perfect solution. The effect to be achieved is to collect all the logical SQL, physical SQL and upstream information executed by the Proxy, including some SQL other than JDBCDatabaseCommunicationEngine (such as TCL, etc.) , and query it in real time through the console. The final effect is similar to the charging service of Alibaba Cloud RDS-SQL Insights

3.4 Bug fixes

Due to historical reasons, we are based on the secondary development of Apache ShardingSphere 5.0.0-alpha. We encountered many bugs in the actual use process, most of which have been issued to the official, and repaired in the Rainbow Bridge version. , of course, the small partners in the ShardingSphere community have also given a lot of help and repair ideas.

3.5 JDBC&Proxy Hybrid Architecture

(1) Background Most of the above four items are aimed at the transformation of the Proxy module or the upstream connection pool module, but there are still some problems that the Proxy mode cannot solve temporarily, such as the SQL compatibility and performance issues mentioned above. Also, if the entire Proxy cluster goes down, we don't have a bottom-up mechanism. So the Proxy mode is not suitable for all scenarios. In the official documentation of Apache ShardingSphere, you can see this paragraph:

So we are going to do further exploration on the JDBC&Proxy hybrid architecture.

(2) The solution realizes the mode configuration of the logic library on the console, and perceives and starts different types of data sources according to different modes through the self-developed connection pool Rainbow. And it can be adjusted dynamically without loss after switching the mode in the background, so that it is completely imperceptible to the user. Applications that require high SQL performance and compatibility can be adjusted to JDBC mode. At the same time, when all Proxy clusters are paralyzed, there is also a bottom-up solution, so that the entire site will not collapse.

(3) Implementation principle
When the Rainbow connection pool is started, it will query the mode corresponding to the current logical library. If it is Proxy mode, it will directly connect to the Proxy to start the connection pool. If it is JBDC, it will be based on the data source configuration of the logical library and fragmentation & read-write separation & shadow library. Wait for the rules to load the data source of JDBC mode, and the corresponding DataSource is GovernanceShardingSphereDataSource. And will monitor this mode configuration, when the mode changes, it will dynamically replace the current connection pool without loss. The specific lossless replacement scheme is similar to the cluster switching mentioned earlier. At the same time, there are monitoring problems and management of connection pool resources that need to be solved. After Proxy switches to JDBC mode, the exposure of indicators is changed from the Proxy node to the upstream node. The corresponding large disk also needs to be integrated correspondingly, and connection pool management is also used. The new computing mode has been adapted accordingly.

4. Current confusion

Due to historical reasons, we are based on the secondary development of Apache ShardingSphere 5.0.0-alpha. The latest version of the community is 5.1.2-release, and a lot of optimization has been done from 5.0.0-alpha~5.1.2-release. With bug fixes, but we have no good way to merge the community's code into our internal code, because it is impossible to determine whether the changes in the community's open source version will have an impact on the existing business, and we will not be able to enjoy the community in a short time. dividends to come. At the same time, we are also looking for a way to merge some of the optimizations we have made into the community, which can be regarded as a kind of feedback to the community. Make a contribution to China's open source~

5. Write at the end

The source code of ShardingShpere is very good, the design in many places is very clever, and the module division is also very clear. But the overall code is very large, and it is still very difficult to read at first. Although most of the article is pointing out the problems of the current open source version, no matter how good the product is, it is impossible to apply to all scenarios. At the beginning of this year, the big cows of the ShardingSphere team, including Brother Liang, also came to our Dewu headquarters to have an offline communication with us, and shared a lot of dry goods with us. Gives very useful ideas and guidance. Very much looking forward to the next offline communication.

*Text / Chen Hao
@德物科技public account

The Evolution Road of "Rainbow Bridge", the Middleware Platform of Dewu Database

foreword

1. Technical selection