About the author:
Yang Chuanhui, OceanBase CTO. In 2010, he joined the OceanBase team as one of the founding members, and led the previous architecture design and technology research and development of OceanBase, and realized the full implementation of OceanBase in Ant Group from scratch. At the same time, he also led two OceanBase TPC-C tests and broke world records, and is the author of "Large-scale Distributed Storage Systems: Principles and Practice". At present, Yang Chuanhui leads the OceanBase technical team to create a more open, flexible, efficient and easy-to-use next-generation enterprise-level distributed database.
On August 10, 2022, OceanBase 4.0 Xiaoyu was unveiled at the annual conference, and the stand-alone distributed integrated architecture was also officially met with you. From my point of view, OceanBase has undergone three major architecture upgrades in total: the first architecture upgrade was OceanBase 0.1~0.5 (2010~2015). At that time, OceanBase achieved quasi-distribution through a single-write-multiple-read architecture. There are many different roles such as UpdateServer, ChunkServer, MergeServer; the second is OceanBase version 1.0~3.0 (2016~2022), OceanBase has become a peer-to-peer fully distributed architecture, all nodes are readable and writable, and gradually support The complete SQL function is provided; the third architecture upgrade is the 4.0 version released this time, and we formally propose a single-machine distributed integrated architecture.
This article mainly shares with you the design concepts and technical thinking behind 4.0.
Cloud-oriented design, with distributed as the cornerstone
In the past 70 years of database development, centralized databases have produced many excellent works such as Oracle, SQL Server, MySQL, PostgreSQL, etc. They have done very well in stand-alone performance and SQL functions, but there is no way to scale horizontally; distributed databases The problem of horizontal expansion has been solved, but there is a clear gap between the performance of a single machine and the SQL function of a centralized database. There are many distributed databases in this process, some of which are actually distributed storage systems that only support simple NoSQL functions or limited SQL functions; there are also some distributed databases that support both horizontal expansion and complete SQL functions. These systems Often referred to as NewSQL, such as CockroachDB, Google Spanner, but their single-node performance is less than 1/3 of MySQL.
Therefore, how to choose a single machine and a distributed one has become the most tangled and difficult problem to choose. The usual method is to do it according to the amount of data: if the amount of data is relatively small, choose a centralized database with complete functions; if the amount of data is large, choose Distributed databases or distributed storage systems sacrifice functions and single-machine performance, and solve them by modifying business or heap machines.
For distributed databases, there are different concepts: many people or many systems think that distributed is a supplement to the database, suitable for certain subdivisions with a particularly large amount of data or a particularly high amount of concurrency, but my point of view is different: I believe that distribution is the cornerstone of the next generation of cloud-oriented databases in the future, and the underlying layers of cloud databases in the future will be natively distributed.
For example, a cloud database is like an electric vehicle, and distributed is the battery technology in it. With distributed, although the basic usage of the database is not much different from before, the user experience begins to differentiate, and we have seen zero configuration. , Serverless and other concepts are gradually integrated into the latest cloud database. This is a bit like an electric vehicle. When the battery efficiency is improved and the charging pile is gradually popularized, the electric vehicle begins to widen the gap between the electric vehicle and the fuel vehicle in terms of user experience. Many people's biggest feeling about Tesla is not the so-called automatic driving, but the simple operation, stepping on the accelerator to accelerate, closing the accelerator to decelerate, and even without braking. It's a little uncomfortable at first, but after a while you will soon find that this is the way the car is supposed to be. Distributed and cloud-native should also be what databases should be. By default, mainstream databases in the future will have cloud-oriented flexibility, simplicity, and openness, including:
1) Cloud-oriented flexibility: It has arbitrary expansion, supports on-demand expansion of storage capacity and computing power, and supports Pay as you go. Support customers' database requirements from small to large-scale life cycle. When the customer's business volume is small, you can choose a small-scale independent instance (such as 4C8G), or even a shared instance or pay-per-request (Serverless); when the business volume gradually increases, it can be It is very flexible to increase the storage and computing capacity of the database; when the business volume crosses the flood peak and returns to the normal, it can smoothly shrink the capacity dynamically according to the demand to reduce the cost.
2) Cloud-oriented simplicity: Whether it is functions or operation and maintenance, users can make fewer choices or even no choices. The database completes operations such as automatic data fragmentation, automatic fault tolerance, and cross-server distributed transactions as much as possible, and is compatible with user interfaces such as MySQL/Oracle/PostgreSQL. Support HTAP mixed load, improve query optimization and adaptive processing capabilities, and minimize user configuration.
3) Cloud-oriented openness: from a single "cloud native" to a customer-centric multi-cloud native, reducing IaaS layer dependencies. It supports "one library with multiple clouds" and "one library with multiple cores", which can quickly and efficiently adapt to new cloud platforms and new hardware such as CPUs according to business needs.
To this end, we put forward the OceanBase 4.0 stand-alone distributed integration concept, making distribution the cornerstone of the database and not only applied to the scalable sub-scenario. In order to achieve this goal, we not only need scalability, but also complete SQL functions and extreme stand-alone performance, and support users' full life cycle business from small to large. Since the single-machine distributed integrated architecture reduces the entry threshold to the same level as centralized databases, and has obvious advantages in cloud-oriented high availability, scalability, and cost-effectiveness, I believe that this architecture will gradually become mainstream and develop a An open source community of the same magnitude as MySQL and PostgreSQL. Therefore, we decided to unify the open source branch and the commercial branch into one branch from OceanBase 4.0 to accelerate the development of the open source community.
Difficulties that must be overcome: dynamic log stream integration stand-alone distributed
The single-machine distributed integrated architecture requires both distributed scalability and centralized database functions and single-machine performance. The ACID (Atomicity, Consistency, Isolation, Durability) of the transaction is the basic requirement of the database, and the difficulty of the distributed database is how to ensure the ACID of the transaction in abnormal scenarios. The core is how to realize the failure recovery based on the redo log (Redo log). , and the atomicity of distributed transactions in abnormal scenarios based on redo logs.
Let's first look at the existing solutions in the industry:
A) Server-level static log stream: The typical representative is the solution based on middleware + open source database (MySQL/PostgreSQL) sub-database and sub-table. The database distinguishes between the primary database and the standby database, and the primary database synchronizes redo logs to the standby database. This solution does not support partition-level fine-grained migration and replication, and thus cannot achieve online horizontal expansion. When capacity expansion is required, multiple expansion schemes are often adopted, and the expansion process will affect the business, which is performed manually by the DBA.
B) Partition-level static log stream: Typical representatives are NewSQL databases represented by CockroachDB/TiDB, and OceanBase versions 1.0~3.0. Each data partition/shard has a separate log stream, supports partition-level fine-grained migration replication, and supports horizontal expansion. The problem with this solution is that the distributed overhead is proportional to the number of partitions, and there is an upper limit on the number of tables supported by a single server. Even if all business data can be placed on one server, it needs to bear the overhead related to distribution.
C) Log and data separation: The typical representative is the FoundationDB database. Separate the storage server and the log server, write transactions only to the log server, and the storage server will periodically pull logs from the log server and apply them locally to serve read requests. The problem with this method is that the data on the storage server has a very short delay, which affects the efficiency of reading the latest data.
The advantage of scheme A is that it avoids distributed overhead and has better single-machine performance, but cannot achieve online horizontal expansion; the advantage of scheme B is that it supports online horizontal expansion, but introduces distributed overhead, which will cause additional performance overhead; the benefits of scheme C The deployment is flexible, but it will affect the read performance. Therefore, starting from OceanBase 4.0, scheme A and scheme B are integrated through dynamic log stream, which has both stand-alone performance and distributed horizontal scalability. When the system is in a stable state, the dynamic log stream is similar to scheme A, each server has only a fixed number of log streams; but different from scheme A, OceanBase 4.0 supports migrating a partition from one log stream to another. It means that the partition-level online and horizontal expansion capability similar to that of Scheme B can be achieved. This seemingly easy-to-understand solution is extremely challenging to implement. Here are three practical scenarios:
First, when a partition is migrated from one log stream to another, how to ensure that the migration process is atomic, that is, no matter what happens, all copies of the partition will either be migrated successfully or all of them will fail. ;
Second, when partition migration, how to ensure that the ongoing transaction will not be affected, we must consider this situation, some transactions may do half of the transaction or half of the SQL statement running by the transaction;
Third, during partition migration, how to ensure that the downstream always subscribes to the correct data, no matter what happens, it will not be lost or duplicated;
Ocean Base 4.0 solves these technical problems and realizes online horizontal expansion without adding distributed related overhead, so that it can be deployed on small-sized servers like a centralized database, so that the performance of a single node can reach or even exceed the level of centralized databases. .
Born for Multi-Cloud Architecture: From Cloud-Native to Multi-Cloud-Native
Cloud computing has undergone many technical iterations since its inception, from the earliest managed virtual machines, to managed services, to today's multi-cloud and serverless, constantly pursuing flexibility, simplicity, and openness. OceanBase 4.0 is designed for multi-cloud architecture.
Flexible: Built-in automatic horizontal scaling
In the development process of cloud users from small to large, from shared instances, to small-scale instances, to large-scale instances, and finally to the final distributed solution. Through the built-in multi-tenant isolation architecture, OceanBase can implement shared instances to reduce costs as much as possible. When the user's business volume is getting larger and larger, the single-machine distributed integrated architecture can gradually expand from small-scale instances to distributed instances. Small-scale instances have no distributed related overhead. Distributed instances can not only expand storage capacity, but also can Expand computing power. Support Pay as you go billing mode, support both rapid capacity expansion and rapid capacity reduction.
We'll also explore serverless options, enabling pay-per-request. The serverless solution involves several important technical points:
- When the user has no requests at all, can the database achieve zero consumption?
- How to quickly and elastically expand and shrink the capacity, and how to quickly pull up new computing nodes to continue to provide services when a computing node fails, in seconds or even lower?
- How can I measure the cost of different SQL requests to enable per-request billing? Some SQL may be simple, and some SQL may be complex and require access to large amounts of data.
These questions are very interesting, and you are welcome to discuss them with us.
Simple: HTAP = OLTP+
OceanBase 4.0 hopes to reduce the complexity of database selection and operation and maintenance, and try to use a set of databases to meet the needs of different scenarios. On the one hand, the differences between single-machine and distributed architectures can be unified through the dynamic log stream scheme mentioned above; on the other hand, we see that most of the underlying storage engines of NoSQL, NewSQL and OLAP systems on the market are LSM Tree-like architectures , which can realize the processing of various data models such as OLTP, real-time OLAP, Key-Value, JSON, GIS, etc. in one system, so as to avoid users from using multiple systems due to insufficient infrastructure capabilities.
I don't support "One Size Fit All", and the single-machine distributed integrated architecture is not omnipotent, but I think "One Size Fit Bunch" can be done without sacrificing performance costs. Following this line of thinking, I wrote an article some time ago, "What Real HTAP Means to Users and Developers ": I think HTAP is actually OLTP+, first with high-performance OLTP, and then on the basis of OLTP to support Real-time analysis, and multi-modal support such as Key-Value, JSON, GIS, etc., are suitable for online query and online analysis business, but are not optimal for offline analysis and big data analysis.
Open: Separation of storage and computing for multi-cloud
The early version of OceanBase supported local deployment, and then gradually supported multi-cloud deployments such as Alibaba Cloud and AWS. The separation of storage and computing is standard for cloud databases. There are two design ideas:
The first is the separation of storage and computing within the database. The specific method is to divide the database into SQL layer and KV layer. The SQL layer is stateless for computing, and the KV layer is used for storage. The advantage of this method is that the implementation is simple, and distributed features such as scalability and high availability can be moved down to the KV layer, and transaction processing is built on a scalable and highly available distributed KV; of course, the problem is also very significant , this scheme sacrifices performance and openness, each SQL operation requires an RPC interaction and requires first importing external data into the database.
Second, separate storage and computing between the database and file layers. The specific approach is to put aside the SQL and KV layering schemes, the database supports various local or remote distributed file systems when reading and writing data blocks, the database is used for computing, and the file layer is used for storage. The biggest difficulty of this method is that it needs to integrate high-availability and scalable processing into distributed transactions, which is complicated to implement. The advantage is to ensure performance and openness. RPC interaction is no longer required between SQL and KV modules inside the database, and It can adapt to various storage systems on multi-cloud platforms.
OceanBase 4.0 chooses the second option, which I call "open storage computing separation". In addition, in order to better adapt to different cloud platforms, OceanBase reduces the dependence on external file systems when designing the storage format. The OceanBase storage engine is in the LSM Tree-like format. The SSTable is divided into 2MB macroblocks. When the LSM Tree engine performs the Compaction operation, only the modified 2MB macroblock needs to be rewritten. Each macroblock write operation is asynchronous, and choosing a 2MB macroblock size can also make good use of the performance of various distributed file systems on different cloud platforms.
In this way, the dependence on the underlying storage system can be reduced, and users can deploy the database to different cloud platforms or even multiple clouds according to actual business needs, and ensure system stability and high performance.
Small is Big: The Battle for Scale
I think that in the future, the field of stand-alone distributed integration will develop a technical community of the same scale as MySQL and PostgreSQL, and such systems will also have millions of enterprise users. According to the "28 principle", most users' data can be put down on one machine, but users also have the needs of rapid business development, and the integrated architecture can solve this contradiction well. The complexity of a distributed database is definitely much higher than that of a centralized database. The integrated architecture can reduce the usage threshold to the level of a centralized database when the user scale is still small. At the same time, there is no need to worry about the expansion brought about by business development in advance. Sexual issues, choose once for a lifetime. To scale up usage, you need:
First, it supports single-machine deployment and small-scale deployment. OceanBase has an interesting design: each distributed system will have a master control node for global management and scheduling. The master control node of OceanBase is not a separate process, but a service RootService integrated into OBServer. The advantage of this design scheme is that there is only one process for single-node deployment, and OBServer can be added online, which not only realizes the minimum configuration of a single-node, but also realizes the unity of single-machine and distributed architecture. OceanBase 4.0 reduces the deployment specification to 4C8G, and will further reduce it in the future.
Second, it is compatible with the behavior and usage of centralized databases. As a brand-new self-developed database, OceanBase still adheres to the technical route of compatibility with MySQL and Oracle, and does not create new usage syntax in principle. We hope to support smooth business migration, which can be migrated from a centralized database without modifying business code. OceanBase 3.x already supports functions such as stored procedures, triggers, foreign keys, XA, etc. OceanBase 4.0 further enhances the support for general DDL, table lock, general LOB, GIS, JSON and other functions.
Third, take a completely open source technology route. On June 1, 2021, we announced the official open source. At that time, the open source branch and the commercial branch were still two different code branches. The OceanBase R&D team regularly patched the code of the commercial branch to the open source branch. In this way, the OceanBase open source branch will always lag behind the commercial branch for a period of time. After OceanBase 4.0, we went a step further and merged the open source branch and the commercial branch into one code branch, so that the two are completely synchronized.
Stability and reliability are the first elements: the pursuit of extreme performance and extreme high availability
As basic software, the database needs to meet many visible and invisible requirements, such as stability and reliability, performance, cost, high availability, SQL support, ease of use, and so on. From the perspective of OceanBase's technical concept, stability and reliability must be ensured first, and all other requirements must be compromised. For example, the OceanBase system has a data verification mechanism that automatically performs real-time transaction consistency verification and periodic baseline data consistency verification between multiple replicas. In order to achieve this verification function, some performance will be sacrificed, and There are certain constraints on the design scheme of the storage format, but this must be done. Stability and reliability are the first elements we adhere to for our customers, and also the most basic and simple first principle of the OceanBase database.
The essence of the database is how fast it runs and how low the cost is. Only databases that pursue extreme performance can be applied to core scenarios. The performance of the database is equivalent to the manufacturing process of the chip. Some databases have poor performance, which is equivalent to 28nm technology, and can also be applied to some scenarios of analog chips. Some databases have good performance, which is equivalent to 7nm technology, which can be applied to all scenarios, including performance requirements. tall cell phones etc.
OceanBase 4.0 uses C++ language, but we are not allowed to use advanced functions of C++ language (including C++ STL), and only use basic functions such as C++ class to organize code. We manage all memory and IO usage inside the database instead of leaving it to the underlying OS. To optimize performance, we need to optimize each operator to minimize the number of CPU instructions during the execution of each SQL statement. For a distributed database, the following performance challenges are often encountered:
- Active/standby strong synchronization. Strong synchronization will increase the delay of transaction commit. If the SQL execution thread waits for the return of strong synchronization, it will bring a lot of thread context switching overhead. OceanBase's approach is asynchronous. The SQL execution thread processes the next SQL statement immediately after submitting the synchronization task, and then responds to the client through the callback function after the strong synchronization is successful.
- LSM Tree-like query performance. When the LSM Tree-like engine reads, it needs to merge the SSTable in the disk and the MemTable in the memory. This operation will affect the query performance. It involves the Compaction strategy, the design of the query operator, etc., and needs to be optimized continuously.
- Operational performance across servers. This involves both distributed transactions and the performance of remote execution of SQL statements. It is necessary to localize SQL execution as much as possible and continuously optimize the efficiency of remote execution.
OceanBase has integrated Paxos into the database to achieve lossless disaster recovery since version 0.5. Today, OceanBase's pioneering "RPO=0, RTO < 30 seconds" has become the de facto standard for high availability in the database industry. Of course, recovery within 30 seconds (RTO < 30 seconds) is not absolute and depends on the lease time and re-election time of Paxos elections for a large number of partitions.
With OceanBase 4.0 entering a single-machine distributed integrated architecture, coupled with our further optimization of the Paxos election protocol and comprehensive detection mechanism, we can finally achieve RTO < 8 seconds, bringing stronger continuous availability to the business. From 30 seconds to 8 seconds, this short 22-second improvement seems simple, but it is very complicated behind it. To give an example, one of the necessary links in Formula 1 is changing tires, and this process is a race against time. From the minute level in the 1950s, the fastest record is 1.92 seconds. The database runs like a running race car, and the RTO is like a tire change in the middle that has to race against the clock. What's more difficult for us is that the tire changing link in F1 is planned in advance and completed by a team of more than 10 people. But the failure of the database can happen at any time and requires no human involvement at all.
Written at the end: Grow with users and developers
The OceanBase 4.0 Xiaoyu at this OceanBase conference implements a single-machine distributed integrated architecture through dynamic log streams, and supports single-machine small-scale deployment. In the field exercise, we saw that under the same hardware environment, OceanBase 4.0 single-machine sysbench benchmark test performance Better than MySQL. At the same time, version 4.0 also has the flexibility, simplicity, openness, and large-scale output capabilities for multi-cloud. Of course, the maturity of each version is inseparable from the polishing of a large number of real business scenarios, and features such as Serverless and HTAP are also in the process of continuous improvement.
Many innovations and ideas of Ocean Base 4.0 come from users' needs or suggestions. We will always adhere to the MySQL model's strategy of completely open source, and work with our users and developers to make the stand-alone distributed integrated architecture truly the mainstream of the database. Develop an integrated database product and community on the same scale as MySQL and PostgreSQL.
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。