Why the future of relational databases must be distributed databases?

OceanBase CTO Yang Chuanhui will take everyone to review the development history of the database, understand future trends, and reveal the story of the birth of OceanBase in this article.

what is a database

First, what is a database. As the name suggests, a database is a warehouse that organizes, stores, and manages data according to data structures. Historically, there have been hierarchical databases, mesh databases, and relational databases.

Today, relational databases have become mainstream in the industry. The full name of relational database is Relational Database Management System, or RDBMS for short. Generally speaking, relational databases are mainly used in the core business scenarios of core industries, that is, mission-critical scenarios that we often say, involving applications that require precise management of people, finances, and things. Relational databases have a first law of never losing a single piece of data.

Relational databases can be divided into three basic modules, including:

Relational model, which is the table, index, foreign key, paradigm, etc. we often hear;
Transaction processing, that is, the ACID of transactions, Atomicity, Consistency, Isolation, Durability;
Query optimization, that is, SQL parsing, rewriting, optimization, execution, and so on.

The prosperity and development of database technology has also benefited from many excellent scientists. There are four Turing Award winners in the database field. The first Turing Award winner was Charles W. Bachman, the founder of the network database in 1973. The second is Edgar F. Codd, who proposed relational theory in 1981. The third Turing Award winner is Jim Gray, the inventor of transaction processing, who won the Turing Award in 1998. Jim Gray is also the proposer of four paradigms in scientific research. He is a genius scientist. There is an interesting thing I can share with you: Microsoft wanted to recruit Jim Gray to the headquarters in Seattle, but Jim Gray did not like the weather in Seattle, so Microsoft built a separate research institute for Jim Gray in San Francisco.

The fourth Turing Award winner in the database field is Michael Stonebraker. Michael Stonebraker is not only a professor, but also the founder of different database companies. He has created many different types of database products. Except for Oracle and DB2, most of the database products are more or less related to Michael Stonebraker. One of Michael Stonebraker's most famous database products is Ingres. Ingres is the predecessor of a series of products such as Informix, Sybase, SQL Server, and PostgreSQL. Michael Stonebraker also won the Turing Award in 2014.

The history of relational databases

According to the timeline in the figure below, let's review the development history of relational databases.

In 1970, Edgar F. Codd first proposed the relational model. Next there are two prototype systems of relational models. The first is System R from IBM, and the second is Ingres led by Michael Stonebraker. IBM is the proposer of the relational model and the implementer of the first prototype system, but IBM did not seize the new historical opportunity of commercialization of relational databases, but gave the opportunity to a man named Larry Ellison. In 1979, Oracle imitated IBM's System R produced Oracle Release 1, the first commercial database release. It wasn't until 1983 that IBM's first commercial database release, DB2, was long overdue, but it had already missed the best days of database development.

In 1987, Sybase and Microsoft made Sybase SqlServer together. In 1989, Microsoft bought the code copyright of Sybase SqlServer and formed an independent branch of MS SqlServer. The SqlServer we hear today generally refers to Microsoft's MS SqlServer.

In the same year, the famous open source database PostgreSQL was also born.

In 1995, MySQL was born, and the founder of MySQL was called Monty. A very amazing thing is that in the early days of MySQL, the main code was contributed by Monty, and to this day, Monty still maintains the habit of writing code.

In 2004, more columnar database products, including MonetDB and C-Store, were added to the relational database product ranks.

distributed system development

Next, let's look at the development of distributed systems.

Distributed systems is also a relatively old field. However, it was not until the last ten or twenty years that distributed systems changed from theory to large-scale engineering practice. The company that has contributed the most to the practice of distributed systems engineering is Google. Google's infrastructure has a troika called "Google File System", "Google MapReduce" and "Google BigTable". After Google published these three papers, it basically laid the theoretical foundation of the large-scale distributed storage system in the industry. All students who are interested in the practice of distributed system engineering, I suggest to read these three papers carefully.

In 2005, Hadoop was established. The original intention of the Hadoop project was to make an open source implementation of the Google Troika.

In 2007, Amazon published a paper on Dynamo. Dynamo's design idea is also interesting. He adopts the idea of P2P to realize distributed storage, which uses a series of very interesting technologies including consistent Hash and NWR. Of course, in the end, the P2P technology has not become the mainstream of the industry because there is no way to ensure strong consistency.

In 2009, the Spark project was established.

In 2010, the OceanBase project I was in was established. The OceanBase team is a believer and practitioner of distributed databases. OceanBase is positioned as a native distributed database with the goal of becoming a leader in distributed databases.

In 2011, another cloud computing giant, Microsoft, released Windows Azure Storage.

In 2012, Google Spanner published a paper, Spanner is the world's first Global Database, using a series of distributed technologies such as TrueTime, Paxos and two-phase commit to achieve a global-level, infinitely scalable, strongly consistent distribution database.

In 2016, Amazon released Aurora. Aurora is a system that separates storage and computing. It runs on the public cloud. Its design idea is very clever. It separates storage from computing and makes it very simple to achieve scalability of storage capabilities. Aurora has a core design philosophy: The log is the Database.

Lessons learned from relational databases

Relational databases have evolved over the years, and there are many lessons to be learned.

Regarding the experience, today I choose two points to share:

The first point: application-driven innovation, application innovation and technological innovation complement each other and promote each other. Many technologies of relational databases are application-driven, which also forms a very powerful technology ecosystem.

The second point: abstraction and standardization. The relational data model and the transaction processing model are the most essential abstractions themselves, and they are an abstraction that can win the Turing Award. There are also many standards in relational databases, the most famous of which are: SQL standard and TPC test standard.

The early commercial databases were also confusing, with each commercial database claiming to be the best. In the end, the TPC organization stood up and formulated a series of test standards such as TPC-C and TPC-H, and adopted a third-party professional audit organization to conduct strict audits. The TPC organization enables different commercial database companies to obtain a fair and competitive arena, and everyone promotes healthy competition together.

Distributed Databases: The Future of Relational Databases

I firmly believe that the future of relational databases must be distributed databases. Why are you so convinced? Because distributed databases are fully compatible with the use of centralized databases, including relational models, transaction processing models, and SQL standards, and integrate advanced distributed cloud-native technologies, they can fully enjoy the benefits of distributed technology, including high availability, scalability Scalable, low cost, smart, and more.
The relationship between a distributed database and a centralized database is a bit like the car and the carriage. When the car first appeared, it was not as easy to use as the carriage, but we all know that with the development of the times, the car will gradually replace the carriage. . The same is true for distributed databases, because distributed databases are fully compatible with centralized databases, include the capabilities of centralized databases, and have better scalability. Therefore, distributed databases will certainly be able to replace centralized databases in the future.

Native distributed database OceanBase

We just mentioned that we firmly believe that the era of distributed databases is coming. Next, we will introduce the enterprise-level distributed database we created - Ocean Base.

OceanBase is a transparent and scalable enterprise-level database. The bottom layer is a native distributed architecture. This design allows users to fully enjoy the technical benefits of distributed technology, including:

High availability: RPO = 0, RTO < 30 seconds, and supports up to three locations and five centers. This means that when a server, a computer room or even a city fails, OceanBase can be recovered within 30 seconds. no data loss;

Transparent scaling: complete distributed transaction, distributed query, global secondary index and global consistency support;

The only distributed database in the world that has passed the TPC-C test and audit, with a transaction processing performance of 707 million tpmC, an order of magnitude higher than competing products;

From the user's point of view, OceanBase is compatible with traditional enterprise-level databases, compatible with MySQL/Oracle syntax and enterprise-level functions, and has the ability to efficiently handle mixed loads with Oracle. OceanBase has supported 100% of the traffic of all core businesses of Ant Group & MYbank, and supports the core systems of important customers in industries such as banking, insurance, securities, operators, and public utilities that are related to the national economy and people's livelihood.

Just mentioned, OceanBase was born in 2010. At the beginning of the project, our goal was to create a native distributed database. There is no reference solution in the industry. This background has made OceanBase 100% independent research and development from 0 to 1. From 2010 to 2014, OceanBase was applied and promoted on Alibaba's e-commerce platform, serving the business systems of dozens of e-commerce platforms. In 2014, OceanBase supported the peak of Double 11 that year, which also achieved a breakthrough of 0 in OceanBase's core transaction scenarios.

In the next 3 years, Alipay will use OceanBase for all core businesses including transaction, payment, and account membership.

In 2017, OceanBase walked out of Alibaba and Ant Group for the first time, and began to commercialize externally. Bank of Nanjing was the first external customer running on OceanBase.

In 2019, OceanBase participated in the TPC-C test for the first time and achieved a score of 60.88 million tpmc, breaking the previous world record.

In 2020, OceanBase participated in the TPC-C test again and achieved a score of 707 million tpmc. In the same year, OceanBase officially carried out independent corporate operation, and established Beijing Aoxing Base Technology Co., Ltd., dedicated to the design, research and development, sales, and service of OceanBase, a distributed relational database, to help customers realize the transformation of distributed architecture.

In 2021, the OceanBase database will be officially open-sourced, adopting the industry-wide Open Core model, and completely open-source the OceanBase database kernel, distributed components and interface drivers, with a total of 3 million lines of open-source code, including SQL engine, transaction engine and storage engine, supporting multiple copies , distributed transactions, high performance, scalability, fault recovery, optimizer, multi-active disaster recovery, syntax compatibility and other core technologies.


Author of this article | Yang Chuanhui

The current OceanBase CTO. Yang Chuanhui has been engaged in the research and development of large-scale cloud computing systems in Baidu. In 2010, he joined the OceanBase team as one of the founding members, leading the previous architecture design and technology research and development of OceanBase, and realized the full implementation of OceanBase in Ant Group from scratch. At the same time, he also led two OceanBase TPC-C tests and broke the world record, and authored the monograph "Large-scale Distributed Storage System: Principles and Practice".


OceanBase技术站
22 声望122 粉丝

海量记录,笔笔算数