Open Source to the World (Part 2): Looking at the Power of Open Source from the Evolution of Database Technology丨BDTC 2021

The content of this article is based on the keynote speech of PingCAP Senior Vice President Fan Ruohan at BDTC 2021: "Open Source to the World". From the perspectives of collaboration and technology evolution, it shares the interrelationship between "open source" and "globalization", and the close relationship between "open source" and "globalization". The inseparable relationship is divided into upper and lower parts. [The first part introduces how open source builds a global stage](). The theme of this article is: Looking at the power of open source from the evolution of database technology

We believe that the driving forces for the evolution of data technology mainly include three aspects: theoretical basis promotes software innovation, infrastructure guarantees the realization of software capabilities, and business requirements really polish the continuous engineering and productization of technology. "Field of Use".

Database evolution history - driven by basic theory

According to the time and function dimensions, we have divided the data ecology, roughly including SQL ecology, big data ecology, NoSQL ecology, NewSQL ecology, and SQL cloud ecology. The evolution of each ecology is inseparable from the development of basic theories.
In 1970, IBM's relational database theory Relational Model, including the advent of System R prototype products, laid the foundation for the birth of commercial databases such as Oracle, DB2, and MicroSoft SQL Server. After that, MySQL and PostgreSQL gained rapid development and the most widely used in the world in the form of open source.
From 2003 to 2006, the publication of Google's troika GFS, MapReduce, and BigTable papers laid the theoretical foundation for large-scale distributed storage systems in the industry. Today's very popular Hadoop, Spark, MongoDB, Hbase, etc. are also based on these theories. As you can see, these data products are developed and grown in an open source model. Due to the slow iteration speed and high unit cost of the closed-source model, it has been unable to meet the needs of a large number of users.
From 2012 to 2014, Spanner and F1 published by Google, and the Raft paper at Stanford University, promoted the development of the NewSQL database. PingCAP's TiDB is also a productized realization of these theoretical foundations, and continues to innovate on this basis.

database historical evolution - business innovation-driven

Let's take a look at how to understand the "useful place" just mentioned. In general, business requirements are reflected in the following three aspects: One is "transaction characteristics" , which is commonly referred to as ACID atomicity, consistency, and isolation sex, durability. Generally speaking, process digitization and business online are serious businesses, such as finance, telecommunications and other businesses, as well as enterprise-level ERP and CRM, which all require reliable transaction characteristics.
The second is "data scale ", which is mainly reflected in the explosive growth of massive data brought by the Internet, whether it is the comprehensive Internetization of user behavior, or the great abundance of data collection brought by mobile devices, or the massive data created by the creation of the content itself. Skyrocketing, from text to pictures, animations, short videos, long videos, games and the recently hot metaverse, are all factors for the growth of data scale. Under the extreme promotion of the epidemic, the digital transformation of various industries has spawned a new round of data growth.
The third is "processing delay" , in today's mobile Internet and digitalization, the pursuit of user experience is rising, and the ToC service hopes to respond faster, so as to compete for the user's fragmented time and business opportunities. Faster business response, more real-time data analysis and more agile operational decisions are required.
These three factors have different developments and different combinations in different periods, which have continuously spawned and implemented the development of data technology. Different database ecosystems are the result of different combinations of business drivers. digital foundation in the information age is weak, and it mainly solves the problems of accuracy and efficiency of key business. More serious business with small data volume requires high transaction characteristics of data, and the data structure is stable, the rules are clear, and the amount of data is limited. This kind of demand can be met by the relational SQL stand-alone database ecology, which requires efficiency and stability.
Around 2000, we entered the era of big data. After the long-term development of informatization, a large amount of data has been accumulated, and new data has also increased at an unprecedented volume and speed. Stand-alone relational databases are gradually showing signs of fatigue and aging. In order to store and analyze massive data, especially the data accumulated offline, all kinds of high-efficiency, scalable, and deployable big data processing platforms on low-cost hardware have emerged one after another.
In the early days of the Internet era, the content and online behavior of users were extremely rich, but at that time it was mainly a massive unstructured data storage (video/audio/graphics/social relations, etc.), but the scale of data was huge, requiring concurrent traffic , compete for traffic at the same time, quickly respond to user access, and provide a low-latency user experience that drives the development of the NoSQL ecosystem. Because the early Internet business is not for profit, and the data to be processed is more users' browsing records on the Internet, social relations, etc., so there are not so high requirements for transaction characteristics.
Entering the era of mobile Internet, with the rapid growth of data volume, business must complete transactions and realize cash while ensuring a good user experience. Business agility requires not only the system to respond quickly to business changes and data growth, but also to support massive transactions with high reliability. , payment and other serious matters. It can be seen that the three elements of business driving force have entered the field of vision during this period. Some enterprises are still satisfied by the transition of SQL ecological cloudification, but we have also seen in practice that when the amount of user data, especially the data update exceeds a certain range, native distributed NewSQL is the choice for advanced architecture. Coupled with the fact that data technology has entered the stage of comprehensive cloud service, the differences in architecture are even more exposed.
At the same time, real-time insight requires data decision-making to be upgraded from T+1 to T+0, and even second-level and millisecond-level analysis responses, real-time aggregation of multi-source data, dynamic updates, and flexible computing are emerging needs, and gradually transactional computing The demarcation between analytic computing and analytic computing is becoming more and more blurred, and technological innovations in databases and big data will continue to converge.

Database Evolution History - Infrastructure Driven

Finally, hardware is the cornerstone of software, and the development of data technology is inseparable from the development of infrastructure.
From mainframes to X86 servers to cloud computing, infrastructure deployment has achieved subversive changes from "years" to "months" to "days" to "seconds", resources from proprietary, closed to on-demand startup, elastic expansion . The cloud-native era has once again expanded the scale of resources and reduced the granularity of resources. API-based and micro-services have further pushed the speed of business online and update to the second level.
In the future, the design of resource separation will unleash greater power on the cloud. The timeline of database development above only shows NewSQL. In fact, data technology is still evolving. I believe that in this process, the value that open source can play will increase. Now the core behind all cloud products comes from open source, and the source of innovation also comes from open source. The following is a bold hypothesis put forward by Tunghsu at the 2021 PingCAP DevCon conference: in the cloud-native era, everything that can be separated will be separated, and the scale effect will control everything. This separation includes the separation of storage and computing, and more extreme is the separation of storage and storage for different purposes. Business computing can be further separated from distributed computing and transactional computing. Constantly push the scale effect and resource efficiency optimization to the extreme, and for users, they only need to pay attention to the business itself, and leave the rest to the cloud database to complete.

TiDB product iteration

Under the action of these three driving forces, we can summarize the product iterations of TiDB in the past 6 years to prove it.
The biggest advantage of TiDB products is the openness of technology. Open architecture means more connections can be generated, and more connections means faster iteration speed and more possibilities. TiDB is to provide a natively distributed database that supports OLTP transactions well, so that our DBAs will no longer work overtime due to the sub-database and sub-table of massive data. TiDB 1.0 and 2.0 solve this problem. Later, with the real-time demands brought by digitization, one-stack HTAP became our direction of efforts. With the release of TiFlash MPP this year, we have achieved comprehensive HTAP capabilities.
As a cloud-native distributed database, we launched TiDB Cloud this year, as well as the Developer Tier, which is free for developers to try out. Users can run TiDB clusters on Amazon Web Services for one year for free. TiDB Cloud is responsible for all background database management such as infrastructure management, cluster deployment, backup management, etc., so that developers can focus on creating excellent applications and achieve second-level switching. All of this is based on the speed of iteration and innovation that open source brings to us.
Finally, I conclude with a quote from RedHat CEO Paul Cormier in a recent TV interview: Open source software is the heart of the technology behind cloud computing.
Whether it is open source, whether to do open source, or whether to use open source as an external drive for the company's continuous innovation, I think it is a question that every basic software company can think deeply about.

Open Source to the World (Part 2): Looking at the Power of Open Source from the Evolution of Database Technology丨BDTC 2021

PingCAP

引用和评论

从企业数智化四阶段解读 TiDB 场景价值

MySQL慢查询日志：性能优化的终极指南

做到真正0丢失、0重复：Apache SeaTunnel 实现万亿级数据一致性全解密

Devin 发布 DeepWiki，2 星的项目直接装出万星的气场

好用的开源埋点方案-ClkLog埋点用户分析系统

DNS服务器地址大全

实战分享：DolphinScheduler 中 Shell 任务环境变量最佳配置方式