Is there any completely independent localized database technology?

During the Russian-Ukrainian conflict some time ago, Oracle announced that it "suspended all business in Russia".
The database is known as one of the three cores in the IT field (the other two are CPU and operating system). It has always been monopolized by international giants. People control the core.

The only way to solve this problem now is to improve yourself, to master the core database technology in your own hands, and to make your own domestic database. In fact, this matter has been discussed in my country for decades. As early as the 1980s, the national team based on research institutes and universities began to invest in the research and development of domestic databases, and in the 1990s, several database products were launched one after another. It is a pity that these product research and development lacked access to the industrial side from the very beginning, not because of the stimulation of actual demand, but purely for ownership. In this way, the expansion of the product in the commercial market is also relatively weak. As a pursuer, he never saw the back of his opponent.

There is a question on Zhihu: "Has China crossed the mountain of databases?" The translation is: Is there a domestic database that is completely independently developed? There are more than 100 answers. Looking at it, they either popularize database knowledge or promote their own products. Most of the answers do not face this question directly. There is really no way to face it, because we can't say that we have crossed this mountain.

Status Quo of Domestic Databases

In recent years, hundreds of domestic databases have sprung up, but how many have original technologies?

Not much actually! It can even be rudely said: almost no!

Among the hundreds of domestic databases, most of them are based on open source databases , and 90% are more than 90%. Most of them (about 90%) are based on MySQL or PostgreSQL transformation.

As the most famous open source database, it is not surprising that MySQL is used by many domestic database manufacturers to transform into their own products due to factors such as numerous users, strong compatibility, and rich interfaces. After all, many people are familiar with it, and the transformation cost is also low. a little.

However, compared to MySQL, there are more packages based on PostgreSQL (commonly known as PG). This is because PG's BSD open source license is very lenient, allowing the source code to be modified and then closed, even without a copyright statement. Therefore, PG has become the favorite of many domestic database manufacturers, and they have packaged their own "original" domestic databases based on PG, including some famous manufacturers known for their innovation. As the saying goes, "As soon as it is open sourced abroad, we will create it", and some manufacturers don't even bother to modify it (or maybe they are unable to modify it), and even the drivers can be directly borrowed.

In addition to the two camps of MySQL and PG, there are also some packages based on other open source databases, but the number is very small. Some domestic databases seem to be original, but they are actually based on an ancient open source database that has withdrawn from the arena. It is difficult to see that it is mistaken for original.

In addition to using open source library packaging, there are some domestic database manufacturers to achieve "autonomy" by purchasing source code. As in 2015, several Chinese companies purchased Informix source code to develop their own databases.

Most of these non-original database manufacturers who "borrow from others" have not mastered the core technology. After all, it is not an easy task to digest tens of millions of lines of code. Although there is source code in hand, it is still difficult to carry out in-depth transformation, and future upgrades and development will also need to be admired. Sometimes there are even agreements and legal issues. For example, MySQL is now owned by Oracle. God knows what day O will do to us if he is unhappy.

However, it is gratifying that there are still a small number of commendable manufacturers that have implemented it independently from 0 . A more representative one is OceanBase. Because it was born in an Internet company, facing the rapidly expanding business, it is difficult to continue to use foreign commercial databases in terms of cost and capacity, and it has a strong motivation to get rid of its dependence on foreign products. research road. Of course, it is not easy to develop a database from scratch. It is a hard work that takes ten years to sharpen a sword. It is indeed rare for a manufacturer to endure it like this.

In addition, we have another more bizarre product that has been honed in ten years, a product that does not look like a database but can complete a large number of database tasks: the esProc SPL developed by Rungan Software. Not only is it completely self-developed in terms of engineering implementation, but even the theoretical model is original. What breaks through is not only the database itself, but also the theoretical framework behind it. Such a product can be said to be unique in China.

What is SPL? What does it have to do with the database? What is the effect? What is the theory behind it? Let's talk about it below.

Origin of SPL

The main development body of SPL is Rungan software. You may have heard or used the Run Dry Report. It is an innovative product 20 years ago to solve the complex Chinese-style report production, which uses the original nonlinear report model theory. We know that the report is a strong data calculation scenario, the data in the database is still far from the data to be presented, and requires many steps of complex operations to obtain. The report tool can only solve a small amount of calculation in the presentation step, and can do nothing for the data calculation before entering the report tool. This has led to the fact that although there are mature report tools to solve the calculation problems in the format and presentation, report development is still difficult.

For this problem, the industry has no good solution. It can only write complex SQL (and stored procedures) or use high-level languages (such as Java) to program in applications, which is very cumbersome and inefficient. Moreover, due to the development characteristics of SQL and Java, it will also bring problems such as high coupling and difficult maintenance.

In this context, we hope to find a way to solve the problem of difficult and slow computation of data. After a lot of summary and analysis of various data calculation problems we encountered, we found that if we continued to use the technical system of SQL, we would not be able to solve this problem. Just old wine in a bottle.

The theoretical basis of SQL is relational algebra. The fundamental reason why SQL is difficult to deal with complex data calculations is the relational algebra theory behind it. To fundamentally solve this problem, it can no longer be based on relational algebra.

What to do then?

Since there is no ready-made available, you can only invent new ones, using new theoretical models to solve computational problems!

However, this is easier said than done. Since 2007, it took us more than ten years to stabilize the model and structure after four major reconstructions, and formed a set of theoretical models —discrete data sets . Based on this model, we developed SPL (Structured Process Language), a programming language specially used for structured data calculation, with a storage mechanism, it can also be understood as a data warehouse product.

Since SPL adopts a new theoretical model, there are no other products on the market that can be used for reference, let alone ready-made open source code that can be "borrowed", and can only be developed line by line. Therefore, the core computing model code of SPL is completely independent and original from head to toe. Even the theoretical basis is invented by yourself, and the code can only be original. Do you think it is independent enough?

Speaking of which, you may find that SPL looks different from traditional databases. How does it work in practice?

SPL application effect

For big data computing tasks, SPL performs very well in practice in terms of applied effects. When implementing complex calculations, not only is the code short, but the performance is usually more than an order of magnitude faster than traditional databases.

A certain celestial object calculation scene of the National Astronomical Observatory: 11 photos, each with 500,000 celestial bodies (the target scale is 5 million), the celestial bodies with relatively close astronomical distances (calculated by trigonometric functions) are regarded as the same, and the "" Identical" celestial bodies merge, properties re-aggregate.

The technical essence of this task is a non-equivalent association, and the calculation amount is square (that is, 500,000 * 500,000 = 250 billion). The Python code is about 200 lines, and the single-threaded calculation takes 6.5 days. According to the speed estimation, the target scale of 5 million will take nearly 2 years, which is completely unpractical. The distributed database of a large domestic factory uses 100 CPUs of SQL. The code also took 3.8 hours, and the calculation speed of a single core is slower than that of Python; and the optimized code implemented by SPL is only more than 50 lines, which greatly reduces the calculation amount (far less than 250 billion) by using the task characteristics. It only took more than 2 minutes to complete the calculation, and the calculation of the target scale of 5 million can be done in a few hours, which is completely practical.

Behind this gap: SQL, which is limited by the theoretical model, cannot implement this optimization technology, and can only watch the consumption of computing resources; Python hard-coding can realize the optimization algorithm, but the workload is huge, and the code will be far more than 200 lines. ; only SPL, the code is shorter and runs faster.

Not only that, but in other industries, the advantages of SPL are also obvious.

In the query scenario of an insurance company's group insurance statement, the performance of SPL has improved by 2000 times compared with Oracle, and the amount of code has been reduced by more than 5 times...

In an insurance company's auto insurance batch calculation optimization scenario, SPL was used to optimize the RDB batch running time from 2 hours to 17 minutes , and the implementation code was shortened from the original 2000 lines to less than 500 lines...

In the report query scenario of a financial user, SPL shortened the report calculation time from 3700 seconds to 105 seconds, an increase of more than 35 times ...

In similar cases, SPL has been implemented a lot, and it has not been missed. The average speed is increased by more than an order of magnitude, while the amount of code is reduced by several times.

There is also a performance test report here: "National Computing Database Performance Test Report" ( http://c.raqsoft.com.cn/article/1564972044122 ). The operations implemented on domestic chips with SPL can surpass Oracle running on Intel chips. This is the result of the theoretical innovation of SPL (discrete data sets).

Why is SPL stronger

After seeing the application effect of SPL, we can't help but ask, what kind of magic does SPL have to achieve these amazing effects? The theoretical basis behind SPL What exactly does a discrete dataset model look like?

The advantages of SPL are mainly concentrated in two points, the code for data calculation is short (simple to write) , and the performance is higher (runs faster) . Did SPL change the speed of computers? No, software cannot change the performance of hardware. The reason why SPL is stronger is because it has designed many algorithms (and storage mechanisms) that others do not have. Based on these algorithms, the computer can perform fewer operations and achieve high performance. Most of these algorithms rely on discrete data set theory to achieve well. .

The following are some algorithms of SPL, many of which are original inventions of SPL.

Like the common TopN operation, TopN is understood as an aggregation operation in SPL, which can convert a high-complexity sorting into a low-complexity aggregation operation, and can expand the scope of application.

Different from SQL, SPL does not have a sort word in the statement to complete this operation, so it will not generate a large sorting action. The syntax for calculating TopN in the complete set or grouping is basically the same, which is not only simpler in writing, but also higher in performance. However, SQL can only write statements with sorting words. Whether it can run fast can only rely on the optimization engine of the database. In simple cases, the database can handle it, but when the situation is complicated, even a senior database like Oracle will be "dizzy." ”, there are related detailed test cases here: Performance Optimization Tips: TopN.

We have organized SPL's discrete dataset theory into a paper (SPL paper http://c.raqsoft.com.cn/article/1653097658478 ), which rigorously defines the discrete dataset algebraic system and describes its relationship to relational algebra s difference.

High performance depends not on code, but on algebra . Code is just a means of implementation. The key is the data types, algorithms and storage models provided in the theoretical system behind SPL.

This article is written in the simple and fast database language SPL to explain the efficient principle of SPL in more popular terms. Relational algebra and SQL are like elementary school arithmetic, with only addition, subtraction, multiplication, and division, while discrete datasets and SPL are equivalent to adding secondary school powers to exponential logarithms. Addition, subtraction, multiplication and division can handle everyday grocery shopping, but building an airplane building requires more math.

Knowing this, it is not amazing to look at the performance of the domestic chip that surpassed Oracle's performance on the Intel chip mentioned above. Even if domestic chips still have a long way to go, building a completely independent and efficient domestic database based on SPL can become a reality, allowing domestic chips to take off.

The future of SPL

Of course, SPL itself still has a long way to go. The currently released functions are only for OLAP (data analysis) scenarios, mainly solving data computing problems. We know that in addition to calculations and transactions, databases are often referred to as OLTP capabilities. In transaction-oriented scenarios, SPL will still solve various problems faced by current databases through innovation.

or innovation

Now it is a general trend to move databases to the cloud, but simply moving the relational database from the local to the cloud does not reflect the characteristics of cloud applications. The basic feature of cloud applications is the diversity of data structures . The cloud database needs to provide services for multiple users at the same time, and the data structure of different users may be different, and the data structure of the same user in different time periods will also change, which will accumulate a large amount of data with different structures to be stored and calculated together. This will face the contradiction between personalization (different data structures) and massive users, which is a problem that relational databases cannot solve.

In fact, databases born 50 years ago were not designed with this problem in mind (and it is impossible to think about the needs of 50 years later), so there is almost no design in relational algebra to deal with diverse structured data. If you want to solve this problem, you can no longer use the relational algebra system.

At the same time, when the relational database achieves consistency, the cost is too high, and the resource consumption is serious, resulting in a decrease in the concurrency capability. And high concurrency is a typical feature of cloud applications, which has become a pair of irreconcilable contradictions. The reason for this problem lies in its data organization mechanism (data types), which is still determined by its theoretical relational algebra. If you want to balance consistency and high concurrency at the same time, you must break the constraints of relational algebra and organize and store data in a different way.

Breaking through theoretical limitations can solve the problem fundamentally, and SPL (discrete data set) is just in time!

This future is not far away. SPL's OLTP-oriented functions have been polished in the laboratory for several years, and after a period of perfection, it will be able to emerge. At that time, the domestic database based entirely on independent original theories will break through the sky.

beyond

At the same time, theoretical innovation may also bring about another result, that is: beyond ! Achieve transcendence of foreign products in the field of database.

We understand that as a catcher, adopting a technology-following strategy is hopeless. At present, the vast majority of domestic databases are still relational databases, which can be said to be technical followers. And foreign giants have been doing these things for decades. People are strong and money is accumulated. I don't have three heads and six arms. How can I surpass others? The only possibility is that the opponent makes a mistake, but as a top ten, we can't expect the top N opponents to make mistakes at the same time. And it is a bit useless to hope that some kind of policy will keep foreign products out of the door, and this is unlikely to happen in this open era.

Then only innovation!

Database, we must do better than our opponents, much better, so as to have the opportunity to surpass and make up for the imperfection of the ecology. To do better, we need disruptive technologies. In the face of new technologies, we and our opponents are on the same starting line.

Relational databases have been invented for decades, and have long been unable to adapt to modern, more complex application requirements and more powerful hardware environments. Many seemingly simple problems are very difficult to solve, the development and maintenance costs are high, and computer resources cannot be fully utilized. tolerate low performance.

For those relational database giants, if they want to explain to shareholders, they must maintain stable income. They can't just get rid of their own lives, but they are in a relatively unfavorable situation. This gives opportunities for products that can innovate at the theoretical level, and it is not unrealistic to achieve transcendence.

No matter how high-end the carriage is, it is still a carriage, and no matter how it is optimized, it is still driven by horses. A new car will of course be unaccustomed to in operation, and there will be many unsatisfactory functions. But it is engine-driven, and over time, its enormous advantages will surely overwhelm the carriage.

Let's wait and see, and let's move forward!

Is there any completely independent localized database technology?

溜达兔

引用和评论

无惧大规模预训练模型，浪潮AI服务器引领Transformer训练性能提升

【专题】2024年人工智能AI行业报告汇总PDF洞察（附原数据表）

直击青藏高原数据匮乏难题！浙江大学团队提出GeoAI新模型，解释青藏高原地表热流分布

Mybatis-基础使用

2025消费趋势及增长策略洞察报告汇总PDF洞察（附原数据表）

Easysearch 证书：Windows 上创建自签名证书的 7 种方法

工业人工智能白皮书2025年：边缘AI驱动，助力新质生产力报告汇总PDF洞察（附原数据表）