With the development of cloud computing and big data technology, traditional information technology and applications have been greatly impacted, and database as basic software has also ushered in new challenges and opportunities. Similarly, data usage scenarios show a trend of diversification, and the scale of data also grows explosively. The explosive growth of massive heterogeneous data places higher requirements on the storage and computing power of databases.

In the future, traditional databases based on a single physical server node will be replaced by distributed databases in more scenarios, and the application prospects of distributed databases will become wider and wider.

The fourth phase of "Dialogue ACE" focuses on "challenges and opportunities for the future development of distributed databases", and invites Zhou Yanwei, founder & CEO of Jishu Yunzhou and Oracle ACE Director, and Wang Nan, Senior Director of OceanBase Products and Solutions Department Jointly explore the "future development of distributed database", so as to promote the better development of distributed database technology.

The following is the transcript of the conversation:

📣Wang Nan, Senior Director of OceanBase Products and Solutions Department

Q: OceanBase's completely self-developed domestic distributed database, what difficulties will it have at the beginning of its design?

For OceanBase, it was not designed to solve the problems and difficulties in the longer-term stage from the beginning of the design. There is a process of decision-making and project establishment, including what problems can be solved, what value can be generated, how much investment is required, how long and how long it will take to solve such problems. In the early stage, it was solved gradually based on scenarios and problems, and the entire cognitive process was completed indirectly in stages.

The first stage originates from the Taobao scenario, which solves the core data scale and data volume, including the throughput of data access, which can cope with business growth under the condition of continuous growth. The first step is to solve the distributed problem, which is a distributed NoSQL without a relational model. After the problem of Taobao scenario is resolved, more business scenarios also have scalability demands. Many scenarios are originally supported by relational databases. For database requirements, some businesses can be solved by transforming the application layer. However, if a large number of businesses are going to be used in general scenarios, the relational model is still a very necessary requirement.

In the second stage, after solving the distributed problem, gradually complete the relational model and support SQL capabilities. As OceanBase moves from Ant and Alipay to the general market, it can be seen that many business scenarios are based on MySQL open source ecology and Oracle commercial ecology, so how can applications be migrated to new distributed databases? Actually there is a big problem.

The third stage is to achieve good compatibility. For users, what are the costs and costs in the application layer including the migration process? OceanBase has completed the full compatibility of Oracle after solving the compatibility of MySQL. Including syntax, semantics and stored procedures and so on.

The key problems and challenges that will be encountered later are that as the data volume grows to a certain scale, the demand for AP capabilities is still relatively strong. The capability of HTAP is a key difficulty and technical challenge at the current stage. Similarly, with the development of cloud computing, there will be many customer demands for cloud infrastructure and even cross-cloud heterogeneous cloud support.

Therefore, at the beginning of the design, the technical difficulties in the development process were actually discovered gradually in the process of supporting and serving customers.

Q: In financial scenarios such as banking and insurance, many companies have chosen distributed databases as their core first choice. In which scenarios will there be similar situations in the future?

In the past, OceanBase has been applied more in financial scenarios. In the past two years, in addition to finance, we have already started to apply it in many enterprises in the general market. There are many influencing factors on this issue. Today I mainly discuss and share it from the perspective of product and technology.

In the core scenario of distributed databases that have been widely used, they choose distributed databases. In fact, there are several categories of reasons, and everyone chooses based on the different demands and considerations of the categories.

First, users with core system migration needs. These users are often not driven by business demands, because in this scenario, what kind of architecture the user chooses, whether it is centralized or distributed, is not a particularly strong constraint, but is more concerned about the compatibility, switching and replacement of Oracle In the process, whether it can be cost-effective and smooth migration.

Second, users who solve the continuity problem. Continuity may also be one of the important reasons for the increasing challenges of database provisioning, purchasing, including database service capabilities.

Third, users with real business demands. Because the essence of the choice is to return to the business, or return to the market, to choose whether a product is economical, cost-effective and correct. From this perspective, whether it is the rapid growth of the customer's business volume, which leads to the demand for expansion and expansion of the database or data management layer, or because of the excessive amount of data, the centralized architecture cannot meet future business demands. and data volume. In fact, the demands of such users are more rigid and urgent, and they are also a very core factor that can touch and drive users to truly make up their minds to make technological upgrades.

In this process, in addition to the database's own capabilities, including whether the capabilities, performance, specifications, and security are satisfied, the key factor that users care about is the cost and cost of application migration. Databases are very different from basic software and applications. When replacing them, the impact and cost of various complex business systems are relatively large.

Therefore, it is actually two concepts for what scenario to choose distributed and what scenario to choose a new database. When choosing a database, everyone should consider whether the security and specifications of the data can be met. But whether to choose distributed, there need to be some core demands: either the original single financial system has no way to meet the demands; or I make some technical upgrades and reserves for the future. In the end, there may be some scenarios in the future that will choose a distributed database, or we have to return to the topic of the real market choice itself.

Q: Do you have any good suggestions for the operation and maintenance of distributed database products or large-scale intelligent operation and maintenance?

In fact, the larger the application scale of a product, the higher its demands and challenges will be. Because when the product is not applied on a large scale, no matter how he does it, even if it is supported by human flesh, it can support the operation and maintenance well. However, once the volume is increased, this operation and maintenance will be a big challenge, especially database products, which are very different from ordinary applications, and will affect the influence of various factors during operation. For example, software and hardware failures, which is why there is a professional group such as DBA to specialize in the operation and maintenance of this database. After decades of development, traditional databases have formed a relatively mature DBA group. In addition, distributed databases have different challenges compared to centralized databases.

The first is the technical architecture, which itself brings higher complexity. Compared with the centralized database, distributed for high-availability failure recovery and various tuning of execution plans, it will have a higher capacity for 's demands. For a new product and architecture, it takes a long process to learn and understand, and then form a general cognition, as well as the proficiency in using it. This may not be for a certain database, but because the distributed architecture and technical characteristics determine that there will be such a process, and in fact, we also have several levels of practice and exploration on the issue of operation and maintenance.

The first level is product capability. In addition to the core of the database, to really use a product, you need to consider various tools related to productization, the entire cluster management, operation and maintenance monitoring, full-link diagnosis, as well as automatic optimizers, function tuning, self-service Operation and maintenance, so it is very critical to use tools to solve this problem. Of course, DBAs and people are definitely needed to cultivate such abilities, but if you have better tools or more complete tool capabilities to support, and better technologies to provide such support and query, it is actually very big. a help. Therefore, in terms of product capabilities, we have to build such capability supports and bases.

The second level is the cognition of the entire user level, including the training of the operation and maintenance personnel system. OceanBase is now cultivating some users through open source. We also recommend that everyone use it to know what the characteristics of the database are and what problems they will encounter. More people will use it, and more people will understand it and use it in terms of operation and maintenance. At the same time, participate in some training certifications, such as OceanBase's current OBCA/OBCP/OBCE, to create a full certification system for different groups of people and different characteristics, so that more people can quickly learn, use and understand this distributed database. This is not only for OceanBase, but to cultivate such talent base for the operation, maintenance and tuning of the entire distributed database.

The third level is that in the process of marketization and delivery, we gradually realized and understood the operation and maintenance of the entire database. It is far from enough to rely on ourselves. For a product and company as mature as Oracle, its operation and maintenance are also handled by a large number of third-party professional service companies and DBA groups. For distributed databases, OceanBase belongs to a start-up company, and it is not large now. Once the scale is applied to more industries and user scenarios, it is difficult to rely on the original factory to support services. Therefore, for database operation and maintenance, I think it is necessary to comprehensively utilize the capabilities of various third-party professional service companies, ecosystems, and DBAs.

In addition, you will also have some exploration. Like qualification capabilities, including AI is also an exploration framework for us in the future. I think at this stage, the three levels just mentioned are the ones we need to focus on and pay attention to.

Q: Many emerging database products in China refer to Google Spanner's papers. What do you think are the gaps between domestic and foreign distributed database technology?

In the past, under the traditional database architecture, we faced some business challenges, such as the rapid explosion of data scale and the demands of analysis. Under such a scenario, we have brought a lot of different technically innovative thinking. This kind of breakthrough and enlightenment in thinking is of great significance. Taking Google as an example, it originated from its own business, because Google itself is global, cross-regional, and has a very large business scale. Based on its own business scenario demands, it has accumulated some technical solutions in the process of solving its own problems. and capabilities, and then some output.

Another example is the structure of Spanner, which is a direction to solve such problems, but the effect is difficult to say. At least it seems that it may not be at a very complete stage, including the marketization or commercialization of Google's own database. It is difficult to say that it is relatively successful. Like some of the core technologies he mentioned, whether the True-Time API and globalization capabilities are applicable to all scenarios, in fact, not necessarily.

The demands and scenarios of different enterprises are very different. Whether it is domestic or global promotion of distributed databases, there are mainly a few big cloud companies behind them. Because the characteristics of large cloud factories and Google's own business characteristics have certain commonalities, that is, a large enough scale and a large concentration of data, such a scenario will have some core demands for distributed databases, but it will bring strong manufacturers. Color, how to meet the diverse scenes, has very important practical significance.

So is it based on the data centralization demands of cloud vendors or focusing on Internet companies, or all in cloud to do distributed databases? In fact, it is not. For example, in China, it will have better soil to promote and develop diversified scenarios, instead of everyone gathering on the same road to do this kind of globalization, or public cloud-native solutions.

In general, our technology gap compared to foreign countries is not that big, because different scenarios need to be solved, we have our advantages, and he may also have some advantages in the early stage. It can only be said that at this stage, in terms of technological advancement, technical strength and capabilities, we are confident enough to face any scenario, including the current domestic large-scale challenges and challenges of globalization demands, we are strong enough self-confidence to solve these problems.

Q: Under the current general trend of cloud native, what are the breakthroughs in the development of distributed database technology?

The current database is in a state of blooming, and our insistence is to make a distributed relational database that is transparent to user applications. This statement sounds simple, but there are actually several key elements in it.

First, we have to do application transparency. There is no need to perceive and solve the problem at the application layer, nor is it a middleware solution, but to solve the problem at the database layer. In other words, leave the complexity to the database and the simplicity to the user.

Second, strict ACID guarantees. The database must ensure the correctness of the data. In other words, OceanBase has always been based on HTAP, constantly supporting and expanding our AP analysis capabilities, not that we support strict OLTP on the basis of AP, this is our insistence in the direction of distribution one of the core elements.

In addition to supporting these two technical challenges, I would like to talk about another topic. In the process of distributed databases moving towards market-oriented applications, the biggest problem and challenge is to find out what problems customers have encountered and what needs to be solved. The problem is that in addition to the technical competitiveness and advantages of distributed databases, manufacturers really need to return to the world from the sky and let everyone use them. There are also several key points here.

First, for large-scale distributed transactions, the consistency of transactions can be supported and guaranteed. Under normal conditions, I believe that many people can solve it, but in the event of a failure or abnormal software and hardware failures, it is necessary to do a good job of recovery, and at the same time, the recovery process can not affect the business, which is a huge problem for the production system. challenge. From the user's point of view, a great concern is whether the distributed technology system can solve this problem.

Second, whether it can be smoothly migrated. When a large number of applications are migrated from the original system to distributed, is there a general solution? Another reality is that we need to solve a large number of different industries and scenarios, and the migration of massive applications is also an element that many large users pay attention to.

Third, if we expand or generalize this scope, in fact, different user scenarios or the same user's application scenarios are very complex. It may have different infrastructure (private cloud, public cloud, hybrid cloud) and a large number of systems are not switched all at once, this risk is too costly, so it must be incremental. If there is such a scenario, it will bring a lot of support from different infrastructures and demands for heterogeneous deployment. This also requires the database layer to provide consistent capability experience and capability support to guarantee.

Finally, what I want to say is HTAP, that is, whether the capabilities of TP and AP can be integrated. At first, the concept of separation between the two was not actually separated. Later, with the increase of data volume and the enhancement of analysis demands. The original database capacity could not be satisfied, so it was separated. The current architecture also brings many problems, including the need for users to build different business systems on this application to do daily production transactions, as well as the analysis of this role.

In terms of application complexity and customer input cost, it still returns to the essence of IT, that is, it is uneconomical for computing power and storage resource consumption. Therefore, there is a solution that can solve the integration of TP and AP at the same time, and has a good isolation of load resources, so that they will not be affected by each other. This has huge practical and economic significance, and this is also the next step for OceanBase to focus on investing. and the key breakthrough point of strength.

📣 Zhou Yanwei, founder of Extreme Cloud Boat

Q: How to quickly understand what is a real distributed database?

If this question is answered succinctly, it can be divided into two paragraphs. What is distributed and what is database respectively.

First of all, the database is to solve the problem of data storage and calculation. Secondly, distributed is to use multiple resources to solve the storage and calculation of large-scale data, and allocate it from the resources and calculation of a single server to multiple servers. This is the literal meaning of distributed database, but whether such a product is Is it really distributed? I don't think it's true. It can only be counted as an academic theoretical product, and it is not considered from the actual demand scenario.

In addition to solving the problem of resource allocation of computing power and storage capacity, distributed databases also need to look at the capabilities and demands of the database in actual needs. In addition, distributed databases can be considered as the allocation of various resources. For example, asymmetric distribution. Most of the time, everyone pays attention to data storage and calculation allocation under the peer-to-peer condition, but in some product requirements, for example, for edge computing, it is necessary to directly calculate the results at the remote end. Then associate these results with the central terminal. For such a scenario, the allocation of resources is obviously unequal, but it is still necessary to combine the data of the distributed database technology from the remote end to the terminal, and then perform some operation and In conclusion, is this a distributed database? If it is designed well, it is of course a kind of distribution, and this is the first example. The second example is the distribution of unequal computing. Just like HTAP, which is often seen in various computing types, data will be stored in different storages, and different environments will use different structures and have different computing requirements or algorithms. , then to make a distributed system, it is necessary to solve the problem of computing power and allocation, and the allocation of different operations and algorithms. If the system is well designed, this should also be included in the category of distributed databases.

Q: Distributed databases have been very popular in recent years, but some mainstream overseas database vendors (Oracle, IBM) have not promoted them. language is more suitable, what do you think of this?

In the past, the core products mainly promoted by IBM, Oracle, and SQL Server were still stand-alone versions, and distributed databases were not promoted or developed as their mainstream. From another perspective, regarding the size of the amount of data, I think it mainly comes not from people, but from machines and equipment.

There is no essential connection between distributed databases and data volume, so why mainstream manufacturers do not do this (distributed), I have also thought about this. Focusing on making distributed databases can be roughly divided into two categories, one is an idealistic team with a partial academic type, who proposes from a theoretical point of view, and then makes distributed databases, hoping to see what kind of results the database will produce in the future or more. Good product. The other type doesn’t know much about databases or is a new born calf who is not afraid of tigers. They plan the long-term blueprint first, and know the results. At the same time, they also burn other people’s money.

Why do you say that? That is, the database first solves the problem that the data is absolutely correct, and then considers the support of various performance and capabilities, but distributed means to ensure the consistency of various calculations or data on various cross-network resources, These are the basic capabilities of the database, so mainstream database manufacturers, based on the needs of business and social responsibility, must first fully ensure the correctness of data storage, and also consider performance, input-output ratio, and second It is to consider whether to do distributed, in other words, if single-machine deployment or cluster can solve problems such as data computing and storage capacity allocation to a certain extent, or if the problem can be solved without a distributed computing architecture in the database core, it will also It can obtain better benefits for data security and consistency assurance, which may be an important reason why mainstream databases do not become distributed databases.

With the continuous development of hardware and network and distributed database storage capabilities, the allocation of resource capabilities is also divided into several types for distribution. From the perspective of computing power, it can be allocated based on CPU/GPU, but due to the limitation of Moore's theorem, the improvement of computing power on a single machine will gradually decrease, but the distribution of physical hardware across the network will gradually increase. Networks already have this kind of high-speed access like RDMA. I don't know if there will be a fusion of hardware across CPU/GPU in the future, but I have seen some from storage. It turns out that we use RAID to connect multiple storage disks to form a block, so that the database is directly deployed on a single machine after RAID is done. A category of unified hard storage across the network and across machines.

That is to say, distributed storage may be solved from the perspective of hardware, so you need to carefully consider whether your database needs to solve this problem, because the essence of the database is to ensure the storage and calculation of data, especially data storage. with the correctness of the calculation. It seems that Oracle, IBM, etc. mentioned above are lazy to implement distributed software at the software level, so there may be such a consideration.

If the CPU and memory can also be made into a unified CPU and memory based on intelligent networks, intelligent switches, and cross-hosts in the future, they will be made into a hardware-based distribution. If there are these resources, then as long as you focus on the storage computing problem itself, you don't necessarily need to consider distribution. Although it is still immature, I have seen some scientific research being done, and there may be such products in the future. So I think I will do it for distribution, but as I said, the demand for unequal and heterogeneous computing may be a new challenge for the database. This is what the Data Fabric we implemented is going to solve. It is a generalized distributed system that spans environmental dimensions and data dimensions.

This debate may continue for a while. Software and hardware are mutually iterative to promote development. When hardware distribution is done well enough, software distribution will be used to speak, but the scope is different. It is believed that in the near future, it will be a situation in which the generalized distribution will dominate the world.

Q: There are several forms of distributed databases (middleware, NewSQL, distributed architecture + enterprise-level kernel, computing + storage separation architecture)? Can you analyze the advantages and disadvantages of these forms?

This is a very basic topic. From my point of view, we can analyze the advantages and disadvantages from several distributed scenarios. I personally think that there are three levels of distribution, namely user interaction, data calculation, and data storage. We can look at three levels of distributed occurrence:

First, the routing layer. When SQL is written in, the database needs to be written. If we are doing middleware, it is something that the routing layer of the middleware has to solve. A long time ago, the sub-database sub-table, middleware, and even data middle-end we did may have Such a middleware component is used to solve the SQL distribution, allocation, data cache calculation and reorganization, which is distributed in the routing layer.

Second, the computing layer. If it is said that the routing layer is the earliest and lowest-end form, then the distribution in the computing layer really involves the core of the database, and the technical capabilities it needs and the time to mature the software will be longer. a system.

Third, the storage layer. The shared storage just mentioned, or the earliest cloud-native ideas, such as the theoretical system based on AWS, I think its difficulty is between the routing layer and the computing layer, but it also considers various security and performance. Balanced results. Let us analyze the characteristics of these three forms.

In the past, the routing layer, for example, Renren, Taobao, etc. more than ten years ago, used the routing layer as the main form. The earliest open source middleware was also open sourced by Alibaba, which is the earliest form. Up to now, there are still some commercial middleware in this form, but I think this form is a product of the early technology development process, to solve some problems at that time, but I can't find better resources The allocation method, and then made a back-end sub-database sub-table, with middleware to coordinate, and then solved the problem of resource allocation. Once our technology breaks through the technical barriers encountered at the time, this solution will soon be eliminated. Because middleware needs to solve the fusion calculation of various data, cache various data, be compatible with various syntaxes, and data consistency, these are its problems, and some problems cannot be solved at all. Therefore, we will gradually see that the distribution based on this architecture is becoming less and less, or it will gradually be covered by the big wave of technology. Middleware-based distribution is completely future-proof.

The other two mainstreams should be the main development directions in the distributed field - the distribution of the computing layer and the distribution of the storage layer. To perform distributed computing at the computing layer, the bottom layer may be sharded storage of data. This form can be said to be an ideal distributed form, which ideally solves the problem of various resource allocation. I think at this level, OceanBase is a form of this kind of distribution. This is its advantage, but it also has some disadvantages. Where are the disadvantages? In fact, we can examine the calculation of data by the database, because the core demands of the solution are the correctness of calculation, the security of data and the efficiency of calculation, and the second is how to consider resource allocation.

If I allocate resources better, in order to ensure the security and efficiency of data, then I will sacrifice something. If it is to do a thorough distributed transaction, it means that in some extreme performance, it should not be compared with a single machine, that is, it will not be higher than a single machine. It's a simple computing model, like MySQL will be faster than distributed, it's because it's simple, so it's efficient and can be done better. So from this point of view, based on the distribution of computer hardware and computing form, it can reach a relatively ideal state, but the complexity of its implementation will increase, which greatly increases the complexity of the database itself and the difficulty of database maturity. It may have an impact on the security and computing performance of some data.

Linking up with the previous topic, why doesn't Oracle develop such a product? Why IBM's DB2 is mainly based on a single database in this product, I think the real database people consider absolute data security and performance improvement, and choose the best between cost performance.

When it comes to the distributed form of the storage layer, this is an ideal now, although it is not the ultimate form, but it is a very smart approach. That is to say, if we want to share the database capacity, it must be divided into two directions: computing and storage, and then there is the concept of separation of computing and storage, because they can indeed be disassembled. And if you want to support real large-scale data, you must first realize the separation of computing and storage. Only after the separation of computing and storage is achieved, you can consider what to do at the storage layer and the computing layer respectively. Only after such conditions are met can we think that distributed storage in the storage layer is the same as the original single machine, and I don't need to consider the security and immediacy of data.

Of course, it does not rule out that we need to make more ideal and extreme products. However, technical research and development must be combined with actual needs. We can do it if the actual needs are needed. On the contrary, the product is very good and the actual demand is not large. It may not rule out that it will continue to exist as a scientific research product, waiting for future applications, but from commercial From a chemical perspective, it may not only be the pursuit of technological perfection, but also whether the technology can cover the demand and ensure high efficiency and safety. I think this may also be the way that many established database vendors design their products.

In addition, combined with the prediction of the future hardware development trend, if the problem of resource allocation is solved at the hardware level, then there is every reason not to ask for trouble, but to do the work of the database based on hardware implementation.

For several forms of distribution, I think many should not exceed these three levels. And if each level is distinguished in this way, the characteristics of each level should be relatively distinct, and the advantages and disadvantages are clear at a glance. Therefore, which product we choose to deploy and apply depends on the actual needs and the stage of social development.

Q: What is the value of the separation of storage and computation of distributed databases for business?

My view on this may be extreme. The so-called distributed database requires separation of storage and computing. The separation of storage and calculation is a means. Its goal is to solve this distributed problem. Only when storage and calculation are separated can we examine distributed expansion, capacity expansion, and some high-usage and replica applications from the perspective of “storage”. and other issues, as well as how to achieve high flexibility, improve resource utilization, take on demand, and parallel computing from the perspective of "computing".

I don't need to consider what content is stored, because after the separation of storage and computation, this part is transparent to the user. In addition, there is a transparent and highly available construction for the entire system. For a traditional database architecture, each node carries storage and computation together. It's hard for you to make a business switch. After the separation of storage and computing, a server-less architecture can be achieved, which makes switching logic very easy, and the business layer does not need to consider the switching situation of a node at all. This is very valuable for business development, operation and maintenance, and expansion, so my point of view is extreme - the computing layer and the storage layer are separated, and they are distributed in different ways, which is truly valuable to the business. distributed.

Q: What aspects should you pay attention to in the study and practice of distributed databases?

I think there are two points. First, if you know what it is, you also need to know why; the other point is that you will know the truth through practice.

If we know the truth, we must know the reason. The explanation is that if we are using a complex system, the more optimized and simpler the system is, the more things are hidden in it. Because all the personalization stuff is packaged, if you keep using it on the surface, you may not be able to fix it when it goes wrong.

For example, when I was on Qunar.com, I madly recommended a MySQL component called Galera. Because within a period of time, it solved the problem of multi-node synchronous writing in a small and beautiful way. I think it is very valuable, but for such a product, I have also seen many people complain that this architecture is not easy to use, this open source component does not Easy to use, what's the problem. But I got hundreds of sets online without any problem, and later I wrote it into a book and put it in "MySQL Operation and Maintenance Internal Reference".

Why is this so? In addition to the problem of the product capability itself, it also depends on the user's understanding of it and the ability to master it. If you're just doing something on the surface, it's hard to use the product well for any reason. Therefore, we still need to understand the essence in the extremely complex and encapsulated case of using a distributed database.

Second, practice produces true knowledge, emphasizing that you should give yourself the opportunity to make mistakes. That is to say, just talking on paper is useless. You have to actually do it for a year or two. If you have the opportunity, you can also do it in an online environment. However, this online environment is divided into several types, and you need to choose the one that will not affect your online business. Use it to understand the differences between database products. It is a dynamic system and a process of continuous operation. After operation, it will have various problems more or less. Only by climbing with problems can you understand It will let you read the source code, where should you read it, and let you have a deeper understanding of the product or the system itself.

Q&A

Q: Does the distributed database have a particularly good guarantee for the availability and integrity of the database, and can it be widely used in the financial and operator industries like Oracle?

Wang Nan

This problem is actually one of our concepts and positioning, and our goal is to solve this problem. To put it simply, OceanBase is to enter the core scenario of this enterprise, and then put the various demands brought by this application, including functions, performance security, data consistency, integrity, and correctness guarantees in the database. Build it layer by layer.

There are many solutions using middleware that have dependencies and constraints on applications, such as two-stage cross-node things. When a fault occurs, there is no complete guarantee for the consistency of things, including the residual state of rollback, which needs to be implemented in the application. Only by constraining or processing the layer can ensure the correctness of the data. OceanBase will never throw this problem to the user, but solve it at the database level. The key technology in the core scenarios of finance and operators is the demand for data reliability, security, and correctness. Compared with this kind of Internet and this kind of scenario, its tolerance and maturity are different. OceanBase database installation and deployment, development, operation and maintenance, monitoring, and fault handling.

Q: Does Arkcontrol plan to access OceanBase?

Zhou Yanwei

Arkcontrol is a component of our product system that exists in two directions.

One is to provide users with a management layer of our entire product system, who is responsible for the operation and maintenance, monitoring, data backup and other management of Yunzhou Data Jingwei Platform (DTArk).

The second is to meet the needs of users, build on the needs of users, and then solve the problems of integration management of various databases at the user level. My users may have MySQL, MongoDB, Oracle, and Redis at the database level in addition to their own products. I can provide them with a unified platform for unified management. It is a convenient tool, and its core significance is is management.

As for whether or not to accept OceanBase, I think it needs to return to the market, and then make a decision after seeing what products the user has deployed. Assuming that many users have deployed many OceanBase products in their IDCs, we also hope to support the convenience of users. Some systems of OceanBase. I have always been adhering to a point of view, that is to go deep inside to do things, rather than floating on the surface and the periphery. Of course, in the process of support, we hope that OceanBase can open up more interfaces for us, so that the management can be done well.

Q: The performance requirements of distributed databases are very high. Is there any plan to say that ordinary PCs or solutions with reduced performance can run?

Wang Nan

To be honest, we've been paying attention, and we've been working on it for a while. First of all, on the public cloud, there are a large number of such small and medium-sized customers who have such rigid needs. Especially for small customers, they do not really need servers with particularly strong performance to solve the current problems, or there are many such developers, personal Users may want to take a small scene, or even test or use it on their own basis.

We are working on this issue in several directions: the first is resource consumption, and then OceanBase will launch products with lower resource consumption. At present, it should be a memory requirement such as 8C 64G. This year, we will reduce the cost of resources to 4C, and will launch 2C or even 1C in the future. At the same time, in order for more customers to use and experience more conveniently, we are also doing some exploration for the specification scene of a single node.

In addition, on the public cloud, multi-tenancy can be considered to solve the cost problem of users, because if a large number of small users use this method exclusively, it will bring cost overhead. If it is for non-core business, customers can reduce some demands on resources to get some returns on economic investment. This kind of multi-rent is better. We now have the multi-tenancy kernel capability, and then we will do the resource isolation capability of storage computing and memory. In the future, we will also release the payment capability through this form of cloud, so that users can quickly experience it. We will summarize this issue. Now, for this kind of small specification demand, the demand for reducing the specification is already being done, and we will see great progress this year, and OceanBase will continue to do this in the future.

Q: How to understand the intermediate state of HTAP?

Zhou Yanwei

There may be many kinds of understanding. As a product designer, we should consider from a higher goal. What is a lofty goal? HTAP - the form of the fusion of TP and AP, which is also the combination of the two basic capabilities of the database. If you design a product like this, is your state the ultimate state? Obviously not, because data calculation includes edge calculation, retrieval calculation, graph calculation, etc. in addition to TP and AP. Due to the existence of such calculation requirements, we have various databases on the market. HTAP may be a 1+1 model that connects the original TP type with the later big data-based APs. Can three be combined into one and four into one? The design of the product determines the realization of the product. If you only consider the two-dimensional intermediate state of HTAP, it will be very difficult to expand the realized product in the future.

Therefore, in terms of product design ideas, if we can solve the product form of multi-dimensional computing, we may look at HTAP in turn. It is an intermediate state, and it is a very important step for us to take from one-dimensional to two-dimensional. But this step is not an end, it is just a starting point, we still have a lot of things to do in the future. From this perspective, if we design a product, it is best to design a framework that can support two-dimensional, three-dimensional and even multi-dimensional, so the technical person in turn becomes the product manager, and can analyze the needs of the product and the Simultaneous consideration from the perspective of product realization. If you only focus on this dimension, then when we look at a higher product requirement in the future, a lot of the code you write now may be overturned and redone. We are reluctant to do this, so we have to consider more comprehensively, build a larger framework, and be able to implement this two-dimensionally and at the same time be compatible with future multi-dimensional needs. This is the most ideal scenario.

Our DTArk is based on this kind of thinking. At present, the fusion calculation of TP/AP/FP has been realized. It is estimated that graph calculation will be supported soon. This multi-dimensional and scalable architecture is completely different from directly thinking about implementing an HTAP. In other words, if the current HTAP products are to achieve more dimensional integration, they may have to be overturned and rewritten.

Q: Is there a distributed database recommendation for the same server in the world? The kind that covers Asia, America, and Europe?

Wang Nan

I believe that the students who asked the question may not be saying whether there is a database that can adapt to these regions. The "global server" may be asking whether there is such a database on the public cloud that can provide consistent capabilities in different regions of the world. This kind of global application and data deployment demands are indeed quite challenging now.

There is also an implicit appeal here: Can the same cloud provide such capabilities in all regions of the world? There are still many customers who do not want to be bound by a certain cloud. Just to see if there are distributed databases or cloud services that can provide services globally, this is actually quite a lot, and there may be some differences in capabilities, including several large cloud vendors. There are products including RDS, shared storage service, and AP. If it is a cross-cloud demand, it will not be bound by one cloud, especially for large customers.

If you can only do application infrastructure on one cloud, there are actually many risks. Including technical, business security, and cost risks. At present, many people have put forward the demand for cross-cloud, but not everyone can solve it, because what cloud vendors can solve is how to provide global cross-regional services based on infrastructure. However, for cross-cloud, independent database products and manufacturers may still be needed to solve this problem. OceanBase is now considering and solving it, and you will soon be able to see that we have some products and services.

Zhou Yanwei

If you answer from a purely technical point of view, there must be, for example, relying on the distribution of the network, without considering the performance tolerance and time cost, there is no problem with distributed global rapid deployment. In other words, the bottleneck of global deployment, I think, lies in the network and permissions, as long as you can cross both points.

In actual operation, it is really necessary to consider not only cross-center, cross-cloud, cross-globalization, but also the combination of performance. Now it seems to be difficult, because the network must be time-sensitive, and the speed of light is difficult to break through. Therefore, the next best thing is to do only shared storage in distributed, or to do both distributed and shared storage. It mainly depends on the timing or performance tolerance of the business. When we implement Data Fabric products like DTArk, we also propose an innovative technology - data shuttle. It's very simple. You may not need the full amount of data for large-scale synchronization to do this. You may only need a certain batch of data, and then pass it around. You don't need to make consistent synchronization around the world, and you waste a lot of time waiting. network, but can do data cross-domain or cross-region of different dimensions in some data or business and computing scenarios. What needs to be considered at this moment is not instant synchronization, but eventual consistency; not instant calculation, but calculation of a time window configured according to time. I think the key point is the network problem, that is, the network delay is the essential problem that affects all requirements.


OceanBase技术站
22 声望122 粉丝

海量记录,笔笔算数