How to design a cloud-native database for the future?

We are honored that our latest paper "Manu: A Cloud Native Vector Database Management System" was accepted by VLDB'22, the top international conference in the database field. These two days just happened to share the content of the paper at the conference. Just strike while the iron is hot to write an article, share the content of the paper after sorting out, and talk about the design and thinking behind it.

This article will try to explain the key design concepts and principles in Manu (Milvus 2.0 version code name), a " cloud-native database system designed for vector data management ". If you want to know more details, you can refer to our published papers and the GitHub open source repository of the Milvus project [[GitHub repository]]( https://github.com/milvus-io/milvus ).

Background of the project

When initially designing and implementing version 1.0 of Milvus, our main goal was to well support vector management related features and optimize vector retrieval performance. With the continuous deepening of communication with users, we found some common requirements for vector databases in practical business, and these requirements were difficult to solve well under the framework of the previous version of Milvus .

We can group these requirements into the following categories: requirements for continuous change, the need for a more flexible consistency strategy, the need for component-level elastic scalability, and the need for a simpler and more efficient transaction processing model.

Demand continues to change

For processing vector data, the requirements in the business are still not fully finalized.

Business requirements have evolved from K-nearest neighbor search on vector data in the earliest days to range search, support for various custom distance indicators, joint query of vector and scalar data, and multi-modal query. More and more diverse query semantics, etc.

This requires the architecture of the vector database to be flexible enough to support new requirements quickly and agilely.

Need for a more flexible consistency strategy

Taking the content recommendation scenario as an example, the business has high requirements for timeliness. When new content is generated, it is usually acceptable for us to recommend content to users within minutes or even seconds. However, if the content is recommended every other day or longer, it will have a greater impact on the recommendation effect. In such scenarios, if "deadly" strong consistency will bring about a large system overhead, but if only eventual consistency is provided, the business effect will be difficult to guarantee .

In response to this problem, we propose a solution: users can specify the maximum delay that the inserted data can tolerate before it can be queried according to the needs of the business. Correspondingly adjust some mechanisms of the system during data processing to provide guarantee for the final effect of the business.

Need to support component-level elastic expansion

The resource requirements of each component of the vector database are very different, and in different application scenarios, the load intensity of each component is also different. For example, the components of vector retrieval and query require a lot of computing and memory resources to ensure performance, while other components responsible for data archiving and metadata management require only a small amount of resources to function properly.

From the perspective of application types, the most critical requirement for recommendation applications is the ability to perform large-scale concurrent queries, so usually only the query component will bear a high load; analysis applications often need to import a large amount of offline data, and the load pressure is at this time. It falls on the two related components of data insertion and index building.

In order to improve the utilization of resources, it is necessary to make each functional module have independent elastic expansion capability, so that the resource usage of the system is more suitable for the actual needs of the application .

A simpler and more efficient transaction processing model is needed

Strictly speaking, this is not a requirement feature, but an optimization space that can be used in system design.

With the increasing description ability of machine learning models, businesses usually tend to fuse data of multiple dimensions of a single entity to represent it as a unified vector data. For example, when making user portraits, information such as personal data, preference characteristics, and social relationships are integrated. Therefore, the vector database can only use a single table for data maintenance, and does not need to implement a "JOIN operation" similar to traditional databases. In this way, the system only needs to support the ACID capability at the upstream level of a single table, and does not need to support complex transactions involving multiple tables, leaving a large design space for component decoupling and performance optimization in the system.

Design goals

As the second major version of Milvus, Manu is positioned as a distributed vector database system for cloud-native design.

When designing Manu, we comprehensively considered the various requirements mentioned above, combined with the common requirements required for distributed system design, and proposed five major goals: continuous evolution capability, adjustable consistency, good elasticity, High availability and high performance.

Continuous evolution capability

In order to limit the complexity of the system to a controllable range while the function evolves, we need to decouple the system well to ensure that the functional components in the system can be increased, decreased and modified independently.

Adjustable consistency

In order to allow users to specify query latency for newly inserted data, the system needs to support delta consistency. Delta consistency requires that all queries can at least query all relevant data before the delta time unit, and the delta can be specified by the user application according to business requirements.

good elasticity

In order to improve the efficiency of resource use, it is necessary to achieve fine-grained elasticity at the component level, and at the same time, it is also required that the resource allocation strategy can take into account the differences in hardware resource requirements of components.

High availability and high performance

High availability is the basic requirement of all cloud databases. It is necessary to ensure that the system can perform effective fault recovery without affecting the normal operation of other services in the system when a few service nodes or components fail.

High performance is an "old-fashioned" problem in the vector database scenario. In the design process, it is necessary to strictly control the overhead generated by the system framework level to ensure good performance.

system structure

Manu uses a four-layer architecture to achieve separation of read/write, compute/storage, and stateful/stateless components .

As shown in the figure below, the system architecture can be divided into four layers: access layer, coordinator layer, worker layer and storage layer . In addition, Manu uses the log as the backbone of the system for collaboration and communication between components.

Access layer

Access layer The access layer consists of several stateless proxies .

These proxies are responsible for receiving user requests, forwarding the processed requests to relevant components in the system for processing, and collecting and sorting out the processing results and returning them to the user. Proxy caches some system metadata and checks the validity of user requests based on these data. For example, whether the dataset being queried actually exists.

Coordinator layer

The Coordinator layer is responsible for managing system status, maintaining system metadata, and coordinating various functional components to complete various tasks in the system.

Manu has a total of four types of coordinators. We separate and design the coordinators of different functions, so that on the one hand, faults can be isolated, and on the other hand, it is convenient for each functional component to evolve independently.

Also, for reliability, there are usually multiple instances of each type of coordinator:

The Root coordinator is responsible for processing data management requests such as dataset creation/deletion, and managing the metadata of each dataset. For example, the attribute information contained, the data type of each attribute.
The data coordinator is responsible for managing the persistence of data in the system. On the one hand, it coordinates each data node to process data update requests, and on the other hand, it maintains the metadata of data storage in each data set. For example, the list of data shards for each dataset and the storage path for each shard.
The index coordinator is responsible for managing the related work of data indexing in the system. On the one hand, it coordinates each index node to complete the indexing task, and on the other hand, it records the index information of each data set, including: index type, related parameters, storage path, etc.
The Query coordinator is responsible for managing the status of each query node, and adjusting the distribution of data and indexes on each query node according to changes in load.

Worker layer

The Worker layer is responsible for the actual execution of various tasks in the system.

All worker nodes are stateless, they only read a read-only copy of the data and perform related operations, and they do not need to communicate and cooperate with each other. Therefore, the number of worker nodes can be easily adjusted according to changes in load. Manu uses different types of work nodes to complete different data processing tasks, so that each functional component can be independently elastically scaled according to the difference in load and QoS requirements.

Storage layer

Storage layer persistent storage system state data, metadata, user data and related index data.

Manu uses a highly available distributed KV system such as etcd to record system state and metadata. When updating related data, it is necessary to first write the data into the KV system, and then synchronize it to the cache of each related coordinator. For large-scale data such as user data and index data, Manu uses object storage services (such as S3) for management. The high latency of the object storage service will not cause Manu's data processing performance problems, because the worker node will obtain a read-only copy of the corresponding data from the object storage and cache it locally before processing the data, so most of the data processing is done in done locally.

Log as system backbone

In order to better decouple each functional component so that it can independently adjust resource elasticity and function evolution, Manu adopts the design concept of "log as data", and uses the log as the backbone of the system to connect various components. In Manu, logs are organized as persistent subscribed information, and the various components in the system are subscribers to log data .

The log content in Manu mainly includes two types: WAL (write ahead log) and binlog. The stock part of the data in the system is organized in Binlog, while the incremental part is in WAL, both of which play complementary roles in terms of latency, capacity, and cost.

As shown in the figure above, the logger is responsible for writing data to the WAL and is the data entry for the entire logging system. Data node subscribes to the content of WAL and is responsible for processing the data and storing it in Binlog. However, other components such as query node and index node maintain an independent relationship with each other, and only rely on subscription logs to synchronize data content.

In addition, the log system is also responsible for some communication functions between the internal components of the system, and each component can use the log to broadcast the internal events of the system . For example, the data node can inform other components which new data shards have been written to the object storage system, and the index node can inform all query coordinators that new index construction has been completed, etc. Various types of information can be organized into different channels according to their functional categories. Each component only needs to subscribe to the channel related to its own function, instead of listening to all broadcast log content.

work process

Next, we introduce the workflow of various tasks in the Manu system from three aspects: data insertion, index construction and vector query.

data insertion

The above diagram shows the workflow of data insertion related components.

After the data insertion request is processed by the proxy, it will be hashed into different buckets. There are usually multiple loggers in the system to process the data in each hash bucket according to the consistent hash method. The data in each hash bucket will be written to a unique corresponding WAL channel. When the logger receives a request, it will assign a globally unique logical sequence number (LSN) to the request, and write the request to the corresponding WAL channel. The LSN is generated by the TSO (time service oracle), and each logger needs to obtain the LSN from the TSO periodically and save it locally.

In order to ensure low-latency, fine-grained data subscription, Manu uses row-based storage for data in the WAL, which is streamed by each subscription component . Usually WAL can be implemented with a message queue like Kafka or Pulsar. Data node is one of the components subscribed to WAL. After getting the updated data in WAL, it will convert it from row-based storage to column-based storage and store it in binlog . Columnar storage stores the data in the same column consecutively, which is more friendly to data compression and access. For example, when the index node needs to build an index on a column of vector data, it only needs to read the column vector from binlog without accessing data in other columns.

index build

Manu supports both batch and streaming index building methods . When a user builds an index on a dataset that already has data, a batch index build is triggered. In this case, the index coordinator will obtain the storage paths of all data shards in the dataset from the data coordinator, and schedule each index node to build indexes for these data shards. If the user continues to insert new data into the dataset after the index is built for the dataset, a streaming index build will be triggered.

After the data node writes a new data shard to the binlog, the data coordinator will notify the index coordinator to create a task to build an index for the new data shard. No matter which of the above methods is used, after the index node builds the index, it will save it to the object storage service, and send the corresponding storage path to the index coordinator. The index coordinator will further notify the query coordinator to arrange the corresponding query node to read the copy of the index to its local.

vector query

In order to process data query requests in parallel, Manu divides the data in the dataset into fixed-size data shards and places these data shards on different query nodes . The Proxy can query the query coordinator for the distribution information of data shards on the query node, and cache it locally. When a query request is received, the proxy will distribute the request to each query node that stores the relevant data shards. These query nodes will query all relevant local shards in turn, and combine the results and return them to the proxy. After the Proxy receives the results of all relevant query nodes, it will further integrate the results and return them to the client.

The source of data in Query node mainly consists of three aspects: binlog, index file and WAL . For existing data, query node will read the corresponding binlog or index file from the object storage service. For incremental data, the query node streams directly from the WAL. If incremental data is obtained from binlog, it will lead to a large visible delay of the query, that is, the time interval from when data is inserted to when it can be queried will be relatively large, which is difficult to meet the needs of applications with high consistency requirements.

As mentioned above, Manu implements delta consistency in order to allow users to have more flexible consistency choices . Delta consistency requires the system to ensure that after receiving a data update (including insertion and deletion) request, the content updated by at most delta time units can be queried.

In order to achieve delta consistency, Manu adds LSN to all insert and query requests, and carries timestamp information in LSN. When executing a query request, the query node will check the timestamp Lr of the query request and the timestamp Ls of the latest update request processed by the query node. The query task can be executed only when the interval between the two times is less than delta. Otherwise, the WAL needs to be processed first. Data update information recorded in . In order to prevent the user from not updating data for a long time, Ls is too small relative to the current system time and thus blocking the execution of the query, Manu will periodically insert specific control information into the WAL to force the query node to update the timestamp.

performance evaluation

We conduct a comprehensive performance evaluation of the system in combination with actual usage scenarios in the paper, and only part of the results are listed here.

In this figure, we compare the query performance of Manu and four other open source vector retrieval systems (anonymized). It can be seen that Manu has obvious advantages over other systems in vector retrieval performance on both SIFT and DEEP datasets .

In this figure, we show the query performance of Manu with different number of query nodes. It can be seen that under different datasets and different similarity indicators, the query performance of Manu and the number of query nodes show an approximate linear relationship .

In the last figure, we show the query performance of Manu when the user chooses different consistency requirements. The abscissa in the figure is the value of delta in delta consistency, and different legends represent the frequency of sending control information to the WAL to force the query node to synchronize time. It can be observed from the figure that as the delta increases, the query latency of Manu decreases rapidly. Therefore, users need to select the appropriate delta value according to their own application requirements for performance and consistency .

Summarize

Finally, let's summarize.

This paper mainly introduces the special requirements of the application to the vector database, and expounds the design concept of the Manu system and the operation process of the key functions. There are two main design concepts:

Using the log as the backbone of the system to connect various components in the system provides greater convenience for the independent elasticity of each component, functional evolution, and isolation of resources and faults;
Based on logs and LSN, delta consistency is implemented, allowing users to choose more freely between consistency, cost and performance.

The main contribution of our VLDB paper this time is to introduce the actual needs of users for vector databases, and accordingly design a basic architecture of a cloud-native vector database . Of course, there are still many issues worth exploring under this framework, such as:

How to jointly retrieve the vector data of multiple modalities;
How to make better use of cloud storage services including local disks, cloud disks and other storage services to design efficient data retrieval solutions;
How to use FPGA, GPU, RDMA, NVM, RDMA and other new computing, storage and communication hardware to design indexing and retrieval solutions with extreme performance, etc.

write at the end

The first idea to write this article came about a year ago, when the company CEO Xingjue and I had just finished attending SIGMOD in Xi'an and were going back to Shanghai to attend the Milvus 2.0 GA release conference. Talking about the impressions of this conference, Xingjue and I both felt that cloud-native databases are gradually becoming a new research hotspot in academia. It just so happens that Milvus 2.0 is our cloud-native database system designed for vector data management, so we naturally came up with the idea of writing this article.

We hope that our work can play a role in attracting more scholars and friends from the industry to explore and research related topics with us.

In addition, this paper is jointly completed by the Zilliz team and the database team of Southern University of Science and Technology. I would like to thank Mr. Tang Bo, Mr. Yan Xiao and Mr. Xiang Long for their contributions in the work.

If you think the content we share is not bad, please don't hesitate to give us some encouragement: like, like or share with your friends!

For event information, technology sharing and recruitment express, please follow: https://zilliz.gitee.io/welcome/

If you are interested in our projects please follow:

Milvus database for storing vectors and creating indexes

Towhee, a framework for building model inference pipelines