Editor's note: This article introduces in detail how Milvus2.0 manages the data of query nodes and how to provide query capabilities
Content outline:
- Quickly review the process and mechanism of Milvus for data insertion and persistent storage;
- How to load data into Query Node for query operation;
- Real-time query related operations and processes on Milvus.
A quick review of the processes and mechanisms related to data insertion and persistence in Milvus
A Quick Review of Milvus Architecture
As shown in the figure below, the overall architecture of the Milvus vector database can be divided into coordinator service, worker node, message storage and object storage.
The main job of coordinator services is to coordinate the work of each worker node, in which each module has a one-to-one correspondence with the work node, and coordinate and manage the work between each node. As shown in the architecture diagram, the query coordinator corresponds to and coordinates the query node, the data coordinator corresponds to and coordinates the data node, and the index coordinator corresponds to and coordinates the index node.
The data node is responsible for the persistent storage of data, which is basically an I/O-intensive workload, and is responsible for writing data from the log broker to the final object storage. The index node is responsible for the construction of the vector index, and finally the query node is responsible for the query work of the entire Milvus. These two types of nodes are data-intensive nodes.
In addition, there are two more important parts in the system architecture: message storage and object storage.
Message storage is equivalent to a WAL. When data is inserted into this place, the system will ensure that the data will not be lost. The log broker will store the data for 7 days by default. During this period, even if the following work node is partially down, the system can restore some data and status from the log broker. Object storage is responsible for implementing persistent data storage, and the data in the log broker will eventually be persisted to object storage for long-term data storage.
In general, this architecture is equivalent to a system that separates storage and computing. The data side is responsible for data storage, and the query side is responsible for query computing.
Data Insertion Process
Step 1: After the Insert Message is sent from the SDK to the proxy, the proxy inserts the insert message into the corresponding log broker. Each message inserted into the log broker has a unique primary key and a timestamp;
Step 2: After inserting into the log broker, the data will be consumed by the data node;
Step 3: The data node will write the data into the persistent storage. The final data is organized based on the granularity of the segment in the persistent storage. That is to say, in addition to the primary key and timestamp, the message will be additionally assigned A segment ID to identify which segment this data will eventually belong to. After receiving the information, Data note will write the corresponding information to the corresponding segment, and finally write it to the persistent storage.
Steps 4 and 5: After the data is persisted, if the query is made directly based on the data, the query speed will be relatively slow, so in general, some indexes will be considered to speed up the query speed. At this time, the index node will pull the information from the persistent storage and build the index, and the built index file will be written back to the persistent storage (S3 or Minio, etc.). Sometimes we need to build multiple indexes to select the one with the fastest query speed. This operation can also be implemented in the index node.
Log broker and object storage are also two important parts of Milvus architecture to ensure data reliability. In the system design, these two parts can also choose some third-party components to ensure reliability in different situations.
A common situation is that data is inserted at the same time as the query. At this time, part of the data is in the log broker, and part of the data is in the object storage. We define these two pieces of data separately, the data in the object storage is batch data, and the data in the log broker is stream data. Obviously, in the scenario of real-time query, if you want to traverse all the data that has been inserted, you must do the query in both stream data and batch data at the same time, in order to return the correct real-time query data.
data organization mechanism
Next, let's take a look at the relevant mechanism of data storage. The data is stored in two parts. Part is in object storage; part is in log broker.
First, let's take a look at how the data is organized in the log broker.
You can refer to the figure below. The data can be divided into these parts: unique collection ID, unique partition ID, and unique segment ID.
Each collection will be allocated a specified number of channels in the system, which can be understood as a concept similar to a topic in Kafka, or a concept similar to a shard in a traditional database.
In the figure, if we assign three channels to the collection, suppose we want to insert 100 pieces of data, then the 100 pieces of data will be divided into these three channels equally, and then in the three channels, the data is again in the form of segment is split for granularity. Currently, the capacity of each segment has an upper limit, and the system defaults to a maximum of 512M. In the process of continuous data insertion, it will preferentially continue to write to a segment, but if the capacity exceeds 512M, the system will assign a new segment ID to continue data insertion. So in real scenarios, each channel will contain many segments. To sum up, the data in the log broker can be divided into collection, partition and segment, and finally we store it in the system, which is actually many small segments.
Next, let's take a look at how data is organized in object storage.
Like the log broker, the data node is organized by segment after receiving the insert message. When a segment reaches the default upper limit of 512M, or the user directly forcibly stops inserting data into the segment, the segment will be persisted and stored in object storage. In persistent storage, the storage format in each segment is a smaller log snapshot, and it is divided into multiple columns. The specific number of columns is related to the schema of the collection to be inserted. If the schema of the collection has 4 columns, the data inserted into the segment will also have 4 columns. Therefore, in the end, in the object storage, the data storage is in the form of many log snapshots.
How to load data into query node query node
Detailed explanation of data loading process
After clarifying how the data is organized, let's take a look at the specific process of querying and loading the data.
In the query node, the streaming data in the log broker is called streaming, and the batch data in the object storage is called historical. The loading process of streaming data and batch data is as follows:
First, the query coord will ask the data coord. Because Data coord has been responsible for continuously inserting data, it can feed back two kinds of information to query coord: one is which segments have been persisted, and the other is the checkpoint information corresponding to these persistent segments. According to the checkpoint, you can Know the last position where these segments are consumed from the log broker.
Then, after receiving these two pieces of information, the query coord will output a certain allocation strategy. These strategies are also divided into two parts: allocation by segment (as shown in segment allocator), or allocation by channel (as shown in channel allocator).
Segment allocator will assign different segments in persistent storage - that is, batch data - to different query nodes for processing. As shown in the figure, S1 and S3 are assigned to query node 1, and S2 and S4 are assigned to query node 2. Channel allocator will assign different channels in the log broker to different query nodes for monitoring. As shown in the figure, query node 1 listens to Ch 1, and query node 2 listens to Ch 2.
After these allocation strategies are delivered to each query node, the query node will perform corresponding load and watch operations according to the strategy. As shown in query node 1, the historical (batch data) part will load the S1 and S3 data assigned to it from the persistent storage, and the streaming part will subscribe to Ch1 in the log broker and access this part of the streaming data .
Because Ch1 can continuously insert data (streaming data), and the data connected to the query node from this part is defined as growing segment, because it will continue to grow, it is incremental data, as shown in G5. Correspondingly, the segment in the historical defines the sealed segment, which is static stock data.
Data management and maintenance
For the management of the sealed segment, the system design mainly considers load balancing and downtime.
As shown in the figure, if there are many sealed segments on query node 4, but there are few other nodes, in this case, the query of query node 4 may be a bottleneck in the whole query. So at this time, the system should consider load balancing these sealed segments to other nodes.
In another case, if a node suddenly hangs up, the load on it can also be quickly migrated to other normal nodes at this time to ensure that the query results are correct.
For incremental data, as mentioned earlier, after the query node listens to the corresponding dmchannel, the incremental data will enter the query node. But how exactly did you get in? Here we use a flow graph model, a state-driven model. The entire flowGraph includes four parts: input node, filter node, insert node and service time. First, the input node is responsible for receiving the Insert message from the stream, and then the filter node filters the message. Why do you need to filter? Because the user may only need to load a certain partition data under the collection. After filtering, the insert node inserts the data into the underlying growing sagment. After this, the server time node is responsible for updating the service time of the query
At the beginning, when we reviewed the data insert process, we mentioned that a timestamp is assigned to each insert message.
You can refer to the example on the left side of the figure. If data is only inserted sequentially from left to right, the timestamp of the first message is 1, the timestamp of the second message is 2, and the operation time of the third message The stamp is 6. Why is the fourth item marked in red? This is a timetick message inserted by the system, which does not represent an insert message. Timticker indicates that the inserted data whose timestamp is less than this timetick are already in the log broker. In other words, the timestamps corresponding to the insert messages that appear after this timetick 5 will not be less than 5. You can see that the following timestamps are 7, 8, 9, and 10, and the timestamps are all greater than 5. , that is to say, the insert message message with timestamp less than 5 will definitely appear on the left. In other words, when the query node receives a message with timetick = 5, it can be determined that all messages with a timestamp before 5 have entered the query node, so as to confirm that the query is correct. Then the server time node here will update a tsafe after receiving the timetiker from the insert node, such as 5 or 9 shown in the figure, which is equivalent to a timestamp indicating safety. As long as tsafe reaches 5, the data before 5 will be can be checked.
With these foreshadowing, let's talk about how to actually do this part of the query.
Related operations and processes for real-time query on Milvus
First, let's talk about how the query request (query message) is defined.
The query message is also inserted into the log broker by the proxy, and then the query node will obtain the query message by listening to the query channel in the log broker.
What exactly does the Query message look like?
- Message ID, a globally assigned ID assigned to this query system;
- Collection ID: The collection ID corresponding to the query request. If the query is to be queried in the collection, then it should specify the corresponding collection ID. Of course, on the SDK side, in fact, this place specifies the collection name, and the name and ID are mapped one-to-one in the system.
- execPlan: The number of executions, corresponding to the operations on the SDK side, which is equivalent to specifying an expression when the SDK queries, that is, a PR. For vector query, it is mainly used for attribute filtering. If an attribute is greater than 10 or equal to 10, some use filtering is performed.
- Service timestamp: After the tsafe update mentioned above, the service timestamp will also be updated accordingly, which is used to indicate when the service time is now, and the data inserted before that can be queried.
- Travel timestamp: If you need to query data before a certain time period, you can use (services timestamp - travle timestamp) to demarcate the new timestamp and data range;
- Guarantee timestamp: If you need to query data after a certain time period, the query will start only when the conditions of services timestamp greater than or equal to guarantee timestamp are met.
Now look at the specific query operation process:
After receiving the query message, the system will first make a judgment. If the service time is greater than the guarantee timestamp in the query message, the query will be executed. The query is divided into two parallel parts, one is historical data in persistent storage, and the other is streaming data in log broker. Finally a local reduce will be done. I also mentioned before that there may be some duplication of data between historical and streaming due to various reasons, so in the end, a reduce needs to be done first.
The above is a relatively smooth process. And if the timestamp is judged in the first step, and the serviceable time has not been advanced to the guarantee timestamp, then the query will be put into the unsolved meessage, and it will wait until the conditions are met to query.
The final result will be pushed to the result channel and accepted by the proxy. Of course, the proxy will receive results from many query nodes, and it will also be doing a round of global reduce. At this point, the entire query process is completed.
But there is still a problem here, that is, how does the proxy determine that it has received all the query results before returning the final result to the SDK. To this end, we have made a strategy: in the returned result message, it will also record which sealed segments have been queried (searched sealed segments), which dmChannels have been queried (dmchannels searched), and which segments are on the querynode. (global sealed segments). If the union of searched sealed segments in the search results of all query nodes is greater than the global sealed segments, and the incremental data corresponding to all dmchannels of this collection have been queried, it is considered that all query results have been received, and the proxy can reduce operation, and finally return the result to the SDK.
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。