Editor's note: This article introduces the Milvus2.0 data insertion process and persistence scheme in detail
Introduction to the overall architecture of Milvus 2.0 Introduction to components related to data writing
- proxy
- Data coord
- Data node
- Root coord & Time tick
Data allocation
- data organization
File structure and data persistence
Introduction to the overall architecture of Milvus 2.0
The above picture is an overall architecture diagram of Milvus 2.0. From the SDK on the far left as the entrance, requests are sent to the Proxy layer through Load Balancer. Then Proxy will interact with the top Coordinator Service (including Root Coord, Root Query, Data and Index), and then write DDL and DML to our Message Storage.
The Worker Node below: including Query Node, Data Node and Index Node, will consume these request data from Message Storage. The query node is responsible for querying, the data node is responsible for writing and persistence, and the index node is responsible for building indexes and accelerating queries.
The bottom layer is the data storage layer (Object Storage), which mainly uses MinIO, S3 and AzureBlob to store Log, Delta and Index files.
Introduction to components related to data writing
proxy
Proxy acts as an entry point for data requests. Its function is to accept the insertion requests of the SDK from the beginning, then hash the data received by these requests into multiple buckets, and then request the DataCoord (data coordinator) to allocate space for segments. (Segment is the smallest unit of Milvus data storage, which will be described in detail later.) The next step is to insert this part of the requested space data into the message storage. Once inserted into message storage, the data is never lost.
Next we look at some details of the data flow:
- Proxy can have multiple
- There are VChannels of V1, V2, V3, V4 under Collection
- C1, C2, C3, C4 are some PChannels, we call them physical channels
- Multiple V channels can correspond to the same PChannel
- Each proxy corresponds to all VChannels: different proxies for a collection also need to be responsible for all channels in the collection.
- In order to avoid too much resource consumption due to too many VChannels, multiple VChannels can correspond to one PChannel
DataCoord
DataCoord has several functions:
- Allocating segment data After the segment space is allocated to the proxy, the proxy can use this part of the space to insert data.
- Record allocation space and its expiration time Each allocation is not permanent and will have an expiration time.
- Segment flush logic If the segment is full, it will be dropped.
- Channel allocation and subscription A collection can have many channels. Which channels are consumed by which Data Nodes require DataCoord to do an overall allocation.
Data Node
Data Node has several functions:
- Consume data from this data stream, and serialize this data.
- Cache the written data in memory, and then automatically flush it to the disk after reaching a certain amount.
Summarize:
DataCoord manages the allocation of channels and segments; Data Node is mainly responsible for consumption and persistence.
The relationship between DataNode and Channel
If a collection has four channels, the possible distribution relationship is that each of the two Data Nodes consumes two VChannels. This is allocated by DataCoord. Why can't a VChannel be assigned to multiple Data Nodes? Because of this, the data will be consumed multiple times, which will lead to the duplication of segment data.
RootCoord & Time Tick
Time Tick (time stamp) is a very important concept in Milvus 2.0, it is a key concept for the promotion of the entire system; RootCoord is the role of a TSO service, which is responsible for the distribution of the global clock, and each request will correspond to a timestamp. Time Tick is incremented, which indicates which time point the system advances to, which has a lot to do with writing and querying; RootCoord is responsible for assigning timestamps, which are 0.2 seconds by default.
When Proxy writes data, each request will have a timestamp. The Data Node consumes every time the timestamp is the interval. Take the above figure as an example, the direction of the arrow is a process of writing this data, and the numbers 126578 are the size of the timestamp. The line Written by represents proxy writing, and P1 is proxy 1. If we use Time Tick as the interval for consumption, before the interval of 5, we will only read the two messages 1 and 2 the first time we read. Since 6 is greater than 5, they are consumed the next time in the interval 5 to 9.
Data Allocation
data organization
The relationship between Collection, Partition, Channel and Segment:
- Collection: The outermost layer is a collection (equivalent to the concept of a table), and there are multiple partitions in the collection.
- Partition: Each partition is divided by time; partition and channel are an orthogonal relationship, that is, each partition and each channel will define the location of a segment.
(Note: Channel and shard are the same concept: shard may be written in some places in our document, and the concept of shard is equivalent to channel. For the sake of consistency, we collectively refer to it as channel here.) - Segment:
Segment is defined by collection+partition+channel together. Segment is the smallest unit of data allocation. Indexes are created in units of segments, and queries are also load balanced on different QueryNodes in units of segments. There will be some Binlogs inside the Segment, that is, when we consume data, a Binlog file will be formed.
There are three states of Segment in memory, namely Growing, Sealed and Flushed.
Growing: When a new segment is created, it is in the growing state, and it is in an assignable state.
Sealed: The segment has been closed, and its space can no longer be allocated.
Flushed: Segment has been written to disk
The space inside the Growing segment can be divided into three parts:
- Used (used space): has been consumed by the Data Node.
- Allocated: The Proxy requests the space allocated by the segment to the DataCoord deletetor.
- Free: Unused space.
Channel:
What is the allocation logic of Channel?
Each collection will be divided into multiple channels, and then each channel will be given to a Data Node to consume the data in it, and then we will have more strategies to do this allocation. Milvus currently implements two allocation strategies:- Consistent Hash A default strategy within the system now is to distribute through consistent hashing. That is, each channel first makes a hash, then finds a position on the ring, and then finds the closest node to it by clockwise, and assigns the channel to this DataNode, for example, Channel 1 is assigned to Data Node 2, Channel 2 for Data Node 3.
- Try to distribute the channels of the same collection to different DataNodes, and try to equalize the number of channels on different DataNodes to achieve load balancing.
If DataCoord does it through the scheme of consistency, the increase or decrease of DataNode, that is, it goes online or offline will lead to the reallocation of a channel. Then how do we do it? DataCoord uses etcd to watch the status of DataNode. If DataNode goes online or offline, it will notify DataCoord, and then DataCoord will decide where this channel will be allocated later.
When will the Channel be allocated?
- DataNode startup/downline
- When Proxy requests to allocate segment space
When does data distribution take place?
This process first starts with the client (as shown in the figure above)
- Insert request, then produce a timestamp - t1.
- Proxy sends DataCoord a request to allocate a segment.
- DataCoord allocates and stores the allocated space in the meta server for persistence.
- DataCoord then returns the allocated space to the proxy, and the proxy can use this part of the space to store data. From the figure, we can see that there is an insert request of t1, and the segment we returned has an expiration time of t2. It can be seen from here that our t1 must be less than t2. This will be explained in detail in a later article.
How to assign segments?
When our DataCoord receives the allocation request, how does it allocate?
First let's understand what InsertRequest contains? It contains CollectionID, PartitionID, Channel and NumOfRows.
Milvus currently has multiple strategies:
Default strategy: If there is enough space to store these rows, the created segment space will be used first; if not, a new segment will be created. How can you tell if there is enough space? We mentioned earlier that a segment has three parts, one is the used part, the other is the allocated part, and there is a spare part. Therefore, space = total size - already used - allocated, the result may be relatively small, and the allocated space varies with As the time expires, the Free section becomes larger.
A request can return 1 or more segment spaces. The maximum size of our segment is clearly defined in the file data_coord.yaml.
Data expiration logic
- Each allocated space will have an expiration time (Time Tick comparable)
- A time tick is allocated when data is inserted, and then DataCoord is requested to allocate a segment, so this time tick must be less than T.
- The default expiration time is 2000 milliseconds, which is defined by the segment.assignmentExpiration parameter in data_coord.yaml.
When to seal segment?
The allocation mentioned above must be for the segment of the growing state. When will the state become sealed?
Sealed segment indicates that the space of this segment can no longer be allocated. There are several conditions to seal a segment:
- The space usage has reached the upper limit (75%).
- After receiving the Flush collection, all the data in the collection must be persisted, and the segment can no longer allocate space.
- Segment survived too long.
- Too many Growing segments will lead to more memory usage of the DataNode, thus forcing the closure of the segment with the longest surviving time.
When will it be placed?
Flush is to persist segment data to object storage.
We need to wait for the allocated space to expire before we can perform the flush operation. After Flush is finished, this segment is a flushed segment.
What is the specific operation of this waiting?
The DataNode reports the consumed time tick, and then compares it with the time tick allocated to the space. If the time tick is larger, it means that this part of the space can be released. If it is larger than the timestamp of the last allocation, it means that the allocated space has been released, and no new data will be written to this segment, so you can Flush.
Frequently Asked Questions and Details
- How do we ensure that the segment is flushed after all data has been consumed?
The Data Node will tell DataCoord that the current channel consumes the timestamp. The time tick indicates that the previous data has been consumed, and it is safe to close at this time. - After a segment flush, how to ensure that no data is written again?
Because the segment in this state of flush and sealed will no longer allocate space, it will no longer have data written. - Is the segment size strictly limited to the max size space?
There is no strict limit, because how many pieces of data a segment can hold is estimated. - How to estimate it?
Estimated by schema. - What happens if the user calls Flush frequently?
Many small segments will be generated, which will affect the query efficiency. - How to prevent data from being consumed multiple times after DataNode restarts?
DataCoord will record the position of the latest segment data in the message channel. When the channel is allocated next time, it will tell the Data Node where the segment has been consumed, and the Data Node will filter it. (not full filter) - When to create an index?
- The user manually invokes the SDK request
- It will be triggered automatically after the segment flush is completed
File structure and data persistence
DataNode Flush
The DataNode will subscribe to the message store, because our insert requests are all in the message store. By subscribing to it, we can continuously consume the insert message, and then we will put the insert request into a memory buffer. After accumulating to a certain size, it will be flushed to an object store. (Binlog is stored in the object storage.)
Because the size of this buffer is limited, it will not wait until all segments are consumed before writing down, which will easily cause memory shortage.
file structure
The structure of the Binlog file is similar to that of MySQL.
Binlog has two main functions, the first is to restore data through Binlog, and the second is to create indexes.
Binlog is divided into many events, each event will have two parts, one is event header and event data. The Event header stores some meta information, such as creation time, write node ID, event length and NextPosition (the offset of the next event)
Event data is divided into two parts, one is the fixed part (the size of the fixed length part); the other is the variable part (the size of the variable part), which is a part reserved for our future expansion.
There are three main fixed parts of the event data of INSERT_EVENT, StartTimestamp, EndTimestamp and reserved. Reserved is to reserve a part of the space to expand the fixed part.
Variable part stores the actual inserted data. We serialize this data into a Parquet form and store it in this file.
Binlog persistence
If there are more than 12345 columns in the schema, Milvus will store Binlog in the form of column storage.
From the above picture, the first one is the Binlog of the primary key, then the Time Stamp column, and then the 12345 columns defined in the schema. The paths in MinIO are defined as follows: First, a Tenant ID, followed by an insert log, then collection, partition, segment ID, field ID, and log index. log index is a unique ID. Merge multiple Binlogs when deserializing.
In the recently released version, some users reported that they need to specify an ID to delete, so we implemented the function of fine-grained deletion (delete by ID). Since then, we can efficiently delete the specified content without waiting. ; At the same time, we have added the compaction function, which can release part of the space that has been released by delete, and at the same time combine small segments to improve query efficiency.
At present, in order to solve the inefficiency problem of users when the amount of data is large and the data is inserted one by one, we are working on a Bulk load function, which allows users to organize the data into a certain form, and then load it into our inside the system.
If you have any improvements or suggestions for milvus during use, please keep in touch with us on GitHub or various official channels~
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。