Editor's note: This article introduces Knowwhere, the core computing engine of the Milvus 2.0 system, in detail, including code overview, how to add indexes, and optimizations for Faiss.
Knowhere overview
If Milvus is likened to a sports car, Knowwhere is the engine of the sports car. The definition of Knowhere is divided into two categories: narrow and broad. Known in the narrow sense is the operation interface between the lower-level vector query library (such as Faiss, HNSW, Annoy) and the upper-level service scheduling. At the same time, heterogeneous computing is also controlled by the Knowwhere layer, which is used to manage the hardware on which the index construction and query operations are performed, such as CPU or GPU, and can also support DPU/TPU/... The origin of naming - know where. Knowwhere in a broad sense also includes Faiss and all other third-party indexing libraries. Therefore, Knowwhere can be understood as the core computing engine of Milvus.
It can be seen from the above definition that Knowwhere is only responsible for processing tasks related to data operations, and other system-level tasks such as data sharding, load balancing, disaster recovery, etc., are not within its functional scope. Also, starting with Milvus 2.0.1, Knowwhere in general has been spun off from the Milvus project as a separate project.
As an AI database, Milvus' operations can be divided into scalar operations and vector operations. Knowhere only handles vector operations.
The figure above is the architecture diagram of the Knowwhere module in the Milvus project. From bottom to top are system hardware, third-party vector query library, Knowwhere, and then interact with index_node and query_node through CGO. The part shown in orange is the Knowwhere in the narrow sense, and the part shown in the blue box is the Knowwhere in the broad sense. This article is about Know in a broad sense.
Knowhere code overview
Having a general understanding of Knowwhere's code structure will make it more convenient for users to read or contribute to the code later.
Milvus data model
First, the data model of Milvus is introduced.
- Database, now Milvus does not support multi-tenancy, so there is only one database
- Collection, because it is a distributed system, a collection can be loaded on multiple nodes, each node loads a shard of the collection, and each shard is called a shard
- Partition, logical sharding of data to speed up queries
- Segment, is the data block in Partition
The smallest unit of query in Milvus is segment, and the query operation on a collection will eventually be decomposed into a query on all segments in the collection or several partitions. Finally, the query results of all segments are merged and the final result is obtained.
As shown in the figure above, in order to support streaming data insertion, segments are divided into growing segment and sealed segment. The growing segment is a dynamic segment that can continue to add data, but it has no index and can only be queried by "violent search"; after reaching the size or time threshold, the growing segment will become a sealed segment. Each segment contains multiple fields, among which, Primary Key and Timestamp are the default fields of the system, and the other fields are specified by the user when creating the table.
Now a collection only supports one column of vector fields; Know only deals with vector fields; indexing and querying are only for vector fields in segments.
Index indexing
An index is a data structure independent of the original vector data. Most of the index construction needs to go through 4 steps, create (create) - insert data (insert) - train (train) - build (build).
For some AI applications, the training data set and the query data set are separate, first use the training data set for training, and then insert the query data based on the training results. For example, the public dataset sift1M / sift1B is divided into specialized training data and test data. But for Knowwhere, there is no distinction between training data and query data. For each segment, Knowwhere uses the full data of the segment for training, and then inserts the full data to build an index based on the training result.
Knowhere Code Architecture
All operations in Knowhere are for index operations.
The DataObj on the far left of the figure below is the base class of all data structures in Knowwhere, with only one virtual method Size(); the Index class inherits DataObj, and has a field named size_, which includes Serialize() and Load() virtual method. VecIndex derived from Index is a pure virtual base class for all vector indices. As can be seen from the figure, it provides methods such as training (Train), query (Query), and statistical information (GetStatistics, ClearStatistics).
Several other index types are shown on the right side of the image above.
- Faiss native indexes have two base classes: FaissBaseIndex is the base class of all Faiss native FLOAT indexes; FaissBaseBinaryIndex is the base class of all Faiss native BINARY indexes.
- GPUIndex is the base class for all Faiss native GPU indexes.
- OffsetBaseIndex is a self-developed index base class. Only the vector ID is stored in the index. For 128 latitude vectors, the index file can be reduced by 2 orders of magnitude. Therefore, the index needs to be used together with the original vector when querying.
IDMAP is an index that is not an index, commonly known as "violent search". After the original vector is inserted, no training and construction are required, and the original vector data can be directly queried by violent search. But in order to be consistent with other indexes, IDMAP also inherits from VecIndex and implements all its virtual interfaces, so it is used in the same way as other indexes.
The above picture is the IVF series, which is also the most used index interface type. IVF is derived from VecIndex and FaissBaseIndex, and IVFSQ and IVFPQ are derived from IVF; GPUIVF is derived from GPUIndex and IVF, and GPUIVFSQ and GPUIVFPQ are derived from GPUIVF.
IVFSQHybrid is a self-developed hybrid index. The coarse quantizer is executed on the GPU, and the in-bucket query is executed on the CPU. It takes advantage of the strong computing power of the GPU and reduces the memory copy between the CPU and the GPU, so the query recall rate and The same as GPUIVFSQ, but with higher query performance.
In addition, the basic class structure of Binary type index is relatively simple. BinaryIDMAP and BinaryIVF are derived from FaissBaseBinaryIndex and VecIndex, and will not be introduced further.
At present, in addition to the Faiss series of indexes, other third-party indexes only support two types, one is the tree-based index Annoy, and the other is the graph-based index HNSW. These two indexes are the most used now, and are derived directly from VecIndex.
Knowhere how to add index
To add a new index to Knowwhere, it is recommended to refer to the existing index. If adding a vector quantization-based index, it is recommended to refer to IVF_FLAT; if adding a graph-based index, it is recommended to refer to HNSW; if adding a tree-based index, it is recommended to refer to Annoy.
Specific steps are as follows:
- add new index name string in
IndexEnum
; - Add the validity check of new index parameters in [ConfAdapter.cpp] (mainly parameter check in train and query);
- Create a separate file for the new index, the base class of the new index should contain at least VecIndex, and implement the virtual interface required by VecIndex;
- Add new index creation logic in
VecIndexFactory::CreateVecIndex()
; - Finally, add unit tests in the
unittest
directory.
Knowhere Optimizations for Faiss
Knowhere has done a lot of functional extensions and performance optimizations to Faiss.
1. Support BitsetView
Bitset was initially introduced to support "soft delete". Each bit in Bitset corresponds to a row vector in index. If the bit bit is 1, it indicates that the row vector has been deleted, and the row vector does not participate in the operation during query.
Later, the application of Bitset has been expanded, and it is no longer limited to support delete, but the basic semantics of Bitset remain unchanged. As long as the bit is 1, it means that its corresponding vector does not participate in the query.
All exposed Faiss index query interfaces in Knowhere have added Bitset parameters, including CPU index and GPU index.
For a detailed description of Bitset, please refer to Dry Goods Sharing | Bitset Application Details
2. Support more Binary Index distance calculation methods: Jaccard, Tanimoto, Superstructure, Substructure
Jaccard distance and Tanimoto distance can be used to calculate the similarity between samples; SuperStructure and Substructure can be used to calculate the similarity between chemical formulas.
3. Support AVX512 instruction set
The instruction sets natively supported by FAISS include AARCH64 / SSE42 / AVX2, and we have added support for the instruction set AVX512 based on AVX2. Compared to AVX2, AVX512 can improve performance by 20% - 30% when building indexes and queries.
Please refer to the article Milvus performance comparison between AVX-512 and AVX2
4. Support dynamic loading of instruction set
Which instruction set supported by native Faiss needs to be specified through parameter macros at compile time. If this method is adopted, Milvus needs to compile a specific Milvus image for each instruction set when it is released, and users must also select a specific Milvus image according to the hardware environment when using it. Milvus mirror. This brings inconvenience to Milvus distribution and users.
In order to solve this problem, Knowwhere defines a unified function interface that different instruction sets need to implement, and puts the function implementations of different instruction sets into different files, and then compiles all files simultaneously with different compilation parameters.
At runtime, Knowwhere also provides an interface that allows users to manually select the instruction set function to run; or Knowwhere will first check the highest instruction set supported by the CPU of the current running environment, and then mount the function corresponding to the instruction set.
5. Other performance optimizations
Please refer to our paper Milvus: A Purpose-Built Vector Data Management System published in SIGMOD
For the full video explanation, please click:
Deep dive# Milvus 2.0 Knowhere Overview
If you have any improvements or suggestions for Milvus in the process of using, welcome to keep in touch with us on GitHub or various official channels~
With a vision to redefine data science, Zilliz is committed to building a global leader in open source technology innovation and unlocking the hidden value of unstructured data for enterprises through open source and cloud-native solutions.
Zilliz built the Milvus vector database to accelerate the development of a next-generation data platform. The Milvus database is a graduate project of the LF AI & Data Foundation. It can manage a large number of unstructured data sets and has a wide range of applications in new drug discovery, recommendation systems, chatbots, etc.
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。