Application of Proxima, a self-developed vector search engine of Dharma Academy in industry search

Introduction to What kind of retrieval technology is used behind Taobao search recommendation and video search? Unstructured data retrieval, vector retrieval, and multimodal retrieval, what problems do they solve? Today, scientists from Alibaba Dharma Academy start from business problems, and they will deeply reveal the internal technology of Dharma Academy, the vector search engine Proxima, and the practical application of the ability to open search product industry templates in Alibaba Cloud~

technology sharing:

Wang Shaoyi (Dasha) Senior Technical Expert of the Machine Intelligence Laboratory of Ali Dharma Academy
Xiao Yunfeng (He Chong) Senior Technical Expert of the Machine Intelligence Laboratory of Ali Dharma Academy

Alibaba Cloud search recommended products:

OpenSearch: https://www.aliyun.com/product/opensearch
Intelligent recommendation (AIRec): https://www.aliyun.com/product/bigdata/airec
Artificial intelligence, or AI for short, is a technical field that existed when computers were invented. One of its core features is that it can assist human work in a human-like manner. Through a series of mathematical methods, such as probability theory, statistics, linear algebra, etc., it analyzes and designs algorithms that allow computers to learn automatically.

As shown in the figure below, artificial intelligence algorithms can abstract various unstructured data (such as voice, picture, video, language and text, behavior, etc.) generated by people/objects/scenes in the physical world and turn them into multi-dimensional vectors. These vectors are like coordinates in a mathematical space, identifying various entities and entity relationships. We generally call the process of turning unstructured data into vectors as Embedding, while unstructured retrieval is the process of retrieving these generated vectors to find the corresponding entities.

The essence of unstructured retrieval is vector retrieval technology, and its main application areas are face recognition, recommendation system, image search, video fingerprinting, speech processing, natural language processing, file search, etc. With the widespread application of AI technology and the continuous growth of data scale, vector retrieval has gradually become an indispensable part of the AI technology chain, and it is a supplement to traditional search technology, and it has the ability of multi-modal search. .

1. Business scenario

1.1 Voice/image/video retrieval

The first major application of vector retrieval is the retrieval of the most common unstructured data that humans touch, such as voice, image, and video. traditional search engine only indexes the names and descriptions of these multimedia, but does not try to understand and index the content of these unstructured data. Therefore, the search results of traditional engines have very large limitations.

With the development of artificial intelligence, the ability of AI allows us to understand these unstructured data quickly and at a low cost, which makes it possible to directly retrieve the content of these unstructured data. Among them, a very important part is vector retrieval.

As shown in the figure below, taking picture search as an example, we first perform machine learning analysis on all historical pictures in an offline manner, abstract each picture (or the characters segmented from the picture) into high-dimensional vector features, and then all The feature is constructed into an efficient vector index. When a new query (picture) comes, we use the same machine learning method to analyze it and produce a representation vector, and then use this vector to find out in the previously constructed vector index With the most similar results, an image retrieval based on the content of the picture is completed in this way.

1.2 Text search

In fact, vector retrieval has been used in common full-text retrieval for a long time.

Here we use address retrieval as an example to briefly introduce the application and value of vector retrieval technology in text retrieval.

Example 1: is shown on the left side of the figure below. We want to search for "Zheyi Hospital" in the standard address database (and there is no keyword "Zheyi" in the standard address database. The standard address of "Zheyi Hospital" is "Zhejiang The First Affiliated Hospital of University School of Medicine"), if we only use text segmentation ("Zheyi" and "Hospital"), the relevant results will not be found in the standard address database (because the address of "Zheyi" does not exist) . But if we can analyze people’s historical language and even previous click associations to build a model of semantic relevance and express all addresses with high-dimensional features, then "Zhejiang First Hospital" and "Zhejiang University School of Medicine" The similarity of "The First Affiliated Hospital" may be very high, so it can be retrieved.

Example 2: is shown on the right side of the figure below. It is also an address query. If we want to search for the address of "Hangzhou Alibaba" in the standard address library, we can hardly find similar results when we only use text recall, but If we analyze the click behavior of a large number of users and combine the click behavior with the address text information to form a high-dimensional vector, the addresses with high click rates can be naturally recalled and ranked first during retrieval.

1.3 Search/Recommendation/Advertising

In the search/recommendation/advertising business scenario of the e-commerce field, the common demand is to find similar products of the same type and recommend products that are of interest to users. Most of this demand is based on the strategy of product collaboration and user collaboration. Finished. The new generation of search recommendation system absorbs the ability of deep learning Embedding, which is implemented through vector recall methods such as Item-Item (i2i), User-Item (u2i), User-User-Item (u2u2i), User2Item2Item (u2i2i), etc. Quick search.

algorithm engineers characterize them into high-dimensional vector features and store them in the vector engine through the abstraction of the similarity and correlation of the products, as well as the user behaviors that are browsed and purchased. In this way, when we need to find a similar product (i2i) of a product, we can retrieve it from the vector engine efficiently and quickly.

1.4 Almost all AI scenarios are covered

In fact, the application scenarios of vector retrieval are far more than those mentioned above. As shown in the figure below, it almost covers most of the business scenarios where AI can be applied.

2. Current status and challenges of vector retrieval

2.1 Numerous search algorithms

essence of vector retrieval is to solve the two problems of KNN and RNN , KNN (K-Nearest Neighbor) is to find the K points closest to the query point, and RNN (Radius Nearest Neighbor) finds all points within a certain radius of the query point or N points. When a large amount of data is involved, the computational cost of solving the KNN or RNN problem with 100% accuracy is high, so the method of seeking approximate solutions is introduced. Therefore, the actual problem of large data retrieval is ANN (Approximate Nearest Neighbor). problem.

In order to solve the ANN problem, many retrieval algorithms have been proposed in the industry. The commonly used algorithm can be traced back to the KD-Tree proposed in 1975, which is based on the European space and uses a multi-dimensional binary tree data structure to solve the ANN retrieval problem. At the end of the 1980s, the idea of spatial coding and hashing came into being, mainly represented by fractal curves and locally sensitive hashing. Fractal curves and local sensitive hashing belong to the idea of spatial coding and conversion. Algorithms similar to the idea include Product Quantization (PQ), etc. These quantization algorithms map high-dimensional problems to low-dimensional solutions to improve retrieval efficiency. At the beginning of the 21st century, the idea of using neighbor graphs to solve ANN problems has also begun to sprout. Neighbor graphs are mainly based on the assumption that "neighbors of neighbors may also be neighbors". The neighbor relationship of all points in the data set is established in advance to form a neighbor graph with certain characteristics. It is time to traverse the graph, and finally converge to get the result.

There are many algorithms for vector retrieval and lack of generality. There are different algorithms for different data dimensions and distributions, but the total can be classified into three types of ideas: space division method, space encoding and conversion method, and neighbor graph method. The spatial division method is represented by KD-Tree and cluster retrieval. These small collections are quickly located during retrieval, thereby reducing the amount of data points that need to be scanned and improving retrieval efficiency. Spatial coding and conversion methods, such as p-Stable LSH, PQ, etc., re-encode or transform the data set and map it to a smaller data space, thereby reducing the amount of calculation of scanned data points. Neighbor graph methods, such as HNSW, SPTAG, ONNG, etc., use the method of pre-establishing relationship graphs to speed up the convergence speed during retrieval, reduce the amount of data points that need to be scanned, and improve retrieval efficiency.

2.2 Technical Challenges Facing

During the development of vector retrieval, some excellent open source works have also emerged, such as FLANN, Faiss, etc. These works have unified implementation and optimization of some commonly used and effective ANN algorithms in the industry, and formed some engineered retrieval schemes by running the library. Based on these runtime libraries and improvements, some service-oriented engineering engines such as milvus, vearch, etc. have also been produced in the industry.

Although vector retrieval has been developed for many years and has gradually become the mainstream method of unstructured retrieval, there are still many technical challenges and problems.

2.2.1 Accuracy and performance of very large-scale indexes

derived from the variety and complexity of unstructured data. Vector retrieval is inherently used to deal with this large-scale data retrieval. However, in the face of scenarios with hundreds of millions or even billions, many retrieval algorithms still face challenges , There are also some problems in engineering implementation, either the construction cost is huge, or the retrieval efficiency is low.

In addition, the increase in the number of dimensions has also caused the efficiency of some vector retrieval methods to decrease, which is flashy in high-dimensional spaces. At the same time, the engineering also increases the cost of data calculation and storage. Secondly, the algorithm lacks complete versatility and cannot achieve universally consistent retrieval of data, that is, the retrieval algorithm is effective for any data distribution.

At present, the industry is still unable to handle high-dimensional billion-level data, and multiple indexes are used to retrieve and merge them separately, which increases the actual calculation cost.

2.2.2 Distributed construction and retrieval

vector retrieval currently achieves horizontal expansion through data fragmentation. However, too many fragments easily cause an increase in the amount of calculation, which leads to a decrease in retrieval efficiency. In terms of distribution, there is still a problem with the fast merging algorithm of vector indexes, which leads to the inability to apply the Map-Reduce calculation model to merge into a more efficient index once the data is fragmented.

2.2.3 Online update of streaming index

Traditional retrieval methods can easily implement CRUD operations. vector retrieval relies on data distribution and distance metric 160af437af21f6. Some methods also require data set training, and data point changes may even be whole body. Therefore, to realize the full-stream construction of vector index from 0 to 1, and to meet the requirements of increasing and checking, real-time placing, and real-time dynamic update of the index, there are still some challenges to algorithms and engineering.

At present, for non-training retrieval methods, it can more conveniently support online dynamic addition and query of full-memory indexes. However, in the face of requirements such as operating cost. Real-time performance cannot be satisfied.

2.2.4 Joint retrieval of label + vector

In most business scenarios, both label retrieval conditions and similarity retrieval requirements need to be met, such as querying similar images under certain attribute condition combinations. We call this type of retrieval "conditional vector retrieval".

At present, the industry adopts a multi-way merging method, that is, searching tags and vectors separately and then combining the results. Although some problems can be solved, the results are not ideal in most cases. The main reason is that vector retrieval has no scope, and its goal is to ensure the accuracy of TOPK as much as possible. When TOPK is very large, the accuracy is likely to decrease, resulting in inaccurate or even empty merged results.

2.2.5 Complex multi-scene adaptation

Vector retrieval is a universal capability, but currently there is no universal algorithm that can adapt to any scene and data. Even when the same algorithm adapts to different data, there are differences in parameter configuration. For example, for the multi-layer clustering retrieval algorithm, which clustering algorithm to use, how many layers, how many clusters, and what convergence threshold to use when searching, these are different in the face of different scenarios and data. is precisely because of the existence of these hyper-parameter tuning, which greatly increases the user's threshold for use .

If you want to make users easier, you must consider the problem of scene adaptation, which mainly includes data adaptation (such as: data scale, data distribution, data dimension, etc.) and demand adaptation (such as recall rate, throughput, delay, etc.) Streaming, real-time, etc.). Based on different data distributions, we can select appropriate algorithms and parameters to meet actual business needs.

3. Demystifying the Vector Retrieval Technology of Dharma Academy

Proxima is a vector retrieval kernel self-developed by Alibaba Dharma Academy. At present, its core capabilities are widely used in many businesses within Alibaba and Ant Group, such as Taobao search and recommendation, Ant facial payment, Youku video search, Alimama advertising search, etc. At the same time, Proxima is also deeply integrated in various types of big data and database products, such as Alibaba Cloud Hologres, OpenSearch (OpenSearch), search engines ElasticSearch and ZSearch, offline engine MaxCompute (ODPS), etc., to provide vector retrieval ability.

Proxima is a general-purpose vector retrieval engineering engine that realizes high-performance similarity search for big data. It supports multiple hardware platforms such as ARM64, x86, GPU, and supports embedded devices and high-performance servers, from edge computing to cloud computing. Coverage, supports high-accuracy, high-performance index construction and retrieval at the billion-level monolithic index.

3.1 Core Competence

As shown in the figure above, Proxima's main core capabilities are as follows:

• ultra-large-scale index construction and retrieval : Proxima is proficient in engineering implementation and bottom-level optimization of the algorithm, and introduces a composite retrieval algorithm. Based on the limited construction cost, it implements a high-efficiency retrieval method. The single-chip index can reach several billion scale.

• index horizontal expansion : Proxima uses a non-equal sharding method to implement distributed retrieval. For the neighbor graph index, the problem of fast merging of the graph index with limited precision is solved, and it can be effectively combined with the Map-Reduce calculation model.

• high-dimensional & high-precision : Proxima supports a variety of search algorithms, and makes a deeper abstraction of the algorithm to form an algorithm framework. Different algorithms or algorithm combinations are selected according to different data dimensions and distributions, and accuracy and accuracy can be achieved according to specific scene requirements. The balance between performance.

• streaming real-time & online update : Proxima uses a flat index structure to support online large-scale vector index streaming from 0 to 1, and uses the convenience and data characteristics of neighbor graphs to achieve index increase Instant check, real-time order placement, and real-time dynamic update.

• label + vector retrieval : Proxima implements the "conditional vector retrieval" method at the indexing algorithm layer, which solves the unsatisfactory result of traditional multi-channel merge recall and satisfies the requirements of combined retrieval to a greater extent.

• heterogeneous computing: Proxima supports high-volume and high-throughput offline retrieval acceleration, and at the same time solves the problem of GPU building neighbor graph indexes. On the other hand, it also successfully solves the resource utilization problem of small batch + low latency + high throughput. And it is fully applied in Taobao's search recommendation system.

• High performance and low cost : Maximizing performance and meeting business needs under limited costs are the main problems that vector retrieval needs to solve. Proxima realizes the optimization of multiple platforms and hardware, supports cloud servers and some embedded devices, realizes offline data retrieval and training through the combination with distributed scheduling engine, and realizes cold data through flat indexing and disk retrieval schemes Fast search.

• scene adaptation : Combining methods such as hyper-parameter tuning and composite indexing, through data sampling and pre-experiment, Proxima can solve some problems of intelligent adaptation of data scenes, thereby improving the automation capabilities of the system and enhancing the ease of users. Usability.

3.2 Industry comparison

At present, the commonly used vector retrieval library in the industry is the Faiss (Facebook AI Similarity Search) engine open sourced by the Facebook AI team. Faiss is very good, and it is also the basic core of many service engines, but Faiss still has some limitations in large-scale general retrieval scenarios, such as streaming real-time computing, offline distributed, online heterogeneous acceleration, tag & vector joint retrieval , Cost control and servicing .

such as , for the publicly available billion-scale ANN\_SIFT1B data set (source corpus-texmex.irisa.fr), on the Intel(R) Xeon(R) Platinum 8163 CPU & 512GB memory server, due to the requirements of Faiss The computing resources are too large to realize the construction and retrieval of a single-machine billion-scale index. Proxima can easily complete the construction and retrieval of a billion-scale index under the same environment and data volume.

Considering the feasibility of the test, the for index construction and retrieval under the same data volume of 200 million. In addition, under the same data volume of 20 million, they compared Faiss and Proxima. For the heterogeneous computing power of a single Proxima card, Proxima separately provides test data for the billion-scale data volume. The specific results are as follows.

3.2.1 Search comparison

Proxima's retrieval performance is several times better than Faiss, and it can achieve higher precision recall. The retrieval for TOP1 is even better. In addition, Faiss also has design flaws in some algorithm implementations, such as the implementation of HNSW, which has very low retrieval performance for large-scale indexes.

3.2.2 Construction comparison

The construction time of Faiss 200 million scale index needs 45 hours, which can be shortened to 15 hours under the condition of HNSW optimization. Proxima can build the index in more than one hour under the same resource, and the index storage is smaller and the accuracy is higher (see Search comparison).

3.2.3 Heterogeneous Computing

Proxima uses a GPU computing method that is different from Faiss, and is especially optimized for online retrieval scenarios of "small batch + low latency + high throughput".

Proxima has shown amazing advantages in small-batch scenarios, small-batch, low latency, high throughput, and can make full use of GPU resources. At present, the retrieval scheme is also applied to Ali's search recommendation business on a large scale.

3.2.4 Billion scale

Proxima supports streaming indexing and semi-memory construction retrieval mode, truly achieving single-machine billion-scale index construction under limited resources, as well as high-performance and high-precision retrieval. Proxima's high-performance and low-cost capabilities provide strong basic support for AI large-scale offline training and online retrieval.

4. Application of vector retrieval in industry search

4.1 Alibaba Cloud Intelligent Search Development Platform-OpenSearch (OpenSearch)

OpenSearch is a one-stop intelligent search business development platform based on a large-scale distributed search engine independently developed by Alibaba. It currently provides search service support for the core businesses of Alibaba Group including Taobao and Tmall. With built-in capabilities such as query semantics understanding of various industries and machine learning sorting algorithms, it provides fully open engine capabilities to help developers quickly build smart search services with higher performance and higher search baseline effects.

4.2 Industry Search Application

The search business of each industry has different industry characteristics and business requirements. Open Search has created a unique industry vertical solution based on years of accumulated industry experience. With the help of Dharma Academy’s advanced intelligent language processing technology, it meets the pain points and needs of the industry. Provides industry-specific query analysis capabilities, built-in industry sorting expressions and industry algorithm capabilities, lowers access thresholds, realizes one-click configuration, improves access efficiency, and provides enterprises with better search results.

4.2.1 E-commerce industry applications

The open search e-commerce industry template makes the industry search productized. Users do not need to explore technology in all directions. They only need to access the template to have a better search service, eliminating a lot of data labeling and model training, and directly built-in Tao system Search algorithm capabilities. Support personalized search and service capabilities. Through the multi-channel recall capability on the engine side, it can realize important services such as search results, drop-down prompts, and shading words. According to the changes in the e-commerce industry, iteratively update the original capabilities to provide higher timeliness Service guarantee;

vector recall in product search multi-channel recall:

Learn more: https://www.aliyun.com/page-source//data-intelligence/activity/opensearch

4.2.2 Application of photo search for online education

Open search and photo search solution:

multi-channel recall in educational search questions:

multiple recalls for search questions?

There are significant differences between the educational photo search scene and the text search of the web/e-commerce:

The search query is particularly long: the upper limit of the regular search term is 30, and the search question needs to be set to 100;
The search query is the text obtained after the photo OCR recognition, the recognition error of the key term will seriously affect the recall ranking;

Plain text query scheme

1. OR logic query

In order to reduce the rate of non-results, the common system used by search customers is based on the default OR logic of ES, which has high latency and high computational consumption;
OpenSearch also supports OR logic, which can be optimized by parallel seek for high latency, but the overall computational consumption is still high;

2. AND logic query

The general query analysis module is adopted, and the rate of no results is high, and the overall accuracy is not as good as OR logic;
Optimize customized query analysis module for the education field, greatly improve the effect, and the accuracy is close to OR logic;

How to balance the computational cost and search accuracy? We have introduced text vector retrieval

text vector retrieval

Goal : Recall through text vector retrieval, combined with AND logic query, to achieve higher accuracy when latency and computational consumption are lower than OR logic;

The vector recall uses the most advanced BERT model, and the specially optimized has:

BERT model adopts StructBERT self-developed by Dharma Academy, and customized models for the education industry;
The vector search engine adopts the proxima engine developed by Dharma Academy, which is far more accurate and faster than the open source system;
Training data can be continuously accumulated based on the customer's search log, and the effect will continue to improve;

In this picture, we can see that there is a recall, which has reached the concave logic in the recall rate. At the same time, the accuracy is now 3 to 5 points beyond the 2 logic. When the overall recall is reduced by 40 times, the latency can be reduced by more than 10 times.

effect:

Recall rate reaches OR logic
Accuracy exceeds OR logic 3%-5%
The overall number of recalled doc is reduced by 40 times, and the latency is reduced by more than 10 times

multi-channel recall advantage:

The combination of text recall and semantic vector recall has been verified to be effective in the search scenario. The multi-channel recall architecture of open search will have more use space: image vector recall, formula recall, and personalized recall.

In addition to the open search built-in vector model, we will also support customers' own vector indexing, and customers are welcome to work with us to deepen the search algorithm optimization.

Learn more: https://www.aliyun.com/page-source/data-intelligence/activity/edusearch

5. Technology Outlook

With the widespread application of AI technology and the continuous growth of data scale, vector retrieval, as the mainstream method in deep learning, will further exert its pan-retrieval and multi-modal search capabilities. The entities and characteristics of the physical world are represented and combined through vectorization technology, and then mapped to the digital world. Computers are used to perform calculations and retrievals, mining potential logic and implicit relationships, and serving human society more intelligently.

In the future, in addition to facing the continuous growth of data scale, vector retrieval still needs to solve the problems of hybrid space retrieval, sparse space retrieval, ultra-high dimensionality, and universal consistency in algorithms. In engineering, the scenes faced will become more and more extensive and more complex. How to form a strong systematic system that runs through scenes and applications will be the focus of the next development of vector retrieval.

Copyright Statement: content of this article was contributed spontaneously by Alibaba Cloud real-name registered users. The copyright belongs to the original author. The Alibaba Cloud Developer Community does not own its copyright and does not assume corresponding legal responsibilities. For specific rules, please refer to the "Alibaba Cloud Developer Community User Service Agreement" and the "Alibaba Cloud Developer Community Intellectual Property Protection Guidelines". If you find suspected plagiarism in this community, fill in the infringement complaint form to report it. Once verified, the community will immediately delete the suspected infringing content.