Elasticsearch vector search
This article will introduce two ways of Elasticsearch vector search.
vector search
When it comes to vector search, I think you must be wondering:
- What is vector search?
- What are the application scenarios of vector search?
- How is vector search different from full-text search?
In short, ES's full-text search is to segment the text, and then calculate the relevance score based on the word through the BM25 algorithm to find text similar to the search sentence, which is essentially a term-based (word-based) search.
The actual use of full-text search has been very extensive, and the core technology is also very mature. However, in addition to text content, there are many other data forms in real life, such as: pictures, audio, video, etc., can we also search for these data?
The answer is Yes!
With the development of technologies such as machine learning and artificial intelligence, 万物皆可 Embedding
. In other words, we can convert all data such as text, pictures, audio, video, etc. into feature vectors through Embedding related technologies. Once the vectors are available, the demand for vector search will become stronger and stronger. The application of vector search The scene also becomes endless and imaginative.
ES Vector Search Description
ES vector search currently has two ways:
-
script_score
-
_knn_search
script_score exact search
The ES 7.6 version confirms the stability guarantee for the new field type dense_vector
which is used to represent vector data.
Data modeling example:
PUT my-index
{
"mappings": {
"properties": {
"my_vector": {
"type": "dense_vector",
"dims": 128
},
"my_text" : {
"type" : "keyword"
}
}
}
}
As shown in the figure above, we have established a vector data field with dims dimension 128 in the index.
script_score
Search example:
{
"script_score": {
"query": {"match_all": {}},
"script": {
"source": "cosineSimilarity(params.query_vector, 'my_vector') + 1.0",
"params": {"query_vector": query_vector}
}
}
}
The meaning shown in the above figure is to use the built-in cosine similarity function after ES 7.3 version cosineSimilarity
to calculate the similarity score between vectors.
It should be noted that script_score
This search method is to first execute query
, and then perform the vector similarity score for the matching documents. The implied meaning is:
- When modeling data, vector fields can be used with other field types, that is, to support mixed queries (full-text search first, then vector search based on search results).
-
script_score
is a kind of brute force calculation, the larger the data set, the greater the performance loss.
_knn_search search
Due to performance issues with script_score
, ES introduced a new vector search method _knn_search
in version 8.0 (currently experimental).
The so-called _knn_search
is actually an approximate nearest neighbor search (ANN) ie 近似最近邻搜索
. This search method prioritizes search performance at the expense of certain accuracy.
In order to use the _knn_search
search, it is different when modeling the data.
Example:
PUT my-index-knn
{
"mappings": {
"properties": {
"my_vector": {
"type": "dense_vector",
"dims": 128,
"index": true,
"similarity": "dot_product"
}
}
}
}
As shown above, we have to additionally specify:
-
index
is true. -
similarity
Specifies the vector similarity algorithm, which can be one ofl2_norm
,dot_product
,cosine
.
The additional specification of index
is true because, in order to implement _knn_search
, ES must build a new data structure at the bottom (currently using the HNSW graph).
_knn_search
Search example:
GET my-index-knn/_knn_search
{
"knn": {
"field": "my_vector",
"query_vector": [0.3, 0.1, 1.2, ...],
"k": 10,
"num_candidates": 100
},
"_source": ["name", "date"]
}
The advantage of using _knn_search
search is that the search speed is very fast, but the disadvantage is that the accuracy is not 100%, and it cannot be used together with Query DSL, that is, it cannot be used for mixed search.
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。