1

Elasticsearch vector search

This article will introduce two ways of Elasticsearch vector search.

vector search

When it comes to vector search, I think you must be wondering:

  1. What is vector search?
  2. What are the application scenarios of vector search?
  3. How is vector search different from full-text search?

In short, ES's full-text search is to segment the text, and then calculate the relevance score based on the word through the BM25 algorithm to find text similar to the search sentence, which is essentially a term-based (word-based) search.

The actual use of full-text search has been very extensive, and the core technology is also very mature. However, in addition to text content, there are many other data forms in real life, such as: pictures, audio, video, etc., can we also search for these data?

The answer is Yes!

With the development of technologies such as machine learning and artificial intelligence, 万物皆可 Embedding . In other words, we can convert all data such as text, pictures, audio, video, etc. into feature vectors through Embedding related technologies. Once the vectors are available, the demand for vector search will become stronger and stronger. The application of vector search The scene also becomes endless and imaginative.

图片来源 damo.alibaba.com/events/112

ES Vector Search Description

ES vector search currently has two ways:

  • script_score
  • _knn_search

script_score exact search

The ES 7.6 version confirms the stability guarantee for the new field type dense_vector which is used to represent vector data.

Data modeling example:

 PUT my-index
{
  "mappings": {
    "properties": {
      "my_vector": {
        "type": "dense_vector",
        "dims": 128
      },
      "my_text" : {
        "type" : "keyword"
      }
    }
  }
}

As shown in the figure above, we have established a vector data field with dims dimension 128 in the index.

script_score Search example:

 {
  "script_score": {
    "query": {"match_all": {}},
    "script": {
      "source": "cosineSimilarity(params.query_vector, 'my_vector') + 1.0",
      "params": {"query_vector": query_vector}
    }
  }
}

The meaning shown in the above figure is to use the built-in cosine similarity function after ES 7.3 version cosineSimilarity to calculate the similarity score between vectors.

It should be noted that script_score This search method is to first execute query , and then perform the vector similarity score for the matching documents. The implied meaning is:

  • When modeling data, vector fields can be used with other field types, that is, to support mixed queries (full-text search first, then vector search based on search results).
  • script_score is a kind of brute force calculation, the larger the data set, the greater the performance loss.

_knn_search search

Due to performance issues with script_score , ES introduced a new vector search method _knn_search in version 8.0 (currently experimental).

The so-called _knn_search is actually an approximate nearest neighbor search (ANN) ie 近似最近邻搜索 . This search method prioritizes search performance at the expense of certain accuracy.

In order to use the _knn_search search, it is different when modeling the data.

Example:

 PUT my-index-knn
{
  "mappings": {
    "properties": {
      "my_vector": {
        "type": "dense_vector",
        "dims": 128,
        "index": true,
        "similarity": "dot_product"
      }
    }
  }
}

As shown above, we have to additionally specify:

  • index is true.
  • similarity Specifies the vector similarity algorithm, which can be one of l2_norm , dot_product , cosine .

The additional specification of index is true because, in order to implement _knn_search , ES must build a new data structure at the bottom (currently using the HNSW graph).

_knn_search Search example:

 GET my-index-knn/_knn_search
{
  "knn": {
    "field": "my_vector",
    "query_vector": [0.3, 0.1, 1.2, ...],
    "k": 10,
    "num_candidates": 100
  },
  "_source": ["name", "date"]
}

The advantage of using _knn_search search is that the search speed is very fast, but the disadvantage is that the accuracy is not 100%, and it cannot be used together with Query DSL, that is, it cannot be used for mixed search.

Reference documentation


凌虚
3.8k 声望1.3k 粉丝