Vector database entry guide: First acquaintance with Faiss, how to convert data into vectors (1)

Many functions in the various apps we use every day are inseparable from similarity retrieval technology. For example, one-by-one news and video recommendations, various common dialogue robots, risk control systems that protect the security of our daily accounts, listening to songs that can use humming to find songs, and even the best route selection for takeaway delivery It also has its presence.

I believe that many students heard about it for the first time, or only knew its name, but did not know how to use it. In this article, let's talk about Faiss and share how this "black technology" exerts its magical "magic".

write in front

Faiss is a leader in similarity retrieval solutions. It is an open source project from Meta AI (formerly Facebook Research), and it is also one of the most popular and efficient similarity retrieval solutions. Although it is very popular with the similarity retrieval technology, it appears in the functions of various "big factory" applications that we are familiar with, but after all, it belongs to a niche scene, with a high threshold and complexity to master.

So, don't try to master it all in one go, let's do it step by step.

Of course, if you are too lazy to understand and hope to be able to write a few lines of CRUD to complete the efficient vector retrieval function, you can try to start a Milvus instance . Or if you are more lazy, you can try to use Milvus' Cloud service to complete high-performance vector retrieval.

Understand the working mechanism and applicable scenarios of Faiss

Before formally using faiss, we need to understand how it works.

When we feed it the data processed by the model or AI application ("a bunch of feature vectors"), it will index the data according to some fixed routines, such as query optimization and acceleration of traditional databases. To avoid the need for clumsy one-to-one comparison in massive data when we conduct data query, this is the secret of its ability to achieve "high-performance vector retrieval".

We are familiar with the more profitable "search promotion" (search, advertising, recommendation) business in Internet companies, and will use it to solve the vector recall work in these scenarios. In these scenarios, the system needs to perform data association calculations based on multiple dimensions, because the amount of data in the actual business scenario is very large, and it is easy to form abnormal results such as "Cartesian product". It is also unrealistic to obtain the similarity calculation of certain vectors in the scenario of massive data.

Faiss is one of the few reliable solutions to quickly get results similar to the query content (Top K similar results) in such massive data scenarios.

Like we specify field types in common databases, Faiss can also specify data types, such as IndexFlatL2, IndexHNSW, IndexIVF and more than 20 types, although the type names look strange, and traditional strings, numbers, dates and other data It doesn't look the same, but these scenarios will help us bring unexpectedly high-performance data retrieval capabilities under different data scales and business scenarios. Conversely, in the case of different business scenarios, different data levels, different index types and parameter sizes, our application performance indicators will also be very different. How to choose the appropriate index is also a knowledge. (mentioned below)

In addition to supporting rich index types, faiss can also run in both CPU and GPU environments, and can be called using C++ or Python, and some developers have made Go-Faiss to meet the use of faiss in Golang scenarios.

After having a preliminary understanding of Faiss, let's prepare for the use of Faiss.

Environmental preparation

In order to reduce unnecessary problems as much as possible, in this article, we use the Linux operating system as the basic environment of faiss, and use Python as the way to interact with faiss.

In the previous article, I introduced how to prepare the Linux environment and Python environment. If you are new to the Linux system, you can read this article and complete the preparation of the system environment from zero to one: " Building Cost-effective Linux Learning on a Notebook " Environment: Basics "; if you are not familiar with the environment configuration of Python, it is recommended to read this article "Make your own animation video with the AI model that surprised Makoto Shinkai" , refer to the "Preparation" section, and complete the "Conda" Install configuration.

After everything is ready, we can choose to use the CPU version of the faiss or the GPU version of the faiss according to our equipment conditions, and choose whether to specify whether to use the fixed CUDA version:

 # 创建一个干净的环境
conda create -n faiss -y
# 激活这个环境
conda activate faiss
# 安装 CPU 版本
conda install -c pytorch python=3.8 faiss-cpu -y
# 或者，安装 GPU 版本
conda install -c pytorch python=3.8 faiss-gpu -y
# 或者，搭配指定 CUDA 版本使用
conda install -c pytorch python=3.8 faiss-gpu cudatoolkit=10.2 -y

When configuring and installing, it is recommended to use version 3.8 of Python to avoid unnecessary compatibility issues. After preparing the environment, we can officially enter the magical world of vector data.

Build vector data

As mentioned above, the place suitable for faiss to show off is the world of vector data, so it is necessary to prepare for the construction of vector data first.

As an introductory article, this article will not talk about how to construct vector data for data such as voice (audio), movie (video), fingerprint and face (picture). We start with the simplest text data and implement a "text search function based on vector retrieval technology". Next, I will use my favorite novel "Harry Potter" as an example, you can adjust the text data to be used according to your preferences. Download the text data (txt file) to be processed as vectors from the web.

Simple ETL for data

The size of my original TXT document here is 3 MB. In order to reduce unnecessary vector conversion calculations, we first perform necessary preprocessing on the content (ETL process of data) to remove unnecessary duplicate content, blank lines, etc.:

 cat /Users/soulteary/《哈利波特》.txt | tr -d ' ' | sed '/^[[:space:]]*$/d' > data.txt

Open the text and observe carefully. In the data, some lines of text data are extremely long and consist of many sentences, which will affect our vector feature calculation and precise positioning retrieval results. Therefore, we need to make further content adjustments to split multiple long sentences into one short sentence per line.

In order to better solve the problem of sentence wrapping and avoid breaking multiple sentences in a character dialogue into multiple lines, we can use a simple Node.js script to process the data:

 const { readFileSync, writeFileSync } = require("fs");

const raw = readFileSync("./hp.txt", "utf-8")
  .split("\n")
  .map((line) => line.replace(/。/g, "。\n").split("\n"))
  .flat()
  .join("\n")
  .replace(/“([\S]+?)”/g, (match) => match.replace(/\n/g, ""))
  .replace(/“([\S\r\n]+?)”/g, (match) => match.replace(/[\r\n]/g, ""))
  .split("\n")
  .map((line) => line.replace(/s/g, "").trim().replace(/s/g, "—"))
  .filter((line) => line)
  .join("\n");

writeFileSync("./ready.txt", raw);

After we execute node . to process the text, a text file named ready.txt will appear in the current folder.

In order to facilitate the following, we will have a better understanding of the resource occupation of the vector database. Let's check how much disk space the sorted text files occupy:

 du -hs ready.txt 
5.5M        ready.txt

Convert text to vector using model

In order to convert text to vector data, we need to use a model that can handle text embeddings. The model I choose here is the pre-training model from the " UER: An Open-Source Toolkit for Pre-training Models " jointly launched by the National People's Congress, Tencent AI Lab, and Peking University (in the order of the authors of the paper).

Relevant information about this pretrained model:

HuggingFace https://huggingface.co/uer/sbert-base-chinese-nli
Training data https://github.com/liuhuanyong/ChineseTextualInference/

To use the model, we need to install some basic Python packages first:

 pip install sentence_transformers pandas

After the dependencies are installed, we can enter python in the terminal to enter the Python interactive terminal. First, we use the prepared text file pandas to parse it into DataFrames.

 import pandas as pd
df = pd.read_csv("ready.txt", sep="#",header=None, names=["sentence"])
print(df)

After execution, we will see results similar to the following:

 sentence
0                                  《哈利波特》J.K罗琳
1                                第一部 第一章 幸存的男孩
2          住在四号普里怀特街的杜斯利先生及夫人非常骄傲地宣称自己是十分正常的人。
3      但是他们最不希望见到的就是任何奇怪或神秘故事中的人物因为他们对此总是嗤之以鼻。
4                      杜斯利先生是一家叫作格朗宁斯的钻机工厂的老板。
...                                        ...
60023                 哈利看着她茫然地低下头摸了摸额头上闪电形的伤疤。
60024                                “我知道他会的。”
60025                         十九年来哈利的伤疤再也没有疼过。
60026                                   一切都很好。
60027                                    （全书完）

[60028 rows x 1 columns]

Next, we perform vector calculations on the text loaded into memory, and perform "feature vector extraction" for each row of data:

 from sentence_transformers import SentenceTransformer
model = SentenceTransformer('uer/sbert-base-chinese-nli')
sentences = df['sentence'].tolist()
sentence_embeddings = model.encode(sentences)

This process will take a long time, and the time consumption will be related to the performance of your computer. I use an ordinary Zen2 notebook here, and it takes about half an hour to run, so you might as well stand up and move during this time to relieve the fatigue of the day.

When the data vector is complete, we can first execute sentence_embeddings.shape to see the status of the data:

 (60028, 768)

After execution, we will see results similar to the above, with 60,000 texts being vectorized into 768-dimensional vector data.

at last

We've got the "vector data" done, and in the next article, we'll learn how to use Faiss to implement vector similarity retrieval.

Author: Su Yang

Original text: "Vector Database Entry Guide: Talking about Faiss, the Similarity Retrieval Technology from Metaverse, a Big Factory"

Link: https://zhuanlan.zhihu.com/p/560981386

If you think the content we share is not bad, please don't hesitate to give us some encouragement: like, like or share with your friends!

For event information, technology sharing and recruitment express, please follow: https://zilliz.gitee.io/welcome/

If you are interested in our projects please follow:

Milvus database for storing vectors and creating indexes

Towhee, a framework for building model inference pipelines

Vector database entry guide: First acquaintance with Faiss, how to convert data into vectors (1)

write in front

Understand the working mechanism and applicable scenarios of Faiss

Environmental preparation

Build vector data

Simple ETL for data

Convert text to vector using model

at last

Zilliz

引用和评论

成本最高直降50倍! Zilliz Cloud Serverless Beta上线，限时免费，早用早省钱！

MySQL慢查询日志：性能优化的终极指南

做到真正0丢失、0重复：Apache SeaTunnel 实现万亿级数据一致性全解密

Devin 发布 DeepWiki，2 星的项目直接装出万星的气场

好用的开源埋点方案-ClkLog埋点用户分析系统

DNS服务器地址大全

实战分享：DolphinScheduler 中 Shell 任务环境变量最佳配置方式