使用Dask，SBERT SPECTRE和Milvus构建自己的ARXIV论文相似性搜索引擎

通过矢量相似性搜索，可以在〜50ms内响应〜640K论文上的语义搜索查询

Arxiv.org大家一定都不陌生，学习数据科学的最佳方法之一是阅读Arxiv.org上的开源研究论文。但是即使对于经验丰富的研究人员来说，从大量的研究论文中找出想读的内容也是非常不容易的。Connected等论文之类的工具可以提供一些帮助，但是它们根据论文之间共享的引用和参考书目来衡量相似性的，这当然非常的好，并且也很简单，但是文档中文本的语义含义也是一个衡量相似度非常重要的特征。

在这篇文章中，我们将手动构建一个语义相似性搜索引擎，该引擎将单个论文作为“查询”输入，并查找Top-K的最类似论文。我们测试，在单个笔记本电脑中对Arxiv语料库中的640k计算机科学论文进行查询的的延迟<50ms！如果你感兴趣，那么本文的主要内容总结如下：

设置环境并从Kaggle下载ARXIV数据
使用dask将数据加载到Python中
使用MILVUS矢量数据库进行语义相似性搜索

本文中使用的技术不仅仅局限在科学论文，可用作建立任何NLP语义相似性搜索引擎的模板。唯一的区别是使用的预训练模型不同。

这篇文章使用Kaggle的ARXIV数据集是在CC0：公共域许可证下发布的，所以请先阅读其使用授权的要求。

Milvus Vector是一个矢量数据库，我们也可以使用其他矢量数据库，如果使用其他库替换的话，有许多步骤完全相同并且改动并不复杂。

设置环境并从Kaggle下载ARXIV数据。

Cornel University已将整个Arxiv语料库上传到Kaggle，并根据CC0：公共领域许可证获得许可。我们可以使用Kaggle API直接下载数据集。

这里还要使用Conda环境，创建一个称为Semantic_sibilarity的环境。下面的步骤是创建必要的目录和Conda环境，安装所需的Python库，然后从Kaggle下载ARXIV数据集。

 # Create the necessary directories 
 mkdir -p semantic_similarity/notebooks semantic_similarity/data semantic_similarity/milvus
 
 # CD into the data directory
 cd semantic_similarity/data
 
 # Create and activate a conda environment
 conda create -n semantic_similarity python=3.9
 conda activate semantic_similarity
 
 ## Create Virtual Environment using venv if not using conda
 # python -m venv semantic_similarity
 # source semantic_similarity/bin/activate
 
 # Pip install the necessary libraries
 pip install jupyterlab kaggle matplotlib scikit-learn tqdm ipywidgets 
 pip install "dask[complete]" sentence-transformers
 pip install pandas pyarrow pymilvus protobuf==3.20.0
 
 # Download data using the kaggle API
 kaggle datasets download -d Cornell-University/arxiv
 
 # Unzip the data into the local directory
 unzip arxiv.zip
 
 # Delete the Zip file
 rm arxiv.zip

使用dask将数据加载到Python中

我们从Kaggle下载的数据是一个3.3GB JSON文件，其中包含大约200万篇论文！为了有效地处理如此大的数据集，使用PANDA将整个数据集加载到内存中并不是一个好主意。为了处理这样大的数据，我们选择使用DASK将数据分为多个分区，并且仅将一些需要处理的分区加载到内存中。

Dask

Dask是一个开源库，可以让我们使用类似于PANDA的API进行并行计算。通过运行“ pip install dask[complete]”在本地计算机上进行安装。安装完成后要导入必要的库。

 import dask.bag as db
 import json
 from datetime import datetime
 import time
 
 data_path = '../data/arxiv-metadata-oai-snapshot.json'

我们将使用两个有效地处理大型ARXIV JSON文件的DASK的组件。

Dask Bag：使我们可以将JSON文件加载到固定大小的块中，并在每行数据上运行一些预处理功能
DASK DATAFRAME：将DASK Bag转换为DASK DATAFRAME，并可以用类似Pandas的API访问

步骤1：将JSON文件加载到Dask Bag中

将JSON文件加载到一个Dask Bag中，每个块的大小为10MB。可以调整blocksize参数，控制每个块的大小。然后使用.map（）函数将JSON.LOADS函数应用于Dask Bag的每一行，将JSON字符串解析为Python字典。

 # Read the file in blocks of 10MB and parse the JSON.
 papers_db = db.read_text(data_path, blocksize="10MB").map(json.loads)
 
 # Print the first row
 papers_db.take(1)

步骤2：编写预处理辅助函数

从打印输出中可以看到每行包含与论文相关的几个元数据。让我们编写三个辅助函数，可以帮助我们对数据集进行预处理。

v1_date（）：此函数是提取作者将论文的第一个版上传到arxiv的日期。我们将将日期转换为UNIX时间戳，并将其存储在该行中新的字段。

text_col（）：此函数是使用“ [sep]”令牌组合“标题”和“摘要”字段，以便我们可以将这些文本发送到SPECTRE embedding模型中。

filters（）：此函数过滤符合某些条件的行，例如计算机科学类别中各个列和论文中的最大文本长度等等。

 def v1_date(row):
     """
     For each row in the dask bag, 
     find the date of the first version of the paper 
     and add it to the row as a new column
     Args:
       row: a row of the dask bag
     Returns:
       A row of the dask bag with added "unix_time" column
     """
     
     versions = row["versions"]
 
     date = None
     for version in versions:
         if version["version"] == "v1":
             date = datetime.strptime(version["created"], "%a, %d %b %Y %H:%M:%S %Z")
             date = int(time.mktime(date.timetuple()))
 
     row["unix_time"] = date
 
     return row
 
 
 def text_col(row):
     """
     It takes a row of a dataframe, adds a new column called 'text' 
     that is the concatenation of the 'title' and 'abstract' columns
     Args:
       row: the row of the dataframe
     Returns:
       A row with the text column added.
     """
 
     row["text"] = row["title"] + "[SEP]" + row["abstract"]
     return row
 
 
 def filters(row):
     """
     For each row in the dask bag, only keep the row if it meets the filter criteria
     
     Args:
       row: the row of the dataframe
     Returns:
       Boolean mask
     """
     
     return ((len(row["id"])<16) and 
             (len(row["categories"])<200) and
             (len(row["title"])<4096) and
             (len(row["abstract"])<65535) and
             ("cs." in row["categories"]) # Keep only CS papers
            )

步骤3：在Dask Bag上运行预处理辅助函数

如下所示，我们可以使用.map（）和.filter（）函数在Dask Bag的每一行上运行。由于Dask支持方法链，因此我们可以仅保留一些必需的列，然后删除不需要的列。

 # Specify columns to keep in the final table
 cols_to_keep = ["id", "categories", "title", "abstract", "unix_time", "text"]
 
 # Apply the pre-processing
 papers_db = (
     papers_db.map(lambda row: v1_date(row))
     .map(lambda row: text_col(row))
     .map(
         lambda row: {
             key: value 
             for key, value in row.items() 
             if key in cols_to_keep
         }
     )
     .filter(filters)
 )
 
 # Print the first row
 papers_db.take(1)

步骤4：将Dask Bag转换为DASK DATAFRAME

数据加载的最后一步是将Dask Bag转换为DASK DATAFRAME，这样我们可以使用类似Pandas的API进行访问。

 # Convert the Dask Bag to a Dask Dataframe
 schema = {
     "id": str,
     "title": str,
     "categories": str,
     "abstract": str,
     "unix_time": int,
     "text": str,
 }
 papers_df = papers_db.to_dataframe(meta=schema)
 
 # Display first 5 rows
 papers_df.head()

使用MILVUS矢量数据库进行语义相似性搜索

Milvus是最受欢迎的开源矢量数据库之一，所以我们在本文中选择使用它，并且我们这里使用的是单机版，因为我们只在本地机器上运行Milvus。

步骤1：本地安装MILVUS矢量数据库

使用Docker安装Milvus Vector数据库很简单，因此我们首先需要安装Docker。然后就是下载Docker-compose.yml并启动Docker容器，如下所示！MILVUS.IO网站提供了许多其他选择来安装Milvus单机版和Milvus群集版；如果需要在Kubernetes群集上安装或离线安装，请参考具体文档。

 # CD into milvus directory
 cd semantic_similarity/milvus 
 
 # Download the Standalone version of Milvus docker compose
 wget https://github.com/milvus-io/milvus/releases/download/v2.1.0/milvus-standalone-docker-compose.yml -O ./docker-compose.yml
 
 # Run the Milvus server docker container on your local
 sudo docker-compose up -d

步骤2：创建一个Milvus集合

我们可以使用Pymilvus库与Milvus Vector数据库服务进行交互。emb_dim参数是文本转换为嵌入的维度。在SPECTRE的情况下，嵌入维度为768。

 # Make sure a Milvus server is already running
 from pymilvus import connections, utility
 from pymilvus import Collection, CollectionSchema, FieldSchema, DataType
 
 # Connect to Milvus server
 connections.connect(alias="default", host="localhost", port="19530")
 
 # Collection name
 collection_name = "arxiv"
 
 # Embedding size
 emb_dim = 768
 
 # # Check for existing collection and drop if exists
 # if utility.has_collection(collection_name):
 #     print(utility.list_collections())
 #     utility.drop_collection(collection_name)

Milvus的集合是类似于传统数据库中的表格。要创建一个集合，首先需要指定集合的模式。在本文示例中利用Milvus 2.1字符串索引和字段来存储与每篇论文相关的所有必要元数据。主键idx和其他字段categories、title、abstract是VARCHAR数据类型，而嵌入是包含emb_dim维度嵌入的FLOAT_VECTOR字段。Milvus支持多种数据类型，如下所示。

 # Create a schema for the collection
 idx = FieldSchema(name="id", dtype=DataType.VARCHAR, is_primary=True, max_length=16)
 categories = FieldSchema(name="categories", dtype=DataType.VARCHAR, max_length=200)
 title = FieldSchema(name="title", dtype=DataType.VARCHAR, max_length=4096)
 abstract = FieldSchema(name="abstract", dtype=DataType.VARCHAR, max_length=65535)
 unix_time = FieldSchema(name="unix_time", dtype=DataType.INT64)
 embedding = FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=emb_dim)
 
 # Fields in the collection
 fields = [idx, categories, title, abstract, unix_time, embedding]
 schema = CollectionSchema(
     fields=fields, description="Semantic Similarity of Scientific Papers"
 )
 
 # Create a collection with the schema
 collection = Collection(
     name=collection_name, schema=schema, using="default", shards_num=10
 )

一旦创建了集合，现在就可以将文本和向量插入到集合中。

步骤3:遍历Dask分区，使用SPECTER进行文本嵌入，并将它们插入到Milvus。

我们需要将Dask DATAFRAME中的文本转换为嵌入向量来进行语义相似度搜索。所以首先需要生成文本的嵌入。本文将使用名为SPECTRE的SBERT双编码器模型。

SPECTER : Scientific Paper Embeddings using Citation-informed TransformERs。

简单的说SPECTER 是经过论文数据进行专门训练的模型，所以在选题分类、引文预测、科学论文推荐等方面的表现优于SciBERT，这就是我们选择他的文章。

通过Sentence Transformer库，使用预先训练的SPECTRE模型非常简单。只需要一行代码就可以下载预训练的模型，我们还编写了一个简单的辅助函数，将Dask dataframe分区的整个文本列转换为嵌入。

 from sentence_transformers import SentenceTransformer
 from tqdm import tqdm
 
 # Scientific Papers SBERT Model
 model = SentenceTransformer('allenai-specter')
 
 def emb_gen(partition):
     return model.encode(partition['text']).tolist()

我们可以使用dask.map_partitions() API将嵌入生成的函数应用到分区中的每一行，然后可以使用collection.insert将数据上传到Milvus。

 # Initialize
 collection = Collection(collection_name)
 
 for partition in tqdm(range(papers_df.npartitions)):
     # Get the dask dataframe for the partition
     subset_df = papers_df.get_partition(partition)
 
     # Check if dataframe is empty
     if len(subset_df.index) != 0:
         # Metadata
         data = [
             subset_df[col].values.compute().tolist()
             for col in ["id", "categories", "title", "abstract", "unix_time"]
         ]
 
         # Embeddings
         data += [
             subset_df
             .map_partitions(emb_gen)
             .compute()[0]
         ]
 
         # Insert data
         collection.insert(data)

需要注意的是添加到数据变量中的列的顺序必须与创建时定义的字段变量的顺序相同!

步骤4:对插入的数据将创建一个近似最近邻居(ANN)索引

在我们将所有的嵌入插入到Milvus向量数据库后，还需要创建一个神经网络索引来加快搜索速度。在这个例子中，我使用的是HNSW索引，这是最快、最准确的ANN索引之一。有关HNSW指数及其参数的更多信息，请参阅Milvus文档。

 # Add an ANN index to the collection
 index_params = {
     "metric_type": "L2",
     "index_type": "HNSW",
     "params": {"efConstruction": 128, "M": 8},
 }
 
 collection.create_index(field_name="embedding", index_params=index_params)

步骤5:运行向量相似度搜索查询!

完成了以上步骤以后就可以查询Milvus集合中的数据了。首先加载集合：

 collection = Collection(collection_name)
 collection.load()

接下来，我创建了一个简单的辅助函数，它接收query_text并将其转换为SPECTRE嵌入，在Milvus集合中执行ANN搜索，并打印出结果。还有一些search_params可以控制搜索的质量和速度，请参考Milvus文档。

 def query_and_display(query_text, collection, num_results=10):
     # Embed the Query Text
     query_emb = [model.encode(query_text)]
 
     # Search Params
     search_params = {"metric_type": "L2", "params": {"ef": 128}}
 
     # Search
     query_start = datetime.now()
     results = collection.search(
         data=query_emb,
         anns_field="embedding",
         param=search_params,
         limit=num_results,
         expr=None,
         output_fields=["title", "abstract"],
     )
     query_end = datetime.now()
 
     # Print Results
     print(f"Query Speed: {(query_end - query_start).total_seconds():.2f} s")
     print("Results:")
     for res in results[0]:
         title = res.entity.get("title").replace("\n ", "")
         print(f"➡️ ID: {res.id}. L2 Distance: {res.distance:.2f}")
         print(f"Title: {title}")
         print(f"Abstract: {res.entity.get('abstract')}")

我们看看结果怎么样

 # Query for papers that are similar to the SimCSE paper
 title = "SimCSE: Simple Contrastive Learning of Sentence Embeddings"
 abstract = """This paper presents SimCSE, a simple contrastive learning framework that greatly advances state-of-the-art sentence embeddings. We first describe an unsupervised approach, which takes an input sentence and predicts itself in a contrastive objective, with only standard dropout used as noise. This simple method works surprisingly well, performing on par with previous supervised counterparts. We find that dropout acts as minimal data augmentation, and removing it leads to a representation collapse. Then, we propose a supervised approach, which incorporates annotated pairs from natural language inference datasets into our contrastive learning framework by using "entailment" pairs as positives and "contradiction" pairs as hard negatives. We evaluate SimCSE on standard semantic textual similarity (STS) tasks, and our unsupervised and supervised models using BERT base achieve an average of 76.3% and 81.6% Spearman's correlation respectively, a 4.2% and 2.2% improvement compared to the previous best results. We also show -- both theoretically and empirically -- that the contrastive learning objective regularizes pre-trained embeddings' anisotropic space to be more uniform, and it better aligns positive pairs when supervised signals are available."""
 
 query_text = f"{title}[SEP]{abstract}"
 query_and_display(query_text, collection, num_results=10)

如果不需要查询了，可以释放集合来释放机器的内存。

 collection.release()

这在单机运行时是很好的方法，但是如果提供线上的服务则不要这样应用，因为每次加载都需要读取硬盘的数据，会很慢。

总结

在这篇文章中，我们使用SPECTRE嵌入和Milvus向量数据库和几个简单的步骤中实现了一个可扩展的科学论文语义搜索服务。这种方法在生产中可扩展到数亿甚至数十亿的数据。Milvus在30毫秒内返回了前10个结果，这个速度对我们来说还是非常不错的。

https://avoid.overfit.cn/post/36ad9ebf46ad43f78b84595e793e1a34

作者：Marie Stephen Leo

使用Dask，SBERT SPECTRE和Milvus构建自己的ARXIV论文相似性搜索引擎

设置环境并从Kaggle下载ARXIV数据。

使用dask将数据加载到Python中

使用MILVUS矢量数据库进行语义相似性搜索

总结

deephub

引用和评论

解读 Python 3.14：模板字符串、惰性类型、Zstd压缩等7大核心功能升级

Open WebUI：开源AI交互平台的全面解析

大模型中的Token究竟是什么？从原理到作用深度解析

一文掌握 MCP 上下文协议：从理论到实践

MySQL × 向量数据库：大模型时代的黄金组合实战指南

AdventureX 2025 正式启动：五天四夜，120小时极限创造！一起在杭州点燃青年创新之火！

大模型时代，后端程序员如何避免被AI卷死？