头图

偶然的机会帮团队的数据专家做了一版过渡性质的wrapper,这个wrapper是用于milvus 1.x向2.x过渡用的,有了这个wrapper原来使用老版本的团队以最低的修改成本完美过渡到milvus的新版本。最后完成之后,只需要修改一下引入的名称就可以了。这其中,我学习到了全新的数据和存储的概念,也体会到了对于存储方面的人类智慧。

什么是Milvus

Milvus是一种向量数据库,向量是由深度神经网络和其他机器学习 (ML) 模型生成的大量嵌入向量。
嵌入向量是非结构化数据的特征抽象,例如电子邮件、物联网传感器数据、Instagram 照片、蛋白质结构等等。从数学上讲,嵌入向量是一组浮点数或二进制数。
非结构化数据,包括图像、视频、音频和自然语言,是不遵循预定义模型或组织方式的信息。这种数据类型约占世界数据的 80%,可以使用各种人工智能 (AI) 和机器学习 (ML) 模型将其转换为向量。

Milvus能做什么

存储、索引和管理向量。作为专门设计用于处理输入向量查询的数据库,它能够在万亿规模上对向量进行索引。

我用Milvus做了什么

我做了一个用于过渡的包,可以使原来使用1.x的用户可以几乎不用修改代码过渡到2.x,这就需要我对Milvus两个版本的所有API都了如指掌。其中 Milvus1Mivus2的差异我也是从Milvus开发人员的blog里学习到的(阅读链接
我用的是python,所以这里给的参考链接也是pymilvus.

学到了什么

首先,温故了好久没用的python,还学习到了一些新的API,以及一些语言规范。
其次,在学习Milvus的过程中,认识了好多新的概念,比如在Milvus中很重要的一个参数consistency_level,也就是数据的一致性等级,这个参数在存储系统中经常出现的一个概念。在这里,一致性分为四个等级,分别为:强一致性,最终一致性,有界一致性,客户端一致性。不同的等级代表查询数据的时效性的差异。以上有链接可参考,讲的很清楚。还有就是在写wrapper的过程中,学习了更多的文档规范,把每一个API的注视按照规范写清楚,包括API的功能介绍,参数详细说明,以及返回的内容和类型。
最后,wrapper完成之后在做性能测试的时候,也学习到了测试这类功能的方式方法。在整个过程中,由于第一次接触这类数据库有很多知识盲区,需要一边学习,一边做,过程是稍微痛苦的,但是收获还是很多的。另外附上wrapper的内容

from pymilvus import connections, CollectionSchema, FieldSchema, DataType, Collection, utility
from enum import IntEnum

'''
ID_FIELD_NAME used for id field schema name when create collection & used for query expr
'''
_ID_FIELD_NAME = 'id'

'''
  _VECTOR_FIELD_NAME used for vector field schema name when create collection
  & for query
  & for search
  & for create index
'''
_VECTOR_FIELD_NAME = 'embedding'  # embedding is fixed and cannot be changed in 2.x

# Vector parameters
_INDEX_FILE_SIZE = 512  # max file size of stored index, this is a cluster-wide configuration param and 512 is the suggested value

'''
In Milvus 2.x, the metric_type is required in create_index and search but is unnecessary in Milvus 1.x.
So, defined here as constant.
If you want to use other types, please try to use Milvus 2.x API.
'''
_METRIC_TYPE = "IP"

_COLLECTION_NAME = "collection_name"
_DIMENSION = "dimension"

_ALIAS = 'default'

# No class Status in Milvus 2.x, but it required in Milvus 1.x
class Status:
  """
  :attribute code: int (optional) default as ok

  :attribute message: str (optional) current status message
  """

  SUCCESS = 0
  UNEXPECTED_ERROR = 1
  CONNECT_FAILED = 2
  PERMISSION_DENIED = 3
  COLLECTION_NOT_EXISTS = 4
  ILLEGAL_ARGUMENT = 5
  ILLEGAL_RANGE = 6
  ILLEGAL_DIMENSION = 7
  ILLEGAL_INDEX_TYPE = 8
  ILLEGAL_COLLECTION_NAME = 9
  ILLEGAL_TOPK = 10
  ILLEGAL_ROWRECORD = 11
  ILLEGAL_VECTOR_ID = 12
  ILLEGAL_SEARCH_RESULT = 13
  FILE_NOT_FOUND = 14
  META_FAILED = 15
  CACHE_FAILED = 16
  CANNOT_CREATE_FOLDER = 17
  CANNOT_CREATE_FILE = 18
  CANNOT_DELETE_FOLDER = 19
  CANNOT_DELETE_FILE = 20
  BUILD_INDEX_ERROR = 21
  ILLEGAL_NLIST = 22
  ILLEGAL_METRIC_TYPE = 23
  OUT_OF_MEMORY = 24

  def __init__(self, code=SUCCESS, message="Success"):
    self.code = code
    self.message = message

  def __repr__(self):
    attr_list = ['%s=%r' % (key, value)
                  for key, value in self.__dict__.items()]
    return '%s(%s)' % (self.__class__.__name__, ', '.join(attr_list))

  def __eq__(self, other):
    """
    Make Status comparable with self by code
    """
    if isinstance(other, int):
      return self.code == other

    return isinstance(other, self.__class__) and self.code == other.code

  def __ne__(self, other):
    return self != other

  def OK(self):
    return self.code == Status.SUCCESS


class CollectionSchemaInfo:
  def __init__(self, collection_name, dimension, index_file_size, metric_type):
    self.collection_name = collection_name
    self.dimension = dimension
    self.index_file_size = index_file_size
    self.metric_type = metric_type

# No class MetricType in Milvus 2.x, but it required in Milvus 1.x
class MetricType(IntEnum):
  INVALID = 0
  L2 = 1
  IP = 2
  # Only supported for byte vectors
  HAMMING = 3
  JACCARD = 4
  TANIMOTO = 5
  #
  SUBSTRUCTURE = 6
  SUPERSTRUCTURE = 7

  def __repr__(self):
    return "<{}: {}>".format(self.__class__.__name__, self._name_)

  def __str__(self):
    return self._name_


class IndexType:
    INVALID = 'INVALID'
    FLAT = 'FLAT'
    IVF_FLAT = 'IVF_FLAT'
    IVF_SQ8 = 'IVF_SQ8'
    RNSG = 'RNSG'
    IVF_SQ8H = 'IVF_SQ8H'
    IVF_PQ = 'IVF_PQ'
    HNSW = 'HNSW'
    ANNOY = 'ANNOY'
    IVFLAT = 'IVF_FLAT'
    IVF_SQ8_H = 'IVF_SQ8H'


class Milvus:
  def __init__(self, host, port, user='', password=''):
    connections.connect(
      alias=_ALIAS,
      host=host,
      port=port,
      user=user,
      password=password,
    )
    self.registered_collections = {}
    self.visited_collections = {}
    self.__init_collections()

  def __del__(self):
    self.disconnect()

  def __init_collections(self):
    '''
    Loads all the existing collections once during initialization, which is required before search
    '''
    status, collections = self.list_collections()
    for item in collections:
      current_collection = Collection(item)
      current_collection.load()
      self.registered_collections[item] = current_collection
      self.visited_collections[item] = current_collection

  def create_collection(self, collection_param, consistency_level='Bounded', description=''):
    '''
    Creates a collection.
    Args: 
      collection_param[required]: dict
        collection_param is same as param in 1.x, includes collection_name, dimension, index_file_size, metric_type.
      consistency_level[optional]: string
        Set consistency level to get data visibility.
      description[optional]: string
        Description of the CollectionSchema.
    Returns: A new collection object created with the specified schema or an existing collection object by name.
    '''
    collection_name = collection_param[_COLLECTION_NAME]
    dimension = collection_param[_DIMENSION]
 
    id_field = FieldSchema(
      name=_ID_FIELD_NAME, 
      dtype=DataType.INT64, 
      is_primary=True, 
    )
    vector_field = FieldSchema(
      name=_VECTOR_FIELD_NAME, 
      dtype=DataType.FLOAT_VECTOR, 
      dim=dimension
    )
    schema = CollectionSchema(
      fields=[id_field, vector_field], 
      description=description
    )

    try:
      Collection(
        name=collection_name, 
        schema=schema,
        consistency_level=consistency_level
      )
      return Status(code=0, message='Create collection successfully!')
    except Exception as e: 
      return Status(code=1, message=e)


  def has_collection(self, collection_name):
    '''
    Checks whether a collection exists.
    Args: 
      collection_name[required]: string
    Returns: 
      The operation status and the flag indicating if collection exists. Succeed if Status.OK() is True. If status is not OK, the flag is always False.
    '''
    try:
      bool = utility.has_collection(collection_name)
      return (Status(code=0, message='Success'), bool)
    except Exception as e: 
      return (Status(code=1, message=e), False)


  def list_collections(self, timeout=30):
    '''
    Returns collection list.
    Args: 
      timeout[optional]: float
         An optional duration of time in seconds to allow for the RPC. When timeout is set to None, client waits until server responses or error occurs.
    Returns: 
      The operation status and collection name list. Succeed if Status.OK() is True. If status is not OK, the returned name list is always [].
    '''
    try:
      collections = utility.list_collections(timeout=timeout)
      return (Status(code=0, message='Show collections successfully!'), collections)
    except Exception as e: 
      return (Status(code=1, message=e), [])

  def get_collection_info(self, collection_name):
    '''
    Returns information of a collection, includes dimension, index_file_size, metric_type
    About metric_type, it passed in when creating a collection in Milvus 1.x, there is no way to get it in Milvus 2.x, so hardcode here accroding to usecase.
    Args: 
      collection_name[required]: string
    Returns: 
      The operation status and collection information. Succeed if Status.OK() is True. If status is not OK, the returned information is always None.
    Return type:
      Status, CollectionSchemaInfo
    '''
    try:
      if (collection_name not in self.visited_collections):
        self.visited_collections[collection_name] = Collection(name=collection_name)
      current_collection = self.visited_collections[collection_name]
      fields = current_collection.schema.fields
      for field in fields:
        if field.name == _VECTOR_FIELD_NAME:
          dimension = field.params.dim
      index_file_size = _INDEX_FILE_SIZE
      metric_type = MetricType.IP

      return (Status(code=0, message='Success'), CollectionSchemaInfo(collection_name, dimension, index_file_size, metric_type))
    except Exception as e: 
      return (Status(code=1, message=e), None)


  def drop_collection(self, collection_name, timeout=30):
    '''
    Deletes a collection by name.
    Args: 
      collection_name[required]: string
      timeout[optional]: float
         An optional duration of time in seconds to allow for the RPC. When timeout is set to None, client waits until server responses or error occurs.
    Returns:
      The operation status. Succeed if Status.OK() is True.
    '''
    try:
      utility.drop_collection(collection_name, timeout=timeout)
      del self.registered_collections[collection_name]
      del self.visited_collections[collection_name]
      return Status(code=0, message='Delete collection successfully!')
    except Exception as e: 
      return Status(code=1, message=e)

  def get_entity_by_id(self, collection_name, ids, timeout=None):
    '''
    Returns raw vectors according to ids.
    Args: 
      collection_name[required]: string
      ids[required]: list
      timeout[optional]: float
        An optional duration of time in seconds to allow for the RPC. When timeout is set to None, client waits until server responses or error occurs.
    Returns:
      The operation status and entities. Succeed if Status.OK() is True. If status is not OK, the returned entities is always [].
    '''
    try:
      if (collection_name not in self.registered_collections):
        new_collection = Collection(name=collection_name)      # Get an existing collection.
        new_collection.load()
        self.registered_collections[collection_name] = new_collection
        self.visited_collections[collection_name] = new_collection

      current_collection = self.registered_collections[collection_name]
      field_name = _ID_FIELD_NAME
      expr = f"{field_name} in {ids}"
      res = current_collection.query(expr, output_fields=[_VECTOR_FIELD_NAME], timeout=timeout)
      result = []
      for entity in res:
        result.append(entity[_VECTOR_FIELD_NAME])
      return (Status(code=0, message='Success'), result)
    except Exception as e: 
      return (Status(code=1, message=e), [])

  def delete_entity_by_id(self, collection_name, ids, timeout=None):
    '''
    Deletes vectors in a collection by vector ID.
    Args: 
      collection_name[required]: string
      ids[required]: list
      timeout[optional]: float
        An optional duration of time in seconds to allow for the RPC. When timeout is set to None, client waits until server responses or error occurs.
    Returns:
      The operation status. If the specified ID doesn't exist, Milvus server skip it and try to delete next entities, which is regard as one successful operation. Succeed if Status.OK() is True.
    '''
    try:
      if (collection_name not in self.visited_collections):
        self.visited_collections[collection_name] = Collection(name=collection_name)
      current_collection = self.visited_collections[collection_name]
      field_name = _ID_FIELD_NAME
      expr = f"{field_name} in {ids}"
      current_collection.delete(expr, timeout=timeout)
      return Status(code=0, message='Success')
    except Exception as e: 
      return Status(code=1, message=e)

  def count_entities(self, collection_name):
    '''
    Returns the number of vectors in a collection.
    Args: 
      collection_name[required]: string
    Returns:
      The operation status and row count. Succeed if Status.OK() is True. If status is not OK, the returned value of is always None.
    '''
    # count after flushed
    try:
      if (collection_name not in self.visited_collections):
        self.visited_collections[collection_name] = Collection(name=collection_name)
      current_collection = self.visited_collections[collection_name]
      count = current_collection.num_entities
      return (Status(code=0, message='Success'), count)
    except Exception as e: 
      return (Status(code=1, message=e), None)

  def create_index(self, collection_name, index_type=IndexType.FLAT, index_param={}, timeout=None):
    '''
    Creates index for a collection.
    About metric_type, is unnecessary in Milvus 1.x, constant used here.
    Args: 
      collection_name[required]: string
      index_type[required]: string
        FLAT by default. Tip: In 2.x, if by default the has_index() will return false
      index_param[required]: dict
      timeout[optional]: float
        An optional duration of time in seconds to allow for the RPC. When timeout is set to None, client waits until server responses or error occurs.
    Returns:
      The operation status. Succeed if Status.OK() is True.
    '''
    try:
      index_params = {
        "metric_type": _METRIC_TYPE,
        "index_type": index_type,
        "params": index_param
      }
      if (collection_name not in self.visited_collections):
        self.visited_collections[collection_name] = Collection(name=collection_name)
      current_collection = self.visited_collections[collection_name]
      current_collection.create_index(
        field_name=_VECTOR_FIELD_NAME,
        index_params=index_params,
        timeout=timeout
      )
      return Status(code=0, message='Success')
    except Exception as e: 
      return Status(code=1, message=e)

  def drop_index(self, collection_name, timeout=30):
    '''
    Removes an index.
    Args: 
      collection_name[required]: string
      timeout[optional]: float
        An optional duration of time in seconds to allow for the RPC. When timeout is set to None, client waits until server responses or error occurs.
    Returns:
      The operation status. Succeed if Status.OK() is True.
    '''
    try:
      if (collection_name not in self.visited_collections):
        self.visited_collections[collection_name] = Collection(name=collection_name)
      current_collection = self.visited_collections[collection_name]
      current_collection.drop_index(timeout)
      return Status(code=0, message='Success')
    except Exception as e: 
      return Status(code=1, message=e)

  def flush(self):
    '''
    flush not expose in 2.x, search data through set consistency level
    '''
    return Status(code=0, message='Success')


  def insert(self,collection_name, records, ids, timeout=None):
    '''
    Inserts vectors to a collection.
    Args: 
      collection_name[required]: string
      records[required]: list[list[float]]
         List of vectors to insert.
      ids[required]: list[int]
      timeout[optional]: float
        An optional duration of time in seconds to allow for the RPC. When timeout is set to None, client waits until server responses or error occurs.
    Returns:
      The operation status and IDs of inserted entities. Succeed if Status.OK() is True. If status is not OK, the returned IDs is always [].
    '''
    try:
      data = [ids, records] 
      if (collection_name not in self.visited_collections):
        self.visited_collections[collection_name] = Collection(name=collection_name)
      current_collection = self.visited_collections[collection_name]
      mutation_result = current_collection.insert(data, timeout=timeout)
      return (Status(code=0, message='Success'), mutation_result.primary_keys)
    except Exception as e: 
      return (Status(code=1, message=e), [])

  def search(self, collection_name, top_k, query_records, params=None, consistency_level='Bounded', timeout=None):
    '''
    Search vectors in a collection.
    About metric_type, is unnecessary in Milvus 1.x, constant used here.
    Args: 
      collection_name[required]: string
      top_k[required]: int
        Number of vectors which is most similar with query vectors.
      query_records[required]: list[list[float32]]
        Vectors to query
      params[optional]: dict
        The params is related to index type the collection is built.
      consistency_level[optional]: string
        The consistency level determines the freshness of the returned data.
      timeout[optional]: float
        An optional duration of time in seconds to allow for the RPC. When timeout is set to None, client waits until server responses or error occurs.
    Returns:
      The operation status and search result.
    '''
    try:
      search_params = {"metric_type": _METRIC_TYPE, "params": params}

      if (collection_name not in self.registered_collections):
        new_collection = Collection(name=collection_name)      # Get an existing collection.
        new_collection.load()
        self.registered_collections[collection_name] = new_collection
        self.visited_collections[collection_name] = new_collection

      current_collection = self.registered_collections[collection_name]

      results = current_collection.search(
        data=query_records, 
        anns_field=_VECTOR_FIELD_NAME, 
        param=search_params, 
        limit=top_k,
        timeout=timeout,
        consistency_level=consistency_level,
      )
      # 1.x TopKQueryResult has property [id_array] & [distance_array]
      id_array = []
      distance_array = []
      for res in results:
        id_array.append(res.ids)
        distance_array.append(res.distances)
      
      results.id_array = id_array
      results.distance_array = distance_array

      return (Status(code=0, message='Search vectors successfully!'), results)
    except Exception as e: 
      return (Status(code=1, message=e), None)


  def load_collection(self, collection_name, timeout=None):
    '''
    Loads a collection for caching.
    Args: 
      collection_name[required]: string
      timeout[optional]: float
        An optional duration of time in seconds to allow for the RPC. When timeout is set to None, client waits until server responses or error occurs.
    Returns:
      The operation status. Succeed if Status.OK() is True
    '''
    try:
      if (collection_name not in self.visited_collections):
        self.visited_collections[collection_name] = Collection(name=collection_name)
      current_collection = self.visited_collections[collection_name]
      current_collection.load(partition_names=None, timeout=timeout)
      return Status(code=0, message='Success')
    except Exception as e: 
      return Status(code=1, message=e)

  def index_building_progress(self, collection_name, index_name='', using=_ALIAS):
    '''
    Get the progress of index building.
    Args: 
      collection_name[required]: string
      index_name[optional]: string
        Name of the index to build. Default index will be checked if it is left blank.
      using[optional]: string
        Milvus Connection used to build the index
    Returns:
      A dict contains the number of the indexed entities and the total entity number.
    '''
    try:
      progress = utility.index_building_progress(collection_name, index_name=index_name, using=using)
      return progress
    except Exception as e: 
      return Status(code=1, message=e)

  def create_credential(self, user, password, using=_ALIAS):
    '''
    Create an authenticated user with username and password.
    Args:
      user[required]: string
        Username must not be empty, and must not exceed 32 characters in length. It must start with a letter, and only contains underscores, letters, or numbers.
      password[required]: string
        Password must have at least 6 characters and must not exceed 256 characters in length.
      using[optional]: string
    Returns: 
      The operation status. Succeed if Status.OK() is True.
    '''
    try:
      utility.create_user(user, password, using=using)
      return Status(code=0, message='Success')
    except Exception as e: 
      return Status(code=1, message=e)

  def reset_password(self, user, old_password, new_password, using=_ALIAS):
    '''
    Change the password for an existing user.
    Args:
      user[required]: string
        Credential name or username.
      old_password[required]: string
      new_password[required]: string
      using[optional]: string
    Returns:
      The operation status. Succeed if Status.OK() is True.
    '''
    try:
      utility.reset_password(user, old_password, new_password, using=using)
      return Status(code=0, message='Success')
    except Exception as e: 
      return Status(code=1, message=e)

  def delete_credential(self, user, using=_ALIAS ):
    '''
    Delete an authenticated user.
    Args:
      user[required]: string
        Credential name or username.
      using[optional]: string
    Returns:
      The operation status. Succeed if Status.OK() is True.
    Tip: Re authenticate required after delete user
    '''
    try:
      utility.delete_user(user, using=using)
      return Status(code=0, message='Success')
    except Exception as e: 
      return Status(code=1, message=e)

  def list_users(self, include_role_info=False, using=_ALIAS):
    '''
    List all the credential users.
    Args:
      using[optional]: string
    Returns:
      The operation status and credential users list. Succeed if Status.OK() is True. If status is not OK, the returned list is always [].
    '''
    try:
      users = utility.list_users(include_role_info, using=using)
      return (Status(code=0, message='Get credential users successfully!'), users)
    except Exception as e: 
      return (Status(code=1, message=e), [])

  def disconnect(self, using=_ALIAS):
    '''
    This method disconnects the client from the specified Milvus connection.
    Args:
      using[optional]: string
    Returns:
      The operation status. Succeed if Status.OK() is True.
    '''
    try:
      connections.disconnect(using)
      return Status(code=0, message='Success')
    except Exception as e: 
      return Status(code=1, message=e)

superMin
9 声望2 粉丝

在自己的行业内深耕