头图

Starting from the actual application scenario of Yipay risk control, this paper introduces how the image review product "Detective Map" detects false business licenses, the recognition problems of in-store door photos and similar template photos. The "Detective Map" product is characterized by introducing Milvus. Vector retrieval, in terms of retrieval speed, when tens of millions of vector queries are used, the time for a single vector query is not more than 1 second, and the average time for a single query in batches is not more than 0.08 seconds , which greatly saves development costs and significantly improves system performance. retrieval performance.

Background of the project

In recent years, with the rise of e-commerce and the popularization of online payment, some black industries that take advantage of Internet loopholes and use technical means to "crop wool" have grown rapidly. The combat effectiveness of this group of "wool parties" is amazing, and an organized industrial operation has been formed. They are accustomed to using technical means to synthesize fake photos to pass authentication, and register a large number of fake accounts on the platform for illegal profit . The judgment published by China Judgment Documents Network shows that in December 2021, a group purchased users’ personal information through illegal channels, defrauded points after registering on a bank’s mobile APP, exchanged gifts and made profits from external sales, in less than 3 months. Inside, tens of thousands of video membership cards, more than 700 Starbucks drink coupons, and many other gifts of the bank were taken away. There are countless similar black production cases. Once these fake accounts are successfully registered, they will not only reap the dividends of ordinary consumers, but also cause irreparable losses to the platform.

The black industry relies on technology to rapidly expand and spread its power. Once the platform loopholes are not repaired in time, it will become a sinkhole in minutes. Faced with the huge and rapidly updated data volume, traditional risk control methods have been unable to effectively gain insight into risks. Based on this, E-Pay uses deep learning technology and digital image processing technology to create an image review product with deep learning technology as the core - "Detective Map", driven by the needs of various application scenarios. "Detecting pictures" involves various scenarios of image review, and one of the important areas is the identification and detection of false business licenses, in-store door photos, and similar template photos (as shown in the figure below).


(False business license template)

Traditional similar image comparison algorithms include PSNR [1] and ORB [2]. These algorithms are not only slow, but also have low accuracy. Generally, they can only be applied to offline tasks and cannot be applied in large-scale real-time . Deep learning can process large-scale image data in real time, and is the most suitable method for processing similar image comparison tasks. Through deep learning models, image data can be transformed into massive feature vectors, and we use the Milvus vector search engine to specifically process these unstructured data. The similar risk template photo detection system built by the Milvus vector search engine can index trillion-level vector data, enabling it to efficiently retrieve target risk template photos from tens of millions of images.

"Detection" product introduction

With the joint efforts of the Yipay R&D team and the Milvus community, we transformed massive image data into feature vectors and imported them into Milvus through a deep learning model, and developed a similar risk template photo detection system as the overall visual risk control product of Yipay. " part . "Detection Map" is a self-developed multimedia visual risk control product of E-Pay. It has many industry-leading capabilities such as a complete set of face recognition solutions, license identification and image background intermediary clustering, etc. It deeply integrates machine learning technology, neural network image recognition, etc. The technology and the built-in algorithm model of the product can accurately identify false risks and intermediary gang risks in user authentication, respond in milliseconds, and truly block in advance; at the product deployment level, relying on the capabilities of the big data platform, cross-domain connectivity The underlying data barriers meet the needs of information identification and invocation in various high-concurrency business scenarios, and support the horizontal expansion of business functions. With leading technology applications and original solutions, "Detective Map" has passed 5 patent applications and obtained 2 software copyrights. At the same time, "Detection Map" has been practically applied in the business of many banks and financial institutions, helping businesses to detect risks in advance.

System flow

E-Pay currently has more than 10 million merchant ID photos, and the actual data volume is still growing exponentially with the development of the business. In order to quickly retrieve the most similar and potentially risky template photos from such a huge image library, "Detective" chose Milvus as the feature vector similarity calculation engine. The general structure of the similar risk template photo detection system is shown in the figure below.

(Similar risk template according to the structure diagram of detection system)

The business process of the system can be roughly divided into four processes:

  1. Image preprocessing. Perform preprocessing operations such as denoising and contrast enhancement on the input image. Preprocessing can not only ensure the integrity of the original information, but also remove the useless information in the image signal.
  2. Feature vector extraction. Extract feature vectors of images using a specially trained deep learning model. Converting an image to a vector and performing a similarity search is a routine operation.
  3. Normalized processing. The extracted feature vectors are normalized, which helps to improve the efficiency of subsequent processing.
  4. Milvus Retrieval. Embed the normalized feature vector into the Milvus database for vector similarity retrieval.

Deployment plan

Next, I will briefly introduce how the similar risk template photo detection system of "Detection Map" is deployed.

(Milvus system architecture diagram)

The above picture shows the system architecture of Milvus. We use Kubernetes to complete the Milvus cluster deployment to ensure that the system can have high availability and enjoy the elasticity of real-time synchronization of cloud services.
The rough steps are:

  1. To view the available resources, run the command kubectl describe nodes to view the resources that the entire Kubernetes cluster can allocate to the created instances.
  2. To configure resources, run the command kubect --apply xxx.yaml to use Helm to allocate memory and CPU resources for Milvus cluster components.
  3. To apply the new configuration, run the following command.
    helm upgrade my-release milvus/milvus --reuse-values -f resources.yaml
  4. Apply the new configuration in the Milvus cluster.

Clusters deployed in this way not only facilitate us to scale and expand according to business needs, but also better meet the project's high-performance retrieval requirements for massive vector data.

For different business scenarios, we can adjust the system parameters of Milvus to ensure that different types of data in different scenarios can have good query performance. Here are two examples:

In the stage of establishing the vector index, according to the actual scene of the system application, the parameters for indexing are:

 index = {"index_type": "IVF_PQ", "params": {"nlist": 2048}, "metric_type": "IP"}

Among them, IVF_PQ is a lossy compression algorithm for vector data (PQ product quantization) based on IVF_FLAT, which has the characteristics of high-speed disk query and extremely low memory usage, which is in line with the application scenario of "Detection" products;

Meanwhile, we set the optimal search parameters as:

 search_params = {"metric_type": "IP", "params": {"nprobe": 32}}

Because the vectors are standardized before storage, the inner product (IP) is selected here to calculate the distance between the two vectors. According to the results of our practice, the inner product (IP) is used to calculate the vector distance compared with the Euclidean distance (L2), and the accuracy rate is about increased by about 15%.

The above examples show that we can test and set Milvus parameters according to different business scenarios and performance requirements, which helps us to set and adjust parameters according to business problems.

In addition, Milvus not only integrates different index libraries, but also supports different index types and similarity calculation methods. Milvus officially provides SDKs in multiple languages and rich APIs such as insertion and query for calling, which is convenient for our front-end business group. The ability to use the SDK to call our risk control center.

landing effect

At present, the similar risk template photo detection system has been running stably in real time in production, and can help business parties discover risk template photos every day. In 2021, the accumulated pre-identified false template licenses exceeded 20,000. In terms of retrieval speed, the time for a single vector query is no more than 1 second for a tens of millions of vector queries, and the average time for a single query in a batch is no more than 0.08 seconds. The retrieval performance of Milvus meets the business requirements for accuracy and concurrency.

Thanks

"Our team has been paying attention to and using Milvus since 2020, and synchronously fed back the problems during use to the community, and also provided a lot of constructive issues to the Milvus project; especially during the development of this system, we We have always maintained a positive interaction with the community. Whenever we encounter a problem, the community and the official will help to solve it within 24 hours. I am very grateful to Milvus' very nice R&D team!"

References:

[1] https://en.wikipedia.org/wiki/Peak SNR
[2] Aglave P, Kolkure V S. Implementation of High Performance Feature Extraction Method Using Oriented Fast and Rotated Brief Algorithm[J]. Int. J. Res. Eng. Technol, 2015, 4: 394-397.
[3] https://wenshu.court.gov.cn/website/wenshu/181107ANFZ0BXSK4/index.html?docId=e9940c28c79143cf9e61ae08002bbabd

about the author:

Shi Yan, Senior Algorithm Engineer of Yipay

Tang Minwei, Senior Algorithm Engineer of E-Pay

Edit introduction:

Ye Xiong, Zilliz Community Intern

Zang Peng, Zilliz Community Intern

company profile:

E-Pay is the fourth batch of "dual pilot" enterprises in the SASAC Double Hundred Reform and Development and Reform Commission, and the only financial technology company among the "dual pilot" enterprises. E-Pay is also an important part of China Telecom’s deployment of financial technology. It owns sub-brands such as Sweet Orange Finance, Orange Installment, Sweet Orange Insurance, Sweet Orange Credit, Sweet Orange Factoring, and Tianyi Digital. E-Pay insists on empowering business innovation with financial technology, and creates intelligent products, risk control and services with cutting-edge technologies such as big data, artificial intelligence, and cloud computing as the core.


With a vision to redefine data science, Zilliz is committed to building a global leader in open source technology innovation and unlocking the hidden value of unstructured data for enterprises through open source and cloud-native solutions.

Zilliz built the Milvus vector database to accelerate the development of a next-generation data platform. The Milvus database is a graduate project of the LF AI & Data Foundation. It can manage a large number of unstructured data sets and has a wide range of applications in new drug discovery, recommendation systems, chatbots, etc.


Zilliz
154 声望829 粉丝

Vector database for Enterprise-grade AI