This article is compiled from Akulaku's anti-fraud team's speech at nMeetup·Shenzhen. For the video at station B, see: https://www.bilibili.com/video/BV1nQ4y1B7Qd

Akulaku 的智能风控实践

This time, I will mainly introduce the practice of Nebula's intelligent risk control in Akulaku. Divided into the following 6 parts:

  • Overview of the basic concepts and application scenarios of graphs
  • Graph database selection
  • Construction of graph database platform
  • Nebula application case
  • Graph model training and deployment
  • Summary and Outlook

Akulaku 的智能风控实践

Let’s first explain the basic concepts of the following graph. A graph is a collection of nodes and edges that describe the relationship. The biggest advantage of the graph is to compare the image. For example, the above graph is the graph structure of a desensitized fraud gang. You can see whether there is an abnormal relationship between a user and other nodes. If we are using a pure row database (relational database), we cannot see the abnormality, but from the perspective of the graph, we can easily find the abnormality of the data.

Let's explain the application scenario of the following figure. In Akulaku's scenario, it is mainly graph relationship mining and visual analysis, and graph query replaces complex query. Here we explain that the following graph query replaces complex queries. Your application background may not be related to graphs, but the back-end operations involved have a certain depth. Using graph relationships to model queries makes it easier to understand query statements and maintenance operations.

Akulaku 的智能风控实践

Then there is the graph database selection. Let's talk about the pit that Akulaku stepped on in the graph database selection. At the beginning, we used Neo4j, mainly to do some correlation features. Neo4j has high query efficiency, but scalability is a shortcoming. The performance of distributed Neo4j is not much different from the efficiency of the stand-alone version. We have also tried other graph databases, here are our business requirements:

  • good scalability
  • fast data import
  • Good query efficiency

In terms of expansion, Neo4j does not have good scalability, so pass. Since our graph scale is very large, mainly for financial risk control scenarios, the scale of the graph can reach one billion nodes and tens of billions of edges, so fast data import is required for initialization. Here we want to say that we have tried to use Dgraph. We have read its related academic papers before. The papers are very well written but the engineering implementation is not good. Especially in batch import, when the data you import exceeds a certain order of magnitude, there will be problems like memory leaks, so Dgraph pass.

The last point is good query efficiency. Here we will talk about JanusGraph. Its advantage is that the backend can integrate other storage engines, which is the main reason for testing JanusGraph at that time. But when we imported and initialized the data, we found that its query efficiency was very bad.

Here's a look at the scalability and query performance tests done by the Akulaku team on Nebula Graph:

  • Graph scale : 1 billion points, 10 billion edges
  • Test method : nebula-bench https://github.com/vesoft-inc/nebula-bench
  • query statement :

    • two one-time queries
    • Two second degree queries
    • a third degree query
  • random source : Randomly sample 500 W phones of registered mobile phone numbers

During the stress test, one of the data is randomly queried from the random source phone. The horizontal axis is concurrency, the vertical axis is QPS, and the curves of different colors represent the number of concurrent nodes. Here we can see that the query performance of the entire Nebula Graph is relatively good.

Akulaku 的智能风控实践

As can be seen from the figure, the scalability can achieve high performance when there are about 12 machines. If the number of nodes is added later, the distributed overhead will begin to exceed the benefits brought by concurrency. So you will see that the number of nodes increases, and the query performance will decrease. Here is the query we do, randomly select 5 million nodes, and perform multiple batches of queries. Each batch of queries contains two first-degree queries, two second-degree queries and one third-degree query. During the test, we also encountered hotspot problems. The above picture is the final verification result.

During the stress test, the Nebula version used was v1.x, and after Nebula v2.x was released, the Akulaku team began to try to upgrade. But when I just upgraded and tried 2.0.1, I found some problems:

  • Leader change

    • When importing data, frequent leader changes occur, causing write failures
    • Not found when querying data
  • Observed high CPU load, mainly caused by very large subgraphs

The first problem mainly occurs when importing data, and frequent leader changes will occur, which will affect the efficiency of online calls. Another observed phenomenon is that the CPU load is high, which is caused by some very large nodes. So I rolled back to the v1.2 version first. After the release of v2.5.0, the Akulaku team did another test, which is similar to the previous stress test.

  • Graph scale : 1 billion points, 10 billion edges
  • machine configuration : 7 units, 256G, 32 cores
  • test method : nebula-bench https://github.com/vesoft-inc/nebula-bench
  • query statement :

    • two one-time queries
    • Two second degree queries
    • a third degree query
  • random source : Randomly sample 500 W phones of registered mobile phone numbers

In this release, the previously encountered issues of leader change and high CPU load have been resolved. So the Akulaku team tried to apply v2.5.0 to the business.

Akulaku 的智能风控实践

The right side of the above figure is the concurrent number, and the left side is the QPS.

Akulaku 的智能风控实践

Let's talk about the following graph analysis platform, mainly around two engines: the graph database platform Nebula Graph and the real-time graph computing engine. Because Akulaku's main docking application scenario is anti-fraud, which requires high effectiveness, a series of real-time graph algorithms are required. In order to develop graph algorithms, the graph analytics platform needs a real-time graph computation engine. These engines rely on offline scheduling, such as: landsat task scheduling platform, offline data warehouse, real-time data warehouse monitoring and task monitoring and other modules. The bottom layer of this content relies on big data clusters, such as common Spark, Hive, Hadoop, Flink, HBase, and so on.

So this picture is application-platform-infrastructure from top to bottom.

If we look at a graph database platform in isolation:

Akulaku 的智能风控实践

To build the graph database platform, the Akulaku team mainly does two things: one is data import and high availability. For data import, it is based on offline data warehouses. It has both batch writing and real-time writing based on real-time data sources. . For real-time data sources, graph database storage has two modes:

One is dual-cluster master-slave, that is, online services are provided by two master-slave clusters, and the real-time data source will double-write the master-slave clusters;

The other is a single-cluster solution, where each application can have a separate instance.

To build graph database storage in this way, the support platform needs monitoring (service and data), sensitive data sandbox, cluster expansion and contraction, scheduling system and other supporting modules.

Akulaku 的智能风控实践

From a business perspective, the entire platform is built for exploration and visualization based on graph relationships. The above picture is just a schematic diagram, not a graph actually used in business.

The above mentioned application scenarios, now let's talk about the specific application cases of Nebula in Akulaku.

First, it is mainly used in visual fraud case analysis and deep correlation mining. Second, the device ID association calculation. The last and most common application is the deployment of various graph models, including label propagation, etc. The details are explained below.

Akulaku 的智能风控实践

The above picture is an analysis of a fraud case visualized by the graph. The above picture is still a schematic diagram and not a real data graph. Anti-fraud investigators use the graph relationship and the visualization tool of the graph database to analyze the association relationship, including operations such as the lower segment of the graph, to view the node attributes.

Akulaku 的智能风控实践

The second case is the device ID association calculation, which belongs to the concept part of the diagram just now. It may have nothing to do with the diagram itself, but it is more natural to use the diagram to represent it, and it is easier to maintain. Specifically, the ID of the device needs to be calculated through a series of elements, but fraudsters try to bypass anti-fraud policies by constantly changing these elements. But in fact, when changing an element, it can only change a certain element, and other elements still maintain a certain relationship with this element. Through certain rules, the actual mapping relationship can be found by using the association relationship. The depth of this query is not deep. It is a one-time query. The main difficulty is not logic, but data consistency. For example, if the IDs of computing devices are manipulated concurrently, there will be data consistency problems. For example, delete an edge that should not be deleted, and add a data that should not be added, so this process needs to lock the data.

Akulaku 的智能风控实践

The specific locking method is to lock a batch of nodes involving computing device ID information, and then release the nodes after the calculation is completed, and then other processes can modify the data.

Akulaku 的智能风控实践

The third major type of application is graph model training and deployment, such as deploying graph models like subgraph expansion. Here to explain the following model, that is, the result of this model is that the extracted subgraph is calculated, and the subgraph here is generally obtained by expanding the central node. What are the specific graph models of subgraph expansion? For example, a subgraph feature centered on the current node, the result of label propagation based on the current subgraph, or a graph convolution model. What is the difficulty here?

First, the logic of backtesting is complicated. Backtesting mainly refers to data backtracking. According to the requirements of the scene, it is necessary to obtain the graph relationship when the event occurs, perform feature extraction and model construction, and the logic is relatively complex. In addition, the timeliness requirements for the deployment of graph models are also very high. If the model is an anti-fraud scenario, the model should generally be deployed in the credit granting or ordering process, which requires high timeliness. According to the different characteristics of graph model training and deployment application scenarios, there are the following four perspectives:

  1. The timeliness requirements of business links. For example, relatively speaking, the timeliness requirements of the credit granting link will be lower than that of the ordering link;
  2. Subplot size. Look at the scale of the subgraph involved in the deployment model. If it is very large, is sampling allowed? If sampling is not allowed, how do we deal with it?
  3. Who has more graph update frequency and model call volume?
  4. Backtesting complexity, is the amount of data large or small?

Here are a few examples.

Akulaku 的智能风控实践

The first example is the subgraph feature of the credit link.

Specifically, in the credit granting process, it is necessary to obtain the feature calculation of the subgraph to which a uid belongs, for example, the proportion of N-degree subgraph nodes or topological features. The timeliness requirements of this business link are relatively low, which is the first feature. The second feature is that the subgraph scale may be large and may encounter the situation of exploding nodes, but sampling is not allowed. The third feature is that the amount of sub-graph update data is much larger than the amount of model calls. What does this represent? It is the number of credit line applications per unit time, which is far less than the frequency of graph updates. The fourth feature is that the backtest is relatively small.

What is the solution for the above four characteristics? Because the amount of updated data is much larger than the amount of model calls, it is best to calculate its features directly when the model is called, and the score backtest can be based on the graph database, that is to say, the events called directly according to the historical model to do the backtesting of the scores. Since the model calculation is to query when the business link is called, it is necessary to ensure the availability of the graph database cluster.

Akulaku 的智能风控实践

The second example is the spread of tags in the ordering process.

Specifically, label propagation is to propagate from a label on a node, such as a black-gray label or a specific business attribute label, according to certain rules. So, what are the characteristics of the hashtag propagation scenario?

The first point is that the timeliness requirements of the business link will be relatively high, because it is the ordering link, and there cannot be a great delay, otherwise the order will be blocked. The second point is that the scale of the subgraph is the same as above, and there is a possibility of exploding subgraphs. For example, a 3D subgraph may be on the order of millions, and this subgraph cannot be sampled. Why not sample it? Because label propagation involves business rules, sampling may affect the stability of the score, so sampling is generally not allowed. The third point is that the amount of updated data is much smaller than the amount of model calls, that is, the frequency of graph updates is small, but the number of calls is large. So, when to do the subgraph calculation? The calculation is performed when the data is updated, so that the amount of data that needs to be calculated is relatively small. The amount of data in the label dissemination and backtesting in the ordering link will be relatively large. In terms of quantity, the number of orders placed will be much larger than the number of credit applications.

In view of the above characteristics, if the amount of updated data is less than the amount of model calls (the number of points called in the business link), the results of the model are calculated when the graph is updated. The advantage of this is that the process of calling results and model calculation is decoupled from the process of calling scores. That is to say, the business link directly calls the score, and the calculation is not the call of the actual link, but a certain delay needs to be allowed here. This processing is equivalent to two processes, one is an offline process, and T+1 is used for score correction; the other is a real-time process, the deployment of the real-time graph model on the right side of the above figure. As soon as the data of the real-time data source is updated, the system will update the value of the label propagation, and the score call of the business link is called through the query service of the model result, instead of directly querying the graph database, which separates the query and calculation. You can even upgrade the graph database painlessly without affecting online services.

The above processing solution has two data streams (offline data stream and real-time data stream), so the system complexity is high, and data synchronization needs to be maintained.

Akulaku 的智能风控实践

The third example is the graph convolution of the ordering link.

Specifically, the graph convolution is calculated based on the node attributes of the attribute graph, and the graph convolution result is called in the ordering link. Its characteristics are similar to the previous example, and the timeliness of business links is high. Exploding subgraphs may also exist at the subgraph scale, again sampling is not allowed. In this scenario, as in the second example above, the amount of updated data will be smaller than the amount of model calls, the amount of orders placed in the ordering process is relatively large, and the amount of back-tested data is relatively large. All we have is an architectural process on the right side of the above figure, that is, the system has a T+1 correction. That is, there is an overall T+1 graph convolution process to update the graph convolution result every day, and the local update of the score is driven by the real-time data source in real time. Therefore, it is a full refresh of T+1, plus a real-time partial refresh.

The above are three applications of Nebula in the Akulaku team.

Akulaku 的智能风控实践

Overall, Nebula's greatest value to Akulaku is excellent import performance, and scalability. Let's talk about its import speed here. It is very fast. The QPS can reach 110,000. Of course, it is asynchronous writing. This data is much better than other graph databases, and of course the scalability is very good. In terms of specific applications, Nebula Graph is actually applied to many scenarios of Akulaku graph learning model deployment. In the future, we will focus on improving the stability of this platform, and continue to provide feedback and suggestions to the community to build products together. In addition, the platform for graph analysis will be further optimized to reduce the difficulty of graph models, backtesting, and model deployment.


NebulaGraph
169 声望684 粉丝

NebulaGraph:一个开源的分布式图数据库。欢迎来 GitHub 交流:[链接]


引用和评论

0 条评论