头图

Abstract: Peng Mingde, the author of this article, introduced that Aunt Qian and the Alibaba Cloud Flink real-time computing team jointly built a real-time risk control rule engine to accurately identify wool parties to prevent the loss of marketing budgets. The main contents include:

  1. Background of the project
  2. business architecture
  3. irregular model
  4. Difficulties
  5. Looking back and looking forward

1. Project Background

At present, Aunt Qian has built an omni-channel data center integrating offline and real-time data based on cloud-native big data components (DataWorks, MaxCompute, Flink, Hologres), providing BI reports and data interface support for each business line. In addition to the analysis scenario of the warehouse, Aunt Qian is faced with the need for risk control in the business system. For example, in the quarterly marketing expenses, a lot of wool parties take away the interests of normal users. On the one hand, the wool party may lead to the reputation of users. The decline, on the other hand, will also affect the rapid increase in the operating budget of the original activities, resulting in capital losses. Aunt Qian and the Alibaba Cloud Flink real-time computing team jointly built a real-time risk control rule engine to accurately identify wool parties to prevent the loss of marketing budgets.

img

Figure 1: Schematic diagram of Aunt Qian's real-time risk control process

2. Business structure

Aunt Qian's risk control business structure is divided into four parts as shown in Figure 2: event access, risk perception, risk response, and risk retrospective. The real-time user portrait tags and sales fact indicators processed by Flink online ETL, in addition to being displayed as online BI indicators and real-time large-screen data, also provide important data support for the event access of the real-time rule engine.

  1. Event access . These include black and white and grey list databases, portrait feature data, behavioral data and transaction data in the middle office.
  2. Risk perception . After policy research, it is released to the rule engine, and offline regression and multi-channel access to the alarm results are performed.
  3. Risk response . Provide re-approval, exemption mechanism or labor compensation for rules involving financial settlement.
  4. Risk backtracking . After the strategy is hit, statistics and risk classification are carried out, early warning is offline, retrospective, and risk control events are closed.

img

Figure 2: Aunt Qian's real-time risk control business architecture diagram

3. Rule Model

Risk control business specialists can dynamically publish risk control rules in real time through simple configuration on the product interface, and at the same time add, update and delete rules for online Flink jobs. The risk control rule models are mainly divided into statistical rules and sequence rules. The same The model supports the nesting of sub-rules, and different models can be combined through and, or relationship.

img

Figure 3: DAG abstract diagram of Aunt Qian's Flink job

The following are the configuration items that require dynamic configuration capability in the rule combination:

  1. Group field . Different field grouping and multi-field grouping are very common in the application of risk control rules. Here are some sample rules:

    1. Group by user ID: "Number of orders placed by the user";
    2. Group by user ID and region ID: "Number of orders in different regions within the same period of time by the user".
  1. Aggregate function . Aggregation functions include aggregation logic commonly used in business. The rule engine relies on Flink's built-in rich accumulators, and based on the Accumulator interface, it implements custom implementations according to demand scenarios. Sample rules are as follows:

    1. Store A has less than 100 independent consumers in the last 30 minutes;
    2. The consumption amount of new customers in store B is more than 300.
  1. window period . The window period is also the size of each window. For example, the business side may want to run the rules during the 30-minute lightning strike activity period, or want to focus on abnormal periods.

    1. In every 30-minute time window, a single user initiates more than 20 unpaid orders;
    2. From 1:00 am to 3:00 am, a single user paid more than 50 orders.
  1. window type . In order to face different business requirements, we integrate common window types in business rules into the rule engine. These include sliding windows, cumulative windows, and even windowless (instant triggering).
  1. Filter conditions before aggregation :

    1. Only count "order events";
    2. Filter store "virtual users".
  1. Aggregated filter conditions :

    1. User A placed orders "more than 150 times" within 5 minutes;
    2. User B purchases "more than 300 yuan" within 5 minutes.
  1. Evaluate expressions . The field caliber of risk control rules usually needs to be calculated in combination. We have integrated the lighter and higher-performance Aviator expression engine in expression calculation and compilation. A sample rule is as follows:

    1. The receivable amount is greater than 150 yuan (receivable amount = total product amount + shipping fee + total discount);
    2. The receivable amount paid through the POS terminal is greater than 150 yuan.
  1. sequence of actions . The behavior sequence is actually a combination of events and events. It breaks the barriers that previous risk control rules can only describe facts based on a single event dimension, and the factual information between events and events will also be captured by the rule engine. A sample rule is as follows:

    1. User A clicks, bookmarks, and adds purchases in sequence within 5 minutes;
    2. User B claimed the coupon 30 minutes ago, but did not place an order.

img

Figure 4: Simple diagram of real-time risk control rule configuration business logic

4. Tackling difficult points

For the streaming sequence data of the rule model, we choose Flink CEP to handle event sequence matching. Since our entire risk control job is implemented using Flink, and Flink CEP is the library officially natively supported by Flink, the high integration level can be satisfied without citing additional components. Event sequence matching requirements. The job is expected to allow users to hot publish rules on the product interface, but based on the open source Flink CEP, there are the following difficulties in realizing the ability to dynamically update rules:

  1. The CEP API of the Flink community cannot support dynamic modification of Pattern, that is, it cannot meet the integratability of the upper-level rule middle platform and the risk control middle platform;
  2. The CEP API of the Flink community cannot support pattern-defined timeouts between events.

Alibaba Cloud's Flink real-time computing team and Aunt Qian's engineers worked together to solve this problem. They initiated the following two FLIP proposals in the Flink community and output corresponding functions on Alibaba Cloud's real-time computing products:

  1. FLIP-200 : CEP supports multiple rules and dynamic pattern changes;
  2. FLIP-228 : CEP supports Pattern-defined timeouts between events.

Alibaba Cloud's real-time computing product output supports multi-rule and dynamic rule changes, supports pattern-defined timeouts between events, and supports IterativeCondition-based accumulator commercialization functions to broaden Flink's capabilities in real-time risk control, and the above commercialization functions are already in the money. Aunt’s production environment is put into practice. The interaction overview of the internal components in the Flink CEP dynamic update pattern mechanism is as follows:

img

<p style="text-align:center">Figure 5: Community Flink CEP Dynamic Pattern Mechanism

The product interface is used as the entry for risk control rules, and the rules are written into Hologres. At the same time, JDBCPatternProcessorDiscover periodically polls to discover the changes of the rules. The data structure of the rule table is as follows:

  1. Id : rule ID;
  2. Version : The version number corresponding to the rule;
  3. Keyby : rule grouping field (if grouping is required);
  4. Pattern : Json string serialized by CEP Pattern;
  5. Function : PatternProcessFunction processed after CEP matching;
  6. Relation : AND or OR relationship between statistical type and regular type (premise: statistical type and regular type have the same ID).

img

Figure 6: Community Flink dynamic CEP rule table

V. Review and Outlook

The real-time risk control solution based on Flink has been applied to the internal production environment of Qiandama Group. No new technical components and programming languages have been introduced in this solution, maximizing the reuse of Flink resources to achieve real-time risk control scenario requirements, and greatly reducing the demand for real-time risk control scenarios. New components introduce potential operational and maintenance risks. On the other hand, it also greatly reduces the learning cost of the R&D team, efficiently releases human resources for real-time computing, and brings the following benefits to R&D and business applications:

  • Decouple Flink job logic development and business rule definition;
  • Business rules are stored in the Database, which is convenient for viewing the current status and historical versions of the rules;
  • Rule changes only need to modify the rules stored in the Database, and Flink automatically loads the rule list in the update job;
  • Combined with the Flink ecosystem, it is very easy to integrate the reading and writing of event heterogeneous data sources;
  • Combined with Flink's distributed capabilities, it can scale to thousands of concurrency to match running rules.

In the follow-up, Aunt Qian and the Alibaba Cloud real-time computing product team will continue to jointly build and improve the real-time risk control and risk control solution based on Flink. The future planning of Flink CEP will focus on the following three main directions:

  1. Further enhancement of Flink CEP capabilities;
  2. Dynamic capabilities of Flink CEP SQL;
  3. Native support for Flink + DSL

Company profile : Aunt Qian is an industry pioneer in the community fresh food chain with the brand concept of "not selling overnight meat". At the beginning of its establishment, it reorganized the standards of the traditional fresh food industry from the perspective of freshness, and made a new definition of the meat market. Aunt Qian has been deployed in nearly 30 cities across the country, with more than 3,000 stores in total, serving over 10 million families.

The author of this article : Peng Mingde, currently working at Aunt Qian, as a big data development engineer in Omni-Channel Data Center.

At the same time, I also hope that more friends who have real-time risk control needs, or who love the construction of risk control scenarios can communicate in the Flink community risk control DingTalk group:

img

Figure 7: Flink community real-time risk control group QR code


ApacheFlink
946 声望1.1k 粉丝