Flink implements real-time features in risk control scenarios

Background introduction

Introduction to Risk Control

In the 21st century, with the advent of the information age, the development speed of the Internet industry is much faster than that of other industries. Once the business model is successful and profitable, capital immediately swarms, boosting more companies to enter the market for rapid replication and iteration, in an attempt to become the next "industry leader".

Players who enter the market with capital will only pay more attention to business development because they will not have financial pressure, but ignore business risks. As powerful as Pinduoduo, it has also lost tens of millions of dollars to the patronage of the "Wool" army.

Risk control, or risk management, is a management process that includes the definition, measurement, assessment, and strategies of risk management. The purpose is to minimize avoidable risks, costs and losses [1].

Introduction to Feature Platform

Internet companies are facing various attacks from black and gray all the time. The business security team needs to assess the risks in the business process in advance, and then set up card points to collect relevant business information and identify whether the current request is risky. Expert experience (prevention and control strategies) is generated from long-term past confrontations.

The deployment of the strategy needs to be supported by features one by one, so what are the features?
Features are divided into basic features, derived features, statistical features, etc., for example:

Basic features: can be obtained directly from the business, such as the amount of the order, the buyer's mobile phone number, the buyer's address, the seller's address, etc.
Derivative features: requires secondary calculation, such as the distance from buyer to buyer, the top 3 mobile phone numbers, etc.
Statistical features: Real-time statistics are required, such as the number of purchase orders placed by a mobile phone number within 5 minutes, the number of orders with a purchase amount greater than 2w yuan within 10 minutes, etc.

With the rapid development of the business, pure expert experience can no longer meet the needs of risk identification. The addition of the algorithm team makes the interception effect more accurate. The algorithm department personnel solved the systematic problem of model and feature iteration by unifying the algorithm engineering framework, which greatly improved the iteration efficiency.

According to different functions, the algorithm platform can be divided into three parts: model service, model training and feature platform. Among them, the model service is used to provide online model estimation, the model training is used to provide the training output of the model, and the feature platform is used to provide data support for features and samples. This paper will focus on the challenges and optimization ideas encountered in real-time calculation of the feature platform during the construction process.

Challenges and Solutions

challenges

In the early stage of business development, we can meet the feature requirements put forward by the strategist through hard coding, and the coordination is also better. However, as the business develops faster and faster, there are more and more lines of business, and the marketing gameplay becomes more and more complex, and the number of users and requests have increased exponentially. Applicable to the early hard-coding method, there are many problems such as scattered and unmanageable policies, strong coupling between logic and business, limited iteration rate of policy update due to development, and high docking costs. At this time, we urgently need a feature management platform that can be configured online, can be hot updated, and can be quickly trial-and-error.

Weaknesses of the old framework

Real-time framework 1.0: built on the Flink DataStream API

If you are familiar with the Flink DataStream API, then you will definitely find that the design of Flink naturally meets the real-time feature calculation scenario of risk control. We only need a few simple steps to count the indicators, as shown in the following figure:

Flink DataStream flow graph

The sample code of real-time feature statistics is as follows:

 // 数据流，如topic
DataStream<ObjectNode> dataStream = ...

SingleOutputStreamOperator<AllDecisionAnalyze> windowOperator = dataStream
                // 过滤
                .filter(this::filterStrategy)
                // 数据转换
                .flatMap(this::convertData)
                // 配置watermark
                .assignTimestampsAndWatermarks(timestampAndWatermarkAssigner(config))
                // 分组
                .keyBy(this::keyByStrategy)
                // 5分钟滚动窗口
                .window(TumblingEventTimeWindows.of(Time.seconds(300)))
                // 自定义聚合函数，内部逻辑自定义
                .aggregate(AllDecisionAnalyzeCountAgg.create(), AllDecisionAnalyzeWindowFunction.create());

1.0 framework is insufficient:

Features are strongly dependent on developer coding, simple statistical features can be abstracted, and a little more complex needs to be customized
Iterative efficiency is low, strategy requirements, product scheduling, R&D intervention, testing support, and delivery of a set of processes are at least two weeks
Features are strongly coupled, tasks are difficult to split, a JOB contains too much logic, and the new feature logic may affect the previously stable indicators

In general, 1.0 is very suitable in the early stage of the business, but with the development of the business, the research and development speed gradually becomes a bottleneck, which does not meet the sustainable and manageable real-time feature cleaning architecture.

Real-time Framework 2.0: Built on Flink SQL

The disadvantage of the 1.0 architecture is that different language systems are used from requirements to R&D. How to efficiently transform requirements, or even let the strategy staff configure the feature cleaning logic directly online? If the speed of the two-week iteration is followed, the online may have been "unrecognizable" by black and gray.

At this time, our R&D team noticed Flink SQL. SQL is the most common data analysis language. It has basic necessary skills for scoring, strategy and operation. It can be said that SQL is one of the implementation methods with the least cost for conversion requirements.

Take a look at an example Flink SQL implementation:

 -- error 日志监控
-- kafka source
CREATE TABLE rcp_server_log (
    thread varchar,
    level varchar,
    loggerName varchar,
    message varchar,
    endOfBatch varchar,
    loggerFqcn varchar,
    instant varchar,
    threadId varchar,
    threadPriority varchar,
    appName varchar,
    triggerTime as LOCALTIMESTAMP,
    proctime as PROCTIME(),

    WATERMARK FOR triggerTime AS triggerTime - INTERVAL '5' SECOND
) WITH (
    'connector.type' = 'kafka',
    'connector.version' = '0.11',
    'connector.topic' = '${sinkTopic}',
    'connector.startup-mode' = 'latest-offset',
    'connector.properties.group.id' = 'streaming-metric',
    'connector.properties.bootstrap.servers' = '${sinkBootstrapServers}',
    'connector.properties.zookeeper.connect' = '${sinkZookeeperConnect}}',
    'update-mode' = 'append',
    'format.type' = 'json'
);

-- 此处省略 sink_feature_indicator 创建，参考 source table
-- 按天 按城市 各业务线决策分布
INSERT INTO sink_feature_indicator
SELECT
    level,
    loggerName,
    COUNT(*)
FROM rcp_server_log
WHERE
    (level <> 'INFO' AND `appName` <> 'AppTestService')
    OR loggerName <> 'com.test'
GROUP BY
    TUMBLE(triggerTime, INTERVAL '5' SECOND),
    level,
    loggerName;

In the process of developing the Flink SQL support platform, we encountered the following problems:

If a SQL cleans an indicator, the data source will be a huge waste
SQL merge, that is, a test will be merged if it has the same source SQL, which will greatly increase the complexity of the job and cannot define boundaries
SQL needs to be shut down and restarted to go online. At this time, if the task contains a large number of stable indicators, will it be a critical point?

Technical realization

Pain point summary

flink实时清洗-痛点.png
Business & R&D Pain Point Map

real-time computing architecture

Strategy/algorithm personnel need to observe whether there are risks in real-time and offline data analysis online every day. For risky scenarios, they will design prevention and control strategies. Transparent transmission to the R&D side is actually the development of real-time features. Therefore, the online speed, quality delivery, and ease of use of real-time features completely determine the key to timely plugging of online risk scenarios.

Before the unified real-time feature computing platform was built, the output of real-time features mainly had the following problems:

Slow delivery, iterative development: strategy is proposed to product, then to R&D, test, and online observation is stable, the speed is extremely slow
Strong coupling, leading to the whole body: monster tasks, including many business characteristics, all businesses are mixed together, there is no priority guarantee
Repeated development: Due to the lack of a unified real-time feature management platform, many features already exist, but with different names, causing great waste

The most important thing in the construction of platform language is " the abstraction of the whole process" . The goal of platform language should be usable, easy to use and easy to use. Based on the above ideas, we try to extract the real-time feature research and development pain points: templating + configuration , that is, the platform provides a template for creating real-time features. Based on this template, users can generate the real-time features they need through simple configuration.

Flink real-time computing architecture diagram

computing layer

Data source cleaning: abstract Flink Connector from different data sources, and standard output for downstream use
Data splitting: 1 split N, a real-time message may contain multiple messages, and data fission is required at this time
Dynamic configuration: Allows to dynamically update or add cleaning logic without shutting down the JOB, and issue cleaning logic involving features
Script loading: Groovy support, hot update
RTC: Real-Time Calculate, real-time feature calculation, highly abstract encapsulation module
Task awareness: feature-based business domain, priority, stability, task isolation, business decoupling

service layer

Unified query SDK: unified query SDK for real-time features, shielding the underlying implementation logic

Based on the unified Flink real-time computing architecture, we redesigned the real-time feature cleaning architecture
flink实时清洗-数据流图-2.png
Flink real-time computing data flow graph

Feature Configuration & Store/Read

The underlying storage of features should be "atomic", the smallest indivisible unit. Why is it designed this way? The real-time statistical features are linked to the window size. Different strategy personnel have different requirements on the feature window size for prevention and control. Examples are as follows:

Trusted device determination scenario: where the current mobile phone number login duration window should be moderate, not too short, and prevent disturbance
Withdrawal fraud judgment scenario: where the current mobile phone number login time window should be as short as possible, and short-distance quick withdrawals should be combined with other dimensions to quickly locate risks

Based on the above, there is an urgent need for a set of general real-time feature reading modules to meet the needs of policy personnel for any window, and to meet the rapid configuration and cleaning needs of R&D personnel. Our refactored feature configuration module is as follows:

Feature Configuration Abstraction Module

Real-time Feature Module:

feature unique identifier
Feature name
Whether to support windows: sliding, scrolling, fixed size windows
Event slice units: minutes, hours, days, weeks
Main attribute: that is, the grouping column, there can be multiple
Dependency: Aggregate functions are used, such as input base features required for deduplication

There is not much time left for risk control in business. Most scenarios are within 100 ms, and real-time feature acquisition is even shorter. From previous research and development experience, RT needs to be controlled within 10 ms to ensure that policy execution does not time out. So our storage uses Redis to ensure performance is not a bottleneck.

Hot deployment of cleaning scripts

As mentioned above, the real-time feature calculation module strongly depends on the "main attribute" and "subordinate attribute" transmitted in the upstream message. This stage is also where R&D needs to intervene. If the main attribute field in the message does not exist, it needs to be completed by R&D. If you have to join the release of the code, it will return to the problem faced in the original stage: Flink Job needs to be restarted constantly, which is obviously unacceptable.
At this point, we thought of Groovy. Can Flink + Groovy directly hot deploy code? The answer is yes!

Since we abstract the calculation flow graph of the entire Flink Job, the operator itself does not need to be changed, that is, the DAG is fixed and becomes the cleaning logic of the associated events inside the operator. Therefore, as long as the associated cleaning logic and the cleaning code itself are changed, there is no need to restart the Flink Job to complete the hot deployment.

The core logic of Groovy hot deployment is shown in the figure:
flink实时清洗-脚本配置.png
Cleaning script configuration and loading diagram

R&D or strategy personnel add cleaning scripts to the operating system and store them in the database. The Flink Job script cache module will perceive the addition or modification of the script at this time (see the detailed explanation of the overall process below for how to perceive it)

warm up: It takes time to run the script for the first time. It is pre-warmed and executed in advance when it is started for the first time or the cache is updated to ensure that the real traffic enters the script and executes quickly.
cache: caches already in good Groovy scripts
Push/Poll: The cache update adopts two modes of push and pull to ensure that the information is not lost.
router: script routing to ensure that messages can find the corresponding script and execute it

The script loads the core code:

 // 缓存，否则无限加载下去会 metaspace outOfMemory
    private final static Map<String, GroovyObject> groovyObjectCache = new ConcurrentHashMap<>();

    /**
     * 加载脚本
     * @param script
     * @return
     */
    public static GroovyObject buildScript(String script) {
        if (StringUtils.isEmpty(script)) {
            throw new RuntimeException("script is empty");
        }

        String cacheKey = DigestUtils.md5DigestAsHex(script.getBytes());
        if (groovyObjectCache.containsKey(cacheKey)) {
            log.debug("groovyObjectCache hit");
            return groovyObjectCache.get(cacheKey);
        }

        GroovyClassLoader classLoader = new GroovyClassLoader();
        try {
            Class<?> groovyClass = classLoader.parseClass(script);
            GroovyObject groovyObject = (GroovyObject) groovyClass.newInstance();
            classLoader.clearCache();

            groovyObjectCache.put(cacheKey, groovyObject);
            log.info("groovy buildScript success: {}", groovyObject);
            return groovyObject;
        } catch (Exception e) {
            throw new RuntimeException("buildScript error", e);
        } finally {
            try {
                classLoader.close();
            } catch (IOException e) {
                log.error("close GroovyClassLoader error", e);
            }
        }
    }

Standard Message & Cleaning Procedure

The dimensions of the messages that need to be counted in the strategy are complex, involving multiple businesses, and the R&D itself also has real-time feature requirements for monitoring. Therefore, the data sources corresponding to real-time features are diverse. Fortunately, Flink supports access to multiple data sources. For some specific data sources, we only need to inherit and implement the Flink Connector to meet the needs. I will take Kafka as an example to illustrate how the overall process cleans real-time statistical features.

First, the overall data flow of risk control is introduced. Multiple business scenarios are connected to the middle platform of risk control. The internal core links of risk control are: decision engine, rule engine, and feature service.
For a business request decision, we will asynchronously record it and send Kafka messages for real-time feature calculation & offline tracking.

Risk control core data flow diagram

Standardized message templates

After the Flink real-time computing job receives the MQ message, the first step is to standardize and parse the message template. Different topics correspond to inconsistent message formats, such as JSON, CSV, and heterogeneous (such as error log messages, separated by spaces, and objects containing JSON objects), etc.

To facilitate unified processing by downstream operators, the standardized message structure is as follows:

 public class RcpPreProcessData {

    /**
     * 渠道，可以直接写topic即可
     */
    private String channel;

    /**
     * 消息分类 channel + eventCode 应唯一确定一类消息
     */
    private String eventCode;

    /**
     * 所有主从属性
     */
    private Map<String, Object> featureParamMap;

    /**
     * 原始消息
     */
    private ObjectNode node;
    
}

message fission

A "rich message" may contain a large amount of business information, and some real-time characteristics may need to be counted separately. For example, a business request risk control context message includes whether the message is rejected, that is, how many policy rules are hit. The hit rules are an array and may contain multiple hit rules. At this time, if you want to associate other attribute statistics based on a hit rule, you need to use message fission to change from 1 to N.

The logic of message fission is written by the operation background through Groovy script. The logic of positioning and cleaning script is channel (parent) + eventCode (child). Here, the logic is divided into "parent and child". The "parent" logic is applicable to all the logic under the current channel, to avoid It is cumbersome to configure N eventCodes individually, and the "sub" logic is applicable to a specific eventCode.

Message Cleaning & Pruning

The cleaning of the message is that we need to know which master-slave attributes are required for the feature, and the cleaning with the purpose is clearer. The script for positioning and cleaning is the same as above, and is still implemented according to channel + eventCode. The cleaned master-slave attributes are stored in featureParamMap for downstream real-time calculation.

It should be noted here that we have been passing down the original message, but if the master-slave attribute of cleaning has been confirmed, then the original message does not need to exist. At this time, we need to "prune" to save Consumption of I/O traffic for the RPC call process.

So far, an original message has been processed to include only channel (channel), eventCode (event type), featureParamMap (all master-slave attributes), and downstream operators only need and only need these information to calculate.

real-time computing

It is still the same as the above two operators. The real-time calculation operator relies on channel + eventCode to find the corresponding real-time feature metadata. An event may have multiple real-time feature configurations. After the operation platform fills in the real-time feature configuration, it is quickly distributed according to the cache update mechanism. In the task, generate the corresponding Key according to the Key constructor, and pass the downstream direct sink to Redis.

Task Troubleshooting & Tuning Ideas

The troubleshooting of tasks is based on perfect monitoring. Flink provides many useful metrics for us to troubleshoot problems. The following is a list of common task exceptions that I have listed. I hope it will help you.

Troubleshooting TaskManager Full GC

The possible reasons for the above exception are:

Large window: 90% of TM memory bursts are caused by large windows
Memory leak: If it is a custom node, and it involves caching, etc., it is easy to cause memory expansion

Solution:

Reasonably formulate window wires, allocate TM memory reasonably (1.10 default is 1G), aggregate data should be managed by Back State, it is not recommended to write object storage by yourself
You can attach heap snapshots to troubleshoot abnormalities. Analysis tools such as MAT require some tuning experience and can quickly locate problems.

Flink Job Backpressure

The possible reasons for the above exception are:

Data skew: 90% of the back pressure must be caused by data skew
The degree of parallelism is not set properly, and the data traffic or the computing performance of a single operator is incorrectly estimated

Solution:

Data cleaning refer to the following
For parallelism, you can bury points in the message passing process to see the cost of each node

data skew

Core idea:

The key is added with a random number, and then the partition will be performed according to the new key when keyby is executed. At this time, the distribution of the key will be scattered, and the problem of data skew will not be caused.
Secondary keyby for result statistics

Break up the logic core code:

 public class KeyByRouter {

    private final static String SPLIT_CHAR = "#";

    /**
     * 不能太散，否则二次聚合还是会有数据倾斜
     *
     * @param sourceKey
     * @return
     */
    public static String randomKey(String sourceKey) {
        int endExclusive = (int) Math.pow(2, 7);
        return sourceKey + SPLIT_CHAR + (RandomUtils.nextInt(0, endExclusive) + 1);
    }

    public static String restoreKey(String randomKey) {
        if (StringUtils.isEmpty(randomKey)) {
            return null;
        }

        return randomKey.split(SPLIT_CHAR)[0];
    }
}

Job suspend and hold status failed

The possible reasons for the above exception are:

The job itself is under back pressure, and the checkpoint may fail, so the savepoint will definitely fail when the reserved state is suspended.
The status of the job is very large, and the savepoint timed out
The Checkpoint timeout time set by the job is short, resulting in that the SavePoint has not been completed, and the job discards the status of the Savepoint.

Solution:

The code sets the timeout time of Checkpoint as long as possible, such as 10min. For jobs with large status, you can set a larger value.
If the job does not need to retain state, just pause the job and restart it

Summary and Outlook

This article introduces the current stable real-time computing feasible architecture from the aspects of real-time feature cleaning framework evolution, feature configuration, and feature cleaning logic hot deployment. After nearly two years of iteration, the current architecture has the best performance in terms of stability, resource utilization, and performance overhead, providing strong support for business strategists and business algorithm personnel.

In the future, we expect that the configuration of features will return to SQL. Although the current configuration is simple enough, it belongs to the "domain design language" created by ourselves. There is a certain learning cost for new strategists and product personnel. We expect What's more, it can be configured through language like SQL, similar to Hive offline query, which shields the underlying complex computing logic and helps the business develop better.

Welcome to the official account: Gugu Chicken Technical Column Personal technical blog: https://jifuwei.github.io/

references:
[1] Risk Control ( https://zh.wikipedia.org/wiki/%E9%A3%8E%E9%99%A9%E7%AE%A1%E7%90%86 )

Flink implements real-time features in risk control scenarios

Background introduction

Introduction to Risk Control

Introduction to Feature Platform

Challenges and Solutions

challenges

Weaknesses of the old framework

Real-time framework 1.0: built on the Flink DataStream API

Real-time Framework 2.0: Built on Flink SQL

Technical realization

Pain point summary

real-time computing architecture

computing layer

service layer

Feature Configuration & Store/Read

Hot deployment of cleaning scripts

Standard Message & Cleaning Procedure

Standardized message templates

message fission

Message Cleaning & Pruning

real-time computing

Task Troubleshooting & Tuning Ideas

Troubleshooting TaskManager Full GC

Flink Job Backpressure

data skew

Job suspend and hold status failed

Summary and Outlook

咕咕鸡

引用和评论

减少80%存储-风控名单服务重构剖析

Java8的新特性

Java11的新特性

Java5的新特性

Java9的新特性

Java13的新特性

Java7的新特性