Abstract: This article is compiled from the speech delivered by Guo Yubo, a senior expert on big data platform development of Zhongan Insurance, at the Flink Forward Asia 2021 industry practice session. The main contents include:
- Overall overview
- Smart Marketing Application
- real-time feature application
- Anti-fraud application
- Later planning
Click to view live replay & speech PDF
1. Overall overview
The above figure is the overall architecture of our real-time computing. The bottom layer is the data source layer, which includes business data from the application system, message data from the application system, user behavior buried point data, and application log data. These data will be entered through Flink. Real-time data warehouse.
The real-time data warehouse is divided into three layers:
- The first layer is the ODS layer. After the data passes through Flink to the ODS layer, it will be associated with an original table, which corresponds to the data source one-to-one, and then there will be a view table for simple cleaning and processing of the original data;
- After that, the data is sent to the DWD layer through Flink. The DWD layer is divided based on the subject domain. We are now divided into user data domain, marketing data domain, credit data domain and insurance data domain, etc.;
- Another part is the DIM layer, which contains dimension table data such as user-related, product-related, and channel-related. The data in the DIM layer will be saved in HBase.
After data cleaning at the DWD layer, the data is sent to the DWS layer, and the DWS layer will integrate and summarize the data. Generally, there will be an indicator wide table and a multi-dimensional detailed wide table. The data will then enter the ADS layer, which contains various OLAP data storage engines. We now mainly use ClickHouse as the storage engine for large-scale real-time reports, HBase and Alibaba Cloud's TableStore provide data storage services for user tags and feature engineering, and ES is mainly used for real-time monitoring scenarios.
The above picture is the architecture diagram of our real-time computing platform. The whole real-time computing platform can be divided into three parts. The first part is the task management background, where tasks are edited and submitted in the task management module. The task editor supports both Flink SQL and Flink JAR tasks, provides convenient Flink SQL editing and debugging functions, and supports multiple task startup strategies. For example, based on checkpoint, offset, time point and earliest position, etc., it also supports timing and real-time generation of checkpoint functions. After the task is submitted, it will be submitted to our self-built CDH cluster through the Flink client. The task management service also periodically obtains the real-time status of tasks from Yarn.
In terms of monitoring, Flink will push the indicator log data to PushGateway, and after Prometheus obtains these indicators of PushGateway, the data will be visualized in Grafana. In addition to monitoring the status of abnormal tasks, we will also provide real-time alerts for various situations such as resource usage and message backlog. In addition, Flink also supports many connectors, such as Alibaba Cloud's ODPS, TableStore, and Hologres. It also has built-in rich UDFs and supports user-defined UDFs.
The above picture is the task editor of our real-time computing platform. You can see that it supports Flink SQL and Flink JAR tasks. SQL tasks support DML and DDL. They can be used for overall task submission in the editor. Task management also supports Version management for every change. In addition, it also supports more advanced task configuration functions, including checkpoint configuration, message Kafka parallelism and state management, etc.
2. Smart Marketing Application
Next, we will focus on the use of Flink in intelligent marketing application scenarios.
The bottom layer of the architecture diagram of the marketing platform is also the data source layer, including financial business data, insurance business data, user behavior data, third-party platform data, and operational result data. The offline data enters the offline data warehouse through ETL, and the real-time data enters the real-time data warehouse through Flink.
Above the real-time offline data warehouse is the tag service layer. The platform has offline/real-time tag management functions. At the same time, we will also manage and control these tags, such as data permission control. In addition, there is also the monitoring of tag data, which can be timely Find anomalies in tag data and accurately grasp the analysis and statistics of tag usage.
Above the label layer is the label application layer. We have a marketing AB laboratory and a traffic AB laboratory. The difference between them is that the marketing AB mainly resides in the customer group for marketing, whether it is a static customer group based on rules for customer group circle selection. Even the real-time customer groups accessed through Flink will carry out process marketing and intelligent access to these customer groups. The Traffic AB Lab is also a tag-based data service capability, which is used for personalized recommendations on the APP side. The platform also provides the analysis function of customer group portraits, which can quickly find the data effect of similar customer groups and historical marketing of customer groups, and can better assist operations in the selection and marketing of customer groups.
After the marketing AB and traffic AB experiments, there will be an effect analysis service for real-time effect recovery. Through the effect analysis, the operation team can timely assist the operation team to make rapid strategy adjustments.
At present, the total number of tags has reached more than 500, the number of executed marketing tasks will be about 2 million per day, and the traffic AB will have more than 20 million calls per day, mainly to provide the front-end with personalized display of resource bits and thousands of people. business scenario.
The above picture is the data flow diagram of the intelligent marketing platform. The left side is the data source, which includes business data from the business system, as well as embedded point and event data. These data arrive at the real-time data warehouse through Kafka. After the real-time data warehouse is processed, some It will become a real-time tag, stored in Alibaba Cloud's TableStore, and some will be processed into real-time customer groups, which will also be sent to Kafka, and then the marketing AB laboratory will carry out intelligent marketing to these real-time customer groups.
Another part of the label data processed by the offline data warehouse, we use DataX as an ETL tool to synchronize them to Hologres. Hologres can seamlessly connect with ODPS, and use the acceleration ability of its associated surface with ODPS to achieve millions of levels per second. data synchronization. Operators can self-select customer groups on the marketing platform, and use the interactive analysis capabilities of Hologres to support the generation of complex customer groups in seconds.
The characteristics of the entire marketing platform can be summarized into three points:
- First, real-time images. By customizing standardized real-time events and data structures, and utilizing Flink's real-time computing capabilities, automated real-time tag access is realized;
- Second, support a more intelligent marketing strategy. It allows users to directly configure the componentized marketing process on the marketing platform, provides rich time strategies, and various intelligent marketing channels, and also supports flexible, multi-branch business flow, using consistent hash distribution The algorithm conducts the user's AB experiment;
- Third, real-time analysis. For real-time analysis of marketing effectiveness, we also use Flink to achieve real-time effect recovery. Through the analysis of the funnel and the ability to analyze the effectiveness of business indicators, it can better empower the marketing business.
3. Real-time feature application
Feature engineering mainly serves financial risk control scenarios, such as decision engine, anti-fraud, and risk control model services. The main purpose of feature engineering is to transform raw data into a process that better represents the nature of the problem. Using these features can improve the accuracy of our prediction of some invisible things. Financial business scenarios use this feature to improve the ability to identify user risks.
Feature engineering is the most time-consuming and important step in the entire data mining model. It provides the core data support for the risk control of the whole process of financial business. It is mainly divided into three parts:
- The first is feature mining, which is mainly completed by the risk control strategy and model development team. They will analyze and process data according to business indicators, and then extract effective compliance features;
- When the feature is mined, it will be given to the development team. The feature development team will connect with different data sources according to the source of the feature. Some are from three parties, some are processed offline, and some are processed in real time. Of course, there are also some machines. The learning model reprocesses the calculated features;
- The developed features will be provided to online services through the feature center, and the stability of the entire feature link must also be guaranteed.
There are more than 100 Flink real-time tasks currently used in feature engineering, generating more than 10,000 features, and more than 30 million feature calls per day.
The core indicators of financial risk control characteristics, the most important thing is compliance. All features are above compliance, and it is also necessary to ensure the accuracy of feature processing, the real-time performance of feature numbers, the fast response of feature calculation, and the high availability and stability of the entire platform.
Based on such indicator requirements, we use Flink as the real-time computing engine, HBase and Alibaba Cloud's TableStore as the high-performance storage engine, and then realize the overall service and platformization through the micro-service architecture.
The overall architecture diagram of the feature platform can be divided into five parts. Upstream systems include foreground systems, decision-making systems and protection systems. All requests from the business side will go through the feature gateway, and the feature gateway will arrange the links according to the source data of the feature. Some of them need to call the third-party data, the credit data of the People's Bank of China, and some data from the data mart. After the data is accessed, it will enter the feature data processing layer, which includes feature processing services for third-party data, as well as financial real-time feature data calculation; there are also some anti-fraud feature calculation services, including relational graphs and some list features. service.
After some basic features are processed through this layer, they can be provided to the upstream business system for use, and some need to be processed again through the feature combination service. We implement feature combination service and risk control model service through a low-code editor, and reprocess features through machine learning platform.
The basic service layer is mainly for background management and real-time monitoring of features. Real-time features need to rely on real-time computing platforms, and offline features rely on offline scheduling platforms.
To sum up, the feature platform is a feature service system constructed by micro-services. It is combined into a set of feature computing by accessing third-party data, credit data, internal data, real-time data, and offline data for feature processing and services. Risk control data products.
The above figure can clearly see the flow of real-time financial characteristic data. The data mainly comes from the business database, including the foreground, middle and other business systems, and is sent to Kafka through binlog. The data middleware blcs can convert binlog to Kafka. The user behavior data is directly sent to Kafka, and then enters the real-time data warehouse through Flink. After the data is calculated in the real-time data warehouse, the multi-dimensional detailed data is written to TableStore.
We used HBase at the earliest, and for stability reasons, we used TableStore for a technical upgrade. Finally, considering that the feature service has high requirements for stability, we still reserve two storages, and HBase is used as a downgrade storage.
Because financial features require data services that can describe the entire life cycle of users, not only real-time data, but also full offline data is required. Offline data is returned to HDFS through DataX, and then returned to the online storage engine TableStore using Spark's offline computing power.
Now, risk control requires more and more refinement of feature processing. For example, a simple feature calculation such as payment amount may require half an hour, nearly 3 hours, nearly 6 hours, nearly a day, 7 Window for various businesses such as days, 15 days, 30 days, etc. If real-time computing is used, there will be a lot of windows, and the calculation of full data will also reduce the throughput of Flink. Therefore, our real-time task is mainly to clean and simply integrate data, and then return these detailed data to the storage engine, and then perform configuration feature processing through the feature calculation engine of the application system.
The scenarios of risk control features are still relatively fixed, and are basically calculated from the dimensions of user ID, user ID or user mobile phone number, so we abstract a set of user entity relationship association tables, including ID card, mobile phone number, Mapping relationship table of user ID such as device fingerprint, business data uses userID for dimension table association storage, and user detailed data query is carried out through entity relationship plus business data two dimensions. Thanks to the high-performance query capability provided by TableStore, we can handle highly concurrent feature calculations. Some features not only use real-time data, but also call the interface of the business system to obtain data. Real-time data is required, and the interface data is aggregated and calculated, which makes it impossible to complete all feature calculations in Flink. Therefore, Flink only processes and aggregates detailed data, and then uses the feature calculation engine to calculate the feature results.
At present, our real-time feature calculation is mainly realized by combining the data of the real-time data warehouse DWD with the feature calculation engine. The DWD data will be returned to the Tablestore of Alibaba Cloud, and then the processing and calculation of the features will be realized through configuration. In order to save query cost, our computing granularity is in the dimension of the feature group. A feature group will only query the data source once, and the feature group and feature have a one-to-many relationship.
Here is a brief description of the feature calculation process: first, the relevant detailed data will be scanned according to the query conditions of the feature, and then according to the specific feature configuration under the feature group, such as time granularity and dimension, the feature calculation is performed using a custom statistical function. If multiple data sources need to join to calculate, first calculate the dependent feature factors, and then complete the next feature calculation. In addition, if our custom function can not meet the calculation requirements, the system also provides a way to perform feature processing in Groovy script. In addition, some feature sources are interfaces from the business system, so you only need to switch the first step of data acquisition from querying Tablestore to calling the interface. If there are other feature data sources, you can also implement the standard data interface. Done, the calculation engine of the feature does not need to make any adjustments.
4. Anti-fraud application
The above figure is the data flow diagram of the real-time anti-fraud feature application, which is similar to the data flow diagram of the financial real-time feature service, but there are also some differences. In addition to using business data, the data source here is more concerned with user behavior data and user device data. Of course, these device data and behavior data are collected under the premise of the user's permission. After the data passes through Kafka, it will also enter Flink for processing. Anti-fraud data mainly uses a graph database to store user relationship data. For complex feature calculations that require historical data, we will use bitmap as state storage in Flink, combine timerService for data cleaning, and use Redis for feature calculation result storage.
The anti-fraud feature of GPS is to use TableStore's multivariate index and the ability of the lbs function to perform the feature calculation of location recognition. The anti-fraud relational graph and relational community will be provided to anti-fraud personnel for case investigation through the ability of data visualization.
We classify anti-fraud features into 4 categories:
- The first type is the location recognition type, which is mainly based on the user's location information, plus the GeoHash algorithm, to realize the data calculation of location aggregation features. For example, we found some suspicious users through location aggregation features, and then checked the face recognition photos of these users through anti-fraud investigation, and found that their backgrounds are very similar, and they all applied for business in the same company. All we can combine the features of the location class and the AI capability of image recognition to more accurately locate similar fraudulent behaviors;
- The second category is the device association class, which is mainly realized through the relational graph. By obtaining the information of the associated users of the same device, some wool parties and simple frauds can be located relatively quickly;
- The third type is graph relationships, such as user login, registration, self-use, credit granting and other scenarios. We will capture the user’s device fingerprints, mobile phone numbers, contacts and other information in real time in these scenarios to construct the neighbor relationship of the relationship graph. . Then, through the adjacent edge relationship and the node degree associated with the user, it is judged whether it is associated with some black and gray list users to identify the risk;
- The fourth category is based on the statistical community features implemented by the community discovery algorithm. By judging the size of the community and the performance of the user's behavior in the community, the statistical rule features are refined.
Most of the relational graph features mentioned above are stored by the graph computing engine (NebulaGraph). We have tested the more commonly used janusgraph and orientdb, but when the amount of data reaches a certain order of magnitude or more, there will be some unstable factors and situations, so we tried to use the graph computing engine and found that its stability is relatively speaking It is relatively high because it uses a shard-nothing distributed engine storage, which can support the calculation of large-scale graphs at the trillion level. It is mainly divided into three parts for combined services:
- graph service, mainly responsible for graph real-time calculation;
- meta service, mainly responsible for data management, schema operation and user permissions, etc.;
- The storage service is mainly responsible for data storage.
At the same time, Nubula also adopts an architecture that separates computing and storage. Both the computing layer and the storage layer can be cloned independently. At the same time, it also supports transfer computing, which reduces data relocation. Whether it is the meta layer or the storage layer, the final consistency of the data is achieved through the raft protocol.
In addition, NebulaGraph also provides a richer client access method, supports Java\Go\Python and other clients, and also provides Flink connector and Spark connector, which can be easily integrated with the current mainstream large computing engines.
The realization path of the relational graph is divided into 4 parts: the first is the data source of the graph. In order to build a more valuable relational graph, it is necessary to find accurate and rich data for graph modeling. Our data sources mainly come from user data, such as mobile phone numbers, ID cards, device information, contacts and other related data will be synchronized to the relational graph. In addition to real-time data, historical data will also be cleaned through offline Spark tasks. The query language provided by NebulaGraph supports rich graph functions, such as adjacent edges, maximum paths, and shortest paths. The community found that we implemented it through Spark Graph-X, and finally provided data services through APIs for the application of graph databases. We now have services directly provided to the decision engine for graph data features, as well as anti-fraud services. Some data services may even consider community-based recommendation algorithms for marketing later.
5. Later planning
In the future, first of all, we will consolidate our real-time computing platform, realize the management of blood relationship of real-time data, and try Flink + K8s to realize dynamic expansion and contraction of resources.
Secondly, we hope to build a graph platform based on Flink + NubelaGraph. At present, real-time computing and offline computing are implemented by Lambda architecture, so we want to use Flink + Hologres to realize stream-batch integration to try to solve this problem.
Finally, we will try to use Flink ML in the anti-fraud business scenario of risk control to realize online machine learning, improve model development efficiency, quickly realize model iteration, and enable intelligent real-time risk control.
Click to view live replay & speech PDF
For more technical issues related to Flink, you can scan the code to join the community DingTalk exchange group to get the latest technical articles and community dynamics as soon as possible. Please pay attention to the public number~
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。