The 3rd Apache Flink Geek Challenge and AAIG CUP Raiders are released!

Introduction to Cloud Zhou Yunfeng (Yunyan) and Tao Department Technology Department Huang Jiaming (Ming Xiao) two teachers jointly interpreted the content of the 3rd Apache Flink Geek Challenge and AAIG CUP

Author of this article: Aliyun Zhou Yunfeng (Yunyan), Tao Department Technology Department Huang Jiaming (Ming Xiao)
Contest through train: https://tianchi.aliyun.com/competition/entrance/531925/introduction

Since the 3rd Apache Flink Geek Challenge and AAIG CUP went live on August 17th, 4000+ teams have signed up. The knowledge points and the use of related tools involved in the contest question "E-commerce recommended'thigh-holding' attack recognition" have been shared in the weekly live course. This article will explain the contest questions in detail through the following points to help players better understand The core content of the question.

Detailed analysis of the question

Competition background

With the development of the Internet, online shopping has become the choice of more and more people. According to Alibaba's financial report, the total turnover of Alibaba's website in fiscal 2020 will exceed one trillion US dollars, and the global annual active consumers will reach 960 million.

In order to meet the individual needs of different users, the e-commerce platform will recommend suitable products according to the user's hobbies, so as to realize the needs of thousands of people in ordering products. Common recall paths for recommendation systems include U2I (User-Item), I2I (Item-Item), etc. Among them, user-to-item refers to the recommendation of products for the user through the user's profile information, and the item-to-item recommendation strategy recommends related products for the user according to the user's product click list.

The purpose of the recommendation system is to make recommendations based on the preferences of different users. Traditional offline recommendation systems process based on user historical behavior data to form feature samples, then train models offline, and deploy online services. However, user preferences are diverse, and the distribution of user behaviors will change over time. Offline models cannot describe this dynamic user preferences. Therefore, real-time feature updates and model parameter updates are required to better capture users. Behavioral preferences. In the recommendation scenario, in order to better improve the timeliness and accuracy of recommendations, the platform will perform real-time U2I and I2I updates based on the user behavior information of the entire network, and make relevant recommendations based on the user's recent behavior information.

In order to obtain more platform traffic exposure and show their products to more consumers, some merchants increase their exposure opportunities through the recommendation mechanism of the HACK platform. One of the typical tactics is the "thigh hug" attack. This method employs a group of malicious users to click on target products and popular products to establish the relationship between the target products and popular products, and improve the target products and popular products. I2I correlation points between products. In this way, merchants induce users to purchase products that do not match their names with the psychological anticipation of exploding money, which not only harms the interests of consumers, reduces their shopping experience, but also affects the reputation of the platform and other merchants, and seriously disrupts the fairness of the platform . Therefore, we need to use a risk control system to filter out these possible malicious traffic and prevent them from interfering with the model of the recommendation system.

Since all user behaviors are filtered by the risk control system before they are input into the recommendation system, if you want to achieve the real-time performance of the recommendation system, the risk control system must also achieve real-time performance. Real-time interception of such behaviors helps protect the real-time recommendation system from malicious attacks while ensuring the timeliness of recommendations.

Real-time risk control systems have higher requirements for data security. If the system's interception algorithm is accidentally leaked, the HACK platform will be able to strengthen the camouflage ability of malicious traffic and increase the difficulty of monitoring malicious traffic. Therefore, such systems need to be deployed. In an encrypted and trusted environment.

In summary, in order to ensure the accuracy of the real-time recommendation system, the competition requires players to implement a real-time risk control system that ensures data security.

the data shows

Given malicious clicks, normal clicks, and corresponding "commodity" and "user"-related attribute information (user local debugging can be downloaded from the Internet), players implement real-time malicious click recognition classification algorithms, including model training and model prediction. In the competition evaluation system, the system uses 1 million pieces of data for model training and 100,000 pieces of data for model prediction. In addition, the competition provides players with a data set of 500,000 data for local debugging of the algorithm.

The competition provides data in the following format for training and prediction. All data are saved in the file in csv format, that is, the columns in the following data format are separated by commas. Each piece of data represents the behavior of a user clicking on a product, and its characteristics are mainly derived from the users and products it is associated with.

uuid	visit\_time	user\_id	item\_id	features	label

uuid: the id of each piece of data. The id is unique in the data set.
visit\_time: The time when the behavior data occurred. The value of the data provided in the real-time prediction process is basically monotonically increasing.
user\_id: The id of the user corresponding to this piece of data.
item\_id: The id of the item corresponding to this piece of data.
features: The features of the data, including N floating-point numbers separated by spaces. Among them, the 1st ~ M numbers represent the characteristics of the product, and the M+1 ~ Nth numbers represent the characteristics of the user.
label: The value is 0 or 1, which represents whether the data is normal behavior data.

The training data includes the data of all the above columns, and the prediction data includes all the columns except the label.

The input and output format of the model file

For players who only want to optimize at the algorithm level, they only need to ensure that the input and output of the saved model file are in the following format. The code of the sample image we provided can preprocess the format of the input data, parse the inference results of the Tensorflow model, and finally generate a CSV format file that meets the requirements of the evaluation program.

The prediction model is input in tensor format. Where N is the number of features.
Tensor("input:0", shape=(?, N), dtype=float32)
The prediction model outputs tensor format. The output value is 0 or 1, indicating whether the input behavior data is malicious.
Tensor("output:0", shape=(?, 1), dtype=float32)

Demo analysis

This competition focuses on the combination of algorithms and engineering. The answer to the competition questions will probably go through the following stages: model training, model prediction, optimal threshold selection, online prediction and category determination.

Model training: The data in the training set is structured, no feature extraction stage is required, and the model can be directly used for training. In the demo, a forward feedback network is constructed to train the model and directly fit the labels of the samples;
Model prediction: In order to better separate the training and prediction phases, in the model prediction phase, cluster serving is used. Therefore, prediction only needs to directly load the trained model to make predictions;
Threshold selection: What is used online is to directly determine the category instead of outputting a probability, which is very in line with the actual business scenario. However, in the case of direct output categories, the selection of the threshold has a particularly large impact on the online effect of the model. Therefore, it is necessary to select the optimal threshold to find the optimal threshold in the verification data as the threshold for online judgment. The current demo uses the threshold Is 0.5;
Online prediction and category determination: In the final output, the current prediction probability and the optimal threshold are used to determine the prediction category of the current sample (whether it is cheating)

Demo optimization

Real-time features: Currently only the static features of users/commodities are provided, but the data also includes the user-commodity click relationship. Users can consider building real-time features based on the click relationship, such as counting the clicks of users/commodities so far on the day However, it should be noted that when real-time features are used in the prediction phase, the same real-time features need to be matched in the training phase, otherwise the features used for training and prediction Inconsistency will cause the model to report errors or deteriorate the effect; in addition, the training set already knows which products/users have cheated, this information can also be used as the characteristics of the model to construct;
Model training: There are many mature DNN models in the industry. The demo currently uses a three-layer structure. Players can consider using more complex models for training to achieve better fitting results; in addition, we should not be limited to a certain " "Super model", but you can consider mixing multiple models/strategies to make predictions based on ensemble learning.
Optimal threshold selection: The current threshold used in the demo is 0.5, but the optimal threshold selection needs to be selected based on the model's prediction in the verification set. In fact, we can write a script to find the optimal threshold through the verification set;
Online prediction: The online demo model predicts all streaming data. However, once there is a high delay in the prediction of a certain sample, it may cause subsequent sample predictions to be associated with delays, resulting in overall online delays serious. In addition to optimizing algorithms and engineering and minimizing latency, players can also try to monitor latency to alleviate the impact of the long tail phenomenon.

Scoring index

The score of the result submitted by the contestant is determined by the product of the scores of the two aspects, which respectively represent the performance of the algorithm and engineering aspects of the submitted result of the contestant. Expressed by a formula is as follows:

score=F1 ∗valid\_latency

In terms of algorithms, the game is scored according to the F1 parameter of the inference result, which is the harmonic average of the accuracy and recall of the inference result.

In terms of engineering, since the game simulates real-time risk control scenarios, the game limits the delay in the real-time reasoning process. The contestant's program needs to provide inference services for the real-time data stream appearing in Kafka, and the delay of a single data does not exceed 500ms when the flow of the data stream does not exceed a given threshold.

The inference service deployed by the players needs to read the data to be inferred from Kafka and write the inference results to Kafka. The definition of data delay is the difference between the time stamps of the data to be inferred and its inference result in Kafka. The valid\_latency in the above formula is the proportion of all data whose delay meets the requirements. Data with a delay of more than 500ms will not only affect the value of valid\_latency, and then affect the score, but also will not participate in the calculation process of F1 parameters.

Technology Introduction

Apache Flink is a framework and distributed processing engine for state computing on unbounded and bounded data streams. Flink can already run in all common cluster environments and perform calculations at in-memory speed and any scale.

On the basis of Flink, Flink AI Flow, as a big data + AI top-level workflow abstraction and supporting service that takes into account flow computing, provides an end-to-end solution for machine learning.

Analytics Zoo and BigDL are Intel® open source unified big data analysis and AI platforms that support distributed TensorFlow and PyTorch training and inference. The OpenVINO toolkit and DL Boost instruction set are used to improve the performance of deep learning workloads. Cluster Serving is a distributed reasoning solution of Analytics Zoo/BigDL, which can be deployed on Apache Flink clusters for distributed computing.

Occlum is Ant Group's open source LibOS based on Intel SGX, which enables Linux applications to run in the Enclave security environment with only a small amount of code modification or no code modification at all, ensuring that data is encrypted and strongly isolated, ensuring data security and user privacy .

Reference

Basic image instructions and related technology introduction: https://code.aliyun.com/flink-tianchi/antispam-2021/tree/master
Flink 1.11 Chinese document: https://ci.apache.org/projects/flink/flink-docs-release-1.11/zh/
Flink AI Flow Wiki：https://github.com/alibaba/flink-ai-extended/wiki
Analytics Zoo Cluster Serving Programming Guide：https://github.com/intel-analytics/analytics-zoo/blob/master/docs/docs/ClusterServingGuide/ProgrammingGuide.md
Occlum Github Repo：https://github.com/occlum/occlum

Learning materials

Learning Forum: https://tianchi.aliyun.com/competition/entrance/531925/forum
Learning video: https://flink-learning.org.cn/activity/detail/99fac57d602922669b0ad11eecd5df01
Contest Q&A Exchange DingTalk Group: 35732860

Click to learn more about the question information

activity recommendation

Alibaba Cloud's enterprise-level product based on Apache Flink, the real-time computing Flink version, is now active:
real-time calculation of Flink version (yearly and monthly, 10CU) for 99 yuan to get a chance to get a Flink exclusive customized T-shirt; another package for 3 months and above also has a 15% discount!
Learn about the event details: https://www.aliyun.com/product/bigdata/sc

Copyright Statement: content of this article is contributed spontaneously by Alibaba Cloud real-name registered users. The copyright belongs to the original author. The Alibaba Cloud Developer Community does not own its copyright and does not assume corresponding legal responsibilities. For specific rules, please refer to the "Alibaba Cloud Developer Community User Service Agreement" and the "Alibaba Cloud Developer Community Intellectual Property Protection Guidelines". If you find suspected plagiarism in this community, fill in the infringement complaint form to report it. Once verified, the community will immediately delete the suspected infringing content.