Application of Alink and Tensorflow on Flink in JD.com

This article is compiled from the sharing of JD.com search recommendation algorithm engineers Zhang Ying and Liu Lu at Flink Forward Asia 2021. The main contents include:
background
JD.com’s current state of machine learning for search and recommendation
Online learning based on Alink
Tensorflow on Flink application
planning

FFA 2021 Live Replay & Presentation PDF Download

1. Background

Search and recommendation are the two core entrances of Internet applications, and most of the traffic comes from the two scenarios of search and recommendation. According to the site, Jingdong Retail is divided into the main site, Jingxi, overseas sites and some vertical field sites.

For the search business, there will be keyword search, drop-down discovery, and sub-page searches such as shops, coupons, orders, etc. under each site; for the recommendation business, according to different application fields, it is divided into large and small Hundreds of recommended bits.

In each of the above business scenarios, there are more than ten strategic links that require the support of machine learning models. Based on massive commodity data and massive user behavior, it is used as a feature sample for machine learning.

In addition to the typical intent recognition, recall, sorting, and correlation models in the field of search promotion, JD.com search recommends, in order to better maintain the ecology of users, merchants, and platforms, in intelligent operation, intelligent risk control, and effect analysis. More and more models are also introduced for decision-making.

2. The current situation of Jingdong search and recommendation machine learning

We divide these machine learning scenarios into three categories according to the differences in service scenarios and service timeliness:

One model is a model that instantly requests product recall, sorting, and intent recognition when a user visits a search or recommendation page. This type of model has extremely high requirements on response time at the service level, and the estimated service is located in the online system.
Another model has low requirements for service response time, but has certain timeliness requirements for model training and prediction, such as real-time user portraits and real-time anti-cheating models. Here we call it a near-line scenario.
The third type is purely offline model scenarios, such as long-term portraits of products or users, and knowledge maps for various material labels. The training and prediction of these scenarios have relatively low requirements on timeliness and are all carried out in an offline environment.

Let's take a look at what the current main model service architecture looks like:

Due to the differences in the business system itself, JD.com's search and recommendation systems are composed of different kernel chain modules to form a search system and a recommendation system.

A user search will request services at all levels of the link step by step. First, pass the QP service for keywords and use the intent recognition model; then the recall service will request the recall of each channel in parallel, and the recall model, correlation model, and recall model will be called in turn. Rough sorting model; then after the sorting service aggregates the result set, it will call the fine sorting model, rearranging model, etc.

There are some differences in the recommended business process for a user visit, but the overall process is relatively similar.

The lower layers of these two large businesses will share some models for offline and near-line basic purposes, such as user portraits, material labels, and various indicator analysis.

The model service architecture they access is composed of two parts: training + estimation, which is bridged by model warehouse and parameter service; in terms of features, online scenarios require feature servers, while offline scenarios consist of data links.

From the model form, we can divide the existing model into two forms:

The model on the left has a relatively complex monomer scale. It uses data parallelism to train the same set of parameters, and uses the self-developed parameter server to train ultra-large sparse parameters. The training and estimation architectures are separated from each other.
The model on the right side is relatively simple, the single model is relatively simple, the data volume and business granularity are various, the data is divided according to different business granularities, modeled separately, and the data flow is driven by the streaming computing framework, so that the training and estimation architectures are integrated. .

Based on the architectural differences between online services and offline training, most model systems will be in the form of separate online and offline systems. The training process is a layer of encapsulation based on Tensorflow and Pytorch.

Sample production and preprocessing are based on the sample link framework constructed by Flink. Many of the features of online services are derived from Featurelog feature logs of online services; model training and sample production constitute the offline part, relying on some common basic components Such as Hive, Kafka, HDFS; the estimation process is based on the self-developed estimation engine, inference calculation is performed on the CPU or GPU, and the large-scale sparse vector is provided by an independent parameter server; the feature service provides input data for the estimation process, which is also provided by The self-developed feature service consists of a unified feature data acquisition interface and a corresponding feature extraction library due to the different feature sources during estimation and training.

Feature extraction and model estimation constitute the separation of online and offline parts.

In the form of model iteration, models with high requirements on timeliness are generally trained offline using historically accumulated batch data to obtain the Base model. After deployment and online, continue to use real-time data stream samples to continue training and training on its basis. Iteration goes live.

Since estimation and training are under two architectures, the process of continuous iteration involves the interaction, data transfer, and consistency requirements of the two architectures.

Training and prediction need to be combined with the data state to autonomously implement fault-tolerant transfer and fault recovery capabilities. How to combine the distributed processing of data and the distributed mode of the model into a whole, which is easy to deploy and maintain, is also a function that is not easy to implement. For different models, it is difficult to unify the mode of loading and switching pre-training parameters.

3. Online learning based on Alink

First, let's analyze the pain points of the online learning system:

offline/streaming training architecture is difficult to unify: typical online learning first trains a model from a large amount of offline data as a basic model, and then continuously performs streaming training on the basis of this basic model. Training and offline training are two different systems and code systems. For example, generally offline train and online train are two different architecture systems. Offline training may be an ordinary offline task, while online training may be a continuous training task started by a single machine. These two tasks have different systems and systems. Even if the online training is run with Spark/Flink, the code itself may be Also different.
data model: As mentioned above, the entire training architecture is difficult to unify. Therefore, users in a business engine need to maintain two sets of environments and two sets of codes. Many commonalities cannot be reused, and the quality and consistency of data are difficult to guarantee; The underlying data model and parsing logic may be inconsistent, which leads us to do a lot of piecing together logic, and even a lot of data comparisons such as year-on-year, chain-comparison, and secondary processing for data consistency, which are extremely inefficient and prone to errors.
Estimation service: Traditional model estimation requires the deployment of a separate model service, which is then called by the task in the form of http/rpc to obtain the estimation result, but this model requires extra manpower to maintain the server. And in the real-time/offline estimation scenario, the rpc/http server does not need to exist all the time, they only need to start with the start of the task and end with the end of the task; and how does the offline trained model serve the online service? Another headache.
model upgrade: model will have a certain impact on the model. Here, we mainly discuss the impact of model upgrade on the loss of model parameters.

This is a simple classic flow chart of online learning. Let me explain how this flow chart is implemented in the Alink link:

Offline training task: This Alink task goes to hdfs to load the training data. After the training data is processed by feature engineering, the model is trained offline. After the training is completed, the model info and model parameter data are written to the parameter server. The task runs at the day level, and each run trains, for example, 28 days of data.
real-time training tasks: In real-time tasks, the Alink task reads sample data from kafka, accumulates the sample data for a certain amount of time, such as hourly, minutely, number of pieces, etc. for small batch training, first go to the parameter server to pull the model For parameters and hyperparameter data, if there is an estimated demand after loading the model, a predict may be performed. If there is no estimated demand, the model can be trained directly, and the model data after training is pushed to the parameter server.

Next, let's mainly take a look at how the real-time learning model serves the scenario of online estimation:

First of all, real-time training will definitely not affect the model structure, that is, real-time training will only affect the update of model parameters;
Second, the estimated and trained ps must be separated. Therefore, the problem becomes how to synchronize the estimated and trained ps data.

There are roughly two implementations in the industry here:

Scheme A: This is for the training of some small models, which allows Alink's task to directly push the trained parameters to both offline PS and online PS.
Scheme B: Introduce a role similar to PS controller, which is responsible for calculating parameters and pushing parameters to offline PS and online PS at the same time.

However, we can also let Alink's training tasks write training PS, and at the same time construct a ps server-like role to synchronize parameters, write server updates to a kafka-like queue at the same time, and start an estimated ps server to consume the parameters in the kafka queue information, this is done to a data synchronization between the training PS and the estimated PS.

There are many options, just choose the one that suits you.

Let's take a look at why the parameters are lost when the model version is upgraded:

Suppose the data of the first 28 days of training in the early morning of the 1st, the parameters are written to the parameter server after the training, and the training has been streaming between the 1st and the 2nd, and the parameter server has been written incrementally until the 2nd. No. early morning.
The data of the first 28 days of training started in the early morning of the 2nd, assuming that the training time is 1h. At this time, if the PS is directly written, the 1h data will be directly overwritten, which is not bad for some time-insensitive models. At least no error will be reported. But for the prophet time series model in this business, there will be a problem, because the model parameters are missing 1h of data, and the model may be degraded in accuracy.

In fact, when the model is iterated, it takes a certain amount of time to complete the offline training. If it is directly overwritten, the parameters during this period will be lost. Therefore, we must ensure that the parameters in PS are continuous in time.

In this figure, we mainly introduce the process of PS cold start and hot switch:

After the cold start of model training, the model is temporarily unavailable due to the loss of parameters. After the first warm start, the model becomes available;
Parameter Server supports multiple scopes and multiple versions. When the model is hot switched, only the ps new scope is updated, and when the warm start is performed, all scopes are updated;
The model only pulls the data of the old scope each time it predicts, and pulls the new scope when the warm start is performed.

The following is a detailed acceptance of the entire link process:

During cold start, it takes a certain amount of time to train the model for offline tasks. Therefore, at this time, the parameters in PS lack the data of this time period, so we can only perform warm start first to complete the parameters, and write PS old scope and PS new scope;
After that, the normal prediction and warm start process are carried out. In the prediction, only the ps old scope is pulled, because the data in the ps new scope will be overwritten during the hot switch, resulting in loss of parameters, and the ps with lost parameters cannot be predicted;
Wait until the early morning of the next day to perform a hot switch, and only update the ps new scope;
After that, the normal pull ps old scope is used to predict, and the pull ps new scope is used to perform the warm start process.

Next, let me introduce the pain points of streaming training:

Failover is not supported for online training. Everyone should know that online training will inevitably be interrupted for various reasons (such as network jitter). In this case, an appropriate failover strategy is very important. We introduce Flink's Failover strategy into our self-developed model training operator to support the model's Failover.
Appropriate pretrain strategy: The training embedding layer of any model does not need to be pulled from PS every time. Generally, the industry will develop some forms similar to local ps to store these sparse vectors locally. Of course, we can also use these local ps. Introduced into Flink to solve this problem, but for flink, we can completely replace local ps with state backend in some scenarios. Using the fusion of Flink's state and parameter server (parameter server), when init or failover, part of the hot data of the parameter server is loaded into the state to pretrain the model.
It is difficult to achieve distributed requirements. If it is some architecture that supports distributed itself, it is fine, but some algorithms do not support distributed (such as facebook's open source prophet). In this case, if the amount of data is large and it is not distributed, run Completing a large amount of data may be extremely time-consuming; Alink naturally supports distribution, and Alink is an upper-level algorithm library based on Flink. Therefore, Alink has all the distributed functions of Flink, supports all scheduling strategies of Flink Master, and can even support various fine-grained data distribution strategy.

Failover strategy for streaming training:

During online distributed training, there are often cases where a machine is abnormal for some reason (such as the network). In this case, if you want to recover, there are generally two situations:

allows data loss
General training tasks allow a small amount of data to be lost, so we hope that some data can be sacrificed in exchange for continuous training of the overall task. The introduction of a local recovery strategy can greatly improve the continuity of the task and avoid the task caused by some external reasons. Full recovery from a single point of failure.
No data loss allowed
Here we only discuss the case of at least once (exactly once requires PS to support transactions). If the business has high data requirements, we can adopt the global failover strategy. Of course, the general single-point redeployment is also abnormal. The strategy of global failover In this business, we adopt the strategy of local recovery to prioritize the continuous training of tasks.

The following describes the strategy for restarting the training task in detail:

global recovery. Here is the concept of Failover commonly used in Flink, so I won't go into details.
singal task recovery. In this case, a taskmanager has a heartbeat timeout due to a network abnormality. At this time, in order to ensure data consistency, the Flink task will failover and recover from the last checkpoint, but if a small amount of data is allowed to be lost and in order to ensure the continuous output of the task , you can enable local recovery, at this time the task will only restart the taskmanager, which can ensure the continuity of training.
Single point redeploy exception. If the task fails for any reason, an exception occurs during the single-point recovery of the task, and the single-point recovery fails. At this time, a single-point redeployment exception occurs. The exception cannot be resolved, and the only way to do this is to failover the task. To solve the problem, you can configure whether to resume from checkpoint or not to resume continuous training according to the needs of the task.
Here I will focus on the scene of recovery from checkpoint when the task failsover: when the task fails, first execute the save method, save the current PS state snapshot, save the data of the Flink state backend, and execute the load method when the task resumes , restore the PS. If you think about it carefully, you can find that this operation will cause repeated training of some parameters (the time point of cp is inconsistent with the time point of save). I hope everyone pays attention.

The streaming training pretrain strategy based on Alink can be roughly divided into three modes: cold start, global recovery and single-point recovery:

During cold start, the model parameters and hyperparameter information are first pulled from the PS, and then the state backends such as ListState, MapState, and ValueState are initialized, and the scope and version information of the PS are initialized at the same time.
Global recovery is also the default Failover strategy of Flink. In this mode, the task will first save the model, that is, serialize the model information in the PS to the hard disk, and then save the data of the state backend in the flink task, and then it is not needed during initialization. After pulling the hyperparameters and other information, we directly choose to restore the hyperparameters from the state backend, and reload the parameters of the model for continuous training.
singal task recovery, this mode allows a small amount of data loss and is only taken to ensure the continuous output of the task. In this mode, the task will only restart the tm, which can ensure the stable and continuous training of the training task to the greatest extent.

In the currently popular 3D parallel and 5D parallel architectures, data parallelism is the most basic and most important link.
Flink's most basic data distribution strategy includes a variety of options including rebalence, rescale, hash, broadcast, etc., and users can freely control the data distribution strategy by implementing streampartitioner, and users can freely implement data parallelism through load balance and other solutions. The problem of waiting for each other between model parameters caused by data skew.
In this mode, we have opened up the path of calling python methods in a distributed manner in Alink mode, which can maximize the efficiency of data parallelism.
Data parallelism ignores streams and batches. We integrate Alink's mapper component to realize the integration of train and update model variable batch streams.

4. Tensorflow on Flink application

Let me first introduce the difference between the Tensorflow on Flink estimation service and the traditional online estimation link:

Different from online estimation, real-time/offline estimation does not require services to exist all the time, and loading into tm can greatly save labor maintenance and resource costs.
Different systems and code structures need to be maintained for data models, data processing, model training, and model inference due to the different architectures of the entire link.

Tensorflow on Flink estimation service currently has multiple solutions, such as:

Option A: Deploy an rpc or http server, and use flink to call it in client mode through rpc or http.
Solution B: Load the Tensorflow model into flink tm and call it directly.

Option A has the following disadvantages:

The rpc or http server side needs extra maintenance manpower.
The difference between real-time/offline estimation and online estimation is that the rpc or http server does not need to exist all the time, they only need to start with the start of the task, and end with the end of the task, it always exists It is a waste of resources, but if it is changed to this architecture, it will undoubtedly consume more labor and maintenance costs.
It is still the problem that the above architecture is not unified. RPC or HTTP server and real-time/offline data processing are often not a set of systems. This still involves the issue of disagreement with the architecture that has been emphasized before, and will not be repeated here.

5. Planning

Flink sql is used to implement batch-stream-integrated model training, striving to make model training more convenient.
Tensorflow Inference on Flink supports large models and implements dynamic embedding storage based on PS: There are a large number of id-type features in business scenarios such as search and recommendation. Then most of the memory of the taskmanager is swallowed up, and the variables of the native tensorflow will have inconveniences such as the need to pre-specify the dimension size and cannot support dynamic expansion. Therefore, we plan to replace the embedded Parameter Server with our self-developed PS, which supports Hundreds of billions of distributed services.
Dynamically load the embedding in the PS to the state of the taskmanager to reduce the demand for PS access pressure: Flink usually uses the keyby operation to hash certain fixed keys to different subtasks, so we can assign the corresponding keys to these keys. The embedded embedding is cached in the state, which reduces the access pressure to PS.

6. Thanks

First of all, I would like to thank all the colleagues in the data aging Flink optimization team of the Jingdong Data and Intelligence Department for their help and support.
Thanks to all colleagues in the Alink community for their help and support.
Thanks to all the colleagues in the Flink ecological technology team of Alibaba Cloud Computing Platform Division for their help and support.

The following is the github link of Alink and flink-ai-extend, welcome to star.

https://github.com/alibaba/Alink.git

https://github.com/flink-extended/flink-ai-extended.git

FFA 2021 Live Replay & Presentation PDF Download

For more technical issues related to Flink, you can scan the code to join the community DingTalk exchange group
Get the latest technical articles and community dynamics for the first time, please pay attention to the public number~

Application of Alink and Tensorflow on Flink in JD.com

1. Background

2. The current situation of Jingdong search and recommendation machine learning

3. Online learning based on Alink

4. Tensorflow on Flink application

5. Planning

6. Thanks

ApacheFlink

引用和评论

Flink CDC 3.4 发布, 优化高频 DDL 处理，支持 Batch 模式，新增 Iceberg 支持

【Hadoop】HDFS架构解析

【Hadoop】HBase系统解析及适用场景

基于 pyflink 的算法工作流设计和改造

通过Milvus内置Sparse-BM25算法进行全文检索并将混合检索应用于RAG系统

MCP+Hologres+LLM 搭建数据分析 Agent

小米基于 Apache Paimon 的流式湖仓实践