The application of bilibili&#39;s machine learning workflow platform based on Flink in station b

Introduction to introduces the application of the machine learning workflow platform ultron at station b in multiple machine learning scenarios at station b.

Sharing guest: Zhang Yang, senior development engineer at station B

Introduction: The entire machine learning process, from data reporting, feature calculation, model training, online deployment, and final effect evaluation, is very lengthy. At station b, multiple teams will build their own machine learning links to complete their own machine learning requirements. Engineering efficiency and data quality are difficult to guarantee. Therefore, based on the aiflow project of the Flink community, we built a complete set of standard machine learning workflow platforms to accelerate the construction of machine learning processes and improve the effectiveness and accuracy of data in multiple scenarios. This sharing will introduce the application of station b's machine learning workflow platform ultron in multiple machine learning scenarios at station b.
catalog:
1. Real-time machine learning
2. Flink's use of machine learning at station B
3. Machine learning workflow platform construction
4. Future planning

GitHub address
https://github.com/apache/flink
Welcome everyone to give Flink likes and send stars~

1. Real-time machine learning

First of all, let's talk about the real-time of machine learning, which is mainly divided into three parts:

The first is real-time samples. In traditional machine learning, all samples are t+1, that is to say, today's model uses yesterday's training data, and yesterday's full-day data is used to train the model every morning;
The second is the real-time feature. The previous features are basically t+1, which will cause some inaccurate recommendations. For example, today I watched a lot of new videos, but the ones recommended to me are still some content that I saw yesterday or a long time ago;
The third is the real-time model training. After we have real-time sample and feature real-time, model training can also achieve real-time online training, which can bring more real-time recommendation effects.

Traditional offline link

The above picture is a traditional offline link diagram. First, the APP generates the log or the server generates the log. The entire data will be dropped on the HDFS through the data pipeline, and then some feature generation and model training will be done at t+1 every day, and the feature generation will be placed in In the feature store, it may be redis or some other kv store, and then give it to the inference online service above.

Shortcomings of traditional offline links

What's wrong with it?

The first is that the feature timeliness of the t+1 data model is very low, and it is difficult to achieve particularly time-sensitive updates;
The second is that in the entire model training or some feature production process, day-level data is used every day. The entire training or feature production takes a very long time and requires very high computing power for the cluster.

Real-time link

In the above figure, the red cross is removed for the entire real-time link process after we optimized it. After the entire data is reported, it will be directly sent to the real-time Kafka through the pipeline, and then a real-time feature generation will be performed, as well as the generation of real-time samples. The feature results will be written to the feature store, and the sample generation also needs to be read from the feature store. Take some characteristics.

After generating the samples, we directly perform real-time training. The entire long link on the right has been removed, but we still save the part of the offline feature. Because for some special features, we still need to do some offline calculations, such as those that are particularly complex and difficult to real-time or have no real-time requirements.

2. Flink's use of machine learning at station b

Let's talk about how we achieve real-time samples, real-time features and real-time effect evaluation.

The first is a real-time sample. Flink currently hosts the production process of all recommended business sample data at station b;
The second is the real-time feature. At present, a considerable part of the features use Flink for real-time calculation, and the timeliness is very high. There are many features that use offline + real-time combination to get results. Historical data uses offline calculation, real-time data uses Flink, and when reading features, it uses splicing.
However, sometimes these two sets of calculation logic cannot be reused, so we are also trying to use Flink to do batch streaming. All the feature definitions are done with Flink. According to business needs, real-time calculation or offline calculation, the underlying calculation engine All are Flink;
The third is an evaluation of real-time effects. We used Flink+olap to open up the entire real-time calculation + real-time analysis link for the final model effect evaluation.

Real-time sample generation

The above figure is the current real-time sample generation, which is for the entire recommended business link. After the log data falls into Kafka, first we make a label-join of Flink to splice clicks and displays. After the result continues to fall into Kafka, another Flink task is followed for feature join. Feature join will splice multiple features. Some features are public domain features, and some are private domain features of the business side. The sources of features are relatively diverse, including offline and real-time. After all the features are completed, an instance sample data will be generated to fall into Kafka, which will be used by the subsequent training model.

Real-time feature generation

The above figure is the generation of real-time features. Here is a more complex feature process. The entire calculation process involves 5 tasks. The first task is an offline task, followed by 4 Flink tasks. A feature generated after a series of complex calculations is placed in Kafka, and then written into the feature-store, and then used for online prediction or real-time training.

Real-time effect evaluation

The above picture is the evaluation of real-time effects. A very core indicator that the recommendation algorithm pays attention to is the click rate of ctr. After label-join is done, the ctr data can be calculated. In addition to the next step of sample generation, it will also lead one. You can see the real-time effect after the report system is connected to the data in clickhouse. The data itself will be labeled with the experiment. In clickhouse, you can distinguish the experiment according to the label to see the corresponding experimental effect.

Three, machine learning workflow platform construction

Pain points

The entire link of machine learning includes sample generation, feature generation, training, prediction, and effect evaluation. Each part has to be configured and developed with many tasks. The launch of a model eventually needs to span multiple tasks, and the link is very long.
It is difficult for the new algorithm students to understand the full picture of this complex link, and the learning cost is extremely high.
The change of the entire link affects the whole body at once, and it is very prone to failure.
Multiple engines are used in the computing layer, batch streams are mixed, and semantics are difficult to be consistent. Two sets of the same logic must be developed, and it is also difficult to maintain no gap.
The entire real-time cost threshold is also relatively high, and strong real-time offline capabilities are required, which is difficult for many small business teams to complete without platform support.

The figure above is the approximate process of a model from data preparation to training. Seven or eight nodes are involved. Can we complete all the process operations on one platform? Why should we use Flink? It is because our team’s real-time computing platform is based on Flink, and we have also seen the potential of Flink in batch-stream integration and some future development paths in real-time model training and deployment.

Introducing Aiflow

Aiflow is a set of machine learning workflow platform open sourced by Alibaba's Flink ecological team, focusing on the standardization of processes and the entire machine learning link. In August and September last year, after contacting them, we introduced such a system, built and perfected it together, and began to gradually land at station b. It abstracts the entire machine learning into the processes of example, transform, train, validation, and inference on the graph. The core capability scheduling in the project architecture is to support mixed dependencies of streaming batches, and the metadata layer supports model management, making iterative updates of models very convenient. We built our machine learning workflow platform based on this.

Platform features

Next, let’s talk about platform features:

The first is to use Python to define the workflow. In the direction of ai, people still use Python a lot, and we also refer to some external ones. For example, Netflix also uses Python to define this kind of machine learning workflow.
The second is to support mixed dependencies of batch flow tasks. In a complete link, all the real-time offline processes involved can be added to it, and batch flow tasks can rely on each other through signals.
The third is to support one-click cloning of the entire experimental process. From the original log to the final training of the entire experiment, we hope to be able to clone the overall link with one click and quickly pull up a brand new experimental link.
The fourth is some performance optimizations to support resource sharing.
The fifth is to support the integration of feature backtracking and batching. The cold start of many features requires calculation of historical data for a long time. Writing a set of offline feature calculation logic specifically for cold start is very expensive, and it is difficult to align with the real-time feature calculation results. We support backtracking offline features directly on the real-time link .

Basic structure

The figure above is the basic architecture, with the business at the top and the engine at the bottom. There are many engines currently supported: Flink, spark, Hive, kafka, Hbase, Redis. There are computing engines and storage engines. Aiflow is used as the intermediate workflow management, and Flink is used as the core computing engine to design the entire workflow platform.

Workflow description

The entire workflow is described in Python. In Python, users only need to define computing nodes and resource nodes, as well as the dependencies between these nodes. The syntax is a bit like the scheduling framework airflow.

Dependency definition

There are mainly four kinds of dependencies of batch flow: flow-to-batch, flow-to-flow, batch-to-flow, and batch-to-batch. It can basically meet all the current needs of our business.

Resource Sharing

Resource sharing is mainly used for performance, because many times the learning link of a machine is very long. For example, in the picture I just changed frequently, there may only be five or six nodes. When I want to re-pull the entire experimental process, the whole The graph is cloned again. I only need to change some or most of the nodes in the middle, and the upstream nodes can do data sharing.

This is a technical implementation. After cloning, a status tracking of the shared node is done.

Real-time training

The picture above is the process of real-time training. Feature traversal is a very common problem, which occurs when the progress of multiple computing tasks is inconsistent. In the workflow platform, we only need to define the dependencies of each node. Once dependencies occur between the nodes, the processing progress will be synchronized. Generally speaking, it is fast and slow to avoid feature crossing. In Flink, we use watermark to define the processing progress.

Feature backtracking

The above figure is the process of feature backtracking. We use real-time links to directly backtrack its historical data. After all, offline and real-time data are different. There are many problems that need to be solved in the middle, so spark is also used, and we will change it to Flink in the latter part.

The problem of feature backtracking

Feature backtracking has several big problems:

The first is how to ensure the order of data. The implicit semantics of real-time data is that the data comes in order, and the production is processed immediately, which naturally has a certain order. But offline HDFS is not. HDFS has partitions. The data in the partitions is completely out of order. A large number of calculation processes in the actual business rely on timing. How to solve the disorder of offline data is a big problem.
The second is how to ensure the consistency of features and sample versions. For example, there are two links, one for feature production and the other for sample production. Sample production depends on feature production. How to ensure the consistency of their versions without crossing?
The third is how to ensure the consistency of the calculation logic of the real-time link and the backtracking link? In fact, we don't need to worry about this problem, we are directly backtracking offline data on the real-time link.
The fourth is some performance issues, how to quickly calculate a large amount of historical data.

solution

The following are solutions to the first and second problems:

first question. For the order of data, our HDFS offline data is processed in kafka. Instead of pouring it into kafka, we simulate the data structure of kafka. The partition and the partition are in order. We also process the HDFS data into similar The architecture is simulated as a logical partition, and the logical partition is in order. The hdfssource read by Flink has also been developed to support this simulated data architecture. This simulation calculation is currently done using spark, we will change it to Flink later.
The second question is divided into two parts:
- The solution of the real-time feature part depends on Hbase storage, and Hbase supports query according to version. After the feature is calculated, it is directly written into Hbase according to the version. When the sample is generated, check the Hbase and bring the corresponding version number. The version here is usually the data time.
- The offline feature part, because there is no need to recalculate, offline storage hdfs are available, but it does not support spot checking. It is good to perform kv processing on this part, and we have done asynchronous preloading for performance.

The process of asynchronous preloading is shown in the figure.

Four, future planning

Next, we will introduce our plan later.

One is data quality assurance. Now the entire link is getting longer and longer, there may be 10 nodes, 20 nodes, then how to quickly find the problem when the entire link has a problem. Here we want to do dpc for the node set. For each node, we can customize some data quality verification rules, and the data is bypassed to the unified dqc-center for rule calculation alarms.

The second is exactly once for the entire link. How to ensure accurate consistency between workflow nodes is not yet clear.

The third is that we will add nodes for model training and deployment in the workflow. Training and deployment can be connected to other platforms, or training models and deployment services supported by Flink itself.

guest introduction: Zhang Yang, joined B station in 2017, engaged in big data work.

Copyright Statement: content of this article is contributed spontaneously by Alibaba Cloud real-name registered users, and the copyright belongs to the original author. The Alibaba Cloud Developer Community does not own its copyright and does not assume corresponding legal responsibilities. For specific rules, please refer to the "Alibaba Cloud Developer Community User Service Agreement" and the "Alibaba Cloud Developer Community Intellectual Property Protection Guidelines". If you find suspected plagiarism in this community, fill in the infringement complaint form to report it. Once verified, the community will immediately delete the suspected infringing content.

The application of bilibili's machine learning workflow platform based on Flink in station b

1. Real-time machine learning

Traditional offline link

Shortcomings of traditional offline links

Real-time link

2. Flink's use of machine learning at station b

Real-time sample generation

Real-time feature generation

Real-time effect evaluation

Three, machine learning workflow platform construction

Pain points

Introducing Aiflow

Platform features

Basic structure

Workflow description

Dependency definition

Resource Sharing

Real-time training

Feature backtracking

The problem of feature backtracking

solution

Four, future planning

阿里云开发者

引用和评论

福利来了！计算巢支持在已经购买的 ECS 上搭建幻兽帕鲁服务器，支持图形化管理配置

分析型数据库入门指南：如何选择适合你的实时分析工具？

搭建Zookeeper、Kafka集群

为什么使用 RocketMQ？

MPP 架构解析：原理、核心优势与对比指南

湖仓一体架构解析：如何平衡数据灵活性与分析性能？

入门向：下一代实时计算基础设施-Fluss

The application of bilibili&#39;s machine learning workflow platform based on Flink in station b

1. Real-time machine learning

Traditional offline link

Shortcomings of traditional offline links

Real-time link

2. Flink's use of machine learning at station b

Real-time sample generation

Real-time feature generation

Real-time effect evaluation

Three, machine learning workflow platform construction

Pain points

Introducing Aiflow

Platform features

Basic structure

Workflow description

Dependency definition

Resource Sharing

Real-time training

Feature backtracking

The problem of feature backtracking

solution

Four, future planning

阿里云开发者

引用和评论

福利来了！计算巢支持在已经购买的 ECS 上搭建幻兽帕鲁服务器，支持图形化管理配置

分析型数据库入门指南：如何选择适合你的实时分析工具？

搭建Zookeeper、Kafka集群

为什么使用 RocketMQ？

MPP 架构解析：原理、核心优势与对比指南

湖仓一体架构解析：如何平衡数据灵活性与分析性能？

入门向：下一代实时计算基础设施-Fluss

The application of bilibili's machine learning workflow platform based on Flink in station b