Application of Flink real-time computing in Weibo

Cao Fuqiang, the person in charge of data computing at the Weibo Machine Learning R&D Center and senior system engineer, brings you an introduction to the application of Flink real-time computing in Weibo. content include:
1. Introduction to Weibo
2. Introduction to data computing platform
3. Typical application of Flink in data computing platform

<p style="text-align:center"> GitHub address
https://github.com/apache/flink
Welcome everyone to give Flink likes and stars~</p>

1. Weibo introduction

The sharing brought to you this time is the application of Flink real-time computing in Weibo. Weibo is China's leading social media platform, with 241 million daily active users and 550 million monthly active users, of which mobile users account for more than 94%.

2. Introduction to Data Computing Platform

1. Overview of Data Computing Platform

The figure below is the architecture diagram of the data computing platform.

The first is scheduling. This block deploys Flink and Storm for real-time data processing based on K8s and Yarn, and SQL services for offline processing.
On top of the cluster, we deployed Weibo's AI platform to manage jobs, data, resources, samples, etc. through this platform.
We have built some services on the platform to support various business parties in a service-oriented way.
1. The services of real-time computing mainly include data synchronization, content deduplication, multi-modal content understanding, real-time feature generation, real-time sample splicing, and streaming model training. These are services that are closely related to business. In addition, it also supports Flink real-time computing and Storm real-time computing, which are more common basic computing frameworks.
2. The offline part, combined with Hive's SQL and SparkSQL to build a SQL computing service, has now supported most of the business parties in Weibo.
The data output is to use the data warehouse and feature engineering to form the data center to provide data output to the outside. On the whole, we currently run more than 1,000 real-time computing jobs online, more than 5,000 offline jobs, and the amount of data processed per day exceeds 3 PB.

2. Data calculation

The following two pictures are data calculations, one of which is real-time calculation and the other is offline calculation.

Real-time calculation mainly includes real-time feature generation, multimedia feature generation and real-time sample generation, which are closely related to business. In addition, some basic flink real-time calculations and storm real-time calculations are also provided.
Offline calculation mainly includes SQL calculation. It mainly includes SQL ad hoc query, data generation, data query and table management. Table management is mainly the management of data warehouses, including the management of table metadata, the usage rights of the table, and the blood relationship between the upstream and downstream of the table.

3. Real-time features

As shown in the figure below, we built a real-time feature generation service based on Flink and Storm. On the whole, it will be divided into job details, input source feature generation, output and resource allocation. The user can develop the UDF generated by the feature according to our pre-defined interface. Other things like input and feature writing are automatically provided by the platform, and users only need to configure it on the page. In addition, the platform will provide monitoring of input data sources, abnormal monitoring of jobs, feature writing monitoring, feature reading monitoring, etc., all of which are automatically generated.

4. Flow batch integration

The following introduces our batch flow integration based on FlinkSQL. First, we will unify metadata, and unify real-time logs and offline logs through the metadata management platform. After unification, when users submit jobs, we will have a unified scheduling layer. The scheduling part is to schedule the jobs to different clusters according to the type of the job, the characteristics of the job, and the current load of the cluster.

At present, the computing engines supported by the scheduling layer are mainly HiveSQL, SparkSQL and FlinkSQL. The SQL of Hive and Spark is mainly used for batch computing, and FlinkSQL is for batch-flow mixed running. The entire result will be output to the data warehouse for the business side to use. There are about 4 key points in the integration of batch and flow:

First, the batch flow code is unified to improve development efficiency.
Second, the batch stream metadata is unified. Unified management to ensure consistent metadata.
Third, batch and stream programs are mixed to save resources.
Fourth, unified batch scheduling to improve cluster utilization.

5. Data Warehouse

For offline warehouses, we divide the data into three layers, one is the original log, the other is the middle layer, and the other is the data service layer. The middle is the unification of metadata, and the bottom is the real-time data warehouse.
For real-time data warehouses, we use FlinkSQL to stream these original logs as an ETL, and then write the final data results to the data service layer through a stream summary, and also store it in various real-time storages, such as Go to ES, Hbase, Redis, ClickHouse. We can store data queries externally through real-time storage. It also provides the ability to calculate further data. In other words, the establishment of real-time data warehouse is mainly to solve the problem of long cycle of offline feature generation. The other is to use FlinkSQL to solve the problem of the relatively long development cycle of streaming jobs. One of the key points is the management of metadata for offline data warehouses and real-time data warehouses.

3. Typical application of Flink in data computing platform

1. Streaming machine learning

First of all, I will introduce several characteristics of streaming machine learning. The biggest feature is real-time. This is divided into real-time features and real-time models.

The real-time feature is mainly for more timely feedback of user behavior and more fine-grained characterization of users.
Model real-time is to train the model in real time based on online samples to reflect the online changes of the object in time.

■ Features of Weibo streaming machine learning:

The scale of the sample is large, and the current real-time samples can reach a million-level qps.
The scale of the model is large. In terms of model training parameters, the entire framework will support a training scale of hundreds of billions.
The requirements for the stability of the operation are relatively high.
The real-time requirements of the samples are high.
The real-time performance of the model is high.
There are many platform business needs.

■ Streaming machine learning has several difficult problems:

One is the full link, and the end-to-end link is relatively long. For example, a streaming machine learning process will start with log collection, feature generation, then sample generation, then model training, and finally service launch. The entire process is very long. Any problem in any link will affect the final user experience. Therefore, we have deployed a relatively complete full-link monitoring system for each link, and there are relatively rich monitoring indicators.
The other is its large data scale, including massive user logs, sample size, and model size. We investigated the commonly used real-time computing frameworks, and finally chose Flink to solve this problem.

■ Lost machine learning process:

The first is offline training. After we get offline logs and generate samples offline, we read the samples through Flink, and then do offline training. After the training is completed, these training result parameters are saved in the offline parameter server. This result will be used as the Base model of the model service for real-time cold start.
Then comes the real-time streaming machine learning process. We will pull real-time logs, such as Weibo posts, interaction logs, etc. After pulling these logs, use Flink to generate its samples, and then do real-time training. After the training is completed, the training parameters will be saved in a real-time parameter server, and then periodically synchronized from the real-time parameter server to the real-time parameter server.
The last part is the model service, which will pull those parameters corresponding to the model from the parameter service to recommend user characteristics, or material characteristics. The user and material-related characteristics and behaviors are scored through the model, and then the sorting service will call the scoring results, add some recommended strategies, to select the material that it thinks is the most suitable for the user, and feedback it to the user. After the user generates some interactive behaviors on the client, he sends a new online request and generates a new log. Therefore, the entire flow-based learning process is a closed-loop process.

In addition,

The delay of offline samples and the update of the model are days or hours, while the streaming method has reached the hours or minutes;
The computational pressure of offline model training is relatively concentrated, while the real-time computational pressure is relatively scattered.

■ Sample

Here is a brief introduction to the development history of our streaming machine learning samples. In October 2018, we launched the first streaming sample job, which was done through Storm and external storage Redis. In May 2019, we used the new real-time computing framework Flink, and adopted the union+timer solution instead of window computing to implement the join operation of multiple data streams. In October 2019, a xx sample job was launched, and the qps of a single job reached hundreds of thousands. In April of this year, the sample generation process was platformized. As of June this year, platformization has made an iteration to support the placing of samples, including the sample library, as well as the improvement of various monitoring indicators of the samples.

The so-called sample generation in streaming machine learning is actually a splicing of multiple data streams according to the same key. For example, we have three data streams, the result of data cleaning is stored as <k, v>, k is the key of the aggregation, and v is the value required in the sample. After data union, KeyBy aggregation is performed. After aggregation, the data is stored in the value state of the memory area. As shown below:

If k1 does not exist, register the timer and save it in state.
If k1 exists, take it out of state and store it in after updating. After its timer expires at the end, this piece of data will be output and cleared from the state.

■ Sample platform

We have made the entire sample stitching process a platform operation, divided into 5 modules, including input, data cleaning, sample stitching, sample formatting and output. Based on platform-based development, users only need to care about the business logic. Need user development:

Corresponding to the data cleaning logic of the input data.
Data formatting logic before sample output.

The rest can be achieved by configuring on the UI, specifically:

The time window for sample stitching.
Aggregation of fields in the window.

The resources are reviewed and configured by the platform. In addition, the entire platform provides basic monitoring, including monitoring of input data, monitoring of sample indicators, monitoring of abnormal operations, and monitoring of sample output.

■ Sample UI of streaming machine learning project

The picture below is a sample of a streaming machine learning project. On the left is the job configuration for sample generation, and on the right is the sample library. The sample library is mainly used for the management and display of samples, including the description rights of the samples, the sharing situation of the samples and so on.

■ Lose machine learning applications

Finally, let me introduce the effects of streaming machine learning applications. Currently we support real-time sample splicing, with QPS reaching millions. Support streaming model training, can support hundreds of model training at the same time, model real-time support hour/minute model update. Streaming learning full-process disaster recovery, supporting full-link automatic monitoring. One of the things we are doing recently is streaming deep learning to increase the expressive ability of real-time models. There is also reinforcement learning to explore some new application scenarios.

2. Multi-modal content understanding

■ Introduction

Multi-modality is the ability or technology to use some methods of machine learning to realize or understand multi-modal information. This section of Weibo mainly includes pictures, videos, audios, and texts.

The picture section includes object recognition tagging, OCR, face, celebrity, face value, and smart cropping.
The video includes copyright detection and logo recognition.
In the audio section, there are tags for converting speech to text and audio.
The text mainly includes the word segmentation of the text, the timeliness of the text, and the classification label of the text.

For example, when we started video classification, we only used those frames after the video frame was extracted, that is, pictures. Later, in the second optimization, audio-related things were added, as well as blog-related things corresponding to the video, which is equivalent to considering the integration of audio, pictures, text, and multi-modality to generate the classification of this video more accurately. label.

■ Platform

The following figure shows the platform architecture of multi-modal content understanding. The middle part is Flink's real-time calculation, which receives real-time data such as picture streams, video streams, and blogging streams, and then calls the following basic services and deep learning model services through model plug-ins. After calling the service, the content characteristics are returned. Then we store the features in the feature project and provide them to various business parties through the data center. During the whole operation process, the whole link monitors and alarms, and responds to abnormal situations as soon as possible. The platform automatically provides log collection, index statistics, CASE tracking and other functions. The middle part uses zk for service discovery to solve the problem of service status synchronization between real-time computing and deep learning models. In addition, in addition to state synchronization, there will also be some load balancing strategies. The bottom is to use the data-reconciliation system to further improve the success rate of data processing.

■ UI

The UI for multi-modal content understanding mainly includes job information, input source information, model information, output information, and resource configuration. This is to improve development efficiency through configurable development. Then some monitoring indicators of model call will be automatically generated, including the success rate and time-consuming of model call. After the job is submitted, a job for index statistics will be automatically generated.

3. Content deduplication service

■ Background

In the recommended scenario, if you keep pushing users with repeated content, it will greatly affect the user experience. Based on this consideration, a set of content deduplication service platform built by combining the Flink real-time streaming computing platform, distributed vector retrieval system and deep learning model service has the characteristics of low latency, high stability, and high recall rate. Currently, it supports multiple business parties with a stability of 99.9+%.

■ Architecture

The following figure shows the architecture diagram of the content deduplication service. At the bottom is multimedia model training. This piece is for offline training. For example, we will get some sample data, and then do sample processing, after the sample is processed, the sample is stored in the sample library. When I need to do model training, I pull samples from the sample library, and then do model training, and the trained results will be saved in the model library.

The main model used here for content deduplication is the vector generation model. Including picture vectors, text vectors, and video vectors.

When we verify that there is no problem with the trained model, we will save the model in the model library. The model library saves some basic information of the model, including the operating environment and version of the model. Then the model needs to be deployed online. The deployment process needs to pull the model from the model library, and at the same time need to know some technical environment for the operation of this model.

After the model is deployed, we will read the materials from the material library in real time through Flink, and then call the multimedia estimation service to generate the vectors corresponding to these materials. These vectors will then be saved in the Weiss library, which is a vector recall retrieval system developed by Weibo. After being stored in the Weiss library, a vector recall process will be performed on this material, and a batch of materials similar to this material will be recalled. In the comparison of the fine ranking, a certain strategy will be added from all the recall results, and the most similar one will be selected. Then aggregate the most similar one with the current material to form a content ID. Finally, when the business is used, the content ID corresponding to the material is also used for deduplication.

■ Application

There are three main business scenarios for content deduplication:

First, it supports video copyright-pirated video recognition-99.99% stability, and 99.99% pirated recognition rate.
Second, support the site-wide Weibo video deduplication-recommended scene application-stability of 99.99%, processing delay in seconds.
Thirdly, it is recommended to de-weight flow materials-99% stability, second-level processing delay, and 90% accuracy

■ Finally

By combining the Flink real-time stream computing framework with business scenarios, we have done a lot of work in terms of platformization and service, and have also made many optimizations in terms of development efficiency and stability. We improve development efficiency through modular design and platform development. At present, the real-time data computing platform comes with full link monitoring, data index statistics and debug case tracking (log review) system. In addition, based on FlinkSQL, there are also certain applications in batch-stream integration. These are some of the new changes that Flink brings to us, and we will continue to explore the larger application space of Flink in Weibo.