Flink ML API, an algorithmic interface and iterative engine designed for real-time machine learning

Abstract: This article is compiled from the speeches of Alibaba senior technical expert Lin Dong and Alibaba technical expert Gao Yun (Yun Qian) at the core technology special session of Flink Forward Asia 2021. The main contents include:
API for real-time machine learning
Stream-batch integrated iterative engine
Flink ML ecological construction

Click to view live replay & speech PDF

1. API for real-time machine learning

Flink ML API refers to the interface provided to users to use algorithms. By packaging all algorithms into a unified API and providing them to users, the experience of all users is consistent, the cost of learning and understanding algorithms can also be reduced, and the algorithms can be better interacted and compatible.

For example, some basic functional classes are provided in the Flink ML API. By using these functional classes, different operators can be connected and combined into an advanced operator, which can greatly improve the development efficiency of the algorithm. At the same time, by using the unified Table API, all data is transmitted in Table format, which can make the algorithms developed by different companies compatible with each other, reduce the cost of operators repeatedly developed by different companies, and improve the efficiency of algorithm cooperation.

There are still many pain points in the previous version of Flink ML API.

The first is the ability to express. The input of the previous API only supports the form of a single Table and cannot express some common algorithm logic. For example, the input expression of some training algorithms is a graph, and the data is passed in through different Tables. In this case, the input interface of a single Table is not applicable. Another example is that some data preprocessing logic needs to fuse data obtained from multiple inputs, and APIs that use a single Table input are not suitable. Therefore we plan to extend the algorithm interface to support multiple input and multiple output.

The second is the real-time training aspect. The previous API did not natively support real-time machine learning scenarios. In real-time machine learning, we hope that the training algorithm can generate model data in real-time, and stream the model data to multiple front-end servers in real-time. But the existing interface only has one-time training and one-time reasoning API, which cannot express this logic.

Lastly is the ease of use. Previously, toJson() and fromJson() were used to export and load model data and allow users to store this data. However, the data volume of some models is as high as several gigabytes. In this case, storing the model data in the form of strings will be very inefficient or even impossible. Of course, there are some hacky methods that can store model data to a remote terminal, and then transmit the relevant url through the toJson() method. However, in this case, there will be a problem of ease of use. The algorithm user needs to parse the URL and obtain the data remotely.

Due to the above factors, we decided to extend the Flink ML API.

After a lot of discussion and thinking, we have given the following features to the new API, which solve all the above problems.

First, an interface for obtaining model data is added to the API, for example, the getModelData() and setModelData() APIs are added to the model, which can help realize real-time machine learning scenarios;
Second, the interface of the operator is extended, so that the operator can support multiple inputs and multiple outputs, and different operators can be integrated in the form of a directed acyclic graph and packaged into more advanced operators;
Third, the new API is implemented based on datastream, which can support the feature of stream-batch integration, and can realize online training based on limited stream and unlimited stream at the same time;
Fourth, the algorithm parameter access API has been improved, and the new algorithm parameter access API is easier to use;
Fifth, the model access API has been improved. The new model access API uses the save() and load() APIs. When the model data is very large, the user does not need to consider the complexity of this aspect, and only needs to call save() and load() can implement related functions;
Sixth, abstract classes with model-free semantics are provided.

The picture above is the latest API architecture diagram. The top layer has a WithParams interface, which provides an API for accessing parameters. We have improved this interface so that users no longer need to express fields such as isOptional. Below this interface is a stage interface, which contains all algorithm modules and provides APIs for accessing modules, namely save() and load(). save() is responsible for storing the model data and parameters, and load() is responsible for reading out the model data and parameters and restoring the original stage instance. The user does not need to consider the complexity of parameter access.

Stage is divided into two parts, one is the Estimator that expresses the training logic, and the other is the AlgoOperator and Model classes that express the reasoning logic. The Estimator's core API is fit(). The difference from before is that it now supports the input of multiple tables, which can be used to express input logic that requires multiple tables, such as feature splicing. Estimator::fit() outputs a Model. Model belongs to AlgoOperator. AlgoOperator expresses calculation logic and supports multiple Tables as input and output. Each Table is a data source and can be used to express general calculation logic.

Under the AlgoOperator is the Transformer, which can express the logic of transforming the data. It has the same API format as AlgoOperator, but their abstraction is different. Transformer is a data transformation logic with model semantics. In computing, such as data preprocessing, there are some more general operations of splicing and transforming different data, such as filtering data, which may not be applicable to Transformer under the general concept. Therefore, we specially added the AlgoOperator class to facilitate the understanding and use of users.

Below the Transformer is the model class. We have added setModelData() and getModelData() APIs. These two APIs are specially designed for real-time machine learning, allowing users to export model data to multiple remote terminals in real time for online inference.

The image above is a simplified but classic real-time machine learning scenario.

There are two main data sources here, static data comes from HDFS, and dynamic data comes from Kafka. The data from the above two data sources is read by AlgoOperator, and they are spliced together to form a Table input to the Estimator logic. The Estimator reads the data just spliced and generates a Model, and then you can get the Table representing the model data through getModelData(). This data is then transferred to the Kafka topic through the sink() API. Finally, run programs on multiple front-end servers. These programs can directly create a Model instance, read the model data from Kafka to form a Table, and then pass the data to the Model through setModelData(), and use the obtained Model for online reasoning.

After supporting online training and online reasoning, we further provide some basic components to facilitate users to build more complex operators through simple operators. This component is the GraphBuilder provided by FLIP-175.

It is assumed that the user's input is also two data sources consistent with the above, and the final output is to one data source. The user's core computing logic can be divided into two parts. The first part is data preprocessing, such as feature splicing. The data from the two data sources is read in and then integrated, output to the Estimator in the form of a Table, and the second block of training is executed. logic. We want to execute the training operator first and get a Model. Then connect the preprocessing operator to the Model to express the online reasoning logic.

All the user needs to do is to connect the above steps through the GraphBuilder API, and there is no need to write the connection logic again for the online reasoning logic. GraphBuilder will automatically generate from the previous graph and form a one-to-one correspondence with the operators in the latter graph. The form of AlgoOperator in the training graph is directly converted to the operator in the inference graph, and the Model obtained by the Estimator in the training graph will become the corresponding node in the inference graph. By connecting these nodes, the final ModelA is obtained, and finally used for online reasoning.

2. Iterative engine integrating stream and batch

Flink is a stream-batch processing engine based on Dag description execution logic, but in many scenarios, especially in machine learning\graph computing type applications, users also need the ability to iteratively process data. For example, offline training of some algorithms, online training, and scenarios in which model parameters are dynamically adjusted according to the results after model deployment require data iterative processing.

Since the actual scenario covers both offline and online processing cases, it is necessary to support both offline and online processing at the iteration layer.

The processing logic of the three scenarios mentioned above has both differences and commonalities.

For offline training, taking logistic regression as an example, a node can be used to cache the entire model in an iteration, this node will send the latest model to the training node, and the training node will pre-read the entire dataset before the iteration starts, and then After receiving the latest model each time, select a mini-batch data from the dataset to update the model, and send the result back to the model cache node.

For online computing, since the training data arrives continuously from the outside, it is impossible to read all the training data in advance. The general practice is to dynamically read a mini-batch data, and send it to the model cache after calculating the model update. node, wait for the cache node of the model to further send the updated model, and then read the next mini-batch data, which also requires the training node to read the data in the way of priority reading, so as to finally process the mini-batch one by one. -batch capability.

There are synchronous and asynchronous ways of training in both scenarios, depending on whether the model cache node has to collect all the updates before starting the next round of training. In addition, there are some models that dynamically update the parameters during prediction. After processing each piece of data, it is necessary to evaluate whether the parameter update will be performed immediately, and then initiate the update if necessary.

The calculation logic in these scenarios has certain commonalities. First, it is necessary to introduce an iterative structure in the job graph to support the cyclic processing of data, and after the data cyclic processing, it is necessary to judge whether to terminate the iteration. On the other hand, in the calculation process, it is also necessary to receive a corresponding notification after each round of data reception is completed, and trigger a specific calculation. For example, in offline training, the next round of calculation will start after contacting the entire model.

There are actually two options here. One is to iterate this layer, directly providing the ability to divide the dataset into multiple mini-batches, and give each mini-batch the ability to iterate. It can logically accept all types of iterations, but if you want to support the logic of one-by-one mini-batch processing and multiple mini-batch parallel processing, you must introduce a new mini-batch-based stream processing interface, and Both processing logics are supported at the practice layer.

In addition, the mini-batches of online training and offline training are generated in different ways. Offline mini-batches are realized by reading all the data in advance locally, and then selecting them in each round of processing. The online mini-batch is implemented by reading a specific amount of data from external data. Therefore, if you want to be compatible with these two cases, it will further increase the complexity of the interface and implementation.

Finally, if you want to be compatible with the processing of a single per-record, you must also introduce an infinite mini-batch, or treat each record as a separate mini-batch. The former will further increase the complexity of the interface, while the latter requires each record. Notifying the operator once will increase the runtime overhead.

Considering the above, we only provide whole dataset-level notifications in iterations, and place the function of dividing mini-batches above iterations. In offline training, the user can efficiently implement the function of mini-batch selection by selecting a portion of the data from the entire dataset. In online computing, users can input operators through the running set that comes with Flink to realize the function of one-by-one mini-batch processing.

Based on the above ideas, the design of the entire iteration is shown in the figure above, which is mainly composed of 4 parts. Inputs with back edges such as the initial model, read-only inputs without back edges such as datasets, the position of the iteration back edges, and the final output. There is a one-to-one correspondence between the data set corresponding to the return edge and the input with the return edge. After the data returned from the return edge is unioned with the initial data, the iterative processing of the data is realized.

The iteration provides users with a notification function of the completion of data set processing, that is, the ability to track progress. Based on this ability, users can perform certain operations after processing a certain round of the data set. For example, during offline training, the user can calculate the next round of model update after an operator receives the updated data of the model.

In addition, for input data without loopback, the user is allowed to specify whether to replay each round. It also provides the ability to create a new operator for each round and process all rounds of data through an operator instance. In this way, users can achieve the ability to cache data between rounds without rebuilding operator instances. Users can also reuse out-of-iteration operators by replaying the input data and reconstructing the operator, such as common operators such as Reduce and Join. The input data is replayed and the operator will be reconstructed in a certain round. In this case Users can directly reuse these out-of-iteration operators.

At the same time, we provide two kinds of termination judgment logic, one is that when there is no data to be processed in the entire iteration, the iteration will be terminated naturally. The other is that in the case of limited flow, the user is also allowed to specify a specific data set. If this data set has no data generated in a certain round, the user can terminate the entire iteration in advance.

 DataStream<double[]> initParameters = …
DataStream<Tuple2<double[], Double>> dataset = …

DataStreamList resultStreams =
    Iterations.iterate(
          DataStreamList.of(initParameters),
          ReplayableDataStreamList.notReplay(dataset),
          IterationConfig.newBuilder().setOperatorRoundMode(ALL_ROUND).build();
          (variableStreams, dataStreams) -> {
                    DataStream<double[]> modelUpdate = variableStreams.get(0);
                    DataStream<Tuple2<double[], Double>> dataset = dataStreams.get(0);
 
                  
                    DataStream<double[]> newModelUpdate = …
                    DataStream<double[]> modelOutput = …
                    return new IterationBodyResult(
       DataStreamList.of(newModelUpdate), 
       DataStreamList.of(modelOutput)
                });
         
DataStream<double[]> finalModel = resultStreams.get("final_model");

The image above is an example of using the iteration API to build iterations. The user needs to specify the input list with and without back edges, whether the operator needs to be reconstructed every round, and the calculation logic of the iterative body, etc. For the iterative body, the user needs to return the data set corresponding to the back edge and the final output of the iteration.

 public static class ModelCacheFunction extends ProcessFunction<double[], double[]>
        implements IterationListener<double[]> { 
         
        private final double[] parameters = new double[N_DIM];
 
        public void processElement(double[] update, Context ctx, Collector<O> output) {
            // Suppose we have a util to add the second array to the first.
            ArrayUtils.addWith(parameters, update);
        }
 
        void onEpochWatermarkIncremented(int epochWatermark, Context context, Collector<T> collector) {
            if (epochWatermark < N_EPOCH * N_BATCH_PER_EPOCH) {
                collector.collect(parameters);
            }
        }
 
        public void onIterationEnd(int[] round, Context context) {
            context.output(FINAL_MODEL_OUTPUT_TAG, parameters);
        }
    }

For an operator within an iteration, if it implements the IterationListener interface, it will notify the iteration operator after all data sets have been processed for a certain round. If the entire iteration is terminated, the operator will be further notified through onIterationTerminated, and the user can implement the required calculation logic in these two callbacks.

In the iterative implementation, for the iterative processing structure created by the user through the code, a part of the nodes inside the iteration will be added, and the wrap operation will be performed on all the processing nodes of the user, so as to manage the life cycle of the operator and convert the data type. Purpose. Finally, iteration implements back-edge based on Colocation and local memory, so that the entire job logic is still a DAG in the view of the scheduler, so that the current scheduling logic can be directly reused.

The special operators inserted in the iteration mainly include input, output, head and tail operators, among which input and output are responsible for the conversion of data types. When external data enters the iteration, an iteration head will be added to each record, which records the The round of record processing, the wrap of each operator will remove the iteration header and hand it over to the user's original operator for processing.

The head and tail operators are responsible for implementing the back-edge and calculating whether a certain round of iteration has been processed completely. The head operator reads the entire input from input and inserts a special EpochWatermark event at the end to mark the termination of the zeroth iteration. Since the head operator may have multiple concurrency, the event of the 0th round termination can only be sent after all the concurrency of the head operator have read the input.

The head operator synchronizes all concurrency through the Operator Coordinator. Operator Coordinator is a global component located in JobManager. It can communicate with all head tasks in two directions. After all head operators receive the notification of completion of each round of processing concurrently, they will send a global broadcast to all head tasks, telling them This round of processing is all over. After the head sends the EpochWaterMark, the operators in all iterations are aligned with this message. After the operator has read the EpochWatermark of a specific round from all inputs, it will consider this round of processing complete, and call the callback for the end of this round. When the tail node receives a certain round of data or EpochWatermark, it will increase the round by 1, and then send it to the head again through the back edge, so as to realize the data loop processing.

Finally, we also support the checkpoint function in the case of iteration. Since Flink's default checkpoint mechanism cannot support computation graphs with loops, we have extended Flink's checkpoint mechanism to implement the Chandy-Lamport algorithm with loops, which will cache messages from the back-end at the same time. In addition, when the head operator is aligned, in addition to reading the normal input barrier, it will also wait for a special barrier from the Operator Coordinator. The message of the global end of each round is also broadcast from the Operator Coordinator, which can ensure that the operators in all iterations in each checkpoint are in the same round, thereby simplifying the subsequent concurrent modification of the operator.

In addition, there is an optimization in development. The Operator Coordinator will delay the received barrier until the next global alignment message, and then send a notification, so that the state of the operator in the entire iteration is exactly in a certain round of reading. after the data. Many synchronization algorithms first store the data received by the cache in the operator, and then perform unified processing after reading all the data in one round. This optimization can ensure that all caches are just emptied when the snapshot operation is performed, thereby minimizing the amount of data cached in the entire checkpoint.

The above is an introduction to the Flink streaming-batch integrated iteration engine. These engines can be used in both online and offline scenarios and support exactly-once fault tolerance. In the future, we will further support the batch execution mode and provide more upper-level development tools.

3. Flink ML ecological construction

Recently we have moved Flink ML related code from the Flink core codebase into a separate Flink ML codebase. The first purpose of this is to facilitate the rapid iteration of Flink ML, and secondly, it is hoped that this method can reduce the complexity of the Flink core code base and avoid the Flink core code base being too bloated.

In addition, we have established a neutral organization Flink-extended on Github, which can provide a platform for all developers in the Flink community to contribute to some projects they want to open source. It is convenient for everyone to share the code without the name of a specific company, so that developers of different companies can contribute the code, which is convenient for the Flink community to use and build together. We hope to use this to promote the prosperity and development of the Flink ecosystem.

At present, there are some more important projects in the neutral project, such as Deep Learning on Flink, which is an open source project mainly developed by the Alibaba big data team. Flink's preprocessor is combined with Tensorflow's deep learning training algorithm to form end-to-end training and inference.

We have recently added several common algorithms to Flink ML, and will continue to provide more out-of-the-box algorithms in the future.

The above picture shows the important work we are currently working on. The core work is to transform the existing Alibaba open source Alink code base, so that the algorithm in it can adapt to the newly designed Flink ML API, and the transformed Contribute to the Apache project, so that Flink users can get more out-of-the-box algorithms.

In addition, we also cooperated with 360 to build the Clink project. The core goal is to use Java to run some operators in offline computing to get the training results. On the other hand, these operators need to be able to do online inference with very low latency. However, low-latency online inference is difficult to implement in Java and usually requires C++. In order to enable developers to write algorithms only once and apply them to both Java and C++ environments, Clink provides some functions of packaged basic classes, which are convenient for algorithm developers to use JNI to package Java operators after writing C++ operators. Use these operators in Flink.

Finally, we plan to develop support for Python in Flink ML, including allowing algorithm users to connect and combine Java operators in Flink ML by writing Python programs, hoping to improve the efficiency and experience of machine learning developers .

The above work has basically entered the open source project. The design of the algorithm API is in FLIP-173, and the design of the iterative engine is mainly in FLIP-176. FLIP-174 and FLIP-175 respectively provide the API of algorithm parameters and the API of GraphBuilder. Projects such as Clink and Deep Learning on Flink are also on the Flink-extended organization, and you are welcome to use them.

Click to view live replay & speech PDF

For more technical issues related to Flink, you can scan the code to join the community DingTalk exchange group to get the latest technical articles and community dynamics as soon as possible. Please pay attention to the public number~

Recommended activities

Alibaba Cloud's enterprise-level product based on Apache Flink - real-time computing Flink version is now open:
99 yuan to try out the Flink version of real-time computing (yearly and monthly, 10CU), and you will have the opportunity to get Flink's exclusive custom sweater; another package of 3 months and above will have a 15% discount!
Learn more about the event: https://www.aliyun.com/product/bigdata/en

Flink ML API, an algorithmic interface and iterative engine designed for real-time machine learning

1. API for real-time machine learning

2. Iterative engine integrating stream and batch

3. Flink ML ecological construction

ApacheFlink

引用和评论

Flink CDC 3.4 发布, 优化高频 DDL 处理，支持 Batch 模式，新增 Iceberg 支持

【Hadoop】HDFS架构解析

【Hadoop】HBase系统解析及适用场景

基于 pyflink 的算法工作流设计和改造

通过Milvus内置Sparse-BM25算法进行全文检索并将混合检索应用于RAG系统

MCP+Hologres+LLM 搭建数据分析 Agent

基于 Flink CDC YAML 的 MySQL 到 Kafka 流式数据集成