Apache Flink ML 2.1.0 Release Announcement

Source | Apache Flink Official Blog

The Apache Flink community is proud to announce the official release of Apache Flink ML 2.1.0! This release focuses on improving Flink ML's infrastructure, such as Python SDK, memory management, and performance testing framework, to help developers develop high-performance, high-stability, and easy-to-use machine learning based on Flink ML Algorithm library.

Based on the improvements proposed in this release and the performance test results we got, we believe that the Flink ML infrastructure is ready to be used by community developers to develop high-performance machine learning algorithm libraries that support the Python environment.

We encourage you to download this release ^[1] and share your feedback with the community via the Flink mailing list ^[2] or JIRA ^[3] ! We hope you enjoy the new version, and we look forward to hearing about your experience.

Important features

1. Operator interface and infrastructure

1.1 Supports memory management at operator level granularity

In previous versions, the internal state data of machine learning operators, such as training data that needed to be cached and read repeatedly in each iteration, was stored in the state backend. These data can only be placed in memory in full or on disk. In the former case, the large amount of status data may lead to OOM and reduce job stability. In the latter case, since each iteration needs to read the full amount of data from the disk and deserialize it, the performance is lower than putting the data in memory when the amount of state data is small. This problem increases the difficulty for developers to develop high-performance and high-stability operators.

In this release, we have improved the Flink ML infrastructure to allow specifying a managed memory quota that an operator can use. When the amount of operator state data is lower than the quota, these state data will be stored in Flink's managed memory. When the amount of operator state data exceeds the quota, the data exceeding the quota will be stored on the disk to avoid OOM. Algorithm developers can use this mechanism to allow operators to provide optimal performance for different amounts of input data. Developers can refer to the code of the KMeans operator to learn to use this mechanism.

1.2 Improvement of infrastructure for developing online training algorithms

An important goal of Flink ML is to advance the development of online training algorithms. In the previous version, by providing the setModelData() and getModelData() methods, the model data of the online training algorithm can be transmitted and saved in the form of unlimited data streams, which enhanced the Flink ML API's support for online training algorithms. This release further improves and verifies the ability of Flink ML infrastructure to support online training algorithms.

This release adds 2 online training algorithms (ie OnlineKMeans and OnlineLogisticRegression), and provides unit tests to verify and test the correctness of these algorithms. These two algorithms introduce concepts such as global batch size, model version, etc., and provide indicators and interfaces to set and read the corresponding information. Although the prediction accuracy of these two algorithms has not been tuned, this work will help us further establish best practices for developing online training algorithms. We hope that more and more community contributors will join us to accomplish this goal together.

1.3 Algorithm performance testing framework

An easy-to-use performance testing framework is crucial for developing and maintaining high-performance Flink ML algorithm libraries. This release adds a performance testing framework that supports writing pluggable and reusable data generators that can read JSON-formatted configurations and output performance testing results in JSON format to support customizable performance testing Visual analysis of results. We provide out-of-the-box scripts to convert performance test results into graphs. Interested readers can read this document ^[4] to learn how to use this testing framework.

2. Python SDK

This release enhances the infrastructure of the Python SDK and supports Python operators to call corresponding Java operators to complete training and inference. Python operators can provide the same performance as Java operators. This function can greatly improve the development efficiency of the Python algorithm library, allowing algorithm developers to provide both Python and Java algorithm libraries for a set of algorithms without repeating the core logic of the algorithm.

3. Algorithm library

This release continues the previous algorithm library development work, adding representative algorithms for various machine learning algorithm categories to verify the functionality and performance of the Flink ML infrastructure.

The following are the new algorithms added in this release:

Feature Engineering: MinMaxScaler, StringIndexer, VectorAssembler, StandardScaler, Bucketizer
Online Learning: OnlineKmeans, OnlineLogisiticRegression
Regression Algorithm: LinearRegression
Classification algorithm: LinearSVC
Evaluation algorithm: BinaryClassificationEvaluator

To help users learn and use the Flink ML algorithm library, we provide corresponding Python and Java sample programs for each algorithm on the Apache Flink ML website ^[5] . And we provide a performance test profile ^[6] for each algorithm to support users to verify the performance of Flink ML. Interested readers can read this document ^[4] to learn how to run performance tests of these algorithms.

Upgrade Instructions

For adjustments and confirmations that may need to be made during the upgrade process, please refer to the original announcement ^[7] .

Release Notes and Related Resources

Users can view the release notes ^[8] for a detailed list of modifications and new features.
The source code can be obtained from the download page of the Flink official website ^[1] , and the latest Flink ML Python release can be obtained from PyPI ^[9] .

Contributor list

The Apache Flink community thanks every contributor who contributed to this release:

Yunfeng Zhou, Zhipeng Zhang, huangxingbo, weibo, Dong Lin, Yun Gao, Jingsong Li and mumuhhh.

Reference link :

[1] https://flink.apache.org/downloads.html

[2] https://flink.apache.org/community.html#mailing-lists

[3] https://issues.apache.org/jira/browse/flink

[4] https://github.com/apache/flink-ml/blob/master/flink-ml-benchmark/README.md

[5] https://nightlies.apache.org/flink/flink-ml-docs-release-2.1/

[6] https://github.com/apache/flink-ml/tree/master/flink-ml-benchmark/src/main/resources

[7] https://flink.apache.org/news/2022/07/12/release-ml-2.1.0.html

[8] https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522&version=12351141

[9] https://pypi.org/project/apache-flink-ml

Click to enter Flink Chinese Learning Network

For more technical issues related to Flink, you can scan the code to join the community DingTalk exchange group to get the latest technical articles and community dynamics as soon as possible. Please pay attention to the public number~

Recommended activities

Alibaba Cloud's enterprise-level product based on Apache Flink - real-time computing Flink version is now open:
99 yuan to try out the Flink version of real-time computing (yearly and monthly, 10CU), and you will have the opportunity to get Flink's exclusive custom sweater; another package of 3 months and above will have a 15% discount!
Learn more about the event: https://www.aliyun.com/product/bigdata/en

Apache Flink ML 2.1.0 Release Announcement

Important features

1. Operator interface and infrastructure

1.1 Supports memory management at operator level granularity

1.2 Improvement of infrastructure for developing online training algorithms

1.3 Algorithm performance testing framework

2. Python SDK

3. Algorithm library

Upgrade Instructions

Release Notes and Related Resources

Contributor list

ApacheFlink

引用和评论

Flink CDC 3.4 发布, 优化高频 DDL 处理，支持 Batch 模式，新增 Iceberg 支持

Dolphinscheduler IDEA本地调试

【Hadoop】HDFS架构解析

【Hadoop】HBase系统解析及适用场景

科学计算编程涉及到的技术栈简介

manus 的替代品有哪些？使用LLM大模型技术做手机/网页/浏览器自动化操作技术汇总

基于yolov5实现的AI智能盒子框架