PyFlink tutorial (3): PyFlink DataStream API-state & timer

1. Background

Flink 1.13 has been officially released recently. More than 200 contributors participated in the development of Flink 1.13, submitted more than 1,000 commits, and completed several important functions. Among them, the PyFlink module has also added several important functions in this version, such as support for state, custom window, row-based operation, etc. With the introduction of these functions, PyFlink functions have become more and more perfect, and users can use the Python language to complete the development of most types of Flink jobs. Next, we introduce in detail how to use the state & timer function in the Python DataStream API.

Two, state function introduction

As a stream computing engine, state is one of the core functions in Flink.

In 1.12, the Python DataStream API does not yet support state, and users can only implement simple applications that do not need to use state by using the Python DataStream API;
In 1.13, Python DataStream API supports this important feature.

state usage example

The following is a simple example of how to use state in a Python DataStream API job:

from pyflink.common import WatermarkStrategy, Row
from pyflink.common.typeinfo import Types
from pyflink.datastream import StreamExecutionEnvironment
from pyflink.datastream.connectors import NumberSequenceSource
from pyflink.datastream.functions import RuntimeContext, MapFunction
from pyflink.datastream.state import ValueStateDescriptor


class MyMapFunction(MapFunction):

    def open(self, runtime_context: RuntimeContext):
        state_desc = ValueStateDescriptor('cnt', Types.LONG())
        # 定义value state
        self.cnt_state = runtime_context.get_state(state_desc)

    def map(self, value):
        cnt = self.cnt_state.value()
        if cnt is None:
            cnt = 0

        new_cnt = cnt + 1
        self.cnt_state.update(new_cnt)
        return value[0], new_cnt


def state_access_demo():
    # 1. 创建 StreamExecutionEnvironment
    env = StreamExecutionEnvironment.get_execution_environment()

    # 2. 创建数据源
    seq_num_source = NumberSequenceSource(1, 100)
    ds = env.from_source(
        source=seq_num_source,
        watermark_strategy=WatermarkStrategy.for_monotonous_timestamps(),
        source_name='seq_num_source',
        type_info=Types.LONG())

    # 3. 定义执行逻辑
    ds = ds.map(lambda a: Row(a % 4, 1), output_type=Types.ROW([Types.LONG(), Types.LONG()])) \
           .key_by(lambda a: a[0]) \
           .map(MyMapFunction(), output_type=Types.TUPLE([Types.LONG(), Types.LONG()]))

    # 4. 将打印结果数据
    ds.print()

    # 5. 执行作业
    env.execute()


if __name__ == '__main__':
    state_access_demo()

In the above example, we have defined a MapFunction, and a ValueState named "cnt_state" is defined in the MapFunction to record the number of occurrences of each key.

Description:

In addition to ValueState, Python DataStream API also supports ListState, MapState, ReducingState, and AggregatingState;
When defining the StateDescriptor of the state, you need to declare the type (TypeInformation) of the data stored in the state. Also note that the current TypeInformation field is not used, and pickle is used for serialization by default. Therefore, it is recommended to define the TypeInformation field as Types.PICKLED_BYTE_ARRAY() to match the actual serializer used. In this way, when the subsequent version supports the use of TypeInformation, backward compatibility can be maintained;
State can be used in KeyedStream's map operation as well as other operations; in addition, state can also be used in connection streams, such as:

ds1 = ...  # type DataStream
ds2 = ...  # type DataStream
ds1.connect(ds2) \
    .key_by(key_selector1=lambda a: a[0], key_selector2=lambda a: a[0]) \
    .map(MyCoMapFunction())  # 可以在MyCoMapFunction中使用state

can use state API list as follows:

	operating	Custom function
KeyedStream	map	MapFunction
flat_map	FlatMapFunction
reduce	ReduceFunction
filter	FilterFunction
process	KeyedProcessFunction
ConnectedStreams	map	CoMapFunction
flat_map	CoFlatMapFunction
process	KeyedCoProcessFunction
WindowedStream	apply	WindowFunction
	process	ProcessWindowFunction

How the state works

The figure above is an architectural diagram of the working principle of state in PyFlink. From the figure, we can see that the Python custom function runs in the Python worker process, and the state backend runs in the JVM process (managed by the Java operator). When a Python custom function needs to access the state, it will access the state backend through remote calls.

We know that the overhead of remote calls is very large. In order to improve the performance of state reading and writing, PyFlink has done the following optimization work for state reading and writing:

Lazy Read：
For a state that contains multiple entries, such as MapState, when traversing the state, the state data will not be read into the Python worker all at once, and only when it is really needed to access it will it be read from the state backend.
Async Write：
When the state is updated, the updated state will be stored in the LRU cache first, and will not be updated to the remote state backend synchronously. This can avoid accessing the remote state backend for each state update operation; at the same time, Multiple update operations for the same key can be combined to avoid invalid state updates.
LRU cache：
The state read and write cache is maintained in the Python worker process. When reading a key, it will first check whether it has been loaded into the read cache; when updating a key, it will first be stored in the write cache. For keys that are frequently read and written, the LRU cache can avoid accessing the remote state backend for each read and write operation. For scenarios with hot keys, it can greatly improve the state read and write performance.
Flush on Checkpoint：
In order to ensure the correctness of checkpoint semantics, when a Java operator needs to execute a checkpoint, all write caches in the Python worker will be flushed back to the state backend.

The LRU cache can be subdivided into two levels, as shown in the following figure:

description:

The second-level cache is a global cache. The read cache in the second-level cache stores all the original state data cached in the current Python worker process (not deserialized); the write cache in the second-level cache stores the current Python worker process All created state objects.
The first-level cache is located in each state object, and the state data that the state object has read from the remote state backend and the state data to be updated back to the remote state backend are cached in the state object.

workflow:

When creating a state object in Python UDF, it will first check whether the state object corresponding to the current key already exists (look up in the "Global Write Cache" in the secondary cache), and if it exists, return the corresponding state object ; If it does not exist, create a new state object and store it in the "Global Write Cache";
State reading: When reading the state object in Python UDF, if the state data to be read already exists (first-level cache), for example, for MapState, the map key/map value to be read already exists, then return directly Corresponding map key/map value; otherwise, access the secondary cache, if there is no state data to be read in the secondary cache, read from the remote state backend;
State writing: When updating the state object in Python UDF, it is first written to the write cache inside the state object (first level cache); when the size of the state data to be written back to the state backend in the state object exceeds the specified threshold or when When a checkpoint is encountered, the state data to be written back is written back to the remote state backend.

state performance tuning

Through the introduction in the previous section, we know that PyFlink uses a variety of optimization methods to improve the performance of state read and write. These optimization behaviors can be configured through the following parameters:

Configuration	Description
python.state.cache-size	The size of read cache and write cache in Python worker. (Secondary cache) It should be noted that: read cache and write cache are independent, and currently it is not supported to configure the size of read cache and write cache separately.
python.map-state.iterate-response-batch-size	When traversing the MapState, the maximum number of entries that are read from the state backend and returned to the Python worker each time.
python.map-state.read-cache-size	The maximum number of entries allowed in a MapState read cache (level one cache). When the number of entries in the read cache exceeds this threshold in a MapState, the least recently accessed entry will be deleted from the read cache through the LRU strategy.
python.map-state.write-cache-size	The maximum number of entries to be updated in the write cache of a MapState (level one cache). When the number of entries to be updated in the write cache exceeds the threshold in a MapState, all the state data to be updated in the MapState will be written back to the remote state backend.

It should be noted that the performance of state read and write not only depends on the above parameters, but also affected by other factors, such as:

The distribution of keys in
The more scattered the keys of the input data, the lower the probability of a read cache hit, and the worse the performance.
Python UDF state read and write times:
State reading and writing may involve reading and writing the remote state backend. The implementation of Python UDF should be optimized as much as possible to reduce unnecessary state reading and writing.
checkpoint interval：
In order to ensure the correctness of the checkpoint semantics, when a checkpoint is encountered, the Python worker will write all the cached state data to be updated back to the state backend. If the configured checkpoint interval is too small, it may not effectively reduce the amount of data written back to the state backend by the Python worker.
bundle size / bundle time：
The current Python operator divides the input data into multiple batches and sends them to the Python worker for execution. When a batch of data is processed, the state to be updated in the Python worker process is forced to be written back to the state backend. Similar to checkpoint interval, this behavior may also affect state write performance. The batch size can be controlled by the python.fn-execution.bundle.size and python.fn-execution.bundle.time parameters.

Three, timer function introduction

Timer usage example

In addition to state, users can also use timers in the Python DataStream API.

import datetime

from pyflink.common import Row, WatermarkStrategy
from pyflink.common.typeinfo import Types
from pyflink.common.watermark_strategy import TimestampAssigner
from pyflink.datastream import StreamExecutionEnvironment
from pyflink.datastream.functions import KeyedProcessFunction, RuntimeContext
from pyflink.datastream.state import ValueStateDescriptor
from pyflink.table import StreamTableEnvironment


class CountWithTimeoutFunction(KeyedProcessFunction):

    def __init__(self):
        self.state = None

    def open(self, runtime_context: RuntimeContext):
        self.state = runtime_context.get_state(ValueStateDescriptor(
            "my_state", Types.ROW([Types.STRING(), Types.LONG(), Types.LONG()])))

    def process_element(self, value, ctx: 'KeyedProcessFunction.Context'):
        # retrieve the current count
        current = self.state.value()
        if current is None:
            current = Row(value.f1, 0, 0)

        # update the state's count
        current[1] += 1

        # set the state's timestamp to the record's assigned event time timestamp
        current[2] = ctx.timestamp()

        # write the state back
        self.state.update(current)

        # schedule the next timer 60 seconds from the current event time
        ctx.timer_service().register_event_time_timer(current[2] + 60000)

    def on_timer(self, timestamp: int, ctx: 'KeyedProcessFunction.OnTimerContext'):
        # get the state for the key that scheduled the timer
        result = self.state.value()

        # check if this is an outdated timer or the latest timer
        if timestamp == result[2] + 60000:
            # emit the state on timeout
            yield result[0], result[1]


class MyTimestampAssigner(TimestampAssigner):

    def __init__(self):
        self.epoch = datetime.datetime.utcfromtimestamp(0)

    def extract_timestamp(self, value, record_timestamp) -> int:
        return int((value[0] - self.epoch).total_seconds() * 1000)


if __name__ == '__main__':
    env = StreamExecutionEnvironment.get_execution_environment()
    t_env = StreamTableEnvironment.create(stream_execution_environment=env)

    t_env.execute_sql("""
            CREATE TABLE my_source (
              a TIMESTAMP(3),
              b VARCHAR,
              c VARCHAR
            ) WITH (
              'connector' = 'datagen',
              'rows-per-second' = '10'
            )
        """)

    stream = t_env.to_append_stream(
        t_env.from_path('my_source'),
        Types.ROW([Types.SQL_TIMESTAMP(), Types.STRING(), Types.STRING()]))
    watermarked_stream = stream.assign_timestamps_and_watermarks(
        WatermarkStrategy.for_monotonous_timestamps()
                         .with_timestamp_assigner(MyTimestampAssigner()))

    # apply the process function onto a keyed stream
    watermarked_stream.key_by(lambda value: value[1])\
        .process(CountWithTimeoutFunction()) \
        .print()

    env.execute()

In the above example, we define a KeyedProcessFunction that records the number of occurrences of each key. When a key is not updated for more than 60 seconds, the key and the number of occurrences will be sent to the downstream node.

In addition to event time timer, users can also use processing time timer.

How timer works

The workflow of timer is like this:

Unlike state access using a separate communication channel, when the user registers with the timer, the registration message is sent to the Java operator through the data channel;
After the Java operator receives the timer registration message, it first checks the trigger time of the timer to be registered, and if it exceeds the current time, it is directly triggered; otherwise, the timer is registered to the timer service of the Java operator;
When the timer is triggered, the trigger message is sent to the Python worker through the data channel, and the Python worker calls back the on_timer method in the user's Python UDF.

needs to pay attention to: is asynchronously transmitted between the Java operator and the Python worker through the data channel due to the timer registration message and trigger message. This will cause the trigger of the timer to be less timely in some scenarios. For example, when a user registers a processing time timer, when the timer is triggered, the trigger message is transmitted to the Python UDF through the data channel, it may be a few seconds later.

Four, summary

In this article, we mainly introduced how to use state & timer in Python DataStream API jobs, the working principle of state & timer and how to perform performance tuning. Next, we will continue to launch the PyFlink series of articles to help PyFlink users understand various functions, application scenarios, and best practices in PyFlink.

In addition, Alibaba Cloud's real-time computing ecological team recruits outstanding big data talents (including internship + social recruitment) for a long time. Our work includes:

Real-time machine learning: Support real-time feature engineering and AI engine cooperation in machine learning scenarios, build real-time machine learning standards based on Apache Flink and its ecology, and promote the full real-timeization of scenarios such as search, recommendation, advertising, and risk control;
Big data + AI integration: including programming language integration (PyFlink related work), execution engine integration (TF on Flink), workflow and management integration (Flink AI Flow).

PyFlink tutorial (3): PyFlink DataStream API-state & timer

1. Background

Two, state function introduction

state usage example

How the state works

state performance tuning

Three, timer function introduction

Timer usage example

How timer works

Four, summary

ApacheFlink

引用和评论

Flink CDC 3.4 发布, 优化高频 DDL 处理，支持 Batch 模式，新增 Iceberg 支持

小米基于 Apache Paimon 的流式湖仓实践

基于 Flink CDC YAML 的 MySQL 到 Kafka 流式数据集成

基于Flink的配置化实时反作弊系统

物化视图详解：数据库性能优化的利器

vivo基于Paimon的湖仓一体落地实践

Apache Flink 2.0.0: 实时数据处理的新纪元