How to develop PyFlink API jobs from 0 to 1

Introduction takes Flink 1.12 as an example to introduce how to use the Python language to develop Flink jobs through the PyFlink API.
Apache Flink is currently the most popular stream batch unified computing engine, and it has a wide range of applications in real-time ETL, event processing, data analysis, CEP, real-time machine learning and other fields. Starting from Flink 1.9, the Apache Flink community began to provide support for the Python language based on the original programming languages such as Java, Scala, and SQL. After the development of multiple versions of Flink 1.9 ~ 1.12 and the upcoming 1.13 version, the functions of the PyFlink API have been improved day by day, which can meet the needs of Python users in most cases. Next, we take Flink 1.12 as an example to introduce how to use the Python language to develop Flink jobs through the PyFlink API. content include:
Environmental preparation
Job development
Assignment submission
Troubleshooting
to sum up

GitHub address
https://github.com/apache/flink
Everyone is welcome to give Flink likes and send stars~

Environmental preparation

Step 1: Install Python

PyFlink only supports Python 3.5+. You first need to confirm whether Python 3.5+ has been installed in your development environment. If not, you need to install Python 3.5+ first.

Step 2: Install JDK

We know that the runtime of Flink is developed in Java language, so in order to execute Flink jobs, you also need to install JDK. Flink provides full support for JDK 8 and JDK 11. You need to confirm whether the above version of JDK has been installed in your development environment. If not, you need to install JDK first.

Step 3: Install PyFlink

Next, you need to install PyFlink, which can be installed with the following command:

# 创建 Python 虚拟环境
python3 -m pip install virtualenv
virtualenv -p `which python3` venv

# 使用上述创建的 Python 虚拟环境
./venv/bin/activate

# 安装 PyFlink 1.12
python3 -m pip install apache-flink==1.12.2

Job development

PyFlink Table API job

We first introduce how to develop PyFlink Table API operations.

■ 1) Create a TableEnvironment object

For Table API jobs, users first need to create a TableEnvironment object. The following example defines a TableEnvironment object, and the job defined by the object runs in stream mode and is executed using blink planner.

env_settings = EnvironmentSettings.new_instance().in_streaming_mode().use_blink_planner().build()
t_env = StreamTableEnvironment.create(environment_settings=env_settings)

■ 2) Configure the execution parameters of the job

The execution parameters of the job can be configured in the following ways. The following example sets the default concurrency of the job to 4.

t_env.get_config().get_configuration().set_string('parallelism.default', '4')

■ 3) Create a data source table

Next, you need to create a data source table for the job. There are many ways to define data source tables in PyFlink.

Method 1: from\_elements

PyFlink supports users to create source tables from a given list. The following example defines a table containing 3 rows of data: [("hello", 1), ("world", 2), ("flink", 3)], the table has 2 columns, the column names are a and b, the types are VARCHAR and BIGINT respectively.

tab = t_env.from_elements([("hello", 1), ("world", 2), ("flink", 3)], ['a', 'b'])

Description:

This method is usually used in the testing phase, you can quickly create a data source table to verify the job logic
The from\_elements method can receive multiple parameters, the first parameter is used to specify the data list, each element in the list must be of tuple type; the second parameter is used to specify the schema of the table

Method 2: DDL

In addition, the data can also come from an external data source. The following example defines a table named my\_source and type datagen. The table has two fields of type VARCHAR.

t_env.execute_sql("""
        CREATE TABLE my_source (
          a VARCHAR,
          b VARCHAR
        ) WITH (
          'connector' = 'datagen',
          'number-of-rows' = '10'
        )
    """)

tab = t_env.from_path('my_source')

Description:

It is currently the most recommended way to define the data source table through DDL, and all the connectors supported in Java Table API & SQL can be used in PyFlink Table API operations through DDL. Please refer to Flink for the detailed connector list Official document [1].
Currently, only part of the connector implementation is included in the official release package provided by Flink, such as FileSystem, DataGen, Print, BlackHole, etc., and most of the connector implementation is currently not included in the official release package provided by Flink, such as Kafka, ES, etc. For connectors that are not included in the official distribution package provided by Flink, if they need to be used in PyFlink operations, users need to explicitly specify the corresponding FAT JAR. For example, for Kafka, you need to use the JAR package [2]. The JAR package can be used in the following ways Specify:

# 注意：file:///前缀不能省略
t_env.get_config().get_configuration().set_string("pipeline.jars", "file:///my/jar/path/flink-sql-connector-kafka_2.11-1.12.0.jar")

Method three: catalog

hive_catalog = HiveCatalog("hive_catalog")
t_env.register_catalog("hive_catalog", hive_catalog)
t_env.use_catalog("hive_catalog")

# 假设hive catalog中已经定义了一个名字为source_table的表
tab = t_env.from_path('source_table')

This method is similar to the DDL method, except that the definition of the table has been registered in the catalog in advance, and there is no need to redefine it in the job.

■ 4) Define the calculation logic of the job

Method 1: Through Table API

After obtaining the source table, you can then use the various operations provided in the Table API to define the calculation logic of the job and perform various transformations on the table, such as:

@udf(result_type=DataTypes.STRING())
def sub_string(s: str, begin: int, end: int):
   return s[begin:end]

transformed_tab = tab.select(sub_string(col('a'), 2, 4))

Method 2: Through SQL statement

In addition to using the various operations provided in the Table API, you can also directly transform the table through SQL statements. For example, the above logic can also be implemented through SQL statements:

t_env.create_temporary_function("sub_string", sub_string)
transformed_tab = t_env.sql_query("SELECT sub_string(a, 2, 4) FROM %s" % tab)

Description:

TableEnvironment provides a variety of ways to execute SQL statements, and their uses are slightly different:

■ 5) View the execution plan

In the process of developing or debugging a job, the user may need to view the execution plan of the job, which can be done in the following ways.

Method 1: Table.explain

For example, when we need to know the current execution plan of transformed\_tab, we can execute: print(transformed\_tab.explain()), and we can get the following output:

== Abstract Syntax Tree ==
LogicalProject(EXPR$0=[sub_string($0, 2, 4)])
+- LogicalTableScan(table=[[default_catalog, default_database, Unregistered_TableSource_582508460, source: [PythonInputFormatTableSource(a)]]])

== Optimized Logical Plan ==
PythonCalc(select=[sub_string(a, 2, 4) AS EXPR$0])
+- LegacyTableSourceScan(table=[[default_catalog, default_database, Unregistered_TableSource_582508460, source: [PythonInputFormatTableSource(a)]]], fields=[a])

== Physical Execution Plan ==
Stage 1 : Data Source
    content : Source: PythonInputFormatTableSource(a)

    Stage 2 : Operator
        content : SourceConversion(table=[default_catalog.default_database.Unregistered_TableSource_582508460, source: [PythonInputFormatTableSource(a)]], fields=[a])
        ship_strategy : FORWARD

        Stage 3 : Operator
            content : StreamExecPythonCalc
            ship_strategy : FORWARD

Method 2: TableEnvironment.explain\_sql

Method one is suitable for viewing the execution plan of a certain table. Sometimes there is not a ready-made table object available, such as:

print(t_env.explain_sql("INSERT INTO my_sink SELECT * FROM %s " % transformed_tab))

The execution plan is as follows:

== Abstract Syntax Tree ==
LogicalSink(table=[default_catalog.default_database.my_sink], fields=[EXPR$0])
+- LogicalProject(EXPR$0=[sub_string($0, 2, 4)])
   +- LogicalTableScan(table=[[default_catalog, default_database, Unregistered_TableSource_1143388267, source: [PythonInputFormatTableSource(a)]]])

== Optimized Logical Plan ==
Sink(table=[default_catalog.default_database.my_sink], fields=[EXPR$0])
+- PythonCalc(select=[sub_string(a, 2, 4) AS EXPR$0])
   +- LegacyTableSourceScan(table=[[default_catalog, default_database, Unregistered_TableSource_1143388267, source: [PythonInputFormatTableSource(a)]]], fields=[a])

== Physical Execution Plan ==
Stage 1 : Data Source
    content : Source: PythonInputFormatTableSource(a)

    Stage 2 : Operator
        content : SourceConversion(table=[default_catalog.default_database.Unregistered_TableSource_1143388267, source: [PythonInputFormatTableSource(a)]], fields=[a])
        ship_strategy : FORWARD

        Stage 3 : Operator
            content : StreamExecPythonCalc
            ship_strategy : FORWARD

            Stage 4 : Data Sink
                content : Sink: Sink(table=[default_catalog.default_database.my_sink], fields=[EXPR$0])
                ship_strategy : FORWARD

■ 6) Write out the result data

Method 1: Through DDL

Similar to creating a data source table, you can also create a result table through DDL.

t_env.execute_sql("""
        CREATE TABLE my_sink (
          `sum` VARCHAR
        ) WITH (
          'connector' = 'print'
        )
    """)

table_result = transformed_tab.execute_insert('my_sink')

Description:

When using print as sink, the job result will be printed to standard output. If you don't need to view the output, you can also use blackhole as a sink.

Method 2: collect

You can also use the collect method to collect the results of the table to the client and view them one by one.

table_result = transformed_tab.execute()
with table_result.collect() as results:
    for result in results:
        print(result)

Description:

This method can easily collect the results of the table to the client and view
Since the data will eventually be collected on the client, it is best to limit the number of data items, such as:

transformed\_tab.limit(10).execute(), limit to collect only 10 data to the client

Method 3: to\_pandas

You can also use the to\_pandas method to convert the result of the table into a pandas.DataFrame and view it.

result = transformed_tab.to_pandas()
print(result)

You can see the following output:

Description:

This method is similar to collect, and will also collect the results of the table to the client, so it is best to limit the number of result data

■ 7) Summary

The complete job example is as follows:

from pyflink.table import DataTypes, EnvironmentSettings, StreamTableEnvironment
from pyflink.table.expressions import col
from pyflink.table.udf import udf


def table_api_demo():
    env_settings = EnvironmentSettings.new_instance().in_streaming_mode().use_blink_planner().build()
    t_env = StreamTableEnvironment.create(environment_settings=env_settings)
    t_env.get_config().get_configuration().set_string('parallelism.default', '4')

    t_env.execute_sql("""
            CREATE TABLE my_source (
              a VARCHAR,
              b VARCHAR
            ) WITH (
              'connector' = 'datagen',
              'number-of-rows' = '10'
            )
        """)

    tab = t_env.from_path('my_source')

    @udf(result_type=DataTypes.STRING())
    def sub_string(s: str, begin: int, end: int):
        return s[begin:end]

    transformed_tab = tab.select(sub_string(col('a'), 2, 4))

    t_env.execute_sql("""
            CREATE TABLE my_sink (
              `sum` VARCHAR
            ) WITH (
              'connector' = 'print'
            )
        """)

    table_result = transformed_tab.execute_insert('my_sink')

    # 1）等待作业执行结束，用于local执行，否则可能作业尚未执行结束，该脚本已退出，会导致minicluster过早退出
    # 2）当作业通过detach模式往remote集群提交时，比如YARN/Standalone/K8s等，需要移除该方法
    table_result.wait()


if __name__ == '__main__':
    table_api_demo()

The execution results are as follows:

4> +I(a1)
3> +I(b0)
2> +I(b1)
1> +I(37)
3> +I(74)
4> +I(3d)
1> +I(07)
2> +I(f4)
1> +I(7f)
2> +I(da)

PyFlink DataStream API job

■ 1) Create StreamExecutionEnvironment object

For DataStream API jobs, the user first needs to define a StreamExecutionEnvironment object.

env = StreamExecutionEnvironment.get_execution_environment()

■ 2) Configure the execution parameters of the job

The execution parameters of the job can be configured in the following ways. The following example sets the default concurrency of the job to 4.

env.set_parallelism(4)

■ 3) Create a data source

Next, you need to create a data source for the job. There are many ways to define data sources in PyFlink.

Method 1: from\_collection

PyFlink supports users to create source tables from a list. The following example defines a table containing 3 rows of data: [(1,'aaa|bb'), (2,'bb|a'), (3,'aaa|a')], the table has 2 columns, The column names are a and b, and the types are VARCHAR and BIGINT.

ds = env.from_collection(
        collection=[(1, 'aaa|bb'), (2, 'bb|a'), (3, 'aaa|a')],
        type_info=Types.ROW([Types.INT(), Types.STRING()]))

Description:

This method is usually used in the testing phase, you can easily create a data source
The from\_collection method can receive two parameters, the first parameter is used to specify the data list; the second parameter is used to specify the type of data

Method 2: Use the connector

In addition, you can also use the connectors that are already supported in the PyFlink DataStream API. It should be noted that only Kafka connector support is provided in 1.12.

deserialization_schema = JsonRowDeserializationSchema.builder() \
    .type_info(type_info=Types.ROW([Types.INT(), Types.STRING()])).build()

kafka_consumer = FlinkKafkaConsumer(
    topics='test_source_topic',
    deserialization_schema=deserialization_schema,
    properties={'bootstrap.servers': 'localhost:9092', 'group.id': 'test_group'})

ds = env.add_source(kafka_consumer)

Description:

Kafka connector is currently not included in the distribution package officially provided by Flink. If you need to use it in PyFlink jobs, users need to explicitly specify the corresponding FAT JAR [2]. The JAR package can be specified as follows:

# 注意：file:///前缀不能省略
env.add_jars("file:///my/jar/path/flink-sql-connector-kafka_2.11-1.12.0.jar")

Even for PyFlink DataStream API jobs, it is recommended to use the FAT JAR packaged in the Table & SQL connector to avoid the problem of recursive dependency.

Method 3: Use the connector

The following example defines how to use the connectors supported in Table & SQL for PyFlink DataStream API operations.

t_env = StreamTableEnvironment.create(stream_execution_environment=env)

t_env.execute_sql("""
        CREATE TABLE my_source (
          a INT,
          b VARCHAR
        ) WITH (
          'connector' = 'datagen',
          'number-of-rows' = '10'
        )
    """)

ds = t_env.to_append_stream(
    t_env.from_path('my_source'),
    Types.ROW([Types.INT(), Types.STRING()]))

Description:

Since the types of connectors supported by the built-in in the current PyFlink DataStream API are still relatively small, it is recommended to use this method to create the data source table used in the PyFlink DataStream API job. In this case, all connectors that can be used in the PyFlink Table API can be used. Used in PyFlink DataStream API operations.
It should be noted that TableEnvironment needs to create StreamTableEnvironment.create(stream\_execution\_environment=env) in the following way, so that PyFlink DataStream API and PyFlink Table API share the same StreamExecutionEnvironment object.

■ 4) Define calculation logic

After generating the DataStream object corresponding to the data source, you can then use the various operations defined in the PyFlink DataStream API to define the calculation logic and transform the DataStream object, such as:

def split(s):
    splits = s[1].split("|")
    for sp in splits:
       yield s[0], sp

ds = ds.map(lambda i: (i[0] + 1, i[1])) \
       .flat_map(split) \
       .key_by(lambda i: i[1]) \
       .reduce(lambda i, j: (i[0] + j[0], i[1]))

■ 5) Write out the result data

Method 1: print

You can call the print method on the DataStream object to print the result of the DataStream to standard output, such as:

ds.print()

Method 2: Use the connector

You can directly use the connectors already supported in the PyFlink DataStream API. It should be noted that support for FileSystem, JDBC, and Kafka connectors is provided in 1.12. Take Kafka as an example:

serialization_schema = JsonRowSerializationSchema.builder() \
    .with_type_info(type_info=Types.ROW([Types.INT(), Types.STRING()])).build()

kafka_producer = FlinkKafkaProducer(
    topic='test_sink_topic',
    serialization_schema=serialization_schema,
    producer_config={'bootstrap.servers': 'localhost:9092', 'group.id': 'test_group'})

ds.add_sink(kafka_producer)

Description:

JDBC and Kafka connector are currently not included in the official release package provided by Flink. If they need to be used in PyFlink jobs, users need to explicitly specify the corresponding FAT JAR. For example, Kafka connector can use JAR package [2], and the JAR package can be passed as follows Way to specify:

# 注意：file:///前缀不能省略
env.add_jars("file:///my/jar/path/flink-sql-connector-kafka_2.11-1.12.0.jar")

It is recommended to use the FAT JAR packaged in the Table & SQL connector to avoid the problem of recursive dependence.

Method 3: Use the connector

The following example shows how to use the connector supported in Table & SQL as a sink for PyFlink DataStream API jobs.

# 写法一：ds类型为Types.ROW
def split(s):
    splits = s[1].split("|")
    for sp in splits:
        yield Row(s[0], sp)

ds = ds.map(lambda i: (i[0] + 1, i[1])) \
       .flat_map(split, Types.ROW([Types.INT(), Types.STRING()])) \
       .key_by(lambda i: i[1]) \
       .reduce(lambda i, j: Row(i[0] + j[0], i[1]))

# 写法二：ds类型为Types.TUPLE
def split(s):
    splits = s[1].split("|")
    for sp in splits:
        yield s[0], sp

ds = ds.map(lambda i: (i[0] + 1, i[1])) \
       .flat_map(split, Types.TUPLE([Types.INT(), Types.STRING()])) \
       .key_by(lambda i: i[1]) \
       .reduce(lambda i, j: (i[0] + j[0], i[1]))

# 将ds写出到sink
t_env.execute_sql("""
        CREATE TABLE my_sink (
          a INT,
          b VARCHAR
        ) WITH (
          'connector' = 'print'
        )
    """)

table = t_env.from_data_stream(ds)
table_result = table.execute_insert("my_sink")

Description:

It should be noted that the result type of the ds object in t\_env.from\_data\_stream(ds) must be the composite type Types.ROW or Types.TUPLE, which is why you need to explicitly declare flat in the job calculation logic \_map operation result type
The job submission needs to be submitted through the job submission method provided in the PyFlink Table API
Since the types of connectors supported in the current PyFlink DataStream API are still relatively small, it is recommended to define the data source table used in the PyFlink DataStream API job in this way. In this way, all connectors that can be used in the PyFlink Table API can be used as PyFlink DataStream The sink of the API job.

■ 7) Summary

The complete job example is as follows:

Method 1 (suitable for debugging):

from pyflink.common.typeinfo import Types
from pyflink.datastream import StreamExecutionEnvironment


def data_stream_api_demo():
    env = StreamExecutionEnvironment.get_execution_environment()
    env.set_parallelism(4)

    ds = env.from_collection(
        collection=[(1, 'aaa|bb'), (2, 'bb|a'), (3, 'aaa|a')],
        type_info=Types.ROW([Types.INT(), Types.STRING()]))

    def split(s):
        splits = s[1].split("|")
        for sp in splits:
            yield s[0], sp

    ds = ds.map(lambda i: (i[0] + 1, i[1])) \
           .flat_map(split) \
           .key_by(lambda i: i[1]) \
           .reduce(lambda i, j: (i[0] + j[0], i[1]))

    ds.print()

    env.execute()


if __name__ == '__main__':
    data_stream_api_demo()

The execution results are as follows:

3> (2, 'aaa')
3> (2, 'bb')
3> (6, 'aaa')
3> (4, 'a')
3> (5, 'bb')
3> (7, 'a')

Method two (suitable for online operations):

from pyflink.common.typeinfo import Types
from pyflink.datastream import StreamExecutionEnvironment
from pyflink.table import StreamTableEnvironment


def data_stream_api_demo():
    env = StreamExecutionEnvironment.get_execution_environment()
    t_env = StreamTableEnvironment.create(stream_execution_environment=env)
    env.set_parallelism(4)

    t_env.execute_sql("""
            CREATE TABLE my_source (
              a INT,
              b VARCHAR
            ) WITH (
              'connector' = 'datagen',
              'number-of-rows' = '10'
            )
        """)

    ds = t_env.to_append_stream(
        t_env.from_path('my_source'),
        Types.ROW([Types.INT(), Types.STRING()]))

    def split(s):
        splits = s[1].split("|")
        for sp in splits:
            yield s[0], sp

    ds = ds.map(lambda i: (i[0] + 1, i[1])) \
           .flat_map(split, Types.TUPLE([Types.INT(), Types.STRING()])) \
           .key_by(lambda i: i[1]) \
           .reduce(lambda i, j: (i[0] + j[0], i[1]))

    t_env.execute_sql("""
            CREATE TABLE my_sink (
              a INT,
              b VARCHAR
            ) WITH (
              'connector' = 'print'
            )
        """)

    table = t_env.from_data_stream(ds)
    table_result = table.execute_insert("my_sink")

    # 1）等待作业执行结束，用于local执行，否则可能作业尚未执行结束，该脚本已退出，会导致minicluster过早退出
    # 2）当作业通过detach模式往remote集群提交时，比如YARN/Standalone/K8s等，需要移除该方法
    table_result.wait()


if __name__ == '__main__':
    data_stream_api_demo()

Assignment submission

Flink provides a variety of job deployment methods, such as local, standalone, YARN, K8s, etc. PyFlink also supports the above job deployment methods. Please refer to Flink official documentation [3] for more details.

local

Note: When using this method to execute a job, a minicluster will be started, and the job will be submitted to the minicluster for execution. This method is suitable for the job development stage.

Example: python3 table\_api\_demo.py

standalone

Note: When using this method to execute a job, the job will be submitted to a remote standalone cluster.

Example:

./bin/flink run --jobmanager localhost:8081 --python table\_api\_demo.py

YARN Per-Job

Note: When using this method to execute a job, the job will be submitted to a remote YARN cluster.

Example:

./bin/flink run --target yarn-per-job --python table\_api\_demo.py

K8s application mode

Note: When using this method to execute a job, the job will be submitted to the K8s cluster and executed in application mode.

Example:

./bin/flink run-application \

--target kubernetes-application \
--parallelism 8 \
-Dkubernetes.cluster-id**=**<ClusterId> \
-Dtaskmanager.memory.process.size**=**4096m \
-Dkubernetes.taskmanager.cpu**=**2 \
-Dtaskmanager.numberOfTaskSlots**=**4 \
-Dkubernetes.container.image**=**<PyFlinkImageName> \

--pyModule table\_api\_demo \

--pyFiles file:///path/to/table_api_demo.py

Parameter Description

In addition to the parameters mentioned above, when submitting via flink run, there are other parameters related to PyFlink jobs.

parameter name	Use description	Example
-py / --python	Specify the entry file of the job	-py file:///path/to/table\_api\_demo.py
-pym / --pyModule	Specify the entry module of the job, the function is similar to --python, it can be used when the Python file of the job is a zip package and cannot be specified by --python, it is more general than --python	-pym table\_api\_demo -pyfs file:///path/to/table\_api\_demo.py
-pyfs / --pyFiles	Specify one or more Python files (.py/.zip, etc., separated by commas), these Python files will be placed in the PYTHONPATH of the Python process when the job is executed, and can be accessed in the Python custom function	-pyfs file:///path/to/table\_api\_demo.py,file:///path/to/deps.zip
-pyarch / --pyArchives	Specify one or more archive files (separated by commas). These archive files will be decompressed when the job is executed and placed in the workspace directory of the Python process, which can be accessed through relative paths	-pyarch file:///path/to/venv.zip
-pyexec / --pyExecutable	Specify the path of the Python process when the job is executed	-pyarch file:///path/to/venv.zip -pyexec venv.zip/venv/bin/python3
-pyreq / --pyRequirements	Specify the requirements file, the dependencies of the job are defined in the requirements file	-pyreq requirements.txt

Troubleshooting

When we first started PyFlink homework development, we would inevitably encounter various problems. It is very important to learn how to troubleshoot problems. Next, we introduce some common troubleshooting methods.

Abnormal output on the client side

PyFlink jobs also follow the submission method of Flink jobs. The job will first be compiled into JobGraph on the client side, and then submitted to the Flink cluster for execution. If there is a problem with the job compilation, an exception will be thrown when the job is submitted on the client side. At this time, you can see output similar to this on the client side:

Traceback (most recent call last):
  File "/Users/dianfu/code/src/github/pyflink-usecases/datastream_api_demo.py", line 50, in <module>
    data_stream_api_demo()
  File "/Users/dianfu/code/src/github/pyflink-usecases/datastream_api_demo.py", line 45, in data_stream_api_demo
    table_result = table.execute_insert("my_")
  File "/Users/dianfu/venv/pyflink-usecases/lib/python3.8/site-packages/pyflink/table/table.py", line 864, in execute_insert
    return TableResult(self._j_table.executeInsert(table_path, overwrite))
  File "/Users/dianfu/venv/pyflink-usecases/lib/python3.8/site-packages/py4j/java_gateway.py", line 1285, in __call__
    return_value = get_return_value(
  File "/Users/dianfu/venv/pyflink-usecases/lib/python3.8/site-packages/pyflink/util/exceptions.py", line 162, in deco
    raise java_exception
pyflink.util.exceptions.TableException: Sink `default_catalog`.`default_database`.`my_` does not exists
     at org.apache.flink.table.planner.delegation.PlannerBase.translateToRel(PlannerBase.scala:247)
     at org.apache.flink.table.planner.delegation.PlannerBase$$anonfun$1.apply(PlannerBase.scala:159)
     at org.apache.flink.table.planner.delegation.PlannerBase$$anonfun$1.apply(PlannerBase.scala:159)
     at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
     at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
     at scala.collection.Iterator$class.foreach(Iterator.scala:891)
     at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
     at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
     at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
     at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
     at scala.collection.AbstractTraversable.map(Traversable.scala:104)
     at org.apache.flink.table.planner.delegation.PlannerBase.translate(PlannerBase.scala:159)
     at org.apache.flink.table.api.internal.TableEnvironmentImpl.translate(TableEnvironmentImpl.java:1329)
     at org.apache.flink.table.api.internal.TableEnvironmentImpl.executeInternal(TableEnvironmentImpl.java:676)
     at org.apache.flink.table.api.internal.TableImpl.executeInsert(TableImpl.java:572)
     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
     at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
     at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
     at java.lang.reflect.Method.invoke(Method.java:498)
     at org.apache.flink.api.python.shaded.py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
     at org.apache.flink.api.python.shaded.py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
     at org.apache.flink.api.python.shaded.py4j.Gateway.invoke(Gateway.java:282)
     at org.apache.flink.api.python.shaded.py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
     at org.apache.flink.api.python.shaded.py4j.commands.CallCommand.execute(CallCommand.java:79)
     at org.apache.flink.api.python.shaded.py4j.GatewayConnection.run(GatewayConnection.java:238)
     at java.lang.Thread.run(Thread.java:748)

Process finished with exit code 1

For example, the above error indicates that the table named "my\_" used in the job does not exist.

TaskManager log file

Some errors will not occur until the job is running, such as dirty data or implementation problems of Python custom functions. For such errors, you usually need to check the TaskManager log file. For example, the following errors reflect user access in Python custom functions The opencv library does not exist.

Caused by: java.lang.RuntimeException: Error received from SDK harness for instruction 2: Traceback (most recent call last):
  File "/Users/dianfu/venv/pyflink-usecases/lib/python3.8/site-packages/apache_beam/runners/worker/sdk_worker.py", line 253, in _execute
    response = task()
  File "/Users/dianfu/venv/pyflink-usecases/lib/python3.8/site-packages/apache_beam/runners/worker/sdk_worker.py", line 310, in <lambda>
    lambda: self.create_worker().do_instruction(request), request)
  File "/Users/dianfu/venv/pyflink-usecases/lib/python3.8/site-packages/apache_beam/runners/worker/sdk_worker.py", line 479, in do_instruction
    return getattr(self, request_type)(
  File "/Users/dianfu/venv/pyflink-usecases/lib/python3.8/site-packages/apache_beam/runners/worker/sdk_worker.py", line 515, in process_bundle
    bundle_processor.process_bundle(instruction_id))
  File "/Users/dianfu/venv/pyflink-usecases/lib/python3.8/site-packages/apache_beam/runners/worker/bundle_processor.py", line 977, in process_bundle
    input_op_by_transform_id[element.transform_id].process_encoded(
  File "/Users/dianfu/venv/pyflink-usecases/lib/python3.8/site-packages/apache_beam/runners/worker/bundle_processor.py", line 218, in process_encoded
    self.output(decoded_value)
  File "apache_beam/runners/worker/operations.py", line 330, in apache_beam.runners.worker.operations.Operation.output
  File "apache_beam/runners/worker/operations.py", line 332, in apache_beam.runners.worker.operations.Operation.output
  File "apache_beam/runners/worker/operations.py", line 195, in apache_beam.runners.worker.operations.SingletonConsumerSet.receive
  File "pyflink/fn_execution/beam/beam_operations_fast.pyx", line 71, in pyflink.fn_execution.beam.beam_operations_fast.FunctionOperation.process
  File "pyflink/fn_execution/beam/beam_operations_fast.pyx", line 85, in pyflink.fn_execution.beam.beam_operations_fast.FunctionOperation.process
  File "pyflink/fn_execution/coder_impl_fast.pyx", line 83, in pyflink.fn_execution.coder_impl_fast.DataStreamFlatMapCoderImpl.encode_to_stream
  File "/Users/dianfu/code/src/github/pyflink-usecases/datastream_api_demo.py", line 26, in split
    import cv2
ModuleNotFoundError: No module named 'cv2'

    at org.apache.beam.runners.fnexecution.control.FnApiControlClient$ResponseStreamObserver.onNext(FnApiControlClient.java:177)
    at org.apache.beam.runners.fnexecution.control.FnApiControlClient$ResponseStreamObserver.onNext(FnApiControlClient.java:157)
    at org.apache.beam.vendor.grpc.v1p26p0.io.grpc.stub.ServerCalls$StreamingServerCallHandler$StreamingServerCallListener.onMessage(ServerCalls.java:251)
    at org.apache.beam.vendor.grpc.v1p26p0.io.grpc.ForwardingServerCallListener.onMessage(ForwardingServerCallListener.java:33)
    at org.apache.beam.vendor.grpc.v1p26p0.io.grpc.Contexts$ContextualizedServerCallListener.onMessage(Contexts.java:76)
    at org.apache.beam.vendor.grpc.v1p26p0.io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.messagesAvailableInternal(ServerCallImpl.java:309)
    at org.apache.beam.vendor.grpc.v1p26p0.io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.messagesAvailable(ServerCallImpl.java:292)
    at org.apache.beam.vendor.grpc.v1p26p0.io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1MessagesAvailable.runInContext(ServerImpl.java:782)
    at org.apache.beam.vendor.grpc.v1p26p0.io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
    at org.apache.beam.vendor.grpc.v1p26p0.io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    ... 1 more

Description:

In local mode, the TaskManager log is located in the PyFlink installation directory: site-packages/pyflink/log/, or it can be found by the following command:

\>>> import pyflink

\>>> print(pyflink.\_\_path\_\_)
['/Users/dianfu/venv/pyflink-usecases/lib/python3.8/site-packages/pyflink'], the log file is located in /Users/dianfu/venv/pyflink-usecases/lib/python3.8/site- packages/pyflink/log directory

Custom log

Sometimes, the content of the exception log is not enough to help us locate the problem. At this time, you can consider printing some log information in the Python custom function. PyFlink supports users to output log through logging in Python custom functions, such as:

def split(s):
    import logging
    logging.info("s: " + str(s))
    splits = s[1].split("|")
    for sp in splits:
        yield s[0], sp

Through the above method, the input parameters of the split function will be printed to the log file of the TaskManager.

Remote debugging

PyFlink job, during the running process, will start an independent Python process to execute Python custom functions, so if you need to debug Python custom functions, you need to perform remote debugging, please refer to [4] to learn how to do it in Pycharm Python remote debugging.

1) Install pydevd-pycharm in the Python environment:

pip install pydevd-pycharm~=203.7717.65

2) Set the remote debugging parameters in the Python custom function:

def split(s):
    import pydevd_pycharm
    pydevd_pycharm.settrace('localhost', port=6789, stdoutToServer=True, stderrToServer=True)
    splits = s[1].split("|")
    for sp in splits:
        yield s[0], sp

3) Follow the steps of remote debugging in Pycharm, you can refer to [4], or you can refer to the introduction in the "Code Debugging" section in the blog [5].

Note: The Python remote debugging function is only supported in the professional version of Pycharm.

Community user mailing list

If the problem is not solved after the above steps, you can also subscribe to the Flink user mailing list[6] and send the problem to the Flink user mailing list. It should be noted that when sending the problem to the mailing list, try to describe the problem clearly. It is best to have reproducible code and data. You can refer to this email [7].

to sum up

In this article, we mainly introduce the PyFlink API job environment preparation, job development, job submission, troubleshooting, etc., hoping to help users quickly build a Flink job using the Python language, and I hope it will be helpful to everyone. Next, we will continue to launch the PyFlink series of articles to help PyFlink users understand various functions, application scenarios, best practices, etc. in PyFlink.

To this end, we have launched a questionnaire. We hope that everyone will actively participate in this questionnaire to help us better organize PyFlink related learning materials. After completing the questionnaire, you can participate in the lottery, and Flink customized Polo shirts are given away! The draw will be on time at 12:00 noon on April 30~

Reference link

[1] https://ci.apache.org/projects/flink/flink-docs-release-1.12/dev/table/connectors/

[2] https://repo.maven.apache.org/maven2/org/apache/flink/flink-sql-connector-kafka\_2.11/1.12.0/flink-sql-connector-kafka\_2.11-1.12.0.jar

[3] https://ci.apache.org/projects/flink/flink-docs-release-1.12/deployment/cli.html#submitting-pyflink-jobs

[4] https://www.jetbrains.com/help/pycharm/remote-debugging-with-product.html#remote-debug-config

[5] https://mp.weixin.qq.com/s?\_\_biz=MzIzMDMwNTg3MA==&mid=2247485386&idx=1&sn=da24e5200d72e0627717494c22d0372e&chksm=e8b43eebdfc3b7fdbd10b49e6749cb761b7aa5f8ddc90b34eb3170119a8bbb3ddd7327acb712&scene=178&cur\_album\_id=1386152464113811456#rd

[6] https://flink.apache.org/community.html#mailing-lists

[7] http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/PyFlink-called-already-closed-and-NullPointerException-td42997.html

Activity recommendation:

You can experience the real-time computing Flink version of Alibaba Cloud's enterprise-level product based on Apache Flink for only 99 yuan! Click the link below to learn about the event details: https://www.aliyun.com/product/bigdata/sc?utm\_content=g\_1000250506

Copyright statement: content of this article is contributed spontaneously by Alibaba Cloud real-name registered users, and the copyright belongs to the original author. The Alibaba Cloud Developer Community does not own the copyright, and does not bear the corresponding legal responsibility. For specific rules, please refer to the "Alibaba Cloud Developer Community User Service Agreement" and the "Alibaba Cloud Developer Community Intellectual Property Protection Guidelines". If you find suspected plagiarism in this community, fill in the infringement complaint form to report it. Once verified, the community will immediately delete the suspected infringing content.

How to develop PyFlink API jobs from 0 to 1

Environmental preparation

Step 1: Install Python

Step 2: Install JDK

Step 3: Install PyFlink

Job development

PyFlink Table API job

■ 1) Create a TableEnvironment object

■ 2) Configure the execution parameters of the job

■ 3) Create a data source table

■ 4) Define the calculation logic of the job

■ 5) View the execution plan

■ 6) Write out the result data

■ 7) Summary

PyFlink DataStream API job

■ 1) Create StreamExecutionEnvironment object

■ 2) Configure the execution parameters of the job

■ 3) Create a data source

■ 4) Define calculation logic

■ 5) Write out the result data

■ 7) Summary

Assignment submission

local

standalone

YARN Per-Job

K8s application mode

Parameter Description

Troubleshooting

Abnormal output on the client side

TaskManager log file

Custom log

Remote debugging

Community user mailing list

to sum up

Reference link

阿里云开发者

引用和评论

福利来了！计算巢支持在已经购买的 ECS 上搭建幻兽帕鲁服务器，支持图形化管理配置

Java8的新特性

Java11的新特性

Java5的新特性

Java9的新特性

Java13的新特性

Java7的新特性