Create a new batch stream integration: Detailed explanation of the Pulsar Flink Connector released by Apache Flink 1.14.0

Author profile
Sheng Yufan, StreamNative development engineer, Apache Pulsar and Apache Flink contributor. Before joining StreamNative, he worked for Alibaba Big Data Platform and Tencent Cloud in charge of Flink development. Sheng Yufan is the core committer and project leader of Tencent Cloud Project Barad. He is currently in charge of Pulsar-Flink and Pulsar-Spark related development work at StreamNative. He and his team have contributed the Pulsar Source Connector to the Flink community and released it in Apache Flink 1.14.0, which will be fully integrated in subsequent releases. Contribute Pulsar Connector to the community.
editor: Jipei@StreamNative, Apache Pulsar contributor.

Summary of this article
Batch-stream integration is the future trend of data computing. Pulsar Flink Connector provides an ideal solution for processing data in a batch-stream-integrated manner on Apache Flink based on Apache Pulsar.
StreamNative has contributed Pulsar Source Connector to Flink version 1.14.0. Users can use it to read data from Pulsar and ensure that each piece of data is processed only once.
The latest Pulsar Flink Connector is based on Pulsar 2.8.0 and Flink 1.14, supports Pulsar's transaction processing, and further integrates the characteristics of the two.

background

As data expands day by day, it is crucial to use event streams to process data. Apache Flink unifies batch stream processing into the computing engine and provides a consistent programming interface. Apache Pulsar (together with Apache BookKeeper) unifies data in a "streaming" manner. In Pulsar, data is stored as a copy and accessed in the form of streaming (through the pub-sub interface) and segment (for batch processing). Pulsar solves the data island problem encountered by enterprises when using different storage and messaging technology solutions.

Flink can directly perform real-time streaming read and write with Pulsar broker. At the same time, Flink can also read Pulsar's underlying offline storage in batches, and read and write the contents of BookKeeper in batches. At the same time, it supports batch streaming, making Pulsar and Flink innately compatible partners. Combining Flink and Pulsar, these two open source technologies can create a unified data architecture and provide the best solution for real-time data-driven enterprises.

In order to integrate the functions of Pulsar and Flink and provide users with more powerful development capabilities, StreamNative developed and open sourced the Pulsar Flink Connector. has been polished many times, Pulsar Flink Connector has been merged into the Flink code warehouse and released in Flink version 1.14.0!

Pulsar Flink Connector provides flexible data processing based on Apache Pulsar and Apache Flink, allowing Apache Flink to read and write data in Apache Pulsar. Using Pulsar Flink Connector, companies can focus more on business logic without having to pay attention to storage issues.

Create a new Pulsar Flink Connector

Prior to this version, StreamNative has released Pulsar Flink Connector 2.7 version . Why overthrow the previous code and rebuild batch flow integration? What refactorings have been made in the new version?

New version changes

Split design

All data consumption is based on split (distribution) to create Reader to consume data. How to abstract Pulsar message into split? First, we abstract the topic and create a Partition example for each partition. For topics with partitions, they are created according to the number, while for topics without partitions, there is only one partition, and its value is -1.

In Pulsar's exclusive (exclusive), shared (shared) and failover (disaster recovery) subscription modes, we package the topic partition as a split for consumption on Flink, which contains the consumption node, storage node and two special states, and finally The consumed message ID and the currently processed transaction ID are respectively used in different modes of Pulsar. In Pulsar's key_shared (key sharing) mode, the range layer is added when mapping between topic partition and split.

The reasons for creating split for each partition are:

Pulsar's partition is actually topic;
The topic partition is actually a subtopic;
Consumer.seek() can only be executed on a single topic.

Enumerator (enumerator) design

The enumerator corresponds to the split distribution and subscription interface. This design attention is divided into two parts, one part is based on TopicList, for a set of topics given by the user, from Pulsar for information query; the other part is Topic Pattern, query the current topic, regular matching and create split.

In exclusive (exclusive), key_shared (key sharing) and failover (disaster recovery) modes, a split will only be allocated to a reader in a round-robin manner.

In the shared (shared) mode, each split is assigned to each reader. In this mode, each reader consumes each partition of Pulsar.

Reader design

In exclusive (exclusive) and failover (disaster recovery) modes, Reader is designed as follows:

We can see that this topic currently has three partitions. In the enumerator layer, 3 splits are created according to the partitions. The parallelism of Flink is 3, and the three readers of Reader 0, 1, and 2 are generated to consume the splits, thereby forming exclusive consumption. model. Failover mode and exclusive mode are the same consumption model, and both are sequential consumption.

Unordered Reader
 
Unordered Split Reader
SortedMap<Long, Map<TopicPartition, MessageID>> cursorsToCommit
ConcurrentMap<TopicPartition,MessageID> cursorsOfFinishedSplits
ScheduledExecutorService cursorScheduler

In Pulsar's Shared and Key_shared modes, consumption is disorderly. We neither want it to consume sequentially, nor do we want to ACK one by one. So we introduce a transaction here. Every time a message is created, a transaction is opened, an ACK is performed within the transaction, and the transaction ACK will be submitted on the checkpoint.

Unordered Reader
 
Unordered Split Reader
TransactionCoordinatorClient coordinatorClient
SortedMap<Long, List<TxnID>> transactionsToCommit
List<TxnID> transactionsOfFinishedSplits

Type system

Pulsar is similar to Flink in that it has a type system.

Flink's type system:

DeserializationSchema: decode the original data;
TypeInformation: Flink each strength? Data serialization and transmission based on TypeInformation between
TypeSerializer: A serialized instance created by TypeInformation.

In Pulsar:

Schema: Pulsar Schema is an interface for client-side data serialization and deserialization;
SchemaInfo: The SchemaInfo created by the interface is transmitted to the Broker, and the broker checks the compatibility of the Schema version and whether the Schema can be upgraded according to the SchemaInfo. SchemaInfo eliminates the need for brokers to serialize and deserialize;
SchemaDefinition: Create the instance required by the Schema for the Client.

Therefore, Pulsar and Flink are connected on the type system, resulting in the following two modes:

Common mode: Reader consumes in the form of Byte data, and uses Flink's DeserializationSchema for analysis, and DeserializationSchema's own TypeInformation is passed downstream. Flink and other messaging systems also use this model.

Pulsar's unique mode: Reader consumes in the form of Byte data, decodes the data in Pulsar Schema on Flink, and automatically creates TypeInformation that can be used on Flink.

But in the second mode, Pulsar's built-in Schema compatibility and verification is not used. We will use this feature in the next version.

Version requirements

Flink currently only provides Pulsar Source connector. Users can use it to read data from Pulsar and ensure that each piece of data is processed only once.

The connector currently supports Pulsar 2.7.0 and later versions, but the connector uses Pulsar's transaction mechanism . It is recommended to use the connector for data reading on Pulsar 2.8.0 and later versions. For more information about Pulsar API compatibility design, please read PIP-72 .

<dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-connector-pulsar_2.11</artifactId>
    <version>1.14.0</version>
</dependency>

Read document to learn how to add connectors to Flink cluster instances.

Pulsar Source Connector using Flink 1.14.0

The new version of Pulsar Source Connector has been merged into the latest version 1.14.0 released by Flink. If you want to use SourceFunction , or if the Flink version is lower than 1.14, you can use the pulsar-flink maintained separately by StreamNative.

Construct an instance of Pulsar Source Connector

Pulsar Source Connector provides a builder class to construct Source Connector instances. The following code example uses the Source Connector created by the builder class to consume data from the topic "persistent://public/default/my-topic". The connector uses Exclusive (exclusive) subscription mode to consume messages, the subscription name is my-subscription , and the binary byte stream of the message body is encoded into a string in UTF-8.

PulsarSource<String> pulsarSource = PulsarSource.builder()
 .setServiceUrl(serviceUrl)
    .setAdminUrl(adminUrl)
    .setStartCursor(StartCursor.earliest())
    .setTopics("my-topic")
    .setDeserializationSchema(PulsarDeserializationSchema.flinkSchema(new SimpleStringSchema()))
    .setSubscriptionName("my-subscription")
    .setSubscriptionType(SubscriptionType.Exclusive)
    .build();
 
env.fromSource(source, WatermarkStrategy.noWatermarks(), "Pulsar Source");

If you use the construction class to construct the Pulsar Source Connector, you must provide the following properties:

The address of Pulsar data consumption is provided by the setServiceUrl(String) method;
Pulsar HTTP management address, provided by the setAdminUrl(String) method;
Pulsar subscription name, provided by the setSubscriptionName(String) method;
The topic that needs to be consumed or the partition under topic, see specify the topic of consumption or Topic partition ;
Deserializer to decode Pulsar messages, see for .

Specify the Topic/Topic partition for consumption

Pulsar Source Connector provides two ways to subscribe to topic or topic partition:

Topic list, consume messages from all partitions of this topic, for example:

PulsarSource.builder().setTopics("some-topic1", "some-topic2")
 
// 从 topic "topic-a" 的 0 和 1 分区上消费
PulsarSource.builder().setTopics("topic-a-partition-0", "topic-a-partition-2")

Topic regular, the connector uses the given regular expression to match all compliant topics, for example:

PulsarSource.builder().setTopicPattern("topic-*")

Abbreviated topic name

Since Pulsar 2.0, the complete topic name format is {persistent|non-persistent}://tenant/namespace/topic. But the connector does not need to provide a complete definition of the topic name, because the topic type, tenant, and namespace are all set with default values.

Currently supported abbreviations:

⚠️Note: For non-persistent topics, the connector does not support abbreviated names, non-persistent://public/default/my-topic cannot be abbreviated as non-persistent://my-topic .

`Topic subscribed to the partition structure`

For Pulsar, Topic partition is also a kind of Topic. Pulsar will internally split a partitioned topic into an equal number of non-partitioned topics according to the size of the partition. For example, a topic with 3 partitions is created in flink sample tenant of simple-string . You can see the following topic list on Pulsar:

This means that users can use the above child topics to directly consume the data in the partitions, instead of consuming all the data in the partitions based on the upper parent topic. For example: using PulsarSource.builder().setTopics (" sample/flink/simple-string-partition-1 ", " sample/flink/simple-string-partition-2 ") will only consume the messages in partitions 1 and 2 on sample/flink/simple-string

`Configure Topic Regular Expression`

As mentioned earlier, Pulsar topic has persistent and non-persistent . When using regular expressions to consume data, the connector will try to parse the message type from the regular expression. For example: PulsarSource.builder().setTopicPattern (" non-persistent://my-topic* ") will parse out non-persistent topic type. If the user uses the topic name shorthand, the connector will use the default message type persistent .

If you want to use regular to consume persistent type 061b05f04300bb and non-persistent , you need to use RegexSubscriptionMode define the topic type, for example: setTopicPattern(" topic-* ", RegexSubscriptionMode.AllTopics)`.

`Parse the message-deserializer`

The deserializer is used to parse Pulsar messages, and the connector uses PulsarDeserializationSchema to define the deserializer. setDeserializationSchema(PulsarDeserializationSchema) method in the builder class to configure the deserializer, which will parse the Pulsar Message<byte[]> instance.

If the user only cares about the binary byte stream of the message body, no other attributes are needed to parse the data. You can directly use the predefined PulsarDeserializationSchema . There are 3 predefined deserializers provided in the Pulsar connector:

Use Pulsar's Schema parse the message.

// 基础数据类型
PulsarDeserializationSchema.pulsarSchema(Schema)
 
// 结构类型 (JSON, Protobuf, Avro, etc.)
PulsarDeserializationSchema.pulsarSchema(Schema, Class)
 
// 键值对类型
PulsarDeserializationSchema.pulsarSchema(Schema, Class, Class)

Use Flink's DeserializationSchema parse the message.

PulsarDeserializationSchema.flinkSchema(DeserializationSchema)

Use Flink's TypeInformation to parse the message.

PulsarDeserializationSchema.flinkTypeInfo(TypeInformation, ExecutionConfig)

Pulsar's Message<byte[]> contains many additional attributes . For example, the message key, message sending time, message production time, user-defined key-value pair attributes on the message, etc. You can use the Message<byte[]> interface to get these attributes.

If users need to parse a message based on these additional attributes, they can implement the PulsarDeserializationSchema interface, and must ensure that the PulsarDeserializationSchema.getProducedType() returned by the TypeInformation method is the correct result. Flink uses TypeInformation the parsed result and pass it to the downstream operator.

`Subscription model`

Pulsar supports four subscription modes: exclusive , shared (shared) , failover (disaster recovery) , key_shared (key ) shared. In the current Pulsar connector, there is exclusive and disaster recovery. If one of Flink's readers fails, the connector will give all unconsumed data to other readers to consume data. By default, if no subscription type is specified, the connector uses the shared subscription type ( SubscriptionType.Shared ).

// 名为 "my-shared" 的共享订阅
PulsarSource.builder().setSubscriptionName("my-shared")
 
// 名为 "my-exclusive" 的独占订阅
PulsarSource.builder().setSubscriptionName("my-exclusive").setSubscriptionType(SubscriptionType.Exclusive)

If you want to use the key shared subscription in the Pulsar connector, you need to provide RangeGenerator instance of 061b05f043030d. RangeGenerator will generate a hash range of a set of message keys, and the connector will consume data based on the given range. The Pulsar connector also provides a UniformRangeGenerator , which divides the hash range equally based on the parallelism of the flink Source Connector.

`Starting consumption position`

The connector uses the setStartCursor(StartCursor) method to specify where to start consumption. The built-in consumer locations are:

Start consumption from the earliest message in topic.

StartCursor.earliest()

Start consumption from the latest news in topic.

StartCursor.latest()

Start consumption from the given message.

StartCursor.fromMessageId(MessageId)

The difference with the former is that a given message can be skipped and consumed again.

StartCursor.fromMessageId(MessageId,boolean)

Start consumption from the given message time.

StartCursor.fromMessageTime(long)

Each message has a fixed serial number, which is arranged in an orderly manner on Pulsar, which contains original information such as ledger, entry, partition, etc., and is used to find specific messages on the underlying storage of Pulsar. Pulsar calls this serial number MessageId, and users can create it DefaultImplementation.newMessageId(long ledgerId, long entryId, int partitionIndex)

`boundary`

The Pulsar connector supports both streaming and batch consumption. By default, the connector uses streaming to consume data. Unless the task fails or is cancelled, the connector will continue to consume data. The user can use setBoundedStopCursor(StopCursor) to specify the position to stop the consumption, in this case the connector will use the batch mode for consumption. When all topic partitions are consumed to the stop position, the Flink task will end. You can specify the stop position in the same way as using the stream, just use the setUnboundedStopCursor(StopCursor) method. The built-in stop positions are as follows:

Never stop.

StopCursor.never()

Stop at the latest piece of data in the topic when Pulsar was started.

StopCursor.latest()

Stop at a certain message, this message is not included in the result.

StopCursor.atMessageId(MessageId)

Stop after a certain message, the result contains this message.

StopCursor.afterMessageId(MessageId)

Stop at a given message timestamp.

StopCursor.atEventTime(long)

`Other configuration items`

In addition to the configuration options mentioned above, the connector also provides a wealth of options for Pulsar experts to use. In the builder class, specify all the configurations of the Pulsar client and Pulsar API setConfig(ConfigOption<T>, T) and setConfig(Configuration) Specific reference other configuration items

`Dynamic partition discovery`

In order to discover the expanded partition or newly created topic on Pulsar after starting the Flink task, the linker provides a dynamic partition discovery mechanism. This mechanism does not require restarting Flink tasks. Set a positive integer to option PulsarSourceOptions.PULSAR_PARTITION_DISCOVERY_INTERVAL_MS

// 10 秒查询一次分区信息
PulsarSource.builder()
        .setConfig(PulsarSourceOptions.PULSAR_PARTITION_DISCOVERY_INTERVAL_MS, 10000);

By default, Pulsar enables dynamic partition discovery, and the query interval is 30 seconds. The user can give a negative number to disable this function. If you consume data in batches, you will not be able to enable this feature.

`Event time and watermark`

By default, the connector uses Message<byte[]> as the timestamp of the analysis result. Users can use WatermarkStrategy to analyze the desired message time by themselves, and pass the corresponding water level to the downstream.

env.fromSource(pulsarSource, new CustomWatermarkStrategy(), "Pulsar Source With Custom Watermark Strategy")

The definition WatermarkStrategy refers to the document .

`Message confirmation`

Once a subscription is created on the topic, the message will be stored in Pulsar. Even if there are no consumers, the message will not be discarded. Only when the connector confirms with Pulsar that the message has been consumed, the message will be removed by some mechanism. The connector supports four subscription methods, and their message confirmation methods are also quite different.

`Exclusive and disaster recovery subscription`

exclusive and disaster recovery subscription, the connector uses a progressive confirmation method. When confirming that a certain message has been processed, the previously consumed message will automatically be set as read. The Pulsar connector will set the message consumed at the corresponding time as read when Flink completes the checkpoint to ensure that the Pulsar status is consistent with the Flink status. If the user does not enable checkpoints on Flink, the connector can use periodic submission to submit the consumption status to Pulsar, using configuration PulsarSourceOptions.PULSAR_AUTO_COMMIT_CURSOR_INTERVAL to define.

It should be noted that in this scenario, the Pulsar connector does not rely on the state submitted to Pulsar for fault tolerance. The message confirmation is only to be able to see the corresponding consumption processing situation on the Pulsar side.

`Sharing and key sharing subscription`

sharing and key sharing need to confirm each message in turn, so the connector confirms the message in the Pulsar transaction, and then submits the transaction to Pulsar. First, you need to enable transactions borker.conf

transactionCoordinatorEnabled=true

The default timeout for transactions created by the connector is 3 hours. Please make sure that this time is greater than the interval between Flink checkpoints. Users can use PulsarSourceOptions.PULSAR_TRANSACTION_TIMEOUT_MILLIS to set the timeout period of the transaction.

If the user cannot enable Pulsar's transaction, or checkpointing is disabled for the project, the PulsarSourceOptions.PULSAR_ENABLE_AUTO_ACKNOWLEDGE_MESSAGE option needs to be set to true , and the message will be immediately set as read after being consumed from Pulsar. The connector cannot guarantee message consistency in this scenario. The connector uses logs on Pulsar to record message confirmations under a certain transaction. For better performance, please shorten the interval between Flink checkpoints.

`Upgrade and problem diagnosis`

For upgrade steps, refer to 161b05f0446336 Upgrade Application and Flink version . The Pulsar connector does not store the consumption status on the Flink side, and all consumption information is pushed to Pulsar.

Notice:
Do not upgrade the version of Pulsar connector and Pulsar server at the same time.
Use the latest version of the Pulsar client to consume messages.

Flink only uses Pulsar's Java client and management API . If you encounter problems when interacting with Flink and Pulsar, it is very likely that it has nothing to do with Flink. Please upgrade the version of Pulsar, the version of Pulsar client first, or modify the configuration of Pulsar and the configuration of Pulsar connector to try to solve the problem.

`contact us`

Welcome everyone to use Pulsar Flink Connector and communicate with us to optimize this batch-flow integrated project together. At present, the community has established Pulsar Flink Connector SIG (Special Interest Group), scan the Bot QR code below, reply "Flink" to join Pulsar Flink Connector SIG, and communicate with project developers.

Follow the public account "Apache Pulsar" for dry goods and news