Technical exploration: Apache Pulsar's transactional event stream

ApachePulsar
中文

About Apache Pulsar

Apache Pulsar is the top-level project of the Apache Software Foundation. It is the next-generation cloud-native distributed message flow platform. It integrates messaging, storage, and lightweight functional computing. It uses a separate architecture design for computing and storage to support multi-tenancy, persistent storage, Multi-computer room and cross-regional data replication, with strong consistency, high throughput, low latency and high scalability and other streaming data storage characteristics.
GitHub address: http://github.com/apache/pulsar/

Introduction: This article is a text collated version of the speech "Technical Exploration: Transactional Event Flow of Apache Pulsar" by StreamNative development engineer and Apache Pulsar Committer Congbo at the Pulsar Summit Asia 2020 conference. This speech mainly focuses on the principles and planning of Apache Pulsar affairs. Share, please refer to it.

My name is Congbo and I am a development engineer from StreamNative. The topic I bring today is "Technical Exploration: The Transactional Event Flow of Apache Pulsar".

Message semantics

Everyone knows that all message system streaming data platforms have different semantics for messages. The general semantics are divided into three types: At-most once, At-least once, and Exactly once.

  • At-most once: At most once, it does not care whether the message is sent successfully, and the return value of the message is not required.
  • At-least once: At least once, the message is allowed to be repeated but the message must be guaranteed.
  • Exactly once: Exactly once, to ensure that the message will not be lost and will not be repeated.

At Most Once

Pulsar has realized At Most Once semantics before version 1.2.0.

At Least Once (at least once)

Pulsar follows the semantics of At Least Once at the beginning of its design. Retrying after sending a message fails is the basic way to ensure the semantics of At Least Once. Sending retries will result in repeated messages. In some usage scenarios, the Producer is required not to send repeated messages and the Consumer cannot be re-consumed, thus resulting in Exactly Once semantics.

Exactly Once

Realizing Exactly once requires deduplication of consumption/production.

How to remove weight in Pulsar?

  • Producer: Idempotent Producer;
  • Broker: Guaranteed message deduplication (PIP-6);
  • Consumer: Reader + Checkpoints (Flink / Spark)。

How to open Exactly once?

Set the set-deduplication of the name space of the topic. Some operations through admin and so on:

  • bin/pulsar-admin set-deduplication -e tenant/namespace
  • Set Producer name and Sequence id when creating Producer;
  • Specify an increasing sequence id when generating a message.

limit:

  • Only valid when generating a message to a partition;
  • Only applicable to generate a message;
  • There is no atomicity when multiple messages are generated on one partition or multiple partitions;
  • The Consumer needs to store the message id and its state, and look up the message id when restoring the state.

How Transaction handles events

An example of the logic operation of the transfer is used to describe how Transaction handles events in the streaming message system:

Now there are two people, Alice and Bob. Alice will transfer ten dollars to Bob. How to achieve this function through Pulsar?

  • Transfer Topic: Record transfer request;
  • Cash Transfer Function: The act of processing transfers;
  • BalanceUpdate Topic: Record balance update request.

Alice forwards it to Bob. When the Transfer Function receives this transfer message, it will send a message to BalanceUpdate Topic that Bob's balance increases by ten dollars, and a message that Alice's balance decreases by ten dollars to BalanceUpdate Topic. After receiving all the return values, Ack this transfer message. It is no problem when all operations will not fail. But often things are contrary to expectations, and all of its operations may cause problems.

图 1

As shown in Figure 1, after Ack fails, he will consume the transfer message again. The consequence is that Alice transfers 10 yuan to Bob again, and Alice transfers a total of 20 yuan to Bob. If Ack fails every time, it is possible that Alice's account is heavily indebted and Bob becomes a billionaire.

图 2

As shown in Figure 2, the message for Bob to increase the balance is not successfully sent to the corresponding BalanceUpdate Topic. The result is that Bob's balance does not increase, but Alice's balance decreases.

Pulsar Transaction

How to use Pulsar's Transaction to achieve this?

Transaction semantics:

  • Ensure multi-partition atomic message writing;
  • Ensure that multiple subscriptions are confirmed atomically;
  • All operations performed in a transaction succeed or fail;
  • Consumers are allowed to read submitted messages.

How to implement the above example without Transaction API?

Message<String> message = inputConsumer.receive();
 
CompletableFuture<MessageId> sendFuture1 =
producer1.newMessage().value(“output-message-1”).sendAsync();
CompletableFuture<MessageId> sendFuture2 =
producer2.newMessage().value(“output-message-2”).sendAsync();
 
inputConsumer.acknowledgeAsync(message.getMessageId());

As shown in Figure 3:

图 3

After receiving the message from the Input Consumer, Producer1 will send the message to topic1, Producer2 will send a message to topic2, and then Ack the received message.

Pulsar's Transaction API is actually very simple, and there is not much change to the original logic that needs to be implemented:

Message<String> message = inputConsumer.receive();
Transaction txn = client.newTransaction().withTransactionTimeout(…).build().get();
 
CompletableFuture<MessageId> sendFuture1 =
producer1.newMessage(txn).value(“output-message-1”).sendAsync();
CompletableFuture<MessageId> sendFuture2 =
producer2.newMessage(txn).value(“output-message-2”).sendAsync();
inputConsumer.acknowledgeAsync(message.getMessageId(), txn);
 
txn.commit().get();
 
MessageId msgId1 = sendFuture1.get();
MessageId msgId2 = sendFuture2.get();
 
inputConsumer.acknowledgeAsync(message.getMessageId(), txn);
 
txn.commit().get();

Pulsar Transaction has the following three components:

  • TC (Transaction Coordinator) is responsible for managing Transaction metadata.
  • TB (Transaction Buffer) is responsible for processing and sending messages with Transaction.
  • TP (Transaction Pending Ack) is responsible for processing Ack requests with Transaction.

图 4

As shown in Figure 4: The operation of creating Transaction is recorded in TC.

图 5

As shown in Figure 5: Pulsar Client has successfully created Txn1 and requested Txn1 to send messages to Topic1 and Topic2 from TC. After receiving the sending request, TC records sending metadata and responds to Client. Client sends a message to Topic1 and Topic2 respectively.

图 6

As shown in Figure 6: It is basically the same as the description in Figure 5. Only the difference between sending and signing.

图 7
图 8

As shown in Figure 7 and Figure 8: Pulsar Client waits for Commit Transaction after all ACKs and Produce are completed. After TC receives the Commit request, the state of Txn1 changes to Committing and will process the information of Txn1 in TP and TB.

图 9

As shown in Figure 9: After processing TP and TB, TC will change the status of Txn1 to Committed.

The above is the complete life cycle of a Transaction.

Let's look at the transfer example again:

图 10

With the support of Pulsar Transaction, all operations either succeed or fail. This ensures the correctness of Alice and Bob's balance operations.

Pulsar Transaction's future plan

Pulsar Transaction is designed to make the event stream system simpler and more reliable. For many business scenarios, there may be fewer idempotent operations and so on when dealing with business scenarios.

Then, the following is the future development plan of Pulsar Transaction:

  • Transaction support in other languages (e.g. C++, Go)
  • Transaction in Pulsar Functions & Pulsar IO
  • Transaction in Kafka-on-Pulsar (KOP)
  • Transaction for Flink / Spark job
  • Transaction for State storage in Pulsar Functions

Everyone is interested in the content above, welcome to scan the QR code below to reply "enter the group" and discuss with us in the Pulsar exchange group at any time.

If you want to know more about the introduction in the article, please scan the small program code below to view the full version of the video:

Related Reading

Click the link to get the Apache Pulsar hard core dry goods information!

阅读 950

ApachePulsar
Apache Pulsar 是 Apache 软件基金会顶级项目,是下一代云原生分布式消息流平台,集消息、存储、轻量化...

Apache软件基金会顶级项目,下一代云原生分布式消息系统

186 声望
920 粉丝
0 条评论

Apache软件基金会顶级项目,下一代云原生分布式消息系统

186 声望
920 粉丝
文章目录
宣传栏