Original: Sijie Guo

Translation: Zhai Jia

In the previous article, we described the reasons why Apache Pulsar can become an enterprise-level streaming and messaging system. Pulsar's enterprise features include persistent storage of messages, multi-tenancy, interconnection of multiple computer rooms, encryption and security, etc. One question we are often asked is the difference between Apache Pulsar and Apache Kafka.

In this series of Pulsar and Kafka comparison articles, we will guide you to recognize and understand some important concerns in the messaging system, such as robustness, high availability, high bandwidth and low latency.

When the user chooses a message system, the message model is the first thing the user considers. The message model should cover the following 3 aspects:

  1. Message consumption-how to send and consume messages;
  2. Message confirmation (ack)-how to confirm the message;
  3. Message saving-how long the message is kept, the reason for the message deletion and how to delete it;

Message consumption model

In the real-time streaming architecture, message delivery can be divided into two categories: Queue and Stream.

Queue model

The queue model is mainly to consume messages in an unordered or shared manner. Through the queue model, users can create multiple consumers to receive messages from a single pipe; when a message is sent from the queue, only one of the multiple consumers (any one is possible) receives and consumes the message. The specific implementation of the message system determines which consumer actually receives the message in the end.

The queue model is usually used in conjunction with stateless applications. Stateless applications don't care about ordering, but they do need to be able to acknowledge (ack) or delete individual messages, as well as the ability to expand consumption parallelism as much as possible. Typical messaging systems based on the queue model include RabbitMQ and RocketMQ.

Stream model

In contrast, the streaming model requires strict ordering of message consumption or exclusive message consumption. For a pipeline, using the streaming model, there will always be only one consumer to consume and consume messages. Consumers receive messages sent from the pipe in the exact order in which they are written to the pipe.

The flow model is usually associated with stateful applications. Stateful applications pay more attention to the order of messages and their status. The order in which messages are consumed determines the state of a stateful application. The order of the messages will affect the correctness of the application processing logic.

In a microservice-oriented or event-driven architecture, both the queue model and the flow model are necessary.

Pulsar's message consumption model

Apache Pulsar abstracts a unified producer-topic-subscription-consumer consumption model through "subscription". Pulsar's message model supports both the queue model and the stream model.

In Pulsar's message consumption model, Topic is the channel used to send messages. Each topic corresponds to a distributed log in Apache BookKeeper. Each message published by the publisher is only stored once in the Topic; during the storage process, BookKeeper will copy the message and store it on multiple storage nodes; each message in the Topic can be multiple times according to the consumer's subscription requirements. Use, each subscription corresponds to a consumer group (Consumer Group).

Topic is the true source of consumer news. Although the messages are only stored once on the topic (Topic), users can have different subscription methods to consume these messages:

  • Consumers are grouped together to consume messages, and each consumer group is a subscription.
  • Each topic can have different consumer groups.
  • Each group of consumers is a subscription to the topic.
  • Each group of consumers can have their own different consumption methods: Exclusive, Failover or Share.

Through this model, Pulsar combines the queue model and the flow model together to provide a unified API interface. This model will not affect the performance of the message system, nor will it bring additional overhead. At the same time, it also provides users with more flexibility, and it is convenient for the user program to use the message system in the most matching mode.

Exclusive subscription (Stream streaming model)

As the name implies, in an exclusive subscription, at any time, there is one and only one consumer in a consumer group (subscription) to consume the messages in the topic. The figure below is an example of an exclusive subscription. In this example, there is an active consumer A-0 with subscription A, and messages m0 to m4 are delivered in sequence and consumed by A-0. If another consumer A-1 wants to attach to subscription A, it is not allowed.

Failover (Stream streaming model)

With failover subscriptions, multiple consumers (Consumers) can be attached to the same subscription. However, for all consumers in a subscription, only one consumer will be selected as the main consumer of the subscription. Other consumers will be designated as failover consumers.

When the primary consumer is disconnected, the partition will be reassigned to one of the failover consumers, and the newly allocated consumer will become the new primary consumer. When this happens, all unacknowledged (ack) messages will be delivered to the new primary consumer. This is similar to Consumer partition rebalance in Apache Kafka.

The figure below is an example of a failover subscription. Consumers B-0 and B-1 subscribe to consumer news through subscription B. B-0 is the main consumer and receives all messages. B-1 is a failover consumer. If consumer B-0 fails, it will take over the consumption.

Shared subscription (Queue queue model)

With shared subscriptions, behind the same subscription, users can mount as many consumers as required by the application. All messages in the subscription are sent to multiple consumers behind the subscription in a circular distribution, and one message is delivered to only one consumer.

When a consumer disconnects, all messages delivered to it but not acknowledged (ack) will be redistributed and organized for delivery to the remaining consumers on the subscription.

The figure below is an example of a shared subscription. Consumers C-1, C-2 and C-3 all consume messages on the same topic. Each consumer receives approximately 1/3 of all messages.

If you want to increase the consumption speed, users do not need to increase the number of partitions, but only need to add more consumers to the same subscription.

Choice of three subscription models

Exclusive and failover subscription, only one consumer is allowed to use and consume, each subscription to the topic. Both modes use messages in order of topic partition. They are most suitable for Stream use cases that require strict message order.

Shared subscriptions allow multiple consumers per topic partition. Each consumer in the same subscription only receives a part of the message of the topic partition. Shared subscriptions are most suitable for queues that do not need to guarantee the order of messages, and the number of consumers can be arbitrarily expanded as needed.

Subscription in Pulsar is actually similar to the concept of Consumer Group in Apache Kafka. The operation of creating subscriptions is very lightweight and highly scalable. Users can create any number of subscriptions according to the needs of the application. For different subscriptions of the same topic, different subscription types can also be used. For example, a user can provide a failover subscription with 3 consumers on the same topic, and a shared subscription with 20 consumers at the same time, and can add more to the shared subscription without changing the number of partitions Consumers. The figure below depicts a topic with 3 subscriptions A, B, and C, and illustrates how messages flow from producer to consumer.

In addition to the unified messaging API, since the Pulsar topic partition is actually stored in Apache BookKeeper, it also provides a read API (Reader), similar to the consumer API (but Reader does not have cursor management) so that users can fully control how Use the message in Topic.

Pulsar's message acknowledgement (ACK)

Due to the characteristics of distributed systems, failures may occur when using distributed messaging systems. For example, in the process of consumers consuming messages from topics in the messaging system, both the consumer consuming the message and the message broker (Broker) serving the topic partition may experience errors. The purpose of message acknowledgment (ACK) is to ensure that when such a failure occurs, consumers can resume consumption from where they stopped last time, and that they will neither lose messages nor repeat processing of acknowledged (ACK) messages. In Apache Kafka, the recovery point is usually called Offset, and the process of updating the recovery point is called message confirmation or commit offset.

In Apache Pulsar, each subscription uses a special data structure-Cursor (Cursor) to track the acknowledgement (ACK) status of each message in the subscription. Whenever a consumer confirms a message on the topic partition, the cursor is updated. Updating the cursor ensures that the consumer will not receive the message again.

Apache Pulsar provides two message confirmation methods, Individual Ack and Cumulative Ack. With cumulative confirmation, the consumer only needs to confirm the last message it received. All messages in the subject partition (including) the provided message ID will be marked as confirmed and will not be delivered to consumers again. The cumulative confirmation is similar to the offset update in Apache Kafka.

Apache Pulsar can support single confirmation of messages, that is, selective confirmation. Consumers can confirm a message individually. The confirmed message will not be redelivered. The following figure illustrates the difference between single confirmation and cumulative confirmation (the message in the gray box is confirmed and will not be redelivered). In the upper part of the figure, it shows an example of cumulative acknowledgment. Messages before M12 are marked as acked. In the lower part of the figure, it shows an example of acking alone. Only acknowledge messages M7 and M12-In the event of consumer failure, all messages except M7 and M12 will be retransmitted.

Consumers with exclusive subscriptions or failover subscriptions can perform a single confirmation and cumulative confirmation of messages; consumers with shared subscriptions are only allowed to perform a single confirmation of messages. The ability of a single confirmation message provides a better experience for handling consumer failures. For some applications, processing a message may take a long time or very expensive, and it is very important to prevent retransmission of confirmed messages.

The Cursor, a specialized data structure for managing Ack, is managed by Broker and uses BookKeeper's Ledger to provide storage. In a later article, we will introduce more details about Cursor.

Apache Pulsar provides flexible message consumption subscription types and message confirmation methods. Through a simple and unified API, it can support various message and stream usage scenarios.

Pulsar's message retention (Retention)

After the message is confirmed, Pulsar's Broker will update the corresponding cursor. When a message in Topic has been acknowledged by all subscriptions, the message can be deleted. Pulsar also allows messages to be retained for a longer period of time by setting the retention time, even if all subscriptions have confirmed that they have been consumed. The following figure illustrates how to keep messages in a topic with 2 subscriptions. Subscription A has consumed all messages before M6 and subscription B have consumed all messages before M10. This means that all messages (gray boxes) before M6 can be safely deleted. Subscription A has not used the messages between M6 and M9 and cannot delete them. If the topic is configured with a message retention period, the messages M0 to M5 will remain unchanged during the configured time period, even if A and B have confirmed to consume them.

In the message retention policy, Pulsar also supports message time to live (TTL). If the message is not used by any consumer within the configured TTL time period, the message will be automatically marked as confirmed. The difference between message retention period and message TTL is that the message retention period acts on messages that are marked as confirmed and set as deleted, while TTL acts on messages that are not acked. The illustration above illustrates the TTL in Pulsar. For example, if there are no active consumers in subscription B, after the configured TTL time period has elapsed, message M10 will be automatically marked as confirmed, even if no consumers actually read the message.

Pulsar VS. Kafka

Through the above aspects, we summarize the differences between Pulsar and Kafka in terms of message models.

Model concept

Kafka: Producer - topic - consumer group - consumer;

Pulsar:Producer - topic - subscription - consumer。

Consumption pattern

Kafka: Mainly focused on the stream (Stream) mode, which is exclusive consumption for a single partition, and there is no shared (Queue) consumption mode;

Pulsar: Provides a unified messaging model and API. Stream mode-exclusive and failover subscription mode; Queue mode-shared subscription mode.

Message confirmation (Ack)

Kafka: Use offset Offset;

Pulsar: Use dedicated Cursor management. Cumulative confirmation has the same effect as Kafka; single or selective confirmation is provided.

Message retention

Kafka: Delete messages according to the set retention period. It is possible that the message has not been consumed and will be deleted after expiration. TTL is not supported.

Pulsar: Messages will only be deleted after being consumed by all subscriptions, and no data will be lost. It is also allowed to set a retention period to retain the data being consumed. Support TTL.

Comparison summary:

Apache Pulsar combines high-performance streaming (as pursued by Apache Kafka) and flexible traditional queues (as pursued by RabbitMQ) into a unified message model and API. Pulsar uses a unified API to provide users with a system that supports streams and queues, and has the same high performance.

Summarize

In this blog post, we introduced Apache Pulsar's messaging model, which unifies queuing and streaming into one API. Applications can use this unified API for high-performance queues and streaming without maintaining two systems: RabbitMQ for queue processing and Kafka for streaming. I hope this article can help you understand the message model in Apache Pulsar, how message consumption, deletion and retention work; understand the difference between Pulsar and Kafka message models. In a later article, we will introduce you to the details of the architecture of Apache Pulsar and the differences between Pulsar and Apache Kafka in terms of data distribution, replication, availability, and durability.


If you are interested in Pulsar, you can participate in the Pulsar community in the following ways:

For general information about the Apache Pulsar project, please visit the official website: http://pulsar.incubator.apache.org/ You can also follow the Twitter account @apache_pulsar.


ApachePulsar
192 声望939 粉丝

Apache软件基金会顶级项目,下一代云原生分布式消息系统