Understand Apache Pulsar in one article

Pulsar introduction

Apache Pulsar, as the top project of the Apache Software Foundation, is the next-generation cloud-native distributed message flow platform. It integrates messaging, storage, and lightweight function computing. It uses a separate architecture design for computing and storage to support multi-tenancy, persistent storage, Regional replication, with strong consistency, high throughput, low latency and high scalability and other streaming data storage features.

Pulsar was born in 2012. The original purpose was to integrate other messaging systems within Yahoo, and build a unified logic that supports large clusters and cross-regional messaging platforms. At that time, other messaging systems (including Kafka) could not meet the needs of Yahoo, such as large clusters and multi-tenancy, stable and reliable IO service quality, million-level Topics, cross-regional replication, etc., so Pulsar came into being.

The key features of Pulsar are as follows:

A single instance of Pulsar natively supports multiple clusters, and can seamlessly replicate messages between clusters across computer rooms.
Very low release delay and end-to-end delay.
Can be seamlessly expanded to more than one million topics.
Simple client API supporting Java, Go, Python and C++.
Support multiple topic subscription modes (exclusive subscription, shared subscription, failover subscription).
The message delivery is guaranteed through the persistent message storage mechanism provided by Apache BookKeeper.
Pulsar Functions, a lightweight serverless computing framework, implements stream-native data processing.
Pulsar IO, a serverless connector framework based on Pulsar Functions, makes it easier to move data in and out of Apache Pulsar.
Hierarchical storage can unload data from hot storage to cold/long-term storage (such as S3, GCS) when the data is obsolete.

Community:

At present, the number of stars of Apache Pulsar on Github is 10K+, with a total of 470+ contributors. And it is constantly being updated, and the community’s activity is relatively good.

concept

Producer

The source of the message, which is also the publisher of the message, is responsible for sending the message to the topic.

Consumer

Consumers of messages are responsible for subscribing to and consuming messages from topics.

Topic

The carrier of the message data, Topic in Pulsar can be specified to be divided into multiple partitions, if not set, there is only one partition by default.

Broker

Broker is a stateless component, which is mainly responsible for receiving messages sent by Producer and delivering them to Consumer.

BookKeeper

The distributed pre-write log system provides storage services for the message system than Pulsar, and provides cross-machine replication for multiple data centers.

Bookie

Bookie is the server of Apache BookKeeper that provides persistence for messages.

Cluster

The Apache Pulsar instance cluster consists of one or more instances.

Cloud native architecture

Apache Pulsar uses an architecture that separates computing and storage, not coupled with computing logic, and can achieve independent data expansion and rapid recovery. With the development of cloud-native architecture, the separation of computing and storage architecture has appeared more and more frequently in various systems. Pulsar's Broker layer is a stateless computing logic layer, which is mainly responsible for receiving and distributing messages, while the storage layer is composed of Bookie nodes and is responsible for message storage and reading.

Pulsar's separate computing and storage architecture can achieve horizontal expansion without restrictions. If the system has more Producers and Consumers, then it can directly expand the computing logic layer Broker without being affected by data consistency. If it is not for this architecture, when we expand the capacity, the computing logic and storage will change in real time, and it will be easy to be restricted by data consistency. At the same time, the logic of the computing layer is complex and error-prone, while the logic of the storage layer is relatively simple, and the probability of error is relatively small. Under this architecture, if an error occurs in the computing layer, unilateral recovery can be performed without affecting the storage layer.

Pulsar also supports tiered storage of data, which can move old messages to cheaper storage solutions, and the latest messages can be stored in SSDs. This can save costs and maximize the use of resources.

Cluster architecture

The Pulsar cluster is composed of multiple Pulsar instances, including

Multiple Broker instances, responsible for receiving and distributing messages
A ZooKeeper service to coordinate cluster configuration
BookKeeper server cluster Bookie, used for message persistence
Message synchronization between clusters through cross-regional replication

Design principle

Pulsar uses a publish-subscribe design pattern (pub-sub), in which the producer publishes messages to the topic, and the consumer subscribes to the message in the topic and sends an ack confirmation after the processing is completed.

Producer

Send mode

The Producer has two modes for sending messages. It can publish messages to the broker in a synchronous (sync) or asynchronous (async) manner.

Sending a message synchronously means that after the Producer sends the message, it will not consider the message to be successfully sent until the broker's confirmation. If the confirmation is not received, the message will be regarded as a failure.
```
MessageId messageId = producer.send("同步发送的消息".getBytes(StandardCharsets.UTF_8));
```
Sending a message asynchronously means that the Producer sends a message, puts the message in the blocking queue and returns immediately. No need to wait for confirmation from the broker.
```
CompletableFuture<MessageId> messageIdCompletableFuture = producer.sendAsync(
                "异步发送的消息".getBytes(StandardCharsets.UTF_8));
```

interview method

Pulsar provides a variety of different types of Topic access modes for the Producer:

Shared
By default, multiple producers can publish messages to the same topic.
Exclusive
The producer is required to access the topic in exclusive mode. In this mode, if the topic already has a producer, then other producers will fail and report an error when connecting.
"Topic has an existing exclusive producer: standalone-0-12"
WaitForExclusive
If the topic has already connected to the producer, the current producer will be suspended until the producer obtains Exclusive access.

The access mode can be set in the following ways:

Producer<byte[]> producer = pulsarClient.newProducer().accessMode(ProducerAccessMode.Shared).topic("test-topic-1").create();

compression

Pulsar supports compression of messages sent by the Producer, and Pulsar supports the following types of compression:

LZ4
LZ4 is a lossless compression algorithm that provides a compression speed of> 500 MB/s per core and can be expanded by a multi-core CPU. It has an extremely fast decoder, with a speed of several GB/s per core, which usually reaches the RAM speed limit on a multi-core system.
ZLIB
zlib aims to be a free, general-purpose, and unrestricted-that is, without any patent protection-lossless data compression library, which can be used on almost any computer hardware and operating system. The zlib data format itself can be ported across platforms.
ZSTD
Zstandard is a fast compression algorithm that provides high compression ratios. It also provides a special mode for small data called dictionary compression. The reference library provides a very wide range of speed/compression trade-offs and is supported by extremely fast decoders.
snappy
Snappy is a compression/decompression library. It does not aim for maximum compression, nor is it compatible with any other compression library; on the contrary, its aim is very high speed and reasonable compression.

Batch processing

Producer supports sending batch messages in a single request, and batch processing can be enabled by (acknowledgmentAtBatchIndexLevelEnabled = true). When all the messages of a batch are confirmed by the Consumer, the messages of this batch will be confirmed to be sent successfully. If an unexpected failure occurs, it may cause all messages in this batch to be re-delivered, including messages that have been confirmed to be consumed.

In order to avoid this problem, Pulsar has introduced batch index confirmation since 2.6.0. The broker maintains the confirmation status of each index and avoids sending confirmed messages to the Consumer. When all the message indexes of this batch are confirmed, the batch messages will be deleted.

Message block

Producer supports chunking messages. You can use chunkingEnabled=true to start chunking. When chunking is enabled, pay attention to the following points:

Batch processing and blocking cannot be started at the same time. To start blocking, batch processing must be disabled in advance.
Only persistent themes support chunking.
Blocking is only supported for exclusive and failover subscription types.

When blocking is enabled, if the message sent by the Producer exceeds the maximum load size, the Producer will split the original message into multiple block messages, and then send each block to the broker, and the block messages are stored in the broker and normal The message is the same. It's just that when the Consumer is consuming, it is found that this is a block message, and the block message needs to be cached. When all the blocks of a message are collected, they are combined into the original receiver queue for consumption by the client. If the Producer fails to send all the block messages, the Consumer has an expiration handling mechanism, and the default expiration time is one hour.

block message model of a producer and an ordered consumer:

Multi-producer and one ordered consumer message model:

Consumer

Consumer is a consumer of messages, and obtains messages from brokers by subscribing to a specified topic.

The consumer sends a stream request to the broker to get the message. There is a queue on the Consumer side to receive messages pushed from the broker. You can configure the queue size using the receiverQueueSize parameter. The default size is 1000). Every time consumer.receive() is called, a message is taken from the buffer.

Receiving method

Messages can be received from brokers in a synchronous (sync) or asynchronous (async) manner, and messages can also be returned through MessageListener: After receiving the message, the user's MessageListener will be called back.

Synchronous reception of messages will be blocked until a message is available.

Message<byte[]> message = consumer.receive();
System.out.println("接收消息内容: " + new String(message.getData()));
consumer.acknowledge(message);  // 确认消费消息

Messages received asynchronously will immediately return a future value. Use CompletableFuture. If CompletableFuture completes the message reception, receiveAsync() should be called subsequently, otherwise, it will create a backlog of received requests in the application.
future before completion by calling .cancel(false) (CompletableFuture.cancel(boolean)) to remove it from the backlog of received requests.
```
CompletableFuture<Message<byte[]>> messageCompletableFuture = consumer.receiveAsync();
Message<byte[]> message = messageCompletableFuture.get();
System.out.println("接收消息内容: " + new String(message.getData()));
consumer.acknowledge(message); // 确认消费消息
```

The client library provides a listener implementation for consumers. For example, the Java client provides a MesssageListener interface, which calls the received method whenever a new message is received.

pulsarClient.newConsumer().topic("test-topic-1").messageListener((MessageListener<byte[]>) (consumer, msg) -> {
            System.out.println("接受消息内容: " + new String(msg.getData()));
            try {
                consumer.acknowledge(msg);  //确认消费消息
            } catch (PulsarClientException e) {
                consumer.negativeAcknowledge(msg);  // 消息消费失败
            }
        }).subscriptionName("test-subscription-1").subscribe();

Consumption confirmation

successfully confirmed :

After the Consumer successfully consumes a message, it needs to send a message confirmation to the broker. The message will only be deleted after all subscriptions are confirmed. If you need to store messages that have been confirmed successful consumption, you need to set a message saving strategy. Otherwise, Pulsar will immediately delete all messages confirming successful consumption.

For batch messages, you can confirm the message in the following two ways:

The message was confirmed separately. With individual confirmation, the consumer needs to confirm each message and send a confirmation request to the agent.
The messages are cumulatively confirmed. With cumulative confirmation, the consumer only needs to confirm the last message it received. All messages in the stream up to the provided message will not be redelivered to the consumer.

failed confirmation:

When the Consumer fails to consume a message and wants to consume the message again, it needs to send a negative confirmation to the broker. It means that the message was not successfully consumed, so the broker will deliver the message again. Messages will be negatively acknowledged individually or cumulatively, depending on the type of consumer subscription:

In exclusive and failover subscription types, consumers only deny the last message they received.
In the shared and Key_Shared subscription types, you can individually negate the confirmation message.

confirmation timeout:

If a message is not successfully consumed and you want to trigger the broker to automatically resend the message, you can use the automatic retransmission mechanism of unconfirmed messages. It is recommended to give priority to failure confirmation, which can more accurately control the re-delivery of a single message.

Dead letter queue

Apache Pulsar has a built-in dead letter queue feature. When a message processing fails and a denial Ack is received, Apache Pulsar can automatically retry. If the number of retries is exceeded, the message can be stored in the dead letter queue to ensure that the new message can be processed.

Consumer<byte[]> consumer = pulsarClient.newConsumer(Schema.BYTES)
              .topic(topic)
              .subscriptionName("my-subscription")
              .subscriptionType(SubscriptionType.Shared)
              .deadLetterPolicy(DeadLetterPolicy.builder()
                    .maxRedeliverCount(maxRedeliveryCount)
                    .build())
              .subscribe();

Consumption model

Apache Pulsar provides the unification of the queue model and the stream model. Only one piece of data needs to be saved at the Topic level. The same piece of data can be consumed multiple times. Different subscription models are calculated by means of streams, queues, etc., which greatly improves flexibility. There are four types of subscriptions in Apache Pulsar: exclusive, shared, failover, and key_shared. These types are shown in the figure below.

Topic

Topic naming

Topic in Pulsar is responsible for delivering messages from producer to consumer, and the topic name is a URL with a clearly defined structure:

{persistent|non-persistent}://tenant/namespace/topic

persistent / non-persistent
Indicates the type of theme, the theme is divided into persistent and non-persistent themes, the default is the persistent type. Persistent topics will save messages to disk, while non-persistent topics will not save messages to disk.
tenant
The tenant of the subject in the Pulsar instance, the tenant is essential for multi-tenancy in Pulsar, and is distributed in the cluster.
namespace
Managing the related topics as a group is the basic unit of topic management. Each tenant can have one or more namespaces.
topic
The topics in Pulsar are named channels, which are used to transfer messages from producers to consumers.

Create Topic without displaying in Pulsar. If you try to send or receive a message to a topic that does not exist, the Topic will be created in the default tenant and namespace.

Topic partition

Ordinary topics are stored in a single broker, and topics can be divided into multiple partitions, stored in different brokers, and processed by multiple brokers, which greatly improves the throughput of topics.

As shown in the figure above, Topic1 is divided into 5 partitions, namely P0, P1, P2, P3, and P4. These 5 partitions are divided into 3 Brokers (Broker1, Broker2, Broker3). Since the number of partitions is more than the number of agents, the first two agents handle two partitions respectively, and the third agent only handles one. Pulsar will automatically Deal with this partition distribution.

When publishing to a partition topic, you must specify the routing pattern . The routing mode determines which partition each message should be posted to.

There are three types of MessageRoutingMode available:

RoundRobinPartition
If no key is provided, the producer will publish messages across all partitions in a round-robin fashion to achieve maximum throughput. Please note that the loop is not for a single message, but is set to the same boundary as the batch delay to ensure that the batch is effective. And if a key is specified on the message, the producer of the partition will hash the key and assign the message to the specific partition.
SinglePartition
If no key is provided, the partition producer will randomly select a partition and publish all messages to that partition. If a key is provided on the message, the partition producer will hash the key and assign the message to a specific partition.
CustomPartition
Use the custom message router implementation to be called to determine the partition of a particular message.

Multi-tenant

The multi-tenant feature of Apache Pulsar can meet the management needs of enterprises. Tenant and namespace are the two core concepts of Apache Pulsar to support multi-tenancy.

At the tenant level, Pulsar reserves appropriate storage space, application authorization and authentication mechanisms for specific tenants.
At the namespace level, Pulsar provides a series of configuration strategies, including storage quotas, flow control, message expiration strategies, and isolation strategies between namespaces.

Cross-regional replication

The cross-regional replication mechanism provides redundancy for large-scale distributed systems and multiple data centers to prevent services from failing to operate normally. It also provides a basis for cross-regional production and cross-regional consumption.

Tiered storage

Pulsar's tiered storage feature allows the older backlog of data to be moved from BookKeeper to long-term and cheaper storage, reducing storage costs while still allowing clients to access the backlog as if nothing has changed. The administrator can configure the namespace size threshold strategy to realize the automatic migration of data to long-term storage.

Component

Pulsar Schema Registry

Schema registry enables Producer and Consumer to communicate Topic's data structure through broker, without the need for external coordination mechanisms, thereby avoiding various potential problems such as serialization and deserialization.

Pulsar Functions

Pulsar Functions is a lightweight computing framework that provides users with a FAAS (Function as a Service) platform with simple deployment, simple operation and maintenance, and simple API. The goal is to help users easily create various levels of complex processing logic without deployment. Separate computing system.

Pulsar IO

Pulsar IO supports Apache Pulsar to interact with external systems such as databases and other messaging systems, such as Apache Cassandra and other systems. Users do not need to pay attention to implementation details and can run quickly with one command.

Source imports data from external systems to Apache Pulsar, and Sink exports data from Apache Pulsar to external systems.

Pulsar SQL

Pulsar SQL is a query layer built on top of Apache Pulsar, which supports users to dynamically query all old data streams stored in Pulsar. Users can clean up, transform, and query data streams while injecting data on the same system, which greatly simplifies the data pipeline.

Get started quickly

Binary installation

Here is a brief introduction of a Pulsar Demo case. Only one Pulsar server is installed. First, download Pulsar through the following command:

wget https://archive.apache.org/dist/pulsar/pulsar-2.8.1/apache-pulsar-2.8.1-bin.tar.gz

After downloading to the local, decompress the apache-pulsar-2.8.1-bin.tar.gz compressed package with the following command:

tar xvfz apache-pulsar-2.8.1-bin.tar.gz

Then cd to the apache-pulsar-2.8.1 folder, which contains the following directories:

bin: Pulsar's command line tools, such as [pulsar](https://pulsar.apache.org/docs/en/reference-cli-tools#pulsar) and [pulsar-admin](https://pulsar.apache.org/tools/pulsar-admin/) .
conf: Pulsar configuration files, including broker configuration , ZooKeeper configuration etc.
examples: Java JAR file Pulsar function
JAR ) file used by Pulsar.
licenses: .txt to Pulsar code base exist in the form of various components of a license file.

independently start Pulsar

After you have installed Pulsar locally, you can use the [pulsar](https://pulsar.apache.org/docs/en/reference-cli-tools#pulsar) stored in bin start the local cluster and specify that you want to start Pulsar in standalone mode.

$ bin/pulsar standalone

If you have successfully started Pulsar, you will see this INFO level log message:

2017-06-01 14:46:29,192 - INFO  - [main:WebSocketService@95] - Configuration Store cache started
2017-06-01 14:46:29,192 - INFO  - [main:AuthenticationService@61] - Authentication is disabled
2017-06-01 14:46:29,192 - INFO  - [main:WebSocketService@108] - Pulsar WebSocket Service started

A case of sending/receiving messages

public class PulsarDemo {

    private static PulsarClient PULSAR_CLIENT = null;

    static {
        try {
            // 创建pulsar客户端
            PULSAR_CLIENT = PulsarClient.builder().serviceUrl("pulsar://127.0.0.1:6650").build();
        } catch (PulsarClientException e) {
            e.printStackTrace();
        }
    }

    public static void main(String[] args) throws PulsarClientException {

        // 创建生产者
        Producer<byte[]> producer = PULSAR_CLIENT.newProducer().topic("test-topic-1").create();
        // 同步发送消息
        MessageId messageId = producer.send("同步发送的消息".getBytes(StandardCharsets.UTF_8));
        System.out.println("消息发送成功，消息id: " + messageId);

        // 创建消费者
        Consumer<byte[]> consumer = PULSAR_CLIENT.newConsumer().topic("test-topic-1")
                .subscriptionName("test-subscription-1").subscribe();
        //获取一个消息内容
        Message<byte[]> message = consumer.receive();
        System.out.println("接收的消息内容: " + new String(message.getData()));
        // 确认消费成功，以便pulsar删除消费成功的消息
        consumer.acknowledge(message);

        //关闭客户端
        producer.close();
        consumer.close();
        PULSAR_CLIENT.close();
    }
}

Output:

消息发送成功，消息id: 66655:0:-1:0
接收的消息内容: 同步发送的消息

refer to

Apache Pulsar official website

StreamNative Product Manual

Apache Pulsar and Apache Kafka performance comparison analysis in financial scenarios

Comprehensive comparison between Pulsar and Kafka (

Pulsar and Kafka Comprehensive Comparison (Part 2)

Kafka is out of date, meet Pulsar around the corner!

Understand Apache Pulsar in one article

Pulsar introduction

concept

Producer

Consumer

Topic

Broker

BookKeeper

Bookie

Cluster

Cloud native architecture

Cluster architecture

Design principle

Producer

Send mode

interview method

compression

Batch processing

Message block

Consumer

Receiving method

Consumption confirmation

Dead letter queue

Consumption model

Topic

Topic naming

Topic partition

Multi-tenant

Cross-regional replication

Tiered storage

Component

Pulsar Schema Registry

Pulsar Functions

Pulsar IO

Pulsar SQL

Get started quickly

Binary installation

independently start Pulsar

A case of sending/receiving messages

refer to

晓双

引用和评论

【On Nacos】SpringCloud 方式使用 Nacos

K8s 小白入门｜从电影配乐谈起，聊聊容器编排和 K8s

如何在通义灵码里用上DeepSeek-V3 和 DeepSeek-R1 满血版671B模型？

基于 API 网关践行 API First 开发实践

重磅发布！AI 驱动的 Java 开发框架：Spring AI Alibaba

草莓不是莓，西瓜才是莓——解读 Kubernetes 中被驱逐的 Pod

Spring AI 智能体通过 MCP 集成本地文件数据