Pulsar and Kafka benchmark test: Pulsar performance accurate analysis (full version)

About Apache Pulsar
Apache Pulsar is the top-level project of the Apache Software Foundation. It is the next-generation cloud-native distributed message flow platform. It integrates messaging, storage, and lightweight functional computing. It uses a separate architecture design for computing and storage to support multi-tenancy, persistent storage, Multi-computer room and cross-regional data replication, with strong consistency, high throughput, low latency and high scalability and other streaming data storage characteristics. At present, many large domestic and foreign large Internet and traditional industry companies have adopted Apache Pulsar. The cases are distributed in artificial intelligence, finance, telecom operators, live broadcast and short video, Internet of Things, retail and e-commerce, online education and other industries, such as the United States Cable network giants Comcast, Yahoo! , Tencent, China Telecom, China Mobile, BIGO, VIPKID, etc.

Confluent recently conducted a benchmark test comparing the throughput and latency differences of Kafka, Pulsar, and RabbitMQ. The Confluent blog shows that Kakfa can achieve "best throughput" with "low latency", while RabbitMQ can achieve "low latency" with "lower throughput". Overall, the benchmark results show that Kafka is undoubtedly superior in terms of "speed".

Kafka technology is mature and perfect, but many companies today (from multinational companies to innovative start-ups) still choose Pulsar first. At the recent Splunk summit conf20, Splunk's chief product officer Sendur Sellakumar announced that they decided to replace Kafka with Pulsar:

"... We have taken Apache Pulsar as the basic stream. We have put the company's future on the long-term architecture of enterprise-level multi-tenant streams."

--Splunk Chief Product Officer Sendur Sellakumar

Many companies are using Pulsar, and Splunk is just one example. These companies choose Pulsar because in modern elastic cloud environments (such as Kubernetes), Pulsar can scale horizontally to handle massive amounts of data in a cost-effective manner, and there is no single point of failure. At the same time, Pulsar has many built-in features, such as automatic data rebalancing, multi-tenancy, cross-regional replication, and persistent tiered storage, which not only simplifies operation and maintenance, but also makes it easier for the team to focus on business goals.

Developers finally chose Pulsar because of its unique features and performance, making Pulsar the cornerstone of streaming data.

After understanding these situations, it is necessary to carefully study Confluent's benchmark settings and conclusions. We found that two issues are highly controversial. First, Confluent has limited knowledge of Pulsar, which is the biggest source of inaccurate conclusions. If you don't understand Pulsar, you can't use the correct measurement standards to test Pulsar performance.

Second, Confluent's performance test is based on a narrow set of test parameters. This limits the applicability of the results, and cannot provide readers with accurate results that match different workloads and actual application scenarios.

In order to provide the community with more accurate test results, we decided to solve these problems and repeat the test. Important adjustments include:

We adjusted the benchmark test settings to include the durability levels supported by Pulsar and Kafka, and compared the throughput and latency of the two under the same durability level.
We repaired the OpenMessaging benchmark (OMB) framework, eliminated variables generated by using different instances, and corrected configuration errors in the OMB Pulsar driver.
Finally, we measured other performance factors and conditions, such as different numbers of partitions and mixed workloads including write, tailing-read, and catch-up read, to gain a more comprehensive understanding of performance.

After completing these tasks, we repeated the test. The test results show that for scenarios closer to the real workload, Pulsar's performance is significantly better than Kafka, and for the basic scenarios used by Confluent in the test, Pulsar's performance is comparable to Kafka's performance.

The following sections will focus on the most important conclusions drawn from this test. In the StreamNative benchmark results chapter, we introduced the test settings and test reports in detail.

Summary of StreamNative benchmark results

1. Under the same durability guarantee as Kafka, Pulsar can achieve a release and end-to-end throughput of 605 MB/s (same as Kafka) and a catch-up read throughput of 3.5 GB/s (higher than Kafka 3.5 times). The throughput of Pulsar will not be affected by the increase in the number of partitions and the change in durability level, while the throughput of Kafka will be severely affected by the change in the number of partitions or the durability level.

表 1：在不同工作负载及不同持久性保证下，Pulsar 与 Kafka 的吞吐量差异

2. In different test cases (including different subscription numbers, different topic numbers, and different durability guarantees), Pulsar's latency is significantly lower than Kafka. PulsarP99 has a delay between 5 and 15 milliseconds. KafkaP99 latency can be as long as several seconds, and it will be greatly affected by the number of topics, the number of subscriptions, and different durability guarantees.

表 2：在不同订阅数量及不同持久性保证下，Pulsar 与 Kafka 端到端 P99 延迟差异
表 3：在不同主题数量及不同持久性保证下，Pulsar 与 Kafka 端到端 P99 延迟差异

3. Pulsar's I/O isolation is significantly better than Kafka. When consumers catch up to read historical data, the release delay of PulsarP99 is still about 5 milliseconds. In contrast, Kafka's delay will be severely affected by catch up read. KafkaP99 release delay may increase from a few milliseconds to a few seconds.

表 4：在 catch up read 下，Pulsar 和 Kafka P99 发布延迟差异

All of our benchmark tests are open source ( Github URL ). Interested readers can generate the results themselves, or they can further study the test results and the indicators provided in the warehouse.

Although our benchmark test is more accurate and comprehensive than Confluent's benchmark test, it does not cover all scenarios. In the final analysis, testing through your own hardware/real workload cannot be replaced by any benchmark test. We also encourage readers to evaluate other variables and scenarios, and use their own settings and environment for testing.

Dive into the Confluent benchmark

Confluent used the OpenMessaging benchmark (OMB) framework as the basis for its benchmark tests and made some modifications. In this section, we will explain the problems found in the Confluent benchmark test and explain how these problems affect the accuracy of Confluent test results.

Confluent setup issues

The Confluent benchmark test conclusion is incorrect because the Pulsar parameter settings are unreasonable. We will explain these issues in detail in the StreamNative benchmark section. In addition to Pulsar tuning issues, Confluent has set different durability guarantees for Pulsar and Kafka. The persistence level will affect the performance, the persistence settings of the two systems are the same, the comparison is of reference value.

Confluent engineers use the default durability guarantee for Pulsar, which is a higher level of durability than Kafka. Increasing the durability level will seriously affect latency and throughput, so the Confluent test puts higher requirements on Pulsar than Kafka. The Pulsar version used by Confluent does not yet support reducing the durability to the same level as Kafka, but the upcoming version of Pulsar supports this level, and it was also used in this test. If Confluent engineers use the same persistence settings on the two systems, the comparison shown by the test results should be accurate. We certainly do not blame Confluent engineers for not using features that have not yet been released. However, the test record does not provide the necessary scenarios and treats it as the result of an equally persistent setting. This article will provide additional contextual descriptions.

OMB framework issues

The Confluent benchmark follows the OMB framework guidelines, which recommend using the same instance type in multiple event streaming systems. But in the test, we found that there are a lot of deviations in different instances of the same type, especially in the case of disk I/O failures. In order to minimize this difference, we use the same instance every time we run Pulsar and Kafka. We found that these instances have greatly improved the accuracy of the results. Small differences in disk I/O performance may be Make a big difference to the overall performance of the system. We propose to update the OMB framework guidelines and consider adopting this recommendation in the future.

Problems with Confluent's research methods

The Confluent benchmark only tested a few limited scenarios. For example, actual workloads include write, tailing read, and catch-up read. When a consumer is reading the latest news near the "tail" of the log, tailing-read occurs. Confluent only tested this scenario. In contrast, catch-up read occurs when consumers have a large number of historical messages, and must be consumed from the “catch-up” position to the end of the log message, which is a common key task in actual systems. If catch-up read is not considered, it will seriously affect the delay of writing and tailing read. Since the Confluent benchmark only focuses on throughput and end-to-end latency, it fails to provide comprehensive results on expected behavior under various workloads. In order to further bring the results closer to actual application scenarios, we believe that benchmarking different numbers of subscriptions and partitions is essential. Few companies only care about a small number of topics with a few partitions and consumers. They need to be able to accommodate a large number of different consumers with different topics/partitions to map to business use cases.

We outline the specific issues of Confluent's research method in the table below.

表 4：在 catch up read 下，Pulsar 和 Kafka P99 发布延迟差异

Many problems with Confluent benchmarks stem from limited knowledge of Pulsar. In order to help you avoid these problems during subsequent benchmark tests, we share some Pulsar technical insights with you.

In order to carry out an accurate benchmark test, it is necessary to understand Pulsar's durability guarantee. We will discuss this issue as an entry point, first give a general overview of the durability of a distributed system, and then explain the difference in durability guarantees between Pulsar and Kafka.

Overview of Distributed System Persistence

Persistence refers to the ability to maintain system consistency and availability in the face of external problems such as hardware or operating system failures. Single-node storage systems such as RDBMS rely on fsync to write to disk to ensure maximum durability. The operating system usually caches writes. In the event of a failure, writes may be lost, but fsync will ensure that these data are written to physical storage. In a distributed system, persistence usually comes from data replication, that is, multiple copies of data are distributed to different nodes that can fail independently. But you should not confuse local persistence (fsync data) with replicated persistence, as the purpose of the two is different. Next we will explain the importance of these features and the main differences.

Replication persistence and local persistence

Distributed systems usually have both replication durability and local durability. The various types of persistence are controlled by separate mechanisms. These mechanisms can be combined flexibly, and different persistence levels can be set as needed.

Replication durability is achieved by creating multiple copies of data through an algorithm, so the same data can be stored in multiple locations, and availability and accessibility are improved. The number of copies N determines the fault tolerance of the system. Many systems require "arbitration" or N/2 + 1 nodes to confirm writes. While any single copy is still available, some systems can continue to service existing data. This replication mechanism is essential to deal with the complete loss of instance data. New instances can re-copy data from existing copies. This is also essential for availability and consensus (this section will not discuss this issue).

In contrast, local persistence determines the different understanding of confirmation at each node level. Local persistence requires that data be fsynced to persistent storage to ensure that no data will be lost even if there is a power outage or hardware failure. The data fsync can ensure that the node has all the previously confirmed data after the machine recovers from a failure in a short time.

Persistence mode: synchronous and asynchronous

Different types of systems provide different levels of durability guarantees. Generally, the overall durability of the system is affected by the following factors:

Whether the system fsyncs data to the local disk
Does the system copy data to multiple locations
When does the system confirm replication to the peer system
When does the system confirm to write to the client

In different systems, these choices vary greatly, and not all systems support users to control these values. Systems that lack some of these mechanisms (such as replication in non-distributed systems) have lower durability.

We can define two persistence modes, both of which can control when the system confirms writes for internal replication and when to write to the client, namely "synchronous" and "asynchronous". These two modes operate as follows.

Synchronization persistence: Only after the data is successfully fsynced to the local disk (local persistence) or copied to multiple locations (replication persistence), the system returns a write response to the peer system/client.
Asynchronous persistence: Before the data is successfully fsynced to the local disk (local persistence) or copied to multiple locations (replication persistence), the system will return a write response to the peer system/client.

Durability level: measurement durability guarantee

Durability guarantees exist in many forms, depending on the following variables:

Whether the data is stored locally, whether it is replicated in multiple locations, or whether it meets these two conditions
When to confirm the write (synchronous/asynchronous)

Like the persistence model, in order to distinguish between different distributed systems, we define four levels of persistence. Table 6 lists the various levels from the highest durability to the lowest durability.

表 6：分布式系统的持久性级别

Most distributed relational database management systems (such as the NewSQL database) can guarantee the highest level of durability, so this type of system is classified as level 1.

Like the database, Pulsar is a level 1 system and provides the highest level of persistence by default. In addition, Pulsar can customize the required durability level for each application. In contrast, most of Kafka's production environment deployment is configured at level 2 or 4. It is reported that by setting flush.messages=1 and flush.ms=0 , Kafka can also reach the level 1 standard. But these two configurations will seriously affect throughput and latency, and we will discuss this issue in detail in the benchmark test.

Let's start with Pulsar and explore the persistence of each system in detail.

Pulsar's durability

Pulsar provides various levels of durability guarantees, can copy data to multiple locations, and fsync the data to a local disk. Pulsar has two persistence modes (synchronous and asynchronous as described above). Users can customize the settings according to the usage scenarios, use a certain mode alone, or use them in combination.

Pulsar uses raft-equivalent, arbitration-based replication protocols to control the durability of replication. The replication persistence mode can be adjusted by adjusting the ack-quorum-size and write-quorum-size Table 7 lists the settings of these parameters, and Table 8 lists the persistence levels supported by Pulsar. (Pulsar replication protocol and consensus algorithm are not within the scope of this article, we will discuss this field in depth in subsequent blogs.)

表 7：Pulsar 持久性配置设置
表 8：Pulsar 持久性级别

Pulsar controls local persistence by writing and/or fsync data to the log disk. Pulsar also provides options to adjust the local persistence mode through the configuration parameters in Table 9.

表 9：Pulsar 本地持久性模式参数

Kafka's persistence

Kafka provides 3 levels of persistence: level 1, level 2 (default setting) and level 4. Kafka provides replication durability at level 2 and cannot provide durability guarantee at level 4 because it does not have the ability to fysnc data to disk before confirming the write. By setting flush.messages=1 and flush.ms=0 , Kafka can reach the level 1 system level, but Kafka has hardly deployed such a configuration in a production environment.

Kafka's ISR replication protocol controls replication durability. acks and min.insync.replicas parameters associated with this agreement, Kafka's replication persistence mode can be adjusted. Table 10 lists the settings of these parameters. Table 11 lists the durability levels supported by Kafka. (The detailed description of the Kafka replication protocol is beyond the scope of this article. We will dig deeper into the differences between the Kafka protocol and the Pulsar protocol in subsequent blogs.)

表 10：Kafka 的持久性配置设置
表 11：Kafka 持久性级别

Unlike Pulsar, Kafka does not write data to a separate log disk. Kafka will first confirm the write operation, and then fsync the data to the disk. This operation minimizes I/O contention between writes and reads and prevents performance degradation.

flush.messages = 1 and flush.ms = 0 for each message, Kafka can provide fsync function and greatly reduce the possibility of message loss, but this will seriously affect throughput and latency. Therefore, this setup is almost never used for production deployments.

Kafka cannot transmit log data. If there is a machine failure or power failure, there is a risk of data loss. This defect is obvious and has a great impact. This is also one of the main reasons Tencent’s billing system chose Pulsar.

Differences in durability between Pulsar and Kafka

Pulsar's persistence settings are flexible. Users can optimize the persistence settings according to specific needs to meet the requirements of various applications, application scenarios or hardware configurations.

Kafka is less flexible and cannot ensure the same persistence settings in the two systems according to the constraints of the scenario. This makes it more difficult to benchmark. To solve this problem, the OMB framework recommends using the closest available setting.

After understanding these backgrounds, let's take a look at the problems in the Confluent benchmark. Confluent tried to simulate Pulsar's fsync behavior. In their benchmark test, Confluent set up an asynchronous persistence function for Kafka and a synchronous persistence function for Pulsar. This inequity leads to incorrect test results and biased performance judgments. Our benchmark test shows that Pulsar and Kafka performance match or even surpass Kafka, and Pulsar also provides stronger durability guarantee.

StreamNative benchmark

In order to understand Pulsar performance more accurately, we need to use Confluent benchmarks to solve these problems. We focused on adjusting the configuration of Pulsar, ensuring that the persistence settings of the two systems are the same, and incorporating other performance factors and conditions, such as different number of partitions and mixed workloads, to measure performance in different application scenarios. In the following chapters we will explain in detail the configuration adjustments in our test.

StreamNative test setup

Our benchmark settings include all durability levels supported by Pulsar and Kafka. In this way, we can compare throughput and latency at the same endurance level. The persistence settings we use are as follows.

Copy persistence settings

Our replication persistence settings are the same as Confluent, and no changes have been made. To maintain integrity, the settings listed in Table 12 are used here.

表 12：复制持久性设置

Pulsar's new feature ( new feature: ) provides an option for applications to skip logging, thereby relaxing local durability guarantees, avoiding write amplification, and improving write throughput. (This feature will be provided in the next version of Apache BookKeeper). We did not set it as the default feature, because it is still possible to lose messages, so it is not recommended to use it in most scenarios.

To ensure accurate comparison of the performance of the two systems, we used this feature in the benchmark test. Logging bypassing Pulsar provides the same local durability guarantees as Kafka's default fsync settings.

Pulsar's new features include local persistence mode (Async-Bypass journal). We use this mode to configure Pulsar to keep it matching the default level of Kafka's local persistence. Table 13 lists the specific settings for the benchmark test.

表 13：StreamNative 基准测试的本地持久性设置

StreamNative framework

We found some problems in the Confluent OMB framework branch and fixed some configuration errors in the OMB Pulsar driver. We have developed new benchmark code (including the fixes described below), which are all placed in the open source repository.

Fix OMB framework issues

Confluent follows the OMB framework recommendations and uses two sets of instances-one for Kafka and one for Pulsar. In our benchmark test, we assigned a set of three instances to enhance the reliability of the test. In the first test, we ran three instances on Pulsar. Then use the same set of instances to perform the same test on Kafka.

We use the same machine to benchmark different systems and clear the file system page cache before each run to ensure that the current test is not affected by the previous test.

Fix OMB Pulsar driver configuration problem

We fixed many errors in Confluent's OMB Pulsar driver configuration. The following sections will introduce the specific adjustments we made to broker, bookie, producer, consumer and Pulsar image.

Adjust Broker configuration

The Pulsar broker uses the managedLedgerNewEntriesCheckDelayInMillis parameter to determine the length of time (in milliseconds) that the catch-up subscription must wait before distributing messages to its consumers. In the OMB framework, the value of this parameter is set to 10. This is the main reason for the inaccurate conclusions of the Confluent benchmark test. The Confluent benchmark test concluded that Pulsar has a higher latency than Kafka. We change the value to 0 to simulate Kafka's latency behavior on Pulsar. After the change, Pulsar's latency was significantly lower than Kafka in all test scenarios.

In order to optimize performance, we bookkeeperNumberOfChannelsPerBookie parameter value of 061a3451b6c9f4 from 16 to 64 to prevent any single Netty channel between broker and bookie from becoming a bottleneck. When a large number of messages accumulate in the Netty IO queue, this bottleneck can cause high latency.

We will provide clearer guidelines in the Pulsar documentation to help users optimize end-to-end latency.

Adjust Bookie configuration

We added a new configuration to Bookie to test Pulsar performance while bypassing logging. Pulsar and Kafka's durability guarantees are evenly matched.

To test the performance of this feature, we built a custom image based on the official release of Pulsar 2.6.1 to cover this adjustment. (See Pulsar Image for details.)

We manually configured the following settings to bypass logging in Pulsar.

journalWriteData = false
journalSyncData = false

In addition, we journalPageCacheFlushIntervalMSec parameter from 1 to 1000, and benchmarked asynchronous local persistence in Pulsar ( journalSyncData = false ). After increasing the value, Pulsar can simulate Kafka's flashing behavior as described below.

Kafka ensures local durability by flushing the file system page cache to disk. The data is flushed by a group of background threads called pdflush. Pdflush can be set, and the waiting time between two flashes is usually set to 5 seconds. Set the journalPageCacheFlushIntervalMSec parameter of Pulsar to 1000, which is equivalent to the 5-second pdflush interval on Kafka. After the change, we can benchmark asynchronous local persistence more accurately and compare Pulsar and Kafka more accurately.

Adjust producer configuration

Our batch processing configuration is the same as Confluent, with one exception: we have increased the switching interval to make it longer than the batch processing interval. Specifically, we changed the batchingPartitionSwitchFrequencyByPublishDelay parameter value from 1 to 2. This change ensures that Pulsar's producers only concentrate on one partition during each batch.

Setting the switching interval and batch processing interval to the same value will cause Pulsar to switch partitions frequently, generate too many small-scale batches, and may affect throughput. Setting the switching interval to be greater than the batch interval can minimize this risk.

Adjust consumer configuration

When the application cannot quickly process the incoming message, the Pulsar client uses the receiver queue to apply back pressure. The size of the consumer's receiver queue will affect the end-to-end latency. Compared with smaller queues, larger queues can prefetch and cache more messages.

These two parameters determine the size of the receiver's queue: receiverQueueSize and maxTotalReceiverQueueSizeAcrossPartitions . Pulsar calculates the size of the receiver's queue as follows:

Math.min(receiverQueueSize, maxTotalReceiverQueueSizeAcrossPartitions / number of partitions)

For example, if maxTotalReceiverQueueSizeAcrossPartitions set to 50000, when there are 100 partitions, the Pulsar client will set the consumer's receiver queue size to 500 on each partition.

In our benchmark test, maxTotalReceiverQueueSizeAcrossPartitions increased from 50,000 to 5,000,000. This tuning ensures that consumers will not exert back pressure.

Pulsar image

We built a custom Pulsar version (v.2.6.1-sn-16), which includes the Pulsar and BookKeeper fixes described above. The 2.6.1-Sn-16 version is based on the official release version of Pulsar 2.6.1, available from https://github.com/streamnative/pulsar/releases/download/v2.6.1-sn-16/apache-pulsar-2.6. 1-sn-16-bin.tar.gz download.

StreamNative test method

We adjusted the testing method of the Confluent benchmark to fully understand the performance through the actual workload. The specific adjustments to the test are as follows:

To evaluate the following, a catch-up read has been added

The maximum throughput that each system can achieve when processing catch-up read
How writes affect publishing and end-to-end latency

Change the number of partitions and see how each change affects throughput and latency
Change the number of subscriptions and see how each change affects throughput and latency

Our benchmarking scenarios tested the following types of workloads:

maximum throughput : the maximum throughput that each system can achieve?
Release and Tailing Read Delay : What is the lowest release and end-to-end delay that each system can achieve at a given throughput?
Catch-up read : When reading messages from a large number of to-do items, what is the maximum throughput that each system can achieve?
mixed workload : What is the lowest release and end-to-end delay level that each system can achieve when the consumer performs a catch-up operation? How does catch-up read affect release delay and end-to-end procrastination?

Testbed

The OMB framework recommends using specific testbed definitions for instance types and JVM configurations; for producers, consumers, and servers, use workload driver configuration. Our benchmark test uses the same testbed definition as Confluent. For these testbed definitions, you can check the StreamNative branch in the Confluent OMB warehouse.

The following will focus on the disk throughput and disk fsync latency we have observed. To interpret the benchmark results, these hardware indicators must be considered.

Disk throughput

Our benchmark test uses the same instance type as Confluent, specifically i3en.2xlarge (with 8 vCores, 64 GB RAM, 2 x 2, 500 GB NVMe SSD). We confirmed that the i3en.2xlarge instance can support up to 655 MB/s write throughput between two disks. See dd results below.

Disk 1
dd if=/dev/zero of=/mnt/data-1/test bs=1M count=65536 oflag=direct
65536+0 records in
65536+0 records out
68719476736 bytes (69 GB) copied, 210.08 s, 327 MB/s
Disk 2
dd if=/dev/zero of=/mnt/data-2/test bs=1M count=65536 oflag=direct
65536+0 records in
65536+0 records out
68719476736 bytes (69 GB) copied, 209.635 s, 328 MB/s

Disk data synchronization delay

When performing latency related tests, it is important to capture the fsync latency on the NVMe SSD. We observe that the P99 fsync latency of these 3 instances is between 1 millisecond and 6 milliseconds, as shown in the figure below. As mentioned earlier, disks vary greatly in different situations, which is mainly reflected in the delay. We found that there is a group of instances with the same delay.

图 1-1：3 个不同实例的 P99 fsync 延迟

StreamNative benchmark test results

The following will summarize our benchmark results. If you want to view the complete benchmark test report, you can download it on the StreamNative official website or view it in the openmessaging-benchmark warehouse.

Maximum throughput test

The maximum throughput test is designed to determine the maximum throughput that each system can achieve when processing workloads with different durability guarantees (including publishing and tailing-read). We changed the number of topic partitions to see how each change affects the maximum throughput.

we discover:

When the durability guarantee (synchronous replication durability, synchronous local durability) is configured to level 1, the maximum throughput of Pulsar is about 300 Mb/s, which is the physical limit of the log disk bandwidth. When there are 100 partitions, Kafka can reach about 420 MB/s. It is worth noting that when the durability is level 1, Pulsar is configured to use one disk as the log disk for writing, and the other disk as the ledger disk for reading; while Kafka uses two disks for reading and writing at the same time . Although Pulsar's settings can provide better I/O isolation, its throughput is also limited by the maximum bandwidth of a single disk (~300 MB/s). Configuring spare disks for Pulsar can achieve a more cost-effective operation. This topic will be discussed in a follow-up blog.
When the persistence (synchronous replication persistence and asynchronous local persistence) is configured to level 2, both Pulsar and Kafka can reach a maximum throughput of approximately 600 MB/s. Both systems have reached the physical limit of disk bandwidth.
Kafka's maximum throughput on a partition is only half of Pulsar's maximum throughput.
Pulsar's throughput will not be affected by changing the number of partitions, but Kafka's throughput will be affected.

When the number of partitions increased from 100 to 2000, Pulsar maintained the maximum throughput (about 300 MB/s under the level 1 durability guarantee, and approximately 600 MB/s under the level 2 durability guarantee).
When the number of partitions increased from 100 to 2000, Kafka's throughput dropped by half.

Release and end-to-end latency testing

Release and end-to-end latency testing aims to determine the lowest latency that each system can achieve when processing workloads with different durability guarantees (including release and tailing-read). We modified the number of subscriptions and the number of partitions to understand how each change affects publication and end-to-end latency.

we discover:

In all test cases, Pulsar's publishing and end-to-end latency are significantly (hundreds of times) lower than Kafka, which evaluates various durability guarantees and different numbers of partitions and subscriptions. Even if the number of partitions is increased from 100 to 10,000 or the number of subscriptions is increased from 1 to 10, Pulsar P99 publication delay and end-to-end delay are both within 10 milliseconds.
Changes in the number of subscriptions and the number of partitions will have a huge impact on Kafka's release and end-to-end latency.

When the number of subscriptions increased from 1 to 10, both the publication and end-to-end latency increased from about 5 milliseconds to about 13 seconds.
When the number of topic partitions increased from 100 to 10,000, both the publishing and end-to-end latency increased from about 5 milliseconds to about 200 seconds.

Catch-up read test

The purpose of the Catch-up read test is to determine the maximum throughput that each system can achieve when processing workloads that contain only catch-up read. At the beginning of the test, the producer sent messages at a fixed rate of 200K per second. After the producer sends 512GB of data, the consumer starts to read the received message. The consumer has processed the accumulated messages. When the producer continues to send new messages at the same speed, the consumer can keep pace with the producer.

When processing catch-up reads, Pulsar's maximum throughput is 3.5 times faster than Kafka. Pulsar's maximum throughput is 3.5 GB/s (3.5 million messages/sec), while Kafka's throughput is only 1 GB/s (1 million messages/sec).

Mixed workload testing

Mixed workload testing aims to determine the impact of catch-up reads on release and tailing reads in mixed workloads. At the beginning of the test, the producer sent messages at a fixed rate of 200K per second, and the consumer consumed the messages in tailing mode. After the producer generates a 512GB message, it will start a new set of catch-up consumers and read all the messages from the beginning. At the same time, producers and existing tailing-read consumers continue to publish and consume messages at the same speed.

We tested Kafka and Pulsar with different persistence settings, and found that catch-up read will seriously affect Kafka's release delay, but it has little impact on Pulsar. Kafka P99 release delay increased from 5 milliseconds to 1-3 seconds, while Pulsar P99 release delay remained between a few milliseconds to tens of milliseconds.

in conclusion

Benchmark testing usually only presents a narrow combination of business logic and configuration options, and may or may not reflect actual application scenarios or best practices. This is the tricky part of benchmark testing. Benchmarks may be biased due to problems with their framework, settings, and research methods. We found these issues in Confluent's recent benchmark tests.

At the request of the community, the StreamNative team embarked on the benchmark test to provide insights and perspectives on the real performance of Pulsar. In order to make the benchmark test more accurate, we fixed the problems in the Confluent benchmark test and added new test parameters to help us deeply explore the comparison results of each technology in real use cases.

According to our benchmark results, under the same durability guarantee, Pulsar performance exceeds Kafka in workloads similar to real application scenarios; in the same limited test cases of Confluent applications, Pulsar can achieve the same end-to-end throughput as Kafka quantity. In addition, in each different test instance (including different subscription numbers, topic numbers, and durability guarantees), Pulsar's latency is better than Kafka, and I/O isolation is better than Kafka.

As mentioned earlier, no benchmark test can replace the actual workload test on their respective hardware. We encourage readers to use their own settings and workloads to test Pulsar and Kafka to understand the performance of each system in a specific production environment. If you have any questions about Pulsar best practices, please contact us directly or join Pulsar Slack at any time.

In the coming months, we will publish a series of blogs to help the community better understand and use Pulsar to meet their business needs. We will introduce the performance of Pulsar in different workloads and settings, how to select and adjust the hardware size in different cloud providers and local environments, and how to use Pulsar to build the most cost-effective streaming data platform.

About the Author:

Sijie , co-founder and CEO of StreamNative. Guo Sijie has been engaged in the messaging and streaming data industry for more than ten years and is a senior expert in messaging and streaming data. Before founding StreamNative, he co-founded Streamlio, focusing on real-time solutions. He used to be the technical leader of the Twitter messaging infrastructure team and co-founded DistributedLog and Twitter EventBus. During his tenure at Yahoo, he led the team to develop BookKeeper and Pulsar. He is the Vice President of Apache BookKeeper and a member of Apache Pulsar PMC.

Li , StreamNative software engineer, member of Apache Pulsar Committer/PMC. Li Penghui once worked in Zhaolian Recruitment, during which he was the main promoter of Apache Pulsar in Zhaolian Recruitment. His work experience has always revolved around messaging systems and microservices, and he has fully devoted himself to the world of Pulsar.