🎙️7 minutes to read this article

This series of articles will focus on the latency of Pulsar and Kafka. The previous two articles have introduced the test method (the green part in the figure below) and the test details (the blue content in the figure below). You can click on the following headings to view directly:

This article will detail the test results of Pulsar and Kafka (red content in the figure below). The Fsync state is a variable in the experiment. In addition, in order to better compare the two, the tester also adjusted the number of partitions.

Apache Pulsar test results

This article will detail the test results of Apache Pulsar's latency. We will first introduce the test results of turning on fsync (Pulsar's default working method), and then the test results of turning off message flush.

For each workload, there are two graphs for reference: one graph is the p99 release delay during the test, and the other graph is the average end-to-end delay. In addition, these two figures are accompanied by a table that summarizes the delay measurement values during the test period and organizes them to provide delay distribution data.

The percentage calculation of the publication delay is more accurate than the end-to-end delay, because the end-to-end delay uses the timestamp automatically set in the message header, and the accuracy of the timestamp is milliseconds, while the accuracy of the publication delay is nanoseconds.

All tests use 100-byte messages. During the 15-minute test, only two client (production and consumption) servers were used, and the production rate and consumption rate were constant at 50,000 messages per second. The Apache Pulsar version used in the test is 2.4.0.

Delay test when fsync is turned on

Test 1: 1 topic, 1 partition

Test 2: 1 topic, 6 partitions

Test 3: 1 topic, 16 partitions

discuss

Since partitions in Pulsar and Kafka are parallel units, we expect that when the number of partitions increases, the delay will decrease, and the actual test results are indeed the case. Overall, the number of partitions increased, and the release delay and end-to-end delay decreased.

There are abnormal values in each test, but the maximum delay always does not exceed 267 milliseconds. The release delay is less fluctuating than the end-to-end delay value. In all tests, the release delay of p9999 never exceeded 11.6 milliseconds. In the end-to-end latency test of 16 partitions, the impact of partitions on latency is the most obvious. The average latency of 16 partition tests (3 milliseconds) is one third of that of 1 partition test (9 milliseconds).

Pulsar's release delay does not change over time. All tests run for 15 minutes. As shown in the figure, the average release delay fluctuated little during the test period. The end-to-end delay changes with time. Within 90 seconds, the average delay fluctuation is 2 milliseconds, and the delay fluctuation value is almost constant. For example, the average end-to-end latency is 9 milliseconds for 1 partition and 3 milliseconds for 16 partitions, but the change value never exceeds 2 milliseconds (9 milliseconds increased to 11 milliseconds, and 3 milliseconds became 5 milliseconds).

Delay test when fsync is turned off

Except for setting journalSyncData=false in the bookkeeper.conf file to disable refreshing each message to disk, and restarting Pulsar broker and BookKeeper, the other test conditions are the same.

Test 4: 1 topic, 1 partition

Test 5: 1 topic, 6 partitions

Test 6: 1 topic, 16 partitions

discuss

As expected, when Flush is not enabled, the delay is reduced, but not too much. For example, when Flush is turned on for one partition, the publishing delay of p99 is 4.129 milliseconds, and when Flush is not turned on, the publishing delay is 3.928 milliseconds. In the 16 partition test, whether to enable Flush has almost no effect on the latency. In the same time interval, the end-to-end delay periodicity of 2 milliseconds changes the same as in the previous test (the peak in the figure).

Disabling Flush will lose some durability, so when using Apache Pulsar, from a latency point of view, disabling Flush is not beneficial.

Test results of Apache Kafka

Since Kafka turns off Flush by default, we will conduct this test first. Like the Pulsar test, all tests use 100-byte messages, the message rate is 50,000 messages per second, and only two clients are used. Delays were recorded during the test and organized into a table. The Apache Kafka version used in the test is 2.11-2.3.0.

Delay test when fsync is turned off

Test 7: 1 topic, 1 partition

Test 8: 1 topic, 6 partitions

Test 9: 1 topic, 16 partitions

discuss

First look at the publishing delay in the case of 1 partition. When Kafka closes Flush to the message, the delay is less than Pulsar (the delay is 2.969 milliseconds when refreshing, and the delay is 2.72 milliseconds when not refreshing). However, in the delay distribution, you can see the main difference between Pulsar and Kafka.

Pulsar's delay distribution is more concentrated, from p50 to p999, the delay increases from 2.916 milliseconds to 4.095 milliseconds), while Kafka's delay reaches 149.616 milliseconds at p999, the results are much different. In 1 partition of p99, the latency of Pulsar is 52.958 milliseconds, while the latency of Kafka is almost 4 times that of Pulsar, which is 201.701 milliseconds. We are comparing the differences in the default mode, so Flush is enabled for Pulsar, and Flush is not enabled for Kafka. If Pulsar's disk flush is disabled, the delay of p999 is reduced to only 4.508 milliseconds.

Observing the release delay of p99, you will find that the reason for the large number of outliers in the Kafka test is obvious. When the release delay jumped from single digits to more than 100 milliseconds, Kafka experienced periodic peaks. The number of partitions increases, and the change in release delay decreases, but it still exists. Comparing it with Pulsar, you will find that the p99 release delay is basically a straight line throughout the test period.

Another difference between Pulsar and Kafka is that as the number of partitions increases, the release delay of Pulsar decreases, while the release delay of Kafka increases. Although in 1 partition test, Kafka's average release latency is lower than Pulsar, but in 6 partition and 16 partition tests, Pulsar's release latency is lower. In 16 partition tests, Pulsar's average publication delay was less than 3 milliseconds, while Kafka's average publication delay was about 8.5 milliseconds.

Looking at the average end-to-end latency, Kafka's test results are better in a partition test, but when the number of partitions increases, the end-to-end latency is similar to the release latency, and both increase accordingly. In 16 partition tests, Kafka's average end-to-end latency was 11 milliseconds, while Pulsar's average was close to 3 milliseconds. In the Pulsar test, a periodic peak of 2 milliseconds can be observed. In the Kafka test, you can see that the peaks are more frequent and the value is higher, usually above 5 milliseconds

Delay test when fsync is turned on

Except for enabling each message refresh (fsync) and configuring each topic in the test (flush.messages=1, flush.ms=0), the test conditions are exactly the same as before.

Test 10: 1 topic, 1 partition

Test 11: 1 topic, 6 partitions

Test 12: 1 topic, 16 partitions

discuss

The default mode of Pulsar is to enable Flush. The results of this set of comparative tests show that Pulsar performs better. In the 1 partition test, Kafka performed better when Flush was not enabled. When both were set to enable Flush, the average latency of Pulsar was 2.969 milliseconds, while the average latency of Kafka exceeded that of Pulsar, which was 6.652 milliseconds.

As the number of partitions increases, Kafka's latency increases. In the 16 partition test, Pulsar's latency is 2.72 milliseconds, while Kafka's latency is 18.454 milliseconds, which is more than 6 times the Pulsar test result.

When Kafka is configured to enable Flush, there will still be a large release delay peak, but the frequency is not high.

In Kafka, enabling Flush will increase the end-to-end latency. Kafka performed better in 1 partition (7.129 ms vs 9.052 ms), while Pulsar performed better in 6 partitions and 16 partitions. Over time, Kafka's end-to-end latency will peak as high as 5 milliseconds.

Summarize

Based on the above test results, the summary is as follows:

  1. Over time, Pulsar's delay is more predictable. Compared with Kafka, the curve of Pulsar delay versus time is smoother. The comparison chart (6 partitions, average end-to-end latency, no refresh) shows that Kafka's latency is lower than Pulsar, but Pulsar's latency value is more stable.

  1. Pulsar's latency has not changed much. Kafka tests show that in most cases, p999 latency increases. In the Pulsar test, there are only a few cases where the delayed p999 will increase. The comparison chart (6 partitions, p99 release delay, fsync enabled) shows that Pulsar's delay value is more stable compared with Kafka:

  1. When using a single producer and a single consumer, the number of Pulsar topic partitions increases and the latency decreases; the number of Kafka topic partitions increases, and the latency increases.
  2. Under the premise that the highest requirement for messages to be persistent (to ensure that no messages are lost), the delay of Pulsar is lower than that of Kafka.
  3. Turn off fsync, Pulsar will have a smaller delay and cannot guarantee the durability of the message

For latency-sensitive workloads, Pulsar performs better overall. Pulsar can guarantee consistent low latency and strong durability. Of course, not all workloads are sensitive to latency. In order to increase throughput, it may be necessary to bear the cost of higher latency.

Want to keep abreast of Pulsar's R&D progress, user cases and hot topics? Come and pay attention to Apache Pulsar and StreamNative WeChat public accounts, we will share everything related to Pulsar here for the first time.

Click on the link to view the English original


ApachePulsar
192 声望939 粉丝

Apache软件基金会顶级项目,下一代云原生分布式消息系统