It takes about 10 minutes to read this article.
Since LinkedIn created Apache Kafka in 2011, this messaging system once became the only choice for large-scale messaging systems. why? Because these messaging systems need to deliver millions of messages every day, the scale of the messages is really huge (Twitter tweets averaged 5 million per day in 2018, and the number of users averaged 100 million per day). At that time, we did not have a MOM system to handle streaming data capabilities based on a large number of subscriptions. Therefore, many big-name companies, such as LinkedIn, Yahoo, Twitter, Netflix and Uber, can only choose Kafka.
Now in 2019, the world has undergone tremendous changes, with billions of daily news increments, and the support platform needs to be expanded accordingly to meet the needs of continuous growth. Therefore, the messaging system needs to continuously and seamlessly expand without affecting customers. Kafka has many problems in terms of expansion, and the system is difficult to manage. Fans of Kafka may criticize this statement, but this is not a personal prejudice. I am also a fan of Kafka. Objectively speaking, with the development and innovation of the world, new tools are more convenient and easy to use than old tools. We naturally feel that the original tools are full of loopholes and difficult to use. Natural development has always been the case.
At this time, a new product came into being-it is "Apache Pulsar"!
Yahoo created Pulsar in 2013 and donated Pulsar to the Apache Foundation in 2016. Pulsar has now become a top-level project of Apache and has won worldwide recognition. Both Yahoo and Twitter are using Pulsar. Yahoo sends 100 billion messages and more than 2 million topics every day. This amount of news sounds incredible, but it is indeed true!
Next, let's understand the pain points of Kafka and Pulsar's corresponding solutions.
- Kafka is difficult to scale because Kafka persists messages in brokers. When migrating topic partitions, the data in the partitions needs to be completely copied to other brokers. This operation is very time-consuming.
- When the partition size needs to be changed to obtain more storage space, it will conflict with the message index and disrupt the message sequence. Therefore, if users need to guarantee the order of messages, Kafka becomes very tricky.
- If the partition copy is not in the ISR (synchronization) state, leader selection may be disordered. Generally, when the original primary partition fails, an ISR copy should be expropriated, but this cannot be fully guaranteed. If it is not specified in the settings that only the ISR replica can be selected as the leader, a replica that is in an asynchronous state is selected as the leader, which is worse than the situation where there is no broker to serve the partition.
- When using Kafka, you need to plan the number of brokers, topics, partitions, and replicas based on the existing situation and fully consider future incremental plans to avoid problems caused by Kafka expansion. This is an ideal situation. The actual situation is difficult to plan, and expansion needs will inevitably arise.
- The partition rebalancing of the Kafka cluster will affect the performance of related producers and consumers.
- When a failure occurs, the Kafka topic cannot guarantee the integrity of the message (especially in the case of point 3, it is very possible to lose the message when it needs to be expanded).
- Using Kafka requires dealing with offset, which is a headache because the broker does not maintain the consumer's consumption state.
- If the usage rate is high, you must delete old messages as soon as possible, otherwise there will be a problem of insufficient disk space.
- As we all know, Kafka's native cross-regional replication mechanism (MirrorMaker) has problems, and cross-regional replication cannot be used normally even in only two data centers. Therefore, even Uber had to create another solution to solve this problem and called it uReplicator ( https://eng.uber.com/ureplicator/).
- If you want to perform real-time data analysis, you have to choose third-party tools, such as Apache Storm, Apache Heron, or Apache Spark. At the same time, you need to ensure that these third-party tools are sufficient to support incoming traffic.
- Kafka does not have native multi-tenant functions to achieve complete isolation of tenants. It is done by using security features such as subject authorization.
Of course, in a production environment, architects and engineers have ways to solve the above problems; but in terms of platform/solution or site reliability, this is a headache. It is not like fixing logic in the code and then packaging It’s that simple to deploy the binary files to the production environment.
Now, let's talk about Pulsar, the leader in this competitive field.
What is Apache Pulsar?
Apache Pulsar is an open source distributed publish-subscribe messaging system, originally created by Yahoo. If you know Kafka, you can think that Pulsar is similar to Kafka in nature.
Pulsar performance
The most outstanding performance of Pulsar is its performance. Pulsar is much faster than Kafka. A technical research and analysis company https://gigaom.com/) The performance was compared and this was confirmed.
Compared with Kafka, Pulsar's speed is increased by 2.5 times and latency is reduced by 40%. (Source: https://streaml.io/pdf/Gigaom-Benchmarking-Streaming-Platforms.pdf).
Please note that this performance comparison is for 1 topic in 1 partition, which contains 100-byte messages. Pulsar can send 220,000+ messages per second, as shown below.
Pulsar did a great job at this point! At this point, it is definitely worth abandoning Kafka and turning to Pulsar. Next, I will analyze the advantages and characteristics of Pulsar in detail.
Advantages and features of Apache Pulsar
Pulsar not only supports use as a message queue in Pub-Sub mode, but also supports sequential access (similar to Kafka-based Offset-based reading), which provides users with great flexibility.
For data persistence, Pulsar's system architecture is different from Kafka. Kafka uses log files in the local broker, while Pulsar stores all topic data in the dedicated data layer of Apache BookKeeper. Simply put, BookKeeper is a highly scalable, highly disaster-tolerant and low-latency storage service, and is optimized for real-time and persistent data workloads. Therefore, BookKeeper guarantees the availability of data. Kafka log files reside in various brokers and catastrophic server failures, so Kafka log files may have problems, and data availability cannot be fully ensured. This guaranteed persistence layer brings another advantage to Pulsar, that is, "broker is stateless." This is fundamentally different from Kafka. The advantage of Pulsar is that the broker can seamlessly scale horizontally to meet growing demand because it does not need to move actual data when it expands.
What if a Pulsar broker goes down? The topic will be immediately reassigned to another broker. Since there is no topic data in the broker's disk, service discovery will handle the producer and consumer by itself.
Kafka needs to clear old data to use disk space; unlike Kafka, Pulsar stores topic data in a hierarchical structure that can be connected to other disks or Amazon S3, so that the storage capacity of topic data can be expanded and unloaded indefinitely. What's even cooler is that Pulsar displays data seamlessly to consumers as if the data were on the same drive. Since there is no need to clear old data, you can use these organized Pulsar themes as "Data Lake". This user scenario is still very valuable. Of course, when needed, you can also clear the old data in Pulsar through settings.
Pulsar natively supports multi-tenancy using data isolation at the topic namespace level; Kafka cannot achieve this isolation. In addition, Pulsar also supports fine-grained access control functions, which makes Pulsar's applications more secure and reliable.
Pulsar has multiple client libraries available for Java, Go, Python, C++ and WebSocket languages.
Pulsar natively supports Function-as-a-Service (FaaS). This feature is cool. Just like Amazon Lambda, it can analyze, aggregate, or aggregate real-time data streams in real time. To use Kafka, you also need to use a stream processing system like Apache Storm, which will increase the cost and is troublesome to maintain. At this point, Pulsar is far better than Kafka. As of now, Pulsar Functions supports Java, Python and Go languages, and other languages will be supported in future versions.
The use cases of Pulsar Functions include content based routing, aggregation, message formatting, message cleaning, etc.
The following is an example of word count calculation.
package org.example.functions;
import org.apache.pulsar.functions.api.Context;
import org.apache.pulsar.functions.api.Function;
import java.util.Arrays;
public class WordCountFunction implements Function<String, Void> {
// This is invoked every time messages published to the topic
@Override
public Void process(String input, Context context)
throws Exception {
Arrays.asList(input.split(" ")).forEach(word -> {
String counterKey = word.toLowerCase();
context.incrCounter(counterKey, 1);
});
return null;
}
}
Pulsar supports multiple data sinks, which are used to route processed messages for main products (such as the Pulsar topic itself, Cassandra, Kafka, AWS Kinesis, Elastic Search, Redis, Mongo DB, Influx DB, etc.).
In addition, the processed message stream can be persisted to a disk file.
Pulsar uses Pulsar SQL to query historical information and Presto engine to efficiently query the data in BookKeeper. Presto is a high-performance distributed SQL query engine for big data solutions, which can query data from multiple data sources in a single query. The following is an example of query using Pulsar SQL.
show tables in pulsar."public/default"
Pulsar has a built-in powerful cross-regional replication mechanism, which can instantly synchronize messages between different clusters in different regions to maintain the integrity of messages. When a message is generated on the Pulsar topic, the message is first retained in the local cluster, and then asynchronously forwarded to the remote cluster. In Pulsar, enabling cross-regional replication is based on tenants. Only when the created tenant can access two clusters at the same time, cross-region replication can be enabled between the two clusters.
For message delivery channel security, Pulsar natively supports TLS-based and JWT token-based authorization mechanisms. Therefore, you can specify who can post or use which topics. In addition, to improve security, Pulsar Encryption allows applications to encrypt all messages on the producer side and decrypt them when Pulsar delivers encrypted messages to the consumer side. Pulsar uses the public/private key pair configured by the application to perform encryption. Only consumers with a valid key can decrypt encrypted messages. But this will bring performance loss, because every message needs to be encrypted and decrypted before it can be processed.
Users who are currently using Kafka and want to migrate to Pulsar can rest assured that Pulsar natively supports direct use of Kafka data through a connector, or you can import existing Kafka application data into Pulsar. This process is also quite easy.
Summarize
This article is not to say that Kafka cannot be used for large-scale message processing platforms, but Pulsar can only be selected. I want to emphasize that Pulsar already has a good solution to Kafka's pain point, which is a good thing for any engineer or architect. In addition, in terms of architecture, Pulsar is much faster in large messaging solutions. With Yahoo and Twitter (and many other companies) deploying Pulsar to production environments, it shows that Pulsar is stable enough to support any production environment. Although switching from Kafka to Pulsar will experience a small learning curve, the corresponding return on investment is still very objective!
If you need Pulsar's corporate services and support, please contact StreamNative (info@streamnative.io).
Author: Anuradha Prasanna
Translation: Zhanying
Reviewer: Jennifer + Sijie + Yjshen
Editor: Irene
Apache Pulsar is the next-generation cloud-native distributed streaming data platform. It originated from Yahoo. It was open sourced in December 2016 and officially became the top Apache project in September 2018. It has gradually evolved from a single messaging system to a collection of messaging, storage and functional light A streaming data platform for quantitative calculations. In the process of rapid development of Apache Pulsar, the community partners are also committed to the evangelistic journey outside of Silicon Valley, starting an extraordinary journey in the Chinese community.
Click on the link to read the original English text.
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。