1
The author of this article is David Kjerrumgaard, currently a contributor to Splunk, Apache Pulsar and Apache NiFi projects. The translator is Sijia@StreamNative. Original link: https://searchdatamanagement.techtarget.com/post/Apache-Pulsar-vs-Kafka-and-other-data-processing-technologies , translation has been authorized.

About Apache Pulsar

Apache Pulsar is the top-level project of the Apache Software Foundation. It is the next-generation cloud-native distributed message flow platform. It integrates messaging, storage, and lightweight functional computing. It uses a separate architecture design for computing and storage to support multi-tenancy, persistent storage, Multi-computer room and cross-regional data replication, with strong consistency, high throughput, low latency and high scalability and other streaming data storage characteristics.
GitHub address: http://github.com/apache/pulsar/

Compared with data processing middleware such as Kafka, how does the distributed messaging platform Apache Pulsar store data? Based on the architecture, this article compares the advantages and disadvantages of traditional data processing middleware such as Apache Kafka and the distributed messaging platform Apache Pulsar for your reference.

Scalable storage

The multi-layer architecture of Apache Pulsar completely decouples the message service layer and the storage layer, so that each layer can be expanded independently. Traditional distributed data processing middleware (such as Hadoop, Spark) processes and stores data on the same cluster node/instance. This design can reduce the amount of data transmitted through the network, making the architecture more concise, and performance has been improved, but at the same time scalability, flexibility, and operation and maintenance are affected.

Pulsar's layered architecture is unique among cloud-native solutions. Today, the greatly increased network bandwidth provides a solid foundation for this architecture, which is conducive to the separation of computing and storage. Pulsar's architecture decouples the service layer from the storage layer: stateless broker nodes are responsible for data services; bookie nodes are responsible for data storage (Figure 1).

图 1. 服务层与存储层解耦

The decoupling architecture of the service layer and the storage layer has many advantages. First of all, each layer can be elastically expanded without affecting each other. With the flexibility of cloud and container environments, each layer can automatically expand and contract to dynamically adapt to traffic peaks. Second, by significantly reducing the complexity of cluster expansion and upgrades, system availability and manageability are improved. Again, this design is also a container-friendly design, making Pulsar the best solution for hosting cloud native streaming systems. Apache Pulsar uses highly scalable BookKeeper as the storage layer to achieve strong durability guarantees and distributed data storage and replication, and natively supports cross-regional replication.

The multi-tier design can easily achieve tiered storage, so that the less frequently accessed data can be offloaded to low-cost persistent storage (such as AWS S3, Azure cloud). Pulsar supports the configuration of a predefined storage size or time period, automatically unloads the data stored on the local disk to the cloud storage platform, releases the local disk, and safely backs up the event data.

Pulsar vs. Kafka

Both Apache Pulsar and Apache Kafka have similar messaging concepts. The client interacts with the two through topics (logically divided into multiple partitions). Generally speaking, the unlimited data stream written to the topic will be divided into partitions (a specific number of groups of equal size), so that the data is evenly distributed in the system and used by multiple clients at the same time.

The essential difference between Apache Pulsar and Apache Kafka lies in the storage partition infrastructure. Apache Kafka is a partition-based publish/subscribe system designed to run as an overall architecture, with the service layer and storage layer located on the same node.

图 2. Kafka 分区

Kafka storage: based on partition

In Kafka, partitioned data is used as a single continuous data storage on the leader node, and then replicated to the replica node (the replica node can be pre-configured) to achieve multiple copies of data. This design limits the capacity of the partition in two ways and expands the topic. First, since partitions can only be stored on local disks, the size of the partition depends on the size of the largest single disk on the host (the disk size of a "new" installation user is about 4 TB); secondly, since the data must be copied, the size of the partition cannot be Exceeds the size of the minimum disk space on the replica node.

图 3. Kafka 故障和扩容

Assuming that the leader can be stored on the new node, the disk size is 4 TB and it is only used for partition storage, and the storage capacity of the two replica nodes is 1 TB. After publishing 1 TB of data to the topic, Kafka will detect that the replica node cannot continue to receive data, and cannot continue to publish messages to the topic until the replica node releases space (Figure 3). If the producer cannot buffer messages during the interruption, data loss may result.

Faced with this problem, there are two solutions: delete the data on the disk, store the existing replica node, but because the data from other topics may not be consumed, it may cause data loss; or add a new node to the Kafka cluster And "rebalance" the partition, using the new node as a replica node. However, this requires re-copying the entire 1 TB partition, which is time-consuming, error-prone, requires high network bandwidth and disk IO, and is costly. In addition, for programs with strict SLAs, offline replication is not preferable.

With Kafka, not only partition data needs to be re-replicated when expanding the cluster, but other failures may also require re-replication of partition data, such as replica failure, disk failure, computer failure, etc. If there is no failure in the production environment, we usually ignore this shortcoming of Kafka.

图 4. Pulsar 分片

Pulsar storage: based on sharding

In a shard-based storage architecture (such as that used by Apache Pulsar), partitions are further divided into shards, which can be rolled according to a pre-configured time or size. The shards are evenly distributed in the bookie of the storage layer to achieve multiple copies of data and capacity expansion.

When the bookie disk space is used up and data cannot be written to it, Kafka needs to replicate the data again. How does Pulsar deal with this scenario? Since the partition is further divided into slices, there is no need to copy the entire bookie's content to the newly added bookie. Before adding a new bookie, Pulsar can continue to receive new data fragments and write them to the bookie whose storage capacity is not full. When a new bookie is added, the traffic on the new node and the new partition will automatically increase immediately, without re-copying the old data.

图 5. Pulsar 故障和扩容

As shown in Figure 5, when the fourth bookie node no longer continues to receive new message fragments, message fragments 4-7 are routed to other active bookies. After a new bookie is added, the fragments are automatically routed to the new bookie. Throughout the process, Pulsar is always running and can provide services for producers and consumers. In this case, Pulsar's storage system is more flexible and highly scalable.

Related Reading

图片

Click the link to get the Apache Pulsar hard core dry goods information!


ApachePulsar
192 声望939 粉丝

Apache软件基金会顶级项目,下一代云原生分布式消息系统