This article is translated from "Apache Pulsar: A Unified Queueing and Streaming Platform" by Addison Higham. Original link: https://thenewstack.io/apache-pulsar-a-unified-queueing-and-streaming-platform/
Translator Profile

Addison is a lead engineer at StreamNative. This article is translated by Liu Zilin.

Now we can make a bold statement: There are indications that Apache Pulsar, an open source distributed messaging system, is on the cusp of modern application architecture and development.

Why do we dare to say this?

As engineering teams face increasingly complex challenges, the technologies and tools needed to solve the ever-increasing range of problems continue to evolve. A common tool for this is messaging.

Messaging is based on message queues. Messages are queued asynchronously between client applications, with a "broker" acting as an intermediary between applications.

In earlier technologies, brokers were relatively simple. But as the actual needs change, the messaging system also changes. Distributed messaging systems are built on this concept, and not only have the advantages of reliability, scalability, and durability, but multiple "brokers" can help share the load.

Most distributed messaging systems support one type of semantics: streams or queues. From past experience, each has its own particular type of scenario that is best suited. Apache Pulsar is unique in that it supports both stream and queue semantics.

Before exploring the advantages of using a unified message flow system, let's take a step back and look at message queuing and streaming technologies separately.

Streaming systems are a relatively new innovation in the industry that can move large amounts of data in an orderly, low-latency fashion. Streaming systems are ideal for moving data (such as logs, metrics, or click-event data, etc.) and centralizing it in one place, while achieving high concurrency and high throughput.

An example would be getting click or metric data from 10,000 machines in a large cloud deployment. The flow system facilitates this process.

In a way, message queues are similar to streaming systems in that they link multiple systems together. However, message queues are older and more about peer-to-peer communication, helping a wide range of applications exchange information.

The access patterns of the two systems are also different: streaming systems focus on messages arriving in sequence and processing multiple message groups in the same batch, possibly also for aggregating or transforming data.

In a message queue, by contrast, events are usually processed one at a time. As in a work queue, each message may represent a specific "task". In other words, streams are used to move and process large groups of data, whereas queues tend to be about the fine-grained processing of individual messages to facilitate some work in the system.

The most common streaming platforms are Apache Kafka and AWS' Kinesis. The most common queuing systems include RabbitMQ and ActiveMQ. On the cloud, there's Google's Pub/Sub, AWS SQS, and SNS.

Apache Pulsar: Unified Message Queuing and Streaming

First, a brief look back at history.

Pulsar was originally developed internally at Yahoo around 2010 due to the need to queue very large workloads. The Yahoo service is massive and spread across many different teams and data centers.

At the time, they were using the popular Java community standard Java Message Service (JMS), which required a new system that could meet the requirements of the JMS standard while being distributed and extensible.

Although Pulsar's early prototype system API was initially focused on the messaging scenario, the system's architectural design also made it ideal for streaming system tasks, allowing the Yahoo team to use the system flexibly in a wider range of scenarios.

This service, called "Cloud Messaging Service", has been very successful and widely known within Yahoo. Taking this momentum, Yahoo continued to develop the Cloud Messaging Service internally and open-sourced it in 2016, which is where the Pulsar project came from.

In 2018, the project graduated as a top-level project of the Apache Software Foundation. Since then, Pulsar's presence in global enterprises has grown rapidly. The reason is obvious: many enterprises, such as Yahoo, need more scalable messaging solutions.

While a streaming system like Apache Kafka has the ability to scale further (there is still a lot of labor involved in data rebalancing) - its API functionality for streaming systems is still somewhat lackluster. Not only does it require developers to work within the constraints of pure streaming mode, but it also requires developers to learn a new way of thinking and designing, which makes messaging scenarios more difficult.

But with Pulsar, it's a different story. Developers can work in familiar ways using familiar APIs; it also provides more extensibility and the ability to stream systems.

One of the challenges my team at Instructure faced was meeting the need for both scalable messaging services and stream processing at the same time. To solve this problem, we discovered Pulsar.

Instructure's large-scale business scenarios require the support of a highly scalable messaging system. At first, we tried to re-build by architecting around the streaming system technology. By chance, we found that Apache Pulsar can help teams get the messaging system capabilities they need without the complexity of stream-based re-architecture.

When the Instructure team started using Pulsar, the benefits of Pulsar were immediate, and large-scale deployments of Pulsar on production systems began. With Instructure adopting Pulsar, communication between applications has become more convenient, and we can also share data between various teams.

However, not only does Pulsar work well for messaging workloads, but the stream processing it supports is a real need that exists within most enterprises. Pulsar provides a system that is easier to use, operate, and integrate than other streaming systems.

For example, Pulsar is highly scalable. Users do not need to "rebalance" the cluster when it is necessary to increase the size of the cluster. It supports multi-tenancy and millions of topics without significantly increasing latency, making it easy for many teams to share a cluster.

This means that companies no longer need to put a lot of effort into developing their own tools. They can thus focus on extracting value from messages and data rather than wasting too much time managing infrastructure.

For Iterable (the famous cross-channel marketing platform), Pulsar provides scalability, high availability to replace RabbitMQ and eventually other messaging systems inside Iterable - including Kafka and Amazon SQS. As Greg Methvin pointed out, Apache Pulsar is well suited not only for stream processing scenarios, but also for message queuing needs.

Apache Pulsar Advantages

Those who have already adopted Apache Pulsar may have found that Apache Pulsar provides more scalable message queuing capabilities, as well as more scalable and easier to use streaming systems than comparable systems such as RabbitMQ or ActiveMQ capabilities (with built-in capabilities including geo-replication and multi-tenancy).

It is worth mentioning that using Apache Pulsar is a simple math problem: a platform that unifies streams and message queues requires at least one less technology than operating two streams and message queues at the same time. With just one technology, users can more easily develop and bring products to market, and leverage existing data more efficiently and with higher quality.

In addition to the increased IT costs and the expense of operating two separate technologies, streaming systems and queuing systems are not well integrated, leading to data silos. When you have a unified system, you can handle more things, and you can also open up various data and applications, data query, data analysis, streaming systems, etc. based on the same underlying system.

When using Apache Pulsar, there is no need to export data to another system or team, because the enterprise covers the entire life cycle of the data based on the same middle-end technology and the same middle-end storage service, and users no longer need to manually deal with the different data life cycles. stage. With Apache Pulsar, the architecture is greatly simplified, and data flow and data governance are easier.

As Greg Methvin, Senior Engineer at Iterable sums it up, "Pulsar is unique in that it not only supports streaming and queue scenarios, but also a wider range of capabilities. It is already the best of the other distributed messaging systems in the architecture we are currently using. Alternatives. Plus, Pulsar covers all of our use cases for Kafka, RabbitMQ, and SQS, and we can focus on building expertise and tooling around a single unified system.”

With Apache Pulsar, you can have both.

Follow the public account "Apache Pulsar" to get dry goods and news

Join the Apache Pulsar Chinese exchange group👇🏻


ApachePulsar
192 声望939 粉丝

Apache软件基金会顶级项目,下一代云原生分布式消息系统