A brief history of Apache Pulsar milestones: building a unified message flow platform and ecosystem

About Apache Pulsar
Apache Pulsar is a top-level project of the Apache Software Foundation. It is a next-generation cloud-native distributed message flow platform integrating messaging, storage, and lightweight functional computing. It adopts a separate architecture design for computing and storage, and supports multi-tenancy and persistent storage. , Multi-machine room and cross-regional data replication, with strong consistency, high throughput, low latency and high scalability and other streaming data storage characteristics.
GitHub address: http://github.com/apache/pulsar/

Reprint and author information: This article is translated from StreamNative blog, original authors Matteo Merli, Guo Sijie, translator Jipei@StreamNative. Original address: https://streamnative.io/en/blog/release/2021-06-15-apache-pulsar-launches-2-8-unified-messaging-and-streaming-with-transactions

Since Apache Pulsar was born and graduated from the Apache Software Foundation, it has developed into a complete project and formed a community of tens of thousands of people around the world. This year, at the end of the Pulsar North America Summit and the release of Pulsar 2.8.0, Pulsar ushered in a new milestone in both the project and the community. Take this opportunity to review Pulsar's technological and ecological development together.

The birth of Apache Pulsar

Apache Pulsar is widely adopted by hundreds of companies around the world, including Splunk, Tencent, Verizon, and Yahoo! JAPAN. Apache Pulsar started as a cloud-native distributed messaging system, and has now developed into a complete message and stream platform that can be used for publishing and subscribing, real-time storage and processing of large-scale data streams.

Pulsar was born in 2012. The original purpose was to integrate other messaging systems within Yahoo, and build a unified logic that supports large clusters and cross-regional messaging platforms. At that time, other messaging systems (including Kafka) could not meet the needs of Yahoo, such as large clusters and multi-tenancy, stable and reliable IO service quality, million-level Topics, cross-regional replication, etc., so Pulsar came into being.

At that time, there were usually two types of systems to process dynamic data: message queues for real-time processing of key business events, and streaming systems for large-scale processing of scalable data pipelines. Many companies have to limit their functions to one of them, or have to adopt multiple different technologies. If you choose multiple technologies, it usually leads to data isolation and data islands: one island is used for message queues to build application services, and the other is used to build streaming systems for data services. The company’s infrastructure is usually extremely complex. The following diagram illustrates the architecture.

However, as the types of data that companies need to process increase, operational data (such as log data, click events, etc.), and the number of downstream systems that need to access combined business data and operational data increase, the system needs to support both message queues and streaming scenarios. .

In addition, companies need an infrastructure platform that allows them to build all applications on it, and then let these applications handle dynamic data (messages and streaming data) by default. In this way, the real-time data infrastructure can be significantly simplified, as shown in the figure below.

With this vision, the Yahoo! team began to work on building a unified message flow platform for dynamic data. The following are the key milestones since Pulsar was founded.

Milestone 1: Scalable storage for data streams

The birth of Pulsar began with Apache BookKeeper. Apache BookKeeper implements a log-like abstraction for continuous streams and provides the ability to run it on the Internet scale using a simple write-read log API. Logs provide a good abstraction for building distributed systems, such as distributed databases and publish-subscribe messages. The write API appears in the form of appending to the log. The read API is to read continuously from the starting offset defined by the reader. The implementation of BookKeeper lays the foundation-an extensible log-based message and stream system.

Milestone 2: A multi-layered architecture with separation of storage and computing.

A stateless service layer is introduced on top of the scalable log storage, which publishes and consumes messages by running a stateless broker. This multi-layered architecture separates services/computing and storage, allowing Pulsar to manage services and storage in different layers.

This architecture guarantees instant scalability and higher availability, making Pulsar ideal for building mission-critical services, such as billing platforms in financial scenarios, transaction processing systems for e-commerce and retailers, and real-time risk control systems for financial institutions .

Milestone 3: Unified messaging model and API

In modern data architecture, real-time scenarios can generally be divided into two categories: queues and streams. Queues are usually used to build core business application services, and streams are usually used to build real-time data services, such as data pipelines.

In order to provide a platform that can provide services for applications and data services at the same time, a unified message model that integrates queue and stream semantics is needed. Pulsar topic becomes the real source of consumption. Messages can only be stored once on a topic, but they can be consumed in different ways through different subscriptions. This unification greatly reduces the complexity of managing and developing messaging and streaming applications.

Milestone 4: Schema API

Then, a new Pulsar schema registry and a new security type producer and consumer API were added to Pulsar. The built-in schema registry enables the message producer and consumer on the Pulsar topic to coordinate the structure of the topic data through the Pulsar broker itself, without the need for an external coordination mechanism. Using schema data, every piece of data transmitted through Pulsar is fully discoverable, and users can build a system that easily adapts to data changes.

In addition, the schema registry will track data compatibility between schema versions. With the upload of the new schema, the registry ensures that the old consumer can read the new schema version to ensure that the producer cannot destroy the consumer.

Milestone 5: Functions and IO API

The next step is to build an API to easily input, output and process data from Pulsar. The goal is to use Apache Pulsar to easily build event-driven applications and real-time data pipelines, no matter where they come from, users can process events when they arrive.

Pulsar IO API allows users to construct real-time streaming data pipelines by inserting various source connectors-input data from external systems to Pulsar, sink connector-output data from Pulsar to external systems. Currently, Pulsar provides multiple built-in connectors, which users can use out of the box.

In addition, StreamNative maintains StreamNative Hub (Pulsar connector registry), which provides dozens of connectors integrated with popular data systems. If the IO API is used to build streaming data pipelines, the Functions API is used to build event-driven applications and real-time stream processors.

The concept of serverless functions is used for stream processing, and then the Functions API is built as a lightweight serverless library. Users can write any event and processing logic in any language. The team of engineers can write stream processing logic without running and maintaining another cluster.

Milestone 6: Provide Pulsar with unlimited storage through tiered storage

With the popularity of Apache Pulsar and the increase in the amount of data stored in Pulsar, users eventually encounter a "retention cliff". At this time, the cost of storing, managing, and retrieving data in Apache BookKeeper becomes more expensive. In order to solve this problem, operation and maintenance engineers and application developers usually use external storage such as AWS S3 as a long-term storage sink. However, this will lose most of the advantages of Pulsar's immutable flow and sorting semantics, and users will eventually have to manage Two systems with different access modes.

The introduction of tiered storage supports Pulsar to offload most of the data to remote cloud native storage. This cheaper form of storage can easily scale with the amount of data. More importantly, through tiered storage, Pulsar provides the batch storage capacity required when integrating with batch stream fusion processors such as Flink. The batch stream integration with Pulsar enables companies to quickly and easily query real-time streams with historical backgrounds, increasing their competitive advantage.

Milestone 7: Plug-in Agreement

After the introduction of hierarchical storage, Pulsar evolved from a Pub/Sub messaging system to a scalable streaming data system that can receive, store and process data streams. However, existing applications written using other messaging protocols (such as Kafka, AMQP, MQTT, etc.) must be rewritten to adopt Pulsar's messaging protocol.

The plug-in protocol API further reduces the overhead of using Pulsar to build message flows. Developers can take advantage of all the advantages provided by the Pulsar architecture to extend Pulsar functions to other messaging areas. So StreamNative cooperated with other industry leaders to develop popular plug-in protocols, including:

Kafka-on-Pulsar (KoP) , open sourced by OVHCloud and StreamNative in March 2020;
AMQP-on-Pulsar (AoP) , open sourced by China Mobile and StreamNative in June 2020;
MQTT-on-Pulsar (MoP) , open sourced by StreamNative in August 2020;
RocketMQ-on-Pulsar (RoP) , open sourced by Tencent Cloud and StreamNative in May 2021.

Milestone 8: Transaction API for exactly-once stream processing

Recently, transactions were added to Apache Pulsar to enable exactly-once semantics for stream processing. This basic function provides a strong guarantee for streaming data conversion, making it easy to build scalable, fault-tolerant, and stateful message flow applications to process streaming data.

In addition, transaction API functions are not limited to existing client languages. Pulsar's support for transactional message flow is a protocol-level feature that can be presented in any language. Such protocol-level functions can be used in various applications.

Create a unified message flow ecology

In addition to the continuous upgrade of Pulsar technology, the community is also committed to building a strong surrounding ecosystem. Pulsar can support rich pub-sub libraries, connectors, functions, plug-in protocols, and an ecosystem integrated with popular engines, enabling Pulsar users to simplify their workflow and apply them to new scenarios.

A brief history of Apache Pulsar milestones: building a unified message flow platform and ecosystem

About Apache Pulsar

The birth of Apache Pulsar

Milestone 1: Scalable storage for data streams

Milestone 2: A multi-layered architecture with separation of storage and computing.

Milestone 3: Unified messaging model and API

Milestone 4: Schema API

Milestone 5: Functions and IO API

Milestone 6: Provide Pulsar with unlimited storage through tiered storage

Milestone 7: Plug-in Agreement

Milestone 8: Transaction API for exactly-once stream processing

Create a unified message flow ecology

Related Reading

ApachePulsar

引用和评论

深入解析 Apache BookKeeper 系列：第二篇 — 写操作原理

祝贺陈梓立(Tison)当选 2025 年度 Apache 软件基金会董事会

K8s 小白入门｜从电影配乐谈起，聊聊容器编排和 K8s

架构设计不合理，如何优化系统结构

深入浅出微服务基础设施：服务架构的演进历史

MCP协议重大升级，Spring AI Alibaba联合Higress发布业界首个Streamable HTTP实现方案

得物增长兑换商城的构架演进