About Apache Pulsar

Apache Pulsar is the top-level project of the Apache Software Foundation. It is the next-generation cloud-native distributed message flow platform. It integrates messaging, storage, and lightweight functional computing. It uses a separate computing and storage architecture design to support multi-tenancy, persistent storage, Multi-computer room and cross-regional data replication, with strong consistency, high throughput, low latency and high scalability and other streaming data storage characteristics.
GitHub address: http://github.com/apache/pulsar/

When a new thing appears, people always have two angles to observe it, either to look at it small or to enlarge it.

For Apache Pulsar, the smaller angle is usually "Apache Pulsar is just a new message queue", or "Apache Pulsar is just a new data pipeline", or "The queuing system has long existed. It's just that Apache Pulsar is more extensible and can solve some scene problems, basically there is no essential difference."

So, is this really the case?

With these questions in mind, InfoQ reporter Zhang Hao StreamNative co-founder, Apache Pulsar and Apache BookKeeper PMC member Zhai https://qcon.infoq.cn/2021/beijing/) Jia, chatted with him about the history, characteristics and future trends of Apache Pulsar.

Click the link to view the full content of the video interview with the . For the convenience of readers, the text content is also attached below the video.

InfoQ: Hello, Teacher Zhai, welcome to the QCon conference and accept our interview. Please introduce yourself first.

Zhai Jia: Hello, host. My name is Zhai Jia. I come from StreamNative. Now I mainly do the design and development of Apache Pulsar and Apache BookKeeper. I am also a member of the project management committee of these two open source projects. Earlier, I also did the design and development of the distributed file system at EMC.

InfoQ: I saw this definition on Pulsar's official website. It said Pulsar is a cloud-native distributed messaging and streaming platform. In a lot of publicity in the Chinese media, there is everything about the definition of Pulsar, and I will interpret it as a new generation of Kafka. So many companies actually use Kafka as a messaging system, so I would like to invite you to talk about the iterations of the open source messaging system over the years?

Zhai Jia: The message system you just mentioned is indeed the original intention of Pulsar, because in 2011 and 2012, what needs to be solved within Yahoo is the unification of the message system. At that time, the popular ActiveMQ and RabbitMQ were also used internally in Yahoo. Therefore, a scene where Pulsar was born was also an internal unity, which was to replace messaging systems such as ActiveMQ and RabbitMQ. That is the MQ scenario requirements you just mentioned.

You just mentioned that Pulsar is cloud-native, which has something to do with Yahoo's environment when it was born. Yahoo was a major creator of the Hadoop ecosystem at that time. It was the first underlying storage layer, that is, a project like BookKeeper. We say that it is cloud native because Pulsar uses an architecture that separates storage and computing. The storage layer is BookKeeper we just mentioned. BookKeeper is mainly used to solve the Hadoop ecosystem within Yahoo, that is, HDFS, NameNode, etc. One layer of HA. You can think of it as storing metadata, so it has a high consistency. It just happens to be an abstraction of Log, using the append-only mode, which is very similar to the MQ scenario we mentioned. New messages are constantly added, and then this mode can just be applied to the message field, so Its choice of cloud-native route is related to its technical background. First, there is a very good storage system that can provide a good quality of service, and then there is an architecture that separates storage and computing such as Pulsar.

With such a system, due to the architecture of separating storage and computing, and the abstraction of the Write-Ahead-Log (WAL) storage layer, it can not only support many key business scenarios of MQ, without losing messages, and requiring consistency Very high, and can provide relatively high bandwidth and relatively low latency. So this is the reason why many users use it in the original Kafka scene. Perhaps the initial choice is the direction of MQ, but due to the architecture and the characteristics of its own storage layer, some Kafka scenarios will also be used. This is its relationship with Kafka.

InfoQ: Just now you mentioned that Pulsar was born in Yahoo. Can you introduce the background of this project in more detail? How has Pulsar developed over the years?

Zhai Jia: We just mentioned the birth of Pulsar. We started this project in 2012. At that time, RocketMQ was born in China in 2011. The year before, Kafka was born in 2010, and Pulsar was born in 2012. .

It is very challenging to choose a cloud-native storage and computing separation architecture in 2012. Because of the network card of a single node, the disk of a single node may not be suitable for cloud native architecture. So before 2012, from the second half of 2010 to 2011, Yahoo built a Pulsar prototype system called HedWig. It is mainly to make such an attempt: whether the architecture of storage and computing separation can serve the message scenario. So it may be born earlier, with a prototype system like HedWig first, and then Pulsar was established. Pulsar was codenamed CMS (Cloud Message Service) in Yahoo before open source, so the original intention of its birth was to be a cloud message service.

With the foundation of the early prototype system, after deciding on the route of cloud native, 2013 will mainly focus on functional development. In 2014, it will be deployed on a large scale within Yahoo. It has been tested by online products within Yahoo, and it is running stably. One or two years later, it was open sourced in 2016, donated to the Apache Software Foundation in 2017, and graduated from the Apache Software Foundation as a top project in 2018.

After donating to the Apache Software Foundation, this project began a rapid development period. Many Internet companies at home and abroad began to use Pulsar, because it did solve the problems in the MQ scene and also solved some of the problems in the Kafka scene. .

Especially with the establishment of our commercial company StreamNative in 2019, users have more confidence in this project. Including a very critical business scenario that Tencent first launched online is the Tencent billing platform, which is also based on Pulsar to do all billing services. There are also many scenes in Splunk abroad and Yahoo that use Pulsar. In this way, Pulsar's ecology is also growing and developing better and better with our communities at home and abroad.

Before becoming a top project in 2018, although GitHub's number of Stars grew well, it was not so significant. After becoming a top project at the end of 2018, the company (StreamNative) was established in 2019, and the number of Stars grew even faster. I think the domestic open source soil is very helpful to the construction of our open source community and the initial growth of our open source projects.

InfoQ: I have seen a lot of people on the Internet comparing Kafka and Pulsar, including a comparison we just made, so what do you think of these two technologies?

Zhai Jia: We just mentioned that the background of the Pulsar project was to solve Yahoo's internal needs. It needs to have large clusters and multi-tenant capabilities, so that the data of various business departments within Yahoo can be opened up, so its original intention was not to do Kafka things, everyone is in the same era.

Moreover, Kafka was not as popular in 2012 as it is now. Pulsar still started from its own point of view, that is, to solve the internal problems of Yahoo, and then there is such a product. After that, it chose the open source route to make the product more stable and able to add more features to it with the help of the community. So it was not designed specifically for Kafka at the beginning, but at the beginning of its birth, it was designed to solve its own problems, taking into account richer application scenarios and considering some of its own business needs, so it can indeed support Some business requirements corresponding to Kafka. Naturally, after the open source, it may solve some of the pain points of Kafka users and be selected by some users, but this is a natural process, not specifically for Kafka to do this.

InfoQ: Guo Sijie said before that news, streaming, and storage will gradually merge. How do you understand this point of view?

Zhai Jia: I just mentioned three directions, news, flow and storage. I found that message and flow are two systems that are very related and have very similar patterns. Your message also needs a carrier to provide message storage and service. The pattern of messages coming in is also in a fixed order according to time, user data, etc. , So as to provide the guarantee of sequence and consistency to the outside world.

The same is true for the streaming scenario. All data flows in continuously, and then a consumption is made in chronological order. So from this point of view, the two of them are very close and similar in concept. It's just that we divided the message and flow into two models because of the technical architecture. In many key business scenarios within enterprises, various MQs are used to ensure message services. In the data pipeline scenario, Kafka is used to provide streaming services, but in essence, they are all the same. They naturally make a separation.

One advantage of Pulsar is that it has a very good underlying storage layer BookKeeper, which can not only ensure high consistency, but also can ensure relatively high throughput through the abstraction of append-only Write-Ahead-Log (WAL). It can technically solve the different needs of the two scenarios, so Pulsar provides a good integration of messages and streams from this perspective.

In the direction of storage, you may have seen the storage of messages a bit too simple before. Many users think that my messages need to be decoupled from the system. In the MQ scenario, there may be many messages from memory. The message is consumed directly. There are few scenarios where the message needs to be landed and stored, so the implementation at that time was very simple. Everyone, including Kafka, put the message directly into the file system. How simple and how to implement it? For many key business scenarios, for many scenarios that require data accumulation, if you rely on a file system alone, your service quality and your stability may not be easy to guarantee, because the file system is not specifically designed for this scenario Designed. For example, if you are in an MQ scenario, I need to refresh the disk in real time. I need to ensure the durability of this data. Its backup number and consistency should be high enough. In this scenario, its performance may drop. It is also that we said that as the user's application data scenarios continue to enrich, as your requirements for the messaging system continue to increase, some of its requirements for the underlying messaging system will also change. So looking at it now, with the development of big data and the increase in the amount of user data, Pulsar has truly played its value. It has a storage layer dedicated to messages. With such a storage layer, it can follow us The message service and streaming service just mentioned can be well combined.

Therefore, the message, flow, and storage on Pulsar is a fusion direction and a fusion architecture. This is a very intuitive feeling when we look at these three directions from the perspective of Pulsar.

InfoQ: The reason why Pulsar dare to say that it is a new generation of messaging system, then it must conform to some trends, in addition to the previous trends in news, flow, and storage, what other trends have you seen?

Zhai Jia: We think this trend may be what we mentioned in the first question, and it is more related to cloud native. Because cloud native is a big trend. We just mentioned that the architecture chosen by Pulsar was also a very advanced architecture at the time. In 2012, it chose the architecture of separation of storage and computing. It has the advantage of cloud native, and combined with each of Pulsar itself. Layer nodes are all a peer-to-peer architecture, which will greatly help online operation and maintenance. Including what we said in your service layer, all your Brokers are a peer-to-peer architecture. Then your storage layer BookKeeper is also a peer-to-peer architecture. In this way, when you expand and contract, your large cluster does not need to maintain the master/slave status of each node. Its maintenance is simpler. At the same time, we have a very important original intention in the cloud, that is, I want to be a flexible resource scheduling. From this perspective, Pulsar also just follows this trend, and it is in the same line with its birth code CMS (Cloud Message Service). It is to be a message cloud, and various resources on the cloud must be easily scheduled to provide a service for the message system. So I think cloud native is a big advantage of Pulsar, and it is also a big trend we follow.

InfoQ : Okay, thank you, Teacher Zhai.

Zhai Jia: Thank you host.

Related Reading

Click the link to get the Apache Pulsar hard core dry goods information!


ApachePulsar
192 声望939 粉丝

Apache软件基金会顶级项目,下一代云原生分布式消息系统