大数据 - Live Review｜TGIP-CN 032: Apache Pulsar Quick Start and Actual Combat - ApachePulsar

About Apache Pulsar
Apache Pulsar is the top-level project of the Apache Software Foundation. It is the next-generation cloud-native distributed message flow platform. It integrates messaging, storage, and lightweight functional computing. It uses a separate architecture design for computing and storage to support multi-tenancy, persistent storage, Multi-computer room and cross-regional data replication, with strong consistency, high throughput, low latency and high scalability and other streaming data storage characteristics.

GitHub address: http://github.com/apache/pulsar/

Introduction: This article is a text-edited version of StreamNative solution engineer Wei Bin's live broadcast event on TGIP-CN 032. In the event, I led everyone to learn about the Pulsar test environment construction, peripheral tools to components, etc., and helped everyone quickly get started with Apache Pulsar.

Today, I will bring you the content of Apache Pulsar to get started quickly. This time, I will introduce how to quickly build a Pulsar test environment and be familiar with Pulsar peripheral tools and related components for students who have just contacted Pulsar. I hope that through this sharing, everyone can follow the shared exercises to quickly run Pulsar related clusters and peripheral tools to prepare for the next step.

The content of this article is mainly divided into the following three parts:

Introduction to Apache Pulsar
How to quickly get started with Apache Pulsar
Use of Pulsar peripheral monitoring operation and maintenance tools

Introduction to Apache Pulsar

A brief introduction to Apache Pulsar, it is a new generation of cloud-native distributed message flow platform, there are several keywords in it. In terms of cloud native, I believe everyone should have heard a lot. It can be simply understood as K8S-oriented, which is very suitable for operation in a K8S container orchestration system. The message flow platform means that Apache Pulsar is a data platform that integrates message queues and stream processing.

Pulsar's appearance is the latest. As you can see from the picture above, Pulsar was designed in 2012. The reason for its birth was that its previous projects did not meet the needs of the creators at that time.

Positioning of Apache Pulsar

It is mainly divided into Streaming (stream processing consumption mode) and Queuing (queue consumption mode), as shown in the figure above.

For message queues, what are their differences? Speaking of message queues, take RabbitMQ as an example. After a message comes in, it will only be consumed by one consumer, that is to say, it will only deliver once, and it will end after consumption, and it will not be stored.

Correspondingly, such as the Streaming system represented by Kafka, its messages allow multiple consumers, which means that a message comes in once but can be consumed multiple times. In addition, its messages can be persisted to the Streaming platform, which also means that subsequent consumption of historical data can be repeated.

In addition, in the message queue scenario, there is no strict order restriction on the consumption of messages, and a message will not have a strict consumption order when it comes in, and it is not necessary to consume the message first. However, on the Streaming platform, we often encounter a strict order, and we must consume them in the order in which the messages come in.

Different application scenarios of message queues and streams

Message queues are more used in asynchronous decoupling scenarios. To give a simple example, for example, when we are an e-commerce platform, a user places an order, and then you want to send him an email or text message to remind him. Under normal circumstances, this task is unlikely to be placed in the main processing flow. Because sending emails and text messages are relatively slow processing logic, if they are placed in the ordering process, it will make it difficult to increase the storage capacity of the ordering service.

At this time, we often introduce message queues, and let the message queue decouple the service of placing orders and the service of sending emails and text messages. But stream processing scenarios are more real-time computing scenarios of big data. The different application scenarios of message queues and stream processing have led to different products.

The emergence of Pulsar allows us to unify the message queue and the streaming platform in one product at the same time, and users do not need to be so finely positioned. In the existing technology scenario, the boundary between message queue and stream processing is actually not so clear. When using Kafka, many users regard it as a message queue. If you strictly follow the definition just now, many users actually use Kafka as an asynchronous decoupled platform.

The core difference between message queue and stream processing is actually the consumption model, that is, whether to consume once or multiple times, and whether there is an order.

Pulsar can support queue and stream consumption models in the consumption model, so it can unify these two scenarios. This is an explanation of the "unified cloud-native message flow platform" mentioned earlier.

As shown in the figure above, it shows the consumption semantics supported by Pulsar. Through the previous definition of streaming, Exclusive and Failover consumption semantics can be understood as the consumption mode of Streaming stream processing, which is consumed in strict accordance with the order of messages.

Shared and Key_Shared modes allow out-of-order consumption, and many consumers can read data from the same topic. At this time it is out of order, similar to the consumption mode of Queuing consumption queue we mentioned above.

When Pulsar unifies these two scenarios, there is no need to think strictly about whether this is a message queue scenario or a streaming scenario when actually using it. Different consumption semantics can be applied directly according to the needs of consumption.

Why does Pulsar unify these two platforms? It can reduce the mental cost of users when they use it. Users don't need to choose among so many product items. It can be solved in one product item.

Problems to be solved by the birth of Apache Pulsar

Business needs and data scale
- Multi-tenant-Million Topic-Low Latency-Persistence-Cross-regional Replication
Decoupling storage and computing
- Operation and maintenance pain points: replacement of machines, service expansion, data rebalance
Reduce file system dependency
- Performance is difficult to guarantee: persistence (fsync), consistency (ack: all), multiple topics
- IO is not isolated: Consumers will affect other producers and consumers when they read the Backlog

Apache Pulsar architecture

Pulsar separates storage and computing. The top layer is Producer and Consumer for producing and consuming messages. The bottom layer is the Broker computing layer, and the bottom layer is Bookie. You can see that there are many segments. This is the storage layer. It provides a storage layer service with the segment as the granularity. Each layer of nodes are equal, support each other for independent expansion, very convenient and flexible to support us to do expansion or fault-tolerant processing.

Apache Pulsar features

The figure above shows many features of Pulsar, including some enterprise-level features. If you are interested, you can go to the official website to find out, such as:

Multi-tenant
Replicate across geographic regions
High availability
Unified consumption model
……

I won't repeat it here.

Apache Pulsar core components

The figure above shows the three core components of Pulsar:

Broker (computing layer)
BookKeeper / Bookie (storage layer)
ZooKeeper (coordination component)

Broker

Broker is mainly used for protocol analysis when Producer and Consumer interact. Pulsar has its own set of interaction protocols. When Producer and Consumer interact with Pulsar, it needs to be processed based on the Pulsar protocol.

When Kafka exposed its services to the outside world, it also exposed its own set of protocols. When Kafka's Producer and Consumer are consuming, users must follow Kafka's production and consumption agreement before they can interact with Kafka.

The separation of Pulsar's computing and storage layers means that as long as the Kafka protocol is compatible at the computing layer, Kafka's Producer and Consumer are allowed to write data to Pulsar. We can do different protocol compatibility at the computing layer.

In addition to Kafka, we can also do AMQP, MQTT, RocketMQ and other protocol compatibility. Because the computing layer is doing protocol analysis and processing, this layer itself does not store any data. All data will be stored in the downstream BookKeeper (storage layer) after the analysis processing. When consumers come to consume, Just read it out of it.

BookKeeper

Next, we look at Storage Layer, you can understand it as a shard-oriented design. BookKeeper provides a segment-oriented storage semantics, and its contrast is the partition semantics. The most common partition semantics is this semantics like Kafka.

Partition

The biggest difference between Partition and Segment is that the magnitude of Partition is often relatively large. Once a Kafka partition is bound to a storage node, storage device, or disk, the partition can only be bound to the storage disk as the disk becomes larger and larger as data is written. When we do expansion or migration, the larger the Partition, the higher the operation cost.

Segment

What are the benefits of Segment? We can define the upper limit of its size. BookKeeper or Pulsar's storage layer actually has the concept of Partition, but Partition has been broken up into segments, and different segments are bound to the underlying storage devices. Since the segment is limited in size, the cost will be very small. This is the greatest benefit brought by the storage layer using segment semantics to provide read-write services.

ZooKeeper

Functions of ZooKeeper:

Provide metadata management
Do Service discover (service discovery)

Network flow of Apache Pulsar components

The above figure briefly describes a flow, so that everyone can understand the interaction between components. Both Broker and BookKeeper will interact with Zookeeper. The relationship between Broker and BookKeeper is: Broker will write read and write requests from Producer or Consumer to BookKeeper or read it out from BookKeeper. Broker is equivalent to BookKeeper's Client.

Port of Apache Pulsar component

Let's take a look at the external port of the Pulsar component. When starting the service, we need to have a clear understanding of its port. In the above figure, Broker exposes a TCP port: 6650, and HTTP port: 8080. Port 6650 is mainly used by Client to connect. When Produce or Consume, it will connect to 6650.
TCP port; port 8080 exposes some monitoring indicators, that is, indicators compatible with the Prometheus protocol. The other is that it also exposes the admin API. When you want to perform operation and maintenance configuration management on Pulsar or retrieve indicator data, you can get it through this port.

BookKeeper exposes the 3181 TCP port. This port is mainly used by Broker. It connects to BookKeeper through port 3181. It also has 8080 HTTP port, which mainly exposes the monitoring indicators of Prometheus, which are some of the monitoring indicators of Bookie.

ZooKeeper's 2181 is a very common ZooKeeper's TCP external service port. This port is used to connect to Broker and BookKeeper. Pulsar's ZooKeeper has also made 8080 HTTP port, which mainly exposes some indicators of ZooKeeper. These ports can be defined by themselves in the configuration, and the default port numbers are mentioned here.

Of course, all components of Pulsar support distributed: in production, Broker generally recommends at least two, so that when one of them hangs, the traffic can be cut to the other; bookie recommends at least three, hang on one of them In the case of data loss, the data will be backed up on the other two nodes; ZooKeeper generally has three, which is enough for use. Each component can be distributed, which means that each component can be expanded/shrinked according to its own needs. If the storage layer is not enough, you can add BookKeeper nodes. If the computing layer is not enough, you can add Bookie nodes on the Broker. ZooKeeper can be under great pressure. Manage ZooKeeper.

Get started with Apache Pulsar

I hope that the introduction in the previous article will give you an idea of Pulsar. Next, let's talk about getting started with Pulsar. Pulsar has two download addresses, official address and project mirror address (domestic mirror source). All Pulsar-related components are provided in the official address, binary is the main Pulsar; in the getting started stage, you can skip the surrounding ecological components on the website; the integrated SDK can be queried in the Client; Pulsar Manager is the UI Dashboard we will use in the future tool. Downloading from the mirror source will be faster for domestic users.

Standalone mode

First of all, we need a Java environment, installed in Oracle JDK/ Open JDK 1.8 version on this machine. Standalone is the native development mode provided by Pulsar. It is recommended to use this mode when starting from the hands-on stage and when developing locally. Standalone is very simple, just run the command after downloading. Two commands are listed here, bin/pulsar standalone in the foreground, and bin/pulsar-daemon standalone in the background. For details, please refer to Standalone document . In the video 28:40 ~ 32:54, please refer to the demo using Pulsar 2.7.1 Standalone.

Examples of commonly used commands

Administrator command:

bin/pulsar-admin clusters list //Show Pulsar cluster
bin/pulsar-admin broker list test //View cluster Broker
bin/pulsar-admin topics list public/default //View topic

Pulsar Client command:

bin/pulsar-client produce my topic --messages “hello-pulsar” //Production news
bin/pulsar-client consume my topic -s “first-subscription” //Consumer news

To learn more about the command line and tools provided by Pulsar, you can official website client document and Pulsar admin document .

Cluster mode

In addition to the Standalone mode, many people hope to use a cluster mode similar to the production environment when testing locally, that is, having multiple Brokers and Bookies. One feasible way is to copy multiple data directories and configuration files locally to execute, you can refer to deploy-bare-metal document . Another way is, if you want to use the cluster mode in the test environment, it is recommended to use Docker to run, refer to the Github repository Docker Compose document .

Monitoring operation and maintenance tools

Once you have a Pulsar that can be run, then share the surrounding operation and maintenance tools, such as how to efficiently manage the cluster, operate and maintain the cluster, and observe the cluster; the number of production and consumption of the cluster; the amount of topic data, etc.

Operation and maintenance tool Pulsar Manager

Pulsar Manager used to manage the Pulsar cluster, and its implementation structure is relatively simple. The backend is used to connect Pulsar Broker and Bookie, and there is also a local persistent storage.

Pulsar Manager architecture diagram

It is recommended that you open two configurations: one is to open bookie.enable=true pulsar.peek.massage=true in the application.properties configuration file, and the bookie.enable=true in the bkvm.conf configuration file.

Monitoring tools Prometheus & Grafana

Pulsar's monitoring tools use the most common Prometheus and Grafana in cloud-native scenarios. Because Pulsar considered cloud native at the beginning of its design, Broker, Bookie, and Zookeeper all directly exposed port 8080, and access to the port will directly spit out Prometheus-compatible metrics. You only need to add Broker, Bookie, and Zookeeper to Prometheus.

StreamNative has established warehouse , which contains a large number of available Dashboards.

Use Prometheus

First, you need to configure the monitoring tool to capture the relevant indicators. The local environment only needs to be configured through port 8080, while in the production environment, brokers, etc. are scattered in their own ports, and multiple Brokers and Bookies must be configured independently.

Next, import the indicator. There is a Prometheus folder in the apache-pulsar-grafana-dashboard warehouse mentioned above, which provides some templates for you, you can refer to the templates to configure.

Connect to Grafana

The first step is to create a Data Source to connect to Prometheus. Note here that if you execute it locally, you need to enter the command .scripts/general-dashboards.sh <prometheus-url> <clustername> .
Then open Grafana Manage to start importing, and upload the JSON file after importing.

Finally, the Perf stress test is also introduced. I won't repeat it here. You can check the review video below to view the relevant demo.

Video playback address: https://www.bilibili.com/video/BV1oK4y1A76N?from=search&seid=7311952598788771684

Getting started with Apache Pulsar

TGIP-CN
B station
[Apache Pulsar Dry Goods Collection]()
official document
sample 1
sample 2

Guest introduction
Wei Bin (@rockybean) Solution Engineer@StreamNative, Elastic Certified Engineer & Analyst, Alibaba Cloud MVP.

Live Review｜TGIP-CN 032: Apache Pulsar Quick Start and Actual Combat

About Apache Pulsar