开源 - Best Practices｜Apache Pulsar's journey to Huawei Cloud IoT - ApachePulsar

About Apache Pulsar
Apache Pulsar is the top-level project of the Apache Software Foundation. It is the next-generation cloud-native distributed message flow platform. It integrates messaging, storage, and lightweight functional computing. It uses a separate architecture design for computing and storage to support multi-tenancy, persistent storage, Multi-computer room and cross-regional data replication, with strong consistency, high throughput, low latency and high scalability and other streaming data storage characteristics.

GitHub address: http://github.com/apache/pulsar/

Device Access Service (IoTDA) is the core service of the Huawei Cloud IoT platform. IoTDA requires a reliable messaging middleware. After comparing the capabilities and features of various messaging middleware, Apache Pulsar relies on its multi-tenant design, computing and storage Features such as separate architecture and support for Key_Shared mode consumption have become the first choice for Huawei Cloud IoT messaging middleware. This article introduces Pulsar's online history of HUAWEI CLOUD IoT, as well as the problems encountered during the online process and the corresponding solutions.

Introduction to Huawei Cloud Device Access Service

Device Access Service (IoTDA) has the capabilities of connecting massive devices to the cloud, two-way message communication between devices and the cloud, data flow, batch device management, remote control and monitoring, OTA upgrades, device linkage rules, etc. The following figure shows the architecture of Huawei Cloud IoT. The upper layer is IoT applications, including the Internet of Vehicles, smart cities, and smart parks. The device layer is connected to the IoT platform through a direct connection gateway and edge network. At present, the number of HUAWEI CLOUD IoT connections exceeds 300 million, making the IoT platform the most competitive in China.

Data transfer means that after the user sets rules on the IoT platform, when the device behavior meets the rule conditions, the platform will trigger corresponding rule actions to achieve user needs, such as connecting to other Huawei Cloud services, providing storage, computing, and analysis of device data. Stack services, such as DIS, Kafka, OBS, InfluxDb, etc., can also be connected to the client's system through other communication protocols, such as HTTP and AMQP. In these actions, the IoT platform mainly acts as a client or server.
According to the user category, the usage scenarios can be divided into three categories:

Larger customers generally choose to push to messaging middleware (such as Pulsar, Kafka) and build their own business systems on the cloud for consumption processing.
Long-tail customers usually choose to push the data to their own database (such as MySQL) for processing, or receive the data for processing by their own HTTP server.
Lighter customers will choose to create a simple client connection through the AMQP protocol.

Pain points of the original push module

The original push module adopts the Apache Kafka solution. This operating mode has some drawbacks, and the expansion operation is complicated, which brings burdens to the development and operation and maintenance teams. In addition, the original push module supports client-side and server-side push, but does not support AMQP push. The architecture diagram is as follows. The Consumer continuously pulls messages from Kafka and stores the failed messages in the database, waiting for retry. This mode of operation brings many problems:

Even if many customers' servers are unreachable, consumers still need to pull messages from Kafka (because Kafka has only one topic) and try to send them.
It is not possible to configure the storage time and size of messages based on a single user.
Some clients have poor server capabilities and cannot control the rate at which messages are pushed to a single client.

Topic quantity

In May 2020, in order to enhance the competitiveness of our products, we plan to allow customers to receive transfer data through the AMQP protocol. The client access of the AMQP protocol is more complicated, and the customer may integrate the AMQP client on the mobile phone and go online for two hours every day. In this case, we need to ensure that the customer will not lose data when using it, so we request a message The middleware supports more topics than the number of rules (some customers have a large amount of data under single rules, which cannot be supported by a single topic). At present, the number of our rules has exceeded 30,000, and it is expected to reach 50,000 soon, and it will continue to grow.

Kafka topic occupies file handles at the bottom layer and shares the OS cache. It cannot support large topics. Kafka of a partner can support up to 1,800 topics. If we want to support queues with the number of rules, we must maintain multiple Kafka clusters. The following figure shows our solution based on Kafka.

The implementation of the Kafka-based solution will be very complicated. We not only need to maintain the life cycle of multiple Kafka clusters, but also maintain the mapping relationship between tenants and Kafka clusters, because Kafka does not support the Shared consumption model, and two layers of relays are required. In addition, if the number of topics on a Kafka cluster has reached the upper limit, but there is too much data in circulation, the topic needs to be expanded. In this case, the original cluster cannot be expanded without data migration. The overall scheme is very complicated, which poses great challenges to development and operation and maintenance.

Why choose Pulsar

In order to solve our problems in the Kafka solution, we began to investigate popular messaging middleware on the market and learned about Apache Pulsar. Apache Pulsar is a cloud-native distributed messaging and streaming platform that natively supports many excellent features. Its unique Key_Shared mode and millions of topic support are features we urgently need.

Pulsar supports Key_Shared mode. If a single slice of Pulsar supports 3000 QPS, and an AMQP client of the customer only supports 300 QPS. In this case, the best solution is to use Pulsar's sharing mode to enable multi-client connections, that is, connect 10 clients at the same time to process data. If you use the Failover mode, you need to expand to 10 partitions, resulting in a waste of resources.
Pulsar can be extended to millions of topics. We can correspond a rule to a Pulsar topic. When the AMQP client is online, it can start reading from the location where it was last consumed, ensuring that no messages will be lost.

Pulsar is a multi-tenant design based on the cloud, while Kafka is more inclined to docking between the system and the system, single tenant, high throughput. Pulsar considers the deployment based on K8s, and the overall deployment is easy to implement; Pulsar's computing and storage are separated, the expansion operation is simple, the topic interruption time during expansion is short, and the retry can achieve uninterrupted business; and it supports shared subscription types, which is more flexible. We compared Pulsar and Kafka from different dimensions, and the results are as follows:

Pulsar can not only solve the shortcomings of the Kafka solution, but its non-losing message feature perfectly meets our needs, so we decided to try Pulsar.

First edition design

In the initial design, we wanted to use the Key_Shared consumption model for both the client type and the server type. The following figure shows the design of the client type (using HTTP as an example). Each time a customer configures a data flow rule, we create a topic in Pulsar, and the consumer consumes the topic, and then pushes it to the customer's HTTP server through the NAT gateway.

The design of server type push (take AMQP as an example) is shown in the figure below. If you are not connected to the AMQP client, even if you start the consumer to pull data, the next step cannot be processed. Therefore, when the client connects to the corresponding consumer microservice instance through the load balancing component, the instance will start the consumer of the corresponding topic Make consumption. An AMQP connection corresponds to a consumer.

The throughput of a topic single partition in the Pulsar cluster is limited. When the volume of regular data for a single customer exceeds the throughput, for example, when the topic performance specification is around 3000 and the customer's estimated business volume is 5000, we need to expand the partition for the topic. In order to avoid restarting the producer/consumer, we set the autoUpdatePartition parameter to true , so that the producer/consumer can dynamically perceive the change of partition.

Problems encountered in the testing of the first version of the design

When testing the first version of the design scheme, we found that this scheme has some problems, which are mainly reflected in the following three aspects:

The client type push uses the above design, and a network relationship is formed between the microservice instance and the consumer. Assuming we have 10,000 customer rules and 4 microservice instances, there will be 40,000 consumer-subscription relationships. A single microservice instance has 10,000 consumers in memory at the same time. The size of the consumer receiving queue is the key to throughput and memory consumption, but it is not easy to configure. If the configuration is too large, in an abnormal scenario, the consumer cannot send HTTP messages, which will cause a large amount of messages to be accumulated in the consumer and cause the consumer's memory to overflow. Assuming that there are 1000 consumers and the network is suddenly disconnected for 5 minutes, the messages in these 5 minutes will be backlogged in the receiving queue; if the configuration is too small, the efficiency of communication between the consumer and the server will be low, which will affect system performance.
In the scenario of Pulsar or the rolling upgrade of production and consumption services, frequent requests for topic metadata put a lot of pressure on the cluster (the number of requests is the product of the number of instances and the number of topics).
autoUpdatePartition has a great impact on system resources. autoUpdatePartition is enabled for each topic, according to the default setting, each topic sends a ZooKeeper request every minute.

We reported this problem in the Pulsar community, and the StreamNative team gave great support and help. It is recommended that we group the customers and then set the autoUpdatePartition parameter as needed. With the strong support of the community, we decided to make corresponding improvements and start planning the online plan.

Online plan

Our customers are roughly divided into two types. One is busy users who push a large amount of data upstream when the business is busy. Its characteristic is that one fragment may not meet the demand and the number of users is small; the other is the business is relatively stable and the amount of data is small. Medium, which is characterized by enough shards and a large number of users.

We group users according to suggestions, deploy and push the workload of busy users separately, and set up users with medium business volume together. At present, we manually group in the configuration center through SRE according to the customer's business capacity, and will automatically group according to real-time statistics in the future. Grouping services can not only greatly reduce the number of combinations between topic and consumer, but also reduce the number of requests for metadata when restarting. In addition, the two types of user client parameters are not completely the same after grouping. First, autoUpdatePartition is only enabled in busy user topics; second, the receiving queue sizes of the two sets of workloads are different.

deploy

We use containerized deployment methods to deploy for two types of users: deploy brokers in deployment mode, and deploy BookKeeper and ZooKeeper in StatefulSet mode. Deployment scenarios include cloud deployment and edge deployment. Different deployment methods have different requirements for reliability and performance. The deployment parameters we set are as follows:

During deployment, we found:

When the EnsembleSize and Write Quorum Size (Qw) parameters are the same between topics, the write performance is the best.
The amount of messages in the cloud is large. If the number of copies is 2, 100% redundancy is required, and when the number of copies is 3, only 50% redundancy is required.

Pulsar tuning solution

The above solution has been successfully launched half a year ago, and we have also tested a scenario of 50,000 topics and 100,000 messages per second in the test environment. During the test, we encountered some problems, and adopted the tuning program according to the specific situation. For details, please refer to Pulsar 50 thousand topic tuning . This section focuses on delays, ports, and improvement suggestions.

Reduce production/consumption delay

Through the use of testing tools, we found that the overall end-to-end delay of the message is relatively large.

In order to locate the problem conveniently, we have developed a single topic debug feature. In a massive message scenario, whether it is a test environment or a production environment, it is not easy to enable global debug in the broker. We have added a configuration in the configuration center, which only prints detailed debug information for topics in the configuration list. With the cooperation of the single topic debug feature, we quickly discovered that the maximum delay of the message occurred between the message sent by the producer and the message received by the server. The possible reason is that the configuration of the number of netty threads is too small.

However, increasing the number of netty threads does not completely solve this problem. We found that a single jvm instance will still have performance bottlenecks. As mentioned above, after grouping by the user’s data size, small user groups need to serve about 40,000 topics. Starting consumers of the same order of magnitude results in slow startup (and therefore long interruption time during upgrade), insufficient memory queues, and complicated scheduling. We finally decided to hash the small user groups, and each instance was responsible for about 10,000 consumers, which successfully solved the problem of large production and consumption delays.

Use 8080 port to connect to the broker

We use port 8080 instead of port 6650 to connect to the broker for two main reasons:

The log is detailed, and most of the requests sent to 8080 are metadata requests, which is helpful for troubleshooting and easy to monitor. For example, Jetty's requestLog can easily detect events such as topic creation failure and producer creation timeout.
Data requests and metadata requests can be isolated to avoid affecting the creation and deletion of topics when the 6650 port is busy.

Compared with the 6650 port, the 8080 port has poor efficiency and performance. At the level of 50,000 topics, when you upgrade a producer/consumer or broker, you need to create a large number of producers/consumers, and generate a large number of requests to port 8080, such as partitions and lookup requests. We successfully solved this problem by increasing the number of jetty threads in Pulsar.

Suggestions for Improvement

Pulsar is slightly lacking in operation and maintenance and response to problems. We hope that Pulsar can improve in the following aspects:

Automatically adjust client parameters, such as send queue size, receive queue size, etc., so that it will be more smooth when using ten thousand topics.
When encountering an unrecoverable error (such as a failure to operate ZooKeeper), expose the API interface so that we can easily connect to the alert platform on the cloud.
During operation, a single topic is tracked through logs, and operation and maintenance personnel are supported to use kibana tools in combination with business logs to quickly locate and solve problems.
Sampling and tracking the key nodes of Pulsar's internal production, consumption, and storage, and export the data to the APM system (such as Skywalking), making it easier to analyze and optimize performance.
Consider the number of topics in the load balancing strategy.
When BookKeeper monitors, only the usage rate of the data disk is monitored, not the usage rate of the journal disk.

Concluding remarks

It took us three to four months from the first contact with Pulsar to the design launch. After going online, Pulsar has been running stably and performing well, helping us achieve our expected goals. Pulsar greatly simplifies the overall architecture of the data access service of the Huawei Cloud IoT platform, and supports our new business smoothly and with low latency, so we can focus on improving business competitiveness. Due to the excellent performance of Pulsar, we also apply it to data analysis services, and hope that Pulsar Functions can be used in business to further enhance product competitiveness.

About the Author

Zhang Jian: Senior Engineer of Huawei Cloud IoT. Focus on cloud native, IoT, messaging middleware, and APM.

Best Practices｜Apache Pulsar's journey to Huawei Cloud IoT

About Apache Pulsar

Introduction to Huawei Cloud Device Access Service