Case recommendation｜Apache Pulsar assists Kingsoft Cloud Log Service, processing 200TB of data per day

About Apache Pulsar
Apache Pulsar is the top-level project of the Apache Software Foundation. It is the next-generation cloud-native distributed message flow platform. It integrates messaging, storage, and lightweight functional computing. It uses a separate architecture design for computing and storage to support multi-tenancy, persistent storage, Multi-computer room and cross-regional data replication, with strong consistency, high throughput, low latency and high scalability and other streaming data storage characteristics.
GitHub address: http://github.com/apache/pulsar/

This article originated from InfoQ, the original link: https://www.infoq.cn/article/m5NBipdR8bPdCJLu38lV

Kingsoft Cloud Log Service Introduction

Founded in 2012, Kingsoft Cloud is one of the top three Internet cloud service providers in China. It was listed on NASDAQ in the United States in May 2020, and its business scope covers many countries and regions around the world. Since its establishment 8 years ago, Kingsoft Cloud has always adhered to the customer-centric service concept, providing safe, reliable, stable, and high-quality cloud computing services.

Kingsoft Cloud has established green and energy-saving data centers and operating organizations in domestic regions such as Beijing, Shanghai, Guangzhou, Hangzhou, Yangzhou, Tianjin, and international regions such as the United States, Russia, and Singapore. In the future, Kingsoft Cloud will continue to be local and international, and to connect more devices and people through the construction of a global cloud computing network, so that the value of cloud computing will benefit the world.

Kingsoft Cloud Log Service is a one-stop service system for log data processing, providing multiple services from log collection, log storage, log retrieval and analysis, real-time consumption, log delivery, etc., supporting log query and monitoring of multiple business lines Business, to improve the operation and maintenance and operation efficiency of each product line of Kingsoft Cloud. Currently, the daily data level is 200 TB.

Features of Log Service

As a one-stop service system for log data processing, Kingsoft Cloud Log Service needs to have the following features:

Data collection: Customized development based on Logstash and Filebeat to support more data collection forms.
Data query: Support SQL and ElasticSearch Query String syntax.
Data consumption: Based on Pulsar to encapsulate external sockets, some product lines (want to show log scrolling scenes in the console) can be implemented through the websocket protocol of the entire log service; all log data can also be queried through the exposed REST API (that is, as a queue To use).
Abnormal alarm: After the console retrieves the data, save the data and the retrieval syntax as an alarm item, and support the configuration of the overall alarm strategy and alarm method. After the exception is retrieved, the background will start the corresponding task to achieve real-time alarm.
Chart display: Save the sentences and query results retrieved in the console as a chart (bar chart, line chart, etc.). When you enter the console again, click on the dashboard to see all current or previously saved query statements and result data .
Data heterogeneity: You can customize whether to post logs to other cloud product lines, such as post data of certain logs to object storage, so as to realize some other operations (such as post data to Hive data warehouse, and then analyze ).

Why choose Pulsar

In the research process, we compared RocketMQ, Kafka and Pulsar in terms of basic functions and reliability, and summarized the advantages and disadvantages of the three (see the table below for the comparison results).

We found that Pulsar is very suitable for log stream processing. From the BookKeeper level, Pulsar is a component of log stream storage. Pulsar adopts a cloud-native architecture, and the log stream service also adopts a cloud-native, no-service model, and all services are implemented on the cloud. Pulsar has many flexible enterprise-level features, such as support for multi-tenancy, support for tenant storage quotas, data ETL, overall data load balancing strategies, etc.; support for the transmission of large amounts of data; relatively complete monitoring of message queues, etc. Let me introduce in detail some of the features of our choice of Pulsar.

Separation of computing and storage

Both the producer and consumer of Pulsar are connected to the broker. As a stateless service, the broker can be scaled horizontally without affecting the overall production and consumption of data; the broker does not store data, and the data is stored in the next layer of the broker. (Ie bookie), the separation of computing and storage is realized.

Flexible horizontal expansion and contraction

For cloud products, Pulsar can achieve broker expansion and contraction without rebalancing. In contrast, Kafka needs to be rebalanced before scaling, which may cause a high cluster load and affect the overall service. Secondly, the Pulsar topic partition can also achieve unlimited expansion. After the expansion, the entire shard and traffic are automatically balanced through the load balancing strategy.

Pulsar multi-tenant

Pulsar natively supports multi-tenancy. In the log service, there is also the concept of tenants. Each product line (that is, each project) belongs to a tenant, which realizes data isolation between product lines. The Pulsar cluster supports millions of topics (practices in Yahoo), and the entire topic is also isolated by tenants. At the tenant level, Pulsar implements excellent features such as storage quotas, message expiration deletion, isolation strategies, and supports separate authentication. And authorization mechanism.

Load balancing strategy

Pulsar has the concept of a bundle at the namespace level. If the current broker load is high, the bundle will perform the bundle split operation by managing the topic partition strategy, and automatically balance the sub-partitions to other brokers with lower load. When creating a topic, Pulsar will also automatically prioritize the topic to the broker with a lower current load.

Pulsar IO model

During the write operation, the broker concurrently writes data to BookKeeper; when bookie reports to the broker that the data is successfully written, at the broker level, only one queue is maintained internally. If the current consumption mode is real-time consumption, data can be obtained directly from the broker without querying by bookie, thereby improving message throughput. In the catch-up reading scenario, you only need to query bookie to query historical data; catch-up reading also supports the data unloading function, that is, unloading data to other storage media (such as object storage or HDFS) to achieve cold storage of historical data.

Topic creation, production and consumption

After the topic is created in the console, the topic information and tenant information are recorded in etcd and MySQL, and then the two types of services on the right side of the figure will monitor etcd. One type is the producer service, which monitors the internal operations after the topic is created or deleted. The other is the consumer service. When the operation of creating a new topic is monitored, the corresponding service will connect to the Pulsar topic and consume the data on the topic in real time. Then the producer starts to receive the data and determines which topic to write the data to. The consumer consumes the data and writes it after the determination, or dumps it and writes it to other ES or other storage.

图 1. Topic 创建、生产、消费流程

Topic logical abstraction

There are three levels in Pulsar: topic, namespace, and tenant. Because Pulsar currently does not support a regular consumption model at the namespace level, we need to take the overall concept to a higher level, reduce the amount of background Flink tasks, and achieve consumption at the entire project level. In other words, in the log service, topic corresponds to Pulsar logical fragmentation, and the namespace corresponds to Pulsar logical topic. With this change, we have achieved two functions, one is to dynamically increase and decrease the number of shards, and the other is that Flink tasks started in the background can consume data at a single project level.

图 2. Pulsar topic 逻辑抽象图

Message subscription model

Pulsar provides four message subscription models:

Exclusive mode (Exclusive): When multiple consumers use the same subscription name to subscribe to a topic, only one consumer can consume data.
Failover mode (Failover): When multiple consumers subscribe to the Pulsar topic through the same subscription name, if a consumer fails or the connection is interrupted, Pulsar will automatically switch to the next consumer to achieve a single point of consumption.
Shared mode (Shared): A widely used model. If multiple consumers are started, but only one subscriber subscribes to topic information, Pulsar will send data to consumers in turn through polling; if one consumer goes down or the connection is interrupted , The message will be polled to the next consumer. LogHub uses a shared subscription model. The entire Hub runs in a container and can dynamically expand and shrink the consumer end according to the overall load or other indicators.
Key_Shared message subscription mode: Maintain the consistency of data consumption through Key hashing.

Broker failure recovery

Because the broker is stateless, the downtime of a certain broker has no impact on the overall production and consumption. At the same time, a new broker will take the role of owner, obtain topic metadata from ZooKeeper, and automatically evolve to the new owner. Data storage The layer will not change. In addition, there is no need to copy the data in the topic to avoid data redundancy.

Bookie failure recovery

The Bookie layer uses shards to store information. Since the bookie itself has a multi-copy mechanism, when a bookie fails, the system will read the information of the corresponding shard from other bookie and rebalance, so the entire bookie's writing will not be affected, and the availability of the entire topic is guaranteed .

Application of Pulsar in Log Service

The bottom layer of the log service system is data collection tools. We have developed customized development based on open source data collection tools (such as Logstash, Flume, Beats). The log pool in the data storage is a logical concept, which corresponds to the topic in Pulsar. The upper layer of the log service system is query analysis and business applications. Query analysis refers to search and analysis on the log service workbench, or query through SQL syntax; business applications refer to customizing dashboards and charts on the console to achieve real-time alarms, etc. Both query analysis and business applications support data dumping, that is, dumping log data to storage media or low-priced storage devices, such as KS3-based object storage, ElasticSearch cluster or Kafka. The product function overview of Log Service is shown in the figure below.

图 3. 日志服务的产品功能概况图

Log Service Architecture Design

We designed the hierarchical structure of the log service according to the product function of the log service (as shown in the figure below). The bottom layer is the data collection terminal, which is responsible for collecting log text data, TP/TCP protocol data, log data in MySQL, etc. The development work of our self-developed collection terminal is still in progress. The collected data is sent to the corresponding Pulsar topic through the data entry of the log service. We apply the Pulsar cluster to the three major sectors. One is to implement multi-dimensional statistical analysis scenarios through the Flink-on-Pulsar framework, because some business lines need to perform multi-dimensional aggregation statistics through log data to generate index result data, and then transfer it to Business Line. The second is to apply the Pulsar cluster to LogHub (microservices), which mainly consumes the data of the Pulsar topic and writes the data directly to the ES. The data of the entire log stream can be queried through the console, and it can also be retrieved and analyzed. The third is to use Pulsar Functions to set some operators or ETL logic on the console, and to do data ETL through the Pulsar Functions module in the background. We use the EalsticSearch cluster to store data retrieval and analysis results. KS3, KMR, and KES correspond to some of our internal cloud product lines for storage and computing. The upper-level data output part can be divided into two major modules. One is the Search API module, which is responsible for providing APIs to the outside world, and performs some actions that are tightly coupled to the log by calling the API; the other is the Control API module, which supports management in the console. Class operations, such as creating a topic, adjusting the number of partitions of a topic, and retrieving alarms, etc.

图 4. 日志服务的分层架构设计图

Communication Design of Log Service

In terms of the product architecture of Log Service, the entire service adopts a stateless operation mode, so various services (especially producer and consumer services) realize data sharing through etcd. In other words, after creating, updating, and deleting operations on the console, the producer and consumer will perceive these actions and make corresponding changes. In addition, because producers and consumers run completely in containers, and the service itself is stateless, it can be dynamically scaled. The communication design diagram of the log service is as follows.

图 5. 日志服务的通信设计图

Log stream processing architecture

According to the requirements of log stream processing, we designed the following architecture diagram. On the left is the data collection end. The collection end sends the data to the data receiving end (PutDatas), and the receiving end sends the data to the corresponding Pulsar topic. We mainly apply Pulsar to three scenarios.

Add the Flink program on Pulsar to realize customized ETL multi-dimensional analysis, statistics, aggregation and other operations.
Use Pulsar to consume and store data in LogHub. After consuming data from Pulsar, write the collected log data to the ElasticSearch cluster.
Use Pulsar on WebSocket and REST API. WebSocket enables real-time scrolling logs to be viewed on the console, and REST API supports querying data in specific queues. At the same time, we implemented some simple ETL processing through Pulsar Functions, and transferred the processed data to the storage medium of the business line (for example, transferred to the data warehouse, Kafka or ElasticSearch cluster).

图 6. 日志流处理架构图

future plan

With the support of Pulsar, Kingsoft Cloud Log Service has been running well. We look forward to the log service can support more functions and meet more needs. In terms of log service, our plan is as follows:

Added sequential consumption capacity (sequential consumption capacity may be required for billing and audit scenarios).
Merge and split partitions.
Achieve fully containerized deployment. At present, the internal service of Log Service has completed the containerization operation. In the next step, we will focus on implementing the containerized deployment of all Pulsar clusters.

At present, the log service supports about 15 internal product lines of Kingsoft Cloud (as shown in the figure below), a single online data transmission is about 200 TB/day, and the number of topics has exceeded 30,000. When the AI business is connected to Pulsar, the overall data volume and the number of topics will be greatly increased.

图 7. 使用日志服务的产品线

In the process of testing and use, we have a more comprehensive understanding of Pulsar, and expect Pulsar to support more features, such as removing the dependency on ZooKeeper. At present, ZooKeeper maintains all the metadata of Pulsar, which is under great pressure; and the number of Pulsar topics is forced to rely on the ZooKeeper cluster. If Pulsar metadata information is stored in bookie, the number of topics can be increased infinitely.

Automatically expand and shrink the partition. Log data has peaks and troughs. At peak times, the number of partitions of the current topic is automatically expanded to increase the overall concurrency; at troughs, the number of partitions is reduced to reduce the pressure on cluster resources.

Provides regular matching at the namespace level. In the background Flink tasks, the namespace level data is no longer monitored, reducing the amount of Flink background tasks.

Concluding remarks

As the next generation cloud native distributed message flow platform, Apache Pulsar has its unique advantages and is very suitable for our log stream processing scenarios. The Pulsar community is very active and all questions are answered. During our preliminary investigations, follow-up tests, and the official launch process, the friends of StreamNative gave great support to help us quickly promote the business launch.

Currently, Kingsoft Cloud Log Service has more than 30,000 Pulsar topics, which process about 200 TB of data per day and support 15 product lines. Since its launch, Pulsar has been operating in a stable state, which has greatly saved our development and operation and maintenance costs. We look forward to realizing the containerized deployment of the Pulsar cluster as soon as possible. We also expect Pulsar to remove the dependency on ZooKeeper and support automatic expansion and contraction of partitions. We are willing to develop new features with our friends in the Pulsar community to further accelerate the development of Pulsar.

Author profile
Liu Bin, Senior Development Engineer of Kingsoft Cloud Big Data.

Kingsoft Cloud Log Service Introduction

Features of Log Service

Why choose Pulsar

Separation of computing and storage

Flexible horizontal expansion and contraction

Pulsar multi-tenant

Load balancing strategy

Pulsar IO model

Topic creation, production and consumption

Topic logical abstraction

Message subscription model

Broker failure recovery

Bookie failure recovery

Application of Pulsar in Log Service

Log Service Architecture Design

Communication Design of Log Service

Log stream processing architecture

future plan

Concluding remarks

Related Reading

ApachePulsar

引用和评论

深入解析 Apache BookKeeper 系列：第二篇 — 写操作原理

一键实现 Oracle 数据整库同步至 Apache Doris

ClkLog埋点系统基于ClickHouse的百万日活测试报告

2027倒计时：5个关键数据揭秘100%国产替代实施路径

laravel 小技巧：为日志组件的非默认通道注册全局上下文 context

用C#在Excel工作表中创建数据透视表和数据透视图

深度解析：通过 AIBrix 多节点部署 DeepSeek-R1 671B 模型