云原生 - Case practice | Apache Pulsar's practice in mobile cloud intelligent operation and maintenance platform - ApachePulsar

The following article comes from everyone learning big data, by Cassie

About Apache Pulsar
Apache Pulsar is a top-level project of the Apache Software Foundation. It is a next-generation cloud-native distributed message flow platform that integrates messaging, storage, and lightweight functional computing. Multi-machine room and cross-region data replication, with streaming data storage features such as strong consistency, high throughput, low latency, and high scalability.
GitHub address: http://github.com/apache/pulsar/

The article is transferred from: Everyone learns big data, author: Wang Jialing
The content of the article is compiled from: China Mobile Cloud Competence Center, Wang Jialing, Product Manager of Mobile Cloud Pulsar, shared the content of the video at Pulsar Summit Asia 2021.

Pulsar Summit Aisa 2021 has been held online from January 15-16, 2022. At the meeting, 30 guests shared the most cutting-edge Apache Pulsar practical experience, scenario cases, technical exploration and operation and maintenance stories around 26 topics.

Below we will review the application practice of Apache Pulsar in the mobile cloud intelligent operation and maintenance platform shared by Wang Jialing engineer from Mobile Cloud.

Scan the image below to view the full video review:

Mobile cloud intelligent operation and maintenance platform

The mobile cloud intelligent operation and maintenance platform is an enterprise-level DevOPS platform that integrates functions such as resource configuration, alarm indicators, performance monitoring, log management, and fault handling. Operation and maintenance work is more convenient.

The platform aims to create a resource layout of N+31+X . In the face of nearly 50,000 physical machines, 9000 and many more network devices, there will be many real problems in building such a system:

In which equipment room, which cabinet, and which rack is the physical machine located? How many cores, how much memory, and how much storage does a physical machine have?
What should I do with so many device alarms and performance data? How to collect and process?
If the equipment fails, how to quickly locate it, and how to dispatch the troubleshooting personnel?
Most services are distributed, and the logs are scattered on different nodes. How to quickly retrieve the logs to locate the problem?

These problems need to be solved or explored by the mobile cloud intelligent operation and maintenance platform. The intelligent maintenance platform shoulders the major mission of intelligent centralized operation and maintenance.

The overall architecture of the mobile cloud intelligent operation and maintenance platform is as follows:

Among them, the operation and maintenance data platform plays a linking role. The lower layer of the operation and maintenance data platform is connected to the basic operation and maintenance platform to receive and collect data, and the upper layer is connected to the operation and maintenance capability layer to provide data query and analysis interfaces.

The mobile cloud team chose Pulsar as the core technology of the operation and maintenance data platform, and used Pulsar as the data pipeline to realize the capabilities of data access, data processing, data consumption and data delivery. The separation of computing and storage based on Pulsar provides an extensible data pipeline; based on the Pulsar Function computing framework, a unified operation and maintenance data processing DSL is constructed to achieve efficient data integration; based on the transformation of Pulsar Sink, the operation and maintenance data delivery function is realized, and a unified operation and maintenance data is built based on PrestoDB. SQL, realizes unified query analysis of Elasticsearch and ClickHouse, and provides operation and maintenance-specific query analysis syntax through DSL field translation service to improve operation and maintenance query efficiency.

In short, the role of Pulsar in the operation and maintenance data platform is to receive log data, process it and deliver it to Elasticsearch and ClickHouse. The latter provides the upper layer with data query and data analysis capabilities.

Compared with the commonly used ELK or Kafka-based data pipeline solutions in the industry, choosing Pulsar has obvious advantages: on the one hand, compared with Kafka, it has a better operation and maintenance experience, thanks to Pulsar's computing and storage separation architecture and tenant isolation On the other hand, compared with Logstash, using Pulsar Function for data processing has better flexibility and higher performance.

Implementation of log processing DSL based on Pulsar Function

Pulsar Function

Pulsar Function implements function computing based on Pulsar. It is essentially an independent Java module that receives messages from the input topic, processes the message content in the Function, and outputs the results to the output topic.

Developers only need to implement the Function interface to provide message processing logic.

Function worker is the configuration, scheduling and runtime management platform of Function, providing threads, processes, and running states of K8s three Functions.

The log processing conundrum

Pulsar Function only provides function computing capabilities at runtime. There are still many difficulties in how to use Function to process logs in actual business:

The log sources are constantly increasing, and the log data varies greatly.
Log processing logic requirements are becoming more and more complex: structuring, cleaning, desensitization, data distribution, multi-source aggregation, data enrichment, etc.
How to design a machining function to dynamically load different machining logic

In order to solve these problems, it is necessary to provide a set of DSL syntax, that is, a special syntax for log processing, to flexibly control and process raw log data. Based on the requirements of log processing logic, this DSL is required to provide rich function operators and scenario-based custom logic in computing. For complex requirements, logic combination and arrangement can be realized through process control and conditional judgment.

DSL syntax parsing

To implement a DSL grammar, a grammar parser is required. The mobile cloud technical team uses Antlr4, which is also used to parse SQL in the Spark computing engine. It can parse the DSL language according to the pre-configured grammar file, generate a Statement for each statement, and the entire DSL will Generate a Statement list. A Statement contains two pieces of information, a syntax keyword and a parameter list.

Processing functions can then be defined for each grammar. FunctionArgs is the list of processing parameters parsed from Statsment, and Context is the log content that has been converted to json format.

Then the general syntax of setting a field can be implemented as a simple operation of finding the corresponding field from the Context and replacing the value. The return value of the processing function is defined as the list of Context because the scenario of log splitting requires splitting an original log into multiple logs for subsequent processing.

DAG execution plan

After completing the parsing of the DSL, you need to build an execution plan and perform runtime scheduling. The DAG graph is used for modeling here. DAG represents a directed acyclic graph. When Spark processes data, it will convert the calculation into a DAG graph for scheduling. The mobile cloud technology team refers to this idea and parses the Statement generated by DSL. The list is turned into a DAG graph, allowing the processing process to be decomposed into tasks that can be computed in parallel.

DSL statement:

e_set(...);
e_drop_fields(...);
e_if_else(bool,e_set(...),e_set(...));
e_set(...);
e_output(...);

DAG graph:

Each node in the DAG graph corresponds to a Statement, that is, it contains a processing function and corresponding processing parameters, then the execution of the DSL statement can be transformed into the operation of traversing the DAG graph and executing the processing function corresponding to each node in turn. The characteristics of the DAG directed acyclic graph can ensure that the process of traversing from the root node will not form a closed loop, and the end of the processing is indicated by traversing the tail node.

DAG execution scheduling

After parsing the DSL statement and building the DAG task, the next step is to consider how to implement scheduling in the Function. The overall design idea is to pass the processing DSL statement into the Function configuration parameter in the form of custom parameters, parse the DSL in advance and build a DAG task, and then process each message based on the data-driven process.

Add onStart and onStop interfaces for Function to perform DSL parsing and global resource initialization when the Function is started.

In the implementation of the Process interface of Function, only the message content and the corresponding DAG execution plan are put into the cache queue, and the message result delivery and message confirmation are not performed at this time. The two relatively time-consuming processes of message processing and message result delivery are executed in an asynchronous thread pool, and the message is confirmed after completion.

First, the mobile cloud technology team chose the high-performance lock-free queue Disruptor as the cache queue. Disruptor uses a ring array RingBuffer as a queue. The producers and consumers of the queue apply for the position of the operable element in the array based on the CAS operation, and write or read the data in the position after successful acquisition, thereby avoiding the problem of lock competition. . Compared with the lock queue, it has a great advantage in performance.

Finally, the DAG execution scheduling scheme selected by the mobile cloud technology team is to use the processing of a single DAG node as the realization of the log processing thread. After a message enters the log processing thread, the processing of the current node is completed, and then the downstream node and the processing of the current message are selected. The result is put back into the cache queue, waiting for the next scheduling. After the end node is processed, the message is delivered and confirmed. It is hoped that the threads in the thread pool can be reused to improve the efficiency of data processing.

The above is the design and implementation process of the DSL log processing based on Pulsar Function by the mobile cloud technology team. With this capability, log preprocessing for complex scenarios such as data normalization and structuring can be easily constructed to prepare data for subsequent log query and analysis stages. , but also greatly reduces the operation and maintenance costs.

Outlook: Building a full-text index based on BookKeeper, a distributed data storage engine that integrates time series indexes

At present, the data processed in Pulsar Function will be delivered to Elasticsearch and ClickHouse, and indexes will be established respectively for subsequent query analysis. In addition, Pulsar itself uses BookKeeper to persist data, so there will be a lot of redundancy in data storage. In addition, although Pulsar provides SQL queries based on Presto, the query performance in the face of massive data sources cannot meet the requirements.

BookKeeper is a very complete distributed data storage engine. It is expected that the log data written into BookKeeper is structured data that has undergone preprocessing and will not have the characteristics of update operations. The time series index allows upper-layer applications to directly query and analyze the data stored in BookKeeper. And use BookKeeper's horizontal expansion ability and the characteristics of divided storage to create a full-text index, an infinitely scalable distributed data storage engine integrated with time series index.

Case practice | Apache Pulsar's practice in mobile cloud intelligent operation and maintenance platform

About Apache Pulsar

Mobile cloud intelligent operation and maintenance platform