Introduction of EasyEagle, the intelligent operation and maintenance platform of the basic platform of big data: Cluster Queue

Here he comes, here he comes! The intelligent operation and maintenance platform mentioned in the big data basic platform conference, he is here!
As users of the data platform, the following questions have been plaguing us:

What is the resource level of the cluster, what is the utilization rate, and does it need to be expanded?
Why is a large number of tasks pending in the queue recently, and what is the reason?
Which tasks take up most of the queue's resources, is it reasonable, and can it be optimized?
Why is the task running so slow and where is the problem?
Can the task be optimized to speed up output?
If a task or service is abnormal, can it be handled automatically?
…

In response to the above problems, we have developed a self-service, intelligent monitoring and diagnosis platform - EasyEagle. Aims to achieve the following purposes:

Real-time resource level monitoring at all levels (cluster, queue, task, node), including application and actual use;
Help platform administrators and users understand the resource usage status at their respective levels, help users better optimize resources, and improve resource utilization;
For queue-related problems, diagnosis results can be given quickly, reducing user positioning time;
For the performance or abnormality of the task, the diagnosis results and suggestions can be quickly given;
By diagnosing tasks, relevant optimization suggestions can be put forward, thereby speeding up the output of tasks and improving the overall resource utilization.

We will introduce it in sections from the perspective of cluster queues, tasks, resource management, and full-link diagnosis. Show current representative problems to users at all levels, and how to use EasyEagle to discover and solve them. This article will introduce related to the cluster perspective and the queue perspective.
1 Cluster perspective
1.1 Cluster Basic Monitoring The cluster is mainly for platform administrators. Their main concerns are as follows:

What is the level of cluster resources?
How is the resource utilization?
Do you need to expand or shrink, etc.

EasyEagle provides real-time resource monitoring of the cluster, and can give the trend of the resource water level in the time period according to the selected time. Based on this information, the administrator can clearly know the idle and busy time periods of the cluster, which can be a reasonable task. Scheduling serves as a reference, such as staggering non-baseline tasks from busy baseline time periods, moving high resource-consuming tasks to time periods when resources are idle, etc.
EasyEagle also provides a summary and analysis function of the cluster task volume. With day as the dimension or month as the dimension, you can clearly understand the daily or monthly changes in the number of tasks in the cluster, so as to measure the recent business growth/decrease situation, Combined with the resource water level, the scalability of the cluster and the changing trend of the water level are further analyzed, which provides a data reference for operations such as cluster resource expansion.
The cluster overview interface of EasyEagle displays the relevant indicator data as described above. As shown in the figure below, it shows the real-time resources and node status of the cluster, the summary information of the number of cluster tasks, and the water level of cluster resources.

In addition, as one of the highlights of EasyEagle, we can see that in addition to the memory and CPU resource levels of the cluster, we also introduce the actual resource usage of the cluster (the green line in the figure). Simply put, how many of the resources allocated to tasks by the cluster are actually used by tasks? EasyEagle collects the actual resource usage of the entire cluster through the actual load and memory usage of each node machine. If you find that the cluster's allocatable resources are fully loaded, but the actual load is very low, you need to pay attention to whether the large-scale task resource application of the cluster is suspected of being wasted.
In order to display the actual resource utilization more intuitively, EasyEagle displays the resource utilization of each node in the cluster in the form of a scatter plot, as shown in the following illustration. Each point represents a computing node in the cluster, and the abscissa and ordinate are the node memory and CPU utilization, respectively. In an ideal situation, the memory usage and cpu usage of the machine should be relatively balanced, as reflected in the graph, all points will be evenly distributed around a line with a slope of 1. In the example shown, it is found that the CPU utilization of most machine nodes is significantly higher than 40%; while the memory utilization is less than 15%. Then you need to pay attention to whether the configuration of each computing node in the cluster and the ratio of virtual core and memory size are reasonable.

According to the above practical case, EasyEagle can obviously inform the administrator of the following information in this module:

The water level of cluster resources, when it is busy, how should the task schedule scheduling time, what is the actual utilization of resources, whether it can be further optimized without purchasing machines, and whether the ratio of CPU and memory of cluster nodes is correct, and how to set the ratio

In short, EasyEagle can start from the overall dimension of the cluster, and display and analyze the most concerned points of big data platform operation and maintenance personnel, such as cluster resource water level, resource utilization, task volume change trend, and machine node utilization.
1.2 Cluster Queue Pending Diagnosis For platform administrators, in addition to providing basic monitoring and analysis of the cluster, EasyEagle currently provides monitoring of the running status of each queue in the cluster.
In our practice of the hadoop platform, we often encounter such a phenomenon: the task submitted by the user has been waiting for a long time, but it is found that it cannot be scheduled for a long time. Such a phenomenon is more common, often concentrated in a specific time period, and occurs periodically every day. If the pending task is a non-core task, a non-baseline task, or an offline task that does not pay attention to timeliness, the pending tasks in the cluster or queue will often not be perceived by people, so that the cluster does not exist in such a situation; but if it is a core task , the large number of pending will affect the output of the business.
Therefore, such a phenomenon should be noticed and addressed. EasyEagle has the following analysis on the pending queue of the cluster to which it belongs:
For the reasons for pending, it can be roughly divided into the following two situations:
(1) The queue resources are sufficient, and a large number of tasks are pending in the queue

Insufficient queue AM resources
Insufficient Yarn scheduling performance

(2) The queue resources are insufficient, and a large number of tasks in the queue are pending

The queue itself has insufficient resources. The parent queue of the queue has insufficient resources, or the sibling queue has preempted resources.

After analyzing the reason for the pending, the following data indicators will be given:

pending queue
The number of times the queue has pending during the configuration time period of the actual queue caused by the pending problem, for 7 consecutive days, and the resource usage trend of the actual queue caused by the pending problem.

As shown in the figure below, EasEagle can display by default which queues in the cluster have pending in the previous day, the queue where the problem actually occurred, and the reason for the pending.

Click to view the details to further display the resource utilization of the actual queue caused by the problem, and to display the problem more intuitively. As shown in the illustration below, you can select the queue resource utilization for seven consecutive days. The dots marked in the figure indicate that the pending phenomenon occurs during this period.

Based on the above analysis, this function module can solve the following problems:

When a large number of tasks are pending in some queues in the cluster, you can diagnose in time, intervene in advance, and reduce user perception and preprocessing time;
Through automatic diagnosis, the administrator can be directly informed of the specific reasons for the large number of pending tasks in the queue.

2 Queue Perspective As a multi-tenant Hadoop cluster, the resources of the entire cluster can be divided into sub-queues to support the shared use of cluster resources for multiple business parties and to isolate resources between different business parties. In this environment, business parties may prefer to know the queue where their business is located.

What is the resource level of the queue in each time period, and how is the task scheduling time arranged?
What is the actual utilization of queue resources? Is there room for optimization? Do I need to apply to the platform for new resources?
In a certain period of time, the available resources of the queue suddenly decrease so much. Which task is the cause?
The queue runs very slowly during a certain period of time, and some tasks cannot be submitted. What is the problem?

The above problems are often thrown by the business side to the underlying development and operation and maintenance personnel. The following will demonstrate how to answer the above questions with EasyEagle.
Queue resource monitoring
EasyEagle can select any time period, provide queue memory, CPU resource usage water level trend, queue running and pending task number trend, as shown in the following figure:

From the above queue resource usage view, the business side can clearly obtain:

What time period, the scheduling of the queue's busy tasks, and which time period should be configured

Below this queue resource usage view, EasyEagle also provides a task list that can specify a time period or point in time. The list looks like this:

Each task in the task list contains information such as the resource application level at a specified time period or time point, which can help the business side quickly locate:
Which time period and which task applied for higher resources is similar to the actual resource usage of the cluster dimension. EasyEagle also provides the actual resource usage of the queue. EasyEagle aggregates the actual resource usage of the queue dimension through the actual resource usage of the task. Therefore, the actual resource utilization of a queue is too low, which means that the tasks under the queue are wasting resources. As shown in the figure below, the blue line represents the resource level already occupied by the queue, and the green line represents the actual resource usage level of the queue.

If a problem with queue resources is found, for example, the resource level of the cluster is high, but the resource utilization rate is very low, we know that this must be caused by the waste of a large number of task application resources in the queue. If you want to improve the resource utilization of the queue, you must start from these tasks, and increase the resource utilization of the task in a certain way, so as to improve the resource utilization of the queue and even the entire cluster.
The resource analysis module of the queue also provides a detailed list of tasks in the queue with high resource consumption but low resource utilization. This task list is also equivalent to providing top task information with the greatest optimization benefits. By prioritizing and optimizing the resource consumption of these tasks and improving the utilization rate, the greatest benefit can be obtained, which is clearly reflected in the queue resources. The information currently returned by this list is shown in the illustration below.

After finding the task list to be optimized, the next step is to optimize a single task. For the optimization method and strategy of the resources of a single task, you can get a detailed description in the task resource management function below.
In short, in the module of queue resource monitoring, we can solve the following problems for the business side:

By obtaining the resource water level of the queue in each time period, assist the business party to reasonably arrange task scheduling;
Obtain the actual usage of queue resources and list the tasks to be optimized for resources to help business parties improve queue resource utilization and reduce costs;
Real-time understanding of the operation of large queue tasks (in terms of resources or running time);
Provide data support for queue resource estimation.

3 Summary This article focuses on the cluster perspective and queue perspective for related introductions. The above two perspectives are mainly oriented to the roles of administrators of the data platform. They mainly care about the resource level of the cluster and the operation status of each queue, and then reasonably adjust the task arrangement and resource ratio.
In the following chapters, we will introduce the relevant introduction from the perspective of the user, that is, the task submitter. Author: Netease Shufan Community Link: https://juejin.cn/post/7123115900571484173/Source : The copyright of rare earth nuggets belongs to the author. For commercial reprints, please contact the author for authorization, and for non-commercial reprints, please indicate the source.

Introduction of EasyEagle, the intelligent operation and maintenance platform of the basic platform of big data: Cluster Queue

网易数帆

引用和评论

一图看懂网易数帆指标平台EasyMetrics

【Hadoop】HDFS架构解析

【Hadoop】HBase系统解析及适用场景

基于 pyflink 的算法工作流设计和改造

MCP+Hologres+LLM 搭建数据分析 Agent

某全球领先网络解决方案提供商基于 Apache Doris 统一 Trino、Pinot、Iceberg、Kyuubi技术栈

SelectDB 实时分析性能突出，宝舵成本锐减与性能显著提升的双赢之旅

Introduction of EasyEagle, the intelligent operation and maintenance platform of the basic platform of big data: Cluster Queue

网易数帆

引用和评论

一图看懂网易数帆指标平台EasyMetrics

【Hadoop】HDFS架构解析

【Hadoop】HBase系统解析及适用场景

基于 pyflink 的算法工作流设计和改造

MCP+Hologres+LLM 搭建数据分析 Agent

某全球领先网络解决方案提供商 基于 Apache Doris 统一 Trino、Pinot、Iceberg、Kyuubi技术栈

SelectDB 实时分析性能突出，宝舵成本锐减与性能显著提升的双赢之旅

某全球领先网络解决方案提供商基于 Apache Doris 统一 Trino、Pinot、Iceberg、Kyuubi技术栈