How to quickly resolve cluster exceptions and machine performance fluctuations

This article was first published on the Nebula Graph Community public account

Starting from cluster performance fluctuations

A few days ago, we received feedback from Xiao Zhang, a company's Nebula database maintainer: I found that the performance of cluster A fluctuated. The same statement is sometimes fast, but sometimes slow. Can you help us to find out whether it is the problem of the machine or the service itself?

Thinking that Xiao Zhang had installed the Nebula Dashboard community version before, he recommended him to check the monitoring situation. After Xiao Zhang entered the platform, he checked the CPU, memory, disk, and network conditions of the current machine, and found that there was no obvious abnormality compared with before, and the machines were running normally. As shown below:

Nebula Dashboard

However, if you look at this graph carefully, you will find that cluster A does have a problem of soaring network and CPU usage in individual time periods.

Therefore, we continued to ask Xiao Zhang to check the service operation of the cluster again, and found that the number of queries would suddenly surge during this period, and it was periodic. As shown below:

Nebula Dashboard

After discovering the periodic problem, we asked Zhang about the usage scenarios of the cluster during this time period. After investigation, it was found that Xiao Zhang would regularly run a database nGQL execution script every day at this time. After he reviewed the script logic, he found that the query involved multi-hop queries and the number of hops exceeded 5 hops. After locating the problem, Xiao Zhang suggested that relevant business students optimize the statement script to solve the problem of resource fluctuation.

The troublesome cluster problem

After solving this problem, Xiao Zhang asked us a new question: Can I sense the abnormal conditions of the services and machines in the cluster in time? Can I access the alarm service and notify the service abnormality through DingTalk, WeChat, and SMS?

Coincidentally, Xiao Liu from another company team also reported an abnormal problem: a cluster could not be connected, and I wonder if the service was down. And the external business traffic portal has now been closed, how to troubleshoot the problem? Since the Nebula Dashbaord Community Edition does not provide the management function of viewing the cluster status, after we found that the number of monitoring queries of the service is indeed 0, it is recommended that Xiao Liu check the machines one by one. After Xiao Liu checked, the machines in the response cluster A could log in normally, but one by one, it was found that the graphd and storage services of the ports were not online, and there were service exceptions. In order not to affect the normal operation of the business, Xiao Liu needs to manually start the machines with abnormal services one by one, which takes him a lot of time to start and stop. After this time, Xiao Liu said that he planned to write a cluster quick start script, otherwise it would be too troublesome to start and stop manually every time.

Coincidentally, in addition to the feedback from the above two operation and maintenance students, in fact, we also received such feedback: when the traffic is small, the cluster is running normally; once the traffic exceeds a certain threshold, the service will be unable to connect, Connection timeout, a query service is offline, etc... Generally speaking, due to the unreasonable distribution of cluster services and uneven sharding, the cluster or a certain machine has been in a high load state. Taking the student above as an example, his business traffic is usually relatively balanced, but recently it happened to be a certain event period, and he participated in a number of promotion activities, and the traffic surge was hundreds of times of the usual traffic. Based on this situation, we suggest that he expand the cluster elastically and add 5 services to deal with sudden large traffic. After this wave of activities has passed, the capacity will be reduced to save costs.

In fact, in addition to elastic scaling problems, the following types of feedback are common:

1: How to quickly create a cluster, the default 3 node configuration is fine?
2: Can I see the operation records of a certain cluster in a certain period of time?
3: Can I delete a cluster and reclaim resources?
4: Checking the log information yesterday, I found that the storage service storaged2 of cluster B started and stopped once. Can you help to troubleshoot what caused the problem? Will it also appear in the production environment in the future?
5: The graph service cannot be found, how to locate the problem?
6: A service in the cluster can be found, but the status is always exited. How to quickly start it?

final solution

Based on user feedback from various scenarios, we plan to create a tool that is more convenient to manage Nebula database clusters than the Dashboard Community Edition - yes, it is today's protagonist Nebula Dashboard Enterprise Edition, the above-mentioned cluster problems and performance The fluctuation problem is not a problem in the face of the newly released Nebula Dashboard v3.0: Nebula Dashboard Enterprise Edition, specializing in the intractable diseases of the cluster.

In order to make it easier for database operation and maintenance and DBA students to manage the Nebula database cluster, based on the community version of Nebula Dashboard, we have expanded several functional scenarios. Here we can briefly introduce the following functions. For more information, please pay attention to our Dashboard v3.0 next week. Demo SHOW live broadcast and official introduction~

Simplify operations

When it comes to simplifying operations, there must be support for the rapid deployment/import of cluster functions; in addition, adding capacity expansion and contraction, so that you can add and subtract machine operations without worrying about complicated and cumbersome shell commands. Finally, add visual elements to more intuitively understand the distribution of cluster machines and services.

Nebula Dashboard

Scientific Monitoring & Alerting

In addition, the enterprise version of Nebula Dashboard optimizes the monitoring overview for more flexible and scientific monitoring of services. It supports custom configuration of monitoring data rules, and access to exception notifications can synchronize exceptions to designated recipients in a timely manner. The default notification platform prompts and email notifications provide Webhooks to support quick docking with third-party communications such as DingTalk and corporate WeChat. platform;

Nebula Dashboard

Visual operation of services/nodes

The enterprise version of Nebula Dashboard also supports the visual management of services/nodes and displays the service status in real time. In addition, Dashboard Enterprise Edition also supports quick start and stop, so there is no need to manually start the service on the faulty machines one by one. Finally, Nebula Dashboard Enterprise Edition combines machine information to display the distribution, type, running path and other information of cluster services, so that you can better operate the cluster;

Nebula Dashboard

And More...

In addition, we have many useful functions:

Cluster diagnosis: It can not only locate the problem cluster, but also diagnose and analyze the sub-healthy cluster, and specify a better maintenance plan according to the report;
Visual large screen: The data is projected on the large screen, and the cluster can be followed in real time, especially when stress testing and active scenarios are required, the operating load of the cluster can be more intuitively felt;
View cluster metadata information: View leader, partition distribution, support BALANCE operation, and display metadata information of other services running;
Cluster configuration update: Detect the cluster configuration and customize the modification;
Account and permission management reflects: invite multiple people to manage at the same time, support multi-cluster permission management;

Nebula Dashboard

Finally, other functions such as slow query management, one-click data backup/restore/Nebula cluster upgrade, job management, single-process monitoring, black box and other functions have been included in the Nebula Dashboard Enterprise Edition iteration plan. Interested students can pay more attention to the follow-up of the Dashboard Enterprise Edition. version release.

Welcome to visit https://wj.qq.com/s2/9437467/b3b1/ to apply for a trial of Nebula Dashboard, and give us product feedback suggestions, so as to better improve the cluster operation and maintenance of Nebula Dashboard to serve everyone~ btw, currently Nebula Dashboard The trial period has been extended to 30 days, so remember to try it out.

Exchange graph database technology? To join the Nebula exchange group, please fill in your Nebula business card first, and the Nebula assistant will pull you into the group~~

How to quickly resolve cluster exceptions and machine performance fluctuations

Starting from cluster performance fluctuations

The troublesome cluster problem

final solution

Simplify operations

Scientific Monitoring & Alerting

Visual operation of services/nodes

And More...

NebulaGraph

引用和评论

来领《黑神话：悟空》！NebulaGraph 用户案例征集ing

做到真正0丢失、0重复：Apache SeaTunnel 实现万亿级数据一致性全解密

MySQL慢查询日志：性能优化的终极指南

Devin 发布 DeepWiki，2 星的项目直接装出万星的气场

好用的开源埋点方案-ClkLog埋点用户分析系统

DNS服务器地址大全

实战分享：DolphinScheduler 中 Shell 任务环境变量最佳配置方式