Abstract: This article mainly introduces the design plan and status quo of the GaussDB (DWS) database intelligent monitoring operation and maintenance service system.

This article is shared from the HUAWEI CLOUD community " " Is it tireless to see and listen to all directions? How does the data warehouse intelligent operation and maintenance service system do it? ", original author: Master Lu.

Background introduction

In the early days, the database system only provided SQL commands to query its internal operating status, resulting in a high threshold for database operation and maintenance and poor ease of use. DBAs once became a highly specialized key position. While enjoying high salaries and enviable eyes, they also served Enterprise data security brings uncertain risks. In addition, the command line operation and maintenance is not intuitive, relying heavily on the experience of the operation and maintenance personnel, unable to quickly discover, locate, and solve problems, resulting in database operation and maintenance problems, which are difficult to find, locate, and solve.

In order to cope with this dilemma, the visualization of database operation status (database monitoring system) came into being. Through visualization means, the key data is displayed to operation and maintenance personnel in graphical means in the form of graphs that are easy for humans to understand, thereby significantly reducing the database. The threshold of operation and maintenance improves the efficiency of database operation and maintenance. At this stage, there are some representative products such as: OEM (Oracle), ViewPoint (Teradata), and so on. However, the scale of user data during this period is not very large, and the database is still deployed in the user's own data center, and it is still the stage of several DBAs operating and maintaining several sets of databases.

With the advent of the cloud era, cloud databases have gradually hosted customers’ data storage services. Cloudization has concentrated all the heavy IT operation and maintenance work in the cloud back-end management, thereby enabling customers to operate from professional, complex, and heavy data centers. Freed from maintenance activities, allowing customers to focus more on their core business. At the same time, as a provider of data storage services, cloud service providers need to intensively work on IT operation and maintenance and database operation and maintenance, give full play to their team stability, high degree of specialization, and master the advantages of massive database operation data; make full use of current machines Research results in the field of learning and artificial intelligence, using technical means to gradually increase the number of databases that each operation and maintenance personnel can manage, so as to realize the "reduction and efficiency increase" of database operation and maintenance. On the other hand, after the database service goes to the cloud, the number of databases that the cloud service provider needs to operate and maintain is very different from before, and the previous tools may no longer meet the needs of the cloud era. How to do a good job in the operation and maintenance of the cloud database will become a huge challenge for cloud service providers.

Database intelligent operation and maintenance system

The database monitoring service in the traditional sense only refers to (1) collecting database operating status; (2) reporting/storing database operating data; (3) graphically displaying database operating status data. However, this is only part of the database intelligent monitoring operation and maintenance system.
image.png

If you compare the entire database intelligent monitoring operation and maintenance system to a person, the database monitoring service in the traditional sense only represents the role of the eye. The service can only find problems, and both identifying and locating problems and solving problems require the intervention of the DBA. Therefore, DBA is the core element in the traditional database monitoring operation and maintenance system, which is one of the reasons why DBA talents are so critical.

The advent of the cloud era and the maturity of technologies such as big data analysis and artificial intelligence have given more imagination to database monitoring, operation and maintenance. I can add predictive analysis and root cause judgment modules on the basis of traditional database monitoring (eyes), establish the phenomenon-root cause-solution mapping relationship (brain), and finally execute the solution (hands) through the database management module, thereby Realize the closed-loop operation and maintenance from discovering problems, locating problems, and solving problems. And machines are different from humans. As long as the computing power allows, they can observe six directions, listen to all directions, tirelessly, and not get bored. They can stare at the various operating data of hundreds of database systems 7x24. Let go of any small potential problems. Therefore, in the intelligentization of database operation and maintenance, the use of rules or algorithms to solidify the DBA's judgment and decision-making experience will be a very important part.

GaussDB (DWS) database intelligent operation and maintenance system

With reference to the construction experience of the database monitoring operation and maintenance system of friends, combined with the characteristics of GaussBD (DWS) data warehouse, we are ready to establish a closed-loop database intelligent monitoring operation and maintenance system from three aspects: eyes, brain and hands.
image.png

GaussDB (DWS) uses DMS to carry the intelligent operation and maintenance system of the database. DMS will string together the three steps of monitoring, analysis, and processing in the database operation and maintenance process, corresponding to the three parts of the eye, brain, and hand in the database intelligent operation and maintenance system mentioned above, and form the operation and maintenance from the conceptual design. The closed loop of the system.

monitoring part: mainly responsible for the collection, storage and visual display of database running status data. This part is basically equivalent to the traditional database monitoring business. In the selection of this part of the functions and indicators, we refer to the suggestions of our friends and the operation and maintenance team, and divide the monitoring indicators into two types, the underlying IT system operation and maintenance indicators and the database system operation and maintenance indicators, which are gradually being supplemented and improved. The monitoring module is the first module for the DMS database operation intelligent monitoring operation and maintenance system to exert its strength and to form a competitive advantage in a short period of time.

analysis part: as the brain of the entire DMS database intelligent operation and maintenance system, this part is the key module that undertakes the analysis and decision-making of operation and maintenance data. Because of its complexity, this part is still in the stage of design and conception. The preliminary plan has three sub-modules, the trend analysis sub-module of time series, which is mainly used for trend forecasting and analysis to predict potential problems; the logical inference sub-module, users analyze the relationship between the problem phenomenon and the actual root cause, Can realize the inference from the problem phenomenon to the trigger cause, preliminary consideration is to use search engine technology to achieve; the knowledge graph sub-module is mainly used to express the mapping relationship between the phenomenon, the root cause and the solution, and it is convenient to find the best from the root cause of the location. The right solution.

processing part: mainly undertaken by the database management function provided by DWS. Currently, it can provide database parameter configuration (fewer configurable parameters and need to be further enriched), workload queue configuration, cluster installation/uninstallation, cluster restart, cluster expansion, cluster data Operation and maintenance capabilities such as redistribution and node warming.

Typical users and requirements of GaussDB (DWS) database intelligent operation and maintenance

In order to further clarify the design ideas of database intelligent operation and maintenance products, we plan to analyze their requirements from the perspective of users, and then derive the function (tool) page design from the requirements, and summarize the required monitoring database indicators from the function (tool) page. By analyzing various usage scenarios of the database monitoring system, we made user role portraits for the users of the database monitoring system, defined three roles in the database operation and maintenance process, and believed that different roles only focus on one aspect of the database operation and maintenance. In the actual database operation and maintenance scenario, the same user may play multiple roles, but here we only define these three roles logically for the convenience of analysis.
image.png

application development: mainly refers to the application development role on the client side, and they are responsible for designing specific business SQL. They care about the correctness and efficiency of business SQL execution. Application development engineers need to use web SQL to debug the query efficiency of their SQL statements; need to use the query monitoring page to view the performance and resource consumption of business SQL in actual execution scenarios; need to use workload queue monitoring to confirm newly developed Whether the business SQL is in the appropriate workload queue, and whether the configured circuit breaker rules are reasonable, and so on.

SRE: refers to the role of database operation and maintenance on the Huawei Cloud side. They usually need to be responsible for the stable operation of hundreds of clusters. They need to be able to quickly identify abnormalities in cluster operation status, cluster resource bottlenecks, and potential clusters. They also need to actively respond to customer requests for help, help customers locate, confirm and solve problems. SRE needs node resource monitoring to identify the resource tilt in the cluster; it needs to identify the baseline change trend of cluster resource consumption, so as to identify the need for expansion and remind users; it needs to pay attention to storage changes to estimate the time point of the next routine maintenance and automatically plan; at the same time Need to respond to user needs and use the problem location tool provided by DMS to assist users in locating existing network problems.

DBA: refers to GaussDB (DWS) database cluster experts, who are familiar with database design methodology, database tuning, and database problem positioning. They need to analyze and locate database faults, and use multiple tools to comprehensively analyze and locate system faults, system stability and potential bottlenecks from the perspective of resources and business; they also need to help users recommend database indexes and distributed column configurations from the perspective of business and database design. According to the user’s business level, it is recommended that users purchase a suitable cluster size, etc.; at the same time, it is also necessary to assist application development engineers in tuning the SQL statements that cause performance degradation; after finding the exact root cause of the failure, recommend a suitable solution to repair the failure.

Generally speaking, in public cloud scenarios, there are generally only two user roles: application development and SRE. The SRE role in public cloud scenarios often covers the role of DBA. The purpose of subdividing the operation and maintenance roles here is actually to show a complete operation and maintenance scene sandbox, and list the customer's operation and maintenance requirements into categories, for the subsequent further function (tool) page design and operation and maintenance scene design Provide the basis.

GaussDB (DWS) database intelligent operation and maintenance indicators

Database monitoring indicators are large in number and complex in form and logic. According to indicator types, they can be divided into two types: logical relationship and physical relationship. Among them, the internal logical relationship of the logical relationship index library, for example, the top level is the database, there are multiple schemas in the database, there are multiple tables in the schema, there are multiple users in the database, and one user can have multiple schemas and tables. The physical relationship refers to the topological relationship of the GaussDB (DWS) cluster. For example, a database cluster is composed of multiple computing nodes, and multiple computing instances are deployed on each computing node. The relationship between these two indicators will affect the collection dimensions and aggregate display dimensions of database indicators.
image.png

Because the dimensional relationship of the indicators has been analyzed above, we will only discuss the specific database indicator types below, and will not expand the dimensionality of the indicators. The database is a software service, and it must run on a host and operating system, so the monitoring indicators can be roughly divided into two categories:

system resource index: this type of index mainly describes the consumption of various resources on the system

database related indicators: this type of indicators mainly describe the business load level related to data performance
image.png

The above figure summarizes the main indexes of the database collected by DMS. The specific index items are arranged in three levels: index categories, atomic indexes and derived indexes. However, the current index map is not fixed. In the future, with the gradual maturity of the GaussDB (DWS) intelligent operation and maintenance system, the index map will be gradually improved and fixed.

Because of the special configuration of the MPP database, the database instance is trial-run on the node as a process. Therefore, our indicator design will actually have its own dimensional attributes, such as disk utilization indicators. The smallest dimension should be a certain DN instance, the upper level is the node level, and the next level is the entire cluster. Therefore, the actual monitoring indicator we provide should be a Cartesian product of the indicator dimension relationship and the cluster indicator map. To describe this situation, we introduce the concepts of atomic indicators, derived indicators, and combined indicators. Taking the above disk usage rate as an example, we use the disk usage rate of the DN instance as the atomic index, and the disk usage rate of other dimensions as the derived index.

  • atomic index: describes the smallest dimension index of a certain characteristic of the database, such as the CPU usage rate of the node, the disk usage rate of the DN instance, and so on.
  • derived indicators: (1) Aggregation results of atomic indicators in different dimensions, such as cluster average CPU usage, cluster disk usage, etc.; (2) New indicators obtained by statistical calculations on atomic indicators, such as CPU tilt Rate, etc.
  • combined indicators: combines multiple atomic indicators or derived indicators to obtain a new indicator that is easier to understand. Such as cluster health, etc.

At present, the construction of DMS indicators is more at the stage of atomic indicators and derived indicators, because we believe that the basic indicators of the database should be supplemented first to form a basic monitoring operation and maintenance capability, and then combined with user habits, in-depth mining of indicators in various dimensions The meaning of operation and maintenance and the meaning of operation and maintenance represented by the combination of multiple indicators.

to sum up

Finally, to summarize, this article mainly introduces the design plan and current status of GaussDB (DWS) database intelligent monitoring operation and maintenance service system. As the first article in a series of DMS articles, this article mainly serves as an overview, giving everyone a general understanding of the GaussDB (DWS) database intelligent monitoring operation and maintenance service system. More details are welcome to look forward to the follow-up articles.

If you want to know more about GuassDB (DWS), welcome to search "GaussDB DWS" on WeChat and follow the WeChat official account, and share with you the latest and most complete PB-level digital warehouse black technology. You can also get a lot of learning materials in the background~

Click to follow, and get to know the fresh technology of


华为云开发者联盟
1.4k 声望1.8k 粉丝

生于云,长于云,让开发者成为决定性力量