Abstract: This article attempts to deduce the possible application scenarios of a database intelligent monitoring system based on user roles from the conceptual and logical aspects.
This article is shared from the HUAWEI cloud community " GaussDB (DWS) database intelligent monitoring system application scenario analysis ", the original author: Master Lu.
Similar to the project model of Internet products, when we define and design a new product, we first need to analyze the needs of users, summarize and comprehensively analyze the needs of users, and define our product positioning, functions, business logic, user interface, and so on. Therefore, in order to design a database intelligent monitoring system, we need to analyze the needs of the target users of the database monitoring system, collect user demands, tap the potential needs of users, and draw typical user portraits. Finally, design the implementation architecture of the database monitoring system and incorporate the various needs of typical users into the product design architecture pipeline.
Users of database intelligent monitoring system
In actual application scenarios, users of the database monitoring system may have many different roles. Because different companies have different organizational structures, there may be more segmented or more aggregated user roles. But in general, it can be summarized into the following three types of users:
- Application Development (APP DEV)
- Operation and Maintenance Engineer (SRE)
- Database Administrator (DBA)
application development engineer role: mainly responsible for the development of business SQL in cloud applications, and is responsible for the functions and performance of cloud services. At the same time, it is necessary to ensure that the written SQL is efficient and high-quality, and will not cause additional resource consumption and time consumption on the cluster. Therefore, application development engineers need to be able to monitor the newly developed SQL and understand the execution efficiency and resource consumption of the newly-added query statement.
Operation and maintenance engineer role: mainly responsible for ensuring the long-term stable operation of the database cluster. The database system needs to be evaluated from two perspectives of resource consumption and system load. You need to be able to configure the alarm scenario of the database, and you can see the real-time or predicted database alarm information, and report the discovered problems to the database administrator role for further processing. In general, the role of operation and maintenance engineer will monitor a large number of database clusters. He will not do a very in-depth analysis of each cluster, but will appear more in the role of problem finder.
database administrator role: mainly responsible for locating the root cause of database problems and providing corresponding solutions. The database administrator needs to be an expert in the database field and be familiar with all aspects of the database. He can analyze database monitoring data from multiple dimensions, locate database faults, and provide solutions.
It should be noted that the above three roles do not refer to positions in the actual production environment, but are typical role symbols summarized to facilitate the analysis of user needs. In the actual production environment, there may be three scenarios where the roles are the same person, or the SRE position will be both SRE and DBA roles. We divide users into three roles here, mainly to facilitate our needs analysis and build the corresponding portraits, so as to further lock the tools needed by the corresponding roles. In the end, I will show you a clear-thinking concept of database monitoring system development.
Database intelligent monitoring system tools and application scenarios
Through the above abstraction and combing, we found that the three roles in the database monitoring operation and maintenance process correspond to different needs, and different needs will inevitably lead to different tools or different focuses of the same tool. Below we focus on the three roles and introduce in detail the tools they will use:
application development role, they only care about whether the SQL they write is efficient, whether they use the various optimization features of the cluster, and whether they occupy too much resources of the cluster? Therefore, he needs a tool that allows him to evaluate the execution efficiency of the SQL written by him, that is, the WebSQL tool, which allows users to simply connect to the database and execute SQL statements. WebSQL can return the execution result of the SQL statement, and it can also return the execution plan to help the application development role understand the execution efficiency of its SQL statement. At the same time, the user's SQL statement is not simply executed by a single statement, but needs to be executed in the entire job stream. Then it becomes very important to measure the baseline of its execution time and resource consumption in the job stream. Therefore, we need query monitoring that can record the execution time and resource consumption for the characteristic SQL, and calculate the maximum, minimum, and average values as a comparison baseline to further help users evaluate the execution efficiency of their SQL. At the user site, due to resource isolation requirements, user jobs need to be bound to a workload queue for execution, so data such as the resource configuration of the work in the queue and the load level of the workload queue become very important. Will the newly added SQL statement cause the workload queue to be overloaded? Whether the current work is allocated to the queue is reasonable, this requires the application development role to have an intuitive understanding before the newly developed application goes online.
System Operation and Maintenance Role (SRE), They care about the long-term and stable operation of a large number of database systems on the cloud. Based on this demand, we plan to provide tools in three areas to solve the problem.
The health index index is a composite index, which is mainly supported by two indicators, the resource consumption index and the database system load index. These two indicators are supported by the next level of atomic indicators and extended indicators. The calculation of the cluster health index requires the design of a set of corresponding mathematical models. Based on this model, we can quantify the health index of the system, so that system administrators can quickly find out from hundreds of databases on the cloud. The database of the problem.
In addition to passive indicators such as health indicators that need to be viewed by the system administrator in person, DMS will further provide comprehensive warning capabilities. DMS will provide database alarm capabilities at three levels. (1) On the dms-agent side, through log analysis, real-time analysis of the dms-agent node, operating system and database logs, when threat keywords are found After that, the alarm is triggered immediately and reported to the alarm platform through the corresponding channel; (2) On the DMS server side, because DMS has all the monitoring data of the database cluster, through data analysis methods and database expertise, we will be able to design corresponding alarm rules. Periodically check the database cluster, and directly trigger an alarm when a problem is found; (3) For the database cluster index data collected by DMS, it can be used as a threshold alarm indicator, all of which are connected to CES, and threshold alarms are made through the CES service. The configuration and display of the above three alarms need to be presented on the front-end page of the DMS.
There is a natural connection between artificial intelligence and cloud computing. When the database goes to the cloud, AIOps, the intersection of artificial intelligence and database operation and maintenance, naturally appears. Because DMS has all the monitoring data of the data cluster, it uses historical monitoring data to determine the working mode of the cluster, recommends the most optimized configuration parameters; predicts the growth trend of database disk space, and informs users in advance of expansion or operation and maintenance needs, etc. . With the blessing of artificial intelligence, all of this becomes possible.
database administrator role (DBA), database administrators have always been the big stewards of the database. In traditional data centers, they are responsible for database performance optimization, long-term stable operation of the database, and sometimes even help applications Development engineers optimize SQL. However, in the cloud era, the work division of database administrators has become more refined. Application development and system administrators share part of the database management work, which makes the role and responsibilities of database administrators purer. As an expert in the database field, the database administrator will be responsible for locating the root cause of database problems and providing solutions to the problems. The two roles of system administrator + database administrator finally form a closed loop of tasks of discovering, analyzing, and solving problems. Therefore, in the cloud, SRE positions often include the responsibilities of the two roles of SRE+DBA.
DBA is a database expert and a master who uses database tools to locate various database problems. To locate the root cause of the problem, he will need two types of tools: fault analysis tools and fault self-healing tools. Among them, the fault analysis tool will provide various monitoring data and different visualization forms of the data, and provide help for the database administrator to quickly locate the root cause of the problem. Fault self-healing tools are to solidify the experience of database administrators in locating and solving problems in the past. In the future, as we further understand the working methods of DBA, there will be more and more self-healing tools.
Another important responsibility of the database administrator is to provide fault solutions, which is a very important part of the operation and maintenance system. No matter how good the fault location tool is, if the problem is located, if there is no solution in the end, it will not really help the user in the end. Therefore, we need to establish a set of professional search engines for root cause-solutions, helping users to speed up the problem-solving process, and ease the work intensity of front-line customer support staff.
This article is the second of three articles that introduces the core concepts of database monitoring operation and maintenance system design on the cloud. It tries to deduce the possible application scenarios of the database intelligent monitoring system based on user roles from the concept and logic. With this basic framework, the work and tools we need to do in the future become clear. May our expectations become a display soon, making the cloud database operation and maintenance work easier and smarter.
If you want to know more about GuassDB (DWS), welcome to search "GaussDB DWS" on WeChat and follow the WeChat official account, and share with you the latest and most complete PB-level digital warehouse black technology. You can also get a lot of learning materials in the background~
Click to follow and learn about Huawei Cloud's fresh technology for the first time~
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。