Author: Mo Shan
Senior DBA of an Internet company.
Source of this article: original contribution
*The original content is produced by the open source community of Aikesheng, and the original content shall not be used without authorization. For reprinting, please contact the editor and indicate the source.
1. Background introduction
Such a pain point may be encountered in the operation and maintenance work. Because online machines are basically single-machine with multiple instances, sometimes the performance of the entire machine will be affected due to a certain instance. Due to the lack of process-level monitoring, it is often difficult to analyze which instance is full of system resources afterwards. In order to solve this pain point, it is urgent to implement process-level monitoring.
Process-level resource monitoring, including but not limited to CPU, memory, disk IO, and network traffic.
2. Preliminary preparation
It is understood that there is a process_exporter that can realize process monitoring, but in actual research and testing, it is found that this tool has some shortcomings:
process_exporter https://github.com/ncabatoff/process-exporter
- The monitored objects must be pre-configured
We may deploy 20 instances on a single machine online, either by placing the configuration of 20 instances in one process_export, or a single instance with one process_export. Either way, deploying process_export may be a little troublesome. In addition, a new one wants to be monitored. The object also needs to be re-maintained about process_exporter.
I hope that after adding the machine to be monitored, all active processes can be automatically discovered.
- Cannot monitor the network status of the process
Testing process_exporter found that only the usage of io, memory, cpu, etc., did not find the indicators of network monitoring.
Many of our online machines are still gigabit network cards, and there is a greater need to monitor network usage.
- additional requirements
Our environment may have some ephemeral processes (not resident processes).
3. Demand realization
1. Monitoring and collection
The initial idea was very simple, I just wanted to use some system tools to directly read the results for analysis. However, the leader thinks that reading their collection results may be a little heavier, may not be efficient, and cannot achieve small-grained collection, so he wants me to study and directly capture the data in the running state under [/proc/pid/]. This method should be the most efficient. However, in the actual test process, it was found that it was really difficult to implement the process monitoring scheme through [/proc/pid/], so that the scheme was temporarily abandoned, but I still wanted to briefly talk about the test process of this.
[root process-exporter]# ls /proc/1
attr auxv clear_refs comm cpuset environ fd gid_map limits map_files mem mounts net numa_maps oom_score pagemap personality root schedstat setgroups stack statm syscall timers wchan
autogroup cgroup cmdline coredump_filter cwd exe fdinfo io loginuid maps mountinfo mountstats ns oom_adj oom_score_adj patch_state projid_map sched sessionid smaps stat status task uid_map
[root process-exporter]#
This link introduces the files/directories under [/proc/pid] in detail https://github.com/NanXiao/gnu-linux-proc-pid-intro
(1) CPU state capture
Directly overturned, no CPU-related status data was found under [/proc/pid/].
Anyone who knows please guide me.
(2) MEM state capture
Memory can be captured through the [/proc/pid/status] file.
$ grep "VmRSS:" /proc/3948/status
VmRSS: 19797780 kB
$
(3) io status capture
Similarly, io can also be captured through the [/proc/pid/io] file.
$ grep "bytes" /proc/3948/io
read_bytes: 7808071458816
write_bytes: 8270093250560
(4) Network status capture
This also directly overturned, and found network-related information under [/proc/pid/], such as [/proc/pid/net/dev], and [/proc/pid/netstat].
At first, I thought that the dev file saves process-level network transmission data, but I found that the network traffic recorded by this file is for the entire network card, that is, [/proc/pid/net/dev] and [/proc/net/dev]. The number of bytes of network traffic recorded in each file is basically the same size. The specific tests are as follows:
First simulate the process of two network transmissions, because the machine I tested has NFS, so I copied it directly from NFS to the local, simulating network transmission.
$ ps -ef|grep -- "cp -i -r"|grep -v grep
root 66218 111973 12 17:14 pts/1 00:00:11 cp -i -r Backup_For_TiDB/15101/2022-06-20 /work
root 67099 122467 10 17:14 pts/2 00:00:09 cp -i -r Backup_For_TiDB/15001/2022-06-20 /work
The process numbers are 66218 67099
Then compare by printing the [/proc/pid/net/dev] file corresponding to the two pids and the system's [/proc/net/dev] file at the same time
$ cat /proc/66218/net/dev /proc/67099/net/dev /proc/net/dev |grep eth0 && sleep 1 && echo "------------------------" && cat /proc/66218/net/dev /proc/67099/net/dev /proc/net/dev|grep eth0
eth0: 364616462197417 249383778845 0 0 0 0 0 0 77471452119287 170038153309 0 0 0 0 0 0
eth0: 364616462197417 249383778845 0 0 0 0 0 0 77471452119287 170038153309 0 0 0 0 0 0
eth0: 364616462197417 249383778845 0 0 0 0 0 0 77471452119287 170038153309 0 0 0 0 0 0
------------------------
eth0: 364616675318586 249383924598 0 0 0 0 0 0 77471456448161 170038229547 0 0 0 0 0 0
eth0: 364616675318586 249383924598 0 0 0 0 0 0 77471456449457 170038229571 0 0 0 0 0 0
eth0: 364616675318586 249383924598 0 0 0 0 0 0 77471456449835 170038229578 0 0 0 0 0 0
$
It can be seen that the traffic of the eth0 network cards of these three [/proc/66218/net/dev] [/proc/67099/net/dev] [/proc/net/dev] is the same, that is to say [/proc/ pid/net/dev] is actually the traffic overhead of the system, not the traffic overhead corresponding to a single process.
[/proc/pid/net/dev] If this doesn't work, I suspect that [/proc/pid/net/netstat] This file is what I need, but it's very uncomfortable. I can hardly understand the information in it. Clear the data inside, and finally found that it is not the required data.
Details of /proc/pid/net/netstat can be found here https://github.com/moooofly/MarkSomethingDown/blob/master/Linux/TCP%20%E7%9B%B8%E5%85%B3%E7%BB %9F%E8%AE%A1%E4%BF%A1%E6%81%AF%E8%AF%A6%E8%A7%A3.md
In the end, I compromised, and honestly use off-the-shelf tools for collection.
tools such as top free ps iotop iftop
2. Data analysis
After the acquisition method is determined, it is the analysis of the data, and the following is ready to analyze one by one.
(1) CPU
The following are the conditions of the whole machine
$ lscpu|grep 'NUMA node0 CPU(s)'|awk '{print $NF}'|awk -F'-' '{print $2+1}' #机器CPU核心数
$ uptime|awk -F'average: ' '{print $2}'|awk -F, '{print int($1)}' #机器当前负载情况
$ top -b -n 1|grep '%Cpu(s):' |awk '{print int($8)}' #idle
$ top -b -n 1|grep '%Cpu(s):' |awk '{print int($10)}' # iowait
This part is relatively simple, just record it directly.
The following is the CPU usage for the process grabbing
$ top -b -n 1|grep -P "^[ 0-9]* "|awk 'NF==12 {
if($9 > 200 || $10 > 10) {
for (i=1;i<=NF;i++)
printf $i"@@@";
print "";
}
}' #进程使用CPU百分比,内存百分比,仅记录使用到cpu和内存的进程
This part is a little more complicated, and the result is saved to the top_dic dictionary.
The purpose of this operation is to record the CPU memory usage of the process, but it will be found that there is no process information in the top details, so it needs to be assisted by ps, as follows:
ps -ef|awk '{printf $2"@@@" ;for(i=8;i<=NF;i++) {printf $i" "}print ""}'
This part of the result will be saved to the ps_dic dictionary. You only need to record the pid and process details, so the final result after analyzing the ps is [pid@@@process_info], and finally top_dic and ps_dic are associated by pid
(2) MEM
The following is the situation of the whole machine
$ free |grep '^Mem:'|awk '{print int($2/1024/1024),int($3/1024/1024),int(($2-$3)/1024/1024)}'
The following is the usage of grabbing MEM for the process
#进程使用MEM百分比在cpu部分就已经采集
The memory is for convenience and does not use the running state data under the proc. If you need to traverse all the pids to collect from the proc, it feels more troublesome. It is better to collect it directly through the top (or by the way). However, there is also a disadvantage. In the end, the memory size used by the calculation process will be one more operation, that is, it needs to be calculated according to the MEM percentage and replaced by the specific number of bytes.
(3) Disk
The following is the disk usage of the whole machine
$ df|grep ' " + part + "'|awk '$2 > 1024 * 1024 * 50 && /^\//{print $1,int($2/1024/1024),int($3/1024/1024),int($4/1024/1024)}' #磁盘使用情况
The mount point of the data disk is required. If no mount point is configured, the usage of all mount points (mount points larger than 50GB) of the entire machine will be recorded.
The following is the io usage for process crawling
$ iotop -d 1 -k -o -t -P -qq -b -n 1|awk -F' % ' '
NR>2{
OFS="@@@";
split($1,a," ");
if(a[5] > 10240 || a[7] > 10240 ) {
print a[1],a[2],a[5]a[6],a[7]a[8],$NF;
}
}
NR<3{
print $0;
}'|awk '
{
if(NR==1){
print $1,$2,$6,$13;
} else if(NR==2) {
print $1,$2,$5,$11;
} else {
print $0;
}
}'
This collection is also a little more complicated, and the results will be saved to the iotop_dic dictionary, which is associated with the two dictionaries pid, top_dic and ps_dic. It should be noted that in the actual test process, it was found that the details of some processes are very long, so in order to avoid data redundancy, the process information will be recorded in a separate table [tb_monitor_process_info], and the md5 value of the string will be recorded and md5 will be used as the only one. key to avoid wasting space. When displaying, you only need to pass the md5 value as the association condition.
We need to analyze the results and process them into what we need. What I think is useful is [time] [pid] [read io] [write io] [process information], and directly filter out processes with few io visits.
In summary, the collected data of process-level cpu, memory, and io usage is reported to the server as follows:
{
"19991":{
"cpu":"50.0",
"mem":"12.5",
"io_r":"145",
"io_w":"14012",
"md5":"2932fb739fbfed7175c196b42021877b",
"remarks":"/opt/soft/mysql57/bin/mysqld --defaults-file=//work/mysql23736/etc/my23736.cnf"
},
"58163":{
"cpu":"38.9",
"mem":"13.1",
"io_r":"16510",
"io_w":"1245",
"md5":"c9e1804bcf8a9a2f7c4d5ef6a2ff1b62",
"remarks":"/opt/soft/mysql57/bin/mysqld --defaults-file=//work/mysql23758/etc/my23758.cnf"
}
}
(4) Network
The monitoring of the network is a bit uncomfortable, and there is no way to analyze it based on the pid. Only the back and forth traffic can be analyzed through ip:port.
The following is the network usage of the whole machine
$ iftop -t -n -B -P -s 1 2>/dev/null|grep Total |awk '
NR < 3 {
a = $4;
if ($4 ~ /MB/) {
a = ($4 ~ /MB/) ? 1024 * int($4) "KB" : $4;
} else if ($4 ~ /GB/) {
a = ($4 ~ /GB/) ? 1024 * 1024 * int($4) "KB" : $4;
}
a = (a ~ /KB/) ? int(a) : 0
print $2, a;
}
NR == 3 {
b = $6;
if ($6 ~ /MB/) {
b = ($6 ~ /MB/) ? 1024 * int($6) "KB" : $6;
} else if ($6 ~ /GB/) {
b = ($6 ~ /GB/) ? 1024 * 1024 * int($6) "KB" : $6;
}
b = (b ~ /KB/) ? int(b) : 0
print $1, b;
}'
Below is the network usage at the process level
$ iftop -t -n -B -P -s 2 -L 200 2>/dev/null|grep -P '(<=|=>)'|sed 'N;s/\\n/,/g'|awk 'NF==13{
if($4 ~ /(K|M|G)B/ || $10 ~ /(K|M|G)B/) {
if(($4 ~ /KB/ && int($4) > 10240) ||
($10 ~ /KB/ && int($10) > 10240) ||
($4 ~ /MB/ && int($4) > 10240) ||
($10 ~ /MB/ && int($10) > 10240) ||
($4 ~ /GB/ || $10 ~ /GB/)) {
print $2,$4,$8,$10
}
}
}'
This part is more troublesome is the unit conversion and calculation. This part will save the result to iftop_dic.
This acquisition is also a little more complicated, and the results need to be analyzed and processed into what we need. What I think is useful is [out ip:port] [export traffic] [return ip:port] [ingress traffic]. Finally, the collected data of the network usage of the process is reported to the server as follows.
{
"net":{
"speed":"1000",
"send":"7168",
"receive":"8192",
"Total":"16384",
"time":"2022-06-29 20:16:20",
"iftop" : {
"192.168.168.11:55746":[
{
"remote":"192.168.168.13:18059",
"out":"7.94KB",
"in":"307KB"
}
],
"192.168.168.11:60090":[
{
"remote":"192.168.168.13:18053",
"out":"6.73KB",
"in":"307KB"
}
]
}
}
}
So far, the monitoring data of all the collected items have been obtained, and the next step is to store the data.
3. Data storage
After the monitoring data is analyzed, it is recorded. This project uses MySQL to save the data, which inevitably involves some secondary analysis and precautions. I will not introduce it here, but will introduce it in the precautions section.
4. Data display
After completing the data analysis and data recording, the final work is to display the data for the operation and maintenance personnel to view and analyze at any time when needed. This project uses grafana to display the data. There are some precautions in this part, and I will not introduce them here. Introduced on the note. You can first take a look at the effect diagram:
Fourth, matters needing attention
The code part is implemented using python3, and the solutions for all caveats are also implemented for python3 syntax only.
1. ssh environment
The collection of data is realized through rpc, but the management of the client on the server side relies on ssh, so it must be ensured that the server to all clients can log in without ssh password.
2. Long connection
For communication with MySQL, it is recommended to use a long connection. In particular, the number of machines that need to be monitored is relatively large. If it is a short connection, it will frequently create a connection with MySQL to release the connection, and there is a certain unnecessary overhead.
def f_connect_mysql(): #建立连接
"""
建立连接
"""
state = 0
try :
db = pymysql.connect(monitor_host, monitor_user, monitor_pass, monitor_db, monitor_port, read_timeout = 2, write_timeout = 5) #连接mysql
except Exception as e :
f_write_log(log_opt = "ERROR", log = "[ 建立连接失败 ] [ " + str(e) + " ]", log_file = log_file)
db = None
return db
def f_test_connection(db):
"""
测试连接
"""
try:
db.ping()
except:
f_connect_mysql()
return db
def f_close_connection(db):
"""
关闭连接
"""
try:
db.close()
except:
db = None
return db
It should be noted that if it is multi-threaded, it is recommended that each thread maintain a connection, or add a mutex, which can avoid some exceptions.
3. Do subtraction
Because we are monitoring the process based on the machine, it is inevitable that there will be a lot of monitored objects (it is not impossible to have thousands of services on one machine). During the test, we did not find that this problem has a great impact, but after the actual line It is found that if this part is not optimized, it will lead to too many metrics and the rendering of grafana is very slow. Therefore, for unnecessary collection records, it can be filtered out during collection, which can avoid the network overhead from the client to the server to a certain extent, and can also reduce The overhead of disk space also improves the drawing efficiency of grafana.
After optimization, configuration items are provided in the configuration file, which will be collected only when the process uses system resources to meet the threshold.
4. Timeout mechanism
(1) Timeout of operating MySQL
Not only for the robustness of the code, but also for the continuity and stability of the program, it is recommended to add a timeout parameter, which can avoid reading or writing due to some extreme scenarios.
(2) Timeout of collecting data
Due to the complexity of the production environment, anything can happen, and even a simple command can get stuck, so add a timeout mechanism. It should be noted here that some problems were found when adding the timeout mechanism. The specific tests are as follows:
Python 3.7.4 (default, Sep 3 2019, 19:29:53)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-16)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import datetime,subprocess
>>> s_time = datetime.datetime.now()
>>> res = subprocess.run("echo $(sleep 10)|awk '{print $1}'",shell=True,stdin=subprocess.PIPE,stdout=subprocess.PIPE,stderr=subprocess.PIPE,encoding="utf-8",timeout=2)
省略了很多错误输出
subprocess.TimeoutExpired: Command 'echo $(sleep 10)|awk '{print $1}'' timed out after 2 seconds
>>> e_time = datetime.datetime.now();
>>>
>>> print(s_time)
2022-06-23 13:05:37.886864
>>> print(e_time)
2022-06-23 13:05:48.353889
>>> print(res)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
NameError: name 'res' is not defined
>>>
It can be seen that [subprocess.run] sets a two-second timeout, but no exception is thrown and the execution ends after two seconds. From the execution time, the difference between s_time and e_time is 11 seconds, but the entire execution result is abnormal (res have no results) , that is to say, it does not have the expected timeout effect (the expected effect is to terminate the execution after the timeout threshold and return an exception).
If the operation command is a simple command, it will be fine, such as changing [echo $(sleep 10)|awk '{print $1}'] to [sleep 10], the timeout mechanism is normal.
In view of the situation that this timeout mechanism may fail, the system command timeout is directly used in the code.
5. Return value
If the operation command has a complex command such as a pipeline, the return value may not be trusted. The specific test is as follows
>>> res = subprocess.run("echoa|awk '{print $1}'",shell=True,stdin=subprocess.PIPE,stdout=subprocess.PIPE,stderr=subprocess.PIPE,encoding="utf-8",timeout=2)
>>> res.returncode
0
>>>
[echoa|awk '{print $1}'] The operation on the left side of the pipe is wrong, so the entire return result should be non-zero (expected), but 0 is returned here. The reason is that in the scenario of pipeline operation, bash only obtains the execution return status code of the last pipeline by default, such as [comm1|comm2|comm3], if the execution of comm1 is successful, the execution of comm2 fails, but the execution of comm3 is successful, then the entire return status is to execute successfully.
The solution is as follows
>>> res = subprocess.run("set -o pipefail;echoa|awk '{print $1}'",shell=True,stdin=subprocess.PIPE,stdout=subprocess.PIPE,stderr=subprocess.PIPE,encoding="utf-8",timeout=2)
>>> res.returncode
127
>>>
Add a set -o pipefail before executing the command. For the explanation of set -o, please refer to the following description.
-o If set, the return value of a pipeline is the value of the last (rightmost) command to exit with a non-zero status, or zero if all com‐mands in the pipeline exit successfully. This option is disabled by default.
5. Working principle
1. server
- thread 1
This thread will do three things:
(1) When the server restarts, it will read the [tb_monitor_version] table to determine whether the current version number is consistent with the version number recorded by MySQL. If it is inconsistent, the version number recorded by MySQL will be updated. Then update all nodes with istate=2 in the [tb_monitor_host_config] table to istate=1.
(2) When the management client goes online and offline, read the [tb_monitor_host_config] table every 30s, and maintain the nodes that need to go online or nodes that need to go offline. isate=1 means that it needs to go online, it will deploy the monitoring script (update the code when upgrading), and update it to istate=2, istate=0 means that it needs to go offline, it will go offline the client node and update it to istate=-1.
(3) To manage the client status, read the [tb_monitor_host_config, tb_monitor_alert_info, tb_monitor_host_info] tables every 30s (the three tables are associated), and count the clients that have not been reported in the last two minutes and the nodes that have not been alerted in the last 5 minutes, and alarm them.
- thread 2
This thread does two things:
(1) Wait for the client to report the monitoring data, then perform secondary analysis and write it to MySQL.
(2) Return the current version number to the client.
2. client
The client side will do three things
(1) Six threads in parallel to collect [machine cpu] [machine memory] [machine disk] [machine network] [process network] [process io, process cpu, process memory]. After the collection is completed, the main thread will analyze and report to the server.
(2) During the reporting process, if the server is in an abnormal state for three consecutive times, it will record the abnormality of the server (to avoid simultaneous alarms from multiple clients) to the [tb_monitor_alert_info] table to send an alarm.
(3) After the report is completed, it will judge whether its version number is consistent with the version number returned by the server. If it is inconsistent, it will exit the program and wait for the crontab to be pulled up to complete the upgrade.
The server side completes the code update, and the new code is synchronized to each client when the server is restarted.
3. MySQL
The role of MySQL is to store version information, client ip configuration, monitoring data, and alarm status.
4. Grafana
The role of grafana is to read monitoring data from MySQL and display it.
5. alert
The robot of enterprise WeChat is used as the alarm channel.
6. Use restrictions
1. System environment
(1) Operating system version and kernel.
$ uname -a
Linux 3.10.0-693.21.1.el7.x86_64 #1 SMP Wed Mar 7 19:03:37 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
$ cat /etc/redhat-release
CentOS Linux release 7.4.1708 (Core)
Other versions have not been tested, not sure if it will work.
(2) System tools
Monitoring data collection depends on operating system tools, and the main dependencies are as follows:
awk,grep,sed,tr,md5sum
top,iftop,iotop
df,free,lscpu,uptime
ip,netstat
rsync,python3
cd,ssh,timeout
There must be a password-free login from the server to the client.
2. Software environment
The software version may have compatibility problems, so other versions are not sure whether they can be used, please test and debug them individually.
(1) Python environment
3.7.4
(2) MySQL version
5.7.26
(3) Grafana version
8.3.1 It is recommended that minor versions be consistent. https://dl.grafana.com/enterprise/release/grafana-enterprise-8.3.1.linux-amd64.tar.gz
7. Introduction
1. Deploy the server
(1) clone project
mkdir -p /opt/soft/git
cd /opt/soft/git
git clone https://gitee.com/mo-shan/rpc_for_process_monitor.git
Depends on the Python3 environment, 3.7.4 is recommended, python3 is required to be in the PATH, and the installation process is omitted.
(2) Deploy the server
cp -r /opt/soft/git/rpc_for_process_monitor /opt/soft/rpc_for_monitor #注意这里的目录是有区别的, 主要是希望开发环境跟实际部署的目录不一样, 避免失误
cd /opt/soft/rpc_for_monitor
$ tree -L 2
.
├── conf
│ └── config.ini #配置文件
├── img #忽略
│ ├── all-info.png
│ ├── cpu-info.png
│ ├── disk-info.png
│ ├── grafana-data-source-1.png
│ ├── grafana-data-source-2.png
│ ├── grafana-data-source-3.png
│ ├── grafana-data-source-4.png
│ ├── grafana-data-source-5.png
│ ├── grafana-data-source-6.png
│ ├── grafana-data-source-7.png
│ ├── mem-info.png
│ ├── net-info.png
│ └── process-info.png
├── init #初始化文件
│ ├── grafana.json #grafana配置模板
│ ├── init.sql #mysql建表语句
│ └── requirements.txt #python3依赖的模块
├── lib #库文件
│ ├── Config.py #解析config.ini
│ ├── ConnectMySQL.py #连接并操作mysql
│ ├── globalVar.py #全局变量
│ ├── Public.py #公共函数
│ └── __pycache__
├── LICENSE
├── logs #日志目录
│ └── info.log #日志文件
├── py37env #虚拟环境,要求在/opt/soft/rpc_for_monitor/py37env下才能使用(activate等文件的路径写死了)
│ ├── bin
│ ├── include
│ ├── lib
│ └── pip-selfcheck.json
├── README.md #帮助文档
├── rpc.py #主程序
├── start_server.sh #server端的启动脚本
└── state #忽略
└── state.log
11 directories, 28 files
(3) Configure the server
vim conf/config.ini #根据实际情况进行编辑
If you need to change the project directory, you need to change the variable [config_file] of the [lib/Config.py] file.
[global]
version = 1.1 #版本号, 通过这个变量控制server和client的代码,如果server发现这个配置跟表里保存的版本不一致就认为代码进行了变更,就会将新代码传到client,client如果发现自己的版本和server版本不一样会进行重启,以此达到升级效果。
interval_time = 30 #监控采集粒度单位是秒,即30秒一次,这个不是完全精确的30s一次
retention_day = 30 #监控数据保留天数,即30天
log_file = /opt/soft/rpc_for_monitor/logs/info.log #日志文件
script_dir = /opt/soft/rpc_for_monitor #脚本目录,不建议变更
mount_part = /work #数据盘挂载点, 也可以不配置,置为空,但是不能删除这个配置项
log_size = 20 #日志文件大小(MB)限制,超过这个值就会删除历史日志
[RULE]
cpu = 200 #采集的阈值,200表示某个进程使用cpu大于等于200%才会被采集
mem = 10 #采集的阈值,10表示某个进程使用内存大于等于10GB才会被采集
io = 10240 #采集的阈值,10240表示某个进程使用io(读写有一个就算)大于等于10MB才会被采集
net = 10240 #采集的阈值,10240表示某个进程使用网络(进出有一个就算)大于等于10MB才会被采集
[CLIENT]
path = xxxx #预定义一下操作系统的path,因为client会维护一个cront任务,所以避免因为环境变量问题导致脚本执行报错,需要定义一下path
python3 = /usr/local/python3 #python3安装目录
py3env = /opt/soft/rpc_for_monitor/py37env #python3虚拟环境目录,工程自带了一个虚拟环境,可以直接用(前提是脚本目录没有变更)
[MSM]
wx_url = xxxx #企业微信报警url,告警功能需要用户自己修改一下并测试(如果是告警机器人url+key,可以直接配上就能用,本例就是通过企业微信机器人发送告警)
[Monitor] #存放监控数据的MySQL的配置
mysql_host = xxxx
mysql_port = xxxx
mysql_user = xxxx
mysql_pass = xxxx
省略部分不建议变更的配置
It is not recommended to modify all directories, otherwise there are too many places to be changed, which is prone to errors.
2. Deploy MySQL
Install MySQL slightly, recommended version: 5.7
(1) Create the necessary account
Log in with the MySQL administrator user and operate.
create user 'monitor_ro'@'192.%' identified by 'pass1'; #密码请根据实际情况变更
grant select on dbzz_monitor.* to 'monitor_ro'@'192.%';
create user 'monitor_rw'@'192.%' identified by 'pass2';
grant select,insert,update,delete on dbzz_monitor.* to 'monitor_rw'@'192.%';
The monitor_ro user is used by grafana, and the monitor_rw user is used to write monitoring data to the program (the server writes data, and the client reports it to the server). Therefore, it should be noted that the monitor_ro user should authorize the grafana machine, and the monitor_rw user should authorize all monitoring objects. This purpose is to control when the server is disconnected, and the first discovered client will write an alarm record to the table and Alerts to avoid repeated operations by other clients.
(2) Initialize MySQL
Log in with the MySQL administrator user and operate.
cd /opt/soft/rpc_for_monitor
mysql < init/init.sql
All tables are placed under the dbzz_monitor library
(dba:3306)@[dbzz_monitor]>show tables;
+----------------------------+
| Tables_in_dbzz_monitor |
+----------------------------+
| tb_monitor_alert_info | # 告警表, 触发告警就会在里面写入一条记录, 避免同一时间多次告警。
| tb_monitor_disk_info | # 磁盘信息表,多个盘会记录多条记录
| tb_monitor_host_config | # client配置表,需要采集监控的机器配置到这里面就行
| tb_monitor_host_info | # 系统层面的监控记录到这里面
| tb_monitor_port_net_info | # 端口级别的网络监控会记录到这里面
| tb_monitor_process_info | # 这里面是记录了进程信息,全局的
| tb_monitor_process_io_info | # 这里是记录的进程的io监控数据
| tb_monitor_version | # 记录版本号,及版本号变更时间
+----------------------------+
6 rows in set (0.00 sec)
(dba:3306)@[dbzz_monitor]>
All tables have detailed notes, please see table creation notes.
3. Configure the client
Configuring the client is as simple as writing a record to a MySQL table.
use dbzz_monitor;
insert into tb_monitor_host_config(rshost,istate) select '192.168.168.11',1;
#多个机器就写多条记录,server端会有后台线程定时扫描tb_monitor_host_config
#如果有待添加的client就会进行部署
#如果需要下线监控节点直接将istate状态改成0即可
There is a restriction here, the client already has a python3 environment, otherwise an error will be reported.
4. Deploy grafana
Install slightly.
Grafana version: 8.3.1, it is recommended that the minor version should be the same. https://dl.grafana.com/enterprise/release/grafana-enterprise-8.3.1.linux-amd64.tar.gz
This part involves the configuration of grafana. All the configurations have been imported into json files, and users can import them directly.
The specific operation is as follows.
(1) Create a new DataSource
Create a new data source
Need to select MySQL data source
The name of the data source requires to write [dba_process_monitor], if it is inconsistent with the grafana configuration, it may have an impact.
(2) Import json configuration
$ ll init/grafana.json
-rw-r--r-- 1 root root 47875 Jun 23 14:28 init/grafana.json
This configuration is generated under grafana 8.3.1 version. Need to pay attention to the version, different versions may not be compatible. If the version is inconsistent, the import will cause the graph to fail, and the user needs to reconfigure grafana. For the sql graph, please refer to the rawSql configuration of the grafana configuration file [grep rawSql init/grafana.json].
- Suggestion 1: Change the configuration of [Lagend] to [as table], otherwise it will be very messy if there are too many indicators displayed.
- Recommendation 2: When selecting a unit, you can select the [Custom unit] attribute for those who do not want to convert
- Suggestion 3: [Stacking and null value] attribute is recommended to be set to [null as zero]
5. Start the server
Configure the server-side startup script into crontab, which can play the role of a daemon.
echo "*/1 * * * * bash /opt/soft/rpc_for_monitor/start_server.sh" >> /var/spool/cron/root
The client side does not need to be concerned. After the server starts, it will automatically manage the client.
After the configuration is complete, wait for one minute to view the log [/opt/soft/rpc_for_monitor/logs/info.log], and you can see a log similar to the following.
[ 2022-06-30 15:13:01 ] [ INFO ] [ V1.1 Listening for '0.0.0.0:9300' ]
[ 2022-06-30 15:13:04 ] [ INFO ] [ 新加监控节点成功 ] [ 192.168.168.11 ]
[ 2022-06-30 15:13:11 ] [ INFO ] [ 监控数据上报成功 ] [ 192.168.168.11 ]
The default port is 9300. You can change the listening port by modifying the file [/opt/soft/rpc_for_monitor/start_server.sh].
6. Rendering
(1) Main page
There are a total of five ROWs, the first four are the machine-level monitoring graphs, and the process is the process monitoring graph.
(2) CPU page
The CPU usage of the entire machine.
(3) Memory page
The memory usage of the entire machine.
(4) Disk page
The disk usage of the entire machine. If no specific mount point is defined, all mount points will be collected.
(5) Web page
Network usage of the entire machine.
(6) Process page
You will see the usage of system resources by specific processes. It should be noted that the monitoring data is not necessarily continuous due to filtering during collection, so it is recommended to configure grafana's [null as zero], so that the monitoring graph is displayed continuously instead of many points. Another indicator may be empty, which is normal.
Eight, matters needing attention
- The server and client must have a python3 environment.
- If there are multiple servers, you must specify multiple servers (separated by commas) when starting the server. Otherwise, only a single server will be configured when deploying the client. The advantage of configuring multiple servers is that if the first one goes down/ If it is abnormal, the client will report it to other servers.
- The server can add clients during the running process, just add a record with istate 1 in the [tb_monitor_host_config] table. In the same way, if you go offline, you can update istate to 0, the istate during operation is 2, and the istate after offline is -1.
- The tool has an alarm function (if configured), if the server hangs up (the client cannot connect to the server three times in a row), the first discovered client will be recorded in MySQL and an alarm will be sent. If the client hangs, the server will find and Alarm (no alarm data is reported for more than two minutes).
- If you need to upgrade the code, you only need to test the new code. After confirming that it is correct, update it to the deployment script directory of the server, then kill the server process, and wait for the crontab to be pulled up. The code on the client side does not need to be manually updated. It should be noted that the new code must remember to modify the version number of the configuration file, otherwise the server will not find that the version is inconsistent, and will not issue related tasks to update the client's code.
- If you need to modify the deployment directory, please modify [conf/config.ini] [lib/Config.py] according to the actual situation. Note that the built-in virtual environment will not be available at this time. Changes to the directory structure or directory names are strongly discouraged.
- In consideration of MySQL performance issues and grafana rendering performance issues, the collection threshold function has been added, so the monitoring data of some panels may not be available (the process in this time period does not meet the collection threshold data).
9. Write at the end
All content in this article is for reference only. Due to different environments, unknown problems may be encountered when using the code in the article. If there is an online environment operation requirement, please fully test it in the test environment.
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。