1. 简介
通过Prometheus+Grafana+Alertmanager实现对常用组件的监控,并进行告警。
2. 系统环境
① Prometheus: 192.168.83.137 39090
② Grafana: 192.168.83.137 33000
③ Alertmanager: 192.168.83.137 39093
④ node_exporter: 192.168.83.137 39100、192.168.83.138 39100
⑤ mysqld_exporter: 192.168.83.137 39104
⑥ nginx-prometheus-exporter: 192.168.83.137 39113、192.168.83.138 39113
⑦ process-exporter: 192.168.83.137 39256
⑧ prometheus-webhook-dingtalk: 192.168.83.138 38086
3. 实现要求
① 通过node_exporter实现对linux系统的监控,并通过Grafana进行图形化展示
② 通过mysqld_exporter实现对mysql数据库的监控,并通过Grafana进行图形化展示
③ 通过nginx-prometheus-exporter实现对nginx中间件的监控,并通过Grafana进行图形化展示
④ 通过process-exporter实现对系统进程的监控,并通过Grafana进行图形化展示
⑤ 对监控数据配置告警。
4. 下载:
① Prometheus:
https://github.com/prometheus/prometheus/releases/download/v3.4.0/prometheus-3.4.0.linux-amd64.tar.gz
② Grafana:
https://dl.grafana.com/enterprise/release/grafana-enterprise-12.0.0.linux-amd64.tar.gz
③ Alertmanager:
https://github.com/prometheus/alertmanager/releases/download/v0.28.1/alertmanager-0.28.1.linux-amd64.tar.gz
④ node_exporter:
https://github.com/prometheus/node_exporter/releases/download/v1.9.1/node_exporter-1.9.1.linux-amd64.tar.gz
⑤ mysqld_exporter:
https://github.com/prometheus/mysqld_exporter/releases/download/v0.17.2/mysqld_exporter-0.17.2.linux-amd64.tar.gz
⑥ process-exporter:
https://github.com/ncabatoff/process-exporter/releases/download/v0.8.7/process-exporter-0.8.7.linux-amd64.tar.gz
⑦ prometheus-webhook-dingtalk:
https://github.com/timonwong/prometheus-webhook-dingtalk/releases/download/v2.1.0/prometheus-webhook-dingtalk-2.1.0.linux-amd64.tar.gz
⑧ nginx-prometheus-exporter:
https://github.com/nginx/nginx-prometheus-exporter/releases/download/v1.4.2/nginx-prometheus-exporter_1.4.2_linux_amd64.tar.gz
5. 安装配置:
① node_exporter:
安装:
tar zxvf node_exporter-1.9.0.linux-amd64.tar.gz -C /usr/local/share/applications/
cd /usr/local/share/applications/node_exporter-1.9.0.linux-amd64/
启动:
./node_exporter --web.listen-address=:39100
验证:
http://192.168.83.137:39100/metrics //页面能访问,且可以看到对应的指标
创建服务:
vim /usr/lib/systemd/system/node_exporter.service
[Unit]
Description=Prometheus exporter for node metrics, written in Go with pluggable metric collectors.
Documentation=https://github.com/prometheus/node_exporter/
After=network.target
[Service]
Type=simple
WorkingDirectory=/usr/local/share/applications/node_exporter-1.9.0.linux-amd64/
ExecStart=/usr/local/share/applications/node_exporter-1.9.0.linux-amd64/node_exporter --web.listen-address=:39100
Restart=on-failure
[Install]
WantedBy=multi-user.target
验证服务:
systemctl daemon-reload
systemctl start node_exporter
systemctl enable node_exporter
systemctl status node_exporter
netstat -lantup | grep 39100
② mysqld_exporter:
准备:
安装mysql,此处不赘述
mysql授权:
mysql> CREATE USER 'exporter'@'localhost' IDENTIFIED BY 'Prometheus';
mysql> GRANT PROCESS, REPLICATION CLIENT, SELECT ON *.* TO 'exporter'@'localhost';
mysql> flush privileges;
安装:
tar zxvf mysqld_exporter-0.17.2.linux-amd64.tar.gz -C /usr/local/share/applications/
cd /usr/local/share/applications/mysqld_exporter-0.17.2.linux-amd64
vim mysqld_exporter.cnf
[client]
user=exporter
password=Prometheus
启动:
./mysqld_exporter --config.my-cnf=mysqld_exporter.cnf --web.listen-address=:39104 --mysqld.address="localhost:13306"
验证:
http://192.168.83.137:39104/metrics //页面能访问,且可以看到对应的指标
创建服务:
vim /usr/lib/systemd/system/mysqld_exporter.service
[Unit]
Description=Prometheus exporter for mysql metrics, written in Go with pluggable metric collectors.
Documentation=https://github.com/prometheus/mysqld_exporter/
After=network.target
[Service]
Type=simple
WorkingDirectory=/usr/local/share/applications/mysqld_exporter-0.17.2.linux-amd64
ExecStart=/usr/local/share/applications/mysqld_exporter-0.17.2.linux-amd64/mysqld_exporter --config.my-cnf=/usr/local/share/applications/mysqld_exporter-0.17.2.linux-amd64/mysqld_exporter.cnf --web.listen-address=:39104 --mysqld.address="localhost:13306"
Restart=on-failure
[Install]
WantedBy=multi-user.target
验证服务:
systemctl daemon-reload
systemctl start mysqld_exporter
systemctl enable mysqld_exporter
systemctl status mysqld_exporter
netstat -lantup | grep 39104
③ nginx-prometheus-exporter:
准备:
安装nginx,此处不再赘述
配置stub_status模块
location /stub_status {
stub_status;
}
安装:
mkdir /usr/local/share/applications/nginx-prometheus-exporter_1.4.2
tar zxvf nginx-prometheus-exporter_1.4.2_linux_amd64.tar.gz -C /usr/local/share/applications/nginx-prometheus-exporter_1.4.2
cd /usr/local/share/applications/nginx-prometheus-exporter_1.4.2/
启动:
./nginx-prometheus-exporter --web.listen-address=:39113 --nginx.scrape-uri=http://127.0.0.1:8080/stub_status
验证:
http://192.168.83.137:39113/metrics //页面能访问,且可以看到对应的指标
创建服务:
vim /usr/lib/systemd/system/nginx-prometheus-exporter.service
[Unit]
Description=Prometheus exporter for nginx metrics, written in Go with pluggable metric collectors.
Documentation=https://github.com/nginx/nginx-prometheus-exporter/
After=network.target
[Service]
Type=simple
WorkingDirectory=/usr/local/share/applications/nginx-prometheus-exporter_1.4.2/
ExecStart=/usr/local/share/applications/nginx-prometheus-exporter_1.4.2/nginx-prometheus-exporter --web.listen-address=:39113 --nginx.scrape-uri=http://127.0.0.1:8080/stub_status
Restart=on-failure
[Install]
WantedBy=multi-user.target
验证服务:
systemctl daemon-reload
systemctl start nginx-prometheus-exporter
systemctl enable nginx-prometheus-exporter
systemctl status nginx-prometheus-exporter
netstat -lantup | grep 39104
④ process-exporter:
安装:
tar zxvf process-exporter-0.8.7.linux-amd64.tar.gz -C /usr/local/share/applications/
cd /usr/local/share/applications/process-exporter-0.8.7.linux-amd64/
创建配置文件:
vim process-exporter.yml
process_names:
- name: "{{.Matches}}" #mysql
cmdline:
- 'mysql'
- name: "{{.Matches}}" #nginx
cmdline:
- 'nginx'
- name: "{{.Comm}}" #other
cmdline:
- '.+'
启动:
./process-exporter -config.path=process-exporter.yaml -web.listen-address=":39256"
验证:
http://192.168.83.137:39256/metrics //页面能访问,且可以看到对应的指标
创建服务:
vim /usr/lib/systemd/system/process-exporter.service
[Unit]
Description=Prometheus exporter for nginx metrics, written in Go with pluggable metric collectors.
Documentation=https://github.com/nginx/nginx-prometheus-exporter/
After=network.target
[Service]
Type=simple
WorkingDirectory=/usr/local/share/applications/process-exporter-0.8.7.linux-amd64/
ExecStart=/usr/local/share/applications/process-exporter-0.8.7.linux-amd64/process-exporter -config.path=/usr/local/share/applications/process-exporter-0.8.7.linux-amd64/process-exporter.yaml -web.listen-address=":39256"
Restart=on-failure
[Install]
WantedBy=multi-user.target
验证服务:
systemctl daemon-reload
systemctl start process-exporter
systemctl enable process-exporter
systemctl status process-exporter
netstat -lantup | grep 39256
⑤ prometheus-webhook-dingtalk:
安装:
tar zxvf prometheus-webhook-dingtalk-2.1.0.linux-amd64.tar.gz -C /usr/local/share/applications/
cd /usr/local/share/applications/prometheus-webhook-dingtalk-2.1.0.linux-amd64/
修改告警模板配置文件:
cp contrib/templates/legacy/template.tmpl contrib/templates/legacy/template.tmpl_default
vim contrib/templates/legacy/template.tmpl
{{ define "dingtalk.default.message" }}
{{- if gt (len .Alerts.Firing) 0 -}}
{
"msgtype": "text",
"text": {
"content": "告警通知: FIRING\n\n告警状态: FIRING\n告警名称: {{ .CommonLabels.alertname }}\n实例: {{ .CommonLabels.instance }}\n开始时间: {{ (index .Alerts.Firing 0).StartsAt.Format `2006-01-02 15:04:05` }}\n描述: {{ .CommonAnnotations.summary }}\n详情: {{ .CommonAnnotations.description }}",
"at": {
"isAtAll": false
}
}
}
{{- end }}
{{- if gt (len .Alerts.Resolved) 0 -}}
{
"msgtype": "text",
"text": {
"content": "告警恢复通知: RESOLVED\n\n告警状态: RESOLVED\n告警名称: {{ .CommonLabels.alertname }}\n实例: {{ .CommonLabels.instance }}\n恢复时间: {{ (index .Alerts.Resolved 0).EndsAt.Format `2006-01-02 15:04:05` }}\n描述: {{ .CommonAnnotations.summary }}\n详情: {{ .CommonAnnotations.description }}",
"at": {
"isAtAll": false
}
}
}
{{- end }}
{{ end }}
修改配置文件:
cp config.example.yml config.yml
vim config.yml
## Request timeout
# timeout: 5s
## Uncomment following line in order to write template from scratch (be careful!)
#no_builtin_template: true
## Customizable templates path
templates:
- contrib/templates/legacy/template.tmpl
## You can also override default template using `default_message`
## The following example to use the 'legacy' template from v0.3.0
#default_message:
# title: '{{ template "legacy.title" . }}'
# text: '{{ template "legacy.content" . }}'
## Targets, previously was known as "profiles"
targets:
webhook1:
url: https://oapi.dingtalk.com/robot/send?access_token=xxx #填写实际access_token
# secret for signature
secret: xxx #填写实际secret
#webhook2:
# url: https://oapi.dingtalk.com/robot/send?access_token=xxxxxxxxxxxx
#webhook_legacy:
# url: https://oapi.dingtalk.com/robot/send?access_token=xxxxxxxxxxxx
# # Customize template content
message:
# Use legacy template
title: '{{ template "dingtalk.default.message". }}'
text: '{{ template "dingtalk.default.message". }}'
#webhook_mention_all:
# url: https://oapi.dingtalk.com/robot/send?access_token=xxxxxxxxxxxx
# mention:
# all: true
#webhook_mention_users:
# url: https://oapi.dingtalk.com/robot/send?access_token=xxxxxxxxxxxx
# mention:
# mobiles: ['156xxxx8827', '189xxxx8325']
启动
./prometheus-webhook-dingtalk --web.listen-address=":38060" --config.file=config.yml
创建服务:
vim /usr/lib/systemd/system/prometheus-webhook-dingtalk.service
[Unit]
Description=Generating DingTalk notification from Prometheus AlertManager WebHooks.
Documentation=https://github.com/timonwong/prometheus-webhook-dingtalk
After=network.target
[Service]
Type=simple
WorkingDirectory=/usr/local/share/applications/prometheus-webhook-dingtalk-2.1.0.linux-amd64/
ExecStart=/usr/local/share/applications/prometheus-webhook-dingtalk-2.1.0.linux-amd64/prometheus-webhook-dingtalk --web.listen-address=":38060" --config.file=/usr/local/share/applications/prometheus-webhook-dingtalk-2.1.0.linux-amd64/config.yml
Restart=on-failure
[Install]
WantedBy=multi-user.target
验证服务:
systemctl daemon-reload
systemctl start prometheus-webhook-dingtalk
systemctl enable prometheus-webhook-dingtalk
systemctl status prometheus-webhook-dingtalk
netstat -lantup | grep 38060
⑥ Alertmanager:
安装:
tar zxvf alertmanager-0.28.1.linux-amd64.tar.gz -C /usr/local/share/applications/
cd /usr/local/share/applications/alertmanager-0.28.1.linux-amd64/
修改配置文件:
cp alertmanager.yml alertmanager.yml_default
vim alertmanager.yml
global:
resolve_timeout: 5m
route:
group_by: ['alertname', 'cluster', 'service']
group_wait: 30s
group_interval: 5m
repeat_interval: 1h
receiver: 'team-B-receiver'
routes:
- match_re:
alertname: "(InstanceDown|MySQLDown|NginxDown|ProcessDown)"
receiver: 'team-A-receiver'
- receiver: 'team-B-receiver'
receivers:
- name: 'team-A-receiver'
webhook_configs:
- url: 'http://192.168.83.137:38060/dingtalk/webhook1/send' # 替换为你的钉钉机器人Webhook URL
send_resolved: true
email_configs:
- to: 'xxx@163.com' # 替换为实际的邮箱地址
from: 'xxx@163.com' # 替换为实际的邮箱地址
smarthost: 'smtp.163.com:465' # 如果使用SSL
# smarthost: 'smtp.163.com:587' # 如果使用STARTTLS
auth_username: 'xxx@163.com'
auth_password: 'xxxxx' # 替换为实际的邮箱密码
send_resolved: true
- name: 'team-B-receiver'
#webhook_configs:
# - url: 'https://oapi.dingtalk.com/robot/send?access_token=your-dingtalk-token-for-team-B' # 替换为你的钉钉机器人Webhook URL
# send_resolved: true
email_configs:
- to: 'xxx@163.com' # 替换为实际的邮箱地址
from: 'xxx@163.com' # 替换为实际的邮箱地址
smarthost: 'smtp.163.com:465' # 如果使用SSL
# smarthost: 'smtp.163.com:587' # 如果使用STARTTLS
auth_username: 'xxx@163.com'
auth_password: 'xxxxx' # 替换为实际的邮箱密码
send_resolved: true
启动
./alertmanager --config.file="alertmanager.yml" --web.listen-address=:39093
创建服务:
vim /usr/lib/systemd/system/alertmanager.service
[Unit]
Description=The Alertmanager handles alerts sent by client applications such as the Prometheus server.
Documentation=https://github.com/prometheus/alertmanager
After=network.target
[Service]
Type=simple
WorkingDirectory=/usr/local/share/applications/alertmanager-0.28.1.linux-amd64/
ExecStart=/usr/local/share/applications/alertmanager-0.28.1.linux-amd64/alertmanager --config.file="/usr/local/share/applications/alertmanager-0.28.1.linux-amd64/alertmanager.yml" --web.listen-address=:39093
Restart=on-failure
[Install]
WantedBy=multi-user.target
验证服务:
systemctl daemon-reload
systemctl start alertmanager
systemctl enable alertmanager
systemctl status alertmanager
netstat -lantup | grep 39093
⑦ Prometheus:
安装:
tar zxvf prometheus-3.2.1.linux-amd64.tar.gz -C /usr/local/share/applications/
cd /usr/local/share/applications/prometheus-3.2.1.linux-amd64/
修改配置文件:
cp prometheus.yml prometheus.yml_default
vim prometheus.yml
# my global config
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
- 192.168.83.137:39093
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
- "/usr/local/share/alertmanager-0.28.1.linux-amd64/AlterRules/*.yml"
# - "first_rules.yml"
# - "second_rules.yml"
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: "prometheus"
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.
static_configs:
- targets: ["192.168.83.137:39090"]
- job_name: 'node'
# Override the global default and scrape targets from this job every 5 seconds.
scrape_interval: 5s
static_configs:
- targets: ['192.168.83.137:39100','192.168.83.138:39100']
labels:
group: 'node'
- job_name: 'mysql'
# Override the global default and scrape targets from this job every 5 seconds.
scrape_interval: 5s
static_configs:
- targets: ['192.168.83.137:39104']
labels:
group: 'mysql'
- job_name: 'nginx'
# Override the global default and scrape targets from this job every 5 seconds.
scrape_interval: 5s
static_configs:
- targets: ['192.168.83.137:39113']
labels:
group: 'nginx'
- job_name: 'process'
# Override the global default and scrape targets from this job every 5 seconds.
scrape_interval: 5s
static_configs:
- targets: ['192.168.83.137:39256']
labels:
group: 'process'
启动
./prometheus --config.file="prometheus.yml" --web.listen-address=0.0.0.0:39090 --storage.tsdb.path="/data/prometheus"
验证:
http://192.168.83.137:39090/targets //页面能访问,且可以看到对应的指标
创建服务:
vim /usr/lib/systemd/system/prometheus.service
[Unit]
Description=Prometheus, a Cloud Native Computing Foundation project, is a systems and service monitoring system.
Documentation=https://github.com/prometheus/prometheus/
After=network.target
[Service]
Type=simple
WorkingDirectory=/usr/local/share/applications/prometheus-3.2.1.linux-amd64/
ExecStart=/usr/local/share/applications/prometheus-3.2.1.linux-amd64/prometheus --config.file="/usr/local/share/applications/prometheus-3.2.1.linux-amd64/prometheus.yml" --web.listen-address=0.0.0.0:39090 --storage.tsdb.path="/data/prometheus"
Restart=on-failure
[Install]
WantedBy=multi-user.target
验证服务:
systemctl daemon-reload
systemctl start prometheus
systemctl enable prometheus
systemctl status prometheus
netstat -lantup | grep 39090
⑧ Grafana:
安装:
tar zxvf grafana-enterprise-11.5.2.linux-amd64.tar.gz -C /usr/local/share/applications/
cd /usr/local/share/applications/grafana-v11.5.2/
修改配置文件:
cp conf/defaults.ini conf/defaults.ini_dafault
vim conf/defaults.ini
http_port = 3000 更改为http_port = 33000
启动
./bin/grafana-server
验证:
http://192.168.83.137:33000/ //页面能访问,用户名密码admin/admin可以进入
创建服务:
vim /usr/lib/systemd/system/grafana-server.service
[Unit]
Description=Dashboard anything. Observe everything.
Documentation=https://grafana.com/grafana/download
After=network.target
[Service]
Type=simple
WorkingDirectory=/usr/local/share/applications/grafana-v11.5.2/
ExecStart=/usr/local/share/applications/grafana-v11.5.2/bin/grafana-server
Restart=on-failure
[Install]
WantedBy=multi-user.target
验证服务:
systemctl daemon-reload
systemctl start grafana-server
systemctl enable grafana-server
systemctl status grafana-server
netstat -lantup | grep 33000
6. 监控
grafana 添加Prometheus源:
Home > Connections > Add new connection,选择Prometheus,点击 Add new data source. 点击填入Prometheus server URL:http://192.168.83.137:39090/,点击 test & save.
添加成功后,在Home > Connections > Data sources,可看到添加的源
① 通过node_exporter实现对linux系统的监控,并通过Grafana进行图形化展示
Home > Dashboards,点击new,选择Import dashboard,输入ID(8919,具体可在https://grafana.com/grafana/dashboards/搜索),选择添加的Prometheus源,点击import,即可跳转到对应的监控页面。
再次进去可在Home > Dashboards,点击Node Exporter Dashboard 20240520 TenSunS自动同步版进行查看。
② 通过mysqld_exporter实现对mysql数据库的监控,并通过Grafana进行图形化展示
Home > Dashboards,点击new,选择Import dashboard,输入ID(7362,具体可在https://grafana.com/grafana/dashboards/搜索),选择添加的Prometheus源,点击import,即可跳转到对应的监控页面。
再次进去可在Home > Dashboards,点击MySQL Overview 进行查看。
③ 通过nginx-prometheus-exporter实现对nginx中间件的监控,并通过Grafana进行图形化展示
Home > Dashboards,点击new,选择Import dashboard,输入ID(10393,具体可在https://grafana.com/grafana/dashboards/搜索),选择添加的Prometheus源,点击import,即可跳转到对应的监控页面。
再次进去可在Home > Dashboards,点击Nginx进行查看。
④ 通过process-exporter实现对系统进程的监控,并通过Grafana进行图形化展示
Home > Dashboards,点击new,选择Import dashboard,输入ID(13882,具体可在https://grafana.com/grafana/dashboards/搜索),选择添加的Prometheus源,点击import,即可跳转到对应的监控页面。
再次进去可在Home > Dashboards,点击process exporter dashboard with treemap进行查看。
7. 告警
① node_exporter告警规则
cd /usr/local/share/alertmanager-0.28.1.linux-amd64/AlterRules/
vim node_rules.yml
groups:
- name: instance-health
rules:
- alert: InstanceDown
expr: up{job="node"} == 0
for: 2m
labels:
severity: warning
annotations:
summary: "Instance {{ $labels.instance }} is down"
description: "{{ $labels.instance }} has been unreachable for more than 2 minutes."
- name: cpu-usage
rules:
- alert: HighCpuUsage
expr: (rate(node_cpu_seconds_total{mode!="idle"}[5m]) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
description: "CPU usage is above 80% (current value: {{ $value }}%)"
- name: memory-usage
rules:
- alert: HighMemoryUsage
expr: ((node_memory_MemTotal_bytes - node_memory_MemFree_bytes - node_memory_Buffers_bytes - node_memory_Cached_bytes) / node_memory_MemTotal_bytes) * 100 > 85
for: 5m
labels:
severity: warning
annotations:
summary: "High memory usage on {{ $labels.instance }}"
description: "Memory usage is above 85% (current value: {{ $value }}%)"
- name: disk-usage
rules:
- alert: DiskWillFillIn4Hours
expr: predict_linear(node_disk_io_time_seconds_total[1h], 4*3600) > 100
for: 30m
labels:
severity: warning
annotations:
summary: "Disk on {{ $labels.device }} will fill in less than 4 hours"
description: "Disk partition {{ $labels.device }} is expected to fill up within 4 hours."
- name: filesystem-usage
rules:
- alert: HighFsUsage
expr: (node_filesystem_size_bytes{mountpoint!~"/boot|/var/log"} - node_filesystem_free_bytes) / node_filesystem_size_bytes * 100 > 90
for: 5m
labels:
severity: warning
annotations:
summary: "High filesystem usage on {{ $labels.mountpoint }}"
description: "Filesystem on {{ $labels.mountpoint }} is above 90% (current value: {{ $value }}%)"
② mysqld_exporter告警规则
cd /usr/local/share/alertmanager-0.28.1.linux-amd64/AlterRules/
vim mysqld_rules.yml
groups:
- name: mysql.rules
rules:
- alert: MySQLDown
expr: mysql_up == 0
for: 1m
labels:
severity: critical
annotations:
summary: "MySQL instance is down"
description: "MySQL instance {{ $labels.instance }} has been down for more than 1 minute."
- alert: HighConnectionUsage
expr: rate(mysql_global_status_threads_connected[5m]) > 800 # 根据实际情况调整阈值
for: 5m
labels:
severity: warning
annotations:
summary: "High number of connections on MySQL instance"
description: "The number of connections to MySQL instance {{ $labels.instance }} has exceeded the threshold of 800 connections."
- alert: SlowQueriesDetected
expr: increase(mysql_global_status_slow_queries[5m]) > 10 # 根据实际情况调整阈值
for: 5m
labels:
severity: warning
annotations:
summary: "Slow queries detected on MySQL instance"
description: "More than 10 slow queries have been detected on MySQL instance {{ $labels.instance }} in the last 5 minutes."
- alert: InnoDBBufferPoolHitRateLow
expr: (1 - (rate(mysql_global_status_innodb_buffer_pool_reads[5m]) / rate(mysql_global_status_innodb_buffer_pool_read_requests[5m]))) * 100 < 90
for: 5m
labels:
severity: warning
annotations:
summary: "InnoDB buffer pool hit rate low"
description: "The InnoDB buffer pool hit rate on MySQL instance {{ $labels.instance }} is below 90%."
③ nginx-prometheus-exporter告警规则
cd /usr/local/share/alertmanager-0.28.1.linux-amd64/AlterRules/
vim nginx_rules.yml
groups:
- name: nginx.rules
rules:
# Nginx Down
- alert: NginxDown
expr: nginx_up == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Nginx instance {{ $labels.instance }} is down"
description: "Nginx instance {{ $labels.instance }} has been down for more than 1 minute."
# High Request Rate
- alert: HighRequestRate
expr: rate(nginx_http_requests_total[5m]) > 10000 # 根据实际情况调整阈值
for: 5m
labels:
severity: warning
annotations:
summary: "High request rate on Nginx instance {{ $labels.instance }}"
description: "The request rate on Nginx instance {{ $labels.instance }} has exceeded the threshold of 10,000 requests per 5 minutes."
# High Error Rate
- alert: HighErrorRate
expr: sum(rate(nginx_http_requests_total{status=~"5.."}[5m])) / sum(rate(nginx_http_requests_total[5m])) * 100 > 5 # 5% error rate
for: 5m
labels:
severity: warning
annotations:
summary: "High error rate on Nginx instance {{ $labels.instance }}"
description: "The error rate on Nginx instance {{ $labels.instance }} has exceeded 5%."
# High Response Time
- alert: HighResponseTime
expr: histogram_quantile(0.99, rate(nginx_http_request_duration_seconds_bucket[5m])) > 1 # 99th percentile response time over 1 second
for: 5m
labels:
severity: warning
annotations:
summary: "High response time on Nginx instance {{ $labels.instance }}"
description: "The 99th percentile response time on Nginx instance {{ $labels.instance }} has exceeded 1 second."
④ process-exporter告警规则
cd /usr/local/share/alertmanager-0.28.1.linux-amd64/AlterRules/
vim process.yml
groups:
- name: process.rules
rules:
# Process Down
- alert: ProcessDown
expr: sum by (process_name) (process_up{process_name=~".+"}) == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Process {{ $labels.process_name }} is down"
description: "The process named '{{ $labels.process_name }}' has been down for more than 1 minute."
# High CPU Usage
- alert: HighCPUUsage
expr: sum(rate(process_cpu_seconds_total{mode!="idle"}[5m])) by (process_name) > 1 # 大于1个CPU核心的使用量
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU usage by process {{ $labels.process_name }}"
description: "The process '{{ $labels.process_name }}' is using more than 1 CPU core."
# High Memory Usage
- alert: HighMemoryUsage
expr: sum(process_resident_memory_bytes) by (process_name) / 1024 / 1024 > 500 # 内存使用量大于500MB
for: 5m
labels:
severity: warning
annotations:
summary: "High memory usage by process {{ $labels.process_name }}"
description: "The process '{{ $labels.process_name }}' is using more than 500MB of memory."
# Too Many Processes
- alert: TooManyProcesses
expr: count(processes{status="running"}) by (job) > 100 # 根据实际情况调整阈值
for: 5m
labels:
severity: warning
annotations:
summary: "Too many processes running on {{ $labels.job }}"
description: "There are more than 100 running processes on {{ $labels.job }}."
验证
模拟nginx down
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。