Prometheus+Grafana+Alertmanager监控

1. 简介

通过Prometheus+Grafana+Alertmanager实现对常用组件的监控，并进行告警。

2. 系统环境

① Prometheus：                  192.168.83.137 39090
② Grafana：                     192.168.83.137 33000
③ Alertmanager：                192.168.83.137 39093
④ node_exporter：               192.168.83.137 39100、192.168.83.138 39100
⑤ mysqld_exporter：             192.168.83.137 39104
⑥ nginx-prometheus-exporter:    192.168.83.137 39113、192.168.83.138 39113
⑦ process-exporter：            192.168.83.137 39256
⑧ prometheus-webhook-dingtalk： 192.168.83.138 38086

3. 实现要求

① 通过node_exporter实现对linux系统的监控，并通过Grafana进行图形化展示
② 通过mysqld_exporter实现对mysql数据库的监控，并通过Grafana进行图形化展示
③ 通过nginx-prometheus-exporter实现对nginx中间件的监控，并通过Grafana进行图形化展示
④ 通过process-exporter实现对系统进程的监控，并通过Grafana进行图形化展示
⑤ 对监控数据配置告警。

4. 下载：

① Prometheus：
    https://github.com/prometheus/prometheus/releases/download/v3.4.0/prometheus-3.4.0.linux-amd64.tar.gz
② Grafana：
    https://dl.grafana.com/enterprise/release/grafana-enterprise-12.0.0.linux-amd64.tar.gz
③ Alertmanager：
    https://github.com/prometheus/alertmanager/releases/download/v0.28.1/alertmanager-0.28.1.linux-amd64.tar.gz
④ node_exporter：
    https://github.com/prometheus/node_exporter/releases/download/v1.9.1/node_exporter-1.9.1.linux-amd64.tar.gz
⑤ mysqld_exporter：
    https://github.com/prometheus/mysqld_exporter/releases/download/v0.17.2/mysqld_exporter-0.17.2.linux-amd64.tar.gz
⑥ process-exporter：
    https://github.com/ncabatoff/process-exporter/releases/download/v0.8.7/process-exporter-0.8.7.linux-amd64.tar.gz
⑦ prometheus-webhook-dingtalk：
    https://github.com/timonwong/prometheus-webhook-dingtalk/releases/download/v2.1.0/prometheus-webhook-dingtalk-2.1.0.linux-amd64.tar.gz
⑧ nginx-prometheus-exporter：
    https://github.com/nginx/nginx-prometheus-exporter/releases/download/v1.4.2/nginx-prometheus-exporter_1.4.2_linux_amd64.tar.gz

5. 安装配置：

① node_exporter：

安装：

tar zxvf node_exporter-1.9.0.linux-amd64.tar.gz -C /usr/local/share/applications/
cd /usr/local/share/applications/node_exporter-1.9.0.linux-amd64/

启动：

./node_exporter --web.listen-address=:39100

验证：

http://192.168.83.137:39100/metrics   //页面能访问，且可以看到对应的指标

创建服务：

vim /usr/lib/systemd/system/node_exporter.service

[Unit]
Description=Prometheus exporter for node metrics, written in Go with pluggable metric collectors.
Documentation=https://github.com/prometheus/node_exporter/
After=network.target
 
[Service]
Type=simple
WorkingDirectory=/usr/local/share/applications/node_exporter-1.9.0.linux-amd64/
ExecStart=/usr/local/share/applications/node_exporter-1.9.0.linux-amd64/node_exporter --web.listen-address=:39100
Restart=on-failure
 
[Install]
WantedBy=multi-user.target

验证服务：

systemctl daemon-reload
systemctl start node_exporter
systemctl enable node_exporter
systemctl status node_exporter
netstat -lantup | grep 39100

② mysqld_exporter：

准备：

安装mysql，此处不赘述
mysql授权：
    mysql> CREATE USER 'exporter'@'localhost' IDENTIFIED BY 'Prometheus';
    mysql> GRANT PROCESS, REPLICATION CLIENT, SELECT ON *.* TO 'exporter'@'localhost';
    mysql> flush privileges;

安装：

tar zxvf mysqld_exporter-0.17.2.linux-amd64.tar.gz -C /usr/local/share/applications/
cd /usr/local/share/applications/mysqld_exporter-0.17.2.linux-amd64
vim mysqld_exporter.cnf

[client]
user=exporter
password=Prometheus

启动：

./mysqld_exporter --config.my-cnf=mysqld_exporter.cnf --web.listen-address=:39104 --mysqld.address="localhost:13306"

验证：

http://192.168.83.137:39104/metrics         //页面能访问，且可以看到对应的指标

创建服务：

vim /usr/lib/systemd/system/mysqld_exporter.service

[Unit]
Description=Prometheus exporter for mysql metrics, written in Go with pluggable metric collectors.
Documentation=https://github.com/prometheus/mysqld_exporter/
After=network.target
 
[Service]
Type=simple
WorkingDirectory=/usr/local/share/applications/mysqld_exporter-0.17.2.linux-amd64
ExecStart=/usr/local/share/applications/mysqld_exporter-0.17.2.linux-amd64/mysqld_exporter --config.my-cnf=/usr/local/share/applications/mysqld_exporter-0.17.2.linux-amd64/mysqld_exporter.cnf --web.listen-address=:39104 --mysqld.address="localhost:13306"
Restart=on-failure
 
[Install]
WantedBy=multi-user.target

验证服务：

systemctl daemon-reload
systemctl start mysqld_exporter
systemctl enable mysqld_exporter
systemctl status mysqld_exporter
netstat -lantup | grep 39104

③ nginx-prometheus-exporter：

准备：

安装nginx，此处不再赘述
配置stub_status模块

    location /stub_status {
        stub_status;
    }

安装：

mkdir /usr/local/share/applications/nginx-prometheus-exporter_1.4.2
tar zxvf nginx-prometheus-exporter_1.4.2_linux_amd64.tar.gz -C /usr/local/share/applications/nginx-prometheus-exporter_1.4.2
cd /usr/local/share/applications/nginx-prometheus-exporter_1.4.2/

启动：

./nginx-prometheus-exporter --web.listen-address=:39113 --nginx.scrape-uri=http://127.0.0.1:8080/stub_status

验证：

http://192.168.83.137:39113/metrics         //页面能访问，且可以看到对应的指标

创建服务：

vim /usr/lib/systemd/system/nginx-prometheus-exporter.service

[Unit]
Description=Prometheus exporter for nginx metrics, written in Go with pluggable metric collectors.
Documentation=https://github.com/nginx/nginx-prometheus-exporter/
After=network.target
 
[Service]
Type=simple
WorkingDirectory=/usr/local/share/applications/nginx-prometheus-exporter_1.4.2/
ExecStart=/usr/local/share/applications/nginx-prometheus-exporter_1.4.2/nginx-prometheus-exporter --web.listen-address=:39113 --nginx.scrape-uri=http://127.0.0.1:8080/stub_status
Restart=on-failure
 
[Install]
WantedBy=multi-user.target

验证服务：

systemctl daemon-reload
systemctl start nginx-prometheus-exporter
systemctl enable nginx-prometheus-exporter
systemctl status nginx-prometheus-exporter
netstat -lantup | grep 39104

④ process-exporter：

安装：

tar zxvf process-exporter-0.8.7.linux-amd64.tar.gz -C /usr/local/share/applications/
cd /usr/local/share/applications/process-exporter-0.8.7.linux-amd64/

创建配置文件：

vim process-exporter.yml

process_names:
  - name: "{{.Matches}}"  #mysql
    cmdline:
    - 'mysql'

  - name: "{{.Matches}}"  #nginx
    cmdline:
    - 'nginx'

  - name: "{{.Comm}}"     #other
    cmdline:
    - '.+'

启动：

./process-exporter -config.path=process-exporter.yaml -web.listen-address=":39256"

验证：

http://192.168.83.137:39256/metrics         //页面能访问，且可以看到对应的指标

创建服务：

vim /usr/lib/systemd/system/process-exporter.service

[Unit]
Description=Prometheus exporter for nginx metrics, written in Go with pluggable metric collectors.
Documentation=https://github.com/nginx/nginx-prometheus-exporter/
After=network.target
 
[Service]
Type=simple
WorkingDirectory=/usr/local/share/applications/process-exporter-0.8.7.linux-amd64/
ExecStart=/usr/local/share/applications/process-exporter-0.8.7.linux-amd64/process-exporter -config.path=/usr/local/share/applications/process-exporter-0.8.7.linux-amd64/process-exporter.yaml -web.listen-address=":39256"
Restart=on-failure
 
[Install]
WantedBy=multi-user.target

验证服务：

systemctl daemon-reload
systemctl start process-exporter
systemctl enable process-exporter
systemctl status process-exporter
netstat -lantup | grep 39256

⑤ prometheus-webhook-dingtalk：

安装：

tar zxvf prometheus-webhook-dingtalk-2.1.0.linux-amd64.tar.gz -C /usr/local/share/applications/
cd /usr/local/share/applications/prometheus-webhook-dingtalk-2.1.0.linux-amd64/

修改告警模板配置文件：

cp contrib/templates/legacy/template.tmpl contrib/templates/legacy/template.tmpl_default
vim contrib/templates/legacy/template.tmpl

{{ define "dingtalk.default.message" }}
{{- if gt (len .Alerts.Firing) 0 -}}
{
  "msgtype": "text",
  "text": {
    "content": "告警通知: FIRING\n\n告警状态: FIRING\n告警名称: {{ .CommonLabels.alertname }}\n实例: {{ .CommonLabels.instance }}\n开始时间: {{ (index .Alerts.Firing 0).StartsAt.Format `2006-01-02 15:04:05` }}\n描述: {{ .CommonAnnotations.summary }}\n详情: {{ .CommonAnnotations.description }}",
    "at": {
      "isAtAll": false
    }
  }
}
{{- end }}

{{- if gt (len .Alerts.Resolved) 0 -}}
{
  "msgtype": "text",
  "text": {
    "content": "告警恢复通知: RESOLVED\n\n告警状态: RESOLVED\n告警名称: {{ .CommonLabels.alertname }}\n实例: {{ .CommonLabels.instance }}\n恢复时间: {{ (index .Alerts.Resolved 0).EndsAt.Format `2006-01-02 15:04:05` }}\n描述: {{ .CommonAnnotations.summary }}\n详情: {{ .CommonAnnotations.description }}",
    "at": {
      "isAtAll": false
    }
  }
}
{{- end }}
{{ end }}

修改配置文件：

cp config.example.yml config.yml
vim config.yml

## Request timeout
# timeout: 5s

## Uncomment following line in order to write template from scratch (be careful!)
#no_builtin_template: true

## Customizable templates path
templates:
  - contrib/templates/legacy/template.tmpl

## You can also override default template using `default_message`
## The following example to use the 'legacy' template from v0.3.0
#default_message:
#  title: '{{ template "legacy.title" . }}'
#  text: '{{ template "legacy.content" . }}'

## Targets, previously was known as "profiles"
targets:
  webhook1:
    url: https://oapi.dingtalk.com/robot/send?access_token=xxx      #填写实际access_token
    # secret for signature
    secret: xxx    #填写实际secret
  #webhook2:
  #  url: https://oapi.dingtalk.com/robot/send?access_token=xxxxxxxxxxxx
  #webhook_legacy:
  #  url: https://oapi.dingtalk.com/robot/send?access_token=xxxxxxxxxxxx
  #  # Customize template content
    message:
      # Use legacy template
      title: '{{ template "dingtalk.default.message". }}'
      text: '{{ template "dingtalk.default.message". }}'
  #webhook_mention_all:
  #  url: https://oapi.dingtalk.com/robot/send?access_token=xxxxxxxxxxxx
  #  mention:
  #    all: true
  #webhook_mention_users:
  #  url: https://oapi.dingtalk.com/robot/send?access_token=xxxxxxxxxxxx
  #  mention:
  #    mobiles: ['156xxxx8827', '189xxxx8325']

启动

./prometheus-webhook-dingtalk --web.listen-address=":38060" --config.file=config.yml

创建服务：

vim /usr/lib/systemd/system/prometheus-webhook-dingtalk.service

[Unit]
Description=Generating DingTalk notification from Prometheus AlertManager WebHooks.
Documentation=https://github.com/timonwong/prometheus-webhook-dingtalk
After=network.target
 
[Service]
Type=simple
WorkingDirectory=/usr/local/share/applications/prometheus-webhook-dingtalk-2.1.0.linux-amd64/
ExecStart=/usr/local/share/applications/prometheus-webhook-dingtalk-2.1.0.linux-amd64/prometheus-webhook-dingtalk --web.listen-address=":38060" --config.file=/usr/local/share/applications/prometheus-webhook-dingtalk-2.1.0.linux-amd64/config.yml
Restart=on-failure
 
[Install]
WantedBy=multi-user.target

验证服务：

systemctl daemon-reload
systemctl start prometheus-webhook-dingtalk
systemctl enable prometheus-webhook-dingtalk
systemctl status prometheus-webhook-dingtalk
netstat -lantup | grep 38060

⑥ Alertmanager：

安装：

tar zxvf alertmanager-0.28.1.linux-amd64.tar.gz -C /usr/local/share/applications/
cd /usr/local/share/applications/alertmanager-0.28.1.linux-amd64/

修改配置文件：

cp alertmanager.yml alertmanager.yml_default
vim alertmanager.yml

global:
  resolve_timeout: 5m

route:
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 1h
  receiver: 'team-B-receiver'
  routes:
    - match_re:
        alertname: "(InstanceDown|MySQLDown|NginxDown|ProcessDown)"
      receiver: 'team-A-receiver'
    - receiver: 'team-B-receiver'

receivers:
- name: 'team-A-receiver'
  webhook_configs:
    - url: 'http://192.168.83.137:38060/dingtalk/webhook1/send' # 替换为你的钉钉机器人Webhook URL
      send_resolved: true
  email_configs:
    - to: 'xxx@163.com'      # 替换为实际的邮箱地址
      from: 'xxx@163.com'     # 替换为实际的邮箱地址
      smarthost: 'smtp.163.com:465' # 如果使用SSL
      # smarthost: 'smtp.163.com:587' # 如果使用STARTTLS
      auth_username: 'xxx@163.com'
      auth_password: 'xxxxx' # 替换为实际的邮箱密码
      send_resolved: true

- name: 'team-B-receiver'
  #webhook_configs:
  #  - url: 'https://oapi.dingtalk.com/robot/send?access_token=your-dingtalk-token-for-team-B' # 替换为你的钉钉机器人Webhook URL
  #    send_resolved: true
  email_configs:
    - to: 'xxx@163.com'      # 替换为实际的邮箱地址
      from: 'xxx@163.com'     # 替换为实际的邮箱地址
      smarthost: 'smtp.163.com:465' # 如果使用SSL
      # smarthost: 'smtp.163.com:587' # 如果使用STARTTLS
      auth_username: 'xxx@163.com'
      auth_password: 'xxxxx' # 替换为实际的邮箱密码
      send_resolved: true

启动

./alertmanager --config.file="alertmanager.yml" --web.listen-address=:39093

创建服务：

vim /usr/lib/systemd/system/alertmanager.service

[Unit]
Description=The Alertmanager handles alerts sent by client applications such as the Prometheus server. 
Documentation=https://github.com/prometheus/alertmanager
After=network.target
 
[Service]
Type=simple
WorkingDirectory=/usr/local/share/applications/alertmanager-0.28.1.linux-amd64/
ExecStart=/usr/local/share/applications/alertmanager-0.28.1.linux-amd64/alertmanager --config.file="/usr/local/share/applications/alertmanager-0.28.1.linux-amd64/alertmanager.yml" --web.listen-address=:39093
Restart=on-failure
 
[Install]
WantedBy=multi-user.target

验证服务：

systemctl daemon-reload
systemctl start alertmanager
systemctl enable alertmanager
systemctl status alertmanager
netstat -lantup | grep 39093

⑦ Prometheus：

安装：

tar zxvf prometheus-3.2.1.linux-amd64.tar.gz -C /usr/local/share/applications/
cd /usr/local/share/applications/prometheus-3.2.1.linux-amd64/

修改配置文件：

cp prometheus.yml prometheus.yml_default
vim prometheus.yml

# my global config
global:
  scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
  alertmanagers:
    - static_configs:
        - targets:
           - 192.168.83.137:39093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  - "/usr/local/share/alertmanager-0.28.1.linux-amd64/AlterRules/*.yml"
  # - "first_rules.yml"
  # - "second_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: "prometheus"

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.
    static_configs:
      - targets: ["192.168.83.137:39090"]

  - job_name:       'node'
    # Override the global default and scrape targets from this job every 5 seconds.
    scrape_interval: 5s
    static_configs:
      - targets: ['192.168.83.137:39100','192.168.83.138:39100']
        labels:
          group: 'node'

  - job_name:       'mysql'
    # Override the global default and scrape targets from this job every 5 seconds.
    scrape_interval: 5s
    static_configs:
      - targets: ['192.168.83.137:39104']
        labels:
          group: 'mysql'

  - job_name:       'nginx'
    # Override the global default and scrape targets from this job every 5 seconds.
    scrape_interval: 5s
    static_configs:
      - targets: ['192.168.83.137:39113']
        labels:
          group: 'nginx'

  - job_name:       'process'
    # Override the global default and scrape targets from this job every 5 seconds.
    scrape_interval: 5s
    static_configs:
      - targets: ['192.168.83.137:39256']
        labels:
          group: 'process'

启动

./prometheus --config.file="prometheus.yml" --web.listen-address=0.0.0.0:39090 --storage.tsdb.path="/data/prometheus"

验证：

http://192.168.83.137:39090/targets        //页面能访问，且可以看到对应的指标

创建服务：

vim /usr/lib/systemd/system/prometheus.service

[Unit]
Description=Prometheus, a Cloud Native Computing Foundation project, is a systems and service monitoring system.  
Documentation=https://github.com/prometheus/prometheus/
After=network.target
 
[Service]
Type=simple
WorkingDirectory=/usr/local/share/applications/prometheus-3.2.1.linux-amd64/
ExecStart=/usr/local/share/applications/prometheus-3.2.1.linux-amd64/prometheus --config.file="/usr/local/share/applications/prometheus-3.2.1.linux-amd64/prometheus.yml" --web.listen-address=0.0.0.0:39090 --storage.tsdb.path="/data/prometheus"
Restart=on-failure
 
[Install]
WantedBy=multi-user.target

验证服务：

systemctl daemon-reload
systemctl start prometheus
systemctl enable prometheus
systemctl status prometheus
netstat -lantup | grep 39090

⑧ Grafana：

安装：

tar zxvf grafana-enterprise-11.5.2.linux-amd64.tar.gz -C /usr/local/share/applications/
cd /usr/local/share/applications/grafana-v11.5.2/

修改配置文件：

cp conf/defaults.ini conf/defaults.ini_dafault
vim conf/defaults.ini
    http_port = 3000 更改为http_port = 33000

启动

./bin/grafana-server

验证：

http://192.168.83.137:33000/        //页面能访问，用户名密码admin/admin可以进入

创建服务：

vim /usr/lib/systemd/system/grafana-server.service

[Unit]
Description=Dashboard anything. Observe everything.  
Documentation=https://grafana.com/grafana/download
After=network.target
 
[Service]
Type=simple
WorkingDirectory=/usr/local/share/applications/grafana-v11.5.2/
ExecStart=/usr/local/share/applications/grafana-v11.5.2/bin/grafana-server
Restart=on-failure
 
[Install]
WantedBy=multi-user.target

验证服务：

systemctl daemon-reload
systemctl start grafana-server
systemctl enable grafana-server
systemctl status grafana-server
netstat -lantup | grep 33000

6. 监控

grafana 添加Prometheus源：

Home > Connections > Add new connection,选择Prometheus，点击 Add new data source. 点击填入Prometheus server URL:http://192.168.83.137:39090/，点击 test & save.
添加成功后，在Home > Connections > Data sources，可看到添加的源

① 通过node_exporter实现对linux系统的监控，并通过Grafana进行图形化展示

Home > Dashboards，点击new,选择Import dashboard，输入ID（8919，具体可在https://grafana.com/grafana/dashboards/搜索），选择添加的Prometheus源，点击import，即可跳转到对应的监控页面。
再次进去可在Home > Dashboards，点击Node Exporter Dashboard 20240520 TenSunS自动同步版进行查看。

② 通过mysqld_exporter实现对mysql数据库的监控，并通过Grafana进行图形化展示

Home > Dashboards，点击new,选择Import dashboard，输入ID（7362，具体可在https://grafana.com/grafana/dashboards/搜索），选择添加的Prometheus源，点击import，即可跳转到对应的监控页面。
再次进去可在Home > Dashboards，点击MySQL Overview 进行查看。

③ 通过nginx-prometheus-exporter实现对nginx中间件的监控，并通过Grafana进行图形化展示

Home > Dashboards，点击new,选择Import dashboard，输入ID（10393，具体可在https://grafana.com/grafana/dashboards/搜索），选择添加的Prometheus源，点击import，即可跳转到对应的监控页面。
再次进去可在Home > Dashboards，点击Nginx进行查看。

④ 通过process-exporter实现对系统进程的监控，并通过Grafana进行图形化展示

Home > Dashboards，点击new,选择Import dashboard，输入ID（13882，具体可在https://grafana.com/grafana/dashboards/搜索），选择添加的Prometheus源，点击import，即可跳转到对应的监控页面。
再次进去可在Home > Dashboards，点击process exporter dashboard with treemap进行查看。

7. 告警

① node_exporter告警规则

    cd /usr/local/share/alertmanager-0.28.1.linux-amd64/AlterRules/
    vim node_rules.yml

groups:
  - name: instance-health
    rules:
      - alert: InstanceDown
        expr: up{job="node"} == 0
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "Instance {{ $labels.instance }} is down"
          description: "{{ $labels.instance }} has been unreachable for more than 2 minutes."

  - name: cpu-usage
    rules:
      - alert: HighCpuUsage
        expr: (rate(node_cpu_seconds_total{mode!="idle"}[5m]) * 100) > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage on {{ $labels.instance }}"
          description: "CPU usage is above 80% (current value: {{ $value }}%)"

  - name: memory-usage
    rules:
      - alert: HighMemoryUsage
        expr: ((node_memory_MemTotal_bytes - node_memory_MemFree_bytes - node_memory_Buffers_bytes - node_memory_Cached_bytes) / node_memory_MemTotal_bytes) * 100 > 85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High memory usage on {{ $labels.instance }}"
          description: "Memory usage is above 85% (current value: {{ $value }}%)"

  - name: disk-usage
    rules:
      - alert: DiskWillFillIn4Hours
        expr: predict_linear(node_disk_io_time_seconds_total[1h], 4*3600) > 100
        for: 30m
        labels:
          severity: warning
        annotations:
          summary: "Disk on {{ $labels.device }} will fill in less than 4 hours"
          description: "Disk partition {{ $labels.device }} is expected to fill up within 4 hours."

  - name: filesystem-usage
    rules:
      - alert: HighFsUsage
        expr: (node_filesystem_size_bytes{mountpoint!~"/boot|/var/log"} - node_filesystem_free_bytes) / node_filesystem_size_bytes * 100 > 90
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High filesystem usage on {{ $labels.mountpoint }}"
          description: "Filesystem on {{ $labels.mountpoint }} is above 90% (current value: {{ $value }}%)"

② mysqld_exporter告警规则

    cd /usr/local/share/alertmanager-0.28.1.linux-amd64/AlterRules/
    vim mysqld_rules.yml

groups:
- name: mysql.rules
  rules:
  - alert: MySQLDown
    expr: mysql_up == 0
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "MySQL instance is down"
      description: "MySQL instance {{ $labels.instance }} has been down for more than 1 minute."

  - alert: HighConnectionUsage
    expr: rate(mysql_global_status_threads_connected[5m]) > 800 # 根据实际情况调整阈值
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High number of connections on MySQL instance"
      description: "The number of connections to MySQL instance {{ $labels.instance }} has exceeded the threshold of 800 connections."

  - alert: SlowQueriesDetected
    expr: increase(mysql_global_status_slow_queries[5m]) > 10 # 根据实际情况调整阈值
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Slow queries detected on MySQL instance"
      description: "More than 10 slow queries have been detected on MySQL instance {{ $labels.instance }} in the last 5 minutes."

  - alert: InnoDBBufferPoolHitRateLow
    expr: (1 - (rate(mysql_global_status_innodb_buffer_pool_reads[5m]) / rate(mysql_global_status_innodb_buffer_pool_read_requests[5m]))) * 100 < 90
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "InnoDB buffer pool hit rate low"
      description: "The InnoDB buffer pool hit rate on MySQL instance {{ $labels.instance }} is below 90%."

③ nginx-prometheus-exporter告警规则

    cd /usr/local/share/alertmanager-0.28.1.linux-amd64/AlterRules/
    vim nginx_rules.yml

groups:
- name: nginx.rules
  rules:

  # Nginx Down
  - alert: NginxDown
    expr: nginx_up == 0
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "Nginx instance {{ $labels.instance }} is down"
      description: "Nginx instance {{ $labels.instance }} has been down for more than 1 minute."

  # High Request Rate
  - alert: HighRequestRate
    expr: rate(nginx_http_requests_total[5m]) > 10000 # 根据实际情况调整阈值
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High request rate on Nginx instance {{ $labels.instance }}"
      description: "The request rate on Nginx instance {{ $labels.instance }} has exceeded the threshold of 10,000 requests per 5 minutes."

  # High Error Rate
  - alert: HighErrorRate
    expr: sum(rate(nginx_http_requests_total{status=~"5.."}[5m])) / sum(rate(nginx_http_requests_total[5m])) * 100 > 5 # 5% error rate
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High error rate on Nginx instance {{ $labels.instance }}"
      description: "The error rate on Nginx instance {{ $labels.instance }} has exceeded 5%."

  # High Response Time
  - alert: HighResponseTime
    expr: histogram_quantile(0.99, rate(nginx_http_request_duration_seconds_bucket[5m])) > 1 # 99th percentile response time over 1 second
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High response time on Nginx instance {{ $labels.instance }}"
      description: "The 99th percentile response time on Nginx instance {{ $labels.instance }} has exceeded 1 second."

④ process-exporter告警规则

    cd /usr/local/share/alertmanager-0.28.1.linux-amd64/AlterRules/
    vim process.yml

groups:
- name: process.rules
  rules:

  # Process Down
  - alert: ProcessDown
    expr: sum by (process_name) (process_up{process_name=~".+"}) == 0
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "Process {{ $labels.process_name }} is down"
      description: "The process named '{{ $labels.process_name }}' has been down for more than 1 minute."

  # High CPU Usage
  - alert: HighCPUUsage
    expr: sum(rate(process_cpu_seconds_total{mode!="idle"}[5m])) by (process_name) > 1 # 大于1个CPU核心的使用量
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High CPU usage by process {{ $labels.process_name }}"
      description: "The process '{{ $labels.process_name }}' is using more than 1 CPU core."

  # High Memory Usage
  - alert: HighMemoryUsage
    expr: sum(process_resident_memory_bytes) by (process_name) / 1024 / 1024 > 500 # 内存使用量大于500MB
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High memory usage by process {{ $labels.process_name }}"
      description: "The process '{{ $labels.process_name }}' is using more than 500MB of memory."

  # Too Many Processes
  - alert: TooManyProcesses
    expr: count(processes{status="running"}) by (job) > 100 # 根据实际情况调整阈值
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Too many processes running on {{ $labels.job }}"
      description: "There are more than 100 running processes on {{ $labels.job }}."

验证

模拟nginx down

Prometheus+Grafana+Alertmanager监控

1. 简介

2. 系统环境

3. 实现要求

4. 下载：

5. 安装配置：

① node_exporter：

② mysqld_exporter：

③ nginx-prometheus-exporter：

④ process-exporter：

⑤ prometheus-webhook-dingtalk：

⑥ Alertmanager：

⑦ Prometheus：

⑧ Grafana：

6. 监控

7. 告警

① node_exporter告警规则

② mysqld_exporter告警规则

③ nginx-prometheus-exporter告警规则

④ process-exporter告警规则

验证

会当凌绝顶

引用和评论

轻量级日志系统Loki监控日志并告警

Prometheus中系统CPU使用率如何计算？

构建混合技术栈的统一监控与日志平台

剑指大规模 AI 可观测，阿里云 Prometheus 2.0 应运而生

Spring Boot 监控缺失 JVM 指标的根源解析与终极解决方案