3

我们的监控系统采用的是 collectd收集,graphite存储,grafana展示的架构,好处是新添加的服务可以自动接入,并且图形化的展示页比较直观,但是不足之处是整个体系中没有能够在指标出现问题的时候进行报警通知的功能。经过尝试了多种报警系统,我们最终选定了seyren作为报警组件,本文会介绍尝试过的其他组件以及优缺点。

对报警系统的要求

对报警系统的要求总结起来有如下几点:

  1. 能够从graphite中读取数据:既然架构上使用了graphite作为监控指标的存储,我们不希望再引入另外一个存储组件

  2. 集中配置:所有的监控指标阀值需要再一个地方进行统一的配置和管理,方便进行调整,因此在每台机器上进行检查报警的相关组件不作为候选方案

  3. 自动化配置:新的服务进行接入或者已有的服务进行扩容的时候能够自动化进行接入,这里就要求报警系统能够进行通用化的配置或者提供接口进行报警规则的添加

  4. 报警通道支持扩展:需要接入工资提供的短信报警平台,必须能够进行定制

选型

Grafana报警功能

我们首选的是Grafana的报警功能,因为已经使用Grafana进行绘图和dashboard展示了。Grafana从4.X开始添加了报警功能,可以对一个查询配置报警条件并选择一个报警通道进行报警,配置界面如下图:

图片描述

报警通道的选择也比较多,包括Slack,Mail 以及 WebHook 等。其中WebHook可以作为扩展报警通道的方法,当触发一个报警的时候会以POST方法访问WebHook,把报警的具体信息上传,我们可以自己实现一个HTTP接口处理请求,以便实现和不同报警系统的对接。POST消息体如下(摘自Grafana文档):

{
  "title": "My alert",
  "ruleId": 1,
  "ruleName": "Load peaking!",
  "ruleUrl": "http://url.to.grafana/db/dashboard/my_dashboard?panelId=2",
  "state": "alerting",
  "imageUrl": "http://s3.image.url",
  "message": "Load is peaking. Make sure the traffic is real and spin up more webfronts",
  "evalMatches": [
    {
      "metric": "requests",
      "tags": {},
      "value": 122
    }
  ]
}

触发报警以后在dashboard上会以不同的颜色展现:

图片描述

存在的问题

第一个问题是报警的查询不能够支持Grafana模版。Grafana的模版功能很好的解决了新项目接入时候复杂的操作,只要按照预设的规则进行上报,新接入项目的时候完全不用创建新的Dashboard。由于报警模块缺少对模版的支持,使用上就需要每一个服务器的报警查询都必须明确定义,不能包含模版变量,这样就导致接入一个新的项目的时候需要大量的手工/半自动化操作才能够完成报警的配置。

第二个问题是查询表达式不能够对一个单独的Meter单独维护报警状态。例如定一个报警查询 collectd.*.cpu.percent-idle,如果我们有2台服务器,这个查询就对应了2个meter:collectd.host2.cpu.percent-idle 和 collectd.host1.cpu.percent-idle,当host1的cpu idle 达到报警阀值的时候这个check的状态会被改为ALERTING并触发发送报警信息,但是当host2触发到报警阀值的时候就不会发送报警了。Grafana的文档中提到这个功能后面会有支持的计划,但是暂时还无法使用。

cabot

cabot 是一个主要为Graphite数据源设计的报警系统,和Grafana类似,可以通过定义一个grafana的metric查询以及阀值进行报警,可以通过自己实现插件进行报警的发送。与Grafana的报警组件类似,对于一个查询包含了多个metric的情况无法单独对每个Metric进行报警状态的追踪。

seyren

seyren 也是为Graphite数据源设置的报警系统,优点是在metric查询中包含多个metric的情况下能够单独为每个metric追踪报警状态。

首先我们定义一个check,metric查询是 collectd.base.control.jy.*.cpu.percent-idle, 中间*匹配的是所有服务器的IP地址。下图是check的配置界面,定义查询以后需要定义warn的阀值和error的阀值,定义以后会展示出最近一段时间的监控图。

图片描述

保存了Check以后就能够从dashboard中查看到报警的情况,可以看到所有匹配的metric都有一个独立的状态进行追踪。这个特性使得自动化添加服务器和服务成为可能,新扩容的机器只要按照约定进行监控数据的上报就能够被上述check涵盖。

图片描述

警报的发送方面,seyren支持的报警通道也比较多,例如 Email, Flowdock, HipChat, HTTP, Hubot等,这里我们只关心HTTP。下图是一个HTTP报警通道的设置,只要定义一个URL就好,这个URL要能够接受POST请求,报警的具体信息会用json的方式通过post body上传。

图片描述

报警的POST Body 关键节点摘录如下:

  "alerts": [
    {
      "checkId": "59327b84e4b0a957ebb25f77",
      "targetHash": "\ufffd\u0006LC\ufffd\ufffd\ufffd\ufffd\u0002\u007f\u0002\ufffd\ufffd\ufffdkE",
      "fromType": "OK",
      "toType": "WARN",
      "warn": 57,
      "timestamp": 1496484513227,
      "error": 62,
      "value": 58.1702216645755,
      "id": "59328aa1e4b0a957ebb26201",
      "target": "collectd.base.control.jy.host1.cpu.percent-idle"
    },
    {
      "checkId": "59327b84e4b0a957ebb25f77",
      "targetHash": "\ufffd\ufffd\ufffd\ufffd\ufffd>G\u001c\ufffd\ufffd\ufffd\u001c\ufffd9\ufffd\ufffd",
      "fromType": "WARN",
      "toType": "OK",
      "warn": 57,
      "timestamp": 1496484513227,
      "error": 62,
      "value": 52.6318006613729,
      "id": "59328aa1e4b0a957ebb26209",
      "target": "collectd.base.control.jy.host2.cpu.percent-idle"
    }
  ],

可以看到alert节点里面为每一个host单独维护和上报了检查状态。

下面是POST Body的全部内容:

{
  "preview": "<br /><img src=http://192.168.1.1/render/?target=collectd.base.control.jy.*.cpu.percent-idle&from=10:08_20170603&until=09:08_20170603&target=alias(dashed(color(constantLine(57),%22yellow%22)),%22warn%20level%22)&target=alias(dashed(color(constantLine(62),%22red%22)),%22error%20level%22)&width=500&height=225></img>",
  "subscription": {
    "su": true,
    "mo": true,
    "tu": true,
    "we": true,
    "th": true,
    "fr": true,
    "sa": true,
    "ignoreWarn": false,
    "ignoreError": false,
    "ignoreOk": false,
    "fromTime": {
      "chronology": {
        "zone": {
          "fixed": true,
          "id": "UTC"
        }
      },
      "millisOfSecond": 0,
      "millisOfDay": 0,
      "secondOfMinute": 0,
      "hourOfDay": 0,
      "minuteOfHour": 0,
      "fieldTypes": [
        {
          "durationType": {
            "name": "hours"
          },
          "rangeDurationType": {
            "name": "days"
          },
          "name": "hourOfDay"
        },
        {
          "durationType": {
            "name": "minutes"
          },
          "rangeDurationType": {
            "name": "hours"
          },
          "name": "minuteOfHour"
        },
        {
          "durationType": {
            "name": "seconds"
          },
          "rangeDurationType": {
            "name": "minutes"
          },
          "name": "secondOfMinute"
        },
        {
          "durationType": {
            "name": "millis"
          },
          "rangeDurationType": {
            "name": "seconds"
          },
          "name": "millisOfSecond"
        }
      ],
      "values": [
        0,
        0,
        0,
        0
      ],
      "fields": [
        {
          "range": 24,
          "rangeDurationField": {
            "unitMillis": 86400000,
            "precise": true,
            "name": "days",
            "type": {
              "name": "days"
            },
            "supported": true
          },
          "maximumValue": 23,
          "lenient": false,
          "unitMillis": 3600000,
          "durationField": {
            "unitMillis": 3600000,
            "precise": true,
            "name": "hours",
            "type": {
              "name": "hours"
            },
            "supported": true
          },
          "minimumValue": 0,
          "leapDurationField": null,
          "name": "hourOfDay",
          "type": {
            "durationType": {
              "name": "hours"
            },
            "rangeDurationType": {
              "name": "days"
            },
            "name": "hourOfDay"
          },
          "supported": true
        },
        {
          "range": 60,
          "rangeDurationField": {
            "unitMillis": 3600000,
            "precise": true,
            "name": "hours",
            "type": {
              "name": "hours"
            },
            "supported": true
          },
          "maximumValue": 59,
          "lenient": false,
          "unitMillis": 60000,
          "durationField": {
            "unitMillis": 60000,
            "precise": true,
            "name": "minutes",
            "type": {
              "name": "minutes"
            },
            "supported": true
          },
          "minimumValue": 0,
          "leapDurationField": null,
          "name": "minuteOfHour",
          "type": {
            "durationType": {
              "name": "minutes"
            },
            "rangeDurationType": {
              "name": "hours"
            },
            "name": "minuteOfHour"
          },
          "supported": true
        },
        {
          "range": 60,
          "rangeDurationField": {
            "unitMillis": 60000,
            "precise": true,
            "name": "minutes",
            "type": {
              "name": "minutes"
            },
            "supported": true
          },
          "maximumValue": 59,
          "lenient": false,
          "unitMillis": 1000,
          "durationField": {
            "unitMillis": 1000,
            "precise": true,
            "name": "seconds",
            "type": {
              "name": "seconds"
            },
            "supported": true
          },
          "minimumValue": 0,
          "leapDurationField": null,
          "name": "secondOfMinute",
          "type": {
            "durationType": {
              "name": "seconds"
            },
            "rangeDurationType": {
              "name": "minutes"
            },
            "name": "secondOfMinute"
          },
          "supported": true
        },
        {
          "range": 1000,
          "rangeDurationField": {
            "unitMillis": 1000,
            "precise": true,
            "name": "seconds",
            "type": {
              "name": "seconds"
            },
            "supported": true
          },
          "maximumValue": 999,
          "lenient": false,
          "unitMillis": 1,
          "durationField": {
            "unitMillis": 1,
            "precise": true,
            "name": "millis",
            "type": {
              "name": "millis"
            },
            "supported": true
          },
          "minimumValue": 0,
          "leapDurationField": null,
          "name": "millisOfSecond",
          "type": {
            "durationType": {
              "name": "millis"
            },
            "rangeDurationType": {
              "name": "seconds"
            },
            "name": "millisOfSecond"
          },
          "supported": true
        }
      ]
    },
    "toTime": {
      "chronology": {
        "zone": {
          "fixed": true,
          "id": "UTC"
        }
      },
      "millisOfSecond": 0,
      "millisOfDay": 86340000,
      "secondOfMinute": 0,
      "hourOfDay": 23,
      "minuteOfHour": 59,
      "fieldTypes": [
        {
          "durationType": {
            "name": "hours"
          },
          "rangeDurationType": {
            "name": "days"
          },
          "name": "hourOfDay"
        },
        {
          "durationType": {
            "name": "minutes"
          },
          "rangeDurationType": {
            "name": "hours"
          },
          "name": "minuteOfHour"
        },
        {
          "durationType": {
            "name": "seconds"
          },
          "rangeDurationType": {
            "name": "minutes"
          },
          "name": "secondOfMinute"
        },
        {
          "durationType": {
            "name": "millis"
          },
          "rangeDurationType": {
            "name": "seconds"
          },
          "name": "millisOfSecond"
        }
      ],
      "values": [
        23,
        59,
        0,
        0
      ],
      "fields": [
        {
          "range": 24,
          "rangeDurationField": {
            "unitMillis": 86400000,
            "precise": true,
            "name": "days",
            "type": {
              "name": "days"
            },
            "supported": true
          },
          "maximumValue": 23,
          "lenient": false,
          "unitMillis": 3600000,
          "durationField": {
            "unitMillis": 3600000,
            "precise": true,
            "name": "hours",
            "type": {
              "name": "hours"
            },
            "supported": true
          },
          "minimumValue": 0,
          "leapDurationField": null,
          "name": "hourOfDay",
          "type": {
            "durationType": {
              "name": "hours"
            },
            "rangeDurationType": {
              "name": "days"
            },
            "name": "hourOfDay"
          },
          "supported": true
        },
        {
          "range": 60,
          "rangeDurationField": {
            "unitMillis": 3600000,
            "precise": true,
            "name": "hours",
            "type": {
              "name": "hours"
            },
            "supported": true
          },
          "maximumValue": 59,
          "lenient": false,
          "unitMillis": 60000,
          "durationField": {
            "unitMillis": 60000,
            "precise": true,
            "name": "minutes",
            "type": {
              "name": "minutes"
            },
            "supported": true
          },
          "minimumValue": 0,
          "leapDurationField": null,
          "name": "minuteOfHour",
          "type": {
            "durationType": {
              "name": "minutes"
            },
            "rangeDurationType": {
              "name": "hours"
            },
            "name": "minuteOfHour"
          },
          "supported": true
        },
        {
          "range": 60,
          "rangeDurationField": {
            "unitMillis": 60000,
            "precise": true,
            "name": "minutes",
            "type": {
              "name": "minutes"
            },
            "supported": true
          },
          "maximumValue": 59,
          "lenient": false,
          "unitMillis": 1000,
          "durationField": {
            "unitMillis": 1000,
            "precise": true,
            "name": "seconds",
            "type": {
              "name": "seconds"
            },
            "supported": true
          },
          "minimumValue": 0,
          "leapDurationField": null,
          "name": "secondOfMinute",
          "type": {
            "durationType": {
              "name": "seconds"
            },
            "rangeDurationType": {
              "name": "minutes"
            },
            "name": "secondOfMinute"
          },
          "supported": true
        },
        {
          "range": 1000,
          "rangeDurationField": {
            "unitMillis": 1000,
            "precise": true,
            "name": "seconds",
            "type": {
              "name": "seconds"
            },
            "supported": true
          },
          "maximumValue": 999,
          "lenient": false,
          "unitMillis": 1,
          "durationField": {
            "unitMillis": 1,
            "precise": true,
            "name": "millis",
            "type": {
              "name": "millis"
            },
            "supported": true
          },
          "minimumValue": 0,
          "leapDurationField": null,
          "name": "millisOfSecond",
          "type": {
            "durationType": {
              "name": "millis"
            },
            "rangeDurationType": {
              "name": "seconds"
            },
            "name": "millisOfSecond"
          },
          "supported": true
        }
      ]
    },
    "enabled": true,
    "id": "59328a65e4b0a957ebb26200",
    "type": "HTTP",
    "target": "http://10.153.74.117:8083/sonar/1.0/alarm_str"
  },
  "check": {
    "subscriptions": [
      {
        "su": true,
        "mo": true,
        "tu": true,
        "we": true,
        "th": true,
        "fr": true,
        "sa": true,
        "ignoreWarn": false,
        "ignoreError": false,
        "ignoreOk": false,
        "fromTime": {
          "chronology": {
            "zone": {
              "fixed": true,
              "id": "UTC"
            }
          },
          "millisOfSecond": 0,
          "millisOfDay": 0,
          "secondOfMinute": 0,
          "hourOfDay": 0,
          "minuteOfHour": 0,
          "fieldTypes": [
            {
              "durationType": {
                "name": "hours"
              },
              "rangeDurationType": {
                "name": "days"
              },
              "name": "hourOfDay"
            },
            {
              "durationType": {
                "name": "minutes"
              },
              "rangeDurationType": {
                "name": "hours"
              },
              "name": "minuteOfHour"
            },
            {
              "durationType": {
                "name": "seconds"
              },
              "rangeDurationType": {
                "name": "minutes"
              },
              "name": "secondOfMinute"
            },
            {
              "durationType": {
                "name": "millis"
              },
              "rangeDurationType": {
                "name": "seconds"
              },
              "name": "millisOfSecond"
            }
          ],
          "values": [
            0,
            0,
            0,
            0
          ],
          "fields": [
            {
              "range": 24,
              "rangeDurationField": {
                "unitMillis": 86400000,
                "precise": true,
                "name": "days",
                "type": {
                  "name": "days"
                },
                "supported": true
              },
              "maximumValue": 23,
              "lenient": false,
              "unitMillis": 3600000,
              "durationField": {
                "unitMillis": 3600000,
                "precise": true,
                "name": "hours",
                "type": {
                  "name": "hours"
                },
                "supported": true
              },
              "minimumValue": 0,
              "leapDurationField": null,
              "name": "hourOfDay",
              "type": {
                "durationType": {
                  "name": "hours"
                },
                "rangeDurationType": {
                  "name": "days"
                },
                "name": "hourOfDay"
              },
              "supported": true
            },
            {
              "range": 60,
              "rangeDurationField": {
                "unitMillis": 3600000,
                "precise": true,
                "name": "hours",
                "type": {
                  "name": "hours"
                },
                "supported": true
              },
              "maximumValue": 59,
              "lenient": false,
              "unitMillis": 60000,
              "durationField": {
                "unitMillis": 60000,
                "precise": true,
                "name": "minutes",
                "type": {
                  "name": "minutes"
                },
                "supported": true
              },
              "minimumValue": 0,
              "leapDurationField": null,
              "name": "minuteOfHour",
              "type": {
                "durationType": {
                  "name": "minutes"
                },
                "rangeDurationType": {
                  "name": "hours"
                },
                "name": "minuteOfHour"
              },
              "supported": true
            },
            {
              "range": 60,
              "rangeDurationField": {
                "unitMillis": 60000,
                "precise": true,
                "name": "minutes",
                "type": {
                  "name": "minutes"
                },
                "supported": true
              },
              "maximumValue": 59,
              "lenient": false,
              "unitMillis": 1000,
              "durationField": {
                "unitMillis": 1000,
                "precise": true,
                "name": "seconds",
                "type": {
                  "name": "seconds"
                },
                "supported": true
              },
              "minimumValue": 0,
              "leapDurationField": null,
              "name": "secondOfMinute",
              "type": {
                "durationType": {
                  "name": "seconds"
                },
                "rangeDurationType": {
                  "name": "minutes"
                },
                "name": "secondOfMinute"
              },
              "supported": true
            },
            {
              "range": 1000,
              "rangeDurationField": {
                "unitMillis": 1000,
                "precise": true,
                "name": "seconds",
                "type": {
                  "name": "seconds"
                },
                "supported": true
              },
              "maximumValue": 999,
              "lenient": false,
              "unitMillis": 1,
              "durationField": {
                "unitMillis": 1,
                "precise": true,
                "name": "millis",
                "type": {
                  "name": "millis"
                },
                "supported": true
              },
              "minimumValue": 0,
              "leapDurationField": null,
              "name": "millisOfSecond",
              "type": {
                "durationType": {
                  "name": "millis"
                },
                "rangeDurationType": {
                  "name": "seconds"
                },
                "name": "millisOfSecond"
              },
              "supported": true
            }
          ]
        },
        "toTime": {
          "chronology": {
            "zone": {
              "fixed": true,
              "id": "UTC"
            }
          },
          "millisOfSecond": 0,
          "millisOfDay": 86340000,
          "secondOfMinute": 0,
          "hourOfDay": 23,
          "minuteOfHour": 59,
          "fieldTypes": [
            {
              "durationType": {
                "name": "hours"
              },
              "rangeDurationType": {
                "name": "days"
              },
              "name": "hourOfDay"
            },
            {
              "durationType": {
                "name": "minutes"
              },
              "rangeDurationType": {
                "name": "hours"
              },
              "name": "minuteOfHour"
            },
            {
              "durationType": {
                "name": "seconds"
              },
              "rangeDurationType": {
                "name": "minutes"
              },
              "name": "secondOfMinute"
            },
            {
              "durationType": {
                "name": "millis"
              },
              "rangeDurationType": {
                "name": "seconds"
              },
              "name": "millisOfSecond"
            }
          ],
          "values": [
            23,
            59,
            0,
            0
          ],
          "fields": [
            {
              "range": 24,
              "rangeDurationField": {
                "unitMillis": 86400000,
                "precise": true,
                "name": "days",
                "type": {
                  "name": "days"
                },
                "supported": true
              },
              "maximumValue": 23,
              "lenient": false,
              "unitMillis": 3600000,
              "durationField": {
                "unitMillis": 3600000,
                "precise": true,
                "name": "hours",
                "type": {
                  "name": "hours"
                },
                "supported": true
              },
              "minimumValue": 0,
              "leapDurationField": null,
              "name": "hourOfDay",
              "type": {
                "durationType": {
                  "name": "hours"
                },
                "rangeDurationType": {
                  "name": "days"
                },
                "name": "hourOfDay"
              },
              "supported": true
            },
            {
              "range": 60,
              "rangeDurationField": {
                "unitMillis": 3600000,
                "precise": true,
                "name": "hours",
                "type": {
                  "name": "hours"
                },
                "supported": true
              },
              "maximumValue": 59,
              "lenient": false,
              "unitMillis": 60000,
              "durationField": {
                "unitMillis": 60000,
                "precise": true,
                "name": "minutes",
                "type": {
                  "name": "minutes"
                },
                "supported": true
              },
              "minimumValue": 0,
              "leapDurationField": null,
              "name": "minuteOfHour",
              "type": {
                "durationType": {
                  "name": "minutes"
                },
                "rangeDurationType": {
                  "name": "hours"
                },
                "name": "minuteOfHour"
              },
              "supported": true
            },
            {
              "range": 60,
              "rangeDurationField": {
                "unitMillis": 60000,
                "precise": true,
                "name": "minutes",
                "type": {
                  "name": "minutes"
                },
                "supported": true
              },
              "maximumValue": 59,
              "lenient": false,
              "unitMillis": 1000,
              "durationField": {
                "unitMillis": 1000,
                "precise": true,
                "name": "seconds",
                "type": {
                  "name": "seconds"
                },
                "supported": true
              },
              "minimumValue": 0,
              "leapDurationField": null,
              "name": "secondOfMinute",
              "type": {
                "durationType": {
                  "name": "seconds"
                },
                "rangeDurationType": {
                  "name": "minutes"
                },
                "name": "secondOfMinute"
              },
              "supported": true
            },
            {
              "range": 1000,
              "rangeDurationField": {
                "unitMillis": 1000,
                "precise": true,
                "name": "seconds",
                "type": {
                  "name": "seconds"
                },
                "supported": true
              },
              "maximumValue": 999,
              "lenient": false,
              "unitMillis": 1,
              "durationField": {
                "unitMillis": 1,
                "precise": true,
                "name": "millis",
                "type": {
                  "name": "millis"
                },
                "supported": true
              },
              "minimumValue": 0,
              "leapDurationField": null,
              "name": "millisOfSecond",
              "type": {
                "durationType": {
                  "name": "millis"
                },
                "rangeDurationType": {
                  "name": "seconds"
                },
                "name": "millisOfSecond"
              },
              "supported": true
            }
          ]
        },
        "enabled": true,
        "id": "59328a65e4b0a957ebb26200",
        "type": "HTTP",
        "target": "http://192.168.1.1/sonar/1.0/alarm_str"
      }
    ],
    "warn": 57,
    "until": null,
    "from": null,
    "lastCheck": 1496484513253,
    "description": null,
    "enabled": true,
    "error": 62,
    "name": "cpu",
    "id": "59327b84e4b0a957ebb25f77",
    "state": "ERROR",
    "target": "collectd.base.control.jy.*.cpu.percent-idle",
    "live": false
  },
  "alerts": [
    {
      "checkId": "59327b84e4b0a957ebb25f77",
      "targetHash": "\ufffd\u0006LC\ufffd\ufffd\ufffd\ufffd\u0002\u007f\u0002\ufffd\ufffd\ufffdkE",
      "fromType": "OK",
      "toType": "WARN",
      "warn": 57,
      "timestamp": 1496484513227,
      "error": 62,
      "value": 58.1702216645755,
      "id": "59328aa1e4b0a957ebb26201",
      "target": "collectd.base.control.jy.host1.cpu.percent-idle"
    },
    {
      "checkId": "59327b84e4b0a957ebb25f77",
      "targetHash": "\ufffd\ufffd\ufffd\ufffd\ufffd>G\u001c\ufffd\ufffd\ufffd\u001c\ufffd9\ufffd\ufffd",
      "fromType": "WARN",
      "toType": "OK",
      "warn": 57,
      "timestamp": 1496484513227,
      "error": 62,
      "value": 52.6318006613729,
      "id": "59328aa1e4b0a957ebb26209",
      "target": "collectd.base.control.jy.host2.cpu.percent-idle"
    }
  ],
  "seyrenUrl": "http://localhost:8080/seyren"
}

channingbj
130 声望10 粉丝