Accurate monitoring & alarm practice based on Prometheus

Guide:

Delivered projects that run silky smooth and have a good customer experience, bug-free is probably the dream of any keyboard worker. So can we fix the bug before the customer perceives it, and when the bug occurs, we can quickly perceive and locate the problem, and fix it in time? For the proposition of rapid alarm perception and problem location, we have implemented a precise alarm system based on Prometheus. The system includes three parts: log platform, indicator system, and alarm system. This solution supports quick message reminders for designated handlers, and the alarm message contains There is sufficient indicator information to quickly locate the problem area.

Text｜Senior Java Development Engineer, Wuhua NetEase Cloud Business

1. Current Situation & Problem Positioning

Everyone must have been troubled by an alarm storm. A large number of alarm information flooded wildly without type distinction, which caused the processing personnel to be unable to quickly determine the location of the problem for a while, and it might take a few laps to find out that it was caused by a certain problem. At present, most of the running projects have monitoring systems. When an abnormality occurs, the monitoring system will issue a unified alarm message. If the message has a traceId, you can send the traceId to the log platform or view specific logs, context information, etc. on ELK. Usually, the alarm information will be sent to the owner of the project team or a large group of projects. After the project leader sees it, the relevant personnel will conduct investigation and positioning analysis . The time spent in the positioning and analysis process is proportional to the complexity of the problem. If the degree is high, it will take longer. This process may have missed a favorable time for remediation. Over time, more and more customers perceive the problem, and the scope of influence gradually expands. Our original intention is to quickly repair the problem. This speed is to shorten the abnormal detection time and the positioning analysis time as much as possible, and complete the repair before the customer perceives it, so as not to affect the customer experience as much as possible.

2. Analysis & Solutions

The positioning problem generally requires the following indicator information, which are indispensable:

Usually the log information will be redundant with some other additional information to help locate and troubleshoot the problem. The log contains the above data but is not limited to this. Regardless of whether the interface responds normally or not, log information is recorded. The daily log data generated by the service system can reach the T level. It is unrealistic to directly process such huge data. The meaning of the indicator system is to extract some key information we are interested in, such as service memory, CPU, GC status, etc. The interface response code is not 200, or a custom system abnormal status code, or other business indicators. The indicator system is the storage, calculation and display of lightweight data, displaying the aggregated information we need. Depending on the indicator system, we can flexibly configure hotspot data display, trend graphs, alarms, etc.

title=

Comparison of the most popular indicator systems in the community at present:

Because the alarms are gathered in error or exception information, or the method lacks contextual correlation, the information is not enough to support the relevant handlers to make judgments as soon as possible, and secondly, the unclassified and sorted alarm information is sent out in one go, which will only disturb the information of the current handler. Sorting, especially when encountering an alarm storm, the crazy alarm can easily lead to misjudgment, thus delaying the processing time. Therefore, we need to collect, sort and sort the error information. When a certain threshold is reached, a message is sent to remind the corresponding business handler, which can be a business group or a single person. The message includes time, machine, service, method, trace, and module. , exception type, exception information. Alarm information can also be stored in the warehouse, statistical response time, processing time, etc.

After research, we decided to choose open source Prometheus. Prometheus consists of two parts: the indicator system and the alarm system. It also provides some API interfaces. Indicator storage supports both local and remote methods. Since Prometheus loads all data into memory to support data query, it is generally recommended to store data locally for no more than 15 days to avoid insufficient data on the server or memory explosion due to excessive data. Data can also be stored in a remote database and supported by an indicator database.

Community remote storage support includes:

AppOptics: write
Chronix: write
Cortex: read and write
CrateDB: read and write
Elasticsearch: write
Gnocchi: write
Graphite: write
InfluxDB: read and write
OpenTSDB: write
PostgreSQL/TimescaleDB: read and write
SignalFx: write
Clickhouse: read and write

The way the indicator system supports PULL&PUSH. For the PULL mode to support flexible job configuration, you can configure the REST interface and frequency of the target indicator pulling individually. Prometheus supports hot reloading, which means it can be modified remotely and the configuration takes effect in real time. The indicator system and the alarm system are naturally integrated. The alarm system supports different granularity of indicator configuration, alarm frequency configuration, labels, etc., and the alarm message push supports slack, DingTalk, email, Webhook interface, etc. Due to the need to ensure the availability of online services, other interfaces other than business function support are generally not directly opened. First, it is easy to pollute business functions, and second, it is to avoid other functions from affecting the normal support of business functions and service performance. System logs are generally collected and stored by other services, such as ELK or other self-developed log platforms. At present, we use the PULL mode to connect to the log platform, and provide an indicator pull interface for the development of the log platform. The architecture is designed as shown in the figure.

Generally, most alarm information is sent by configuring the responsible group in the service dimension, and sending the responsible person or group message by email. The current work habits of Chinese people do not rely entirely on emails. The awareness of email reminders is still low, and group messages are easily ignored when there is no designated person, resulting in a greatly reduced response speed to alarm information. In addition, when the alarm information is insufficient, it will increase the difficulty of the processing personnel and further reduce the processing speed.

Therefore, we adopt the Prometheus alert scheme to send the alert information to the Webhook interface of the log platform , and the log platform selects the final message routing destination according to the module configuration information.

The complete execution chain is as follows:

The log platform collects logs and provides an indicator pull interface
Prometheus completes metrics collection
Prometheus configures alarm rules, and sends alarm information if the rules match
Prometheus alert sends alarm information to the alarm interface provided by the log platform
The log platform calls the Prometheus API to obtain specific indicator information according to the module and indicator name contained in the alarm information.
The log platform selects the person in charge or the group responsible for sending messages according to the existing configuration, the module of the alarm information, and the indicator label.

So far, the whole link process of an alarm is completed.

3. Practice

The implementation steps are as follows:

To build a log platform, you need to collect interface or system logs.
The log platform opens the indicator pull interface.

Configure Prometheus [prometheus.xml] to start the Prometheus process

The collection task configuration is as follows:

 - job_name: 'name'# 自定义采集路径  metrics_path: '/path'  scrape_interval:1800s  static_configs:  - targets:['localhost:9527']

The alarm configuration is as follows:

 # Alertmanager configurationalerting: alertmanagers:  - static_configs:    -targets: ['localhost:9093']# Load rules once and periodically evaluatethemrule_files:  -"rules.yml"  -"kafka_rules.yml"

Alert service port 9093 corresponds to Prometheus alert service.

The rules file is configured as follows:

 groups:- name: kafkaAlert rules:  -alert: hukeKfkaDelay   expr: count_over_time(kafka_log{appname='appname'}[1m]) > 0   labels:     metric_name: kafka     module: modulename     metric_expr: kafka_log{appname='appname'}[1m]   annotations:     summary: "kafka消息堆积"     description: "{{$value}}次"

Since the Clickhouse database is used for log storage, start the Prometheus2click process to store the indicator data in Clickhouse for a long time, and the remote configuration interface corresponds to Prometheus2click.

 remote_write:  -url: "http://localhost:9201/write"remote_read:  -url: "http://localhost:9201/read"

Configure PrometheusAlert and start the alter process.

 route: group_by: ['alertname'] group_wait: 10s group_interval: 10s repeat_interval: 1m receiver: 'web.hook'receivers:- name: 'web.hook'  webhook_configs:  - url: '外部报警消息对接接口'

The alarm information received by the alarm service comes from Prometheus, including the configured tag information. Alert chooses whether to send messages to the third-party interface according to the frequency, silent configuration , etc.

Alarm display & comparison

business alarm

Kafka alarm instance

4. Conclusion

Based on the accurate monitoring and alarming of Prometheus, it can effectively avoid the alarm storm, improve the response and processing speed of online problems, and effectively reduce the difficulty of research and development students to troubleshoot problems. The flexible message push for different persons in charge can effectively speed up the problem perception of the corresponding person in charge, and respond to and deal with problems in a timely manner. The indicator collection task that comes with Prometheus avoids a lot of repetitive indicator collection work and integrates the alarm system perfectly. The current disadvantage is that the configuration is slightly complicated and inflexible.

Those interested in other features of Prometheus can also go directly to its official website to view relevant information.

References

about the author

Wu Hua, senior Java development engineer of NetEase Cloud Merchant, is responsible for the development and maintenance of the core modules of the cloud merchant inter-customer system and the Qiyu work order system.

Accurate monitoring & alarm practice based on Prometheus

Guide:

1. Current Situation & Problem Positioning

2. Analysis & Solutions

3. Practice

4. Conclusion

References

about the author

Related reading recommendation

网易数智

引用和评论

InfoQ官媒报道|网易云信裴明明：云原生架构下中间件联邦高可用架构实践

做到真正0丢失、0重复：Apache SeaTunnel 实现万亿级数据一致性全解密

MySQL慢查询日志：性能优化的终极指南

Devin 发布 DeepWiki，2 星的项目直接装出万星的气场

好用的开源埋点方案-ClkLog埋点用户分析系统

DNS服务器地址大全

【赵渝强老师】在Docker中运行达梦数据库