Use Prometheus to monitor the running status of eKuiper rules

Prometheus is an open source system monitoring and alerting toolkit hosted on CNCF, and many companies and organizations have adopted Prometheus as a monitoring and alerting tool.

An eKuiper rule is a continuously running stream computing task. Rules are used to process unbounded data flow. Under normal circumstances, the rules will run all the time after they are started, and continuously generate running status data. Until the rule is stopped manually or after an unrecoverable error. Rules in eKuiper provide a status API to get the running metrics of the rules. At the same time, eKuiper integrates Prometheus, which can easily monitor various status indicators through the latter.

This tutorial is aimed at users who have a preliminary understanding of eKuiper, and will introduce rule status indicators and how to monitor specific indicators through Prometheus.

Rule Status Indicator

After a rule is created with eKuiper and runs successfully, users can view the running status indicators of the rule through the CLI, REST API, or management console. For example, if there is a rule rule1, you can obtain the rule running indicator in JSON format through curl -X GET "<http://127.0.0.1:9081/rules/rule1/status"> , as shown below:

 {
  "status": "running",
  "source_demo_0_records_in_total": 265,
  "source_demo_0_records_out_total": 265,
  "source_demo_0_process_latency_us": 0,
  "source_demo_0_buffer_length": 0,
  "source_demo_0_last_invocation": "2022-08-22T17:19:10.979128",
  "source_demo_0_exceptions_total": 0,
  "source_demo_0_last_exception": "",
  "source_demo_0_last_exception_time": 0,
  "op_2_project_0_records_in_total": 265,
  "op_2_project_0_records_out_total": 265,
  "op_2_project_0_process_latency_us": 0,
  "op_2_project_0_buffer_length": 0,
  "op_2_project_0_last_invocation": "2022-08-22T17:19:10.979128",
  "op_2_project_0_exceptions_total": 0,
  "op_2_project_0_last_exception": "",
  "op_2_project_0_last_exception_time": 0,
  "sink_mqtt_0_0_records_in_total": 265,
  "sink_mqtt_0_0_records_out_total": 265,
  "sink_mqtt_0_0_process_latency_us": 0,
  "sink_mqtt_0_0_buffer_length": 0,
  "sink_mqtt_0_0_last_invocation": "2022-08-22T17:19:10.979128",
  "sink_mqtt_0_0_exceptions_total": 0,
  "sink_mqtt_0_0_last_exception": "",
  "sink_mqtt_0_0_last_exception_time": 0
}

The running indicator mainly includes two parts, one is status, which is used to indicate whether the rule is running normally, and its value may be running, stopped manually, etc. The other part is the running index of each operator of the rule. The operators of the rules are generated from the SQL of the rules, which may be different for each rule. In this example, the simplest rule SQL SELECT * FROM demo, action is MQTT, and the generated operators are [source_demo, op_project, sink_mqtt] 3. Each operator has the same number of running indicators, which together with the operator name constitute an indicator. For example, the indicator of the input quantity records_in_total of the operator source_demo_0 is source_demo_0_records_in_total.

Running indicators

The operation indicators of each operator are the same, mainly including the following:

records_in_total: The total number of messages read, indicating how many messages are processed after the rule is started.
records_out_total: The total number of messages output, indicating the number of messages correctly processed by the operator.
process_latency_us: The latency of the last processing, in microseconds. This value is an instantaneous value to understand the processing performance of the operator. The delay of the overall rule is generally determined by the operator with the largest delay.
buffer_length: The length of the operator buffer. Due to the difference in calculation speed between operators, there are buffer queues between each operator. If the buffer length is large, the operator processing is slow and cannot keep up with the upstream processing speed.
last_invocation: The last run time of the operator.
exceptions_total: The total number of exceptions. Non-irrecoverable errors generated during the operation of the operator, such as connection interruption, data format error, etc., are counted as exceptions without interrupting the rules.

After version 1.6.1, we added two exception-related indicators to facilitate debugging and processing of exceptions.

last_exception: The last exception error message.
last_exception_time: The time when the last exception occurred.

Numerical indicators in these operational indicators can be monitored using Prometheus. In the next section we will describe how to configure the Prometheus service in eKuiper.

Configure eKuiper's Prometheus service

eKuiper comes with Prometheus service, but it is disabled by default. Users can modify the configuration in etc/kuiper.yaml to open the service. Among them, prometheus is a boolean value, modifying to true can open the service; prometheusPort configures the access port of the service.

 prometheus: true
  prometheusPort: 20499

If you use Docker to start eKuiper, you can also enable the service by configuring environment variables.

 docker run -p 9081:9081 -d --name ekuiper MQTT_SOURCE__DEFAULT__SERVER="$MQTT_BROKER_ADDRESS" KUIPER__BASIC__PROMETHEUS=true lfedge/ekuiper:$tag

In the startup log, you can see information about the service startup, such as:

 time="2022-08-22 17:16:50" level=info msg="Serving prometheus metrics on port <http://localhost:20499/metrics"> file="server/prome_init.go:60"
Serving prometheus metrics on port <http://localhost:20499/metrics>

Click the address in the prompt http://localhost:20499/metrics to view the original indicator information of eKuiper collected in Prometheus. After eKuiper has rules running normally, you can search for indicators like kuiper_sink_records_in_total on the page. Users can configure Prometheus to access eKuiper for richer presentations.

Check status with Prometheus

We have implemented the function of outputting the eKuiper status as a Prometheus indicator above. Next, we can configure Prometheus to access this part of the indicator and complete the initial monitoring.

Install and configure

Go to the Prometheus official website to download the required system version and unzip it.

Modify the configuration file so that it monitors eKuiper. Open prometheus.yml and modify the scrape_configs section as follows:

 global:
  scrape_interval:     15s
  evaluation_interval: 15s

rule_files:
  # - "first.rules"
  # - "second.rules"

scrape_configs:
  - job_name: ekuiper
    static_configs:
      - targets: ['localhost:20499']

The monitoring task named eKuiper is defined here, and targets point to the address of the service started in the previous section. Once configured, start Prometheus.

 ./prometheus --config.file=prometheus.yml

After the startup is successful, open http://localhost:9090/ to enter the management console.

Simple monitoring

Monitor changes in the number of messages received by sinks of all rules. You can enter the name of the indicator to be monitored in the search box as shown in the figure, and click Execute to generate the monitoring table. Select Graph to switch to display methods such as line graphs.

Click Add Panel to monitor more indicators through the same configuration method.

Summarize

This article describes rule status indicators in eKuiper and how to easily monitor these status indicators using Prometheus. Based on this, users can further explore more advanced functions of Prometheus and better realize the operation and maintenance of eKuiper.

Copyright statement: This article is original by EMQ, please indicate the source when reprinting.
Original link: https://www.emqx.com/zh/blog/use-prometheus-to-monitor-ekuiper-rules-status

Use Prometheus to monitor the running status of eKuiper rules

Rule Status Indicator

Running indicators

Configure eKuiper's Prometheus service

Check status with Prometheus

Install and configure

Simple monitoring

Summarize

EMQX

引用和评论

在 Windows 平台搭建 MQTT 服务

k8s集群部署（一主两从）

AD系列：Windows Server 2025 搭建AD域控和初始化

一体化运维，降本增效！秒云助力某基金打造智能运维平台

k8s实战基础

HTTP500代码怎么解决？常见的5xx网页错误及其原因

linux运维之NFS