New feature: Prometheus Agent mode hands-on experience

Hi everyone, this is Zhang Jintao.

Prometheus has almost become the de facto standard for monitoring selection in the cloud-native era, and it is also the second project to graduate from CNCF.

Currently, Prometheus can almost meet the monitoring needs of various scenarios/services. I have written some articles about Prometheus and its ecology before. In this article, we will focus on the Agent mode released in the latest version of Prometheus. I will briefly mention some concepts or usages that are not related to this topic.

Pull mode (Pull) and push mode (Push)

As we all know, Prometheus is a pull mode (Pull) monitoring system, which is different from the traditional push mode (Push)-based monitoring system.

What is the pull mode (Pull)?

Prometheus Pull model

The service to be monitored itself or the interface of some metrics exposed through some exporters, and Prometheus will take the initiative to capture/collect regularly, which is the pull mode (Pull). That is, the monitoring system actively pulls (Pull) the metrics of the target.

Corresponding to it is Push mode.

Monitor Push model

The application proactively reports some of its own metrics, and the monitoring system then performs corresponding processing. If you want to use Push mode (Push) for monitoring of certain applications, for example: it is not easy to implement the metrics interface, etc., you can consider using Pushgateway to complete it.

The discussion on which is better, pull mode (Pull) or push mode (Push), has been continuing, and interested friends can search for it on their own.

The focus here is mainly on the interaction between a single Prometheus and application services. In this article, we will take a look at how the current Prometheus does HA, persistence, and clustering from a higher-level perspective or a global perspective.

Prometheus HA/persistence/cluster solution

When used in a large-scale production environment, it is rare that there is only one single instance of Prometheus in the system. Whether it is considered from high availability, data persistence, or from providing users with a more accessible global view, it is common to run multiple instances of Prometheus.

At present, Prometheus mainly has three methods to aggregate the data of multiple Prometheus instances and provide users with a unified global view.

Federation: It is the earliest built-in data aggregation solution in Prometheus. In this scenario, a central Prometheus instance can be used to capture metrics from the leaf Prometheus instance. Under this scheme, the original timestamp of metrics can be retained, and the overall is relatively simple;
Prometheus Remote Read: It can support reading the original metrics from the remote storage. Note: There are many options for remote storage here. After reading the data, they can be aggregated and displayed to the user;
Prometheus Remote Write: It can support collecting Prometheus to metrics and writing it to remote storage. When the user is using it, directly read data from the remote storage and provide a global view, etc.;

Prometheus Agent mode

Prometheus Agent is a function that will be provided starting from Prometheus v2.32.0. It mainly uses the Prometheus Remote Write mentioned above to write the data of the Prometheus instance with the Agent mode enabled to the remote storage . And with the help of remote storage to provide a global view.

Pre-dependence

Because it uses the Prometheus Remote Write method, we need to prepare a "remote storage" for the centralized storage of metrcis. Here we use Thanos to provide this capability. Of course, if you want to use other solutions, such as Cortex, influxDB, etc., it is also possible.

Prepare for remote storage

Here we directly use the latest version of Thanos's container image for deployment. Here we use the host network to facilitate the test.

After executing these commands, Thanos receive will listen on http://127.0.0.1:10908/api/v1/receive to receive "remote write".

➜  cd prometheus
➜  prometheus docker run -d --rm \
    -v $(pwd)/receive-data:/receive/data \
    --net=host \
    --name receive \
    quay.io/thanos/thanos:v0.23.1 \
    receive \
    --tsdb.path "/receive/data" \
    --grpc-address 127.0.0.1:10907 \
    --http-address 127.0.0.1:10909 \
    --label "receive_replica=\"0\"" \
    --label "receive_cluster=\"moelove\"" \
    --remote-write.address 127.0.0.1:10908
59498d43291b705709b3f360d28af81d5a8daba11f5629bb11d6e07532feb8b6
➜  prometheus docker ps -l
CONTAINER ID   IMAGE                           COMMAND                  CREATED          STATUS          PORTS     NAMES
59498d43291b   quay.io/thanos/thanos:v0.23.1   "/bin/thanos receive…"   21 seconds ago   Up 20 seconds             receive

Prepare query components

Next, we start a query component of Thanos and connect with the receive component to query the written data.

➜  prometheus docker run -d --rm \
--net=host \
--name query \
quay.io/thanos/thanos:v0.23.1 \
query \
--http-address "0.0.0.0:39090" \
--store "127.0.0.1:10907"
10c2b1bf2375837dbda16d09cee43d95787243f6dcbee73f4159a21b12d36019
➜  prometheus docker ps -l
CONTAINER ID   IMAGE                           COMMAND                  CREATED         STATUS         PORTS     NAMES
10c2b1bf2375   quay.io/thanos/thanos:v0.23.1   "/bin/thanos query -…"   4 seconds ago   Up 3 seconds             query

Note: Here we have configured the --store field, which points to the previous receive component.

Open the browser and visit http://127.0.0.1:39090/stores . If everything goes well, you should see that the receive has been registered in the store.

Deploy Prometheus Agent mode

Here I downloaded the latest version v2.32.0 binary file directly from Prometheus's Release page After decompression, you will find that the contents in the directory are the same as in the previous version.

This is because the Prometheus Agent mode is now built into the Prometheus binary file, which can be enabled --enable-feature=agent

Prepare configuration file

We need to prepare a configuration file for it. Note that needs to be configured with remote_write, and there can be no configuration such as alerting

global:
  scrape_interval: 15s
  external_labels:
    cluster: moelove
    replica: 0

scrape_configs:
  - job_name: "prometheus"
    static_configs:
      - targets: ["localhost:9090"]

remote_write:
- url: 'http://127.0.0.1:10908/api/v1/receive'

Save the configuration file as prometheus.yml

start up

We set its log level to debug for easy viewing of some of its details

➜  ./prometheus --enable-feature=agent --log.level=debug --config.file="prometheus.yml" 
ts=2021-11-27T19:03:15.861Z caller=main.go:195 level=info msg="Experimental agent mode enabled."
ts=2021-11-27T19:03:15.861Z caller=main.go:515 level=info msg="Starting Prometheus" version="(version=2.32.0-beta.0, branch=HEAD, revision=c32725ba7873dbaa39c223410043430ffa5a26c0)"
ts=2021-11-27T19:03:15.861Z caller=main.go:520 level=info build_context="(go=go1.17.3, user=root@da630543d231, date=20211116-11:23:14)"
ts=2021-11-27T19:03:15.861Z caller=main.go:521 level=info host_details="(Linux 5.14.18-200.fc34.x86_64 #1 SMP Fri Nov 12 16:48:10 UTC 2021 x86_64 moelove (none))"
ts=2021-11-27T19:03:15.861Z caller=main.go:522 level=info fd_limits="(soft=1024, hard=524288)"
ts=2021-11-27T19:03:15.861Z caller=main.go:523 level=info vm_limits="(soft=unlimited, hard=unlimited)"
ts=2021-11-27T19:03:15.862Z caller=web.go:546 level=info component=web msg="Start listening for connections" address=0.0.0.0:9090
ts=2021-11-27T19:03:15.862Z caller=main.go:980 level=info msg="Starting WAL storage ..."
ts=2021-11-27T19:03:15.863Z caller=tls_config.go:195 level=info component=web msg="TLS is disabled." http2=false
ts=2021-11-27T19:03:15.864Z caller=db.go:306 level=info msg="replaying WAL, this may take a while" dir=data-agent/wal
ts=2021-11-27T19:03:15.864Z caller=db.go:357 level=info msg="WAL segment loaded" segment=0 maxSegment=0
ts=2021-11-27T19:03:15.864Z caller=main.go:1001 level=info fs_type=9123683e
ts=2021-11-27T19:03:15.864Z caller=main.go:1004 level=info msg="Agent WAL storage started"
ts=2021-11-27T19:03:15.864Z caller=main.go:1005 level=debug msg="Agent WAL storage options" WALSegmentSize=0B WALCompression=true StripeSize=0 TruncateFrequency=0s MinWALTime=0s MaxWALTime=0s
ts=2021-11-27T19:03:15.864Z caller=main.go:1129 level=info msg="Loading configuration file" filename=prometheus.yml
ts=2021-11-27T19:03:15.865Z caller=dedupe.go:112 component=remote level=info remote_name=e6fa2a url=http://127.0.0.1:10908/api/v1/receive msg="Starting WAL watcher" queue=e6fa2a
ts=2021-11-27T19:03:15.865Z caller=dedupe.go:112 component=remote level=info remote_name=e6fa2a url=http://127.0.0.1:10908/api/v1/receive msg="Starting scraped metadata watcher"
ts=2021-11-27T19:03:15.865Z caller=dedupe.go:112 component=remote level=info remote_name=e6fa2a url=http://127.0.0.1:10908/api/v1/receive msg="Replaying WAL" queue=e6fa2a
ts=2021-11-27T19:03:15.865Z caller=dedupe.go:112 component=remote level=debug remote_name=e6fa2a url=http://127.0.0.1:10908/api/v1/receive msg="Tailing WAL" lastCheckpoint= checkpointIndex=0 currentSegment=0 lastSegment=0
ts=2021-11-27T19:03:15.865Z caller=dedupe.go:112 component=remote level=debug remote_name=e6fa2a url=http://127.0.0.1:10908/api/v1/receive msg="Processing segment" currentSegment=0
ts=2021-11-27T19:03:15.877Z caller=manager.go:196 level=debug component="discovery manager scrape" msg="Starting provider" provider=static/0 subs=[prometheus]
ts=2021-11-27T19:03:15.877Z caller=main.go:1166 level=info msg="Completed loading of configuration file" filename=prometheus.yml totalDuration=12.433099ms db_storage=361ns remote_storage=323.413µs web_handler=247ns query_engine=157ns scrape=11.609215ms scrape_sd=248.024µs notify=3.216µs notify_sd=6.338µs rules=914ns
ts=2021-11-27T19:03:15.877Z caller=main.go:897 level=info msg="Server is ready to receive web requests."
ts=2021-11-27T19:03:15.877Z caller=manager.go:214 level=debug component="discovery manager scrape" msg="Discoverer channel closed" provider=static/0
ts=2021-11-27T19:03:28.196Z caller=dedupe.go:112 component=remote level=info remote_name=e6fa2a url=http://127.0.0.1:10908/api/v1/receive msg="Done replaying WAL" duration=12.331255772s
ts=2021-11-27T19:03:30.867Z caller=dedupe.go:112 component=remote level=debug remote_name=e6fa2a url=http://127.0.0.1:10908/api/v1/receive msg="runShard timer ticked, sending buffered data" samples=230 exemplars=0 shard=0
ts=2021-11-27T19:03:35.865Z caller=dedupe.go:112 component=remote level=debug remote_name=e6fa2a url=http://127.0.0.1:10908/api/v1/receive msg=QueueManager.calculateDesiredShards dataInRate=23 dataOutRate=23 dataKeptRatio=1 dataPendingRate=0 dataPending=0 dataOutDuration=0.0003201718 timePerSample=1.3920513043478261e-05 desiredShards=0.0003201718 highestSent=1.638039808e+09 highestRecv=1.638039808e+09
ts=2021-11-27T19:03:35.865Z caller=dedupe.go:112 component=remote level=debug remote_name=e6fa2a url=http://127.0.0.1:10908/api/v1/receive msg=QueueManager.updateShardsLoop lowerBound=0.7 desiredShards=0.0003201718 upperBound=1.3
ts=2021-11-27T19:03:45.866Z caller=dedupe.go:112 component=remote level=debug remote_name=e6fa2a url=http://127.0.0.1:10908/api/v1/receive msg=QueueManager.calculateDesiredShards dataInRate=23.7 dataOutRate=18.4 dataKeptRatio=1 dataPendingRate=5.300000000000001 dataPending=355.5 dataOutDuration=0.00025613744 timePerSample=1.3920513043478263e-05 desiredShards=0.00037940358300000006 highestSent=1.638039808e+09 highestRecv=1.638039823e+09
ts=2021-11-27T19:03:45.866Z caller=dedupe.go:112 component=remote level=debug remote_name=e6fa2a url=http://127.0.0.1:10908/api/v1/receive msg=QueueManager.updateShardsLoop lowerBound=0.7 desiredShards=0.00037940358300000006 upperBound=1.3
ts=2021-11-27T19:03:45.871Z caller=dedupe.go:112 component=remote level=debug remote_name=e6fa2a url=http://127.0.0.1:10908/api/v1/receive msg="runShard timer ticked, sending buffered data" samples=265 exemplars=0 shard=0

As you can see from the log, it will http://127.0.0.1:10908/api/v1/receive which is the Thanos receive we deployed at the beginning.

Query data

Open the Thanos query that we deployed at the beginning, enter any metrics to query, you can query the expected results.

But if we directly access the UI address of Prometheus with Agent mode enabled, an error will be reported directly and the query cannot be performed. This is because has enabled Prometheus in Agent mode, its UI query capabilities, alarms, and local storage capabilities will be disabled by default .

Summarize

This article mainly carried out the hands-on practice of Prometheus Agent, receiving metrics reports from Prometheus Agent through Thanos receive, and then querying the results through Thanos query.

Prometheus Agent does not essentially change the way Prometheus metrics are collected, and still continues to use the pull mode (Pull).

Its usage scenario is mainly for Prometheus HA/data persistence or clustering. There will be a slight overlap in the architecture with some existing schemes,
But there are some advantages:

Agent mode is a built-in function of Prometheus;
Prometheus instances with Agent mode enabled will consume less resources and have a single function, which is more beneficial for the expansion of some edge scenarios;
After the Agent mode is enabled, the Prometheus instance can almost be regarded as a stateless application, which is more convenient for expansion;

The official version will be released in a while, will you try to use it?

Welcome to subscribe to my article public account【MoeLove】

TheMoeLove

New feature: Prometheus Agent mode hands-on experience

Pull mode (Pull) and push mode (Push)

Prometheus HA/persistence/cluster solution

Prometheus Agent mode

Pre-dependence

Prepare for remote storage

Prepare query components

Deploy Prometheus Agent mode

Prepare configuration file

start up

Query data

Summarize

张晋涛

引用和评论

张晋涛：KubeCon China 2024 回顾

MCP协议重大升级，Spring AI Alibaba联合Higress发布业界首个Streamable HTTP实现方案

JManus - 面向 Java 开发者的开源通用智能体

Jenkins 企业级 CI/CD 实践：安装、配置与 Kubernetes & Docker 集成

k8s集群部署（一主两从）

深度测评国产 AI 程序员，在 QwQ 和满血版 DeepSeek 助力下，哪些能力让你眼前一亮？

k8s实战基础