The pit of PushGateway and Flink&#39;s actual combat: Talking about pull and push in monitoring model

This article was first published on Mooring Floating Purpose Brief : https://www.jianshu.com/u/204b8aaab8ba

Version	date	Remark
1.0	2021.8.14	Article first published

0. Background

Recently, I used PushGateway (hereinafter referred to as PGW) for the access monitoring of stream processing components. As a result, I stepped on a lot of pits. Let me share it.

1. Why Push (PGW)

The previous implementation of pull, that is, exposing service ports in a process to follow the protocol of Prometheous (hereinafter referred to as Prom), let Prom pull data.

But there is a problem with this, the port needs to be allocated. Before, our team used a lot of troublesome implementations: distributed locks, multiple state storage, etc... but still can't avoid the problem of port leakage and waste (topological high availability mechanism will cause it to be offset between different machines, so before The assigned machine port is useless). Although we can also monitor the life cycle of the topology, this is by no means easy - in larger scenarios, the k-level topology is normal, but it seems to be a big problem to effectively monitor the k-level topology life cycle. topic of.

My colleague told me that k8s might be able to solve my problem, and I will try to follow up the introduction of this technology stack in the future.

We just want to implement a monitoring, and don't want to care about other things.

Then it's time to talk about the old topic again, which is better to push or pull. My point of view is that to reason out of the scene is to be a hooligan. In this scene, push is more appropriate.

After all, push requires the monitored service to know the address of the monitoring system, so this information needs to be set in the monitored service. Therefore, the monitored service will depend on the monitoring service to a certain extent; while pull requires the monitoring system to know the addresses of all monitored services, then each time a monitored service is added, the monitoring service needs to perceive it through some means - such as prom support The target service is dynamically obtained from the service discovery system, and flink supports the location of the monitored service through the port range.

As for the comparison of other push and poll models, we can check the following table and make a comparison according to our own scenarios:

dimension	push model	Pull model
service discovery	faster. On startup, the agent can automatically send data. Therefore, the speed of service discovery has nothing to do with the number of agents	slower. New services need to be discovered by regularly scanning the address space, and the speed of discovering services is related to the number of agents
Scalability	better. Only the agent needs to be deployed, and the agent is generally stateless	poor. The workload of the monitoring system will increase linearly with the number of agents
safety	better. No need to monitor network connection, can resist remote attack	poor. Potential for remote access and denial of service attacks
operational complexity	better. Just sense the polling interval and monitor the system address. The firewall needs to be configured for one-way measurement traffic from the agent to the collector.	poor. The monitoring system needs to be configured with the list of agents to poll, the security credentials to access the agents, and the set of metrics to retrieve. The firewall needs to be configured to allow bidirectional communication between the poller and the proxy.
Delay	better. The timeliness of delivery is better. There are also many push protocols (such as sFlow) that are implemented on top of UDP, providing non-blocking, low-latency measurement transfer.	Poor, low real-time performance

And the official website of Prom also introduces that PGW is not applicable in most cases, except:

Usually, the only valid use case for the Pushgateway is for capturing the outcome of a service-level batch job. A "service-level" batch job is one which is not semantically related to a specific machine or job instance (for example, a batch job that deletes a number of users for an entire service). Such a job's metrics should not include a machine or instance label to decouple the lifecycle of specific machines or instances from the pushed metrics. This decreases the burden for managing stale metrics in the Pushgateway. See also the best practices for monitoring batch jobs.

In best practicse for monitoring batch jobs , it is also mentioned:

There is a fuzzy line between offline-processing and batch jobs, as offline processing may be done in batch jobs. Batch jobs are distinguished by the fact that they do not run continuously, which makes scraping them difficult.

Included in its github repository is this:

The Prometheus Pushgateway exists to allow ephemeral and batch jobs to expose their metrics to Prometheus. Since these kinds of jobs may not exist long enough to be scraped, they can instead push their metrics to a Pushgateway. The Pushgateway then exposes these metrics to Prometheus.

My business system does exist ephemeral job(mean the jobs may not exist long enough) which is also an important reason why I choose PGW.

2. What kind of pit did you step on?

Originally, after connecting to the PGW, all the data that should be there is available. As a result, the test students found that the monitoring data suddenly disappeared during the test. I saw that a large number of topologies were created for streaming tasks, and the PGW exited directly. The log has the words out of memory .

After two more tests, it was found that the memory and CPU consumption of PGW became more and more powerful as the topology increased.

So I remembered what was mentioned on the official website:

The Pushgateway never forgets series pushed to it and will expose them to Prometheus forever unless those series are manually deleted via the Pushgateway's API.

But I clearly configured the configuration of deleteOnShutdown . The explanation on the official website is: Specifies whether to delete metrics from the PushGateway on shutdown. , but when I re-run PGW, I found that the related mertics was not deleted!

Our team did some searching and found an Issue: https://issues.apache.org/jira/browse/FLINK-20691

Some people think that PGW should do the TTL thing, and PGW thinks this is not a good way. And I think this is something that Flink should fix by itself. I don't know why it hasn't been fixed, and there is no hint of this pit in the documentation on the official website.

I mentioned a patch in the Flink documentation, I hope it can be incorporated as soon as possible - https://github.com/apache/flink/pull/16823 .

3. Summary

This article shares with you the pits that our team stepped on when PushGateway combined with Flink, and discussed our original intention for choosing PGW. After that, I plan to pay attention to InfluxDB and use it as the push-in terminal instead of PGW. I also noticed that the ecology of the new version of InfluxDB is also quite good, providing panels, data visualization and alarms. It is no longer a simple time series database, combined with its own ecology , will be more and more like prom+grafana.

The pit of PushGateway and Flink's actual combat: Talking about pull and push in monitoring model

0. Background

1. Why Push (PGW)

2. What kind of pit did you step on?

3. Summary

3.1 References

泊浮目

引用和评论

以防你不知道大佬认为写好注释有多重要

Java8的新特性

Java11的新特性

Java5的新特性

Java9的新特性

Java13的新特性

Java7的新特性

The pit of PushGateway and Flink&#39;s actual combat: Talking about pull and push in monitoring model

0. Background

1. Why Push (PGW)

2. What kind of pit did you step on?

3. Summary

3.1 References

泊浮目

引用和评论

以防你不知道大佬认为写好注释有多重要

Java8的新特性

Java11的新特性

Java5的新特性

Java9的新特性

Java13的新特性

Java7的新特性

The pit of PushGateway and Flink's actual combat: Talking about pull and push in monitoring model