This article was first published in the short book of Mooring Purpose : https://www.jianshu.com/u/204b8aaab8ba
Version date Remark
1.0 2021.10.8 Article first published
1.1 2022.3.9 fix typo
1.2 2022.7.3 fix typo

0. Preface

A while ago, the author involved a little monitoring-related development work, and encountered some problems during the development process, so I read the code of the relevant part of Flink, and found some good designs in the process of reading the code, so I also wrote an article to organize it. come up.

The source code of this article is based on Flink 1.13.2 .

1. Extension plug-in

On the official website, the Flink community provides some connected Reporters. If we have our own custom Reporter, we can also implement our own Reporter according to its specifications.

In Flink's code, a reflection mechanism is provided to instantiate MetricReporter: it is required that the implementation class of MetricReporter must be the access modifier of public , cannot be an abstract class, and must have a no-argument constructor.

The core code is ReporterSetup#getAllReporterFactories :

 private static Iterator<MetricReporterFactory> getAllReporterFactories(
            @Nullable PluginManager pluginManager) {
        final Iterator<MetricReporterFactory> factoryIteratorSPI =
                ServiceLoader.load(MetricReporterFactory.class).iterator();
        final Iterator<MetricReporterFactory> factoryIteratorPlugins =
                pluginManager != null
                        ? pluginManager.load(MetricReporterFactory.class)
                        : Collections.emptyIterator();

        return Iterators.concat(factoryIteratorPlugins, factoryIteratorSPI);
    }

The code will obtain the relevant implementation classes of MetricReporter through Java's SPI mechanism, which is essentially obtained through ClassLoder.

 |-- ReporterSetup
     \-- fromConfiguration //当集群启动时,会从配置读取监控并初始化相关类
         \-- loadAvailableReporterFactories // 加载有效的Reporter们
             \-- getAllReporterFactories //  核心代码,通过SPI以及ClassLoader机制获取Reporter们

2. Built-in loose coupling

As mentioned above, the community will provide some common monitoring Reporters. In the code, the essence is the implementation of the factory pattern.

 /**
 * {@link MetricReporter} factory.
 *
 * <p>Reporters that can be instantiated with a factory automatically qualify for being loaded as a
 * plugin, so long as the reporter jar is self-contained (excluding Flink dependencies) and contains
 * a {@code META-INF/services/org.apache.flink.metrics.reporter.MetricReporterFactory} file
 * containing the qualified class name of the factory.
 *
 * <p>Reporters that previously relied on reflection for instantiation can use the {@link
 * InstantiateViaFactory} annotation to redirect reflection-base instantiation attempts to the
 * factory instead.
 */
public interface MetricReporterFactory {

    /**
     * Creates a new metric reporter.
     *
     * @param properties configured properties for the reporter
     * @return created metric reporter
     */
    MetricReporter createMetricReporter(final Properties properties);
}

Each access to a monitoring, as long as the implementation of the corresponding factory method can be. Currently implemented are:

  • org.apache.flink.metrics.graphite.GraphiteReporterFactory
  • org.apache.flink.metrics.influxdb.InfluxdbReporterFactory
  • org.apache.flink.metrics.prometheus.PrometheusReporter
  • org.apache.flink.metrics.prometheus.PrometheusPushGatewayReporter
  • org.apache.flink.metrics.statsd.StatsDReporterFactory
  • org.apache.flink.metrics.datadog.DatadogHttpReporterFactory
  • org.apache.flink.metrics.slf4j.Slf4jReporterFactory

Whenever the community needs to access a new Reporter, it only needs to implement MetricReporterFactory , and the upper layer can only perceive MetricReporter , which has nothing to do with any specific implementation, which is also typical an anti-corrosion design.

3. Fail safe

In the stream computing business, if there is a problem with monitoring this bypass logic, should it affect the main logic? The answer is no.

In MetricRegistryImpl (as the name implies, it will register all Reporters into this class), the constructor will put the relevant MetricReporter into the thread pool, and periodically let them report data.

 |-- MetricRegistryImpl
  \-- constructor

In WebMonitorEndpoint , there are also thread pools. This class provides RestAPI to facilitate querying metrics. Requests for other components are sent asynchronously through Akka, and the replies to these callbacks are processed through a thread pool.

 |-- WebMonitorEndpoint
  \-- start
    \-- initializeHandlers
      \--   new JobConfigHandler
|-- AbstractExecutionGraphHandler
  \-- handleRequest

This is a typical Fail-safe design.

4. More than just Push

In Flink, monitoring data not only supports Push, but also implements Pull, and the implementation is also very simple.

MetricQueryService implements MetricQueryServiceGateway , which means it can be called remotely.

Its monitoring data source code tracking:

 |-- AbstractMetricGroup
  \-- counter
    |-- MetricRegistryImpl
      \-- register
        |-- MetricQueryService
          \-- addMetric

The above-mentioned WebMonitorEndpoint is the same, but it is based on the implementation of RestAPI, which also provides the Pull strategy.

5. References


泊浮目
4.9k 声望1.3k 粉丝