Google mtail cooperates with Prometheus and Grafana to implement custom log monitoring

Preface

mtail is a log extraction tool developed by Google, which is lighter than ELK/EFK/Grafana Loki. Because the demand I encountered was only to collect data in the production log, I used a simpler mtail with Prometheus and Grafana to implement custom log data monitoring.

Update history

August 04, 2021-First Draft

Read the original text- https://wsgzao.github.io/post/mtail/

Common log monitoring solutions

Open source business log monitoring, I mainly recommend the following 3

It is worth noting that ELK is currently being replaced by EFK

1: ELK-"ELK" is the acronym for three open source projects, these three projects are: Elasticsearch, Logstash and Kibana.

Elasticsearch is a search and analysis engine.

Logstash is a server-side data processing pipeline that can collect data from multiple sources at the same time, transform the data, and then send the data to a "repository" such as Elasticsearch.

Kibana allows users to use graphs and charts to visualize data in Elasticsearch.

2: Loki, the latest open source project of the Grafana Labs team, is a horizontally scalable, highly available, multi-tenant log aggregation system.

3: mtail: It is a log extraction tool developed by Google, which extracts metrics from application logs to export to a time series database or time series calculator,

The purpose is to read the application log in real time, analyze it through the script written by yourself, and finally generate time series indicators.

The tool that suits you is the best. Both EFK and Loki are fully functional log collection systems. Of course, they also have their own advantages.

Some experience is recorded in the Blog, you can refer to it

Scribe installation and use- https://wsgzao.github.io/post/scribe/

Use ELK (Elasticsearch + Logstash + Kibana) to build a log centralized analysis platform practice- https://wsgzao.github.io/post/elk/

The difference between the open source log management solution ELK and EFK- https://wsgzao.github.io/post/efk/

Grafana Loki open source log aggregation system instead of ELK or EFK- https://wsgzao.github.io/post/loki/

Introduction to mtail

mtail - extract whitebox monitoring data from application logs for collection into a timeseries database

mtail is a tool for extracting metrics from application logs to be exported into a timeseries database or timeseries calculator for alerting and dashboarding.

It fills a monitoring niche by being the glue between applications that do not export their own internal state (other than via logs) and existing monitoring systems, such that system operators do not need to patch those applications to instrument them or writing custom extraction code for every such application.

The extraction is controlled by mtail programs which define patterns and actions:

# simple line counter
counter lines_total
/$/ {
  lines_total++
}

Metrics are exported for scraping by a collector as JSON or Prometheus format over HTTP, or can be periodically sent to a collectd, StatsD, or Graphite collector socket.

mtail is a tool used to extract metrics from application logs to export to a time series database or time series calculator for alerts and dashboard display. Simply put, it is a tool that reads the application log in real time, and analyzes it in real time through a script written by yourself, and finally generates a time series indicator.

https://github.com/google/mtail

mtail installation

Download : 16114bff6b8974 https://github.com/google/mtail/releases

# check latest version from github
wget https://github.com/google/mtail/releases/download/v3.0.0-rc47/mtail_3.0.0-rc47_Linux_x86_64.tar.gz

tar xf mtail_3.0.0-rc47_Linux_x86_64.tar.gz
# can choose to cp mtail to /usr/local/bin
# cp mtail /usr/local/bin

# 查看mtail版本
./mtail --version
mtail version 3.0.0-rc47 git revision 5e0099f843e4e4f2b7189c21019de18eb49181bf go version go1.16.5 go arch amd64 go os linux

# mtail后台启动
nohup mtail -port 3903 -logtostderr -progs test.mtail -logs test.log &

# 默认端口是3903
nohup ./mtail -progs test.mtail -logs test.log &

# 查看是否启动成功
ps -ef | grep mtail

Detailed parameter explanation: console operation mtail -h

Here are a few simple parameters

参数 　　　　　　描述
-address 　　　　绑定HTTP监听器的主机或IP地址
-alsologtostderr 　　记录标准错误和文件
-emit_metric_timestamp 　　发出metric的记录时间戳。如果禁用（默认设置），则不会向收集器发送显式时间戳。
-expired_metrics_gc_interval 　　metric的垃圾收集器运行间隔（默认为1h0m0s）
-ignore_filename_regex_pattern 　　需要忽略的日志文件名字，支持正则表达式。
-log_dir 　　mtail程序的日志文件的目录，与logtostderr作用类似，如果同时配置了logtostderr参数，则log_dir参数无效
-logs 　　监控的日志文件列表，可以使用,分隔多个文件，也可以多次使用-logs参数，也可以指定一个文件目录，支持通配符*，指定文件目录时需要对目录使用单引号。如：
　　　　　　-logs a.log,b.log
　　　　　　-logs a.log -logs b.log
　　　　　　-logs ‘/export/logs/*.log’
-logtostderr 　　直接输出标准错误信息，编译问题也直接输出
-override_timezone 　　设置时区，如果使用此参数，将在时间戳转换中使用指定的时区来替代UTC
-port 　　监听的http端口，默认3903
-progs 　　mtail脚本程序所在路径
-trace_sample_period 　　用于设置跟踪的采样频率和发送到收集器的频率。将其设置为100，则100条收集一条追踪。
-v 　　v日志的日志级别，该设置可能被 vmodule标志给覆盖.默认为0.
-version 　　打印mtail版本

After the program is started, port 3903 is monitored by default, which can be accessed through http://ip:3903, and metrics can be accessed through http://ip:3903/metrics

Detailed explanation of mtail parameters

./mtail -h

mtail version 3.0.0-rc47 git revision 5e0099f843e4e4f2b7189c21019de18eb49181bf go version go1.16.5 go arch amd64 go os linux

Usage:
  -address string
        Host or IP address on which to bind HTTP listener
  -alsologtostderr
        log to standard error as well as files
  -block_profile_rate int
        Nanoseconds of block time before goroutine blocking events reported. 0 turns off.  See https://golang.org/pkg/runtime/#SetBlockProfileRate
  -collectd_prefix string
        Prefix to use for collectd metrics.
  -collectd_socketpath string
        Path to collectd unixsock to write metrics to.
  -compile_only
        Compile programs only, do not load the virtual machine.
  -disable_fsnotify
        DEPRECATED: this flag is no longer in use. (default true)
  -dump_ast
        Dump AST of programs after parse (to INFO log).
  -dump_ast_types
        Dump AST of programs with type annotation after typecheck (to INFO log).
  -dump_bytecode
        Dump bytecode of programs (to INFO log).
  -emit_metric_timestamp
        Emit the recorded timestamp of a metric.  If disabled (the default) no explicit timestamp is sent to a collector.
  -emit_prog_label
        Emit the 'prog' label in variable exports. (default true)
  -expired_metrics_gc_interval duration
        interval between expired metric garbage collection runs (default 1h0m0s)
  -graphite_host_port string
        Host:port to graphite carbon server to write metrics to.
  -graphite_prefix string
        Prefix to use for graphite metrics.
  -ignore_filename_regex_pattern string
    
  -jaeger_endpoint string
        If set, collector endpoint URL of jaeger thrift service
  -log_backtrace_at value
        when logging hits line file:N, emit a stack trace
  -log_dir string
        If non-empty, write log files in this directory
  -logs value
        List of log files to monitor, separated by commas.  This flag may be specified multiple times.
  -logtostderr
        log to standard error instead of files
  -max_recursion_depth int
        The maximum length a mtail statement can be, as measured by parsed tokens. Excessively long mtail expressions are likely to cause compilation and runtime performance problems. (default 100)
  -max_regexp_length int
        The maximum length a mtail regexp expression can have. Excessively long patterns are likely to cause compilation and runtime performance problems. (default 1024)
  -metric_push_interval duration
        interval between metric pushes to passive collectors (default 1m0s)
  -metric_push_interval_seconds int
        DEPRECATED: use --metric_push_interval instead
  -metric_push_write_deadline duration
        Time to wait for a push to succeed before exiting with an error. (default 10s)
  -mtailDebug int
        Set parser debug level.
  -mutex_profile_fraction int
        Fraction of mutex contention events reported.  0 turns off.  See http://golang.org/pkg/runtime/#SetMutexProfileFraction
  -one_shot
        Compile the programs, then read the contents of the provided logs from start until EOF, print the values of the metrics store and exit. This is a debugging flag only, not for production use.
  -override_timezone string
        If set, use the provided timezone in timestamp conversion, instead of UTC.
  -poll_interval duration
        Set the interval to poll all log files for data; must be positive, or zero to disable polling.  With polling mode, only the files found at mtail startup will be polled. (default 250ms)
  -port string
        HTTP port to listen on. (default "3903")
  -progs string
        Name of the directory containing mtail programs
  -stale_log_gc_interval duration
        interval between stale log garbage collection runs (default 1h0m0s)
  -statsd_hostport string
        Host:port to statsd server to write metrics to.
  -statsd_prefix string
        Prefix to use for statsd metrics.
  -stderrthreshold value
        logs at or above this threshold go to stderr
  -syslog_use_current_year
        Patch yearless timestamps with the present year. (default true)
  -trace_sample_period int
        Sample period for traces.  If non-zero, every nth trace will be sampled.
  -unix_socket string
        UNIX Socket to listen on
  -v value
        log level for V logs
  -version
        Print mtail version information.
  -vm_logs_runtime_errors
        Enables logging of runtime errors to the standard log.  Set to false to only have the errors printed to the HTTP console. (default true)
  -vmodule value
        comma-separated list of pattern=N settings for file-filtered logging

parameter	describe
-address	Bind the host or IP address of the HTTP listener
-alsologtostderr	Log standard errors and files
-block\_profile\_rate	Nanosecond time before reporting the goroutine blocking event
-collectd\_prefix	The metrics prefix of the metrics sent to collectd
-collectd\_socketpath	collectd unixsock path, used to write metrics to it
-compile\_only	Only try to compile the mtail script program, not execute it, it is used to test the script
-disable\_fsnotify	Whether to disable the dynamic file discovery mechanism. When it is true, it will not monitor the new files found by dynamic loading, only the files when the program is started.
-dump\_ast	AST of the dump program after parsing (default to /tmp/mtail.INFO)
-dump\_ast\_types	Dump the AST of programs with type annotations after type checking (default to /tmp/mtail.INFO)
-dump\_bytecode	dump program bytecode
-emit\_metric\_timestamp	The record timestamp of the sent metric. If disabled (the default setting), no explicit timestamp will be sent to the collector.
-emit\_prog\_label	Display the label corresponding to'prog' in the exported variable. The default is true
-expired\_metrics\_gc\_interval	metric's garbage collector running interval (default is 1h0m0s)
-graphite\_host\_port	The address of the graphite carbon server, in the format Host:port. Used to write metrics to the graphite carbon server
-graphite\_prefix	Metrics prefix sent to graphite metrics
-ignore\_filename\_regex\_pattern	The name of the log file to be ignored, supports regular expressions. Usage scenario: When the -logs parameter specifies a directory, you can use the ignore\_filename\_regex\_pattern parameter to ignore some files
-jaeger\_endpoint	If set to true, the trace can be exported to the Jaeger trace collector. Use the --jaeger\_endpoint flag to specify the Jaeger endpoint URL
-log\_backtrace\_at	When the log record hits the set line N, emit a stack trace
-log\_dir	The directory of the log file of the mtail program is similar to the logtostderr function. If the logtostderr parameter is configured at the same time, the log\_dir parameter is invalid
-logs	The list of monitored log files can be used to separate multiple files. You can also use the -logs parameter multiple times. You can also specify a file directory. Wildcard * is supported. When you specify a file directory, you need to use single quotes for the directory.
-logtostderr	Directly output standard error messages, and also directly output compilation problems
-metric\_push\_interval\_seconds	metric push interval, unit: second, default 60 seconds
-metric\_push\_write\_deadline	The time to wait for a successful push before exiting with an error. (Default 10s)
-mtailDebug	Set the parser debug level
-mutex\_profile\_fraction	The score of the reported mutex contention event. 0 is off. (This parameter is a literal translation, I don’t understand what it means)
-one\_shot	This parameter will compile and run the mtail program, and then will read the log from the beginning of the specified file (read the log from the beginning, not a real-time tail), and then print all the collected metrics to the log. This parameter is used to verify whether the mtail program has the expected output and is not used in a production environment.
-override\_timezone	Set the time zone. If this parameter is used, the specified time zone will be used in the time stamp conversion instead of UTC
-poll\_interval	Set the interval for polling all log files for data; must be positive, if it is zero, polling will be disabled. Use polling mode, will poll only the files found when mtail is started
-port	Listening http port, default 3903
-progs	Path where mtail script program is located
-stale\_log\_gc\_interval	stale garbage collector running interval (default is 1h0m0s)
-statsd\_hostport	statsd address, format Host:port. Used to write metrics to statsd
-statsd\_prefix	Metrics prefix sent to statsd metrics
-stderrthreshold	The log information whose severity level exceeds the threshold is not only written to the log file, but also output to stderr. The corresponding values for each severity level: INFO—0, WARNING—1, ERROR—2, FATAL—3, the default value is 2.
-syslog\_use\_current\_year	If the time stamp does not have a year, use the current year instead. (Default is true)
-trace\_sample\_period	Used to set the sampling frequency of tracking and the frequency of sending to the collector. Set it to 100, then 100 traces will be collected.
-v	v The log level of the log. This setting may be overridden by the vmodule flag. The default is 0.
-version	Print mtail version
-vmodule	Set the log level by file or module, such as: -vmodule=mapreduce=2,file=1,gfs*=3

mtail script syntax

Read the programming guide if you want to learn how to write mtail programs.

https://github.com/google/mtail/blob/main/docs/Programming-Guide.md

mtail script standard format

The standard format is:

COND {
  ACTION
}

Where COND is a conditional expression. It can be a regular expression or a conditional statement of type boolean. as follows:

/foo/ {
  ACTION1
}

variable > 0 {
  ACTION2
}

/foo/ && variable > 0 {
  ACTION3
}

COND expressions are as follows:

Relational operators:

< , <= , > , >= , == , != , =~ , !~ , || , && , !

Arithmetic operators:

| , & , ^ , + , - , * , /, << , >> , **

available operators for the index variable are as follows:

\= , += , ++ , –

mtail is to extract information from the log and pass it to the monitoring system. Therefore, the indicator variable must be exported and named. The naming can use indicator types such as counter, gauge, etc., and the named variable must be before the COND script.
For example, export a counter type indicator lines\_total: count the number of log lines, the script content is as follows:

# simple line counter
counter lines_total
/$/ {
  lines_total++
}

Types supported by mtail

The three types of counter, gauge, and histogram in mtail are the same as those described in the prometheus type.

The counter type data is a monotonically increasing indicator, that is, it only increases without decreasing. For example, you can use counter type indicators to indicate the number of service requests, the number of successful tasks, and the number of failed tasks.

The gauge type data refers to the index that can be changed arbitrarily, which can be increased or decreased. For example, you can extract the data matched by the regularity, assign it directly to the indicator variable and return it, or return it after calculation.

The histogram (histogram) divides the data into statistics, quoting the description of histogram in prometheus:

In most cases, people tend to use the average value of some quantitative indicators, such as the average CPU usage and the average response time of the page. The problem with this approach is obvious. Take the average response time of system API calls as an example: If most API requests are maintained within 100ms of response time, and the response time of individual requests takes 5s, then some WEBs will be caused. The response time of the page falls to the median, and this phenomenon is called the long tail problem.
In order to distinguish between average slowness and long-tailed slowness, the simplest way is to group according to the range of request delay. For example, count the number of requests with a delay of 0~10ms and the number of requests with a delay of 10~20ms. In this way, the reason for the slow system can be quickly analyzed. Both Histogram and Summary are to be able to solve the existence of such problems. Through the monitoring indicators of Histogram and Summary type, we can quickly understand the distribution of monitoring samples.
Histogram samples the data within a certain period of time (usually request duration or response size, etc.), and counts it in a configurable bucket (bucket), and then you can filter the sample through the specified interval, or you can count the sample The total is , and finally the data is generally displayed as a histogram.

Mtail detailed explanation- https://blog.csdn.net/bluuusea/article/details/105508897

Configure Prometheus data source

After restarting Prometheus, add a new Panel to Grafana Dashoard, and configure the datasource that has been set for it

vim prometheus-config.yml

# 全局配置
global:
  scrape_interval:     15s
  evaluation_interval: 15s

scrape_configs:
  # 监控mtail日志
  - job_name: 'mtail'  
    static_configs:
    - targets: ['内网ip:3903']

Reference article

Google mtail

mtail Programming Guide

prometheus+grafana+mtail+node_exporter realizes machine load and business monitoring

mtail detailed explanation

Google mtail cooperates with Prometheus and Grafana to implement custom log monitoring

Preface

Update history

Common log monitoring solutions

Introduction to mtail

mtail installation

Detailed explanation of mtail parameters

mtail script syntax

mtail script standard format

Types supported by mtail

Configure Prometheus data source

Reference article

王奥OX

引用和评论

字节跳动和TikTok内推

Google Cloud Next 25：AI，但加速一切

Google I/O 2025 观看攻略一键收藏，开启技术探索之旅！

从开发者视角解读 Google Cloud Next 25

GO基础快速入门

高效日志管理与可视化：Loki与Grafana结合优化高频日志处理

Prometheus中系统CPU使用率如何计算？