Master Prometheus and Grafana from 0 to 1 at the speed of light

author

Huang Lei, senior engineer of Tencent Cloud, was responsible for building a new generation of multi-dimensional business monitoring system for Tencent Cloud Monitoring. He is good at large-scale distributed monitoring system design and has a deep understanding of golang back-end project architecture design. After that, he joined the TKE team and devoted himself to researching Kubernetes Related operation and maintenance technology, with many years of experience in Kubernetes cluster federation operation and maintenance management. At present, the team is mainly responsible for improving the observability of large-scale cluster federation. He has led the development of Tencent Cloud's 10,000-level Kubernetes cluster monitoring and warning system, intelligent inspection and risk detection system.

Summary

If I ask the author, what open source components will definitely be used when managing a Kubernetes cluster, then I think Prometheus will definitely be one of them. Prometheus has strong performance, active ecology, convenient deployment methods, and flexible PromQL, which is especially suitable for the collection and aggregation of monitoring data at various levels such as master, node, and application in the Kubernetes scenario, combined with the dazzling Grafana panel (As shown in the figure below), it can be said to be the best solution for cloud native monitoring.

Although Prometheus and Grafana are very powerful, they still have a certain learning cost when they first contacted them, and it is not easy to learn. This is especially true for the author. I remember a few years ago when the author was not in charge of improving the native observability of the team’s cloud, I often heard a buddy who was new to Prometheus complaining to me all day long, "Hey, why the syntax of Prometheus is so complicated", "This thing is too complicated." It's disgusting, how do you write this?" At that time, the author laughed at him for exaggerating, but when I started to learn Prometheus and started to match the Grafana panel, I also made the same spit, such as the sentence below.

 max(label_replace(
 label_replace(
 label_replace(
 kube_deployment_status_replicas_unavailable,
 "workload_kind","Deployment","","")
 ,"workload_name","$1","deployment","(.*)"),
 "__name__", "k8s_workload_abnormal", "__name__","(.*)")
 )
 by (namespace, workload_name, workload_kind,__name__)
 or on (namespace,workload_name,workload_kind, __name__) max(label_replace(
 label_replace(
 label_replace(
 kube_daemonset_status_number_unavailable,
 "workload_kind","DaemonSet","","")
 ,"workload_name","$1","daemonset","(.*)"),
 "__name__", "k8s_workload_abnormal", "__name__","(.*)") ) by (namespace, workload_name, workload_kind,__name__)
 or on (namespace,workload_name,workload_kind, __name__)
 max(label_replace(
 label_replace(
 label_replace(
 (kube_statefulset_replicas - kube_statefulset_status_replicas_ready),
 "workload_kind","StatefulSet","","")
 ,"workload_name","$1","statefulset","(.*)"),
 "__name__", "k8s_workload_abnormal", "__name__","(.*)") ) by (namespace, workload_name, workload_kind,__name__)
 or on (namespace,workload_name,workload_kind, __name__)
 max(label_replace(
 label_replace(
 label_replace(
 (kube_job_status_failed),
 "workload_kind","Job","","")
 ,"workload_name","$1","job_name","(.*)"),
 "__name__", "k8s_workload_abnormal", "__name__","(.*)") ) by (namespace, workload_name, workload_kind,__name__)
 or on (namespace,workload_name,workload_kind, __name__)
 max(label_replace(
 label_replace(
 label_replace(
 (kube_cronjob_info * 0),
 "workload_kind","CronJob","","")
 ,"workload_name","","cronjob","(.*)"),
 "__name__", "k8s_workload_abnormal", "__name__","(.*)") ) by (namespace, workload_name, workload_kind,__name__)

In the past few years, the author has accumulated a certain amount of practical experience in the process of using Prometheus, and has also stepped on many pits.

In order to let readers who want to learn Prometheus get started more quickly, avoid detours, and improve business monitoring skills in the cloud-native era.

The author compiled and summarized a version of the tutorial, including some of the most basic and core concepts, techniques and best practices to share with you, so that you can use 20% of the time to master 80% of the most commonly used parts.

Learn how to expose monitoring indicators to your business from scratch, how to configure service discovery correctly, and how to configure a practical Grafana panel, guide readers to get started with Prometheus+Grafana, and master the correct posture of cloud native monitoring. picture

Reply to "Prometheus" or "Introduction to Light Speed" in the background of the "Tencent Cloud Native" official account to get the tutorial! Let's learn together!

Small Tips: The textbook currently has a website version (which needs to be opened in a browser) and a PDF version. Children's shoes can view it according to their needs. The website version of this textbook will continue to be updated, so everyone can continue to pay attention~

At the same time, everyone is welcome to submit issues to the tutorial. This tutorial will be updated, expanded, and revised from time to time based on your feedback!

(The GitHub address of the issue)

Master Prometheus and Grafana from 0 to 1 at the speed of light

author

Summary

The textbook catalog is as follows

账号已注销

引用和评论

Serverless AI绘画技术沙龙【深圳站】火热报名中

Light创造营 2025 评选规则

数据库的下一场革命：S3 延迟已降至原先的 10%，云数据库架构该进化了

在 ApeCloud （云猿生数据）实习是怎样的体验？跟行业大佬练技术修为的一年小记

MCP协议重大升级，Spring AI Alibaba联合Higress发布业界首个Streamable HTTP实现方案

阿里云 ESA 游戏行业解决方案｜安全防护、加速、低延时的技术融合

基于 KubeBlocks 的 PikiwiDB(原Pika) 云化下一站