author
Huang Lei, senior engineer of Tencent Cloud, was responsible for building a new generation of multi-dimensional business monitoring system for Tencent Cloud Monitoring. He is good at large-scale distributed monitoring system design and has a deep understanding of golang back-end project architecture design. After that, he joined the TKE team and devoted himself to researching Kubernetes Related operation and maintenance technology, with many years of experience in Kubernetes cluster federation operation and maintenance management. At present, the team is mainly responsible for improving the observability of large-scale cluster federation. He has led the development of Tencent Cloud's 10,000-level Kubernetes cluster monitoring and warning system, intelligent inspection and risk detection system.
Summary
If I ask the author, what open source components will definitely be used when managing a Kubernetes cluster, then I think Prometheus will definitely be one of them. Prometheus has strong performance, active ecology, convenient deployment methods, and flexible PromQL, which is especially suitable for the collection and aggregation of monitoring data at various levels such as master, node, and application in the Kubernetes scenario, combined with the dazzling Grafana panel (As shown in the figure below), it can be said to be the best solution for cloud native monitoring.
Although Prometheus and Grafana are very powerful, they still have a certain learning cost when they first contacted them, and it is not easy to learn. This is especially true for the author. I remember a few years ago when the author was not in charge of improving the native observability of the team’s cloud, I often heard a buddy who was new to Prometheus complaining to me all day long, "Hey, why the syntax of Prometheus is so complicated", "This thing is too complicated." It's disgusting, how do you write this?" At that time, the author laughed at him for exaggerating, but when I started to learn Prometheus and started to match the Grafana panel, I also made the same spit, such as the sentence below.
max(label_replace(
label_replace(
label_replace(
kube_deployment_status_replicas_unavailable,
"workload_kind","Deployment","","")
,"workload_name","$1","deployment","(.*)"),
"__name__", "k8s_workload_abnormal", "__name__","(.*)")
)
by (namespace, workload_name, workload_kind,__name__)
or on (namespace,workload_name,workload_kind, __name__) max(label_replace(
label_replace(
label_replace(
kube_daemonset_status_number_unavailable,
"workload_kind","DaemonSet","","")
,"workload_name","$1","daemonset","(.*)"),
"__name__", "k8s_workload_abnormal", "__name__","(.*)") ) by (namespace, workload_name, workload_kind,__name__)
or on (namespace,workload_name,workload_kind, __name__)
max(label_replace(
label_replace(
label_replace(
(kube_statefulset_replicas - kube_statefulset_status_replicas_ready),
"workload_kind","StatefulSet","","")
,"workload_name","$1","statefulset","(.*)"),
"__name__", "k8s_workload_abnormal", "__name__","(.*)") ) by (namespace, workload_name, workload_kind,__name__)
or on (namespace,workload_name,workload_kind, __name__)
max(label_replace(
label_replace(
label_replace(
(kube_job_status_failed),
"workload_kind","Job","","")
,"workload_name","$1","job_name","(.*)"),
"__name__", "k8s_workload_abnormal", "__name__","(.*)") ) by (namespace, workload_name, workload_kind,__name__)
or on (namespace,workload_name,workload_kind, __name__)
max(label_replace(
label_replace(
label_replace(
(kube_cronjob_info * 0),
"workload_kind","CronJob","","")
,"workload_name","","cronjob","(.*)"),
"__name__", "k8s_workload_abnormal", "__name__","(.*)") ) by (namespace, workload_name, workload_kind,__name__)
In the past few years, the author has accumulated a certain amount of practical experience in the process of using Prometheus, and has also stepped on many pits.
In order to let readers who want to learn Prometheus get started more quickly, avoid detours, and improve business monitoring skills in the cloud-native era.
The author compiled and summarized a version of the tutorial, including some of the most basic and core concepts, techniques and best practices to share with you, so that you can use 20% of the time to master 80% of the most commonly used parts.
Learn how to expose monitoring indicators to your business from scratch, how to configure service discovery correctly, and how to configure a practical Grafana panel, guide readers to get started with Prometheus+Grafana, and master the correct posture of cloud native monitoring. picture
Reply to "Prometheus" or "Introduction to Light Speed" in the background of the "Tencent Cloud Native" official account to get the tutorial! Let's learn together!
Small Tips: The textbook currently has a website version (which needs to be opened in a browser) and a PDF version. Children's shoes can view it according to their needs. The website version of this textbook will continue to be updated, so everyone can continue to pay attention~
At the same time, everyone is welcome to submit issues to the tutorial. This tutorial will be updated, expanded, and revised from time to time based on your feedback!
(The GitHub address of the issue)
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。