A more elegant Kubernetes cluster event measurement scheme

Hi everyone, this is Zhang Jintao.

A small partner in the group asked me the question in the picture above, how to measure the time of the rolling upgrade process.

This problem can be abstracted as a general requirement, suitable for a variety of scenarios.

For example, you are an administrator of a Kubernetes cluster, and you want to measure the time-consuming process in order to find optimization points;
For example, you are doing CI/CD, and you want to give the time consumption of the CI/CD process by measuring the time consumption of this process;

Existing plan

Kubernetes has provided a very convenient way to solve this problem, which is what I talked about in my reply, which can be measured by event.

For example, we create a deployment in K8S and look at the event information in the process:

➜  ~ kubectl create ns moelove
namespace/moelove created
➜  ~ kubectl -n moelove create deployment redis --image=ghcr.io/moelove/redis:alpine
deployment.apps/redis created
➜  ~ kubectl -n moelove get deploy
NAME    READY   UP-TO-DATE   AVAILABLE   AGE
redis   1/1     1            1           16s
➜  ~ kubectl -n moelove get events
LAST SEEN   TYPE     REASON              OBJECT                        MESSAGE
27s         Normal   Scheduled           pod/redis-687967dbc5-gsz5n    Successfully assigned moelove/redis-687967dbc5-gsz5n to kind-control-plane
27s         Normal   Pulled              pod/redis-687967dbc5-gsz5n    Container image "ghcr.io/moelove/redis:alpine" already present on machine
27s         Normal   Created             pod/redis-687967dbc5-gsz5n    Created container redis
27s         Normal   Started             pod/redis-687967dbc5-gsz5n    Started container redis
27s         Normal   SuccessfulCreate    replicaset/redis-687967dbc5   Created pod: redis-687967dbc5-gsz5n
27s         Normal   ScalingReplicaSet   deployment/redis              Scaled up replica set redis-687967dbc5 to 1

It can be seen that some of the events we are mainly concerned about have already been recorded. But you can't look at it through kubectl every time, it's a waste of time.

One way I did before was to write a program in K8S to continuously monitor & collect events in the K8S cluster, and write it to a system I developed for storage and visualization. But this method requires additional development and is not universal. Here I will introduce another better solution.

More elegant solution

These events in K8S correspond to one of our operations. For example, a deployment is created above, which generates several events, including Scheduled , Pulled , Created etc. We abstract it, is it similar to the link tracing we do?

Here we will use a CNCF graduation project Jaeger . I have introduced it many times in the previous K8S Ecological Weekly . Jaeger is an open source, end-to-end distributed tracing system. But the focus of this article is not to introduce it, so we view its document , and quickly deploy a Jaeger. Another CNCF sandbox level project is OpenTelemetry is an observable framework for cloud native software, we can use it in combination with Jaeger. However, the focus of this article is not to introduce these two projects, so I will skip them here.

Next, we will introduce the main project used in this article. It is an open source project from Weaveworks called kspan . Its main method is to organize events in K8S as spans in the trace system.

Deploy kspan

---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: kspan
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  creationTimestamp: null
  name: kspan-admin
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: cluster-admin
subjects:
- kind: ServiceAccount
  name: kspan
  namespace: default
---
apiVersion: v1
kind: Pod
metadata:
  labels:
    run: kspan
  name: kspan
spec:
  containers:
  - image: docker.io/weaveworks/kspan:v0.0
    name: kspan
    resources: {}
  dnsPolicy: ClusterFirst
  restartPolicy: Always
  serviceAccountName: kspan

You can directly use the YAML I provided here for deployment testing, but note that the above configuration file should not be used in a production environment. RBAC permissions need to be modified .

It will use otlp-collector.default:55680 pass span by default, so you need to make sure that this svc exists. After all the above is deployed, you will probably look like this:

➜  ~ kubectl get all
NAME                                  READY   STATUS    RESTARTS   AGE
pod/jaeger-76c84457fb-89s5v           1/1     Running   0          64m
pod/kspan                             1/1     Running   0          35m
pod/otel-agent-sqlk6                  1/1     Running   0          59m
pod/otel-collector-69985cc444-bjb92   1/1     Running   0          56m

NAME                       TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)                                          AGE
service/jaeger-collector   ClusterIP   10.96.47.12    <none>        14250/TCP                                        60m
service/kubernetes         ClusterIP   10.96.0.1      <none>        443/TCP                                          39h
service/otel-collector     ClusterIP   10.96.231.43   <none>        4317/TCP,14250/TCP,14268/TCP,9411/TCP,8888/TCP   59m
service/otlp-collector     ClusterIP   10.96.79.181   <none>        55680/TCP                                        52m

NAME                        DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
daemonset.apps/otel-agent   1         1         1       1            1           <none>          59m

NAME                             READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/jaeger           1/1     1            1           73m
deployment.apps/otel-collector   1/1     1            1           59m

NAME                                        DESIRED   CURRENT   READY   AGE
replicaset.apps/jaeger-6f77c67c44           0         0         0       73m
replicaset.apps/jaeger-76c84457fb           1         1         1       64m
replicaset.apps/otel-collector-69985cc444   1         1         1       59m

Hands-on practice

Here we first create a namespace for testing:

➜  ~ kubectl create ns moelove
namespace/moelove created

Create a deployment

➜  ~ kubectl -n moelove create deployment redis --image=ghcr.io/moelove/redis:alpine
deployment.apps/redis created
➜  ~ kubectl -n moelove get pods 
NAME                     READY   STATUS    RESTARTS   AGE
redis-687967dbc5-xj2zs   1/1     Running   0          10s

Check it out on Jaeger:

Click to see the details

As you can see, the events related to this deployment are grouped together, and detailed information such as time-consuming can be seen on the timeline.

to sum up

This article introduces how to use tracing with Jaeger to collect events in K8S, so as to better grasp the time-consuming points of all events in the K8S cluster, and find the direction of optimization and measurement results easier.

Welcome to subscribe to my article public account【MoeLove】

TheMoeLove

A more elegant Kubernetes cluster event measurement scheme

Existing plan

More elegant solution

Deploy kspan

Hands-on practice

to sum up

张晋涛

引用和评论

张晋涛：KubeCon China 2024 回顾

记录下安装open-eBackup过程

🔥吐血整理 Bolt.diy 部署与应用攻略

【Docker】基本概念及语法与环境搭建

狂揽17k star！Docker可视化神器，一键部署项目真香！

华为云开天 aPaaS 平台的流使用体验

麒麟系统中theia终端崩溃问题排查小记