环境信息
# k8s版本(使用kubeadm安装):
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.4", GitCommit:"d360454c9bcd1634cf4cc52d1867af5491dc9c5f", GitTreeState:"clean", BuildDate:"2020-11-11T13:17:17Z", GoVersion:"go1.15.2", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.4", GitCommit:"d360454c9bcd1634cf4cc52d1867af5491dc9c5f", GitTreeState:"clean", BuildDate:"2020-11-11T13:09:17Z", GoVersion:"go1.15.2", Compiler:"gc", Platform:"linux/amd64"}
# helm 版本:
$ helm version
version.BuildInfo{Version:"v3.3.4", GitCommit:"a61ce5633af99708171414353ed49547cf05013d", GitTreeState:"clean", GoVersion:"go1.14.9"}
helm 部署 Prometheus operator,官方文档github链接,注意需要k8s版本1.16以上以及helm3 以上
# 添加repo源
$ helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
$ helm repo add stable https://charts.helm.sh/stable
$ helm repo update
# 检查repo源是否添加成功,如下显示prometheus-community表示已添加成功
$ helm repo list
NAME URL
prometheus-community https://prometheus-community.github.io/helm-charts
stable https://charts.helm.sh/stable
这里因为要给prometheus operator添加pvc持久化存储以及添加额外的监控还有修改部分默认的配置,所以选择先把chart包下载到本地,修改好values.yaml以后再执行安装命令
# 下载chart包
$ helm pull prometheus-community/kube-prometheus-stack
$ ls -l
-rw-r--r-- 1 root root 326161 Dec 21 10:24 kube-prometheus-stack-12.2.3.tgz
# 解压
$ tar -xzvf kube-prometheus-stack-12.2.3.tgz
# 编辑values.yaml
$ cd kube-prometheus-stack # 切换到目录
$ vi values.yaml
笔者的环境通过hostPath的方式提供pv,所以在修改values.yaml之前,我们先在集群创建对应的pv
$ mkdir -p /promethous/{alert,grafana,promethous} # 创建用于pv的hostPath目录
保存以下内容为prometheus-pv.yaml后执行kubectl create -f prometheus-pv.yaml,创建pv
---
# storageClass.yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: prometheus
provisioner: kubernetes.io/no-provisioner
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer
---
# alertmanager pv
apiVersion: v1
kind: PersistentVolume
metadata:
finalizers:
- kubernetes.io/pv-protection
labels:
use: alert
name: alert-pv
spec:
accessModes:
- ReadWriteOnce
capacity:
storage: 20Gi
local:
path: /promethous/alert
nodeAffinity:
required:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/hostname
operator: In
values:
- 192.168.0.13 # 指定节点亲和性,这里笔者将pv指定到192.168.0.13这个节点
persistentVolumeReclaimPolicy: Delete
storageClassName: prometheus
volumeMode: Filesystem
---
# grafana pv
apiVersion: v1
kind: PersistentVolume
metadata:
finalizers:
- kubernetes.io/pv-protection
labels:
use: grafana
name: grafana-pv
spec:
accessModes:
- ReadWriteOnce
capacity:
storage: 10Gi
local:
path: /promethous/grafana
nodeAffinity:
required:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/hostname
operator: In
values:
- 192.168.0.13
persistentVolumeReclaimPolicy: Delete
storageClassName: prometheus
volumeMode: Filesystem
---
# prometheus pv
apiVersion: v1
kind: PersistentVolume
metadata:
finalizers:
- kubernetes.io/pv-protection
labels:
use: prometheus
name: prometheus-pv
spec:
accessModes:
- ReadWriteOnce
capacity:
storage: 70Gi
local:
path: /promethous/promethous
nodeAffinity:
required:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/hostname
operator: In
values:
- 192.168.0.13
persistentVolumeReclaimPolicy: Delete
storageClassName: prometheus
volumeMode: Filesystem
values.yaml文件内容较长,此处逐一贴出需要修改的部分,完整的修改完成的values.yaml我已上传到这里,读者可自行获取(注:部分内容还需根据自己的实际环境修改)
alertmanager 配置:
# alertmanager 部分主要添加告警模板和告警接收者以及添加ingress配置暴露alertmanager界面,
# alertmanager 与 prometheus 的对接 chart 包已默认实现
alertmanager:
enabled: true
apiVersion: v2
serviceAccount:
create: true
name: ""
annotations: {}
podDisruptionBudget:
enabled: false
minAvailable: 1
maxUnavailable: ""
config:
global:
resolve_timeout: 5m
# 邮箱告警配置
smtp_hello: 'kubernetes'
smtp_from: 'example@163.com'
smtp_smarthost: 'smtp.163.com:25'
smtp_auth_username: 'test'
smtp_auth_password: 'USFRGHSFQTCJNDAHQ' # 此处为授权密码,并非邮箱密码,具体根据你用的邮箱进行配置
# 企业微信告警配置参考 https://www.cnblogs.com/miaocbin/p/13706164.html
wechat_api_secret: 'RRTAFGGSS0G_KFSl6FYBVlHyMo'
wechat_api_url: 'https://qyapi.weixin.qq.com/cgi-bin/'
wechat_api_corp_id: 'ssssdghsetyxsg'
templates:
- '/etc/alertmanager/config/*.tmpl' # 指定告警模板路径
route:
group_by: ['job'] # 通过job这个标签来进行告警分组
group_wait: 30s
group_interval: 5m
repeat_interval: 12h
receiver: 'wechat' # 指定默认接收者
# 指定拥有label key为:alertname 值为:Watchdog的告警发给email这个receiver
routes:
- match:
alertname: Watchdog
receiver: 'email'
# 定义邮件告警接收者
receivers:
- name: 'email'
email_configs:
- to: 'test@example.com'
html: '{{ template "template_email.tmpl" }}'
# 定义企业微信告警接收者
- name: 'wechat'
wechat_configs:
- send_resolved: true
message: '{{ template "template_wechat.tmpl" . }}'
to_party: '2'
agent_id: '1000002'
tplConfig: false
templateFiles:
# 邮件告警模板
template_email.tmpl: |-
{{ define "cluster" }}{{ .ExternalURL | reReplaceAll ".*alertmanager\\.(.*)" "$1" }}{{ end }}
{{ define "slack.myorg.text" }}
{{- $root := . -}}
{{ range .Alerts }}
*Alert:* {{ .Annotations.summary }} - `{{ .Labels.severity }}`
*Cluster:* {{ template "cluster" $root }}
*Description:* {{ .Annotations.description }}
*Graph:* <{{ .GeneratorURL }}|:chart_with_upwards_trend:>
*Runbook:* <{{ .Annotations.runbook }}|:spiral_note_pad:>
*Details:*
{{ range .Labels.SortedPairs }} • *{{ .Name }}:* `{{ .Value }}`
{{ end }}
{{ end }}
{{ end }}
# 微信告警模板,此处参考 https://www.cnblogs.com/miaocbin/p/13706164.html
template_wechat.tmpl: |-
{{ define "template_wechat.tmpl" }}
{{- if gt (len .Alerts.Firing) 0 -}}
{{- range $index, $alert := .Alerts -}}
{{- if eq $index 0 }}
========= 监控报警 =========
告警状态:{{ .Status }}
告警级别:{{ .Labels.severity }}
告警类型:{{ $alert.Labels.alertname }}
故障主机: {{ $alert.Labels.instance }}
告警主题: {{ $alert.Annotations.summary }}
告警详情: {{ $alert.Annotations.message }}{{ $alert.Annotations.description}};
触发阀值:{{ .Annotations.value }}
故障时间: {{ ($alert.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
========= = end = =========
{{- end }}
{{- end }}
{{- end }}
{{- if gt (len .Alerts.Resolved) 0 -}}
{{- range $index, $alert := .Alerts -}}
{{- if eq $index 0 }}
========= 异常恢复 =========
告警类型:{{ .Labels.alertname }}
告警状态:{{ .Status }}
告警主题: {{ $alert.Annotations.summary }}
告警详情: {{ $alert.Annotations.message }}{{ $alert.Annotations.description}};
故障时间: {{ ($alert.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
恢复时间: {{ ($alert.EndsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
{{- if gt (len $alert.Labels.instance) 0 }}
实例信息: {{ $alert.Labels.instance }}
{{- end }}
========= = end = =========
{{- end }}
{{- end }}
{{- end }}
{{- end }}
# 配置 alertmanager 通过 ingress 暴露(根据个人需要也可通过NodePort的方式暴露)
ingress:
enabled: true # 开启ingress
ingressClassName: nginx # 如果k8s集群有多个ingress控制器请指定具体在哪个控制器上暴露,笔者是通过 ingress-nginx-controller 暴露
annotations: {}
labels: {}
hosts:
- alertmanager.test.com # 指定host域名
paths: []
tls: []
secret:
annotations: {}
# 持久化存储
storage:
volumeClaimTemplate:
spec:
storageClassName: prometheus # 指定storage class
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 20Gi
selector: # 匹配具体的 pv
matchLabels:
use: alert # 此处的标签要和上面定义的 pv 标签匹配
grafana配置:
grafana:
enabled: true
defaultDashboardsEnabled: true
adminPassword: admin@123 # 设置默认密码
plugins:
- grafana-kubernetes-app # 选择安装插件,该插件提供k8s相关的监控dashboard,可以选择安装
# 持久化配置
persistence:
type: pvc
enabled: true
selector: # 选定特定标签的pv
matchLabels:
use: grafana
storageClassName: prometheus # 指定storage class
accessModes:
- ReadWriteOnce
size: 10Gi
finalizers:
- kubernetes.io/pvc-protection
ingress:
enabled: true # 开启ingress
annotations:
kubernetes.io/ingress.class: nginx # 指定ingress class
labels: {}
hosts:
- grafana.sz.com # 指定 grafana 域名
path: /
prometheus配置主要添加ingress配置和ingress controller target和持久化配置:
prometheus:
enabled: true
annotations: {}
serviceAccount:
create: true
name: ""
service:
annotations: {}
labels: {}
clusterIP: ""
port: 9090
targetPort: 9090
externalIPs: []
nodePort: 30090
loadBalancerIP: ""
loadBalancerSourceRanges: []
type: ClusterIP # 改为NodePort的将暴露上面的30090端口
sessionAffinity: ""
ingress:
enabled: true # 开启ingress
ingressClassName: nginx
annotations: {}
labels: {}
hosts:
- prometheus.test.com
paths: []
tls: []
image:
repository: quay.io/prometheus/prometheus
tag: v2.22.1
sha: ""
# 持久化配置
storageSpec:
volumeClaimTemplate:
spec:
storageClassName: prometheus
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 50Gi
selector:
matchLabels:
use: prometheus
# 添加额外的ingress controller target,需要在ingress controller的deployment中加入prometheus.io/scrape: "true"注释才会开启抓取数据操作
additionalScrapeConfigs:
- job_name: 'ingress-nginx-endpoints'
kubernetes_sd_configs:
- role: pod
namespaces:
names:
- ingress-nginx
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scheme]
action: replace
target_label: __scheme__
regex: (https?)
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__meta_kubernetes_pod_container_port_name]
action: keep
regex: metrics
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
target_label: __address__
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
- source_labels: [__meta_kubernetes_service_name]
regex: prometheus-server
action: drop
修改ingress controller的deployment配置,添加prometheus.io/scrape: "true" 注释,注意是在spec.template.metadata.annotations处添加
$ kubectl edit deploy -n ingress-nginx ingress-nginx-controller
spec:
template:
metadata:
annotations:
prometheus.io/scrape: "true"
k8s 1.19.4版本部分组件提供metrics的端口更改了,需要修改serviceMonitor默认定义的端口号:
# kubeControllerManager:
kubeControllerManager:
enabled: true
service:
port: 10257 # 端口号为https的10257
targetPort: 10257
serviceMonitor:
https: true # 默认使用的是https
insecureSkipVerify: true # 跳过证书认证
# etcd
kubeEtcd:
enabled: true
service:
port: 2381
targetPort: 2381
serviceMonitor:
scheme: http
insecureSkipVerify: false
# kubeScheduler:
kubeScheduler:
enabled: true
service:
port: 10259
targetPort: 10259
serviceMonitor:
https: true
insecureSkipVerify: true
# kubeProxy:
kubeProxy:
enabled: true
service:
port: 10249
targetPort: 10249
修改完成后保存,在values.yaml文件所在的路径下执行以下命令安装prometheus operator
$ helm install prometheus .
通过以下命令查看安装状态
$ kubectl get pod -n monitoring --watch
待所有pod状态Running后,通过浏览器访问上面定义的ingress host,如你的ingress controller不是监听在80端口则要加上ingress controller的端口号进行访问如:
默认情况下prometheus中的etcd、kube-scheduler还有kube-controller的targets会显示抓取指标失败,这是因为kubeadm方式部署的k8s集群,默认一些监控指标端口监听在127.0.0.1上,要让prometheus能够正确抓取到监控信息还需要更改manifests让对应的组件指标端口监听在可被prometheus访问的ip上
$ cd /etc/kubernetes/manifests # 在master节点上切换到该路径下
# vi etcd.yaml,添加 "--listen-metrics-urls=http://0.0.0.0:2381"
spec:
containers:
- command:
- etcd
- --listen-metrics-urls=http://0.0.0.0:2381
...
# vi kube-controller-manager.yaml 修改"--bind-address=0.0.0.0"
spec:
containers:
- command:
- kube-controller-manager
- --bind-address=0.0.0.0
...
# vi kube-scheduler.yaml 修改"--bind-address=0.0.0.0"
spec:
containers:
- command:
- kube-scheduler
- --bind-address=0.0.0.0
...
grafana 配置kubernetes 插件
点击左侧图标选择plugins,出现k8s图标点击进入
点击连接进入配置界面
输入k8s集群名称,输入k8s集群访问地址,因为grafana是部署在k8s集群内部,所以此处可以配置集群内部的访问域名,打开tls和ca认证按钮,底下贴入对应的证书信息,该信息可在k8s master节点的/root/.kube/config 配置文件中获取certificate-authority-data、client-certificate-data和client-key-data分别对应CA Cert、Client Cert和Client Key,需要注意的是config配置文件中的值需要通过base64 -d解密后才能使用
数据源选择prometheus,点击save
保存后在左侧可看到新增的k8s图标,点击进去可看到相应的监控界面
至此安装部分结束!
自定义监控使用介绍:
通过创建PrometheusRule资源对象可添加自定义告警
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
labels:
# 此处必须添加这两个标签operator才会自动根据该定义创建对应的告警规则,此处的标签和prometheus资源对象里面的ruleSelector.matchLabels定义的标签匹配,可通过 kubectl get prometheus -n monitoring prometheus-kube-prometheus-prometheus -oyaml 获取
app: kube-prometheus-stack
release: prometheus
name: deployment-status
namespace: monitoring
spec:
groups:
- name: deployment-status
rules:
- alert: DeploymentUnavailable # prometheus 界面上显示的alert名称
annotations:
summary: deployment {{ $labels.deployment }} unavailable # 此处定义的信息可在告警模板中通过{{ $alert.Annotations.summary }} 获取
description: 工作负载:{{ $labels.deployment }}, 有{{ $value }}实例不可用
expr: |
kube_deployment_status_replicas_unavailable > 0 # 触发规则
for: 3m # 该告警规则触发持续3分钟将alertmanager将收到此告警信息
labels:
severity: critical # 此处定义的标签会传到alertmanager里
可通过创建serviceMonitor对象的方式添加自定义监控项,也可在chart包的values.yaml添加自定义监控,如前面添加ingress controller的监控项,具体的细节和其他选项可参考官网,此处不再描述
参考资料:
https://github.com/prometheus...
https://www.cnblogs.com/miaoc...
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。