Abstract: ES cluster is a powerful tool for big data storage and analysis, and fast retrieval. This article briefly describes the ES cluster architecture and provides a sample of rapid deployment of ES clusters in Kubernetes; monitoring, operation and maintenance tools for ES clusters Introduced, and provided some problem location experience, and finally summarized the API call methods of commonly used ES clusters.
Share this article from Huawei cloud community " Kubernetes ES cluster deployed and operation and maintenance ", the original author: minucas.
ES cluster architecture:
ES clusters are divided into single-point mode and cluster mode. The single-point mode is generally not recommended for use in a production environment, and cluster mode deployment is recommended. Among them, the cluster mode is divided into a deployment mode in which the Master node and the Data node are assumed by the same node, and the Master node and the Data node are assumed by different nodes. The separate deployment method of Master node and Data node is more reliable. The following figure shows the deployment architecture diagram of the ES cluster:
Use K8s for ES cluster deployment:
1. Use k8s statefulset deployment, which can quickly expand and shrink es nodes. This example uses 3 Master Node + 12 Data Node deployment
2. Configure the corresponding domain name and service discovery through k8s service to ensure that the cluster can be automatically connected and monitored
kubectl -s http://ip:port create -f es-master.yaml
kubectl -s http://ip:port create -f es-data.yaml
kubectl -s http://ip:port create -f es-service.yaml
es-master.yaml:
apiVersion: apps/v1
kind: StatefulSet
metadata:
labels:
addonmanager.kubernetes.io/mode: Reconcile
k8s-app: es
kubernetes.io/cluster-service: "true"
version: v6.2.5
name: es-master
namespace: default
spec:
podManagementPolicy: OrderedReady
replicas: 3
revisionHistoryLimit: 10
selector:
matchLabels:
k8s-app: es
version: v6.2.5
serviceName: es
template:
metadata:
labels:
k8s-app: camp-es
kubernetes.io/cluster-service: "true"
version: v6.2.5
spec:
containers:
- env:
- name: NAMESPACE
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.namespace
- name: ELASTICSEARCH_SERVICE_NAME
value: es
- name: NODE_MASTER
value: "true"
- name: NODE_DATA
value: "false"
- name: ES_HEAP_SIZE
value: 4g
- name: ES_JAVA_OPTS
value: -Xmx4g -Xms4g
- name: cluster.name
value: es
image: elasticsearch:v6.2.5
imagePullPolicy: Always
name: es
ports:
- containerPort: 9200
hostPort: 9200
name: db
protocol: TCP
- containerPort: 9300
hostPort: 9300
name: transport
protocol: TCP
resources:
limits:
cpu: "6"
memory: 12Gi
requests:
cpu: "4"
memory: 8Gi
securityContext:
capabilities:
add:
- IPC_LOCK
- SYS_RESOURCE
volumeMounts:
- mountPath: /data
name: es
- command:
- /bin/elasticsearch_exporter
- -es.uri=http://localhost:9200
- -es.all=true
image: elasticsearch_exporter:1.0.2
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 3
httpGet:
path: /health
port: 9108
scheme: HTTP
initialDelaySeconds: 30
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 10
name: es-exporter
ports:
- containerPort: 9108
hostPort: 9108
protocol: TCP
readinessProbe:
failureThreshold: 3
httpGet:
path: /health
port: 9108
scheme: HTTP
initialDelaySeconds: 10
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 10
resources:
limits:
cpu: 100m
memory: 128Mi
requests:
cpu: 25m
memory: 64Mi
securityContext:
capabilities:
drop:
- SETPCAP
- MKNOD
- AUDIT_WRITE
- CHOWN
- NET_RAW
- DAC_OVERRIDE
- FOWNER
- FSETID
- KILL
- SETGID
- SETUID
- NET_BIND_SERVICE
- SYS_CHROOT
- SETFCAP
readOnlyRootFilesystem: true
dnsPolicy: ClusterFirst
initContainers:
- command:
- /sbin/sysctl
- -w
- vm.max_map_count=262144
image: alpine:3.6
imagePullPolicy: IfNotPresent
name: elasticsearch-logging-init
resources: {}
securityContext:
privileged: true
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
volumes:
- hostPath:
path: /Data/es
type: DirectoryOrCreate
name: es
es-data.yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
labels:
addonmanager.kubernetes.io/mode: Reconcile
k8s-app: es
kubernetes.io/cluster-service: "true"
version: v6.2.5
name: es-data
namespace: default
spec:
podManagementPolicy: OrderedReady
replicas: 12
revisionHistoryLimit: 10
selector:
matchLabels:
k8s-app: es
version: v6.2.5
serviceName: es
template:
metadata:
labels:
k8s-app: es
kubernetes.io/cluster-service: "true"
version: v6.2.5
spec:
containers:
- env:
- name: NAMESPACE
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.namespace
- name: ELASTICSEARCH_SERVICE_NAME
value: es
- name: NODE_MASTER
value: "false"
- name: NODE_DATA
value: "true"
- name: ES_HEAP_SIZE
value: 16g
- name: ES_JAVA_OPTS
value: -Xmx16g -Xms16g
- name: cluster.name
value: es
image: elasticsearch:v6.2.5
imagePullPolicy: Always
name: es
ports:
- containerPort: 9200
hostPort: 9200
name: db
protocol: TCP
- containerPort: 9300
hostPort: 9300
name: transport
protocol: TCP
resources:
limits:
cpu: "8"
memory: 32Gi
requests:
cpu: "7"
memory: 30Gi
securityContext:
capabilities:
add:
- IPC_LOCK
- SYS_RESOURCE
volumeMounts:
- mountPath: /data
name: es
- command:
- /bin/elasticsearch_exporter
- -es.uri=http://localhost:9200
- -es.all=true
image: elasticsearch_exporter:1.0.2
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 3
httpGet:
path: /health
port: 9108
scheme: HTTP
initialDelaySeconds: 30
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 10
name: es-exporter
ports:
- containerPort: 9108
hostPort: 9108
protocol: TCP
readinessProbe:
failureThreshold: 3
httpGet:
path: /health
port: 9108
scheme: HTTP
initialDelaySeconds: 10
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 10
resources:
limits:
cpu: 100m
memory: 128Mi
requests:
cpu: 25m
memory: 64Mi
securityContext:
capabilities:
drop:
- SETPCAP
- MKNOD
- AUDIT_WRITE
- CHOWN
- NET_RAW
- DAC_OVERRIDE
- FOWNER
- FSETID
- KILL
- SETGID
- SETUID
- NET_BIND_SERVICE
- SYS_CHROOT
- SETFCAP
readOnlyRootFilesystem: true
dnsPolicy: ClusterFirst
initContainers:
- command:
- /sbin/sysctl
- -w
- vm.max_map_count=262144
image: alpine:3.6
imagePullPolicy: IfNotPresent
name: elasticsearch-logging-init
resources: {}
securityContext:
privileged: true
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
volumes:
- hostPath:
path: /Data/es
type: DirectoryOrCreate
name: es
es-service.yaml
apiVersion: v1
kind: Service
metadata:
labels:
addonmanager.kubernetes.io/mode: Reconcile
k8s-app: es
kubernetes.io/cluster-service: "true"
kubernetes.io/name: Elasticsearch
name: es
namespace: default
spec:
clusterIP: None
ports:
- name: es
port: 9200
protocol: TCP
targetPort: 9200
- name: exporter
port: 9108
protocol: TCP
targetPort: 9108
selector:
k8s-app: es
sessionAffinity: None
type: ClusterIP
ES cluster monitoring
Workers must first sharpen their tools if they want to do their best. The operation and maintenance of middleware must first have adequate monitoring methods. There are three commonly used monitoring methods for ES cluster monitoring: exporter, eshead, and kopf. Because ES clusters are deployed using k8s architecture, Many features will be developed in combination with k8s
Grafana monitoring
Deploy es-exporter through k8s to export monitoring metrics, prometheus collect monitoring data, and grafana customize dashboard display
ES-head components
github address: https://github.com/mobz/elasticsearch-head
The ES-head component can be searched and installed through the Google Chrome application store, and the Chrome plug-in can be used to view the status of the ES cluster
Cerebro (kopf) components
github address: https://github.com/lmenezes/cerebro
ES cluster problem handling
ES configuration
Resource configuration: Pay attention to the ES CPU, Memory and Heap Size, Xms Xmx configuration, it is recommended that if the machine is 8u32g memory, the heap memory and Xms Xmx configuration is 50%, the official website recommends that the memory of a single node should not exceed 64G
Index configuration: Since ES retrieval is located by index, ES will load relevant index data into memory to speed up retrieval during retrieval. Therefore, setting the index reasonably has a great impact on the performance of ES. Currently, we create by date Index method (individual data volume can not be divided into index)
ES load
Focus on nodes with higher CPU and Load. The possible reason is uneven shard allocation. At this time, you can manually talk about uneven shard relocate.
shard configuration
The shard configuration should preferably be an integer multiple of the number of data nodes. The number of shards should not be as many as possible. Fragmentation should be carried out reasonably according to the amount of data indexed. Ensure that each shard does not exceed the heap memory size allocated by a single data node, such as the maximum amount of data. The index is about 150G per day, divided into 24 shards, and the size of a single shard is about 6-7G.
The recommended number of copies is 1. If the number of copies is too large, it will easily lead to frequent data relocate and increase the cluster load.
Delete abnormal index
curl -X DELETE "10.64.xxx.xx:9200/szv-prod-ingress-nginx-2021.05.01"
The index name can be used for regular matching to delete in batches, such as: -2021.05.*
Another reason for high node load
When locating the problem, it is found that the node data shard has been removed but the node load has not gone down. Log in to the node and use the top command to find that the CPU usage of the node kubelet is very high. Restarting the kubelet is also invalid. The load is relieved after restarting the node.
Summary of ES cluster conventional operation and maintenance experience (refer to the official website)
View cluster health status
The health status of the ES cluster is divided into three types: Green, Yellow, and Red.
- Green: cluster health;
- Yellow: The cluster is not healthy, but it can be automatically rebalanced and restored within the load allowable range;
- Red: There is a problem in the cluster, some data is not ready, and at least one primary shard has not been allocated successfully.
The health status of the cluster and unallocated shards can be queried through the API:
GET _cluster/health
{
"cluster_name": "camp-es",
"status": "green",
"timed_out": false,
"number_of_nodes": 15,
"number_of_data_nodes": 12,
"active_primary_shards": 2176,
"active_shards": 4347,
"relocating_shards": 0,
"initializing_shards": 0,
"unassigned_shards": 0,
"delayed_unassigned_shards": 0,
"number_of_pending_tasks": 0,
"number_of_in_flight_fetch": 0,
"task_max_waiting_in_queue_millis": 0,
"active_shards_percent_as_number": 100
}
View pending tasks:
GET /_cat/pending_tasks
The priority field indicates the priority of the task
Check the reason why the shard is not allocated
GET _cluster/allocation/explain
The reason field indicates what kind of reason the shard is unallocated, and the detail indicates the detailed unallocated reason
View all unallocated indexes and primary shards:
GET /_cat/indices?v&health=red
Check which shards are abnormal
curl -s http://ip:port/_cat/shards | grep UNASSIGNED
Reallocate a primary shard:
POST _cluster/reroute?pretty" -d '{
"commands" : [
{
"allocate_stale_primary" : {
"index" : "xxx",
"shard" : 1,
"node" : "12345...",
"accept_data_loss": true
}
}
]
}
Where node is the id of the es cluster node, which can be queried by curl'ip:port/_node/process?pretty'
Reduce the number of copies of the index
PUT /szv_ingress_*/settings
{
"index": {
"number_of_replicas": 1
}
}
Click to follow and learn about Huawei Cloud's fresh technology for the first time~
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。