ETCD node failover

Unreachable member

A cluster with etcd containers is created successfully.
Check the cluster status with the following command.

# etcdctl --endpoint cluster-health

If the cluster is running normally, the output looks like:

member xxx is healthy: got healthy result from https://10.23.2.109:3379
member xxx is healthy: got healthy result from https://10.23.2.108:3379
member xxx is healthy: got healthy result from https://10.23.2.110:3379
cluster is healthy

If one member failed, the output may look like:

failed to check the health of member xxx on https://10.23.2.109:3379: Get https://10.23.2.109:3379/health: dial tcp 10.23.2.109:3379: connect: connection refused
member xxx is unreachable: [https://10.23.2.109:3379] are all unreachable
member xxx is healthy: got healthy result from https://10.23.2.108:3379
member xxx is healthy: got healthy result from https://10.23.2.110:3379
cluster is healthy

The reason may meet one of the following four cases.

Case 1: The whole environment of an etcd container was destroyed.

Solution

Remove the destroyed member with etcdctl.

# etcdctl member remove xxx

xxx is memberID of the unreachable member.

Create a new etcd container with adding the following environment variables to env in config file.

"ETCD_INITIAL_CLUSTER_STATE": "existing"
"ETCD_INITIAL_CLUSTER": <The cluster peer urls with the new etcd container>

"hostname2=https://10.23.2.108:3380,hostname3=https://10.23.2.110:3380" in ETCD_INITIAL_CLUSTER are the peer urls of the cluster after removing the destroyed member.

Add the new container to the existing cluster.

# etcdctl --endpoint member add <name> <peerURL>

<name> is hostname in its config file.

<peerURL> is one of ETCD_INITIAL_ADVERTISE_PEER_URLS in its config file.

Case 2: The etcd container doesn't exist.

Solution

Add "ETCD_INITIAL_CLUSTER_STATE": "existing" to the container creation config file.
Create the container with the new config file, but keep the other configurations as same as before.

Case 3: The etcd container was stopped.

Solution

Start the container.

# docker start <container>

Case 4: The etcd service was stopped in its container.

Solution

Restart the stopped etcd container.

# docker restart <container>

Unhealthy member

If a member is unhealthy, we can refer to above case 2 to remove its container with metadata, then create a new one to fix it.

ETCD node failover

Unreachable member

Case 1: The whole environment of an etcd container was destroyed.

Solution

Case 2: The etcd container doesn't exist.

Solution

Case 3: The etcd container was stopped.

Solution

Case 4: The etcd service was stopped in its container.

Solution

Unhealthy member

AndyZhang

引用和评论

docker容器启动后马上退出的问题

解决Git上传文件到GitHub时收到 “GH001: Large files detected” 错误信息！

解决：Loading class `com.mysql.jdbc.Driver‘. This is deprecated.

kafka启动命令

ubuntu 22.04通过apt-get安装的apache2将http改造为https的方法

Redis 万字入门教程

Linux arm64 set_memory_ro/rw函数