foreword

Recently, I upgraded the Alibaba Cloud k8s component Cloud Controller Manager from v2.1.0 to v2.3.0. I found that it was not very smooth. I recorded the solution process to prevent the same problem from happening again.

operate

Click Upgrade, and then find the pre-check error, as shown below:
image.png

Then, in the event hub also prints:

 DryRun: Error syncing load balancer [lb-bp1erkwv3fcdyobqd7x3k]: Message: loadbalancer lb-bp1erkwv3fcdyobqd7x3k listener 80 should be updated, VGroupId rsp-bp1up5x12mwt6 should be changed to rsp-bp1tsakxo59ww;

DryRun: Error syncing load balancer [lb-bp1erkwv3fcdyobqd7x3k]: Message: loadbalancer lb-bp1erkwv3fcdyobqd7x3k listener 443 should be updated, VGroupId rsp-bp1cuciusq2zf should be changed to rsp-bp11d0mmv0cma;

It is found that it has something to do with load balancing, and then check the SLB. Just set VGroupId rsp-bp1up5x12mwt6 to rsp-bp1tsakxo59ww; and rsp-bp1cuciusq2z to rsp-bp11d0mmv0cma As follows:
image.png
image.png

According to the prompt of the event center, we just need to transfer the virtual server groups corresponding to 80 and 443.

Transfer a virtual service group

1. Click to modify 80 or 443 monitoring configuration
image.png

2. Next step

3. Specify the server group
image.png

4. Continue to click Next to complete

This is done, you can click to upgrade Cloud Controller Manager again and it will be fine

image.png

Summarize

1. The above four virtual server groups are all generated by the system
2. After upgrading k8s, it changed back, and I had to execute it again, which was very troublesome, so I deleted the remaining two, which are 1 and 2 in the above picture, and then I will check if there is any problem.
image.png

2022-7-6 Update: The website is available and unavailable for a while. The problem has reappeared

After the recent upgrade of the ACK component, there was a problem that the service could be used for a while, but could not be used for a while. Then, a work order was initiated and Alibaba Cloud engineers started a two-day troubleshooting trip. The following is an explanation of the troubleshooting process, which is also for future encounters. There is a reference to the same question.

Problem 1: The pod of elasticsearch cannot be deleted all the time
analyse problem:
I installed es on ACK, but found that the resources were not enough, so I wanted to delete it, but found that there was a pod that could not be deleted.

 pod/test-elasticsearch-ingest-0   1/1     Terminating   1 (6h41m ago)   6h42m

Solve the problem:
Through the communication with Ali engineers, just use the following forced delete command

 kubectl delete pod <your-pod-name> -n <name-space> --force --grace-period=0

Question 2: My website is available for a while and not available for a while

image.png

analyse problem:
1. Check whether the kubelet on the node (ECS) is normal, and then log in and check whether it is normal - unresolved

 systemctl status kubelet.service

2. Upgrade the virtual server group, and then check that the website is still available for a while, but not available for a while - unresolved

3. Ali engineers think it is nginx-ingress-controller the component version is too low, the new version has reached 1.21, and my version is still 0.44, so manual upgrade - not resolved

 手动升级nginx-ingress-controller版本的方案

风险:会覆盖掉客户从组件管理配置页面之外的对nginx deployment的修改

操作注意事项:需要提前把webhook的job pod提前清理掉,在kube-system命名空间

升级接口:https://next.api.aliyun.com/api/CS/2015-12-15/UpgradeClusterAddons?lang=JAVA¶ms={%22body%22:[{%22component_name%22:%22nginx-ingress-controller%22,%22next_version%22:%22v1.2.0-aliyun.1%22}]}

Two jobs were deleted during

 kubectl get job -n kube-system
NAME                                    COMPLETIONS   DURATION   AGE
ingress-nginx-admission-create          1/1           4s         456d
ingress-nginx-admission-patch           1/1           5s         456d

Then get feedback from the second engineer:

I just communicated with our back-end R&D students. The console upgrade interface of ingress 0.44 to version 1.X is currently under development. After the successful release, it can be upgraded directly at the component management point, and the deployment modification will be displayed in detail. If you do not have to do an ingress upgrade on your side, it is recommended that you wait for the function release here.

image.png

4. Then initiate a new work order (thought it was solved!!!), Ali engineer asked me to rebuild the pod- of nginx-ingress-controller unresolved

image.png

So I deleted the above two Pods. After that, I found that there were two new virtual service groups in the load balancer, and then my website could not be opened at all. It could be opened with a 50% probability before, but this time it completely collapsed. In addition to being sad, I wondered if there were other solutions, so I continued to ask.

5. Then the Ali engineer said: Rebuild the service of nginx-ingress-lb, and first change the service to the type of clusterip. Change back to load balancing type - not resolved

image.png

6. The Ali engineer said that this is still not successful, indicating that it has nothing to do with ingress and slb, indicating that there is a problem with your service, and then let me report the error log - solved

gateway-server log:

 io.netty.channel.AbstractChannel$AnnotatedConnectException: finishConnect(..) failed: Host is unreachable: /172.20.0.235:8902

space-service log

 removed ips(1) service: DEFAULT_GROUP@@space-service@@DEFAULT -> [{"instanceId":"172.20.0.247#8902#DEFAULT#DEFAULT_GROUP@@space-service","ip":"172.20.0.247","port":8902,"weight":1.0,"healthy":true,"enabled":true,"ephemeral":true,"clusterName":"DEFAULT","serviceName":"DEFAULT_GROUP@@space-service","metadata":{"preserved.register.source":"SPRING_CLOUD"},"instanceHeartBeatInterval":5000,"instanceHeartBeatTimeOut":15000,"ipDeleteTimeout":30000}]
2022-07-06 10:22:36.240  INFO 1 --- [.naming.updater] com.alibaba.nacos.client.naming          : current ips:(1) service: DEFAULT_GROUP@@space-service@@DEFAULT -> [{"instanceId":"172.20.0.235#8902#DEFAULT#DEFAULT_GROUP@@space-

image.png
Then I found that it was related to Nacos, and then I found on nacos space-service the ip of the service has been changing, so I thought of deleting the development in k8s corresponding to space-service and trying to rebuild it.

 kubectl delete deploy testapi-seaurl-space-service -n seaurl

After the development of space-service was rebuilt, I found that Nacos was normal, and my website was also accessed normally. It was finally solved!


Awbeci
3.1k 声望213 粉丝

Awbeci