foreword
Recently, I upgraded the Alibaba Cloud k8s component Cloud Controller Manager from v2.1.0 to v2.3.0. I found that it was not very smooth. I recorded the solution process to prevent the same problem from happening again.
operate
Click Upgrade, and then find the pre-check error, as shown below:
Then, in the event hub also prints:
DryRun: Error syncing load balancer [lb-bp1erkwv3fcdyobqd7x3k]: Message: loadbalancer lb-bp1erkwv3fcdyobqd7x3k listener 80 should be updated, VGroupId rsp-bp1up5x12mwt6 should be changed to rsp-bp1tsakxo59ww;
DryRun: Error syncing load balancer [lb-bp1erkwv3fcdyobqd7x3k]: Message: loadbalancer lb-bp1erkwv3fcdyobqd7x3k listener 443 should be updated, VGroupId rsp-bp1cuciusq2zf should be changed to rsp-bp11d0mmv0cma;
It is found that it has something to do with load balancing, and then check the SLB. Just set VGroupId rsp-bp1up5x12mwt6
to rsp-bp1tsakxo59ww;
and rsp-bp1cuciusq2z
to rsp-bp11d0mmv0cma
As follows:
According to the prompt of the event center, we just need to transfer the virtual server groups corresponding to 80 and 443.
Transfer a virtual service group
1. Click to modify 80 or 443 monitoring configuration
2. Next step
3. Specify the server group
4. Continue to click Next to complete
This is done, you can click to upgrade Cloud Controller Manager again and it will be fine
Summarize
1. The above four virtual server groups are all generated by the system
2. After upgrading k8s, it changed back, and I had to execute it again, which was very troublesome, so I deleted the remaining two, which are 1 and 2 in the above picture, and then I will check if there is any problem.
2022-7-6 Update: The website is available and unavailable for a while. The problem has reappeared
After the recent upgrade of the ACK component, there was a problem that the service could be used for a while, but could not be used for a while. Then, a work order was initiated and Alibaba Cloud engineers started a two-day troubleshooting trip. The following is an explanation of the troubleshooting process, which is also for future encounters. There is a reference to the same question.
Problem 1: The pod of elasticsearch cannot be deleted all the time
analyse problem:
I installed es on ACK, but found that the resources were not enough, so I wanted to delete it, but found that there was a pod that could not be deleted.
pod/test-elasticsearch-ingest-0 1/1 Terminating 1 (6h41m ago) 6h42m
Solve the problem:
Through the communication with Ali engineers, just use the following forced delete command
kubectl delete pod <your-pod-name> -n <name-space> --force --grace-period=0
Question 2: My website is available for a while and not available for a while
analyse problem:
1. Check whether the kubelet on the node (ECS) is normal, and then log in and check whether it is normal - unresolved
systemctl status kubelet.service
2. Upgrade the virtual server group, and then check that the website is still available for a while, but not available for a while - unresolved
3. Ali engineers think it is nginx-ingress-controller
the component version is too low, the new version has reached 1.21, and my version is still 0.44, so manual upgrade - not resolved
手动升级nginx-ingress-controller版本的方案
风险:会覆盖掉客户从组件管理配置页面之外的对nginx deployment的修改
操作注意事项:需要提前把webhook的job pod提前清理掉,在kube-system命名空间
升级接口:https://next.api.aliyun.com/api/CS/2015-12-15/UpgradeClusterAddons?lang=JAVA¶ms={%22body%22:[{%22component_name%22:%22nginx-ingress-controller%22,%22next_version%22:%22v1.2.0-aliyun.1%22}]}
Two jobs were deleted during
kubectl get job -n kube-system
NAME COMPLETIONS DURATION AGE
ingress-nginx-admission-create 1/1 4s 456d
ingress-nginx-admission-patch 1/1 5s 456d
Then get feedback from the second engineer:
I just communicated with our back-end R&D students. The console upgrade interface of ingress 0.44 to version 1.X is currently under development. After the successful release, it can be upgraded directly at the component management point, and the deployment modification will be displayed in detail. If you do not have to do an ingress upgrade on your side, it is recommended that you wait for the function release here.
4. Then initiate a new work order (thought it was solved!!!), Ali engineer asked me to rebuild the pod- of nginx-ingress-controller
unresolved
So I deleted the above two Pods. After that, I found that there were two new virtual service groups in the load balancer, and then my website could not be opened at all. It could be opened with a 50% probability before, but this time it completely collapsed. In addition to being sad, I wondered if there were other solutions, so I continued to ask.
5. Then the Ali engineer said: Rebuild the service of nginx-ingress-lb, and first change the service to the type of clusterip. Change back to load balancing type - not resolved
6. The Ali engineer said that this is still not successful, indicating that it has nothing to do with ingress and slb, indicating that there is a problem with your service, and then let me report the error log - solved
gateway-server log:
io.netty.channel.AbstractChannel$AnnotatedConnectException: finishConnect(..) failed: Host is unreachable: /172.20.0.235:8902
space-service log
removed ips(1) service: DEFAULT_GROUP@@space-service@@DEFAULT -> [{"instanceId":"172.20.0.247#8902#DEFAULT#DEFAULT_GROUP@@space-service","ip":"172.20.0.247","port":8902,"weight":1.0,"healthy":true,"enabled":true,"ephemeral":true,"clusterName":"DEFAULT","serviceName":"DEFAULT_GROUP@@space-service","metadata":{"preserved.register.source":"SPRING_CLOUD"},"instanceHeartBeatInterval":5000,"instanceHeartBeatTimeOut":15000,"ipDeleteTimeout":30000}]
2022-07-06 10:22:36.240 INFO 1 --- [.naming.updater] com.alibaba.nacos.client.naming : current ips:(1) service: DEFAULT_GROUP@@space-service@@DEFAULT -> [{"instanceId":"172.20.0.235#8902#DEFAULT#DEFAULT_GROUP@@space-
Then I found that it was related to Nacos, and then I found on nacos space-service
the ip of the service has been changing, so I thought of deleting the development in k8s corresponding to space-service
and trying to rebuild it.
kubectl delete deploy testapi-seaurl-space-service -n seaurl
After the development of space-service was rebuilt, I found that Nacos was normal, and my website was also accessed normally. It was finally solved!
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。