Author:
background
As the core component responsible for service registration discovery, the registry is an indispensable part of the microservice architecture. In the CAP model, the registry can sacrifice a little data consistency (C), that is, the service address obtained by each node at the same time allows short-term inconsistency, but must ensure availability (A). Because once the registry is unavailable due to some problems, or the service cannot connect to the registry, the node that wants to connect to it may have a catastrophic blow to the entire system because it cannot obtain the service address.
a real case
The whole article starts from a real case. A customer deployed many of its own microservices on Alibaba Cloud using a Kubernetes cluster. Due to an abnormality in the network card of an ECS, although the abnormality of the network card was quickly recovered, there was a big problem. Continued service in the area is unavailable and business suffers.
Let's take a look at how this problem chain is formed?
- All Pods of CoreDNS, the core basic component of the Kubernetes cluster, are running on the faulty ECS node, and the low-version Kubernetes cluster lacks the NodeLocal DNSCache feature, resulting in cluster DNS resolution problems.
- The client's service found that a defective client version (version 1.4.1 of Nacos-client) was used. The defect of this version is related to DNS. After the heartbeat request fails to resolve the domain name, the process will not continue. About heartbeat, only restart can restore.
- This defective version is actually a known issue. Alibaba Cloud announced that there are serious bugs in Nacos-client 1.4.1 in May, but the customer R&D did not receive the notification, and then used this version in the production environment.
Risks are interlinked and cannot be separated from each other.
The final cause of the failure is that the service cannot call the downstream, the availability is reduced, and the business is damaged. The following figure illustrates the root cause of the problem caused by the client-side defect:
- DNS exception occurs when the provider client renews the heartbeat contract;
- The heartbeat thread failed to properly handle this DNS exception, causing the thread to exit unexpectedly;
- The normal mechanism of the registration center is that the heartbeat does not renew, and it automatically goes offline after 30 seconds. Since CoreDNS affects the DNS resolution of the entire Kubernetes cluster, all instances of Provider encounter the same problem, and all instances of the entire service are offline;
- On the Consumer side, after receiving the pushed empty list, the downstream cannot be found, then the upstream (such as the gateway) that calls it will have an exception.
Looking back on the whole case, each risk in each link seems to have a small probability of occurrence, but once it occurs, it will cause a bad impact. Service discovery high availability is a very important part of the microservice system, and of course it is also a point that we often ignore. This has always been an essential part of Alibaba's internal failure drills.
design for failure
Due to the jitter of the network environment, such as the abnormality of CoreDns, or the unavailability of our registry due to some factors, there are often cases where services are intermittently interrupted in batches, but this situation is not actually unavailability of business services. If we The microservices can recognize that this is an abnormal situation (when batches are flashed or the address becomes empty), and a conservative strategy should be adopted to avoid false pushes, which will cause all services to have a "no provider" problem, which will cause all Failures in which microservices are unavailable and persist for a long time are difficult to recover from.
From the perspective of microservices, how can we segment the above problem chain? The above case seems to be a problem caused by the low version of Nacos-client, but what if we use a registry such as zookeeper and eureka? Can we pat our chests and say, the above problems will not happen? The failure-oriented design principle tells us that if the registry hangs up, or our service can't connect to the registry, we need a way to ensure that our service is called normally and the online business continues.
This article introduces the high availability mechanism in the service discovery process, and thinks about how to completely solve the above problems from the service framework level.
Analysis of High Availability Principle in Service Discovery Process
Service Discovery High Availability - Push Null Protection
The failure-oriented design tells us that the service cannot fully trust the notification address of the registry. When the push address of the registry is empty, the service call will definitely give a no provider error, then we ignore the address change of this push.
Governance Center Provides air push protection capability
- Default non-intrusive support for Spring Cloud and Dubbo frameworks on the market for the past five years
- It has nothing to do with the implementation of the registry, no need to upgrade the client version
Service Discovery High Availability - Outlier Instance
Heartbeat renewal is the basic way for the registry to perceive instance availability. But in certain cases, the existence of the heartbeat is not completely equivalent to the availability of the service.
Because there are still cases where the heartbeat is normal, but the service is unavailable, for example:
- The thread pool for Request processing is full
- Dependent RDS connection exception causes a lot of slow SQL
- Some machines have high load due to full disk or host resource contention
At this time, the service cannot fully trust the notification address of the registration center. There may be some service providers with low service quality in the pushed address. Therefore, the client needs to judge the availability of the service address and the quality of service according to the result of the call. For better or worse, to directionally ignore certain addresses.
Governance Center provides outlier instance removal
- Non-intrusive by default, supports Spring Cloud and Dubbo frameworks in the market for the past five years
- It has nothing to do with the implementation of the registry, no need to upgrade the client version
- Removal strategy based on anomaly detection: including network anomalies and network anomalies + business anomalies (HTTP 5xx)
- Set abnormal threshold, lower QPS limit, lower limit of removal ratio
- Remove event notification, DingTalk group alarm
The ability to remove outliers is a supplement to measure the availability of services based on the call exception characteristics of a specific interface.
Hands
Preconditions
- A Kubernetes cluster has been created, see Creating a Kubernetes Managed Cluster [1] .
- MSE Microservice Governance Professional Edition has been activated, please refer to Activating MSE Microservice Governance [2] .
Ready to work
Enable MSE Microservice Governance
1. Open the Professional Edition of Microservice Governance:
- Click Enable MSE Microservice Governance [3] .
- Microservice governance version select professional version , select service agreement , and then click to activate immediately. For the billing details of microservice governance, please refer to the price description [4] .
2. Install the MSE microservice governance component:
- In the left navigation bar of the Container Service console [5] , select Market > Application Catalog .
- Enter ack-mse-pilot in the search box on the application catalog page, click the search icon, and then click Components.
- On the details page, select the cluster to activate the component, and then click to create . After the installation is complete, apply it in the namespace mse-pilotmse-pilot-ack-mse-pilot , indicating that the installation is successful.
3. Enable microservice governance for the application:
- Log in to the MSE Governance Center console [6] .
- In the left navigation bar, select Governance Center > Kubernetes Cluster List .
- Search for the target cluster on the Kubernetes cluster list page, click the search icon, and then click manage under the target cluster operation column.
- In the namespace list area of the page of the cluster details, click 162138cd7bb319 under the target namespace operation column to enable governance .
- In the Open Microservice Governance dialog box, click confirm .
Deploy the Demo application
- In the Container Service console [5] the left navigation bar, click cluster .
- On the cluster list page, click the name of the target cluster or to the right of the target cluster .
- In the left navigation bar of the cluster management page, select Workload > Stateless .
- On the stateless page select namespace and click to create the resource using YAML.
- Configure the template. After completing the configuration, click to create . In the example in this article, sc-consumer, sc-consumer-empty, and sc-provider are deployed, and the open source Nacos is used.
Deploy the sample application (springcloud)
YAML:
# 开启推空保护的 sc-consumer
apiVersion: apps/v1
kind: Deployment
metadata:
name: sc-consumer
spec:
replicas: 1
selector:
matchLabels:
app: sc-consumer
template:
metadata:
annotations:
msePilotCreateAppName: sc-consumer
labels:
app: sc-consumer
spec:
containers:
- env:
- name: JAVA_HOME
value: /usr/lib/jvm/java-1.8-openjdk/jre
- name: spring.cloud.nacos.discovery.server-addr
value: nacos-server:8848
- name: profiler.micro.service.registry.empty.push.reject.enable
value: "true"
image: registry.cn-hangzhou.aliyuncs.com/mse-demo-hz/demo:sc-consumer-0.1
imagePullPolicy: Always
name: sc-consumer
ports:
- containerPort: 18091
livenessProbe:
tcpSocket:
port: 18091
initialDelaySeconds: 10
periodSeconds: 30
# 无推空保护的sc-consumer-empty
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: sc-consumer-empty
spec:
replicas: 1
selector:
matchLabels:
app: sc-consumer-empty
template:
metadata:
annotations:
msePilotCreateAppName: sc-consumer-empty
labels:
app: sc-consumer-empty
spec:
containers:
- env:
- name: JAVA_HOME
value: /usr/lib/jvm/java-1.8-openjdk/jre
- name: spring.cloud.nacos.discovery.server-addr
value: nacos-server:8848
image: registry.cn-hangzhou.aliyuncs.com/mse-demo-hz/demo:sc-consumer-0.1
imagePullPolicy: Always
name: sc-consumer-empty
ports:
- containerPort: 18091
livenessProbe:
tcpSocket:
port: 18091
initialDelaySeconds: 10
periodSeconds: 30
# sc-provider
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: sc-provider
spec:
replicas: 1
selector:
matchLabels:
app: sc-provider
strategy:
template:
metadata:
annotations:
msePilotCreateAppName: sc-provider
labels:
app: sc-provider
spec:
containers:
- env:
- name: JAVA_HOME
value: /usr/lib/jvm/java-1.8-openjdk/jre
- name: spring.cloud.nacos.discovery.server-addr
value: nacos-server:8848
image: registry.cn-hangzhou.aliyuncs.com/mse-demo-hz/demo:sc-provider-0.3
imagePullPolicy: Always
name: sc-provider
ports:
- containerPort: 18084
livenessProbe:
tcpSocket:
port: 18084
initialDelaySeconds: 10
periodSeconds: 30
# Nacos Server
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: nacos-server
spec:
replicas: 1
selector:
matchLabels:
app: nacos-server
template:
metadata:
labels:
app: nacos-server
spec:
containers:
- env:
- name: MODE
value: standalone
image: nacos/nacos-server:latest
imagePullPolicy: Always
name: nacos-server
dnsPolicy: ClusterFirst
restartPolicy: Always
# Nacos Server Service 配置
---
apiVersion: v1
kind: Service
metadata:
name: nacos-server
spec:
ports:
- port: 8848
protocol: TCP
targetPort: 8848
selector:
app: nacos-server
type: ClusterIP
We only need to add an environment variable profiler.micro.service.registry.empty.push.reject.enable=true to Consumer to enable the empty push protection of the registry (no need to upgrade the client version of the registry, regardless of the implementation of the registry, Support MSE's Nacos, eureka, zookeeper and self-built Nacos, eureka, console, zookeeper, etc.)
Add SLB to the Consumer application for public network access
The following uses {sc-consumer-empty} to represent the public network address of the slb of the sc-consumer-empty application, and {sc-consumer} to represent the public network address of the slb of the sc-consumer application.
Application scenarios
Let's practice the following scenarios through the Demo prepared above.
- Write test scripts
vi curl.sh
while :
do
result=`curl $1 -s`
if [[ "$result" == *"500"* ]]; then
echo `date +%F-%T` $result
else
echo `date +%F-%T` $result
fi
sleep 0.1
done
- To test, open two command lines respectively, execute the following script, the display is as follows
% sh curl.sh {sc-consumer-empty}:18091/user/rest2022-01-19-11:58:12 Hello from [18084]10.116.0.142!2022-01-19-11:58:12 Hello from [18084]10.116.0.142!2022-01-19-11:58:12 Hello from [18084]10.116.0.142!2022-01-19-11:58:13 Hello from [18084]10.116.0.142!2022-01-19-11:58:13 Hello from [18084]10.116.0.142!2022-01-19-11:58:13 Hello from [18084]10.116.0.142!
% sh curl.sh {sc-consumer}:18091/user/rest2022-01-19-11:58:13 Hello from [18084]10.116.0.142!2022-01-19-11:58:13 Hello from [18084]10.116.0.142!2022-01-19-11:58:13 Hello from [18084]10.116.0.142!2022-01-19-11:58:14 Hello from [18084]10.116.0.142!2022-01-19-11:58:14 Hello from [18084]10.116.0.142!2022-01-19-11:58:14 Hello from [18084]10.116.0.142!
And keep the script called all the time, observe the MSE console to see the following respectively
- Reduce the coredns component to 0 to simulate abnormal DNS network resolution.
Found that the instance is disconnected from Nacos and the service list is empty.
- Simulate DNS service recovery and scale it back to quantity 2.
result verification
To maintain continuous business traffic during the above process, we found that the sc-consumer-empty service reported a large number of and continuous errors
2022-01-19-12:02:37 {"timestamp":"2022-01-19T04:02:37.597+0000","status":500,"error":"Internal Server Error","message":"com.netflix.client.ClientException: Load balancer does not have available server for client: mse-service-provider","path":"/user/feign"}2022-01-19-12:02:37 {"timestamp":"2022-01-19T04:02:37.799+0000","status":500,"error":"Internal Server Error","message":"com.netflix.client.ClientException: Load balancer does not have available server for client: mse-service-provider","path":"/user/feign"}2022-01-19-12:02:37 {"timestamp":"2022-01-19T04:02:37.993+0000","status":500,"error":"Internal Server Error","message":"com.netflix.client.ClientException: Load balancer does not have available server for client: mse-service-provider","path":"/user/feign"}
In contrast, the whole process of the sc-consumer application does not report any errors
- sc-consumer-empty returns to normal only after restarting the Provider
In contrast, the whole process of the sc-consumer application does not report any errors
follow-up
We will report events and alarms to the DingTalk group after empty push protection occurs. At the same time, it is recommended to use it in conjunction with outlier instance removal. Push empty protection may cause the Consumer to hold too many provider addresses. When the provider address is invalid , outlier instance removal can logically isolate them to ensure high service availability.
tail
Ensuring the always-on service on the cloud is the goal that MSE has been pursuing. This article uses the sharing of service discovery high availability capability for failure-oriented design and the service governance capability of MSE to quickly build a demonstration of service discovery high availability capability, simulating the line The impact of unexpected service discovery-related exceptions on the Internet and how we can prevent them show how a simple open source microservice application should build high availability for service discovery.
Related Links
[1] Create a Kubernetes managed cluster
https://help.aliyun.com/document_detail/95108.htm#task-skz-qwk-qfb
[2] Open MSE Microservice Governance
https://help.aliyun.com/document_detail/347625.htm#task-2140253
[3] Open MSE Microservice Governance
https://common-buy.aliyun.com/?commodityCode=mse_basic_public_cn
[4] Price description
https://help.aliyun.com/document_detail/170443.htm#concept-2519524
[5] Container Service Console
https://cs.console.aliyun.com
[6] MSE Governance Center Console
https://mse.console.aliyun.com
Click , to the MSE official website to see more!
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。