How to Quickly Build High Availability for Service Discovery

Author:

background

As the core component responsible for service registration discovery, the registry is an indispensable part of the microservice architecture. In the CAP model, the registry can sacrifice a little data consistency (C), that is, the service address obtained by each node at the same time allows short-term inconsistency, but must ensure availability (A). Because once the registry is unavailable due to some problems, or the service cannot connect to the registry, the node that wants to connect to it may have a catastrophic blow to the entire system because it cannot obtain the service address.

a real case

The whole article starts from a real case. A customer deployed many of its own microservices on Alibaba Cloud using a Kubernetes cluster. Due to an abnormality in the network card of an ECS, although the abnormality of the network card was quickly recovered, there was a big problem. Continued service in the area is unavailable and business suffers.

Let's take a look at how this problem chain is formed?

All Pods of CoreDNS, the core basic component of the Kubernetes cluster, are running on the faulty ECS node, and the low-version Kubernetes cluster lacks the NodeLocal DNSCache feature, resulting in cluster DNS resolution problems.
The client's service found that a defective client version (version 1.4.1 of Nacos-client) was used. The defect of this version is related to DNS. After the heartbeat request fails to resolve the domain name, the process will not continue. About heartbeat, only restart can restore.
This defective version is actually a known issue. Alibaba Cloud announced that there are serious bugs in Nacos-client 1.4.1 in May, but the customer R&D did not receive the notification, and then used this version in the production environment.

Risks are interlinked and cannot be separated from each other.

The final cause of the failure is that the service cannot call the downstream, the availability is reduced, and the business is damaged. The following figure illustrates the root cause of the problem caused by the client-side defect:

DNS exception occurs when the provider client renews the heartbeat contract;
The heartbeat thread failed to properly handle this DNS exception, causing the thread to exit unexpectedly;
The normal mechanism of the registration center is that the heartbeat does not renew, and it automatically goes offline after 30 seconds. Since CoreDNS affects the DNS resolution of the entire Kubernetes cluster, all instances of Provider encounter the same problem, and all instances of the entire service are offline;
On the Consumer side, after receiving the pushed empty list, the downstream cannot be found, then the upstream (such as the gateway) that calls it will have an exception.

Looking back on the whole case, each risk in each link seems to have a small probability of occurrence, but once it occurs, it will cause a bad impact. Service discovery high availability is a very important part of the microservice system, and of course it is also a point that we often ignore. This has always been an essential part of Alibaba's internal failure drills.

design for failure

Due to the jitter of the network environment, such as the abnormality of CoreDns, or the unavailability of our registry due to some factors, there are often cases where services are intermittently interrupted in batches, but this situation is not actually unavailability of business services. If we The microservices can recognize that this is an abnormal situation (when batches are flashed or the address becomes empty), and a conservative strategy should be adopted to avoid false pushes, which will cause all services to have a "no provider" problem, which will cause all Failures in which microservices are unavailable and persist for a long time are difficult to recover from.

From the perspective of microservices, how can we segment the above problem chain? The above case seems to be a problem caused by the low version of Nacos-client, but what if we use a registry such as zookeeper and eureka? Can we pat our chests and say, the above problems will not happen? The failure-oriented design principle tells us that if the registry hangs up, or our service can't connect to the registry, we need a way to ensure that our service is called normally and the online business continues.

This article introduces the high availability mechanism in the service discovery process, and thinks about how to completely solve the above problems from the service framework level.

Analysis of High Availability Principle in Service Discovery Process

Service Discovery High Availability - Push Null Protection

The failure-oriented design tells us that the service cannot fully trust the notification address of the registry. When the push address of the registry is empty, the service call will definitely give a no provider error, then we ignore the address change of this push.

Governance Center Provides air push protection capability

Default non-intrusive support for Spring Cloud and Dubbo frameworks on the market for the past five years
It has nothing to do with the implementation of the registry, no need to upgrade the client version

Service Discovery High Availability - Outlier Instance

Heartbeat renewal is the basic way for the registry to perceive instance availability. But in certain cases, the existence of the heartbeat is not completely equivalent to the availability of the service.
Because there are still cases where the heartbeat is normal, but the service is unavailable, for example:

The thread pool for Request processing is full
Dependent RDS connection exception causes a lot of slow SQL
Some machines have high load due to full disk or host resource contention

At this time, the service cannot fully trust the notification address of the registration center. There may be some service providers with low service quality in the pushed address. Therefore, the client needs to judge the availability of the service address and the quality of service according to the result of the call. For better or worse, to directionally ignore certain addresses.

Governance Center provides outlier instance removal

Non-intrusive by default, supports Spring Cloud and Dubbo frameworks in the market for the past five years
It has nothing to do with the implementation of the registry, no need to upgrade the client version
Removal strategy based on anomaly detection: including network anomalies and network anomalies + business anomalies (HTTP 5xx)
Set abnormal threshold, lower QPS limit, lower limit of removal ratio
Remove event notification, DingTalk group alarm

The ability to remove outliers is a supplement to measure the availability of services based on the call exception characteristics of a specific interface.

Hands

Preconditions

A Kubernetes cluster has been created, see Creating a Kubernetes Managed Cluster [1] .
MSE Microservice Governance Professional Edition has been activated, please refer to Activating MSE Microservice Governance [2] .

Ready to work

Enable MSE Microservice Governance

1. Open the Professional Edition of Microservice Governance:

Click Enable MSE Microservice Governance [3] .
Microservice governance version select professional version , select service agreement , and then click to activate immediately. For the billing details of microservice governance, please refer to the price description [4] .

2. Install the MSE microservice governance component:

In the left navigation bar of the Container Service console [5] , select Market > Application Catalog .
Enter ack-mse-pilot in the search box on the application catalog page, click the search icon, and then click Components.
On the details page, select the cluster to activate the component, and then click to create . After the installation is complete, apply it in the namespace mse-pilotmse-pilot-ack-mse-pilot , indicating that the installation is successful.

3. Enable microservice governance for the application:

Log in to the MSE Governance Center console [6] .
In the left navigation bar, select Governance Center > Kubernetes Cluster List .
Search for the target cluster on the Kubernetes cluster list page, click the search icon, and then click manage under the target cluster operation column.
In the namespace list area of the page of the cluster details, click 162138cd7bb319 under the target namespace operation column to enable governance .
In the Open Microservice Governance dialog box, click confirm .

Deploy the Demo application

In the Container Service console [5] the left navigation bar, click cluster .
On the cluster list page, click the name of the target cluster or to the right of the target cluster .
In the left navigation bar of the cluster management page, select Workload > Stateless .
On the stateless page select namespace and click to create the resource using YAML.
Configure the template. After completing the configuration, click to create . In the example in this article, sc-consumer, sc-consumer-empty, and sc-provider are deployed, and the open source Nacos is used.

Deploy the sample application (springcloud)

YAML：

    # 开启推空保护的 sc-consumer
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: sc-consumer
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: sc-consumer
      template:
        metadata:
          annotations:
            msePilotCreateAppName: sc-consumer
          labels:
            app: sc-consumer
        spec:
          containers:
          - env:
            - name: JAVA_HOME
              value: /usr/lib/jvm/java-1.8-openjdk/jre
            - name: spring.cloud.nacos.discovery.server-addr
              value: nacos-server:8848
            - name: profiler.micro.service.registry.empty.push.reject.enable
              value: "true"
            image: registry.cn-hangzhou.aliyuncs.com/mse-demo-hz/demo:sc-consumer-0.1
            imagePullPolicy: Always
            name: sc-consumer
            ports:
            - containerPort: 18091
            livenessProbe:
              tcpSocket:
                port: 18091
              initialDelaySeconds: 10
              periodSeconds: 30
    # 无推空保护的sc-consumer-empty
    ---
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: sc-consumer-empty
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: sc-consumer-empty
      template:
        metadata:
          annotations:
            msePilotCreateAppName: sc-consumer-empty
          labels:
            app: sc-consumer-empty
        spec:
          containers:
          - env:
            - name: JAVA_HOME
              value: /usr/lib/jvm/java-1.8-openjdk/jre
            - name: spring.cloud.nacos.discovery.server-addr
              value: nacos-server:8848
            image: registry.cn-hangzhou.aliyuncs.com/mse-demo-hz/demo:sc-consumer-0.1
            imagePullPolicy: Always
            name: sc-consumer-empty
            ports:
            - containerPort: 18091
            livenessProbe:
              tcpSocket:
                port: 18091
              initialDelaySeconds: 10
              periodSeconds: 30
    # sc-provider
    ---
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: sc-provider
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: sc-provider
      strategy:
      template:
        metadata:
          annotations:
            msePilotCreateAppName: sc-provider
          labels:
            app: sc-provider
        spec:
          containers:
          - env:
            - name: JAVA_HOME
              value: /usr/lib/jvm/java-1.8-openjdk/jre
            - name: spring.cloud.nacos.discovery.server-addr
              value: nacos-server:8848
            image: registry.cn-hangzhou.aliyuncs.com/mse-demo-hz/demo:sc-provider-0.3
            imagePullPolicy: Always
            name: sc-provider
            ports:
            - containerPort: 18084
            livenessProbe:
              tcpSocket:
                port: 18084
              initialDelaySeconds: 10
              periodSeconds: 30
    # Nacos Server
    ---
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: nacos-server
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: nacos-server
      template:
        metadata:
          labels:
            app: nacos-server
        spec:
          containers:
          - env:
            - name: MODE
              value: standalone
            image: nacos/nacos-server:latest
            imagePullPolicy: Always
            name: nacos-server
          dnsPolicy: ClusterFirst
          restartPolicy: Always

    # Nacos Server Service 配置
    ---
    apiVersion: v1
    kind: Service
    metadata:
      name: nacos-server
    spec:
      ports:
      - port: 8848
        protocol: TCP
        targetPort: 8848
      selector:
        app: nacos-server
      type: ClusterIP

We only need to add an environment variable profiler.micro.service.registry.empty.push.reject.enable=true to Consumer to enable the empty push protection of the registry (no need to upgrade the client version of the registry, regardless of the implementation of the registry, Support MSE's Nacos, eureka, zookeeper and self-built Nacos, eureka, console, zookeeper, etc.)

Add SLB to the Consumer application for public network access

The following uses {sc-consumer-empty} to represent the public network address of the slb of the sc-consumer-empty application, and {sc-consumer} to represent the public network address of the slb of the sc-consumer application.

Application scenarios

Let's practice the following scenarios through the Demo prepared above.

Write test scripts

vi curl.sh

    while :
    do
            result=`curl $1 -s`
            if [[ "$result" == *"500"* ]]; then
                    echo `date +%F-%T` $result
            else
                    echo `date +%F-%T` $result
            fi

            sleep 0.1
    done

To test, open two command lines respectively, execute the following script, the display is as follows

% sh curl.sh {sc-consumer-empty}:18091/user/rest2022-01-19-11:58:12 Hello from [18084]10.116.0.142!2022-01-19-11:58:12 Hello from [18084]10.116.0.142!2022-01-19-11:58:12 Hello from [18084]10.116.0.142!2022-01-19-11:58:13 Hello from [18084]10.116.0.142!2022-01-19-11:58:13 Hello from [18084]10.116.0.142!2022-01-19-11:58:13 Hello from [18084]10.116.0.142!

% sh curl.sh {sc-consumer}:18091/user/rest2022-01-19-11:58:13 Hello from [18084]10.116.0.142!2022-01-19-11:58:13 Hello from [18084]10.116.0.142!2022-01-19-11:58:13 Hello from [18084]10.116.0.142!2022-01-19-11:58:14 Hello from [18084]10.116.0.142!2022-01-19-11:58:14 Hello from [18084]10.116.0.142!2022-01-19-11:58:14 Hello from [18084]10.116.0.142!

And keep the script called all the time, observe the MSE console to see the following respectively

Reduce the coredns component to 0 to simulate abnormal DNS network resolution.

Found that the instance is disconnected from Nacos and the service list is empty.

Simulate DNS service recovery and scale it back to quantity 2.

result verification

To maintain continuous business traffic during the above process, we found that the sc-consumer-empty service reported a large number of and continuous errors

2022-01-19-12:02:37 {"timestamp":"2022-01-19T04:02:37.597+0000","status":500,"error":"Internal Server Error","message":"com.netflix.client.ClientException: Load balancer does not have available server for client: mse-service-provider","path":"/user/feign"}2022-01-19-12:02:37 {"timestamp":"2022-01-19T04:02:37.799+0000","status":500,"error":"Internal Server Error","message":"com.netflix.client.ClientException: Load balancer does not have available server for client: mse-service-provider","path":"/user/feign"}2022-01-19-12:02:37 {"timestamp":"2022-01-19T04:02:37.993+0000","status":500,"error":"Internal Server Error","message":"com.netflix.client.ClientException: Load balancer does not have available server for client: mse-service-provider","path":"/user/feign"}

In contrast, the whole process of the sc-consumer application does not report any errors

sc-consumer-empty returns to normal only after restarting the Provider

In contrast, the whole process of the sc-consumer application does not report any errors

follow-up

We will report events and alarms to the DingTalk group after empty push protection occurs. At the same time, it is recommended to use it in conjunction with outlier instance removal. Push empty protection may cause the Consumer to hold too many provider addresses. When the provider address is invalid , outlier instance removal can logically isolate them to ensure high service availability.

tail

Ensuring the always-on service on the cloud is the goal that MSE has been pursuing. This article uses the sharing of service discovery high availability capability for failure-oriented design and the service governance capability of MSE to quickly build a demonstration of service discovery high availability capability, simulating the line The impact of unexpected service discovery-related exceptions on the Internet and how we can prevent them show how a simple open source microservice application should build high availability for service discovery.

How to Quickly Build High Availability for Service Discovery

background

a real case

design for failure

Analysis of High Availability Principle in Service Discovery Process

Service Discovery High Availability - Push Null Protection

Service Discovery High Availability - Outlier Instance

Hands

Preconditions

Ready to work

Enable MSE Microservice Governance

Deploy the Demo application

Deploy the sample application (springcloud)

Add SLB to the Consumer application for public network access

Application scenarios

result verification

follow-up

tail

Related Links

阿里云云原生

引用和评论

三句话生成 P5.js 粒子特效代码，人人都可以做交互式数字艺术

本地玩转 DeepSeek 和 Qwen 最新开源版本（入门+进阶）

K8s 小白入门｜从电影配乐谈起，聊聊容器编排和 K8s

面向教育场景的大模型 RAG 检索增强解决方案

全网首发 | PAI Model Gallery一键部署阶跃星辰Step-Video-T2V、Step-Audio-Chat模型

开放创新，释放云上数字生产力｜2024华为云开源开发者论坛圆满落幕

阿里云可观测 2024 年 11 月产品动态