Best Practices for Service Discovery and Configuration Management High Availability

infrastructure team, responsible for MSE engine high availability architecture 161d6eee75fb60

* *This article is the beginning of a series of sharing high-availability best practices for microservices. The content of the series is continuously updated, and I look forward to your attention.

introduction

Before starting the official content, let me share a real case with you.

A customer deployed many of its own microservices on Alibaba Cloud using K8s cluster, but one day, the network card of one of the nodes became abnormal, which eventually caused the service to be unavailable, unable to call the downstream, and the business was damaged. Let's take a look at how this problem chain is formed?

All the Pods of CoreDNS, the core basic component of the K8s cluster, are running on the faulty ECS node. They are not broken up, causing problems in the DNS resolution of the cluster.
The client's service found that a defective client version (version 1.4.1 of nacos-client) was used. The defect of this version is related to DNS. After the heartbeat request fails to resolve the domain name, the process will not continue. About heartbeat, only restart can restore.
This defective version is actually a known issue. Alibaba Cloud announced that there are serious bugs in nacos-client 1.4.1 in May, but the customer R&D did not receive the notification, and then used this version in the production environment.

Risks are interlinked and cannot be without one.

The final cause of the failure is that the service cannot call the downstream, the availability is reduced, and the business is damaged. The following figure illustrates the root cause of the problem caused by the client-side defect:

DNS exception occurs when the provider client renews the heartbeat contract;
The heartbeat thread handles this DNS exception correctly, causing the thread to exit unexpectedly;
The normal mechanism of the registration center is that the heartbeat does not renew, and it automatically goes offline after 30 seconds. Since CoreDNS affects the DNS resolution of the entire K8s cluster, all instances of Provider encounter the same problem, and all instances of the entire service are offline;
On the Consumer side, after receiving the pushed empty list, the downstream cannot be found, then the upstream (such as the gateway) that calls it will have an exception.

Looking back at the entire case, each risk in each link seems to have a very small probability of occurrence, but once it occurs, it will cause a bad impact.

Therefore, this article will discuss how to design high-availability solutions in the field of microservices, and what specific solutions are there in the field of service discovery and configuration management.

Microservice High Availability Solution

First of all, there is a fact that cannot be changed: no system is 100% free of problems, so the high-availability architecture solution is designed in the face of failure (risk).

Risks are ubiquitous, and although many of them are very small, they cannot be completely avoided.

What are the possible risks in a microservice system?

This is only a part of it, but in Alibaba's more than ten years of microservice practice, all these problems have been encountered, and some of them have been more than once. Although it seems that there are many pitfalls, we can still well guarantee the stability of the Double Eleven promotion, which relies on the construction of a mature and stable high-availability system.

We can't completely avoid the risk, but we can control it. This is the essence of high availability.

What are the strategies for controlling risk?

The registration and configuration center is on the core link of the microservice system, and any jitter may affect the stability of the entire system to a large extent.

Strategy 1: Narrow the scope of risk impact

Cluster high availability

copies: less than 3 nodes for instance deployment.

Multiple Availability Zones (Disaster Recovery in the same city): different nodes of the cluster in different Availability Zones (AZs). When a node or availability zone fails, the scope of impact is only a part of the cluster. If it can be quickly switched and the faulty node can be automatically separated from the cluster, the impact can be minimized.

Reduce upstream and downstream dependencies

The system design should reduce upstream and downstream dependencies as much as possible. The more dependencies, the more likely to make the overall service unavailable (usually a functional block) when a problem occurs in the dependent system. If there are necessary dependencies, it must also require a highly available architecture.

Change to grayscale

The new version is released iteratively. It should start with grayscale from the smallest range, and grade by user and region, and gradually expand the range of changes. Once a problem occurs, it only affects the grayscale range, reducing the explosion radius of the problem.

Service can be downgraded, current-limited, and blown

In the case of abnormal load of the registry, downgrade the heartbeat renewal time, downgrade some non-core functions, etc.
Limit the flow of abnormal traffic, limit the flow within the capacity range, and protect some traffic is available
On the client side, downgrade to using the local cache in the event of an exception (push-empty protection is also a downgrade scheme), temporarily sacrificing the consistency of list updates to ensure availability

As shown in the figure, the same-city dual-active three-node architecture of the microservice engine MSE, after simplified upstream and downstream dependencies, each guarantees a high-availability architecture. Multi-node MSE instances are automatically allocated to different availability zones through the underlying scheduling capabilities to form a multi-copy cluster.

Strategy 2: Shorten the duration of risk occurrence

The core idea is: identify as soon as possible and deal with it as soon as possible

Identify - Observable

For example, based on Prometheus, the monitoring and alarm capabilities of the instance are built.

Further, make stronger observation capabilities at the product level: including general market, alarm convergence/classification (identification of problems), assurance for major customers, and construction of service levels.

The service level currently offered by the MSE Registration Configuration Center is 99.95% and is moving towards four nines (99.99%).

Fast processing - emergency response

The emergency response mechanism must be established, quickly and effectively notify the correct range of personnel, the ability to quickly execute the plan (aware of the efficiency difference between the white screen and the black screen), and conduct emergency emergency drills on a regular basis.

The plan means that no matter whether you are familiar with your system or not, you can execute it with confidence. Behind this, you need a set of technical support (technical thickness) with good precipitation.

Strategy 3: Reduce the number of touch risks

Reduce unnecessary releases, such as: increasing iteration efficiency, not releasing at will; closing the network during important events and big promotions.

From the point of view of probability, no matter how low the risk probability is, if you keep trying, the joint probability of risk occurrence will be infinitely close to 1.

Strategy 4: Reduce the probability of risk occurrence

Architecture upgrades, improved design

Nacos 2.0 not only improves performance, but also upgrades architecture:

Upgrade the data storage structure, and upgrade the service-level granularity to instance-level partition fault tolerance (bypassing the problem of service hang caused by service-level data inconsistency);
Upgrade the connection model (long connection) to reduce the dependency on threads, connections, and DNS.

Identify risks early

This "advance" means that potential risks are exposed as much as possible in the design, development, and testing stages;
Predict where the capacity risk level is through capacity assessment in advance;
Through regular failure drills, upstream and downstream environmental risks are discovered in advance, and the system robustness is verified.

As shown in the figure, Alibaba is promoting a high-availability system, continuously conducting stress testing exercises, verifying the robustness and elasticity of the system, observing and tracking system problems, and verifying the practicability of plans such as current limiting and downgrading.

Service Discovery High Availability Solution

Service discovery includes service consumers (Consumer) and service providers (Provider).

Consumer-side high availability

The disaster recovery on the consumer side is achieved by means of air push protection and service degradation.

Push air protection

To deal with the case mentioned at the beginning, the service empty list push is automatically degraded to cached data.

The service consumer (Consumer) subscribes to the instance list of the service provider (Provider) from the registry.

In the event of an unexpected situation (for example, the availability zone is disconnected from the network, the provider cannot report the heartbeat) or an unexpected exception occurs in the registry (allocation, restart, upgrade or upgrade), it may cause subscription exceptions and affect service consumers ( Consumer) availability.

No push-through protection

Provider registration failed (such as network, SDK bug, etc.)
The registry determines that the provider's heartbeat has expired
Consumer subscribes to an empty list, business interrupted and error

Turn on push-off protection

Ditto
Consumer subscribe to the empty list, push air protection to take effect discard changes, protect the business services available

open method

Simpler to open

The open source client nacos-client 1.4.2 and above is supported

configuration item

SpingCloudAlibaba adds in the spring configuration item:
spring.cloud.nacos.discovery.namingPushEmptyProtection=true
Dubbo plus registryUrl parameters:
namingPushEmptyProtection=true

Empty the cache rely mention protection, so it is necessary to persist cache directory to avoid missing after the restart, the path is: ${user.home}/nacos/naming/${namespaceId}

Service downgrade

The consumer side can choose whether to downgrade a certain calling interface according to different strategies, which can protect the business request process (reserve valuable downstream provider resources for important business consumers) and protect the availability of important business.

The specific strategy for service downgrade, including returning Null value, returning Exception, returning custom JSON data and custom callback.

This high availability capability is available by default in the MSE Microservices Governance Center.

Provider-side high availability

The provider side improves availability through solutions such as disaster recovery protection, outlier removal, and lossless offline provided by the registry and service governance.

Disaster recovery protection

Disaster recovery protection is mainly used to avoid cluster avalanches under abnormal traffic.

Let's take a look at it in detail:

No disaster recovery protection (default threshold=0)

When the amount of burst requests increases and the capacity level is high, individual Providers fail;
The registration center will remove the faulty node, and the full amount of traffic will be given to the remaining nodes;
The load of the remaining nodes becomes high, and there is a high probability of failure;
Finally all nodes fail, 100% unserviceable.

Enable disaster recovery protection (threshold=0.6)

Ibid;
The number of faulty nodes reaches the protection threshold , and the traffic is evenly distributed to all machines;
Finally, guarantees that 50% of the nodes can provide services.

Disaster recovery and protection capability, in an emergency, can keep service availability above a certain level, which can be said to be the bottom line of the overall system.

This program has saved many business systems.

Outlier Instance Removal

Heartbeat renewal is the basic way for the registry to perceive instance availability.

But in certain cases, the existence of the heartbeat is not completely equivalent to the availability of the service.

Because there are still cases where the heartbeat is normal, but the service is unavailable, for example:

The thread pool for Request processing is full
Dependent RDS connection exception or slow SQL

Governance Center provides outlier instance removal

Removal strategy based on anomaly detection: including network anomalies and network anomalies + business anomalies (HTTP 5xx)
Set abnormal threshold, lower QPS limit, lower limit of removal ratio

The ability to remove outliers is a supplement to measure the availability of services based on the call exception characteristics of a specific interface.

Lossless offline

Lossless offline, also known as graceful offline, or smooth offline, is the same meaning. First look at what is lossy downline:

During the upgrade process of the Provider instance, after going offline, the heartbeat will be stored in the registry for a certain period of time and the change will take effect. During this period, the subscription list on the Consumer side is still not updated to the offline version. If the Provider is recklessly stopped, It will cause some traffic loss.

There are many different solutions for lossless offline, but the least intrusive one is the default capability provided by the service governance center, which is integrated into the release process without any sense and completes automatic execution. Eliminate the maintenance of cumbersome operation and maintenance script logic.

Configuration Management High Availability Solution

Configuration management mainly includes two types of operations: configuration subscription and configuration publishing

What problem does configuration management solve?

Multi-environment, multi-machine configuration publishing, configuration dynamic real-time push.

High availability of services based on configuration management

How do microservices make high availability solutions based on configuration management?

Release Environment Management

How to manage hundreds of machines and multiple environments at one time, how to push it correctly, how to roll back quickly when misoperation or online problems occur, and how to grayscale the publishing process.

Business switch dynamic push

Switches for functions, active pages, etc.

Disaster recovery and downgrade plan push

The preset scheme is enabled by push, and the flow control threshold is adjusted in real time.

The picture above shows the overall high availability solution for configuration management during the promotion period. For example, downgrade non-core services, downgrade functions, downgrade logs, and disable high-risk operations.

Client high availability

The configuration management client side also has a disaster recovery solution.

The local directory is divided into two levels, the high priority is the disaster recovery directory, and the low priority is the cache directory.

Cache directory: each client interacts with the configuration center, the latest configuration content will be saved to the local cache directory. When the server is unavailable, the content in the local cache directory will be used.

Disaster recovery directory: When the server is unavailable, you can manually update the configuration content in the local disaster recovery directory, and the client will preferentially load the content in the disaster recovery directory to simulate the effect of server-side change push.

Simply put, when the configuration center is unavailable, check the configuration of the disaster recovery directory first, otherwise use the previously pulled cache.

The design of the disaster recovery directory is because sometimes there may not be cached configurations, or the business needs to urgently cover and use new content to open some necessary plans and configurations.

The overall idea is that nothing can go wrong. In any case, the client must be able to read the correct configuration to ensure the availability of the microservice.

Server-side high availability

On the configuration center side, it is mainly for read and write current limiting.
Limit the number of connections, limit writing:

Connection limit: the maximum connection current limit of a single machine, the connection current limit of a single client IP
Write-restricted interface: publish operations & specific configuration seconds-level and minute-level current limit

Control operational risk

Controls the risk of people doing configuration releases.

The operation of configuration release is grayscale, traceable, and rollbackable.

Configure Grayscale

Release History & Rollback

change comparison

Hands

Finally, let's do a practice together.

The scenario is taken from a high-availability solution mentioned above. In the case of a registration exception on all machines of the service provider, we can see the performance of the service consumer when the null push protection is turned on.

Experimental Architecture and Ideas

The above figure is the architecture of this practice. The right side is a simple call scenario. External traffic is accessed through the gateway. Here, the cloud native gateway in the MSE product matrix is selected. Relying on the observability provided by it, it is convenient for us to observe service calls. condition.

There are three applications A, B, and C downstream of the gateway, which support the use of configuration management to dynamically connect invocation relationships, which we will practice later.

The basic idea:

Deploy the service, adjust the call relationship to gateway->A->B->C, and check the gateway call success rate.
By simulating network problems, the heartbeat link between application B and the registration center is disconnected to simulate the occurrence of registration exceptions.
Check the gateway invocation success rate again, and expect that the link of service A->B will not be affected by the registration exception.

In order to facilitate comparison, application A will deploy two versions, one with empty push protection enabled, and one without. The final desired result is that after the push-empty protection switch is turned on, it can help application A to continue to be able to address application B in the event of an exception.

After the traffic of the gateway reaches application A, it can be observed that the success rate of the interface should be exactly 50%.

Start

Let's start practicing. Here I choose the Alibaba Cloud MSE+ACK combination as a complete solution.

Environmental preparation

First, purchase a set of MSE Registration and Configuration Center Professional Edition and a set of MSE Cloud Native Gateway. The specific purchase process is not described here.

Before the application is deployed, prepare the configuration in advance. Here we can first configure that the downstream of A is C, and the downstream of B is also C.

deploy application

Next we deploy three applications based on ACK. The following configuration can be seen, this version of the application A spring-cloud-a-b , pushing air circuit breaker has been opened.

The version of the nacos client used in the demo here is 1.4.2, because the empty push protection is only supported after this version.

Configuration hints (cannot be used directly):

# A 应用 base 版本
---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: spring-cloud-a
  name: spring-cloud-a-b
spec:
  replicas: 2
  selector:
    matchLabels:
      app: spring-cloud-a
  template:
    metadata:
      annotations:
        msePilotCreateAppName: spring-cloud-a
      labels:
        app: spring-cloud-a
    spec:
      containers:
      - env:
        - name: LANG
          value: C.UTF-8
        - name: spring.cloud.nacos.discovery.server-addr
          value: mse-xxx-nacos-ans.mse.aliyuncs.com:8848
        - name: spring.cloud.nacos.config.server-addr
          value: mse-xxx-nacos-ans.mse.aliyuncs.com:8848
        - name: spring.cloud.nacos.discovery.metadata.version
          value: base
        - name: spring.application.name
          value: sc-A
        - name: spring.cloud.nacos.discovery.namingPushEmptyProtection
          value: "true"
        image: mse-demo/demo:1.4.2
        imagePullPolicy: Always
        name: spring-cloud-a
        ports:
        - containerPort: 8080
          protocol: TCP
        resources:
          requests:
            cpu: 250m
            memory: 512Mi
---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: spring-cloud-a
  name: spring-cloud-a
spec:
  replicas: 2
  selector:
    matchLabels:
      app: spring-cloud-a
  template:
    metadata:
      annotations:
        msePilotCreateAppName: spring-cloud-a
      labels:
        app: spring-cloud-a
    spec:
      containers:
      - env:
        - name: LANG
          value: C.UTF-8
        - name: spring.cloud.nacos.discovery.server-addr
          value: mse-xxx-nacos-ans.mse.aliyuncs.com:8848
        - name: spring.cloud.nacos.config.server-addr
          value: mse-xxx-nacos-ans.mse.aliyuncs.com:8848
        - name: spring.cloud.nacos.discovery.metadata.version
          value: base
        - name: spring.application.name
          value: sc-A
        image: mse-demo/demo:1.4.2
        imagePullPolicy: Always
        name: spring-cloud-a
        ports:
        - containerPort: 8080
          protocol: TCP
        resources:
          requests:
            cpu: 250m
            memory: 512Mi
# B 应用 base 版本
---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: spring-cloud-b
  name: spring-cloud-b
spec:
  replicas: 2
  selector:
    matchLabels:
      app: spring-cloud-b
  strategy:
  template:
    metadata:
      annotations:
        msePilotCreateAppName: spring-cloud-b
      labels:
        app: spring-cloud-b
    spec:
      containers:
      - env:
        - name: LANG
          value: C.UTF-8
        - name: spring.cloud.nacos.discovery.server-addr
          value: mse-xxx-nacos-ans.mse.aliyuncs.com:8848
        - name: spring.cloud.nacos.config.server-addr
          value: mse-xxx-nacos-ans.mse.aliyuncs.com:8848
        - name: spring.application.name
          value: sc-B
        image: mse-demo/demo:1.4.2
        imagePullPolicy: Always
        name: spring-cloud-b
        ports:
        - containerPort: 8080
          protocol: TCP
        resources:
          requests:
            cpu: 250m
            memory: 512Mi
# C 应用 base 版本
---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: spring-cloud-c
  name: spring-cloud-c
spec:
  replicas: 2
  selector:
    matchLabels:
      app: spring-cloud-c
  template:
    metadata:
      annotations:
        msePilotCreateAppName: spring-cloud-c
      labels:
        app: spring-cloud-c
    spec:
      containers:
      - env:
        - name: LANG
          value: C.UTF-8
        - name: spring.cloud.nacos.discovery.server-addr
          value: mse-xxx-nacos-ans.mse.aliyuncs.com:8848
        - name: spring.cloud.nacos.config.server-addr
          value: mse-xxx-nacos-ans.mse.aliyuncs.com:8848
        - name: spring.application.name
          value: sc-C
        image: mse-demo/demo:1.4.2
        imagePullPolicy: Always
        name: spring-cloud-c
        ports:
        - containerPort: 8080
          protocol: TCP
        resources:
          requests:
            cpu: 250m
            memory: 512Mi

Deploy the application:

Register the service at the gateway

After the application is deployed, in the MSE cloud native gateway, associate with the MSE registration center and register the service.

What we designed is that the gateway only calls A, so we only need to put A in and register it.

Verify and Adjust Links

Verify the link based on the curl command:

$ curl http://${网关IP}/ip
sc-A[192.168.1.194] --> sc-C[192.168.1.195]

Verify the link. It can be seen that A is calling C at this time. We will change the configuration and change the downstream of A to B in real time.

Look again, at this time, the calling relationship of the three applications is ABC, which is in line with our previous plan.

$ curl http://${网关IP}/ip
sc-A[192.168.1.194] --> sc-B[192.168.1.191] --> sc-C[192.168.1.180]

Next, we use a command to continuously call the interface to simulate uninterrupted business traffic in real scenarios.

$ while true; do sleep .1 ; curl -so /dev/null http://${网关IP}/ip ;done

watch call

The success rate can be observed by monitoring the market through the gateway.

injection failure

Everything works fine, now we can start injecting faults.

Here we can use the K8s NetworkPolicy mechanism to simulate egress network exceptions.

kind: NetworkPolicy
apiVersion: networking.k8s.io/v1
metadata:
  name: block-registry-from-b
spec:
  podSelector:
    matchLabels:
      app: spring-cloud-b
  ingress:
  - {}
  egress:
  - to:
    - ipBlock:
        cidr: 0.0.0.0/0
    ports:
    - protocol: TCP
      port: 8080

The 8080 port means that it does not affect the intranet calling the downstream application port, and only disables other egress traffic (for example, the 8848 port that reaches the registry is disabled). Here the downstream of B is C.

After the network is cut off, the heartbeat of the registration center cannot be renewed, and all IPs of application B will be removed after a while (30 seconds later).

Observe again

Looking at the large-scale database again, the success rate begins to decline. At this time, the IP of application B can no longer be seen on the console.

Returning to the broader market, the success rate no longer fluctuates around 50%.

summary

Through practice, we simulated a real risk occurrence scenario, and through the client's high-availability solution (empty push protection), we successfully achieved risk control and prevented abnormal service calls.