15 minutes to achieve lossless online and offline of enterprise-level applications

Many application systems with a large number of users and a high degree of concurrency generally choose to publish in the middle of the night when the traffic is small in order to avoid the loss of traffic during the publishing process. Although this is effective, it is uncontrollable and leads to the R&D, operation and maintenance costs behind it. For enterprises It's not a small burden. Based on this, during the application release process, Alibaba Cloud's microservice engine MSE provides microservices through adaptive waiting + active notification when the application is offline, readiness check when the application is online, alignment with the life cycle of microservices + service warm-up and other technical means. The service application has a lossless online and offline function, which can effectively help enterprises avoid the traffic loss caused by online publishing.

Lossless online and offline function design

Common causes of traffic loss include but are not limited to the following:
• The service cannot be offline in time: the service consumer perceives the delay in the service list of the registry center, which causes the service consumer to still call the offline application for a period of time after the application is offline, resulting in a request error.
• Slow initialization: The application has just started to receive online traffic to initialize and load resources. Due to the large traffic, the initialization process is slow, and a large number of request response timeouts, blocking, and resource exhaustion occur, causing the application just started to crash.
• Too early registration: The service has an asynchronous resource loading problem. When the service is not initialized completely, it is registered in the registry. As a result, the resource is not loaded when the call is completed, and the request response is slow, and the call time-out error occurs.
• The release state and the running state are not aligned: Use the rolling release function of Kubernetes to release the application. Due to the readiness check mechanism generally associated with the rolling release of Kubernetes, the next batch is triggered by checking whether the application-specific port is activated as a sign of application readiness. Instances are published, but in microservice applications, service calls can only be provided externally when the application has completed the service registration. Therefore, in some cases, the new application may not be registered in the registry, and the old application instance will be offline, resulting in no service available.

Lossless offline

One of the services cannot be offline in time, as shown in Figure 1 below:
在这里插入图片描述

Figure 1. Spring Cloud application consumers cannot sense provider service offline in time

For Spring Cloud applications, when the two instances of the application, A' and A in A, go offline, because the Spring Cloud framework balances availability and performance, the consumer defaults to 30s to go to the registry to pull the latest service list. Therefore, the offline of the A instance cannot be sensed in real time. At this time, if the consumer continues to call A through the local cache, there will be traffic loss when calling the offline instance.

In response to this problem, the lossless offline function designed and implemented by Alibaba Cloud microservice engine MSE based on Java Agent bytecode technology is shown in Figure 2 below:
在这里插入图片描述 Figure 2. Lossless offline scheme

In this lossless offline solution, the service provider application only needs to access the MSE, compared with the general lossy offline. There will be an adaptive waiting period before the application goes offline. At this time, the application that is expected to go offline will send an offline event to the service consumer who has sent the request during the adaptive waiting phase through active notification, and the consumer receives the offline event. After the event, the registry service instance list will be actively pulled in order to sense the application offline event in real time, so as to avoid the loss of application offline traffic caused by calling the offline instance.

Lossless online

Lazy loading is the most common strategy in software framework design. For example, in the Spring Cloud framework, the initial timing of the pull service list of the Ribbon component is to wait until the first invocation of the service by default. For example, Figure 3 below is in the Spring Cloud application. The time-consuming request of the first and second calls to the remote service through RestTemplate:
在这里插入图片描述

Figure 3. Time-consuming comparison between application startup resource initialization and normal operation

It can be seen from the test results that the first call takes several times the normal time due to some resource initialization. Therefore, when a new application is released online and directly handles large traffic, it is very likely that a large number of requests will be slow to respond, resources will be blocked, and application instances will be down. In response to the slow initialization of application resources under such large traffic, the low-traffic warm-up function provided by MSE helps protect the new instance by adjusting the traffic allocated by the just-launched application to process normal traffic after sufficient warm-up. The small flow preheating process is shown in Figure 4 below:
在这里插入图片描述
Figure 4. The relationship between QPS and startup time during the warm-up process of low-volume service

In addition to the detrimental online problem caused by the slow initialization of the first call of the above application, MSE also provides resource pre-established connections, delayed registration, ensuring that service registration is completed before the Kubernetes readiness check is passed, and ensuring that the Kubernetes readiness check is completed before the service is warmed up. Wait for a complete set of lossless online means to meet the lossless online requirements of various applications. The complete solution is shown in Figure 5:

在这里插入图片描述

Figure 5. MSE lossless online solution

How to use MSE's lossless online and offline

Next, we will demonstrate the best practices of the lossless online and offline and service warm-up capabilities provided by the Alibaba Cloud microservice engine MSE when the application is released. It is assumed that the architecture of the application consists of the Zuul gateway and the backend microservice application instance (Spring Cloud). The specific back-end call links include shopping cart application A, transaction center application B, and inventory center application C. The services in these applications are registered and discovered through the Nacos registry.

Preconditions

Enable MSE Microservice Governance

• A Kubernetes cluster has been created, see Creating a Kubernetes Managed Cluster [1].
• MSE Microservice Governance Professional Edition has been activated, please refer to Activating MSE Microservice Governance [2].

Ready to work

Note that the agent used in this practice is still in grayscale, and the application agent needs to be upgraded to grayscale. The upgrade document: https://help.aliyun.com/document_detail/392373.html

If the application is deployed in different Regions (for now, only domestic Regions are supported), please use the corresponding Agent download address: http://arms-apm-cn- [regionId].oss-cn-[regionId].aliyuncs.com/2.7 .1.3-mse-beta/, pay attention to replace [RegionId] in the address, RegionId is Alibaba Cloud RegionId,

For example, the address of Region Beijing Agent is: http://arms-apm-cn-beijing.oss-cn-beijing.aliyuncs.com/2.7.1.3-mse-beta/

Application Deployment Traffic Architecture Diagram
在这里插入图片描述

Figure 6. Demo application deployment architecture

Flow pressure source

In the spring-cloud-zuul application, as shown in Figure 6, it makes service calls to the grayscale version and the normal version of spring-cloud-a at a rate of 100 QPS at the same time.

Deploy Demo application

Save the following content to a file, assuming the name is mse-demo.yaml, and execute kubectl apply -f mse-demo.yaml to deploy the application to the pre-created Kubernetes cluster (note that there are CronHPA tasks in the demo , so please install the ack-kubernetes-cronhpa-controller component in the cluster first, specifically search for the component in Container Service-Kubernetes->Market->Application Directory to install it in the test cluster), here we will deploy Zuul, A, B and C three applications, of which two applications A and B deploy a baseline version and a gray version respectively. The baseline version of application B has the lossless offline capability turned off, and the gray version has the lossless offline capability enabled. The C application has the service warm-up capability enabled, and the warm-up time is 120 seconds.

# Nacos Server
---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: nacos-server
  name: nacos-server
spec:
  replicas: 1
  selector:
    matchLabels:
      app: nacos-server
  template:
    metadata:
      labels:
        app: nacos-server
    spec:
      containers:
      - env:
        - name: MODE
          value: standalone
        image: registry.cn-shanghai.aliyuncs.com/yizhan/nacos-server:latest
        imagePullPolicy: Always
        name: nacos-server
        resources:
          requests:
            cpu: 250m
            memory: 512Mi
      dnsPolicy: ClusterFirst
      restartPolicy: Always
# Nacos Server Service 配置
---
apiVersion: v1
kind: Service
metadata:
  name: nacos-server
spec:
  ports:
  - port: 8848
    protocol: TCP
    targetPort: 8848
  selector:
    app: nacos-server
  type: ClusterIP
#入口 zuul 应用
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: spring-cloud-zuul
spec:
  replicas: 1
  selector:
    matchLabels:
      app: spring-cloud-zuul
  template:
    metadata:
      annotations:
        msePilotAutoEnable: "on"
        msePilotCreateAppName: spring-cloud-zuul
      labels:
        app: spring-cloud-zuul
    spec:
      containers:
        - env:
            - name: JAVA_HOME
              value: /usr/lib/jvm/java-1.8-openjdk/jre
            - name: LANG
              value: C.UTF-8
          image: registry.cn-shanghai.aliyuncs.com/yizhan/spring-cloud-zuul:1.0.1
          imagePullPolicy: Always
          name: spring-cloud-zuul
          ports:
            - containerPort: 20000
# A 应用 base 版本,开启按照机器纬度全链路透传
---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: spring-cloud-a
  name: spring-cloud-a
spec:
  replicas: 2
  selector:
    matchLabels:
      app: spring-cloud-a
  template:
    metadata:
      annotations:
        msePilotCreateAppName: spring-cloud-a
        msePilotAutoEnable: "on"
      labels:
        app: spring-cloud-a
    spec:
      containers:
      - env:
        - name: LANG
          value: C.UTF-8
        - name: JAVA_HOME
          value: /usr/lib/jvm/java-1.8-openjdk/jre
        - name: profiler.micro.service.tag.trace.enable
          value: "true"
        image: registry.cn-shanghai.aliyuncs.com/yizhan/spring-cloud-a:0.1-SNAPSHOT
        imagePullPolicy: Always
        name: spring-cloud-a
        ports:
        - containerPort: 20001
          protocol: TCP
        resources:
          requests:
            cpu: 250m
            memory: 512Mi
        livenessProbe:
          tcpSocket:
            port: 20001
          initialDelaySeconds: 10
          periodSeconds: 30
      
# A 应用 gray 版本,开启按照机器纬度全链路透传
---            
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: spring-cloud-a-gray
  name: spring-cloud-a-gray
spec:
  replicas: 2
  selector:
    matchLabels:
      app: spring-cloud-a-gray
  strategy:
  template:
    metadata:
      annotations:
        alicloud.service.tag: gray
        msePilotCreateAppName: spring-cloud -a
        msePilotAutoEnable: "on"
      labels:
        app: spring-cloud-a-gray
    spec:
      containers:
      - env:
        - name: LANG
          value: C.UTF-8
        - name: JAVA_HOME
          value: /usr/lib/jvm/java-1.8-openjdk/jre
        - name: profiler.micro.service.tag.trace.enable
          value: "true"
        image: registry.cn-shanghai.aliyuncs.com/yizhan/spring-cloud-a:0.1-SNAPSHOT
        imagePullPolicy: Always
        name: spring-cloud-a-gray
        ports:
        - containerPort: 20001
          protocol: TCP
        resources:
          requests:
            cpu: 250m
            memory: 512Mi
        livenessProbe:
          tcpSocket:
            port: 20001
          initialDelaySeconds: 10
          periodSeconds: 30
            
# B 应用 base 版本，关闭无损下线能力
---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: spring-cloud-b
  name: spring-cloud-b
spec:
  replicas: 2
  selector:
    matchLabels:
      app: spring-cloud-b
  strategy:
  template:
    metadata:
      annotations:
        msePilotCreateAppName: spring-cloud-b
        msePilotAutoEnable: "on"
      labels:
        app: spring-cloud-b
    spec:
      containers:
      - env:
        - name: LANG
          value: C.UTF-8
        - name: JAVA_HOME
          value: /usr/lib/jvm/java-1.8-openjdk/jre
        - name: micro.service.shutdown.server.enable
          value: "false"
        - name: profiler.micro.service.http.server.enable
          value: "false"
        image: registry.cn-shanghai.aliyuncs.com/yizhan/spring-cloud-b:0.1-SNAPSHOT
        imagePullPolicy: Always
        name: spring-cloud-b
        ports:
        - containerPort: 8080
          protocol: TCP
        resources:
          requests:
            cpu: 250m
            memory: 512Mi
        livenessProbe:
          tcpSocket:
            port: 20002
          initialDelaySeconds: 10
          periodSeconds: 30
            
# B 应用 gray 版本,默认开启无损下线功能
---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: spring-cloud-b-gray
  name: spring-cloud-b-gray
spec:
  replicas: 2
  selector:
    matchLabels:
      app: spring-cloud-b-gray
  template:
    metadata:
      annotations:
        alicloud.service.tag: gray
        msePilotCreateAppName: spring-cloud-b
        msePilotAutoEnable: "on"
      labels:
        app: spring-cloud-b-gray
    spec:
      containers:
      - env:
        - name: LANG
          value: C.UTF-8
        - name: JAVA_HOME
          value: /usr/lib/jvm/java-1.8-openjdk/jre
        image: registry.cn-shanghai.aliyuncs.com/yizhan/spring-cloud-b:0.1-SNAPSHOT
        imagePullPolicy: Always
        name: spring-cloud-b-gray
        ports:
        - containerPort: 8080
          protocol: TCP
        resources:
          requests:
            cpu: 250m
            memory: 512Mi
        lifecycle:
            preStop:
              exec:
                command:
                  - /bin/sh
                  - '-c'
                  - >-
                    wget http://127.0.0.1:54199/offline 2>/tmp/null;sleep
                    30;exit 0
        livenessProbe:
          tcpSocket:
            port: 20002
          initialDelaySeconds: 10
          periodSeconds: 30
            
# C 应用 base 版本
---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: spring-cloud-c
  name: spring-cloud-c
spec:
  replicas: 2
  selector:
    matchLabels:
      app: spring-cloud-c
  template:
    metadata:
      annotations:
        msePilotCreateAppName: spring-cloud-c
        msePilotAutoEnable: "on"
      labels:
        app: spring-cloud-c
    spec:
      containers:
      - env:
        - name: LANG
          value: C.UTF-8
        - name: JAVA_HOME
          value: /usr/lib/jvm/java-1.8-openjdk/jre
        image: registry.cn-shanghai.aliyuncs.com/yizhan/spring-cloud-c:0.1-SNAPSHOT
        imagePullPolicy: Always
        name: spring-cloud-c
        ports:
        - containerPort: 8080
          protocol: TCP
        resources:
          requests:
            cpu: 250m
            memory: 512Mi
        livenessProbe:
          tcpSocket:
            port: 20003
          initialDelaySeconds: 10
          periodSeconds: 30
#HPA 配置
---
apiVersion: autoscaling.alibabacloud.com/v1beta1
kind: CronHorizontalPodAutoscaler
metadata:
  labels:
    controller-tools.k8s.io: "1.0"
  name: spring-cloud-b
spec:
   scaleTargetRef:
      apiVersion: apps/v1beta2
      kind: Deployment
      name: spring-cloud-b
   jobs:
   - name: "scale-down"
     schedule: "0 0/5 * * * *"
     targetSize: 1
   - name: "scale-up"
     schedule: "10 0/5 * * * *"
     targetSize: 2
---
apiVersion: autoscaling.alibabacloud.com/v1beta1
kind: CronHorizontalPodAutoscaler
metadata:
  labels:
    controller-tools.k8s.io: "1.0"
  name: spring-cloud-b-gray
spec:
   scaleTargetRef:
      apiVersion: apps/v1beta2
      kind: Deployment
      name: spring-cloud-b-gray
   jobs:
   - name: "scale-down"
     schedule: "0 0/5 * * * *"
     targetSize: 1
   - name: "scale-up"
     schedule: "10 0/5 * * * *"
     targetSize: 2
---
apiVersion: autoscaling.alibabacloud.com/v1beta1
kind: CronHorizontalPodAutoscaler
metadata:
  labels:
    controller-tools.k8s.io: "1.0"
  name: spring-cloud-c
spec:
   scaleTargetRef:
      apiVersion: apps/v1beta2
      kind: Deployment
      name: spring-cloud-c
   jobs:
   - name: "scale-down"
     schedule: "0 2/5 * * * *"
     targetSize: 1 
   - name: "scale-up"
     schedule: "10 2/5 * * * *"
     targetSize: 2
# zuul 网关开启 SLB 暴露展示页面   
---     
apiVersion: v1
kind: Service
metadata:
  name: zuul-slb
spec:
  ports:
    - port: 80
      protocol: TCP
      targetPort: 20000
  selector:
    app: spring-cloud-zuul
  type: ClusterIP
# a 应用暴露 k8s service
---
apiVersion: v1
kind: Service
metadata:
  name: spring-cloud-a-base
spec:
  ports:
    - name: http
      port: 20001
      protocol: TCP
      targetPort: 20001
  selector:
    app: spring-cloud-a
---
apiVersion: v1
kind: Service
metadata:
  name: spring-cloud-a-gray
spec:
  ports:
    - name: http
      port: 20001
      protocol: TCP
      targetPort: 20001
  selector:
    app: spring-cloud-a-gray
# Nacos Server SLB Service 配置
---
apiVersion: v1
kind: Service
metadata:
  name: nacos-slb
spec:
  ports:
  - port: 8848
    protocol: TCP
    targetPort: 8848
  selector:
    app: nacos-server
  type: LoadBalancer

Result Verification 1: Lossless offline function

Since we have enabled timed HPA for both spring-cloud-b and spring-cloud-b-gray applications, we simulate a timed expansion and contraction every 5 minutes.

在这里插入图片描述

Log in to the MSE console and enter the Microservice Governance Center->Application List->spring-cloud-a->Application Details. From the application monitoring curve, we can see the traffic data of the spring-cloud-a application:
在这里插入图片描述

The traffic of the gray version has 0 request errors during the process of pod expansion and contraction, and there is no traffic loss. In the unmarked version, since the lossless offline function is disabled, 20 requests sent from spring-cloud-a to spring-cloud-b are reported with errors during the process of pod expansion and contraction, resulting in request traffic loss.

Result verification 2: service warm-up function

In the spring-cloud-c application, we have started the timed HPA simulation application online process, scaling every 5 minutes, scaling down to 1 node at the 2nd minute and 0th second in the expansion and shrinkage cycle, and at the 2nd minute and 10th second. Scale up to 2 nodes.
在这里插入图片描述

Enable the service warm-up function on spring-cloud-b on the consumer side of the warm-up application.

在这里插入图片描述

On the service provider side of the warm-up application, spring-cloud-c enables the service warm-up function. The warm-up time is configured to be 120 seconds.

在这里插入图片描述

Observe the traffic of the node and find that the traffic of the node increases slowly. And you can see the warm-up start and end time of the node, as well as related events.

在这里插入图片描述

As can be seen from the above figure, the traffic of the application with the preheating function enabled will increase slowly over time after restarting. In some slow-start scenarios where resources such as connection pools and caches need to be pre-built during the application startup process, enabling service preheating can effectively protect the application. During the startup process, the cache resources are created in an orderly manner to ensure the safe startup of the application, so that the traffic of the application online is lossless.

Program introduction & practical operation

For more details of the solution design, please watch the video playback of how microservice applications achieve lossless online and offline theme live [3]:
https://yqh.aliyun.com/live/detail/27936

15 minutes to achieve lossless online and offline of enterprise-level applications

Lossless online and offline function design

Lossless offline

Lossless online

How to use MSE's lossless online and offline

Preconditions

Ready to work

Result Verification 1: Lossless offline function

Result verification 2: service warm-up function

Program introduction & practical operation

Related Links

阿里云云原生

引用和评论

通义灵码 AI IDE 上线，第一时间测评体验

支付宝H5下载被拦截的原因排查与解决指南

JManus - 面向 Java 开发者的开源通用智能体

MCP协议重大升级，Spring AI Alibaba联合Higress发布业界首个Streamable HTTP实现方案

PAI Model Gallery 支持云上一键部署 Qwen3 全尺寸模型

2025年3月中国数据库排行榜：PolarDB夺魁傲群雄，GoldenDB晋位入三强

分析型数据库入门指南：如何选择适合你的实时分析工具？