Many application systems with a large number of users and a high degree of concurrency generally choose to publish in the middle of the night when the traffic is small in order to avoid the loss of traffic during the publishing process. Although this is effective, it is uncontrollable and leads to the R&D, operation and maintenance costs behind it. For enterprises It's not a small burden. Based on this, during the application release process, Alibaba Cloud's microservice engine MSE provides microservices through adaptive waiting + active notification when the application is offline, readiness check when the application is online, alignment with the life cycle of microservices + service warm-up and other technical means. The service application has a lossless online and offline function, which can effectively help enterprises avoid the traffic loss caused by online publishing.
Lossless online and offline function design
Common causes of traffic loss include but are not limited to the following:
• The service cannot be offline in time: the service consumer perceives the delay in the service list of the registry center, which causes the service consumer to still call the offline application for a period of time after the application is offline, resulting in a request error.
• Slow initialization: The application has just started to receive online traffic to initialize and load resources. Due to the large traffic, the initialization process is slow, and a large number of request response timeouts, blocking, and resource exhaustion occur, causing the application just started to crash.
• Too early registration: The service has an asynchronous resource loading problem. When the service is not initialized completely, it is registered in the registry. As a result, the resource is not loaded when the call is completed, and the request response is slow, and the call time-out error occurs.
• The release state and the running state are not aligned: Use the rolling release function of Kubernetes to release the application. Due to the readiness check mechanism generally associated with the rolling release of Kubernetes, the next batch is triggered by checking whether the application-specific port is activated as a sign of application readiness. Instances are published, but in microservice applications, service calls can only be provided externally when the application has completed the service registration. Therefore, in some cases, the new application may not be registered in the registry, and the old application instance will be offline, resulting in no service available.
Lossless offline
One of the services cannot be offline in time, as shown in Figure 1 below:
Figure 1. Spring Cloud application consumers cannot sense provider service offline in time
For Spring Cloud applications, when the two instances of the application, A' and A in A, go offline, because the Spring Cloud framework balances availability and performance, the consumer defaults to 30s to go to the registry to pull the latest service list. Therefore, the offline of the A instance cannot be sensed in real time. At this time, if the consumer continues to call A through the local cache, there will be traffic loss when calling the offline instance.
In response to this problem, the lossless offline function designed and implemented by Alibaba Cloud microservice engine MSE based on Java Agent bytecode technology is shown in Figure 2 below:
Figure 2. Lossless offline scheme
In this lossless offline solution, the service provider application only needs to access the MSE, compared with the general lossy offline. There will be an adaptive waiting period before the application goes offline. At this time, the application that is expected to go offline will send an offline event to the service consumer who has sent the request during the adaptive waiting phase through active notification, and the consumer receives the offline event. After the event, the registry service instance list will be actively pulled in order to sense the application offline event in real time, so as to avoid the loss of application offline traffic caused by calling the offline instance.
Lossless online
Lazy loading is the most common strategy in software framework design. For example, in the Spring Cloud framework, the initial timing of the pull service list of the Ribbon component is to wait until the first invocation of the service by default. For example, Figure 3 below is in the Spring Cloud application. The time-consuming request of the first and second calls to the remote service through RestTemplate:
Figure 3. Time-consuming comparison between application startup resource initialization and normal operation
It can be seen from the test results that the first call takes several times the normal time due to some resource initialization. Therefore, when a new application is released online and directly handles large traffic, it is very likely that a large number of requests will be slow to respond, resources will be blocked, and application instances will be down. In response to the slow initialization of application resources under such large traffic, the low-traffic warm-up function provided by MSE helps protect the new instance by adjusting the traffic allocated by the just-launched application to process normal traffic after sufficient warm-up. The small flow preheating process is shown in Figure 4 below:
Figure 4. The relationship between QPS and startup time during the warm-up process of low-volume service
In addition to the detrimental online problem caused by the slow initialization of the first call of the above application, MSE also provides resource pre-established connections, delayed registration, ensuring that service registration is completed before the Kubernetes readiness check is passed, and ensuring that the Kubernetes readiness check is completed before the service is warmed up. Wait for a complete set of lossless online means to meet the lossless online requirements of various applications. The complete solution is shown in Figure 5:
Figure 5. MSE lossless online solution
How to use MSE's lossless online and offline
Next, we will demonstrate the best practices of the lossless online and offline and service warm-up capabilities provided by the Alibaba Cloud microservice engine MSE when the application is released. It is assumed that the architecture of the application consists of the Zuul gateway and the backend microservice application instance (Spring Cloud). The specific back-end call links include shopping cart application A, transaction center application B, and inventory center application C. The services in these applications are registered and discovered through the Nacos registry.
Preconditions
Enable MSE Microservice Governance
• A Kubernetes cluster has been created, see Creating a Kubernetes Managed Cluster [1].
• MSE Microservice Governance Professional Edition has been activated, please refer to Activating MSE Microservice Governance [2].
Ready to work
Note that the agent used in this practice is still in grayscale, and the application agent needs to be upgraded to grayscale. The upgrade document: https://help.aliyun.com/document_detail/392373.html
If the application is deployed in different Regions (for now, only domestic Regions are supported), please use the corresponding Agent download address: http://arms-apm-cn- [regionId].oss-cn-[regionId].aliyuncs.com/2.7 .1.3-mse-beta/, pay attention to replace [RegionId] in the address, RegionId is Alibaba Cloud RegionId,
For example, the address of Region Beijing Agent is: http://arms-apm-cn-beijing.oss-cn-beijing.aliyuncs.com/2.7.1.3-mse-beta/
Application Deployment Traffic Architecture Diagram
Figure 6. Demo application deployment architecture
Flow pressure source
In the spring-cloud-zuul application, as shown in Figure 6, it makes service calls to the grayscale version and the normal version of spring-cloud-a at a rate of 100 QPS at the same time.
Deploy Demo application
Save the following content to a file, assuming the name is mse-demo.yaml, and execute kubectl apply -f mse-demo.yaml to deploy the application to the pre-created Kubernetes cluster (note that there are CronHPA tasks in the demo , so please install the ack-kubernetes-cronhpa-controller component in the cluster first, specifically search for the component in Container Service-Kubernetes->Market->Application Directory to install it in the test cluster), here we will deploy Zuul, A, B and C three applications, of which two applications A and B deploy a baseline version and a gray version respectively. The baseline version of application B has the lossless offline capability turned off, and the gray version has the lossless offline capability enabled. The C application has the service warm-up capability enabled, and the warm-up time is 120 seconds.
# Nacos Server
---
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: nacos-server
name: nacos-server
spec:
replicas: 1
selector:
matchLabels:
app: nacos-server
template:
metadata:
labels:
app: nacos-server
spec:
containers:
- env:
- name: MODE
value: standalone
image: registry.cn-shanghai.aliyuncs.com/yizhan/nacos-server:latest
imagePullPolicy: Always
name: nacos-server
resources:
requests:
cpu: 250m
memory: 512Mi
dnsPolicy: ClusterFirst
restartPolicy: Always
# Nacos Server Service 配置
---
apiVersion: v1
kind: Service
metadata:
name: nacos-server
spec:
ports:
- port: 8848
protocol: TCP
targetPort: 8848
selector:
app: nacos-server
type: ClusterIP
#入口 zuul 应用
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: spring-cloud-zuul
spec:
replicas: 1
selector:
matchLabels:
app: spring-cloud-zuul
template:
metadata:
annotations:
msePilotAutoEnable: "on"
msePilotCreateAppName: spring-cloud-zuul
labels:
app: spring-cloud-zuul
spec:
containers:
- env:
- name: JAVA_HOME
value: /usr/lib/jvm/java-1.8-openjdk/jre
- name: LANG
value: C.UTF-8
image: registry.cn-shanghai.aliyuncs.com/yizhan/spring-cloud-zuul:1.0.1
imagePullPolicy: Always
name: spring-cloud-zuul
ports:
- containerPort: 20000
# A 应用 base 版本,开启按照机器纬度全链路透传
---
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: spring-cloud-a
name: spring-cloud-a
spec:
replicas: 2
selector:
matchLabels:
app: spring-cloud-a
template:
metadata:
annotations:
msePilotCreateAppName: spring-cloud-a
msePilotAutoEnable: "on"
labels:
app: spring-cloud-a
spec:
containers:
- env:
- name: LANG
value: C.UTF-8
- name: JAVA_HOME
value: /usr/lib/jvm/java-1.8-openjdk/jre
- name: profiler.micro.service.tag.trace.enable
value: "true"
image: registry.cn-shanghai.aliyuncs.com/yizhan/spring-cloud-a:0.1-SNAPSHOT
imagePullPolicy: Always
name: spring-cloud-a
ports:
- containerPort: 20001
protocol: TCP
resources:
requests:
cpu: 250m
memory: 512Mi
livenessProbe:
tcpSocket:
port: 20001
initialDelaySeconds: 10
periodSeconds: 30
# A 应用 gray 版本,开启按照机器纬度全链路透传
---
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: spring-cloud-a-gray
name: spring-cloud-a-gray
spec:
replicas: 2
selector:
matchLabels:
app: spring-cloud-a-gray
strategy:
template:
metadata:
annotations:
alicloud.service.tag: gray
msePilotCreateAppName: spring-cloud -a
msePilotAutoEnable: "on"
labels:
app: spring-cloud-a-gray
spec:
containers:
- env:
- name: LANG
value: C.UTF-8
- name: JAVA_HOME
value: /usr/lib/jvm/java-1.8-openjdk/jre
- name: profiler.micro.service.tag.trace.enable
value: "true"
image: registry.cn-shanghai.aliyuncs.com/yizhan/spring-cloud-a:0.1-SNAPSHOT
imagePullPolicy: Always
name: spring-cloud-a-gray
ports:
- containerPort: 20001
protocol: TCP
resources:
requests:
cpu: 250m
memory: 512Mi
livenessProbe:
tcpSocket:
port: 20001
initialDelaySeconds: 10
periodSeconds: 30
# B 应用 base 版本,关闭无损下线能力
---
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: spring-cloud-b
name: spring-cloud-b
spec:
replicas: 2
selector:
matchLabels:
app: spring-cloud-b
strategy:
template:
metadata:
annotations:
msePilotCreateAppName: spring-cloud-b
msePilotAutoEnable: "on"
labels:
app: spring-cloud-b
spec:
containers:
- env:
- name: LANG
value: C.UTF-8
- name: JAVA_HOME
value: /usr/lib/jvm/java-1.8-openjdk/jre
- name: micro.service.shutdown.server.enable
value: "false"
- name: profiler.micro.service.http.server.enable
value: "false"
image: registry.cn-shanghai.aliyuncs.com/yizhan/spring-cloud-b:0.1-SNAPSHOT
imagePullPolicy: Always
name: spring-cloud-b
ports:
- containerPort: 8080
protocol: TCP
resources:
requests:
cpu: 250m
memory: 512Mi
livenessProbe:
tcpSocket:
port: 20002
initialDelaySeconds: 10
periodSeconds: 30
# B 应用 gray 版本,默认开启无损下线功能
---
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: spring-cloud-b-gray
name: spring-cloud-b-gray
spec:
replicas: 2
selector:
matchLabels:
app: spring-cloud-b-gray
template:
metadata:
annotations:
alicloud.service.tag: gray
msePilotCreateAppName: spring-cloud-b
msePilotAutoEnable: "on"
labels:
app: spring-cloud-b-gray
spec:
containers:
- env:
- name: LANG
value: C.UTF-8
- name: JAVA_HOME
value: /usr/lib/jvm/java-1.8-openjdk/jre
image: registry.cn-shanghai.aliyuncs.com/yizhan/spring-cloud-b:0.1-SNAPSHOT
imagePullPolicy: Always
name: spring-cloud-b-gray
ports:
- containerPort: 8080
protocol: TCP
resources:
requests:
cpu: 250m
memory: 512Mi
lifecycle:
preStop:
exec:
command:
- /bin/sh
- '-c'
- >-
wget http://127.0.0.1:54199/offline 2>/tmp/null;sleep
30;exit 0
livenessProbe:
tcpSocket:
port: 20002
initialDelaySeconds: 10
periodSeconds: 30
# C 应用 base 版本
---
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: spring-cloud-c
name: spring-cloud-c
spec:
replicas: 2
selector:
matchLabels:
app: spring-cloud-c
template:
metadata:
annotations:
msePilotCreateAppName: spring-cloud-c
msePilotAutoEnable: "on"
labels:
app: spring-cloud-c
spec:
containers:
- env:
- name: LANG
value: C.UTF-8
- name: JAVA_HOME
value: /usr/lib/jvm/java-1.8-openjdk/jre
image: registry.cn-shanghai.aliyuncs.com/yizhan/spring-cloud-c:0.1-SNAPSHOT
imagePullPolicy: Always
name: spring-cloud-c
ports:
- containerPort: 8080
protocol: TCP
resources:
requests:
cpu: 250m
memory: 512Mi
livenessProbe:
tcpSocket:
port: 20003
initialDelaySeconds: 10
periodSeconds: 30
#HPA 配置
---
apiVersion: autoscaling.alibabacloud.com/v1beta1
kind: CronHorizontalPodAutoscaler
metadata:
labels:
controller-tools.k8s.io: "1.0"
name: spring-cloud-b
spec:
scaleTargetRef:
apiVersion: apps/v1beta2
kind: Deployment
name: spring-cloud-b
jobs:
- name: "scale-down"
schedule: "0 0/5 * * * *"
targetSize: 1
- name: "scale-up"
schedule: "10 0/5 * * * *"
targetSize: 2
---
apiVersion: autoscaling.alibabacloud.com/v1beta1
kind: CronHorizontalPodAutoscaler
metadata:
labels:
controller-tools.k8s.io: "1.0"
name: spring-cloud-b-gray
spec:
scaleTargetRef:
apiVersion: apps/v1beta2
kind: Deployment
name: spring-cloud-b-gray
jobs:
- name: "scale-down"
schedule: "0 0/5 * * * *"
targetSize: 1
- name: "scale-up"
schedule: "10 0/5 * * * *"
targetSize: 2
---
apiVersion: autoscaling.alibabacloud.com/v1beta1
kind: CronHorizontalPodAutoscaler
metadata:
labels:
controller-tools.k8s.io: "1.0"
name: spring-cloud-c
spec:
scaleTargetRef:
apiVersion: apps/v1beta2
kind: Deployment
name: spring-cloud-c
jobs:
- name: "scale-down"
schedule: "0 2/5 * * * *"
targetSize: 1
- name: "scale-up"
schedule: "10 2/5 * * * *"
targetSize: 2
# zuul 网关开启 SLB 暴露展示页面
---
apiVersion: v1
kind: Service
metadata:
name: zuul-slb
spec:
ports:
- port: 80
protocol: TCP
targetPort: 20000
selector:
app: spring-cloud-zuul
type: ClusterIP
# a 应用暴露 k8s service
---
apiVersion: v1
kind: Service
metadata:
name: spring-cloud-a-base
spec:
ports:
- name: http
port: 20001
protocol: TCP
targetPort: 20001
selector:
app: spring-cloud-a
---
apiVersion: v1
kind: Service
metadata:
name: spring-cloud-a-gray
spec:
ports:
- name: http
port: 20001
protocol: TCP
targetPort: 20001
selector:
app: spring-cloud-a-gray
# Nacos Server SLB Service 配置
---
apiVersion: v1
kind: Service
metadata:
name: nacos-slb
spec:
ports:
- port: 8848
protocol: TCP
targetPort: 8848
selector:
app: nacos-server
type: LoadBalancer
Result Verification 1: Lossless offline function
Since we have enabled timed HPA for both spring-cloud-b and spring-cloud-b-gray applications, we simulate a timed expansion and contraction every 5 minutes.
Log in to the MSE console and enter the Microservice Governance Center->Application List->spring-cloud-a->Application Details. From the application monitoring curve, we can see the traffic data of the spring-cloud-a application:
The traffic of the gray version has 0 request errors during the process of pod expansion and contraction, and there is no traffic loss. In the unmarked version, since the lossless offline function is disabled, 20 requests sent from spring-cloud-a to spring-cloud-b are reported with errors during the process of pod expansion and contraction, resulting in request traffic loss.
Result verification 2: service warm-up function
In the spring-cloud-c application, we have started the timed HPA simulation application online process, scaling every 5 minutes, scaling down to 1 node at the 2nd minute and 0th second in the expansion and shrinkage cycle, and at the 2nd minute and 10th second. Scale up to 2 nodes.
Enable the service warm-up function on spring-cloud-b on the consumer side of the warm-up application.
On the service provider side of the warm-up application, spring-cloud-c enables the service warm-up function. The warm-up time is configured to be 120 seconds.
Observe the traffic of the node and find that the traffic of the node increases slowly. And you can see the warm-up start and end time of the node, as well as related events.
As can be seen from the above figure, the traffic of the application with the preheating function enabled will increase slowly over time after restarting. In some slow-start scenarios where resources such as connection pools and caches need to be pre-built during the application startup process, enabling service preheating can effectively protect the application. During the startup process, the cache resources are created in an orderly manner to ensure the safe startup of the application, so that the traffic of the application online is lossless.
Program introduction & practical operation
For more details of the solution design, please watch the video playback of how microservice applications achieve lossless online and offline theme live [3]:
https://yqh.aliyun.com/live/detail/27936
Related Links
[1] Create a Kubernetes managed version cluster https://help.aliyun.com/document_detail/95108.htm#task-skz-qwk-qfb
[2] Open MSE microservice governance
https://help.aliyun.com/document_detail/347625.htm#task-2140253
[3] Lossless online and offline theme live broadcast
https://yqh.aliyun.com/live/detail/27936
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。