Author: ten sleep
Why do you need to go online without lossless offline? What problems can lossless online solve?
This article will answer these questions one by one.
The non-destructive online function has to be said to be a function polished by a customer
We will start with the process of troubleshooting and resolving a release issue.
background
A large number of 5xx timeout exceptions occurred during the release process of an application center service in Alibaba Cloud. Initially suspected that it was a lossless offline problem, so quickly connected to the lossless offline function provided by MSE. However, after connecting to the lossless offline function and continuing to publish, there are still a large number of timeout errors in the publishing process of the application. According to the feedback from the business classmates, about 5 seconds after the application starts, there will be a large number of timeout requests.
Lossless offline function does not take effect?
So I pulled the relevant classmates to start the investigation. The calling situation of the application is as follows: gateway -> A -> C .
The published application is a C application, and a large number of timeout errors occur during the publishing process. We sorted out the following clues through the relevant logs and the startup of the application:
[Server Perspective]: Find a machine with C application xxx.xxx.xxx.60 to observe
The first stage: xxx.xxx.xxx.60 (C application) offline stage
- 20:27:51 Start restart, execute restart script
<!---->
- At the same time, it is observed that the sendReadOnlyEvent action is executed, indicating that the server sends a read-only event, and the client will no longer request the server.
- After sendReadOnlyEvent, start to execute the logout service action one after another
<!---->
- 20:27:54 Log out all provider seivce done
- 20:28:15 App receives kill -15 signal
The second stage: xxx.xxx.xxx.60 (C application) online stage
- 20:28:34 Server restarts
- 20:29:19 Observed in Dubbo registry console that xxx.xxx.xxx.60 has been registered
- 20:29:19,257 saw start NettyServer in log
[Client Perspective]: Find a machine XXX.XXX.XXX.142 with A application to observe
- 20:27:51 received readOnly event, the read-only event sent by the server is received. At this time, the client will not request to the XXX.XXX.XXX.60 machine
- 20:27:56 close [xxx.xxx.xxx.142:0 -> /XXX.XXX.XXX.60:20880] , close the channel connection
Business log error message
At the same time, search the logs related to the error report of the machine XXX.XXX.XXX.60 of the C application, a total of 237 logs
The earliest time: 2020-07-30 20:29:26,524
The latest time: 2020-07-30 20:29:59,788
in conclusion
Observing these signs can lead to preliminary conclusions:
- The lossless offline process is as expected, and no errors were reported during the offline process
- During the error reporting period, after the server application is successfully started and the registration is successful, it is consistent with the phenomenon observed by the business side
At this time, it is suspected that it is a problem during the online period, and at the same time, the related logs of the server are checked. During the error reporting period, the server thread is full:
The problem is located in the process of going online and has nothing to do with the lossless offline.
Lossless online practice
Our ideas for helping users solve problems: help users discover the essence of the problem, find the generality of the problem, solve the problem, and productize the ability to solve general problems.
It is found that the user's Dubbo version is relatively low and lacks the ability to automatically hit the thread stack:
- Increase Dubbo thread pool full automatic JStack capability through MSE
This is a problem that must occur every time a release is made. By observing the JStack log when the thread pool is full, it will help us locate the problem.
Blocked on resource preparation such as asynchronous connections
Preliminary observation of JStack logs shows that many threads are blocked on the preparation of asynchronous connection resources such as taril/druid:
At the same time, some customers in our cloud have encountered the problem that the Dubbo thread pool is full for a period of time after the application is started. After investigation, because the connection in the Redis connection pool was not established in advance, a large number of threads were blocked on the establishment of the Redis connection after the traffic came in.
The connection pool maintains the number of connections through asynchronous threads. By default, the connection with the minimum number of connections is established 30 seconds after the application starts.
1. Solutions
- connect in advance
- Using the service delayed release feature
2. Pre-built connections
Taking the JedisPool pre-established connection as an example, the connection pool connection such as Redis is established in advance, instead of waiting for the traffic to start to establish the connection, causing a large number of business threads to wait for the connection to be established.
org.apache.commons.pool2.impl.GenericObjectPool#startEvictor
protected synchronized void startEvictor(long delay) {
if(null != _evictor) {
EvictionTimer.cancel(_evictor);
_evictor = null;
}
if(delay > 0) {
_evictor = new Evictor();
EvictionTimer.schedule(_evictor, delay, delay);
}
}
JedisPool uses timed tasks to asynchronously ensure the establishment of the minimum number of connections, but this will result in the Redis connection not being established when the application starts.
Active pre-connection method: Use the GenericObjectPool#preparePool method to manually prepare the connection before using the connection.
During the online process of microservices, min-idle redis connections are created in advance during the process of initializing Redis to ensure that the connection is established before the service is published.
There are also similar problems. Asynchronous connection logic such as pre-built database connections ensures that all asynchronous connection resources are ready before business traffic comes in.
3. Delayed release
Delayed release For some pre-resources that need to be loaded asynchronously, such as preparing cache resources in advance, downloading resources asynchronously, etc., it is necessary to control the timing of service registration, that is, control the timing of traffic entry to ensure that the pre-resources required by the service are ready before the service can be published. , there are two ways to delay the release
- Configure by delay
By specifying a delay size such as 300 s, the Dubbo/Spring Cloud service will wait 5 minutes after the Spring container initialization is complete, and then execute the service registration logic.
- online command goes online
Active registration is triggered by opening the default unregistered service configuration item, and then executing the curl 127.0.0.1:54199/online address in conjunction with the release script. We can register the service through the online command after the pre-resource preparation is complete.
You can also register the service through the service online in the MSE instance details.
Blocked on ASMClassLoader class loader
A large number of threads are blocked in the process of fastjson's ASMClassLoader class loader loading classes. Looking at the code of ClassLoader loading classes, the default is synchronous class loading. In high concurrency scenarios, a large number of threads will be blocked on class loading, which will affect server performance and cause problems such as thread pool full.
private ClassLoader(Void unused, ClassLoader parent) {
this.parent = parent;
// 开启并行类加载
if (ParallelLoaders.isRegistered(this.getClass())) {
parallelLockMap = new ConcurrentHashMap<>();
package2certs = new ConcurrentHashMap<>();
domains =
Collections.synchronizedSet(new HashSet<ProtectionDomain>());
assertionLock = new Object();
} else {
// no finer-grained lock; lock on the classloader instance
parallelLockMap = null;
package2certs = new Hashtable<>();
domains = new HashSet<>();
assertionLock = this;
}
}
protected Class<?> loadClass(String name, boolean resolve)
throws ClassNotFoundException
{
synchronized (getClassLoadingLock(name)) {
......
return c;
}
}
protected Object getClassLoadingLock(String className) {
Object lock = this;
//如果开启类加载器并行类加载,则锁在所加载的类上,而不是类加载器上
if (parallelLockMap != null) {
Object newLock = new Object();
lock = parallelLockMap.putIfAbsent(className, newLock);
if (lock == null) {
lock = newLock;
}
}
return lock;
}
1. Solutions
- Enable class loader parallel loading
2. The class loader enables parallel class loading
On JDK7, if the following method is called, the parallel class loading function will be enabled, and the lock level will be lowered from the ClassLoader object itself to the level of the class name to be loaded. In other words, the loadClass method will not be locked as long as multiple threads are not loading the same class.
We can see the introduction of the Classloader.registerAsParallelCapable method
protected static boolean registerAsParallelCapable()
Registers the caller as parallel capable.
The registration succeeds if and only if all of the following conditions are met:
- no instance of the caller has been created
- all of the super classes (except class Object) of the caller are registered as parallel capable
Classloader.registerAsParallelCapable
It requires that when the method is registered, the registered class loader has no instance and all class loaders on the inheritance link of the class loader have called registerAsParallelCapable. For the lower version of Tomcat/Jetty webAppClassLoader and fastjson's ASMClassLoader, the class is not enabled Loading, if there are multiple threads in the application calling the loadClass method at the same time for class loading, then the competition for locks will be very fierce.
MSE Agent enables its parallel class loading capability before the class loader is loaded in a non-invasive way, without requiring users to upgrade Tomcat/Jetty, and supports dynamic class loading and parallel class loading capability through configuration.
some other questions
- JVM JIT compilation issues cause CPU spikes
- Log synchronous printing causes thread blocking
- Jetty low version class loading class synchronous loading
- In the K8s scenario, the microservice and the K8s service life cycle are not aligned
1. Solutions
- Service warm-up
<!---->
- Client Load Balancing
- Server-side service layered release
<!---->
- Asynchronous business log
- Provide microservice Readiness interface
2. Asynchronous business log
Log printing is performed synchronously. Since log printing uses business threads and logic such as serialization and class loading exist in the log printing process, in high concurrency scenarios, business threads will hang and the service framework thread pool will be full. MSE Agent supports dynamic use of asynchronous log printing capabilities, separates log printing tasks from service threads, and improves service thread throughput.
3. Small flow preheating
After the application is started, a large number of requests come in, causing many problems in the application, so some capabilities of microservices are needed to solve the problem of service warm-up:
- The JVM JIT compilation thread occupies too much CPU, the CPU/load soars in a short time, and the performance of Dubbo processing requests decreases
- The amount of instantaneous requests is too large, causing threads to block in class loading, caching, etc., resulting in the Dubbo service thread pool being full
Small traffic warm-up, MSE service governance provides the following capabilities through OneAgent without intrusion:
- Client Load Balancing
By enhancing the client load balancing capability, the traffic weight is adjusted for the newly launched nodes that need to be preheated, so that the newly launched applications can be preheated with small traffic according to the rules configured by the user. The user only needs to specify the preheating rules. Preheating the newly launched nodes with small traffic as expected
The effect of a service-side instance on the business side after using the service pre-warming: After the service pre-warming is enabled, the application to be pre-warmed will be pre-warmed and initialized during the application startup process through a small amount of traffic during the pre-warming period. The following figure shows the preheating time of 120 seconds and the preheating curve of 2 preheating effects:
It means that the test Demo is a timed scaling simulation application startup, so in addition to the warm-up process, it also includes the process of going offline. The following figure shows the preheating time of 120 seconds and the preheating curve of 5 times of preheating effect:
As shown in the figure above, compared with the 2 preheating process, the QPS has been kept at a low value during the period of the 5 preheating process just started (ie 17:41:01~17:42:01), so that the Meets the warm-up needs of complex applications that require a longer warm-up time.
- Server-side layered publishing
By modifying the logic of service registration, increasing the monitoring of indicators such as application load, registering services in batches and rolling back the registration and other logic, to ensure that during the service registration process, when traffic enters into sub-services, the system load is always lower than the threshold, and it needs to be Register the service within the specified time period.
Disadvantages: When the service traffic of the application is average and there is no super hotspot interface, hierarchical publishing can well solve the problem of service warm-up. However, if there are some super-hot services in the application, which may account for more than 90% of all traffic, the effect of layered publishing on the server will not be obvious.
Note: For some dependent service interfaces, the hierarchical release of services may require business sorting out the order in which services are released in batches
4. Get through the life cycle of K8s and microservices
K8s provides two health check mechanisms:
- livenessProbe, used to detect unhealthy Pods, if the detection fails, the Pods will be restarted.
- readinessProbe is used to detect whether a Pod is ready to receive traffic. If the detection fails, the node will be picked up on the Service route.
If readinessProbe is not configured, it only checks whether the process in the container is up and running by default, and it is difficult to consider the running status of the process. Mse Agent provides readiness interface externally. Only when Spring Bean initialization is completed and asynchronous resources are ready and service registration begins, readiness Just returned 200. The service exposure on the microservice side is connected with the K8s service system, so that the K8s management and control can perceive the ready time of the service inside the process, so as to carry out the correct service online.
We need to enable the configuration of lossless rolling publishing on the MSE lossless online page:
At the same time, configure the readiness check interface of K8s for the application. If your application is on Alibaba Cloud Container Service ACK, you can select Enable on the right side of Ready Check in the health check area of the corresponding application configuration of Alibaba Cloud Container Service ACK, and configure the following parameters: Then click Update.
The configuration will take effect the next time the application is restarted.
5. Parallel subscription and registration of services
Through parallel service registration and subscription, the speed of application startup can be greatly improved, and the problem of slow service startup can be solved.
Take parallel service subscription as an example:
As shown in the figure above, the refer process of the service framework is separated from the initialization process of SpringBean through Java Agent, and parallel subscription and registration of services are realized through asynchronous threads.
Summarize
By continuously observing the business situation, and then trying to analyze, think and solve problems continuously, until the service small traffic warm-up capability is enabled, the problem of request loss caused by the full thread pool of the business team application during the online period has been completely solved.
- The total number of Exceptions during the release period and the release date (including the nodes whose lossless online function has been launched one after another) are as follows
After the service small traffic warm-up capability was released on September 15, the related Exception dropped to 2 during the release period. (Confirmed by the business party that it is not caused by the release, it can be ignored)
After the launch of the lossless online function, the application center of the business team for several months has finally come to an end, but the lossless online function is far more than that. It also solves the loss of many cloud customers online, and the functional capabilities and scenarios are gradually improved and enriched in the continuous problem-solving process.
Lossless launch of MSE
One of the features of MSE service governance is that it supports all versions of Dubbo and Spring Cloud in the past five years without intrusion through the Agent, so the function of lossless online will also be the same. The following will take Dubbo as an example for the function of lossless online. It supports Dubbo and Spring Cloud seamlessly.
Let’s start to systematically introduce the lossless launch of MSE service governance. We can start with the process of launching an open source Dubbo application:
- Application initialization, Spring Bean container initialization
- After receiving the ContextRefreshedEvent, Dubbo will pull the configuration, metadata, etc. required by the Dubbo application
- exportServices register service
The open source Dubbo online process is still very complete and rigorous, but there are still some scenarios that may cause problems with the service online:
- After the service information is registered in the registry, the service can be called from the consumer's point of view. However, there may be some scenarios in which some asynchronous resources such as databases and cache resources have not been loaded yet. This depends on whether your system has corresponding components, and when they are loaded completely depends on your business.
- If the service is registered to the registry, and a large amount of traffic enters immediately in the scenario of large traffic, there will be a series of problems, leading to thread blocking and loss of business traffic.
<!---->
- For example, Redis's JedisPool connection pool will not establish a connection immediately after it is created, but will start to establish a connection after the traffic comes in. If a large amount of traffic flows in at the beginning, a large number of threads will be blocked on the establishment of heavy connections in the connection pool.
- In lower versions such as FastJson and Jetty/tomcat, the parallel class loading capability of the class loader is not enabled, resulting in a large number of threads blocked on the class loader loading class
- JVM JIT compilation issues cause CPU spikes
- Thread blocked on business log
- In the cloud native scenario, the life cycle of microservices and K8s is not aligned
<!---->
- Rolling release, the restarted pod has not been registered with the registry, but the readiness check and passed. As a result, the first pod has not been registered in the registry, and the last pod is offline, resulting in a short-term client NoProvider exception
In response to the above problems, MSE service governance not only provides a complete solution, but also provides out-of-the-box capabilities for white screen, and dynamic configuration takes effect in real time.
At the same time, MSE service governance also provides complete observability capabilities for lossless online and offline scenarios.
The lossless online function can be summarized as the following picture:
Not just lossless online and offline
Lossless online and offline capabilities are an important part of microservice traffic governance. Of course, in addition to lossless offline, MSE also provides a series of microservice governance capabilities such as full-link grayscale, flow control degradation and fault tolerance, and database governance. Service governance is the only way to go through the transformation of microservices to a certain stage. During this process, we continue to have new problems.
- In addition to lossless online and offline, is there any other capability for service governance?
- Is there a standard definition of service governance capability, and what does service governance capability include?
- In multilingual scenarios, are there any best practices or standards for full links?
- How can heterogeneous microservices be managed uniformly?
In the process of exploring service governance, when we are connecting with other microservices, we find that the trouble caused by different governance systems is huge, and the cost of connecting two or even multiple governance systems is also huge. For this we propose the OpenSergo project. What OpenSergo wants to solve is the fragmentation and inability to communicate with different frameworks and languages in the concept of microservice governance.
The OpenSergo community is also uniting various communities for further cooperation, and the community will discuss and define a unified service governance standard. The current community is also working with bilibili, ByteDance and other companies to jointly build standards. Interested developers, communities and companies are also welcome to join in the joint construction of OpenSergo service governance standards. Welcome to join the OpenSergo community exchange group (Dingding group) for discussion: 34826335
MSE registered configuration 20% off the first purchase, 30% off the first purchase of 1 year and above. 10% discount for MSE cloud native gateway prepaid full specification. Click here to take advantage of the discount!
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。