1. Background
A certain day received feedback from the upstream caller, a certain Dubbo interface provided was fused for a short time at a fixed time every day, and the exception message thrown was that the provider's dubbo thread pool was exhausted. The current dubbo interface has 1.8 billion daily requests and 94W/day error requests. So far, the optimization journey has begun.
2. Quick response
2.1 Quick positioning
First, perform regular system information monitoring (machine, JVM memory, GC, thread), and found that although there is a slight spike, it is within a reasonable range, and it does not match the time of the error. Ignore it for now.
Secondly, we conducted traffic analysis and found that there will be a sudden increase in traffic at a fixed time every day. The sudden increase in traffic also coincides with the time when the error is reported. The preliminary judgment is that it is caused by a short-term large flow.
Traffic trends
Degraded amount
interface 99 lines
Third, look for performance bottlenecks
3.1 Interface process analysis
3.1.1 Flow chart
3.1.2 Process analysis
After receiving the request, call the downstream interface, use the hystrix fuse, and the fusing time is 500MS;
According to the data returned by the downstream interface, the detailed data is encapsulated. The first step is to obtain it from the local cache. If the local cache is not available, it will return to the source from Redis. If there is none in Redis, it will return directly, and the asynchronous thread will return to the source from the database.
If the first step of calling the downstream interface is abnormal, the data will be collected. The bottoming process is to obtain it from the local cache. If the local cache is not available, then return to the source from Redis. If there is none in Redis, return directly, and the asynchronous thread returns to the source from the database. .
3.2 Troubleshoot performance bottlenecks
3.2.1 Downstream interface services take a long time
The call chain shows that although the P99 line of the downstream interface has a spike during peak traffic, exceeding 1S, because of the fuse timeout setting (fuse time is 500MS, coreSize&masSize=50, the average downstream interface takes less than 10MS), it is judged that the downstream interface is not a problem The key point is that in order to further eliminate interference, it can quickly fail when there is a spur in the downstream service. Adjust the fusing time to 100MS and the dubbo timeout time to 100MS.
3.2.2 Get details No data in local cache, Redis back to source
With the help of the call chain platform, the first step is to analyze the Redis request traffic to determine the hit rate of the local cache. It is found that the Redis traffic is twice the interface traffic. This phenomenon should not occur in design. I started the code review and found that there was a logic problem.
Instead of reading from the local cache, the data was obtained directly from Redis. The maximum response time of Redis did indeed find unreasonable spikes. Continued analysis found that the Redis response time is basically the same as the Dubbo99 line spike situation. I feel that it has been found at this time. The cause of the problem is secretly happy.
Redis request traffic
Service interface request traffic
Dubbo99 line
Redis maximum response time
3.2.3 Obtaining bottom data. There is no data in the local cache, and Redis returns to the source.
normal
3.2.4 Record request result into Redis
Because the current Redis has done resource isolation, and the slow log is not queried in the DB background, there are many reasons for the slowdown of Redis at this time, but the others are subjectively ignored, and the attention is on the problem of requesting Redis traffic to double It is up, so the problem in 3.2.2 is solved first.
Four, the solution
4.1 Issues located in 3.3.2 go online
Redis request volume before
went online, Redis request volume
After the launch of Redis, the problem of doubling the traffic was solved, and the maximum response time of Redis was alleviated, but it was still not completely solved, indicating that the large traffic query is not the most fundamental reason.
redis maximum response time (before going online)
redis maximum response time (after going online)
4.2 Redis expansion
After the Redis abnormal traffic problem is solved, the problem has not been completely solved. What you can do at this time is to calm down and carefully sort out the reasons that cause Redis to slow. The ideas are mainly from the following three aspects:
- Slow query
- Redis service has performance bottlenecks
- Unreasonable client configuration
Based on the above ideas, one by one to investigate; query the Redis slow query log, no slow query is found.
Borrowing the call chain platform to analyze the slow Redis commands in detail, without the interference of slow queries caused by large traffic, the problem location process is very fast, a large number of time-consuming requests are in the setex method, and occasionally slow requests for queries are also in the setex method. After that, based on the single-threaded characteristics of Redis, it was determined that setex was the culprit of Redis99 line stab. After finding the specific sentence and locating the specific business, first apply for Redis expansion, from 6 masters to 8 masters.
Redis before expansion
Redis expansion
From the results, the expansion basically has no effect, indicating that the redis service itself is not a performance bottleneck. At this time, the remaining one is the client-related configuration.
4.3 Client parameter optimization
4.3.1 Connection pool optimization
Redis expansion has no effect. In view of the problems that may occur on the client side, there are two points of suspicion at this time.
The first is that there is a bug in the connection management of the client when processing the Redis cluster mode. The second is that the connection pool parameter setting is unreasonable. At this time, the source code analysis and the connection pool parameter adjustment are carried out simultaneously.
4.3.1.1 Determine whether there is a BUG on the client connection management
After the analysis, after the client processed the source code of the connection pool, there was no problem. As expected, the connection pool was cached according to the slot. The first assumption was excluded. The source code is as follows.
1、setEx
public String setex(final byte[] key, final int seconds, final byte[] value) {
return new JedisClusterCommand<String>(connectionHandler, maxAttempts) {
@Override
public String execute(Jedis connection) {
return connection.setex(key, seconds, value);
}
}.runBinary(key);
}
2、runBinary
public T runBinary(byte[] key) {
if (key == null) {
throw new JedisClusterException("No way to dispatch this command to Redis Cluster.");
}
return runWithRetries(key, this.maxAttempts, false, false);
}
3、runWithRetries
private T runWithRetries(byte[] key, int attempts, boolean tryRandomNode, boolean asking) {
if (attempts <= 0) {
throw new JedisClusterMaxRedirectionsException("Too many Cluster redirections?");
}
Jedis connection = null;
try {
if (asking) {
// TODO: Pipeline asking with the original command to make it
// faster....
connection = askConnection.get();
connection.asking();
// if asking success, reset asking flag
asking = false;
} else {
if (tryRandomNode) {
connection = connectionHandler.getConnection();
} else {
connection = connectionHandler.getConnectionFromSlot(JedisClusterCRC16.getSlot(key));
}
}
return execute(connection);
}
4、getConnectionFromSlot
public Jedis getConnectionFromSlot(int slot) {
JedisPool connectionPool = cache.getSlotPool(slot);
if (connectionPool != null) {
// It can't guaranteed to get valid connection because of node
// assignment
return connectionPool.getResource();
} else {
renewSlotCache(); //It's abnormal situation for cluster mode, that we have just nothing for slot, try to rediscover state
connectionPool = cache.getSlotPool(slot);
if (connectionPool != null) {
return connectionPool.getResource();
} else {
//no choice, fallback to new connection to random node
return getConnection();
}
}
}
4.3.1.2 Analyze connection pool parameters
Through communicating with the middleware team and referring to the official documents of commons-pool2, amended as follows;
After the parameter adjustment, the number of requests above 1S has been reduced, but it still exists. The amount of upstream feedback degradation has been reduced from about 900,000 per day to 6W per day (about why there are more than 200MS requests after maxWaitMillis is set to 200MS, as explained below) .
Reds maximum response time after parameter optimization
Interface error amount after parameter optimization
4.3.2 Continuous optimization
The optimization cannot be stopped. How to reduce all write requests of Redis to less than 200MS. At this time, the optimization idea is to adjust the client configuration parameters and analyze Jedis to obtain the connection-related source code;
Jedis get the connection source code
final AbandonedConfig ac = this.abandonedConfig;
if (ac != null && ac.getRemoveAbandonedOnBorrow() &&
(getNumIdle() < 2) &&
(getNumActive() > getMaxTotal() - 3) ) {
removeAbandoned(ac);
}
PooledObject<T> p = null;
// Get local copy of current config so it is consistent for entire
// method execution
final boolean blockWhenExhausted = getBlockWhenExhausted();
boolean create;
final long waitTime = System.currentTimeMillis();
while (p == null) {
create = false;
p = idleObjects.pollFirst();
if (p == null) {
p = create();
if (p != null) {
create = true;
}
}
if (blockWhenExhausted) {
if (p == null) {
if (borrowMaxWaitMillis < 0) {
p = idleObjects.takeFirst();
} else {
p = idleObjects.pollFirst(borrowMaxWaitMillis,
TimeUnit.MILLISECONDS);
}
}
if (p == null) {
throw new NoSuchElementException(
"Timeout waiting for idle object");
}
} else {
if (p == null) {
throw new NoSuchElementException("Pool exhausted");
}
}
if (!p.allocate()) {
p = null;
}
if (p != null) {
try {
factory.activateObject(p);
} catch (final Exception e) {
try {
destroy(p);
} catch (final Exception e1) {
// Ignore - activation failure is more important
}
p = null;
if (create) {
final NoSuchElementException nsee = new NoSuchElementException(
"Unable to activate object");
nsee.initCause(e);
throw nsee;
}
}
if (p != null && (getTestOnBorrow() || create && getTestOnCreate())) {
boolean validate = false;
Throwable validationThrowable = null;
try {
validate = factory.validateObject(p);
} catch (final Throwable t) {
PoolUtils.checkRethrow(t);
validationThrowable = t;
}
if (!validate) {
try {
destroy(p);
destroyedByBorrowValidationCount.incrementAndGet();
} catch (final Exception e) {
// Ignore - validation failure is more important
}
p = null;
if (create) {
final NoSuchElementException nsee = new NoSuchElementException(
"Unable to validate object");
nsee.initCause(validationThrowable);
throw nsee;
}
}
}
}
}
updateStatsBorrow(p, System.currentTimeMillis() - waitTime);
return p.getObject();
The general process of obtaining a connection is as follows:
Whether there is an idle connection, return directly if there is an idle connection, and create if there is no idle connection;
If the maximum number of connections is exceeded during creation, determine whether there are other threads creating connections, if not, return directly, if there are, wait for maxWaitMis time (other threads may fail to create), if the maximum connection is not exceeded, perform the connection creation operation ( At this time, the waiting time for obtaining a connection may be greater than maxWaitMs).
If the creation is unsuccessful, judge whether it is blocking access to the connection, if not, throw an exception directly, the connection pool is not enough, if it is, judge whether maxWaitMillis is less than 0, if it is less than 0, block and wait, if it is greater than 0, block and wait for maxWaitMillis.
The follow-up is to determine whether you need to do connection check etc. according to the parameters.
According to the above process analysis, maxWaitMills is currently set to 200, and the maximum blocking time of the above processes is 400MS. In most cases, it is 200MS, and there should not be a spur beyond 400MS.
At this time, the problem may occur in the creation of the connection, because the creation of the connection is time-consuming and the creation time is uncertain. The focus is to analyze whether there is this scenario and monitor the Redis connection through the DB background.
DB background monitoring Redis service connection
Analyzing the above figure, it is found that at several points in time (9:00, 12:00, 19:00...), the number of redis connections has increased, which basically coincides with the Redis spike time. It feels (after various attempts before, I have not dared to use it to determine) the problem is clearly located here (when the sudden increase in traffic comes, the available connections in the connection pool cannot meet the demand, and the connection will be created, causing the request to wait).
The idea at this time is to create a connection pool when the service starts to minimize the creation of new connections. Modify the connection pool parameter vivo.cache.depend.common.poolConfig.minIdle, but the result is invalid? ? ?
Don’t say anything, and start to source the source code. The bottom layer of jedis uses commons-poll2 to manage the connection. Check the commons-pool2-2.6.2.jar part of the source code used in the project;
CommonPool2 source code
public GenericObjectPool(final PooledObjectFactory<T> factory,
final GenericObjectPoolConfig<T> config) {
super(config, ONAME_BASE, config.getJmxNamePrefix());
if (factory == null) {
jmxUnregister(); // tidy up
throw new IllegalArgumentException("factory may not be null");
}
this.factory = factory;
idleObjects = new LinkedBlockingDeque<>(config.getFairness());
setConfig(config);
}
I found that there was no initial connection and started consulting the middleware team. The source code (commons-pool2-2.4.2.jar) given by the middleware team is as follows. After the method is executed, there is one more call to the startEvictor method?
1、初始化连接池
public GenericObjectPool(PooledObjectFactory<T> factory,
GenericObjectPoolConfig config) {
super(config, ONAME_BASE, config.getJmxNamePrefix());
if (factory == null) {
jmxUnregister(); // tidy up
throw new IllegalArgumentException("factory may not be null");
}
this.factory = factory;
idleObjects = new LinkedBlockingDeque<PooledObject<T>>(config.getFairness());
setConfig(config);
startEvictor(getTimeBetweenEvictionRunsMillis());
}
Why is it different? ? ? Start to check the Jar package, the version is different, the version given by the middleware is V2.4.2, the project actually uses V2.6.2, analysis of startEvictor has one step logic to deal with the connection pool warm-up logic.
Jedis connection pool preheating
1、final void startEvictor(long delay) {
synchronized (evictionLock) {
if (null != evictor) {
EvictionTimer.cancel(evictor);
evictor = null;
evictionIterator = null;
}
if (delay > 0) {
evictor = new Evictor();
EvictionTimer.schedule(evictor, delay, delay);
}
}
}
2、class Evictor extends TimerTask {
/**
* Run pool maintenance. Evict objects qualifying for eviction and then
* ensure that the minimum number of idle instances are available.
* Since the Timer that invokes Evictors is shared for all Pools but
* pools may exist in different class loaders, the Evictor ensures that
* any actions taken are under the class loader of the factory
* associated with the pool.
*/
@Override
public void run() {
ClassLoader savedClassLoader =
Thread.currentThread().getContextClassLoader();
try {
if (factoryClassLoader != null) {
// Set the class loader for the factory
ClassLoader cl = factoryClassLoader.get();
if (cl == null) {
// The pool has been dereferenced and the class loader
// GC'd. Cancel this timer so the pool can be GC'd as
// well.
cancel();
return;
}
Thread.currentThread().setContextClassLoader(cl);
}
// Evict from the pool
try {
evict();
} catch(Exception e) {
swallowException(e);
} catch(OutOfMemoryError oome) {
// Log problem but give evictor thread a chance to continue
// in case error is recoverable
oome.printStackTrace(System.err);
}
// Re-create idle instances.
try {
ensureMinIdle();
} catch (Exception e) {
swallowException(e);
}
} finally {
// Restore the previous CCL
Thread.currentThread().setContextClassLoader(savedClassLoader);
}
}
}
3、 void ensureMinIdle() throws Exception {
ensureIdle(getMinIdle(), true);
}
4、 private void ensureIdle(int idleCount, boolean always) throws Exception {
if (idleCount < 1 || isClosed() || (!always && !idleObjects.hasTakeWaiters())) {
return;
}
while (idleObjects.size() < idleCount) {
PooledObject<T> p = create();
if (p == null) {
// Can't create objects, no reason to think another call to
// create will work. Give up.
break;
}
if (getLifo()) {
idleObjects.addFirst(p);
} else {
idleObjects.addLast(p);
}
}
if (isClosed()) {
// Pool closed while object was being added to idle objects.
// Make sure the returned object is destroyed rather than left
// in the idle object pool (which would effectively be a leak)
clear();
}
}
Modify the Jar version and add vivo.cache.depend.common.poolConfig.timeBetweenEvictionRunsMillis in the configuration center (check the idle connection in the connection pool once, and disconnect the connection with the idle time exceeding minEvictableIdleTimeMillis milliseconds until the number of connections in the connection pool reaches minIdle) .
vivo.cache.depend.common.poolConfig.minEvictableIdleTimeMillis (the time when the connection in the connection pool can be idle, in milliseconds). After restarting the service, the connection pool is warmed up normally, and the problem is finally solved from the Redis level.
The optimization results are as follows, and the performance problem is basically solved;
Redis response time (before optimization)
Redis response time (after optimization)
interface 99 lines (before optimization)
interface 99 lines (after optimization)
Five, summary
When an online problem occurs, the first thing to consider is to quickly restore online business and minimize the impact of the business. Therefore, for online business, it is necessary to implement current limiting, fusing, and downgrading strategies in advance to cause online problems. Can quickly find a recovery plan. The proficiency of the company's monitoring platforms determines the speed of locating problems. Every development must regard the proficient use of the monitoring platform (machines, services, interfaces, DBs, etc.) as a basic ability.
When Redis responds slowly, you can first check the Redis cluster server (machine load, whether the service has slow query), business code (whether there is a BUG), and the client (whether the connection pool configuration is reasonable). Most of the Redis slow response problems were found out.
The Redis connection pool warms up the connection pool when the system is cold-started. Different versions of commons-pool2 have different cold-start strategies, but you need to configure the minEvictableIdleTimeMillis parameter to take effect. You can see the official documentation of common-pool2. Commonly used parameters are well known and can be quickly located when a problem occurs.
The default parameters of the connection pool are slightly weak in solving large-traffic services, and need to be tuned for large-traffic scenarios. If the business traffic is not too large, just use the default parameters.
Specific problems should be analyzed in detail, and when the problem cannot be solved, we should adapt our ideas and try to solve the problem through various methods.
Author: vivo internet server team-Wang Shaodong
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。