1

1. Phenomenon

One day, I suddenly received an operation and maintenance alarm message, and it was reported that an interface in the production line was slow for a certain period of time, and the JIRA task for performance optimization was also started. The general content is as follows:

 API告警 - Base Tomcat Accesslog: 
微服务: xxxapi 
接口: /xxxx/xxx POST 
在 [ 2022/xx/xx 10:42:00 ~ 10:43:00 ] 时间窗口,RT > 阈值(300ms) 发生 141 次。

The company's internal response time requirements for APIs are less than 300ms, and the more than 300ms are all interfaces with performance problems.

2. Check

2.1. Check the monitoring platform: interface speed

First, check the company's API monitoring platform to confirm that this interface is indeed in response speed during this period, more than 100 times exceeding 300ms.

However, a few interesting phenomena were found:

  1. The interface was fine around that time, so it should just be what happened during that time.
  2. Not only that one interface exceeded 300ms, but also some other interfaces exceeded the threshold during that time. Just because the concurrency of that interface is relatively high, it is obviously detected by the alarm mechanism.

2.2. Check the monitoring platform: full link log

Although it is found above that the problem may not be just that one interface, it is still necessary to carefully analyze the full link log of that interface.

By calling the link diagram, it can be clearly found that the main time-consuming is in the execution phase of the redis command. After searching for that period of time, the results of several other interfaces are the same.

Next, you should look at the code of the corresponding interface to see if there is any problem affecting the performance when executing the redis command.

2.3. Analyze the code

Find the code of the corresponding interface. The only command that uses redis in the code is HMGET key field1 [field2] , and the problem of the code is eliminated after analysis:

  1. The redis call in the code, HMGET is not a time-consuming command (not included in the O(n) command below), and the number of field is enumerable, There are dozens of them at most, and it won't take much time.
  2. Because the interface was expected to have high concurrency, it was stress tested before going online, and tens of thousands of QPS were not a problem at all, and it would not fall into trouble here.

2.4. Check the monitoring platform: redis connection pool

The problem of this interface at the code level is ruled out, so we can only look at the problem of slow execution of redis commands from the operation and maintenance side. It happens that the monitoring platform has monitoring of mysql and redis connection pools. I don't know if I don't see it, I was startled when I saw it.

During the problem period, the number of redis connections soared instantly, directly exceeding the maximum number of idle connections. At the time node, it completely coincides with the time when the production line has a slow response.

It can be inferred that some strange commands may have been executed during this period, causing redis to block, and other redis requests can only create new connections frequently, resulting in a surge in the number of connections in the redis connection pool.

If you want to speculate what redis has experienced during that time, you can only entrust it to the colleagues of operation and maintenance.

2.5. Operation and maintenance: check redis slow requests

From the operation and maintenance colleagues, the slow request logs executed by redis during that time were as expected. During that time, a large number of slow request commands were executed, especially a lrange command, each execution time exceeded 300ms, and it was executed hundreds of times a minute.

Through the specific redis command, the corresponding code is found backwards. The code corresponds to a logic of asynchronously refreshing user permissions through MQ after the corresponding user logs in. Because it is asynchronous consumption, even if the execution is slow, it will not be alerted because of the response time threshold.

Combined with the call log about this interface in ELK, the complete process of the problem can be restored.

2.6. Scene Restoration

The lrange command is fine to execute alone, but during that time, there were a large number of concurrent user logins in the production line, which caused the slow request command to block redis, and a large number of redis requests were blocked and queued, the redis connection pool The number of connections is also expanding.

Therefore, during that time, all redis request commands will be delayed due to blocking. The interface that is alerted belongs to "laying gun", because it calls high concurrency, so it has a remarkable performance and is caught by the alert program as a typical example.

Without this step-by-step investigation, it would be hard to believe that an API alert was affected by another unrelated API.

3. Solve

3.1. Resolution

lrange is a command with a time complexity of O(n) and should not be called frequently in principle. Analyzing the code context logic is actually just to store and read List type data in full. You can completely replace List with String in redis.

对于redis 中keys *flushdbflushall等超耗时的命令,可以直接作为禁用命令,维护进redis.conf in the command list.

But commands like lrange can be used, but to strictly examine the command of the application scenario, the R&D personnel need to pay special attention. The following is a summary of the common O(n) commands of redis.

3.2. O(n) Commands

  • String: MSET, MSETNX, MGET
  • List: LPUSH, RPUSH, LRANGE, LINDEX, LSET, LINSERT
  • Hash: HDEL, HGETALL, HKEYS/HVALS
  • Set: SADD, SREM, SRANDMEMBER, SPOP, SMEMBERS, SUNION/SUNIONSTORE, SINTER/SINTERSTORE, SDIFF/SDIFFSTORE
  • Sorted Set: ZADD, ZREM, ZRANGE/ZREVRANGE, ZRANGEBYSCORE/ZREVRANGEBYSCORE, ZREMRANGEBYRANK/ZREMRANGEBYSCORE

3.3. Others

Because of the bug in the Lettuce connection pool in Spring, the company requested to change back to the Jedis connection pool. It feels that in order to avoid this problem, the Jedis connection pool should also have room for optimization. I will share it after the follow-up practice.


KerryWu
641 声望159 粉丝

保持饥饿