Remember that a Redis O(n) command slows down the interface

1. Phenomenon

One day, I suddenly received an operation and maintenance alarm message, and it was reported that an interface in the production line was slow for a certain period of time, and the JIRA task for performance optimization was also started. The general content is as follows:

 API告警 - Base Tomcat Accesslog: 
微服务: xxxapi 
接口: /xxxx/xxx POST 
在 [ 2022/xx/xx 10:42:00 ~ 10:43:00 ] 时间窗口,RT > 阈值(300ms) 发生 141 次。

The company's internal response time requirements for APIs are less than 300ms, and the more than 300ms are all interfaces with performance problems.

2. Check

2.1. Check the monitoring platform: interface speed

First, check the company's API monitoring platform to confirm that this interface is indeed in response speed during this period, more than 100 times exceeding 300ms.

However, a few interesting phenomena were found:

  1. The interface was fine around that time, so it should just be what happened during that time.
  2. Not only that one interface exceeded 300ms, but also some other interfaces exceeded the threshold during that time. Just because the concurrency of that interface is relatively high, it is obviously detected by the alarm mechanism.

2.2. Check the monitoring platform: full link log

Although it is found above that the problem may not be just that one interface, it is still necessary to carefully analyze the full link log of that interface.

By calling the link diagram, it can be clearly found that the main time-consuming is in the execution phase of the redis command. After searching for that period of time, the results of several other interfaces are the same.

Next, you should look at the code of the corresponding interface to see if there is any problem affecting the performance when executing the redis command.

2.3. Analyze the code

Find the code of the corresponding interface. The only command that uses redis in the code is HMGET key field1 [field2] , and the problem of the code is eliminated after analysis:

  1. The redis call in the code, HMGET is not a time-consuming command (not included in the O(n) command below), and the number of field is enumerable, There are dozens of them at most, and it won't take much time.
  2. Because the interface was expected to have high concurrency, it was stress tested before going online, and tens of thousands of QPS were not a problem at all, and it would not fall into trouble here.

2.4. Check the monitoring platform: redis connection pool

The problem of this interface at the code level is ruled out, so we can only look at the problem of slow execution of redis commands from the operation and maintenance side. It happens that the monitoring platform has monitoring of mysql and redis connection pools. I don't know if I don't see it, I was startled when I saw it.

During the problem period, the number of redis connections soared instantly, directly exceeding the maximum number of idle connections. At the time node, it completely coincides with the time when the production line has a slow response.

It can be inferred that some strange commands may have been executed during this period, causing redis to block, and other redis requests can only create new connections frequently, resulting in a surge in the number of connections in the redis connection pool.

If you want to speculate what redis has experienced during that time, you can only entrust it to the colleagues of operation and maintenance.

2.5. Operation and maintenance: check redis slow requests

From the operation and maintenance colleagues, the slow request logs executed by redis during that time were as expected. During that time, a large number of slow request commands were executed, especially a lrange command, each execution time exceeded 300ms, and it was executed hundreds of times a minute.

Through the specific redis command, the corresponding code is found backwards. The code corresponds to a logic of asynchronously refreshing user permissions through MQ after the corresponding user logs in. Because it is asynchronous consumption, even if the execution is slow, it will not be alerted because of the response time threshold.

Combined with the call log about this interface in ELK, the complete process of the problem can be restored.

2.6. Scene Restoration

The lrange command is fine to execute alone, but during that time, there were a large number of concurrent user logins in the production line, which caused the slow request command to block redis, and a large number of redis requests were blocked and queued, the redis connection pool The number of connections is also expanding.

Therefore, during that time, all redis request commands will be delayed due to blocking. The interface that is alerted belongs to "laying gun", because it calls high concurrency, so it has a remarkable performance and is caught by the alert program as a typical example.

Without this step-by-step investigation, it would be hard to believe that an API alert was affected by another unrelated API.

3. Solve

3.1. Resolution

lrange is a command with a time complexity of O(n) and should not be called frequently in principle. Analyzing the code context logic is actually just to store and read List type data in full. You can completely replace List with String in redis.

对于redis 中keys *flushdbflushall等超耗时的命令,可以直接作为禁用命令,维护进redis.conf in the command list.

But commands like lrange can be used, but to strictly examine the command of the application scenario, the R&D personnel need to pay special attention. The following is a summary of the common O(n) commands of redis.

3.2. O(n) Commands

  • String: MSET, MSETNX, MGET

3.3. Others

Because of the bug in the Lettuce connection pool in Spring, the company requested to change back to the Jedis connection pool. It feels that in order to avoid this problem, the Jedis connection pool should also have room for optimization. I will share it after the follow-up practice.


502 声望
138 粉丝
0 条评论
前面的文章里面主要讲的是查询的用法,还是延续之前的文章格式,这里讲讲修改。1. 单文档修改1.1. insert其实在数据准备阶段已经有新增的例子了。DSL {代码...} spring {代码...} 1.2. update-(save)新增时,spri...

KerryWu阅读 359

Redis 发布订阅模式:原理拆解并实现一个消息队列
“65 哥,如果你交了个漂亮小姐姐做女朋友,你会通过什么方式将这个消息广而告之给你的微信好友?““那不得拍点女朋友的美照 + 亲密照弄一个九宫格图文消息在朋友圈发布大肆宣传,暴击单身狗。”像这种 65 哥通过朋...

码哥字节5阅读 1.1k

在上一篇的文章《Redis高可用全景一览》中,我们学习了 Redis 的高可用性。高可用性有两方面含义:一是服务少中断,二是数据少丢失。主从库模式和哨兵保证了服务少中断,AOF 日志和 RDB 快照保证了数据少丢失。

杨同学technotes4阅读 1.1k

1.5万字总结 Redis 常见面试题&知识点
Redis 是一个基于 C 语言开发的开源数据库(BSD 许可),与传统数据库不同的是 Redis 的数据是存在内存中的(内存数据库),读写速度非常快,被广泛应用于缓存方向。并且,Redis 存储的是 KV 键值对数据。

JavaGuide3阅读 777

通过 CONFIG SET maxmemory 100mb 或者在 redis.conf 配置文件设置 maxmemory 100mb Redis 内存占用限制。当达到内存最大值值,会触发内存淘汰策略删除数据。

码哥字节2阅读 601


苏凌峰阅读 2.6k

Redis big key problem,实际上不是大Key问题,而是Key对应的value过大,因此严格来说是Big Value问题,Redis value is too large (key value is too large)。

Java架构师2阅读 466


502 声望
138 粉丝