I encountered a weird bug online this week.
There is a timed task online. This task needs to query some data in a table within a few days to do some processing. It will be executed every ten minutes until it succeeds.
Through the log, it was found that the task was executed three times from 5:26 am to 5:56 am. All three times, the execution failed due to the SQL query timeout. The strange thing is that the task was successfully executed at around 6:00 am.
Every day, it fails at more than five in the morning and executes successfully at six in the morning.
Click on the exception log and look like this:
In summary, the MySQL query timed out.
I thought it would only take three minutes to locate a bug like this kind of stable recurrence, but I didn't expect it to take up half a day.
Path to investigation
Explain
Seeing the timeout SQL, most people's first reaction was that the SQL was not indexed, and I was no exception. My first reaction at the time was that the SQL was not indexed. So, I copied the SQL in the log, and desensitized it. It was probably a SQL like this:
select * from table_a where status_updated_at >= ? and status_updated_at < ?
There are two date parameters in SQL. These two start dates are the tradable time interval of a certain commodity. They are three to five days apart. I took the conservative value of the 17-day interval and explained this SQL.
It can be seen from the figure that the execution of this SQL is still indexed. The index is based on the status_updated_at field.
It only took 135 milliseconds to execute it.
According to the Explain results, my inference at the time was: this SQL must be indexed. If it is not indexed, the query at six o'clock will definitely time out because the data in this table is tens of millions.
In order to verify this inference, I asked the DBA to help me export the slow SQL for this table from 5 am to 7 am. The DBA told me that there was no slow SQL for this table during that time period.
This also further verified what I said inference: this SQL was out of the index, but it timed out due to some mysterious reasons after five o'clock.
Next, what needs to be done is to find out the mysterious reason.
According to past experience, I think there are several factors that will cause the query to time out
- MySQL resource competition
- database backup
- The internet
MySQL resource competition
First of all, I checked the operation of MySQL during that time through the monitoring system, and indicators such as the number of connections and CPU load were all very normal. Therefore, the timeout caused by the MySQL load can be eliminated first.
Could it be affected by other business operations?
First of all, our online database transaction isolation level is set to RR (Repeatable Read). Because of the existence of MVCC, simple modifications will definitely not affect the query to timeout.
If you want to affect the only possibility is that when other businesses update the table data, the update condition does not take the index, which causes the row lock to be upgraded to the table lock, and this operation has to be executed just after 5 o'clock in the morning and lasted for half Hours.
This condition is very harsh. I checked the relevant code and asked the relevant person in charge. This is not the case. All updates are based on the Id primary key. The key is that if the update condition of the update SQL does not use the index, it will definitely be a slow SQL. Then, we can find it in the slow SQL log file, but in fact it does not.
Backup
Is it because the database is backing up after 5 o'clock in the morning?
First of all, the backup lock table will not be locked for so long. This task failed to execute in half an hour before and after;
Secondly, we are the backup slave library, not the backup master library;
Finally, none of our backup tasks are executed at five in the morning.
Therefore, the timeout caused by the backup can be ruled out.
The internet
Is it the cause of network fluctuations?
I asked the operation and maintenance classmates to look at the network conditions of the machine that performed the task during that time. It was very smooth and there was no problem, and there was no network jitter in the computer room.
Furthermore, if it is a network problem, it will definitely affect other tasks and services. In fact, there is nothing unusual about checking other services from the monitoring system.
Therefore, timeout caused by network fluctuations can also be ruled out.
Turnaround
I have ruled out factors such as indexing, network, backup, business competition for MySQL resources, and simulated N situations in my mind, and filled up the entire execution process of a piece of SQL. I can't think of any other reasons.
This thing has become weird. The DBA advised me to give up temporarily, suggesting that I postpone the task execution time and add some monitoring logs to observe and observe. After all, it's not unavailable.
It is impossible to give up. I am an iron-headed baby, and I can't sleep if I encounter a bug without solving it.
I clarified my thoughts and started from the beginning. I asked the DBA for an online slow SQL file from 5 to 6 o'clock. I searched it again by myself, but I still couldn't find the slow SQL related to this table.
When I was about to give up suddenly, I suddenly found that the time zone in the SQL log record was in the standard time zone, and my task was executed in Beijing time. You must know that the standard time zone is 8 hours away from the Beijing time zone!
Good guys! I have to think about quantum mechanics, but it turns out that the time zone is not correct?
Could it be that the DBA finds the wrong time when looking for slow SQL?
I told the DBA about this "significant discovery", and the DBA helped me re-run a slow SQL. Good guy, there was a slow SQL that I wanted for that table.
It can be seen from the log that the date range of the query is from September 2020 to April 2021, and the time span is 7 months. When MySQL cost calculation, it is considered that the interval is too large. It is better to scan the entire table directly without using the index. In the end, it scans 1800W data without using the index.
Is the agreed time interval up to seven days? How did it become seven months?
Quickly locate the code. When the positioning found that the bottom layer was adjusting the time interval, an RPC interface was adjusted. The expected time interval returned by this interface was only a few days, and the result returned a time interval of seven months. This logic was launched in 18 years.
So I contacted the relevant personnel who provided this RPC interface, and verified that it was a problem with the underlying data through search and verification. It should be returned for a few days and the results returned for a few months.
Finally, the related data was repaired, the corresponding check and monitoring was added, and the system was re-released. This BUG that had existed for two years was solved.
This story ends here.
Looking back, there are still a few questions that need to be answered:
If you don’t use the index, why doesn’t it time out after six o’clock?
The reason is that basically no business is calling MySQL at 6 o'clock. At that time, MySQL's resources were very sufficient, and the MySQL machine was also configured very high, so this SQL ran successfully. It sounds outrageous, but it is true.
Why has this bug been online for so long and only discovered it now?
The underlying data in this time interval is not used much, and only this timeout SQL task is currently being called.
It turns out that the business volume is not so large, and the machine configuration is high, scanning the entire table will not take long. It was executed at 5 or 6 in the morning, and there was no impact on online services.
It is normal for a task to fail, because it also relies on some other data, and the time provided by other data is uncertain, so the task will run until it succeeds.
to sum up
To review the entire process, I analyzed the query timeout SQL problem from the aspects of indexing, network, backup, business competition, MySQL resources, etc., but ignored the most important factor-which SQL was executed.
I take it for granted that the executed SQL is exactly what I imagined and I am convinced of it, and the subsequent efforts will be in vain.
This is a simple question, but I have complicated it. It shouldn't be.
This is a typical example. The pits buried when the business volume is not large are exposed when the business is developing rapidly. Fortunately, it does not affect the core trading system. If it is the core trading system, it may cause a problem. P0 accident.
Although this code was not written by me, the lesson I learned from it is that must be awed by the online environment, be suspicious of relying on data, and be objective in troubleshooting .
The online environment is extremely complex, with data left over from their respective version migrations and business changes. Developers cannot consider all of these conditions. Testing is also difficult to cover testing. Writing code with subjective ideas can easily lead to BUGs. Some bugs are not easy to attract attention when the business volume is not large, but with the development of the business, these debts will eventually have to be paid.
You can guarantee that there is no problem with the logic you write, but you cannot guarantee that the data provided by the service upstream meets expectations. Think more about whether there will be problems with the services written by yourself if the upstream data is abnormal. Adding more data checksums and alarms will save a lot of unnecessary trouble.
When troubleshooting, you must be objective and not subjective. BUG is originally caused by subjectivity, you also take it for granted to find the problem, which of course will increase the difficulty of troubleshooting.
Finally, welcome to add my personal public account "CoderW"
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。