Author: Ren Kun
Now living in Zhuhai, he has served as a full-time Oracle and MySQL DBA, and now he is mainly responsible for the maintenance of MySQL, mongoDB and Redis.
Source of this article: original contribution
*The original content is produced by the open source community of Aikesheng, and the original content shall not be used without authorization. For reprinting, please contact the editor and indicate the source.
1. Background
There is a 10-node clickhouse cluster online, 5 shards * 2 replicas, version 19.7.3.
Development and execution of an operation to create a distributed table, 9 nodes all returned successfully, one node reported an error, and the return information is as follows:
Code: 159. DB::Exception: Received from 127.0.0.1:9000. DB::Exception: Wa
tching task /clickhouse/task_queue/ddl/query‐0003271440 is executing longer
than distributed_ddl_task_timeout (=180) seconds. There are 1 unfinished ho
sts (0 of them are currently active), they are going to execute the query i
n background.
2. Diagnosis
Log in to the node to view show processlist, and a distributed ddl is being executed. The ddl has been running for more than 100 hours and should be stuck.
The distributed ddl of clickhouse is executed serially, storing tasks to zookeeper each time
/clickhouse/task_queue/ddl directory, arranged in FIFO order.
For each node, only the current ddl is executed before the subsequent ones can be called.
select * from zookeeper where path='/clickhouse/task_queue/ddl' order by ctime desc\G
The distributed table creation ddl just initiated is ranked first, and the optimize table in the above screenshot is ranked 23rd, indicating that there are as many as 22 ddl blocked by it, and the development has also confirmed that the ddl tasks of the last two days are on this node. Neither succeeded.
Tried killing it, waiting a few hours and still no effect.
Add the sync keyword to get stuck directly.
Trying to restart the ck instance is also stuck, and finally only Kill -9.
After the instance is restarted, the task still exists, and it has been executed for more than 10 minutes without meaning to end, and the kill operation is still invalid.
Since this ddl cannot be bypassed, and the execution does not end for a long time, it can only save the country through a curve, and rename the table to temporarily disappear for a while.
rename table log_catch_data_local to log_catch_data_local1;
The optimize table is returned immediately, and the next 22 ddls are executed quickly, and finally it is renamed back to the original table.
At this point, execute optimize table again, and it only takes 5s to return successfully, and the problem is solved.
3. Summary
Clickhouse's distributed ddl is executed serially in FIFO at each node, and any ddl stuck will block subsequent tasks.
The stuck ddl in this example is the optimize table, which can be skipped by renaming the table; if it is adding or deleting columns, you can manually execute the local ddl on the table after rename.
If none of the above methods work, you can drop the local table directly on the node in question, wait for all blocking DDLs to execute, and then recreate an empty table, which will automatically synchronize data from another copy.
Finally, the 19.x version is very old, we encountered various problems in the process of using it, and we need to upgrade to the 20.x series as soon as possible
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。