分布式 - Failure Analysis | ClickHouse Cluster Distributed DDL Blocked Case - 个人文章

Author: Ren Kun
Now living in Zhuhai, he has served as a full-time Oracle and MySQL DBA, and now he is mainly responsible for the maintenance of MySQL, mongoDB and Redis.
Source of this article: original contribution
*The original content is produced by the open source community of Aikesheng, and the original content shall not be used without authorization. For reprinting, please contact the editor and indicate the source.

1. Background

There is a 10-node clickhouse cluster online, 5 shards * 2 replicas, version 19.7.3.

Development and execution of an operation to create a distributed table, 9 nodes all returned successfully, one node reported an error, and the return information is as follows:

 Code: 159. DB::Exception: Received from 127.0.0.1:9000. DB::Exception: Wa
tching task /clickhouse/task_queue/ddl/query‐0003271440 is executing longer
than distributed_ddl_task_timeout (=180) seconds. There are 1 unfinished ho
sts (0 of them are currently active), they are going to execute the query i
n background.

2. Diagnosis

Log in to the node to view show processlist, and a distributed ddl is being executed. The ddl has been running for more than 100 hours and should be stuck.

The distributed ddl of clickhouse is executed serially, storing tasks to zookeeper each time
/clickhouse/task_queue/ddl directory, arranged in FIFO order.

For each node, only the current ddl is executed before the subsequent ones can be called.

 select * from zookeeper where path='/clickhouse/task_queue/ddl' order by ctime desc\G

The distributed table creation ddl just initiated is ranked first, and the optimize table in the above screenshot is ranked 23rd, indicating that there are as many as 22 ddl blocked by it, and the development has also confirmed that the ddl tasks of the last two days are on this node. Neither succeeded.

Tried killing it, waiting a few hours and still no effect.

Add the sync keyword to get stuck directly.

Trying to restart the ck instance is also stuck, and finally only Kill -9.

After the instance is restarted, the task still exists, and it has been executed for more than 10 minutes without meaning to end, and the kill operation is still invalid.

Since this ddl cannot be bypassed, and the execution does not end for a long time, it can only save the country through a curve, and rename the table to temporarily disappear for a while.

 rename table log_catch_data_local to log_catch_data_local1;

The optimize table is returned immediately, and the next 22 ddls are executed quickly, and finally it is renamed back to the original table.

At this point, execute optimize table again, and it only takes 5s to return successfully, and the problem is solved.

3. Summary

Clickhouse's distributed ddl is executed serially in FIFO at each node, and any ddl stuck will block subsequent tasks.

The stuck ddl in this example is the optimize table, which can be skipped by renaming the table; if it is adding or deleting columns, you can manually execute the local ddl on the table after rename.

If none of the above methods work, you can drop the local table directly on the node in question, wait for all blocking DDLs to execute, and then recreate an empty table, which will automatically synchronize data from another copy.

Finally, the 19.x version is very old, we encountered various problems in the process of using it, and we need to upgrade to the 20.x series as soon as possible

Failure Analysis | ClickHouse Cluster Distributed DDL Blocked Case

1. Background

2. Diagnosis

3. Summary

爱可生开源社区

引用和评论

gh-ost 扩展 MySQL 字段失败？看看 ChatDBA 和 DeepSeek 都怎么说？

53 倍性能提升！TiDB 全局索引如何优化分区表查询？

ProtonBase 游戏行业解决方案

分布式系统架构5：限流设计模式

Redis 分片

Redis怎么实现分布式锁，以及注意事项

基于时间维度水平拆分的多 TiDB 集群统一数据路由/联邦查询技术的实践