Shopee ClickHouse cold and hot data separation storage architecture and practice

This article was first published on the WeChat public "161719c9f411a8 Shopee Technical Team ".

Summary

Shopee ClickHouse is a highly available distributed analytical database based on the open source database ClickHouse for secondary development and architecture evolution. This article will mainly introduce Shopee ClickHouse's cold and hot separation storage architecture and the practice of supporting the company's business.
Shopee ClickHouse's cold and hot separation storage architecture uses the JuiceFS client to mount remote object storage to the local machine path. By writing the ClickHouse storage strategy, remote object storage is used like multi-volume storage. Because we use the same ClickHouse DB cluster to support the business of multiple teams, different teams or even different businesses of the same team may have different data cold and hot division benchmarks, so the strategy needs to be done in ClickHouse when doing cold and hot separation. level.
In order to achieve table-level cold and hot separation, we modify the storage strategy of the table according to the pre-edited storage strategy, and modify the storage strategy of the business tables that require cold and hot isolation for the inventory. For new business tables that need to be separated from hot and cold, when creating the table, specify the storage strategy that supports data storage at the remote end, and then use the refined TTL expression to determine whether the data should be stored locally or at the remote end.
After the hot and cold storage architecture went online, we encountered some problems and challenges, such as: juicefs object request error , Redis memory growth abnormally, suspicious broken parts etc. This article will address some of these problems, combine the context of the scene, and provide solutions through source code analysis.
In general, the overall design idea of Shopee ClickHouse's hot and cold storage architecture is: local SSD stores and queries hot data, and remote storage and query of relatively infrequent data, thereby saving storage costs and supporting more data storage requirements.

1. Shopee ClickHouse cluster general architecture

ClickHouse is an open source column-stored OLAP (online analytical query) database, which implements a vectorized execution engine and has excellent AP query performance. Shopee ClickHouse is an analytical database based on ClickHouse's continuous secondary iterative development and product architecture evolution.

The following figure shows the architecture of the Shopee ClickHouse DB cluster:

From top to bottom, the user requests to intervene in the SLB, Proxy layer, ClickHouse DB cluster layer, and at the bottom is the remote object storage. Here we use S3 provided by the Shopee STO team.

Among them, SLB provides user request routing ; Proxy layer provides query routing , the request will be routed to the corresponding cluster according to the cluster name in the user connection string, and it also provides the ability to partially write balance and query routing; ClickHouse DB cluster layer is a distributed cluster composed of Shopee ClickHouse database. At present, there are computing distributed clusters that use SSD disks as hot data storage media, computing single-node clusters, and SATA Disks as storage media. Storage-type distributed cluster; is used as the cold data storage medium .

2. Cold and hot separation storage architecture solution

Users hope that the data can be stored more and longer, and the query speed is faster. But generally, the more data is stored, the higher the return delay will be under the same query conditions.

From the perspective of resource utilization, we hope that the data stored on Shopee ClickHouse can be accessed and utilized more, to provide more extensive support for the business. Therefore, at first, we asked the business side to store the data in the Shopee ClickHouse database as the user's business hot data.

However, this also brings some problems. For example, users sometimes need to query data for a relatively long time for analysis, so that part of the data that is not in ClickHouse must be imported and then analyzed, and this part of the data must be deleted after the analysis is over. . Another example: some businesses that use log services for aggregation analysis and retrieval analysis also require relatively long log service data to help monitor and analyze daily operations.

Based on such needs, we hope to maximize the use of resources on the one hand, and on the other hand, we hope to support more data storage without affecting the query speed of users’ hot data. Therefore separate cold and hot data is a very good solution. Good choice.

Generally, the design of a cold and hot separation scheme needs to consider the following issues:

How to store cold data?
How to use cold storage medium efficiently, stably and simply?
How does the hot data sink to the cold storage medium?
How does the evolution of the architecture not affect existing user services?

The choice of cold data storage media is generally analyzed through the following key points:

cost
stability
Complete functions (data can still be correctly queried during the sinking process, and database data can also be correctly written)
performance
Scalability

2.1 Selection of cold storage media and JuiceFS

The media that can be used as cold storage generally include S3, Ozone, HDFS, and SATA Disk. Among them, SATA Disk is limited by machine hardware and is not easy to expand, so it can be eliminated first. HDFS, Ozone and S3 are good cold storage media.

At the same time, in order to use cold storage media efficiently and simply, we set our sights on JuiceFS. JuiceFS is an open source POSIX file system based on Redis and cloud object storage, which allows us to access remote object storage more conveniently and efficiently.

JuiceFS uses existing object storage in the public cloud, such as S3, GCS, OSS, etc. With JuiceFS as storage, the data is actually stored remotely, and JuiceFS focuses on the metadata management of these data files stored in the remote. JuiceFS chooses Redis as the engine for storing metadata. This is because Redis is stored in memory, which can meet the low latency and high IOPS of metadata read and write, supports optimistic transactions, and meets the atomicity of file system operations on metadata [1 ].

JuiceFS provides an efficient and convenient way to access remote storage. You only need to use the format and mount commands through the JuiceFS client to mount the remote storage to the local path. Our ClickHouse database can access the remote storage just like accessing the local path.

After choosing JuiceFS, we turned our attention back to the selection of cold data storage media. Since the main backend storage layer supported by JuiceFS is an object storage category, the remaining options become S3 and Ozone. We designed a benchmark as follows, using ClickHouse TPCH Star Schema Benchmark 1000s (for more information on the benchmark, please refer to ClickHouse community documentation [2]) as test data to test the insert performance of S3 and Ozone respectively, and use the select statement of Star Schema Benchmark to do Query performance comparison.

The queried data is in the following three storage states:

One part is in Ozone/S3, and part is in the local SSD disk;
All in Ozone/S3;
All on SSD.

The following are the results of our test sampling:

(1) Insert performance sampling results

Insert Lineorder table data to Ozone:

Insert Lineorder table data to S3:

It can be seen that S3 is slightly stronger than .

(2) Query performance sampling results

According to ClickHouse Star Schema Benchmark, after importing the Customer, Lineorder, Part, Supplier tables, you need to create a flat wide table based on the data of the four tables.

CREATE TABLE lineorder_flat  
ENGINE = MergeTree  
PARTITION BY toYear(LO_ORDERDATE)  
ORDER BY (LO_ORDERDATE, LO_ORDERKEY)  
AS  
SELECT  
l.LO_ORDERKEY AS LO_ORDERKEY,  
l.LO_LINENUMBER AS LO_LINENUMBER,  
l.LO_CUSTKEY AS LO_CUSTKEY,  
l.LO_PARTKEY AS LO_PARTKEY,  
l.LO_SUPPKEY AS LO_SUPPKEY,  
l.LO_ORDERDATE AS LO_ORDERDATE,  
l.LO_ORDERPRIORITY AS LO_ORDERPRIORITY,  
l.LO_SHIPPRIORITY AS LO_SHIPPRIORITY,  
l.LO_QUANTITY AS LO_QUANTITY,  
l.LO_EXTENDEDPRICE AS LO_EXTENDEDPRICE,  
l.LO_ORDTOTALPRICE AS LO_ORDTOTALPRICE,  
l.LO_DISCOUNT AS LO_DISCOUNT,  
l.LO_REVENUE AS LO_REVENUE,  
l.LO_SUPPLYCOST AS LO_SUPPLYCOST,  
l.LO_TAX AS LO_TAX,  
l.LO_COMMITDATE AS LO_COMMITDATE,  
l.LO_SHIPMODE AS LO_SHIPMODE,  
c.C_NAME AS C_NAME,  
c.C_ADDRESS AS C_ADDRESS,  
c.C_CITY AS C_CITY,  
c.C_NATION AS C_NATION,  
c.C_REGION AS C_REGION,  
c.C_PHONE AS C_PHONE,  
c.C_MKTSEGMENT AS C_MKTSEGMENT,  
s.S_NAME AS S_NAME,  
s.S_ADDRESS AS S_ADDRESS,  
s.S_CITY AS S_CITY,  
s.S_NATION AS S_NATION,  
s.S_REGION AS S_REGION,  
s.S_PHONE AS S_PHONE,  
p.P_NAME AS P_NAME,  
p.P_MFGR AS P_MFGR,  
p.P_CATEGORY AS P_CATEGORY,  
p.P_BRAND AS P_BRAND,  
p.P_COLOR AS P_COLOR,  
p.P_TYPE AS P_TYPE,  
p.P_SIZE AS P_SIZE,  
p.P_CONTAINER AS P_CONTAINER  
FROM lineorder AS l  
INNER JOIN customer AS c ON c.C_CUSTKEY = l.LO_CUSTKEY  
INNER JOIN supplier AS s ON s.S_SUPPKEY = l.LO_SUPPKEY  
INNER JOIN part AS p ON p.P_PARTKEY = l.LO_PARTKEY

Then execute this SQL statement, when the data is all on Ozone, the following Error occurs:

Code: 246. DB::Exception: Received from localhost:9000. DB::Exception: Bad size of marks file '/mnt/jfs/data/tpch1000s_juice/customer/all_19_24_1/C_CUSTKEY.mrk2': 0, must be: 18480

Part of the Select data is in Ozone, and during this process, data sinks from the SSD disk to Ozone.

Result: Hang lives, unable to inquire.

When doing this test, the Ozone we used was the community version 1.1.0-SNAPSHOT. The test result only shows that Ozone 1.1.0-SNAPSHOT is not very suitable for our usage scenarios.

Since Ozone 1.1.0-SNAPSHOT has functional shortcomings in our usage scenarios, the subsequent Star Schema Benchmark performance test report focuses on the performance comparison between SSD and S3 (detailed Query SQL statements can be obtained from the ClickHouse community documentation ).

Query No.	Query Latency Data on JuiceFS	Query Latency Data on ⅓ JuiceFs + ⅔ SSD	Query Latency Data on SSD
Q1.1	8.884 s	8.966 s	1.417 s
Q1.2	0.921 s	0.998 s	0.313 s
Q1.3	0.551 s	0.611 s	0.125 s
Q2.1	68.148 s	36.273 s	5.450 s
Q2.2	54.360 s	20.846 s	4.557 s
Q2.3	55.329 s	22.152 s	4.297 s
Q3.1	60.796 s	27.585 s	7.999 s
Q3.2	67.559 s	29.123 s	5.928 s
Q3.3	45.917 s	20.682 s	5.606 s
Q3.4	0.675 s	0.202 s	0.188 s
Q4.1	100.644 s	41.498 s	7.019 s
Q4.2	32.294 s	2.952 s	2.464 s
Q4.3	33.667 s	2.813 s	2.357 s

In the end, under the comparison of all aspects, we chose S3 as the cold storage medium .

Therefore, the cold and hot storage separation solution uses JuiceFS+S3 to implement . The implementation process will be briefly described below.

2.2 Implementation of separation of cold and hot data storage

First, we use the JuiceFS client to mount the S3 bucket to the local storage path /mnt/jfs , and then edit the ClickHouse storage policy to configure the ../config.d/storage.xml file. When writing a storage policy configuration file, be careful not to affect historical user storage (that is, keep the previous storage policy). Here, default is our historical storage strategy, and hcs_ck is the storage strategy of cold and hot separation.

Detailed information can refer to the following figure:

For services that require separate storage of cold and hot, you only need to write the storage strategy as hcs_ck statement of the build table, and then use the expression of TTL to control the cold data sinking strategy.

The following is an example to illustrate the usage and data separation process. Table hcs_table_name is a business log data table that requires separation of cold and hot storage. The following is a table creation statement:

CREATE TABLE db_name.hcs_table_name  
(  
    .....  
    `log_time` DateTime64(3),  
    `log_level` String,  
    .....  
    `create_time` DateTime DEFAULT now()  
)  
ENGINE = ReplicatedMergeTree('/clickhouse/tables/{layer}-{shard}/db_name.hcs_table_name  
', '{replica}')  
PARTITION BY toYYYYMMDD(log_time)  
ORDER BY (ugi, ip)  
TTL toDateTime(log_time) TO VOLUME 'v_ssd',  
        toDateTime(log_time) + toIntervalDay(7) TO VOLUME 'v_cold',  
        toDateTime(log_time) + toIntervalDay(14)  
SETTINGS index_granularity = 16384,  
                   storage_policy = 'hcs_ck',   
                    parts_to_throw_insert = 1600

From the TTL expression, we can see that hcs_table_name indicates that the data of the last 7 days is stored on the local SSD disk, the data from the 8th to 14th days is stored on the remote S3, and the data over 14 days is expired and deleted.

The general process is shown in the figure below:

The data parts of the table hcs_table_name (ClickHouse's data storage uses data part as the basic processing unit) will be scheduled by a background task, which is executed by the thread BgMoveProcPool, which comes from back_ground_move_pool (note that back_ground_pool is not the same as 061719c9f41c99).

std::optional<BackgroundProcessingPool> background_move_pool; /// The thread pool for the background moves performed by the tables.

The background task scheduling will determine whether the data parts need to data needs to be moved to the remote storage ) and whether it can be moved.

If you need to execute move, the background move_pool will create a move task. The core logic of this task is: first select the data parts that need to be moved, and then move these data parts to the destination storage.

At the interface:

MergeTreePartsMover::selectPartsForMove

ttl_entry according to the TTL Expression in the data parts, and then ttl_move according to the 061719c9f41d52 information in the data parts, and store the move_entry data parts (including the IMergeTreeDataPart pointer and the amount of storage space that needs to be reserved) into the vector. Then the interface will be called:

MergeTreeData::moveParts

To implement the move operation, the move process is simply clone SSD disk data parts to the remote storage S3 hcs_table_name table detach directory, and then move out the data parts from the detach directory, and finally these are in The data parts on the SSD disk will be cleared in the destructor of IMergeTreeDataPart.

Therefore, during the entire move process, the table is always checkable. Because it is a clone operation, the data parts of the move at the same time are either active on the SSD disk or active on the remote storage.

Regarding the move information of the data parts table, you can also query the following three fields in the system table system.parts:

move_ttl_info.expression；
move_ttl_info.min; 
move_ttl_info.max;

3. Practice sharing

After the Shopee ClickHouse cold and hot data separation storage architecture went online, we summarized some of the problems encountered in practice.

3.1 Redis memory growth is abnormal

The amount of data storage on S3 has not increased much, but Redis memory has continued to grow rapidly.

JuiceFS uses Redis to store the metadata of data files on S3, so under normal circumstances, the more data files on S3, the more Redis storage usage. Generally, this abnormal situation is because there are many small files in the target table that do not merge and sink directly, which can easily fill up Redis.

This also introduces another problem: Once the Redis memory is full, JuiceFS can no longer successfully write data to S3. If the JuiceFS client is unmounted, it cannot be successfully mounted again, and an Error will be thrown when mounting again:

Meta: create session: OOM command not allowed when used memory > 'maxmemory'.

To avoid this kind of problem, you should first monitor the ClickHouse merge status. clickhouse-exporter will collect a merge indicator clickhouse_merge , this indicator will collect the number of merges currently being triggered (by querying the system.metrics table metric='merge'), each time a merge is triggered, multiple data parts of a table will be merged . According to our experience, if the average number of merges every three hours is less than 0.5, then it is very likely that there is a problem with the merge of this machine.

There may be many reasons for the abnormal merge (such as HTTPHandler thread, ZooKeeperRecv thread continuously occupying a lot of CPU resources, etc.), this is not the focus of this article, and will not be expanded here. Therefore, you can set alarm rules. If the number of merges within three hours is less than 0.5, the alarm will be given to the classmates of the ClickHouse development and operation team to avoid the generation of a large number of small files.

What should I do if there are a lot of small files sinking to S3?

First of all, we must prevent the data from continuing to sink. There are two ways to find user business tables that have a large number of small files sinking.

The first way: Check the Error Log of ClickHouse, find the table that throws too many parts, and then further determine whether the table that throws Error has hot and cold storage.

The second way: by querying the system.parts table, find out that there are obviously too many active parts, and disk_name is equal to the alias of cold storage. After locating the table that produces a large number of small files, use the ClickHouse system to command SQL:

SYSTEM STOP MOVES [[db.]merge_tree_family_table_name]

Stop the data from continuing to sink to avoid the Redis memory from being full.

If the table is relatively small, such as less than 1TB after compression (1TB here is an empirical value, we have used insert into ... select * from … to import table data, if it is larger than 1TB, the import time will be long, and there is a certain possibility that the import will fail in the middle), After confirming that the merge function returns to normal, you can choose to create temp table> insert into this temp table> select * from org table, and then drop org table> rename temp table to org table.

If the table is relatively large, after confirming that the merge function returns to normal, try to use the system command SQL:

SYSTEM START MERGES [[db.]merge_tree_family_table_name]

Wake up the merge thread. If the merge is slow, you can query the system.parts table to find the data parts that have fallen on S3, and then manually execute the query:

ALTER TABLE table_source MOVE PART/PARTITION partition_expr TO volume 'ssd_volume'

Move the small files that fell on S3 back to the SSD. Because the IOPS of SSD is much higher than that of S3 (even after access is accelerated through JuiceFS), this speeds up the merge process on the one hand, and on the other hand, because the files are moved out of S3, Redis memory will be released.

3.2 JuiceFS fails to read and write S3

Data sinking fails, S3 is accessed through JuiceFS, and S3 cannot be read and written. At this time, if the user query overwrites the data on S3, the query will throw an error that the data file on the local path of the S3 mount cannot be accessed. If you encounter this problem, you can query the log of JuiceFS.

JuiceFS logs are stored in syslog in Linux CentOS. The method cat/var/log/messages|grep 'juicefs' can be used to query the logs. For the log directories corresponding to different operating systems, please refer to the JuiceFS community documentation [3].

The problem we encountered is send request to S3 host name certificate expired . Later, by contacting the development and operation and maintenance team of S3, the access problem was solved.

So how to monitor this kind of JuiceFS read and write failure of S3? It can be monitored by the indicator 061719c9f420b9 provided by juicefs_object_request_errors . If an Error occurs, the team members will be alerted and the log can be checked in time to locate the problem.

3.3 clickhouse-server failed to start

When modifying the TTL of the replicated table (the table engine contains the Replicated prefix) that requires the separation of cold and hot data storage for the history table, the TTL expression in the metadata of the .sql This is a problem we encountered during the test. If clickhouse-server is restarted without solving this problem, clickhouse-server will fail to start because the table structure is not aligned.

This is because to modify the TTL of the replicated table is to modify the TTL in ZooKeeper first, and then modify the TTL of the table on the machine under the same node. Therefore, if the TTL of the local machine has not been modified successfully after the TTL is modified, and clickhouse-server is restarted, the above problem will occur.

3.4 suspicious_broken_parts

Restarting clickhouse-server fails, and an Error is thrown:

DB::Exception: Suspiciously many broken parts to remove

This is because ClickHouse will reload the MergeTree table engine data when restarting the service. The main code interface is:

MergeTreeData::loadDataParts(bool skip_sanity_checks)

In this interface, the data parts of each table will be obtained, and it will be judged whether there is a #DELETE_ON_DESTROY_MARKER_PATH or delete-on-destroy.txt file in the data part folder. If so, add the part to broken_parts_to_detach and add 1 to the number of suspicious_broken_parts

Then in the scenario of separate cold and hot data storage, when data parts sink through TTL, there will be the following code calling relationship in the function of the core interface move operation:

MergeTreeData::moveParts->MergeTreePartsMover::swapClonedPart->MergeTreeData::swapActivePart

In the last function, the path points of the active parts are exchanged. As mentioned above, the data of the data parts is available during the move process, which is either active in SSD or active in S3.

void MergeTreeData::swapActivePart(MergeTreeData::DataPartPtr part_copy)  
{  
    auto lock = lockParts();  
    for (auto original_active_part : getDataPartsStateRange(DataPartState::Committed)) // NOLINT (copy is intended)  
    {  
        if (part_copy->name == original_active_part->name)  
        {  
            .....  
            String marker_path = original_active_part->getFullRelativePath() + DELETE_ON_DESTROY_MARKER_PATH;  
            try  
            {  
                disk->createFile(marker_path);  
            }  
            catch (Poco::Exception & e)  
            ...  
}

#DELETE_ON_DESTROY_MARKER_PATH file will be created in the old active parts (replacing parts) to modify the state to DeleteOnDestory, which is used to delete the data parts of the state when the IMergeTreeDataPart is destroyed later.

suspicious_broken_parts appears in our usage scenario. When this value exceeds the default threshold of 10, it will affect the ClickHouse service startup.

There are two solutions: First, delete the metadata .sql file, stored data, and metadata on ZooKeeper of the table that threw the error on this machine. After restarting the machine, rebuild the table, and the data will be synchronized from the backup machine. . The second is to force_restore_data flag with the running user of the clickhouse-server process under the /flags path, and then restart it.

It can be seen from the above questions that after the use of JuiceFS+S3 to achieve the separation of cold and hot data storage architecture, new components (JuiceFS+Redis+S3) are introduced, and the use of the database is more flexible. Correspondingly, various aspects of monitoring information Also do it well. Here are a few more important monitoring indicators:

JuiceFS: juicefs_object_request_errors : JuiceFS reads and writes the health status of S3.
Redis: Memory Usage : Monitor the memory usage of Redis.
ClickHouse: clickhouse_merge : Monitor whether the merge status of the machines in the cluster is normal.

4. Summary of benefits of cold and hot storage architecture

After the separation of cold and hot data storage, we better support users' data services, improve the data storage capacity of the overall cluster, ease the local storage pressure of each machine, and manage business data more flexibly.

Before the hot and cold data separation architecture went live, the average disk usage of our cluster machines was close to 85%. After going online, by modifying the business user table TTL, this data dropped to 75%. And the overall cluster supports two new data services based on the original business volume. If there is no hot and cold isolation on the line, our cluster will not be able to undertake new projects due to insufficient disk usage before expansion. Currently, the amount of data we sink to the remote S3 is greater than 90TB (after compression).

In the future, Shopee ClickHouse will continue to develop more useful features and continue to evolve product architecture. At present, the use of JuiceFS in our production environment is very stable. We will further use JuiceFS to access HDFS in the future, and then realize the Shopee ClickHouse storage and computing separation architecture.

version information of each product component mentioned in this article is as follows:
Shopee ClickHouse: Currently based on the community version of ClickHouse 20.8.12.2-LTS version
JuiceFS：v0.14.2
Redis: v6.2.2, sentinel model, enable AOF (policy is Every Secs), enable RDB (policy is one backup per day)
S3: Provided by the Shopee STO team
Ozone：1.1.0-SNAPSHOT

Shopee ClickHouse cold and hot data separation storage architecture and practice

Summary

1. Shopee ClickHouse cluster general architecture

2. Cold and hot separation storage architecture solution

2.1 Selection of cold storage media and JuiceFS

2.2 Implementation of separation of cold and hot data storage

3. Practice sharing

3.1 Redis memory growth is abnormal

3.2 JuiceFS fails to read and write S3

3.3 clickhouse-server failed to start

3.4 suspicious_broken_parts

4. Summary of benefits of cold and hot storage architecture

Related Links

Shopee技术团队

引用和评论

基于 Flink + Hudi 的实时数仓在 Shopee 的实践

70k star，取代Postman！这款轻量级API工具，太香了！

大模型时代，后端程序员如何避免被AI卷死？

C++ 中 VS 项目引入公共配置文件

做到真正0丢失、0重复：Apache SeaTunnel 实现万亿级数据一致性全解密

LSM-TREE从入门到入魔：从零开始实现一个高性能键值存储｜得物技术

疯狂推荐！从零开始 Dify 部署全攻略！

Shopee ClickHouse cold and hot data separation storage architecture and practice

Summary

1. Shopee ClickHouse cluster general architecture

2. Cold and hot separation storage architecture solution

2.1 Selection of cold storage media and JuiceFS

2.2 Implementation of separation of cold and hot data storage

3. Practice sharing

3.1 Redis memory growth is abnormal

3.2 JuiceFS fails to read and write S3

3.3 clickhouse-server failed to start

3.4 suspicious_broken_parts

4. Summary of benefits of cold and hot storage architecture

Related Links

Shopee技术团队

引用和评论

基于 Flink + Hudi 的实时数仓在 Shopee 的实践

70k star，取代Postman！这款轻量级API工具，太香了！

大模型时代，后端程序员如何避免被AI卷死？

C++ 中 VS 项目引入公共配置文件

做到真正0丢失、0重复：Apache SeaTunnel 实现万亿级数据一致性全解密

LSM-TREE从入门到入魔：从零开始实现一个高性能键值存储 ｜ 得物技术

疯狂推荐！从零开始 Dify 部署全攻略！

LSM-TREE从入门到入魔：从零开始实现一个高性能键值存储｜得物技术