6

At present, there are a lot of relevant information on the Internet to prove that the query response speed of the ClickHouse database is one hundred to several hundred times faster than that of the MySQL database. In fact, ClickHouse and MySQL have different application scenarios and limitations. Recently, when I was researching this ClickHouse , which was intended to be applied to a large amount of data for querying, I stepped on some pits, so I made a summary here for subsequent data generation. Use as a memo when storing and processing, and as a reference when you want to replace part of the MySQL database with ClickHouse .

Use the VersionedCollapsingMergeTree engine to do data that will be modified but not often modified

ClickHouse itself is not suitable for processing frequent data modification and deletion operations, which consume a lot of performance for deletion and modification, especially frequent single data modification. Therefore, we usually see a lot of data that the data is written in batches as much as possible, whether it is 1000, 10000 or as many as a batch, to improve the performance of ClickHouse .

In addition, the data writing, modification and deletion of ClickHouse itself are asynchronous. For the data written, modified and deleted, it needs to be queried in time. It is not suitable to use ClickHouse for storage, and ClickHouse does not support transactions, so ClickHouse is not suitable. It is used for scenarios with high data consistency.

In some scenarios, the data may not need to be modified in general, but some scenarios need to be modified once or several times, and then for the response speed, if you want to switch to ClickHouse , you can use the VersionedCollapsingMergeTree engine for data storage, about this The storage engine, I can briefly introduce it here. For details, please refer to the detailed document of ClickHouse .

The general principle of this storage engine is that by providing a sign and a version tag, when sign is stored, the value is 1 , -1 , and version to store the version number of the data. The rules that apply to the following two scenarios are:

  1. If it needs to be deleted, re-insert the data of this row, version remains unchanged, sign is set to -1 , then there will be two duplicate data in the data table except sign . Then ClickHouse will periodically perform merge and fold, so that the two data are consistent, but the data of 0 when the two sign are added together are deleted. One thing to note here is that it is merged and cleared regularly, so when querying, you need to use group by and then do having(sign)>0) to manually exclude deleted data.
  2. In the case of modification, while doing the above operations, insert new data, mark sign as 1 , and add 1 to version on the basis of the previous one. Of course, when querying, you also need to manually exclude the data of the previous version.

Sparse indexes are not suitable for precise queries

Before saying that this sparse index is not suitable for precise query, let's talk about the exact query scenario I mentioned below:

  1. It is necessary to query the data of the current row according to an indexed condition, such as id or user_id ;
  2. It is necessary to fetch the data list for paging according to an indexed condition, such as id or user_id

ClickHouse uses a sparse index, which is different from the MySQL tree of B+ . (I will not introduce about sparse index and B+ tree index here. If these two things are not introduced for a while, they will be explained clearly. If someone sees my article and does not understand it, you can study it yourself. ), so when doing a precise conditional query, the amount of ClickHouse scanned data will be very large, and the actual response speed will not reach the ideal state.

For the case where ClickHouse uses a sparse index, it is especially suitable to use group by for query. After several times of group by , a large amount of data can be excluded, so usually the most suitable scenario is to process statistical queries. In this case In the case of a large amount of data, the response speed is dozens of times faster than MySQL , and it can be extracted hundreds of times.

Columnar databases are not suitable for querying a large number of columns at once

The other is the characteristics of the columnar database of ClickHouse . Based on the above, in the scenario that requires precise query, when a large number of fields need to be queried at one time, the response speed is not ideal. Even if a lot of group by conditions are added to the formula, in the end, because there are many columns to be scanned, when MySQL is correctly indexed, the response speed of ClickHouse is usually not as fast as that of MySQL .

The query effect of ClickHouse is more stable than that of MySQL

I won't say much about the specific test here, but mainly talk about the scene. Under the condition that both data tables are properly indexed, when doing 200 million data list query, when MySQL is doing paging data query, the first few pages of query will be obviously time-consuming, about 500ms to 800ms It varies, but the subsequent paging query can basically reach 50ms to 80ms . Here, the data preheating of MySQL should play a role. But ClickHouse is basically stable at 230ms to 300ms .

Summarize

Based on the current tests and observations, if statistical queries are required and the data is not frequently modified, ClickHouse is used to store and process data queries. If you need to frequently modify or do large data list queries, the best solution is to use MySQL query and process the data into tables, and the data response performance will be much better than ClickHouse .


kumfo
6.7k 声望4.1k 粉丝

程序生存法则: