At present, there are a lot of relevant information on the Internet to prove that the query response speed of the ClickHouse
database is one hundred to several hundred times faster than that of the MySQL
database. In fact, ClickHouse
and MySQL
have different application scenarios and limitations. Recently, when I was researching this ClickHouse
, which was intended to be applied to a large amount of data for querying, I stepped on some pits, so I made a summary here for subsequent data generation. Use as a memo when storing and processing, and as a reference when you want to replace part of the MySQL
database with ClickHouse
.
Use the VersionedCollapsingMergeTree
engine to do data that will be modified but not often modified
ClickHouse
itself is not suitable for processing frequent data modification and deletion operations, which consume a lot of performance for deletion and modification, especially frequent single data modification. Therefore, we usually see a lot of data that the data is written in batches as much as possible, whether it is 1000, 10000 or as many as a batch, to improve the performance of ClickHouse
.
In addition, the data writing, modification and deletion of ClickHouse
itself are asynchronous. For the data written, modified and deleted, it needs to be queried in time. It is not suitable to use ClickHouse
for storage, and ClickHouse
does not support transactions, so ClickHouse
is not suitable. It is used for scenarios with high data consistency.
In some scenarios, the data may not need to be modified in general, but some scenarios need to be modified once or several times, and then for the response speed, if you want to switch to ClickHouse
, you can use the VersionedCollapsingMergeTree
engine for data storage, about this The storage engine, I can briefly introduce it here. For details, please refer to the detailed document of ClickHouse
.
The general principle of this storage engine is that by providing a sign
and a version
tag, when sign
is stored, the value is 1
, -1
, and version
to store the version number of the data. The rules that apply to the following two scenarios are:
- If it needs to be deleted, re-insert the data of this row,
version
remains unchanged,sign
is set to-1
, then there will be two duplicate data in the data table exceptsign
. ThenClickHouse
will periodically perform merge and fold, so that the two data are consistent, but the data of0
when the twosign
are added together are deleted. One thing to note here is that it is merged and cleared regularly, so when querying, you need to usegroup by
and then dohaving(sign)>0)
to manually exclude deleted data. - In the case of modification, while doing the above operations, insert new data, mark
sign
as1
, and add1
toversion
on the basis of the previous one. Of course, when querying, you also need to manually exclude the data of the previous version.
Sparse indexes are not suitable for precise queries
Before saying that this sparse index is not suitable for precise query, let's talk about the exact query scenario I mentioned below:
- It is necessary to query the data of the current row according to an indexed condition, such as
id
oruser_id
; - It is necessary to fetch the data list for paging according to an indexed condition, such as
id
oruser_id
ClickHouse
uses a sparse index, which is different from the MySQL
tree of B+
. (I will not introduce about sparse index and B+
tree index here. If these two things are not introduced for a while, they will be explained clearly. If someone sees my article and does not understand it, you can study it yourself. ), so when doing a precise conditional query, the amount of ClickHouse
scanned data will be very large, and the actual response speed will not reach the ideal state.
For the case where ClickHouse
uses a sparse index, it is especially suitable to use group by
for query. After several times of group by
, a large amount of data can be excluded, so usually the most suitable scenario is to process statistical queries. In this case In the case of a large amount of data, the response speed is dozens of times faster than MySQL
, and it can be extracted hundreds of times.
Columnar databases are not suitable for querying a large number of columns at once
The other is the characteristics of the columnar database of ClickHouse
. Based on the above, in the scenario that requires precise query, when a large number of fields need to be queried at one time, the response speed is not ideal. Even if a lot of group by
conditions are added to the formula, in the end, because there are many columns to be scanned, when MySQL
is correctly indexed, the response speed of ClickHouse
is usually not as fast as that of MySQL
.
The query effect of ClickHouse
is more stable than that of MySQL
I won't say much about the specific test here, but mainly talk about the scene. Under the condition that both data tables are properly indexed, when doing 200 million data list query, when
MySQL
is doing paging data query, the first few pages of query will be obviously time-consuming, about 500ms
to 800ms
It varies, but the subsequent paging query can basically reach 50ms
to 80ms
. Here, the data preheating of MySQL
should play a role. But ClickHouse
is basically stable at 230ms
to 300ms
.
Summarize
Based on the current tests and observations, if statistical queries are required and the data is not frequently modified, ClickHouse
is used to store and process data queries. If you need to frequently modify or do large data list queries, the best solution is to use MySQL
query and process the data into tables, and the data response performance will be much better than ClickHouse
.
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。