1
头图

foreword

In the working scenario, we will collect factory equipment data for intelligent control, and use InfluxDB for data storage. As the data scale becomes larger and larger, the performance of InfluxDB is getting worse and worse. Therefore, we consider introducing ClickHouse to share the big data analysis of InfluxDB. Pressure, coupled with the fact that we also use MySQL in our business, this article will compare the performance indicators of MySQL, InfluxDB, ClickHouse in terms of write time, aggregation query time, disk usage and other aspects under tens of millions of data volumes.

conclusion first

The final conclusion is that the 6600w data set provided by ClickHouse's official website is directly used for comparative testing. Under the condition that MySQL, InfluxDB and ClickHouse also allocate 4c16g resources, ClickHouse completely crushes MySQL and MySQL in terms of import speed, disk usage and query performance. InfluxDB, the specific comparison indicators are as follows:

MySQL InfluxDB ClickHouse
Import time It takes about 70 minutes It takes about 35 minutes 75 seconds
disk space 12.35G 5.9G 2.66G
full table count 24366ms 11674 ms 100 ms
Full table max/min 27023 ms 26829 ms 186 ms
full table average 24841 ms 12043 ms 123 ms
full table variance 24600 ms OOM 113 ms
complex query 1 30260 ms OOM 385 ms
complex query 2 470 ms 200 ms 8 ms

In order to ensure that the test results are relatively accurate, each sql above is executed at least 5 times, and then the intermediate value is taken. Among them, the performance of InfluxDB is worse than expected, or even worse than MySQL, which may be caused by the fact that the data samples and test cases are not suitable for the InfluxDB scenario. If you have any doubts about the test results, you can git clone [https://github.com/stone0090/clickhouse-test.git](https://github.com/stone0090/clickhouse-test.git) project, complete verification Compare the whole process above.

Introduction to Databases

MySQL

MySQL is a relational database management system, developed by the Swedish MySQL AB company, which is a product of Oracle and is one of the most popular relational database management systems. The SQL language it uses is the most commonly used standardized language for accessing databases. It adopts a dual authorization policy and is divided into a community version and a commercial version. Due to its small size, fast speed, low total cost of ownership, especially open source, MySQL is generally chosen as the website database for the development of small and medium-sized websites. . The core sentence mentioned at the beginning of the book "High Performance MySQL" is "MySQL is not perfect, but it is flexible enough", it is the panacea in the architecture, and there will always be it in complex and non-single projects. Use force.

InfluxDB

InfluxDB is an open source time series database developed by InfluxData. It focuses on high-performance reading, high-performance writing, efficient storage and real-time analysis of massive time series data. It ranks first in the DB-Engines Ranking time series database and is widely used in DevOps monitoring, IoT monitoring, real-time analysis and other scenarios.
Traditional databases usually record the current value of data, while time series databases record all historical data. When processing current time series data, new time series data must be continuously received. At the same time, the query of time series data is always based on time. Focus on solving the following massive data scenarios:

  • Writing of time series data: how to support the writing of tens of millions of data per second;
  • Reading of time series data: how to support aggregation and query of tens of millions of data per second;
  • Cost-sensitive: Mass data storage brings cost issues, how to store these data at a lower cost.

ClickHouse

ClickHouse is a column store-based database open sourced by Yandex (the largest search engine in Russia) for real-time data analysis, which processes data 100-1000 times faster than traditional methods. ClickHouse outperforms comparable column-oriented DBMSs on the market today, processing hundreds of millions to over a billion rows and tens of gigabytes of data per second per server per second. It is a columnar database management system (DBMS) for online analysis (OLAP), briefly introduce OLTP and OLAP.

  • OLTP: It is a traditional relational database. It mainly operates additions, deletions, changes, and searches, emphasizing transaction consistency, such as banking systems and e-commerce systems.
  • OLAP: It is a warehouse-type database, mainly for reading data, doing complex data analysis, focusing on technical decision support, and providing intuitive and simple results.

The applicable scenarios of ClickHouse OLAP are: 1) Read more than write; 2) Large wide table, read a large number of rows but a few columns, the result set is small; 3) Data is written in batches, and the data is not updated or updated less; 4) No need Transaction, low data consistency requirements; 5) flexible and changeable, not suitable for pre-modeling.

Environmental preparation

Buy a 16c64g server in Alibaba Cloud, the operating system is centos 7.8, use sealos to install k8s with one click, use helm to install mysql (5.7), influxdb (1.8), clickhouse (22.3) with one click, and each application allocates 4c16g resources .

 # 下载 sealos
$ wget https://github.com/labring/sealos/releases/download/v4.0.0/sealos_4.0.0_linux_amd64.tar.gz \
&& tar zxvf sealos_4.0.0_linux_amd64.tar.gz sealos && chmod +x sealos && mv sealos /usr/bin

# 初始化一个单节点 Kubernetes
$ sealos run labring/kubernetes:v1.24.0 labring/calico:v3.22.1 --masters [xxx.xxx.xxx.xxx] -p [your-ecs-password]

# 去掉 master 的污点,允许安装应用到 master 和 control-plane
$ kubectl taint nodes --all node-role.kubernetes.io/master-
$ kubectl taint nodes --all node-role.kubernetes.io/control-plane-

# 获取 mysql、influxdb、clickhouse 一键安装 Helm-Charts
$ wget https://github.com/stone0090/clickhouse-test/archive/refs/tags/v1.0.0.tar.gz
$ tar -zxvf v1.0.0.tar.gz

# 安装 Kubernetes 包管理工具 Helm,以及 mysql、influxdb、clickhouse 3大数据库
$ sealos run labring/helm:v3.8.2
$ helm install mysql clickhouse-test-1.0.0/helm-charts/mysql/
$ helm install influxdb clickhouse-test-1.0.0/helm-charts/influxdb/
$ helm install clickhouse clickhouse-test-1.0.0/helm-charts/clickhouse/

data import

Directly use the test data officially provided by ClickHouse https://clickhouse.com/docs/zh/getting-started/example-datasets/opensky The data in this dataset is derived and cleaned from the complete OpenSky dataset to Illustrating the evolution of air traffic during the COVID-19 pandemic. It covers all flights seen by more than 2,500 members of the network since January 1, 2019, with a total data volume of 6,600w.

 # 在服务器 /home/flightlist 目录执行以下命令,该目录会被挂载到 mysql-pod、influxdb-pod、clickhouse-pod 内
$ wget -O- https://zenodo.org/record/5092942 | grep -oP 'https://zenodo.org/record/5092942/files/flightlist_\d+_\d+\.csv\.gz' | xargs wget

# 批量解压 flightlist.gz 数据
$ for file in flightlist_*.csv.gz; do gzip -d "$file"; done

# 将 csv 处理成 influxdb 导入所需的 txt 格式(此过程大概耗时1小时)
$ python clickhouse-test-1.0.0/influxdb_csv2txt.py

MySQL

 # 进入 mysql pod
$ kubectl exec -it [influxdb-podname] -- bash

# 连上 mysql 建库、建表
$ mysql -uroot -p123456
$ use test;
$ CREATE TABLE `opensky` (`callsign` varchar(255) COLLATE utf8mb4_unicode_ci DEFAULT NULL,`number` varchar(255) COLLATE utf8mb4_unicode_ci DEFAULT NULL,`icao24` varchar(255) COLLATE utf8mb4_unicode_ci DEFAULT NULL,`registration` varchar(255) COLLATE utf8mb4_unicode_ci DEFAULT NULL,`typecode` varchar(255) COLLATE utf8mb4_unicode_ci DEFAULT NULL,`origin` varchar(255) COLLATE utf8mb4_unicode_ci DEFAULT NULL,`destination` varchar(255) COLLATE utf8mb4_unicode_ci DEFAULT NULL,`firstseen` datetime DEFAULT NULL,`lastseen` datetime DEFAULT NULL,`day` datetime DEFAULT NULL,`latitude_1` double DEFAULT NULL,`longitude_1` double DEFAULT NULL,`altitude_1` double DEFAULT NULL,`latitude_2` double DEFAULT NULL,`longitude_2` double DEFAULT NULL,`altitude_2` double DEFAULT NULL,KEY `idx_callsign` (`callsign`),KEY `idx_origin` (`origin`),KEY `idx_destination` (`destination`)) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;

# 导入数据(大概耗时70分钟)
$ load data local infile 'flightlist_20190101_20190131.csv' into table opensky character set utf8mb4 fields terminated by ',' lines terminated by '\n' ignore 1 lines;
# 省略其他29条导入命令:load data local infile 'flightlist_*_*.csv' into table opensky character set utf8mb4 fields terminated by ',' lines terminated by '\n' ignore 1 lines;

# 检查数据是否导入成功
$ select count(*) from test.opensky;

InfluxDB

 # 进入 influxdb pod
$ kubectl exec -it [influxdb-podname] -- bash

# 导入数据(大概耗时30分钟)
$ influx -username 'admin' -password 'admin123456' -import -path=/tmp/flightlist/flightlist_20190101_20190131.txt -precision=ns;
# 省略其他29条导入命令:influx -username 'admin' -password 'admin123456' -import -path=/tmp/flightlist/flightlist_*_*.txt -precision=ns;

# 检查数据是否导入成功
$ influx -username 'admin' -password 'admin123456'
$ select count(latitude_1) from test.autogen.opensky;

ClickHouse

 # 进入 clickhouse pod
$ kubectl exec -it [clickhouse-podname] -- bash

# 连上 clickhouse 建库、建表
$ clickhouse-client
$ create database test;
$ use test;
$ CREATE TABLE opensky(callsign String,number String,icao24 String,registration String,typecode String,origin String,destination String,firstseen DateTime,lastseen DateTime,day DateTime,latitude_1 Float64,longitude_1 Float64,altitude_1 Float64,latitude_2 Float64,longitude_2 Float64,altitude_2 Float64) ENGINE = MergeTree ORDER BY (origin, destination, callsign);
$ exit

# 导入数据(大概耗时75秒)
$ cd /tmp/flightlist
$ for file in flightlist_*.csv; do cat "$file" | clickhouse-client --date_time_input_format best_effort --query "INSERT INTO test.opensky FORMAT CSVWithNames"; done

# 检查数据是否导入成功
$ clickhouse-client
$ SELECT count() FROM test.opensky;

testing scenarios

MySQL

 $ mysql -uroot -p123456
$ use test;
-- 开启性能分析
set profiling = 1;
-- 查询磁盘空间
select table_rows as `总行数`, (data_length + index_length)/1024/1024/1024 as `磁盘占用(G)` from information_schema.`TABLES` where table_name = 'opensky';
-- 全表count
select count(latitude_1) from opensky;
-- 全表max/min
select max(longitude_1),min(altitude_1) from opensky;
-- 全表平均值
select avg(latitude_2) from opensky;
-- 全表方差
select var_pop(longitude_2) from opensky;
-- 复杂查询1:全表多个字段聚合查询
select count(latitude_1),max(longitude_1),min(altitude_1),avg(latitude_2) from opensky;
-- 复杂查询2:从莫斯科三个主要机场起飞的航班数量
SELECT origin, count(1) AS c FROM opensky WHERE origin IN ('UUEE', 'UUDD', 'UUWW') GROUP BY origin;
-- 输出分析结果
show profiles;

InfluxDB

 $ influx -username 'admin' -password 'admin123456'
$ use test;
-- 耗时统计,queryReqDurationNs 是累计查询时间,2次任务的时间相减就是耗时
select queryReq,queryReqDurationNs/1000/1000,queryRespBytes from _internal."monitor".httpd order by time desc limit 10;
-- 查询磁盘空间
select sum(diskBytes) / 1024 / 1024 /1024 from _internal."monitor"."shard" where time > now() - 10s group by "database";
-- 全表count
select count(latitude_1) from opensky;
-- 全表max/min
select max(longitude_1),min(altitude_1) from opensky;
-- 全表平均值
select mean(latitude_2) from opensky;
-- 全表方差
select stddev(longitude_2) from opensky;
-- 复杂查询1:全表多个字段聚合查询
select count(latitude_1),max(longitude_1),min(altitude_1),mean(latitude_2) from opensky;
-- 复杂查询2:从莫斯科三个主要机场起飞的航班数量
SELECT count(latitude_1) AS c FROM opensky WHERE origin =~/^UUEE|UUDD|UUWW$/ GROUP BY origin;

ClickHouse

 $ clickhouse-client
$ use test;
-- 耗时统计
select event_time_microseconds,query_duration_ms,read_rows,result_rows,memory_usage,query from system.query_log where query like '%opensky%' and query_duration_ms <> 0 and query not like '%event_time_microseconds%' order by event_time_microseconds desc limit 5;
-- 查询磁盘空间
SELECT formatReadableSize(total_bytes) FROM system.tables WHERE name = 'opensky';
-- 全表count
select count(latitude_1) from opensky;
-- 全表max/min
select max(longitude_1),min(altitude_1) from opensky;
-- 全表平均值
select avg(latitude_2) from opensky;
-- 全表方差
select var_pop(longitude_2) from opensky;
-- 复杂查询1:全表多个字段聚合查询
select count(latitude_1),max(longitude_1),min(altitude_1),avg(latitude_2) from opensky;
-- 复杂查询2:从莫斯科三个主要机场起飞的航班数量
SELECT origin, count() AS c FROM opensky WHERE origin IN ('UUEE', 'UUDD', 'UUWW') GROUP BY origin;

Why ClickHouse is fast

1. Column storage

Data is stored in columns, and data is the index; queries only access the involved columns, reducing system I/O; each column is processed by a thread, making efficient use of CPU resources; it also paved the way for vectorized execution.

2. Data compression

The essence of data compression is to match and scan the data according to a certain step size, and to perform encoding conversion when duplicate data is found.
Because it is a columnar storage, the data characteristics are very similar, so there are many duplicates in the data, the higher the compression rate, the smaller the data volume, the lower the disk I/O pressure, and the faster the network transmission.

3. Vectorized execution engine

SIMD (Single Instruction Multiple Data) means that a single instruction operates multiple pieces of data. It is a way to improve performance through data parallelism. It can be simply understood as parallel processing of data in the program at the register level. Clickhouse can improve computing efficiency. SIMD is widely used in places where SIMD is used. By using SIMD, it can basically bring several times the performance improvement.

4. Multithreading and Distributed

There is a law in the distributed field. Computational movement is more cost-effective than data movement. This is also its core. The calculation of data is directly distributed to the server where the data is located, and the results are processed in parallel on multiple machines. Through thread-level parallelism, the efficiency is further accelerated, and the resources of the server are used to the extreme.

5. Various table engines

The MergeTree storage structure sorts the written data and then stores it in an orderly manner. Ordered storage has two main advantages:

  • When the column-stored file is compressed by block, the column values in the sort key are continuous or repeated, so that the data in the column-stored block can obtain the ultimate compression ratio;
  • The storage order itself can speed up the index structure of the query. According to the equivalence condition or range condition of the column in the sort key, we can quickly find the approximate location range of the target, and this index structure will not generate additional storage overhead.

MergeTree is the core engine in the ClickHouse table engine. Other engines are based on the MergeTree engine and implement different features in the process of data merging, thus forming the MergeTree table engine family.

Advantages and disadvantages of ClickHouse

Advantages: extreme query and analysis performance, low storage cost, high-throughput data writing, diversified table engines, and complete DBMS functions;
Disadvantages: does not support transactions, does not support true deletion/update, and has weak distribution capabilities; does not support high concurrency, the official recommendation is 100 QPS; non-standard SQL, the implementation of join is special, and the performance is not good; frequent small batches Data manipulation will affect query performance;
At present, there is no OLAP engine that can meet the needs of various scenarios. The essential reason is that no system can be perfect in terms of query efficiency, timeliness, and maintainability at the same time. It can only be said that ClickHouse is for the ultimate query performance. Made some tradeoffs.
The advantages and disadvantages of ClickHouse are obvious. Whether or not to adopt it depends on the fit with the actual business scenario. The architecture that suits you is the best architecture.

reference


劼哥stone
1.5k 声望348 粉丝