【clickhouse column】Why clickhouse performance is so excellent

In the previous article of "Clickhouse Column", "The Difference and Connection Between Database and Data Warehouse", we introduced what is a database, what is a data warehouse, and the difference and connection between the two. The positioning of clickhouse is "data warehouse", so after understanding the content of the previous article, you can actually know what kind of application scenarios clickhouse is suitable for and what kind of application scenarios it is not suitable for.
In the following section, we will continue to introduce some very meaningful features of clickhouse to help you better understand the application scenarios of ck and why it is called "performance monster".
[TOC]

1. Columnar data storage

One of the reasons why clickhouse's performance is sturdy is its columnar storage design. Let me give you an example, if we now have a student information table student

id	name	age
1	little red	7
2	Xiao Ming	8
3	lucy	7

If this table uses row data storage, its structure on disk is as follows:

If this table uses columnar data storage, its structure on disk is as follows:

Comparing the two figures above, we can see the advantages of using columnar storage.

For example, when we query the maximum age of a student, the columnar data storage only needs to locate the starting address of the age column, and then sequentially read the data for sorting and calculation. In the row-based data storage method, because the data units of the age field are not continuous, it is necessary to continuously address according to the index, or to scan the full table to obtain all the age data. Therefore, when using columnar storage, we need to query and filter a certain column, and the performance of statistical computing is far better than that of row-based data storage .
In addition, because the data in one column of database design is usually of the same data type, columnar data storage has a compression ratio of more than 10 times that of row storage, which saves a lot of disk and memory space and can effectively reduce server costs.

2. Support SQL and excellent performance

At present, most of the columnar storage databases in the open source world do not support SQL. Even if many claim to support SQL, they are actually pseudo-SQL and have limited support capabilities.

However, after the author's experiments, clikhouse's support for standard SQL is comparable to that of traditional relational databases. Although for click house data warehouses, I recommend that you use wide tables for data storage, but it does not mean that ck does not have multi-table associated queries. Ability.

You can visit: https://clickhouse.com/benchmark/dbms/ to get click house's official online performance comparison for various statistical SQLs.

3. Distributed sharded storage cluster

clikhouse supports not only the stand-alone mode, but also the cluster mode of distributed sharded data storage. The data is stored on multiple server nodes in a sharded row format, so ck can make use of the large-scale computing power of the cluster server to quickly respond to the statistical results of the data. The mechanism of ck data fragmentation and distributed storage enables clickhouse to have the ability to scale horizontally and analyze and process massive data.

There are many ways of data sharding, such as: data is randomly written to different server shard storage, data is sent to specified server shard storage, data is sharded according to hash value, of course, we can also customize data sharding. way of slices.

Distributed data storage distributes data to each server in the cluster (existing in the form of shards), in order to ensure data security, each shard has multiple replicas, and the replicas are also distributed Stored, so that even if some servers go down, the ck cluster can still be guaranteed to be available.

4. Support sequential storage

Different from the traditional RMDB database, clickhouse supports specifying the sorting field through the sort by keyword when creating a table. In this way, when the data is entered into the table, the sorting operation is actually performed first, and the data sorted according to the sorting field is stored in an orderly manner.
In the subsequent data query, filtering, and statistics, the data in the continuous data blocks can be obtained effectively and quickly, and the performance of query statistics can be improved. This sequential storage feature actually has a very wide range of application scenarios. For example, the stock K-line charts are sorted according to the time of the trading day. The preset sorting fields and sequential storage effectively improve the statistical performance.

Five, support data TTL

In databases for statistical analysis of data, we usually need data TTL capability, that is, some data are automatically deleted after a certain storage period. ck provides this capability, reducing the difficulty of system operation and maintenance personnel.

ck supports the following granularities of TTL

Column-level TTL: Set the TTL time for a column. When part of the data in this column expires, the column value will be automatically replaced with the default value, and the column will be automatically deleted after all the data expires.
Row-level TTL: Set the TTL time for a row. When a row expires, the row will be deleted directly.
Partition-level TTL: ck supports data partitioning and sets the TTL time. When the partition expires, the partition will be deleted directly.

【clickhouse column】Why clickhouse performance is so excellent

1. Columnar data storage

2. Support SQL and excellent performance

3. Distributed sharded storage cluster

4. Support sequential storage

Five, support data TTL

Recommended reading

字母哥博客

引用和评论

高频时序数据的储存与统计方案

Java8的新特性

Java11的新特性

Java5的新特性

Java9的新特性

Java13的新特性

Java7的新特性