5
头图

With the rapid growth of business, Yandex.Metrica has become the world's third largest web traffic analysis platform, processing more than 20 billion tracking events every day. To have such an amazing volume, ClickHouse, which provides support behind it, is indispensable. ClickHouse has stored more than 20 trillion rows of data for Yandex.Metrica, 90% of custom queries can be returned within 1 second, and its cluster size has exceeded 400 servers. Although ClickHouse was originally developed for Yandex.Metrica, due to its outstanding performance, it is now widely used in dozens of other products within Yandex.

When I first met ClickHouse, I had this feeling: It seems to violate the laws of physics, has no shortcomings, and is an unreal existence. ClickHouse seems to be able to satisfy all the demands of a high-performance, high-availability OLAP database. This mysterious atmosphere aroused my great curiosity.

When I first switched from Hadoop ecology to ClickHouse, I had a lot of discomforts because it was completely different from the technical "character" we used to use. If you compare the database to a car, then ClickHouse is like a manual racing car. It is not as highly automated as other systems in many ways. Some concepts of ClickHouse are also different from our usual understanding, especially in terms of sharding and copying. Sometimes data sharding even needs to be done manually. After further using ClickHouse, I gradually understood the purpose of these designs.

Certain designs that seem to be insufficiently automated, in turn, bring great flexibility in use. Compared with other databases in the Hadoop ecosystem, ClickHouse is more like a "traditional" MPP architecture database. It does not use the master-slave architecture commonly used in the Hadoop ecosystem, but uses a multi-master peer-to-peer network structure, and it is also based on relationships. The ROLAP scheme of the model.

In this article, let us take a look at the core features of ClickHouse.

Core features of ClickHouse

ClickHouse is a columnar storage database with MPP architecture, but MPP and columnar storage are not "rare" designs. There are many other database products with similar architectures, but why is it that only ClickHouse has such outstanding performance? Through the introduction in the previous chapter, we know the evolution of ClickHouse so far. It has gone through four stages in total, each stage of evolution, compared with the previous stage, take its essence and remove its dross. It can be said that ClickHouse has absorbed the essence of various technologies and achieved the ultimate in every detail. Next, I will introduce some of the core features of ClickHouse. It is the combined force of these features that makes ClickHouse so excellent.

1. Complete DBMS function

ClickHouse has complete management functions, so it can be called a DBMS (Database Management System), not just a database. As a DBMS, it has some basic functions, as shown below.

  • DDL (Data Definition Language): You can dynamically create, modify, or delete databases, tables, and views without restarting the service.
  • DML (Data Manipulation Language): You can dynamically query, insert, modify or delete data.
  • Permission control: You can set the operating permissions of the database or table according to the user granularity to ensure data security.
  • Data backup and recovery: Provides a data backup export and import recovery mechanism to meet the requirements of the production environment.
  • Distributed management: Provides a cluster mode that can automatically manage multiple database nodes.

Only some of the most representative functions are listed here, but it is enough to show why Click House can be called a DBMS.

2. Columnar storage and data compression

Columnar storage and data compression are essential features for a high-performance database. A very popular view is that if you want to make queries faster, the easiest and most effective way is to reduce the size of data scanning and data transmission, and columnar storage and data compression can help us achieve the above two points. . Columnar storage and data compression are usually associated, because generally speaking, columnar storage is a prerequisite for data compression.

Compared with column storage and row storage, the former can effectively reduce the amount of data that needs to be scanned during a query. This can be simply illustrated by an example. Suppose a data table A has 50 fields A1 to A50, and 100 rows of data. Now you need to query the first 5 fields and perform data analysis, you can use the following SQL to achieve:

SELECT A1,A2,A3,A4,A5 FROM A

If the data is stored in rows, the database will scan row by row first, and obtain all 50 fields of each row of data, and then return the 5 fields A1 to A5 from each row of data. It is not difficult to find that although only the first 5 fields are needed, all the fields are actually scanned because the data is organized in rows. If the data is stored in columns, this problem will not occur. Since the data is organized in columns, the database can directly obtain the data in the five columns A1 to A5, thereby avoiding redundant data scanning.

Another advantage of column storage compared to row storage is the friendliness of data compression. You can also use an example to briefly illustrate the nature of compression. Suppose there are two strings abcdefghi and bcdefghi, now compress them as follows:

压缩前:abcdefghi_bcdefghi
压缩后:abcdefghi_(9,8)

It can be seen that the essence of compression is to match and scan the data according to a certain step size, and perform encoding conversion when a duplicate part is found. For example, (9, 8) in the above example means that if you move 9 bytes forward from the underscore, it will match an 8-byte repetition, that is, bcdefghi here.

The real compression algorithm is naturally more complicated than this example, but the essence of compression is like this. The more duplicates in the data, the higher the compression rate; the higher the compression rate, the smaller the data volume; and the smaller the data volume, the faster the data transmission in the network, which has an impact on network bandwidth and disk IO. The pressure is less. In that case, what kind of data is most likely to have repetitive characteristics? The answer is the data belonging to the same column field, because they have the same data type and realistic semantics, the possibility of duplicate items is naturally higher.

ClickHouse is a database that uses columnar storage. Data is organized in columns. Data belonging to the same column will be stored together. The columns will also be stored in different files (here mainly refers to the MergeTree table engine). The data is compressed using the LZ4 algorithm by default. In the production environment of Yandex.Metrica, the overall data compression ratio can reach 8:1 (17PB before uncompressed, 2PB after compression). In addition to reducing the pressure of IO and storage, columnar storage also paves the way for vectorized execution.

3. Vectorized execution engine

There is a joke in the market, that is, "Problems that can be solved with money, don't spend time". The industry also has the same ridicule, that is, "If you can upgrade the hardware to solve the problem, don't optimize the program." Sometimes, the performance improvement brought by your painstaking efforts to optimize the program logic is not as straightforward as it is to directly upgrade the hardware. Although this is just a joke and cannot be taken seriously, optimization at the hardware level is indeed one of the most direct and efficient ways to improve. Vectorized execution is a typical representative of this method. This register hardware-level feature brings an exponential improvement in the performance of upper-level applications.

Vectorized execution can be simply seen as an optimization to eliminate loops in the program. Here is an analogy with a vivid example. Xiao Hu runs a juice shop. Although the freshly squeezed apple juice in the shop is very popular, customers always complain that the speed of making juice is too slow. There is only one juicer in Xiaohu's shop. Every time he takes an apple from the basket and puts it in the juicer to wait for the juice to come out. If there are 8 customers and each customer orders a glass of apple juice, then Xiao Hu needs to repeat the above-mentioned juice extraction process 8 times to squeeze 8 glasses of apple juice. If it takes 5 minutes to make a glass of juice, it takes 40 minutes to make all of it. In order to increase the speed of juice production, Xiao Hu came up with a way. He increased the number of juicers from 1 to 8 so that he could take out 8 apples from the basket at once and put them into 8 juicers to extract juice at the same time. At this point, Xiao Hu can make 8 cups of apple juice in just 5 minutes. In order to make n cups of juice, the non-vectorized execution method is to repeat the cycle with 1 juicer to make n times, while the vectorized execution method is to execute it only once with n juicers.

In order to achieve vectorized execution, it is necessary to use the SIMD instructions of the CPU. The full name of SIMD is Single Instruction Multiple Data, which means using a single instruction to manipulate multiple pieces of data. In the concept of modern computer systems, it is an implementation way to improve performance through data parallelism (others include instruction-level parallelism and thread-level parallelism), and its principle is to implement data parallel operation at the CPU register level.

In the computer system architecture, the storage system is a hierarchical structure. The storage hierarchy of a typical server computer is shown in Figure 1. A practical experience tells us that the closer the storage medium is to the CPU, the faster the data can be accessed.

As you can see from the above figure, from left to right, the farther away from the CPU, the slower the data access speed. The speed of accessing data from registers is 300 times the speed of accessing data from memory and 30 million times the speed of accessing data from disks. Therefore, the use of the vectorized execution of the CPU is of great significance to the performance of the program.

ClickHouse currently uses SSE4.2 instruction set to achieve vectorized execution.

4. Relational model and SQL query

Compared with NoSQL databases such as HBase and Redis, ClickHouse uses a relational model to describe data and provides traditional database concepts (databases, tables, views, functions, etc.). At the same time, ClickHouse completely uses SQL as the query language (supports most standard SQL such as GROUP BY, ORDER BY, JOIN, IN), which makes it approachable, easy to understand and learn. Because the relational database and SQL language can be said to be one of the most widely used technologies in the software field so far, it has a very high "mass base". It is precisely because ClickHouse provides a standard protocol SQL query interface, so that the existing third-party analysis visualization system can be easily integrated and docked with it. In terms of SQL parsing, ClickHouse is case-sensitive, which means that the semantics represented by SELECT a and SELECT A are different.

Compared with other models such as documents and key-value pairs, the relational model has better descriptive capabilities and can express the relationship between entities more clearly. More importantly, in the field of OLAP, a large amount of data modeling work is based on relational models (star model, snowflake model and even wide table model). ClickHouse uses a relational model, so the cost of migrating systems built on traditional relational databases or data warehouses to ClickHouse will become lower, and previous experience results can be directly used.

5. Diversified table engine

Maybe because the original architecture of Yandex.Metrica is based on MySQL, so in the design of ClickHouse, some MySQL shadows can be noticed, and the design of the table engine is one of them. Similar to MySQL, ClickHouse also abstracts the storage part and treats the storage engine as an independent interface. As of the completion of this book, ClickHouse has a total of more than 20 table engines in six categories, including merge trees, memory, files, interfaces, and other six categories. Each table engine has its own characteristics, and users can choose a suitable table engine to use according to the requirements of actual business scenarios.

Generally speaking, a general system means wider applicability and can adapt to more scenarios. But another general explanation is mediocrity, because it cannot be the ultimate in all scenarios.

In the world of software, there will not be a universal system that can be applied to any scenario. In order to highlight a certain feature, it is bound to be a trade-off elsewhere. In fact, everything in the world follows this principle. Just like albatross and hummingbirds, although both belong to birds, their respective characteristics have created completely different physical characteristics. Albatross are good at flying long distances, and it only takes 1 to 2 months to orbit the earth. Because it can be in a gliding state for a long time, it takes only 5 days to flap its wings, and its heart rate can be maintained between 100 and 200 beats per minute. The hummingbird can hover and fly vertically, waving its wings 70-100 times per second, and its heart rate during flight can reach 1,000 beats per minute. If you use database scenarios to compare the characteristics of albatross and hummingbirds, then the albatross may represent a high-performance design idea that can be achieved using ordinary hardware. The data is processed in a coarse-grained manner and executed through batch processing; while the hummingbirds may represent The design idea of fine-grained data processing requires the support of high-performance hardware.

The benefit of designing the table engine independently is obvious. It is very flexible to support a specific scene through a specific table engine. For simple scenes, you can directly use simple engines to reduce costs, while complex scenes also have suitable options.

6. Multithreading and distributed

ClickHouse has almost all the typical features of modern high-performance databases. It can be described as exhausting the methods that can improve performance. Naturally, it is not a problem for the widely used technologies such as multithreading and distributed.

If vectorized execution improves performance through data-level parallelism, then multi-threaded processing achieves performance improvement through thread-level parallelism. Compared with the vectorized execution of SIMD based on the underlying hardware, thread-level parallelism is usually controlled by a higher level of software. Modern computer systems have already popularized multi-processor architectures, so today's servers on the market have good multi-core and multi-thread processing capabilities. Since SIMD is not suitable for scenes with more branch judgments, ClickHouse also uses a large number of multi-threading technology to achieve speed, which complements vectorized execution.

If one basket can't hold all the eggs, then use more baskets to hold them. This is the basic idea of divide and conquer in distributed design. In the same way, if the performance of a server is tight, then the resources of multiple services are used for collaborative processing. In order to achieve this goal, it is first necessary to realize the distribution of data at the data level. Because in the distributed field, there is a golden rule-computing mobile is more cost-effective than data mobile. Between servers, the cost of data transmission through the network is high, so compared to mobile data, it is smarter to distribute the data to each server in advance, and push the data calculation query directly to the server where the data is located. . In terms of data access, ClickHouse supports partitioning (vertical expansion, using the principle of multi-threading) and sharding (horizontal expansion, using the principle of distributed). It can be said that the technology of multi-threading and distributed is applied to the extreme.

7. Multi-master architecture

Distributed systems such as HDFS, Spark, HBase, and Elasticsearch all use the Master-Slave architecture, with a control node acting as the leader to coordinate the overall situation. ClickHouse uses a Multi-Master multi-master architecture, each node in the cluster has equal roles, and the client can get the same effect when accessing any node. This multi-master architecture has many advantages. For example, the peer-to-peer role makes the system architecture simpler. There is no need to distinguish between master nodes, data nodes, and computing nodes. All nodes in the cluster have the same function. Therefore, it naturally avoids the problem of single point of failure, and is very suitable for scenarios with multiple data centers and multiple activities in different places.

8. Online query

ClickHouse is often compared with other analytical databases, such as Vertica, SparkSQL, Hive, and Elasticsearch. It does have many similarities with these databases. For example, they can support massive data query scenarios, they all have a distributed architecture, and they all support column storage, data sharding, and computing push-down features. In fact, this also shows that ClickHouse has indeed absorbed all kinds of tricks in its design. Compared with other databases, ClickHouse also has obvious advantages. For example, commercial software such as Vertica is expensive; systems such as SparkSQL and Hive cannot guarantee that 90% of queries are returned within one second, and complex queries under large amounts of data may require a response time of minutes; while searches such as Elasticsearch The engine is stretched when it comes to processing billions of data aggregation queries.

As ClickHouse’s "advertising phrase" said, other open source systems are too slow and commercial systems are too expensive. Only Clickouse has achieved a good balance between cost and performance, that is, fast and open source. ClickHouse has well-deservedly explained the meaning of the word "online". Even in complex query scenarios, it can respond extremely quickly without any preprocessing of data.

9. Data Sharding and Distributed Query

Data fragmentation is the horizontal division of data. This is an effective means to solve storage and query bottlenecks in the face of massive data scenarios, and is a manifestation of the idea of divide and conquer. ClickHouse supports sharding, and sharding depends on the cluster. Each cluster is composed of one or more shards, and each shard corresponds to 1 service node of ClickHouse. The upper limit of the number of shards depends on the number of nodes (1 shard can only correspond to 1 service node).

ClickHouse does not have a highly automated sharding function like other distributed systems. ClickHouse provides the concepts of Local Table and Distributed Table. A local table is equivalent to a fragment of data. The distributed table itself does not store any data, it is the access agent of the local table, and its role is similar to the database middleware. With the help of distributed tables, it is possible to proxy access to multiple data shards, thereby realizing distributed queries.

This design is similar to the sub-database and sub-table of the database, which is very flexible. For example, in the initial stage of the business system going online, the data volume is not high, and the data table does not need multiple shards at this time. Therefore, the local table of a single node (single data shard) can meet the business needs. When the business grows and the amount of data increases, the data is split by adding new data shards, and the distribution is realized through the distributed table. Query. This is like a manual transmission racing car, it puts all the options in the hands of the user.

Architecture design of ClickHouse

At present, the public information of ClickHouse is relatively scarce. For example, it is difficult to find complete information at the level of architecture design, and there is not even an overall architecture diagram. I think this is one of the reasons why it is an open source software, but it seems so mysterious. Even so, we can still find some clues in some scattered materials. Next, some concepts in the underlying design of ClickHouse will be explained. These concepts can help us understand ClickHouse.

1.Column and Field

Column and Field are the most basic mapping units of ClickHouse data. As a 100% columnar storage database, ClickHouse stores data in columns, and a column of data in the memory is represented by a Column object. The Column object is divided into two parts: interface and implementation. In the IColumn interface object, methods for performing various relational operations on data are defined, such as insertRangeFrom and insertFrom methods for inserting data, cut for paging, and filter for filtering Methods, etc. The specific implementation objects of these methods are implemented by corresponding objects, such as ColumnString, ColumnArray, and ColumnTuple, according to different data types. In most cases, ClickHouse will manipulate the data in an entire column, but there are exceptions to everything. If you need to manipulate a single specific value (that is, a row of data in a single column), you need to use a Field object, which represents a single value. Different from the generalized design idea of the Column object, the Field object uses the aggregation design pattern. In the Field object, 13 data types such as Null, UInt64, String, and Array, and the corresponding processing logic are aggregated.

2.DataType

DataType is responsible for the serialization and deserialization of data. The IDataType interface defines many positive and negative serialization methods, which appear in pairs, such as serializeBinary and deserializeBinary, serializeTextJSON and deserializeTextJSON, etc., covering commonly used binary, text, JSON, XML, CSV and Protobuf and other format types. IDataType also uses a generalized design pattern, and the implementation logic of specific methods is carried by instances of corresponding data types, such as DataTypeString, DataTypeArray, and DataTypeTuple.

Although DataType is responsible for serialization-related work, it is not directly responsible for reading data, but instead obtains it from Column or Field objects. In the implementation class of DataType, Column objects and Field objects of the corresponding data types are aggregated. For example, DataTypeString will refer to ColumnString of string type, and DataTypeArray will refer to ColumnArray of array type, and so on.

3. Block and Block Flow

The data operation inside ClickHouse is oriented to the Block object and adopts the form of stream. Although Column and Filed constitute the basic mapping unit of data, they still lack some necessary information, such as the type of data and the name of the column, corresponding to the actual operation. So ClickHouse designed the Block object, which can be regarded as a subset of the data table. The essence of the Block object is a triple composed of data objects, data types, and column names, namely Column, DataType, and column name strings. Column provides the ability to read data, and DataType knows how to forward and reverse serialization, so Block implements further abstraction and encapsulation on the basis of these objects, which simplifies the entire process of use, and can be completed only through the Block object A series of data operations. In the specific implementation process, Block does not directly aggregate Column and DataType objects, but indirectly references through ColumnWithTypeAndName objects.

With the encapsulation of the Block object, the design of the Block flow is a matter of course. There are two sets of top-level interfaces for stream operations: IBlockInputStream is responsible for data reading and relational operations, and IBlockOutputStream is responsible for outputting data to the next link. The Block stream also uses a generalized design pattern, and various operations on the data will eventually be converted into the realization of one of the streams. The IBlockInputStream interface defines several read virtual methods for reading data, and the specific implementation logic is filled by its implementation class.

There are a total of more than 60 implementation classes for the IBlockInputStream interface, which cover all aspects of ClickHouse data ingestion. These implementation classes can be roughly divided into three categories: the first type is used to process data definition DDL operations, such as DDLQueryStatusInputStream, etc.; the second type is used to process relational operations related operations, such as LimitBlockInput-Stream, JoinBlockInputStream and AggregatingBlockInputStream, etc.; third The class corresponds to the table engine. Each table engine has a corresponding BlockInputStream implementation, such as MergeTreeBaseSelect-BlockInputStream (MergeTree table engine), TinyLogBlockInputStream (TinyLog table engine) and KafkaBlockInputStream (Kafka table engine).

The design of IBlockOutputStream is exactly the same as IBlockInputStream. The IBlockOutputStream interface also defines several write virtual methods for writing data. Its implementation analogy is much less than IBlockInputStream, there are only more than 20 kinds in total. These implementation classes are basically used for the related processing of the table engine, and are responsible for writing data to the next link or final destination, such as MergeTreeBlockOutputStream, TinyLogBlockOutputStream, and StorageFileBlock-OutputStream.

4.Table

There is no so-called Table object in the underlying design of the data table. It directly uses the IStorage interface to refer to the data table. Table engine is a significant feature of ClickHouse. Different table engines are implemented by different subclasses, such as IStorageSystemOneBlock (system table), StorageMergeTree (merged tree table engine) and StorageTinyLog (log table engine). The IStorage interface defines DDL (such as ALTER, RENAME, OPTIMIZE and DROP, etc.), read and write methods, which are respectively responsible for data definition, query and write. When querying data, IStorage is responsible for returning the original data of the specified column according to the instructions of the AST query statement. Subsequent further processing, calculation and filtering of the data will be uniformly handed over to the Interpreter object for processing. An operation initiated on the Table usually goes through such a process, receiving AST query statements, returning the data of the specified column according to the AST, and then submitting the data to the Interpreter for further processing.

5.Parser and Interpreter

Parser and Interpreter are two very important sets of interfaces: the Parser analyzer is responsible for creating AST objects; and the Interpreter interpreter is responsible for interpreting the AST and further creating the execution pipeline of the query. Together with IStorage, they connect the entire data query process in series. The Parser analyzer can parse a SQL statement into the form of an AST syntax tree in a recursive descent method. Different SQL statements will be parsed by different Parser implementation classes. For example, there are ParserRenameQuery, ParserDropQuery, and ParserAlterQuery parsers responsible for parsing DDL query statements, ParserInsertQuery parser responsible for parsing INSERT statements, and ParserSelectQuery parser responsible for SELECT statements.

The Interpreter interpreter acts like the Service layer, which plays a role in connecting the entire query process. It will aggregate the resources it needs according to the type of interpreter. First, it parses the AST object; then executes the "business logic" (such as branch judgment, setting parameters, calling the interface, etc.); and finally returns the IBlock object to establish a query execution pipeline in the form of a thread.

6.Functions and Aggregate Functions

ClickHouse mainly provides two types of functions-ordinary functions and aggregate functions. Common functions are defined by the IFunction interface and have dozens of function implementations, such as FunctionFormatDateTime, FunctionSubstring, etc. In addition to some common functions (such as the four arithmetic operations, date conversion, etc.), there are also some very practical functions, such as URL extraction functions, IP address desensitization functions, etc. Ordinary functions are stateless, and the effect of the function is applied to each row of data. Of course, in the specific execution of the function, it does not operate line by line, but uses vectorization to directly act on a whole column of data.

Aggregate functions are defined by the IAggregateFunction interface. Compared with stateless ordinary functions, aggregate functions are stateful. Taking the COUNT aggregate function as an example, the state of its AggregateFunctionCount is recorded using integer UInt64. The state of the aggregate function supports serialization and deserialization, so it can be transmitted between distributed nodes to achieve incremental calculations.

7.Cluster and Replication

The ClickHouse cluster is composed of shards, and each shard is composed of replicas (Replica). This layered concept is very common in some popular distributed systems. For example, in the concept of Elasticsearch, an index is composed of shards and replicas, and replicas can be regarded as a special kind of shards. If an index consists of 5 shards and the base of the copy is 1, then the index will have 10 shards in total (each shard corresponds to 1 copy).

If you use the same idea to understand ClickHouse's sharding, then you will probably end up here. Certain designs of ClickHouse always seem unique, and clustering and sharding are one of them. There are a few distinctive features here.

  • 1 node of ClickHouse can only have 1 shard, which means that if 1 shard and 1 copy are to be implemented, at least 2 service nodes need to be deployed.
  • Fragmentation is just a logical concept, and its physical bearing is still borne by the copy.

Code Listing 1 is an example of ClickHouse cluster configuration. To understand the semantics of this configuration from the literal meaning, it can be understood that the custom cluster ch_cluster has 1 shard (shard) and 1 replica (replica), and the copy Carried by the 10.37.129.6 service node.

Essentially, this group of 1-shard and 1-copy configuration has only 1 physical copy in ClickHouse, so its correct semantics should be 1 slice, 0 copy. Fragmentation is more like a logical layer grouping. At the physical storage layer, copies are used to represent both fragmentation and copy. Therefore, the configuration that really represents the semantics of 1 shard and 1 copy should be changed to 1 shard and 2 copies, as shown in code listing 2.

Why is ClickHouse so fast

Many users have always had this question in their minds, why is ClickHouse so fast? The previous introduction has made a scientific and reasonable explanation for this problem. For example, because ClickHouse is a columnar storage database, it is fast; also because ClickHouse uses a vectorization engine, it is fast. These explanations are tenable, but still cannot eliminate all doubts. Because these technologies are not secrets, many databases in the world also use these technologies, but they are still not as fast as ClickHouse. So I want to explore the secret of ClickHouse from another angle.

First of all, I will throw a question to readers: when designing software architecture, should the principle of design be top-down design or bottom-up design? In the traditional concept, or in my opinion, it is naturally a top-down design, and usually we are taught to do a top-level design. The ClickHouse design adopts a bottom-up approach. The ClickHouse prototype system was born as early as 2008, and it did not have a grand plan at the beginning of its birth. On the contrary, its purpose is very simple, that is, it hopes to perform GROUP BY query and filtering at the fastest speed. How do they practice bottom-up design?

1. Focus on the hardware, think first and do it later

First, start designing from the hardware function level. At the beginning of the design, at least the following issues need to be considered.

  • What is the level of hardware we will use? Including CPU, memory, hard disk, network, etc.
  • What kind of performance do we need to achieve on such hardware? Including latency, throughput, etc.
  • What data structure are we going to use? Including String, HashTable, Vector, etc.
  • How will these selected data structures work on our hardware?

If you can think about the above problems, you can calculate the rough performance before you start to implement the function. Therefore, for the purpose of maximizing hardware efficiency, ClickHouse will perform GROUP BY in memory and use HashTable to load data. At the same time, they are very concerned about the CPU L3 cache, because an L3 cache invalidation will bring a delay of 70-100ns. This means that on a single-core CPU, it will waste 40 million operations per second; and on a 32-thread CPU, it may waste 500 million operations per second. So don't underestimate these details, add them up bit by bit, the data is very impressive. Because of these details, ClickHouse can achieve 175 million scans per second in the benchmark query.

2. Algorithms come first, abstraction comes behind

People often say: "Sometimes, choice is more important than hard work." Indeed, it is no use trying the wrong route. In the underlying implementation of ClickHouse, we often face some repetitive scenarios, such as string substring query, array sorting, and the use of HashTable. How can we maximize performance? The choice of algorithm is the top priority. Taking string as an example, there is a book dedicated to string search, called "Handbook of Exact String Matching Algorithms", which lists 35 common string search algorithms. Guess which one of them does ClickHouse use? The answer is none of them. Why is this? Because the performance is not fast enough. In terms of string search, ClickHouse finally chose these algorithms for different scenarios: for constants, Volnitsky's algorithm is used; for non-constants, the vectorization of the CPU is used to perform SIMD and brute force optimization; regular matching uses re2 and hyperscan algorithms. Performance is the primary consideration for algorithm selection.

3. Have the courage to try new things, change if you can’t

Except for the string, the rest of the scenes are similar to it. ClickHouse will use the most appropriate and fastest algorithm. If a new algorithm that claims to be powerful appears in the world, the ClickHouse team will immediately incorporate it and verify it. If the effect is good, keep it; if the performance is not satisfactory, discard it.

4. Specific scenarios, special optimization

In view of the different conditions of the same scene, choose to use different implementation methods to maximize performance as much as possible. Regarding this point, the idea of selecting different algorithms for different scenarios is actually reflected in the introduction of string query. There are many similar examples. For example, the de-counting uniqCombined function will choose different algorithms according to the amount of data: when the amount of data is small, it will choose Array to save; when the amount of data is medium, it will choose HashSet; and When the amount of data is large, the HyperLogLog algorithm is used.

For scenes with relatively clear data structure, loop unrolling will be implemented through code generation technology to reduce the number of loops. Then there is the well-known big killer-vectorized execution. SIMD is widely used in scenarios such as text conversion, data filtering, data decompression, and JSON conversion. Compared to purely using the CPU, the use of register brute force optimization can be regarded as a dimensionality reduction blow.

5. Continuous testing and continuous improvement

If you simply work on the above details, it is not enough to build such a powerful ClickHouse, and you need to have a mechanism that can be continuously verified and improved. Due to the natural advantages of Yandex, ClickHouse often uses real data for testing, which guarantees the authenticity of the test scenarios. At the same time, ClickHouse is also the fastest open source software I have ever seen, and one version can be released almost every month. Without a reliable continuous integration environment, this is impossible. It is precisely because of this frequency of release that ClickHouse can iterate quickly and improve quickly.

Therefore, ClickHouse's black magic is not a single technology, but a bottom-up design idea that pursues the ultimate performance. This is the secret to it being so fast.

summary

In this article, we quickly reviewed the core features and logical architecture of ClickHouse, the backbone behind Yandex.Metrica, the world's third largest web traffic analysis platform. Through the display of the core features, the reasons for ClickHouse to be so powerful have been revealed. Columnar storage, vectorized execution engine and table engine are all its best features.

In the architecture design part, some design ideas of ClickHouse are further demonstrated, such as Column, Field, Block and Cluster. Understanding these design ideas can help us better understand and use ClickHouse. Finally, I discussed the secret of ClickHouse so fast from another angle.

Author: Zhu Kai The source of the article is unknown. This article is for sharing only. The copyright belongs to the original author. If you have any copyright issues, please contact the editor. Thank you.


民工哥
26.4k 声望56.7k 粉丝

10多年IT职场老司机的经验分享,坚持自学一路从技术小白成长为互联网企业信息技术部门的负责人。2019/2020/2021年度 思否Top Writer