Common database selection! Did you do it right?

Factors Affecting Database Selection

Data volume: Whether there is a large amount of data, the large amount of data in a single table will test the performance of the database
Data structure: structured (the structure of each record is the same) or unstructured (the structure of different records can be different)
Whether the table is wide: is a record 10 fields, or hundreds or thousands of fields
Data attributes: basic data (such as user information), business data (such as user behavior), auxiliary data (such as logs), cached data
Whether transactional is required: a transaction consists of multiple operations, all of which must be successful or all rolled back, partial success is not allowed
Real-time: Are there any requirements for write latency or read latency? For example, some services allow high write latency but require low read latency
Query volume: For example, some businesses require to query a few columns of a large number of records, and some require to query all columns of a few records
Sorting requirements: For example, some businesses operate on time series
Reliability Requirements: Tolerance to Data Loss
Consistency requirement: whether to read the latest written data
Requirements for adding, deleting, checking and modifying: some businesses need to be able to quickly add, delete, check and modify a single piece of data (such as user information), some require batch import, and some do not need to modify and delete a single record (such as logs, user behavior), some Some require a small amount of data to be retrieved (such as logs), some require a large amount of data to be read quickly (such as displaying reports), and some require a large amount of data to be read and calculated (such as analyzing user behavior)
Whether to support multi-table operations

Different businesses have different requirements for databases

SQL Database & NoSQL Database

SQL database is a traditional relational database

Deterministic table storage
structured data
Requires predefined data types
The amount of data and query is not large, if the amount of data is large, it needs to be divided into tables
High requirements for data consistency, integrity constraints, transactionality, and reliability
Support multi-table Join operation
Supports integrity between multiple tables. To delete a piece of data in table A, it may be necessary to delete some data in table B first.
Strong function of adding, deleting, modifying and checking SQL
More general, more mature technology
Insufficient performance for large data volumes
Insufficient high concurrency performance
Cannot be applied to unstructured data
Difficulty scaling

Common SQL databases such as Oracle, MySQL, PostgreSQL, SQLite

NoSQL generally refers to non-relational databases

Table structure is more flexible, such as column storage, key-value pair storage, document storage, graph storage
Support for unstructured data
Some don't need predefined data types, and some don't even need predefined tables
Support large data volume
Most support distributed
Good scalability
Basic query ability, high concurrency ability is relatively strong (because it adopts unstructured, distributed, and sacrifices functions such as consistency, integrity, and transactionality)
Low requirements for data consistency
Transactional support is usually not supported, or limited support
Integrity is usually not supported, and complex business scenarios are poorly supported
Usually multi-table Join is not supported, or limited support
Non-SQL query language, or SQL-like query language, but the functions are relatively weak, and some do not even support modifying and deleting data
It is not very general, the technology is diverse, and the market changes greatly

Common NoSQL databases such as

Columnar: HBase, Cassandra, ClickHouse
Key value: Redis, Memcached
Documentation: MongoDB
Timing: InfluxDB, Prometheus
Search: Elasticsearch

SQL and NoSQL are a complementary relationship, applied in different scenarios.

OLTP & OLAP

OLTP (On-Line Transaction Processing)

Mainly do real-time transaction processing
Such as processing basic user information, processing order contracts, processing bank transfers, enterprise ERP systems and OA systems, etc.
Frequently, do real-time additions, deletions, and changes to a small amount of data, even a single piece of data
Database is updated frequently
Usually there are requirements for normalization, real-time, stability, transactionality, consistency, integrity, etc.
Operations are relatively fixed, such as order business, which may always be those fixed operations
The main model of the database is the 3NF or BCNF model

OLAP (On-Line Analytical Processing)

Data warehouse, mainly for historical data analysis, to provide support for business decision-making
For example, analyze a large number of user behaviors, and analyze the status, usage, and performance of devices.
Less frequently, for a large amount of data, read, aggregate, calculate, and analyze, the real-time requirements are not high, and the throughput capacity is relatively high.
Usually the number of columns is relatively large, but only a few columns of data are taken for each analysis.
Usually bulk import data
Usually data will not be modified after importing, mainly read operations, write less and read more
Usually, the requirements for normalization, transactionality, consistency, integrity, etc. are low, and even a query operation failure will not have any effect
The operation is more flexible, such as a massive user behavior data table, you can come up with many different methods to analyze users from different angles
The database is mainly a star, snowflake model
Before using high-performance OLAP, the more traditional approach is to build T+1 offline data through offline business, which is relatively lagging

OLTP usually uses traditional relational databases. If the amount of data is large and needs to be divided into tables, and the requirements for transactionality, consistency, and integrity are not high, NoSQL can also be used.

OLAP usually uses NoSQL, and traditional relational databases can also be used if the amount of data is not large.

Relational databases Oracle, SQL Server, MySQL, PostgreSQL, SQLite

Oracle: The commercial database developed by Oracle is not open source, supports all mainstream platforms, has good performance, strong functions, good stability, good security, supports large amounts of data, is relatively complex, and is expensive.

SQL Server: A commercial database developed by Microsoft that can only run on Windows.

MySQL: An open source database owned by Oracle, which supports a variety of operating systems. It is small in size and weak in functions. Simple operation has good performance, while complex operation has poor performance. **PostgreSQL**: A completely open source and free project using the BSD protocol, supporting a variety of operating systems, with more powerful functions, and can cooperate with a variety of open source tools.

SQLite: Open source, lightweight, serverless, zero configuration, a database is just a file, performs operations within the application, takes up small resources, and can be used for embedded or small applications.

Application scenarios

Oracle is mostly used in high-demand fields such as banking, while MySQL and PostgreSQL are mostly used in the Internet industry for less demanding ones, while SQLite is used for embedded or as a database in applications, and SQL Server is used for Windows servers.

HBase (wide table, columnar store, key-value store, NoSQL, OLTP)

Hadoop-based HDFS distributed file system
Distributed database requires ZooKeeper as a coordinator between nodes
Supports wide tables, supports unstructured data, does not require predefined columns and data types
Columnar storage, each HFile file only stores the data of one column family, a column family can have multiple HFiles, and HFiles are stored in Key-Value format, where Key is a combination of rowkey, column family, column, timestamp rowkey is stored in order in HFile, and value is the value of Column Cell
Support massive data (hundreds of billions of data tables)
The data is first written to the memory, and then written to the disk when the threshold is reached, which has good performance and occupies a large amount of memory.
Does not support SQL, does not support Join, has its own special statement, supports addition, deletion, modification and query
Automatic partitioning, load balancing, linear scalability
Automatic failover
Strong consistency (each partition Region is only responsible for one Region Server, it is easy to achieve strong consistency)
CP model (availability is not guaranteed, each Region is only responsible for one Region Server, and the Server is temporarily unavailable due to migration if it hangs up)
Transactions, secondary indexes are not supported

Application scenarios

It has many components and is relatively heavy. It is suitable for the existing Hadoop platform, and it is suitable for the scenarios of massive wide table data, need to add, delete, modify and query, and OLTP.

Phoenix (HBase-based database engine, relational, OLTP)

The database engine of Region Server embedded into HBase
SQL support
Support Join
Support transactions (need to be configured when defining the table)
Support secondary index
support salt
JDBC support

Application scenarios

It is used to strengthen HBase, mainly as OLTP. If the query performance is not high, it can also be used as OLAP, mostly used for HDP (HDP has integrated Phoenix)

Cassandra (wide table, key-value store, NoSQL, OLTP)

No single point of failure: Cassandra nodes are arranged in a ring, without a central node, each node plays the same role independently and interconnected, each node can accept read and write requests, data can be stored in multiple copies in multiple nodes, and between nodes Exchange status information through the Gossip (P2P) protocol. Several nodes in the cluster are configured as seed nodes, which are used by newly joined nodes to obtain the cluster topology and start the Gossip protocol.
Provides SQL-like language CQL
Suitable for structured and unstructured data
Table needs to define Partition Key, Clustering Key, and common columns, where Partition Key is used for partitioning and sorting, that is, according to the Hash Token of Partition Key, the data is allocated to which node, and the Hash Token is also stored in order in the node. Yes, data with the same Partition Key will exist together, and they will be sorted and stored according to the Clustering Key, which is somewhat similar to HBase's RowKey, ColumnFamily, and Column, but HBase is stored together with the same CF, and then sorted and stored by RowKey internally, and then takes the Column value ( Column values are not sorted), while Cassandra is first sorted and stored by the Token of Partition Key, then sorted and stored internally by Clustering, and then takes the values of ordinary Columns (Column values are not sorted)
Highly scalable, allowing the addition of hardware, nodes to increase data capacity while maintaining fast response times
The consistency level can be configured through the Consistency command, mainly to notify the client of the successful number of replicas that must be ensured before the operation.
Cassandra uses eventual consistency, which is the AP in the CAP theory
Cassandra does not support Join and subqueries

Application scenarios

It is mainly used for OLTP, and can also be used as OLAP if the requirements are not high. Compared with HBase, it requires fewer components and is easier to maintain.

Redis (in-memory Key-Value based NoSQL database, OLTP)

Written in C language
Supports multiple data types such as strings, hashes, lists, sets, sorted sets, bitmaps, hyperloglogs, geospatial, etc.
The atomicity of the operation ensures that two clients accessing the server at the same time will get the updated value
Data is stored in memory
Persistence can be configured to periodically write update data to disk, or periodically write modification operations to append record files, or turn off the persistence function and use Redis as an efficient network caching data function
Support master-slave synchronization, data can be synchronized from the master server to any number of slave servers, and the slave server can be the master server associated with other slave servers, which enables Redis to perform single-layer tree replication, and save data can be intentional or unintentional. , because the publish/subscribe mechanism is fully implemented, so that when synchronizing the tree from the database anywhere, you can subscribe to a channel and receive the complete message publishing record of the master server. Synchronization is very helpful for the scalability and data redundancy of read operations.
Publish/Subscribe (Pub/Sub) pattern that supports messages
Single-threaded mode, that is, network IO, data reading and writing, are all completed by one thread. Because of this, atomicity, stability, and code maintenance are guaranteed. The reason why single-threading does not affect performance is because data is in memory, and operations are It is efficient. Of course, the single thread here refers to the main function of network IO and data reading and writing. In fact, there are other threads, such as the thread that periodically writes to the hard disk.
The high version uses multithreading in network IO (because network IO becomes the bottleneck in high concurrent operations), but the read and write operations are still single-threaded (the performance of operating memory data is still very high, and it can cope with high concurrency scenarios)

Application scenarios

Usually used as a high-performance in-memory database, cache, message middleware, etc.

memcached (in-memory Key-Value NoSQL database, OLTP)

Open source, high-performance, distributed memory-based Key-Value data storage, similar to Redis
Store String/RawData without defining data structures (Redis has hash, list, set and other structures)
Data usually consists of key, flags, expire time, bytes, value
Basically, the server can only simply read and write data, and the server can support fewer operations.
Contains Server components and Client components, there can be multiple servers but the servers are independent, there is no mechanism such as synchronous broadcasting, which server needs to be selected is determined by the API of the client
Data is only in memory, not on hard disk
no safety mechanism
Simple and efficient protocol

Application scenarios

Memcached is relatively simple. As a pure Key-Value cache, its performance is better than Redis, but its function is not as powerful as Redis.

MongoDB (document database, NoSQL, OLTP)

It is called a document database because its data is stored in the form of JSON documents

The concept of MongoDB is different from that of many databases. Its collection is equivalent to table, document is equivalent to row, and field is equivalent to column, for example:

db.user.insert(
    {
        "name": "Lin",
        "age": 30
        "address": {
            "street": "Zhongshan Road",
            "city": "Guangzhou",
            "zip": 510000
        },
        "hobbies": ["surfing", "coding"]
    }
)

This is an insert statement, where db refers to the current database, user is the collection equivalent to a table, JSON in the insert statement is the document equivalent to the row of other databases, name, age, street These are the fields equivalent to the column

The same document can be inserted multiple times without being overwritten. In fact, mongodb will automatically create the \_id field as the primary key and assign different values, so it will not be repeated. You can also specify \_id when inserting, but if \_id If it already exists, it will throw an error

As you can see, mongodb is unstructured data and does not require a predefined collection nor a predefined data structure.
Provide rich query expressions
Support secondary index, automatic load balancing, higher read efficiency than write
Support distributed, support failure recovery, data redundancy, fragmentation, horizontal expansion
The storage engine can be configured. The WiredTiger Storage Engine (default) will do memory-to-file mapping to improve performance, but the memory consumption is high.
The high version supports Join and transactions
Support security authentication function
Provide extensions, such as tools for visualization, tools for BI integration

Application scenarios

mongodb is more suitable for highly unstructured, or the source data is JSON, each piece of data is relatively large, and OLTP is the main scenario. It is not suitable for scenarios with high transaction requirements or complex queries with large amounts of data. The syntax of mongodb is quite different from other databases and requires a certain learning cost

Hive (HDFS-based database engine, relational, OLAP)

Hive is a data warehouse tool based on Hadoop
Data is stored in HDFS. When creating a table, specify the storage format such as TEXTFILE, ORCFILE, PARQUET through the STORED AS command, or specify HBase through the STORED BY command. You can create a new table or create a mapping of an existing HBase table.
Queries are completed through jobs such as MapReduce, Spark, etc.
Provides SQL-like query language HQL (HiveQL), supports user-defined functions (UDF)
Higher versions support transactions (specified when creating a table)
Support massive data
structured data
Support CRUD

Application scenarios

Not suitable for OLTP, mainly used as OLAP for batch query of big data, requires Hadoop platform

Impala (based on HDFS, HBase, Kudu storage, parallel computing, relational, OLAP)

A database query engine for distributed parallel computing based on memory developed by Cloudera
Mainly implemented in C++, using JNI for interaction with Hadoop
Impala uses the same metadata, SQL, ODBC driver, and UI as Hive, which improves the SQL query performance of HDFS and provides a similar user experience.
Like Hive, you can specify HDFS storage formats such as TEXTFILE, ORCFILE, PARQUET through STORED AS
Tables operated through Hive need to be manually synchronized to Impala
Impala is not only the same as SQL and Hive, but actually metadata also exists in Hive
In addition to HDFS, table data can also be stored in HBase, but you need to create a table in HBase, and then create a mapping table in Hive through STORED BY. Since Impala and Hive use the same metadata, after Hive has built the table, just execute the refresh command in Impala INVALIDATE METADATA, you can see the corresponding HBase table
Support Join, Aggregate and other functions
Support JDBC, ODBC
Unlike Hive, Impala does not rely on MapReduce, but runs its own engine on each HDFS DataNode for parallel processing
Impala's parallel processing engine is mainly composed of state store, catalog service, and multiple impala daemons
Each impala daemon can receive the client's request. The impala daemon consists of query planner, query coordinator, and query executor. The planner receives the client's SQL query, and then decomposes it into multiple sub-queries. The coordinator distributes the sub-queries to the executors of each daemon for execution. , the daemon obtains HDFS, HBase data, calculates, and then returns it to the coordinator, and then the coordinator aggregates and returns the final result to the client
Impala is a non-centralized structure, each daemon can accept connection queries, and can achieve load balancing of multiple daemons through HA Proxy
The state store is used to collect and monitor the status of each daemon
The catalog service notifies all impala daemons in the cluster of metadata changes made by SQL
Impala's calculations are performed in memory, and the memory requirements are relatively high
Impala only supports update operation after 2.8, but it is limited to Kudu storage. Kudu needs to be installed, and Kudu is specified as database storage through STORED AS. Kudu is a columnar storage manager developed by Cloudera. The purpose is to do OLAP and balance HDFS and The performance of HBase, Kude's random read and write performance is better than HDFS (such as Parquet), but worse than HBase, and the large data query performance is worse than HDFS (such as Parquet), but better than HBase, Kude and Impala are highly integrated, and can also Integrate with MapReduce/Spark, replace HDFS/HBase with Kudu, so Impala can do update, taking into account the needs of OLAP and data modification, it is suitable for scenarios with OLAP as the main and certain update requirements, Kudu can configure consistency, adopt structure The table data model needs to define the primary key, instead of using HDFS, it has its own components to store and manage data, using C++ without the risk of full gc

Application scenarios

Impala is not suitable for OLTP, mainly used as OLAP for batch query of large data
Hadoop platform and Hive required
Much better performance than Hive
Better performance as an OLAP than something like Phoenix
Mainly CDH is pushing, CDH has integrated Impala

Presto (based on multiple data sources, parallel computing, relational, OLAP)

A database query engine for distributed parallel computing based on memory launched by Facebook
It consists of coordinator server, discovery server (usually integrated in coordinator, or independent), and multiple worker servers
The coordinator is responsible for interacting with clients, managing workers, parsing statements, planning queries, creating a series of stages, and then converting them into a series of tasks to distribute to different workers for concurrent execution
The worker is responsible for executing tasks and processing data. It obtains data through the connector, and interacts with other workers for intermediate data. The final result will be returned to the client by the coordinator.
connectors are adapters that allow Presto to access different databases
The built-in connectors are mainly Hive, and there are many third-party connectors such as cassandra, es, kafka, kudu, redis, mysql, postgresql, etc.
The catalog needs to be configured in the configuration file, where the catalog maintains the schema and points to a data source through the connector. The positioning of the presto table starts from the catalog. For example, hive.test\_data.test refers to the test\_data schema under the hive catalog. test table, and the concept of schema depends on the specific connector. For example, for mysql, the schema of presto is the schema of mysql, and for cassandra, the schema of presto is the keyspace of cassandra, and multiple catalogs can be associated with the same one. For example, if there are multiple kafka clusters in the environment, there can be two catalogs, kafka1 and kafka2
Statement can be considered as the sql statement received by presto, which is then parsed into a query plan, and then the query is divided into multiple stages. These stages form a tree structure, and each stage aggregates and calculates the results of other stages below it. Each stage is divided into one or more tasks, these tasks will be distributed to different workers for parallel execution, each task processes different data shards, and each task has one or more drivers to process data concurrently
Presto supports JDBC interface, the URL format of JDBC is jdbc:presto://host:port/catalog/schema or jdbc:presto://host:port/catalog or jdbc:presto://host:port
Support Join query, and support join query of multiple data sources (joining of multiple large tables may affect performance). When querying across data sources, you need to specify the complete table name, i.e. [catalog].[schema].[table], and Use presto://host:port to connect to JDBC without specifying catalog and schema
Limited support for subqueries
update operation is not supported
Support security mechanism
Supports standard ANSI SQL
Good scalability
Can be integrated with Tableau
Spark support

Application scenarios

Suitable for OLAP query of large data volume with multiple data sources
The performance may be similar to Impala, but it supports multiple data sources and does not depend on Hadoop

Greenplum (multiple PostgreSQL, parallel computing, relational, OLAP)

Database query engine based on multiple PostgreSQL distributed parallel computing
Internal PostgreSQL has been modified to accommodate parallel distributed computing
Mainly consists of a master, multiple segments, and an interconnect
The master maintains metadata, receives and authenticates client links, receives SQL requests, parses SQL, generates query plans, distributes tasks to segments, coordinates the return results of aggregated segments, and returns the final results to the client. The master can be set as the master from config
Each segment has an independent PostgreSQL database. Each segment is responsible for storing part of the data and performing corresponding query calculations. The segment can also be configured with a backup mechanism
Interconnect is the network layer of Greenplum, responsible for the link between master and segment, as well as the link between each segment
Links and SQL syntax are compatible with PostgreSQL, support JDBC, ODBC
When creating a table, you can specify whether to use column storage, row storage, or external table (data is stored in other systems such as HDFS and GP only stores metadata)
To operate external data, you need to install PXF (Platform Extension Framework). With PXF, you can support Hive, HBase, Parquet, S3, MySQL, ORACLE, etc.
Support security, permission configuration
Support distributed transactions, support ACID, and ensure strong data consistency. Instead of using locks, MVCC (Multi-Version Concurrency Control) is used to ensure data consistency
shared-nothing architecture

Application scenarios

Similar to Impala and Presto, they are parallel in-memory computing, but the performance of Greenplum may be slightly worse, and Greenplum is also divided into open source and commercial versions, and some functions are only supported by commercial versions.

Kylin (based on Hive, HBase, parallel computing, relational, multi-dimensional, precomputed OLAP)

Traditional OLAP is divided into ROLAP (Relational OLAP) and MOLAP (Multi-Dimension OLAP) according to different data storage methods. ROLAP stores data in a relational model. The advantages are small size and flexible query methods. The disadvantage is that each query needs to Data is aggregated and calculated, and Kylin belongs to MOLAP

Kylin calculates the results in advance according to different combinations of dimensions to form a Cube (cube) structure, so the query speed is fast, but the disadvantage is that the amount of data is not easy to control, N dimensions can have 2**N combinations, and dimensions may appear The problem of explosion, and if the data is changed, it needs to be recalculated

For example, there are two dimension tables, Phone and Country, as well as the Sale fact table (detail table), taking mobile phone brand, country, and date as three dimensions, including (null), (brand), (country), (date), (brand) , country), (brand, date), (country, date), (brand, country, date), a total of 8 combinations, you can calculate the various summary information (sum, count, etc.) of the sale of these 8 group by combinations in advance ), a summary information of a dimension combination is called a cuboid, and all cuboids together are called a Cube

The data source of Kylin can be Hive or Kafka (Json format message, key is the column name)

Kylin's precomputed results are stored in HBase, RowKey is a combination of various dimensions, and the corresponding detailed summary is stored in Column, so that SQL becomes a scan of RowKey, and further calculates Column (if necessary), so that query performance It is naturally improved and can support sub-second queries

Kylin supports ODBC, JDBC, RESTful API and other interfaces

Kylin can be integrated with BI tools such as Tableau and PowerBI

The use steps are as follows

Create Project
Sync Hive table or Kafka table
Create Data Model
Create and name the Model
Select Fact Table (fact table) and Lookup Table (lookup table, mainly dimension information), and Join field
Select dimension columns from Fact Table and Lookup Table (can be grouped by Cube)
Select indicator column from Fact Table (can be aggregated by Cube)
Select column from Fact Table for date partitioning, leave blank if not needed
Add Filter (can be used by Cube for Where operation)
Create cube
Create and name the Cube, and select the Data Model to associate
Add dimension column (must be selected from dimension columns configured in Data Model)
Add a metric column (must be selected from the metric columns configured in the Data Model)
There are 8 aggregation operations that can be configured for indicator columns: SUM, MAX, MIN, COUNT, COUNT\_DISTINCT, TOP\_N, EXTENDED\_COLUMN and PERCENTILE (if you want to check avg, it is actually obtained by dividing sum by count, so here There is no need to configure avg and other operations that can be further calculated by precomputing results)

The build Cube is actually calculated by MapReduce/Spark. After the task is completed, the result will be written to HBase

After the build is successful, you can directly query with SQL. In fact, the RowKey is checked according to the dimension, and then the aggregated results stored in the Column are taken out, and further calculations are performed if necessary.

If the data source has changed, you need to rebuild the cube

It can be seen that Kylin is a pure OLAP tool. It improves query performance through pre-computing, but cannot reflect changes in data sources in time. Pre-computing may be time-consuming and may take up a lot of space, and needs to be integrated with Hadoop

Precomputed OLAP data query engine and Druid

ClickHouse (column storage, vectorized computing, parallel computing, OLAP)

OLAP database developed by Russian company Yandex

Column Store Benefits for OLAP
Since OLAP often retrieves a small number of columns in a large number of data columns, if row storage is used, it means to scan row by row, and each row is very large, while using column storage only needs to scan the columns to be retrieved, which can reduce IO
Assuming that some records do not store the column to be retrieved, the row store still needs to scan the record to know, but there is no such problem for the column store, because there is no storage, and it will not be scanned by itself.
Because the data type and size of the same column are relatively consistent, column storage is easier to compress, more efficient, and further reduces IO
The reduction in IO also means that more data can be cached in memory
vectorized computation
SIMD (Single Instruction, Multiple Data, single instruction stream, multiple data streams), the current CPU supports such a function, through one instruction, multiple cores can be used to perform concurrent computing on a set of data (that is, a vector) at the CPU level, which is suitable for pure The scene of basic calculation, if there are scenes of judgment, jump, and branch, it is not suitable
ClickHouse has a vector computing engine, which uses SMID instructions as much as possible to process data in batches and in parallel, which greatly improves the processing capacity.
Mainly implemented in C++
Decentralized structure, consisting of a cluster of servers, and each server can accept the client's link query. After the server receives the request, it will coordinate with other servers to do parallel computing. Each server is multi-threaded. Coordinated synchronization via ZooKeeper
Support sharding (shard), data can be stored in different shards across nodes, a shard is a node, or multiple nodes form a shard with copy backup, which is configured by the configuration file
Support partition, create table through Partition By command
Fragmentation and partitioning are sometimes indistinguishable. Fragmentation refers to that the data of a table is distributed in different nodes, and a node can store data of multiple databases and multiple tables, while partitioning refers to a certain column The data divides a large table into multiple small tables, such as partitioning by date column, and one partition table per day. You can query both the partition table and the large table.
Support copy backup, support data integrity
Table Engine
The table created on a server is only the local table of the server, not distributed. If you want to create a distributed table, you need to create a table with the same name on each server, and then create a distributed table on one of the servers (will Automatically created on all servers), this distributed table is a logical table, which does not really store data, but is mapped to the local table of each server, which will automatically do parallel computing
ENGINE = Distributed(cluster\_name, database, table, [sharding\_key])
cluster\_name is configured in the configuration file
ENGINE = Memory data exists in memory
ENGINE = ODBC(connection\_settings, external\_database, external\_table)
ENGINE = JDBC(dbms\_uri, external\_database, external\_table)
ENGINE = MySQL('host:port', 'database', 'table', 'user', 'password')
ENGINE = PostgreSQL('host:port', 'database', 'table', 'user', 'password')
ENGINE = MongoDB(host:port, database, collection, user, password)
ENGINE = HDFS(URI, format)
ENGINE = Kafka() SETTINGS kafka\_broker\_list = 'host:port', kafka\_topic\_list = 'topic1'
ENGINE = Log;
ENGINE = TinyLog;
ENGINE = MergeTree()
ENGINE = AggregatingMergeTree()
When creating a table, you need to specify the table engine to use through the Engine command, and decide how to store the data.
The most commonly used are the MergeTree series engines, such as
A relatively lightweight Log series engine
Allow queries from other data sources, such as
special types, such as
distributed
Usually MergeTree is used for storage, data can be quickly appended to the back of a MergeTree in sequence, and merge and compression operations will be performed in the background, which improves the performance of data insertion
Primary index, data is sorted by Primary Key
You can also specify the sort field by Order By when creating the table
Support secondary index, also known as hopping index data skipping index, such as minmax index, it will count the maximum and minimum values of a column of data (or an expression) in each piece of data. When retrieving, you can decide whether to use minmax Skip this data (it feels weird, the performance should be worse than rebuilding an index table)
Support TTL, including column-level, row-level, partition-level TTL
Support HTTP, TCP interface
Support JDBC, ODBC
There are third-party tools to support importing data from other databases such as PG into ClickHouse
There are third-party tool support and some visualization tools such as Grafana, DBeaver, Tabix integration
There are three-party tools to support Kafka, Flink, Spark, etc.
Support SQL, support group by, order by, join, some sub-queries and other functions
Support complex data types such as array, json, tuple, set, etc.
Support approximate calculation, such as calculating the average value, you can take part of the data to calculate, which improves the performance, but reduces the accuracy
Adaptive Join algorithm, such as Hash-Join is preferred, and Merge-Join is automatically used if there are multiple large tables
Security mechanism, role-based permission control
Support error recovery, good scalability

where it falls short

Insufficient support for high concurrency
No mature transaction functionality
The performance of modifying and deleting data is relatively poor, and only limited support is provided
Primary Key uses a sparse index, that is, the index can only point to a piece of data, and the specific data has to be checked one by one, so if you check a small amount of data, or query a single piece of data, the performance will be poor

Does not depend on Hadoop, column storage, vectorization, parallelism, multi-threading, multi-storage engines

Single-table query performance is excellent, much better than Impala, Presto and the like

Multi-table query performance is poor, worse than Impala, Presto and the like

Elasticsearch (inverted index, word segmentation, search engine)

Elastic Stack is a set of components including Elasticsearch, Logstash, Filebeat, Kibana, etc.
Elasticsearch is a search engine based on Apache Lucene, mainly developed in Java
Elasticsearch cluster is mainly composed of master, data, ingest, coordinating nodes
Each node can be configured with multiple roles at the same time, such as master and data, but in large clusters, each node usually only has one function
Coordinating is a function that every node will have, and no configuration is required, that is, both master and data will have coordinating function, which is used to receive read and write requests from clients, and then forward the request to the corresponding node for processing, and then combine the processing results and return them to Client, in order not to put too much pressure on master/data and other nodes in large clusters, you can configure multiple specialized coordinating, by configuring role as empty or setting master, data, ingest to false (depending on different versions) ie Yes, these coordinating nodes are only responsible for receiving and responding to client requests and do not do other work, just like a reverse proxy load balancing, which can improve concurrent performance
The master is responsible for managing the entire cluster, responsible for the creation and deletion of indexes and other management operations, deciding which node the data is to be sharded to, managing the status of other nodes, etc. Multiple masters can be configured for HA, and an odd number is required, at least three, The system actually automatically elects one of them to be the master. If the master hangs up, it will re-elect one from the other nodes configured as masters. The master configuration can be lower.
data is responsible for storing, computing, and processing data, and has high resource requirements. Data can be further configured to specify nodes for storing hot data, warm data, cold data, etc.
Ingest is an optional node dedicated to pipeline preprocessing for certain data and operations
Elasticsearch's data storage is mainly composed of index, type, document
An index is similar to a database in SQL. You can create an index directly, or you can use an index template as a template to create an index. The template is usually used for multiple indexes with the same configuration. For example, an index is created every day to store the log of the day.
type is similar to a table in SQL (this statement is not entirely correct, it has been officially clarified, because different types are not completely independent and unrelated), in earlier versions, there can be multiple types under one index, starting from 6.0 only There can be a type. It is not recommended to use the concept of type from 7.0. It is said that type will not be supported at all from 8.0.
Document is like a row of records in SQL. Document uses unstructured data, which is represented by JSON format. Each field of JSON is equivalent to a column.
Each document will have a unique \_id, if not specified, it will be automatically generated by the system
Each document has a strictly increasing sequence number \_seq\_no representing the order in which documents are written/updated
Each document has a \_version field, which is incremented by 1 each time this field is changed
You can create an index first, or you can not create an index in advance, it will be created automatically when data is written
There is no need to set the field of the document in advance, it is automatically created when the data is written, the field of each document can be different, or the field can be set in advance, so that the field of each document must be the same
Elasticsearch will automatically index all fields, and will automatically do word segmentation processing, that is, a sentence such as "hello world" is automatically divided into two words "hello" and "world" for indexing, and "hello world" without word segmentation is keyword, The size limit is 256, after word segmentation such as "hello" is a text
Elasticsearch uses an inverted index, which is mainly composed of three parts: Term Index (word index), Term Dictionary (word dictionary), and Posting List (index item list)
The Term Index exists in the memory, not all words, but the prefixes of the words, such as he, wor, ad, sar, etc., to indicate the starting position of the words starting with these prefixes in the Term Dictionary, so that the Term Index The size is relatively small, it can be stored in memory, and it can help quickly locate the position of the content to be read in the Term Dictionary, which can greatly reduce the number of disk IO
Term Dictionary exists on the disk, usually the amount of words will be very large, records the relationship between each word of the index and the Posting List, through B+ tree or Hash table to meet high-performance insertion and query
The Posting List records: the ids of all documents in which the word appears, the number of times the word appears in the document, and the position of the word in the document
The search engine finds the corresponding document through such an inverted index, and then does further processing
Since all fields are indexed, the amount of data will be very large
The data is first written into the memory, and then the data is periodically placed on the disk to form segments. When there are more segments, merges will be performed to form larger segments.
In order to prevent the loss of data that has not yet been placed in the memory, translog will be written, similar to HBase's WAL. In fact, this also requires disk IO, which will affect performance, but it is simpler than the data placed on the disk.
A segment consists of multiple files that record metadata, field information, index information in positive order (that is, find the corresponding data from document id), field data (equivalent to storing in columns), index data in reverse order, etc.
Supports REST API interface operations, and can perform many operations by improving the rich syntax of JSON format data in Body
Support Event Query Language (EQL): for event-based time series data, such as logs, metrics, and traces, specified through the Body of the REST API
Support JDBC, ODBC, where table specifies index, and column specifies field, SQL here does not support JOIN
Transactions are not supported
Cross-table query is not fully supported, and it is more complicated to define parent-child documents and fields of join type.
There is a certain delay in reading and writing, that is, the written data cannot be indexed immediately, and it takes at least 1 second
It is different from traditional databases and requires a certain learning cost

Examples of some REST API operations

curl localhost:9200                          # 查看集群基本信息
curl localhost:9200/_cluster/health?pretty   # 查看集群健康 (pretty 是 JSON 格式化输出)
curl localhost:9200/_cluster/state?pretty    # 查看集群状态
curl localhost:9200/_cluster/stats?pretty    # 查看统计信息
curl localhost:9200/_nodes?pretty            # 查看节点信息

curl localhost:9200/_cat                     # 列出可以查看的各种信息 (cat 命令列出的信息比较简化)
curl localhost:9200/_cat/health
curl localhost:9200/_cat/nodes
curl localhost:9200/_cat/indices

curl -X PUT 'http://localhost:9200/my_index/my_doc/123'  -H 'Content-Type: application/json' -d '{
  "name": "Lin", "title": "senior designer", "age": 30
}'      ## 指定 document id 为 123，会自动创建 my_index，my_doc 以及各个 fields

curl -X POST 'http://localhost:9200/my_index/my_doc'  -H 'Content-Type: application/json' -d '{
  "name": "Wang", "title": "senior designer", "age": 35
}'      ## 由系统自动创建 document id

curl -X POST 'http://localhost:9200/my_index/my_doc_2'  -H 'Content-Type: application/json' -d '{
  "name": "n_1", "type": "t_1", "value": 1
}'      ## 报错，不允许 index 下有两个 type

curl -X POST 'http://localhost:9200/my_index/_doc'  -H 'Content-Type: application/json' -d '{
  "name": "n_1", "type": "t_1", "value": 1
}'      ## 允许，_doc 就是 my_doc（可以一开始就只用 _doc 而不需要 type 名）

curl -X POST 'http://localhost:9200/my_index/_doc'  -H 'Content-Type: application/json' -d '{
  "name": "Li",
  "address": {"city": "guangzhou", "district": "tianhe"}
}'      ## 允许新的 fields，允许复杂类型，貌似不支持列表 "address": ["xxx"]

curl localhost:9200/my_index/my_doc/123?pretty        ## 查看 id 为 123 的记录
curl localhost:9200/my_index/my_doc/_search?pretty    ## 查看所有记录
curl localhost:9200/my_index/_search?pretty

curl localhost:9200/_all?pretty                   ## 列出所有 index 的设置和 mapping (就是 field 的信息)
curl localhost:9200/my_index?pretty               ## 查看 my_index 的设置和 mapping
curl localhost:9200/my_index/_mapping?pretty      ## 查看 my_index 的 mapping

curl -X GET -H "Content-Type: application/json"  localhost:9200/my_index/_search?pretty -d '{
  "query": {
     "term":{
        "name":"lin"
     }
   }
}'        ## 简单的查询，还有更多的语法和功能

curl -X GET -H "Content-Type: application/json"  localhost:9200/my_index/_search?pretty -d '{
  "query": {
     "term":{
        "title.keyword":"senior designer"
     }
   }
}'        ## 默认查询的是分词，如果要查没分词的，应该加上 keyword

curl -X GET -H "Content-Type: application/json"  localhost:9200/my_index/_search?pretty -d '{
  "query": {
     "term":{
        "address.city":"guangzhou"
     }
   }
}'        ## 查询嵌套的字段

curl localhost:9200/_search?pretty        ## 在所有 index 中查找

curl -H "Content-Type: application/json" localhost:9200/my_index/_analyze?pretty -d '{
   "analyzer" : "standard",  
   "text" : "This is the apple"
}'        ## 如何分析一段文字

curl -X PUT 'http://localhost:9200/my_index_3'    ## 创建 index

curl -X PUT -H "Content-Type: application/json" 'http://localhost:9200/my_index_6' -d  '{
    "settings": {
        "number_of_shards": 1,
        "number_of_replicas": 1
    },
    "mappings": {
        "properties": {
            "name": {
                "type": "text"
            },
            "title": {
                "type": "text"
            },
            "value": {
                "type": "long"
            }
        }
    }
}'         ## 创建 index 同时指定 field，在版本 7 以后不需要指定 type (不需要指定 my_doc 之类的)

Application scenarios

It is suitable for search-based business scenarios. At the beginning, ELK (Elasticsearch + Logstash + Kibana) was used for log collection and search.

Spark/Flink

Although the database is powerful, it is powerless if it encounters complex logical calculations. In this case, special calculation tools are required.

Both Spark and Flink are high-performance parallel computing engines. Flink is more inclined to real-time streaming services, while Spark is more inclined to batch processing services, both of which can be used to process data efficiently.

BI

Data is usually visualized, and the more commonly used BI tools are Tableau (charged) and PowerBI

Overall system architecture

K8S (container deployment) + SpringCloud (microservices) + Keycloak (authentication) + Kafka (data streaming) + Spark/Flink (data processing) + ELK (log collection) + PG/MySQL (basic data) + NoSQL-OLTP (large scale) Data Volume Business Data) + OLAP (Analysis Business) + BI (Data Visualization)

Common database selection! Did you do it right?

Factors Affecting Database Selection

SQL Database & NoSQL Database

SQL database is a traditional relational database

NoSQL generally refers to non-relational databases

Common NoSQL databases such as

OLTP & OLAP

OLTP (On-Line Transaction Processing)

Relational databases Oracle, SQL Server, MySQL, PostgreSQL, SQLite

Application scenarios

HBase (wide table, columnar store, key-value store, NoSQL, OLTP)

Application scenarios

Phoenix (HBase-based database engine, relational, OLTP)

Application scenarios

Cassandra (wide table, key-value store, NoSQL, OLTP)

Application scenarios

Redis (in-memory Key-Value based NoSQL database, OLTP)

Application scenarios

memcached (in-memory Key-Value NoSQL database, OLTP)

Application scenarios

MongoDB (document database, NoSQL, OLTP)

Application scenarios

Hive (HDFS-based database engine, relational, OLAP)

Application scenarios

Impala (based on HDFS, HBase, Kudu storage, parallel computing, relational, OLAP)

Application scenarios

Presto (based on multiple data sources, parallel computing, relational, OLAP)

Application scenarios

Greenplum (multiple PostgreSQL, parallel computing, relational, OLAP)

Application scenarios

Kylin (based on Hive, HBase, parallel computing, relational, multi-dimensional, precomputed OLAP)

The use steps are as follows

ClickHouse (column storage, vectorized computing, parallel computing, OLAP)

where it falls short

Elasticsearch (inverted index, word segmentation, search engine)

Examples of some REST API operations

Application scenarios

Spark/Flink

BI

Overall system architecture

民工哥

引用和评论

早知道有这么个吊炸天的 CI&CD 工具，我就不用 Jenkins 了！

手写一个动态海洋和天空效果的vue hooks

你可能不知道的图片加载相关知识

使用CSS给标题添加书名号并超出省略

原生electron起步-从零到一完成构建和打包

Koa+Typescript起手式(空环境) 不用每次玩node都要搭环境了！

LRU算法，你别跑，我就要吃透你