Factors Affecting Database Selection
- Data volume: Whether there is a large amount of data, the large amount of data in a single table will test the performance of the database
- Data structure: structured (the structure of each record is the same) or unstructured (the structure of different records can be different)
- Whether the table is wide: is a record 10 fields, or hundreds or thousands of fields
- Data attributes: basic data (such as user information), business data (such as user behavior), auxiliary data (such as logs), cached data
- Whether transactional is required: a transaction consists of multiple operations, all of which must be successful or all rolled back, partial success is not allowed
- Real-time: Are there any requirements for write latency or read latency? For example, some services allow high write latency but require low read latency
- Query volume: For example, some businesses require to query a few columns of a large number of records, and some require to query all columns of a few records
- Sorting requirements: For example, some businesses operate on time series
- Reliability Requirements: Tolerance to Data Loss
- Consistency requirement: whether to read the latest written data
- Requirements for adding, deleting, checking and modifying: some businesses need to be able to quickly add, delete, check and modify a single piece of data (such as user information), some require batch import, and some do not need to modify and delete a single record (such as logs, user behavior), some Some require a small amount of data to be retrieved (such as logs), some require a large amount of data to be read quickly (such as displaying reports), and some require a large amount of data to be read and calculated (such as analyzing user behavior)
- Whether to support multi-table operations
Different businesses have different requirements for databases
SQL Database & NoSQL Database
SQL database is a traditional relational database
- Deterministic table storage
- structured data
- Requires predefined data types
- The amount of data and query is not large, if the amount of data is large, it needs to be divided into tables
- High requirements for data consistency, integrity constraints, transactionality, and reliability
- Support multi-table Join operation
- Supports integrity between multiple tables. To delete a piece of data in table A, it may be necessary to delete some data in table B first.
- Strong function of adding, deleting, modifying and checking SQL
- More general, more mature technology
- Insufficient performance for large data volumes
- Insufficient high concurrency performance
- Cannot be applied to unstructured data
- Difficulty scaling
Common SQL databases such as Oracle, MySQL, PostgreSQL, SQLite
NoSQL generally refers to non-relational databases
- Table structure is more flexible, such as column storage, key-value pair storage, document storage, graph storage
- Support for unstructured data
- Some don't need predefined data types, and some don't even need predefined tables
- Support large data volume
- Most support distributed
- Good scalability
- Basic query ability, high concurrency ability is relatively strong (because it adopts unstructured, distributed, and sacrifices functions such as consistency, integrity, and transactionality)
- Low requirements for data consistency
- Transactional support is usually not supported, or limited support
- Integrity is usually not supported, and complex business scenarios are poorly supported
- Usually multi-table Join is not supported, or limited support
- Non-SQL query language, or SQL-like query language, but the functions are relatively weak, and some do not even support modifying and deleting data
- It is not very general, the technology is diverse, and the market changes greatly
Common NoSQL databases such as
- Columnar: HBase, Cassandra, ClickHouse
- Key value: Redis, Memcached
- Documentation: MongoDB
- Timing: InfluxDB, Prometheus
- Search: Elasticsearch
SQL and NoSQL are a complementary relationship, applied in different scenarios.
OLTP & OLAP
OLTP (On-Line Transaction Processing)
- Mainly do real-time transaction processing
- Such as processing basic user information, processing order contracts, processing bank transfers, enterprise ERP systems and OA systems, etc.
- Frequently, do real-time additions, deletions, and changes to a small amount of data, even a single piece of data
- Database is updated frequently
- Usually there are requirements for normalization, real-time, stability, transactionality, consistency, integrity, etc.
- Operations are relatively fixed, such as order business, which may always be those fixed operations
- The main model of the database is the 3NF or BCNF model
OLAP (On-Line Analytical Processing)
- Data warehouse, mainly for historical data analysis, to provide support for business decision-making
- For example, analyze a large number of user behaviors, and analyze the status, usage, and performance of devices.
- Less frequently, for a large amount of data, read, aggregate, calculate, and analyze, the real-time requirements are not high, and the throughput capacity is relatively high.
- Usually the number of columns is relatively large, but only a few columns of data are taken for each analysis.
- Usually bulk import data
- Usually data will not be modified after importing, mainly read operations, write less and read more
- Usually, the requirements for normalization, transactionality, consistency, integrity, etc. are low, and even a query operation failure will not have any effect
- The operation is more flexible, such as a massive user behavior data table, you can come up with many different methods to analyze users from different angles
- The database is mainly a star, snowflake model
- Before using high-performance OLAP, the more traditional approach is to build T+1 offline data through offline business, which is relatively lagging
OLTP usually uses traditional relational databases. If the amount of data is large and needs to be divided into tables, and the requirements for transactionality, consistency, and integrity are not high, NoSQL can also be used.
OLAP usually uses NoSQL, and traditional relational databases can also be used if the amount of data is not large.
Relational databases Oracle, SQL Server, MySQL, PostgreSQL, SQLite
Oracle: The commercial database developed by Oracle is not open source, supports all mainstream platforms, has good performance, strong functions, good stability, good security, supports large amounts of data, is relatively complex, and is expensive.
SQL Server: A commercial database developed by Microsoft that can only run on Windows.
MySQL: An open source database owned by Oracle, which supports a variety of operating systems. It is small in size and weak in functions. Simple operation has good performance, while complex operation has poor performance. **PostgreSQL**: A completely open source and free project using the BSD protocol, supporting a variety of operating systems, with more powerful functions, and can cooperate with a variety of open source tools.
SQLite: Open source, lightweight, serverless, zero configuration, a database is just a file, performs operations within the application, takes up small resources, and can be used for embedded or small applications.
Application scenarios
Oracle is mostly used in high-demand fields such as banking, while MySQL and PostgreSQL are mostly used in the Internet industry for less demanding ones, while SQLite is used for embedded or as a database in applications, and SQL Server is used for Windows servers.
HBase (wide table, columnar store, key-value store, NoSQL, OLTP)
- Hadoop-based HDFS distributed file system
- Distributed database requires ZooKeeper as a coordinator between nodes
- Supports wide tables, supports unstructured data, does not require predefined columns and data types
- Columnar storage, each HFile file only stores the data of one column family, a column family can have multiple HFiles, and HFiles are stored in Key-Value format, where Key is a combination of rowkey, column family, column, timestamp rowkey is stored in order in HFile, and value is the value of Column Cell
- Support massive data (hundreds of billions of data tables)
- The data is first written to the memory, and then written to the disk when the threshold is reached, which has good performance and occupies a large amount of memory.
- Does not support SQL, does not support Join, has its own special statement, supports addition, deletion, modification and query
- Automatic partitioning, load balancing, linear scalability
- Automatic failover
- Strong consistency (each partition Region is only responsible for one Region Server, it is easy to achieve strong consistency)
- CP model (availability is not guaranteed, each Region is only responsible for one Region Server, and the Server is temporarily unavailable due to migration if it hangs up)
- Transactions, secondary indexes are not supported
Application scenarios
It has many components and is relatively heavy. It is suitable for the existing Hadoop platform, and it is suitable for the scenarios of massive wide table data, need to add, delete, modify and query, and OLTP.
Phoenix (HBase-based database engine, relational, OLTP)
- The database engine of Region Server embedded into HBase
- SQL support
- Support Join
- Support transactions (need to be configured when defining the table)
- Support secondary index
- support salt
- JDBC support
Application scenarios
It is used to strengthen HBase, mainly as OLTP. If the query performance is not high, it can also be used as OLAP, mostly used for HDP (HDP has integrated Phoenix)
Cassandra (wide table, key-value store, NoSQL, OLTP)
- No single point of failure: Cassandra nodes are arranged in a ring, without a central node, each node plays the same role independently and interconnected, each node can accept read and write requests, data can be stored in multiple copies in multiple nodes, and between nodes Exchange status information through the Gossip (P2P) protocol. Several nodes in the cluster are configured as seed nodes, which are used by newly joined nodes to obtain the cluster topology and start the Gossip protocol.
- Provides SQL-like language CQL
- Suitable for structured and unstructured data
- Table needs to define Partition Key, Clustering Key, and common columns, where Partition Key is used for partitioning and sorting, that is, according to the Hash Token of Partition Key, the data is allocated to which node, and the Hash Token is also stored in order in the node. Yes, data with the same Partition Key will exist together, and they will be sorted and stored according to the Clustering Key, which is somewhat similar to HBase's RowKey, ColumnFamily, and Column, but HBase is stored together with the same CF, and then sorted and stored by RowKey internally, and then takes the Column value ( Column values are not sorted), while Cassandra is first sorted and stored by the Token of Partition Key, then sorted and stored internally by Clustering, and then takes the values of ordinary Columns (Column values are not sorted)
- Highly scalable, allowing the addition of hardware, nodes to increase data capacity while maintaining fast response times
- The consistency level can be configured through the Consistency command, mainly to notify the client of the successful number of replicas that must be ensured before the operation.
- Cassandra uses eventual consistency, which is the AP in the CAP theory
- Cassandra does not support Join and subqueries
Application scenarios
It is mainly used for OLTP, and can also be used as OLAP if the requirements are not high. Compared with HBase, it requires fewer components and is easier to maintain.
Redis (in-memory Key-Value based NoSQL database, OLTP)
- Written in C language
- Supports multiple data types such as strings, hashes, lists, sets, sorted sets, bitmaps, hyperloglogs, geospatial, etc.
- The atomicity of the operation ensures that two clients accessing the server at the same time will get the updated value
- Data is stored in memory
- Persistence can be configured to periodically write update data to disk, or periodically write modification operations to append record files, or turn off the persistence function and use Redis as an efficient network caching data function
- Support master-slave synchronization, data can be synchronized from the master server to any number of slave servers, and the slave server can be the master server associated with other slave servers, which enables Redis to perform single-layer tree replication, and save data can be intentional or unintentional. , because the publish/subscribe mechanism is fully implemented, so that when synchronizing the tree from the database anywhere, you can subscribe to a channel and receive the complete message publishing record of the master server. Synchronization is very helpful for the scalability and data redundancy of read operations.
- Publish/Subscribe (Pub/Sub) pattern that supports messages
- Single-threaded mode, that is, network IO, data reading and writing, are all completed by one thread. Because of this, atomicity, stability, and code maintenance are guaranteed. The reason why single-threading does not affect performance is because data is in memory, and operations are It is efficient. Of course, the single thread here refers to the main function of network IO and data reading and writing. In fact, there are other threads, such as the thread that periodically writes to the hard disk.
- The high version uses multithreading in network IO (because network IO becomes the bottleneck in high concurrent operations), but the read and write operations are still single-threaded (the performance of operating memory data is still very high, and it can cope with high concurrency scenarios)
Application scenarios
Usually used as a high-performance in-memory database, cache, message middleware, etc.
memcached (in-memory Key-Value NoSQL database, OLTP)
- Open source, high-performance, distributed memory-based Key-Value data storage, similar to Redis
- Store String/RawData without defining data structures (Redis has hash, list, set and other structures)
- Data usually consists of key, flags, expire time, bytes, value
- Basically, the server can only simply read and write data, and the server can support fewer operations.
- Contains Server components and Client components, there can be multiple servers but the servers are independent, there is no mechanism such as synchronous broadcasting, which server needs to be selected is determined by the API of the client
- Data is only in memory, not on hard disk
- no safety mechanism
- Simple and efficient protocol
Application scenarios
Memcached is relatively simple. As a pure Key-Value cache, its performance is better than Redis, but its function is not as powerful as Redis.
MongoDB (document database, NoSQL, OLTP)
It is called a document database because its data is stored in the form of JSON documents
The concept of MongoDB is different from that of many databases. Its collection is equivalent to table, document is equivalent to row, and field is equivalent to column, for example:
db.user.insert(
{
"name": "Lin",
"age": 30
"address": {
"street": "Zhongshan Road",
"city": "Guangzhou",
"zip": 510000
},
"hobbies": ["surfing", "coding"]
}
)
This is an insert statement, where db refers to the current database, user is the collection equivalent to a table, JSON in the insert statement is the document equivalent to the row of other databases, name, age, street These are the fields equivalent to the column
The same document can be inserted multiple times without being overwritten. In fact, mongodb will automatically create the \_id field as the primary key and assign different values, so it will not be repeated. You can also specify \_id when inserting, but if \_id If it already exists, it will throw an error
- As you can see, mongodb is unstructured data and does not require a predefined collection nor a predefined data structure.
- Provide rich query expressions
- Support secondary index, automatic load balancing, higher read efficiency than write
- Support distributed, support failure recovery, data redundancy, fragmentation, horizontal expansion
- The storage engine can be configured. The WiredTiger Storage Engine (default) will do memory-to-file mapping to improve performance, but the memory consumption is high.
- The high version supports Join and transactions
- Support security authentication function
- Provide extensions, such as tools for visualization, tools for BI integration
Application scenarios
mongodb is more suitable for highly unstructured, or the source data is JSON, each piece of data is relatively large, and OLTP is the main scenario. It is not suitable for scenarios with high transaction requirements or complex queries with large amounts of data. The syntax of mongodb is quite different from other databases and requires a certain learning cost
Hive (HDFS-based database engine, relational, OLAP)
- Hive is a data warehouse tool based on Hadoop
- Data is stored in HDFS. When creating a table, specify the storage format such as TEXTFILE, ORCFILE, PARQUET through the STORED AS command, or specify HBase through the STORED BY command. You can create a new table or create a mapping of an existing HBase table.
- Queries are completed through jobs such as MapReduce, Spark, etc.
- Provides SQL-like query language HQL (HiveQL), supports user-defined functions (UDF)
- Higher versions support transactions (specified when creating a table)
- Support massive data
- structured data
- Support CRUD
Application scenarios
- Not suitable for OLTP, mainly used as OLAP for batch query of big data, requires Hadoop platform
Impala (based on HDFS, HBase, Kudu storage, parallel computing, relational, OLAP)
- A database query engine for distributed parallel computing based on memory developed by Cloudera
- Mainly implemented in C++, using JNI for interaction with Hadoop
- Impala uses the same metadata, SQL, ODBC driver, and UI as Hive, which improves the SQL query performance of HDFS and provides a similar user experience.
- Like Hive, you can specify HDFS storage formats such as TEXTFILE, ORCFILE, PARQUET through STORED AS
- Tables operated through Hive need to be manually synchronized to Impala
- Impala is not only the same as SQL and Hive, but actually metadata also exists in Hive
- In addition to HDFS, table data can also be stored in HBase, but you need to create a table in HBase, and then create a mapping table in Hive through STORED BY. Since Impala and Hive use the same metadata, after Hive has built the table, just execute the refresh command in Impala INVALIDATE METADATA, you can see the corresponding HBase table
- Support Join, Aggregate and other functions
- Support JDBC, ODBC
- Unlike Hive, Impala does not rely on MapReduce, but runs its own engine on each HDFS DataNode for parallel processing
- Impala's parallel processing engine is mainly composed of state store, catalog service, and multiple impala daemons
- Each impala daemon can receive the client's request. The impala daemon consists of query planner, query coordinator, and query executor. The planner receives the client's SQL query, and then decomposes it into multiple sub-queries. The coordinator distributes the sub-queries to the executors of each daemon for execution. , the daemon obtains HDFS, HBase data, calculates, and then returns it to the coordinator, and then the coordinator aggregates and returns the final result to the client
- Impala is a non-centralized structure, each daemon can accept connection queries, and can achieve load balancing of multiple daemons through HA Proxy
- The state store is used to collect and monitor the status of each daemon
- The catalog service notifies all impala daemons in the cluster of metadata changes made by SQL
- Impala's calculations are performed in memory, and the memory requirements are relatively high
- Impala only supports update operation after 2.8, but it is limited to Kudu storage. Kudu needs to be installed, and Kudu is specified as database storage through STORED AS. Kudu is a columnar storage manager developed by Cloudera. The purpose is to do OLAP and balance HDFS and The performance of HBase, Kude's random read and write performance is better than HDFS (such as Parquet), but worse than HBase, and the large data query performance is worse than HDFS (such as Parquet), but better than HBase, Kude and Impala are highly integrated, and can also Integrate with MapReduce/Spark, replace HDFS/HBase with Kudu, so Impala can do update, taking into account the needs of OLAP and data modification, it is suitable for scenarios with OLAP as the main and certain update requirements, Kudu can configure consistency, adopt structure The table data model needs to define the primary key, instead of using HDFS, it has its own components to store and manage data, using C++ without the risk of full gc
Application scenarios
- Impala is not suitable for OLTP, mainly used as OLAP for batch query of large data
- Hadoop platform and Hive required
- Much better performance than Hive
- Better performance as an OLAP than something like Phoenix
- Mainly CDH is pushing, CDH has integrated Impala
Presto (based on multiple data sources, parallel computing, relational, OLAP)
- A database query engine for distributed parallel computing based on memory launched by Facebook
- It consists of coordinator server, discovery server (usually integrated in coordinator, or independent), and multiple worker servers
- The coordinator is responsible for interacting with clients, managing workers, parsing statements, planning queries, creating a series of stages, and then converting them into a series of tasks to distribute to different workers for concurrent execution
- The worker is responsible for executing tasks and processing data. It obtains data through the connector, and interacts with other workers for intermediate data. The final result will be returned to the client by the coordinator.
- connectors are adapters that allow Presto to access different databases
- The built-in connectors are mainly Hive, and there are many third-party connectors such as cassandra, es, kafka, kudu, redis, mysql, postgresql, etc.
- The catalog needs to be configured in the configuration file, where the catalog maintains the schema and points to a data source through the connector. The positioning of the presto table starts from the catalog. For example, hive.test\_data.test refers to the test\_data schema under the hive catalog. test table, and the concept of schema depends on the specific connector. For example, for mysql, the schema of presto is the schema of mysql, and for cassandra, the schema of presto is the keyspace of cassandra, and multiple catalogs can be associated with the same one. For example, if there are multiple kafka clusters in the environment, there can be two catalogs, kafka1 and kafka2
- Statement can be considered as the sql statement received by presto, which is then parsed into a query plan, and then the query is divided into multiple stages. These stages form a tree structure, and each stage aggregates and calculates the results of other stages below it. Each stage is divided into one or more tasks, these tasks will be distributed to different workers for parallel execution, each task processes different data shards, and each task has one or more drivers to process data concurrently
- Presto supports JDBC interface, the URL format of JDBC is jdbc:presto://host:port/catalog/schema or jdbc:presto://host:port/catalog or jdbc:presto://host:port
- Support Join query, and support join query of multiple data sources (joining of multiple large tables may affect performance). When querying across data sources, you need to specify the complete table name, i.e. [catalog].[schema].[table], and Use presto://host:port to connect to JDBC without specifying catalog and schema
- Limited support for subqueries
- update operation is not supported
- Support security mechanism
- Supports standard ANSI SQL
- Good scalability
- Can be integrated with Tableau
- Spark support
Application scenarios
- Suitable for OLAP query of large data volume with multiple data sources
- The performance may be similar to Impala, but it supports multiple data sources and does not depend on Hadoop
Greenplum (multiple PostgreSQL, parallel computing, relational, OLAP)
- Database query engine based on multiple PostgreSQL distributed parallel computing
- Internal PostgreSQL has been modified to accommodate parallel distributed computing
- Mainly consists of a master, multiple segments, and an interconnect
- The master maintains metadata, receives and authenticates client links, receives SQL requests, parses SQL, generates query plans, distributes tasks to segments, coordinates the return results of aggregated segments, and returns the final results to the client. The master can be set as the master from config
- Each segment has an independent PostgreSQL database. Each segment is responsible for storing part of the data and performing corresponding query calculations. The segment can also be configured with a backup mechanism
- Interconnect is the network layer of Greenplum, responsible for the link between master and segment, as well as the link between each segment
- Links and SQL syntax are compatible with PostgreSQL, support JDBC, ODBC
- When creating a table, you can specify whether to use column storage, row storage, or external table (data is stored in other systems such as HDFS and GP only stores metadata)
- To operate external data, you need to install PXF (Platform Extension Framework). With PXF, you can support Hive, HBase, Parquet, S3, MySQL, ORACLE, etc.
- Support security, permission configuration
- Support distributed transactions, support ACID, and ensure strong data consistency. Instead of using locks, MVCC (Multi-Version Concurrency Control) is used to ensure data consistency
- shared-nothing architecture
Application scenarios
Similar to Impala and Presto, they are parallel in-memory computing, but the performance of Greenplum may be slightly worse, and Greenplum is also divided into open source and commercial versions, and some functions are only supported by commercial versions.
Kylin (based on Hive, HBase, parallel computing, relational, multi-dimensional, precomputed OLAP)
Traditional OLAP is divided into ROLAP (Relational OLAP) and MOLAP (Multi-Dimension OLAP) according to different data storage methods. ROLAP stores data in a relational model. The advantages are small size and flexible query methods. The disadvantage is that each query needs to Data is aggregated and calculated, and Kylin belongs to MOLAP
Kylin calculates the results in advance according to different combinations of dimensions to form a Cube (cube) structure, so the query speed is fast, but the disadvantage is that the amount of data is not easy to control, N dimensions can have 2**N combinations, and dimensions may appear The problem of explosion, and if the data is changed, it needs to be recalculated
For example, there are two dimension tables, Phone and Country, as well as the Sale fact table (detail table), taking mobile phone brand, country, and date as three dimensions, including (null), (brand), (country), (date), (brand) , country), (brand, date), (country, date), (brand, country, date), a total of 8 combinations, you can calculate the various summary information (sum, count, etc.) of the sale of these 8 group by combinations in advance ), a summary information of a dimension combination is called a cuboid, and all cuboids together are called a Cube
The data source of Kylin can be Hive or Kafka (Json format message, key is the column name)
Kylin's precomputed results are stored in HBase, RowKey is a combination of various dimensions, and the corresponding detailed summary is stored in Column, so that SQL becomes a scan of RowKey, and further calculates Column (if necessary), so that query performance It is naturally improved and can support sub-second queries
Kylin supports ODBC, JDBC, RESTful API and other interfaces
Kylin can be integrated with BI tools such as Tableau and PowerBI
The use steps are as follows
- Create Project
- Sync Hive table or Kafka table
- Create Data Model
- Create and name the Model
- Select Fact Table (fact table) and Lookup Table (lookup table, mainly dimension information), and Join field
- Select dimension columns from Fact Table and Lookup Table (can be grouped by Cube)
- Select indicator column from Fact Table (can be aggregated by Cube)
- Select column from Fact Table for date partitioning, leave blank if not needed
- Add Filter (can be used by Cube for Where operation)
- Create cube
- Create and name the Cube, and select the Data Model to associate
- Add dimension column (must be selected from dimension columns configured in Data Model)
- Add a metric column (must be selected from the metric columns configured in the Data Model)
- There are 8 aggregation operations that can be configured for indicator columns: SUM, MAX, MIN, COUNT, COUNT\_DISTINCT, TOP\_N, EXTENDED\_COLUMN and PERCENTILE (if you want to check avg, it is actually obtained by dividing sum by count, so here There is no need to configure avg and other operations that can be further calculated by precomputing results)
The build Cube is actually calculated by MapReduce/Spark. After the task is completed, the result will be written to HBase
After the build is successful, you can directly query with SQL. In fact, the RowKey is checked according to the dimension, and then the aggregated results stored in the Column are taken out, and further calculations are performed if necessary.
If the data source has changed, you need to rebuild the cube
It can be seen that Kylin is a pure OLAP tool. It improves query performance through pre-computing, but cannot reflect changes in data sources in time. Pre-computing may be time-consuming and may take up a lot of space, and needs to be integrated with Hadoop
Precomputed OLAP data query engine and Druid
ClickHouse (column storage, vectorized computing, parallel computing, OLAP)
OLAP database developed by Russian company Yandex
- Column Store Benefits for OLAP
- Since OLAP often retrieves a small number of columns in a large number of data columns, if row storage is used, it means to scan row by row, and each row is very large, while using column storage only needs to scan the columns to be retrieved, which can reduce IO
- Assuming that some records do not store the column to be retrieved, the row store still needs to scan the record to know, but there is no such problem for the column store, because there is no storage, and it will not be scanned by itself.
- Because the data type and size of the same column are relatively consistent, column storage is easier to compress, more efficient, and further reduces IO
- The reduction in IO also means that more data can be cached in memory
- vectorized computation
- SIMD (Single Instruction, Multiple Data, single instruction stream, multiple data streams), the current CPU supports such a function, through one instruction, multiple cores can be used to perform concurrent computing on a set of data (that is, a vector) at the CPU level, which is suitable for pure The scene of basic calculation, if there are scenes of judgment, jump, and branch, it is not suitable
- ClickHouse has a vector computing engine, which uses SMID instructions as much as possible to process data in batches and in parallel, which greatly improves the processing capacity.
- Mainly implemented in C++
- Decentralized structure, consisting of a cluster of servers, and each server can accept the client's link query. After the server receives the request, it will coordinate with other servers to do parallel computing. Each server is multi-threaded. Coordinated synchronization via ZooKeeper
- Support sharding (shard), data can be stored in different shards across nodes, a shard is a node, or multiple nodes form a shard with copy backup, which is configured by the configuration file
- Support partition, create table through Partition By command
- Fragmentation and partitioning are sometimes indistinguishable. Fragmentation refers to that the data of a table is distributed in different nodes, and a node can store data of multiple databases and multiple tables, while partitioning refers to a certain column The data divides a large table into multiple small tables, such as partitioning by date column, and one partition table per day. You can query both the partition table and the large table.
- Support copy backup, support data integrity
- Table Engine
- The table created on a server is only the local table of the server, not distributed. If you want to create a distributed table, you need to create a table with the same name on each server, and then create a distributed table on one of the servers (will Automatically created on all servers), this distributed table is a logical table, which does not really store data, but is mapped to the local table of each server, which will automatically do parallel computing
- ENGINE = Distributed(cluster\_name, database, table, [sharding\_key])
- cluster\_name is configured in the configuration file
- ENGINE = Memory data exists in memory
- ENGINE = ODBC(connection\_settings, external\_database, external\_table)
- ENGINE = JDBC(dbms\_uri, external\_database, external\_table)
- ENGINE = MySQL('host:port', 'database', 'table', 'user', 'password')
- ENGINE = PostgreSQL('host:port', 'database', 'table', 'user', 'password')
- ENGINE = MongoDB(host:port, database, collection, user, password)
- ENGINE = HDFS(URI, format)
- ENGINE = Kafka() SETTINGS kafka\_broker\_list = 'host:port', kafka\_topic\_list = 'topic1'
- ENGINE = Log;
- ENGINE = TinyLog;
- ENGINE = MergeTree()
- ENGINE = AggregatingMergeTree()
- When creating a table, you need to specify the table engine to use through the Engine command, and decide how to store the data.
- The most commonly used are the MergeTree series engines, such as
- A relatively lightweight Log series engine
- Allow queries from other data sources, such as
- special types, such as
- distributed
- Usually MergeTree is used for storage, data can be quickly appended to the back of a MergeTree in sequence, and merge and compression operations will be performed in the background, which improves the performance of data insertion
- Primary index, data is sorted by Primary Key
- You can also specify the sort field by Order By when creating the table
- Support secondary index, also known as hopping index data skipping index, such as minmax index, it will count the maximum and minimum values of a column of data (or an expression) in each piece of data. When retrieving, you can decide whether to use minmax Skip this data (it feels weird, the performance should be worse than rebuilding an index table)
- Support TTL, including column-level, row-level, partition-level TTL
- Support HTTP, TCP interface
- Support JDBC, ODBC
- There are third-party tools to support importing data from other databases such as PG into ClickHouse
- There are third-party tool support and some visualization tools such as Grafana, DBeaver, Tabix integration
- There are three-party tools to support Kafka, Flink, Spark, etc.
- Support SQL, support group by, order by, join, some sub-queries and other functions
- Support complex data types such as array, json, tuple, set, etc.
- Support approximate calculation, such as calculating the average value, you can take part of the data to calculate, which improves the performance, but reduces the accuracy
- Adaptive Join algorithm, such as Hash-Join is preferred, and Merge-Join is automatically used if there are multiple large tables
- Security mechanism, role-based permission control
- Support error recovery, good scalability
where it falls short
- Insufficient support for high concurrency
- No mature transaction functionality
- The performance of modifying and deleting data is relatively poor, and only limited support is provided
- Primary Key uses a sparse index, that is, the index can only point to a piece of data, and the specific data has to be checked one by one, so if you check a small amount of data, or query a single piece of data, the performance will be poor
Does not depend on Hadoop, column storage, vectorization, parallelism, multi-threading, multi-storage engines
Single-table query performance is excellent, much better than Impala, Presto and the like
Multi-table query performance is poor, worse than Impala, Presto and the like
Elasticsearch (inverted index, word segmentation, search engine)
- Elastic Stack is a set of components including Elasticsearch, Logstash, Filebeat, Kibana, etc.
- Elasticsearch is a search engine based on Apache Lucene, mainly developed in Java
- Elasticsearch cluster is mainly composed of master, data, ingest, coordinating nodes
- Each node can be configured with multiple roles at the same time, such as master and data, but in large clusters, each node usually only has one function
- Coordinating is a function that every node will have, and no configuration is required, that is, both master and data will have coordinating function, which is used to receive read and write requests from clients, and then forward the request to the corresponding node for processing, and then combine the processing results and return them to Client, in order not to put too much pressure on master/data and other nodes in large clusters, you can configure multiple specialized coordinating, by configuring role as empty or setting master, data, ingest to false (depending on different versions) ie Yes, these coordinating nodes are only responsible for receiving and responding to client requests and do not do other work, just like a reverse proxy load balancing, which can improve concurrent performance
- The master is responsible for managing the entire cluster, responsible for the creation and deletion of indexes and other management operations, deciding which node the data is to be sharded to, managing the status of other nodes, etc. Multiple masters can be configured for HA, and an odd number is required, at least three, The system actually automatically elects one of them to be the master. If the master hangs up, it will re-elect one from the other nodes configured as masters. The master configuration can be lower.
- data is responsible for storing, computing, and processing data, and has high resource requirements. Data can be further configured to specify nodes for storing hot data, warm data, cold data, etc.
- Ingest is an optional node dedicated to pipeline preprocessing for certain data and operations
- Elasticsearch's data storage is mainly composed of index, type, document
- An index is similar to a database in SQL. You can create an index directly, or you can use an index template as a template to create an index. The template is usually used for multiple indexes with the same configuration. For example, an index is created every day to store the log of the day.
- type is similar to a table in SQL (this statement is not entirely correct, it has been officially clarified, because different types are not completely independent and unrelated), in earlier versions, there can be multiple types under one index, starting from 6.0 only There can be a type. It is not recommended to use the concept of type from 7.0. It is said that type will not be supported at all from 8.0.
- Document is like a row of records in SQL. Document uses unstructured data, which is represented by JSON format. Each field of JSON is equivalent to a column.
- Each document will have a unique \_id, if not specified, it will be automatically generated by the system
- Each document has a strictly increasing sequence number \_seq\_no representing the order in which documents are written/updated
- Each document has a \_version field, which is incremented by 1 each time this field is changed
- You can create an index first, or you can not create an index in advance, it will be created automatically when data is written
- There is no need to set the field of the document in advance, it is automatically created when the data is written, the field of each document can be different, or the field can be set in advance, so that the field of each document must be the same
- Elasticsearch will automatically index all fields, and will automatically do word segmentation processing, that is, a sentence such as "hello world" is automatically divided into two words "hello" and "world" for indexing, and "hello world" without word segmentation is keyword, The size limit is 256, after word segmentation such as "hello" is a text
- Elasticsearch uses an inverted index, which is mainly composed of three parts: Term Index (word index), Term Dictionary (word dictionary), and Posting List (index item list)
- The Term Index exists in the memory, not all words, but the prefixes of the words, such as he, wor, ad, sar, etc., to indicate the starting position of the words starting with these prefixes in the Term Dictionary, so that the Term Index The size is relatively small, it can be stored in memory, and it can help quickly locate the position of the content to be read in the Term Dictionary, which can greatly reduce the number of disk IO
- Term Dictionary exists on the disk, usually the amount of words will be very large, records the relationship between each word of the index and the Posting List, through B+ tree or Hash table to meet high-performance insertion and query
- The Posting List records: the ids of all documents in which the word appears, the number of times the word appears in the document, and the position of the word in the document
- The search engine finds the corresponding document through such an inverted index, and then does further processing
- Since all fields are indexed, the amount of data will be very large
- The data is first written into the memory, and then the data is periodically placed on the disk to form segments. When there are more segments, merges will be performed to form larger segments.
- In order to prevent the loss of data that has not yet been placed in the memory, translog will be written, similar to HBase's WAL. In fact, this also requires disk IO, which will affect performance, but it is simpler than the data placed on the disk.
- A segment consists of multiple files that record metadata, field information, index information in positive order (that is, find the corresponding data from document id), field data (equivalent to storing in columns), index data in reverse order, etc.
- Supports REST API interface operations, and can perform many operations by improving the rich syntax of JSON format data in Body
- Support Event Query Language (EQL): for event-based time series data, such as logs, metrics, and traces, specified through the Body of the REST API
- Support JDBC, ODBC, where table specifies index, and column specifies field, SQL here does not support JOIN
- Transactions are not supported
- Cross-table query is not fully supported, and it is more complicated to define parent-child documents and fields of join type.
- There is a certain delay in reading and writing, that is, the written data cannot be indexed immediately, and it takes at least 1 second
- It is different from traditional databases and requires a certain learning cost
Examples of some REST API operations
curl localhost:9200 # 查看集群基本信息
curl localhost:9200/_cluster/health?pretty # 查看集群健康 (pretty 是 JSON 格式化输出)
curl localhost:9200/_cluster/state?pretty # 查看集群状态
curl localhost:9200/_cluster/stats?pretty # 查看统计信息
curl localhost:9200/_nodes?pretty # 查看节点信息
curl localhost:9200/_cat # 列出可以查看的各种信息 (cat 命令列出的信息比较简化)
curl localhost:9200/_cat/health
curl localhost:9200/_cat/nodes
curl localhost:9200/_cat/indices
curl -X PUT 'http://localhost:9200/my_index/my_doc/123' -H 'Content-Type: application/json' -d '{
"name": "Lin", "title": "senior designer", "age": 30
}' ## 指定 document id 为 123,会自动创建 my_index,my_doc 以及各个 fields
curl -X POST 'http://localhost:9200/my_index/my_doc' -H 'Content-Type: application/json' -d '{
"name": "Wang", "title": "senior designer", "age": 35
}' ## 由系统自动创建 document id
curl -X POST 'http://localhost:9200/my_index/my_doc_2' -H 'Content-Type: application/json' -d '{
"name": "n_1", "type": "t_1", "value": 1
}' ## 报错,不允许 index 下有两个 type
curl -X POST 'http://localhost:9200/my_index/_doc' -H 'Content-Type: application/json' -d '{
"name": "n_1", "type": "t_1", "value": 1
}' ## 允许,_doc 就是 my_doc(可以一开始就只用 _doc 而不需要 type 名)
curl -X POST 'http://localhost:9200/my_index/_doc' -H 'Content-Type: application/json' -d '{
"name": "Li",
"address": {"city": "guangzhou", "district": "tianhe"}
}' ## 允许新的 fields,允许复杂类型,貌似不支持列表 "address": ["xxx"]
curl localhost:9200/my_index/my_doc/123?pretty ## 查看 id 为 123 的记录
curl localhost:9200/my_index/my_doc/_search?pretty ## 查看所有记录
curl localhost:9200/my_index/_search?pretty
curl localhost:9200/_all?pretty ## 列出所有 index 的设置和 mapping (就是 field 的信息)
curl localhost:9200/my_index?pretty ## 查看 my_index 的设置和 mapping
curl localhost:9200/my_index/_mapping?pretty ## 查看 my_index 的 mapping
curl -X GET -H "Content-Type: application/json" localhost:9200/my_index/_search?pretty -d '{
"query": {
"term":{
"name":"lin"
}
}
}' ## 简单的查询,还有更多的语法和功能
curl -X GET -H "Content-Type: application/json" localhost:9200/my_index/_search?pretty -d '{
"query": {
"term":{
"title.keyword":"senior designer"
}
}
}' ## 默认查询的是分词,如果要查没分词的,应该加上 keyword
curl -X GET -H "Content-Type: application/json" localhost:9200/my_index/_search?pretty -d '{
"query": {
"term":{
"address.city":"guangzhou"
}
}
}' ## 查询嵌套的字段
curl localhost:9200/_search?pretty ## 在所有 index 中查找
curl -H "Content-Type: application/json" localhost:9200/my_index/_analyze?pretty -d '{
"analyzer" : "standard",
"text" : "This is the apple"
}' ## 如何分析一段文字
curl -X PUT 'http://localhost:9200/my_index_3' ## 创建 index
curl -X PUT -H "Content-Type: application/json" 'http://localhost:9200/my_index_6' -d '{
"settings": {
"number_of_shards": 1,
"number_of_replicas": 1
},
"mappings": {
"properties": {
"name": {
"type": "text"
},
"title": {
"type": "text"
},
"value": {
"type": "long"
}
}
}
}' ## 创建 index 同时指定 field,在版本 7 以后不需要指定 type (不需要指定 my_doc 之类的)
Application scenarios
It is suitable for search-based business scenarios. At the beginning, ELK (Elasticsearch + Logstash + Kibana) was used for log collection and search.
Spark/Flink
Although the database is powerful, it is powerless if it encounters complex logical calculations. In this case, special calculation tools are required.
Both Spark and Flink are high-performance parallel computing engines. Flink is more inclined to real-time streaming services, while Spark is more inclined to batch processing services, both of which can be used to process data efficiently.
BI
Data is usually visualized, and the more commonly used BI tools are Tableau (charged) and PowerBI
Overall system architecture
K8S (container deployment) + SpringCloud (microservices) + Keycloak (authentication) + Kafka (data streaming) + Spark/Flink (data processing) + ELK (log collection) + PG/MySQL (basic data) + NoSQL-OLTP (large scale) Data Volume Business Data) + OLAP (Analysis Business) + BI (Data Visualization)
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。