In-depth interpretation of the overall architecture design and implementation of the MRS IoTDB time series database

Abstract: This article will systematically introduce the ins and outs and functional characteristics of MRS IoTDB, focusing on the overall architecture design and implementation of the MRS IoTDB time series database.

This article is shared from the HUAWEI cloud community " MRS IoTDB timing database overall architecture design and implementation ", the original author: cloudsong.

MRS IoTDB is the latest time series database product launched by Huawei FusionInsight MRS big data suite. Its leading design concept shows more and more powerful competitiveness in the time series database field, and has been recognized by more and more users. In order to better understand MRS IoTDB, this article will systematically introduce the ins and outs and functional characteristics of MRS IoTDB, focusing on the overall architecture design and implementation of MRS IoTDB time series database.

What is a time series database

Time series database is the abbreviation of time series database, which refers to a dedicated database system that specializes in storing, querying, analyzing and processing data with time tags (changes in the order of time, that is, time serialization). Generally speaking, the time series database is specifically used to record the temperature, humidity, speed, pressure, voltage, current, and securities buying and selling prices of IoT devices, etc., which are constantly changing over time (measurement points, events). ) Database.

At present, with the continuous deepening of the development and application of big data technology, the two types of data represented by the Internet of Things (IoT) and financial analysis show that a large number of sensor values or event data are continuously generated over time. . Time series data (time series data) is a continuous numerical sequence formed with the time (time stamp) of the data (event) as the time axis. For example, the temperature data of an IoT device at different times constitutes a time series data:

Whether it is sensor data generated by machines or social event data generated by human activities, there are some common features:

(1) High acquisition frequency: dozens of times, hundreds of times, one hundred thousand times or even one million times per second;

(2) High acquisition accuracy: at least support millisecond-level acquisition, and some need to support microsecond and nanosecond-level acquisition;

(3) Large collection span: 7*24 hours of continuous collection of data for several years or even decades;

(4) Long storage period: it needs to support the persistent storage of time series data, and even some data needs to be stored permanently for hundreds of years (such as seismic data);

(5) Query window length: need to support time window queries of different granularities from milliseconds, seconds, minutes, hours to days, months, years, etc.; also need to support quantity windows of different granularities such as ten thousand, one hundred thousand, one million, ten million, etc. Inquire;

(6) Data cleaning is difficult: time series data has complicated situations such as disorder, missing, abnormal, etc., which requires special algorithms for efficient real-time processing;

(7) High real-time requirements: Whether it is sensor data or event data, real-time processing capabilities of milliseconds and seconds are required to ensure real-time response and processing capabilities;

(8) The algorithm is professional: time series data has many professional time series analysis requirements in vertical fields in different fields such as earthquake, finance, electric power, and transportation, and it is necessary to use time series trend forecasting and similar sub-sequences

Analysis, periodic forecasting, time moving average, exponential smoothing, time autoregressive analysis and LSTM-based time series neural network and other algorithms for professional analysis.

From the common characteristics of time series data, it can be seen that the special scene requirements of time series have brought challenges to traditional relational database storage and big data storage. It is impossible to use relational databases for structured storage or NoSQL databases for storage. It cannot meet the needs of high-concurrency real-time writing and querying of massive time series data. Therefore, there is an urgent need for a special database dedicated to storing time series data, and the concept and product of a time series database was born.

It should be noted that: time series database is different from temporal database and real-time database. Temporal Database is a database that can record the history of object changes, that is, can maintain the history of data changes, such as TimeDB. Temporal database is a system for fine-grained maintenance of the time state of time records in traditional relational databases. Time series databases are completely different from relational databases, and only store the measured point values corresponding to different timestamps. A more detailed comparison between time series database and temporal database will be specifically introduced in the follow-up, so I won’t go into details here.

Time series databases are also different from real-time databases. Real-time databases were born in traditional industries, mainly because of the development of modern industrial manufacturing processes and large-scale industrial automation. Traditional relational databases are difficult to meet the storage and query requirements of industrial data. Therefore, in the mid-1980s, a real-time database suitable for industrial monitoring was born. Due to the early birth of real-time databases, there are limitations in scalability, big data ecological docking, distributed architecture, data types, etc., but it also has the advantages of complete product support and complete industrial protocol docking. The time series database was born in the era of the Internet of Things and has advantages in big data ecological docking and cloud native support.

The basic comparison information of time series database, temporal database and real-time database is as follows:

2. What is MRS IoTDB time series database

MRS IoTDB is a time series database product in the Huawei FusionInsight MRS big data suite. It is a high-performance enterprise-level time series database product launched on the basis of deep participation in the open source version of the Apache IoTDB community. IoTDB, as the name suggests, is a dedicated time series database software for the IoT field. It is a domestic Apache open source software initiated by Tsinghua University. Since the birth of IoTDB, Huawei has been deeply involved in the architecture design and core code contribution of IoTDB. It has invested a lot of manpower on the stability, high availability and performance optimization of the IoTDB cluster version, and has put forward a lot of improvement suggestions and contributed a lot of code.

At the beginning of the design, IoTDB comprehensively analyzed the time series database related products on the market, including Timescale based on traditional relational database, OpenTSDB based on HBase, KariosDB based on Cassandra, InfluxDB based on timing exclusive structure and other mainstream time series databases, drawing lessons from different time series The advantages of data in the implementation mechanism have formed its own unique technical advantages:

(1) Support high-speed data writing

The unique tLSM algorithm based on two-stage LSM merging effectively guarantees that IoTDB can easily realize the concurrent write capability of tens of millions of measurement points per second on a single machine even in the presence of out-of-order data.

(2) Support high-speed query

Support TB-level data millisecond level query

(3) Complete functions

Supports complete data operations such as CRUD (updates are achieved by overwriting and writing measurement points of the same device with the same time stamp, deletion is achieved by setting TTL expiration time), supports frequency domain query, has rich aggregation functions, and supports similarity Professional timing processing such as matching and frequency domain analysis.

(4) Rich interfaces, easy to use

It supports multiple interfaces such as JDBC interface, Thrift API interface and SDK. Using SQL-like statements, the standard SQL statements are added to the time-sliding window statistics and other time-series processing commonly used functions, which improves the efficiency of the system. The Thrift API interface supports multi-language interface calls such as Java, C\C++, Python, and C#.

(5) Low storage cost

The TsFile timing file storage format independently developed by IoTDB is optimized for timing processing. Based on columnar storage, it supports explicit data type declarations. Different data types automatically match different compression algorithms such as SNAPPY, LZ4, GZIP, and SDT. A compression ratio of 1:150 or even higher can be achieved (when the data accuracy is further reduced), which greatly reduces the user's storage cost. For example, a user originally used 9 KariosDB servers to store time series data, but IoTDB can be easily implemented with 1 server with the same configuration.

(6) Multi-modal deployment on the cloud side

IoTDB's unique lightweight architecture design ensures that IoTDB can easily realize "a set of engines connects the cloud side, and a copy of data is compatible with all scenarios." In the cloud service center, IoTDB can be deployed in clusters to give full play to the cluster processing advantages of the cloud; at the edge computing location, IoTDB can deploy stand-alone IoTDB on the edge server, or a cluster version with a small number of nodes, depending on the edge server configuration ; In the device terminal, IoTDB can be directly embedded in the local storage of the terminal device in the form of a TsFile file, and the TsFile file can be directly read and written by the device terminal, without the startup and operation of the IoTDB database server, which greatly reduces the processing of the terminal device Competence requirements. Since the TsFile file format is open, any terminal language and development platform can directly read and write the binary byte stream of TsFile, or use the SDK that comes with TsFile for reading and writing, and even externally send the TsFile file to the edge or cloud service center through FTP. .

(7) Query and analysis integration

A piece of IoTDB data supports both real-time reading and writing and analysis of distributed computing engines. The loosely coupled design of TsFile and IoTDB engine ensures that on the one hand, IoTDB can use the proprietary time series data processing engine to efficiently write and query time series data. At the same time, TsFile It can also be read and written by big data-related components such as Flink, Kafka, Hive, Pulsar, RabbitMQ, RocketMQ, Hadoop, Matlab, Grafana, Zeepelin, etc., which greatly enhances IoTDB's query and analysis integration capabilities and ecological expansion capabilities.

3. The overall architecture of MRS IoTDB

Based on the existing architecture of Apache IoTDB, MRS IoTDB integrates MRS Manager's powerful log management, operation and maintenance monitoring, rolling upgrades, security reinforcement, high availability guarantee, disaster recovery, fine-grained authority control, big data ecological integration, and resource pools. Optimizing enterprise-level core capabilities such as scheduling, refactoring and optimizing the core architecture of Apache IoTDB, especially the distributed cluster architecture, and making a lot of system-level enhancements in terms of stability, reliability, availability, and performance.

(1) Interface compatibility:

Further improve the northbound and southbound interfaces, support multiple access interfaces such as JDBC, Cli, API, SDK, MQTT, CoAP, Https, and further improve SQL-like statements, compatible with most Influx SQL, and support batch import and export

(2) Distributed peer-to-peer architecture:

On the basis of the Raft protocol, MRS IoTDB adopts the improved Multi-Raft protocol, and optimizes the underlying implementation of the Muti-Raft protocol, and adopts optimization strategies such as Cache Leader to ensure that there is no single node failure. Improve the performance of MRS IoTDB data query routing; at the same time, fine-grained optimization of the strong consistency, medium consistency and weak consistency strategies; the virtual node strategy is added to the consistent hash algorithm to avoid data skew, and at the same time, it integrates table lookup and The algorithm strategy of hash partitioning further guarantees the performance of cluster scheduling on the basis of improving the high availability of the cluster.

(3) Double-level granular metadata management:

Due to the adoption of a peer-to-peer architecture, metadata information is naturally distributed and stored on all nodes in the cluster, but due to the large amount of metadata storage, it will cause a large consumption of memory. In order to balance memory consumption and performance, MRS IoTDB adopts a two-layer granular metadata management architecture. First, time series metadata is synchronized between all nodes, and secondly, time series metadata is synchronized between partition nodes. In this way, when querying metadata, the filter tree is pruned based on the time series group, which greatly reduces the search space, and then the time series metadata query is further performed on the filtered partition node.

(4) High-performance access to local disks:

MRS IoTDB adopts the dedicated TsFile file format for time series optimization storage, adopts column storage format for adaptive encoding and compression, and supports pipeline optimization access and high-speed insertion of out-of-order data

(5) HDFS ecological integration:

MRS IoTDB supports HDFS file reading and writing, and has implemented various optimization methods such as local caching, short-circuit reading, HDFS I/O thread pool, etc., to comprehensively improve the read and write performance of MRS IoTDB on HDFS. At the same time, MRS IoTDB supports Huawei OBS objects. Storage and deep optimization for higher performance.

On the basis of HDFS integration, MRS IoTDB supports efficient read and write of TsFile by MRS components such as Spark, Flink, and Hive.

(6) Multi-level authority control:

Support multi-level authority management and control of storage groups, devices, sensors, etc.
Support multi-level operations such as creation, deletion, and query
Support Kerberos authentication
Support Ranger permission structure

(7) Cloud side deployment:

It supports flexible deployment at the edge of the cloud. The edge part can be docked based on Huawei's IEF products, or it can be directly deployed in Huawei's IES.

The MRS IoTDB cluster version supports dynamic expansion and contraction, which can provide more flexible deployment support for the cloud side.

4. Stand-alone architecture of MRS IoTDB

4.1 Basic concepts of MRS IoTDB

MRS IoTDB mainly focuses on real-time processing of device sensor measurement points in the IoT field. Therefore, the basic architecture design of MRS IoTDB takes devices and sensors as the core concepts. At the same time, it adds storage for the convenience of users and IoTDB management of time series data. The concept of group, the following is an explanation for everyone:

Storage Group: A concept proposed by IoTDB to manage time series data, similar to the concept of a database in a relational database. From the user's perspective, it is mainly used to group device data; from the IoTDB database perspective, the storage group is a unit of concurrency control and disk isolation, and different storage groups can be read and written in parallel.

Device: Corresponding to specific physical devices in the real world, such as a manufacturing unit in a power plant, wind generators, automobiles, aircraft engines, seismic wave acquisition instruments, etc. In IoTDB, device is the unit for writing time series data at one time, and one write request is limited to one device.

Sensor (Sensor): Corresponding to the sensors carried by specific physical devices in the real world, such as sensors that collect information on wind speed, steering angle, power generation and other information on wind turbine equipment. In IoTDB, Sensor is also called Measurement, which specifically refers to the sensor value collected by the sensor at a certain moment, which is stored in a column in the IoTDB in the form of <time, value>.

The relationship among storage groups, devices, and sensors is as follows:

Time Series: Similar to a table in a relational database, but this table mainly has three main fields: Timestamp, Device ID, and Measurement. In order to facilitate more description of the device information of the time series, IoTDB also adds extended fields such as Tag and Field. Among them, Tag supports indexing, and Field does not support indexing. In some time series databases, it is also called timeline, which means to record the value of a certain sensor value of a device that changes over time, forming a timeline that continuously adds measured point values along the time axis.

Path: IoTDB constructs a tree structure with root as the root node, which connects storage groups, devices, and sensors in series. A path is formed from the root root node through the storage group, devices, and sensor leaf nodes. As shown below:

Virtual storage group: Because the concept of storage group has the dual function of user grouping of devices and system concurrency control, the excessive coupling of the two will cause the impact of users' different usage methods on system concurrency control. For example: the user puts all irrelevant device data in a storage group, and IoTDB locks the storage group for concurrent control, which limits the concurrent read and write capabilities of the data. In order to realize the relatively loose coupling between storage group and concurrency control, IoTDB designed the concept of virtual storage group, which splits the concurrency control of storage group into the granularity of virtual storage group, thereby reducing the granularity of concurrency control.

4.2 Basic architecture of MRS IoTDB

The stand-alone MRS IoTDB mainly consists of different storage groups. Each storage group is a concurrency control and resource isolation unit. Each storage group includes multiple Time Partitions. Among them, each storage group corresponds to a WAL pre-write log file and a TsFile time series data storage file. The time series data in each Time Partition is written into Memtable first, and written into WAL at the same time, and periodically and asynchronously flushed to TsFile. The specific implementation mechanism will be introduced in detail later. The basic architecture of MRS IoTDB stand-alone is as follows:

5. Cluster architecture of MRS IoTDB

5.1 Multi-Raft-based distributed peer-to-peer architecture
The MRS IoTDB cluster is a completely peer-to-peer distributed architecture. It not only avoids the single point of failure problem based on the Raft protocol, but also avoids the single point of performance problem caused by a single Raft consensus group through the Multi-Raft protocol. The communication, concurrency control and high-availability mechanisms have been further optimized.

First, all nodes in the entire cluster form a MetaGroup, which is only used to maintain the metadata information of the storage group. For example, a 4-node IoTDB cluster shown in the blue-gray box in the figure below, all 4 nodes form a metadata group (MetaGroup);

Secondly, the data group is constructed according to the number of data copies. For example, if the number of copies is 3, a data group (DataGroup) including 3 nodes is constructed. The storage group is used to store time series data and corresponding metadata.

In distributed systems, reliable data storage is usually achieved in multiple copies. Multiple copies of the same data are stored in different nodes and must be consistent. Therefore, it is necessary to use the Raft consensus protocol to ensure data consistency. It splits the consistency problem into several relatively independent sub-problems, namely Leader election, log replication, consistency guarantee, etc. There are the following important concepts in the Raft protocol:

(1) Raft group. There is an elected leader node in the Raft group, and the other nodes are followers. When a write request comes, it must first be submitted to the leader node for processing. The leader node first records the write request in its log, and then distributes the log to the follower nodes.

(2) Raft log. Raft uses logs to ensure that operations will not be lost. A Commit number and an Apply number are maintained in the log. If a log is committed, it means that more than half of the nodes in the current cluster have received and persisted the log. If a log is applied, it means that the current node has executed the log. When some nodes fail and recover again, the node's log will lag behind the leader's log. Before this node catches up with the leader's log, it cannot provide services to the outside world normally.

5.2 Metadata hierarchical management

Metadata management strategy is the key point in the distributed design of MRS IoTDB. When designing a metadata management strategy, we must first consider the use of metadata in the read and write process:

When writing data, metadata is required to check the validity of data types, permissions, etc.
When querying data, metadata is needed for query routing. At the same time, due to the number of yuan in the time series data scene

According to the huge data, the consumption of memory resources by metadata also needs to be considered.

The existing metadata management strategy either adopts the method of transferring metadata to the metadata node for special management, which will reduce the read and write performance; or adopts the method of storing all the metadata in all nodes of the cluster, which will consume a lot of Memory resources.

In order to solve the above problems, MRS IoTDB has designed a two-layer granular metadata management strategy. Its core idea is to separate metadata into storage groups and time series to manage them separately:

(1) Storage group metadata: The metadata group (MetaGroup) contains routing information when querying data, which is stored

The metadata information of the storage group (Storage Group) is fully stored on all nodes in the cluster. The granularity of storage groups is relatively large, and the order of magnitude of storage groups within a cluster is much smaller than that of time series. Therefore, the storage of metadata of these storage groups on all nodes in the cluster greatly reduces the memory usage.

Each node in a metadata group is called a metadata holder, and the Raft protocol is used to ensure data consistency between each holder and other holders in the same group.

(2) Time series metadata: The time series metadata in the data group (DataGroup) contains the data type, permissions and other information required for data writing, and this information is stored on the node where the data group is located (part of the node in the cluster). Since the granularity of time series metadata is smaller and the quantity is far more than that of storage group metadata, these time series metadata are stored on the node where the data group is located, avoiding unnecessary memory occupation, and can also pass storage group elements. The primary filtering of the data quickly locates, and the Raft consistency of the data group also avoids the single point of failure of time series metadata storage.

Each node in the data group is called a data partition holder, and the Raft protocol is used to ensure data consistency between each holder and other holders in the same group.

This method manages metadata in metadata holders and data partition holders according to two levels of granularity: storage group and time series. Since time series data and metadata are synchronized in the data group, each data write is not The metadata check and synchronization operations are required, and the storage group metadata check and synchronization operations only need to be performed when the time-series metadata is modified, thereby improving system performance. For example, in the operation of creating a time series and performing 500,000 data writes, the metadata check and synchronization operation decreased from 500,000 to 1 time.

5.3 Metadata distribution

According to the hierarchical management of metadata, metadata is divided into storage group metadata and time series metadata.

The storage group metadata is replicated on all nodes in the entire cluster and belongs to the MetaGroup group.

Time series metadata is only stored on the corresponding DataGroup, storing some time series attributes, field types, field descriptions and other information. The distribution method of time series metadata is the same as the data distribution method, which is generated through slot hash.

5.4 Time series data distribution

In the implementation of the distributed system, the time series data is partitioned according to the storage group based on the hash ring and the search algorithm on the ring. Put each node of the cluster on the hash ring according to the hash value. For a time series data point that comes, calculate the hash value of the storage group corresponding to the time series name and place it on the hash ring. Press on the ring Search clockwise, the first node found is the node to be inserted.

When the hash ring is used for data partitioning, the difference between the hash values of the two nodes is likely to be small. Therefore, the virtual node is introduced on the basis of the consistent hash ring. The specific method is to virtualize each physical node into Several virtual nodes are placed on the hash ring according to the hash value, which largely avoids data skew and makes the data more evenly distributed.

First, the entire cluster is preset with 10,000 slots, and the 10,000 slots are evenly distributed on each DataGroup. As shown in the figure below, the IoTDB cluster has 4 DataGroups, and the entire cluster has 10000 slots, so on average, each DataGroup has 10000/4=2500 slots. Since the number of DataGroups is equal to the number of cluster nodes 4, it is equivalent to an average of each 2500 slots for nodes.

Secondly, complete the mapping of slot to DataGroup, Time Partition and time series.

The IoTDB cluster is divided into multiple DataGroup groups according to the raft protocol. Each DataGroup group contains multiple slots, and each slot contains multiple time partitions. At the same time, each time partition contains multiple time series. The composition relationship is shown in the following figure. Show:

Finally, calculate the value of the slot through Hash to complete the mapping of the input storage group and timestamp to the slot:

1) First partition by time range to facilitate time range query:

TimePartitionNum = TimeStamp % PartitionInterval

Among them, TimePartitionNum is the ID of the time partition, TimeStamp is the timestamp of the data to be inserted into the measurement point, and PartitionInterval is the time partition interval. The default is 7 days.

2) Press the storage group area again, and calculate the slot value through Hash:

Slot = Hash(StorageGroupName + TimePartitionNum) % maxSlotNum

Among them, StorageGroupName is the name of the storage group, TimePartitionNum is the time partition ID calculated in step 1, and maxSlotNum is the maximum number of slots, and the default is 10000.

The relationship between Data Group and Storage Group is shown in the following figure, where Data Group 1 on node 3 and node 1 shows the same Data Group distributed on two nodes:

Click to follow and learn about Huawei Cloud's fresh technology for the first time~

In-depth interpretation of the overall architecture design and implementation of the MRS IoTDB time series database

What is a time series database

2. What is MRS IoTDB time series database

3. The overall architecture of MRS IoTDB

4. Stand-alone architecture of MRS IoTDB

4.1 Basic concepts of MRS IoTDB

4.2 Basic architecture of MRS IoTDB

5. Cluster architecture of MRS IoTDB

5.2 Metadata hierarchical management

5.3 Metadata distribution

5.4 Time series data distribution

华为云开发者联盟

引用和评论

华为云开发者联盟入选 2023 中国技术品牌影响力企业榜，深耕开发者生态

【Hadoop】HDFS架构解析

【Hadoop】HBase系统解析及适用场景

基于 pyflink 的算法工作流设计和改造

得物增长兑换商城的构架演进

MCP+Hologres+LLM 搭建数据分析 Agent

得物业务参数配置中心架构综述