Mainstream TSDB analysis of OpenMetric and time series database model

Abstract: brings you the mainstream TSDB analysis of current time series data models and the latest developments of cloud vendors in time series data models.

This article is shared from the Huawei Cloud Community " [Dry Goods] OpenMetric and Time Series Database Storage Model Analysis (Part 2) ", author: agile Xiaozhi.

In the last article " [Dry Goods] OpenMetric and Time Series Database Storage Model Analysis ( 1) 1614aaa4cf04dc", we learned about the relevant knowledge and content of the time series data model, and then we will analyze the current mainstream TSDB and cloud vendors’ book sequence data The latest developments in models.

Mainstream TSDB analysis

InfluxDB

InfluxDB[9] is a one-stop timing toolbox that includes everything needed for a timing platform: a multi-tenant timing database, UI and dashboard tools, background processing and monitoring data collectors. As shown in Figure 9 below.

Picture 9

InfluxDB supports dynamic shcema, that is, no schema definition is required before data is written. Therefore, users can add measurement, tag, and field at will, which can be any number of columns.

The underlying storage engine of InfluxDB has gone from LevelDB to BlotDB to self-developed TSM. Starting from v1.3, a self-developed WAL + TSMFile + TSIFile solution is adopted, the so-called TSM (Time-Structured Merge Tree) engine. Its idea is similar to LSM, and some special optimizations are made for the characteristics of time series data. The design goal of TSM is to solve the problem of too many file handles in LevelDB, and the second is to solve the write performance problem of BoltDB.

Cache: TSM's Cache is similar to LSM's MemoryTable, and its internal data is data that is not persisted to TSM File in WAL. If the process fails over, the data in the cache will be reconstructed based on the data in the WAL.

WAL (Write Ahead Log): After time series data is written into the memory, it is organized according to SeriesKey. Data will be written to WAL first, then memory-index and cache, and finally flashed to ensure data integrity and availability. The basic process includes: splicing out of the series key according to Measurement and TagKV; checking whether the series key exists; if it exists, directly write the time series data to the WAL and time series data write buffer; if it does not exist, it will write in the Index WAL A set of entries; build an inverted index in the memory based on the elements contained in the series key, and then write the time series data into the WAL and cache.

TSM Files: TSM File is similar to LSM SSTable. At the file system level, each TSMFile corresponds to a Shard, with a single maximum of 2GB. In a TSM File, there is a data area for storing time series data (ie Timestamp + Field value), and an index area for storing Serieskey and Field Name information. The B+tree-like in-file index constructed based on Serieskey + Fieldkey can quickly locate the data block where the time series data is located. In TSMFile, the index blocks are sorted by Serieskey + Fieldkey and organized together.

TSI Files: users do not specify query conditions based on the Series key as expected, for example, specify more complex query conditions, technical means usually use inverted index to ensure its query performance. As the scale of the user's timeline will become very large, the inverted index will consume too much memory. For this, InfluxDB introduced TSIfiles. The overall storage mechanism of TSIFile is similar to that of TSMFile, and a TSIFile is also generated in the unit of Shard.

In InfluxDB, there is a database concept that is benchmarked against traditional RDB. Logically, there can be multiple measurements under each Database. In the stand-alone version, each Database actually corresponds to a file system directory.

As a time series database, InfluxDB must include time series data storage and an inverted index for metrics, tags, and field metadata to provide faster multi-dimensional queries. InfluxDB scans the data on TSMFile according to the following steps to obtain high-performance query results:

• Find the index data block where the Serieskey+FieldKey is located in the index area according to the timeline (Serieskey) and FieldKey specified by the user.

• According to the user-specified timestamp range, search the index data block to find which index entry or several index entries the data corresponds to

• Load the time series data block corresponding to the retrieved index entry into the memory for further scanning to obtain the result

Like Prometheus, the InfluxDB data model has key-value pairs as tags, called tags. InfluxDB supports timestamps with resolutions up to nanoseconds, as well as float64, int64, bool and string data types. In contrast, Prometheus supports the float64 data type, but has limited support for string and millisecond resolution timestamps. InfluxDB uses a variant of the log structure merge tree to store logs with write-ahead, which are sliced by time. This is more suitable for event recording than Prometheus's append-only file method for each time series.

Prometheus

The following figure is the official architecture diagram of Prometheus [10], including some ecological components, most of which are optional. The core of these is the Prometheus server, which is responsible for capturing and storing time series data and applying rules to these data to aggregate new time series or generate alarms, and record and store them.

Picture 10

TSDB is the key kernel engine of Prometheus. The latest V3 engine [7] is equivalent to LSM (Log Structured Merge Tree) optimized for TSDB scenarios. The core idea of the LSM tree is to give up part of the read ability in exchange for the maximum ability to write; the premise is that the memory is large enough. Generally, LSM-tree is suitable for application systems where index insertion is more frequent than retrieval. The V3 storage engine also uses the Gorilla paper ideas. It includes the following TSDB components: Head Block, pre-write log WAL and checkpoint, memory mapping of Chunk header on disk, persistent block and its index and query module. Figure 11 below is the disk directory structure of Prometheus.

Picture 11

In the V3 engine, a chunk will contain many time series (Time series). The sample data in the Chunk directory is grouped into one or more segments, each of which does not exceed 512MB by default. Index is the timeline in the chunk directory under this block that is indexed according to the index name and label, so as to support quickly locating the timeline and the chunk where the data is located according to a label. meta.json is a simple description file about block data and status. Chunks_head is responsible for the chunk index. The uint64 index value is composed of the offset in the file (the lower 4 bytes) and the segment sequence number (the upper 4 bytes).

Prometheus divides the data into multiple blocks according to the time dimension. Only the most recent block is allowed to receive new data. To write data to the latest block, it will first be written to a memory structure. In order to ensure that the data is not lost, a pre-written log file WAL will be written first and stored in the directory by segment (128MB in size). They are uncompressed raw data, so the file size is significantly larger than regular block files. Prometheus will retain three or more WAL files in order to retain at least two hours of original data.

The V3 engine uses 2 hours as the default block duration of a block; that is, the block is divided into 2h spans (this is an empirical value). V3 also uses the same compaction strategy as LSM for query optimization, combining small blocks into large blocks. For the latest block that is still writing data, the V3 engine will hold all indexes in memory, maintain a memory structure, and wait until the block is closed before persisting to the file. The efficiency is very high for the hot data query in the memory.

Prometheus officials have repeatedly emphasized that its local storage is not for durable long-term storage; external solutions provide extended retention time and data durability. The community has a variety of integrated ways to try to solve this problem. Such as Cassandra, DynamoDB, etc.

Realizing application observability through indicators is the first step in the IT monitoring operation and maintenance system. Metrics provide a summary view, combined with detailed information about each request or event provided by the log. This makes it easier to find and diagnose problems.

Prometheus servers operate independently of each other, relying only on their local storage to achieve their core functions: crawling, rule processing, and alerting. In other words, it is not oriented to distributed clusters; in other words, its current distributed clustering capabilities are not strong enough. Open source projects such as Cortex and Thanos in the community are successful solutions that have emerged to address the shortcomings of Prometheus.

Druid

Druid [11] is a well-known real-time OLAP analysis engine. Druid's architecture design is relatively simple (Figure 12 below). There are three types of nodes in the cluster: Master node, Query node and Data node.

Picture 12

Druid data is stored in a datasource, similar to a table in a traditional RDBMS. Each datasource is partitioned by time (other attributes are also possible). Each time range is called a "chunk" (for example, one day, if your data source is partitioned by day). Within a Chunk, data is divided into one or more "Segments". Each segment is a file, usually containing up to several million rows of data. As shown in Figure 13 below.

Figure 13

The purpose of Segment is to generate data files that are compact and support fast queries. These data are generated on the real-time node MiddleManager, and are variable and uncommitted. At this stage, it mainly includes column storage, bitmap indexing, and compression by various algorithms. These segments (hot data) will be periodically submitted and released; and then written to DeepStorage (can be local disk, AWS S3, HUAWEI CLOUD OBS, etc.). Druid, similar to HBase, also uses the LSM structure. The data is first written to the memory and then flushed to the data file. Druid encoding is a partial encoding, which is file-level. This can effectively reduce the huge pressure of large data sets on memory. These segment data are deleted by the MiddleManager node on the one hand, and loaded by the historical node (Historical) on the other hand. At the same time, the entries of these segments are also written into the metadata (Metadata) storage. The self-describing metadata of the segment includes the structure of the segment, its size and its position in the depth storage. These metadata are used by the Coordinator for query routing.

Druid stores its index in segment files partitioned by time. The segment file size is recommended to be in the range of 300MB-700MB. The internal structure of the segment file is essentially column storage: the data of each column is arranged in a separate data structure. By storing each column separately, Druid can reduce query latency by scanning only those columns that the query actually needs. There are three basic column types: timestamp column, dimension column, and indicator column, as shown in Figure 14 below:

Figure 14

The Dimension column needs to support filtering and grouping operations, so each dimension requires the following three data structures:

1) A dictionary that maps values (always treated as strings) to integer IDs,

2) A list of column values, using the dictionary code in 1. 【Serving for group by and TopN queries】

3) For each different value in the column, a bitmap indicating which rows contain the value (essentially an inverted index). [Serving for fast filtering, convenient AND and OR operations]

Each column storage in Druid includes two parts: Jackson serialized ColumnDescriptor and the remaining binary files of the column. Druid strongly recommends using LZ4 to compress string, long, float, and double column value blocks by default, and Roaring to compress string columns and digital null value bitmaps. In particular, the Roaring compression algorithm is much faster on filters that match a large number of values in a high cardinality column scene (compared to the CONCISE compression algorithm).

It is worth mentioning that Druid supports the Kafka Indexing Service plug-in (extension) to achieve real-time ingestion tasks, then the segment can be queried immediately, even though the segment has not been published. This can better meet the real-time requirements from data generation to queryable and aggregated analysis.

Another important feature of Druid is that when data is written, the rollup function can be turned on to aggregate all selected dimensions according to the minimum time interval granularity you specify (such as 1 minute, or 5 minutes, etc.). This can greatly reduce the size of the data that needs to be stored. The disadvantage is that each original data is discarded and detailed queries cannot be performed.

In order to make the query more efficient, Druid has the following design considerations.

• Broker trims which segments each query accesses: It is an important way for Druid to limit the amount of data that each query must scan. First, the query enters the Broker first, and the Broker will identify which segments have data that may be related to the query. The Broker will then identify which Historian and MiddleManager are serving these segments and send a rewritten subquery to each of these processes. The Historical/MiddleManager process will receive queries, process them, and return results. The Broker receives the results and merges them together to obtain the final answer, and returns it to the original caller.

• Use index filtering within segments: The index structure within each segment allows Druid to determine which (if any) rows match the filter set before looking at any data rows.

• Only read specific associated rows and columns in the Segment: Once Druid knows which rows match a specific query, it will only access the specific columns required by the query. In these columns, Druid can jump from one row to another to avoid reading data that does not match the query filter.

The timestamp is also a must for the Druid data model. Although Druid is not a time series database, it is also a natural choice for storing time series data. The Druid data model can be supported in the same datasource, and can store time series data and non-time series data at the same time. Therefore, Druid does not consider data points to be part of a "time series", but processes each point individually for ingestion and aggregation. For example, the time series data interpolation calculation supported by the orthodox TSDB is no longer necessary in Druid. This will bring great convenience to the processing of some business scenarios.

IoTDB

Apache IoTDB[12] started from the School of Software, Tsinghua University, and is an Apache incubator project in September 2020. IoTDB is a database used to manage a large amount of time series data. It uses columnar storage, data encoding, pre-computation and indexing technology. It has a SQL-like interface and can support writing millions of data points per node per second. Get query results of more than trillions of data points in seconds. Mainly for IoT scenarios in the industry.

The IoTDB suite is composed of several components, which together form a series of functions such as data collection, data ingestion, data storage, data query, data visualization, and data analysis. As shown in Figure 15 below:

Figure 15

IoTDB specifically refers to the time series database engine; its design is based on equipment and sensors. In order to facilitate the management and use of time series data, a storage group (the concept of storage group) is added.

Storage Group: The concept proposed by IoTDB is similar to the concept of Database in relational databases. The data of all entities in a storage group will be stored in the same folder, and the entity data of different storage groups will be stored in different folders on the disk to achieve physical isolation. For the internal implementation of IoTDB, the storage group is a unit of concurrency control and disk isolation, and multiple storage groups can read and write in parallel. For users, it is convenient for group management and convenient use of device data.

Device: Corresponds to specific physical devices in the real world, such as aircraft engines. In IoTDB, device is the unit for writing time series data at one time, and one write request is limited to one device.

Sensor (Sensor): Corresponding to the sensors carried by specific physical devices in the real world, such as sensors that collect information on wind speed, steering angle, power generation and other information on wind turbine equipment. In IoTDB, Sensor is also called Measurement.

Measurement point/physical quantity (Measurement, also known as working condition, field): A single or multiple physical quantity, which is a measurement value collected by a sensor at a certain time in an actual scene, and is stored in a column in the form of <time, value> in the IoTDB . All data and paths stored in IoTDB are organized in units of measurement points. Measurements can also include multiple components (SubMeasurement). For example, GPS is a multiple physical quantity that contains 3 components: longitude, latitude, and altitude. Multiple measuring points are usually collected at the same time, sharing the time series.

The storage of IoTDB is composed of different storage groups. Each storage group is a concurrency control and resource isolation unit. Each storage group includes multiple Time Partitions. Among them, each storage group corresponds to a WAL pre-write log file and a TsFile time series data storage file. The time sequence data in each Time Partition is written into Memtable first, and recorded into WAL at the same time, and periodically and asynchronously flushed to TsFile. This is the so-called tLSM timing processing algorithm.

In terms of ingestion performance: IoTDB has the smallest write latency. The larger the batch size, the higher the write throughput of IoTDB. This shows that IoTDB is most suitable for batch data writing solutions. In high-concurrency solutions, IoTDB can also maintain a steady increase in throughput (constrained by network cards and network bandwidth).

Aggregate query performance: In the original data query, as the query scope expands, the advantages of IoTDB begin to appear. Because the granularity of data blocks is larger, the advantages of columnar storage are reflected, so column-based compression and column iterators will speed up queries. In aggregate query, IoTDB uses file-level statistics and caches statistics. Multiple queries only need to perform memory calculations, and the aggregate performance advantage is obvious.

Data storage comparison

Based on the previous analysis, we try to use the following table comparison to illustrate the characteristics of these time series data processing systems.

table 3

For the processing of time series data, key capabilities mainly include data model definition, storage engine, query engine that closely cooperates with storage, and architecture design that supports partition expansion. The mainstream TSDB is basically implemented based on LSM or an LSM tree specially optimized for time series data scenarios (including InfluxDB's so-called TSM and IoTDB's tLSM, which are essentially LSM mechanisms). Among them, only IoTDB originally used tree schema to model time series data. In order to pursue the ultimate performance and the ultimate cost, everyone is continuously improving and optimizing the data storage structure design, various efficient indexing mechanisms, and query efficiency for massive data and usage scenarios. In terms of single point technology or key technology, there is a general trend of convergence and homogeneity.

The latest developments of cloud vendors

In addition to the open source community, many cloud service vendors at home and abroad have successively released related time series database products or services.

HUAWEI CLOUD

Huawei Cloud's GaussDB for Influx[13] cloud service is based on InfluxDB for in-depth optimization and transformation. It has made technological innovations in terms of architecture, performance, and data compression, and achieved good results. Achieves a separation of storage and computing architecture, which uses a high-performance distributed storage system developed by Huawei Cloud, which significantly improves the reliability of the time series database; at the same time, it facilitates the minute-level expansion of computing nodes and the second-level expansion of storage space, while greatly reducing Storage costs. Supports hundreds of millions of timelines (open source capabilities are at the level of tens of millions of timelines), and the write performance is basically stable; it can support higher performance of high-hash aggregation queries; in terms of compression algorithms, compared to the native InfluxDB, the focus is on Float The three data types of, String and Timestamp have been optimized and improved. .

Huawei Cloud MRS cloud service includes IoTDB[14], of which IoTDB is positioned as a time series database library for industrial equipment and industrial sites. IoTDB has better performance after optimization, tens of millions of data points are written in seconds, and terabytes of data is in milliseconds; the optimized data compression ratio can reach a hundred times, further saving storage space and costs; by adopting a peer-to-peer distributed architecture, dual Multi-layer Raft protocol, synchronous active-active of edge cloud nodes, to achieve 7*24 hours high availability. In industrialized scenarios, it is truly possible to achieve a time series data compatible with all scenarios, a set of timing engines to open up the cloud side end, and a set of frameworks to integrate the cloud side end.

Ali Cloud

According to public information [15], the development and evolution of Alibaba Cloud's time-series spatiotemporal database TSDB has gone through three stages. Based on OpenTSDB in the v1.0 stage, the bottom layer has implemented dual engines-HBase and HiStore. In the v2.0 stage, the OpenTSDB engine was replaced with a self-developed TSDB engine, which made up for the inverted index that OpenTSDB does not support, special coding for time series scenarios, and distributed stream computing aggregation functions. Subsequently, the integration of cloud and edge computing was realized, and TSQL was compatible with the Prometheus ecosystem. Among them, TSDB for Spatial Temporal supports spatio-temporal data, which is based on the self-developed S3 spatio-temporal index and high-performance electronic fence. The latest TSDB is also based on Gorilla, which reduces the average storage space used by a single data point to 1 to 2 bytes, which can reduce storage space usage by 90% and speed up data writing at the same time. For the reading of millions of data points, the response time is less than 5 seconds, and the maximum can support the writing of tens of millions of data points per second. Compared with the open source OpenTSDB and InfluxDB, the read and write efficiency is improved several times, and it is compatible with the OpenTSDB data access protocol.

Tencent Cloud

Tencent Cloud also launched the TencentDB for CTSDB[16] cloud service, which is a distributed, scalable, time series database that supports near real-time data search and analysis, and is compatible with the commonly used API interfaces and ecology of Elasticsearch. It supports more than 20 core businesses within Tencent. In terms of performance, tens of millions of data points can be written per second, and hundreds of millions of data can be analyzed in seconds. CTSDB also uses the LSM mechanism, first writes to the memory and then periodically flushes to the storage; then through the inverted index to speed up the query of any dimension data, it can realize the data can be checked in seconds. It also supports general aggregation calculation functions such as histogram, percentile, and cardinality; also configures Rollup tasks to periodically aggregate historical data and save them to a new data table to achieve downsampling features. When the number of nodes in the cluster exceeds 30, it is necessary to purchase a new cluster or optimize and upgrade the general cluster architecture to a hybrid node cluster architecture to ensure the stable performance of the multi-node super-large cluster. Inferring from these characteristics, the CTSDB kernel should be a time series database built on the basis of the in-depth optimization experience of the ElasticSearch kernel.

Other domestic manufacturers

In addition to the out-of-the-box cloud services provided by cloud service vendors, some innovative products have emerged. The more famous ones include TDengine[17], ByteDance TerarkDB[18], DolphinDB[19] and so on. They are also undergoing rapid development and are worthy of continuous follow-up and attention, especially some TSDB products incubated in China.

Summary and outlook

In addition to the community versions of InfluxDB, IoTDB, and OpenTSDB, some cloud vendors also provide native InfluxDB (Alibaba Cloud TSDB for InfluxDB), IoTDB (Huawei Cloud MRS IoTDB) or OpenTSDB (Huawei Cloud MRS OpenTSDB) cloud services for easy use. A more mainstream approach is that each cloud vendor borrows, optimizes, and even re-developes the time-series database kernel based on its own technological precipitation and research and development capabilities, which can provide stronger clustering capabilities, higher performance writing, and faster query and aggregation analysis capabilities. . Foreign vendors AWS has Timestream[20], a serverless time-series database service, domestic Huawei Cloud's GaussDB for Influx, a new generation of spatiotemporal analysis database created by a team of top database experts; Tencent Cloud's TencentDB for CTSDB, compatible with ES ecology; Alibaba Cloud HiTSDB and so on. These out-of-the-box, scalable, and highly available time series databases have brought good news to the development and deployment of cloud-native applications. There is no need to manage the underlying infrastructure, but only to focus on business construction.

In the development process of Promeheus, its long-term storage of historical data (long-term) that needs to be preserved for a long time is one of its shortcomings. The industry has some compromised integration solutions. For example, the use of Cassandra as the persistent storage of Prometheus; and the use of InfluxDB as the persistent storage of Prometheus. On the one hand, make full use of Prometheus to monitor related capabilities and community ecology (including Cortex that supports distributed clusters); on the other hand, make good use of InfluxDB time series database The advantages, especially the ultra-PB-level distributed database capabilities, to make up for the shortcomings of Prometheus in the storage of massive historical data.

Apache Druid has strong competitiveness in the field of OLAP real-time analysis, and it is also adopted by many large companies. The industry's largest cluster has more than 4,000 nodes. Whether it is time series indicators, business data, application logs, etc., you can use Druid's Kafka Indexing Service and its powerful data preprocessing capabilities to convert time series data into Druid. At present, Druid SQL features are developing rapidly, including cross-table join, subquery, and the continuous enrichment of many functions and operators.

Whether it is an orthodox time series database or an OLAP analysis system suitable for time series data; whether it is a popular project in the open source community, or a cloud vendor provides a more powerful cloud-native time series database, it is all for all kinds of time series data (including indicators, business data) Storage, retrieval and analysis provide diversified options. Combining with their own business scenarios, users will be able to find relatively suitable tools or services to meet business demands.