Talking about the core of time series database: how to use a single machine to hold billions of data writing

This article was first published on short book : https://www.jianshu.com/u/204b8aaab8ba

Version	date	Remark
1.0	2021.10.19	First article
1.0	2021.11.21	Supplement the content added when sharing in the company

0. Background

The title comes from InfluxDB's background introduction to the birth of their storage engine:

The workload of time series data is quite different from normal database workloads. There are a number of factors that conspire to make it very difficult to get it to scale and perform well:
- Billions of individual data points
- High write throughput
- High read throughput
- Large deletes to free up disk space
- Mostly an insert/append workload, very few updates

The first and most obvious problem is one of scale. In DevOps, for instance, you can collect hundreds of millions or billions of unique data points every day.

To prove out the numbers, let’s say we have 200 VMs or servers running, with each server collecting an average of 100 measurements every 10 seconds. Given there are 86,400 seconds in a day, a single measurement will generate 8,640 points in a day, per server. That gives us a total of 200 * 100 * 8,640 = 172,800,000 individual data points per day. We find similar or larger numbers in sensor data use cases.

Recently, I was responsible for the monitoring of part of the product. I think that the time series database should have high requirements for RT and IOPS, so I want to see how it is implemented internally-will it be similar to the Kafka and HBase I know.

Let's do a simple science popularization first. A time series database is a database used to store data that changes with time, and is indexed by time (time point or time interval). Then it was first applied to the data collected and generated by various types of real-time monitoring, inspection and analysis equipment for industrial (power industry, chemical industry) applications. The typical feature of these industrial data is that the frequency of generation is fast (one second for each monitoring point). Multiple pieces of data can be generated inside), heavily dependent on the collection time (each piece of data requires a unique time), multiple measurement points and a large amount of information (conventional real-time monitoring systems can reach thousands of monitoring points. Data is being generated every second). The data is historically imprinted, and it is immutable, unique, and orderly. The time series database also has the characteristics of simple data structure and large data volume.

1. Problem

Anyone who has used time series database knows it. The data in the time series database is usually only appended, rarely deleted or not allowed at all, and the query scene is generally continuous. for example:

We usually observe the data at a certain time on the monitoring page. When needed, we will look for more detailed time periods to observe.
The time series database will push the indicators that the alarm system cares about

1.1 The pits that Prometheus stepped on

Here, let's briefly review the data structure in Prometheus. It is a typical kv pair, k (generally called Series) consists of MetricName , Lables , TimeStamp , and v is the value.

In the early design, the same Series will be organized according to certain rules, and files will also be organized according to time. So it becomes a matrix:

The advantage is that writes can be written in parallel, and reads can also be read in parallel (regardless of whether it is based on conditions or time periods). But the shortcomings are also obvious: firstly, the query will become a matrix. This design is easy to trigger random reads and writes, which is uncomfortable on both HDD and SSD (students who are interested can see the following section 3.2).

So Prometheus improved another version of storage. There is one file for each Series, and the data of each Series is stored in the memory for 1KB and refreshed once.

This alleviates the problem of random reading and writing, but it also brings new problems:

When the data does not reach 1KB and is still in the memory, if the machine is trucked, the data will be lost
Series can easily become too many, which will lead to high memory usage
Continuing the above, when the data is flushed in one go, the disk will become very busy
Following the above, many files will be opened and the FD will be consumed
When the application has not uploaded data for a long time, should the data in the memory be refreshed? In fact, it’s impossible to make a good judgment.

1.2 The pits that InfluxDB has stepped on

1.2.1 LevelDB based on LSM Tree

The write performance of LSM Tree is much better than the read performance. However, InfluxDB provides a delete API. Once a delete occurs, it is very troublesome-it will insert a tombstone record, and wait for a query, the query will merge the result set with the tombstone, and then the merge program will run to delete the underlying data . And InfluxDB provides TTL, which means that data is deleted by Range.

In order to avoid this slow deletion, InfluxDB uses a sharding design. Cut different time periods into different LevelDB, just close the database and delete files when deleting. However, when the amount of data is large, it will cause too many file handles.

1.2.2 BoltDB based on mmap B+Tree

BoltDB is based on a single file as data storage, and the performance of mmap-based B+Tree at runtime is not bad. But when the write data becomes larger, things become troublesome-how to alleviate the problems caused by writing hundreds of thousands of Serires at a time?

In order to alleviate this problem, InfluxDB introduced WAL, which can effectively alleviate the problems caused by random writes. Write multiple adjacent ones into the buffer and then refresh them together, just like MySQL's BufferPool. However, this does not solve the problem of reduced write throughput. This method only delays the emergence of this problem.

2. Solution

If you think about it carefully, the data hotspots of time series databases only focus on recent data. Moreover, write more and read less, almost no deletion and modification, and data is only added sequentially. Therefore, we can make very aggressive storage, access and retention policies for time series databases (Retention Policies).

2.1 Key data structure

The variant implementation of Log Structured Merge Tree (LSM-Tree) with reference to the log structure replaces the B+Tree in the traditional relational database as the storage structure. The suitable application scenario for LSM is Write in order) , and almost no data deleted. The general implementation uses time as the key. In InfluxDB, this structure is called Time Structured Merge Tree.
There is even a not rare but more extreme form of time series database, called Round Robin Database (Round Robin Database, RRD), which is implemented based on the idea of a ring buffer, which can only store a fixed amount of the latest data, overdue or overdue The capacity of the data will be covered by rotation, so it also has a fixed database capacity, but it can accept an unlimited amount of data input.

2.2 Key Strategy

WAL (Write ahead log): Like many data-intensive applications, WAL can ensure the persistence of data, and alleviate the occurrence of random write . In a time series database, it will be used as a carrier for querying data-when a request occurs, the storage engine will merge the data from the WAL and the disk placement. In addition, it will also do compression based on Snappy, which is a less time-consuming compression algorithm.
Set up aggressive data retention strategies, such as automatically deleting relevant data based on the expiration time (TTL) to , and at the same time improve query performance . For ordinary databases, the practice of automatically deleting data after a period of time is unimaginable.
Resampling the data to save space. For example, the data of the last few days may need to be accurate to the second, while the query cold data one month ago only needs to be accurate to the day, and the data one year ago only needs to be accurate to the week. enough, so the data re-sampling summary, can save a lot of storage space .

3. Summary

Generally speaking, compared with Kafka and HBase, the internal structure of time series database is not simple, and it is very valuable for learning.

3.1 Reference link

3.2 Disk random read and write vs sequential read and write

3.2.1 HHD

The fundamental reason for the weakness of HHD's random read and write lies in its physical structure. When we make an addressing request to the disk (may be to read data in a region, or locate a region to write data), the first bottleneck we can see is the speed of the spindle, followed by the head arm.

Looking at it now, the random read and write speed of HHD is roughly 2MB/S and 2.2MB/S. The sequential read and write is roughly 200MB/S, 220MB/S.

3.2.2 SSD

SSD looks like everything is good, random read and write speeds are generally 400MB/S, 360MB/S, and sequential read and write speeds are generally 560MB/S, 550MB/S.

But the real problem lies in its internal structure. Its most basic physical unit is a flash memory particle. Multiple flash memory particles can form a page, and multiple pages can form a block.

When writing, it will take page as the unit, we can see that it is 4kb in the picture. This means that even if you write 1b of data, it will occupy 4kb. This is not the most deadly. The most deadly is deletion, which occurs in units of the entire block. The figure is 512kb, which means that even if you delete 1kb of data, write amplification will occur.

Talking about the core of time series database: how to use a single machine to hold billions of data writing

0. Background