Observable | Time series data downsampling in Prometheus practice review - 阿里巴巴云原生

Author: Zhizhen

In the monitoring practice based on Prometheus, especially when the scale is large, the storage and query of time series data is a very critical and problematic part. How to deal with long-term queries under large amounts of data, the native Prometheus system has not given a satisfactory answer. In this regard, ARMS Prometheus recently launched the downsampling function, making a new attempt to solve this problem.

foreword

problem background

Prometheus and K8s, as a pair of golden partners in the cloud-native era, are the standard configuration of many enterprise operating environments. However, in order to adapt to the development and evolution of business scale and microservices, the number of monitored objects will increase all the way; in order to more completely reflect the status details of the system or application, the granularity of indicators is more and more detailed, and the number of indicators is increasing; If a trend change with a longer period is found, the retention period of the indicator data is bound to be longer. All these changes will eventually lead to an explosive growth in the amount of monitoring data, which will put a lot of pressure on the storage, query, and calculation of observation products.

We can more intuitively feel the consequences of this data explosion through a simple scenario. Suppose we need to query the change of CPU usage on each node of my cluster in the past month, and my cluster is a small-scale cluster of 30 physical nodes, and each node runs an average of 50 PODs that need to collect indicators. According to the default acquisition interval of 30 seconds, we need to process a total of 30 50 = 1500 acquisition targets, and each sampling point will be captured 60 60 24/30 = 2880 times a day, in a one-month cycle, a total of 1500 2880 30 = 130 million index captures. Taking Node exporter as an example, the number of samples spit out by a single bare metal capture is about 500, so the number of sampling points generated by this cluster in one month is about 130 million 500 = 65 billion! In real business systems, the situation is often not so ideal, and the actual number of sampling points often exceeds 100 billion.

Faced with this situation, we must have some technical means to optimize the cost and efficiency of storage/query/computing on the premise of ensuring the accuracy of the data as much as possible. Downsampling is one of the representative ideas.

What is downsampling

Downsampling is based on the premise that the processing of data conforms to the associative law, and the merging of the values of multiple sampling points will not affect the final calculation result. It just so happens that the time series data of Prometheus conforms to such characteristics. In other words, downsampling is to reduce the resolution of the data. The idea is very straightforward. If the data points in a certain time interval are aggregated into one or a group of values based on certain rules, the number of sampling points can be reduced, the amount of data can be reduced, and the Stored query computation pressure. So we need two inputs: time interval, aggregation rule.

For the time interval of downsampling, based on empirical analysis, we have defined two different downsampling time intervals: five minutes and one hour, plus the original data, three different resolutions of data will be obtained. Query requests are routed to data at different resolutions. We may also add new time interval options as subsequent ARMS Prometheus offers longer storage duration options.

For aggregation rules, through the analysis of the operator functions of Prometheus, various operator functions can finally be summarized into six types of numerical calculations:

max, used to calculate the maximum value in the vector, typical operators such as max_over_time;
min, used to calculate the minimum value in the vector, typical operators such as min_over_time;
sum, used to calculate the sum value in the vector, typical operators such as sum_over_time;
count, used to count the number of points in the ventor, typical operators such as count_over_time;
counter, used to calculate the rate of change, typical operators such as rate, increase, etc.;
avg, take the average value of each point in the time interval;

It can be seen that, for a series of sampling points in a time interval, we only need to calculate the above six types of aggregated feature values, and return the aggregated values of the corresponding time interval when querying. If the default scrape interval is 30 seconds, five-minute downsampling will aggregate ten points into one point; one-hour downsampling will aggregate 120 points into one point. Similarly, the number of sampling points involved in the query will have Order of magnitude decrease, if the scrape interval is smaller, the effect of sampling point reduction will be more significant. On the one hand, the reduction in the number of sampling points reduces the reading pressure of the TSDB, and on the other hand, the computational pressure on the query engine will also be simultaneously reduced, thereby effectively reducing the query time.

How to implement downsampling

The Stone of Other Mountains

In the implementation of other open source/commercial time series data storage, some have also optimized and improved long-term span queries through the downsampling function. Let's take a look at them together.

Prometheus

The storage capability of open source Prometheus has always been criticized. The open source Prometheus itself does not directly provide the capability of downsampling, but it provides the capability of Recording Rule. Users can use the Recording Rule to implement DownSampling by themselves, but this will generate new In the high cardinality scenario, the storage pressure is further exacerbated.

Thanos

As a well-known Prometheus high-availability storage solution, Thanos provides a relatively complete downsampling solution. The component that performs the downsmpling function in Thanos is the compactor, which will:

Periodically pull blocks (original Prometheus Block, 2-hour time span) from ojbect storage, perform compaction and downsampling, and the status of downsampling will be recorded in block metadata.
The result of compression and downsampling, a new block is generated and written to object storage.

title=

The eigenvalues after Downsampling, including sum/count/max/min/counter, are written to the special aggrChunks data block. When doing a query:

The original aggregation operators and functions will be converted into a special AggrFunc for reading aggrChunks data block data
The read blocks are sorted by time, and the block with the largest Resolution is read first.
M3

title=

M3 Aggregator is responsible for stream aggregation of indicators before they are stored in M3DB, and specifies the storage duration of indicators and the sampling interval of the calculation window according to the storagePolicy.

M3 supports more flexible data intervals and more eigenvalues, including the histogram quantile function.

InfluxDB/Victoria Metric/Context

Victoria Metrics has only launched the downsampling function in the commercial version, and the open source version has not been disclosed. The open source version of InfluxDB (before v2.0) implements downsampling by executing a continuous query on the original data that has been placed outside the storage medium in a manner similar to the Recording Rule. Context does not currently support downsampling.

how do we do it

There are different downsampling schemes on the market. We briefly summarize the points that users are more concerned about, such as their usage costs. The comparison is as follows:

title=

ARMS Prometheus adopts the method of processing TSDB storage blocks, and automatically processes the original data blocks into down-sampling data blocks in the background. On the one hand, it can achieve a better processing performance, and on the other hand, for end users, there is no need to care about parameter configuration. Rule maintenance, etc., to reduce the burden of user operation and maintenance as much as possible.

This feature has been launched in some regions of Alibaba Cloud, and a targeted invitation experience has begun. This feature will be integrated and available by default in the upcoming Advanced Edition of ARMS Prometheus.

The impact of downsampling on queries

After we complete the downsampling at the sampling point level, will the long-term query problem be solved? Obviously not, only the most primitive materials are stored in TSDB, and the curve seen by the user needs to be calculated and processed by the query engine. In the process of calculation and processing, we are faced with at least the following two problems:

Q1: When to read downsampled data? Is it impossible to use the original data after downsampling?
Q2: After downsampling, the density of data points is smaller and the data is more "sparse". Will the query performance be consistent with the original data? Do users need to tune PromQL?

For the first question, ARMS Prometheus will intelligently select the appropriate time granularity according to the user's query statement and filter conditions, and make an appropriate balance between data details and query performance.

For the second question, we can first say the conclusion: the density of collection points has a great influence on the result calculation, but ARMS Prometheus shields the difference at the query engine level, and users do not need to adjust PromQL. This impact is mainly reflected in three aspects: the impact on the duration of the query statement, the impact on the step of the query request, and the impact on the operator itself. Below we will describe these three aspects in detail. work on three fronts.

duration and downsampling calculation results

We know that when a range vector (Range Vector) is queried in PromQL, a time interval parameter (time duration) will be brought to frame a time range for calculating the result. For example, in the query statement http_requests_total{job="prometheus"}[2m], the specified duration is two minutes. When calculating the result, the queried time series will be divided into several vectors in units of two minutes and passed to the function Do the calculations and return the results separately. duration directly determines the length of input parameters that can be obtained when the function is calculated, and the impact on the result is obvious.

Under normal circumstances, the interval of collection points is 30s or shorter. As long as the time duration is greater than this value, we can determine that there will be several samples in each divided vector for calculating the result. When downsampling is processed, the interval between data points will become larger (five minutes or even an hour), and there may be no value in the vector at this time, which leads to the intermittent situation of the function calculation result. In this case, ARMS Prometheus will automatically adjust the time duration parameter of the operator to deal with it, to ensure that the duration is not less than the resolution of downsampling, that is, to ensure that there will be sampling points in each ventor to ensure the accuracy of the calculation results.

step and downsampling calculation results

The duration parameter determines the "length" of the vector when PromQL is calculated, and the step parameter determines the "step" of the vector. If the user queries on grafana, the step parameter is actually calculated by grafana according to the page width and the query time span. Taking my personal computer as an example, the default step is 10 minutes when the time span is 15 days. For some operators, because the density of sampling points decreases, the step may also cause a jump in the calculation result. The following is a simple analysis of increase as an example.

Under normal circumstances (even sampling points, no counter reset), the calculation formula of increase can be simplified as (tail value - first value) x duration / (last timestamp - first timestamp). For general scenarios, the first / The interval between the end point and the start/end time will not exceed the scrape interval. If the duration is much larger than the scrape interval, the result is approximately equal to (tail value - first value). Suppose there is a set of downsampled counter data points, as follows:

 sample1:    t = 00:00:01   v=1 
sample2:    t = 00:59:31   v=1    
sample3:    t = 01:00:01   v=30  
sample4:    t = 01:59:31   v=31 
sample5:    t = 02:00:01   v=31 
sample6:    t = 02:59:31   v=32
...

Assuming that the query duration is two hours and the step is 10 minutes, then we will get the split vector as follows:

 slice 1:  起止时间 00:00:00 / 02:00:00  [sample1 ...  sample4] 
slice 2:  起止时间 00:10:00 / 02:10:00  [sample2 ...  sample5] 
slice 3:  起止时间 00:20:00 / 02:20:00  [sample2 ...  sample5]
...

In the original data, the interval between the start and end points and the start and end times will not exceed the scrape interval, while for the data after downsampling, the interval between the start and end points and the start and end times can be up to (duration - step). If the value of the sampling point changes gently, then the calculation result after downsampling will not be significantly different from the calculation result of the original data, but if the change of the value in a slice interval is relatively drastic, then according to the above calculation formula (tail value - first value value) x duration / (last timestamp - first timestamp), which will proportionally magnify this change, making the final displayed curve fluctuate more violently . We consider this result to be a normal situation, and irate will be more applicable in the scenario of fast-moving counter, which is also consistent with the recommendations of the official documentation.

Operator and downsampling calculation results

The calculation results of some operators are directly related to the number of samples. The most typical one is count_over_time, which counts the number of samples in the time interval, and downsampling itself is to reduce the number of points in the time interval, so this situation requires special processing in the Prometheus engine , when it is found that the down-sampled data is used, a new calculation logic is adopted to ensure the correctness of the result.

Comparison of downsampling effects

For users, the final feeling is the improvement of query speed, but how big the improvement is, we also verified and compared through two queries on the spot.

The test cluster has 55 nodes and a total of 6000+ pods. The total number of sampling points reported every day is about 10 billion, and the data storage period is 15 days.

The first round of comparison: query efficiency

Check for phrases:

sum(irate(node_network_receive_bytes_total{}[5m])*8) by (instance)

That is, query the network traffic received by each node in the cluster, and the query period is 15 days.

title=

Figure 1: Downsampling data query, the time span is fifteen days, and the query takes 3.12 seconds

title=

Figure 2: Raw data query, time span of fifteen days, query timeout (timeout time is 30 seconds)

The original data failed to return because the calculation timed out due to the large amount of data. The downsampled query is at least ten times more efficient than the original query.

Second round of comparison: accuracy of results

Check for phrases:

max(irate(node_network_receive_bytes_total{}[5m])*8) by (instance)

That is, query the traffic data of the network card that receives the largest amount of data on each node.

title=

Figure 3: Downsampling query, time span of two days

title=

Figure 4: Raw data query, time span of two days

In the end, we shortened the query time span to two days, and the original data query can also return faster. Comparing the down-sampling query results (above) and the original data query results (below), it can be seen that the number of timelines and the overall trend of the two are exactly the same, and the points with more drastic data changes can also be well matched, fully satisfying the long-term requirements. Periodic query requirements.

Epilogue

Alibaba Cloud officially released the Alibaba Cloud Observability Suite (ACOS) on June 22. Alibaba Cloud Observability Suite revolves around Prometheus service, Grafana service, and link tracking service to form an observable data layer for indicator storage analysis, link storage analysis, and integration of heterogeneous data sources. At the same time, through standard PromQL and SQL, it provides a data disk Demonstration, alerting and data exploration capabilities. Give data value to different scenarios such as IT cost management, enterprise risk management, intelligent operation and maintenance, and business continuity assurance, so that observable data can truly be more than observation.

Among them, **Alibaba Cloud Prometheus monitoring has gradually introduced global aggregation query, streaming query, downsampling, pre-aggregation and other extreme scenarios for extreme scenarios such as multiple instances, large data volume, high timeline cardinality, long time span, and complex query. targeted measures.

Promotions such as 15-day free trial, fee reduction for Prometheus monitoring container cluster basic metrics are currently available!

Click here to activate the service~

Observable | Time series data downsampling in Prometheus practice review

foreword

problem background

What is downsampling

How to implement downsampling

The Stone of Other Mountains

how do we do it

The impact of downsampling on queries

duration and downsampling calculation results

step and downsampling calculation results

Operator and downsampling calculation results

Comparison of downsampling effects

The first round of comparison: query efficiency

Second round of comparison: accuracy of results

Epilogue

阿里云云原生

引用和评论

Spring AI Alibaba 发布企业级 MCP 分布式部署方案

支付宝H5下载被拦截的原因排查与解决指南

MCP协议重大升级，Spring AI Alibaba联合Higress发布业界首个Streamable HTTP实现方案

JManus - 面向 Java 开发者的开源通用智能体

PAI Model Gallery 支持云上一键部署 Qwen3 全尺寸模型

通过Milvus内置Sparse-BM25算法进行全文检索并将混合检索应用于RAG系统

2025年3月中国数据库排行榜：PolarDB夺魁傲群雄，GoldenDB晋位入三强