The evolution of Prometheus on CeresDB

Text｜Liu Jiacai (flower name: Chenxiang)

The senior development engineer of Ant Group focuses on the field of time series storage

Proofreading｜Feng Jiachun

This article is 7035 words read 10 minutes

One of CeresDB's early design goals was to dock open source protocols. The system currently supports both OpenTSDB and Prometheus protocols. Compared with OpenTSDB, the Prometheus protocol is very flexible, similar to SQL in the time series field.

With the increase in internal usage scenarios, query performance and service stability have gradually exposed some problems. This article will review some of the work done by CeresDB in improving the PromQL query engine, hoping to play a role in attracting new ideas and deficiencies. Please point out.

PART. 1 Memory Control

For a query engine, the performance bottleneck in most cases is IO. In order to solve the IO problem, data is generally cached in memory. For CeresDB, it mainly includes the following parts:

MTSDB: Cache data according to the data time dimension, and correspondingly eliminate it according to the time range
Column Cache: Cache data according to the timeline dimension, when the memory usage reaches the specified threshold, the LRU accessed according to the timeline is eliminated
Index Cache: LRU elimination according to access frequency

In the above several parts, the memory usage is relatively fixed, and the biggest impact on the memory fluctuation is the intermediate result of the query. If the control is not good, the service can easily trigger OOM.

The size of the intermediate results can be measured in two dimensions: the horizontal timeline and the vertical timeline.

The easiest way to control the intermediate results is to limit the size of these two dimensions and reject them directly when constructing the query plan, but it will affect the user experience. For example, in the SLO scenario, monthly statistics are required for indicators. The corresponding PromQL is generally similar to sum_over_time(success_reqs[30d]). If monthly range queries cannot be supported, the business layer needs to adapt.

To solve this problem, you need to understand the data organization and query method in CeresDB. For data in a timeline, it is stored in a compressed block of thirty minutes. The query engine uses a vectorized volcano model. When the next call is made between different tasks, the data is transmitted in a batch of 30 minutes.

In the execution of the above sum_over_time function, the 30-day data will be checked out in turn, and then decompressed, and then a summation operation will be performed. This method will cause the memory usage to increase linearly with the query interval. If this linear relationship can be removed, even if the number of queries doubles, the memory usage will not be greatly affected.

In order to achieve this goal, it is possible to implement stream calculations for functions that are cumulative, such as sum/max/min/count functions, that is, after each compressed block is decompressed, the function is evaluated immediately, and the intermediate result is stored in a temporary variable Get up, and return the result after all data block calculations are completed. After adopting this method, the intermediate result of the previous GB level may end up being only a few KB.

PART. 2 function pushdown

Different from the stand-alone version of Prometheus, CeresDB adopts a distributed architecture of share-nothing, and there are three main roles in the cluster:

datanode: store specific metric data, generally will be allocated a number of sharding (sharding), stateful
proxy: write/query routing, stateless
meta: Store information such as shards, tenants, etc., with status.

The approximate execution flow of a PromQL query:

1. The proxy first parses a PromQL query statement into a syntax tree, and at the same time finds out the datanode involved according to the fragmentation information in the meta

2. Send the nodes that can be pushed down in the syntax tree to the datanode through RPC

3. The proxy accepts the return value of all datanodes, executes the calculation nodes that cannot be pushed down in the syntax tree, and returns the final result to the client

The execution diagram of sum(rate(write_duration_sum[5m])) / sum(rate(write_duration_count[5m])) is as follows:

In order to reduce the IO transmission between proxy and datanode as much as possible, CeresDB will try to push the nodes in the syntax tree to the datanode layer. For example, for query sum(rate(http_requests[3m])), the ideal effect is to add sum, rate These two functions are pushed to the datanode for execution, so that the data returned to the proxy will be greatly reduced. This is consistent with the "push-down selection" idea in traditional relational databases, which is to reduce the amount of data involved in the operation.

According to the number of shards involved in PromQL, push-down optimization can be divided into two categories: single-shard push-down and multi-shard push-down.

### Single shard pushdown

For a single shard, the data exists in one machine, so you only need to implement the functions in Prometheus at the datanode layer to push down. Here we focus on the push-down support of subquery [1], because its push-down is different from general functions, and other readers who do not understand its usage can refer to Subquery Support [2].

Subquery is similar to the query_range[3] interface (also called interval query). It mainly has three parameters start/end/step, which indicate the interval of the query and the step length of the data. For the instant query, the time parameter is the end in the subquery. There is no dispute, but for the interval query, it also has the three parameters of start/end/step. How does it correspond to the parameters in the subquery?

Suppose there is an interval query with a step length of 10s and a query interval of 1h. The query statement is avg_over_time((a_gauge == bool 2)[1h:10s]), then for each step, you need to calculate 3600/10=360 data Points, calculated according to an hour interval, will involve a total of 360*360=129600 points, but because the step length of subquery and interval query are the same, some points can be reused, and the points that are really involved are only 720, that is, 2h corresponds to the data volume of the subquery.

It can be seen that for the inconsistent step size, the data involved will be very large. Prometheus has made an improvement after version 2.3.0. When the step size of the subquery cannot divide the step size of the interval query, the step of the interval query is ignored Long, reuse the result of subquery directly. Here is an example to analyze:

Assuming that the interval query start is t=100, the step is 3s, the subquery interval is 20s, and the step length is 5s. For interval queries, normally:

1. The first step

Need t=80, 85, 90, 95, 100 points at these five moments

2. The second step

Need t=83, 88, 83, 98, 103 these five points

You can see the points that need to be staggered in each step, but if you ignore the step length of the interval query, calculate the subquery first, and then pass the result of the subquery to the upper layer as a range vector. The points seen at each step of the interval query are t=80, 85, 90, 95, 100, 105..., this is the same logic as the step size. In addition, after this processing, there is no difference between subquery and other functions that return range vector. When pushing down, you only need to encapsulate it as a call (function) node for processing, but this call node has no specific calculations, just Reorganize the data according to the step size.

call: avg_over_time step:3
└─ call: subquery step:5
   └─ binary: ==
      ├─ selector: a_gauge
      └─ literal: 2

Before the optimization is launched, the query with subquery cannot be pushed down, which not only takes a long time, but also produces a large number of intermediate results, and the memory fluctuates greatly. After the feature is launched, it is not only beneficial to memory control, but also the query time is basically all Increased by 2-5 times.

Multi-shard pushdown

For a distributed system, the real challenge is how to solve the query performance involving multiple shards. In CeresDB, the basic sharding method is based on the metric name. For those large indicators, the metric + tags method is used for routing, and the tags are specified by the user.

Therefore, for CeresDB, multi-shard queries can be divided into two types:

1. A metric is involved, but the metric has multiple shards

2. Involving multiple metrics and different shards

Single metric and multiple shards

For a single metric multi-shard query, if the query filter condition carries fragment tags, then it can naturally correspond to a fragment, for example (cluster is fragment tags):

up{cluster="em14"}

There is also a special case here, namely

sum by (cluster) (up)

In this query, although there are no fragment tags in the filter condition, there are by in the aggregation condition. Although this query involves multiple shards, the data on each shard is not cross-calculated, so it can also be pushed down.

Here we can go a step further. For aggregation operators with cumulative properties, even if there are no fragmented tags in the filter condition and by statement, they can be pushed down by inserting a node. For example, the following two queries are equivalent:

sum (up)
# 等价于
sum ( sum by (cluster) (up) )

Since the inner sum includes fragment tags, it can be pushed down, and this step will greatly reduce the amount of data transmission, even if the outer sum does not push down, the problem is not big. Through this optimization method, the aggregation query that previously took 22s can be reduced to 2s.

In addition, for some binary operators, only one metric may be involved, such as:

time() - kube_pod_created > 600

The time() 600 in it can be used as a constant and pushed down to the datanode for calculation together with kube_pod_created.

Multi-metric multi-shard

For the multi-metric scenario, because the data distribution is not related, there is no need to consider how to optimize the sharding rules. A direct optimization method can query multiple metrics concurrently. On the other hand, you can learn from the idea of SQL rewrite, according to the query Make appropriate adjustments to the structure to achieve the push-down effect. for example:

sum (http_errors + grpc_errors)
# 等价于
sum (http_errors) +  sum (grpc_errors)

For some combinations of aggregate functions and binary operators, the aggregate function can be moved to the innermost layer by rewriting the syntax tree to achieve the purpose of pushdown. It should be noted that not all binary operators support such rewriting. For example, the following rewriting is not equivalent.

sum (http_errors or grpc_errors)
# 不等价
sum (http_errors) or  sum (grpc_errors)

In addition, common expression elimination techniques can also be used here. For example, total in (total-success)/total only needs to be queried once, and then the result can be reused.

PART. 3 Index matching optimization

For the search of time series data, the index mainly depends on tagk->tagv->postings to speed up, as shown in the following figure:

For up{job="app1"}, you can directly find the corresponding postings (that is, the timeline ID list), but for negative matches such as up{status!="501"}, you cannot directly find the corresponding postings. The conventional The method is to make a union of all two traversals, including the first traversal to find all eligible tagvs, and the second traversal to find all postings.

But here, we can use the operational nature of the set [4] to turn a negative match into a positive match. For example, if the query condition is up{job="app1",status!="501"}, when doing the merge, after checking the postings corresponding to the job, directly check the postings corresponding to status=501, and then use the corresponding job Just subtract the postings from the cluster, so there is no need to traverse the tagv of the status.

# 常规计算方式
{1, 4} ∩ {1, 3} = {1}
# 取反，再相减的方式
{1, 4} - {2, 4} = {1}

Similar to the above idea, for a regular match like up{job=~"app1|app2"}, it can be split into two exact matches of the job, which can also save the traversal of tagv.

In addition, for cloud-native monitoring scenarios, timeline changes are frequent occurrences. One destruction or creation of a pod will generate a large number of new timelines, so it is necessary to split the index. A common idea is to divide by time, for example, a new index is generated every two days, and multiple indexes are merged according to the time range when querying. In order to avoid write/query jitter caused by switching indexes, the pre-write logic is added during implementation. The idea is roughly as follows:

When writing, index switching is not strictly in accordance with the time window, but a pre-write point is specified in advance, and the index after the pre-write point will be double-written, that is, written into the current index and the next index. The basis for this is time locality. These timelines are likely to be still valid in the next window. By prewriting in advance, on the one hand, the next index can be warmed up, and on the other hand, the pressure of query expansion can be reduced, because The next shard already contains data from the previous shard since the pre-written point, which is especially important for queries that cross the whole point.

PART. 4 Full link trace

In the process of implementing performance optimization, in addition to referring to some metric information, it is very important to trace the entire query link, from the proxy receiving the request to the termination of the proxy returning the result, and the client can also communicate with the client. The trace ID is associated and used to troubleshoot user query problems.

It is interesting to say that the one optimization with the highest trace tracking performance improvement is to delete a line of code. Since the native Prometheus may be connected to multiple remote terminals, the results of the remote terminals will be sorted once in a timeline. Then, when merging, you can use the idea of merging to merge the data from n remote terminals with O(n*m) complexity Data (assuming m timelines for each remote end). But for CeresDB, there is only one remote end, so this sort is not needed. After removing this sort, those queries that cannot be pushed down basically increase by 2-5 times.

PART. 5 Continuous Integration

Although there are mature optimization rules based on relational algebra and SQL rewrite rule, integration testing is still needed to ensure the correctness of each development iteration. CeresDB currently uses linke's ACI for continuous integration, and the test case includes two parts:

Prometheus' own PromQL test set [5]
CeresDB test cases written for the above optimization

Each time the MR is submitted, these two parts of the test will be run, and the main branch will be allowed to be merged after passing.

PART. 6 PromQL Prettier

In the process of docking with Sigma cloud native monitoring, it was found that SRE would write some particularly complex PromQL, which is difficult to distinguish with the naked eye. Therefore, a formatting tool based on the open source PromQL parser was made. The effect is as follows:

Original:
topk(5, (sum without(env) (instance_cpu_time_ns{app="lion", proc="web", rev="34d0f99", env="prod", job="cluster-manager"})))

Pretty print:
topk (
  5,
  sum without (env) (
    instance_cpu_time_ns{app="lion", proc="web", rev="34d0f99", env="prod", job="cluster-manager"}
  )
)

Please refer to the project README [6] for download and usage.

"Summarize"

This article introduces some improvements that Prometheus on CeresDB has done with the increase in usage scenarios. The current query performance of CeresDB has been improved by 2-5 times in most scenarios compared to the architecture of Thanos + Prometheus. It is optimized for multiple hits. Conditional query can be increased by 10+ times. CeresDB has covered most of the monitoring scenarios on AntMonitor (Ant's internal monitoring system), such as SLO, infrastructure, customization, Sigma cloud native, etc.

The optimization points listed in this article are not difficult to say, but the difficulty is how to do these details properly. I encountered a serious problem in specific development. The timeline returned by the executor in the different next stages of the pipeline may be inconsistent, and Prometheus's unique backtracking logic (default 5 minutes), resulting in data loss in some scenarios , It took a week to troubleshoot this problem.

I remember that when I was watching Why ClickHouse Is So Fast? [7], I very much agreed with the views inside. Here is the conclusion of this article to share with you:

“ What really makes ClickHouse stand out is attention to low-level details.”

recruitment

We are the time series storage team in the middle of the ant intelligent monitoring technology. We are using Rust to build a new generation of time series database with high performance, low cost and real-time analysis capabilities.

In the continuous recruitment of the ant monitoring risk intelligence team, the team is mainly responsible for the intelligent ability and platform construction of the ant group's technical risk field, and provides algorithm support for various intelligent scenarios of technical risk (emergency, capacity, change, performance, etc.) , Including time series data anomaly detection, causality reasoning and root cause positioning, graph learning and event correlation analysis, log analysis and mining and other fields, the goal is to build the world's leading AIOps intelligent capabilities.

Welcome to send inquiries: jiachun.fjc@antgroup.com