Apache Druid&#39;s engineering practice in Shopee

Author of this article
Yuanli from the Shopee Data Infra OLAP team.

Summary

Apache Druid is a high-performance open source time series database, which is suitable for low-latency query analysis scenarios with interactive experience. This article will mainly share the engineering practice of Apache Druid in supporting OLAP real-time analysis of Shopee-related core business.
With the continuous development of Shopee's business, more and more related core businesses rely more and more on the OLAP real-time analysis service based on Druid cluster. The increasingly stringent application scenarios make us start to encounter various performance bottlenecks of the open source project Apache Druid. By analyzing and studying the core source code, we have optimized the performance of the metadata management module and cache module that have performance bottlenecks.
At the same time, in order to meet the customization requirements of the company's internal core business, we have developed some new features, including integer precision deduplication operators and flexible sliding window functions.

1. The application of Druid cluster in Shopee

The current cluster deployment solution is to maintain a very large cluster, based on physical machine deployment, with a cluster size of 100+ nodes. As the downstream of related core business data projects, Druid cluster can write data through batch tasks and stream tasks, and then related business parties can perform OLAP real-time query and analysis.

2. Sharing of technical optimization solutions

2.1 Efficiency optimization of Coordinator load balancing algorithm

2.1.1 Problem background

Through real-time task monitoring and alarms, we found that many real-time tasks failed to wait for timeout due to the last step of segment release (Coordinate Handoff), and then users reported to us that their real-time data queries were jittery.

Through the investigation, it is found that as more services begin to access the Druid cluster, more and more dataSources are connected, and with the accumulation of historical data, the number of segments in the overall cluster is increasing. This increases the pressure on the Coordinator metadata management service, and a performance bottleneck gradually appears, which affects the stability of the overall service.

2.1.2 Problem Analysis

Coordinator analysis of a series of serial subtasks

First of all, we need to analyze whether these serials can be parallelized, but the analysis finds that these subtasks have logical dependencies before and after, so they need to be executed serially. Through the log information of the Coordinator, we found that one of the subtasks responsible for balancing segment loading in the historical node was executed very slowly, taking more than 10 minutes. It is this subtask that slows down the total time-consuming of the entire serial task, making the execution interval of another subtask responsible for arranging segment loading too long, causing the aforementioned real-time task to fail due to the timeout of the publishing phase.

Through analysis using the JProfiler tool, we found that the implementation of the reservoir sampling algorithm used in the load balancing algorithm has performance issues. Analysis of the source code found that the current reservoir sampling algorithm can only sample one element from a total of 5 million segments per call, and each cycle needs to balance 2000 segments. In other words, it is obviously unreasonable to traverse a list of 5 million 2000 times.

2.1.3 Optimization scheme

To implement the reservoir algorithm for batch sampling, it only needs to traverse the 5 million segment metadata list once to complete the sampling of 2000 elements. After optimization, the subtask responsible for segment load balancing takes only 300 milliseconds to execute. The total time spent on coordinator serial subtasks is significantly reduced.

Benchmark results

A comparison of Benchmark results shows that the performance of the batch-sampling reservoir algorithm is significantly better than other options.

community collaboration

We have contributed this optimization to the Apache Druid community, see PR for details.

2.2 Incremental metadata management optimization

2.2.1 Problem background

When the current Coordinator manages metadata, there is a scheduled task thread that pulls full segment records from the metadata MySQL DB every 2 minutes by default, and updates a snapshot of the segment set in the memory of the Coordinator process. When the amount of segment metadata in the cluster is very large, the SQL execution of each full pull becomes very slow, and deserializing a large number of metadata records also requires a lot of resource overhead. A series of segment-managed subtasks in the coordinator all depend on the snapshot update of the segment set, so the slow execution of the full pull SQL will directly affect the timeliness of the overall cluster data (segment) visibility.

2.2.2 Problem Analysis

We first analyze the changes of segment metadata in three different scenarios from the perspective of metadata addition, deletion and modification.

metadata increase

The data writing of the dataSource will generate new segment metadata, and the data writing methods are mainly divided into batch tasks and Kafka real-time tasks. The segment management subtask of the Coordinator senses and manages these newly added segment metadata in a timely manner, which is critical to the visibility of the data written by the Druid cluster. Through Druid's built-in metric indicators, the analysis found that the increment of segment unit time is far less than the total number of records of 500w.

Metadata removal

Druid can clean up the segments of the dataSource within the specified time interval by submitting a kill type of task. The kill task will first clean up the segment records in the metadata DB, and then delete the segment files in HDFS. For the segment that has been downloaded to the local history node, the segment management subtask of the Coordinator is responsible for notifying the cleanup.

metadata changes

One of the subtasks of the segment management subtask of the Coordinator will mark and clear the segment with the older version number according to the version number of the segment. This process will change the flag bit in the relevant metadata record that represents whether the segment is valid, and the old version of the segment that has been downloaded to the local history node is also notified by the segment management subtask of the Coordinator for cleaning.

2.2.3 Optimization scheme

Through the analysis of three situations of segment metadata addition, deletion and modification, we found that it is very important to perceive and manage the newly added metadata in a timely manner, which will directly affect the timely visibility of newly written data. The deletion and modification of metadata mainly affects data cleaning, which requires relatively low timeliness.

Based on the above analysis, our optimization idea is: to implement an incremental metadata management method, only pull the segment metadata newly added in the recent period from the metadata DB, and merge it with the current metadata snapshot to obtain a new one. Metadata snapshot for metadata management. At the same time, in order to ensure the final consistency of the data, the data cleaning with a relatively low priority is completed, and the full amount of metadata will be pulled at regular intervals.

The original full-pull SQL statement:

SELECT payload FROM druid_segments WHERE used=true;

Incremental pull SQL statement:

-- 为了保证SQL执行效率，提前在元数据DB中为新加的过滤条件创建索引
SELECT payload FROM druid_segments WHERE used=true and created_date > :created_date;

Incremental feature property configuration

# 增量拉取最近5分钟新加的元数据
druid.manager.segments.pollLatestPeriod=PT5M
# 每隔15分钟全量拉取元数据
druid.manager.segments.fullyPollDuration=PT15M

Online performance

By monitoring system metrics, it is found that after enabling the incremental management function, the time-consuming for pulling metadata and deserializing is significantly reduced. At the same time, the pressure on the metadata DB is also reduced, and the problem of slow readability of written data reported by users has also been solved.

2.3 Broker result cache optimization

2.3.1 Problem background

During the query performance tuning process, we found that many query application scenarios cannot make good use of the caching function provided by Druid. There are currently two caching methods in Druid, namely result caching and segment-level intermediate result caching. The first result cache can only be applied to the Broker process, while the segment-level intermediate result cache can be applied to the Broker and other data nodes. However, both of these two caching functions currently have obvious limitations, as shown in the following table.

Cache scheme / usage scenarios / availability	Scenario 1: Using the group by v2 engine	Scenario 2: Scan only historical segments	Scenario 3: Scan historical segment and real-time segment at the same time	Scenario 4: Efficiently cache the results of a large number of segments
segment level cache	✗	✓	✓	✗
result cache	✗	✓	✗	✓

2.3.2 Problem Analysis

Cache is not available with group by v2 engine

The group by v2 engine has been the default engine for groupBy type queries in many stable versions for a long time in the past, and will be the same for a long time in the foreseeable future. And groupBy type of query is one of the most common query types, the other two types are topN and timeseries. The problem that the group by v2 engine does not support caching still exists until version 0.22.0, see Cache does not support scenario .

By tracking the community's change records, we found that the reason why the group by v2 engine does not support caching is that the intermediate results at the segment level are not sorted, which may lead to incorrect query merging results. For details, see the community's issue .

Here's a brief summary of why the Druid community chose to fix this bug by disabling the feature:

If the intermediate results at the segment level are sorted and then the sorted results are cached, when the number of segments is large, the load of historical nodes will increase;
If the intermediate results at the segment level are not sorted and cached directly, then the broker needs to reorder the intermediate results of each segment, which will increase the burden on the broker;
If you disable this function directly, not only the history nodes will not be affected, but the bug that the Broker merge result is incorrect is also solved. :)

The community repair plan also accidentally damaged the function of the result cache, so that when the group by v2 engine is used in the repaired version, the result cache on the Broker is also unavailable. See Cache does not support scenario .

Limitations of result caching

The result cache requires that the set of segments for each scan of the query is consistent, and all segments are historical segments. That is, as long as the query conditions need to query the latest real-time data, the result cache is not available.

For a service like Druid, which is good at real-time query analysis application scenarios, this limitation of result caching is particularly prominent. The query panel of many business scenarios is to query the time series aggregation results of the latest day/week/month, including the latest real-time data, but these queries do not support result caching.

Limitations of segment-level intermediate result caching

The segment-level intermediate result cache function can be enabled on the Broker and other data nodes at the same time, and is mainly applicable to historical nodes.

The segment-level intermediate result cache is enabled on the broker. When a large number of segments are scanned, the following limitations exist:

The deserialization process to extract cached results will add extra overhead to the Broker;
Increase the overhead of broker nodes merging intermediate results, and cannot use history nodes to merge some intermediate results.

The segment-level intermediate result cache is enabled on the history node. The workflow is as follows:

In practical application scenarios, we found that when the intermediate cache results of the segment are large, the overhead of serializing and deserializing the cache results cannot be ignored.

2.3.3 Optimization scheme

Through the above analysis, we found that both current cache functions have obvious limitations. In order to better improve the cache efficiency, we designed and implemented a new cache function on the Broker, which caches the intermediate merge results of historical segments, which can well make up for the insufficiency of the current two caches.

New cache property configuration

druid.broker.cache.useSegmentMergedResultCache=true
druid.broker.cache.populateSegmentMergedResultCache=true

Applicable scene comparison

Cache scheme / usage scenarios / availability	Scenario 1: Using the group by v2 engine	Scenario 2: Scan only historical segments	Scenario 3: Scan historical segment and real-time segment at the same time	Scenario 4: Efficiently cache the results of a large number of segments
segment level cache	✗	✓	✓	✗
result cache	✗	✓	✗	✓
segment merge intermediate result cache	✓	✓	✓	✓

working principle

Benchmark results

Through the benchmark results, it can be found that the segment merge intermediate result cache function not only has no obvious additional overhead for the initial query, but also has significantly better cache efficiency than other cache options.

Online performance

After enabling the new caching feature, the cluster's overall query latency was reduced by about 50%.

community collaboration

We are ready to contribute this new caching feature to the community, currently the PR still waiting for more community feedback.

3. Customized requirements development

3.1 Bitmap-based exact deduplication operator

3.1.1 Problem background

Many key businesses need to count accurate order volume and UV, and Druid comes with several deduplication operators that are implemented based on approximate algorithms, and there are errors in practical applications. Therefore, related businesses hope that we can provide an accurate deduplication implementation.

3.1.2 Requirements Analysis

Deduplication field type analysis

By analyzing the collected requirements, it is found that the order ID and user ID in the urgent requirement are both integers or long integers, which allows us to consider omitting the process of dictionary encoding.

3.1.3 Implementation plan

Since the Druid community lacks this implementation, we use the commonly used Roaring Bitmap to customize the new Aggregator. Develop corresponding operators for integers and long integers, and both support serialization and deserialization for rollup import models. So we quickly released the first stable version of this feature, which is a good solution to the needs of small data volumes.

Operator API

// native JSON API
{
    "type": "Bitmap32ExactCountBuild or Bitmap32ExactCountMerge",
    "name": "exactCountMetric",
    "fieldName": "userId"
}

-- SQL support
SELECT "dim", Bitmap32_EXACT_COUNT("exactCountMetric") FROM "ds_name" WHERE "__time" >= CURRENT_TIMESTAMP - INTERVAL '1' DAY GROUP BY key

Limitation analysis and optimization direction

The current simple implementation scheme, in the face of a large amount of data, also exposes its performance bottleneck.

Performance bottleneck caused by too large intermediate result set

The memory space of the new operator is too large, and there is an obvious overhead for cache writing and retrieval, and this type of operator is mainly used for group by query, so the current existing cache cannot play its due role. This further drives our design and development of a new caching option, segment merging intermediate results, as described in the previous section.

By effectively caching the intermediate results of segment merging, the serialization and deserialization overhead caused by too large intermediate results at the segment level is greatly reduced. In addition, in the future, re-encoding will be considered to reduce the discrete degree of data distribution and improve the compression rate of bitmap for integer sequences.

Memory Estimation Difficult Problems

Since the Druid query engine mainly processes the intermediate settlement results through the off-heap memory buffer to reduce the impact of GC, this requires the internal data structure of the operator to support more accurate memory estimation. However, such operators based on Roaring bitmap are not only difficult to estimate memory, but also can only construct object instances in heap memory during the operation. This makes the memory overhead of such operators uncontrollable in the query, and even OOM may occur in extreme query situations.

For this kind of problem, in the short term, we mainly alleviate it by combining upstream data processing, such as recoding, reasonable partitioning and sharding, etc.

3.2 Flexible sliding window function

3.2.1 Problem background

The Druid core query engine only supports aggregate functions with fixed window size, and lacks support for flexible sliding window functions. Some key business parties want to count UVs for nearly 7 days every day, which requires Druid to support the sliding window aggregation function.

3.2.2 Requirements Analysis

Limitations of Community Moving Average Query Extensions

Through investigation, we found that the existing extension plug-in Moving Average Query supports some basic types of sliding window calculations, but lacks support for Druid native operators of other complex types (object types), such as the widely used HLL type approximation operator etc. At the same time, this extension also lacks support for SQL.

3.2.3 Implementation plan

By studying the source code, we found that this extension can also be more general and concise. We have added an operator implementation of default type, which can implement sliding window aggregation of basic fields according to the types of basic fields. That is to say, all Druid native operators (Aggregators) can support sliding window aggregation through this default type of operator.

At the same time, we have adapted SQL function support for this general operator.

Operator API

// native JSON API
{
    "aggregations": [
        {
            "type": "hyperUnique",
            "name": "deltaDayUniqueUsers",
            "fieldName": "uniq_user"
        }
    ],
    "averagers": [
        {
            "name": "trailing7DayUniqueUsers",
            "fieldName": "deltaDayUniqueUsers",
            "type": "default",
            "buckets": 7
        }
    ]
}

-- SQL support
select TIME_FLOOR(__time, 'PT1H'), dim, MA_TRAILING_AGGREGATE_DEFAULT(DS_HLL(user), 7) from ds_name where __time >= '2021-06-27T00:00:00.000Z' and __time < '2021-06-28T00:00:00.000Z' GROUP BY 1, 2

community collaboration

We are ready to contribute this new feature to the community, currently the PR still waiting for more community feedback.

4. Future Architecture Evolution

In order to better solve the stability problem from the architectural level and achieve cost reduction and efficiency increase, we began to explore and implement Druid's cloud-native deployment solution. We will share our practical experience in this area in the future, so stay tuned!

Apache Druid&#39;s engineering practice in Shopee

Summary

1. The application of Druid cluster in Shopee

2. Sharing of technical optimization solutions

2.1 Efficiency optimization of Coordinator load balancing algorithm

2.1.1 Problem background

2.1.2 Problem Analysis

Coordinator analysis of a series of serial subtasks

2.1.3 Optimization scheme

Benchmark results

community collaboration

2.2 Incremental metadata management optimization

2.2.1 Problem background

2.2.2 Problem Analysis

metadata increase

Metadata removal

metadata changes

2.2.3 Optimization scheme

Incremental feature property configuration

Online performance

2.3 Broker result cache optimization

2.3.1 Problem background

2.3.2 Problem Analysis

Cache is not available with group by v2 engine

Limitations of result caching

Limitations of segment-level intermediate result caching

2.3.3 Optimization scheme

New cache property configuration

Applicable scene comparison

working principle

Benchmark results

Online performance

community collaboration

3. Customized requirements development

3.1 Bitmap-based exact deduplication operator

3.1.1 Problem background

3.1.2 Requirements Analysis

Deduplication field type analysis

3.1.3 Implementation plan

Operator API

Limitation analysis and optimization direction

Performance bottleneck caused by too large intermediate result set

Memory Estimation Difficult Problems

3.2 Flexible sliding window function

3.2.1 Problem background

3.2.2 Requirements Analysis

Limitations of Community Moving Average Query Extensions

3.2.3 Implementation plan

Operator API

community collaboration

4. Future Architecture Evolution

🔗 Reference link

Shopee技术团队

引用和评论

基于 Flink + Hudi 的实时数仓在 Shopee 的实践

70k star，取代Postman！这款轻量级API工具，太香了！

大模型时代，后端程序员如何避免被AI卷死？

定档 7 月！Community Over Code Asia 2025 议题征集全面启动！

祝贺陈梓立(Tison)当选 2025 年度 Apache 软件基金会董事会

C++ 中 VS 项目引入公共配置文件

做到真正0丢失、0重复：Apache SeaTunnel 实现万亿级数据一致性全解密

Apache Druid's engineering practice in Shopee