Dewu cloud native full link tracing Trace2.0 architecture practice

Guide:

As an important technology to solve the observability problem of distributed applications, distributed link tracing (Trace2.0 for short) realizes a new generation of one-stop full-link observation and diagnosis platform based on the observable standard solution provided by OpenTelemetry , and help businesses improve the efficiency of fault diagnosis, performance optimization, and architecture governance by collecting traces in full.

Collecting full trace data (daily increasing hundreds of terabytes and hundreds of billions of span data) and ensuring real-time data processing and efficient query at low cost, it provides a very high level of observability solution for the Trace2.0 backend as a whole. requirements. This article will introduce in detail the architecture design, tail sampling and hot and cold storage solutions behind Trace2.0, and how we achieved further cost reduction and efficiency improvement through self-built storage (storage cost decreased by 66%).

1. Overall Architecture Design

The overall module architecture of full-link tracing Trace2.0 from the data access side, calculation, storage to query is shown in the figure above. Here are the core capabilities of each component:

Client & Data Collection : Integrate and customize the multilingual SDK (Agent) provided by OpenTelemetry to generate observable data in a unified format.
Control Plane : The unified configuration center delivers various dynamic configurations to the data collection side and takes effect in real time; supports the delivery of dynamic configurations to each collector and takes effect in real time, supports grayscale access of applications according to the number of instances, and provides access Parameter collection dynamic switch, performance analysis dynamic switch, traffic coloring dynamic configuration, client version management, etc.
Data collection service OTel Server : The data collector OTel Server is compatible with the OpenTelemetry Protocol (OTLP) protocol, and provides two methods of gRPC and HTTP to receive the observable data sent by the collector.
Analysis computing & storage OTel Storage : In addition to the basic real-time retrieval capabilities, the computing side also provides scenario-based data analysis and computing, including:
- Store Trace data : The data is divided into two sections, one is the index field, including basic information such as TraceID, ServiceName, SpanName, StatusCode, Duration, and start and end time for advanced retrieval; the other is detailed data (source data, including all Spans data)
- Calculate SpanMetrics data : aggregate and calculate data such as total execution times, total time consumption, maximum time consumption, minimum time consumption, and quantile lines of dimensions such as Service, SpanName, Host, StatusCode, Env, and Region;
- Business order number association Trace : In the e-commerce scenario, some R&D mostly use order number, performance order number, and Huijin order number as input for troubleshooting. Therefore, after agreeing with business R&D on special tracking rules - add a special tag to the Span tag. The field "bizOrderId={actual order number}"--this Tag is used as the index field of ClickHouse; so as to realize the business link to the full link Trace to form a complete troubleshooting link;
- Redis hotspot data statistics : When the client side expands the call to Redis, the input and output parameters of SpanTag are buried, so that the indicator data such as Redis hit rate, large key, high-frequency writing, and slow call can be collected;
- MySQL hotspot data statistics : Count the number of calls, slow SQL times, and associated interface names based on SQL fingerprints.
2. Tail sampling & cold and hot storage
Dewu's early full-link tracking solution set a sampling rate of 1% on the client for consideration of storage costs. As a result, the trace link that you want to see is often not queried during R&D troubleshooting. Then, in order to solve this problem, Trace 2.0 cannot simply adjust the sampling rate of the client to 100%, but needs to reasonably control the trace storage cost while collecting the full amount of Trace data on the client. From practical experience, the value distribution of Trace data is uneven, and the value of Trace data decreases rapidly over time.

Storing trace data in full will not only cause huge cost waste, but also significantly affect the performance and stability of the entire data processing link. Therefore, if we can only save those traces that are valuable and have a high probability of being actually queried by users, we can achieve a balance between cost and benefit. So what is a valuable Trace? Based on our daily inspection experience, we found that business R&D mainly cares about the following four types of high-priority scenarios:

An exception ERROR occurred on the call chain;
There is a database call greater than "200ms" in the call chain;
The entire call chain takes more than "1s";
The call chain of the business scenario, such as the call chain associated with the order number.

In this context, combined with the practical experience of the industry, a tail sampling & cold and hot tiered storage scheme was designed in the process of implementing Trace2.0. The scheme is as follows:

The trace data within "3 days" is fully stored, which is defined as hot data.
Data based on Kafka delayed consumption + Bloom Filter tail sampling (wrong, slow, custom sampling rules, and default regular 0.1% sampling data) is retained for "30 days", which is defined as cold data.

The overall processing flow is as follows:

OTel Server data collection & sampling rules : Write the full amount of Trace data reported by the client collector into Kafka in real time, and record the TraceID corresponding to the Span data that satisfies the sampling rules (scenarios defined above) into Bloom Filter;
OTel Storage persists hot data : consumes data in Kafka in real time and persists it to ClickHouse hot clusters in full;
OTel Storage persists cold data : subscribes to the Bloom Filter of the upstream OTel Server, delays consuming the data in Kafka, and persists the Span data that may exist in the Bloom Filter by TraceID to the ClickHouse cold cluster; the delay time is configured to 30 minutes, try to ensure that A Span under a Trace remains intact.
TraceID check: Trace2.0 customizes the generation rules of TraceID; when generating TraceID, the hexadecimal encoding result of the current timestamp seconds (occupying 8 bytes) is used as a part of TraceID. When querying, you only need to decode the timestamp in the TraceId to know whether to query the hot cluster or the cold cluster.

Next, we will introduce the design details of Bloom Filter in tail sampling, as shown in the following figure:

The overall processing flow is as follows:

The OTel Server will write the TraceID corresponding to the Span data that meets the sampling rules into the Bloom Filter of the corresponding timestamp according to the timestamp in the TraceID;
Bloom Filter will be sharded at ten-minute granularity (memory consumption can be calculated and adjusted according to the actual data volume combined with BloomFilter's miscalculation rate and sample size), and Bloom Filter will be serialized and written to ClickHouse storage after ten minutes. ;
The OTel Storage consumer side pulls the Bloom Filter data (note: in the same time window, each OTel Server node will generate a BloomFilter) and merge it (reduce the memory usage of the Bloom Filter and improve the query efficiency of the Bloom Filter).

To sum up, Trace 2.0 uses less resources to complete tail sampling and hot and cold tiered storage. It not only saves costs for the company, but also saves almost all the "valuable" traces, which solves the problem of not being able to query the traces you want to see during routine inspections of business R&D.

3. Self-built storage & cost reduction and efficiency enhancement

3.1 Solutions based on SLS-Trace

At the initial stage of Trace2.0 construction, the Trace solution [1] customized by SLS for OpenTelemetry was adopted, which provided functions such as Trace query, call analysis, and topology analysis, as shown in the following figure:

The main processing flow of SLS-Trace is as follows:

Use OpenTelemetry Collector aliyunlogserverexporter[2] to write Trace data to SLS-Trace Logstore;
SLS-Trace regularly aggregates Trace data through the Scheduled SQL tasks provided by default, and generates corresponding Span indicators and application, interface-granular topology indicators, and other data.

With the full rollout of Trace2.0 within the company, the storage cost pressure of SLS has become more and more severe. In response to the company's call to "use technical means to reduce costs and improve efficiency", we decided to build our own storage.

3.2 Solutions based on ClickHouse

At present, the more popular full-link tracking open source projects (SkyWalking, Pinpoint, Jaeger, etc.) in the industry use most of the storage based on ES or HBase. In recent years, the emerging open source full-link tracking open source projects (Uptrace [3], Signoz [4], etc.) mostly use the storage based on ClickHouse, and the indicator data cleaned from Span data is also stored in ClickHouse. And ClickHouse's materialized view (very easy to use) also solves the problem of downsampling of indicator data (DownSampling). Finally, after some research, we decided to build a new storage solution based on ClickHouse. The overall architecture diagram is as follows:

The overall processing flow is as follows:

Trace index & detailed data: OTel Storage will write the index data constructed based on the original Span data into the SpanIndex table, and write the original detailed data of the Span into the SpanData table (relevant table design can refer to Uptrace [5]);
Calculate & persist SpanMetrics data : OTel Storage will count and generate the total number of calls, total time, maximum time, minimum time, and quantile in "30-second" granularity according to Span's Service, SpanName, Host, StatusCode and other attributes and other indicator data, and write it to the SpanMetrics table;
- Indicator DownSampling function : Use ClickHouse's materialized view to aggregate "second-level" indicators into "minute-level" indicators, and then aggregate "minute-level" indicators into "hour-level" indicators; thus achieving multi-precision indicators to meet the needs of different time ranges. query requirements;

 -- span_metrics_10m_mv
CREATE MATERIALIZED VIEW IF NOT EXISTS '{database}'.span_metrics_10m_mv_local
            on cluster '{cluster}'
            TO '{database}'.span_metrics_10m_local
AS
SELECT a.serviceName                     as serviceName,
       a.spanName                        as spanName,
       a.kind                            as kind,
       a.statusCode                      as statusCode,
       toStartOfTenMinutes(a.timeBucket) as timeBucket,
       sum(a.count)                      as count,
       sum(a.timeSum)                    as timeSum,
       max(a.timeMax)                    as timeMax,
       min(a.timeMin)                    as timeMin
FROM '{database}'.span_metrics_30s_local as a
GROUP BY a.serviceName, a.spanName, a.kind, a.statusCode,
    toStartOfTenMinutes(a.timeBucket);

Metadata (upstream and downstream topology data): OTel Storage writes the topology dependencies to the graph database Nebula according to the upstream and downstream relationships in the Span attributes (relevant attributes need to be buried on the client side).

ClickHouse Write Details

ClickHouse uses the Distributed engine to implement the Distributed (distributed) table mechanism, which can create views on all shards (local tables) and implement distributed queries. And the Distributed table itself will not store any data, it will process data by reading or writing the tables of other remote nodes. The SpanData table creation statement looks like this:

 -- span_data
CREATE TABLE IF NOT EXISTS '{database}'.span_data_local ON CLUSTER '{cluster}'
(
    traceID                   FixedString(32),
    spanID                    FixedString(16),
    startTime                 DateTime64(6 ) Codec (Delta, Default),
    body                      String CODEC (ZSTD(3))
) ENGINE = MergeTree
ORDER BY (traceID,startTime,spanID)
PARTITION BY toStartOfTenMinutes(startTime)
TTL toDate(startTime) + INTERVAL '{TTL}' HOUR;

-- span_data_distributed
CREATE TABLE IF NOT EXISTS '{database}'.span_data_all ON CLUSTER '{cluster}'
as '{database}'.span_data_local
    ENGINE = Distributed('{cluster}', '{database}', span_data_local,
                         xxHash64(concat(traceID,spanID,toString(toDateTime(startTime,6)))));

The overall writing process is relatively simple (note: avoid using distributed tables), as follows:

Get ClickHouse cluster nodes regularly;
Select the corresponding ClickHouse node through the Hash function, and then write the ClickHouse local table in batches.

On-line effect

Full-link tracking is a typical scenario of writing more and reading less, so we use ClickHouse ZSTD compression algorithm to compress the data, the compression ratio after compression is as high as 12, and the effect is very good. At present, the ClickHouse hot and cold clusters each use dozens of 16C64G ESSD machines, and the single machine write speed is 25w/s (the number of rows written by ClickHouse). Compared with the initial Alibaba Cloud SLS-Trace solution, the storage cost has dropped by 66%, and the query speed has also dropped from 800+ms to 490+ms.

Next step planning

At present, Trace2.0 also stores the original detailed data of Span in ClickHouse, resulting in a somewhat high disk usage rate of ClickHouse. In the future, we will consider writing the detailed data of Span into block storage devices such as HDFS/OSS, and ClickHouse will record each The offset of each Span in the block storage, thereby further reducing the storage cost of ClickHouse.

about us:
Dewu monitoring team provides a one-stop observability platform, responsible for link tracking, time series database, log system, including custom dashboards, application dashboards, business monitoring, intelligent alarms, AIOPS and other troubleshooting analysis.

Students who are interested in observability/monitoring/alarming/AIOPS and other fields are welcome to join us.

Reference [1] SLS-Trace scheme https://developer.aliyun.com/article/785854
[2] SLS-Trace Contrib https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/exporter/alibabacloudlogserviceexporter
[3] Uptrace https://uptrace.dev/
【4】Signoz https://signoz.io/
[5] Uptrace Schema design https://github.com/uptrace/uptrace/tree/v0.2.16/pkg/bunapp/migrations

This article is the beginning of the "Dewu Cloud Native Full Link Tracking Trace2.0" series. For more information, please pay attention to the "Dewu Technology" public account.
Dewu Cloud Native Full Link Tracking Trace2.0 Architecture Practice Dewu Cloud Native Full Link Tracing Trace2.0 Product Article Dewu Cloud Native Full Link Tracing Trace2.0 Collection Article Dewu Cloud Native Full Link Tracing Trace2.0 Data Mining

*Text/South Wind
@德物科技public account

Dewu cloud native full link tracing Trace2.0 architecture practice

1. Overall Architecture Design

2. Tail sampling & cold and hot storage

3. Self-built storage & cost reduction and efficiency enhancement

3.1 Solutions based on SLS-Trace

3.2 Solutions based on ClickHouse

得物技术

引用和评论

得物研发自测 & 前端自动化测试体系建设

MCP协议重大升级，Spring AI Alibaba联合Higress发布业界首个Streamable HTTP实现方案

得物增长兑换商城的构架演进

得物业务参数配置中心架构综述

分析型数据库入门指南：如何选择适合你的实时分析工具？

Dify+DeepSeek实战教程！企业级 AI 文档库本地化部署，数据安全与智能检索我都要

Java 开发玩转 MCP：从 Claude 自动化到 Spring AI Alibaba 生态整合