Authors: Zhai Dongbo, Ye Shujun
Under the microservice architecture, Sohu Smart Media uses Zipkin to collect the buried points of service link tracking (Tracing), and stores the collected Trace information in StarRocks. Through the powerful SQL computing power of StarRocks, multi-dimensional statistics and analysis of tracing information can be performed, which improves the monitoring capability of microservices, from simple statistical Monitoring to more dimensional exploration and analysis of Observability.
The full text is mainly divided into three parts: The first section mainly introduces the commonly used monitoring methods under microservices. Among them, the link tracking technology can connect the entire service call link to obtain key information of the overall service, which is very important for the monitoring of microservices. meaning. The second section mainly introduces how Sohu Smart Media builds a link tracking analysis system, including Zipkin's data collection, StarRocks' data storage, and analysis and calculation of StarRocks according to application scenarios. The third section mainly introduces some practical results obtained by Sohu Smart Media by introducing Zipkin and StarRocks for link tracking analysis.
01 Link Tracking in Microservice Architecture
In recent years, enterprise IT application architecture has gradually evolved to distributed application architectures such as micro-services and cloud-native. Inside Sohu Smart Media, application services are developed, operated and maintained according to the architectural ideas and technical solutions of micro-services, Docker, Kubernetes, and Spring Cloud. Improve the overall engineering efficiency of the department .
While the microservice architecture improves engineering efficiency, it also brings some new problems. Microservice is a distributed architecture that divides service units according to business. Each request of a user is no longer completed independently by a certain service, but becomes a coordinated completion of multiple services. These services may be implemented by different teams, using different programming languages, and may be deployed on different servers or even different data centers. If there are errors and exceptions in user requests, the characteristics of distributed invocation of it difficult to locate these faults. Compared with the traditional monolithic architecture, monitoring faces new challenges.
Logging、Metrics、Tracing
Microservice monitoring can include many methods. According to the type of data monitored, it is mainly divided into three areas: Logging, Metrics and Tracing:
Logging
Discrete events actively recorded by users. The recorded information is generally unstructured text content, which can provide more detailed clues when users analyze and judge problems.
The collected data with aggregated attributes is designed to show users the running status of a certain indicator in a certain period of time, so as to view some indicators and trends.
Tracing
Record the entire life cycle of a request call, including service call and processing time and other information, including the request context, identified by a globally unique Trace ID and concatenated with the entire call chain, which is very suitable for monitoring scenarios of microservice architecture .
figure 1
The relationship between the three is shown in the figure above, and there is overlap between the three. For example, Logging can aggregate relevant fields to generate Metrics information, and associate related fields to generate Tracing information; Tracing can aggregate query times to generate Metrics information, and can record business log generation Logging information. In general, it is difficult to add fields in Metrics and Logging to concatenate the life cycle of a microservice request, but it is relatively easy to obtain Metrics and Logging through Tracing.
In addition, these three have different requirements for storage resources. Metrics are natural compressed data, which saves resources the most; Logging tends to increase indefinitely, even exceeding the expected capacity; Tracing's storage capacity is generally between Metrics and Logging. In between, the capacity requirements can be further controlled by the sampling rate.
From Monitoring to Observability
Monitoring tells you whether the system works. Observability lets you ask why it's not working.
– Baron Schwarz
Microservice monitoring can be simply divided into Monitoring and Observability from the data analysis level.
Monitoring
Tells you if the system is working, pre-defined calculations for known scenarios, ex-ante assumptions about various monitoring problems. Corresponding to Known Knowns and Known Unknowns in the above figure, they are events that may be assumed in advance, including events that have been understood and those that are not understood.
Observability
Allows you to ask why the system isn't working, exploratory analysis of unknown scenarios, post-mortem analysis of any monitoring problem. Corresponding to Unknown Knowns and Unknown Unknowns in the above figure, they are events that may occur without awareness, including events that are already understood and unknown.
Obviously, monitoring by assuming all possible events in advance can no longer meet the complex monitoring scenarios of microservices. We need an Observability monitoring method that can provide exploratory analysis. In Logging, Metrics and Tracing, Tracing is currently the most effective way to provide multi-dimensional monitoring and analysis capabilities.
Tracing
Link Tracing Tracing Analysis provides distributed application developers with complete call link restoration, call request volume statistics, link topology, application dependency analysis and other tools, which can help developers quickly analyze and diagnose problems under the distributed application architecture. performance bottleneck, and improve the efficiency of development and diagnosis in the era of microservices.
Tracing can connect the call links of distributed requests in microservices, and plays an important role in the microservice monitoring system. In addition, Tracing is between Metrics and Logging, which can not only complete the work of Monitoring, but also analyze the Observability, and improve the efficiency of monitoring system construction.
system model
Link tracking (Tracing) system needs to record the upstream and downstream service invocation links that a specific request passes through, and the related work information completed by each service.
In the microservice system shown in the figure below, a user initiates a request to service A, service A will generate a globally unique Trace ID, and service A's internal Messaging method calls related processing modules (such as cross-thread asynchronous calls, etc.), and service A module then Call service B and service C in parallel by RPC; service B will return the response immediately, but service C will use serial mode, first use RPC to call service D, then use RPC to call service E, and then respond to service A's call request; Service A will respond to the initial user request after the two internal module calls are processed.
The initially generated Trace ID is passed in the series of intra-service or inter-service request calls to connect the request calls. In addition, the Tracing system will also record the Timestamp, service name and other related information of each request call processing.
Figure 3 (Note: The internal serial call of the service has an impact on the system performance, and the parallel call method is generally used. The subsequent chapters will only consider the parallel call scenario.)
In the Tracing system, it mainly includes two basic concepts, Trace and Span. The following figure shows a Trace composed of Span.
Figure 4
Trace refers to the call links of all services that an external request passes through. It can be understood as a tree structure composed of service calls, and each link is identified by a globally unique ID.
Span refers to a call within a service or between services, that is, a node in the Trace tree. As shown in the following figure, the Trace tree composed of Spans has a parent-child relationship between the Span nodes in the tree. Span mainly includes Span name, Span ID, parent ID, and other log information such as Timestamp, Duration (including the duration of the child node call processing), and business data.
Span can be divided into RPC Span and Messaging Span according to the calling method:
RPC Span
Generated by RPC Tracing, it is divided into two types of Spans: Client and Server. They are generated by the Client node and Server node records called by the RPC service respectively. The two share information such as Span ID, Parent Span ID, etc., but it should be noted that the two Span records are There is a time deviation. This deviation is the call overhead between services, which is generally caused by network transmission overhead, proxy service or service interface message queuing, etc.
Messaging Span
It is generated by Messaging Tracing and is generally used for internal calls of Tracing services. Unlike RPC Span, Messaging Span does not share information such as Span ID.
Application scenarios
According to the system model of Tracing, various metric information such as service response can be obtained, which can be used for Alerting, DashBoard query, etc.; it can also analyze the single or overall service situation according to the link composed of Span, and find service performance bottlenecks, network transmission overhead, service Various issues such as the design of internal asynchronous calls. As shown in the figure below, compared to Metrics and Logging, Tracing can cover both Monitoring and Observability scenarios of monitoring, and occupies an important position in the monitoring system. Associations and organizations such as Opentracing, Opencensus, and Opentelemetry all include support for Tracing.
Figure 5
From the perspective of microservices, the Span information recorded by Tracing can be counted and analyzed in various dimensions. The following figure is an example of a microservice system designed based on HTTP API. The user queries the /1/api interface of Service1, Service1 then requests the /2/api of Service2, and Service2 internally asynchronously calls msg2.1 and msg2.2, and msg2.1 requests Service3's /3/api interface, msg2.2 requests Service4's /4/api interface, Service3 internally calls msg3, and Service4 then requests Service5's /5/api. Service5 does not perform tracing and cannot collect information about Service5.
Image 6
For the microservice system shown in the figure above, the following two types of statistical analysis operations can be performed:
Service Analysis
Pay attention to the operation of a single service, such as the performance indicators of external service interface and upstream interface query, etc. The analysis scenarios mainly include:
1. Upstream service request
For example, /1/api provided by Service1, /4/api provided by Service4, etc., count the metric information such as the number of acquisitions, QPS, time-consuming percentile, error rate, and timeout rate.
2. Downstream service response
For example, /2/api requested by Service1, /5/api requested by Service4, etc., count the number of queries, QPS, time-consuming percentile, error rate, timeout rate, etc. Metric information.
3. Internal processing of services
The external interface of the service may be split into multiple spans internally. You can group and aggregate statistics according to the span name to find the span that takes the longest, such as the Service2 interface /2/api, and the internal span of the interface service includes /2/api Server Span, the Span corresponding to call2.1 and the Span corresponding to call2.2, the time-consuming Duraion of these Spans can be calculated through the dependencies between the Spans, and various statistical analysis can be performed.
-Service Analysis
In the overall analysis of microservices, we regard a single service as a black box, and focus on dependencies between services, service hotspots on the call link, etc. The main analysis scenarios include:
1. Service topology statistics
Based on the Client Span and Server Span called between services, you can obtain the topology of the entire service system, as well as statistical information such as the number of calls and Duration between services.
2. Call link performance bottleneck analysis
Analyze the performance bottleneck on the call link of an external request interface. This bottleneck may be caused by the internal processing overhead of a service, or the network call overhead between two services.
For a complex call request involving more than dozens of microservices in one call, the performance bottlenecks that occur each time are likely to be different. At this time, it is necessary to aggregate statistics, calculate the ranking of the frequency of performance bottlenecks, and analyze the hot spots for performance bottlenecks. services or inter-service calls.
The above are only some of the listed analysis scenarios. The information provided by Tracing can actually support more metric statistics and exploratory analysis scenarios. This article will not list them one by one.
02 Build a link tracking analysis system based on Zipkin and StarRocks
The link tracking system is mainly divided into three parts: data collection, data storage and analysis and calculation. The most widely used open source link tracking system is Zipkin. It mainly includes two parts: data collection and analysis and calculation. The underlying storage depends on other storage. system. When Sohu Smart Media built the link tracking system, it initially used the method of Zipkin + ElasticSearch, and then added StarRocks as the underlying storage system, and performed analysis and statistics based on StarRocks. The overall structure of the system is shown in the figure below.
Figure 7
data collection
Zipkin supports fully automatic point embedding on the client side. By simply introducing the relevant libraries into the application and configuring them simply, the Span information can be automatically generated, and the Span information can be automatically uploaded through HTTP or Kafka. Zipkin currently provides buried point collection libraries in most languages. For example, Spring Cloud in Java language provides deep binding between Sleuth and Zipkin, which is basically transparent to developers. In order to solve the storage space, a sampling rate of about 1/100 is generally set when using it. Dapper's paper mentioned that even a sampling rate of 1/1000 can provide enough information for the general use of tracking data. .
data model
Corresponding to Figure 6 , the following is a schematic diagram of Zipkin Span buried point collection (Figure 8). The specific process is as follows:
Figure 8
- The Request sent by the user to Service1 does not contain Trace and Span information. Service1 will create a Server Span and randomly generate a globally unique TraceID (X in the figure) and SpanId (A in the figure, X and A here) will use the same value), record Timestamp and other information; when Service1 returns Response to the user, Service1 will count the Server Span processing time Duration, and will report the complete Server Span information including TraceID, SpanID, Timestamp, Duration and other information. .
- The request sent by Service1 to Service2 will create a Client Span, use X as the Trace ID, randomly generate a globally unique SpanID (B in the figure), record Timestamp and other information, and Service1 will use the Trace ID (X) and SpanID ( B) Pass it to Service2 (such as adding TraceID and SpanID and other related fields in the HEADER of the HTTP protocol); after Service1 receives the response from Service2, Service1 will process the Client Span related information and report the Client Span
- Service2 receives the Request from Service1, which contains information such as Trace(X) and Span(B). Service2 will create a Server Span, use X as the Trace ID and B as the SpanID, and internally call msg2.1 and msg2.2. Trace ID (X) and SpanID (B) are passed to them; after Service2 receives the return of msg2.1 and msg2.2, Service1 will process the server span related information and report the server span
- msg2.1 and msg2.2 of Service2 will create a Messaging Span respectively, use X as the Trace ID, randomly generate a globally unique SpanID (C and F in the figure), record Timestamp and other information, and send requests to Service3 and Service4 respectively ; After msg2.1 and msg2.2 receive the response, they will process the Messaging Span related information respectively, and report the two Messaging Spans
- The request sent by Service2 to Service3 and Service4 will each create a Client Span, use X as the Trace ID, randomly generate a globally unique SpanID (D and G in the figure), record Timestamp and other information, and Service2 will record the Trace ID ( X) and SpanID (D or G) are passed to Service3 and Service4; after Service12 receives the responses from Service3 and Service3, Service2 will process Client Span-related information respectively, and report the two Client Spans
- Service3 receives the Request from Service2, including information such as Trace(X) and Span(D), Service3 will create a Server Span, use X as the Trace ID, D as the SpanID, and call msg3 internally; after Service3 receives the return of msg3 , Service3 will process this Server Span related information and report this Server Span
- The msg3 of Service3 will create a Messaging Span respectively, use X as the Trace ID, randomly generate a globally unique SpanID (E in the figure), and record the Timestamp and other information. Messaging Span for escalation
- Service4 receives the Request from Service2, which contains information such as Trace(X) and Span(G). Service4 will create a Server Span, use X as the Trace ID and G as the SpanID, and then send the request to Service5; After the response, Service4 will process the information about the Server Span and report the Server Span
- The request sent by Service4 to Service5 will create a Client Span, use X as the Trace ID, randomly generate a globally unique SpanID (H in the figure), record Timestamp and other information, and Service4 will trace the Trace ID (X) and SpanID ( H) Pass it to Service5; after Service4 receives the response from Service5, Service4 will process the Client Span related information and report the Client Span
The Span record generated by the entire Trace X call link above is shown in the figure below. Each Span mainly records Span Id, Parent Id, and Kind (CLIENT represents the RPC client side Span, SERVER represents the RPC SERVER side SPAN, NULL represents the Messaging SPAN), SN (Service Name), and also includes Trace ID, timestamp, Duration and other information. Service5 does not collect Zipkin buried points, so there will be no Span records for Service5.
Figure 9
Data Format
The application service with Zipkin buried point is set up, and the Span information will be reported to Kafka in Json format by default. The reported information mainly has the following points of attention:
Each application service will report a set of Spans each time, forming a Json array to report
The Json array contains Spans of different Traces, that is, not all Trace IDs are the same
Different forms of interfaces (such as Http, Grpc, Dubbo, etc.), in addition to the same main fields, will record some different fields in tags
[
{
"traceId": "3112dd04c3112036",
"id": "3112dd04c3112036",
"kind": "SERVER",
"name": "get /2/api",
"timestamp": 1618480662355011,
"duration": 12769,
"localEndpoint": {
"serviceName": "SERVICE2",
"ipv4": "172.24.132.32"
},
"remoteEndpoint": {
"ipv4": "111.25.140.166",
"port": 50214
},
"tags": {
"http.method": "GET",
"http.path": "/2/api",
"mvc.controller.class": "Controller",
"mvc.controller.method": "get2Api"
}
},
{
"traceId": "3112dd04c3112036",
"parentId": "3112dd04c3112036",
"id": "b4bd9859c690160a",
"name": "msg2.1",
"timestamp": 1618480662357211,
"duration": 11069,
"localEndpoint": {
"serviceName": "SERVICE2"
},
"tags": {
"class": "MSG",
"method": "msg2.1"
}
},
{
"traceId": "3112dd04c3112036",
"parentId": "3112dd04c3112036",
"id": "c31d9859c69a2b21",
"name": "msg2.2",
"timestamp": 1618480662357201,
"duration": 10768,
"localEndpoint": {
"serviceName": "SERVICE2"
},
"tags": {
"class": "MSG",
"method": "msg2.2"
}
},
{
"traceId": "3112dd04c3112036",
"parentId": "b4bd9859c690160a",
"id": "f1659c981c0f4744",
"kind": "CLIENT",
"name": "get /3/api",
"timestamp": 1618480662358201,
"duration": 9206,
"localEndpoint": {
"serviceName": "SERVICE2",
"ipv4": "172.24.132.32"
},
"tags": {
"http.method": "GET",
"http.path": "/3/api"
}
},
{
"traceId": "3112dd04c3112036",
"parentId": "c31d9859c69a2b21",
"id": "73cd1cab1d72a971",
"kind": "CLIENT",
"name": "get /4/api",
"timestamp": 1618480662358211,
"duration": 9349,
"localEndpoint": {
"serviceName": "SERVICE2",
"ipv4": "172.24.132.32"
},
"tags": {
"http.method": "GET",
"http.path": "/4/api"
}
}
]
Figure 10
data storage
Zipkin supports MySQL, Cassandra and ElasticSearch three data storage, all three have their own shortcomings:
- MySQL: The collected tracing information is basically hundreds of millions of rows or even more than 10 billion rows per day. MySQL cannot support such a large amount of data.
- Cassandra: It can support Span information analysis of a single Trace, but it does not support data statistical analysis scenarios such as aggregated queries.
- ElasticSearch: It can support the analysis of a single trace and simple aggregate query analysis, but cannot support some more complex data analysis calculations, such as the calculation requirements involving Join, window functions, etc., especially the inter-task dependent calculation, Zipkin currently cannot calculate in real time, and needs to run Spark tasks offline to calculate the dependency information between tasks.
In practice, we also used ElasticSearch first, and found the above-mentioned problems. For example, Zipkin's service-dependent topology must be calculated offline, so we added StarRocks as the underlying data storage. It is very convenient to import Zipkin's trace data into StarRocks. The basic steps only need two steps, CREATE TABLE + CREATE ROUTINE LOAD.
In addition, in the call link performance bottleneck analysis scenario, a single service should be regarded as a black box, only the RPC SPAN should be concerned, the Messaging Span inside the service should be blocked, and Flink should be used to trace the parentID of the internal span of the service. , traced back to the RPC Server SPAN with the same Trace ID of the same service, replace the parentId of the RPC Client SPAN with the ID of the RPC Server SPAN, and finally write the converted data to StarRocks in real time through Flink-Connector-StarRocks.
The data storage architecture flow based on StarRocks is shown in the following figure.
Figure 11
CREATE TABLE
An example of a table creation statement is as follows, with the following points to note:
- Including two tables, Zipkin and zipkin_trace_perf. The zipkin_trace_perf table is only used for the call link performance bottleneck analysis scenario, and other statistical analysis is applicable to the Zipkin table
- Through the Timestamp field in the collected information, the dt, hr, and min time fields are generated, which is convenient for subsequent statistical analysis
- Use DUPLICATE model, Bitmap index and other settings to speed up query
- The Zipkin table uses id as the bucketing field. When querying the service topology, the query plan will be optimized to Colocate Join to improve query performance.
Zipkin
CREATE TABLE `zipkin` (
`traceId` varchar(24) NULL COMMENT "",
`id` varchar(24) NULL COMMENT "Span ID",
`localEndpoint_serviceName` varchar(512) NULL COMMENT "",
`dt` int(11) NULL COMMENT "",
`parentId` varchar(24) NULL COMMENT "",
`timestamp` bigint(20) NULL COMMENT "",
`hr` int(11) NULL COMMENT "",
`min` bigint(20) NULL COMMENT "",
`kind` varchar(16) NULL COMMENT "",
`duration` int(11) NULL COMMENT "",
`name` varchar(300) NULL COMMENT "",
`localEndpoint_ipv4` varchar(16) NULL COMMENT "",
`remoteEndpoint_ipv4` varchar(16) NULL COMMENT "",
`remoteEndpoint_port` varchar(16) NULL COMMENT "",
`shared` int(11) NULL COMMENT "",
`tag_error` int(11) NULL DEFAULT "0" COMMENT "",
`error_msg` varchar(1024) NULL COMMENT "",
`tags_http_path` varchar(2048) NULL COMMENT "",
`tags_http_method` varchar(1024) NULL COMMENT "",
`tags_controller_class` varchar(100) NULL COMMENT "",
`tags_controller_method` varchar(1024) NULL COMMENT "",
INDEX service_name_idx (`localEndpoint_serviceName`) USING BITMAP COMMENT ''
) ENGINE=OLAP
DUPLICATE KEY(`traceId`, `parentId`, `id`, `timestamp`, `localEndpoint_serviceName`, `dt`)
COMMENT "OLAP"
PARTITION BY RANGE(`dt`)
(PARTITION p20220104 VALUES [("20220104"), ("20220105")),
PARTITION p20220105 VALUES [("20220105"), ("20220106")))
DISTRIBUTED BY HASH(`id`) BUCKETS 100
PROPERTIES (
"replication_num" = "3",
"dynamic_partition.enable" = "true",
"dynamic_partition.time_unit" = "DAY",
"dynamic_partition.time_zone" = "Asia/Shanghai",
"dynamic_partition.start" = "-30",
"dynamic_partition.end" = "2",
"dynamic_partition.prefix" = "p",
"dynamic_partition.buckets" = "100",
"in_memory" = "false",
"storage_format" = "DEFAULT"
);
zipkin_trace_perf
CREATE TABLE `zipkin_trace_perf` (
`traceId` varchar(24) NULL COMMENT "",
`id` varchar(24) NULL COMMENT "",
`dt` int(11) NULL COMMENT "",
`parentId` varchar(24) NULL COMMENT "",
`localEndpoint_serviceName` varchar(512) NULL COMMENT "",
`timestamp` bigint(20) NULL COMMENT "",
`hr` int(11) NULL COMMENT "",
`min` bigint(20) NULL COMMENT "",
`kind` varchar(16) NULL COMMENT "",
`duration` int(11) NULL COMMENT "",
`name` varchar(300) NULL COMMENT "",
`tag_error` int(11) NULL DEFAULT "0" COMMENT ""
) ENGINE=OLAP
DUPLICATE KEY(`traceId`, `id`, `dt`, `parentId`, `localEndpoint_serviceName`)
COMMENT "OLAP"
PARTITION BY RANGE(`dt`)
(PARTITION p20220104 VALUES [("20220104"), ("20220105")),
PARTITION p20220105 VALUES [("20220105"), ("20220106")))
DISTRIBUTED BY HASH(`traceId`) BUCKETS 32
PROPERTIES (
"replication_num" = "3",
"dynamic_partition.enable" = "true",
"dynamic_partition.time_unit" = "DAY",
"dynamic_partition.time_zone" = "Asia/Shanghai",
"dynamic_partition.start" = "-60",
"dynamic_partition.end" = "2",
"dynamic_partition.prefix" = "p",
"dynamic_partition.buckets" = "12",
"in_memory" = "false",
"storage_format" = "DEFAULT"
);
ROUTINE LOAD
An example of a ROUTINE LOAD creation statement is as follows:
CREATE ROUTINE LOAD zipkin_routine_load ON zipkin COLUMNS(
id,
kind,
localEndpoint_serviceName,
traceId,
`name`,
`timestamp`,
`duration`,
`localEndpoint_ipv4`,
`remoteEndpoint_ipv4`,
`remoteEndpoint_port`,
`shared`,
`parentId`,
`tags_http_path`,
`tags_http_method`,
`tags_controller_class`,
`tags_controller_method`,
tmp_tag_error,
tag_error = if(`tmp_tag_error` IS NULL, 0, 1),
error_msg = tmp_tag_error,
dt = from_unixtime(`timestamp` / 1000000, '%Y%m%d'),
hr = from_unixtime(`timestamp` / 1000000, '%H'),
`min` = from_unixtime(`timestamp` / 1000000, '%i')
) PROPERTIES (
"desired_concurrent_number" = "3",
"max_batch_interval" = "50",
"max_batch_rows" = "300000",
"max_batch_size" = "209715200",
"max_error_number" = "1000000",
"strict_mode" = "false",
"format" = "json",
"strip_outer_array" = "true",
"jsonpaths" = "[\"$.id\",\"$.kind\",\"$.localEndpoint.serviceName\",\"$.traceId\",\"$.name\",\"$.timestamp\",\"$.duration\",\"$.localEndpoint.ipv4\",\"$.remoteEndpoint.ipv4\",\"$.remoteEndpoint.port\",\"$.shared\",\"$.parentId\",\"$.tags.\\\"http.path\\\"\",\"$.tags.\\\"http.method\\\"\",\"$.tags.\\\"mvc.controller.class\\\"\",\"$.tags.\\\"mvc.controller.method\\\"\",\"$.tags.error\"]"
)
FROM
KAFKA (
"kafka_broker_list" = "IP1:PORT1,IP2:PORT2,IP3:PORT3",
"kafka_topic" = "XXXXXXXXX"
);
Flink traceability Parent ID
In the scenario of calling link performance bottleneck analysis, Flink is used to trace the parent ID. The code example is as follows:
env
// 添加kafka数据源
.addSource(getKafkaSource())
// 将采集到的Json字符串转换为JSONArray,
// 这个JSONArray是从单个服务采集的信息,里面会包含多个Trace的Span信息
.map(JSON.parseArray(_))
// 将JSONArray转换为JSONObject,每个JSONObejct就是一个Span
.flatMap(_.asScala.map(_.asInstanceOf[JSONObject]))
// 将Span的JSONObject对象转换为Bean对象
.map(jsonToBean(_))
// 以traceID+localEndpoint_serviceName作为key对span进行分区生成keyed stream
.keyBy(span => keyOfTrace(span))
// 使用会话窗口,将同一个Trace的不同服务上的所有Span,分发到同一个固定间隔的processing-time窗口
// 这里为了实现简单,使用了processing-time session窗口,后续我们会使用starrocks的UDAF函数进行优化,去掉对Flink的依赖
.window(ProcessingTimeSessionWindows.withGap(Time.seconds(10)))
// 使用Aggregate窗口函数
.aggregate(new TraceAggregateFunction)
// 将经过溯源的span集合展开,便于调用flink-connector-starrocks
.flatMap(spans => spans)
// 使用flink-connector-starrocks sink,将数据写入starrocks中
.addSink(
StarRocksSink.sink(
StarRocksSinkOptions.builder().withProperty("XXX", "XXX").build()))
analysis caculate
Taking Figure 6 as a microservice system use case, the StarRocks SQL statements corresponding to each statistical analysis scenario are given.
In-Service Analytics
Upstream service request indicator statistics
The following SQL uses the Zipkin table data to calculate the query statistics of the upstream service Service3 and upstream service Service4 requested by the service Service2, and the query indicators are grouped by hour and interface.
select
hr,
name,
req_count,
timeout / req_count * 100 as timeout_rate,
error_count / req_count * 100 as error_rate,
avg_duration,
tp95,
tp99
from
(
select
hr,
name,
count(1) as req_count,
AVG(duration) / 1000 as avg_duration,
sum(if(duration > 200000, 1, 0)) as timeout,
sum(tag_error) as error_count,
percentile_approx(duration, 0.95) / 1000 AS tp95,
percentile_approx(duration, 0.99) / 1000 AS tp99
from
zipkin
where
localEndpoint_serviceName = 'Service2'
and kind = 'CLIENT'
and dt = 20220105
group by
hr,
name
) tmp
order by
hr
Downstream Service Response Indicator Statistics
The following SQL uses the Zipkin table data to calculate the query statistics of the service Service2 responding to the downstream service Service1, and the query metrics are grouped by hour and interface.
select
hr,
name,
req_count,
timeout / req_count * 100 as timeout_rate,
error_count / req_count * 100 as error_rate,
avg_duration,
tp95,
tp99
from
(
select
hr,
name,
count(1) as req_count,
AVG(duration) / 1000 as avg_duration,
sum(if(duration > 200000, 1, 0)) as timeout,
sum(tag_error) as error_count,
percentile_approx(duration, 0.95) / 1000 AS tp95,
percentile_approx(duration, 0.99) / 1000 AS tp99
from
zipkin
where
localEndpoint_serviceName = 'Service2'
and kind = 'SERVER'
and dt = 20220105
group by
hr,
name
) tmp
order by
hr
Service internal processing analysis
The following SQL uses the Zipkin table data to query the interface /2/api of the service Service2, and group statistics such as Duration by Span Name.
with
spans as (
select * from zipkin where dt = 20220105 and localEndpoint_serviceName = "Service2"
),
api_spans as (
select
spans.id as id,
spans.parentId as parentId,
spans.name as name,
spans.duration as duration
from
spans
inner JOIN
(select * from spans where kind = "SERVER" and name = "/2/api") tmp
on spans.traceId = tmp.traceId
)
SELECT
name,
AVG(inner_duration) / 1000 as avg_duration,
percentile_approx(inner_duration, 0.95) / 1000 AS tp95,
percentile_approx(inner_duration, 0.99) / 1000 AS tp99
from
(
select
l.name as name,
(l.duration - ifnull(r.duration, 0)) as inner_duration
from
api_spans l
left JOIN
api_spans r
on l.parentId = r.id
) tmp
GROUP BY
name
Inter-Service Analysis
Service topology statistics
The following SQL uses Zipkin table data to calculate topological relationships between services, and statistics on the Duration interface between services.
with tbl as (select * from zipkin where dt = 20220105)
select
client,
server,
name,
AVG(duration) / 1000 as avg_duration,
percentile_approx(duration, 0.95) / 1000 AS tp95,
percentile_approx(duration, 0.99) / 1000 AS tp99
from
(
select
c.localEndpoint_serviceName as client,
s.localEndpoint_serviceName as server,
c.name as name,
c.duration as duration
from
(select * from tbl where kind = "CLIENT") c
left JOIN
(select * from tbl where kind = "SERVER") s
on c.id = s.id and c.traceId = s.traceId
) as tmp
group by
client,
server,
name
Call Link Performance Bottleneck Analysis
The following SQL uses the zipkin_trace_perf table data to respond to a query request with a timeout for a service interface, and counts the service or inter-service call that takes the longest to process in the call chain of each request, and then analyzes whether the performance hotspot is in a certain service interface. A service or inter-service call.
select
service,
ROUND(count(1) * 100 / sum(count(1)) over(), 2) as percent
from
(
select
traceId,
service,
duration,
ROW_NUMBER() over(partition by traceId order by duration desc) as rank4
from
(
with tbl as (
SELECT
l.traceId as traceId,
l.id as id,
l.parentId as parentId,
l.kind as kind,
l.duration as duration,
l.localEndpoint_serviceName as localEndpoint_serviceName
FROM
zipkin_trace_perf l
INNER JOIN
zipkin_trace_perf r
on l.traceId = r.traceId
and l.dt = 20220105
and r.dt = 20220105
and r.tag_error = 0 -- 过滤掉出错的trace
and r.localEndpoint_serviceName = "Service1"
and r.name = "/1/api"
and r.kind = "SERVER"
and r.duration > 200000 -- 过滤掉未超时的trace
)
select
traceId,
id,
service,
duration
from
(
select
traceId,
id,
service,
(c_duration - s_duration) as duration,
ROW_NUMBER() over(partition by traceId order by (c_duration - s_duration) desc) as rank2
from
(
select
c.traceId as traceId,
c.id as id,
concat(c.localEndpoint_serviceName, "=>", ifnull(s.localEndpoint_serviceName, "?")) as service,
c.duration as c_duration,
ifnull(s.duration, 0) as s_duration
from
(select * from tbl where kind = "CLIENT") c
left JOIN
(select * from tbl where kind = "SERVER") s
on c.id = s.id and c.traceId = s.traceId
) tmp1
) tmp2
where
rank2 = 1
union ALL
select
traceId,
id,
service,
duration
from
(
select
traceId,
id,
service,
(s_duration - c_duration) as duration,
ROW_NUMBER() over(partition by traceId order by (s_duration - c_duration) desc) as rank2
from
(
select
s.traceId as traceId,
s.id as id,
s.localEndpoint_serviceName as service,
s.duration as s_duration,
ifnull(c.duration, 0) as c_duration,
ROW_NUMBER() over(partition by s.traceId, s.id order by ifnull(c.duration, 0) desc) as rank
from
(select * from tbl where kind = "SERVER") s
left JOIN
(select * from tbl where kind = "CLIENT") c
on s.id = c.parentId and s.traceId = c.traceId
) tmp1
where
rank = 1
) tmp2
where
rank2 = 1
) tmp3
) tmp4
where
rank4 = 1
GROUP BY
service
order by
percent desc
The result of the SQL query is shown in the figure below, which shows the percentage distribution of performance bottleneck services or calls between services among the timed out Trace requests.
Figure 12
03 Practical effect
At present, Sohu Smart Media has connected Zipkin to 30+ services, covering hundreds of online service instances, and a 1% sampling rate generates nearly 1 billion lines of logs every day.
Query StarRocks through Zipkin Server, and the obtained Trace information is shown in the following figure:
Figure 13
Query StarRocks through Zipkin Server, and the obtained service topology information is shown in the following figure:
Figure 14
During the practice of the link tracking system based on Zipkin StarRocks, the microservice monitoring and analysis capabilities and engineering efficiency have been significantly improved:
monitoring and analysis capabilities
- In terms of monitoring and alarming, based on the StarRocks query and statistics of the response delay percentile, error rate and other indicators of the online service at the current moment, various alarms can be generated in time according to these indicators;
- In terms of indicator statistics, StarRocks can count various indicators of service response delay by day, hour, minute and other granularity, so as to better understand the service operation status;
- In terms of fault analysis, based on the powerful SQL computing capabilities of StarRocks, exploratory analysis and query in multiple dimensions such as service, time, and interface can be performed to locate the cause of the fault.
Improves the efficiency of microservice monitoring engineering
For Metric and Logging data collection, many users need to manually bury points and install various collector agents. After data collection, they are stored in storage systems such as ElasticSearch. These processes have to be operated once for every previous business, which is very cumbersome, and the resources are scattered and difficult to manage. .
With the Zipkin + StarRocks method, you only need to introduce the corresponding library SDK into the code, and set a small amount of configuration information such as the Kafka address and sampling rate to be reported. Tracing can automatically collect points and perform query and analysis through the zikpin server interface, which is very simple.
04 Summary and Outlook
Building a link tracking system based on Zipkin+StarRocks can provide Monitoring and Observability capabilities for microservice monitoring, and improve the analysis capabilities and engineering efficiency of microservice monitoring.
There are several subsequent optimization points , which can further improve the analysis capability and ease of use of the link tracking system:
- Using StarRocks' UDAF, window function and other functions, the parent ID is traced back to StarRocks calculation, and the reliance on Flink is canceled by post-calculation, which further simplifies the entire system architecture.
- At present, the fields such as tags in the original log are not fully collected. StarRocks is implementing the Json data type, which can better support nested data types such as tags.
- The current interface of Zipkin Server is still a little rudimentary. We have already opened up Zipkin Server to query StarRokcs, and will optimize the UI of Zipkin Server in the future, and realize more index queries through the powerful computing power of StarRocks to further improve the user experience.
05 Reference documents
- "Cloud native computing reshapes enterprise IT architecture - distributed application architecture":
https://developer.aliyun.com/article/717072 - What is Upstream and Downstream in Software Development?
https://reflectoring.io/upstream-downstream/ - Metrics, tracing, and logging:
https://peter.bourgon.org/blog/2017/02/21/metrics-tracing-and-logging.html - The 3 pillars of system observability:logs, metrics and tracing:
https://iamondemand.com/blog/the-3-pillars-of-system-observability-logs-metrics-and-tracing/ - observability 3 ways: logging, metrics and tracing:
https://speakerdeck.com/adriancole/observability-3-ways-logging-metrics-and-tracing - Dapper, a Large-Scale Distributed Systems Tracing Infrastructure:
https://static.googleusercontent.com/media/research.google.com/en//archive/papers/dapper-2010-1.pdf - Jaeger:www.jaegertracing.io
- Zipkin:https://zipkin.io/
- opentracing.io:
https://opentracing.io/docs/ - opencensus.io:
https://opencensus.io/ - opentelemetry.io:
https://opentelemetry.io/docs/ - Microservice Observability, Part 1: Disambiguating Observability and Monitoring:
https://bravenewgeek.com/microservice-observability-part-1-disambiguating-observability-and-monitoring/ - How to Build Observable Distributed Systems:
https://www.infoq.com/presentations/observable-distributed-ststems/ - Monitoring and Observability:
https://copyconstruct.medium.com/monitoring-and-observability-8417d1952e1c - Monitoring Isn't Observability:
https://orangematter.solarwinds.com/2017/09/14/monitoring-isnt-observability/ - Spring Cloud Sleuth Documentation:
https://docs.spring.io/spring-cloud-sleuth/docs/current-SNAPSHOT/reference/html/getting-started.html#getting-started
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。