头图
Authors: Zhai Dongbo, Ye Shujun

Under the microservice architecture, Sohu Smart Media uses Zipkin to collect the buried points of service link tracking (Tracing), and stores the collected Trace information in StarRocks. Through the powerful SQL computing power of StarRocks, multi-dimensional statistics and analysis of tracing information can be performed, which improves the monitoring capability of microservices, from simple statistical Monitoring to more dimensional exploration and analysis of Observability.

The full text is mainly divided into three parts: The first section mainly introduces the commonly used monitoring methods under microservices. Among them, the link tracking technology can connect the entire service call link to obtain key information of the overall service, which is very important for the monitoring of microservices. meaning. The second section mainly introduces how Sohu Smart Media builds a link tracking analysis system, including Zipkin's data collection, StarRocks' data storage, and analysis and calculation of StarRocks according to application scenarios. The third section mainly introduces some practical results obtained by Sohu Smart Media by introducing Zipkin and StarRocks for link tracking analysis.

01 Link Tracking in Microservice Architecture

In recent years, enterprise IT application architecture has gradually evolved to distributed application architectures such as micro-services and cloud-native. Inside Sohu Smart Media, application services are developed, operated and maintained according to the architectural ideas and technical solutions of micro-services, Docker, Kubernetes, and Spring Cloud. Improve the overall engineering efficiency of the department .

While the microservice architecture improves engineering efficiency, it also brings some new problems. Microservice is a distributed architecture that divides service units according to business. Each request of a user is no longer completed independently by a certain service, but becomes a coordinated completion of multiple services. These services may be implemented by different teams, using different programming languages, and may be deployed on different servers or even different data centers. If there are errors and exceptions in user requests, the characteristics of distributed invocation of it difficult to locate these faults. Compared with the traditional monolithic architecture, monitoring faces new challenges.

Logging、Metrics、Tracing

Microservice monitoring can include many methods. According to the type of data monitored, it is mainly divided into three areas: Logging, Metrics and Tracing:

Logging

Discrete events actively recorded by users. The recorded information is generally unstructured text content, which can provide more detailed clues when users analyze and judge problems.

The collected data with aggregated attributes is designed to show users the running status of a certain indicator in a certain period of time, so as to view some indicators and trends.

Tracing

Record the entire life cycle of a request call, including service call and processing time and other information, including the request context, identified by a globally unique Trace ID and concatenated with the entire call chain, which is very suitable for monitoring scenarios of microservice architecture .

图 1
figure 1

The relationship between the three is shown in the figure above, and there is overlap between the three. For example, Logging can aggregate relevant fields to generate Metrics information, and associate related fields to generate Tracing information; Tracing can aggregate query times to generate Metrics information, and can record business log generation Logging information. In general, it is difficult to add fields in Metrics and Logging to concatenate the life cycle of a microservice request, but it is relatively easy to obtain Metrics and Logging through Tracing.

In addition, these three have different requirements for storage resources. Metrics are natural compressed data, which saves resources the most; Logging tends to increase indefinitely, even exceeding the expected capacity; Tracing's storage capacity is generally between Metrics and Logging. In between, the capacity requirements can be further controlled by the sampling rate.

From Monitoring to Observability

Monitoring tells you whether the system works. Observability lets you ask why it's not working.

– Baron Schwarz
Microservice monitoring can be simply divided into Monitoring and Observability from the data analysis level.

Monitoring

Tells you if the system is working, pre-defined calculations for known scenarios, ex-ante assumptions about various monitoring problems. Corresponding to Known Knowns and Known Unknowns in the above figure, they are events that may be assumed in advance, including events that have been understood and those that are not understood.

Observability

Allows you to ask why the system isn't working, exploratory analysis of unknown scenarios, post-mortem analysis of any monitoring problem. Corresponding to Unknown Knowns and Unknown Unknowns in the above figure, they are events that may occur without awareness, including events that are already understood and unknown.

Obviously, monitoring by assuming all possible events in advance can no longer meet the complex monitoring scenarios of microservices. We need an Observability monitoring method that can provide exploratory analysis. In Logging, Metrics and Tracing, Tracing is currently the most effective way to provide multi-dimensional monitoring and analysis capabilities.

Tracing

Link Tracing Tracing Analysis provides distributed application developers with complete call link restoration, call request volume statistics, link topology, application dependency analysis and other tools, which can help developers quickly analyze and diagnose problems under the distributed application architecture. performance bottleneck, and improve the efficiency of development and diagnosis in the era of microservices.

Tracing can connect the call links of distributed requests in microservices, and plays an important role in the microservice monitoring system. In addition, Tracing is between Metrics and Logging, which can not only complete the work of Monitoring, but also analyze the Observability, and improve the efficiency of monitoring system construction.

system model

Link tracking (Tracing) system needs to record the upstream and downstream service invocation links that a specific request passes through, and the related work information completed by each service.

In the microservice system shown in the figure below, a user initiates a request to service A, service A will generate a globally unique Trace ID, and service A's internal Messaging method calls related processing modules (such as cross-thread asynchronous calls, etc.), and service A module then Call service B and service C in parallel by RPC; service B will return the response immediately, but service C will use serial mode, first use RPC to call service D, then use RPC to call service E, and then respond to service A's call request; Service A will respond to the initial user request after the two internal module calls are processed.

The initially generated Trace ID is passed in the series of intra-service or inter-service request calls to connect the request calls. In addition, the Tracing system will also record the Timestamp, service name and other related information of each request call processing.

Figure 3 (Note: The internal serial call of the service has an impact on the system performance, and the parallel call method is generally used. The subsequent chapters will only consider the parallel call scenario.)

In the Tracing system, it mainly includes two basic concepts, Trace and Span. The following figure shows a Trace composed of Span.

Figure 4

Trace refers to the call links of all services that an external request passes through. It can be understood as a tree structure composed of service calls, and each link is identified by a globally unique ID.

Span refers to a call within a service or between services, that is, a node in the Trace tree. As shown in the following figure, the Trace tree composed of Spans has a parent-child relationship between the Span nodes in the tree. Span mainly includes Span name, Span ID, parent ID, and other log information such as Timestamp, Duration (including the duration of the child node call processing), and business data.

Span can be divided into RPC Span and Messaging Span according to the calling method:

RPC Span

Generated by RPC Tracing, it is divided into two types of Spans: Client and Server. They are generated by the Client node and Server node records called by the RPC service respectively. The two share information such as Span ID, Parent Span ID, etc., but it should be noted that the two Span records are There is a time deviation. This deviation is the call overhead between services, which is generally caused by network transmission overhead, proxy service or service interface message queuing, etc.

Messaging Span

It is generated by Messaging Tracing and is generally used for internal calls of Tracing services. Unlike RPC Span, Messaging Span does not share information such as Span ID.

Application scenarios

According to the system model of Tracing, various metric information such as service response can be obtained, which can be used for Alerting, DashBoard query, etc.; it can also analyze the single or overall service situation according to the link composed of Span, and find service performance bottlenecks, network transmission overhead, service Various issues such as the design of internal asynchronous calls. As shown in the figure below, compared to Metrics and Logging, Tracing can cover both Monitoring and Observability scenarios of monitoring, and occupies an important position in the monitoring system. Associations and organizations such as Opentracing, Opencensus, and Opentelemetry all include support for Tracing.

Figure 5

From the perspective of microservices, the Span information recorded by Tracing can be counted and analyzed in various dimensions. The following figure is an example of a microservice system designed based on HTTP API. The user queries the /1/api interface of Service1, Service1 then requests the /2/api of Service2, and Service2 internally asynchronously calls msg2.1 and msg2.2, and msg2.1 requests Service3's /3/api interface, msg2.2 requests Service4's /4/api interface, Service3 internally calls msg3, and Service4 then requests Service5's /5/api. Service5 does not perform tracing and cannot collect information about Service5.

Image 6

For the microservice system shown in the figure above, the following two types of statistical analysis operations can be performed:

Service Analysis

Pay attention to the operation of a single service, such as the performance indicators of external service interface and upstream interface query, etc. The analysis scenarios mainly include:

1. Upstream service request

For example, /1/api provided by Service1, /4/api provided by Service4, etc., count the metric information such as the number of acquisitions, QPS, time-consuming percentile, error rate, and timeout rate.

2. Downstream service response

For example, /2/api requested by Service1, /5/api requested by Service4, etc., count the number of queries, QPS, time-consuming percentile, error rate, timeout rate, etc. Metric information.

3. Internal processing of services

The external interface of the service may be split into multiple spans internally. You can group and aggregate statistics according to the span name to find the span that takes the longest, such as the Service2 interface /2/api, and the internal span of the interface service includes /2/api Server Span, the Span corresponding to call2.1 and the Span corresponding to call2.2, the time-consuming Duraion of these Spans can be calculated through the dependencies between the Spans, and various statistical analysis can be performed.

-Service Analysis

In the overall analysis of microservices, we regard a single service as a black box, and focus on dependencies between services, service hotspots on the call link, etc. The main analysis scenarios include:

1. Service topology statistics

Based on the Client Span and Server Span called between services, you can obtain the topology of the entire service system, as well as statistical information such as the number of calls and Duration between services.

2. Call link performance bottleneck analysis

Analyze the performance bottleneck on the call link of an external request interface. This bottleneck may be caused by the internal processing overhead of a service, or the network call overhead between two services.

For a complex call request involving more than dozens of microservices in one call, the performance bottlenecks that occur each time are likely to be different. At this time, it is necessary to aggregate statistics, calculate the ranking of the frequency of performance bottlenecks, and analyze the hot spots for performance bottlenecks. services or inter-service calls.

The above are only some of the listed analysis scenarios. The information provided by Tracing can actually support more metric statistics and exploratory analysis scenarios. This article will not list them one by one.

02 Build a link tracking analysis system based on Zipkin and StarRocks

The link tracking system is mainly divided into three parts: data collection, data storage and analysis and calculation. The most widely used open source link tracking system is Zipkin. It mainly includes two parts: data collection and analysis and calculation. The underlying storage depends on other storage. system. When Sohu Smart Media built the link tracking system, it initially used the method of Zipkin + ElasticSearch, and then added StarRocks as the underlying storage system, and performed analysis and statistics based on StarRocks. The overall structure of the system is shown in the figure below.

Figure 7

data collection

Zipkin supports fully automatic point embedding on the client side. By simply introducing the relevant libraries into the application and configuring them simply, the Span information can be automatically generated, and the Span information can be automatically uploaded through HTTP or Kafka. Zipkin currently provides buried point collection libraries in most languages. For example, Spring Cloud in Java language provides deep binding between Sleuth and Zipkin, which is basically transparent to developers. In order to solve the storage space, a sampling rate of about 1/100 is generally set when using it. Dapper's paper mentioned that even a sampling rate of 1/1000 can provide enough information for the general use of tracking data. .

data model

Corresponding to Figure 6 , the following is a schematic diagram of Zipkin Span buried point collection (Figure 8). The specific process is as follows:

Figure 8

  1. The Request sent by the user to Service1 does not contain Trace and Span information. Service1 will create a Server Span and randomly generate a globally unique TraceID (X in the figure) and SpanId (A in the figure, X and A here) will use the same value), record Timestamp and other information; when Service1 returns Response to the user, Service1 will count the Server Span processing time Duration, and will report the complete Server Span information including TraceID, SpanID, Timestamp, Duration and other information. .
  2. The request sent by Service1 to Service2 will create a Client Span, use X as the Trace ID, randomly generate a globally unique SpanID (B in the figure), record Timestamp and other information, and Service1 will use the Trace ID (X) and SpanID ( B) Pass it to Service2 (such as adding TraceID and SpanID and other related fields in the HEADER of the HTTP protocol); after Service1 receives the response from Service2, Service1 will process the Client Span related information and report the Client Span
  3. Service2 receives the Request from Service1, which contains information such as Trace(X) and Span(B). Service2 will create a Server Span, use X as the Trace ID and B as the SpanID, and internally call msg2.1 and msg2.2. Trace ID (X) and SpanID (B) are passed to them; after Service2 receives the return of msg2.1 and msg2.2, Service1 will process the server span related information and report the server span
  4. msg2.1 and msg2.2 of Service2 will create a Messaging Span respectively, use X as the Trace ID, randomly generate a globally unique SpanID (C and F in the figure), record Timestamp and other information, and send requests to Service3 and Service4 respectively ; After msg2.1 and msg2.2 receive the response, they will process the Messaging Span related information respectively, and report the two Messaging Spans
  5. The request sent by Service2 to Service3 and Service4 will each create a Client Span, use X as the Trace ID, randomly generate a globally unique SpanID (D and G in the figure), record Timestamp and other information, and Service2 will record the Trace ID ( X) and SpanID (D or G) are passed to Service3 and Service4; after Service12 receives the responses from Service3 and Service3, Service2 will process Client Span-related information respectively, and report the two Client Spans
  6. Service3 receives the Request from Service2, including information such as Trace(X) and Span(D), Service3 will create a Server Span, use X as the Trace ID, D as the SpanID, and call msg3 internally; after Service3 receives the return of msg3 , Service3 will process this Server Span related information and report this Server Span
  7. The msg3 of Service3 will create a Messaging Span respectively, use X as the Trace ID, randomly generate a globally unique SpanID (E in the figure), and record the Timestamp and other information. Messaging Span for escalation
  8. Service4 receives the Request from Service2, which contains information such as Trace(X) and Span(G). Service4 will create a Server Span, use X as the Trace ID and G as the SpanID, and then send the request to Service5; After the response, Service4 will process the information about the Server Span and report the Server Span
  9. The request sent by Service4 to Service5 will create a Client Span, use X as the Trace ID, randomly generate a globally unique SpanID (H in the figure), record Timestamp and other information, and Service4 will trace the Trace ID (X) and SpanID ( H) Pass it to Service5; after Service4 receives the response from Service5, Service4 will process the Client Span related information and report the Client Span

The Span record generated by the entire Trace X call link above is shown in the figure below. Each Span mainly records Span Id, Parent Id, and Kind (CLIENT represents the RPC client side Span, SERVER represents the RPC SERVER side SPAN, NULL represents the Messaging SPAN), SN (Service Name), and also includes Trace ID, timestamp, Duration and other information. Service5 does not collect Zipkin buried points, so there will be no Span records for Service5.

Figure 9

Data Format

The application service with Zipkin buried point is set up, and the Span information will be reported to Kafka in Json format by default. The reported information mainly has the following points of attention:

Each application service will report a set of Spans each time, forming a Json array to report

The Json array contains Spans of different Traces, that is, not all Trace IDs are the same

Different forms of interfaces (such as Http, Grpc, Dubbo, etc.), in addition to the same main fields, will record some different fields in tags

[
  {
    "traceId": "3112dd04c3112036",
    "id": "3112dd04c3112036",
    "kind": "SERVER",
    "name": "get /2/api",
    "timestamp": 1618480662355011,
    "duration": 12769,
    "localEndpoint": {
      "serviceName": "SERVICE2",
      "ipv4": "172.24.132.32"
    },
    "remoteEndpoint": {
      "ipv4": "111.25.140.166",
      "port": 50214
    },
    "tags": {
      "http.method": "GET",
      "http.path": "/2/api",
      "mvc.controller.class": "Controller",
      "mvc.controller.method": "get2Api"
    }
  },
  {
    "traceId": "3112dd04c3112036",
    "parentId": "3112dd04c3112036",
    "id": "b4bd9859c690160a",
    "name": "msg2.1",
    "timestamp": 1618480662357211,
    "duration": 11069,
    "localEndpoint": {
      "serviceName": "SERVICE2"
    },
    "tags": {
      "class": "MSG",
      "method": "msg2.1"
    }
  },
  {
    "traceId": "3112dd04c3112036",
    "parentId": "3112dd04c3112036",
    "id": "c31d9859c69a2b21",
    "name": "msg2.2",
    "timestamp": 1618480662357201,
    "duration": 10768,
    "localEndpoint": {
      "serviceName": "SERVICE2"
    },
    "tags": {
      "class": "MSG",
      "method": "msg2.2"
    }
  },
  {
    "traceId": "3112dd04c3112036",
    "parentId": "b4bd9859c690160a",
    "id": "f1659c981c0f4744",
    "kind": "CLIENT",
    "name": "get /3/api",
    "timestamp": 1618480662358201,
    "duration": 9206,
    "localEndpoint": {
      "serviceName": "SERVICE2",
      "ipv4": "172.24.132.32"
    },
    "tags": {
      "http.method": "GET",
      "http.path": "/3/api"
    }
  },
  {
    "traceId": "3112dd04c3112036",
    "parentId": "c31d9859c69a2b21",
    "id": "73cd1cab1d72a971",
    "kind": "CLIENT",
    "name": "get /4/api",
    "timestamp": 1618480662358211,
    "duration": 9349,
    "localEndpoint": {
      "serviceName": "SERVICE2",
      "ipv4": "172.24.132.32"
    },
    "tags": {
      "http.method": "GET",
      "http.path": "/4/api"
    }
  }
]

Figure 10

data storage

Zipkin supports MySQL, Cassandra and ElasticSearch three data storage, all three have their own shortcomings:

  • MySQL: The collected tracing information is basically hundreds of millions of rows or even more than 10 billion rows per day. MySQL cannot support such a large amount of data.
  • Cassandra: It can support Span information analysis of a single Trace, but it does not support data statistical analysis scenarios such as aggregated queries.
  • ElasticSearch: It can support the analysis of a single trace and simple aggregate query analysis, but cannot support some more complex data analysis calculations, such as the calculation requirements involving Join, window functions, etc., especially the inter-task dependent calculation, Zipkin currently cannot calculate in real time, and needs to run Spark tasks offline to calculate the dependency information between tasks.

In practice, we also used ElasticSearch first, and found the above-mentioned problems. For example, Zipkin's service-dependent topology must be calculated offline, so we added StarRocks as the underlying data storage. It is very convenient to import Zipkin's trace data into StarRocks. The basic steps only need two steps, CREATE TABLE + CREATE ROUTINE LOAD.

In addition, in the call link performance bottleneck analysis scenario, a single service should be regarded as a black box, only the RPC SPAN should be concerned, the Messaging Span inside the service should be blocked, and Flink should be used to trace the parentID of the internal span of the service. , traced back to the RPC Server SPAN with the same Trace ID of the same service, replace the parentId of the RPC Client SPAN with the ID of the RPC Server SPAN, and finally write the converted data to StarRocks in real time through Flink-Connector-StarRocks.

The data storage architecture flow based on StarRocks is shown in the following figure.

Figure 11

CREATE TABLE

An example of a table creation statement is as follows, with the following points to note:

  • Including two tables, Zipkin and zipkin_trace_perf. The zipkin_trace_perf table is only used for the call link performance bottleneck analysis scenario, and other statistical analysis is applicable to the Zipkin table
  • Through the Timestamp field in the collected information, the dt, hr, and min time fields are generated, which is convenient for subsequent statistical analysis
  • Use DUPLICATE model, Bitmap index and other settings to speed up query
  • The Zipkin table uses id as the bucketing field. When querying the service topology, the query plan will be optimized to Colocate Join to improve query performance.

Zipkin

CREATE TABLE `zipkin` (
  `traceId` varchar(24) NULL COMMENT "",
  `id` varchar(24) NULL COMMENT "Span ID",
  `localEndpoint_serviceName` varchar(512) NULL COMMENT "",
  `dt` int(11) NULL COMMENT "",
  `parentId` varchar(24) NULL COMMENT "",
  `timestamp` bigint(20) NULL COMMENT "",
  `hr` int(11) NULL COMMENT "",
  `min` bigint(20) NULL COMMENT "",
  `kind` varchar(16) NULL COMMENT "",
  `duration` int(11) NULL COMMENT "",
  `name` varchar(300) NULL COMMENT "",
  `localEndpoint_ipv4` varchar(16) NULL COMMENT "",
  `remoteEndpoint_ipv4` varchar(16) NULL COMMENT "",
  `remoteEndpoint_port` varchar(16) NULL COMMENT "",
  `shared` int(11) NULL COMMENT "",
  `tag_error` int(11) NULL DEFAULT "0" COMMENT "",
  `error_msg` varchar(1024) NULL COMMENT "",
  `tags_http_path` varchar(2048) NULL COMMENT "",
  `tags_http_method` varchar(1024) NULL COMMENT "",
  `tags_controller_class` varchar(100) NULL COMMENT "",
  `tags_controller_method` varchar(1024) NULL COMMENT "",
  INDEX service_name_idx (`localEndpoint_serviceName`) USING BITMAP COMMENT ''
) ENGINE=OLAP 
DUPLICATE KEY(`traceId`, `parentId`, `id`, `timestamp`, `localEndpoint_serviceName`, `dt`)
COMMENT "OLAP"
PARTITION BY RANGE(`dt`)
(PARTITION p20220104 VALUES [("20220104"), ("20220105")),
 PARTITION p20220105 VALUES [("20220105"), ("20220106")))
DISTRIBUTED BY HASH(`id`) BUCKETS 100 
PROPERTIES (
"replication_num" = "3",
"dynamic_partition.enable" = "true",
"dynamic_partition.time_unit" = "DAY",
"dynamic_partition.time_zone" = "Asia/Shanghai",
"dynamic_partition.start" = "-30",
"dynamic_partition.end" = "2",
"dynamic_partition.prefix" = "p",
"dynamic_partition.buckets" = "100",
"in_memory" = "false",
"storage_format" = "DEFAULT"
);

zipkin_trace_perf

CREATE TABLE `zipkin_trace_perf` (
  `traceId` varchar(24) NULL COMMENT "",
  `id` varchar(24) NULL COMMENT "",
  `dt` int(11) NULL COMMENT "",
  `parentId` varchar(24) NULL COMMENT "",
  `localEndpoint_serviceName` varchar(512) NULL COMMENT "",
  `timestamp` bigint(20) NULL COMMENT "",
  `hr` int(11) NULL COMMENT "",
  `min` bigint(20) NULL COMMENT "",
  `kind` varchar(16) NULL COMMENT "",
  `duration` int(11) NULL COMMENT "",
  `name` varchar(300) NULL COMMENT "",
  `tag_error` int(11) NULL DEFAULT "0" COMMENT ""
) ENGINE=OLAP 
DUPLICATE KEY(`traceId`, `id`, `dt`, `parentId`, `localEndpoint_serviceName`)
COMMENT "OLAP"
PARTITION BY RANGE(`dt`)
(PARTITION p20220104 VALUES [("20220104"), ("20220105")),
 PARTITION p20220105 VALUES [("20220105"), ("20220106")))
DISTRIBUTED BY HASH(`traceId`) BUCKETS 32 
PROPERTIES (
"replication_num" = "3",
"dynamic_partition.enable" = "true",
"dynamic_partition.time_unit" = "DAY",
"dynamic_partition.time_zone" = "Asia/Shanghai",
"dynamic_partition.start" = "-60",
"dynamic_partition.end" = "2",
"dynamic_partition.prefix" = "p",
"dynamic_partition.buckets" = "12",
"in_memory" = "false",
"storage_format" = "DEFAULT"
);

ROUTINE LOAD

An example of a ROUTINE LOAD creation statement is as follows:

CREATE ROUTINE LOAD zipkin_routine_load ON zipkin COLUMNS(
  id,
  kind,
  localEndpoint_serviceName,
  traceId,
  `name`,
  `timestamp`,
  `duration`,
  `localEndpoint_ipv4`,
  `remoteEndpoint_ipv4`,
  `remoteEndpoint_port`,
  `shared`,
  `parentId`,
  `tags_http_path`,
  `tags_http_method`,
  `tags_controller_class`,
  `tags_controller_method`,
  tmp_tag_error,
  tag_error = if(`tmp_tag_error` IS NULL, 0, 1),
  error_msg = tmp_tag_error,
  dt = from_unixtime(`timestamp` / 1000000, '%Y%m%d'),
  hr = from_unixtime(`timestamp` / 1000000, '%H'),
  `min` = from_unixtime(`timestamp` / 1000000, '%i')
) PROPERTIES (
  "desired_concurrent_number" = "3",
  "max_batch_interval" = "50",
  "max_batch_rows" = "300000",
  "max_batch_size" = "209715200",
  "max_error_number" = "1000000",
  "strict_mode" = "false",
  "format" = "json",
  "strip_outer_array" = "true",
  "jsonpaths" = "[\"$.id\",\"$.kind\",\"$.localEndpoint.serviceName\",\"$.traceId\",\"$.name\",\"$.timestamp\",\"$.duration\",\"$.localEndpoint.ipv4\",\"$.remoteEndpoint.ipv4\",\"$.remoteEndpoint.port\",\"$.shared\",\"$.parentId\",\"$.tags.\\\"http.path\\\"\",\"$.tags.\\\"http.method\\\"\",\"$.tags.\\\"mvc.controller.class\\\"\",\"$.tags.\\\"mvc.controller.method\\\"\",\"$.tags.error\"]"
)
FROM
  KAFKA (
    "kafka_broker_list" = "IP1:PORT1,IP2:PORT2,IP3:PORT3",
    "kafka_topic" = "XXXXXXXXX"
  );

Flink traceability Parent ID

In the scenario of calling link performance bottleneck analysis, Flink is used to trace the parent ID. The code example is as follows:

env
  // 添加kafka数据源
  .addSource(getKafkaSource())
  // 将采集到的Json字符串转换为JSONArray,
  // 这个JSONArray是从单个服务采集的信息,里面会包含多个Trace的Span信息
  .map(JSON.parseArray(_))
  // 将JSONArray转换为JSONObject,每个JSONObejct就是一个Span
  .flatMap(_.asScala.map(_.asInstanceOf[JSONObject]))
  // 将Span的JSONObject对象转换为Bean对象
  .map(jsonToBean(_))
  // 以traceID+localEndpoint_serviceName作为key对span进行分区生成keyed stream
  .keyBy(span => keyOfTrace(span))
  // 使用会话窗口,将同一个Trace的不同服务上的所有Span,分发到同一个固定间隔的processing-time窗口
  // 这里为了实现简单,使用了processing-time session窗口,后续我们会使用starrocks的UDAF函数进行优化,去掉对Flink的依赖
  .window(ProcessingTimeSessionWindows.withGap(Time.seconds(10)))
  // 使用Aggregate窗口函数
  .aggregate(new TraceAggregateFunction)
  // 将经过溯源的span集合展开,便于调用flink-connector-starrocks
  .flatMap(spans => spans)
  // 使用flink-connector-starrocks sink,将数据写入starrocks中
  .addSink(
    StarRocksSink.sink(
      StarRocksSinkOptions.builder().withProperty("XXX", "XXX").build()))

analysis caculate

Taking Figure 6 as a microservice system use case, the StarRocks SQL statements corresponding to each statistical analysis scenario are given.

In-Service Analytics

Upstream service request indicator statistics

The following SQL uses the Zipkin table data to calculate the query statistics of the upstream service Service3 and upstream service Service4 requested by the service Service2, and the query indicators are grouped by hour and interface.

select
  hr,
  name,
  req_count,
  timeout / req_count * 100 as timeout_rate,
  error_count / req_count * 100 as error_rate,
  avg_duration,
  tp95,
  tp99
from
  (
    select
      hr,
      name,
      count(1) as req_count,
      AVG(duration) / 1000 as avg_duration,
      sum(if(duration > 200000, 1, 0)) as timeout,
      sum(tag_error) as error_count,
      percentile_approx(duration, 0.95) / 1000 AS tp95,
      percentile_approx(duration, 0.99) / 1000 AS tp99
    from
      zipkin
    where
      localEndpoint_serviceName = 'Service2'
      and kind = 'CLIENT'
      and dt = 20220105
    group by
      hr,
      name
  ) tmp
order by
  hr

Downstream Service Response Indicator Statistics

The following SQL uses the Zipkin table data to calculate the query statistics of the service Service2 responding to the downstream service Service1, and the query metrics are grouped by hour and interface.

select
  hr,
  name,
  req_count,
  timeout / req_count * 100 as timeout_rate,
  error_count / req_count * 100 as error_rate,
  avg_duration,
  tp95,
  tp99
from
  (
    select
      hr,
      name,
      count(1) as req_count,
      AVG(duration) / 1000 as avg_duration,
      sum(if(duration > 200000, 1, 0)) as timeout,
      sum(tag_error) as error_count,
      percentile_approx(duration, 0.95) / 1000 AS tp95,
      percentile_approx(duration, 0.99) / 1000 AS tp99
    from
      zipkin
    where
      localEndpoint_serviceName = 'Service2'
      and kind = 'SERVER'
      and dt = 20220105
    group by
      hr, 
      name
  ) tmp
order by
  hr

Service internal processing analysis

The following SQL uses the Zipkin table data to query the interface /2/api of the service Service2, and group statistics such as Duration by Span Name.

with 
spans as (
  select * from zipkin where dt = 20220105 and localEndpoint_serviceName = "Service2"
),
api_spans as (
  select
    spans.id as id,
    spans.parentId as parentId,
    spans.name as name,
    spans.duration as duration
  from
    spans
    inner JOIN 
    (select * from spans where kind = "SERVER" and name = "/2/api") tmp 
    on spans.traceId = tmp.traceId
)
SELECT
  name,
  AVG(inner_duration) / 1000 as avg_duration,
  percentile_approx(inner_duration, 0.95) / 1000 AS tp95,
  percentile_approx(inner_duration, 0.99) / 1000 AS tp99
from
  (
    select
      l.name as name,
      (l.duration - ifnull(r.duration, 0)) as inner_duration
    from
      api_spans l
      left JOIN 
      api_spans r 
      on l.parentId = r.id
  ) tmp
GROUP BY
  name

Inter-Service Analysis

Service topology statistics

The following SQL uses Zipkin table data to calculate topological relationships between services, and statistics on the Duration interface between services.

with tbl as (select * from zipkin where dt = 20220105)
select 
  client, 
  server, 
  name,
  AVG(duration) / 1000 as avg_duration,
  percentile_approx(duration, 0.95) / 1000 AS tp95,
  percentile_approx(duration, 0.99) / 1000 AS tp99
from
  (
    select
      c.localEndpoint_serviceName as client,
      s.localEndpoint_serviceName as server,
      c.name as name,
      c.duration as duration
    from
    (select * from tbl where kind = "CLIENT") c
    left JOIN 
    (select * from tbl where kind = "SERVER") s 
    on c.id = s.id and c.traceId = s.traceId
  ) as tmp
group by 
  client,  
  server,
  name

Call Link Performance Bottleneck Analysis

The following SQL uses the zipkin_trace_perf table data to respond to a query request with a timeout for a service interface, and counts the service or inter-service call that takes the longest to process in the call chain of each request, and then analyzes whether the performance hotspot is in a certain service interface. A service or inter-service call.

select
  service,
  ROUND(count(1) * 100 / sum(count(1)) over(), 2) as percent
from
  (
    select
      traceId,
      service,
      duration,
      ROW_NUMBER() over(partition by traceId order by duration desc) as rank4
    from
      (
        with tbl as (
          SELECT
            l.traceId as traceId,
            l.id as id,
            l.parentId as parentId,
            l.kind as kind,
            l.duration as duration,
            l.localEndpoint_serviceName as localEndpoint_serviceName
          FROM
            zipkin_trace_perf l
            INNER JOIN 
            zipkin_trace_perf r 
            on l.traceId = r.traceId
              and l.dt = 20220105
              and r.dt = 20220105
              and r.tag_error = 0     -- 过滤掉出错的trace
              and r.localEndpoint_serviceName = "Service1"
              and r.name = "/1/api"
              and r.kind = "SERVER"
              and r.duration > 200000  -- 过滤掉未超时的trace
        )
        select
          traceId,
          id,
          service,
          duration
        from
          (
            select
              traceId,
              id,
              service,
              (c_duration - s_duration) as duration,
              ROW_NUMBER() over(partition by traceId order by (c_duration - s_duration) desc) as rank2
            from
              (
                select
                  c.traceId as traceId,
                  c.id as id,
                  concat(c.localEndpoint_serviceName, "=>", ifnull(s.localEndpoint_serviceName, "?")) as service,
                  c.duration as c_duration,
                  ifnull(s.duration, 0) as s_duration
                from
                  (select * from tbl where kind = "CLIENT") c
                  left JOIN 
                  (select * from tbl where kind = "SERVER") s 
                  on c.id = s.id and c.traceId = s.traceId
              ) tmp1
          ) tmp2
        where
          rank2 = 1
        union ALL
        select
          traceId,
          id,
          service,
          duration
        from
          (
            select
              traceId,
              id,
              service,
              (s_duration - c_duration) as duration,
              ROW_NUMBER() over(partition by traceId order by (s_duration - c_duration) desc) as rank2
            from
              (
                select
                  s.traceId as traceId,
                  s.id as id,
                  s.localEndpoint_serviceName as service,
                  s.duration as s_duration,
                  ifnull(c.duration, 0) as c_duration,
                  ROW_NUMBER() over(partition by s.traceId, s.id order by ifnull(c.duration, 0) desc) as rank
                from
                  (select * from tbl where kind = "SERVER") s
                  left JOIN 
                  (select * from tbl where kind = "CLIENT") c 
                  on s.id = c.parentId and s.traceId = c.traceId
              ) tmp1
            where
              rank = 1
          ) tmp2
        where
          rank2 = 1
      ) tmp3
  ) tmp4
where
  rank4 = 1
GROUP BY
  service
order by
  percent desc

The result of the SQL query is shown in the figure below, which shows the percentage distribution of performance bottleneck services or calls between services among the timed out Trace requests.

Figure 12

03 Practical effect

At present, Sohu Smart Media has connected Zipkin to 30+ services, covering hundreds of online service instances, and a 1% sampling rate generates nearly 1 billion lines of logs every day.

Query StarRocks through Zipkin Server, and the obtained Trace information is shown in the following figure:

Figure 13

Query StarRocks through Zipkin Server, and the obtained service topology information is shown in the following figure:

Figure 14

During the practice of the link tracking system based on Zipkin StarRocks, the microservice monitoring and analysis capabilities and engineering efficiency have been significantly improved:

monitoring and analysis capabilities

  • In terms of monitoring and alarming, based on the StarRocks query and statistics of the response delay percentile, error rate and other indicators of the online service at the current moment, various alarms can be generated in time according to these indicators;
  • In terms of indicator statistics, StarRocks can count various indicators of service response delay by day, hour, minute and other granularity, so as to better understand the service operation status;
  • In terms of fault analysis, based on the powerful SQL computing capabilities of StarRocks, exploratory analysis and query in multiple dimensions such as service, time, and interface can be performed to locate the cause of the fault.

Improves the efficiency of microservice monitoring engineering

For Metric and Logging data collection, many users need to manually bury points and install various collector agents. After data collection, they are stored in storage systems such as ElasticSearch. These processes have to be operated once for every previous business, which is very cumbersome, and the resources are scattered and difficult to manage. .

With the Zipkin + StarRocks method, you only need to introduce the corresponding library SDK into the code, and set a small amount of configuration information such as the Kafka address and sampling rate to be reported. Tracing can automatically collect points and perform query and analysis through the zikpin server interface, which is very simple.

04 Summary and Outlook

Building a link tracking system based on Zipkin+StarRocks can provide Monitoring and Observability capabilities for microservice monitoring, and improve the analysis capabilities and engineering efficiency of microservice monitoring.
There are several subsequent optimization points , which can further improve the analysis capability and ease of use of the link tracking system:

  1. Using StarRocks' UDAF, window function and other functions, the parent ID is traced back to StarRocks calculation, and the reliance on Flink is canceled by post-calculation, which further simplifies the entire system architecture.
  2. At present, the fields such as tags in the original log are not fully collected. StarRocks is implementing the Json data type, which can better support nested data types such as tags.
  3. The current interface of Zipkin Server is still a little rudimentary. We have already opened up Zipkin Server to query StarRokcs, and will optimize the UI of Zipkin Server in the future, and realize more index queries through the powerful computing power of StarRocks to further improve the user experience.

05 Reference documents

  1. "Cloud native computing reshapes enterprise IT architecture - distributed application architecture":
    https://developer.aliyun.com/article/717072
  2. What is Upstream and Downstream in Software Development?
    https://reflectoring.io/upstream-downstream/
  3. Metrics, tracing, and logging:
    https://peter.bourgon.org/blog/2017/02/21/metrics-tracing-and-logging.html
  4. The 3 pillars of system observability:logs, metrics and tracing:
    https://iamondemand.com/blog/the-3-pillars-of-system-observability-logs-metrics-and-tracing/
  5. observability 3 ways: logging, metrics and tracing:
    https://speakerdeck.com/adriancole/observability-3-ways-logging-metrics-and-tracing
  6. Dapper, a Large-Scale Distributed Systems Tracing Infrastructure:
    https://static.googleusercontent.com/media/research.google.com/en//archive/papers/dapper-2010-1.pdf
  7. Jaeger:www.jaegertracing.io
  8. Zipkin:https://zipkin.io/
  9. opentracing.io:
    https://opentracing.io/docs/
  10. opencensus.io:
    https://opencensus.io/
  11. opentelemetry.io:
    https://opentelemetry.io/docs/
  12. Microservice Observability, Part 1: Disambiguating Observability and Monitoring:
    https://bravenewgeek.com/microservice-observability-part-1-disambiguating-observability-and-monitoring/
  13. How to Build Observable Distributed Systems:
    https://www.infoq.com/presentations/observable-distributed-ststems/
  14. Monitoring and Observability:
    https://copyconstruct.medium.com/monitoring-and-observability-8417d1952e1c
  15. Monitoring Isn't Observability:
    https://orangematter.solarwinds.com/2017/09/14/monitoring-isnt-observability/
  16. Spring Cloud Sleuth Documentation:
    https://docs.spring.io/spring-cloud-sleuth/docs/current-SNAPSHOT/reference/html/getting-started.html#getting-started

StarRocks
4 声望17 粉丝

StarRocks 是 Linux 基金会旗下的开源项目,专注于打造高性能、可扩展的分析型数据库,助力企业构建高效统一的湖仓新范式。目前,StarRocks 已在全球多个行业广泛应用,帮助众多企业提升数据分析能力。