Recently, when I used Opentracing + Jaeger to monitor my own cloud microservice program, I encountered a memory leak problem that was difficult to troubleshoot. During the performance test, the memory is full of JaegerSpan, JaegerSpanContext, and the strings in it, but I don't know where to start when I look at the code.
Therefore, translate this article The Life of a Span from Juraci Paixao Kroehling, a development engineer at Kiali, to learn about the life cycle of a Span, hoping to bring some inspiration to troubleshooting.
The following is the original translation:
"Span" is an independent unit in distributed link tracking. This article will use an example of a Jaeger application to show the life cycle of spans and how spans reach the background collection and analysis system.
In the world of OpenTracing , and many other distributed link tracings, a "Span" describes a data structure that stores information about individual units of work. In most cases, the programmer will only focus on how to instantiate a tracer and use various tool libraries to record the Span data of interest. As for how the Span reaches the backend collection service such as Jaeger , most programmers do not clear.
Today let's explore the mysteries behind magic.
In the following text, we will focus on how all components work in the default configuration, so that you can clearly compare the differences between the various configuration items, and know that your distributed link tracking is in the background based on your configuration. is how it works.
We will use an application example of Jaeger Java Client to illustrate, and jaeger client libraries in other languages work based on similar principles.
Application configuration
The example used in this blog is very simple and does not register the tracker inGlobalTracer as you usually do. The example just starts an Eclipse Vert.x and then creates a Span for each HTTP request received by the receiver. The code in the example can be found on GitLab, and the most critical parts are excerpted as follows:
package com.example.demo;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import io.jaegertracing.Configuration;
import io.opentracing.Scope;
import io.opentracing.Tracer;
import io.vertx.core.AbstractVerticle;
import io.vertx.core.Vertx;
public class MainVerticle extends AbstractVerticle {
private static final Logger logger = LoggerFactory.getLogger(MainVerticle.class);
public static void main(String[] args) {
Vertx.vertx().deployVerticle(new MainVerticle());
}
@Override
public void start() throws Exception {
String serviceName = System.getenv("JAEGER_SERVICE_NAME");
if (null == serviceName || serviceName.isEmpty()) {
serviceName = "vertx-create-span";
}
System.setProperty("JAEGER_SERVICE_NAME", serviceName);
Tracer tracer = Configuration.fromEnv().getTracer();
vertx.createHttpServer().requestHandler(req -> {
try (Scope ignored = tracer.buildSpan("operation").startActive(true)) {
logger.debug("Request received");
req.response().putHeader("content-type", "text/plain").end("Hello from Vert.x!");
}
}).listen(8080);
logger.info("HTTP server started on port 8080");
}
}
During the startup process, the Jaeger Java Client will create a RemoteReporter instance behind the scenes, thereby starting a daemon thread to continuously flush the Span stored in the cache (see JAEGER_REPORTER_FLUSH_INTERVAL ).
The Reporter will create an instance of UdpSender and use UDP and Thrift protocols to send the captured Span information to the Jaeger Agent running on localhost . An HttpSender can be used instead of the default UdpSender , depending on the tracker configuration.
Create span
Once the tool library, or business code starts a Span, the Jaeger Java Client will use JaegerTracer.SpanBuilder to generate an instance of JaegerS . This instance will contain a reference to the context object ( JaegerSpanContext ). There is a TraceId and a SpanId in the context, which will be consistent with the root node of the Span tree (also called Parent Span).
Span report
After the Span is created, our tool library code will do some specified processing, such as adding a specified HTTP header and writing the reply to the Client. Once this processing is complete, the try-with-resource statement automatically calls the close() method, which in turn calls JaegerSpan#finish() . Then the span is passed to the RemoteReporter by JaegerTracer , and our span looks like this:
JaegerSpan(
tracer=JaegerTracer(
version=Java-0.30.3-SNAPSHOT,
serviceName=vertx-create-span,
reporter=CompositeReporter(
reporters=...
),
sampler=ConstSampler(decision=true, tags={sampler.type=const, sampler.param=true}),
ipv4=-1062731773,
tags={hostname=caju.kroehling.de, jaeger.version=Java-0.30.3-SNAPSHOT, ip=192.168.0.3},
...
),
startTimeMicroseconds=1531757144093000,
startTimeNanoTicks=31909260878670,
computeDurationViaNanoTicks=true,
tags={sampler.type=const, sampler.param=true},
durationMicroseconds=9888,
operationName=operation,
references=[],
context=73abf3be4c32c2b8:73abf3be4c32c2b8:0:1,
logs=null,
finished=true
)
What the RemoteReporter does is simple, it adds the span to a queue and returns control to the caller. So while the real application is being tracked, no IO-blocking will happen, not to mention the extra work the main thread will do for the span.
scour!
Once the spans are queued, the background thread calls UdpSender#append(JaegerSpan) to convert each JaegerSpan into Thrift spans and pass them to the agent. After transformation, our span looks like this:
Span(
traceIdLow:8335023534026965688,
traceIdHigh:0,
spanId:8335023534026965688,
parentSpanId:0,
operationName:operation,
references:[],
flags:1,
startTime:1531757144093000,
duration:9888,
tags:[
Tag(key:sampler.type, vType:STRING, vStr:const),
Tag(key:sampler.param, vType:BOOL, vBool:true)
],
logs:[]
)
This span in Thrift form will be added to a buffer, and the sender will monitor the buffer size. Once the maximum UDP packet size (about 65KB) is reached, or a predetermined time threshold is reached, UdpSender will add a Process object Flush the span queue together, that is, the tracker calls the UdpSender#send(Process, List<JaegerSpan>) method. This function is the trigger for UdpSender to send a Thrift message to the agent. If you're curious about the Thrift message, it looks like this:
\ufffd\ufffd\u0001\temitBatch\u001c\u001c\u0018\u0011vertx-create-span\u0019<\u0018\u0008hostname\u0015\u0000\u0018\u0...
Flash in Agent
Jaeger Agent is a daemon next to the tracked program, its only purpose is to capture the span submitted by the application via UDP, and pass the span to the data collector through a TChannel that keeps the connection.
A batch of Spans received by the Agent is called "batch" and has two main attributes: a Process representing the metadata of the process where Jaeger Tracer is located, and a set of Spans. For this set of spans, their metadata is the same, so representation in a batch may save some overhead. In our example, a batch has only one span, which looks like this before being passed to the collector:
Batch({
Process:Process({
ServiceName:vertx-create-span
Tags:[
Tag({Key:hostname VType:STRING VStr:0xc4203621b0 VDouble:<nil> VBool:<nil> VLong:<nil> VBinary:[]})
Tag({Key:jaeger.version VType:STRING VStr:0xc420362260 VDouble:<nil> VBool:<nil> VLong:<nil> VBinary:[]})
Tag({Key:ip VType:STRING VStr:0xc420362280 VDouble:<nil> VBool:<nil> VLong:<nil> VBinary:[]})
]
})
Spans:[
Span({
TraceIdLow:8335023534026965688
TraceIdHigh:0
SpanId:8335023534026965688
ParentSpanId:0
OperationName:operation
References:[]
Flags:1
StartTime:1531757144093000
Duration:9888
Tags:[
Tag({Key:sampler.type VType:STRING VStr:0xc4203622a0 VDouble:<nil> VBool:<nil> VLong:<nil> VBinary:[]})
Tag({Key:sampler.param VType:BOOL VStr:<nil> VDouble:<nil> VBool:0xc42003ad5b VLong:<nil> VBinary:[]})
]
Logs:[]
})
]
})
reach the collector
Span ends its flash in the agent, and then passes it to Jaeger through the TChannel processor in SpanHandler#SubmitBatches (this processor is only responsible for processing batches in Jaeger format, other formats such as Zipkin will be handled by other processors) to Jaeger collection device.
After the batch is submitted to the collector, the formatted payload looks like this:
[
{
"process": {
"serviceName": "vertx-create-span",
"tags": [
{
"key": "hostname",
"vType": "STRING",
"vStr": "caju.kroehling.de"
},
{
"key": "jaeger.version",
"vType": "STRING",
"vStr": "Java-0.30.3-SNAPSHOT"
},
{
"key": "ip",
"vType": "STRING",
"vStr": "192.168.0.3"
}
]
},
"spans": [
{
"traceIdLow": 8.335023534027e+18,
"traceIdHigh": 0,
"spanId": 8.335023534027e+18,
"parentSpanId": 0,
"operationName": "operation",
"flags": 1,
"startTime": 1.531757144093e+15,
"duration": 9888,
"tags": [
{
"key": "sampler.type",
"vType": "STRING",
"vStr": "const"
},
{
"key": "sampler.param",
"vType": "BOOL",
"vBool": true
}
]
}
]
}
]
The Span processor will process it into Jaeger format, each Span takes a copy of the Process object, and then passes the processed queue to SpanProcessor#ProcessSpans . Our Span now looks like this:
[
{
"trace_id": "AAAAAAAAAABzq\/O+TDLCuA==",
"span_id": "c6vzvkwywrg=",
"operation_name": "operation",
"references": null,
"flags": 1,
"start_time": "2018-07-16T16:05:44.093Z",
"duration": 9888000,
"tags": [
{
"key": "sampler.type",
"v_str": "const"
},
{
"key": "sampler.param",
"v_type": 1,
"v_bool": true
}
],
"logs": null,
"process": {
"service_name": "vertx-create-span",
"tags": [
{
"key": "hostname",
"v_str": "caju.kroehling.de"
},
{
"key": "jaeger.version",
"v_str": "Java-0.30.3-SNAPSHOT"
},
{
"key": "ip",
"v_str": "192.168.0.3"
}
]
}
}
]
At this point, Span may go through a preprocessing process, or it may be filtered out directly. Normally, Spans will arrive at SpanProcessor#saveSpan , if we have more spans in our batch, we will see that this method is called for every span. At this time, a Span Writer will be injected for processing. This writer may be Cassandra, ElastiSearch, Kafka or an in-memory writer to write the Span to the corresponding area.
What the Span looks like in these centralized storage areas is left to the reader to explore. After a Span is converted into a Cassandra or Elasticsearch semantic file, it may contain a lot of detail.
After this, we will not use the "Span" granularity for illustration. In later scenarios, we observe a full-fledged tracking link consisting of many independent spans.
UI and queries
At this stage, our tracking link is lying in the storage area, waiting for the UI system to retrieve it through the query statement.
In the Jager UI, we can search for tracking links based on query terms like service name (vertx-create-span in our case). In the background, when we select the service name and click "Find Traces", the Jaeger query will be organized as follows:
{
"ServiceName": "vertx-create-span",
"OperationName": "",
"Tags": {
},
"StartTimeMin": "2018-07-16T17:06:38.426+02:00",
"StartTimeMax": "2018-07-16T18:06:38.426+02:00",
"DurationMin": 0,
"DurationMax": 0,
"NumTraces": 20
}
The query's APIHandler#Search method parses the query terms and passes them to each storage-specific Span Reader. Based on the service name, our link will be found and added to the result list, which in the background looks like this:
{
"traceID": "73abf3be4c32c2b8",
"spans": [
{
"traceID": "73abf3be4c32c2b8",
"spanID": "73abf3be4c32c2b8",
"flags": 1,
"operationName": "operation",
"references": [
],
"startTime": 1.531757144093e+15,
"duration": 9888,
"tags": [
{
"key": "sampler.type",
"type": "string",
"value": "const"
},
{
"key": "sampler.param",
"type": "bool",
"value": true
}
],
"logs": [
],
"processID": "p1",
"warnings": null
}
],
"processes": {
"p1": {
"serviceName": "vertx-create-span",
"tags": [
{
"key": "hostname",
"type": "string",
"value": "caju.kroehling.de"
},
{
"key": "ip",
"type": "string",
"value": "192.168.0.3"
},
{
"key": "jaeger.version",
"type": "string",
"value": "Java-0.30.3-SNAPSHOT"
}
]
}
},
"warnings": null
}
All messages are organized into a specific format in the background, and then the UI will receive data like this:
{
"data": [
{
"traceID": "73abf3be4c32c2b8",
"spans": [
{
"traceID": "73abf3be4c32c2b8",
"spanID": "73abf3be4c32c2b8",
"flags": 1,
"operationName": "operation",
"references": [
],
"startTime": 1.531757144093e+15,
"duration": 9888,
"tags": [
{
"key": "sampler.type",
"type": "string",
"value": "const"
},
{
"key": "sampler.param",
"type": "bool",
"value": true
}
],
"logs": [
],
"processID": "p1",
"warnings": null
}
],
"processes": {
"p1": {
"serviceName": "vertx-create-span",
"tags": [
{
"key": "hostname",
"type": "string",
"value": "caju.kroehling.de"
},
{
"key": "ip",
"type": "string",
"value": "192.168.0.3"
},
{
"key": "jaeger.version",
"type": "string",
"value": "Java-0.30.3-SNAPSHOT"
}
]
}
},
"warnings": null
}
],
"total": 0,
"limit": 0,
"offset": 0,
"errors": null
}
According to this format, Jaeger UI will iterate over the results and display their information beautifully on the screen:
In most scenarios, clicking on a link displayed on the screen will not trigger a re-query of the UI. But if we open the link and hit "Refresh", the UI will execute an ApiHandler#getTrace query. This query retrieves all related spans from the store for the given ID, and returns a data structure similar to the following:
{
"spans": [
{
"trace_id": "AAAAAAAAAABzq\/O+TDLCuA==",
"span_id": "c6vzvkwywrg=",
"operation_name": "operation",
"references": null,
"flags": 1,
"start_time": "2018-07-16T16:05:44.093Z",
"duration": 9888000,
"tags": [
{
"key": "sampler.type",
"v_str": "const"
},
{
"key": "sampler.param",
"v_type": 1,
"v_bool": true
}
],
"logs": null,
"process": {
"service_name": "vertx-create-span",
"tags": [
{
"key": "hostname",
"v_str": "caju.kroehling.de"
},
{
"key": "ip",
"v_str": "192.168.0.3"
},
{
"key": "jaeger.version",
"v_str": "Java-0.30.3-SNAPSHOT"
}
]
}
}
],
"process_map": null
}
Since we only have one link, and there is only one span in it, the payload obtained by the UI is the same as the previous "search" operation. However, the data is presented differently:
Span afterlife
So far we've covered almost all stages of Span's life, from its origins until it finally reports information about the applications it tracks. Since then, Spans may also appear in some afterlife scenarios, such as becoming a data point in Prometheus , or integrating with other Spans in the Grafana dashboard, and so on. Eventually, the admin might decide to clean up some historical data , and then our span ends the cycle of life and disappears into history.
Question follow up
After that, the colleagues who developed together discovered the problem in our use of Jaeger Span. The main reason is that the statement created by Span was not put into try with resource, which resulted in the related resources in the memory not being recovered, resulting in a memory leak. The interception code is as follows:
Original code:
@Around("@annotation(com.company.department.project.opentracing.Traced)")
public Object aroundTracedAdvice(ProceedingJoinPoint jp) throws Throwable {
Object res = null;
String className = jp.getTarget().getClass().getName();
String methodName = jp.getSignature().getName();
Span span = tracer.buildSpan(className + "." + methodName).withTag("class", className)
.withTag("method", methodName).start();
tracer.activateSpan(span);
try {
res = jp.proceed();
} catch (Throwable t) {
TracingUtils.onError(t, span);
throw t;
} finally {
span.finish();
}
return res;
}
Modified code: Note that the scope is placed in try with resource.
@Around("@annotation(com.company.department.project.opentracing.Traced)")
public Object aroundTracedAdvice(ProceedingJoinPoint jp) throws Throwable {
Object res = null;
String className = jp.getTarget().getClass().getName();
String methodName = jp.getSignature().getName();
Span span = tracer.buildSpan(className + "." + methodName).withTag("class", className)
.withTag("method", methodName).start();
try (Scope scope = tracer.activateSpan(span)) {
res = jp.proceed();
} catch (Throwable t) {
TracingUtils.onError(t, span);
throw t;
} finally {
span.finish();
}
return res;
}
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。