1
Click on "Cloud Recommended Big Coffee", get the official recommended boutique content, and learn technology without getting lost!

5周东科.jpg

This article takes Dapper's paper as the entry point, extends to its related paper content, and combines the clues of historical timeline development to show readers the exploration and practice of link tracking technology by software practitioners.
213.png

Look at history with questions

When it comes to link tracing, most people will think of Zipkin, Jaeger, Skywalking and other mature open source software for link tracing, as well as open source standards such as Opentelemetry, OpenTracing, and OpenCensus. Although the implementations are different, different link tracking systems built using various software, standards and implementation combinations have many similarities.

For example, these link tracking systems all need to propagate metadata on the calling link. Their definitions of metadata content are also similar, such as the unique trace id of the link, the parent id of the associated parent link, and the span id that identifies itself. They are all tracking information collected asynchronously and distributed, and offline aggregated and aggregated tracking links. They all have link sampling and so on.

The architecture of the link tracking system and the design of the model seem to be so similar, I can't help but have some questions: When developers design link tracking, do they have the same idea? Why do I need to pass metadata on the call link? Is this information of metadata necessary? Can I access the link tracking system without intrusively modifying the code? Why do we need to report asynchronously and decentrally, and aggregate offline? What is the use of setting up link sampling?

With various questions, I found the source of inspiration for these many link tracking software-"Google Dapper" paper, and read the original text and related cited papers. These papers gradually solved my doubts.

Black box mode exploration

In the early academic circles' exploration of link state detection in distributed systems, some people believed that every application or middleware in a distributed system should be a black box, and link detection should not invade the application system. At that time, Spring had not been developed, and the technology of inversion of control and aspect programming was not very popular. If you need to invade the application code, you need to modify the application code. For engineers, the threshold for additional access is too high. The link detection tool will be difficult to promote.

If you are not allowed to invade the application and modify the code, you can only get and record the link information from the outside of the application. Due to the limitation of the black box, the link information is scattered and cannot be connected in series. How to connect these links in series has become a problem that needs to be solved.

《Performance Debugging for Distributed Systems of Black Boxes》

This paper, published in 2003, is an exploration of call chain monitoring in the black box mode. Two algorithms for finding link information are proposed in the paper.

The first algorithm is called "nested algorithm". First, by generating a unique id, the request (1 call) link and the return (11 return) link of a cross-service call are associated and then form a link pair together. Then use the sequence of time to associate the different round-trip link pairs at the same level or between the upper and lower levels (refer to Figure 1).

image.png

                                          图1

If the application is single-threaded, there is no problem with this algorithm. Production applications are often multi-threaded, so the corresponding relationship between links cannot be found well using this method. Although the paper proposes a scoreboard penalty method that can de-weight some incorrectly associated link relationships, this method has some problems for some services based on asynchronous RPC calls.

Another algorithm is called the "convolution algorithm", which treats the round-trip link as an independent link, and then treats each independent link pair as a time signal, and uses signal processing technology to find the correlation between the signals. The advantage of this algorithm is that it can be used in services based on asynchronous RPC calls. However, if the actual call link has a loopback condition, the convolution algorithm can obtain other call links in addition to the actual call link. For example, when calling link A -> B -> C -> B -> A, the convolution algorithm will not only get its own calling link, but also a calling link A -> B -> A. If a node appears multiple times on a link, then this algorithm is likely to get a large number of derived call links.

In the black box mode, the relationship between links is to determine the association relationship between links by means of probability statistics. Probability statistics are always probabilities, and there is no way to accurately determine the correlation between links.

Another way of thinking

How can we accurately derive the relationship between the call links? The following paper gives some ideas and practices.

Pinpoint: Problem Determination in Large, Dynamic Internet Services

Note: This Pinpoint is not pinpoint-apm on github

The research object of this paper is mainly monolithic applications with different components. Of course, the corresponding methods can also be extended to distributed clusters. In the thesis, Pinpoint architecture design is mainly divided into three parts. Refer to Figure 2, where Tracing and Trace Log are the first part, called Client Request Trace, which is mainly used to collect link logs. Internal F/D, External F/D and Fault Log are the second part, which is the failure detection information (Failure Detection), which is mainly used to collect fault logs. Statistical Analysis is the third part, called Data Clustering Analysis, which is mainly used to analyze the collected log data and obtain fault detection results.

image.png

                                            图2

In the Pinpoint architecture, a data that can be effectively used in data mining analysis methods is designed. As shown in Figure 3, each call link is used as a sample data, marked with a unique identification request id. The attributes of the sample record the program components and failure status (Failure) that the call link passes through.

image.png

                                            图3

In order to be able to associate the Trace Logs and Fault Logs of each call, the paper uses Java applications as an example to describe how to implement the association of these logs in the code. The following is a summary of some key points of the Pinpoint practice chapter:

Need to generate a component id
For each http request, a unique request id generated and passed through the thread local variable (ThreadLocal)
For new threads in the request, you need to modify the thread creation class, and continue to pass request id
For the rpc call generated in the request, you need to modify the requester code, bring the request id information into the header, and parse the header on the receiving end and inject it into the thread local variable
Every time a component is called, use ( request id , component id ) to record a Trace Log
For Java applications, these points have simple technical practice and high operability, and provide a basic idea for the current link tracking system to realize link series and link propagation (Propegation).

This paper was published in 2002. At that time, the java version was 1.4. It already has the ability of thread local variables (ThreadLocal). It is relatively easy to carry information in threads. But because aspect programming was not very popular in that era (Spring appeared in 2003, javaagent is a capability that was only available in Java 1.5, released in 2004), so such a method cannot be widely used. If you think about it the other way round, it may be precisely because of the emergence of these programming requirements that the technology advancement in the field of java aspect programming is being promoted.

Rebuild the call link

X-Trace: A Pervasive Network Tracing Framework

The main research object of this paper is the network link in the distributed cluster. The X-Trace paper continues and expands the ideas of the Pinpoint paper, and proposes a framework and model that can reconstruct a complete call link. In order to achieve the goal, three design principles are defined in the article:

1. Carry metadata in the call link (the data passed on the call link is also called in-band data, in-bound data )
2. The reported link information does not remain in the calling link, and the mechanism for collecting link information needs to be orthogonal to the application itself (Note: Link data that is not saved in the calling link is also called out-of-band data, out-of-bound data )
3. The entity that injected metadata should be decoupled from the entity that collected the report

Principles 1, 2 are the design principles that have been used so far. Principle 1 is an extension of Poinpont's thinking. Link transfer extends more elements from the original request id, among which TaskID, ParentID, OpID are the predecessors of trace id, parent id, and span id. The word span also appears in the Abstract of the X-Trace paper. It may be a tribute to the authors of the X-Trace paper by the author of Dapper.

Let's take a look at the X-Trace metadata content definition :

1.Flags

A bit array, used to mark TreeInfo , Destination , Options are used

2.TaskID

Globally unique id, used to identify the unique call chain

3.TreeInfo

ParentID -parent node id, unique in the call chain
OpID -current operation id, unique in the call chain
EdgeType -NEXT means brother relationship, DOWN means father-child relationship

4.Destination

Used to specify the reporting address

5.Options

Reserved field for extension

In addition to the definition of metadata, the paper also defines two link propagation operations, namely pushDown() and pushNext() . pushDown() means to copy metadata to the next level, pushNext() means to propagate metadata from the current node to the next node.

image.png

                         图4 pushDown() 与 pushNext() 的伪代码

image.png

                 图5 pushDown() 与 pushNext() 操作在调用链路中的执行的位置

In the structure design of the link data reported by X-Trace, the second design principle is followed. As shown in Figure 6, X-Trace provides a lightweight client package for the application, so that the application can forward link data to a local daemon. The local daemon opens a UDP protocol port, receives the data sent by the client packet, and puts it into a queue. The other side of the queue is sent to the corresponding place according to the specific configuration information of the link data. It may be a database, or a data forwarding service, a data collection service, or a data aggregation service.

image.png

                                            图6

The architecture design of X-Trace reporting link data has a significant impact on the current implementation of link tracking on the market. Compared with Zipkin's collector and Jeager's jaeger-agent, you can see the shadow of X-Trace to some extent.

The three design principles of X-Trace, the definition of in-band and out-of-band data, the definition of metadata dissemination operations, and the structure of link data reporting, etc., are all references to current link tracking systems. Compared with Zipkin's collector and Jeager's jaeger-agent, you can see the shadow of the X-Trace link data reporting architecture to some extent.

Large-scale commercial practice - Dapper

Dapper, a Large-Scale Distributed Systems Tracing Infrastructure

Dapper is a system used internally by Google to provide developers with information about the behavior of complex distributed systems. Dapper's paper is to introduce Google's experience in the design and practice of this distributed link tracking infrastructure. The Dapper paper was published in 2010. According to the paper, the Dapper system has two years of practical experience within Google.

The main purpose of the Dapper system is to provide developers with information about the behavior of complex distributed systems. The article analyzes what kind of problems need to be solved in order to realize such a system. And based on these problems, two basic design requirements are put forward: large-scale deployment and continuous monitoring. In response to the two basic design requirements, three specific design goals are proposed:

Low overhead (Low overhead): The link tracking system needs to ensure that the performance impact of online services is negligible. Even a small monitoring consumption can have a noticeable impact on some highly optimized services, and even force the deployment team to shut down the tracking system.
Application-level transparency (Application-level transparecy): Developers should not be aware of link tracking facilities. If the link tracking system needs to rely on the assistance of application-level developers to work, then this link tracking facility will become the weakest, and it will often fail to work due to bugs or negligence. This violates the design requirements for large-scale deployment.
Scalability (Scalability): The link tracking system needs to be able to meet the scale of Google's services and clusters in the next few years.
Although Dapper's design concepts are similar to Pinpoint, Magpie, and X-Trace, Dapper also has some unique designs of its own. One of them is that in order to achieve the low-overhead design goal, Dapper has sampled and collected the request link. According to Dapper's practical experience at Google, for many commonly used scenarios, even if sampling and collecting 1/1000 of the requests, sufficient information can be obtained.

Another unique feature is that they achieve very high application transparency. This benefited from the relatively high homogeneity of the deployment of Google application clusters. They can limit the implementation code of the link tracking facility to the bottom layer of the software without adding additional annotation information in the application. For example, if the applications in the cluster use the same http library, message notification library, thread pool factory, and RPC library, then the link tracking facility can be limited to these code modules.

How to define link information?

The article first gives a simple call chain example, as shown in Figure 7. The author believes that a distributed tracing of a request needs to collect the identification code of the message and the event and time corresponding to the message. If we only consider the case of RPC, the call link can be understood as a nested tree of RPCs. Of course, Google's internal data model is not limited to RPCs calls.

The architecture design of X-Trace reporting link data has a significant impact on the current implementation of link tracking on the market. Compared with Zipkin's collector and Jeager's jaeger-agent, you can see the shadow of X-Trace to some extent.

The three design principles of X-Trace, the definition of in-band and out-of-band data, the definition of metadata dissemination operations, and the structure of link data reporting, etc., are all references to current link tracking systems. Compared with Zipkin's collector and Jeager's jaeger-agent, you can see the shadow of the X-Trace link data reporting architecture to some extent.

image.png

                                       图7

Figure 8 illustrates the structure of the Dapper tracking tree. The nodes of the tree are the basic units, called span . The sideline is the connection between span A span is simply with start and end timestamp, RPC time-consuming or application-related annotation information. In order to reconstruct the Dapper tracking tree, span also needs to include the following information:

span name : An easy-to-read name, like Frontend.Request
span id : a 64bit unique identifier
parent id : Father span id

image.png

                                         图8

Figure 9 is the detailed information of an RPC span. It is worth mentioning that a same span may contain the information of multiple hosts. In fact, every RPC span contains annotations for client and server processing. Since the timestamp of the client and the server are from different hosts, it is necessary to pay special attention to the abnormal situation of these times. Figure 9 is a detailed information of span

image.png

                                         图9

How to achieve application-level transparency?

By adding measurement points to some common packages, Dapper realizes distributed link tracking for application developers without interference. The main practices are as follows:

When a thread is processing the link tracking path, Dapper will associate the tracking context with the thread local storage. Tracking context is a small and easy-to-replicate span information easily.
If the calculation process is delayed or one step, most Google developers will use the general control flow library to construct the callback function, and use the thread pool thread pool or other executors to schedule. In this way, Dapper can ensure that all callback functions store the tracking context when they are created, and the tracking context is associated with the correct thread when the callback function is executed.
Almost all of Google's intra-thread communication is built on an RPC framework, including the implementation of C++ and Java. Measurements are added to the framework to define spans related to all RPC calls. In the traced RPC, the id of span and trace will be passed from the client to the server. This is a very necessary measurement point at Google.

end

Dapper's paper gives a data model design that is easy to read and helpful in problem location, application-level transparent measurement practices, and low-overhead design solutions, which cleared many obstacles for the use of link tracking in industrial applications, and inspired Many developers' inspiration. Since the Google Dapper paper came out, many developers have been inspired by the paper and have developed a variety of link tracking. In 2012, Twitter open sourced Zipkin, Naver open source Pinpoint, Wu Sheng open source Skywalking in 2015, Uber open source Jaeger, etc. . Since then, link tracking has entered an era of contention among a hundred schools of thought.

213.png

5周东科.jpg

"Yunjian Big Coffee" is a special column for Tencent Cloud Plus community. Cloud recommendation officials specially invite industry leaders to focus on the implementation of cutting-edge technologies and theoretical practice, and continue to interpret hot technologies in the cloud era and explore new opportunities for industry development. Click one-click to subscribe to , and we will regularly push premium content for you.

腾讯云开发者
21.9k 声望17.3k 粉丝