https://mp.weixin.qq.com/s/riRAS56VeWE3oHcuzorQyA

Author |

The value of full link tracking

The value of link tracing lies in "association." End users, back-end applications, and cloud components (databases, messages, etc.) together form a big picture of the trajectory topology of link tracing. The wider the coverage of this topology, the greater the value of link tracking. Full-link tracking is the best practice solution that covers all associated IT systems and can completely record the path and status of user behaviors between systems.

Complete full-link tracking can bring three core values to the business: end-to-end problem diagnosis, inter-system dependency combing, and transparent transmission of custom marks.

• End-to-end problem diagnosis: VIP customers fail to place orders, internal test user requests time out, and many end-user experience problems are traced to the root cause of abnormal back-end applications or cloud components. And full link tracking is the most effective means to solve end-to-end problems, and there is no one.
• Reorganization of inter-system dependencies: new services are launched, old services are abolished, computer room relocation/architecture upgrades, and the dependencies between IT systems are intricate and complex, which has exceeded the scope of manual reorganization. Topology discovery based on full link tracking makes the above scenario decision-making More agile and credible.
• Custom mark transparent transmission: full link pressure test, user-level grayscale, order traceability, and traffic isolation. Based on the hierarchical processing & data association of custom tags, a prosperous full-link ecology has been derived. However, once the data link is broken and the tag is lost, it will also cause unpredictable logical disasters.

Challenges and solutions for full link tracking

The value of full link tracking is proportional to the coverage, and its challenges are also the same. In order to ensure link integrity to the greatest extent, whether it is front-end applications or cloud components, whether it is Java language or Go language, whether it is a public cloud or a self-built computer room, it is necessary to follow the same set of link specifications and achieve data interconnection. Unification of multi-language protocol stacks, front/back/cloud (multiple) end linkage, and cross-cloud data integration are the three major challenges for realizing full link tracking, as shown in the following figure:

1. Unified multi-language protocol stack

In the cloud-native era, multilingual application architectures are becoming more and more common, and it has become a trend to use different language features to achieve the best performance and R&D experience. However, the maturity differences of different languages make it impossible for the full link tracking to achieve the same ability. The current mainstream practice in the industry is to first ensure the uniformity of the remote call protocol layer format, and implement call interception and context transparent transmission within the multilingual application, so as to ensure the integrity of the basic link data.

However, the vast majority of online problems cannot be effectively located and resolved only through the basic capabilities of link tracking. The complexity of the online system determines that an excellent Trace product must provide more comprehensive and effective data diagnosis capabilities, such as Code-level diagnosis, memory analysis, thread pool analysis, non-destructive statistics, etc. Making full use of the diagnostic interfaces provided by different languages and maximizing the release of multilingual product capabilities are the basis for Trace's continuous development.

Standardization of transparent transmission protocol: All applications on the entire link need to follow the same set of protocol transparent transmission standards to ensure that the link context can be completely transparently transmitted between applications in different languages, and there will be no problems with broken links or missing context. The current mainstream open source transparent transmission protocols include Jaeger, SkyWalking, ZipKin, etc.
Maximize the release of multilingual product capabilities: In addition to the most basic call chain function, link tracing has gradually derived high-level capabilities such as application/service monitoring, method stack tracing, and performance analysis. However, the maturity of different languages leads to large differences in product capabilities. For example, Java probes can implement many high-level edge-side diagnostics based on JVMTI. An excellent full-link tracking solution will maximize the release of the differentiated technology dividends of each language, rather than blindly pursuing convergence to mediocrity. Interested students can read the article "Open source self-built/hosted and commercialized self-developed Trace, how to choose".

2. Front and back cloud (multiple) terminal linkage

The current open source link tracking implementation is mainly concentrated on the back-end business application layer, and there is no effective means of burying points on the user terminal and cloud components (such as cloud databases). The main reason is that the latter two are usually provided by cloud service providers or third-party vendors, depending on whether the vendors are friendly to the compatibility of open source. It is difficult for the business side to directly intervene in the development.

The direct impact of the above situation is that the front-end page response is slow, it is difficult to directly locate which application or service on the back-end is caused, and the root cause of certainty cannot be clearly given. In the same way, cloud component anomalies are also difficult to equate directly with business application anomalies, especially in scenarios where multiple applications share the same database instance, requiring more circuitous means for verification, and the efficiency of troubleshooting is very low.

In order to solve such problems, cloud service providers first need to better support open source link standards, add core method burying points, and support open source protocol stack transparent transmission and data return (for example, Alibaba Cloud ARMS front-end monitoring supports Jaeger protocol transparent transmission and methods Stack trace).

Secondly, because different systems may not be able to complete the unification of the full-link protocol stack due to problems such as ownership, in order to achieve multi-terminal linkage, the Trace system needs to provide a heterogeneous protocol stack connection solution.

Open up heterogeneous protocol stacks

In order to realize the connection of heterogeneous protocol stacks (Jaeger, SkyWalking, Zipkin), the Trace system needs to support two capabilities: one is protocol stack conversion and dynamic configuration, for example, the front-end transparently transmits the Jaeger protocol downwards, and the newly connected downstream external system The ZipKin B3 protocol is used. The Node.js application between the two can receive the Jaeger protocol and transparently transmit the ZipKin protocol downward to ensure the integrity of the full-link mark transparent transmission. The second is the server data format conversion, which can convert different reported data formats into a unified format for storage, or make compatibility on the query side. The former has relatively low maintenance costs, and the latter has higher compatibility costs, but is relatively more flexible.

3. Cross-cloud data fusion

Many large enterprises choose multi-cloud deployment for stability or data security considerations. For example, domestic systems are deployed in Alibaba Cloud, overseas systems are deployed in AWS cloud, and systems involving sensitive internal data are deployed in self-built computer rooms. Multi-cloud deployment has become a typical cloud deployment architecture, but the network isolation of different environments and the differences in infrastructure have also brought huge challenges to operation and maintenance personnel.

As cloud environments can only communicate through the public network, in order to achieve link integrity under a multi-cloud deployment architecture, methods such as link data reporting and cross-cloud querying can be adopted. Either way, the goal is to achieve unified visibility of multi-cloud data, and quickly locate or analyze problems through complete link data.

Cross-cloud reporting

Cross-cloud reporting of link data is relatively easy to implement and easy to maintain and manage. It is the mainstream practice currently adopted by cloud vendors. For example, Alibaba Cloud ARMS implements multi-cloud data fusion through cross-cloud data reporting.

The advantages of cross-cloud reporting are low deployment costs and easy maintenance of a set of servers; the disadvantage is that cross-cloud transmission will occupy public network bandwidth, and public network traffic costs and stability are important constraints. Cross-cloud reporting is more suitable for a one-master, multiple-slave architecture. Most nodes are deployed in a cloud environment, and other clouds/self-built computer rooms only account for a small amount of business traffic. Self-built computer room is more suitable for cross-cloud reporting, as shown in the figure below.

Cross-cloud query

Cross-cloud query means that the original link data is stored in the current cloud network, a user query is issued separately, and the query results are aggregated for unified processing, reducing public network transmission costs.

The advantage of cross-cloud query is that the amount of data transmitted across networks is small, especially the actual query volume of link data is usually less than one ten thousandth of the original data volume, which can greatly save public network bandwidth. The disadvantage is that multiple data processing terminals need to be deployed, and complex calculations such as quantiles and global TopN are not supported. More suitable for multi-master architecture, simple link splicing and max/min/avg statistics can be supported.

There are two modes for implementing cross-cloud query. One is to build a set of centralized data processing terminals inside the cloud network, and to open up the user network through an intranet dedicated line, which can process the data of multiple users at the same time; the other is for each user Each user separately builds a set of data processing terminals in the VPC. The former has low maintenance costs and greater capacity flexibility; the latter has better data isolation.

other methods

In addition to the above two solutions, a hybrid mode or only a transparent transmission mode can also be used in practical applications.

The hybrid mode refers to the unified reporting of statistical data through the public network for centralized processing (small data volume and high accuracy requirements), while link data is retrieved by cross-cloud query (large data volume and low query frequency).

Only the transparent transmission mode means that only the link context can be completely transparently transmitted between each cloud environment, and the storage and query of link data are implemented independently. The advantage of this model is that the implementation cost is extremely low, each cloud only needs to follow the same set of transparent transmission protocols, and the specific implementation scheme can be completely independent. Manual concatenation through the same TraceId or application name is more suitable for rapid integration of stock systems with minimal transformation costs.

Full link tracking access practice

The previous article introduced in detail the challenges and solutions faced by full link tracking in various scenarios. Next, taking Alibaba Cloud ARMS as an example, I will introduce how to build a set of front-end, gateway, server, container, and cloud from 0 to 1. Complete observable system of components.

Header transparent transmission format: uniformly adopt Jaeger format, Key is uber-trace-id, Value is {trace-id}:{span-id}:{parent-span-id}:{flags}.
Front-end access: Two low-code access methods, CDN (Script injection) or NPM, can be used to support Web/H5, Weex, and various small program scenarios.
Back-end access:
- It is recommended to use ARMS Agent first for Java applications, no code modification is required for non-intrusive embedding, and high-level functions such as edge diagnosis, non-destructive statistics, and precise sampling are supported. User-defined methods can be actively buried through the OpenTelemetry SDK.
- It is recommended to access non-Java applications through Jaeger and report the data to ARMS Endpoint. ARMS will be compatible with link transparent transmission and display between multilingual applications.

The current full link tracking solution of Alibaba Cloud ARMS is based on the Jaeger protocol, and the SkyWalking protocol is being developed to support the lossless migration of SkyWalking self-built users. The call chain effect of front-end, Java application and non-Java application full link tracing is shown in the following figure:

1. Front-end access practice

ARMS front-end monitoring supports Web/H5, Weex, Alipay and WeChat applets, etc. This article takes Web applications to access ARMS front-end monitoring through CDN as an example to briefly explain the access process. For detailed access guidelines, refer to the official ARMS front-end monitoring documents.

Log in to the ARMS console, click Access Center in the left navigation bar, and click to select front-end Web/H5 access.
Enter the application name and click Create; check the options required in the SDK extension configuration item area to quickly generate the BI probe code to be inserted into the page.
Select asynchronous loading, copy the following code and paste it into element in the HTML of the page, and then restart the application.

<script>
!(function(c,b,d,a){c[a]||(c[a]={});c[a].config={pid:"xxx",imgUrl:"https://arms-retcode.aliyuncs.com/r.png?", 
enableLinkTrace: true, linkType: 'tracing'};
with(b)with(body)with(insertBefore(createElement("script"),firstChild))setAttribute("crossorigin","",src=d)
})(window,document,"https://retcode.alicdn.com/retcode/bl.js","__bl");
</script>

In order to achieve the link between the front and back ends, the following two parameters must be included in the above probe code:

enableLinkTrace:true // Indicates that the front-end link tracking function is turned on
linkType:'tracing' // Indicates to generate link data in Jaeger protocol format, Hearder allows uber-trace-id to be transparently transmitted

In addition, if the API is not homologous to the current application, the enableApiCors: true parameter needs to be added, and the back-end server also needs to support cross-domain requests and custom header values. For details, refer to the front-end link related documents. To verify whether the front-end and back-end link tracking configuration is effective, you can open the console to check whether there is an uber-trace-id in the Request Headers of the corresponding API request.

2. Java application access practice

Java applications are recommended to be connected to ARMS JavaAgent. Non-intrusive probes can be used out of the box and there is no need to modify the business code. For the detailed access guide, please refer to the ARMS application monitoring official website document.

Log in to the ARMS console, click Access Center in the left navigation bar, and click to select back-end Java access.
Choose any method of manual installation, script installation, and container service installation according to your needs.
According to the operation guide, ensure that the probe is downloaded and decompressed to the local. After the appName, LicenseKey and javaagent startup parameters are correctly configured, restart the application.

3. Non-Java application access practice

Non-Java applications can report data to the ARMS access point through an open source SDK (such as Jaeger). For detailed access guidelines, refer to the ARMS application monitoring official website document.

Log in to the ARMS console, click Access Center in the left navigation bar, and click to select the back-end Go/C++/.NET/Node.js access method.
Replace the access point according to the operation guide, and restart the application after the configuration is complete.

Full link tracking is only the beginning, not the end

Since Google published the Dapper paper in 2010, link tracking has been developed for more than ten years. However, there are always few books or in-depth articles on link tracking. Most blogs just briefly introduce some open source concepts or QuickStart. How a large enterprise can build a truly usable, easy-to-use, and easy-to-use link tracking system requires Which pits to fill and which thunders to avoid, it is difficult to find a more systematic and comprehensive answer.

Full-link tracing access is only the starting point of Tracing. Choosing a solution that suits your business architecture can avoid some detours. But link tracking is more than just looking at the call chain and service monitoring. How to empower the business and derive it into the observable area of the business to assist business decision-making? How to link with observable infrastructure to detect resource risks in advance? There is still a lot of work to be done in the future, and we look forward to more students joining in and sharing.

Related Links:
1. ARMS front-end monitoring official website document: https://help.aliyun.com/document_detail/106086.html?spm=ata.21736010.0.0.5d3a7f117o1Lty
2. Front-end and back-end link related documents: https://help.aliyun.com/document_detail/91409.html#title-6rx-0lb-p1o
3. ARMS application monitoring official website document: https://help.aliyun.com/document_detail/97924.html
4. ARMS application monitoring official website document: https://help.aliyun.com/document_detail/118912.html
5. ARMS console:
https://arms.console.aliyun.com/?spm=ata.21736010.0.0.5d3a7f117o1Lty
6. How to choose open source self-built/hosted and commercial self-developed Trace? :
https://mp.weixin.qq.com/s?spm=a2c6h.12873639.0.0.2ff66234Viwi2h&__biz=MzIzOTU0NTQ0MA==&mid=2247504737&idx=1&sn=2de2fb0e0656c702fa4d2546a9cdd2e5&scene=21#wechat_redirect

Click on the link below to experience link tracking immediately!
https://www.aliyun.com/product/xtrace?spm=5176.8140086.J_8058803260.58.4da02c90QhBtVo

Quickly analyze and diagnose performance bottlenecks under the distributed application architecture, and improve the efficiency of development and diagnosis in the era of microservices.

https://mp.weixin.qq.com/s/riRAS56VeWE3oHcuzorQyA

The value of full link tracking

Challenges and solutions for full link tracking

1. Unified multi-language protocol stack

2. Front and back cloud (multiple) terminal linkage

Open up heterogeneous protocol stacks

3. Cross-cloud data fusion

Cross-cloud reporting

Cross-cloud query

other methods

Full link tracking access practice

1. Front-end access practice

2. Java application access practice

3. Non-Java application access practice

Full link tracking is only the beginning, not the end

阿里云云原生

引用和评论

如何在通义灵码里使用 MCP 能力？

K8s 小白入门｜从电影配乐谈起，聊聊容器编排和 K8s

全网首发 | PAI Model Gallery一键部署阶跃星辰Step-Video-T2V、Step-Audio-Chat模型

无需编码5分钟免费部署云上调用满血版DeepSeek

支付宝H5下载被拦截的原因排查与解决指南

如何在通义灵码里用上DeepSeek-V3 和 DeepSeek-R1 满血版671B模型？

云上玩转DeepSeek系列之四：DeepSeek R1 蒸馏和微调训练最佳实践