How to choose a self-developed trace from development/hosting and commercialization?

Introduction to With the rise of microservice architecture, server-side call dependencies have become more complex. In order to quickly locate abnormal components and performance bottlenecks, access to distributed link tracking Trace has become a consensus in the IT operation and maintenance field. However, what are the differences between open source self-built, open source hosting, or commercial self-developed Trace products, and how should I choose? This is a question that many users will encounter when investigating the Trace solution, and it is also the most confusing misunderstanding.

Author | Yahai

With the rise of the microservice architecture, the call dependency of the server has become more and more complicated. In order to quickly locate abnormal components and performance bottlenecks, it has become a consensus in the IT operation and maintenance field to access the distributed link tracing trace. However, what are the differences between open source self-built, open source hosting, or commercial self-developed Trace products, and how should I choose? This is a question that many users will encounter when investigating the Trace solution, and it is also the most confusing misunderstanding.

In order to clarify this problem, we need to start from two aspects. One is to sort out the core risks and typical scenarios of online applications. The second is to compare the differences in the capabilities of the three trace solutions, open source self-built, hosted, and commercialized self-developed. The so-called "know oneself and the enemy, never end in a hundred battles", only by combining one's actual situation, can we choose the most suitable plan.

"Two Types of Risks" and "Ten Typical Problems"

Online application risks are mainly divided into two categories: "wrong" and "slow". The reason for "wrong" is usually that the program does not run as expected. For example, the JVM loads the wrong version of the class instance, the code enters an abnormal branch, and the environment configuration is incorrect. The reason for "slowness" is usually insufficient resources, such as burst traffic that causes the CPU to fill up, microservices or database thread pools are exhausted, memory leaks cause continuous FGC, and so on.

Whether it is a "wrong" problem or a "slow" problem. From the user's point of view, they all hope to quickly locate the root cause, stop losses in time, and eliminate hidden dangers. However, based on the author's more than five years of experience in Trace development, operation and maintenance, and preparations for Double Eleven, most of the online problems cannot be effectively located and resolved only through the basic capabilities of link tracking. The complexity of the online system determines that an excellent Trace product must provide more comprehensive and effective data diagnosis capabilities, such as code-level diagnosis, memory analysis, thread pool analysis, etc.; at the same time, in order to improve the ease of use and stability of the Trace component It also needs to provide capabilities such as dynamic sampling, lossless statistics, and automatic interface name convergence. This is why the mainstream Trace products in the industry are gradually upgrading to the APM and application observable fields. In order to facilitate understanding, this article still uses Trace to uniformly express the observability of the application layer.

To sum up, in order to ensure the final business stability of online applications, in addition to the general basic capabilities of Trace (such as call chain, service monitoring, link topology), you can also refer to the following list when making link tracing scheme selection "Ten typical problems" (take Java application as an example), comprehensively compare the differential performance of open source self-built, open source hosting and commercial self-developed Trace products.

1. [Code-level automatic diagnosis] The interface occasionally timed out. The call chain can only see the name of the timeout interface, but the internal method cannot be seen, the root cause cannot be located, and it is difficult to reproduce. What should I do?

Students in charge of stability should be familiar with this scenario: the system will occasionally experience interface timeouts at night or during a big promotion on the hour. When the problem is found and then checked, the abnormal scene has been lost and it is difficult to reproduce. Diagnose by manual jstack. However, the current open source link tracking implementation can only see the timeout interface through the call chain. What is the specific reason, which code caused the exception can not be located, and finally can only be stopped. The above-mentioned scenes were repeated until failures occurred, and eventually suffered huge business losses.

In order to solve the above problems, a precise and lightweight slow call automatic monitoring function is needed, which can truly restore the first scene of code execution without having to bury points in advance, and automatically record the complete method stack of slow calls. As shown in the figure below, when the interface call exceeds a certain threshold (for example, 2 seconds), the monitoring of the thread of the slow request will be started, and the monitoring will be stopped immediately after the 15th second of the request, and the life cycle of the request will be accurately retained Snapshot collection of the thread in which it is located, and restore the complete method stack and time-consuming.

2. [Pooling Monitoring] The microservice/database thread pool is often filled up, causing the service to time out. It is very difficult to troubleshoot. How to solve it?

The microservice/database thread pool is full, causing business requests to time out. This type of problem occurs frequently every day. Students with rich diagnostic experience will consciously check the corresponding component logs. For example, Dubbo will output related exception records when the thread pool is full. However, if the component does not output thread pool information, or the operation and maintenance students are not experienced enough in troubleshooting, this type of problem will become very difficult. At present, the open source version of Trace generally only provides JVM overview monitoring. It is impossible to check the status of each thread pool, let alone judge whether the thread pool is exhausted.

The pooling monitoring provided by the commercial self-developed Trace can directly see the maximum number of threads, the current number of threads, the number of active threads, etc. of the specified thread pool, and the risk of thread pool exhaustion or high water mark can be seen at a glance. In addition, you can also set the thread pool usage percentage alarm, such as setting the Tomcat thread pool current thread number exceeds 80% of the maximum thread number to notify in advance by SMS, and call alarm when it reaches 100%.

3. [Thread analysis] After the big promotion pressure test/release change, it is found that the CPU water level is very high. How to analyze the application performance bottleneck points and make targeted optimization?

When we are doing a big pressure test, or a big version change (including a lot of code logic), we will encounter the CPU water level suddenly becoming very high, but we can’t clearly identify which piece of code is causing it, so we can only do it continuously Jstack compares the thread state changes with the naked eye, and then constantly tries to optimize it based on experience. In the end, it consumes a lot of energy, but the effect is mediocre. So is there a way to quickly analyze application performance bottlenecks? The answer must be yes, and there is more than one.

The most common is to manually trigger a ThreadDump for a period of time (such as 5min), and then analyze the thread overhead and method stack snapshots during this period. The disadvantage of manually triggering ThreadDump is that the performance overhead is relatively large, it cannot be run in a normal manner, and it cannot automatically retain the on-site snapshots that have occurred. For example, the CPU is high during the pressure test. When the pressure test is finished and the disk is resumed, the scene is no longer there, and the manual ThreadDump is too late.

The second is to provide a normalized thread analysis function, which can automatically record the status, number, CPU time and internal method stack of each type of thread pool. In any period of time, click Sort by CPU time consumption to locate the thread category with the highest CPU overhead, and then click the method stack to see the specific code stuck points. As shown in the figure below, there are a large number of BLOCKED state methods stuck in the database Connection acquisition can be optimized by increasing the database connection pool.

4. [Exception Diagnosis] After the release/configuration change is performed, a large number of errors are reported on the interface, but the cause cannot be located in the first time, causing a business failure?

The biggest "culprit" that affects online stability is changes. Whether it is application release changes or dynamic configuration changes, it may cause abnormal program operation. So, how to quickly judge the risk of change, find the problem the first time, and stop the loss in time?

Here, I will share an exception release interception practice of Alibaba's internal release system. One of the most important monitoring indicators is the comparison of the number of exceptions in Java Exception/Error. Whether it is NPE (NullPointException) or OOM (OutOfMemoryError), based on the monitoring and warning of the total/specific number of exceptions, online exceptions can be quickly found, especially before and after the timeline is changed.

The independent anomaly analysis and diagnosis page allows you to view the change trend and stack details of each type of anomaly, and you can also further view the associated interface distribution, as shown in the following figure.

5. [Memory Diagnosis] Frequent FGC is used, and memory leaks are suspected, but abnormal objects cannot be located. What should I do?

FullGC is definitely one of the most common problems in Java applications. Various reasons such as fast object creation and memory leaks can lead to FGC. The most effective way to troubleshoot FGC is to execute HeapDump. The memory usage of various objects is clear and visible at a glance.

The white screen memory snapshot function can specify the machine to perform one-click HeapDump and analysis, which greatly improves the efficiency of troubleshooting memory problems. It also supports automatic Dump to save abnormal snapshots in memory leak scenarios, as shown in the following figure:

6. [Online debugging] In the same code, the online running state is inconsistent with the local debugging behavior. How to troubleshoot?

The code passed by local debugging will report various errors as soon as it is sent to the production environment. What is wrong? I believe that the development classmates have experienced such a nightmare. There are many reasons for this problem. For example, Maven relies on multi-version conflicts, dynamic configuration parameters in different environments are inconsistent, and different environments depend on component differences.

In order to solve the problem that the online code does not meet expectations, we need an online debugging and diagnostic tool that can view the source code, input and output parameters, execution method stack and time-consuming, static object or dynamic instance value of the current program running state in real time, etc. Wait, make online debugging as convenient as local debugging, as shown in the figure below:

7. [Full Link Tracking] Users report that the website opens very slowly. How to track the full link call trajectory from the Web end to the server end?

The key to connecting the front-end and back-end links is to follow the same set of transparent transmission protocol standards. Currently, open source only supports back-end application access and lacks front-end embedded points (such as Web/H5, applets, etc.). The front-end and back-end full link tracking scheme is shown in the following figure:

Header transparent transmission format: uniformly adopt Jaeger format, Key is uber-trace-id, Value is {trace-id}:{span-id}:{parent-span-id}:{flags}.

Front-end access: Two low-code access methods, CDN (Script injection) or NPM, can be used to support Web/H5, Weex and various small program scenarios.

Back-end access:

- It is recommended to use ARMS Agent first for Java applications, no code modification is required for non-intrusive embedding, and high-level functions such as edge diagnosis, non-destructive statistics, and precise sampling are supported. User-defined methods can be actively buried through the OpenTelemetry SDK.

- It is recommended to access non-Java applications through Jaeger and report the data to ARMS Endpoint. ARMS will be perfectly compatible with link transmission and display between multilingual applications.

The current full link tracking solution of Alibaba Cloud ARMS is based on the Jaeger protocol, and the SkyWalking protocol is being developed to support the lossless migration of SkyWalking self-built users. The call chain effect of front-end, Java application and non-Java application full link tracing is shown in the following figure:

8. [Non-destructive statistics] The cost of the call chain log is too high. After the client sampling is enabled, the monitoring chart is inaccurate. How to solve it?

The call chain log is positively related to the traffic. The traffic of To C type business is very large, and the cost of the full report and storage of the call chain will be very high. However, if the client sampling is enabled, the statistical indicators will be inaccurate, such as The sampling rate is set to 1%, and only one hundred of them will be recorded for 10,000 requests. The statistical data aggregated from these hundred logs will cause serious sample tilt problems, which cannot accurately reflect the actual service traffic or time-consuming.

In order to solve the above problems, we need to support non-destructive statistics on the client Agent. No matter how many times the same indicator is requested within a period of time (usually 15 seconds), only one piece of data will be reported. In this way, the results of statistical indicators are always accurate and will not be affected by the sampling rate of the call chain. Users can safely adjust the sampling rate, and the call chain cost can be reduced by more than 90%. The larger the user traffic and cluster size, the more significant the cost optimization effect.

9. [Interface name auto-convergence] The URL name of the RESTFul interface diverges due to parameters such as timestamp and UID, and the monitoring charts are all meaningless breakpoints. How to solve it?

When there are variable parameters such as timestamp and UID in the interface name, the name of the same type of interface will be different, and the number of occurrences will be very small, which does not have monitoring value, and will cause hotspots in storage/computing and affect the stability of the cluster. At this point, we need to classify and aggregate the divergent interfaces to improve the value of data analysis and cluster stability.

At this point, we need to provide an automatic convergence algorithm for interface names, which can actively identify variable parameters, aggregate the same type of interface, and observe category change trends, which is more in line with user monitoring requirements; at the same time, it avoids data hotspots caused by interface divergence The problem has improved the overall stability and performance. As shown in the figure below: /safe/getXXXInfo/xxxx will be classified into one category, otherwise each request will be a chart with only one data point, and the user's readability will become very poor.

10. [Dynamic Configuration Delivery] Burst online traffic causes insufficient resources, and it is necessary to immediately downgrade non-core functions. How to achieve dynamic downgrade or tuning without restarting the application?

Accidents are always sudden. Sudden traffic, external attacks, and computer room failures may cause insufficient system resources. In order to keep the most important core business from being affected, we often need to dynamically downgrade some non-core applications without restarting the application. Functions to release resources, such as lowering the sampling rate of the client call chain, and closing some diagnostic modules with high performance overhead. On the contrary, sometimes we need to dynamically turn on some high-overhead in-depth diagnosis functions to analyze the current abnormal scene, such as memory dump.

Regardless of whether it is dynamic downgrade or dynamic start, dynamic configuration pushdown is required without restarting the application. The open source Trace usually does not have such capabilities, and it is necessary to build a metadata configuration center and carry out corresponding code transformations. Commercialized Trace not only supports dynamic configuration pushdown, but can also be refined to the independent configuration of each application. For example, if application A has occasional slow calls, you can turn on the automatic slow call diagnostic switch to monitor; while application B takes time and compares CPU overhead Sensitive, you can turn off this switch; the two applications take what they need and do not affect each other.

Open source vs. open source hosting vs. commercial self-research

The "top ten typical problems" listed above in the production environment are currently unresolved open source self-built or hosted Trace products. In fact, open source solutions have many excellent features, such as extensive component support, unified multi-language solutions, flexible data/page customization, and so on. However, open source is not a panacea, and the production environment is not a test field. When it comes to the lifeline of online stability, we must carefully evaluate and thoroughly investigate the pros and cons of different solutions. We can't just stop at the comparison of common basic capabilities. This will bring huge hidden dangers to subsequent application promotion.

Due to space limitations, this article only analyzes the shortcomings of open source self-built/hosted solutions through 10 typical problem scenarios, emphasizing that Trace is not simple, ignoring this point, you may be forced to renew the pits that commercial self-developed products have stepped on. Experience. This is like an Internet e-commerce business. It is not the end of opening a shop online. A series of complex links such as product polishing, traffic expansion, user conversion, word-of-mouth operation, etc. are hidden behind the scenes. Entering it rashly may cause a miserable loss.

So what are the advantages of open source self-built/hosting? They and the commercialized self-developed Trace products in terms of product features, resource/labor costs, secondary development, multi-cloud deployment, stability, ease of use and other comprehensive comparative analysis, will be left for the next article "Open source self-built, open source hosting and Full analysis of commercialized self-developed Trace products", stay tuned.

Copyright Statement: content of this article is contributed spontaneously by Alibaba Cloud real-name registered users, and the copyright belongs to the original author. The Alibaba Cloud Developer Community does not own its copyright and does not assume corresponding legal responsibilities. For specific rules, please refer to the "Alibaba Cloud Developer Community User Service Agreement" and the "Alibaba Cloud Developer Community Intellectual Property Protection Guidelines". If you find suspected plagiarism in this community, fill in the infringement complaint form to report it. Once verified, the community will immediately delete the suspected infringing content.

How to choose a self-developed trace from development/hosting and commercialization?

"Two Types of Risks" and "Ten Typical Problems"

Open source vs. open source hosting vs. commercial self-research

阿里云开发者

引用和评论

福利来了！计算巢支持在已经购买的 ECS 上搭建幻兽帕鲁服务器，支持图形化管理配置