The Wonderland of Dynamic Tracing (Part 1 of 7)

This is the first part of the article "The Wonderland of Dynamic Tracing" which consists of 7 parts. I will keep updating this series to reflect the state of art of the dynamic tracing world.

Dynamic Tracing

It’s my great pleasure to share my thoughts on dynamic tracing —— a topic I have a lot of passion and excitement for. Let’s cut to the chase: what is dynamic tracing?

What It Is

As a kind of post-modern advanced debugging technology, dynamic tracing allows software engineers to answer some tricky problems about software systems, such as high CPU or memory usage, high disk usage, long latency, or program crashes. All this can be detected at a low cost within a short period of time, to quickly identify and rectify the problems. It emerged and thrived in a rapidly developing Internet era of cloud computing, service mesh, big data, API computation etc., which exposed engineers to two major challenges. The first challenge relates to the scale of computation and deployment. Today, the number of users, colocations, and machines are all experiencing rapid growth. The second one is complexity. Software engineers are facing increasingly complicated business logic and software systems. There are many, many layers to them. From bottom to top, there are operating system kernels, different kinds of system software like databases and Web servers, then, virtual machines, interpreters and Just-In-Time (JIT) compilers of various scripting languages or other advanced languages, and finally at the application level, the abstraction layers of various business logic and numerous complex code logic.

These huge challenges have consequences. The most serious is that software engineers today are quickly losing their insight and control over the whole production systems, which has become so enormous and complex that all kinds of bugs are much more likely to arise. Some may be fatal, like the 500 error pages, memory leaks, and error return values, just to name a few. Also worth noting is the issue of performance. You probably may have been confused about why software sometimes runs very slowly, either by itself or on some machines. Worse, as cloud computing and big data are gaining more popularity, the production environment will only see more and more unpredictable problems on this massive scale. In these situations, engineers must devote most of their time and energy to them. Here, two factors are at play. Firstly, a majority of problems only occur in online environments, making it extremely difficult, if not impossible, reproduce these problems. Secondly, some have only a very low frequency of occurrences, say, one in a hundred, one in a thousand, or even lower. For engineers, it would be ideal if they are able to analyze and pinpoint the root cause of a problem and take the targeted measure to address it while the system is still running, without having to drop the machine offline, edit existing code or configurations, or reboot the processes or machines.

Too Good to be True?

And this is where dynamic tracing comes in. It can push software engineers toward that vision, greatly unleashing their productivity. I still remember when I worked for Yahoo! China. Sometimes I had to take a taxi, you know, at midnight, to the company to deal with online problems. I had no choice, but it obviously frustrated me, blurring the lines between my work and life. Later I came to a CDN service provider in the United States. The maintenance team of our clients always looked through the original logs provided by us, reporting any problems they deemed important. From the perspective of the service provider, some of them may just occur with a frequency of one in hundred or one in a thousand. Even so, we must identify the real cause and give feedback to the client. The abundance of such subtle occurrences in reality has fueled the creation and emergence of new technologies.

The best part of dynamic tracing, in my humble opinion, is its “live process analysis”. In other words, the technology allows software engineers to analyze one program or the whole software system while it is still running, providing online services and responding to real requests. Just like querying a database. That is a very intriguing practice. Many engineers tend to ignore the fact that a running software system, itself containing most precious information, serves as a database that is changing in real time and open to direct queries. Of course, the special “database” must be read-only, otherwise the said analysis and debugging would possibly affect the system’s own behaviors, and hamper online services. With the help of the operating system kernel, engineers can initiate a series of targeted queries from the outside to secure invaluable raw data about the running software system. This data will guide a multitude of tasks like problem analysis, security analysis, and performance analysis.

How it Works

Dynamic tracing usually works based on the operating system kernel level, where the “supreme being of software” has complete control over the entire software world. With absolute authority, the kernel can ensure the above-mentioned “queries” targeted at the software system will not influence the latter’s normal running. That is to say, those queries must be secure enough for wide use on production systems. Then, there arises another question concerning how a query is made if the software system is regarded as a special “database”. Clearly, the answer is not SQL.

Dynamic tracing generally starts a query through the probe mechanism. Probes will be dynamically planted into one or several layers of the software system, and the processing handlers associated with them will be defined by engineers. This procedure is similar to acupuncture in traditional Chinese medicine. Imagine that the software system is a person, and dynamic tracing means pushing some “needles” into particular spots of his body, or acupuncture points. As these needles often carry some engineer-defined “sensors”, we can freely garner and collect essential information from those points, to perform reliable diagnosis and create feasible treatment schemes. Here, tracing usually involves two dimensions. One dimension is the timeline. As long as the software is running, a course
of continuous changes are incurring along the timeline. The other is the spatial dimension, because tracing may be related to various different processes, including kernel tasks and threads. Each process often has its own memory space and process space. So, among different layers, and within the memory space of the same layer, engineers can obtain abundant information in space, both vertically and horizontally. Doesn’t this sound like a spider searching for preys on the cobweb?

Spiderman searching on a cobweb

The information-gathering process goes beyond the operating system kernel to higher levels like the user mode program. The information collected can piece together along the timeline to form a complete software view and serve as a useful guide for some very complex analyses —— we can easily find various kinds of performance bottlenecks, root causes of weird exceptions, errors and crashes, as well as potential security vulnerabilities. A crucial point here is that dynamic tracing is non-invasive. Again, if we compare the software system to a person, to help them diagnose a condition, we clearly wouldn’t want to do so by ripping apart the living body or planting wires. Instead, the sensible action would be doing an X-ray or MRI, feeling their pulse, or using a stethoscope to listen to their heart and breathing. The same should go for diagnosis of a production software system. With the non-invasiveness of dynamic tracing comes speediness and high efficiency in accurately acquiring desired information firsthand, which helps identify different problems under investigation. No revision of the operating system kernels, application programs, or any configurations is needed here.

Most engineers should already be very familiar with the process of constructing software systems. This is a basic skill for software engineers after all. It usually means creating various abstraction layers to construct software, layer by layer, either with a bottom-up manner, or top-down. Among many other paradigms, software abstraction layers can be created via the classes and methods in object-oriented programming, or directly via functions and subroutines. In contrast with software construction, debugging works in a way that can easily “rip off” existing abstraction layers. Engineers can then have free access to any necessary information from any layers, regardless of the concrete modular design, the code encapsulation, and man-made constraints set for software construction. This is because during debugging people usually wants to get as much information as possible. After all, bugs may happen at any software layer (or even on the hardware level).

Still Having Doubts?

But will the abstraction layers built when constructing the software hinder the debugging process? The answer is a big no. Dynamic tracing, as mentioned above, is generally based on the operating system kernel which claims absolute authority as the “supreme being”. So the technology can easily (and legally) penetrate through the abstraction layers. In fact, if well-designed, those abstraction layers will actually help the debugging process, which I will detail later on. In my own work, I noticed a common phenomenon. When an online problem arises, some engineers become nervous and are quick to come up with wild guesses about the root of the problem without any evidence. Even worse, through trial and error of confirming their guesses about the root problem, they leave the system in a mess which they and their colleagues may be pained to clean up after. Finally, they miss out on valuable time for debugging or simply destroy the first scene of the incidents. All such pains could go away when dynamic tracing plays a part here. Troubleshooting could even turn out to be a lot of fun. Emergence of weird online problems would present a rare opportunity to solve a fascinating puzzle for experts. All this, of course, requires powerful tools available for collecting and analyzing information which can help quickly prove or disprove any assumptions and theories about the culprits.

The Advantages of Dynamic Tracing

Dynamic tracing does not require any cooperation or collaboration from the target application. Back to the example of a human, who is now receiving a physical examination while still running on the playground. With dynamic tracing, we can directly have a real-time X-ray for him, and he will not sense it at all. Almost all analytical tools based on dynamic tracing operate in a “hot-plug” or post-mortem manner, allowing us to run the tools at any time, and begin and end sampling at any time, without restarting or interfering the target software processes. In reality, most of analytical requirements come after the target software system starts running, and before that, software engineers are unlikely to be able to predict what problems might arise, not to mention all the information which needs to be collected to troubleshoot those issues. In this case, one advantage of dynamic tracing is, to collect data anywhere and anytime, in an on-demand manner. Another strength is it brings extremely small performance overhead. The impact of a carefully written debugging tool on the ultimate performance of the system tends to be no more than 5%, minimizing the observable performance impact on the ultimate users. Moreover, the performance overhead, already miniscule, only occurs within a few seconds or minutes of the actual sampling time window. Once the debugging tool finishes operation, the online system will automatically return to its original full speed.

The running little "man" being checked alive

Conclusion

In this part we introduced the concept of dynamic tracing on a very high level and also briefly covered the advantage of dynamic tracing. In Part 2 of this series, we will talk about two open source dynamic tracing frameworks, DTrace and SystemTap.

A Word on OpenResty XRay

OpenResty XRay is a commercial dynamic tracing product offered by our OpenResty Inc. company.
We use this product in our articles like this one to demonstrate implementation details, as well as provide statistics about real world applications and open source software. In general, OpenResty XRay can help users gain deep insight into their online and offline software systems without any modifications or any other collaborations, and efficiently troubleshoot difficult problems for performance, reliability, and security.
It utilizes advanced dynamic tracing technologies developed by OpenResty Inc. and others.

We welcome you to contact us to try out this product for free.

OpenResty XRay Console Dashboard

About The Author

Yichun Zhang is the creator of the OpenResty^® open source project. He is also the founder and CEO of the OpenResty Inc. company. He contributed a dozen open source Nginx 3rd-party modules, many Nginx and LuaJIT core patches, and designed the OpenResty XRay platform.

Translations

We provide a Chinese translation for this article on blog.openresty.com.cn We also welcome interested readers to contribute translations in other
languages as long as the full article is translated without any omissions. We thank anyone willing to do so in advance.