systemtap 探秘（二）- 由 probe 生成的 C 代码

上一篇文章，我简单地介绍了 systemtap 的工作流程，以及第一、第二个阶段的内容。从这篇文章开始，我们将步入本系列的重头戏 - 负责生成 C 代码的第三阶段。

我们可以通过 stap -v test.stp -p3 > out.c 这样的命令，让 stap 把生成的 C 代码重定向到 out.c 去。

hello, world

按照惯例，先从一个 ”hello world“ 示例开始。

probe begin {
    printf("hello")
}

probe oneshot {
    printf(" wor")
}

probe end {
    printf("ld\n")
}

出于本人的趣味，这里把一个完整的 hello world 断成三截。通过查找特定的字符串，我们可以很快地从生成的 C 代码里找到这三个 probe 对应生成的代码。

static void probe_3646 (struct context * __restrict__ c) {
  __label__ deref_fault;
  __label__ out;
  struct probe_3646_locals * __restrict__ l = & c->probe_locals.probe_3646;
  (void) l;
  if (c->actionremaining < 1) { c->last_error = "MAXACTION exceeded"; goto out; }
  (void)
  ({
    _stp_print ("hello");
  });
deref_fault: __attribute__((unused));
out:
  _stp_print_flush();
}

上面就是 probe begin 对应的代码。

我们可以看到，每个 probe 在执行时都会传递一个 context 参数。每个 context 参数中有一个 struct probe_id_locals 变量。这个变量是用来存储本地变量的，当然我们的 hello world 示例中没有用到本地变量，所以它们都是空的。

然后是检查 MAXACTION exceeded 的部分，这部分参考 systemtap 的文档，是限制一个 systemtap probe 的执行时间的，避免出现内核失去响应的状况。

接下来是

  (void)
  ({
    _stp_print ("hello");
  });

我们可以看到，printf 这条语句被编译成对应的内置函数的调用。而且为了防止污染，每条语句的编译结果还特意加了层花括号和大括号。

剩下两个 probe 大同小异，只是 probe oneshot 会多一个 function___global_exit__overload_0 。function___global_exit__overload_0 调用了 _stp_exit 内置函数。

每个 probe 都会一个对应的 struct stap_be_probe 实例。从代码里能看到，enter_be_probe 函数会执行该 probe 的 handler，具体是在这么一行：

  (*stp->probe->ph) (c);

这一行之前的是一些准备代码，之后的则是检查执行过程中是否有错误发生和统计执行时间等操作。注意传递给 probe 函数的 context 会被复用的。

而 enter_be_probe 会被 systemtap_module_init 和 systemtap_module_exit 调用。具体而言，probe begin 和 probe oneshot 会在 systemtap_module_init 这个函数里调用（它们对应的 struct stap_be_probe 的 type 都是 0），而 probe end 会在 systemtap_module_exit 这个函数里调用（type 是 1）。顾名思义，systemtap_module_init 和 systemtap_module_exit 分别在会话开始和结束时调用。你可以在 systemtap 源码的 runtime/transport/transport.txt 这个文件里看到调用它们的具体流程。

可以这么认为，systemtap 运行时有一个 begin 和 end 阶段，probe begin 和 probe oneshot 都是运行在 begin 阶段的。而后者会调用 _stp_exit 函数，标记要进入到 end 阶段了。最后 probe end 会在 end 阶段中运行。

那么，begin 和 end 之间，是否存在一个中间阶段呢？答案当然是肯定的。接下来，让我们看看一个包含 timer 的例子。

timer

把 probe oneshot 换成 probe timer.ms(149)：

probe timer.ms(149) {
    printf(" wor")
    exit()
}

比较生成出来的 probe 对应的 C 代码，基本上跟原来是一样的。但是 probe 部分之外有两点不同。

一是没有 probe timer.ms(149) 对应的 struct stap_be_probe 了。因为 probe timer.ms(149) 不是在 begin 或者 end 阶段运行的。

二是多了个 struct stap_hrtimer_probe 类型。这个便是 probe timer.ms(149) 对应的 probe 类型了。从生成的代码可以看到，在 systemtap_module_init 里面有一个 _stp_hrtimer_create。这个函数注册了 _stp_hrtimer_notify_function。而 _stp_hrtimer_notify_function 几乎是 enter_be_probe 的一个翻版。

值得注意的是，_stp_hrtimer_notify_function 在统计执行时间时多了一个检查：

        if (interval > STP_OVERLOAD_INTERVAL) {
          if (c->cycles_sum > STP_OVERLOAD_THRESHOLD) {
            _stp_error ("probe overhead exceeded threshold");
            atomic_set (session_state(), STAP_SESSION_ERROR);
            atomic_inc (error_count());
          }
          c->cycles_base = cycles_atend;
          c->cycles_sum = 0;
        }

这是为了避免一段时间内太多的时间用于执行 systemtap 而设置的，防止内核失去响应。

带 timer 的 stp 脚本生成的 C 代码中，并不是在 begin 阶段之后就通过 _stp_exit 切入到 end 阶段，而是注册了个 timer，并在 timer 里执行 probe 的逻辑。在这之后，才因为 timer 中调用了 _stp_exit 而切入到 end 阶段。

下面，让我们看看带 uprobe 的例子。

uprobe

probe process("/usr/local/openresty/luajit/bin/luajit").function("lj_str_new") {
    printf(" wor")
    exit()
}

上面的 stp 代码挂载了 luajit 可执行文件的 lj_str_new 函数。注意要想把这个脚本运行起来，需要确保已经提供了 luajit 的 debuginfo。

生成的 C 代码里，该 probe 对应的类型是 stapiu_consumer。

static struct stapiu_consumer stap_inode_uprobe_consumers[] = {
  { .target=&stap_inode_uprobe_targets[0], .offset=(loff_t)0x6a55ULL, .probe=(&stap_probes[1]), },
};

奇怪的是这里面的 0x6a55。代码里并没有这个数，它是怎么来的呢？

通过 readelf -s /usr/local/openresty/luajit/bin/luajit | grep lj_str_new 我们能看到，这个函数的地址是 0x406a55。当然，实际的运行地址应该是 X + 0x406a55，而 X 是随机的。由于 0x400000 是在程序链接时固定的基址，我们可以认为 lj_str_new 的地址是 X + 0x40000 + 0x6a55。换句话说，把 0x6a55 作为 offset 就能确定 lj_str_new 这个函数的位置。这也是为什么需要提供 luajit 的 debuginfo，因为没有 debuginfo 的话，是无法确定 lj_str_new 的地址的。

stapiu_consumer 是在 stapiu_probe_handler 里执行的，执行过程跟前两种 probe 一样。systemtap 会检查当前已存在和新创建的所有进程，如果某些进程的可执行文件匹配某个 probe，会把对应的 probe 通过内核 API 注册上去。内核触发回调时就会执行该函数。

值得强调的是，每个匹配的进程都会执行 probe。指定 -x PID 其实只会设置 target() 的值。如果不想被多个进程触发，你还需要自己在 stp 代码里解决：

probe process("/usr/local/openresty/luajit/bin/luajit").function("lj_str_new") {
    _target = target();
    if (pid() != _target) {
        next;
    }

    printf(" wor")
    exit()
}

-c CMD 也是同样的，该选项其实就是创建一个子进程，并以该子进程的 PID 作为 target() 的值。

uretprobe

最后，看下跟 uprobe 相对的，uretprobe 的情况。

probe process("/usr/local/openresty/luajit/bin/luajit").function("lj_str_new").return {
    printf(" wor")
    exit()
}

由上面的 stp 代码生成的 C 代码基本上类似于 uprobe。只是 stapiu_consumer 有点不同：

static struct stapiu_consumer stap_inode_uprobe_consumers[] = {
  { .return_p=1, .target=&stap_inode_uprobe_targets[0], .offset=(loff_t)0x6a55ULL, .probe=(&stap_probes[1]), },
};

多了个 return_p=1。

预告

下一篇我们会看看 stp 的各种类型是如何编译成对应的 C 代码，并讨论更多的 systemtap 实现细节。

systemtap 探秘（二）- 由 probe 生成的 C 代码

hello, world

timer

uprobe

uretprobe

预告

spacewander

引用和评论

Envoy Golang Filter 实践：挑战与应对之道