OpenResty技术

OpenResty技术 查看完整档案

广州编辑  |  填写毕业院校OpenResty Inc.  |  技术推广 编辑 openresty.com.cn 编辑
编辑

OpenResty Inc - 软件系统的实时 X 光诊断与优化。我们有顶尖技术团队,背靠快速成长的 OpenResty 开源用户群体和社区,是世界上第一家广泛应用机器编程的公司,我们的愿景是帮助所有开发与运维工程师,透视任意生产环境。

个人动态

OpenResty技术 发布了文章 · 2020-10-08

Lua 级别 CPU 火焰图简介

在 OpenResty 或 Nginx 服务器中运行 Lua 代码如今已经变得越来越常见,因为人们希望他们的非阻塞的 Web 服务器能够兼具超高的性能和很大的灵活性。有些人使用 Lua 完成一些非常简单的任务,比如检查和修改某些请求头和响应体数据,而有些人则利用Lua 创建非常复杂的 Web 应用、 CDN 软件和 API 网关等等。Lua 以简单、内存占用小和运行效率高而著称,尤其是在使用 LuaJIT 这样的的即时编译器 (JIT) 的时候。但有些时候,在 OpenResty 或 Nginx 服务器上运行的 Lua 代码也会消耗过多的 CPU 资源。通常这是由于程序员的编程错误,比如调用了一些昂贵的 C/C++ 库代码,或者其他原因。

要想在一个在线的 OpenResty 或 Nginx 服务器中快速地定位所有的 CPU 性能瓶颈,最好的方法是使用 OpenResty XRay 产品提供的 Lua 语言级别 CPU 火焰图的采样工具。这个工具 需要对 OpenResty 或 Nginx 的目标进程做任何修改,也不会对生产环境中的进程产生任何可觉察的影响。

本文将解释什么是火焰图,以及什么是 Lua 级别的 CPU 火焰图,会穿插使用多个小巧且独立的 Lua 代码实例来做演示。我们将利用 OpenResty XRay 来生成这些示例的火焰图来进行讲解和分析。我们选择小例子的原因是,它们更容易预测和验证各种性能分析的结果。相同的分析方法和工具也适用于那些最复杂的 Lua 应用。过去这几年,我们使用这种技术和可视化方式,成功地帮助了许多拥有繁忙网站或应用的企业客户。

什么是火焰图

火焰图是由 Brendan Gregg 发明的一种可视化方法,用于展示某一种系统资源或性能指标,是如何定量分布在目标软件里所有的代码路径上的。

这里的“系统资源”或指标可以是 CPU 时间、off-CPU 时间、内存使用、硬盘使用、延时,或者任何其他你能想到的资源。

而“代码路径”可以定义为目标软件代码中的调用栈轨迹。调用栈轨迹通常是由一组函数调用帧组成的,通常出现在 GDB 命令 bt 的输出中,以及 Python 或 Java 程序的异常错误信息当中。比如下面是一个 Lua 调用栈轨迹的样例:

C:ngx_http_lua_ngx_timer_at
at
cache.lua:43
cache.lua:record_timing
router.lua:338
router.lua:route
v2_routing.lua:1214
v2_routing.lua:route
access_by_lua.lua:130

在这个例子中,Lua 栈是从基帧 access_by_lua.lua:130 一路生长到顶帧 C:ngx_http_lua_ngx_timer_at。它清晰地显示了不同的 Lua 或 C 函数之间是如何相互调用的,从而构成了“代码路径”的一种近似表示。

而上文中的“所有代码路径”,实际上是从统计学的角度来看,并不是要真的要去枚举和遍历程序中的每一条代码路径。显然在现实中,后者的开销极其高昂,因为组合爆炸的问题。我们只要确保所有那些开销不太小的代码路径,都有机会出现在我们的图中,并且我们能以足够小的误差去量化他们的开销。

本文会聚焦在一种特定类型的火焰图上面。这种火焰图专用于展示 CPU 时间(或 CPU 资源)是如何定量分布在所有的 Lua 代码路径上的。特别地,我们这里只关注 OpenResty 或 Nginx 目标进程里的 Lua 代码。自然地,这类火焰图被我们命名为“Lua 级别 CPU 火焰图”(Lua-land CPU Flame Graphs)。

本文标题图片是一个火焰图示例,后文将提供更多示例。

为什么需要火焰图

火焰图仅用一张小图,就可以定量展示所有的性能瓶颈的全景图,而不论目标软件有多么复杂。

传统的性能分析工具通常会给用户展示大量的细节信息和数据,而用户很难看到全貌,反而容易去优化那些并不重要的地方,经常浪费大量时间和精力却看不到明显效果。传统分析器的另一个缺点是,它们通常会孤立地显示每个函数调用的延时,但很难看出各个函数调用的上下文,而且用户还须刻意区分当前函数本身运行的时间(exclusive time)和包括了其调用其他函数的时间在内的总时间(inclusive time)。

而相比之下,火焰图可以把大量信息压缩到一个大小相对固定的图片当中(通常一屏就可以显示全)。不怎么重要的代码路径会在图上自然地淡化乃至消失,而真正重要的代码路径则会自然地凸显出来。越重要的,则会显示得越明显。火焰图总是为用户提供最适当的信息量,不多,也不少。

如何解读火焰图

对于新手而言,正确地解读火焰图可能不太容易。但通过一些简单的解释,用户就会发现火焰图其实很直观,很容易理解。火焰图是一张二维图。y 轴显示是代码(或数据)上下文,比如目标编程语言的调用栈轨迹,而 x 轴则显示的是各个调用栈所占用的系统资源的比例。整个 x 轴通常代表了目标软件所消耗的 100% 的系统资源(比如 CPU 时间)。x 轴上的各个调用栈轨迹的先后顺序通常并不重要,因为这些调用栈只是根据函数帧名的字母顺序来排列。当然,也会有一些例外,例如笔者发明了一种时序火焰图,其中的 x 轴实际上是时间轴,此时调用栈的先后顺序就是时间顺序。本文将专注于讨论经典的火焰图类型,即图中 x 轴上的顺序并不重要。

要学会读懂一张火焰图,最好的方法是尝试解读真实的火焰图样本。下文将提供多个火焰图实例,针对 OpenResty 和 Nginx 服务器上运行的 Lua 应用,并提供详细的解释。

简单的 Lua 样例

本节将列举几个简单的有明显性能特征的 Lua 样例程序,并将使用 OpenResty XRay 分析真实的 nginx 进程,生成 Lua 级别的 CPU 火焰图,并验证图中显示的性能情况。我们将检查不同的案例,例如开启了
JIT 即时编译的 Lua 代码、禁用了 JIT 编译的 Lua 代码(即被解释执行),以及调用外部 C 库代码的 Lua 代码。

JIT 编译过的 Lua 代码

首先,我们来研究一个开启了 JIT 即时编译的 Lua 样本程序(LuaJIT 是默认开启 JIT)。

考虑下面这个独立的 OpenResty 小应用。本节将一直使用这个示例,但会针对不同情形的讨论需求,适时对这个例子进行少许修改。

我们首先准备这个应用的目录布局:

mkdir -p ~/work
cd ~/work
mkdir conf logs lua

然后我们创建如下所示的 conf/nginx.conf 配置文件:

master_process on;
worker_processes 1;

events {
    worker_connections 1024;
}

http {
    lua_package_path "$prefix/lua/?.lua;;";

    server {
        listen 8080;

        location = /t {
            content_by_lua_block {
                require "test".main()
            }
        }
    }
}

在 location /t 的 Lua 处理程序中,我们加载了名为 test 的外部 Lua 模块,并立即调用该模块的 main 函数。我们使用了 lua_package_path 配置指令,来把 lua/ 目录添加到 Lua 模块的搜索路径列表中 ,因为我们会把刚提及的 test 这个 Lua 模块文件放到 lua/ 目录下。

这个 test Lua 模块定义在 lua/test.lua 文件中:

local _M = {}

local N = 1e7

local function heavy()
    local sum = 0
    for i = 1, N do
        sum = sum + i
    end
    return sum
end

local function foo()
    local a = heavy()
    a = a + heavy()
    return a
end

local function bar()
    return (heavy())
end

function _M.main()
    ngx.say(foo())
    ngx.say(bar())
end

return _M

这里我们定义了一个计算量较大的 Lua 函数 heavy(),计算从 1 到 1000 万 (1e7)的数字之和。然后我们在函数 foo() 中调用两次 heavy() 函数,而在 bar() 函数中只调用一次 heavy() 函数。最后,模块的入口函数 _M.main() 先后调用 foobar 各 一次,并通过 ngx.say 向 HTTP 响应体输出它们的返回值。

显然,在这个 Lua 处理程序中,foo() 函数占用的 CPU 时间应当是 bar() 函数的两倍,因为 foo() 函数调用了 heavy() 函数两次,而 bar() 仅调用了一次。通过下文中由 OpenResty XRay 采样生成的 Lua 级别的 CPU 火焰图,我们可以很容易地验证这里的观察结果。

因为在这个示例中,我们并没有触碰 LuaJIT 的 JIT 编译器选项,因此 JIT 编译便使用了默认的开启状态,并且现代的 OpenResty 平台版本则总是只使用 LuaJIT(对标准 Lua 5.1 解释器的支持早已移除)。

现在,我们可以按下面的命令启动这个 OpenResty 应用:

cd ~/work/
/usr/local/openresty/bin/openresty -p $PWD/

假设 OpenResty 安装在当前系统的 /usr/local/openresty/ 目录下(这是默认的安装位置)。

为了使 OpenResty 应用忙碌起来,我们可以使用 abweighttp 这样的压测工具,向 URI http://localhost:8080/t 施加请求压力,或者使用 OpenResty XRay 产品自带的负载生成器。无论使用何种方式,当目标 OpenResty 应用的 nginx 工作进程保持活跃时,我们可以在 OpenResty XRay 的 Web 控制台里得到类似下面这张 Lua 级别的 CPU 火焰图:

启用即时编译的 Lua 代码的 Lua-land CPU 火焰图

我们从图上可以观察到下列现象:

  1. 图中的所有 Lua 调用栈都源自同一个入口点,即 content_by_lua(nginx.conf:24)。这符合预期。
  2. 图中主要显示了两个代码路径,分别是

    content_by_lua -> test.lua:main -> test.lua:bar -> test.lua:heavy -> trace#2:test.lua:8

    以及

    content_by_lua -> test.lua:main -> test.lua:foo -> test.lua:heavy -> trace#2:test.lua:8

    两个代码路径的唯一区别是中间的 foo 函数帧与 bar 函数帧。这也不出所料。

  3. 左侧涉及 bar 函数的代码路径的宽度,是右侧涉及 foo 的代码路径宽度的一半。换言之,这两个代码路径在图中 x 轴上的宽度比为 1:2,即 bar 代码路径占用的 CPU 时间,只有 foo 代码路径的50%。将鼠标移动到图中的 test.lua:bar 帧(即方框)上,我们可以看到它占据总样本量(即总 CPU 时间)的 33.3%,而 test.lua:foo 所占的比例为66.7%. 显然,与我们之前的预测相比较,这个火焰图提供的比例数字非常精确,尽管它所采取的是采样和统计分析的方法。
  4. 我们在图中没有看到 ngx.say() 等其他代码路径,毕竟它们与那两个调用了 heavy() 的 Lua 代码路径相比,所占用的 CPU 时间微乎其微。在火焰图中,那些微不足道的代码路径本就是小噪音,不会引起我们的关注。我们可以始终专注于那些真正重要的部分,而不会为其他东西分心。
  5. 那两条热代码路径(即调用栈轨迹)的顶部帧是完全相同的,都是 trace#2:test.lua:8. 它并不是真正的 Lua 函数调用帧,而是一个“伪函数帧”,用于表示它正在运行一个被 JIT 编译了的 Lua 代码路径。按照 LuaJIT 的术语,该路径被称为”trace“(因为 LuaJIT 是一种 tracing JIT 编译器)。这个”trace“的编号为 2,而对应的被编译的 Lua 代码路径是从 test.lua 文件的第 8 行开始的。而 test.lua:8 所指向的 Lua 代码行是:

    sum = sum + i

我们很高兴地看到,这个非侵入的采样工具,可以从一个没有任何外挂模块、没有被修改过、也没有使用特殊编译选项的标准 OpenResty 二进制程序,得到如此准确的火焰图。这个工具没有使用 LuaJIT 运行时的任何特殊特性或接口,甚至没有使用它的 LUAJIT_USE_PERFTOOLS 特性或者 LuaJIT 内建的性能分析器。相反,该工具使用的是先进的动态追踪 技术,仅读取原始目标进程中原有的信息。我们甚至可以从 JIT 编译过的 Lua 代码中获取足够多的有用信息。

解释执行的 Lua 代码

解释执行的 Lua 代码通常能够得到最完美的的调用栈轨迹和火焰图样本。如果我们的采样工具能够正确处理 JIT 即时编译后的 Lua 代码,那么在分析解释的 Lua 代码时,效果只会更好。LuaJIT 既有一个 JIT 编译器,又同时有一个解释器。它的解释器的有趣之处在于,几乎完全是用手工编写的汇编代码实现的(当然,LuaJIT 引入了自己的一种汇编语言记法,叫做 DynASM)。

对于我们一直在使用的那个 Lua 样例程序,我们需要在此做少许修改,即在 server {} 配置块中添加下面的 nginx.conf 配置片段:

init_by_lua_block {
    jit.off()
}

然后重新加载(reload)或重启服务器进程,并保持流量负载。

这回我们得到了下面这张 Lua 级别 CPU 火焰图:

解释性 Lua 代码的 Lua-land 火焰图

这张新图与前一张图在以下方面都极其相似:

  1. 我们依旧只看到了两条主要的代码路径,分别是 bar 代码路径和 foo 代码路径。
  2. bar 代码路径依旧占用了总 CPU 时间的三分之一左右,而 foo 占用了余下的所有部分(即大约三分之二)。
  3. 图中显示的所有代码路径的入口都是 content_by_lua 那一帧。

然而,这张图与前图相比仍然有一个重要的区别:代码路径的顶帧不再是 "trace" 伪帧了。这个变化也是预期的,因为这一回没有 JIT 编译过的 Lua 代码路径了,于是代码路径的顶部或顶帧变成为 lj_BC_IFORLlj_BC_ADDVV 等函数帧。而这些被 C: 前缀标记出来的 C 函数帧其实也并非 C 语言函数,而是属于汇编代码帧,对应于实现各个 LuaJIT 字节码的汇编例程,它们被标记成了 lj_BC_IFORL 等符号。自然地,lj_BC_IFORL 用于实现 LuaJIT 字节码指令 IFORL,而 lj_BC_ADDVV 则用于字节码指令 ADDVVIFORL 用于解释执行 Lua代码中的 for 循环, 而 ADDVV 则用于算术加法。这些字节码的出现,都符合我们的 Lua 函数 heavy() 的实现方式。另外,我们还可以看到一些辅助的汇编例程,例如如 lj_meta_arithlj_vm_foldarith

通过观察这些函数帧的比例数值,我们还得以一窥 CPU 时间在 LuaJIT 虚拟机和解释器内部的分布情况,为这个虚拟机和解释器本身的优化铺平道路。

调用外部 C/C++ 函数

Lua 代码调用外部 C/C++ 库函数的情况很常见。我们也希望通过 Lua 级别的 CPU 火焰图,了解这些外部的 C 函数所占用的 CPU 时间比例,毕竟这些 C 语言函数调用也是由 Lua 代码发起的。这也是基于动态追踪的性能分析的真正优势所在:这些外部 C 语言函数调用在性能分析中永远不会成为盲点1

我们一直使用的 Lua 样例在这里又需要作少许修改,即需要将 heavy() 这个 Lua 函数修改成下面这个样子:

local ffi = require "ffi"
local C = ffi.C

ffi.cdef[[
    double sqrt(double x);
]]

local function heavy()
    local sum = 0
    for i = 1, N do
        -- sum = sum + i
        sum = sum + C.sqrt(i)
    end
    return sum
end

这里我们使用 LuaJIT 的 FFI API ,先声明了一下标准 C 库函数 sqrt(),并直接在 Lua 函数 heavy()内部调用了这个 C 库函数。它应当会显示在对应的 Lua 级别 CPU 火焰图中。

此次我们得到了下面这张火焰图:

调用 C 语言函数的 Lua 代码的 Lua-land 火焰图

有趣的是,我们果然在那两条主要的 Lua 代码路径的顶部,看到了 C 语言函数帧 C:sqrt。另外值得注意的是,我们在顶部附近依旧看到了 trace#N 这样的伪帧,这说明我们通过 FFI 调用 C 函数的 Lua 代码,也是可以被 JIT 编译的(这回我们从 init_by_lua_block 指令中删除了 jit.off() 语句)。

代码行层面的火焰图

上文展示的火焰图其实都是函数层面的火焰图,因为这些火焰图中所显示的所有调用帧都只有函数名,而没有发起函数调用的源代码行的信息。

幸运的是, OpenResty XRay 的 Lua 级别性能分析工具支持生成代码行层面的火焰图,会在图中添加 Lua 源代码行的文件名和行号,以方便用户在较大的 Lua 函数体中直接定位到某一行 Lua 源代码。下图是我们一直使用的那个 Lua 样例程序所对应的一张 Lua 代码行层面的 CPU 火焰图:

即时编译的 Lua 代码的代码行层面 Lua-land 火焰图

我们可以看到在每一个函数帧上方都多了一个源代码行的伪帧。例如,在函数 main 所在的 test.lua 源文件的第 32 行 Lua 代码,调用了 foo() 函数。而在 foo() 函数所在的 test.lua:22 这一行,则调用了 heave() 函数。

代码行层面的火焰图对于准确定位最热的 Lua 源代码行和 Lua 语句有非常大的帮助。当对应的 Lua 函数体很大的时候,代码行层面的火焰图可以帮助节约排查代码行位置的大量时间。

多进程

在多核 CPU 的系统上,为单个 OpenResty 或 Nginx 服务器实例配置多个 nginx 工作进程是很常见的做法。OpenResty XRay 的分析工具支持同时对一个指定进程组中的所有进程进行采样。当进来的流量不是很大,并且可能分布在任意一个或几个 nginx 工作进程上的时候,这种全进程组粒度的采样分析是非常实用的。

复杂的 Lua 应用

我们也可以从非常复杂的 OpenResty/Lua 应用中得到 Lua 级别的 CPU 火焰图。例如,下面的 Lua 级别 CPU 火焰图源自对运行了我们的 OpenResty Edge 产品的“迷你 CDN”服务器进行了采样。这是一款复杂的 Lua 应用,同时包含了全动态的 CDN 网关、地理敏感的 DNS 权威服务器和一个 Web 应用防火墙(WAF):

我们的迷你-CDN 服务器的 Lua-land CPU 火焰图

从图上可以看到,Web 应用防火墙(WAF)占用的 CPU 时间最多,内置 DNS 服务器也占用了很大一部分 CPU 时间。我们布署在全球范围的”迷你 CDN“网络为我们自己运营的多个网站,比如 openresty.orgopenresty.com 提供了安全和加速支持。

它还可以分析那些基于 OpenResty 的 API 网关软件,例如 Kong 等等。

采样开销

我们使用的是基于采样的方法,而不是全量埋点,因此为生成 Lua 级别 CPU 火焰图所产生的运行时开销通常可以忽略不计。无论是数据量还是
CPU 损耗都是极小的,所以这类工具非常适合于生产环境和在线环境。

如果我们通过固定速率的请求来访问 nginx 目标进程,并且 Lua 级别 CPU 火焰图工具同时在进行密集采样,则该目标进程的 CPU 使用率随时间的变化曲线如下所示:

采样时的 CPU 使用量曲线

该 CPU 使用率的变化曲线图也是由 OpenResty XRay 自动生成和渲染的。

在我们停止工具采样之后,同一个 nginx 工作进程的 CPU 使用量曲线仍然非常相似:

不采样时进程的 CPU 使用量

我们凭肉眼很难看出前后两条曲线之间有什么差异。所以,工具进行分析和采样的开销确实是非常低的。

而当工具不在采样时,对目标进程的性能影响严格为零,毕竟我们并不需要对目标进程做任何的定制和修改。

安全性

由于使用了动态追踪技术,我们不会改变目标进程的任何状态,甚至不会修改其中哪怕一比特的信息2。这样可以确保目标进程无论是在采样时,还是没有采样时,其行为(几乎)是完全相同的。这就保证了目标进程自身的可靠性(不会有意外的行为变化或进程崩溃),其行为不会因为分析工具的存在而受到任何影响。目标进程的表现完全没有变化,就像是为一只活体动物拍摄 X 光片一样。

传统的应用性能管理(APM)产品可能要求在目标软件中加载特殊的模块或插件,甚至在目标软件的可执行文件或进程空间里强行打上补丁或注入自己的机器代码或字节码,这都可能会严重影响用户系统的稳定性和正确性。

因为这些原因,我们的工具可以安全应用到生产环境中,以分析那些在离线环境中很难复现的问题。

兼容性

OpenResty XRay 产品提供的 Lua 级别 CPU 火焰图的采样工具,同时支持 LuaJITGC64 模式 或非 GC64 模式,也支持任意的 OpenResty 或 Nginx 的二进制程序,包括用户使用任意构建选项自己编译的、优化或未优化的二进制程序。

OpenResty XRay 也可以对在 Docker 或 Kubernetes 容器内运行的 OpenResty 和 Nginx 服务器进程进行透明的分析,并生成完美的 Lua 级别的 CPU 火焰图,不会有任何问题。

我们的工具还可以分析由 restyluajit 命令行工具运行的那些基于控制台的用户 Lua 程序。

我们也支持较老的 Linux 操作系统和内核,比如使用 2.6.32 内核的 CentOS 6 老系统。

其他类型的 Lua 级别火焰图

如前文所述,火焰图可以用于可视化任意一种系统资源或性能指标,而不仅限于 CPU 时间。因此,我们的 OpenResty XRay 产品中也提供了其他类型的 Lua 级别火焰图,比如 off-CPU 火焰图、垃圾回收(GC)对象大小和数据引用路径火焰图、新 GC 对象分配火焰图、Lua 协程弃权(yield)时间火焰图、文件 I/O 延时火焰图等等。

我们的博客网站 将会发文详细介绍这些不同类型的火焰图。

结论

我们在本文中介绍了一种非常实用的可视化方法,火焰图,可以直观地分析任意软件系统的性能。我们深入讲解了其中的一类火焰图,即 Lua 级别 CPU 火焰图。这种火焰图可用于分析在 OpenResty 和 Nginx 服务器上运行的 Lua 应用。我们分析了多个 Lua 样例程序,简单的和复杂的,同时使用 OpenResty XRay 生成的对应的 Lua 级别 CPU 火焰图,展示了动态追踪工具的威力。最后,我们检查了采样分析的性能损耗,以及在线使用时的安全性和可靠性。

关于作者

章亦春是开源项目 OpenResty® 的创始人,同时也是 OpenResty Inc. 公司的创始人和 CEO。他贡献了许多 Nginx 的第三方模块,相当多 Nginx 和 LuaJIT 核心补丁,并且设计了 OpenResty XRay 等产品。

关注我们

如果您觉得本文有价值,非常欢迎关注我们 OpenResty Inc. 公司的博客网站 。也欢迎扫码关注我们的微信公众号:

我们的微信公众号

翻译

我们提供了英文版 原文和中译版(本文)。我们也欢迎读者提供其他语言的翻译版本,只要是全文翻译不带省略,我们都将会考虑采用,非常感谢!


  1. 同样地,虚拟机中的任何原语例程也不会成为分析的盲点。所以,我们也可以同时对虚拟机本身进行性能分析。
  2. Linux 内核的 uprobes 机制,仍然会以一种确保安全的方式,轻微地改变目标进程中少数机器指令的内存状态以实现透明且安全的动态探针,而这种修改对目标进程是完全透明的。
查看原文

赞 1 收藏 1 评论 0

OpenResty技术 发布了文章 · 2020-09-02

Introduction to Lua-Land CPU Flame Graphs

Lua code running inside OpenResty or Nginx servers is very common nowadays since people want both performance and flexibility out of their nonblocking web servers. Some people use Lua for very simple tasks like modifying and checking certain request headers and response bodies while other people use Lua to build very complicated web applications, CDN software, and API gateways. Lua is known for its simplicity, small memory footprint, and efficiency, especially when using Just-in-Time (JIT) compilers like LuaJIT. But still some times the Lua code running atop OpenResty or Nginx servers may consume too much CPU resources due to the programmer's coding mistakes, calling out to some expensive C/C++ library code, or some other reasons.

The best way to quickly find all the CPU bottlenecks in an online OpenResty or Nginx instance is the Lua-land CPU flame graph sampling tool as provided by the OpenResty XRay product. It does not require any changes to the target OpenResty or Nginx processes nor have any noticeable impact to the processes in production. This article will introduce the idea of Lua-land CPU flame graphs and use OpenResty XRay to produce real flame graphs for several small and standalone Lua examples. We choose small examples because it is much easier to predict and verify the profiling results. The same idea and tools do apply equally well to those most complex Lua applications in the wild. And we have had many successes in using this technique and visualization to help our customers running very busy sites and applications in the past few years.

What is a Flame Graph

Flame graphs are a visualization method invented by Brendan Gregg for showing how a system resource or a metric is distributed across all the code paths in the target software. The system resource could be CPU time, off-CPU time, memory usage, disk usage, latency, or any other things you can imagine. The code paths here are defined by the backtraces in the target software's code. A backtrace is usually a stack of function call frames as in the output of the bt GDB command and in an exception error message of a Python or Java program. When we say "all code paths", we actually mean it from a statistical perspective instead of literally iterating through every single code path in a program. Obviously the latter would be prohibitively expensive to do in reality. We just make sure all code paths with nontrivial overhead are showing up in our graphs and we can identify their cost quantitatively with good enough confidence.

In this article we just focus on the type of flame graphs which show how the CPU time (or CPU resources) is distributed across all the Lua code paths in a target OpenResty or Nginx process (or processes), hence the name "Lua-Land CPU Flame Graphs".

The header picture of this article shows a sample flame graph and we will see several more in later parts of this post.

Why Flame Graphs

Flame Graphs are a great way to show the "big picture" of bottlenecks quantitatively in a single small graph. Traditional profilers would usually throw a ton of details and numbers at the user's face. And the user may lose insight of the whole picture and go through rabbit holes for things that do not really matter. Another drawback of traditional profilers is that they just give you latencies of all the functions while it's hard to see the contexts of these function calls, not to mention the user also has to distinguish the exclusive time and inclusive time of a function call. On the other hand, Flame Graphs can squeeze a great deal of information very compactly into a limited size graph. Stuff which do not matter fade out naturally while stuff which really matter would stand out as well. No more, no less, just the right amount of information for the user.

How to read a Flame Graph

Flame Graphs might be a bit tricky to read for a newcomer. But with a little guidance the user would find it so intuitive to understand. A Flame Graph is two-dimensional graph. The y-axis shows the context, i.e., the backtraces of the target programming language while the x-axis shows how much percentage a particular backtrace takes. The full x-axis usually means 100% of the resources spent on the target software. The order of backtraces along the x-axis usually do not matter since they are usually simply sorted by function frame names alphabetically. There are exceptions however, where I invented a type of Time-Series Flame Graphs where the x-axis actually means the time axis instead. For this article, we only care about the classic type of flame graphs where the order along the x-axis does not matter at all.

The best way to learn how to read a flame graph is to read sample flame graphs. We will see several examples below with detailed explanation for OpenResty and Nginx's Lua applications.

Simple Lua samples

In this section, we will consider several simple Lua samples with obvious performance characteristics and we will use OpenResty XRay to profile the real nginx processes to show Lua-land CPU Flame Graphs and verify the performance behaviors in the graphs. We will check different cases like with and without JIT compilation enabled for the Lua code, as well as calling into external C library code.

JIT compiled Lua code

We will first investigate a Lua sample program with JIT compilation enabled (which is enabled by default in LuaJIT).

Let's consider the following standalone OpenResty application. We will use this example throughout this section with minor modifications for different cases. We first prepare the applications' directory layout:

mkdir -p ~/work
cd ~/work
mkdir conf logs lua

And then we create the conf/nginx.conf configuration file as follows:

master_process on;
worker_processes 1;

events {
    worker_connections 1024;
}

http {
    lua_package_path "$prefix/lua/?.lua;;";

    server {
        listen 8080;

        location = /t {
            content_by_lua_block {
                require "test".main()
            }
        }
    }
}

Here we load the external Lua module named test and immediately calls its main function in our Lua handler for the location /t. We use the lua_package_path directive to add the lua/ directory into the Lua module search paths since we will shortly put the aforementioned test Lua module into lua/.

The test Lua module is defined in the lua/test.lua file as follows:

local _M = {}

local N = 1e7

local function heavy()
    local sum = 0
    for i = 1, N do
        sum = sum + i
    end
    return sum
end

local function foo()
    local a = heavy()
    a = a + heavy()
    return a
end

local function bar()
    return (heavy())
end

function _M.main()
    ngx.say(foo())
    ngx.say(bar())
end

return _M

Here we define a computation-heavy Lua function, heavy(), which computes the sum of numbers from 1 to 10 million (1e7). Then we call this heavy() function twice in function foo() and just once in function bar(). Finally, the module entry function _M.main() calls foo and bar just for once in turn and print out their return values respectively to the HTTP response body via ngx.say.

Intuitively, for this Lua handler, the foo() function would takes exactly twice of the CPU time taken by the bar() function because foo() calls heavy() twice while bar() only calls heavy() once. We can easily verify this observation in the Lua-land CPU flame graphs sampled by OpenResty XRay below.

Because we did not touch the JIT compiler settings in this example, the JIT compilation is turned on by default since modern OpenResty platform versions always use LuaJIT anyway.

Now we can start this OpenResty web application like this:

cd ~/work/
/usr/local/openresty/bin/openresty -p $PWD/

assuming OpenResty is installed under /usr/local/openresty/ in the current system.

To make this OpenResty application hot, we can use tools like ab and weighttp to load the URI http://localhost:8080/t or the load generator provided by the OpenResty XRay product. Either way, while the target OpenResty applications' nginx worker process is busy, we can get the following Lua-land CPU flame graph in OpenResty XRay's web console:

Lua-land CPU flame graph for JITted Lua code

From this graph we can make the following observations:

  1. All Lua backtraces in this graph stem from the same entry point, content_by_lua(nginx.conf:24), which is expected.
  2. There are mainly two code paths shown in the graph, which are

    content_by_lua -> test.lua:main -> test.lua:bar -> test.lua:heavy -> trace#2:test.lua:8

    and also

    content_by_lua -> test.lua:main -> test.lua:foo -> test.lua:heavy -> trace#2:test.lua:8

    The only difference in these two code paths are foo vs bar. This is also expected.

  3. The left-hand-side code path involving bar is just half as wide as the right-hand-side code path involving foo. In other words, their width ratio along the x-axis is 1:2, which means that the bar code path takes only 50% of the time taken by foo. By putting your mouse onto the test.lua:bar frame (or box) in the graph, we can see that it takes 33.3% of the total samples (or total CPU time) while the test.lua:foo frame shows 66.7%. Obviously it is very accurate as compared to our predictions even though it takes a sampling and statistical approach.
  4. We did not see other code paths like ngx.say() in the graph since such code paths simply take too little CPU time as compared to the two dominating Lua code paths involving heavy(). Trivial things are simply noises which won't catch our attention in the flame graph. We always focus on really important things and cannot get distracted.
  5. The top frames for both code paths (or backtraces) are the same, which is trace#2:test.lua:3. This is not a really Lua function call frame, but rather a pseudo frame indicating that it is running a JIT compiled Lua code path which is called "trace" in LuaJIT's terminology (because LuaJIT is a tracing JIT compiler). And this "trace" has the ID number 2 and the compiled Lua code path starting from the Lua source code line 8 of the test.lua file. test.lua:8 is this Lua code line:

            sum = sum + i

It is wonderful to see our noninvasive sampling tool can get such accurate flame graphs from a standard binary build of OpenResty without any extra modules, modifications, or special build flags. The tool does not use any special features or interfaces of the LuaJIT runtime at all, not even the LUAJIT_USE_PERFTOOLS feature or its built-in profiler. Instead it uses advanced dynamic tracing technologies which simply read the information already available in the target process itself. And we are able to get good enough information even from JIT compiled Lua code.

Interpreted Lua code

Interpreted Lua code can usually result in perfectly accurate backtraces and flame graphs. If the sampling tool can handle JIT-compiled Lua code just fine, then it can only do a better job when dealing with interpreted Lua code. One interesting thing about LuaJIT's interpreter is that the interpreter is written almost completely in hand-crafted assembly code.

For our continuing Lua example, we simply add the following nginx.conf snippet inside the server {} block:

    init_by_lua_block {
        jit.off()
    }

And then reload or restart the server processes and still keep the traffic load.

This time we get the following Lua-land CPU flame graphs:

Lua-land flame graph for interpreted Lua code

This graph is very similar to the previous one in that:

  1. We are still only seeing two main code paths, the bar one and the foo one.
  2. The bar code path still takes approximately one third of the total CPU time and the foo one still takes the remaining two-thirds.
  3. The entry point for all the code paths shown in the graph is the content_by_lua thing.

This graph still has an important difference, however: the tips of the code paths (or backtraces) are no longer "traces". This is expected since no JIT compiled Lua code paths are possible this time. The tips or top frames are now C function frames like lj_BC_IFORL and lj_BC_ADDVV. These C functions frames are actually not C functions per se. Instead they are assembly code frames corresponding to LuaJIT byte code interpretation handlers specially annotated by symbols like lj_BC_IFORL. Naturally, lj_BC_IFORL is for the LuaJIT byte code instruction IFORL while lj_BC_ADDVV is for the byte code instruction ADDVV. The IFORL is for interpreted Lua for loops while ADDVV is for arithmetic additions. All these are expected according to our Lua function heavy(). There are also some auxiliary assembly routines like lj_meta_arith and lj_vm_foldarith. By looking at the percentage numbers for these frames, we can also understand how the CPU time is distributed across inside the LuaJIT virtual machine and interpreter, paving the way to optimize the VM and the interpreter themselves.

Calling external C/C++ functions

It is very common for Lua code to invoke external C/C++ library functions, in which case we also want to see their proportional parts in the Lua-land CPU flame graphs. Because such C function calls are initiated from within the Lua code anyway. This is also where dynamic-tracing-based profiling really shines where such external C function calls would never become the blind spots for the profiler1.

Let's modify the heavy() Lua function in our ongoing example as follows:

local ffi = require "ffi"
local C = ffi.C

ffi.cdef[[
    double sqrt(double x);
]]

local function heavy()
    local sum = 0
    for i = 1, N do
        -- sum = sum + i
        sum = sum + C.sqrt(i)
    end
    return sum
end

Here we use LuaJIT's FFI API to declare and invoke the standard C library function sqrt() directly from within the Lua function heavy(). It should show up in the corresponding Lua-land CPU flame graphs.

This time we got the following flame graph:

Lua-land flame graph for Lua code calling C functions

Interestingly we indeed see the C function frame C:sqrt showing up as the tips of those two main Lua code paths. It's also worth noting that we still see the trace#N frames near the top, which means that our FFI calls into the C function can also be JIT compiled (this time we removed the jit.off() statement from the init_by_lua_block directive).

Line-Level Flame Graphs

The previous flame graphs we have seen are all function-level flame graphs because almost all the function frames shown in the flame graphs have only function names instead of the source lines initiating the calls.

Fortunately OpenResty XRay's Lua-land profiling tools can also generate Lua source line numbers in its line-level flame graphs by which we can easily know exactly what Lua source lines are hot. Below is such an example for our ongoing Lua example:

Line-level Lua-land flame graph for JITted Lua code

We can see that there are now one more source-line frame added above each corresponding function name frame. For example, inside function main, on the line 32, there comes the call to the foo() function. And inside the foo() function, on line 22, there is the call to the heave() function, and etc.

Line-level flame graphs are very useful to pinpoint the hottest source line and Lua statements. This can save a lot of time when the corresponding Lua function body is large.

Multiple processes

It is common to configure multiple nginx worker processes for a single OpenResty or Nginx server instance on a system with multiple CPU cores. OpenResty XRay's profiling tools support sampling all the processes in a specific process group at the same time. This is useful when the incoming traffic is moderate and is spread across arbitrary nginx worker processes.

Complicated Lua applications

We can also get Lua-land CPU flame graphs from very complicated OpenResty/Lua applications in the wild. For example, below is a Lua-land CPU flame graph sampled on one of our mini-CDN server running our OpenResty Edge product, which is a very complex Lua application, including a dynamic CDN gateway, a geo-sensitive DNS server, and a web application firewall (WAF):

Lua-land CPU flame graph for our mini-CDN server

From this graph, we can see that the WAF takes most of the CPU time while the built-in DNS server also takes a good portion. Our global mini-CDN network is also powering our own web sites like openresty.org and openresty.com.

It can surely analyze OpenResty-based API gateway software like Kong
as well.

Sampling overhead

Because we use the sampling approach instead of full instrumentation, the overhead involved with the sampling for generating Lua-land CPU flame graphs is usually negligible which makes such tools usable in production or online environments.

If we load the target nginx worker process with requests of a constant rate, the CPU usage curve of the target process while Lua CPU flame graph sampling is frequently performed is like this:

CPU usage curve when sampling

This CPU usage line graph is also generated and rendered by OpenResty XRay automatically.

And then we stop sampling at all and the CPU usage curve of the same nginx worker process is very similar:

CPU usage for processes when not sampling

We cannot really see any differences between these two curves with human eyes. So the profiling and sampling cost is indeed very small.

When the tools are not sampling, the performance impact is strictly zero since we never change anything in the target processes anyway.

Safety

Because we use dynamic tracing technologies, we do not change any state in the target processes, not even a single bit of information2. This makes sure that the target process behaves (almost) exactly the same as the case when no sampling is performed. This guarantees that the target process's reliability and behaviors won't get compromised by the profiling tool. They stay exactly the same as when no one is watching, just like taking an X-ray image against a live animal. Traditional Application Performance Manager (APM) products may require loading special modules or plugins into the target software, or even bloodily patching or injecting binary code into the target software's executable or process space, severely compromising the stability and correctness of the user systems.

For these reasons, these tools are safe to use in production environments to analyze those really hard problems which cannot be easily reproduced offline.

Compatibility

The Lua-land CPU flame graph sampling tool provided by the OpenResty XRay product supports any OpenResty or Nginx binaries including those compiled by the users themselves with aribitrary build options, optimized or unoptimized, using the GC64 mode or non-GC64 mode in the bundled LuaJIT library, and etc.

OpenResty and Nginx server processes running inside Docker or Kubernetes containers can also be analyzed transparently by OpenResty XRay and perfect Lua-land CPU flame graphs can be rendered without problems.

Our tool can also analyze console-based user Lua programs run by the resty or luajit command-line utilities.

Other types of Lua-land Flame Graphs

As mentioned earlier in this post, flame graphs can be used for visualizing any system resources or metrics, not just CPU time. Naturally we do have other types of Lua-land flame graphs in our OpenResty XRay product, like off-CPU flame graphs, garbage collectable (GC) object size and data reference path flame graphs, new GC object allocation flame graphs, Lua coroutine yielding time flame graphs, file I/O latency flame graphs, and many more. We will cover these different kinds of flame graphs in future articles on our blog site.

Conclusion

In this article we introduce the very useful visualization, Flame Graphs, for profiling software performance. And we have a deep look at one particular type of flame graphs, Lua-land CPU Flame Graphs, for profiling Lua applications running atop OpenResty and Nginx. We investigate several small Lua sample programs using real flame graphs produced by OpenResty XRay to demonstrate the strength of our sampling tools based on dynamic tracing technologies. Finally we look at the performance overhead of sampling and the safety of online uses.

About The Author

Yichun Zhang is the creator of the OpenResty® open source project. He is also the founder and CEO of the OpenResty Inc. company. He contributed a dozen open source Nginx 3rd-party modules, quite some Nginx and LuaJIT core patches, and designed the OpenResty XRay platform.

Translations

We will provide a Chinese translation for this article on blog.openresty.com.cn. We also welcome interested readers to contribute translations in other natural languages as long as the full article is translated without any omissions. We thank them in advance.

We are hiring

We always welcome talented and enthusiastic engineers to join our team at OpenResty Inc. to explore various open source software's internals and build powerful analyzers and visualizers for real world applications built atop the open source software. If you are interested, please send your resume to talents@openresty.com . Thank you!


  1. Similarly any primitive routines belonging to the VM itself won't be blind spots either. So we can profile the VM itself at the same time just fine.
  2. The Linux kernel's uprobes facility would still change some minor memory state inside the target process in a completely safe way (guaranteed by the kernel) and such modifications are completely transparent to the target processes.
查看原文

赞 0 收藏 0 评论 0

OpenResty技术 发布了文章 · 2020-08-25

Memory Fragmentation in OpenResty and Nginx Shared Memory Zones

Memory fragmentation is a common problem in computer systems though many clever algorithms have emerged to tackle it. Memory fragmentation wastes free memory blocks scattered in a memory region and these free blocks cannot be merged as a whole to serve future requests for large memory blocks or cannot be returned to the operating system for other use 1. This could lead to a phenomenon of memory leaks since the total memory needed to fulfill more and more memory requests for large blocks would grow indefinitely. This kind of indefinite memory usage growth is usually not considered memory leaks in common perception since unused memory blocks are indeed released and marked free but they are just cannot be reused (for larger memory block requests) nor can be returned to the operating system.

The ngx_lua module's lua_shared_dict zones do support inserting arbitrary user data items of an arbitrary length, from a tiny single number to a huge string. If care is not taken, it could easily lead to severe memory fragmentation inside the shm zone and wastes a lot of free memory. This article will present a few small and standalone examples to demonstrate this problem and various detailed behaviors. It will use the OpenResty XRay dynamic tracing platform to observe the memory fragmentation directly using vivid visualizations along the way. We will conclude the discussions by introducing the best practices of mitigating memory fragmentation inside shared memory zones.

As with almost all the technical articles in this blog site, we use our OpenResty XRaydynamic tracing product to analyze and visualize the internals of unmodified OpenResty or Nginx servers and applications. Because OpenResty XRay is a noninvasive analyzing platform, we don't need to change anything in the target OpenResty or Nginx processes -- no code injection needed and no special plugins or modules needed to be loaded into the target processes. This makes sure what we see inside the target processes through OpenResty XRay analyzers is exactly like when there is no observers at all.

If you are not already familiar with the memory allocation and usage inside OpenResty or Nginx's shared memory zones, you are encouraged to refer to our previous blog post, "How OpenResty and Nginx Shared Memory Zones Consume RAM".

An empty zone

Let's start with an empty shared memory zone with no user data at all and check the slabs or memory blocks inside it so that we can understand the "baseline" :

master_process on;
worker_processes 1;

events {
    worker_connections 1024;
}

http {
    lua_shared_dict cats 40m;

    server {
        listen 8080;
    }
}

Here we allocated a shared memory zone named cats of the total size of 40 MB. We never touched this cats zone in our configuration so it should be empty. But from our previous blog post "How OpenResty and Nginx Shared Memory Zones Consume RAM" we already know that an empty zone still has 160 pre-allocated slabs to serve as the meta data for the slab allocator itself. And the following graph for slab layout in the virtual memory space indeed confirms it:

Slabs layout for empty shm zone cats

As always, this graph is generated automatically by OpenResty XRay to analyze the real process. We can see there are 3 used slabs and more than a hundred free slabs.

Filling entries of similar sizes

Let's add the following location = /t to our ongoing example nginx.conf:

location = /t {
    content_by_lua_block {
        local cats = ngx.shared.cats
        local i = 0
        while true do
            local ok, err = cats:safe_set(i, i)
            if not ok then
                if err == "no memory" then
                    break
                end
                ngx.say("failed to set: ", err)
                return
            end
            i = i + 1
        end
        ngx.say("inserted ", i, " keys.")
    }
}

Here we define a Lua content handler in our location = /t which inserts small key-value pairs into the cats zone until no free memory is available. Because we insert the numbers as both the key and value and the cats zone is small, the key-value pairs inserted should be of very similar sizes. After starting this Nginx server and then we query this /t location like this:

$ curl 'localhost:8080/t'
inserted 255 keys.

We can see that we can insert up to 255 such keys into the zone.

We can check the slabs' layout inside that shm zone again:

Slab layout in the full shm zone 'cats'

If we compare this graph with the previous graph for the empty zone case, we can see all the newly added bigger slabs are in red (or in-use). Interestingly, the free slabs in the middle (in green) cannot be reused for the bigger slabs even though they are adjacent to each other. Apparently for these preserved free slabs are not automatically merged to form bigger free slabs.

Let's see the size distribution for all these slabs via OpenResty XRay:

Slab size distribution for the full shm zone 'cats'

We can see that almost all the used slabs are of the 128 byte size.

Deleting odd-numbered keys

Now let's try deleting the odd-numbered keys in the shm zone by adding the following Lua snippet right after our existing Lua content handler code:

for j = 1, i, 2 do
    cats.delete(j)
end

After restarting the server and querying /t again, we get the following new slabs' layout graph for the cats shm zone:

Slab layout in odd-key-deleted shm zone 'cats'

Now we have a lot of non-adjacent free blocks which cannot be merged together to serve bigger memory block requests in the future. Let's try adding the following Lua code to the end of the Lua content handler to attempt adding a much bigger entry:

local ok, err = cats:safe_add("Jimmy", string.rep("a", 200))
if not ok then
    ngx.say("failed to add a big entry: ", err)
end

Then we restart the server and query /t:

$ curl 'localhost:8080/t'
inserted 255 keys.
failed to add a big entry: no memory

As expected, the new big entry has a 200 byte string value, so the corresponding slab must be larger than the largest free slab available in the shm zone (which is 128 bytes as we saw earlier). So it is impossible to fulfill this memory block request without forcibly evicting used entries (like what the set() method would do when running out of memory in the zone).

Deleting the keys in the first half

Now let's try something different. Instead of deleting the odd-number keys in the previous section, we delete the keys in the first half of the shm zone by adding the following Lua code:

for j = 0, i / 2 do
    assert(cats:delete(j))
end

After restarting the server and querying /t, we got the following slabs' layout in the virtual memory:

Slabs' layout for a zone with first half of the keys deleted

We can see now we have the adjacent free slabs automatically merged into 3 big free slabs near the middle of this shm zone. Actually they are 3 free memory pages of 4096 bytes each:

Free slab size distribution

These free pages can further form even larger slabs spanning multiple pages.

Now let's try inserting the big entry which was failed to insert in the previous section:

local ok, err = cats:safe_add("Jimmy", string.rep("a", 200))
if not ok then
    ngx.say("failed to add a big entry: ", err)
else
    ngx.say("succeeded in inserting the big entry.")
end

This time it succeeds because we have plenty of contiguous free space to accommodate this key-value pair:

$ curl 'localhost:8080/t'
inserted 255 keys.
succeeded in inserting the big entry.

Now the new slabs' layout indeed has this new entry:

Slabs in a zone with the big entry inserted

Please note the longest red block in the first half of graph. That is our "big entry". The size distribution chart for used slabs can make it even clearer:

Size distribution for slabs with a big slab

We can see that our "big entry" is actually a slab of 512 bytes (including the key size, value size, and memory padding and address alignment overhead).

Mitigating Fragmentation

In previous sections, we already see scattered small free slabs can cause fragmentation problems in a shared memory zone which makes future memory block requests of larger size impossible to fulfill, even though the total sum of all these free slabs are even bigger. To allow reuse of these free slabs, we would recommend the following two ways:

  1. Always use similarly sized data entries so that there won't be the problem of accommodating future larger memory block requests in the first place.
  2. Making deleted entries adjacent to each other so that they can merge into larger free slabs.

For 1), we can divide a single monolithic zone into several zones for different entry size groups2. For example, we can have a zone dedicated for data entries of the 0 ~ 128 byte size range only, and another for the 128 ~ 256 byte range.

For 2), we can group entries by their expiration time. Short-lived entries can live in a dedicated zone while long-lived entries live in another zone. This helps entries expire in a similar pace, increasing the chance of getting expired and eventually deleted at the same time.

Conclusion

Memory fragmentation inside OpenResty or Nginx's shared memory zones can be quite hard to notice or troubleshoot. Fortunately, OpenResty XRay provides powerful observability and visualizations to see and diagnose such problems quickly. We presented several small examples to demonstrate the memory fragmentation issue and ways to work around it, using OpenResty XRay's graphs and data to explain what is happening under the hood. We finally introduce best practices when working with shared memory zones in general configurations and programming based on OpenResty or Nginx.

Further Readings

About The Author

Yichun Zhang is the creator of the OpenResty® open source project. He is also the founder and CEO of the OpenResty Inc. company. He contributed a dozen open source Nginx 3rd-party modules, quite some Nginx and LuaJIT core patches, and designed the OpenResty XRay platform.

Translations

We provide the Chinese translation for this article on blog.openresty.com.cn. We also welcome interested readers to contribute translations in other natural languages as long as the full article is translated without any omissions. We thank them in advance.

We are hiring

We always welcome talented and enthusiastic engineers to join our team at OpenResty Inc. to explore various open source software's internals and build powerful analyzers and visualizers for real world applications built atop the open source software. If you are interested, please send your resume to talents@openresty.com. Thank you!


  1. For OpenResty and Nginx's shared memory zones, the allocated and accessed memory pages can never be returned to the operating system until all the current nginx processes quit. Nevertheless, released memory pages or memory slabs can still be reused for future memory requests inside the shm zone. In OpenResty or Nginx's shared memory zones, memory fragmentation may also happen when the requested memory slabs or blocks are of varying sizes. Many standard Nginx directives involve with shared memory zones, like ssl_session_cache, proxy_cache_path, limit_req_zone, limit_conn_zone, and upstream's zone. Nginx's 3rd-party modules may use shared memory zones as well, like one of OpenResty's core components, ngx_lua module. For those shm zones with equal-sized data entries, there won't be any possibilities for memory fragmentation issues.
  2. Interestingly the Linux kernels' buddy allocator and Memcached's allocator uses a similar strategy.
查看原文

赞 1 收藏 0 评论 0

OpenResty技术 发布了文章 · 2020-08-24

OpenResty 与 Nginx 共享内存区的内存碎片问题

内存碎片是计算机系统中的一个常见问题,尽管已经存在许多解决这个问题的巧妙算法。内存碎片会浪费内存区中空闲的内存块。这些空闲的内存块无法合并成更大的内存区以满足应用未来对较大内存块的申请,也无法重新释放到操作系统用于其他用途1。这会导致内存泄露现象,因为对大块的内存申请越来越多,满足这些请求所需要的总内存大小会无限增加。这种内存使用量的无限增加通常不会被视为内存泄漏,因为未使用的内存块实际上被释放并标记为空闲内存,只是这些内存块无法被重新用于满足更大的内存块申请,同时也无法重新被返还给操作系统供其他进程使用。

在 OpenResty 或 Nginx 的共享内存区内,如果被申请的内存 slab 或内存块的大小差别很大,也容易出现内存碎片问题。例如,ngx_lua 模块的 lua_shared_dict 区域支持写入任意长度的任意用户数据项。这些数据项可以小到一个数值,也可以大到很长的字符串。如果不注意的话,用户的 Lua 程序很容易在共享内存区中产生严重的内存碎片问题,浪费大量的空闲内存。本文中将列举几个独立的例子,来详细演示这个问题以及相关行为细节。本文将使用 OpenResty XRay 动态追踪平台产品,直接通过生动的可视化方法来观察内存碎片。我们将在文章结尾中介绍减少共享内存区内存碎片问题的最佳实践。

与本 博客网站 中的几乎所有技术类文章类似,我们使用 OpenResty XRay 这款 动态追踪 产品对未经修改的 OpenResty 或 Nginx 服务器和应用的内部进行深度分析和可视化呈现。因为 OpenResty XRay 是一个非侵入性的分析平台,所以我们不需要对 OpenResty 或 Nginx 的目标进程做任何修改——不需要代码注入,也不需要在目标进程中加载特殊插件或模块。这样可以保证我们通过 OpenResty XRay 分析工具所看到的目标进程内部状态,与没有观察者时的状态是完全一致的。

如果您还不熟悉 OpenResty 或 Nginx 共享内存区内的内存分配和使用,请参阅上一篇博客文章《OpenResty 和 Nginx 的共享内存区是如何消耗物理内存的》

空的共享内存区

首先,我们以一个没有写入用户数据的空的共享内存区为例,通过查看这个内存区内部的 slab 或内存块,了解“基准数据”:

master_process on;
worker_processes 1;

events {
    worker_connections 1024;
}

http {
    lua_shared_dict cats 40m;

    server {
        listen 8080;
    }
}

我们在这里配置了一个 40MB 的共享内存区,名为 cats。我们在配置里完全没有触及这个 cats 区,所以这个区是空的。但从前一篇博客文章 《OpenResty 和 Nginx 的共享内存区是如何消耗物理内存的》 中,我们已经知道空的共享内存区依旧有 160 个预分配的 slab,用于存放 slab 分配器使用的元数据。下面这张虚拟内存空间内 slab 的分布图也证实了这一点:

共享内存区 cats 中的 slab 布局

这个图同样由 OpenResty XRay 自动生成,用于分析实际运行中的进程。从图中可以看到,有 3 个已经使用的 slab,还有 100 多个空闲 slab。

填充类似大小的条目

我们在上面的例子 nginx.conf 中增加下列指令 location = /t

location = /t {
    content_by_lua_block {
        local cats = ngx.shared.cats
        local i = 0
        while true do
            local ok, err = cats:safe_set(i, i)
            if not ok then
                if err == "no memory" then
                    break
                end
                ngx.say("failed to set: ", err)
                return
            end
            i = i + 1
        end
        ngx.say("inserted ", i, " keys.")
    }
}

这里我们在 location = /t 中定义了一个 Lua 内容处理程序,在 cats 区中写入很小的键值对,直到没有空闲的内存区为止。因为我们写入的键和值都是数值(number),而 cats 共享内存区又很小,所以在这个区中写入的各个键值对的大小应该都是彼此很接近的。在启动 Nginx 服务器进程后,我们使用下面的命令查询这个 /t 位置:

$ curl 'localhost:8080/t'
inserted 255 keys.

从响应体结果中可以看到,我们在这个共享内存区中可以写入了 255 个键。

我们可以再次查看这个共享内存区内的 slab 布局:

被填满的共享内存区'cats'中的slab布局

如果将这个图与上文那个空的共享内存区分布图进行对比,我们可以看到新增的那些更大的 slab 都是红色的(即表示已被使用)。有趣的是,中间那些空闲 slab (绿色)无法被更大的 slab 重新使用,尽管两者彼此接近。显然,这些原本预留的空 slab 并不会被自动合并成更大的空闲 slab。

我们再通过 OpenResty XRay 来查看这些 slab 的大小分布情况:

被填满的共享内存区 'cats' 中的slab大小分布

我们看到几乎所有已使用的 slab 的大小都是 128 个字节。

删除奇数键

接下来,我们尝试在现有的 Lua 处理程序后面追增下面这个 Lua 代码片段,以删除共享内存区内的奇数键:

for j = 1, i, 2 do
    cats.delete(j)
end

重启服务器后,再次查询 /t,我们得到了新的 cats 共享内存区 slab 布局图:

删除奇数键后的共享内存区 'cats' 中的 slab 布局

现在,我们有许多不相邻的空闲内存块,而这些内存块无法合并成大块,从而无法满足未来大块内存的申请。我们尝试在 Lua 处理程序之后再添加下面的 Lua 代码,以插入一个更大的键值对条目:

local ok, err = cats:safe_add("Jimmy", string.rep("a", 200))
if not ok then
    ngx.say("failed to add a big entry: ", err)
end

然后我们重启服务器进程,并查询 /t

$ curl 'localhost:8080/t'
inserted 255 keys.
failed to add a big entry: no memory

如我们所预期的,新添加的大的条目有一个 200 个字节的字符串值,所以相对应的 slab 必然大于共享内存区中可用的最大的空闲 slab(即我们前面看到的 128 字节)。所以如果不强行驱除某些已使用的键值对条目,则无法满足这个内存块请求(例如在内存区中内存不足时,set() 方法自动删最冷的已使用的条目数据的行为)。

删除前半部分的键

接下来我们来做一个不同的试验。我们不删除上文中的奇数键,而是改为添加下列 Lua 代码以删除共享内存区前半部分里的那些键:

for j = 0, i / 2 do
    assert(cats:delete(j))
end

重启服务器进程并查询 /t 之后,我们得到了下面这个虚拟内存空间里的 slab 布局:

内存区中删除键的前半部分之后的 Slab 布局

可以看到在这个共享内存区的中间位置附近,那些相邻的空闲 slab 被自动合并成了 3 个较大的空闲 slab。实际上,这是 3 个空闲内存页,每个内存页的大小为 4096 个字节:

空闲 slab 大小分布

这些空闲内存页可以进一步形成跨越多个内存页的更大的 slab。

下面,我们再次尝试写入上文中插入失败的大条目:

local ok, err = cats:safe_add("Jimmy", string.rep("a", 200))
if not ok then
    ngx.say("failed to add a big entry: ", err)
else
    ngx.say("succeeded in inserting the big entry.")
end

这一回,我们终于成功插入了,因为我们有很大的连续空闲空间,足以容纳这个键值对:

$ curl 'localhost:8080/t'
inserted 255 keys.
succeeded in inserting the big entry.

现在,新的 slab 分布图里已经明显可以看到这个新的条目了:

内存区中写入大的条目后的 slab

请注意图中前半部分里的那条最狭长的红色方块。那就是我们的“大条目”。
我们从已使用的 slab 的大小分布图中可以看得更清楚些:

有大 slab 的 slab 大小分布

从图中可以看出,“大条目”实际上是一个 512 个字节的 slab(包含了键大小、值大小、内存补齐和地址对齐的开销)。

缓解内存碎片

在上文中我们已经看到,分散在共享内存区内的小空闲 slab 容易产生内存碎片问题,导致未来大块内存的申请无法被满足,即使所有空闲 slab 加起来的总大小还要大得多。我们推荐下面两种方法,可更好地重新利用这些空闲的 slab 空间:

  1. 始终使用大小接近的数据条目,这样就不再存在需要满足大得多的内存块申请的问题了。
  2. 让被删除的数据条目邻近,以方便这些条目被自动合并成更大的空闲 slab 块。

对于方法 1),我们可以把一个统一的共享内存区人为分割成多个针对不同数据条目大小分组的共享内存区2。例如,我们可以有一个共享内存区只存放大小为 0 ~ 128 个字节的条目,而另一个共享内存内存区只存放大小为 128 ~ 256 个字节的条目。

而对于方法 2)我们可以按照条目的过期时间进行分组。比如过期时间较短的数据条目,可以集中存放在一个专门的共享内存区,而过期时间较长的条目则可以存放在另一个共享内存区。这样我们可以保证同一个共享内存区内的条目都以类似的速度到期,从而提高条目同时到期,以及同时被删除和合并的几率。

结论

OpenResty 或 Nginx 共享内存区内的内存碎片问题在缺少工具的情况下,还是很难被观察和调试的。幸运的是,OpenResty XRay 提供了强有力的可观察性和可视化呈现,能够迅速地发现和诊断问题。本文中我们通过一系列小例子,使用 OpenResty XRay 自动生成的图表和数据,揭示了背后到底发生了什么,演示了内存碎片问题以及缓解这个问题的方法。最后,我们介绍了基于 OpenResty 或 Nginx 使用一般配置和编程的共享内存区的最佳实践。

延伸阅读

关于作者

章亦春是开源项目 OpenResty® 的创始人,同时也是 OpenResty Inc. 公司的创始人和 CEO。他贡献了许多 Nginx 的第三方模块,相当多 Nginx 和 LuaJIT 核心补丁,并且设计了 OpenResty XRay 等产品。

关注我们

如果您觉得本文有价值,非常欢迎关注我们 OpenResty Inc. 公司的博客网站
。也欢迎扫码关注我们的微信公众号:

我们的微信公众号

翻译

我们提供了英文版原文和中译版(本文)。我们也欢迎读者提供其他语言的翻译版本,只要是全文翻译不带省略,我们都将会考虑采用,非常感谢!


  1. 对于 OpenResty 和 Nginx 的共享内存区而言,分配和访问过的内存页在进程退出之前都永远不会再返还给操作系统。当然,释放的内存页和内存 slab 仍然可以在共享内存区内被重新使用。
  2. 有趣的是,Linux 内核的 Buddy 内存分配器以及 Memcached 的分配器也使用了类似的策略。
查看原文

赞 2 收藏 1 评论 0

OpenResty技术 发布了文章 · 2020-08-12

OpenResty 和 Nginx 的共享内存区是如何消耗物理内存的

OpenResty 和 Nginx 服务器通常会配置共享内存区,用于储存在所有工作进程之间共享的数据。例如,Nginx 标准模块 ngx_http_limit_reqngx_http_limit_conn 使用共享内存区储存状态数据,以限制所有工作进程中的用户请求速率和用户请求的并发度。OpenResty 的 ngx_lua 模块通过 lua_shared_dict,向用户 Lua 代码提供基于共享内存的数据字典存储。

本文通过几个简单和独立的例子,探讨这些共享内存区如何使用物理内存资源(或 RAM)。我们还会探讨共享内存的使用率对系统层面的进程内存指标的影响,例如在 ps 等系统工具的结果中的 VSZRSS 等指标。

与本博客网站 中的几乎所有技术类文章类似,我们使用 OpenResty XRay 这款动态追踪产品对未经修改的 OpenResty 或 Nginx 服务器和应用的内部进行深度分析和可视化呈现。因为 OpenResty XRay 是一个非侵入性的分析平台,所以我们不需要对 OpenResty 或 Nginx 的目标进程做任何修改 -- 不需要代码注入,也不需要在目标进程中加载特殊插件或模块。这样可以保证我们通过 OpenResty XRay 分析工具所看到的目标进程内部状态,与没有观察者时的状态是完全一致的。

我们将在多数示例中使用 ngx_lua 模块的 lua_shared_dict,因为该模块可以使用自定义的 Lua 代码进行编程。我们在这些示例中展示的行为和问题,也同样适用于所有标准 Nginx 模块和第三方模块中的其他共享内存区。

Slab 与内存页

Nginx 及其模块通常使用 Nginx 核心里的 slab 分配器 来管理共享内存区内的空间。这个slab 分配器专门用于在固定大小的内存区内分配和释放较小的内存块。

在 slab 的基础之上,共享内存区会引入更高层面的数据结构,例如红黑树和链表等等。

slab 可能小至几个字节,也可能大至跨越多个内存页。

操作系统以内存页为单位来管理进程的共享内存(或其他种类的内存)。
x86_64 Linux 系统中,默认的内存页大小通常是 4 KB,但具体大小取决于体系结构和 Linux 内核的配置。例如,某些 Aarch64 Linux 系统的内存页大小高达 64 KB。

我们将会看到 OpenResty 和 Nginx 进程的共享内存区,分别在内存页层面和 slab 层面上的细节信息。

分配的内存不一定有消耗

与硬盘这样的资源不同,物理内存(或 RAM)总是一种非常宝贵的资源。
大部分现代操作系统都实现了一种优化技术,叫做 按需分页(demand-paging),用于减少用户应用对 RAM 资源的压力。具体来说,就是当你分配大块的内存时,操作系统核心会将 RAM 资源(或物理内存页)的实际分配推迟到内存页里的数据被实际使用的时候。例如,如果用户进程分配了 10 个内存页,但却只使用了 3 个内存页,则操作系统可能只把这 3 个内存页映射到了 RAM 设备。这种行为同样适用于 Nginx 或 OpenResty 应用中分配的共享内存区。用户可以在 nginx.conf 文件中配置庞大的共享内存区,但他可能会注意到在服务器启动之后,几乎没有额外占用多少内存,毕竟通常在刚启动的时候,几乎没有共享内存页被实际使用到。

空的共享内存区

我们以下面这个 nginx.conf 文件为例。该文件分配了一个空的共享内存区,并且从没有使用过它:

master_process on;
worker_processes 2;

events {
    worker_connections 1024;
}

http {
    lua_shared_dict dogs 100m;

    server {
        listen 8080;

        location = /t {
            return 200 "hello world\n";
        }
    }
}

我们通过 lua_shared_dict 指令配置了一个 100 MB 的共享内存区,名为 dogs。并且我们为这个服务器配置了 2 个工作进程。请注意,我们在配置里从没有触及这个 dogs 区,所以这个区是空的。

可以通过下列命令启动这个服务器:

mkdir ~/work/
cd ~/work/
mkdir logs/ conf/
vim conf/nginx.conf  # paste the nginx.conf sample above here
/usr/local/openresty/nginx/sbin/nginx -p $PWD/

然后用下列命令查看 nginx 进程是否已在运行:

$ ps aux|head -n1; ps aux|grep nginx
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
agentzh   9359  0.0  0.0 137508  1576 ?        Ss   09:10   0:00 nginx: master process /usr/local/openresty/nginx/sbin/nginx -p /home/agentzh/work/
agentzh   9360  0.0  0.0 137968  1924 ?        S    09:10   0:00 nginx: worker process
agentzh   9361  0.0  0.0 137968  1920 ?        S    09:10   0:00 nginx: worker process

这两个工作进程占用的内存大小很接近。下面我们重点研究 PID 为 9360 的这个工作进程。在 OpenResty XRay 控制台的 Web 图形界面中,我们可以看到这个进程一共占用了 134.73 MB 的虚拟内存(virtual memory)和 1.88 MB 的常驻内存(resident memory),这与上文中的 ps 命令输出的结果完全相同:

空的共享内存区的虚拟内存使用量明细

正如我们的另一篇文章 《OpenResty 和 Nginx 如何分配和管理内存》中所介绍的,我们最关心的就是常驻内存的使用量。常驻内存将硬件资源实际映射到相应的内存页(如 RAM 1)。所以我们从图中看到,实际映射到硬件资源的内存量很少,总计只有 1.88MB。上文配置的 100 MB 的共享内存区在这个常驻内存当中只占很小的一部分(详情请见后续的讨论)。

当然,共享内存区的这 100 MB 还是全部贡献到了该进程的虚拟内存总量中去了。操作系统会为这个共享内存区预留出虚拟内存的地址空间,不过,这只是一种簿记记录,此时并不占用任何的 RAM 资源或其他硬件资源。

不是 空无一物

我们可以通过该进程的“应用层面的内存使用量的分类明细”图,来检查空的共享内存区是否占用了常驻(或物理)内存。

应用层面内存使用量明细

有趣的是,我们在这个图中看到了一个非零的 Nginx Shm Loaded (已加载的 Nginx 共享内存)组分。这部分很小,只有 612 KB,但还是出现了。所以空的共享内存区也并非空无一物。这是因为 Nginx 已经在新初始化的共享内存区域中放置了一些元数据,用于簿记目的。这些元数据为 Nginx 的 slab 分配器所使用。

已加载和未加载内存页

我们可以通过 OpenResty XRay 自动生成的下列图表,查看共享内存区内被实际使用(或加载)的内存页数量。

共享内存区域内已加载和未加载的内存页

我们发现在dogs区域中已经加载(或实际使用)的内存大小为 608 KB,同时有一个特殊的 ngx_accept_mutex_ptr 被 Nginx 核心自动分配用于 accept_mutex 功能。

这两部分内存的大小相加为 612 KB,正是上文的饼状图中显示的 Nginx Shm Loaded 的大小。

如前文所述,dogs 区使用的 608 KB 内存实际上是 slab 分配器 使用的元数据。

未加载的内存页只是被保留的虚拟内存地址空间,并没有被使用过。

关于进程的页表

我们没有提及的一种复杂性是,每一个 nginx 工作进程其实都有各自的页表。CPU 硬件或操作系统内核正是通过查询这些页表来查找虚拟内存页所对应的存储。因此每个进程在不同共享内存区内可能有不同的已加载页集合,因为每个进程在运行过程中可能访问过不同的内存页集合。为了简化这里的分析,OpenResty XRay 会显示所有的为任意一个工作进程加载过的内存页,即使当前的目标工作进程从未碰触过这些内存页。也正因为这个原因,已加载内存页的总大小可能(略微)高于目标进程的常驻内存的大小。

空闲的和已使用的 slab

如上文所述,Nginx 通常使用 slabs 而不是内存页来管理共享内存区内的空间。我们可以通过 OpenResty XRay 直接查看某一个共享内存区内已使用的和空闲的(或未使用的)slabs 的统计信息:

dogs区域中空的和已使用的slab

如我们所预期的,我们这个例子里的大部分 slabs 是空闲的未被使用的。注意,这里的内存大小的数字远小于上一节中所示的内存页层面的统计数字。这是因为 slabs 层面的抽象层次更高,并不包含 slab 分配器针对内存页的大小补齐和地址对齐的内存消耗。

我们可以通过OpenResty XRay进一步观察在这个 dogs 区域中各个 slab 的大小分布情况:

空白区域的已使用 slab 大小分布

空的 slab 大小分布

我们可以看到这个空的共享内存区里,仍然有 3 个已使用的 slab 和 157 个空闲的 slab。这些 slab 的总个数为:3 + 157 = 160个。请记住这个数字,我们会在下文中跟写入了一些用户数据的 dogs 区里的情况进行对比。

写入了用户数据的共享内存区

下面我们会修改之前的配置示例,在 Nginx 服务器启动时主动写入一些数据。具体做法是,我们在 nginx.conf 文件的 http {} 配置分程序块中增加下面这条 init_by_lua_block 配置指令:

init_by_lua_block {
    for i = 1, 300000 do
        ngx.shared.dogs:set("key" .. i, i)
    end
}

这里在服务器启动的时候,主动对 dogs 共享内存区进行了初始化,写入了 300,000 个键值对。

然后运行下列的 shell 命令以重新启动服务器进程:

kill -QUIT `cat logs/nginx.pid`
/usr/local/openresty/nginx/sbin/nginx -p $PWD/

新启动的 Nginx 进程如下所示:

$ ps aux|head -n1; ps aux|grep nginx
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
agentzh  29733  0.0  0.0 137508  1420 ?        Ss   13:50   0:00 nginx: master process /usr/local/openresty/nginx/sbin/nginx -p /home/agentzh/work/
agentzh  29734 32.0  0.5 138544 41168 ?        S    13:50   0:00 nginx: worker process
agentzh  29735 32.0  0.5 138544 41044 ?        S    13:50   0:00 nginx: worker process

虚拟内存与常驻内存

针对 Nginx 工作进程 29735,OpenResty XRay 生成了下面这张饼图:

非空白区域的虚拟内存使用量明细

显然,常驻内存的大小远高于之前那个空的共享区的例子,而且在总的虚拟内存大小中所占的比例也更大(29.6%)。

虚拟内存的使用量也略有增加(从 134.73 MB 增加到了 135.30 MB)。因为共享内存区本身的大小没有变化,所以共享内存区对于虚拟内存使用量的增加其实并没有影响。这里略微增大的原因是我们通过 init_by_lua_block 指令新引入了一些 Lua 代码(这部分微小的内存也同时贡献到了常驻内存中去了)。

应用层面的内存使用量明细显示,Nginx 共享内存区域的已加载内存占用了最多常驻内存:

dogs 区域内已加载和未加载的内存页

已加载和未加载内存页

现在在这个 dogs 共享内存区里,已加载的内存页多了很多,而未加载的内存页也有了相应的显著减少:

dogs区域中的已加载和未加载内存页

空的和已使用的 slab

现在 dogs 共享内存区增加了 300,000 个已使用的 slab(除了空的共享内存区中那 3 个总是会预分配的 slab 以外):

dogs非空白区域中的已使用slab

显然,lua_shared_dict 区中的每一个键值对,其实都直接对应一个 slab。

空闲 slab 的数量与先前在空的共享内存区中的数量是完全相同的,即 157 个 slab:

dogs非空白区域的空slab

虚假的内存泄漏

正如我们上面所演示的,共享内存区在应用实际访问其内部的内存页之前,都不会实际耗费物理内存资源。因为这个原因,用户可能会观察到 Nginx 工作进程的常驻内存大小似乎会持续地增长,特别是在进程刚启动之后。这会让用户误以为存在内存泄漏。下面这张图展示了这样的一个例子:

process memory growing

通过查看 OpenResty XRay 生成的应用级别的内存使用明细图,我们可以清楚地看到 Nginx 的共享内存区域其实占用了绝大部分的常驻内存空间:

Memory usage breakdown for huge shm zones

这种内存增长是暂时的,会在共享内存区被填满时停止。但是当用户把共享内存区配置得特别大,大到超出当前系统中可用的物理内存的时候,仍然是有潜在风险的。正因为如此,我们应该注意观察如下所示的内存页级别的内存使用量的柱状图:

Loaded and unloaded memory pages in shared memory zones

图中蓝色的部分可能最终会被进程用尽(即变为红色),而对当前系统产生冲击。

HUP 重新加载

Nginx 支持通过 HUP 信号来重新加载服务器的配置而不用退出它的 master 进程(worker 进程仍然会优雅退出并重启)。通常 Nginx 共享内存区会在 HUP 重新加载(HUP reload)之后自动继承原有的数据。所以原先为已访问过的共享内存页分配的那些物理内存页也会保留下来。于是想通过 HUP 重新加载来释放共享内存区内的常驻内存空间的尝试是会失败的。用户应改用 Nginx 的重启或二进制升级操作。

值得提醒的是,某一个 Nginx 模块还是有权决定是否在 HUP 重新加载后保持原有的数据。所以可能会有例外。

结论

我们在上文中已经解释了 Nginx 的共享内存区所占用的物理内存资源,可能远少于 nginx.conf 文件中配置的大小。这要归功于现代操作系统中的按需分页特性。我们演示了空的共享内存区内依然会使用到一些内存页和 slab,以用于存储 slab 分配器本身需要的元数据。通过 OpenResty XRay 的高级分析器,我们可以实时检查运行中的 nginx 工作进程,查看其中的共享内存区实际使用或加载的内存,包括内存页和 slab 这两个不同层面。

另一方面,按需分页的优化也会产生内存在某段时间内持续增长的现象。这其实并不是内存泄漏,但仍然具有一定的风险。我们也解释了 Nginx 的 HUP 重新加载操作通常并不会清空共享内存区里已有的数据。

我们将在本博客网站后续的文章中,继续探讨共享内存区中使用的高级数据结构,例如红黑树和队列,以及如何分析和缓解共享内存区内的内存碎片的问题。

关于作者

章亦春是开源项目 OpenResty® 的创始人,同时也是 OpenResty Inc. 公司的创始人和 CEO。他贡献了许多 Nginx 的第三方模块,相当多 Nginx 和 LuaJIT 核心补丁,并且设计了 OpenResty XRay 等产品。

关注我们

如果您觉得本文有价值,非常欢迎关注我们 OpenResty Inc. 公司的博客网站 。也欢迎扫码关注我们的微信公众号:

我们的微信公众号

翻译

我们提供了英文版原文和中译版(本文) 。我们也欢迎读者提供其他语言的翻译版本,只要是全文翻译不带省略,我们都将会考虑采用,非常感谢!


  1. 当发生交换(swapping)时,一些常驻内存会被保存和映射到硬盘设备上去。
查看原文

赞 10 收藏 7 评论 0

OpenResty技术 发布了文章 · 2020-08-11

How OpenResty and Nginx Shared Memory Zones Consume RAM

OpenResty and Nginx servers are often configured with shared memory zones which can hold data that is shared among all their worker processes. For example, Nginx's standard modules ngx_http_limit_req and ngx_http_limit_conn use shared memory zones to hold state data to limit the client request rate and client requests' concurrency level across all the worker processes. OpenResty's ngx_lua module provides lua_shared_dict to provide shared memory dictionary data storage for the user Lua code.

In this article, we will explore how these shared memory zones consume physical memory (or RAM) by several minimal and self-contained examples. We will also examine how the share memory utilization affects system-level process memory metrics like VSZ and RSS as seen in the output of system utilities like ps. And finally, we will discuss the "fake memory leak" issues caused by the on-demand usage nature of the shared memory zones as well as the effect of Nginx's HUP reload operation.

As with almost all the technical articles in this blog site, we use our OpenResty XRaydynamic tracing product to analyze and visualize the internals of unmodified OpenResty or Nginx servers and applications. Because OpenResty XRay is a noninvasive analyzing platform, we don't need to change anything in the target OpenResty or Nginx processes -- no code injection needed and no special plugins or modules needed to be loaded into the target processes. This makes sure what we see inside the target processes through OpenResty XRay analyzers is exactly like when there is no observers at all.

We would like to use ngx_lua module's lua_shared_dict in most of the examples below since it is programmable by custom Lua code. The behaviors and issues we demonstrate in these examples also apply well to any other shared memory zones found in all standard Nginx modules and 3rd-party ones.

Slabs and pages

Nginx and its modules usually use the slab allocator implemented by the Nginx core to manage the memory storage inside a shared memory zone. The slab allocator is designed specifically for allocating and deallocating small memory chunks inside a fixed-size memory region. On the top of the slabs, the shared memory zones may introduce higher level data structures like red-black trees and linked lists. A slab can be as small as a few bytes and can also be as large as spanning multiple memory pages.

The operating system manages the shared memory (or any other kinds of memory) by pages. On x86_64 Linux, the default page size is usually 4 KB but it can be different depending on the architecture and Linux kernel configurations. For example, some Aarch64 Linux systems have a page size of 64 KB.

We shall see detailed memory page level and slab level statistics for shared memory zones in real OpenResty and Nginx processes.

What is allocated is not what is paid for

When compared with disks, physical memory (or RAM) is always a kind of very precious resource. Most of the modern operating systems employ demand paging as an optimization trick to reduce the stress of user applications on the RAM. Basically, when you allocate a large chunk of memory, the operating system kernel would defer the actual assignment of the RAM resources (or physical memory pages) to the point where these memory pages' content is actually used. For example, if the user process allocates 10 pages of memory and only ever uses 3 pages, then the operating system may only assigns these 3 pages to the RAM device. The same applies to the shared memory zones allocated in an Nginx or OpenResty application. The user may configure huge shared memory zones in the nginx.conf file but she may notice that the server takes almost no extra memory immediately after starting up the server because very few of the shared memory pages are actually used.

Empty zones

Let's consider the following sample nginx.conf file which allocates an empty shard memory zone which is never used:

master_process on;
worker_processes 2;

events {
    worker_connections 1024;
}

http {
    lua_shared_dict dogs 100m;

    server {
        listen 8080;

        location = /t {
            return 200 "hello world\n";
        }
    }
}

Here we configure a 100 MB shared memory zone named dogs via the lua_shared_dict directory. And 2 worker processes are configured for this server. Please note that we never touch this dogs zone in this configuration, therefore the zone should be empty.

Let's start this server like below:

mkdir ~/work/
cd ~/work/
mkdir logs/ conf/
vim conf/nginx.conf  # paste the nginx.conf sample above here
/usr/local/openresty/nginx/sbin/nginx -p $PWD/

We can check if the nginx processes are already running like this:

$ ps aux|head -n1; ps aux|grep nginx
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
agentzh   9359  0.0  0.0 137508  1576 ?        Ss   09:10   0:00 nginx: master process /usr/local/openresty/nginx/sbin/nginx -p /home/agentzh/work/
agentzh   9360  0.0  0.0 137968  1924 ?        S    09:10   0:00 nginx: worker process
agentzh   9361  0.0  0.0 137968  1920 ?        S    09:10   0:00 nginx: worker process

The worker processes take similarly sized memory. Let's focus on the worker process of the PID 9360 from now on. In OpenResty XRay console's web UI, we can see this process takes for total 134.73 MB of virtual memory and 1.88 MB of resident memory (identical to the reporting of the ps command shown above):

Virtual memory usage breakdown for an empty zone

As we already discussed in the other article, How OpenResty and Nginx Allocate Memory, what really matters is the resident memory usage which actually maps hardware resources to the corresponding memory pages (like RAM1). Therefore, very little memory is actually assigned with hardware resources, just 1.88MB in total. The 100 MB shared memory zone we configured above definitely takes a very small part in this resident memory portion (which we will see in details below). The 100 MB shared memory zone already adds the 100 MB size to the virtual memory size of this process, however. The operating system does preserve the virtual memory address space for this shared memory zone, but that
is just bookkeeping records which does not take up any RAM or other hardware resources at all.

Empty is not empty

To check if this empty shared memory zone takes up any resident (or physical) memory at all, we can refer to the "Application-Level Memory Usage Breakdown" chart for this process below:

Application-Level Memory Usage Breakdown

Interesting we see a nonzero Nginx Shm Loaded component in this pie chart. It is a tiny portion, just 612 KB. So an empt shared memory zone is not completely empty. This is because Nginx always stores some meta data for book-keeping purposes into any newly initialized shared memory zones. Such meta data is used by the Nginx's slab allocator.

Loaded and unloaded pages

We can check out how many memory pages are actually used (or loaded) inside all the shared memory zones by looking at the following chart produced automatically by OpenResty XRay:

Loaded and unloaded memory pages in shared memory zones

We can see that there are 608 KB of memory is loaded (or actually used) in the dogs zone, while there is special ngx_accept_mutex_ptr zone which is automatically allocated by the Nginx core for the accept_mutex feature. When we add these two sizes together, we get 612 KB, which is exactly the Nginx Shm Loaded size shown in the pie chart above. As we mentioned above, the 608 KB memory used by the dogs zone is actually meta data used by the slab allocator.

The unloaded memory pages are just preserved virtual memory address space that has never been touched (or used).

A word on process page tables

One complication we haven't mentioned yet is that each nginx worker process has its own page table used by the CPU hardware or the operating system kernel when looking up a virtual memory page. For this reason, each process may have different sets of loaded pages for exactly the same shared memory zone because each process may have touched different sets of memory pages in its own execution history. To simplify the analysis here, OpenResty XRay always shows all the memory pages that are loaded by any of the worker processes even if the current target worker process does not have touched some of those pages. For this reason, the total size of loaded pages here may (slightly) exceed the actual portion in the resident memory size of the target process.

Free and used slabs

As we have discussed above, Nginx usually manages the shared memory zone by slabs instead of memory pages. We can directly see the statistics of the used and free (or unused) slabs inside a particular shared memory zone through OpenResty XRay:

Free and used slabs in zone dogs

As expected, most of the slabs are free or unused for our example. Note that the size numbers are actually much smaller than the memory page level statistics shown in the previous section. This is because we are now on a higher abstraction level, the slabs level, excluding most of the slab allocator's own memory overhead and the memory page padding overhead.

We can further observe the size distribution of all the individual slabs in this dogs zone through OpenResty XRay:

Used slab size distribution for an empty zone
Free slab size distribution

We can see that for this empty zone, there are still 3 used slabs and 157 free slabs. Or for total 3 + 157 = 160 slabs. Please keep this number in mind when we later compare this with the same dogs zone with some user data inserted.

Zones with user data

Now let's modify our previous example by inserting some data upon Nginx server startup. Basically, we just need to add the following init_by_lua_block directive to the nginx.conf file's http {} configuration block:

init_by_lua_block {
    for i = 1, 300000 do
        ngx.shared.dogs:set("key" .. i, i)
    end
}

Here we initialize our dogs shared memory zone by inserting 300,000 key-value pairs into it during the server startup.

Then let's restart the server with the following shell commands:

kill -QUIT `cat logs/nginx.pid`
/usr/local/openresty/nginx/sbin/nginx -p $PWD/

The new Nginx processes now look like this:

$ ps aux|head -n1; ps aux|grep nginx
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
agentzh  29733  0.0  0.0 137508  1420 ?        Ss   13:50   0:00 nginx: master process /usr/local/openresty/nginx/sbin/nginx -p /home/agentzh/work/
agentzh  29734 32.0  0.5 138544 41168 ?        S    13:50   0:00 nginx: worker process
agentzh  29735 32.0  0.5 138544 41044 ?        S    13:50   0:00 nginx: worker process

Virtual memory and resident memory

For the Nginx worker process 29735, OpenResty XRay gives the following pie chart:

Virtual memory usage breakdown for a non-empty zone

Apparently now the resident memory is significantly larger than the previous empty zone case and it also takes a much larger portion in the total virtual memory size (29.6%). The virtual memory size is just slightly larger than before (135.30 MB vs 134.73 MB). Because the shared memory zones' sizes stay the same, they contribute nothing to the increased virtual memory size. It is just due to the newly added Lua code via the init_by_lua_block directive (this small addition also contributes to the resident memory size).

The application-level memory usage breakdown shows that the Nginx shared memory zone's loaded memory takes most of the resident memory:

Loaded and unloaded pages in zone dogs

Loaded and unloaded pages

Now we have many more loaded memory pages and far less unloaded ones inside this dogs shared memory zone:

Loaded and unloaded pages for zone dogs

Free and used slabs

This time we have 300,000 more used slabs (in addition to the 3 pre-allocated slabs in an empty zone):

Used slabs for non-empty zone dogs

Apparently each key-value pair in the lua_shared_dict zone corresponds to a single slab.

The number of free slabs are exactly the same as in the empty zone case, i.e., 157 slabs:

Free slabs for a non-empty zone dogs

Fake Memory Leaks

As we demonstrated above, the shared memory zones will not consume any RAM resources until more and more of their memory pages get accessed by the applications. For this reason, it may seem to the user that the resident memory usage of nginx worker processes keeps growing infinitely, especially right after the process is started. It may give a false alarm of memory leaks. The following chart shows such an example:

process memory growing

By looking at the application-level memory breakdown chart produced by OpenResty XRay, we can clearly see that the Nginx shared memory zones are taking most of the resident memory here:

Memory usage breakdown for huge shm zones

Such memory growth is temporary and will stop once the shared memory zones are all filled up. But this also poses a potential risk when the shared memory zones are configured too large, larger than the current operating system can ever accommodate. For this reason, it is always a good idea to keep an eye on page-level memory consumption graphs like below:

Loaded and unloaded memory pages in shared memory zones

The blue portions may eventually be used up by the process (i.e., turning red) and put real impact on the current system.

HUP reload

Nginx does support receiving the HUP signal to reload the server configuration without quitting the master process (the worker processes would still be exited gracefully and relaunched, however). Usually the Nginx shared memory zones would automatically inherit the existing data after the HUP reload operation. So any previously assigned physical memory pages for those accessed shared memory data will stay. Thus any attempts to use HUP reload to release up shared memory zones' existing resident memory pages would fail. The user should use full restart or Nginx's binary upgrade operation instead.

Nevertheless, it is up to the Nginx modules implementing the shared memory zones to decide whether to keep the data during a HUP reload. So there might be exceptions.

Conclusion

We have already explained that Nginx's shared memory zones may take much less physical memory resources than the size configured in the nginx.conf file. Thanks to the demand-paging feature of modern operating systems. We demonstrated that empty shared memory zones may still utilize some memory pages and slabs to store the slab allocator's meta data. By means of OpenResty XRay analyzers, we can easily examine exactly how much memory is actually used or loaded by the shared memory zones inside any running nginx worker processes at real time, both on the memory page level and the slab level.

On the other hand, the demand-paging optimization may also produce steady memory usage growth for a period of time, which is not really memory leaks but may still impose risks. And we covered that Nginx's HUP reload operation usually do not clear existing data in shared memory zones.

In future articles on this blog site, we will continue looking at high level data structures used in shared memory zones like red-black trees and queues, and will also analyze and mitigate memory fragmentation issues inside shared memory zones.

About The Author

Yichun Zhang is the creator of the OpenResty® open source project. He is also the founder and CEO of the OpenResty Inc. company. He contributed a dozen open source Nginx 3rd-party modules, quite some Nginx and LuaJIT core patches, and designed the OpenResty XRay platform.

Translations

We provide the Chinese translation for this article on blog.openresty.com.cn. We also welcome interested readers to contribute translations in other natural languages as long as the full article is translated without any omissions. We thank them in advance.

We are hiring

We always welcome talented and enthusiastic engineers to join our team at OpenResty Inc. to explore various open source software's internals and build powerful analyzers and visualizers for real world applications built atop the open source software. If you are interested, please send your resume to talents@openresty.com . Thank you!


  1. When swapping happens, some residential memory pages would be saved and mapped to disk devices.
查看原文

赞 1 收藏 0 评论 1

OpenResty技术 发布了文章 · 2020-07-18

The Wonderland of Dynamic Tracing (Part 4 of 7)

This is Part 4 of the series "The Wonderland of Dynamic Tracing" which consists of 7 parts. I will keep updating this series to reflect the state of art of the dynamic tracing world.

The previous one, Part 3, introduced various real world use cases of SystemTap in production. This part will take a close look at Flame Graphs which were frequently mentioned in the previous part.

See also Part 1 and Part 2.

Flame Graphs

Flame Graphs have appeared many times in the previous parts of this series. So what is it? Flame Graphs are a kind of amazing visualization, presumably invented by Brendan Gregg, of whom I already made repeated mentions before.

Flame Graphs function like X-ray images of a running software system. The graph integrates and displays time and spatial information in a very natural and vivid way, revealing a variety of quantitative statistical patterns of system performance.

C-land CPU Flame Graph for Nginx

I shall start with an example. The most classical kind of flame graphs looks at the distribution of CPU time among all code paths of the target running software. The resulting distribution diagram visibly distinguishes code paths consuming more CPU time from those which consume less. Furthermore, the flame graphs can be generated on different software stack levels, say, drawing a graph on the C/C++ language level of systems software, and then drawing a flame graph on a higher level, like the dynamic scripting language level, like Lua and Python code. Different flame graphs often offer different perspectives, reflecting level-specific code hot spots.

In dealing with the mailing lists of OpenResty, my own open-source software community, I often encourage users to proactively provide the flame graphs they sample when reporting a problem. Then the graph will work its magic to quickly reveal all the bottlenecks to everyone who sees it, saving all the trouble of wasting time on endless trials-and-errors. It is a big win for everybody.

It is worth noting that in the case of an unfamiliar program, the flame graph still makes it possible to gain a big picture of any performance issues, without the need of reading any source code of the target software. This capability is really marvelous, thanks to the fact that most programs are made to be reasonable or understandable, at least to some extend, meaning that each program already uses abstraction layers at the time of software construction, for example, through functions or class methods. The names of these functions usually contain semantic information and are directly displayed on the flame graph. Each name serves as a hint of what the corresponding function does, and even a hint for the corresponding code path as well. The bottlenecks in the program can thus be inferred. So it still comes down to the importance of proper function or module naming in the source code. The names are not only crucial for humans to read the source code, but also very helpful when debugging or profiling the binary programs. The flame graphs, in turn, also serve as a shortcut to learning unfamiliar software systems. Thinking it the other way, important code paths are almost always those taking up a lot of time, and so they deserve special attention; otherwise something must be very wrong in the way the software is constructed.

The most classical flame graphs focus on the distribution of CPU time across all code paths of the target software system currently running. This is the CPU time dimension. Naturally, flame graphs can also be extended to other dimensions, like off-CPU time, when a process does not run on any CPU cores. Generally speaking, off-CPU time exists because the process could be in a sleeping state for some reasons. For example, the process could be waiting for certain system-level locks, for some blocking I/O operations to complete, or is just running out of the current CPU time slice assigned by the process scheduler of the operating system. All such circumstances would prevent the process from running on any CPU cores, but a lot of wall clock time is still taken. In contrast with the CPU time dimension, the off-CPU time dimension reveals invaluable information to analyze overhead of the system locking (such as the system call sem_wait), some blocking I/O operations (like open and read), as well as the CPU contention among processes and threads. All become very obvious with off-CPU flame graphs with getting overwhelmed by too many details which do not really
matter.

Technically speaking, the off-CPU flame graph was the result of a bold attempt. One day, I was reading Brendan’s blog article about off-CPU time, by Lake Tahoe straddling the California-Nevada border. A thought struck me: maybe off-CPU time, like CPU time, can be applied to the flame graphs. Later I tried it in my previous employers' production systems,
sampling the off-CPU flame graph of the nginx processes using SystemTap. And I made it! I tweeted about the successful story and got a warm response from Brendan Gregg. He told me how he had tried it without desired results. I guess that he had used the off-CPU graph for multi-threaded programs, like MySQL. Massive thread synchronization operations in such processes will fill the off-CPU graph with too much noises that the really interesting parts get obscured. I chose a different use case, single-thread programs like Nginx or OpenResty. In such processes, the off-CPU flame graphs can often promptly reveal blocking system calls in the blocked Nginx event loops, like sem_wait, open, and intervenes by the process scheduler. With these functions, it is of great help for analyzing similar performance issues. The only noise will be the epoll_wait system call in the Nginx event loop, which is easy to identify and ignore.

off-CPU time

Similarly, we can extend the flame graph idea to other system resource metric dimensions, such as the number of bytes in memory leaks, file I/O latency, network bandwidth, etc. I remembered once I used the “memory leak flame graph” tool invented by myself for rapidly figuring out what was behind a very thorny leak issue in the Nginx core. Conventional tools like Valgrind and AddressSanitizer were unable to capture the leak lurking inside the memory pool of Nginx. In another situation, the “memory leak flame graph” easily located a leak in the Nginx C module written by a European developer. He had been perplexed by the very subtle and slow leak over a long period of time, but I quickly pinpointed the culprit in his own C code without even reading his source code at all. In retrospect, I think that was indeed like magic. I hope now you can understand the omnipotence of flame graph as a visualization method for a lot of entirely
different problems.

Our OpenResty XRay product supports automated sampling for various types of flame graphs, including the C/C++ level flame graphs, Lua level flame graph, off-CPU flame graphs, CPU flame graphs, dynamic memory allocation flame graph, GC object reference relationship flame graphs, file IO flame graphs, and many more!

Conclusion

This part of the series has a close look at Flame Graphs. The next part, Part 5, will cover the methodology commonly used in the troubleshooting process involved with dynamic tracing technologies.

A Word on OpenResty XRay

OpenResty XRay is a commercial dynamic tracing product offered by our OpenResty Inc. company. We use this product in our articles like this one to intuitively demonstrate implementation details, as well as statistics about real world applications and open source software. In general, OpenResty XRay can help users to get deep insight into their online and offline software systems without any modifications or any other collaborations, and efficiently troubleshoot really hard problems for performance, reliability, and security. It utilizes advanced dynamic tracing technologies developed by OpenResty Inc. and others.

You are welcome to contact us to try out this product for free.

OpenResty XRay Console Dashboard

About The Author

Yichun Zhang is the creator of the OpenResty® open source project. He is also the founder and CEO of the OpenResty Inc. company. He contributed a dozen open source Nginx 3rd-party modules, quite some Nginx and LuaJIT core patches, and designed the OpenResty XRay platform.

Translations

We provide a Chinese translation for this article on blog.openresty.com.cn
ourselves. We also welcome interested readers to contribute translations in other natural languages as long as the full article is translated without any omissions. We thank them in advance.

查看原文

赞 0 收藏 0 评论 0

OpenResty技术 发布了文章 · 2020-07-17

The Wonderland of Dynamic Tracing (Part 3 of 7)

This is Part 3 of the series "The Wonderland of Dynamic Tracing" which consists of 7 parts. I will keep updating this series to reflect the state of art of the dynamic tracing world.

The previous one, Part 2, introduced DTrace and SystemTap, two famous dynamic tracing frameworks. This part will continue looking at real world applications of SystemTap.

See also Part 1 and Part 4.

Applications of SystemTap in Production

The huge impact of DTrace today wouldn't be possible without the contributions of the renowned leading expert on DTrace, Brendan Gregg. I already mentioned him in Part 2 before. He previously worked on the Solaris file system in Sun Microsystems, being one of the earliest users of DTrace. He authored several books on DTrace and systems performance optimization, as well as many high quality blog posts concerning dynamic tracing in general.

After leaving Taobao in 2011, I went to Fuzhou and experienced an “idyllic life” there for a whole year. During the last several months there, I dived into Brendan’s public blog and systematically studied DTrace and dynamic tracing. Before that, I had never heard of DTrace until one of my Sina Weibo
followers mentioned it very briefly in a comment. I was immediately intrigued and did my own research to learn more about it. Well, I would never have imagined that my exploration would lead me to a totally new world and completely change my views about the entire computing world. So I devoted a lot of time to thoroughly reading each personal blog of Brendan. Ultimately, my efforts paid off. Fully enlightened, I felt I could finally take the subtleties of dynamic tracing in.

Then in 2012, I ended the “idyllic life” in Fuzhou and left for the US to join the CDN service provider and network security company mentioned before. I immediately started to apply SystemTap and the whole set of dynamic tracing methods I had acquired to the company’s global network, to solve those very weird, very strange online problems. I found that my colleagues at the time would always add additional event tracking code into the software system on their own when troubleshooting online problems. They did so by directly editing the source code and adding various counters, or event tracking code to emit log data, primarily to the applications' Lua code, and sometimes even to the code base of systems software like Nginx. In this way, a large number of logs would be collected online in real time, before entering the special database and going through offline analysis. However, their practice clearly brought colossal costs. Not only raise the cost related to hacking and maintaining the business system sharply, but also the online costs resulting from full-volume data collection and storage of enormous amount of log data. Moreover, the following situation is not uncommon: Engineer A adds an event tracking code in the business code and Engineer B does the same later. However, they may end up being forgotten and left in the code base, without being noticed again. The final result would only be that these endlessly increasing events mess up the code base. And the invasive revisions would make corresponding software, whether system software or business code, more and more difficult to maintain.

Two serious problems exist in the way metrics and event tracking code is done. The first one is “too many” event tracking counters and logging statements are added. Out of a desire to cover all, we tend to gather some totally useless information, leading to unnecessary collection and storage costs. Even if sampling is already enough to analyze problems in many cases, the habitual response is still carrying out whole-network and full-volume data collection, which is clearly very expensive in the long run. The second means “too few” counters and logging are added. It is often very difficult to plan all necessary information collection points in the first place, as no one can predict future problems needing troubleshooting like a prophet. Consequently, whenever a new problem emerges, existing information collected is almost always insufficient. What follows is to revise the software system and conduct online operations frequently, causing much heavier workload to development and maintenance engineers, and higher risk of more severe online incidents.

Another brute force debugging method some maintenance engineers often use is to drop the servers offline, and then set a series of temporary firewall rules to block or screen user traffic or their own monitoring traffic, before fiddling with the production machine. This cumbersome process has a huge impact. Firstly, as the machine is unable to continue its services, the overall throughput of the entire online system is impaired. Secondly, problems that can reproduce only when real traffic exists will no longer recur. You can imagine how frustrating it will be.

Fortunately, SystemTap dynamic tracing offers an ideal solution to such problems while avoiding those headaches. You don’t have to change the software stack itself, be it systems software or business-level applications. I often write some dedicated tools that place dynamic probes on the "key spots" of the relevant code paths. These probes collect information separately, which will be combined and transmitted by the debugging tools to the terminal. My way of doing things enables me to quickly get key information I need through sampling on one or more machines, and obtain quick answers to some very basic questions to navigate subsequent (deeper) debugging work.

We talked earlier about manually adding metrics & event tracking/logging code into the production systems to record logs and putting them in a database. The manual work is far inferior to seeing the whole production system as a directly accessible “database” from which we can obtain the very information needed in a safe and quick manner, without leaving any trace. Following this train of thought, I wrote a number of debugging tools, most being open-source on GitHub. Of them, many were targeted at systems software such as Nginx, LuaJIT and the operating system kernel, and some focused on higher-level Web frameworks like OpenResty. GitHub offers access to the following code repositories: nginx-systemtap-toolkit, perl-systemtap-toolkit and stappxx.

My SystemTap Tool Cloud

These tools helped me identify a lot of online problems, some even by surprise. We will walk through five examples below.

Case #1: Slow Debugging Code Left in Production

The first example is an accidental discovery when I analyzed the online Nginx process using the SystemTap-based Flame Graph. I noticed a big portion of CPU time was spent on a very strange code path. The code path turned out to be some temporary debugging statements left by one of my former colleagues when debugging an ancient problem. It’s like the “event tracking code” mentioned above. Although the problem had long been fixed, the code was left there and was forgotten, both online and in the company’s code repository. The existence of that piece of code came at a high price, that is, ongoing performance overhead unnoticed for a long time. The approach I used was sampling so that the tool can automatically draw a Flame Graph (we will cover it in detail in Part 4), from which I can understand the problem and take measures. You see, this is very efficient.

Case #2: Long request latency outliers

Long delays may be seen only in a very small portion of all the online requests, or “request latency outliers”. Though small in number, they may have delays on the level of seconds. I used to run into such things a lot. For example, one former colleague just took a wild guess that my OpenResty had a bug. Unconvinced, I immediately wrote a SystemTap tool for online sampling to analyze those requests delayed by over one second. The tool can directly test the internal time distribution of problematic requests, including delay of each typical I/O operation and pure CPU computing delay in the course of request handling. It soon found the delay appeared when OpenResty accessed the DNS server written in Go. Then the tool output details about those long-tailed DNS queries, which were all related to CNAME. Well, mystery was solved. Obviously, the delay had nothing to do with OpenResty. And the finding paved the way for further investigation and optimization.

Case #3: From Network Issues to Disk Issues

The third example is very interesting. It's about shifting from network problems to hard disk issues in debugging. My former colleagues and I once noticed machines in a computer room showed a dramatically higher ratio of network timeout errors than the other colocations or data centers, albeit at a mere 1 percent. At first, we naturally paid attention to the network protocol stack. However, a series of dedicated SystemTap tools focusing directly on those outlier requests later led me to a hard disk configuration issue. First-hand data steered us to the correct track very quickly. A presumed network issue turned out to be a hard disk problem.

Case #4: File Handle Cache Tradeoffs

The fourth example turned to the Flame Graphs again. In the CPU flame graph for online nginx processes, we observed a phenomenon in the Nginx process: file opening and closing operations took a significant portion of the total CPU time. Our natural response was to initiate the file-handle caching of Nginx itself, without yielding noticeable optimization result. With a new flame graph sampled, however, we found the “spin lock” used in the cache meta data now took a lot of CPU time. All became clear via the flame graph. Although we initiated the caching, it had been set at so large a size that its benefits were voided by the overhead of the meta data spin lock. Imagine that if we had no flame graphs and just performed black-box benchmarks, we would have reached the wrong conclusion that the file-handle cache of Nginx was useless instead of tuning cache parameters.

Case #5: Compiled Regex Cache Tuning

Now comes our last example for this section. After one online release operation, I remembered, the latest online flame graph revealed the operation of compiling regular expressions consumed a lot of CPU time, but the caching of the compiled regular expression had already been enabled online. Apparently the number of regular expressions used in our business system had exceeded the maximum cache size. Accordingly, the next thing that came to my mind was simply to increase the cache size for online regular expressions. As expected, the bottleneck then immediately disappeared from our online flame graphs after the cache size growth.

Wrapping up

These examples demonstrate new problems will always occur and vary depending on the data centers, the servers, and even the time period of the day on the same machine. Whatever the problem, the solution is to analyze the root cause of the problem directly and take online samples from the first scene of events, instead of jumping into trials and errors with wild guesses. With the help of powerful observability tools, troubleshooting can actually yield much more with much less effort.

After we founded OpenResty Inc., we developed OpenResty XRay as a brand new dynamic tracing platform, putting an end to manual uses of open-source solutions like SystemTap.

Conclusion

This part of the series covered various use cases of one of the most mature dynamic tracing frameworks, SystemTap. The next part, Part 4, will talk about a very powerful visualization method to analyze resource usage across all software code paths, Flame Graphs. Stay tuned!

A Word on OpenResty XRay

OpenResty XRay is a commercial dynamic tracing product offered by our OpenResty Inc. company. We use this product in our articles like this one to intuitively demonstrate implementation details, as well as statistics about real world applications and open source software. In general, OpenResty XRay can help users to get deep insight into their online and offline software systems without any modifications or any other collaborations, and efficiently troubleshoot really hard problems for performance, reliability, and security. It utilizes advanced dynamic tracing technologies developed by OpenResty Inc. and others.

You are welcome to contact us to try out this product for free.

OpenResty XRay Console Dashboard

About The Author

Yichun Zhang is the creator of the OpenResty® open source project. He is also the founder and CEO of the OpenResty Inc. company. He contributed a dozen open source Nginx 3rd-party modules, quite some Nginx and LuaJIT core patches, and designed the OpenResty XRay platform.

Translations

We provide a Chinese translation for this article on blog.openresty.com.cn ourselves. We also welcome interested readers to contribute translations in other natural languages as long as the full article is translated without any omissions. We thank them in advance.

查看原文

赞 0 收藏 0 评论 0

OpenResty技术 发布了文章 · 2020-07-16

The Wonderland of Dynamic Tracing (Part 2 of 7)

This is the second part of the 7 part series "The Wonderland of Dynamic Tracing." This series will consist of updates on the state of art of the dynamic tracing world.

The previous part, Part 1, introduced the basic concepts of dynamic tracing and covered its advantages. This part will continue looking at some of the most popular dynamic tracing frameworks in the open source world.

See also Part 3 and Part 4.

DTrace

We cannot talk about dynamic tracing without mentioning DTrace. DTrace is the earliest modern dynamic tracing framework. Originating from the Solaris operating system at the beginning of this century, it was developed by engineers of Sun Microsystems. Many you may have heard about the Solaris system and its original developer Sun.

A story circulates around the creation of DTrace. Once upon a time, several kernel engineers of the Solaris operating system spent several days and nights troubleshooting a very weird online issue. They originally considered it very complicated, and spared great effort to address it, only to realize it was just a very silly configuration issue. Learning from the painful experience, they created DTrace, a highly sophisticated debugging framework to enable tools which can prevent them from going through similar pains in the future. Indeed, most of the so-called “weird problems” (high CPU or memory usage, high disk usage, long latency, program crashes etc.) are so embarrassing that it is even more depressing after pinpointing the real cause.

As a highly general-purpose debugging platform, DTrace provides the D language, a scripting language that looks like C. All DTrace-based debugging tools are written by D. The D language supports special notations to specify “probes” which usually contain information about code locations in the target software system (being either in the OS kernel or in a user-land process). For example, you can put the probe at the entry or exit of a certain kernel function, or the function entry or exit of certain user mode processes, and even on any machine instructions. Writing debugging tools in the D language requires some understanding and knowledge of the target software system. These powerful tools can help us regain insight of complex systems, greatly increasing the observability. Brendan Gregg, a former engineer of Sun, was one of the earliest DTrace users, even before DTrace was open-sourced. Brendan wrote a lot of reusable DTrace-based debugging tools, most of which are in the open-source project called DTrace Toolkit. Dtrace is the earliest and one of the most famous dynamic tracing frameworks.

DTrace Pros and Cons

DTrace has an edge in closely integrating with the operating system kernel. Implementation of the D language is actually a virtual machine (VM), kinda like a Java virtual machine (JVM). One benefit of the D language is that its runtime is resident in the kernel and is very compact, meaning the startup and quitting time for the debugging tools are very short. However, I think DTrace also has some notable weaknesses. The most frustrating one is the lack of looping language structures in D, making it very hard to write many analytical tools targeting complicated data structures in the target. The official statement attributed the lack to the purpose of avoiding infinite loops, but clearly DTrace can instead limit the iteration count of each loop on the VM level. The same applies to recursive function calls. Another major flaw is its relatively weak tracing support for user-mode code as it has no built-in support for utilizing user-mode debug symbols. So the user must declare in their D code the type of user-mode C language structures used.1

DTrace has such a large influence that many engineers port it over to several other operating systems. For example, Apple has added DTrace support in its Mac OS X (and later macOS) operating system. In fact, each Apple laptop or desktop computer launched in recent years offers a ready-to-use dtrace command line utility. Those who have an Apple computer can have a try on its command line terminal. Alongside the Apple system, DTrace has also made its way into the FreeBSD operating system. Not enabled by default, the DTrace kernel module in FreeBSD must be loaded through extra user commands. Oracle has also tried to introduce DTrace into their own Linux distribution, Oracle Linux, without much progress though. This is because the Linux kernel is not controlled by Oracle, but DTrace needs close integration with the operating system kernel. Similar reasons have long left the DTrace-to-Linux porting attempted by some engineers to be far below the production-level requirement.

Those DTrace ports lack some advanced features here and there (it would be nice to have the floating-point number support, and they are also missing support for a lot of built-in probes etc.) In addition, they cannot really match the original DTrace implementation in the Solaris operating system.

SystemTap

Another influence of DTrace on the Linux operating system is reflected in the open-source project SystemTap, a relatively independent dynamic tracing framework built by engineers from Red Hat and other companies. SystemTap has its own little language, the SystemTap scripting language, which is not compatible with DTrace's D language (although it does also resemble C). Serving a wide range of enterprise-level users, Red Hat naturally relies on engineers who have to cope with a lot of “weird problems” on a daily basis. The real-life demand has inevitably prompted it to develop this technology. In my opinion, SystemTap is one of the most powerful and the most usable dynamic tracing frameworks in today’s open source Linux world, and I have been using it in work for years. Authors of SystemTap, including Frank Ch. Eigler and Josh Stone are all very smart engineers full of enthusiasm. I once raised questions through their IRC channel and their mailing list, and they often answered me very quickly and in great detail. I've been contributing to SystemTap by adding new features and fixing bugs.

SystemTap Pros and Cons

The strengths of SystemTap include its great maturity in automatic loading of user-mode debug symbols, complete looping language structures to write complicated probe processing programs, with support for a great number of complex aggregations and statistics. Due to the immature implementation of SystemTap and Linux kernels in the early days, outdated criticisms over SystemTap have already flooded the Internet, unfortunately. In the past few years we have witnessed significant improvements in it. In 2017, I established OpenResty Inc. which have also been helping improve SystemTap.

Of course, SystemTap is not perfect. Firstly, it’s not a part of the Linux kernel, and such lack of close integration with the kernel means SystemTap has to keep track of changes in the mainline kernel all the time. Secondly, SystemTap usually complies (or "translates") its language scripts (in its own language) into C source code of a Linux kernel module. It is therefore often necessary to deploy the full C compiler toolchain and the header files of the Linux kernel in online systems2. For these reasons, SystemTap script starts much more slowly than DTrace, and at a speed similar to JVM. Overall, SystemTap is still a very mature and outstanding dynamic tracing framework despite these shortcomings3.

SystemTap Internal Workflow

DTrace and SystemTap

Neither DTrace nor SystemTap supports writing complete debugging tools as both lack convenient primitives for command-line interactions. This is why a slew of real world tools based on them have come with wrappers written in Perl, Python, and even Shell script4. To use a clean language to write complete debugging tools, I once extended the SystemTap language to a higher-level “macro language” called stap++5. I employed Perl to implement the stap++ interpreter capable of directly interpreting and executing stap++ source code and internally calling the SystemTap command-line tool. Those interested please visit GitHub for my open-source code repository stapxx, where many complete debugging tools backed by my stap++ macro language are available.

Conclusion

This part of the series introduced two famous dynamic tracing frameworks, DTrace and SystemTap and covered their strengths and weaknesses. The next part, Part 3, will talk about applications of SystemTap to solve really hard problems. Stay tuned!

A Word on OpenResty XRay

OpenResty XRay is a commercial dynamic tracing product offered by our OpenResty Inc. company. We use this product in our articles like this one to demonstrate implementation details, as well as provide statistics about real world applications and open source software. In general, OpenResty XRay can help users gain deep insight into their online and offline software systems without any modifications or any other collaborations, and efficiently troubleshoot difficult problems for performance, reliability, and security. It utilizes advanced dynamic tracing technologies developed by OpenResty Inc. and others.

We welcome you to contact us to try out this product for free.

OpenResty XRay Console Dashboard

About The Author

Yichun Zhang is the creator of the OpenResty® open source project. He is also the founder and CEO of the OpenResty Inc. company. He contributed a dozen open source Nginx 3rd-party modules, many Nginx and LuaJIT core patches, and designed the OpenResty XRay platform.

Translations

We provide a Chinese translation for this article on blog.openresty.com.cn We also welcome interested readers to contribute translations in other
languages as long as the full article is translated without any omissions. We thank anyone willing to do so in advance.


  1. Neither SystemTap nor OpenResty XRay has these restrictions.
  2. SystemTap also supports the "translator server" mode which can remotely compile its scripts on some dedicated machines. But it is still required to deply the C compiler tool chain and header files on these "server" machines.
  3. OpenResty XRay's dynamic tracing solution has overcome SystemTap's these shortcomings.
  4. Alas. The newer BCC framework targeting Linux's eBPF technology also suffers from such ugly tool wrappers for its standard usage.
  5. The stap++ project is no longer maintained and has been superseded by the new generation of dynamic tracing framework provided by OpenResty XRay.
查看原文

赞 0 收藏 0 评论 0

OpenResty技术 发布了文章 · 2020-07-15

The Wonderland of Dynamic Tracing (Part 1 of 7)

This is the first part of the article "The Wonderland of Dynamic Tracing" which consists of 7 parts. I will keep updating this series to reflect the state of art of the dynamic tracing world.

See also Part 2, Part 3, and Part 4.

Dynamic Tracing

It’s my great pleasure to share my thoughts on dynamic tracing —— a topic I have a lot of passion and excitement for. Let’s cut to the chase: what is dynamic tracing?

What It Is

As a kind of post-modern advanced debugging technology, dynamic tracing allows software engineers to answer some tricky problems about software systems, such as high CPU or memory usage, high disk usage, long latency, or program crashes. All this can be detected at a low cost within a short period of time, to quickly identify and rectify the problems. It emerged and thrived in a rapidly developing Internet era of cloud computing, service mesh, big data, API computation etc., which exposed engineers to two major challenges. The first challenge relates to the scale of computation and deployment. Today, the number of users, colocations, and machines are all experiencing rapid growth. The second one is complexity. Software engineers are facing increasingly complicated business logic and software systems. There are many, many layers to them. From bottom to top, there are operating system kernels, different kinds of system software like databases and Web servers, then, virtual machines, interpreters and Just-In-Time (JIT) compilers of various scripting languages or other advanced languages, and finally at the application level, the abstraction layers of various business logic and numerous complex code logic.

These huge challenges have consequences. The most serious is that software engineers today are quickly losing their insight and control over the whole production systems, which has become so enormous and complex that all kinds of bugs are much more likely to arise. Some may be fatal, like the 500 error pages, memory leaks, and error return values, just to name a few. Also worth noting is the issue of performance. You probably may have been confused about why software sometimes runs very slowly, either by itself or on some machines. Worse, as cloud computing and big data are gaining more popularity, the production environment will only see more and more unpredictable problems on this massive scale. In these situations, engineers must devote most of their time and energy to them. Here, two factors are at play. Firstly, a majority of problems only occur in online environments, making it extremely difficult, if not impossible, reproduce these problems. Secondly, some have only a very low frequency of occurrences, say, one in a hundred, one in a thousand, or even lower. For engineers, it would be ideal if they are able to analyze and pinpoint the root cause of a problem and take the targeted measure to address it while the system is still running, without having to drop the machine offline, edit existing code or configurations, or reboot the processes or machines.

Too Good to be True?

And this is where dynamic tracing comes in. It can push software engineers toward that vision, greatly unleashing their productivity. I still remember when I worked for Yahoo! China. Sometimes I had to take a taxi, you know, at midnight, to the company to deal with online problems. I had no choice, but it obviously frustrated me, blurring the lines between my work and life. Later I came to a CDN service provider in the United States. The maintenance team of our clients always looked through the original logs provided by us, reporting any problems they deemed important. From the perspective of the service provider, some of them may just occur with a frequency of one in hundred or one in a thousand. Even so, we must identify the real cause and give feedback to the client. The abundance of such subtle occurrences in reality has fueled the creation and emergence of new technologies.

The best part of dynamic tracing, in my humble opinion, is its “live process analysis”. In other words, the technology allows software engineers to analyze one program or the whole software system while it is still running, providing online services and responding to real requests. Just like querying a database. That is a very intriguing practice. Many engineers tend to ignore the fact that a running software system, itself containing most precious information, serves as a database that is changing in real time and open to direct queries. Of course, the special “database” must be read-only, otherwise the said analysis and debugging would possibly affect the system’s own behaviors, and hamper online services. With the help of the operating system kernel, engineers can initiate a series of targeted queries from the outside to secure invaluable raw data about the running software system. This data will guide a multitude of tasks like problem analysis, security analysis, and performance analysis.

How it Works

Dynamic tracing usually works based on the operating system kernel level, where the “supreme being of software” has complete control over the entire software world. With absolute authority, the kernel can ensure the above-mentioned “queries” targeted at the software system will not influence the latter’s normal running. That is to say, those queries must be secure enough for wide use on production systems. Then, there arises another question concerning how a query is made if the software system is regarded as a special “database”. Clearly, the answer is not SQL.

Dynamic tracing generally starts a query through the probe mechanism. Probes will be dynamically planted into one or several layers of the software system, and the processing handlers associated with them will be defined by engineers. This procedure is similar to acupuncture in traditional Chinese medicine. Imagine that the software system is a person, and dynamic tracing means pushing some “needles” into particular spots of his body, or acupuncture points. As these needles often carry some engineer-defined “sensors”, we can freely garner and collect essential information from those points, to perform reliable diagnosis and create feasible treatment schemes. Here, tracing usually involves two dimensions. One dimension is the timeline. As long as the software is running, a course
of continuous changes are incurring along the timeline. The other is the spatial dimension, because tracing may be related to various different processes, including kernel tasks and threads. Each process often has its own memory space and process space. So, among different layers, and within the memory space of the same layer, engineers can obtain abundant information in space, both vertically and horizontally. Doesn’t this sound like a spider searching for preys on the cobweb?

Spiderman searching on a cobweb

The information-gathering process goes beyond the operating system kernel to higher levels like the user mode program. The information collected can piece together along the timeline to form a complete software view and serve as a useful guide for some very complex analyses —— we can easily find various kinds of performance bottlenecks, root causes of weird exceptions, errors and crashes, as well as potential security vulnerabilities. A crucial point here is that dynamic tracing is non-invasive. Again, if we compare the software system to a person, to help them diagnose a condition, we clearly wouldn’t want to do so by ripping apart the living body or planting wires. Instead, the sensible action would be doing an X-ray or MRI, feeling their pulse, or using a stethoscope to listen to their heart and breathing. The same should go for diagnosis of a production software system. With the non-invasiveness of dynamic tracing comes speediness and high efficiency in accurately acquiring desired information firsthand, which helps identify different problems under investigation. No revision of the operating system kernels, application programs, or any configurations is needed here.

Most engineers should already be very familiar with the process of constructing software systems. This is a basic skill for software engineers after all. It usually means creating various abstraction layers to construct software, layer by layer, either with a bottom-up manner, or top-down. Among many other paradigms, software abstraction layers can be created via the classes and methods in object-oriented programming, or directly via functions and subroutines. In contrast with software construction, debugging works in a way that can easily “rip off” existing abstraction layers. Engineers can then have free access to any necessary information from any layers, regardless of the concrete modular design, the code encapsulation, and man-made constraints set for software construction. This is because during debugging people usually wants to get as much information as possible. After all, bugs may happen at any software layer (or even on the hardware level).

Still Having Doubts?

But will the abstraction layers built when constructing the software hinder the debugging process? The answer is a big no. Dynamic tracing, as mentioned above, is generally based on the operating system kernel which claims absolute authority as the “supreme being”. So the technology can easily (and legally) penetrate through the abstraction layers. In fact, if well-designed, those abstraction layers will actually help the debugging process, which I will detail later on. In my own work, I noticed a common phenomenon. When an online problem arises, some engineers become nervous and are quick to come up with wild guesses about the root of the problem without any evidence. Even worse, through trial and error of confirming their guesses about the root problem, they leave the system in a mess which they and their colleagues may be pained to clean up after. Finally, they miss out on valuable time for debugging or simply destroy the first scene of the incidents. All such pains could go away when dynamic tracing plays a part here. Troubleshooting could even turn out to be a lot of fun. Emergence of weird online problems would present a rare opportunity to solve a fascinating puzzle for experts. All this, of course, requires powerful tools available for collecting and analyzing information which can help quickly prove or disprove any assumptions and theories about the culprits.

The Advantages of Dynamic Tracing

Dynamic tracing does not require any cooperation or collaboration from the target application. Back to the example of a human, who is now receiving a physical examination while still running on the playground. With dynamic tracing, we can directly have a real-time X-ray for him, and he will not sense it at all. Almost all analytical tools based on dynamic tracing operate in a “hot-plug” or post-mortem manner, allowing us to run the tools at any time, and begin and end sampling at any time, without restarting or interfering the target software processes. In reality, most of analytical requirements come after the target software system starts running, and before that, software engineers are unlikely to be able to predict what problems might arise, not to mention all the information which needs to be collected to troubleshoot those issues. In this case, one advantage of dynamic tracing is, to collect data anywhere and anytime, in an on-demand manner. Another strength is it brings extremely small performance overhead. The impact of a carefully written debugging tool on the ultimate performance of the system tends to be no more than 5%, minimizing the observable performance impact on the ultimate users. Moreover, the performance overhead, already miniscule, only occurs within a few seconds or minutes of the actual sampling time window. Once the debugging tool finishes operation, the online system will automatically return to its original full speed.

The running little "man" being checked alive

Conclusion

In this part we introduced the concept of dynamic tracing on a very high level and also briefly covered the advantage of dynamic tracing. In Part 2 of this series, we will talk about two open source dynamic tracing frameworks, DTrace and SystemTap.

A Word on OpenResty XRay

OpenResty XRay is a commercial dynamic tracing product offered by our OpenResty Inc. company.
We use this product in our articles like this one to demonstrate implementation details, as well as provide statistics about real world applications and open source software. In general, OpenResty XRay can help users gain deep insight into their online and offline software systems without any modifications or any other collaborations, and efficiently troubleshoot difficult problems for performance, reliability, and security.
It utilizes advanced dynamic tracing technologies developed by OpenResty Inc. and others.

We welcome you to contact us to try out this product for free.

OpenResty XRay Console Dashboard

About The Author

Yichun Zhang is the creator of the OpenResty® open source project. He is also the founder and CEO of the OpenResty Inc. company. He contributed a dozen open source Nginx 3rd-party modules, many Nginx and LuaJIT core patches, and designed the OpenResty XRay platform.

Translations

We provide a Chinese translation for this article on blog.openresty.com.cn We also welcome interested readers to contribute translations in other
languages as long as the full article is translated without any omissions. We thank anyone willing to do so in advance.

查看原文

赞 0 收藏 0 评论 0

认证与成就

  • 获得 41 次点赞
  • 获得 2 枚徽章 获得 0 枚金徽章, 获得 0 枚银徽章, 获得 2 枚铜徽章

擅长技能
编辑

开源项目 & 著作
编辑

  • OpenResty

    OpenResty® 是一个基于 Nginx 与 Lua 的高性能 Web 平台,其内部集成了大量精良的 Lua 库、第三方模块以及大多数的依赖项。用于方便地搭建能够处理超高并发、扩展性极高的动态 Web 应用、Web 服务和动态网关。

注册于 2020-02-02
个人主页被 4.1k 人浏览