Linux Load Average: Solving the Mystery

English

\$ uptime
16:48:24 up  4:11,  1 user,  load average: 25.25, 23.40, 23.46

top - 16:48:42 up  4:12,  1 user,  load average: 25.25, 23.14, 23.37

25.72 23.19 23.35 42/3411 43603

• 如果averages是0，表示你的系统处于空闲状态。
• 如果1分钟的数值高于5分钟或15分钟的数值，表示系统负载正在上升。
• 如果1分钟的数值低于5分钟或15分钟的数值，表示系统负载正在下降。
• 如果这些数值高于CPU数量，那么你可能面临着一个性能问题。（当然也要看具体情况）

历史

[1] The TENEX load average is a measure of CPU demand.
The load average is an average of the number of runnable processes over a given time period.
For example, an hourly load average of 10 would mean that (for a single CPU system) at any time during that hour one could expect to see 1 process running and 9 others ready to run (i.e., not blocked for I/O) waiting for the CPU.

NRJAVS==3               ;NUMBER OF LOAD AVERAGES WE MAINTAIN
GS RJAV,NRJAVS          ;EXPONENTIAL AVERAGES OF NUMBER OF ACTIVE PROCESSES
[...]
;UPDATE RUNNABLE JOB AVERAGES

DORJAV: MOVEI 2,^D5000
MOVEM 2,RJATIM          ;SET TIME OF NEXT UPDATE
MOVE 4,RJTSUM           ;CURRENT INTEGRAL OF NBPROC+NGPROC
SUBM 4,RJAVS1           ;DIFFERENCE FROM LAST UPDATE
EXCH 4,RJAVS1
FSC 4,233               ;FLOAT IT
FDVR 4,[5000.0]         ;AVERAGE OVER LAST 5000 MS
[...]
;TABLE OF EXP(-T/C) FOR T = 5 SEC.

EXPFF:  EXP 0.920043902 ;C = 1 MIN
EXP 0.983471344 ;C = 5 MIN
EXP 0.994459811 ;C = 15 MIN

#define EXP_1           1884            /* 1/exp(5sec/1min) as fixed-point */
#define EXP_5           2014            /* 1/exp(5sec/5min) */
#define EXP_15          2037            /* 1/exp(5sec/15min) */

Linux也硬编码了1，5，15分钟这三个常量。

搜寻一个古老的Linux补丁

"Changes to the last official release (p13) are too numerous to mention (or even to remember)..." – Linus

"While working on a system to make these mailing archives scale more effecitvely I accidently destroyed the current set of archives (ah whoops)."

“不可中断”的起源

From: Matthias Urlichs <urlichs@smurf.sub.org>
Date: Fri, 29 Oct 1993 11:37:23 +0200

The kernel only counts "runnable" processes when computing the load average.
I don't like that; the problem is that processes which are swapping or
waiting on "fast", i.e. noninterruptible, I/O, also consume resources.

It seems somewhat nonintuitive that the load average goes down when you
replace your fast swap disk with a slow swap disk...

Anyway, the following patch seems to make the load average much more
consistent WRT the subjective speed of the system. And, most important, the
load is still zero when nobody is doing anything. ;-)

--- kernel/sched.c.orig Fri Oct 29 10:31:11 1993
+++ kernel/sched.c  Fri Oct 29 10:32:51 1993
@@ -414,7 +414,9 @@
unsigned long nr = 0;

-       if (*p && (*p)->state == TASK_RUNNING)
+       if (*p && ((*p)->state == TASK_RUNNING) ||
nr += FIXED_1;
return nr;
}
--
Matthias Urlichs        \ XLink-POP N|rnberg   | EMail: urlichs@smurf.sub.org
Schleiermacherstra_e 12  \  Unix+Linux+Mac     | Phone: ...please use email.
90491 N|rnberg (Germany)  \   Consulting+Networking+Programming+etc'ing      42

“不可中断”的今日

"The point of "load average" is to arrive at a number relating how busy the system is from a human point of view. TASK_UNINTERRUPTIBLE means (meant?) that the process is waiting for something like a disk read which contributes to system load. A heavily disk-bound system might be extremely sluggish but only have a TASK_RUNNING average of 0.1, which doesn't help anybody."

（能这么快地收到回复，其实光是收到回复，就已经让我兴奋不已了，感谢！）

度量不可中断的任务

<embed src="http://www.brendangregg.com/blog/images/2017/out.offcputime_unint02.svg" />

# ./bcc/tools/offcputime.py -K --state 2 -f 60 > out.stacks
# awk '{ print \$1, \$2 / 1000 }' out.stacks | ./FlameGraph/flamegraph.pl --color=io --countname=ms > out.offcpu.svgb>

<embed src="http://www.brendangregg.com/blog/images/2017/out.offcputime_unint01.svg" />

/* wait to be given the lock */
while (true) {
break;
schedule();
}

terma\$ pidstat -p `pgrep -x tar` 60
Linux 4.9.0-rc5-virtual (bgregg-xenial-bpf-i-0b7296777a2585be1)     08/01/2017  _x86_64_    (8 CPU)

10:15:51 PM   UID       PID    %usr %system  %guest    %CPU   CPU  Command
10:16:51 PM     0     18468    2.85   29.77    0.00   32.62     3  tar

termb\$ iostat -x 60
[...]
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
0.54    0.00    4.03    8.24    0.09   87.10

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
xvdap1            0.00     0.05   30.83    0.18   638.33     0.93    41.22     0.06    1.84    1.83    3.64   0.39   1.21
xvdb            958.18  1333.83 2045.30  499.38 60965.27 63721.67    98.00     3.97    1.56    0.31    6.67   0.24  60.47
xvdc            957.63  1333.78 2054.55  499.38 61018.87 63722.13    97.69     4.21    1.65    0.33    7.08   0.24  61.65
md0               0.00     0.00 4383.73 1991.63 121984.13 127443.80    78.25     0.00    0.00    0.00    0.00   0.00   0.00

termc\$ uptime
22:15:50 up 154 days, 23:20,  5 users,  load average: 1.25, 1.19, 1.05
[...]
termc\$ uptime
22:17:14 up 154 days, 23:21,  5 users,  load average: 1.19, 1.17, 1.06

<embed src="http://www.brendangregg.com/blog/images/2017/out.offcputime_unint08.svg" />

• 0.33来自于tar的CPU时间（pidstat）
• 0.67来自于不可中断的磁盘读（off-CPU火焰图中显示的是0.69，我怀疑是因为脚本搜集数据稍晚了一些，造成了时间上的一些微小的误差）
• 0.04来自于其它CPU消费者（iostat user + system，减去pidstat中tar的CPU时间）
• 0.11来自于内核态处理不可中断的disk I/O的时间，向磁盘写入数据（通过off-CPU火焰图，左侧的两个塔）

（作者在原文评论中说明了计算这些数值的方式：）

tar: the off-CPU flame graph has 41,164 ms, and that's a sum over a 60 second trace. Normalizing that to 1 second = 41.164 / 60 = 0.69. The pidstat output has tar taking 32.62% average CPU (not a sum), and I know all its off-CPU time is in uninterruptible (by generating off-CPU graphs for the other states), so I can infer that 67.38% of its time is in uninterruptible. 0.67. I used that number instead, as the pidstat interval closely matched the other tools I was running.
by mpstat I meant iostat sorry (I updated the text), but it's the same CPU summary. It's 0.54 + 4.03% for user + sys. That's 4.57% average across 8 CPUs, 4.57 x 8 = 36.56% in terms of one CPU. pidstat says that tar consumed 32.62%, so the remander is 36.56% - 32.62% = 3.94% of one CPU, which was used by things that weren't tar (other processes). That's the 0.04 added to load average.

• 在Linux系统中，平均负载是（或希望是）“系统平均负载”，将系统作为一个整体，来度量所有工作中或等待（CPU，disk，不可中断锁）中的线程数量。换句话说，指标衡量的是所有不完全处在idle状态的线程数量。优点：囊括了不同种类资源的需求。
• 在其它操作系统中：平均负载是“CPU平均负载”，度量的是占用CPU运行中或等待CPU的线程数量。优点：理解起来，解释起来都很简单（因为只需要考虑CPU）。

更好的指标

• per-CPU utilization：使用 mpstat -P ALL 1；
• per-process CPU utilization：使用 top，pidstat 1等等；
• per-thread run queue(scheduler) latency：使用in /proc/PID/schedstats，delaystats，pref sched；
• CPU run queue latency：使用in /proc/schedstat，perf sched，我的runqlat bcc工具；
• CPU run queue length：使用vmstat 1，观察'r'列，或者使用我的runqlen bcc工具。

schedstats组件在Linux 4.6中被设置成了内核可调整，并且改为了默认关闭。cpustat的延迟统计同样统计了scheduler latency指标，我也刚刚建议把它加到htop中去，这样可以大大简化大家的使用，要比从/proc/sched_debug的输出中抓取等待时间指标简单。

\$ awk 'NF > 7 { if (\$1 == "task") { if (h == 0) { print; h=1 } } else { print } }' /proc/sched_debug
task   PID         tree-key  switches  prio     wait-time             sum-exec        sum-sleep
systemd     1      5028.684564    306666   120        43.133899     48840.448980   2106893.162610 0 0 /init.scope
ksoftirqd/0     3 99071232057.573051   1109494   120         5.682347     21846.967164   2096704.183312 0 0 /
kworker/0:0H     5 99062732253.878471         9   100         0.014976         0.037737         0.000000 0 0 /
migration/0     9         0.000000   1995690     0         0.000000     25020.580993         0.000000 0 0 /
lru-add-drain    10        28.548203         2   100         0.000000         0.002620         0.000000 0 0 /
watchdog/0    11         0.000000   3368570     0         0.000000     23989.957382         0.000000 0 0 /
cpuhp/0    12      1216.569504         6   120         0.000000         0.010958         0.000000 0 0 /
xenbus    58  72026342.961752       343   120         0.000000         1.471102         0.000000 0 0 /
khungtaskd    59 99071124375.968195    111514   120         0.048912      5708.875023   2054143.190593 0 0 /
[...]
dockerd 16014    247832.821522   2020884   120        95.016057    131987.990617   2298828.078531 0 0 /system.slice/docker.service
dockerd 16015    106611.777737   2961407   120         0.000000    160704.014444         0.000000 0 0 /system.slice/docker.service
dockerd 16024       101.600644        16   120         0.000000         0.915798         0.000000 0 0 /system.slice/
[...]

总结

* This file contains the magic bits required to compute the global loadavg
* figure. Its a silly number but people think its important. We go through
* great pains to make it work on big machines and tickless kernels.

参考资料

[1] Saltzer, J., and J. Gintell. “The Instrumentation of Multics,” CACM, August 1970 （解释了指数）
[2] Multics system_performance_graph command reference （提到了1分钟平均负载）
[3] TENEX source code.（CHED.MAC系统中的平均负载代码）
[4] RFC 546 "TENEX Load Averages for July 1973".（解释了对CPU需求的度量）
[5] Bobrow, D., et al. “TENEX: A Paged Time Sharing System for the PDP-10,” Communications of the ACM, March 1972.（解释了三重平均负载）
[6] Gunther, N. "UNIX Load Average Part 1: How It Works" PDF. （解释了指数计算公式）
[7] Linus's email about Linux 0.99 patchlevel 14.
[8] The load average change email is on oldlinux.org.（在alan-old-funet-lists/kernel.1993.gz压缩包中，不在我一开始搜索的linux目录下）
[9] The Linux kernel/sched.c source before and after the load average change: 0.99.13, 0.99.14.
[10] Tarballs for Linux 0.99 releases are on kernel.org.