头图

This blog post is translated from Brendan Gregg's technical archaeology article: Linux Load Average: Solving the Mystery . The reason for reading this article is that when I used Prometheus to alert the system CPU usage, a system_load indicator was not in line with what I expected: the alert line was always reached when the CPU margin was still large. To this end, I studied the Load Average indicator of Linux.

The following is the original translation:

Load Average is a very important metric in engineering. My company uses this metric and some other metrics to maintain millions of cloud instances for automatic scaling. But there has always been some mystery surrounding this Linux metric, such as this metric not only tracking running tasks, but also tasks in uninterruptible sleep (usually waiting for IO). This in the end is why? I've never found any explanation before. So in this post I'm going to solve the puzzle and give some summary of the load average metric for anyone trying to understand this metric.

The Linux load average indicator, also known as "system load average", refers to the number of threads (tasks) that the system needs to execute over a period of time, that is, the average number of running and waiting threads. This indicator measures the amount of tasks that the system needs to process, which can be greater than the number of threads the system is actually processing. Most tools will display 1 minute, 5 minute and 15 minute averages.

 $ uptime
 16:48:24 up  4:11,  1 user,  load average: 25.25, 23.40, 23.46

top - 16:48:42 up  4:12,  1 user,  load average: 25.25, 23.14, 23.37

$ cat /proc/loadavg 
25.72 23.19 23.35 42/3411 43603

Just a little explanation:

  • If averages is 0, your system is idle.
  • If the value for 1 minute is higher than the value for 5 minutes or 15 minutes, the system load is increasing.
  • If the value for 1 minute is lower than the value for 5 minutes or 15 minutes, the system load is decreasing.
  • If these numbers are higher than the number of CPUs, then you may be facing a performance issue. (Of course it depends on the specific situation)

With a set of three values, you can see whether the system load is rising or falling, which is very useful for you to monitor system conditions. As an independent value, this indicator can also be used to formulate rules for automatic scaling of cloud services. But if you want to understand the meaning of these numbers in more detail, you need the help of some other indicators. A single value, such as 23-25, is meaningless by itself. But if the number of CPUs is known, this value can represent a CPU-bound workload.

Rather than trying to troubleshoot load averages, I'm used to looking at a few other metrics. These metrics will be covered later in the chapter "Better Metrics".

history

The original load average metric only showed demand for CPU: that is, the number of programs running plus the number of programs waiting to run. It's well described in the RFC546 document titled "TENEX Load Average" published in August 1973:

[1] The TENEX load average is a measure of CPU demand.
The load average is an average of the number of runnable processes over a given time period.
For example, an hourly load average of 10 would mean that (for a single CPU system) at any time during that hour one could expect to see 1 process running and 9 others ready to run (ie, not blocked for I/O) waiting for the CPU.

The article also links to a PDF document showing a July 1973 hand-drawn graph of load averages (shown below), indicating that this metric has been used for decades.

These ancient operating system source codes are still available online today, the following code snippet is excerpted from the macro assembler of TENEX (early 1970s) SCHED.MAC:

 NRJAVS==3               ;NUMBER OF LOAD AVERAGES WE MAINTAIN
GS RJAV,NRJAVS          ;EXPONENTIAL AVERAGES OF NUMBER OF ACTIVE PROCESSES
[...]
;UPDATE RUNNABLE JOB AVERAGES

DORJAV: MOVEI 2,^D5000
        MOVEM 2,RJATIM          ;SET TIME OF NEXT UPDATE
        MOVE 4,RJTSUM           ;CURRENT INTEGRAL OF NBPROC+NGPROC
        SUBM 4,RJAVS1           ;DIFFERENCE FROM LAST UPDATE
        EXCH 4,RJAVS1
        FSC 4,233               ;FLOAT IT
        FDVR 4,[5000.0]         ;AVERAGE OVER LAST 5000 MS
[...]
;TABLE OF EXP(-T/C) FOR T = 5 SEC.

EXPFF:  EXP 0.920043902 ;C = 1 MIN
        EXP 0.983471344 ;C = 5 MIN
        EXP 0.994459811 ;C = 15 MIN

Here is a snippet from today's Linux source code (include/linux/sched/loadavg.h):

 #define EXP_1           1884            /* 1/exp(5sec/1min) as fixed-point */
#define EXP_5           2014            /* 1/exp(5sec/5min) */
#define EXP_15          2037            /* 1/exp(5sec/15min) */

Linux also hard-coded the three constants 1, 5, and 15 minutes.

There are similar load averages on older systems such as Multics which has an exponential scheduling queue average.

three numbers

The three numbers in the title refer to the 1 minute, 5 minute, and 15 minute load averages. But it should be noted that these three numbers are not really "average", and the counting time is not really 1 minute, 5 minutes and 15 minutes. As you can see from the previous assembly code, 1, 5, and 15 are constants in the equation that actually calculates exponentially-damped moving sums on average every 5s. You are as confused about this term and formula as I am, and there are related articles and code links later in this section). The 1, 5, and 15 minute values thus calculated better reflect the average load.

If you take an idle machine and start a single-threaded CPU-bound program (such as a single-threaded loop), what should the 1min load average be after 60s? If it is just a simple average, then this value should be 1.0. But the actual experimental results are shown in the following figure:

Load Average

The value known as "1-minute load average" only reached 0.62 at the 1-minute point. If you want to learn more about this equation and similar experiments, Dr. Neil Gunther wrote an article: How It Works , and the linux source code of loadavg.c also has a lot of comments about the calculation.

Linux Uninterruptible Tasks

When load average metrics first appeared in linux, they, like other operating systems, reflected CPU demand. But then, Linux modified them to include not only runnable tasks, but also tasks in an uninterruptible (TASK_UNINTERRUPTIBLE or nr_uninterruptible) state. This state indicates that the program does not want to be interrupted by a semaphore, such as a task that is in the middle of disk I/O or some lock. You may have also observed these tasks with the ps or top command before, and their status is marked as "D". The man page for the ps instruction explains it this way: "uninterrupible sleep(usually IO)".

Adding the uninterruptible state means that the Linux load average will not only increase due to CPU usage, but also due to a disk (or NFS) load. If you are familiar with other operating systems and their concept of CPU load averages, Linux load averages with uninterruptible states can be confusing at first.

why? Why does Linux do this?

There are countless articles about load averages pointing out that Linux added nr_uninterruptible, but I haven't seen any that explain why, not even a wild guess as to why. My personal guess is that this is for the metric to represent a broader notion of demand for resources, not just demand for CPU resources.

Searching for an ancient Linux patch

It's easy to understand why something was changed in Linux: you can find the git commit history of the file with the question, and read its change description. I looked at the change history of loadavg.c , but the code to add the uninterruptible state was copied from an earlier file. I looked at that earlier file again, but that didn't work either: this code is interspersed in several different files. Looking for a shortcut, I downloaded the entire Linux github repository containing 4G text files using git log -p, and wanted to backtrack to see when this code first appeared, but it was a dead end: in the entire Linux project , the oldest change dates back to 2005, when Linux introduced version 2.6.12-rc2, but this change already exists at this time.

There are also historical Linux repositories on the web ( here and here ), but there is no description of this change in those repositories either. In order to at least find out when this change was made, I searched the source code on kernel.org and found that this change was already in 0.99.15, but not in 0.99.13, but the 0.99.14 version was missing. I found this version elsewhere and confirmed that the changes were implemented on Linux 0.99 patchlevel 14 in November 1993. Hoping that Linus would explain why in the release description for 0.99.14, but it turned out to be a dead end:

"Changes to the last official release (p13) are too numerous to mention (or even to remember)..." – Linus

He mentions a lot of major changes, but doesn't explain the modification of the load average.

Based on this point in time, I'd like to look in the key mailing list archives for the real source of the patch, but the oldest one is from June 1995, and the sysadmin wrote:

"While working on a system to make these mailing archives scale more effecitvely I accidently destroyed the current set of archives (ah whoops)."

My quest seems to be cursed. Fortunately, I found some older linux-devel mailing list archives recovered from backup servers that store summaries in tarballs. I searched 6000 abstracts and contained 98000 emails, 30000 of them from 1993. But for some reason, these emails have all been lost. It seems likely that the description of the original patch has been lost permanently, and why this was done remains a mystery.

The Origin of "Uninterruptible"

Thankfully, I finally found this change in a mailbox zip file from 1993 on oldlinux.org , which reads:

 From: Matthias Urlichs <urlichs@smurf.sub.org>
Subject: Load average broken ?
Date: Fri, 29 Oct 1993 11:37:23 +0200


The kernel only counts "runnable" processes when computing the load average.
I don't like that; the problem is that processes which are swapping or
waiting on "fast", i.e. noninterruptible, I/O, also consume resources.

It seems somewhat nonintuitive that the load average goes down when you
replace your fast swap disk with a slow swap disk...

Anyway, the following patch seems to make the load average much more
consistent WRT the subjective speed of the system. And, most important, the
load is still zero when nobody is doing anything. ;-)

--- kernel/sched.c.orig Fri Oct 29 10:31:11 1993
+++ kernel/sched.c  Fri Oct 29 10:32:51 1993
@@ -414,7 +414,9 @@
    unsigned long nr = 0;

    for(p = &LAST_TASK; p > &FIRST_TASK; --p)
-       if (*p && (*p)->state == TASK_RUNNING)
+       if (*p && ((*p)->state == TASK_RUNNING) ||
+                  (*p)->state == TASK_UNINTERRUPTIBLE) ||
+                  (*p)->state == TASK_SWAPPING))
            nr += FIXED_1;
    return nr;
 }
--
Matthias Urlichs        \ XLink-POP N|rnberg   | EMail: urlichs@smurf.sub.org
Schleiermacherstra_e 12  \  Unix+Linux+Mac     | Phone: ...please use email.
90491 N|rnberg (Germany)  \   Consulting+Networking+Programming+etc'ing      42

It's a wonderful feeling to read the thinking behind a change 24 years ago.

This confirms that the changes to the load average are intentional, to reflect the demands on the CPU and other system resources. This metric for Linux has changed from "CPU load average" to "system load average".

The email's example of using a slower disk makes a lot of sense: by reducing system performance, the demands on system resources should increase. But when slower disks were used, the load average metric actually decreased. Because these metrics only track tasks in CPU running state, they do not consider tasks in disk swap state. Matthias didn't think this was intuitive, so he modified it accordingly.

"Uninterruptible" Today

One question is, today if you find that sometimes the system load average is too high, isn't disk I/O alone enough to explain it? The answer is yes, because I would guess that the Linux code has a branch of the TASK_UNINTERRUPTIBLE setting that didn't exist in 1993, which in turn caused the load average to be too high. In Linux 0.99.14, 13 code paths set the task state to TASK_UNINTERRUPIBLE or TASK_SWAPPING (this state has since been removed from Linux). As of today, in Linux 4.12, there are nearly 400 code branches that set the TASK_INTERRUPTIBLE state, including some locking mechanisms. It is quite possible that some of these branches should not be included in the load average statistics. Next time if I see a high load average, I'll check to see if I've entered a branch that shouldn't be included and see if I can make some fixes.

For the first time, I emailed Matthias to ask him what he thought of the changes 24 years ago. He got back to me within an hour (as I said on twitter) with the following:

"The point of "load average" is to arrive at a number relating how busy the system is from a human point of view. TASK_UNINTERRUPTIBLE means (meant?) that the process is waiting for something like a disk read which contributes to system load. A heavily disk-bound system might be extremely sluggish but only have a TASK_RUNNING average of 0.1, which doesn't help anybody."

(Being able to receive a reply so quickly, in fact, just received a reply, it has already excited me, thank you!)

So Matthias still thinks this indicator is reasonable, at least given the original meaning of TASK_UNINTERRUPTIBLE.

But since Linux has evolved, TASK_UNINTERRUPIBLE has represented more things. Should we change the load average metric into a metric that just characterizes CPU and disk demand? Scheduler maintainer Peter Zijstra has sent me a hack: use task_struct->in_iowait in load average instead of TASK_UNINTERRUPTIBLE to more closely match disk I/O. This leads to another question: what exactly do we want? Do we want to measure the threading requirements of the system, or do we want to analyze the physical resource requirements of the system? If the former, then tasks waiting for uninterruptible locks should also be included, they are not idle. From this perspective, the current way the load average metric works may be exactly what we expect.

To better understand "unbreakable" code branches, I'd much rather do some actual analysis. We can examine different examples, quantify the execution time, and see if the load average metric is reasonable.

Measure uninterruptible tasks

Below is an Off-CPU flame graph for a production server. I filtered out the tasks in the TASK_UNINTERRUPTIBLE state in the kernel stack within 60 seconds, which can provide many examples of pointing to uninterruptible code branches:

<embed src="http://www.brendangregg.com/blog/images/2017/out.offcputime_unint02.svg" />

If you're not familiar with the Off-CPU flame graph: each column is a complete stack of towers for a task that make up what the flame looks like. You can click on each box to zoom in on the complete stack. The x-axis size is proportional to the time the task spends off-CPU, and the ordering from left to right has little practical meaning. For the color of the off-CPU stack I use blue (on the on-CPU graph I use a warm color), the saturation of the color is randomly generated to distinguish the different boxes.

I used my offcputime tool under the bcc project to generate this picture, the instructions are as follows:

 # ./bcc/tools/offcputime.py -K --state 2 -f 60 > out.stacks
# awk '{ print $1, $2 / 1000 }' out.stacks | ./FlameGraph/flamegraph.pl --color=io --countname=ms > out.offcpu.svgb>

The awk command outputs microseconds as milliseconds, and --state 2 means TASK_UNINTERRUPTIBLE (see sched.h file), an optional parameter I added for this post. The first person to do this was Josef Bacik of Facebook, using his kernelscope tool, which also uses bcc and flame graphs. In my example, I only show the kernel stack, and offcputime.py also supports showing the user stack.

The graph shows that the uninterruptible sleep took only 926ms in 60s, which only increased our load average by 0.015. Most of this time is spent on cgroup related code, and disk I/O doesn't spend much time.

Here's a more interesting graph, covering only 10s of time:

<embed src="http://www.brendangregg.com/blog/images/2017/out.offcputime_unint01.svg" />

The wider task on the right side of the graph represents the systemd-journal task in proc_pid_cmdline_read() (see /proc/PID/cmdline), which is blocked and contributes 0.07 to the load average. The wider graph on the left represents a page_fault, which also ends with rwsem_down_read_failed(), contributing 0.23 to the load average. Combined with the search feature of the flame graph, I have used magenta to highlight the relevant function. The source code snippet of this function is as follows:

 /* wait to be given the lock */
    while (true) {
        set_task_state(tsk, TASK_UNINTERRUPTIBLE);
        if (!waiter.task)
            break;
        schedule();
    }

Here is a piece of code that uses TASK_UNINTERRUPTIBLE to acquire a lock. Linux has interruptible and non-interruptible implementations for mutex acquisition (such as mutex_lock() and mutex_lock_interruptible(), and down() and down_interruptible() for semaphores). The interruptible version allows tasks to be interrupted by signals and wake up Then continue processing. The time spent sleeping in an uninterruptible lock usually does not have a large impact on the load average. But in this example, this type of task adds 0.30 to the load average. If this value is larger, it is worth analyzing whether it is necessary to reduce lock contention to optimize performance and reduce load average (for example I will start looking into systemd-journal and proc_pid_cmdline_read()).

So should these code paths be included in the load average statistics? I think it should. These threads are in the process of execution and then blocked by the lock. They are not idle, they make demands on the system, albeit with software resources rather than hardware resources.

Unbundling Linux Load Averages

So can Linux load averages be completely broken down into parts? Here's an example: On an idle system with 8 CPUs, I call tar to pack some uncached files. This process can take a few minutes, most of the time is blocked reading from disk. Below is the data collected from three terminal windows:

 terma$ pidstat -p `pgrep -x tar` 60
Linux 4.9.0-rc5-virtual (bgregg-xenial-bpf-i-0b7296777a2585be1)     08/01/2017  _x86_64_    (8 CPU)

10:15:51 PM   UID       PID    %usr %system  %guest    %CPU   CPU  Command
10:16:51 PM     0     18468    2.85   29.77    0.00   32.62     3  tar

termb$ iostat -x 60
[...]
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.54    0.00    4.03    8.24    0.09   87.10

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
xvdap1            0.00     0.05   30.83    0.18   638.33     0.93    41.22     0.06    1.84    1.83    3.64   0.39   1.21
xvdb            958.18  1333.83 2045.30  499.38 60965.27 63721.67    98.00     3.97    1.56    0.31    6.67   0.24  60.47
xvdc            957.63  1333.78 2054.55  499.38 61018.87 63722.13    97.69     4.21    1.65    0.33    7.08   0.24  61.65
md0               0.00     0.00 4383.73 1991.63 121984.13 127443.80    78.25     0.00    0.00    0.00    0.00   0.00   0.00

termc$ uptime
 22:15:50 up 154 days, 23:20,  5 users,  load average: 1.25, 1.19, 1.05
[...]
termc$ uptime
 22:17:14 up 154 days, 23:21,  5 users,  load average: 1.19, 1.17, 1.06

I also collected Off-CPU flame graphs for uninterruptible tasks:

<embed src="http://www.brendangregg.com/blog/images/2017/out.offcputime_unint08.svg" />

The last minute load average was 1.19, let's break it down:

  • 0.33 CPU time from tar (pidstat)
  • 0.67 comes from uninterruptible disk reads (0.69 is shown in the off-CPU flame graph, I suspect it is because the script collects data a little later, causing some minor errors in timing)
  • 0.04 comes from other CPU consumers (iostat user + system, minus the CPU time of tar in pidstat)
  • 0.11 comes from kernel mode processing uninterruptible disk I/O, writing data to disk (via off-CPU flame graph, two towers on the left)

These add up to 1.15, which is 0.04 less. Part of it may be due to rounding and errors caused by the offset of the measurement interval, but most of it should be because the "exponentially decaying offset sum" is used for the average load, and the other averages (pidstat, iostat) are ordinary averages . The one-minute load average prior to 1.19 was 1.25, so this minute's value will pull up the next minute's load average. How high will it be? According to the previous graph, 62% of the minute we count comes from the current minute. So 0.62 1.15 + 0.38 1.25 = 1.18, which is very close to the reported 1.19.

In this example, the system has one thread (tar) plus a small number of other threads (and some kernel-mode worker threads) working, so it makes sense for Linux to report a load average of 1.19. If only "CPU Load Average" was displayed, the value would be 0.37 (as reported by mpstat), which is correct only for CPU resources, but hides the fact that there are actually more than one thread on the system that needs to keep working.

What I'm trying to illustrate with this example is that the number of load average statistics (CPU + uninterruptible) does make sense, and you can break it down and figure out the components.

(The author explains how these values are calculated in the original comments :)

tar: the off-CPU flame graph has 41,164 ms, and that's a sum over a 60 second trace. Normalizing that to 1 second = 41.164 / 60 = 0.69. The pidstat output has tar taking 32.62% average CPU (not a sum), and I know all its off-CPU time is in uninterruptible (by generating off-CPU graphs for the other states), so I can infer that 67.38% of its time is in uninterruptible. 0.67. I used that number instead, as the pidstat interval closely matched the other tools I was running.
by mpstat I meant iostat sorry (I updated the text), but it's the same CPU summary. It's 0.54 + 4.03% for user + sys. That's 4.57% average across 8 CPUs, 4.57 x 8 = 36.56% in terms of one CPU. pidstat says that tar consumed 32.62%, so the remander is 36.56% - 32.62% = 3.94% of one CPU, which was used by things that weren't tar (other processes). That's the 0.04 added to load average.

Understanding Linux Load Averages

I grew up in an OS environment where load averages only expressed CPU load, so the Linux version's load averages often bothered me. Perhaps the root cause is that the term "load average" is as ambiguous as "I/O": what exactly is I/O? Disk I/O? Filesystem I/O? Network I/O? ..., again, what loads are they? CPU load? Or system load? Interpreting it in the following way allows me to understand the load average metric:

  • In Linux systems, load average is (or hopes to be) "system load average", a measure of the number of threads all working or waiting (CPU, disk, uninterruptible locks) on the system as a whole. In other words, the metric measures the number of all threads that are not fully idle. Advantages: Covers the needs of different types of resources.
  • In other operating systems: load average is the "CPU load average", which measures the number of threads that are consuming the CPU running or waiting for the CPU. Pros: Easy to understand, easy to explain (since only CPU needs to be considered).

Note that there is another possible load average, the "physical resource load average", which only includes physical resources (CPU + disk)

Maybe one day we'll add different load averages to Linux and let the user choose which one to use: a separate "CPU load average", "disk load average" and "network load average", etc. Or simply list all the different metrics.

What is a good or bad load average?

Some people find values that make sense for their systems and workloads: when the load average exceeds this value X, application latency spikes and users start complaining. But how to get this value is actually not a rule.

If using CPU load average, one can divide the number by the number of CPU cores and say that if the ratio exceeds 1.0, your system is saturated and may cause performance issues. But this is also ambiguous because a long-term average (at least 1 minute) might also hide some variation. For example, a ratio of 1.5 may work fine for one system, but for another system, the ratio jumps to 1.5, and the performance for the minute may be terrible.

Load averages measured in a modern tool

I once managed a dual-core mail server with an average CPU load between 11 and 16 (ratio of 5.5 to 8) during normal operation, the latency was acceptable, and no one complained. But this is an extreme example, and most systems may have a ratio of more than 2, which will have a great impact on service performance.

For Linux system load average, the situation is more complicated and ambiguous, because this metric includes a variety of different resource types, so you can't simply directly divide by the number of CPU cores. Relative values are more useful here: if you know the system was working fine with a load average of 20, and now it's at 40, then you should combine other metrics to see what's going on.

better metrics

When Linux's load average metric goes up, you know your system needs better resources (CPU, disk, and some locks), but you're not really sure which one you need. Then you can use some other indicators to distinguish. For example, for CPU:

  • per-CPU utilization: use mpstat -P ALL 1;
  • per-process CPU utilization: use top, pidstat 1, etc.;
  • per-thread run queue(scheduler) latency: use in /proc/PID/schedstats, delaystats, pref sched;
  • CPU run queue latency: use in /proc/schedstat, perf sched, my runqlat bcc tool;
  • CPU run queue length: Use vmstat 1, watch the 'r' column, or use my runqlen bcc tool.

The first two metrics evaluate utilization, and the last three are saturation metrics. Utilization metrics are used to describe workloads, while saturation metrics are used to identify performance issues. The best indicator of CPU saturation is run queue (or scheduler) latency: the amount of time a task or thread is in a runnable state but needs to wait to run. These metrics can help you measure the severity of performance problems, such as the percentage of time a task is waiting. Measuring the length of the run queue can also find problems, but it is difficult to measure the severity.

The schedstats component was made kernel tunable in Linux 4.6 and changed to off by default. The latency statistics of cpustat also count the scheduler latency indicator, and I just suggested adding it to htop , which can greatly simplify everyone's use, and is simpler than grabbing the wait time indicator from the output of /proc/sched_debug.

 $ awk 'NF > 7 { if ($1 == "task") { if (h == 0) { print; h=1 } } else { print } }' /proc/sched_debug
            task   PID         tree-key  switches  prio     wait-time             sum-exec        sum-sleep
         systemd     1      5028.684564    306666   120        43.133899     48840.448980   2106893.162610 0 0 /init.scope
     ksoftirqd/0     3 99071232057.573051   1109494   120         5.682347     21846.967164   2096704.183312 0 0 /
    kworker/0:0H     5 99062732253.878471         9   100         0.014976         0.037737         0.000000 0 0 /
     migration/0     9         0.000000   1995690     0         0.000000     25020.580993         0.000000 0 0 /
   lru-add-drain    10        28.548203         2   100         0.000000         0.002620         0.000000 0 0 /
      watchdog/0    11         0.000000   3368570     0         0.000000     23989.957382         0.000000 0 0 /
         cpuhp/0    12      1216.569504         6   120         0.000000         0.010958         0.000000 0 0 /
          xenbus    58  72026342.961752       343   120         0.000000         1.471102         0.000000 0 0 /
      khungtaskd    59 99071124375.968195    111514   120         0.048912      5708.875023   2054143.190593 0 0 /
[...]
         dockerd 16014    247832.821522   2020884   120        95.016057    131987.990617   2298828.078531 0 0 /system.slice/docker.service
         dockerd 16015    106611.777737   2961407   120         0.000000    160704.014444         0.000000 0 0 /system.slice/docker.service
         dockerd 16024       101.600644        16   120         0.000000         0.915798         0.000000 0 0 /system.slice/
[...]

In addition to CPU metrics, you can also find metrics that measure disk device usage and saturation. I mostly use the metrics from the USE method and will refer to their Linux Checklist .

While there are many more specific metrics, that doesn't mean the load average metric is useless. The average load combined with other indicators can be successfully applied in the automatic expansion strategy of cloud computing microservices, which can help the microservices to cope with the load increase caused by different reasons such as CPU and disk. With the automatic expansion strategy, even if the wrong expansion (burning money) is caused, it is safer than not expanding (affecting users), so people will tend to add more signals to the automatic expansion. If a certain automatic expansion is too large, we can also debug the next day.

Another reason why I continue to use load average metrics is that they (three numbers) represent historical information. If I need to check why an instance in the cloud is behaving badly, then log into that machine and see that the 1 minute load average is significantly lower than the 15 minute load average, then I know I'm missing a performance issue that happened earlier. I was able to come to this conclusion by thinking about the load average number for a few seconds without having to study other metrics.

Summarize

In 1993, a Linux engineer discovered a problem with unintuitive load average performance, and then used a three-line patch to permanently change the Load Average metric from "CPU load average" to, possibly called "system load average" more appropriate indicator. His changes include tasks that are in an uninterruptible state, so the load average reflects the CPU and disk demands of the tasks. The system load balancing indicator calculates the number of threads in work and waiting for work, and uses three constants of 1, 5, and 15 to calculate three "exponential decay offset sums" through a special formula. These three numbers give you an idea of whether the load on your system is increasing or decreasing, and their maximum value may be used for relative comparisons to determine if the system is having performance issues.

In the Linux kernel code, there are more and more uninterruptible states, and today the uninterruptible state also includes the state of acquiring locks. If the load average is a metric that counts the number of threads running and waiting (rather than strictly a thread waiting for a hardware resource), then their numbers are still as expected.

In this post, I dug into this patch from 1993 - finding it surprisingly difficult - and saw the author's original explanation. I have also studied the stack and time spent in uninterruptible tasks via bcc/eBPF on modern Linux systems and represented these as an off-CPU flame graph. There are many examples of uninterruptible states provided in the figure, which can be used to explain why the average load value is soaring at any time. I also present some other metrics to help you understand system load details.

I'll end this post with a comment from the Linux source code by scheduler maintainer Peter Zijlstra at the top of kernel/sched/loadavg.c :

 * This file contains the magic bits required to compute the global loadavg
  * figure. Its a silly number but people think its important. We go through
  * great pains to make it work on big machines and tickless kernels.

References

[1] Saltzer, J., and J. Gintell. “ The Instrumentation of Multics ,” CACM, August 1970 (explains exponents)
[2] Multics system_performance_graph command reference (mentioned 1 minute load average)
[3] TENEX source code. (Load average code in CHED.MAC system)
[4] RFC 546 "TENEX Load Averages for July 1973". (explains the measure of CPU demand)
[5] Bobrow, D., et al. “TENEX: A Paged Time Sharing System for the PDP-10,” Communications of the ACM, March 1972. (explains triple load averaging)
[6] Gunther, N. "UNIX Load Average Part 1: How It Works" PDF . (explains the exponential formula)
[7] Linus's email about Linux 0.99 patchlevel 14 .
[8] The load average change email is on oldlinux.org . (in the alan-old-funet-lists/kernel.1993.gz archive, not in the linux directory I searched at first)
[9] The Linux kernel/sched.c source before and after the load average change: 0.99.13 , 0.99.14 .
[10] Tarballs for Linux 0.99 releases are on kernel.org .
[11] The current Linux load average code: loadavg.c , loadavg.h
[12] The bcc analysis tools includes my offcputime , used for tracing TASK_UNINTERRUPTIBLE.
[13] Flame Graphs were used for visualizing uninterruptible paths.


Hotlink
340 声望7 粉丝

Stay hungry, stay foolish.