Put the elephant in the container-Java container memory disassembly

[Image source: Introduction

test environment

configuration capacity

POD capacity configuration
JVM capacity configuration
- mysterious MaxDirectMemorySize default value
- maxThreadCount Maximum number of threads source

usage

Java's perspective to see usage
- how to collect actual usage
from the perspective of native applications to see usage
- *lib.so dynamic library occupies
- *.jar mapping occupies
- glibc malloc consumes
- GC memory consumption
- tmpfs memory consumption
operating system RSS
- CGroup limit

Potential problems and recommended solutions

Native Buffer limit
glibc malloc arena waste
Jetty thread pool
Java code cache slowly rises
container memory limit

Outlook

Disclaimer

experience

`introduce`

I believe many people know that in a cloud environment, all services must be resource-constrained. Of course, memory as an important resource is no exception. Restriction is easy to say, but how to guarantee the performance index (SLA) of the service while restricting is a technical and artistic work.

Setting an upper limit for application memory is never easy. Because the rationale for setting the upper limit is:

The logic of the application's memory usage and recycling, and this logic is generally unusually complex
Complex virtual memory management, physical memory allocation, and recovery mechanisms of modern operating systems

If it is Java, add:

Memory management mechanism of various types of components in JVM

The above three aspects can be further subdivided. Each subdivision has its memory mechanism. As long as we miss one of them, it is possible that the total memory usage of the application will exceed the limit.

What is worrying is that when the total memory usage of the application exceeds the limit, the operating system will ruthlessly kill the application process (OOM, Out Of Memory) . Many people are ignorant of this and only know that the container has restarted. And this may be the beginning of a chain reaction:

If the reason for container OOM is just accidental, it's okay to say. If it is caused by a BUG, then this OOM may erupt one by one in all containers of the service, and eventually the service will be paralyzed.
It turns out that the resources of the service container group are tight. One container OOM is closed, and the load balancing distributes the traffic to other containers, so the same OOM appears in other containers. Finally the service was paralyzed

JVM is a Nice manager. When it finds that memory is tight, it takes the trouble to stop application threads and perform GC. This kind of memory tight signal is called "Backpressure" in the design world. On the contrary, the operating system is a vigorous commander. Whenever a process is overrun, he will shoot OOM Killed directly.

or you have studied cgroup memory in depth, it actually also has a Backpressure notification mechanism, but the current container and JVM ignore it.

In conclusion, the container process OOM Kllled is something that should be avoided, but it requires in-depth research to avoid it.

On the Internet, we can find many real cases and lessons:

Java memory management is complicated. The more we know about it, the less likely it is that the application will appear OOM Killed. Below I take a test case encountered for analysis.

The analysis report is divided into two parts:

Research the measured indicators, memory consumption, and memory limit configuration of the application
Potential problems and suggestions for improvement

`test environment`

主机：裸机(BareMetal)
CPU: 40 cores, 共 80 个超线程
Linux:
  Kernel: 5.3.18
  glibc: libc-2.26.so
Java: 1.8.0_261-b12
Web/Servlet 容器： Jetty

`Configuration capacity`

`POD capacity configuration`

    resources:
      limits:
        cpu: "8"
        memory: 4Gi
        # 4Gi = 4 * 1024Mb = 4*1024*1024k = 4194304k = 4294967296 bytes = 4096Mb
      requests:
        cpu: "2"
        memory: 4Gi

`JVM capacity configuration`

Before I start talking about JVM capacity configuration, I assume that you already have a basic impression of JVM memory usage:

picture Source: https://docs.oracle.com/javase/8/docs/technotes/tools/unix/java.html as it says:

Sets the maximum total size (in bytes) of the New I/O (the java.nio package) direct-buffer allocations. Append the letter k or K to indicate kilobytes, m or M to indicate megabytes, g or G to indicate gigabytes . By default, the size is set to 0, meaning that the JVM chooses the size for NIO direct-buffer allocations automatically.

It means that if you say it, you don't say it.

In my test environment, I use Arthas attached to the JVM and then view the internal static variables:

[arthas@112]$ dashboard
ognl -c 30367620 '@io.netty.util.internal.PlatformDependent@maxDirectMemory()'
@Long[3,221,225,472]

ognl '@java.nio.Bits@maxMemory'
@Long[3,221,225,472]

3221225472/1024/1024 = 3072.0 Mb

If you want to go deeper, please refer to:
http://www.mastertheboss.com/java/troubleshooting-outofmemoryerror-direct-buffer-memory/
https://developer.aliyun.com/article/2948
MaxDirectMemorySize ~= `from -Xmx (Young Heap + Old Heap )` - `Survivor(Young) Capacity` ~= 3G

`maxThreadCount Maximum number of threads source`

Arthas used above, let’s continue with Arthas :

[arthas@112]$ dashboard
   Threads Total: 276

The application uses Jetty, and the thread pool configuration is jetty-threadpool.xml

<Configure>
  <New id="threadPool" class="org.eclipse.jetty.util.thread.QueuedThreadPool">
    <Set name="maxThreads" type="int"><Property name="jetty.threadPool.maxThreads" deprecated="threads.max" default="200"/></Set>
...
  </New>
</Configure>

Because in addition to Jetty, there are various other threads.

`Usage amount`

`Viewing usage from the perspective of Java`

	Capacity allocation	Effective configuration (Mbyte)	Actual use (Mbyte)
Young Heap + Old Heap	`-Xmx3G` -XX:+AlwaysPreTouch	3072	3072
MaxMetaspaceSize	[default]	Unlimited	128
CompressedClassSpaceSize	[default]	1024	15
MaxDirectMemorySize	[default]	3072	270
ReservedCodeCacheSize	[default]	240	82
ThreadStackSize*maxThreadCount	[Default]*276 threads	276	276

Sum		7684 + (Unlimited MaxMetaspaceSize)	3843

`How to collect the actual usage of`

ReservedCodeCache

After the application has been warmed up and pressure tested, use Arthas attached:

[arthas@112]$ dashboard
code_cache : 82Mb

DirectMemory

[arthas@112]$ 
ognl '@java.nio.Bits@reservedMemory.get()'
@Long[1,524,039]
ognl -c 30367620 '@io.netty.util.internal.PlatformDependent@usedDirectMemory()'
@Long[268,435,456]

Metaspace
CompressedClassSpaceSize

$ jcmd $PID GC.heap_info

 garbage-first heap   total 3145728K, used 1079227K [0x0000000700000000, 0x0000000700106000, 0x00000007c0000000)
  region size 1024K, 698 young (714752K), 16 survivors (16384K)
 Metaspace       used 127,323K, capacity 132,290K, committed 132,864K, reserved 1,167,360K
  class space    used 14,890K, capacity 15,785K, committed 15,872K, reserved 1,048,576K

`Look at usage from the perspective of native apps`

Look at usage from the perspective of native applications, including the following:

*lib.so dynamic library occupancy: 16Mb
*.jar file mapping occupancy: 8Mb
GC algorithm consumption: Not investigated
glibc malloc space reclamation is not consumed in time: 158Mb

Total native application consumption: 16+8+158 = 182Mb

summary: Usage from Java perspective: 3843Mb Total application usage = 3843 + 158 ~= 4001Mb

4001Mb, here we did not count the *lib.so dynamic library occupancy and *.jar file mapping occupancy. Why? It will be explained in the following. The number of 4001Mb is a bit scary, not far from the upper limit of the container configuration of 4096Mb. But this number has some moisture. Why? It will be explained in the following.

Below I try to analyze the data source of each sub-item

`lib.so dynamic library occupation`*

Run the command:

pmap -X $PID

Part of the output:

         Address Perm   Offset Device      Inode     Size     Rss     Pss Referenced Anonymous  Mapping
...
    7f281b1b1000 r-xp 00000000  08:03 1243611251       48      48       3         48         0  /lib64/libcrypt-2.26.so
    7f281b1bd000 ---p 0000c000  08:03 1243611251     2044       0       0          0         0  /lib64/libcrypt-2.26.so
    7f281b3bc000 r--p 0000b000  08:03 1243611251        4       4       4          4         4  /lib64/libcrypt-2.26.so
    7f281b3bd000 rw-p 0000c000  08:03 1243611251        4       4       4          4         4  /lib64/libcrypt-2.26.so
...
    7f28775a5000 r-xp 00000000  08:03 1243611255       92      92       5         92         0  /lib64/libgcc_s.so.1
    7f28775bc000 ---p 00017000  08:03 1243611255     2048       0       0          0         0  /lib64/libgcc_s.so.1
    7f28777bc000 r--p 00017000  08:03 1243611255        4       4       4          4         4  /lib64/libgcc_s.so.1
    7f28777bd000 rw-p 00018000  08:03 1243611255        4       4       4          4         4  /lib64/libgcc_s.so.1
    7f28777be000 r-xp 00000000  08:03 1800445487      224      64       4         64         0  /opt/jdk1.8.0_261/jre/lib/amd64/libsunec.so
    7f28777f6000 ---p 00038000  08:03 1800445487     2044       0       0          0         0  /opt/jdk1.8.0_261/jre/lib/amd64/libsunec.so
    7f28779f5000 r--p 00037000  08:03 1800445487       20      20      20         20        20  /opt/jdk1.8.0_261/jre/lib/amd64/libsunec.so
    7f28779fa000 rw-p 0003c000  08:03 1800445487        8       8       8          8         8  /opt/jdk1.8.0_261/jre/lib/amd64/libsunec.so
...
    7f28f43a7000 r-xp 00000000  08:03 1243611284       76      76       3         76         0  /lib64/libresolv-2.26.so
    7f28f43ba000 ---p 00013000  08:03 1243611284     2048       0       0          0         0  /lib64/libresolv-2.26.so
    7f28f45ba000 r--p 00013000  08:03 1243611284        4       4       4          4         4  /lib64/libresolv-2.26.so
    7f28f45bb000 rw-p 00014000  08:03 1243611284        4       4       4          4         4  /lib64/libresolv-2.26.so
    7f28f45bc000 rw-p 00000000  00:00          0        8       0       0          0         0  
    7f28f45be000 r-xp 00000000  08:03 1243611272       20      20       1         20         0  /lib64/libnss_dns-2.26.so
    7f28f45c3000 ---p 00005000  08:03 1243611272     2044       0       0          0         0  /lib64/libnss_dns-2.26.so
    7f28f47c2000 r--p 00004000  08:03 1243611272        4       4       4          4         4  /lib64/libnss_dns-2.26.so
    7f28f47c3000 rw-p 00005000  08:03 1243611272        4       4       4          4         4  /lib64/libnss_dns-2.26.so
    7f28f47c4000 r-xp 00000000  08:03 1243611274       48      48       2         48         0  /lib64/libnss_files-2.26.so
    7f28f47d0000 ---p 0000c000  08:03 1243611274     2044       0       0          0         0  /lib64/libnss_files-2.26.so
    7f28f49cf000 r--p 0000b000  08:03 1243611274        4       4       4          4         4  /lib64/libnss_files-2.26.so
    7f28f49d0000 rw-p 0000c000  08:03 1243611274        4       4       4          4         4  /lib64/libnss_files-2.26.so
    7f28f49d1000 rw-p 00000000  00:00          0     2072    2048    2048       2048      2048  
    7f28f4bd7000 r-xp 00000000  08:03 1800445476       88      88       6         88         0  /opt/jdk1.8.0_261/jre/lib/amd64/libnet.so
    7f28f4bed000 ---p 00016000  08:03 1800445476     2044       0       0          0         0  /opt/jdk1.8.0_261/jre/lib/amd64/libnet.so
    7f28f4dec000 r--p 00015000  08:03 1800445476        4       4       4          4         4  /opt/jdk1.8.0_261/jre/lib/amd64/libnet.so
    7f28f4ded000 rw-p 00016000  08:03 1800445476        4       4       4          4         4  /opt/jdk1.8.0_261/jre/lib/amd64/libnet.so
    7f28f4dee000 r-xp 00000000  08:03 1800445477       68      64       4         64         0  /opt/jdk1.8.0_261/jre/lib/amd64/libnio.so
    7f28f4dff000 ---p 00011000  08:03 1800445477     2044       0       0          0         0  /opt/jdk1.8.0_261/jre/lib/amd64/libnio.so
    7f28f4ffe000 r--p 00010000  08:03 1800445477        4       4       4          4         4  /opt/jdk1.8.0_261/jre/lib/amd64/libnio.so
    7f28f4fff000 rw-p 00011000  08:03 1800445477        4       4       4          4         4  /opt/jdk1.8.0_261/jre/lib/amd64/libnio.so

💡 If you don't know much about the output of Linux memory map and pmap, I suggest reading: https://www.labcorner.de/cheat-sheet-understanding-the-pmap1-output/ . If you are as lazy as I am, let me take the previous picture:

As we all know, modern operating systems have mechanisms for sharing physical memory between processes to save physical memory. If you know COW (Copy on Write), even better. There are multiple containers running on a physical machine, and the images of the containers are actually hierarchical. The images of different services generated by the same organization are often based on the same base layer, and this base layer includes Java-related libraries. The so-called layer is nothing but a directory on the host. That is, different containers may share the same file for reading (Mapping).

Back to our topic, memory limitation. The container limits memory through cgroups. The cgroup will account for each memory allocation of the process in the container. The calculation method of file mapping shared memory obviously needs special treatment, because it crosses processes and containers. The information that can be found now is that only the first cgroup that reads/writes this mapping memory is accounted for ( https://www.kernel.org/doc/Documentation/cgroup-v1/memory.txt [2.3 Shared Page Accounting]). Therefore, this account is more difficult to predict. Generally, we only make reservations for the worst.

`.jar mapping occupied`*

pmap -X $PID

The accounting principle is similar to However, after Java 9, .jar mapping is no longer done. Even Java 8 is only part of the directory structure in the mapping file.

In my test, only 8Mb of memory was used.

`glibc malloc consumption`

Java uses glibc malloc in two situations:

NIO Direct Byte Buffer / Netty Direct Byte Buffer
JVM internal basic program

The industry glibc malloc the waste of 0615eb71ccd8bf. The main focus is on the untimely return of memory (to the operating system). This waste is proportional to the number of CPUs of the host. Refer to:

Unfortunately, my test environment is a bare machine, and all CPUs are seen by the container. The host has 80 CPUs. So the question is, how to measure how much wasted? glibc provides a malloc_stats(3) function, which will output heap information (including use and reservation) to the standard output stream. Then the problem comes again. How to call this function? Modify the code and write JNI? sure. However, as a Geek, of course you must use gdb .

cat <<"EOF" > ~/.gdbinit
handle SIGSEGV nostop noprint pass
handle SIGBUS nostop noprint pass
handle SIGFPE nostop noprint pass
handle SIGPIPE nostop noprint pass
handle SIGILL nostop noprint pass
EOF

export PID=`pgrep java`
gdb --batch --pid $PID --ex 'call malloc_stats()'

Output:

Arena 0:
system bytes     =     135168
in use bytes     =      89712
Arena 1:
system bytes     =     135168
in use bytes     =       2224
Arena 2:
system bytes     =     319488
in use bytes     =      24960
Arena 3:
system bytes     =     249856
in use bytes     =       2992
...
Arena 270:
system bytes     =    1462272
in use bytes     =     583280
Arena 271:
system bytes     =   67661824
in use bytes     =   61308192


Total (incl. mmap):
system bytes     =  638345216
in use bytes     =  472750720
max mmap regions =         45
max mmap bytes   =  343977984

So the result is: 638345216-472750720 = 165594496 ~= 158Mb That is, 158Mb was wasted. Because the load in my test scenario is not heavy, under heavy load and high concurrency scenarios, the waste of 80 CPUs is much more than that.

One thing that needs to be pointed out is that the physical memory allocation of the operating system is allocated by Lazy, that is, it is allocated only when the memory is actually read and written. Therefore, the above 158Mb may be smaller from the RSS of the operating system.

`GC memory consumption`

not investigated

`tmpfs memory consumption`

not investigated

`Operating System RSS`

RSS(pmap -X $PID) = 3920MB. That is, the operating system thinks that 3920MB of physical memory is used.

`CGroup restrictions`

cgroup limit 4Gi = 4*1024Mb = 4096Mb pagecache free space: 4096-3920 = 176Mb

Let's take a look at the memory.stat file of cgroup

$ cat cgroup `memory.stat` file
    rss 3920Mb
    cache 272Mb
    active_anon 3740Mb
    inactive_file 203Mb
    active_file 72Mb  # bytes of file-backed memory on active LRU list

Careful as you will find:

3920 + 272 = 4192 > 4096Mb

No, why not OOM killed?

To make a long story, pagecache is a flexible memory space. When an application needs anonymous memory, the kernel can automatically reclaim pagecache.

💡 interested can refer to: https://engineering.linkedin.com/blog/2016/08/don_t-let-linux-control-groups-uncontrolled https://github.com/kubernetes/kubernetes/issues/43916 https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v1/memory.html

`Potential problems and recommended solutions`

`Native Buffer limitations`

Default MaxDirectMemorySize ~ = -Xmx - survivor size ~ = 3G.

This will use a large number of Direct Byte Buffers when the memory is not reclaimed in time when high concurrency occurs. So it is recommended to set the limit explicitly:

java ... -XX:MaxDirectMemorySize=350Mb

💡 1615eb71ccdc06 interested, please refer to:
Cassandra client and Redisson are based Netty , solid are used Native Buffer . Note that Netty in Unsafe.class basis, as well as internal memory pool.

`glibc malloc arena waste`

In my test environment, the host has 80 CPUs. In order to reduce the lock contention when multithreading allocates memory, glibc reserves up to 8 memory blocks ( Arena ) Arena is returned to the operating system is unpredictable, and memory fragmentation in the heap And so on. The result observed in my test environment is: a total of 271 Arena been created. 608Mb RSS is used. The actual memory used by the program is only 450Mb. wasted 157 Mb . The waste is random and related to memory fragmentation. For containers, it is impossible to allocate CPUs for all hosts. It is reasonable to set an explicit upper limit, and this upper limit should be linked with the memory limit and CPU limit of the container.

MALLOC_ARENA_MAX is used to configure this upper limit.

The connection with memory usage: In our actual measurement, a total of 700Mb of glibc heap memory was used. And the Arena is 64Mb. So:
```
700/64=10 Arena
```
Contact with container cpu limit:
cpu * (8 arena per cpu) = 64 Arena.

We use large reserved spaces conservatively:

export MALLOC_ARENA_MAX=64

💡 interested can refer to: https://www.gnu.org/software/libc/manual/html_node/Memory-Allocation-Tunables.html

`Jetty thread pool`

After investigation, each API call takes about 100 ms. The existing configuration specifies a maximum of 200 threads. so:

200 thread / 0.1s = 2000 TPS

In our test, the TPS of a single container did not produce 1000. So 100 threads are enough. The advantage of reducing the number of threads is that it can reduce excessive thread context switching, cgroup CPU throttling, thread stack memory, and Native Buffer memory at the same time. Let the request heap in the Request Queue instead of the kernel's Runnale Queue.

<!-- jetty-threadpool.xml -->
<Configure>
  <New id="threadPool" class="org.eclipse.jetty.util.thread.QueuedThreadPool">
...
    <Set name="maxThreads" type="int"><Property name="jetty.threadPool.maxThreads" deprecated="threads.max" default="100"/></Set>
...
  </New>
</Configure>

`Java code cache is rising slowly`

In our test, after the system warms up, the Java code cache will still increase slowly. The maximum code cache of Java 8 is 240Mb. If the code cache consumes a lot of memory, it may trigger OOM killed. So it is still necessary to make explicit restrictions. From the observation of the test environment, 100Mb of space is sufficient.

java ... -XX:ReservedCodeCacheSize=100M -XX:UseCodeCacheFlushing=true

💡 interested can refer to: https://docs.oracle.com/javase/8/embedded/develop-apps-platforms/codecache.htm

`The memory limit of the container`

From the above investigation, we can see that 3G java heap + JVM overhead + DirectByteBuffer is very close to the container memory upper limit of 4Gi In the case of high concurrency, the risk of OOM killed is still very high. And this problem may not appear in the test environment, it has its randomness.

The cgroup has a record of the number of times the container is close to OOM (memory.failcnt), and it is found that this number is slowing during the test. When the memory is tight, the kernel will give priority to meeting the application's memory requirements by discarding the pagecache. And what does it mean to discard the file cache? Slower reads, more frequent and slower writes to the hard drive. If the application has read and write IO pressure, if you read *.jar and write logs, then the IO slow problem will follow.

watch cat ./memory.failcnt 
19369

💡 interested can refer to: https://engineering.linkedin.com/blog/2016/08/don_t-let-linux-control-groups-uncontrolled https://www.kernel.org/doc/Documentation/cgroup-v1/memory.txt https://srvaroa.github.io/jvm/kubernetes/memory/docker/oomkiller/2019/05/29/k8s-and-java.html

For my application, my suggestion is to relax the memory limit:

    resources:
      limits:
        memory: 4.5Gi
      requests:
        memory: 4.5Gi

`Outlook`

Incompletely speaking, from the perspective of service operators, service is based on these coefficients :

SLA of the container
- The throughput of the target container

If I take the above coefficient as the input a tool program, then the output of should be:

How many containers should be deployed
How should the resource configuration of each container be
- CPU
  - Container CPU limit
  - Application thread pool limit
- Memory
  - Container memory limit
  - Application thread pool d limit:
    - java: in/out of heap

💡 There is an open source tool for reference: https://github.com/cloudfoundry/java-buildpack-memory-calculator

`Disclaimer`

Every coin has two sides, and application tuning is that every tuning method has its required environmental prerequisites, otherwise it will not be called tuning, and directly go to the default configuration Pull Request of the open source project. Masters often say, don’t just copy and adjust the parameters. You must consider your own actual situation, and then do a sufficient test before you can use it.

`Experience`

Beginning in 2016, major companies began to catch up with fashion and put appropriate applications into containers. And because many old projects and components were designed without considering running in a restricted container, to put it bluntly, they are not contaier aware. After a few years, the situation has improved, but there are still many pitfalls. As a qualified architect, in addition to PPT and the distance, we have to have a glass heart.

The above is an analysis of the memory of a Java container. If you are interested in the CPU and thread parameters of the Java container, please move to: The historical pit of Java containerization (historical pit) limitation chapter 1615eb71cce351.

Conclude this article with a comic:

Put the elephant in the container-Java container memory disassembly

`introduce`

`test environment`

`Configuration capacity`

`POD capacity configuration`

`JVM capacity configuration`

`maxThreadCount Maximum number of threads source`

`Usage amount`

`Viewing usage from the perspective of Java`

`How to collect the actual usage of`

`Look at usage from the perspective of native apps`

`lib.so dynamic library occupation`*

`.jar mapping occupied`*

`glibc malloc consumption`

`GC memory consumption`

`tmpfs memory consumption`

`Operating System RSS`

`CGroup restrictions`

`Potential problems and recommended solutions`

`Native Buffer limitations`

`glibc malloc arena waste`

`Jetty thread pool`

`Java code cache is rising slowly`

`The memory limit of the container`

`Outlook`

`Disclaimer`

`Experience`

MarkZhu

`引用和评论`

最长的 4 月天

Java8的新特性

Java11的新特性

Java5的新特性

Java9的新特性

Java13的新特性

Java7的新特性

Put the elephant in the container-Java container memory disassembly

introduce

test environment

Configuration capacity

POD capacity configuration

JVM capacity configuration

maxThreadCount Maximum number of threads source

Usage amount

Viewing usage from the perspective of Java

How to collect the actual usage of

Look at usage from the perspective of native apps

*lib.so dynamic library occupation

*.jar mapping occupied

glibc malloc consumption

GC memory consumption

tmpfs memory consumption

Operating System RSS

CGroup restrictions

Potential problems and recommended solutions

Native Buffer limitations

glibc malloc arena waste

Jetty thread pool

Java code cache is rising slowly

The memory limit of the container

Outlook

Disclaimer

Experience

MarkZhu

引用和评论

最长的 4 月天

Java8的新特性

Java11的新特性

Java5的新特性

Java9的新特性

Java13的新特性

Java7的新特性

`introduce`

`test environment`

`Configuration capacity`

`POD capacity configuration`

`JVM capacity configuration`

`maxThreadCount Maximum number of threads source`

`Usage amount`

`Viewing usage from the perspective of Java`

`How to collect the actual usage of`

`Look at usage from the perspective of native apps`

`lib.so dynamic library occupation`*

`.jar mapping occupied`*

`glibc malloc consumption`

`GC memory consumption`

`tmpfs memory consumption`

`Operating System RSS`

`CGroup restrictions`

`Potential problems and recommended solutions`

`Native Buffer limitations`

`glibc malloc arena waste`

`Jetty thread pool`

`Java code cache is rising slowly`

`The memory limit of the container`

`Outlook`

`Disclaimer`

`Experience`

`引用和评论`