1
头图


[Image source: Introduction

  • test environment
  • configuration capacity

  • usage

  • Potential problems and recommended solutions

  • Outlook
  • Disclaimer
  • experience
  • introduce

    I believe many people know that in a cloud environment, all services must be resource-constrained. Of course, memory as an important resource is no exception. Restriction is easy to say, but how to guarantee the performance index (SLA) of the service while restricting is a technical and artistic work.

    Setting an upper limit for application memory is never easy. Because the rationale for setting the upper limit is:

    • The logic of the application's memory usage and recycling, and this logic is generally unusually complex
    • Complex virtual memory management, physical memory allocation, and recovery mechanisms of modern operating systems

    If it is Java, add:

    • Memory management mechanism of various types of components in JVM

    The above three aspects can be further subdivided. Each subdivision has its memory mechanism. As long as we miss one of them, it is possible that the total memory usage of the application will exceed the limit.

    What is worrying is that when the total memory usage of the application exceeds the limit, the operating system will ruthlessly kill the application process (OOM, Out Of Memory) . Many people are ignorant of this and only know that the container has restarted. And this may be the beginning of a chain reaction:

    • If the reason for container OOM is just accidental, it's okay to say. If it is caused by a BUG, then this OOM may erupt one by one in all containers of the service, and eventually the service will be paralyzed.
    • It turns out that the resources of the service container group are tight. One container OOM is closed, and the load balancing distributes the traffic to other containers, so the same OOM appears in other containers. Finally the service was paralyzed

    JVM is a Nice manager. When it finds that memory is tight, it takes the trouble to stop application threads and perform GC. This kind of memory tight signal is called "Backpressure" in the design world.
    On the contrary, the operating system is a vigorous commander. Whenever a process is overrun, he will shoot OOM Killed directly.

    or you have studied cgroup memory in depth, it actually also has a Backpressure notification mechanism, but the current container and JVM ignore it.

    In conclusion, the container process OOM Kllled is something that should be avoided, but it requires in-depth research to avoid it.

    On the Internet, we can find many real cases and lessons:

    Java memory management is complicated. The more we know about it, the less likely it is that the application will appear OOM Killed. Below I take a test case encountered for analysis.

    The analysis report is divided into two parts:

    1. Research the measured indicators, memory consumption, and memory limit configuration of the application
    2. Potential problems and suggestions for improvement

    test environment

    主机:裸机(BareMetal)
    CPU: 40 cores, 共 80 个超线程
    Linux:
      Kernel: 5.3.18
      glibc: libc-2.26.so
    Java: 1.8.0_261-b12
    Web/Servlet 容器: Jetty

    Configuration capacity

    POD capacity configuration

        resources:
          limits:
            cpu: "8"
            memory: 4Gi
            # 4Gi = 4 * 1024Mb = 4*1024*1024k = 4194304k = 4294967296 bytes = 4096Mb
          requests:
            cpu: "2"
            memory: 4Gi

    JVM capacity configuration

    Before I start talking about JVM capacity configuration, I assume that you already have a basic impression of JVM memory usage:


    picture Source: https://docs.oracle.com/javase/8/docs/technotes/tools/unix/java.html as it says:

    Sets the maximum total size (in bytes) of the New I/O (the java.nio package) direct-buffer allocations. Append the letter k or K to indicate kilobytes, m or M to indicate megabytes, g or G to indicate gigabytes . By default, the size is set to 0, meaning that the JVM chooses the size for NIO direct-buffer allocations automatically.

    It means that if you say it, you don't say it.

    In my test environment, I use Arthas attached to the JVM and then view the internal static variables:

    [arthas@112]$ dashboard
    ognl -c 30367620 '@io.netty.util.internal.PlatformDependent@maxDirectMemory()'
    @Long[3,221,225,472]
    
    ognl '@java.nio.Bits@maxMemory'
    @Long[3,221,225,472]

    3221225472/1024/1024 = 3072.0 Mb

    If you want to go deeper, please refer to:

    maxThreadCount Maximum number of threads source

    Arthas used above, let’s continue with Arthas :

    [arthas@112]$ dashboard
       Threads Total: 276

    The application uses Jetty, and the thread pool configuration is jetty-threadpool.xml

    <Configure>
      <New id="threadPool" class="org.eclipse.jetty.util.thread.QueuedThreadPool">
        <Set name="maxThreads" type="int"><Property name="jetty.threadPool.maxThreads" deprecated="threads.max" default="200"/></Set>
    ...
      </New>
    </Configure>

    Because in addition to Jetty, there are various other threads.

    Usage amount

    Viewing usage from the perspective of Java

    Capacity allocationEffective configuration (Mbyte)Actual use (Mbyte)
    Young Heap + Old Heap-Xmx3G -XX:+AlwaysPreTouch30723072
    MaxMetaspaceSize[default]Unlimited128
    CompressedClassSpaceSize[default]102415
    MaxDirectMemorySize[default]3072270
    ReservedCodeCacheSize[default]24082
    ThreadStackSize*maxThreadCount[Default]*276 threads276276
    Sum 7684 + (Unlimited MaxMetaspaceSize) 3843

    How to collect the actual usage of

    • ReservedCodeCache

    After the application has been warmed up and pressure tested, use Arthas attached:

    [arthas@112]$ dashboard
    code_cache : 82Mb
    • DirectMemory
    [arthas@112]$ 
    ognl '@java.nio.Bits@reservedMemory.get()'
    @Long[1,524,039]
    ognl -c 30367620 '@io.netty.util.internal.PlatformDependent@usedDirectMemory()'
    @Long[268,435,456]
    • Metaspace
    • CompressedClassSpaceSize
    $ jcmd $PID GC.heap_info
    
     garbage-first heap   total 3145728K, used 1079227K [0x0000000700000000, 0x0000000700106000, 0x00000007c0000000)
      region size 1024K, 698 young (714752K), 16 survivors (16384K)
     Metaspace       used 127,323K, capacity 132,290K, committed 132,864K, reserved 1,167,360K
      class space    used 14,890K, capacity 15,785K, committed 15,872K, reserved 1,048,576K

    Look at usage from the perspective of native apps

    Look at usage from the perspective of native applications, including the following:

    • *lib.so dynamic library occupancy: 16Mb
    • *.jar file mapping occupancy: 8Mb
    • GC algorithm consumption: Not investigated
    • glibc malloc space reclamation is not consumed in time: 158Mb

    Total native application consumption: 16+8+158 = 182Mb

    summary:
    Usage from Java perspective: 3843Mb
    Total application usage = 3843 + 158 ~= 4001Mb

    4001Mb, here we did not count the *lib.so dynamic library occupancy and *.jar file mapping occupancy. Why? It will be explained in the following.
    The number of 4001Mb is a bit scary, not far from the upper limit of the container configuration of 4096Mb. But this number has some moisture. Why? It will be explained in the following.

    Below I try to analyze the data source of each sub-item

    *lib.so dynamic library occupation

    Run the command:

    pmap -X $PID

    Part of the output:

             Address Perm   Offset Device      Inode     Size     Rss     Pss Referenced Anonymous  Mapping
    ...
        7f281b1b1000 r-xp 00000000  08:03 1243611251       48      48       3         48         0  /lib64/libcrypt-2.26.so
        7f281b1bd000 ---p 0000c000  08:03 1243611251     2044       0       0          0         0  /lib64/libcrypt-2.26.so
        7f281b3bc000 r--p 0000b000  08:03 1243611251        4       4       4          4         4  /lib64/libcrypt-2.26.so
        7f281b3bd000 rw-p 0000c000  08:03 1243611251        4       4       4          4         4  /lib64/libcrypt-2.26.so
    ...
        7f28775a5000 r-xp 00000000  08:03 1243611255       92      92       5         92         0  /lib64/libgcc_s.so.1
        7f28775bc000 ---p 00017000  08:03 1243611255     2048       0       0          0         0  /lib64/libgcc_s.so.1
        7f28777bc000 r--p 00017000  08:03 1243611255        4       4       4          4         4  /lib64/libgcc_s.so.1
        7f28777bd000 rw-p 00018000  08:03 1243611255        4       4       4          4         4  /lib64/libgcc_s.so.1
        7f28777be000 r-xp 00000000  08:03 1800445487      224      64       4         64         0  /opt/jdk1.8.0_261/jre/lib/amd64/libsunec.so
        7f28777f6000 ---p 00038000  08:03 1800445487     2044       0       0          0         0  /opt/jdk1.8.0_261/jre/lib/amd64/libsunec.so
        7f28779f5000 r--p 00037000  08:03 1800445487       20      20      20         20        20  /opt/jdk1.8.0_261/jre/lib/amd64/libsunec.so
        7f28779fa000 rw-p 0003c000  08:03 1800445487        8       8       8          8         8  /opt/jdk1.8.0_261/jre/lib/amd64/libsunec.so
    ...
        7f28f43a7000 r-xp 00000000  08:03 1243611284       76      76       3         76         0  /lib64/libresolv-2.26.so
        7f28f43ba000 ---p 00013000  08:03 1243611284     2048       0       0          0         0  /lib64/libresolv-2.26.so
        7f28f45ba000 r--p 00013000  08:03 1243611284        4       4       4          4         4  /lib64/libresolv-2.26.so
        7f28f45bb000 rw-p 00014000  08:03 1243611284        4       4       4          4         4  /lib64/libresolv-2.26.so
        7f28f45bc000 rw-p 00000000  00:00          0        8       0       0          0         0  
        7f28f45be000 r-xp 00000000  08:03 1243611272       20      20       1         20         0  /lib64/libnss_dns-2.26.so
        7f28f45c3000 ---p 00005000  08:03 1243611272     2044       0       0          0         0  /lib64/libnss_dns-2.26.so
        7f28f47c2000 r--p 00004000  08:03 1243611272        4       4       4          4         4  /lib64/libnss_dns-2.26.so
        7f28f47c3000 rw-p 00005000  08:03 1243611272        4       4       4          4         4  /lib64/libnss_dns-2.26.so
        7f28f47c4000 r-xp 00000000  08:03 1243611274       48      48       2         48         0  /lib64/libnss_files-2.26.so
        7f28f47d0000 ---p 0000c000  08:03 1243611274     2044       0       0          0         0  /lib64/libnss_files-2.26.so
        7f28f49cf000 r--p 0000b000  08:03 1243611274        4       4       4          4         4  /lib64/libnss_files-2.26.so
        7f28f49d0000 rw-p 0000c000  08:03 1243611274        4       4       4          4         4  /lib64/libnss_files-2.26.so
        7f28f49d1000 rw-p 00000000  00:00          0     2072    2048    2048       2048      2048  
        7f28f4bd7000 r-xp 00000000  08:03 1800445476       88      88       6         88         0  /opt/jdk1.8.0_261/jre/lib/amd64/libnet.so
        7f28f4bed000 ---p 00016000  08:03 1800445476     2044       0       0          0         0  /opt/jdk1.8.0_261/jre/lib/amd64/libnet.so
        7f28f4dec000 r--p 00015000  08:03 1800445476        4       4       4          4         4  /opt/jdk1.8.0_261/jre/lib/amd64/libnet.so
        7f28f4ded000 rw-p 00016000  08:03 1800445476        4       4       4          4         4  /opt/jdk1.8.0_261/jre/lib/amd64/libnet.so
        7f28f4dee000 r-xp 00000000  08:03 1800445477       68      64       4         64         0  /opt/jdk1.8.0_261/jre/lib/amd64/libnio.so
        7f28f4dff000 ---p 00011000  08:03 1800445477     2044       0       0          0         0  /opt/jdk1.8.0_261/jre/lib/amd64/libnio.so
        7f28f4ffe000 r--p 00010000  08:03 1800445477        4       4       4          4         4  /opt/jdk1.8.0_261/jre/lib/amd64/libnio.so
        7f28f4fff000 rw-p 00011000  08:03 1800445477        4       4       4          4         4  /opt/jdk1.8.0_261/jre/lib/amd64/libnio.so
    💡 If you don't know much about the output of Linux memory map and pmap, I suggest reading: https://www.labcorner.de/cheat-sheet-understanding-the-pmap1-output/ .
    If you are as lazy as I am, let me take the previous picture:

    As we all know, modern operating systems have mechanisms for sharing physical memory between processes to save physical memory. If you know COW (Copy on Write), even better. There are multiple containers running on a physical machine, and the images of the containers are actually hierarchical. The images of different services generated by the same organization are often based on the same base layer, and this base layer includes Java-related libraries. The so-called layer is nothing but a directory on the host. That is, different containers may share the same file for reading (Mapping).

    Back to our topic, memory limitation. The container limits memory through cgroups. The cgroup will account for each memory allocation of the process in the container. The calculation method of file mapping shared memory obviously needs special treatment, because it crosses processes and containers. The information that can be found now is that only the first cgroup that reads/writes this mapping memory is accounted for ( https://www.kernel.org/doc/Documentation/cgroup-v1/memory.txt [2.3 Shared Page Accounting]). Therefore, this account is more difficult to predict. Generally, we only make reservations for the worst.

    *.jar mapping occupied

    pmap -X $PID

    The accounting principle is similar to However, after Java 9, .jar mapping is no longer done. Even Java 8 is only part of the directory structure in the mapping file.

    In my test, only 8Mb of memory was used.

    glibc malloc consumption

    Java uses glibc malloc in two situations:

    1. NIO Direct Byte Buffer / Netty Direct Byte Buffer
    2. JVM internal basic program

    The industry glibc malloc the waste of 0615eb71ccd8bf. The main focus is on the untimely return of memory (to the operating system). This waste is proportional to the number of CPUs of the host. Refer to:

    Unfortunately, my test environment is a bare machine, and all CPUs are seen by the container. The host has 80 CPUs. So the question is, how to measure how much wasted?
    glibc provides a malloc_stats(3) function, which will output heap information (including use and reservation) to the standard output stream. Then the problem comes again. How to call this function? Modify the code and write JNI? sure. However, as a Geek, of course you must use gdb .

    cat <<"EOF" > ~/.gdbinit
    handle SIGSEGV nostop noprint pass
    handle SIGBUS nostop noprint pass
    handle SIGFPE nostop noprint pass
    handle SIGPIPE nostop noprint pass
    handle SIGILL nostop noprint pass
    EOF
    
    export PID=`pgrep java`
    gdb --batch --pid $PID --ex 'call malloc_stats()'

    Output:

    Arena 0:
    system bytes     =     135168
    in use bytes     =      89712
    Arena 1:
    system bytes     =     135168
    in use bytes     =       2224
    Arena 2:
    system bytes     =     319488
    in use bytes     =      24960
    Arena 3:
    system bytes     =     249856
    in use bytes     =       2992
    ...
    Arena 270:
    system bytes     =    1462272
    in use bytes     =     583280
    Arena 271:
    system bytes     =   67661824
    in use bytes     =   61308192
    
    
    Total (incl. mmap):
    system bytes     =  638345216
    in use bytes     =  472750720
    max mmap regions =         45
    max mmap bytes   =  343977984

    So the result is: 638345216-472750720 = 165594496 ~= 158Mb
    That is, 158Mb was wasted. Because the load in my test scenario is not heavy, under heavy load and high concurrency scenarios, the waste of 80 CPUs is much more than that.

    One thing that needs to be pointed out is that the physical memory allocation of the operating system is allocated by Lazy, that is, it is allocated only when the memory is actually read and written. Therefore, the above 158Mb may be smaller from the RSS of the operating system.

    GC memory consumption

    not investigated

    tmpfs memory consumption

    not investigated

    Operating System RSS

    RSS(pmap -X $PID) = 3920MB. That is, the operating system thinks that 3920MB of physical memory is used.

    CGroup restrictions

    cgroup limit 4Gi = 4*1024Mb = 4096Mb
    pagecache free space: 4096-3920 = 176Mb

    Let's take a look at the memory.stat file of cgroup

    $ cat cgroup `memory.stat` file
        rss 3920Mb
        cache 272Mb
        active_anon 3740Mb
        inactive_file 203Mb
        active_file 72Mb  # bytes of file-backed memory on active LRU list

    Careful as you will find:

    3920 + 272 = 4192 > 4096Mb

    No, why not OOM killed?

    To make a long story, pagecache is a flexible memory space. When an application needs anonymous memory, the kernel can automatically reclaim pagecache.

    💡 interested can refer to:
    https://engineering.linkedin.com/blog/2016/08/don_t-let-linux-control-groups-uncontrolled
    https://github.com/kubernetes/kubernetes/issues/43916
    https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v1/memory.html

    Potential problems and recommended solutions

    Native Buffer limitations

    Default MaxDirectMemorySize ~ = -Xmx - survivor size ~ = 3G.

    This will use a large number of Direct Byte Buffers when the memory is not reclaimed in time when high concurrency occurs. So it is recommended to set the limit explicitly:

    java ... -XX:MaxDirectMemorySize=350Mb

    💡 1615eb71ccdc06 interested, please refer to:

    • Cassandra client and Redisson are based Netty , solid are used Native Buffer . Note that Netty in Unsafe.class basis, as well as internal memory pool.

    glibc malloc arena waste

    In my test environment, the host has 80 CPUs. In order to reduce the lock contention when multithreading allocates memory, glibc reserves up to 8 memory blocks ( Arena ) Arena is returned to the operating system is unpredictable, and memory fragmentation in the heap And so on.
    The result observed in my test environment is: a total of 271 Arena been created. 608Mb RSS is used. The actual memory used by the program is only 450Mb. wasted 157 Mb . The waste is random and related to memory fragmentation. For containers, it is impossible to allocate CPUs for all hosts. It is reasonable to set an explicit upper limit, and this upper limit should be linked with the memory limit and CPU limit of the container.

    MALLOC_ARENA_MAX is used to configure this upper limit.

    • The connection with memory usage:
      In our actual measurement, a total of 700Mb of glibc heap memory was used. And the Arena is 64Mb. So:

      700/64=10 Arena
    • Contact with container cpu limit:

    • cpu * (8 arena per cpu) = 64 Arena.

    We use large reserved spaces conservatively:

    export MALLOC_ARENA_MAX=64
    💡 interested can refer to:
    https://www.gnu.org/software/libc/manual/html_node/Memory-Allocation-Tunables.html

    Jetty thread pool

    After investigation, each API call takes about 100 ms. The existing configuration specifies a maximum of 200 threads. so:

    200 thread / 0.1s = 2000 TPS

    In our test, the TPS of a single container did not produce 1000. So 100 threads are enough. The advantage of reducing the number of threads is that it can reduce excessive thread context switching, cgroup CPU throttling, thread stack memory, and Native Buffer memory at the same time. Let the request heap in the Request Queue instead of the kernel's Runnale Queue.

    <!-- jetty-threadpool.xml -->
    <Configure>
      <New id="threadPool" class="org.eclipse.jetty.util.thread.QueuedThreadPool">
    ...
        <Set name="maxThreads" type="int"><Property name="jetty.threadPool.maxThreads" deprecated="threads.max" default="100"/></Set>
    ...
      </New>
    </Configure>

    Java code cache is rising slowly

    In our test, after the system warms up, the Java code cache will still increase slowly. The maximum code cache of Java 8 is 240Mb. If the code cache consumes a lot of memory, it may trigger OOM killed. So it is still necessary to make explicit restrictions. From the observation of the test environment, 100Mb of space is sufficient.

    java ... -XX:ReservedCodeCacheSize=100M -XX:UseCodeCacheFlushing=true
    💡 interested can refer to:
    https://docs.oracle.com/javase/8/embedded/develop-apps-platforms/codecache.htm

    The memory limit of the container

    From the above investigation, we can see that 3G java heap + JVM overhead + DirectByteBuffer is very close to the container memory upper limit of 4Gi In the case of high concurrency, the risk of OOM killed is still very high. And this problem may not appear in the test environment, it has its randomness.

    The cgroup has a record of the number of times the container is close to OOM (memory.failcnt), and it is found that this number is slowing during the test. When the memory is tight, the kernel will give priority to meeting the application's memory requirements by discarding the pagecache. And what does it mean to discard the file cache? Slower reads, more frequent and slower writes to the hard drive. If the application has read and write IO pressure, if you read *.jar and write logs, then the IO slow problem will follow.

    watch cat ./memory.failcnt 
    19369
    💡 interested can refer to:
    https://engineering.linkedin.com/blog/2016/08/don_t-let-linux-control-groups-uncontrolled
    https://www.kernel.org/doc/Documentation/cgroup-v1/memory.txt
    https://srvaroa.github.io/jvm/kubernetes/memory/docker/oomkiller/2019/05/29/k8s-and-java.html

    For my application, my suggestion is to relax the memory limit:

        resources:
          limits:
            memory: 4.5Gi
          requests:
            memory: 4.5Gi

    Outlook

    Incompletely speaking, from the perspective of service operators, service is based on these coefficients :

    • SLA of the container

      • The throughput of the target container

    If I take the above coefficient as the input a tool program, then the output of should be:

    • How many containers should be deployed
    • How should the resource configuration of each container be

      • CPU

        • Container CPU limit
        • Application thread pool limit
      • Memory

        • Container memory limit
        • Application thread pool d limit:

          • java: in/out of heap
    💡 There is an open source tool for reference:
    https://github.com/cloudfoundry/java-buildpack-memory-calculator

    Disclaimer

    Every coin has two sides, and application tuning is that every tuning method has its required environmental prerequisites, otherwise it will not be called tuning, and directly go to the default configuration Pull Request of the open source project. Masters often say, don’t just copy and adjust the parameters. You must consider your own actual situation, and then do a sufficient test before you can use it.

    Experience

    Beginning in 2016, major companies began to catch up with fashion and put appropriate applications into containers. And because many old projects and components were designed without considering running in a restricted container, to put it bluntly, they are not contaier aware. After a few years, the situation has improved, but there are still many pitfalls. As a qualified architect, in addition to PPT and the distance, we have to have a glass heart.

    The above is an analysis of the memory of a Java container. If you are interested in the CPU and thread parameters of the Java container, please move to: The historical pit of Java containerization (historical pit) limitation chapter 1615eb71cce351.

    Conclude this article with a comic:


    MarkZhu
    83 声望21 粉丝

    Blog: [链接]