Location and troubleshooting of memory leaks: Heap Profiling principle analysis

title: Location and troubleshooting of memory leaks: Heap Profiling principle analysis
author: ['Zhang Yexiang']
date: 2021-11-17
summary: This article will introduce some common implementation principles and usage methods of Heap Profiler to help readers understand the related implementations in TiKV more easily, or to better apply this kind of analysis methods to their own projects.

tags: ['tikv performance optimization']

After the system has been running for a long time, the available memory becomes less and less, and even some services fail, which is a typical memory leak problem. This type of problem is usually difficult to predict, and it is also difficult to locate through static code combing. Heap Profiling is to help us solve such problems.

As part of the distributed system, TiKV already has the capability of Heap Profiling. This article will introduce some common implementation principles and usage methods of Heap Profiler, to help readers understand the related implementations in TiKV more easily, or to better apply this kind of analysis methods to their own projects.

What is Heap Profiling

Runtime memory leaks are quite difficult to troubleshoot in many scenarios, because such problems are usually unpredictable and difficult to locate through static code combing.

Heap Profiling is to help us solve such problems.

Heap Profiling usually refers to the collection or sampling of the heap allocation of the application to report the memory usage of the program to us in order to analyze the cause of the memory occupation or locate the source of the memory leak.

How does Heap Profiling work?

As a comparison, let's briefly understand how CPU Profiling works.

When we are preparing for CPU Profiling, we usually need to select a time window . In this window, the CPU Profiler will register a hook for timing execution with the target program (there are multiple methods, such as the SIGPROF signal), in this hook Inside, we will get the stack trace of the business thread at the moment every time.

We control the execution frequency of the hook to a specific value, such as 100hz, so that a call stack sample of the business code is collected every 10ms. When the end of the time window, we collected all samples polymerized to give a final times each function was collected, as compared to the total number of samples of each function will get relative proportion .

With the help of this model, we can find the functions that account for a relatively high proportion, and then locate the CPU hot spots.

In terms of data structure, Heap Profiling is very similar to CPU Profiling, and both are models of stack trace + statistics. If you have used the pprof provided by Go, you will find that the display format of the two is almost the same:

Go CPU Profile

Go Heap Profile

Unlike CPU Profiling, Heap Profiling's data collection work is not simply carried out through timers, but needs to invade the memory allocation path so that the amount of memory allocation can be obtained. Therefore, the usual approach of Heap Profiler is to directly into the memory allocator , and get the current stack trace when the application allocates memory, and finally aggregate all the samples together, so that we can know each function of Directly or indirectly, the number of memory allocations is .

Heap Profile is consistent with that of CPU Proflie .

Next, we will introduce the use and implementation principles of a variety of Heap Profiler.

Note: GNU gprof and Valgrind do not match our purpose, so this article will not expand. Refer to gprof, Valgrind and gperftools-an evaluation of some tools for application level CPU profiling on Linux-Gernot.Klingler .

Heap Profiling in Go

Most readers should be more familiar with Go, so we use Go as a starting point and base for our research.

Note: If we talked about a concept in the previous section, we won't repeat it in the following section, even if they are not the same project. In addition, for completeness, each project is equipped with a usage section to explain its usage, students who are already familiar with this can skip it directly.

Usage

Go runtime has a built-in convenient profiler, heap is one of them. We can open a debug port as follows:

import _ "net/http/pprof"

go func() {
   log.Print(http.ListenAndServe("0.0.0.0:9999", nil))
}()

Then use the command line to get the current Heap Profiling snapshot during the running of the program:

$ go tool pprof http://127.0.0.1:9999/debug/pprof/heap

Or you can directly get a Heap Profiling snapshot at a specific location in the application code:

import "runtime/pprof"

pprof.WriteHeapProfile(writer)

Here we use a complete demo to show the usage of heap pprof:

package main

import (
 "log"
 "net/http"
 _ "net/http/pprof"
 "time"
)

func main() {
 go func() {
  log.Fatal(http.ListenAndServe(":9999", nil))
 }()

 var data [][]byte
 for {
  data = func1(data)
  time.Sleep(1 * time.Second)
 }
}

func func1(data [][]byte) [][]byte {
 data = func2(data)
 return append(data, make([]byte, 1024*1024)) // alloc 1mb
}

func func2(data [][]byte) [][]byte {
 return append(data, make([]byte, 1024*1024)) // alloc 1mb

The code continuously allocates memory in func1 and func2, and allocates a total of 2mb of heap memory per second.

After running the program for a period of time, execute the following command to get a profile snapshot and start a web service to browse:

$ go tool pprof -http=":9998" localhost:9999/debug/pprof/heap

Go Heap Graph

From the figure, we can intuitively see which functions have the largest memory allocation (the box is larger), and we can also intuitively see the function call relationships (through connections). For example, in the above figure, it is obvious that the allocation of func1 and func2 takes up the bulk, and func2 is called by func1.

Note that because Heap Profiling is also sampling (by default, 512k samples are allocated once), so the memory size shown here is smaller than the actual allocated memory size. CPU Profiling same as this value is only used to calculate relative proportion , memory allocation so as to locate hot spots.

Note: In fact, Go runtime has the logic to estimate the original size of the sampled results, but this conclusion is not necessarily accurate.

In addition, 48.88% of 90.24% in the box of func1 means Flat% of Cum%.

What are Flat% and Cum%? Let's change the browsing method first, drop down and click Top in the View column in the upper left corner:

Go Heap Top

Name column represents the corresponding function name
Flat column indicates how much memory is allocated by the function itself
Flat% column represents the proportion of Flat relative to the total allocation size
Cum column indicates the function, and all sub-functions total allocated memory
Cum% column represents the proportion of Cum relative to the total allocation size

The Sum% column represents the accumulation of Flat% from top to bottom (you can intuitively determine how much memory is allocated from which row to the top)
The above two methods can help us locate specific functions. Go provides more fine-grained line-of-code distribution source statistics. Drop down and click Source in the View column in the upper left corner:

Go Heap Source

In CPU Profiling, we often find the wide top of the flame graph to quickly and intuitively locate the hotspot function. Of course, due to the homogeneity of the data model, Heap Profiling data can also be displayed through the flame graph. Pull down and click Flame Graph in the View column in the upper left corner:

Go Heap Flamegraph

Through the above methods, we can easily see that the memory allocation is mostly in func1 and func2. However, in the real world, it is never so simple to let us locate the root of the problem. Since we get a snapshot of a certain moment, this is not enough for the memory leak problem. What we need is an incremental data to judge. Which memory is continuously growing. So you can get the Heap Profile again after a certain interval, and do a diff of the two results.

Implementation details

In this section, we focus on the implementation principle of Go Heap Profiling.

Recalling the section "How does Heap Profiling work", the usual approach of Heap Profiler is to directly integrate itself into the memory allocator, and get the current stack trace when the application allocates memory, which is exactly what Go does.

Go's memory allocation entry is the mallocgc() function in src/runtime/malloc.go, and a key piece of code is as follows:

func mallocgc(size uintptr, typ *_type, needzero bool) unsafe.Pointer {
 // ...
 if rate := MemProfileRate; rate > 0 {
  // Note cache c only valid while m acquired; see #47302
  if rate != 1 && size < c.nextSample {
   c.nextSample -= size
  } else {
   profilealloc(mp, x, size)
  }
 }
 // ...
}

func profilealloc(mp *m, x unsafe.Pointer, size uintptr) {
 c := getMCache()
 if c == nil {
  throw("profilealloc called without a P or outside bootstrapping")
 }
 c.nextSample = nextSample()
 mProf_Malloc(x, size)
}

This means that every time 512k of heap memory is allocated through mallocgc(), profilealloc() is called to record a stack trace.

Why do we need to define a sampling granularity? Isn't it more accurate to record the current stack trace every time mallocgc()?

It seems more attractive to get the memory allocation of all functions completely accurately, but the performance overhead brought by . malloc () library function as a user-mode application will be calls very frequently , optimize memory allocation performance is the responsibility of the allocator. If every malloc() call is accompanied by a stack traceback, the overhead is almost unacceptable, especially in scenarios where profiling is continued on the server side for a long time. Choosing "sampling" is not a better result, but just a compromise.

Of course, we can also modify the MemProfileRate variable by ourselves. Setting it to 1 will result in stack trace recording every time mallocgc(), and setting it to 0 will completely turn off Heap Profiling. Users can weigh performance and accuracy according to actual scenarios.

Note that when we MemProfileRate to a normal sample size, this value is not entirely accurate, but at every time as to MemProfileRate randomly chosen value in the exponential average .

// nextSample returns the next sampling point for heap profiling. The goal is
// to sample allocations on average every MemProfileRate bytes, but with a
// completely random distribution over the allocation timeline; this
// corresponds to a Poisson process with parameter MemProfileRate. In Poisson
// processes, the distance between two samples follows the exponential
// distribution (exp(MemProfileRate)), so the best return value is a random
// number taken from an exponential distribution whose mean is MemProfileRate.
func nextSample() uintptr

Because memory allocation is regular in many cases, if sampling is performed at a fixed granularity, the final result may have a large error, and it may happen that each sampling catches up with a specific type of memory allocation. This is why randomization is chosen here.

Not only Heap Profiling, sampling-based profilers will always have some errors (for example: SafePoint Bias ). When reviewing the sampling-based profiling results, you need to remind yourself not to ignore the possibility of errors.

The mProf_Malloc() function located in src/runtime/mprof.go is responsible for the specific sampling work:

// Called by malloc to record a profiled block.
func mProf_Malloc(p unsafe.Pointer, size uintptr) {
 var stk [maxStack]uintptr
 nstk := callers(4, stk[:])
 lock(&proflock)
 b := stkbucket(memProfile, size, stk[:nstk], true)
 c := mProf.cycle
 mp := b.mp()
 mpc := &mp.future[(c+2)%uint32(len(mp.future))]
 mpc.allocs++
 mpc.alloc_bytes += size
 unlock(&proflock)

 // Setprofilebucket locks a bunch of other mutexes, so we call it outside of proflock.
 // This reduces potential contention and chances of deadlocks.
 // Since the object must be alive during call to mProf_Malloc,
 // it's fine to do this non-atomically.
 systemstack(func() {
  setprofilebucket(p, b)
 })
}

func callers(skip int, pcbuf []uintptr) int {
 sp := getcallersp()
 pc := getcallerpc()
 gp := getg()
 var n int
 systemstack(func() {
  n = gentraceback(pc, sp, 0, gp, skip, &pcbuf[0], len(pcbuf), nil, nil, 0)
 })
 return n
}

By calling callers() and further gentraceback() to obtain the current call stack and save it in the stk array (ie the array of PC addresses), this technique is called call stack backtracking, and is applied in many scenarios (for example, when the program is panic) The stack is expanded).

Note: The term PC refers to Program Counter, which is RIP register when it is specific to x86-64 platform; FP refers to Frame Pointer, which is RBP register when it is specific to x86-64; SP refers to Stack Pointer, which is RSP register when it is specific to x86-64.

An original call stack traceback implementation is to ensure that the RBP register (on x86-64) must store the stack base address when a function call occurs in the Calling Convention, instead of being used as a general-purpose register. The instruction will first push the RIP (return address) to the stack. We only need to ensure that the first data pushed to the stack is the current RBP, then the stack base address of all functions will start with RBP, which is stringed into an address linked list. We only need to shift each RBP address down by one unit to get the RIP array.

Go FramePointer Backtrace (picture from go-profiler-notes )

Note: It is mentioned in the figure that all parameters of Go are passed through the stack. This conclusion is now outdated. Go supports register passing from version 1.17.

Since x86-64 classifies RBP as a general-purpose register, compilers such as GCC no longer use RBP to save the stack base address by default unless it is turned on with a specific option. However, the Go compiler retains this feature , so it is feasible to use RBP for stack traceback in Go.

But Go did not adopt this simple solution because it will cause some problems in some special scenarios. For example, if a function is dropped by inline, then the call stack obtained through RBP backtracking is missing. In addition, this solution needs to insert additional instructions between regular function calls, and needs to occupy an additional general-purpose register, and there is a certain performance overhead, even if we don't need stack traceback.

Each Go binary file contains a section called gopclntab, which is the abbreviation of Go Program Counter Line Table, which maintains the mapping of PC to SP and its return address. So that we do not need to rely on FP, you will be able to complete list of series PC directly through table lookup . At the same time, 1619cbbae33a77 in whether the PC and its function have been inline optimized, so we will not lose the inline function frame during the stack traceback. In addition, gopclntab also maintains a symbol table and saves the code information (function name, line number, etc.) corresponding to the PC, so we can finally see the human-readable panic result or profiling result instead of a large pile of address information.

gopclntab

Unlike Go-specific gopclntab, DWARF is a standardized debugging format. The Go compiler also adds DWARF (v4) information to its generated binary, so some non-Go ecological external tools can rely on it to debug Go programs . It is worth mentioning that the information contained in DWARF is a superset of gopclntab.

Back to Heap Profiling, when we get the PC array through the stack traceback technology (gentraceback() function in the previous code), we don’t need to worry about directly symbolizing it. The cost of symbolization is considerable, and we can First aggregate through the pointer address stack. The so-called aggregation is to accumulate the same samples in the hashmap. The same sample refers to the samples with the same contents in the two arrays.

Obtain the corresponding bucket with stk as the key through the stkbucket() function, and then accumulate the statistically relevant fields in it.

In addition, we noticed that memRecord has multiple sets of memRecordCycle statistics:

type memRecord struct {
 active memRecordCycle
 future [3]memRecordCycle
}

When accumulating, the mProf.cycle global variable is used as a subscript to access a specific group of memRecordCycle. mProf.cycle will be incremented after each round of GC, so that the distribution among the three rounds of GC is recorded. Only after one round of GC is over, the memory allocation and release between the previous round of GC and this round of GC will be incorporated into the final displayed statistics. This design is to avoid getting the Heap Profile before the GC is executed, and to show us a lot of useless temporary memory.

Moreover, we may also see unstable heap memory state at different times in a GC cycle.

Finally, setprofilebucket() is called to record the bucket on the mspan related to the assigned address, and mProf_Free() is called to record the corresponding release in the subsequent GC.

In this way, the bucket collection is always maintained in the Go runtime. When we need to perform Heap Profiling (for example, when calling pprof.WriteHeapProfile()), we will access this bucket collection and convert it to the format required for pprof output.

This is also a difference between Heap Profiling and CPU Profiling: CPU Profiling only has a certain sampling overhead for the application during the profiling time window, while Heap Profiling sampling occurs all the time. performs a profiling once and just dumps it so far The data snapshot is .

Next, we will enter the world of C/C++/Rust. Fortunately, since most of the implementation principles of Heap Profiler are similar, a lot of the knowledge mentioned in the previous section will correspond to the latter. The most typical, Go Heap Profiling is actually ported from Google tcmalloc, and they have similar implementations.

Heap Profiling with gperftools

gperftools (Google Performance Tools) is a toolkit, including Heap Profiler, Heap Checker, CPU Profiler and other tools. The reason why it is introduced immediately after Go is because it has a deep relationship with Go.

The Google tcmalloc ported by the Go runtime mentioned above has differentiated two community versions from the inside: one is tcmalloc , which is a pure malloc implementation without other additional functions; the other is gperftools , with Heap Profiling The ability of malloc to achieve, and other supporting tool sets.

Among them, pprof is also one of the most well-known tools. In the early days, pprof was a perl script, and later evolved into a powerful tool written in Go, pprof , and now it has been integrated into the Go backbone. The go tool pprof command we usually use is the pprof package that we use directly.

Note: The main author of gperftools is Sanjay Ghemawat, a great man who paired programming with Jeff Dean.

Usage

Google has been using Heap Profiler of gperftools to analyze the heap memory allocation of C++ programs. It can do:

Figuring out what is in the program heap at any given time
Locating memory leaks
Finding places that do a lot of allocation

As the ancestor of Go pprof, it seems to be the same as the Heap Profiling capability provided by Go.

Go hard-codes the acquisition code directly into the memory allocation function in the runtime. Similar to this, gperftools embeds the acquisition code in the malloc implementation of libtcmalloc it provides. The user needs to execute -ltcmalloc to link the library during the project compilation and linking phase to replace libc's default malloc implementation.

Of course, we can also rely on Linux's dynamic link mechanism to replace it at runtime:

$ env LD_PRELOAD="/usr/lib/libtcmalloc.so" <binary>

When libtcmalloc.so is specified with LD_PRELOAD, the malloc() that is linked by default in our program is overwritten, and the dynamic linker of Linux ensures that the version specified by LD_PRELOAD is executed first.

Before running the executable file linked to libtcmalloc, if we set the environment variable HEAPPROFILE to a file name, then when the program is executed, Heap Profile data will be written to the file.

By default, whenever our program allocates 1g of memory, or whenever the program's memory usage high-water mark increases by 100mb, a Heap Profile dump will be performed. These parameters can be modified through environment variables.

Use the pprof script that comes with gperftools to analyze the dump profile file, and the usage is basically the same as that of Go.

$ pprof --gv gfs_master /tmp/profile.0100.heap

gperftools gv

$ pprof --text gfs_master /tmp/profile.0100.heap
   255.6  24.7%  24.7%    255.6  24.7% GFS_MasterChunk::AddServer
   184.6  17.8%  42.5%    298.8  28.8% GFS_MasterChunkTable::Create
   176.2  17.0%  59.5%    729.9  70.5% GFS_MasterChunkTable::UpdateState
   169.8  16.4%  75.9%    169.8  16.4% PendingClone::PendingClone
    76.3   7.4%  83.3%     76.3   7.4% __default_alloc_template::_S_chunk_alloc
    49.5   4.8%  88.0%     49.5   4.8% hashtable::resize
   ...

Similarly, from left to right, they are Flat(mb), Flat%, Sum%, Cum(mb), Cum%, Name.

Implementation details

Similarly, tcmalloc adds some sampling logic to malloc() and operator new. When the sampling hook is triggered according to conditions, the following functions are executed:

// Record an allocation in the profile.
static void RecordAlloc(const void* ptr, size_t bytes, int skip_count) {
  // Take the stack trace outside the critical section.
void* stack[HeapProfileTable::kMaxStackDepth];
  int depth = HeapProfileTable::GetCallerStackTrace(skip_count + 1, stack);
  SpinLockHolder l(&heap_lock);
  if (is_on) {
    heap_profile->RecordAlloc(ptr, bytes, depth, stack);
    MaybeDumpProfileLocked();
  }
}

void HeapProfileTable::RecordAlloc(
    const void* ptr, size_t bytes, int stack_depth,
    const void* const call_stack[]) {
  Bucket* b = GetBucket(stack_depth, call_stack);
  b->allocs++;
  b->alloc_size += bytes;
  total_.allocs++;
  total_.alloc_size += bytes;

  AllocValue v;
  v.set_bucket(b);  // also did set_live(false); set_ignore(false)
  v.bytes = bytes;
  address_map_->Insert(ptr, v);
}

The execution process is as follows:

Call GetCallerStackTrace() to get the call stack.
Call GetBucket() with the call stack as the key of the hashmap to obtain the corresponding bucket.
Accumulate statistics in the Bucket.

Since there is no GC, the sampling process is much simpler than that of Go. From the point of view of variable naming, the profiling code in Go runtime is indeed transplanted from here.

The sampling rules of gperftools are described in detail in sampler.h. In general, they are also consistent with Go, namely: 512k average sample step.

In free() or operator delete, you also need to add some logic to record the memory release, which is also much simpler than Go with GC:

// Record a deallocation in the profile.
static void RecordFree(const void* ptr) {
  SpinLockHolder l(&heap_lock);
  if (is_on) {
    heap_profile->RecordFree(ptr);
    MaybeDumpProfileLocked();
  }
}

void HeapProfileTable::RecordFree(const void* ptr) {
  AllocValue v;
  if (address_map_->FindAndRemove(ptr, &v)) {
    Bucket* b = v.bucket();
    b->frees++;
    b->free_size += v.bytes;
    total_.frees++;
    total_.free_size += v.bytes;
  }
}

Find the corresponding bucket and add up the free related fields.

Modern C/C++/Rust programs usually rely on the libunwind library to obtain the call stack. The principle of stack traceback by libunwind is similar to that of Go. Frame Pointer traceback mode is not selected. It depends on a specific section in the program. Recorded unwind table. The difference is that Go relies on a specific section named gopclntab created in its own ecology, while C/C++/Rust programs rely on .debug_frame section or .eh_frame section.

Among them, .debug_frame is defined by the DWARF standard, and the Go compiler will also write this information, but it is not used by itself and is only reserved for third-party tools. GCC will only write debugging information to .debug_frame when the -g parameter is turned on.

The .eh_frame is more modern, defined Linux Standard Base The principle is to let the compiler insert some pseudo-instructions ( CFI Directives , Call Frame Information) in the corresponding position of the assembly code to assist the assembler in generating the .eh_frame section that eventually contains the unwind table.

Take the following code as an example:

// demo.c

int add(int a, int b) {
    return a + b;
}

We use cc -S demo.c to generate assembly code (gcc/clang can be used). Note that the -g parameter is not used here.

  .section __TEXT,__text,regular,pure_instructions
 .build_version macos, 11, 0 sdk_version 11, 3
 .globl _add                            ## -- Begin function add
 .p2align 4, 0x90
_add:                                   ## @add
 .cfi_startproc
## %bb.0:
 pushq %rbp
 .cfi_def_cfa_offset 16
 .cfi_offset %rbp, -16
 movq %rsp, %rbp
 .cfi_def_cfa_register %rbp
 movl %edi, -4(%rbp)
 movl %esi, -8(%rbp)
 movl -4(%rbp), %eax
 addl -8(%rbp), %eax
 popq %rbp
 retq
 .cfi_endproc
                                        ## -- End function
.subsections_via_symbols

From the generated assembly code, you can see many pseudo-instructions prefixed with .cfi_, which are CFI Directives.

Heap Profiling with jemalloc

Next, we pay attention to jemalloc. This is because TiKV uses jemalloc as the memory allocator by default. Whether Heap Profiling can be performed smoothly on jemalloc is a point worthy of our attention.

Usage

jemalloc comes with Heap Profiling capability, but it is not turned on by default. You need to specify the --enable-prof parameter when compiling.

./autogen.sh
./configure --prefix=/usr/local/jemalloc-5.1.0 --enable-prof
make
make install

Same as tcmalloc, we can choose to link jemalloc to the program via -ljemalloc, or overwrite libc's malloc() implementation with jemalloc via LD_PRELOAD.

We take the Rust program as an example to show how to perform Heap Profiling through jemalloc.

fn main() {
    let mut data = vec![];
    loop {
        func1(&mut data);
        std::thread::sleep(std::time::Duration::from_secs(1));
    }
}

fn func1(data: &mut Vec<Box<[u8; 1024*1024]>>) {
    data.push(Box::new([0u8; 1024*1024])); // alloc 1mb
    func2(data);
}

fn func2(data: &mut Vec<Box<[u8; 1024*1024]>>) {
    data.push(Box::new([0u8; 1024*1024])); // alloc 1mb
}

Similar to the demo provided in the Go section, we also allocate 2mb of heap memory per second in Rust, func1 and func2 each allocate 1mb, and func1 calls func2.

Use rustc directly to compile the file without any parameters, and then execute the following command to start the program:

$ export MALLOC_CONF="prof:true,lg_prof_interval:25"
$ export LD_PRELOAD=/usr/lib/libjemalloc.so
$ ./demo

MALLOC_CONF is used to specify the related parameters of jemalloc, where prof:true means to open the profiler, log_prof_interval:25 means to dump a profile file every time 2^25 bytes (32mb) of heap memory is allocated.

Note: For more MALLOC_CONF options, please refer to document .

After waiting for a period of time, you can see that some profile files are generated.

jemalloc provides a tool similar to tcmalloc's pprof, called jeprof. In fact, it is derived from the pprof perl script fork. We can use jeprof to review profile files.

$ jeprof ./demo jeprof.7262.0.i0.heap

You can also generate the same graph as Go/gperftools:

$ jeprof --gv ./demo jeprof.7262.0.i0.heap

jeprof svg

Implementation details

Similar to tcmalloc, jemalloc adds sampling logic to malloc():

JEMALLOC_ALWAYS_INLINE int
imalloc_body(static_opts_t *sopts, dynamic_opts_t *dopts, tsd_t *tsd) {
 // ...
 // If profiling is on, get our profiling context.
 if (config_prof && opt_prof) {
  bool prof_active = prof_active_get_unlocked();
  bool sample_event = te_prof_sample_event_lookahead(tsd, usize);
  prof_tctx_t *tctx = prof_alloc_prep(tsd, prof_active,
      sample_event);

  emap_alloc_ctx_t alloc_ctx;
  if (likely((uintptr_t)tctx == (uintptr_t)1U)) {
   alloc_ctx.slab = (usize <= SC_SMALL_MAXCLASS);
   allocation = imalloc_no_sample(
       sopts, dopts, tsd, usize, usize, ind);
  } else if ((uintptr_t)tctx > (uintptr_t)1U) {
   allocation = imalloc_sample(
       sopts, dopts, tsd, usize, ind);
   alloc_ctx.slab = false;
  } else {
   allocation = NULL;
  }

  if (unlikely(allocation == NULL)) {
   prof_alloc_rollback(tsd, tctx);
   goto label_oom;
  }
  prof_malloc(tsd, allocation, size, usize, &alloc_ctx, tctx);
 } else {
  assert(!opt_prof);
  allocation = imalloc_no_sample(sopts, dopts, tsd, size, usize,
      ind);
  if (unlikely(allocation == NULL)) {
   goto label_oom;
  }
 }
 // ...
}

Call prof_malloc_sample_object() in prof_malloc() to accumulate the corresponding call stack records in the hashmap:

void
prof_malloc_sample_object(tsd_t *tsd, const void *ptr, size_t size,
    size_t usize, prof_tctx_t *tctx) {
 // ...
 malloc_mutex_lock(tsd_tsdn(tsd), tctx->tdata->lock);
 size_t shifted_unbiased_cnt = prof_shifted_unbiased_cnt[szind];
 size_t unbiased_bytes = prof_unbiased_sz[szind];
 tctx->cnts.curobjs++;
 tctx->cnts.curobjs_shifted_unbiased += shifted_unbiased_cnt;
 tctx->cnts.curbytes += usize;
 tctx->cnts.curbytes_unbiased += unbiased_bytes;
 // ...
}

The logic injected by jemalloc in free() is also similar to tcmalloc, and jemalloc also relies on libunwind for stack backtracking, so I won’t go into details here.

Heap Profiling with bytehound

Bytehound is a Memory Profiler for Linux platform, written in Rust. The characteristic is that the front-end functions provided are relatively rich. Our focus is on how it is implemented and whether it can be used in TiKV, so we only briefly introduce the basic usage.

Usage

We can download bytehound's binary dynamic library on the Releases page, which is only supported by the Linux platform.

Then, like tcmalloc or jemalloc, mount its own implementation via LD_PRELOAD. Here we assume that we are running the same Rust program with memory leaks in the section Heap Profiling with jemalloc:

$ LD_PRELOAD=./libbytehound.so ./demo

Next, a memory-profiling_*.dat file will be generated in the working directory of the program, which is the product of Heap Profiling by bytehound. Note that, unlike other Heap Profiler, this file is continuously updated instead of generating a new file every specific time.

Next, execute the following command to open a web port for real-time analysis of the above files:

$ ./bytehound server memory-profiling_*.dat

Bytehound GUI

The most intuitive way is to click Flamegraph in the upper right corner to view the flame graph:

Bytehound Flamegraph

The memory hotspots of demo::func1 and demo::func2 can be easily seen from the figure.

Bytehound provides a wealth of GUI functions, which is one of its highlights. You can refer to the document explore on your own.

Implementation details

Bytehound also replaces the user's default malloc implementation, but bytehound itself does not implement a memory allocator, but is packaged based on jemalloc.

// 入口
#[cfg_attr(not(test), no_mangle)]
pub unsafe extern "C" fn malloc( size: size_t ) -> *mut c_void {
    allocate( size, AllocationKind::Malloc )
}

#[inline(always)]
unsafe fn allocate( requested_size: usize, kind: AllocationKind ) -> *mut c_void {
    // ...
    // 调用 jemalloc 进行内存分配
    let pointer = match kind {
        AllocationKind::Malloc => {
            if opt::get().zero_memory {
                calloc_real( effective_size as size_t, 1 )
            } else {
                malloc_real( effective_size as size_t )
            }
        },
        // ...
    };
    // ...
    // 栈回溯
    let backtrace = unwind::grab( &mut thread );
    // ...
    // 记录样本
    on_allocation( id, allocation, backtrace, thread );
    pointer
}

// xxx_real 链接到 jemalloc 实现
#[cfg(feature = "jemalloc")]
extern "C" {
    #[link_name = "_rjem_mp_malloc"]
    fn malloc_real( size: size_t ) -> *mut c_void;
    // ...
}

It seems that stack backtracking and recording are performed every time malloc, and there is no sampling logic. In the on_allocation hook, the allocation record is sent to the channel and processed asynchronously by the unified processor thread.

pub fn on_allocation(
    id: InternalAllocationId,
    allocation: InternalAllocation,
    backtrace: Backtrace,
    thread: StrongThreadHandle
) {
    // ...
    crate::event::send_event_throttled( move || {
        InternalEvent::Alloc {
            id,
            timestamp,
            allocation,
            backtrace,
        }
    });
}

#[inline(always)]
pub(crate) fn send_event_throttled< F: FnOnce() -> InternalEvent >( callback: F ) {
    EVENT_CHANNEL.chunked_send_with( 64, callback );
}

The implementation of EVENT_CHANNEL is simple Mutex<Vec<T>>:

pub struct Channel< T > {
    queue: Mutex< Vec< T > >,
    condvar: Condvar
}

Performance overhead

In this section, let's explore the performance overhead of each Heap Profiler mentioned above. The specific measurement methods vary from scenario to scenario.

All tests are run separately in the following physical machine environment:

Host	Intel NUC11PAHi7
CPU	Intel Core i7-1165G7 2.8GHz~4.7GHz 4 cores 8 threads
RAM	Kingston 64G DDR4 3200MHz
hard disk	Samsung 980PRO 1T SSD PCIe4.
operating system	Arch Linux Kernel-5.14.1

Go

In Go, our measurement method is to use TiDB + unistore to deploy a single node, adjust the runtime.MemProfileRate parameter and then use sysbench to measure.

Related software version and pressure test parameter data:

Go Version	1.17.1
TiDB Version	v5.3.0-alpha-1156-g7f36a07de
Commit Hash	7f36a07de9682b37d46240b16a2107f5c84941ba

| Sysbench

The resulting data:

MemProfileRate	in conclusion
0: Do not record	Transactions: 1505224 (2508.52 per sec.)<br/>Queries: 24083584 (40136.30 per sec.)<br/>Latency (AVG): 51.02<br/>Latency (P95): 73.13
512k: sampling record	Transactions: 1498855 (2497.89 per sec.)<br/>Queries: 23981680 (39966.27 per sec.)<br/>Latency (AVG): 51.24<br/>Latency (P95): 74.46
1: Full record	Transactions: 75178 (125.18 per sec.)<br/>Queries: 1202848 (2002.82 per sec.)<br/>Latency (AVG): 1022.04<br/>Latency (P95): 2405.65

Compared with "no recording", whether it is TPS/QPS or P95 delay line, 512k sampling recording is basically within 1% of . The performance overhead brought by "full recording" is in line with the "will be very high" expectation, but it is unexpectedly high: TPS/QPS is reduced by 20 times, and P95 latency is increased by 30 times by .

Since Heap Profiling is a general function, we cannot accurately give the general performance loss under all scenarios, and only the measurement conclusions under specific projects are valuable. TiDB is a relatively computationally intensive application, and the memory allocation frequency may not be as high as that of some memory-intensive applications. Therefore, this conclusion (and all subsequent conclusions) can only be used as a reference, and readers can measure the overhead in their own application scenarios.

tcmalloc/jemalloc

We measure tcmalloc/jemalloc based on TiKV. The method is to deploy a PD process and a TiKV process on the machine, and use go-ycsb for pressure measurement. The key parameters are as follows:

threadcount=200
recordcount=100000
operationcount=1000000
fieldcount=20

Use LD_PRELOAD to inject different malloc hooks before starting TiKV. Among them, tcmalloc uses the default configuration, which is 512k sampling similar to Go; jemalloc uses the default sampling strategy, and dumps a profile file every time 1G of heap memory is allocated.

Finally get the following data:

default	OPS: 119037.2 Avg(us): 4186 99th(us): 14000
tcmalloc	OPS: 113708.8 Avg(us): 4382 99th(us): 16000
jemalloc	OPS: 114639.9 Avg(us): 4346 99th(us): 15000

The performance of tcmalloc and jemalloc is almost the same. Compared with the default memory allocator, OPS has dropped by about 4%, and the P99 delay line has increased by about 10%.

We have learned that the implementation of tcmalloc is basically the same as that of Go heap pprof, but the data measured here is not consistent. It is speculated that the reason is that the memory allocation characteristics of TiKV and TiDB are different, which also confirms the previous article. : "We cannot accurately give the general performance loss under all scenarios, and only the measurement conclusions under specific projects are valuable."

bytehound

The reason why we did not put bytehound and tcmalloc/jemalloc together is that when using bytehound on TiKV, we will encounter deadlock during the startup phase.

Since we speculate that bytehound will be very high, and cannot be applied in the TiKV production environment theoretically, we only need to confirm this conclusion.

Note: The reason for the high performance overhead is speculated that the sampling logic is not found in the bytehound code. The data collected each time is sent to the background thread for processing through the channel, and the channel is simply encapsulated with Mutex + Vec.

We choose a simple mini-redis project to measure the performance overhead of bytehound. Since the goal is only to confirm whether it can meet the requirements of the TiKV production environment, not to accurately measure the data, we can simply count and compare its TPS. The specific driver code snippets are as follows:

var count int32

for n := 0; n < 128; n++ {
 go func() {
  for {
   key := uuid.New()
   err := client.Set(key, key, 0).Err()
   if err != nil {
    panic(err)
   }
   err = client.Get(key).Err()
   if err != nil {
    panic(err)
   }
   atomic.AddInt32(&count, 1)
  }
 }()
}

Enable 128 goroutine to perform read and write operations on the server. A read/write is considered a complete operation, in which only the number of times is counted, and indicators such as delay are not measured. The final use of the total number of times is divided by the execution time to get the difference before and after the bytehound is turned on. TPS, the data is as follows:

default	Count: 11784571 Time: 60s TPS: 196409
open bytehound	Count: 5660952 Time: 60s TPS: 94349

From the results, TPS lost more than 50% .

What can BPF bring

Although BPF performance overhead is very low, but based on BPF, to a large extent, only system-level indicators can be obtained. In general, Heap Profiling needs to perform statistics on the memory allocation link, but the memory allocation tends to be hierarchical.

For example, if we malloc a large amount of memory as a memory pool in our program in advance, and design the allocation algorithm by ourselves, then all the heap memory required by the business logic will be allocated from the memory pool by ourselves, then the existing The Heap Profiler is useless. Because it only tells you that you have applied for a large amount of memory during the startup phase, and the number of memory applications at other times is 0. In this scenario, we need to invade the memory allocation code designed by ourselves, and do what Heap Profiler should do at the entrance.

The problem of BPF is similar to this. We can hook a hook to brk/sbrk. When the user mode really needs to apply to the kernel for heap memory expansion, record the current stack trace. However, the memory allocator is a complicated black box, and the user stack that most frequently triggers brk/sbrk is not necessarily the user stack that causes memory leaks. This requires some experiments to verify. If the results really have a certain value, then it is okay to use BPF as a low-cost solution for long-term operation (additional consideration of BPF permissions is required).

As for uprobe, it is just a non-invasive code implantation. For Heap Profiling itself, the same logic must be followed in the allocator, which in turn brings the same overhead, and we are not sensitive to the intrusiveness of the code.

https://github.com/parca-dev/parca implements BPF-based Continuous Profiling, but the only module that really uses BPF is actually the CPU Profiler. A Python tool is provided in bcc-tools for CPU Profiling ( https://github.com/iovisor/bcc/blob/master/tools/profile.py ), the core principle is the same. For Heap Profiling, there is not much reference for the time being.

Location and troubleshooting of memory leaks: Heap Profiling principle analysis

tags: ['tikv performance optimization']

What is Heap Profiling

How does Heap Profiling work?

Heap Profiling in Go

Usage

Implementation details

Heap Profiling with gperftools

Usage

Implementation details

Heap Profiling with jemalloc

Usage

Implementation details

Heap Profiling with bytehound

Usage

Implementation details

Performance overhead

Go

tcmalloc/jemalloc

bytehound

What can BPF bring

PingCAP

引用和评论

4.98 亿月活背后的国产数据库：咪咕视讯携手 TiDB 攻克内容分发核心系统挑战

直击青藏高原数据匮乏难题！浙江大学团队提出GeoAI新模型，解释青藏高原地表热流分布

Mybatis-基础使用

Mybatis源码-加载映射文件与动态代理

Easysearch 证书：Windows 上创建自签名证书的 7 种方法

白鲸开源WhaleStudio荣获2024星空奖！

DeepMind与Google Research齐发力，多技术路线打造AI天气预报的「六边形战士」