头图

Background: This is reproduced in the environment of centos7.6, but the problem actually exists in many kernel versions. How to monitor and control some caches of Linux has always been a hot spot in the direction of cloud computing, but these hot spots It is a subdivision scenario and it is difficult to integrate into the main Linux baseline. With the gradual stability of ebpf, there may be new gains in programming and observing the general Linux kernel. This article will share how we troubleshoot and solve this problem.

1. Failure phenomenon

The oppo cloud kernel team found that the cpu consumption of the snmpd of the cluster is surging,
snmpd occupies one core for almost a long time, and perf finds hot spots as follows:

+   92.00%     3.96%  [kernel]    [k]    __d_lookup 
-   48.95%    48.95%  [kernel]    [k] _raw_spin_lock 
     20.95% 0x70692f74656e2f73                       
        __fopen_internal                              
        __GI___libc_open                              
        system_call                                   
        sys_open                                       
        do_sys_open                                    
        do_filp_open                                   
        path_openat                                    
        link_path_walk                                 
      + lookup_fast                                    
-   45.71%    44.58%  [kernel]    [k] proc_sys_compare 
   - 5.48% 0x70692f74656e2f73                          
        __fopen_internal                               
        __GI___libc_open                               
        system_call                                    
        sys_open                                       
        do_sys_open                                    
        do_filp_open                                   
        path_openat                                    
   + 1.13% proc_sys_compare                                                                                                                     

Almost all are consumed in the call of kernel mode __d_lookup, and then the consumption seen by strace is:

open("/proc/sys/net/ipv4/neigh/kube-ipvs0/retrans_time_ms", O_RDONLY) = 8 <0.000024>------v4的比较快
open("/proc/sys/net/ipv6/neigh/ens7f0_58/retrans_time_ms", O_RDONLY) = 8 <0.456366>-------v6很慢

Further manual operation, it is found that the path to ipv6 is very slow:

time cd /proc/sys/net

real 0m0.000s
user 0m0.000s
sys 0m0.000s

time cd /proc/sys/net/ipv6

real 0m2.454s
user 0m0.000s
sys 0m0.509s

time cd /proc/sys/net/ipv4

real 0m0.000s
user 0m0.000s
sys 0m0.000s
It can be seen that the time consumption to enter the ipv6 path is much greater than that of the ipv4 path.

2. Failure analysis

We need to see why the hot spot of perf shows that proc_sys_compare consumes more in __d_lookup, and what is its process
There is only one call path for proc_sys_compare, which is the d_compare callback. From the call chain:

__d_lookup--->if (parent->d_op->d_compare(parent, dentry, tlen, tname, name))
struct dentry *__d_lookup(const struct dentry *parent, const struct qstr *name)
{
.....
    hlist_bl_for_each_entry_rcu(dentry, node, b, d_hash) {

        if (dentry->d_name.hash != hash)
            continue;

        spin_lock(&dentry->d_lock);
        if (dentry->d_parent != parent)
            goto next;
        if (d_unhashed(dentry))
            goto next;

        /*
         * It is safe to compare names since d_move() cannot
         * change the qstr (protected by d_lock).
         */
        if (parent->d_flags & DCACHE_OP_COMPARE) {
            int tlen = dentry->d_name.len;
            const char *tname = dentry->d_name.name;
            if (parent->d_op->d_compare(parent, dentry, tlen, tname, name))
                goto next;//caq:返回1则是不相同
        } else {
            if (dentry->d_name.len != len)
                goto next;
            if (dentry_cmp(dentry, str, len))
                goto next;
        }
        ....
next:
        spin_unlock(&dentry->d_lock);//caq:再次进入链表循环
     }        

.....
}

The snmp process of the cluster is the same as the physical condition of the machine, so it is natural to doubt whether it is hlist_bl_for_each_entry_rcu
Too many cycles have caused parent->d_op->d_compare to constantly compare conflicting chains,
When entering ipv6, do you compare the number of times a lot, because you will definitely encounter more cache misses in the process of traversing the list, when traversing
Too many linked list elements may trigger this situation. The following needs to be verified:

static inline long hlist_count(const struct dentry *parent, const struct qstr *name)
{
  long count = 0;
  unsigned int hash = name->hash;
  struct hlist_bl_head *b = d_hash(parent, hash);
  struct hlist_bl_node *node;
  struct dentry *dentry;

  rcu_read_lock();
  hlist_bl_for_each_entry_rcu(dentry, node, b, d_hash) {
    count++;
  }
  rcu_read_unlock();
  if(count >COUNT_THRES)
  {
     printk("hlist_bl_head=%p,count=%ld,name=%s,hash=%u\n",b,count,name,name->hash);
  }
  return count;
}

The results of kprobe are as follows:

[20327461.948219] hlist_bl_head=ffffb0d7029ae3b0 count = 799259,name=ipv6/neigh/ens7f1_46/base_reachable_time_ms,hash=913731689
[20327462.190378] hlist_bl_head=ffffb0d7029ae3b0 count = 799259,name=ipv6/neigh/ens7f0_51/retrans_time_ms,hash=913731689
[20327462.432954] hlist_bl_head=ffffb0d7029ae3b0 count = 799259,name=ipv6/conf/ens7f0_51/forwarding,hash=913731689
[20327462.675609] hlist_bl_head=ffffb0d7029ae3b0 count = 799259,name=ipv6/neigh/ens7f0_51/base_reachable_time_ms,hash=913731689

Judging from the length of the conflict chain, it has indeed entered a relatively long conflict chain in the hash table of dcache. The number of dentry in this chain is 799259.
And they all point to the dentry of ipv6.
Students who understand the principle of dcache must know that the elements in the conflict chain must have the same hash value, and the hash value of dcache is the parent
The dentry plus the hash value forms the final hash value:

static inline struct hlist_bl_head *d_hash(const struct dentry *parent,
                    unsigned int hash)
{
    hash += (unsigned long) parent / L1_CACHE_BYTES;
    hash = hash + (hash >> D_HASHBITS);
    return dentry_hashtable + (hash & D_HASHMASK);
}
高版本的内核是:
static inline struct hlist_bl_head *d_hash(unsigned int hash)
{
    return dentry_hashtable + (hash >> d_hash_shift);
}

On the surface, the calculation of dentry->dname.hash value of the higher version of the kernel has changed, but it is actually
When hash is stored in dentry->d_name.hash, helper has been added, please refer to
The following patches:

commit 8387ff2577eb9ed245df9a39947f66976c6bcd02
Author: Linus Torvalds <torvalds@linux-foundation.org>
Date:   Fri Jun 10 07:51:30 2016 -0700

    vfs: make the string hashes salt the hash
    
    We always mixed in the parent pointer into the dentry name hash, but we
    did it late at lookup time.  It turns out that we can simplify that
    lookup-time action by salting the hash with the parent pointer early
    instead of late.

At this point in the problem analysis, there are two questions as follows:

  1. Although the conflict chain is long, it is possible that our dentry is in front of the conflict chain.
    Not necessarily so far away every time;
  2. Dentry under proc, logically, are common and fixed file names.
    Why is there such a long chain of conflicts?

To solve these two questions, it is necessary to further analyze the dentry in the conflict chain.
Based on the hash header printed by kprobe above, we can further analyze the dentry as follows:

crash> list dentry.d_hash -H 0xffff8a29269dc608 -s dentry.d_sb
ffff89edf533d080
  d_sb = 0xffff89db7fd3c800
ffff8a276fd1e3c0
  d_sb = 0xffff89db7fd3c800
ffff8a2925bdaa80
  d_sb = 0xffff89db7fd3c800
ffff89edf5382a80
  d_sb = 0xffff89db7fd3c800
.....

Since the linked list is very long, we printed the corresponding analysis to the file and found all dentry in this conflict chain
All belong to the same super_block, which is 0xffff89db7fd3c800,

crash> list super_block.s_list -H super_blocks -s super_block.s_id,s_nr_dentry_unused >/home/caq/super_block.txt

# grep ffff89db7fd3c800 super_block.txt  -A 2 
ffff89db7fd3c800
  s_id = "proc\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000"

0xffff89db7fd3c800 is the proc file system. Why does he create so many ipv6 dentry?
Continue to use the command to see the d_inode corresponding to dentry:

...
ffff89edf5375b00
  d_inode = 0xffff8a291f11cfb0
ffff89edf06cb740
  d_inode = 0xffff89edec668d10
ffff8a29218fa780
  d_inode = 0xffff89edf0f75240
ffff89edf0f955c0
  d_inode = 0xffff89edef9c7b40
ffff8a2769e70780
  d_inode = 0xffff8a291c1c9750
ffff8a2921969080
  d_inode = 0xffff89edf332e1a0
ffff89edf5324b40
  d_inode = 0xffff89edf2934800
...

We found that these d_name.names with the same name are all ipv6 dentry, and their inodes are different, indicating that these procs
There is no hard link in the file under, so this is normal.
We continue to analyze the formation of the ipv6 path.
The formation of the /proc/sys/net/ipv6 path is simply divided into the following steps:

start_kernel-->proc_root_init()//caq:注册proc fs
由于proc是linux系统默认挂载的,所以查找 kern_mount_data 函数
pid_ns_prepare_proc-->kern_mount_data(&proc_fs_type, ns);//caq:挂载proc fs
proc_sys_init-->proc_mkdir("sys", NULL);//caq:proc目录下创建sys目录
net_sysctl_init-->register_sysctl("net", empty);//caq:在/proc/sys下创建net
对于init_net:
ipv6_sysctl_register-->register_net_sysctl(&init_net, "net/ipv6", ipv6_rotable);
对于其他net_namespace,一般是系统调用触发创建
ipv6_sysctl_net_init-->register_net_sysctl(net, "net/ipv6", ipv6_table);//创建ipv6

With these foundations, next, we will focus on the last one, the ipv6 creation process.
ipv6_sysctl_net_init function
ipv6_sysctl_register-->register_pernet_subsys(&ipv6_sysctl_net_ops)-->
register_pernet_operations-->__register_pernet_operations-->
ops_init-->ipv6_sysctl_net_init
Common call stacks are as follows:

 :Fri Mar  5 11:18:24 2021,runc:[1:CHILD],tid=125338.path=net/ipv6
 0xffffffffb9ac66f0 : __register_sysctl_table+0x0/0x620 [kernel]
 0xffffffffb9f4f7d2 : register_net_sysctl+0x12/0x20 [kernel]
 0xffffffffb9f324c3 : ipv6_sysctl_net_init+0xc3/0x150 [kernel]
 0xffffffffb9e2fe14 : ops_init+0x44/0x150 [kernel]
 0xffffffffb9e2ffc3 : setup_net+0xa3/0x160 [kernel]
 0xffffffffb9e30765 : copy_net_ns+0xb5/0x180 [kernel]
 0xffffffffb98c8089 : create_new_namespaces+0xf9/0x180 [kernel]
 0xffffffffb98c82ca : unshare_nsproxy_namespaces+0x5a/0xc0 [kernel]
 0xffffffffb9897d83 : sys_unshare+0x173/0x2e0 [kernel]
 0xffffffffb9f76ddb : system_call_fastpath+0x22/0x27 [kernel]

In dcache, the dentry in each net_namespace under our /proc/sys/ are all hashed together,
How to ensure a net_namespace
What about the dentry isolation inside? Let's look at the corresponding __register_sysctl_table function:

struct ctl_table_header *register_net_sysctl(struct net *net,
    const char *path, struct ctl_table *table)
{
    return __register_sysctl_table(&net->sysctls, path, table);
}

struct ctl_table_header *__register_sysctl_table(
    struct ctl_table_set *set,
    const char *path, struct ctl_table *table)
{
    .....
    for (entry = table; entry->procname; entry++)
        nr_entries++;//caq:先计算该table下有多少个项

    header = kzalloc(sizeof(struct ctl_table_header) +
             sizeof(struct ctl_node)*nr_entries, GFP_KERNEL);
....
    node = (struct ctl_node *)(header + 1);
    init_header(header, root, set, node, table);
....
    /* Find the directory for the ctl_table */
    for (name = path; name; name = nextname) {
....//caq:遍历查找到对应的路径
    }

    spin_lock(&sysctl_lock);
    if (insert_header(dir, header))//caq:插入到管理结构中去
        goto fail_put_dir_locked;
....
}

The specific code is not expanded, and the dentry under each sys is distinguished by ctl_table_set whether it is visible
Then when searching, the comparison is as follows:

static int proc_sys_compare(const struct dentry *parent, const struct dentry *dentry,
        unsigned int len, const char *str, const struct qstr *name)
{
....
    return !head || !sysctl_is_seen(head);
}

static int sysctl_is_seen(struct ctl_table_header *p)
{
    struct ctl_table_set *set = p->set;//获取对应的set
    int res;
    spin_lock(&sysctl_lock);
    if (p->unregistering)
        res = 0;
    else if (!set->is_seen)
        res = 1;
    else
        res = set->is_seen(set);
    spin_unlock(&sysctl_lock);
    return res;
}

//不是同一个 ctl_table_set 则不可见
static int is_seen(struct ctl_table_set *set)
{
    return &current->nsproxy->net_ns->sysctls == set;
}

As can be seen from the above code, the process currently being searched, if it belongs to the set of net_ns
If it is inconsistent with the set attribution in dentry, it will return to failure, and snmpd attribution
Set is actually the sysctls of init_net, and after looking at the vast majority of dentry in the conflict chain
None of the sysctls belong to init_net, so the previous ones have failed.

So, why is the dentry of /proc/sys/net belonging to init_net at the end of the conflict chain?
That is caused by the following code:

static inline void hlist_bl_add_head_rcu(struct hlist_bl_node *n,
                    struct hlist_bl_head *h)
{
    struct hlist_bl_node *first;

    /* don't need hlist_bl_first_rcu because we're under lock */
    first = hlist_bl_first(h);

    n->next = first;//caq:每次后面添加的时候,是加在链表头
    if (first)
        first->pprev = &n->next;
    n->pprev = &h->first;

    /* need _rcu because we can have concurrent lock free readers */
    hlist_bl_set_first_rcu(h, n);
}

We already know the reason why snmp needs to traverse the conflicting list to a later position. Next, we need to get
Understand why there are so many dentry. According to the management, we found that if docker keeps
Create a pause container and destroy it, and the dentry of ipv6 under these nets will accumulate,
The cumulative reason is that dentry will not be automatically destroyed without triggering memory tension.
If you can cache, you can cache, and the other is that we don't limit the length of the conflict chain.

Then the question comes again, why doesn't the dentry of ipv4 accumulate? Since the parent of ipv6 and ipv4
It's all the same, so how many child dentry does this parent have?

然后看 hash表里面的dentry,d_parent很多都指向 0xffff8a0a7739fd40 这个dentry。
crash> dentry.d_subdirs 0xffff8a0a7739fd40 ----查看这个父dentry有多少child
  d_subdirs = {
    next = 0xffff8a07a3c6f710, 
    prev = 0xffff8a0a7739fe90
  }
crash> list 0xffff8a07a3c6f710 |wc -l
1598540----------居然有159万个child

There are 1.59 million subdirectories, 799,259 of the previous conflict chain is removed, and there are almost 790,000. Since entering the ipv4 path is fast,
Explain that in the net directory, there should be other dentry with many sub-dentries, is it a common problem?

Then check other machines in the cluster, and find the type phenomenon, the intercepted print is as follows:

 count=158505,d_name=net,d_len=3,name=ipv6/conf/all/disable_ipv6,hash=913731689,len=4
hlist_bl_head=ffffbd9d5a7a6cc0,count=158507
 count=158507,d_name=net,d_len=3,name=core/somaxconn,hash=1701998435,len=4
hlist_bl_head=ffffbd9d429a7498,count=158506

It can be seen that ffffbd9d429a7498 has a collision chain of almost the same length as ffffbd9d5a7a6cc0.
First analyze the ipv6 chain. The analysis of the core chain is actually the same. The data analysis of picking the conflict chain is as follows:

crash> dentry.d_parent,d_name.name,d_lockref.count,d_inode,d_subdirs ffff9b867904f500
  d_parent = 0xffff9b9377368240
  d_name.name = 0xffff9b867904f538 "ipv6"-----这个是一个ipv6的dentry
  d_lockref.count = 1
  d_inode = 0xffff9bba4a5e14c0
  d_subdirs = {
    next = 0xffff9b867904f950, 
    prev = 0xffff9b867904f950
  }

d_child偏移0x90,则0xffff9b867904f950减去0x90为 0xffff9b867904f8c0
crash> dentry 0xffff9b867904f8c0
struct dentry {
......
  d_parent = 0xffff9b867904f500, 
  d_name = {
    {
      {
        hash = 1718513507, 
        len = 4
      }, 
      hash_len = 18898382691
    }, 
    name = 0xffff9b867904f8f8 "conf"------名称为conf
  }, 
  d_inode = 0xffff9bba4a5e61a0, 
  d_iname = "conf\000bles_names\000\060\000.2\000\000pvs.(*Han", 
  d_lockref = {
......
        count = 1----------------引用计数为1,说明还有人引用
......
  }, 
 ......
  d_subdirs = {
    next = 0xffff9b867904fb90, 
    prev = 0xffff9b867904fb90
  }, 
......
}
既然引用计数为1,则继续往下挖:
crash> dentry.d_parent,d_lockref.count,d_name.name,d_subdirs 0xffff9b867904fb00
  d_parent = 0xffff9b867904f8c0
  d_lockref.count = 1
  d_name.name = 0xffff9b867904fb38 "all"
  d_subdirs = {
    next = 0xffff9b867904ef90, 
    prev = 0xffff9b867904ef90
  }
  再往下:
crash> dentry.d_parent,d_lockref.count,d_name.name,d_subdirs,d_flags,d_inode -x 0xffff9b867904ef00
  d_parent = 0xffff9b867904fb00
  d_lockref.count = 0x0-----------------------------挖到引用计数为0为止
  d_name.name = 0xffff9b867904ef38 "disable_ipv6"
  d_subdirs = {
    next = 0xffff9b867904efa0, --------为空
    prev = 0xffff9b867904efa0
  }
  d_flags = 0x40800ce-------------下面重点分析这个
  d_inode = 0xffff9bba4a5e4fb0

As you can see, the dentry path of ipv6 is ipv6/conf/all/disable_ipv6, which is the same as that seen by probe.
For d_flags, the analysis is as follows:

#define DCACHE_FILE_TYPE        0x04000000 /* Other file type */

#define DCACHE_LRU_LIST     0x80000--------这个表示在lru上面

#define DCACHE_REFERENCED   0x0040  /* Recently used, don't discard. */
#define DCACHE_RCUACCESS    0x0080  /* Entry has ever been RCU-visible */

#define DCACHE_OP_COMPARE   0x0002
#define DCACHE_OP_REVALIDATE    0x0004
#define DCACHE_OP_DELETE    0x0008

We see that the reference count of disable_ipv6 is 0, but it has the DCACHE_LRU_LIST flag,
According to the following function:

static void dentry_lru_add(struct dentry *dentry)
{
    if (unlikely(!(dentry->d_flags & DCACHE_LRU_LIST))) {
        spin_lock(&dcache_lru_lock);
        dentry->d_flags |= DCACHE_LRU_LIST;//有这个标志说明在lru上
        list_add(&dentry->d_lru, &dentry->d_sb->s_dentry_lru);
        dentry->d_sb->s_nr_dentry_unused++;//caq:放在s_dentry_lru是空闲的
        dentry_stat.nr_unused++;
        spin_unlock(&dcache_lru_lock);
    }
}

So far, it shows that it can be released, because it is an online business, we dare not use it
echo 2 >/proc/sys/vm/drop_caches
Then write a module to release, the main code of the module is as follows, refer to shrink_slab:

  spin_lock(orig_sb_lock);
        list_for_each_entry(sb, orig_super_blocks, s_list) {
                if (memcmp(&(sb->s_id[0]),"proc",strlen("proc"))||\
                   memcmp(sb->s_type->name,"proc",strlen("proc"))||\
                    hlist_unhashed(&sb->s_instances)||\
                    (sb->s_nr_dentry_unused < NR_DENTRY_UNUSED_LEN) )
                        continue;
                sb->s_count++;
                spin_unlock(orig_sb_lock);
                printk("find proc sb=%p\n",sb);
                shrinker = &sb->s_shrink;
                
               count = shrinker_one(shrinker,&shrink,1000,1000);
               printk("shrinker_one count =%lu,sb=%p\n",count,sb);
               spin_lock(orig_sb_lock);//caq:再次持锁
                if (sb_proc)
                        __put_super(sb_proc);
                sb_proc = sb;

         }
         if(sb_proc){
             __put_super(sb_proc);
             spin_unlock(orig_sb_lock);
         }
         else{
            spin_unlock(orig_sb_lock);
            printk("can't find the special sb\n");
         }

It was discovered that indeed both conflicting chains were released.
For example, before a certain node is released:

[3435957.357026] hlist_bl_head=ffffbd9d5a7a6cc0,count=34686
[3435957.357029] count=34686,d_name=net,d_len=3,name=core/somaxconn,hash=1701998435,len=4
[3435957.457039] IPVS: Creating netns size=2048 id=873057
[3435957.477742] hlist_bl_head=ffffbd9d429a7498,count=34686
[3435957.477745] count=34686,d_name=net,d_len=3,name=ipv6/conf/all/disable_ipv6,hash=913731689,len=4
[3435957.549173] hlist_bl_head=ffffbd9d5a7a6cc0,count=34687
[3435957.549176] count=34687,d_name=net,d_len=3,name=core/somaxconn,hash=1701998435,len=4
[3435957.667889] hlist_bl_head=ffffbd9d429a7498,count=34687
[3435957.667892] count=34687,d_name=net,d_len=3,name=ipv6/conf/all/disable_ipv6,hash=913731689,len=4
[3435958.720110] find proc sb=ffff9b647fdd4000-----------------------开始释放
[3435959.150764] shrinker_one count =259800,sb=ffff9b647fdd4000------释放结束

After separate release:

[3436042.407051] hlist_bl_head=ffffbd9d466aed58,count=101
[3436042.407055] count=101,d_name=net,d_len=3,name=core/somaxconn,hash=1701998435,len=4
[3436042.501220] IPVS: Creating netns size=2048 id=873159
[3436042.591180] hlist_bl_head=ffffbd9d466aed58,count=102
[3436042.591183] count=102,d_name=net,d_len=3,name=core/somaxconn,hash=1701998435,len=4
[3436042.685008] hlist_bl_head=ffffbd9d4e8af728,count=101
[3436042.685011] count=101,d_name=net,d_len=3,name=ipv6/conf/all/disable_ipv6,hash=913731689,len=4
[3436043.957221] IPVS: Creating netns size=2048 id=873160
[3436044.043860] hlist_bl_head=ffffbd9d466aed58,count=103
[3436044.043863] count=103,d_name=net,d_len=3,name=core/somaxconn,hash=1701998435,len=4
[3436044.137400] hlist_bl_head=ffffbd9d4e8af728,count=102
[3436044.137403] count=102,d_name=net,d_len=3,name=ipv6/conf/all/disable_ipv6,hash=913731689,len=4
[3436044.138384] IPVS: Creating netns size=2048 id=873161
[3436044.226954] hlist_bl_head=ffffbd9d466aed58,count=104
[3436044.226956] count=104,d_name=net,d_len=3,name=core/somaxconn,hash=1701998435,len=4
[3436044.321947] hlist_bl_head=ffffbd9d4e8af728,count=103

Two details can be seen above:

1. Before the release, the hlist is also growing, and after the release, the hlist is still growing.

2. After the release, the dentry of net has changed, so the location of hashlist has changed.

In summary, we are slow to traverse hotspots because snmpd is looking for init_net's ctl_table_set
It is inconsistent with the ctl_table_set belonging to other dentry in dcache, and the length of the linked list is
It is because when someone destroys net_namespace, they are still accessing ipv6/conf/all/disable_ipv6 and
Caused by core/somaxconn, these two dentry are placed in the s_dentry_lru of the super_block attributable to
on.
The last question, what call accessed these dentry? The trigger mechanism is as follows:

pid=16564,task=exe,par_pid=366883,task=dockerd,count=1958,d_name=net,d_len=3,name=ipv6/conf/all/disable_ipv6,hash=913731689,len=4,hlist_bl_head=ffffbd9d429a7498
hlist_bl_head=ffffbd9d5a7a6cc0,count=1960

pid=16635,task=runc:[2:INIT],par_pid=16587,task=runc,count=1960,d_name=net,d_len=3,name=core/somaxconn,hash=1701998435,len=4,hlist_bl_head=ffffbd9d5a7a6cc0
hlist_bl_head=ffffbd9d429a7498,count=1959

As you can see, it is actually dockerd and runc that triggered this problem. K8 calls docker to create pause containers continuously.
Cni’s network parameters are incorrectly filled, causing the created net_namespace to be quickly destroyed, although it is called during destruction.
unregister_net_sysctl_table, but at the same time runc and exe access
The two dentry under the net_namespace cause these two dentry to be cached in the super_block
s_dentry_lru on the linked list. And because the overall memory is relatively sufficient, it will continue to grow.
Note that the corresponding paths are: ipv6/conf/all/disable_ipv6 and core/somaxconn, because there is no dentry under the ipv4 path
It was visiting at that time, so ctl_table can be cleaned up at that time.
And the unlucky snmpd has to access the corresponding chain all the time,
The cpu is high. After using manual drop_caches, it will be restored immediately. Note that online machines cannot be used
drop_caches, this will cause sys to go up and affect some delay-sensitive businesses.

Three, failure to reproduce

1. When the memory is free, the memory recovery of slab is not triggered, and k8 calls docker to create a different net_namespace
The pause container, but because the cni parameter is incorrect, the net_namespace just created will be destroyed immediately, if you are in dmesg
I frequently see the following logs in:

IPVS: Creating netns size=2048 id=866615

It is necessary to pay attention to the caching of dentry.

Four, fault avoidance or resolution

The possible solutions are:

1. Through rcu, read each conflict chain of dentry_hashtable, if it is greater than a certain level, an alarm will be thrown.

2. Set the number of dentry cached through a proc parameter.

3. Globally, you can follow /proc/sys/fs/dentry-state

4. Partially, you can read s_nr_dentry_unused for super_block, if it exceeds a certain number, it will alarm.
The sample code can refer to the realization of shrink_slab function.

5. Pay attention to the difference with negative-dentry-limit.

6. There are many places where hash buckets are used in the kernel. How do we monitor the length of the hash bucket conflict chain? Make it into a module
Scan, or find a place to save the length of a linked list.

Five, the author's introduction

Anqing OPPO Senior Backend Engineer

Currently in oppo hybrid cloud, he is responsible for the virtualization of Linux kernels, containers, and virtual machines.

Get more exciting content: follow the [OPPO Internet Technology] public account


OPPO数智技术
612 声望952 粉丝