I get the time and the machine goes down

1. Background

Linux time management, including clocksource, clockevent, timer, tick, timekeeper and other concepts, these concepts organically constitute a complete time code system. Of course, there will be bugs in the code. This article starts with a bug to deepen the understanding of the theory in actual combat. Get time, but it crashed.

2. Malfunction

The OPPO cloud kernel team received a connectivity alarm and found that the machine was reset:

PID: 0      TASK: ffff8d2b3775b0c0  CPU: 1   COMMAND: "swapper/1"
 #0 [ffff8d597f6489f0] machine_kexec at ffffffffa5a63b34
 #1 [ffff8d597f648a50] __crash_kexec at ffffffffa5b1e242
 #2 [ffff8d597f648b20] panic at ffffffffa615d85b
 #3 [ffff8d597f648ba0] nmi_panic at ffffffffa5a9859f
 #4 [ffff8d597f648bb0] watchdog_overflow_callback at ffffffffa5b4a881
 #5 [ffff8d597f648bc8] __perf_event_overflow at ffffffffa5ba26b7
 #6 [ffff8d597f648c00] perf_event_overflow at ffffffffa5babd24
 #7 [ffff8d597f648c10] intel_pmu_handle_irq at ffffffffa5a0a850
 #8 [ffff8d597f648e38] perf_event_nmi_handler at ffffffffa616d031
 #9 [ffff8d597f648e58] nmi_handle at ffffffffa616e91c
#10 [ffff8d597f648eb0] do_nmi at ffffffffa616ebf8
#11 [ffff8d597f648ef0] end_repeat_nmi at ffffffffa616dd89
    [exception RIP: __getnstimeofday64+144]
    RIP: ffffffffa5b03940  RSP: ffff8d597f643c78  RFLAGS: 00000212
    RAX: 15b5c8320b8602cd  RBX: ffff8d597f643cb0  RCX: 000000005f89ee29
    RDX: 00000000ee4479fe  RSI: 0000012b5478f3b2  RDI: 0009709c7629b240
    RBP: ffff8d597f643c90   R8: 00000000007001de   R9: ffff8d596d5c0000
    R10: 000000000000007a  R11: 000000000000000e  R12: ffffffffa662ea80
    R13: 000000003ccbcfb6  R14: ffff8d895de08000  R15: 0000000000000081
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
--- <NMI exception stack> ---
#12 [ffff8d597f643c78] __getnstimeofday64 at ffffffffa5b03940
#13 [ffff8d597f643c98] getnstimeofday64 at ffffffffa5b0398e
#14 [ffff8d597f643ca8] ktime_get_real at ffffffffa5b03a45
#15 [ffff8d597f643cd0] netif_receive_skb_internal at ffffffffa603b936
#16 [ffff8d597f643d00] napi_gro_receive at ffffffffa603c588
#17 [ffff8d597f643d28] mlx5e_handle_rx_cqe_mpwrq at ffffffffc052ef1d [mlx5_core]
#18 [ffff8d597f643db8] mlx5e_poll_rx_cq at ffffffffc052f4b8 [mlx5_core]
#19 [ffff8d597f643e08] mlx5e_napi_poll at ffffffffc05304c6 [mlx5_core]
#20 [ffff8d597f643e78] net_rx_action at ffffffffa603bf1f
#21 [ffff8d597f643ef8] __do_softirq at ffffffffa5aa2155
#22 [ffff8d597f643f68] call_softirq at ffffffffa617a32c
#23 [ffff8d597f643f80] do_softirq at ffffffffa5a2e675

From the stack point of view, our process 0 was in the process of processing the soft interrupt packet, because it obtained a time, it caused a crash. A lot of hardlock analysis has been given before, it is nothing more than a long time to turn off the interrupt. For the specific place to turn off the interrupt, you can see the call_softirq function.

3. Failure analysis

1) Theoretical knowledge

In the process of processing the soft interrupt of the network packet, a time stamp is applied, that is to say, for the oppo cloud machine, the above call stack path is a hot and mature path. It is rare to have problems with the mature path, so it is necessary to share it. When timekeeping is initialized, it is difficult to choose the best clock source, because it is very likely that the best clock source has not yet been initialized. Therefore, the strategy is to use a clock source that must be ready when timekeeping is initialized, such as the clock source based on jiffies.

Generally speaking, the timekeeping module updates the time values of various system clocks when the tick arrives. The ktime_get call is likely to occur between two ticks. At this time, it is not enough to rely solely on the accuracy of the current system clock value. After all, That time value is updated per tick. Therefore, in order to obtain high precision, the ns value is obtained through timekeeping_get_ns. Timekeeping_get_ns is the protagonist of this article. This function obtains the nanosecond value of the real time clock at the current time, and this is the real time through the last tick. The time value of clock (xtime_nsec) is calculated by adding the delta time value between the current time and the last tick. After the system is running, real time clock + wall_to_monotonic is the uptime of the system, and real time clock + wall_to_monotonic + sleep time is the boot time of the system.

2) Actual combat analysis

According to the call stack, in a nutshell, __getnstimeofday64 has only one cycle, which is to read the sequence lock of timekeeper_seq. The code analysis is as follows:

int __getnstimeofday64(struct timespec64 *ts)
{
  struct timekeeper *tk = &timekeeper;
  unsigned long seq;
  s64 nsecs = 0;

  do {
    seq = read_seqcount_begin(&timekeeper_seq);

    ts->tv_sec = tk->xtime_sec;//caq:秒值赋值
    nsecs = timekeeping_get_ns(&tk->tkr_mono);

  } while (read_seqcount_retry(&timekeeper_seq, seq));

  ts->tv_nsec = 0;
  timespec64_add_ns(ts, nsecs);//caq:这里面还有个循环呢，

  /*
   * Do not bail out early, in case there were callers still using
   * the value, even in the face of the WARN_ON.
   */
  if (unlikely(timekeeping_suspended))
    return -EAGAIN;
  return 0;
}

But from the perspective of compilation:

0xffffffffa5b0393b <__getnstimeofday64+139>:    xor    %edx,%edx----清零 u32 ret = 0;
0xffffffffa5b0393d <__getnstimeofday64+141>:    nopl   (%rax)
0xffffffffa5b03940 <__getnstimeofday64+144>:    sub    $0x3b9aca00,%rax---------------------1s就是1000000000 ns，循坏在这，而栈中的rax为 15b5c8320b8602cd
0xffffffffa5b03946 <__getnstimeofday64+150>:    add    $0x1,%edx---------ret++;edx is the lower 32 bit of rdx，rdx为00000000ee4479fe，所以edx为 0xee4479fe,也就是3997465086
0xffffffffa5b03949 <__getnstimeofday64+153>:    cmp    $0x3b9ac9ff,%rax-------------------------------------剩余是否小于1ns
0xffffffffa5b0394f <__getnstimeofday64+159>:    ja     0xffffffffa5b03940 <__getnstimeofday64+144>
 /include/linux/time.h: 215---对应 timespec_add_ns
0xffffffffa5b03951 <__getnstimeofday64+161>:    add    %rcx,%rdx---delta算出的秒值+之前保存的秒值，就是最新的秒值
0xffffffffa5b03954 <__getnstimeofday64+164>:    mov    %rax,0x8(%rbx)----剩余的ns，赋值给a->tv_nsec = ns;
0xffffffffa5b03958 <__getnstimeofday64+168>:    mov    %rdx,(%rbx)---加完delta秒值的最新的秒值，赋值给a->tv_sec
0xffffffffa5b0395b <__getnstimeofday64+171>:    cmpl   $0x1,0xc55702(%rip)        # 0xffffffffa6759064----if(timekeeping_suspended)
 /kernel/time/timekeeping.c: 512
0xffffffffa5b03962 <__getnstimeofday64+178>:    pop    %rbx
0xffffffffa5b03963 <__getnstimeofday64+179>:    pop    %r12
0xffffffffa5b03965 <__getnstimeofday64+181>:    pop    %r13

Seen from the stack, we loop in __getnstimeofday64+144

0xffffffffa5b03940 <__getnstimeofday64+144>:    sub    $0x3b9aca00,%rax---------------------1s就是1000000000 ns，循坏在这，而栈中的rax为 15b5c8320b8602cd

It turns out that we are looping in the timespec64_add_ns function:

static __always_inline void timespec64_add_ns(struct timespec64 *a, u64 ns)
{
  a->tv_sec += __iter_div_u64_rem(a->tv_nsec + ns, NSEC_PER_SEC, &ns);
  a->tv_nsec = ns;
}
__iter_div_u64_rem展开如下：

static __always_inline u32
__iter_div_u64_rem(u64 dividend, u32 divisor, u64 *remainder)
{
  u32 ret = 0;

  while (dividend >= divisor) {//这个循环
    /* The following asm() prevents the compiler from
       optimising this loop into a modulo operation.  */
    asm("" : "+rm"(dividend));

    dividend -= divisor;
    ret++;
  }

  *remainder = dividend;

  return ret;
}

Our input divisor is NSEC_PER_SEC, which is 10 to the 9th power, and the hexadecimal is 0x3b9aca00. Since we are looping, our dividend is rax. Please note that the value is:

 RAX: 15b5c8320b8602cd

crash> p 0x15b5c8320b8602cd/0x3b9aca00
$7 = 1564376562

According to this calculation, to complete the calculation, you have to cycle 1564376562 so many times.
With such a large value, I really don't know how to cycle to the year of the monkey.
So how did this value come from? It turns out that this value is calculated by reading the cycle difference of closk_source twice before and after.

static inline s64 timekeeping_get_ns(struct tk_read_base *tkr)
{
  u64 delta;

  delta = timekeeping_get_delta(tkr);//caq:上次读取与本次读取之间的差值
  return timekeeping_delta_to_ns(tkr, delta);//caq:差值转换为ns
}

 delta的来源是：
 static inline u64 timekeeping_get_delta(struct tk_read_base *tkr)
{
  u64 cycle_now, delta;
  struct clocksource *clock;

  /* read clocksource: */
  clock = tkr->clock;
  cycle_now = tkr->clock->read(clock);//当前值是通过读取来的

  /* calculate the delta since the last update_wall_time */
  delta = clocksource_delta(cycle_now, clock->cycle_last, clock->mask);//计算差值

  return delta;
}

It turns out that the acquisition of delta is to read the cycle value of the current clocksource online, and then calculate the corresponding difference through clocksource_delta. According to the above code, first we have to know which current clocksource is:

crash> timekeeper
timekeeper = $1 = {
  tkr_mono = {------------------------------timekeeping_get_ns(&tk->tkr_mono)
    clock = 0xffffffffa662ea80, ------------这个就是 clocksource，这个值当前就是 clocksource_tsc
    cycle_last = 16728674596502256, 
    mult = 7340510, 
    shift = 24, 
    xtime_nsec = 2657092090049088, 这个值并不是ns，而是要 >>tkr->shift
    base = {
      tv64 = 2788453640047242
    }
  }, 
  tkr_raw = {
    clock = 0xffffffffa662ea80, 
    cycle_last = 16728674596502256, 
    mult = 8007931, 
    shift = 24, 
    xtime_nsec = 0, 
    base = {
      tv64 = 2788490058099290
    }
  }, 
  xtime_sec = 1602874921, ------------------当前的秒数

timekeeper selects the clocksource with the highest precision currently to work:

crash> dis -l 0xffffffffa662ea80
0xffffffffa662ea80 <clocksource_tsc>:   addb   $0xa5,-0x5d(%rcx)--------------就是 clocksource_tsc ，tsc就是一个clock_source


crash> clocksource_tsc
clocksource_tsc = $2 = {
  read = 0xffffffffa5a34180, -----------read_tsc
  cycle_last = 16728674596502256, ------上次更新墙上时间的时刻取的cycle值
  mask = 18446744073709551615, 
  mult = 8007931, 
  shift = 24, ----------------------注意位数
  max_idle_ns = 204347035648, 
  maxadj = 880872, 
  archdata = {
    vclock_mode = 1
  }, 
  name = 0xffffffffa647c1cd "tsc", ---名称
  list = {
    next = 0xffffffffa6633ff8, 
    prev = 0xffffffffa665c9b0
  }, 
  rating = 300, --------------优先级，
  enable = 0x0, 
  disable = 0x0, 
  flags = 35, ---没有CLOCK_SOURCE_UNSTABLE标志
  suspend = 0x0, 
  resume = 0x0, 
  owner = 0x0
}

The calculation and analysis of the difference are as follows:

static inline s64 timekeeping_delta_to_ns(struct tk_read_base *tkr,
            u64 delta)
{
  s64 nsec;//注意，这里是带符号数

  nsec = delta * tkr->mult + tkr->xtime_nsec;
  nsec >>= tkr->shift;//换算成ns

  /* If arch requires, add in get_arch_timeoffset() */
  return nsec + arch_gettimeoffset();
}

The return value of timekeeping_delta_to_ns is too large, there are two possibilities:
One is that delta is too large, delta * tkr->mult overflows the value of s64, this is considered a bug. Another possibility is that the delta value read directly before and after is too large, which involves update_wall_time not being called in time to read the cycle of the current clocksource.

4. Fault recurrence

This s64 overflow bug has been fixed in the community.

-static inline s64 timekeeping_delta_to_ns(struct tk_read_base *tkr,
+static inline u64 timekeeping_delta_to_ns(struct tk_read_base *tkr,
                                          cycle_t delta)
 {
-       s64 nsec;
+       u64 nsec;

And check Red Hat’s changelog, and follow the upstream fix, but I think the risk is still there, because sometimes the update_wall_time is not so timely, and even if it is changed from s64 to u64, it does not solve the overflow problem, because the timekeeping_delta_to_ns function is obvious It can be seen that the 64 bits of u64 are not all used in the cycle difference. I believe that someone in the community should explode this issue eventually.

5. Fault avoidance or resolution

The possible solutions are:
Increase the alarm, and intervene in time for the softlock, which may cause the update_wall_time not to be updated in time.

Author profile

Anqing Senior Backend Engineer

Currently, he is mainly responsible for the virtualization of the Linux kernel, containers, and virtual machines.

Get more exciting content, scan the code and follow the [OPPO Digital Intelligence Technology] public account

I get the time and the machine goes down

1. Background

2. Malfunction

3. Failure analysis

1) Theoretical knowledge

2) Actual combat analysis

4. Fault recurrence

5. Fault avoidance or resolution

OPPO数智技术

引用和评论

OPPO云数据库访问服务技术揭秘

OpenInfra 基金会董事会宣布加入 Linux 基金会意向，增强开源全球影响力

Redis 持久化原理分析和使用建议

MyBatis-Plus结合Spring Boot实现数据权限

70k star，取代Postman！这款轻量级API工具，太香了！

Linux 磁盘挂载教程

大模型时代，后端程序员如何避免被AI卷死？