I get the time and the machine goes down

1. Background

Linux time management, including clocksource, clockevent, timer, tick, timekeeper and other concepts, these concepts organically constitute a complete time code system. Of course, there will be bugs in the code. This article starts with a bug to deepen the understanding of the theory in actual combat. Get time, but it crashed.

2. Malfunction

The OPPO cloud kernel team received a connectivity alarm and found that the machine was reset:

PID: 0      TASK: ffff8d2b3775b0c0  CPU: 1   COMMAND: "swapper/1"
 #0 [ffff8d597f6489f0] machine_kexec at ffffffffa5a63b34
 #1 [ffff8d597f648a50] __crash_kexec at ffffffffa5b1e242
 #2 [ffff8d597f648b20] panic at ffffffffa615d85b
 #3 [ffff8d597f648ba0] nmi_panic at ffffffffa5a9859f
 #4 [ffff8d597f648bb0] watchdog_overflow_callback at ffffffffa5b4a881
 #5 [ffff8d597f648bc8] __perf_event_overflow at ffffffffa5ba26b7
 #6 [ffff8d597f648c00] perf_event_overflow at ffffffffa5babd24
 #7 [ffff8d597f648c10] intel_pmu_handle_irq at ffffffffa5a0a850
 #8 [ffff8d597f648e38] perf_event_nmi_handler at ffffffffa616d031
 #9 [ffff8d597f648e58] nmi_handle at ffffffffa616e91c
#10 [ffff8d597f648eb0] do_nmi at ffffffffa616ebf8
#11 [ffff8d597f648ef0] end_repeat_nmi at ffffffffa616dd89
    [exception RIP: __getnstimeofday64+144]
    RIP: ffffffffa5b03940  RSP: ffff8d597f643c78  RFLAGS: 00000212
    RAX: 15b5c8320b8602cd  RBX: ffff8d597f643cb0  RCX: 000000005f89ee29
    RDX: 00000000ee4479fe  RSI: 0000012b5478f3b2  RDI: 0009709c7629b240
    RBP: ffff8d597f643c90   R8: 00000000007001de   R9: ffff8d596d5c0000
    R10: 000000000000007a  R11: 000000000000000e  R12: ffffffffa662ea80
    R13: 000000003ccbcfb6  R14: ffff8d895de08000  R15: 0000000000000081
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
--- <NMI exception stack> ---
#12 [ffff8d597f643c78] __getnstimeofday64 at ffffffffa5b03940
#13 [ffff8d597f643c98] getnstimeofday64 at ffffffffa5b0398e
#14 [ffff8d597f643ca8] ktime_get_real at ffffffffa5b03a45
#15 [ffff8d597f643cd0] netif_receive_skb_internal at ffffffffa603b936
#16 [ffff8d597f643d00] napi_gro_receive at ffffffffa603c588
#17 [ffff8d597f643d28] mlx5e_handle_rx_cqe_mpwrq at ffffffffc052ef1d [mlx5_core]
#18 [ffff8d597f643db8] mlx5e_poll_rx_cq at ffffffffc052f4b8 [mlx5_core]
#19 [ffff8d597f643e08] mlx5e_napi_poll at ffffffffc05304c6 [mlx5_core]
#20 [ffff8d597f643e78] net_rx_action at ffffffffa603bf1f
#21 [ffff8d597f643ef8] __do_softirq at ffffffffa5aa2155
#22 [ffff8d597f643f68] call_softirq at ffffffffa617a32c
#23 [ffff8d597f643f80] do_softirq at ffffffffa5a2e675

From the stack point of view, our process 0 was in the process of processing the soft interrupt packet, because it obtained a time, it caused a crash. A lot of hardlock analysis has been given before, it is nothing more than a long time to turn off the interrupt. For the specific place to turn off the interrupt, you can see the call_softirq function.

3. Failure analysis

1) Theoretical knowledge

In the process of processing the soft interrupt of the network packet, a time stamp is applied, that is to say, for the oppo cloud machine, the above call stack path is a hot and mature path. It is rare to have problems with the mature path, so it is necessary to share it. When timekeeping is initialized, it is difficult to choose the best clock source, because it is very likely that the best clock source has not yet been initialized. Therefore, the strategy is to use a clock source that must be ready when timekeeping is initialized, such as the clock source based on jiffies.

Generally speaking, the timekeeping module updates the time values of various system clocks when the tick arrives. The ktime_get call is likely to occur between two ticks. At this time, it is not enough to rely solely on the accuracy of the current system clock value. After all, That time value is updated per tick. Therefore, in order to obtain high precision, the ns value is obtained through timekeeping_get_ns. Timekeeping_get_ns is the protagonist of this article. This function obtains the nanosecond value of the real time clock at the current time, and this is the real time through the last tick. The time value of clock (xtime_nsec) is calculated by adding the delta time value between the current time and the last tick. After the system is running, real time clock + wall_to_monotonic is the uptime of the system, and real time clock + wall_to_monotonic + sleep time is the boot time of the system.

2) Actual combat analysis

According to the call stack, in a nutshell, __getnstimeofday64 has only one cycle, which is to read the sequence lock of timekeeper_seq. The code analysis is as follows:

int __getnstimeofday64(struct timespec64 *ts)
{
  struct timekeeper *tk = &timekeeper;
  unsigned long seq;
  s64 nsecs = 0;

  do {
    seq = read_seqcount_begin(&timekeeper_seq);

    ts->tv_sec = tk->xtime_sec;//caq:秒值赋值
    nsecs = timekeeping_get_ns(&tk->tkr_mono);

  } while (read_seqcount_retry(&timekeeper_seq, seq));

  ts->tv_nsec = 0;
  timespec64_add_ns(ts, nsecs);//caq:这里面还有个循环呢,

  /*
   * Do not bail out early, in case there were callers still using
   * the value, even in the face of the WARN_ON.
   */
  if (unlikely(timekeeping_suspended))
    return -EAGAIN;
  return 0;
}

But from the perspective of compilation:

0xffffffffa5b0393b <__getnstimeofday64+139>:    xor    %edx,%edx----清零 u32 ret = 0;
0xffffffffa5b0393d <__getnstimeofday64+141>:    nopl   (%rax)
0xffffffffa5b03940 <__getnstimeofday64+144>:    sub    $0x3b9aca00,%rax---------------------1s就是1000000000 ns,循坏在这,而栈中的rax为 15b5c8320b8602cd
0xffffffffa5b03946 <__getnstimeofday64+150>:    add    $0x1,%edx---------ret++;edx is the lower 32 bit of rdx,rdx为00000000ee4479fe,所以edx为 0xee4479fe,也就是3997465086
0xffffffffa5b03949 <__getnstimeofday64+153>:    cmp    $0x3b9ac9ff,%rax-------------------------------------剩余是否小于1ns
0xffffffffa5b0394f <__getnstimeofday64+159>:    ja     0xffffffffa5b03940 <__getnstimeofday64+144>
 /include/linux/time.h: 215---对应 timespec_add_ns
0xffffffffa5b03951 <__getnstimeofday64+161>:    add    %rcx,%rdx---delta算出的秒值+之前保存的秒值,就是最新的秒值
0xffffffffa5b03954 <__getnstimeofday64+164>:    mov    %rax,0x8(%rbx)----剩余的ns,赋值给a->tv_nsec = ns;
0xffffffffa5b03958 <__getnstimeofday64+168>:    mov    %rdx,(%rbx)---加完delta秒值的最新的秒值,赋值给a->tv_sec
0xffffffffa5b0395b <__getnstimeofday64+171>:    cmpl   $0x1,0xc55702(%rip)        # 0xffffffffa6759064----if(timekeeping_suspended)
 /kernel/time/timekeeping.c: 512
0xffffffffa5b03962 <__getnstimeofday64+178>:    pop    %rbx
0xffffffffa5b03963 <__getnstimeofday64+179>:    pop    %r12
0xffffffffa5b03965 <__getnstimeofday64+181>:    pop    %r13

Seen from the stack, we loop in __getnstimeofday64+144

0xffffffffa5b03940 <__getnstimeofday64+144>:    sub    $0x3b9aca00,%rax---------------------1s就是1000000000 ns,循坏在这,而栈中的rax为 15b5c8320b8602cd

It turns out that we are looping in the timespec64_add_ns function:

static __always_inline void timespec64_add_ns(struct timespec64 *a, u64 ns)
{
  a->tv_sec += __iter_div_u64_rem(a->tv_nsec + ns, NSEC_PER_SEC, &ns);
  a->tv_nsec = ns;
}
__iter_div_u64_rem展开如下:

static __always_inline u32
__iter_div_u64_rem(u64 dividend, u32 divisor, u64 *remainder)
{
  u32 ret = 0;

  while (dividend >= divisor) {//这个循环
    /* The following asm() prevents the compiler from
       optimising this loop into a modulo operation.  */
    asm("" : "+rm"(dividend));

    dividend -= divisor;
    ret++;
  }

  *remainder = dividend;

  return ret;
}

Our input divisor is NSEC_PER_SEC, which is 10 to the 9th power, and the hexadecimal is 0x3b9aca00. Since we are looping, our dividend is rax. Please note that the value is:

 RAX: 15b5c8320b8602cd

crash> p 0x15b5c8320b8602cd/0x3b9aca00
$7 = 1564376562

According to this calculation, to complete the calculation, you have to cycle 1564376562 so many times.
With such a large value, I really don't know how to cycle to the year of the monkey.
So how did this value come from? It turns out that this value is calculated by reading the cycle difference of closk_source twice before and after.

static inline s64 timekeeping_get_ns(struct tk_read_base *tkr)
{
  u64 delta;

  delta = timekeeping_get_delta(tkr);//caq:上次读取与本次读取之间的差值
  return timekeeping_delta_to_ns(tkr, delta);//caq:差值转换为ns
}

 delta的来源是:
 static inline u64 timekeeping_get_delta(struct tk_read_base *tkr)
{
  u64 cycle_now, delta;
  struct clocksource *clock;

  /* read clocksource: */
  clock = tkr->clock;
  cycle_now = tkr->clock->read(clock);//当前值是通过读取来的

  /* calculate the delta since the last update_wall_time */
  delta = clocksource_delta(cycle_now, clock->cycle_last, clock->mask);//计算差值

  return delta;
}

It turns out that the acquisition of delta is to read the cycle value of the current clocksource online, and then calculate the corresponding difference through clocksource_delta. According to the above code, first we have to know which current clocksource is:

crash> timekeeper
timekeeper = $1 = {
  tkr_mono = {------------------------------timekeeping_get_ns(&tk->tkr_mono)
    clock = 0xffffffffa662ea80, ------------这个就是 clocksource,这个值当前就是 clocksource_tsc
    cycle_last = 16728674596502256, 
    mult = 7340510, 
    shift = 24, 
    xtime_nsec = 2657092090049088, 这个值并不是ns,而是要 >>tkr->shift
    base = {
      tv64 = 2788453640047242
    }
  }, 
  tkr_raw = {
    clock = 0xffffffffa662ea80, 
    cycle_last = 16728674596502256, 
    mult = 8007931, 
    shift = 24, 
    xtime_nsec = 0, 
    base = {
      tv64 = 2788490058099290
    }
  }, 
  xtime_sec = 1602874921, ------------------当前的秒数

timekeeper selects the clocksource with the highest precision currently to work:

crash> dis -l 0xffffffffa662ea80
0xffffffffa662ea80 <clocksource_tsc>:   addb   $0xa5,-0x5d(%rcx)--------------就是 clocksource_tsc ,tsc就是一个clock_source


crash> clocksource_tsc
clocksource_tsc = $2 = {
  read = 0xffffffffa5a34180, -----------read_tsc
  cycle_last = 16728674596502256, ------上次更新墙上时间的时刻取的cycle值
  mask = 18446744073709551615, 
  mult = 8007931, 
  shift = 24, ----------------------注意位数
  max_idle_ns = 204347035648, 
  maxadj = 880872, 
  archdata = {
    vclock_mode = 1
  }, 
  name = 0xffffffffa647c1cd "tsc", ---名称
  list = {
    next = 0xffffffffa6633ff8, 
    prev = 0xffffffffa665c9b0
  }, 
  rating = 300, --------------优先级,
  enable = 0x0, 
  disable = 0x0, 
  flags = 35, ---没有CLOCK_SOURCE_UNSTABLE标志
  suspend = 0x0, 
  resume = 0x0, 
  owner = 0x0
}

The calculation and analysis of the difference are as follows:

static inline s64 timekeeping_delta_to_ns(struct tk_read_base *tkr,
            u64 delta)
{
  s64 nsec;//注意,这里是带符号数

  nsec = delta * tkr->mult + tkr->xtime_nsec;
  nsec >>= tkr->shift;//换算成ns

  /* If arch requires, add in get_arch_timeoffset() */
  return nsec + arch_gettimeoffset();
}

The return value of timekeeping_delta_to_ns is too large, there are two possibilities:
One is that delta is too large, delta * tkr->mult overflows the value of s64, this is considered a bug. Another possibility is that the delta value read directly before and after is too large, which involves update_wall_time not being called in time to read the cycle of the current clocksource.

4. Fault recurrence

This s64 overflow bug has been fixed in the community.

-static inline s64 timekeeping_delta_to_ns(struct tk_read_base *tkr,
+static inline u64 timekeeping_delta_to_ns(struct tk_read_base *tkr,
                                          cycle_t delta)
 {
-       s64 nsec;
+       u64 nsec;

And check Red Hat’s changelog, and follow the upstream fix, but I think the risk is still there, because sometimes the update_wall_time is not so timely, and even if it is changed from s64 to u64, it does not solve the overflow problem, because the timekeeping_delta_to_ns function is obvious It can be seen that the 64 bits of u64 are not all used in the cycle difference. I believe that someone in the community should explode this issue eventually.

5. Fault avoidance or resolution

The possible solutions are:
Increase the alarm, and intervene in time for the softlock, which may cause the update_wall_time not to be updated in time.

Author profile

Anqing Senior Backend Engineer

Currently, he is mainly responsible for the virtualization of the Linux kernel, containers, and virtual machines.

Get more exciting content, scan the code and follow the [OPPO Digital Intelligence Technology] public account


OPPO数智技术
OPPO前沿互联网技术及活动分享,公众号:OPPO_tech
604 声望
945 粉丝
0 条评论
推荐阅读
OPPO云数据库访问服务技术揭秘
MySQL是OPPO使用最广泛的关系数据库,不同编程语言的微服务都是通过MySQL官方的SDK直连真实的数据库实例。这种最传统的使用方式,会给业务开发和数据库运维带来一系列影响效率和稳定性的问题。

OPPO数智技术1阅读 1.1k

终于卷完了!Redis 打怪升级进阶成神之路(2023 最新版)!
是一种非关系型数据库服务,它能解决常规数据库的并发能力,比如传统的数据库的IO与性能的瓶颈,同样它是关系型数据库的一个补充,有着比较好的高效率与高性能。专注于key-value查询的redis、memcached、ttserver。

民工哥11阅读 989

封面图
算法可视化:一文弄懂 10 大排序算法
在本文中,我们将通过动图可视化加文字的形式,循序渐进全面介绍不同类型的算法及其用途(包括原理、优缺点及使用场景)并提供 Python 和 JavaScript 两种语言的示例代码。除此之外,每个算法都会附有一些技术说...

破晓L7阅读 945

封面图
硬卷完了!MongoDB 打怪升级进阶成神之路( 2023 最新版 )!
前面我们学习:MySQL 打怪升级进阶成神之路、Redis 打怪升级进阶成神之路,然后我们还在继续 NoSQL 的卷王之路。从第一篇文章开始,我们逐步详细介绍了 MogoDB 基础概念、安装和最基本的CURD操作、索引和聚合、工...

民工哥6阅读 553

封面图
「刷起来」Go必看的进阶面试题详解
逃逸分析是Go语言中的一项重要优化技术,可以帮助程序减少内存分配和垃圾回收的开销,从而提高程序的性能。下面是一道涉及逃逸分析的面试题及其详解。

王中阳Go4阅读 1.9k评论 1

封面图
架构设计-高性能篇
大家好,我是易安!今天我们谈一谈架构设计中的高性能架构涉及到的底层思想。本文分为缓存架构,单服务器高性能模型,集群下的高性能模型三个部分,内容很干,希望你仔细阅读。

架构狂人4阅读 795

Nginx常用配置及和基本功能讲解 | 京东云技术团队
作者:京东物流 殷世杰Nginx已经广泛应用于J-one和Jdos的环境部署上,本文对Nginx的常用的配置和基本功能进行讲解,适合Nginx入门学习。1 核心配置找到Nginx安装目录下的conf目录下nginx.conf文件,Nginx的基本功...

京东云开发者6阅读 606

封面图
604 声望
945 粉丝
宣传栏