Server memory failure prediction can actually do this!

Author: vivo Internet Server Team - Hao Chan

With the rapid development of Internet services, the availability of infrastructure has attracted more and more attention from the industry. Memory failures have a high failure rate, high frequency, and great impact, all of which are unacceptable for upper-layer services.

This paper mainly introduces the application of EDAC (Error Detection And Correction) framework in memory prediction. First, the background of EDAC application is introduced, then the principle of EDAC is introduced, and then the application of EDAC on vivo server is introduced in detail through the EDAC installation-configuration-testing process, and finally the summary of the memory prediction using EDAC and the server are proposed. The prospect of RAS (Reliability, Availability and Serviceability) application to reduce the impact of hardware failure on the system.

1. Background introduction

With the rapid development of Internet services, the availability of infrastructure has attracted more and more attention from the industry. However, hardware failure has always been a common phenomenon, and the losses caused by hardware failure are often huge. Among server components, memory failure is the second most common type of hardware failure after hard disk failure. In addition, the number of server memory is large, and the number of vivo memory reaches 40w+. The most serious consequence of memory failure is that it will directly lead to system crash and server downtime, which are unacceptable for upper-level business.

Memory failure can be divided into UCE (Uncorrectable Error) and CE (Correctable Error). When the hardware detects an error, it is reported to the CPU in two ways. One of them is an outage, which in the case of a UCE, or uncorrectable error, can cause an immediate server downtime. If it is CE, the error can be corrected, and the hardware will use some resources to repair the error. When the memory CE accumulates too much and cannot repair itself, UCE will be generated, causing the system to crash and restart. Therefore, we need to find too many CE memory modules as soon as possible and replace them in time to avoid major losses.

In the past, most of the memory faults were found and located through the combination of the MCE (Machine Check Exception) log and the SEL (System Error Log) log recorded by the BMC. The biggest problem of these is that the memory problem cannot be found in advance, and the server is often restarted after a shutdown. Passively discovered later. In addition, there are the following problems:

It is difficult to directly locate the faulty memory slot in MCE logs.
There is no intuitive CE/UCE error count.
The memory health status cannot be judged based on the number of CE/UCE on the memory module.

In response to the above problems, we need to find other solutions. At this time, EDAC appeared in our field of vision. It can perfectly solve all the problems mentioned above, and can realize the active detection of memory CE faults and find memory problems in advance.

This article will mainly introduce the principle of EDAC and how to realize the failure prediction through it.

2. Introduction to the principle of EDAC

EDAC (Error Detection And Correction) is a framework for error detection and correction of the Linux system. Its purpose is to detect and report hardware errors when errors occur during the operation of the Linux system. EDAC consists of a core (edac\_core.ko) and multiple memory controller driver modules. Its subsystems include edac\_mc, edac_device, and PCI bus scanning, which are responsible for collecting memory controllers, other controllers (such as L3 Cache controller) and errors reported by PCI devices.

This is mainly about how the EDAC subsystem edac\_mc collects the errors of the memory controller. Memory CE and UCE are the main error types obtained by the edac\_mc class, which mainly involve the following functions:

[edac\_mc\_alloc()] : Use the structure mem\_ctl\_info to describe the memory controller. Only the core of EDAC can access it. Use the function edac\_mc\_alloc() to allocate the contents of the filled structure .
[edac\_device\_handle_ce()] : Mark CE errors.
[edac\_device\_handle_ue()] : Flag UCE error.
[edac\_mc\_handle_error()] : Report memory events to user space. Its parameters include the hierarchy of the fault point, the fault type, and the accumulated related UCE/CE error count statistics.
[edac\_raw\_mc\_handle\_error()] : Report memory events to user space, but do nothing to find its location, only when hardware errors come from BIOS, will be edac\_mc\_handle_error() call directly.

So how does EDAC control and report equipment failures? How does it locate and record the fault to the corresponding memory stick?

Linux uses the sysfs file system to display the hierarchical relationship of kernel devices, and EDAC uses it to control and report device failures. EDAC locates the fault to the corresponding memory module through the abstracted memory controller model, which is mainly related to the arrangement structure of the memory in the system. Each MC (memory controller) device corresponding to the CPU controls a group of DIMM memory modules. These modules are arranged in the form of chip select row (Chip-Select Row, csrowX) and channel (Channel, chX). There are multiple csrows and multiple channels.

Related files can be viewed through the following paths:

 # ls /sys/devices/system/edac/mc/mc0/csrow0/
ce_count  ch0_ce_count  ch0_dimm_label  ch1_ce_count  ch1_dimm_label  dev_type  edac_mode  mem_type  power  size_mb  subsystem  ue_count  uevent

The purpose of some files is shown in the following table:

If the EDAC finds that the hardware device controller reports a UE event, and the controller requires the UE to stop immediately, it will restart the system. After the controller detects CE events, it can be regarded as a prediction of future UCE events. We can reduce the possibility of UE events and system downtime through some shielding methods or replacing memory modules.

3. Application of EDAC

The application process of EDAC in the live network of vivo is mainly divided into the following steps:

(1) EDAC support in Linux system

EDAC has been supported in the kernels and system distributions above Linux 2.6.16, but there are many edac driver modules in the kernel, and the driver modules supported by different system versions are not the same. You can view the system in the following ways Which driver modules are supported.

 # ls /lib/modules/3.10.0-693.el7.x86_64/kernel/drivers/edac/
amd64_edac_mod.ko.xz  edac_core.ko.xz     i3000_edac.ko.xz  i5000_edac.ko.xz  i5400_edac.ko.xz  i7core_edac.ko.xz   ie31200_edac.ko.xz  skx_edac.ko.xz
e752x_edac.ko.xz      edac_mce_amd.ko.xz  i3200_edac.ko.xz  i5100_edac.ko.xz  i7300_edac.ko.xz  i82975x_edac.ko.xz  sb_edac.ko.xz       x38_edac.ko.xz

So what's the difference between these driver modules? How should we choose? Take sb\_edac and skx\_edac to illustrate, let's take a look at their descriptions first.

 # modinfo sb_edac
filename:       /lib/modules/3.10.0-693.el7.x86_64/kernel/drivers/edac/sb_edac.ko.xz
description:    MC Driver for Intel Sandy Bridge and Ivy Bridge memory controllers -  Ver: 1.1.1
...

# modinfo skx_edac
filename:       /lib/modules/3.10.0-693.el7.x86_64/kernel/drivers/edac/skx_edac.ko.xz
description:    MC Driver for Intel Skylake server processors
...

By looking at the description, we found that the original driver module is related to the product architecture of the CPU, and an error like edac-util: Error: No memory controller data found will appear when installing a mismatched module. After our tests, we found that, in general, if a driver module supported by the product architecture of the CPU exists, the system will install the supported driver by default.

(2) Configure the correspondence between memory slots and physical slots

Through the sysfs file system, we can see the CE count of which memory of which channel and which memory is controlled by which CPU and which memory, but which memory under the corresponding system, after all, we often see the daily operation and maintenance of our server. is the system slot name, so what is the relationship between them?

After viewing the source code structure of edac-util, it is found that it provides the configuration file labels.db to store the corresponding relationship between the system slot and the physical slot of the server memory.

 # cat /etc/edac/labels.db
# EDAC Motherboard DIMM labels Database file.
#
# $Id: labels.db 102 2008-09-25 15:52:07Z grondo $
#
#  Vendor-name and model-name are found from the program 'dmidecode'
#  labels are found from the silk screen on the motherboard.
#
#Vendor: <vendor-name>
#  Model: <model-name>
#    <label>:  <mc>.<row>.<channel>

When writing this file, we need to know how the memory is inserted on the server, and know that it corresponds to the slot name in the system. The names of the system slots of different server models are different. Generally, the insertion method that can maximize the performance of the memory can be summed up as the symmetrical insertion method, and the channel far away from the CPU is inserted first, and the slot far away from the CPU is inserted in each channel first.

After the configuration is completed, how to check whether the configuration is correct is mainly divided into two steps:

① Use edac-ctl to check whether the number of SYSFS CONTETS is correct
② Use dmidecode -t memory to check whether the memory names are consistent

Here we also encounter a problem with the rpm package: if there are multiple spaces before and after the model name of the manufacturer's motherboard, edac-ctl cannot recognize the model name of the motherboard, and labels.db cannot be successfully registered. Finally, we modified the source code of the edac-utils package and repackaged it.

(3) Testing and Verification

After the installation and configuration is completed, it is time to test and verify. How to verify the correctness of EDAC and ensure that CE errors are recorded on the corresponding memory stick? We can use APEI Error inject to do some error injection tests.

APEI Error inject Its principle is to rely on APEI (ACPI Platform Error Interface), and there are four tables in its structure:

BERT (Boot Error Record Table) : Mainly used to record errors that occur during startup
ERST (Error Record Serialization Table) : an abstract interface used to permanently store errors, storing various hardware or platform-related errors, error types include Corrected Error (CE), Uncorrected Recoverable Error (UCR), and Uncorrected Non-Recoverable Error, Or Fatal Error.
EINJ (Error Injection Table) : The main function is to inject errors and trigger errors. It is a table for testing
HEST (Hardware Error Source Table) : defines many error sources and error types. The purpose of defining these hardware error sources is to standardize the implementation of the hardware and software error interfaces.

Here, the test is performed by injecting memory errors into the EINJ table in the kernel APEI structure through debugfs. Debugfs is a virtual file system used for kernel debugging. In short, it is possible to map kernel data to user space through debugfs, so that users can modify Some data for debugging.

The method steps are as follows:

 # 查看是否存在EINJ表
# ls /sys/firmware/acpi/tables/EINJ 

# grep <以下字段> /boot/config-3.10.0-693.el7.x86_64

CONFIG_DEBUG_FS=y 
CONFIG_ACPI_APEI=y
CONFIG_ACPI_APEI_EINJ=m

# 安装einj
# modprobe einj


# 查看内存地址范围，这一步是因为/proc/iomem这个文件记录的是物理地址的分配情况，有些内存地址是系统预留存放以及其他设备所占用的，无法进行错误注入。
# cat /proc/iomem | grep "System RAM"
00001000-000997ff
00100000-69f79fff
6c867000-6c9e6fff
6f345000-6f7fffff
100000000-407fffffff

# 查看内存页大小
# getconf PAGESIZE
4096 即4KB

# 进入edac错误注入目录

# cat /proc/mounts | grep debugfs
debugfs /sys/kernel/debug debugfs rw,relatime 0 0

# cd  /sys/kernel/debug/apei/einj/

# 查看支持注入的错误类型
# cat available_error_type
0x00000008  Memory Correctable
0x00000010  Memory Uncorrectable non-fatal
0x00000020  Memory Uncorrectable fatal

# 写入要注入的错误的类型
echo 0x8 > error_type 

# 写入内存地址掩码
echo 0xfffffffffffff000 > param2

# 写入内存地址
echo 0x32dec000 > param1

# 写入0x0，若为1，则会跳过触发环节
echo 0x0 > notrigger

# 写入任何整数触发错误注入，这是错误注入的最后一步
echo 1 > error_inject

# 查看日志
# tail /var/log/message
xxxxxx xxxxxxxx kernel: [2258720.203422] EDAC MC0: 1 CE memory read error on CPU_SrcID#0_MC#0_Chan#0_DIMM#1 (channel:0 slot:1 page:0x32dec offset:0x0 grain:32 syndrome:0x0 -  err_code:0101:0090 socket:0 imc:0 rank:0 bg:0 ba:3 row:327 col:300)

# 使用edac-util -v查看，可以看到对应的内存条上新增了CE计数

4. Summary and Outlook

EDAC can clearly obtain the CE count on each memory of the server. We can set the threshold through the CE count, analyze the CE count curve, etc., and combine other MCE logs, SEL, etc. to evaluate the health of the memory and make memory predictions. Since the full launch of vivo servers, EDAC has discovered 450+ cases of memory CE problems in advance, and the number of server downtimes has been significantly reduced. Migrate the server business that meets the repair reporting standards, and replace the corresponding memory module to avoid business instability and even losses caused by sudden server downtime.
EDAC is a small part of the server RAS (Reliability, Availability and Serviceability) application in memory. RAS refers to the combination of software and hardware to ensure these three capabilities of the server through some technical means. RAS has many optimizations in memory, such as MCA (Machine Check Architecture) recovery and so on. In the future, we will also introduce RAS to mitigate the impact of hardware failures on the system.

References:

Server memory failure prediction can actually do this!

1. Background introduction

2. Introduction to the principle of EDAC

3. Application of EDAC

4. Summary and Outlook

vivo互联网技术

引用和评论

vivo Pulsar万亿级消息处理实践（1）-数据发送原理解析和性能调优

Proxmox VE 8.4 显卡直通完整指南：NVIDIA 2080 Ti 实战

k8s集群部署（一主两从）

AD系列：Windows Server 2025 搭建AD域控和初始化

黑客眼中的"肥羊"：刚开通的VPS为何最危险？

一体化运维，降本增效！秒云助力某基金打造智能运维平台

k8s实战基础