Just now, the Meta (formerly Facebook) engineering team demonstrated a new Linux kernel feature called "Transparent Memory Offloading" (TMO) in a blog post, which can save 20% to 32% of memory per Linux server. . The feature is reported to be available in Facebook/Meta servers in 2021, and the team has successfully upgraded TMO's operating system components into the Linux kernel.
Transparent Memory Offload (TMO) is Meta's solution for heterogeneous data center environments. It introduces a new Linux kernel mechanism to measure in real time the resource shortages between CPU, memory, and I/O. job loss. Guided by this information, TMO automatically adjusts the amount of memory to offload to heterogeneous devices such as compressed memory or SSD without any prior knowledge of the application.
That is, TMO adjusts to the sensitivity of slower memory accesses based on the performance characteristics of the device and the application. In addition to application containers, TMO also fully recognizes uninstall timing from sidecar containers that provide infrastructure-level functionality.
Uninstall timing
In recent years, a number of cheaper non-DRAM memory technologies such as NVMe SSDs have been successfully deployed in or under development in data centers. In addition, emerging non-DDR memory bus technologies such as Compute Express Link (CXL) also provide memory-like access semantics and approach DDR performance. The in-memory storage hierarchy shown in the data graph illustrates how the various technologies stack up against each other. The combination of these trends opens up new opportunities for memory tiering that were not possible in the past.
With memory tiering, less frequently accessed data is migrated to slower memory. The application itself, user space library, kernel, or hypervisor can drive the migration process. Meta's work on TMO functionality focuses on kernel-driven migration or swap, which can be applied transparently to many applications without application modification.
Despite the simplicity of the concept, kernel-driven switching for latency-sensitive data center applications is challenging at hyperscale. Meta built TMO, a transparent memory offload solution for containerized environments.
Solution: Transparent Memory Offload
Composition of TMO :
- Stress Suspension Information (PSI), a Linux kernel component that measures in real-time job losses due to resource shortages between CPU, memory, and I/O. For the first time, Meta makes it possible to directly measure an application's susceptibility to memory access slowdowns without having to resort to fragile low-level metrics such as page promotion rates.
- Senpai is a userspace agent that applies mild aggressive memory pressure, efficiently offloading memory across disparate workloads and heterogeneous hardware with minimal impact on application performance.
- TMO performs memory offloading for swap at a sub-threshold memory pressure level, with a turnover rate proportional to the file cache. This contrasts with the historical behavior of swapping as an emergency overflow under severe memory pressure.
The cost of DRAM is a fraction of the cost of a server, which prompted Meta to work on TMO. Data graph showing the relative costs of DRAM, compressed memory, and SSD storage. Meta estimates the cost of compressing DRAM based on a 3x compression ratio that represents the average of its production workloads.
The cost of DRAM is expected to grow to 33% of Meta infrastructure spending, while DRAM power consumption follows a similar trend to 38% of server infrastructure power consumption.
On top of compressed DRAM, Meta also equips all production servers with powerful NVMe SSDs. At the system level, NVMe SSDs account for less than 3 percent of server costs (about 3 times the compressed memory of current-generation servers). The data graph shows that across generations, the cost of iso capacity to DRAM, SSD is still less than 1% of the cost of a server - about 10 times lower cost per byte than compressed memory.
Although cheaper than DRAM, compressed memory and NVMe SSDs have poorer performance characteristics. Fortunately, typical memory access patterns provide plenty of opportunities for offloading to slower media. The data graph shows "cold" application memory, the percentage of pages that have not been accessed in the past 5 minutes. This memory can be offloaded to compressed memory or SSD without affecting application performance.
Overall, cold storage averages around 35% of the Meta server's total memory. However, it varies widely across applications, ranging from 19% to 62%. This highlights the importance of uninstallation methods that are robust to various application behaviors.
In addition to access frequency, offloading solutions also need to consider what type of memory to offload. There are two main categories of memory accessed by applications: anonymous and file backups. Anonymous memory is allocated directly by the application in the form of heap or stack pages. File-backed memory is allocated by the kernel's page cache to store frequently used file system data on behalf of applications.
TMO Design Overview
TMO consists of multiple parts across user space and the kernel, "Senpai" as a user space agent at the heart of the offload operation, in the control loop around the observed memory pressure, it uses the kernel's reclamation algorithm to identify the least used memory pages , and move them out of the offload backend. During this process, the PSI (Pressure Stall Information) kernel component quantifies and reports memory pressure, and the reclamation algorithm is directed to specific applications through the kernel's cgroup2 memory controller.
Senpai
Senpai sits on top of the PSI metric and uses pressure as feedback to determine how hard it is to drive kernel memory reclamation. If the vessel measurement falls below a given pressure threshold, Senpai will increase the recovery rate; if the pressure falls below, Senpai will relieve. Stress thresholds are calibrated so that paging overhead does not functionally impact workload performance.
exchange algorithm
TMO unloads memory at low stress levels that don't affect the workload, but while Linux exits the filesystem cache under stress, it seems "unwilling" to move anonymous memory out to swap. Even when a known cold heap exists, and the speed of the file cache exceeds the TMO pressure threshold, the configured swap space can be frustratingly idle.
Therefore, TMO has introduced a new swapping algorithm that takes advantage of these drives without reverting to the traditional setup that still uses rotating storage media, by tracking the rate at which the filesystem caches in the system are being rebuilt and swapping proportionally. accomplish. That is, for each file page that needs to be read from the file system repeatedly, the kernel will try to swap out an anonymous page, thus making room for page turning. If a swap insertion occurs, Recall pushes back to the file cache again.
Currently, Meta manually selects the offload backend between compressed memory and SSD-backed swap, based on the memory compressibility of the application and its sensitivity to memory access slowdowns. While tools can be developed to automate the process, a more basic solution would require the kernel to manage the hierarchy of offloading the backend (such as automatically using zswap for hotter pages, using SSD for colder or less compressible pages, and in the future adding NVM and CXL devices are collapsed into the memory hierarchy). The kernel reclamation algorithm should be dynamically balanced across these memory pools, and Meta is actively working on this architecture.
With upcoming bus technologies such as CXL providing memory-like access semantics, memory offloading can help offload not only cold storage but also hot storage. Meta is also actively focusing on the architecture to utilize CXL devices as memory offload backends.
Reference link: https://engineering.fb.com/2022/06/20/data-infrastructure/transparent-memory-offloading-more-memory-at-a-fraction-of-the-cost-and-power/
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。