Heterogeneous memory and its application and optimization in machine learning systems

The fourth paradigm is deeply cultivated in the field of artificial intelligence, with both breadth and depth of understanding in artificial intelligence-related algorithms, applications, systems, and underlying architecture design.

With the rapid development of advanced storage technologies in recent years, disruptive storage technologies have emerged, such as non-volatile storage and SSD. The heterogeneous memory architecture based on such technologies is subverting the design and optimization model of traditional applications.

The fourth paradigm is the first to lay out on the heterogeneous memory architecture, and has carried out a number of innovative exploration, research and development and landing practices, such as parameter servers [The fourth paradigm launched the industry’s first trillion-dimensional online estimation based on persistent memory and supporting millisecond recovery System: https://www.163.com/tech/article/FGCFSO4N00099A7M.html ], memory database, etc. [The joint research results of Intel and Fourth Paradigm were selected into the International Top Conference VLDB Optane™ persistent memory blessing optimization trillion dimensions Feature online estimation system: https://newsroom.intel.cn/news-releases/the-joint-research-results-of-intel-and-4paradigm-were-selected-into-the-vldb-international-conference /].

This article will introduce the technical background of heterogeneous memory architecture and the technical practice on automatic machine learning systems.

Heterogeneous memory architecture

Traditionally, what we call memory generally refers to dynamic random storage, that is, DRAM. In addition, there will be small-capacity fast storage devices in the CPU. We generally call them the CPU cache (ie L1/L2 cache). Persistent slow storage devices constitute external storage, such as disks. Therefore, external memory, memory, and CPU cache constitute the entire storage architecture pyramid. However, with the commercialization of revolutionary non-volatile memory technology, the memory in this pyramid is no longer composed of DRAM alone, but a heterogeneous memory architecture composed of DRAM and non-volatile memory.

In addition, the emergence of non-volatile memory has also blurred the functional boundary between memory and external memory, making memory data persistence possible. Today, non-volatile memory technology is fully mature. The Intel® Optane™ persistent memory (referred to as persistent memory or PMem) released by Intel in 2019 is a representative product of this technology.
在这里插入图片描述

Figure 1. Storage architecture pyramid based on heterogeneous memory

Figure 1 shows the storage architecture pyramid containing heterogeneous memory. It can be seen that, in essence, persistent memory is between DRAM and external memory in the pyramid, and its capacity, performance, and cost are between the two. Even functionally, it is a hybrid of DRAM and external memory. It can be used directly as memory (memory mode) or as a persistent device (App Direct mode, referred to as AD mode).

In the memory model, persistent memory is transparent to the operating system, and its capacity directly reflects the overall available memory capacity; the AD model exposes the storage hierarchy and is fully controlled by the developer. Therefore, due to the special existence of persistent memory, the modern memory architecture has not only become more complicated at the level, but also has revolutionary changes in function. Developers need to think more about how to make good use of the heterogeneous memory architecture. Problems, such as:

Optimization of multi-level storage. Persistent memory provides a memory solution with performance close to that of DRAM, but at a lower cost, which is very beneficial for applications with huge memory consumption. However, the introduction of multi-level storage architecture also brings higher challenges for performance optimization. We know that high-performance caching is of great significance in performance tuning. On the one hand, there are often hot spots in real data, and caching can effectively improve the access performance of hot data; on the other hand, cache sensitive data structures (cache
conscious) In order to squeeze hardware performance, there are often sophisticated designs. Then, the emergence of persistent memory makes this storage level more complicated, and puts forward higher requirements for the design of multi-level caching mechanisms, data structures, and algorithms.
Utilization of persistence mechanism. Persistent memory makes external memory no longer the only option for storing data. Persistent memory provides far higher persistence performance than traditional external memory devices, but its capacity is relatively small. How to effectively play the characteristics of high performance and persistence in some scenarios has become a new problem that needs to be considered when the application is implemented. For example, for online service applications that need to guarantee service quality around the clock, memory data persistence can provide fast recovery capabilities after offline; in addition, the original disk IO is a performance bottleneck scenario, you can also use persistent memory as a storage medium to improve Overall system performance.

In order to give you a better understanding of how heterogeneous memory architecture can play its value in actual scenarios, we will throw some ideas and share the practical experience of the fourth paradigm on heterogeneous memory architecture.

Optimization of automatic machine learning system on heterogeneous memory

在这里插入图片描述
Figure 2 shows a typical automatic machine learning (AutoML) whole process in a fourth normal form product. Its main body includes offline exploration and online reasoning. Offline exploration Through automatic feature engineering and model training, feature engineering scripts and models that can be launched online are produced. After receiving the user's request, the online reasoning service obtains the prediction result through real-time feature extraction and model reasoning. At the same time, the message queue plays a key role in data collection and distribution in the entire system.

As can be seen from Table 1, under the heterogeneous memory architecture, persistent memory has different usage methods in different components to achieve different optimization purposes. In general, the memory mode can be used to achieve fast and low-cost memory capacity expansion, and the AD mode brings more benefits, including fast recovery capabilities and improved data storage performance.
在这里插入图片描述

The fourth paradigm has decoupled the key technology components based on heterogeneous memory optimization and contributed to the open source community. Currently, it mainly includes two projects: high-performance message queue system Pafka ( https://github.com/4paradigm/ pafka), and the high-performance KV storage engine PmemStore ( https://github.com/4paradigm/pmemstore) optimized for AI load. The following mainly introduces Pafka.

Pafka: High-performance message queuing system based on heterogeneous memory optimization

Kafka is an open source distributed event stream/message queue system for efficient and reliable processing of real-time data streams. It has a very wide range of application scenarios in the industry. However, due to its persistence logic, its performance (throughput and latency) is often restricted by external storage devices (HDD/SSD). In actual usage scenarios, in order to increase the overall throughput of the Kafka cluster, the enterprise has to expand the cluster scale, which increases the total cost of the enterprise.

Persistent memory has the characteristics of high-speed persistence, which can achieve several times or even dozens of times the persistence performance of traditional hard disks and SSDs. Therefore, Pafka, an optimized version of Kafka based on a heterogeneous memory architecture, takes advantage of the characteristics of high-speed persistence to greatly increase the throughput of a single node, thereby optimizing the total investment cost on the cluster. Generally speaking, compared to traditional Kafka solutions, Pafka brings the following advantages:

Compared with the current common SATA SSD configuration in data centers, Pafka based on heterogeneous memory improves node throughput and latency by 20 times.
Due to the significant increase in node throughput, Pafka can reduce hardware investment costs by more than 10 times compared with Kafka in terms of total investment in cluster scale.
Pafka is directly optimized based on Kafka. The user's original Kafka-based business code does not need to be modified and can be migrated to the Pafka system at zero code transformation cost.

Our optimization of Kafka focuses on the data placement part that causes performance bottlenecks. In the original architecture of Kafka, data persistence only occurs at the level of external memory (disk/SSD); the optimized version of Pafka is based on a heterogeneous memory architecture and uses persistent memory and external memory for data persistence. .

Persistent memory with high-performance persistence is the first level of the persistence level, while external memory with larger capacity but lower performance is used as the second level of persistence media, and both are managed through a certain cache mechanism.

Due to the producer/consumer usage pattern of message queues, data access in most scenarios will occur in high-performance persistent memory.

在这里插入图片描述
Figure 3. Pafka cluster architecture
As shown in Figure 3, a Kafka server cluster consists of several to hundreds of brokers. Brokers are internally divided into different partitions and further divided into segments for message storage. Our transformation of Kafka is mainly focused on the transformation of the segment storage data structure. The original segment can only be stored on external storage devices such as HDD/SSD. We use PMDK to perform persistent operations based on heterogeneous memory, and introduce the concept of MixChannel to realize that segments can be stored in HDD/SSD external storage devices. Can also be on persistent memory.

Specifically, MixChannel uniformly manages the common file interface and persistent memory interface, and its underlying storage medium is transparent to the upper-level components. In order to support storage based on persistent memory, we have introduced a data structure PMemChannel for MixChannel. Its main function is to encapsulate the MemoryBlock object of persistent memory into an interface that satisfies the FileChannel API, so that MixChannel can conveniently choose the FileChannel interface based on traditional files, or PMemChannel based on persistent memory. Here we use the PersistentMemoryBlock of pmdk llpl, which will automatically persist the data every time it is written. At the same time, in order to support zero-copy, we also implemented the ByteBuffer interface of zero-copy by directly mapping the address of persistent memory to ByteBuffer for llpl's MemoryBlock, thus avoiding multiple copies of memory and improving performance.

In order to maintain the corresponding relationship between the segment and the data in the persistent memory, we allocate a MemoryBlock of the persistent memory for each segment, and the mapping relationship is maintained through the ObjectDirectory of pmdk pcj.

In addition, in order to avoid the overhead of dynamic allocation of MemoryBlock during normal operation of Pafka, we will pre-allocate a fixed percentage of memory pool space during initialization for rapid allocation of MemoryBlock when writing data.

performance comparison
在这里插入图片描述

Figure 4 shows that compared to Kafka, which is commonly used in data centers based on SATA SSD for persistence, Pafka based on heterogeneous memory optimization can achieve a 20-fold improvement in throughput and latency performance.

cost comparison

Assuming that our goal is to provide an overall throughput rate of 20 GB/sec, we compared Pafka with heterogeneous persistent memory with Kafka based on SATA SSD. Figure 5 shows that in order to achieve a total throughput rate of 20 GB/sec, the numbers of SATA SSD-based servers and heterogeneous memory-based servers are 45 and 3, respectively. In addition, in terms of hardware cost, traditional Kafka (SATA SSD) needs to cost US$450,000, while our Pafka solution only costs US$40,500. The Pafka solution greatly reduces the hardware cost to 9% of the traditional Kafka solution.
在这里插入图片描述

Figure 5. Cost comparison between Pafka and Kafka solutions at a throughput performance of 20 GB/sec

More information

Pafka is an open source project in the fourth paradigm. For specific usage, technical support, and complete performance reports, you can learn more through the following channels:
Github repo: 160c3969f42903 https://github.com/4paradigm/pafka
-Slack channel：https://join.slack.com/t/memarkworkspace/shared_invite/zt-o1wa5wqt-euKxFgyrUUrQCqJ4rE0oPw
-MemArk Heterogeneous Storage Technology Forum: https://discuss.memark.io/

Heterogeneous memory and its application and optimization in machine learning systems

Heterogeneous memory architecture

Optimization of automatic machine learning system on heterogeneous memory

Pafka: High-performance message queuing system based on heterogeneous memory optimization

More information

白玉兰开源

引用和评论

【Data & AI Con Shanghai 2023】嘉宾专访｜西电王皓：认清边界大胆创新

一文掌握 MCP 上下文协议：从理论到实践

AI Agent爆火后，MCP协议为什么如此重要！

2025年医疗大模型各医疗场景赋能实践研究报告130+份汇总解读|附PDF下载

AdventureX 2025 正式启动：五天四夜，120小时极限创造！一起在杭州点燃青年创新之火！

MCP 协议为何不如你想象的安全？从技术专家视角解读

祛魅最热门的通用Agent赛道

Heterogeneous memory and its application and optimization in machine learning systems

Heterogeneous memory architecture

Optimization of automatic machine learning system on heterogeneous memory

Pafka: High-performance message queuing system based on heterogeneous memory optimization

More information

白玉兰开源

引用和评论

【Data & AI Con Shanghai 2023】嘉宾专访｜西电王皓：认清边界 大胆创新

一文掌握 MCP 上下文协议：从理论到实践

AI Agent爆火后，MCP协议为什么如此重要！

2025年医疗大模型各医疗场景赋能实践研究报告130+份汇总解读|附PDF下载

AdventureX 2025 正式启动：五天四夜，120小时极限创造！一起在杭州点燃青年创新之火！

MCP 协议为何不如你想象的安全？从技术专家视角解读

祛魅最热门的通用Agent赛道

【Data & AI Con Shanghai 2023】嘉宾专访｜西电王皓：认清边界大胆创新