Predictive system performance optimization under NUMA architecture

Image credit: https://unsplash.com/photos/RgufvmXe4G4
Author: Yu Feng

1. Background

In the recommendation business, the classic hierarchical logic consists of three layers: recall, fine-arrangement, and rearrangement. The fine-arrangement layer is undoubtedly the most complex and challenging part. Online reasoning service.

After more than three years of continuous iteration, the current cloud music online prediction system has been connected to multiple scene models including music, search, live broadcast, and innovative business.

The main logic of the estimation system consists of three parts:

Feature query: query users, scenarios, items and other features from distributed storage according to the business, and prepare for the next stage of feature extraction after analysis;
Feature extraction: According to the input features defined by the model, calculate the features returned from the previous stage query one by one, and convert them into the data format required by the model, such as embedding, hash, etc.;
Forward reasoning: Input the data after feature extraction into the machine learning library, and after a series of operations such as matrix operations, finally output a user's score for an item;

Among the three stages, feature query is IO-intensive, and the other two stages are CPU-intensive. Especially in the forward inference stage, a large number of operations such as matrix multiplication need to be completed, and there is a large amount of memory application and read and write logic during this period.

At present, the cloud music online prediction system deploys dozens of network clusters and online hundreds of physical machines. The Xeon E5 56-core machine was mainly used at the initial stage of the system's launch. The Xeon Gold 96-core machine was purchased last year. The specific configurations of the two models are as follows:

Comparing the two model configurations, the computing power of the high-end machine is at least twice that of the low-end machine. However, the performance of the online operation and the daily stress test data show that the actual business processing capacity of the high-end machines does not increase linearly, and there is a large performance loss.

The following table compares the business processing capabilities of the two scenario models on the two models:

2. NUMA architecture

When the computing power of a single-core processor cannot meet the growing demand for computing power, processor manufacturers begin to develop in the direction of multi-core.

The stand-alone server architecture has undergone an evolution from a consistent memory access (UMA) to a non-uniform memory access (NUMA) architecture.

In the UMA architecture, multiple CPUs have no primary-secondary relationship, and share resources such as memory, bus, and operating system. At this time, each CPU takes the same amount of time to access any address in the memory. Since all CPUs share memory, the scalability of a single machine is limited. As the number of CPUs increases, memory access conflicts will also increase greatly.

In order to further increase the number of CPUs in a single machine and ensure the utilization of computing resources, the NUMA architecture appears, which improves the scalability of the machine by sacrificing the memory access delay.

In the NUMA architecture, each CPU has a local memory, and the local memory access of other CPUs can be achieved through a high-speed switching matrix. The feature of this architecture is that the CPU accesses its own local memory at a very high speed, but accessing the remote CPU local memory requires a longer bus, and the delay will also increase accordingly.

Through the lscpu command, the current online servers are all configured with NUMA architecture of 2 CPU nodes. Next, we will test whether it is the memory access feature of the NUMA architecture, which leads to the failure to achieve the expected improvement in single-machine processing capacity on high-end machines. times the problem.

3. Nucleophilicity testing

In order to verify the impact of nucleophilicity under the NUMA architecture on the estimated system performance, it was decided to test three deployment methods of a single machine: single node, dual nodes without core binding, and dual node core binding. Kernel binding is achieved by using the numactl command on the Linux operating system.

A complex scenario (Model A) and a simple scenario (Model B) were selected as the models tested, and the request processing capability of a single machine was tested when the CPU utilization reached 60% under three different deployment modes.

3.1 Low-end machine test results

Model A:

Model B:

Compared with the single-node deployment method, the double-node core binding has about 10%~20% increase in the single-machine request volume, but the average time consumption of the model A in the dual-node core binding deployment method also increases by about 10%; Compared with the single-node deployment method, the node does not bind the core, but the single-machine request volume is reduced by 10%.

In general, on low-spec machines, the nucleophilicity of the NUMA architecture has little effect on the system performance.

3.2 Test results of high-end machines

Model A:

Model B:

No matter in model A or model B, compared with single-node deployment, dual-node core binding has significantly improved single-machine request processing capability, bringing about a 75% and 49% increase respectively; meanwhile, dual-node core performance is the worst without core binding. , mainly because the number of threads of dual-node is greatly increased compared with that of single-node, resulting in more obvious memory access competition and thread switching overhead.

4. Test conclusion

Compared with the single-node deployment method on the low-profile machine, the high-profile machine adopts the dual-node core-bundled deployment method, which brings a performance improvement of 169% and 112% on model A and model B respectively, which is in line with the increase in computing resources. The expectation of a corresponding increase in the request processing capacity in the case of doubling.

The performance improvement on the high-end machine is more obvious than that of the low-end machine. The performance of the three deployment methods is ranked: dual-node core binding > single node > dual-node no core binding .

This article is published from the NetEase Cloud Music technical team, and any form of reprinting of the article is prohibited without authorization. We recruit all kinds of technical positions all year round. If you are ready to change jobs and happen to like cloud music, then join us at staff.musicrecruit@service.netease.com .

Predictive system performance optimization under NUMA architecture

1. Background

2. NUMA architecture

3. Nucleophilicity testing

3.1 Low-end machine test results

3.2 Test results of high-end machines

4. Test conclusion

云音乐技术团队

引用和评论

AI Code 在团队开发工作流的融合思考

一文掌握 MCP 上下文协议：从理论到实践

AI Agent爆火后，MCP协议为什么如此重要！

AdventureX 2025 正式启动：五天四夜，120小时极限创造！一起在杭州点燃青年创新之火！

MCP 协议为何不如你想象的安全？从技术专家视角解读

C++ 中 VS 项目引入公共配置文件

🔥吐血整理 Bolt.diy 部署与应用攻略