RDNA 4 的"乱序"内存访问

AMD's RDNA 4 brings memory subsystem enhancements. It allows out-of-order memory accesses with new queues. Prior to RDNA 4, there was a false dependency case. A wave could wait for another's memory load. RDNA 3 had strict data return ordering.
Testing showed that on RDNA 3, one wave's cache misses could prevent another from consuming cache hits. In the test, "wave Y" caused cache misses and "wave X" had to wait. On RDNA 4, the two waves don't affect each other.
AMD's compiler schedules instructions differently. On RDNA 3, it sends out accesses before waiting. On RDNA 4, it gives more flexibility.
Intel's Xe-LPG doesn't have false cross-wave memory dependencies. Nvidia's Pascal has varying behavior depending on wave location. Turing doesn't have the problem.
AMD also improved memory request handling within a wave. RDNA 4 splits vmcnt into several counters and gives more flexibility.
RDNA 4's memory subsystem enhancements improve performance in raytracing workloads by breaking cross-wave memory dependencies. But the scheme isn't fundamentally new; it builds on GCN and is similar to other GPU makers' approaches. AMD's engineers deserve credit for the improvements.

阅读 6
0 条评论