使用 llvm-mca 进行性能调试：模拟 CPU！

发布于 2 月 1 日

Johnny's Software Lab LLC is expert in performance. Had a performance problem not explainable by code. Using llvm-mca to debug a simple convolution kernel loop.

Plain-old C version: for loop with inner loop for calculation.
Vectorized version using outer loop vectorization: runs 4 instances of inner loop in parallel.
After optimization: uses vmlaq_laneq_f32 and unrolls inner loop. But has 5 repeated vld1q_f32.
Use vextq_f32 to improve by concatenating loads. But runtime of new version is slower.
Investigation with llvm-mca: shows 5L version has more instructions but fewer uOperations and uses less cycles. 2L3E has one less instruction but more uOps per cycle. 5L spits out a block every 5.7 cycles and 2L3E every 6.5 cycles.
Instruction Info Table: 2L3E wins in latency and throughput metrics, but resource consumption and instruction dependencies are missing.
Resource Consumption View: 5L uses resources more balanced. 2L3E uses execution ports more, suggesting contention.
Timeline Graph: 5L issues one load instruction per cycle. 2L3E has delay as ext instructions wait for load.
Bottleneck Analysis: 5L has no resource or data dependency bottlenecks. 2L3E has 38.62% increase in pressure on backend with 36.56% execution port pressure and 37.59% data dependencies pressure.
Conclusion: 5L is faster due to balanced use of CPU execution units and independent load instructions. llvm-mca is a useful tool but has limitations like only detecting backend problems and emulating load instructions with small latency.

阅读 12