使用 llvm-mca 进行性能调试:模拟 CPU!

Johnny's Software Lab LLC is expert in performance. Had a performance problem not explainable by code. Using llvm-mca to debug a simple convolution kernel loop.

  • Plain-old C version: for loop with inner loop for calculation.
  • Vectorized version using outer loop vectorization: runs 4 instances of inner loop in parallel.
  • After optimization: uses vmlaq_laneq_f32 and unrolls inner loop. But has 5 repeated vld1q_f32.
  • Use vextq_f32 to improve by concatenating loads. But runtime of new version is slower.
  • Investigation with llvm-mca: shows 5L version has more instructions but fewer uOperations and uses less cycles. 2L3E has one less instruction but more uOps per cycle. 5L spits out a block every 5.7 cycles and 2L3E every 6.5 cycles.
  • Instruction Info Table: 2L3E wins in latency and throughput metrics, but resource consumption and instruction dependencies are missing.
  • Resource Consumption View: 5L uses resources more balanced. 2L3E uses execution ports more, suggesting contention.
  • Timeline Graph: 5L issues one load instruction per cycle. 2L3E has delay as ext instructions wait for load.
  • Bottleneck Analysis: 5L has no resource or data dependency bottlenecks. 2L3E has 38.62% increase in pressure on backend with 36.56% execution port pressure and 37.59% data dependencies pressure.
    Conclusion: 5L is faster due to balanced use of CPU execution units and independent load instructions. llvm-mca is a useful tool but has limitations like only detecting backend problems and emulating load instructions with small latency.
阅读 12
0 条评论