Johnny's Software Lab LLC is expert in performance. Had a performance problem not explainable by code. Using llvm-mca to debug a simple convolution kernel loop.
- Plain-old C version: for loop with inner loop for calculation.
- Vectorized version using outer loop vectorization: runs 4 instances of inner loop in parallel.
- After optimization: uses vmlaq_laneq_f32 and unrolls inner loop. But has 5 repeated vld1q_f32.
- Use vextq_f32 to improve by concatenating loads. But runtime of new version is slower.
- Investigation with llvm-mca: shows 5L version has more instructions but fewer uOperations and uses less cycles. 2L3E has one less instruction but more uOps per cycle. 5L spits out a block every 5.7 cycles and 2L3E every 6.5 cycles.
- Instruction Info Table: 2L3E wins in latency and throughput metrics, but resource consumption and instruction dependencies are missing.
- Resource Consumption View: 5L uses resources more balanced. 2L3E uses execution ports more, suggesting contention.
- Timeline Graph: 5L issues one load instruction per cycle. 2L3E has delay as ext instructions wait for load.
- Bottleneck Analysis: 5L has no resource or data dependency bottlenecks. 2L3E has 38.62% increase in pressure on backend with 36.56% execution port pressure and 37.59% data dependencies pressure.
Conclusion: 5L is faster due to balanced use of CPU execution units and independent load instructions. llvm-mca is a useful tool but has limitations like only detecting backend problems and emulating load instructions with small latency.
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。