为pony程序添加IACA标记（一）

IACA（Intel® Architecture Code Analyzer）是Intel出品的静态代码分析工具，可以用来分析代码的数据依赖、代码吞吐量、延迟，对于理解CPU执行和性能优化有很大帮助。

要分析一个程序就必须在代码中插入指定的标记（marker），iaca会找出标记的代码然后进行静态分析，通常可以使用Intel提供的iacaMarks.h里的宏来实现，使用方法：

while ( condition )
{
    IACA_START
    <loop body>
}
IACA_END

宏实际展开为内联汇编（或者intrinsic），例如IACA_START是这样：

__asm  mov ebx, 111
__asm  _emit 0x64
__asm  _emit 0x67
__asm  _emit 0x90

现在需要分析一段pony程序的代码，但是pony不支持内联汇编，通过FFI调用C库也不能内联，解决方法之一是在pony编译器里增加intrinsic，这工作量略大，所以我又另辟巧径：

先在代码里条件增加标记

while i < size do
  IACA.start()
  p(i)? = p(i)? xor mask_key(i % 4)?
  i = i + 1
end
IACA.stop()

其中的IACA定义是这样的：

primitive IACA
  fun start(): None => None

  fun stop(): None => None

编译生成LLVM IR:

ponyc . -r=ir -d

使用debug模式，方便替换，这时打开生成的LLVM IR文件，找到对IACA.start()和IACA.stop()的调用:

; <label>:20:                                     ; preds = %40, %16
  %21 = call fastcc %135* @websocket_IACA_val_create_o(%135* @134), !pony.newcall !2
  %22 = call fastcc %137* @websocket_IACA_val_start_o(%135* %21)

...

  %50 = call fastcc %135* @websocket_IACA_val_create_o(%135* @134), !pony.newcall !2
  %51 = call fastcc %137* @websocket_IACA_val_stop_o(%135* %50)

在start调用后加上内联asm：

tail call void asm sideeffect ".byte 0xbb, 0x6f, 0, 0, 0, 0x64, 0x67, 0x90",""()

在stop调用前加上内联asm：

tail call void asm sideeffect ".byte 0xbb, 0xde, 0, 0, 0, 0x64, 0x67, 0x90",""()

然后用clang -O3 -c 编译得到目标文件，就可以用iaca分析了，下面是分析上面pony代码的结果：

Throughput Analysis Report
--------------------------
Block Throughput: 2.75 Cycles       Throughput Bottleneck: Dependency chains
Loop Count:  23
Port Binding In Cycles Per Iteration:
--------------------------------------------------------------------------------------------------
|  Port  |   0   -  DV   |   1   |   2   -  D    |   3   -  D    |   4   |   5   |   6   |   7   |
--------------------------------------------------------------------------------------------------
| Cycles |  1.0     0.0  |  1.0  |  2.5     2.5  |  2.5     2.5  |  1.0  |  1.0  |  1.0  |  0.0  |
--------------------------------------------------------------------------------------------------

DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3)
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion occurred
# - ESP Tracking sync uop was issued
@ - SSE instruction followed an AVX256/AVX512 instruction, dozens of cycles penalty is expected
X - instruction not supported, was not accounted in Analysis

| Num Of   |                    Ports pressure in cycles                         |      |
|  Uops    |  0  - DV    |  1   |  2  -  D    |  3  -  D    |  4   |  5   |  6   |  7   |
-----------------------------------------------------------------------------------------
|   1*     |             |      |             |             |      |      |      |      | mov esi, edi
|   1      |             |      |             |             |      | 1.0  |      |      | and esi, 0x3
|   2^     |             |      | 0.5     0.5 | 0.5     0.5 |      |      | 1.0  |      | cmp qword ptr [rcx+0x8], rsi
|   0*F    |             |      |             |             |      |      |      |      | jbe 0x2a
|   1      |             |      | 0.5     0.5 | 0.5     0.5 |      |      |      |      | mov rbx, qword ptr [rcx+0x18]
|   1      |             |      | 0.5     0.5 | 0.5     0.5 |      |      |      |      | movzx ebx, byte ptr [rbx+rsi*1]
|   4      | 1.0         |      | 1.0     1.0 | 1.0     1.0 | 1.0  |      |      |      | xor byte ptr [rdx+rdi*1], bl
|   1      |             | 1.0  |             |             |      |      |      |      | inc rdi
|   1*     |             |      |             |             |      |      |      |      | cmp rdi, rax
|   0*F    |             |      |             |             |      |      |      |      | jb 0xffffffffffffffdc
Total Num Of Uops: 12

为pony程序添加IACA标记（一）

oraoto

引用和评论

OrzClick: 国庆写个 ClickHouse 客户端