在上一篇文章介绍了一种加IACA标记的方法,但使用还是很麻烦,所以我尝试修改pony编译器,直接增加了IACA支持,目前代码在iaca分支。
使用方法
因为还没发PR到上游,所以要自己克隆编译。
git clone https://github.com/oraoto/ponyc.git
cd ponyc
git checkout iaca
然后安装官方的编译步骤编译就好了,通常就是一句make
。
在需要添加IACA标记的代码加上IACA.start()
和IACA.stop()
就可以了。以pony-websocket
里的代码为例:
while (i + 4) < size do
IACA.start()
p(i)? = p(i)? xor m1
p(i + 1)? = p(i + 1)? xor m2
p(i + 2)? = p(i + 2)? xor m3
p(i + 3)? = p(i + 3)? xor m4
i = i + 4
end
IACA.stop()
编译后就可以用iaca进行分析了:
$ iaca ./echo-server.exe
F:\build > iaca .\echo-server.exe
Intel(R) Architecture Code Analyzer Version - v3.0-28-g1ba2cbb build date: 2017-10-23;17:30:24
Analyzed File - .\echo-server.exe
Binary Format - 64Bit
Architecture - SKL
Analysis Type - Throughput
Throughput Analysis Report
--------------------------
Block Throughput: 6.74 Cycles Throughput Bottleneck: Dependency chains
Loop Count: 22
Port Binding In Cycles Per Iteration:
--------------------------------------------------------------------------------------------------
| Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 |
--------------------------------------------------------------------------------------------------
| Cycles | 2.5 0.0 | 2.5 | 4.0 4.0 | 4.0 4.0 | 4.0 | 2.5 | 2.5 | 0.0 |
--------------------------------------------------------------------------------------------------
DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3)
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion occurred
# - ESP Tracking sync uop was issued
@ - SSE instruction followed an AVX256/AVX512 instruction, dozens of cycles penalty is expected
X - instruction not supported, was not accounted in Analysis
| Num Of | Ports pressure in cycles | |
| Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 |
-----------------------------------------------------------------------------------------
| 1* | | | | | | | | | cmp rax, rbx
| 0*F | | | | | | | | | jbe 0x95
| 4 | | 0.5 | 1.0 1.0 | 1.0 1.0 | 1.0 | | 0.5 | | xor byte ptr [rdi+rbx*1], r8b
| 1 | | 0.5 | | | | 0.5 | | | lea rsi, ptr [rbx+0x1]
| 1* | | | | | | | | | cmp rax, rsi
| 0*F | | | | | | | | | jbe 0x8b
| 4 | 0.5 | | 1.0 1.0 | 1.0 1.0 | 1.0 | | 0.5 | | xor byte ptr [rdi+rbx*1+0x1], r9b
| 1 | 0.5 | | | | | 0.5 | | | add rsi, 0x1
| 1* | | | | | | | | | cmp rax, rsi
| 0*F | | | | | | | | | jbe 0x80
| 4 | | 0.5 | 1.0 1.0 | 1.0 1.0 | 1.0 | | 0.5 | | xor byte ptr [rdi+rbx*1+0x2], r10b
| 1 | 0.5 | | | | | 0.5 | | | add rsi, 0x1
| 1* | | | | | | | | | cmp rax, rsi
| 0*F | | | | | | | | | jbe 0x75
| 4 | | 0.5 | 1.0 1.0 | 1.0 1.0 | 1.0 | | 0.5 | | xor byte ptr [rdi+rbx*1+0x3], r11b
| 1 | | 0.5 | | | | 0.5 | | | lea rdx, ptr [rsi+0x5]
| 1 | 0.5 | | | | | | 0.5 | | add rsi, 0x1
| 1* | | | | | | | | | cmp rdx, rax
| 0*F | | | | | | | | | jb 0xffffffffffffffab
| 1 | 0.5 | | | | | 0.5 | | | add rbx, 0x4
Total Num Of Uops: 27
实现方式
pony的builtin包里,有些代码是这样的:
fun _apply(i: USize): this->A =>
compile_intrinsic
fun ref _update(i: USize, value: A!): A^ =>
compile_intrinsic
fun _offset(n: USize): this->Pointer[A] =>
compile_intrinsic
函数体只有一句compile_intrinsic
,这些函数编译器内置的,可以直接生成代码。所以我直接在builtin包里加了
primitive IACA
fun start(): None => compile_intrinsic
fun stop(): None => compile_intrinsic
这时编译是不通过的,因为编译器还不知道怎样编译这两个函数,所以要在编译器里“注册”,这里只要参考Platform包的处理就可以了。
实际生成代码的方法:
static void iaca_start(compile_t* c, reach_type_t* t, token_id cap)
{
FIND_METHOD("start", cap);
compile_type_t* t_result = (compile_type_t*)m->result->c_type;
start_function(c, t, m, t_result->use_type, &c_t->use_type, 1);
LLVMAddFunctionAttr(c_m->func, LLVMAlwaysInlineAttribute);
LLVMTypeRef void_fn = LLVMFunctionType(c->void_type, NULL, 0, false);
LLVMValueRef asmstr = LLVMConstInlineAsm(void_fn,
".byte 0xbb, 0x6f, 0, 0, 0, 0x64, 0x67, 0x90", "", true, false);
LLVMValueRef call = LLVMBuildCall(c->builder, asmstr, NULL, 0, "");
LLVMBuildRet(c->builder, t_result->instance);
codegen_finishfun(c);
}
就是是生成一句inline asm的LLVM IR。不经优化生成的IR是这样的:
while_body: ; preds = %invoke13, %while_init
%28 = call fastcc %IACA* @IACA_val_create_o(%IACA* @IACA_Inst), !dbg !5828, !pony.newcall !3
%29 = call fastcc %None* @IACA_val_start_o(%IACA* %28), !dbg !5830
; Function Attrs: alwaysinline
define fastcc %None* @IACA_val_start_o(%IACA* noalias readonly dereferenceable(8)) unnamed_addr #7 !pony.abi !3 {
entry:
call void asm sideeffect ".byte 0xbb, 0x6f, 0, 0, 0, 0x64, 0x67, 0x90", ""()
ret %None* @None_Inst
}
没错,生成的是个函数调用,所以我们依赖于优化把这个函数内联到调用点,优化的结果是:
; <label>:38: ; preds = %35, %67
%39 = phi i64 [ %71, %67 ], [ 0, %35 ]
tail call void asm sideeffect ".byte 0xbb, 0x6f, 0, 0, 0, 0x64, 0x67, 0x90", ""() #2
%40 = icmp ugt i64 %4, %39
br i1 %40, label %43, label %41
这就是我们要的。
目前的不足是,因为还是生成了函数的代码,iaca有时会分析错位置,会出现下面的结果:
| Num Of | Ports pressure in cycles | |
| Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 |
-----------------------------------------------------------------------------------------
| 1 | | | | | | 1.0 | | | lea rax, ptr [rip+0x71a8]
| 3^# | | | 1.0 1.0 | | | | 0.1 | | ret
| 1 | 0.4 | | | | | | 0.6 | | mov ebx, 0x6f
| 1 | 0.6 | | | | | | 0.4 | | addr32 nop
| 1 | | 1.0 | | | | | | | lea rax, ptr [rip+0x71a8]
| 3^ | | | | 1.0 1.0 | | | | | ret
Total Num Of Uops: 10
编译结果居然不是确定的?遇到这种情况,现在只能再编译,直到出现正确的结果。
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。