i'm doing micro-optimization on performance critical part of code , came across sequence of instructions (in at&t syntax): add %rax, %rbx mov %rdx, %rax mov %rbx, %rdx i thought had use case xchg allow me shave instruction , write: add %rbx, %rax xchg %rax, %rdx however, dimay found agner fog's instruction tables , xchg 3 micro-op instruction 2 cycle latency on sandy bridge, ivy bridge, broadwell, haswell , skylake. 3 whole micro-ops , 2 cycles of latency! 3 micro-ops throws off 4-1-1-1 cadence , 2 cycle latency makes worse original in best case since last 2 instructions in original might execute in parallel. now... cpu might breaking instruction micro-ops equivalent to: mov %rax, %tmp mov %rdx, %rax mov %tmp, %rdx where tmp anonymous internal register , suppose last 2 micro-ops run in parallel latency 2 cycles. given register renaming occurs on these micro-architectures, though, doesn't make sense me done way. why wouldn't register renamer ...
Comments
Post a Comment