performance - Why is XCHG reg, reg a 3 micro-op instruction on modern Intel architectures? -


i'm doing micro-optimization on performance critical part of code , came across sequence of instructions (in at&t syntax):

add %rax, %rbx mov %rdx, %rax mov %rbx, %rdx 

i thought had use case xchg allow me shave instruction , write:

add  %rbx, %rax xchg %rax, %rdx 

however, dimay found agner fog's instruction tables, xchg 3 micro-op instruction 2 cycle latency on sandy bridge, ivy bridge, broadwell, haswell , skylake. 3 whole micro-ops , 2 cycles of latency! 3 micro-ops throws off 4-1-1-1 cadence , 2 cycle latency makes worse original in best case since last 2 instructions in original might execute in parallel.

now... cpu might breaking instruction micro-ops equivalent to:

mov %rax, %tmp mov %rdx, %rax mov %tmp, %rdx  

where tmp anonymous internal register , suppose last 2 micro-ops run in parallel latency 2 cycles.

given register renaming occurs on these micro-architectures, though, doesn't make sense me done way. why wouldn't register renamer swap labels? in theory, have latency of 1 cycle (possibly 0?) , represented single micro-op cheaper.

i still need proof-read again, wanted post instead leaving sitting in browser tab more days.


supporting efficient xchg non-trivial, , presumably not worth complexity require in various parts of cpu. real cpu's microarchitecture more complicated mental model can use while optimizing software it. example, speculative execution makes more complicated, because has able roll point exception occurred.

making fxch efficient important x87 performance because stack nature of x87 makes (or alternatives fld st(2)) hard avoid. compiler-generated fp code (for targets without sse support) use fxch significant amount. seems fast fxch done because important, not because it's easy. intel haswell dropped support single-uop fxch. it's still zero-latency, decodes 2 uops on hsw , later (up 1 in p5, , ppro through ivybridge).

xchg easy avoid. in cases, can unroll loop it's ok same value in different register. e.g. fibonacci add rax, rdx / add rdx, rax instead of add rax, rdx / xchg rax, rdx. compilers don't use xchg reg,reg, , hand-written asm doesn't either. (this chicken/egg problem pretty similar loop being slow (why loop instruction slow? couldn't intel have implemented efficiently?). have been useful for adc loops on core2/nehalem adc + dec/jnz loop causes partial-flag stalls.)

since xchg still slow-ish on previous cpus, compilers wouldn't start using -mtune=generic several years. unlike fxch or mov-elimination, design-change support fast xchg wouldn't cpu run existing code faster, , enable tiny performance gains on current design code did make use of peephole optimizations this.


integer registers complicated partial-register stuff, unlike x87

there 4 operand sizes of xchg, 3 of use same opcode rex or operand-size prefixes. (xchg r8,r8 separate opcode, it's easier make decoders decode differently others). decoders have recognize xchg memory operand special, because of implicit lock prefix, it's less decoder complexity (transistor-count + power) if reg-reg forms decode same number of uops different operand sizes.

making r,r forms decode single uop more complexity, because single-uop instructions have handled "simple" decoders complex decoder. need able parse xchg , decide whether single uop or multi-uop form.


amd , intel cpus behave programmer's perspective, there many signs internal implementation vastly different. example, intel mov-elimination works of time, limited kind of microarchitectural resources, amd cpus mov-elimination 100% of time (e.g. bulldozer low lane of vector regs).

see intel's optimization manual, example 3-25. re-ordering sequence improve effectiveness of zero-latency mov instructions, discuss overwriting zero-latency-movzx result right away free internal resource sooner. (i tried examples on haswell , skylake, , found mov-elimination did in fact work more of time when doing that, slower in total cycles, instead of faster. example intended show benefit on ivybridge, bottlenecks on 3 alu ports, hsw/skl bottleneck on resource conflicts in dep chains , don't seem bothered needing alu port more of movzx instructions.)

i don't know needs tracking in limited-size table(?) mov-elimination. it's related needing free register-file entries possible when they're no longer needed, because physical register file size limits rather rob size can bottleneck out-of-order window size. swapping around indices might make harder.

xor-zeroing eliminated 100% of time on intel sandybridge-family; it's assumed works renaming physical 0 register, , register never needs freed.

if xchg used same mechanism mov-elimination does, work of time. need decode enough uops work in cases isn't handled @ rename. (or else issue/rename stage have insert uops when xchg take more 1 uop, when un-laminating micro-fused uops indexed addressing modes can't stay micro-fused in rob, or when inserting merging uops flags or high-8 partial registers. that's significant complication worth doing if xchg common , important instruction.)

note xchg r32,r32 has zero-extend both results 64 bits, can't simple swap of rat (register alias table) entries. more truncating both registers in-place. , note intel cpus never eliminate mov same,same. need support mov r32,r32 , movzx r32, r8 no execution port, presumably has bits indicate rax = al or something. (and yes, intel hsw/skl that, not ivybridge, despite agner's microarch guide says.)

we know p6 , snb had upper-zeroed bits this, because xor eax,eax before setz al avoids partial-register stall when reading eax. hsw/skl never rename al separately in first place, ah. may not coincidence partial-register renaming (other ah) seems have been dropped in same uarch introduced mov-elimination (ivybridge). still, setting bit 2 registers @ once special case required special support.

xchg r64,r64 maybe swap rat entries, decoding differently r32 case yet complication. might still need trigger partial-register merging both inputs, add r64,r64 needs that, too.

also note an intel uop (other fxch) ever produces 1 register result (plus flags). not touching flags doesn't "free up" output slot; example mulx r64,r64,r64 still takes 2 uops produce 2 integer outputs on hsw/skl, though "work" done in multiply unit on port 1, same mul r64 produce flag result.)

even if simple "swap rat entries", building rat supports writing more 1 entry per uop complication. when renaming 4 xchg uops in single issue group? seems me make logic more complicated. remember has built out of logic gates / transistors. if "handle special case trap microcode", have build whole pipeline support possibility that pipeline stage could take kind of exception.

single-uop fxch requires support swapping rat entries (or other mechanism) in fp rat (frat), it's separate block of hardware integer rat (irat). leaving out complication in irat seems reasonable if have in frat (pre-haswell).

issue/rename complexity issue power consumption, though. note skylake widened lot of front-end (legacy decode , uop cache fetch), , retirement, kept 4-wide issue/rename limit. skl added replicated execution units on more port in back-end, issue bandwidth bottleneck more of time, in code mix of loads, stores, , alu.

the rat (or integer register file, idk) may have limited read ports, since there seem front-end bottlenecks in issuing/renaming many 3-input uops add rax, [rcx+rdx]. posted microbenchmarks (this , follow-up post) showing skylake being faster haswell when reading lots of registers, e.g. micro-fusion of indexed addressing modes. or maybe bottleneck there other microarchitectural limit.


but how 1-uop fxch work? idk how it's done in sandybridge / ivybridge. in p6-family cpus, remapping table exists support fxch. might needed because p6 uses retirement register file 1 entry per "logical" register, instead of physical register file (prf). say, you'd expect simpler when "cold" register values pointer prf entry. (source: us patent 5,499,352: floating point register alias table fxch , retirement floating point register array (describes intel's p6 uarch).

one main reason rfrat array 802 included within present invention frat logic direct result of manner in present invention implements fxch instruction.

(thanks andy glew (@krazyglew), hadn't thought of looking patents find out cpu internals.) it's pretty heavy going, may provide insight bookkeeping needed speculative execution.

interesting tidbit: patent describes integer well, , mentions there "hidden" logical registers reserved use microcode. (intel's 3-uop xchg uses 1 of these temporary.)


we might able insight looking @ amd does.

interestingly, amd has 2-uop xchg r,r in k10, bulldozer-family, bobcat/jaguar, , ryzen. (but jaguar xchg r8,r8 3 uops. maybe support xchg ah,al corner case without special uop swapping low 16 of single reg).

presumably both uops read old values of input architectural registers before first 1 updates rat. idk how works, since aren't issued/renamed in same cycle (but @ least contiguous in uop flow, @ worst 2nd uop first uop in next cycle). have no idea if haswell's 2-uop fxch works similarly, or if they're doing else.

ryzen new architecture designed after mov-elimination "invented", presumably take advantage of wherever possible. (bulldozer-family renames vector moves (but low 128b lane of ymm vectors); ryzen first amd architecture gp regs too.) xchg r32,r32 , r64,r64 zero-latency (renamed), still 2 uops each. (r8 , r16 need execution unit, because merge old value instead of zero-extending or copying entire reg, still 2 uops).

ryzen's fxch 1 uop. amd (like intel) isn't spending lot of transistors on making x87 fast (e.g. fmul 1 per clock , on same port fadd), presumably able without lot of support. micro-coded x87 instructions (like fyl2x) faster on recent intel cpus, maybe intel cares less (at least microcoded x87 instruction).

maybe amd have made xchg r64,r64 single uop too, more intel. maybe xchg r32,r32 single uop, since intel needs support mov r32,r32 zero-extension no execution port, maybe set whatever "upper 32 zeroed" bit exists support that. ryzen doesn't eliminate movzx r32, r8 @ rename, presumably there's upper32-zero bit, not bits other widths.


what intel might able cheaply if wanted to:

it's possible intel support 2-uop xchg r,r way ryzen (zero latency r32,r32 , r64,r64 forms, or 1c r8,r8 , r16,r16 forms) without complexity in critical parts of core, issue/rename , retirement stages manage register alias table (rat). maybe not, if can't have 2 uops read "old" value of register when first uop writes it.

stuff xchg ah,al complication, since intel cpus don't rename partial registers separately anymore, except ah/bh/ch/dh.


other side-topics comments , question

the 3 micro-ops throws off 4-1-1-1 cadence

sandybridge-family decoders different core2/nehalem. can produce 4 uops total, not 7, patterns 1-1-1-1, 2-1-1, 3-1, or 4.

also beware if last uop 1 can macro-fuse, hang onto until next decode cycle in case first instruction in next block jcc. (this win when code runs multiple times uop cache each time it's decoded. , that's still 3 uops per clock decode throughput.)

skylake has "simple" decoder can 1-1-1-1-1 4-1 guess, > 4 uops 1 instruction still requires microcode rom. skylake beefed uop cache, too, , can bottleneck on 4 fused-domain uops per clock issue/rename throughput limit if back-end (or branch misses) aren't bottleneck first.

i'm literally searching ~1% speed bumps hand optimization has been working out on main loop code. unfortunately that's ~18kb of code i'm not trying consider uop cache anymore.

that seems kinda crazy, unless you're limiting asm-level optimization in shorter loops inside main loop. inner loops within main loop still run uop cache, , should you're spending of time optimizing. compilers good-enough job it's not practical human on large scale. try write c or c++ in such way compiler can job it, of course, looking tiny peephole optimizations on 18kb of code seems going down rabbit hole.

use perf counters idq.dsb_uops vs. uops_issued.any see how many of total uops came uop cache (dsb = decode stream buffer or something). intel's optimization manual has suggestions other perf counters @ code doesn't fit in uop cache, such dsb2mite_switches.penalty_cycles. (mite legacy-decode path). search pdf dsb find few places it's mentioned.

perf counters find spots potential problems, e.g. regions higher average uops_issued.stall_cycles benefit finding ways expose more ilp if there any, or solving front-end problem, or reducing branch-mispredicts.


as discussed in comments, single uop produces @ 1 register result

as aside, mul %rbx, %rdx , %rax @ once or rob technically have access lower part of result 1 cycle earlier higher part? or "mul" uop goes multiplication unit , multiplication unit issues 2 uops straight rob write result @ end?

terminology: multiply result doesn't go rob. goes on forwarding network whatever other uops read it, , goes prf.

the mul %rbx instruction decodes 2 uops in decoders. don't have issue in same cycle, let alone execute in same cycle.

however, agner fog's instruction tables list single latency number. turns out 3 cycles latency both inputs rax. minimum latency rdx 4c, according instlatx64 testing on both haswell , skylake-x.

from this, conclude 2nd uop dependent on first, , exists write high half of result architectural register. port1 uop produces full 128b multiply result,

interestingly, according agner fog's instruction tables, on haswell 2 uops mul r64 go ports 1 , 6. mul r32 3 uops, , runs on p1 + p0156. agner doesn't whether that's 2p1 + p0156 or p1 + 2p0156 other insns. (however, says mulx r32,r32,r32 runs on p1 + 2p056 (note p056 doesn't include p1).)

even more strangely, says skylake runs mulx r64,r64,r64 on p1 p5 mul r64 on p1 p6. if that's accurate , not typo (which possibility), pretty rules out possibility uop upper-half multiplier.


Comments

Popular posts from this blog

Is there a better way to structure post methods in Class Based Views -

jquery - Responsive Navbar with Sub Navbar -