Does any of current C++ compilers ever emit “rep movsb/w/d”?

Does any of current C++ compilers ever emit “rep movsb/w/d”?

This question made me wonder, if current modern compilers ever emit REP MOVSB/W/D instruction.

REP MOVSB/W/D

Based on this discussion, it seems that using REP MOVSB/W/D could be beneficial on current CPUs.

REP MOVSB/W/D

But no matter how I tried, I cannot made any of the current compilers (GCC 8, Clang 7, MSVC 2017 and ICC 18) to emit this instruction.

For this simple code, it could be reasonable to emit REP MOVSB:

REP MOVSB

void fn(char *dst, const char *src, int l) { for (int i=0; i<l; i++) { dst[i] = src[i]; } }

But compilers emit a non-optimized simple byte-copy loop, or a huge unrolled loop (basically an inlined memmove). Do any of the compilers use this instruction?

memmove

For that loop (without __restrict), or ever? gcc has -mmemcpy-strategy=rep_4byte and -minline-all-stringops to override the tuning options. gcc.gnu.org/onlinedocs/gcc/x86-Options.html.
– Peter Cordes
Jul 1 at 9:10

__restrict

-mmemcpy-strategy=rep_4byte

-minline-all-stringops

GCC will surprisingly inline rep stosq for zero-init with default tuning, though. (Or even -march=haswell where vector stores would be better: Unexpected x64 assembly for __atomic_fetch_or with gcc 7.3 / gcc.gnu.org/bugzilla/show_bug.cgi?id=86352
– Peter Cordes
Jul 1 at 9:11

rep stosq

-march=haswell

@PeterCordes: I mean ever, but not helping the compiler. So not using _builtin* things, just simple code. So the gist of my question is not: "how to make the compiler emit rep movsb", but rather "why don't compilers use this instruction in general/more often, as it seems a good alternative in a lot of cases?"
– geza
Jul 1 at 9:23

Right of course a builtin / intrinsic is uninteresting, but I meant if you were restricting it to -mtune=generic or -mtune=haswell or any other of the preset tunings. (i.e. if you don't allow setting the string-op strategy.)
– Peter Cordes
Jul 1 at 9:27

-mtune=generic

-mtune=haswell

@PeterCordes: I allow every compiler option :) I just wondered, why this instruction isn't used at all, as it seems a good alternative. Instead of a byte-copy loop, it is surely better. And instead of inlining memmove, it is better in a lot of cases too (much smaller code, and on current cpus, it is not much slower), yet it seems that this instruction is not used by default.
– geza
Jul 1 at 9:31

1 Answer
1

GCC has x86 tuning options to control string-ops strategy and when to inline vs. library call. (See https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html). -mmemcpy-strategy=strategy takes alg:max_size:dest_align triplets, but the brute-force way is -mstringop-strategy=rep_byte

-mmemcpy-strategy=strategy

alg:max_size:dest_align

-mstringop-strategy=rep_byte

I had to use __restrict to get gcc to recognize the memcpy pattern, instead of just doing normal auto-vectorization after an overlap check / fallback to a dumb byte loop. (Fun fact: gcc -O3 auto-vectorizes even with -mno-sse, using the full width of an integer register. So you only get a dumb byte loop if you compile with -Os (optimize for size) or -O2 (less than full optimization)).

__restrict

-mno-sse

-Os

-O2

Note that if src and dst overlap with dst > src, the result is not memmove. Instead, you'll get a repeating pattern with length = dst-src. rep movsb has to correctly implement the exact byte-copy semantics even in case of overlap, so it would still be valid (but slow on current CPUs: I think microcode would just fall back to a byte loop).

dst > src

memmove

dst-src

rep movsb

gcc only gets to rep movsb via recognizing a memcpy pattern and then choosing to inline memcpy as rep movsb. It doesn't go directly from byte-copy loop to rep movsb, and that's why possible aliasing defeats the optimization. (It might be interesting for -Os to consider using rep movs directly, though, when alias analysis can't prove it's a memcpy or memmove, on CPUs with fast rep movsb.)

rep movsb

memcpy

rep movsb

-Os

rep movs

rep movsb

void fn(char *__restrict dst, const char *__restrict src, int l) { for (int i=0; i<l; i++) { dst[i] = src[i]; } }

This probably shouldn't "count" because I would probably not recommend those tuning options for any use-case other than "make the compiler use rep movs", so it's not that different from an intrinsic. I didn't check all the -mtune=silvermont / -mtune=skylake / -mtune=bdver2 (Bulldozer version 2 = Piledriver) / etc. tuning options, but I doubt any of them enable that. So this is an unrealistic test because nobody using -march=native would get this code-gen.

rep movs

-mtune=silvermont

-mtune=skylake

-mtune=bdver2

-march=native

But the above C compiles with gcc8.1 -xc -O3 -Wall -mstringop-strategy=rep_byte -minline-all-stringops on the Godbolt compiler explorer to this asm for x86-64 System V:

-xc -O3 -Wall -mstringop-strategy=rep_byte -minline-all-stringops

fn: test edx, edx jle .L1 # rep movs treats the counter as unsigned, but the source uses signed sub edx, 1 # what the heck, gcc? mov ecx,edx would be too easy? lea ecx, [rdx+1] rep movsb # dst=rdi and src=rsi .L1: # matching the calling convention ret

Fun fact: the x86-64 SysV calling convention being optimized for inlining rep movs is not a coincidence (Why does Windows64 use a different calling convention from all other OSes on x86-64?). I think gcc favoured that when the calling convention was being designed, so it saved instructions.

rep movs

rep_8byte does a bunch of setup to handle counts that aren't a multiple of 8, and maybe alignment, I didn't look carefully.

rep_8byte

I also didn't check other compilers.

Inlining rep movsb would be a poor choice without an alignment guarantee, so it's good that compilers don't do it by default. (As long as they do something better.) Intel's optimization manual has a section on memcpy and memset with SIMD vectors vs. rep movs. See also http://agner.org/optimize/, and other performance links in the x86 tag wiki.

rep movsb

rep movs

(I doubt that gcc would do anything differently if you did dst=__builtin_assume_aligned(dst, 64); or any other way of communicating alignment to the compiler, though. e.g. alignas(64) on some arrays.)

dst=__builtin_assume_aligned(dst, 64);

alignas(64)

Intel's IceLake microarchitecture will have a "short rep" feature that presumably reduces startup overhead for rep movs / rep stos, making them much more useful for small counts. (Currently rep string microcode has significant startup overhead: What setup does REP do?)

rep movs

rep stos

rep

BTW, glibc's memcpy uses a pretty nice strategy for small inputs that's insensitive to overlap: Two loads -> two stores that potentially overlap, for copies up to 2 registers wide. This means any input from 4..7 bytes branches the same way, for example.

Glibc's asm source has a nice comment describing the strategy: https://code.woboq.org/userspace/glibc/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S.html#19.

For large inputs, it uses SSE XMM registers, AVX YMM registers, or rep movsb (after checking an internal config variable that's set based on CPU-detection when glibc initializes itself). I'm not sure which CPUs it will actually use rep movsb on, if any, but support is there for using it for large copies.

rep movsb

rep movsb might well be a pretty reasonable choice for small code-size and non-terrible scaling with count for a byte loop like this, with safe handling for the unlikely case of overlap.

rep movsb

Microcode startup overhead is a big problem with using it for copies that are usually small, though, on current CPUs.

It's probably better than a byte loop if the average copy size is maybe 8 to 16 bytes on current CPUs, and/or different counts cause branch mispredicts a lot. It's not good, but it's less bad.

Some kind of last-ditch peephole optimization for turning a byte-loop into a rep movsb might be a good idea, if compiling without auto-vectorization. (Or for compilers like MSVC that make a byte loop even at full optimization.)

rep movsb

It would be neat if compilers knew about it more directly, and considered using it for -Os (optimize for code-size more than speed) when tuning for CPUs with the Enhanced Rep Movs/Stos Byte (ERMSB) feature. (See also Enhanced REP MOVSB for memcpy for lots of good stuff about x86 memory bandwidth single threaded vs. all cores, NT stores that avoid RFO, and rep movs using an RFO-avoiding cache protocol...).

-Os

rep movs

On older CPUs, rep movsb wasn't as good for large copies, so the recommended strategy was rep movsd or movsq with special handling for the last few counts. (Assuming you're going to use rep movs at all, e.g. in kernel code where you can't touch SIMD vector registers.)

rep movsb

rep movsd

movsq

rep movs

The -mno-sse auto-vectorization using integer registers is much worse than rep movs for medium sized copies that are hot in L1d or L2 cache, so gcc should definitely use rep movsb or rep movsq after checking for overlap, not a qword copy loop, unless it expects small inputs (like 64 bytes) to be common.

-mno-sse

rep movs

rep movsb

rep movsq

The only advantage of a byte loop is small code size; it's pretty much the bottom of the barrel; a smart strategy like glibc's would be much better for small but unknown copy sizes. But that's too much code to inline, and a function call does have some cost (spilling call-clobbered registers and clobbering the red zone, plus the actual cost of the call / ret instructions and dynamic linking indirection).

call

ret

Especially in a "cold" function that doesn't run often (so you don't want to spend a lot of code size on it, increasing your program's I-cache footprint, TLB locality, pages to be loaded from disk, etc). If writing asm by hand, you'd usually know more about the expected size distribution and be able to inline a fast-path with a fallback to something else.

Remember that compilers will make their decisions on potentially many loops in one program, and most code in most programs is outside of hot loops. It shouldn't bloat them all. This is why gcc defaults to -fno-unroll-loops unless profile-guided optimization is enabled. (Auto-vectorization is enabled at -O3, though, and can create a huge amount of code for some small loops like this one. It's quite silly that gcc spends huge amounts of code-size on loop prologues/epilogues, but tiny amounts on the actual loop; for all it knows the loop will run millions of iterations for each one time the code outside runs.)

-fno-unroll-loops

-O3

Unfortunately it's not like gcc's auto-vectorized code is very efficient or compact. It spends a lot of code size on the loop cleanup code for the 16-byte SSE case (fully unrolling 15 byte-copies). With 32-byte AVX vectors, we get a rolled-up byte loop to handle the leftover elements. (For a 17 byte copy, this is pretty terrible vs. 1 XMM vector + 1 byte or glibc style overlapping 16-byte copies). With gcc7 and earlier, it does the same full unrolling until an alignment boundary as a loop prologue so it's twice as bloated.

IDK if profile-guided optimization would optimize gcc's strategy here, e.g. favouring smaller / simpler code when the count is small on every call, so auto-vectorized code wouldn't be reached. Or change strategy if the code is "cold" and only runs once or not at all per run of the whole program. Or if the count is usually 16 or 24 or something, then scalar for the last n % 32 bytes is terrible so ideally PGO would get it to special case smaller counts. (But I'm not too optimistic.)

n % 32

I might report a GCC missed-optimization bug for this, about detecting memcpy after an overlap check instead of leaving it purely up to the auto-vectorizer. And/or about using rep movs for -Os, maybe with -mtune=icelake if more info becomes available about that uarch.

rep movs

-Os

-mtune=icelake

A lot of software gets compiled with only -O2, so a peephole for rep movs other than the auto-vectorizer could make a difference. (But the question is whether it's a positive or negative difference)!

-O2

rep movs

Thanks for the explanation why REP is not used more frequently, now I understand the reason. But instead of the byte-copy loop, compilers should always use it, shouldn't they?
– geza
Jul 1 at 9:42

REP

@geza: yes, a byte-copy loop is worse unless the loop count is always small (like 1 to maybe 15) and doesn't branch mispredict on the loop-exit. (So the count has to be predictable if not constant.)
– Peter Cordes
Jul 1 at 9:47

@geza: BTW, glibc's memcpy uses a pretty nice strategy for small inputs that's insensitive to overlap: Two loads -> two stores that potentially overlap, for copies up to 2 registers wide. This means any input from 4..7 bytes branches the same way, for example. The asm source has a nice comment describing the strategy: code.woboq.org/userspace/glibc/sysdeps/x86_64/multiarch/…. The only advantage of a byte loop is small code size; a smart strategy like glibc's would be much better for small but unknown copy sizes.
– Peter Cordes
Jul 1 at 16:47

memcpy

That's a nice trick, thanks for sharing!
– geza
Jul 1 at 18:25

Peter, I really appreciate the work you put into this answer, and into this site in general, thanks for it! It always worth checking back even after your answer is accepted, because maybe you edited the answer to contain even more good stuff :)
– geza
Jul 2 at 4:35

By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

zfUWjn,O93EOFvbW66bPtlS 8JahlxWS2wt,hq

搜尋此網誌

Gtjkyu