What do you do without fast gather and scatter in AVX2 instructions?

Multi tool use
Multi tool use


What do you do without fast gather and scatter in AVX2 instructions?



I'm writing a program to detect primes numbers. One part is bit sieving possible candidates out. I've written a fairly fast program but I thought I'd see if anyone has some better ideas. My program could use some fast gather and scatter instructions but I'm limited to AVX2 hardware for a x86 architecture (I know AVX-512 has these though I'd not sure how fast they are).


#include <stdint.h>
#include <immintrin.h>

#define USE_AVX2

// Sieve the bits in array sieveX for later use
void sieveFactors(uint64_t *sieveX)
{
const uint64_t totalX = 5000000;
#ifdef USE_AVX2
uint64_t indx[4], bits[4];

const __m256i sieveX2 = _mm256_set1_epi64x((uint64_t)(sieveX));
const __m256i total = _mm256_set1_epi64x(totalX - 1);
const __m256i mask = _mm256_set1_epi64x(0x3f);

// Just filling with some typical values (not really constant)
__m256i ans = _mm256_set_epi64x(58, 52, 154, 1);
__m256i ans2 = _mm256_set_epi64x(142, 70, 136, 100);

__m256i sum = _mm256_set_epi64x(201, 213, 219, 237); // 3x primes
__m256i sum2 = _mm256_set_epi64x(201, 213, 219, 237); // This aren't always the same

// Actually algorithm can changes these
__m256i mod1 = _mm256_set1_epi64x(1);
__m256i mod3 = _mm256_set1_epi64x(1);

__m256i mod2, mod4, sum3;

// Sieve until all factors (start under 32-bit threshold) exceed the limit
do {
// Sieve until one of the factors exceeds the limit
do {
// Compiler does a nice job converting these into extracts
*(__m256i *)(&indx[0]) = _mm256_add_epi64(_mm256_srli_epi64(_mm256_andnot_si256(mask, ans), 3), sieveX2);
*(__m256i *)(&bits[0]) = _mm256_sllv_epi64(mod1, _mm256_and_si256(mask, ans));

ans = _mm256_add_epi64(ans, sum);

// Early on these locations can overlap
*(uint64_t *)(indx[0]) |= bits[0];
*(uint64_t *)(indx[1]) |= bits[1];
*(uint64_t *)(indx[2]) |= bits[2];
*(uint64_t *)(indx[3]) |= bits[3];

mod2 = _mm256_sub_epi64(total, ans);

*(__m256i *)(&indx[0]) = _mm256_add_epi64(_mm256_srli_epi64(_mm256_andnot_si256(mask, ans2), 3), sieveX2);
*(__m256i *)(&bits[0]) = _mm256_sllv_epi64(mod3, _mm256_and_si256(mask, ans2));

ans2 = _mm256_add_epi64(ans2, sum2);

// Two types of candidates are being performed at once
*(uint64_t *)(indx[0]) |= bits[0];
*(uint64_t *)(indx[1]) |= bits[1];
*(uint64_t *)(indx[2]) |= bits[2];
*(uint64_t *)(indx[3]) |= bits[3];

mod4 = _mm256_sub_epi64(total, ans2);
} while (!_mm256_movemask_pd(_mm256_castsi256_pd(_mm256_or_si256(mod2, mod4))));

// Remove one factor
mod2 = _mm256_castpd_si256(_mm256_blendv_pd(_mm256_setzero_pd(), _mm256_castsi256_pd(sum), _mm256_castsi256_pd(mod2)));
mod4 = _mm256_castpd_si256(_mm256_blendv_pd(_mm256_setzero_pd(), _mm256_castsi256_pd(sum2), _mm256_castsi256_pd(mod4)));
ans = _mm256_sub_epi64(ans, mod2);
ans2 = _mm256_sub_epi64(ans2, mod4);
sum = _mm256_sub_epi64(sum, mod2);
sum2 = _mm256_sub_epi64(sum2, mod4);
sum3 = _mm256_or_si256(sum, sum2);
} while (!_mm256_testz_si256(sum3, sum3));
#else
// Just some example values (not really constant - compiler will optimize away code incorrectly)
uint64_t cur = 58;
uint64_t cur2 = 142;
uint64_t factor = 67;

if (cur < cur2) {
std::swap(cur, cur2);
}
while (cur < totalX) {
sieveX[cur >> 6] |= (1ULL << (cur & 0x3f));
sieveX[cur2 >> 6] |= (1ULL << (cur2 & 0x3f));
cur += factor;
cur2 += factor;
}
while (cur2 < totalX) {
sieveX[cur2 >> 6] |= (1ULL << (cur2 & 0x3f));
cur2 += factor;
}
#endif
}



Be warned that the locations can overlap at first. After a short while in the loop, this is not the case. I'd be happy to using a different approach if this is possible. Around 82% of the time within this part of the algorithm is in this loop. Hopefully this isn't too close to other posted questions.





Also, even with AVX512 scatter, possible overlap (like cur[0] == cur[1]) means you have to check for conflicts so that gather / SIMD OR / scatter has the same semantics as doing them one at a time. If you can rule that out after a certain point, you might go scalar or with vpconflictq / retry until then, and then use a loop that doesn't check for conflicts.
– Peter Cordes
Jul 2 at 1:59


cur[0] == cur[1]


vpconflictq





Updated the code for gcc (options "-mavx2 -O3") even though I'm using MSVC 17. Remember I "DO NOT" have access to an AVX-512 compiler and machine. So using AVX-512 won't help me. I would love to be able to use _mm256_conflict_epi64 and other advanced gather/scatter commands.
– ChipK
Jul 2 at 4:38





Code review: Use alignas(32) uint64_t *indx[4] so you don't have to cast. (And don't declare it with array size 0, unless that's some kind of hack that gets more efficient compiler output from skipping stack alignment because the compiler isn't actually storing/reloading. Was uint64_t indx[0] a typo?) I had to add a declaration for __m256i sum3; to get it to compile (godbolt.org/g/zrSCDa), but yeah the SIMD address calc idea from my answer looks good for Haswell/Skylake. Handy (or not a coincidence :P) that you already had the right mask in a vector.
– Peter Cordes
Jul 2 at 22:57



alignas(32) uint64_t *indx[4]


uint64_t indx[0]


__m256i sum3;





Yup, typos and I removed a needed variable declaration. I only compiled it - not run it.
– ChipK
Jul 3 at 1:40





It's too bad scalar memory-destination bts is not fast. bts [rdi], rax would set that bit in the bit-string, even if that's outside the dword selected by [rdi]. (That kind of crazy-CISC behaviour is why it's not fast, though! like 10 uops on Skylake.)
– Peter Cordes
Jul 3 at 2:51


bts


bts [rdi], rax


[rdi]




2 Answers
2



IDK why you use different parts of the same cur[8] array for indices and values; it made the source harder to understand to figure out that there was only one real array. The other was just to bounce vectors to scalars.


cur[8]



It looks like you're only ever going vector -> scalar, not inserting scalars back into a vector. And also that nothing inside the loop depends on any data in sieveX; I'm not familiar with your sieving algorithm but I guess the point of this is to create data in memory for later use.


sieveX



AVX2 has gathers (not scatters), but they're only fast on Skylake and newer. They're ok on Broadwell, slowish on Haswell, and slow on AMD. (Like one per 12 clocks for Ryzen's vpgatherqq). See http://agner.org/optimize/ and other performance links in the x86 tag wiki.


vpgatherqq



Intel's optimization manual has a small section on manual gather / scatter (using insert/extract or movhps) vs. hardware instructions, possibly worth reading. In this case where the indices are runtime variables (not a constant stride or something), I think Skylake can benefit from AVX2 gather instructions here.


movhps



See Intel's intrinsics guide to look up the intrinsic for asm instructions like movhps. I'm just talking about what you want to get your compiler to emit, because that's what's important and the asm mnemonics are shorter to type and don't need casting. You have to know the asm mnemonic to look them up in Agner Fog's instruction tables, or to read compiler output from auto-vectorization, so I usually think in asm and then translate that to intrinsics.


movhps



With AVX, you have 3 main options:



do everything scalar. Register pressure may be a problem, but generating indices as needed (instead of doing all 4 adds or subs to generate curr[4..7] at once) might help. Unless those mask vectors have different values in different elements.


curr[4..7]


mask



(Using memory sources for scalar constants might not be bad, though, if they don't fit in 32-bit immediates and if you don't bottleneck on 2 memory ops per clock. The memory-destination or instructions would use indexed addressing modes, so the dedicated store-AGU on port 7 on Haswell and later couldn't be used. Thus AGU throughput could be a bottleneck.)


or



Extracting all 4 elements of a vector as scalar is more expensive than 4x scalar add or shift instructions, but you're doing more work than that. Still, with BMI2 for 1 uops variable-count shifts (instead of 3 on Intel), it might not be terrible. I think we can do better with SIMD, though, especially with careful tuning.


add



extract indices and values to scalar like you're doing now, so the OR into sieveX is pure scalar. Works even when two or more indices are the same.


sieveX



This costs you about 7 uops per ymm vector -> 4x scalar registers using extract ALU instructions, or 5 uops using store/reload (worth considering for the compiler, maybe for one or two of the 4 vector extracts, because this code probably doesn't manage to bottleneck on load / store port throughput.) If the compiler turns store/reload in the C source into shuffle/extract instructions, though, you can't easily override its strategy except maybe by using volatile. And BTW, you'd want to use alignas(32) cur[8] to make sure actual vector stores don't cross a cache-line boundary.


volatile


alignas(32) cur[8]



or [rdi + rax*8], rdx (with an indexed addressing mode preventing full micro-fusion) is 3 uops on modern Intel CPUs (Haswell and later). We could avoid an indexed addressing mode (making it 2 uops for the front-end) by scaling + adding to the array base address using SIMD: e.g. srli by 3 instead of 6, mask off the low 3 bits (vpand), and vpaddq with set1_epi64(sieveX). So this costs 2 extra SIMD instructions to save 4 uops on SnB-family, per vector of indices. (You'd extracting uint64_t* pointer elements instead of uint64_t indices. Or if sieveX can be a 32-bit absolute address1, you could skip the vpaddq and extract already-scaled indices for the same gain.)


or [rdi + rax*8], rdx


srli


vpand


vpaddq


set1_epi64(sieveX)


uint64_t*


uint64_t


sieveX


vpaddq



It would also enable the store-address uops to run on port 7 (Haswell and later); the simple AGU on port7 can only handle non-indexed addressing modes. (This makes extracting values to scalar with store+reload more attractive. You want lower latency for extracting indices, because the values aren't needed until after the load part of a memory-dst or completes.) It does mean more unfused-domain uops for the scheduler / execution units, but could well be worth the tradeoff.


or



This isn't a win on other AVX2 CPUs (Excavator / Ryzen or Xeon Phi); only SnB-family has a front-end cost and execution-port restrictions for indexed addressing modes.



extract indices, manually gather into a vector with vmovq / vmovhps for a SIMD vpor, then scatter back with vmovq / vmovhps.


vmovq


vmovhps


vpor


vmovq


vmovhps



Just like a HW gather/scatter, correctness requires that all indices are unique, so you'll want to use one of the above options until you get to that point in your algo. (vector conflict detection + fallback would not be worth the cost vs. just always extracting to scalar: Fallback implementation for conflict detection in AVX2).



See selectively xor-ing elements of a list with AVX2 instructions for an intrinsics version. (I knew I'd recently written an answer with a manual gather / scatter, but took me a while to find it!) In that case I only used 128-bit vectors because there wasn't any extra SIMD work to justify the extra vinserti128 / vextracti128.


vinserti128


vextracti128



Actually I think here you'd want to extract the high half of the _mm256_sllv_epi64 result so you have (the data that would be) cur[4..5] and cur[6..7] in two separate __m128i variables. You'd have vextracti128 / 2x vpor xmm instead of vinserti128 / vpor ymm / vextracti128.


_mm256_sllv_epi64


cur[4..5]


cur[6..7]


__m128i


vextracti128


vpor xmm


vinserti128


vpor ymm


vextracti128



The former has less port5 pressure, and has better instruction-level parallelism: The two 128-bit halves are separate dependency chains that don't get coupled to each other, so store/reload bottlenecks (and cache misses) impact fewer dependent uops, allowing out-of-order execution to keep working on more stuff while waiting.



Doing address calculation in a 256b vector and extracting pointers instead of indices would make vmovhps loads cheaper on Intel (indexed loads can't stay micro-fused to vmovhps2). See the previous bullet point. But vmovq loads/stores are always a single uop, and vmovhps indexed stores can stay micro-fused on Haswell and later, so it's break-even for front-end throughput and worse on AMD or KNL. It also means more unfused-domain uops for the scheduler / execution units, which looks like more of a potential bottleneck than port2/3 AGU pressure. The only advantage is that the store-address uops can run on port 7, relieving some pressure.


vmovhps


vmovhps


vmovq


vmovhps



AVX2 vpgatherqq for the gather (_mm256_i64gather_epi64(sieveX, srli_result, 8)), then extract indices and manually scatter. So it's exactly like the manual gather / manual scatter, except you replace the manual gather with an AVX2 hardware gather. (Two 128-bit gathers cost more than one 256-bit gather, so you would want to take the instruction-level parallelism hit and gather into a single 256-bit register).


vpgatherqq


_mm256_i64gather_epi64(sieveX, srli_result, 8)



Possibly a win on Skylake (where vpgatherqq ymm is 4 uops / 4c throughput, plus 1 uop of setup), but not even Broadwell (9 uops, one per 6c throughput) and definitely not Haswell (22 uops / 9c throughput). You do need the indices in scalar registers anyway, so you're only saving the manual-gather part of the work. That's pretty cheap.


vpgatherqq ymm



It looks like this won't bottleneck badly on any one port. GP reg->xmm needs port 5, but xmm->int needs port 0 on SnB-family CPUs, so it's less likely to bottleneck on port 5 when mixed with the shuffles needed for extracting. (e.g. vpextrq rax, xmm0, 1 is a 2 uop instruction, one port 5 shuffle uop to grab the high qword, and a port 0 uop to send that data from SIMD to the integer domain.)


vpextrq rax, xmm0, 1



So your SIMD calculation where you need to frequently extract a vector to scalar
is less bad than if you needed to frequently insert scalar calculation results into vectors. See also Loading an xmm from GP regs, but that's talking about data that starts in GP regs, not memory.



extract both / scalar OR: Total = 24 uops = 6 cycles of front-end throughput.


or [r], r



Total = 6 uops for p5, just barely fits. Store/reload for a data extract looks sensible, if you could get your compiler to do that. (But compilers don't typically model the pipeline in enough detail to use a mix of strategies in the same loop to balance port pressure.)



manual gather/scatter: 20 uops, 5 cycles of front-end throughput (Haswell / BDW / Skylake). Also good on Ryzen.


movhps



2x vmovhps xmm, xmm, [r] non-indexed load (2 front-end uops micro-fused: 2p23 + 2p5)


vmovhps xmm, xmm, [r]



vextracti128 split the data (p5)


vpor xmm



Port bottlenecks: 4 p0 and 4 p5 fits comfortably in 5 cycles, especially when you mix this with your loop which can run several of its uops on port 1. On Haswell paddq is only p15 (not p015), and shifts are only p0 (not p01). AVX2 _mm256_sllv_epi64 is 1 uop (p01) on Skylake, but on Haswell it's 3 uops = 2p0 + p5. So Haswell might be closer to a p0 or p5 bottleneck for this loop, in which case you might want to look at a store/reload extract strategy for one vector of indices.


paddq


_mm256_sllv_epi64



Skipping the SIMD address calc is probably good, because AGU pressure doesn't look like a problem unless you use a store/reload extract. And it means fewer instruction / smaller code-size and fewer uops in the uop cache. (un-lamination doesn't happen until after the decoders / uop cache, so you still benefit from micro-fusion in the early parts of the front-end, just not at the issue bottleneck.)



Skylake AVX2 gather / manual scatter: Total = 18 uops, 4.5 cycles of front-end throughput. (Worse on any earlier uarch or on AMD).



2x vpextrq (4 = 2p0 2p5)



vpcmpeqd ymm0,ymm0,ymm0 create an all-ones mask for vpgatherqq (p015)


vpcmpeqd ymm0,ymm0,ymm0


vpgatherqq



vpgatherqq ymm1, [rdi + ymm2*8], ymm0 4 uops for some ports.


vpgatherqq ymm1, [rdi + ymm2*8], ymm0



vpor ymm (p015)


vpor ymm



So even with the best throughput choice, we're still only managing 4 loads / 4 stores per 4.5 cycles, and that's without considering the SIMD work in the loop which costs some front-end throughput. So we're not close to bottlenecking on AGU throughput and having to worry about using port 7.



We could maybe think about store/reload for one of the extracts (if we were the compiler), replacing the 7 uop 5 instruction vextracti128 / 2x vmovq / 2x vpextrq sequence with a 5 uops store / 4x load.



You say that after a certain point, you don't have conflicts (overlap) between the indices like cur[0] == cur[2].


cur[0] == cur[2]



You definitely want a separate loop that doesn't check for conflicts at all to take advantage of this. Even if you had AVX512, Skylake's vpconflictq is micro-code and not fast. (KNL has single-uop vpconflictq but it's still faster to avoid it entirely).


vpconflictq


vpconflictq



I'll leave it up to you (or a separate question) how to figure out for sure when you're done with conflicts and can leave the loop that accounts for that possibility.



You probably want the extract indices + data strategy while there can be conflicts. SIMD conflict checking is possible, but it's not cheap, 11 uops for 32-bit elements: Fallback implementation for conflict detection in AVX2. A qword version is obviously much cheaper than dword (fewer shuffles and compares to get all against all), but you probably still only want to do it every 10 iterations or so of your extract loop.



There's not a huge speedup from the best scalar-or version to the best gather version (6 cycles vs. 4.5 isn't accounting for the other work in the loop, so the ratio is even smaller than that). Leaving the slightly slower version ASAP is not worth making it a lot slower.



So if you can reliably detect when you're done with conflicts, use something like


int conflictcheck = 10;

do {

if (--conflictcheck == 0) {
vector stuff to check for conflicts
if (no conflicts now or in the future)
break;

conflictcheck = 10; // reset the down-counter
}

main loop body, extract -> scalar OR strategy

} while(blah);


// then fall into the gather/scatter loop.
do {
main loop body, gather + manual scatter strategy
} while();



That should compile to a dec / je which only costs 1 uop in the not-taken case.


dec / je



Doing 9 extra iterations total of the slightly-slower loop is much better than doing thousands of extra expensive conflict checks.



Footnote 1:



If sieveX is static and you're building non-PIC code on Linux (not MacOS) then its address will fit in a disp32 as part of a [reg+disp32] addressing mode. In that case you can leave out the vpaddq. But getting a compiler to treat a uint64_t as an already-scaled array index (with its low bits cleared) would be ugly. Probably have to cast sieveX to uintptr_t and add, then cast back.


sieveX


disp32


[reg+disp32]


vpaddq


uint64_t


sieveX


uintptr_t



This isn't possible in a PIE executable or shared library (where 32-bit absolute addresses aren't allowed), or on OS X at all (where static addresses are always above 2^32). I'm not sure what Windows allows. Note that [disp32 + reg*8] only has 1 register, but is still an indexed addressing mode so all the SnB-family penalties apply. But if you don't need scaling, reg + disp32 is just base + disp32.


[disp32 + reg*8]


reg + disp32



Footnote 2: Fun fact: non-VEX movhps loads can stay micro-fused on Haswell. It won't cause an SSE/AVX stall on Skylake, but you won't get a compiler to emit the non-VEX version in the middle of an AVX2 function.


movhps



IACA (Intel's static analysis tool) gets this wrong, though. :( What is IACA and how do I use it?.



This is basically a missed-optimization for -mtune=skylake, but it would stall on Haswell: Why is this SSE code 6 times slower without VZEROUPPER on Skylake?.


-mtune=skylake



The "penalty A" (execute SSE with dirty upper) on Skylake is merely a false dependency on that one register. (And a merging uop for instructions that would otherwise be write-only, but movhps is already a read-modify-write of its destination.) I tested this on Skylake with Linux perf to count uops, with this loop:


movhps


perf


mov r15d, 100000000

.loop:
vpaddq ymm0, ymm1, ymm2 ; dirty the upper part
vpaddq ymm3, ymm1, ymm2 ; dirty another register for good measure

vmovq xmm0, [rdi+rbx*8] ; zero the full register, breaking dependencies
movhps xmm0, [rdi+rbx*8+8] ; RMW the low 128 bits
; fast on Skylake, will stall on Haswell

dec r15d
jnz .loop



The loop runs at ~1.25 cycles per iteration on Skylake (i7-6700k), maxing out the front-end throughput of 4 uops per clock. 5 total fused-domain uops (uops_issued.any), 6 unfused-domain uops (uops_executed.thread). So micro-fusion was definitely happening for movhps without any SSE/AVX problems.


uops_issued.any


uops_executed.thread


movhps



Changing it to vmovhps xmm0, xmm0, [rdi+rbx*8+8] slowed it down to 1.50 cycles per iteration, now 6 fused-domain, but still the same 6 unfused-domain uops.


vmovhps xmm0, xmm0, [rdi+rbx*8+8]



There's no extra uop if the upper half of ymm0 is dirty when movhps xmm0, [mem] runs. I tested by commenting out the vmovq. But changing vmovq to movq does result in an extra uop: movq becomes a micro-fused load+merge that replaces the low 64 bits (and still zeros the upper 64 bits of xmm0 so it's not quite movlps).


ymm0


movhps xmm0, [mem]


vmovq


vmovq


movq


movq


movlps



Also note that pinsrq xmm0, [mem], 1 can't micro fuse even without VEX. But with VEX, you should prefer vmovhps for code-size reasons.


pinsrq xmm0, [mem], 1


vmovhps



Your compiler may want to "optimize" the intrinsic for movhps on integer data into vpinsrq, though, I didn't check.


movhps


vpinsrq





You're right. The code needed some cleaning up (which I did to some degree). I added the C code as well as reference. This function is mainly code pulled out of function of a much larger size. Thanks for spending so much time. Once I have a chance to go thru it all I'll probably have more comments.
– ChipK
Jul 2 at 14:35





Updated the code to use fused uop approach. Note: sieveX is a 64-bit address.
– ChipK
Jul 2 at 17:32





@ChipK: yeah, I didn't think it was likely you could take advantage of it being a 32-bit absolute address (i.e. static storage in non-PIC code). That's why I put that in a footnote and wrote the rest of the answer without assuming that. :P
– Peter Cordes
Jul 2 at 18:48





Wow! I guess my next task to separate out the the conflict portion of the loop from the non-conflict and then try some of the methods you described. It gives me hope that there is some performance to be gained (with "hard"-ish work). Thanks again!
– ChipK
Jul 2 at 22:37





@ChipK: note that with two separate gather/scatters into two 128-bit vectors, you can arrange it so you can use the manual-gather/scatter loop even if idx[0] and idx[1] conflict. i.e. gather idx[0] and idx[2] together, and 1,3 together. Scatter two results before gathering the next two. (Arrange your 256-bit vectors to interleave elements in that order.) If that stride helps, this could let you get into the faster loop sooner. (And reduce the cost of conflict detection to maybe one shuffle / compare / movmsk / test.) ...
– Peter Cordes
Jul 2 at 23:02



idx[0]


idx[1]


idx[0]


idx[2]


1,3



I just looked at exactly what you're doing here: For the mod1 = mod3 = _mm256_set1_epi64x(1); case, you're just setting single bits in a bitmap with elements of ans as the index.


mod1 = mod3 = _mm256_set1_epi64x(1);


ans



And it's unrolled by two, with ans and ans2 running in parallel, using mod1 << ans and mod3 << ans2. Comment your code and explain what's going on in the big picture using English text! This is just a very complicated implementation of the bit-setting loop of a normal Sieve of Eratosthenes. (So it would have been nice if the question had said so in the first place.)


mod1 << ans


mod3 << ans2



Unrolling with multiple start/strides in parallel is a very good optimization, so you normally set multiple bits in a cache line while it's still hot in L1d. Cache-blocking for fewer different factors at once has similar advantages. Iterate over the same 8kiB or 16kiB chunk of memory repeatedly for multiple factors (strides) before moving on to the next. Unrolling with 4 offsets for each of 2 different strides could be a good way to create more ILP.



The more strides you run in parallel, the slower you go through new cache lines the first time you touch them, though. (Giving cache / TLB prefetch room to avoid even an initial stall). So cache blocking doesn't remove all the benefit of multiple strides.



A single 256-bit vector load/VPOR/store can set multiple bits. The trick is creating a vector constant, or set of vector constants, with bits in the right position. The repeating pattern is something like LCM(256, bit_stride) bits long, though. For example stride=3 would repeat in a pattern that's 3 vectors long. This very quickly gets unusable for odd / prime strides unless there's something more clever :(


LCM(256, bit_stride)



64-bit scalar is interesting because bitwise rotate is available to create a sequence of patterns, but variable-count rotate on SnB-family CPUs costs 2 uops.



There might be more you can do with this; maybe unaligned loads could help somehow.



A repeating pattern of bitmasks could be useful even for the large-stride case, e.g. rotating by stride % 8 every time. But that would be more useful if you were JITing a loop that hard-coded the pattern into or byte [mem], imm8, with an unroll factor chosen to be congruent with repeat length.


stride % 8


or byte [mem], imm8



You don't have to load/modify/store 64-bit chunks when you're only setting a single bit. The narrower your RMW operations, the closer your bit-indices can be without conflicting.



(But you don't have a long loop-carried dep chain on the same location; you will move on before OoO exec stalls waiting for a reloads at the end of a long chain. So if conflicts aren't a correctness problem, it's unlikely to make a big perf difference here. Unlike a bitmap histogram or something where a long chain of repeated hits on nearby bits could easily happen.)



32-bit elements would be an obvious choice. x86 can efficiently load/store dwords to/from SIMD registers as well as scalar. (scalar byte ops are efficient, too, but byte stores from SIMD regs always require multiple uops with pextrb.)


pextrb



If you're not gathering into SIMD registers, the SIMD element width for ans / ans2 doesn't have to match the RMW width. 32-bit RMW has advantages over 8-bit if you want to split a bit-index into address / bit-offset in scalar, using shifts or bts that implicitly mask the shift count to 32 bits (or 64 bits for 64-bit shifts). But 8-bit shlx or bts doesn't exist.


ans


ans2


bts


shlx


bts



The main advantage of using 64-bit SIMD elements is if you're calculating a pointer instead of just an index. If you could restrict your sieveX to 32 bits you'd still be able to do this. e.g. allocate with mmap(..., MAP_32BIT|MAP_ANONYMOUS, ...) on Linux. That's assuming you don't need more than 2^32 bits (512MiB) sieve space, so your bit indices always fit in 32-bit elements. If that's not the case, you could still use 32-bit element vectors up to that point, then use your current loop for the high part.


sieveX


mmap(..., MAP_32BIT|MAP_ANONYMOUS, ...)



If you use 32-bit SIMD elements without restricting sieveX to be a 32-bit point pointer, you'd have to give up on using SIMD pointer calculations and just extract a bit-index, or still split in SIMD into idx/bit and extract both.


idx


bit



(With 32-bit elements, a SIMD -> scalar strategy based on store/reload looks even more attractive, but in C that's mostly up to your compiler.)



If you were manually gathering into 32-bit elements, you couldn't use movhps anymore. You'd have to use pinsrd / pextrd for the high 3 elements, and those never micro-fuse / always need a port5 uop on SnB-family. (Unlike movhps which is a pure store). But that means vpinsrd is still 2 uops with an indexed addressing mode. You could still use vmovhps for element 2 (then overwrite the top dword of the vector with vpinsrd); unaligned loads are cheap and it's ok to overlap the next element. But you can't do movhps stores, and that's where it was really good.


movhps


pinsrd


pextrd


movhps


vpinsrd


vmovhps


vpinsrd


movhps



There are two big performance problems with your current strategy:



Apparently you're sometimes using this with some elements of mod1 or mod3 being 0, resulting in completely useless wasted work, doing [mem] |= 0 for those strides.


mod1


mod3


0


[mem] |= 0



I think once an element in ans or ans2 reaches total, you're going to fall out of the inner loop and do ans -= sum 1 every time through the inner loop. You don't necessarily want to reset it back ans = sum (for that element) to redo the ORing (setting bits that were already set), because that memory will be cold in cache. What we really want is to pack the remaining still-in-use elements into known locations and enter other versions of the loop that only do 7, then 6, then 5 total elements. Then we're down to only 1 vector.


ans


ans2


total


ans -= sum


ans = sum



That seems really clunky. A better strategy for one element hitting the end might be to finish the remaining three in that vector with scalar, one at a time, then run the remaining single __m256i vector. If the strides are all nearby, you probably get good cache locality.


__m256i



Splitting the bit-index into a qword index and a bitmask with SIMD and then extracting both separately costs a lot of uops for the scalar-OR case: so many that you're not bottlenecking on 1-per-clock store throughput, even with all the optimizations in my scatter/gather answer. (Cache misses may slow this down sometimes, but fewer front-end uops means a larger out-of-order window to find parallelism and keep more memory ops in flight.)



If we can get the compiler to make good scalar code to split a bit-index, we could consider pure scalar. Or at least extracting only bit-indices and skipping the SIMD shift/mask stuff.



It's too bad scalar memory-destination bts is not fast. bts [rdi], rax would set that bit in the bit-string, even if that's outside the dword selected by [rdi]. (That kind of crazy-CISC behaviour is why it's not fast, though! like 10 uops on Skylake.)


bts


bts [rdi], rax


[rdi]



Pure scalar may not be ideal, though. I was playing around with this on Godbolt:


#include <immintrin.h>
#include <stdint.h>
#include <algorithm>

// Sieve the bits in array sieveX for later use
void sieveFactors(uint64_t *sieveX64, unsigned cur1, unsigned cur2, unsigned factor1, unsigned factor2)
{
const uint64_t totalX = 5000000;
#ifdef USE_AVX2
//...
#else
//uint64_t cur = 58;
//uint64_t cur2 = 142;
//uint64_t factor = 67;
uint32_t *sieveX = (uint32_t*)sieveX64;

if (cur1 > cur2) {
// TODO: if factors can be different, properly check which will end first
std::swap(cur1, cur2);
std::swap(factor1, factor2);
}
// factor1 = factor2; // is this always true?

while (cur2 < totalX) {
sieveX[cur1 >> 5] |= (1U << (cur1 & 0x1f));
sieveX[cur2 >> 5] |= (1U << (cur2 & 0x1f));
cur1 += factor1;
cur2 += factor2;
}
while (cur1 < totalX) {
sieveX[cur1 >> 5] |= (1U << (cur1 & 0x1f));
cur1 += factor1;
}
#endif
}



Note how I replaced your outer if() to choose between loops with sorting cur1, cur2.



GCC and clang put a 1 in a register outside the loop, and use shlx r9d, ecx, esi inside the loop to do 1U << (cur1 & 0x1f) in a single uop without destroying the 1. (MSVC uses load / BTS / store, but it's clunky with a lot of mov instructions. I don't know how to tell MSVC it's allowed to use BMI2.)


1


shlx r9d, ecx, esi


1U << (cur1 & 0x1f)


1


mov



If an indexed addressing mode for or [mem], reg didn't cost an extra uop, this would be great.


or [mem], reg



The problem is that you need a shr reg, 5 in there somewhere, and that's destructive. Putting 5 in a register and using that to copy+shift the bit-index would be an ideal setup for load / BTS / store, but compilers don't know that optimization it seems.


shr reg, 5


5



Optimal(?) scalar split and use of a bit-index


mov ecx, 5 ; outside the loop

.loop:
; ESI is the bit-index.
; Could be pure scalar, or could come from an extract of ans directly

shrx edx, esi, ecx ; EDX = ESI>>5 = dword index
mov eax, [rdi + rdx*4]
bts eax, esi ; set esi % 32 in EAX
mov [rdi + rdx*4]


more unrolled iterations

; add esi, r10d ; ans += factor if we're doing scalar

...
cmp/jb .loop



So given a bit-index in a GP register, that's 4 uops to set the bit in memory. Notice that the load and store are both with mov, so indexed addressing modes have no penalty on Haswell and later.


mov



But the best I could get compilers to make was 5, I think, using shlx / shr / or [mem], reg. (With an indexed addressing mode the or is 3 uops instead of 2.)


or [mem], reg


or



I think if you're willing to use hand-written asm, you can go faster with this scalar and ditch SIMD entirely. Conflicts are never a correctness problem for this.



Maybe you can even get a compiler to emit something comparable, but even a single extra uop per unrolled RMW is a big deal.





Thanks , I'll take a look at the your research. My whole system is setup to use 64-bit entries and changing it would require too much work IMO. Each bit array (or portion of) is limited to fit within L2 cache (not fully implemented yet). As you summarized, it is a segmented prime number sieve. My program has come a long way from the initial implementation (about 20-30X speedup). But I'd love/feel that there is more to begotten - as you've shown. Right now, my target platform is Windows but I'm hoping to eventually have a Linux version working (shouldn't be a big issue).
– ChipK
Jul 3 at 14:46





I don't mind switching to a scalar approach if it is faster. I mainly used a vector approach to calculate initial starting position (requires several 64-bit modulos) and to prevent segment overruns (as I said before, branch predictor was going crazy with 4 different stopping conditions). I don't think a JIT is in the cards - I used an approach similar to that to sieve small (less than 64) factors. Sorry for the lack of comments and weird variable names - I need to cleanup/rewrite some of the code for clarity.
– ChipK
Jul 3 at 14:52





The only times ORing with zero is present is in segments in which the factors (plus initial starting alignment) exceeds the ending value. So this doesn't happen unless it is using some small segments. Therefore, it's not really wasting time RMW useless elements except some exceptionally rare conditions.
– ChipK
Jul 3 at 15:29






I tried a scalar method for indexes and bits locations but it seems that I'm having register pressure and compiler can not hold all the data in registers (I'm sure hand assembly would probably fix that).
– ChipK
Jul 3 at 17:28





I implemented the scalar version and it is indeed faster - about 10%
– ChipK
Jul 4 at 22:32






By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

3 AH9W ssHluZx8Uhs69Ysx,FKs8BHDtKOwOr 17P wOLv1,QFHfOzCQ64IbTB1eFcS
0,RFsNVzU6l44HaV JeQ afPIAmQ7y t 84SSM VYI 0NfS,n5S ihStQmTSkd85ybTAK,pK

Popular posts from this blog

Rothschild family

Cinema of Italy