From: Mateusz Guzik
> Sent: 30 August 2023 15:03
...
> Hand-rolled mov loops executing in this case are quite pessimal compared
> to rep movsq for bigger sizes. While the upper limit depends on uarch,
> everyone is well south of 1KB AFAICS and sizes bigger than that are
> common.
That unrolled loop is pretty pessimal and very much 1980s.
It should be pretty easy to write a code loop that runs
at one copy (8 bytes) per clock on modern desktop x86.
I think that matches 'rep movsq'.
(It will run slower on Atom based cpu.)
A very simple copy loop needs (using negative offsets
from the end of the buffer):
A memory read
A memory write
An increment
A jnz
Doing all of those every clock is well with the cpu's capabilities.
However I've never managed a 1 clock loop.
So you need to unroll once (and only once) to copy 8 bytes/clock.
So for copies that are multiples of 16 bytes something like:
# dst in %rdi, src in %rsi, len in %rdx
add %rdi, %rdx
add %rsi, %rdx
neg %rdx
1:
mov %rcx,0(%rsi, %rdx)
mov 0(%rdi, %rdx), %rcx
add #16, %rdx
mov %rcx, -8(%rsi, %rdx)
mov -8(%rdi, %rdx), %rcx
jnz 1b
Is likely to execute an iteration every two clocks.
The memory read/write all get queued up and will happen at
some point - so memory latency doesn't matter at all.
For copies (over 16 bytes) that aren't multiples of
16 it is probably just worth copying the first 16 bytes
and then doing 16 bytes copies that align with the end
of the buffer - copying some bytes twice.
(Or even copy the last 16 bytes first and copy aligned
with the start.)
I'm also not at all sure how much it is worth optimising
mis-aligned transfers.
An IP-Checksum function (which only does reads) is only
just measurable slower for mis-aligned buffers.
Less than 1 clock per cache line.
I think you can get an idea of what happens from looking
at the PCIe TLP generated for misaligned transfers and
assuming that memory requests get much the same treatment.
Last time I checked (on an i7-7700) misaligned transfers
were done in 8-byte chunks (SSE/AVX) and accesses that
crossed cache-line boundaries split into two.
Since the cpu can issue two reads/clock not all of the
split reads (to cache) will take an extra clock.
(which sort of matches what we see.)
OTOH misaligned writes that cross a cache-line boundary
probably always take a 1 clock penalty.
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
On 9/1/23, David Laight <[email protected]> wrote:
> From: Mateusz Guzik
>> Sent: 30 August 2023 15:03
> ...
>> Hand-rolled mov loops executing in this case are quite pessimal compared
>> to rep movsq for bigger sizes. While the upper limit depends on uarch,
>> everyone is well south of 1KB AFAICS and sizes bigger than that are
>> common.
>
> That unrolled loop is pretty pessimal and very much 1980s.
>
> It should be pretty easy to write a code loop that runs
> at one copy (8 bytes) per clock on modern desktop x86.
> I think that matches 'rep movsq'.
> (It will run slower on Atom based cpu.)
>
> A very simple copy loop needs (using negative offsets
> from the end of the buffer):
> A memory read
> A memory write
> An increment
> A jnz
> Doing all of those every clock is well with the cpu's capabilities.
> However I've never managed a 1 clock loop.
> So you need to unroll once (and only once) to copy 8 bytes/clock.
>
When I was playing with this stuff about 5 years ago I found 32-byte
loops to be optimal for uarchs of the priod (Skylake, Broadwell,
Haswell and so on), but only up to a point where rep wins.
> So for copies that are multiples of 16 bytes something like:
> # dst in %rdi, src in %rsi, len in %rdx
> add %rdi, %rdx
> add %rsi, %rdx
> neg %rdx
> 1:
> mov %rcx,0(%rsi, %rdx)
> mov 0(%rdi, %rdx), %rcx
> add #16, %rdx
> mov %rcx, -8(%rsi, %rdx)
> mov -8(%rdi, %rdx), %rcx
> jnz 1b
>
> Is likely to execute an iteration every two clocks.
> The memory read/write all get queued up and will happen at
> some point - so memory latency doesn't matter at all.
>
> For copies (over 16 bytes) that aren't multiples of
> 16 it is probably just worth copying the first 16 bytes
> and then doing 16 bytes copies that align with the end
> of the buffer - copying some bytes twice.
> (Or even copy the last 16 bytes first and copy aligned
> with the start.)
>
It would definitely be beneficial to align the target buffer in this
routine (as in, non-FSRM), but it is unclear to me if you are
suggesting that for mov loops or rep.
I never tested regs for really big sizes and misaligned targets, for
the sizes where hand-rolled movs used to win against rep spending time
aligning was more expensive than suffering the misaligned (and
possibly overlapped) stores.
If anything I find it rather surprising how inconsistent the string
ops are -- why is memcpy using overlapping stores, while memset does
not? Someone(tm) should transplant it, along with slapping rep past a
certain size on both.
--
Mateusz Guzik <mjguzik gmail.com>
...
> When I was playing with this stuff about 5 years ago I found 32-byte
> loops to be optimal for uarchs of the priod (Skylake, Broadwell,
> Haswell and so on), but only up to a point where rep wins.
Does the 'rep movsq' ever actually win?
(Unless you find one of the EMRS (or similar) versions.)
IIRC it only ever does one iteration per clock - and you
should be able to match that with a carefully constructed loop.
Many years ago I got my Athlon-700 to execute a copy loop
as fast as 'rep movs' - but the setup times were longer.
The killer for 'rep movs' setup was always P4-netburst - over 40 clocks.
But I think some of the more recent cpu are still in double figures
(apart from some optimised copies).
So I'm not actually sure you should ever need to switch
to a 'rep movsq' loop - but I've not tried to write it.
I did have to unroll the ip-cksum loop 4 times (as):
+ asm( " bt $4, %[len]\n"
+ " jnc 10f\n"
+ " add (%[buff], %[len]), %[sum_0]\n"
+ " adc 8(%[buff], %[len]), %[sum_1]\n"
+ " lea 16(%[len]), %[len]\n"
+ "10: jecxz 20f\n" // %[len] is %rcx
+ " adc (%[buff], %[len]), %[sum_0]\n"
+ " adc 8(%[buff], %[len]), %[sum_1]\n"
+ " lea 32(%[len]), %[len_tmp]\n"
+ " adc 16(%[buff], %[len]), %[sum_0]\n"
+ " adc 24(%[buff], %[len]), %[sum_1]\n"
+ " mov %[len_tmp], %[len]\n"
+ " jmp 10b\n"
+ "20: adc %[sum_0], %[sum]\n"
+ " adc %[sum_1], %[sum]\n"
+ " adc $0, %[sum]\n"
In order to get one adc every clock.
But only because of the strange loop required to 'loop carry' the
carry flag (the 'loop' instruction is OK on AMD cpu, but not on Intel.)
A similar loop using adox and adcx will beat one read/clock
provided it is unrolled again.
(IIRC I got to about 12 bytes/clock.)
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)