2021-11-01 05:03:44

by Noah Goldstein

[permalink] [raw]
Subject: [PATCH v1] arch/x86: Improve 'rep movs{b|q}' usage in memmove_64.S

Add check for "short distance movsb" for forwards FSRM usage and
entirely remove backwards 'rep movsq'. Both of these usages hit "slow
modes" that are an order of magnitude slower than usual.

'rep movsb' has some noticeable VERY slow modes that the current
implementation is either 1) not checking for or 2) intentionally
using.

All times are in cycles and measuring the throughput of copying 1024
bytes.

1. For FSRM, when 'dst - src' is in (1, 63] or (4GB, 4GB + 63] it is
an order of magnitude slower than normal and much slower than a 4x
'movq' loop.

FSRM forward (dst - src == 32) -> 1113.156
FSRM forward (dst - src == 64) -> 120.669

ERMS forward (dst - src == 32) -> 209.326
ERMS forward (dst - src == 64) -> 118.22

2. For both FSRM and ERMS backwards 'rep movsb' is always slow. Both
of the times below are with dst % 256 == src % 256 which mirrors
the usage of the previous implementation.

FSRM backward -> 1196.039
ERMS backward -> 1191.873

As a reference this is how a 4x 'movq' performances:

4x Forward (dst - src == 32) -> 128.273
4x Backward -> 130.183

Signed-off-by: Noah Goldstein <[email protected]>
---
arch/x86/lib/memmove_64.S | 28 ++++++++++------------------
1 file changed, 10 insertions(+), 18 deletions(-)

diff --git a/arch/x86/lib/memmove_64.S b/arch/x86/lib/memmove_64.S
index 64801010d312..9d5f3ec4db04 100644
--- a/arch/x86/lib/memmove_64.S
+++ b/arch/x86/lib/memmove_64.S
@@ -39,7 +39,16 @@ SYM_FUNC_START(__memmove)

/* FSRM implies ERMS => no length checks, do the copy directly */
.Lmemmove_begin_forward:
- ALTERNATIVE "cmp $0x20, %rdx; jb 1f", "", X86_FEATURE_FSRM
+ /*
+ * Don't use FSRM 'rep movsb' if 'dst - src' in (0, 63] or (4GB, 4GB +
+ * 63]. It hits a slow case which is an order of magnitude slower.
+ */
+ ALTERNATIVE "cmp $0x20, %rdx;"
+ "jb 1f;"
+ "mov %edi, %ecx;"
+ "sub %esi, %ecx;"
+ "cmp $63, %ecx;"
+ "jb 3f;", "", X86_FEATURE_FSRM
ALTERNATIVE "", "movq %rdx, %rcx; rep movsb; retq", X86_FEATURE_ERMS

/*
@@ -89,23 +98,6 @@ SYM_FUNC_START(__memmove)
jmp 13f
.Lmemmove_end_forward:

- /*
- * Handle data backward by movsq.
- */
- .p2align 4
-7:
- movq %rdx, %rcx
- movq (%rsi), %r11
- movq %rdi, %r10
- leaq -8(%rsi, %rdx), %rsi
- leaq -8(%rdi, %rdx), %rdi
- shrq $3, %rcx
- std
- rep movsq
- cld
- movq %r11, (%r10)
- jmp 13f
-
/*
* Start to prepare for backward copy.
*/
--
2.25.1


2021-11-01 19:05:36

by Noah Goldstein

[permalink] [raw]
Subject: [PATCH v2] arch/x86: Improve 'rep movs{b|q}' usage in memmove_64.S

Add check for "short distance movsb" for forwards FSRM usage and
entirely remove backwards 'rep movsq'. Both of these usages hit "slow
modes" that are an order of magnitude slower than usual.

'rep movsb' has some noticeable VERY slow modes that the current
implementation is either 1) not checking for or 2) intentionally
using.

All times are in cycles and measuring the throughput of copying 1024
bytes.

1. For FSRM, when 'dst - src' is in (1, 63] or (4GB, 4GB + 63] it is
an order of magnitude slower than normal and much slower than a 4x
'movq' loop.

FSRM forward (dst - src == 32) -> 1113.156
FSRM forward (dst - src == 64) -> 120.669

ERMS forward (dst - src == 32) -> 209.326
ERMS forward (dst - src == 64) -> 118.22

2. For both FSRM and ERMS backwards 'rep movsb' is always slow. Both
of the times below are with dst % 256 == src % 256 which mirrors
the usage of the previous implementation.

FSRM backward -> 1196.039
ERMS backward -> 1191.873

As a reference this is how a 4x 'movq' performances:

4x Forward (dst - src == 32) -> 128.273
4x Backward -> 130.183

Signed-off-by: Noah Goldstein <[email protected]>
---
Mistake in V1. Had forgotten to remove the logic jumping to backwards
'rep movsq'.
arch/x86/lib/memmove_64.S | 34 +++++++++++-----------------------
1 file changed, 11 insertions(+), 23 deletions(-)

diff --git a/arch/x86/lib/memmove_64.S b/arch/x86/lib/memmove_64.S
index 64801010d312..90eb2487fde2 100644
--- a/arch/x86/lib/memmove_64.S
+++ b/arch/x86/lib/memmove_64.S
@@ -39,7 +39,16 @@ SYM_FUNC_START(__memmove)

/* FSRM implies ERMS => no length checks, do the copy directly */
.Lmemmove_begin_forward:
- ALTERNATIVE "cmp $0x20, %rdx; jb 1f", "", X86_FEATURE_FSRM
+ /*
+ * Don't use FSRM 'rep movsb' if 'dst - src' in (0, 63] or (4GB, 4GB +
+ * 63]. It hits a slow case which is an order of magnitude slower.
+ */
+ ALTERNATIVE "cmp $0x20, %rdx;"
+ "jb 1f;"
+ "mov %edi, %ecx;"
+ "sub %esi, %ecx;"
+ "cmp $63, %ecx;"
+ "jb 3f;", "", X86_FEATURE_FSRM
ALTERNATIVE "", "movq %rdx, %rcx; rep movsb; retq", X86_FEATURE_ERMS

/*
@@ -89,23 +98,6 @@ SYM_FUNC_START(__memmove)
jmp 13f
.Lmemmove_end_forward:

- /*
- * Handle data backward by movsq.
- */
- .p2align 4
-7:
- movq %rdx, %rcx
- movq (%rsi), %r11
- movq %rdi, %r10
- leaq -8(%rsi, %rdx), %rsi
- leaq -8(%rdi, %rdx), %rdi
- shrq $3, %rcx
- std
- rep movsq
- cld
- movq %r11, (%r10)
- jmp 13f
-
/*
* Start to prepare for backward copy.
*/
@@ -113,11 +105,7 @@ SYM_FUNC_START(__memmove)
2:
cmp $0x20, %rdx
jb 1f
- cmp $680, %rdx
- jb 6f
- cmp %dil, %sil
- je 7b
-6:
+
/*
* Calculate copy position to tail.
*/
--
2.25.1

2021-11-02 23:18:23

by Noah Goldstein

[permalink] [raw]
Subject: [PATCH v3] arch/x86: Improve 'rep movs{b|q}' usage in memmove_64.S

Add check for "short distance movsb" for forwards FSRM usage and
entirely remove backwards 'rep movsq'. Both of these usages hit "slow
modes" that are an order of magnitude slower than usual.

'rep movsb' has some noticeable VERY slow modes that the current
implementation is either 1) not checking for or 2) intentionally
using.

All times are in cycles and measuring the throughput of copying 1024
bytes.

1. For FSRM, when 'dst - src' is in (1, 63] or (4GB, 4GB + 63] it is
an order of magnitude slower than normal and much slower than a 4x
'movq' loop.

FSRM forward (dst - src == 32) -> 1113.156
FSRM forward (dst - src == 64) -> 120.669

ERMS forward (dst - src == 32) -> 209.326
ERMS forward (dst - src == 64) -> 118.22

2. For both FSRM and ERMS backwards 'rep movsb' is always slow. Both
of the times below are with dst % 256 == src % 256 which mirrors
the usage of the previous implementation.

FSRM backward -> 1196.039
ERMS backward -> 1191.873

As a reference this is how a 4x 'movq' performances:

4x Forward (dst - src == 32) -> 128.273
4x Backward -> 130.183

Signed-off-by: Noah Goldstein <[email protected]>
---
Appears the included assembly needs to all be on one line.
arch/x86/lib/memmove_64.S | 29 ++++++-----------------------
1 file changed, 6 insertions(+), 23 deletions(-)

diff --git a/arch/x86/lib/memmove_64.S b/arch/x86/lib/memmove_64.S
index 64801010d312..8fb16a7d0ea2 100644
--- a/arch/x86/lib/memmove_64.S
+++ b/arch/x86/lib/memmove_64.S
@@ -39,7 +39,11 @@ SYM_FUNC_START(__memmove)

/* FSRM implies ERMS => no length checks, do the copy directly */
.Lmemmove_begin_forward:
- ALTERNATIVE "cmp $0x20, %rdx; jb 1f", "", X86_FEATURE_FSRM
+ /*
+ * Don't use FSRM 'rep movsb' if 'dst - src' in (0, 63] or (4GB, 4GB +
+ * 63]. It hits a slow case which is an order of magnitude slower.
+ */
+ ALTERNATIVE "cmp $0x20, %rdx; jb 1f; mov %edi, %ecx; sub %esi, %ecx; cmp $63, %ecx; jbe 3f;", "", X86_FEATURE_FSRM
ALTERNATIVE "", "movq %rdx, %rcx; rep movsb; retq", X86_FEATURE_ERMS

/*
@@ -89,23 +93,6 @@ SYM_FUNC_START(__memmove)
jmp 13f
.Lmemmove_end_forward:

- /*
- * Handle data backward by movsq.
- */
- .p2align 4
-7:
- movq %rdx, %rcx
- movq (%rsi), %r11
- movq %rdi, %r10
- leaq -8(%rsi, %rdx), %rsi
- leaq -8(%rdi, %rdx), %rdi
- shrq $3, %rcx
- std
- rep movsq
- cld
- movq %r11, (%r10)
- jmp 13f
-
/*
* Start to prepare for backward copy.
*/
@@ -113,11 +100,7 @@ SYM_FUNC_START(__memmove)
2:
cmp $0x20, %rdx
jb 1f
- cmp $680, %rdx
- jb 6f
- cmp %dil, %sil
- je 7b
-6:
+
/*
* Calculate copy position to tail.
*/
--
2.25.1

2021-11-17 21:02:59

by Noah Goldstein

[permalink] [raw]
Subject: [PATCH v4] arch/x86: Improve 'rep movs{b|q}' usage in memmove_64.S

Add check for "short distance movsb" for forwards FSRM usage and
entirely remove backwards 'rep movsq'. Both of these usages hit "slow
modes" that are an order of magnitude slower than usual.

'rep movsb' has some noticeable VERY slow modes that the current
implementation is either 1) not checking for or 2) intentionally
using.

All times are in cycles and measuring the throughput of copying 1024
bytes.

1. For FSRM, when 'dst - src' is in (1, 63] or (4GB, 4GB + 63] it is
an order of magnitude slower than normal and much slower than a 4x
'movq' loop.

FSRM forward (dst - src == 32) -> 1113.156
FSRM forward (dst - src == 64) -> 120.669

ERMS forward (dst - src == 32) -> 209.326
ERMS forward (dst - src == 64) -> 118.22

2. For both FSRM and ERMS backwards 'rep movsb' is always slow. Both
of the times below are with dst % 256 == src % 256 which mirrors
the usage of the previous implementation.

FSRM backward -> 1196.039
ERMS backward -> 1191.873

As a reference this is how a 4x 'movq' performances:

4x Forward (dst - src == 32) -> 128.273
4x Backward -> 130.183

Signed-off-by: Noah Goldstein <[email protected]>
---
arch/x86/lib/memmove_64.S | 38 +++++++++++++-------------------------
1 file changed, 13 insertions(+), 25 deletions(-)

diff --git a/arch/x86/lib/memmove_64.S b/arch/x86/lib/memmove_64.S
index 64801010d312..910b963388b1 100644
--- a/arch/x86/lib/memmove_64.S
+++ b/arch/x86/lib/memmove_64.S
@@ -28,6 +28,8 @@ SYM_FUNC_START_WEAK(memmove)
SYM_FUNC_START(__memmove)

mov %rdi, %rax
+ cmp $0x20, %rdx
+ jb 1f

/* Decide forward/backward copy mode */
cmp %rdi, %rsi
@@ -39,7 +41,17 @@ SYM_FUNC_START(__memmove)

/* FSRM implies ERMS => no length checks, do the copy directly */
.Lmemmove_begin_forward:
- ALTERNATIVE "cmp $0x20, %rdx; jb 1f", "", X86_FEATURE_FSRM
+ /*
+ * Don't use FSRM 'rep movsb' if 'dst - src' in (0, 63] or (4GB, 4GB +
+ * 63]. It hits a slow case which is an order of magnitude slower.
+ */
+ ALTERNATIVE " \
+ mov %edi, %ecx; \
+ sub %esi, %ecx; \
+ cmp $63, %ecx; \
+ jbe 3f \
+ ", "", X86_FEATURE_FSRM
+
ALTERNATIVE "", "movq %rdx, %rcx; rep movsb; retq", X86_FEATURE_ERMS

/*
@@ -89,35 +101,11 @@ SYM_FUNC_START(__memmove)
jmp 13f
.Lmemmove_end_forward:

- /*
- * Handle data backward by movsq.
- */
- .p2align 4
-7:
- movq %rdx, %rcx
- movq (%rsi), %r11
- movq %rdi, %r10
- leaq -8(%rsi, %rdx), %rsi
- leaq -8(%rdi, %rdx), %rdi
- shrq $3, %rcx
- std
- rep movsq
- cld
- movq %r11, (%r10)
- jmp 13f
-
/*
* Start to prepare for backward copy.
*/
.p2align 4
2:
- cmp $0x20, %rdx
- jb 1f
- cmp $680, %rdx
- jb 6f
- cmp %dil, %sil
- je 7b
-6:
/*
* Calculate copy position to tail.
*/
--
2.25.1


2021-11-17 22:31:39

by David Laight

[permalink] [raw]
Subject: RE: [PATCH v4] arch/x86: Improve 'rep movs{b|q}' usage in memmove_64.S

From: Noah Goldstein
> Sent: 17 November 2021 21:03
>
> Add check for "short distance movsb" for forwards FSRM usage and
> entirely remove backwards 'rep movsq'. Both of these usages hit "slow
> modes" that are an order of magnitude slower than usual.
>
> 'rep movsb' has some noticeable VERY slow modes that the current
> implementation is either 1) not checking for or 2) intentionally
> using.

How does this relate to the decision that glibc made a few years
ago to use backwards 'rep movs' for non-overlapping copies?

Did they find a different corner case??

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


2021-11-17 22:45:26

by Noah Goldstein

[permalink] [raw]
Subject: Re: [PATCH v4] arch/x86: Improve 'rep movs{b|q}' usage in memmove_64.S

On Wed, Nov 17, 2021 at 4:31 PM David Laight <[email protected]> wrote:
>
> From: Noah Goldstein
> > Sent: 17 November 2021 21:03
> >
> > Add check for "short distance movsb" for forwards FSRM usage and
> > entirely remove backwards 'rep movsq'. Both of these usages hit "slow
> > modes" that are an order of magnitude slower than usual.
> >
> > 'rep movsb' has some noticeable VERY slow modes that the current
> > implementation is either 1) not checking for or 2) intentionally
> > using.
>
> How does this relate to the decision that glibc made a few years
> ago to use backwards 'rep movs' for non-overlapping copies?

GLIBC doesn't use backwards `rep movs`. Since the regions are
non-overlapping it just uses forward copy. Backwards `rep movs` is
from setting the direction flag (`std`) and is a very slow byte
copy. For overlapping regions where backwards copy is necessary GLIBC
uses 4x vec copy loop.

>
>
> Did they find a different corner case??
>
> David
>
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)
>

2021-11-19 22:31:17

by David Laight

[permalink] [raw]
Subject: RE: [PATCH v4] arch/x86: Improve 'rep movs{b|q}' usage in memmove_64.S

From: Noah Goldstein
> Sent: 17 November 2021 22:45
>
> On Wed, Nov 17, 2021 at 4:31 PM David Laight <[email protected]> wrote:
> >
> > From: Noah Goldstein
> > > Sent: 17 November 2021 21:03
> > >
> > > Add check for "short distance movsb" for forwards FSRM usage and
> > > entirely remove backwards 'rep movsq'. Both of these usages hit "slow
> > > modes" that are an order of magnitude slower than usual.
> > >
> > > 'rep movsb' has some noticeable VERY slow modes that the current
> > > implementation is either 1) not checking for or 2) intentionally
> > > using.
> >
> > How does this relate to the decision that glibc made a few years
> > ago to use backwards 'rep movs' for non-overlapping copies?
>
> GLIBC doesn't use backwards `rep movs`. Since the regions are
> non-overlapping it just uses forward copy. Backwards `rep movs` is
> from setting the direction flag (`std`) and is a very slow byte
> copy. For overlapping regions where backwards copy is necessary GLIBC
> uses 4x vec copy loop.

Try to find this commit 6fb8cbcb58a29fff73eb2101b34caa19a7f88eba

Or follow links from https://www.win.tue.nl/~aeb/linux/misc/gcc-semibug.html
But I can't find the actual patch.

The claims were a massive performance increase for the reverse copy.

The pdf from http://www.agner.org/optimize may well indicate why some
copies are unexpectedly slow due to cache access aliasing.

I'm pretty sure that Intel cpu (possibly from Ivy bridge onwards)
can be persuaded to copy 8 bytes/clock for in-cache data with
a fairly simple loop that contains 2 reads (maybe misaligned)
and two writes (so 16 bytes per iteration).
Extra unrolling just adds extra code top and bottom.

You might want a loop like:
1: mov 0(%rsi, %rcx),%rax
mov 8(%rsi, %rcx),%rdx
mov %rax, 0(%rdi, %rcx)
mov %rdx, 8(%rdi, %rcx)
add $16, %rcx
jnz 1b

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2021-11-20 00:06:11

by Noah Goldstein

[permalink] [raw]
Subject: Re: [PATCH v4] arch/x86: Improve 'rep movs{b|q}' usage in memmove_64.S

On Fri, Nov 19, 2021 at 4:31 PM David Laight <[email protected]> wrote:
>
> From: Noah Goldstein
> > Sent: 17 November 2021 22:45
> >
> > On Wed, Nov 17, 2021 at 4:31 PM David Laight <[email protected]> wrote:
> > >
> > > From: Noah Goldstein
> > > > Sent: 17 November 2021 21:03
> > > >
> > > > Add check for "short distance movsb" for forwards FSRM usage and
> > > > entirely remove backwards 'rep movsq'. Both of these usages hit "slow
> > > > modes" that are an order of magnitude slower than usual.
> > > >
> > > > 'rep movsb' has some noticeable VERY slow modes that the current
> > > > implementation is either 1) not checking for or 2) intentionally
> > > > using.
> > >
> > > How does this relate to the decision that glibc made a few years
> > > ago to use backwards 'rep movs' for non-overlapping copies?
> >
> > GLIBC doesn't use backwards `rep movs`. Since the regions are
> > non-overlapping it just uses forward copy. Backwards `rep movs` is
> > from setting the direction flag (`std`) and is a very slow byte
> > copy. For overlapping regions where backwards copy is necessary GLIBC
> > uses 4x vec copy loop.
>
> Try to find this commit 6fb8cbcb58a29fff73eb2101b34caa19a7f88eba
>
> Or follow links from https://www.win.tue.nl/~aeb/linux/misc/gcc-semibug.html
> But I can't find the actual patch.
>
> The claims were a massive performance increase for the reverse copy.
>

I don't think that's referring to optimizations around `rep movs`. It
appears to be referring to fallout from this patch:
https://sourceware.org/git/?p=glibc.git;a=commit;h=6fb8cbcb58a29fff73eb2101b34caa19a7f88eba

which broken programs misusing `memcpy` with overlapping regions
resulting in this fix:
https://sourceware.org/git/?p=glibc.git;a=commit;h=0354e355014b7bfda32622e0255399d859862fcd

AFAICT support for ERMS was only added around:
https://sourceware.org/git/?p=glibc.git;a=commit;h=13efa86ece61bf84daca50cab30db1b0902fe2db

Either way GLIBC memcpy/memmove moment most certainly does not
use backwards `rep movs`:
https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S;hb=HEAD#l655

as it is very slow.

> The pdf from http://www.agner.org/optimize may well indicate why some
> copies are unexpectedly slow due to cache access aliasing.

Even in the `4k` aliasing case `rep movsb` seems to stay within a
factor of 2 of optimal whereas the `std` backwards `rep movs` loses
by a factor of 10.

Either way, `4k` aliasing detection is mostly a concern of `memcpy` as
the direction of copy for `memmove` is a correctness question, not
an optimization.


>
> I'm pretty sure that Intel cpu (possibly from Ivy bridge onwards)
> can be persuaded to copy 8 bytes/clock for in-cache data with
> a fairly simple loop that contains 2 reads (maybe misaligned)
> and two writes (so 16 bytes per iteration).
> Extra unrolling just adds extra code top and bottom.
>
> You might want a loop like:
> 1: mov 0(%rsi, %rcx),%rax
> mov 8(%rsi, %rcx),%rdx
> mov %rax, 0(%rdi, %rcx)
> mov %rdx, 8(%rdi, %rcx)
> add $16, %rcx
> jnz 1b
>
> David

The backwards loop already has 4x unrolled `movq` loop.

>
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)

2021-12-10 18:36:08

by Noah Goldstein

[permalink] [raw]
Subject: Re: [PATCH v4] arch/x86: Improve 'rep movs{b|q}' usage in memmove_64.S

On Fri, Nov 19, 2021 at 6:05 PM Noah Goldstein <[email protected]> wrote:
>
> On Fri, Nov 19, 2021 at 4:31 PM David Laight <[email protected]> wrote:
> >
> > From: Noah Goldstein
> > > Sent: 17 November 2021 22:45
> > >
> > > On Wed, Nov 17, 2021 at 4:31 PM David Laight <[email protected]> wrote:
> > > >
> > > > From: Noah Goldstein
> > > > > Sent: 17 November 2021 21:03
> > > > >
> > > > > Add check for "short distance movsb" for forwards FSRM usage and
> > > > > entirely remove backwards 'rep movsq'. Both of these usages hit "slow
> > > > > modes" that are an order of magnitude slower than usual.
> > > > >
> > > > > 'rep movsb' has some noticeable VERY slow modes that the current
> > > > > implementation is either 1) not checking for or 2) intentionally
> > > > > using.
> > > >
> > > > How does this relate to the decision that glibc made a few years
> > > > ago to use backwards 'rep movs' for non-overlapping copies?
> > >
> > > GLIBC doesn't use backwards `rep movs`. Since the regions are
> > > non-overlapping it just uses forward copy. Backwards `rep movs` is
> > > from setting the direction flag (`std`) and is a very slow byte
> > > copy. For overlapping regions where backwards copy is necessary GLIBC
> > > uses 4x vec copy loop.
> >
> > Try to find this commit 6fb8cbcb58a29fff73eb2101b34caa19a7f88eba
> >
> > Or follow links from https://www.win.tue.nl/~aeb/linux/misc/gcc-semibug.html
> > But I can't find the actual patch.
> >
> > The claims were a massive performance increase for the reverse copy.
> >
>
> I don't think that's referring to optimizations around `rep movs`. It
> appears to be referring to fallout from this patch:
> https://sourceware.org/git/?p=glibc.git;a=commit;h=6fb8cbcb58a29fff73eb2101b34caa19a7f88eba
>
> which broken programs misusing `memcpy` with overlapping regions
> resulting in this fix:
> https://sourceware.org/git/?p=glibc.git;a=commit;h=0354e355014b7bfda32622e0255399d859862fcd
>
> AFAICT support for ERMS was only added around:
> https://sourceware.org/git/?p=glibc.git;a=commit;h=13efa86ece61bf84daca50cab30db1b0902fe2db
>
> Either way GLIBC memcpy/memmove moment most certainly does not
> use backwards `rep movs`:
> https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S;hb=HEAD#l655
>
> as it is very slow.
>
> > The pdf from http://www.agner.org/optimize may well indicate why some
> > copies are unexpectedly slow due to cache access aliasing.
>
> Even in the `4k` aliasing case `rep movsb` seems to stay within a
> factor of 2 of optimal whereas the `std` backwards `rep movs` loses
> by a factor of 10.
>
> Either way, `4k` aliasing detection is mostly a concern of `memcpy` as
> the direction of copy for `memmove` is a correctness question, not
> an optimization.
>
>
> >
> > I'm pretty sure that Intel cpu (possibly from Ivy bridge onwards)
> > can be persuaded to copy 8 bytes/clock for in-cache data with
> > a fairly simple loop that contains 2 reads (maybe misaligned)
> > and two writes (so 16 bytes per iteration).
> > Extra unrolling just adds extra code top and bottom.
> >
> > You might want a loop like:
> > 1: mov 0(%rsi, %rcx),%rax
> > mov 8(%rsi, %rcx),%rdx
> > mov %rax, 0(%rdi, %rcx)
> > mov %rdx, 8(%rdi, %rcx)
> > add $16, %rcx
> > jnz 1b
> >
> > David
>
> The backwards loop already has 4x unrolled `movq` loop.
ping.
>
> >
> > -
> > Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> > Registration No: 1397386 (Wales)

2022-01-12 03:13:23

by Noah Goldstein

[permalink] [raw]
Subject: Re: [PATCH v4] arch/x86: Improve 'rep movs{b|q}' usage in memmove_64.S

On Fri, Dec 10, 2021 at 12:35 PM Noah Goldstein <[email protected]> wrote:
>
> On Fri, Nov 19, 2021 at 6:05 PM Noah Goldstein <[email protected]> wrote:
> >
> > On Fri, Nov 19, 2021 at 4:31 PM David Laight <[email protected]> wrote:
> > >
> > > From: Noah Goldstein
> > > > Sent: 17 November 2021 22:45
> > > >
> > > > On Wed, Nov 17, 2021 at 4:31 PM David Laight <[email protected]> wrote:
> > > > >
> > > > > From: Noah Goldstein
> > > > > > Sent: 17 November 2021 21:03
> > > > > >
> > > > > > Add check for "short distance movsb" for forwards FSRM usage and
> > > > > > entirely remove backwards 'rep movsq'. Both of these usages hit "slow
> > > > > > modes" that are an order of magnitude slower than usual.
> > > > > >
> > > > > > 'rep movsb' has some noticeable VERY slow modes that the current
> > > > > > implementation is either 1) not checking for or 2) intentionally
> > > > > > using.
> > > > >
> > > > > How does this relate to the decision that glibc made a few years
> > > > > ago to use backwards 'rep movs' for non-overlapping copies?
> > > >
> > > > GLIBC doesn't use backwards `rep movs`. Since the regions are
> > > > non-overlapping it just uses forward copy. Backwards `rep movs` is
> > > > from setting the direction flag (`std`) and is a very slow byte
> > > > copy. For overlapping regions where backwards copy is necessary GLIBC
> > > > uses 4x vec copy loop.
> > >
> > > Try to find this commit 6fb8cbcb58a29fff73eb2101b34caa19a7f88eba
> > >
> > > Or follow links from https://www.win.tue.nl/~aeb/linux/misc/gcc-semibug.html
> > > But I can't find the actual patch.
> > >
> > > The claims were a massive performance increase for the reverse copy.
> > >
> >
> > I don't think that's referring to optimizations around `rep movs`. It
> > appears to be referring to fallout from this patch:
> > https://sourceware.org/git/?p=glibc.git;a=commit;h=6fb8cbcb58a29fff73eb2101b34caa19a7f88eba
> >
> > which broken programs misusing `memcpy` with overlapping regions
> > resulting in this fix:
> > https://sourceware.org/git/?p=glibc.git;a=commit;h=0354e355014b7bfda32622e0255399d859862fcd
> >
> > AFAICT support for ERMS was only added around:
> > https://sourceware.org/git/?p=glibc.git;a=commit;h=13efa86ece61bf84daca50cab30db1b0902fe2db
> >
> > Either way GLIBC memcpy/memmove moment most certainly does not
> > use backwards `rep movs`:
> > https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S;hb=HEAD#l655
> >
> > as it is very slow.
> >
> > > The pdf from http://www.agner.org/optimize may well indicate why some
> > > copies are unexpectedly slow due to cache access aliasing.
> >
> > Even in the `4k` aliasing case `rep movsb` seems to stay within a
> > factor of 2 of optimal whereas the `std` backwards `rep movs` loses
> > by a factor of 10.
> >
> > Either way, `4k` aliasing detection is mostly a concern of `memcpy` as
> > the direction of copy for `memmove` is a correctness question, not
> > an optimization.
> >
> >
> > >
> > > I'm pretty sure that Intel cpu (possibly from Ivy bridge onwards)
> > > can be persuaded to copy 8 bytes/clock for in-cache data with
> > > a fairly simple loop that contains 2 reads (maybe misaligned)
> > > and two writes (so 16 bytes per iteration).
> > > Extra unrolling just adds extra code top and bottom.
> > >
> > > You might want a loop like:
> > > 1: mov 0(%rsi, %rcx),%rax
> > > mov 8(%rsi, %rcx),%rdx
> > > mov %rax, 0(%rdi, %rcx)
> > > mov %rdx, 8(%rdi, %rcx)
> > > add $16, %rcx
> > > jnz 1b
> > >
> > > David
> >
> > The backwards loop already has 4x unrolled `movq` loop.
> ping.
ping.
> >
> > >
> > > -
> > > Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> > > Registration No: 1397386 (Wales)

2022-02-10 11:10:14

by Noah Goldstein

[permalink] [raw]
Subject: Re: [PATCH v4] arch/x86: Improve 'rep movs{b|q}' usage in memmove_64.S

On Tue, Jan 11, 2022 at 9:13 PM Noah Goldstein <[email protected]> wrote:
>
> On Fri, Dec 10, 2021 at 12:35 PM Noah Goldstein <[email protected]> wrote:
> >
> > On Fri, Nov 19, 2021 at 6:05 PM Noah Goldstein <[email protected]> wrote:
> > >
> > > On Fri, Nov 19, 2021 at 4:31 PM David Laight <[email protected]> wrote:
> > > >
> > > > From: Noah Goldstein
> > > > > Sent: 17 November 2021 22:45
> > > > >
> > > > > On Wed, Nov 17, 2021 at 4:31 PM David Laight <[email protected]> wrote:
> > > > > >
> > > > > > From: Noah Goldstein
> > > > > > > Sent: 17 November 2021 21:03
> > > > > > >
> > > > > > > Add check for "short distance movsb" for forwards FSRM usage and
> > > > > > > entirely remove backwards 'rep movsq'. Both of these usages hit "slow
> > > > > > > modes" that are an order of magnitude slower than usual.
> > > > > > >
> > > > > > > 'rep movsb' has some noticeable VERY slow modes that the current
> > > > > > > implementation is either 1) not checking for or 2) intentionally
> > > > > > > using.
> > > > > >
> > > > > > How does this relate to the decision that glibc made a few years
> > > > > > ago to use backwards 'rep movs' for non-overlapping copies?
> > > > >
> > > > > GLIBC doesn't use backwards `rep movs`. Since the regions are
> > > > > non-overlapping it just uses forward copy. Backwards `rep movs` is
> > > > > from setting the direction flag (`std`) and is a very slow byte
> > > > > copy. For overlapping regions where backwards copy is necessary GLIBC
> > > > > uses 4x vec copy loop.
> > > >
> > > > Try to find this commit 6fb8cbcb58a29fff73eb2101b34caa19a7f88eba
> > > >
> > > > Or follow links from https://www.win.tue.nl/~aeb/linux/misc/gcc-semibug.html
> > > > But I can't find the actual patch.
> > > >
> > > > The claims were a massive performance increase for the reverse copy.
> > > >
> > >
> > > I don't think that's referring to optimizations around `rep movs`. It
> > > appears to be referring to fallout from this patch:
> > > https://sourceware.org/git/?p=glibc.git;a=commit;h=6fb8cbcb58a29fff73eb2101b34caa19a7f88eba
> > >
> > > which broken programs misusing `memcpy` with overlapping regions
> > > resulting in this fix:
> > > https://sourceware.org/git/?p=glibc.git;a=commit;h=0354e355014b7bfda32622e0255399d859862fcd
> > >
> > > AFAICT support for ERMS was only added around:
> > > https://sourceware.org/git/?p=glibc.git;a=commit;h=13efa86ece61bf84daca50cab30db1b0902fe2db
> > >
> > > Either way GLIBC memcpy/memmove moment most certainly does not
> > > use backwards `rep movs`:
> > > https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S;hb=HEAD#l655
> > >
> > > as it is very slow.
> > >
> > > > The pdf from http://www.agner.org/optimize may well indicate why some
> > > > copies are unexpectedly slow due to cache access aliasing.
> > >
> > > Even in the `4k` aliasing case `rep movsb` seems to stay within a
> > > factor of 2 of optimal whereas the `std` backwards `rep movs` loses
> > > by a factor of 10.
> > >
> > > Either way, `4k` aliasing detection is mostly a concern of `memcpy` as
> > > the direction of copy for `memmove` is a correctness question, not
> > > an optimization.
> > >
> > >
> > > >
> > > > I'm pretty sure that Intel cpu (possibly from Ivy bridge onwards)
> > > > can be persuaded to copy 8 bytes/clock for in-cache data with
> > > > a fairly simple loop that contains 2 reads (maybe misaligned)
> > > > and two writes (so 16 bytes per iteration).
> > > > Extra unrolling just adds extra code top and bottom.
> > > >
> > > > You might want a loop like:
> > > > 1: mov 0(%rsi, %rcx),%rax
> > > > mov 8(%rsi, %rcx),%rdx
> > > > mov %rax, 0(%rdi, %rcx)
> > > > mov %rdx, 8(%rdi, %rcx)
> > > > add $16, %rcx
> > > > jnz 1b
> > > >
> > > > David
> > >
> > > The backwards loop already has 4x unrolled `movq` loop.
> > ping.
> ping.
ping3.
> > >
> > > >
> > > > -
> > > > Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> > > > Registration No: 1397386 (Wales)

2022-03-17 03:43:11

by Noah Goldstein

[permalink] [raw]
Subject: Re: [PATCH v4] arch/x86: Improve 'rep movs{b|q}' usage in memmove_64.S

On Thu, Feb 10, 2022 at 3:08 AM Noah Goldstein <[email protected]> wrote:
>
> On Tue, Jan 11, 2022 at 9:13 PM Noah Goldstein <[email protected]> wrote:
> >
> > On Fri, Dec 10, 2021 at 12:35 PM Noah Goldstein <[email protected]> wrote:
> > >
> > > On Fri, Nov 19, 2021 at 6:05 PM Noah Goldstein <[email protected]> wrote:
> > > >
> > > > On Fri, Nov 19, 2021 at 4:31 PM David Laight <[email protected]> wrote:
> > > > >
> > > > > From: Noah Goldstein
> > > > > > Sent: 17 November 2021 22:45
> > > > > >
> > > > > > On Wed, Nov 17, 2021 at 4:31 PM David Laight <[email protected]> wrote:
> > > > > > >
> > > > > > > From: Noah Goldstein
> > > > > > > > Sent: 17 November 2021 21:03
> > > > > > > >
> > > > > > > > Add check for "short distance movsb" for forwards FSRM usage and
> > > > > > > > entirely remove backwards 'rep movsq'. Both of these usages hit "slow
> > > > > > > > modes" that are an order of magnitude slower than usual.
> > > > > > > >
> > > > > > > > 'rep movsb' has some noticeable VERY slow modes that the current
> > > > > > > > implementation is either 1) not checking for or 2) intentionally
> > > > > > > > using.
> > > > > > >
> > > > > > > How does this relate to the decision that glibc made a few years
> > > > > > > ago to use backwards 'rep movs' for non-overlapping copies?
> > > > > >
> > > > > > GLIBC doesn't use backwards `rep movs`. Since the regions are
> > > > > > non-overlapping it just uses forward copy. Backwards `rep movs` is
> > > > > > from setting the direction flag (`std`) and is a very slow byte
> > > > > > copy. For overlapping regions where backwards copy is necessary GLIBC
> > > > > > uses 4x vec copy loop.
> > > > >
> > > > > Try to find this commit 6fb8cbcb58a29fff73eb2101b34caa19a7f88eba
> > > > >
> > > > > Or follow links from https://www.win.tue.nl/~aeb/linux/misc/gcc-semibug.html
> > > > > But I can't find the actual patch.
> > > > >
> > > > > The claims were a massive performance increase for the reverse copy.
> > > > >
> > > >
> > > > I don't think that's referring to optimizations around `rep movs`. It
> > > > appears to be referring to fallout from this patch:
> > > > https://sourceware.org/git/?p=glibc.git;a=commit;h=6fb8cbcb58a29fff73eb2101b34caa19a7f88eba
> > > >
> > > > which broken programs misusing `memcpy` with overlapping regions
> > > > resulting in this fix:
> > > > https://sourceware.org/git/?p=glibc.git;a=commit;h=0354e355014b7bfda32622e0255399d859862fcd
> > > >
> > > > AFAICT support for ERMS was only added around:
> > > > https://sourceware.org/git/?p=glibc.git;a=commit;h=13efa86ece61bf84daca50cab30db1b0902fe2db
> > > >
> > > > Either way GLIBC memcpy/memmove moment most certainly does not
> > > > use backwards `rep movs`:
> > > > https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S;hb=HEAD#l655
> > > >
> > > > as it is very slow.
> > > >
> > > > > The pdf from http://www.agner.org/optimize may well indicate why some
> > > > > copies are unexpectedly slow due to cache access aliasing.
> > > >
> > > > Even in the `4k` aliasing case `rep movsb` seems to stay within a
> > > > factor of 2 of optimal whereas the `std` backwards `rep movs` loses
> > > > by a factor of 10.
> > > >
> > > > Either way, `4k` aliasing detection is mostly a concern of `memcpy` as
> > > > the direction of copy for `memmove` is a correctness question, not
> > > > an optimization.
> > > >
> > > >
> > > > >
> > > > > I'm pretty sure that Intel cpu (possibly from Ivy bridge onwards)
> > > > > can be persuaded to copy 8 bytes/clock for in-cache data with
> > > > > a fairly simple loop that contains 2 reads (maybe misaligned)
> > > > > and two writes (so 16 bytes per iteration).
> > > > > Extra unrolling just adds extra code top and bottom.
> > > > >
> > > > > You might want a loop like:
> > > > > 1: mov 0(%rsi, %rcx),%rax
> > > > > mov 8(%rsi, %rcx),%rdx
> > > > > mov %rax, 0(%rdi, %rcx)
> > > > > mov %rdx, 8(%rdi, %rcx)
> > > > > add $16, %rcx
> > > > > jnz 1b
> > > > >
> > > > > David
> > > >
> > > > The backwards loop already has 4x unrolled `movq` loop.
> > > ping.
> > ping.
> ping3.

Hi,

Anything I'm missing to get this looked at?

> > > >
> > > > >
> > > > > -
> > > > > Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> > > > > Registration No: 1397386 (Wales)

2022-03-17 05:54:23

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v4] arch/x86: Improve 'rep movs{b|q}' usage in memmove_64.S

On Wed, Nov 17, 2021 at 03:02:45PM -0600, Noah Goldstein wrote:
> Add check for "short distance movsb" for forwards FSRM usage and
> entirely remove backwards 'rep movsq'. Both of these usages hit "slow
> modes" that are an order of magnitude slower than usual.
>
> 'rep movsb' has some noticeable VERY slow modes that the current
> implementation is either 1) not checking for or 2) intentionally
> using.
>
> All times are in cycles and measuring the throughput of copying 1024
> bytes.

All these claims need to be proven by

- real benchmarks - not a microbenchmark - where it shows that
modifications like that are not "in the noise". Others should be able
to verify those results too.

- on a bunch of CPUs from different vendors to verify that they don't
cause performance regressions on any.

HTH.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette