2022-09-23 17:04:56

by Nick Desaulniers

[permalink] [raw]
Subject: [PATCH] x86, mem: move memmove to out of line assembler

When building ARCH=i386 with CONFIG_LTO_CLANG_FULL=y, it's possible
(depending on additional configs which I have not been able to isolate)
to observe a failure during register allocation:

error: inline assembly requires more registers than available

when memmove is inlined into tcp_v4_fill_cb() or tcp_v6_fill_cb().

memmove is quite large and probably shouldn't be inlined due to size
alone. A noinline function attribute would be the simplest fix, but
there's a few things that stand out with the current definition:

In addition to having complex constraints that can't always be resolved,
the clobber list seems to be missing %bx and %dx, and possibly %cl. By
using numbered operands rather than symbolic operands, the constraints
are quite obnoxious to refactor.

Having a large function be 99% inline asm is a code smell that this
function should simply be written in stand-alone out-of-line assembler.
That gives the opportunity for other cleanups like fixing the
inconsistent use of tabs vs spaces and instruction suffixes, and the
label 3 appearing twice. Symbolic operands and local labels would
provide this code with a fresh coat of paint.

Moving this to out of line assembler guarantees that the
compiler cannot inline calls to memmove.

Signed-off-by: Nick Desaulniers <[email protected]>
---
An alternative, lower risk approach:
Rather than do all of the above, we could simply provide a macro in the
vain of noinline_for_stack to document that we don't want LTO to inline
a particular function across translation unit boundaries, then apply
that to memmove. Up to the x86 maintainers' risk tolerances.

Also, this leaves arch/x86/lib/memcpy_32.c pretty sparse/nearly empty.
Perhaps I should clean that up here, too? Perhaps in
arch/x86/include/asm/string_32.h? That would probably allow for inlining
calls to memcpy and memset.


arch/x86/lib/Makefile | 1 +
arch/x86/lib/memcpy_32.c | 187 -----------------------------------
arch/x86/lib/memmove_32.S | 202 ++++++++++++++++++++++++++++++++++++++
3 files changed, 203 insertions(+), 187 deletions(-)
create mode 100644 arch/x86/lib/memmove_32.S

diff --git a/arch/x86/lib/Makefile b/arch/x86/lib/Makefile
index f76747862bd2..9a0b8ed782e2 100644
--- a/arch/x86/lib/Makefile
+++ b/arch/x86/lib/Makefile
@@ -60,6 +60,7 @@ ifeq ($(CONFIG_X86_32),y)
lib-y += checksum_32.o
lib-y += strstr_32.o
lib-y += string_32.o
+ lib-y += memmove_32.o
ifneq ($(CONFIG_X86_CMPXCHG64),y)
lib-y += cmpxchg8b_emu.o atomic64_386_32.o
endif
diff --git a/arch/x86/lib/memcpy_32.c b/arch/x86/lib/memcpy_32.c
index ef3af7ff2c8a..a29b64befb93 100644
--- a/arch/x86/lib/memcpy_32.c
+++ b/arch/x86/lib/memcpy_32.c
@@ -17,190 +17,3 @@ __visible void *memset(void *s, int c, size_t count)
return __memset(s, c, count);
}
EXPORT_SYMBOL(memset);
-
-__visible void *memmove(void *dest, const void *src, size_t n)
-{
- int d0,d1,d2,d3,d4,d5;
- char *ret = dest;
-
- __asm__ __volatile__(
- /* Handle more 16 bytes in loop */
- "cmp $0x10, %0\n\t"
- "jb 1f\n\t"
-
- /* Decide forward/backward copy mode */
- "cmp %2, %1\n\t"
- "jb 2f\n\t"
-
- /*
- * movs instruction have many startup latency
- * so we handle small size by general register.
- */
- "cmp $680, %0\n\t"
- "jb 3f\n\t"
- /*
- * movs instruction is only good for aligned case.
- */
- "mov %1, %3\n\t"
- "xor %2, %3\n\t"
- "and $0xff, %3\n\t"
- "jz 4f\n\t"
- "3:\n\t"
- "sub $0x10, %0\n\t"
-
- /*
- * We gobble 16 bytes forward in each loop.
- */
- "3:\n\t"
- "sub $0x10, %0\n\t"
- "mov 0*4(%1), %3\n\t"
- "mov 1*4(%1), %4\n\t"
- "mov %3, 0*4(%2)\n\t"
- "mov %4, 1*4(%2)\n\t"
- "mov 2*4(%1), %3\n\t"
- "mov 3*4(%1), %4\n\t"
- "mov %3, 2*4(%2)\n\t"
- "mov %4, 3*4(%2)\n\t"
- "lea 0x10(%1), %1\n\t"
- "lea 0x10(%2), %2\n\t"
- "jae 3b\n\t"
- "add $0x10, %0\n\t"
- "jmp 1f\n\t"
-
- /*
- * Handle data forward by movs.
- */
- ".p2align 4\n\t"
- "4:\n\t"
- "mov -4(%1, %0), %3\n\t"
- "lea -4(%2, %0), %4\n\t"
- "shr $2, %0\n\t"
- "rep movsl\n\t"
- "mov %3, (%4)\n\t"
- "jmp 11f\n\t"
- /*
- * Handle data backward by movs.
- */
- ".p2align 4\n\t"
- "6:\n\t"
- "mov (%1), %3\n\t"
- "mov %2, %4\n\t"
- "lea -4(%1, %0), %1\n\t"
- "lea -4(%2, %0), %2\n\t"
- "shr $2, %0\n\t"
- "std\n\t"
- "rep movsl\n\t"
- "mov %3,(%4)\n\t"
- "cld\n\t"
- "jmp 11f\n\t"
-
- /*
- * Start to prepare for backward copy.
- */
- ".p2align 4\n\t"
- "2:\n\t"
- "cmp $680, %0\n\t"
- "jb 5f\n\t"
- "mov %1, %3\n\t"
- "xor %2, %3\n\t"
- "and $0xff, %3\n\t"
- "jz 6b\n\t"
-
- /*
- * Calculate copy position to tail.
- */
- "5:\n\t"
- "add %0, %1\n\t"
- "add %0, %2\n\t"
- "sub $0x10, %0\n\t"
-
- /*
- * We gobble 16 bytes backward in each loop.
- */
- "7:\n\t"
- "sub $0x10, %0\n\t"
-
- "mov -1*4(%1), %3\n\t"
- "mov -2*4(%1), %4\n\t"
- "mov %3, -1*4(%2)\n\t"
- "mov %4, -2*4(%2)\n\t"
- "mov -3*4(%1), %3\n\t"
- "mov -4*4(%1), %4\n\t"
- "mov %3, -3*4(%2)\n\t"
- "mov %4, -4*4(%2)\n\t"
- "lea -0x10(%1), %1\n\t"
- "lea -0x10(%2), %2\n\t"
- "jae 7b\n\t"
- /*
- * Calculate copy position to head.
- */
- "add $0x10, %0\n\t"
- "sub %0, %1\n\t"
- "sub %0, %2\n\t"
-
- /*
- * Move data from 8 bytes to 15 bytes.
- */
- ".p2align 4\n\t"
- "1:\n\t"
- "cmp $8, %0\n\t"
- "jb 8f\n\t"
- "mov 0*4(%1), %3\n\t"
- "mov 1*4(%1), %4\n\t"
- "mov -2*4(%1, %0), %5\n\t"
- "mov -1*4(%1, %0), %1\n\t"
-
- "mov %3, 0*4(%2)\n\t"
- "mov %4, 1*4(%2)\n\t"
- "mov %5, -2*4(%2, %0)\n\t"
- "mov %1, -1*4(%2, %0)\n\t"
- "jmp 11f\n\t"
-
- /*
- * Move data from 4 bytes to 7 bytes.
- */
- ".p2align 4\n\t"
- "8:\n\t"
- "cmp $4, %0\n\t"
- "jb 9f\n\t"
- "mov 0*4(%1), %3\n\t"
- "mov -1*4(%1, %0), %4\n\t"
- "mov %3, 0*4(%2)\n\t"
- "mov %4, -1*4(%2, %0)\n\t"
- "jmp 11f\n\t"
-
- /*
- * Move data from 2 bytes to 3 bytes.
- */
- ".p2align 4\n\t"
- "9:\n\t"
- "cmp $2, %0\n\t"
- "jb 10f\n\t"
- "movw 0*2(%1), %%dx\n\t"
- "movw -1*2(%1, %0), %%bx\n\t"
- "movw %%dx, 0*2(%2)\n\t"
- "movw %%bx, -1*2(%2, %0)\n\t"
- "jmp 11f\n\t"
-
- /*
- * Move data for 1 byte.
- */
- ".p2align 4\n\t"
- "10:\n\t"
- "cmp $1, %0\n\t"
- "jb 11f\n\t"
- "movb (%1), %%cl\n\t"
- "movb %%cl, (%2)\n\t"
- ".p2align 4\n\t"
- "11:"
- : "=&c" (d0), "=&S" (d1), "=&D" (d2),
- "=r" (d3),"=r" (d4), "=r"(d5)
- :"0" (n),
- "1" (src),
- "2" (dest)
- :"memory");
-
- return ret;
-
-}
-EXPORT_SYMBOL(memmove);
diff --git a/arch/x86/lib/memmove_32.S b/arch/x86/lib/memmove_32.S
new file mode 100644
index 000000000000..096feb154f64
--- /dev/null
+++ b/arch/x86/lib/memmove_32.S
@@ -0,0 +1,202 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#include <linux/linkage.h>
+
+SYM_FUNC_START(memmove)
+/*
+ * void *memmove(void *dest, const void *src, size_t n)
+ * -mregparm=3 passes these in registers:
+ */
+.set dest, %eax
+.set src, %edx
+.set n, %ecx
+
+/* Need 3 scratch registers. These need to be saved+restored. */
+.set tmp0, %edi
+.set tmp1, %ebx
+.set tmp2, %esi
+
+ pushl %ebp
+ movl %esp, %ebp
+
+ pushl dest
+ pushl tmp0
+ pushl tmp1
+ pushl tmp2
+
+ /* Handle more 16 bytes in loop */
+ cmpl $0x10, n
+ jb .L16_byteswap
+
+ /* Decide forward/backward copy mode */
+ cmpl dest, src
+ jb .Lbackwards_header
+
+ /*
+ * movs instruction have many startup latency
+ * so we handle small size by general register.
+ */
+ cmpl $680, n
+ jb .Ltoo_small_forwards
+ /*
+ * movs instruction is only good for aligned case.
+ */
+ movl src, tmp0
+ xorl dest, tmp0
+ andl $0xff, tmp0
+ jz .Lforward_movs
+.Ltoo_small_forwards:
+ subl $0x10, n
+
+ /*
+ * We gobble 16 bytes forward in each loop.
+ */
+.L16_byteswap_forwards_loop:
+ subl $0x10, n
+ movl 0*4(src), tmp0
+ movl 1*4(src), tmp1
+ movl tmp0, 0*4(dest)
+ movl tmp1, 1*4(dest)
+ movl 2*4(src), tmp0
+ movl 3*4(src), tmp1
+ movl tmp0, 2*4(dest)
+ movl tmp1, 3*4(dest)
+ leal 0x10(src), src
+ leal 0x10(dest), dest
+ jae .L16_byteswap_forwards_loop
+ addl $0x10, n
+ jmp .L16_byteswap
+
+ /*
+ * Handle data forward by movs.
+ */
+.p2align 4
+.Lforward_movs:
+ movl -4(src, n), tmp0
+ leal -4(dest, n), tmp1
+ shrl $2, n
+ rep movsl
+ movl tmp0, (tmp1)
+ jmp .Ldone
+ /*
+ * Handle data backward by movs.
+ */
+.p2align 4
+.Lbackwards_movs:
+ movl (src), tmp0
+ movl dest, tmp1
+ leal -4(src, n), src
+ leal -4(dest, n), dest
+ shrl $2, n
+ std
+ rep movsl
+ movl tmp0,(tmp1)
+ cld
+ jmp .Ldone
+
+ /*
+ * Start to prepare for backward copy.
+ */
+.p2align 4
+.Lbackwards_header:
+ cmpl $680, n
+ jb .Ltoo_small_backwards
+ movl src, tmp0
+ xorl dest, tmp0
+ andl $0xff, tmp0
+ jz .Lbackwards_movs
+
+ /*
+ * Calculate copy position to tail.
+ */
+.Ltoo_small_backwards:
+ addl n, src
+ addl n, dest
+ subl $0x10, n
+
+ /*
+ * We gobble 16 bytes backward in each loop.
+ */
+.L16_byteswap_backwards_loop:
+ subl $0x10, n
+
+ movl -1*4(src), tmp0
+ movl -2*4(src), tmp1
+ movl tmp0, -1*4(dest)
+ movl tmp1, -2*4(dest)
+ movl -3*4(src), tmp0
+ movl -4*4(src), tmp1
+ movl tmp0, -3*4(dest)
+ movl tmp1, -4*4(dest)
+ leal -0x10(src), src
+ leal -0x10(dest), dest
+ jae .L16_byteswap_backwards_loop
+ /*
+ * Calculate copy position to head.
+ */
+ addl $0x10, n
+ subl n, src
+ subl n, dest
+
+ /*
+ * Move data from 8 bytes to 15 bytes.
+ */
+.p2align 4
+.L16_byteswap:
+ cmpl $8, n
+ jb .L8_byteswap
+ movl 0*4(src), tmp0
+ movl 1*4(src), tmp1
+ movl -2*4(src, n), tmp2
+ movl -1*4(src, n), src
+
+ movl tmp0, 0*4(dest)
+ movl tmp1, 1*4(dest)
+ movl tmp2, -2*4(dest, n)
+ movl src, -1*4(dest, n)
+ jmp .Ldone
+
+ /*
+ * Move data from 4 bytes to 7 bytes.
+ */
+.p2align 4
+.L8_byteswap:
+ cmpl $4, n
+ jb .L4_byteswap
+ movl 0*4(src), tmp0
+ movl -1*4(src, n), tmp1
+ movl tmp0, 0*4(dest)
+ movl tmp1, -1*4(dest, n)
+ jmp .Ldone
+
+ /*
+ * Move data from 2 bytes to 3 bytes.
+ */
+.p2align 4
+.L4_byteswap:
+ cmpl $2, n
+ jb .Lbyteswap
+ movw 0*2(src), %di
+ movw -1*2(src, n), %bx
+ movw %dx, 0*2(dest)
+ movw %bx, -1*2(dest, n)
+ jmp .Ldone
+
+ /*
+ * Move data for 1 byte.
+ */
+.p2align 4
+.Lbyteswap:
+ cmpl $1, n
+ jb .Ldone
+ movb (src), %cl
+ movb %cl, (dest)
+.p2align 4
+.Ldone:
+ popl tmp2
+ popl tmp1
+ popl tmp0
+ popl %eax
+ popl %ebp
+ RET
+SYM_FUNC_END(memmove)
--
2.37.3.998.g577e59143f-goog


2022-09-23 17:59:59

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] x86, mem: move memmove to out of line assembler

On Fri, Sep 23, 2022 at 10:02 AM Nick Desaulniers
<[email protected]> wrote:
>
> memmove is quite large and probably shouldn't be inlined due to size
> alone. A noinline function attribute would be the simplest fix, but
> there's a few things that stand out with the current definition:

I don't think your patch is wrong, and it's not that different from
what the x86-64 code did long ago back in 2011 in commit 9599ec0471de
("x86-64, mem: Convert memmove() to assembly file and fix return value
bug")

But I wonder if we might be better off getting rid of that horrid
memmove thing entirely. The original thing seems to be from 2010,
where commit 3b4b682becdf ("x86, mem: Optimize memmove for small size
and unaligned cases") did this thing for both 32-bit and 64-bit code.

And it's really not likely used all that much - memcpy is the *much*
more important function, and that's the one that has actually been
updated for modern CPU's instead of looking like it's some copy from
glibc or whatever.

To make things even stranger, on the x86-64 side, we actually have
*two* copies of 'memmove()' - it looks like memcpy_orig() is already a
memmove, in that it does that

cmp %dil, %sil
jl .Lcopy_backward

thing that seems to mean that it already handles the overlapping case.

Anyway, that 32-bit memmove() implemented (badly) as inline asm does
need to go. As you point out, it seems to get the clobbers wrong too,
so it's actively buggy and just happens to work because there's
nothing else in that function, and it clobbers %bx that is supposed to
be callee-saved, but *presumably* the compile has to allocate %bx one
of the other registers, so it will save and restore %bx anyway.

So that current memmove() seems to be truly horrendously buggy, but in
a "it happens to work" way. Your patch seems an improvement, but I get
the feeling that it could be improved a lot more...

Linus

2022-09-23 18:44:13

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] x86, mem: move memmove to out of line assembler

On Fri, Sep 23, 2022 at 10:55 AM Nick Desaulniers
<[email protected]> wrote:
>
> We could remove __HAVE_ARCH_MEMMOVE from
> arch/x86/include/asm/string_32.h for ARCH=i386 then rip this
> arch-specific definition of memmove out.
>
> Might performance regressions be a concern with that approach?

memmove() isn't particularly common, but it does happen for some paths
that can be hot - the usual case of moving parts of an array around. I
see filesystems and networking paths doing that.

The generic memmove() is a horrendous byte-at-a-time thing and only
good for bring-up of new architectures. That's not an option.

But I'm looking at that x86-64 memcpy_orig, and I think it looks
fairly good as a template for doing the same on x86-32. And we could
get rid of the duplication on the x86-64 side.

That said, your patch looks fine too, as a "minimal changes" thing.

Linus

2022-09-23 18:52:45

by Nick Desaulniers

[permalink] [raw]
Subject: Re: [PATCH] x86, mem: move memmove to out of line assembler

Dropping Ma, emails bouncing.

On Fri, Sep 23, 2022 at 10:30 AM Linus Torvalds
<[email protected]> wrote:
>
> On Fri, Sep 23, 2022 at 10:02 AM Nick Desaulniers
> <[email protected]> wrote:
> >
> > memmove is quite large and probably shouldn't be inlined due to size
> > alone. A noinline function attribute would be the simplest fix, but
> > there's a few things that stand out with the current definition:
>
> I don't think your patch is wrong, and it's not that different from
> what the x86-64 code did long ago back in 2011 in commit 9599ec0471de
> ("x86-64, mem: Convert memmove() to assembly file and fix return value
> bug")
>
> But I wonder if we might be better off getting rid of that horrid
> memmove thing entirely.

We could remove __HAVE_ARCH_MEMMOVE from
arch/x86/include/asm/string_32.h for ARCH=i386 then rip this
arch-specific definition of memmove out.

Might performance regressions be a concern with that approach?

I'll write up a patch for that just to have on hand, and leave the
decision up to someone else.

> The original thing seems to be from 2010,
> where commit 3b4b682becdf ("x86, mem: Optimize memmove for small size
> and unaligned cases") did this thing for both 32-bit and 64-bit code.
>
> And it's really not likely used all that much - memcpy is the *much*
> more important function, and that's the one that has actually been
> updated for modern CPU's instead of looking like it's some copy from
> glibc or whatever.

I suspect that's probably where the duplicate 3 label comes from:
likely some macros were expanded from another codebase's
implementation, then copied into the kernel sources.

>
> To make things even stranger, on the x86-64 side, we actually have
> *two* copies of 'memmove()' - it looks like memcpy_orig() is already a
> memmove, in that it does that
>
> cmp %dil, %sil
> jl .Lcopy_backward
>
> thing that seems to mean that it already handles the overlapping case.
>
> Anyway, that 32-bit memmove() implemented (badly) as inline asm does
> need to go. As you point out, it seems to get the clobbers wrong too,
> so it's actively buggy and just happens to work because there's
> nothing else in that function, and it clobbers %bx that is supposed to
> be callee-saved, but *presumably* the compile has to allocate %bx one
> of the other registers, so it will save and restore %bx anyway.
>
> So that current memmove() seems to be truly horrendously buggy, but in
> a "it happens to work" way. Your patch seems an improvement, but I get
> the feeling that it could be improved a lot more...
--
Thanks,
~Nick Desaulniers

2022-09-27 18:28:15

by Nick Desaulniers

[permalink] [raw]
Subject: Re: [PATCH] x86, mem: move memmove to out of line assembler

Oh, yeah my patch essentially _is_
commit 9599ec0471de ("x86-64, mem: Convert memmove() to assembly file
and fix return value bug")
but for 32b (and no return value bug). I should probably amend a
reference to that in the commit message for this patch.

Also, I'm missing an EXPORT_SYMBOL in my v1, so modules that reference
memmove will fail to build during modpost. v2 is required.

On Fri, Sep 23, 2022 at 11:06 AM Linus Torvalds
<[email protected]> wrote:
>
> But I'm looking at that x86-64 memcpy_orig, and I think it looks
> fairly good as a template for doing the same on x86-32. And we could
> get rid of the duplication on the x86-64 side.

Is the suggestion that 64b memcpy_orig could be replaced with __memmove?

Sorry, I'm not sure I follow either suggestions for code reuse opportunities.

Also, any ideas which machines for QEMU don't have ERMS for testing
these non-ERMS implementations?

>
> That said, your patch looks fine too, as a "minimal changes" thing.
>
> Linus

--
Thanks,
~Nick Desaulniers