LinuxLists.cc - x86 copy performance regression

2023-05-26 15:18:15

Subject: x86 copy performance regression

Hi Linus

While testing unrelated patches using upstream net-next kernels,
I found a big regression in sendmsg()/recvmsg() caused by a series of yours.

Tested platforms :

Intel(R) Xeon(R) Gold 6268L CPU @ 2.80GHz

We can see rep_movs_alternative() using more cycles in kernel profiles
than the previous variant (copy_user_enhanced_fast_string, which was
simply using "rep movsb"), and we can not reach line rate (as we
could before the series)

Patch series:

commit a5624566431de76b17862383d9ae254d9606cba9
Merge: 487c20b016dc48230367a7be017f40313e53e3bd
034ff37d34071ff3f48755f728cd229e42a4f15d
Author: Linus Torvalds <[email protected]>
Date: Mon Apr 24 10:39:27 2023 -0700

Merge branch 'x86-rep-insns': x86 user copy clarifications

Merge my x86 user copy updates branch.

IMO this patch seems to think tcp sendmsg() is using small areas.
(tcp_sendmsg() usually copy 32KB at a time, if order-3 pages
allocations are possible)

commit adfcf4231b8cbc2d9c1e7bfaa965b907e60639eb
Author: Linus Torvalds <[email protected]>
Date: Sat Apr 15 13:14:59 2023 -0700

x86: don't use REP_GOOD or ERMS for user memory copies

The modern target to use is FSRM (Fast Short REP MOVS), and the other
cases should only be used for bigger areas (ie mainly things like page
clearing).

Signed-off-by: Linus Torvalds <[email protected]>

The issue is that (some of) our platforms do have ERMS but not FSRM

Test run on 6.3 (single TCP flow, sending 32 MB of payload to a
zerocopy receiver to make sure the receiver is not the bottleneck).
100Gbit link speed.

# perf stat taskset 02 tcp_mmap -H 2002:a05:6608:295::

Performance counter stats for 'taskset 02 ./tcp_mmap -H 2002:a05:6608:295::':

2,815.79 msec task-clock # 0.936
CPUs utilized
2,370 context-switches # 841.682
/sec
1 cpu-migrations # 0.355
/sec
127 page-faults # 45.103
/sec
10,106,383,352 cycles # 3.589
GHz
6,936,487,168 instructions # 0.69
insn per cycle
1,206,325,691 branches # 428.414
M/sec
10,327,112 branch-misses # 0.86% of
all branches

3.007810265 seconds time elapsed

0.005158000 seconds user
2.406125000 seconds sys

Same test from linux-6.4-rc1

# perf stat taskset 02 tcp_mmap -H 2002:a05:6608:295::

Performance counter stats for 'taskset 02 ./tcp_mmap -H 2002:a05:6608:295::':

4,039.73 msec task-clock # 1.000
CPUs utilized
12 context-switches # 2.970
/sec
1 cpu-migrations # 0.248
/sec
130 page-faults # 32.180
/sec
14,639,828,754 cycles # 3.624
GHz
19,443,379,653 instructions # 1.33
insn per cycle
1,931,003,961 branches # 478.003
M/sec
12,349,476 branch-misses # 0.64% of
all branches

4.040825111 seconds time elapsed

0.012496000 seconds user
3.560336000 seconds sys

Thanks.

2023-05-26 16:36:28

by Linus Torvalds

[permalink] [raw]

Subject: Re: x86 copy performance regression

On Fri, May 26, 2023 at 8:00 AM Eric Dumazet <[email protected]> wrote:
>
> We can see rep_movs_alternative() using more cycles in kernel profiles
> than the previous variant (copy_user_enhanced_fast_string, which was
> simply using "rep movsb"), and we can not reach line rate (as we
> could before the series)

Hmm. I assume the attached patch ends up fixing the regression?

That hack to generate the two-byte 'jae' instruction even for the
alternative is admittedly not pretty, but I just couldn't deal with
the alternative that generated pointlessly bad code.

We could make the constant in the comparison depend on whether it is
for the unrolled or for the erms case too, I guess, but I think erms
is probably "good enough" with 64-byte copies.

I was really hoping we could avoid this, but hey, a regression is a regression.

Can you verify this patch fixes things for you?

Linus

Attachments:

patch.diff (1.17 kB)

2023-05-26 16:56:06

by Eric Dumazet

[permalink] [raw]

Subject: Re: x86 copy performance regression

On Fri, May 26, 2023 at 6:30 PM Linus Torvalds
<[email protected]> wrote:
>
> On Fri, May 26, 2023 at 8:00 AM Eric Dumazet <[email protected]> wrote:
> >
> > We can see rep_movs_alternative() using more cycles in kernel profiles
> > than the previous variant (copy_user_enhanced_fast_string, which was
> > simply using "rep movsb"), and we can not reach line rate (as we
> > could before the series)
>
> Hmm. I assume the attached patch ends up fixing the regression?
>
> That hack to generate the two-byte 'jae' instruction even for the
> alternative is admittedly not pretty, but I just couldn't deal with
> the alternative that generated pointlessly bad code.
>
> We could make the constant in the comparison depend on whether it is
> for the unrolled or for the erms case too, I guess, but I think erms
> is probably "good enough" with 64-byte copies.
>
> I was really hoping we could avoid this, but hey, a regression is a regression.
>
> Can you verify this patch fixes things for you?

Hmm.. my build environment does not like this yet :)

arch/x86/lib/copy_user_64.S:40:30: error: unexpected token in argument list
0: alternative ".byte 0x73," ".Lunrolled" "-0b-2", ".byte 0x73,"
".Llarge" "-0b-2", X86_FEATURE_ERMS
^
make[3]: *** [scripts/Makefile.build:374: arch/x86/lib/copy_user_64.o] Error 1
make[3]: *** Waiting for unfinished jobs....
make[2]: *** [scripts/Makefile.build:494: arch/x86/lib] Error 2
make[2]: *** Waiting for unfinished jobs....
make[1]: *** [Makefile:2026: .] Error 2

2023-05-26 17:12:50

by Linus Torvalds

[permalink] [raw]

Subject: Re: x86 copy performance regression

On Fri, May 26, 2023 at 9:37 AM Eric Dumazet <[email protected]> wrote:
>
> Hmm.. my build environment does not like this yet :)

Ahh, I tested it in an allmodconfig build, but only building that one
file, and not trying to link it. And gas was apparent;y perfectly
happy just leaving that undefined feature as a relocation.

I assume that if you just add a

#include <asm/cpufeatures.h>

to the includes, it magically works for you?

Linus

2023-05-26 17:20:24

by Linus Torvalds

[permalink] [raw]

Subject: Re: x86 copy performance regression

On Fri, May 26, 2023 at 9:56 AM Linus Torvalds
<[email protected]> wrote:
>
> Ahh, I tested it in an allmodconfig build, but only building that one
> file, and not trying to link it. And gas was apparent;y perfectly
> happy just leaving that undefined feature as a relocation.

Oh, never mind. Even with that fixed, objtool is very unhappy about my
hack, because it knows about short jumps, and despite me encoding it
as a sequence of bytes, objtool will just decode it anyway and say
"that's not right".

Grr, I tried so hard to get the exact asm I wanted, but our
sanity-checking catches my cleverness and stops me in my tracks.

Let me go look at it some more. I *really* didn't want to make the
code worse for ERSM.

Linus

2023-05-26 17:30:54

by Linus Torvalds

[permalink] [raw]

Subject: Re: x86 copy performance regression

On Fri, May 26, 2023 at 10:00 AM Linus Torvalds
<[email protected]> wrote:
>
> Let me go look at it some more. I *really* didn't want to make the
> code worse for ERMS

Oh well. I'll think about it some more in the hope that I can come up
with something clever that doesn't make objtool hate me, but in the
meantime let me just give you the "not clever" patch.

It generates an annoying six-byte jump when the small 2-byte one would
work just fine, but I guess only my pride is wounded.

Linus

Attachments:

patch.diff (877.00 B)

2023-05-26 17:39:59

by Eric Dumazet

[permalink] [raw]

Subject: Re: x86 copy performance regression

On Fri, May 26, 2023 at 7:17 PM Linus Torvalds
<[email protected]> wrote:
>
> On Fri, May 26, 2023 at 10:00 AM Linus Torvalds
> <[email protected]> wrote:
> >
> > Let me go look at it some more. I *really* didn't want to make the
> > code worse for ERMS
>
> Oh well. I'll think about it some more in the hope that I can come up
> with something clever that doesn't make objtool hate me, but in the
> meantime let me just give you the "not clever" patch.
>
> It generates an annoying six-byte jump when the small 2-byte one would
> work just fine, but I guess only my pride is wounded.

arch/x86/lib/copy_user_64.S:34:2: error: invalid instruction mnemonic
'alternative'
alternative "jae .Lunrolled", "jae .Llarge", ( 9*32+ 9)
^~~~~~~~~~~

I changed alternative to ALTERNATIVE to let it build.

SYM_FUNC_START(rep_movs_alternative)
cmpq $64,%rcx
- jae .Lunrolled
+ ALTERNATIVE "jae .Lunrolled", "jae .Llarge", X86_FEATURE_ERMS

I will report test result soon, thanks !

2023-05-26 17:54:18

by Linus Torvalds

[permalink] [raw]

Subject: Re: x86 copy performance regression

On Fri, May 26, 2023 at 10:25 AM Eric Dumazet <[email protected]> wrote:
>
> arch/x86/lib/copy_user_64.S:34:2: error: invalid instruction mnemonic
> 'alternative'

Ok, that's just odd. For me, assembler mnemonics - very much including
macros - are case-insensitive.

It's actually documented that way, with the example given is for a
macro that is declared as "sum" and then used as "SUM":

https://sourceware.org/binutils/docs/as/Macro.html

And if you want to use macros as pseudo-instructions, that's what you
want, since typically assembler instructions are not case sensitive.

But yeah, your build environment is clearly different, and yes, we
declare the macro with all caps, and other places use it that way too.

Clang?

Linus

2023-05-26 18:08:17

by Eric Dumazet

[permalink] [raw]

Subject: Re: x86 copy performance regression

On Fri, May 26, 2023 at 7:40 PM Linus Torvalds
<[email protected]> wrote:
>
> On Fri, May 26, 2023 at 10:25 AM Eric Dumazet <[email protected]> wrote:
> >
> > arch/x86/lib/copy_user_64.S:34:2: error: invalid instruction mnemonic
> > 'alternative'
>
> Ok, that's just odd. For me, assembler mnemonics - very much including
> macros - are case-insensitive.
>
> It's actually documented that way, with the example given is for a
> macro that is declared as "sum" and then used as "SUM":
>
> https://sourceware.org/binutils/docs/as/Macro.html
>
> And if you want to use macros as pseudo-instructions, that's what you
> want, since typically assembler instructions are not case sensitive.
>
> But yeah, your build environment is clearly different, and yes, we
> declare the macro with all caps, and other places use it that way too.
>
> Clang?

Yes, we use clang here ...

Hmmm

[ 25.532236] RIP: 0010:0xffffffffa5a85134
[ 25.536173] Code: Unable to access opcode bytes at 0xffffffffa5a8510a.
[ 25.542720] RSP: 0000:ffff92f08159bcd8 EFLAGS: 00050206
[ 25.547960] RAX: 00007ffc3b16c318 RBX: 0000000000000000 RCX: 0000000000000170
[ 25.555118] RDX: 0000000000000170 RSI: ffff92f0944d4c28 RDI: 00007ffc3b16c1a8
[ 25.562275] RBP: ffff92f08159bce0 R08: fefefefefefefeff R09: 000000000000002c
[ 25.569432] R10: 000000000000002c R11: ffff92f0944d5bb0 R12: 00007ffc3b16cff2
[ 25.576588] R13: 00007ffc3b16c1a8 R14: 0000000000000001 R15: ffff92f0944d4ac0
[ 25.583746] FS: 0000000000000000(0000) GS:ffff934e404c0000(0000)
knlGS:0000000000000000
[ 25.591862] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 25.597624] CR2: ffffffffa5a8510a CR3: 000000010e33c003 CR4: 00000000003706e0
[ 25.604780] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 25.611936] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 25.619092] Call Trace:
[ 25.621545] <TASK>
[ 25.623648] ? _copy_to_user+0x20/0x30
[ 25.627409] create_elf_tables+0x528/0x5a0
[ 25.631520] load_elf_binary+0x9e7/0xce0
[ 25.635453] bprm_execve+0x2bf/0x5c0
[ 25.639040] kernel_execve+0x2ad/0x2d0
[ 25.642799] run_init_process+0xa9/0xb0
[ 25.646648] ? rest_init+0xc0/0xc0
[ 25.650059] kernel_init+0x82/0x1a0
[ 25.653558] ret_from
_fork+0x1f/0x30
[ 25.657145] </TASK>

2023-05-26 18:42:42

by Linus Torvalds

[permalink] [raw]

Subject: Re: x86 copy performance regression

On Fri, May 26, 2023 at 10:51 AM Eric Dumazet <[email protected]> wrote:
>
> Hmmm
>
> [ 25.532236] RIP: 0010:0xffffffffa5a85134
> [ 25.536173] Code: Unable to access opcode bytes at 0xffffffffa5a8510a.

This was the other reason I really didn't want to use alternatives on
the conditional branch instructions. The relocations are really not
very natural, and we have odd rules for those things. So I suspect our
instruction rewriting simply gets this wrong, because that's such a
nasty pattern.

I really wanted my "just hardcode the instruction bytes" to work. Not
only did it get me the small 2-byte conditional jump, it meant that
there was no relocation on it. But objtool really hates not
understanding what the alternatives code does.

Which is fair enough, but it's frustrating here when it only results
in more problems.

Anyway, I guess *this* avoids all issues. It creates an extra jump to
a jump for the case where the CPU doesn't have ERMS, but I guess we
don't really care about those CPUs anyway.

And it avoids all the "alternative instructions have relocations"
issues. And it creates all small two-byte jumps, and the "rep movsb"
fits exactly on that same 2 bytes too. Which I guess all argues for
this being what I should have started with.

This time it *really* works.

Famous last words.

Linus

Attachments:

patch.diff (877.00 B)

2023-05-26 19:07:23

by Eric Dumazet

[permalink] [raw]

Subject: Re: x86 copy performance regression

On Fri, May 26, 2023 at 8:33 PM Linus Torvalds
<[email protected]> wrote:
>
> On Fri, May 26, 2023 at 10:51 AM Eric Dumazet <[email protected]> wrote:
> >
> > Hmmm
> >
> > [ 25.532236] RIP: 0010:0xffffffffa5a85134
> > [ 25.536173] Code: Unable to access opcode bytes at 0xffffffffa5a8510a.
>
> This was the other reason I really didn't want to use alternatives on
> the conditional branch instructions. The relocations are really not
> very natural, and we have odd rules for those things. So I suspect our
> instruction rewriting simply gets this wrong, because that's such a
> nasty pattern.
>
> I really wanted my "just hardcode the instruction bytes" to work. Not
> only did it get me the small 2-byte conditional jump, it meant that
> there was no relocation on it. But objtool really hates not
> understanding what the alternatives code does.
>
> Which is fair enough, but it's frustrating here when it only results
> in more problems.
>
> Anyway, I guess *this* avoids all issues. It creates an extra jump to
> a jump for the case where the CPU doesn't have ERMS, but I guess we
> don't really care about those CPUs anyway.
>
> And it avoids all the "alternative instructions have relocations"
> issues. And it creates all small two-byte jumps, and the "rep movsb"
> fits exactly on that same 2 bytes too. Which I guess all argues for
> this being what I should have started with.
>
> This time it *really* works.
>

Indeed, this one is working and fixes the issue for me, thanks a lot !

New numbers look similar to 6.3 ones.

Tested-by: Eric Dumazet <[email protected]>

Performance counter stats for 'taskset 02 ./tcp_mmap -H 2002:a05:6608:297::':

2,833.29 msec task-clock # 0.970
CPUs utilized
1,065 context-switches # 375.888
/sec
1 cpu-migrations # 0.353
/sec
128 page-faults # 45.177
/sec
10,297,389,329 cycles # 3.634
GHz
7,213,189,594 instructions # 0.70
insn per cycle
1,220,821,121 branches # 430.884
M/sec
10,430,907 branch-misses # 0.85% of
all branches

2.921180547 seconds time elapsed

0.005304000 seconds user
2.478561000 seconds sys

2023-05-26 21:02:58

by Linus Torvalds

[permalink] [raw]

Subject: Re: x86 copy performance regression

On Fri, May 26, 2023 at 11:33 AM Linus Torvalds
<[email protected]> wrote:
>
> Anyway, I guess *this* avoids all issues. It creates an extra jump to
> a jump for the case where the CPU doesn't have ERMS, but I guess we
> don't really care about those CPUs anyway.

Well, I'm obviously wrong, because my very own CPU (AMD Zen 2) doesn't do ERMS.

But the extra 'jmp' doesn't seem to appreciably matter, so I guess I
don't care. It does show up in profiles, but only barely.

I've committed and pushed out the fix.

Linus