2018-11-21 06:57:49

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] x86: only use ERMS for user copies for larger sizes


[ Cc:-ed a few other gents and lkml. ]

* Jens Axboe <[email protected]> wrote:

> Hi,
>
> So this is a fun one... While I was doing the aio polled work, I noticed
> that the submitting process spent a substantial amount of time copying
> data to/from userspace. For aio, that's iocb and io_event, which are 64
> and 32 bytes respectively. Looking closer at this, and it seems that
> ERMS rep movsb is SLOWER for smaller copies, due to a higher startup
> cost.
>
> I came up with this hack to test it out, and low and behold, we now cut
> the time spent in copying in half. 50% less.
>
> Since these kinds of patches tend to lend themselves to bike shedding, I
> also ran a string of kernel compilations out of RAM. Results are as
> follows:
>
> Patched : 62.86s avg, stddev 0.65s
> Stock : 63.73s avg, stddev 0.67s
>
> which would also seem to indicate that we're faster punting smaller
> (< 128 byte) copies.
>
> CPU: Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz
>
> Interestingly, text size is smaller with the patch as well?!
>
> I'm sure there are smarter ways to do this, but results look fairly
> conclusive. FWIW, the behaviorial change was introduced by:
>
> commit 954e482bde20b0e208fd4d34ef26e10afd194600
> Author: Fenghua Yu <[email protected]>
> Date: Thu May 24 18:19:45 2012 -0700
>
> x86/copy_user_generic: Optimize copy_user_generic with CPU erms feature
>
> which contains nothing in terms of benchmarking or results, just claims
> that the new hotness is better.
>
> Signed-off-by: Jens Axboe <[email protected]>
> ---
>
> diff --git a/arch/x86/include/asm/uaccess_64.h b/arch/x86/include/asm/uaccess_64.h
> index a9d637bc301d..7dbb78827e64 100644
> --- a/arch/x86/include/asm/uaccess_64.h
> +++ b/arch/x86/include/asm/uaccess_64.h
> @@ -29,16 +29,27 @@ copy_user_generic(void *to, const void *from, unsigned len)
> {
> unsigned ret;
>
> + /*
> + * For smaller copies, don't use ERMS as it's slower.
> + */
> + if (len < 128) {
> + alternative_call(copy_user_generic_unrolled,
> + copy_user_generic_string, X86_FEATURE_REP_GOOD,
> + ASM_OUTPUT2("=a" (ret), "=D" (to), "=S" (from),
> + "=d" (len)),
> + "1" (to), "2" (from), "3" (len)
> + : "memory", "rcx", "r8", "r9", "r10", "r11");
> + return ret;
> + }
> +
> /*
> * If CPU has ERMS feature, use copy_user_enhanced_fast_string.
> * Otherwise, if CPU has rep_good feature, use copy_user_generic_string.
> * Otherwise, use copy_user_generic_unrolled.
> */
> alternative_call_2(copy_user_generic_unrolled,
> - copy_user_generic_string,
> - X86_FEATURE_REP_GOOD,
> - copy_user_enhanced_fast_string,
> - X86_FEATURE_ERMS,
> + copy_user_generic_string, X86_FEATURE_REP_GOOD,
> + copy_user_enhanced_fast_string, X86_FEATURE_ERMS,
> ASM_OUTPUT2("=a" (ret), "=D" (to), "=S" (from),
> "=d" (len)),
> "1" (to), "2" (from), "3" (len)

So I'm inclined to do something like yours, because clearly the changelog
of 954e482bde20 was at least partly false: Intel can say whatever they
want, it's a fact that ERMS has high setup costs for low buffer sizes -
ERMS is optimized for large size, cache-aligned copies mainly.

But the result is counter-intuitive in terms of kernel text footprint,
plus the '128' is pretty arbitrary - we should at least try to come up
with a break-even point where manual copy is about as fast as ERMS - on
at least a single CPU ...

Thanks,

Ingo


2018-11-21 13:37:17

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH] x86: only use ERMS for user copies for larger sizes

On 11/20/18 11:36 PM, Ingo Molnar wrote:
>
> [ Cc:-ed a few other gents and lkml. ]
>
> * Jens Axboe <[email protected]> wrote:
>
>> Hi,
>>
>> So this is a fun one... While I was doing the aio polled work, I noticed
>> that the submitting process spent a substantial amount of time copying
>> data to/from userspace. For aio, that's iocb and io_event, which are 64
>> and 32 bytes respectively. Looking closer at this, and it seems that
>> ERMS rep movsb is SLOWER for smaller copies, due to a higher startup
>> cost.
>>
>> I came up with this hack to test it out, and low and behold, we now cut
>> the time spent in copying in half. 50% less.
>>
>> Since these kinds of patches tend to lend themselves to bike shedding, I
>> also ran a string of kernel compilations out of RAM. Results are as
>> follows:
>>
>> Patched : 62.86s avg, stddev 0.65s
>> Stock : 63.73s avg, stddev 0.67s
>>
>> which would also seem to indicate that we're faster punting smaller
>> (< 128 byte) copies.
>>
>> CPU: Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz
>>
>> Interestingly, text size is smaller with the patch as well?!
>>
>> I'm sure there are smarter ways to do this, but results look fairly
>> conclusive. FWIW, the behaviorial change was introduced by:
>>
>> commit 954e482bde20b0e208fd4d34ef26e10afd194600
>> Author: Fenghua Yu <[email protected]>
>> Date: Thu May 24 18:19:45 2012 -0700
>>
>> x86/copy_user_generic: Optimize copy_user_generic with CPU erms feature
>>
>> which contains nothing in terms of benchmarking or results, just claims
>> that the new hotness is better.
>>
>> Signed-off-by: Jens Axboe <[email protected]>
>> ---
>>
>> diff --git a/arch/x86/include/asm/uaccess_64.h b/arch/x86/include/asm/uaccess_64.h
>> index a9d637bc301d..7dbb78827e64 100644
>> --- a/arch/x86/include/asm/uaccess_64.h
>> +++ b/arch/x86/include/asm/uaccess_64.h
>> @@ -29,16 +29,27 @@ copy_user_generic(void *to, const void *from, unsigned len)
>> {
>> unsigned ret;
>>
>> + /*
>> + * For smaller copies, don't use ERMS as it's slower.
>> + */
>> + if (len < 128) {
>> + alternative_call(copy_user_generic_unrolled,
>> + copy_user_generic_string, X86_FEATURE_REP_GOOD,
>> + ASM_OUTPUT2("=a" (ret), "=D" (to), "=S" (from),
>> + "=d" (len)),
>> + "1" (to), "2" (from), "3" (len)
>> + : "memory", "rcx", "r8", "r9", "r10", "r11");
>> + return ret;
>> + }
>> +
>> /*
>> * If CPU has ERMS feature, use copy_user_enhanced_fast_string.
>> * Otherwise, if CPU has rep_good feature, use copy_user_generic_string.
>> * Otherwise, use copy_user_generic_unrolled.
>> */
>> alternative_call_2(copy_user_generic_unrolled,
>> - copy_user_generic_string,
>> - X86_FEATURE_REP_GOOD,
>> - copy_user_enhanced_fast_string,
>> - X86_FEATURE_ERMS,
>> + copy_user_generic_string, X86_FEATURE_REP_GOOD,
>> + copy_user_enhanced_fast_string, X86_FEATURE_ERMS,
>> ASM_OUTPUT2("=a" (ret), "=D" (to), "=S" (from),
>> "=d" (len)),
>> "1" (to), "2" (from), "3" (len)
>
> So I'm inclined to do something like yours, because clearly the changelog
> of 954e482bde20 was at least partly false: Intel can say whatever they
> want, it's a fact that ERMS has high setup costs for low buffer sizes -
> ERMS is optimized for large size, cache-aligned copies mainly.

I'm actually surprised that something like that was accepted, I guess
2012 was a simpler time :-)

> But the result is counter-intuitive in terms of kernel text footprint,
> plus the '128' is pretty arbitrary - we should at least try to come up
> with a break-even point where manual copy is about as fast as ERMS - on
> at least a single CPU ...

I did some more investigation yesterday, and found this:

commit 236222d39347e0e486010f10c1493e83dbbdfba8
Author: Paolo Abeni <[email protected]>
Date: Thu Jun 29 15:55:58 2017 +0200

x86/uaccess: Optimize copy_user_enhanced_fast_string() for short strings

which does attempt to rectify it, but only using ERMS for >= 64 byte copies.
At least for me, looks like the break even point is higher than that, which
would mean that something like the below would be more appropriate. Adding
Paolo, in case he actually wrote a test app for this. In terms of test app,
I'm always weary of using microbenchmarking only for this type of thing,
real world testing (if stable enough) is much more useful.


diff --git a/arch/x86/lib/copy_user_64.S b/arch/x86/lib/copy_user_64.S
index db4e5aa0858b..21c4d68c5fac 100644
--- a/arch/x86/lib/copy_user_64.S
+++ b/arch/x86/lib/copy_user_64.S
@@ -175,8 +175,8 @@ EXPORT_SYMBOL(copy_user_generic_string)
*/
ENTRY(copy_user_enhanced_fast_string)
ASM_STAC
- cmpl $64,%edx
- jb .L_copy_short_string /* less then 64 bytes, avoid the costly 'rep' */
+ cmpl $128,%edx
+ jb .L_copy_short_string /* less then 128 bytes, avoid costly 'rep' */
movl %edx,%ecx
1: rep
movsb

--
Jens Axboe


2018-11-21 16:35:54

by Paolo Abeni

[permalink] [raw]
Subject: Re: [PATCH] x86: only use ERMS for user copies for larger sizes

On Wed, 2018-11-21 at 06:32 -0700, Jens Axboe wrote:
> I did some more investigation yesterday, and found this:
>
> commit 236222d39347e0e486010f10c1493e83dbbdfba8
> Author: Paolo Abeni <[email protected]>
> Date: Thu Jun 29 15:55:58 2017 +0200
>
> x86/uaccess: Optimize copy_user_enhanced_fast_string() for short strings
>
> which does attempt to rectify it, but only using ERMS for >= 64 byte copies.
> At least for me, looks like the break even point is higher than that, which
> would mean that something like the below would be more appropriate.

Back then I used a custom kernel module and the tsc to micro-benchmark
the patched function and the original one with different buffer sizes.
I'm sorry, the relevant code has been lost.

In my experiments 64 bytes was the break even point for all the CPUs I
had handy, but I guess that may change with other models.

Cheers,

Paolo


2018-11-21 16:36:02

by Denys Vlasenko

[permalink] [raw]
Subject: Re: [PATCH] x86: only use ERMS for user copies for larger sizes

On 11/21/2018 02:32 PM, Jens Axboe wrote:
> On 11/20/18 11:36 PM, Ingo Molnar wrote:
>> * Jens Axboe <[email protected]> wrote:
>>> So this is a fun one... While I was doing the aio polled work, I noticed
>>> that the submitting process spent a substantial amount of time copying
>>> data to/from userspace. For aio, that's iocb and io_event, which are 64
>>> and 32 bytes respectively. Looking closer at this, and it seems that
>>> ERMS rep movsb is SLOWER for smaller copies, due to a higher startup
>>> cost.
>>>
>>> I came up with this hack to test it out, and low and behold, we now cut
>>> the time spent in copying in half. 50% less.
>>>
>>> Since these kinds of patches tend to lend themselves to bike shedding, I
>>> also ran a string of kernel compilations out of RAM. Results are as
>>> follows:
>>>
>>> Patched : 62.86s avg, stddev 0.65s
>>> Stock : 63.73s avg, stddev 0.67s
>>>
>>> which would also seem to indicate that we're faster punting smaller
>>> (< 128 byte) copies.
>>>
>>> CPU: Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz
>>>
>>> Interestingly, text size is smaller with the patch as well?!
>>>
>>> I'm sure there are smarter ways to do this, but results look fairly
>>> conclusive. FWIW, the behaviorial change was introduced by:
>>>
>>> commit 954e482bde20b0e208fd4d34ef26e10afd194600
>>> Author: Fenghua Yu <[email protected]>
>>> Date: Thu May 24 18:19:45 2012 -0700
>>>
>>> x86/copy_user_generic: Optimize copy_user_generic with CPU erms feature
>>>
>>> which contains nothing in terms of benchmarking or results, just claims
>>> that the new hotness is better.
>>>
>>> Signed-off-by: Jens Axboe <[email protected]>
>>> ---
>>>
>>> diff --git a/arch/x86/include/asm/uaccess_64.h b/arch/x86/include/asm/uaccess_64.h
>>> index a9d637bc301d..7dbb78827e64 100644
>>> --- a/arch/x86/include/asm/uaccess_64.h
>>> +++ b/arch/x86/include/asm/uaccess_64.h
>>> @@ -29,16 +29,27 @@ copy_user_generic(void *to, const void *from, unsigned len)
>>> {
>>> unsigned ret;
>>>
>>> + /*
>>> + * For smaller copies, don't use ERMS as it's slower.
>>> + */
>>> + if (len < 128) {
>>> + alternative_call(copy_user_generic_unrolled,
>>> + copy_user_generic_string, X86_FEATURE_REP_GOOD,
>>> + ASM_OUTPUT2("=a" (ret), "=D" (to), "=S" (from),
>>> + "=d" (len)),
>>> + "1" (to), "2" (from), "3" (len)
>>> + : "memory", "rcx", "r8", "r9", "r10", "r11");
>>> + return ret;
>>> + }
>>> +
>>> /*
>>> * If CPU has ERMS feature, use copy_user_enhanced_fast_string.
>>> * Otherwise, if CPU has rep_good feature, use copy_user_generic_string.
>>> * Otherwise, use copy_user_generic_unrolled.
>>> */
>>> alternative_call_2(copy_user_generic_unrolled,
>>> - copy_user_generic_string,
>>> - X86_FEATURE_REP_GOOD,
>>> - copy_user_enhanced_fast_string,
>>> - X86_FEATURE_ERMS,
>>> + copy_user_generic_string, X86_FEATURE_REP_GOOD,
>>> + copy_user_enhanced_fast_string, X86_FEATURE_ERMS,
>>> ASM_OUTPUT2("=a" (ret), "=D" (to), "=S" (from),
>>> "=d" (len)),
>>> "1" (to), "2" (from), "3" (len)
>>
>> So I'm inclined to do something like yours, because clearly the changelog
>> of 954e482bde20 was at least partly false: Intel can say whatever they
>> want, it's a fact that ERMS has high setup costs for low buffer sizes -
>> ERMS is optimized for large size, cache-aligned copies mainly.
>
> I'm actually surprised that something like that was accepted, I guess
> 2012 was a simpler time :-)
>
>> But the result is counter-intuitive in terms of kernel text footprint,
>> plus the '128' is pretty arbitrary - we should at least try to come up
>> with a break-even point where manual copy is about as fast as ERMS - on
>> at least a single CPU ...
>
> I did some more investigation yesterday, and found this:
>
> commit 236222d39347e0e486010f10c1493e83dbbdfba8
> Author: Paolo Abeni <[email protected]>
> Date: Thu Jun 29 15:55:58 2017 +0200
>
> x86/uaccess: Optimize copy_user_enhanced_fast_string() for short strings
>
> which does attempt to rectify it, but only using ERMS for >= 64 byte copies.
> At least for me, looks like the break even point is higher than that, which
> would mean that something like the below would be more appropriate.

I also tested this while working for string ops code in musl.

I think at least 128 bytes would be the minimum where "REP insn"
are more efficient. In my testing, it's more like 256 bytes...

2018-11-21 19:50:44

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] x86: only use ERMS for user copies for larger sizes

On Wed, Nov 21, 2018 at 5:45 AM Paolo Abeni <[email protected]> wrote:
>
> In my experiments 64 bytes was the break even point for all the CPUs I
> had handy, but I guess that may change with other models.

Note that experiments with memcpy speed are almost invariably broken.
microbenchmarks don't show the impact of I$, but they also don't show
the impact of _behavior_.

For example, there might be things like "repeat strings do cacheline
optimizations" that end up meaning that cachelines stay in L2, for
example, and are never brought into L1. That can be a really good
thing, but it can also mean that now the result isn't as close to the
CPU, and the subsequent use of the cacheline can be costlier.

I say "go for upping the limit to 128 bytes".

That said, if the aio user copy is _so_ critical that it's this
noticeable, there may be other issues. Sometimes _real_ cost of small
user copies is often the STAC/CLAC, more so than the "rep movs".

It would be interesting to know exactly which copy it is that matters
so much... *inlining* the erms case might show that nicely in
profiles.

Linus

2018-11-21 20:20:02

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH] x86: only use ERMS for user copies for larger sizes

On 11/21/18 10:27 AM, Linus Torvalds wrote:
> On Wed, Nov 21, 2018 at 5:45 AM Paolo Abeni <[email protected]> wrote:
>>
>> In my experiments 64 bytes was the break even point for all the CPUs I
>> had handy, but I guess that may change with other models.
>
> Note that experiments with memcpy speed are almost invariably broken.
> microbenchmarks don't show the impact of I$, but they also don't show
> the impact of _behavior_.
>
> For example, there might be things like "repeat strings do cacheline
> optimizations" that end up meaning that cachelines stay in L2, for
> example, and are never brought into L1. That can be a really good
> thing, but it can also mean that now the result isn't as close to the
> CPU, and the subsequent use of the cacheline can be costlier.

Totally agree, which is why all my testing was NOT microbenchmarking.

> I say "go for upping the limit to 128 bytes".

See below...

> That said, if the aio user copy is _so_ critical that it's this
> noticeable, there may be other issues. Sometimes _real_ cost of small
> user copies is often the STAC/CLAC, more so than the "rep movs".
>
> It would be interesting to know exactly which copy it is that matters
> so much... *inlining* the erms case might show that nicely in
> profiles.

Oh I totally agree, which is why I since went a different route. The
copy that matters is the copy_from_user() of the iocb, which is 64
bytes. Even for 4k IOs, copying 64b per IO is somewhat counter
productive for O_DIRECT.

Playing around with this:

http://git.kernel.dk/cgit/linux-block/commit/?h=aio-poll&id=ed0a0a445c0af4cfd18b0682511981eaf352d483

since we're doing a new sys_io_setup2() for polled aio anyway. This
completely avoids the iocb copy, but that's just for my initial
particular gripe.


diff --git a/arch/x86/lib/copy_user_64.S b/arch/x86/lib/copy_user_64.S
index db4e5aa0858b..21c4d68c5fac 100644
--- a/arch/x86/lib/copy_user_64.S
+++ b/arch/x86/lib/copy_user_64.S
@@ -175,8 +175,8 @@ EXPORT_SYMBOL(copy_user_generic_string)
*/
ENTRY(copy_user_enhanced_fast_string)
ASM_STAC
- cmpl $64,%edx
- jb .L_copy_short_string /* less then 64 bytes, avoid the costly 'rep' */
+ cmpl $128,%edx
+ jb .L_copy_short_string /* less then 128 bytes, avoid costly 'rep' */
movl %edx,%ecx
1: rep
movsb

--
Jens Axboe


2018-11-21 20:30:20

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] x86: only use ERMS for user copies for larger sizes

On Wed, Nov 21, 2018 at 9:27 AM Linus Torvalds
<[email protected]> wrote:
>
> It would be interesting to know exactly which copy it is that matters
> so much... *inlining* the erms case might show that nicely in
> profiles.

Side note: the fact that Jens' patch (which I don't like in that form)
allegedly shrunk the resulting kernel binary would seem to indicate
that there's a *lot* of compile-time constant-sized memcpy calls that
we are missing, and that fall back to copy_user_generic().

It might be interesting to just change raw_copy_to/from_user() to
handle a lot more cases (in particular, handle cases where 'size' is
8-byte aligned). The special cases we *do* have may not be the right
ones (the 10-byte case in particular looks odd).

For example, instead of having a "if constant size is 8 bytes, do one
get/put_user()" case, we might have a "if constant size is < 64 just
unroll it into get/put_user()" calls.

Linus

2018-11-21 20:30:38

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH] x86: only use ERMS for user copies for larger sizes



> On Nov 21, 2018, at 11:04 AM, Jens Axboe <[email protected]> wrote:
>
>> On 11/21/18 10:27 AM, Linus Torvalds wrote:
>>> On Wed, Nov 21, 2018 at 5:45 AM Paolo Abeni <[email protected]> wrote:
>>>
>>> In my experiments 64 bytes was the break even point for all the CPUs I
>>> had handy, but I guess that may change with other models.
>>
>> Note that experiments with memcpy speed are almost invariably broken.
>> microbenchmarks don't show the impact of I$, but they also don't show
>> the impact of _behavior_.
>>
>> For example, there might be things like "repeat strings do cacheline
>> optimizations" that end up meaning that cachelines stay in L2, for
>> example, and are never brought into L1. That can be a really good
>> thing, but it can also mean that now the result isn't as close to the
>> CPU, and the subsequent use of the cacheline can be costlier.
>
> Totally agree, which is why all my testing was NOT microbenchmarking.
>
>> I say "go for upping the limit to 128 bytes".
>
> See below...
>
>> That said, if the aio user copy is _so_ critical that it's this
>> noticeable, there may be other issues. Sometimes _real_ cost of small
>> user copies is often the STAC/CLAC, more so than the "rep movs".
>>
>> It would be interesting to know exactly which copy it is that matters
>> so much... *inlining* the erms case might show that nicely in
>> profiles.
>
> Oh I totally agree, which is why I since went a different route. The
> copy that matters is the copy_from_user() of the iocb, which is 64
> bytes. Even for 4k IOs, copying 64b per IO is somewhat counter
> productive for O_DIRECT.

Can we maybe use this as an excuse to ask for some reasonable instructions to access user memory? Intel already did all the dirty work of giving something resembling sane semantics for the kernel doing a user-privileged access with WRUSS. How about WRUSER, RDUSER, and maybe even the REP variants? And I suppose LOCK CMPXCHGUSER.

Or Intel could try to make STAC and CLAC be genuinely fast (0 or 1 cycles and no stalls *ought* to be possible if it were handled in the front end, as long as there aren’t any PUSHF or POPF instructions in the pipeline). As it stands, I assume that both instructions prevent any following memory accesses from starting until they retire, and they might even be nastily microcoded to handle the overloading of AC.

2018-11-21 21:23:18

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] x86: only use ERMS for user copies for larger sizes

On Wed, Nov 21, 2018 at 10:26 AM Andy Lutomirski <[email protected]> wrote:
>
> Can we maybe use this as an excuse to ask for some reasonable instructions to access user memory?

I did that long ago. It's why we have CLAC/STAC today. I was told that
what I actually asked for (get an instruction to access user space - I
suggested using a segment override prefix) was not viable.

Linus

2018-11-21 21:55:47

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] x86: only use ERMS for user copies for larger sizes

On Wed, Nov 21, 2018 at 10:16 AM Linus Torvalds
<[email protected]> wrote:
>
> It might be interesting to just change raw_copy_to/from_user() to
> handle a lot more cases (in particular, handle cases where 'size' is
> 8-byte aligned). The special cases we *do* have may not be the right
> ones (the 10-byte case in particular looks odd).
>
> For example, instead of having a "if constant size is 8 bytes, do one
> get/put_user()" case, we might have a "if constant size is < 64 just
> unroll it into get/put_user()" calls.

Actually, x86 doesn't even set INLINE_COPY_TO_USER, so I don't think
the constant size cases ever trigger at all the way they are set up
now.

I do have a random patch that makes "unsafe_put_user()" actually use
"asm goto" for the error case, and that, together with the attached
patch seems to generate fairly nice code, but even then it would
depend on gcc actually unrolling things (which we do *not* want in
general).

But for a 32-byte user copy (cp_old_stat), and that
INLINE_COPY_TO_USER, it generates this:

stac
movl $32, %edx #, size
movq %rsp, %rax #, src
.L201:
movq (%rax), %rcx # MEM[base: src_155, offset: 0B],
MEM[base: src_155, offset: 0B]
1: movq %rcx,0(%rbp) # MEM[base: src_155, offset: 0B],
MEM[(struct __large_struct *)dst_156]
ASM_EXTABLE_HANDLE from=1b to=.L200 handler="ex_handler_uaccess" #

addq $8, %rax #, src
addq $8, %rbp #, statbuf
subq $8, %rdx #, size
jne .L201 #,
clac

which is actually fairly close to "optimal".

Random patch (with my "asm goto" hack included) attached, in case
people want to play with it.

Impressively, it actually removes more lines of code than it adds. But
I didn't actually check whether the end result *works*, so hey..

Linus


Attachments:
patch.diff (13.07 kB)

2018-11-22 09:19:38

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH] x86: only use ERMS for user copies for larger sizes

On Wed, Nov 21, 2018 at 10:44 AM Linus Torvalds
<[email protected]> wrote:
>
> On Wed, Nov 21, 2018 at 10:26 AM Andy Lutomirski <[email protected]> wrote:
> >
> > Can we maybe use this as an excuse to ask for some reasonable instructions to access user memory?
>
> I did that long ago. It's why we have CLAC/STAC today. I was told that
> what I actually asked for (get an instruction to access user space - I
> suggested using a segment override prefix) was not viable.
>

Maybe it wasn't viable *then*, but WRUSS really does write to
userspace with user privilege. Surely the same mechanism with some of
the weirdness removed would do what we want.

2018-11-23 11:27:04

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] x86: only use ERMS for user copies for larger sizes


* Linus Torvalds <[email protected]> wrote:

> On Wed, Nov 21, 2018 at 10:16 AM Linus Torvalds
> <[email protected]> wrote:
> >
> > It might be interesting to just change raw_copy_to/from_user() to
> > handle a lot more cases (in particular, handle cases where 'size' is
> > 8-byte aligned). The special cases we *do* have may not be the right
> > ones (the 10-byte case in particular looks odd).
> >
> > For example, instead of having a "if constant size is 8 bytes, do one
> > get/put_user()" case, we might have a "if constant size is < 64 just
> > unroll it into get/put_user()" calls.
>
> Actually, x86 doesn't even set INLINE_COPY_TO_USER, so I don't think
> the constant size cases ever trigger at all the way they are set up
> now.

Side note, there's one artifact the patch exposes: some of the
__builtin_constant_p() checks are imprecise and don't trigger at the
early stage where GCC checks them, but the lenght is actually known to
the compiler at later optimization stages.

This means that with Jen's patch some of the length checks go away. I
checked x86-64 defconfig and a distro config, and the numbers were ~7%
and 10%, so not a big effect.

The kernel text size reduction with Jen's patch is small but real:

text data bss dec hex filename
19572694 11516934 19873888 50963516 309a43c vmlinux.before
19572468 11516934 19873888 50963290 309a35a vmlinux.after

But I checked the disassembly, and it's not a real win, the new code is
actually more complex than the old one, as expected, but GCC (7.3.0) does
some particularly stupid things which bloats the generated code.

> I do have a random patch that makes "unsafe_put_user()" actually use
> "asm goto" for the error case, and that, together with the attached
> patch seems to generate fairly nice code, but even then it would
> depend on gcc actually unrolling things (which we do *not* want in
> general).
>
> But for a 32-byte user copy (cp_old_stat), and that
> INLINE_COPY_TO_USER, it generates this:
>
> stac
> movl $32, %edx #, size
> movq %rsp, %rax #, src
> .L201:
> movq (%rax), %rcx # MEM[base: src_155, offset: 0B],
> MEM[base: src_155, offset: 0B]
> 1: movq %rcx,0(%rbp) # MEM[base: src_155, offset: 0B],
> MEM[(struct __large_struct *)dst_156]
> ASM_EXTABLE_HANDLE from=1b to=.L200 handler="ex_handler_uaccess" #
>
> addq $8, %rax #, src
> addq $8, %rbp #, statbuf
> subq $8, %rdx #, size
> jne .L201 #,
> clac
>
> which is actually fairly close to "optimal".

Yeah, that looks pretty sweet!

> Random patch (with my "asm goto" hack included) attached, in case
> people want to play with it.

Doesn't even look all that hacky to me. Any hack in it that I didn't
notice? :-)

The only question is the inlining overhead - will try to measure that.

> Impressively, it actually removes more lines of code than it adds. But
> I didn't actually check whether the end result *works*, so hey..

Most of the linecount reduction appears to come from the simplification
of the unroll loop and moving it into C, from a 6-way hard-coded copy
routine:

> - switch (size) {
> - case 1:
> - case 2:
> - case 4:
> - case 8:
> - case 10:
> - case 16:

to a more flexible 4-way loop unrolling:

> + while (size >= sizeof(unsigned long)) {
> + while (size >= sizeof(unsigned int)) {
> + while (size >= sizeof(unsigned short)) {
> + while (size >= sizeof(unsigned char)) {

Which is a nice improvement in itself.

> + user_access_begin();
> + if (dirent)
> + unsafe_put_user(offset, &dirent->d_off, efault_end);
> dirent = buf->current_dir;
> + unsafe_put_user(d_ino, &dirent->d_ino, efault_end);
> + unsafe_put_user(reclen, &dirent->d_reclen, efault_end);
> + unsafe_put_user(0, dirent->d_name + namlen, efault_end);
> + unsafe_put_user(d_type, (char __user *) dirent + reclen - 1, efault_end);
> + user_access_end();
> +
> if (copy_to_user(dirent->d_name, name, namlen))
> goto efault;
> buf->previous = dirent;
> dirent = (void __user *)dirent + reclen;
> buf->current_dir = dirent;
> buf->count -= reclen;
> return 0;
> +efault_end:
> + user_access_end();
> efault:
> buf->error = -EFAULT;
> return -EFAULT;

In terms of high level APIs, could we perhaps use the opportunity to
introduce unsafe_write_user() instead, which would allow us to write it
as:

unsafe_write_user(&dirent->d_ino, d_ino, efault_end);
unsafe_write_user(&dirent->d_reclen, reclen, efault_end);
unsafe_write_user(dirent->d_name + namlen, 0, efault_end);
unsafe_write_user((char __user *)dirent + reclen - 1, d_type, efault_end);

if (copy_to_user(dirent->d_name, name, namlen))
goto efault;

This gives it the regular 'VAR = VAL;' notation of C assigments, instead
of the weird historical reverse notation that put_user()/get_user() uses.

Note how this newfangled ordering now matches the 'copy_to_user()'
natural C-assignment parameter order that comes straight afterwards and
makes it obvious that the d->name+namelen was writing the delimiter at
the end.

I think we even had bugs from put_user() ordering mixups?

Or is it too late to try to fix this particular mistake?

Thanks,

Ingo

2018-11-23 12:21:45

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] x86: only use ERMS for user copies for larger sizes


* Ingo Molnar <[email protected]> wrote:

> The kernel text size reduction with Jen's patch is small but real:
>
> text data bss dec hex filename
> 19572694 11516934 19873888 50963516 309a43c vmlinux.before
> 19572468 11516934 19873888 50963290 309a35a vmlinux.after
>
> But I checked the disassembly, and it's not a real win, the new code is
> actually more complex than the old one, as expected, but GCC (7.3.0) does
> some particularly stupid things which bloats the generated code.

So I dug into this some more:

1)

Firstly I tracked down GCC bloating the might_fault() checks and the
related out-of-line code exception handling which bloats the full
generated function.

2)

But with even that complication eliminated, there's a size reduction when
Jen's patch is applied, which is puzzling:

19563640 11516790 19882080 50962510 309a04e vmlinux.before
19563274 11516790 19882080 50962144 3099ee0 vmlinux.after

but this is entirely due to the .altinstructions section being counted as
'text' part of the vmlinux - while in reality it's not:

3)

The _real_ part of the vmlinux gets bloated by Jen's patch:

ffffffff81000000 <_stext>:

before: ffffffff81b0e5e0 <__clear_user>
after: ffffffff81b0e670 <__clear_user>:

I.e. we get a e5e0 => e670 bloat, as expected.

In the config I tested a later section of the kernel image first aligns
away the bloat:

before: ffffffff82fa6321 <.altinstr_aux>:
after: ffffffff82fa6321 <.altinstr_aux>:

and then artificially debloats the modified kernel via the
altinstructions section:

before: Disassembly of section .exit.text: ffffffff83160798 <intel_uncore_exit>
after: Disassembly of section .exit.text: ffffffff83160608 <intel_uncore_exit>

Note that there's a third level of obfuscation here: Jen's patch actually
*adds* a new altinstructions statement:

+ /*
+ * For smaller copies, don't use ERMS as it's slower.
+ */
+ if (len < 128) {
+ alternative_call(copy_user_generic_unrolled,
+ copy_user_generic_string, X86_FEATURE_REP_GOOD,
+ ASM_OUTPUT2("=a" (ret), "=D" (to), "=S" (from),
+ "=d" (len)),
+ "1" (to), "2" (from), "3" (len)
+ : "memory", "rcx", "r8", "r9", "r10", "r11");
+ return ret;
+ }
+
/*
* If CPU has ERMS feature, use copy_user_enhanced_fast_string.
* Otherwise, if CPU has rep_good feature, use copy_user_generic_string.
* Otherwise, use copy_user_generic_unrolled.
*/
alternative_call_2(copy_user_generic_unrolled,
- copy_user_generic_string,
- X86_FEATURE_REP_GOOD,
- copy_user_enhanced_fast_string,
- X86_FEATURE_ERMS,
+ copy_user_generic_string, X86_FEATURE_REP_GOOD,
+ copy_user_enhanced_fast_string, X86_FEATURE_ERMS,
ASM_OUTPUT2("=a" (ret), "=D" (to), "=S" (from),
"=d" (len)),
"1" (to), "2" (from), "3" (len)

So how can this change possibly result in a *small* altinstructions
section?

4)

The reason is GCC's somewhat broken __builtin_constant() logic, which
leaves ~10% of the constant call sites actually active, but which are
then optimized by GCC's later stages, and the alternative_call_2() gets
optimized out and replaced with the alternative_call() call.

This is where Jens's patch 'debloats' the vmlinux and confuses the 'size'
utility and gains its code reduction street cred.

Note to self: watch out for patches that change altinstructions and don't
make premature vmlinux size impact assumptions. :-)

Thanks,

Ingo

2018-11-23 12:24:05

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] x86: only use ERMS for user copies for larger sizes


* Ingo Molnar <[email protected]> wrote:

> So I dug into this some more:
>
> 1)
>
> Firstly I tracked down GCC bloating the might_fault() checks and the
> related out-of-line code exception handling which bloats the full
> generated function.

Sorry, I mis-remembered that detail when I wrote the email: it was
CONFIG_HARDENED_USERCOPY=y and its object size checks that distros enable
- and I disabled that option to simplify the size analysis.

(might_fault() doesn't have inline conditionals so shouldn't have any
effect on the generated code.)

Thanks,

Ingo

2018-11-24 06:56:38

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] x86: only use ERMS for user copies for larger sizes

On Thu, Nov 22, 2018 at 2:32 AM Ingo Molnar <[email protected]> wrote:
> * Linus Torvalds <[email protected]> wrote:
> >
> > Random patch (with my "asm goto" hack included) attached, in case
> > people want to play with it.
>
> Doesn't even look all that hacky to me. Any hack in it that I didn't
> notice? :-)

The code to use asm goto sadly doesn't have any fallback at all for
the "no asm goto available".

I guess we're getting close to "we require asm goto support", but I
don't think we're there yet.

Also, while "unsafe_put_user()" has been converted to use asm goto
(and yes, it really does generate much nicer code), the same is not
true of "unsafe_get_user()". Because sadly, gcc does not support asm
goto with output values.

So, realistically, my patch is not _technically_ hacky, but it's
simply not viable as things stand, and it's more of a tech
demonstration than anything else.

Linus

2018-11-24 07:04:14

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH] x86: only use ERMS for user copies for larger sizes

On Thu, Nov 22, 2018 at 8:56 AM Linus Torvalds
<[email protected]> wrote:
>
> On Thu, Nov 22, 2018 at 2:32 AM Ingo Molnar <[email protected]> wrote:
> > * Linus Torvalds <[email protected]> wrote:
> > >
> > > Random patch (with my "asm goto" hack included) attached, in case
> > > people want to play with it.
> >
> > Doesn't even look all that hacky to me. Any hack in it that I didn't
> > notice? :-)
>
> The code to use asm goto sadly doesn't have any fallback at all for
> the "no asm goto available".
>
> I guess we're getting close to "we require asm goto support", but I
> don't think we're there yet.

commit e501ce957a786ecd076ea0cfb10b114e6e4d0f40
Author: Peter Zijlstra <[email protected]>
Date: Wed Jan 17 11:42:07 2018 +0100

x86: Force asm-goto

We want to start using asm-goto to guarantee the absence of dynamic
branches (and thus speculation).

A primary prerequisite for this is of course that the compiler
supports asm-goto. This effecively lifts the minimum GCC version to
build an x86 kernel to gcc-4.5.

This is basically the only good outcome from the speculation crap as
far as I'm concerned :)

So I think your patch is viable. Also, with that patch applied,
put_user_ex() should become worse than worthless -- if gcc is any
good, plain old:

if (unsafe_put_user(...) != 0)
goto err;
if (unsafe_put_user(...) != 0)
goto err;
etc.

will generate *better* code than a series of put_user_ex() calls.

2018-11-24 07:06:16

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] x86: only use ERMS for user copies for larger sizes

On Thu, Nov 22, 2018 at 9:26 AM Andy Lutomirski <[email protected]> wrote:
>
> So I think your patch is viable. Also, with that patch applied,
> put_user_ex() should become worse than worthless

Yes. I hate those special-case _ex variants.

I guess I should just properly forward-port my patch series where the
different steps are separated out (not jumbled up like that patch I
actually posted).

Linus

2018-11-24 07:07:41

by David Laight

[permalink] [raw]
Subject: RE: [PATCH] x86: only use ERMS for user copies for larger sizes

From: Denys Vlasenko
> Sent: 21 November 2018 13:44
...
> I also tested this while working for string ops code in musl.
>
> I think at least 128 bytes would be the minimum where "REP insn"
> are more efficient. In my testing, it's more like 256 bytes...

What happens for misaligned copies?
I had a feeling that the ERMS 'reb movsb' code used some kind
of barrel shifter in that case.

The other problem with the ERMS copy is that it gets used
for copy_to/from_io() - and the 'rep movsb' on uncached
locations has to do byte copies.
Byte reads on PCIe are really horrid.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2018-11-24 07:08:12

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] x86: only use ERMS for user copies for larger sizes

On Thu, Nov 22, 2018 at 9:36 AM David Laight <[email protected]> wrote:
>
> The other problem with the ERMS copy is that it gets used
> for copy_to/from_io() - and the 'rep movsb' on uncached
> locations has to do byte copies.

Ugh. I thought we changed that *long* ago, because even our non-ERMS
copy is broken for PCI (it does overlapping stores for the small tail
cases).

But looking at "memcpy_{from,to}io()", I don't see x86 overriding it
with anything better.

I suspect nobody uses those functions for anything critical any more.
The fbcon people have their own copy functions, iirc.

But we definitely should fix this. *NONE* of the regular memcpy
functions actually work right for PCI space any more, and haven't for
a long time.

Linus

2018-11-24 07:11:19

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH] x86: only use ERMS for user copies for larger sizes

On Thu, Nov 22, 2018 at 9:53 AM Linus Torvalds
<[email protected]> wrote:
>
> On Thu, Nov 22, 2018 at 9:36 AM David Laight <[email protected]> wrote:
> >
> > The other problem with the ERMS copy is that it gets used
> > for copy_to/from_io() - and the 'rep movsb' on uncached
> > locations has to do byte copies.
>
> Ugh. I thought we changed that *long* ago, because even our non-ERMS
> copy is broken for PCI (it does overlapping stores for the small tail
> cases).
>
> But looking at "memcpy_{from,to}io()", I don't see x86 overriding it
> with anything better.
>
> I suspect nobody uses those functions for anything critical any more.
> The fbcon people have their own copy functions, iirc.
>
> But we definitely should fix this. *NONE* of the regular memcpy
> functions actually work right for PCI space any more, and haven't for
> a long time.

I'm not personally volunteering, but I suspect we can do much better
than we do now:

- The new MOVDIRI and MOVDIR64B instructions can do big writes to WC
and UC memory. I assume those would be safe to use in ...toio()
functions, unless there are quirky devices out there that blow up if
their MMIO space is written in 64-byte chunks.

- MOVNTDQA can, I think, do 64-byte loads, but only from WC memory.
For sufficiently large copies, it could plausibly be faster to create
a WC alias and use MOVNTDQA than it is to copy in 8- for 16-byte
chunks. The i915 driver has a copy implementation using MOVNTDQA --
maybe this should get promoted to something in arch/x86 called
memcpy_from_wc().

--Andy

2018-11-24 07:14:27

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] x86: only use ERMS for user copies for larger sizes

On Thu, Nov 22, 2018 at 10:07 AM Andy Lutomirski <[email protected]> wrote:
>
> I'm not personally volunteering, but I suspect we can do much better
> than we do now:
>
> - The new MOVDIRI and MOVDIR64B instructions can do big writes to WC
> and UC memory.
>
> - MOVNTDQA can, I think, do 64-byte loads, but only from WC memory.

No, performance isn't the _primary_ issue. Nobody uses MMIO and
expects high performance from the generic functions (but people may
then tweak individual drivers to do tricks).

And we've historically had various broken hardware that cares deeply
about access size. Trying to be clever and do big accesses could
easily break something.

The fact that nobody has complained about the generic memcpy routines
probably means that the broken hardware isn't in use any more, or it
just works anyway. And nobody has complained about performance either,
so it's clearly not a huge issue. "rep movs" probably works ok on WC
memory writes anyway, it's the UC case that is bad, but I don't think
anybody uses UC and then does the "memcp_to/fromio()" things. If you
have UC memory, you tend to do the accesses properly.

So I suspect we should just write memcpy_{to,from}io() in terms of writel/readl.

Oh, and I just noticed that on x86 we expressly use our old "safe and
sane" functions: see __inline_memcpy(), and its use in
__memcpy_{from,to}io().

So the "falls back to memcpy" was always a red herring. We don't
actually do that.

Which explains why things work.

Linus

2018-11-24 08:19:24

by David Laight

[permalink] [raw]
Subject: RE: [PATCH] x86: only use ERMS for user copies for larger sizes

From: Linus Torvalds
> Sent: 22 November 2018 18:58
...
> Oh, and I just noticed that on x86 we expressly use our old "safe and
> sane" functions: see __inline_memcpy(), and its use in
> __memcpy_{from,to}io().
>
> So the "falls back to memcpy" was always a red herring. We don't
> actually do that.
>
> Which explains why things work.

It doesn't explain why I've seen single byte PCIe TLP generated
by memcpy_to/fromio().

I've had to change code to use readq/writeq loops because the
byte accesses are so slow - even when PIO performance should
be 'good enough'.

It might have been changed since last time I tested it.
But I don't remember seeing a commit go by.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2018-11-24 08:24:16

by David Laight

[permalink] [raw]
Subject: RE: [PATCH] x86: only use ERMS for user copies for larger sizes

From: David Laight
> Sent: 23 November 2018 09:35
> From: Linus Torvalds
> > Sent: 22 November 2018 18:58
> ...
> > Oh, and I just noticed that on x86 we expressly use our old "safe and
> > sane" functions: see __inline_memcpy(), and its use in
> > __memcpy_{from,to}io().
> >
> > So the "falls back to memcpy" was always a red herring. We don't
> > actually do that.
> >
> > Which explains why things work.
>
> It doesn't explain why I've seen single byte PCIe TLP generated
> by memcpy_to/fromio().
>
> I've had to change code to use readq/writeq loops because the
> byte accesses are so slow - even when PIO performance should
> be 'good enough'.
>
> It might have been changed since last time I tested it.
> But I don't remember seeing a commit go by.

I've just patched my driver and redone the test on a 4.13 (ubuntu) kernel.
Calling memcpy_fromio(kernel_buffer, PCIe_address, length)
generates a lot of single byte TLP.

What the code normally does is 64bit aligned PCIe reads with
multiple writes and shifts to avoid writing beyond the end of
the kernel buffer for 'odd' length transfers.

Most of our PIO copies are actually direct to/from userspace.
While copy_to/from_user() will work on PCIe memory, it is 'rep mosvb'.
We also mmap() the PCIe space into process memory - and have be
careful not to use memcpy() in usespace.

On suitable systems userspace can use the AVX256 instructions to
get wide reads. Much harder and more expensive in the kernel.

In practise most of the bulk data transfers are requested by the PCIe slave.
But there are times when PIO ones are needed, and 64 bit transfers are
8 times faster than 8 bit ones.
This is all made more significant because it takes our fpga about 500ns
to complete a single word PCIe read.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2018-11-24 08:43:14

by Josh Poimboeuf

[permalink] [raw]
Subject: Re: [PATCH] x86: only use ERMS for user copies for larger sizes

On Thu, Nov 22, 2018 at 12:13:41PM +0100, Ingo Molnar wrote:
> Note to self: watch out for patches that change altinstructions and don't
> make premature vmlinux size impact assumptions. :-)

I noticed a similar problem with ORC data. As it turns out, size's
"text" calculation also includes read-only sections. That includes
.rodata and anything else not writable.

Maybe we need a more sensible "size" script for the kernel. It would be
trivial to implement based on the output of "readelf -S vmlinux".

--
Josh

2018-11-24 08:44:07

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] x86: only use ERMS for user copies for larger sizes

On Fri, Nov 23, 2018 at 2:12 AM David Laight <[email protected]> wrote:
>
> I've just patched my driver and redone the test on a 4.13 (ubuntu) kernel.
> Calling memcpy_fromio(kernel_buffer, PCIe_address, length)
> generates a lot of single byte TLP.

I just tested it too - it turns out that the __inline_memcpy() code
never triggers, and "memcpy_toio()" just generates a memcpy.

So that code seems entirely dead.

And, in fact, the codebase I looked at was the historical one, because
I had been going back and looking at the history. The modern tree
*does* have the "__inline_memcpy()" function I pointed at, but it's
not actually hooked up to anything!

This actually has been broken for _ages_. The breakage goes back to
2010, and commit 6175ddf06b61 ("x86: Clean up mem*io functions"), and
it seems nobody really ever noticed - or thought that it was ok.

That commit claims that iomem has no special significance on x86, but
that really really isn't true, exactly because the access size does
matter.

And as mentioned, the generic memory copy routines are not at all
appropriate, and that has nothing to do with ERMS. Our "do it by hand"
memory copy routine does things like this:

.Lless_16bytes:
cmpl $8, %edx
jb .Lless_8bytes
/*
* Move data from 8 bytes to 15 bytes.
*/
movq 0*8(%rsi), %r8
movq -1*8(%rsi, %rdx), %r9
movq %r8, 0*8(%rdi)
movq %r9, -1*8(%rdi, %rdx)
retq

and note how for a 8-byte copy, it will do *two* reads of the same 8
bytes, and *two* writes of the same 8 byte destination. That's
perfectly ok for regular memory, and it means that the code can handle
an arbitrary 8-15 byte copy without any conditionals or loop counts,
but it is *not* ok for iomem.

Of course, in practice it all just happens to work in almost all
situations (because a lot of iomem devices simply won't care), and
manual access to iomem is basically extremely rare these days anyway,
but it's definitely entirely and utterly broken.

End result: we *used* to do this right. For the last eight years our
"memcpy_{to,from}io()" has been entirely broken, and apparently even
the people who noticed oddities like David, never reported it as
breakage but instead just worked around it in drivers.

Ho humm.

Let me write a generic routine in lib/iomap_copy.c (which already does
the "user specifies chunk size" cases), and hook it up for x86.

David, are you using a bus analyzer or something to do your testing?
I'll have a trial patch for you asap.

Linus

2018-11-24 08:46:59

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] x86: only use ERMS for user copies for larger sizes

On Fri, Nov 23, 2018 at 8:36 AM Linus Torvalds
<[email protected]> wrote:
>
> Let me write a generic routine in lib/iomap_copy.c (which already does
> the "user specifies chunk size" cases), and hook it up for x86.

Something like this?

ENTIRELY UNTESTED! It might not compile. Seriously. And if it does
compile, it might not work.

And this doesn't actually do the memset_io() function at all, just the
memcpy ones.

Finally, it's worth noting that on x86, we have this:

/*
* override generic version in lib/iomap_copy.c
*/
ENTRY(__iowrite32_copy)
movl %edx,%ecx
rep movsd
ret
ENDPROC(__iowrite32_copy)

because back in 2006, we did this:

[PATCH] Add faster __iowrite32_copy routine for x86_64

This assembly version is measurably faster than the generic version in
lib/iomap_copy.c.

which actually implies that "rep movsd" is faster than doing
__raw_writel() by hand.

So it is possible that this should all be arch-specific code rather
than that butt-ugly "generic" code I wrote in this patch.

End result: I'm not really all that happy about this patch, but it's
perhaps worth testing, and it's definitely worth discussing. Because
our current memcpy_{to,from}io() is truly broken garbage.

Linus


Attachments:
patch.diff (5.01 kB)

2018-11-24 08:48:12

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH] x86: only use ERMS for user copies for larger sizes



> On Nov 23, 2018, at 10:42 AM, Linus Torvalds <[email protected]> wrote:
>
> On Fri, Nov 23, 2018 at 8:36 AM Linus Torvalds
> <[email protected]> wrote:
>>
>> Let me write a generic routine in lib/iomap_copy.c (which already does
>> the "user specifies chunk size" cases), and hook it up for x86.
>
> Something like this?
>
> ENTIRELY UNTESTED! It might not compile. Seriously. And if it does
> compile, it might not work.
>
> And this doesn't actually do the memset_io() function at all, just the
> memcpy ones.
>
> Finally, it's worth noting that on x86, we have this:
>
> /*
> * override generic version in lib/iomap_copy.c
> */
> ENTRY(__iowrite32_copy)
> movl %edx,%ecx
> rep movsd
> ret
> ENDPROC(__iowrite32_copy)
>
> because back in 2006, we did this:
>
> [PATCH] Add faster __iowrite32_copy routine for x86_64
>
> This assembly version is measurably faster than the generic version in
> lib/iomap_copy.c.
>
> which actually implies that "rep movsd" is faster than doing
> __raw_writel() by hand.
>
> So it is possible that this should all be arch-specific code rather
> than that butt-ugly "generic" code I wrote in this patch.
>
> End result: I'm not really all that happy about this patch, but it's
> perhaps worth testing, and it's definitely worth discussing. Because
> our current memcpy_{to,from}io() is truly broken garbage.
>
>

What is memcpy_to_io even supposed to do? I’m guessing it’s defined as something like “copy this data to IO space using at most long-sized writes, all aligned, and writing each byte exactly once, in order.” That sounds... dubiously useful. I could see a function that writes to aligned memory in specified-sized chunks. And I can see a use for a function to just write it in whatever size chunks the architecture thinks is fastest, and *that* should probably use MOVDIR64B.

Or is there some subtlety I’m missing?

2018-11-24 08:48:22

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] x86: only use ERMS for user copies for larger sizes

On Fri, Nov 23, 2018 at 10:39 AM Andy Lutomirski <[email protected]> wrote:
>
> What is memcpy_to_io even supposed to do? I’m guessing it’s defined as something like “copy this data to IO space using at most long-sized writes, all aligned, and writing each byte exactly once, in order.” That sounds... dubiously useful.

We've got hundreds of users of it, so it's fairly common..

> I could see a function that writes to aligned memory in specified-sized chunks.

We have that. It's called "__iowrite{32,64}_copy()". It has very few users.

Linus

2018-11-24 08:49:06

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH] x86: only use ERMS for user copies for larger sizes



> On Nov 23, 2018, at 11:44 AM, Linus Torvalds <[email protected]> wrote:
>
>> On Fri, Nov 23, 2018 at 10:39 AM Andy Lutomirski <[email protected]> wrote:
>>
>> What is memcpy_to_io even supposed to do? I’m guessing it’s defined as something like “copy this data to IO space using at most long-sized writes, all aligned, and writing each byte exactly once, in order.” That sounds... dubiously useful.
>
> We've got hundreds of users of it, so it's fairly common..
>

I’m wondering if the “at most long-sizes” restriction matters, especially given that we’re apparently accessing some of the same bytes more than once. I would believe that trying to encourage 16-byte writes (with AVX, ugh) or 64-byte writes (with MOVDIR64B) would be safe and could meaningfully speed up some workloads.

>> I could see a function that writes to aligned memory in specified-sized chunks.
>
> We have that. It's called "__iowrite{32,64}_copy()". It has very few users.
>
> Linus

2018-11-24 09:03:00

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH] x86: only use ERMS for user copies for larger sizes

On 11/21/18 11:16 AM, Linus Torvalds wrote:
> On Wed, Nov 21, 2018 at 9:27 AM Linus Torvalds
> <[email protected]> wrote:
>>
>> It would be interesting to know exactly which copy it is that matters
>> so much... *inlining* the erms case might show that nicely in
>> profiles.
>
> Side note: the fact that Jens' patch (which I don't like in that form)
> allegedly shrunk the resulting kernel binary would seem to indicate
> that there's a *lot* of compile-time constant-sized memcpy calls that
> we are missing, and that fall back to copy_user_generic().

Other kind of side note... This also affects memset(), which does
rep stosb if we have ERMS if any size memset. I noticed this from
sg_init_table(), which does a memset of the table. For my kind of
testing, the entry size is small. The below, too, reduces memset()
overhead by 50% here for me.

diff --git a/arch/x86/lib/memset_64.S b/arch/x86/lib/memset_64.S
index 9bc861c71e75..bad0fdb9ddcd 100644
--- a/arch/x86/lib/memset_64.S
+++ b/arch/x86/lib/memset_64.S
@@ -60,6 +60,8 @@ EXPORT_SYMBOL(__memset)
* rax original destination
*/
ENTRY(memset_erms)
+ cmpl $128,%edx
+ jb memset_orig
movq %rdi,%r9
movb %sil,%al
movq %rdx,%rcx

--
Jens Axboe


2018-11-26 10:02:43

by David Laight

[permalink] [raw]
Subject: RE: [PATCH] x86: only use ERMS for user copies for larger sizes

From: Linus Torvalds
> Sent: 23 November 2018 16:36
>
> On Fri, Nov 23, 2018 at 2:12 AM David Laight <[email protected]> wrote:
> >
> > I've just patched my driver and redone the test on a 4.13 (ubuntu) kernel.
> > Calling memcpy_fromio(kernel_buffer, PCIe_address, length)
> > generates a lot of single byte TLP.
>
> I just tested it too - it turns out that the __inline_memcpy() code
> never triggers, and "memcpy_toio()" just generates a memcpy.
>
> So that code seems entirely dead.
>
> And, in fact, the codebase I looked at was the historical one, because
> I had been going back and looking at the history. The modern tree
> *does* have the "__inline_memcpy()" function I pointed at, but it's
> not actually hooked up to anything!
>
> This actually has been broken for _ages_. The breakage goes back to
> 2010, and commit 6175ddf06b61 ("x86: Clean up mem*io functions"), and
> it seems nobody really ever noticed - or thought that it was ok.

It probably was ok in 2010 - that predates ERMS copy.

> That commit claims that iomem has no special significance on x86, but
> that really really isn't true, exactly because the access size does
> matter.

I suspect that memcpy_to/fromio() should only be used for IO space
that has 'memory-like' semantics.
So it shouldn't really matter what size accesses are done.
OTOH the 'io' side is likely to be slow so you want to limit the
number of io cycles (reads in particular).
With memory-like semantics it is ok to read full words at both ends
of the buffer to avoid extra transfers.
Indeed, on PCIe, the misaligned transfer for the last 8 bytes is
probably optimal.

> And as mentioned, the generic memory copy routines are not at all
> appropriate, and that has nothing to do with ERMS. Our "do it by hand"
> memory copy routine does things like this:
>
> .Lless_16bytes:
> cmpl $8, %edx
> jb .Lless_8bytes
> /*
> * Move data from 8 bytes to 15 bytes.
> */
> movq 0*8(%rsi), %r8
> movq -1*8(%rsi, %rdx), %r9
> movq %r8, 0*8(%rdi)
> movq %r9, -1*8(%rdi, %rdx)
> retq
>
> and note how for a 8-byte copy, it will do *two* reads of the same 8
> bytes, and *two* writes of the same 8 byte destination. That's
> perfectly ok for regular memory, and it means that the code can handle
> an arbitrary 8-15 byte copy without any conditionals or loop counts,
> but it is *not* ok for iomem.

I'd say it is ok for memcpy_to/fromio() since that should only really
be used for targets with memory-like semantics - and could be documented
as such.
But doing the same transfers twice is definitely sub-optimal.

> Of course, in practice it all just happens to work in almost all
> situations (because a lot of iomem devices simply won't care), and
> manual access to iomem is basically extremely rare these days anyway,
> but it's definitely entirely and utterly broken.
>
> End result: we *used* to do this right. For the last eight years our
> "memcpy_{to,from}io()" has been entirely broken, and apparently even
> the people who noticed oddities like David, never reported it as
> breakage but instead just worked around it in drivers.
>
> Ho humm.
>
> Let me write a generic routine in lib/iomap_copy.c (which already does
> the "user specifies chunk size" cases), and hook it up for x86.
>
> David, are you using a bus analyzer or something to do your testing?
> I'll have a trial patch for you asap.

We've a TLP monitor built into our fpga image so can look at the last
few TLPs (IIRC a 32kB buffer).

Testing patches is a bit harder.
The test system isn't one I build kernels on.

Is there a 'sensible' amd64 kernel config that contains most of the drivers
a modern system might need?
It is a PITA trying to build kernels that will load on all my test systems
(since I tend to move the disks between systems).
Rebuilding the ubuntu config just takes too long and generates a ramdisk
that doesn't fit in /boot.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2018-11-26 10:13:28

by David Laight

[permalink] [raw]
Subject: RE: [PATCH] x86: only use ERMS for user copies for larger sizes

From: Andy Lutomirski
> Sent: 23 November 2018 19:11
> > On Nov 23, 2018, at 11:44 AM, Linus Torvalds <[email protected]> wrote:
> >
> >> On Fri, Nov 23, 2018 at 10:39 AM Andy Lutomirski <[email protected]> wrote:
> >>
> >> What is memcpy_to_io even supposed to do? I’m guessing it’s defined as
> >> something like “copy this data to IO space using at most long-sized writes,
> >> all aligned, and writing each byte exactly once, in order.”
> >> That sounds... dubiously useful.
> >
> > We've got hundreds of users of it, so it's fairly common..
>
> I’m wondering if the “at most long-sizes” restriction matters, especially
> given that we’re apparently accessing some of the same bytes more than once.
> I would believe that trying to encourage 16-byte writes (with AVX, ugh) or
> 64-byte writes (with MOVDIR64B) would be safe and could meaningfully speed
> up some workloads.

The real gains come from increasing the width of IO reads, not IO writes.
None of the x86 cpus I've got issue multiple concurrent PCIe reads
(the PCIe completion tag seems to match the core number).
PCIe writes are all 'posted' so there aren't big gaps between them.

> >> I could see a function that writes to aligned memory in specified-sized chunks.
> >
> > We have that. It's called "__iowrite{32,64}_copy()". It has very few users.

For x86 you want separate entry points for the 'rep movq' copy
and one using an instruction loop.
(Perhaps with guidance to the cutover length.)
In most places the driver will know whether the size is above or below
the cutover - which might be 256.
Certainly transfers below 64 bytes are 'short'.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2018-11-26 10:27:36

by David Laight

[permalink] [raw]
Subject: RE: [PATCH] x86: only use ERMS for user copies for larger sizes

From: Linus Torvalds
> Sent: 23 November 2018 16:36
...
> End result: we *used* to do this right. For the last eight years our
> "memcpy_{to,from}io()" has been entirely broken, and apparently even
> the people who noticed oddities like David, never reported it as
> breakage but instead just worked around it in drivers.

I've mentioned it several times...

Probably no one else noticed lots of single byte transfers while
testing a TLP monitor he was writing for an FPGA :-)
They are far too expensive to buy, and would never be connected
to the right system at the right time - so we (I) wrote one.

Unfortunately we don't really get to see what happens when the
link comes up (or rather doesn't come up). We only get the
LTSSM state transitions.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2019-01-05 02:43:19

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] x86: only use ERMS for user copies for larger sizes

Coming back to this old thread, because I've spent most of the day
resurrecting some of my old core x86 patches, and one of them was for
the issue David Laight complained about: horrible memcpy_toio()
performance.

Yes, I should have done this before the merge window instead of at the
end of it, but I didn't want to delay things for yet another release
just because it fell through some cracks.

Anyway, it would be lovely to hear whether memcpy_toio() now works
reasonably. I just picked our very old legacy function for this, so it
will do things in 32-bit chunks (even on x86-64), and I'm certainly
open to somebody doing something smarter, but considering that nobody
else seemed to show any interest in this at all, I just went
"whatever, good enough".

I tried to make it easy to improve on things if people want to.

The other ancient patch I resurrected was the old "use asm goto for
put_user" which I've had in a private branch for the last almost three
years.

I've tried to test this (it turns out I had completely screwed up the
32-bit case for put_user, for example), but I only have 64-bit user
space, so the i386 build ended up being just about building and
looking superficially at the code generation in a couple of places.

More testing and comments appreciated.

Now I have no ancient patches in any branches, or any known pending
issue. Except for all the pull requests that are piling up because I
didn't do them today since I was spending time on my own patches.

Oh well. There's always tomorrow.

Linus

On Mon, Nov 26, 2018 at 2:26 AM David Laight <[email protected]> wrote:
>
> From: Linus Torvalds
> > Sent: 23 November 2018 16:36
> ...
> > End result: we *used* to do this right. For the last eight years our
> > "memcpy_{to,from}io()" has been entirely broken, and apparently even
> > the people who noticed oddities like David, never reported it as
> > breakage but instead just worked around it in drivers.
>
> I've mentioned it several times...
>
> Probably no one else noticed lots of single byte transfers while
> testing a TLP monitor he was writing for an FPGA :-)
> They are far too expensive to buy, and would never be connected
> to the right system at the right time - so we (I) wrote one.
>
> Unfortunately we don't really get to see what happens when the
> link comes up (or rather doesn't come up). We only get the
> LTSSM state transitions.
>
> David
>
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)

2019-01-07 09:57:06

by David Laight

[permalink] [raw]
Subject: RE: [PATCH] x86: only use ERMS for user copies for larger sizes

From: Linus Torvalds
> Sent: 05 January 2019 02:39
...
> Anyway, it would be lovely to hear whether memcpy_toio() now works
> reasonably. I just picked our very old legacy function for this, so it
> will do things in 32-bit chunks (even on x86-64), and I'm certainly
> open to somebody doing something smarter, but considering that nobody
> else seemed to show any interest in this at all, I just went
> "whatever, good enough".
>
> I tried to make it easy to improve on things if people want to.

I'll do some tests once the merge has had time to settle.

I needed to open-code one part because it wants to do copy_to_user()
from a PCIe address buffer (which has to work).

Using 64bit chunks for reads is probably worth while on x86-64.
I might cook up a patch.
Actually, if the AVX registers are available without an fpu save
using larger chunks would be worthwhile - especially for io reads.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2019-01-07 18:18:55

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] x86: only use ERMS for user copies for larger sizes

On Mon, Jan 7, 2019 at 1:55 AM David Laight <[email protected]> wrote:
>
> I needed to open-code one part because it wants to do copy_to_user()
> from a PCIe address buffer (which has to work).

It will never work for memcpy_fromio(). Any driver that thinks it will
copy from io space to user space absolutely *has* to do it by hand. No
questions, and no exceptions. Some loop like

for (..)
put_user(readl(iomem++), uaddr++);

because neither copy_to_user() nor memcpy_fromio() will *ever* handle
that correctly.

They might randomly happen to work on x86, but absolutely nowhere else.

Linus

2019-01-08 09:12:48

by David Laight

[permalink] [raw]
Subject: RE: [PATCH] x86: only use ERMS for user copies for larger sizes

From: Linus Torvalds
> Sent: 07 January 2019 17:44
> On Mon, Jan 7, 2019 at 1:55 AM David Laight <[email protected]> wrote:
> >
> > I needed to open-code one part because it wants to do copy_to_user()
> > from a PCIe address buffer (which has to work).
>
> It will never work for memcpy_fromio(). Any driver that thinks it will
> copy from io space to user space absolutely *has* to do it by hand. No
> questions, and no exceptions. Some loop like
>
> for (..)
> put_user(readl(iomem++), uaddr++);
>
> because neither copy_to_user() nor memcpy_fromio() will *ever* handle
> that correctly.
>
> They might randomly happen to work on x86, but absolutely nowhere else.

Actually they tend to handle it on a lot of systems.
(But I don't do it.)
Probably most of those where vm_iomap_memory() (to map IO memory
space directly into user pages) works.

It might be 'interesting' to build an amd64 kernel where all the IO
memory addresses (eg returned by pci_iomap()) are offset by a
large constant so direct accesses all fault and all the readl()
macros (etc) add it back in.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2019-01-08 18:03:02

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] x86: only use ERMS for user copies for larger sizes

On Tue, Jan 8, 2019 at 1:10 AM David Laight <[email protected]> wrote:
> >
> > It will never work for memcpy_fromio(). Any driver that thinks it will
> > copy from io space to user space absolutely *has* to do it by hand. No
> > questions, and no exceptions. Some loop like
> >
> > for (..)
> > put_user(readl(iomem++), uaddr++);
> >
> > because neither copy_to_user() nor memcpy_fromio() will *ever* handle
> > that correctly.
> >
> > They might randomly happen to work on x86, but absolutely nowhere else.
>
> Actually they tend to handle it on a lot of systems.

Not with memcpy_fromio(), at least.

That doesn't work even on x86. Try it. If the user space page is
swapped out (or not mapped), you'd get a kernel page fault.

And if you do "copy_to_user()" from a mmio region, you get what you
get. If somebody complains about it doing a byte-at-a-time copy, I'll
laugh in their face and tell them to fix their broken driver. It might
work on about half the architectures out there, but it's still
complete garbage, and it's not a bug in copy_to_user().

Linus