LinuxLists.cc - [PATCH] x86/uaccess: use unrolled string copy for short strings

2017-06-21 11:13:35

Subject: [PATCH] x86/uaccess: use unrolled string copy for short strings

The 'rep' prefix suffers for a relevant "setup cost"; as a result
string copies with unrolled loops are faster than even
optimized string copy using 'rep' variant, for short string.

This change updates __copy_user_generic() to use the unrolled
version for small string length. The threshold length for short
string - 64 - has been selected with empirical measures as the
larger value that still ensure a measurable gain.

A micro-benchmark of __copy_from_user() with different lengths shows
the following:

string len vanilla patched delta
bytes ticks ticks tick(%)

0 58 26 32(55%)
1 49 29 20(40%)
2 49 31 18(36%)
3 49 32 17(34%)
4 50 34 16(32%)
5 49 35 14(28%)
6 49 36 13(26%)
7 49 38 11(22%)
8 50 31 19(38%)
9 51 33 18(35%)
10 52 36 16(30%)
11 52 37 15(28%)
12 52 38 14(26%)
13 52 40 12(23%)
14 52 41 11(21%)
15 52 42 10(19%)
16 51 34 17(33%)
17 51 35 16(31%)
18 52 37 15(28%)
19 51 38 13(25%)
20 52 39 13(25%)
21 52 40 12(23%)
22 51 42 9(17%)
23 51 46 5(9%)
24 52 35 17(32%)
25 52 37 15(28%)
26 52 38 14(26%)
27 52 39 13(25%)
28 52 40 12(23%)
29 53 42 11(20%)
30 52 43 9(17%)
31 52 44 8(15%)
32 51 36 15(29%)
33 51 38 13(25%)
34 51 39 12(23%)
35 51 41 10(19%)
36 52 41 11(21%)
37 52 43 9(17%)
38 51 44 7(13%)
39 52 46 6(11%)
40 51 37 14(27%)
41 50 38 12(24%)
42 50 39 11(22%)
43 50 40 10(20%)
44 50 42 8(16%)
45 50 43 7(14%)
46 50 43 7(14%)
47 50 45 5(10%)
48 50 37 13(26%)
49 49 38 11(22%)
50 50 40 10(20%)
51 50 42 8(16%)
52 50 42 8(16%)
53 49 46 3(6%)
54 50 46 4(8%)
55 49 48 1(2%)
56 50 39 11(22%)
57 50 40 10(20%)
58 49 42 7(14%)
59 50 42 8(16%)
60 50 46 4(8%)
61 50 47 3(6%)
62 50 48 2(4%)
63 50 48 2(4%)
64 51 38 13(25%)

Above 64 bytes the gain fades away.

Very similar values are collectd for __copy_to_user().
UDP receive performances under flood with small packets using recvfrom()
increase by ~5%.

Signed-off-by: Paolo Abeni <[email protected]>
---
arch/x86/include/asm/uaccess_64.h | 3 +++
1 file changed, 3 insertions(+)

diff --git a/arch/x86/include/asm/uaccess_64.h b/arch/x86/include/asm/uaccess_64.h
index c5504b9..16a8871 100644
--- a/arch/x86/include/asm/uaccess_64.h
+++ b/arch/x86/include/asm/uaccess_64.h
@@ -28,6 +28,9 @@ copy_user_generic(void *to, const void *from, unsigned len)
{
unsigned ret;

+ if (len <= 64)
+ return copy_user_generic_unrolled(to, from, len);
+
/*
* If CPU has ERMS feature, use copy_user_enhanced_fast_string.
* Otherwise, if CPU has rep_good feature, use copy_user_generic_string.
--
2.9.4

2017-06-21 17:38:55

by Kees Cook

[permalink] [raw]

Subject: Re: [PATCH] x86/uaccess: use unrolled string copy for short strings

On Wed, Jun 21, 2017 at 4:09 AM, Paolo Abeni <[email protected]> wrote:
> The 'rep' prefix suffers for a relevant "setup cost"; as a result
> string copies with unrolled loops are faster than even
> optimized string copy using 'rep' variant, for short string.
>
> This change updates __copy_user_generic() to use the unrolled
> version for small string length. The threshold length for short
> string - 64 - has been selected with empirical measures as the
> larger value that still ensure a measurable gain.
>
> A micro-benchmark of __copy_from_user() with different lengths shows
> the following:
>
> string len vanilla patched delta
> bytes ticks ticks tick(%)
>
> 0 58 26 32(55%)
> 1 49 29 20(40%)
> 2 49 31 18(36%)
> 3 49 32 17(34%)
> 4 50 34 16(32%)
> 5 49 35 14(28%)
> 6 49 36 13(26%)
> 7 49 38 11(22%)
> 8 50 31 19(38%)
> 9 51 33 18(35%)
> 10 52 36 16(30%)
> 11 52 37 15(28%)
> 12 52 38 14(26%)
> 13 52 40 12(23%)
> 14 52 41 11(21%)
> 15 52 42 10(19%)
> 16 51 34 17(33%)
> 17 51 35 16(31%)
> 18 52 37 15(28%)
> 19 51 38 13(25%)
> 20 52 39 13(25%)
> 21 52 40 12(23%)
> 22 51 42 9(17%)
> 23 51 46 5(9%)
> 24 52 35 17(32%)
> 25 52 37 15(28%)
> 26 52 38 14(26%)
> 27 52 39 13(25%)
> 28 52 40 12(23%)
> 29 53 42 11(20%)
> 30 52 43 9(17%)
> 31 52 44 8(15%)
> 32 51 36 15(29%)
> 33 51 38 13(25%)
> 34 51 39 12(23%)
> 35 51 41 10(19%)
> 36 52 41 11(21%)
> 37 52 43 9(17%)
> 38 51 44 7(13%)
> 39 52 46 6(11%)
> 40 51 37 14(27%)
> 41 50 38 12(24%)
> 42 50 39 11(22%)
> 43 50 40 10(20%)
> 44 50 42 8(16%)
> 45 50 43 7(14%)
> 46 50 43 7(14%)
> 47 50 45 5(10%)
> 48 50 37 13(26%)
> 49 49 38 11(22%)
> 50 50 40 10(20%)
> 51 50 42 8(16%)
> 52 50 42 8(16%)
> 53 49 46 3(6%)
> 54 50 46 4(8%)
> 55 49 48 1(2%)
> 56 50 39 11(22%)
> 57 50 40 10(20%)
> 58 49 42 7(14%)
> 59 50 42 8(16%)
> 60 50 46 4(8%)
> 61 50 47 3(6%)
> 62 50 48 2(4%)
> 63 50 48 2(4%)
> 64 51 38 13(25%)
>
> Above 64 bytes the gain fades away.
>
> Very similar values are collectd for __copy_to_user().
> UDP receive performances under flood with small packets using recvfrom()
> increase by ~5%.
>
> Signed-off-by: Paolo Abeni <[email protected]>

Since there are no regressions here, this seems sensible to me. :)

Reviewed-by: Kees Cook <[email protected]>

-Kees

> ---
> arch/x86/include/asm/uaccess_64.h | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/arch/x86/include/asm/uaccess_64.h b/arch/x86/include/asm/uaccess_64.h
> index c5504b9..16a8871 100644
> --- a/arch/x86/include/asm/uaccess_64.h
> +++ b/arch/x86/include/asm/uaccess_64.h
> @@ -28,6 +28,9 @@ copy_user_generic(void *to, const void *from, unsigned len)
> {
> unsigned ret;
>
> + if (len <= 64)
> + return copy_user_generic_unrolled(to, from, len);
> +
> /*
> * If CPU has ERMS feature, use copy_user_enhanced_fast_string.
> * Otherwise, if CPU has rep_good feature, use copy_user_generic_string.
> --
> 2.9.4
>

--
Kees Cook
Pixel Security

2017-06-22 08:47:38

by Ingo Molnar

[permalink] [raw]

Subject: Re: [PATCH] x86/uaccess: use unrolled string copy for short strings

* Paolo Abeni <[email protected]> wrote:

> The 'rep' prefix suffers for a relevant "setup cost"; as a result
> string copies with unrolled loops are faster than even
> optimized string copy using 'rep' variant, for short string.
>
> This change updates __copy_user_generic() to use the unrolled
> version for small string length. The threshold length for short
> string - 64 - has been selected with empirical measures as the
> larger value that still ensure a measurable gain.
>
> A micro-benchmark of __copy_from_user() with different lengths shows
> the following:
>
> string len vanilla patched delta
> bytes ticks ticks tick(%)
>
> 0 58 26 32(55%)
> 1 49 29 20(40%)
> 2 49 31 18(36%)
> 3 49 32 17(34%)
> 4 50 34 16(32%)
> 5 49 35 14(28%)
> 6 49 36 13(26%)
> 7 49 38 11(22%)
> 8 50 31 19(38%)
> 9 51 33 18(35%)
> 10 52 36 16(30%)
> 11 52 37 15(28%)
> 12 52 38 14(26%)
> 13 52 40 12(23%)
> 14 52 41 11(21%)
> 15 52 42 10(19%)
> 16 51 34 17(33%)
> 17 51 35 16(31%)
> 18 52 37 15(28%)
> 19 51 38 13(25%)
> 20 52 39 13(25%)
> 21 52 40 12(23%)
> 22 51 42 9(17%)
> 23 51 46 5(9%)
> 24 52 35 17(32%)
> 25 52 37 15(28%)
> 26 52 38 14(26%)
> 27 52 39 13(25%)
> 28 52 40 12(23%)
> 29 53 42 11(20%)
> 30 52 43 9(17%)
> 31 52 44 8(15%)
> 32 51 36 15(29%)
> 33 51 38 13(25%)
> 34 51 39 12(23%)
> 35 51 41 10(19%)
> 36 52 41 11(21%)
> 37 52 43 9(17%)
> 38 51 44 7(13%)
> 39 52 46 6(11%)
> 40 51 37 14(27%)
> 41 50 38 12(24%)
> 42 50 39 11(22%)
> 43 50 40 10(20%)
> 44 50 42 8(16%)
> 45 50 43 7(14%)
> 46 50 43 7(14%)
> 47 50 45 5(10%)
> 48 50 37 13(26%)
> 49 49 38 11(22%)
> 50 50 40 10(20%)
> 51 50 42 8(16%)
> 52 50 42 8(16%)
> 53 49 46 3(6%)
> 54 50 46 4(8%)
> 55 49 48 1(2%)
> 56 50 39 11(22%)
> 57 50 40 10(20%)
> 58 49 42 7(14%)
> 59 50 42 8(16%)
> 60 50 46 4(8%)
> 61 50 47 3(6%)
> 62 50 48 2(4%)
> 63 50 48 2(4%)
> 64 51 38 13(25%)
>
> Above 64 bytes the gain fades away.
>
> Very similar values are collectd for __copy_to_user().
> UDP receive performances under flood with small packets using recvfrom()
> increase by ~5%.

What CPU model(s) were used for the performance testing and was it performance
tested on several different types of CPUs?

Please add a comment here:

+ if (len <= 64)
+ return copy_user_generic_unrolled(to, from, len);
+

... because it's not obvious at all that this is a performance optimization, not a
correctness issue. Also explain that '64' is a number that we got from performance
measurements.

But in general I like the change - as long as it was measured on reasonably modern
x86 CPUs. I.e. it should not regress on modern Intel or AMD CPUs.

Thanks,

Ingo

2017-06-22 14:56:16

by Alan Cox

[permalink] [raw]

Subject: Re: [PATCH] x86/uaccess: use unrolled string copy for short strings

> > diff --git a/arch/x86/include/asm/uaccess_64.h b/arch/x86/include/asm/uaccess_64.h
> > index c5504b9..16a8871 100644
> > --- a/arch/x86/include/asm/uaccess_64.h
> > +++ b/arch/x86/include/asm/uaccess_64.h
> > @@ -28,6 +28,9 @@ copy_user_generic(void *to, const void *from, unsigned len)
> > {
> > unsigned ret;
> >
> > + if (len <= 64)
> > + return copy_user_generic_unrolled(to, from, len);
> > +

Given this gets inlined surely that should bracketed with an ifdef
against optimizing for space.

Also you give one set of benchmarks but how do they look on different
processors - AMD v Intel, Atom v Core, Ryzen v older AMD etc ?

Alan

2017-06-22 17:02:54

by Paolo Abeni

[permalink] [raw]

Subject: Re: [PATCH] x86/uaccess: use unrolled string copy for short strings

On Thu, 2017-06-22 at 10:47 +0200, Ingo Molnar wrote:
> * Paolo Abeni <[email protected]> wrote:
>
> > The 'rep' prefix suffers for a relevant "setup cost"; as a result
> > string copies with unrolled loops are faster than even
> > optimized string copy using 'rep' variant, for short string.
> >
> > This change updates __copy_user_generic() to use the unrolled
> > version for small string length. The threshold length for short
> > string - 64 - has been selected with empirical measures as the
> > larger value that still ensure a measurable gain.
> >
> > A micro-benchmark of __copy_from_user() with different lengths shows
> > the following:
> >
> > string len vanilla patched delta
> > bytes ticks ticks tick(%)
> >
> > 0 58 26 32(55%)
> > 1 49 29 20(40%)
> > 2 49 31 18(36%)
> > 3 49 32 17(34%)
> > 4 50 34 16(32%)
> > 5 49 35 14(28%)
> > 6 49 36 13(26%)
> > 7 49 38 11(22%)
> > 8 50 31 19(38%)
> > 9 51 33 18(35%)
> > 10 52 36 16(30%)
> > 11 52 37 15(28%)
> > 12 52 38 14(26%)
> > 13 52 40 12(23%)
> > 14 52 41 11(21%)
> > 15 52 42 10(19%)
> > 16 51 34 17(33%)
> > 17 51 35 16(31%)
> > 18 52 37 15(28%)
> > 19 51 38 13(25%)
> > 20 52 39 13(25%)
> > 21 52 40 12(23%)
> > 22 51 42 9(17%)
> > 23 51 46 5(9%)
> > 24 52 35 17(32%)
> > 25 52 37 15(28%)
> > 26 52 38 14(26%)
> > 27 52 39 13(25%)
> > 28 52 40 12(23%)
> > 29 53 42 11(20%)
> > 30 52 43 9(17%)
> > 31 52 44 8(15%)
> > 32 51 36 15(29%)
> > 33 51 38 13(25%)
> > 34 51 39 12(23%)
> > 35 51 41 10(19%)
> > 36 52 41 11(21%)
> > 37 52 43 9(17%)
> > 38 51 44 7(13%)
> > 39 52 46 6(11%)
> > 40 51 37 14(27%)
> > 41 50 38 12(24%)
> > 42 50 39 11(22%)
> > 43 50 40 10(20%)
> > 44 50 42 8(16%)
> > 45 50 43 7(14%)
> > 46 50 43 7(14%)
> > 47 50 45 5(10%)
> > 48 50 37 13(26%)
> > 49 49 38 11(22%)
> > 50 50 40 10(20%)
> > 51 50 42 8(16%)
> > 52 50 42 8(16%)
> > 53 49 46 3(6%)
> > 54 50 46 4(8%)
> > 55 49 48 1(2%)
> > 56 50 39 11(22%)
> > 57 50 40 10(20%)
> > 58 49 42 7(14%)
> > 59 50 42 8(16%)
> > 60 50 46 4(8%)
> > 61 50 47 3(6%)
> > 62 50 48 2(4%)
> > 63 50 48 2(4%)
> > 64 51 38 13(25%)
> >
> > Above 64 bytes the gain fades away.
> >
> > Very similar values are collectd for __copy_to_user().
> > UDP receive performances under flood with small packets using recvfrom()
> > increase by ~5%.
>
> What CPU model(s) were used for the performance testing and was it performance
> tested on several different types of CPUs?
>
> Please add a comment here:
>
> + if (len <= 64)
> + return copy_user_generic_unrolled(to, from, len);
> +
>
> ... because it's not obvious at all that this is a performance optimization, not a
> correctness issue. Also explain that '64' is a number that we got from performance
> measurements.
>
> But in general I like the change - as long as it was measured on reasonably modern
> x86 CPUs. I.e. it should not regress on modern Intel or AMD CPUs.

Thank you for reviewing this.

I'll add an hopefully descriptive comment in v2.

The above figures are for an Intel Xeon E5-2690 v4.

I see similar data points with an i7-6500U CPU, while an i7-4810MQ
shows slightly better improvements.

I'm in the process of collecting more figures for AMD processors, which
I don't have so handy - it may take some time.

Thanks,

Paolo

2017-06-22 17:30:09

by Linus Torvalds

[permalink] [raw]

Subject: Re: [PATCH] x86/uaccess: use unrolled string copy for short strings

On Wed, Jun 21, 2017 at 4:09 AM, Paolo Abeni <[email protected]> wrote:
>
> + if (len <= 64)
> + return copy_user_generic_unrolled(to, from, len);
> +
> /*
> * If CPU has ERMS feature, use copy_user_enhanced_fast_string.
> * Otherwise, if CPU has rep_good feature, use copy_user_generic_string.

NAK. Please do *not* do this. It puts the check in completely the
wrong place for several reasons:

(a) it puts it in the inlined caller case (which could be ok for
constant sizes, but not in general)

(b) it uses the copy_user_generic_unrolled() function that will then
just test the size *AGAIN* against small cases.

so it's both bigger than necessary, and stupid.

So if you want to do this optimization, I'd argue that you should just
do it inside the copy_user_enhanced_fast_string() function itself, the
same way we already handle the really small case specially in
copy_user_generic_string().

And do *not* use the unrolled code, which isn't used for small copies
anyway - rewrite the "copy_user_generic_unrolled" function in that
same asm file to have the non-unrolled cases (label "17" and forward)
accessible, so that you don't bother re-testing the size.

Linus

2017-06-22 17:54:40

by Paolo Abeni

[permalink] [raw]

Subject: Re: [PATCH] x86/uaccess: use unrolled string copy for short strings

On Thu, 2017-06-22 at 10:30 -0700, Linus Torvalds wrote:
> So if you want to do this optimization, I'd argue that you should just
> do it inside the copy_user_enhanced_fast_string() function itself, the
> same way we already handle the really small case specially in
> copy_user_generic_string().
>
> And do *not* use the unrolled code, which isn't used for small copies
> anyway - rewrite the "copy_user_generic_unrolled" function in that
> same asm file to have the non-unrolled cases (label "17" and forward)
> accessible, so that you don't bother re-testing the size.

Thank you for the feedback.

I'm quite new to the core x86 land; the rep stosb cost popped out while
messing with the networking. I'll try to dig into the asm.

Regards,

Paolo

2017-06-29 13:57:52

by Paolo Abeni

[permalink] [raw]

Subject: [PATCH] x86/uaccess: optimize copy_user_enhanced_fast_string for short string

According to the Intel datasheet, the rep movsb instruction
exposes a relevant setup cost - 50 ticks - which affect
badly short string copy operation.

This change tries to avoid such cost calling the explicit
loop available in the unrolled code for string shorter
than 64 bytes. Such value has been selected with empirical
measures as the largest value that still ensure a measurable
gain.

Micro benchmarks of the __copy_from_user() function with
lengths in the [0-63] range show this performance gain
(shorter the string, larger the gain):

- in the [55%-4%] range on Intel Xeon(R) CPU E5-2690 v4
- in the [72%-9%] range on Intel Core i7-4810MQ

Other tested CPUs - namely Intel Atom S1260 and AMD Opteron
8216 - show no differences, because they do not expose the
ERMS feature bit.

Signed-off-by: Paolo Abeni <[email protected]>
---
arch/x86/lib/copy_user_64.S | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/arch/x86/lib/copy_user_64.S b/arch/x86/lib/copy_user_64.S
index c595957..020f75c 100644
--- a/arch/x86/lib/copy_user_64.S
+++ b/arch/x86/lib/copy_user_64.S
@@ -37,7 +37,7 @@ ENTRY(copy_user_generic_unrolled)
movl %edx,%ecx
andl $63,%edx
shrl $6,%ecx
- jz 17f
+ jz .L_copy_short_string
1: movq (%rsi),%r8
2: movq 1*8(%rsi),%r9
3: movq 2*8(%rsi),%r10
@@ -58,7 +58,8 @@ ENTRY(copy_user_generic_unrolled)
leaq 64(%rdi),%rdi
decl %ecx
jnz 1b
-17: movl %edx,%ecx
+.L_copy_short_string:
+ movl %edx,%ecx
andl $7,%edx
shrl $3,%ecx
jz 20f
@@ -174,6 +175,8 @@ EXPORT_SYMBOL(copy_user_generic_string)
*/
ENTRY(copy_user_enhanced_fast_string)
ASM_STAC
+ cmpl $64,%edx
+ jb .L_copy_short_string /* less then 64 bytes, avoid the costly 'rep' */
movl %edx,%ecx
1: rep
movsb
--
2.9.4

2017-06-29 21:40:56

by Linus Torvalds

[permalink] [raw]

Subject: Re: [PATCH] x86/uaccess: optimize copy_user_enhanced_fast_string for short string

Looks ok to me, so Ack, fwiw.

I've have long hoped that Intel would actually fix their microcode
overhead issue so that we could get rid of all this and just inline
"rep movsb", but that hasn't been happening, so this patch looks fine.

Side niote: I suspect the copy_user_generic_string() might as well use
the same logic too, rather than having a different cut-off (8 bytes)
to a worse implementation (rep movsb on architectures that don't mark
themselves with the enhanced string).

Linus

On Thu, Jun 29, 2017 at 6:55 AM, Paolo Abeni <[email protected]> wrote:
> According to the Intel datasheet, the rep movsb instruction
> exposes a relevant setup cost - 50 ticks - which affect
> badly short string copy operation.

2017-06-30 13:17:54

by tip-bot for Vasyl Gomonovych

[permalink] [raw]

Subject: [tip:x86/asm] x86/uaccess: Optimize copy_user_enhanced_fast_string() for short strings

Commit-ID: 236222d39347e0e486010f10c1493e83dbbdfba8
Gitweb: http://git.kernel.org/tip/236222d39347e0e486010f10c1493e83dbbdfba8
Author: Paolo Abeni <[email protected]>
AuthorDate: Thu, 29 Jun 2017 15:55:58 +0200
Committer: Ingo Molnar <[email protected]>
CommitDate: Fri, 30 Jun 2017 09:52:51 +0200

x86/uaccess: Optimize copy_user_enhanced_fast_string() for short strings

According to the Intel datasheet, the REP MOVSB instruction
exposes a pretty heavy setup cost (50 ticks), which hurts
short string copy operations.

This change tries to avoid this cost by calling the explicit
loop available in the unrolled code for strings shorter
than 64 bytes.

The 64 bytes cutoff value is arbitrary from the code logic
point of view - it has been selected based on measurements,
as the largest value that still ensures a measurable gain.

Micro benchmarks of the __copy_from_user() function with
lengths in the [0-63] range show this performance gain
(shorter the string, larger the gain):

- in the [55%-4%] range on Intel Xeon(R) CPU E5-2690 v4
- in the [72%-9%] range on Intel Core i7-4810MQ

Other tested CPUs - namely Intel Atom S1260 and AMD Opteron
8216 - show no difference, because they do not expose the
ERMS feature bit.

Signed-off-by: Paolo Abeni <[email protected]>
Acked-by: Linus Torvalds <[email protected]>
Cc: Alan Cox <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Hannes Frederic Sowa <[email protected]>
Cc: Josh Poimboeuf <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Link: http://lkml.kernel.org/r/4533a1d101fd460f80e21329a34928fad521c1d4.1498744345.git.pabeni@redhat.com
[ Clarified the changelog. ]
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/lib/copy_user_64.S | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/arch/x86/lib/copy_user_64.S b/arch/x86/lib/copy_user_64.S
index c595957..020f75c 100644
--- a/arch/x86/lib/copy_user_64.S
+++ b/arch/x86/lib/copy_user_64.S
@@ -37,7 +37,7 @@ ENTRY(copy_user_generic_unrolled)
movl %edx,%ecx
andl $63,%edx
shrl $6,%ecx
- jz 17f
+ jz .L_copy_short_string
1: movq (%rsi),%r8
2: movq 1*8(%rsi),%r9
3: movq 2*8(%rsi),%r10
@@ -58,7 +58,8 @@ ENTRY(copy_user_generic_unrolled)
leaq 64(%rdi),%rdi
decl %ecx
jnz 1b
-17: movl %edx,%ecx
+.L_copy_short_string:
+ movl %edx,%ecx
andl $7,%edx
shrl $3,%ecx
jz 20f
@@ -174,6 +175,8 @@ EXPORT_SYMBOL(copy_user_generic_string)
*/
ENTRY(copy_user_enhanced_fast_string)
ASM_STAC
+ cmpl $64,%edx
+ jb .L_copy_short_string /* less then 64 bytes, avoid the costly 'rep' */
movl %edx,%ecx
1: rep
movsb