2021-12-13 18:01:08

by David Laight

[permalink] [raw]
Subject: [PATCH] lib/x86: Optimise csum_partial of buffers that are not multiples of 8 bytes.


Add in the trailing bytes first so that there is no need to worry
about the sum exceeding 64 bits.

Signed-off-by: David Laight <[email protected]>
---

This ought to be faster - because of all the removed 'adc $0'.
Guessing how fast x86 code will run is hard!
There are other ways of handing buffers that are shorter than 8 bytes,
but I'd rather hope they don't happen in any hot paths.

Note - I've not even compile tested it.
(But have tested an equivalent change before.)

arch/x86/lib/csum-partial_64.c | 55 ++++++++++++----------------------
1 file changed, 19 insertions(+), 36 deletions(-)

diff --git a/arch/x86/lib/csum-partial_64.c b/arch/x86/lib/csum-partial_64.c
index abf819dd8525..fbcc073fc2b5 100644
--- a/arch/x86/lib/csum-partial_64.c
+++ b/arch/x86/lib/csum-partial_64.c
@@ -37,6 +37,24 @@ __wsum csum_partial(const void *buff, int len, __wsum sum)
u64 temp64 = (__force u64)sum;
unsigned result;

+ if (len & 7) {
+ if (unlikely(len < 8)) {
+ /* Avoid falling off the start of the buffer */
+ if (len & 4) {
+ temp64 += *(u32 *)buff;
+ buff += 4;
+ }
+ if (len & 2) {
+ temp64 += *(u16 *)buff;
+ buff += 2;
+ }
+ if (len & 1)
+ temp64 += *(u8 *)buff;
+ goto reduce_to32;
+ }
+ temp64 += *(u64 *)(buff + len - 8) << (8 - (len & 7)) * 8;
+ }
+
while (unlikely(len >= 64)) {
asm("addq 0*8(%[src]),%[res]\n\t"
"adcq 1*8(%[src]),%[res]\n\t"
@@ -82,43 +100,8 @@ __wsum csum_partial(const void *buff, int len, __wsum sum)
: "memory");
buff += 8;
}
- if (len & 7) {
-#ifdef CONFIG_DCACHE_WORD_ACCESS
- unsigned int shift = (8 - (len & 7)) * 8;
- unsigned long trail;
-
- trail = (load_unaligned_zeropad(buff) << shift) >> shift;

- asm("addq %[trail],%[res]\n\t"
- "adcq $0,%[res]"
- : [res] "+r" (temp64)
- : [trail] "r" (trail));
-#else
- if (len & 4) {
- asm("addq %[val],%[res]\n\t"
- "adcq $0,%[res]"
- : [res] "+r" (temp64)
- : [val] "r" ((u64)*(u32 *)buff)
- : "memory");
- buff += 4;
- }
- if (len & 2) {
- asm("addq %[val],%[res]\n\t"
- "adcq $0,%[res]"
- : [res] "+r" (temp64)
- : [val] "r" ((u64)*(u16 *)buff)
- : "memory");
- buff += 2;
- }
- if (len & 1) {
- asm("addq %[val],%[res]\n\t"
- "adcq $0,%[res]"
- : [res] "+r" (temp64)
- : [val] "r" ((u64)*(u8 *)buff)
- : "memory");
- }
-#endif
- }
+reduce_to32:
result = add32_with_carry(temp64 >> 32, temp64 & 0xffffffff);
return (__force __wsum)result;
}
--
2.17.1

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


2021-12-13 18:40:27

by Alexander Duyck

[permalink] [raw]
Subject: RE: [PATCH] lib/x86: Optimise csum_partial of buffers that are not multiples of 8 bytes.

> -----Original Message-----
> From: David Laight <[email protected]>
> Sent: Monday, December 13, 2021 10:01 AM
> To: 'Noah Goldstein' <[email protected]>; 'Eric Dumazet'
> <[email protected]>
> Cc: '[email protected]' <[email protected]>; '[email protected]'
> <[email protected]>; 'Borislav Petkov' <[email protected]>;
> '[email protected]' <[email protected]>; 'X86 ML'
> <[email protected]>; '[email protected]' <[email protected]>;
> '[email protected]' <[email protected]>; Alexander Duyck
> <[email protected]>; 'open list' <[email protected]>;
> 'netdev' <[email protected]>
> Subject: [PATCH] lib/x86: Optimise csum_partial of buffers that are not
> multiples of 8 bytes.
>
>
> Add in the trailing bytes first so that there is no need to worry about the sum
> exceeding 64 bits.
>
> Signed-off-by: David Laight <[email protected]>
> ---
>
> This ought to be faster - because of all the removed 'adc $0'.
> Guessing how fast x86 code will run is hard!
> There are other ways of handing buffers that are shorter than 8 bytes, but I'd
> rather hope they don't happen in any hot paths.
>
> Note - I've not even compile tested it.
> (But have tested an equivalent change before.)
>
> arch/x86/lib/csum-partial_64.c | 55 ++++++++++++----------------------
> 1 file changed, 19 insertions(+), 36 deletions(-)
>
> diff --git a/arch/x86/lib/csum-partial_64.c b/arch/x86/lib/csum-partial_64.c
> index abf819dd8525..fbcc073fc2b5 100644
> --- a/arch/x86/lib/csum-partial_64.c
> +++ b/arch/x86/lib/csum-partial_64.c
> @@ -37,6 +37,24 @@ __wsum csum_partial(const void *buff, int len,
> __wsum sum)
> u64 temp64 = (__force u64)sum;
> unsigned result;
>
> + if (len & 7) {
> + if (unlikely(len < 8)) {
> + /* Avoid falling off the start of the buffer */
> + if (len & 4) {
> + temp64 += *(u32 *)buff;
> + buff += 4;
> + }
> + if (len & 2) {
> + temp64 += *(u16 *)buff;
> + buff += 2;
> + }
> + if (len & 1)
> + temp64 += *(u8 *)buff;
> + goto reduce_to32;
> + }
> + temp64 += *(u64 *)(buff + len - 8) << (8 - (len & 7)) * 8;
> + }
> +

I don't think your shift is headed in the right direction. If your starting offset is "buff + len - 8" then your remaining bits should be in the upper bytes of the qword, not the lower bytes shouldn't they? So I would think it should be ">>" not "<<".

> while (unlikely(len >= 64)) {
> asm("addq 0*8(%[src]),%[res]\n\t"
> "adcq 1*8(%[src]),%[res]\n\t"
> @@ -82,43 +100,8 @@ __wsum csum_partial(const void *buff, int len,
> __wsum sum)
> : "memory");
> buff += 8;
> }
> - if (len & 7) {
> -#ifdef CONFIG_DCACHE_WORD_ACCESS
> - unsigned int shift = (8 - (len & 7)) * 8;
> - unsigned long trail;
> -
> - trail = (load_unaligned_zeropad(buff) << shift) >> shift;

Your code above should be equivalent to the load_unaligned_zeropad() << shift, so the shift you are performing above is equivalent to the later one.

>
> - asm("addq %[trail],%[res]\n\t"
> - "adcq $0,%[res]"
> - : [res] "+r" (temp64)
> - : [trail] "r" (trail));
> -#else
> - if (len & 4) {
> - asm("addq %[val],%[res]\n\t"
> - "adcq $0,%[res]"
> - : [res] "+r" (temp64)
> - : [val] "r" ((u64)*(u32 *)buff)
> - : "memory");
> - buff += 4;
> - }
> - if (len & 2) {
> - asm("addq %[val],%[res]\n\t"
> - "adcq $0,%[res]"
> - : [res] "+r" (temp64)
> - : [val] "r" ((u64)*(u16 *)buff)
> - : "memory");
> - buff += 2;
> - }
> - if (len & 1) {
> - asm("addq %[val],%[res]\n\t"
> - "adcq $0,%[res]"
> - : [res] "+r" (temp64)
> - : [val] "r" ((u64)*(u8 *)buff)
> - : "memory");
> - }
> -#endif
> - }
> +reduce_to32:
> result = add32_with_carry(temp64 >> 32, temp64 & 0xffffffff);
> return (__force __wsum)result;
> }
> --
> 2.17.1
>
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes,
> MK1 1PT, UK Registration No: 1397386 (Wales)

2021-12-13 18:45:29

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH] lib/x86: Optimise csum_partial of buffers that are not multiples of 8 bytes.

On Mon, Dec 13, 2021 at 10:00 AM David Laight <[email protected]> wrote:
>
>
> Add in the trailing bytes first so that there is no need to worry
> about the sum exceeding 64 bits.
>
> Signed-off-by: David Laight <[email protected]>
> ---
>
> This ought to be faster - because of all the removed 'adc $0'.
> Guessing how fast x86 code will run is hard!
> There are other ways of handing buffers that are shorter than 8 bytes,
> but I'd rather hope they don't happen in any hot paths.
>
> Note - I've not even compile tested it.
> (But have tested an equivalent change before.)
>
> arch/x86/lib/csum-partial_64.c | 55 ++++++++++++----------------------
> 1 file changed, 19 insertions(+), 36 deletions(-)
>
> diff --git a/arch/x86/lib/csum-partial_64.c b/arch/x86/lib/csum-partial_64.c
> index abf819dd8525..fbcc073fc2b5 100644
> --- a/arch/x86/lib/csum-partial_64.c
> +++ b/arch/x86/lib/csum-partial_64.c
> @@ -37,6 +37,24 @@ __wsum csum_partial(const void *buff, int len, __wsum sum)
> u64 temp64 = (__force u64)sum;
> unsigned result;
>
> + if (len & 7) {
> + if (unlikely(len < 8)) {
> + /* Avoid falling off the start of the buffer */
> + if (len & 4) {
> + temp64 += *(u32 *)buff;
> + buff += 4;
> + }
> + if (len & 2) {
> + temp64 += *(u16 *)buff;
> + buff += 2;
> + }
> + if (len & 1)
> + temp64 += *(u8 *)buff;
> + goto reduce_to32;
> + }
> + temp64 += *(u64 *)(buff + len - 8) << (8 - (len & 7)) * 8;

This is reading far away (end of buffer).

Maybe instead read the first bytes and adjust @buff, to allow for
better hardware prefetching ?



> + }
> +
> while (unlikely(len >= 64)) {
> asm("addq 0*8(%[src]),%[res]\n\t"
> "adcq 1*8(%[src]),%[res]\n\t"
> @@ -82,43 +100,8 @@ __wsum csum_partial(const void *buff, int len, __wsum sum)
> : "memory");
> buff += 8;
> }
> - if (len & 7) {
> -#ifdef CONFIG_DCACHE_WORD_ACCESS
> - unsigned int shift = (8 - (len & 7)) * 8;
> - unsigned long trail;
> -
> - trail = (load_unaligned_zeropad(buff) << shift) >> shift;
>
> - asm("addq %[trail],%[res]\n\t"
> - "adcq $0,%[res]"
> - : [res] "+r" (temp64)
> - : [trail] "r" (trail));
> -#else
> - if (len & 4) {
> - asm("addq %[val],%[res]\n\t"
> - "adcq $0,%[res]"
> - : [res] "+r" (temp64)
> - : [val] "r" ((u64)*(u32 *)buff)
> - : "memory");
> - buff += 4;
> - }
> - if (len & 2) {
> - asm("addq %[val],%[res]\n\t"
> - "adcq $0,%[res]"
> - : [res] "+r" (temp64)
> - : [val] "r" ((u64)*(u16 *)buff)
> - : "memory");
> - buff += 2;
> - }
> - if (len & 1) {
> - asm("addq %[val],%[res]\n\t"
> - "adcq $0,%[res]"
> - : [res] "+r" (temp64)
> - : [val] "r" ((u64)*(u8 *)buff)
> - : "memory");
> - }
> -#endif
> - }
> +reduce_to32:
> result = add32_with_carry(temp64 >> 32, temp64 & 0xffffffff);
> return (__force __wsum)result;
> }
> --
> 2.17.1
>
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)

2021-12-13 19:23:38

by Alexander Duyck

[permalink] [raw]
Subject: RE: [PATCH] lib/x86: Optimise csum_partial of buffers that are not multiples of 8 bytes.



> -----Original Message-----
> From: Eric Dumazet <[email protected]>
> Sent: Monday, December 13, 2021 10:45 AM
> To: David Laight <[email protected]>
> Cc: Noah Goldstein <[email protected]>; [email protected];
> [email protected]; Borislav Petkov <[email protected]>;
> [email protected]; X86 ML <[email protected]>; [email protected];
> [email protected]; Alexander Duyck <[email protected]>; open
> list <[email protected]>; netdev <[email protected]>
> Subject: Re: [PATCH] lib/x86: Optimise csum_partial of buffers that are not
> multiples of 8 bytes.
>
> On Mon, Dec 13, 2021 at 10:00 AM David Laight <[email protected]>
> wrote:
> >
> >
> > Add in the trailing bytes first so that there is no need to worry
> > about the sum exceeding 64 bits.
> >
> > Signed-off-by: David Laight <[email protected]>
> > ---
> >
> > This ought to be faster - because of all the removed 'adc $0'.
> > Guessing how fast x86 code will run is hard!
> > There are other ways of handing buffers that are shorter than 8 bytes,
> > but I'd rather hope they don't happen in any hot paths.
> >
> > Note - I've not even compile tested it.
> > (But have tested an equivalent change before.)
> >
> > arch/x86/lib/csum-partial_64.c | 55
> > ++++++++++++----------------------
> > 1 file changed, 19 insertions(+), 36 deletions(-)
> >
> > diff --git a/arch/x86/lib/csum-partial_64.c
> > b/arch/x86/lib/csum-partial_64.c index abf819dd8525..fbcc073fc2b5
> > 100644
> > --- a/arch/x86/lib/csum-partial_64.c
> > +++ b/arch/x86/lib/csum-partial_64.c
> > @@ -37,6 +37,24 @@ __wsum csum_partial(const void *buff, int len,
> __wsum sum)
> > u64 temp64 = (__force u64)sum;
> > unsigned result;
> >
> > + if (len & 7) {
> > + if (unlikely(len < 8)) {
> > + /* Avoid falling off the start of the buffer */
> > + if (len & 4) {
> > + temp64 += *(u32 *)buff;
> > + buff += 4;
> > + }
> > + if (len & 2) {
> > + temp64 += *(u16 *)buff;
> > + buff += 2;
> > + }
> > + if (len & 1)
> > + temp64 += *(u8 *)buff;
> > + goto reduce_to32;
> > + }
> > + temp64 += *(u64 *)(buff + len - 8) << (8 - (len & 7))
> > + * 8;
>
> This is reading far away (end of buffer).
>
> Maybe instead read the first bytes and adjust @buff, to allow for better
> hardware prefetching ?

That will cause the wsum to be aligned to the length instead of the buff wouldn't it? So we would need an extra rotation at the end to realign odd length sections wouldn't we?

Since our only concern here would be large buffers would it maybe make sense to just run the loop in the call in reverse if we make this change? That way we would still be accessing the same cache line once we start the loop. For smaller buffers I would imagine the overhead should be minimal since we likely would have the end of the buffer still in cache since it would be something like 40B over anyway.

2021-12-13 22:52:23

by David Laight

[permalink] [raw]
Subject: RE: [PATCH] lib/x86: Optimise csum_partial of buffers that are not multiples of 8 bytes.

From: Alexander Duyck <[email protected]>
> Sent: 13 December 2021 18:40
...
> > Add in the trailing bytes first so that there is no need to worry about the sum
> > exceeding 64 bits.
> >
> > Signed-off-by: David Laight <[email protected]>
> > ---
> >
> > This ought to be faster - because of all the removed 'adc $0'.
> > Guessing how fast x86 code will run is hard!
> > There are other ways of handing buffers that are shorter than 8 bytes, but I'd
> > rather hope they don't happen in any hot paths.
> >
> > Note - I've not even compile tested it.
> > (But have tested an equivalent change before.)
> >
> > arch/x86/lib/csum-partial_64.c | 55 ++++++++++++----------------------
> > 1 file changed, 19 insertions(+), 36 deletions(-)
> >
> > diff --git a/arch/x86/lib/csum-partial_64.c b/arch/x86/lib/csum-partial_64.c
> > index abf819dd8525..fbcc073fc2b5 100644
> > --- a/arch/x86/lib/csum-partial_64.c
> > +++ b/arch/x86/lib/csum-partial_64.c
> > @@ -37,6 +37,24 @@ __wsum csum_partial(const void *buff, int len,
> > __wsum sum)
> > u64 temp64 = (__force u64)sum;
> > unsigned result;
> >
> > + if (len & 7) {
> > + if (unlikely(len < 8)) {
> > + /* Avoid falling off the start of the buffer */
> > + if (len & 4) {
> > + temp64 += *(u32 *)buff;
> > + buff += 4;
> > + }
> > + if (len & 2) {
> > + temp64 += *(u16 *)buff;
> > + buff += 2;
> > + }
> > + if (len & 1)
> > + temp64 += *(u8 *)buff;
> > + goto reduce_to32;
> > + }
> > + temp64 += *(u64 *)(buff + len - 8) << (8 - (len & 7)) * 8;
> > + }
> > +
>
> I don't think your shift is headed in the right direction. If your starting offset is "buff + len - 8"
> then your remaining bits should be in the upper bytes of the qword, not the lower bytes shouldn't
> they? So I would think it should be ">>" not "<<".

Brain-fart :-)
It needs to discard the low bytes - so >> is indeed right.
I did say I hadn't tested it.

Cache line wise I'm not sure whether it matters.
If the data is in the cache it doesn't matter.
If the data isn't in the cache then the only real problem is if the
line gets evicted - only likely for 4k-ish+ buffers.
I'd guess the largest checksum is under 1500 bytes - hardware doing
TSO will be doing hardware checksums. So evections are unlikely.

Plausibly the *(buf + len - 8) read could be done after the while() loop.
That would need an adc and a saved copy of the length (or a read that would trap)
but would only be loading the 'next' cache line.

So you'd end up with something like:
while (len >= 64) {
...
}
if (len & 7)
trail = *(u64 *)(buff + len - 8) >> (8 - (len & 7)) * 8;
if (len & 32)
...
if (len & 16)
...
if (len & 8)
...
temp64 += trail
adc $0, temp64

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2021-12-14 12:36:14

by David Laight

[permalink] [raw]
Subject: RE: [PATCH] lib/x86: Optimise csum_partial of buffers that are not multiples of 8 bytes.

From: David Laight <[email protected]>
> Sent: 13 December 2021 18:01
>
> Add in the trailing bytes first so that there is no need to worry
> about the sum exceeding 64 bits.

This is an alternate version that (mostly) compiles to reasonable code.
I've also booted a kernel with it - networking still works!

https://godbolt.org/z/K6vY31Gqs

I changed the while (len >= 64) loop into an
if (len >= 64) do (...) while(len >= 64) one.
But gcc makes a pigs breakfast of compiling it - it optimises
it so that it is while (ptr < lim) but adds a lot of code.
So I've done that by hand.
Then it still makes a meal of it because it refuses to take
'buff' from the final loop iteration.
An assignment to the limit helps.

Then there is the calculation of (8 - (len & 7)) * 8.
gcc prior to 9.2 just negate (len & 7) then use leal 56(,%rs1,8),%rcx.
But later ones and fail to notice.
Even given (64 + 8 * -(len & 7)) clang fails to use leal.

I'm not even sure the code clang generates is right:
(%rsi is (len & 7))
movq -8(%rsi,%rax), %rdx
leal (,%rsi,8), %ecx
andb $56, %cl
negb %cl
shrq %cl, %rdx

The 'negb' is the wrong size of the 'andb'.
It might be ok if it is assuming the cpu ignores the high 2 bits of %cl.
But that is a horrid assumption to be making.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)