2023-06-28 09:43:30

by Borislav Petkov

[permalink] [raw]
Subject: Re: x86/csum: Remove unnecessary odd handling

+ Linus who's been poking at this yesterday.

+ lkml. Please always CC lkml when sending patches.

On Tue, Jun 27, 2023 at 09:06:57PM -0500, Noah Goldstein wrote:
> The special case for odd aligned buffers is unnecessary and mostly
> just adds overhead. Aligned buffers is the expectations, and even for
> unaligned buffer, the only case that was helped is if the buffer was
> 1-byte from word aligned which is ~1/7 of the cases. Overall it seems
> highly unlikely to be worth to extra branch.
>
> It was left in the previous perf improvement patch because I was
> erroneously comparing the exact output of `csum_partial(...)`, but
> really we only need `csum_fold(csum_partial(...))` to match so its
> safe to remove.
>
> All csum kunit tests pass.
>
> Signed-off-by: Noah Goldstein <[email protected]>
> ---
> arch/x86/lib/csum-partial_64.c | 37 ++--------------------------------
> 1 file changed, 2 insertions(+), 35 deletions(-)
>
> diff --git a/arch/x86/lib/csum-partial_64.c b/arch/x86/lib/csum-partial_64.c
> index cea25ca8b8cf..d06112e98893 100644
> --- a/arch/x86/lib/csum-partial_64.c
> +++ b/arch/x86/lib/csum-partial_64.c
> @@ -11,28 +11,6 @@
> #include <asm/checksum.h>
> #include <asm/word-at-a-time.h>
>
> -static inline unsigned short from32to16(unsigned a)
> -{
> - unsigned short b = a >> 16;
> - asm("addw %w2,%w0\n\t"
> - "adcw $0,%w0\n"
> - : "=r" (b)
> - : "0" (b), "r" (a));
> - return b;
> -}
> -
> -static inline __wsum csum_tail(u64 temp64, int odd)
> -{
> - unsigned int result;
> -
> - result = add32_with_carry(temp64 >> 32, temp64 & 0xffffffff);
> - if (unlikely(odd)) {
> - result = from32to16(result);
> - result = ((result >> 8) & 0xff) | ((result & 0xff) << 8);
> - }
> - return (__force __wsum)result;
> -}
> -
> /*
> * Do a checksum on an arbitrary memory area.
> * Returns a 32bit checksum.
> @@ -47,17 +25,6 @@ static inline __wsum csum_tail(u64 temp64, int odd)
> __wsum csum_partial(const void *buff, int len, __wsum sum)
> {
> u64 temp64 = (__force u64)sum;
> - unsigned odd;
> -
> - odd = 1 & (unsigned long) buff;
> - if (unlikely(odd)) {
> - if (unlikely(len == 0))
> - return sum;
> - temp64 = ror32((__force u32)sum, 8);
> - temp64 += (*(unsigned char *)buff << 8);
> - len--;
> - buff++;
> - }
>
> /*
> * len == 40 is the hot case due to IPv6 headers, but annotating it likely()
> @@ -73,7 +40,7 @@ __wsum csum_partial(const void *buff, int len, __wsum sum)
> "adcq $0,%[res]"
> : [res] "+r"(temp64)
> : [src] "r"(buff), "m"(*(const char(*)[40])buff));
> - return csum_tail(temp64, odd);
> + return add32_with_carry(temp64 >> 32, temp64 & 0xffffffff);
> }
> if (unlikely(len >= 64)) {
> /*
> @@ -143,7 +110,7 @@ __wsum csum_partial(const void *buff, int len, __wsum sum)
> : [res] "+r"(temp64)
> : [trail] "r"(trail));
> }
> - return csum_tail(temp64, odd);
> + return add32_with_carry(temp64 >> 32, temp64 & 0xffffffff);
> }
> EXPORT_SYMBOL(csum_partial);
>
> --
> 2.34.1
>

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette


2023-06-28 15:42:20

by Noah Goldstein

[permalink] [raw]
Subject: Re: x86/csum: Remove unnecessary odd handling

On Wed, Jun 28, 2023 at 4:12 AM Borislav Petkov <[email protected]> wrote:
>
> + Linus who's been poking at this yesterday.
>
Linus, if you're planning a patch and want to just integrate the codes
here I'm happy drop this patch

> + lkml. Please always CC lkml when sending patches.

Ah, will do so in the future.
>
> On Tue, Jun 27, 2023 at 09:06:57PM -0500, Noah Goldstein wrote:
> > The special case for odd aligned buffers is unnecessary and mostly
> > just adds overhead. Aligned buffers is the expectations, and even for
> > unaligned buffer, the only case that was helped is if the buffer was
> > 1-byte from word aligned which is ~1/7 of the cases. Overall it seems
> > highly unlikely to be worth to extra branch.
> >
> > It was left in the previous perf improvement patch because I was
> > erroneously comparing the exact output of `csum_partial(...)`, but
> > really we only need `csum_fold(csum_partial(...))` to match so its
> > safe to remove.
> >
> > All csum kunit tests pass.
> >
> > Signed-off-by: Noah Goldstein <[email protected]>
> > ---
> > arch/x86/lib/csum-partial_64.c | 37 ++--------------------------------
> > 1 file changed, 2 insertions(+), 35 deletions(-)
> >
> > diff --git a/arch/x86/lib/csum-partial_64.c b/arch/x86/lib/csum-partial_64.c
> > index cea25ca8b8cf..d06112e98893 100644
> > --- a/arch/x86/lib/csum-partial_64.c
> > +++ b/arch/x86/lib/csum-partial_64.c
> > @@ -11,28 +11,6 @@
> > #include <asm/checksum.h>
> > #include <asm/word-at-a-time.h>
> >
> > -static inline unsigned short from32to16(unsigned a)
> > -{
> > - unsigned short b = a >> 16;
> > - asm("addw %w2,%w0\n\t"
> > - "adcw $0,%w0\n"
> > - : "=r" (b)
> > - : "0" (b), "r" (a));
> > - return b;
> > -}
> > -
> > -static inline __wsum csum_tail(u64 temp64, int odd)
> > -{
> > - unsigned int result;
> > -
> > - result = add32_with_carry(temp64 >> 32, temp64 & 0xffffffff);
> > - if (unlikely(odd)) {
> > - result = from32to16(result);
> > - result = ((result >> 8) & 0xff) | ((result & 0xff) << 8);
> > - }
> > - return (__force __wsum)result;
> > -}
> > -
> > /*
> > * Do a checksum on an arbitrary memory area.
> > * Returns a 32bit checksum.
> > @@ -47,17 +25,6 @@ static inline __wsum csum_tail(u64 temp64, int odd)
> > __wsum csum_partial(const void *buff, int len, __wsum sum)
> > {
> > u64 temp64 = (__force u64)sum;
> > - unsigned odd;
> > -
> > - odd = 1 & (unsigned long) buff;
> > - if (unlikely(odd)) {
> > - if (unlikely(len == 0))
> > - return sum;
> > - temp64 = ror32((__force u32)sum, 8);
> > - temp64 += (*(unsigned char *)buff << 8);
> > - len--;
> > - buff++;
> > - }
> >
> > /*
> > * len == 40 is the hot case due to IPv6 headers, but annotating it likely()
> > @@ -73,7 +40,7 @@ __wsum csum_partial(const void *buff, int len, __wsum sum)
> > "adcq $0,%[res]"
> > : [res] "+r"(temp64)
> > : [src] "r"(buff), "m"(*(const char(*)[40])buff));
> > - return csum_tail(temp64, odd);
> > + return add32_with_carry(temp64 >> 32, temp64 & 0xffffffff);
> > }
> > if (unlikely(len >= 64)) {
> > /*
> > @@ -143,7 +110,7 @@ __wsum csum_partial(const void *buff, int len, __wsum sum)
> > : [res] "+r"(temp64)
> > : [trail] "r"(trail));
> > }
> > - return csum_tail(temp64, odd);
> > + return add32_with_carry(temp64 >> 32, temp64 & 0xffffffff);
> > }
> > EXPORT_SYMBOL(csum_partial);
> >
> > --
> > 2.34.1
> >
>
> --
> Regards/Gruss,
> Boris.
>
> https://people.kernel.org/tglx/notes-about-netiquette

2023-06-28 17:55:51

by Linus Torvalds

[permalink] [raw]
Subject: Re: x86/csum: Remove unnecessary odd handling

On Wed, 28 Jun 2023 at 08:32, Noah Goldstein <[email protected]> wrote:
>
> Linus, if you're planning a patch and want to just integrate the codes
> here I'm happy drop this patch

No, that patch looks good to me.

In fact, I wasn't planning on integrating my patch at all. I literally
did it as a "I would have done it this way instead" exercise.

And while I am currently running with my patch in the kernel, I don't
even really know if it works and does the right thing. Maybe my use
doesn't even trigger csum_partial() at all. I did not do any testing
that "yes, I get the same checksum as a result".

So

(a) removing the pointless one-byte alignment looks good to me.

(b) I'd actually hope that somebody who _cares_ about this path and
has put some real work into it (as opposed to my "superficial
dabbling") would look at my patch and either go "yeah, not worth it",
or "looks good, I'll take it".

and I'm including that final patch of mine here again in case there
was any confusion with the earlier versions (there were at least two
known-broken versions I posted).

*If* somebody likes it, and verifies that the checksum result is
correct, feel free to do anything with that patch, including adding my
signed-off-by for it (or taking the credit all for yourself -
Mwahahahahaahaa!)

Linus


Attachments:
0001-Silly-csum-improvement.-Maybe.patch (3.34 kB)

2023-06-28 18:54:20

by Noah Goldstein

[permalink] [raw]
Subject: Re: x86/csum: Remove unnecessary odd handling

On Wed, Jun 28, 2023 at 12:44 PM Linus Torvalds
<[email protected]> wrote:
>
> On Wed, 28 Jun 2023 at 08:32, Noah Goldstein <[email protected]> wrote:
> >
> > Linus, if you're planning a patch and want to just integrate the codes
> > here I'm happy drop this patch
>
> No, that patch looks good to me.
>
> In fact, I wasn't planning on integrating my patch at all. I literally
> did it as a "I would have done it this way instead" exercise.
>
> And while I am currently running with my patch in the kernel, I don't
> even really know if it works and does the right thing. Maybe my use
> doesn't even trigger csum_partial() at all. I did not do any testing
> that "yes, I get the same checksum as a result".
>

There is a reasonably robust kunit for csum_partial: lib/checksum_kunit.c
so if you happened to run the kunit testsuite with your patch, it's
probably correct.

> So
>
> (a) removing the pointless one-byte alignment looks good to me.
>
> (b) I'd actually hope that somebody who _cares_ about this path and
> has put some real work into it (as opposed to my "superficial
> dabbling") would look at my patch and either go "yeah, not worth it",
> or "looks good, I'll take it".
>
> and I'm including that final patch of mine here again in case there
> was any confusion with the earlier versions (there were at least two
> known-broken versions I posted).
>
> *If* somebody likes it, and verifies that the checksum result is
> correct, feel free to do anything with that patch, including adding my
> signed-off-by for it (or taking the credit all for yourself -
> Mwahahahahaahaa!)
>
> Linus

2023-06-28 20:11:55

by Linus Torvalds

[permalink] [raw]
Subject: Re: x86/csum: Remove unnecessary odd handling

On Wed, 28 Jun 2023 at 11:34, Noah Goldstein <[email protected]> wrote:
>
> There is a reasonably robust kunit for csum_partial: lib/checksum_kunit.c
> so if you happened to run the kunit testsuite with your patch, it's
> probably correct.

Testing? TESTING?

Why do you think I do open source? I like the programming part. I even
like the debugging part. But testing - that's what users are for. No?

Linus

2023-06-29 14:15:41

by David Laight

[permalink] [raw]
Subject: RE: x86/csum: Remove unnecessary odd handling

From: Borislav Petkov
> Sent: 28 June 2023 10:13
>
> + Linus who's been poking at this yesterday.
>
> + lkml. Please always CC lkml when sending patches.
>
> On Tue, Jun 27, 2023 at 09:06:57PM -0500, Noah Goldstein wrote:
> > The special case for odd aligned buffers is unnecessary and mostly
> > just adds overhead. Aligned buffers is the expectations, and even for
> > unaligned buffer, the only case that was helped is if the buffer was
> > 1-byte from word aligned which is ~1/7 of the cases. Overall it seems
> > highly unlikely to be worth to extra branch.
> >
> > It was left in the previous perf improvement patch because I was
> > erroneously comparing the exact output of `csum_partial(...)`, but
> > really we only need `csum_fold(csum_partial(...))` to match so its
> > safe to remove.

I'm sure I've suggested this before.
The 'odd' check was needed by an earlier implementation.

Misaligned buffers are (just about) measurably slower.
But it is pretty much noise and the extra code in the
aligned case will code more.

It is pretty much impossible to find out what the cpu is doing,
but if you do misaligned accesses to a PCIe target you can
(with suitable hardware) look at the generated TLP.

What that shows is misaligned transfers being done in 8-byte
chunks and being split into two TLP if they cross a 64 byte
(probably cache line) boundary.

It is likely that the same happens for cached accesses.

Given that the cpu can do two memory reads each clock
it isn't surprising that the checksum loop (which doesn't
even manage a read every clock) is slower by less than
one clock every cache line.

Someone might also want to use the 'arc' C version of csum_fold()
on pretty much every architecture [1].
It is:
return (~sum - ror32(sum, 16)) >> 16;
significantly better than the x86 asm (even on more recent
cpu that don't take 2 clocks for an 'adc').

[1] arm can do a bit better because of the barrel shifter.
sparc is slower because it has a carry flag but no rotate.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2023-06-29 14:44:49

by David Laight

[permalink] [raw]
Subject: RE: x86/csum: Remove unnecessary odd handling

...
> > All csum kunit tests pass.

Last time I looked I couldn't see where generated IPv6
checksums get changed from 0x0000 (from ~csum_fold() using
adc) to 0xffff - which I think the protocol requires.

The trivial way to do this is to initialise the sum to 1
(instead or 0 or 0xffff) and then add 1 after the invert.

It doesn't matter (much) for IPv4 because 0x0000 is 'no checksum'
rather than 'invalid'.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)