2022-11-22 13:25:13

by David Laight

[permalink] [raw]
Subject: Optimising csum_fold()

There are currently 20 copies of csum_fold(), some in C some in assembler.
The default C version (in asm-generic/checksum.h) is pretty horrid.
Some of the asm versions (including x86 and x86-64) aren't much better.

There are 3 pretty good C versions:
1: (~sum - rol32(sum, 16)) >> 16
2: ~(sum + rol32(sum, 16)) >> 16
3: (u16)~((sum + rol32(sum, 16)) >> 16)
All three are (usually) 4 arithmetic instructions.

The first two have the advantage that the high bits are zero.
Relevant when the value is being checked rather than set.

The first one can generate better instruction scheduling (the rotate
and invert can be executed in the same clock).

The 3rd one saves an instruction on arm, but may need masking.
(I've not compiled an arm kernel to see how often that happens.)

The only architectures where (I think) the current asm code is better
than the C above are sparc and sparc64.
Sparc doesn't have a rotate instruction, but does have a carry flag.
This makes the current asm version one instruction shorter.

For architectures like mips and risc-v which have neither rotate
instructions nor carry flags the C is as good as the current asm.
The rotate is 3 instructions - the same as the extra cmp+add.

Changing everything to use [1] would improve quite a few architectures
while only adding 1 clock to some paths in arm/arm64 and sparc.

Unfortunately it is all currently a mess.
Most architectures don't include asm-generic/checksum.h at all.

Thoughts?

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


2022-11-22 17:18:37

by Willy Tarreau

[permalink] [raw]
Subject: Re: Optimising csum_fold()

On Tue, Nov 22, 2022 at 01:08:23PM +0000, David Laight wrote:
> There are currently 20 copies of csum_fold(), some in C some in assembler.
> The default C version (in asm-generic/checksum.h) is pretty horrid.
> Some of the asm versions (including x86 and x86-64) aren't much better.
>
> There are 3 pretty good C versions:
> 1: (~sum - rol32(sum, 16)) >> 16
> 2: ~(sum + rol32(sum, 16)) >> 16
> 3: (u16)~((sum + rol32(sum, 16)) >> 16)
> All three are (usually) 4 arithmetic instructions.
>
> The first two have the advantage that the high bits are zero.
> Relevant when the value is being checked rather than set.
>
> The first one can generate better instruction scheduling (the rotate
> and invert can be executed in the same clock).
>
> The 3rd one saves an instruction on arm, but may need masking.
> (I've not compiled an arm kernel to see how often that happens.)
>
> The only architectures where (I think) the current asm code is better
> than the C above are sparc and sparc64.
> Sparc doesn't have a rotate instruction, but does have a carry flag.
> This makes the current asm version one instruction shorter.
>
> For architectures like mips and risc-v which have neither rotate
> instructions nor carry flags the C is as good as the current asm.
> The rotate is 3 instructions - the same as the extra cmp+add.
>
> Changing everything to use [1] would improve quite a few architectures
> while only adding 1 clock to some paths in arm/arm64 and sparc.
>
> Unfortunately it is all currently a mess.
> Most architectures don't include asm-generic/checksum.h at all.
>
> Thoughts?

Then why not just have one version per arch, the most efficient one,
and use it everywhere ? The simple fact that we're discussing the
tradeoffs means that if we don't want to compromise performance here
(which I assume to be the case), then it needs to be per-arch and
that's all. At least that's the way I understand it.

Regards,
Willy

2022-11-22 17:57:42

by David Laight

[permalink] [raw]
Subject: RE: Optimising csum_fold()

From: Willy Tarreau <[email protected]>
> Sent: 22 November 2022 16:25
>
> On Tue, Nov 22, 2022 at 01:08:23PM +0000, David Laight wrote:
> > There are currently 20 copies of csum_fold(), some in C some in assembler.
> > The default C version (in asm-generic/checksum.h) is pretty horrid.
> > Some of the asm versions (including x86 and x86-64) aren't much better.
> >
> > There are 3 pretty good C versions:
> > 1: (~sum - rol32(sum, 16)) >> 16
> > 2: ~(sum + rol32(sum, 16)) >> 16
> > 3: (u16)~((sum + rol32(sum, 16)) >> 16)
> > All three are (usually) 4 arithmetic instructions.
> >
> > The first two have the advantage that the high bits are zero.
> > Relevant when the value is being checked rather than set.
> >
> > The first one can generate better instruction scheduling (the rotate
> > and invert can be executed in the same clock).
> >
> > The 3rd one saves an instruction on arm, but may need masking.
> > (I've not compiled an arm kernel to see how often that happens.)
> >
> > The only architectures where (I think) the current asm code is better
> > than the C above are sparc and sparc64.
> > Sparc doesn't have a rotate instruction, but does have a carry flag.
> > This makes the current asm version one instruction shorter.
> >
> > For architectures like mips and risc-v which have neither rotate
> > instructions nor carry flags the C is as good as the current asm.
> > The rotate is 3 instructions - the same as the extra cmp+add.
> >
> > Changing everything to use [1] would improve quite a few architectures
> > while only adding 1 clock to some paths in arm/arm64 and sparc.
> >
> > Unfortunately it is all currently a mess.
> > Most architectures don't include asm-generic/checksum.h at all.
> >
> > Thoughts?
>
> Then why not just have one version per arch, the most efficient one,
> and use it everywhere ? The simple fact that we're discussing the
> tradeoffs means that if we don't want to compromise performance here
> (which I assume to be the case), then it needs to be per-arch and
> that's all. At least that's the way I understand it.

At the moment there are a lot of arch-specific ones that are
definitely sub-optimal.

I started doing some patches, my x86-64 kernel in about 4k
smaller with [1].
I was going to post the patches to asm-generic an x86.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2022-11-22 18:14:02

by Willy Tarreau

[permalink] [raw]
Subject: Re: Optimising csum_fold()

On Tue, Nov 22, 2022 at 04:55:27PM +0000, David Laight wrote:
> From: Willy Tarreau <[email protected]>
> > Sent: 22 November 2022 16:25
> >
> > On Tue, Nov 22, 2022 at 01:08:23PM +0000, David Laight wrote:
> > > There are currently 20 copies of csum_fold(), some in C some in assembler.
> > > The default C version (in asm-generic/checksum.h) is pretty horrid.
> > > Some of the asm versions (including x86 and x86-64) aren't much better.
> > >
> > > There are 3 pretty good C versions:
> > > 1: (~sum - rol32(sum, 16)) >> 16
> > > 2: ~(sum + rol32(sum, 16)) >> 16
> > > 3: (u16)~((sum + rol32(sum, 16)) >> 16)
> > > All three are (usually) 4 arithmetic instructions.
> > >
> > > The first two have the advantage that the high bits are zero.
> > > Relevant when the value is being checked rather than set.
> > >
> > > The first one can generate better instruction scheduling (the rotate
> > > and invert can be executed in the same clock).
> > >
> > > The 3rd one saves an instruction on arm, but may need masking.
> > > (I've not compiled an arm kernel to see how often that happens.)
> > >
> > > The only architectures where (I think) the current asm code is better
> > > than the C above are sparc and sparc64.
> > > Sparc doesn't have a rotate instruction, but does have a carry flag.
> > > This makes the current asm version one instruction shorter.
> > >
> > > For architectures like mips and risc-v which have neither rotate
> > > instructions nor carry flags the C is as good as the current asm.
> > > The rotate is 3 instructions - the same as the extra cmp+add.
> > >
> > > Changing everything to use [1] would improve quite a few architectures
> > > while only adding 1 clock to some paths in arm/arm64 and sparc.
> > >
> > > Unfortunately it is all currently a mess.
> > > Most architectures don't include asm-generic/checksum.h at all.
> > >
> > > Thoughts?
> >
> > Then why not just have one version per arch, the most efficient one,
> > and use it everywhere ? The simple fact that we're discussing the
> > tradeoffs means that if we don't want to compromise performance here
> > (which I assume to be the case), then it needs to be per-arch and
> > that's all. At least that's the way I understand it.
>
> At the moment there are a lot of arch-specific ones that are
> definitely sub-optimal.

Yes very likely!

> I started doing some patches, my x86-64 kernel in about 4k
> smaller with [1].
> I was going to post the patches to asm-generic an x86.

I mean, maybe we could have your 3 versions with different names
in asm-generic, and have each asm file define csum_fold() to one
of them. That would limit the spreadth of variants and the auditing
difficulty.

Willy