From: Charlie Jenkins
> Sent: 11 September 2023 23:57
>
> Provide a 32 and 64 bit version of do_csum. When compiled for 32-bit
> will load from the buffer in groups of 32 bits, and when compiled for
> 64-bit will load in groups of 64 bits. Benchmarking by proxy compiling
> csum_ipv6_magic (64-bit version) for an x86 chip as well as running
> the riscv generated code in QEMU, discovered that summing in a
> tree-like structure is about 4% faster than doing 64-bit reads.
>
...
> + sum = saddr->s6_addr32[0];
> + sum += saddr->s6_addr32[1];
> + sum1 = saddr->s6_addr32[2];
> + sum1 += saddr->s6_addr32[3];
> +
> + sum2 = daddr->s6_addr32[0];
> + sum2 += daddr->s6_addr32[1];
> + sum3 = daddr->s6_addr32[2];
> + sum3 += daddr->s6_addr32[3];
> +
> + sum4 = csum;
> + sum4 += ulen;
> + sum4 += uproto;
> +
> + sum += sum1;
> + sum2 += sum3;
> +
> + sum += sum2;
> + sum += sum4;
Have you got gcc to compile that as-is?
Whenever I've tried to get a 'tree add' compiled so that the
early adds can be executed in parallel gcc always pessimises
it to a linear sequence of adds.
But I agree that adding 32bit values to a 64bit register
may be no slower than trying to do an 'add carry' sequence
that is guaranteed to only do one add/clock.
(And on Intel cpu from core-2 until IIRC Haswell adc took 2 clocks!)
IIRC RISCV doesn't have a carry flag, so the adc sequence
is hard - probably takes two extra instructions per value.
Although with parallel execute it may not matter.
Consider:
val = buf[offset];
sum += val;
carry += sum < val;
val = buf[offset1];
sum += val;
...
the compare and 'carry +=' can be executed at the same time
as the following two instructions.
You do then a final sum += carry; sum += sum < carry;
Assuming all instructions are 1 clock and any read delays
get filled with other instructions (by source or hardware
instruction re-ordering) even without parallel execute
that is 4 clocks for 64 bits, which is much the same as the
2 clocks for 32 bits.
Remember that all the 32bit values can summed first as
they won't overflow.
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
On Tue, Sep 12, 2023 at 08:45:38AM +0000, David Laight wrote:
> From: Charlie Jenkins
> > Sent: 11 September 2023 23:57
> >
> > Provide a 32 and 64 bit version of do_csum. When compiled for 32-bit
> > will load from the buffer in groups of 32 bits, and when compiled for
> > 64-bit will load in groups of 64 bits. Benchmarking by proxy compiling
> > csum_ipv6_magic (64-bit version) for an x86 chip as well as running
> > the riscv generated code in QEMU, discovered that summing in a
> > tree-like structure is about 4% faster than doing 64-bit reads.
> >
> ...
> > + sum = saddr->s6_addr32[0];
> > + sum += saddr->s6_addr32[1];
> > + sum1 = saddr->s6_addr32[2];
> > + sum1 += saddr->s6_addr32[3];
> > +
> > + sum2 = daddr->s6_addr32[0];
> > + sum2 += daddr->s6_addr32[1];
> > + sum3 = daddr->s6_addr32[2];
> > + sum3 += daddr->s6_addr32[3];
> > +
> > + sum4 = csum;
> > + sum4 += ulen;
> > + sum4 += uproto;
> > +
> > + sum += sum1;
> > + sum2 += sum3;
> > +
> > + sum += sum2;
> > + sum += sum4;
>
> Have you got gcc to compile that as-is?
>
> Whenever I've tried to get a 'tree add' compiled so that the
> early adds can be executed in parallel gcc always pessimises
> it to a linear sequence of adds.
>
> But I agree that adding 32bit values to a 64bit register
> may be no slower than trying to do an 'add carry' sequence
> that is guaranteed to only do one add/clock.
> (And on Intel cpu from core-2 until IIRC Haswell adc took 2 clocks!)
>
> IIRC RISCV doesn't have a carry flag, so the adc sequence
> is hard - probably takes two extra instructions per value.
> Although with parallel execute it may not matter.
> Consider:
> val = buf[offset];
> sum += val;
> carry += sum < val;
> val = buf[offset1];
> sum += val;
> ...
> the compare and 'carry +=' can be executed at the same time
> as the following two instructions.
> You do then a final sum += carry; sum += sum < carry;
>
> Assuming all instructions are 1 clock and any read delays
> get filled with other instructions (by source or hardware
> instruction re-ordering) even without parallel execute
> that is 4 clocks for 64 bits, which is much the same as the
> 2 clocks for 32 bits.
>
> Remember that all the 32bit values can summed first as
> they won't overflow.
>
> David
>
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)
Yeah it does seem like the tree-add does just do a linear add. All three
of them were pretty much the same on riscv so I used the version that
did best on x86 with the knowledge that my QEMU setup does not
accurately represent real hardware.
I don't quite understand how doing the carry in the middle of each
stage, even though it can be executed at the same time, would be faster
than just doing a single overflow check at the end. I can just revert
back to the non-tree add version since there is no improvement on riscv.
I can also revert back to the default version that uses carry += sum < val
as well.
- Charlie