Modify the 8x loop to that it uses two independent
accumulators. Despite adding more instructions the latency and
throughput of the loop is improved because the `adc` chains can now
take advantage of multiple execution units.
Make the memory clobbers more precise. 'buff' is read only and we know
the exact usage range. There is no reason to write-clobber all memory.
Relative performance changes on Tigerlake:
Time Unit: Ref Cycles
Size Unit: Bytes
size, lat old, lat new, tput old, tput new
0, 4.972, 5.054, 4.864, 4.870
100, 14.218, 12.476, 9.429, 9.441
200, 22.115, 16.937, 13.088, 12.852
300, 31.826, 24.640, 19.383, 18.230
400, 39.016, 28.133, 23.223, 21.304
500, 48.815, 36.186, 30.331, 27.104
600, 56.732, 40.120, 35.899, 30.363
700, 66.623, 48.178, 43.044, 36.400
800, 73.259, 51.171, 48.564, 39.173
900, 82.821, 56.635, 58.592, 45.162
1000, 90.780, 63.703, 65.658, 48.718
Signed-off-by: Noah Goldstein <[email protected]>
---
arch/x86/lib/csum-partial_64.c | 36 +++++++++++++++++-----------------
1 file changed, 18 insertions(+), 18 deletions(-)
diff --git a/arch/x86/lib/csum-partial_64.c b/arch/x86/lib/csum-partial_64.c
index ded842cd1020..76e2f540587e 100644
--- a/arch/x86/lib/csum-partial_64.c
+++ b/arch/x86/lib/csum-partial_64.c
@@ -48,18 +48,21 @@ __wsum csum_partial(const void *buff, int len, __wsum sum)
}
while (unlikely(len >= 64)) {
- asm("addq 0*8(%[src]),%[res]\n\t"
- "adcq 1*8(%[src]),%[res]\n\t"
- "adcq 2*8(%[src]),%[res]\n\t"
- "adcq 3*8(%[src]),%[res]\n\t"
- "adcq 4*8(%[src]),%[res]\n\t"
+ u64 temp_accum;
+
+ asm("movq 0*8(%[src]),%[res_tmp]\n\t"
+ "addq 1*8(%[src]),%[res_tmp]\n\t"
+ "adcq 2*8(%[src]),%[res_tmp]\n\t"
+ "adcq 3*8(%[src]),%[res_tmp]\n\t"
+ "adcq $0,%[res_tmp]\n\t"
+ "addq 4*8(%[src]),%[res]\n\t"
"adcq 5*8(%[src]),%[res]\n\t"
"adcq 6*8(%[src]),%[res]\n\t"
"adcq 7*8(%[src]),%[res]\n\t"
- "adcq $0,%[res]"
- : [res] "+r" (temp64)
- : [src] "r" (buff)
- : "memory");
+ "adcq %[res_tmp], %[res]\n\t"
+ "adcq $0,%[res]\n\t"
+ : [res] "+r"(temp64), [res_tmp] "=&r"(temp_accum)
+ : [src] "r"(buff), "m"(*(const char(*)[64])buff));
buff += 64;
len -= 64;
}
@@ -70,26 +73,23 @@ __wsum csum_partial(const void *buff, int len, __wsum sum)
"adcq 2*8(%[src]),%[res]\n\t"
"adcq 3*8(%[src]),%[res]\n\t"
"adcq $0,%[res]"
- : [res] "+r" (temp64)
- : [src] "r" (buff)
- : "memory");
+ : [res] "+r"(temp64)
+ : [src] "r"(buff), "m"(*(const char(*)[32])buff));
buff += 32;
}
if (len & 16) {
asm("addq 0*8(%[src]),%[res]\n\t"
"adcq 1*8(%[src]),%[res]\n\t"
"adcq $0,%[res]"
- : [res] "+r" (temp64)
- : [src] "r" (buff)
- : "memory");
+ : [res] "+r"(temp64)
+ : [src] "r"(buff), "m"(*(const char(*)[16])buff));
buff += 16;
}
if (len & 8) {
asm("addq 0*8(%[src]),%[res]\n\t"
"adcq $0,%[res]"
- : [res] "+r" (temp64)
- : [src] "r" (buff)
- : "memory");
+ : [res] "+r"(temp64)
+ : [src] "r"(buff), "m"(*(const char(*)[8])buff));
buff += 8;
}
if (len & 7) {
--
2.25.1
On Thu, Nov 25, 2021 at 11:38 AM Noah Goldstein <[email protected]> wrote:
>
> Modify the 8x loop to that it uses two independent
> accumulators. Despite adding more instructions the latency and
> throughput of the loop is improved because the `adc` chains can now
> take advantage of multiple execution units.
Nice !
Note that I get better results if I do a different split, because the
second chain gets shorter.
First chain adds 5*8 bytes from the buffer, but first bytes are a mere
load, so that is really 4+1 additions.
Second chain adds 3*8 bytes from the buffer, plus the result coming
from the first chain, also 4+1 additions.
asm("movq 0*8(%[src]),%[res_tmp]\n\t"
"addq 1*8(%[src]),%[res_tmp]\n\t"
"adcq 2*8(%[src]),%[res_tmp]\n\t"
"adcq 3*8(%[src]),%[res_tmp]\n\t"
"adcq 4*8(%[src]),%[res_tmp]\n\t"
"adcq $0,%[res_tmp]\n\t"
"addq 5*8(%[src]),%[res]\n\t"
"adcq 6*8(%[src]),%[res]\n\t"
"adcq 7*8(%[src]),%[res]\n\t"
"adcq %[res_tmp],%[res]\n\t"
"adcq $0,%[res]"
: [res] "+r" (temp64), [res_tmp] "=&r"(temp_accum)
: [src] "r" (buff)
: "memory");
>
> Make the memory clobbers more precise. 'buff' is read only and we know
> the exact usage range. There is no reason to write-clobber all memory.
Not sure if that matters in this function ? Or do we expect it being inlined ?
Personally, I find the "memory" constraint to be more readable than these casts
"m"(*(const char(*)[64])buff));
>
> Relative performance changes on Tigerlake:
>
> Time Unit: Ref Cycles
> Size Unit: Bytes
>
> size, lat old, lat new, tput old, tput new
> 0, 4.972, 5.054, 4.864, 4.870
Really what matters in modern networking is the case for 40 bytes, and
eventually 8 bytes.
Can you add these two cases in this nice table ?
We hardly have to checksum anything with NIC that are not decades old.
Apparently making the 64byte loop slightly longer incentives gcc to
move it away (our intent with the unlikely() hint).
Anyway I am thinking of providing a specialized inline version for
IPv6 header checksums (40 + x*8 bytes, x being 0 pretty much all the
time),
so we will likely not use csum_partial() anymore.
Thanks !
On Thu, Nov 25, 2021 at 7:50 PM Eric Dumazet <[email protected]> wrote:
>
> On Thu, Nov 25, 2021 at 11:38 AM Noah Goldstein <[email protected]> wrote:
> >
> > Modify the 8x loop to that it uses two independent
> > accumulators. Despite adding more instructions the latency and
> > throughput of the loop is improved because the `adc` chains can now
> > take advantage of multiple execution units.
>
> Nice !
>
> Note that I get better results if I do a different split, because the
> second chain gets shorter.
>
> First chain adds 5*8 bytes from the buffer, but first bytes are a mere
> load, so that is really 4+1 additions.
>
> Second chain adds 3*8 bytes from the buffer, plus the result coming
> from the first chain, also 4+1 additions.
Good call. With that approach I also see an improvement for the 32 byte
case (which is on the hot path) so this change might actually matter :)
>
> asm("movq 0*8(%[src]),%[res_tmp]\n\t"
> "addq 1*8(%[src]),%[res_tmp]\n\t"
> "adcq 2*8(%[src]),%[res_tmp]\n\t"
> "adcq 3*8(%[src]),%[res_tmp]\n\t"
> "adcq 4*8(%[src]),%[res_tmp]\n\t"
> "adcq $0,%[res_tmp]\n\t"
> "addq 5*8(%[src]),%[res]\n\t"
> "adcq 6*8(%[src]),%[res]\n\t"
> "adcq 7*8(%[src]),%[res]\n\t"
> "adcq %[res_tmp],%[res]\n\t"
> "adcq $0,%[res]"
> : [res] "+r" (temp64), [res_tmp] "=&r"(temp_accum)
> : [src] "r" (buff)
> : "memory");
>
>
> >
> > Make the memory clobbers more precise. 'buff' is read only and we know
> > the exact usage range. There is no reason to write-clobber all memory.
>
> Not sure if that matters in this function ? Or do we expect it being inlined ?
It may matter for LTO build. I also think it can matter for the loop
case. I didn't see
any difference when playing around with the function in userland with:
```
gcc -O3 -march=native -mtune=native checksum.c -o checksum
```
but IIRC if the clobber is loops with inline assembly payloads can be
de-optimized if GCC can't prove the iterations don't affect each other.
>
> Personally, I find the "memory" constraint to be more readable than these casts
> "m"(*(const char(*)[64])buff));
>
Hmm, I personally find it more readable if I can tell what memory
transforms happen
just from reading the clobbers, but you're the maintainer.
Do you want it changed in V2?
> >
> > Relative performance changes on Tigerlake:
> >
> > Time Unit: Ref Cycles
> > Size Unit: Bytes
> >
> > size, lat old, lat new, tput old, tput new
> > 0, 4.972, 5.054, 4.864, 4.870
>
> Really what matters in modern networking is the case for 40 bytes, and
> eventually 8 bytes.
>
> Can you add these two cases in this nice table ?
>
Sure, with your suggestion in the 32 byte cases there is an improvement there
too.
> We hardly have to checksum anything with NIC that are not decades old.
>
> Apparently making the 64byte loop slightly longer incentives gcc to
> move it away (our intent with the unlikely() hint).
>
> Anyway I am thinking of providing a specialized inline version for
> IPv6 header checksums (40 + x*8 bytes, x being 0 pretty much all the
> time),
> so we will likely not use csum_partial() anymore.
I see. For now is it worth adding a case for 40 in this implementation?
if(likely(len == 40)) {
// optimized 40 + buff aligned case
}
else {
// existing code
}
>
> Thanks !
On Thu, Nov 25, 2021 at 8:15 PM Noah Goldstein <[email protected]> wrote:
>
> On Thu, Nov 25, 2021 at 7:50 PM Eric Dumazet <[email protected]> wrote:
> >
> > On Thu, Nov 25, 2021 at 11:38 AM Noah Goldstein <[email protected]> wrote:
> > >
> > > Modify the 8x loop to that it uses two independent
> > > accumulators. Despite adding more instructions the latency and
> > > throughput of the loop is improved because the `adc` chains can now
> > > take advantage of multiple execution units.
> >
> > Nice !
> >
> > Note that I get better results if I do a different split, because the
> > second chain gets shorter.
> >
> > First chain adds 5*8 bytes from the buffer, but first bytes are a mere
> > load, so that is really 4+1 additions.
> >
> > Second chain adds 3*8 bytes from the buffer, plus the result coming
> > from the first chain, also 4+1 additions.
>
> Good call. With that approach I also see an improvement for the 32 byte
> case (which is on the hot path) so this change might actually matter :)
>
> >
> > asm("movq 0*8(%[src]),%[res_tmp]\n\t"
> > "addq 1*8(%[src]),%[res_tmp]\n\t"
> > "adcq 2*8(%[src]),%[res_tmp]\n\t"
> > "adcq 3*8(%[src]),%[res_tmp]\n\t"
> > "adcq 4*8(%[src]),%[res_tmp]\n\t"
> > "adcq $0,%[res_tmp]\n\t"
> > "addq 5*8(%[src]),%[res]\n\t"
> > "adcq 6*8(%[src]),%[res]\n\t"
> > "adcq 7*8(%[src]),%[res]\n\t"
> > "adcq %[res_tmp],%[res]\n\t"
> > "adcq $0,%[res]"
> > : [res] "+r" (temp64), [res_tmp] "=&r"(temp_accum)
> > : [src] "r" (buff)
> > : "memory");
> >
> >
> > >
> > > Make the memory clobbers more precise. 'buff' is read only and we know
> > > the exact usage range. There is no reason to write-clobber all memory.
> >
> > Not sure if that matters in this function ? Or do we expect it being inlined ?
>
> It may matter for LTO build. I also think it can matter for the loop
> case. I didn't see
> any difference when playing around with the function in userland with:
>
> ```
> gcc -O3 -march=native -mtune=native checksum.c -o checksum
> ```
>
> but IIRC if the clobber is loops with inline assembly payloads can be
> de-optimized if GCC can't prove the iterations don't affect each other.
>
>
> >
> > Personally, I find the "memory" constraint to be more readable than these casts
> > "m"(*(const char(*)[64])buff));
> >
>
> Hmm, I personally find it more readable if I can tell what memory
> transforms happen
> just from reading the clobbers, but you're the maintainer.
>
> Do you want it changed in V2?
>
> > >
> > > Relative performance changes on Tigerlake:
> > >
> > > Time Unit: Ref Cycles
> > > Size Unit: Bytes
> > >
> > > size, lat old, lat new, tput old, tput new
> > > 0, 4.972, 5.054, 4.864, 4.870
> >
> > Really what matters in modern networking is the case for 40 bytes, and
> > eventually 8 bytes.
> >
> > Can you add these two cases in this nice table ?
> >
>
> Sure, with your suggestion in the 32 byte cases there is an improvement there
> too.
>
> > We hardly have to checksum anything with NIC that are not decades old.
> >
> > Apparently making the 64byte loop slightly longer incentives gcc to
> > move it away (our intent with the unlikely() hint).
Do you think the 40/48 byte case might be better of in GAS assembly. It's
a bit difficult to get proper control flow optimization with GCC +
inline assembly
even with likely/unlikely (i.e expanding the 64 byte case moves it off hotpath,
cmov + ror instead of fallthrough for hotpath).
> >
> > Anyway I am thinking of providing a specialized inline version for
> > IPv6 header checksums (40 + x*8 bytes, x being 0 pretty much all the
> > time),
> > so we will likely not use csum_partial() anymore.
>
> I see. For now is it worth adding a case for 40 in this implementation?
>
> if(likely(len == 40)) {
> // optimized 40 + buff aligned case
> }
> else {
> // existing code
> }
>
>
> >
> > Thanks !
On Thu, Nov 25, 2021 at 8:18 PM Noah Goldstein <[email protected]> wrote:
>
> On Thu, Nov 25, 2021 at 8:15 PM Noah Goldstein <[email protected]> wrote:
> >
> > On Thu, Nov 25, 2021 at 7:50 PM Eric Dumazet <[email protected]> wrote:
> > >
> > > On Thu, Nov 25, 2021 at 11:38 AM Noah Goldstein <[email protected]> wrote:
> > > >
> > > > Modify the 8x loop to that it uses two independent
> > > > accumulators. Despite adding more instructions the latency and
> > > > throughput of the loop is improved because the `adc` chains can now
> > > > take advantage of multiple execution units.
> > >
> > > Nice !
> > >
> > > Note that I get better results if I do a different split, because the
> > > second chain gets shorter.
> > >
> > > First chain adds 5*8 bytes from the buffer, but first bytes are a mere
> > > load, so that is really 4+1 additions.
> > >
> > > Second chain adds 3*8 bytes from the buffer, plus the result coming
> > > from the first chain, also 4+1 additions.
> >
> > Good call. With that approach I also see an improvement for the 32 byte
> > case (which is on the hot path) so this change might actually matter :)
> >
> > >
> > > asm("movq 0*8(%[src]),%[res_tmp]\n\t"
> > > "addq 1*8(%[src]),%[res_tmp]\n\t"
> > > "adcq 2*8(%[src]),%[res_tmp]\n\t"
> > > "adcq 3*8(%[src]),%[res_tmp]\n\t"
> > > "adcq 4*8(%[src]),%[res_tmp]\n\t"
> > > "adcq $0,%[res_tmp]\n\t"
> > > "addq 5*8(%[src]),%[res]\n\t"
> > > "adcq 6*8(%[src]),%[res]\n\t"
> > > "adcq 7*8(%[src]),%[res]\n\t"
> > > "adcq %[res_tmp],%[res]\n\t"
> > > "adcq $0,%[res]"
> > > : [res] "+r" (temp64), [res_tmp] "=&r"(temp_accum)
> > > : [src] "r" (buff)
> > > : "memory");
> > >
> > >
> > > >
> > > > Make the memory clobbers more precise. 'buff' is read only and we know
> > > > the exact usage range. There is no reason to write-clobber all memory.
> > >
> > > Not sure if that matters in this function ? Or do we expect it being inlined ?
> >
> > It may matter for LTO build. I also think it can matter for the loop
> > case. I didn't see
> > any difference when playing around with the function in userland with:
> >
> > ```
> > gcc -O3 -march=native -mtune=native checksum.c -o checksum
> > ```
> >
> > but IIRC if the clobber is loops with inline assembly payloads can be
> > de-optimized if GCC can't prove the iterations don't affect each other.
> >
> >
> > >
> > > Personally, I find the "memory" constraint to be more readable than these casts
> > > "m"(*(const char(*)[64])buff));
> > >
> >
> > Hmm, I personally find it more readable if I can tell what memory
> > transforms happen
> > just from reading the clobbers, but you're the maintainer.
> >
> > Do you want it changed in V2?
> >
> > > >
> > > > Relative performance changes on Tigerlake:
> > > >
> > > > Time Unit: Ref Cycles
> > > > Size Unit: Bytes
> > > >
> > > > size, lat old, lat new, tput old, tput new
> > > > 0, 4.972, 5.054, 4.864, 4.870
> > >
> > > Really what matters in modern networking is the case for 40 bytes, and
> > > eventually 8 bytes.
> > >
> > > Can you add these two cases in this nice table ?
Regarding the 32 byte case, adding two accumulators helps with the latency
numbers but causes a regression in throughput for the 40/48 byte cases. Which
is the more important metric for the usage of csum_partial()?
Here are the numbers for the smaller sizes:
size, lat old, lat ver2, lat ver1, tput old, tput ver2, tput ver1
0, 4.961, 4.503, 4.901, 4.887, 4.399, 4.951
8, 5.590, 5.594, 5.620, 4.227, 4.110, 4.252
16, 6.182, 6.398, 6.202, 4.233, 4.062, 4.278
24, 7.392, 7.591, 7.380, 4.256, 4.246, 4.279
32, 7.371, 6.366, 7.390, 4.550, 4.900, 4.537
40, 8.621, 7.496, 8.601, 4.862, 5.162, 4.836
48, 9.406, 8.128, 9.374, 5.206, 5.736, 5.234
56, 10.535, 9.189, 10.522, 5.416, 5.772, 5.447
64, 10.000, 7.487, 7.590, 6.946, 6.975, 6.989
72, 11.192, 8.639, 8.763, 7.210, 7.311, 7.277
80, 11.734, 9.179, 9.409, 7.605, 7.620, 7.548
88, 12.933, 10.545, 10.584, 7.878, 7.902, 7.858
96, 12.952, 9.331, 10.625, 8.168, 8.470, 8.206
104, 14.206, 10.424, 11.839, 8.491, 8.785, 8.502
112, 14.763, 11.403, 12.416, 8.798, 9.134, 8.771
120, 15.955, 12.635, 13.651, 9.175, 9.494, 9.130
128, 15.271, 10.599, 10.724, 9.726, 9.672, 9.655
'ver2' uses two accumulators for 32 byte case and has better latency numbers
but regressions in tput compared to 'old' and 'ver1'. 'ver1' is the
implementation
posted which has essentially the same numbers for tput/lat as 'old'
for sizes [0, 63].
> > >
> >
> > Sure, with your suggestion in the 32 byte cases there is an improvement there
> > too.
> >
> > > We hardly have to checksum anything with NIC that are not decades old.
> > >
> > > Apparently making the 64byte loop slightly longer incentives gcc to
> > > move it away (our intent with the unlikely() hint).
>
> Do you think the 40/48 byte case might be better of in GAS assembly. It's
> a bit difficult to get proper control flow optimization with GCC +
> inline assembly
> even with likely/unlikely (i.e expanding the 64 byte case moves it off hotpath,
> cmov + ror instead of fallthrough for hotpath).
> > >
> > > Anyway I am thinking of providing a specialized inline version for
> > > IPv6 header checksums (40 + x*8 bytes, x being 0 pretty much all the
> > > time),
> > > so we will likely not use csum_partial() anymore.
> >
> > I see. For now is it worth adding a case for 40 in this implementation?
> >
> > if(likely(len == 40)) {
> > // optimized 40 + buff aligned case
> > }
> > else {
> > // existing code
> > }
> >
> >
> > >
> > > Thanks !
On Thu, Nov 25, 2021 at 6:15 PM Noah Goldstein <[email protected]> wrote:
>
> On Thu, Nov 25, 2021 at 7:50 PM Eric Dumazet <[email protected]> wrote:
> >
> > On Thu, Nov 25, 2021 at 11:38 AM Noah Goldstein <[email protected]> wrote:
> > >
> > > Modify the 8x loop to that it uses two independent
> > > accumulators. Despite adding more instructions the latency and
> > > throughput of the loop is improved because the `adc` chains can now
> > > take advantage of multiple execution units.
> >
> > Nice !
> >
> > Note that I get better results if I do a different split, because the
> > second chain gets shorter.
> >
> > First chain adds 5*8 bytes from the buffer, but first bytes are a mere
> > load, so that is really 4+1 additions.
> >
> > Second chain adds 3*8 bytes from the buffer, plus the result coming
> > from the first chain, also 4+1 additions.
>
> Good call. With that approach I also see an improvement for the 32 byte
> case (which is on the hot path) so this change might actually matter :)
>
> >
> > asm("movq 0*8(%[src]),%[res_tmp]\n\t"
> > "addq 1*8(%[src]),%[res_tmp]\n\t"
> > "adcq 2*8(%[src]),%[res_tmp]\n\t"
> > "adcq 3*8(%[src]),%[res_tmp]\n\t"
> > "adcq 4*8(%[src]),%[res_tmp]\n\t"
> > "adcq $0,%[res_tmp]\n\t"
> > "addq 5*8(%[src]),%[res]\n\t"
> > "adcq 6*8(%[src]),%[res]\n\t"
> > "adcq 7*8(%[src]),%[res]\n\t"
> > "adcq %[res_tmp],%[res]\n\t"
> > "adcq $0,%[res]"
> > : [res] "+r" (temp64), [res_tmp] "=&r"(temp_accum)
> > : [src] "r" (buff)
> > : "memory");
> >
> >
> > >
> > > Make the memory clobbers more precise. 'buff' is read only and we know
> > > the exact usage range. There is no reason to write-clobber all memory.
> >
> > Not sure if that matters in this function ? Or do we expect it being inlined ?
>
> It may matter for LTO build. I also think it can matter for the loop
> case. I didn't see
> any difference when playing around with the function in userland with:
>
> ```
> gcc -O3 -march=native -mtune=native checksum.c -o checksum
> ```
>
> but IIRC if the clobber is loops with inline assembly payloads can be
> de-optimized if GCC can't prove the iterations don't affect each other.
>
>
> >
> > Personally, I find the "memory" constraint to be more readable than these casts
> > "m"(*(const char(*)[64])buff));
> >
>
> Hmm, I personally find it more readable if I can tell what memory
> transforms happen
> just from reading the clobbers, but you're the maintainer.
I am not x86 maintainer at all, I was only giving an opinion.
>
> Do you want it changed in V2?
Let's hear from x86 maintainers :)
>
> > >
> > > Relative performance changes on Tigerlake:
> > >
> > > Time Unit: Ref Cycles
> > > Size Unit: Bytes
> > >
> > > size, lat old, lat new, tput old, tput new
> > > 0, 4.972, 5.054, 4.864, 4.870
> >
> > Really what matters in modern networking is the case for 40 bytes, and
> > eventually 8 bytes.
> >
> > Can you add these two cases in this nice table ?
> >
>
> Sure, with your suggestion in the 32 byte cases there is an improvement there
> too.
>
> > We hardly have to checksum anything with NIC that are not decades old.
> >
> > Apparently making the 64byte loop slightly longer incentives gcc to
> > move it away (our intent with the unlikely() hint).
> >
> > Anyway I am thinking of providing a specialized inline version for
> > IPv6 header checksums (40 + x*8 bytes, x being 0 pretty much all the
> > time),
> > so we will likely not use csum_partial() anymore.
>
> I see. For now is it worth adding a case for 40 in this implementation?
>
> if(likely(len == 40)) {
> // optimized 40 + buff aligned case
> }
> else {
> // existing code
> }
This is what I had first in my tree to demonstrate what gains could be achieved.
But decided to keep csum_partial() as a generic function, yet optimize it a bit.
My plan is to have a new inline helper, that could be a mere
csum_partial() on arches that do not have
an optimized version.
Preliminary patch to show the idea (this only compiles on x86, we have
to add the generic definition)
First call point is in GRO stack.
Note some of us in the US are not working until next Monday.
diff --git a/arch/x86/include/asm/checksum_64.h
b/arch/x86/include/asm/checksum_64.h
index 407beebadaf45a748f91a36b78bd1d023449b132..af93fb53b480ab7102db71c32ab6ca9604c6b5fb
100644
--- a/arch/x86/include/asm/checksum_64.h
+++ b/arch/x86/include/asm/checksum_64.h
@@ -182,4 +182,26 @@ static inline __wsum csum_add(__wsum csum, __wsum addend)
(__force unsigned)addend);
}
+static inline __wsum ipv6_csum_partial(const void *buff, int len, __wsum sum)
+{
+ u64 temp64;
+
+ if (unlikely(len == 40))
+ return csum_partial(buff, len, sum);
+
+ temp64 = (__force u64)sum;
+ asm("addq 0*8(%[src]),%[res]\n\t"
+ "adcq 1*8(%[src]),%[res]\n\t"
+ "adcq 2*8(%[src]),%[res]\n\t"
+ "adcq 3*8(%[src]),%[res]\n\t"
+ "adcq 4*8(%[src]),%[res]\n\t"
+ "adcq 5*8(%[src]),%[res]\n\t"
+ "adcq $0,%[res]"
+ : [res] "+r" (temp64)
+ : [src] "r" (buff)
+ : "memory");
+ return (__force __wsum)add32_with_carry(temp64 >> 32, temp64);
+}
+#define ipv6_csum_partial ipv6_csum_partial
+
#endif /* _ASM_X86_CHECKSUM_64_H */
diff --git a/net/ipv6/ip6_offload.c b/net/ipv6/ip6_offload.c
index 1b9827ff8ccf48e61e233e39d671aa67c8fff0ab..50d1d0c92bdce37ac2f48ded8edec02e1ba9096d
100644
--- a/net/ipv6/ip6_offload.c
+++ b/net/ipv6/ip6_offload.c
@@ -274,7 +274,9 @@ INDIRECT_CALLABLE_SCOPE struct sk_buff
*ipv6_gro_receive(struct list_head *head,
NAPI_GRO_CB(skb)->is_atomic = true;
NAPI_GRO_CB(skb)->flush |= flush;
- skb_gro_postpull_rcsum(skb, iph, nlen);
+ if (NAPI_GRO_CB(skb)->csum_valid)
+ NAPI_GRO_CB(skb)->csum = ~ipv6_csum_partial(iph, nlen,
+ ~NAPI_GRO_CB(skb)->csum);
pp = indirect_call_gro_receive_l4(tcp6_gro_receive, udp6_gro_receive,
ops->callbacks.gro_receive, head, skb);
On Fri, Nov 26, 2021 at 10:09 AM Eric Dumazet <[email protected]> wrote:
>
> On Thu, Nov 25, 2021 at 6:15 PM Noah Goldstein <[email protected]> wrote:
> >
> > On Thu, Nov 25, 2021 at 7:50 PM Eric Dumazet <[email protected]> wrote:
> > >
> > > On Thu, Nov 25, 2021 at 11:38 AM Noah Goldstein <[email protected]> wrote:
> > > >
> > > > Modify the 8x loop to that it uses two independent
> > > > accumulators. Despite adding more instructions the latency and
> > > > throughput of the loop is improved because the `adc` chains can now
> > > > take advantage of multiple execution units.
> > >
> > > Nice !
> > >
> > > Note that I get better results if I do a different split, because the
> > > second chain gets shorter.
> > >
> > > First chain adds 5*8 bytes from the buffer, but first bytes are a mere
> > > load, so that is really 4+1 additions.
> > >
> > > Second chain adds 3*8 bytes from the buffer, plus the result coming
> > > from the first chain, also 4+1 additions.
> >
> > Good call. With that approach I also see an improvement for the 32 byte
> > case (which is on the hot path) so this change might actually matter :)
> >
> > >
> > > asm("movq 0*8(%[src]),%[res_tmp]\n\t"
> > > "addq 1*8(%[src]),%[res_tmp]\n\t"
> > > "adcq 2*8(%[src]),%[res_tmp]\n\t"
> > > "adcq 3*8(%[src]),%[res_tmp]\n\t"
> > > "adcq 4*8(%[src]),%[res_tmp]\n\t"
> > > "adcq $0,%[res_tmp]\n\t"
> > > "addq 5*8(%[src]),%[res]\n\t"
> > > "adcq 6*8(%[src]),%[res]\n\t"
> > > "adcq 7*8(%[src]),%[res]\n\t"
> > > "adcq %[res_tmp],%[res]\n\t"
> > > "adcq $0,%[res]"
> > > : [res] "+r" (temp64), [res_tmp] "=&r"(temp_accum)
> > > : [src] "r" (buff)
> > > : "memory");
> > >
> > >
> > > >
> > > > Make the memory clobbers more precise. 'buff' is read only and we know
> > > > the exact usage range. There is no reason to write-clobber all memory.
> > >
> > > Not sure if that matters in this function ? Or do we expect it being inlined ?
> >
> > It may matter for LTO build. I also think it can matter for the loop
> > case. I didn't see
> > any difference when playing around with the function in userland with:
> >
> > ```
> > gcc -O3 -march=native -mtune=native checksum.c -o checksum
> > ```
> >
> > but IIRC if the clobber is loops with inline assembly payloads can be
> > de-optimized if GCC can't prove the iterations don't affect each other.
> >
> >
> > >
> > > Personally, I find the "memory" constraint to be more readable than these casts
> > > "m"(*(const char(*)[64])buff));
> > >
> >
> > Hmm, I personally find it more readable if I can tell what memory
> > transforms happen
> > just from reading the clobbers, but you're the maintainer.
>
> I am not x86 maintainer at all, I was only giving an opinion.
>
> >
> > Do you want it changed in V2?
>
> Let's hear from x86 maintainers :)
>
> >
> > > >
> > > > Relative performance changes on Tigerlake:
> > > >
> > > > Time Unit: Ref Cycles
> > > > Size Unit: Bytes
> > > >
> > > > size, lat old, lat new, tput old, tput new
> > > > 0, 4.972, 5.054, 4.864, 4.870
> > >
> > > Really what matters in modern networking is the case for 40 bytes, and
> > > eventually 8 bytes.
> > >
> > > Can you add these two cases in this nice table ?
> > >
> >
> > Sure, with your suggestion in the 32 byte cases there is an improvement there
> > too.
> >
> > > We hardly have to checksum anything with NIC that are not decades old.
> > >
> > > Apparently making the 64byte loop slightly longer incentives gcc to
> > > move it away (our intent with the unlikely() hint).
> > >
> > > Anyway I am thinking of providing a specialized inline version for
> > > IPv6 header checksums (40 + x*8 bytes, x being 0 pretty much all the
> > > time),
> > > so we will likely not use csum_partial() anymore.
> >
> > I see. For now is it worth adding a case for 40 in this implementation?
> >
> > if(likely(len == 40)) {
> > // optimized 40 + buff aligned case
> > }
> > else {
> > // existing code
> > }
>
> This is what I had first in my tree to demonstrate what gains could be achieved.
>
> But decided to keep csum_partial() as a generic function, yet optimize it a bit.
>
> My plan is to have a new inline helper, that could be a mere
> csum_partial() on arches that do not have
> an optimized version.
>
> Preliminary patch to show the idea (this only compiles on x86, we have
> to add the generic definition)
> First call point is in GRO stack.
> Note some of us in the US are not working until next Monday.
>
> diff --git a/arch/x86/include/asm/checksum_64.h
> b/arch/x86/include/asm/checksum_64.h
> index 407beebadaf45a748f91a36b78bd1d023449b132..af93fb53b480ab7102db71c32ab6ca9604c6b5fb
> 100644
> --- a/arch/x86/include/asm/checksum_64.h
> +++ b/arch/x86/include/asm/checksum_64.h
> @@ -182,4 +182,26 @@ static inline __wsum csum_add(__wsum csum, __wsum addend)
> (__force unsigned)addend);
> }
>
> +static inline __wsum ipv6_csum_partial(const void *buff, int len, __wsum sum)
> +{
> + u64 temp64;
> +
> + if (unlikely(len == 40))
> + return csum_partial(buff, len, sum);
> +
> + temp64 = (__force u64)sum;
> + asm("addq 0*8(%[src]),%[res]\n\t"
> + "adcq 1*8(%[src]),%[res]\n\t"
> + "adcq 2*8(%[src]),%[res]\n\t"
> + "adcq 3*8(%[src]),%[res]\n\t"
> + "adcq 4*8(%[src]),%[res]\n\t"
> + "adcq 5*8(%[src]),%[res]\n\t"
> + "adcq $0,%[res]"
> + : [res] "+r" (temp64)
> + : [src] "r" (buff)
> + : "memory");
> + return (__force __wsum)add32_with_carry(temp64 >> 32, temp64);
Makes sense. Although if you inline I think you definitely will want a more
conservative clobber than just "memory". Also I think with 40 you also will
get some value from two counters.
Did you see the number/question I posted about two accumulators for 32
byte case?
Its a judgement call about latency vs throughput that I don't really have an
answer for.
> +}
> +#define ipv6_csum_partial ipv6_csum_partial
> +
> #endif /* _ASM_X86_CHECKSUM_64_H */
> diff --git a/net/ipv6/ip6_offload.c b/net/ipv6/ip6_offload.c
> index 1b9827ff8ccf48e61e233e39d671aa67c8fff0ab..50d1d0c92bdce37ac2f48ded8edec02e1ba9096d
> 100644
> --- a/net/ipv6/ip6_offload.c
> +++ b/net/ipv6/ip6_offload.c
> @@ -274,7 +274,9 @@ INDIRECT_CALLABLE_SCOPE struct sk_buff
> *ipv6_gro_receive(struct list_head *head,
> NAPI_GRO_CB(skb)->is_atomic = true;
> NAPI_GRO_CB(skb)->flush |= flush;
>
> - skb_gro_postpull_rcsum(skb, iph, nlen);
> + if (NAPI_GRO_CB(skb)->csum_valid)
> + NAPI_GRO_CB(skb)->csum = ~ipv6_csum_partial(iph, nlen,
> + ~NAPI_GRO_CB(skb)->csum);
>
> pp = indirect_call_gro_receive_l4(tcp6_gro_receive, udp6_gro_receive,
> ops->callbacks.gro_receive, head, skb);
On Fri, Nov 26, 2021 at 8:08 AM Eric Dumazet <[email protected]> wrote:
> diff --git a/arch/x86/include/asm/checksum_64.h
> b/arch/x86/include/asm/checksum_64.h
> index 407beebadaf45a748f91a36b78bd1d023449b132..af93fb53b480ab7102db71c32ab6ca9604c6b5fb
> 100644
> --- a/arch/x86/include/asm/checksum_64.h
> +++ b/arch/x86/include/asm/checksum_64.h
> @@ -182,4 +182,26 @@ static inline __wsum csum_add(__wsum csum, __wsum addend)
> (__force unsigned)addend);
> }
>
> +static inline __wsum ipv6_csum_partial(const void *buff, int len, __wsum sum)
> +{
> + u64 temp64;
> +
> + if (unlikely(len == 40))
this of course needs to be the opposite condition
if (unlikely(len != sizeof(struct ipv6hdr))
> + return csum_partial(buff, len, sum);
> +
> + temp64 = (__force u64)sum;
> + asm("addq 0*8(%[src]),%[res]\n\t"
> + "adcq 1*8(%[src]),%[res]\n\t"
> + "adcq 2*8(%[src]),%[res]\n\t"
> + "adcq 3*8(%[src]),%[res]\n\t"
> + "adcq 4*8(%[src]),%[res]\n\t"
> + "adcq 5*8(%[src]),%[res]\n\t"
> + "adcq $0,%[res]"
> + : [res] "+r" (temp64)
> + : [src] "r" (buff)
> + : "memory");
> + return (__force __wsum)add32_with_carry(temp64 >> 32, temp64);
> +}
On Fri, Nov 26, 2021 at 10:17 AM Noah Goldstein <[email protected]> wrote:
>
>
> Makes sense. Although if you inline I think you definitely will want a more
> conservative clobber than just "memory". Also I think with 40 you also will
> get some value from two counters.
>
> Did you see the number/question I posted about two accumulators for 32
> byte case?
> Its a judgement call about latency vs throughput that I don't really have an
> answer for.
>
The thing I do not know is if using more units would slow down the
hyper thread ?
Would using ADCX/ADOX would be better in this respect ?
On Fri, Nov 26, 2021 at 12:27 PM Eric Dumazet <[email protected]> wrote:
>
> On Fri, Nov 26, 2021 at 10:17 AM Noah Goldstein <[email protected]> wrote:
> >
>
> >
> > Makes sense. Although if you inline I think you definitely will want a more
> > conservative clobber than just "memory". Also I think with 40 you also will
> > get some value from two counters.
> >
> > Did you see the number/question I posted about two accumulators for 32
> > byte case?
> > Its a judgement call about latency vs throughput that I don't really have an
> > answer for.
> >
>
> The thing I do not know is if using more units would slow down the
> hyper thread ?
There are more uops in the two accumulator version so it could be concern
iff the other hyperthread is bottlenecked on p06 throughput. My general
understanding is this is not the common case and that the very premise of
hyperthreads is that most bottlenecks are related to memory fetch or resolving
control flow.
>
> Would using ADCX/ADOX would be better in this respect ?
What would code using those instructions look like? Having trouble
seeing how to use them here.
On Fri, Nov 26, 2021 at 12:50 PM Noah Goldstein <[email protected]> wrote:
>
> On Fri, Nov 26, 2021 at 12:27 PM Eric Dumazet <[email protected]> wrote:
> >
> > On Fri, Nov 26, 2021 at 10:17 AM Noah Goldstein <[email protected]> wrote:
> > >
> >
> > >
> > > Makes sense. Although if you inline I think you definitely will want a more
> > > conservative clobber than just "memory". Also I think with 40 you also will
> > > get some value from two counters.
> > >
> > > Did you see the number/question I posted about two accumulators for 32
> > > byte case?
> > > Its a judgement call about latency vs throughput that I don't really have an
> > > answer for.
> > >
> >
> > The thing I do not know is if using more units would slow down the
> > hyper thread ?
Did some quick tests with the latency/throughput benchmarks running
in parallel on two hyperthreads on the same processors. The 32 byte case
latency advantage goes with 2 accum and there is still a slight regression
in throughput. The larger cases that hit the loop still still have improvements
both in tput and latency with 2 accum.
>
> There are more uops in the two accumulator version so it could be concern
> iff the other hyperthread is bottlenecked on p06 throughput. My general
> understanding is this is not the common case and that the very premise of
> hyperthreads is that most bottlenecks are related to memory fetch or resolving
> control flow.
>
> >
> > Would using ADCX/ADOX would be better in this respect ?
>
> What would code using those instructions look like? Having trouble
> seeing how to use them here.
On Fri, Nov 26, 2021 at 11:14 AM Noah Goldstein <[email protected]> wrote:
>
> On Fri, Nov 26, 2021 at 12:50 PM Noah Goldstein <[email protected]> wrote:
> >
> > On Fri, Nov 26, 2021 at 12:27 PM Eric Dumazet <[email protected]> wrote:
> > >
> > > On Fri, Nov 26, 2021 at 10:17 AM Noah Goldstein <[email protected]> wrote:
> > > >
> > >
> > > >
> > > > Makes sense. Although if you inline I think you definitely will want a more
> > > > conservative clobber than just "memory". Also I think with 40 you also will
> > > > get some value from two counters.
> > > >
> > > > Did you see the number/question I posted about two accumulators for 32
> > > > byte case?
> > > > Its a judgement call about latency vs throughput that I don't really have an
> > > > answer for.
> > > >
> > >
> > > The thing I do not know is if using more units would slow down the
> > > hyper thread ?
>
> Did some quick tests with the latency/throughput benchmarks running
> in parallel on two hyperthreads on the same processors. The 32 byte case
> latency advantage goes with 2 accum and there is still a slight regression
> in throughput. The larger cases that hit the loop still still have improvements
> both in tput and latency with 2 accum.
Great. I also played with rorx instruction, because it removes one
"adcl $0,..." step
__wsum ipv6_csum_partial(const void *buff, int len, __wsum sum)
{
u64 temp64;
u64 tmp;
u32 res32;
if (unlikely(len != 40))
return csum_partial(buff, len, sum);
temp64 = (__force u64)sum;
asm("addq 0*8(%[src]),%[temp64]\n\t"
"adcq 1*8(%[src]),%[temp64]\n\t"
"adcq 2*8(%[src]),%[temp64]\n\t"
"adcq 3*8(%[src]),%[temp64]\n\t"
"adcq 4*8(%[src]),%[temp64]\n\t"
"mov %k[temp64],%[res32]\n\t"
"rorx $32,%[temp64],%[temp64]\n\t"
"adcl %k[temp64],%[res32]\n\t"
"adcl $0,%[res32]"
: [temp64] "+r" (temp64), [res32] "=r" (res32)
: [src] "r" (buff)
: "memory");
return (__force __wsum)res32;
}
On Fri, Nov 26, 2021 at 1:21 PM Eric Dumazet <[email protected]> wrote:
>
> On Fri, Nov 26, 2021 at 11:14 AM Noah Goldstein <[email protected]> wrote:
> >
> > On Fri, Nov 26, 2021 at 12:50 PM Noah Goldstein <[email protected]> wrote:
> > >
> > > On Fri, Nov 26, 2021 at 12:27 PM Eric Dumazet <[email protected]> wrote:
> > > >
> > > > On Fri, Nov 26, 2021 at 10:17 AM Noah Goldstein <[email protected]> wrote:
> > > > >
> > > >
> > > > >
> > > > > Makes sense. Although if you inline I think you definitely will want a more
> > > > > conservative clobber than just "memory". Also I think with 40 you also will
> > > > > get some value from two counters.
> > > > >
> > > > > Did you see the number/question I posted about two accumulators for 32
> > > > > byte case?
> > > > > Its a judgement call about latency vs throughput that I don't really have an
> > > > > answer for.
> > > > >
> > > >
> > > > The thing I do not know is if using more units would slow down the
> > > > hyper thread ?
> >
> > Did some quick tests with the latency/throughput benchmarks running
> > in parallel on two hyperthreads on the same processors. The 32 byte case
> > latency advantage goes with 2 accum and there is still a slight regression
> > in throughput. The larger cases that hit the loop still still have improvements
> > both in tput and latency with 2 accum.
>
> Great. I also played with rorx instruction, because it removes one
> "adcl $0,..." step
Bright :) but it will need a BMI support check.
>
> __wsum ipv6_csum_partial(const void *buff, int len, __wsum sum)
> {
> u64 temp64;
> u64 tmp;
> u32 res32;
>
> if (unlikely(len != 40))
> return csum_partial(buff, len, sum);
>
> temp64 = (__force u64)sum;
> asm("addq 0*8(%[src]),%[temp64]\n\t"
> "adcq 1*8(%[src]),%[temp64]\n\t"
> "adcq 2*8(%[src]),%[temp64]\n\t"
> "adcq 3*8(%[src]),%[temp64]\n\t"
> "adcq 4*8(%[src]),%[temp64]\n\t"
> "mov %k[temp64],%[res32]\n\t"
> "rorx $32,%[temp64],%[temp64]\n\t"
> "adcl %k[temp64],%[res32]\n\t"
> "adcl $0,%[res32]"
> : [temp64] "+r" (temp64), [res32] "=r" (res32)
> : [src] "r" (buff)
> : "memory");
> return (__force __wsum)res32;
> }
I actually get better performance in hyperthread benchmarks with 2 accum:
Used:
u64 res;
temp64 = (__force uint64_t)sum;
asm("movq 0*8(%[src]),%[res]\n\t"
"addq 1*8(%[src]),%[res]\n\t"
"adcq 2*8(%[src]),%[res]\n\t"
"adcq $0, %[res]\n"
"addq 3*8(%[src]),%[temp64]\n\t"
"adcq 4*8(%[src]),%[temp64]\n\t"
"adcq %[res], %[temp64]\n\t"
"mov %k[temp64],%k[res]\n\t"
"rorx $32,%[temp64],%[temp64]\n\t"
"adcl %k[temp64],%k[res]\n\t"
"adcl $0,%k[res]"
: [temp64] "+r"(temp64), [res] "=&r"(res)
: [src] "r"(buff)
: "memory");
return (__force __wsum)res;
w/ hyperthread:
size, 2acc lat, 1acc lat, 2acc tput, 1acc tput
40, 6.511, 7.863, 6.177, 6.157
w/o hyperthread:
size, 2acc lat, 1acc lat, 2acc tput, 1acc tput
40, 5.577, 6.764, 3.150, 3.210
On Fri, Nov 26, 2021 at 11:50 AM Noah Goldstein <[email protected]> wrote:
>
> Bright :) but it will need a BMI support check.
Yes, probably not worth the pain.
>
> I actually get better performance in hyperthread benchmarks with 2 accum:
>
> Used:
>
> u64 res;
> temp64 = (__force uint64_t)sum;
> asm("movq 0*8(%[src]),%[res]\n\t"
> "addq 1*8(%[src]),%[res]\n\t"
> "adcq 2*8(%[src]),%[res]\n\t"
> "adcq $0, %[res]\n"
> "addq 3*8(%[src]),%[temp64]\n\t"
> "adcq 4*8(%[src]),%[temp64]\n\t"
> "adcq %[res], %[temp64]\n\t"
> "mov %k[temp64],%k[res]\n\t"
> "rorx $32,%[temp64],%[temp64]\n\t"
> "adcl %k[temp64],%k[res]\n\t"
> "adcl $0,%k[res]"
> : [temp64] "+r"(temp64), [res] "=&r"(res)
> : [src] "r"(buff)
> : "memory");
> return (__force __wsum)res;
>
> w/ hyperthread:
> size, 2acc lat, 1acc lat, 2acc tput, 1acc tput
> 40, 6.511, 7.863, 6.177, 6.157
>
> w/o hyperthread:
> size, 2acc lat, 1acc lat, 2acc tput, 1acc tput
> 40, 5.577, 6.764, 3.150, 3.210
On Fri, Nov 26, 2021 at 2:07 PM Eric Dumazet <[email protected]> wrote:
>
> On Fri, Nov 26, 2021 at 11:50 AM Noah Goldstein <[email protected]> wrote:
> >
> > Bright :) but it will need a BMI support check.
>
> Yes, probably not worth the pain.
Making a V2 for my patch with your optimization for the loop case. Do you think
1 or 2 accum for the 32 byte case?
>
> >
> > I actually get better performance in hyperthread benchmarks with 2 accum:
> >
> > Used:
> >
> > u64 res;
> > temp64 = (__force uint64_t)sum;
> > asm("movq 0*8(%[src]),%[res]\n\t"
> > "addq 1*8(%[src]),%[res]\n\t"
> > "adcq 2*8(%[src]),%[res]\n\t"
> > "adcq $0, %[res]\n"
> > "addq 3*8(%[src]),%[temp64]\n\t"
> > "adcq 4*8(%[src]),%[temp64]\n\t"
> > "adcq %[res], %[temp64]\n\t"
> > "mov %k[temp64],%k[res]\n\t"
> > "rorx $32,%[temp64],%[temp64]\n\t"
> > "adcl %k[temp64],%k[res]\n\t"
> > "adcl $0,%k[res]"
> > : [temp64] "+r"(temp64), [res] "=&r"(res)
> > : [src] "r"(buff)
> > : "memory");
> > return (__force __wsum)res;
> >
> > w/ hyperthread:
> > size, 2acc lat, 1acc lat, 2acc tput, 1acc tput
> > 40, 6.511, 7.863, 6.177, 6.157
> >
> > w/o hyperthread:
> > size, 2acc lat, 1acc lat, 2acc tput, 1acc tput
> > 40, 5.577, 6.764, 3.150, 3.210
On Fri, Nov 26, 2021 at 12:33 PM Noah Goldstein <[email protected]> wrote:
>
> On Fri, Nov 26, 2021 at 2:07 PM Eric Dumazet <[email protected]> wrote:
> >
> > On Fri, Nov 26, 2021 at 11:50 AM Noah Goldstein <[email protected]> wrote:
> > >
> > > Bright :) but it will need a BMI support check.
> >
> > Yes, probably not worth the pain.
>
> Making a V2 for my patch with your optimization for the loop case. Do you think
> 1 or 2 accum for the 32 byte case?
>
I would vote for something simpler, thus one accum, since this 32byte
block is only run one time ?
Thanks !
On Fri, Nov 26, 2021 at 6:15 PM Eric Dumazet <[email protected]> wrote:
>
> On Fri, Nov 26, 2021 at 12:33 PM Noah Goldstein <[email protected]> wrote:
> >
> > On Fri, Nov 26, 2021 at 2:07 PM Eric Dumazet <[email protected]> wrote:
> > >
> > > On Fri, Nov 26, 2021 at 11:50 AM Noah Goldstein <[email protected]> wrote:
> > > >
> > > > Bright :) but it will need a BMI support check.
> > >
> > > Yes, probably not worth the pain.
> >
> > Making a V2 for my patch with your optimization for the loop case. Do you think
> > 1 or 2 accum for the 32 byte case?
> >
>
> I would vote for something simpler, thus one accum, since this 32byte
> block is only run one time ?
If the one at a time performance is whats the most important wouldn't that
argue in favor of 2x accum because it lead to decreased latency? Or are you
saying it's not that important so simpler codes is the priority?
>
> Thanks !
Modify the 8x loop to that it uses two independent
accumulators. Despite adding more instructions the latency and
throughput of the loop is improved because the `adc` chains can now
take advantage of multiple execution units.
Make the memory clobbers more precise. 'buff' is read only and we know
the exact usage range. There is no reason to write-clobber all memory.
Relative performance changes on Tigerlake:
Time Unit: Ref Cycles
Size Unit: Bytes
size, lat old, lat new, tput old, tput new
0, 4.961, 4.901, 4.887, 4.951
8, 5.590, 5.620, 4.227, 4.252
16, 6.182, 6.202, 4.233, 4.278
24, 7.392, 7.380, 4.256, 4.279
32, 7.371, 7.390, 4.550, 4.537
40, 8.621, 8.601, 4.862, 4.836
48, 9.406, 9.374, 5.206, 5.234
56, 10.535, 10.522, 5.416, 5.447
64, 10.000, 7.590, 6.946, 6.989
100, 14.218, 12.476, 9.429, 9.441
200, 22.115, 16.937, 13.088, 12.852
300, 31.826, 24.640, 19.383, 18.230
400, 39.016, 28.133, 23.223, 21.304
500, 48.815, 36.186, 30.331, 27.104
600, 56.732, 40.120, 35.899, 30.363
700, 66.623, 48.178, 43.044, 36.400
800, 73.259, 51.171, 48.564, 39.173
900, 82.821, 56.635, 58.592, 45.162
1000, 90.780, 63.703, 65.658, 48.718
Signed-off-by: Noah Goldstein <[email protected]>
tmp
---
arch/x86/lib/csum-partial_64.c | 38 +++++++++++++++++-----------------
1 file changed, 19 insertions(+), 19 deletions(-)
diff --git a/arch/x86/lib/csum-partial_64.c b/arch/x86/lib/csum-partial_64.c
index ded842cd1020..52540f148ebb 100644
--- a/arch/x86/lib/csum-partial_64.c
+++ b/arch/x86/lib/csum-partial_64.c
@@ -48,18 +48,21 @@ __wsum csum_partial(const void *buff, int len, __wsum sum)
}
while (unlikely(len >= 64)) {
- asm("addq 0*8(%[src]),%[res]\n\t"
- "adcq 1*8(%[src]),%[res]\n\t"
- "adcq 2*8(%[src]),%[res]\n\t"
- "adcq 3*8(%[src]),%[res]\n\t"
- "adcq 4*8(%[src]),%[res]\n\t"
- "adcq 5*8(%[src]),%[res]\n\t"
+ u64 temp_accum;
+
+ asm("movq 0*8(%[src]),%[res_tmp]\n\t"
+ "addq 1*8(%[src]),%[res_tmp]\n\t"
+ "adcq 2*8(%[src]),%[res_tmp]\n\t"
+ "adcq 3*8(%[src]),%[res_tmp]\n\t"
+ "adcq 4*8(%[src]),%[res_tmp]\n\t"
+ "adcq $0,%[res_tmp]\n\t"
+ "addq 5*8(%[src]),%[res]\n\t"
"adcq 6*8(%[src]),%[res]\n\t"
"adcq 7*8(%[src]),%[res]\n\t"
- "adcq $0,%[res]"
- : [res] "+r" (temp64)
- : [src] "r" (buff)
- : "memory");
+ "adcq %[res_tmp], %[res]\n\t"
+ "adcq $0,%[res]\n\t"
+ : [res] "+r"(temp64), [res_tmp] "=&r"(temp_accum)
+ : [src] "r"(buff), "m"(*(const char(*)[64])buff));
buff += 64;
len -= 64;
}
@@ -70,26 +73,23 @@ __wsum csum_partial(const void *buff, int len, __wsum sum)
"adcq 2*8(%[src]),%[res]\n\t"
"adcq 3*8(%[src]),%[res]\n\t"
"adcq $0,%[res]"
- : [res] "+r" (temp64)
- : [src] "r" (buff)
- : "memory");
+ : [res] "+r"(temp64)
+ : [src] "r"(buff), "m"(*(const char(*)[32])buff));
buff += 32;
}
if (len & 16) {
asm("addq 0*8(%[src]),%[res]\n\t"
"adcq 1*8(%[src]),%[res]\n\t"
"adcq $0,%[res]"
- : [res] "+r" (temp64)
- : [src] "r" (buff)
- : "memory");
+ : [res] "+r"(temp64)
+ : [src] "r"(buff), "m"(*(const char(*)[16])buff));
buff += 16;
}
if (len & 8) {
asm("addq 0*8(%[src]),%[res]\n\t"
"adcq $0,%[res]"
- : [res] "+r" (temp64)
- : [src] "r" (buff)
- : "memory");
+ : [res] "+r"(temp64)
+ : [src] "r"(buff), "m"(*(const char(*)[8])buff));
buff += 8;
}
if (len & 7) {
--
2.25.1
On Fri, Nov 26, 2021 at 8:25 PM Noah Goldstein <[email protected]> wrote:
>
> Modify the 8x loop to that it uses two independent
> accumulators. Despite adding more instructions the latency and
> throughput of the loop is improved because the `adc` chains can now
> take advantage of multiple execution units.
>
> Make the memory clobbers more precise. 'buff' is read only and we know
> the exact usage range. There is no reason to write-clobber all memory.
>
> Relative performance changes on Tigerlake:
>
> Time Unit: Ref Cycles
> Size Unit: Bytes
>
> size, lat old, lat new, tput old, tput new
> 0, 4.961, 4.901, 4.887, 4.951
> 8, 5.590, 5.620, 4.227, 4.252
> 16, 6.182, 6.202, 4.233, 4.278
> 24, 7.392, 7.380, 4.256, 4.279
> 32, 7.371, 7.390, 4.550, 4.537
> 40, 8.621, 8.601, 4.862, 4.836
> 48, 9.406, 9.374, 5.206, 5.234
> 56, 10.535, 10.522, 5.416, 5.447
> 64, 10.000, 7.590, 6.946, 6.989
> 100, 14.218, 12.476, 9.429, 9.441
> 200, 22.115, 16.937, 13.088, 12.852
> 300, 31.826, 24.640, 19.383, 18.230
> 400, 39.016, 28.133, 23.223, 21.304
> 500, 48.815, 36.186, 30.331, 27.104
> 600, 56.732, 40.120, 35.899, 30.363
> 700, 66.623, 48.178, 43.044, 36.400
> 800, 73.259, 51.171, 48.564, 39.173
> 900, 82.821, 56.635, 58.592, 45.162
> 1000, 90.780, 63.703, 65.658, 48.718
>
> Signed-off-by: Noah Goldstein <[email protected]>
>
> tmp
SGTM (not sure what this 'tmp' string means here :) )
Reviewed-by: Eric Dumazet <[email protected]>
On Sat, Nov 27, 2021 at 12:03 AM Eric Dumazet <[email protected]> wrote:
>
> On Fri, Nov 26, 2021 at 8:25 PM Noah Goldstein <[email protected]> wrote:
> >
> > Modify the 8x loop to that it uses two independent
> > accumulators. Despite adding more instructions the latency and
> > throughput of the loop is improved because the `adc` chains can now
> > take advantage of multiple execution units.
> >
> > Make the memory clobbers more precise. 'buff' is read only and we know
> > the exact usage range. There is no reason to write-clobber all memory.
> >
> > Relative performance changes on Tigerlake:
> >
> > Time Unit: Ref Cycles
> > Size Unit: Bytes
> >
> > size, lat old, lat new, tput old, tput new
> > 0, 4.961, 4.901, 4.887, 4.951
> > 8, 5.590, 5.620, 4.227, 4.252
> > 16, 6.182, 6.202, 4.233, 4.278
> > 24, 7.392, 7.380, 4.256, 4.279
> > 32, 7.371, 7.390, 4.550, 4.537
> > 40, 8.621, 8.601, 4.862, 4.836
> > 48, 9.406, 9.374, 5.206, 5.234
> > 56, 10.535, 10.522, 5.416, 5.447
> > 64, 10.000, 7.590, 6.946, 6.989
> > 100, 14.218, 12.476, 9.429, 9.441
> > 200, 22.115, 16.937, 13.088, 12.852
> > 300, 31.826, 24.640, 19.383, 18.230
> > 400, 39.016, 28.133, 23.223, 21.304
> > 500, 48.815, 36.186, 30.331, 27.104
> > 600, 56.732, 40.120, 35.899, 30.363
> > 700, 66.623, 48.178, 43.044, 36.400
> > 800, 73.259, 51.171, 48.564, 39.173
> > 900, 82.821, 56.635, 58.592, 45.162
> > 1000, 90.780, 63.703, 65.658, 48.718
> >
> > Signed-off-by: Noah Goldstein <[email protected]>
> >
> > tmp
>
> SGTM (not sure what this 'tmp' string means here :) )
>
> Reviewed-by: Eric Dumazet <[email protected]>
Poor rebasing practices :/
Fixed in V3 (only change).
Modify the 8x loop to that it uses two independent
accumulators. Despite adding more instructions the latency and
throughput of the loop is improved because the `adc` chains can now
take advantage of multiple execution units.
Make the memory clobbers more precise. 'buff' is read only and we know
the exact usage range. There is no reason to write-clobber all memory.
Relative performance changes on Tigerlake:
Time Unit: Ref Cycles
Size Unit: Bytes
size, lat old, lat new, tput old, tput new
0, 4.961, 4.901, 4.887, 4.951
8, 5.590, 5.620, 4.227, 4.252
16, 6.182, 6.202, 4.233, 4.278
24, 7.392, 7.380, 4.256, 4.279
32, 7.371, 7.390, 4.550, 4.537
40, 8.621, 8.601, 4.862, 4.836
48, 9.406, 9.374, 5.206, 5.234
56, 10.535, 10.522, 5.416, 5.447
64, 10.000, 7.590, 6.946, 6.989
100, 14.218, 12.476, 9.429, 9.441
200, 22.115, 16.937, 13.088, 12.852
300, 31.826, 24.640, 19.383, 18.230
400, 39.016, 28.133, 23.223, 21.304
500, 48.815, 36.186, 30.331, 27.104
600, 56.732, 40.120, 35.899, 30.363
700, 66.623, 48.178, 43.044, 36.400
800, 73.259, 51.171, 48.564, 39.173
900, 82.821, 56.635, 58.592, 45.162
1000, 90.780, 63.703, 65.658, 48.718
Signed-off-by: Noah Goldstein <[email protected]>
---
arch/x86/lib/csum-partial_64.c | 38 +++++++++++++++++-----------------
1 file changed, 19 insertions(+), 19 deletions(-)
diff --git a/arch/x86/lib/csum-partial_64.c b/arch/x86/lib/csum-partial_64.c
index ded842cd1020..52540f148ebb 100644
--- a/arch/x86/lib/csum-partial_64.c
+++ b/arch/x86/lib/csum-partial_64.c
@@ -48,18 +48,21 @@ __wsum csum_partial(const void *buff, int len, __wsum sum)
}
while (unlikely(len >= 64)) {
- asm("addq 0*8(%[src]),%[res]\n\t"
- "adcq 1*8(%[src]),%[res]\n\t"
- "adcq 2*8(%[src]),%[res]\n\t"
- "adcq 3*8(%[src]),%[res]\n\t"
- "adcq 4*8(%[src]),%[res]\n\t"
- "adcq 5*8(%[src]),%[res]\n\t"
+ u64 temp_accum;
+
+ asm("movq 0*8(%[src]),%[res_tmp]\n\t"
+ "addq 1*8(%[src]),%[res_tmp]\n\t"
+ "adcq 2*8(%[src]),%[res_tmp]\n\t"
+ "adcq 3*8(%[src]),%[res_tmp]\n\t"
+ "adcq 4*8(%[src]),%[res_tmp]\n\t"
+ "adcq $0,%[res_tmp]\n\t"
+ "addq 5*8(%[src]),%[res]\n\t"
"adcq 6*8(%[src]),%[res]\n\t"
"adcq 7*8(%[src]),%[res]\n\t"
- "adcq $0,%[res]"
- : [res] "+r" (temp64)
- : [src] "r" (buff)
- : "memory");
+ "adcq %[res_tmp], %[res]\n\t"
+ "adcq $0,%[res]\n\t"
+ : [res] "+r"(temp64), [res_tmp] "=&r"(temp_accum)
+ : [src] "r"(buff), "m"(*(const char(*)[64])buff));
buff += 64;
len -= 64;
}
@@ -70,26 +73,23 @@ __wsum csum_partial(const void *buff, int len, __wsum sum)
"adcq 2*8(%[src]),%[res]\n\t"
"adcq 3*8(%[src]),%[res]\n\t"
"adcq $0,%[res]"
- : [res] "+r" (temp64)
- : [src] "r" (buff)
- : "memory");
+ : [res] "+r"(temp64)
+ : [src] "r"(buff), "m"(*(const char(*)[32])buff));
buff += 32;
}
if (len & 16) {
asm("addq 0*8(%[src]),%[res]\n\t"
"adcq 1*8(%[src]),%[res]\n\t"
"adcq $0,%[res]"
- : [res] "+r" (temp64)
- : [src] "r" (buff)
- : "memory");
+ : [res] "+r"(temp64)
+ : [src] "r"(buff), "m"(*(const char(*)[16])buff));
buff += 16;
}
if (len & 8) {
asm("addq 0*8(%[src]),%[res]\n\t"
"adcq $0,%[res]"
- : [res] "+r" (temp64)
- : [src] "r" (buff)
- : "memory");
+ : [res] "+r"(temp64)
+ : [src] "r"(buff), "m"(*(const char(*)[8])buff));
buff += 8;
}
if (len & 7) {
--
2.25.1
On Fri, Nov 26, 2021 at 10:39 PM Noah Goldstein <[email protected]> wrote:
>
> Modify the 8x loop to that it uses two independent
> accumulators. Despite adding more instructions the latency and
> throughput of the loop is improved because the `adc` chains can now
> take advantage of multiple execution units.
Oh well, there was really no need to resend this, especially if you do
not add my ack.
Reviewed-by: Eric Dumazet <[email protected]>
On Sat, Nov 27, 2021 at 12:51 AM Eric Dumazet <[email protected]> wrote:
>
> On Fri, Nov 26, 2021 at 10:39 PM Noah Goldstein <[email protected]> wrote:
> >
> > Modify the 8x loop to that it uses two independent
> > accumulators. Despite adding more instructions the latency and
> > throughput of the loop is improved because the `adc` chains can now
> > take advantage of multiple execution units.
>
> Oh well, there was really no need to resend this, especially if you do
> not add my ack.
>
> Reviewed-by: Eric Dumazet <[email protected]>
Sorry!
Is the protocol for tags to let the maintainer add them or
should I post a V4 with your tag?
...
> Regarding the 32 byte case, adding two accumulators helps with the latency
> numbers but causes a regression in throughput for the 40/48 byte cases. Which
> is the more important metric for the usage of csum_partial()?
>
> Here are the numbers for the smaller sizes:
>
> size, lat old, lat ver2, lat ver1, tput old, tput ver2, tput ver1
> 0, 4.961, 4.503, 4.901, 4.887, 4.399, 4.951
> 8, 5.590, 5.594, 5.620, 4.227, 4.110, 4.252
> 16, 6.182, 6.398, 6.202, 4.233, 4.062, 4.278
> 24, 7.392, 7.591, 7.380, 4.256, 4.246, 4.279
> 32, 7.371, 6.366, 7.390, 4.550, 4.900, 4.537
> 40, 8.621, 7.496, 8.601, 4.862, 5.162, 4.836
> 48, 9.406, 8.128, 9.374, 5.206, 5.736, 5.234
> 56, 10.535, 9.189, 10.522, 5.416, 5.772, 5.447
> 64, 10.000, 7.487, 7.590, 6.946, 6.975, 6.989
> 72, 11.192, 8.639, 8.763, 7.210, 7.311, 7.277
> 80, 11.734, 9.179, 9.409, 7.605, 7.620, 7.548
> 88, 12.933, 10.545, 10.584, 7.878, 7.902, 7.858
> 96, 12.952, 9.331, 10.625, 8.168, 8.470, 8.206
> 104, 14.206, 10.424, 11.839, 8.491, 8.785, 8.502
> 112, 14.763, 11.403, 12.416, 8.798, 9.134, 8.771
> 120, 15.955, 12.635, 13.651, 9.175, 9.494, 9.130
> 128, 15.271, 10.599, 10.724, 9.726, 9.672, 9.655
>
> 'ver2' uses two accumulators for 32 byte case and has better latency numbers
> but regressions in tput compared to 'old' and 'ver1'. 'ver1' is the
> implementation
> posted which has essentially the same numbers for tput/lat as 'old'
> for sizes [0, 63].
Which cpu are you testing on - it will make a big difference ?
And what are you measing throughput in?
And are you testing aligned or mis-aligned 64bit reads?
I think one of the performance counters will give 'cpu clocks'.
I did some tests early last year and got 8 bytes/clock on broadwell/haswell
with code that 'loop carried' the carry flag (a single adc chain).
On the older Intel cpu (Ivy bridge onwards) 'adc' has a latency of 2
for the result, but the carry flag is available earlier.
So alternating the target register in the 'adc' chain will give (nearly)
8 bytes/clock - I think I got to 7.5.
That can all be done with only 4 reads per interaction.
IIRC broadwell/haswell only need 2 reads/iteration.
It is actually likely (certainly worth checking) that haswell/broadwell
can do two misaligned memory reads every clock.
So it may not be worth aligning the reads (which the old code did).
In any case aren't tx packets likely to be aligned, and rx ones
misaligned to some known 4n+2 boundary?
Using adxc/adxo together is a right PITA.
I did get (about) 12 bytes/clock fo long buffers while loop carrying
both the overflow and carry flags.
Also is there a copy of the patched code anywhere?
I think I've missed some of the patches and they are difficult to follow.
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
On Sun, Nov 28, 2021 at 1:47 PM David Laight <[email protected]> wrote:
>
> ...
> > Regarding the 32 byte case, adding two accumulators helps with the latency
> > numbers but causes a regression in throughput for the 40/48 byte cases. Which
> > is the more important metric for the usage of csum_partial()?
> >
> > Here are the numbers for the smaller sizes:
> >
> > size, lat old, lat ver2, lat ver1, tput old, tput ver2, tput ver1
> > 0, 4.961, 4.503, 4.901, 4.887, 4.399, 4.951
> > 8, 5.590, 5.594, 5.620, 4.227, 4.110, 4.252
> > 16, 6.182, 6.398, 6.202, 4.233, 4.062, 4.278
> > 24, 7.392, 7.591, 7.380, 4.256, 4.246, 4.279
> > 32, 7.371, 6.366, 7.390, 4.550, 4.900, 4.537
> > 40, 8.621, 7.496, 8.601, 4.862, 5.162, 4.836
> > 48, 9.406, 8.128, 9.374, 5.206, 5.736, 5.234
> > 56, 10.535, 9.189, 10.522, 5.416, 5.772, 5.447
> > 64, 10.000, 7.487, 7.590, 6.946, 6.975, 6.989
> > 72, 11.192, 8.639, 8.763, 7.210, 7.311, 7.277
> > 80, 11.734, 9.179, 9.409, 7.605, 7.620, 7.548
> > 88, 12.933, 10.545, 10.584, 7.878, 7.902, 7.858
> > 96, 12.952, 9.331, 10.625, 8.168, 8.470, 8.206
> > 104, 14.206, 10.424, 11.839, 8.491, 8.785, 8.502
> > 112, 14.763, 11.403, 12.416, 8.798, 9.134, 8.771
> > 120, 15.955, 12.635, 13.651, 9.175, 9.494, 9.130
> > 128, 15.271, 10.599, 10.724, 9.726, 9.672, 9.655
> >
> > 'ver2' uses two accumulators for 32 byte case and has better latency numbers
> > but regressions in tput compared to 'old' and 'ver1'. 'ver1' is the
> > implementation
> > posted which has essentially the same numbers for tput/lat as 'old'
> > for sizes [0, 63].
>
> Which cpu are you testing on - it will make a big difference ?
Tigerlake, although assuming `adc` as the bottleneck, the results
should be largely independent.
> And what are you measing throughput in?
Running back to back iterations with the same input without any
dependency between iterations. The OoO window will include
multiple iterations at once.
> And are you testing aligned or mis-aligned 64bit reads?
Aligned as that is the common case.
>
> I think one of the performance counters will give 'cpu clocks'.
Time is in Ref Cycles using `rdtsc`
>
> I did some tests early last year and got 8 bytes/clock on broadwell/haswell
> with code that 'loop carried' the carry flag (a single adc chain).
> On the older Intel cpu (Ivy bridge onwards) 'adc' has a latency of 2
> for the result, but the carry flag is available earlier.
> So alternating the target register in the 'adc' chain will give (nearly)
> 8 bytes/clock - I think I got to 7.5.
>
> That can all be done with only 4 reads per interaction.
> IIRC broadwell/haswell only need 2 reads/iteration.
>
> It is actually likely (certainly worth checking) that haswell/broadwell
> can do two misaligned memory reads every clock.
> So it may not be worth aligning the reads (which the old code did).
> In any case aren't tx packets likely to be aligned, and rx ones
> misaligned to some known 4n+2 boundary?
>
> Using adxc/adxo together is a right PITA.
I'm a bit hesitant about adxc/adxo because they are extensions so
support will need to be tested.
> I did get (about) 12 bytes/clock fo long buffers while loop carrying
> both the overflow and carry flags.
>
> Also is there a copy of the patched code anywhere?
https://lore.kernel.org/lkml/CANn89iLpFOok_tv=DKsLX1mxZGdHQgATdW4Xs0rc6oaXQEa5Ng@mail.gmail.com/T/
> I think I've missed some of the patches and they are difficult to follow.
>
> David
>
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)
From: Noah Goldstein
> Sent: 28 November 2021 21:00
>
> On Sun, Nov 28, 2021 at 1:47 PM David Laight <[email protected]> wrote:
> >
> > ...
> > > Regarding the 32 byte case, adding two accumulators helps with the latency
> > > numbers but causes a regression in throughput for the 40/48 byte cases. Which
> > > is the more important metric for the usage of csum_partial()?
> > >
> > > Here are the numbers for the smaller sizes:
> > >
> > > size, lat old, lat ver2, lat ver1, tput old, tput ver2, tput ver1
> > > 0, 4.961, 4.503, 4.901, 4.887, 4.399, 4.951
> > > 8, 5.590, 5.594, 5.620, 4.227, 4.110, 4.252
> > > 16, 6.182, 6.398, 6.202, 4.233, 4.062, 4.278
> > > 24, 7.392, 7.591, 7.380, 4.256, 4.246, 4.279
> > > 32, 7.371, 6.366, 7.390, 4.550, 4.900, 4.537
> > > 40, 8.621, 7.496, 8.601, 4.862, 5.162, 4.836
> > > 48, 9.406, 8.128, 9.374, 5.206, 5.736, 5.234
> > > 56, 10.535, 9.189, 10.522, 5.416, 5.772, 5.447
> > > 64, 10.000, 7.487, 7.590, 6.946, 6.975, 6.989
> > > 72, 11.192, 8.639, 8.763, 7.210, 7.311, 7.277
> > > 80, 11.734, 9.179, 9.409, 7.605, 7.620, 7.548
> > > 88, 12.933, 10.545, 10.584, 7.878, 7.902, 7.858
> > > 96, 12.952, 9.331, 10.625, 8.168, 8.470, 8.206
> > > 104, 14.206, 10.424, 11.839, 8.491, 8.785, 8.502
> > > 112, 14.763, 11.403, 12.416, 8.798, 9.134, 8.771
> > > 120, 15.955, 12.635, 13.651, 9.175, 9.494, 9.130
> > > 128, 15.271, 10.599, 10.724, 9.726, 9.672, 9.655
> > >
> > > 'ver2' uses two accumulators for 32 byte case and has better latency numbers
> > > but regressions in tput compared to 'old' and 'ver1'. 'ver1' is the
> > > implementation
> > > posted which has essentially the same numbers for tput/lat as 'old'
> > > for sizes [0, 63].
> >
> > Which cpu are you testing on - it will make a big difference ?
>
> Tigerlake, although assuming `adc` as the bottleneck, the results
> should be largely independent.
The cpu definitely makes a difference.
Although the big changes for Intel mainstream cpu was before Ivy/Sandy bridge
and to Broadwell/Haswell. They improved the latency for adc from 2 clocks
to 1 clock.
The later cpu also have extra instruction ports - which can help
parallel exceution.
That could well make a big difference where you've duplicated the adc chain.
Interesting an adc chain can only do one add (8 bytes) per clock.
But you should be able to do two 4 byte loads and add to two separate
64bit register every clock.
That is also 8 bytes/clock.
So a C loop:
for (;buf != buf_end; buf += 2) {
sum_64a += buf_32[0];
sum_64b += buf_32[1];
}
Might be as fast as the asm one!
It probably needs unrolling once, and may need 'adjusting'
so that the loop test is 'add $d, %reg; jnz 1b'
The 'add' and 'jnz' should then get 'fused' into a single u-op.
> > And what are you measing throughput in?
>
> Running back to back iterations with the same input without any
> dependency between iterations. The OoO window will include
> multiple iterations at once.
>
> > And are you testing aligned or mis-aligned 64bit reads?
>
> Aligned as that is the common case.
I was thinking that the code to align the buffer start may not
be needed - so the test can be removed without affecting performance.
Especially if aligned buffers are 'expected'
So you just have to support variable lengths, not alignments.
> > I think one of the performance counters will give 'cpu clocks'.
>
> Time is in Ref Cycles using `rdtsc`
Hmmm...
Unless you manage to lock the cpu frequency governor (or whatever it
is called) rdtsc is almost useless for timing instructions.
The cpu hardware will change the clock multiplier on you.
I've used histogram[delta_cyles() >> n]++ (with bound checking) to
get a distribution of the cost of each pass - rather than an average.
I last used that testing writev("dev/null", ...).
Each run gave a sharp peak - but it differed run to run!
Possibly because of the relative alignment of the stacks and buffers.
..
> > Using adxc/adxo together is a right PITA.
>
> I'm a bit hesitant about adxc/adxo because they are extensions so
> support will need to be tested.
Indeed.
> https://lore.kernel.org/lkml/CANn89iLpFOok_tv=DKsLX1mxZGdHQgATdW4Xs0rc6oaXQEa5Ng@mail.gmail.com/T/
Ah that is your patch that just changes the asm() block.
I was looking for Eric's changes as well.
I spent too long trying to optimise this code last year.
Mostly because I found a really crap version in one of our applications
that is checksumming UDP/IPv4 RTP packets before sending them on a RAW
socket. These are about 200 bytes each and we are sending 100s every 20ms.
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
I've dug out my test program and measured the performance of
various copied of the inner loop - usually 64 bytes/iteration.
Code is below.
It uses the hardware performance counter to get the number of
clocks the inner loop takes.
This is reasonable stable once the branch predictor has settled down.
So the different in clocks between a 64 byte buffer and a 128 byte
buffer is the number of clocks for 64 bytes.
(Unlike the TSC the pmc count doesn't depend on the cpu frequency.)
What is interesting is that even some of the trivial loops appear
to be doing 16 bytes per clock for short buffers - which is impossible.
Checksum 1k bytes and you get an entirely different answer.
The only loop that really exceeds 8 bytes/clock for long buffers
is the adxc/adoc one.
What is almost certainly happening is that all the memory reads and
the dependant add/adc instructions are all queued up in the 'out of
order' execution unit.
Since 'rdpmc' isn't a serialising instruction they can still be
outstanding when the function returns.
Uncomment the 'rdtsc' and you get much slower values for short buffers.
When testing the full checksum function the queued up memory
reads and adc are probably running in parallel with the logic
that is handling lengths that aren't multiples of 64.
I also found nothing consistently different for misaligned reads.
These were all tested on my i7-7700 cpu.
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <errno.h>
#include <linux/perf_event.h>
#include <sys/mman.h>
#include <sys/syscall.h>
static int init_pmc(void)
{
static struct perf_event_attr perf_attr = {
.type = PERF_TYPE_HARDWARE,
.config = PERF_COUNT_HW_CPU_CYCLES,
.pinned = 1,
};
struct perf_event_mmap_page *pc;
int perf_fd;
perf_fd = syscall(__NR_perf_event_open, &perf_attr, 0, -1, -1, 0);
if (perf_fd < 0) {
fprintf(stderr, "perf_event_open failed: errno %d\n", errno);
exit(1);
}
pc = mmap(NULL, 4096, PROT_READ, MAP_SHARED, perf_fd, 0);
if (pc == MAP_FAILED) {
fprintf(stderr, "perf_event mmap() failed: errno %d\n", errno);
exit(1);
}
return pc->index - 1;
}
static inline unsigned int rdpmc(unsigned int counter)
{
unsigned int low, high;
// asm volatile("rdtsc" : "=a" (low), "=d" (high));
asm volatile("rdpmc" : "=a" (low), "=d" (high) : "c" (counter));
// return low bits, counter might to 32 or 40 bits wide.
return low;
}
#ifndef MODE
#define MODE 0
#endif
__attribute__((noinline))
unsigned long csum64(unsigned long csum, const unsigned char *buff, unsigned long len)
{
#if MODE == 0
// Simple 'adc' loop, abs max 8 bytes/clock.
// Will be very slow on old cpu (Pre Ivy bridge) cpu where 'dec'
// has a false dependency against the carry flag.
// Nearly as good if doing 32 bytes/iteration.
#define DFLT_OFFSET 40
long count = len/32;
asm(" clc\n"
"1: adcq (%[buff]), %[csum]\n"
" adcq 8(%[buff]), %[csum]\n"
" adcq 16(%[buff]), %[csum]\n"
" adcq 24(%[buff]), %[csum]\n"
" lea 32(%[buff]), %[buff]\n"
" dec %[count]\n"
" jnz 1b\n"
" adcq $0, %[csum]\n"
: [csum] "+&r" (csum), [len] "+&r" (len), [count] "+&r" (count)
: [buff] "r" (buff)
: "memory");
#endif
#if MODE == 1
// Alternate 'adc' loop that avoids using the carry flag.
#define DFLT_OFFSET 40
buff += len;
len = -len;
asm(" clc\n"
"1: jrcxz 2f\n"
" adcq 0(%[buff], %[len]), %[csum]\n"
" adcq 8(%[buff], %[len]), %[csum]\n"
" adcq 16(%[buff], %[len]), %[csum]\n"
" adcq 24(%[buff], %[len]), %[csum]\n"
" lea 32(%[len]), %[len]\n"
" jmp 1b\n"
"2: adcq $0, %[csum]\n"
: [csum] "+&r" (csum), [len] "+&c" (len)
: [buff] "r" (buff)
: "memory");
#endif
#if MODE == 2
// Existing kernel loop.
// The adc chain limits it to about 8 bytes/clock.
#define DFLT_OFFSET 40
long count = len/64;
asm("1: addq 0(%[buff]), %[csum]\n"
" adcq 8(%[buff]), %[csum]\n"
" adcq 16(%[buff]), %[csum]\n"
" adcq 24(%[buff]), %[csum]\n"
" adcq 32(%[buff]), %[csum]\n"
" adcq 40(%[buff]), %[csum]\n"
" adcq 48(%[buff]), %[csum]\n"
" adcq 56(%[buff]), %[csum]\n"
" adcq $0, %[csum]\n"
" lea 64(%[buff]), %[buff]\n"
" dec %[count]\n"
" jnz 1b"
: [csum] "+&r" (csum), [len] "+&r" (len), [count] "+&r" (count)
: [buff] "r" (buff)
: "memory");
#endif
#if MODE == 3
// Proposed kernel loop.
// This splits the adc chain in two.
// Faster than (2) for short buffers - not sure why.
#define DFLT_OFFSET 38
long count = len/64;
unsigned long csum1;
asm("1: movq 0(%[buff]), %[csum1]\n"
" adcq 8(%[buff]), %[csum1]\n"
" adcq 16(%[buff]), %[csum1]\n"
" adcq 24(%[buff]), %[csum1]\n"
" adcq 32(%[buff]), %[csum1]\n"
" adcq $0, %[csum1]\n"
" lea 64(%[buff]), %[buff]\n"
" addq 40-64(%[buff]), %[csum]\n"
" adcq 48-64(%[buff]), %[csum]\n"
" adcq 56-64(%[buff]), %[csum]\n"
" adcq %[csum1], %[csum]\n"
" adcq $0, %[csum]\n"
" dec %[count]\n"
" jnz 1b"
: [csum] "+&r" (csum), [csum1] "=&r" (csum1), [len] "+&r" (len), [count] "+&r" (count)
: [buff] "r" (buff)
: "memory");
#endif
#if MODE == 4
// Less unrolled version of (3).
// Maybe just as fast!
#define DFLT_OFFSET 43
long count = len;
unsigned long csum1;
asm("1: movq 0(%[buff]), %[csum1]\n"
" adcq 8(%[buff]), %[csum1]\n"
" adcq $0, %[csum1]\n"
" lea 32(%[buff]), %[buff]\n"
" addq 16-32(%[buff]), %[csum]\n"
" adcq 24-32(%[buff]), %[csum]\n"
" adcq %[csum1], %[csum]\n"
" adcq $0, %[csum]\n"
" sub $32, %[count]\n"
" jnz 1b"
: [csum] "+&r" (csum), [csum1] "=&r" (csum1), [len] "+&r" (len), [count] "+&r" (count)
: [buff] "r" (buff)
: "memory");
#endif
#if MODE == 5
// adxc/adxo loop.
// This is the only loop that exceeds 8 bytes/clock for 1k buffers.
#define DFLT_OFFSET 40
unsigned long csum_odd;
buff += len;
len = -len;
asm( " xor %[csum_odd], %[csum_odd]\n" // Also clears carry and overflow
"10: jrcxz 20f\n"
" adcx (%[buff], %[len]), %[csum]\n"
" adox 8(%[buff], %[len]), %[csum_odd]\n"
" adcx 16(%[buff], %[len]), %[csum]\n"
" adox 24(%[buff], %[len]), %[csum_odd]\n"
" adcx 32(%[buff], %[len]), %[csum]\n"
" adox 40(%[buff], %[len]), %[csum_odd]\n"
" adcx 48(%[buff], %[len]), %[csum]\n"
" adox 56(%[buff], %[len]), %[csum_odd]\n"
" lea 64(%[len]), %[len]\n"
" jmp 10b\n"
"20: adox %[len], %[csum_odd]\n" // [len] is zero
" adcx %[csum_odd], %[csum]\n"
" adcx %[len], %[csum]"
: [csum] "+&r" (csum), [len] "+&c" (len), [csum_odd] "=&r" (csum_odd)
: [buff] "r" (buff)
: "memory");
#endif
return csum;
}
unsigned char buff[8192] __attribute__((aligned(4096))) = {
0x46,0x56,0x20,0x04,0x10,0x02,0x50,0x07,0x72,0x4d,0xc6,0x3d,0x31,0x85,0x2d,0xbd,
0xe2,0xe0,0x9d,0x3e,0x3b,0x7a,0x70,0x3d,0xd2,0xfb,0x8c,0xbf,0x95,0x10,0xa9,0xbe,
0xeb,0xfd,0x29,0x40,0xd5,0x7a,0x61,0x40,0xde,0xcd,0x14,0xbf,0x81,0x1b,0xf6,0x3f,
0xbc,0xff,0x17,0x3f,0x67,0x1c,0x6e,0xbe,0xf4,0xc2,0x05,0x40,0x0b,0x13,0x78,0x3f,
0xfe,0x47,0xa7,0xbd,0x59,0xc2,0x15,0x3f,0x07,0xd0,0xea,0xbf,0x97,0xf1,0x3c,0x3f,
0xcc,0xfa,0x6b,0x40,0x72,0x6a,0x4f,0xbe,0x0b,0xe3,0x75,0x3e,0x3c,0x9b,0x0e,0xbf,
0xa9,0xeb,0xb7,0x3f,0xeb,0x4a,0xec,0x3e,0x33,0x8c,0x0c,0x3f,0x6a,0xf2,0xf3,0x3e,
0x2b,0x45,0x86,0x3f,0x83,0xce,0x8a,0x3f,0xf6,0x01,0x16,0x40,0x9c,0x17,0x47,0x3e,
0x44,0x83,0x61,0x40,0x74,0xc7,0x5c,0x3f,0xec,0xe7,0x95,0x3f,0xee,0x19,0xb5,0xbf,
0xb5,0xf0,0x03,0xbf,0xd1,0x02,0x1c,0x3e,0xa3,0x55,0x90,0xbe,0x1e,0x0b,0xa1,0xbf,
0xa4,0xa8,0xb4,0x3f,0xc6,0x68,0x91,0x3f,0xd1,0xc5,0xab,0x3f,0xb9,0x14,0x62,0x3f,
0x7c,0xe0,0xb9,0xbf,0xc0,0xa4,0xb5,0x3d,0x6f,0xd9,0xa7,0x3f,0x8f,0xc4,0xb0,0x3d,
0x48,0x2c,0x7a,0x3e,0x83,0xb2,0x3c,0x40,0x36,0xd3,0x18,0x40,0xb7,0xa9,0x57,0x40,
0xda,0xd3,0x95,0x3f,0x74,0x95,0xc0,0xbe,0xbb,0xce,0x71,0x3e,0x95,0xec,0x18,0xbf,
0x94,0x17,0xdd,0x3f,0x98,0xa5,0x02,0x3f,0xbb,0xfb,0xbb,0x3e,0xd0,0x5a,0x9c,0x3f,
0xd4,0x70,0x9b,0xbf,0x3b,0x9f,0x20,0xc0,0x84,0x5b,0x0f,0x40,0x5e,0x48,0x2c,0xbf,
};
#ifndef PASSES
#define PASSES 10
#endif
#ifndef OFFSET
#ifdef DFLT_OFFSET
#define OFFSET DFLT_OFFSET
#else
#define OFFSET 0
#endif
#endif
int main(int argc, char **argv)
{
unsigned long csum;
unsigned int tick;
unsigned int ticks[PASSES];
unsigned int len, off;
unsigned int i;
unsigned int id = init_pmc();
len = sizeof buff;
off = 0;
if (argv[1]) {
len = atoi(argv[1]);
len = (len + 63) & ~63;
if (argv[2])
off = atoi(argv[2]) & 7;
}
for (i = 0; i < PASSES; i++) {
tick = rdpmc(id);
csum = csum64(0, buff + off, len);
ticks[i] = rdpmc(id) - tick;
}
printf(" ticks rate %lx\n", csum);
for (i = 0; i < PASSES; i++)
printf(" %7u %7u\n", ticks[i], 100 * len / (ticks[i] - OFFSET));
return 0;
}
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
On Thu, Dec 2, 2021 at 6:24 AM David Laight <[email protected]> wrote:
>
> I've dug out my test program and measured the performance of
> various copied of the inner loop - usually 64 bytes/iteration.
> Code is below.
>
> It uses the hardware performance counter to get the number of
> clocks the inner loop takes.
> This is reasonable stable once the branch predictor has settled down.
> So the different in clocks between a 64 byte buffer and a 128 byte
> buffer is the number of clocks for 64 bytes.
> (Unlike the TSC the pmc count doesn't depend on the cpu frequency.)
>
> What is interesting is that even some of the trivial loops appear
> to be doing 16 bytes per clock for short buffers - which is impossible.
> Checksum 1k bytes and you get an entirely different answer.
> The only loop that really exceeds 8 bytes/clock for long buffers
> is the adxc/adoc one.
>
> What is almost certainly happening is that all the memory reads and
> the dependant add/adc instructions are all queued up in the 'out of
> order' execution unit.
> Since 'rdpmc' isn't a serialising instruction they can still be
> outstanding when the function returns.
> Uncomment the 'rdtsc' and you get much slower values for short buffers.
>
> When testing the full checksum function the queued up memory
> reads and adc are probably running in parallel with the logic
> that is handling lengths that aren't multiples of 64.
>
> I also found nothing consistently different for misaligned reads.
>
> These were all tested on my i7-7700 cpu.
>
I usually do not bother timing each call.
I instead time a loop of 1,000,000,000 calls.
Yes, this includes loop cost, but this is the same cost for all variants.
for (i = 0; i < 100*1000*1000; i++) {
res += csum_partial((void *)frame + 14 + 64*0, 40, 0);
res += csum_partial((void *)frame + 14 + 64*1, 40, 0);
res += csum_partial((void *)frame + 14 + 64*2, 40, 0);
res += csum_partial((void *)frame + 14 + 64*3, 40, 0);
res += csum_partial((void *)frame + 14 + 64*4, 40, 0);
res += csum_partial((void *)frame + 14 + 64*5, 40, 0);
res += csum_partial((void *)frame + 14 + 64*6, 40, 0);
res += csum_partial((void *)frame + 14 + 64*7, 40, 0);
res += csum_partial((void *)frame + 14 + 64*8, 40, 0);
res += csum_partial((void *)frame + 14 + 64*9, 40, 0);
}
Then use " perf stat ./bench" or similar.
On Thu, Dec 2, 2021 at 9:01 AM Eric Dumazet <[email protected]> wrote:
>
> On Thu, Dec 2, 2021 at 6:24 AM David Laight <[email protected]> wrote:
> >
> > I've dug out my test program and measured the performance of
> > various copied of the inner loop - usually 64 bytes/iteration.
> > Code is below.
> >
> > It uses the hardware performance counter to get the number of
> > clocks the inner loop takes.
> > This is reasonable stable once the branch predictor has settled down.
> > So the different in clocks between a 64 byte buffer and a 128 byte
> > buffer is the number of clocks for 64 bytes.
Intuitively 10 passes is a bit low. Also you might consider aligning
the `csum64` function and possibly the loops.
There a reason you put ` jrcxz` at the beginning of the loops instead
of the end?
> > (Unlike the TSC the pmc count doesn't depend on the cpu frequency.)
> >
> > What is interesting is that even some of the trivial loops appear
> > to be doing 16 bytes per clock for short buffers - which is impossible.
> > Checksum 1k bytes and you get an entirely different answer.
> > The only loop that really exceeds 8 bytes/clock for long buffers
> > is the adxc/adoc one.
> >
> > What is almost certainly happening is that all the memory reads and
> > the dependant add/adc instructions are all queued up in the 'out of
> > order' execution unit.
> > Since 'rdpmc' isn't a serialising instruction they can still be
> > outstanding when the function returns.
> > Uncomment the 'rdtsc' and you get much slower values for short buffers.
Maybe add an `lfence` before / after `csum64`
> >
> > When testing the full checksum function the queued up memory
> > reads and adc are probably running in parallel with the logic
> > that is handling lengths that aren't multiples of 64.
> >
> > I also found nothing consistently different for misaligned reads.
> >
> > These were all tested on my i7-7700 cpu.
> >
>
> I usually do not bother timing each call.
> I instead time a loop of 1,000,000,000 calls.
> Yes, this includes loop cost, but this is the same cost for all variants.
> for (i = 0; i < 100*1000*1000; i++) {
> res += csum_partial((void *)frame + 14 + 64*0, 40, 0);
> res += csum_partial((void *)frame + 14 + 64*1, 40, 0);
> res += csum_partial((void *)frame + 14 + 64*2, 40, 0);
> res += csum_partial((void *)frame + 14 + 64*3, 40, 0);
> res += csum_partial((void *)frame + 14 + 64*4, 40, 0);
> res += csum_partial((void *)frame + 14 + 64*5, 40, 0);
> res += csum_partial((void *)frame + 14 + 64*6, 40, 0);
> res += csum_partial((void *)frame + 14 + 64*7, 40, 0);
> res += csum_partial((void *)frame + 14 + 64*8, 40, 0);
> res += csum_partial((void *)frame + 14 + 64*9, 40, 0);
> }
+ 1. You can also feed `res` from previous iteration to the next
iteration to measure latency cheaply if that is better
predictor of performance.
>
> Then use " perf stat ./bench" or similar.
From: Noah Goldstein
> Sent: 02 December 2021 20:19
>
> On Thu, Dec 2, 2021 at 9:01 AM Eric Dumazet <[email protected]> wrote:
> >
> > On Thu, Dec 2, 2021 at 6:24 AM David Laight <[email protected]> wrote:
> > >
> > > I've dug out my test program and measured the performance of
> > > various copied of the inner loop - usually 64 bytes/iteration.
> > > Code is below.
> > >
> > > It uses the hardware performance counter to get the number of
> > > clocks the inner loop takes.
> > > This is reasonable stable once the branch predictor has settled down.
> > > So the different in clocks between a 64 byte buffer and a 128 byte
> > > buffer is the number of clocks for 64 bytes.
>
> Intuitively 10 passes is a bit low.
I'm doing 10 separate measurements.
The first one is much slower because the cache is cold.
All the ones after (typically) number 5 or 6 tend to give the same answer.
10 is plenty to give you that 'warm fuzzy feeling' that you've got
a consistent answer.
Run the program 5 or 6 times with the same parameters and you sometimes
get a different stable value - probably something to do with stack and
data physical pages.
Was more obvious when I was timing a system call.
> Also you might consider aligning
> the `csum64` function and possibly the loops.
Won't matter here, instruction decode isn't the problem.
Also the uops all come out of the loop uop cache.
> There a reason you put ` jrcxz` at the beginning of the loops instead
> of the end?
jrcxz is 'jump if cx zero' - hard to use at the bottom of a loop!
The 'paired' loop end instruction is 'loop' - decrement %cx and jump non-zero.
But that is 7+ cycles on current Intel cpu (ok on amd ones).
I can get a two clock loop with jrcxz and jmp - as in the examples.
But it is more stable taken out to 4 clocks.
You can't do a one clock loop :-(
> > > (Unlike the TSC the pmc count doesn't depend on the cpu frequency.)
> > >
> > > What is interesting is that even some of the trivial loops appear
> > > to be doing 16 bytes per clock for short buffers - which is impossible.
> > > Checksum 1k bytes and you get an entirely different answer.
> > > The only loop that really exceeds 8 bytes/clock for long buffers
> > > is the adxc/adoc one.
> > >
> > > What is almost certainly happening is that all the memory reads and
> > > the dependant add/adc instructions are all queued up in the 'out of
> > > order' execution unit.
> > > Since 'rdpmc' isn't a serialising instruction they can still be
> > > outstanding when the function returns.
> > > Uncomment the 'rdtsc' and you get much slower values for short buffers.
>
> Maybe add an `lfence` before / after `csum64`
That's probably less strong than rdtsc, I might try it.
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)