Sébastien Dugué reported to me that devices implementing ipoib (which don't have
checksum offload hardware were spending a significant amount of time computing
checksums. We found that by splitting the checksum computation into two
separate streams, each skipping successive elements of the buffer being summed,
we could parallelize the checksum operation accros multiple alus. Since neither
chain is dependent on the result of the other, we get a speedup in execution (on
hardware that has multiple alu's available, which is almost ubiquitous on x86),
and only a negligible decrease on hardware that has only a single alu (an extra
addition is introduced). Since addition in commutative, the result is the same,
only faster
Signed-off-by: Neil Horman <[email protected]>
CC: [email protected]
CC: Thomas Gleixner <[email protected]>
CC: Ingo Molnar <[email protected]>
CC: "H. Peter Anvin" <[email protected]>
CC: [email protected]
---
arch/x86/lib/csum-partial_64.c | 37 +++++++++++++++++++++++++------------
1 file changed, 25 insertions(+), 12 deletions(-)
diff --git a/arch/x86/lib/csum-partial_64.c b/arch/x86/lib/csum-partial_64.c
index 9845371..2c7bc50 100644
--- a/arch/x86/lib/csum-partial_64.c
+++ b/arch/x86/lib/csum-partial_64.c
@@ -29,11 +29,12 @@ static inline unsigned short from32to16(unsigned a)
* Things tried and found to not make it faster:
* Manual Prefetching
* Unrolling to an 128 bytes inner loop.
- * Using interleaving with more registers to break the carry chains.
*/
static unsigned do_csum(const unsigned char *buff, unsigned len)
{
unsigned odd, count;
+ unsigned long result1 = 0;
+ unsigned long result2 = 0;
unsigned long result = 0;
if (unlikely(len == 0))
@@ -68,22 +69,34 @@ static unsigned do_csum(const unsigned char *buff, unsigned len)
zero = 0;
count64 = count >> 3;
while (count64) {
- asm("addq 0*8(%[src]),%[res]\n\t"
- "adcq 1*8(%[src]),%[res]\n\t"
- "adcq 2*8(%[src]),%[res]\n\t"
- "adcq 3*8(%[src]),%[res]\n\t"
- "adcq 4*8(%[src]),%[res]\n\t"
- "adcq 5*8(%[src]),%[res]\n\t"
- "adcq 6*8(%[src]),%[res]\n\t"
- "adcq 7*8(%[src]),%[res]\n\t"
- "adcq %[zero],%[res]"
- : [res] "=r" (result)
+ asm("addq 0*8(%[src]),%[res1]\n\t"
+ "adcq 2*8(%[src]),%[res1]\n\t"
+ "adcq 4*8(%[src]),%[res1]\n\t"
+ "adcq 6*8(%[src]),%[res1]\n\t"
+ "adcq %[zero],%[res1]\n\t"
+
+ "addq 1*8(%[src]),%[res2]\n\t"
+ "adcq 3*8(%[src]),%[res2]\n\t"
+ "adcq 5*8(%[src]),%[res2]\n\t"
+ "adcq 7*8(%[src]),%[res2]\n\t"
+ "adcq %[zero],%[res2]"
+ : [res1] "=r" (result1),
+ [res2] "=r" (result2)
: [src] "r" (buff), [zero] "r" (zero),
- "[res]" (result));
+ "[res1]" (result1), "[res2]" (result2));
buff += 64;
count64--;
}
+ asm("addq %[res1],%[res]\n\t"
+ "adcq %[res2],%[res]\n\t"
+ "adcq %[zero],%[res]"
+ : [res] "=r" (result)
+ : [res1] "r" (result1),
+ [res2] "r" (result2),
+ [zero] "r" (zero),
+ "0" (result));
+
/* last up to 7 8byte blocks */
count %= 8;
while (count) {
--
1.8.3.1
* Neil Horman <[email protected]> wrote:
> S?bastien Dugu? reported to me that devices implementing ipoib (which
> don't have checksum offload hardware were spending a significant amount
> of time computing checksums. We found that by splitting the checksum
> computation into two separate streams, each skipping successive elements
> of the buffer being summed, we could parallelize the checksum operation
> accros multiple alus. Since neither chain is dependent on the result of
> the other, we get a speedup in execution (on hardware that has multiple
> alu's available, which is almost ubiquitous on x86), and only a
> negligible decrease on hardware that has only a single alu (an extra
> addition is introduced). Since addition in commutative, the result is
> the same, only faster
This patch should really come with measurement numbers: what performance
increase (and drop) did you get on what CPUs.
Thanks,
Ingo
On 10/11/2013 09:51 AM, Neil Horman wrote:
> Sébastien Dugué reported to me that devices implementing ipoib (which don't have
> checksum offload hardware were spending a significant amount of time computing
> checksums. We found that by splitting the checksum computation into two
> separate streams, each skipping successive elements of the buffer being summed,
> we could parallelize the checksum operation accros multiple alus. Since neither
> chain is dependent on the result of the other, we get a speedup in execution (on
> hardware that has multiple alu's available, which is almost ubiquitous on x86),
> and only a negligible decrease on hardware that has only a single alu (an extra
> addition is introduced). Since addition in commutative, the result is the same,
> only faster
On hardware that implement ADCX/ADOX then you should also be able to
have additional streams interleaved since those instructions allow for
dual carry chains.
-hpa
On Sat, Oct 12, 2013 at 07:21:24PM +0200, Ingo Molnar wrote:
>
> * Neil Horman <[email protected]> wrote:
>
> > S?bastien Dugu? reported to me that devices implementing ipoib (which
> > don't have checksum offload hardware were spending a significant amount
> > of time computing checksums. We found that by splitting the checksum
> > computation into two separate streams, each skipping successive elements
> > of the buffer being summed, we could parallelize the checksum operation
> > accros multiple alus. Since neither chain is dependent on the result of
> > the other, we get a speedup in execution (on hardware that has multiple
> > alu's available, which is almost ubiquitous on x86), and only a
> > negligible decrease on hardware that has only a single alu (an extra
> > addition is introduced). Since addition in commutative, the result is
> > the same, only faster
>
> This patch should really come with measurement numbers: what performance
> increase (and drop) did you get on what CPUs.
>
> Thanks,
>
Sure, I can gather some stats for you. I'll post them later this week
Neil
> Ingo
>
On Sat, Oct 12, 2013 at 03:29:24PM -0700, H. Peter Anvin wrote:
> On 10/11/2013 09:51 AM, Neil Horman wrote:
> > S?bastien Dugu? reported to me that devices implementing ipoib (which don't have
> > checksum offload hardware were spending a significant amount of time computing
> > checksums. We found that by splitting the checksum computation into two
> > separate streams, each skipping successive elements of the buffer being summed,
> > we could parallelize the checksum operation accros multiple alus. Since neither
> > chain is dependent on the result of the other, we get a speedup in execution (on
> > hardware that has multiple alu's available, which is almost ubiquitous on x86),
> > and only a negligible decrease on hardware that has only a single alu (an extra
> > addition is introduced). Since addition in commutative, the result is the same,
> > only faster
>
> On hardware that implement ADCX/ADOX then you should also be able to
> have additional streams interleaved since those instructions allow for
> dual carry chains.
>
Ok, thats a good idea, I'll look into those instructions this week
Neil
> -hpa
>
>
>
>
Neil Horman <[email protected]> writes:
> Sébastien Dugué reported to me that devices implementing ipoib (which don't have
> checksum offload hardware were spending a significant amount of time computing
Must be an odd workload, most TCP/UDP workloads do copy-checksum
anyways. I would rather investigate why that doesn't work.
That said the change looks reasonable, but may not fix the root cause.
-Andi
--
[email protected] -- Speaking for myself only
* Andi Kleen <[email protected]> wrote:
> Neil Horman <[email protected]> writes:
>
> > S?bastien Dugu? reported to me that devices implementing ipoib (which
> > don't have checksum offload hardware were spending a significant
> > amount of time computing
>
> Must be an odd workload, most TCP/UDP workloads do copy-checksum
> anyways. I would rather investigate why that doesn't work.
There's a fair amount of csum_partial()-only workloads, a packet does not
need to hit user-space to be a significant portion of the system's
workload.
That said, it would indeed be nice to hear which particular code path was
hit in this case, if nothing else then for education purposes.
Thanks,
Ingo
On Sun, Oct 13, 2013 at 09:38:33PM -0700, Andi Kleen wrote:
> Neil Horman <[email protected]> writes:
>
> > S?bastien Dugu? reported to me that devices implementing ipoib (which don't have
> > checksum offload hardware were spending a significant amount of time computing
>
> Must be an odd workload, most TCP/UDP workloads do copy-checksum
> anyways. I would rather investigate why that doesn't work.
>
FWIW, the reporter was reporting this using an IP over Infiniband network.
Neil
> That said the change looks reasonable, but may not fix the root cause.
>
> -Andi
>
> --
> [email protected] -- Speaking for myself only
>
On Sat, Oct 12, 2013 at 07:21:24PM +0200, Ingo Molnar wrote:
>
> * Neil Horman <[email protected]> wrote:
>
> > S?bastien Dugu? reported to me that devices implementing ipoib (which
> > don't have checksum offload hardware were spending a significant amount
> > of time computing checksums. We found that by splitting the checksum
> > computation into two separate streams, each skipping successive elements
> > of the buffer being summed, we could parallelize the checksum operation
> > accros multiple alus. Since neither chain is dependent on the result of
> > the other, we get a speedup in execution (on hardware that has multiple
> > alu's available, which is almost ubiquitous on x86), and only a
> > negligible decrease on hardware that has only a single alu (an extra
> > addition is introduced). Since addition in commutative, the result is
> > the same, only faster
>
> This patch should really come with measurement numbers: what performance
> increase (and drop) did you get on what CPUs.
>
> Thanks,
>
> Ingo
>
So, early testing results today. I wrote a test module that, allocated a 4k
buffer, initalized it with random data, and called csum_partial on it 100000
times, recording the time at the start and end of that loop. Results on a 2.4
GHz Intel Xeon processor:
Without patch: Average execute time for csum_partial was 808 ns
With patch: Average execute time for csum_partial was 438 ns
I'm looking into hpa's suggestion to use alternate instructions where available
right now. I'll have more soon
Neil
On Mon, 2013-10-14 at 09:49 +0200, Ingo Molnar wrote:
> * Andi Kleen <[email protected]> wrote:
>
> > Neil Horman <[email protected]> writes:
> >
> > > Sébastien Dugué reported to me that devices implementing ipoib (which
> > > don't have checksum offload hardware were spending a significant
> > > amount of time computing
> >
> > Must be an odd workload, most TCP/UDP workloads do copy-checksum
> > anyways. I would rather investigate why that doesn't work.
>
> There's a fair amount of csum_partial()-only workloads, a packet does not
> need to hit user-space to be a significant portion of the system's
> workload.
>
> That said, it would indeed be nice to hear which particular code path was
> hit in this case, if nothing else then for education purposes.
Many NIC do not provide a CHECKSUM_COMPLETE information for encapsulated
frames, meaning we have to fallback to software csum to validate
TCP frames, once tunnel header is pulled.
So to reproduce the issue, all you need is to setup a GRE tunnel between
two hosts, and use any tcp stream workload.
Then receiver profile looks like :
11.45% [kernel] [k] csum_partial
3.08% [kernel] [k] _raw_spin_lock
3.04% [kernel] [k] intel_idle
2.73% [kernel] [k] ipt_do_table
2.57% [kernel] [k] __netif_receive_skb_core
2.15% [kernel] [k] copy_user_generic_string
2.05% [kernel] [k] __hrtimer_start_range_ns
1.42% [kernel] [k] ip_rcv
1.39% [kernel] [k] kmem_cache_free
1.36% [kernel] [k] _raw_spin_unlock_irqrestore
1.24% [kernel] [k] __schedule
1.13% [bnx2x] [k] bnx2x_rx_int
1.12% [bnx2x] [k] bnx2x_start_xmit
1.11% [kernel] [k] fib_table_lookup
0.99% [ip_tunnel] [k] ip_tunnel_lookup
0.91% [ip_tunnel] [k] ip_tunnel_rcv
0.90% [kernel] [k] check_leaf.isra.7
0.89% [kernel] [k] nf_iterate
On Mon, 2013-10-14 at 16:28 -0400, Neil Horman wrote:
> So, early testing results today. I wrote a test module that, allocated a 4k
> buffer, initalized it with random data, and called csum_partial on it 100000
> times, recording the time at the start and end of that loop. Results on a 2.4
> GHz Intel Xeon processor:
>
> Without patch: Average execute time for csum_partial was 808 ns
> With patch: Average execute time for csum_partial was 438 ns
Impressive, but could you try again with data out of cache ?
On Mon, 2013-10-14 at 14:19 -0700, Eric Dumazet wrote:
> On Mon, 2013-10-14 at 16:28 -0400, Neil Horman wrote:
>
> > So, early testing results today. I wrote a test module that, allocated a 4k
> > buffer, initalized it with random data, and called csum_partial on it 100000
> > times, recording the time at the start and end of that loop. Results on a 2.4
> > GHz Intel Xeon processor:
> >
> > Without patch: Average execute time for csum_partial was 808 ns
> > With patch: Average execute time for csum_partial was 438 ns
>
> Impressive, but could you try again with data out of cache ?
So I tried your patch on a GRE tunnel and got following results on a
single TCP flow. (short result : no visible difference)
Using a prefetch 5*64([%src]) helps more (see at the end)
cpus : model name : Intel Xeon(R) CPU X5660 @ 2.80GHz
Before patch :
lpq83:~# ./netperf -H 7.7.8.84 -l 20 -Cc
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.8.84 () port 0 AF_INET
Recv Send Send Utilization Service Demand
Socket Socket Message Elapsed Send Recv Send Recv
Size Size Size Time Throughput local remote local remote
bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB
87380 16384 16384 20.00 7651.61 2.51 5.45 0.645 1.399
After patch :
lpq83:~# ./netperf -H 7.7.8.84 -l 20 -Cc
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.8.84 () port 0 AF_INET
Recv Send Send Utilization Service Demand
Socket Socket Message Elapsed Send Recv Send Recv
Size Size Size Time Throughput local remote local remote
bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB
87380 16384 16384 20.00 7239.78 2.09 5.19 0.569 1.408
Profile on receiver
PerfTop: 1358 irqs/sec kernel:98.5% exact: 0.0% [1000Hz cycles], (all, 24 CPUs)
------------------------------------------------------------------------------------------------------------------------------------------------------------
19.99% [kernel] [k] csum_partial
7.04% [kernel] [k] copy_user_generic_string
4.92% [bnx2x] [k] bnx2x_rx_int
3.50% [kernel] [k] ipt_do_table
2.86% [kernel] [k] __netif_receive_skb_core
2.35% [kernel] [k] fib_table_lookup
2.19% [kernel] [k] netif_receive_skb
1.87% [kernel] [k] intel_idle
1.65% [kernel] [k] kmem_cache_alloc
1.64% [kernel] [k] ip_rcv
1.51% [kernel] [k] kmem_cache_free
And attached patch brings much better results
lpq83:~# ./netperf -H 7.7.8.84 -l 10 -Cc
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.8.84 () port 0 AF_INET
Recv Send Send Utilization Service Demand
Socket Socket Message Elapsed Send Recv Send Recv
Size Size Size Time Throughput local remote local remote
bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB
87380 16384 16384 10.00 8043.82 2.32 5.34 0.566 1.304
diff --git a/arch/x86/lib/csum-partial_64.c b/arch/x86/lib/csum-partial_64.c
index 9845371..f0e10fc 100644
--- a/arch/x86/lib/csum-partial_64.c
+++ b/arch/x86/lib/csum-partial_64.c
@@ -68,7 +68,8 @@ static unsigned do_csum(const unsigned char *buff, unsigned len)
zero = 0;
count64 = count >> 3;
while (count64) {
- asm("addq 0*8(%[src]),%[res]\n\t"
+ asm("prefetch 5*64(%[src])\n\t"
+ "addq 0*8(%[src]),%[res]\n\t"
"adcq 1*8(%[src]),%[res]\n\t"
"adcq 2*8(%[src]),%[res]\n\t"
"adcq 3*8(%[src]),%[res]\n\t"
On Mon, 2013-10-14 at 15:18 -0700, Eric Dumazet wrote:
> attached patch brings much better results
>
> lpq83:~# ./netperf -H 7.7.8.84 -l 10 -Cc
> MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.8.84 () port 0 AF_INET
> Recv Send Send Utilization Service Demand
> Socket Socket Message Elapsed Send Recv Send Recv
> Size Size Size Time Throughput local remote local remote
> bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB
>
> 87380 16384 16384 10.00 8043.82 2.32 5.34 0.566 1.304
>
> diff --git a/arch/x86/lib/csum-partial_64.c b/arch/x86/lib/csum-partial_64.c
[]
> @@ -68,7 +68,8 @@ static unsigned do_csum(const unsigned char *buff, unsigned len)
> zero = 0;
> count64 = count >> 3;
> while (count64) {
> - asm("addq 0*8(%[src]),%[res]\n\t"
> + asm("prefetch 5*64(%[src])\n\t"
Might the prefetch size be too big here?
0x140 is pretty big and is always multiple cachelines no?
> + "addq 0*8(%[src]),%[res]\n\t"
> "adcq 1*8(%[src]),%[res]\n\t"
> "adcq 2*8(%[src]),%[res]\n\t"
> "adcq 3*8(%[src]),%[res]\n\t"
On Mon, 2013-10-14 at 15:37 -0700, Joe Perches wrote:
> On Mon, 2013-10-14 at 15:18 -0700, Eric Dumazet wrote:
> > attached patch brings much better results
> >
> > lpq83:~# ./netperf -H 7.7.8.84 -l 10 -Cc
> > MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.8.84 () port 0 AF_INET
> > Recv Send Send Utilization Service Demand
> > Socket Socket Message Elapsed Send Recv Send Recv
> > Size Size Size Time Throughput local remote local remote
> > bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB
> >
> > 87380 16384 16384 10.00 8043.82 2.32 5.34 0.566 1.304
> >
> > diff --git a/arch/x86/lib/csum-partial_64.c b/arch/x86/lib/csum-partial_64.c
> []
> > @@ -68,7 +68,8 @@ static unsigned do_csum(const unsigned char *buff, unsigned len)
> > zero = 0;
> > count64 = count >> 3;
> > while (count64) {
> > - asm("addq 0*8(%[src]),%[res]\n\t"
> > + asm("prefetch 5*64(%[src])\n\t"
>
> Might the prefetch size be too big here?
To be effective, you need to prefetch well ahead of time.
5*64 seems common practice (check arch/x86/lib/copy_page_64.S)
On Mon, 2013-10-14 at 15:44 -0700, Eric Dumazet wrote:
> On Mon, 2013-10-14 at 15:37 -0700, Joe Perches wrote:
> > On Mon, 2013-10-14 at 15:18 -0700, Eric Dumazet wrote:
> > > attached patch brings much better results
> > >
> > > lpq83:~# ./netperf -H 7.7.8.84 -l 10 -Cc
> > > MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.8.84 () port 0 AF_INET
> > > Recv Send Send Utilization Service Demand
> > > Socket Socket Message Elapsed Send Recv Send Recv
> > > Size Size Size Time Throughput local remote local remote
> > > bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB
> > >
> > > 87380 16384 16384 10.00 8043.82 2.32 5.34 0.566 1.304
> > >
> > > diff --git a/arch/x86/lib/csum-partial_64.c b/arch/x86/lib/csum-partial_64.c
> > []
> > > @@ -68,7 +68,8 @@ static unsigned do_csum(const unsigned char *buff, unsigned len)
> > > zero = 0;
> > > count64 = count >> 3;
> > > while (count64) {
> > > - asm("addq 0*8(%[src]),%[res]\n\t"
> > > + asm("prefetch 5*64(%[src])\n\t"
> >
> > Might the prefetch size be too big here?
>
> To be effective, you need to prefetch well ahead of time.
No doubt.
> 5*64 seems common practice (check arch/x86/lib/copy_page_64.S)
5 cachelines for some processors seems like a lot.
Given you've got a test rig, maybe you could experiment
with 2 and increase it until it doesn't get better.
* Neil Horman <[email protected]> wrote:
> On Sat, Oct 12, 2013 at 07:21:24PM +0200, Ingo Molnar wrote:
> >
> > * Neil Horman <[email protected]> wrote:
> >
> > > S?bastien Dugu? reported to me that devices implementing ipoib (which
> > > don't have checksum offload hardware were spending a significant amount
> > > of time computing checksums. We found that by splitting the checksum
> > > computation into two separate streams, each skipping successive elements
> > > of the buffer being summed, we could parallelize the checksum operation
> > > accros multiple alus. Since neither chain is dependent on the result of
> > > the other, we get a speedup in execution (on hardware that has multiple
> > > alu's available, which is almost ubiquitous on x86), and only a
> > > negligible decrease on hardware that has only a single alu (an extra
> > > addition is introduced). Since addition in commutative, the result is
> > > the same, only faster
> >
> > This patch should really come with measurement numbers: what performance
> > increase (and drop) did you get on what CPUs.
> >
> > Thanks,
> >
> > Ingo
> >
>
>
> So, early testing results today. I wrote a test module that, allocated
> a 4k buffer, initalized it with random data, and called csum_partial on
> it 100000 times, recording the time at the start and end of that loop.
It would be nice to stick that testcase into tools/perf/bench/, see how we
are able to benchmark the kernel's mempcy and memset implementation there:
$ perf bench mem memcpy -r help
# Running 'mem/memcpy' benchmark:
Unknown routine:help
Available routines...
default ... Default memcpy() provided by glibc
x86-64-unrolled ... unrolled memcpy() in arch/x86/lib/memcpy_64.S
x86-64-movsq ... movsq-based memcpy() in arch/x86/lib/memcpy_64.S
x86-64-movsb ... movsb-based memcpy() in arch/x86/lib/memcpy_64.S
In a similar fashion we could build the csum_partial() code as well and do
measurements. (We could change arch/x86/ code as well to make such
embedding/including easier, as long as it does not change performance.)
Thanks,
Ingo
* Joe Perches <[email protected]> wrote:
> On Mon, 2013-10-14 at 15:44 -0700, Eric Dumazet wrote:
> > On Mon, 2013-10-14 at 15:37 -0700, Joe Perches wrote:
> > > On Mon, 2013-10-14 at 15:18 -0700, Eric Dumazet wrote:
> > > > attached patch brings much better results
> > > >
> > > > lpq83:~# ./netperf -H 7.7.8.84 -l 10 -Cc
> > > > MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.8.84 () port 0 AF_INET
> > > > Recv Send Send Utilization Service Demand
> > > > Socket Socket Message Elapsed Send Recv Send Recv
> > > > Size Size Size Time Throughput local remote local remote
> > > > bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB
> > > >
> > > > 87380 16384 16384 10.00 8043.82 2.32 5.34 0.566 1.304
> > > >
> > > > diff --git a/arch/x86/lib/csum-partial_64.c b/arch/x86/lib/csum-partial_64.c
> > > []
> > > > @@ -68,7 +68,8 @@ static unsigned do_csum(const unsigned char *buff, unsigned len)
> > > > zero = 0;
> > > > count64 = count >> 3;
> > > > while (count64) {
> > > > - asm("addq 0*8(%[src]),%[res]\n\t"
> > > > + asm("prefetch 5*64(%[src])\n\t"
> > >
> > > Might the prefetch size be too big here?
> >
> > To be effective, you need to prefetch well ahead of time.
>
> No doubt.
So why did you ask then?
> > 5*64 seems common practice (check arch/x86/lib/copy_page_64.S)
>
> 5 cachelines for some processors seems like a lot.
What processors would that be?
Most processors have hundreds of cachelines even in their L1 cache.
Thousands in the L2 cache, up to hundreds of thousands.
Thanks,
Ingo
Hi Neil, Andi,
On Mon, 14 Oct 2013 16:25:28 -0400
Neil Horman <[email protected]> wrote:
> On Sun, Oct 13, 2013 at 09:38:33PM -0700, Andi Kleen wrote:
> > Neil Horman <[email protected]> writes:
> >
> > > Sébastien Dugué reported to me that devices implementing ipoib (which don't have
> > > checksum offload hardware were spending a significant amount of time computing
> >
> > Must be an odd workload, most TCP/UDP workloads do copy-checksum
> > anyways. I would rather investigate why that doesn't work.
> >
> FWIW, the reporter was reporting this using an IP over Infiniband network.
> Neil
indeed, our typical workload is connected mode IPoIB on mlx4 QDR hardware
where one cannot benefit from hardware offloads.
For a bit of background on the issue:
It all started nearly 3 years ago when trying to understand why IPoIB BW was
so low in our setups and why ksoftirqd used 100% of one CPU. A kernel profile
trace showed that the CPU spent most of it's time in checksum computation (from
the only old trace I managed to unearth):
Function Hit Time Avg
-------- --- ---- ---
schedule 1730 629976998 us 364148.5 us
csum_partial 10813465 20944414 us 1.936 us
mwait_idle_with_hints 1451 9858861 us 6794.529 us
get_page_from_freelist 10110434 8120524 us 0.803 us
alloc_pages_current 10093675 5180650 us 0.513 us
__phys_addr 35554783 4471387 us 0.125 us
zone_statistics 10110434 4360871 us 0.431 us
ipoib_cm_alloc_rx_skb 673899 4343949 us 6.445 us
After having recoded the checksum to use 2 ALUs, csum_partial() disappeared
from the tracer radar. IPoIB BW got from ~12Gb/s to ~ 20Gb/s and ksoftirqd load
dropped down drastically. Sorry, I could not manage to locate my old traces and
results, those seem to have been lost in the mist of time.
I did some micro benchmark (dirty hack code below) of different solutions.
It looks like processing 128-byte blocks in 4 chains allows the best performance,
but there are plenty other possibilities.
FWIW, this code has been running as is at our customers sites for 3 years now.
Sébastien.
>
> > That said the change looks reasonable, but may not fix the root cause.
> >
> > -Andi
> >
> > --
> > [email protected] -- Speaking for myself only
> >
8<----------------------------------------------------------------------
/*
* gcc -Wall -O3 -o csum_test csum_test.c -lrt
*/
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <time.h>
#include <string.h>
#include <errno.h>
#define __force
#define unlikely(x) (x)
typedef uint32_t u32;
typedef uint16_t u16;
typedef u16 __sum16;
typedef u32 __wsum;
#define NUM_LOOPS 100000
#define BUF_LEN 65536
unsigned char buf[BUF_LEN];
/*
* csum_fold - Fold and invert a 32bit checksum.
* sum: 32bit unfolded sum
*
* Fold a 32bit running checksum to 16bit and invert it. This is usually
* the last step before putting a checksum into a packet.
* Make sure not to mix with 64bit checksums.
*/
static inline __sum16 csum_fold(__wsum sum)
{
asm(" addl %1,%0\n"
" adcl $0xffff,%0"
: "=r" (sum)
: "r" ((__force u32)sum << 16),
"0" ((__force u32)sum & 0xffff0000));
return (__force __sum16)(~(__force u32)sum >> 16);
}
static inline unsigned short from32to16(unsigned a)
{
unsigned short b = a >> 16;
asm("addw %w2,%w0\n\t"
"adcw $0,%w0\n"
: "=r" (b)
: "0" (b), "r" (a));
return b;
}
static inline unsigned add32_with_carry(unsigned a, unsigned b)
{
asm("addl %2,%0\n\t"
"adcl $0,%0"
: "=r" (a)
: "0" (a), "r" (b));
return a;
}
/*
* Do a 64-bit checksum on an arbitrary memory area.
* Returns a 32bit checksum.
*
* This isn't as time critical as it used to be because many NICs
* do hardware checksumming these days.
*
* Things tried and found to not make it faster:
* Manual Prefetching
* Unrolling to an 128 bytes inner loop.
* Using interleaving with more registers to break the carry chains.
*/
static unsigned do_csum(const unsigned char *buff, unsigned len)
{
unsigned odd, count;
unsigned long result = 0;
if (unlikely(len == 0))
return result;
odd = 1 & (unsigned long) buff;
if (unlikely(odd)) {
result = *buff << 8;
len--;
buff++;
}
count = len >> 1; /* nr of 16-bit words.. */
if (count) {
if (2 & (unsigned long) buff) {
result += *(unsigned short *)buff;
count--;
len -= 2;
buff += 2;
}
count >>= 1; /* nr of 32-bit words.. */
if (count) {
unsigned long zero;
unsigned count64;
if (4 & (unsigned long) buff) {
result += *(unsigned int *) buff;
count--;
len -= 4;
buff += 4;
}
count >>= 1; /* nr of 64-bit words.. */
/* main loop using 64byte blocks */
zero = 0;
count64 = count >> 3;
while (count64) {
asm("addq 0*8(%[src]),%[res]\n\t"
"adcq 1*8(%[src]),%[res]\n\t"
"adcq 2*8(%[src]),%[res]\n\t"
"adcq 3*8(%[src]),%[res]\n\t"
"adcq 4*8(%[src]),%[res]\n\t"
"adcq 5*8(%[src]),%[res]\n\t"
"adcq 6*8(%[src]),%[res]\n\t"
"adcq 7*8(%[src]),%[res]\n\t"
"adcq %[zero],%[res]"
: [res] "=r" (result)
: [src] "r" (buff), [zero] "r" (zero),
"[res]" (result));
buff += 64;
count64--;
}
/* printf("csum %lx\n", result); */
/* last upto 7 8byte blocks */
count %= 8;
while (count) {
asm("addq %1,%0\n\t"
"adcq %2,%0\n"
: "=r" (result)
: "m" (*(unsigned long *)buff),
"r" (zero), "0" (result));
--count;
buff += 8;
}
result = add32_with_carry(result>>32,
result&0xffffffff);
if (len & 4) {
result += *(unsigned int *) buff;
buff += 4;
}
}
if (len & 2) {
result += *(unsigned short *) buff;
buff += 2;
}
}
if (len & 1)
result += *buff;
result = add32_with_carry(result>>32, result & 0xffffffff);
if (unlikely(odd)) {
result = from32to16(result);
result = ((result >> 8) & 0xff) | ((result & 0xff) << 8);
}
return result;
}
static unsigned do_csum1(const unsigned char *buff, unsigned len)
{
unsigned odd, count;
unsigned long result1 = 0;
unsigned long result2 = 0;
unsigned long result = 0;
if (unlikely(len == 0))
return result;
odd = 1 & (unsigned long) buff;
if (unlikely(odd)) {
result = *buff << 8;
len--;
buff++;
}
count = len >> 1; /* nr of 16-bit words.. */
if (count) {
if (2 & (unsigned long) buff) {
result += *(unsigned short *)buff;
count--;
len -= 2;
buff += 2;
}
count >>= 1; /* nr of 32-bit words.. */
if (count) {
unsigned long zero;
unsigned count64;
if (4 & (unsigned long) buff) {
result += *(unsigned int *) buff;
count--;
len -= 4;
buff += 4;
}
count >>= 1; /* nr of 64-bit words.. */
/* main loop using 64byte blocks */
zero = 0;
count64 = count >> 3;
while (count64) {
asm("addq 0*8(%[src]),%[res1]\n\t"
"adcq 2*8(%[src]),%[res1]\n\t"
"adcq 4*8(%[src]),%[res1]\n\t"
"adcq 6*8(%[src]),%[res1]\n\t"
"adcq %[zero],%[res1]\n\t"
"addq 1*8(%[src]),%[res2]\n\t"
"adcq 3*8(%[src]),%[res2]\n\t"
"adcq 5*8(%[src]),%[res2]\n\t"
"adcq 7*8(%[src]),%[res2]\n\t"
"adcq %[zero],%[res2]"
: [res1] "=r" (result1),
[res2] "=r" (result2)
: [src] "r" (buff), [zero] "r" (zero),
"[res1]" (result1), "[res2]" (result2));
buff += 64;
count64--;
}
asm("addq %[res1],%[res]\n\t"
"adcq %[res2],%[res]\n\t"
"adcq %[zero],%[res]"
: [res] "=r" (result)
: [res1] "r" (result1),
[res2] "r" (result2),
[zero] "r" (zero),
"0" (result));
/* last upto 7 8byte blocks */
count %= 8;
while (count) {
asm("addq %1,%0\n\t"
"adcq %2,%0\n"
: "=r" (result)
: "m" (*(unsigned long *)buff),
"r" (zero), "0" (result));
--count;
buff += 8;
}
result = add32_with_carry(result>>32,
result&0xffffffff);
if (len & 4) {
result += *(unsigned int *) buff;
buff += 4;
}
}
if (len & 2) {
result += *(unsigned short *) buff;
buff += 2;
}
}
if (len & 1)
result += *buff;
result = add32_with_carry(result>>32, result & 0xffffffff);
if (unlikely(odd)) {
result = from32to16(result);
result = ((result >> 8) & 0xff) | ((result & 0xff) << 8);
}
return result;
}
static unsigned do_csum2(const unsigned char *buff, unsigned len)
{
unsigned odd, count;
unsigned long result1 = 0;
unsigned long result2 = 0;
unsigned long result3 = 0;
unsigned long result4 = 0;
unsigned long result = 0;
if (unlikely(len == 0))
return result;
odd = 1 & (unsigned long) buff;
if (unlikely(odd)) {
result = *buff << 8;
len--;
buff++;
}
count = len >> 1; /* nr of 16-bit words.. */
if (count) {
if (2 & (unsigned long) buff) {
result += *(unsigned short *)buff;
count--;
len -= 2;
buff += 2;
}
count >>= 1; /* nr of 32-bit words.. */
if (count) {
if (4 & (unsigned long) buff) {
result += *(unsigned int *) buff;
count--;
len -= 4;
buff += 4;
}
count >>= 1; /* nr of 64-bit words.. */
if (count) {
unsigned long zero = 0;
unsigned count128;
if (8 & (unsigned long) buff) {
asm("addq %1,%0\n\t"
"adcq %2,%0\n"
: "=r" (result)
: "m" (*(unsigned long *)buff),
"r" (zero), "0" (result));
count--;
buff += 8;
}
/* main loop using 128 byte blocks */
count128 = count >> 4;
while (count128) {
asm("addq 0*8(%[src]),%[res1]\n\t"
"adcq 4*8(%[src]),%[res1]\n\t"
"adcq 8*8(%[src]),%[res1]\n\t"
"adcq 12*8(%[src]),%[res1]\n\t"
"adcq %[zero],%[res1]\n\t"
"addq 1*8(%[src]),%[res2]\n\t"
"adcq 5*8(%[src]),%[res2]\n\t"
"adcq 9*8(%[src]),%[res2]\n\t"
"adcq 13*8(%[src]),%[res2]\n\t"
"adcq %[zero],%[res2]\n\t"
"addq 2*8(%[src]),%[res3]\n\t"
"adcq 6*8(%[src]),%[res3]\n\t"
"adcq 10*8(%[src]),%[res3]\n\t"
"adcq 14*8(%[src]),%[res3]\n\t"
"adcq %[zero],%[res3]\n\t"
"addq 3*8(%[src]),%[res4]\n\t"
"adcq 7*8(%[src]),%[res4]\n\t"
"adcq 11*8(%[src]),%[res4]\n\t"
"adcq 15*8(%[src]),%[res4]\n\t"
"adcq %[zero],%[res4]"
: [res1] "=r" (result1),
[res2] "=r" (result2),
[res3] "=r" (result3),
[res4] "=r" (result4)
: [src] "r" (buff),
[zero] "r" (zero),
"[res1]" (result1),
"[res2]" (result2),
"[res3]" (result3),
"[res4]" (result4));
buff += 128;
count128--;
}
asm("addq %[res1],%[res]\n\t"
"adcq %[res2],%[res]\n\t"
"adcq %[res3],%[res]\n\t"
"adcq %[res4],%[res]\n\t"
"adcq %[zero],%[res]"
: [res] "=r" (result)
: [res1] "r" (result1),
[res2] "r" (result2),
[res3] "r" (result3),
[res4] "r" (result4),
[zero] "r" (zero),
"0" (result));
/* last upto 15 8byte blocks */
count %= 16;
while (count) {
asm("addq %1,%0\n\t"
"adcq %2,%0\n"
: "=r" (result)
: "m" (*(unsigned long *)buff),
"r" (zero), "0" (result));
--count;
buff += 8;
}
result = add32_with_carry(result>>32,
result&0xffffffff);
if (len & 8) {
asm("addq %1,%0\n\t"
"adcq %2,%0\n"
: "=r" (result)
: "m" (*(unsigned long *)buff),
"r" (zero), "0" (result));
buff += 8;
}
}
if (len & 4) {
result += *(unsigned int *) buff;
buff += 4;
}
}
if (len & 2) {
result += *(unsigned short *) buff;
buff += 2;
}
}
if (len & 1)
result += *buff;
result = add32_with_carry(result>>32, result & 0xffffffff);
if (unlikely(odd)) {
result = from32to16(result);
result = ((result >> 8) & 0xff) | ((result & 0xff) << 8);
}
return result;
}
static unsigned do_csum3(const unsigned char *buff, unsigned len)
{
unsigned odd, count;
unsigned long result1 = 0;
unsigned long result2 = 0;
unsigned long result3 = 0;
unsigned long result4 = 0;
unsigned long result = 0;
if (unlikely(len == 0))
return result;
odd = 1 & (unsigned long) buff;
if (unlikely(odd)) {
result = *buff << 8;
len--;
buff++;
}
count = len >> 1; /* nr of 16-bit words.. */
if (count) {
if (2 & (unsigned long) buff) {
result += *(unsigned short *)buff;
count--;
len -= 2;
buff += 2;
}
count >>= 1; /* nr of 32-bit words.. */
if (count) {
unsigned long zero;
unsigned count64;
if (4 & (unsigned long) buff) {
result += *(unsigned int *) buff;
count--;
len -= 4;
buff += 4;
}
count >>= 1; /* nr of 64-bit words.. */
/* main loop using 64byte blocks */
zero = 0;
count64 = count >> 3;
while (count64) {
asm("addq 0*8(%[src]),%[res1]\n\t"
"adcq 4*8(%[src]),%[res1]\n\t"
"adcq %[zero],%[res1]\n\t"
"addq 1*8(%[src]),%[res2]\n\t"
"adcq 5*8(%[src]),%[res2]\n\t"
"adcq %[zero],%[res2]\n\t"
"addq 2*8(%[src]),%[res3]\n\t"
"adcq 6*8(%[src]),%[res3]\n\t"
"adcq %[zero],%[res3]\n\t"
"addq 3*8(%[src]),%[res4]\n\t"
"adcq 7*8(%[src]),%[res4]\n\t"
"adcq %[zero],%[res4]\n\t"
: [res1] "=r" (result1),
[res2] "=r" (result2),
[res3] "=r" (result3),
[res4] "=r" (result4)
: [src] "r" (buff),
[zero] "r" (zero),
"[res1]" (result1),
"[res2]" (result2),
"[res3]" (result3),
"[res4]" (result4));
buff += 64;
count64--;
}
asm("addq %[res1],%[res]\n\t"
"adcq %[res2],%[res]\n\t"
"adcq %[res3],%[res]\n\t"
"adcq %[res4],%[res]\n\t"
"adcq %[zero],%[res]"
: [res] "=r" (result)
: [res1] "r" (result1),
[res2] "r" (result2),
[res3] "r" (result3),
[res4] "r" (result4),
[zero] "r" (zero),
"0" (result));
/* printf("csum1 %lx\n", result); */
/* last upto 7 8byte blocks */
count %= 8;
while (count) {
asm("addq %1,%0\n\t"
"adcq %2,%0\n"
: "=r" (result)
: "m" (*(unsigned long *)buff),
"r" (zero), "0" (result));
--count;
buff += 8;
}
result = add32_with_carry(result>>32,
result&0xffffffff);
if (len & 4) {
result += *(unsigned int *) buff;
buff += 4;
}
}
if (len & 2) {
result += *(unsigned short *) buff;
buff += 2;
}
}
if (len & 1)
result += *buff;
result = add32_with_carry(result>>32, result & 0xffffffff);
if (unlikely(odd)) {
result = from32to16(result);
result = ((result >> 8) & 0xff) | ((result & 0xff) << 8);
}
return result;
}
long long delta_ns(struct timespec *t1, struct timespec *t2)
{
long long tt1, tt2, delta;
tt1 = t1->tv_sec * 1000000000 + t1->tv_nsec;
tt2 = t2->tv_sec * 1000000000 + t2->tv_nsec;
delta = tt2 - tt1;
return delta;
}
int main(int argc, char **argv)
{
FILE *f;
unsigned csum1, csum2, csum3, csum4;
struct timespec t1;
struct timespec t2;
double delta;
int i;
unsigned int offset = 0;
unsigned char *ptr;
unsigned int size;
if ((f = fopen("data.bin", "r")) == NULL) {
printf("Failed to open input file data.bin: %s\n",
strerror(errno));
return -1;
}
if (fread(buf, 1, BUF_LEN, f) != BUF_LEN) {
printf("Failed to read data.bin: %s\n",
strerror(errno));
fclose(f);
return -1;
}
fclose(f);
if (argc > 1)
offset = atoi(argv[1]);
printf("Using offset=%d\n", offset);
ptr = &buf[offset];
size = BUF_LEN - offset;
clock_gettime(CLOCK_MONOTONIC, &t1);
for (i = 0; i < NUM_LOOPS; i++)
csum1 = do_csum((const unsigned char *)ptr, size);
clock_gettime(CLOCK_MONOTONIC, &t2);
delta = (double)delta_ns(&t1, &t2)/1000.0;
printf("Original: %.8x %f us\n",
csum1, (double)delta/(double)NUM_LOOPS);
clock_gettime(CLOCK_MONOTONIC, &t1);
for (i = 0; i < NUM_LOOPS; i++)
csum2 = do_csum1((const unsigned char *)ptr, size);
clock_gettime(CLOCK_MONOTONIC, &t2);
delta = (double)delta_ns(&t1, &t2)/1000.0;
printf("64B Split2: %.8x %f us\n",
csum2, (double)delta/(double)NUM_LOOPS);
clock_gettime(CLOCK_MONOTONIC, &t1);
for (i = 0; i < NUM_LOOPS; i++)
csum3 = do_csum2((const unsigned char *)ptr, size);
clock_gettime(CLOCK_MONOTONIC, &t2);
delta = (double)delta_ns(&t1, &t2)/1000.0;
printf("128B Split4: %.8x %f us\n",
csum3, (double)delta/(double)NUM_LOOPS);
clock_gettime(CLOCK_MONOTONIC, &t1);
for (i = 0; i < NUM_LOOPS; i++)
csum4 = do_csum3((const unsigned char *)ptr, size);
clock_gettime(CLOCK_MONOTONIC, &t2);
delta = (double)delta_ns(&t1, &t2)/1000.0;
printf("64B Split4: %.8x %f us\n",
csum4, (double)delta/(double)NUM_LOOPS);
if ((csum1 != csum2) || (csum1 != csum3) || (csum1 != csum4))
printf("Wrong checksum\n");
return 0;
}
On Tue, Oct 15, 2013 at 09:41:23AM +0200, Ingo Molnar wrote:
> Most processors have hundreds of cachelines even in their L1 cache.
> Thousands in the L2 cache, up to hundreds of thousands.
Also, I have this hazy memory of prefetch hints being harmful in some
situations: https://lwn.net/Articles/444344/
--
Regards/Gruss,
Boris.
Sent from a fat crate under my desk. Formatting is fine.
--
* Borislav Petkov <[email protected]> wrote:
> On Tue, Oct 15, 2013 at 09:41:23AM +0200, Ingo Molnar wrote:
> > Most processors have hundreds of cachelines even in their L1 cache.
> > Thousands in the L2 cache, up to hundreds of thousands.
>
> Also, I have this hazy memory of prefetch hints being harmful in some
> situations: https://lwn.net/Articles/444344/
Yes, for things like random list walks they tend to be harmful - the
hardware is smarter.
For something like a controlled packet stream they might be helpful.
Thanks,
Ingo
On Tue, Oct 15, 2013 at 09:32:48AM +0200, Ingo Molnar wrote:
>
> * Neil Horman <[email protected]> wrote:
>
> > On Sat, Oct 12, 2013 at 07:21:24PM +0200, Ingo Molnar wrote:
> > >
> > > * Neil Horman <[email protected]> wrote:
> > >
> > > > S?bastien Dugu? reported to me that devices implementing ipoib (which
> > > > don't have checksum offload hardware were spending a significant amount
> > > > of time computing checksums. We found that by splitting the checksum
> > > > computation into two separate streams, each skipping successive elements
> > > > of the buffer being summed, we could parallelize the checksum operation
> > > > accros multiple alus. Since neither chain is dependent on the result of
> > > > the other, we get a speedup in execution (on hardware that has multiple
> > > > alu's available, which is almost ubiquitous on x86), and only a
> > > > negligible decrease on hardware that has only a single alu (an extra
> > > > addition is introduced). Since addition in commutative, the result is
> > > > the same, only faster
> > >
> > > This patch should really come with measurement numbers: what performance
> > > increase (and drop) did you get on what CPUs.
> > >
> > > Thanks,
> > >
> > > Ingo
> > >
> >
> >
> > So, early testing results today. I wrote a test module that, allocated
> > a 4k buffer, initalized it with random data, and called csum_partial on
> > it 100000 times, recording the time at the start and end of that loop.
>
> It would be nice to stick that testcase into tools/perf/bench/, see how we
> are able to benchmark the kernel's mempcy and memset implementation there:
>
Sure, my module is a mess currently. But as soon as I investigate the use of
ADCX/ADOX that Anvin suggested I'll see about integrating that
Neil
> $ perf bench mem memcpy -r help
> # Running 'mem/memcpy' benchmark:
> Unknown routine:help
> Available routines...
> default ... Default memcpy() provided by glibc
> x86-64-unrolled ... unrolled memcpy() in arch/x86/lib/memcpy_64.S
> x86-64-movsq ... movsq-based memcpy() in arch/x86/lib/memcpy_64.S
> x86-64-movsb ... movsb-based memcpy() in arch/x86/lib/memcpy_64.S
>
> In a similar fashion we could build the csum_partial() code as well and do
> measurements. (We could change arch/x86/ code as well to make such
> embedding/including easier, as long as it does not change performance.)
>
> Thanks,
>
> Ingo
>
On Mon, Oct 14, 2013 at 02:07:48PM -0700, Eric Dumazet wrote:
> On Mon, 2013-10-14 at 09:49 +0200, Ingo Molnar wrote:
> > * Andi Kleen <[email protected]> wrote:
> >
> > > Neil Horman <[email protected]> writes:
> > >
> > > > S?bastien Dugu? reported to me that devices implementing ipoib (which
> > > > don't have checksum offload hardware were spending a significant
> > > > amount of time computing
> > >
> > > Must be an odd workload, most TCP/UDP workloads do copy-checksum
> > > anyways. I would rather investigate why that doesn't work.
> >
> > There's a fair amount of csum_partial()-only workloads, a packet does not
> > need to hit user-space to be a significant portion of the system's
> > workload.
> >
> > That said, it would indeed be nice to hear which particular code path was
> > hit in this case, if nothing else then for education purposes.
>
> Many NIC do not provide a CHECKSUM_COMPLETE information for encapsulated
> frames, meaning we have to fallback to software csum to validate
> TCP frames, once tunnel header is pulled.
>
> So to reproduce the issue, all you need is to setup a GRE tunnel between
> two hosts, and use any tcp stream workload.
>
> Then receiver profile looks like :
>
> 11.45% [kernel] [k] csum_partial
> 3.08% [kernel] [k] _raw_spin_lock
> 3.04% [kernel] [k] intel_idle
> 2.73% [kernel] [k] ipt_do_table
> 2.57% [kernel] [k] __netif_receive_skb_core
> 2.15% [kernel] [k] copy_user_generic_string
> 2.05% [kernel] [k] __hrtimer_start_range_ns
> 1.42% [kernel] [k] ip_rcv
> 1.39% [kernel] [k] kmem_cache_free
> 1.36% [kernel] [k] _raw_spin_unlock_irqrestore
> 1.24% [kernel] [k] __schedule
> 1.13% [bnx2x] [k] bnx2x_rx_int
> 1.12% [bnx2x] [k] bnx2x_start_xmit
> 1.11% [kernel] [k] fib_table_lookup
> 0.99% [ip_tunnel] [k] ip_tunnel_lookup
> 0.91% [ip_tunnel] [k] ip_tunnel_rcv
> 0.90% [kernel] [k] check_leaf.isra.7
> 0.89% [kernel] [k] nf_iterate
>
As I noted previously the workload that this got reported on was ipoib, which
has a simmilar profile, since infiniband cards tend to not be able to do
checksum offload for ip frames.
Neil
> indeed, our typical workload is connected mode IPoIB on mlx4 QDR hardware
> where one cannot benefit from hardware offloads.
Is this with sendfile?
For normal send() the checksum is done in the user copy and for receiving it
can be also done during the copy in most cases
-Andi
On Tue, 15 Oct 2013 15:33:36 +0200
Andi Kleen <[email protected]> wrote:
> > indeed, our typical workload is connected mode IPoIB on mlx4 QDR hardware
> > where one cannot benefit from hardware offloads.
>
> Is this with sendfile?
Tests were done with iperf at the time without any extra funky options, and
looking at the code it looks like it does plain write() / recv() on the socket.
Sébastien.
>
> For normal send() the checksum is done in the user copy and for receiving it
> can be also done during the copy in most cases
>
> -Andi
On Tue, 2013-10-15 at 15:56 +0200, Sébastien Dugué wrote:
> On Tue, 15 Oct 2013 15:33:36 +0200
> Andi Kleen <[email protected]> wrote:
>
> > > indeed, our typical workload is connected mode IPoIB on mlx4 QDR hardware
> > > where one cannot benefit from hardware offloads.
> >
> > Is this with sendfile?
>
> Tests were done with iperf at the time without any extra funky options, and
> looking at the code it looks like it does plain write() / recv() on the socket.
>
But the csum cost is both for sender and receiver ?
Please post the following :
perf record -g "your iperf session"
perf report | head -n 200
Hi Eric,
On Tue, 15 Oct 2013 07:06:25 -0700
Eric Dumazet <[email protected]> wrote:
> On Tue, 2013-10-15 at 15:56 +0200, Sébastien Dugué wrote:
> > On Tue, 15 Oct 2013 15:33:36 +0200
> > Andi Kleen <[email protected]> wrote:
> >
> > > > indeed, our typical workload is connected mode IPoIB on mlx4 QDR hardware
> > > > where one cannot benefit from hardware offloads.
> > >
> > > Is this with sendfile?
> >
> > Tests were done with iperf at the time without any extra funky options, and
> > looking at the code it looks like it does plain write() / recv() on the socket.
> >
>
> But the csum cost is both for sender and receiver ?
No, it was only on the receiver side that I noticed it.
>
> Please post the following :
>
> perf record -g "your iperf session"
>
> perf report | head -n 200
Sorry, but this is 3 years old stuff and I do not have the
setup anymore to reproduce.
Sébastien.
On Tue, 2013-10-15 at 16:15 +0200, Sébastien Dugué wrote:
> Hi Eric,
>
> On Tue, 15 Oct 2013 07:06:25 -0700
> Eric Dumazet <[email protected]> wrote:
> > But the csum cost is both for sender and receiver ?
>
> No, it was only on the receiver side that I noticed it.
>
Yes, as Andi said, we do the csum while copying the data for the
sender : (I disabled hardware assist tx checksum using 'ethtool -K eth0
tx off')
17.21% netperf [kernel.kallsyms] [k]
csum_partial_copy_generic
|
--- csum_partial_copy_generic
|
|--97.39%-- __libc_send
|
--2.61%-- tcp_sendmsg
inet_sendmsg
sock_sendmsg
_sys_sendto
sys_sendto
system_call_fastpath
__libc_send
> Sorry, but this is 3 years old stuff and I do not have the
> setup anymore to reproduce.
And the receiver should also do the same : (ethtool -K eth0 rx off)
10.55% netserver [kernel.kallsyms] [k]
csum_partial_copy_generic
|
--- csum_partial_copy_generic
|
|--98.24%-- __libc_recv
|
--1.76%-- skb_copy_and_csum_datagram
skb_copy_and_csum_datagram_iovec
tcp_rcv_established
tcp_v4_do_rcv
|
|--73.05%-- tcp_prequeue_process
| tcp_recvmsg
| inet_recvmsg
| sock_recvmsg
| SYSC_recvfrom
| SyS_recvfrom
| system_call_fastpath
| __libc_recv
|
So I suspect something is wrong with IPoIB.
On Tue, 2013-10-15 at 07:26 -0700, Eric Dumazet wrote:
> And the receiver should also do the same : (ethtool -K eth0 rx off)
>
> 10.55% netserver [kernel.kallsyms] [k]
> csum_partial_copy_generic
I get the csum_partial() if disabling prequeue.
echo 1 >/proc/sys/net/ipv4/tcp_low_latency
24.49% swapper [kernel.kallsyms] [k]
csum_partial
|
--- csum_partial
skb_checksum
__skb_checksum_complete_head
__skb_checksum_complete
tcp_rcv_established
tcp_v4_do_rcv
tcp_v4_rcv
ip_local_deliver_finish
ip_local_deliver
ip_rcv_finish
ip_rcv
So yes, we can call csum_partial() in receive path in this case.
> I get the csum_partial() if disabling prequeue.
At least in the ipoib case i would consider that a misconfiguration.
"don't do this if it hurts"
There may be more such problems.
-Andi
On Tue, 2013-10-15 at 09:41 +0200, Ingo Molnar wrote:
> * Joe Perches <[email protected]> wrote:
>
> > On Mon, 2013-10-14 at 15:44 -0700, Eric Dumazet wrote:
> > > On Mon, 2013-10-14 at 15:37 -0700, Joe Perches wrote:
> > > > On Mon, 2013-10-14 at 15:18 -0700, Eric Dumazet wrote:
> > > > > attached patch brings much better results
> > > > >
> > > > > lpq83:~# ./netperf -H 7.7.8.84 -l 10 -Cc
> > > > > MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.8.84 () port 0 AF_INET
> > > > > Recv Send Send Utilization Service Demand
> > > > > Socket Socket Message Elapsed Send Recv Send Recv
> > > > > Size Size Size Time Throughput local remote local remote
> > > > > bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB
> > > > >
> > > > > 87380 16384 16384 10.00 8043.82 2.32 5.34 0.566 1.304
> > > > >
> > > > > diff --git a/arch/x86/lib/csum-partial_64.c b/arch/x86/lib/csum-partial_64.c
> > > > []
> > > > > @@ -68,7 +68,8 @@ static unsigned do_csum(const unsigned char *buff, unsigned len)
> > > > > zero = 0;
> > > > > count64 = count >> 3;
> > > > > while (count64) {
> > > > > - asm("addq 0*8(%[src]),%[res]\n\t"
> > > > > + asm("prefetch 5*64(%[src])\n\t"
> > > >
> > > > Might the prefetch size be too big here?
> > >
> > > To be effective, you need to prefetch well ahead of time.
> >
> > No doubt.
>
> So why did you ask then?
>
> > > 5*64 seems common practice (check arch/x86/lib/copy_page_64.S)
> >
> > 5 cachelines for some processors seems like a lot.
>
> What processors would that be?
The ones where conservatism in L1 cache use is good
because there are multiple threads running concurrently.
> Most processors have hundreds of cachelines even in their L1 cache.
And sometimes that many executable processes too.
> Thousands in the L2 cache, up to hundreds of thousands.
Irrelevant because prefetch doesn't apply there.
Ingo, Eric _showed_ that the prefetch is good here.
How about looking at a little optimization to the minimal
prefetch that gives that level of performance.
You could argue that prefetching PAGE_SIZE or larger
would be better still otherwise.
I suspect that using a smaller multiple of
L1_CACHE_BYTES like 2 or 3 would perform the same.
The last time it was looked at for copy_page_64.S was
quite awhile ago. It looks like maybe 2003.
On Tue, 2013-10-15 at 18:02 +0200, Andi Kleen wrote:
> > I get the csum_partial() if disabling prequeue.
>
> At least in the ipoib case i would consider that a misconfiguration.
There is nothing you can do, if application is not blocked on recv(),
but using poll()/epoll()/select(), prequeue is not used at all.
In this case, we need to csum_partial() frame before sending an ACK,
don't you think ? ;)
On Tue, 2013-10-15 at 09:21 -0700, Joe Perches wrote:
> Ingo, Eric _showed_ that the prefetch is good here.
> How about looking at a little optimization to the minimal
> prefetch that gives that level of performance.
Wait a minute, my point was to remind that main cost is the
memory fetching.
Its nice to optimize cpu cycles if we are short of them,
but in the csum_partial() case, the bottleneck is the memory.
Also I was wondering on the implications of changing reads order,
as it might fool cpu predictions.
I do not particularly care about finding the right prefetch stride,
I think Intel guys know better than me.
* Joe Perches <[email protected]> wrote:
> On Tue, 2013-10-15 at 09:41 +0200, Ingo Molnar wrote:
> > * Joe Perches <[email protected]> wrote:
> >
> > > On Mon, 2013-10-14 at 15:44 -0700, Eric Dumazet wrote:
> > > > On Mon, 2013-10-14 at 15:37 -0700, Joe Perches wrote:
> > > > > On Mon, 2013-10-14 at 15:18 -0700, Eric Dumazet wrote:
> > > > > > attached patch brings much better results
> > > > > >
> > > > > > lpq83:~# ./netperf -H 7.7.8.84 -l 10 -Cc
> > > > > > MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.8.84 () port 0 AF_INET
> > > > > > Recv Send Send Utilization Service Demand
> > > > > > Socket Socket Message Elapsed Send Recv Send Recv
> > > > > > Size Size Size Time Throughput local remote local remote
> > > > > > bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB
> > > > > >
> > > > > > 87380 16384 16384 10.00 8043.82 2.32 5.34 0.566 1.304
> > > > > >
> > > > > > diff --git a/arch/x86/lib/csum-partial_64.c b/arch/x86/lib/csum-partial_64.c
> > > > > []
> > > > > > @@ -68,7 +68,8 @@ static unsigned do_csum(const unsigned char *buff, unsigned len)
> > > > > > zero = 0;
> > > > > > count64 = count >> 3;
> > > > > > while (count64) {
> > > > > > - asm("addq 0*8(%[src]),%[res]\n\t"
> > > > > > + asm("prefetch 5*64(%[src])\n\t"
> > > > >
> > > > > Might the prefetch size be too big here?
> > > >
> > > > To be effective, you need to prefetch well ahead of time.
> > >
> > > No doubt.
> >
> > So why did you ask then?
> >
> > > > 5*64 seems common practice (check arch/x86/lib/copy_page_64.S)
> > >
> > > 5 cachelines for some processors seems like a lot.
> >
> > What processors would that be?
>
> The ones where conservatism in L1 cache use is good because there are
> multiple threads running concurrently.
What specific processor models would that be?
> > Most processors have hundreds of cachelines even in their L1 cache.
>
> And sometimes that many executable processes too.
Nonsense, this is an unrolled loop running in softirq context most of the
time that does not get preempted.
> > Thousands in the L2 cache, up to hundreds of thousands.
>
> Irrelevant because prefetch doesn't apply there.
What planet are you living on? Prefetch takes memory from L2->L1 memory
just as much as it moves it cachelines from memory to the L2 cache.
Especially in the usecase cited here there will be a second use of the
data (when it's finally copied over into user-space), so the L2 cache size
very much matters.
The prefetches here matter mostly to the packet being processed: the ideal
size of the look-ahead window in csum_partial() is dictated by typical
memory latencies and bandwidth. The amount of parallelism is limited by
the number of carry bits we can maintain independently.
> Ingo, Eric _showed_ that the prefetch is good here. How about looking at
> a little optimization to the minimal prefetch that gives that level of
> performance.
Joe, instead of using a condescending tone in matters you clearly have
very little clue about you might want to start doing some real kernel
hacking in more serious kernel areas, beyond trivial matters such as
printk strings, to gain a bit of experience and respect ...
Every word you uttered in this thread made it more likely for me to
redirect you to /dev/null, permanently.
Thanks,
Ingo
On Wed, 2013-10-16 at 08:25 +0200, Ingo Molnar wrote:
> Prefetch takes memory from L2->L1 memory
> just as much as it moves it cachelines from memory to the L2 cache.
Yup, mea culpa.
I thought the prefetch was still to L1 like the Pentium.
On Mon, Oct 14, 2013 at 03:18:47PM -0700, Eric Dumazet wrote:
> On Mon, 2013-10-14 at 14:19 -0700, Eric Dumazet wrote:
> > On Mon, 2013-10-14 at 16:28 -0400, Neil Horman wrote:
> >
> > > So, early testing results today. I wrote a test module that, allocated a 4k
> > > buffer, initalized it with random data, and called csum_partial on it 100000
> > > times, recording the time at the start and end of that loop. Results on a 2.4
> > > GHz Intel Xeon processor:
> > >
> > > Without patch: Average execute time for csum_partial was 808 ns
> > > With patch: Average execute time for csum_partial was 438 ns
> >
> > Impressive, but could you try again with data out of cache ?
>
> So I tried your patch on a GRE tunnel and got following results on a
> single TCP flow. (short result : no visible difference)
>
>
So I went to reproduce these results, but was unable to (due to the fact that I
only have a pretty jittery network to do testing accross at the moment with
these devices). So instead I figured that I would go back to just doing
measurements with the module that I cobbled together (operating under the
assumption that it would give me accurate, relatively jitter free results (I've
attached the module code for reference below). My results show slightly
different behavior:
Base results runs:
89417240
85170397
85208407
89422794
91645494
103655144
86063791
75647774
83502921
85847372
AVG = 875 ns
Prefetch only runs:
70962849
77555099
81898170
68249290
72636538
83039294
78561494
83393369
85317556
79570951
AVG = 781 ns
Parallel addition only runs:
42024233
44313064
48304416
64762297
42994259
41811628
55654282
64892958
55125582
42456403
AVG = 510 ns
Both prefetch and parallel addition:
41329930
40689195
61106622
46332422
49398117
52525171
49517101
61311153
43691814
49043084
AVG = 494 ns
For reference, each of the above large numbers is the number of nanoseconds
taken to compute the checksum of a 4kb buffer 100000 times. To get my average
results, I ran the test in a loop 10 times, averaged them, and divided by
100000.
Based on these, prefetching is obviously a a good improvement, but not as good
as parallel execution, and the winner by far is doing both.
Thoughts?
Neil
#include <linux/module.h>
#include <linux/kernel.h>
#include <linux/netdevice.h>
#include <linux/etherdevice.h>
#include <linux/init.h>
#include <linux/moduleparam.h>
#include <linux/rtnetlink.h>
#include <net/rtnetlink.h>
#include <linux/u64_stats_sync.h>
static char *buf;
static int __init csum_init_module(void)
{
int i;
__wsum sum = 0;
struct timespec start, end;
u64 time;
buf = kmalloc(PAGE_SIZE, GFP_KERNEL);
if (!buf) {
printk(KERN_CRIT "UNABLE TO ALLOCATE A BUFFER OF %lu bytes\n", PAGE_SIZE);
return -ENOMEM;
}
printk(KERN_CRIT "INITALIZING BUFFER\n");
get_random_bytes(buf, PAGE_SIZE);
preempt_disable();
printk(KERN_CRIT "STARTING ITERATIONS\n");
getnstimeofday(&start);
for(i=0;i<100000;i++)
sum = csum_partial(buf, PAGE_SIZE, sum);
getnstimeofday(&end);
preempt_enable();
if (start.tv_nsec > end.tv_nsec)
time = (ULLONG_MAX - end.tv_nsec) + start.tv_nsec;
else
time = end.tv_nsec - start.tv_nsec;
printk(KERN_CRIT "COMPLETED 100000 iterations of csum in %llu nanosec\n", time);
kfree(buf);
return 0;
}
static void __exit csum_cleanup_module(void)
{
return;
}
module_init(csum_init_module);
module_exit(csum_cleanup_module);
MODULE_LICENSE("GPL");
On Wed, 2013-10-16 at 20:34 -0400, Neil Horman wrote:
> >
>
> So I went to reproduce these results, but was unable to (due to the fact that I
> only have a pretty jittery network to do testing accross at the moment with
> these devices). So instead I figured that I would go back to just doing
> measurements with the module that I cobbled together (operating under the
> assumption that it would give me accurate, relatively jitter free results (I've
> attached the module code for reference below). My results show slightly
> different behavior:
>
> Base results runs:
> 89417240
> 85170397
> 85208407
> 89422794
> 91645494
> 103655144
> 86063791
> 75647774
> 83502921
> 85847372
> AVG = 875 ns
>
> Prefetch only runs:
> 70962849
> 77555099
> 81898170
> 68249290
> 72636538
> 83039294
> 78561494
> 83393369
> 85317556
> 79570951
> AVG = 781 ns
>
> Parallel addition only runs:
> 42024233
> 44313064
> 48304416
> 64762297
> 42994259
> 41811628
> 55654282
> 64892958
> 55125582
> 42456403
> AVG = 510 ns
>
>
> Both prefetch and parallel addition:
> 41329930
> 40689195
> 61106622
> 46332422
> 49398117
> 52525171
> 49517101
> 61311153
> 43691814
> 49043084
> AVG = 494 ns
>
>
> For reference, each of the above large numbers is the number of nanoseconds
> taken to compute the checksum of a 4kb buffer 100000 times. To get my average
> results, I ran the test in a loop 10 times, averaged them, and divided by
> 100000.
>
>
> Based on these, prefetching is obviously a a good improvement, but not as good
> as parallel execution, and the winner by far is doing both.
>
> Thoughts?
>
> Neil
>
Your benchmark uses a single 4K page, so data is _super_ hot in cpu
caches.
( prefetch should give no speedups, I am surprised it makes any
difference)
Try now with 32 huges pages, to get 64 MBytes of working set.
Because in reality we never csum_partial() data in cpu cache.
(Unless the NIC preloaded the data into cpu cache before sending the
interrupt)
Really, if Sebastien got a speed up, it means that something fishy was
going on, like :
- A copy of data into some area of memory, prefilling cpu caches
- csum_partial() done while data is hot in cache.
This is exactly a "should not happen" scenario, because the csum in this
case should happen _while_ doing the copy, for 0 ns.
* Neil Horman <[email protected]> wrote:
> On Mon, Oct 14, 2013 at 03:18:47PM -0700, Eric Dumazet wrote:
> > On Mon, 2013-10-14 at 14:19 -0700, Eric Dumazet wrote:
> > > On Mon, 2013-10-14 at 16:28 -0400, Neil Horman wrote:
> > >
> > > > So, early testing results today. I wrote a test module that, allocated a 4k
> > > > buffer, initalized it with random data, and called csum_partial on it 100000
> > > > times, recording the time at the start and end of that loop. Results on a 2.4
> > > > GHz Intel Xeon processor:
> > > >
> > > > Without patch: Average execute time for csum_partial was 808 ns
> > > > With patch: Average execute time for csum_partial was 438 ns
> > >
> > > Impressive, but could you try again with data out of cache ?
> >
> > So I tried your patch on a GRE tunnel and got following results on a
> > single TCP flow. (short result : no visible difference)
> >
> >
>
> So I went to reproduce these results, but was unable to (due to the fact that I
> only have a pretty jittery network to do testing accross at the moment with
> these devices). So instead I figured that I would go back to just doing
> measurements with the module that I cobbled together (operating under the
> assumption that it would give me accurate, relatively jitter free results (I've
> attached the module code for reference below). My results show slightly
> different behavior:
>
> Base results runs:
> 89417240
> 85170397
> 85208407
> 89422794
> 91645494
> 103655144
> 86063791
> 75647774
> 83502921
> 85847372
> AVG = 875 ns
>
> Prefetch only runs:
> 70962849
> 77555099
> 81898170
> 68249290
> 72636538
> 83039294
> 78561494
> 83393369
> 85317556
> 79570951
> AVG = 781 ns
>
> Parallel addition only runs:
> 42024233
> 44313064
> 48304416
> 64762297
> 42994259
> 41811628
> 55654282
> 64892958
> 55125582
> 42456403
> AVG = 510 ns
>
>
> Both prefetch and parallel addition:
> 41329930
> 40689195
> 61106622
> 46332422
> 49398117
> 52525171
> 49517101
> 61311153
> 43691814
> 49043084
> AVG = 494 ns
>
>
> For reference, each of the above large numbers is the number of
> nanoseconds taken to compute the checksum of a 4kb buffer 100000 times.
> To get my average results, I ran the test in a loop 10 times, averaged
> them, and divided by 100000.
>
> Based on these, prefetching is obviously a a good improvement, but not
> as good as parallel execution, and the winner by far is doing both.
But in the actual usecase mentioned the packet data was likely cache-cold,
it just arrived in the NIC and an IRQ got sent. Your testcase uses a
super-hot 4K buffer that fits into the L1 cache. So it's apples to
oranges.
To correctly simulate the workload you'd have to:
- allocate a buffer larger than your L2 cache.
- to measure the effects of the prefetches you'd also have to randomize
the individual buffer positions. See how 'perf bench numa' implements a
random walk via --data_rand_walk, in tools/perf/bench/numa.c.
Otherwise the CPU might learn your simplistic stream direction and the
L2 cache might hw-prefetch your data, interfering with any explicit
prefetches the code does. In many real-life usecases packet buffers are
scattered.
Also, it would be nice to see standard deviation noise numbers when two
averages are close to each other, to be able to tell whether differences
are statistically significant or not.
For example 'perf stat --repeat' will output stddev for you:
comet:~/tip> perf stat --repeat 20 --null bash -c 'usleep $((RANDOM*10))'
Performance counter stats for 'bash -c usleep $((RANDOM*10))' (20 runs):
0.189084480 seconds time elapsed ( +- 11.95% )
The last '+-' percentage is the noise of the measurement.
Also note that you can inspect many cache behavior details of your
algorithm via perf stat - the -ddd option will give you a laundry list:
aldebaran:~> perf stat --repeat 20 -ddd perf bench sched messaging
...
Total time: 0.095 [sec]
Performance counter stats for 'perf bench sched messaging' (20 runs):
1519.128721 task-clock (msec) # 12.305 CPUs utilized ( +- 0.34% )
22,882 context-switches # 0.015 M/sec ( +- 2.84% )
3,927 cpu-migrations # 0.003 M/sec ( +- 2.74% )
16,616 page-faults # 0.011 M/sec ( +- 0.17% )
2,327,978,366 cycles # 1.532 GHz ( +- 1.61% ) [36.43%]
1,715,561,189 stalled-cycles-frontend # 73.69% frontend cycles idle ( +- 1.76% ) [38.05%]
715,715,454 stalled-cycles-backend # 30.74% backend cycles idle ( +- 2.25% ) [39.85%]
1,253,106,346 instructions # 0.54 insns per cycle
# 1.37 stalled cycles per insn ( +- 1.71% ) [49.68%]
241,181,126 branches # 158.763 M/sec ( +- 1.43% ) [47.83%]
4,232,053 branch-misses # 1.75% of all branches ( +- 1.23% ) [48.63%]
431,907,354 L1-dcache-loads # 284.313 M/sec ( +- 1.00% ) [48.37%]
20,550,528 L1-dcache-load-misses # 4.76% of all L1-dcache hits ( +- 0.82% ) [47.61%]
7,435,847 LLC-loads # 4.895 M/sec ( +- 0.94% ) [36.11%]
2,419,201 LLC-load-misses # 32.53% of all LL-cache hits ( +- 2.93% ) [ 7.33%]
448,638,547 L1-icache-loads # 295.326 M/sec ( +- 2.43% ) [21.75%]
22,066,490 L1-icache-load-misses # 4.92% of all L1-icache hits ( +- 2.54% ) [30.66%]
475,557,948 dTLB-loads # 313.047 M/sec ( +- 1.96% ) [37.96%]
6,741,523 dTLB-load-misses # 1.42% of all dTLB cache hits ( +- 2.38% ) [37.05%]
1,268,628,660 iTLB-loads # 835.103 M/sec ( +- 1.75% ) [36.45%]
74,192 iTLB-load-misses # 0.01% of all iTLB cache hits ( +- 2.88% ) [36.19%]
4,466,526 L1-dcache-prefetches # 2.940 M/sec ( +- 1.61% ) [36.17%]
2,396,311 L1-dcache-prefetch-misses # 1.577 M/sec ( +- 1.55% ) [35.71%]
0.123459566 seconds time elapsed ( +- 0.58% )
There's also a number of prefetch counters that might be useful:
aldebaran:~> perf list | grep prefetch
L1-dcache-prefetches [Hardware cache event]
L1-dcache-prefetch-misses [Hardware cache event]
LLC-prefetches [Hardware cache event]
LLC-prefetch-misses [Hardware cache event]
node-prefetches [Hardware cache event]
node-prefetch-misses [Hardware cache event]
Thanks,
Ingo
On 10/17/2013 01:41 AM, Ingo Molnar wrote:
>
> To correctly simulate the workload you'd have to:
>
> - allocate a buffer larger than your L2 cache.
>
> - to measure the effects of the prefetches you'd also have to randomize
> the individual buffer positions. See how 'perf bench numa' implements a
> random walk via --data_rand_walk, in tools/perf/bench/numa.c.
> Otherwise the CPU might learn your simplistic stream direction and the
> L2 cache might hw-prefetch your data, interfering with any explicit
> prefetches the code does. In many real-life usecases packet buffers are
> scattered.
>
> Also, it would be nice to see standard deviation noise numbers when two
> averages are close to each other, to be able to tell whether differences
> are statistically significant or not.
>
Seriously, though, how much does it matter? All the above seems likely
to do is to drown the signal by adding noise.
If the parallel (threaded) checksumming is faster, which theory says it
should and microbenchmarking confirms, how important are the
macrobenchmarks?
-hpa
On Thu, 2013-10-17 at 11:19 -0700, H. Peter Anvin wrote:
> Seriously, though, how much does it matter? All the above seems likely
> to do is to drown the signal by adding noise.
I don't think so.
>
> If the parallel (threaded) checksumming is faster, which theory says it
> should and microbenchmarking confirms, how important are the
> macrobenchmarks?
Seriously, micro benchmarks are very misleading.
I spent time on this patch, and found no changes on real workloads.
I was excited first, then disappointed.
I hope we will find the real issue, as I really don't care of micro
benchmarks.
* H. Peter Anvin <[email protected]> wrote:
> On 10/17/2013 01:41 AM, Ingo Molnar wrote:
> >
> > To correctly simulate the workload you'd have to:
> >
> > - allocate a buffer larger than your L2 cache.
> >
> > - to measure the effects of the prefetches you'd also have to randomize
> > the individual buffer positions. See how 'perf bench numa' implements a
> > random walk via --data_rand_walk, in tools/perf/bench/numa.c.
> > Otherwise the CPU might learn your simplistic stream direction and the
> > L2 cache might hw-prefetch your data, interfering with any explicit
> > prefetches the code does. In many real-life usecases packet buffers are
> > scattered.
> >
> > Also, it would be nice to see standard deviation noise numbers when two
> > averages are close to each other, to be able to tell whether differences
> > are statistically significant or not.
>
>
> Seriously, though, how much does it matter? All the above seems likely
> to do is to drown the signal by adding noise.
I think it matters a lot and I don't think it 'adds' noise - it measures
something else (cache cold behavior - which is the common case for
first-time csum_partial() use for network packets), which was not measured
before, and that that is by its nature has different noise patterns.
I've done many cache-cold measurements myself and had no trouble achieving
statistically significant results and high precision.
> If the parallel (threaded) checksumming is faster, which theory says it
> should and microbenchmarking confirms, how important are the
> macrobenchmarks?
Microbenchmarks can be totally blind to things like the ideal prefetch
window size. (or whether a prefetch should be done at all: some CPUs will
throw away prefetches if enough regular fetches arrive.)
Also, 'naive' single-threaded algorithms can occasionally be better in the
cache-cold case because a linear, predictable stream of memory accesses
might saturate the memory bus better than a somewhat random looking,
interleaved web of accesses that might not harmonize with buffer depths.
I _think_ if correctly tuned then the parallel algorithm should be better
in the cache cold case, I just don't know with what parameters (and the
algorithm has at least one free parameter: the prefetch window size), and
I don't know how significant the effect is.
Also, more fundamentally, I absolutely detest doing no measurements or
measuring the wrong thing - IMHO there are too many 'blind' optimization
commits in the kernel with little to no observational data attached.
Thanks,
Ingo
On Sat, Oct 12, 2013 at 03:29:24PM -0700, H. Peter Anvin wrote:
> On 10/11/2013 09:51 AM, Neil Horman wrote:
> > S?bastien Dugu? reported to me that devices implementing ipoib (which don't have
> > checksum offload hardware were spending a significant amount of time computing
> > checksums. We found that by splitting the checksum computation into two
> > separate streams, each skipping successive elements of the buffer being summed,
> > we could parallelize the checksum operation accros multiple alus. Since neither
> > chain is dependent on the result of the other, we get a speedup in execution (on
> > hardware that has multiple alu's available, which is almost ubiquitous on x86),
> > and only a negligible decrease on hardware that has only a single alu (an extra
> > addition is introduced). Since addition in commutative, the result is the same,
> > only faster
>
> On hardware that implement ADCX/ADOX then you should also be able to
> have additional streams interleaved since those instructions allow for
> dual carry chains.
>
> -hpa
>
I've been looking into this a bit more, and I'm a bit confused. According to
this:
http://www.intel.com/content/www/us/en/intelligent-systems/intel-technology/ia-large-integer-arithmetic-paper.html
by my read, this pair of instructions simply supports 2 carry bit chains,
allowing for two parallel execution paths through the cpu that won't block on
one another. Its exactly the same as whats being done with the universally
available addcq instruction, so theres no real speedup (that I can see). Since
we'd either have to use the alternatives macro to support adcx/adox here or the
old instruction set, it seems not overly worth the effort to support the
extension.
Or am I missing something?
Neil
>
> Your benchmark uses a single 4K page, so data is _super_ hot in cpu
> caches.
> ( prefetch should give no speedups, I am surprised it makes any
> difference)
>
> Try now with 32 huges pages, to get 64 MBytes of working set.
>
> Because in reality we never csum_partial() data in cpu cache.
> (Unless the NIC preloaded the data into cpu cache before sending the
> interrupt)
>
> Really, if Sebastien got a speed up, it means that something fishy was
> going on, like :
>
> - A copy of data into some area of memory, prefilling cpu caches
> - csum_partial() done while data is hot in cache.
>
> This is exactly a "should not happen" scenario, because the csum in this
> case should happen _while_ doing the copy, for 0 ns.
>
>
>
>
So, I took your suggestion, and modified my test module to allocate 32 huge
pages instead of a single 4k page. I've attached the module changes and the
results below. Contrary to your assertion above, results came out the same as
in my first run. See below:
base results:
80381491
85279536
99537729
80398029
121385411
109478429
85369632
99242786
80250395
98170542
AVG=939 ns
prefetch only results:
86803812
101891541
85762713
95866956
102316712
93529111
90473728
79374183
93744053
90075501
AVG=919 ns
parallel only results:
68994797
63503221
64298412
63784256
75350022
66398821
77776050
79158271
91006098
67822318
AVG=718 ns
both prefetch and parallel results:
68852213
77536525
63963560
67255913
76169867
80418081
63485088
62386262
75533808
57731705
AVG=693 ns
So based on these, it seems that your assertion that prefetching is the key to
speedup here isn't quite correct. Either that or the testing continues to be
invalid. I'm going to try to do some of ingos microbenchmarking just to see if
that provides any further details. But any other thoughts about what might be
going awry are appreciated.
My module code:
#include <linux/module.h>
#include <linux/kernel.h>
#include <linux/netdevice.h>
#include <linux/etherdevice.h>
#include <linux/init.h>
#include <linux/moduleparam.h>
#include <linux/rtnetlink.h>
#include <net/rtnetlink.h>
#include <linux/u64_stats_sync.h>
static char *buf;
#define BUFSIZ_ORDER 4
#define BUFSIZ ((2 << BUFSIZ_ORDER) * (1024*1024*2))
static int __init csum_init_module(void)
{
int i;
__wsum sum = 0;
struct timespec start, end;
u64 time;
struct page *page;
u32 offset = 0;
page = alloc_pages((GFP_TRANSHUGE & ~__GFP_MOVABLE), BUFSIZ_ORDER);
if (!page) {
printk(KERN_CRIT "NO MEMORY FOR ALLOCATION");
return -ENOMEM;
}
buf = page_address(page);
printk(KERN_CRIT "INITALIZING BUFFER\n");
preempt_disable();
printk(KERN_CRIT "STARTING ITERATIONS\n");
getnstimeofday(&start);
for(i=0;i<100000;i++) {
sum = csum_partial(buf+offset, PAGE_SIZE, sum);
offset = (offset < BUFSIZ-PAGE_SIZE) ? offset+PAGE_SIZE : 0;
}
getnstimeofday(&end);
preempt_enable();
if ((unsigned long)start.tv_nsec > (unsigned long)end.tv_nsec)
time = (ULONG_MAX - (unsigned long)end.tv_nsec) + (unsigned long)start.tv_nsec;
else
time = (unsigned long)end.tv_nsec - (unsigned long)start.tv_nsec;
printk(KERN_CRIT "COMPLETED 100000 iterations of csum in %llu nanosec\n", time);
__free_pages(page, BUFSIZ_ORDER);
return 0;
}
static void __exit csum_cleanup_module(void)
{
return;
}
module_init(csum_init_module);
module_exit(csum_cleanup_module);
MODULE_LICENSE("GPL");
If implemented properly adcx/adox should give additional speedup... that is the whole reason for their existence.
Neil Horman <[email protected]> wrote:
>On Sat, Oct 12, 2013 at 03:29:24PM -0700, H. Peter Anvin wrote:
>> On 10/11/2013 09:51 AM, Neil Horman wrote:
>> > Sébastien Dugué reported to me that devices implementing ipoib
>(which don't have
>> > checksum offload hardware were spending a significant amount of
>time computing
>> > checksums. We found that by splitting the checksum computation
>into two
>> > separate streams, each skipping successive elements of the buffer
>being summed,
>> > we could parallelize the checksum operation accros multiple alus.
>Since neither
>> > chain is dependent on the result of the other, we get a speedup in
>execution (on
>> > hardware that has multiple alu's available, which is almost
>ubiquitous on x86),
>> > and only a negligible decrease on hardware that has only a single
>alu (an extra
>> > addition is introduced). Since addition in commutative, the result
>is the same,
>> > only faster
>>
>> On hardware that implement ADCX/ADOX then you should also be able to
>> have additional streams interleaved since those instructions allow
>for
>> dual carry chains.
>>
>> -hpa
>>
>I've been looking into this a bit more, and I'm a bit confused.
>According to
>this:
>http://www.intel.com/content/www/us/en/intelligent-systems/intel-technology/ia-large-integer-arithmetic-paper.html
>
>by my read, this pair of instructions simply supports 2 carry bit
>chains,
>allowing for two parallel execution paths through the cpu that won't
>block on
>one another. Its exactly the same as whats being done with the
>universally
>available addcq instruction, so theres no real speedup (that I can
>see). Since
>we'd either have to use the alternatives macro to support adcx/adox
>here or the
>old instruction set, it seems not overly worth the effort to support
>the
>extension.
>
>Or am I missing something?
>
>Neil
--
Sent from my mobile phone. Please pardon brevity and lack of formatting.
On Fri, 2013-10-18 at 12:50 -0400, Neil Horman wrote:
> >
> for(i=0;i<100000;i++) {
> sum = csum_partial(buf+offset, PAGE_SIZE, sum);
> offset = (offset < BUFSIZ-PAGE_SIZE) ? offset+PAGE_SIZE : 0;
> }
Please replace this by random accesses, and use the more standard 1500
length.
offset = prandom_u32() % (BUFSIZ - 1500);
offset &= ~1U;
sum = csum_partial(buf + offset, 1500, sum);
You are basically doing sequential accesses, so prefetch should
be automatically done by cpu itself.
Thanks !
On Fri, Oct 18, 2013 at 10:20:35AM -0700, Eric Dumazet wrote:
> On Fri, 2013-10-18 at 12:50 -0400, Neil Horman wrote:
> > >
>
> > for(i=0;i<100000;i++) {
> > sum = csum_partial(buf+offset, PAGE_SIZE, sum);
> > offset = (offset < BUFSIZ-PAGE_SIZE) ? offset+PAGE_SIZE : 0;
> > }
>
> Please replace this by random accesses, and use the more standard 1500
> length.
>
> offset = prandom_u32() % (BUFSIZ - 1500);
> offset &= ~1U;
>
> sum = csum_partial(buf + offset, 1500, sum);
>
> You are basically doing sequential accesses, so prefetch should
> be automatically done by cpu itself.
>
> Thanks !
>
>
>
Sure, you got it! Results below. However, they continue to bear out that
parallel execution beats prefetch only execution, and both is better than either
one.
base results:
53156647
59670931
62839770
44842780
39297190
44905905
53300688
53287805
39436951
43021730
AVG=493 ns
prefetch-only results:
40337434
51986404
43509199
53128857
52973171
53520649
53536338
50325466
44864664
47908398
AVG=492 ns
parallel-only results:
52157183
44496511
36180011
38298368
36258099
43263531
45365519
54116344
62529241
63118224
AVG = 475 ns
both prefetch and parallel:
44317078
44526464
45761272
44477906
34868814
44637904
49478309
49718417
58681403
58304972
AVG = 474 ns
Heres the code I was using
#include <linux/module.h>
#include <linux/kernel.h>
#include <linux/netdevice.h>
#include <linux/etherdevice.h>
#include <linux/init.h>
#include <linux/moduleparam.h>
#include <linux/rtnetlink.h>
#include <net/rtnetlink.h>
#include <linux/u64_stats_sync.h>
static char *buf;
#define BUFSIZ_ORDER 4
#define BUFSIZ ((2 << BUFSIZ_ORDER) * (1024*1024*2))
static int __init csum_init_module(void)
{
int i;
__wsum sum = 0;
struct timespec start, end;
u64 time;
struct page *page;
u32 offset = 0;
page = alloc_pages((GFP_TRANSHUGE & ~__GFP_MOVABLE), BUFSIZ_ORDER);
if (!page) {
printk(KERN_CRIT "NO MEMORY FOR ALLOCATION");
return -ENOMEM;
}
buf = page_address(page);
printk(KERN_CRIT "INITALIZING BUFFER\n");
preempt_disable();
printk(KERN_CRIT "STARTING ITERATIONS\n");
getnstimeofday(&start);
for(i=0;i<100000;i++) {
sum = csum_partial(buf+offset, 1500, sum);
offset = prandom_u32() % (BUFSIZ - 1500);
offset &= ~1U;
}
getnstimeofday(&end);
preempt_enable();
if ((unsigned long)start.tv_nsec > (unsigned long)end.tv_nsec)
time = (ULONG_MAX - (unsigned long)end.tv_nsec) + (unsigned long)start.tv_nsec;
else
time = (unsigned long)end.tv_nsec - (unsigned long)start.tv_nsec;
printk(KERN_CRIT "COMPLETED 100000 iterations of csum in %llu nanosec\n", time);
__free_pages(page, BUFSIZ_ORDER);
return 0;
}
static void __exit csum_cleanup_module(void)
{
return;
}
module_init(csum_init_module);
module_exit(csum_cleanup_module);
MODULE_LICENSE("GPL");
On Fri, 2013-10-18 at 16:11 -0400, Neil Horman wrote:
> #define BUFSIZ_ORDER 4
> #define BUFSIZ ((2 << BUFSIZ_ORDER) * (1024*1024*2))
> static int __init csum_init_module(void)
> {
> int i;
> __wsum sum = 0;
> struct timespec start, end;
> u64 time;
> struct page *page;
> u32 offset = 0;
>
> page = alloc_pages((GFP_TRANSHUGE & ~__GFP_MOVABLE), BUFSIZ_ORDER);
Not sure what you are doing here, but its not correct.
You have a lot of variations in your results, I suspect a NUMA affinity
problem.
You can try the following code, and use taskset to make sure you run
this on a cpu on node 0
#define BUFSIZ 2*1024*1024
#define NBPAGES 16
static int __init csum_init_module(void)
{
int i;
__wsum sum = 0;
u64 start, end;
void *base, *addrs[NBPAGES];
u32 rnd, offset;
memset(addrs, 0, sizeof(addrs));
for (i = 0; i < NBPAGES; i++) {
addrs[i] = kmalloc_node(BUFSIZ, GFP_KERNEL, 0);
if (!addrs[i])
goto out;
}
local_bh_disable();
pr_err("STARTING ITERATIONS on cpu %d\n", smp_processor_id());
start = ktime_to_ns(ktime_get());
for (i = 0; i < 100000; i++) {
rnd = prandom_u32();
base = addrs[rnd % NBPAGES];
rnd /= NBPAGES;
offset = rnd % (BUFSIZ - 1500);
offset &= ~1U;
sum = csum_partial_opt(base + offset, 1500, sum);
}
end = ktime_to_ns(ktime_get());
local_bh_enable();
pr_err("COMPLETED 100000 iterations of csum %x in %llu nanosec\n", sum, end - start);
out:
for (i = 0; i < NBPAGES; i++)
kfree(addrs[i]);
return 0;
}
static void __exit csum_cleanup_module(void)
{
return;
}
On Mon, 2013-10-14 at 22:49 -0700, Joe Perches wrote:
> On Mon, 2013-10-14 at 15:44 -0700, Eric Dumazet wrote:
>> On Mon, 2013-10-14 at 15:37 -0700, Joe Perches wrote:
>> > On Mon, 2013-10-14 at 15:18 -0700, Eric Dumazet wrote:
>> > > attached patch brings much better results
>> > >
>> > > lpq83:~# ./netperf -H 7.7.8.84 -l 10 -Cc
>> > > MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.8.84 () port 0 AF_INET
>> > > Recv Send Send Utilization Service Demand
>> > > Socket Socket Message Elapsed Send Recv Send Recv
>> > > Size Size Size Time Throughput local remote local remote
>> > > bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB
>> > >
>> > > 87380 16384 16384 10.00 8043.82 2.32 5.34 0.566 1.304
>> > >
>> > > diff --git a/arch/x86/lib/csum-partial_64.c b/arch/x86/lib/csum-partial_64.c
>> > []
>> > > @@ -68,7 +68,8 @@ static unsigned do_csum(const unsigned char *buff, unsigned len)
>> > > zero = 0;
>> > > count64 = count >> 3;
>> > > while (count64) {
>> > > - asm("addq 0*8(%[src]),%[res]\n\t"
>> > > + asm("prefetch 5*64(%[src])\n\t"
>> >
>> > Might the prefetch size be too big here?
>>
>> To be effective, you need to prefetch well ahead of time.
>
> No doubt.
>
>> 5*64 seems common practice (check arch/x86/lib/copy_page_64.S)
>
> 5 cachelines for some processors seems like a lot.
>
> Given you've got a test rig, maybe you could experiment
> with 2 and increase it until it doesn't get better.
You have a fundamental misunderstanding of the prefetch operation. The 5*64
in the above asm statment does not mean a size, it is an index, with %[src]
as the base pointer. So it is saying to go to address %[src] + 5*64 and
prefetch there. The prefetch size itself is always a cache line. Once the
address is known, whatever cacheline holds that address is the cacheline we
will prefetch. Your size concerns have no meaning.
On 2013-10-17, Ingo wrote:
> * Neil Horman <[email protected]> wrote:
>
>> On Mon, Oct 14, 2013 at 03:18:47PM -0700, Eric Dumazet wrote:
>> > On Mon, 2013-10-14 at 14:19 -0700, Eric Dumazet wrote:
>> > > On Mon, 2013-10-14 at 16:28 -0400, Neil Horman wrote:
>> > >
>> > > > So, early testing results today. I wrote a test module that, allocated a 4k
>> > > > buffer, initalized it with random data, and called csum_partial on it 100000
>> > > > times, recording the time at the start and end of that loop. Results on a 2.4
>> > > > GHz Intel Xeon processor:
>> > > >
>> > > > Without patch: Average execute time for csum_partial was 808 ns
>> > > > With patch: Average execute time for csum_partial was 438 ns
>> > >
>> > > Impressive, but could you try again with data out of cache ?
>> >
>> > So I tried your patch on a GRE tunnel and got following results on a
>> > single TCP flow. (short result : no visible difference)
[ to Eric ]
You didn't show profile data from before and after the patch, only after. And it
showed csum_partial at 19.9% IIRC. That's a much better than I get on my test
machines (even though this is on a rhel6.5-beta kernel, understand that the entire
IB stack in rhel6.5-beta is up to a 3.10 level, with parts closer to 3.11+):
For IPoIB in connected mode, where there is no rx csum offload:
::::::::::::::
rhel6.5-beta-cm-no-offload-oprofile-run1
::::::::::::::
CPU: Intel Architectural Perfmon, speed 3392.17 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (No unit mask) count 100000
Samples on CPU 0
Samples on CPU 4
Samples on CPU 6 (edited out as it was only a few samples and ruined
line wrapping)
samples % samples % image name symbol name
98588 59.1431 215 57.9515 vmlinux csum_partial_copy_generic
3003 1.8015 8 2.1563 vmlinux tcp_sendmsg
2219 1.3312 0 0 vmlinux irq_entries_start
2076 1.2454 4 1.0782 vmlinux avc_has_perm_noaudit
1815 1.0888 0 0 mlx4_ib.ko mlx4_ib_poll_cq
So, here anyway, it's 60%. At that level of showing, there is a lot more to be
gained from an improvement to that function. And here's the measured performance
from those runs:
[root@rdma-master rhel6.5-beta-client]# more rhel6.5-beta-cm-no-offload-netperf.output
Recv Send Send Utilization
Socket Socket Message Elapsed Send Recv
Size Size Size Time Throughput local remote
bytes bytes bytes secs. MBytes /s % S % S
87380 16384 16384 20.00 2815.29 7.92 12.80
87380 16384 16384 20.00 2798.22 7.88 12.87
87380 16384 16384 20.00 2786.74 7.79 12.84
The test machine has 8 logical CPUs, so 12.5% is 100% of a single CPU. That
said, the receive side is obviously the bottleneck here, and 60% of that
bottleneck is csum_partial.
[ snip a bunch of Neil's measurements ]
>> Based on these, prefetching is obviously a a good improvement, but not
>> as good as parallel execution, and the winner by far is doing both.
OK, this is where I have to chime in that these tests can *not* be used
to say anything about prefetch, and not just for the reasons Ingo lists
in his various emails to this thread. In fact I would argue that Ingo's
methodology on this is wrong as well.
All prefetch operations get sent to an access queue in the memory controller
where they compete with both other reads and writes for the available memory
bandwidth. The optimal prefetch window is not a factor of memory bandwidth
and latency, it's a factor of memory bandwidth, memory latency, current memory
access queue depth at time prefetch is issued, and memory bank switch time *
number of queued memory operations that will require a bank switch. In other
words, it's much more complex and also much more fluid than any static
optimization can pull out. So every time I see someone run a series of micro-
benchmarks like you just did, where the system was only doing the micro-
benchmark and not a real workload, and we draw conclusions about optimal
prefetch distances from that test, I cringe inside and I think I even die...
just a little.
A better test for this, IMO, would be to start a local kernel compile with at
least twice as many gcc instances allowed as you have CPUs, *then* run your
benchmark kernel module and see what prefetch distance works well. This
distance should be far enough out that it can withstand other memory pressure,
yet not so far as to constantly be prefetching, tossing the result out of cache
due to pressure, then fetching/stalling that same memory on load. And it may
not benchmark as well on a quiescent system running only the micro-benchmark,
but it should end up performing better in actual real world usage.
> Also, it would be nice to see standard deviation noise numbers when two
> averages are close to each other, to be able to tell whether differences
> are statistically significant or not.
>
> For example 'perf stat --repeat' will output stddev for you:
>
> comet:~/tip> perf stat --repeat 20 --null bash -c 'usleep $((RANDOM*10))'
>
> Performance counter stats for 'bash -c usleep $((RANDOM*10))' (20 runs):
>
> 0.189084480 seconds time elapsed ( +- 11.95% )
[ snip perf usage tips ]
I ran my original tests with oprofile. I'll rerun the last one plus some new
tests with the various incarnations of this patch using perf and report the
results back here.
However, the machines I ran these tests on were limited by a 40GBit/s line
speed, with a theoretical max of 4GBytes/s due to bit encoding on the wire,
and I think limited even a bit lower by theoretical limit of useful data
across a PCI-e gen2 x8 bus. So I wouldn't expect the throughput to go
much higher even if this helps, it should mainly reduce CPU usage. I can
try the same tests on a 56GBit/s link and with cards that have PCI-e
gen3 and see how those machines do by comparison (the hosts are identical,
just the cards are different).
* Doug Ledford <[email protected]> wrote:
> >> Based on these, prefetching is obviously a a good improvement, but
> >> not as good as parallel execution, and the winner by far is doing
> >> both.
>
> OK, this is where I have to chime in that these tests can *not* be used
> to say anything about prefetch, and not just for the reasons Ingo lists
> in his various emails to this thread. In fact I would argue that Ingo's
> methodology on this is wrong as well.
Well, I didn't go into as many details as you - but I agree with your full
list obviously:
> All prefetch operations get sent to an access queue in the memory
> controller where they compete with both other reads and writes for the
> available memory bandwidth. The optimal prefetch window is not a factor
> of memory bandwidth and latency, it's a factor of memory bandwidth,
> memory latency, current memory access queue depth at time prefetch is
> issued, and memory bank switch time * number of queued memory operations
> that will require a bank switch. In other words, it's much more complex
> and also much more fluid than any static optimization can pull out.
> [...]
But this is generally true of _any_ static operation - CPUs are complex,
workloads are complex, other threads, CPUs, sockets, devices might
interact, etc.
Yet it does not make it invalid to optimize for the isolated, static
usecase that was offered, because 'dynamism' and parallelism in a real
system will rarely make that optimization completely invalid, it will
typically only diminish its fruits to a certain degree (for example by
causing prefetches to be discarded).
What I was objecting to strongly here was to measure the _wrong_ thing,
i.e. the cache-hot case. The cache-cold case should be measured in a low
noise fashion, so that results are representative. It's closer to the real
usecase than any other microbenchmark. That will give us a usable speedup
figure and will tell us which technique helped how much and which
parameter should be how large.
> [...] So every time I see someone run a series of micro- benchmarks
> like you just did, where the system was only doing the micro- benchmark
> and not a real workload, and we draw conclusions about optimal prefetch
> distances from that test, I cringe inside and I think I even die... just
> a little.
So the thing is, microbenchmarks can indeed be misleading - and as in this
case the cache-hot claims can be outright dangerously misleading.
But yet, if done correctly and interpreted correctly they tell us a little
bit of the truth and are often correlated to real performance.
Do microbenchmarks show us everything that a 'real' workload inhibits? Not
at all, they are way too simple for that. They are a shortcut, an
indicator, which is often helpful as long as not taken as 'the'
performance of the system.
> A better test for this, IMO, would be to start a local kernel compile
> with at least twice as many gcc instances allowed as you have CPUs,
> *then* run your benchmark kernel module and see what prefetch distance
> works well. [...]
I don't agree that this represents our optimization target. It may
represent _one_ optimization target. But many other important usecases
such as a dedicated file server, or a computation node that is
cache-optimized, would unlikely to show such high parallel memory pressure
as a GCC compilation.
> [...] This distance should be far enough out that it can withstand
> other memory pressure, yet not so far as to constantly be prefetching,
> tossing the result out of cache due to pressure, then fetching/stalling
> that same memory on load. And it may not benchmark as well on a
> quiescent system running only the micro-benchmark, but it should end up
> performing better in actual real world usage.
The 'fully adversarial' case where all resources are maximally competed
for by all other cores is actually pretty rare in practice. I don't say it
does not happen or that it does not matter, but I do say there are many
other important usecases as well.
More importantly, the 'maximally adversarial' case is very hard to
generate, validate, and it's highly system dependent!
Cache-cold (and cache hot) microbenchmarks on the other hand tend to be
more stable, because they typically reflect current physical (mostly
latency) limits of CPU and system technology, _not_ highly system
dependent resource sizing (mostly bandwidth) limits which are very hard to
optimize for in a generic fashion.
Cache-cold and cache-hot measurements are, in a way, important physical
'eigenvalues' of a complex system. If they both show speedups then it's
likely that a more dynamic, contended for, mixed workload will show
speedups as well. And these 'eigenvalues' are statistically much more
stable across systems, and that's something we care for when we implement
various lowlevel assembly routines in arch/x86/ which cover many different
systems with different bandwidth characteristics.
I hope I managed to explain my views clearly enough on this.
Thanks,
Ingo
On Fri, Oct 18, 2013 at 02:15:52PM -0700, Eric Dumazet wrote:
> On Fri, 2013-10-18 at 16:11 -0400, Neil Horman wrote:
>
> > #define BUFSIZ_ORDER 4
> > #define BUFSIZ ((2 << BUFSIZ_ORDER) * (1024*1024*2))
> > static int __init csum_init_module(void)
> > {
> > int i;
> > __wsum sum = 0;
> > struct timespec start, end;
> > u64 time;
> > struct page *page;
> > u32 offset = 0;
> >
> > page = alloc_pages((GFP_TRANSHUGE & ~__GFP_MOVABLE), BUFSIZ_ORDER);
>
> Not sure what you are doing here, but its not correct.
>
Why not? You asked for a test with 32 hugepages, so I allocated 32 hugepages.
> You have a lot of variations in your results, I suspect a NUMA affinity
> problem.
>
I do have some variation, you're correct, but I don't think its a numa issue
> You can try the following code, and use taskset to make sure you run
> this on a cpu on node 0
>
I did run this with taskset to do exactly that (hence my comment above). I'll
be glad to run your variant on monday morning though and provide results.
Best
Neil
> #define BUFSIZ 2*1024*1024
> #define NBPAGES 16
>
> static int __init csum_init_module(void)
> {
> int i;
> __wsum sum = 0;
> u64 start, end;
> void *base, *addrs[NBPAGES];
> u32 rnd, offset;
>
> memset(addrs, 0, sizeof(addrs));
> for (i = 0; i < NBPAGES; i++) {
> addrs[i] = kmalloc_node(BUFSIZ, GFP_KERNEL, 0);
> if (!addrs[i])
> goto out;
> }
>
> local_bh_disable();
> pr_err("STARTING ITERATIONS on cpu %d\n", smp_processor_id());
> start = ktime_to_ns(ktime_get());
>
> for (i = 0; i < 100000; i++) {
> rnd = prandom_u32();
> base = addrs[rnd % NBPAGES];
> rnd /= NBPAGES;
> offset = rnd % (BUFSIZ - 1500);
> offset &= ~1U;
> sum = csum_partial_opt(base + offset, 1500, sum);
> }
> end = ktime_to_ns(ktime_get());
> local_bh_enable();
>
> pr_err("COMPLETED 100000 iterations of csum %x in %llu nanosec\n", sum, end - start);
>
> out:
> for (i = 0; i < NBPAGES; i++)
> kfree(addrs[i]);
>
> return 0;
> }
>
> static void __exit csum_cleanup_module(void)
> {
> return;
> }
>
>
>
>
On Sun, 2013-10-20 at 17:29 -0400, Neil Horman wrote:
> On Fri, Oct 18, 2013 at 02:15:52PM -0700, Eric Dumazet wrote:
> > On Fri, 2013-10-18 at 16:11 -0400, Neil Horman wrote:
> >
> > > #define BUFSIZ_ORDER 4
> > > #define BUFSIZ ((2 << BUFSIZ_ORDER) * (1024*1024*2))
> > > static int __init csum_init_module(void)
> > > {
> > > int i;
> > > __wsum sum = 0;
> > > struct timespec start, end;
> > > u64 time;
> > > struct page *page;
> > > u32 offset = 0;
> > >
> > > page = alloc_pages((GFP_TRANSHUGE & ~__GFP_MOVABLE), BUFSIZ_ORDER);
> >
> > Not sure what you are doing here, but its not correct.
> >
> Why not? You asked for a test with 32 hugepages, so I allocated 32 hugepages.
Not really. We cannot allocate 64 Mbytes in a single alloc_pages() call
on x86. (MAX_ORDER = 11)
You noticed nothing because you did not
write anything on the 64Mbytes area (and corrupt memory) or
use CONFIG_DEBUG_PAGEALLOC=y.
Your code read data out of bounds and was lucky, thats all...
You in fact allocated a page of (4096<<4) bytes
On Mon, Oct 21, 2013 at 10:31:38AM -0700, Eric Dumazet wrote:
> On Sun, 2013-10-20 at 17:29 -0400, Neil Horman wrote:
> > On Fri, Oct 18, 2013 at 02:15:52PM -0700, Eric Dumazet wrote:
> > > On Fri, 2013-10-18 at 16:11 -0400, Neil Horman wrote:
> > >
> > > > #define BUFSIZ_ORDER 4
> > > > #define BUFSIZ ((2 << BUFSIZ_ORDER) * (1024*1024*2))
> > > > static int __init csum_init_module(void)
> > > > {
> > > > int i;
> > > > __wsum sum = 0;
> > > > struct timespec start, end;
> > > > u64 time;
> > > > struct page *page;
> > > > u32 offset = 0;
> > > >
> > > > page = alloc_pages((GFP_TRANSHUGE & ~__GFP_MOVABLE), BUFSIZ_ORDER);
> > >
> > > Not sure what you are doing here, but its not correct.
> > >
> > Why not? You asked for a test with 32 hugepages, so I allocated 32 hugepages.
>
> Not really. We cannot allocate 64 Mbytes in a single alloc_pages() call
> on x86. (MAX_ORDER = 11)
>
> You noticed nothing because you did not
> write anything on the 64Mbytes area (and corrupt memory) or
> use CONFIG_DEBUG_PAGEALLOC=y.
>
> Your code read data out of bounds and was lucky, thats all...
>
> You in fact allocated a page of (4096<<4) bytes
>
Gahh! I see what I did, the order in the alloc_pages call is the order of
hugepages, it still allocates that order as typically sized pages, and then
treats them as huge. Stupid of me...
I'll have results on your version of the test case in just a bit here
Neil
>
>
>
On 10/19/2013 04:23 AM, Ingo Molnar wrote:
>
> * Doug Ledford <[email protected]> wrote:
>> All prefetch operations get sent to an access queue in the memory
>> controller where they compete with both other reads and writes for the
>> available memory bandwidth. The optimal prefetch window is not a factor
>> of memory bandwidth and latency, it's a factor of memory bandwidth,
>> memory latency, current memory access queue depth at time prefetch is
>> issued, and memory bank switch time * number of queued memory operations
>> that will require a bank switch. In other words, it's much more complex
>> and also much more fluid than any static optimization can pull out.
>> [...]
>
> But this is generally true of _any_ static operation - CPUs are complex,
> workloads are complex, other threads, CPUs, sockets, devices might
> interact, etc.
>
> Yet it does not make it invalid to optimize for the isolated, static
> usecase that was offered, because 'dynamism' and parallelism in a real
> system will rarely make that optimization completely invalid, it will
> typically only diminish its fruits to a certain degree (for example by
> causing prefetches to be discarded).
So, prefetches are a bit of a special beast in that, if they are done
incorrectly, they can actually make the overall system slower than if we
didn't do anything at all. If you are talking about anything other than
prefetch I would agree with you. With prefetch, as much as possible,
you need to mimic the environment for which you are optimizing. Neil's
test kernel module just called csum_partial on a bunch of memory pages.
The actual usage pattern of csum_partial though is that it will be used
while, most likely, there is ongoing DMA of data packets across the
network interface. It is very unlikely that we could care less about
the case of optimizing csum_partial for no network activity since
csum_partial is always going to be the result of network activity and
unlikely to happen in isolation. As such, my suggestion about a kernel
compile was to create activity across the PCI bus to hard drives,
mimicking network interface DMA traffic. You could also run a netperf
instance instead. I just don't agree with optimizing it without
simultaneous DMA traffic as that particular case if likely to be a
rarity, not the norm.
> What I was objecting to strongly here was to measure the _wrong_ thing,
> i.e. the cache-hot case. The cache-cold case should be measured in a low
> noise fashion, so that results are representative. It's closer to the real
> usecase than any other microbenchmark. That will give us a usable speedup
> figure and will tell us which technique helped how much and which
> parameter should be how large.
Cold cache, yes. Low noise, yes. But you need DMA traffic at the same
time to be truly representative.
>> [...] So every time I see someone run a series of micro- benchmarks
>> like you just did, where the system was only doing the micro- benchmark
>> and not a real workload, and we draw conclusions about optimal prefetch
>> distances from that test, I cringe inside and I think I even die... just
>> a little.
>
> So the thing is, microbenchmarks can indeed be misleading - and as in this
> case the cache-hot claims can be outright dangerously misleading.
So can non-DMA cases.
> But yet, if done correctly and interpreted correctly they tell us a little
> bit of the truth and are often correlated to real performance.
>
> Do microbenchmarks show us everything that a 'real' workload inhibits? Not
> at all, they are way too simple for that. They are a shortcut, an
> indicator, which is often helpful as long as not taken as 'the'
> performance of the system.
>
>> A better test for this, IMO, would be to start a local kernel compile
>> with at least twice as many gcc instances allowed as you have CPUs,
>> *then* run your benchmark kernel module and see what prefetch distance
>> works well. [...]
>
> I don't agree that this represents our optimization target. It may
> represent _one_ optimization target. But many other important usecases
> such as a dedicated file server, or a computation node that is
> cache-optimized, would unlikely to show such high parallel memory pressure
> as a GCC compilation.
But they will *all* show network DMA load, not quiescent DMA load.
>> [...] This distance should be far enough out that it can withstand
>> other memory pressure, yet not so far as to constantly be prefetching,
>> tossing the result out of cache due to pressure, then fetching/stalling
>> that same memory on load. And it may not benchmark as well on a
>> quiescent system running only the micro-benchmark, but it should end up
>> performing better in actual real world usage.
>
> The 'fully adversarial' case where all resources are maximally competed
> for by all other cores is actually pretty rare in practice. I don't say it
> does not happen or that it does not matter, but I do say there are many
> other important usecases as well.
>
> More importantly, the 'maximally adversarial' case is very hard to
> generate, validate, and it's highly system dependent!
This I agree with 100%, which is why I tend to think we should scrap the
static prefetch optimizations entirely and have a boot up test that
allows us to find our optimum prefetch distance for our given hardware.
> Cache-cold (and cache hot) microbenchmarks on the other hand tend to be
> more stable, because they typically reflect current physical (mostly
> latency) limits of CPU and system technology, _not_ highly system
> dependent resource sizing (mostly bandwidth) limits which are very hard to
> optimize for in a generic fashion.
>
> Cache-cold and cache-hot measurements are, in a way, important physical
> 'eigenvalues' of a complex system. If they both show speedups then it's
> likely that a more dynamic, contended for, mixed workload will show
> speedups as well. And these 'eigenvalues' are statistically much more
> stable across systems, and that's something we care for when we implement
> various lowlevel assembly routines in arch/x86/ which cover many different
> systems with different bandwidth characteristics.
>
> I hope I managed to explain my views clearly enough on this.
>
> Thanks,
>
> Ingo
>
--
Doug Ledford <[email protected]>
GPG KeyID: 0E572FDD
http://people.redhat.com/dledford
On Fri, Oct 18, 2013 at 02:15:52PM -0700, Eric Dumazet wrote:
> On Fri, 2013-10-18 at 16:11 -0400, Neil Horman wrote:
>
> > #define BUFSIZ_ORDER 4
> > #define BUFSIZ ((2 << BUFSIZ_ORDER) * (1024*1024*2))
> > static int __init csum_init_module(void)
> > {
> > int i;
> > __wsum sum = 0;
> > struct timespec start, end;
> > u64 time;
> > struct page *page;
> > u32 offset = 0;
> >
> > page = alloc_pages((GFP_TRANSHUGE & ~__GFP_MOVABLE), BUFSIZ_ORDER);
>
> Not sure what you are doing here, but its not correct.
>
> You have a lot of variations in your results, I suspect a NUMA affinity
> problem.
>
> You can try the following code, and use taskset to make sure you run
> this on a cpu on node 0
>
> #define BUFSIZ 2*1024*1024
> #define NBPAGES 16
>
> static int __init csum_init_module(void)
> {
> int i;
> __wsum sum = 0;
> u64 start, end;
> void *base, *addrs[NBPAGES];
> u32 rnd, offset;
>
> memset(addrs, 0, sizeof(addrs));
> for (i = 0; i < NBPAGES; i++) {
> addrs[i] = kmalloc_node(BUFSIZ, GFP_KERNEL, 0);
> if (!addrs[i])
> goto out;
> }
>
> local_bh_disable();
> pr_err("STARTING ITERATIONS on cpu %d\n", smp_processor_id());
> start = ktime_to_ns(ktime_get());
>
> for (i = 0; i < 100000; i++) {
> rnd = prandom_u32();
> base = addrs[rnd % NBPAGES];
> rnd /= NBPAGES;
> offset = rnd % (BUFSIZ - 1500);
> offset &= ~1U;
> sum = csum_partial_opt(base + offset, 1500, sum);
> }
> end = ktime_to_ns(ktime_get());
> local_bh_enable();
>
> pr_err("COMPLETED 100000 iterations of csum %x in %llu nanosec\n", sum, end - start);
>
> out:
> for (i = 0; i < NBPAGES; i++)
> kfree(addrs[i]);
>
> return 0;
> }
>
> static void __exit csum_cleanup_module(void)
> {
> return;
> }
>
>
>
>
Ok, so I ran the above code on a single cpu using taskset, and set irq affinity
such that no interrupts (save for local ones), would occur on that cpu. Note
that I had to convert csum_partial_opt to csum_partial, as the _opt variant
doesn't exist in my tree, nor do I see it in any upstream tree or in the history
anywhere.
base results:
53569916
43506025
43476542
44048436
45048042
48550429
53925556
53927374
53489708
53003915
AVG = 492 ns
prefetching only:
53279213
45518140
49585388
53176179
44071822
43588822
44086546
47507065
53646812
54469118
AVG = 488 ns
parallel alu's only:
46226844
44458101
46803498
45060002
46187624
37542946
45632866
46275249
45031141
46281204
AVG = 449 ns
both optimizations:
45708837
45631124
45697135
45647011
45036679
39418544
44481577
46820868
44496471
35523928
AVG = 438 ns
We continue to see a small savings in execution time with prefetching (4 ns, or
about 0.8%), a better savings with parallel alu execution (43 ns, or 8.7%), and
the best savings with both optimizations (54 ns, or 10.9%).
These results, while they've changed as we've modified the test case slightly
have remained consistent in their sppedup ordinality. Prefetching helps, but
not as much as using multiple alu's, and neither is as good as doing both
together.
Unless you see something else that I'm doing wrong here. It seems like a win to
do both.
Regards
Neil
On Mon, 2013-10-21 at 15:21 -0400, Neil Horman wrote:
>
> Ok, so I ran the above code on a single cpu using taskset, and set irq affinity
> such that no interrupts (save for local ones), would occur on that cpu. Note
> that I had to convert csum_partial_opt to csum_partial, as the _opt variant
> doesn't exist in my tree, nor do I see it in any upstream tree or in the history
> anywhere.
This csum_partial_opt() was a private implementation of csum_partial()
so that I could load the module without rebooting the kernel ;)
>
> base results:
> 53569916
> 43506025
> 43476542
> 44048436
> 45048042
> 48550429
> 53925556
> 53927374
> 53489708
> 53003915
>
> AVG = 492 ns
>
> prefetching only:
> 53279213
> 45518140
> 49585388
> 53176179
> 44071822
> 43588822
> 44086546
> 47507065
> 53646812
> 54469118
>
> AVG = 488 ns
>
>
> parallel alu's only:
> 46226844
> 44458101
> 46803498
> 45060002
> 46187624
> 37542946
> 45632866
> 46275249
> 45031141
> 46281204
>
> AVG = 449 ns
>
>
> both optimizations:
> 45708837
> 45631124
> 45697135
> 45647011
> 45036679
> 39418544
> 44481577
> 46820868
> 44496471
> 35523928
>
> AVG = 438 ns
>
>
> We continue to see a small savings in execution time with prefetching (4 ns, or
> about 0.8%), a better savings with parallel alu execution (43 ns, or 8.7%), and
> the best savings with both optimizations (54 ns, or 10.9%).
>
> These results, while they've changed as we've modified the test case slightly
> have remained consistent in their sppedup ordinality. Prefetching helps, but
> not as much as using multiple alu's, and neither is as good as doing both
> together.
>
> Unless you see something else that I'm doing wrong here. It seems like a win to
> do both.
>
Well, I only said (or maybe I forgot), that on my machines, I got no
improvements at all with the multiple alu or the prefetch. (I tried
different strides)
Only noises in the results.
It seems it depends on cpus and/or multiple factors.
Last machine I used for the tests had :
processor : 23
vendor_id : GenuineIntel
cpu family : 6
model : 44
model name : Intel(R) Xeon(R) CPU X5660 @ 2.80GHz
stepping : 2
microcode : 0x13
cpu MHz : 2800.256
cache size : 12288 KB
physical id : 1
siblings : 12
core id : 10
cpu cores : 6
On Mon, Oct 21, 2013 at 12:44:05PM -0700, Eric Dumazet wrote:
> On Mon, 2013-10-21 at 15:21 -0400, Neil Horman wrote:
>
> >
> > Ok, so I ran the above code on a single cpu using taskset, and set irq affinity
> > such that no interrupts (save for local ones), would occur on that cpu. Note
> > that I had to convert csum_partial_opt to csum_partial, as the _opt variant
> > doesn't exist in my tree, nor do I see it in any upstream tree or in the history
> > anywhere.
>
> This csum_partial_opt() was a private implementation of csum_partial()
> so that I could load the module without rebooting the kernel ;)
>
> >
> > base results:
> > 53569916
> > 43506025
> > 43476542
> > 44048436
> > 45048042
> > 48550429
> > 53925556
> > 53927374
> > 53489708
> > 53003915
> >
> > AVG = 492 ns
> >
> > prefetching only:
> > 53279213
> > 45518140
> > 49585388
> > 53176179
> > 44071822
> > 43588822
> > 44086546
> > 47507065
> > 53646812
> > 54469118
> >
> > AVG = 488 ns
> >
> >
> > parallel alu's only:
> > 46226844
> > 44458101
> > 46803498
> > 45060002
> > 46187624
> > 37542946
> > 45632866
> > 46275249
> > 45031141
> > 46281204
> >
> > AVG = 449 ns
> >
> >
> > both optimizations:
> > 45708837
> > 45631124
> > 45697135
> > 45647011
> > 45036679
> > 39418544
> > 44481577
> > 46820868
> > 44496471
> > 35523928
> >
> > AVG = 438 ns
> >
> >
> > We continue to see a small savings in execution time with prefetching (4 ns, or
> > about 0.8%), a better savings with parallel alu execution (43 ns, or 8.7%), and
> > the best savings with both optimizations (54 ns, or 10.9%).
> >
> > These results, while they've changed as we've modified the test case slightly
> > have remained consistent in their sppedup ordinality. Prefetching helps, but
> > not as much as using multiple alu's, and neither is as good as doing both
> > together.
> >
> > Unless you see something else that I'm doing wrong here. It seems like a win to
> > do both.
> >
>
> Well, I only said (or maybe I forgot), that on my machines, I got no
> improvements at all with the multiple alu or the prefetch. (I tried
> different strides)
>
> Only noises in the results.
>
I thought you previously said that running netperf gave you a stastically
significant performance boost when you added prefetching:
http://marc.info/?l=linux-kernel&m=138178914124863&w=2
But perhaps I missed a note somewhere.
> It seems it depends on cpus and/or multiple factors.
>
> Last machine I used for the tests had :
>
> processor : 23
> vendor_id : GenuineIntel
> cpu family : 6
> model : 44
> model name : Intel(R) Xeon(R) CPU X5660 @ 2.80GHz
> stepping : 2
> microcode : 0x13
> cpu MHz : 2800.256
> cache size : 12288 KB
> physical id : 1
> siblings : 12
> core id : 10
> cpu cores : 6
>
>
>
>
Thats about what I'm running with:
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 44
model name : Intel(R) Xeon(R) CPU E5620 @ 2.40GHz
stepping : 2
microcode : 0x13
cpu MHz : 1600.000
cache size : 12288 KB
physical id : 0
siblings : 8
core id : 0
cpu cores : 4
I can't imagine what would cause the discrepancy in our results (a 10% savings
in execution time seems significant to me). My only thought would be that
possibly the alu's on your cpu are faster than mine, and reduce the speedup
obtained by preforming operation in parallel, though I can't imagine thats the
case with these processors being so closely matched.
Neil
On Fri, Oct 18, 2013 at 10:09:54AM -0700, H. Peter Anvin wrote:
> If implemented properly adcx/adox should give additional speedup... that is the whole reason for their existence.
>
Ok, fair enough. Unfotunately, I'm not going to be able to get my hands on a
stepping of this CPU to test any code using these instructions for some time, so
I'll back burner their use and revisit them later. I'm still working on the
parallel alu/prefetch angle though.
Neil
* Doug Ledford <[email protected]> wrote:
> > What I was objecting to strongly here was to measure the _wrong_
> > thing, i.e. the cache-hot case. The cache-cold case should be
> > measured in a low noise fashion, so that results are
> > representative. It's closer to the real usecase than any other
> > microbenchmark. That will give us a usable speedup figure and
> > will tell us which technique helped how much and which parameter
> > should be how large.
>
> Cold cache, yes. Low noise, yes. But you need DMA traffic at the
> same time to be truly representative.
Well, but in most usecases network DMA traffic is an order of
magnitude smaller than system bus capacity. 100 gigabit network
traffic is possible but not very common.
So I'd say that _if_ prefetching helps in the typical case we should
tune it for that - not for the bus-contended case...
> >> [...] This distance should be far enough out that it can
> >> withstand other memory pressure, yet not so far as to
> >> constantly be prefetching, tossing the result out of cache due
> >> to pressure, then fetching/stalling that same memory on load.
> >> And it may not benchmark as well on a quiescent system running
> >> only the micro-benchmark, but it should end up performing
> >> better in actual real world usage.
> >
> > The 'fully adversarial' case where all resources are maximally
> > competed for by all other cores is actually pretty rare in
> > practice. I don't say it does not happen or that it does not
> > matter, but I do say there are many other important usecases as
> > well.
> >
> > More importantly, the 'maximally adversarial' case is very hard
> > to generate, validate, and it's highly system dependent!
>
> This I agree with 100%, which is why I tend to think we should
> scrap the static prefetch optimizations entirely and have a boot
> up test that allows us to find our optimum prefetch distance for
> our given hardware.
Would be interesting to see.
I'm a bit sceptical - I think 'looking 1-2 cachelines in advance' is
something that might work reasonably well on a wide range of
systems, while trying to find a bus capacity/latency dependent sweet
spot would be difficult.
We had pretty bad experience from boot-time measurements, and it's
not for lack of trying: I implemented the raid algorithm
benchmarking thing and also the scheduler's boot time cache-size
probing, both were problematic and have hurt reproducability and
debuggability.
Thanks,
Ingo
* Neil Horman <[email protected]> wrote:
> On Mon, Oct 21, 2013 at 12:44:05PM -0700, Eric Dumazet wrote:
> > On Mon, 2013-10-21 at 15:21 -0400, Neil Horman wrote:
> >
> > >
> > > Ok, so I ran the above code on a single cpu using taskset, and set irq affinity
> > > such that no interrupts (save for local ones), would occur on that cpu. Note
> > > that I had to convert csum_partial_opt to csum_partial, as the _opt variant
> > > doesn't exist in my tree, nor do I see it in any upstream tree or in the history
> > > anywhere.
> >
> > This csum_partial_opt() was a private implementation of csum_partial()
> > so that I could load the module without rebooting the kernel ;)
> >
> > >
> > > base results:
> > > 53569916
> > > 43506025
> > > 43476542
> > > 44048436
> > > 45048042
> > > 48550429
> > > 53925556
> > > 53927374
> > > 53489708
> > > 53003915
> > >
> > > AVG = 492 ns
> > >
> > > prefetching only:
> > > 53279213
> > > 45518140
> > > 49585388
> > > 53176179
> > > 44071822
> > > 43588822
> > > 44086546
> > > 47507065
> > > 53646812
> > > 54469118
> > >
> > > AVG = 488 ns
> > >
> > >
> > > parallel alu's only:
> > > 46226844
> > > 44458101
> > > 46803498
> > > 45060002
> > > 46187624
> > > 37542946
> > > 45632866
> > > 46275249
> > > 45031141
> > > 46281204
> > >
> > > AVG = 449 ns
> > >
> > >
> > > both optimizations:
> > > 45708837
> > > 45631124
> > > 45697135
> > > 45647011
> > > 45036679
> > > 39418544
> > > 44481577
> > > 46820868
> > > 44496471
> > > 35523928
> > >
> > > AVG = 438 ns
> > >
> > >
> > > We continue to see a small savings in execution time with prefetching (4 ns, or
> > > about 0.8%), a better savings with parallel alu execution (43 ns, or 8.7%), and
> > > the best savings with both optimizations (54 ns, or 10.9%).
> > >
> > > These results, while they've changed as we've modified the test case slightly
> > > have remained consistent in their sppedup ordinality. Prefetching helps, but
> > > not as much as using multiple alu's, and neither is as good as doing both
> > > together.
> > >
> > > Unless you see something else that I'm doing wrong here. It seems like a win to
> > > do both.
> > >
> >
> > Well, I only said (or maybe I forgot), that on my machines, I got no
> > improvements at all with the multiple alu or the prefetch. (I tried
> > different strides)
> >
> > Only noises in the results.
> >
> I thought you previously said that running netperf gave you a stastically
> significant performance boost when you added prefetching:
> http://marc.info/?l=linux-kernel&m=138178914124863&w=2
>
> But perhaps I missed a note somewhere.
>
> > It seems it depends on cpus and/or multiple factors.
> >
> > Last machine I used for the tests had :
> >
> > processor : 23
> > vendor_id : GenuineIntel
> > cpu family : 6
> > model : 44
> > model name : Intel(R) Xeon(R) CPU X5660 @ 2.80GHz
> > stepping : 2
> > microcode : 0x13
> > cpu MHz : 2800.256
> > cache size : 12288 KB
> > physical id : 1
> > siblings : 12
> > core id : 10
> > cpu cores : 6
> >
> >
> >
> >
>
> Thats about what I'm running with:
> processor : 0
> vendor_id : GenuineIntel
> cpu family : 6
> model : 44
> model name : Intel(R) Xeon(R) CPU E5620 @ 2.40GHz
> stepping : 2
> microcode : 0x13
> cpu MHz : 1600.000
> cache size : 12288 KB
> physical id : 0
> siblings : 8
> core id : 0
> cpu cores : 4
>
>
> I can't imagine what would cause the discrepancy in our results (a
> 10% savings in execution time seems significant to me). My only
> thought would be that possibly the alu's on your cpu are faster
> than mine, and reduce the speedup obtained by preforming operation
> in parallel, though I can't imagine thats the case with these
> processors being so closely matched.
You keep ignoring my request to calculate and account for noise of
the measurement.
For example you are talking about a 0.8% prefetch effect while the
noise in the results is obviously much larger than that, with a
min/max distance of around 5%:
> > > 43476542
> > > 53927374
so the noise of 10 measurements would be around 5-10%. (back of the
envelope calculation)
So you might be right in the end, but the posted data does not
support your claims, statistically.
It's your responsibility to come up with convincing measurements and
results, not of those who review your work.
Thanks,
Ingo
On Sat, Oct 26, 2013 at 02:01:08PM +0200, Ingo Molnar wrote:
>
> * Neil Horman <[email protected]> wrote:
>
> > On Mon, Oct 21, 2013 at 12:44:05PM -0700, Eric Dumazet wrote:
> > > On Mon, 2013-10-21 at 15:21 -0400, Neil Horman wrote:
> > >
> > > >
> > > > Ok, so I ran the above code on a single cpu using taskset, and set irq affinity
> > > > such that no interrupts (save for local ones), would occur on that cpu. Note
> > > > that I had to convert csum_partial_opt to csum_partial, as the _opt variant
> > > > doesn't exist in my tree, nor do I see it in any upstream tree or in the history
> > > > anywhere.
> > >
> > > This csum_partial_opt() was a private implementation of csum_partial()
> > > so that I could load the module without rebooting the kernel ;)
> > >
> > > >
> > > > base results:
> > > > 53569916
> > > > 43506025
> > > > 43476542
> > > > 44048436
> > > > 45048042
> > > > 48550429
> > > > 53925556
> > > > 53927374
> > > > 53489708
> > > > 53003915
> > > >
> > > > AVG = 492 ns
> > > >
> > > > prefetching only:
> > > > 53279213
> > > > 45518140
> > > > 49585388
> > > > 53176179
> > > > 44071822
> > > > 43588822
> > > > 44086546
> > > > 47507065
> > > > 53646812
> > > > 54469118
> > > >
> > > > AVG = 488 ns
> > > >
> > > >
> > > > parallel alu's only:
> > > > 46226844
> > > > 44458101
> > > > 46803498
> > > > 45060002
> > > > 46187624
> > > > 37542946
> > > > 45632866
> > > > 46275249
> > > > 45031141
> > > > 46281204
> > > >
> > > > AVG = 449 ns
> > > >
> > > >
> > > > both optimizations:
> > > > 45708837
> > > > 45631124
> > > > 45697135
> > > > 45647011
> > > > 45036679
> > > > 39418544
> > > > 44481577
> > > > 46820868
> > > > 44496471
> > > > 35523928
> > > >
> > > > AVG = 438 ns
> > > >
> > > >
> > > > We continue to see a small savings in execution time with prefetching (4 ns, or
> > > > about 0.8%), a better savings with parallel alu execution (43 ns, or 8.7%), and
> > > > the best savings with both optimizations (54 ns, or 10.9%).
> > > >
> > > > These results, while they've changed as we've modified the test case slightly
> > > > have remained consistent in their sppedup ordinality. Prefetching helps, but
> > > > not as much as using multiple alu's, and neither is as good as doing both
> > > > together.
> > > >
> > > > Unless you see something else that I'm doing wrong here. It seems like a win to
> > > > do both.
> > > >
> > >
> > > Well, I only said (or maybe I forgot), that on my machines, I got no
> > > improvements at all with the multiple alu or the prefetch. (I tried
> > > different strides)
> > >
> > > Only noises in the results.
> > >
> > I thought you previously said that running netperf gave you a stastically
> > significant performance boost when you added prefetching:
> > http://marc.info/?l=linux-kernel&m=138178914124863&w=2
> >
> > But perhaps I missed a note somewhere.
> >
> > > It seems it depends on cpus and/or multiple factors.
> > >
> > > Last machine I used for the tests had :
> > >
> > > processor : 23
> > > vendor_id : GenuineIntel
> > > cpu family : 6
> > > model : 44
> > > model name : Intel(R) Xeon(R) CPU X5660 @ 2.80GHz
> > > stepping : 2
> > > microcode : 0x13
> > > cpu MHz : 2800.256
> > > cache size : 12288 KB
> > > physical id : 1
> > > siblings : 12
> > > core id : 10
> > > cpu cores : 6
> > >
> > >
> > >
> > >
> >
> > Thats about what I'm running with:
> > processor : 0
> > vendor_id : GenuineIntel
> > cpu family : 6
> > model : 44
> > model name : Intel(R) Xeon(R) CPU E5620 @ 2.40GHz
> > stepping : 2
> > microcode : 0x13
> > cpu MHz : 1600.000
> > cache size : 12288 KB
> > physical id : 0
> > siblings : 8
> > core id : 0
> > cpu cores : 4
> >
> >
> > I can't imagine what would cause the discrepancy in our results (a
> > 10% savings in execution time seems significant to me). My only
> > thought would be that possibly the alu's on your cpu are faster
> > than mine, and reduce the speedup obtained by preforming operation
> > in parallel, though I can't imagine thats the case with these
> > processors being so closely matched.
>
> You keep ignoring my request to calculate and account for noise of
> the measurement.
>
Don't confuse "ignoring" with "haven't gotten there yet". Sometimes we all have
to wait, Ingo. I'm working on it now, but I hit a snag on the machine I'm
working with and am trying to figure it out now.
> For example you are talking about a 0.8% prefetch effect while the
> noise in the results is obviously much larger than that, with a
> min/max distance of around 5%:
>
> > > > 43476542
> > > > 53927374
>
> so the noise of 10 measurements would be around 5-10%. (back of the
> envelope calculation)
>
> So you might be right in the end, but the posted data does not
> support your claims, statistically.
>
> It's your responsibility to come up with convincing measurements and
> results, not of those who review your work.
>
Be patient, I'm getting there
Thanks
Neil
* Neil Horman <[email protected]> wrote:
> > You keep ignoring my request to calculate and account for noise of the
> > measurement.
>
> Don't confuse "ignoring" with "haven't gotten there yet". [...]
So, instead of replying to my repeated feedback with a single line mail
that you plan to address it, you repeated the same measurement mistakes
again and again, posting invalid results, and forced me to spend time to
repeat this same argument 2-3 times?
> [...] Sometimes we all have to wait, Ingo. [...]
I'm making bog standard technical requests to which you've not replied in
substance, there's no need for the patronizing tone really.
Anyway, to simplify the workflow I'm NAK-ing it all until it's done
convincingly.
NAKed-by: Ingo Molnar <[email protected]>
I'll lift the NAK once my technical concerns and questions are resolved.
Thanks,
Ingo
On Sun, Oct 27, 2013 at 08:26:32AM +0100, Ingo Molnar wrote:
>
> * Neil Horman <[email protected]> wrote:
>
> > > You keep ignoring my request to calculate and account for noise of the
> > > measurement.
> >
> > Don't confuse "ignoring" with "haven't gotten there yet". [...]
>
> So, instead of replying to my repeated feedback with a single line mail
> that you plan to address it, you repeated the same measurement mistakes
> again and again, posting invalid results, and forced me to spend time to
> repeat this same argument 2-3 times?
>
No one forced you to do anything Ingo. I was finishing a valid line of
discussion with Eric prior to addressing your questions, while handling several
other unrelated issues (and related issues with my test system), that cropped up
in parallel.
> > [...] Sometimes we all have to wait, Ingo. [...]
>
> I'm making bog standard technical requests to which you've not replied in
> substance, there's no need for the patronizing tone really.
>
No one said they weren't easy to do, Ingo, I said I was getting to your request.
And now I am. I'll be running the tests tomorrow.
> Anyway, to simplify the workflow I'm NAK-ing it all until it's done
> convincingly.
>
> NAKed-by: Ingo Molnar <[email protected]>
>
> I'll lift the NAK once my technical concerns and questions are resolved.
>
Ok, if that helps you wait. I'll have your test results in the next day or so.
Thanks
Neil
> Thanks,
>
> Ingo
>
Ingo, et al.-
Ok, sorry for the delay, here are the test results you've been asking
for.
First, some information about what I did. I attached the module that I ran this
test with at the bottom of this email. You'll note that I started using a
module parameter write patch to trigger the csum rather than the module load
path. The latter seemed to be giving me lots of variance in my run times, which
I wanted to eliminate. I attributed it to the module load mechanism itself, and
by using the parameter write path, I was able to get more consistent results.
First, the run time tests:
I ran this command:
for i in `seq 0 1 3`
do
echo $i > /sys/module/csum_test/parameters/module_test_mode
perf stat --repeat 20 --null echo 1 > echo 1 > /sys/module/csum_test/parameters/test_fire
done
The for loop allows me to chagne the module_test_mode, which is tied to a switch
statement in do_csum that selects which checksumming method we use
(base/prefetch/parallel alu/both). The results are:
Base:
Performance counter stats for 'bash -c echo 1 > /sys/module/csum_test/parameters/test_fire' (20 runs):
0.093269042 seconds time elapsed ( +- 2.24% )
Prefetch (5x64):
Performance counter stats for 'bash -c echo 1 > /sys/module/csum_test/parameters/test_fire' (20 runs):
0.079440009 seconds time elapsed ( +- 2.29% )
Parallel ALU:
Performance counter stats for 'bash -c echo 1 > /sys/module/csum_test/parameters/test_fire' (20 runs):
0.087666677 seconds time elapsed ( +- 4.01% )
Prefetch + Parallel ALU:
Performance counter stats for 'bash -c echo 1 > /sys/module/csum_test/parameters/test_fire' (20 runs):
0.080758702 seconds time elapsed ( +- 2.34% )
So we can see here that we get about a 1% speedup between the base and the both
(Prefetch + Parallel ALU) case, with prefetch accounting for most of that
speedup.
Looking at the specific cpu counters we get this:
Base:
Total time: 0.179 [sec]
Performance counter stats for 'perf bench sched messaging -- bash -c echo 1 > /sys/module/csum_test/parameters/test_fire' (20 runs):
1571.304618 task-clock # 5.213 CPUs utilized ( +- 0.45% )
14,423 context-switches # 0.009 M/sec ( +- 4.28% )
2,710 cpu-migrations # 0.002 M/sec ( +- 2.83% )
75,402 page-faults # 0.048 M/sec ( +- 0.07% )
1,597,349,326 cycles # 1.017 GHz ( +- 1.74% ) [40.51%]
104,882,858 stalled-cycles-frontend # 6.57% frontend cycles idle ( +- 1.25% ) [40.33%]
1,043,429,984 stalled-cycles-backend # 65.32% backend cycles idle ( +- 1.25% ) [39.73%]
868,372,132 instructions # 0.54 insns per cycle
# 1.20 stalled cycles per insn ( +- 1.43% ) [39.88%]
161,143,820 branches # 102.554 M/sec ( +- 1.49% ) [39.76%]
4,348,075 branch-misses # 2.70% of all branches ( +- 1.43% ) [39.99%]
457,042,576 L1-dcache-loads # 290.868 M/sec ( +- 1.25% ) [40.63%]
8,928,240 L1-dcache-load-misses # 1.95% of all L1-dcache hits ( +- 1.26% ) [41.17%]
15,821,051 LLC-loads # 10.069 M/sec ( +- 1.56% ) [41.20%]
4,902,576 LLC-load-misses # 30.99% of all LL-cache hits ( +- 1.51% ) [41.36%]
235,775,688 L1-icache-loads # 150.051 M/sec ( +- 1.39% ) [41.10%]
3,116,106 L1-icache-load-misses # 1.32% of all L1-icache hits ( +- 3.43% ) [40.96%]
461,315,416 dTLB-loads # 293.588 M/sec ( +- 1.43% ) [41.18%]
140,280 dTLB-load-misses # 0.03% of all dTLB cache hits ( +- 2.30% ) [40.96%]
236,127,031 iTLB-loads # 150.275 M/sec ( +- 1.63% ) [41.43%]
46,173 iTLB-load-misses # 0.02% of all iTLB cache hits ( +- 3.40% ) [41.11%]
0 L1-dcache-prefetches # 0.000 K/sec [40.82%]
0 L1-dcache-prefetch-misses # 0.000 K/sec [40.37%]
0.301414024 seconds time elapsed ( +- 0.47% )
Prefetch (5x64):
Total time: 0.172 [sec]
Performance counter stats for 'perf bench sched messaging -- bash -c echo 1 > /sys/module/csum_test/parameters/test_fire' (20 runs):
1565.797128 task-clock # 5.238 CPUs utilized ( +- 0.46% )
13,845 context-switches # 0.009 M/sec ( +- 4.20% )
2,624 cpu-migrations # 0.002 M/sec ( +- 2.72% )
75,452 page-faults # 0.048 M/sec ( +- 0.08% )
1,642,106,355 cycles # 1.049 GHz ( +- 1.33% ) [40.17%]
107,786,666 stalled-cycles-frontend # 6.56% frontend cycles idle ( +- 1.37% ) [39.90%]
1,065,286,880 stalled-cycles-backend # 64.87% backend cycles idle ( +- 1.59% ) [39.14%]
888,815,001 instructions # 0.54 insns per cycle
# 1.20 stalled cycles per insn ( +- 1.29% ) [38.92%]
163,106,907 branches # 104.169 M/sec ( +- 1.32% ) [38.93%]
4,333,456 branch-misses # 2.66% of all branches ( +- 1.94% ) [39.77%]
459,779,806 L1-dcache-loads # 293.639 M/sec ( +- 1.60% ) [40.23%]
8,827,680 L1-dcache-load-misses # 1.92% of all L1-dcache hits ( +- 1.77% ) [41.38%]
15,556,816 LLC-loads # 9.935 M/sec ( +- 1.76% ) [41.16%]
4,885,618 LLC-load-misses # 31.40% of all LL-cache hits ( +- 1.40% ) [40.84%]
236,131,778 L1-icache-loads # 150.806 M/sec ( +- 1.32% ) [40.59%]
3,037,537 L1-icache-load-misses # 1.29% of all L1-icache hits ( +- 2.23% ) [41.13%]
454,835,028 dTLB-loads # 290.481 M/sec ( +- 1.23% ) [41.34%]
139,907 dTLB-load-misses # 0.03% of all dTLB cache hits ( +- 2.18% ) [41.21%]
236,357,655 iTLB-loads # 150.950 M/sec ( +- 1.31% ) [41.29%]
46,633 iTLB-load-misses # 0.02% of all iTLB cache hits ( +- 2.74% ) [40.67%]
0 L1-dcache-prefetches # 0.000 K/sec [40.16%]
0 L1-dcache-prefetch-misses # 0.000 K/sec [40.09%]
0.298948767 seconds time elapsed ( +- 0.36% )
Here it appears everything between the two runs is about the same. We reduced
the number of dcache misses by a small amount (0.03 percentage points), which is
nice, but I'm not sure would account for the speedup we see in the run time.
Parallel ALU:
Total time: 0.182 [sec]
Performance counter stats for 'perf bench sched messaging -- bash -c echo 1 > /sys/module/csum_test/parameters/test_fire' (20 runs):
1553.544876 task-clock # 5.217 CPUs utilized ( +- 0.42% )
14,066 context-switches # 0.009 M/sec ( +- 6.24% )
2,831 cpu-migrations # 0.002 M/sec ( +- 3.33% )
75,432 page-faults # 0.049 M/sec ( +- 0.08% )
1,659,509,743 cycles # 1.068 GHz ( +- 1.27% ) [40.10%]
106,466,680 stalled-cycles-frontend # 6.42% frontend cycles idle ( +- 1.50% ) [39.98%]
1,035,481,957 stalled-cycles-backend # 62.40% backend cycles idle ( +- 1.23% ) [39.38%]
875,104,201 instructions # 0.53 insns per cycle
# 1.18 stalled cycles per insn ( +- 1.30% ) [38.66%]
160,553,275 branches # 103.346 M/sec ( +- 1.32% ) [38.85%]
4,329,119 branch-misses # 2.70% of all branches ( +- 1.39% ) [39.59%]
448,195,116 L1-dcache-loads # 288.498 M/sec ( +- 1.91% ) [41.07%]
8,632,347 L1-dcache-load-misses # 1.93% of all L1-dcache hits ( +- 1.90% ) [41.56%]
15,143,145 LLC-loads # 9.747 M/sec ( +- 1.89% ) [41.05%]
4,698,204 LLC-load-misses # 31.03% of all LL-cache hits ( +- 1.03% ) [41.23%]
224,316,468 L1-icache-loads # 144.390 M/sec ( +- 1.27% ) [41.39%]
2,902,842 L1-icache-load-misses # 1.29% of all L1-icache hits ( +- 2.65% ) [42.60%]
433,914,588 dTLB-loads # 279.306 M/sec ( +- 1.75% ) [43.07%]
132,090 dTLB-load-misses # 0.03% of all dTLB cache hits ( +- 2.15% ) [43.12%]
230,701,361 iTLB-loads # 148.500 M/sec ( +- 1.77% ) [43.47%]
45,562 iTLB-load-misses # 0.02% of all iTLB cache hits ( +- 3.76% ) [42.88%]
0 L1-dcache-prefetches # 0.000 K/sec [42.29%]
0 L1-dcache-prefetch-misses # 0.000 K/sec [41.32%]
0.297758185 seconds time elapsed ( +- 0.40% )
Here It seems the major advantage was backend stall cycles saved (which makes
sense to me). Since we split the instruction path into two units that could run
independently of each other we spent less time waiting for prior instructions to
retire. As a result we dropped two percentage points in our stall number.
Prefetch + Parallel ALU:
Total time: 0.182 [sec]
Performance counter stats for 'perf bench sched messaging -- bash -c echo 1 > /sys/module/csum_test/parameters/test_fire' (20 runs):
1549.171283 task-clock # 5.231 CPUs utilized ( +- 0.50% )
13,717 context-switches # 0.009 M/sec ( +- 4.32% )
2,721 cpu-migrations # 0.002 M/sec ( +- 2.47% )
75,432 page-faults # 0.049 M/sec ( +- 0.07% )
1,579,140,244 cycles # 1.019 GHz ( +- 1.71% ) [40.06%]
103,803,034 stalled-cycles-frontend # 6.57% frontend cycles idle ( +- 1.74% ) [39.60%]
1,016,582,613 stalled-cycles-backend # 64.38% backend cycles idle ( +- 1.79% ) [39.57%]
881,036,653 instructions # 0.56 insns per cycle
# 1.15 stalled cycles per insn ( +- 1.61% ) [39.29%]
164,333,010 branches # 106.078 M/sec ( +- 1.51% ) [39.38%]
4,385,459 branch-misses # 2.67% of all branches ( +- 1.62% ) [40.29%]
463,987,526 L1-dcache-loads # 299.507 M/sec ( +- 1.52% ) [40.20%]
8,739,535 L1-dcache-load-misses # 1.88% of all L1-dcache hits ( +- 1.95% ) [40.37%]
15,318,497 LLC-loads # 9.888 M/sec ( +- 1.80% ) [40.43%]
4,846,148 LLC-load-misses # 31.64% of all LL-cache hits ( +- 1.68% ) [40.59%]
231,982,874 L1-icache-loads # 149.746 M/sec ( +- 1.43% ) [41.25%]
3,141,106 L1-icache-load-misses # 1.35% of all L1-icache hits ( +- 2.32% ) [41.76%]
459,688,615 dTLB-loads # 296.732 M/sec ( +- 1.75% ) [41.87%]
138,667 dTLB-load-misses # 0.03% of all dTLB cache hits ( +- 1.97% ) [42.31%]
235,629,204 iTLB-loads # 152.100 M/sec ( +- 1.40% ) [42.04%]
46,038 iTLB-load-misses # 0.02% of all iTLB cache hits ( +- 2.75% ) [41.20%]
0 L1-dcache-prefetches # 0.000 K/sec [40.77%]
0 L1-dcache-prefetch-misses # 0.000 K/sec [40.27%]
0.296173305 seconds time elapsed ( +- 0.44% )
Here, with both optimizations, we've reduced both our backend stall cycles, and
our dcache miss rate (though our load misses here is higher than it was when we
are just doing parallel ALU execution. I wonder if the separation of the adcx
path is leading to multiple load requests before the prefetch completes. I'll
try messing with the stride a bit more to see if I can get some more insight
there.
So there you have it. I think, looking at this, I can say that its not as big a
win as my initial measurements were indicating, but still a win.
Thoughts?
Regards
Neil
#include <linux/module.h>
#include <linux/kernel.h>
#include <linux/netdevice.h>
#include <linux/etherdevice.h>
#include <linux/init.h>
#include <linux/moduleparam.h>
#include <linux/rtnetlink.h>
#include <net/rtnetlink.h>
#include <linux/u64_stats_sync.h>
#define BUFSIZ 2*1024*1024
#define NBPAGES 16
extern int csum_mode;
int module_test_mode = 0;
int test_fire = 0;
static int __init csum_init_module(void)
{
return 0;
}
static void __exit csum_cleanup_module(void)
{
return;
}
static int set_param_str(const char *val, const struct kernel_param *kp)
{
int i;
__wsum sum = 0;
/*u64 start, end;*/
void *base, *addrs[NBPAGES];
u32 rnd, offset;
memset(addrs, 0, sizeof(addrs));
for (i = 0; i < NBPAGES; i++) {
addrs[i] = kmalloc_node(BUFSIZ, GFP_KERNEL, 0);
if (!addrs[i])
goto out;
}
csum_mode = module_test_mode;
local_bh_disable();
/*pr_err("STARTING ITERATIONS on cpu %d\n", smp_processor_id());*/
/*start = ktime_to_ns(ktime_get());*/
for (i = 0; i < 100000; i++) {
rnd = prandom_u32();
base = addrs[rnd % NBPAGES];
rnd /= NBPAGES;
offset = rnd % (BUFSIZ - 1500);
offset &= ~1U;
sum = csum_partial(base + offset, 1500, sum);
}
/*end = ktime_to_ns(ktime_get());*/
local_bh_enable();
/*pr_err("COMPLETED 100000 iterations of csum %x in %llu nanosec\n", sum, end - start);*/
csum_mode = 0;
out:
for (i = 0; i < NBPAGES; i++)
kfree(addrs[i]);
return 0;
}
static int get_param_str(char *buffer, const struct kernel_param *kp)
{
return sprintf(buffer, "%d\n", test_fire);
}
static struct kernel_param_ops param_ops_str = {
.set = set_param_str,
.get = get_param_str,
};
module_param_named(module_test_mode, module_test_mode, int, 0644);
MODULE_PARM_DESC(module_test_mode, "csum test mode");
module_param_cb(test_fire, ¶m_ops_str, &test_fire, 0644);
module_init(csum_init_module);
module_exit(csum_cleanup_module);
MODULE_LICENSE("GPL");
* Neil Horman <[email protected]> wrote:
> Base:
> 0.093269042 seconds time elapsed ( +- 2.24% )
> Prefetch (5x64):
> 0.079440009 seconds time elapsed ( +- 2.29% )
> Parallel ALU:
> 0.087666677 seconds time elapsed ( +- 4.01% )
> Prefetch + Parallel ALU:
> 0.080758702 seconds time elapsed ( +- 2.34% )
>
> So we can see here that we get about a 1% speedup between the base
> and the both (Prefetch + Parallel ALU) case, with prefetch
> accounting for most of that speedup.
Hm, there's still something strange about these results. So the
range of the results is 790-930 nsecs. The noise of the measurements
is 2%-4%, i.e. 20-40 nsecs.
The prefetch-only result itself is the fastest of all -
statistically equivalent to the prefetch+parallel-ALU result, within
the noise range.
So if prefetch is enabled, turning on parallel-ALU has no measurable
effect - which is counter-intuitive. Do you have an
theory/explanation for that?
Thanks,
Ingo
* Neil Horman <[email protected]> wrote:
> Looking at the specific cpu counters we get this:
>
> Base:
> Total time: 0.179 [sec]
>
> Performance counter stats for 'perf bench sched messaging -- bash -c echo 1 > /sys/module/csum_test/parameters/test_fire' (20 runs):
>
> 1571.304618 task-clock # 5.213 CPUs utilized ( +- 0.45% )
> 14,423 context-switches # 0.009 M/sec ( +- 4.28% )
> 2,710 cpu-migrations # 0.002 M/sec ( +- 2.83% )
Hm, for these second round of measurements were you using 'perf stat
-a -C ...'?
The most accurate method of measurement for such single-threaded
workloads is something like:
taskset 0x1 perf stat -a -C 1 --repeat 20 ...
this will bind your workload to CPU#0, and will do PMU measurements
only there - without mixing in other CPUs or workloads.
Thanks,
Ingo
On 10/28/13 10:24 AM, Ingo Molnar wrote:
> The most accurate method of measurement for such single-threaded
> workloads is something like:
>
> taskset 0x1 perf stat -a -C 1 --repeat 20 ...
>
> this will bind your workload to CPU#0, and will do PMU measurements
> only there - without mixing in other CPUs or workloads.
you can drop the -a if you only want a specific CPU (-C arg). And -C in
perf is cpu number starting with 0, so in your example above -C 1 means
cpu1, not cpu0.
David
On 10/26/2013 07:55 AM, Ingo Molnar wrote:
>
> * Doug Ledford <[email protected]> wrote:
>
>>> What I was objecting to strongly here was to measure the _wrong_
>>> thing, i.e. the cache-hot case. The cache-cold case should be
>>> measured in a low noise fashion, so that results are
>>> representative. It's closer to the real usecase than any other
>>> microbenchmark. That will give us a usable speedup figure and
>>> will tell us which technique helped how much and which parameter
>>> should be how large.
>>
>> Cold cache, yes. Low noise, yes. But you need DMA traffic at the
>> same time to be truly representative.
>
> Well, but in most usecases network DMA traffic is an order of
> magnitude smaller than system bus capacity. 100 gigabit network
> traffic is possible but not very common.
That's not necessarily true. For gigabit, that's true. For something
faster, even just 10GigE, it's not. At least not when you consider that
network traffic usually involves hitting the bus at least two times, and
up to four times, depending on how it's processed on receive and whether
it goes cold from cache between accesses (once for the DMA from card to
memory, once for csum_partial so we know if the packet was good, and a
third time in copy_to_user so the user application can do something with
it, and possibly a fourth time if the user space application does
something with it).
> So I'd say that _if_ prefetching helps in the typical case we should
> tune it for that - not for the bus-contended case...
Well, I've been running a lot of tests here on various optimizations.
Some have helped, some not so much. But I haven't been doing
micro-benchmarks like Neil. I've been focused on running netperf over
IPoIB interfaces. That should at least mimic real use somewhat and be
likely more indicitave of what the change will do to the system as a
whole than a micro-benchmark will.
I have a number of test systems, and they have a matrix of three
combinations of InfiniBand link speed and PCI-e bus speed that change
the theoretical max for each system.
For the 40GBit/s InfiniBand, the theoretical max throughput is 4GByte/s
(10/8bit wire encoding, not bothering to account for headers and such).
For the 56GBit/s InfiniBand, the theoretical max throughput is ~7GByte/s
(66/64 bit wire encoding).
For the PCI-e gen2 system, the PCI-e theoretical limit is 40GBit/s, for
the PCI-e gen3 systems the PCI-e theoretical limit is 64GBit/s. However,
with a max PCI-e payload of 128 bytes, the PCI-e gen2 bus will
definitely be a bottleneck before the 56GBit/s InfiniBand link. The
PCI-e gen3 busses are probably right on par with a 56GBit/s InfiniBand
link in terms of max possible throughput.
Here are my test systems:
A - 2 Dell PowerEdge R415 AMD based servers, dual quad core processors
at 2.6GHz, 2MB L2, 5MB L3 cache, 32GB DDR3 1333 RAM, 56GBit/s InfiniBand
link speed on a card in a PCI-e Gen2 slot. Results of base performance
bandwidth test:
[root@rdma-dev-00 ~]# qperf -t 15 ib0-dev-01 rc_bw rc_bi_bw
rc_bw:
bw = 2.93 GB/sec
rc_bi_bw:
bw = 5.5 GB/sec
B - 2 HP DL320e Gen8 servers, single Intel quad core Intel(R) Xeon(R)
CPU E3-1240 V2 @ 3.40GHz, 8GB DDR3 1600 RAM, card in PCI-e Gen3 slot
(8GT/s x8 active config). Results of base performance bandwidth test
(40GBit/s link):
[root@rdma-qe-10 ~]# qperf -t 15 ib1-qe-11 rc_bw rc_bi_bw
rc_bw:
bw = 3.55 GB/sec
rc_bi_bw:
bw = 6.75 GB/sec
C - 2 HP DL360p Gen8 servers, dual Intel 8-core Intel(R) Xeon(R) CPU
E5-2660 0 @ 2.20GHz, 32GB DDR3 1333 RAM, card in PCI-e Gen3 slot (8GT/s
x8 active config). Results of base performance bandwidth test (56GBit/s
link):
[root@rdma-perf-00 ~]# qperf -t 15 ib0-perf-01 rc_bw rc_bi_bw
rc_bw:
bw = 5.87 GB/sec
rc_bi_bw:
bw = 12.3 GB/sec
Some of my preliminary results:
1) Regarding the initial claim that changing the code to have two
addition chains, allowing the use of two ALUs, doubling performance: I'm
just not seeing it. I have a number of theories about this, but they
are dependent on point #2 below:
2) Prefetch definitely helped, although how much depends on which of the
test setups I was using above. The biggest gainer was B) the E3-1240 V2
@ 3.40GHz based machines.
So, my theories about #1 are that, with modern CPUs, it's more our
load/store speed that is killing us than the ALU speed. I tried at
least 5 distinctly different ALU algorithms, including one that
eliminated the use of the carry chain entirely, and none of them had a
noticeable effect. On the other hand, prefetch always had a noticeable
effect. I suspect the original patch worked and had a performance
benefit some time ago due to a quirk on some CPU common back then, but
modern CPUs are capable of optimizing the routine well enough that the
benefit of the patch is already in our original csum routine due to CPU
optimizations. Or maybe there is another explanation, but I'm not
really looking too hard for it.
I also tried two different prefetch methods on the theory that memory
access cycles are more important than CPU access cycles, and there
appears to be a minor benefit to wasting CPU cycles to prevent
unnecessary prefetches, even with 65520 as our MTU where a 320 byte
excess prefetch at the end of the operation only caused us to load a few
% points of extra memory. I suspect that if I dropped the MTU down to
9K (to mimic jumbo frames on a device without tx/rx checksum offloads),
the smart version of prefetch would be a much bigger winner. The fact
that there is any apparent difference at all on such a large copy tells
me that prefetch should probably always be smart and never dumb (and
here by smart versus dumb I mean prefetch should check to make sure you
aren't prefetching beyond the end of data you care about before
executing the prefetch instruction).
What I've found probably warrants more experimentation on the optimum
prefetch methods. I also have another idea on speeding up the ALU
operations that I want to try. So I'm not ready to send off everything
I have yet (and people wouldn't want that anyway, my collected data set
is megabytes in size). But just to demonstrate some of what I'm seeing
here (notes: Recv CPU% of 12.5% is one CPU core pegged to 100% usage for
the A and B systems, for the C systems 3.125% is 100% usage for one CPU
core. Also, although not so apparent on the AMD CPUs, the odd runs are
all with perf record, the even runs are with perf stat, and perf record
causes the odd runs to generally have a lower throughput (and this
effect is *huge* on the Intel 8 core CPUs, fully cutting throughput in
half on those systems)):
For the A systems:
Stock kernel:
Utilization Service Demand
Send Recv Send Recv
Throughput local remote local remote
MBytes /s % S % S us/KB us/KB
1082.47 3.69 12.55 0.266 0.906
1087.64 3.46 12.52 0.249 0.899
1104.43 3.52 12.53 0.249 0.886
1090.37 3.68 12.51 0.264 0.897
1078.73 3.13 12.56 0.227 0.910
1091.88 3.63 12.52 0.259 0.896
With ALU patch:
Utilization Service Demand
Send Recv Send Recv
Throughput local remote local remote
MBytes /s % S % S us/KB us/KB
1075.01 3.70 12.53 0.269 0.911
1116.90 3.86 12.53 0.270 0.876
1073.40 3.67 12.54 0.267 0.913
1092.79 3.83 12.52 0.274 0.895
1108.69 2.98 12.56 0.210 0.885
1116.76 2.66 12.51 0.186 0.875
With ALU patch + 5*64 smart prefetch:
Utilization Service Demand
Send Recv Send Recv
Throughput local remote local remote
MBytes /s % S % S us/KB us/KB
1243.05 4.63 12.60 0.291 0.792
1194.70 5.80 12.58 0.380 0.822
1149.15 4.09 12.57 0.278 0.854
1207.21 5.69 12.53 0.368 0.811
1204.07 4.27 12.57 0.277 0.816
1191.04 4.78 12.60 0.313 0.826
For the B systems:
Stock kernel:
Utilization Service Demand
Send Recv Send Recv
Throughput local remote local remote
MBytes /s % S % S us/KB us/KB
2778.98 7.75 12.34 0.218 0.347
2819.14 7.31 12.52 0.203 0.347
2721.43 8.43 12.19 0.242 0.350
2832.93 7.38 12.58 0.203 0.347
2770.07 8.01 12.27 0.226 0.346
2829.17 7.27 12.51 0.201 0.345
With ALU patch:
Utilization Service Demand
Send Recv Send Recv
Throughput local remote local remote
MBytes /s % S % S us/KB us/KB
2801.36 8.18 11.97 0.228 0.334
2927.81 7.52 12.51 0.201 0.334
2808.32 8.62 11.98 0.240 0.333
2918.12 7.20 12.54 0.193 0.336
2730.00 8.85 11.60 0.253 0.332
2932.17 7.37 12.51 0.196 0.333
With ALU patch + 5*64 smart prefetch:
Utilization Service Demand
Send Recv Send Recv
Throughput local remote local remote
MBytes /s % S % S us/KB us/KB
3029.53 9.34 10.67 0.241 0.275
3229.36 7.81 11.65 0.189 0.282 <- this is a saturated
40GBit/s InfiniBand link,
and the recv CPU is no longer
pegged at 100%, so the gains
here are higher than just the
throughput gains suggest
3161.14 8.24 11.10 0.204 0.274
3171.78 7.80 11.89 0.192 0.293
3134.01 8.35 10.99 0.208 0.274
3235.50 7.75 11.57 0.187 0.279 <- ditto here
For the C systems:
Stock kernel:
Utilization Service Demand
Send Recv Send Recv
Throughput local remote local remote
MBytes /s % S % S us/KB us/KB
1091.03 1.59 3.14 0.454 0.900
2299.34 2.57 3.07 0.350 0.417
1177.07 1.71 3.15 0.455 0.838
2312.59 2.54 3.02 0.344 0.408
1273.94 2.03 3.15 0.499 0.772
2591.50 2.76 3.19 0.332 0.385
With ALU patch:
Utilization Service Demand
Send Recv Send Recv
Throughput local remote local remote
MBytes /s % S % S us/KB us/KB
Data for this series is missing (these machines were added to
the matrix late and this kernel had already been rebuilt to
something else and was no longer installable...I could recreate
this if people really care).
With ALU patch + 5*64 smart prefetch:
Utilization Service Demand
Send Recv Send Recv
Throughput local remote local remote
MBytes /s % S % S us/KB us/KB
1377.03 2.05 3.13 0.466 0.711
2002.30 2.40 3.04 0.374 0.474
1470.18 2.25 3.13 0.479 0.666
1994.96 2.44 3.08 0.382 0.482
1167.82 1.72 3.14 0.461 0.840
2004.49 2.46 3.06 0.384 0.477
What strikes me as important here is that these 8 core Intel CPUs
actually got *slower* with the ALU patch + prefetch. This warrants more
investigation to find out if it's the prefetch or the ALU patch that did
the damage to the speed. It's also worth noting that these 8 core CPUs
have such high variability that I don't trust these measurements yet.
>>> More importantly, the 'maximally adversarial' case is very hard
>>> to generate, validate, and it's highly system dependent!
>>
>> This I agree with 100%, which is why I tend to think we should
>> scrap the static prefetch optimizations entirely and have a boot
>> up test that allows us to find our optimum prefetch distance for
>> our given hardware.
>
> Would be interesting to see.
>
> I'm a bit sceptical - I think 'looking 1-2 cachelines in advance' is
> something that might work reasonably well on a wide range of
> systems, while trying to find a bus capacity/latency dependent sweet
> spot would be difficult.
I think 1-2 cachelines is probably way too short. Measuring the length
of time that we stall when accessing memory for the first time and then
comparing that to operation cycles for typical instruction chains would
give us more insight I think. That or just tinkering with numbers and
seeing where things work best (but not just on static tests, under a
variety of workloads).
> We had pretty bad experience from boot-time measurements, and it's
> not for lack of trying: I implemented the raid algorithm
> benchmarking thing and also the scheduler's boot time cache-size
> probing, both were problematic and have hurt reproducability and
> debuggability.
OK, that's it from me for now, off to run more tests and try more things...
On Mon, Oct 28, 2013 at 05:24:38PM +0100, Ingo Molnar wrote:
>
> * Neil Horman <[email protected]> wrote:
>
> > Looking at the specific cpu counters we get this:
> >
> > Base:
> > Total time: 0.179 [sec]
> >
> > Performance counter stats for 'perf bench sched messaging -- bash -c echo 1 > /sys/module/csum_test/parameters/test_fire' (20 runs):
> >
> > 1571.304618 task-clock # 5.213 CPUs utilized ( +- 0.45% )
> > 14,423 context-switches # 0.009 M/sec ( +- 4.28% )
> > 2,710 cpu-migrations # 0.002 M/sec ( +- 2.83% )
>
> Hm, for these second round of measurements were you using 'perf stat
> -a -C ...'?
>
> The most accurate method of measurement for such single-threaded
> workloads is something like:
>
> taskset 0x1 perf stat -a -C 1 --repeat 20 ...
>
> this will bind your workload to CPU#0, and will do PMU measurements
> only there - without mixing in other CPUs or workloads.
>
> Thanks,
>
> Ingo
I wasn't, but I will...
Neil
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
On Mon, Oct 28, 2013 at 05:20:45PM +0100, Ingo Molnar wrote:
>
> * Neil Horman <[email protected]> wrote:
>
> > Base:
> > 0.093269042 seconds time elapsed ( +- 2.24% )
> > Prefetch (5x64):
> > 0.079440009 seconds time elapsed ( +- 2.29% )
> > Parallel ALU:
> > 0.087666677 seconds time elapsed ( +- 4.01% )
> > Prefetch + Parallel ALU:
> > 0.080758702 seconds time elapsed ( +- 2.34% )
> >
> > So we can see here that we get about a 1% speedup between the base
> > and the both (Prefetch + Parallel ALU) case, with prefetch
> > accounting for most of that speedup.
>
> Hm, there's still something strange about these results. So the
> range of the results is 790-930 nsecs. The noise of the measurements
> is 2%-4%, i.e. 20-40 nsecs.
>
> The prefetch-only result itself is the fastest of all -
> statistically equivalent to the prefetch+parallel-ALU result, within
> the noise range.
>
> So if prefetch is enabled, turning on parallel-ALU has no measurable
> effect - which is counter-intuitive. Do you have an
> theory/explanation for that?
>
> Thanks,
I mentioned it farther down, loosely theorizing that running with parallel
alu's in conjunction with a prefetch, puts more pressure on the load/store unit
causing stalls while both alu's wait for the L1 cache to fill. Not sure if that
makes sense, but I did note that in the both (prefetch+alu case) our data cache
hit rate was somewhat degraded, so I was going to play with the prefetch stride
to see if that fixed the situation. Regardless I agree, the lack of improvement
in the both case is definately counter-intuitive.
Neil
>
> Ingo
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
On Mon, Oct 28, 2013 at 01:46:30PM -0400, Neil Horman wrote:
> On Mon, Oct 28, 2013 at 05:24:38PM +0100, Ingo Molnar wrote:
> >
> > * Neil Horman <[email protected]> wrote:
> >
> > > Looking at the specific cpu counters we get this:
> > >
> > > Base:
> > > Total time: 0.179 [sec]
> > >
> > > Performance counter stats for 'perf bench sched messaging -- bash -c echo 1 > /sys/module/csum_test/parameters/test_fire' (20 runs):
> > >
> > > 1571.304618 task-clock # 5.213 CPUs utilized ( +- 0.45% )
> > > 14,423 context-switches # 0.009 M/sec ( +- 4.28% )
> > > 2,710 cpu-migrations # 0.002 M/sec ( +- 2.83% )
> >
> > Hm, for these second round of measurements were you using 'perf stat
> > -a -C ...'?
> >
> > The most accurate method of measurement for such single-threaded
> > workloads is something like:
> >
> > taskset 0x1 perf stat -a -C 1 --repeat 20 ...
> >
> > this will bind your workload to CPU#0, and will do PMU measurements
> > only there - without mixing in other CPUs or workloads.
> >
> > Thanks,
> >
> > Ingo
> I wasn't, but I will...
> Neil
>
> > --
Heres my data for running the same test with taskset restricting execution to
only cpu0. I'm not quite sure whats going on here, but doing so resulted in a
10x slowdown of the runtime of each iteration which I can't explain. As before
however, both the parallel alu run and the prefetch run resulted in speedups,
but the two together were not in any way addative. I'm going to keep playing
with the prefetch stride, unless you have an alternate theory.
Regards
Neil
Base:
Total time: 1.013 [sec]
Performance counter stats for 'perf bench sched messaging -- bash -c echo 1 > /sys/module/csum_test/parameters/test_fire' (20 runs):
1140.286043 task-clock # 1.001 CPUs utilized ( +- 0.65% ) [100.00%]
48,779 context-switches # 0.043 M/sec ( +- 10.08% ) [100.00%]
0 cpu-migrations # 0.000 K/sec [100.00%]
75,398 page-faults # 0.066 M/sec ( +- 0.05% )
2,950,225,491 cycles # 2.587 GHz ( +- 0.65% ) [16.63%]
263,349,439 stalled-cycles-frontend # 8.93% frontend cycles idle ( +- 1.87% ) [16.70%]
1,615,723,017 stalled-cycles-backend # 54.77% backend cycles idle ( +- 0.64% ) [16.76%]
2,168,440,946 instructions # 0.74 insns per cycle
# 0.75 stalled cycles per insn ( +- 0.52% ) [16.76%]
406,885,149 branches # 356.827 M/sec ( +- 0.61% ) [16.74%]
10,099,789 branch-misses # 2.48% of all branches ( +- 0.73% ) [16.73%]
1,138,829,982 L1-dcache-loads # 998.723 M/sec ( +- 0.57% ) [16.71%]
21,341,094 L1-dcache-load-misses # 1.87% of all L1-dcache hits ( +- 1.22% ) [16.69%]
38,453,870 LLC-loads # 33.723 M/sec ( +- 1.46% ) [16.67%]
9,587,987 LLC-load-misses # 24.93% of all LL-cache hits ( +- 0.48% ) [16.66%]
566,241,820 L1-icache-loads # 496.579 M/sec ( +- 0.70% ) [16.65%]
9,061,979 L1-icache-load-misses # 1.60% of all L1-icache hits ( +- 3.39% ) [16.65%]
1,130,620,555 dTLB-loads # 991.524 M/sec ( +- 0.64% ) [16.64%]
423,302 dTLB-load-misses # 0.04% of all dTLB cache hits ( +- 4.89% ) [16.63%]
563,371,089 iTLB-loads # 494.061 M/sec ( +- 0.62% ) [16.62%]
215,406 iTLB-load-misses # 0.04% of all iTLB cache hits ( +- 6.97% ) [16.60%]
0 L1-dcache-prefetches # 0.000 K/sec [16.59%]
0 L1-dcache-prefetch-misses # 0.000 K/sec [16.58%]
1.139598762 seconds time elapsed ( +- 0.65% )
Prefetch:
Total time: 0.981 [sec]
Performance counter stats for 'perf bench sched messaging -- bash -c echo 1 > /sys/module/csum_test/parameters/test_fire' (20 runs):
1128.603117 task-clock # 1.001 CPUs utilized ( +- 0.66% ) [100.00%]
45,992 context-switches # 0.041 M/sec ( +- 9.47% ) [100.00%]
0 cpu-migrations # 0.000 K/sec [100.00%]
75,428 page-faults # 0.067 M/sec ( +- 0.06% )
2,920,666,228 cycles # 2.588 GHz ( +- 0.66% ) [16.59%]
255,998,006 stalled-cycles-frontend # 8.77% frontend cycles idle ( +- 1.78% ) [16.67%]
1,601,090,475 stalled-cycles-backend # 54.82% backend cycles idle ( +- 0.69% ) [16.75%]
2,164,301,312 instructions # 0.74 insns per cycle
# 0.74 stalled cycles per insn ( +- 0.59% ) [16.78%]
404,920,928 branches # 358.781 M/sec ( +- 0.54% ) [16.77%]
10,025,146 branch-misses # 2.48% of all branches ( +- 0.66% ) [16.75%]
1,133,764,674 L1-dcache-loads # 1004.573 M/sec ( +- 0.47% ) [16.74%]
21,251,432 L1-dcache-load-misses # 1.87% of all L1-dcache hits ( +- 1.01% ) [16.72%]
38,006,432 LLC-loads # 33.676 M/sec ( +- 1.56% ) [16.70%]
9,625,034 LLC-load-misses # 25.32% of all LL-cache hits ( +- 0.40% ) [16.68%]
565,712,289 L1-icache-loads # 501.250 M/sec ( +- 0.57% ) [16.66%]
8,726,826 L1-icache-load-misses # 1.54% of all L1-icache hits ( +- 3.40% ) [16.64%]
1,130,140,463 dTLB-loads # 1001.362 M/sec ( +- 0.53% ) [16.63%]
419,645 dTLB-load-misses # 0.04% of all dTLB cache hits ( +- 4.44% ) [16.62%]
560,199,307 iTLB-loads # 496.365 M/sec ( +- 0.51% ) [16.61%]
213,413 iTLB-load-misses # 0.04% of all iTLB cache hits ( +- 6.65% ) [16.59%]
0 L1-dcache-prefetches # 0.000 K/sec [16.56%]
0 L1-dcache-prefetch-misses # 0.000 K/sec [16.54%]
1.127934534 seconds time elapsed ( +- 0.66% )
Parallel ALU:
Total time: 0.986 [sec]
Performance counter stats for 'perf bench sched messaging -- bash -c echo 1 > /sys/module/csum_test/parameters/test_fire' (20 runs):
1131.914738 task-clock # 1.001 CPUs utilized ( +- 0.49% ) [100.00%]
40,807 context-switches # 0.036 M/sec ( +- 10.72% ) [100.00%]
0 cpu-migrations # 0.000 K/sec ( +-100.00% ) [100.00%]
75,329 page-faults # 0.067 M/sec ( +- 0.04% )
2,929,149,996 cycles # 2.588 GHz ( +- 0.49% ) [16.58%]
250,428,558 stalled-cycles-frontend # 8.55% frontend cycles idle ( +- 1.75% ) [16.66%]
1,621,074,968 stalled-cycles-backend # 55.34% backend cycles idle ( +- 0.46% ) [16.73%]
2,147,405,781 instructions # 0.73 insns per cycle
# 0.75 stalled cycles per insn ( +- 0.56% ) [16.77%]
401,196,771 branches # 354.441 M/sec ( +- 0.58% ) [16.76%]
9,941,701 branch-misses # 2.48% of all branches ( +- 0.67% ) [16.74%]
1,126,651,774 L1-dcache-loads # 995.350 M/sec ( +- 0.50% ) [16.73%]
21,075,294 L1-dcache-load-misses # 1.87% of all L1-dcache hits ( +- 0.96% ) [16.72%]
37,885,850 LLC-loads # 33.471 M/sec ( +- 1.10% ) [16.71%]
9,729,116 LLC-load-misses # 25.68% of all LL-cache hits ( +- 0.62% ) [16.69%]
562,058,495 L1-icache-loads # 496.556 M/sec ( +- 0.54% ) [16.67%]
8,617,450 L1-icache-load-misses # 1.53% of all L1-icache hits ( +- 3.06% ) [16.65%]
1,121,765,737 dTLB-loads # 991.034 M/sec ( +- 0.57% ) [16.63%]
388,875 dTLB-load-misses # 0.03% of all dTLB cache hits ( +- 4.27% ) [16.62%]
556,029,393 iTLB-loads # 491.229 M/sec ( +- 0.64% ) [16.61%]
189,181 iTLB-load-misses # 0.03% of all iTLB cache hits ( +- 6.98% ) [16.60%]
0 L1-dcache-prefetches # 0.000 K/sec [16.58%]
0 L1-dcache-prefetch-misses # 0.000 K/sec [16.56%]
1.131247174 seconds time elapsed ( +- 0.49% )
Both:
Total time: 0.993 [sec]
Performance counter stats for 'perf bench sched messaging -- bash -c echo 1 > /sys/module/csum_test/parameters/test_fire' (20 runs):
1130.912197 task-clock # 1.001 CPUs utilized ( +- 0.60% ) [100.00%]
45,859 context-switches # 0.041 M/sec ( +- 9.00% ) [100.00%]
0 cpu-migrations # 0.000 K/sec [100.00%]
75,398 page-faults # 0.067 M/sec ( +- 0.07% )
2,926,527,048 cycles # 2.588 GHz ( +- 0.60% ) [16.60%]
255,482,254 stalled-cycles-frontend # 8.73% frontend cycles idle ( +- 1.62% ) [16.67%]
1,608,247,364 stalled-cycles-backend # 54.95% backend cycles idle ( +- 0.73% ) [16.74%]
2,162,135,903 instructions # 0.74 insns per cycle
# 0.74 stalled cycles per insn ( +- 0.46% ) [16.77%]
403,436,790 branches # 356.736 M/sec ( +- 0.44% ) [16.76%]
10,062,572 branch-misses # 2.49% of all branches ( +- 0.85% ) [16.75%]
1,133,889,264 L1-dcache-loads # 1002.632 M/sec ( +- 0.56% ) [16.74%]
21,460,116 L1-dcache-load-misses # 1.89% of all L1-dcache hits ( +- 1.31% ) [16.73%]
38,070,119 LLC-loads # 33.663 M/sec ( +- 1.63% ) [16.72%]
9,593,162 LLC-load-misses # 25.20% of all LL-cache hits ( +- 0.42% ) [16.71%]
562,867,188 L1-icache-loads # 497.711 M/sec ( +- 0.59% ) [16.68%]
8,472,343 L1-icache-load-misses # 1.51% of all L1-icache hits ( +- 3.02% ) [16.64%]
1,126,997,403 dTLB-loads # 996.538 M/sec ( +- 0.53% ) [16.61%]
414,900 dTLB-load-misses # 0.04% of all dTLB cache hits ( +- 4.12% ) [16.60%]
561,156,032 iTLB-loads # 496.198 M/sec ( +- 0.56% ) [16.59%]
212,482 iTLB-load-misses # 0.04% of all iTLB cache hits ( +- 6.10% ) [16.58%]
0 L1-dcache-prefetches # 0.000 K/sec [16.57%]
0 L1-dcache-prefetch-misses # 0.000 K/sec [16.56%]
1.130242195 seconds time elapsed ( +- 0.60% )
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at http://www.tux.org/lkml/
> >
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
* Neil Horman <[email protected]> wrote:
> Heres my data for running the same test with taskset restricting
> execution to only cpu0. I'm not quite sure whats going on here,
> but doing so resulted in a 10x slowdown of the runtime of each
> iteration which I can't explain. As before however, both the
> parallel alu run and the prefetch run resulted in speedups, but
> the two together were not in any way addative. I'm going to keep
> playing with the prefetch stride, unless you have an alternate
> theory.
Could you please cite the exact command-line you used for running
the test?
Thanks,
Ingo
* Doug Ledford <[email protected]> wrote:
> [ Snipped a couple of really nice real-life bandwidth tests. ]
> Some of my preliminary results:
>
> 1) Regarding the initial claim that changing the code to have two
> addition chains, allowing the use of two ALUs, doubling
> performance: I'm just not seeing it. I have a number of theories
> about this, but they are dependent on point #2 below:
>
> 2) Prefetch definitely helped, although how much depends on which
> of the test setups I was using above. The biggest gainer was B)
> the E3-1240 V2 @ 3.40GHz based machines.
>
> So, my theories about #1 are that, with modern CPUs, it's more our
> load/store speed that is killing us than the ALU speed. I tried
> at least 5 distinctly different ALU algorithms, including one that
> eliminated the use of the carry chain entirely, and none of them
> had a noticeable effect. On the other hand, prefetch always had a
> noticeable effect. I suspect the original patch worked and had a
> performance benefit some time ago due to a quirk on some CPU
> common back then, but modern CPUs are capable of optimizing the
> routine well enough that the benefit of the patch is already in
> our original csum routine due to CPU optimizations. [...]
That definitely sounds plausible.
> [...] Or maybe there is another explanation, but I'm not really
> looking too hard for it.
>
> I also tried two different prefetch methods on the theory that
> memory access cycles are more important than CPU access cycles,
> and there appears to be a minor benefit to wasting CPU cycles to
> prevent unnecessary prefetches, even with 65520 as our MTU where a
> 320 byte excess prefetch at the end of the operation only caused
> us to load a few % points of extra memory. I suspect that if I
> dropped the MTU down to 9K (to mimic jumbo frames on a device
> without tx/rx checksum offloads), the smart version of prefetch
> would be a much bigger winner. The fact that there is any
> apparent difference at all on such a large copy tells me that
> prefetch should probably always be smart and never dumb (and here
> by smart versus dumb I mean prefetch should check to make sure you
> aren't prefetching beyond the end of data you care about before
> executing the prefetch instruction).
That looks like an important result and it should matter even more
to ~1.5k MTU sizes where the prefetch window will be even larger
relative to the IP packet size.
> What strikes me as important here is that these 8 core Intel CPUs
> actually got *slower* with the ALU patch + prefetch. This
> warrants more investigation to find out if it's the prefetch or
> the ALU patch that did the damage to the speed. It's also worth
> noting that these 8 core CPUs have such high variability that I
> don't trust these measurements yet.
It might make sense to have a good look at the PMU counts for these
cases to see what's going on.
Also, once the packet is copied to user-space, we might want to do a
CLFLUSH on the originating buffer, to zap the cacheline from the CPU
caches. (This might or might not matter, depending on how good the
CPU is at keeping its true working set in the cache.)
> > I'm a bit sceptical - I think 'looking 1-2 cachelines in
> > advance' is something that might work reasonably well on a wide
> > range of systems, while trying to find a bus capacity/latency
> > dependent sweet spot would be difficult.
>
> I think 1-2 cachelines is probably way too short. [...]
The 4-5 cachelines result you seem to be converging on looks very
plausible to me too.
What I think we should try to avoid is to make the actual window per
system variable: that would be really hard to tune right.
But the 'don't prefetch past the buffer' "smart prefetch" logic you
mentioned is system-agnostic and might make sense to introduce.
Thanks,
Ingo
On Tue, Oct 29, 2013 at 09:25:42AM +0100, Ingo Molnar wrote:
>
> * Neil Horman <[email protected]> wrote:
>
> > Heres my data for running the same test with taskset restricting
> > execution to only cpu0. I'm not quite sure whats going on here,
> > but doing so resulted in a 10x slowdown of the runtime of each
> > iteration which I can't explain. As before however, both the
> > parallel alu run and the prefetch run resulted in speedups, but
> > the two together were not in any way addative. I'm going to keep
> > playing with the prefetch stride, unless you have an alternate
> > theory.
>
> Could you please cite the exact command-line you used for running
> the test?
>
> Thanks,
>
> Ingo
>
Sure it was this:
for i in `seq 0 1 3`
do
echo $i > /sys/module/csum_test/parameters/module_test_mode
taskset -c 0 perf stat --repeat 20 -C 0 -ddd perf bench sched messaging -- /root/test.sh
done >> counters.txt 2>&1
where test.sh is:
#!/bin/sh
echo 1 > /sys/module/csum_test/parameters/test_fire
As before, module_test_mode selects a case in a switch statement I added in
do_csum to test one of the 4 csum variants we've been discusing (base, prefetch,
parallel ALU or both), and test_fire is a callback trigger I use in the test
module to run 100000 iterations of a checksum operation. As you requested, I
ran the above on cpu 0 (-C 0 on perf and -c 0 on taskset), and I removed all irq
affinity to cpu 0.
Regards
Neil
* Neil Horman <[email protected]> wrote:
> Sure it was this:
> for i in `seq 0 1 3`
> do
> echo $i > /sys/module/csum_test/parameters/module_test_mode
> taskset -c 0 perf stat --repeat 20 -C 0 -ddd perf bench sched messaging -- /root/test.sh
> done >> counters.txt 2>&1
>
> where test.sh is:
> #!/bin/sh
> echo 1 > /sys/module/csum_test/parameters/test_fire
What does '-- /root/test.sh' do?
Unless I'm missing something, the line above will run:
perf bench sched messaging -- /root/test.sh
which should be equivalent to:
perf bench sched messaging
i.e. /root/test.sh won't be run.
Thanks,
Ingo
On Tue, Oct 29, 2013 at 12:30:31PM +0100, Ingo Molnar wrote:
>
> * Neil Horman <[email protected]> wrote:
>
> > Sure it was this:
> > for i in `seq 0 1 3`
> > do
> > echo $i > /sys/module/csum_test/parameters/module_test_mode
> > taskset -c 0 perf stat --repeat 20 -C 0 -ddd perf bench sched messaging -- /root/test.sh
> > done >> counters.txt 2>&1
> >
> > where test.sh is:
> > #!/bin/sh
> > echo 1 > /sys/module/csum_test/parameters/test_fire
>
> What does '-- /root/test.sh' do?
>
> Unless I'm missing something, the line above will run:
>
> perf bench sched messaging -- /root/test.sh
>
> which should be equivalent to:
>
> perf bench sched messaging
>
> i.e. /root/test.sh won't be run.
>
According to the perf man page, I'm supposed to be able to use -- to separate
perf command line parameters from the command I want to run. And it definately
executed test.sh, I added an echo to stdout in there as a test run and observed
them get captured in counters.txt
Neil
> Thanks,
>
> Ingo
>
* Neil Horman <[email protected]> wrote:
> On Tue, Oct 29, 2013 at 12:30:31PM +0100, Ingo Molnar wrote:
> >
> > * Neil Horman <[email protected]> wrote:
> >
> > > Sure it was this:
> > > for i in `seq 0 1 3`
> > > do
> > > echo $i > /sys/module/csum_test/parameters/module_test_mode
> > > taskset -c 0 perf stat --repeat 20 -C 0 -ddd perf bench sched messaging -- /root/test.sh
> > > done >> counters.txt 2>&1
> > >
> > > where test.sh is:
> > > #!/bin/sh
> > > echo 1 > /sys/module/csum_test/parameters/test_fire
> >
> > What does '-- /root/test.sh' do?
> >
> > Unless I'm missing something, the line above will run:
> >
> > perf bench sched messaging -- /root/test.sh
> >
> > which should be equivalent to:
> >
> > perf bench sched messaging
> >
> > i.e. /root/test.sh won't be run.
>
> According to the perf man page, I'm supposed to be able to use --
> to separate perf command line parameters from the command I want
> to run. And it definately executed test.sh, I added an echo to
> stdout in there as a test run and observed them get captured in
> counters.txt
Well, '--' can be used to delineate the command portion for cases
where it's ambiguous.
Here's it's unambiguous though. This:
perf stat --repeat 20 -C 0 -ddd perf bench sched messaging -- /root/test.sh
stops parsing a valid option after the -ddd option, so in theory it
should execute 'perf bench sched messaging -- /root/test.sh' where
'-- /root/test.sh' is simply a parameter to 'perf bench' and is thus
ignored.
The message output you provided seems to suggest that to be the
case:
Performance counter stats for 'perf bench sched messaging -- bash -c echo 1 > /sys/module/csum_test/parameters/test_fire' (20 runs):
See how the command executed by perf stat was 'perf bench ...'.
Did you want to run:
perf stat --repeat 20 -C 0 -ddd /root/test.sh
?
Thanks,
Ingo
On Tue, Oct 29, 2013 at 01:52:33PM +0100, Ingo Molnar wrote:
>
> * Neil Horman <[email protected]> wrote:
>
> > On Tue, Oct 29, 2013 at 12:30:31PM +0100, Ingo Molnar wrote:
> > >
> > > * Neil Horman <[email protected]> wrote:
> > >
> > > > Sure it was this:
> > > > for i in `seq 0 1 3`
> > > > do
> > > > echo $i > /sys/module/csum_test/parameters/module_test_mode
> > > > taskset -c 0 perf stat --repeat 20 -C 0 -ddd perf bench sched messaging -- /root/test.sh
> > > > done >> counters.txt 2>&1
> > > >
> > > > where test.sh is:
> > > > #!/bin/sh
> > > > echo 1 > /sys/module/csum_test/parameters/test_fire
> > >
> > > What does '-- /root/test.sh' do?
> > >
> > > Unless I'm missing something, the line above will run:
> > >
> > > perf bench sched messaging -- /root/test.sh
> > >
> > > which should be equivalent to:
> > >
> > > perf bench sched messaging
> > >
> > > i.e. /root/test.sh won't be run.
> >
> > According to the perf man page, I'm supposed to be able to use --
> > to separate perf command line parameters from the command I want
> > to run. And it definately executed test.sh, I added an echo to
> > stdout in there as a test run and observed them get captured in
> > counters.txt
>
> Well, '--' can be used to delineate the command portion for cases
> where it's ambiguous.
>
> Here's it's unambiguous though. This:
>
> perf stat --repeat 20 -C 0 -ddd perf bench sched messaging -- /root/test.sh
>
> stops parsing a valid option after the -ddd option, so in theory it
> should execute 'perf bench sched messaging -- /root/test.sh' where
> '-- /root/test.sh' is simply a parameter to 'perf bench' and is thus
> ignored.
>
> The message output you provided seems to suggest that to be the
> case:
>
> Performance counter stats for 'perf bench sched messaging -- bash -c echo 1 > /sys/module/csum_test/parameters/test_fire' (20 runs):
>
> See how the command executed by perf stat was 'perf bench ...'.
>
> Did you want to run:
>
> perf stat --repeat 20 -C 0 -ddd /root/test.sh
>
I'm sure it worked properly on my system here, I specificially checked it, but
I'll gladly run it again. You have to give me an hour as I have a meeting to
run to, but I'll have results shortly.
Neil
> ?
>
> Thanks,
>
> Ingo
>
* Neil Horman <[email protected]> wrote:
> I'm sure it worked properly on my system here, I specificially
> checked it, but I'll gladly run it again. You have to give me an
> hour as I have a meeting to run to, but I'll have results shortly.
So what I tried to react to was this observation of yours:
> > > Heres my data for running the same test with taskset
> > > restricting execution to only cpu0. I'm not quite sure whats
> > > going on here, but doing so resulted in a 10x slowdown of the
> > > runtime of each iteration which I can't explain. [...]
A 10x slowdown would be consistent with not running your testcase
but 'perf bench sched messaging' by accident, or so.
But I was really just guessing wildly here.
Thanks,
Ingo
On Tue, Oct 29, 2013 at 02:11:49PM +0100, Ingo Molnar wrote:
>
> * Neil Horman <[email protected]> wrote:
>
> > I'm sure it worked properly on my system here, I specificially
> > checked it, but I'll gladly run it again. You have to give me an
> > hour as I have a meeting to run to, but I'll have results shortly.
>
> So what I tried to react to was this observation of yours:
>
> > > > Heres my data for running the same test with taskset
> > > > restricting execution to only cpu0. I'm not quite sure whats
> > > > going on here, but doing so resulted in a 10x slowdown of the
> > > > runtime of each iteration which I can't explain. [...]
>
> A 10x slowdown would be consistent with not running your testcase
> but 'perf bench sched messaging' by accident, or so.
>
> But I was really just guessing wildly here.
>
> Thanks,
>
> Ingo
>
Ok, well, I'll run it again in just a bit here.
Neil
On 10/29/13 6:52 AM, Ingo Molnar wrote:
>> According to the perf man page, I'm supposed to be able to use --
>> to separate perf command line parameters from the command I want
>> to run. And it definately executed test.sh, I added an echo to
>> stdout in there as a test run and observed them get captured in
>> counters.txt
>
> Well, '--' can be used to delineate the command portion for cases
> where it's ambiguous.
>
> Here's it's unambiguous though. This:
>
> perf stat --repeat 20 -C 0 -ddd perf bench sched messaging -- /root/test.sh
>
> stops parsing a valid option after the -ddd option, so in theory it
> should execute 'perf bench sched messaging -- /root/test.sh' where
> '-- /root/test.sh' is simply a parameter to 'perf bench' and is thus
> ignored.
Normally with perf commands a workload can be specified to state how
long to collect perf data. That is not the case for perf-bench.
David
On Tue, Oct 29, 2013 at 02:11:49PM +0100, Ingo Molnar wrote:
>
> * Neil Horman <[email protected]> wrote:
>
> > I'm sure it worked properly on my system here, I specificially
> > checked it, but I'll gladly run it again. You have to give me an
> > hour as I have a meeting to run to, but I'll have results shortly.
>
> So what I tried to react to was this observation of yours:
>
> > > > Heres my data for running the same test with taskset
> > > > restricting execution to only cpu0. I'm not quite sure whats
> > > > going on here, but doing so resulted in a 10x slowdown of the
> > > > runtime of each iteration which I can't explain. [...]
>
> A 10x slowdown would be consistent with not running your testcase
> but 'perf bench sched messaging' by accident, or so.
>
> But I was really just guessing wildly here.
>
> Thanks,
>
> Ingo
>
So, I apologize, you were right. I was running the test.sh script but perf was
measuring itself. Using this command line:
for i in `seq 0 1 3`
do
echo $i > /sys/modules/csum_test/parameters/module_test_mode; taskset -c 0 perf stat --repeat -C 0 -ddd /root/test.sh
done >> counters.txt 2>&1
with test.sh unchanged I get these results:
Base:
Performance counter stats for '/root/test.sh' (20 runs):
56.069737 task-clock # 1.005 CPUs utilized ( +- 0.13% ) [100.00%]
5 context-switches # 0.091 K/sec ( +- 5.11% ) [100.00%]
0 cpu-migrations # 0.000 K/sec [100.00%]
366 page-faults # 0.007 M/sec ( +- 0.08% )
144,264,737 cycles # 2.573 GHz ( +- 0.23% ) [17.49%]
9,239,760 stalled-cycles-frontend # 6.40% frontend cycles idle ( +- 3.77% ) [19.19%]
110,635,829 stalled-cycles-backend # 76.69% backend cycles idle ( +- 0.14% ) [19.68%]
54,291,496 instructions # 0.38 insns per cycle
# 2.04 stalled cycles per insn ( +- 0.14% ) [18.30%]
5,844,933 branches # 104.244 M/sec ( +- 2.81% ) [16.58%]
301,523 branch-misses # 5.16% of all branches ( +- 0.12% ) [16.09%]
23,645,797 L1-dcache-loads # 421.721 M/sec ( +- 0.05% ) [16.06%]
494,467 L1-dcache-load-misses # 2.09% of all L1-dcache hits ( +- 0.06% ) [16.06%]
2,907,250 LLC-loads # 51.851 M/sec ( +- 0.08% ) [16.06%]
486,329 LLC-load-misses # 16.73% of all LL-cache hits ( +- 0.11% ) [16.06%]
11,113,848 L1-icache-loads # 198.215 M/sec ( +- 0.07% ) [16.06%]
5,378 L1-icache-load-misses # 0.05% of all L1-icache hits ( +- 1.34% ) [16.06%]
23,742,876 dTLB-loads # 423.453 M/sec ( +- 0.06% ) [16.06%]
0 dTLB-load-misses # 0.00% of all dTLB cache hits [16.06%]
11,108,538 iTLB-loads # 198.120 M/sec ( +- 0.06% ) [16.06%]
0 iTLB-load-misses # 0.00% of all iTLB cache hits [16.07%]
0 L1-dcache-prefetches # 0.000 K/sec [16.07%]
0 L1-dcache-prefetch-misses # 0.000 K/sec [16.07%]
0.055817066 seconds time elapsed ( +- 0.10% )
Prefetch(5*64):
Performance counter stats for '/root/test.sh' (20 runs):
47.423853 task-clock # 1.005 CPUs utilized ( +- 0.62% ) [100.00%]
6 context-switches # 0.116 K/sec ( +- 4.27% ) [100.00%]
0 cpu-migrations # 0.000 K/sec [100.00%]
368 page-faults # 0.008 M/sec ( +- 0.07% )
120,423,860 cycles # 2.539 GHz ( +- 0.85% ) [14.23%]
8,555,632 stalled-cycles-frontend # 7.10% frontend cycles idle ( +- 0.56% ) [16.23%]
87,438,794 stalled-cycles-backend # 72.61% backend cycles idle ( +- 1.13% ) [18.33%]
55,039,308 instructions # 0.46 insns per cycle
# 1.59 stalled cycles per insn ( +- 0.05% ) [18.98%]
5,619,298 branches # 118.491 M/sec ( +- 2.32% ) [18.98%]
303,686 branch-misses # 5.40% of all branches ( +- 0.08% ) [18.98%]
26,577,868 L1-dcache-loads # 560.432 M/sec ( +- 0.05% ) [18.98%]
1,323,630 L1-dcache-load-misses # 4.98% of all L1-dcache hits ( +- 0.14% ) [18.98%]
3,426,016 LLC-loads # 72.242 M/sec ( +- 0.05% ) [18.98%]
1,304,201 LLC-load-misses # 38.07% of all LL-cache hits ( +- 0.13% ) [18.98%]
13,190,316 L1-icache-loads # 278.137 M/sec ( +- 0.21% ) [18.98%]
33,881 L1-icache-load-misses # 0.26% of all L1-icache hits ( +- 4.63% ) [17.93%]
25,366,685 dTLB-loads # 534.893 M/sec ( +- 0.24% ) [15.93%]
734 dTLB-load-misses # 0.00% of all dTLB cache hits ( +- 8.40% ) [13.94%]
13,314,660 iTLB-loads # 280.759 M/sec ( +- 0.05% ) [12.97%]
0 iTLB-load-misses # 0.00% of all iTLB cache hits [12.98%]
0 L1-dcache-prefetches # 0.000 K/sec [12.98%]
0 L1-dcache-prefetch-misses # 0.000 K/sec [12.87%]
0.047194407 seconds time elapsed ( +- 0.62% )
Parallel ALU:
Performance counter stats for '/root/test.sh' (20 runs):
57.395070 task-clock # 1.004 CPUs utilized ( +- 1.71% ) [100.00%]
5 context-switches # 0.092 K/sec ( +- 3.90% ) [100.00%]
0 cpu-migrations # 0.000 K/sec [100.00%]
367 page-faults # 0.006 M/sec ( +- 0.10% )
143,232,396 cycles # 2.496 GHz ( +- 1.68% ) [16.73%]
7,299,843 stalled-cycles-frontend # 5.10% frontend cycles idle ( +- 2.69% ) [18.47%]
109,485,845 stalled-cycles-backend # 76.44% backend cycles idle ( +- 2.01% ) [19.99%]
56,867,669 instructions # 0.40 insns per cycle
# 1.93 stalled cycles per insn ( +- 0.22% ) [19.49%]
6,646,323 branches # 115.800 M/sec ( +- 2.15% ) [17.75%]
304,671 branch-misses # 4.58% of all branches ( +- 0.37% ) [16.23%]
23,612,428 L1-dcache-loads # 411.402 M/sec ( +- 0.05% ) [15.95%]
518,988 L1-dcache-load-misses # 2.20% of all L1-dcache hits ( +- 0.11% ) [15.95%]
2,934,119 LLC-loads # 51.121 M/sec ( +- 0.06% ) [15.95%]
509,027 LLC-load-misses # 17.35% of all LL-cache hits ( +- 0.15% ) [15.95%]
11,103,819 L1-icache-loads # 193.463 M/sec ( +- 0.08% ) [15.95%]
5,381 L1-icache-load-misses # 0.05% of all L1-icache hits ( +- 2.45% ) [15.95%]
23,727,164 dTLB-loads # 413.401 M/sec ( +- 0.06% ) [15.95%]
0 dTLB-load-misses # 0.00% of all dTLB cache hits [15.95%]
11,104,205 iTLB-loads # 193.470 M/sec ( +- 0.06% ) [15.95%]
0 iTLB-load-misses # 0.00% of all iTLB cache hits [15.95%]
0 L1-dcache-prefetches # 0.000 K/sec [15.95%]
0 L1-dcache-prefetch-misses # 0.000 K/sec [15.96%]
0.057151644 seconds time elapsed ( +- 1.69% )
Both:
Performance counter stats for '/root/test.sh' (20 runs):
48.377833 task-clock # 1.005 CPUs utilized ( +- 0.67% ) [100.00%]
5 context-switches # 0.113 K/sec ( +- 3.88% ) [100.00%]
0 cpu-migrations # 0.001 K/sec ( +-100.00% ) [100.00%]
367 page-faults # 0.008 M/sec ( +- 0.08% )
122,529,490 cycles # 2.533 GHz ( +- 1.05% ) [14.24%]
8,796,729 stalled-cycles-frontend # 7.18% frontend cycles idle ( +- 0.56% ) [16.20%]
88,936,550 stalled-cycles-backend # 72.58% backend cycles idle ( +- 1.48% ) [18.16%]
58,405,660 instructions # 0.48 insns per cycle
# 1.52 stalled cycles per insn ( +- 0.07% ) [18.61%]
5,742,738 branches # 118.706 M/sec ( +- 1.54% ) [18.61%]
303,555 branch-misses # 5.29% of all branches ( +- 0.09% ) [18.61%]
26,321,789 L1-dcache-loads # 544.088 M/sec ( +- 0.07% ) [18.61%]
1,236,101 L1-dcache-load-misses # 4.70% of all L1-dcache hits ( +- 0.08% ) [18.61%]
3,409,768 LLC-loads # 70.482 M/sec ( +- 0.05% ) [18.61%]
1,212,511 LLC-load-misses # 35.56% of all LL-cache hits ( +- 0.08% ) [18.61%]
10,579,372 L1-icache-loads # 218.682 M/sec ( +- 0.05% ) [18.61%]
19,426 L1-icache-load-misses # 0.18% of all L1-icache hits ( +- 14.70% ) [18.61%]
25,329,963 dTLB-loads # 523.586 M/sec ( +- 0.27% ) [17.29%]
802 dTLB-load-misses # 0.00% of all dTLB cache hits ( +- 5.43% ) [15.33%]
10,635,524 iTLB-loads # 219.843 M/sec ( +- 0.09% ) [13.38%]
0 iTLB-load-misses # 0.00% of all iTLB cache hits [12.72%]
0 L1-dcache-prefetches # 0.000 K/sec [12.72%]
0 L1-dcache-prefetch-misses # 0.000 K/sec [12.72%]
0.048140073 seconds time elapsed ( +- 0.67% )
Which overall looks alot more like I expect, save for the parallel ALU cases.
It seems here that the parallel ALU changes actually hurt performance, which
really seems counter-intuitive. I don't yet have any explination for that. I
do note that we seem to have more stalls in the both case so perhaps the
parallel chains call for a more agressive prefetch. Do you have any thoughts?
Regards
Neil
* Neil Horman <[email protected]> wrote:
> So, I apologize, you were right. I was running the test.sh script
> but perf was measuring itself. [...]
Ok, cool - one mystery less!
> Which overall looks alot more like I expect, save for the parallel
> ALU cases. It seems here that the parallel ALU changes actually
> hurt performance, which really seems counter-intuitive. I don't
> yet have any explination for that. I do note that we seem to have
> more stalls in the both case so perhaps the parallel chains call
> for a more agressive prefetch. Do you have any thoughts?
Note that with -ddd you 'overload' the PMU with more counters than
can be run at once, which introduces extra noise. Since you are
running the tests for 0.150 secs or so, the results are not very
representative:
734 dTLB-load-misses # 0.00% of all dTLB cache hits ( +- 8.40% ) [13.94%]
13,314,660 iTLB-loads # 280.759 M/sec ( +- 0.05% ) [12.97%]
with such low runtimes those results are very hard to trust.
So -ddd is typically used to pick up the most interesting PMU events
you want to see measured, and then use them like this:
-e dTLB-load-misses -e iTLB-loads
etc. For such short runtimes make sure the last column displays
close to 100%, so that the PMU results become trustable.
A nehalem+ PMU will allow 2-4 events to be measured in parallel,
plus generics like 'cycles', 'instructions' can be added 'for free'
because they get counted in a separate (fixed purpose) PMU register.
The last colum tells you what percentage of the runtime that
particular event was actually active. 100% (or empty last column)
means it was active all the time.
Thanks,
Ingo
On Tue, Oct 29, 2013 at 03:27:16PM +0100, Ingo Molnar wrote:
>
> * Neil Horman <[email protected]> wrote:
>
> > So, I apologize, you were right. I was running the test.sh script
> > but perf was measuring itself. [...]
>
> Ok, cool - one mystery less!
>
> > Which overall looks alot more like I expect, save for the parallel
> > ALU cases. It seems here that the parallel ALU changes actually
> > hurt performance, which really seems counter-intuitive. I don't
> > yet have any explination for that. I do note that we seem to have
> > more stalls in the both case so perhaps the parallel chains call
> > for a more agressive prefetch. Do you have any thoughts?
>
> Note that with -ddd you 'overload' the PMU with more counters than
> can be run at once, which introduces extra noise. Since you are
> running the tests for 0.150 secs or so, the results are not very
> representative:
>
> 734 dTLB-load-misses # 0.00% of all dTLB cache hits ( +- 8.40% ) [13.94%]
> 13,314,660 iTLB-loads # 280.759 M/sec ( +- 0.05% ) [12.97%]
>
> with such low runtimes those results are very hard to trust.
>
> So -ddd is typically used to pick up the most interesting PMU events
> you want to see measured, and then use them like this:
>
> -e dTLB-load-misses -e iTLB-loads
>
> etc. For such short runtimes make sure the last column displays
> close to 100%, so that the PMU results become trustable.
>
> A nehalem+ PMU will allow 2-4 events to be measured in parallel,
> plus generics like 'cycles', 'instructions' can be added 'for free'
> because they get counted in a separate (fixed purpose) PMU register.
>
> The last colum tells you what percentage of the runtime that
> particular event was actually active. 100% (or empty last column)
> means it was active all the time.
>
> Thanks,
>
> Ingo
>
Hmm,
I ran this test:
for i in `seq 0 1 3`
do
echo $i > /sys/module/csum_test/parameters/module_test_mode
taskset -c 0 perf stat --repeat 20 -C 0 -e L1-dcache-load-misses -e L1-dcache-prefetches -e cycles -e instructions -ddd ./test.sh
done
And I updated the test module to run for a million iterations rather than 100000 to increase the sample size and got this:
Base:
Performance counter stats for './test.sh' (20 runs):
47,305,064 L1-dcache-load-misses # 2.09% of all L1-dcache hits ( +- 0.04% ) [18.74%]
0 L1-dcache-prefetches [18.75%]
13,906,212,348 cycles # 0.000 GHz ( +- 0.05% ) [18.76%]
4,426,395,949 instructions # 0.32 insns per cycle ( +- 0.01% ) [18.77%]
2,261,551,278 L1-dcache-loads ( +- 0.02% ) [18.76%]
47,287,226 L1-dcache-load-misses # 2.09% of all L1-dcache hits ( +- 0.04% ) [18.76%]
276,842,685 LLC-loads ( +- 0.01% ) [18.76%]
46,454,114 LLC-load-misses # 16.78% of all LL-cache hits ( +- 0.05% ) [18.76%]
1,048,894,486 L1-icache-loads ( +- 0.07% ) [18.76%]
472,205 L1-icache-load-misses # 0.05% of all L1-icache hits ( +- 1.19% ) [18.76%]
2,260,639,613 dTLB-loads ( +- 0.01% ) [18.75%]
172 dTLB-load-misses # 0.00% of all dTLB cache hits ( +- 35.14% ) [18.74%]
1,048,732,481 iTLB-loads ( +- 0.07% ) [18.74%]
19 iTLB-load-misses # 0.00% of all iTLB cache hits ( +- 39.75% ) [18.73%]
0 L1-dcache-prefetches [18.73%]
0 L1-dcache-prefetch-misses [18.73%]
5.370546698 seconds time elapsed ( +- 0.05% )
Prefetch:
Performance counter stats for './test.sh' (20 runs):
124,885,469 L1-dcache-load-misses # 4.96% of all L1-dcache hits ( +- 0.09% ) [18.74%]
0 L1-dcache-prefetches [18.75%]
11,434,328,889 cycles # 0.000 GHz ( +- 1.11% ) [18.77%]
4,601,831,553 instructions # 0.40 insns per cycle ( +- 0.01% ) [18.77%]
2,515,483,814 L1-dcache-loads ( +- 0.01% ) [18.77%]
124,928,127 L1-dcache-load-misses # 4.97% of all L1-dcache hits ( +- 0.09% ) [18.76%]
323,355,145 LLC-loads ( +- 0.02% ) [18.76%]
123,008,548 LLC-load-misses # 38.04% of all LL-cache hits ( +- 0.10% ) [18.75%]
1,256,391,060 L1-icache-loads ( +- 0.01% ) [18.75%]
374,691 L1-icache-load-misses # 0.03% of all L1-icache hits ( +- 1.41% ) [18.75%]
2,514,984,046 dTLB-loads ( +- 0.01% ) [18.75%]
67 dTLB-load-misses # 0.00% of all dTLB cache hits ( +- 51.81% ) [18.74%]
1,256,333,548 iTLB-loads ( +- 0.01% ) [18.74%]
19 iTLB-load-misses # 0.00% of all iTLB cache hits ( +- 39.74% ) [18.74%]
0 L1-dcache-prefetches [18.73%]
0 L1-dcache-prefetch-misses [18.73%]
4.496839773 seconds time elapsed ( +- 0.64% )
Parallel ALU:
Performance counter stats for './test.sh' (20 runs):
49,489,518 L1-dcache-load-misses # 2.19% of all L1-dcache hits ( +- 0.09% ) [18.74%]
0 L1-dcache-prefetches [18.76%]
13,777,501,365 cycles # 0.000 GHz ( +- 1.73% ) [18.78%]
4,707,160,703 instructions # 0.34 insns per cycle ( +- 0.01% ) [18.78%]
2,261,693,074 L1-dcache-loads ( +- 0.02% ) [18.78%]
49,468,878 L1-dcache-load-misses # 2.19% of all L1-dcache hits ( +- 0.09% ) [18.77%]
279,524,254 LLC-loads ( +- 0.01% ) [18.76%]
48,491,934 LLC-load-misses # 17.35% of all LL-cache hits ( +- 0.12% ) [18.75%]
1,057,877,680 L1-icache-loads ( +- 0.02% ) [18.74%]
461,784 L1-icache-load-misses # 0.04% of all L1-icache hits ( +- 1.87% ) [18.74%]
2,260,978,836 dTLB-loads ( +- 0.02% ) [18.74%]
27 dTLB-load-misses # 0.00% of all dTLB cache hits ( +- 89.96% ) [18.74%]
1,057,886,632 iTLB-loads ( +- 0.02% ) [18.74%]
4 iTLB-load-misses # 0.00% of all iTLB cache hits ( +-100.00% ) [18.74%]
0 L1-dcache-prefetches [18.73%]
0 L1-dcache-prefetch-misses [18.73%]
5.500417234 seconds time elapsed ( +- 1.60% )
Both:
Performance counter stats for './test.sh' (20 runs):
116,621,570 L1-dcache-load-misses # 4.68% of all L1-dcache hits ( +- 0.04% ) [18.73%]
0 L1-dcache-prefetches [18.75%]
11,597,067,510 cycles # 0.000 GHz ( +- 1.73% ) [18.77%]
4,952,251,361 instructions # 0.43 insns per cycle ( +- 0.01% ) [18.77%]
2,493,003,710 L1-dcache-loads ( +- 0.02% ) [18.77%]
116,640,333 L1-dcache-load-misses # 4.68% of all L1-dcache hits ( +- 0.04% ) [18.77%]
322,246,216 LLC-loads ( +- 0.03% ) [18.76%]
114,528,956 LLC-load-misses # 35.54% of all LL-cache hits ( +- 0.04% ) [18.76%]
999,371,469 L1-icache-loads ( +- 0.02% ) [18.76%]
406,679 L1-icache-load-misses # 0.04% of all L1-icache hits ( +- 1.97% ) [18.75%]
2,492,708,710 dTLB-loads ( +- 0.01% ) [18.75%]
140 dTLB-load-misses # 0.00% of all dTLB cache hits ( +- 38.46% ) [18.74%]
999,320,389 iTLB-loads ( +- 0.01% ) [18.74%]
19 iTLB-load-misses # 0.00% of all iTLB cache hits ( +- 39.90% ) [18.73%]
0 L1-dcache-prefetches [18.73%]
0 L1-dcache-prefetch-misses [18.72%]
4.634419247 seconds time elapsed ( +- 1.60% )
I note a few oddities here:
1) We seem to be getting more counter results than I specified, not sure why
2) The % active column is adding up to way more than 100 (which from my read of
the man page makes sense, given that multiple counters might increment in
response to a single instruction execution
3) The run times are proportionally larger, but still indicate that Parallel ALU
execution is hurting rather than helping, which is counter-intuitive. I'm
looking into it, but thought you might want to see these results in case
something jumped out at you
Regards
Neil
* Neil Horman <[email protected]> wrote:
> 3) The run times are proportionally larger, but still indicate that Parallel ALU
> execution is hurting rather than helping, which is counter-intuitive. I'm
> looking into it, but thought you might want to see these results in case
> something jumped out at you
So here's my theory about all of this.
I think that the original observation some years back was a fluke caused by
either a buggy CPU or a CPU design that is no longer used.
The parallel ALU design of this patch seems OK at first glance, but it means
that two parallel operations are both trying to set/clear both the overflow
and carry flags of the EFLAGS register of the *CPU* (not the ALU). So, either
some CPU in the past had a set of overflow/carry flags per ALU and did some
sort of magic to make sure that the last state of those flags across multiple
ALUs that might have been used in parallelizing work were always in the CPU's
logical EFLAGS register, or the CPU has a buggy microcode that allowed two
ALUs to operate on data at the same time in situations where they would
potentially stomp on the carry/overflow flags of the other ALUs operations.
It's my theory that all modern CPUs have this behavior fixed, probably via a
microcode update, and so trying to do parallel ALU operations like this simply
has no effect because the CPU (rightly so) serializes the operations to keep
them from clobbering the overflow/carry flags of the other ALUs operations.
My additional theory then is that the reason you see a slowdown from this
patch is because the attempt to parallelize the ALU operation has caused
us to write a series of instructions that, once serialized, are non-optimal
and hinder smooth pipelining of the data (aka going 0*8, 2*8, 4*8, 6*8, 1*8,
3*8, 5*8, and 7*8 in terms of memory accesses is worse than doing them in
order, and since we aren't getting the parallel operation we want, this
is the net result of the patch).
It would explain things anyway.
> The parallel ALU design of this patch seems OK at first glance, but it means
> that two parallel operations are both trying to set/clear both the overflow
> and carry flags of the EFLAGS register of the *CPU* (not the ALU). So, either
> some CPU in the past had a set of overflow/carry flags per ALU and did some
> sort of magic to make sure that the last state of those flags across multiple
> ALUs that might have been used in parallelizing work were always in the CPU's
> logical EFLAGS register, or the CPU has a buggy microcode that allowed two
> ALUs to operate on data at the same time in situations where they would
> potentially stomp on the carry/overflow flags of the other ALUs operations.
IIRC x86 cpu treat the (arithmetic) flags register as a single entity.
So an instruction that only changes some of the flags is dependant
on any previous instruction that changes any flags.
OTOH it the instruction writes all of the flags then it doesn't
have to wait for the earlier instruction to complete.
This is problematic for the ADC chain in the IP checksum.
I did once try to use the SSE instructions to sum 16bit
fields into multiple 32bit registers.
David
On Wed, Oct 30, 2013 at 01:25:39AM -0400, Doug Ledford wrote:
> * Neil Horman <[email protected]> wrote:
> > 3) The run times are proportionally larger, but still indicate that Parallel ALU
> > execution is hurting rather than helping, which is counter-intuitive. I'm
> > looking into it, but thought you might want to see these results in case
> > something jumped out at you
>
> So here's my theory about all of this.
>
> I think that the original observation some years back was a fluke caused by
> either a buggy CPU or a CPU design that is no longer used.
>
> The parallel ALU design of this patch seems OK at first glance, but it means
> that two parallel operations are both trying to set/clear both the overflow
> and carry flags of the EFLAGS register of the *CPU* (not the ALU). So, either
> some CPU in the past had a set of overflow/carry flags per ALU and did some
> sort of magic to make sure that the last state of those flags across multiple
> ALUs that might have been used in parallelizing work were always in the CPU's
> logical EFLAGS register, or the CPU has a buggy microcode that allowed two
> ALUs to operate on data at the same time in situations where they would
> potentially stomp on the carry/overflow flags of the other ALUs operations.
>
> It's my theory that all modern CPUs have this behavior fixed, probably via a
> microcode update, and so trying to do parallel ALU operations like this simply
> has no effect because the CPU (rightly so) serializes the operations to keep
> them from clobbering the overflow/carry flags of the other ALUs operations.
>
> My additional theory then is that the reason you see a slowdown from this
> patch is because the attempt to parallelize the ALU operation has caused
> us to write a series of instructions that, once serialized, are non-optimal
> and hinder smooth pipelining of the data (aka going 0*8, 2*8, 4*8, 6*8, 1*8,
> 3*8, 5*8, and 7*8 in terms of memory accesses is worse than doing them in
> order, and since we aren't getting the parallel operation we want, this
> is the net result of the patch).
>
> It would explain things anyway.
>
That does makes sense, but it then begs the question, whats the advantage of
having multiple alu's at all? If they're just going to serialize on the
updating of the condition register, there doesn't seem to be much advantage in
having multiple alu's at all, especially if a common use case (parallelizing an
operation on a large linear dataset) resulted in lower performance.
/me wonders if rearranging the instructions into this order:
adcq 0*8(src), res1
adcq 1*8(src), res2
adcq 2*8(src), res1
would prevent pipeline stalls. That would be interesting data, and (I think)
support your theory, Doug. I'll give that a try
Neil
> /me wonders if rearranging the instructions into this order:
> adcq 0*8(src), res1
> adcq 1*8(src), res2
> adcq 2*8(src), res1
Those have to be sequenced.
Using a 64bit lea to add 32bit quantities should avoid the
dependencies on the flags register.
However you'd need to get 3 of those active to beat a 64bit adc.
David
On 10/30/2013 08:18 AM, David Laight wrote:
>> /me wonders if rearranging the instructions into this order:
>> adcq 0*8(src), res1
>> adcq 1*8(src), res2
>> adcq 2*8(src), res1
>
> Those have to be sequenced.
>
> Using a 64bit lea to add 32bit quantities should avoid the
> dependencies on the flags register.
> However you'd need to get 3 of those active to beat a 64bit adc.
>
> David
>
>
>
Already done (well, something similar to what you mention above anyway),
doesn't help (although doesn't hurt either, even though it doubles the
number of adds needed to complete the same work). This is the code I
tested:
#define ADDL_64 \
asm("xorq %%r8,%%r8\n\t" \
"xorq %%r9,%%r9\n\t" \
"xorq %%r10,%%r10\n\t" \
"xorq %%r11,%%r11\n\t" \
"movl 0*4(%[src]),%%r8d\n\t" \
"movl 1*4(%[src]),%%r9d\n\t" \
"movl 2*4(%[src]),%%r10d\n\t" \
"movl 3*4(%[src]),%%r11d\n\t" \
"addq %%r8,%[res1]\n\t" \
"addq %%r9,%[res2]\n\t" \
"addq %%r10,%[res3]\n\t" \
"addq %%r11,%[res4]\n\t" \
"movl 4*4(%[src]),%%r8d\n\t" \
"movl 5*4(%[src]),%%r9d\n\t" \
"movl 6*4(%[src]),%%r10d\n\t" \
"movl 7*4(%[src]),%%r11d\n\t" \
"addq %%r8,%[res1]\n\t" \
"addq %%r9,%[res2]\n\t" \
"addq %%r10,%[res3]\n\t" \
"addq %%r11,%[res4]\n\t" \
"movl 8*4(%[src]),%%r8d\n\t" \
"movl 9*4(%[src]),%%r9d\n\t" \
"movl 10*4(%[src]),%%r10d\n\t" \
"movl 11*4(%[src]),%%r11d\n\t" \
"addq %%r8,%[res1]\n\t" \
"addq %%r9,%[res2]\n\t" \
"addq %%r10,%[res3]\n\t" \
"addq %%r11,%[res4]\n\t" \
"movl 12*4(%[src]),%%r8d\n\t" \
"movl 13*4(%[src]),%%r9d\n\t" \
"movl 14*4(%[src]),%%r10d\n\t" \
"movl 15*4(%[src]),%%r11d\n\t" \
"addq %%r8,%[res1]\n\t" \
"addq %%r9,%[res2]\n\t" \
"addq %%r10,%[res3]\n\t" \
"addq %%r11,%[res4]" \
: [res1] "=r" (result1), \
[res2] "=r" (result2), \
[res3] "=r" (result3), \
[res4] "=r" (result4) \
: [src] "r" (buff), \
"[res1]" (result1), "[res2]" (result2), \
"[res3]" (result3), "[res4]" (result4) \
: "r8", "r9", "r10", "r11" )
On 10/30/2013 07:02 AM, Neil Horman wrote:
> That does makes sense, but it then begs the question, whats the advantage of
> having multiple alu's at all?
There's lots of ALU operations that don't operate on the flags or other
entities that can be run in parallel.
> If they're just going to serialize on the
> updating of the condition register, there doesn't seem to be much advantage in
> having multiple alu's at all, especially if a common use case (parallelizing an
> operation on a large linear dataset) resulted in lower performance.
>
> /me wonders if rearranging the instructions into this order:
> adcq 0*8(src), res1
> adcq 1*8(src), res2
> adcq 2*8(src), res1
>
> would prevent pipeline stalls. That would be interesting data, and (I think)
> support your theory, Doug. I'll give that a try
Just to avoid spending too much time on various combinations, here are
the methods I've tried:
Original code
2 chains doing interleaved memory accesses
2 chains doing serial memory accesses (as above)
4 chains doing serial memory accesses
4 chains using 32bit values in 64bit registers so you can always use add
instead of adc and never need the carry flag
And I've done all of the above with simple prefetch and smart prefetch.
In all cases, the result is basically that the add method doesn't matter
much in the grand scheme of things, but the prefetch does, and smart
prefetch always beat simple prefetch.
My simple prefetch was to just go into the main while() loop for the
csum operation and always prefetch 5*64 into the future.
My smart prefetch looks like this:
static inline void prefetch_line(unsigned long *cur_line,
unsigned long *end_line,
size_t size)
{
size_t fetched = 0;
while (*cur_line <= *end_line && fetched < size) {
prefetch((void *)*cur_line);
*cur_line += cache_line_size();
fetched += cache_line_size();
}
}
static unsigned do_csum(const unsigned char *buff, unsigned len)
{
...
unsigned long cur_line = (unsigned long)buff &
~(cache_line_size() - 1);
unsigned long end_line = ((unsigned long)buff + len) &
~(cache_line_size() - 1);
...
/* Don't bother to prefetch the first line, we'll end up
stalling on
* it anyway, but go ahead and start the prefetch on the next 3 */
cur_line += cache_line_size();
prefetch_line(&cur_line, &end_line, cache_line_size() * 3);
odd = 1 & (unsigned long) buff;
if (unlikely(odd)) {
result = *buff << 8;
...
count >>= 1; /* nr of 32-bit words.. */
/* prefetch line #4 ahead of main loop */
prefetch_line(&cur_line, &end_line, cache_line_size());
if (count) {
...
while (count64) {
/* we are now prefetching line #5 ahead of
* where we are starting, and will stay 5
* ahead throughout the loop, at least
until
* we get to the end line and then
we'll stop
* prefetching */
prefetch_line(&cur_line, &end_line, 64);
ADDL_64;
buff += 64;
count64--;
}
ADDL_64_FINISH;
I was going to tinker today and tomorrow with this function once I get a
toolchain that will compile it (I reinstalled all my rhel6 hosts as f20
and I'm hoping that does the trick, if not I need to do more work):
#define ADCXQ_64 \
asm("xorq %[res1],%[res1]\n\t" \
"adcxq 0*8(%[src]),%[res1]\n\t" \
"adoxq 1*8(%[src]),%[res2]\n\t" \
"adcxq 2*8(%[src]),%[res1]\n\t" \
"adoxq 3*8(%[src]),%[res2]\n\t" \
"adcxq 4*8(%[src]),%[res1]\n\t" \
"adoxq 5*8(%[src]),%[res2]\n\t" \
"adcxq 6*8(%[src]),%[res1]\n\t" \
"adoxq 7*8(%[src]),%[res2]\n\t" \
"adcxq %[zero],%[res1]\n\t" \
"adoxq %[zero],%[res2]\n\t" \
: [res1] "=r" (result1), \
[res2] "=r" (result2) \
: [src] "r" (buff), [zero] "r" (zero), \
"[res1]" (result1), "[res2]" (result2))
and then I also wanted to try using both xmm and ymm registers and doing
64bit adds with 32bit numbers across multiple xmm/ymm registers as that
should parallel nicely. David, you mentioned you've tried this, how did
your experiment turn out and what was your method? I was planning on
doing regular full size loads into one xmm/ymm register, then using
pshufd/vshufd to move the data into two different registers, then
summing into a fourth register, and possible running two of those
pipelines in parallel.
...
> and then I also wanted to try using both xmm and ymm registers and doing
> 64bit adds with 32bit numbers across multiple xmm/ymm registers as that
> should parallel nicely. David, you mentioned you've tried this, how did
> your experiment turn out and what was your method? I was planning on
> doing regular full size loads into one xmm/ymm register, then using
> pshufd/vshufd to move the data into two different registers, then
> summing into a fourth register, and possible running two of those
> pipelines in parallel.
It was a long time ago, and IIRC the code was just SSE so the
register length just wasn't going to give the required benefit.
I know I wrote the code, but I can't even remember whether I
actually got it working!
With the longer AVX words it might make enough difference.
Of course, this assumes that you have the fpu registers
available. If you have to do a fpu context switch it will
be a lot slower.
About the same time I did manage to an open coded copy
loop to run as fast as 'rep movs' - and without any unrolling
or any prefetch instructions.
Thinking about AVX you should be able to do (without looking up the
actual mnemonics):
load
add 32bit chunks to sum
compare sum with read value (equiv of carry)
add/subtract compare result (0 or ~0) to a carry-sum register
That is 4 instructions for 256 bits, so you can aim for 4 clocks.
You'd need to check the cpu book to see if any of those can
be scheduled at the same time (if not dependant).
(and also whether there is any result delay - don't think so.)
I'd try running two copies of the above - probably skewed so that
the memory accesses are separated, do the memory read for the
next iteration, and use the 3rd instruction unit for loop control.
David
On Wed, Oct 30, 2013 at 09:35:05AM -0400, Doug Ledford wrote:
> On 10/30/2013 07:02 AM, Neil Horman wrote:
>
> >That does makes sense, but it then begs the question, whats the advantage of
> >having multiple alu's at all?
>
> There's lots of ALU operations that don't operate on the flags or
> other entities that can be run in parallel.
>
> >If they're just going to serialize on the
> >updating of the condition register, there doesn't seem to be much advantage in
> >having multiple alu's at all, especially if a common use case (parallelizing an
> >operation on a large linear dataset) resulted in lower performance.
> >
> >/me wonders if rearranging the instructions into this order:
> >adcq 0*8(src), res1
> >adcq 1*8(src), res2
> >adcq 2*8(src), res1
> >
> >would prevent pipeline stalls. That would be interesting data, and (I think)
> >support your theory, Doug. I'll give that a try
>
> Just to avoid spending too much time on various combinations, here
> are the methods I've tried:
>
> Original code
> 2 chains doing interleaved memory accesses
> 2 chains doing serial memory accesses (as above)
> 4 chains doing serial memory accesses
> 4 chains using 32bit values in 64bit registers so you can always use
> add instead of adc and never need the carry flag
>
> And I've done all of the above with simple prefetch and smart prefetch.
>
Yup, I just tried the 2 chains doing interleaved access and came up with the
same results for both prefetch cases.
> In all cases, the result is basically that the add method doesn't
> matter much in the grand scheme of things, but the prefetch does,
> and smart prefetch always beat simple prefetch.
>
> My simple prefetch was to just go into the main while() loop for the
> csum operation and always prefetch 5*64 into the future.
>
> My smart prefetch looks like this:
>
> static inline void prefetch_line(unsigned long *cur_line,
> unsigned long *end_line,
> size_t size)
> {
> size_t fetched = 0;
>
> while (*cur_line <= *end_line && fetched < size) {
> prefetch((void *)*cur_line);
> *cur_line += cache_line_size();
> fetched += cache_line_size();
> }
> }
>
I've done this too, but I've come up with results that are very close to simple
prefetch.
> I was going to tinker today and tomorrow with this function once I
> get a toolchain that will compile it (I reinstalled all my rhel6
> hosts as f20 and I'm hoping that does the trick, if not I need to do
> more work):
>
> #define ADCXQ_64 \
> asm("xorq %[res1],%[res1]\n\t" \
> "adcxq 0*8(%[src]),%[res1]\n\t" \
> "adoxq 1*8(%[src]),%[res2]\n\t" \
> "adcxq 2*8(%[src]),%[res1]\n\t" \
> "adoxq 3*8(%[src]),%[res2]\n\t" \
> "adcxq 4*8(%[src]),%[res1]\n\t" \
> "adoxq 5*8(%[src]),%[res2]\n\t" \
> "adcxq 6*8(%[src]),%[res1]\n\t" \
> "adoxq 7*8(%[src]),%[res2]\n\t" \
> "adcxq %[zero],%[res1]\n\t" \
> "adoxq %[zero],%[res2]\n\t" \
> : [res1] "=r" (result1), \
> [res2] "=r" (result2) \
> : [src] "r" (buff), [zero] "r" (zero), \
> "[res1]" (result1), "[res2]" (result2))
>
I've tried using this method also (HPA suggested it early in the thread, but its
not going to be usefull for awhile. The compiler supports it already, but
theres not hardware available with support for these instructions yet (at least
not that I have available).
Neil
* Neil Horman <[email protected]> wrote:
> > etc. For such short runtimes make sure the last column displays
> > close to 100%, so that the PMU results become trustable.
> >
> > A nehalem+ PMU will allow 2-4 events to be measured in parallel,
> > plus generics like 'cycles', 'instructions' can be added 'for free'
> > because they get counted in a separate (fixed purpose) PMU register.
> >
> > The last colum tells you what percentage of the runtime that
> > particular event was actually active. 100% (or empty last column)
> > means it was active all the time.
> >
> > Thanks,
> >
> > Ingo
> >
>
> Hmm,
>
> I ran this test:
>
> for i in `seq 0 1 3`
> do
> echo $i > /sys/module/csum_test/parameters/module_test_mode
> taskset -c 0 perf stat --repeat 20 -C 0 -e L1-dcache-load-misses -e L1-dcache-prefetches -e cycles -e instructions -ddd ./test.sh
> done
You need to remove '-ddd' which is a shortcut for a ton of useful
events, but here you want to use fewer events, to increase the
precision of the measurement.
Thanks,
Ingo
On Thu, Oct 31, 2013 at 11:22:00AM +0100, Ingo Molnar wrote:
>
> * Neil Horman <[email protected]> wrote:
>
> > > etc. For such short runtimes make sure the last column displays
> > > close to 100%, so that the PMU results become trustable.
> > >
> > > A nehalem+ PMU will allow 2-4 events to be measured in parallel,
> > > plus generics like 'cycles', 'instructions' can be added 'for free'
> > > because they get counted in a separate (fixed purpose) PMU register.
> > >
> > > The last colum tells you what percentage of the runtime that
> > > particular event was actually active. 100% (or empty last column)
> > > means it was active all the time.
> > >
> > > Thanks,
> > >
> > > Ingo
> > >
> >
> > Hmm,
> >
> > I ran this test:
> >
> > for i in `seq 0 1 3`
> > do
> > echo $i > /sys/module/csum_test/parameters/module_test_mode
> > taskset -c 0 perf stat --repeat 20 -C 0 -e L1-dcache-load-misses -e L1-dcache-prefetches -e cycles -e instructions -ddd ./test.sh
> > done
>
> You need to remove '-ddd' which is a shortcut for a ton of useful
> events, but here you want to use fewer events, to increase the
> precision of the measurement.
>
> Thanks,
>
> Ingo
>
Thank you ingo, that fixed it. I'm trying some other variants of the csum
algorithm that Doug and I discussed last night, but FWIW, the relative
performance of the 4 test cases (base/prefetch/parallel/both) remains unchanged.
I'm starting to feel like at this point, theres very little point in doing
parallel alu operations (unless we can find a way to break the dependency on the
carry flag, which is what I'm tinkering with now).
Neil
On Wed, Oct 30, 2013 at 09:35:05AM -0400, Doug Ledford wrote:
> On 10/30/2013 07:02 AM, Neil Horman wrote:
>
> >That does makes sense, but it then begs the question, whats the advantage of
> >having multiple alu's at all?
>
> There's lots of ALU operations that don't operate on the flags or
> other entities that can be run in parallel.
>
> >If they're just going to serialize on the
> >updating of the condition register, there doesn't seem to be much advantage in
> >having multiple alu's at all, especially if a common use case (parallelizing an
> >operation on a large linear dataset) resulted in lower performance.
> >
> >/me wonders if rearranging the instructions into this order:
> >adcq 0*8(src), res1
> >adcq 1*8(src), res2
> >adcq 2*8(src), res1
> >
> >would prevent pipeline stalls. That would be interesting data, and (I think)
> >support your theory, Doug. I'll give that a try
>
> Just to avoid spending too much time on various combinations, here
> are the methods I've tried:
>
> Original code
> 2 chains doing interleaved memory accesses
> 2 chains doing serial memory accesses (as above)
> 4 chains doing serial memory accesses
> 4 chains using 32bit values in 64bit registers so you can always use
> add instead of adc and never need the carry flag
>
> And I've done all of the above with simple prefetch and smart prefetch.
>
>
So, above and beyond this I spent yesterday trying this pattern, something Doug
and I discussed together offline:
asm("prefetch 5*64(%[src])\n\t"
"addq 0*8(%[src]),%[res1]\n\t"
"jo 2f\n\t"
"incq %[cry]\n\t"
"2:addq 1*8(%[src]),%[res2]\n\t"
"jc 3f\n\t"
"incq %[cry]\n\t"
"3:addq 2*8(%[src]),%[res1]\n\t"
...
The hope being that by using the add instead instead of the adc instruction, and
alternatively testing the overflow and carry flags, I could break the
serialization on the flags register between subeuqent adds and start doing
things in parallel (creating a poor mans adcx/adox instruction in effect). It
functions, but unfortunately the performance lost to the completely broken
branch prediction that this inflicts makes it a non starter:
Base Performance:
Performance counter stats for './test.sh' (20 runs):
48,143,372 L1-dcache-load-misses ( +- 0.03% ) [74.99%]
0 L1-dcache-prefetches [75.00%]
13,913,339,911 cycles # 0.000 GHz ( +- 0.06% ) [75.01%]
28,878,999 branch-misses ( +- 0.05% ) [75.00%]
5.367618727 seconds time elapsed ( +- 0.06% )
Prefetch and simluated adcx/adox from above:
Performance counter stats for './test.sh' (20 runs):
35,704,331 L1-dcache-load-misses ( +- 0.07% ) [75.00%]
0 L1-dcache-prefetches [75.00%]
19,751,409,264 cycles # 0.000 GHz ( +- 0.59% ) [75.00%]
34,850,056 branch-misses ( +- 1.29% ) [75.00%]
7.768602160 seconds time elapsed ( +- 1.38% )
With the above instruction changes the prefetching lowers our dcache miss rate
significantly, but greatly raises our branch miss rate, and absolutely kills our
cycle count and run time.
At this point I feel like this is dead in the water. I apologize for wasting
everyones time. The best thing to do here would seem to be:
1) Add in some prefetching (from what I've seen a simple prefetch is as
performant as smart prefetching), so we may as well do it exactly as
csum_copy_from_user does it, and save ourselves the extra while loop.
2) Revisit this when the AVX extensions, or the adcx/adox instructions are
available and we can really preform parallel alu ops here.
Does that sound reasonable?
Neil