2009-11-03 03:45:56

by Alex Shi

[permalink] [raw]
Subject: UDP-U stream performance regression on 32-rc1 kernel

We found the UDP-U 1k/4k stream of netperf benchmark have some
performance regression from 10% to 20% on our Tulsa and some NHM
machines. Bisecting found it is due to the following commitment.

commit 840a0653100dbde599ae8ddf83fa214dfa5fd1aa
Author: Ingo Molnar <[email protected]>
Date: Fri Sep 4 11:32:54 2009 +0200

sched: Turn on SD_BALANCE_NEWIDLE

Start the re-tuning of the balancer by turning on newidle.

It improves hackbench performance and parallelism on a 4x4 box.
The "perf stat --repeat 10" measurements give us:

domain0 domain1
.......................................
-SD_BALANCE_NEWIDLE -SD_BALANCE_NEWIDLE:
2041.273208 task-clock-msecs # 9.354 CPUs ( +- 0.363% )

+SD_BALANCE_NEWIDLE -SD_BALANCE_NEWIDLE:
2086.326925 task-clock-msecs # 11.934 CPUs ( +- 0.301% )

+SD_BALANCE_NEWIDLE +SD_BALANCE_NEWIDLE:
2115.289791 task-clock-msecs # 12.158 CPUs ( +- 0.263% )

BRGs
Alex


2009-11-03 04:32:21

by Yanmin Zhang

[permalink] [raw]
Subject: Re: UDP-U stream performance regression on 32-rc1 kernel

On Tue, 2009-11-03 at 11:47 +0800, Alex Shi wrote:
> We found the UDP-U 1k/4k stream of netperf benchmark have some
> performance regression from 10% to 20% on our Tulsa and some NHM
> machines.

perf events shows function find_busiest_group consumes about 4.5% cpu time
with the patch while it only consumes 0.5% cpu time without the patch.

The communication between netperf client and netserver is very fast.
When netserver receives a message and there is no new message available,
it goes to sleep and scheduler calls idle_balance => load_balance_newidle.
load_balance_newidle spends too much time and a new message arrives quickly
before load_balance_newidle ends.

As the comments in the patch say hackbench benefits from it, I tested hackbench
on Nehalem and core2 machines. hackbench does benefit from it, about 6% on
nehalem machines, but doesn't benefit on core2 machines.

Yanmin

> Bisecting found it is due to the following commitment.
>
> commit 840a0653100dbde599ae8ddf83fa214dfa5fd1aa
> Author: Ingo Molnar <[email protected]>
> Date: Fri Sep 4 11:32:54 2009 +0200

2009-11-03 09:09:27

by Mike Galbraith

[permalink] [raw]
Subject: Re: UDP-U stream performance regression on 32-rc1 kernel

On Tue, 2009-11-03 at 12:33 +0800, Zhang, Yanmin wrote:
> On Tue, 2009-11-03 at 11:47 +0800, Alex Shi wrote:
> > We found the UDP-U 1k/4k stream of netperf benchmark have some
> > performance regression from 10% to 20% on our Tulsa and some NHM
> > machines.
> 
> perf events shows function find_busiest_group consumes about 4.5% cpu time
> with the patch while it only consumes 0.5% cpu time without the patch.
>
> The communication between netperf client and netserver is very fast.
> When netserver receives a message and there is no new message available,
> it goes to sleep and scheduler calls idle_balance => load_balance_newidle.
> load_balance_newidle spends too much time and a new message arrives quickly
> before load_balance_newidle ends.

I have a similar problem wrt ramp-up and affinity, so will certainly be
doing battle with the thing here.

It's harming mysql+oltp and pgsql+oltp ramp up, and modest load in
general by pulling at the first micro-sleep. After twiddling
wake_affine() to spread to a shared cache, newidle comes along and
throws a wrench into my plans an eyeblink later.

> As the comments in the patch say hackbench benefits from it, I tested hackbench
> on Nehalem and core2 machines. hackbench does benefit from it, about 6% on
> nehalem machines, but doesn't benefit on core2 machines.

It depends a lot on the load. I have a testcase which spawns threads at
a ~high rate. There, turning it off costs ~42% on my little Q6600 box.
It's also a modest utilization win for a kbuild.

In any case though, it certainly wants a couple fangs pulled.

-Mike

2009-11-03 17:45:36

by Ingo Molnar

[permalink] [raw]
Subject: Re: UDP-U stream performance regression on 32-rc1 kernel


* Zhang, Yanmin <[email protected]> wrote:

> On Tue, 2009-11-03 at 11:47 +0800, Alex Shi wrote:
> > We found the UDP-U 1k/4k stream of netperf benchmark have some
> > performance regression from 10% to 20% on our Tulsa and some NHM
> > machines.
>  perf events shows function find_busiest_group consumes about 4.5% cpu
> time with the patch while it only consumes 0.5% cpu time without the
> patch.
>
> The communication between netperf client and netserver is very fast.
> When netserver receives a message and there is no new message
> available, it goes to sleep and scheduler calls idle_balance =>
> load_balance_newidle. load_balance_newidle spends too much time and a
> new message arrives quickly before load_balance_newidle ends.
>
> As the comments in the patch say hackbench benefits from it, I tested
> hackbench on Nehalem and core2 machines. hackbench does benefit from
> it, about 6% on nehalem machines, but doesn't benefit on core2
> machines.

Can you confirm that -tip:

http://people.redhat.com/mingo/tip.git/README

has it fixed (or at least improved)?

Ingo

2009-11-04 01:54:56

by Yanmin Zhang

[permalink] [raw]
Subject: Re: UDP-U stream performance regression on 32-rc1 kernel

On Tue, 2009-11-03 at 18:45 +0100, Ingo Molnar wrote:
> * Zhang, Yanmin <[email protected]> wrote:
>
> > On Tue, 2009-11-03 at 11:47 +0800, Alex Shi wrote:
> > > We found the UDP-U 1k/4k stream of netperf benchmark have some
> > > performance regression from 10% to 20% on our Tulsa and some NHM
> > > machines.
> >  perf events shows function find_busiest_group consumes about 4.5% cpu
> > time with the patch while it only consumes 0.5% cpu time without the
> > patch.
> >
> > The communication between netperf client and netserver is very fast.
> > When netserver receives a message and there is no new message
> > available, it goes to sleep and scheduler calls idle_balance =>
> > load_balance_newidle. load_balance_newidle spends too much time and a
> > new message arrives quickly before load_balance_newidle ends.
> >
> > As the comments in the patch say hackbench benefits from it, I tested
> > hackbench on Nehalem and core2 machines. hackbench does benefit from
> > it, about 6% on nehalem machines, but doesn't benefit on core2
> > machines.
>
> Can you confirm that -tip:
>
> http://people.redhat.com/mingo/tip.git/README
>
> has it fixed (or at least improved)?
The latest tips improves netperf loopback result, but doesn't fix it
thoroughly. For example, on a Nehalem machine, netperf UDP-U-1k has
about 25% regression, but with the tips kernel, the regression becomes
less than 10%.

yanmin

2009-11-04 12:07:41

by Mike Galbraith

[permalink] [raw]
Subject: Re: UDP-U stream performance regression on 32-rc1 kernel

On Wed, 2009-11-04 at 09:55 +0800, Zhang, Yanmin wrote:
> On Tue, 2009-11-03 at 18:45 +0100, Ingo Molnar wrote:
> > * Zhang, Yanmin <[email protected]> wrote:
> >
> > > On Tue, 2009-11-03 at 11:47 +0800, Alex Shi wrote:
> > > > We found the UDP-U 1k/4k stream of netperf benchmark have some
> > > > performance regression from 10% to 20% on our Tulsa and some NHM
> > > > machines.
> > >  perf events shows function find_busiest_group consumes about 4.5% cpu
> > > time with the patch while it only consumes 0.5% cpu time without the
> > > patch.
> > >
> > > The communication between netperf client and netserver is very fast.
> > > When netserver receives a message and there is no new message
> > > available, it goes to sleep and scheduler calls idle_balance =>
> > > load_balance_newidle. load_balance_newidle spends too much time and a
> > > new message arrives quickly before load_balance_newidle ends.
> > >
> > > As the comments in the patch say hackbench benefits from it, I tested
> > > hackbench on Nehalem and core2 machines. hackbench does benefit from
> > > it, about 6% on nehalem machines, but doesn't benefit on core2
> > > machines.
> >
> > Can you confirm that -tip:
> >
> > http://people.redhat.com/mingo/tip.git/README
> >
> > has it fixed (or at least improved)?
> The latest tips improves netperf loopback result, but doesn't fix it
> thoroughly. For example, on a Nehalem machine, netperf UDP-U-1k has
> about 25% regression, but with the tips kernel, the regression becomes
> less than 10%.

Can you try the below, and send me your UDP-U-1k args so I can try it?

The below shows promise for stopping newidle from harming cache, though
it needs to be more clever than a holdoff. The fact that it only harms
the _very_ sensitive to idle time x264 testcase by 5% shows some
promise.

tip v2.6.32-rc6-1731-gc5bb4b1
tbench 8 1044.66 MB/sec 8 procs
x264 8 366.58 frames/sec -start_debit 392.24 fps -newidle 215.34 fps

tip+ v2.6.32-rc6-1731-gc5bb4b1
tbench 8 1040.08 MB/sec 8 procs .995
x264 8 350.23 frames/sec -start_debit 371.76
.955 .947

mysql+oltp
clients 1 2 4 8 16 32 64 128 256
tip 10447.14 19734.58 36038.18 35776.85 34662.76 33682.30 32256.22 28770.99 25323.23
10462.61 19580.14 36050.48 35942.63 35054.84 33988.40 32423.89 29259.65 25892.24
10501.02 19231.27 36007.03 35985.32 35060.79 33945.47 32400.42 29140.84 25716.16
tip avg 10470.25 19515.33 36031.89 35901.60 34926.13 33872.05 32360.17 29057.16 25643.87

tip+ 10594.32 19912.01 36320.45 35904.71 35100.37 34003.38 32453.04 28413.57 23871.22
10667.96 20000.17 36533.72 36472.19 35371.35 34208.85 32617.80 28893.55 24499.34
10463.25 19915.69 36657.20 36419.08 35403.15 34041.80 32612.94 28835.82 24323.52
tip+ avg 10575.17 19942.62 36503.79 36265.32 35291.62 34084.67 32561.26 28714.31 24231.36
1.010 1.021 1.013 1.010 1.010 1.006 1.006 .988 .944


---
kernel/sched.c | 9 +++++++++
1 file changed, 9 insertions(+)

Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -590,6 +590,7 @@ struct rq {

u64 rt_avg;
u64 age_stamp;
+ u64 newidle_ratelimit;
#endif

/* calc_load related fields */
@@ -2383,6 +2384,8 @@ static int try_to_wake_up(struct task_st
if (rq != orig_rq)
update_rq_clock(rq);

+ rq->newidle_ratelimit = rq->clock;
+
WARN_ON(p->state != TASK_WAKING);
cpu = task_cpu(p);

@@ -4427,6 +4430,12 @@ static void idle_balance(int this_cpu, s
struct sched_domain *sd;
int pulled_task = 0;
unsigned long next_balance = jiffies + HZ;
+ u64 delta = this_rq->clock - this_rq->newidle_ratelimit;
+
+ if (delta < sysctl_sched_migration_cost)
+ return;
+
+ this_rq->newidle_ratelimit = this_rq->clock;

for_each_domain(this_cpu, sd) {
unsigned long interval;



2009-11-05 02:19:56

by Yanmin Zhang

[permalink] [raw]
Subject: Re: UDP-U stream performance regression on 32-rc1 kernel

On Wed, 2009-11-04 at 13:07 +0100, Mike Galbraith wrote:
> On Wed, 2009-11-04 at 09:55 +0800, Zhang, Yanmin wrote:
> > On Tue, 2009-11-03 at 18:45 +0100, Ingo Molnar wrote:
> > > * Zhang, Yanmin <[email protected]> wrote:
> > >
> > > > On Tue, 2009-11-03 at 11:47 +0800, Alex Shi wrote:
> > > > > We found the UDP-U 1k/4k stream of netperf benchmark have some
> > > > > performance regression from 10% to 20% on our Tulsa and some NHM
> > > > > machines.
> > > >  perf events shows function find_busiest_group consumes about 4.5% cpu
> > > > time with the patch while it only consumes 0.5% cpu time without the
> > > > patch.
> > > >
> > > > The communication between netperf client and netserver is very fast.
> > > > When netserver receives a message and there is no new message
> > > > available, it goes to sleep and scheduler calls idle_balance =>
> > > > load_balance_newidle. load_balance_newidle spends too much time and a
> > > > new message arrives quickly before load_balance_newidle ends.
> > > >
> > > > As the comments in the patch say hackbench benefits from it, I tested
> > > > hackbench on Nehalem and core2 machines. hackbench does benefit from
> > > > it, about 6% on nehalem machines, but doesn't benefit on core2
> > > > machines.
> > >
> > > Can you confirm that -tip:
> > >
> > > http://people.redhat.com/mingo/tip.git/README
> > >
> > > has it fixed (or at least improved)?
> > The latest tips improves netperf loopback result, but doesn't fix it
> > thoroughly. For example, on a Nehalem machine, netperf UDP-U-1k has
> > about 25% regression, but with the tips kernel, the regression becomes
> > less than 10%.
>

> Can you try the below, and send me
I tested it on Nehalem machine against the latest tips kernel. netperf loopback
result is good and regression disappears.

tbench result has no improvement.

> your UDP-U-1k args so I can try it?
#taskset -c 0 ./netserver
#taskset -c 15 ./netperf -t UDP_STREAM -l 60 -H 127.0.0.1 -i 50 3 -I 99 5 -- -P 12384,12888 -s 32768 -S 32768 -m 4096

Pls. check /proc/cpuinfo to make sure cpu 0 and cpu 15 are not in the
same physical cpu.

I also run sysbench(oltp)+mysql testing with thread number 14,16,18,20,32,64,128. The average
number is good. If I compare every single result against 2.6.32-rc5's, I find thread number
14,16,18,20,32's result are better than 2.6.32-rc5's, but 64,128's result are worse. 128's is
the worst.

>
>
> The below shows promise for stopping newidle from harming cache, though
> it needs to be more clever than a holdoff. The fact that it only harms
> the _very_ sensitive to idle time x264 testcase by 5% shows some
> promise.
>
> tip v2.6.32-rc6-1731-gc5bb4b1
> tbench 8 1044.66 MB/sec 8 procs
> x264 8 366.58 frames/sec -start_debit 392.24 fps -newidle 215.34 fps
>
> tip+ v2.6.32-rc6-1731-gc5bb4b1
> tbench 8 1040.08 MB/sec 8 procs .995
> x264 8 350.23 frames/sec -start_debit 371.76
> .955 .947
>
> mysql+oltp
> clients 1 2 4 8 16 32 64 128 256
> tip 10447.14 19734.58 36038.18 35776.85 34662.76 33682.30 32256.22 28770.99 25323.23
> 10462.61 19580.14 36050.48 35942.63 35054.84 33988.40 32423.89 29259.65 25892.24
> 10501.02 19231.27 36007.03 35985.32 35060.79 33945.47 32400.42 29140.84 25716.16
> tip avg 10470.25 19515.33 36031.89 35901.60 34926.13 33872.05 32360.17 29057.16 25643.87
>
> tip+ 10594.32 19912.01 36320.45 35904.71 35100.37 34003.38 32453.04 28413.57 23871.22
> 10667.96 20000.17 36533.72 36472.19 35371.35 34208.85 32617.80 28893.55 24499.34
> 10463.25 19915.69 36657.20 36419.08 35403.15 34041.80 32612.94 28835.82 24323.52
> tip+ avg 10575.17 19942.62 36503.79 36265.32 35291.62 34084.67 32561.26 28714.31 24231.36
> 1.010 1.021 1.013 1.010 1.010 1.006 1.006 .988 .944
>
>
> ---
> kernel/sched.c | 9 +++++++++
> 1 file changed, 9 insertions(+)
>
> Index: linux-2.6/kernel/sched.c
> ===================================================================
> --- linux-2.6.orig/kernel/sched.c
> +++ linux-2.6/kernel/sched.c
> @@ -590,6 +590,7 @@ struct rq {
>
> u64 rt_avg;
> u64 age_stamp;
> + u64 newidle_ratelimit;
> #endif
>
> /* calc_load related fields */
> @@ -2383,6 +2384,8 @@ static int try_to_wake_up(struct task_st
> if (rq != orig_rq)
> update_rq_clock(rq);
>
> + rq->newidle_ratelimit = rq->clock;
> +
> WARN_ON(p->state != TASK_WAKING);
> cpu = task_cpu(p);
>
> @@ -4427,6 +4430,12 @@ static void idle_balance(int this_cpu, s
> struct sched_domain *sd;
> int pulled_task = 0;
> unsigned long next_balance = jiffies + HZ;
> + u64 delta = this_rq->clock - this_rq->newidle_ratelimit;
> +
> + if (delta < sysctl_sched_migration_cost)
> + return;
> +
> + this_rq->newidle_ratelimit = this_rq->clock;
>
> for_each_domain(this_cpu, sd) {
> unsigned long interval;
>
>
>
>

2009-11-05 05:20:18

by Mike Galbraith

[permalink] [raw]
Subject: Re: UDP-U stream performance regression on 32-rc1 kernel

On Thu, 2009-11-05 at 10:20 +0800, Zhang, Yanmin wrote:
> On Wed, 2009-11-04 at 13:07 +0100, Mike Galbraith wrote:

> > Can you try the below, and send me
> I tested it on Nehalem machine against the latest tips kernel. netperf loopback
> result is good and regression disappears.

Excellent. Ingo has picked up a version in tip (1b9508f) which has zero
negative effect on my x264 testcase, and is a win for mysql+oltp through
the whole test spectrum. As that may (dunno, Ingo?) now be considered a
regression fix, ie candidate for 32.final, testing that it does no harm
to your big machines would be a good thing. (pretty please?:)

> tbench result has no improvement.

Can you remind me where we stand on tbench?

> > your UDP-U-1k args so I can try it?
> #taskset -c 0 ./netserver
> #taskset -c 15 ./netperf -t UDP_STREAM -l 60 -H 127.0.0.1 -i 50 3 -I 99 5 -- -P 12384,12888 -s 32768 -S 32768 -m 4096
>
> Pls. check /proc/cpuinfo to make sure cpu 0 and cpu 15 are not in the
> same physical cpu.

Thanks. My little box doesn't have a 15 (darn) so 0,3 will have to do.

> I also run sysbench(oltp)+mysql testing with thread number 14,16,18,20,32,64,128. The average
> number is good. If I compare every single result against 2.6.32-rc5's, I find thread number
> 14,16,18,20,32's result are better than 2.6.32-rc5's, but 64,128's result are worse. 128's is
> the worst.

Hm. That's disconcerting. However, that patch isn't going anywhere but
to the bitwolf anyway (diagnostic). If 1b9508f regresses, that will be
a problem. With diag, my box also regressed at the tail. Balancing a
bit seems to help mysql once it starts tripping all over itself, it
improves the decay curve markedly. 1b9508f does brief bursts of newidle
balancing when idle time climbs, which translated to a ~6% improvement
at 256 clients on my little quad.

-Mike

2009-11-05 07:03:23

by Mike Galbraith

[permalink] [raw]
Subject: Re: UDP-U stream performance regression on 32-rc1 kernel

On Thu, 2009-11-05 at 06:20 +0100, Mike Galbraith wrote:
> On Thu, 2009-11-05 at 10:20 +0800, Zhang, Yanmin wrote:
> > On Wed, 2009-11-04 at 13:07 +0100, Mike Galbraith wrote:
>
> > > Can you try the below, and send me
> > I tested it on Nehalem machine against the latest tips kernel. netperf loopback
> > result is good and regression disappears.
>
> Excellent. Ingo has picked up a version in tip (1b9508f) which has zero
> negative effect on my x264 testcase, and is a win for mysql+oltp through
> the whole test spectrum. As that may (dunno, Ingo?) now be considered a
> regression fix, ie candidate for 32.final, testing that it does no harm
> to your big machines would be a good thing. (pretty please?:)

Egad. Size XXL difference on my cheap Q6600 box

git v2.6.32-rc6-26-g91d3f9b
Socket Message Elapsed Messages
Size Size Time Okay Errors Throughput
bytes bytes secs # # 10^6bits/sec

65536 4096 60.00 7793073 0 4256.06
65536 60.00 7780487 4249.18

git v2.6.32-rc6-26-g91d3f9b + 1b9508f
Socket Message Elapsed Messages
Size Size Time Okay Errors Throughput
bytes bytes secs # # 10^6bits/sec

65536 4096 60.00 15133547 0 8264.93
65536 60.00 15131466 8263.80

tip v2.6.32-rc6-1796-gd995f1d
Socket Message Elapsed Messages
Size Size Time Okay Errors Throughput
bytes bytes secs # # 10^6bits/sec

65536 4096 60.00 13998562 0 7645.08
65536 60.00 13986112 7638.28 (uhoh, tinker time.)

-Mike

2009-11-05 07:43:34

by Yanmin Zhang

[permalink] [raw]
Subject: Re: UDP-U stream performance regression on 32-rc1 kernel

On Thu, 2009-11-05 at 06:20 +0100, Mike Galbraith wrote:
> On Thu, 2009-11-05 at 10:20 +0800, Zhang, Yanmin wrote:
> > On Wed, 2009-11-04 at 13:07 +0100, Mike Galbraith wrote:
>
> > > Can you try the below, and send me
> > I tested it on Nehalem machine against the latest tips kernel. netperf loopback
> > result is good and regression disappears.
>
> Excellent. Ingo has picked up a version in tip (1b9508f) which has zero
> negative effect on my x264 testcase, and is a win for mysql+oltp through
> the whole test spectrum. As that may (dunno, Ingo?) now be considered a
> regression fix, ie candidate for 32.final, testing that it does no harm
> to your big machines would be a good thing. (pretty please?:)
I tested the latest tips kernel which includes commit 1b9508f.
Comparing with 2.6.31, netperf loopback UDP-U-4k has about 2% regression.

sysbench(oltp)+mysql result is pretty good, about 2% improvement than
2.6.31's.



>
> > tbench result has no improvement.
>
> Can you remind me where we stand on tbench?
I run tbench by starting CPU_NUM*2 tbench clients without cpu binding.
Comparing with 2.6.31, tbench has about 6% regression with 2.6.31-rc1 on Nehalem.
Mostly, it's caused by SD_PREFER_LOCAL and Peter already disables the flag for
MC and cpu domains. Your patch disables it for node domain.
With the current tips kernel, tbench has about 3% regression on 1 nahalem, and
less than 1% on another Nehalem.

With pure 2.6.32-rc6 kernel, tbench result has about 3~6% regression on Nehalem
, comparing with 2.6.32-rc5's. So some patches in tips haven't been merged into
upstream.

>
> > > your UDP-U-1k args so I can try it?
> > #taskset -c 0 ./netserver
> > #taskset -c 15 ./netperf -t UDP_STREAM -l 60 -H 127.0.0.1 -i 50 3 -I 99 5 -- -P 12384,12888 -s 32768 -S 32768 -m 4096
> >
> > Pls. check /proc/cpuinfo to make sure cpu 0 and cpu 15 are not in the
> > same physical cpu.
>
> Thanks. My little box doesn't have a 15 (darn) so 0,3 will have to do.
Sorry. I copy it from the output of "ps -ef", so a couple of ',' are lost. The right netperf command
line is:
#taskset -c 15 ./netperf -t UDP_STREAM -l 60 -H 127.0.0.1 -i 50,3 -I 99,5 -- -P 12384,12888 -s 32768 -S 32768 -m 4096


>
> > I also run sysbench(oltp)+mysql testing with thread number 14,16,18,20,32,64,128. The average
> > number is good. If I compare every single result against 2.6.32-rc5's, I find thread number
> > 14,16,18,20,32's result are better than 2.6.32-rc5's, but 64,128's result are worse. 128's is
> > the worst.
>
> Hm. That's disconcerting. However, that patch isn't going anywhere but
> to the bitwolf anyway (diagnostic). If 1b9508f regresses, that will be
> a problem. With diag, my box also regressed at the tail. Balancing a
> bit seems to help mysql once it starts tripping all over itself, it
> improves the decay curve markedly. 1b9508f does brief bursts of newidle
> balancing when idle time climbs, which translated to a ~6% improvement
> at 256 clients on my little quad.
>
> -Mike
>

2009-11-05 08:10:57

by Mike Galbraith

[permalink] [raw]
Subject: Re: UDP-U stream performance regression on 32-rc1 kernel

On Thu, 2009-11-05 at 15:44 +0800, Zhang, Yanmin wrote:
> On Thu, 2009-11-05 at 06:20 +0100, Mike Galbraith wrote:
> > On Thu, 2009-11-05 at 10:20 +0800, Zhang, Yanmin wrote:
> > > On Wed, 2009-11-04 at 13:07 +0100, Mike Galbraith wrote:
> >
> > > > Can you try the below, and send me
> > > I tested it on Nehalem machine against the latest tips kernel. netperf loopback
> > > result is good and regression disappears.
> >
> > Excellent. Ingo has picked up a version in tip (1b9508f) which has zero
> > negative effect on my x264 testcase, and is a win for mysql+oltp through
> > the whole test spectrum. As that may (dunno, Ingo?) now be considered a
> > regression fix, ie candidate for 32.final, testing that it does no harm
> > to your big machines would be a good thing. (pretty please?:)
> I tested the latest tips kernel which includes commit 1b9508f.
> Comparing with 2.6.31, netperf loopback UDP-U-4k has about 2% regression.

Ok, thanks for testing. That could well be a1f84a3, that needs a bit of
fiddling.

> sysbench(oltp)+mysql result is pretty good, about 2% improvement than
> 2.6.31's.

Cool, a progression for a change :)

> > > tbench result has no improvement.
> >
> > Can you remind me where we stand on tbench?
> I run tbench by starting CPU_NUM*2 tbench clients without cpu binding.
> Comparing with 2.6.31, tbench has about 6% regression with 2.6.31-rc1 on Nehalem.
> Mostly, it's caused by SD_PREFER_LOCAL and Peter already disables the flag for
> MC and cpu domains. Your patch disables it for node domain.
> With the current tips kernel, tbench has about 3% regression on 1 nahalem, and
> less than 1% on another Nehalem.

Ok, we're not looking too bad, but still something there to go after.

> With pure 2.6.32-rc6 kernel, tbench result has about 3~6% regression on Nehalem
> , comparing with 2.6.32-rc5's. So some patches in tips haven't been merged into
> upstream.

> > > > your UDP-U-1k args so I can try it?
> > > #taskset -c 0 ./netserver
> > > #taskset -c 15 ./netperf -t UDP_STREAM -l 60 -H 127.0.0.1 -i 50 3 -I 99 5 -- -P 12384,12888 -s 32768 -S 32768 -m 4096
> > >
> > > Pls. check /proc/cpuinfo to make sure cpu 0 and cpu 15 are not in the
> > > same physical cpu.
> >
> > Thanks. My little box doesn't have a 15 (darn) so 0,3 will have to do.
> Sorry. I copy it from the output of "ps -ef", so a couple of ',' are lost. The right netperf command
> line is:
> #taskset -c 15 ./netperf -t UDP_STREAM -l 60 -H 127.0.0.1 -i 50,3 -I 99,5 -- -P 12384,12888 -s 32768 -S 32768 -m 4096

Thanks. (-i and -I have always given me trouble on my little boxen. I
usually just let it do it's thing without them, and repeat a lot;)

-Mike

2009-11-05 08:57:23

by Mike Galbraith

[permalink] [raw]
Subject: Re: UDP-U stream performance regression on 32-rc1 kernel

On Thu, 2009-11-05 at 08:03 +0100, Mike Galbraith wrote:
> On Thu, 2009-11-05 at 06:20 +0100, Mike Galbraith wrote:
> > On Thu, 2009-11-05 at 10:20 +0800, Zhang, Yanmin wrote:
> > > On Wed, 2009-11-04 at 13:07 +0100, Mike Galbraith wrote:
> >
> > > > Can you try the below, and send me
> > > I tested it on Nehalem machine against the latest tips kernel. netperf loopback
> > > result is good and regression disappears.
> >
> > Excellent. Ingo has picked up a version in tip (1b9508f) which has zero
> > negative effect on my x264 testcase, and is a win for mysql+oltp through
> > the whole test spectrum. As that may (dunno, Ingo?) now be considered a
> > regression fix, ie candidate for 32.final, testing that it does no harm
> > to your big machines would be a good thing. (pretty please?:)
>
> Egad. Size XXL difference on my cheap Q6600 box

Ingo, ignore my "eek!" reaction (for now).

I'm trying to test the pull lineup with as many benchmarks as I can fit
in, but methinks I'm screwing this one (unfamiliar) up :-/

> git v2.6.32-rc6-26-g91d3f9b
> Socket Message Elapsed Messages
> Size Size Time Okay Errors Throughput
> bytes bytes secs # # 10^6bits/sec
>
> 65536 4096 60.00 7793073 0 4256.06
> 65536 60.00 7780487 4249.18
>
> git v2.6.32-rc6-26-g91d3f9b + 1b9508f
> Socket Message Elapsed Messages
> Size Size Time Okay Errors Throughput
> bytes bytes secs # # 10^6bits/sec
>
> 65536 4096 60.00 15133547 0 8264.93
> 65536 60.00 15131466 8263.80
>
> tip v2.6.32-rc6-1796-gd995f1d
> Socket Message Elapsed Messages
> Size Size Time Okay Errors Throughput
> bytes bytes secs # # 10^6bits/sec
>
> 65536 4096 60.00 13998562 0 7645.08
> 65536 60.00 13986112 7638.28 (uhoh, tinker time.)
>
> -Mike