LinuxLists.cc - Re: [Bug #10638] sysbench+mysql(oltp, readonly) 30% regression with 2.6.26-rc1

2008-06-02 05:01:59

Subject: Re: [Bug #10638] sysbench+mysql(oltp, readonly) 30% regression with 2.6.26-rc1

On Fri, 2008-05-30 at 11:45 +0200, Ingo Molnar wrote:
> Yanmin,
>
> could you please check whether the performance regressions you noticed
> are now fixed in upstream -git? [make sure merge a7f75d3bed28 is
> included]
>
> i believe most of the regressions to 2.6.25 you found should be
> addressed - if not, please let me know which one is still hurting.
Most regressions are fixed.

I tested the latest git tree on a couple of machines. Below results compare with
2.6.25 result except special comments.

1) sysbench+mysql(oltp, readonly) 30% regression with 2.6.26-rc1:
http://bugzilla.kernel.org/show_bug.cgi?id=10638
It's fixed completely.

2) volanoMark regression with kernel 2.6.26-rc1:
http://bugzilla.kernel.org/show_bug.cgi?id=10634
It's fixed completely.

3) hackbench regression with 2.6.26-rc2 on tulsa machine:
http://bugzilla.kernel.org/show_bug.cgi?id=10761
On 16-thread tulsa machine, hackbench result becomes 34 seconds. 2.6.26-rc2's
result is 40 seconds and 2.6.26-rc1's is 30 seconds. So there is much improvement.
On another Montvale machine(supporting multi-threading,
but I don't turn on it in BIOS), hackbench has the similiar behavior.

4) aim7 regression with 2.6.26-rc1:
With Linus's patch which was accepted into 2.6.26-rc2, most aim7 regression disappeared,
but about 6% regression on 16-core tigerton still existed. If just applying Linus' patch
against 2.6.26-rc1, all regression of aim7 disappeared. So there is something else changed
in 2.6.26-rc2.
I retested aim7 against the latest git tree and all aim7 regression disappeared.

5) Kbuild regression 3%~6% with 2.6.26-rc1:
I run kbuild in a loop of 25 or more. On some machines, the testing script drops page cache
at the begining of every loop, and doesn't drop caches on other machines. The second testing
method result is stable, but the first one's result isn't stable. The regression is about the second
method.
I didn't report it because bisect located 2 groups of patches.
With the latest git, I retested kbuild and all regression with the second method disappears.

2008-06-04 11:20:27

by Ingo Molnar

[permalink] [raw]

Subject: Re: [Bug #10638] sysbench+mysql(oltp, readonly) 30% regression with 2.6.26-rc1

* Zhang, Yanmin <[email protected]> wrote:

>
> On Fri, 2008-05-30 at 11:45 +0200, Ingo Molnar wrote:
> > Yanmin,
> >
> > could you please check whether the performance regressions you
> > noticed are now fixed in upstream -git? [make sure merge
> > a7f75d3bed28 is included]
> >
> > i believe most of the regressions to 2.6.25 you found should be
> > addressed - if not, please let me know which one is still hurting.
>
> Most regressions are fixed.

great - thanks for the exhaustive testing! In fact there should be nice
speedups in most of the categories as well ;-)

out of the 5 issues, only one is inconclusive:

> On 16-thread tulsa machine, hackbench result becomes 34 seconds.
> 2.6.26-rc2's result is 40 seconds and 2.6.26-rc1's is 30 seconds. So
> there is much improvement. On another Montvale machine(supporting
> multi-threading, but I don't turn on it in BIOS), hackbench has the
> similiar behavior.

okay, that's "hackbench 100", which creates a swarm of 2000 runnable
tasks and which is extremely sensitive to wakeup preemption details. It
is a volanomark work-alike, so if volanomark itself works fine (which it
does appear, from your other numbers) and this one regresses a bit, i'm
not sure there's anything fundamental to be worried about.

Quite likely you'll get more stable results if you run it all batched
(which such workload really should):

schedtool -B -e hackbench 100

right?

the 16-thread tulsa machine, how is it laid out physically: 2 sockets, 4
cores per socket, 2 threads per core?

Ingo

2008-06-05 02:39:11

by Yanmin Zhang

[permalink] [raw]

Subject: Re: [Bug #10638] sysbench+mysql(oltp, readonly) 30% regression with 2.6.26-rc1

On Wed, 2008-06-04 at 13:19 +0200, Ingo Molnar wrote:
> * Zhang, Yanmin <[email protected]> wrote:
>
> >
> > On Fri, 2008-05-30 at 11:45 +0200, Ingo Molnar wrote:
> > > Yanmin,
> > >
> > > could you please check whether the performance regressions you
> > > noticed are now fixed in upstream -git? [make sure merge
> > > a7f75d3bed28 is included]
> > >
> > > i believe most of the regressions to 2.6.25 you found should be
> > > addressed - if not, please let me know which one is still hurting.
> >
> > Most regressions are fixed.
>
> great - thanks for the exhaustive testing! In fact there should be nice
> speedups in most of the categories as well ;-)
>
> out of the 5 issues, only one is inconclusive:
>
> > On 16-thread tulsa machine, hackbench result becomes 34 seconds.
> > 2.6.26-rc2's result is 40 seconds and 2.6.26-rc1's is 30 seconds. So
> > there is much improvement. On another Montvale machine(supporting
> > multi-threading, but I don't turn on it in BIOS), hackbench has the
> > similiar behavior.
>
> okay, that's "hackbench 100", which creates a swarm of 2000 runnable
> tasks and which is extremely sensitive to wakeup preemption details. It
> is a volanomark work-alike, so if volanomark itself works fine (which it
> does appear, from your other numbers) and this one regresses a bit, i'm
> not sure there's anything fundamental to be worried about.
One difference between volanoMark and hackbench is cpu context switch.
cpu context switch looks stable when I run volanoMark, but dones't look
stable with hackbench.

running queue is another difference. With volanoMark, running queue is quite stable.
With hackbench, running queue keeps decreasing, sometimes quickly, sometimes slowly.

>
> Quite likely you'll get more stable results if you run it all batched
> (which such workload really should):
>
> schedtool -B -e hackbench 100
I tested it by #hackbench process 2000 with/without schedtool.

If I don't kill most background processes (services), the result is still not stable.
If I kill background processes, the fluctuation is within 0.5 seconds
with or without schedtool. It looks like -B makes the result a little better, but
very little about 1 second.

>
> right?
>
> the 16-thread tulsa machine, how is it laid out physically: 2 sockets, 4
> cores per socket, 2 threads per core?
4 sockets, 2 cores per socket, 2 threads per core.