2004-10-06 00:43:23

by Chen, Kenneth W

[permalink] [raw]
Subject: Default cache_hot_time value back to 10ms

Chen, Kenneth W wrote on Tuesday, October 05, 2004 10:31 AM
> We have experimented with similar thing, via bumping up sd->cache_hot_time to
> a very large number, like 1 sec. What we measured was a equally low throughput.
> But that was because of not enough load balancing.

Since we are talking about load balancing, we decided to measure various
value for cache_hot_time variable to see how it affects app performance. We
first establish baseline number with vanilla base kernel (default at 2.5ms),
then sweep that variable up to 1000ms. All of the experiments are done with
Ingo's patch posted earlier. Here are the result (test environment is 4-way
SMP machine, 32 GB memory, 500 disks running industry standard db transaction
processing workload):

cache_hot_time | workload throughput
--------------------------------------
2.5ms - 100.0 (0% idle)
5ms - 106.0 (0% idle)
10ms - 112.5 (1% idle)
15ms - 111.6 (3% idle)
25ms - 111.1 (5% idle)
250ms - 105.6 (7% idle)
1000ms - 105.4 (7% idle)

Clearly the default value for SMP has the worst application throughput (12%
below peak performance). When set too low, kernel is too aggressive on load
balancing and we are still seeing cache thrashing despite the perf fix.
However, If set too high, kernel gets too conservative and not doing enough
load balance.

This value was default to 10ms before domain scheduler, why does domain
scheduler need to change it to 2.5ms? And on what bases does that decision
take place? We are proposing change that number back to 10ms.

Signed-off-by: Ken Chen <[email protected]>

--- linux-2.6.9-rc3/kernel/sched.c.orig 2004-10-05 17:37:21.000000000 -0700
+++ linux-2.6.9-rc3/kernel/sched.c 2004-10-05 17:38:02.000000000 -0700
@@ -387,7 +387,7 @@ struct sched_domain {
.max_interval = 4, \
.busy_factor = 64, \
.imbalance_pct = 125, \
- .cache_hot_time = (5*1000000/2), \
+ .cache_hot_time = (10*1000000), \
.cache_nice_tries = 1, \
.per_cpu_gain = 100, \
.flags = SD_BALANCE_NEWIDLE \



2004-10-06 00:48:14

by Con Kolivas

[permalink] [raw]
Subject: Re: Default cache_hot_time value back to 10ms

Chen, Kenneth W writes:

> Chen, Kenneth W wrote on Tuesday, October 05, 2004 10:31 AM
>> We have experimented with similar thing, via bumping up sd->cache_hot_time to
>> a very large number, like 1 sec. What we measured was a equally low throughput.
>> But that was because of not enough load balancing.
>
> Since we are talking about load balancing, we decided to measure various
> value for cache_hot_time variable to see how it affects app performance. We
> first establish baseline number with vanilla base kernel (default at 2.5ms),
> then sweep that variable up to 1000ms. All of the experiments are done with
> Ingo's patch posted earlier. Here are the result (test environment is 4-way
> SMP machine, 32 GB memory, 500 disks running industry standard db transaction
> processing workload):
>
> cache_hot_time | workload throughput
> --------------------------------------
> 2.5ms - 100.0 (0% idle)
> 5ms - 106.0 (0% idle)
> 10ms - 112.5 (1% idle)
> 15ms - 111.6 (3% idle)
> 25ms - 111.1 (5% idle)
> 250ms - 105.6 (7% idle)
> 1000ms - 105.4 (7% idle)
>
> Clearly the default value for SMP has the worst application throughput (12%
> below peak performance). When set too low, kernel is too aggressive on load
> balancing and we are still seeing cache thrashing despite the perf fix.
> However, If set too high, kernel gets too conservative and not doing enough
> load balance.
>
> This value was default to 10ms before domain scheduler, why does domain
> scheduler need to change it to 2.5ms? And on what bases does that decision
> take place? We are proposing change that number back to 10ms.

Should it not be based on the cache flush time? We measure that and set the
cache_decay_ticks and can base it on that. What is the cache_decay_ticks
value reported in the dmesg of your hardware?

Cheers,
Con

2004-10-06 00:59:15

by Nick Piggin

[permalink] [raw]
Subject: Re: Default cache_hot_time value back to 10ms

Chen, Kenneth W wrote:
> Chen, Kenneth W wrote on Tuesday, October 05, 2004 10:31 AM
>
>>We have experimented with similar thing, via bumping up sd->cache_hot_time to
>>a very large number, like 1 sec. What we measured was a equally low throughput.
>>But that was because of not enough load balancing.
>
>
> Since we are talking about load balancing, we decided to measure various
> value for cache_hot_time variable to see how it affects app performance. We
> first establish baseline number with vanilla base kernel (default at 2.5ms),
> then sweep that variable up to 1000ms. All of the experiments are done with
> Ingo's patch posted earlier. Here are the result (test environment is 4-way
> SMP machine, 32 GB memory, 500 disks running industry standard db transaction
> processing workload):
>
> cache_hot_time | workload throughput
> --------------------------------------
> 2.5ms - 100.0 (0% idle)
> 5ms - 106.0 (0% idle)
> 10ms - 112.5 (1% idle)
> 15ms - 111.6 (3% idle)
> 25ms - 111.1 (5% idle)
> 250ms - 105.6 (7% idle)
> 1000ms - 105.4 (7% idle)
>
> Clearly the default value for SMP has the worst application throughput (12%
> below peak performance). When set too low, kernel is too aggressive on load
> balancing and we are still seeing cache thrashing despite the perf fix.
> However, If set too high, kernel gets too conservative and not doing enough
> load balance.
>

Great testing, thanks.

> This value was default to 10ms before domain scheduler, why does domain
> scheduler need to change it to 2.5ms? And on what bases does that decision
> take place? We are proposing change that number back to 10ms.
>

IIRC Ingo wanted it lower, to closer match previous values (correct
me if I'm wrong).

I think your patch would be fine though (when timeslicing tasks on
the same CPU, I've typically seen large regressions when going below
a 10ms timeslice, even on a small cache CPU (512K).

2004-10-06 01:02:59

by Nick Piggin

[permalink] [raw]
Subject: Re: Default cache_hot_time value back to 10ms

Con Kolivas wrote:

> Should it not be based on the cache flush time? We measure that and set
> the cache_decay_ticks and can base it on that. What is the
> cache_decay_ticks value reported in the dmesg of your hardware?
>

It should be, but the cache_decay_ticks calculation is so crude that I
preferred to use a fixed value to reduce the variation between different
setups.

I once experimented with attempting to figure out memory bandwidth based
on reading an uncached page. That might be the way to go.

2004-10-06 03:57:15

by Andrew Morton

[permalink] [raw]
Subject: Re: Default cache_hot_time value back to 10ms

"Chen, Kenneth W" <[email protected]> wrote:
>
> This value was default to 10ms before domain scheduler, why does domain
> scheduler need to change it to 2.5ms? And on what bases does that decision
> take place? We are proposing change that number back to 10ms.

It sounds like this needs to be runtime tunable?

2004-10-06 04:30:22

by Nick Piggin

[permalink] [raw]
Subject: Re: Default cache_hot_time value back to 10ms

Andrew Morton wrote:
> "Chen, Kenneth W" <[email protected]> wrote:
>
>>This value was default to 10ms before domain scheduler, why does domain
>> scheduler need to change it to 2.5ms? And on what bases does that decision
>> take place? We are proposing change that number back to 10ms.
>
>
> It sounds like this needs to be runtime tunable?
>

I'd say it is probably too low level to be a useful tunable (although
for testing I guess so... but then you could have *lots* of parameters
tunable).

I don't think there was a really good reason why this value is 2.5ms.

2004-10-06 04:53:15

by Andrew Morton

[permalink] [raw]
Subject: Re: Default cache_hot_time value back to 10ms

Nick Piggin <[email protected]> wrote:
>
> Andrew Morton wrote:
> > "Chen, Kenneth W" <[email protected]> wrote:
> >
> >>This value was default to 10ms before domain scheduler, why does domain
> >> scheduler need to change it to 2.5ms? And on what bases does that decision
> >> take place? We are proposing change that number back to 10ms.
> >
> >
> > It sounds like this needs to be runtime tunable?
> >
>
> I'd say it is probably too low level to be a useful tunable (although
> for testing I guess so... but then you could have *lots* of parameters
> tunable).

This tunable caused an 11% performance difference in (I assume) TPCx.
That's a big deal, and people will want to diddle it.

If one number works optimally for all machines and workloads then fine.

But yes, avoiding a tunable would be nice, but we need a tunable to work
out whether we can avoid making it tunable ;)

Not that I'm soliciting patches or anything. I'll duck this one for now.

2004-10-06 05:00:18

by Nick Piggin

[permalink] [raw]
Subject: Re: Default cache_hot_time value back to 10ms

Andrew Morton wrote:

> This tunable caused an 11% performance difference in (I assume) TPCx.
> That's a big deal, and people will want to diddle it.
>

True. But 2.5 I think really is too low (for anyone, except maybe a
CPU with no/a tiny L2 cache).

> If one number works optimally for all machines and workloads then fine.
>

Yeah.. 10ms may bring up idle times a bit on other workloads. Judith
had some database tests that were very sensitive to this - if 10ms is
OK there, then I'd say it would be OK for most things.

> But yes, avoiding a tunable would be nice, but we need a tunable to work
> out whether we can avoid making it tunable ;)
>

Heh. I think it would be good to have a automatic thingy to tune it.
A smarter cache_decay_ticks calculation would suit.

> Not that I'm soliciting patches or anything. I'll duck this one for now.
>

OK. Any idea when 2.6.9 will be coming out? :)

2004-10-06 05:11:51

by Andrew Morton

[permalink] [raw]
Subject: Re: Default cache_hot_time value back to 10ms

Nick Piggin <[email protected]> wrote:
>
> Any idea when 2.6.9 will be coming out?

Before -mm hits 1000 patches, I hope.

2.6.8 wasn't really super-stable and our main tool for getting the quality
is to stretch the release times, give us time to shake things out. The
release time is largely driven by perceptions of current stability, bug
report rates, etc.

A current guess would be -rc4 later this week, 2.6.9 late next week. We'll
see.

One way of advancing that is to get down and work on bugs in current -linus
tree, yes?

If this still doesn't seem to be working out and if 2.6.9 isn't as good as
we'd like I'll consider shutting down -mm completely once we hit -rc2 so
people have nothing else to do apart from fix bugs in, and test -linus.
We'll see.

2004-10-06 05:21:33

by Nick Piggin

[permalink] [raw]
Subject: Re: Default cache_hot_time value back to 10ms

Andrew Morton wrote:
> Nick Piggin <[email protected]> wrote:
>
>>Any idea when 2.6.9 will be coming out?
>
>
> Before -mm hits 1000 patches, I hope.
>
> 2.6.8 wasn't really super-stable and our main tool for getting the quality
> is to stretch the release times, give us time to shake things out. The
> release time is largely driven by perceptions of current stability, bug
> report rates, etc.
>
> A current guess would be -rc4 later this week, 2.6.9 late next week. We'll
> see.
>
> One way of advancing that is to get down and work on bugs in current -linus
> tree, yes?
>
> If this still doesn't seem to be working out and if 2.6.9 isn't as good as
> we'd like I'll consider shutting down -mm completely once we hit -rc2 so
> people have nothing else to do apart from fix bugs in, and test -linus.
> We'll see.
>

OK thanks for the explanation.

Any thoughts about making -rc's into -pre's, and doing real -rc's?
It would have caught the NFS bug that made 2.6.8.1, and probably
the cd burning problems... Or is Linus' patching finger just too
itchy?

2004-10-06 05:35:09

by Andrew Morton

[permalink] [raw]
Subject: Re: Default cache_hot_time value back to 10ms

Nick Piggin <[email protected]> wrote:
>
> Any thoughts about making -rc's into -pre's, and doing real -rc's?

I think what we have is OK. The idea is that once 2.6.9 is released we
merge up all the well-tested code which is sitting in various trees and has
been under test for a few weeks. As soon as all that well-tested code is
merged, we go into -rc. So we're pipelining the development of 2.6.10 code
with the stabilisation of 2.6.9.

If someone goes and develops *new* code after the release of, say, 2.6.9
then tough tittie, it's too late for 2.6.9: we don't want new code - we
want old-n-tested code. So your typed-in-after-2.6.9 code goes into
2.6.11.

That's the theory anyway. If it means that it takes a long time to get
code into the kernel.org tree, well, that's a cost. That latency may be
high but the bandwidth is pretty good.

There are exceptions of course. Completely new
drivers/filesystems/architectures can go in any old time becasue they won't
break existing setups. Although I do tend to hold back on even these in
the (probably overoptimistic) hope that people will then concentrate on
mainline bug fixing and testing.

> It would have caught the NFS bug that made 2.6.8.1, and probably
> the cd burning problems... Or is Linus' patching finger just too
> itchy?

uh, let's say that incident was "proof by counter example".

2004-10-06 05:47:10

by Nick Piggin

[permalink] [raw]
Subject: Re: Default cache_hot_time value back to 10ms

Andrew Morton wrote:
> Nick Piggin <[email protected]> wrote:
>
>>Any thoughts about making -rc's into -pre's, and doing real -rc's?
>
>
> I think what we have is OK. The idea is that once 2.6.9 is released we
> merge up all the well-tested code which is sitting in various trees and has
> been under test for a few weeks. As soon as all that well-tested code is
> merged, we go into -rc. So we're pipelining the development of 2.6.10 code
> with the stabilisation of 2.6.9.
>
> If someone goes and develops *new* code after the release of, say, 2.6.9
> then tough tittie, it's too late for 2.6.9: we don't want new code - we
> want old-n-tested code. So your typed-in-after-2.6.9 code goes into
> 2.6.11.
>
> That's the theory anyway. If it means that it takes a long time to get
> code into the kernel.org tree, well, that's a cost. That latency may be
> high but the bandwidth is pretty good.
>
> There are exceptions of course. Completely new
> drivers/filesystems/architectures can go in any old time becasue they won't
> break existing setups. Although I do tend to hold back on even these in
> the (probably overoptimistic) hope that people will then concentrate on
> mainline bug fixing and testing.
>
>
>> It would have caught the NFS bug that made 2.6.8.1, and probably
>> the cd burning problems... Or is Linus' patching finger just too
>> itchy?
>
>
> uh, let's say that incident was "proof by counter example".
>

Heh :)

OK I agree on all these points. And yeah it has worked quite well...

But by real -rc, I mean 2.6.9 is a week after 2.6.9-rcx minus the
extraversion string; nothing more.

The main point (for me, at least) is that if -rc1 comes out, and I'm
still working on some bug or having something else tested then I can
hurry up and/or send you and Linus a polite email saying don't release
yet.

Would probably be a help for people running automated testing and
regression tests, etc. And just generally increase the userbase a
little bit.

Catching the odd paper bag bug would be a fringe benefit.

2004-10-06 05:53:31

by Chen, Kenneth W

[permalink] [raw]
Subject: RE: Default cache_hot_time value back to 10ms

Andrew Morton wrote on Tuesday, October 05, 2004 9:51 PM
> > > It sounds like this needs to be runtime tunable?
> > >
> >
> > I'd say it is probably too low level to be a useful tunable (although
> > for testing I guess so... but then you could have *lots* of parameters
> > tunable).
>
> This tunable caused an 11% performance difference in (I assume) TPCx.
> That's a big deal, and people will want to diddle it.
>
> If one number works optimally for all machines and workloads then fine.
>
> But yes, avoiding a tunable would be nice, but we need a tunable to work
> out whether we can avoid making it tunable ;)

Just to throw in some more benchmark numbers, we measured that specjbb
throughput went up by about 0.3% with cache_hot_time set to 10ms compare
to default 2.5ms. No measurable speedup/regression on volanmark (we
just tried 10 and 2.5ms).

- Ken


2004-10-06 06:22:42

by Jeff Garzik

[permalink] [raw]
Subject: new dev model (was Re: Default cache_hot_time value back to 10ms)

Andrew Morton wrote:
> Nick Piggin <[email protected]> wrote:
>
>>Any thoughts about making -rc's into -pre's, and doing real -rc's?
>
>
> I think what we have is OK. The idea is that once 2.6.9 is released we
> merge up all the well-tested code which is sitting in various trees and has
> been under test for a few weeks. As soon as all that well-tested code is
> merged, we go into -rc. So we're pipelining the development of 2.6.10 code
> with the stabilisation of 2.6.9.
>
> If someone goes and develops *new* code after the release of, say, 2.6.9
> then tough tittie, it's too late for 2.6.9: we don't want new code - we
> want old-n-tested code. So your typed-in-after-2.6.9 code goes into
> 2.6.11.
>
> That's the theory anyway. If it means that it takes a long time to get

This is damned frustrating :( Reality is _far_ divorced from what you
just described.

Major developers such as David and Al don't have trees that see wide
testing, their code only sees wide testing once it hits mainline. See
this message from David,
http://marc.theaimsgroup.com/?l=linux-netdev&m=109648930728731&w=2

In particular, I think David's point about -mm being perceived as overly
experimental is fair.

Recent experience seems to directly counter the assertion that only
well-tested code is landing in mainline, and it's not hard to pick
through the -rc changelogs to find non-trivial, non-bugfix modifications
to existing code. My own experience with netdev-2.6 bears this out as
well: I have several personal examples of bugs sitting in netdev (and
thus -mm) for quite a while, only being noticed when the code hits mainline.

Linus's assertion that "calling it -rc means developers should calm
down" (implying we should start concentrating on bug fixing rather than
more-fun stuff) is equally fanciful.

Why is it so hard to say "only bugfixes"?

The _reality_ is that there is _no_ point in time where you and Linus
allow for stabilization of the main tree prior to relesae. The release
criteria has devolved to a point where we call it done when the stack of
pancakes gets too high.

Ground control to Major Tom?

Jeff


2004-10-06 06:41:59

by Andrew Morton

[permalink] [raw]
Subject: Re: new dev model (was Re: Default cache_hot_time value back to 10ms)

Jeff Garzik <[email protected]> wrote:
>
> Andrew Morton wrote:
> > Nick Piggin <[email protected]> wrote:
> >
> >>Any thoughts about making -rc's into -pre's, and doing real -rc's?
> >
> >
> > I think what we have is OK. The idea is that once 2.6.9 is released we
> > merge up all the well-tested code which is sitting in various trees and has
> > been under test for a few weeks. As soon as all that well-tested code is
> > merged, we go into -rc. So we're pipelining the development of 2.6.10 code
> > with the stabilisation of 2.6.9.
> >
> > If someone goes and develops *new* code after the release of, say, 2.6.9
> > then tough tittie, it's too late for 2.6.9: we don't want new code - we
> > want old-n-tested code. So your typed-in-after-2.6.9 code goes into
> > 2.6.11.
> >
> > That's the theory anyway. If it means that it takes a long time to get
>
> This is damned frustrating :( Reality is _far_ divorced from what you
> just described.

s/far/a bit/

> Major developers such as David and Al don't have trees that see wide
> testing, their code only sees wide testing once it hits mainline. See
> this message from David,
> http://marc.theaimsgroup.com/?l=linux-netdev&m=109648930728731&w=2
>

Yes, networking has been an exception. I think this has been acceptable
thus far because historically networking has tended to work better than
other parts of the kernel. Although the fib_hash stuff was a bit of a
fiasco.

> In particular, I think David's point about -mm being perceived as overly
> experimental is fair.

I agree - -mm breaks too often. You wouldn't believe the crap people throw
at me :(. But a lot of problems get fixed this way too.

> Recent experience seems to directly counter the assertion that only
> well-tested code is landing in mainline, and it's not hard to pick
> through the -rc changelogs to find non-trivial, non-bugfix modifications
> to existing code.

Once we hit -rc2 we shouldn't be doing that.

> My own experience with netdev-2.6 bears this out as
> well: I have several personal examples of bugs sitting in netdev (and
> thus -mm) for quite a while, only being noticed when the code hits mainline.

yes, I've had a couple of those. Not too many, fortunately. But having
bugs leak in mainline is OK - we expect that. As long as it wasn't late in
the cycle. If it was late in the cycle then, well,
bad-call-won't-do-that-again.

> Linus's assertion that "calling it -rc means developers should calm
> down" (implying we should start concentrating on bug fixing rather than
> more-fun stuff) is equally fanciful.
>
> Why is it so hard to say "only bugfixes"?

(It's not "only bugfixes". It's "only bugfixes, completely new stuff and
documentation/comment fixes).

But yes. When you see this please name names and thwap people.

> The _reality_ is that there is _no_ point in time where you and Linus
> allow for stabilization of the main tree prior to relesae. The release
> criteria has devolved to a point where we call it done when the stack of
> pancakes gets too high.

That's simply wrong.

For instance, 2.6.8-rc1-mm1-series had 252 patches. I'm now sitting on 726
patches. That's 500 patches which are either non-bugfixes or minor
bugfixes which are held back. The various bk tree maintainers do the same
thing.

2004-10-06 07:46:44

by Ingo Molnar

[permalink] [raw]
Subject: Re: Default cache_hot_time value back to 10ms


* Chen, Kenneth W <[email protected]> wrote:

> Chen, Kenneth W wrote on Tuesday, October 05, 2004 10:31 AM
> > We have experimented with similar thing, via bumping up sd->cache_hot_time to
> > a very large number, like 1 sec. What we measured was a equally low throughput.
> > But that was because of not enough load balancing.
>
> Since we are talking about load balancing, we decided to measure various
> value for cache_hot_time variable to see how it affects app performance. We
> first establish baseline number with vanilla base kernel (default at 2.5ms),
> then sweep that variable up to 1000ms. All of the experiments are done with
> Ingo's patch posted earlier. Here are the result (test environment is 4-way
> SMP machine, 32 GB memory, 500 disks running industry standard db transaction
> processing workload):
>
> cache_hot_time | workload throughput
> --------------------------------------
> 2.5ms - 100.0 (0% idle)
> 5ms - 106.0 (0% idle)
> 10ms - 112.5 (1% idle)
> 15ms - 111.6 (3% idle)
> 25ms - 111.1 (5% idle)
> 250ms - 105.6 (7% idle)
> 1000ms - 105.4 (7% idle)
>
> Clearly the default value for SMP has the worst application throughput (12%
> below peak performance). When set too low, kernel is too aggressive on load
> balancing and we are still seeing cache thrashing despite the perf fix.
> However, If set too high, kernel gets too conservative and not doing enough
> load balance.

could you please try the test in 1 msec increments around 10 msec? It
would be very nice to find a good formula and the 5 msec steps are too
coarse. I think it would be nice to test 7,9,11,13 msecs first, and then
the remaining 1 msec slots around the new maximum. (assuming the
workload measurement is stable.)

> This value was default to 10ms before domain scheduler, why does domain
> scheduler need to change it to 2.5ms? And on what bases does that decision
> take place? We are proposing change that number back to 10ms.

agreed. What value does cache_decay_ticks have on your box?

>
> Signed-off-by: Ken Chen <[email protected]>

Signed-off-by: Ingo Molnar <[email protected]>

Ingo

2004-10-06 08:57:09

by Paolo Ciarrocchi

[permalink] [raw]
Subject: Re: new dev model (was Re: Default cache_hot_time value back to 10ms)

On Tue, 5 Oct 2004 23:39:58 -0700, Andrew Morton <[email protected]> wrote:
> Jeff Garzik <[email protected]> wrote:
> >
> > Andrew Morton wrote:
> > > Nick Piggin <[email protected]> wrote:
> > >
> > >>Any thoughts about making -rc's into -pre's, and doing real -rc's?
> > >
> > >
> > > I think what we have is OK. The idea is that once 2.6.9 is released we
> > > merge up all the well-tested code which is sitting in various trees and has
> > > been under test for a few weeks. As soon as all that well-tested code is
> > > merged, we go into -rc. So we're pipelining the development of 2.6.10 code
> > > with the stabilisation of 2.6.9.
> > >
> > > If someone goes and develops *new* code after the release of, say, 2.6.9
> > > then tough tittie, it's too late for 2.6.9: we don't want new code - we
> > > want old-n-tested code. So your typed-in-after-2.6.9 code goes into
> > > 2.6.11.
> > >
> > > That's the theory anyway. If it means that it takes a long time to get
> >
> > This is damned frustrating :( Reality is _far_ divorced from what you
> > just described.
>
> s/far/a bit/

True, just a bit. But the the -pre/-rc thing is pretty confusing.

> > Major developers such as David and Al don't have trees that see wide
> > testing, their code only sees wide testing once it hits mainline. See
> > this message from David,
> > http://marc.theaimsgroup.com/?l=linux-netdev&m=109648930728731&w=2
> >
>
> Yes, networking has been an exception. I think this has been acceptable
> thus far because historically networking has tended to work better than
> other parts of the kernel. Although the fib_hash stuff was a bit of a
> fiasco.
>
> > In particular, I think David's point about -mm being perceived as overly
> > experimental is fair.
>
> I agree - -mm breaks too often. You wouldn't believe the crap people throw
> at me :(. But a lot of problems get fixed this way too.

Again, true.
But it's hard to understand why we have 'exceptions' to the dev model.
I still thing that the dev model should be make official and all the
develpoers should follow such a rules.

> > Recent experience seems to directly counter the assertion that only
> > well-tested code is landing in mainline, and it's not hard to pick
> > through the -rc changelogs to find non-trivial, non-bugfix modifications
> > to existing code.
>
> Once we hit -rc2 we shouldn't be doing that.
>
> > My own experience with netdev-2.6 bears this out as
> > well: I have several personal examples of bugs sitting in netdev (and
> > thus -mm) for quite a while, only being noticed when the code hits mainline.
>
> yes, I've had a couple of those. Not too many, fortunately. But having
> bugs leak in mainline is OK - we expect that. As long as it wasn't late in
> the cycle. If it was late in the cycle then, well,
> bad-call-won't-do-that-again.
>
> > Linus's assertion that "calling it -rc means developers should calm
> > down" (implying we should start concentrating on bug fixing rather than
> > more-fun stuff) is equally fanciful.
> >
> > Why is it so hard to say "only bugfixes"?
>
> (It's not "only bugfixes". It's "only bugfixes, completely new stuff and
> documentation/comment fixes).
>
> But yes. When you see this please name names and thwap people.
>
> > The _reality_ is that there is _no_ point in time where you and Linus
> > allow for stabilization of the main tree prior to relesae. The release
> > criteria has devolved to a point where we call it done when the stack of
> > pancakes gets too high.
>
> That's simply wrong.
>
> For instance, 2.6.8-rc1-mm1-series had 252 patches. I'm now sitting on 726
> patches. That's 500 patches which are either non-bugfixes or minor
> bugfixes which are held back. The various bk tree maintainers do the same
> thing.

I really think that:
- linus should start making -pre releases and then one (or a couple,
if needed) -rc candidate
- all the patches should go in -mm before landing in -pre
- maybe, try to match a few quality "goals'" ?



--
Paolo
Personal home page: http://www.ciarrocchi.tk
See my photos: http://paolociarrocchi.fotopic.net/
Buy cool stuff here: http://www.cafepress.com/paoloc

2004-10-06 09:24:15

by Ingo Molnar

[permalink] [raw]
Subject: Re: new dev model (was Re: Default cache_hot_time value back to 10ms)


On Wed, 6 Oct 2004, Jeff Garzik wrote:

> The _reality_ is that there is _no_ point in time where you and Linus
> allow for stabilization of the main tree prior to relesae. [...]

i dont think this is fair to Andrew - there's hundreds of patches in his
tree that are scheduled for 2.6.10 not 2.6.9.

you are right that -mm is experimental, but the latency of bugfixes is the
lowest i've ever seen in any Linux tree, which is quite amazing
considering the hundreds of patches.

it is also correct that the pile of patches in the -mm tree mask the QA
effects of testing done on -mm, so testing -BK separately is just as
important at this stage.

Maybe it would help perception and awareness-of-release a bit if at this
stage Andrew switched the -mm tree to the -BK tree and truly only kept
those patches that are destined for BK for 2.6.9. [i.e. if the current
patch-series would be cut off at patch #3 or so, but the numbering of
-rc3-mm3 would be keept.] This can only be done if the changes from now to
2.6.9-real are small enough in that they dont impact those 700 patches too
much.

This switching would immediately expose all -mm users to the current state
of affairs of the -BK tree. (yes, people could try the -BK tree just as
much but it seems -mm is used by developers quite often and it would help
if the two trees would be largely equivalent so close to the release.)

Ingo

2004-10-06 09:44:40

by bert hubert

[permalink] [raw]
Subject: Re: new dev model (was Re: Default cache_hot_time value back to 10ms)

On Tue, Oct 05, 2004 at 11:39:58PM -0700, Andrew Morton wrote:

> I agree - -mm breaks too often. You wouldn't believe the crap people throw
> at me :(. But a lot of problems get fixed this way too.

Mainline is suffering too - lots of people I know running 2.6 on production
systems have noted a marked increase in problems, crashes, odd things.

I'd bet you get a lot of people who'd vote for a timeout right now to figure
out what's going wrong.

There is the distinct impression that we are going down hill in this series.
My personal feeling is that this trend started almost immediately after OLS.

I can try to gather the general reports I hear from people - it might well
be that we are not reporting the bugs properly. I'm sitting on a couple of
odd things myself that need to be written up.

Thanks.

--
http://www.PowerDNS.com Open source, database driven DNS Software
http://lartc.org Linux Advanced Routing & Traffic Control HOWTO

2004-10-06 09:58:02

by Paolo Ciarrocchi

[permalink] [raw]
Subject: Re: new dev model (was Re: Default cache_hot_time value back to 10ms)

On Wed, 6 Oct 2004 05:23:29 -0400 (EDT), Ingo Molnar <[email protected]> wrote:
>
> On Wed, 6 Oct 2004, Jeff Garzik wrote:
>
> > The _reality_ is that there is _no_ point in time where you and Linus
> > allow for stabilization of the main tree prior to relesae. [...]
>
> i dont think this is fair to Andrew - there's hundreds of patches in his
> tree that are scheduled for 2.6.10 not 2.6.9.

Andrew is doing an amazing job. He's really an impressive hacker.

> you are right that -mm is experimental, but the latency of bugfixes is the
> lowest i've ever seen in any Linux tree, which is quite amazing
> considering the hundreds of patches.

Just my humble opinion,
I think that's because Andrew and Linus are working very well together,
I'm not sure that's because the new dev model.
It seems to me that there is room for improvment.

> it is also correct that the pile of patches in the -mm tree mask the QA
> effects of testing done on -mm, so testing -BK separately is just as
> important at this stage.
>
> Maybe it would help perception and awareness-of-release a bit if at this
> stage Andrew switched the -mm tree to the -BK tree and truly only kept
> those patches that are destined for BK for 2.6.9. [i.e. if the current
> patch-series would be cut off at patch #3 or so, but the numbering of
> -rc3-mm3 would be keept.] This can only be done if the changes from now to
> 2.6.9-real are small enough in that they dont impact those 700 patches too
> much.
>
> This switching would immediately expose all -mm users to the current state
> of affairs of the -BK tree. (yes, people could try the -BK tree just as
> much but it seems -mm is used by developers quite often and it would help
> if the two trees would be largely equivalent so close to the release.)

Good idea.

--
Paolo
Personal home page: http://www.ciarrocchi.tk
See my photos: http://paolociarrocchi.fotopic.net/
Buy cool stuff here: http://www.cafepress.com/paoloc

2004-10-06 13:29:19

by Ingo Molnar

[permalink] [raw]
Subject: [patch] sched: auto-tuning task-migration


* Chen, Kenneth W <[email protected]> wrote:

> Since we are talking about load balancing, we decided to measure
> various value for cache_hot_time variable to see how it affects app
> performance. We first establish baseline number with vanilla base
> kernel (default at 2.5ms), then sweep that variable up to 1000ms. All
> of the experiments are done with Ingo's patch posted earlier. Here
> are the result (test environment is 4-way SMP machine, 32 GB memory,
> 500 disks running industry standard db transaction processing
> workload):
>
> cache_hot_time | workload throughput
> --------------------------------------
> 2.5ms - 100.0 (0% idle)
> 5ms - 106.0 (0% idle)
> 10ms - 112.5 (1% idle)
> 15ms - 111.6 (3% idle)
> 25ms - 111.1 (5% idle)
> 250ms - 105.6 (7% idle)
> 1000ms - 105.4 (7% idle)

the following patch adds a new feature to the scheduler: during bootup
it measures migration costs and sets up cache_hot value accordingly.

The measurement is point-to-point, i.e. it can be used to measure the
migration costs in cache hierarchies - e.g. by NUMA setup code. The
patch prints out a matrix of migration costs between CPUs.
(self-migration means pure cache dirtying cost)

Here are a couple of matrixes from testsystems:

A 2-way Celeron/128K box:

arch cache_decay_nsec: 1000000
migration cost matrix (cache_size: 131072, cpu: 467 MHz):
[00] [01]
[00]: 9.6 12.0
[01]: 12.2 9.8
min_delta: 12586890
using cache_decay nsec: 12586890 (12 msec)

a 2-way/4-way P4/512K HT box:

arch cache_decay_nsec: 2000000
migration cost matrix (cache_size: 524288, cpu: 2379 MHz):
[00] [01] [02] [03]
[00]: 6.1 6.1 5.7 6.1
[01]: 6.7 6.2 6.7 6.2
[02]: 5.9 5.9 6.1 5.0
[03]: 6.7 6.2 6.7 6.2
min_delta: 6053016
using cache_decay nsec: 6053016 (5 msec)

an 8-way P3/2MB Xeon box:

arch cache_decay_nsec: 6000000
migration cost matrix (cache_size: 2097152, cpu: 700 MHz):
[00] [01] [02] [03] [04] [05] [06] [07]
[00]: 92.1 184.8 184.8 184.8 184.9 90.7 90.6 90.7
[01]: 181.3 92.7 88.5 88.6 88.5 181.5 181.3 181.4
[02]: 181.4 88.4 92.5 88.4 88.5 181.4 181.3 181.4
[03]: 181.4 88.4 88.5 92.5 88.4 181.5 181.2 181.4
[04]: 181.4 88.5 88.4 88.4 92.5 181.5 181.3 181.5
[05]: 87.2 181.5 181.4 181.5 181.4 90.0 87.0 87.1
[06]: 87.2 181.5 181.4 181.5 181.4 87.9 90.0 87.1
[07]: 87.2 181.5 181.4 181.5 181.4 87.9 87.0 90.0
min_delta: 91815564
using cache_decay nsec: 91815564 (87 msec)

(btw., this matrix shows nicely the 0,5,6,7/1,2,3,4 grouping of quads in
this semi-NUMA 8-way box.)

could you try this patch on your testbox and send me the bootlog? How
close does this method get us to the 10 msec value you measured to be
close to the best value? The patch is against 2.6.9-rc3 + the last
cache_hot fixpatch you tried.

the patch contains comments that explain how migration costs are
measured.

(NOTE: sched_cache_size is only filled in for x86 at the moment, so if
you have another architecture then please add those two lines to that
architecture's smpboot.c.)

this is only the first release of the patch - obviously we cannot print
such a matrix for 1024 CPUs. But this should be good enough for testing
purposes.

Ingo

--- linux/kernel/sched.c.orig
+++ linux/kernel/sched.c
@@ -388,7 +388,7 @@ struct sched_domain {
.max_interval = 4, \
.busy_factor = 64, \
.imbalance_pct = 125, \
- .cache_hot_time = (5*1000000/2), \
+ .cache_hot_time = cache_decay_nsec, \
.cache_nice_tries = 1, \
.per_cpu_gain = 100, \
.flags = SD_BALANCE_NEWIDLE \
@@ -410,7 +410,7 @@ struct sched_domain {
.max_interval = 32, \
.busy_factor = 32, \
.imbalance_pct = 125, \
- .cache_hot_time = (10*1000000), \
+ .cache_hot_time = cache_decay_nsec, \
.cache_nice_tries = 1, \
.per_cpu_gain = 100, \
.flags = SD_BALANCE_EXEC \
@@ -4420,11 +4420,233 @@ __init static void init_sched_build_grou
last->next = first;
}

-__init static void arch_init_sched_domains(void)
+/*
+ * Task migration cost measurement between source and target CPUs.
+ *
+ * This is done by measuring the worst-case cost. Here are the
+ * steps that are taken:
+ *
+ * 1) the source CPU dirties its L2 cache with a shared buffer
+ * 2) the target CPU dirties its L2 cache with a local buffer
+ * 3) the target CPU dirties the shared buffer
+ *
+ * We measure the time step #3 takes - this is the cost of migrating
+ * a cache-hot task that has a large, dirty dataset in the L2 cache,
+ * to another CPU.
+ */
+
+
+/*
+ * Dirty a big buffer in a hard-to-predict (for the L2 cache) way. This
+ * is the operation that is timed, so we try to generate unpredictable
+ * cachemisses that still end up filling the L2 cache:
+ */
+static void fill_cache(void *__cache, unsigned long __size)
{
+ unsigned long size = __size/sizeof(long);
+ unsigned long *cache = __cache;
+ unsigned long data = 0xdeadbeef;
int i;
+
+ for (i = 0; i < size/4; i++) {
+ if ((i & 3) == 0)
+ cache[i] = data;
+ if ((i & 3) == 1)
+ cache[size-1-i] = data;
+ if ((i & 3) == 2)
+ cache[size/2-i] = data;
+ if ((i & 3) == 3)
+ cache[size/2+i] = data;
+ }
+}
+
+struct flush_data {
+ unsigned long source, target;
+ void (*fn)(void *, unsigned long);
+ void *cache;
+ void *local_cache;
+ unsigned long size;
+ unsigned long long delta;
+};
+
+/*
+ * Dirty L2 on the source CPU:
+ */
+static void source_handler(void *__data)
+{
+ struct flush_data *data = __data;
+
+ if (smp_processor_id() != data->source)
+ return;
+
+ memset(data->cache, 0, data->size);
+}
+
+/*
+ * Dirty the L2 cache on this CPU and then access the shared
+ * buffer. (which represents the working set of the migrated task.)
+ */
+static void target_handler(void *__data)
+{
+ struct flush_data *data = __data;
+ unsigned long long t0, t1;
+ unsigned long flags;
+
+ if (smp_processor_id() != data->target)
+ return;
+
+ memset(data->local_cache, 0, data->size);
+ local_irq_save(flags);
+ t0 = sched_clock();
+ fill_cache(data->cache, data->size);
+ t1 = sched_clock();
+ local_irq_restore(flags);
+
+ data->delta = t1 - t0;
+}
+
+/*
+ * Measure the cache-cost of one task migration:
+ */
+static unsigned long long measure_one(void *cache, unsigned long size,
+ int source, int target)
+{
+ struct flush_data data;
+ unsigned long flags;
+ void *local_cache;
+
+ local_cache = vmalloc(size);
+ if (!local_cache) {
+ printk("couldnt allocate local cache ...\n");
+ return 0;
+ }
+ memset(local_cache, 0, size);
+
+ local_irq_save(flags);
+ local_irq_enable();
+
+ data.source = source;
+ data.target = target;
+ data.size = size;
+ data.cache = cache;
+ data.local_cache = local_cache;
+
+ if (on_each_cpu(source_handler, &data, 1, 1) != 0) {
+ printk("measure_one: timed out waiting for other CPUs\n");
+ local_irq_restore(flags);
+ return -1;
+ }
+ if (on_each_cpu(target_handler, &data, 1, 1) != 0) {
+ printk("measure_one: timed out waiting for other CPUs\n");
+ local_irq_restore(flags);
+ return -1;
+ }
+
+ vfree(local_cache);
+
+ return data.delta;
+}
+
+unsigned long sched_cache_size;
+
+/*
+ * Measure a series of task migrations and return the maximum
+ * result - the worst-case. Since this code runs early during
+ * bootup the system is 'undisturbed' and the maximum latency
+ * makes sense.
+ *
+ * As the working set we use 1.66 times the L2 cache size, this is
+ * chosen in such a nonsymmetric way so that fill_cache() doesnt
+ * iterate at power-of-2 boundaries (which might hit cache mapping
+ * artifacts and pessimise the results).
+ */
+static __init unsigned long long measure_cacheflush_time(int cpu1, int cpu2)
+{
+ unsigned long size = sched_cache_size*5/3;
+ unsigned long long delta, max = 0;
+ void *cache;
+ int i;
+
+ if (!size) {
+ printk("arch has not set cachesize - using default.\n");
+ return 0;
+ }
+ if (!cpu_online(cpu1) || !cpu_online(cpu2)) {
+ printk("cpu %d and %d not both online!\n", cpu1, cpu2);
+ return 0;
+ }
+ cache = vmalloc(size);
+ if (!cache) {
+ printk("could not vmalloc %ld bytes for cache!\n", size);
+ return 0;
+ }
+ memset(cache, 0, size);
+ for (i = 0; i < 20; i++) {
+ delta = measure_one(cache, size, cpu1, cpu2);
+ if (delta > max)
+ max = delta;
+ }
+
+ vfree(cache);
+
+ /*
+ * A task is considered 'cache cold' if at least 10 times
+ * the cost of migration has passed. I.e. in the rare and
+ * absolutely worst-case we should see a 10% degradation
+ * due to migration. (this limit is only listened to if the
+ * load-balancing situation is 'nice' - if there is a large
+ * imbalance we ignore it for the sake of CPU utilization and
+ * processing fairness.)
+ *
+ * (We use 5/3 times the L2 cachesize in our measurement,
+ * hence factor 6 here: 10 == 6*5/3.)
+ */
+ return max * 6;
+}
+
+static unsigned long long cache_decay_nsec;
+
+__init static void arch_init_sched_domains(void)
+{
+ int i, cpu1 = -1, cpu2 = -1;
+ unsigned long long min_delta = -1ULL;
+
cpumask_t cpu_default_map;

+ printk("arch cache_decay_nsec: %ld\n", cache_decay_ticks*1000000);
+ printk("migration cost matrix (cache_size: %ld, cpu: %ld MHz):\n",
+ sched_cache_size, cpu_khz/1000);
+ printk(" ");
+ for (cpu1 = 0; cpu1 < NR_CPUS; cpu1++) {
+ if (!cpu_online(cpu1))
+ continue;
+ printk(" [%02d]", cpu1);
+ }
+ printk("\n");
+ for (cpu1 = 0; cpu1 < NR_CPUS; cpu1++) {
+ if (!cpu_online(cpu1))
+ continue;
+ printk("[%02d]: ", cpu1);
+ for (cpu2 = 0; cpu2 < NR_CPUS; cpu2++) {
+ unsigned long long delta;
+
+ if (!cpu_online(cpu2))
+ continue;
+ delta = measure_cacheflush_time(cpu1, cpu2);
+
+ printk(" %3Ld.%ld", delta >> 20,
+ (((long)delta >> 10) / 102) % 10);
+ if ((cpu1 != cpu2) && (delta < min_delta))
+ min_delta = delta;
+ }
+ printk("\n");
+ }
+ printk("min_delta: %Ld\n", min_delta);
+ if (min_delta != -1ULL)
+ cache_decay_nsec = min_delta;
+ printk("using cache_decay nsec: %Ld (%Ld msec)\n",
+ cache_decay_nsec, cache_decay_nsec >> 20);
+
/*
* Setup mask for cpus without special case scheduling requirements.
* For now this just excludes isolated cpus, but could be used to
--- linux/arch/i386/kernel/smpboot.c.orig
+++ linux/arch/i386/kernel/smpboot.c
@@ -849,6 +849,8 @@ static int __init do_boot_cpu(int apicid
cycles_t cacheflush_time;
unsigned long cache_decay_ticks;

+extern unsigned long sched_cache_size;
+
static void smp_tune_scheduling (void)
{
unsigned long cachesize; /* kB */
@@ -879,6 +881,7 @@ static void smp_tune_scheduling (void)
}

cacheflush_time = (cpu_khz>>10) * (cachesize<<10) / bandwidth;
+ sched_cache_size = cachesize * 1024;
}

cache_decay_ticks = (long)cacheflush_time/cpu_khz + 1;

2004-10-06 13:51:59

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch] sched: auto-tuning task-migration

Ingo Molnar wrote:
> * Chen, Kenneth W <[email protected]> wrote:
>
>
>>Since we are talking about load balancing, we decided to measure
>>various value for cache_hot_time variable to see how it affects app
>>performance. We first establish baseline number with vanilla base
>>kernel (default at 2.5ms), then sweep that variable up to 1000ms. All
>>of the experiments are done with Ingo's patch posted earlier. Here
>>are the result (test environment is 4-way SMP machine, 32 GB memory,
>>500 disks running industry standard db transaction processing
>>workload):
>>
>>cache_hot_time | workload throughput
>>--------------------------------------
>> 2.5ms - 100.0 (0% idle)
>> 5ms - 106.0 (0% idle)
>> 10ms - 112.5 (1% idle)
>> 15ms - 111.6 (3% idle)
>> 25ms - 111.1 (5% idle)
>> 250ms - 105.6 (7% idle)
>> 1000ms - 105.4 (7% idle)
>
>
> the following patch adds a new feature to the scheduler: during bootup
> it measures migration costs and sets up cache_hot value accordingly.
>
> The measurement is point-to-point, i.e. it can be used to measure the
> migration costs in cache hierarchies - e.g. by NUMA setup code. The
> patch prints out a matrix of migration costs between CPUs.
> (self-migration means pure cache dirtying cost)
>
> Here are a couple of matrixes from testsystems:
>
> A 2-way Celeron/128K box:
>
> arch cache_decay_nsec: 1000000
> migration cost matrix (cache_size: 131072, cpu: 467 MHz):
> [00] [01]
> [00]: 9.6 12.0
> [01]: 12.2 9.8
> min_delta: 12586890
> using cache_decay nsec: 12586890 (12 msec)
>
> a 2-way/4-way P4/512K HT box:
>
> arch cache_decay_nsec: 2000000
> migration cost matrix (cache_size: 524288, cpu: 2379 MHz):
> [00] [01] [02] [03]
> [00]: 6.1 6.1 5.7 6.1
> [01]: 6.7 6.2 6.7 6.2
> [02]: 5.9 5.9 6.1 5.0
> [03]: 6.7 6.2 6.7 6.2
> min_delta: 6053016
> using cache_decay nsec: 6053016 (5 msec)
>
> an 8-way P3/2MB Xeon box:
>
> arch cache_decay_nsec: 6000000
> migration cost matrix (cache_size: 2097152, cpu: 700 MHz):
> [00] [01] [02] [03] [04] [05] [06] [07]
> [00]: 92.1 184.8 184.8 184.8 184.9 90.7 90.6 90.7
> [01]: 181.3 92.7 88.5 88.6 88.5 181.5 181.3 181.4
> [02]: 181.4 88.4 92.5 88.4 88.5 181.4 181.3 181.4
> [03]: 181.4 88.4 88.5 92.5 88.4 181.5 181.2 181.4
> [04]: 181.4 88.5 88.4 88.4 92.5 181.5 181.3 181.5
> [05]: 87.2 181.5 181.4 181.5 181.4 90.0 87.0 87.1
> [06]: 87.2 181.5 181.4 181.5 181.4 87.9 90.0 87.1
> [07]: 87.2 181.5 181.4 181.5 181.4 87.9 87.0 90.0
> min_delta: 91815564
> using cache_decay nsec: 91815564 (87 msec)
>

Very cool. I reckon you may want to make the final number
non linear if possible, because a 2MB cache probably doesn't
need double the cache decay time of a 1MB cache.

And possible things need to be tuned a bit, eg. 12ms for the
128K celeron may be a bit large (even though it does have a
slow bus).

But this is a nice starting point.

2004-10-06 14:01:08

by Andries Brouwer

[permalink] [raw]
Subject: Re: new dev model (was Re: Default cache_hot_time value back to 10ms)

On Wed, Oct 06, 2004 at 11:44:37AM +0200, bert hubert wrote:

> Mainline is suffering too - lots of people I know running 2.6 on production
> systems have noted a marked increase in problems, crashes, odd things.
>
> I'd bet you get a lot of people who'd vote for a timeout right now to figure
> out what's going wrong.
>
> There is the distinct impression that we are going down hill in this series.
> My personal feeling is that this trend started almost immediately after OLS.

Well, suppose we eliminate 5% of all bugs each week.
Then after a year only 7% of the original bugs are left.

In a stable series that is a fairly good result.
In a series that is simultaneously "stable" and "development"
new random bugs are being introduced continually.
One never reaches the state with only few bugs.

2004-10-06 14:22:45

by Emmanuel Fusté

[permalink] [raw]
Subject: Re: [patch] sched: auto-tuning task-migration


>the following patch adds a new feature to the scheduler:
during >bootup
>it measures migration costs and sets up cache_hot value
>accordingly.
>
>The measurement is point-to-point, i.e. it can be used to measure
>the
>migration costs in cache hierarchies - e.g. by NUMA setup
code. >The
>patch prints out a matrix of migration costs between CPUs.
>(self-migration means pure cache dirtying cost)

Hi Ingo,

Is your auto-tunig patch is supposed to work on a shared L2
cache arch like my i586 SMP system ?
Just to know.

Thanks.

E.F.


Acc?dez au courrier ?lectronique de La Poste : http://www.laposte.net ;
3615 LAPOSTENET (0,34?/mn) ; t?l : 08 92 68 13 50 (0,34?/mn)



2004-10-06 17:19:43

by Chen, Kenneth W

[permalink] [raw]
Subject: RE: Default cache_hot_time value back to 10ms

> Chen, Kenneth W wrote on Tuesday, October 05, 2004 10:31 AM
> > We have experimented with similar thing, via bumping up sd->cache_hot_time to
> > a very large number, like 1 sec. What we measured was a equally low throughput.
> > But that was because of not enough load balancing.
>
> Since we are talking about load balancing, we decided to measure various
> value for cache_hot_time variable to see how it affects app performance. We
> first establish baseline number with vanilla base kernel (default at 2.5ms),
> then sweep that variable up to 1000ms. All of the experiments are done with
> Ingo's patch posted earlier. Here are the result (test environment is 4-way
> SMP machine, 32 GB memory, 500 disks running industry standard db transaction
> processing workload):
>
> cache_hot_time | workload throughput
> --------------------------------------
> 2.5ms - 100.0 (0% idle)
> 5ms - 106.0 (0% idle)
> 10ms - 112.5 (1% idle)
> 15ms - 111.6 (3% idle)
> 25ms - 111.1 (5% idle)
> 250ms - 105.6 (7% idle)
> 1000ms - 105.4 (7% idle)
>
> Clearly the default value for SMP has the worst application throughput (12%
> below peak performance). When set too low, kernel is too aggressive on load
> balancing and we are still seeing cache thrashing despite the perf fix.
> However, If set too high, kernel gets too conservative and not doing enough
> load balance.

Ingo Molnar wrote on Wednesday, October 06, 2004 12:48 AM
> could you please try the test in 1 msec increments around 10 msec? It
> would be very nice to find a good formula and the 5 msec steps are too
> coarse. I think it would be nice to test 7,9,11,13 msecs first, and then
> the remaining 1 msec slots around the new maximum. (assuming the
> workload measurement is stable.)

I should've post the whole thing yesterday, we had measurement of 7.5 and
12.5 ms. Here is the result (repeating 5, 10, 15 for easy reading).

5 ms 106.0
7.5 ms 110.3
10 ms 112.5
12.5 ms 112.0
15 ms 111.6


> > This value was default to 10ms before domain scheduler, why does domain
> > scheduler need to change it to 2.5ms? And on what bases does that decision
> > take place? We are proposing change that number back to 10ms.
>
> agreed. What value does cache_decay_ticks have on your box?


I see all the fancy calculation with cache_decay_ticks on x86, but nobody
actually uses it in the domain scheduler. Anyway, my box has that value
hard coded to 10ms (ia64).

- Ken


2004-10-06 17:49:46

by Chen, Kenneth W

[permalink] [raw]
Subject: RE: [patch] sched: auto-tuning task-migration

Ingo Molnar wrote on Wednesday, October 06, 2004 6:30 AM
> the following patch adds a new feature to the scheduler: during bootup
> it measures migration costs and sets up cache_hot value accordingly.
>
> could you try this patch on your testbox and send me the bootlog? How
> close does this method get us to the 10 msec value you measured to be
> close to the best value? The patch is against 2.6.9-rc3 + the last
> cache_hot fixpatch you tried.

Ran it on a similar system. Below is the output. Haven't tried to get a
real benchmark run with 42 ms cache_hot_time. I don't think it will get
peak throughput as we already start tapering off at 12.5 ms.

task migration cache decay timeout: 10 msecs.
CPU 1: base freq=199.458MHz, ITC ratio=15/2, ITC freq=1495.941MHz+/--1ppm
CPU 2: base freq=199.458MHz, ITC ratio=15/2, ITC freq=1495.941MHz+/--1ppm
CPU 3: base freq=199.458MHz, ITC ratio=15/2, ITC freq=1495.941MHz+/--1ppm
Calibrating delay loop... 2232.84 BogoMIPS (lpj=1089536)
Brought up 4 CPUs
Total of 4 processors activated (8939.60 BogoMIPS).
arch cache_decay_nsec: 10000000
migration cost matrix (cache_size: 9437184, cpu: 1500 MHz):
[00] [01] [02] [03]
[00]: 50.2 42.8 42.9 42.8
[01]: 42.9 50.2 42.1 42.9
[02]: 42.9 42.9 50.2 42.8
[03]: 42.9 42.9 42.9 50.2
min_delta: 44785782
using cache_decay nsec: 44785782 (42 msec)


2004-10-06 19:28:54

by Chen, Kenneth W

[permalink] [raw]
Subject: RE: Default cache_hot_time value back to 10ms

Andrew Morton wrote on Tuesday, October 05, 2004 9:51 PM
> > Nick Piggin <[email protected]> wrote:
> > I'd say it is probably too low level to be a useful tunable (although
> > for testing I guess so... but then you could have *lots* of parameters
> > tunable).
>
> This tunable caused an 11% performance difference in (I assume) TPCx.
> That's a big deal, and people will want to diddle it.
>
> If one number works optimally for all machines and workloads then fine.
>
> But yes, avoiding a tunable would be nice, but we need a tunable to work
> out whether we can avoid making it tunable ;)
>
> Not that I'm soliciting patches or anything. I'll duck this one for now.

Andrew, can I safely interpret this response as you are OK with having
cache_hot_time set to 10 ms for now? And you will merge this change for
2.6.9? I think Ingo and Nick are both OK with that change as well. Thanks.

- Ken


2004-10-06 19:36:56

by Jeff Garzik

[permalink] [raw]
Subject: Re: new dev model (was Re: Default cache_hot_time value back to 10ms)

Ingo Molnar wrote:
> On Wed, 6 Oct 2004, Jeff Garzik wrote:
>
>
>>The _reality_ is that there is _no_ point in time where you and Linus
>>allow for stabilization of the main tree prior to relesae. [...]
>
>
> i dont think this is fair to Andrew - there's hundreds of patches in his
> tree that are scheduled for 2.6.10 not 2.6.9.
>
> you are right that -mm is experimental, but the latency of bugfixes is the
> lowest i've ever seen in any Linux tree, which is quite amazing
> considering the hundreds of patches.

I said "stabilization of the main tree" for a reason :) Like a
"mini-Andrew", I have over 100 net driver csets waiting for 2.6.10 as well.

The crucial point is establishing a psychology where maintainers only
submit (and only apply) bug fixes in -rc series. As long as random
stuff (like fasync in 2.6.8 release) is getting applied at the last
minute, we are

* destroying the validity of testing done in -rc prior to release, and
* reducing the value of user testing
* discouraging users from treating -rc as anything but a 'devel' release
(as opposed to a 'stable' release)



> it is also correct that the pile of patches in the -mm tree mask the QA
> effects of testing done on -mm, so testing -BK separately is just as
> important at this stage.

The simple fact is that -mm doesn't receive _nearly_ the amount of
testing that a 2.6.x -BK snapshot does, which in turn doesn't receive
_nearly_ the amount of testing that a 2.6.x-rc release gets.

The increase in the amount of testing, and amount of feedback I get for
my stuff in -mm/-bk versus -rc/release is a very large margin. For this
reason, one cannot hold up testing in -mm as nearly having the value of
testing in -rc.

But with the diminished signal/noise ratio of current -rc...

Jeff


2004-10-06 19:41:55

by Jeff Garzik

[permalink] [raw]
Subject: Re: new dev model (was Re: Default cache_hot_time value back to 10ms)

Andrew Morton wrote:
> Jeff Garzik <[email protected]> wrote:
>
>>Andrew Morton wrote:
>>
>>>Nick Piggin <[email protected]> wrote:
>>>
>>>
>>>>Any thoughts about making -rc's into -pre's, and doing real -rc's?
>>>
>>>
>>>I think what we have is OK. The idea is that once 2.6.9 is released we
>>>merge up all the well-tested code which is sitting in various trees and has
>>>been under test for a few weeks. As soon as all that well-tested code is
>>>merged, we go into -rc. So we're pipelining the development of 2.6.10 code
>>>with the stabilisation of 2.6.9.
>>>
>>>If someone goes and develops *new* code after the release of, say, 2.6.9
>>>then tough tittie, it's too late for 2.6.9: we don't want new code - we
>>>want old-n-tested code. So your typed-in-after-2.6.9 code goes into
>>>2.6.11.
>>>
>>>That's the theory anyway. If it means that it takes a long time to get
>>
>>This is damned frustrating :( Reality is _far_ divorced from what you
>>just described.
>
>
> s/far/a bit/
>
>
>>Major developers such as David and Al don't have trees that see wide
>>testing, their code only sees wide testing once it hits mainline. See
>>this message from David,
>>http://marc.theaimsgroup.com/?l=linux-netdev&m=109648930728731&w=2
>>
>
>
> Yes, networking has been an exception. I think this has been acceptable
> thus far because historically networking has tended to work better than
> other parts of the kernel. Although the fib_hash stuff was a bit of a
> fiasco.

That's a prime example, yes...


>>In particular, I think David's point about -mm being perceived as overly
>>experimental is fair.
>
>
> I agree - -mm breaks too often. You wouldn't believe the crap people throw
> at me :(. But a lot of problems get fixed this way too.
>
>
>>Recent experience seems to directly counter the assertion that only
>>well-tested code is landing in mainline, and it's not hard to pick
>>through the -rc changelogs to find non-trivial, non-bugfix modifications
>>to existing code.
>
>
> Once we hit -rc2 we shouldn't be doing that.

Why does -rc2 have to be a magic number? Does that really make sense to
users that we want to be testing our stuff?

"We picked a magic number, after which, we hope it becomes more stable
even if it doesn't work out like that in practice"


>> My own experience with netdev-2.6 bears this out as
>>well: I have several personal examples of bugs sitting in netdev (and
>>thus -mm) for quite a while, only being noticed when the code hits mainline.
>
>
> yes, I've had a couple of those. Not too many, fortunately. But having
> bugs leak in mainline is OK - we expect that. As long as it wasn't late in
> the cycle. If it was late in the cycle then, well,
> bad-call-won't-do-that-again.
>
>
>>Linus's assertion that "calling it -rc means developers should calm
>>down" (implying we should start concentrating on bug fixing rather than
>>more-fun stuff) is equally fanciful.
>>
>>Why is it so hard to say "only bugfixes"?
>
>
> (It's not "only bugfixes". It's "only bugfixes, completely new stuff and
> documentation/comment fixes).
>
> But yes. When you see this please name names and thwap people.

I thought I just did ;-)


>>The _reality_ is that there is _no_ point in time where you and Linus
>>allow for stabilization of the main tree prior to relesae. The release
>>criteria has devolved to a point where we call it done when the stack of
>>pancakes gets too high.
>
>
> That's simply wrong.
>
> For instance, 2.6.8-rc1-mm1-series had 252 patches. I'm now sitting on 726
> patches. That's 500 patches which are either non-bugfixes or minor
> bugfixes which are held back. The various bk tree maintainers do the same
> thing.

Sure I'm sitting on over 100 net driver csets myself. I'm glad, but the
overall point is still that "-rc" -- which stands for Release Candidate
-- is nowhere near release candidate status when -rc1 hits, and fluff
like sparse notations and changes like the fasync API change in 2.6.8
always seem to sneak in at the last minute, further belieing(sp?) the
supposed Release Candidate status.

No matter the effort of maintainers to hold back patches, every
violation of the Release Candidate Bugfixes Only policy serves to
undermine user confidence and invalidate previous Q/A work.

Jeff


2004-10-06 19:48:07

by Andrew Morton

[permalink] [raw]
Subject: Re: Default cache_hot_time value back to 10ms

"Chen, Kenneth W" <[email protected]> wrote:
>
> Andrew Morton wrote on Tuesday, October 05, 2004 9:51 PM
> > > Nick Piggin <[email protected]> wrote:
> > > I'd say it is probably too low level to be a useful tunable (although
> > > for testing I guess so... but then you could have *lots* of parameters
> > > tunable).
> >
> > This tunable caused an 11% performance difference in (I assume) TPCx.
> > That's a big deal, and people will want to diddle it.
> >
> > If one number works optimally for all machines and workloads then fine.
> >
> > But yes, avoiding a tunable would be nice, but we need a tunable to work
> > out whether we can avoid making it tunable ;)
> >
> > Not that I'm soliciting patches or anything. I'll duck this one for now.
>
> Andrew, can I safely interpret this response as you are OK with having
> cache_hot_time set to 10 ms for now?

I have a lot of scheduler changes queued up and I view this change as being
not very high priority. If someone sends a patch to update -mm then we can
run with that, however Ingo's auto-tuning seems a far preferable approach.

> And you will merge this change for 2.6.9?

I was not planning on doing so, but could be persuaded, I guess.

It's very, very late for this and subtle CPU scheduler regressions tend to
take a long time (weeks or months) to be identified.

2004-10-06 19:51:16

by Jeff Garzik

[permalink] [raw]
Subject: Re: new dev model (was Re: Default cache_hot_time value back to 10ms)


So my own suggestions for increasing 2.6.x stability are:

1) Create a release numbering system that is __clear to users__, not
just developers. This is a human, not technical problem. Telling users
"oh, -rc1 doesn't really mean Release Candidate, we start getting
serious around -rc2 or so but some stuff slips in and..." is hardly clear.

2) Really (underscore underscore) only accept bugfixes after the chosen
line of demarcation. No API changes. No new stuff (new stuff may not
break anything, but it's a distraction). Chill out on all the sparse
notations. _Just_ _bug_ _fixes_. The fluff (comments/sparse/new
features) just serves to make reviewing the changes more difficult, as
it vastly increases the noise-to-signal ratio.

With all the noise of comment fixes, new features, etc. during Release
Candidate, you kill the value of reviewing the -rc patch. Developers
and users have to either (a) boot and pray or (b) wade through tons of
non-bugfix changes in an attempt to review the changes.

I know it's an antiquated idea, _reading_ and reviewing a -rc patch, but
still...

Jeff



2004-10-06 19:54:54

by Ingo Molnar

[permalink] [raw]
Subject: Re: Default cache_hot_time value back to 10ms


* Chen, Kenneth W <[email protected]> wrote:

> 5 ms 106.0
> 7.5 ms 110.3
> 10 ms 112.5
> 12.5 ms 112.0
> 15 ms 111.6

ok, great. 9ms and 11ms would still be interesting. My guess would be
that the maximum is at 9. (albeit the numbers, when plotted, indicate
that the measurement might be a bit noisy.)

Ingo

2004-10-06 19:59:39

by Jeff Garzik

[permalink] [raw]
Subject: Re: new dev model (was Re: Default cache_hot_time value back to 10ms)

Jeff Garzik wrote:
>
> So my own suggestions for increasing 2.6.x stability are:

And one more, that I meant to include in the last email,

3) Release early, release often (official -rc releases, not just snapshots)

2004-10-06 20:11:46

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] sched: auto-tuning task-migration


* Chen, Kenneth W <[email protected]> wrote:

> Ran it on a similar system. Below is the output. Haven't tried to
> get a real benchmark run with 42 ms cache_hot_time. I don't think it
> will get peak throughput as we already start tapering off at 12.5 ms.

> arch cache_decay_nsec: 10000000
> migration cost matrix (cache_size: 9437184, cpu: 1500 MHz):
> [00] [01] [02] [03]
> [00]: 50.2 42.8 42.9 42.8
> [01]: 42.9 50.2 42.1 42.9
> [02]: 42.9 42.9 50.2 42.8
> [03]: 42.9 42.9 42.9 50.2
> min_delta: 44785782
> using cache_decay nsec: 44785782 (42 msec)

could you try the replacement patch below - what results does it give?

Ingo

--- linux/kernel/sched.c.orig
+++ linux/kernel/sched.c
@@ -388,7 +388,7 @@ struct sched_domain {
.max_interval = 4, \
.busy_factor = 64, \
.imbalance_pct = 125, \
- .cache_hot_time = (5*1000000/2), \
+ .cache_hot_time = cache_decay_nsec, \
.cache_nice_tries = 1, \
.per_cpu_gain = 100, \
.flags = SD_BALANCE_NEWIDLE \
@@ -410,7 +410,7 @@ struct sched_domain {
.max_interval = 32, \
.busy_factor = 32, \
.imbalance_pct = 125, \
- .cache_hot_time = (10*1000000), \
+ .cache_hot_time = cache_decay_nsec, \
.cache_nice_tries = 1, \
.per_cpu_gain = 100, \
.flags = SD_BALANCE_EXEC \
@@ -4420,11 +4420,232 @@ __init static void init_sched_build_grou
last->next = first;
}

-__init static void arch_init_sched_domains(void)
+/*
+ * Task migration cost measurement between source and target CPUs.
+ *
+ * This is done by measuring the worst-case cost. Here are the
+ * steps that are taken:
+ *
+ * 1) the source CPU dirties its L2 cache with a shared buffer
+ * 2) the target CPU dirties its L2 cache with a local buffer
+ * 3) the target CPU dirties the shared buffer
+ *
+ * We measure the time step #3 takes - this is the cost of migrating
+ * a cache-hot task that has a large, dirty dataset in the L2 cache,
+ * to another CPU.
+ */
+
+
+/*
+ * Dirty a big buffer in a hard-to-predict (for the L2 cache) way. This
+ * is the operation that is timed, so we try to generate unpredictable
+ * cachemisses that still end up filling the L2 cache:
+ */
+__init static void fill_cache(void *__cache, unsigned long __size)
{
+ unsigned long size = __size/sizeof(long);
+ unsigned long *cache = __cache;
+ unsigned long data = 0xdeadbeef;
int i;
+
+ for (i = 0; i < size/4; i++) {
+ if ((i & 3) == 0)
+ cache[i] = data;
+ if ((i & 3) == 1)
+ cache[size-1-i] = data;
+ if ((i & 3) == 2)
+ cache[size/2-i] = data;
+ if ((i & 3) == 3)
+ cache[size/2+i] = data;
+ }
+}
+
+struct flush_data {
+ unsigned long source, target;
+ void (*fn)(void *, unsigned long);
+ void *cache;
+ void *local_cache;
+ unsigned long size;
+ unsigned long long delta;
+};
+
+/*
+ * Dirty L2 on the source CPU:
+ */
+__init static void source_handler(void *__data)
+{
+ struct flush_data *data = __data;
+
+ if (smp_processor_id() != data->source)
+ return;
+
+ memset(data->cache, 0, data->size);
+}
+
+/*
+ * Dirty the L2 cache on this CPU and then access the shared
+ * buffer. (which represents the working set of the migrated task.)
+ */
+__init static void target_handler(void *__data)
+{
+ struct flush_data *data = __data;
+ unsigned long long t0, t1;
+ unsigned long flags;
+
+ if (smp_processor_id() != data->target)
+ return;
+
+ memset(data->local_cache, 0, data->size);
+ local_irq_save(flags);
+ t0 = sched_clock();
+ fill_cache(data->cache, data->size);
+ t1 = sched_clock();
+ local_irq_restore(flags);
+
+ data->delta = t1 - t0;
+}
+
+/*
+ * Measure the cache-cost of one task migration:
+ */
+__init static unsigned long long measure_one(void *cache, unsigned long size,
+ int source, int target)
+{
+ struct flush_data data;
+ unsigned long flags;
+ void *local_cache;
+
+ local_cache = vmalloc(size);
+ if (!local_cache) {
+ printk("couldnt allocate local cache ...\n");
+ return 0;
+ }
+ memset(local_cache, 0, size);
+
+ local_irq_save(flags);
+ local_irq_enable();
+
+ data.source = source;
+ data.target = target;
+ data.size = size;
+ data.cache = cache;
+ data.local_cache = local_cache;
+
+ if (on_each_cpu(source_handler, &data, 1, 1) != 0) {
+ printk("measure_one: timed out waiting for other CPUs\n");
+ local_irq_restore(flags);
+ return -1;
+ }
+ if (on_each_cpu(target_handler, &data, 1, 1) != 0) {
+ printk("measure_one: timed out waiting for other CPUs\n");
+ local_irq_restore(flags);
+ return -1;
+ }
+
+ vfree(local_cache);
+
+ return data.delta;
+}
+
+__initdata unsigned long sched_cache_size;
+
+/*
+ * Measure a series of task migrations and return the maximum
+ * result - the worst-case. Since this code runs early during
+ * bootup the system is 'undisturbed' and the maximum latency
+ * makes sense.
+ *
+ * As the working set we use 2.1 times the L2 cache size, this is
+ * chosen in such a nonsymmetric way so that fill_cache() doesnt
+ * iterate at power-of-2 boundaries (which might hit cache mapping
+ * artifacts and pessimise the results).
+ */
+__init static unsigned long long measure_cacheflush_time(int cpu1, int cpu2)
+{
+ unsigned long size = sched_cache_size*21/10;
+ unsigned long long delta, max = 0;
+ void *cache;
+ int i;
+
+ if (!size) {
+ printk("arch has not set cachesize - using default.\n");
+ return 0;
+ }
+ if (!cpu_online(cpu1) || !cpu_online(cpu2)) {
+ printk("cpu %d and %d not both online!\n", cpu1, cpu2);
+ return 0;
+ }
+ cache = vmalloc(size);
+ if (!cache) {
+ printk("could not vmalloc %ld bytes for cache!\n", size);
+ return 0;
+ }
+ memset(cache, 0, size);
+ for (i = 0; i < 20; i++) {
+ delta = measure_one(cache, size, cpu1, cpu2);
+ if (delta > max)
+ max = delta;
+ }
+
+ vfree(cache);
+
+ /*
+ * A task is considered 'cache cold' if at least 2 times
+ * the worst-case cost of migration has passed.
+ * (this limit is only listened to if the load-balancing
+ * situation is 'nice' - if there is a large imbalance we
+ * ignore it for the sake of CPU utilization and
+ * processing fairness.)
+ *
+ * (We use 2.1 times the L2 cachesize in our measurement,
+ * we keep this factor when returning.)
+ */
+ return max;
+}
+
+__initdata static unsigned long long cache_decay_nsec;
+
+__init static void arch_init_sched_domains(void)
+{
+ int i, cpu1 = -1, cpu2 = -1;
+ unsigned long long min_delta = -1ULL;
+
cpumask_t cpu_default_map;

+ printk("arch cache_decay_nsec: %ld\n", cache_decay_ticks*1000000);
+ printk("migration cost matrix (cache_size: %ld, cpu: %ld MHz):\n",
+ sched_cache_size, cpu_khz/1000);
+ printk(" ");
+ for (cpu1 = 0; cpu1 < NR_CPUS; cpu1++) {
+ if (!cpu_online(cpu1))
+ continue;
+ printk(" [%02d]", cpu1);
+ }
+ printk("\n");
+ for (cpu1 = 0; cpu1 < NR_CPUS; cpu1++) {
+ if (!cpu_online(cpu1))
+ continue;
+ printk("[%02d]: ", cpu1);
+ for (cpu2 = 0; cpu2 < NR_CPUS; cpu2++) {
+ unsigned long long delta;
+
+ if (!cpu_online(cpu2))
+ continue;
+ delta = measure_cacheflush_time(cpu1, cpu2);
+
+ printk(" %3Ld.%ld", delta >> 20,
+ (((long)delta >> 10) / 102) % 10);
+ if ((cpu1 != cpu2) && (delta < min_delta))
+ min_delta = delta;
+ }
+ printk("\n");
+ }
+ printk("min_delta: %Ld\n", min_delta);
+ if (min_delta != -1ULL)
+ cache_decay_nsec = min_delta;
+ printk("using cache_decay nsec: %Ld (%Ld msec)\n",
+ cache_decay_nsec, cache_decay_nsec >> 20);
+
/*
* Setup mask for cpus without special case scheduling requirements.
* For now this just excludes isolated cpus, but could be used to
--- linux/arch/i386/kernel/smpboot.c.orig
+++ linux/arch/i386/kernel/smpboot.c
@@ -849,6 +849,8 @@ static int __init do_boot_cpu(int apicid
cycles_t cacheflush_time;
unsigned long cache_decay_ticks;

+extern unsigned long sched_cache_size;
+
static void smp_tune_scheduling (void)
{
unsigned long cachesize; /* kB */
@@ -879,6 +881,7 @@ static void smp_tune_scheduling (void)
}

cacheflush_time = (cpu_khz>>10) * (cachesize<<10) / bandwidth;
+ sched_cache_size = cachesize * 1024;
}

cache_decay_ticks = (long)cacheflush_time/cpu_khz + 1;

2004-10-06 20:45:47

by Andrew Morton

[permalink] [raw]
Subject: Re: Default cache_hot_time value back to 10ms

"Chen, Kenneth W" <[email protected]> wrote:
>
> Secondly, let me ask the question again from the first mail thread: this value
> *WAS* 10 ms for a long time, before the domain scheduler. What's so special
> about domain scheduler that all the sudden this parameter get changed to 2.5?

So why on earth was it switched from 10 to 2.5 in the first place?

Please resend the final patch.

2004-10-06 20:49:36

by Geert Uytterhoeven

[permalink] [raw]
Subject: Re: new dev model (was Re: Default cache_hot_time value back to 10ms)

On Wed, 6 Oct 2004, Jeff Garzik wrote:
> Jeff Garzik wrote:
> > So my own suggestions for increasing 2.6.x stability are:
>
> And one more, that I meant to include in the last email,
>
> 3) Release early, release often (official -rc releases, not just snapshots)

I guess you mean official -pre releases as well?

Gr{oetje,eeting}s,

Geert

P.S. I only track `real' (-pre and -rc) releases. I don't have the manpower
(what's in a word) to track daily snapshots (I do `read' bk-commits). If
m68k stuff gets broken in -rc, usually it means it won't get fixed before
2 full releases later. Anyway, things shouldn't become broken in -rc,
IMHO that's what we (should) have -pre for...
--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds

2004-10-06 20:49:38

by Chen, Kenneth W

[permalink] [raw]
Subject: RE: Default cache_hot_time value back to 10ms

Andrew Morton wrote on Wednesday, October 06, 2004 12:40 PM
> "Chen, Kenneth W" <[email protected]> wrote:
> > Andrew, can I safely interpret this response as you are OK with having
> > cache_hot_time set to 10 ms for now?
>
> I have a lot of scheduler changes queued up and I view this change as being
> not very high priority. If someone sends a patch to update -mm then we can
> run with that, however Ingo's auto-tuning seems a far preferable approach.
>
> > And you will merge this change for 2.6.9?
>
> I was not planning on doing so, but could be persuaded, I guess.
>
> It's very, very late for this and subtle CPU scheduler regressions tend to
> take a long time (weeks or months) to be identified.


Let me try to persuade ;-). First, it hard to accept the fact that we are
leaving 11% of performance on the table just due to a poorly chosen parameter.
This much percentage difference on a db workload is a huge deal. It basically
"unfairly" handicap 2.6 kernel behind competition, even handicap ourselves compare
to 2.4 kernel. We have established from various workloads that 10 ms works the
best, from db to java workload. What more data can we provide to swing you in
that direction?

Secondly, let me ask the question again from the first mail thread: this value
*WAS* 10 ms for a long time, before the domain scheduler. What's so special
about domain scheduler that all the sudden this parameter get changed to 2.5?
I'd like to see some justification/prior measurement for such change when
domain scheduler kicks in.

- Ken


2004-10-06 20:58:20

by Ingo Molnar

[permalink] [raw]
Subject: RE: Default cache_hot_time value back to 10ms


On Wed, 6 Oct 2004, Chen, Kenneth W wrote:

> Let me try to persuade ;-). First, it hard to accept the fact that we
> are leaving 11% of performance on the table just due to a poorly chosen
> parameter. This much percentage difference on a db workload is a huge
> deal. It basically "unfairly" handicap 2.6 kernel behind competition,
> even handicap ourselves compare to 2.4 kernel. We have established from
> various workloads that 10 ms works the best, from db to java workload.
> What more data can we provide to swing you in that direction?

the problem is that 10 msec might be fine for a 9MB L2 cache CPU running a
DB benchmark, but it will sure be too much of a migration cutoff for other
boxes. And too much of a migration cutoff means increased idle time -
resulting in CPU-under-utilization and worse performance.

so i'd prefer to not touch it for 2.6.9 (consider that tree closed from a
scheduler POV), and we can do the auto-tuning in 2.6.10 just fine. It will
need the same weeks-long testcycle that all scheduler balancing patches
need. There are so many different type of workloads ...

> Secondly, let me ask the question again from the first mail thread:
> this value *WAS* 10 ms for a long time, before the domain scheduler.
> What's so special about domain scheduler that all the sudden this
> parameter get changed to 2.5? I'd like to see some justification/prior
> measurement for such change when domain scheduler kicks in.

iirc it was tweaked as a result of the other bug that you fixed. But, high
sensitivity to this tunable was nevery truly established, and a 9 MB L2
cache CPU is certainly not typical - and it is certainly the one that
hurts most from migration effects.

anyway, we were running based on cache_decay_ticks for a long time - is
that what was 10 msec on your box? The cache_decay_ticks calculation was
pretty fine too, it scaled up with cachesize.

Ingo

2004-10-06 21:11:56

by Chen, Kenneth W

[permalink] [raw]
Subject: RE: Default cache_hot_time value back to 10ms

Ingo Molnar wrote on Wednesday, October 06, 2004 1:51 PM
> On Wed, 6 Oct 2004, Chen, Kenneth W wrote:
> > Let me try to persuade ;-). First, it hard to accept the fact that we
> > are leaving 11% of performance on the table just due to a poorly chosen
> > parameter. This much percentage difference on a db workload is a huge
> > deal. It basically "unfairly" handicap 2.6 kernel behind competition,
> > even handicap ourselves compare to 2.4 kernel. We have established from
> > various workloads that 10 ms works the best, from db to java workload.
> > What more data can we provide to swing you in that direction?
>
> the problem is that 10 msec might be fine for a 9MB L2 cache CPU running a
> DB benchmark, but it will sure be too much of a migration cutoff for other
> boxes. And too much of a migration cutoff means increased idle time -
> resulting in CPU-under-utilization and worse performance.
>
> so i'd prefer to not touch it for 2.6.9 (consider that tree closed from a
> scheduler POV), and we can do the auto-tuning in 2.6.10 just fine. It will
> need the same weeks-long testcycle that all scheduler balancing patches
> need. There are so many different type of workloads ...

I would argue that the testing should be the other way around: having people
argue/provide data why 2.5 is better than 10. Is there any prior measurement
or mailing list posting out there?


> > Secondly, let me ask the question again from the first mail thread:
> > this value *WAS* 10 ms for a long time, before the domain scheduler.
> > What's so special about domain scheduler that all the sudden this
> > parameter get changed to 2.5? I'd like to see some justification/prior
> > measurement for such change when domain scheduler kicks in.
>
> iirc it was tweaked as a result of the other bug that you fixed.

Is it possible that whatever the tweak was done before, was running with
broken load balancing logic and thus invalidates the 2.5 ms result?


> anyway, we were running based on cache_decay_ticks for a long time - is
> that what was 10 msec on your box? The cache_decay_ticks calculation was
> pretty fine too, it scaled up with cachesize.

Yes, cache_decay_tick is what I referring to. I guess I was too concentrated
on ia64. For ia64, it was hard coded to 10ms regardless what cache size is.

cache_decay_ticks isn't used anywhere in 2.6.9-rc3, maybe that should be the
one for cache_hot_time.

- Ken


2004-10-06 21:20:55

by Chen, Kenneth W

[permalink] [raw]
Subject: RE: [patch] sched: auto-tuning task-migration

Ingo Molnar wrote on Wednesday, October 06, 2004 1:05 PM
> could you try the replacement patch below - what results does it give?

By the way, I wonder why you chose to round down, but not up.


arch cache_decay_nsec: 10000000
migration cost matrix (cache_size: 9437184, cpu: 1500 MHz):
[00] [01] [02] [03]
[00]: 9.1 8.5 8.5 8.5
[01]: 8.5 9.1 8.5 8.5
[02]: 8.5 8.5 9.1 8.5
[03]: 8.5 8.5 8.5 9.1
min_delta: 8909202
using cache_decay nsec: 8909202 (8 msec)


2004-10-06 22:28:28

by Martin J. Bligh

[permalink] [raw]
Subject: Re: new dev model (was Re: Default cache_hot_time value back to 10ms)

>> it is also correct that the pile of patches in the -mm tree mask the QA
>> effects of testing done on -mm, so testing -BK separately is just as
>> important at this stage.
>
> The simple fact is that -mm doesn't receive _nearly_ the amount of testing that a 2.6.x -BK snapshot does, which in turn doesn't receive _nearly_ the amount of testing that a 2.6.x-rc release gets.

Not sure that's true. Personally I test all -mm releases, and not -bk
snapshots ... I've heard similar from other people.

If everyone pushed their stuff through -mm, and it sat there for a few
days before going upstream, we'd get a much better opportunity to test.

M.

2004-10-06 22:53:46

by Peter Williams

[permalink] [raw]
Subject: Re: Default cache_hot_time value back to 10ms

Chen, Kenneth W wrote:
>>Chen, Kenneth W wrote on Tuesday, October 05, 2004 10:31 AM
>>
>>>We have experimented with similar thing, via bumping up sd->cache_hot_time to
>>>a very large number, like 1 sec. What we measured was a equally low throughput.
>>>But that was because of not enough load balancing.
>>
>>Since we are talking about load balancing, we decided to measure various
>>value for cache_hot_time variable to see how it affects app performance. We
>>first establish baseline number with vanilla base kernel (default at 2.5ms),
>>then sweep that variable up to 1000ms. All of the experiments are done with
>>Ingo's patch posted earlier. Here are the result (test environment is 4-way
>>SMP machine, 32 GB memory, 500 disks running industry standard db transaction
>>processing workload):
>>
>>cache_hot_time | workload throughput
>>--------------------------------------
>> 2.5ms - 100.0 (0% idle)
>> 5ms - 106.0 (0% idle)
>> 10ms - 112.5 (1% idle)
>> 15ms - 111.6 (3% idle)
>> 25ms - 111.1 (5% idle)
>> 250ms - 105.6 (7% idle)
>> 1000ms - 105.4 (7% idle)
>>
>>Clearly the default value for SMP has the worst application throughput (12%
>>below peak performance). When set too low, kernel is too aggressive on load
>>balancing and we are still seeing cache thrashing despite the perf fix.
>>However, If set too high, kernel gets too conservative and not doing enough
>>load balance.
>
>
> Ingo Molnar wrote on Wednesday, October 06, 2004 12:48 AM
>
>>could you please try the test in 1 msec increments around 10 msec? It
>>would be very nice to find a good formula and the 5 msec steps are too
>>coarse. I think it would be nice to test 7,9,11,13 msecs first, and then
>>the remaining 1 msec slots around the new maximum. (assuming the
>>workload measurement is stable.)
>
>
> I should've post the whole thing yesterday, we had measurement of 7.5 and
> 12.5 ms. Here is the result (repeating 5, 10, 15 for easy reading).
>
> 5 ms 106.0
> 7.5 ms 110.3
> 10 ms 112.5
> 12.5 ms 112.0
> 15 ms 111.6
>
>
>
>>>This value was default to 10ms before domain scheduler, why does domain
>>>scheduler need to change it to 2.5ms? And on what bases does that decision
>>>take place? We are proposing change that number back to 10ms.
>>
>>agreed. What value does cache_decay_ticks have on your box?
>
>
>
> I see all the fancy calculation with cache_decay_ticks on x86, but nobody
> actually uses it in the domain scheduler. Anyway, my box has that value
> hard coded to 10ms (ia64).
>

If you fit a quadratic equation to this data, take the first derivative
and then solve for zero it will give the cache_hot_time that maximizes
the throughput.

Peter
--
Peter Williams [email protected]

"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce

2004-10-06 23:19:18

by Chen, Kenneth W

[permalink] [raw]
Subject: RE: Default cache_hot_time value back to 10ms

Andrew Morton wrote on Wednesday, October 06, 2004 1:43 PM
> "Chen, Kenneth W" <[email protected]> wrote:
> >
> > Secondly, let me ask the question again from the first mail thread: this value
> > *WAS* 10 ms for a long time, before the domain scheduler. What's so special
> > about domain scheduler that all the sudden this parameter get changed to 2.5?
>
> So why on earth was it switched from 10 to 2.5 in the first place?
>
> Please resend the final patch.


Here is a patch that revert default cache_hot_time value back to the equivalence
of pre-domain scheduler, which determins task's cache affinity via architecture
defined variable cache_decay_ticks.

This is a mere request that we go back to what *was* before, *NOT* as a new
scheduler tweak (Whatever tweak done for domain scheduler broke traditional/
industry recognized workload).

As a side note, I'd like to get involved on future scheduler tuning experiments,
we have fair amount of benchmark environments where we can validate things across
various kind of workload, i.e., db, java, cpu, etc. Thanks.

Signed-off-by: Ken Chen <[email protected]>

patch against 2.6.9-rc3:

--- linux-2.6.9-rc3/kernel/sched.c.orig 2004-10-06 15:10:56.000000000 -0700
+++ linux-2.6.9-rc3/kernel/sched.c 2004-10-06 15:18:51.000000000 -0700
@@ -387,7 +387,7 @@ struct sched_domain {
.max_interval = 4, \
.busy_factor = 64, \
.imbalance_pct = 125, \
- .cache_hot_time = (5*1000000/2), \
+ .cache_hot_time = cache_decay_ticks*1000000,\
.cache_nice_tries = 1, \
.per_cpu_gain = 100, \
.flags = SD_BALANCE_NEWIDLE \


patch against 2.6.9-rc3-mm2:

--- linux-2.6.9-rc3/include/linux/topology.h.orig 2004-10-06 15:32:48.000000000 -0700
+++ linux-2.6.9-rc3/include/linux/topology.h 2004-10-06 15:33:25.000000000 -0700
@@ -113,7 +113,7 @@ static inline int __next_node_with_cpus(
.max_interval = 4, \
.busy_factor = 64, \
.imbalance_pct = 125, \
- .cache_hot_time = (5*1000/2), \
+ .cache_hot_time = (cache_decay_ticks*1000),\
.cache_nice_tries = 1, \
.per_cpu_gain = 100, \
.flags = SD_LOAD_BALANCE \



2004-10-07 00:08:56

by Matt Mackall

[permalink] [raw]
Subject: Re: new dev model (was Re: Default cache_hot_time value back to 10ms)

On Wed, Oct 06, 2004 at 03:48:01PM -0400, Jeff Garzik wrote:
>
> So my own suggestions for increasing 2.6.x stability are:
>
> 1) Create a release numbering system that is __clear to users__, not
> just developers. This is a human, not technical problem. Telling users
> "oh, -rc1 doesn't really mean Release Candidate, we start getting
> serious around -rc2 or so but some stuff slips in and..." is hardly clear.

The 2.4 system Marcelo used did this nicely.. A couple -preX to shove
in new stuff, and a couple -rcX to iron out the bugs. 2.6.x-rc[12]
seem to be similar in content to 2.4.x-pre - little expectaction that
they're actually candidates for release.

> 2) Really (underscore underscore) only accept bugfixes after the chosen
> line of demarcation. No API changes. No new stuff (new stuff may not
> break anything, but it's a distraction). Chill out on all the sparse
> notations. _Just_ _bug_ _fixes_. The fluff (comments/sparse/new
> features) just serves to make reviewing the changes more difficult, as
> it vastly increases the noise-to-signal ratio.

Also, please simply rename the last -rcX for the release as Marcelo
does with 2.4. Slipping in new stuff between the candidate and the
release invalidates the testing done on the candidate so someone can't
look at 2.6.9 and say "this looks solid from 2 weeks as a release
candidate, I can run with it today".

--
Mathematics is the supreme nostalgia of our time.

2004-10-07 01:08:52

by Jeff Garzik

[permalink] [raw]
Subject: Re: new dev model (was Re: Default cache_hot_time value back to 10ms)

Geert Uytterhoeven wrote:
> P.S. I only track `real' (-pre and -rc) releases. I don't have the manpower
> (what's in a word) to track daily snapshots (I do `read' bk-commits). If
> m68k stuff gets broken in -rc, usually it means it won't get fixed before
> 2 full releases later. Anyway, things shouldn't become broken in -rc,
> IMHO that's what we (should) have -pre for...


I agree completely, but -pre is apparently a dirty word (dirty suffix?:))

Jeff


2004-10-07 02:26:48

by Nick Piggin

[permalink] [raw]
Subject: Re: Default cache_hot_time value back to 10ms

Chen, Kenneth W wrote:
> Andrew Morton wrote on Wednesday, October 06, 2004 1:43 PM
>
>>"Chen, Kenneth W" <[email protected]> wrote:
>>
>>> Secondly, let me ask the question again from the first mail thread: this value
>>> *WAS* 10 ms for a long time, before the domain scheduler. What's so special
>>> about domain scheduler that all the sudden this parameter get changed to 2.5?
>>
>>So why on earth was it switched from 10 to 2.5 in the first place?
>>
>>Please resend the final patch.
>
>
>
> Here is a patch that revert default cache_hot_time value back to the equivalence
> of pre-domain scheduler, which determins task's cache affinity via architecture
> defined variable cache_decay_ticks.
>
> This is a mere request that we go back to what *was* before, *NOT* as a new
> scheduler tweak (Whatever tweak done for domain scheduler broke traditional/
> industry recognized workload).
>

OK... Well Andrew as I said I'd be happy for this to go in. I'd be *extra*
happy if Judith ran a few of those dbt thingy tests which had been sensitive
to idle time. Can you ask her about that? or should I?

> As a side note, I'd like to get involved on future scheduler tuning experiments,
> we have fair amount of benchmark environments where we can validate things across
> various kind of workload, i.e., db, java, cpu, etc. Thanks.
>

That would be very welcome indeed. We have a big backlog of scheduler things
to go in after 2.6.9 is released (although not many of them change the runtime
behaviour IIRC). After that, I have some experimental performance work that
could use wider testing. After *that*, the multiprocessor scheduler will in a
state where 2.6 shouldn't need much more work, so we can concentrate on just
tuning the dials.

2004-10-07 06:08:53

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] sched: auto-tuning task-migration


* Chen, Kenneth W <[email protected]> wrote:

> > could you try the replacement patch below - what results does it give?
>
> By the way, I wonder why you chose to round down, but not up.

what do you mean - the minimum search within the matrix?

Ingo

2004-10-07 06:27:42

by Ingo Molnar

[permalink] [raw]
Subject: Re: Default cache_hot_time value back to 10ms


* Chen, Kenneth W <[email protected]> wrote:

> Here is a patch that revert default cache_hot_time value back to the
> equivalence of pre-domain scheduler, which determins task's cache
> affinity via architecture defined variable cache_decay_ticks.

i could agree with this oneliner patch for 2.6.9, it only affects the
SMP balancer and there for the most common boxes it likely results in a
similar migration cutoff as the 2.5 msec we currently have. Here are the
changes that occur on a couple of x86 boxes:

2-way celeron, 128K cache: 2.5 msec -> 1.0 msec
2-way/4-way P4 Xeon 1MB cache: 2.5 msec -> 2.0 msec
8-way P3 Xeon 2MB cache: 2.5 msec -> 6.0 msec

each of these changes is sane and not too drastic.

(on ia64 there is no auto-tuning of cache_decay_ticks, there you've got
a decay=<x> boot parameter anyway, to fix things up.)

there was one particular DB test that was quite sensitive to idle time
introduced by too large migration cutoff: dbt2-pgsql. Could you try that
one too and compare -rc3 performance to -rc3+migration-patches?

> This is a mere request that we go back to what *was* before, *NOT* as
> a new scheduler tweak (Whatever tweak done for domain scheduler broke
> traditional/ industry recognized workload).
>
> As a side note, I'd like to get involved on future scheduler tuning
> experiments, we have fair amount of benchmark environments where we
> can validate things across various kind of workload, i.e., db, java,
> cpu, etc. Thanks.

yeah, it would be nice to test the following 3 kernels:

2.6.9-rc3 vanilla,
2.6.9-rc3 + cache_hot_fix + use-cache_decay_ticks
2.6.9-rc3 + cache_hot_fixes + autotune-patch

using as many different CPU types (and # of CPUs) as possible.

The most important factor in these measurements is statistical stability
of the result - if noise is too high then it's hard to judge. (the
numbers you posted in previous mails are quite stable, but not all
benchmarks are like that.)

> Signed-off-by: Ken Chen <[email protected]>

Signed-off-by: Ingo Molnar <[email protected]>

Ingo

2004-10-07 07:08:47

by Jeff Garzik

[permalink] [raw]
Subject: Re: Default cache_hot_time value back to 10ms

Ingo Molnar wrote:
> * Chen, Kenneth W <[email protected]> wrote:
> >>Signed-off-by: Ken Chen <[email protected]>
>
>
> Signed-off-by: Ingo Molnar <[email protected]>


[tangent] FWIW Andrew has recently been using "Acked-by" as well,
presumably for patches created by person X from but reviewed by wholly
independent person Y (since signed-off-by indicates you have some amount
of legal standing to actually sign off on the patch)

Jeff


2004-10-07 07:25:54

by Ingo Molnar

[permalink] [raw]
Subject: Re: Default cache_hot_time value back to 10ms


* Jeff Garzik <[email protected]> wrote:

> Ingo Molnar wrote:
> >* Chen, Kenneth W <[email protected]> wrote:
> >>>Signed-off-by: Ken Chen <[email protected]>
> >
> >
> >Signed-off-by: Ingo Molnar <[email protected]>
>
>
> [tangent] FWIW Andrew has recently been using "Acked-by" as well,
> presumably for patches created by person X from but reviewed by wholly
> independent person Y (since signed-off-by indicates you have some
> amount of legal standing to actually sign off on the patch)

[even more tangential] even if this werent a onliner, i might have some
amount of legal standing, i wrote the original cache_decay_ticks code
that this patch reverts to ;) But yeah, Acked-by would be more
informative here.

Ingo

2004-10-07 16:12:50

by Andrew Theurer

[permalink] [raw]
Subject: Re: Default cache_hot_time value back to 10ms

> OK... Well Andrew as I said I'd be happy for this to go in. I'd be *extra*
> happy if Judith ran a few of those dbt thingy tests which had been
> sensitive to idle time. Can you ask her about that? or should I?
>
> > As a side note, I'd like to get involved on future scheduler tuning
>
> experiments,
>
> > we have fair amount of benchmark environments where we can validate
> > things
>
> across
>
> > various kind of workload, i.e., db, java, cpu, etc. Thanks.
>
> That would be very welcome indeed. We have a big backlog of scheduler
> things to go in after 2.6.9 is released (although not many of them change
> the runtime behaviour IIRC). After that, I have some experimental
> performance work that could use wider testing. After *that*, the
> multiprocessor scheduler will in a state where 2.6 shouldn't need much more
> work, so we can concentrate on just tuning the dials.

I'd like to add some comments as well:

1) We are seeing similar problems with that "well known" DB transaction
benchmark, as well as another well known benchmark measuring multi-tier J2EE
server performance. Both problems are with load balancing. It's not quite
the same situation. We have too much idle time and not enough throughput.
Giving a more aggressive idle balance has helped there. The 3 areas we have
changed at:

wake_idle() -find the first idle cpu, statring with cpu->sd and moving up the
sd's as needed. Modify SD_NODE_INIT.flags and SD_CPU_INIT.flags to include
SD_WAKE_IDLE. Now, if there is an idle cpu (and task->cpu is busy), we move
it to the closest idle cpu.

can_migrate() put back (again) the aggressive idle condition in can_migrate().
Do not look at task_hot when we have an idle cpu.

idle_balance() / SD_NODE_INIT add SD_BALANCE_NEWIDLE to SD_NODE_INIT.flags
so a newly idle_balance can try to balance from an appropriate cpu, first a
cpu close to it, then farther out.

(the above changes IMO could also pave the way for removing timer based -idle-
balances)

IMO, I don't think idle cpus should play by the exact same rules as busy ones
when load balacing. I am not saying the only answer is not looking at cache
warmth at all, but maybe a much more relaxed policy.

Also, finding (at boot time) the best cache_hot_time is a step in the right
direction, but I have to wonder it cache_hot() is really doing the right
thing. It looks like all cache_hot does is decide this task is cache hot
because it ran recently. Who's to say the task got cache warm in the first
place? Shouldn't we be looking at both how long ago it ran and the length of
time it ran? Some of these workloads have very high transaction rates, and
in turn have very high context switch rates. I would be surprised if many of
the tasks got more than enough continuous run time to get good cache warmth
anyway. I am all for testing chace warmth, but I think we should start
looking at more than just how long ago the task ran.

-Andrew Theurer




Attachments:
(No filename) (2.90 kB)
100-wake_idle-patch.269-rc3 (2.61 kB)
120-new_idle-patch.269-rc3 (627.00 B)
110-can_migrate-patch.269-rc3 (719.00 B)
Download all attachments

2004-10-07 18:53:46

by Albert Cahalan

[permalink] [raw]
Subject: Re: Default cache_hot_time value back to 10ms

Con Kolivas writes:

> Should it not be based on the cache flush time? We measure
> that and set the cache_decay_ticks and can base it on that.

Often one must use the time, but...

If the system goes idle for an hour, the last-run
process is still cache-hot.

Many systems let you measure cache line castouts.
Time is a very crude approximation of this.
Memory traffic is a slightly better approximation.


2004-10-08 09:48:59

by Nick Piggin

[permalink] [raw]
Subject: Re: Default cache_hot_time value back to 10ms

Andrew Theurer wrote:

> I'd like to add some comments as well:
>

OK, thanks Andrew. Can I ask that we revisit these after 2.6.9 comes
out? I don't think the situation should be worse than 2.6.8, and
basically 2.6.9 is in bugfix only mode at this point, so I doubt we
could get anything more in even if we wanted to.

Please be sure to bring this up again after 2.6.9. Thanks.

Nick

2004-10-08 14:12:16

by Andrew Theurer

[permalink] [raw]
Subject: Re: Default cache_hot_time value back to 10ms

On Friday 08 October 2004 04:47, Nick Piggin wrote:
> Andrew Theurer wrote:
> > I'd like to add some comments as well:
>
> OK, thanks Andrew. Can I ask that we revisit these after 2.6.9 comes
> out? I don't think the situation should be worse than 2.6.8, and
> basically 2.6.9 is in bugfix only mode at this point, so I doubt we
> could get anything more in even if we wanted to.
>
> Please be sure to bring this up again after 2.6.9. Thanks.

No problem. I was thinking this would be post 2.6.9 or somewhere in -mm.

Andrew Theurer

2004-11-11 15:09:41

by Andrew Theurer

[permalink] [raw]
Subject: Re: [patch] sched: auto-tuning task-migration

> Ingo Molnar wrote on Wednesday, October 06, 2004 1:05 PM
>
> > could you try the replacement patch below - what results does it give?
>
> By the way, I wonder why you chose to round down, but not up.
>
>
> arch cache_decay_nsec: 10000000
> migration cost matrix (cache_size: 9437184, cpu: 1500 MHz):
> [00] [01] [02] [03]
> [00]: 9.1 8.5 8.5 8.5
> [01]: 8.5 9.1 8.5 8.5
> [02]: 8.5 8.5 9.1 8.5
> [03]: 8.5 8.5 8.5 9.1
> min_delta: 8909202
> using cache_decay nsec: 8909202 (8 msec)

I tried this patch on power5. This is a 2 node system, 2 chips (1 per node),
2 cores per chip, 2 siblings per core. Cores share and L2 & L3 cache.

Hard coding 1920KB for cache size (L2) I get:

migration cost matrix (cache_size: 1966080, cpu: 1656 MHz):
[00] [01] [02] [03] [04] [05] [06] [07]
[00]: 1.3 1.3 1.3 1.3 2.3 1.4 1.4 1.4
[01]: 1.3 1.4 1.3 1.3 1.4 1.4 1.4 1.4
[02]: 1.3 1.4 1.3 1.3 1.4 1.4 1.4 1.4
[03]: 1.4 1.4 1.4 1.3 1.4 1.4 1.4 1.4
[04]: 1.3 1.3 1.3 1.3 1.3 1.3 1.3 1.3
[05]: 1.3 1.4 1.3 1.3 1.3 1.4 1.4 1.3
[06]: 1.3 1.4 1.4 1.3 1.3 1.3 1.3 1.3
[07]: 1.3 1.3 1.3 1.3 1.3 1.4 1.3 1.3
min_delta: 1422824
using cache_decay nsec: 1422824 (1 msec)

I ran again for L3, but could not vmalloc the whole amount (cache is 36MB). I
tried 19200KB
and got:

migration cost matrix (cache_size: 19660800, cpu: 1656 MHz):
[00] [01] [02] [03] [04] [05] [06] [07]
[00]: 16.9 16.8 16.0 16.0 16.7 16.9 16.7 16.9
[01]: 16.0 17.1 16.0 16.0 16.8 16.9 16.7 16.9
[02]: 17.0 17.1 17.0 16.0 16.7 16.9 16.7 16.9
[03]: 17.0 17.1 16.0 16.0 16.7 16.9 16.8 16.9
[04]: 16.0 16.0 16.0 16.9 17.2 17.1 17.2 17.2
[05]: 16.0 16.0 16.0 16.9 17.2 17.2 17.2 17.2
[06]: 16.0 16.0 16.0 16.9 17.2 17.1 17.2 17.2
[07]: 16.0 16.0 16.0 16.9 17.2 17.1 17.2 17.1
min_delta: 17492688
using cache_decay nsec: 17492688 (16 msec)

First, I am going to assume this test is not designed to show effects of
shared cache. For power5, since cores on a same chip share L2 & L3, I would
conclude that cache_hot_time for level 0 (siblings in a core) and level 1
(cores in a chip) domains should probably be "0". For level 2 domains (all
chips in a system), I guess it needs to be somewhere above 16ms.

We had someone run that online transaction DB workload with 10ms
cache_hot_time on both level1 & 2 domains and performance regressed. If we
get a chance to run again, I will probably try level0: 0ms level1: 0ms
level2: 10-20ms.

-Andrew Theurer

2005-02-21 05:10:23

by Paul Jackson

[permalink] [raw]
Subject: Re: [patch] sched: auto-tuning task-migration

A long long time ago (Oct 2004) Ingo wrote:
> the following patch adds a new feature to the scheduler: during bootup
> it measures migration costs and sets up cache_hot value accordingly.

Ingo - what became of this patch?

I made a quick search for it in Linus's bk tree and Andrew's *-mm
patches, but didn't find it. Perhaps I didn't know what to look for.

The metric it exposes looks like something I might want to expose to
userland, so the performance guys can begin optimizing the placement of
tasks on CPUs, depending on whether they would benefit from, or be
harmed by, sharing cache. Would the two halves of a hyper threaded core
show up as particularly close on this metric? I presume so.

It seems to me to be a good complement to the current cpu-memory
distance we have now in node_distance() exposing the ACPI 2.0 SLIT
table distances.

I view the two key numbers here as (1) how fast can a cpu get stuff out
of a memory node (an amalgam of bandwidth and latency), and (2) how much
cache and buses and such do two cpus share (which can be good or bad,
depending on whether the two tasks on those two cpus share much of their
cache footprint).

The SLIT table provides (1) just fine. Your patch seems to compute a
sensible estimate of (2).

I had one worry - was there a potential O(N**2) cost in computing this
at boottime, where N is the number of nodes? Us SGI folks are usually
amongst the first to notice such details, when they blow up on us ;).

I never actually saw the original patch -- perhaps if I had, some of
my questions above would have obvious answers.

Thanks. (and thanks for cpus_allowed -- I've practically made a
profession out of building stuff on top of that one ... ;).

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401