LinuxLists.cc - [patch] CFS scheduler, -v12

2007-05-13 15:40:14

[permalink] [raw]

Subject: [patch] CFS scheduler, -v12

i'm pleased to announce release -v12 of the CFS scheduler patchset.

The CFS patch against v2.6.22-rc1, v2.6.21.1 or v2.6.20.10 can be
downloaded from the usual place:

http://people.redhat.com/mingo/cfs-scheduler/

-v12 fixes the '3D bug' that caused trivial latencies in 3D games: it
turns out that the problem was not resulting out of any core quality of
CFS, it was caused by 3D userspace growing dependent on the current
inefficiency of the vanilla scheduler's sys_sched_yield()
implementation, and CFS's "make yield work well" changes broke it.

Even a simple 3D app like glxgears does a sys_sched_yield() for every
frame it generates (!) on certain 3D cards, which in essence punishes
any scheduler that implements sys_sched_yield() in a sane manner. This
interaction of CFS's yield implementation with this user-space bug could
be the main reason why some testers reported SD to be handling 3D games
better than CFS. (SD uses a yield implementation similar to the vanilla
scheduler.)

So i've added a yield workaround to -v12, which makes it work similar to
how the vanilla scheduler and SD does it. (Xorg has been notified and
this bug should be fixed there too. This took some time to debug because
the 3D driver i'm using for testing does not use sys_sched_yield().) The
workaround is activated by default so -v12 should work 'out of the box'.

Mike Galbraith has fixed a bug related to nice levels - the fix should
make negative nice levels more potent again.

Changes since -v10:

- nice level calculation fixes (Mike Galbraith)

- load-balancing improvements (this should fix the SMP performance
problem reported by Michael Gerdau)

- remove the sched_sleep_history_max tunable.

- more debugging fields.

- various cleanups, fixlets and code reorganization

As usual, any sort of feedback, bugreport, fix and suggestion is more
than welcome,

Ingo

2007-05-16 02:04:23

by Peter Williams

[permalink] [raw]

Subject: Re: [patch] CFS scheduler, -v12

Ingo Molnar wrote:
> i'm pleased to announce release -v12 of the CFS scheduler patchset.
>
> The CFS patch against v2.6.22-rc1, v2.6.21.1 or v2.6.20.10 can be
> downloaded from the usual place:
>
> http://people.redhat.com/mingo/cfs-scheduler/
>
> -v12 fixes the '3D bug' that caused trivial latencies in 3D games: it
> turns out that the problem was not resulting out of any core quality of
> CFS, it was caused by 3D userspace growing dependent on the current
> inefficiency of the vanilla scheduler's sys_sched_yield()
> implementation, and CFS's "make yield work well" changes broke it.
>
> Even a simple 3D app like glxgears does a sys_sched_yield() for every
> frame it generates (!) on certain 3D cards, which in essence punishes
> any scheduler that implements sys_sched_yield() in a sane manner. This
> interaction of CFS's yield implementation with this user-space bug could
> be the main reason why some testers reported SD to be handling 3D games
> better than CFS. (SD uses a yield implementation similar to the vanilla
> scheduler.)
>
> So i've added a yield workaround to -v12, which makes it work similar to
> how the vanilla scheduler and SD does it. (Xorg has been notified and
> this bug should be fixed there too. This took some time to debug because
> the 3D driver i'm using for testing does not use sys_sched_yield().) The
> workaround is activated by default so -v12 should work 'out of the box'.
>
> Mike Galbraith has fixed a bug related to nice levels - the fix should
> make negative nice levels more potent again.
>
> Changes since -v10:
>
> - nice level calculation fixes (Mike Galbraith)
>
> - load-balancing improvements (this should fix the SMP performance
> problem reported by Michael Gerdau)
>
> - remove the sched_sleep_history_max tunable.
>
> - more debugging fields.
>
> - various cleanups, fixlets and code reorganization
>
> As usual, any sort of feedback, bugreport, fix and suggestion is more
> than welcome,

Load balancing appears to be badly broken in this version. When I
started 4 hard spinners on my 2 CPU machine one ended up on one CPU and
the other 3 on the other CPU and they stayed there.

Peter
--
Peter Williams [email protected]

"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce

2007-05-16 08:12:23

[permalink] [raw]

Subject: Re: [patch] CFS scheduler, -v12

* Peter Williams <[email protected]> wrote:

> >As usual, any sort of feedback, bugreport, fix and suggestion is more
> >than welcome,
>
> Load balancing appears to be badly broken in this version. When I
> started 4 hard spinners on my 2 CPU machine one ended up on one CPU
> and the other 3 on the other CPU and they stayed there.

hm, i cannot reproduce this on 4 different SMP boxen, trying various
combinations of SCHED_SMT/MC and other .config options that might make a
difference to balancing. Could you send me your .config?

Ingo

2007-05-16 23:42:37

by Peter Williams

[permalink] [raw]

Subject: Re: [patch] CFS scheduler, -v12

Ingo Molnar wrote:
> * Peter Williams <[email protected]> wrote:
>
>>> As usual, any sort of feedback, bugreport, fix and suggestion is more
>>> than welcome,
>> Load balancing appears to be badly broken in this version. When I
>> started 4 hard spinners on my 2 CPU machine one ended up on one CPU
>> and the other 3 on the other CPU and they stayed there.
>
> hm, i cannot reproduce this on 4 different SMP boxen, trying various
> combinations of SCHED_SMT/MC

You may need to try more than once. Testing load balancing can be a
pain as there's always a possibility you'll get a good result just by
chance. I.e. you need a bunch of good results to say it's OK but only
one bad result to say it's broken. This makes testing load balancing a
pain.

> and other .config options that might make a
> difference to balancing. Could you send me your .config?

Sent separately.

Peter
--
Peter Williams [email protected]

"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce

2007-05-17 23:45:23

by Peter Williams

[permalink] [raw]

Subject: Re: [patch] CFS scheduler, -v12

Ingo Molnar wrote:
> * Peter Williams <[email protected]> wrote:
>
>> Load balancing appears to be badly broken in this version. When I
>> started 4 hard spinners on my 2 CPU machine one ended up on one CPU
>> and the other 3 on the other CPU and they stayed there.
>
> could you try to debug this a bit more?

I've now done this test on a number of kernels: 2.6.21 and 2.6.22-rc1
with and without CFS; and the problem is always present. It's not
"nice" related as the all four tasks are run at nice == 0.

It's possible that this problem has been in the kernel for a while with
out being noticed as, even with totally random allocation of tasks to
CPUs without any (attempt to balance), there's a quite high probability
of the desirable 2/2 split occurring. So one needs to repeat the test
several times to have reasonable assurance that the problem is not
present. I.e. this has the characteristics of an intermittent bug with
all the debugging problems that introduces.

The probabilities for the 3 split possibilities for random allocation are:

2/2 (the desired outcome) is 3/8 likely,
1/3 is 4/8 likely, and
0/4 is 1/8 likely.

I'm pretty sure that this problem wasn't present when smpnice went into
the kernel which is the last time I did a lot of load balance testing.

Peter
--
Peter Williams [email protected]

"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce

2007-05-18 00:20:03

[permalink] [raw]

Subject: Re: [patch] CFS scheduler, -v12

On Sun, May 13, 2007 at 05:38:53PM +0200, Ingo Molnar wrote:
> Even a simple 3D app like glxgears does a sys_sched_yield() for every
> frame it generates (!) on certain 3D cards, which in essence punishes
> any scheduler that implements sys_sched_yield() in a sane manner. This
> interaction of CFS's yield implementation with this user-space bug could
> be the main reason why some testers reported SD to be handling 3D games
> better than CFS. (SD uses a yield implementation similar to the vanilla
> scheduler.)
>
> So i've added a yield workaround to -v12, which makes it work similar to
> how the vanilla scheduler and SD does it. (Xorg has been notified and
> this bug should be fixed there too. This took some time to debug because
> the 3D driver i'm using for testing does not use sys_sched_yield().) The
> workaround is activated by default so -v12 should work 'out of the box'.

This is an incorrect analysis. OpenGL has the ability to "yield" after
every frame specifically for SGI IRIX (React/Pro) frame scheduler (driven
by the system vertical retrace interrupt) so that it can free up CPU
resources for other tasks to run. The problem here is that the yield
behavior is treated generally instead of specifically to a particular
proportion scheduler policy.

The correct solution is for the app to use a directed yield and a policy
that can directly support it so that OpenGL can guaratee a frame rate
governed by CPU bandwidth allocated by the scheduler.

Will is working on such a mechanism now.

bill

2007-05-18 01:02:33

[permalink] [raw]

Subject: Re: [patch] CFS scheduler, -v12

On Thu, May 17, 2007 at 05:18:41PM -0700, Bill Huey wrote:
> On Sun, May 13, 2007 at 05:38:53PM +0200, Ingo Molnar wrote:
> > Even a simple 3D app like glxgears does a sys_sched_yield() for every
> > frame it generates (!) on certain 3D cards, which in essence punishes
> > any scheduler that implements sys_sched_yield() in a sane manner. This
> > interaction of CFS's yield implementation with this user-space bug could
> > be the main reason why some testers reported SD to be handling 3D games
> > better than CFS. (SD uses a yield implementation similar to the vanilla
> > scheduler.)
> >
> > So i've added a yield workaround to -v12, which makes it work similar to
> > how the vanilla scheduler and SD does it. (Xorg has been notified and
> > this bug should be fixed there too. This took some time to debug because
> > the 3D driver i'm using for testing does not use sys_sched_yield().) The
> > workaround is activated by default so -v12 should work 'out of the box'.
>
> This is an incorrect analysis. OpenGL has the ability to "yield" after
> every frame specifically for SGI IRIX (React/Pro) frame scheduler (driven
> by the system vertical retrace interrupt) so that it can free up CPU
> resources for other tasks to run. The problem here is that the yield
> behavior is treated generally instead of specifically to a particular
> proportion scheduler policy.
>
> The correct solution is for the app to use a directed yield and a policy
> that can directly support it so that OpenGL can guaratee a frame rate
> governed by CPU bandwidth allocated by the scheduler.
>
> Will is working on such a mechanism now.

Follow up:

http://techpubs.sgi.com/library/tpl/cgi-bin/getdoc.cgi/0650/bks/SGI_Developer/books/REACT_PG/sgi_html/ch04.html

bill

2007-05-18 04:13:00

by William Lee Irwin III

[permalink] [raw]

Subject: Re: [patch] CFS scheduler, -v12

On Sun, May 13, 2007 at 05:38:53PM +0200, Ingo Molnar wrote:
>> So i've added a yield workaround to -v12, which makes it work similar to
>> how the vanilla scheduler and SD does it. (Xorg has been notified and
>> this bug should be fixed there too. This took some time to debug because
>> the 3D driver i'm using for testing does not use sys_sched_yield().) The
>> workaround is activated by default so -v12 should work 'out of the box'.

On Thu, May 17, 2007 at 05:18:41PM -0700, Bill Huey wrote:
> This is an incorrect analysis. OpenGL has the ability to "yield" after
> every frame specifically for SGI IRIX (React/Pro) frame scheduler (driven
> by the system vertical retrace interrupt) so that it can free up CPU
> resources for other tasks to run. The problem here is that the yield
> behavior is treated generally instead of specifically to a particular
> proportion scheduler policy.
> The correct solution is for the app to use a directed yield and a policy
> that can directly support it so that OpenGL can guaratee a frame rate
> governed by CPU bandwidth allocated by the scheduler.
> Will is working on such a mechanism now.

What? AFAIK the CFS patches already implement directed yields.

-- wli

2007-05-18 07:33:38

[permalink] [raw]

Subject: Re: [patch] CFS scheduler, -v12

* Bill Huey <[email protected]> wrote:

> On Sun, May 13, 2007 at 05:38:53PM +0200, Ingo Molnar wrote:
> > Even a simple 3D app like glxgears does a sys_sched_yield() for
> > every frame it generates (!) on certain 3D cards, which in essence
> > punishes any scheduler that implements sys_sched_yield() in a sane
> > manner. This interaction of CFS's yield implementation with this
> > user-space bug could be the main reason why some testers reported SD
> > to be handling 3D games better than CFS. (SD uses a yield
> > implementation similar to the vanilla scheduler.)
> >
> > So i've added a yield workaround to -v12, which makes it work
> > similar to how the vanilla scheduler and SD does it. (Xorg has been
> > notified and this bug should be fixed there too. This took some time
> > to debug because the 3D driver i'm using for testing does not use
> > sys_sched_yield().) The workaround is activated by default so -v12
> > should work 'out of the box'.
>
> This is an incorrect analysis. [...]

i'm puzzled, incorrect in specifically what way?

> [...] OpenGL has the ability to "yield" after every frame specifically
> for SGI IRIX (React/Pro) frame scheduler (driven by the system
> vertical retrace interrupt) so that it can free up CPU resources for
> other tasks to run. [...]

what you say makes no sense to me. The majority of Linux 3D apps are
already driven by the vertical retrace interrupt and properly 'yield the
CPU' if they wish so, but this has nothing to do with sys_sched_yield().

> The correct solution is for the app to use a directed yield and a
> policy that can directly support it so that OpenGL can guaratee a
> frame rate governed by CPU bandwidth allocated by the scheduler.
>
> Will is working on such a mechanism now.

i'm even more puzzled. I've added sched_yield_to() to CFS -v6 and it's
been part of CFS since then. I'm curious, on what mechanism is Will
working and have any patches been sent to lkml for discussion?

Ingo

2007-05-18 13:12:17

by Peter Williams

[permalink] [raw]

Subject: Re: [patch] CFS scheduler, -v12

Ingo Molnar wrote:
> * Peter Williams <[email protected]> wrote:
>
>> I've now done this test on a number of kernels: 2.6.21 and 2.6.22-rc1
>> with and without CFS; and the problem is always present. It's not
>> "nice" related as the all four tasks are run at nice == 0.
>
> could you try -v13 and did this behavior get better in any way?

It's still there but I've got a theory about what the problems is that
is supported by some other tests I've done.

What I'd forgotten is that I had gkrellm running as well as top (to
observe which CPU tasks were on) at the same time as the spinners were
running. This meant that between them top, gkrellm and X were using
about 2% of the CPU -- not much but enough to make it possible that at
least one of them was running when the load balancer was trying to do
its thing.

This raises two possibilities: 1. the system looked balanced and 2. the
system didn't look balanced but one of top, gkrellm or X was moved
instead of one of the spinners.

If it's 1 then there's not much we can do about it except say that it
only happens in these strange circumstances. If it's 2 then we may have
to modify the way move_tasks() selects which tasks to move (if we think
that the circumstances warrant it -- I'm not sure that this is the case).

To examine these possibilities I tried two variations of the test.

a. run the spinners at nice == -10 instead of nice == 0. When I did
this the load balancing was perfect on 10 consecutive runs which
according to my calculations makes it 99.9999997 certain that this
didn't happen by chance. This supports theory 2 above.

b. run the tests without gkrellm running but use nice == 0 for the
spinners. When I did this the load balancing was mostly perfect but was
quite volatile (switching between a 2/2 and 1/3 allocation of spinners
to CPUs) but the %CPU allocation was quite good with the spinners all
getting approximately 49% of a CPU each. This also supports theory 2
above and gives weak support to theory 1 above.

This leaves the question of what to do about it. Given that most CPU
intensive tasks on a real system probably only run for a few tens of
milliseconds it probably won't matter much on a real system except that
a malicious user could exploit it to disrupt a system.

So my opinion is that we probably do need to do something about it but
that it's not urgent.

One thing that might work is to jitter the load balancing interval a
bit. The reason I say this is that one of the characteristics of top
and gkrellm is that they run at a more or less constant interval (and,
in this case, X would also be following this pattern as it's doing
screen updates for top and gkrellm) and this means that it's possible
for the load balancing interval to synchronize with their intervals
which in turn causes the observed problem. A jittered load balancing
interval should break the synchronization. This would certainly be
simpler than trying to change the move_task() logic for selecting which
tasks to move.

What do you think?
Peter
--
Peter Williams [email protected]

"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce

2007-05-18 13:27:00

by Peter Williams

[permalink] [raw]

Subject: Re: [patch] CFS scheduler, -v12

Peter Williams wrote:
> Ingo Molnar wrote:
>> * Peter Williams <[email protected]> wrote:
>>
>>> I've now done this test on a number of kernels: 2.6.21 and 2.6.22-rc1
>>> with and without CFS; and the problem is always present. It's not
>>> "nice" related as the all four tasks are run at nice == 0.
>>
>> could you try -v13 and did this behavior get better in any way?
>
> It's still there but I've got a theory about what the problems is that
> is supported by some other tests I've done.
>
> What I'd forgotten is that I had gkrellm running as well as top (to
> observe which CPU tasks were on) at the same time as the spinners were
> running. This meant that between them top, gkrellm and X were using
> about 2% of the CPU -- not much but enough to make it possible that at
> least one of them was running when the load balancer was trying to do
> its thing.
>
> This raises two possibilities: 1. the system looked balanced and 2. the
> system didn't look balanced but one of top, gkrellm or X was moved
> instead of one of the spinners.
>
> If it's 1 then there's not much we can do about it except say that it
> only happens in these strange circumstances. If it's 2 then we may have
> to modify the way move_tasks() selects which tasks to move (if we think
> that the circumstances warrant it -- I'm not sure that this is the case).
>
> To examine these possibilities I tried two variations of the test.
>
> a. run the spinners at nice == -10 instead of nice == 0. When I did
> this the load balancing was perfect on 10 consecutive runs which
> according to my calculations makes it 99.9999997 certain that this
> didn't happen by chance. This supports theory 2 above.
>
> b. run the tests without gkrellm running but use nice == 0 for the
> spinners. When I did this the load balancing was mostly perfect but was
> quite volatile (switching between a 2/2 and 1/3 allocation of spinners
> to CPUs) but the %CPU allocation was quite good with the spinners all
> getting approximately 49% of a CPU each. This also supports theory 2
> above and gives weak support to theory 1 above.
>
> This leaves the question of what to do about it. Given that most CPU
> intensive tasks on a real system probably only run for a few tens of
> milliseconds it probably won't matter much on a real system except that
> a malicious user could exploit it to disrupt a system.
>
> So my opinion is that we probably do need to do something about it but
> that it's not urgent.
>
> One thing that might work is to jitter the load balancing interval a
> bit. The reason I say this is that one of the characteristics of top
> and gkrellm is that they run at a more or less constant interval (and,
> in this case, X would also be following this pattern as it's doing
> screen updates for top and gkrellm) and this means that it's possible
> for the load balancing interval to synchronize with their intervals
> which in turn causes the observed problem. A jittered load balancing
> interval should break the synchronization. This would certainly be
> simpler than trying to change the move_task() logic for selecting which
> tasks to move.

I should have added that the reason I think this mooted synchronization
is the cause of the problem is that I can think of no other way that
tasks with such low activity (2% between the 3 of them) could cause the
imbalance of the spinner to CPU allocation to be so persistent.

>
> What do you think?

Peter
--
Peter Williams [email protected]

"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce

2007-05-19 13:28:06

by Dmitry Adamushko

[permalink] [raw]

Subject: Re: [patch] CFS scheduler, -v12

On 18/05/07, Peter Williams <[email protected]> wrote:
> [...]
> One thing that might work is to jitter the load balancing interval a
> bit. The reason I say this is that one of the characteristics of top
> and gkrellm is that they run at a more or less constant interval (and,
> in this case, X would also be following this pattern as it's doing
> screen updates for top and gkrellm) and this means that it's possible
> for the load balancing interval to synchronize with their intervals
> which in turn causes the observed problem. A jittered load balancing
> interval should break the synchronization. This would certainly be
> simpler than trying to change the move_task() logic for selecting which
> tasks to move.

Just an(quick) another idea. Say, the load balancer would consider not
only p->load_weight but also something like Tw(task) =
(time_spent_on_runqueue / total_task's_runtime) * some_scale_constant
as an additional "load" component (OTOH, when a task starts, it takes
some time for this parameter to become meaningful). I guess, it could
address the scenarios your have described (but maybe break some others
as well :) ...
Any hints on why it's stupid?

>
> Peter
> --
> Peter Williams [email protected]

--
Best regards,
Dmitry Adamushko

2007-05-20 01:41:58

by Peter Williams

[permalink] [raw]

Subject: Re: [patch] CFS scheduler, -v12

Dmitry Adamushko wrote:
> On 18/05/07, Peter Williams <[email protected]> wrote:
>> [...]
>> One thing that might work is to jitter the load balancing interval a
>> bit. The reason I say this is that one of the characteristics of top
>> and gkrellm is that they run at a more or less constant interval (and,
>> in this case, X would also be following this pattern as it's doing
>> screen updates for top and gkrellm) and this means that it's possible
>> for the load balancing interval to synchronize with their intervals
>> which in turn causes the observed problem. A jittered load balancing
>> interval should break the synchronization. This would certainly be
>> simpler than trying to change the move_task() logic for selecting which
>> tasks to move.
>
> Just an(quick) another idea. Say, the load balancer would consider not
> only p->load_weight but also something like Tw(task) =
> (time_spent_on_runqueue / total_task's_runtime) * some_scale_constant
> as an additional "load" component (OTOH, when a task starts, it takes
> some time for this parameter to become meaningful). I guess, it could
> address the scenarios your have described (but maybe break some others
> as well :) ...
> Any hints on why it's stupid?

Well that is the kind of thing I was hoping to avoid for the reasons of
complexity. I think that the actual implementation would be more
complex than it sounds and possibly require multiple runs down the list
of moveable tasks which would be bad for overhead.

Basically, I don't think that the problem is serious enough to warrant a
complex solution. But I may be wrong about how complex the
implementation would be.

Peter
--
Peter Williams [email protected]

"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce

2007-05-21 08:29:30

by William Lee Irwin III

[permalink] [raw]

Subject: Re: [patch] CFS scheduler, -v12

On Sat, May 19, 2007 at 03:27:54PM +0200, Dmitry Adamushko wrote:
> Just an(quick) another idea. Say, the load balancer would consider not
> only p->load_weight but also something like Tw(task) =
> (time_spent_on_runqueue / total_task's_runtime) * some_scale_constant
> as an additional "load" component (OTOH, when a task starts, it takes
> some time for this parameter to become meaningful). I guess, it could
> address the scenarios your have described (but maybe break some others
> as well :) ...
> Any hints on why it's stupid?

I guess I'll take time out from coding to chime in.

cfs should probably consider aggregate lag as opposed to aggregate
weighted load. Mainline's convergence to proper CPU bandwidth
distributions on SMP (e.g. N+1 tasks of equal nice on N cpus) is
incredibly slow and probably also fragile in the presence of arrivals
and departures partly because of this. Tong Li's DWRR repairs the
deficit in mainline by synchronizing epochs or otherwise bounding epoch
dispersion. This doesn't directly translate to cfs. In cfs cpu should
probably try to figure out if its aggregate lag (e.g. via minimax) is
above or below average, and push to or pull from the other half
accordingly.

-- wli

2007-05-21 08:57:31

[permalink] [raw]

Subject: Re: [patch] CFS scheduler, -v12

* William Lee Irwin III <[email protected]> wrote:

> cfs should probably consider aggregate lag as opposed to aggregate
> weighted load. Mainline's convergence to proper CPU bandwidth
> distributions on SMP (e.g. N+1 tasks of equal nice on N cpus) is
> incredibly slow and probably also fragile in the presence of arrivals
> and departures partly because of this. [...]

hm, have you actually tested CFS before coming to this conclusion?

CFS is fair even on SMP. Consider for example the worst-case
3-tasks-on-2-CPUs workload on a 2-CPU box:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
2658 mingo 20 0 1580 248 200 R 67 0.0 0:56.30 loop
2656 mingo 20 0 1580 252 200 R 66 0.0 0:55.55 loop
2657 mingo 20 0 1576 248 200 R 66 0.0 0:55.24 loop

66% of CPU time for each task. The 'TIME+' column shows a 2% spread
between the slowest and the fastest loop after just 1 minute of runtime
(and the spread gets narrower with time). Mainline does a 50% / 50% /
100% split:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
3121 mingo 25 0 1584 252 204 R 100 0.0 0:13.11 loop
3120 mingo 25 0 1584 256 204 R 50 0.0 0:06.68 loop
3119 mingo 25 0 1584 252 204 R 50 0.0 0:06.64 loop

and i fixed that in CFS.

or consider a sleepy workload like massive_intr, 3-tasks-on-2-CPUs:

europe:~> head -1 /proc/interrupts
CPU0 CPU1

europe:~> ./massive_intr 3 10
002623 00000722
002621 00000720
002622 00000721

Or a 5-tasks-on-2-CPS workload:

europe:~> ./massive_intr 5 50
002649 00002519
002653 00002492
002651 00002478
002652 00002510
002650 00002478

that's around 1% of spread.

load-balancing is a performance vs. fairness tradeoff so we wont be able
to make it precisely fair because that's hideously expensive on SMP
(barring someone showing a working patch of course) - but in CFS i got
quite close to having it very fair in practice.

> [...] Tong Li's DWRR repairs the deficit in mainline by synchronizing
> epochs or otherwise bounding epoch dispersion. This doesn't directly
> translate to cfs. In cfs cpu should probably try to figure out if its
> aggregate lag (e.g. via minimax) is above or below average, and push
> to or pull from the other half accordingly.

i'd first like to see a demonstration of a problem to solve, before
thinking about more complex solutions ;-)

Ingo

2007-05-21 12:08:20

by William Lee Irwin III

[permalink] [raw]

Subject: Re: [patch] CFS scheduler, -v12

* William Lee Irwin III <[email protected]> wrote:
>> cfs should probably consider aggregate lag as opposed to aggregate
>> weighted load. Mainline's convergence to proper CPU bandwidth
>> distributions on SMP (e.g. N+1 tasks of equal nice on N cpus) is
>> incredibly slow and probably also fragile in the presence of arrivals
>> and departures partly because of this. [...]

On Mon, May 21, 2007 at 10:57:03AM +0200, Ingo Molnar wrote:
> hm, have you actually tested CFS before coming to this conclusion?
> CFS is fair even on SMP. Consider for example the worst-case

No. It's mostly a response to Dmitry's suggestion. I've done all of
the benchmark/testcase-writing on mainline.

On Mon, May 21, 2007 at 10:57:03AM +0200, Ingo Molnar wrote:
> 3-tasks-on-2-CPUs workload on a 2-CPU box:
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 2658 mingo 20 0 1580 248 200 R 67 0.0 0:56.30 loop
> 2656 mingo 20 0 1580 252 200 R 66 0.0 0:55.55 loop
> 2657 mingo 20 0 1576 248 200 R 66 0.0 0:55.24 loop

This looks like you've repaired the slow convergence issue mainline
has by other means.

On Mon, May 21, 2007 at 10:57:03AM +0200, Ingo Molnar wrote:
> 66% of CPU time for each task. The 'TIME+' column shows a 2% spread
> between the slowest and the fastest loop after just 1 minute of runtime
> (and the spread gets narrower with time). Mainline does a 50% / 50% /
> 100% split:
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 3121 mingo 25 0 1584 252 204 R 100 0.0 0:13.11 loop
> 3120 mingo 25 0 1584 256 204 R 50 0.0 0:06.68 loop
> 3119 mingo 25 0 1584 252 204 R 50 0.0 0:06.64 loop
> and i fixed that in CFS.

I found that mainline actually converges to the evenly-split shares of
CPU bandwidth, albeit incredibly slowly. Something like an hour is needed.

On Mon, May 21, 2007 at 10:57:03AM +0200, Ingo Molnar wrote:
> or consider a sleepy workload like massive_intr, 3-tasks-on-2-CPUs:
> europe:~> head -1 /proc/interrupts
> CPU0 CPU1
> europe:~> ./massive_intr 3 10
> 002623 00000722
> 002621 00000720
> 002622 00000721
> Or a 5-tasks-on-2-CPS workload:
> europe:~> ./massive_intr 5 50
> 002649 00002519
> 002653 00002492
> 002651 00002478
> 002652 00002510
> 002650 00002478
> that's around 1% of spread.
> load-balancing is a performance vs. fairness tradeoff so we wont be able
> to make it precisely fair because that's hideously expensive on SMP
> (barring someone showing a working patch of course) - but in CFS i got
> quite close to having it very fair in practice.

This is close enough to Libenzi's load generator to mean this particular
issue is almost certainly fixed.

* William Lee Irwin III <[email protected]> wrote:
>> [...] Tong Li's DWRR repairs the deficit in mainline by synchronizing
>> epochs or otherwise bounding epoch dispersion. This doesn't directly
>> translate to cfs. In cfs cpu should probably try to figure out if its
>> aggregate lag (e.g. via minimax) is above or below average, and push
>> to or pull from the other half accordingly.

On Mon, May 21, 2007 at 10:57:03AM +0200, Ingo Molnar wrote:
> i'd first like to see a demonstration of a problem to solve, before
> thinking about more complex solutions ;-)

I have other, more difficult to pass testcases. I'm giving up on ipopt
for the quadratic program associated with the \ell^\infty norm and just
pushing out the least squares solution since LAPACK is standard enough
for most people to have or easily obtain.

A quick and dirty approximation is to run one task at each nice level in
a range of nice levels and see if the proportions of CPU bandwidth come
out the same on SMP as UP and how quickly they converge. The testcase
is more comprehensive than that, but it's easy enough of a check to see
if there are any issues in this area.

-- wli

2007-05-21 15:26:05

by Dmitry Adamushko

[permalink] [raw]

Subject: Re: [patch] CFS scheduler, -v12

On 18/05/07, Peter Williams <[email protected]> wrote:
[...]
> One thing that might work is to jitter the load balancing interval a
> bit. The reason I say this is that one of the characteristics of top
> and gkrellm is that they run at a more or less constant interval (and,
> in this case, X would also be following this pattern as it's doing
> screen updates for top and gkrellm) and this means that it's possible
> for the load balancing interval to synchronize with their intervals
> which in turn causes the observed problem.

Hum.. I guess, a 0/4 scenario wouldn't fit well in this explanation..
all 4 spinners "tend" to be on CPU0 (and as I understand each gets
~25% approx.?), so there must be plenty of moments for
*idle_balance()* to be called on CPU1 - as gkrellm, top and X consume
together just a few % of CPU. Hence, we should not be that dependent
on the load balancing interval here..

(unlikely consiparacy theory) - idle_balance() and load_balance() (the
later is dependent on the load balancing interval which can be in
sync. with top/gkerllm activities as you suggest) move always either
top or gkerllm between themselves.. esp. if X is reniced (so it gets
additional "weight") and happens to be active (on CPU1) when
load_balance() (kicked from scheduler_tick()) runs..

p.s. it's mainly theoretical specualtions.. I recently started looking
at the load-balancing code (unfortunatelly, don't have an SMP machine
which I can upgrade to the recent kernel) and so far for me it's
mainly about getting sure I see things sanely.

--
Best regards,
Dmitry Adamushko

2007-05-21 23:51:49

by Peter Williams

[permalink] [raw]

Subject: Re: [patch] CFS scheduler, -v12

Dmitry Adamushko wrote:
> On 18/05/07, Peter Williams <[email protected]> wrote:
> [...]
>> One thing that might work is to jitter the load balancing interval a
>> bit. The reason I say this is that one of the characteristics of top
>> and gkrellm is that they run at a more or less constant interval (and,
>> in this case, X would also be following this pattern as it's doing
>> screen updates for top and gkrellm) and this means that it's possible
>> for the load balancing interval to synchronize with their intervals
>> which in turn causes the observed problem.
>
> Hum.. I guess, a 0/4 scenario wouldn't fit well in this explanation..

No, and I haven't seen one.

> all 4 spinners "tend" to be on CPU0 (and as I understand each gets
> ~25% approx.?), so there must be plenty of moments for
> *idle_balance()* to be called on CPU1 - as gkrellm, top and X consume
> together just a few % of CPU. Hence, we should not be that dependent
> on the load balancing interval here..

The split that I see is 3/1 and neither CPU seems to be favoured with
respect to getting the majority. However, top, gkrellm and X seem to be
always on the CPU with the single spinner. The CPU% reported by top is
approx. 33%, 33%, 33% and 100% for the spinners.

If I renice the spinners to -10 (so that there load weights dominate the
run queue load calculations) the problem goes away and the spinner to
CPU allocation is 2/2 and top reports them all getting approx. 50% each.

It's also worth noting that I've had tests where the allocation started
out 2/2 and the system changed it to 3/1 where it stabilized. So it's
not just a case of bad luck with the initial CPU allocation when the
tasks start and the load balancing failing to fix it (which was one of
my earlier theories).

>
> (unlikely consiparacy theory)

It's not a conspiracy. It's just dumb luck. :-)

> - idle_balance() and load_balance() (the
> later is dependent on the load balancing interval which can be in
> sync. with top/gkerllm activities as you suggest) move always either
> top or gkerllm between themselves.. esp. if X is reniced (so it gets
> additional "weight") and happens to be active (on CPU1) when
> load_balance() (kicked from scheduler_tick()) runs..
>
> p.s. it's mainly theoretical specualtions.. I recently started looking
> at the load-balancing code (unfortunatelly, don't have an SMP machine
> which I can upgrade to the recent kernel) and so far for me it's
> mainly about getting sure I see things sanely.

I'm playing with some jitter experiments at the moment. The amount of
jitter needs to be small (a few tenths of a second) as the
synchronization (if it's happening) is happening at the seconds level as
the intervals for top and gkrellm will be in the 1 to 5 second range (I
guess -- I haven't checked) and the load balancing is every 60 seconds.

Peter
--
Peter Williams [email protected]

"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce

2007-05-22 04:47:59

by Peter Williams

[permalink] [raw]

Subject: Re: [patch] CFS scheduler, -v12

Peter Williams wrote:
> Dmitry Adamushko wrote:
>> On 18/05/07, Peter Williams <[email protected]> wrote:
>> [...]
>>> One thing that might work is to jitter the load balancing interval a
>>> bit. The reason I say this is that one of the characteristics of top
>>> and gkrellm is that they run at a more or less constant interval (and,
>>> in this case, X would also be following this pattern as it's doing
>>> screen updates for top and gkrellm) and this means that it's possible
>>> for the load balancing interval to synchronize with their intervals
>>> which in turn causes the observed problem.
>>
>> Hum.. I guess, a 0/4 scenario wouldn't fit well in this explanation..
>
> No, and I haven't seen one.
>
>> all 4 spinners "tend" to be on CPU0 (and as I understand each gets
>> ~25% approx.?), so there must be plenty of moments for
>> *idle_balance()* to be called on CPU1 - as gkrellm, top and X consume
>> together just a few % of CPU. Hence, we should not be that dependent
>> on the load balancing interval here..
>
> The split that I see is 3/1 and neither CPU seems to be favoured with
> respect to getting the majority. However, top, gkrellm and X seem to be
> always on the CPU with the single spinner. The CPU% reported by top is
> approx. 33%, 33%, 33% and 100% for the spinners.
>
> If I renice the spinners to -10 (so that there load weights dominate the
> run queue load calculations) the problem goes away and the spinner to
> CPU allocation is 2/2 and top reports them all getting approx. 50% each.

For no good reason other than curiosity, I tried a variation of this
experiment where I reniced the spinners to 10 instead of -10 and, to my
surprise, they were allocated 2/2 to the CPUs on average. I say on
average because the allocations were a little more volatile and
occasionally 0/4 splits would occur but these would last for less than
one top cycle before the 2/2 was re-established. The quickness of these
recoveries would indicate that it was most likely the idle balance
mechanism that restored the balance.

This may point the finger at the tick based load balance mechanism being
too conservative in when it decides whether tasks need to be moved. In
the case where the spinners are at nice == 0, the idle balance mechanism
never comes into play as the 0/4 split is never seen so only the tick
based mechanism is in force in this case and this is where the anomalies
are seen.

This tick rebalance mechanism only situation is also true for the nice
== -10 case but in this case the high load weights of the spinners
overcomes the tick based load balancing mechanism's conservatism e.g.
the difference in queue loads for a 1/3 split in this case is the
equivalent to the difference that would be generated by an imbalance of
about 18 nice == 0 spinners i.e. too big to be ignored.

The evidence seems to indicate that IF a rebalance operation gets
initiated then the right amount of load will get moved.

This new evidence weakens (but does not totally destroy) my
synchronization (a.k.a. conspiracy) theory.

Peter
PS As the total load weight for 4 nice == 10 tasks is only about 40% of
the load weight of a single nice == 0 task, the occasional 0/4 split in
the spinners at nice == 10 case is not unexpected as it would be the
desirable allocation if there were exactly one other running task at
nice == 0.
--
Peter Williams [email protected]

"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce

2007-05-22 11:53:06

by Dmitry Adamushko

[permalink] [raw]

Subject: Re: [patch] CFS scheduler, -v12

On 22/05/07, Peter Williams <[email protected]> wrote:
> > [...]
> > Hum.. I guess, a 0/4 scenario wouldn't fit well in this explanation..
>
> No, and I haven't seen one.

Well, I just took one of your calculated probabilities as something
you have really observed - (*) below.

"The probabilities for the 3 split possibilities for random allocation are:

2/2 (the desired outcome) is 3/8 likely,
1/3 is 4/8 likely, and
0/4 is 1/8 likely. <-------------------------- (*)
"

> The split that I see is 3/1 and neither CPU seems to be favoured with
> respect to getting the majority. However, top, gkrellm and X seem to be
> always on the CPU with the single spinner. The CPU% reported by top is
> approx. 33%, 33%, 33% and 100% for the spinners.

Yes. That said, idle_balance() is out of work in this case.

> If I renice the spinners to -10 (so that there load weights dominate the
> run queue load calculations) the problem goes away and the spinner to
> CPU allocation is 2/2 and top reports them all getting approx. 50% each.

I wonder what would happen if X gets reniced to -10 instead (and
spinners are at 0).. I guess, something I described in my previous
mail (and dubbed "unlikely cospiracy" :) could happen, i.e. 0/4 and
then idle_balance() comes into play..

ok, I see. You have probably achieved a similar effect with the
spinners being reniced to 10 (but here both "X" and "top" gain
additional "weight" wrt the load balancing).

> I'm playing with some jitter experiments at the moment. The amount of
> jitter needs to be small (a few tenths of a second) as the
> synchronization (if it's happening) is happening at the seconds level as
> the intervals for top and gkrellm will be in the 1 to 5 second range (I
> guess -- I haven't checked) and the load balancing is every 60 seconds.

Hum.. the "every 60 seconds" part puzzles me quite a bit. Looking at
the run_rebalance_domain(), I'd say that it's normally overwritten by
the following code

if (time_after(next_balance, sd->last_balance + interval))
next_balance = sd->last_balance + interval;

the "interval" seems to be *normally* shorter than "60*HZ" (according
to the default params in topology.h).. moreover, in case of the CFS

if (interval > HZ*NR_CPUS/10)
interval = HZ*NR_CPUS/10;

so it can't be > 0.2 HZ in your case (== once in 200 ms at max with
HZ=1000).. am I missing something? TIA

>
> Peter

--
Best regards,
Dmitry Adamushko

2007-05-22 12:03:31

by Peter Williams

[permalink] [raw]

Subject: Re: [patch] CFS scheduler, -v12

Peter Williams wrote:
> Peter Williams wrote:
>> Dmitry Adamushko wrote:
>>> On 18/05/07, Peter Williams <[email protected]> wrote:
>>> [...]
>>>> One thing that might work is to jitter the load balancing interval a
>>>> bit. The reason I say this is that one of the characteristics of top
>>>> and gkrellm is that they run at a more or less constant interval (and,
>>>> in this case, X would also be following this pattern as it's doing
>>>> screen updates for top and gkrellm) and this means that it's possible
>>>> for the load balancing interval to synchronize with their intervals
>>>> which in turn causes the observed problem.
>>>
>>> Hum.. I guess, a 0/4 scenario wouldn't fit well in this explanation..
>>
>> No, and I haven't seen one.
>>
>>> all 4 spinners "tend" to be on CPU0 (and as I understand each gets
>>> ~25% approx.?), so there must be plenty of moments for
>>> *idle_balance()* to be called on CPU1 - as gkrellm, top and X consume
>>> together just a few % of CPU. Hence, we should not be that dependent
>>> on the load balancing interval here..
>>
>> The split that I see is 3/1 and neither CPU seems to be favoured with
>> respect to getting the majority. However, top, gkrellm and X seem to
>> be always on the CPU with the single spinner. The CPU% reported by
>> top is approx. 33%, 33%, 33% and 100% for the spinners.
>>
>> If I renice the spinners to -10 (so that there load weights dominate
>> the run queue load calculations) the problem goes away and the spinner
>> to CPU allocation is 2/2 and top reports them all getting approx. 50%
>> each.
>
> For no good reason other than curiosity, I tried a variation of this
> experiment where I reniced the spinners to 10 instead of -10 and, to my
> surprise, they were allocated 2/2 to the CPUs on average. I say on
> average because the allocations were a little more volatile and
> occasionally 0/4 splits would occur but these would last for less than
> one top cycle before the 2/2 was re-established. The quickness of these
> recoveries would indicate that it was most likely the idle balance
> mechanism that restored the balance.
>
> This may point the finger at the tick based load balance mechanism being
> too conservative

The relevant code, find_busiest_group() and find_busiest_queue(), has a
lot of code that is ifdefed by CONFIG_SCHED_MC and CONFIG_SCHED_SMT and,
as these macros were defined in the kernels I was testing with, I built
a kernel with these macros undefined and reran my tests. The
problems/anomalies were not present in 10 consecutive tests on this new
kernel. Even better on the few occasions that a 3/1 split did occur it
was quickly corrected to 2/2 and top was reporting approx 49% of CPU for
all spinners throughout each of the ten tests.

So all that is required now is an analysis of the code inside the ifdefs
to see why it is causing a problem.

> in when it decides whether tasks need to be moved. In
> the case where the spinners are at nice == 0, the idle balance mechanism
> never comes into play as the 0/4 split is never seen so only the tick
> based mechanism is in force in this case and this is where the anomalies
> are seen.
>
> This tick rebalance mechanism only situation is also true for the nice
> == -10 case but in this case the high load weights of the spinners
> overcomes the tick based load balancing mechanism's conservatism e.g.
> the difference in queue loads for a 1/3 split in this case is the
> equivalent to the difference that would be generated by an imbalance of
> about 18 nice == 0 spinners i.e. too big to be ignored.
>
> The evidence seems to indicate that IF a rebalance operation gets
> initiated then the right amount of load will get moved.
>
> This new evidence weakens (but does not totally destroy) my
> synchronization (a.k.a. conspiracy) theory.

My synchronization theory is now dead.

Peter
--
Peter Williams [email protected]

"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce

2007-05-22 16:49:17

by Chris Friesen

[permalink] [raw]

Subject: Re: [patch] CFS scheduler, -v12

Ingo Molnar wrote:

> CFS is fair even on SMP. Consider for example the worst-case
> 3-tasks-on-2-CPUs workload on a 2-CPU box:
>
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 2658 mingo 20 0 1580 248 200 R 67 0.0 0:56.30 loop
> 2656 mingo 20 0 1580 252 200 R 66 0.0 0:55.55 loop
> 2657 mingo 20 0 1576 248 200 R 66 0.0 0:55.24 loop
>
> 66% of CPU time for each task. The 'TIME+' column shows a 2% spread
> between the slowest and the fastest loop after just 1 minute of runtime
> (and the spread gets narrower with time).

Is there a way in CFS to tune the amount of time over which the load
balancer is fair? (Of course there would be some overhead involved.)

Chris

2007-05-22 20:15:36

[permalink] [raw]

Subject: Re: [patch] CFS scheduler, -v12

* Chris Friesen <[email protected]> wrote:

> Ingo Molnar wrote:
>
> >CFS is fair even on SMP. Consider for example the worst-case
> >3-tasks-on-2-CPUs workload on a 2-CPU box:
> >
> > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> > 2658 mingo 20 0 1580 248 200 R 67 0.0 0:56.30 loop
> > 2656 mingo 20 0 1580 252 200 R 66 0.0 0:55.55 loop
> > 2657 mingo 20 0 1576 248 200 R 66 0.0 0:55.24 loop
> >
> >66% of CPU time for each task. The 'TIME+' column shows a 2% spread
> >between the slowest and the fastest loop after just 1 minute of runtime
> >(and the spread gets narrower with time).
>
> Is there a way in CFS to tune the amount of time over which the load
> balancer is fair? (Of course there would be some overhead involved.)

it should be fair pretty fast (see the 10 seconds run of massive_intr) -
so it's not 1 minute (if you were worried about that).

Ingo

2007-05-22 20:49:32

by Chris Friesen

[permalink] [raw]

Subject: Re: [patch] CFS scheduler, -v12

Ingo Molnar wrote:
> * Chris Friesen <[email protected]> wrote:

>>Is there a way in CFS to tune the amount of time over which the load
>>balancer is fair? (Of course there would be some overhead involved.)

> it should be fair pretty fast (see the 10 seconds run of massive_intr) -
> so it's not 1 minute (if you were worried about that).

Good to know..that's exactly what I was worried about. I work with guys
that really want predictability above all else, then fairness, and only
then performance--if we can't guarantee a given level of performance for
5-9s then its useless to us.

Chris

2007-05-23 00:10:35

by Peter Williams

[permalink] [raw]

Subject: Re: [patch] CFS scheduler, -v12

Dmitry Adamushko wrote:
> On 22/05/07, Peter Williams <[email protected]> wrote:
>> > [...]
>> > Hum.. I guess, a 0/4 scenario wouldn't fit well in this explanation..
>>
>> No, and I haven't seen one.
>
> Well, I just took one of your calculated probabilities as something
> you have really observed - (*) below.
>
> "The probabilities for the 3 split possibilities for random allocation are:
>
> 2/2 (the desired outcome) is 3/8 likely,
> 1/3 is 4/8 likely, and
> 0/4 is 1/8 likely. <-------------------------- (*)
> "

These are the theoretical probabilities for the outcomes based on the
random allocation of 4 tasks to 2 CPUs. There are, in fact, 16
different ways that 4 tasks can be assigned to 2 CPUs. 6 of these
result in a 2/2 split, 8 in a 1/3 split and 2 in a 0/4 split.

>
>> The split that I see is 3/1 and neither CPU seems to be favoured with
>> respect to getting the majority. However, top, gkrellm and X seem to be
>> always on the CPU with the single spinner. The CPU% reported by top is
>> approx. 33%, 33%, 33% and 100% for the spinners.
>
> Yes. That said, idle_balance() is out of work in this case.

Which is why I reported the problem.

>
>> If I renice the spinners to -10 (so that there load weights dominate the
>> run queue load calculations) the problem goes away and the spinner to
>> CPU allocation is 2/2 and top reports them all getting approx. 50% each.
>
> I wonder what would happen if X gets reniced to -10 instead (and
> spinners are at 0).. I guess, something I described in my previous
> mail (and dubbed "unlikely cospiracy" :) could happen, i.e. 0/4 and
> then idle_balance() comes into play..

Probably the same as I observed but it's easier to renice the spinners.

I see the 0/4 split for brief moments if I renice the spinners to 10
instead of -10 but the idle balancer quickly restores it to 2/2.

>
> ok, I see. You have probably achieved a similar effect with the
> spinners being reniced to 10 (but here both "X" and "top" gain
> additional "weight" wrt the load balancing).
>
>> I'm playing with some jitter experiments at the moment. The amount of
>> jitter needs to be small (a few tenths of a second) as the
>> synchronization (if it's happening) is happening at the seconds level as
>> the intervals for top and gkrellm will be in the 1 to 5 second range (I
>> guess -- I haven't checked) and the load balancing is every 60 seconds.
>
> Hum.. the "every 60 seconds" part puzzles me quite a bit. Looking at
> the run_rebalance_domain(), I'd say that it's normally overwritten by
> the following code
>
> if (time_after(next_balance, sd->last_balance + interval))
> next_balance = sd->last_balance + interval;
>
> the "interval" seems to be *normally* shorter than "60*HZ" (according
> to the default params in topology.h).. moreover, in case of the CFS
>
> if (interval > HZ*NR_CPUS/10)
> interval = HZ*NR_CPUS/10;
>
> so it can't be > 0.2 HZ in your case (== once in 200 ms at max with
> HZ=1000).. am I missing something? TIA

No, I did.

But it's all academic as my synchronization theory is now dead -- see
separate e-mail.

Peter
--
Peter Williams [email protected]

"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce

2007-05-24 07:44:21

by Peter Williams

[permalink] [raw]

Subject: Re: [patch] CFS scheduler, -v12

Peter Williams wrote:
> Peter Williams wrote:
>> Peter Williams wrote:
>>> Dmitry Adamushko wrote:
>>>> On 18/05/07, Peter Williams <[email protected]> wrote:
>>>> [...]
>>>>> One thing that might work is to jitter the load balancing interval a
>>>>> bit. The reason I say this is that one of the characteristics of top
>>>>> and gkrellm is that they run at a more or less constant interval (and,
>>>>> in this case, X would also be following this pattern as it's doing
>>>>> screen updates for top and gkrellm) and this means that it's possible
>>>>> for the load balancing interval to synchronize with their intervals
>>>>> which in turn causes the observed problem.
>>>>
>>>> Hum.. I guess, a 0/4 scenario wouldn't fit well in this explanation..
>>>
>>> No, and I haven't seen one.
>>>
>>>> all 4 spinners "tend" to be on CPU0 (and as I understand each gets
>>>> ~25% approx.?), so there must be plenty of moments for
>>>> *idle_balance()* to be called on CPU1 - as gkrellm, top and X consume
>>>> together just a few % of CPU. Hence, we should not be that dependent
>>>> on the load balancing interval here..
>>>
>>> The split that I see is 3/1 and neither CPU seems to be favoured with
>>> respect to getting the majority. However, top, gkrellm and X seem to
>>> be always on the CPU with the single spinner. The CPU% reported by
>>> top is approx. 33%, 33%, 33% and 100% for the spinners.
>>>
>>> If I renice the spinners to -10 (so that there load weights dominate
>>> the run queue load calculations) the problem goes away and the
>>> spinner to CPU allocation is 2/2 and top reports them all getting
>>> approx. 50% each.
>>
>> For no good reason other than curiosity, I tried a variation of this
>> experiment where I reniced the spinners to 10 instead of -10 and, to
>> my surprise, they were allocated 2/2 to the CPUs on average. I say on
>> average because the allocations were a little more volatile and
>> occasionally 0/4 splits would occur but these would last for less than
>> one top cycle before the 2/2 was re-established. The quickness of
>> these recoveries would indicate that it was most likely the idle
>> balance mechanism that restored the balance.
>>
>> This may point the finger at the tick based load balance mechanism
>> being too conservative
>
> The relevant code, find_busiest_group() and find_busiest_queue(), has a
> lot of code that is ifdefed by CONFIG_SCHED_MC and CONFIG_SCHED_SMT and,
> as these macros were defined in the kernels I was testing with, I built
> a kernel with these macros undefined and reran my tests. The
> problems/anomalies were not present in 10 consecutive tests on this new
> kernel. Even better on the few occasions that a 3/1 split did occur it
> was quickly corrected to 2/2 and top was reporting approx 49% of CPU for
> all spinners throughout each of the ten tests.
>
> So all that is required now is an analysis of the code inside the ifdefs
> to see why it is causing a problem.

Further testing indicates that CONFIG_SCHED_MC is not implicated and
it's CONFIG_SCHED_SMT that's causing the problem. This rules out the
code in find_busiest_group() as it is common to both macros.

I think this makes the scheduling domain parameter values the most
likely cause of the problem. I'm not very familiar with this code so
I've added those who've modified this code in the last year or so to the
address of this e-mail.

Peter
--
Peter Williams [email protected]

"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce

2007-05-24 16:49:16

by Suresh Siddha

[permalink] [raw]

Subject: Re: [patch] CFS scheduler, -v12

On Thu, May 24, 2007 at 12:43:58AM -0700, Peter Williams wrote:
>Peter Williams wrote:
>> The relevant code, find_busiest_group() and find_busiest_queue(), has a
>> lot of code that is ifdefed by CONFIG_SCHED_MC and CONFIG_SCHED_SMT and,
>> as these macros were defined in the kernels I was testing with, I built
>> a kernel with these macros undefined and reran my tests. The
>> problems/anomalies were not present in 10 consecutive tests on this new
>> kernel. Even better on the few occasions that a 3/1 split did occur it
>> was quickly corrected to 2/2 and top was reporting approx 49% of CPU for
>> all spinners throughout each of the ten tests.
>>
>> So all that is required now is an analysis of the code inside the ifdefs
>> to see why it is causing a problem.
>
>
>Further testing indicates that CONFIG_SCHED_MC is not implicated and
>it's CONFIG_SCHED_SMT that's causing the problem. This rules out the
>code in find_busiest_group() as it is common to both macros.
>
>I think this makes the scheduling domain parameter values the most
>likely cause of the problem. I'm not very familiar with this code so
>I've added those who've modified this code in the last year or
>so to the
>address of this e-mail.

What platform is this? I remember you mentioned its a 2 cpu box. Is it
dual core or dual package or one with HT?

thanks,
suresh

2007-05-24 23:23:37

by Peter Williams

[permalink] [raw]

Subject: Re: [patch] CFS scheduler, -v12

Siddha, Suresh B wrote:
> On Thu, May 24, 2007 at 12:43:58AM -0700, Peter Williams wrote:
>> Peter Williams wrote:
>>> The relevant code, find_busiest_group() and find_busiest_queue(), has a
>>> lot of code that is ifdefed by CONFIG_SCHED_MC and CONFIG_SCHED_SMT and,
>>> as these macros were defined in the kernels I was testing with, I built
>>> a kernel with these macros undefined and reran my tests. The
>>> problems/anomalies were not present in 10 consecutive tests on this new
>>> kernel. Even better on the few occasions that a 3/1 split did occur it
>>> was quickly corrected to 2/2 and top was reporting approx 49% of CPU for
>>> all spinners throughout each of the ten tests.
>>>
>>> So all that is required now is an analysis of the code inside the ifdefs
>>> to see why it is causing a problem.
>>
>> Further testing indicates that CONFIG_SCHED_MC is not implicated and
>> it's CONFIG_SCHED_SMT that's causing the problem. This rules out the
>> code in find_busiest_group() as it is common to both macros.
>>
>> I think this makes the scheduling domain parameter values the most
>> likely cause of the problem. I'm not very familiar with this code so
>> I've added those who've modified this code in the last year or
>> so to the
>> address of this e-mail.
>
> What platform is this? I remember you mentioned its a 2 cpu box. Is it
> dual core or dual package or one with HT?

It's a single CPU HT box i.e. 2 virtual CPUs. "cat /proc/cpuinfo" produces:

processor : 0
vendor_id : GenuineIntel
cpu family : 15
model : 3
model name : Intel(R) Pentium(R) 4 CPU 3.20GHz
stepping : 4
cpu MHz : 3201.145
cache size : 1024 KB
physical id : 0
siblings : 2
core id : 0
cpu cores : 1
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 5
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe
constant_tsc pni monitor ds_cpl cid xtpr
bogomips : 6403.97
clflush size : 64

processor : 1
vendor_id : GenuineIntel
cpu family : 15
model : 3
model name : Intel(R) Pentium(R) 4 CPU 3.20GHz
stepping : 4
cpu MHz : 3201.145
cache size : 1024 KB
physical id : 0
siblings : 2
core id : 0
cpu cores : 1
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 5
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe
constant_tsc pni monitor ds_cpl cid xtpr
bogomips : 6400.92
clflush size : 64

Peter
--
Peter Williams [email protected]

"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce

2007-05-29 20:49:26

by Suresh Siddha

[permalink] [raw]

Subject: Re: [patch] CFS scheduler, -v12

On Thu, May 24, 2007 at 04:23:19PM -0700, Peter Williams wrote:
> Siddha, Suresh B wrote:
> > On Thu, May 24, 2007 at 12:43:58AM -0700, Peter Williams wrote:
> > >
> > > Further testing indicates that CONFIG_SCHED_MC is not implicated and
> > > it's CONFIG_SCHED_SMT that's causing the problem. This rules out the
> > > code in find_busiest_group() as it is common to both macros.
> > >
> > > I think this makes the scheduling domain parameter values the most
> > > likely cause of the problem. I'm not very familiar with this code so
> > > I've added those who've modified this code in the last year or
> > > so to the
> > > address of this e-mail.
> >
> > What platform is this? I remember you mentioned its a 2 cpu box. Is it
> > dual core or dual package or one with HT?
>
> It's a single CPU HT box i.e. 2 virtual CPUs. "cat /proc/cpuinfo"
> produces:

Peter, I tried on a similar box and couldn't reproduce this problem
with x86_64 2.6.22-rc3 kernel and using defconfig(has SCHED_SMT turned on).
I am using top and just the spinners. I don't have gkrellm running, is that
required to reproduce the issue?

I tried number of times and also in runlevels 3,5(with top running
in a xterm incase of runlevel 5).

In runlevel 5, occasionally for one refresh screen of top, I see three
spinners on one cpu and one spinner on other(with X or someother app
also on the cpu with one spinner). But it balances nicely for the
immd next refresh of the top screen.

I tried with various refresh rates of top too.. Do you see the issue
at runlevel 3 too?

thanks,
suresh

2007-05-29 23:54:45

by Peter Williams

[permalink] [raw]

Subject: Re: [patch] CFS scheduler, -v12

Siddha, Suresh B wrote:
> On Thu, May 24, 2007 at 04:23:19PM -0700, Peter Williams wrote:
>> Siddha, Suresh B wrote:
>>> On Thu, May 24, 2007 at 12:43:58AM -0700, Peter Williams wrote:
>>>> Further testing indicates that CONFIG_SCHED_MC is not implicated and
>>>> it's CONFIG_SCHED_SMT that's causing the problem. This rules out the
>>>> code in find_busiest_group() as it is common to both macros.
>>>>
>>>> I think this makes the scheduling domain parameter values the most
>>>> likely cause of the problem. I'm not very familiar with this code so
>>>> I've added those who've modified this code in the last year or
>>>> so to the
>>>> address of this e-mail.
>>> What platform is this? I remember you mentioned its a 2 cpu box. Is it
>>> dual core or dual package or one with HT?
>> It's a single CPU HT box i.e. 2 virtual CPUs. "cat /proc/cpuinfo"
>> produces:
>
> Peter, I tried on a similar box and couldn't reproduce this problem
> with x86_64

Mine's a 32 bit machine.

> 2.6.22-rc3 kernel

I haven't tried rc3 yet.

> and using defconfig(has SCHED_SMT turned on).
> I am using top and just the spinners. I don't have gkrellm running, is that
> required to reproduce the issue?

Not necessarily. But you may need to do a number of trials as sheer
chance plays a part.

>
> I tried number of times and also in runlevels 3,5(with top running
> in a xterm incase of runlevel 5).

I've always done it in run level 5 using gnome-terminal. I use 10
consecutive trials without seeing the problem as an indication of its
absence but will cut that short if I see a 3/1 which quickly recovers
(see below).

>
> In runlevel 5, occasionally for one refresh screen of top, I see three
> spinners on one cpu and one spinner on other(with X or someother app
> also on the cpu with one spinner). But it balances nicely for the
> immd next refresh of the top screen.

Yes, that (the fact that it recovers quickly) confirms that the problem
isn't present for your system. If load balancing occurs when other
tasks than the spinners are actually running a 1/3 split for the
spinners is a reasonable outcome so seeing the occasional 1/3 split is
OK but it should return to 2/2 as soon as the other tasks sleep.

When I'm doing my tests (for the various combinations of macros) I
always count a case where I see a 3/1 split that quickly recovers as
proof that this problem isn't present for that case and cease testing.

>
> I tried with various refresh rates of top too.. Do you see the issue
> at runlevel 3 too?

I haven't tried that.

Do your spinners ever relinquish the CPU voluntarily?

Peter
--
Peter Williams [email protected]

"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce

2007-05-30 00:54:21

by Suresh Siddha

[permalink] [raw]

Subject: Re: [patch] CFS scheduler, -v12

On Tue, May 29, 2007 at 04:54:29PM -0700, Peter Williams wrote:
> > I tried with various refresh rates of top too.. Do you see the issue
> > at runlevel 3 too?
>
> I haven't tried that.
>
> Do your spinners ever relinquish the CPU voluntarily?

Nope. Simple and plain while(1); 's

I can try 32-bit kernel to check.

thanks,
suresh

2007-05-30 02:18:36

by Peter Williams

[permalink] [raw]

Subject: Re: [patch] CFS scheduler, -v12

Siddha, Suresh B wrote:
> On Tue, May 29, 2007 at 04:54:29PM -0700, Peter Williams wrote:
>>> I tried with various refresh rates of top too.. Do you see the issue
>>> at runlevel 3 too?
>> I haven't tried that.
>>
>> Do your spinners ever relinquish the CPU voluntarily?
>
> Nope. Simple and plain while(1); 's
>
> I can try 32-bit kernel to check.

Don't bother. I just checked 2.6.22-rc3 and the problem is not present
which means something between rc2 and rc3 has fixed the problem. I hate
it when problems (appear to) fix themselves as it usually means they're
just hiding.

I didn't see any patches between rc2 and rc3 that were likely to have
fixed this (but doesn't mean there wasn't one). I'm wondering whether I
should do a git bisect to see if I can find where it got fixed?

Could you see if you can reproduce it on 2.6.22-rc2?

Thanks
Peter
--
Peter Williams [email protected]

"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce

2007-05-30 04:45:51

by Suresh Siddha

[permalink] [raw]

Subject: Re: [patch] CFS scheduler, -v12

On Tue, May 29, 2007 at 07:18:18PM -0700, Peter Williams wrote:
> Siddha, Suresh B wrote:
> > I can try 32-bit kernel to check.
>
> Don't bother. I just checked 2.6.22-rc3 and the problem is not present
> which means something between rc2 and rc3 has fixed the problem. I hate
> it when problems (appear to) fix themselves as it usually means they're
> just hiding.
>
> I didn't see any patches between rc2 and rc3 that were likely to have
> fixed this (but doesn't mean there wasn't one). I'm wondering whether I
> should do a git bisect to see if I can find where it got fixed?
>
> Could you see if you can reproduce it on 2.6.22-rc2?

No. Just tried 2.6.22-rc2 64-bit version at runlevel 3 on my remote
system at office. 15 attempts didn't show the issue.

Sure that nothing changed in your test setup?

More experiments tomorrow morning..

thanks,
suresh

2007-05-30 06:28:38

by Peter Williams

[permalink] [raw]

Subject: Re: [patch] CFS scheduler, -v12

Siddha, Suresh B wrote:
> On Tue, May 29, 2007 at 07:18:18PM -0700, Peter Williams wrote:
>> Siddha, Suresh B wrote:
>>> I can try 32-bit kernel to check.
>> Don't bother. I just checked 2.6.22-rc3 and the problem is not present
>> which means something between rc2 and rc3 has fixed the problem. I hate
>> it when problems (appear to) fix themselves as it usually means they're
>> just hiding.
>>
>> I didn't see any patches between rc2 and rc3 that were likely to have
>> fixed this (but doesn't mean there wasn't one). I'm wondering whether I
>> should do a git bisect to see if I can find where it got fixed?
>>
>> Could you see if you can reproduce it on 2.6.22-rc2?
>
> No. Just tried 2.6.22-rc2 64-bit version at runlevel 3 on my remote
> system at office. 15 attempts didn't show the issue.
>
> Sure that nothing changed in your test setup?
>

I just rechecked with an old kernel and the problem was still there.

Peter
--
Peter Williams [email protected]

"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce

2007-05-31 01:50:24

by Peter Williams

[permalink] [raw]

Subject: Re: [patch] CFS scheduler, -v12

Siddha, Suresh B wrote:
> On Tue, May 29, 2007 at 07:18:18PM -0700, Peter Williams wrote:
>> Siddha, Suresh B wrote:
>>> I can try 32-bit kernel to check.
>> Don't bother. I just checked 2.6.22-rc3 and the problem is not present
>> which means something between rc2 and rc3 has fixed the problem. I hate
>> it when problems (appear to) fix themselves as it usually means they're
>> just hiding.
>>
>> I didn't see any patches between rc2 and rc3 that were likely to have
>> fixed this (but doesn't mean there wasn't one). I'm wondering whether I
>> should do a git bisect to see if I can find where it got fixed?
>>
>> Could you see if you can reproduce it on 2.6.22-rc2?
>
> No. Just tried 2.6.22-rc2 64-bit version at runlevel 3 on my remote
> system at office. 15 attempts didn't show the issue.
>
> Sure that nothing changed in your test setup?
>
> More experiments tomorrow morning..

I've finished bisecting and the patch at which things appear to improve
is cd5477911fc9f5cc64678e2b95cdd606c59a11b5 which is in the middle of a
bunch of patches reorganizing the link phase of the build. Patch
description is:

kbuild: add "Section mismatch" warning whitelist for powerpc
author Li Yang <[email protected]>
Mon, 14 May 2007 10:04:28 +0000 (18:04 +0800)
committer Sam Ravnborg <[email protected]>
Sat, 19 May 2007 07:11:57 +0000 (09:11 +0200)
commit cd5477911fc9f5cc64678e2b95cdd606c59a11b5
tree d893f07b0040d36dfc60040dc695384e9afcf103 tree | snapshot
parent f892b7d480eec809a5dfbd6e65742b3f3155e50e commit | diff
kbuild: add "Section mismatch" warning whitelist for powerpc

This patch fixes the following class of "Section mismatch" warnings when
building powerpc platforms.

WARNING: arch/powerpc/kernel/built-in.o - Section mismatch: reference to
.init.data:.got2 from prom_entry (offset 0x0)
WARNING: arch/powerpc/platforms/built-in.o - Section mismatch: reference
to .init.text:mpc8313_rdb_probe from .machine.desc after
'mach_mpc8313_rdb' (at offset 0x4)
....

Signed-off-by: Li Yang <[email protected]>
Signed-off-by: Sam Ravnborg <[email protected]>

scripts/mod/modpost.c

Peter
--
Peter Williams [email protected]

"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce