2007-11-09 22:34:27

by Micah Elizabeth Scott

[permalink] [raw]
Subject: High priority tasks break SMP balancer?

I've been investigating a problem recently, in which N runnable
CPU-bound tasks on an N-way machine run on only N-1 CPUs. The
remaining CPU is almost 100% idle. I have seen it occur with both the
CFS and O(1) schedulers.

I've traced this down to what seems to be a quirk in the SMP balancer,
whereby a high-priority thread which spends most of its time sleeping
can artificially inflate the CPU load average calculated for one
processor. Most of the time this CPU is idle (nr_running==0) yet its
CPU load average is much higher than that of any other CPU.

Please find attached a sample program which demonstrates this
behaviour on a 2-way SMP machine. It creates three threads: two are
CPU bound and run at the default priority, the third spends most of
its time sleeping and runs at an elevated priority. It wakes up
frequently (using /dev/rtc) and randomly generates some CPU load.

On my machine (2-way Opteron with a vanilla 2.6.23.1 kernel) this test
program will reliably put the scheduler into a state where one CPU has
both of the busy-looping processes in its runqueue, and the other CPU
is usually idle. The usually-idle CPU will have a very high cpu_load,
as reported by /proc/sched_debug.

Your mileage may vary. On some machines, this test program will only
enter the "bad" state for a few seconds. Sometimes we bounce back and
forth between good and bad states every few seconds. In all cases,
removing the priority elevation fixes the balancing problem.

Is this a behaviour any of the scheduler developers are aware of? I
would be very greatful if anyone could shed some light on the root
cause behind the inflated cpu_load average. If this turns out to be a
real bug, I would be happy to work on a patch.

Thanks in advance,
Micah Dowty


Attachments:
(No filename) (1.71 kB)
priosched.c (3.98 kB)
Download all attachments

2007-11-09 23:55:37

by Cyrus Massoumi

[permalink] [raw]
Subject: Re: High priority tasks break SMP balancer?

Hi Micah

> On my machine (2-way Opteron with a vanilla 2.6.23.1 kernel) this test
> program will reliably put the scheduler into a state where one CPU has
> both of the busy-looping processes in its runqueue, and the other CPU
> is usually idle. The usually-idle CPU will have a very high cpu_load,
> as reported by /proc/sched_debug.

I tried your program on my machine (C2D, 2.6.17, O(1) scheduler).

Both CPUs are 100% busy all the time. Each busy-looping thread is
running on its own CPU. I've been watching top output for 10 minutes,
the spreading is stable and the threads don't bounce at all.


greetings
Cyrus


2007-11-10 00:11:16

by Micah Elizabeth Scott

[permalink] [raw]
Subject: Re: High priority tasks break SMP balancer?

On Sat, Nov 10, 2007 at 12:56:07AM +0100, Cyrus Massoumi wrote:
> I tried your program on my machine (C2D, 2.6.17, O(1) scheduler).
>
> Both CPUs are 100% busy all the time. Each busy-looping thread is running
> on its own CPU. I've been watching top output for 10 minutes, the spreading
> is stable and the threads don't bounce at all.

As I said, YMMV. I haven't been able to find a single set of
parameters for the demo program which cause the problem to occur 100%
of the time on all systems.

In general, boosting the MAINTHREAD_PRIORITY even more and increasing
the WAKE_HZ should exaggerate the problem. These parameters reproduce
the problem very reliably on my system:

#define NUM_BUSY_THREADS 2
#define MAINTHREAD_PRIORITY -20
#define MAINTHREAD_WAKE_HZ 1024
#define MAINTHREAD_LOAD_PERCENT 5
#define MAINTHREAD_LOAD_CYCLES 2

It's also possible this problem doesn't occur on 2.6.17. I have only
tested this example on 2.6.23.1 and 2.6.20 so far.

Thanks,
--Micah

2007-11-14 18:39:41

by Micah Elizabeth Scott

[permalink] [raw]
Subject: Re: High priority tasks break SMP balancer?

On Fri, Nov 09, 2007 at 04:11:03PM -0800, Micah Dowty wrote:
> It's also possible this problem doesn't occur on 2.6.17. I have only
> tested this example on 2.6.23.1 and 2.6.20 so far.

I tested a handful of kernel versions, and it looks like this is
indeed the case. As far as I can tell, this problem was introduced
between 2.6.19.7 and 2.6.20.

Looking at the diff between those versions, it looks like there were
some noteworthy changes to the SMP balancer. I'm still working on
narrowing this down further.

Thanks,
--Micah

2007-11-15 18:49:18

by Kyle Moffett

[permalink] [raw]
Subject: Re: High priority tasks break SMP balancer?

First of all, since Ingo Molnar seems to be one of the head scheduler
gurus, you might CC him on this. Also added a couple other useful
CCs for regression reports.

On Nov 09, 2007, at 19:11:03, Micah Dowty wrote:
> As I said, YMMV. I haven't been able to find a single set of
> parameters for the demo program which cause the problem to occur
> 100% of the time on all systems.
>
> In general, boosting the MAINTHREAD_PRIORITY even more and
> increasing the WAKE_HZ should exaggerate the problem. These
> parameters reproduce the problem very reliably on my system:
>
> #define NUM_BUSY_THREADS 2
> #define MAINTHREAD_PRIORITY -20
> #define MAINTHREAD_WAKE_HZ 1024
> #define MAINTHREAD_LOAD_PERCENT 5
> #define MAINTHREAD_LOAD_CYCLES 2

Well from these statistics; if you are requesting wakeups that often
then it is probably *not* correct to try to move another thread to
that CPU in the mean-time. Essentially the migration cost will
likely far outweigh the advantage of letting it run a little bit of
extra time, and in addition it will dump out cache from the high-
priority thread. As per the description I think that an increased a
priority and increased WAKE_HZ will certainly cause the "problem" to
occur more, simply because it reduces the time between wakeups of the
high-priority process and makes it less helpful to migrate another
process over to that CPU during the sleep periods. This will also
depend on your hardware and possibly other configuration parameters.

I'm not really that much of an expert in this particular area,
though, so it's entirely possible that one of the above-mentioned
scheduler head-honchos will poke holes in my argument and give a
better explanation or a possible patch.

Cheers,
Kyle Moffett

2007-11-15 19:14:20

by Micah Elizabeth Scott

[permalink] [raw]
Subject: Re: High priority tasks break SMP balancer?

On Thu, Nov 15, 2007 at 01:48:20PM -0500, Kyle Moffett wrote:
>> In general, boosting the MAINTHREAD_PRIORITY even more and increasing the
>> WAKE_HZ should exaggerate the problem. These parameters reproduce the
>> problem very reliably on my system:
>>
>> #define NUM_BUSY_THREADS 2
>> #define MAINTHREAD_PRIORITY -20
>> #define MAINTHREAD_WAKE_HZ 1024
>> #define MAINTHREAD_LOAD_PERCENT 5
>> #define MAINTHREAD_LOAD_CYCLES 2
>
> Well from these statistics; if you are requesting wakeups that often then
> it is probably *not* correct to try to move another thread to that CPU in
> the mean-time. Essentially the migration cost will likely far outweigh the
> advantage of letting it run a little bit of extra time, and in addition it
> will dump out cache from the high-priority thread. As per the description
> I think that an increased a priority and increased WAKE_HZ will certainly
> cause the "problem" to occur more, simply because it reduces the time
> between wakeups of the high-priority process and makes it less helpful to
> migrate another process over to that CPU during the sleep periods. This
> will also depend on your hardware and possibly other configuration
> parameters.

The real problem, though, is that the high priority thread only needs
a total of a few percent worth of CPU time. The behaviour which I'm
reporting as a potential bug is that this process which needs very
little CPU time is effectively getting an entire CPU to itself,
despite the fact that other CPU-bound threads could benefit from
having time on that CPU.

One could argue that a thread with a high enough priority *should* get
a CPU all to itself even if it won't use that CPU- but that isn't the
behaviour I want. If this is in fact the intended effect of having a
high-priority thread wake very frequently, I should start a different
discussion about how to solve my specific problem without the use of
elevated priorities :)

I don't have any reason to believe, though, that this behaviour was
intentional. I just finished my "git bisect" run, and I found the
first commit after which I can reproduce the problem:

c9819f4593e8d052b41a89f47140f5c5e7e30582 is first bad commit
commit c9819f4593e8d052b41a89f47140f5c5e7e30582
Author: Christoph Lameter <[email protected]>
Date: Sun Dec 10 02:20:25 2006 -0800

[PATCH] sched: use softirq for load balancing

Call rebalance_tick (renamed to run_rebalance_domains) from a newly introduced
softirq.

We calculate the earliest time for each layer of sched domains to be rescanned
(this is the rescan time for idle) and use the earliest of those to schedule
the softirq via a new field "next_balance" added to struct rq.

Signed-off-by: Christoph Lameter <[email protected]>
Cc: Peter Williams <[email protected]>
Cc: Nick Piggin <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: "Siddha, Suresh B" <[email protected]>
Cc: "Chen, Kenneth W" <[email protected]>
Acked-by: Ingo Molnar <[email protected]>
Cc: KAMEZAWA Hiroyuki <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

:040000 040000 2b66e4b500403869cf5925367b1ddcbb63948d01 33a27ddb470473f129a78ce310cfc59a605e173b M include
:040000 040000 939b60deffb2af2689b4aab63e21ff6c98a3b782 dd3bf32eea9556d5a099db129adc048396368adc M kernel

> I'm not really that much of an expert in this particular area, though, so
> it's entirely possible that one of the above-mentioned scheduler
> head-honchos will poke holes in my argument and give a better explanation
> or a possible patch.

Thanks! I've also CC'ed Christoph.

For reference, the exact test I used with git-bisect is attached. The
C program (priosched) starts two busy-looping threads and a
high-priority high-frequency thread which uses relatively little
CPU. The Python program repeatedly starts the C program, runs it for a
half second, and measures the resulting imbalance in CPU usage. On
kernels prior to the above commit, this reports values within about
10% of 1.0. On later kernels, it crashes within a couple iterations
due to a divide-by-zero error :)

Thanks again,
--Micah


Attachments:
(No filename) (4.19 kB)
test-priosched.py (711.00 B)
priosched.c (3.98 kB)
Download all attachments

2007-11-15 20:07:56

by Christoph Lameter

[permalink] [raw]
Subject: Re: High priority tasks break SMP balancer?

On Thu, 15 Nov 2007, Micah Dowty wrote:

> For reference, the exact test I used with git-bisect is attached. The
> C program (priosched) starts two busy-looping threads and a
> high-priority high-frequency thread which uses relatively little
> CPU. The Python program repeatedly starts the C program, runs it for a
> half second, and measures the resulting imbalance in CPU usage. On
> kernels prior to the above commit, this reports values within about
> 10% of 1.0. On later kernels, it crashes within a couple iterations
> due to a divide-by-zero error :)

The kernel crashes? Sounds like your application crashes with a divide by
zero?


2007-11-15 20:24:35

by Micah Elizabeth Scott

[permalink] [raw]
Subject: Re: High priority tasks break SMP balancer?

On Thu, Nov 15, 2007 at 12:07:47PM -0800, Christoph Lameter wrote:
> On Thu, 15 Nov 2007, Micah Dowty wrote:
>
> > For reference, the exact test I used with git-bisect is attached. The
> > C program (priosched) starts two busy-looping threads and a
> > high-priority high-frequency thread which uses relatively little
> > CPU. The Python program repeatedly starts the C program, runs it for a
> > half second, and measures the resulting imbalance in CPU usage. On
> > kernels prior to the above commit, this reports values within about
> > 10% of 1.0. On later kernels, it crashes within a couple iterations
> > due to a divide-by-zero error :)
>
> The kernel crashes? Sounds like your application crashes with a divide by
> zero?

Yes, the Python test harness crashes, not the kernel. It's just
because on a kernel which exhibits this SMP balancer bug, within a
couple of test iterations I'll hit a case where cpu1 was almost
totally idle and the test harness divides by zero when calculating the
imbalance.

--Micah

2007-11-15 21:29:10

by Christoph Lameter

[permalink] [raw]
Subject: Re: High priority tasks break SMP balancer?

On Thu, 15 Nov 2007, Micah Dowty wrote:

> Yes, the Python test harness crashes, not the kernel. It's just
> because on a kernel which exhibits this SMP balancer bug, within a
> couple of test iterations I'll hit a case where cpu1 was almost
> totally idle and the test harness divides by zero when calculating the
> imbalance.

I still have some problem understanding what you are complaining about?
The patch that you bisected had some issues that were fixed later by
Siddha.

2007-11-15 21:35:39

by Micah Elizabeth Scott

[permalink] [raw]
Subject: Re: High priority tasks break SMP balancer?

On Thu, Nov 15, 2007 at 01:28:55PM -0800, Christoph Lameter wrote:
> On Thu, 15 Nov 2007, Micah Dowty wrote:
>
> > Yes, the Python test harness crashes, not the kernel. It's just
> > because on a kernel which exhibits this SMP balancer bug, within a
> > couple of test iterations I'll hit a case where cpu1 was almost
> > totally idle and the test harness divides by zero when calculating the
> > imbalance.
>
> I still have some problem understanding what you are complaining about?
> The patch that you bisected had some issues that were fixed later by
> Siddha.

On all kernels I've tested from after your patch was committed, I can
reproduce a problem where a single high-priority thread which wakes up
very frequently can artificially inflate the SMP balancer's load
average for one CPU, causing other tasks to be migrated off that
CPU. The result is that this high-priority thread (which may only use
a few percent CPU) gets an entire CPU to itself. Even if there are
several busy-looping threads running, this CPU will be mostly idle.

In the first email I CC'ed you on, I attached a test program that I've
been able to use to reproduce the issue. Feel free to ignore the
Python program- it's just a harness I was using to test for the
problem quickly during my git-bisect run. to reproduce the
problem. The relevant code is all in priosched.c.

Thanks,
--Micah

2007-11-16 02:32:00

by Christoph Lameter

[permalink] [raw]
Subject: Re: High priority tasks break SMP balancer?

On Thu, 15 Nov 2007, Micah Dowty wrote:

> On all kernels I've tested from after your patch was committed, I can
> reproduce a problem where a single high-priority thread which wakes up
> very frequently can artificially inflate the SMP balancer's load
> average for one CPU, causing other tasks to be migrated off that
> CPU. The result is that this high-priority thread (which may only use
> a few percent CPU) gets an entire CPU to itself. Even if there are
> several busy-looping threads running, this CPU will be mostly idle.

I am a bit at a loss as to how this could relate to the patch. This looks
like a load balance logic issue that causes the load calculation to go
wrong?


2007-11-16 02:44:21

by Micah Elizabeth Scott

[permalink] [raw]
Subject: Re: High priority tasks break SMP balancer?

On Thu, Nov 15, 2007 at 06:31:49PM -0800, Christoph Lameter wrote:
> On Thu, 15 Nov 2007, Micah Dowty wrote:
>
> > On all kernels I've tested from after your patch was committed, I can
> > reproduce a problem where a single high-priority thread which wakes up
> > very frequently can artificially inflate the SMP balancer's load
> > average for one CPU, causing other tasks to be migrated off that
> > CPU. The result is that this high-priority thread (which may only use
> > a few percent CPU) gets an entire CPU to itself. Even if there are
> > several busy-looping threads running, this CPU will be mostly idle.
>
> I am a bit at a loss as to how this could relate to the patch. This looks
> like a load balance logic issue that causes the load calculation to go
> wrong?

My best guess is that this has something to do with the timing with
which we sample the CPU's instantaneous load when calculating the load
averages.. but I still understand only the basics of the scheduler and
SMP balancer. All I really know for sure at this point regarding your
patch is that git-bisect found it for me.

It almost seems like the load average algorithm is ignoring the CPU's
idle time, and only accounting for the time that CPU spends running
processes. One of the symptoms is that the mostly-idle CPU in my test
has an instantaneous load which is usually zero, but a very high load
average. (9000, 30000, etc.)

I want to help get to the bottom of this issue, but I was hoping that
someone experienced with the Linux scheduler and SMP balancer would
have some insight or some suggestions about what to try next.

Thanks,
--Micah

2007-11-16 06:07:29

by Ingo Molnar

[permalink] [raw]
Subject: Re: High priority tasks break SMP balancer?


* Micah Dowty <[email protected]> wrote:

> > I am a bit at a loss as to how this could relate to the patch. This
> > looks like a load balance logic issue that causes the load
> > calculation to go wrong?
>
> My best guess is that this has something to do with the timing with
> which we sample the CPU's instantaneous load when calculating the load
> averages.. but I still understand only the basics of the scheduler and
> SMP balancer. All I really know for sure at this point regarding your
> patch is that git-bisect found it for me.

hm, your code uses timeouts for this, right? The CPU load average that
is used for SMP load balancing is sampled from the scheduler tick - and
has been sampled from the scheduler tick for eons. v2.6.23 defaulted to
a different method but v2.6.24 samples it from the tick again. So my
guess is, your testcode behave similarly on 2.6.22 too, correct?

Ingo

2007-11-16 09:19:26

by Micah Elizabeth Scott

[permalink] [raw]
Subject: Re: High priority tasks break SMP balancer?

On Fri, Nov 16, 2007 at 07:07:00AM +0100, Ingo Molnar wrote:
> > My best guess is that this has something to do with the timing with
> > which we sample the CPU's instantaneous load when calculating the load
> > averages.. but I still understand only the basics of the scheduler and
> > SMP balancer. All I really know for sure at this point regarding your
> > patch is that git-bisect found it for me.
>
> hm, your code uses timeouts for this, right? The CPU load average that
> is used for SMP load balancing is sampled from the scheduler tick - and
> has been sampled from the scheduler tick for eons. v2.6.23 defaulted to
> a different method but v2.6.24 samples it from the tick again. So my
> guess is, your testcode behave similarly on 2.6.22 too, correct?

My sample program (and the VMware code which I originally observed
this behaviour with) wakes up frequently using /dev/rtc.

I know the problem occurs on 2.6.20 and 2.6.23.1. I haven't tried
2.6.22 yet, but I'll do so as soon as I can.

Thanks,
--Micah

2007-11-16 10:46:30

by Ingo Molnar

[permalink] [raw]
Subject: Re: High priority tasks break SMP balancer?


* Micah Dowty <[email protected]> wrote:

> On Fri, Nov 16, 2007 at 07:07:00AM +0100, Ingo Molnar wrote:
> > > My best guess is that this has something to do with the timing with
> > > which we sample the CPU's instantaneous load when calculating the load
> > > averages.. but I still understand only the basics of the scheduler and
> > > SMP balancer. All I really know for sure at this point regarding your
> > > patch is that git-bisect found it for me.
> >
> > hm, your code uses timeouts for this, right? The CPU load average that
> > is used for SMP load balancing is sampled from the scheduler tick - and
> > has been sampled from the scheduler tick for eons. v2.6.23 defaulted to
> > a different method but v2.6.24 samples it from the tick again. So my
> > guess is, your testcode behave similarly on 2.6.22 too, correct?
>
> My sample program (and the VMware code which I originally observed
> this behaviour with) wakes up frequently using /dev/rtc.
>
> I know the problem occurs on 2.6.20 and 2.6.23.1. I haven't tried
> 2.6.22 yet, but I'll do so as soon as I can.

ok, if you see it on .20 then no need to try .22.

Ingo

2007-11-16 10:48:17

by Micah Elizabeth Scott

[permalink] [raw]
Subject: Re: High priority tasks break SMP balancer?

On Fri, Nov 16, 2007 at 07:07:00AM +0100, Ingo Molnar wrote:
>
> * Micah Dowty <[email protected]> wrote:
>
> > > I am a bit at a loss as to how this could relate to the patch. This
> > > looks like a load balance logic issue that causes the load
> > > calculation to go wrong?
> >
> > My best guess is that this has something to do with the timing with
> > which we sample the CPU's instantaneous load when calculating the load
> > averages.. but I still understand only the basics of the scheduler and
> > SMP balancer. All I really know for sure at this point regarding your
> > patch is that git-bisect found it for me.
>
> hm, your code uses timeouts for this, right? The CPU load average that
> is used for SMP load balancing is sampled from the scheduler tick - and
> has been sampled from the scheduler tick for eons. v2.6.23 defaulted to
> a different method but v2.6.24 samples it from the tick again. So my
> guess is, your testcode behave similarly on 2.6.22 too, correct?

Interesting.. here are the kernels I've tested so far, not including
the git-bisect run I did between 2.6.19 and 2.6.20:

2.6.17 -
2.6.19 -
2.6.19.7 -
2.6.20 +
2.6.21 +
2.6.22 -
2.6.23.1 +

Here a "-" means that the problem does not occur (my test program uses
100% of both CPUs) and a "+" means that the test program leaves one
CPU mostly idle.

Unless I've made a mistake, 2.6.22 seems like the outlier rather than
2.6.23. Is this inconsistent with the scheduler tick hypothesis?

Thanks,
--Micah

2007-11-16 10:49:00

by Dmitry Adamushko

[permalink] [raw]
Subject: Re: High priority tasks break SMP balancer?

On 16/11/2007, Ingo Molnar <[email protected]> wrote:
>
> * Micah Dowty <[email protected]> wrote:
>
> > > I am a bit at a loss as to how this could relate to the patch. This
> > > looks like a load balance logic issue that causes the load
> > > calculation to go wrong?
> >
> > My best guess is that this has something to do with the timing with
> > which we sample the CPU's instantaneous load when calculating the load
> > averages.. but I still understand only the basics of the scheduler and
> > SMP balancer. All I really know for sure at this point regarding your
> > patch is that git-bisect found it for me.
>
> hm, your code uses timeouts for this, right? The CPU load average that
> is used for SMP load balancing is sampled from the scheduler tick - and
> has been sampled from the scheduler tick for eons. v2.6.23 defaulted to
> a different method but v2.6.24 samples it from the tick again.

yeah... but one of the kernels in question is actually 2.6.23.1, which
does use the 'precise' accounting.

Micah,

could you try to change either :

cat /proc/sys/kernel/sched_stat_granularity

put it to the value equal to a tick on your system

or just remove bit #3 (which is responsible for 8 == 1000) here:

cat /proc/sys/kernel/sched_features

(this one is enabled by default in 2.6.23.1)

anyway, when it comes to calculating rq->cpu_load[], a nice(0) cpu-hog
task (on cpu_0) may generate a similar load (contribute to
rq->cpu_load[]) as e.g. some negatively reniced task (on cpu_1) which
runs only periodically (say, once in a tick for N ms., etc.) [*]

The thing is that the higher a prio of the task, the bigger 'weight'
it has (prio_to_wait[] table in sched.c) ... and roughly, the load it
generates is not only 'proportional' to 'run-time per fixed interval
of time' but also to its 'weight'. That's why the [*] above.

so you may have a situation :

cpu_0 : e.g. a nice(-20) task running periodically every tick and
generating, say ~10% cpu load ;

cpu_1 : 2-3 nice(0) cpu-hog tasks ;

both cpus may be seen with similar rq->load_cpu[]... yeah, one would
argue that one of the cpu hogs could be migrated to cpu_0 and consume
remaining 'time slots' and it would not "disturb" the nice(-20) task
as :
it's able to preempt the lower prio task whenever it want (provided,
fine-grained kernel preemption) and we don't care that much of
trashing of caches here.

btw., without the precise load balancing, there can be situations when
the nice(-20) (or say, a RT periodic task) can be even not seen (i.e.
don't contribute to cpu_load[]) on cpu_0...
we do sampling every tick (sched.c :: update_cpu_load()) and consider
this_rq->ls.load.weight at this particular moment (that is the sum of
'weights' for all runnable tasks on this rq)... and it may well be
that the aforementioned high-priority task is just never (or likely,
rarely) runnable at this particular moment (it runs for short interval
of time in between ticks).


>
> Ingo
>

--
Best regards,
Dmitry Adamushko

2007-11-16 19:13:57

by David Newall

[permalink] [raw]
Subject: Re: High priority tasks break SMP balancer?

There are a couple of points I would make about your python test
harness. Your program compares real+system jiffies for both cpus; an
ideal result would be 1.00. The measurement is taken over a relatively
short period of approximately a half-second, and you kill the CPU hogs
before taking final measurements, even wait for them to die first. You
repeat this measurement, starting and killing CPU hogs each time. Why
do you do that?

What happens if you start the hogs and take the baseline outside of the
loop?

from __future__ import division
import sys, os, time

def getCpuTimes():
cpu0 = 0
cpu1 = 1
for line in open("/proc/stat"):
tokens = line.split()
if tokens[0] == "cpu0":
cpu0 = int(tokens[1]) + int(tokens[3])
elif tokens[0] == "cpu1":
cpu1 = int(tokens[1]) + int(tokens[3])
return cpu0, cpu1

pid = os.spawnl(os.P_NOWAIT, "./priosched")
baseline = getCpuTimes()
while True:
time.sleep(0.5)
current = getCpuTimes()
print "%.04f" % (current[0] - baseline[0]) / (current[1] -
baseline[1])

2007-11-16 21:39:14

by Micah Elizabeth Scott

[permalink] [raw]
Subject: Re: High priority tasks break SMP balancer?

On Sat, Nov 17, 2007 at 05:43:33AM +1030, David Newall wrote:
> There are a couple of points I would make about your python test harness.
> Your program compares real+system jiffies for both cpus; an ideal result
> would be 1.00. The measurement is taken over a relatively short period of
> approximately a half-second, and you kill the CPU hogs before taking final
> measurements, even wait for them to die first. You repeat this
> measurement, starting and killing CPU hogs each time. Why do you do that?

The Python test harness is fairly artificial, but this is just the
best way I found to reliably reproduce the problem in a short amount
of time. It was just for convenience while running git-bisect. When
running the C program directly, there seems to be a somewhat random
chance that it will start up in the "bad" state. Once the single CPU
is stuck in this mostly-idle mode, it seems to stay that way for a
while.

> What happens if you start the hogs and take the baseline outside of the
> loop?

The problem still occurs then, but killing/restarting the test app
seems to trigger the problem more reliably. As I said in the original
email about this, left to its own devices this problem will occur
seemingly-randomly. In the original VMware code I observed this
problem in, the same process would flip between the "good" and "bad"
states seemingly randomly, every few seconds.

Thanks,
--Micah

2007-11-16 22:12:18

by Christoph Lameter

[permalink] [raw]
Subject: Re: High priority tasks break SMP balancer?

On Fri, 16 Nov 2007, Micah Dowty wrote:

> 2.6.17 -
> 2.6.19 -
> 2.6.19.7 -
> 2.6.20 +
> 2.6.21 +
> 2.6.22 -
> 2.6.23.1 +
>
> Here a "-" means that the problem does not occur (my test program uses
> 100% of both CPUs) and a "+" means that the test program leaves one
> CPU mostly idle.
>
> Unless I've made a mistake, 2.6.22 seems like the outlier rather than
> 2.6.23. Is this inconsistent with the scheduler tick hypothesis?

Siddha fixed an issue with the jiffy accounting in for the softirq
approach in.22 (vague recall maybe not exactly that version). This may be
consistent with an issue that was fixed and now surfaces because of
something else.

2007-11-16 22:14:24

by Micah Elizabeth Scott

[permalink] [raw]
Subject: Re: High priority tasks break SMP balancer?

On Fri, Nov 16, 2007 at 11:48:50AM +0100, Dmitry Adamushko wrote:
> could you try to change either :
>
> cat /proc/sys/kernel/sched_stat_granularity
>
> put it to the value equal to a tick on your system

This didn't seem to have any effect.

> or just remove bit #3 (which is responsible for 8 == 1000) here:
>
> cat /proc/sys/kernel/sched_features
>
> (this one is enabled by default in 2.6.23.1)

Aha. Turning off bit 3 appears to instantly fix my problem while it's
occurring in an existing process, and I can't reproduce it on any new
processes afterward.

> anyway, when it comes to calculating rq->cpu_load[], a nice(0) cpu-hog
> task (on cpu_0) may generate a similar load (contribute to
> rq->cpu_load[]) as e.g. some negatively reniced task (on cpu_1) which
> runs only periodically (say, once in a tick for N ms., etc.) [*]
>
> The thing is that the higher a prio of the task, the bigger 'weight'
> it has (prio_to_wait[] table in sched.c) ... and roughly, the load it
> generates is not only 'proportional' to 'run-time per fixed interval
> of time' but also to its 'weight'. That's why the [*] above.

Right. I gathered from reading the scheduler source earlier that the
load average is intended to be proportional to the priority of the
task, but I was really confused by the fairly nondeterministic effect
on the cpu_load average that my test process is having.

> so you may have a situation :
>
> cpu_0 : e.g. a nice(-20) task running periodically every tick and
> generating, say ~10% cpu load ;

Part of the problem may be that my high-priority task can run much
more often than every tick. In my test case and in the VMware code
that I originally observed the problem in, the thread can wake up
based on /dev/rtc or on a device IRQ. Either of these can happen much
more frequently than the scheduler tick, if I understand correctly.

> cpu_1 : 2-3 nice(0) cpu-hog tasks ;
>
> both cpus may be seen with similar rq->load_cpu[]...

When I try this, cpu0 has a cpu_load[] of over 10000 and cpu1 has a
load of 2048 or so.

> yeah, one would
> argue that one of the cpu hogs could be migrated to cpu_0 and consume
> remaining 'time slots' and it would not "disturb" the nice(-20) task
> as :
> it's able to preempt the lower prio task whenever it want (provided,
> fine-grained kernel preemption) and we don't care that much of
> trashing of caches here.

Yes, that's the behaviour I expected to see (and what my application
would prefer).

> btw., without the precise load balancing, there can be situations when
> the nice(-20) (or say, a RT periodic task) can be even not seen (i.e.
> don't contribute to cpu_load[]) on cpu_0...
> we do sampling every tick (sched.c :: update_cpu_load()) and consider
> this_rq->ls.load.weight at this particular moment (that is the sum of
> 'weights' for all runnable tasks on this rq)... and it may well be
> that the aforementioned high-priority task is just never (or likely,
> rarely) runnable at this particular moment (it runs for short interval
> of time in between ticks).

Indeed. I think this is the major contributor to the nondeterminism
I'm seeing.

Thanks much,
--Micah

2007-11-16 23:27:40

by Dmitry Adamushko

[permalink] [raw]
Subject: Re: High priority tasks break SMP balancer?

On 16/11/2007, Micah Dowty <[email protected]> wrote:
> [ ... ]
> > or just remove bit #3 (which is responsible for 8 == 1000) here:
> >
> > cat /proc/sys/kernel/sched_features
> >
> > (this one is enabled by default in 2.6.23.1)
>
> Aha. Turning off bit 3 appears to instantly fix my problem while it's
> occurring in an existing process, and I can't reproduce it on any new
> processes afterward.

humm... ok, but considering your recent summary for various kernels...
I guess, it doesn't qualify as the primary suspect... it just likely
affects something else.

>
> > cpu_1 : 2-3 nice(0) cpu-hog tasks ;
> >
> > both cpus may be seen with similar rq->load_cpu[]...
>
> When I try this, cpu0 has a cpu_load[] of over 10000 and cpu1 has a
> load of 2048 or so.

yeah, one of the options for 2048 would be presence of 2 nice(0)
cpu-hogs (1024 is the weight for a nice(0) task).


> > yeah, one would
> > argue that one of the cpu hogs could be migrated to cpu_0 and consume
> > remaining 'time slots' and it would not "disturb" the nice(-20) task
> > as :
> > it's able to preempt the lower prio task whenever it want (provided,
> > fine-grained kernel preemption) and we don't care that much of
> > trashing of caches here.
>
> Yes, that's the behaviour I expected to see (and what my application
> would prefer).

yep, that's what load_balance_newidle() is about... so maybe there are
some factors resulting in its inconsistency/behavioral differences on
different kernels.

Let's say we change a pattern for the niced task: e.g. run for 100 ms.
and then sleep for 300 ms. (that's ~25% of cpu load) in the loop. Any
behavioral changes?


>
> Thanks much,
> --Micah
>

--
Best regards,
Dmitry Adamushko

2007-11-17 01:05:27

by Micah Elizabeth Scott

[permalink] [raw]
Subject: Re: High priority tasks break SMP balancer?

On Sat, Nov 17, 2007 at 12:26:41AM +0100, Dmitry Adamushko wrote:
> Let's say we change a pattern for the niced task: e.g. run for 100 ms.
> and then sleep for 300 ms. (that's ~25% of cpu load) in the loop. Any
> behavioral changes?

For consistency, I tested this using /dev/rtc. I set the rtc frequency
to 16 Hz, and replaced the main loop of my high (-19) priority thread
with:

while (1) {
unsigned long data;

for (i = 0; i < 3; i++) {
if (read(rtc, &data, sizeof data) != sizeof data) {
perror("read");
return 1;
}
}

fcntl(rtc, F_SETFL, O_NONBLOCK);
while (read(rtc, &data, sizeof data) < 0);
fcntl(rtc, F_SETFL, 0);
}

Now it's busy-looping for 62ms, and sleeping for three consecutive
62.5ms chunks totalling 187.5ms.

The results aren't quite what I was expecting. I have only observed
this so far in test cases where I have a very high wakeup frequency,
so I wasn't expecting this to work. I did, however, still observe the
problem where occasionally I get into a state where one CPU is mostly
idle.

Qualitatively, this feels a bit different. With the higher clock
frequency it seemed like the CPU would easily get "stuck" in this
state where it's mostly idle, and it would stay there for a long
time. With the low wakeup frequency, I'm seeing it toggle between the
busy and mostly-idle states more quickly.

I tried a similar test using usleep() and gettimeofday() rather than
/dev/rtc:

while (1) {
usleep(300000);

gettimeofday(&t1, NULL);
do {
gettimeofday(&t2, NULL);
} while (t2.tv_usec - t1.tv_usec +
(t2.tv_sec - t1.tv_sec) * 1000000 < 100000);
}

With this test program, I haven't yet seen a CPU imbalance that lasts
longer than a fraction of a second.

--Micah

2007-11-17 19:10:48

by Dmitry Adamushko

[permalink] [raw]
Subject: Re: High priority tasks break SMP balancer?

Micah,

ok, would it be possible to get "cat /proc/schedstat" output at the
moment when you observe the 'problem'? So we could try to analyze
behavior of the load balancer (yeah, we should have probably started
with this step)

something like this:

(the problem appears)
# cat /proc/schedstat

... wait either a few seconds or until the problem disappears
(whatever comes first)
# cat /proc/schedstat


TIA,

>
> --Micah
>

--
Best regards,
Dmitry Adamushko

2007-11-19 18:51:27

by Micah Elizabeth Scott

[permalink] [raw]
Subject: Re: High priority tasks break SMP balancer?

On Sat, Nov 17, 2007 at 08:10:35PM +0100, Dmitry Adamushko wrote:
> Micah,
>
> ok, would it be possible to get "cat /proc/schedstat" output at the
> moment when you observe the 'problem'? So we could try to analyze
> behavior of the load balancer (yeah, we should have probably started
> with this step)

No problem. Here are the two schedstat snapshots. With my priosched.c
test program running with one CPU idle, I did a "cat /proc/schedstat",
waited a couple seconds, then repeated:

micah@micah-64:~$ cat /proc/schedstat
version 14
timestamp 4366722208
cpu0 0 0 0 785868111 0 828020771 12621812 22703872 14096041 9504937730116 23603703739504 815398959
domain0 03 8766051 8760123 95 14544377 6374 0 13 8760110 7369 6771 0 20835105 605 0 1 6770 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 7026998 0 24826
cpu1 0 0 0 783505304 0 860700497 26129151 41776419 34749418 10682815341474 1753703225423 834571346
domain0 03 8685595 8677710 224 7037051 8127 0 76 8677634 7500 7320 7 471762 179 0 8 7312 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 8607831 0 36753
micah@micah-64:~$ cat /proc/schedstat
version 14
timestamp 4366723231
cpu0 0 0 0 785868111 0 828021128 12621812 22703965 14096130 9509029340381 23603703795868 815399316
domain0 03 8766051 8760123 95 14544377 6374 0 13 8760110 7374 6776 0 20835105 605 0 1 6775 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 7027010 0 24826
cpu1 0 0 0 783505304 0 860708521 26132612 41780521 34753508 10683770853336 1753758309667 834575909
domain0 03 8685718 8677833 224 7037051 8127 0 76 8677757 7500 7320 7 471762 179 0 8 7312 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 8607835 0 36753

I had been experimenting with both schedstat and sched_debug while
trying to identify the original problem I observed with VMware's
code. During the course of that investigation, I wrote a little tool
to graph schedstat output in real-time.

The tool itself is in a public Subversion reporitory:
http://svn.navi.cx/misc/trunk/rtgraph/schedstat-grapher.py
(Requires the other Python modules in that directory, and PyGTK)

This screenshot was taken just after the priosched.c test program
flipped from the "bad" state to the "good" state. It had been running
on one CPU for a while, and something caused it to rebalance properly
onto both CPUs while this graph was being recorded. You can see that
in the "bad" state, cpu0 was trying to rebalance but it was failing
because there was no busier group (idle_nobusyg):

http://navi.cx/~micah/priosched-graph-1.png

While in the "bad" state, the sched_debug output shows clearly why we
were hitting idle_nobusyg. One CPU has both busy-loop threads in its
runqueue, the other CPU is mostly idle but it has an elevated CPU load
proportional to the high priority thread's priority boost.

micah@micah-64:~$ cat /proc/sched_debug
Sched Debug Version: v0.05-v20, 2.6.23.1 #1
now at 287991360677873 nsecs

cpu#0, 2593.095 MHz
.nr_running : 0
.load : 0
.ls.delta_fair : 27950
.ls.delta_exec : 1958562
.nr_switches : 42562597
.nr_load_updates : 71998668
.nr_uninterruptible : 2731
.jiffies : 4366886827
.next_balance : 4366886830
.curr->pid : 0
.clock : 288012673563036
.idle_clock : 0
.prev_clock_raw : 287377858327690
.clock_warps : 0
.clock_overflows : 50589
.clock_deep_idle_events : 0
.clock_max_delta : 4000250
.cpu_load[0] : 58784
.cpu_load[1] : 64246
.cpu_load[2] : 65333
.cpu_load[3] : 63586
.cpu_load[4] : 61992

cfs_rq
.fair_clock : 9034352848909
.exec_clock : 10184480087500
.wait_runtime : 0
.wait_runtime_overruns : 10356815
.wait_runtime_underruns : 1551195
.sleeper_bonus : 33752273
.wait_runtime_rq_sum : 0

runnable tasks:
task PID tree-key delta waiting switches prio sum-exec sum-wait sum-sleep wait-overrun wait-underrun
------------------------------------------------------------------------------------------------------------------------------------------------------------------

cpu#1, 2593.095 MHz
.nr_running : 3
.load : 3072
.ls.delta_fair : 247507
.ls.delta_exec : 1148874
.nr_switches : 78415958
.nr_load_updates : 71998585
.nr_uninterruptible : -2731
.jiffies : 4366886827
.next_balance : 4366886976
.curr->pid : 16252
.clock : 288012340795477
.idle_clock : 0
.prev_clock_raw : 287377857592876
.clock_warps : 0
.clock_overflows : 18670
.clock_deep_idle_events : 0
.clock_max_delta : 4000250
.cpu_load[0] : 7136
.cpu_load[1] : 4591
.cpu_load[2] : 3319
.cpu_load[3] : 2691
.cpu_load[4] : 2391

cfs_rq
.fair_clock : 10089416808778
.exec_clock : 11099173384388
.wait_runtime : -74868320
.wait_runtime_overruns : 18282511
.wait_runtime_underruns : 1233484
.sleeper_bonus : 3602202
.wait_runtime_rq_sum : -74868320

runnable tasks:
task PID tree-key delta waiting switches prio sum-exec sum-wait sum-sleep wait-overrun wait-underrun
------------------------------------------------------------------------------------------------------------------------------------------------------------------
priosched 16248 10089447384328 30575550 -31397913 172 120 1665860833 -80591108 0 0 17
priosched 16249 10089453683739 36874961 -39698014 148 120 1670430581 -63070307 0 0 13
R cat 16252 10089420581171 3772393 -3772393 1 120 73354 -22393 90430 0 0


Thanks,
--Micah

2007-11-19 22:22:29

by Dmitry Adamushko

[permalink] [raw]
Subject: Re: High priority tasks break SMP balancer?

On 19/11/2007, Micah Dowty <[email protected]> wrote:
> [ ... ]
>
> micah@micah-64:~$ cat /proc/schedstat
> version 14
> timestamp 4366722208
> cpu0 0 0 0 785868111 0 828020771 12621812 22703872 14096041 9504937730116 23603703739504 815398959
> domain0 03 8766051 8760123 95 14544377 6374 0 13 8760110 7369 6771 0 20835105 605 0 1 6770 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 7026998 0 24826
> cpu1 0 0 0 783505304 0 860700497 26129151 41776419 34749418 10682815341474 1753703225423 834571346
> domain0 03 8685595 8677710 224 7037051 8127 0 76 8677634 7500 7320 7 471762 179 0 8 7312 ===> 0 0 0 0 0 0 0 0 <=== 1 0 1 0 0 0 0 0 0 8607831 0 36753

> micah@micah-64:~$ cat /proc/schedstat
> version 14
> timestamp 4366723231
> cpu0 0 0 0 785868111 0 828021128 12621812 22703965 14096130 9509029340381 23603703795868 815399316
> domain0 03 8766051 8760123 95 14544377 6374 0 13 8760110 7374 6776 0 20835105 605 0 1 6775 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 7027010 0 24826
> cpu1 0 0 0 783505304 0 860708521 26132612 41780521 34753508 10683770853336 1753758309667 834575909
> domain0 03 8685718 8677833 224 7037051 8127 0 76 8677757 7500 7320 7 471762 179 0 8 7312 ===> 0 0 0 0 0 0 0 0 <=== 1 0 1 0 0 0 0 0 0 8607835 0 36753

You seem to have a configuration with domains which don't have
SD_BALANCE_NEWIDLE on (CONFIG_NUMA?) as there are no events (all
zeros above) for CPU_NEWLY_IDLE.

this one is being triggered whenever a cpu becomes idle (schedule()
--> idle_balance() --> load_balance_newidle()).

(this flag is a bit #1 == 2)

cat /proc/sys/kernel/sched_domain/cpu0/domain0/flags

?

it can be changed with "echo new_value > ..."

does a change of it make any difference?

moreover, /proc/sys/kernel/sched_domain/cpu1/domain0/newidle_idx seems
to be responsible for a source of the load for calculating the busiest
group. e.g. with newidle_idx == 0, the current load on the queue is
used instead of cpu_load[].


>
> Thanks,
> --Micah
>

--
Best regards,
Dmitry Adamushko

2007-11-19 23:05:27

by Micah Elizabeth Scott

[permalink] [raw]
Subject: Re: High priority tasks break SMP balancer?

On Mon, Nov 19, 2007 at 11:22:06PM +0100, Dmitry Adamushko wrote:
> You seem to have a configuration with domains which don't have
> SD_BALANCE_NEWIDLE on (CONFIG_NUMA?) as there are no events (all
> zeros above) for CPU_NEWLY_IDLE.
>
> this one is being triggered whenever a cpu becomes idle (schedule()
> --> idle_balance() --> load_balance_newidle()).
>
> (this flag is a bit #1 == 2)
>
> cat /proc/sys/kernel/sched_domain/cpu0/domain0/flags

Hmm. I don't have this file on my system:

root@micah-64:/proc/sys/kernel/sched_domain/cpu0/domain0# ls
busy_factor busy_idx forkexec_idx idle_idx imbalance_pct max_interval min_interval newidle_idx wake_idx
root@micah-64:/proc/sys/kernel/sched_domain/cpu0/domain0# uname -a
Linux micah-64 2.6.23.1 #1 SMP Fri Nov 2 12:25:47 PDT 2007 x86_64 GNU/Linux

Is there a config option I'm missing?

Thanks,
--Micah

2007-11-20 05:58:29

by Ingo Molnar

[permalink] [raw]
Subject: Re: High priority tasks break SMP balancer?


* Micah Dowty <[email protected]> wrote:

> > this one is being triggered whenever a cpu becomes idle (schedule()
> > --> idle_balance() --> load_balance_newidle()).
> >
> > (this flag is a bit #1 == 2)
> >
> > cat /proc/sys/kernel/sched_domain/cpu0/domain0/flags
>
> Hmm. I don't have this file on my system:
>
> root@micah-64:/proc/sys/kernel/sched_domain/cpu0/domain0# ls
> busy_factor busy_idx forkexec_idx idle_idx imbalance_pct max_interval min_interval newidle_idx wake_idx
> root@micah-64:/proc/sys/kernel/sched_domain/cpu0/domain0# uname -a
> Linux micah-64 2.6.23.1 #1 SMP Fri Nov 2 12:25:47 PDT 2007 x86_64 GNU/Linux
>
> Is there a config option I'm missing?

yes, CONFIG_SCHED_DEBUG.

Ingo

2007-11-20 18:06:54

by Micah Elizabeth Scott

[permalink] [raw]
Subject: Re: High priority tasks break SMP balancer?

On Tue, Nov 20, 2007 at 06:57:55AM +0100, Ingo Molnar wrote:
>
> * Micah Dowty <[email protected]> wrote:
>
> > > this one is being triggered whenever a cpu becomes idle (schedule()
> > > --> idle_balance() --> load_balance_newidle()).
> > >
> > > (this flag is a bit #1 == 2)
> > >
> > > cat /proc/sys/kernel/sched_domain/cpu0/domain0/flags
> >
> > Hmm. I don't have this file on my system:
> >
> > root@micah-64:/proc/sys/kernel/sched_domain/cpu0/domain0# ls
> > busy_factor busy_idx forkexec_idx idle_idx imbalance_pct max_interval min_interval newidle_idx wake_idx
> > root@micah-64:/proc/sys/kernel/sched_domain/cpu0/domain0# uname -a
> > Linux micah-64 2.6.23.1 #1 SMP Fri Nov 2 12:25:47 PDT 2007 x86_64 GNU/Linux
> >
> > Is there a config option I'm missing?
>
> yes, CONFIG_SCHED_DEBUG.

I have that one. I even posted the /proc/sched_debug output :)

--Micah

2007-11-20 21:48:22

by Dmitry Adamushko

[permalink] [raw]
Subject: Re: High priority tasks break SMP balancer?

On 20/11/2007, Micah Dowty <[email protected]> wrote:
> On Tue, Nov 20, 2007 at 06:57:55AM +0100, Ingo Molnar wrote:
> >
> > * Micah Dowty <[email protected]> wrote:
> >
> > > > this one is being triggered whenever a cpu becomes idle (schedule()
> > > > --> idle_balance() --> load_balance_newidle()).
> > > >
> > > > (this flag is a bit #1 == 2)
> > > >
> > > > cat /proc/sys/kernel/sched_domain/cpu0/domain0/flags
> > >
> > > Hmm. I don't have this file on my system:
> > >
> > > root@micah-64:/proc/sys/kernel/sched_domain/cpu0/domain0# ls
> > > busy_factor busy_idx forkexec_idx idle_idx imbalance_pct max_interval min_interval newidle_idx wake_idx
> > > root@micah-64:/proc/sys/kernel/sched_domain/cpu0/domain0# uname -a
> > > Linux micah-64 2.6.23.1 #1 SMP Fri Nov 2 12:25:47 PDT 2007 x86_64 GNU/Linux
> > >
> > > Is there a config option I'm missing?
> >
> > yes, CONFIG_SCHED_DEBUG.
>
> I have that one. I even posted the /proc/sched_debug output :)

ah ok, try applying this patch on top of 2.6.23.1.

btw., what's your system? If I recall right, SD_BALANCE_NEWIDLE is on
by default for all configs, except for NUMA nodes.

(attached a white-space non-damaged version)

---

--- kernel/sched.c-old 2007-11-20 22:33:22.000000000 +0100
+++ kernel/sched.c 2007-11-20 22:37:07.000000000 +0100
@@ -5306,7 +5306,7 @@ set_table_entry(struct ctl_table *entry,
static struct ctl_table *
sd_alloc_ctl_domain_table(struct sched_domain *sd)
{
- struct ctl_table *table = sd_alloc_ctl_entry(14);
+ struct ctl_table *table = sd_alloc_ctl_entry(12);

set_table_entry(&table[0], "min_interval", &sd->min_interval,
sizeof(long), 0644, proc_doulongvec_minmax);
@@ -5326,10 +5326,10 @@ sd_alloc_ctl_domain_table(struct sched_d
sizeof(int), 0644, proc_dointvec_minmax);
set_table_entry(&table[8], "imbalance_pct", &sd->imbalance_pct,
sizeof(int), 0644, proc_dointvec_minmax);
- set_table_entry(&table[10], "cache_nice_tries",
+ set_table_entry(&table[9], "cache_nice_tries",
&sd->cache_nice_tries,
sizeof(int), 0644, proc_dointvec_minmax);
- set_table_entry(&table[12], "flags", &sd->flags,
+ set_table_entry(&table[10], "flags", &sd->flags,
sizeof(int), 0644, proc_dointvec_minmax);

return table;

---

>
> --Micah
>
>


--
Best regards,
Dmitry Adamushko


Attachments:
(No filename) (2.36 kB)
sd_alloc.patch (1.01 kB)
Download all attachments

2007-11-22 07:47:00

by Micah Elizabeth Scott

[permalink] [raw]
Subject: Re: High priority tasks break SMP balancer?

On Tue, Nov 20, 2007 at 10:47:52PM +0100, Dmitry Adamushko wrote:
> btw., what's your system? If I recall right, SD_BALANCE_NEWIDLE is on
> by default for all configs, except for NUMA nodes.

It's a dual AMD64 Opteron.

So, I recompiled my 2.6.23.1 kernel without NUMA support, and with
your patch for scheduling domain flags in /proc. It looks like with
NUMA disabled, my test case no longer shows the CPU imbalance
problem. Cool.

With NUMA disabled (and my test running smoothly), the flags show that
SD_BALANCE_NEWIDLE is set:

root@micah-64:~# cat /proc/sys/kernel/sched_domain/cpu0/domain0/flags
55

Next I turned it off:

root@micah-64:~# echo 53 > /proc/sys/kernel/sched_domain/cpu0/domain0/flags
root@micah-64:~# echo 53 > /proc/sys/kernel/sched_domain/cpu1/domain0/flags

Oddly enough, I still don't observe the CPU imbalance problem.

Now I reboot into a kernel which has NUMA re-enabled but which is
otherwise identical. I verify that now I can reproduce the CPU
imbalance again.

root@micah-64:~# cat /proc/sys/kernel/sched_domain/cpu0/domain0/flags
1101

Now I set cpu[10]/domain0/flags to 1099, and the imbalance immediately
disappears. I can reliably cause the imbalance again by setting it
back to 1101, and remove the imbalance by setting them to 1099.

Do these results make sense? I'm not sure I understand how
SD_BALANCE_NEWIDLE could be the whole story, since my /proc/schedstat
graphs do show that we continuously try to balance on idle, but we
can't successfully do so because the idle CPU has a much higher load
than the non-idle CPU. I don't understand how the problem I'm seeing
could be related to the time at which we run the balancer, rather than
being related to the load average calculation.

Assuming the CPU imbalance I'm seeing is actually related to
SD_BALANCE_NEWIDLE being unset, I have a couple of questions:

- Is this intended/expected behaviour for a machine without
NEWIDLE set? I'm not familiar with the rationale for disabling
this flag on NUMA systems.

- Is there a good way to detect, without any kernel debug flags
set, whether the current machine has any scheduling domains
that are missing the SD_BALANCE_NEWIDLE bit? I'm looking for
a good way to work around the problem I'm seeing with VMware's
code. Right now the best I can do is disable all thread priority
elevation when running on an SMP machine with Linux 2.6.20 or
later.

Thank you again for all your help.
--Micah

2007-11-22 12:53:18

by Dmitry Adamushko

[permalink] [raw]
Subject: Re: High priority tasks break SMP balancer?

On 22/11/2007, Micah Dowty <[email protected]> wrote:
> On Tue, Nov 20, 2007 at 10:47:52PM +0100, Dmitry Adamushko wrote:
> > btw., what's your system? If I recall right, SD_BALANCE_NEWIDLE is on
> > by default for all configs, except for NUMA nodes.
>
> It's a dual AMD64 Opteron.
>
> So, I recompiled my 2.6.23.1 kernel without NUMA support, and with
> your patch for scheduling domain flags in /proc. It looks like with
> NUMA disabled, my test case no longer shows the CPU imbalance
> problem. Cool.
>
> With NUMA disabled (and my test running smoothly), the flags show that
> SD_BALANCE_NEWIDLE is set:
>
> root@micah-64:~# cat /proc/sys/kernel/sched_domain/cpu0/domain0/flags
> 55
>
> Next I turned it off:
>
> root@micah-64:~# echo 53 > /proc/sys/kernel/sched_domain/cpu0/domain0/flags
> root@micah-64:~# echo 53 > /proc/sys/kernel/sched_domain/cpu1/domain0/flags
>
> Oddly enough, I still don't observe the CPU imbalance problem.

Look for the 'load_idx' variable in find_busiest_group() and how it's used.

include/linux/topology.h (+ some architectures defines additional
node-definition-structures, but that's primarily for NUMA)

There you can find : 'busy_idx', 'idle_idx' and 'newidle_idx' (amongst
others that affect load-balancing).

Those parameters specify the _source_ of load for determining the
busiest group for !CPU_IDLE, CPU_IDLE and CPU_NEWLY_IDLE cases
respectively.

when 'load_idx == 0' (say, when CPU_NEWLY_IDLE and 'newidle_idx == 0'
or when CPU_IDLE and 'idle_idx == 0') the cpu_load[] table is _not_
taken into account by {source,target}_load(), and the raw load on the
rq is considered instead. In case of CPU_NEWLY_IDLE the raw load is
always 0 (no tasks on the queue).

so for NUMA, newidle_idx = 0 and it also seems to be a generic option
for other variants on 2.6.23.1. Hence, it doesn't matter how big are
cpu_load[], they are just not taken into account.

>
> Now I reboot into a kernel which has NUMA re-enabled but which is
> otherwise identical. I verify that now I can reproduce the CPU
> imbalance again.

for NUMA, 'idle_idx != 0' (i.e. cpu_load[] is taken into account) for
the CPU_IDLE case
and
'newidle_idx == 0' (just because, SD_BALANCE_NEWIDLE is normally not
used on NUMA) for CPU_NEWLY_IDLE.

>
> root@micah-64:~# cat /proc/sys/kernel/sched_domain/cpu0/domain0/flags
> 1101
>
> Now I set cpu[10]/domain0/flags to 1099, and the imbalance immediately
> disappears. I can reliably cause the imbalance again by setting it
> back to 1101, and remove the imbalance by setting them to 1099.

I guess, you meant it to be the other way aroumd -- 1101 when it's switched on.

>
> Do these results make sense? I'm not sure I understand how
> SD_BALANCE_NEWIDLE could be the whole story, since my /proc/schedstat
> graphs do show that we continuously try to balance on idle, but we
> can't successfully do so because the idle CPU has a much higher load
> than the non-idle CPU.

(1) CPU_IDLE and idle_idx != 0 --> cpu_load[] is considered

(2) CPU_NEWIDLE and newidle_idx == 0 --> raw load is considered (which
is 0 in this case).


> - Is this intended/expected behaviour for a machine without
> NEWIDLE set? I'm not familiar with the rationale for disabling
> this flag on NUMA systems.

As you may see, there are varios node's configs in
include/linux/topology.h that affects load-balancing. Migrations
between NUMA nodes can be more expensive than e.g. between cpus on
'normal' SMP, multi-core, etc. So it should be about being more
conservative when doing load-balancing on some setups.

e.g. in your particular case, what happens is smth like this:

SD_BALANCE_NEWIDLE + newidle_idx == 0

cpu #0: 2 nice(0) cpu-hogs
cpu #1: nice(-15) blocks...
cpu #1: load_balance_newidle() is triggered (*)
cpu #1: newidle_idx == 0 so the busiest group is on cpu #0
cpu #1: nice(0) (which is != current) is pulled over to cpu #1
cpu #1: nice(0) is running untill it's preempted by nice(-15)
cpu #1: nice(-15) is running
...
in the mean time, rebalance_domain event can be triggered on
cpu #0: and due to the huge load on cpu #1, nice (0) is pulled over
(back) to the cpu #0
...
this situation may repeat again and again --> nice(0) tasks are being
pulled back and forth.

Other variants:

- the normal rebalance_domain event can be triggered at (*) and,
provided idle_idx == 0, cpu_load[] is not considered again and one of
the nice(0) can be migrated (a CPU_IDLE case).

the difference is that CPU_NEWLY_IDLE is triggered immediately after a
CPU becomes idle, while it may take some time for a normal rebalance
event (with CPU_IDLE in this case) to be triggered.

- moreover, I observed cpu_load[] from /proc/sched_debug while running
your test application and I did see from time to time cases when
cpu_load[] is low (both cpus are comparable, load-wise) on the cpu
where nice(-15) is normally running (I even used cpu_affinity to get
it bound to it)... the time for which nice(-15) is _not_ running can
be enough to degrade some of the slots of cpu_load[] back to 0...
should the rebalance_domain be triggered at this particular moment and
depending on sd->busy_idx, one of the nice(0) tasks can be migrated.

Note, different versions of kernels may have (and have) different
variants of tunings for domains, plus there can be some differences in
generic behavior of the load balancer.

> - Is there a good way to detect, without any kernel debug flags
> set, whether the current machine has any scheduling domains
> that are missing the SD_BALANCE_NEWIDLE bit?

e.g. by reading its config and kernel version. But as a generic way
(sort of API for user-space), I guess, no.

> I'm looking for
> a good way to work around the problem I'm seeing with VMware's
> code. Right now the best I can do is disable all thread priority
> elevation when running on an SMP machine with Linux 2.6.20 or
> later.

why are your application depends on the load-balancer's decisions?
Maybe it's just smth wrong with its logic instead? :-/


>
> Thank you again for all your help.
> --Micah
>

--
Best regards,
Dmitry Adamushko

2007-11-26 19:44:21

by Micah Elizabeth Scott

[permalink] [raw]
Subject: Re: High priority tasks break SMP balancer?

Dmitry,

Thank you for the detailed explanation of the scheduler behaviour I've
been seeing.

On Thu, Nov 22, 2007 at 01:53:02PM +0100, Dmitry Adamushko wrote:
> > - Is there a good way to detect, without any kernel debug flags
> > set, whether the current machine has any scheduling domains
> > that are missing the SD_BALANCE_NEWIDLE bit?
>
> e.g. by reading its config and kernel version. But as a generic way
> (sort of API for user-space), I guess, no.

Would it be reasonable perhaps to look for the numa_* entries in
/proc/zoneinfo, vmstat, or slabinfo?

> > I'm looking for
> > a good way to work around the problem I'm seeing with VMware's
> > code. Right now the best I can do is disable all thread priority
> > elevation when running on an SMP machine with Linux 2.6.20 or
> > later.
>
> why are your application depends on the load-balancer's decisions?
> Maybe it's just smth wrong with its logic instead? :-/

The application doesn't really depend on the load-balancer's decisions
per se, it just happens that this behaviour I'm seeing on NUMA systems
is extremely bad for performance.

In this context, the application is a virtual machine runtime which is
executing either an SMP VM or it's executing a guest which has a
virtual GPU. In either case, there are at least three threads:

- Two virtual CPU/GPU threads, which are nice(0) and often CPU-bound
- A low-latency event handling thread, at nice(-10)

The event handling thread performs periodic tasks like delivering
timer interrupts and completing I/O operations. It isn't expected to
use much CPU, but it does run very frequently. To get the best latency
performance, we've been running it at nice(-10). The intention is that
this event handling thread should usually be able to preempt either of
the CPU-bound threads.

The behaviour we expect is that on a system which is otherwise lightly
loaded, the two CPU-bound threads each get nearly an entire physical
CPU to themselves. The event handling thread periodically wakes up,
preempts one of the CPU-bound threads briefly, then goes to sleep and
yields the CPU back to one of the CPU-bound threads.

The actual behaviour I see when the load balancer makes this
unfavorable decision is that the two CPU-bound threads share a single
CPU, and the other CPU is mostly idle. Best case, the virtual machine
runs half as fast as it could be running. Worst case it runs much
slower, because the CPU-bound threads often need to wait for each
other to catch up if they don't always run at about the same
rate. This involves a lot of extra synchronization overhead.

My current workaround for this is to avoid ever boosting our thread
priority on a kernel in which this problem could occur. Currently this
is any 2.6.20 or later kernel running with at least two CPUs. I'd like
to narrow this test now to only cover kernels with NUMA enabled.

Hopefully this explains why the scheduler behaviour I've observed is
undesirable for this particular workload. I would be surprised if I'm
the only one who has a workload that is negatively impacted by it, but
it does seem difficult to reconcile this workload's requirements with
the cache properties expected from a NUMA-aware scheduler.

Is there something I'm doing wrong, or is this just a weakness in the
current scheduler implementation? If it's the latter, are there any
planned improvements?

Thank you again,
--Micah

2007-11-27 09:21:23

by Dmitry Adamushko

[permalink] [raw]
Subject: Re: High priority tasks break SMP balancer?

On 26/11/2007, Micah Dowty <[email protected]> wrote:
>
> The application doesn't really depend on the load-balancer's decisions
> per se, it just happens that this behaviour I'm seeing on NUMA systems
> is extremely bad for performance.
>
> In this context, the application is a virtual machine runtime which is
> executing either an SMP VM or it's executing a guest which has a
> virtual GPU. In either case, there are at least three threads:
>
> - Two virtual CPU/GPU threads, which are nice(0) and often CPU-bound
> - A low-latency event handling thread, at nice(-10)
>
> The event handling thread performs periodic tasks like delivering
> timer interrupts and completing I/O operations.

Are I/O operations initiated by these "virtual CPU/GPU threads"?

If so, would it make sense to have per-CPU event handling threads
(instead of one global)? They would handle I/O operations initiated
from their respective CPUs to (hopefully) achieve better data locality
(esp. if the most part of the involved data is per-CPU).

Then let the load balancer to evenly distribute the "virtual CPU/GPU
threads" or even (at least, as an experiment) fix them to different
CPUs as well?

sure, the scenario is highly dependent on a nature of those
'events'... and I can just speculate here :-) (but I'd imagine
situations when such a scenario would scale better).


>
> Thank you again,
> --Micah
>

--
Best regards,
Dmitry Adamushko

2007-11-27 17:13:39

by Micah Elizabeth Scott

[permalink] [raw]
Subject: Re: High priority tasks break SMP balancer?

On Tue, Nov 27, 2007 at 10:21:12AM +0100, Dmitry Adamushko wrote:
> On 26/11/2007, Micah Dowty <[email protected]> wrote:
> >
> > The application doesn't really depend on the load-balancer's decisions
> > per se, it just happens that this behaviour I'm seeing on NUMA systems
> > is extremely bad for performance.
> >
> > In this context, the application is a virtual machine runtime which is
> > executing either an SMP VM or it's executing a guest which has a
> > virtual GPU. In either case, there are at least three threads:
> >
> > - Two virtual CPU/GPU threads, which are nice(0) and often CPU-bound
> > - A low-latency event handling thread, at nice(-10)
> >
> > The event handling thread performs periodic tasks like delivering
> > timer interrupts and completing I/O operations.
>
> Are I/O operations initiated by these "virtual CPU/GPU threads"?
>
> If so, would it make sense to have per-CPU event handling threads
> (instead of one global)? They would handle I/O operations initiated
> from their respective CPUs to (hopefully) achieve better data locality
> (esp. if the most part of the involved data is per-CPU).
>
> Then let the load balancer to evenly distribute the "virtual CPU/GPU
> threads" or even (at least, as an experiment) fix them to different
> CPUs as well?
>
> sure, the scenario is highly dependent on a nature of those
> 'events'... and I can just speculate here :-) (but I'd imagine
> situations when such a scenario would scale better).

We already do this when it makes sense to. The high priority "event"
thread is mostly used for asynchronous events- timers (driven by
/dev/rtc), completion of disk or USB I/O, etc. These are things that
happen asynchronously to the virtual CPU, and which may or may not
need to interrupt the virtual CPU at some point.

Another good example is audio. We have another thread (which ideally
is high-priority) which processes audio events asynchronously from the
virtual CPUs. The CPUs manipulate some shared memory to queue up audio
data, but while the audio is playing there is no direct control path
between the two threads. The virtual CPU reads/writes audio registers
whenever it needs to emulate accesses to the audio device, and the
audio thread wakes up any time the hardware needs more samples to
play. The two threads are asynchronous with very little shared
synchronization.

Hopefully that clarifies our situation a bit. Thanks again,
--Micah