2009-03-07 18:10:13

by Balazs Scheidler

[permalink] [raw]
Subject: scheduler oddity [bug?]

Hi,

I'm experiencing an odd behaviour from the Linux scheduler. I have an
application that feeds data to another process using a pipe. Both
processes use a fair amount of CPU time apart from writing to/reading
from this pipe.

The machine I'm running on is an Opteron Quad-Core CPU:
model name : Quad-Core AMD Opteron(tm) Processor 2347 HE
stepping : 3

What I see is that only one of the cores is used, the other three is
idling without doing any work. If I explicitly set the CPU affinity of
the processes to use distinct CPUs the performance goes up
significantly. (e.g. it starts to use the other cores and the load
scales linearly).

I've tried to reproduce the problem by writing a small test program,
which you can find attached. The program creates two processes, one
feeds the other using a pipe and each does a series of memset() calls to
simulate CPU load. I've also added capability to the program to set its
own CPU affinity. The results (the more the better):

Without enabling CPU affinity:
$ ./a.out
Check: 0 loops/sec, sum: 1
Check: 12 loops/sec, sum: 13
Check: 41 loops/sec, sum: 54
Check: 41 loops/sec, sum: 95
Check: 41 loops/sec, sum: 136
Check: 41 loops/sec, sum: 177
Check: 41 loops/sec, sum: 218
Check: 40 loops/sec, sum: 258
Check: 41 loops/sec, sum: 299
Check: 41 loops/sec, sum: 340
Check: 41 loops/sec, sum: 381
Check: 41 loops/sec, sum: 422
Check: 41 loops/sec, sum: 463
Check: 41 loops/sec, sum: 504
Check: 41 loops/sec, sum: 545
Check: 40 loops/sec, sum: 585
Check: 41 loops/sec, sum: 626
Check: 41 loops/sec, sum: 667
Check: 41 loops/sec, sum: 708
Check: 41 loops/sec, sum: 749
Check: 41 loops/sec, sum: 790
Check: 41 loops/sec, sum: 831
Final: 39 loops/sec, sum: 831


With CPU affinity:
# ./a.out 1
Check: 0 loops/sec, sum: 1
Check: 41 loops/sec, sum: 42
Check: 49 loops/sec, sum: 91
Check: 49 loops/sec, sum: 140
Check: 49 loops/sec, sum: 189
Check: 49 loops/sec, sum: 238
Check: 49 loops/sec, sum: 287
Check: 50 loops/sec, sum: 337
Check: 49 loops/sec, sum: 386
Check: 49 loops/sec, sum: 435
Check: 49 loops/sec, sum: 484
Check: 49 loops/sec, sum: 533
Check: 49 loops/sec, sum: 582
Check: 49 loops/sec, sum: 631
Check: 49 loops/sec, sum: 680
Check: 49 loops/sec, sum: 729
Check: 49 loops/sec, sum: 778
Check: 49 loops/sec, sum: 827
Check: 49 loops/sec, sum: 876
Check: 49 loops/sec, sum: 925
Check: 50 loops/sec, sum: 975
Check: 49 loops/sec, sum: 1024
Final: 48 loops/sec, sum: 1024

The difference is about 20%, which is about the same work performed by
the slave process. If the two processes race for the same CPU this 20%
of performance is lost.

I've tested this on 3 computers and each showed the same symptoms:
* quad core Opteron, running Ubuntu kernel 2.6.27-13.29
* Core 2 Duo, running Ubuntu kernel 2.6.27-11.27
* Dual Core Opteron, Debian backports.org kernel 2.6.26-13~bpo40+1

Is this a bug, or a feature?

--
Bazsi


Attachments:
pipetest.c (2.21 kB)

2009-03-07 18:47:20

by Balazs Scheidler

[permalink] [raw]
Subject: Re: scheduler oddity [bug?]

On Sat, 2009-03-07 at 18:47 +0100, Balazs Scheidler wrote:
> Hi,
>
> I'm experiencing an odd behaviour from the Linux scheduler. I have an
> application that feeds data to another process using a pipe. Both
> processes use a fair amount of CPU time apart from writing to/reading
> from this pipe.
>
> The machine I'm running on is an Opteron Quad-Core CPU:
> model name : Quad-Core AMD Opteron(tm) Processor 2347 HE
> stepping : 3
>
> What I see is that only one of the cores is used, the other three is
> idling without doing any work. If I explicitly set the CPU affinity of
> the processes to use distinct CPUs the performance goes up
> significantly. (e.g. it starts to use the other cores and the load
> scales linearly).
>
> I've tried to reproduce the problem by writing a small test program,
> which you can find attached. The program creates two processes, one
> feeds the other using a pipe and each does a series of memset() calls to
> simulate CPU load. I've also added capability to the program to set its
> own CPU affinity. The results (the more the better):
>
> Without enabling CPU affinity:
> $ ./a.out
> Check: 0 loops/sec, sum: 1
> Check: 12 loops/sec, sum: 13
> Check: 41 loops/sec, sum: 54
> Check: 41 loops/sec, sum: 95
> Check: 41 loops/sec, sum: 136
> Check: 41 loops/sec, sum: 177
> Check: 41 loops/sec, sum: 218
> Check: 40 loops/sec, sum: 258
> Check: 41 loops/sec, sum: 299
> Check: 41 loops/sec, sum: 340
> Check: 41 loops/sec, sum: 381
> Check: 41 loops/sec, sum: 422
> Check: 41 loops/sec, sum: 463
> Check: 41 loops/sec, sum: 504
> Check: 41 loops/sec, sum: 545
> Check: 40 loops/sec, sum: 585
> Check: 41 loops/sec, sum: 626
> Check: 41 loops/sec, sum: 667
> Check: 41 loops/sec, sum: 708
> Check: 41 loops/sec, sum: 749
> Check: 41 loops/sec, sum: 790
> Check: 41 loops/sec, sum: 831
> Final: 39 loops/sec, sum: 831
>
>
> With CPU affinity:
> # ./a.out 1
> Check: 0 loops/sec, sum: 1
> Check: 41 loops/sec, sum: 42
> Check: 49 loops/sec, sum: 91
> Check: 49 loops/sec, sum: 140
> Check: 49 loops/sec, sum: 189
> Check: 49 loops/sec, sum: 238
> Check: 49 loops/sec, sum: 287
> Check: 50 loops/sec, sum: 337
> Check: 49 loops/sec, sum: 386
> Check: 49 loops/sec, sum: 435
> Check: 49 loops/sec, sum: 484
> Check: 49 loops/sec, sum: 533
> Check: 49 loops/sec, sum: 582
> Check: 49 loops/sec, sum: 631
> Check: 49 loops/sec, sum: 680
> Check: 49 loops/sec, sum: 729
> Check: 49 loops/sec, sum: 778
> Check: 49 loops/sec, sum: 827
> Check: 49 loops/sec, sum: 876
> Check: 49 loops/sec, sum: 925
> Check: 50 loops/sec, sum: 975
> Check: 49 loops/sec, sum: 1024
> Final: 48 loops/sec, sum: 1024
>
> The difference is about 20%, which is about the same work performed by
> the slave process. If the two processes race for the same CPU this 20%
> of performance is lost.
>
> I've tested this on 3 computers and each showed the same symptoms:
> * quad core Opteron, running Ubuntu kernel 2.6.27-13.29
> * Core 2 Duo, running Ubuntu kernel 2.6.27-11.27
> * Dual Core Opteron, Debian backports.org kernel 2.6.26-13~bpo40+1
>
> Is this a bug, or a feature?
>

One new interesting information: I've retested with a 2.6.22 based
kernel, and it still works there, setting the CPU affinity does not
change the performance of the test program and mpstat nicely shows that
2 cores are working, not just one.

Maybe this is CFS related? That was merged for 2.6.23 IIRC.

Also, I tried changing various scheduler knobs
in /proc/sys/kernel/sched_* but they didn't help. I've tried to change
these:

* sched_migration_cost: changed from the default 500000 to 100000 and
then 10000 but neither helped.
* sched_nr_migrate: increased it to 64, but again nothing

I'm starting to think that this is a regression that may or may not be
related to CFS.

I don't have a box where I could bisect on, but the test program makes
the problem quite obvious.


--
Bazsi

2009-03-08 09:42:21

by Mike Galbraith

[permalink] [raw]
Subject: Re: scheduler oddity [bug?]

On Sat, 2009-03-07 at 18:47 +0100, Balazs Scheidler wrote:
> Hi,
>
> I'm experiencing an odd behaviour from the Linux scheduler. I have an
> application that feeds data to another process using a pipe. Both
> processes use a fair amount of CPU time apart from writing to/reading
> from this pipe.
>
> The machine I'm running on is an Opteron Quad-Core CPU:
> model name : Quad-Core AMD Opteron(tm) Processor 2347 HE
> stepping : 3
>
> What I see is that only one of the cores is used, the other three is
> idling without doing any work. If I explicitly set the CPU affinity of
> the processes to use distinct CPUs the performance goes up
> significantly. (e.g. it starts to use the other cores and the load
> scales linearly).
>
> I've tried to reproduce the problem by writing a small test program,
> which you can find attached. The program creates two processes, one
> feeds the other using a pipe and each does a series of memset() calls to
> simulate CPU load. I've also added capability to the program to set its
> own CPU affinity. The results (the more the better):
>
> Without enabling CPU affinity:
> $ ./a.out
> Check: 0 loops/sec, sum: 1
> Check: 12 loops/sec, sum: 13
> Check: 41 loops/sec, sum: 54
> Check: 41 loops/sec, sum: 95
> Check: 41 loops/sec, sum: 136
> Check: 41 loops/sec, sum: 177
> Check: 41 loops/sec, sum: 218
> Check: 40 loops/sec, sum: 258
> Check: 41 loops/sec, sum: 299
> Check: 41 loops/sec, sum: 340
> Check: 41 loops/sec, sum: 381
> Check: 41 loops/sec, sum: 422
> Check: 41 loops/sec, sum: 463
> Check: 41 loops/sec, sum: 504
> Check: 41 loops/sec, sum: 545
> Check: 40 loops/sec, sum: 585
> Check: 41 loops/sec, sum: 626
> Check: 41 loops/sec, sum: 667
> Check: 41 loops/sec, sum: 708
> Check: 41 loops/sec, sum: 749
> Check: 41 loops/sec, sum: 790
> Check: 41 loops/sec, sum: 831
> Final: 39 loops/sec, sum: 831
>
>
> With CPU affinity:
> # ./a.out 1
> Check: 0 loops/sec, sum: 1
> Check: 41 loops/sec, sum: 42
> Check: 49 loops/sec, sum: 91
> Check: 49 loops/sec, sum: 140
> Check: 49 loops/sec, sum: 189
> Check: 49 loops/sec, sum: 238
> Check: 49 loops/sec, sum: 287
> Check: 50 loops/sec, sum: 337
> Check: 49 loops/sec, sum: 386
> Check: 49 loops/sec, sum: 435
> Check: 49 loops/sec, sum: 484
> Check: 49 loops/sec, sum: 533
> Check: 49 loops/sec, sum: 582
> Check: 49 loops/sec, sum: 631
> Check: 49 loops/sec, sum: 680
> Check: 49 loops/sec, sum: 729
> Check: 49 loops/sec, sum: 778
> Check: 49 loops/sec, sum: 827
> Check: 49 loops/sec, sum: 876
> Check: 49 loops/sec, sum: 925
> Check: 50 loops/sec, sum: 975
> Check: 49 loops/sec, sum: 1024
> Final: 48 loops/sec, sum: 1024
>
> The difference is about 20%, which is about the same work performed by
> the slave process. If the two processes race for the same CPU this 20%
> of performance is lost.
>
> I've tested this on 3 computers and each showed the same symptoms:
> * quad core Opteron, running Ubuntu kernel 2.6.27-13.29
> * Core 2 Duo, running Ubuntu kernel 2.6.27-11.27
> * Dual Core Opteron, Debian backports.org kernel 2.6.26-13~bpo40+1
>
> Is this a bug, or a feature?

Both. Affine wakeups are cache friendly, and generally a feature, but
can lead to underutilized CPUs in some cases, thus turning feature into
bug as your testcase demonstrates. The metric we for the affinity hint
works well, but clearly wants some refinement.

You can turn this scheduler hint off via:
echo NO_SYNC_WAKEUPS > /sys/kernel/debug/sched_features

-Mike

2009-03-08 09:58:44

by Mike Galbraith

[permalink] [raw]
Subject: Re: scheduler oddity [bug?]

On Sun, 2009-03-08 at 10:42 +0100, Mike Galbraith wrote:
> On Sat, 2009-03-07 at 18:47 +0100, Balazs Scheidler wrote:
> > Hi,
> >
> > I'm experiencing an odd behaviour from the Linux scheduler. I have an
> > application that feeds data to another process using a pipe. Both
> > processes use a fair amount of CPU time apart from writing to/reading
> > from this pipe.
> >
> > The machine I'm running on is an Opteron Quad-Core CPU:
> > model name : Quad-Core AMD Opteron(tm) Processor 2347 HE
> > stepping : 3
> >
> > What I see is that only one of the cores is used, the other three is
> > idling without doing any work. If I explicitly set the CPU affinity of
> > the processes to use distinct CPUs the performance goes up
> > significantly. (e.g. it starts to use the other cores and the load
> > scales linearly).
> >
> > I've tried to reproduce the problem by writing a small test program,
> > which you can find attached. The program creates two processes, one
> > feeds the other using a pipe and each does a series of memset() calls to
> > simulate CPU load. I've also added capability to the program to set its
> > own CPU affinity. The results (the more the better):
> >
> > Without enabling CPU affinity:
> > $ ./a.out
> > Check: 0 loops/sec, sum: 1
> > Check: 12 loops/sec, sum: 13
> > Check: 41 loops/sec, sum: 54
> > Check: 41 loops/sec, sum: 95
> > Check: 41 loops/sec, sum: 136
> > Check: 41 loops/sec, sum: 177
> > Check: 41 loops/sec, sum: 218
> > Check: 40 loops/sec, sum: 258
> > Check: 41 loops/sec, sum: 299
> > Check: 41 loops/sec, sum: 340
> > Check: 41 loops/sec, sum: 381
> > Check: 41 loops/sec, sum: 422
> > Check: 41 loops/sec, sum: 463
> > Check: 41 loops/sec, sum: 504
> > Check: 41 loops/sec, sum: 545
> > Check: 40 loops/sec, sum: 585
> > Check: 41 loops/sec, sum: 626
> > Check: 41 loops/sec, sum: 667
> > Check: 41 loops/sec, sum: 708
> > Check: 41 loops/sec, sum: 749
> > Check: 41 loops/sec, sum: 790
> > Check: 41 loops/sec, sum: 831
> > Final: 39 loops/sec, sum: 831
> >
> >
> > With CPU affinity:
> > # ./a.out 1
> > Check: 0 loops/sec, sum: 1
> > Check: 41 loops/sec, sum: 42
> > Check: 49 loops/sec, sum: 91
> > Check: 49 loops/sec, sum: 140
> > Check: 49 loops/sec, sum: 189
> > Check: 49 loops/sec, sum: 238
> > Check: 49 loops/sec, sum: 287
> > Check: 50 loops/sec, sum: 337
> > Check: 49 loops/sec, sum: 386
> > Check: 49 loops/sec, sum: 435
> > Check: 49 loops/sec, sum: 484
> > Check: 49 loops/sec, sum: 533
> > Check: 49 loops/sec, sum: 582
> > Check: 49 loops/sec, sum: 631
> > Check: 49 loops/sec, sum: 680
> > Check: 49 loops/sec, sum: 729
> > Check: 49 loops/sec, sum: 778
> > Check: 49 loops/sec, sum: 827
> > Check: 49 loops/sec, sum: 876
> > Check: 49 loops/sec, sum: 925
> > Check: 50 loops/sec, sum: 975
> > Check: 49 loops/sec, sum: 1024
> > Final: 48 loops/sec, sum: 1024
> >
> > The difference is about 20%, which is about the same work performed by
> > the slave process. If the two processes race for the same CPU this 20%
> > of performance is lost.
> >
> > I've tested this on 3 computers and each showed the same symptoms:
> > * quad core Opteron, running Ubuntu kernel 2.6.27-13.29
> > * Core 2 Duo, running Ubuntu kernel 2.6.27-11.27
> > * Dual Core Opteron, Debian backports.org kernel 2.6.26-13~bpo40+1
> >
> > Is this a bug, or a feature?
>
> Both. Affine wakeups are cache friendly, and generally a feature, but
> can lead to underutilized CPUs in some cases, thus turning feature into
> bug as your testcase demonstrates. The metric we for the affinity hint
> works well, but clearly wants some refinement.
>
> You can turn this scheduler hint off via:
> echo NO_SYNC_WAKEUPS > /sys/kernel/debug/sched_features
>

The problem with your particular testcase is that while one half has an
avg_overlap (what we use as affinity hint for synchronous wakeups) which
triggers the affinity hint, the other half has avg_overlap of zero, what
it was born with, so despite significant execution overlap, the
scheduler treats them as if they were truly synchronous tasks.

The below cures it, but is only a demo hack.

diff --git a/kernel/sched.c b/kernel/sched.c
index 8e2558c..85f9ced 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1712,11 +1712,15 @@ static void enqueue_task(struct rq *rq, struct task_struct *p, int wakeup)

static void dequeue_task(struct rq *rq, struct task_struct *p, int sleep)
{
+ u64 limit = sysctl_sched_migration_cost;
+ u64 runtime = p->se.sum_exec_runtime - p->se.prev_sum_exec_runtime;
+
if (sleep && p->se.last_wakeup) {
update_avg(&p->se.avg_overlap,
p->se.sum_exec_runtime - p->se.last_wakeup);
p->se.last_wakeup = 0;
- }
+ } else if (p->se.avg_overlap < limit && runtime >= lpipetest (6701, #threads: 1)
---------------------------------------------------------
se.exec_start : 5607096.896687
se.vruntime : 274158.274352
se.sum_exec_runtime : 139434.783417
se.avg_overlap : 6.477067 <== was ze
nr_switches : 2246
nr_voluntary_switches : 1
nr_involuntary_switches : 2245
se.load.weight : 1024
policy : 0
prio : 120
clock-delta : 102

pipetest (6702, #threads: 1)
---------------------------------------------------------
se.exec_start : 5607096.896687
se.vruntime : 274098.273516
se.sum_exec_runtime : 32987.899515
se.avg_overlap : 0.502174
nr_switches : 13631
nr_voluntary_switches : 11639
nr_involuntary_switches : 1992
se.load.weight : 1024
policy : 0
prio : 120
clock-delta : 117
imit)
+ update_avg(&p->se.avg_overlap, runtime);

sched_info_dequeued(p);
p->sched_class->dequeue_task(rq, p, sleep);


2009-03-08 10:02:19

by Mike Galbraith

[permalink] [raw]
Subject: Re: scheduler oddity [bug?]

On Sun, 2009-03-08 at 10:58 +0100, Mike Galbraith wrote:
> On Sun, 2009-03-08 at 10:42 +0100, Mike Galbraith wrote:
> > On Sat, 2009-03-07 at 18:47 +0100, Balazs Scheidler wrote:
> > > Hi,
> > >
> > > I'm experiencing an odd behaviour from the Linux scheduler. I have an
> > > application that feeds data to another process using a pipe. Both
> > > processes use a fair amount of CPU time apart from writing to/reading
> > > from this pipe.
> > >
> > > The machine I'm running on is an Opteron Quad-Core CPU:
> > > model name : Quad-Core AMD Opteron(tm) Processor 2347 HE
> > > stepping : 3
> > >
> > > What I see is that only one of the cores is used, the other three is
> > > idling without doing any work. If I explicitly set the CPU affinity of
> > > the processes to use distinct CPUs the performance goes up
> > > significantly. (e.g. it starts to use the other cores and the load
> > > scales linearly).
> > >
> > > I've tried to reproduce the problem by writing a small test program,
> > > which you can find attached. The program creates two processes, one
> > > feeds the other using a pipe and each does a series of memset() calls to
> > > simulate CPU load. I've also added capability to the program to set its
> > > own CPU affinity. The results (the more the better):
> > >
> > > Without enabling CPU affinity:
> > > $ ./a.out
> > > Check: 0 loops/sec, sum: 1
> > > Check: 12 loops/sec, sum: 13
> > > Check: 41 loops/sec, sum: 54
> > > Check: 41 loops/sec, sum: 95
> > > Check: 41 loops/sec, sum: 136
> > > Check: 41 loops/sec, sum: 177
> > > Check: 41 loops/sec, sum: 218
> > > Check: 40 loops/sec, sum: 258
> > > Check: 41 loops/sec, sum: 299
> > > Check: 41 loops/sec, sum: 340
> > > Check: 41 loops/sec, sum: 381
> > > Check: 41 loops/sec, sum: 422
> > > Check: 41 loops/sec, sum: 463
> > > Check: 41 loops/sec, sum: 504
> > > Check: 41 loops/sec, sum: 545
> > > Check: 40 loops/sec, sum: 585
> > > Check: 41 loops/sec, sum: 626
> > > Check: 41 loops/sec, sum: 667
> > > Check: 41 loops/sec, sum: 708
> > > Check: 41 loops/sec, sum: 749
> > > Check: 41 loops/sec, sum: 790
> > > Check: 41 loops/sec, sum: 831
> > > Final: 39 loops/sec, sum: 831
> > >
> > >
> > > With CPU affinity:
> > > # ./a.out 1
> > > Check: 0 loops/sec, sum: 1
> > > Check: 41 loops/sec, sum: 42
> > > Check: 49 loops/sec, sum: 91
> > > Check: 49 loops/sec, sum: 140
> > > Check: 49 loops/sec, sum: 189
> > > Check: 49 loops/sec, sum: 238
> > > Check: 49 loops/sec, sum: 287
> > > Check: 50 loops/sec, sum: 337
> > > Check: 49 loops/sec, sum: 386
> > > Check: 49 loops/sec, sum: 435
> > > Check: 49 loops/sec, sum: 484
> > > Check: 49 loops/sec, sum: 533
> > > Check: 49 loops/sec, sum: 582
> > > Check: 49 loops/sec, sum: 631
> > > Check: 49 loops/sec, sum: 680
> > > Check: 49 loops/sec, sum: 729
> > > Check: 49 loops/sec, sum: 778
> > > Check: 49 loops/sec, sum: 827
> > > Check: 49 loops/sec, sum: 876
> > > Check: 49 loops/sec, sum: 925
> > > Check: 50 loops/sec, sum: 975
> > > Check: 49 loops/sec, sum: 1024
> > > Final: 48 loops/sec, sum: 1024
> > >
> > > The difference is about 20%, which is about the same work performed by
> > > the slave process. If the two processes race for the same CPU this 20%
> > > of performance is lost.
> > >
> > > I've tested this on 3 computers and each showed the same symptoms:
> > > * quad core Opteron, running Ubuntu kernel 2.6.27-13.29
> > > * Core 2 Duo, running Ubuntu kernel 2.6.27-11.27
> > > * Dual Core Opteron, Debian backports.org kernel 2.6.26-13~bpo40+1
> > >
> > > Is this a bug, or a feature?
> >
> > Both. Affine wakeups are cache friendly, and generally a feature, but
> > can lead to underutilized CPUs in some cases, thus turning feature into
> > bug as your testcase demonstrates. The metric we for the affinity hint
> > works well, but clearly wants some refinement.
> >
> > You can turn this scheduler hint off via:
> > echo NO_SYNC_WAKEUPS > /sys/kernel/debug/sched_features
> >
>

(reply got munged)

> The problem with your particular testcase is that while one half has an
> avg_overlap (what we use as affinity hint for synchronous wakeups) which
> triggers the affinity hint, the other half has avg_overlap of zero, what
> it was born with, so despite significant execution overlap, the
> scheduler treats them as if they were truly synchronous tasks.
>
> The below cures it, but is only a demo hack.

diff --git a/kernel/sched.c b/kernel/sched.c
index 8e2558c..85f9ced 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1712,11 +1712,15 @@ static void enqueue_task(struct rq *rq, struct task_struct *p, int wakeup)

static void dequeue_task(struct rq *rq, struct task_struct *p, int sleep)
{
+ u64 limit = sysctl_sched_migration_cost;
+ u64 runtime = p->se.sum_exec_runtime - p->se.prev_sum_exec_runtime;
+
if (sleep && p->se.last_wakeup) {
update_avg(&p->se.avg_overlap,
p->se.sum_exec_runtime - p->se.last_wakeup);
p->se.last_wakeup = 0;
- }
+ } else if (p->se.avg_overlap < limit && runtime >= limit)
+ update_avg(&p->se.avg_overlap, runtime);

sched_info_dequeued(p);
p->sched_class->dequeue_task(rq, p, sleep);

pipetest (6701, #threads: 1)
---------------------------------------------------------
se.exec_start : 5607096.896687
se.vruntime : 274158.274352
se.sum_exec_runtime : 139434.783417
se.avg_overlap : 6.477067 <== was zero
nr_switches : 2246
nr_voluntary_switches : 1
nr_involuntary_switches : 2245
se.load.weight : 1024
policy : 0
prio : 120
clock-delta : 102

pipetest (6702, #threads: 1)
---------------------------------------------------------
se.exec_start : 5607096.896687
se.vruntime : 274098.273516
se.sum_exec_runtime : 32987.899515
se.avg_overlap : 0.502174 <== was always < migration cost
nr_switches : 13631
nr_voluntary_switches : 11639
nr_involuntary_switches : 1992
se.load.weight : 1024
policy : 0
prio : 120
clock-delta : 117

2009-03-08 10:20:32

by Peter Zijlstra

[permalink] [raw]
Subject: Re: scheduler oddity [bug?]

On Sun, 2009-03-08 at 10:58 +0100, Mike Galbraith wrote:
>
> The below cures it, but is only a demo hack.
>
> diff --git a/kernel/sched.c b/kernel/sched.c
> index 8e2558c..85f9ced 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -1712,11 +1712,15 @@ static void enqueue_task(struct rq *rq, struct task_struct *p, int wakeup)
>
> static void dequeue_task(struct rq *rq, struct task_struct *p, int sleep)
> {
> + u64 limit = sysctl_sched_migration_cost;
> + u64 runtime = p->se.sum_exec_runtime - p->se.prev_sum_exec_runtime;
> +
> if (sleep && p->se.last_wakeup) {
> update_avg(&p->se.avg_overlap,
> p->se.sum_exec_runtime - p->se.last_wakeup);
> p->se.last_wakeup = 0;
> - }


Wouldn't something like the below be more suited:

diff --git a/kernel/sched.c b/kernel/sched.c
index 89e2ca0..94c8b02 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2501,7 +2501,7 @@ static void __sched_fork(struct task_struct *p)
p->se.prev_sum_exec_runtime = 0;
p->se.nr_migrations = 0;
p->se.last_wakeup = 0;
- p->se.avg_overlap = 0;
+ p->se.avg_overlap = sysctl_sched_migration_cost;
p->se.start_runtime = 0;
p->se.avg_wakeup = sysctl_sched_wakeup_granularity;


2009-03-08 13:35:46

by Mike Galbraith

[permalink] [raw]
Subject: Re: scheduler oddity [bug?]

On Sun, 2009-03-08 at 11:19 +0100, Peter Zijlstra wrote:

> Wouldn't something like the below be more suited:
>
> diff --git a/kernel/sched.c b/kernel/sched.c
> index 89e2ca0..94c8b02 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -2501,7 +2501,7 @@ static void __sched_fork(struct task_struct *p)
> p->se.prev_sum_exec_runtime = 0;
> p->se.nr_migrations = 0;
> p->se.last_wakeup = 0;
> - p->se.avg_overlap = 0;
> + p->se.avg_overlap = sysctl_sched_migration_cost;
> p->se.start_runtime = 0;
> p->se.avg_wakeup = sysctl_sched_wakeup_granularity;
>

Dunno exactly what we need to do here. My hacklet certainly ain't it,
but I suspect this isn't going to work out either. Having tasks begin
life at zero or sysctl_sched_migration_cost won't matter much methinks.

-Mike

2009-03-08 15:40:24

by Ingo Molnar

[permalink] [raw]
Subject: Re: scheduler oddity [bug?]


* Mike Galbraith <[email protected]> wrote:

> The problem with your particular testcase is that while one
> half has an avg_overlap (what we use as affinity hint for
> synchronous wakeups) which triggers the affinity hint, the
> other half has avg_overlap of zero, what it was born with, so
> despite significant execution overlap, the scheduler treats
> them as if they were truly synchronous tasks.

hm, why does it stay on zero?

> static void dequeue_task(struct rq *rq, struct task_struct *p, int sleep)
> {
> + u64 limit = sysctl_sched_migration_cost;
> + u64 runtime = p->se.sum_exec_runtime - p->se.prev_sum_exec_runtime;
> +
> if (sleep && p->se.last_wakeup) {
> update_avg(&p->se.avg_overlap,
> p->se.sum_exec_runtime - p->se.last_wakeup);
> p->se.last_wakeup = 0;
> - }
> + } else if (p->se.avg_overlap < limit && runtime >= limit)
> + update_avg(&p->se.avg_overlap, runtime);
>
> sched_info_dequeued(p);
> p->sched_class->dequeue_task(rq, p, sleep);

hm, that's weird. We want to limit avg_overlap maintenance to
true sleeps only.

And this patch only makes a difference in the !sleep case -
which shouldnt be that common in this workload.

What am i missing? A trace would be nice of a few millisecs of
runtime of this workload, with a few more trace_printk()'s added
showing how avg_overlap develops.

Maybe it's some task migration artifact? There we do a
dequeue/enqueue with sleep=0.

Ingo

2009-03-08 16:20:29

by Mike Galbraith

[permalink] [raw]
Subject: Re: scheduler oddity [bug?]

On Sun, 2009-03-08 at 16:39 +0100, Ingo Molnar wrote:
> * Mike Galbraith <[email protected]> wrote:
>
> > The problem with your particular testcase is that while one
> > half has an avg_overlap (what we use as affinity hint for
> > synchronous wakeups) which triggers the affinity hint, the
> > other half has avg_overlap of zero, what it was born with, so
> > despite significant execution overlap, the scheduler treats
> > them as if they were truly synchronous tasks.
>
> hm, why does it stay on zero?

Wakeup preemption. Presuming here: heavy task wakes light task, is
preempted, light task stuffs data into pipe, heavy task doesn't block,
so no avg_overlap is ever computed. The heavy task uses 100% CPU.

Running as SCHED_BATCH (virgin source), it becomes sane.

pipetest (6836, #threads: 1)
---------------------------------------------------------
se.exec_start : 266073.001296
se.vruntime : 173620.953443
se.sum_exec_runtime : 11324.486321
se.avg_overlap : 1.306762
nr_switches : 381
nr_voluntary_switches : 2
nr_involuntary_switches : 379
se.load.weight : 1024
policy : 3
prio : 120
clock-delta : 109

pipetest (6837, #threads: 1)
---------------------------------------------------------
se.exec_start : 266066.098182
se.vruntime : 51893.050177
se.sum_exec_runtime : 2367.077751
se.avg_overlap : 0.077492
nr_switches : 897
nr_voluntary_switches : 828
nr_involuntary_switches : 69
se.load.weight : 1024
policy : 3
prio : 120
clock-delta : 109

> > static void dequeue_task(struct rq *rq, struct task_struct *p, int sleep)
> > {
> > + u64 limit = sysctl_sched_migration_cost;
> > + u64 runtime = p->se.sum_exec_runtime - p->se.prev_sum_exec_runtime;
> > +
> > if (sleep && p->se.last_wakeup) {
> > update_avg(&p->se.avg_overlap,
> > p->se.sum_exec_runtime - p->se.last_wakeup);
> > p->se.last_wakeup = 0;
> > - }
> > + } else if (p->se.avg_overlap < limit && runtime >= limit)
> > + update_avg(&p->se.avg_overlap, runtime);
> >
> > sched_info_dequeued(p);
> > p->sched_class->dequeue_task(rq, p, sleep);
>
> hm, that's weird. We want to limit avg_overlap maintenance to
> true sleeps only.

Except that when we stop sleeping, we're left with a stale avg_overlap.

> And this patch only makes a difference in the !sleep case -
> which shouldnt be that common in this workload.

Hack was only to kill the stale zero. Let's forget hack ;-)

-Mike

2009-03-08 17:53:19

by Ingo Molnar

[permalink] [raw]
Subject: Re: scheduler oddity [bug?]


* Mike Galbraith <[email protected]> wrote:

> On Sun, 2009-03-08 at 16:39 +0100, Ingo Molnar wrote:
> > * Mike Galbraith <[email protected]> wrote:
> >
> > > The problem with your particular testcase is that while one
> > > half has an avg_overlap (what we use as affinity hint for
> > > synchronous wakeups) which triggers the affinity hint, the
> > > other half has avg_overlap of zero, what it was born with, so
> > > despite significant execution overlap, the scheduler treats
> > > them as if they were truly synchronous tasks.
> >
> > hm, why does it stay on zero?
>
> Wakeup preemption. Presuming here: heavy task wakes light
> task, is preempted, light task stuffs data into pipe, heavy
> task doesn't block, so no avg_overlap is ever computed. The
> heavy task uses 100% CPU.
>
> Running as SCHED_BATCH (virgin source), it becomes sane.

ah.

I'd argue then that time spent on the rq preempted _should_
count in avg_overlap statistics. I.e. couldnt we do something
like ... your patch? :)

> > if (sleep && p->se.last_wakeup) {
> > update_avg(&p->se.avg_overlap,
> > p->se.sum_exec_runtime - p->se.last_wakeup);
> > p->se.last_wakeup = 0;
> > - }
> > + } else if (p->se.avg_overlap < limit && runtime >= limit)
> > + update_avg(&p->se.avg_overlap, runtime);

Just done unconditionally, i.e. something like:

if (sleep) {
runtime = p->se.sum_exec_runtime - p->se.last_wakeup;
p->se.last_wakeup = 0;
} else {
runtime = p->se.sum_exec_runtime - p->se.prev_sum_exec_runtime;
}

update_avg(&p->se.avg_overlap, runtime);

?

Ingo

2009-03-08 18:40:01

by Mike Galbraith

[permalink] [raw]
Subject: Re: scheduler oddity [bug?]

On Sun, 2009-03-08 at 18:52 +0100, Ingo Molnar wrote:
> * Mike Galbraith <[email protected]> wrote:
>
> > On Sun, 2009-03-08 at 16:39 +0100, Ingo Molnar wrote:
> > > * Mike Galbraith <[email protected]> wrote:
> > >
> > > > The problem with your particular testcase is that while one
> > > > half has an avg_overlap (what we use as affinity hint for
> > > > synchronous wakeups) which triggers the affinity hint, the
> > > > other half has avg_overlap of zero, what it was born with, so
> > > > despite significant execution overlap, the scheduler treats
> > > > them as if they were truly synchronous tasks.
> > >
> > > hm, why does it stay on zero?
> >
> > Wakeup preemption. Presuming here: heavy task wakes light
> > task, is preempted, light task stuffs data into pipe, heavy
> > task doesn't block, so no avg_overlap is ever computed. The
> > heavy task uses 100% CPU.
> >
> > Running as SCHED_BATCH (virgin source), it becomes sane.
>
> ah.
>
> I'd argue then that time spent on the rq preempted _should_
> count in avg_overlap statistics. I.e. couldnt we do something
> like ... your patch? :)
>
> > > if (sleep && p->se.last_wakeup) {
> > > update_avg(&p->se.avg_overlap,
> > > p->se.sum_exec_runtime - p->se.last_wakeup);
> > > p->se.last_wakeup = 0;
> > > - }
> > > + } else if (p->se.avg_overlap < limit && runtime >= limit)
> > > + update_avg(&p->se.avg_overlap, runtime);
>
> Just done unconditionally, i.e. something like:
>
> if (sleep) {
> runtime = p->se.sum_exec_runtime - p->se.last_wakeup;
> p->se.last_wakeup = 0;
> } else {
> runtime = p->se.sum_exec_runtime - p->se.prev_sum_exec_runtime;
> }
>
> update_avg(&p->se.avg_overlap, runtime);
>
> ?

That'll do it for this load. I'll resume in the a.m., give that some
testing, and try to remember all the things I was paranoid about.
(getting interrupted a _lot_.. i give up on today;)

-Mike

2009-03-08 18:55:42

by Ingo Molnar

[permalink] [raw]
Subject: Re: scheduler oddity [bug?]


* Mike Galbraith <[email protected]> wrote:

> On Sun, 2009-03-08 at 18:52 +0100, Ingo Molnar wrote:
> > * Mike Galbraith <[email protected]> wrote:
> >
> > > On Sun, 2009-03-08 at 16:39 +0100, Ingo Molnar wrote:
> > > > * Mike Galbraith <[email protected]> wrote:
> > > >
> > > > > The problem with your particular testcase is that while one
> > > > > half has an avg_overlap (what we use as affinity hint for
> > > > > synchronous wakeups) which triggers the affinity hint, the
> > > > > other half has avg_overlap of zero, what it was born with, so
> > > > > despite significant execution overlap, the scheduler treats
> > > > > them as if they were truly synchronous tasks.
> > > >
> > > > hm, why does it stay on zero?
> > >
> > > Wakeup preemption. Presuming here: heavy task wakes light
> > > task, is preempted, light task stuffs data into pipe, heavy
> > > task doesn't block, so no avg_overlap is ever computed. The
> > > heavy task uses 100% CPU.
> > >
> > > Running as SCHED_BATCH (virgin source), it becomes sane.
> >
> > ah.
> >
> > I'd argue then that time spent on the rq preempted _should_
> > count in avg_overlap statistics. I.e. couldnt we do something
> > like ... your patch? :)
> >
> > > > if (sleep && p->se.last_wakeup) {
> > > > update_avg(&p->se.avg_overlap,
> > > > p->se.sum_exec_runtime - p->se.last_wakeup);
> > > > p->se.last_wakeup = 0;
> > > > - }
> > > > + } else if (p->se.avg_overlap < limit && runtime >= limit)
> > > > + update_avg(&p->se.avg_overlap, runtime);
> >
> > Just done unconditionally, i.e. something like:
> >
> > if (sleep) {
> > runtime = p->se.sum_exec_runtime - p->se.last_wakeup;
> > p->se.last_wakeup = 0;
> > } else {
> > runtime = p->se.sum_exec_runtime - p->se.prev_sum_exec_runtime;
> > }
> >
> > update_avg(&p->se.avg_overlap, runtime);
> >
> > ?
>
> That'll do it for this load. I'll resume in the a.m., give
> that some testing, and try to remember all the things I was
> paranoid about.

btw., there's room for a cleanup + micro-optimization here too:
it would be nice to change se.last_wakeup and
se.prev_sum_exec_runtime to be the same variable,
se.prev_timestamp or so.

That way we can do a simple:

update_avg(&p->se.avg_overlap,
p->se.sum_exec_runtime - p->se.prev_timestamp);
p->se.prev_timestamp = 0;

the latter is needed as we rely on the zeroing here:

kernel/sched.c: if (sleep && p->se.last_wakeup) {


Ingo

2009-03-08 20:18:24

by Balazs Scheidler

[permalink] [raw]
Subject: Re: scheduler oddity [bug?]

On Sat, 2009-03-07 at 19:47 +0100, Balazs Scheidler wrote:
> On Sat, 2009-03-07 at 18:47 +0100, Balazs Scheidler wrote:
> > Hi,
> >
> > I've tested this on 3 computers and each showed the same symptoms:
> > * quad core Opteron, running Ubuntu kernel 2.6.27-13.29
> > * Core 2 Duo, running Ubuntu kernel 2.6.27-11.27
> > * Dual Core Opteron, Debian backports.org kernel 2.6.26-13~bpo40+1
> >
> > Is this a bug, or a feature?
> >
>
> One new interesting information: I've retested with a 2.6.22 based
> kernel, and it still works there, setting the CPU affinity does not
> change the performance of the test program and mpstat nicely shows that
> 2 cores are working, not just one.
>
> Maybe this is CFS related? That was merged for 2.6.23 IIRC.
>
> Also, I tried changing various scheduler knobs
> in /proc/sys/kernel/sched_* but they didn't help. I've tried to change
> these:
>
> * sched_migration_cost: changed from the default 500000 to 100000 and
> then 10000 but neither helped.
> * sched_nr_migrate: increased it to 64, but again nothing
>
> I'm starting to think that this is a regression that may or may not be
> related to CFS.
>
> I don't have a box where I could bisect on, but the test program makes
> the problem quite obvious.

Some more test results:

Latest tree from Linus seems to work, at least the program runs on both
cores as it should. I bisected the patch that changed behaviour, and
I've found this:

commit 38736f475071b80b66be28af7b44c854073699cc
Author: Gautham R Shenoy <[email protected]>
Date: Sat Sep 6 14:50:23 2008 +0530

sched: fix __load_balance_iterator() for cfq with only one task

The __load_balance_iterator() returns a NULL when there's only one
sched_entity which is a task. It is caused by the following code-path.

/* Skip over entities that are not tasks */
do {
se = list_entry(next, struct sched_entity, group_node);
next = next->next;
} while (next != &cfs_rq->tasks && !entity_is_task(se));

if (next == &cfs_rq->tasks)
return NULL;
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
This will return NULL even when se is a task.

As a side-effect, there was a regression in sched_mc behavior since 2.6.25,
since iter_move_one_task() when it calls load_balance_start_fair(),
would not get any tasks to move!

Fix this by checking if the last entity was a task or not.

Signed-off-by: Gautham R Shenoy <[email protected]>
Acked-by: Peter Zijlstra <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>


This patch was integrated for 2.6.28. With the above patch, my test program uses
two cores as it should. I could only test this in a virtual machine so I don't
know exact performance metrics, but I'll test 2.6.27 + plus this patch on a real
box tomorrow to see if this was the culprit.

I'm not sure if this is related to the avg_overlap discussion (which I honestly
don't really understand :)


--
Bazsi

2009-03-08 22:03:34

by Willy Tarreau

[permalink] [raw]
Subject: Re: scheduler oddity [bug?]

Hi Balazs,

On Sun, Mar 08, 2009 at 08:45:24PM +0100, Balazs Scheidler wrote:
> On Sat, 2009-03-07 at 19:47 +0100, Balazs Scheidler wrote:
> > On Sat, 2009-03-07 at 18:47 +0100, Balazs Scheidler wrote:
> > > Hi,
> > >
> > > I've tested this on 3 computers and each showed the same symptoms:
> > > * quad core Opteron, running Ubuntu kernel 2.6.27-13.29
> > > * Core 2 Duo, running Ubuntu kernel 2.6.27-11.27
> > > * Dual Core Opteron, Debian backports.org kernel 2.6.26-13~bpo40+1
> > >
> > > Is this a bug, or a feature?
> > >
> >
> > One new interesting information: I've retested with a 2.6.22 based
> > kernel, and it still works there, setting the CPU affinity does not
> > change the performance of the test program and mpstat nicely shows that
> > 2 cores are working, not just one.
> >
> > Maybe this is CFS related? That was merged for 2.6.23 IIRC.
> >
> > Also, I tried changing various scheduler knobs
> > in /proc/sys/kernel/sched_* but they didn't help. I've tried to change
> > these:
> >
> > * sched_migration_cost: changed from the default 500000 to 100000 and
> > then 10000 but neither helped.
> > * sched_nr_migrate: increased it to 64, but again nothing
> >
> > I'm starting to think that this is a regression that may or may not be
> > related to CFS.
> >
> > I don't have a box where I could bisect on, but the test program makes
> > the problem quite obvious.
>
> Some more test results:
>
> Latest tree from Linus seems to work, at least the program runs on both
> cores as it should. I bisected the patch that changed behaviour, and
> I've found this:
>
> commit 38736f475071b80b66be28af7b44c854073699cc
> Author: Gautham R Shenoy <[email protected]>
> Date: Sat Sep 6 14:50:23 2008 +0530
>
> sched: fix __load_balance_iterator() for cfq with only one task
>
> The __load_balance_iterator() returns a NULL when there's only one
> sched_entity which is a task. It is caused by the following code-path.
>
> /* Skip over entities that are not tasks */
> do {
> se = list_entry(next, struct sched_entity, group_node);
> next = next->next;
> } while (next != &cfs_rq->tasks && !entity_is_task(se));
>
> if (next == &cfs_rq->tasks)
> return NULL;
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> This will return NULL even when se is a task.
>
> As a side-effect, there was a regression in sched_mc behavior since 2.6.25,
> since iter_move_one_task() when it calls load_balance_start_fair(),
> would not get any tasks to move!
>
> Fix this by checking if the last entity was a task or not.
>
> Signed-off-by: Gautham R Shenoy <[email protected]>
> Acked-by: Peter Zijlstra <[email protected]>
> Signed-off-by: Ingo Molnar <[email protected]>
>
>
> This patch was integrated for 2.6.28. With the above patch, my test program uses
> two cores as it should. I could only test this in a virtual machine so I don't
> know exact performance metrics, but I'll test 2.6.27 + plus this patch on a real
> box tomorrow to see if this was the culprit.

Just tested right here and I can confirm it is the culprit. I can reliably
reproduce the issue here on my core2 duo, and this patch fixes it. With your
memset() loop at 20k iterations, I saw exactly 50% CPU usage, and a final
sum of 794. With the patch, I see 53% CPU and 909. Changing the loop to 80k
iterations shows 53% CPU usage and 541 loops without the patch, versus
639 loops and 63% CPU usage with the patch.

So there's clearly a big win.

On a related note, I've often noticed that my kernel builds with -j 2 often
only use once CPU. I'm wondering whether this could be related to the same
issue. Just testing, I don't notice this with the patch. I'll have to retry
without later.

> I'm not sure if this is related to the avg_overlap discussion (which I honestly
> don't really understand :)

neither do I :-)

Regards,
Willy

2009-03-09 03:40:23

by Mike Galbraith

[permalink] [raw]
Subject: Re: scheduler oddity [bug?]

On Sun, 2009-03-08 at 23:03 +0100, Willy Tarreau wrote:
> Hi Balazs,
>
> On Sun, Mar 08, 2009 at 08:45:24PM +0100, Balazs Scheidler wrote:
> > On Sat, 2009-03-07 at 19:47 +0100, Balazs Scheidler wrote:
> > > On Sat, 2009-03-07 at 18:47 +0100, Balazs Scheidler wrote:
> > > > Hi,
> > > >
> > > > I've tested this on 3 computers and each showed the same symptoms:
> > > > * quad core Opteron, running Ubuntu kernel 2.6.27-13.29
> > > > * Core 2 Duo, running Ubuntu kernel 2.6.27-11.27
> > > > * Dual Core Opteron, Debian backports.org kernel 2.6.26-13~bpo40+1
> > > >
> > > > Is this a bug, or a feature?
> > > >
> > >
> > > One new interesting information: I've retested with a 2.6.22 based
> > > kernel, and it still works there, setting the CPU affinity does not
> > > change the performance of the test program and mpstat nicely shows that
> > > 2 cores are working, not just one.
> > >
> > > Maybe this is CFS related? That was merged for 2.6.23 IIRC.
> > >
> > > Also, I tried changing various scheduler knobs
> > > in /proc/sys/kernel/sched_* but they didn't help. I've tried to change
> > > these:
> > >
> > > * sched_migration_cost: changed from the default 500000 to 100000 and
> > > then 10000 but neither helped.
> > > * sched_nr_migrate: increased it to 64, but again nothing
> > >
> > > I'm starting to think that this is a regression that may or may not be
> > > related to CFS.
> > >
> > > I don't have a box where I could bisect on, but the test program makes
> > > the problem quite obvious.
> >
> > Some more test results:
> >
> > Latest tree from Linus seems to work, at least the program runs on both
> > cores as it should. I bisected the patch that changed behaviour, and
> > I've found this:
> >
> > commit 38736f475071b80b66be28af7b44c854073699cc
> > Author: Gautham R Shenoy <[email protected]>
> > Date: Sat Sep 6 14:50:23 2008 +0530
> >
> > sched: fix __load_balance_iterator() for cfq with only one task
> >
> > The __load_balance_iterator() returns a NULL when there's only one
> > sched_entity which is a task. It is caused by the following code-path.
> >
> > /* Skip over entities that are not tasks */
> > do {
> > se = list_entry(next, struct sched_entity, group_node);
> > next = next->next;
> > } while (next != &cfs_rq->tasks && !entity_is_task(se));
> >
> > if (next == &cfs_rq->tasks)
> > return NULL;
> > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > This will return NULL even when se is a task.
> >
> > As a side-effect, there was a regression in sched_mc behavior since 2.6.25,
> > since iter_move_one_task() when it calls load_balance_start_fair(),
> > would not get any tasks to move!
> >
> > Fix this by checking if the last entity was a task or not.
> >
> > Signed-off-by: Gautham R Shenoy <[email protected]>
> > Acked-by: Peter Zijlstra <[email protected]>
> > Signed-off-by: Ingo Molnar <[email protected]>
> >
> >
> > This patch was integrated for 2.6.28. With the above patch, my test program uses
> > two cores as it should. I could only test this in a virtual machine so I don't
> > know exact performance metrics, but I'll test 2.6.27 + plus this patch on a real
> > box tomorrow to see if this was the culprit.
>
> Just tested right here and I can confirm it is the culprit. I can reliably
> reproduce the issue here on my core2 duo, and this patch fixes it. With your
> memset() loop at 20k iterations, I saw exactly 50% CPU usage, and a final
> sum of 794. With the patch, I see 53% CPU and 909. Changing the loop to 80k
> iterations shows 53% CPU usage and 541 loops without the patch, versus
> 639 loops and 63% CPU usage with the patch.

Interesting. I'm testing in .git (Q6600), and it's only using one CPU
unless I actively intervene. Doing whatever to pry the pair apart takes
loops/sec from 70 to 84.

-Mike

2009-03-09 04:11:10

by Mike Galbraith

[permalink] [raw]
Subject: Re: scheduler oddity [bug?]

On Sun, 2009-03-08 at 19:55 +0100, Ingo Molnar wrote:

> btw., there's room for a cleanup + micro-optimization here too:
> it would be nice to change se.last_wakeup and
> se.prev_sum_exec_runtime to be the same variable,
> se.prev_timestamp or so.

I don't see how they can meld without breaking check_preempt_tick() all
to pieces.

-Mike

2009-03-09 06:52:56

by Ingo Molnar

[permalink] [raw]
Subject: Re: scheduler oddity [bug?]


* Mike Galbraith <[email protected]> wrote:

> On Sun, 2009-03-08 at 19:55 +0100, Ingo Molnar wrote:
>
> > btw., there's room for a cleanup + micro-optimization here too:
> > it would be nice to change se.last_wakeup and
> > se.prev_sum_exec_runtime to be the same variable,
> > se.prev_timestamp or so.
>
> I don't see how they can meld without breaking
> check_preempt_tick() all to pieces.

hm, indeed.

Ingo

2009-03-09 08:02:43

by Mike Galbraith

[permalink] [raw]
Subject: [patch] Re: scheduler oddity [bug?]

On Sun, 2009-03-08 at 18:52 +0100, Ingo Molnar wrote:
> * Mike Galbraith <[email protected]> wrote:
>
> > On Sun, 2009-03-08 at 16:39 +0100, Ingo Molnar wrote:
> > > * Mike Galbraith <[email protected]> wrote:
> > >
> > > > The problem with your particular testcase is that while one
> > > > half has an avg_overlap (what we use as affinity hint for
> > > > synchronous wakeups) which triggers the affinity hint, the
> > > > other half has avg_overlap of zero, what it was born with, so
> > > > despite significant execution overlap, the scheduler treats
> > > > them as if they were truly synchronous tasks.
> > >
> > > hm, why does it stay on zero?
> >
> > Wakeup preemption. Presuming here: heavy task wakes light
> > task, is preempted, light task stuffs data into pipe, heavy
> > task doesn't block, so no avg_overlap is ever computed. The
> > heavy task uses 100% CPU.
> >
> > Running as SCHED_BATCH (virgin source), it becomes sane.
>
> ah.
>
> I'd argue then that time spent on the rq preempted _should_
> count in avg_overlap statistics. I.e. couldnt we do something
> like ... your patch? :)
>
> > > if (sleep && p->se.last_wakeup) {
> > > update_avg(&p->se.avg_overlap,
> > > p->se.sum_exec_runtime - p->se.last_wakeup);
> > > p->se.last_wakeup = 0;
> > > - }
> > > + } else if (p->se.avg_overlap < limit && runtime >= limit)
> > > + update_avg(&p->se.avg_overlap, runtime);
>
> Just done unconditionally, i.e. something like:
>
> if (sleep) {
> runtime = p->se.sum_exec_runtime - p->se.last_wakeup;
> p->se.last_wakeup = 0;
> } else {
> runtime = p->se.sum_exec_runtime - p->se.prev_sum_exec_runtime;
> }
>
> update_avg(&p->se.avg_overlap, runtime);
>
> ?

OK, I've not seen any problem indications yet, so find patchlet below.

However! Balazs has stated that this problem is _not_ present in .git,
and that..

commit 38736f475071b80b66be28af7b44c854073699cc
Author: Gautham R Shenoy <[email protected]>
Date: Sat Sep 6 14:50:23 2008 +0530

..is what fixed it. Willy Tarreau verified this as being the case on
his HW as well. It is present in .git with my HW.

I see it as a problem, but it's your call. Dunno if I'd apply it or
hold back, given these conflicting reports.

Anyway...

Given a task pair communicating via pipe, if one partner fills/drains such
that the other does not block for extended periods, avg_overlap can be long
stale, and trigger affine wakeups despite heavy CPU demand. This can, and
does lead to throughput loss in the testcase posted by the reporter.

Fix this by unconditionally updating avg_overlap at dequeue time instead
of only updating when a task sleeps.

See http://lkml.org/lkml/2009/3/7/79 for details/testcase.

Reported-by: Balazs Scheidler <[email protected]>
Signed-off-by: Mike Galbraith <[email protected]>

kernel/sched.c | 9 +++++++--
1 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 8e2558c..c670050 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1712,12 +1712,17 @@ static void enqueue_task(struct rq *rq, struct task_struct *p, int wakeup)

static void dequeue_task(struct rq *rq, struct task_struct *p, int sleep)
{
+ u64 runtime;
+
if (sleep && p->se.last_wakeup) {
- update_avg(&p->se.avg_overlap,
- p->se.sum_exec_runtime - p->se.last_wakeup);
+ runtime = p->se.sum_exec_runtime - p->se.last_wakeup;
p->se.last_wakeup = 0;
+ } else {
+ runtime = p->se.sum_exec_runtime - p->se.prev_sum_exec_runtime;
}

+ update_avg(&p->se.avg_overlap, runtime);
+
sched_info_dequeued(p);
p->sched_class->dequeue_task(rq, p, sleep);
p->se.on_rq = 0;


2009-03-09 08:07:47

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] Re: scheduler oddity [bug?]


* Mike Galbraith <[email protected]> wrote:

> On Sun, 2009-03-08 at 18:52 +0100, Ingo Molnar wrote:
> > * Mike Galbraith <[email protected]> wrote:
> >
> > > On Sun, 2009-03-08 at 16:39 +0100, Ingo Molnar wrote:
> > > > * Mike Galbraith <[email protected]> wrote:
> > > >
> > > > > The problem with your particular testcase is that while one
> > > > > half has an avg_overlap (what we use as affinity hint for
> > > > > synchronous wakeups) which triggers the affinity hint, the
> > > > > other half has avg_overlap of zero, what it was born with, so
> > > > > despite significant execution overlap, the scheduler treats
> > > > > them as if they were truly synchronous tasks.
> > > >
> > > > hm, why does it stay on zero?
> > >
> > > Wakeup preemption. Presuming here: heavy task wakes light
> > > task, is preempted, light task stuffs data into pipe, heavy
> > > task doesn't block, so no avg_overlap is ever computed. The
> > > heavy task uses 100% CPU.
> > >
> > > Running as SCHED_BATCH (virgin source), it becomes sane.
> >
> > ah.
> >
> > I'd argue then that time spent on the rq preempted _should_
> > count in avg_overlap statistics. I.e. couldnt we do something
> > like ... your patch? :)
> >
> > > > if (sleep && p->se.last_wakeup) {
> > > > update_avg(&p->se.avg_overlap,
> > > > p->se.sum_exec_runtime - p->se.last_wakeup);
> > > > p->se.last_wakeup = 0;
> > > > - }
> > > > + } else if (p->se.avg_overlap < limit && runtime >= limit)
> > > > + update_avg(&p->se.avg_overlap, runtime);
> >
> > Just done unconditionally, i.e. something like:
> >
> > if (sleep) {
> > runtime = p->se.sum_exec_runtime - p->se.last_wakeup;
> > p->se.last_wakeup = 0;
> > } else {
> > runtime = p->se.sum_exec_runtime - p->se.prev_sum_exec_runtime;
> > }
> >
> > update_avg(&p->se.avg_overlap, runtime);
> >
> > ?
>
> OK, I've not seen any problem indications yet, so find patchlet below.
>
> However! Balazs has stated that this problem is _not_ present in .git,
> and that..
>
> commit 38736f475071b80b66be28af7b44c854073699cc
> Author: Gautham R Shenoy <[email protected]>
> Date: Sat Sep 6 14:50:23 2008 +0530
>
> ..is what fixed it. Willy Tarreau verified this as being the case on
> his HW as well. It is present in .git with my HW.
>
> I see it as a problem, but it's your call. Dunno if I'd apply it or
> hold back, given these conflicting reports.

I think we still want it - as the purpose of the overlap metric
is to measure reality. If preemption causes overlap in execution
we should not ignore that.

The fact that your hw triggers it currently is enough of a
justification. Gautham's change to load-balancing might have
shifted the preemption and migration characteristics on his box
just enough to not trigger this - but it does not 'fix' the
problem per se.

Peter, what do you think?

Ingo

2009-03-09 10:16:48

by David Newall

[permalink] [raw]
Subject: Re: [patch] Re: scheduler oddity [bug?]

Ingo Molnar wrote:
> * Mike Galbraith <[email protected]> wrote:
>


...
>> OK, I've not seen any problem indications yet, so find patchlet below.
>>
>> However! Balazs has stated that this problem is _not_ present in .git,
>> and that..
>>
>> commit 38736f475071b80b66be28af7b44c854073699cc
>> Author: Gautham R Shenoy <[email protected]>
>> Date: Sat Sep 6 14:50:23 2008 +0530
>>
>> ..is what fixed it. Willy Tarreau verified this as being the case on
>> his HW as well. It is present in .git with my HW.
>>
>> I see it as a problem, but it's your call. Dunno if I'd apply it or
>> hold back, given these conflicting reports.
>>
>
> I think we still want it - as the purpose of the overlap metric
> is to measure reality. If preemption causes overlap in execution
> we should not ignore that.
>

I'm sure it's wrong. The only call to dequeue with a non-zero sleep
value is in deactivate_task. All the rest have zero sleep. The section
of code identified by Mike in his patchlet should be moved for purpose
of clarity. It also hilights the symmetry between queue_task and
dequeue_task:

--- sched.c 2009-02-21 09:09:34.000000000 +1030
+++ sched.c.dn 2009-03-09 20:13:51.000000000 +1030
@@ -1647,12 +1647,6 @@

static void dequeue_task(struct rq *rq, struct task_struct *p, int sleep)
{
- if (sleep && p->se.last_wakeup) {
- update_avg(&p->se.avg_overlap,
- p->se.sum_exec_runtime - p->se.last_wakeup);
- p->se.last_wakeup = 0;
- }
-
sched_info_dequeued(p);
p->sched_class->dequeue_task(rq, p, sleep);
p->se.on_rq = 0;
@@ -1724,6 +1718,12 @@
if (task_contributes_to_load(p))
rq->nr_uninterruptible++;

+ if (sleep && p->se.last_wakeup) {
+ update_avg(&p->se.avg_overlap,
+ p->se.sum_exec_runtime - p->se.last_wakeup);
+ p->se.last_wakeup = 0;
+ }
+
dequeue_task(rq, p, sleep);
dec_nr_running(rq);
}



Having done that, it makes sense to entirely remove dequeue_task 's
sleep parameter, and replicate all three lines in deactivate_task:

--- sched.c.dn 2009-03-09 20:41:13.000000000 +1030
+++ sched.c.dn2 2009-03-09 20:41:30.000000000 +1030
@@ -1645,10 +1645,10 @@
p->se.on_rq = 1;
}

-static void dequeue_task(struct rq *rq, struct task_struct *p, int sleep)
+static void dequeue_task(struct rq *rq, struct task_struct *p)
{
sched_info_dequeued(p);
- p->sched_class->dequeue_task(rq, p, sleep);
+ p->sched_class->dequeue_task(rq, p, 0);
p->se.on_rq = 0;
}

@@ -1724,7 +1724,11 @@
p->se.last_wakeup = 0;
}

- dequeue_task(rq, p, sleep);
+ /*dequeue_task(rq, p, sleep);*/
+ sched_info_dequeued(p);
+ p->sched_class->dequeue_task(rq, p, sleep);
+ p->se.on_rq = 0;
+
dec_nr_running(rq);
}

@@ -4848,7 +4852,7 @@
on_rq = p->se.on_rq;
running = task_current(rq, p);
if (on_rq)
- dequeue_task(rq, p, 0);
+ dequeue_task(rq, p);
if (running)
p->sched_class->put_prev_task(rq, p);

@@ -4897,7 +4901,7 @@
}
on_rq = p->se.on_rq;
if (on_rq)
- dequeue_task(rq, p, 0);
+ dequeue_task(rq, p);

p->static_prio = NICE_TO_PRIO(nice);
set_load_weight(p);
@@ -8637,7 +8641,7 @@
on_rq = tsk->se.on_rq;

if (on_rq)
- dequeue_task(rq, tsk, 0);
+ dequeue_task(rq, tsk);
if (unlikely(running))
tsk->sched_class->put_prev_task(rq, tsk);


2009-03-09 11:04:56

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [patch] Re: scheduler oddity [bug?]

On Mon, 2009-03-09 at 09:07 +0100, Ingo Molnar wrote:
> * Mike Galbraith <[email protected]> wrote:

> > I see it as a problem, but it's your call. Dunno if I'd apply it or
> > hold back, given these conflicting reports.
>
> I think we still want it - as the purpose of the overlap metric
> is to measure reality. If preemption causes overlap in execution
> we should not ignore that.
>
> The fact that your hw triggers it currently is enough of a
> justification. Gautham's change to load-balancing might have
> shifted the preemption and migration characteristics on his box
> just enough to not trigger this - but it does not 'fix' the
> problem per se.
>
> Peter, what do you think?

Mostly confusion... trying to reverse engineer wth the patch does, and
why, as the changelog is somewhat silent on the issue, nor are there
comments added to clarify things.

Having something of a cold doesn't really help either..

OK, so staring at this:

---
diff --git a/kernel/sched.c b/kernel/sched.c
index 8e2558c..c670050 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1712,12 +1712,17 @@ static void enqueue_task(struct rq *rq, struct task_struct *p, int wakeup)

static void dequeue_task(struct rq *rq, struct task_struct *p, int sleep)
{
+ u64 runtime;
+
if (sleep && p->se.last_wakeup) {
- update_avg(&p->se.avg_overlap,
- p->se.sum_exec_runtime - p->se.last_wakeup);
+ runtime = p->se.sum_exec_runtime - p->se.last_wakeup;
p->se.last_wakeup = 0;
+ } else {
+ runtime = p->se.sum_exec_runtime - p->se.prev_sum_exec_runtime;
}

+ update_avg(&p->se.avg_overlap, runtime);
+
sched_info_dequeued(p);
p->sched_class->dequeue_task(rq, p, sleep);
p->se.on_rq = 0;
---

The idea of avg_overlap is to measure the time between waking someone
and going to sleep yourself. If this overlap time is short for both
tasks, we infer a mutal relation and try to keep these tasks on the same
cpu.

The above patch changes this definition by adding the full run-time on !
sleep dequeues.

We reset prev_sum_exec_runtime in set_next_entity(), iow every time we
start running a task.

Now !sleep dequeues happen mostly with preemption, but also with things
like migration, nice, etc..

Take migration, that would simply add the last full runtime again, even
though it hasn't ran -- that seems most odd.

OK, talked a bit with Ingo, the reason you're doing is that avg_overlap
can easily grow stale.. I can see that happen indeed.

So the 'perfect' thing would be a task-runtime decay, barring that the
preemption thing seems a sane enough hart-beat of a task.

How does the below look to you?

---
kernel/sched.c | 15 ++++++++++++++-
1 files changed, 14 insertions(+), 1 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 4414926..ec7ffdc 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -4692,6 +4692,19 @@ static inline void schedule_debug(struct task_struct *prev)
#endif
}

+static void put_prev_task(struct rq *rq, struct task_struct *prev)
+{
+ if (prev->state == TASK_RUNNING) {
+ /*
+ * In order to avoid avg_overlap growing stale when we are
+ * indeed overlapping and hence not getting put to sleep, grow
+ * the avg_overlap on preemption.
+ */
+ update_avg(&prev->se.avg_overlap, sysctl_sched_migration_cost);
+ }
+ prev->sched_class->put_prev_task(rq, prev);
+}
+
/*
* Pick up the highest-prio task:
*/
@@ -4768,7 +4781,7 @@ need_resched_nonpreemptible:
if (unlikely(!rq->nr_running))
idle_balance(cpu, rq);

- prev->sched_class->put_prev_task(rq, prev);
+ put_prev_task(rq, prev);
next = pick_next_task(rq);

if (likely(prev != next)) {

2009-03-09 11:19:40

by David Newall

[permalink] [raw]
Subject: Re: scheduler oddity [bug?]

Balazs Scheidler wrote:
> Some more test results:
>
> Latest tree from Linus seems to work, at least the program runs on both
> cores as it should. I bisected the patch that changed behaviour, and
> I've found this:
>
> commit 38736f475071b80b66be28af7b44c854073699cc
> Author: Gautham R Shenoy <[email protected]>
> Date: Sat Sep 6 14:50:23 2008 +0530
>
> sched: fix __load_balance_iterator() for cfq with only one task
>
> The __load_balance_iterator() returns a NULL when there's only one
> sched_entity which is a task. It is caused by the following code-path.
>
> /* Skip over entities that are not tasks */
> do {
> se = list_entry(next, struct sched_entity, group_node);
> next = next->next;
> } while (next != &cfs_rq->tasks && !entity_is_task(se));
>
> if (next == &cfs_rq->tasks)
> return NULL;
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> This will return NULL even when se is a task.
>
> As a side-effect, there was a regression in sched_mc behavior since 2.6.25,
> since iter_move_one_task() when it calls load_balance_start_fair(),
> would not get any tasks to move!
>
> Fix this by checking if the last entity was a task or not.
>
> Signed-off-by: Gautham R Shenoy <[email protected]>
> Acked-by: Peter Zijlstra <[email protected]>
> Signed-off-by: Ingo Molnar <[email protected]>

Woops! That fails when the task is the last entry on the list. This
fixes that:

--- sched_fair.c 2009-02-21 09:09:34.000000000 +1030
+++ sched_fair.c.dn 2009-03-09 20:48:36.000000000 +1030
@@ -1440,7 +1440,7 @@
__load_balance_iterator(struct cfs_rq *cfs_rq, struct list_head *next)
{
struct task_struct *p = NULL;
- struct sched_entity *se;
+ struct sched_entity *se = NULL;

if (next == &cfs_rq->tasks)
return NULL;
@@ -1451,7 +1451,7 @@
next = next->next;
} while (next != &cfs_rq->tasks && !entity_is_task(se));

- if (next == &cfs_rq->tasks)
+ if (se == NULL || !entity_is_task(se))
return NULL;

cfs_rq->balance_iterator = next;


Really, though, the function could stand a spring-cleaning, for example
either of the following, depending on how much you hate returning from
within a loop:

__load_balance_iterator(struct cfs_rq *cfs_rq, struct list_head *next)
{
do {
struct sched_entity *se = list_entry(next, struct sched_entity, group_node);
next = next->next;
if (entity_is_task(se))
{
cfs_rq->balance_iterator = next;
return task_of(se);
}
} while (next != &cfs_rq->tasks);
return NULL;
}


__load_balance_iterator(struct cfs_rq *cfs_rq, struct list_head *next)
{
struct sched_entity *se;

for ( ; next != &cfs_rq->tasks; next = next->next)
{
se = list_entry(next, struct sched_entity, group_node);
if (entity_is_task(se))
break;
}

if (next == &cfs_rq->tasks)
return NULL;

cfs_rq->balance_iterator = next->next;
return task_of(se);
}


I wonder if it was intended to set balance_iterator to the task's list
entry instead of the following one.

2009-03-09 13:16:21

by Mike Galbraith

[permalink] [raw]
Subject: Re: [patch] Re: scheduler oddity [bug?]

On Mon, 2009-03-09 at 12:04 +0100, Peter Zijlstra wrote:

> OK, talked a bit with Ingo, the reason you're doing is that avg_overlap
> can easily grow stale.. I can see that happen indeed.
>
> So the 'perfect' thing would be a task-runtime decay, barring that the
> preemption thing seems a sane enough hart-beat of a task.
>
> How does the below look to you?

Other than the fact that the test for sync reject is currently
avg_overlap > sysctl_sched_migration_cost, looks fine to me. Having it
capped at the boundary is probably the better way to go.

-Mike

2009-03-09 13:29:55

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [patch] Re: scheduler oddity [bug?]

On Mon, 2009-03-09 at 14:16 +0100, Mike Galbraith wrote:
> On Mon, 2009-03-09 at 12:04 +0100, Peter Zijlstra wrote:
>
> > OK, talked a bit with Ingo, the reason you're doing is that avg_overlap
> > can easily grow stale.. I can see that happen indeed.
> >
> > So the 'perfect' thing would be a task-runtime decay, barring that the
> > preemption thing seems a sane enough hart-beat of a task.
> >
> > How does the below look to you?
>
> Other than the fact that the test for sync reject is currently
> avg_overlap > sysctl_sched_migration_cost, looks fine to me. Having it
> capped at the boundary is probably the better way to go.

Ah, yes, and looking at update_avg() we'll also discard the lower 3
bits, so we'll never actually reach.

So I guess it should read something like:

update_avg(&prev->se.avg_overlap, 2*sysctl_sched_migration_cost);

or somesuch.

Does it actually solve the reported problem? I've only thought about the
issue so far :-)

2009-03-09 13:38:18

by Mike Galbraith

[permalink] [raw]
Subject: Re: [patch] Re: scheduler oddity [bug?]

On Mon, 2009-03-09 at 14:16 +0100, Mike Galbraith wrote:
> On Mon, 2009-03-09 at 12:04 +0100, Peter Zijlstra wrote:
>
> > OK, talked a bit with Ingo, the reason you're doing is that avg_overlap
> > can easily grow stale.. I can see that happen indeed.
> >
> > So the 'perfect' thing would be a task-runtime decay, barring that the
> > preemption thing seems a sane enough hart-beat of a task.
> >
> > How does the below look to you?
>
> Other than the fact that the test for sync reject is currently
> avg_overlap > sysctl_sched_migration_cost, looks fine to me. Having it
> capped at the boundary is probably the better way to go.

Heh, doesn't _quite_ work though. The little bugger now hovers just
under :-/

pipetest (5976, #threads: 1)
---------------------------------------------------------
se.exec_start : 150672.502691
se.vruntime : 94882.186606
se.sum_exec_runtime : 34875.797932
se.avg_overlap : 0.499993
nr_switches : 3680
nr_voluntary_switches : 0
nr_involuntary_switches : 3680
se.load.weight : 1024
policy : 0
prio : 120
clock-delta : 112

pipetest (5977, #threads: 1)
---------------------------------------------------------
se.exec_start : 150665.016951
se.vruntime : 94817.157909
se.sum_exec_runtime : 7069.323019
se.avg_overlap : 0.012718
nr_switches : 2931
nr_voluntary_switches : 2930
nr_involuntary_switches : 1
se.load.weight : 1024
policy : 0
prio : 120
clock-delta : 117

2009-03-09 13:46:50

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [patch] Re: scheduler oddity [bug?]

On Mon, 2009-03-09 at 14:37 +0100, Mike Galbraith wrote:
> On Mon, 2009-03-09 at 14:16 +0100, Mike Galbraith wrote:
> > On Mon, 2009-03-09 at 12:04 +0100, Peter Zijlstra wrote:
> >
> > > OK, talked a bit with Ingo, the reason you're doing is that avg_overlap
> > > can easily grow stale.. I can see that happen indeed.
> > >
> > > So the 'perfect' thing would be a task-runtime decay, barring that the
> > > preemption thing seems a sane enough hart-beat of a task.
> > >
> > > How does the below look to you?
> >
> > Other than the fact that the test for sync reject is currently
> > avg_overlap > sysctl_sched_migration_cost, looks fine to me. Having it
> > capped at the boundary is probably the better way to go.
>
> Heh, doesn't _quite_ work though. The little bugger now hovers just
> under :-/

> se.avg_overlap : 0.499993

Right, update_avg()'s >>3 and the off-by-one you spotted.

I recon stuff works better with a 2* added? After that I guess its
praying sysbench still works.. :-)

2009-03-09 13:51:30

by Mike Galbraith

[permalink] [raw]
Subject: Re: [patch] Re: scheduler oddity [bug?]

On Mon, 2009-03-09 at 14:27 +0100, Peter Zijlstra wrote:
> On Mon, 2009-03-09 at 14:16 +0100, Mike Galbraith wrote:
> > On Mon, 2009-03-09 at 12:04 +0100, Peter Zijlstra wrote:
> >
> > > OK, talked a bit with Ingo, the reason you're doing is that avg_overlap
> > > can easily grow stale.. I can see that happen indeed.
> > >
> > > So the 'perfect' thing would be a task-runtime decay, barring that the
> > > preemption thing seems a sane enough hart-beat of a task.
> > >
> > > How does the below look to you?
> >
> > Other than the fact that the test for sync reject is currently
> > avg_overlap > sysctl_sched_migration_cost, looks fine to me. Having it
> > capped at the boundary is probably the better way to go.
>
> Ah, yes, and looking at update_avg() we'll also discard the lower 3
> bits, so we'll never actually reach.
>
> So I guess it should read something like:
>
> update_avg(&prev->se.avg_overlap, 2*sysctl_sched_migration_cost);
>
> or somesuch.
>
> Does it actually solve the reported problem? I've only thought about the
> issue so far :-)

5977 root 20 0 3672 440 352 R 100 0.0 0:28.53 2 pipetest
5978 root 20 0 3668 180 96 S 29 0.0 0:08.27 0 pipetest

Yup, works for me. Ship it :)

-Mike

2009-03-09 13:58:19

by Mike Galbraith

[permalink] [raw]
Subject: Re: [patch] Re: scheduler oddity [bug?]

On Mon, 2009-03-09 at 14:46 +0100, Peter Zijlstra wrote:
> On Mon, 2009-03-09 at 14:37 +0100, Mike Galbraith wrote:
> > On Mon, 2009-03-09 at 14:16 +0100, Mike Galbraith wrote:
> > > On Mon, 2009-03-09 at 12:04 +0100, Peter Zijlstra wrote:
> > >
> > > > OK, talked a bit with Ingo, the reason you're doing is that avg_overlap
> > > > can easily grow stale.. I can see that happen indeed.
> > > >
> > > > So the 'perfect' thing would be a task-runtime decay, barring that the
> > > > preemption thing seems a sane enough hart-beat of a task.
> > > >
> > > > How does the below look to you?
> > >
> > > Other than the fact that the test for sync reject is currently
> > > avg_overlap > sysctl_sched_migration_cost, looks fine to me. Having it
> > > capped at the boundary is probably the better way to go.
> >
> > Heh, doesn't _quite_ work though. The little bugger now hovers just
> > under :-/
>
> > se.avg_overlap : 0.499993
>
> Right, update_avg()'s >>3 and the off-by-one you spotted.

(that was with the off-by-one fixed, but..)

> I recon stuff works better with a 2* added? After that I guess its
> praying sysbench still works.. :-)

Yes 2* worked fine. Mysql+oltp was my worry spot, being a very affinity
sensitive little <bleep>, but my patchlet didn't cause any trouble, so
this one shouldn't either. I'll do some re-test in any case, and squeak
should anything turn up.

-Mike

2009-03-09 14:01:32

by David Newall

[permalink] [raw]
Subject: Re: [patch] Re: scheduler oddity [bug?]

Peter Zijlstra wrote:
> Does it actually solve the reported problem? I've only thought about the
> issue so far

Wasn't it reported to be caused by an off-by-one bug in commit
38736f475071b80b66be28af7b44c854073699cc?

2009-03-09 14:11:46

by Mike Galbraith

[permalink] [raw]
Subject: Re: [patch] Re: scheduler oddity [bug?]

On Mon, 2009-03-09 at 14:58 +0100, Mike Galbraith wrote:
> On Mon, 2009-03-09 at 14:46 +0100, Peter Zijlstra wrote:
> > On Mon, 2009-03-09 at 14:37 +0100, Mike Galbraith wrote:
> > > On Mon, 2009-03-09 at 14:16 +0100, Mike Galbraith wrote:
> > > > On Mon, 2009-03-09 at 12:04 +0100, Peter Zijlstra wrote:
> > > >
> > > > > OK, talked a bit with Ingo, the reason you're doing is that avg_overlap
> > > > > can easily grow stale.. I can see that happen indeed.
> > > > >
> > > > > So the 'perfect' thing would be a task-runtime decay, barring that the
> > > > > preemption thing seems a sane enough hart-beat of a task.
> > > > >
> > > > > How does the below look to you?
> > > >
> > > > Other than the fact that the test for sync reject is currently
> > > > avg_overlap > sysctl_sched_migration_cost, looks fine to me. Having it
> > > > capped at the boundary is probably the better way to go.
> > >
> > > Heh, doesn't _quite_ work though. The little bugger now hovers just
> > > under :-/
> >
> > > se.avg_overlap : 0.499993
> >
> > Right, update_avg()'s >>3 and the off-by-one you spotted.
>
> (that was with the off-by-one fixed, but..)
>
> > I recon stuff works better with a 2* added? After that I guess its
> > praying sysbench still works.. :-)
>
> Yes 2* worked fine. Mysql+oltp was my worry spot, being a very affinity
> sensitive little <bleep>, but my patchlet didn't cause any trouble, so
> this one shouldn't either. I'll do some re-test in any case, and squeak
> should anything turn up.

Squeak! Didn't even get to mysql+oltp.

marge:..local/tmp # netperf -t UDP_STREAM -l 60 -H 127.0.0.1 -- -P 15888,12384 -s 32768 -S 32768 -m 4096
UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0) port 15888 AF_INET to 127.0.0.1 (127.0.0.1) port 12384 AF_INET
Socket Message Elapsed Messages
Size Size Time Okay Errors Throughput
bytes bytes secs # # 10^6bits/sec

65536 4096 60.00 5161103 0 2818.65
65536 60.00 5149666 2812.40

6188 root 20 0 1040 544 324 R 100 0.0 0:31.49 0 netperf
6189 root 20 0 1044 260 164 R 48 0.0 0:15.35 3 netserver

Hurt, pain, ouch, vs...

marge:..local/tmp # netperf -t UDP_STREAM -l 60 -H 127.0.0.1 -T 0,0 -- -P 15888,12384 -s 32768 -S 32768 -m 4096
UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0) port 15888 AF_INET to 127.0.0.1 (127.0.0.1) port 12384 AF_INET : cpu bind
Socket Message Elapsed Messages
Size Size Time Okay Errors Throughput
bytes bytes secs # # 10^6bits/sec

65536 4096 60.00 8452028 0 4615.93
65536 60.00 8442945 4610.97

Drat.

-Mike

2009-03-09 14:20:24

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [patch] Re: scheduler oddity [bug?]

On Tue, 2009-03-10 at 00:30 +1030, David Newall wrote:
> Peter Zijlstra wrote:
> > Does it actually solve the reported problem? I've only thought about the
> > issue so far
>
> Wasn't it reported to be caused by an off-by-one bug in commit
> 38736f475071b80b66be28af7b44c854073699cc?

Completely different issue. Unfortunate interference of effects though.

2009-03-09 14:42:19

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [patch] Re: scheduler oddity [bug?]

On Mon, 2009-03-09 at 15:11 +0100, Mike Galbraith wrote:

> > Yes 2* worked fine. Mysql+oltp was my worry spot, being a very affinity
> > sensitive little <bleep>, but my patchlet didn't cause any trouble, so
> > this one shouldn't either. I'll do some re-test in any case, and squeak
> > should anything turn up.
>
> Squeak! Didn't even get to mysql+oltp.
>
> marge:..local/tmp # netperf -t UDP_STREAM -l 60 -H 127.0.0.1 -- -P 15888,12384 -s 32768 -S 32768 -m 4096
> UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0) port 15888 AF_INET to 127.0.0.1 (127.0.0.1) port 12384 AF_INET
> Socket Message Elapsed Messages
> Size Size Time Okay Errors Throughput
> bytes bytes secs # # 10^6bits/sec
>
> 65536 4096 60.00 5161103 0 2818.65
> 65536 60.00 5149666 2812.40
>
> 6188 root 20 0 1040 544 324 R 100 0.0 0:31.49 0 netperf
> 6189 root 20 0 1044 260 164 R 48 0.0 0:15.35 3 netserver
>
> Hurt, pain, ouch, vs...
>
> marge:..local/tmp # netperf -t UDP_STREAM -l 60 -H 127.0.0.1 -T 0,0 -- -P 15888,12384 -s 32768 -S 32768 -m 4096
> UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0) port 15888 AF_INET to 127.0.0.1 (127.0.0.1) port 12384 AF_INET : cpu bind
> Socket Message Elapsed Messages
> Size Size Time Okay Errors Throughput
> bytes bytes secs # # 10^6bits/sec
>
> 65536 4096 60.00 8452028 0 4615.93
> 65536 60.00 8442945 4610.97
>
> Drat.

Bugger, so back to the drawing board it is...

2009-03-09 15:31:18

by Mike Galbraith

[permalink] [raw]
Subject: Re: [patch] Re: scheduler oddity [bug?]

On Mon, 2009-03-09 at 15:41 +0100, Peter Zijlstra wrote:
> On Mon, 2009-03-09 at 15:11 +0100, Mike Galbraith wrote:
>
> > > Yes 2* worked fine. Mysql+oltp was my worry spot, being a very affinity
> > > sensitive little <bleep>, but my patchlet didn't cause any trouble, so
> > > this one shouldn't either. I'll do some re-test in any case, and squeak
> > > should anything turn up.
> >
> > Squeak! Didn't even get to mysql+oltp.
> >
> > marge:..local/tmp # netperf -t UDP_STREAM -l 60 -H 127.0.0.1 -- -P 15888,12384 -s 32768 -S 32768 -m 4096
> > UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0) port 15888 AF_INET to 127.0.0.1 (127.0.0.1) port 12384 AF_INET
> > Socket Message Elapsed Messages
> > Size Size Time Okay Errors Throughput
> > bytes bytes secs # # 10^6bits/sec
> >
> > 65536 4096 60.00 5161103 0 2818.65
> > 65536 60.00 5149666 2812.40
> >
> > 6188 root 20 0 1040 544 324 R 100 0.0 0:31.49 0 netperf
> > 6189 root 20 0 1044 260 164 R 48 0.0 0:15.35 3 netserver
> >
> > Hurt, pain, ouch, vs...
> >
> > marge:..local/tmp # netperf -t UDP_STREAM -l 60 -H 127.0.0.1 -T 0,0 -- -P 15888,12384 -s 32768 -S 32768 -m 4096
> > UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0) port 15888 AF_INET to 127.0.0.1 (127.0.0.1) port 12384 AF_INET : cpu bind
> > Socket Message Elapsed Messages
> > Size Size Time Okay Errors Throughput
> > bytes bytes secs # # 10^6bits/sec
> >
> > 65536 4096 60.00 8452028 0 4615.93
> > 65536 60.00 8442945 4610.97
> >
> > Drat.
>
> Bugger, so back to the drawing board it is...

Hm.

CPU utilization wise, this test is similar to pipetest. The major
difference is chunk size. Netperf is waking and being preempted (if on
the same CPU) at a very high rate, so the hog component gets cpu in tiny
chunks, vs hefty chunks for pipetest.

Simply doing the below (will look very familiar) made both netperf and
pipetest happy again, because of that preemption rate. Both start life
wanting to be affine, and due to the switch rate, pipetest becomes
non-affine, but netperf remains affine.

Maybe we should factor in wakeup rate, and whether we're waking many vs
one. Wakeup is tied to data, so there is correlation to potential
cache-miss pain, no?

There is also evidence that your patch did in fact make the right
decision, but that we really REALLY should try to punt to a CPU that
shares a cache if available. Check out the numbers when the netperf
test runs on two CPUs that share cache.

marge:..local/tmp # netperf -t UDP_STREAM -l 60 -H 127.0.0.1 -T 0,1 -- -P 15888,12384 -s 32768 -S 32768 -m 4096
UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0) port 15888 AF_INET to 127.0.0.1 (127.0.0.1) port 12384 AF_INET : cpu bind
Socket Message Elapsed Messages
Size Size Time Okay Errors Throughput
bytes bytes secs # # 10^6bits/sec

65536 4096 60.00 15325632 0 8369.84
65536 60.00 15321176 8367.40

(You can skip the below, nothing new there. Just for completeness;)

diff --git a/kernel/sched.c b/kernel/sched.c
index 8e2558c..0f67b2a 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -4508,6 +4508,24 @@ static inline void schedule_debug(struct task_struct *prev)
#endif
}

+static void put_prev_task(struct rq *rq, struct task_struct *prev)
+{
+ if (prev->state == TASK_RUNNING) {
+ u64 runtime = prev->se.sum_exec_runtime;
+
+ runtime -= prev->se.prev_sum_exec_runtime;
+ runtime = min_t(u64, runtime, 2*sysctl_sched_migration_cost);
+
+ /*
+ * In order to avoid avg_overlap growing stale when we are
+ * indeed overlapping and hence not getting put to sleep, grow
+ * the avg_overlap on preemption.
+ */
+ update_avg(&prev->se.avg_overlap, runtime);
+ }
+ prev->sched_class->put_prev_task(rq, prev);
+}
+
/*
* Pick up the highest-prio task:
*/
@@ -4586,7 +4604,7 @@ need_resched_nonpreemptible:
if (unlikely(!rq->nr_running))
idle_balance(cpu, rq);

- prev->sched_class->put_prev_task(rq, prev);
+ put_prev_task(rq, prev);
next = pick_next_task(rq, prev);

if (likely(prev != next)) {

2009-03-09 15:57:40

by Balazs Scheidler

[permalink] [raw]
Subject: Re: [patch] Re: scheduler oddity [bug?]

Hi,

Just an interesting sidenote:

I've ported the quoted patch and
38736f475071b80b66be28af7b44c854073699cc (the one I've found via
bisect) to 2.6.27 but these didn't resolve my scheduling problem, both
my test program and my application still uses only one CPU. So probably
the rest of the scheduling patches between 2.6.27..2.6.28 have some
effect too.

On Mon, 2009-03-09 at 09:02 +0100, Mike Galbraith wrote:
> On Sun, 2009-03-08 at 18:52 +0100, Ingo Molnar wrote:
> > * Mike Galbraith <[email protected]> wrote:
> >
> > > On Sun, 2009-03-08 at 16:39 +0100, Ingo Molnar wrote:
> > > > * Mike Galbraith <[email protected]> wrote:
> > > >
> > > > > The problem with your particular testcase is that while one
> > > > > half has an avg_overlap (what we use as affinity hint for
> > > > > synchronous wakeups) which triggers the affinity hint, the
> > > > > other half has avg_overlap of zero, what it was born with, so
> > > > > despite significant execution overlap, the scheduler treats
> > > > > them as if they were truly synchronous tasks.
> > > >
> > > > hm, why does it stay on zero?
> > >
> > > Wakeup preemption. Presuming here: heavy task wakes light
> > > task, is preempted, light task stuffs data into pipe, heavy
> > > task doesn't block, so no avg_overlap is ever computed. The
> > > heavy task uses 100% CPU.
> > >
> > > Running as SCHED_BATCH (virgin source), it becomes sane.
> >
> > ah.
> >
> > I'd argue then that time spent on the rq preempted _should_
> > count in avg_overlap statistics. I.e. couldnt we do something
> > like ... your patch? :)
> >
> > > > if (sleep && p->se.last_wakeup) {
> > > > update_avg(&p->se.avg_overlap,
> > > > p->se.sum_exec_runtime - p->se.last_wakeup);
> > > > p->se.last_wakeup = 0;
> > > > - }
> > > > + } else if (p->se.avg_overlap < limit && runtime >= limit)
> > > > + update_avg(&p->se.avg_overlap, runtime);
> >
> > Just done unconditionally, i.e. something like:
> >
> > if (sleep) {
> > runtime = p->se.sum_exec_runtime - p->se.last_wakeup;
> > p->se.last_wakeup = 0;
> > } else {
> > runtime = p->se.sum_exec_runtime - p->se.prev_sum_exec_runtime;
> > }
> >
> > update_avg(&p->se.avg_overlap, runtime);
> >
> > ?
>
> OK, I've not seen any problem indications yet, so find patchlet below.
>
> However! Balazs has stated that this problem is _not_ present in .git,
> and that..
>
> commit 38736f475071b80b66be28af7b44c854073699cc
> Author: Gautham R Shenoy <[email protected]>
> Date: Sat Sep 6 14:50:23 2008 +0530
>
> ..is what fixed it. Willy Tarreau verified this as being the case on
> his HW as well. It is present in .git with my HW.
>
> I see it as a problem, but it's your call. Dunno if I'd apply it or
> hold back, given these conflicting reports.
>
> Anyway...
>
> Given a task pair communicating via pipe, if one partner fills/drains such
> that the other does not block for extended periods, avg_overlap can be long
> stale, and trigger affine wakeups despite heavy CPU demand. This can, and
> does lead to throughput loss in the testcase posted by the reporter.
>
> Fix this by unconditionally updating avg_overlap at dequeue time instead
> of only updating when a task sleeps.
>
> See http://lkml.org/lkml/2009/3/7/79 for details/testcase.
>
> Reported-by: Balazs Scheidler <[email protected]>
> Signed-off-by: Mike Galbraith <[email protected]>
>
> kernel/sched.c | 9 +++++++--
> 1 files changed, 7 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/sched.c b/kernel/sched.c
> index 8e2558c..c670050 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -1712,12 +1712,17 @@ static void enqueue_task(struct rq *rq, struct task_struct *p, int wakeup)
>
> static void dequeue_task(struct rq *rq, struct task_struct *p, int sleep)
> {
> + u64 runtime;
> +
> if (sleep && p->se.last_wakeup) {
> - update_avg(&p->se.avg_overlap,
> - p->se.sum_exec_runtime - p->se.last_wakeup);
> + runtime = p->se.sum_exec_runtime - p->se.last_wakeup;
> p->se.last_wakeup = 0;
> + } else {
> + runtime = p->se.sum_exec_runtime - p->se.prev_sum_exec_runtime;
> }
>
> + update_avg(&p->se.avg_overlap, runtime);
> +
> sched_info_dequeued(p);
> p->sched_class->dequeue_task(rq, p, sleep);
> p->se.on_rq = 0;
>
>
>
>
--
Bazsi

2009-03-09 16:13:11

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [patch] Re: scheduler oddity [bug?]

On Mon, 2009-03-09 at 16:30 +0100, Mike Galbraith wrote:
> +static void put_prev_task(struct rq *rq, struct task_struct *prev)
> +{
> + if (prev->state == TASK_RUNNING) {
> + u64 runtime = prev->se.sum_exec_runtime;
> +
> + runtime -= prev->se.prev_sum_exec_runtime;
> + runtime = min_t(u64, runtime, 2*sysctl_sched_migration_cost);
> +
> + /*
> + * In order to avoid avg_overlap growing stale when we are
> + * indeed overlapping and hence not getting put to sleep, grow
> + * the avg_overlap on preemption.
> + */
> + update_avg(&prev->se.avg_overlap, runtime);
> + }
> + prev->sched_class->put_prev_task(rq, prev);
> +}

Right, so we both found it worked quite well, I'm still slightly puzzled
but it.

If something gets preempted a lot and will therefore have short runtimes
it will be seen as sync even though it might not at all be.

Then again, it its preempted that much, it won't be likely to obtain a
large cache footprint either...

hohumm

2009-03-09 17:28:30

by Mike Galbraith

[permalink] [raw]
Subject: Re: [patch] Re: scheduler oddity [bug?]

On Mon, 2009-03-09 at 17:12 +0100, Peter Zijlstra wrote:
> On Mon, 2009-03-09 at 16:30 +0100, Mike Galbraith wrote:
> > +static void put_prev_task(struct rq *rq, struct task_struct *prev)
> > +{
> > + if (prev->state == TASK_RUNNING) {
> > + u64 runtime = prev->se.sum_exec_runtime;
> > +
> > + runtime -= prev->se.prev_sum_exec_runtime;
> > + runtime = min_t(u64, runtime, 2*sysctl_sched_migration_cost);
> > +
> > + /*
> > + * In order to avoid avg_overlap growing stale when we are
> > + * indeed overlapping and hence not getting put to sleep, grow
> > + * the avg_overlap on preemption.
> > + */
> > + update_avg(&prev->se.avg_overlap, runtime);
> > + }
> > + prev->sched_class->put_prev_task(rq, prev);
> > +}
>
> Right, so we both found it worked quite well, I'm still slightly puzzled
> but it.
>
> If something gets preempted a lot and will therefore have short runtimes
> it will be seen as sync even though it might not at all be.

Yes, and the netperf on 2 CPUs with shared cache numbers show that's
happening. It just so happens that in the non-shared case, netperf's
cache pain far outweighs the benefit of having more CPU available :-/

-Mike

2009-03-10 00:20:45

by David Newall

[permalink] [raw]
Subject: Re: [patch] Re: scheduler oddity [bug?]

Peter Zijlstra wrote:
> On Tue, 2009-03-10 at 00:30 +1030, David Newall wrote:
>
>> Wasn't it reported to be caused by an off-by-one bug in commit
>> 38736f475071b80b66be28af7b44c854073699cc?
>>
>
> Completely different issue. Unfortunate interference of effects though.


And yet, there *is* a bug in that commit, being, as I explained, that it
won't return a task if it's the last entry in the list, which seems
quite likely immediately after a fork. Has my patch been tried?

2009-03-10 03:17:21

by Mike Galbraith

[permalink] [raw]
Subject: Re: [patch] Re: scheduler oddity [bug?]

On Mon, 2009-03-09 at 16:57 +0100, Balazs Scheidler wrote:
> Hi,
>
> Just an interesting sidenote:
>
> I've ported the quoted patch and
> 38736f475071b80b66be28af7b44c854073699cc (the one I've found via
> bisect) to 2.6.27 but these didn't resolve my scheduling problem, both
> my test program and my application still uses only one CPU. So probably
> the rest of the scheduling patches between 2.6.27..2.6.28 have some
> effect too.

It's probably the changes to wake_affine(). 27 doesn't have the disable
sync if avg_overlap is too large tests.

-Mike

2009-03-15 13:54:06

by Balazs Scheidler

[permalink] [raw]
Subject: Re: [patch] Re: scheduler oddity [bug?]

On Mon, 2009-03-09 at 18:28 +0100, Mike Galbraith wrote:
> On Mon, 2009-03-09 at 17:12 +0100, Peter Zijlstra wrote:
> > On Mon, 2009-03-09 at 16:30 +0100, Mike Galbraith wrote:
> > > +static void put_prev_task(struct rq *rq, struct task_struct *prev)
> > > +{
> > > + if (prev->state == TASK_RUNNING) {
> > > + u64 runtime = prev->se.sum_exec_runtime;
> > > +
> > > + runtime -= prev->se.prev_sum_exec_runtime;
> > > + runtime = min_t(u64, runtime, 2*sysctl_sched_migration_cost);
> > > +
> > > + /*
> > > + * In order to avoid avg_overlap growing stale when we are
> > > + * indeed overlapping and hence not getting put to sleep, grow
> > > + * the avg_overlap on preemption.
> > > + */
> > > + update_avg(&prev->se.avg_overlap, runtime);
> > > + }
> > > + prev->sched_class->put_prev_task(rq, prev);
> > > +}
> >
> > Right, so we both found it worked quite well, I'm still slightly puzzled
> > but it.
> >
> > If something gets preempted a lot and will therefore have short runtimes
> > it will be seen as sync even though it might not at all be.
>
> Yes, and the netperf on 2 CPUs with shared cache numbers show that's
> happening. It just so happens that in the non-shared case, netperf's
> cache pain far outweighs the benefit of having more CPU available :-/

Any news on this? I haven't seen a patch that was actually integrated,
or I just missed something?

--
Bazsi

2009-03-15 17:17:12

by Mike Galbraith

[permalink] [raw]
Subject: Re: [patch] Re: scheduler oddity [bug?]

On Sun, 2009-03-15 at 14:53 +0100, Balazs Scheidler wrote:
> On Mon, 2009-03-09 at 18:28 +0100, Mike Galbraith wrote:
> > On Mon, 2009-03-09 at 17:12 +0100, Peter Zijlstra wrote:
> > > On Mon, 2009-03-09 at 16:30 +0100, Mike Galbraith wrote:
> > > > +static void put_prev_task(struct rq *rq, struct task_struct *prev)
> > > > +{
> > > > + if (prev->state == TASK_RUNNING) {
> > > > + u64 runtime = prev->se.sum_exec_runtime;
> > > > +
> > > > + runtime -= prev->se.prev_sum_exec_runtime;
> > > > + runtime = min_t(u64, runtime, 2*sysctl_sched_migration_cost);
> > > > +
> > > > + /*
> > > > + * In order to avoid avg_overlap growing stale when we are
> > > > + * indeed overlapping and hence not getting put to sleep, grow
> > > > + * the avg_overlap on preemption.
> > > > + */
> > > > + update_avg(&prev->se.avg_overlap, runtime);
> > > > + }
> > > > + prev->sched_class->put_prev_task(rq, prev);
> > > > +}
> > >
> > > Right, so we both found it worked quite well, I'm still slightly puzzled
> > > but it.
> > >
> > > If something gets preempted a lot and will therefore have short runtimes
> > > it will be seen as sync even though it might not at all be.
> >
> > Yes, and the netperf on 2 CPUs with shared cache numbers show that's
> > happening. It just so happens that in the non-shared case, netperf's
> > cache pain far outweighs the benefit of having more CPU available :-/
>
> Any news on this? I haven't seen a patch that was actually integrated,
> or I just missed something?

It's in tip, as df1c99d416500da8d26a4d78777467c53ee7689e.

-Mike

2009-03-15 18:57:46

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] Re: scheduler oddity [bug?]


* Mike Galbraith <[email protected]> wrote:

> On Sun, 2009-03-15 at 14:53 +0100, Balazs Scheidler wrote:
> > On Mon, 2009-03-09 at 18:28 +0100, Mike Galbraith wrote:
> > > On Mon, 2009-03-09 at 17:12 +0100, Peter Zijlstra wrote:
> > > > On Mon, 2009-03-09 at 16:30 +0100, Mike Galbraith wrote:
> > > > > +static void put_prev_task(struct rq *rq, struct task_struct *prev)
> > > > > +{
> > > > > + if (prev->state == TASK_RUNNING) {
> > > > > + u64 runtime = prev->se.sum_exec_runtime;
> > > > > +
> > > > > + runtime -= prev->se.prev_sum_exec_runtime;
> > > > > + runtime = min_t(u64, runtime, 2*sysctl_sched_migration_cost);
> > > > > +
> > > > > + /*
> > > > > + * In order to avoid avg_overlap growing stale when we are
> > > > > + * indeed overlapping and hence not getting put to sleep, grow
> > > > > + * the avg_overlap on preemption.
> > > > > + */
> > > > > + update_avg(&prev->se.avg_overlap, runtime);
> > > > > + }
> > > > > + prev->sched_class->put_prev_task(rq, prev);
> > > > > +}
> > > >
> > > > Right, so we both found it worked quite well, I'm still slightly puzzled
> > > > but it.
> > > >
> > > > If something gets preempted a lot and will therefore have short runtimes
> > > > it will be seen as sync even though it might not at all be.
> > >
> > > Yes, and the netperf on 2 CPUs with shared cache numbers show that's
> > > happening. It just so happens that in the non-shared case, netperf's
> > > cache pain far outweighs the benefit of having more CPU available :-/
> >
> > Any news on this? I haven't seen a patch that was actually integrated,
> > or I just missed something?
>
> It's in tip, as df1c99d416500da8d26a4d78777467c53ee7689e.

testable via:

http://people.redhat.com/mingo/tip.git/README

Balazs, could you please try it?

Ingo

2009-03-16 11:55:41

by Balazs Scheidler

[permalink] [raw]
Subject: Re: [patch] Re: scheduler oddity [bug?]

On Sun, 2009-03-15 at 19:57 +0100, Ingo Molnar wrote:
> * Mike Galbraith <[email protected]> wrote:
>
> > On Sun, 2009-03-15 at 14:53 +0100, Balazs Scheidler wrote:
> > > On Mon, 2009-03-09 at 18:28 +0100, Mike Galbraith wrote:
> > > > On Mon, 2009-03-09 at 17:12 +0100, Peter Zijlstra wrote:
> > > > > On Mon, 2009-03-09 at 16:30 +0100, Mike Galbraith wrote:
> > > > > > +static void put_prev_task(struct rq *rq, struct task_struct *prev)
> > > > > > +{
> > > > > > + if (prev->state == TASK_RUNNING) {
> > > > > > + u64 runtime = prev->se.sum_exec_runtime;
> > > > > > +
> > > > > > + runtime -= prev->se.prev_sum_exec_runtime;
> > > > > > + runtime = min_t(u64, runtime, 2*sysctl_sched_migration_cost);
> > > > > > +
> > > > > > + /*
> > > > > > + * In order to avoid avg_overlap growing stale when we are
> > > > > > + * indeed overlapping and hence not getting put to sleep, grow
> > > > > > + * the avg_overlap on preemption.
> > > > > > + */
> > > > > > + update_avg(&prev->se.avg_overlap, runtime);
> > > > > > + }
> > > > > > + prev->sched_class->put_prev_task(rq, prev);
> > > > > > +}
> > > > >
> > > > > Right, so we both found it worked quite well, I'm still slightly puzzled
> > > > > but it.
> > > > >
> > > > > If something gets preempted a lot and will therefore have short runtimes
> > > > > it will be seen as sync even though it might not at all be.
> > > >
> > > > Yes, and the netperf on 2 CPUs with shared cache numbers show that's
> > > > happening. It just so happens that in the non-shared case, netperf's
> > > > cache pain far outweighs the benefit of having more CPU available :-/
> > >
> > > Any news on this? I haven't seen a patch that was actually integrated,
> > > or I just missed something?
> >
> > It's in tip, as df1c99d416500da8d26a4d78777467c53ee7689e.
>
> testable via:
>
> http://people.redhat.com/mingo/tip.git/README
>
> Balazs, could you please try it?

Sure I'll try, although not today as my first son was born yesterday,
and I have a lot of things to do, but hopefully I can give a spin
tomorrow.


--
Bazsi