2010-12-25 00:27:19

by Venkatesh Pallipadi

[permalink] [raw]
Subject: [PATCH] sched: Buggy comparison in check_preempt_tick

A preempt comparison line in check_preempt_tick has two bugs.
* It compares signed and unsigned quantities, which breaks when signed
quantity happens to be negative
* It compares runtime and vruntime, which breaks when there are niced tasks

The bug was initially found by linsched[1]. Change here fixes both
the problems.

On x86-64, the signed unsigned compare results in tasks running _longer_
than their expected time slice as a false resched_task() gets signalled after
4 ticks (on tick after preceding sysctl_sched_min_granularity check) and
currently running task gets picked again and runs for another ideal_slice
interval.

With 2 busy loops on a single CPU and trace_printks inside this buggy check
triggering resched task and in pick_next_task shows this:

[001] 510.524336: pick_next_task_fair: loop (5939)
[001] 510.536326: pick_next_task_fair: loop (5883)
[001] 510.540319: task_tick_fair: delta -4897059, ideal_runtime 11994146
[001] 510.540321: pick_next_task_fair: loop (5883)
[001] 510.544306: task_tick_fair: delta -906540, ideal_runtime 11994146
[001] 510.544309: pick_next_task_fair: loop (5883)
[001] 510.556306: pick_next_task_fair: loop (5939)
[001] 510.560301: task_tick_fair: delta -7105824, ideal_runtime 11994146
[001] 510.560304: pick_next_task_fair: loop (5939)
[001] 510.564298: task_tick_fair: delta -3105461, ideal_runtime 11994146
[001] 510.564300: pick_next_task_fair: loop (5939)
[001] 510.576288: pick_next_task_fair: loop (5883)
[001] 510.580282: task_tick_fair: delta -4897210, ideal_runtime 11994146
[001] 510.580285: pick_next_task_fair: loop (5883)
[001] 510.584278: task_tick_fair: delta -897348, ideal_runtime 11994146
[001] 510.584281: pick_next_task_fair: loop (5883)
[001] 510.596269: pick_next_task_fair: loop (5939)

That is 20 ms slice for each task, with some redundant resched_tasks and
with the fix it is expected ~12ms slices (on 16 cpu system).

[1] - http://lwn.net/Articles/409680/

Signed-off-by: Venkatesh Pallipadi <[email protected]>
---
kernel/sched_fair.c | 4 +++-
1 files changed, 3 insertions(+), 1 deletions(-)

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 00ebd76..fc5ffbd 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -871,8 +871,10 @@ check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
if (cfs_rq->nr_running > 1) {
struct sched_entity *se = __pick_next_entity(cfs_rq);
s64 delta = curr->vruntime - se->vruntime;
+ unsigned long ideal_vruntime;

- if (delta > ideal_runtime)
+ ideal_vruntime = calc_delta_fair(ideal_runtime, curr);
+ if (delta > (s64)ideal_vruntime)
resched_task(rq_of(cfs_rq)->curr);
}
}
--
1.7.3.1


2010-12-25 07:50:39

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH] sched: Buggy comparison in check_preempt_tick

On Fri, 2010-12-24 at 16:26 -0800, Venkatesh Pallipadi wrote:
> A preempt comparison line in check_preempt_tick has two bugs.
> * It compares signed and unsigned quantities, which breaks when signed
> quantity happens to be negative

Oops, that's a bug.

> * It compares runtime and vruntime, which breaks when there are niced tasks

This imho isn't.

vruntimes advance at different rates for differently weighted tasks, so
they're already weighted, as is ideal_runtime.

For wakeup preemption we use wakeup_gran() as the weighted ruler to
limit spread. This test just uses a different weighted ruler for the
same purpose at tick time, one already computed.

If you're a heavily niced task, you were very likely booted before you
got this far. If you haven't run for considerable wall time, you won't
get further, you keep on running. You only get booted if giving you a
full wall slice will spread vruntimes too much, for which you'll pay if
you keep running, and leftmost will pay now if we don't boot you.

-Mike

2010-12-26 00:06:05

by Venkatesh Pallipadi

[permalink] [raw]
Subject: Re: [PATCH] sched: Buggy comparison in check_preempt_tick

On Fri, Dec 24, 2010 at 11:50 PM, Mike Galbraith <[email protected]> wrote:
> On Fri, 2010-12-24 at 16:26 -0800, Venkatesh Pallipadi wrote:
>> A preempt comparison line in check_preempt_tick has two bugs.
>> * It compares signed and unsigned quantities, which breaks when signed
>> ? quantity happens to be negative
>
> Oops, that's a bug.
>
>> * It compares runtime and vruntime, which breaks when there are niced tasks
>
> This imho isn't.
>
> vruntimes advance at different rates for differently weighted tasks, so
> they're already weighted, as is ideal_runtime.
>
> For wakeup preemption we use wakeup_gran() as the weighted ruler to
> limit spread. ?This test just uses a different weighted ruler for the
> same purpose at tick time, one already computed.
>
> If you're a heavily niced task, you were very likely booted before you
> got this far. ?If you haven't run for considerable wall time, you won't
> get further, you keep on running. ?You only get booted if giving you a
> full wall slice will spread vruntimes too much, for which you'll pay if
> you keep running, and leftmost will pay now if we don't boot you.
>

May be I am all confused.
- se->vruntime update in __update_curr() is based on
delta_exec_weighted ( i.e., calc_delta_fair(delta_exec, curr) with
delta_exec being wall clock time.
- We have earlier in check_preempt_tick(), ideal_runtime getting
compared with delta exec, which is wall clock time
- delta in check_preempt_tick is se->vruntime difference, which should
be the weighted time and comparing that with ideal_runtime (which is
compared with wall clock earlier) seems wrong.

This is debug output with trace_printk outside this if statement and
if based on (s64)ideal_runtime.
Task 17807 is nice 0 task and task 6363 is nice -2.
Ideal slice for Task 17807 is 8168151
Ideal slice for Task 6363 is 15798460
<...>-17807 [002] 15851.352247: task_tick_fair: delta 3538447 rt
8168151 vrt 10200199
<...>-17807 [002] 15851.353246: task_tick_fair: delta 4799848 rt
8168151 vrt 10200199
<...>-17807 [002] 15851.354245: task_tick_fair: delta 6036141 rt
8168151 vrt 10200199
<...>-17807 [002] 15851.355244: task_tick_fair: delta 7297361 rt
8168151 vrt 10200199
<...>-17807 [002] 15851.356244: task_tick_fair: delta 8546469 rt
8168151 vrt 10200199

Here delta is increasing faster than wall time due to relative lower
nice value of this task and we are comparing this delta with
ideal_slice, which is wall clock time and hence we preempt this task
earlier. No?

Thanks,
Venki

2010-12-26 07:23:13

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH] sched: Buggy comparison in check_preempt_tick

On Sat, 2010-12-25 at 16:05 -0800, Venkatesh Pallipadi wrote:
> On Fri, Dec 24, 2010 at 11:50 PM, Mike Galbraith <[email protected]> wrote:
> > On Fri, 2010-12-24 at 16:26 -0800, Venkatesh Pallipadi wrote:
> >> A preempt comparison line in check_preempt_tick has two bugs.
> >> * It compares signed and unsigned quantities, which breaks when signed
> >> quantity happens to be negative
> >
> > Oops, that's a bug.
> >
> >> * It compares runtime and vruntime, which breaks when there are niced tasks
> >
> > This imho isn't.
> >
> > vruntimes advance at different rates for differently weighted tasks, so
> > they're already weighted, as is ideal_runtime.
> >
> > For wakeup preemption we use wakeup_gran() as the weighted ruler to
> > limit spread. This test just uses a different weighted ruler for the
> > same purpose at tick time, one already computed.
> >
> > If you're a heavily niced task, you were very likely booted before you
> > got this far. If you haven't run for considerable wall time, you won't
> > get further, you keep on running. You only get booted if giving you a
> > full wall slice will spread vruntimes too much, for which you'll pay if
> > you keep running, and leftmost will pay now if we don't boot you.
> >
>
> May be I am all confused.
> - se->vruntime update in __update_curr() is based on
> delta_exec_weighted ( i.e., calc_delta_fair(delta_exec, curr) with
> delta_exec being wall clock time.

Yeah, vruntime is weighted, delta exec is not.

> - We have earlier in check_preempt_tick(), ideal_runtime getting
> compared with delta exec, which is wall clock time

Yeah, you can't run uninterrupted longer than your (weighted!) portion
of the latency target.

[to meet that target, you must be using HRTICK, and even precise tick
(precise? slice can change many times between timer set/fire yes?) won't
help your _wakeup_ latency target.]

> - delta in check_preempt_tick is se->vruntime difference, which should
> be the weighted time and comparing that with ideal_runtime (which is
> compared with wall clock earlier) seems wrong.

What is sched_wakeup_granularity, and why is it different than
sched_latency? Why is it OK to scale sched_wakeup_granularity, and use
it as a caliper, and NOT OK to scale sched_latency (which is what
__sched_period() returns, and sched_slice() scales) to do the same?

> This is debug output with trace_printk outside this if statement and
> if based on (s64)ideal_runtime.
> Task 17807 is nice 0 task and task 6363 is nice -2.
> Ideal slice for Task 17807 is 8168151
> Ideal slice for Task 6363 is 15798460
> <...>-17807 [002] 15851.352247: task_tick_fair: delta 3538447 rt
> 8168151 vrt 10200199
> <...>-17807 [002] 15851.353246: task_tick_fair: delta 4799848 rt
> 8168151 vrt 10200199
> <...>-17807 [002] 15851.354245: task_tick_fair: delta 6036141 rt
> 8168151 vrt 10200199
> <...>-17807 [002] 15851.355244: task_tick_fair: delta 7297361 rt
> 8168151 vrt 10200199
> <...>-17807 [002] 15851.356244: task_tick_fair: delta 8546469 rt
> 8168151 vrt 10200199
>
> Here delta is increasing faster than wall time due to relative lower
> nice value of this task and we are comparing this delta with
> ideal_slice, which is wall clock time and hence we preempt this task
> earlier. No?

Is preempting sooner a bad thing if there's large spread? It'll keep
him from having to wait so long for his next bit of CPU.

What the thing is supposed to help is this: wakeup happens just after a
hog gets the CPU, the wakee doesn't quite have enough lag to preempt, so
has to wait for nearly a full slice to get service. That increases
spread, inducing latency spikes.

What would make more sense to me would be to use the exact same criteria
for wakeup/tick preemption, and do away with some knobs, just treat the
tick as nothing but another preemption opportunity. Set your spread
target knob, and that's it.

But anyway..

echo NO_WAKEUP_PREEMPT > sched_features
echo NO_TESTME > sched_features
two hogs running on isolcpu 3, pid 6890 at nice -2

while sleep 1; do grep 'pert.*6890' /proc/sched_debug; done

runnable tasks:
task PID tree-key switches prio
-------------------------------------------------------
R pert 6890 50201.071851 7453 118
R pert 6890 50596.171290 7513 118 +60
R pert 6890 50991.265264 7572 118 +59
R pert 6890 51383.781965 7631 118 +59
pert 6890 51781.463129 7691 118 +60

echo TESTME > sched_features
pert 6890 126881.306733 18977 118
R pert 6890 127273.825719 19036 118 +59
R pert 6890 127668.928218 19095 118 +59
R pert 6890 128064.031372 19154 118 +59
R pert 6890 128459.134339 19213 118 +59

...with a compute load, the thing should be a noop, and appears to be so
(with busted compare fixed anyway;). You have to be well overloaded for
buddies to kick in these days, so it's probably pretty hard to get
enough spread for the thing to fire.

---
kernel/sched_fair.c | 4 ++--
kernel/sched_features.h | 1 +
2 files changed, 3 insertions(+), 2 deletions(-)

Index: linux-2.6.37.git/kernel/sched_fair.c
===================================================================
--- linux-2.6.37.git.orig/kernel/sched_fair.c
+++ linux-2.6.37.git/kernel/sched_fair.c
@@ -862,7 +862,7 @@ check_preempt_tick(struct cfs_rq *cfs_rq
* narrow margin doesn't have to wait for a full slice.
* This also mitigates buddy induced latencies under load.
*/
- if (!sched_feat(WAKEUP_PREEMPT))
+ if (!sched_feat(TESTME))
return;

if (delta_exec < sysctl_sched_min_granularity)
@@ -872,7 +872,7 @@ check_preempt_tick(struct cfs_rq *cfs_rq
struct sched_entity *se = __pick_next_entity(cfs_rq);
s64 delta = curr->vruntime - se->vruntime;

- if (delta > ideal_runtime)
+ if (delta > (s64)ideal_runtime)
resched_task(rq_of(cfs_rq)->curr);
}
}
Index: linux-2.6.37.git/kernel/sched_features.h
===================================================================
--- linux-2.6.37.git.orig/kernel/sched_features.h
+++ linux-2.6.37.git/kernel/sched_features.h
@@ -66,3 +66,4 @@ SCHED_FEAT(OWNER_SPIN, 1)
* Decrement CPU power based on irq activity
*/
SCHED_FEAT(NONIRQ_POWER, 1)
+SCHED_FEAT(TESTME, 1)

2010-12-28 05:48:28

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH] sched: Buggy comparison in check_preempt_tick

On Sun, 2010-12-26 at 08:23 +0100, Mike Galbraith wrote:

> But anyway..
>
> echo NO_WAKEUP_PREEMPT > sched_features
> echo NO_TESTME > sched_features
> two hogs running on isolcpu 3, pid 6890 at nice -2
>
> while sleep 1; do grep 'pert.*6890' /proc/sched_debug; done
>
> runnable tasks:
> task PID tree-key switches prio
> -------------------------------------------------------
> R pert 6890 50201.071851 7453 118
> R pert 6890 50596.171290 7513 118 +60
> R pert 6890 50991.265264 7572 118 +59
> R pert 6890 51383.781965 7631 118 +59
> pert 6890 51781.463129 7691 118 +60
>
> echo TESTME > sched_features
> pert 6890 126881.306733 18977 118
> R pert 6890 127273.825719 19036 118 +59
> R pert 6890 127668.928218 19095 118 +59
> R pert 6890 128064.031372 19154 118 +59
> R pert 6890 128459.134339 19213 118 +59
>
> ...with a compute load, the thing should be a noop, and appears to be so
> (with busted compare fixed anyway;). You have to be well overloaded for
> buddies to kick in these days, so it's probably pretty hard to get
> enough spread for the thing to fire.

I did a bit more testing yesterday with wakeup loads. There's enough
spread for the test to nudge things a few [0..4] times per second/core.

I'd either fix the comparison, and let it keep on nudging once in a
while, or whack the whole thing.

-Mike