2004-10-05 02:16:54

by Chen, Kenneth W

[permalink] [raw]
Subject: bug in sched.c:activate_task()

Update p->timestamp to "now" in activate_task() doesn't look right
to me at all. p->timestamp records last time it was running on a
cpu. activate_task shouldn't update that variable when it queues
a task on the runqueue.

This bug (and combined with others) triggers improper load balancing.

Patch against linux-2.6.9-rc3. Didn't diff it against 2.6.9-rc3-mm2
because mm tree has so many change in sched.c.

Signed-off-by: Ken Chen <[email protected]>


--- linux-2.6.9-rc3/kernel/sched.c.orig 2004-10-04 19:11:21.000000000 -0700
+++ linux-2.6.9-rc3/kernel/sched.c 2004-10-04 19:11:35.000000000 -0700
@@ -888,7 +888,6 @@ static void activate_task(task_t *p, run
p->activated = 1;
}
}
- p->timestamp = now;

__activate_task(p, rq);
}



2004-10-05 02:34:41

by Con Kolivas

[permalink] [raw]
Subject: Re: bug in sched.c:activate_task()

Chen, Kenneth W writes:

> Update p->timestamp to "now" in activate_task() doesn't look right
> to me at all. p->timestamp records last time it was running on a
> cpu. activate_task shouldn't update that variable when it queues
> a task on the runqueue.
>
> This bug (and combined with others) triggers improper load balancing.

The updated timestamp was placed there by Ingo to detect on-runqueue time.
If it is being used for load balancing then it is being used in error.

Con

2004-10-05 03:15:32

by Nick Piggin

[permalink] [raw]
Subject: Re: bug in sched.c:activate_task()

Con Kolivas wrote:

> Chen, Kenneth W writes:
>
>> Update p->timestamp to "now" in activate_task() doesn't look right
>> to me at all. p->timestamp records last time it was running on a
>> cpu. activate_task shouldn't update that variable when it queues
>> a task on the runqueue.
>>
>> This bug (and combined with others) triggers improper load balancing.
>
>
> The updated timestamp was placed there by Ingo to detect on-runqueue
> time. If it is being used for load balancing then it is being used in
> error.
>

Load balancing wants to know if a task is considered cache hot.


2004-10-05 04:46:09

by Con Kolivas

[permalink] [raw]
Subject: Re: bug in sched.c:activate_task()

Nick Piggin writes:

> Con Kolivas wrote:
>
>> Chen, Kenneth W writes:
>>
>>> Update p->timestamp to "now" in activate_task() doesn't look right
>>> to me at all. p->timestamp records last time it was running on a
>>> cpu. activate_task shouldn't update that variable when it queues
>>> a task on the runqueue.
>>>
>>> This bug (and combined with others) triggers improper load balancing.
>>
>>
>> The updated timestamp was placed there by Ingo to detect on-runqueue
>> time. If it is being used for load balancing then it is being used in
>> error.
>>
>
> Load balancing wants to know if a task is considered cache hot.

Yes I know. It used to be performed based on jiffies which was adequate
resolution for cache warmth at the time. The timestamp was being used for on
runqueue length measurement before the load balancing was modified to use
that value.

Con

2004-10-05 05:05:38

by Peter Williams

[permalink] [raw]
Subject: Re: bug in sched.c:activate_task()

Con Kolivas wrote:
> Chen, Kenneth W writes:
>
>> Update p->timestamp to "now" in activate_task() doesn't look right
>> to me at all. p->timestamp records last time it was running on a
>> cpu. activate_task shouldn't update that variable when it queues
>> a task on the runqueue.
>>
>> This bug (and combined with others) triggers improper load balancing.
>
>
> The updated timestamp was placed there by Ingo to detect on-runqueue
> time. If it is being used for load balancing then it is being used in
> error.

The ZAPHOD scheduler (being trialled in 2.6.9-rc3-mm2) uses its own time
stamp so as not to interfere with the use of timestamp by load balancing
code so it should be OK to delete this line form activate_task() in
2.6.9-rc3-mm2 without effecting ZAPHOD.

Peter
--
Peter Williams [email protected]

"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce

2004-10-05 06:10:48

by Nick Piggin

[permalink] [raw]
Subject: Re: bug in sched.c:activate_task()

Chen, Kenneth W wrote:
> Update p->timestamp to "now" in activate_task() doesn't look right
> to me at all. p->timestamp records last time it was running on a
> cpu. activate_task shouldn't update that variable when it queues
> a task on the runqueue.
>
> This bug (and combined with others) triggers improper load balancing.
>
> Patch against linux-2.6.9-rc3. Didn't diff it against 2.6.9-rc3-mm2
> because mm tree has so many change in sched.c.
>
> Signed-off-by: Ken Chen <[email protected]>
>

Actually, now that I have the code in front of me, I was wrong
and this patch is right.

This timestamp is never used for anything, so the assignment is
pointless.

2004-10-05 06:31:29

by Ingo Molnar

[permalink] [raw]
Subject: Re: bug in sched.c:activate_task()


On Mon, 4 Oct 2004, Chen, Kenneth W wrote:

> Update p->timestamp to "now" in activate_task() doesn't look right to me
> at all. p->timestamp records last time it was running on a cpu.
> activate_task shouldn't update that variable when it queues a task on
> the runqueue.

correct, we are overriding it in schedule():

if (likely(prev != next)) {
next->timestamp = now;
rq->nr_switches++;

the line your patch removes is a remnant of an earlier logic when we
timestamped tasks when they touched the runqueue. (vs. timestamping when
they actually run on a CPU.) So the patch looks good to me. Andrew, please
apply.

Signed-off-by: Ingo Molnar <[email protected]>

Ingo


>
> This bug (and combined with others) triggers improper load balancing.
>
> Patch against linux-2.6.9-rc3. Didn't diff it against 2.6.9-rc3-mm2
> because mm tree has so many change in sched.c.
>
> Signed-off-by: Ken Chen <[email protected]>
>
>
> --- linux-2.6.9-rc3/kernel/sched.c.orig 2004-10-04 19:11:21.000000000 -0700
> +++ linux-2.6.9-rc3/kernel/sched.c 2004-10-04 19:11:35.000000000 -0700
> @@ -888,7 +888,6 @@ static void activate_task(task_t *p, run
> p->activated = 1;
> }
> }
> - p->timestamp = now;
>
> __activate_task(p, rq);
> }
>
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2004-10-05 06:36:32

by Con Kolivas

[permalink] [raw]
Subject: Re: bug in sched.c:activate_task()

Ingo Molnar writes:

>
> On Mon, 4 Oct 2004, Chen, Kenneth W wrote:
>
>> Update p->timestamp to "now" in activate_task() doesn't look right to me
>> at all. p->timestamp records last time it was running on a cpu.
>> activate_task shouldn't update that variable when it queues a task on
>> the runqueue.
>
> correct, we are overriding it in schedule():
>
> if (likely(prev != next)) {
> next->timestamp = now;
> rq->nr_switches++;
>
> the line your patch removes is a remnant of an earlier logic when we
> timestamped tasks when they touched the runqueue. (vs. timestamping when
> they actually run on a CPU.) So the patch looks good to me. Andrew, please
> apply.

unsigned long long delta = now - next->timestamp;

if (next->activated == 1)
delta = delta * (ON_RUNQUEUE_WEIGHT * 128 / 100) / 128;

is in schedule() before we update the timestamp, no?

Con

2004-10-05 06:55:32

by Ingo Molnar

[permalink] [raw]
Subject: Re: bug in sched.c:activate_task()


On Tue, 5 Oct 2004, Con Kolivas wrote:

> unsigned long long delta = now - next->timestamp;
>
> if (next->activated == 1)
> delta = delta * (ON_RUNQUEUE_WEIGHT * 128 / 100) / 128;
>
> is in schedule() before we update the timestamp, no?

indeed ... so the patch is just random incorrect damage that happened to
distrub the scheduler fixing some balancing problem. Kenneth, what
precisely is the balancing problem you are seeing?

Ingo

2004-10-05 06:58:01

by Nick Piggin

[permalink] [raw]
Subject: Re: bug in sched.c:activate_task()

Con Kolivas wrote:
> Ingo Molnar writes:
>
>>
>> On Mon, 4 Oct 2004, Chen, Kenneth W wrote:
>>
>>> Update p->timestamp to "now" in activate_task() doesn't look right to me
>>> at all. p->timestamp records last time it was running on a cpu.
>>> activate_task shouldn't update that variable when it queues a task on
>>> the runqueue.
>>
>>
>> correct, we are overriding it in schedule():
>>
>> if (likely(prev != next)) {
>> next->timestamp = now;
>> rq->nr_switches++;
>>
>> the line your patch removes is a remnant of an earlier logic when we
>> timestamped tasks when they touched the runqueue. (vs. timestamping when
>> they actually run on a CPU.) So the patch looks good to me. Andrew,
>> please
>> apply.
>
>
> unsigned long long delta = now - next->timestamp;
>
> if (next->activated == 1)
> delta = delta * (ON_RUNQUEUE_WEIGHT * 128 / 100) / 128;
>
> is in schedule() before we update the timestamp, no?
>

Yeah right, unfortunately.

2004-10-05 07:00:29

by Con Kolivas

[permalink] [raw]
Subject: Re: bug in sched.c:activate_task()

Ingo Molnar writes:

>
> On Tue, 5 Oct 2004, Con Kolivas wrote:
>
>> unsigned long long delta = now - next->timestamp;
>>
>> if (next->activated == 1)
>> delta = delta * (ON_RUNQUEUE_WEIGHT * 128 / 100) / 128;
>>
>> is in schedule() before we update the timestamp, no?
>
> indeed ... so the patch is just random incorrect damage that happened to
> distrub the scheduler fixing some balancing problem. Kenneth, what
> precisely is the balancing problem you are seeing?

We used to compare jiffy difference in can_migrate_task by comparing it to
cache_decay_ticks. Somewhere in the merging of sched_domains it was changed
to task_hot which uses timestamp.

Con

2004-10-05 07:08:36

by Nick Piggin

[permalink] [raw]
Subject: Re: bug in sched.c:activate_task()

Con Kolivas wrote:
> Ingo Molnar writes:
>
>>
>> On Tue, 5 Oct 2004, Con Kolivas wrote:
>>
>>> unsigned long long delta = now - next->timestamp;
>>>
>>> if (next->activated == 1)
>>> delta = delta * (ON_RUNQUEUE_WEIGHT * 128 / 100) / 128;
>>>
>>> is in schedule() before we update the timestamp, no?
>>
>>
>> indeed ... so the patch is just random incorrect damage that happened to
>> distrub the scheduler fixing some balancing problem. Kenneth, what
>> precisely is the balancing problem you are seeing?
>
>
> We used to compare jiffy difference in can_migrate_task by comparing it to
> cache_decay_ticks. Somewhere in the merging of sched_domains it was
> changed to task_hot which uses timestamp.
>

It always used ->timestamp though, even when that was in jiffies.
sched domains didn't change anything there (and IIRC it used
task_hot before sched domains).

2004-10-05 07:10:21

by Ingo Molnar

[permalink] [raw]
Subject: Re: bug in sched.c:activate_task()


On Tue, 5 Oct 2004, Con Kolivas wrote:

> We used to compare jiffy difference in can_migrate_task by comparing it
> to cache_decay_ticks. Somewhere in the merging of sched_domains it was
> changed to task_hot which uses timestamp.

yep, that's fishy. Kenneth, could you try the simple patch below? It gets
rid of task_hot() in essence. If this works out we could try it - it gets
rid of some more code from sched.c too. Perhaps SD_WAKE_AFFINE is enough
control.

Ingo

--- kernel/sched.c.orig 2004-10-05 08:28:42.295395160 +0200
+++ kernel/sched.c 2004-10-05 09:07:44.081389576 +0200
@@ -180,7 +180,7 @@ static unsigned int task_timeslice(task_
else
return SCALE_PRIO(DEF_TIMESLICE, p->static_prio);
}
-#define task_hot(p, now, sd) ((now) - (p)->timestamp < (sd)->cache_hot_time)
+#define task_hot(p, now, sd) 0

enum idle_type
{

2004-10-05 08:18:28

by Ingo Molnar

[permalink] [raw]
Subject: Re: bug in sched.c:activate_task()


* Chen, Kenneth W <[email protected]> wrote:

> Update p->timestamp to "now" in activate_task() doesn't look right to
> me at all. p->timestamp records last time it was running on a cpu.
> activate_task shouldn't update that variable when it queues a task on
> the runqueue.

it is being used for multiple purposes: measuring on-CPU time, measuring
on-runqueue time (for interactivity) and measuring off-runqueue time
(for interactivity). It is also used to measure cache-hotness, by the
balancing code.

now, this particular update of p->timestamp:

> @@ -888,7 +888,6 @@ static void activate_task(task_t *p, run

> - p->timestamp = now;

is important for the interactivity code that runs in schedule() - it
wants to know how much time we spent on the runqueue without having run.

but you are right that the task_hot() use of p->timestamp is incorrect
for freshly woken up tasks - we want the balancer to be able to re-route
tasks to another CPU, as long as the task has not hit the runqueue yet
(which it hasnt where we consider it in the balancer).

the clean solution is to separate the multiple uses of p->timestamp:
with the patch below p->timestamp is purely used for the interactivity
code, and p->last_ran is for the rebalancer. The patch is ontop of your
task_hot() fix-patch. Does this work for your workload?

Ingo

Signed-off-by: Ingo Molnar <[email protected]>

--- linux/kernel/sched.c.orig
+++ linux/kernel/sched.c
@@ -185,7 +185,8 @@ static unsigned int task_timeslice(task_
else
return SCALE_PRIO(DEF_TIMESLICE, p->static_prio);
}
-#define task_hot(p, now, sd) ((now) - (p)->timestamp < (sd)->cache_hot_time)
+#define task_hot(p, now, sd) ((long long) ((now) - (p)->last_ran) \
+ < (long long) (sd)->cache_hot_time)

/*
* These are the runqueue data structures:
@@ -2833,7 +2834,7 @@ switch_tasks:
if (!(HIGH_CREDIT(prev) || LOW_CREDIT(prev)))
prev->interactive_credit--;
}
- prev->timestamp = now;
+ prev->timestamp = prev->last_ran = now;

sched_info_switch(prev, next);
if (likely(prev != next)) {
--- linux/include/linux/sched.h.orig
+++ linux/include/linux/sched.h
@@ -527,7 +527,7 @@ struct task_struct {

unsigned long sleep_avg;
long interactive_credit;
- unsigned long long timestamp;
+ unsigned long long timestamp, last_ran;
int activated;

unsigned long policy;

2004-10-05 17:25:57

by Chen, Kenneth W

[permalink] [raw]
Subject: RE: bug in sched.c:activate_task()

Ingo Molnar wrote on Monday, October 04, 2004 11:55 PM
> On Tue, 5 Oct 2004, Con Kolivas wrote:
>
> > unsigned long long delta = now - next->timestamp;
> >
> > if (next->activated == 1)
> > delta = delta * (ON_RUNQUEUE_WEIGHT * 128 / 100) / 128;
> >
> > is in schedule() before we update the timestamp, no?
>
> indeed ... so the patch is just random incorrect damage that happened to
> distrub the scheduler fixing some balancing problem. Kenneth, what
> precisely is the balancing problem you are seeing?

We are seeing truck load of move_task() which was originated from newly
idle load balance. The problem is load balancing going nuts moving tons
of tasks around and what we are seeing is cache misses for the application
shots up to the roof and suffering from cache threshing.

- Ken


2004-10-05 17:34:56

by Chen, Kenneth W

[permalink] [raw]
Subject: RE: bug in sched.c:activate_task()

On Tue, 5 Oct 2004, Con Kolivas wrote:
> We used to compare jiffy difference in can_migrate_task by comparing it
> to cache_decay_ticks. Somewhere in the merging of sched_domains it was
> changed to task_hot which uses timestamp.


On Tuesday, October 05, 2004 12:10 AM, Ingo Molnar wrote:
> yep, that's fishy. Kenneth, could you try the simple patch below? It gets
> rid of task_hot() in essence. If this works out we could try it - it gets
> rid of some more code from sched.c too. Perhaps SD_WAKE_AFFINE is enough
> control.
>
> --- kernel/sched.c.orig 2004-10-05 08:28:42.295395160 +0200
> +++ kernel/sched.c 2004-10-05 09:07:44.081389576 +0200
> @@ -180,7 +180,7 @@ static unsigned int task_timeslice(task_
> else
> return SCALE_PRIO(DEF_TIMESLICE, p->static_prio);
> }
> -#define task_hot(p, now, sd) ((now) - (p)->timestamp < (sd)->cache_hot_time)
> +#define task_hot(p, now, sd) 0
>
> enum idle_type
> {

We have experimented with similar thing, via bumping up sd->cache_hot_time to
a very large number, like 1 sec. What we measured was a equally low throughput.
But that was because of not enough load balancing, we are seeing is large amount
of idle time.

- Ken


2004-10-05 17:45:18

by Chen, Kenneth W

[permalink] [raw]
Subject: RE: bug in sched.c:activate_task()

* Chen, Kenneth W <[email protected]> wrote:
> Update p->timestamp to "now" in activate_task() doesn't look right to
> me at all. p->timestamp records last time it was running on a cpu.
> activate_task shouldn't update that variable when it queues a task on
> the runqueue.


Ingo Molnar wrote on Tuesday, October 05, 2004 1:19 AM
> it is being used for multiple purposes: measuring on-CPU time, measuring
> on-runqueue time (for interactivity) and measuring off-runqueue time
> (for interactivity). It is also used to measure cache-hotness, by the
> balancing code.
>
> now, this particular update of p->timestamp:
>
> > @@ -888,7 +888,6 @@ static void activate_task(task_t *p, run
>
> > - p->timestamp = now;
>
> is important for the interactivity code that runs in schedule() - it
> wants to know how much time we spent on the runqueue without having run.
>
> but you are right that the task_hot() use of p->timestamp is incorrect
> for freshly woken up tasks - we want the balancer to be able to re-route
> tasks to another CPU, as long as the task has not hit the runqueue yet
> (which it hasnt where we consider it in the balancer).
>
> the clean solution is to separate the multiple uses of p->timestamp:
> with the patch below p->timestamp is purely used for the interactivity
> code, and p->last_ran is for the rebalancer. The patch is ontop of your
> task_hot() fix-patch. Does this work for your workload?

I will take this for a spin and report back the result.

- Ken


2004-10-05 22:47:00

by Chen, Kenneth W

[permalink] [raw]
Subject: RE: bug in sched.c:activate_task()

* Chen, Kenneth W <[email protected]> wrote:
> Update p->timestamp to "now" in activate_task() doesn't look right to
> me at all. p->timestamp records last time it was running on a cpu.
> activate_task shouldn't update that variable when it queues a task on
> the runqueue.

Ingo Molnar wrote on Tuesday, October 05, 2004 1:19 AM
> it is being used for multiple purposes: measuring on-CPU time, measuring
> on-runqueue time (for interactivity) and measuring off-runqueue time
> (for interactivity). It is also used to measure cache-hotness, by the
> balancing code.
>
> now, this particular update of p->timestamp:
>
> > @@ -888,7 +888,6 @@ static void activate_task(task_t *p, run
>
> > - p->timestamp = now;
>
> is important for the interactivity code that runs in schedule() - it
> wants to know how much time we spent on the runqueue without having run.
>
> but you are right that the task_hot() use of p->timestamp is incorrect
> for freshly woken up tasks - we want the balancer to be able to re-route
> tasks to another CPU, as long as the task has not hit the runqueue yet
> (which it hasnt where we consider it in the balancer).
>
> the clean solution is to separate the multiple uses of p->timestamp:
> with the patch below p->timestamp is purely used for the interactivity
> code, and p->last_ran is for the rebalancer. The patch is ontop of your
> task_hot() fix-patch. Does this work for your workload?


Chen, Kenneth W wrote on Tuesday, October 05, 2004 10:45 AM
> I will take this for a spin and report back the result.

Confirmed that Ingo's latest patch does the right thing for the db workload.

- Ken