LinuxLists.cc - Re: Regression seen for patch "sched:dont decrease idle sleep avg"

[permalink] [raw]

Subject: RE: Regression seen for patch "sched:dont decrease idle sleep avg"

Tim Chen writes:
> See patch:
> http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=e72ff0bb2c163eb13014ba113701bd42dab382fe

Con Kolivas wrote on Monday, May 08, 2006 5:43 PM
> This patch corrects a bug in the original code which unintentionally dropped
> the priority of tasks that were idle but were already high priority on other
> merits. It doesn't further increase the priority.

This got me to take a non-casual look at that particular git commit. The
first portion of the change log description says perfectly about the intent,
but after studying the code, I have to say that the actual code does not
implement what people say it will do. In recalc_task_prio(), if a task's
sleep_time is more than INTERACTIVE_SLEEP, it will bump up p->sleep_avg all
the way to near maximum (at MAX_SLEEP_AVG - DEF_TIMESLICE), which according
to my calculation, it will have a priority bonus of 4 (out of max 5).

IOW, for a prolonged sleep, a task will immediately get near maximum priority
boost. Is that what the real intent is? Seems to be on the contrary to what
the source code comments as well.

I think in the if (sleep_time > INTERACTIVE_SLEEP) block, p->sleep_avg should
be treated similarly like what the "else" block is doing: scale it proportionally
with past sleep time, perhaps not the immediate previously prolonged sleep
because that would for sure bump up priority too fast. A better method might
be p->sleep_avg *= 2 or something like that.

- Ken

2006-05-13 12:30:40

by Andrew Morton

[permalink] [raw]

Subject: Re: Regression seen for patch "sched:dont decrease idle sleep avg"

(Catching up on lkml)

On Thu, 11 May 2006 17:04:11 -0700
"Chen, Kenneth W" <[email protected]> wrote:

> Tim Chen writes:
> > See patch:
> > http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=e72ff0bb2c163eb13014ba113701bd42dab382fe
>
> Con Kolivas wrote on Monday, May 08, 2006 5:43 PM
> > This patch corrects a bug in the original code which unintentionally dropped
> > the priority of tasks that were idle but were already high priority on other
> > merits. It doesn't further increase the priority.
>
>
> This got me to take a non-casual look at that particular git commit. The
> first portion of the change log description says perfectly about the intent,
> but after studying the code, I have to say that the actual code does not
> implement what people say it will do. In recalc_task_prio(), if a task's
> sleep_time is more than INTERACTIVE_SLEEP, it will bump up p->sleep_avg all
> the way to near maximum (at MAX_SLEEP_AVG - DEF_TIMESLICE), which according
> to my calculation, it will have a priority bonus of 4 (out of max 5).
>
> IOW, for a prolonged sleep, a task will immediately get near maximum priority
> boost. Is that what the real intent is? Seems to be on the contrary to what
> the source code comments as well.
>
> I think in the if (sleep_time > INTERACTIVE_SLEEP) block, p->sleep_avg should
> be treated similarly like what the "else" block is doing: scale it proportionally
> with past sleep time, perhaps not the immediate previously prolonged sleep
> because that would for sure bump up priority too fast. A better method might
> be p->sleep_avg *= 2 or something like that.
>

That seems to be a pretty significant discovery. Is anything happening
with it?

2006-05-13 13:07:08

[permalink] [raw]

Subject: Re: Regression seen for patch "sched:dont decrease idle sleep avg"

On Sat, 2006-05-13 at 05:27 -0700, Andrew Morton wrote:
> (Catching up on lkml)
>
> On Thu, 11 May 2006 17:04:11 -0700
> "Chen, Kenneth W" <[email protected]> wrote:
>
> > Tim Chen writes:
> > > See patch:
> > > http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=e72ff0bb2c163eb13014ba113701bd42dab382fe
> >
> > Con Kolivas wrote on Monday, May 08, 2006 5:43 PM
> > > This patch corrects a bug in the original code which unintentionally dropped
> > > the priority of tasks that were idle but were already high priority on other
> > > merits. It doesn't further increase the priority.
> >
> >
> > This got me to take a non-casual look at that particular git commit. The
> > first portion of the change log description says perfectly about the intent,
> > but after studying the code, I have to say that the actual code does not
> > implement what people say it will do. In recalc_task_prio(), if a task's
> > sleep_time is more than INTERACTIVE_SLEEP, it will bump up p->sleep_avg all
> > the way to near maximum (at MAX_SLEEP_AVG - DEF_TIMESLICE), which according
> > to my calculation, it will have a priority bonus of 4 (out of max 5).
> >
> > IOW, for a prolonged sleep, a task will immediately get near maximum priority
> > boost. Is that what the real intent is? Seems to be on the contrary to what
> > the source code comments as well.
> >
> > I think in the if (sleep_time > INTERACTIVE_SLEEP) block, p->sleep_avg should
> > be treated similarly like what the "else" block is doing: scale it proportionally
> > with past sleep time, perhaps not the immediate previously prolonged sleep
> > because that would for sure bump up priority too fast. A better method might
> > be p->sleep_avg *= 2 or something like that.
> >
>
> That seems to be a pretty significant discovery. Is anything happening
> with it?

When I tried to fix that, I ran into resistance.

-Mike

2006-05-14 16:05:34

[permalink] [raw]

Subject: Re: Regression seen for patch "sched:dont decrease idle sleep avg"

On Friday 12 May 2006 10:04, Chen, Kenneth W wrote:
> Tim Chen writes:
> > See patch:
> > http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=co
> >mmit;h=e72ff0bb2c163eb13014ba113701bd42dab382fe
>
> Con Kolivas wrote on Monday, May 08, 2006 5:43 PM
>
> > This patch corrects a bug in the original code which unintentionally
> > dropped the priority of tasks that were idle but were already high
> > priority on other merits. It doesn't further increase the priority.
>
> This got me to take a non-casual look at that particular git commit. The
> first portion of the change log description says perfectly about the
> intent, but after studying the code, I have to say that the actual code
> does not implement what people say it will do. In recalc_task_prio(), if a
> task's sleep_time is more than INTERACTIVE_SLEEP, it will bump up
> p->sleep_avg all the way to near maximum (at MAX_SLEEP_AVG -
> DEF_TIMESLICE), which according to my calculation, it will have a priority
> bonus of 4 (out of max 5).
>
> IOW, for a prolonged sleep, a task will immediately get near maximum
> priority boost. Is that what the real intent is? Seems to be on the
> contrary to what the source code comments as well.
>
> I think in the if (sleep_time > INTERACTIVE_SLEEP) block, p->sleep_avg
> should be treated similarly like what the "else" block is doing: scale it
> proportionally with past sleep time, perhaps not the immediate previously
> prolonged sleep because that would for sure bump up priority too fast. A
> better method might be p->sleep_avg *= 2 or something like that.

There would be no difference if the priority boost is done lower. The if and
else blocks both end up equating to the same amount of priority boost, with
the former having a ceiling on it, so yes it is the intent. You'll see that
the amount of sleep required to jump from lowest priority to MAX_SLEEP_AVG -
DEF_TIMESLICE is INTERACTIVE_SLEEP.

-ck

2006-05-15 19:04:08

[permalink] [raw]

Subject: RE: Regression seen for patch "sched:dont decrease idle sleep avg"

Con Kolivas wrote on Sunday, May 14, 2006 9:03 AM
> There would be no difference if the priority boost is done lower. The if and
> else blocks both end up equating to the same amount of priority boost, with
> the former having a ceiling on it, so yes it is the intent. You'll see that
> the amount of sleep required to jump from lowest priority to MAX_SLEEP_AVG -
> DEF_TIMESLICE is INTERACTIVE_SLEEP.

I don't think the if and the else block is doing the same thing. In the if
block, the p->sleep_avg is unconditionally boosted to ceiling for all tasks,
though it will not reduce sleep_avg for tasks that already exceed the ceiling.
Bumping up sleep_avg will then translate into priority boost of MAX_BONUS-1,
which potentially can be too high.

But that's fine if it is the intent. At minimum, the comment in the source
code should say so instead of fooling people who don't actually read the code.

[patch] sched: update comments in priority calculation w.r.t. implementation.

Signed-off-by: Ken Chen <[email protected]>

--- ./kernel/sched.c.orig 2006-05-15 12:24:02.000000000 -0700
+++ ./kernel/sched.c 2006-05-15 12:37:16.000000000 -0700
@@ -746,10 +746,12 @@ static int recalc_task_prio(task_t *p, u
if (likely(sleep_time > 0)) {
/*
* User tasks that sleep a long time are categorised as
- * idle. They will only have their sleep_avg increased to a
- * level that makes them just interactive priority to stay
- * active yet prevent them suddenly becoming cpu hogs and
- * starving other processes.
+ * idle. If they sleep longer than INTERACTIVE_SLEEP, it
+ * will have its priority boosted to minimum MAX_BONUS-1.
+ * For short sleep, they will only have their sleep_avg
+ * increased to a level that makes them just interactive
+ * priority to stay active yet prevent them suddenly becoming
+ * cpu hogs and starving other processes.
*/
if (p->mm && sleep_time > INTERACTIVE_SLEEP(p)) {
unsigned long ceiling;

2006-05-15 23:45:39

[permalink] [raw]

Subject: Re: Regression seen for patch "sched:dont decrease idle sleep avg"

On Tuesday 16 May 2006 05:01, Chen, Kenneth W wrote:
> Con Kolivas wrote on Sunday, May 14, 2006 9:03 AM
>
> > There would be no difference if the priority boost is done lower. The if
> > and else blocks both end up equating to the same amount of priority
> > boost, with the former having a ceiling on it, so yes it is the intent.
> > You'll see that the amount of sleep required to jump from lowest priority
> > to MAX_SLEEP_AVG - DEF_TIMESLICE is INTERACTIVE_SLEEP.
>
> I don't think the if and the else block is doing the same thing. In the if
> block, the p->sleep_avg is unconditionally boosted to ceiling for all
> tasks, though it will not reduce sleep_avg for tasks that already exceed
> the ceiling. Bumping up sleep_avg will then translate into priority boost
> of MAX_BONUS-1, which potentially can be too high.

Yes it's only designed to detect something that has been asleep for an
arbitrary long time and "categorised as idle"; it is not supposed to be a
priority stepping stone for everything, in this case at MAX_BONUS-1. Mike
proposed doing this instead, but it was never my intent. Your comment is not
quite correct as it just happens to be MAX_BONUS-1 at nice 0, and not any
other nice value.

> But that's fine if it is the intent. At minimum, the comment in the source
> code should say so instead of fooling people who don't actually read the
> code.

Feel free to update it to how you understand it now :) I have this feeling
we'll be seeing quite some action here soon...

> [patch] sched: update comments in priority calculation w.r.t.
> implementation.

--
-ck

2006-05-16 01:22:28

[permalink] [raw]

Subject: RE: Regression seen for patch "sched:dont decrease idle sleep avg"

Con Kolivas wrote on Monday, May 15, 2006 4:45 PM
> On Tuesday 16 May 2006 05:01, Chen, Kenneth W wrote:
> > I don't think the if and the else block is doing the same thing. In the if
> > block, the p->sleep_avg is unconditionally boosted to ceiling for all
> > tasks, though it will not reduce sleep_avg for tasks that already exceed
> > the ceiling. Bumping up sleep_avg will then translate into priority boost
> > of MAX_BONUS-1, which potentially can be too high.
>
> Yes it's only designed to detect something that has been asleep for an
> arbitrary long time and "categorised as idle"; it is not supposed to be a
> priority stepping stone for everything, in this case at MAX_BONUS-1. Mike
> proposed doing this instead, but it was never my intent. Your comment is not
> quite correct as it just happens to be MAX_BONUS-1 at nice 0, and not any
> other nice value.

Huh??

sleep_avg is set at constant:
p->sleep_avg = JIFFIES_TO_NS(MAX_SLEEP_AVG - DEF_TIMESLICE);

The bonus calculation is:

#define CURRENT_BONUS(p) \
(NS_TO_JIFFIES((p)->sleep_avg) * MAX_BONUS / MAX_SLEEP_AVG)

bonus = CURRENT_BONUS(p) - MAX_BONUS / 2;

None of the calculation that I see uses nice value. Did I miss something?

- Ken

2006-05-16 01:44:26

[permalink] [raw]

Subject: Re: Regression seen for patch "sched:dont decrease idle sleep avg"

On Tuesday 16 May 2006 11:22, Chen, Kenneth W wrote:
> Con Kolivas wrote on Monday, May 15, 2006 4:45 PM
>
> > On Tuesday 16 May 2006 05:01, Chen, Kenneth W wrote:
> > > I don't think the if and the else block is doing the same thing. In the
> > > if block, the p->sleep_avg is unconditionally boosted to ceiling for
> > > all tasks, though it will not reduce sleep_avg for tasks that already
> > > exceed the ceiling. Bumping up sleep_avg will then translate into
> > > priority boost of MAX_BONUS-1, which potentially can be too high.
> >
> > Yes it's only designed to detect something that has been asleep for an
> > arbitrary long time and "categorised as idle"; it is not supposed to be a
> > priority stepping stone for everything, in this case at MAX_BONUS-1. Mike
> > proposed doing this instead, but it was never my intent. Your comment is
> > not quite correct as it just happens to be MAX_BONUS-1 at nice 0, and not
> > any other nice value.
>
> Huh??
>
> sleep_avg is set at constant:
> p->sleep_avg = JIFFIES_TO_NS(MAX_SLEEP_AVG - DEF_TIMESLICE);
>
>
> The bonus calculation is:
>
> #define CURRENT_BONUS(p) \
> (NS_TO_JIFFIES((p)->sleep_avg) * MAX_BONUS / MAX_SLEEP_AVG)
>
> bonus = CURRENT_BONUS(p) - MAX_BONUS / 2;
>
> None of the calculation that I see uses nice value. Did I miss something?

My bad. You're correct, sorry.

--
-ck

2006-05-16 04:06:45

[permalink] [raw]

Subject: RE: Regression seen for patch "sched:dont decrease idle sleep avg"

On Mon, 2006-05-15 at 12:01 -0700, Chen, Kenneth W wrote:

> [patch] sched: update comments in priority calculation w.r.t. implementation.
>
> Signed-off-by: Ken Chen <[email protected]>
>
> --- ./kernel/sched.c.orig 2006-05-15 12:24:02.000000000 -0700
> +++ ./kernel/sched.c 2006-05-15 12:37:16.000000000 -0700
> @@ -746,10 +746,12 @@ static int recalc_task_prio(task_t *p, u
> if (likely(sleep_time > 0)) {
> /*
> * User tasks that sleep a long time are categorised as
> - * idle. They will only have their sleep_avg increased to a
> - * level that makes them just interactive priority to stay
> - * active yet prevent them suddenly becoming cpu hogs and
> - * starving other processes.
> + * idle. If they sleep longer than INTERACTIVE_SLEEP, it
> + * will have its priority boosted to minimum MAX_BONUS-1.
> + * For short sleep, they will only have their sleep_avg
> + * increased to a level that makes them just interactive
> + * priority to stay active yet prevent them suddenly becoming
> + * cpu hogs and starving other processes.
> */
> if (p->mm && sleep_time > INTERACTIVE_SLEEP(p)) {
> unsigned long ceiling;

This comment still doesn't reflect what the code does. Please just
delete the last four lines, they are incorrect.

-Mike

2006-05-16 04:10:03

[permalink] [raw]

Subject: Re: Regression seen for patch "sched:dont decrease idle sleep avg"

On Tue, 2006-05-16 at 09:45 +1000, Con Kolivas wrote:
> Feel free to update it to how you understand it now :) I have this feeling
> we'll be seeing quite some action here soon...

Really? Mind telling us why?

-Mike

2006-05-17 00:08:43

by Tim Chen

[permalink] [raw]

Subject: Re: Regression seen for patch "sched:dont decrease idle sleep avg"

On Tue, 2006-05-16 at 09:45 +1000, Con Kolivas wrote:

>
> Yes it's only designed to detect something that has been asleep for an
> arbitrary long time and "categorised as idle"; it is not supposed to be a
> priority stepping stone for everything, in this case at MAX_BONUS-1. Mike
> proposed doing this instead, but it was never my intent.

It seems like just one sleep longer than INTERACTIVE_SLEEP is needed
kick the priority of a process all the way to MAX_BONUS-1 and boost the
sleep_avg, regardless of what the prior sleep_avg was.

So if there is a cpu hog that has long sleeps occasionally, once it woke
up, its priority will get boosted close to maximum, likely starving out
other processes for a while till its sleep_avg gets reduced. This
behavior seems like something to avoid according to the original code
comment. Are we boosting the priority too quickly?

Tim

2006-05-17 04:24:42

[permalink] [raw]

Subject: Re: Regression seen for patch "sched:dont decrease idle sleep avg"

On Tue, 2006-05-16 at 16:32 -0700, Tim Chen wrote:
> On Tue, 2006-05-16 at 09:45 +1000, Con Kolivas wrote:
>
> >
> > Yes it's only designed to detect something that has been asleep for an
> > arbitrary long time and "categorised as idle"; it is not supposed to be a
> > priority stepping stone for everything, in this case at MAX_BONUS-1. Mike
> > proposed doing this instead, but it was never my intent.
>
> It seems like just one sleep longer than INTERACTIVE_SLEEP is needed
> kick the priority of a process all the way to MAX_BONUS-1 and boost the
> sleep_avg, regardless of what the prior sleep_avg was.
>
> So if there is a cpu hog that has long sleeps occasionally, once it woke
> up, its priority will get boosted close to maximum, likely starving out
> other processes for a while till its sleep_avg gets reduced. This
> behavior seems like something to avoid according to the original code
> comment. Are we boosting the priority too quickly?

The answer to that is, sometimes yes, and when it bites, it bites hard.
Happily, most hogs don't sleep much, and we don't generally have lots of
bursty sleepers.

-Mike

2006-05-17 04:45:37

by Peter Williams

[permalink] [raw]

Subject: Re: Regression seen for patch "sched:dont decrease idle sleep avg"

Mike Galbraith wrote:
> On Tue, 2006-05-16 at 16:32 -0700, Tim Chen wrote:
>> On Tue, 2006-05-16 at 09:45 +1000, Con Kolivas wrote:
>>
>>> Yes it's only designed to detect something that has been asleep for an
>>> arbitrary long time and "categorised as idle"; it is not supposed to be a
>>> priority stepping stone for everything, in this case at MAX_BONUS-1. Mike
>>> proposed doing this instead, but it was never my intent.
>> It seems like just one sleep longer than INTERACTIVE_SLEEP is needed
>> kick the priority of a process all the way to MAX_BONUS-1 and boost the
>> sleep_avg, regardless of what the prior sleep_avg was.
>>
>> So if there is a cpu hog that has long sleeps occasionally, once it woke
>> up, its priority will get boosted close to maximum, likely starving out
>> other processes for a while till its sleep_avg gets reduced. This
>> behavior seems like something to avoid according to the original code
>> comment. Are we boosting the priority too quickly?
>
> The answer to that is, sometimes yes, and when it bites, it bites hard.
> Happily, most hogs don't sleep much, and we don't generally have lots of
> bursty sleepers.
>

But it's easy for a malicious user to exploit. Yes?

Peter
--
Peter Williams [email protected]

"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce

2006-05-17 05:24:07

[permalink] [raw]

Subject: Re: Regression seen for patch "sched:dont decrease idle sleep avg"

On Wed, 2006-05-17 at 14:45 +1000, Peter Williams wrote:
> Mike Galbraith wrote:
> > On Tue, 2006-05-16 at 16:32 -0700, Tim Chen wrote:
> >> On Tue, 2006-05-16 at 09:45 +1000, Con Kolivas wrote:
> >>
> >>> Yes it's only designed to detect something that has been asleep for an
> >>> arbitrary long time and "categorised as idle"; it is not supposed to be a
> >>> priority stepping stone for everything, in this case at MAX_BONUS-1. Mike
> >>> proposed doing this instead, but it was never my intent.
> >> It seems like just one sleep longer than INTERACTIVE_SLEEP is needed
> >> kick the priority of a process all the way to MAX_BONUS-1 and boost the
> >> sleep_avg, regardless of what the prior sleep_avg was.
> >>
> >> So if there is a cpu hog that has long sleeps occasionally, once it woke
> >> up, its priority will get boosted close to maximum, likely starving out
> >> other processes for a while till its sleep_avg gets reduced. This
> >> behavior seems like something to avoid according to the original code
> >> comment. Are we boosting the priority too quickly?
> >
> > The answer to that is, sometimes yes, and when it bites, it bites hard.
> > Happily, most hogs don't sleep much, and we don't generally have lots of
> > bursty sleepers.
> >
>
> But it's easy for a malicious user to exploit. Yes?

Without limits, sure. Burn malicious idiot's library card :) The real
pain begins when you start examining legitimate loads. It _can_ get
really and truly fugly. Generally doesn't.

-Mike

2006-05-17 08:23:51

[permalink] [raw]

Subject: Re: Regression seen for patch "sched:dont decrease idle sleep avg"

On Wednesday 17 May 2006 09:32, Tim Chen wrote:
> On Tue, 2006-05-16 at 09:45 +1000, Con Kolivas wrote:
> > Yes it's only designed to detect something that has been asleep for an
> > arbitrary long time and "categorised as idle"; it is not supposed to be a
> > priority stepping stone for everything, in this case at MAX_BONUS-1. Mike
> > proposed doing this instead, but it was never my intent.
>
> It seems like just one sleep longer than INTERACTIVE_SLEEP is needed
> kick the priority of a process all the way to MAX_BONUS-1 and boost the
> sleep_avg, regardless of what the prior sleep_avg was.
>
> So if there is a cpu hog that has long sleeps occasionally, once it woke
> up, its priority will get boosted close to maximum, likely starving out
> other processes for a while till its sleep_avg gets reduced. This
> behavior seems like something to avoid according to the original code
> comment. Are we boosting the priority too quickly?

Two things strike me here. I'll explain them in the patch below.

How's this look?
---
The relationship between INTERACTIVE_SLEEP and the ceiling is not perfect
and not explicit enough. The sleep boost is not supposed to be any larger
than without this code and the comment is not clear enough about what exactly
it does, just the reason it does it.

There is a ceiling to the priority beyond which tasks that only ever sleep
for very long periods cannot surpass.

Opportunity to micro-optimise and re-use the ceiling variable.

Signed-off-by: Con Kolivas <[email protected]>

---
kernel/sched.c | 28 +++++++++++-----------------
1 files changed, 11 insertions(+), 17 deletions(-)

Index: linux-2.6.17-rc4-mm1/kernel/sched.c
===================================================================
--- linux-2.6.17-rc4-mm1.orig/kernel/sched.c 2006-05-17 15:57:49.000000000 +1000
+++ linux-2.6.17-rc4-mm1/kernel/sched.c 2006-05-17 18:19:29.000000000 +1000
@@ -904,20 +904,14 @@ static int recalc_task_prio(task_t *p, u
}

if (likely(sleep_time > 0)) {
- /*
- * User tasks that sleep a long time are categorised as
- * idle. They will only have their sleep_avg increased to a
- * level that makes them just interactive priority to stay
- * active yet prevent them suddenly becoming cpu hogs and
- * starving other processes.
- */
- if (p->mm && sleep_time > INTERACTIVE_SLEEP(p)) {
- unsigned long ceiling;
+ unsigned long ceiling = INTERACTIVE_SLEEP(p);

- ceiling = JIFFIES_TO_NS(MAX_SLEEP_AVG -
- DEF_TIMESLICE);
- if (p->sleep_avg < ceiling)
- p->sleep_avg = ceiling;
+ if (p->mm && sleep_time > ceiling && p->sleep_avg < ceiling) {
+ /*
+ * Prevents user tasks from achieving best priority
+ * with one single large enough sleep.
+ */
+ p->sleep_avg = ceiling;
} else {
/*
* Tasks waking from uninterruptible sleep are
@@ -925,12 +919,12 @@ static int recalc_task_prio(task_t *p, u
* are likely to be waiting on I/O
*/
if (p->sleep_type == SLEEP_NONINTERACTIVE && p->mm) {
- if (p->sleep_avg >= INTERACTIVE_SLEEP(p))
+ if (p->sleep_avg >= ceiling)
sleep_time = 0;
else if (p->sleep_avg + sleep_time >=
- INTERACTIVE_SLEEP(p)) {
- p->sleep_avg = INTERACTIVE_SLEEP(p);
- sleep_time = 0;
+ ceiling) {
+ p->sleep_avg = ceiling;
+ sleep_time = 0;
}
}

--
-ck

2006-05-17 09:48:48

[permalink] [raw]

Subject: Re: Regression seen for patch "sched:dont decrease idle sleep avg"

On Wed, 2006-05-17 at 18:23 +1000, Con Kolivas wrote:

> There is a ceiling to the priority beyond which tasks that only ever sleep
> for very long periods cannot surpass.

(Hmm. The intent is more clear, ie reserve the top for low latency
tasks,... but that sounds a bit like xmms protection.)

The main problem I see with this ceiling, solely from the interactivity
viewpoint, is that interactive tasks which have started burning cpu
and/or freshly forked interactive tasks land in the same spot. Thud.c
demonstrates this problem quite well. You don't want a few copies of
thud in the same queue with your interactive task, much less above it if
it's used enough cpu to drop a notch or two. Much pain ensues.

-Mike

2006-05-17 10:25:54

[permalink] [raw]

Subject: Re: Regression seen for patch "sched:dont decrease idle sleep avg"

On Wednesday 17 May 2006 19:49, Mike Galbraith wrote:
> On Wed, 2006-05-17 at 18:23 +1000, Con Kolivas wrote:
> > There is a ceiling to the priority beyond which tasks that only ever
> > sleep for very long periods cannot surpass.
>
> (Hmm. The intent is more clear, ie reserve the top for low latency
> tasks,... but that sounds a bit like xmms protection.)
>
> The main problem I see with this ceiling, solely from the interactivity
> viewpoint, is that interactive tasks which have started burning cpu
> and/or freshly forked interactive tasks land in the same spot. Thud.c
> demonstrates this problem quite well. You don't want a few copies of
> thud in the same queue with your interactive task, much less above it if
> it's used enough cpu to drop a notch or two. Much pain ensues.

If there are thud processes that have earned their place they deserve it, just
as any other task. If something uses only 1% cpu it is unfair just because it
only ever sleeps for long sleeps to prevent it getting as good priority as
anything else that uses only 1% cpu. I've noticed that 'top' suffers this
fate for example. The problem I've had all along with thud as a test case is
that while it causes a pause on the system for potentially a few seconds, it
does this by raising the load transiently to a load of 50 (or whatever you
set it to). I have no problem with a system having a hiccup when the load is
50, provided the system eventually recovers and isn't starved forever (which
it isn't). There are other means to prevent one user having that many running
tasks if so desired.

--
-ck

2006-05-17 11:42:10

[permalink] [raw]

Subject: Re: Regression seen for patch "sched:dont decrease idle sleep avg"

On Wed, 2006-05-17 at 20:25 +1000, Con Kolivas wrote:
> On Wednesday 17 May 2006 19:49, Mike Galbraith wrote:
> > On Wed, 2006-05-17 at 18:23 +1000, Con Kolivas wrote:
> > > There is a ceiling to the priority beyond which tasks that only ever
> > > sleep for very long periods cannot surpass.
> >
> > (Hmm. The intent is more clear, ie reserve the top for low latency
> > tasks,... but that sounds a bit like xmms protection.)
> >
> > The main problem I see with this ceiling, solely from the interactivity
> > viewpoint, is that interactive tasks which have started burning cpu
> > and/or freshly forked interactive tasks land in the same spot. Thud.c
> > demonstrates this problem quite well. You don't want a few copies of
> > thud in the same queue with your interactive task, much less above it if
> > it's used enough cpu to drop a notch or two. Much pain ensues.
>
> If there are thud processes that have earned their place they deserve it, just
> as any other task. If something uses only 1% cpu it is unfair just because it

Fair? I said interactivity wise. But what the heck, if we're talking
fairness, I can say the same thing about I/O bound tasks. Heck, it's
not fair to stop any task from reaching the top, and it's certainly not
fair to let them have (for all practical purposes) all the cpu they want
once they sleep enough. Shoot, the scheduler is unfair even without any
interactivity code. Long term, it splits tasks into two groups... those
that sleep for more than 50% of the time... yack yack yack... zzzzz

Let's stick to the interactivity side :)

> only ever sleeps for long sleeps to prevent it getting as good priority as
> anything else that uses only 1% cpu. I've noticed that 'top' suffers this
> fate for example. The problem I've had all along with thud as a test case is
> that while it causes a pause on the system for potentially a few seconds, it
> does this by raising the load transiently to a load of 50 (or whatever you
> set it to). I have no problem with a system having a hiccup when the load is
> 50, provided the system eventually recovers and isn't starved forever (which
> it isn't). There are other means to prevent one user having that many running
> tasks if so desired.

Three of the little buggers are enough to cause plenty of pain.

-Mike

2006-05-17 12:47:09

[permalink] [raw]

Subject: Re: Regression seen for patch "sched:dont decrease idle sleep avg"

On Wednesday 17 May 2006 21:42, Mike Galbraith wrote:
> Fair? I said interactivity wise. But what the heck, if we're talking
> fairness, I can say the same thing about I/O bound tasks. Heck, it's
> not fair to stop any task from reaching the top, and it's certainly not
> fair to let them have (for all practical purposes) all the cpu they want
> once they sleep enough.

Toss out the I/O bound thing and we turn into a steaming dripping pile of dog
doo whenever anything does disk I/O. And damned if there aren't a lot of pcs
that have hard disks...

> Shoot, the scheduler is unfair even without any
> interactivity code. Long term, it splits tasks into two groups... those
> that sleep for more than 50% of the time... yack yack yack... zzzzz
>
> Let's stick to the interactivity side :)

It's a deal.

> > only ever sleeps for long sleeps to prevent it getting as good priority
> > as anything else that uses only 1% cpu. I've noticed that 'top' suffers
> > this fate for example. The problem I've had all along with thud as a test
> > case is that while it causes a pause on the system for potentially a few
> > seconds, it does this by raising the load transiently to a load of 50 (or
> > whatever you set it to). I have no problem with a system having a hiccup
> > when the load is 50, provided the system eventually recovers and isn't
> > starved forever (which it isn't). There are other means to prevent one
> > user having that many running tasks if so desired.
>
> Three of the little buggers are enough to cause plenty of pain.

Spits and stutters are not starvation. Luckily it gets no worse with this
patch.

--
-ck

2006-05-17 13:40:55

[permalink] [raw]

Subject: Re: Regression seen for patch "sched:dont decrease idle sleep avg"

On Wed, 2006-05-17 at 22:46 +1000, Con Kolivas wrote:
> On Wednesday 17 May 2006 21:42, Mike Galbraith wrote:
> > Fair? I said interactivity wise. But what the heck, if we're talking
> > fairness, I can say the same thing about I/O bound tasks. Heck, it's
> > not fair to stop any task from reaching the top, and it's certainly not
> > fair to let them have (for all practical purposes) all the cpu they want
> > once they sleep enough.
>
> Toss out the I/O bound thing and we turn into a steaming dripping pile of dog
> doo whenever anything does disk I/O. And damned if there aren't a lot of pcs
> that have hard disks...

(you should have tried my patch set)

> Spits and stutters are not starvation. Luckily it gets no worse with this
> patch.

Ok, I'll accept that. Spits and stutters _are_ interactivity issues
though yes?. Knowing full well that plunking long sleepers into the
queue you are plunking them into causes spits and stutters, why do you
insist on doing so?

Oh well, we're well on the way to agreeing to disagree again, so let's
just get it over with. I hereby agree to disagree.

Cheers,

-Mike

2006-05-17 15:10:47

[permalink] [raw]

Subject: Re: Regression seen for patch "sched:dont decrease idle sleep avg"

On Wednesday 17 May 2006 23:41, Mike Galbraith wrote:
> On Wed, 2006-05-17 at 22:46 +1000, Con Kolivas wrote:
> > On Wednesday 17 May 2006 21:42, Mike Galbraith wrote:
> > > Fair? I said interactivity wise. But what the heck, if we're talking
> > > fairness, I can say the same thing about I/O bound tasks. Heck, it's
> > > not fair to stop any task from reaching the top, and it's certainly not
> > > fair to let them have (for all practical purposes) all the cpu they
> > > want once they sleep enough.
> >
> > Toss out the I/O bound thing and we turn into a steaming dripping pile of
> > dog doo whenever anything does disk I/O. And damned if there aren't a lot
> > of pcs that have hard disks...
>
> (you should have tried my patch set)

Last one of yours I tried suffered this.

> > Spits and stutters are not starvation. Luckily it gets no worse with this
> > patch.
>
> Ok, I'll accept that. Spits and stutters _are_ interactivity issues
> though yes?. Knowing full well that plunking long sleepers into the
> queue you are plunking them into causes spits and stutters, why do you
> insist on doing so?

Because I know of no real world workload that thuds us into spits and
stutters.

> Oh well, we're well on the way to agreeing to disagree again, so let's
> just get it over with. I hereby agree to disagree.

Indeed.

--
-ck

2006-05-17 17:21:56

by Ray Lee

[permalink] [raw]

Subject: Re: Regression seen for patch "sched:dont decrease idle sleep avg"

On 5/17/06, Con Kolivas <[email protected]> wrote:
> > Ok, I'll accept that. Spits and stutters _are_ interactivity issues
> > though yes?. Knowing full well that plunking long sleepers into the
> > queue you are plunking them into causes spits and stutters, why do you
> > insist on doing so?
>
> Because I know of no real world workload that thuds us into spits and
> stutters.

`apt-get dist-upgrade` seems to do wonders for making a system
thoroughly unusable. Possibly because it forks a copy of perl and a
few other tasks, all competing for CPU and disk I/O bandwidth -- at
least, competing when they're not sleeping most of the time. Y'all
might add a chroot'd install of Debian as a test case.

Ray

2006-05-17 19:34:08

[permalink] [raw]

Subject: RE: Regression seen for patch "sched:dont decrease idle sleep avg"

Con Kolivas wrote on Wednesday, May 17, 2006 1:23 AM
> On Wednesday 17 May 2006 09:32, Tim Chen wrote:
> > It seems like just one sleep longer than INTERACTIVE_SLEEP is needed
> > kick the priority of a process all the way to MAX_BONUS-1 and boost the
> > sleep_avg, regardless of what the prior sleep_avg was.
> >
> > So if there is a cpu hog that has long sleeps occasionally, once it woke
> > up, its priority will get boosted close to maximum, likely starving out
> > other processes for a while till its sleep_avg gets reduced. This
> > behavior seems like something to avoid according to the original code
> > comment. Are we boosting the priority too quickly?
>
> Two things strike me here. I'll explain them in the patch below.
>
> How's this look?
> ---
> The relationship between INTERACTIVE_SLEEP and the ceiling is not perfect
> and not explicit enough. The sleep boost is not supposed to be any larger
> than without this code and the comment is not clear enough about what exactly
> it does, just the reason it does it.
>
> There is a ceiling to the priority beyond which tasks that only ever sleep
> for very long periods cannot surpass.

It looks bad. I don't like it. The priority boost is even more peculiar
in this patch.

> --- linux-2.6.17-rc4-mm1.orig/kernel/sched.c 2006-05-17 15:57:49.000000000 +1000
> +++ linux-2.6.17-rc4-mm1/kernel/sched.c 2006-05-17 18:19:29.000000000 +1000
> @@ -904,20 +904,14 @@ static int recalc_task_prio(task_t *p, u
> }
>
> if (likely(sleep_time > 0)) {
> - /*
> - * User tasks that sleep a long time are categorised as
> - * idle. They will only have their sleep_avg increased to a
> - * level that makes them just interactive priority to stay
> - * active yet prevent them suddenly becoming cpu hogs and
> - * starving other processes.
> - */
> - if (p->mm && sleep_time > INTERACTIVE_SLEEP(p)) {
> - unsigned long ceiling;
> + unsigned long ceiling = INTERACTIVE_SLEEP(p);
>
> - ceiling = JIFFIES_TO_NS(MAX_SLEEP_AVG -
> - DEF_TIMESLICE);
> - if (p->sleep_avg < ceiling)
> - p->sleep_avg = ceiling;
> + if (p->mm && sleep_time > ceiling && p->sleep_avg < ceiling) {
> + /*
> + * Prevents user tasks from achieving best priority
> + * with one single large enough sleep.
> + */
> + p->sleep_avg = ceiling;

The assignment of p->sleep_avg = ceiling doesn't make much logical sense.
Because INTERACTIVE_SLEEP is scaled proportionally with nice value, e.g.
the lower the nice value, the lower the interactive_sleep. However, priority
calculation is inverse of p->sleep_avg, e.g. the smaller the sleep_avg, the
smaller the bonus, thus the higher dynamic priority.

Take one concrete example: for a prolonged sleep, say 1 second, nice(-10)
will have a priority boost of 4 while nice(0) will have a priority boost of
9. The ceiling algorithm looks like is reversed. I would think kernel should
at least enforce same ceiling value independent of nice value.

- Ken

2006-05-18 00:36:00

[permalink] [raw]

Subject: Re: Regression seen for patch "sched:dont decrease idle sleep avg"

On Thursday 18 May 2006 05:33, Chen, Kenneth W wrote:
> The assignment of p->sleep_avg = ceiling doesn't make much logical sense.
> Because INTERACTIVE_SLEEP is scaled proportionally with nice value, e.g.
> the lower the nice value, the lower the interactive_sleep. However,
> priority calculation is inverse of p->sleep_avg, e.g. the smaller the
> sleep_avg, the smaller the bonus, thus the higher dynamic priority.
>
> Take one concrete example: for a prolonged sleep, say 1 second, nice(-10)
> will have a priority boost of 4 while nice(0) will have a priority boost of
> 9. The ceiling algorithm looks like is reversed. I would think kernel
> should at least enforce same ceiling value independent of nice value.

I see why you don't like it. However I still don't think you understand why
the ceiling of that magnitude is there. Tasks behave very differently
depending on whether their priority is low enough that they expire at the end
of every timeslice and get put on the expired array, or their priority is
high enough that they get reinserted into the active array. It's an intrinsic
quirk in the "interactive" design of the scheduler that basically means we
have two classes of task - interactive enough to be reinserted into the
active array or not. As the comment already says the ceiling is there to
prevent one sleep from getting them to best priority. I don't want them to
get high priority with one large enough sleep, but I do want them to get into
the reinsertion class which behaves entirely differently. What is missing
from the comment is to say that it is also designed to stop them at the
lowest possible priority that still keeps them in the interactive reinsertion
class. Using a constant ceiling value irrespective of nice will not guarantee
that tasks fall into the active reinsertion class dependant on their nice
value.

--
-ck

2006-05-18 01:10:15

[permalink] [raw]

Subject: RE: Regression seen for patch "sched:dont decrease idle sleep avg"

Con Kolivas wrote on Wednesday, May 17, 2006 5:35 PM
> What is missing
> from the comment is to say that it is also designed to stop them at the
> lowest possible priority that still keeps them in the interactive reinsertion
> class.

> Using a constant ceiling value irrespective of nice will not guarantee
> that tasks fall into the active reinsertion class dependant on their nice
> value.

If I may ask, how does it work right now? Ceiling is set at constant value
irrespective to nice value. Are you saying current code is broken as well?

ceiling = JIFFIES_TO_NS(MAX_SLEEP_AVG -
DEF_TIMESLICE);
if (p->sleep_avg < ceiling)
p->sleep_avg = ceiling;

We maybe also misunderstand each other. I'm not arguing of removing the
ceiling. Having a ceiling is the right thing to do here. What I don't like
is that 2.6.17-rc4 has the ceiling set too high, and your patch also does
an inversion of the ceiling value w.r.t nice value. So it's the detail of
what's the right value for priority boost that I'm uncomfortable with.

- Ken

2006-05-18 01:39:04

[permalink] [raw]

Subject: Re: Regression seen for patch "sched:dont decrease idle sleep avg"

On Thursday 18 May 2006 11:10, Chen, Kenneth W wrote:
> Con Kolivas wrote on Wednesday, May 17, 2006 5:35 PM
>
> > What is missing
> > from the comment is to say that it is also designed to stop them at the
> > lowest possible priority that still keeps them in the interactive
> > reinsertion class.
> >
> > Using a constant ceiling value irrespective of nice will not guarantee
> > that tasks fall into the active reinsertion class dependant on their nice
> > value.
>
> If I may ask, how does it work right now? Ceiling is set at constant value
> irrespective to nice value. Are you saying current code is broken as well?

In essence, yes. The approximation was too general and vague as you have
pointed out, and was only close to the correct ceiling or knee for one nice
value. I just want to formalise the relationship between the ceiling, nice
value and INTERACTIVE_SLEEP and make the comment clear enough to be
understood.

> ceiling = JIFFIES_TO_NS(MAX_SLEEP_AVG -
> DEF_TIMESLICE);
> if (p->sleep_avg < ceiling)
> p->sleep_avg = ceiling;
>
>
> We maybe also misunderstand each other. I'm not arguing of removing the
> ceiling. Having a ceiling is the right thing to do here. What I don't like
> is that 2.6.17-rc4 has the ceiling set too high, and your patch also does
> an inversion of the ceiling value w.r.t nice value. So it's the detail of
> what's the right value for priority boost that I'm uncomfortable with.

We're in fuzzy land where there are no absolutes I'm afraid. The common case
is obviously nice 0 and anything else is simply respecting the relationship
between nice and INTERACTIVE_SLEEP. As timeslices get proportionately larger
the lower the nice value, it becomes increasingly easy to avoid getting to
the end of a full timeslice thereby avoiding expiration anyway. It also
doesn't take much sleep from a task to get to best priority from the priority
knee/ceiling no matter how low it is. Finally the relationship between likely
preemption with varying nice levels is sort of vaguely maintained. On balance
it seems satisfactory to me to maintain this relationship.

--
-ck

2006-05-18 05:43:42

[permalink] [raw]

Subject: Re: Regression seen for patch "sched:dont decrease idle sleep avg"

On Thu, 2006-05-18 at 11:38 +1000, Con Kolivas wrote:
> I just want to formalise the relationship between the ceiling, nice
> value and INTERACTIVE_SLEEP and make the comment clear enough to be
> understood.

Oh yeah, that reminded me...

INTERACTIVE_SLEEP(p) for nice(>=16) tasks is > NS_MAX_SLEEP_AVG.
CURRENT_BONUS(p) if it took the long sleep path can end up being 11,
which will lead to Ka-[fword]-BOOM in scheduler_tick() for an SMP box.
See TIMESLICE_GRANULARITY(p). (btdt;)

-Mike

2006-05-18 05:52:47

[permalink] [raw]

Subject: Re: Regression seen for patch "sched:dont decrease idle sleep avg"

On Thursday 18 May 2006 15:44, Mike Galbraith wrote:
> On Thu, 2006-05-18 at 11:38 +1000, Con Kolivas wrote:
> > I just want to formalise the relationship between the ceiling, nice
> > value and INTERACTIVE_SLEEP and make the comment clear enough to be
> > understood.
>
> Oh yeah, that reminded me...
>
> INTERACTIVE_SLEEP(p) for nice(>=16) tasks is > NS_MAX_SLEEP_AVG.
> CURRENT_BONUS(p) if it took the long sleep path can end up being 11,
> which will lead to Ka-[fword]-BOOM in scheduler_tick() for an SMP box.
> See TIMESLICE_GRANULARITY(p). (btdt;)

Thanks. This updated one fixes that and clears up the remaining mystery of
why the ceiling is the size it is in the comments.

---
The relationship between INTERACTIVE_SLEEP and the ceiling is not perfect
and not explicit enough. The sleep boost is not supposed to be any larger
than without this code and the comment is not clear enough about what exactly
it does, just the reason it does it.

There is a ceiling to the priority beyond which tasks that only ever sleep
for very long periods cannot surpass.

Opportunity to micro-optimise and re-use the ceiling variable.

Signed-off-by: Con Kolivas <[email protected]>

---
kernel/sched.c | 33 ++++++++++++++++-----------------
1 files changed, 16 insertions(+), 17 deletions(-)

Index: linux-2.6.17-rc4-mm1/kernel/sched.c
===================================================================
--- linux-2.6.17-rc4-mm1.orig/kernel/sched.c 2006-05-17 15:57:49.000000000 +1000
+++ linux-2.6.17-rc4-mm1/kernel/sched.c 2006-05-18 15:48:47.000000000 +1000
@@ -905,19 +905,18 @@ static int recalc_task_prio(task_t *p, u

if (likely(sleep_time > 0)) {
/*
- * User tasks that sleep a long time are categorised as
- * idle. They will only have their sleep_avg increased to a
- * level that makes them just interactive priority to stay
- * active yet prevent them suddenly becoming cpu hogs and
- * starving other processes.
+ * This ceiling is set to the lowest priority that would allow
+ * a task to be reinserted into the active array on timeslice
+ * completion.
*/
- if (p->mm && sleep_time > INTERACTIVE_SLEEP(p)) {
- unsigned long ceiling;
+ unsigned long ceiling = INTERACTIVE_SLEEP(p);

- ceiling = JIFFIES_TO_NS(MAX_SLEEP_AVG -
- DEF_TIMESLICE);
- if (p->sleep_avg < ceiling)
- p->sleep_avg = ceiling;
+ if (p->mm && sleep_time > ceiling && p->sleep_avg < ceiling) {
+ /*
+ * Prevents user tasks from achieving best priority
+ * with one single large enough sleep.
+ */
+ p->sleep_avg = ceiling;
} else {
/*
* Tasks waking from uninterruptible sleep are
@@ -925,12 +924,12 @@ static int recalc_task_prio(task_t *p, u
* are likely to be waiting on I/O
*/
if (p->sleep_type == SLEEP_NONINTERACTIVE && p->mm) {
- if (p->sleep_avg >= INTERACTIVE_SLEEP(p))
+ if (p->sleep_avg >= ceiling)
sleep_time = 0;
else if (p->sleep_avg + sleep_time >=
- INTERACTIVE_SLEEP(p)) {
- p->sleep_avg = INTERACTIVE_SLEEP(p);
- sleep_time = 0;
+ ceiling) {
+ p->sleep_avg = ceiling;
+ sleep_time = 0;
}
}

@@ -944,9 +943,9 @@ static int recalc_task_prio(task_t *p, u
*/
p->sleep_avg += sleep_time;

- if (p->sleep_avg > NS_MAX_SLEEP_AVG)
- p->sleep_avg = NS_MAX_SLEEP_AVG;
}
+ if (p->sleep_avg > NS_MAX_SLEEP_AVG)
+ p->sleep_avg = NS_MAX_SLEEP_AVG;
}

return effective_prio(p);

--
-ck

2006-05-18 07:03:50

[permalink] [raw]

Subject: Re: Regression seen for patch "sched:dont decrease idle sleep avg"

On Thu, 2006-05-18 at 15:52 +1000, Con Kolivas wrote:
> On Thursday 18 May 2006 15:44, Mike Galbraith wrote:
> > On Thu, 2006-05-18 at 11:38 +1000, Con Kolivas wrote:
> > > I just want to formalise the relationship between the ceiling, nice
> > > value and INTERACTIVE_SLEEP and make the comment clear enough to be
> > > understood.
> >
> > Oh yeah, that reminded me...
> >
> > INTERACTIVE_SLEEP(p) for nice(>=16) tasks is > NS_MAX_SLEEP_AVG.
> > CURRENT_BONUS(p) if it took the long sleep path can end up being 11,
> > which will lead to Ka-[fword]-BOOM in scheduler_tick() for an SMP box.
> > See TIMESLICE_GRANULARITY(p). (btdt;)
>
> Thanks. This updated one fixes that and clears up the remaining mystery of
> why the ceiling is the size it is in the comments.

OK, after some brief testing, I think this is a step in the right
direct, but there is another problem. In the case where the queue isn't
empty, the stated intent is utterly defeated by the on runqueue bonus.

If you fix this, the thud and irman2 kind of pain mostly goes away for
interactive tasks, and a lot of starvation scenarios go as well.

The best way I've found to fix these kind of boost problems is to just
say no if the task is using more than it's fair share of cpu, and in NO
case, let one wait rocket you to the top... that sets you up for a queue
round-robin nightmare (my interactive feeding frenzy scenario). It
doesn't take much for tasks that nanosleep and round-robin before it
becomes impossible for them to use their sleep_avg. I would say nuke
that code, except that it is the only chance at fairness some tasks
have. Nuking it is definitely easier that getting it right.

-Mike

2006-05-18 12:58:11

[permalink] [raw]

Subject: Re: Regression seen for patch "sched:dont decrease idle sleep avg"

On Thu, 2006-05-18 at 09:04 +0200, Mike Galbraith wrote:

> OK, after some brief testing, I think this is a step in the right
> direct, but there is another problem. In the case where the queue isn't
> empty, the stated intent is utterly defeated by the on runqueue bonus.

The overly verbose one liner below could serve as a minimal ~fix.

Prevent the on-runqueue bonus logic from defeating the idle sleep logic.

Signed-off-by: Mike Galbraith <[email protected]>

--- linux-2.6.17-rc4-mm1/kernel/sched.c.org 2006-05-18 08:38:13.000000000 +0200
+++ linux-2.6.17-rc4-mm1/kernel/sched.c 2006-05-18 14:47:09.000000000 +0200
@@ -917,6 +917,16 @@ static int recalc_task_prio(task_t *p, u
* with one single large enough sleep.
*/
p->sleep_avg = ceiling;
+ /*
+ * Using INTERACTIVE_SLEEP() as a ceiling places a
+ * nice(0) task 1ms sleep away from promotion, and
+ * gives it 700ms to round-robin with no chance of
+ * being demoted. This is more than generous, so
+ * mark this sleep as non-interactive to prevent the
+ * on-runqueue bonus logic from intervening should
+ * this task not receive cpu immediately.
+ */
+ p->sleep_type = SLEEP_NONINTERACTIVE;
} else {
/*
* Tasks waking from uninterruptible sleep are

2006-05-18 23:17:24

[permalink] [raw]

Subject: RE: Regression seen for patch "sched:dont decrease idle sleep avg"

Con Kolivas wrote on Wednesday, May 17, 2006 10:52 PM
> The relationship between INTERACTIVE_SLEEP and the ceiling is not perfect
> and not explicit enough. The sleep boost is not supposed to be any larger
> than without this code and the comment is not clear enough about what exactly
> it does, just the reason it does it.
>
> There is a ceiling to the priority beyond which tasks that only ever sleep
> for very long periods cannot surpass.
>
> Opportunity to micro-optimise and re-use the ceiling variable.

More opportunity to micro-optimize: now that you've put the clamp code of
p->sleep_avg in the common path, there is no need to clamp sleep_time at
the beginning of that function. Just let the calculation go through and
clamp the final value of p->sleep_avg to NS_MAX_SLEEP_AVG at the end.

Signed-off-by: Ken Chen <[email protected]>

--- ./kernel/sched.c.orig 2006-05-18 12:51:45.000000000 -0700
+++ ./kernel/sched.c 2006-05-18 14:53:47.000000000 -0700
@@ -731,17 +731,10 @@
static int recalc_task_prio(task_t *p, unsigned long long now)
{
/* Caller must always ensure 'now >= p->timestamp' */
- unsigned long long __sleep_time = now - p->timestamp;
- unsigned long sleep_time;
+ unsigned long sleep_time = now - p->timestamp;

if (batch_task(p))
sleep_time = 0;
- else {
- if (__sleep_time > NS_MAX_SLEEP_AVG)
- sleep_time = NS_MAX_SLEEP_AVG;
- else
- sleep_time = (unsigned long)__sleep_time;
- }

if (likely(sleep_time > 0)) {
/*

2006-05-18 23:34:43

[permalink] [raw]

Subject: RE: Regression seen for patch "sched:dont decrease idle sleep avg"

Con Kolivas wrote on Wednesday, May 17, 2006 10:52 PM
> The relationship between INTERACTIVE_SLEEP and the ceiling is not perfect
> and not explicit enough. The sleep boost is not supposed to be any larger
> than without this code and the comment is not clear enough about what exactly
> it does, just the reason it does it.
>
> There is a ceiling to the priority beyond which tasks that only ever sleep
> for very long periods cannot surpass.
>
> Opportunity to micro-optimise and re-use the ceiling variable.
>
> --- linux-2.6.17-rc4-mm1.orig/kernel/sched.c 2006-05-17 15:57:49.000000000 +1000
> +++ linux-2.6.17-rc4-mm1/kernel/sched.c 2006-05-18 15:48:47.000000000 +1000
> @@ -925,12 +924,12 @@ static int recalc_task_prio(task_t *p, u
> * are likely to be waiting on I/O
> */
> if (p->sleep_type == SLEEP_NONINTERACTIVE && p->mm) {
> - if (p->sleep_avg >= INTERACTIVE_SLEEP(p))
> + if (p->sleep_avg >= ceiling)
> sleep_time = 0;
> else if (p->sleep_avg + sleep_time >=
> - INTERACTIVE_SLEEP(p)) {
> - p->sleep_avg = INTERACTIVE_SLEEP(p);
> - sleep_time = 0;
> + ceiling) {
> + p->sleep_avg = ceiling;
> + sleep_time = 0;

Watch for white space damage, last two lines has one extra tab on the indentation.

By the way, there is all kinds of non-linear behavior with priority boost
adjustment:

if (p->sleep_type == SLEEP_NONINTERACTIVE && p->mm) {
if (p->sleep_avg >= ceiling)
sleep_time = 0;
else if (p->sleep_avg + sleep_time >= ceiling) {
p->sleep_avg = ceiling;
sleep_time = 0;
}
}

For large p->sleep_avg, kernel don't clamp it to ceiling, yet clamp small
incremental sleep. This all seems very fragile.

- Ken

2006-05-19 01:08:08

[permalink] [raw]

Subject: Re: Regression seen for patch "sched:dont decrease idle sleep avg"

On Friday 19 May 2006 09:34, Chen, Kenneth W wrote:
> Con Kolivas wrote on Wednesday, May 17, 2006 10:52 PM
>
> > The relationship between INTERACTIVE_SLEEP and the ceiling is not perfect
> > and not explicit enough. The sleep boost is not supposed to be any larger
> > than without this code and the comment is not clear enough about what
> > exactly it does, just the reason it does it.
> >
> > There is a ceiling to the priority beyond which tasks that only ever
> > sleep for very long periods cannot surpass.
> >
> > Opportunity to micro-optimise and re-use the ceiling variable.
> >
> > --- linux-2.6.17-rc4-mm1.orig/kernel/sched.c 2006-05-17
> > 15:57:49.000000000 +1000 +++
> > linux-2.6.17-rc4-mm1/kernel/sched.c 2006-05-18 15:48:47.000000000 +1000
> > @@ -925,12 +924,12 @@ static int recalc_task_prio(task_t *p, u
> > * are likely to be waiting on I/O
> > */
> > if (p->sleep_type == SLEEP_NONINTERACTIVE && p->mm) {
> > - if (p->sleep_avg >= INTERACTIVE_SLEEP(p))
> > + if (p->sleep_avg >= ceiling)
> > sleep_time = 0;
> > else if (p->sleep_avg + sleep_time >=
> > - INTERACTIVE_SLEEP(p)) {
> > - p->sleep_avg = INTERACTIVE_SLEEP(p);
> > - sleep_time = 0;
> > + ceiling) {
> > + p->sleep_avg = ceiling;
> > + sleep_time = 0;
>
> Watch for white space damage, last two lines has one extra tab on the
> indentation.

Hmm a component of the if() is up to that tab so it seemed appropriate to
indent the body further to me.

> By the way, there is all kinds of non-linear behavior with priority boost
> adjustment:
>
> if (p->sleep_type == SLEEP_NONINTERACTIVE && p->mm) {
> if (p->sleep_avg >= ceiling)
> sleep_time = 0;
> else if (p->sleep_avg + sleep_time >= ceiling) {
> p->sleep_avg = ceiling;
> sleep_time = 0;
> }
> }
>
> For large p->sleep_avg, kernel don't clamp it to ceiling, yet clamp small
> incremental sleep. This all seems very fragile.

Yes it is. sleep_avg affecting priority in a linear fashion in the original
design was basically the reason interactivity was not flexible enough for a
wide range of workloads. I don't like it much myself any more either, and
have been maintaining a complete rewrite for some time. However the fact is
that the number of valid bug reports is very low, so tiny tweaks as issues
have come up have sufficed so far.

--
-ck

2006-05-19 01:10:44

[permalink] [raw]

Subject: Re: Regression seen for patch "sched:dont decrease idle sleep avg"

On Thursday 18 May 2006 22:59, Mike Galbraith wrote:
> On Thu, 2006-05-18 at 09:04 +0200, Mike Galbraith wrote:
> > OK, after some brief testing, I think this is a step in the right
> > direct, but there is another problem. In the case where the queue isn't
> > empty, the stated intent is utterly defeated by the on runqueue bonus.
>
> The overly verbose one liner below could serve as a minimal ~fix.
>
> Prevent the on-runqueue bonus logic from defeating the idle sleep logic.
>
> Signed-off-by: Mike Galbraith <[email protected]>
>
> --- linux-2.6.17-rc4-mm1/kernel/sched.c.org 2006-05-18 08:38:13.000000000
> +0200 +++ linux-2.6.17-rc4-mm1/kernel/sched.c 2006-05-18 14:47:09.000000000
> +0200 @@ -917,6 +917,16 @@ static int recalc_task_prio(task_t *p, u
> * with one single large enough sleep.
> */
> p->sleep_avg = ceiling;
> + /*
> + * Using INTERACTIVE_SLEEP() as a ceiling places a
> + * nice(0) task 1ms sleep away from promotion, and
> + * gives it 700ms to round-robin with no chance of
> + * being demoted. This is more than generous, so
> + * mark this sleep as non-interactive to prevent the
> + * on-runqueue bonus logic from intervening should
> + * this task not receive cpu immediately.
> + */
> + p->sleep_type = SLEEP_NONINTERACTIVE;
> } else {
> /*
> * Tasks waking from uninterruptible sleep are

Yes I like it. It's consistent with the intent and lacking from the current
design.

Ok I'll wrap this and Ken's patches all together and ask for them to be pushed
upstream. I'd go so far as to say they are mostly obvious comment and
bugfixes for the code that is already going into 2.6.17 and we should short
circuit them to mainline if it's ok with Ingo and Akpm.

--
-ck

2006-05-19 01:31:28

[permalink] [raw]

Subject: [PATCH] sched: fix interactive ceiling code

Ingo, Andrew, I think these are minor logic fixes and comments that correct
a patch that has already been pushed to 2.6.17- and I would like them short
circuited to mainline if everyone is comfortable with it.

Ken, Mike can I ask you to put a signed off on this patch for your
contributions please?

Martin can I please ask for this to be put on test.kernel.org as a last
minute sanity check? I know URLs are easier so here is the patch for 2.6.17-rc4:
http://ck.kolivas.org/patches/crap/sched-fix_interactive_ceiling.patch

---
The relationship between INTERACTIVE_SLEEP and the ceiling is not perfect
and not explicit enough. The sleep boost is not supposed to be any larger
than without this code and the comment is not clear enough about what exactly
it does, just the reason it does it. Fix it.

There is a ceiling to the priority beyond which tasks that only ever sleep
for very long periods cannot surpass. Fix it.

Prevent the on-runqueue bonus logic from defeating the idle sleep logic.

Opportunity to micro-optimise.

Signed-off-by: Con Kolivas <[email protected]>

---
kernel/sched.c | 52 +++++++++++++++++++++++++++-------------------------
1 files changed, 27 insertions(+), 25 deletions(-)

Index: linux-2.6.17-rc4/kernel/sched.c
===================================================================
--- linux-2.6.17-rc4.orig/kernel/sched.c 2006-05-19 11:25:01.000000000 +1000
+++ linux-2.6.17-rc4/kernel/sched.c 2006-05-19 11:25:14.000000000 +1000
@@ -731,33 +731,35 @@ static inline void __activate_idle_task(
static int recalc_task_prio(task_t *p, unsigned long long now)
{
/* Caller must always ensure 'now >= p->timestamp' */
- unsigned long long __sleep_time = now - p->timestamp;
- unsigned long sleep_time;
+ unsigned long sleep_time = now - p->timestamp;

if (batch_task(p))
sleep_time = 0;
- else {
- if (__sleep_time > NS_MAX_SLEEP_AVG)
- sleep_time = NS_MAX_SLEEP_AVG;
- else
- sleep_time = (unsigned long)__sleep_time;
- }

if (likely(sleep_time > 0)) {
/*
- * User tasks that sleep a long time are categorised as
- * idle. They will only have their sleep_avg increased to a
- * level that makes them just interactive priority to stay
- * active yet prevent them suddenly becoming cpu hogs and
- * starving other processes.
+ * This ceiling is set to the lowest priority that would allow
+ * a task to be reinserted into the active array on timeslice
+ * completion.
*/
- if (p->mm && sleep_time > INTERACTIVE_SLEEP(p)) {
- unsigned long ceiling;
+ unsigned long ceiling = INTERACTIVE_SLEEP(p);

- ceiling = JIFFIES_TO_NS(MAX_SLEEP_AVG -
- DEF_TIMESLICE);
- if (p->sleep_avg < ceiling)
- p->sleep_avg = ceiling;
+ if (p->mm && sleep_time > ceiling && p->sleep_avg < ceiling) {
+ /*
+ * Prevents user tasks from achieving best priority
+ * with one single large enough sleep.
+ */
+ p->sleep_avg = ceiling;
+ /*
+ * Using INTERACTIVE_SLEEP() as a ceiling places a
+ * nice(0) task 1ms sleep away from promotion, and
+ * gives it 700ms to round-robin with no chance of
+ * being demoted. This is more than generous, so
+ * mark this sleep as non-interactive to prevent the
+ * on-runqueue bonus logic from intervening should
+ * this task not receive cpu immediately.
+ */
+ p->sleep_type = SLEEP_NONINTERACTIVE;
} else {
/*
* Tasks waking from uninterruptible sleep are
@@ -765,12 +767,12 @@ static int recalc_task_prio(task_t *p, u
* are likely to be waiting on I/O
*/
if (p->sleep_type == SLEEP_NONINTERACTIVE && p->mm) {
- if (p->sleep_avg >= INTERACTIVE_SLEEP(p))
+ if (p->sleep_avg >= ceiling)
sleep_time = 0;
else if (p->sleep_avg + sleep_time >=
- INTERACTIVE_SLEEP(p)) {
- p->sleep_avg = INTERACTIVE_SLEEP(p);
- sleep_time = 0;
+ ceiling) {
+ p->sleep_avg = ceiling;
+ sleep_time = 0;
}
}

@@ -784,9 +786,9 @@ static int recalc_task_prio(task_t *p, u
*/
p->sleep_avg += sleep_time;

- if (p->sleep_avg > NS_MAX_SLEEP_AVG)
- p->sleep_avg = NS_MAX_SLEEP_AVG;
}
+ if (p->sleep_avg > NS_MAX_SLEEP_AVG)
+ p->sleep_avg = NS_MAX_SLEEP_AVG;
}

return effective_prio(p);
--
-ck

2006-05-19 03:08:20

[permalink] [raw]

Subject: Re: [PATCH] sched: fix interactive ceiling code

On Fri, 2006-05-19 at 11:30 +1000, Con Kolivas wrote:
> Ingo, Andrew, I think these are minor logic fixes and comments that correct
> a patch that has already been pushed to 2.6.17- and I would like them short
> circuited to mainline if everyone is comfortable with it.
>
> Ken, Mike can I ask you to put a signed off on this patch for your
> contributions please?

Done.

(I hope nobody ever rolls a patch from 30 contributors;)

> Martin can I please ask for this to be put on test.kernel.org as a last
> minute sanity check? I know URLs are easier so here is the patch for 2.6.17-rc4:
> http://ck.kolivas.org/patches/crap/sched-fix_interactive_ceiling.patch
>
> ---
> The relationship between INTERACTIVE_SLEEP and the ceiling is not perfect
> and not explicit enough. The sleep boost is not supposed to be any larger
> than without this code and the comment is not clear enough about what exactly
> it does, just the reason it does it. Fix it.
>
> There is a ceiling to the priority beyond which tasks that only ever sleep
> for very long periods cannot surpass. Fix it.
>
> Prevent the on-runqueue bonus logic from defeating the idle sleep logic.
>
> Opportunity to micro-optimise.
>
> Signed-off-by: Con Kolivas <[email protected]>

Signed-off-by: Mike Galbraith <[email protected]>

>
> ---
> kernel/sched.c | 52 +++++++++++++++++++++++++++-------------------------
> 1 files changed, 27 insertions(+), 25 deletions(-)
>
> Index: linux-2.6.17-rc4/kernel/sched.c
> ===================================================================
> --- linux-2.6.17-rc4.orig/kernel/sched.c 2006-05-19 11:25:01.000000000 +1000
> +++ linux-2.6.17-rc4/kernel/sched.c 2006-05-19 11:25:14.000000000 +1000
> @@ -731,33 +731,35 @@ static inline void __activate_idle_task(
> static int recalc_task_prio(task_t *p, unsigned long long now)
> {
> /* Caller must always ensure 'now >= p->timestamp' */
> - unsigned long long __sleep_time = now - p->timestamp;
> - unsigned long sleep_time;
> + unsigned long sleep_time = now - p->timestamp;
>
> if (batch_task(p))
> sleep_time = 0;
> - else {
> - if (__sleep_time > NS_MAX_SLEEP_AVG)
> - sleep_time = NS_MAX_SLEEP_AVG;
> - else
> - sleep_time = (unsigned long)__sleep_time;
> - }
>
> if (likely(sleep_time > 0)) {
> /*
> - * User tasks that sleep a long time are categorised as
> - * idle. They will only have their sleep_avg increased to a
> - * level that makes them just interactive priority to stay
> - * active yet prevent them suddenly becoming cpu hogs and
> - * starving other processes.
> + * This ceiling is set to the lowest priority that would allow
> + * a task to be reinserted into the active array on timeslice
> + * completion.
> */
> - if (p->mm && sleep_time > INTERACTIVE_SLEEP(p)) {
> - unsigned long ceiling;
> + unsigned long ceiling = INTERACTIVE_SLEEP(p);
>
> - ceiling = JIFFIES_TO_NS(MAX_SLEEP_AVG -
> - DEF_TIMESLICE);
> - if (p->sleep_avg < ceiling)
> - p->sleep_avg = ceiling;
> + if (p->mm && sleep_time > ceiling && p->sleep_avg < ceiling) {
> + /*
> + * Prevents user tasks from achieving best priority
> + * with one single large enough sleep.
> + */
> + p->sleep_avg = ceiling;
> + /*
> + * Using INTERACTIVE_SLEEP() as a ceiling places a
> + * nice(0) task 1ms sleep away from promotion, and
> + * gives it 700ms to round-robin with no chance of
> + * being demoted. This is more than generous, so
> + * mark this sleep as non-interactive to prevent the
> + * on-runqueue bonus logic from intervening should
> + * this task not receive cpu immediately.
> + */
> + p->sleep_type = SLEEP_NONINTERACTIVE;
> } else {
> /*
> * Tasks waking from uninterruptible sleep are
> @@ -765,12 +767,12 @@ static int recalc_task_prio(task_t *p, u
> * are likely to be waiting on I/O
> */
> if (p->sleep_type == SLEEP_NONINTERACTIVE && p->mm) {
> - if (p->sleep_avg >= INTERACTIVE_SLEEP(p))
> + if (p->sleep_avg >= ceiling)
> sleep_time = 0;
> else if (p->sleep_avg + sleep_time >=
> - INTERACTIVE_SLEEP(p)) {
> - p->sleep_avg = INTERACTIVE_SLEEP(p);
> - sleep_time = 0;
> + ceiling) {
> + p->sleep_avg = ceiling;
> + sleep_time = 0;
> }
> }
>
> @@ -784,9 +786,9 @@ static int recalc_task_prio(task_t *p, u
> */
> p->sleep_avg += sleep_time;
>
> - if (p->sleep_avg > NS_MAX_SLEEP_AVG)
> - p->sleep_avg = NS_MAX_SLEEP_AVG;
> }
> + if (p->sleep_avg > NS_MAX_SLEEP_AVG)
> + p->sleep_avg = NS_MAX_SLEEP_AVG;
> }
>
> return effective_prio(p);

2006-05-19 09:40:19

by Ingo Molnar

[permalink] [raw]

Subject: Re: [PATCH] sched: fix interactive ceiling code

* Con Kolivas <[email protected]> wrote:

> Ingo, Andrew, I think these are minor logic fixes and comments that
> correct a patch that has already been pushed to 2.6.17- and I would
> like them short circuited to mainline if everyone is comfortable with
> it.

the patch looks good, and it certainly wont cause stability problems
(this type of code typically doesnt). It may cause interactivity
problems, but then again this affects boundary conditions, so i'm fairly
optimistic. So if we have say more than say 4 days until 2.6.17 i'd
suggest it for immediate inclusion.

> Signed-off-by: Con Kolivas <[email protected]>

Acked-by: Ingo Molnar <[email protected]>

Ingo

2006-05-19 14:37:47