2003-03-29 21:21:14

by Peter Lundkvist

[permalink] [raw]
Subject: Bad interactive behaviour in 2.5.65-66 (sched.c)

Hi,

I have seen long delays when starting e.g. xterm from my
window manager (sawfish) either by keyboard-shortcut or by
menu command (by mouse) starting from 2.5.65. Sometimes it
starts immediately, sometimes after up to 2 seconds (idle
system). If I start a new xterm from xterm it always start
immediately. 2.5.64 always behaved OK.

My first try to solve this problem was to use some
scheduler parameters from 2.6.64:
#define MAX_TIMESLICE (300 * HZ / 1000)
#define CHILD_PENALTY 95
#define MAX_SLEEP_AVG (2*HZ)
#define STARVATION_LIMIT (2*HZ)

but got the same behaviour.

2nd try was to use sched.c, sched.h from 2.5.64 in a
2.5.66 build + one line patch in fork.c:
- p->last_run = jiffies;
+ p->sleep_timestamp = jiffies;

Now the system behaves as it should!

My system is a P-III 700 (Inspiron 4000),
and Debian (X is running at nice = -10).

Best regards,
Peter Lundkvist


2003-03-29 23:12:06

by Robert Love

[permalink] [raw]
Subject: Re: Bad interactive behaviour in 2.5.65-66 (sched.c)

On Sat, 2003-03-29 at 16:32, Peter Lundkvist wrote:

> I have seen long delays when starting e.g. xterm from my
> window manager (sawfish) either by keyboard-shortcut or by
> menu command (by mouse) starting from 2.5.65. Sometimes it
> starts immediately, sometimes after up to 2 seconds (idle
> system). If I start a new xterm from xterm it always start
> immediately. 2.5.64 always behaved OK.

You are not alone...

> My first try to solve this problem was to use some
> scheduler parameters from 2.6.64:
> #define MAX_TIMESLICE (300 * HZ / 1000)
> #define CHILD_PENALTY 95
> #define MAX_SLEEP_AVG (2*HZ)
> #define STARVATION_LIMIT (2*HZ)
>
> but got the same behaviour.

Expected.

> 2nd try was to use sched.c, sched.h from 2.5.64 in a
> 2.5.66 build + one line patch in fork.c:
> - p->last_run = jiffies;
> + p->sleep_timestamp = jiffies;
>
> Now the system behaves as it should!

This seems to confirm it was one of the interactivity changes that went
into 2.5.65. I figured as much but it is nice to get confirmation.
Thank you for trying this.

Now to figure out which one...

> My system is a P-III 700 (Inspiron 4000),
> and Debian (X is running at nice = -10).

I wonder if the reniced X is a factor?

Robert Love

2003-03-30 01:09:55

by Felipe Alfaro Solana

[permalink] [raw]
Subject: Re: Bad interactive behaviour in 2.5.65-66 (sched.c)

On Sun, 2003-03-30 at 00:23, Robert Love wrote:
> This seems to confirm it was one of the interactivity changes that went
> into 2.5.65. I figured as much but it is nice to get confirmation.
> Thank you for trying this.
>
> Now to figure out which one...
>
> > My system is a P-III 700 (Inspiron 4000),
> > and Debian (X is running at nice = -10).
>
> I wonder if the reniced X is a factor?

Theoretically, with interactivity enhancaments, you'll never need to
renice X. In fact, I'm running X with no renice and it feels pretty
snappy.

>
> ______________________________________________________________________
> Felipe Alfaro Solana
> Linux Registered User #287198
> http://counter.li.org

2003-03-30 01:54:06

by Robert Love

[permalink] [raw]
Subject: Re: Bad interactive behaviour in 2.5.65-66 (sched.c)

On Sat, 2003-03-29 at 20:21, Felipe Alfaro Solana wrote:

> Theoretically, with interactivity enhancaments, you'll never need to
> renice X. In fact, I'm running X with no renice and it feels pretty
> snappy.

I know.

I was wondering, since we are working on an actual bug here, whether or
not renicing X is leading to a starvation issue between X and whatever
is starving. I have seen it before.

My system is responsive, too, and I do not renice X. But it might
help. Or it might cause starvation issues. We have a bug somewhere...

Robert Love

2003-03-30 02:22:04

by Con Kolivas

[permalink] [raw]
Subject: Re: Bad interactive behaviour in 2.5.65-66 (sched.c)

On Sun, 30 Mar 2003 12:05, Robert Love wrote:
> On Sat, 2003-03-29 at 20:21, Felipe Alfaro Solana wrote:
> > Theoretically, with interactivity enhancaments, you'll never need to
> > renice X. In fact, I'm running X with no renice and it feels pretty
> > snappy.
>
> I know.
>
> I was wondering, since we are working on an actual bug here, whether or
> not renicing X is leading to a starvation issue between X and whatever
> is starving. I have seen it before.
>
> My system is responsive, too, and I do not renice X. But it might
> help. Or it might cause starvation issues. We have a bug somewhere...

Are you sure this should be called a bug? Basically X is an interactive
process. If it now is "interactive for a priority -10 process" then it should
be hogging the cpu time no? The priority -10 was a workaround for lack of
interactivity estimation on the old scheduler.

Con

2003-03-30 02:34:55

by Robert Love

[permalink] [raw]
Subject: Re: Bad interactive behaviour in 2.5.65-66 (sched.c)

On Sat, 2003-03-29 at 21:33, Con Kolivas wrote:

> Are you sure this should be called a bug? Basically X is an interactive
> process. If it now is "interactive for a priority -10 process" then it should
> be hogging the cpu time no? The priority -10 was a workaround for lack of
> interactivity estimation on the old scheduler.

Well, I do not necessarily think that renicing X is the problem. Just
an idea.

We do have a problem, though. Nearly indefinite starvation and all sort
of weird effects like bash not able to create a new process... its a
bug.

Renicing X, aside from some weird client-server starvation issues with
stuff like multimedia programs, should not cause any problem. It should
help, in fact. But, you are right, its not needed in the current
scheduler.

Robert Love

2003-03-30 03:49:46

by Tom Sightler

[permalink] [raw]
Subject: Re: Bad interactive behaviour in 2.5.65-66 (sched.c)

On Sat, 2003-03-29 at 21:46, Robert Love wrote:
> Well, I do not necessarily think that renicing X is the problem. Just
> an idea.
>
> We do have a problem, though. Nearly indefinite starvation and all sort
> of weird effects like bash not able to create a new process... its a
> bug.

On my system I get a starvation issue with just about any CPU intensive
task. For example if create a bzip'd tar file from the linux kernel
source with the command:

tar cvp linux | bzip2 -9 > linux.tar.bz2

During this entire time I can switch between different windows and
everything seems great, but if a try to do something like run 'ps ax' or
login to another virtual terminal, or start almost any other program, it
takes 30-45 seconds or longer.

With 2.5.64 doing the same 'tar | bzip2' command above takes nearly the
same length of time, but I can go about my business of running other
programs without any of the above issue. It's basically seems that the
one process is starving out everything else.

Don't know if this info helps, but it's 100% reproducible on my machine
(a Dell C810 laptop with a 1.13Ghz P3) with both 2.5.65 & 66.

Later,
Tom


2003-03-30 05:10:59

by Andrew Morton

[permalink] [raw]
Subject: Re: Bad interactive behaviour in 2.5.65-66 (sched.c)

Tom Sightler <[email protected]> wrote:
>
> On my system I get a starvation issue with just about any CPU intensive
> task. For example if create a bzip'd tar file from the linux kernel
> source with the command:
>
> tar cvp linux | bzip2 -9 > linux.tar.bz2
>

Ingo has determined that Linus's backboost trick is causing at least some
of these problems. Please test and report upon the below patch.

I have another workload which is showing starvation with or without this
patch - it is the bitkeeper verification step in a `bk clone' on a
uniprocessor kernel. Still poking at that one.






From: Ingo Molnar <[email protected]>

the patch below fixes George's setiathome problems (as expected). It
essentially turns off Linus' improvement, but i dont think it can be fixed
sanely.

the problem with setiathome is that it displays something every now and
then - so it gets a backboost from X, and hovers at a relatively high
priority.



kernel/sched.c | 13 +------------
1 files changed, 1 insertion(+), 12 deletions(-)

diff -puN kernel/sched.c~sched-interactivity-backboost-revert kernel/sched.c
--- 25/kernel/sched.c~sched-interactivity-backboost-revert 2003-03-28 22:30:08.000000000 -0800
+++ 25-akpm/kernel/sched.c 2003-03-28 22:30:08.000000000 -0800
@@ -379,19 +379,8 @@ static inline int activate_task(task_t *
* boosting tasks that are related to maximum-interactive
* tasks.
*/
- if (sleep_avg > MAX_SLEEP_AVG) {
- if (!in_interrupt()) {
- sleep_avg += current->sleep_avg - MAX_SLEEP_AVG;
- if (sleep_avg > MAX_SLEEP_AVG)
- sleep_avg = MAX_SLEEP_AVG;
-
- if (current->sleep_avg != sleep_avg) {
- current->sleep_avg = sleep_avg;
- requeue_waker = 1;
- }
- }
+ if (sleep_avg > MAX_SLEEP_AVG)
sleep_avg = MAX_SLEEP_AVG;
- }
if (p->sleep_avg != sleep_avg) {
p->sleep_avg = sleep_avg;
p->prio = effective_prio(p);

_

2003-03-30 10:05:47

by Felipe Alfaro Solana

[permalink] [raw]
Subject: Re: Bad interactive behaviour in 2.5.65-66 (sched.c)

On Sun, 2003-03-30 at 04:05, Robert Love wrote:
> I was wondering, since we are working on an actual bug here, whether or
> not renicing X is leading to a starvation issue between X and whatever
> is starving. I have seen it before.
>
> My system is responsive, too, and I do not renice X. But it might
> help. Or it might cause starvation issues. We have a bug somewhere...

I'm gonna try renicing X to see how it behaves...

________________________________________________________________________
Felipe Alfaro Solana
Linux Registered User #287198
http://counter.li.org

2003-03-30 11:06:47

by Mika Liljeberg

[permalink] [raw]
Subject: Re: Bad interactive behaviour in 2.5.65-66 (sched.c)

On Sun, 2003-03-30 at 01:23, Robert Love wrote:
> I wonder if the reniced X is a factor?

I had some interactivity problems with X reniced to -10. It seemed to me
that X was pre-empting the clients and flushing changes to screen too
quickly. It was probably losing out on some screen update optimizations.
I took out the renice and now the system behaves much better.

MikaL

2003-03-30 14:03:22

by Jens Axboe

[permalink] [raw]
Subject: Re: Bad interactive behaviour in 2.5.65-66 (sched.c)

On Sat, Mar 29 2003, Robert Love wrote:
> On Sat, 2003-03-29 at 21:33, Con Kolivas wrote:
>
> > Are you sure this should be called a bug? Basically X is an interactive
> > process. If it now is "interactive for a priority -10 process" then it should
> > be hogging the cpu time no? The priority -10 was a workaround for lack of
> > interactivity estimation on the old scheduler.
>
> Well, I do not necessarily think that renicing X is the problem. Just
> an idea.

I see the exact same behaviour here (systems appears fine, cpu intensive
app running, attempting to start anything _new_ stalls for ages), and I
definitely don't play X renice tricks.

It basically made 2.5 unusable here, waiting minutes for an ls to even
start displaying _anything_ is totally unacceptable.

--
Jens Axboe

2003-03-30 19:15:03

by Tom Sightler

[permalink] [raw]
Subject: Re: Bad interactive behaviour in 2.5.65-66 (sched.c)

On Sun, 2003-03-30 at 00:23, Andrew Morton wrote:
> Tom Sightler <[email protected]> wrote:
> >
> > On my system I get a starvation issue with just about any CPU intensive
> > task. For example if create a bzip'd tar file from the linux kernel
> > source with the command:
> >
> > tar cvp linux | bzip2 -9 > linux.tar.bz2
> >
>
> Ingo has determined that Linus's backboost trick is causing at least some
> of these problems. Please test and report upon the below patch.

OK, this definitely makes a big difference for my test cases which
include that 'tar' above as well as a run of dvd::rip. Without this
patch everything else on my system drops to a total crawl, especially
which dvd::rip is running, dvd::rip itself won't even switch tabs. With
this patch everything seems quite normal and snappy.

I'll try a few more test cases but backing this out certainly seems to
restore the system to the same behavior as 2.5.64. BTW, I'm running
this on 2.5.65-mm4. I would have tested on 2.5.66-mm1 but for some
reason my system locks solid after only a few minutes with it. I
haven't tried to track that down yet.

Later,
Tom


2003-03-30 20:55:14

by Con Kolivas

[permalink] [raw]
Subject: Re: Bad interactive behaviour in 2.5.65-66 (sched.c)

On Mon, 31 Mar 2003 00:14, Jens Axboe wrote:
> On Sat, Mar 29 2003, Robert Love wrote:
> > On Sat, 2003-03-29 at 21:33, Con Kolivas wrote:
> > > Are you sure this should be called a bug? Basically X is an interactive
> > > process. If it now is "interactive for a priority -10 process" then it
> > > should be hogging the cpu time no? The priority -10 was a workaround
> > > for lack of interactivity estimation on the old scheduler.
> >
> > Well, I do not necessarily think that renicing X is the problem. Just
> > an idea.
>
> I see the exact same behaviour here (systems appears fine, cpu intensive
> app running, attempting to start anything _new_ stalls for ages), and I
> definitely don't play X renice tricks.
>
> It basically made 2.5 unusable here, waiting minutes for an ls to even
> start displaying _anything_ is totally unacceptable.

I guess I should have trusted my own benchmark that was showing this was worse
for system responsiveness.

Con

2003-03-31 02:07:50

by Mike Galbraith

[permalink] [raw]
Subject: Re: Bad interactive behaviour in 2.5.65-66 (sched.c)

At 07:06 AM 3/31/2003 +1000, Con Kolivas wrote:
>On Mon, 31 Mar 2003 00:14, Jens Axboe wrote:
> > On Sat, Mar 29 2003, Robert Love wrote:
> > > On Sat, 2003-03-29 at 21:33, Con Kolivas wrote:
> > > > Are you sure this should be called a bug? Basically X is an interactive
> > > > process. If it now is "interactive for a priority -10 process" then it
> > > > should be hogging the cpu time no? The priority -10 was a workaround
> > > > for lack of interactivity estimation on the old scheduler.
> > >
> > > Well, I do not necessarily think that renicing X is the problem. Just
> > > an idea.
> >
> > I see the exact same behaviour here (systems appears fine, cpu intensive
> > app running, attempting to start anything _new_ stalls for ages), and I
> > definitely don't play X renice tricks.
> >
> > It basically made 2.5 unusable here, waiting minutes for an ls to even
> > start displaying _anything_ is totally unacceptable.
>
>I guess I should have trusted my own benchmark that was showing this was
>worse
>for system responsiveness.

I don't think it's really bad for system responsiveness. I think the
problem is just that the sample is too small. The proof is that simply
doing sleep_time %= HZ cures most of my woes. WRT contest and it's
io_load, applying even the tiniest percentage of a timeslice penalty per
activation and no other limits _dramatically_ affects the benchmark
numbers. (try it and you'll see. I posted a [ugly but useful for
experimentation] patch which allows you to set these things and/or disable
them from /proc/sys/sched)

I'm trying something right now that I think might work. I set
MAX_SLEEP_AVG to 10*60*HZ , start init out at max, and never allow it to
degrade. Everyone else is subject to boost and degradation, with the
maximum boost being MAX_SLEEP_AVG/20 (which is still a good long sleep, and
the max that one sleep can boost you is one priority). When you start a
cpu hogging task, it should drop in priority just fine, and rapid context
switchers shouldn't gain such an advantage. We'll see. Tricky part is
setting CHILD_PENALTY to the right number such that fork()->fork() kind of
tasks don't drop down too low and have to crawl back up. Contest falls
into this category.

Anyway, I think that inverting the problem might cure most of the symptoms ;-)

-Mike

2003-03-31 06:24:49

by Jens Axboe

[permalink] [raw]
Subject: Re: Bad interactive behaviour in 2.5.65-66 (sched.c)

On Mon, Mar 31 2003, Mike Galbraith wrote:
> At 07:06 AM 3/31/2003 +1000, Con Kolivas wrote:
> >On Mon, 31 Mar 2003 00:14, Jens Axboe wrote:
> >> On Sat, Mar 29 2003, Robert Love wrote:
> >> > On Sat, 2003-03-29 at 21:33, Con Kolivas wrote:
> >> > > Are you sure this should be called a bug? Basically X is an
> >interactive
> >> > > process. If it now is "interactive for a priority -10 process" then
> >it
> >> > > should be hogging the cpu time no? The priority -10 was a workaround
> >> > > for lack of interactivity estimation on the old scheduler.
> >> >
> >> > Well, I do not necessarily think that renicing X is the problem. Just
> >> > an idea.
> >>
> >> I see the exact same behaviour here (systems appears fine, cpu intensive
> >> app running, attempting to start anything _new_ stalls for ages), and I
> >> definitely don't play X renice tricks.
> >>
> >> It basically made 2.5 unusable here, waiting minutes for an ls to even
> >> start displaying _anything_ is totally unacceptable.
> >
> >I guess I should have trusted my own benchmark that was showing this was
> >worse
> >for system responsiveness.
>
> I don't think it's really bad for system responsiveness. I think the

What drugs are you on? 2.5.65/66 is the worst interactive kernel I've
ever used, it would be _embarassing_ to release a 2.6-test with such a
rudimentary flaw in it. IOW, a big show stopper.

> problem is just that the sample is too small. The proof is that simply
> doing sleep_time %= HZ cures most of my woes. WRT contest and it's

Irk, that sounds like a really ugly bandaid.

I'm wondering why the scheduler guys aren't all over this problem,
getting it fixed.

--
Jens Axboe

2003-03-31 06:49:55

by Mike Galbraith

[permalink] [raw]
Subject: Re: Bad interactive behaviour in 2.5.65-66 (sched.c)

At 08:35 AM 3/31/2003 +0200, Jens Axboe wrote:
>On Mon, Mar 31 2003, Mike Galbraith wrote:
> > At 07:06 AM 3/31/2003 +1000, Con Kolivas wrote:
> > >On Mon, 31 Mar 2003 00:14, Jens Axboe wrote:
> > >> On Sat, Mar 29 2003, Robert Love wrote:
> > >> > On Sat, 2003-03-29 at 21:33, Con Kolivas wrote:
> > >> > > Are you sure this should be called a bug? Basically X is an
> > >interactive
> > >> > > process. If it now is "interactive for a priority -10 process" then
> > >it
> > >> > > should be hogging the cpu time no? The priority -10 was a workaround
> > >> > > for lack of interactivity estimation on the old scheduler.
> > >> >
> > >> > Well, I do not necessarily think that renicing X is the problem. Just
> > >> > an idea.
> > >>
> > >> I see the exact same behaviour here (systems appears fine, cpu intensive
> > >> app running, attempting to start anything _new_ stalls for ages), and I
> > >> definitely don't play X renice tricks.
> > >>
> > >> It basically made 2.5 unusable here, waiting minutes for an ls to even
> > >> start displaying _anything_ is totally unacceptable.
> > >
> > >I guess I should have trusted my own benchmark that was showing this was
> > >worse
> > >for system responsiveness.
> >
> > I don't think it's really bad for system responsiveness. I think the
>
>What drugs are you on? 2.5.65/66 is the worst interactive kernel I've
>ever used, it would be _embarassing_ to release a 2.6-test with such a
>rudimentary flaw in it. IOW, a big show stopper.

It's only horrible when you trigger the problems, otherwise it's wonderful.

> > problem is just that the sample is too small. The proof is that simply
> > doing sleep_time %= HZ cures most of my woes. WRT contest and it's
>
>Irk, that sounds like a really ugly bandaid.

Nope, it's a really ugly _tourniquet_ ;-)

>I'm wondering why the scheduler guys aren't all over this problem,
>getting it fixed.

I think they are.

-Mike

2003-03-31 08:35:55

by Felipe Alfaro Solana

[permalink] [raw]
Subject: Re: Bad interactive behaviour in 2.5.65-66 (sched.c)

On Mon, 2003-03-31 at 09:05, Mike Galbraith wrote:
> > > I don't think it's really bad for system responsiveness. I think the
> >
> >What drugs are you on? 2.5.65/66 is the worst interactive kernel I've
> >ever used, it would be _embarassing_ to release a 2.6-test with such a
> >rudimentary flaw in it. IOW, a big show stopper.
>
> It's only horrible when you trigger the problems, otherwise it's wonderful.

With scheduler tunables (in -mm, for example), setting min_timeslice =
max_timeslice = 25 helps a lot with those problems (at least for me) :-)

> > > problem is just that the sample is too small. The proof is that simply
> > > doing sleep_time %= HZ cures most of my woes. WRT contest and it's
> >
> >Irk, that sounds like a really ugly bandaid.
>
> Nope, it's a really ugly _tourniquet_ ;-)
>
> >I'm wondering why the scheduler guys aren't all over this problem,
> >getting it fixed.
>
> I think they are.

I hope so ;-)

Felipe Alfaro Solana
Linux Registered User #287198
http://counter.li.org

2003-03-31 08:44:17

by Nick Piggin

[permalink] [raw]
Subject: Re: Bad interactive behaviour in 2.5.65-66 (sched.c)



Mike Galbraith wrote:

> At 08:35 AM 3/31/2003 +0200, Jens Axboe wrote:
>
>> What drugs are you on? 2.5.65/66 is the worst interactive kernel I've
>> ever used, it would be _embarassing_ to release a 2.6-test with such a
>> rudimentary flaw in it. IOW, a big show stopper.
>
>
> It's only horrible when you trigger the problems, otherwise it's
> wonderful.

Heh heh, yeah the anticipatory io scheduler is like that too ;)

2003-04-01 01:30:31

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: Bad interactive behaviour in 2.5.65-66 (sched.c)

On Sat, 2003-03-29 at 21:23, Andrew Morton wrote:
> Tom Sightler <[email protected]> wrote:
> >
> > On my system I get a starvation issue with just about any CPU intensive
> > task. For example if create a bzip'd tar file from the linux kernel
> > source with the command:
> >
> > tar cvp linux | bzip2 -9 > linux.tar.bz2
> >
>
> Ingo has determined that Linus's backboost trick is causing at least some
> of these problems. Please test and report upon the below patch.
>[...]
> From: Ingo Molnar <[email protected]>
>
> the patch below fixes George's setiathome problems (as expected). It
> essentially turns off Linus' improvement, but i dont think it can be fixed
> sanely.
>
> the problem with setiathome is that it displays something every now and
> then - so it gets a backboost from X, and hovers at a relatively high
> priority.

This fixes the starvation I was getting with xmms visualizers, which
have a similar usage pattern: they're mostly CPU-bound, but they talk to
the X server for drawing.

J