LinuxLists.cc - Re: CFS review

2007-08-01 07:12:49

[permalink] [raw]

Subject: Re: CFS review

Roman,

Thanks for the testing and the feedback, it's much appreciated! :-) On
what platform did you do your tests, and what .config did you use (and
could you please send me your .config)?

Please also send me the output of this script:

http://people.redhat.com/mingo/cfs-scheduler/tools/cfs-debug-info.sh

(if the output is too large send it to me privately, or bzip2 -9 it.)

Could you also please send the source code for the "l.c" and "lt.c" apps
you used for your testing so i can have a look. Thanks!

Ingo

2007-08-01 07:26:48

by Mike Galbraith

[permalink] [raw]

Subject: Re: CFS review

On Wed, 2007-08-01 at 09:12 +0200, Ingo Molnar wrote:
> Roman,
>
> Thanks for the testing and the feedback, it's much appreciated! :-) On
> what platform did you do your tests, and what .config did you use (and
> could you please send me your .config)?
>
> Please also send me the output of this script:
>
> http://people.redhat.com/mingo/cfs-scheduler/tools/cfs-debug-info.sh
>
> (if the output is too large send it to me privately, or bzip2 -9 it.)
>
> Could you also please send the source code for the "l.c" and "lt.c" apps
> you used for your testing so i can have a look. Thanks!

I haven't been able to reproduce this with any combination of features,
and massive_intr tweaked to his work/sleep cycle. I notice he's
collecting stats though, and they look funky. Recompiling.

-Mike

2007-08-01 07:30:57

[permalink] [raw]

Subject: Re: CFS review

* Mike Galbraith <[email protected]> wrote:

> > Thanks for the testing and the feedback, it's much appreciated! :-)
> > On what platform did you do your tests, and what .config did you use
> > (and could you please send me your .config)?
> >
> > Please also send me the output of this script:
> >
> > http://people.redhat.com/mingo/cfs-scheduler/tools/cfs-debug-info.sh
> >
> > (if the output is too large send it to me privately, or bzip2 -9
> > it.)
> >
> > Could you also please send the source code for the "l.c" and "lt.c"
> > apps you used for your testing so i can have a look. Thanks!
>
> I haven't been able to reproduce this with any combination of
> features, and massive_intr tweaked to his work/sleep cycle. I notice
> he's collecting stats though, and they look funky. Recompiling.

yeah, the posted numbers look most weird, but there's a complete lack of
any identification of test environment - so we'll need some more word
from Roman. Perhaps this was run on some really old box that does not
have a high-accuracy sched_clock()? The patch below should simulate that
scenario on 32-bit x86.

Ingo

Index: linux/arch/i386/kernel/tsc.c
===================================================================
--- linux.orig/arch/i386/kernel/tsc.c
+++ linux/arch/i386/kernel/tsc.c
@@ -110,7 +110,7 @@ unsigned long long native_sched_clock(vo
* very important for it to be as fast as the platform
* can achive it. )
*/
- if (unlikely(!tsc_enabled && !tsc_unstable))
+// if (unlikely(!tsc_enabled && !tsc_unstable))
/* No locking but a rare wrong value is not a big deal: */
return (jiffies_64 - INITIAL_JIFFIES) * (1000000000 / HZ);

2007-08-01 07:36:20

by Mike Galbraith

[permalink] [raw]

Subject: Re: CFS review

On Wed, 2007-08-01 at 09:30 +0200, Ingo Molnar wrote:
> * Mike Galbraith <[email protected]> wrote:

> > I haven't been able to reproduce this with any combination of
> > features, and massive_intr tweaked to his work/sleep cycle. I notice
> > he's collecting stats though, and they look funky. Recompiling.
>
> yeah, the posted numbers look most weird, but there's a complete lack of
> any identification of test environment - so we'll need some more word
> >from Roman. Perhaps this was run on some really old box that does not
> have a high-accuracy sched_clock()? The patch below should simulate that
> scenario on 32-bit x86.
>
> Ingo
>
> Index: linux/arch/i386/kernel/tsc.c
> ===================================================================
> --- linux.orig/arch/i386/kernel/tsc.c
> +++ linux/arch/i386/kernel/tsc.c
> @@ -110,7 +110,7 @@ unsigned long long native_sched_clock(vo
> * very important for it to be as fast as the platform
> * can achive it. )
> */
> - if (unlikely(!tsc_enabled && !tsc_unstable))
> +// if (unlikely(!tsc_enabled && !tsc_unstable))
> /* No locking but a rare wrong value is not a big deal: */
> return (jiffies_64 - INITIAL_JIFFIES) * (1000000000 / HZ);
>

Ah, thanks. I noticed that clocksource= went away. I'll test with
stats, with and without jiffies resolution.

-Mike

2007-08-01 08:49:36

by Mike Galbraith

[permalink] [raw]

Subject: Re: CFS review

On Wed, 2007-08-01 at 09:36 +0200, Mike Galbraith wrote:
> On Wed, 2007-08-01 at 09:30 +0200, Ingo Molnar wrote:
>
> > yeah, the posted numbers look most weird, but there's a complete lack of
> > any identification of test environment - so we'll need some more word
> > >from Roman. Perhaps this was run on some really old box that does not
> > have a high-accuracy sched_clock()? The patch below should simulate that
> > scenario on 32-bit x86.
> >
> > Ingo
> >
> > Index: linux/arch/i386/kernel/tsc.c
> > ===================================================================
> > --- linux.orig/arch/i386/kernel/tsc.c
> > +++ linux/arch/i386/kernel/tsc.c
> > @@ -110,7 +110,7 @@ unsigned long long native_sched_clock(vo
> > * very important for it to be as fast as the platform
> > * can achive it. )
> > */
> > - if (unlikely(!tsc_enabled && !tsc_unstable))
> > +// if (unlikely(!tsc_enabled && !tsc_unstable))
> > /* No locking but a rare wrong value is not a big deal: */
> > return (jiffies_64 - INITIAL_JIFFIES) * (1000000000 / HZ);
> >
>
> Ah, thanks. I noticed that clocksource= went away. I'll test with
> stats, with and without jiffies resolution.

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ P COMMAND
6465 root 20 0 1432 356 296 R 30 0.0 1:02.55 1 chew
6462 root 20 0 1576 216 140 R 23 0.0 0:50.29 1 massive_intr_x
6463 root 20 0 1576 216 140 R 23 0.0 0:50.23 1 massive_intr_x
6464 root 20 0 1576 216 140 R 23 0.0 0:50.28 1 massive_intr_x

Well, jiffies resolution clock did upset fairness a bit with a right at
jiffies resolution burn time, but not nearly as bad as on Roman's box,
and not in favor of the sleepers. With the longer burn time of stock
massive_intr.c (8ms burn, 1ms sleep), lower resolution clock didn't
upset it.

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ P COMMAND
6511 root 20 0 1572 220 140 R 25 0.0 1:00.11 1 massive_intr
6512 root 20 0 1572 220 140 R 25 0.0 1:00.14 1 massive_intr
6514 root 20 0 1432 356 296 R 25 0.0 1:00.31 1 chew
6513 root 20 0 1572 220 140 R 24 0.0 1:00.14 1 massive_intr

-Mike

2007-08-01 13:19:42

by Roman Zippel

[permalink] [raw]

Subject: Re: CFS review

Hi,

On Wed, 1 Aug 2007, Ingo Molnar wrote:

> Please also send me the output of this script:
>
> http://people.redhat.com/mingo/cfs-scheduler/tools/cfs-debug-info.sh

Send privately.

> Could you also please send the source code for the "l.c" and "lt.c" apps
> you used for your testing so i can have a look. Thanks!

l.c is a simple busy loop (well, with the option to start many of them).
This is lt.c, what it does is to run a bit less than a jiffie, so it
needs a low resolution clock to trigger the problem:

#include <stdio.h>
#include <signal.h>
#include <time.h>
#include <sys/time.h>

#define NSEC 1000000000
#define USEC 1000000

#define PERIOD (NSEC/1000)

int i;

void worker(int sig)
{
struct timeval tv;
long long t0, t;

gettimeofday(&tv, 0);
//printf("%u,%lu\n", i, tv.tv_usec);
t0 = (long long)tv.tv_sec * 1000000 + tv.tv_usec + PERIOD / 1000 - 50;
do {
gettimeofday(&tv, 0);
t = (long long)tv.tv_sec * 1000000 + tv.tv_usec;
} while (t < t0);

}

int main(int ac, char **av)
{
int cnt;
timer_t timer;
struct itimerspec its;
struct sigaction sa;

cnt = i = atoi(av[1]);

sa.sa_handler = worker;
sa.sa_flags = 0;
sigemptyset(&sa.sa_mask);

sigaction(SIGALRM, &sa, 0);

clock_gettime(CLOCK_MONOTONIC, &its.it_value);
its.it_interval.tv_sec = 0;
its.it_interval.tv_nsec = PERIOD * cnt;

while (--i > 0 && fork() > 0)
;

its.it_value.tv_nsec += i * PERIOD;
if (its.it_value.tv_nsec > NSEC) {
its.it_value.tv_sec++;
its.it_value.tv_nsec -= NSEC;
}

timer_create(CLOCK_MONOTONIC, 0, &timer);
timer_settime(timer, TIMER_ABSTIME, &its, 0);

printf("%u,%lu\n", i, its.it_interval.tv_nsec);

while (1)
pause();
return 0;
}

2007-08-01 15:08:23

[permalink] [raw]

Subject: Re: CFS review

* Roman Zippel <[email protected]> wrote:

> > Please also send me the output of this script:
> >
> > http://people.redhat.com/mingo/cfs-scheduler/tools/cfs-debug-info.sh
>
> Send privately.

thanks. Just to make sure, while you said that your TSC was off on that
laptop, the bootup log of yours suggests a working TSC:

Time: tsc clocksource has been installed.

and still your fl.c testcases produces the top output that you've
reported in your first mail? If so then this could be a regression. Or
did you turn off the tsc manually via notsc? (or was it with a different
.config or on a different machine)? Please help us figure this out
exactly, we dont want a real regression go unnoticed.

If you can reproduce that problem with a working TSC then please
generate a second cfs-debug-info.sh snapshot _while_ your fl+l workload
is running and send that to me (i'll reply back to it publicly). Thanks,

Ingo

2007-08-01 16:15:51

[permalink] [raw]

Subject: Re: CFS review

Ingo Molnar <[email protected]> writes:

> thanks. Just to make sure, while you said that your TSC was off on that
> laptop, the bootup log of yours suggests a working TSC:
>
> Time: tsc clocksource has been installed.

Standard kernels often disable the TSC later after running a bit
with it (e.g. on any cpufreq change without p state invariant TSC)

-Andi

2007-08-01 16:29:11

by Linus Torvalds

[permalink] [raw]

Subject: Re: CFS review

On Wed, 1 Aug 2007, Andi Kleen wrote:

> Ingo Molnar <[email protected]> writes:
>
> > thanks. Just to make sure, while you said that your TSC was off on that
> > laptop, the bootup log of yours suggests a working TSC:
> >
> > Time: tsc clocksource has been installed.
>
> Standard kernels often disable the TSC later after running a bit
> with it (e.g. on any cpufreq change without p state invariant TSC)

I assume that what Roman hit was that he had explicitly disabled the TSC
because of TSC instability with the "notsc" kernel command line. Which
disabled is *entirely*.

That *used* to be the right thing to do, since the gettimeofday() logic
originally didn't know about TSC instability, and it just resulted in
somewhat flaky timekeeping.

These days, of course, we should notice it on our own, and just switch
away from the TSC as a reliable clock-source, but still allow it to be
used for the cases where absolute accuracy is not a big issue.

So I suspect that Roman - by virtue of being an old-timer - ends up having
a workaround for an old problem that isn't needed, and that in turn ends
up meaning that his scheduler clock also ends up using the really not very
good timer tick..

Linus

2007-08-01 17:48:56

[permalink] [raw]

Subject: Re: CFS review

> I assume that what Roman hit was that he had explicitly disabled the TSC
> because of TSC instability with the "notsc" kernel command line. Which
> disabled is *entirely*.

It might just have been cpufreq. That nearly hits everybody with cpufreq
unless you have a pstate invariant TSC; and that's pretty much
always the case on older laptops.

It used to not be that drastic, but since i386 switched to the
generic clock frame work it is like that :/

> These days, of course, we should notice it on our own, and just switch
> away from the TSC as a reliable clock-source, but still allow it to be
> used for the cases where absolute accuracy is not a big issue.

The rewritten sched_clock() i still have queued does just that. I planned
to submit it for .23, but then during later in deepth testing
on my machine park I found a show stopper that I couldn't fix on time.
Hopefully for .24

-Andi

2007-08-01 17:50:29

[permalink] [raw]

Subject: Re: CFS review

* Linus Torvalds <[email protected]> wrote:

> On Wed, 1 Aug 2007, Andi Kleen wrote:
>
> > Ingo Molnar <[email protected]> writes:
> >
> > > thanks. Just to make sure, while you said that your TSC was off on that
> > > laptop, the bootup log of yours suggests a working TSC:
> > >
> > > Time: tsc clocksource has been installed.
> >
> > Standard kernels often disable the TSC later after running a bit
> > with it (e.g. on any cpufreq change without p state invariant TSC)
>
> I assume that what Roman hit was that he had explicitly disabled the
> TSC because of TSC instability with the "notsc" kernel command line.
> Which disabled is *entirely*.

but that does not appear to be the case, the debug info i got from Roman
includes the following boot options:

Kernel command line: auto BOOT_IMAGE=2.6.23-rc1-git9 ro root=306

there's no "notsc" option there.

Andi's theory cannot be true either, Roman's debug info also shows this
/proc/<PID>/sched data:

clock-delta : 95

that means that sched_clock() is in high-res mode, the TSC is alive and
kicking and a sched_clock() call took 95 nanoseconds.

Roman, could you please help us with this mystery?

Ingo

2007-08-01 18:01:59

by Roman Zippel

[permalink] [raw]

Subject: Re: CFS review

Hi,

On Wed, 1 Aug 2007, Ingo Molnar wrote:

> Andi's theory cannot be true either, Roman's debug info also shows this
> /proc/<PID>/sched data:
>
> clock-delta : 95
>
> that means that sched_clock() is in high-res mode, the TSC is alive and
> kicking and a sched_clock() call took 95 nanoseconds.
>
> Roman, could you please help us with this mystery?

Actually, Andi is right. What I sent you was generated directly after
boot, as I had to reboot for the right kernel, so a little later appeared
this:

Aug 1 14:54:30 spit kernel: eth0: link up, 100Mbps, full-duplex, lpa 0x45E1
Aug 1 15:09:56 spit kernel: Clocksource tsc unstable (delta = 656747233 ns)
Aug 1 15:09:56 spit kernel: Time: pit clocksource has been installed.

bye, Roman

2007-08-01 19:06:35

[permalink] [raw]

Subject: Re: CFS review

* Roman Zippel <[email protected]> wrote:

> Hi,
>
> On Wed, 1 Aug 2007, Ingo Molnar wrote:
>
> > Andi's theory cannot be true either, Roman's debug info also shows this
> > /proc/<PID>/sched data:
> >
> > clock-delta : 95
> >
> > that means that sched_clock() is in high-res mode, the TSC is alive and
> > kicking and a sched_clock() call took 95 nanoseconds.
> >
> > Roman, could you please help us with this mystery?
>
> Actually, Andi is right. What I sent you was generated directly after
> boot, as I had to reboot for the right kernel, so a little later
> appeared this:
>
> Aug 1 14:54:30 spit kernel: eth0: link up, 100Mbps, full-duplex, lpa 0x45E1
> Aug 1 15:09:56 spit kernel: Clocksource tsc unstable (delta = 656747233 ns)
> Aug 1 15:09:56 spit kernel: Time: pit clocksource has been installed.

just to make sure, how does 'top' output of the l + "lt 3" testcase look
like now on your laptop? Yesterday it was this:

4544 roman 20 0 1796 520 432 S 32.1 0.4 0:21.08 lt
4545 roman 20 0 1796 344 256 R 32.1 0.3 0:21.07 lt
4546 roman 20 0 1796 344 256 R 31.7 0.3 0:21.07 lt
4547 roman 20 0 1532 272 216 R 3.3 0.2 0:01.94 l

and i'm still wondering how that output was possible.

Ingo

2007-08-09 23:13:53

by Roman Zippel

[permalink] [raw]

Subject: Re: CFS review

Hi,

On Wed, 1 Aug 2007, Ingo Molnar wrote:

> just to make sure, how does 'top' output of the l + "lt 3" testcase look
> like now on your laptop? Yesterday it was this:
>
> 4544 roman 20 0 1796 520 432 S 32.1 0.4 0:21.08 lt
> 4545 roman 20 0 1796 344 256 R 32.1 0.3 0:21.07 lt
> 4546 roman 20 0 1796 344 256 R 31.7 0.3 0:21.07 lt
> 4547 roman 20 0 1532 272 216 R 3.3 0.2 0:01.94 l
>
> and i'm still wondering how that output was possible.

I disabled the jiffies logic and the result is still the same, so this
problem isn't related to resolution at all.
I traced it a little and what's happing is that the busy loop really only
gets little time, it only runs inbetween the timer tasks. When the timer
task is woken up __enqueue_sleeper() updates sleeper_bonus and a little
later when the busy loop is preempted __update_curr() is called a last
time and it's fully hit by the sleeper_bonus. So the timer tasks use less
time than they actually get and thus produce overflows, the busy loop OTOH
is punished and underflows.
So it seems my initial suspicion was right and this logic is dodgy, what
is it actually supposed to do? Why is some random task accounted with the
sleeper_bonus?

bye, Roman

PS: Can I still expect answer about all the other stuff?

2007-08-10 05:50:10

[permalink] [raw]

Subject: Re: CFS review

* Roman Zippel <[email protected]> wrote:

> > 4544 roman 20 0 1796 520 432 S 32.1 0.4 0:21.08 lt
> > 4545 roman 20 0 1796 344 256 R 32.1 0.3 0:21.07 lt
> > 4546 roman 20 0 1796 344 256 R 31.7 0.3 0:21.07 lt
> > 4547 roman 20 0 1532 272 216 R 3.3 0.2 0:01.94 l
> >
> > and i'm still wondering how that output was possible.
>
> I disabled the jiffies logic and the result is still the same, so this
> problem isn't related to resolution at all.

how did you disable the jiffies logic? Also, could you please send me
the cfs-debug-info.sh:

http://people.redhat.com/mingo/cfs-scheduler/tools/cfs-debug-info.sh

captured _while_ the above workload is running. This is the third time
i've asked for that :-)

to establish that the basic sched_clock() behavior is sound on that box,
could you please also run this tool:

http://people.redhat.com/mingo/cfs-scheduler/tools/tsc-dump.c

please run it both while the system is idle, and while there's a CPU hog
running:

while :; do :; done &

and send me that output too? (it's 2x 60 lines only) Thanks!

Ingo

2007-08-10 07:23:49

by Mike Galbraith

[permalink] [raw]

Subject: Re: CFS review

On Fri, 2007-08-10 at 01:14 +0200, Roman Zippel wrote:
> Hi,

Greetings,

> On Wed, 1 Aug 2007, Ingo Molnar wrote:
>
> > just to make sure, how does 'top' output of the l + "lt 3" testcase look
> > like now on your laptop? Yesterday it was this:
> >
> > 4544 roman 20 0 1796 520 432 S 32.1 0.4 0:21.08 lt
> > 4545 roman 20 0 1796 344 256 R 32.1 0.3 0:21.07 lt
> > 4546 roman 20 0 1796 344 256 R 31.7 0.3 0:21.07 lt
> > 4547 roman 20 0 1532 272 216 R 3.3 0.2 0:01.94 l
> >
> > and i'm still wondering how that output was possible.
>
> I disabled the jiffies logic and the result is still the same, so this
> problem isn't related to resolution at all.
> I traced it a little and what's happing is that the busy loop really only
> gets little time, it only runs inbetween the timer tasks. When the timer
> task is woken up __enqueue_sleeper() updates sleeper_bonus and a little
> later when the busy loop is preempted __update_curr() is called a last
> time and it's fully hit by the sleeper_bonus. So the timer tasks use less
> time than they actually get and thus produce overflows, the busy loop OTOH
> is punished and underflows.

I still can't reproduce this here. Can you please send your .config, so
I can try again with a config as close to yours as possible?

-Mike

2007-08-10 13:52:17

by Roman Zippel

[permalink] [raw]

Subject: Re: CFS review

Hi,

On Fri, 10 Aug 2007, Ingo Molnar wrote:

> > I disabled the jiffies logic and the result is still the same, so this
> > problem isn't related to resolution at all.
>
> how did you disable the jiffies logic?

I commented it out.

> Also, could you please send me
> the cfs-debug-info.sh:
>
> http://people.redhat.com/mingo/cfs-scheduler/tools/cfs-debug-info.sh
>
> captured _while_ the above workload is running. This is the third time
> i've asked for that :-)

Is there any reason to believe my analysis is wrong?
So far you haven't answered a single question about the CFS design...

Anyway, I give you something better - the raw trace data for 2ms:

1186747669.274790012: update_curr 0xc7fb06f0,479587,319708,21288884188,159880,7360532
1186747669.274790375: dequeue_entity 0xc7fb06f0,21280402988,159880
1186747669.274792580: sched 2848,2846,0xc7432cb0,-7520413
1186747669.274820987: update_curr 0xc7432ce0,29302,-130577,21288913490,1,-7680293
1186747669.274821269: dequeue_entity 0xc7432ce0,21296077409,1
1186747669.274821930: enqueue_entity 0xc7432ce0,21296593783,1
1186747669.274826979: update_curr 0xc7432ce0,5707,5707,21288919197,1,-7680294
1186747669.274827724: enqueue_entity 0xc7432180,21280919197,639451
1186747669.274829948: update_curr 0xc7432ce0,1553,-318172,21288920750,319726,-8000000
1186747669.274831878: sched 2846,2847,0xc7432150,8000000
1186747669.275789883: update_curr 0xc7432180,479797,319935,21289400547,159864,7360339
1186747669.275790295: dequeue_entity 0xc7432180,21280919197,159864
1186747669.275792439: sched 2847,2846,0xc7432cb0,-7520203
1186747669.275820819: update_curr 0xc7432ce0,29238,-130625,21289429785,1,-7680067
1186747669.275821109: dequeue_entity 0xc7432ce0,21296593783,1
1186747669.275821763: enqueue_entity 0xc7432ce0,21297109852,1
1186747669.275826887: update_curr 0xc7432ce0,5772,5772,21289435557,1,-7680068
1186747669.275827652: enqueue_entity 0xc7fb0ca0,21281435557,639881
1186747669.275829826: update_curr 0xc7432ce0,1549,-318391,21289437106,319941,-8000000
1186747669.275831584: sched 2846,2849,0xc7fb0c70,8000000

About the values:

update_curr: sched_entity, delta_fair, delta_mine, fair_clock, sleeper_bonus, wait_runtime
(final values at the end of __update_curr)
{en,de}queue_entity: sched_entity, fair_key, sleeper_bonus
(at the start of __enqueue_entity/__dequeue_entity)
sched: prev_pid,pid,current,wait_runtime
(at the end of scheduling, note that current has a small structure
offset to sched_entity)

It starts with a timer task going to sleep, the busy loop runs for a few
usec until the timer tick, the next timer task is woken up (sleeper_bonus
is increased). Before switching the tasks, current task is updated and
it's punished with the sleeper_bonus.

These tests are done without the recent updates, but they don't seem to
change the basic logic. AFAICT the change to __update_curr() only makes it
more unpredictable, which task is punished with the sleeper_bonus.

So again, what is this logic _supposed_ to do?

bye, Roman

2007-08-10 14:19:18

[permalink] [raw]

Subject: Re: CFS review

* Roman Zippel <[email protected]> wrote:

> > Also, could you please send me
> > the cfs-debug-info.sh:
> >
> > http://people.redhat.com/mingo/cfs-scheduler/tools/cfs-debug-info.sh
> >
> > captured _while_ the above workload is running. This is the third time
> > i've asked for that :-)
>
> Is there any reason to believe my analysis is wrong?

please first give me the debug data captured with the script above
(while the workload is running) - so that i can see the full picture
about what's happening. Thanks,

Ingo

2007-08-10 16:47:56

by Mike Galbraith

[permalink] [raw]

Subject: Re: CFS review

I guess I'm going to have to give up on trying to reproduce this... my
3GHz P4 is just not getting there from here. Last attempt, compiled UP,
HZ=1000 dynticks, full preempt and highres timers fwiw.

6392 root 20 0 1696 332 248 R 25.5 0.0 3:00.14 0 lt
6393 root 20 0 1696 332 248 R 24.9 0.0 3:00.15 0 lt
6391 root 20 0 1696 488 404 R 24.7 0.0 3:00.20 0 lt
6394 root 20 0 2888 1232 1028 R 24.5 0.1 2:58.58 0 sh

-Mike

2007-08-10 16:54:52

by Michael Chang

[permalink] [raw]

Subject: Re: CFS review

On 8/10/07, Roman Zippel <[email protected]> wrote:
> Is there any reason to believe my analysis is wrong?

Not yet, but if you give Ingo what he wants (as opposed to what you're
giving him) it'll be easier for him to answer what's going wrong, and
perhaps "fix" the problem to boot.

(The scripts gives info about CPU characteristics, interrupts,
modules, etc. -- you know, all those "unknown" variables.)

And perhaps a patch to show what parts you commented out, too, so one
can tell if anything got broken (unintentionally).

--
Michael Chang

Please avoid sending me Word or PowerPoint attachments. Send me ODT,
RTF, or HTML instead.
See http://www.gnu.org/philosophy/no-word-attachments.html
Thank you.

2007-08-10 17:19:19

by Roman Zippel

[permalink] [raw]

Subject: Re: CFS review

Hi,

On Fri, 10 Aug 2007, Mike Galbraith wrote:

> I guess I'm going to have to give up on trying to reproduce this... my
> 3GHz P4 is just not getting there from here. Last attempt, compiled UP,
> HZ=1000 dynticks, full preempt and highres timers fwiw.
>
> 6392 root 20 0 1696 332 248 R 25.5 0.0 3:00.14 0 lt
> 6393 root 20 0 1696 332 248 R 24.9 0.0 3:00.15 0 lt
> 6391 root 20 0 1696 488 404 R 24.7 0.0 3:00.20 0 lt
> 6394 root 20 0 2888 1232 1028 R 24.5 0.1 2:58.58 0 sh

Except for UP and HZ=1000, everything else is pretty much turned off.
If you use a very recent kernel, the problem may not be visible like this
anymore.
It may be a bit easier to reproduce, if you change the end time t0 in lt.c
a little. Also try to start the busy loop first.

bye, Roman

2007-08-10 17:25:48

by Roman Zippel

[permalink] [raw]

Subject: Re: CFS review

Hi,

On Fri, 10 Aug 2007, Michael Chang wrote:

> On 8/10/07, Roman Zippel <[email protected]> wrote:
> > Is there any reason to believe my analysis is wrong?
>
> Not yet, but if you give Ingo what he wants (as opposed to what you're
> giving him) it'll be easier for him to answer what's going wrong, and
> perhaps "fix" the problem to boot.
>
> (The scripts gives info about CPU characteristics, interrupts,
> modules, etc. -- you know, all those "unknown" variables.)

He already has most of this information and the trace shows _exactly_
what's going on. All this information should be more than enough to allow
an initial judgement whether my analysis is correct.
Also none of this information is needed to explain the CFS logic a little
more, which I'm still waiting for...

bye, Roman

2007-08-10 19:44:26

[permalink] [raw]

Subject: Re: CFS review

* Roman Zippel <[email protected]> wrote:

> > Not yet, but if you give Ingo what he wants (as opposed to what
> > you're giving him) it'll be easier for him to answer what's going
> > wrong, and perhaps "fix" the problem to boot.
> >
> > (The scripts gives info about CPU characteristics, interrupts,
> > modules, etc. -- you know, all those "unknown" variables.)
>
> He already has most of this information and the trace shows _exactly_
> what's going on. [...]

i'll need the other bits of information too to have a complete picture
about what's going on while your test is running - to maximize the
chances of me being able to fix it. I'm a bit perplexed (and a bit
worried) about this - you've spent _far_ more effort to _not send_ that
script output (captured while the workload is running) than it would
have taken to do it :-/ If you'd like me to fix bugs then please just
send it (in private mail if you want) - or give me an ssh login to that
box - whichever variant you prefer. Thanks,

Ingo

2007-08-10 19:56:03

by Willy Tarreau

[permalink] [raw]

Subject: Re: CFS review

On Fri, Aug 10, 2007 at 07:25:57PM +0200, Roman Zippel wrote:
> Hi,
>
> On Fri, 10 Aug 2007, Michael Chang wrote:
>
> > On 8/10/07, Roman Zippel <[email protected]> wrote:
> > > Is there any reason to believe my analysis is wrong?
> >
> > Not yet, but if you give Ingo what he wants (as opposed to what you're
> > giving him) it'll be easier for him to answer what's going wrong, and
> > perhaps "fix" the problem to boot.
> >
> > (The scripts gives info about CPU characteristics, interrupts,
> > modules, etc. -- you know, all those "unknown" variables.)
>
> He already has most of this information and the trace shows _exactly_
> what's going on. All this information should be more than enough to allow
> an initial judgement whether my analysis is correct.
> Also none of this information is needed to explain the CFS logic a little
> more, which I'm still waiting for...

Roman,

fortunately all bug reporters are not like you. It's amazing how long
you can resist sending a simple bug report to a developer! Maybe you
consider that you need to fix the bug by yourself after you understand
the code, but if you systematically refuse to return the small information
Ingo asks you, we will have to wait for some more cooperative users to be
hit by the same bug when 2.6.23 is released, which is stupid.

I thought you could at least understand that one developer who is used
to read traces from the same tool every day will be far faster at decoding
a trace from the same tool than trying to figure out what your self-maid
dump means.

It's the exact same reason I ask for pcap files when people send me
outputs of tcpdumps without the information I *need*.

I you definitely do not want to cooperate, stop asking for a personal
explanation, and go figure by yourself how the code works. BTW, in the
trace you "kindly offered" in exchange for the cfs-debug-info dump,
you show several useful variables, but nothing says where they are
captured. And as you can see, they're changing. That's a fantastic
trace for a developer, really...

Please try to be a little bit more transparent if you really want the
bugs fixed, and don't behave as if you wanted this bug to survive
till -final.

Thanks,
Willy

2007-08-10 21:15:46

by Roman Zippel

[permalink] [raw]

Subject: Re: CFS review

Hi,

On Fri, 10 Aug 2007, Willy Tarreau wrote:

> fortunately all bug reporters are not like you. It's amazing how long
> you can resist sending a simple bug report to a developer!

I'm more amazed how long Ingo can resist providing some explanations (not
just about this problem).
It's not like I haven't given him anything, he already has the test
programs, he already knows the system configuration.
Well, I've sent him the stuff now...

> Maybe you
> consider that you need to fix the bug by yourself after you understand
> the code,

Fixing the bug requires some knowledge what the code is intended to do.

> Please try to be a little bit more transparent if you really want the
> bugs fixed, and don't behave as if you wanted this bug to survive
> till -final.

Could you please ask Ingo the same? I'm simply trying to get some
transparancy into the CFS design. Without further information it's
difficult to tell, whether something is supposed to work this way or it's
a bug.

In this case it's quite possible that due to a recent change my testcase
doesn't work anymore. Should I consider the problem fixed or did it just
go into hiding? Without more information it's difficult to verify this
independently.

bye, Roman

2007-08-10 21:36:45

[permalink] [raw]

Subject: Re: CFS review

* Roman Zippel <[email protected]> wrote:

> Well, I've sent him the stuff now...

received it - thanks alot, looking at it!

> It's not like I haven't given him anything, he already has the test
> programs, he already knows the system configuration.

one more small thing: could you please send your exact .config (Mike
asked for that too, and i too on two prior occasions). Sometimes
unexpected little details in the .config make a difference, we are not
asking you that because we are second-guessing you in any way, the
reason is simple: i frequently boot _the very .config that others use_,
and see surprising reproducability of bugs that i couldnt trigger
before. It's standard procedure to just pick up the .config of others to
eliminate a whole bunch of degrees of freedom for a bug to hide behind -
and your "it's a pretty standard config" description doesnt really
achieve that. It probably wont make a real difference, but it's really
easy for you to send and it's still very useful when one tries to
eliminate possibilities and when one wants to concentrate on the
remaining possibilities alone. Thanks again,

Ingo

2007-08-10 22:50:00

by Roman Zippel

[permalink] [raw]

Subject: Re: CFS review

Hi,

On Fri, 10 Aug 2007, Ingo Molnar wrote:

> achieve that. It probably wont make a real difference, but it's really
> easy for you to send and it's still very useful when one tries to
> eliminate possibilities and when one wants to concentrate on the
> remaining possibilities alone.

The thing I'm afraid about CFS is its possible unpredictability, which
would make it hard to reproduce problems and we may end up with users with
unexplainable weird problems. That's the main reason I'm trying so hard to
push for a design discussion.

Just to give an idea here are two more examples of irregular behaviour,
which are hopefully easier to reproduce.

1. Two simple busy loops, one of them is reniced to 15, according to my
calculations the reniced task should get about 3.4% (1/(1.25^15+1)), but I
get this:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
4433 roman 20 0 1532 300 244 R 99.2 0.2 5:05.51 l
4434 roman 35 15 1532 72 16 R 0.7 0.1 0:10.62 l

OTOH upto nice level 12 I get what I expect.

2. If I start 20 busy loops, initially I see in top that every task gets
5% and time increments equally (as it should):

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
4492 roman 20 0 1532 68 16 R 5.0 0.1 0:02.86 l
4491 roman 20 0 1532 68 16 R 5.0 0.1 0:02.86 l
4490 roman 20 0 1532 68 16 R 5.0 0.1 0:02.86 l
4489 roman 20 0 1532 68 16 R 5.0 0.1 0:02.86 l
4488 roman 20 0 1532 68 16 R 5.0 0.1 0:02.86 l
4487 roman 20 0 1532 68 16 R 5.0 0.1 0:02.86 l
4486 roman 20 0 1532 68 16 R 5.0 0.1 0:02.86 l
4485 roman 20 0 1532 68 16 R 5.0 0.1 0:02.86 l
4484 roman 20 0 1532 68 16 R 5.0 0.1 0:02.86 l
4483 roman 20 0 1532 68 16 R 5.0 0.1 0:02.86 l
4482 roman 20 0 1532 68 16 R 5.0 0.1 0:02.86 l
4481 roman 20 0 1532 68 16 R 5.0 0.1 0:02.86 l
4480 roman 20 0 1532 68 16 R 5.0 0.1 0:02.86 l
4479 roman 20 0 1532 68 16 R 5.0 0.1 0:02.86 l
4478 roman 20 0 1532 68 16 R 5.0 0.1 0:02.86 l
4477 roman 20 0 1532 68 16 R 5.0 0.1 0:02.86 l
4476 roman 20 0 1532 68 16 R 5.0 0.1 0:02.86 l
4475 roman 20 0 1532 68 16 R 5.0 0.1 0:02.86 l
4474 roman 20 0 1532 68 16 R 5.0 0.1 0:02.86 l
4473 roman 20 0 1532 296 244 R 5.0 0.2 0:02.86 l

But if I renice all of them to -15, the time every task gets is rather
random:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
4492 roman 5 -15 1532 68 16 R 1.0 0.1 0:07.95 l
4491 roman 5 -15 1532 68 16 R 4.3 0.1 0:07.62 l
4490 roman 5 -15 1532 68 16 R 3.3 0.1 0:07.50 l
4489 roman 5 -15 1532 68 16 R 7.6 0.1 0:07.80 l
4488 roman 5 -15 1532 68 16 R 9.6 0.1 0:08.31 l
4487 roman 5 -15 1532 68 16 R 3.3 0.1 0:07.59 l
4486 roman 5 -15 1532 68 16 R 6.6 0.1 0:07.08 l
4485 roman 5 -15 1532 68 16 R 10.0 0.1 0:07.31 l
4484 roman 5 -15 1532 68 16 R 8.0 0.1 0:07.30 l
4483 roman 5 -15 1532 68 16 R 7.0 0.1 0:07.34 l
4482 roman 5 -15 1532 68 16 R 1.0 0.1 0:05.84 l
4481 roman 5 -15 1532 68 16 R 1.0 0.1 0:07.16 l
4480 roman 5 -15 1532 68 16 R 3.3 0.1 0:07.00 l
4479 roman 5 -15 1532 68 16 R 1.0 0.1 0:06.66 l
4478 roman 5 -15 1532 68 16 R 8.6 0.1 0:06.96 l
4477 roman 5 -15 1532 68 16 R 8.6 0.1 0:07.63 l
4476 roman 5 -15 1532 68 16 R 9.6 0.1 0:07.38 l
4475 roman 5 -15 1532 68 16 R 1.3 0.1 0:07.09 l
4474 roman 5 -15 1532 68 16 R 2.3 0.1 0:07.97 l
4473 roman 5 -15 1532 296 244 R 1.0 0.2 0:07.73 l

bye, Roman

2007-08-11 00:31:20

[permalink] [raw]

Subject: Re: CFS review

* Ingo Molnar <[email protected]> wrote:

> * Roman Zippel <[email protected]> wrote:
>
> > Well, I've sent him the stuff now...
>
> received it - thanks alot, looking at it!

everything looks good in your debug output and the TSC dump data, except
for the wait_runtime values, they are quite out of balance - and that
balance cannot be explained with jiffies granularity or with any sort of
sched_clock() artifact. So this clearly looks like a CFS regression that
should be fixed.

the only relevant thing that comes to mind at the moment is that last
week Peter noticed a buggy aspect of sleeper bonuses (in that we do not
rate-limit their output, hence we 'waste' them instead of redistributing
them), and i've got the small patch below in my queue to fix that -
could you give it a try?

this is just a blind stab into the dark - i couldnt see any real impact
from that patch in various workloads (and it's not upstream yet), so it
might not make a big difference. The trace you did (could you send the
source for that?) seems to implicate sleeper bonuses though.

if this patch doesnt help, could you check the general theory whether
it's related to sleeper-fairness, via turning it off:

echo 30 > /proc/sys/kernel/sched_features

does the bug go away if you do that? If sleeper bonuses are showing too
many artifacts then we could turn it off for final .23.

Ingo

--------------------->
Subject: sched: fix sleeper bonus
From: Ingo Molnar <[email protected]>

Peter Ziljstra noticed that the sleeper bonus deduction code was not
properly rate-limited: a task that scheduled more frequently would get a
disproportionately large deduction. So limit the deduction to delta_exec
and limit production to runtime_limit.

Not-Yet-Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched_fair.c | 12 ++++++------
1 file changed, 6 insertions(+), 6 deletions(-)

Index: linux/kernel/sched_fair.c
===================================================================
--- linux.orig/kernel/sched_fair.c
+++ linux/kernel/sched_fair.c
@@ -75,7 +75,7 @@ enum {

unsigned int sysctl_sched_features __read_mostly =
SCHED_FEAT_FAIR_SLEEPERS *1 |
- SCHED_FEAT_SLEEPER_AVG *1 |
+ SCHED_FEAT_SLEEPER_AVG *0 |
SCHED_FEAT_SLEEPER_LOAD_AVG *1 |
SCHED_FEAT_PRECISE_CPU_LOAD *1 |
SCHED_FEAT_START_DEBIT *1 |
@@ -304,11 +304,9 @@ __update_curr(struct cfs_rq *cfs_rq, str
delta_mine = calc_delta_mine(delta_exec, curr->load.weight, lw);

if (cfs_rq->sleeper_bonus > sysctl_sched_granularity) {
- delta = calc_delta_mine(cfs_rq->sleeper_bonus,
- curr->load.weight, lw);
- if (unlikely(delta > cfs_rq->sleeper_bonus))
- delta = cfs_rq->sleeper_bonus;
-
+ delta = min(cfs_rq->sleeper_bonus, (u64)delta_exec);
+ delta = calc_delta_mine(delta, curr->load.weight, lw);
+ delta = min((u64)delta, cfs_rq->sleeper_bonus);
cfs_rq->sleeper_bonus -= delta;
delta_mine -= delta;
}
@@ -521,6 +519,8 @@ static void __enqueue_sleeper(struct cfs
* Track the amount of bonus we've given to sleepers:
*/
cfs_rq->sleeper_bonus += delta_fair;
+ if (unlikely(cfs_rq->sleeper_bonus > sysctl_sched_runtime_limit))
+ cfs_rq->sleeper_bonus = sysctl_sched_runtime_limit;

schedstat_add(cfs_rq, wait_runtime, se->wait_runtime);
}

2007-08-11 05:31:03

by Willy Tarreau

[permalink] [raw]

Subject: Re: CFS review

On Fri, Aug 10, 2007 at 11:15:55PM +0200, Roman Zippel wrote:
> Hi,
>
> On Fri, 10 Aug 2007, Willy Tarreau wrote:
>
> > fortunately all bug reporters are not like you. It's amazing how long
> > you can resist sending a simple bug report to a developer!
>
> I'm more amazed how long Ingo can resist providing some explanations (not
> just about this problem).

It's a matter of time balance. It takes a short time to send the output
of a script, and it takes a very long time to explain how things work.
I often encounter the same situation with haproxy. People ask me to
explain them in detail how this or that would apply to their context, and
it's often easier for me to provide them with a 5-lines patch to add the
feature they need, than to spend half an hour explaining why and how it
would badly behave.

> It's not like I haven't given him anything, he already has the test
> programs, he already knows the system configuration.
> Well, I've sent him the stuff now...

fine, thanks.

> > Maybe you
> > consider that you need to fix the bug by yourself after you understand
> > the code,
>
> Fixing the bug requires some knowledge what the code is intended to do.
>
> > Please try to be a little bit more transparent if you really want the
> > bugs fixed, and don't behave as if you wanted this bug to survive
> > till -final.
>
> Could you please ask Ingo the same? I'm simply trying to get some
> transparancy into the CFS design. Without further information it's
> difficult to tell, whether something is supposed to work this way or it's
> a bug.

I know that Ingo tends to reply to a question with another question. But
as I said, imagine if he has to explain the same things to each person
who asks him for it. I think that a more constructive approach would be
to point what is missing/unclear/inexact in the doc so that he adds some
paragraphs for you and everyone else. If you need this information to debug,
most likely other people will need it too.

> In this case it's quite possible that due to a recent change my testcase
> doesn't work anymore. Should I consider the problem fixed or did it just
> go into hiding? Without more information it's difficult to verify this
> independently.

generally, problems that appear only on one person's side and which suddenly
disappear are either caused by some random buggy patch left in the tree (not
your case it seems), or by an obscure bug of the feature being tested which
will resurface from time to time as long as it's not identified.

Willy

2007-08-11 05:43:49

by Willy Tarreau

[permalink] [raw]

Subject: Re: CFS review

On Sat, Aug 11, 2007 at 12:50:08AM +0200, Roman Zippel wrote:
> Hi,
>
> On Fri, 10 Aug 2007, Ingo Molnar wrote:
>
> > achieve that. It probably wont make a real difference, but it's really
> > easy for you to send and it's still very useful when one tries to
> > eliminate possibilities and when one wants to concentrate on the
> > remaining possibilities alone.
>
> The thing I'm afraid about CFS is its possible unpredictability, which
> would make it hard to reproduce problems and we may end up with users with
> unexplainable weird problems. That's the main reason I'm trying so hard to
> push for a design discussion.

You may be interested by looking at the very early CFS versions. The design
was much more naive and understandable. After that, a lot of tricks have
been added to take into account a lot of uses and corner cases, which may
not help in understanding it globally.

> Just to give an idea here are two more examples of irregular behaviour,
> which are hopefully easier to reproduce.
>
> 1. Two simple busy loops, one of them is reniced to 15, according to my
> calculations the reniced task should get about 3.4% (1/(1.25^15+1)), but I
> get this:
>
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 4433 roman 20 0 1532 300 244 R 99.2 0.2 5:05.51 l
> 4434 roman 35 15 1532 72 16 R 0.7 0.1 0:10.62 l

Could this be caused by typos in some tables like you have found in wmult ?

> OTOH upto nice level 12 I get what I expect.
>
> 2. If I start 20 busy loops, initially I see in top that every task gets
> 5% and time increments equally (as it should):
(...)

> But if I renice all of them to -15, the time every task gets is rather
> random:
>
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 4492 roman 5 -15 1532 68 16 R 1.0 0.1 0:07.95 l
> 4491 roman 5 -15 1532 68 16 R 4.3 0.1 0:07.62 l
> 4490 roman 5 -15 1532 68 16 R 3.3 0.1 0:07.50 l
> 4489 roman 5 -15 1532 68 16 R 7.6 0.1 0:07.80 l
> 4488 roman 5 -15 1532 68 16 R 9.6 0.1 0:08.31 l
> 4487 roman 5 -15 1532 68 16 R 3.3 0.1 0:07.59 l
> 4486 roman 5 -15 1532 68 16 R 6.6 0.1 0:07.08 l
> 4485 roman 5 -15 1532 68 16 R 10.0 0.1 0:07.31 l
> 4484 roman 5 -15 1532 68 16 R 8.0 0.1 0:07.30 l
> 4483 roman 5 -15 1532 68 16 R 7.0 0.1 0:07.34 l
> 4482 roman 5 -15 1532 68 16 R 1.0 0.1 0:05.84 l
> 4481 roman 5 -15 1532 68 16 R 1.0 0.1 0:07.16 l
> 4480 roman 5 -15 1532 68 16 R 3.3 0.1 0:07.00 l
> 4479 roman 5 -15 1532 68 16 R 1.0 0.1 0:06.66 l
> 4478 roman 5 -15 1532 68 16 R 8.6 0.1 0:06.96 l
> 4477 roman 5 -15 1532 68 16 R 8.6 0.1 0:07.63 l
> 4476 roman 5 -15 1532 68 16 R 9.6 0.1 0:07.38 l
> 4475 roman 5 -15 1532 68 16 R 1.3 0.1 0:07.09 l
> 4474 roman 5 -15 1532 68 16 R 2.3 0.1 0:07.97 l
> 4473 roman 5 -15 1532 296 244 R 1.0 0.2 0:07.73 l

Do you see this only at -15, or starting with -15 and below ?

Willy

2007-08-12 05:17:55

[permalink] [raw]

Subject: Re: CFS review

* Willy Tarreau <[email protected]> wrote:

> > 1. Two simple busy loops, one of them is reniced to 15, according to
> > my calculations the reniced task should get about 3.4%
> > (1/(1.25^15+1)), but I get this:
> >
> > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> > 4433 roman 20 0 1532 300 244 R 99.2 0.2 5:05.51 l
> > 4434 roman 35 15 1532 72 16 R 0.7 0.1 0:10.62 l
>
> Could this be caused by typos in some tables like you have found in
> wmult ?

note that the typo was not in the weight table but in the inverse weight
table which didnt really affect CPU utilization (that's why we didnt
notice the typo sooner). Regarding the above problem with nice +15 being
beefier than intended i'd suggest to re-test with a doubled
/proc/sys/kernel/sched_runtime_limit value, or with:

echo 30 > /proc/sys/kernel/sched_features

(which turns off sleeper fairness)

> > 4477 roman 5 -15 1532 68 16 R 8.6 0.1 0:07.63 l
> > 4476 roman 5 -15 1532 68 16 R 9.6 0.1 0:07.38 l
> > 4475 roman 5 -15 1532 68 16 R 1.3 0.1 0:07.09 l
> > 4474 roman 5 -15 1532 68 16 R 2.3 0.1 0:07.97 l
> > 4473 roman 5 -15 1532 296 244 R 1.0 0.2 0:07.73 l
>
> Do you see this only at -15, or starting with -15 and below ?

i think this was scheduling jitter caused by the larger granularity of
negatively reniced tasks. This got improved recently, with latest -git i
get:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
3108 root 5 -15 1576 248 196 R 5.0 0.0 0:07.26 loop_silent
3109 root 5 -15 1576 248 196 R 5.0 0.0 0:07.26 loop_silent
3110 root 5 -15 1576 248 196 R 5.0 0.0 0:07.26 loop_silent
3111 root 5 -15 1576 244 196 R 5.0 0.0 0:07.26 loop_silent
3112 root 5 -15 1576 248 196 R 5.0 0.0 0:07.26 loop_silent
3113 root 5 -15 1576 248 196 R 5.0 0.0 0:07.26 loop_silent

that's picture-perfect CPU time distribution. But, and that's fair to
say, i never ran such an artificial workload of 20x nice -15 infinite
loops (!) before, and boy does interactivity suck (as expected) ;)

Ingo

2007-08-20 22:19:47

by Roman Zippel

[permalink] [raw]

Subject: Re: CFS review

Hi,

On Sat, 11 Aug 2007, Ingo Molnar wrote:

> the only relevant thing that comes to mind at the moment is that last
> week Peter noticed a buggy aspect of sleeper bonuses (in that we do not
> rate-limit their output, hence we 'waste' them instead of redistributing
> them), and i've got the small patch below in my queue to fix that -
> could you give it a try?

It doesn't make much of a difference. OTOH if I disabled the sleeper code
completely in __update_curr(), I get this:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
3139 roman 20 0 1796 344 256 R 21.7 0.3 0:02.68 lt
3138 roman 20 0 1796 344 256 R 21.7 0.3 0:02.68 lt
3137 roman 20 0 1796 520 432 R 21.7 0.4 0:02.68 lt
3136 roman 20 0 1532 268 216 R 34.5 0.2 0:06.82 l

Disabling this code completely via sched_features makes only a minor
difference:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
3139 roman 20 0 1796 344 256 R 20.4 0.3 0:09.94 lt
3138 roman 20 0 1796 344 256 R 20.4 0.3 0:09.94 lt
3137 roman 20 0 1796 520 432 R 20.4 0.4 0:09.94 lt
3136 roman 20 0 1532 268 216 R 39.1 0.2 0:19.20 l

> this is just a blind stab into the dark - i couldnt see any real impact
> from that patch in various workloads (and it's not upstream yet), so it
> might not make a big difference.

Can we please skip to the point, where you try to explain the intention a
little more?
If I had to guess that this is supposed to keep the runtime balance, then
it would be better to use wait_runtime to adjust fair_clock, from where it
would be evenly distributed to all tasks (but this had to be done during
enqueue and dequeue). OTOH this also had then a consequence for the wait
queue, as fair_clock is used to calculate fair_key.
IMHO current wait_runtime should have some influence in calculating the
sleep bonus, so that wait_runtime doesn't constantly overflow for tasks
which only run occasionally.

bye, Roman

2007-08-21 07:33:19

by Mike Galbraith

[permalink] [raw]

Subject: Re: CFS review

On Tue, 2007-08-21 at 00:19 +0200, Roman Zippel wrote:
> Hi,
>
> On Sat, 11 Aug 2007, Ingo Molnar wrote:
>
> > the only relevant thing that comes to mind at the moment is that last
> > week Peter noticed a buggy aspect of sleeper bonuses (in that we do not
> > rate-limit their output, hence we 'waste' them instead of redistributing
> > them), and i've got the small patch below in my queue to fix that -
> > could you give it a try?
>
> It doesn't make much of a difference.

I thought this was history. With your config, I was finally able to
reproduce the anomaly (only with your proggy though), and Ingo's patch
does indeed fix it here.

Freshly reproduced anomaly and patch verification, running 2.6.23-rc3
with your config, both with and without Ingo's patch reverted:

6561 root 20 0 1696 492 404 S 32.0 0.0 0:30.83 0 lt
6562 root 20 0 1696 336 248 R 32.0 0.0 0:30.79 0 lt
6563 root 20 0 1696 336 248 R 32.0 0.0 0:30.80 0 lt
6564 root 20 0 2888 1236 1028 R 4.6 0.1 0:05.26 0 sh

6507 root 20 0 2888 1236 1028 R 25.8 0.1 0:30.75 0 sh
6504 root 20 0 1696 492 404 R 24.4 0.0 0:29.26 0 lt
6505 root 20 0 1696 336 248 R 24.4 0.0 0:29.26 0 lt
6506 root 20 0 1696 336 248 R 24.4 0.0 0:29.25 0 lt

-Mike

2007-08-21 08:35:27

[permalink] [raw]

Subject: Re: CFS review

* Mike Galbraith <[email protected]> wrote:

> > It doesn't make much of a difference.
>
> I thought this was history. With your config, I was finally able to
> reproduce the anomaly (only with your proggy though), and Ingo's patch
> does indeed fix it here.
>
> Freshly reproduced anomaly and patch verification, running 2.6.23-rc3
> with your config, both with and without Ingo's patch reverted:
>
> 6561 root 20 0 1696 492 404 S 32.0 0.0 0:30.83 0 lt
> 6562 root 20 0 1696 336 248 R 32.0 0.0 0:30.79 0 lt
> 6563 root 20 0 1696 336 248 R 32.0 0.0 0:30.80 0 lt
> 6564 root 20 0 2888 1236 1028 R 4.6 0.1 0:05.26 0 sh
>
> 6507 root 20 0 2888 1236 1028 R 25.8 0.1 0:30.75 0 sh
> 6504 root 20 0 1696 492 404 R 24.4 0.0 0:29.26 0 lt
> 6505 root 20 0 1696 336 248 R 24.4 0.0 0:29.26 0 lt
> 6506 root 20 0 1696 336 248 R 24.4 0.0 0:29.25 0 lt

oh, great! I'm glad we didnt discard this as a pure sched_clock
resolution artifact.

Roman, a quick & easy request: please send the usual cfs-debug-info.sh
output captured while your testcase is running. (Preferably try .23-rc3
or later as Mike did, which has the most recent scheduler code, it
includes the patch i sent to you already.) I'll reply to your
sleeper-fairness questions separately, but in any case we need to figure
out what's happening on your box - if you can still reproduce it with
.23-rc3. Thanks,

Ingo

2007-08-21 11:54:46

by Roman Zippel

[permalink] [raw]

Subject: Re: CFS review

Hi,

On Tue, 21 Aug 2007, Mike Galbraith wrote:

> I thought this was history. With your config, I was finally able to
> reproduce the anomaly (only with your proggy though), and Ingo's patch
> does indeed fix it here.
>
> Freshly reproduced anomaly and patch verification, running 2.6.23-rc3
> with your config, both with and without Ingo's patch reverted:

I did update to 2.6.23-rc3-git1 first, but I ended up reverting the patch,
as I didn't notice it had been applied already. Sorry about that.
With this patch the underflows are gone, but there are still the
overflows, so the questions from the last mail still remain.

bye, Roman