2007-09-24 21:46:14

by Ingo Molnar

[permalink] [raw]
Subject: [git] CFS-devel, latest code


The latest sched-devel.git tree can be pulled from:

git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched-devel.git

Lots of scheduler updates in the past few days, done by many people.
Most importantly, the SMP latency problems reported and debugged by Mike
Galbraith should be fixed for good now.

I've also included the latest and greatest group-fairness scheduling
patch from Srivatsa Vaddagiri, which can now be used without containers
as well (in a simplified, each-uid-gets-its-fair-share mode). This
feature (CONFIG_FAIR_USER_SCHED) is now default-enabled.

Peter Zijlstra has been busy enhancing the math of the scheduler: we've
got the new 'vslice' forked-task code that should enable snappier shell
commands during load while still keeping kbuild workloads in check.

On my testsystems this codebase starts looking like something that could
be merged into v2.6.24, so please give it a good workout and let us know
if there's anything bad going on. (If this works out fine then i'll
propagate these changes back into the CFS backport, for wider testing.)

Ingo

----------------------------------------->
the shortlog relative to 2.6.23-rc7:

Dmitry Adamushko (8):
sched: clean up struct load_stat
sched: clean up schedstat block in dequeue_entity()
sched: sched_setscheduler() fix
sched: add set_curr_task() calls
sched: do not keep current in the tree and get rid of sched_entity::fair_key
sched: optimize task_new_fair()
sched: simplify sched_class::yield_task()
sched: rework enqueue/dequeue_entity() to get rid of set_curr_task()

Ingo Molnar (41):
sched: fix new-task method
sched: resched task in task_new_fair()
sched: small sched_debug cleanup
sched: debug: track maximum 'slice'
sched: uniform tunings
sched: use constants if !CONFIG_SCHED_DEBUG
sched: remove stat_gran
sched: remove precise CPU load
sched: remove precise CPU load calculations #2
sched: track cfs_rq->curr on !group-scheduling too
sched: cleanup: simplify cfs_rq_curr() methods
sched: uninline __enqueue_entity()/__dequeue_entity()
sched: speed up update_load_add/_sub()
sched: clean up calc_weighted()
sched: introduce se->vruntime
sched: move sched_feat() definitions
sched: optimize vruntime based scheduling
sched: simplify check_preempt() methods
sched: wakeup granularity fix
sched: add se->vruntime debugging
sched: add more vruntime statistics
sched: debug: update exec_clock only when SCHED_DEBUG
sched: remove wait_runtime limit
sched: remove wait_runtime fields and features
sched: x86: allow single-depth wchan output
sched: fix delay accounting performance regression
sched: prettify /proc/sched_debug output
sched: enhance debug output
sched: kernel/sched_fair.c whitespace cleanups
sched: fair-group sched, cleanups
sched: enable CONFIG_FAIR_GROUP_SCHED=y by default
sched debug: BKL usage statistics
sched: remove unneeded tunables
sched debug: print settings
sched debug: more width for parameter printouts
sched: entity_key() fix
sched: remove condition from set_task_cpu()
sched: remove last_min_vruntime effect
sched: undo some of the recent changes
sched: fix place_entity()
sched: fix sched_fork()

Matthias Kaehlcke (1):
sched: use list_for_each_entry_safe() in __wake_up_common()

Mike Galbraith (2):
sched: fix SMP migration latencies
sched: fix formatting of /proc/sched_debug

Peter Zijlstra (10):
sched: simplify SCHED_FEAT_* code
sched: new task placement for vruntime
sched: simplify adaptive latency
sched: clean up new task placement
sched: add tree based averages
sched: handle vruntime overflow
sched: better min_vruntime tracking
sched: add vslice
sched debug: check spread
sched: max_vruntime() simplification

Srivatsa Vaddagiri (7):
sched: group-scheduler core
sched: revert recent removal of set_curr_task()
sched: fix minor bug in yield
sched: print nr_running and load in /proc/sched_debug
sched: print &rq->cfs stats
sched: clean up code under CONFIG_FAIR_GROUP_SCHED
sched: add fair-user scheduler

arch/i386/Kconfig | 11
include/linux/sched.h | 45 +--
init/Kconfig | 21 +
kernel/sched.c | 547 +++++++++++++++++++++++++------------
kernel/sched_debug.c | 248 +++++++++++------
kernel/sched_fair.c | 692 +++++++++++++++++-------------------------------
kernel/sched_idletask.c | 5
kernel/sched_rt.c | 12
kernel/sched_stats.h | 4
kernel/sysctl.c | 22 -
kernel/user.c | 43 ++
11 files changed, 906 insertions(+), 744 deletions(-)


2007-09-24 21:56:35

by Andrew Morton

[permalink] [raw]
Subject: Re: [git] CFS-devel, latest code

On Mon, 24 Sep 2007 23:45:37 +0200
Ingo Molnar <[email protected]> wrote:

>
> The latest sched-devel.git tree can be pulled from:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched-devel.git

I'm pulling linux-2.6-sched.git, and it's oopsing all over the place on
ia64, and Lee's observations about set_leftmost()'s weirdness are
pertinent.

Should I instead be pulling linux-2.6-sched-devel.git?

2007-09-24 21:59:20

by Ingo Molnar

[permalink] [raw]
Subject: Re: [git] CFS-devel, latest code


* Andrew Morton <[email protected]> wrote:

> On Mon, 24 Sep 2007 23:45:37 +0200
> Ingo Molnar <[email protected]> wrote:
>
> >
> > The latest sched-devel.git tree can be pulled from:
> >
> > git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched-devel.git
>
> I'm pulling linux-2.6-sched.git, and it's oopsing all over the place
> on ia64, and Lee's observations about set_leftmost()'s weirdness are
> pertinent.
>
> Should I instead be pulling linux-2.6-sched-devel.git?

yeah, please pull that one.

linux-2.6-sched.git by mistake contained an older sched-devel tree for
about a day (where your scripts picked it up). I've restored that one to
-rc7 meanwhile. It's only supposed to contain strict fixes for upstream.
(none at the moment)

Ingo

2007-09-25 00:13:19

by Daniel Walker

[permalink] [raw]
Subject: Re: [git] CFS-devel, latest code

On Mon, 2007-09-24 at 23:45 +0200, Ingo Molnar wrote:
> Lots of scheduler updates in the past few days, done by many people.
> Most importantly, the SMP latency problems reported and debugged by
> Mike
> Galbraith should be fixed for good now.

Does this have anything to do with idle balancing ? I noticed some
fairly large latencies in that code in 2.6.23-rc's ..

Daniel

2007-09-25 06:11:15

by Mike Galbraith

[permalink] [raw]
Subject: Re: [git] CFS-devel, latest code

On Mon, 2007-09-24 at 23:45 +0200, Ingo Molnar wrote:

> Mike Galbraith (2):
> sched: fix SMP migration latencies
> sched: fix formatting of /proc/sched_debug

Off-by-one bug in attribution, rocks and sticks (down boy!) don't
count ;-) I just built, and will spend the morning beating on it... no
news is good news.

-Mike

2007-09-25 06:47:15

by Ingo Molnar

[permalink] [raw]
Subject: Re: [git] CFS-devel, latest code


* Daniel Walker <[email protected]> wrote:

> On Mon, 2007-09-24 at 23:45 +0200, Ingo Molnar wrote:
> > Lots of scheduler updates in the past few days, done by many people.
> > Most importantly, the SMP latency problems reported and debugged by
> > Mike
> > Galbraith should be fixed for good now.
>
> Does this have anything to do with idle balancing ? I noticed some
> fairly large latencies in that code in 2.6.23-rc's ..

any measurements?

Ingo

2007-09-25 06:52:53

by S.Çağlar Onur

[permalink] [raw]
Subject: Re: [git] CFS-devel, latest code

Hi;

25 Eyl 2007 Sal tarihinde, Ingo Molnar şunları yazmıştı:
>
> The latest sched-devel.git tree can be pulled from:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched-devel.git
>
> Lots of scheduler updates in the past few days, done by many people.
> Most importantly, the SMP latency problems reported and debugged by Mike
> Galbraith should be fixed for good now.
>
> I've also included the latest and greatest group-fairness scheduling
> patch from Srivatsa Vaddagiri, which can now be used without containers
> as well (in a simplified, each-uid-gets-its-fair-share mode). This
> feature (CONFIG_FAIR_USER_SCHED) is now default-enabled.
>
> Peter Zijlstra has been busy enhancing the math of the scheduler: we've
> got the new 'vslice' forked-task code that should enable snappier shell
> commands during load while still keeping kbuild workloads in check.
>
> On my testsystems this codebase starts looking like something that could
> be merged into v2.6.24, so please give it a good workout and let us know
> if there's anything bad going on. (If this works out fine then i'll
> propagate these changes back into the CFS backport, for wider testing.)

Seems like following trivial change needed to compile without CONFIG_SCHEDSTATS

caglar@zangetsu linux-2.6 $ LC_ALL=C make
CHK include/linux/version.h
CHK include/linux/utsrelease.h
CALL scripts/checksyscalls.sh
CHK include/linux/compile.h
CC kernel/sched.o
In file included from kernel/sched.c:853:
kernel/sched_debug.c: In function `print_cfs_rq':
kernel/sched_debug.c:139: error: structure has no member named `bkl_cnt'
kernel/sched_debug.c:139: error: structure has no member named `bkl_cnt'
make[1]: *** [kernel/sched.o] Error 1
make: *** [kernel] Error 2

Signed-off-by: S.Çağlar Onur <[email protected]>

diff --git a/kernel/sched_debug.c b/kernel/sched_debug.c
index b68e593..4659c90 100644
--- a/kernel/sched_debug.c
+++ b/kernel/sched_debug.c
@@ -136,8 +136,10 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
SPLIT_NS(spread0));
SEQ_printf(m, " .%-30s: %ld\n", "nr_running", cfs_rq->nr_running);
SEQ_printf(m, " .%-30s: %ld\n", "load", cfs_rq->load.weight);
+#ifdef CONFIG_SCHEDSTATS
SEQ_printf(m, " .%-30s: %ld\n", "bkl_cnt",
rq->bkl_cnt);
+#endif
SEQ_printf(m, " .%-30s: %ld\n", "nr_spread_over",
cfs_rq->nr_spread_over);
}


Cheers
--
S.Çağlar Onur <[email protected]>
http://cekirdek.pardus.org.tr/~caglar/

Linux is like living in a teepee. No Windows, no Gates and an Apache in house!

2007-09-25 07:36:18

by Mike Galbraith

[permalink] [raw]
Subject: Re: [git] CFS-devel, latest code

On Tue, 2007-09-25 at 08:10 +0200, Mike Galbraith wrote:
> no news is good news.

Darn, have news: latency thing isn't dead. Two busy loops, one at nice
0 pinned to CPU0, and one at nice 19 pinned to CPU1 produced the
latencies below for nice -5 Xorg. Didn't kill the box though.

se.wait_max : 10.068169
se.wait_max : 7.465334
se.wait_max : 135.501816
se.wait_max : 0.884483
se.wait_max : 144.218955
se.wait_max : 128.578376
se.wait_max : 93.975768
se.wait_max : 4.965965
se.wait_max : 113.655533
se.wait_max : 4.301075

sched_debug (attached) is.. strange.

-Mike


Attachments:
sched_debug.gz (9.60 kB)

2007-09-25 07:42:16

by Andrew Morton

[permalink] [raw]
Subject: Re: [git] CFS-devel, latest code

On Mon, 24 Sep 2007 23:45:37 +0200 Ingo Molnar <[email protected]> wrote:

> The latest sched-devel.git tree can be pulled from:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched-devel.git

This doornails the Vaio. After grub handover the screen remains black
and the fan goes whir.

http://userweb.kernel.org/~akpm/config-sony.txt

2007-09-25 08:33:18

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: Re: [git] CFS-devel, latest code

On Tue, Sep 25, 2007 at 12:41:20AM -0700, Andrew Morton wrote:
> This doornails the Vaio. After grub handover the screen remains black
> and the fan goes whir.
>
> http://userweb.kernel.org/~akpm/config-sony.txt

This seems to be UP regression. Sorry abt that. I could recreate
the problem very easily with CONFIG_SMP turned off.

Can you check if this patch works? Works for me here.

--

Fix UP breakage.

Signed-off-by : Srivatsa Vaddagiri <[email protected]>


---
kernel/sched.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

Index: current/kernel/sched.c
===================================================================
--- current.orig/kernel/sched.c
+++ current/kernel/sched.c
@@ -1029,8 +1029,8 @@ static inline void __set_task_cpu(struct
{
#ifdef CONFIG_SMP
task_thread_info(p)->cpu = cpu;
- set_task_cfs_rq(p);
#endif
+ set_task_cfs_rq(p);
}

#ifdef CONFIG_SMP

--
Regards,
vatsa

2007-09-25 08:33:41

by Mike Galbraith

[permalink] [raw]
Subject: Re: [git] CFS-devel, latest code

On Tue, 2007-09-25 at 09:35 +0200, Mike Galbraith wrote:

> Darn, have news: latency thing isn't dead. Two busy loops, one at nice
> 0 pinned to CPU0, and one at nice 19 pinned to CPU1 produced the
> latencies below for nice -5 Xorg. Didn't kill the box though.
>
> se.wait_max : 10.068169
> se.wait_max : 7.465334
> se.wait_max : 135.501816
> se.wait_max : 0.884483
> se.wait_max : 144.218955
> se.wait_max : 128.578376
> se.wait_max : 93.975768
> se.wait_max : 4.965965
> se.wait_max : 113.655533
> se.wait_max : 4.301075
>
> sched_debug (attached) is.. strange.

Disabling CONFIG_FAIR_GROUP_SCHED fixed both. Latencies of up to 336ms
hit me during the recompile (make -j3), with nothing else running.
Since reboot, latencies are, so far, very very nice. I'm leaving it
disabled for now.

-Mike

2007-09-25 08:43:35

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: Re: [git] CFS-devel, latest code

On Tue, Sep 25, 2007 at 10:33:27AM +0200, Mike Galbraith wrote:
> > Darn, have news: latency thing isn't dead. Two busy loops, one at nice
> > 0 pinned to CPU0, and one at nice 19 pinned to CPU1 produced the
> > latencies below for nice -5 Xorg. Didn't kill the box though.
> >
> > se.wait_max : 10.068169
> > se.wait_max : 7.465334
> > se.wait_max : 135.501816
> > se.wait_max : 0.884483
> > se.wait_max : 144.218955
> > se.wait_max : 128.578376
> > se.wait_max : 93.975768
> > se.wait_max : 4.965965
> > se.wait_max : 113.655533
> > se.wait_max : 4.301075
> >
> > sched_debug (attached) is.. strange.

Mike,
Do you have FAIR_USER_SCHED turned on as well? Can you send me
your .config pls?

Also how do you check se.wait_max?

> Disabling CONFIG_FAIR_GROUP_SCHED fixed both. Latencies of up to 336ms
> hit me during the recompile (make -j3), with nothing else running.
> Since reboot, latencies are, so far, very very nice. I'm leaving it
> disabled for now.

--
Regards,
vatsa

2007-09-25 08:48:35

by Andrew Morton

[permalink] [raw]
Subject: Re: [git] CFS-devel, latest code

On Tue, 25 Sep 2007 14:13:27 +0530 Srivatsa Vaddagiri <[email protected]> wrote:

> On Tue, Sep 25, 2007 at 12:41:20AM -0700, Andrew Morton wrote:
> > This doornails the Vaio. After grub handover the screen remains black
> > and the fan goes whir.
> >
> > http://userweb.kernel.org/~akpm/config-sony.txt
>
> This seems to be UP regression. Sorry abt that. I could recreate
> the problem very easily with CONFIG_SMP turned off.
>
> Can you check if this patch works? Works for me here.
>
> --
>
> Fix UP breakage.
>
> Signed-off-by : Srivatsa Vaddagiri <[email protected]>
>
>
> ---
> kernel/sched.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> Index: current/kernel/sched.c
> ===================================================================
> --- current.orig/kernel/sched.c
> +++ current/kernel/sched.c
> @@ -1029,8 +1029,8 @@ static inline void __set_task_cpu(struct
> {
> #ifdef CONFIG_SMP
> task_thread_info(p)->cpu = cpu;
> - set_task_cfs_rq(p);
> #endif
> + set_task_cfs_rq(p);
> }
>
> #ifdef CONFIG_SMP

yup, that's a fix. It was 15 minutes too late for rc8-mm1 though :(

2007-09-25 09:00:55

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: Re: [git] CFS-devel, latest code

On Tue, Sep 25, 2007 at 02:23:29PM +0530, Srivatsa Vaddagiri wrote:
> On Tue, Sep 25, 2007 at 10:33:27AM +0200, Mike Galbraith wrote:
> > > Darn, have news: latency thing isn't dead. Two busy loops, one at nice
> > > 0 pinned to CPU0, and one at nice 19 pinned to CPU1 produced the
> > > latencies below for nice -5 Xorg. Didn't kill the box though.

These busy loops - are they spawned by the same user? Is it the root
user? Also is this seen in UP mode also?

Can you also pls check if tuning root user's cpu share helps? Basically,

# echo 4096 > /proc/root_user_share

[or any other higher value]

> Also how do you check se.wait_max?

Ok ..I see that it is in /proc/sched_debug.


--
Regards,
vatsa

2007-09-25 09:13:10

by Mike Galbraith

[permalink] [raw]
Subject: Re: [git] CFS-devel, latest code

On Tue, 2007-09-25 at 14:23 +0530, Srivatsa Vaddagiri wrote:

> Mike,
> Do you have FAIR_USER_SCHED turned on as well? Can you send me
> your .config pls?

I did have. gzipped config attached.. this is current though, after
disabling groups. I'm still beating on the basic changes (boy does it
ever feel nice [awaits other shoe]).

-Mike


Attachments:
config.gz (12.97 kB)

2007-09-25 09:13:49

by Ingo Molnar

[permalink] [raw]
Subject: Re: [git] CFS-devel, latest code


* Mike Galbraith <[email protected]> wrote:

> > sched_debug (attached) is.. strange.
>
> Disabling CONFIG_FAIR_GROUP_SCHED fixed both. [...]

heh. Evil plan to enable the group scheduler by default worked out as
planned! ;-) [guess how many container users would do ... interactivity
tests like you do??]

> [...] Latencies of up to 336ms hit me during the recompile (make -j3),
> with nothing else running. Since reboot, latencies are, so far, very
> very nice. [...]

'very very nice' == 'best ever' ? :-)

> [...] I'm leaving it disabled for now.

ok, i'm too seeing some sort of latency weirdness with
CONFIG_FAIR_GROUP_SCHED enabled, _if_ there's Xorg involved which runs
under root uid on my box - and hence gets 50% of all CPU time.

Srivatsa, any ideas? It could either be an accounting buglet (less
likely, seems like the group scheduling bits stick to the 50% splitup
nicely), or a preemption buglet. One potential preemption buglet would
be for the group scheduler to not properly preempt a running task when a
task from another uid is woken?

Ingo

2007-09-25 09:15:30

by Mike Galbraith

[permalink] [raw]
Subject: Re: [git] CFS-devel, latest code

On Tue, 2007-09-25 at 14:41 +0530, Srivatsa Vaddagiri wrote:
> On Tue, Sep 25, 2007 at 02:23:29PM +0530, Srivatsa Vaddagiri wrote:
> > On Tue, Sep 25, 2007 at 10:33:27AM +0200, Mike Galbraith wrote:
> > > > Darn, have news: latency thing isn't dead. Two busy loops, one at nice
> > > > 0 pinned to CPU0, and one at nice 19 pinned to CPU1 produced the
> > > > latencies below for nice -5 Xorg. Didn't kill the box though.
>
> These busy loops - are they spawned by the same user? Is it the root
> user? Also is this seen in UP mode also?
>
> Can you also pls check if tuning root user's cpu share helps? Basically,
>
> # echo 4096 > /proc/root_user_share
>
> [or any other higher value]

I'll try these after I beat on the box some more.

-Mike

2007-09-25 09:17:50

by Mike Galbraith

[permalink] [raw]
Subject: Re: [git] CFS-devel, latest code

On Tue, 2007-09-25 at 11:13 +0200, Ingo Molnar wrote:

> > [...] Latencies of up to 336ms hit me during the recompile (make -j3),
> > with nothing else running. Since reboot, latencies are, so far, very
> > very nice. [...]
>
> 'very very nice' == 'best ever' ? :-)

Yes. Very VERY nice feel.

-Mike

2007-09-25 09:18:03

by Ingo Molnar

[permalink] [raw]
Subject: Re: [git] CFS-devel, latest code


* S.Çağlar Onur <[email protected]> wrote:

> Seems like following trivial change needed to compile without
> CONFIG_SCHEDSTATS
>
> caglar@zangetsu linux-2.6 $ LC_ALL=C make
> CHK include/linux/version.h
> CHK include/linux/utsrelease.h
> CALL scripts/checksyscalls.sh
> CHK include/linux/compile.h
> CC kernel/sched.o
> In file included from kernel/sched.c:853:
> kernel/sched_debug.c: In function `print_cfs_rq':
> kernel/sched_debug.c:139: error: structure has no member named `bkl_cnt'
> kernel/sched_debug.c:139: error: structure has no member named `bkl_cnt'
> make[1]: *** [kernel/sched.o] Error 1
> make: *** [kernel] Error 2
>
> Signed-off-by: S.Çağlar Onur <[email protected]>

thanks, applied!

Ingo

2007-09-25 09:34:29

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: Re: [git] CFS-devel, latest code

On Tue, Sep 25, 2007 at 11:13:31AM +0200, Ingo Molnar wrote:
> ok, i'm too seeing some sort of latency weirdness with
> CONFIG_FAIR_GROUP_SCHED enabled, _if_ there's Xorg involved which runs
> under root uid on my box - and hence gets 50% of all CPU time.
>
> Srivatsa, any ideas? It could either be an accounting buglet (less
> likely, seems like the group scheduling bits stick to the 50% splitup
> nicely), or a preemption buglet. One potential preemption buglet would
> be for the group scheduler to not properly preempt a running task when a
> task from another uid is woken?

Yep, I noticed that too.

check_preempt_wakeup()
{
...

if (is_same_group(curr, p)) {
^^^^^^^^^^^^^

resched_task();
}

}

Will try a fix to check for preemption at higher levels ..

--
Regards,
vatsa

2007-09-25 09:41:39

by Ingo Molnar

[permalink] [raw]
Subject: Re: [git] CFS-devel, latest code


* Srivatsa Vaddagiri <[email protected]> wrote:

> On Tue, Sep 25, 2007 at 11:13:31AM +0200, Ingo Molnar wrote:
> > ok, i'm too seeing some sort of latency weirdness with
> > CONFIG_FAIR_GROUP_SCHED enabled, _if_ there's Xorg involved which runs
> > under root uid on my box - and hence gets 50% of all CPU time.
> >
> > Srivatsa, any ideas? It could either be an accounting buglet (less
> > likely, seems like the group scheduling bits stick to the 50% splitup
> > nicely), or a preemption buglet. One potential preemption buglet would
> > be for the group scheduler to not properly preempt a running task when a
> > task from another uid is woken?
>
> Yep, I noticed that too.
>
> check_preempt_wakeup()
> {
> ...
>
> if (is_same_group(curr, p)) {
> ^^^^^^^^^^^^^
>
> resched_task();
> }
>
> }
>
> Will try a fix to check for preemption at higher levels ..

i bet fixing this will increase precision of group scheduling as well.
Those long latencies can be thought of as noise as well, and the
fair-scheduling "engine" might not be capable to offset all sources of
noise. So generally, while we allow a certain amount of lag in
preemption decisions (wakeup-granularity, etc.), with which the fairness
engine will cope just fine, we do not want to allow unlimited lag.

Ingo

2007-09-25 09:47:30

by Ingo Molnar

[permalink] [raw]
Subject: Re: [git] CFS-devel, latest code


* Mike Galbraith <[email protected]> wrote:

> On Tue, 2007-09-25 at 11:13 +0200, Ingo Molnar wrote:
>
> > > [...] Latencies of up to 336ms hit me during the recompile (make -j3),
> > > with nothing else running. Since reboot, latencies are, so far, very
> > > very nice. [...]
> >
> > 'very very nice' == 'best ever' ? :-)
>
> Yes. Very VERY nice feel.

cool :-)

Maybe there's more to come: if we can get CONFIG_FAIR_USER_SCHED to work
properly then your Xorg will have a load-independent 50% of CPU time all
to itself. (Group scheduling is quite impressive already: i can log in
as root without feeling _any_ effect from a perpetual 'hackbench 100'
running as uid mingo. Fork bombs no more.) Will the Amarok gforce plugin
like that CPU time splitup? (or is most of the gforce overhead under
your user uid?)

it could also work out negatively, _sometimes_ X does not like being too
high prio. (weird as that might be.) So we'll see.

Ingo

2007-09-25 10:02:20

by Mike Galbraith

[permalink] [raw]
Subject: Re: [git] CFS-devel, latest code

On Tue, 2007-09-25 at 11:47 +0200, Ingo Molnar wrote:
> * Mike Galbraith <[email protected]> wrote:
>
> > On Tue, 2007-09-25 at 11:13 +0200, Ingo Molnar wrote:
> >
> > > > [...] Latencies of up to 336ms hit me during the recompile (make -j3),
> > > > with nothing else running. Since reboot, latencies are, so far, very
> > > > very nice. [...]
> > >
> > > 'very very nice' == 'best ever' ? :-)
> >
> > Yes. Very VERY nice feel.
>
> cool :-)
>
> Maybe there's more to come: if we can get CONFIG_FAIR_USER_SCHED to work
> properly then your Xorg will have a load-independent 50% of CPU time all
> to itself. (Group scheduling is quite impressive already: i can log in
> as root without feeling _any_ effect from a perpetual 'hackbench 100'
> running as uid mingo. Fork bombs no more.) Will the Amarok gforce plugin
> like that CPU time splitup? (or is most of the gforce overhead under
> your user uid?)

I run everything as root (naughty me), so I'd have to change my evil
ways to reap the benefits. (I'll do that to test, but it's unlikely to
ever become a permanent habit here) Amarok/Gforce will definitely like
the user split as long as latency is low. Visualizations are not only
bandwidth hungry, they're extremely latency sensitive.

-Mike

2007-09-25 10:11:05

by Ingo Molnar

[permalink] [raw]
Subject: Re: [git] CFS-devel, latest code


* Ingo Molnar <[email protected]> wrote:

>
> * Srivatsa Vaddagiri <[email protected]> wrote:
>
> > On Tue, Sep 25, 2007 at 11:13:31AM +0200, Ingo Molnar wrote:
> > > ok, i'm too seeing some sort of latency weirdness with
> > > CONFIG_FAIR_GROUP_SCHED enabled, _if_ there's Xorg involved which runs
> > > under root uid on my box - and hence gets 50% of all CPU time.
> > >
> > > Srivatsa, any ideas? It could either be an accounting buglet (less
> > > likely, seems like the group scheduling bits stick to the 50% splitup
> > > nicely), or a preemption buglet. One potential preemption buglet would
> > > be for the group scheduler to not properly preempt a running task when a
> > > task from another uid is woken?
> >
> > Yep, I noticed that too.
> >
> > check_preempt_wakeup()
> > {
> > ...
> >
> > if (is_same_group(curr, p)) {
> > ^^^^^^^^^^^^^
> >
> > resched_task();
> > }
> >
> > }
> >
> > Will try a fix to check for preemption at higher levels ..
>
> i bet fixing this will increase precision of group scheduling as well.
> Those long latencies can be thought of as noise as well, and the
> fair-scheduling "engine" might not be capable to offset all sources of
> noise. So generally, while we allow a certain amount of lag in
> preemption decisions (wakeup-granularity, etc.), with which the
> fairness engine will cope just fine, we do not want to allow unlimited
> lag.

hm, i tried the naive patch. In theory the vruntime of all scheduling
entities should be 'compatible' and comparable (that's the point behind
using vruntime - the fairness engine drives each vruntime forward and
tries to balance them).

So the patch below just removes the is_same_group() condition. But i can
still see bad (and obvious) latencies with Mike's 2-hogs test:

taskset 01 perl -e 'while (1) {}' &
nice -19 taskset 02 perl -e 'while (1) {}' &

So something's amiss.

Ingo

------------------->
Subject: sched: group scheduler wakeup latency fix
From: Ingo Molnar <[email protected]>

group scheduler wakeup latency fix.

Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched_fair.c | 9 ++++-----
1 file changed, 4 insertions(+), 5 deletions(-)

Index: linux/kernel/sched_fair.c
===================================================================
--- linux.orig/kernel/sched_fair.c
+++ linux/kernel/sched_fair.c
@@ -785,6 +785,7 @@ static void check_preempt_wakeup(struct
{
struct task_struct *curr = rq->curr;
struct cfs_rq *cfs_rq = task_cfs_rq(curr);
+ s64 delta;

if (unlikely(rt_prio(p->prio))) {
update_rq_clock(rq);
@@ -792,12 +793,10 @@ static void check_preempt_wakeup(struct
resched_task(curr);
return;
}
- if (is_same_group(curr, p)) {
- s64 delta = curr->se.vruntime - p->se.vruntime;
+ delta = curr->se.vruntime - p->se.vruntime;

- if (delta > (s64)sysctl_sched_wakeup_granularity)
- resched_task(curr);
- }
+ if (delta > (s64)sysctl_sched_wakeup_granularity)
+ resched_task(curr);
}

static struct task_struct *pick_next_task_fair(struct rq *rq)

2007-09-25 10:18:43

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: Re: [git] CFS-devel, latest code

On Tue, Sep 25, 2007 at 12:10:44PM +0200, Ingo Molnar wrote:
> So the patch below just removes the is_same_group() condition. But i can
> still see bad (and obvious) latencies with Mike's 2-hogs test:
>
> taskset 01 perl -e 'while (1) {}' &
> nice -19 taskset 02 perl -e 'while (1) {}' &
>
> So something's amiss.

While I try recreating this myself, I wonder if this patch helps?

---
kernel/sched_fair.c | 19 ++++++++++++++-----
1 file changed, 14 insertions(+), 5 deletions(-)

Index: current/kernel/sched_fair.c
===================================================================
--- current.orig/kernel/sched_fair.c
+++ current/kernel/sched_fair.c
@@ -794,7 +794,8 @@ static void yield_task_fair(struct rq *r
static void check_preempt_wakeup(struct rq *rq, struct task_struct *p)
{
struct task_struct *curr = rq->curr;
- struct cfs_rq *cfs_rq = task_cfs_rq(curr);
+ struct cfs_rq *cfs_rq = task_cfs_rq(curr), *pcfs_rq;
+ struct sched_entity *se = &curr->se, *pse = &p->se;

if (unlikely(rt_prio(p->prio))) {
update_rq_clock(rq);
@@ -802,11 +803,19 @@ static void check_preempt_wakeup(struct
resched_task(curr);
return;
}
- if (is_same_group(curr, p)) {
- s64 delta = curr->se.vruntime - p->se.vruntime;

- if (delta > (s64)sysctl_sched_wakeup_granularity)
- resched_task(curr);
+ for_each_sched_entity(se) {
+ cfs_rq = cfs_rq_of(se);
+ pcfs_rq = cfs_rq_of(pse);
+
+ if (cfs_rq == pcfs_rq) {
+ s64 delta = se->vruntime - pse->vruntime;
+
+ if (delta > (s64)sysctl_sched_wakeup_granularity)
+ resched_task(curr);
+ break;
+ }
+ pse = pse->parent;
}
}


--
Regards,
vatsa

2007-09-25 10:36:39

by Ingo Molnar

[permalink] [raw]
Subject: Re: [git] CFS-devel, latest code


* Srivatsa Vaddagiri <[email protected]> wrote:

> On Tue, Sep 25, 2007 at 12:10:44PM +0200, Ingo Molnar wrote:
> > So the patch below just removes the is_same_group() condition. But i can
> > still see bad (and obvious) latencies with Mike's 2-hogs test:
> >
> > taskset 01 perl -e 'while (1) {}' &
> > nice -19 taskset 02 perl -e 'while (1) {}' &
> >
> > So something's amiss.
>
> While I try recreating this myself, I wonder if this patch helps?

you should be able to recreate this easily by booting with maxcpus=1 and
the commands above - then run a few instances of chew-max (without them
being bound to any particular CPUs) and the latencies should show up.

i have tried your patch and it does not solve the problem - i think
there's a more fundamental bug lurking, besides the wakeup latency
problem.

Find below a /proc/sched_debug output of a really large latency. The
latency is caused by the _huge_ (~450 seconds!) vruntime offset that
'loop_silent' and 'sshd' has:

task PID tree-key switches prio exec-runtime
-------------------------------------------------------------------
loop_silent 2391 55344.211189 203 120 55344.211189
sshd 2440 513334.978030 4 120 513334.978030
R cat 2496 513672.558835 4 120 513672.558835

hm. perhaps this fixup in kernel/sched.c:set_task_cpu():

p->se.vruntime -= old_rq->cfs.min_vruntime - new_rq->cfs.min_vruntime;

needs to become properly group-hierarchy aware?

Ingo

------------------>
Sched Debug Version: v0.05-v20, 2.6.23-rc7 #89
now at 95878.065440 msecs
.sysctl_sched_latency : 20.000000
.sysctl_sched_min_granularity : 2.000000
.sysctl_sched_wakeup_granularity : 2.000000
.sysctl_sched_batch_wakeup_granularity : 25.000000
.sysctl_sched_child_runs_first : 0.000001
.sysctl_sched_features : 3

cpu#0, 1828.868 MHz
.nr_running : 3
.load : 3072
.nr_switches : 32032
.nr_load_updates : 95906
.nr_uninterruptible : 4294967238
.jiffies : 4294763202
.next_balance : 4294.763420
.curr->pid : 2496
.clock : 95893.484495
.idle_clock : 55385.089335
.prev_clock_raw : 84753.749367
.clock_warps : 0
.clock_overflows : 1737
.clock_deep_idle_events : 71815
.clock_max_delta : 0.999843
.cpu_load[0] : 3072
.cpu_load[1] : 2560
.cpu_load[2] : 2304
.cpu_load[3] : 2176
.cpu_load[4] : 2119

cfs_rq
.exec_clock : 38202.223241
.MIN_vruntime : 36334.281860
.min_vruntime : 36334.279140
.max_vruntime : 36334.281860
.spread : 0.000000
.spread0 : 0.000000
.nr_running : 2
.load : 3072
.bkl_cnt : 3934
.nr_spread_over : 37

cfs_rq
.exec_clock : 34769.316246
.MIN_vruntime : 55344.211189
.min_vruntime : 36334.279140
.max_vruntime : 513334.978030
.spread : 457990.766841
.spread0 : 0.000000
.nr_running : 2
.load : 2048
.bkl_cnt : 3934
.nr_spread_over : 10

cfs_rq
.exec_clock : 36.982394
.MIN_vruntime : 0.000001
.min_vruntime : 36334.279140
.max_vruntime : 0.000001
.spread : 0.000000
.spread0 : 0.000000
.nr_running : 0
.load : 0
.bkl_cnt : 3934
.nr_spread_over : 1

cfs_rq
.exec_clock : 20.244893
.MIN_vruntime : 0.000001
.min_vruntime : 36334.279140
.max_vruntime : 0.000001
.spread : 0.000000
.spread0 : 0.000000
.nr_running : 0
.load : 0
.bkl_cnt : 3934
.nr_spread_over : 0

cfs_rq
.exec_clock : 3305.155973
.MIN_vruntime : 0.000001
.min_vruntime : 36334.279140
.max_vruntime : 0.000001
.spread : 0.000000
.spread0 : 0.000000
.nr_running : 1
.load : 1024
.bkl_cnt : 3934
.nr_spread_over : 13

runnable tasks:
task PID tree-key switches prio exec-runtime sum-exec sum-sleep
----------------------------------------------------------------------------------------------------------
loop_silent 2391 55344.211189 203 120 55344.211189 34689.036169 42.855690
sshd 2440 513334.978030 4 120 513334.978030 0.726092 0.000000
R cat 2496 513672.558835 4 120 513672.558835 0.656690 26.294621

cpu#1, 1828.868 MHz
.nr_running : 3
.load : 2063
.nr_switches : 22792
.nr_load_updates : 95625
.nr_uninterruptible : 58
.jiffies : 4294763202
.next_balance : 4294.763333
.curr->pid : 2427
.clock : 95643.855219
.idle_clock : 54735.436719
.prev_clock_raw : 84754.067800
.clock_warps : 0
.clock_overflows : 2521
.clock_deep_idle_events : 120633
.clock_max_delta : 0.999843
.cpu_load[0] : 2063
.cpu_load[1] : 2063
.cpu_load[2] : 2063
.cpu_load[3] : 2063
.cpu_load[4] : 2063

cfs_rq
.exec_clock : 38457.557282
.MIN_vruntime : 0.000001
.min_vruntime : 35360.227495
.max_vruntime : 0.000001
.spread : 0.000000
.spread0 : -974.051645
.nr_running : 1
.load : 1024
.bkl_cnt : 456
.nr_spread_over : 18

cfs_rq
.exec_clock : 32536.311766
.MIN_vruntime : 610653.297636
.min_vruntime : 35360.227495
.max_vruntime : 610702.945490
.spread : 49.647854
.spread0 : -974.051645
.nr_running : 3
.load : 2063
.bkl_cnt : 456
.nr_spread_over : 202

cfs_rq
.exec_clock : 17.392835
.MIN_vruntime : 0.000001
.min_vruntime : 35360.227495
.max_vruntime : 0.000001
.spread : 0.000000
.spread0 : -974.051645
.nr_running : 0
.load : 0
.bkl_cnt : 456
.nr_spread_over : 1

cfs_rq
.exec_clock : 0.428251
.MIN_vruntime : 0.000001
.min_vruntime : 35360.227495
.max_vruntime : 0.000001
.spread : 0.000000
.spread0 : -974.051645
.nr_running : 0
.load : 0
.bkl_cnt : 456
.nr_spread_over : 1

cfs_rq
.exec_clock : 5812.752391
.MIN_vruntime : 0.000001
.min_vruntime : 35360.227495
.max_vruntime : 0.000001
.spread : 0.000000
.spread0 : -974.051645
.nr_running : 0
.load : 0
.bkl_cnt : 456
.nr_spread_over : 11

runnable tasks:
task PID tree-key switches prio exec-runtime sum-exec sum-sleep
----------------------------------------------------------------------------------------------------------
loop_silent 2400 610702.945490 859 139 610702.945490 8622.783425 13.951214
R chew-max 2427 610644.849769 2057 120 610644.849769 13737.197972 10.079232
chew-max 2433 610653.297636 1969 120 610653.297636 9679.778096 10.750704

2007-09-25 11:01:16

by Ingo Molnar

[permalink] [raw]
Subject: Re: [git] CFS-devel, latest code


* Srivatsa Vaddagiri <[email protected]> wrote:

> On Tue, Sep 25, 2007 at 12:41:20AM -0700, Andrew Morton wrote:
> > This doornails the Vaio. After grub handover the screen remains black
> > and the fan goes whir.
> >
> > http://userweb.kernel.org/~akpm/config-sony.txt
>
> This seems to be UP regression. Sorry abt that. I could recreate
> the problem very easily with CONFIG_SMP turned off.
>
> Can you check if this patch works? Works for me here.

thanks - i've put this fix into the core group-scheduling patch.

Ingo

2007-09-25 11:34:01

by Ingo Molnar

[permalink] [raw]
Subject: Re: [git] CFS-devel, latest code


* Ingo Molnar <[email protected]> wrote:

> hm. perhaps this fixup in kernel/sched.c:set_task_cpu():
>
> p->se.vruntime -= old_rq->cfs.min_vruntime - new_rq->cfs.min_vruntime;
>
> needs to become properly group-hierarchy aware?

a quick first stab like the one below does not appear to solve the
problem.

Ingo

------------------->
Subject: sched: group scheduler SMP migration fix
From: Ingo Molnar <[email protected]>

group scheduler SMP migration fix.

Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched.c | 9 +++++++--
1 file changed, 7 insertions(+), 2 deletions(-)

Index: linux/kernel/sched.c
===================================================================
--- linux.orig/kernel/sched.c
+++ linux/kernel/sched.c
@@ -1039,7 +1039,8 @@ void set_task_cpu(struct task_struct *p,
{
int old_cpu = task_cpu(p);
struct rq *old_rq = cpu_rq(old_cpu), *new_rq = cpu_rq(new_cpu);
- u64 clock_offset;
+ struct sched_entity *se;
+ u64 clock_offset, voffset;

clock_offset = old_rq->clock - new_rq->clock;

@@ -1051,7 +1052,11 @@ void set_task_cpu(struct task_struct *p,
if (p->se.block_start)
p->se.block_start -= clock_offset;
#endif
- p->se.vruntime -= old_rq->cfs.min_vruntime - new_rq->cfs.min_vruntime;
+
+ se = &p->se;
+ voffset = old_rq->cfs.min_vruntime - new_rq->cfs.min_vruntime;
+ for_each_sched_entity(se)
+ se->vruntime -= voffset;

__set_task_cpu(p, new_cpu);
}

2007-09-25 12:29:08

by Mike Galbraith

[permalink] [raw]
Subject: Re: [git] CFS-devel, latest code

On Tue, 2007-09-25 at 15:58 +0530, Srivatsa Vaddagiri wrote:

> While I try recreating this myself, I wonder if this patch helps?

It didn't here, nor did tweaking root's share. Booting with maxcpus=1,
I was unable to produce large latencies, but didn't try very many
things.

-Mike

2007-09-25 12:41:17

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: Re: [git] CFS-devel, latest code

On Tue, Sep 25, 2007 at 12:36:17PM +0200, Ingo Molnar wrote:
> hm. perhaps this fixup in kernel/sched.c:set_task_cpu():
>
> p->se.vruntime -= old_rq->cfs.min_vruntime - new_rq->cfs.min_vruntime;

This definitely does need some fixup, even though I am not sure yet if
it will solve completely the latency issue.

I tried the following patch. I *think* I see some improvement, wrt
latency seen when I type on the shell. Before this patch, I noticed
oddities like "kill -9 chew-max-pid" wont kill chew-max (it is queued in
runqueue waiting for a looong time to run before it can acknowledge
signal and exit). With this patch, I don't see such oddities ..So I am hoping
it fixes the latency problem you are seeing as well.



Index: current/kernel/sched.c
===================================================================
--- current.orig/kernel/sched.c
+++ current/kernel/sched.c
@@ -1039,6 +1039,8 @@ void set_task_cpu(struct task_struct *p,
{
int old_cpu = task_cpu(p);
struct rq *old_rq = cpu_rq(old_cpu), *new_rq = cpu_rq(new_cpu);
+ struct cfs_rq *old_cfsrq = task_cfs_rq(p),
+ *new_cfsrq = cpu_cfs_rq(old_cfsrq, new_cpu);
u64 clock_offset;

clock_offset = old_rq->clock - new_rq->clock;
@@ -1051,7 +1053,8 @@ void set_task_cpu(struct task_struct *p,
if (p->se.block_start)
p->se.block_start -= clock_offset;
#endif
- p->se.vruntime -= old_rq->cfs.min_vruntime - new_rq->cfs.min_vruntime;
+ p->se.vruntime -= old_cfsrq->min_vruntime -
+ new_cfsrq->min_vruntime;

__set_task_cpu(p, new_cpu);
}


--
Regards,
vatsa

2007-09-25 12:55:10

by Mike Galbraith

[permalink] [raw]
Subject: Re: [git] CFS-devel, latest code

On Tue, 2007-09-25 at 14:28 +0200, Mike Galbraith wrote:
> On Tue, 2007-09-25 at 15:58 +0530, Srivatsa Vaddagiri wrote:
>
> > While I try recreating this myself, I wonder if this patch helps?
>
> It didn't here, nor did tweaking root's share. Booting with maxcpus=1,
> I was unable to produce large latencies, but didn't try very many
> things.

Easy way to make it pretty bad: pin a nice 0 loop to CPU0, pin a nice 19
loop to CPU1, then start an unpinned make.. more Xorg bouncing back and
forth I suppose.

se.wait_max : 14.105683
se.wait_max : 316.943787
se.wait_max : 692.884324
se.wait_max : 38.165534
se.wait_max : 732.883492
se.wait_max : 127.059784
se.wait_max : 63.403549
se.wait_max : 372.933284

-Mike

2007-09-25 13:35:29

by Mike Galbraith

[permalink] [raw]
Subject: Re: [git] CFS-devel, latest code

On Tue, 2007-09-25 at 18:21 +0530, Srivatsa Vaddagiri wrote:
> On Tue, Sep 25, 2007 at 12:36:17PM +0200, Ingo Molnar wrote:
> > hm. perhaps this fixup in kernel/sched.c:set_task_cpu():
> >
> > p->se.vruntime -= old_rq->cfs.min_vruntime - new_rq->cfs.min_vruntime;
>
> This definitely does need some fixup, even though I am not sure yet if
> it will solve completely the latency issue.
>
> I tried the following patch. I *think* I see some improvement, wrt
> latency seen when I type on the shell. Before this patch, I noticed
> oddities like "kill -9 chew-max-pid" wont kill chew-max (it is queued in
> runqueue waiting for a looong time to run before it can acknowledge
> signal and exit). With this patch, I don't see such oddities ..So I am hoping
> it fixes the latency problem you are seeing as well.

http://lkml.org/lkml/2007/9/25/117 plus the below seems to be the SIlver
Bullet for the latencies I was seeing.

> Index: current/kernel/sched.c
> ===================================================================
> --- current.orig/kernel/sched.c
> +++ current/kernel/sched.c
> @@ -1039,6 +1039,8 @@ void set_task_cpu(struct task_struct *p,
> {
> int old_cpu = task_cpu(p);
> struct rq *old_rq = cpu_rq(old_cpu), *new_rq = cpu_rq(new_cpu);
> + struct cfs_rq *old_cfsrq = task_cfs_rq(p),
> + *new_cfsrq = cpu_cfs_rq(old_cfsrq, new_cpu);
> u64 clock_offset;
>
> clock_offset = old_rq->clock - new_rq->clock;
> @@ -1051,7 +1053,8 @@ void set_task_cpu(struct task_struct *p,
> if (p->se.block_start)
> p->se.block_start -= clock_offset;
> #endif
> - p->se.vruntime -= old_rq->cfs.min_vruntime - new_rq->cfs.min_vruntime;
> + p->se.vruntime -= old_cfsrq->min_vruntime -
> + new_cfsrq->min_vruntime;
>
> __set_task_cpu(p, new_cpu);
> }
>
>
> --
> Regards,
> vatsa

2007-09-25 13:56:41

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: Re: [git] CFS-devel, latest code

On Tue, Sep 25, 2007 at 03:35:17PM +0200, Mike Galbraith wrote:
> > I tried the following patch. I *think* I see some improvement, wrt
> > latency seen when I type on the shell. Before this patch, I noticed
> > oddities like "kill -9 chew-max-pid" wont kill chew-max (it is queued in
> > runqueue waiting for a looong time to run before it can acknowledge
> > signal and exit). With this patch, I don't see such oddities ..So I am hoping
> > it fixes the latency problem you are seeing as well.
>
> http://lkml.org/lkml/2007/9/25/117 plus the below seems to be the SIlver
> Bullet for the latencies I was seeing.

Cool ..Thanks for the quick feedback.

Ingo, do the two patches fix the latency problems you were seeing as
well?

--
Regards,
vatsa

2007-09-25 14:50:18

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: Re: [git] CFS-devel, latest code

On Tue, Sep 25, 2007 at 01:33:06PM +0200, Ingo Molnar wrote:
> > hm. perhaps this fixup in kernel/sched.c:set_task_cpu():
> >
> > p->se.vruntime -= old_rq->cfs.min_vruntime - new_rq->cfs.min_vruntime;
> >
> > needs to become properly group-hierarchy aware?

You seem to have hit the nerve for this problem. The two patches I sent:

http://lkml.org/lkml/2007/9/25/117
http://lkml.org/lkml/2007/9/25/168

partly help, but we can do better.

> ===================================================================
> --- linux.orig/kernel/sched.c
> +++ linux/kernel/sched.c
> @@ -1039,7 +1039,8 @@ void set_task_cpu(struct task_struct *p,
> {
> int old_cpu = task_cpu(p);
> struct rq *old_rq = cpu_rq(old_cpu), *new_rq = cpu_rq(new_cpu);
> - u64 clock_offset;
> + struct sched_entity *se;
> + u64 clock_offset, voffset;
>
> clock_offset = old_rq->clock - new_rq->clock;
>
> @@ -1051,7 +1052,11 @@ void set_task_cpu(struct task_struct *p,
> if (p->se.block_start)
> p->se.block_start -= clock_offset;
> #endif
> - p->se.vruntime -= old_rq->cfs.min_vruntime - new_rq->cfs.min_vruntime;
> +
> + se = &p->se;
> + voffset = old_rq->cfs.min_vruntime - new_rq->cfs.min_vruntime;

This one feels wrong, although I can't express my reaction correctly ..

> + for_each_sched_entity(se)
> + se->vruntime -= voffset;

Note that parent entities for a task is per-cpu. So if a task A
belonging to userid guest hops from CPU0 to CPU1, then it gets a new parent
entity as well, which is different from its parent entity on CPU0.

Before:
taskA->se.parent = guest's tg->se[0]

After:
taskA->se.parent = guest's tg->se[1]

So walking up the entity hierarchy and fixing up (parent)se->vruntime will do
little good after the task has moved to a new cpu.

IMO, we need to be doing this :

- For dequeue of higher level sched entities, simulate as if
they are going to "sleep"
- For enqueue of higher level entities, simulate as if they are
"waking up". This will cause enqueue_entity() to reset their
vruntime (to existing value for cfs_rq->min_vruntime) when they
"wakeup".

If we don't do this, then lets say a group had only one task (A) and it
moves from CPU0 to CPU1. Then on CPU1, when group level entity for task
A is enqueued, it will have a very low vruntime (since it was never
running) and this will give task A unlimited cpu time, until its group
entity catches up with all the "sleep" time.

Let me try a fix for this next ..

--
Regards,
vatsa

2007-09-25 15:23:17

by Daniel Walker

[permalink] [raw]
Subject: Re: [git] CFS-devel, latest code

On Tue, 2007-09-25 at 08:45 +0200, Ingo Molnar wrote:
> * Daniel Walker <[email protected]> wrote:
>
> > On Mon, 2007-09-24 at 23:45 +0200, Ingo Molnar wrote:
> > > Lots of scheduler updates in the past few days, done by many people.
> > > Most importantly, the SMP latency problems reported and debugged by
> > > Mike
> > > Galbraith should be fixed for good now.
> >
> > Does this have anything to do with idle balancing ? I noticed some
> > fairly large latencies in that code in 2.6.23-rc's ..
>
> any measurements?

Yes, I made this a while ago,

ftp://source.mvista.com/pub/dwalker/misc/long-cfs-load-balance-trace.txt

This was with PREEMPT_RT on btw, so it's not the most recent kernel. I
was able to reproduce it in all the -rc's I tried.

Daniel

2007-09-26 08:04:50

by Mike Galbraith

[permalink] [raw]
Subject: Re: [git] CFS-devel, latest code

On Tue, 2007-09-25 at 11:47 +0200, Ingo Molnar wrote:

> Maybe there's more to come: if we can get CONFIG_FAIR_USER_SCHED to work
> properly then your Xorg will have a load-independent 50% of CPU time all
> to itself. (Group scheduling is quite impressive already: i can log in
> as root without feeling _any_ effect from a perpetual 'hackbench 100'
> running as uid mingo. Fork bombs no more.) Will the Amarok gforce plugin
> like that CPU time splitup? (or is most of the gforce overhead under
> your user uid?)
>
> it could also work out negatively, _sometimes_ X does not like being too
> high prio. (weird as that might be.) So we'll see.

I piddled around with fair users this morning, and it worked well. With
Xorg and Gforce as one user (X and Gforce are synchronous ATM), and a
make -j30 as another, I could barely tell the make was running.
Watching a dvd, I couldn't tell. Latencies were pretty darn good
throughout three hours of testing this and that.

-Mike

2007-09-28 21:41:48

by Bill Davidsen

[permalink] [raw]
Subject: Re: [git] CFS-devel, latest code

Ingo Molnar wrote:

> Maybe there's more to come: if we can get CONFIG_FAIR_USER_SCHED to work
> properly then your Xorg will have a load-independent 50% of CPU time all
> to itself.

It seems that perhaps that 50% makes more sense on a single/dual CPU
system than on a more robust one, such as a four way dual core Xeon with
HT or some such. With hotplug CPUs, and setups on various machines,
perhaps some resource limit independent of the available resource would
be useful.

Just throwing out the idea, in case it lands on fertile ground.

--
Bill Davidsen <[email protected]>
"We have more to fear from the bungling of the incompetent than from
the machinations of the wicked." - from Slashdot