2007-06-22 22:02:55

by Ingo Molnar

[permalink] [raw]
Subject: [patch] CFS scheduler, -v18


i'm pleased to announce release -v18 of the CFS scheduler patchset.

The rolled-up CFS patch against today's -git kernel, v2.6.22-rc5,
v2.6.22-rc4-mm2, v2.6.21.5 or v2.6.20.14 can be downloaded from the
usual place:

http://people.redhat.com/mingo/cfs-scheduler/

The biggest change in -v18 are various performance related improvements.
Thomas Gleixner has eliminated expensive 64-bit divisions by converting
the arithmetics to scaled math (without impacting the quality of
calculations). Srivatsa Vaddagiri and Dmitry Adamushko have continued
the abstraction and cleanup work. Srivatsa Vaddagiri and Christoph
Lameter fixed the NUMA balancing bug reported by Paul McKenney. There
were also a good number of other refinements to the CFS code. (No
reproducible behavioral regressions were reported against -v17 so far,
so the 'behavioral' bits are mostly unchanged.)

Changes since -v17:

- implement scaled math speedups for CFS. (Thomas Gleixner)

- lots of core code updates, cleanups and streamlining.
(Srivatsa Vaddagiri, Dmitry Adamushko, me.)

- bugfix: fix NUMA balancing. (Srivatsa Vaddagiri, Christoph Lameter,
Paul E. McKenney)

- feature: SCHED_IDLE now also implies block-scheduler (CFQ)
idle-IO-priority. (suggested by Thomas Sattler, picked up from -ck)

- build fix for ppc32. (reported, tested and confirmed fixed by
Art Haas)

- ARM fix. (reported and debugged by Thomas Gleixner)

- cleanup: implemented idle_sched_class in kernel/sched_idletask.c as a
way to separate out rq->idle handling out of the core scheduler. This
made a good deal of idle-task related special-cases go away.

- debug: make the sysctls safer by introducing high and low limits.

- cleanup: move some of the debug counters to under CONFIG_SCHEDSTATS.

- speedup: various micro-optimizations

- various other small updates.

As usual, any sort of feedback, bugreport, fix and suggestion is more
than welcome!

Ingo


2007-06-22 22:09:38

by S.Çağlar Onur

[permalink] [raw]
Subject: Re: [patch] CFS scheduler, -v18

Hi Ingo;

23 Haz 2007 Cts tarihinde, Ingo Molnar şunları yazmıştı:
> As usual, any sort of feedback, bugreport, fix and suggestion is more
> than welcome!

caglar@zangetsu linux-2.6 $ LC_ALL=C make
CHK include/linux/version.h
CHK include/linux/utsrelease.h
CALL scripts/checksyscalls.sh
CHK include/linux/compile.h
CC kernel/sched.o
kernel/sched.c:745:28: sched_idletask.c: No such file or directory
kernel/sched.c: In function `init_idle_bootup_task':
kernel/sched.c:4659: error: `idle_sched_class' undeclared (first use in this
function)
kernel/sched.c:4659: error: (Each undeclared identifier is reported only once
kernel/sched.c:4659: error: for each function it appears in.)
kernel/sched.c: In function `init_idle':
kernel/sched.c:4698: error: `idle_sched_class' undeclared (first use in this
function)
kernel/sched.c: In function `sched_init':
kernel/sched.c:6196: error: `idle_sched_class' undeclared (first use in this
function)
make[1]: *** [kernel/sched.o] Error 1
make: *** [kernel] Error 2

Cheers
--
S.Çağlar Onur <[email protected]>
http://cekirdek.pardus.org.tr/~caglar/

Linux is like living in a teepee. No Windows, no Gates and an Apache in house!


Attachments:
(No filename) (1.18 kB)
signature.asc (189.00 B)
This is a digitally signed message part.
Download all attachments

2007-06-22 22:17:03

by S.Çağlar Onur

[permalink] [raw]
Subject: Re: [patch] CFS scheduler, -v18

23 Haz 2007 Cts tarihinde, S.Çağlar Onur şunları yazmıştı:
> Hi Ingo;
>
> 23 Haz 2007 Cts tarihinde, Ingo Molnar şunları yazmıştı:
> > As usual, any sort of feedback, bugreport, fix and suggestion is more
> > than welcome!
>
> caglar@zangetsu linux-2.6 $ LC_ALL=C make
> CHK include/linux/version.h
> CHK include/linux/utsrelease.h
> CALL scripts/checksyscalls.sh
> CHK include/linux/compile.h
> CC kernel/sched.o
> kernel/sched.c:745:28: sched_idletask.c: No such file or directory

Ahh and this happens with [1], grabbing sched_idletask.c from .18 one solves
the problem...

Index: linux/kernel/sched_idletask.c
===================================================================
--- /dev/null
+++ linux/kernel/sched_idletask.c
@@ -0,0 +1,68 @@
+/*
+ * idle-task scheduling class.
+ *
+ * (NOTE: these are not related to SCHED_IDLE tasks which are
+ * handled in sched_fair.c)
+ */
+
+/*
+ * Idle tasks are unconditionally rescheduled:
+ */
+static void check_preempt_curr_idle(struct rq *rq, struct task_struct *p)
+{
+ resched_task(rq->idle);
+}
+
+static struct task_struct *pick_next_task_idle(struct rq *rq, u64 now)
+{
+ schedstat_inc(rq, sched_goidle);
+
+ return rq->idle;
+}
+
+/*
+ * It is not legal to sleep in the idle task - print a warning
+ * message if some code attempts to do it:
+ */
+static void
+dequeue_task_idle(struct rq *rq, struct task_struct *p, int sleep, u64 now)
+{
+ spin_unlock_irq(&rq->lock);
+ printk(KERN_ERR "bad: scheduling from the idle thread!\n");
+ dump_stack();
+ spin_lock_irq(&rq->lock);
+}
+
+static void put_prev_task_idle(struct rq *rq, struct task_struct *prev, u64
now)
+{
+}
+
+static struct task_struct *load_balance_start_idle(struct rq *rq)
+{
+ return NULL;
+}
+
+static void task_tick_idle(struct rq *rq, struct task_struct *curr)
+{
+}
+
+/*
+ * Simple, special scheduling class for the per-CPU idle tasks:
+ */
+struct sched_class idle_sched_class __read_mostly = {
+ /* no enqueue/yield_task for idle tasks */
+
+ /* dequeue is not valid, we print a debug message there: */
+ .dequeue_task = dequeue_task_idle,
+
+ .check_preempt_curr = check_preempt_curr_idle,
+
+ .pick_next_task = pick_next_task_idle,
+ .put_prev_task = put_prev_task_idle,
+
+ .load_balance_start = load_balance_start_idle,
+ /* no .load_balance_next for idle tasks */
+
+ .task_tick = task_tick_idle,
+ /* no .task_new for idle tasks */
+};


[1]
http://people.redhat.com/mingo/cfs-scheduler/sched-cfs-v2.6.22-git-v18.patch

Cheers
--
S.Çağlar Onur <[email protected]>
http://cekirdek.pardus.org.tr/~caglar/

Linux is like living in a teepee. No Windows, no Gates and an Apache in house!


Attachments:
(No filename) (2.61 kB)
signature.asc (189.00 B)
This is a digitally signed message part.
Download all attachments

2007-06-22 22:21:36

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] CFS scheduler, -v18


* S.Çağlar Onur <[email protected]> wrote:

> > kernel/sched.c:745:28: sched_idletask.c: No such file or directory
>
> Ahh and this happens with [1], grabbing sched_idletask.c from .18 one solves
> the problem...

oops, indeed - i've fixed up the -git patch:

http://people.redhat.com/mingo/cfs-scheduler/sched-cfs-v2.6.22-git-v18.patch

Ingo

2007-06-22 23:08:43

by Gene Heskett

[permalink] [raw]
Subject: Re: [patch] CFS scheduler, -v18

On Friday 22 June 2007, Ingo Molnar wrote:
>i'm pleased to announce release -v18 of the CFS scheduler patchset.
>
>The rolled-up CFS patch against today's -git kernel, v2.6.22-rc5,
>v2.6.22-rc4-mm2, v2.6.21.5 or v2.6.20.14 can be downloaded from the
>usual place:
>
> http://people.redhat.com/mingo/cfs-scheduler/
>
>The biggest change in -v18 are various performance related improvements.
>Thomas Gleixner has eliminated expensive 64-bit divisions by converting
>the arithmetics to scaled math (without impacting the quality of
>calculations). Srivatsa Vaddagiri and Dmitry Adamushko have continued
>the abstraction and cleanup work. Srivatsa Vaddagiri and Christoph
>Lameter fixed the NUMA balancing bug reported by Paul McKenney. There
>were also a good number of other refinements to the CFS code. (No
>reproducible behavioral regressions were reported against -v17 so far,
>so the 'behavioral' bits are mostly unchanged.)
>
>Changes since -v17:
>
> - implement scaled math speedups for CFS. (Thomas Gleixner)
>
> - lots of core code updates, cleanups and streamlining.
> (Srivatsa Vaddagiri, Dmitry Adamushko, me.)
>
> - bugfix: fix NUMA balancing. (Srivatsa Vaddagiri, Christoph Lameter,
> Paul E. McKenney)
>
> - feature: SCHED_IDLE now also implies block-scheduler (CFQ)
> idle-IO-priority. (suggested by Thomas Sattler, picked up from -ck)
>
> - build fix for ppc32. (reported, tested and confirmed fixed by
> Art Haas)
>
> - ARM fix. (reported and debugged by Thomas Gleixner)
>
> - cleanup: implemented idle_sched_class in kernel/sched_idletask.c as a
> way to separate out rq->idle handling out of the core scheduler. This
> made a good deal of idle-task related special-cases go away.
>
> - debug: make the sysctls safer by introducing high and low limits.
>
> - cleanup: move some of the debug counters to under CONFIG_SCHEDSTATS.
>
> - speedup: various micro-optimizations
>
> - various other small updates.
>
>As usual, any sort of feedback, bugreport, fix and suggestion is more
>than welcome!
>
Humm, problem methinks. Applying the patch, with 2.6.22-rc5 applied to 2.6.21
completed, from my script:

now applying patch sched-cfs-v2.6.22-rc5-v18.patch

patching file Documentation/kernel-parameters.txt
patching file Documentation/sched-design-CFS.txt
patching file Makefile
patching file arch/i386/kernel/smpboot.c
patching file arch/i386/kernel/tsc.c
patching file arch/ia64/kernel/setup.c
patching file arch/mips/kernel/smp.c
patching file arch/sparc/kernel/smp.c
patching file arch/sparc64/kernel/smp.c
patching file block/cfq-iosched.c
patching file fs/proc/array.c
patching file fs/proc/base.c
patching file include/asm-generic/bitops/sched.h
patching file include/linux/hardirq.h
patching file include/linux/sched.h
patching file include/linux/topology.h
patching file init/main.c
patching file kernel/delayacct.c
patching file kernel/exit.c
patching file kernel/fork.c
patching file kernel/posix-cpu-timers.c
patching file kernel/sched.c
patching file kernel/sched_debug.c
patching file kernel/sched_fair.c
patching file kernel/sched_idletask.c
patching file kernel/sched_rt.c
patching file kernel/sched_stats.h
patching file kernel/softirq.c
patching file kernel/sysctl.c
The next patch would delete the file l/kernel/sched.c,
which does not exist! Assume -R? [n]

How to proceed?

--
Cheers, Gene
"There are four boxes to be used in defense of liberty:
soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
Even more amazing was the realization that God has Internet access. I
wonder if He has a full newsfeed?
-- Matt Welsh

2007-06-23 07:12:30

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] CFS scheduler, -v18


* Gene Heskett <[email protected]> wrote:

> patching file kernel/softirq.c
> patching file kernel/sysctl.c
> The next patch would delete the file l/kernel/sched.c,
> which does not exist! Assume -R? [n]
>
> How to proceed?

oops - just ignore it, or re-download the patch (i fixed it).

Ingo

2007-06-23 09:55:39

by Gene Heskett

[permalink] [raw]
Subject: Re: [patch] CFS scheduler, -v18

On Saturday 23 June 2007, Ingo Molnar wrote:
>* Gene Heskett <[email protected]> wrote:
>> patching file kernel/softirq.c
>> patching file kernel/sysctl.c
>> The next patch would delete the file l/kernel/sched.c,
>> which does not exist! Assume -R? [n]
>>
>> How to proceed?
>
>oops - just ignore it, or re-download the patch (i fixed it).
>
> Ingo

answering n for all that, I note the build, at the end of the make bzImage,
spits out this:
MODPOST vmlinux
WARNING: arch/i386/kernel/built-in.o(.text+0x845d): Section mismatch:
reference to .init.text:amd_init_mtrr (between 'mtrr_bp_init'
and 'mtrr_save_state')
WARNING: arch/i386/kernel/built-in.o(.text+0x8462): Section mismatch:
reference to .init.text:cyrix_init_mtrr (between 'mtrr_bp_init'
and 'mtrr_save_state')
WARNING: arch/i386/kernel/built-in.o(.text+0x8467): Section mismatch:
reference to .init.text:centaur_init_mtrr (between 'mtrr_bp_init'
and 'mtrr_save_state')
WARNING: arch/i386/kernel/built-in.o(.text+0x9284): Section mismatch:
reference to .init.text: (between 'get_mtrr_state' and 'generic_get_mtrr')
WARNING: arch/i386/kernel/built-in.o(.text+0x9298): Section mismatch:
reference to .init.text: (between 'get_mtrr_state' and 'generic_get_mtrr')
WARNING: arch/i386/kernel/built-in.o(.text+0x92bc): Section mismatch:
reference to .init.text: (between 'get_mtrr_state' and 'generic_get_mtrr')

But then proceeds with the make modules stage. I believe I've seen references
to this in other threads. Is It Serious?

--
Cheers, Gene
"There are four boxes to be used in defense of liberty:
soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
Debug is human, de-fix divine.

2007-06-23 10:22:19

by Antonino Ingargiola

[permalink] [raw]
Subject: Re: [patch] CFS scheduler, -v18

Hi,

2007/6/23, Ingo Molnar <[email protected]>:
>
> i'm pleased to announce release -v18 of the CFS scheduler patchset.

I'm running -v18 on 2.6.22-rc5, no problems so far. How can I change a
task to SCHED_IDLE or SCHED_BATCH priority under CFS?


Thanks,

~ Antonio

2007-06-23 13:31:42

by Willy Tarreau

[permalink] [raw]
Subject: Re: [patch] CFS scheduler, -v18

Hi Ingo,

On Sat, Jun 23, 2007 at 12:02:02AM +0200, Ingo Molnar wrote:
>
> i'm pleased to announce release -v18 of the CFS scheduler patchset.
>
> The rolled-up CFS patch against today's -git kernel, v2.6.22-rc5,
> v2.6.22-rc4-mm2, v2.6.21.5 or v2.6.20.14 can be downloaded from the
> usual place:
>
> http://people.redhat.com/mingo/cfs-scheduler/
>
> The biggest change in -v18 are various performance related improvements.
> Thomas Gleixner has eliminated expensive 64-bit divisions by converting
> the arithmetics to scaled math (without impacting the quality of
> calculations). Srivatsa Vaddagiri and Dmitry Adamushko have continued
> the abstraction and cleanup work. Srivatsa Vaddagiri and Christoph
> Lameter fixed the NUMA balancing bug reported by Paul McKenney. There
> were also a good number of other refinements to the CFS code. (No
> reproducible behavioral regressions were reported against -v17 so far,
> so the 'behavioral' bits are mostly unchanged.)

Today I had a little time to try CFS again (last time it was -v9!). I ran it
on top of 2.6.20.14, and simply tried ocbench again. You remember ? With -v9,
I ran 64 processes which all progressed very smoothly. With -v18, it's not
the case anymore. When I run 64 processes, only 7 of them show smooth rounds,
while all the other ones are only updated once a second. Sometimes they only
progress by one iteration, sometimes by a full round. Some are even updated
once ever 2 seconds, because if I drag an xterm above them and quickly remove
it, the xterm leaves a trace there for up to 2 seconds.

Also, only one of my 2 CPUs is used. I see the rq vary between 1 and 5, with
a permanent 50% idle... :

procs memory swap io system cpu
r b w swpd free buff cache si so bi bo in cs us sy id
1 0 0 0 874400 7864 90436 0 0 0 0 279 2204 50 0 50
3 0 0 0 874408 7864 90436 0 0 0 0 273 2122 50 1 50
1 0 0 0 874408 7864 90436 0 0 0 0 253 1660 49 1 50
3 0 0 0 874408 7864 90436 0 0 0 0 252 1977 50 0 50
2 0 0 0 874408 7864 90436 0 0 0 0 253 2274 49 1 50
3 0 0 0 874408 7864 90436 0 0 0 0 252 1846 49 1 50
1 0 0 0 874408 7864 90436 0 0 0 0 339 1782 49 1 50

I have no idea about what version brought that unexpected behaviour, but it's
clearly something which needs to be tracked down.

My scheduler is at 250 Hz, and here are the values I found in /proc/sys/kernel:
root@pcw:kernel# grep '' sched_*
sched_batch_wakeup_granularity_ns:40000000
sched_child_runs_first:1
sched_features:14
sched_granularity_ns:10000000
sched_runtime_limit_ns:40000000
sched_stat_granularity_ns:0
sched_wakeup_granularity_ns:4000000

I have tried to change each of them, with absolutely no effect. Seems really
strange. Unfortunately, I have to leave right now. Maybe you can try on your
side with this simple command :

$ ./ocbench -x 8 -y 8

Sorry for not giving more information right now.

Regards,
Willy

2007-06-23 17:26:16

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] CFS scheduler, -v18


* Antonino Ingargiola <[email protected]> wrote:

> 2007/6/23, Ingo Molnar <[email protected]>:
> >
> >i'm pleased to announce release -v18 of the CFS scheduler patchset.
>
> I'm running -v18 on 2.6.22-rc5, no problems so far. How can I change a
> task to SCHED_IDLE or SCHED_BATCH priority under CFS?

pick up schedtool, and these are the choices it gives:

-N for SCHED_NORMAL
-F -p PRIO for SCHED_FIFO only as root
-R -p PRIO for SCHED_RR only as root
-B for SCHED_BATCH
-I -p PRIO for SCHED_ISO
-D for SCHED_IDLEPRIO

then for example to start up something as SCHED_IDLE:

schedtool -D -e ./somecommand.sh

Ingo

2007-06-24 10:02:26

by Antonino Ingargiola

[permalink] [raw]
Subject: Re: [patch] CFS scheduler, -v18

2007/6/23, Ingo Molnar <[email protected]>:
>
> * Antonino Ingargiola <[email protected]> wrote:
>
> > 2007/6/23, Ingo Molnar <[email protected]>:
> > >
> > >i'm pleased to announce release -v18 of the CFS scheduler patchset.
> >
> > I'm running -v18 on 2.6.22-rc5, no problems so far. How can I change a
> > task to SCHED_IDLE or SCHED_BATCH priority under CFS?
>
> pick up schedtool, and these are the choices it gives:
>
> -N for SCHED_NORMAL
> -F -p PRIO for SCHED_FIFO only as root
> -R -p PRIO for SCHED_RR only as root
> -B for SCHED_BATCH
> -I -p PRIO for SCHED_ISO
> -D for SCHED_IDLEPRIO
>
> then for example to start up something as SCHED_IDLE:
>
> schedtool -D -e ./somecommand.sh
>

Thank you very much! I was thinking that schedtool was suitable only
for -ck. I've installed schedtool and it works fine.

Anyway, I've discovered with great pleasure that CFS has also the
SCHED_ISO priority. I may have missed something, but I don't remember
to have read this in any of the CFS release notes :). For me this is a
really useful feature. Thanks.


Regards,

~ Antonio

> Ingo
>

2007-06-24 11:07:57

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] CFS scheduler, -v18


* Antonino Ingargiola <[email protected]> wrote:

> Anyway, I've discovered with great pleasure that CFS has also the
> SCHED_ISO priority. I may have missed something, but I don't remember
> to have read this in any of the CFS release notes :). For me this is a
> really useful feature. Thanks.

well, it's only a hack and emulated: SCHED_ISO in CFS is recognized as a
policy but it falls back to SCHED_NORMAL. Could you check how well this
(i.e. SCHED_NORMAL) works for your workload, compared to SD's SCHED_ISO?
If you'd like to increase the priority of a task, i'd suggest to use
negative nice levels. (use the 'nice' option in
/etc/security/limits.conf with a newer version of PAM to allow
unprivileged users to use negative nice levels.)

Ingo

2007-06-24 15:53:00

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] CFS scheduler, -v18


* Willy Tarreau <[email protected]> wrote:

> Today I had a little time to try CFS again (last time it was -v9!). I
> ran it on top of 2.6.20.14, and simply tried ocbench again. You
> remember ? With -v9, I ran 64 processes which all progressed very
> smoothly. With -v18, it's not the case anymore. When I run 64
> processes, only 7 of them show smooth rounds, while all the other ones
> are only updated once a second. Sometimes they only progress by one
> iteration, sometimes by a full round. Some are even updated once ever
> 2 seconds, because if I drag an xterm above them and quickly remove
> it, the xterm leaves a trace there for up to 2 seconds.
>
> Also, only one of my 2 CPUs is used. I see the rq vary between 1 and
> 5, with a permanent 50% idle... :
>
> procs memory swap io system cpu
> r b w swpd free buff cache si so bi bo in cs us sy id
> 1 0 0 0 874400 7864 90436 0 0 0 0 279 2204 50 0 50
> 3 0 0 0 874408 7864 90436 0 0 0 0 273 2122 50 1 50
> 1 0 0 0 874408 7864 90436 0 0 0 0 253 1660 49 1 50
> 3 0 0 0 874408 7864 90436 0 0 0 0 252 1977 50 0 50
> 2 0 0 0 874408 7864 90436 0 0 0 0 253 2274 49 1 50
> 3 0 0 0 874408 7864 90436 0 0 0 0 252 1846 49 1 50
> 1 0 0 0 874408 7864 90436 0 0 0 0 339 1782 49 1 50
>
> I have no idea about what version brought that unexpected behaviour,
> but it's clearly something which needs to be tracked down.

hm, the two problems might be related. Could you try v17 perhaps? In v18
i have 'unified' all the sched.c's between the various kernel releases,
maybe that brought in something unexpected on 2.6.20.14. (perhaps try
v2.6.21.5 based cfs too?)

> My scheduler is at 250 Hz, and here are the values I found in
> /proc/sys/kernel:

could you send me the file the cfs-debug-info.sh script produced. You
can pick the script up from:

http://people.redhat.com/mingo/cfs-scheduler/tools/cfs-debug-info.sh

(i'd suggest to send it in private mail, output is large and detailed.
If your kernel has no /proc/config.gz then please send me your .config
file too.)

Ingo

2007-06-24 17:09:20

by Willy Tarreau

[permalink] [raw]
Subject: Re: [patch] CFS scheduler, -v18

Hi Ingo,

On Sun, Jun 24, 2007 at 05:52:14PM +0200, Ingo Molnar wrote:
>
> * Willy Tarreau <[email protected]> wrote:
>
> > Today I had a little time to try CFS again (last time it was -v9!). I
> > ran it on top of 2.6.20.14, and simply tried ocbench again. You
> > remember ? With -v9, I ran 64 processes which all progressed very
> > smoothly. With -v18, it's not the case anymore. When I run 64
> > processes, only 7 of them show smooth rounds, while all the other ones
> > are only updated once a second. Sometimes they only progress by one
> > iteration, sometimes by a full round. Some are even updated once ever
> > 2 seconds, because if I drag an xterm above them and quickly remove
> > it, the xterm leaves a trace there for up to 2 seconds.
> >
> > Also, only one of my 2 CPUs is used. I see the rq vary between 1 and
> > 5, with a permanent 50% idle... :
> >
> > procs memory swap io system cpu
> > r b w swpd free buff cache si so bi bo in cs us sy id
> > 1 0 0 0 874400 7864 90436 0 0 0 0 279 2204 50 0 50
> > 3 0 0 0 874408 7864 90436 0 0 0 0 273 2122 50 1 50
> > 1 0 0 0 874408 7864 90436 0 0 0 0 253 1660 49 1 50
> > 3 0 0 0 874408 7864 90436 0 0 0 0 252 1977 50 0 50
> > 2 0 0 0 874408 7864 90436 0 0 0 0 253 2274 49 1 50
> > 3 0 0 0 874408 7864 90436 0 0 0 0 252 1846 49 1 50
> > 1 0 0 0 874408 7864 90436 0 0 0 0 339 1782 49 1 50
> >
> > I have no idea about what version brought that unexpected behaviour,
> > but it's clearly something which needs to be tracked down.
>
> hm, the two problems might be related. Could you try v17 perhaps? In v18
> i have 'unified' all the sched.c's between the various kernel releases,
> maybe that brought in something unexpected on 2.6.20.14. (perhaps try
> v2.6.21.5 based cfs too?)

Well, forget this, I'm nuts. I'm sorry, but I did not set any of the -R
and -S parameter on ocbench, which means that all the processes ran at
full speed and did not sleep. The load distribution was not fair, but
since they put a lot of stress on the X server, I think it might be one
of the reasons for the unfairness. I got the same behaviour with -v17,
-v9 and even 2.4 ! It told me something was wrong on my side ;-)

I've retried with 50%/50% run/sleep, and it now works like a charm. It's
perfectly smooth with both small and long run/sleep times (between 1 and 100
ms). I think that with X saturated, it might explain why I only had one CPU
running at 100% !

Next time, I'll try to take a bit more time for such a test.

> could you send me the file the cfs-debug-info.sh script produced. You
> can pick the script up from:
>
> http://people.redhat.com/mingo/cfs-scheduler/tools/cfs-debug-info.sh

OK I got it, but I've not run it since the problem was between the keyboard
and the chair. If you want an output anyway, I can give it a run.

Sorry again for the wrong alert.

regards,
willy

2007-06-24 20:32:33

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] CFS scheduler, -v18


* Willy Tarreau <[email protected]> wrote:

> > > I have no idea about what version brought that unexpected
> > > behaviour, but it's clearly something which needs to be tracked
> > > down.
> >
> > hm, the two problems might be related. Could you try v17 perhaps? In
> > v18 i have 'unified' all the sched.c's between the various kernel
> > releases, maybe that brought in something unexpected on 2.6.20.14.
> > (perhaps try v2.6.21.5 based cfs too?)
>
> Well, forget this, I'm nuts. I'm sorry, but I did not set any of the
> -R and -S parameter on ocbench, which means that all the processes ran
> at full speed and did not sleep. The load distribution was not fair,
> but since they put a lot of stress on the X server, I think it might
> be one of the reasons for the unfairness. I got the same behaviour
> with -v17, -v9 and even 2.4 ! It told me something was wrong on my
> side ;-)
>
> I've retried with 50%/50% run/sleep, and it now works like a charm.
> It's perfectly smooth with both small and long run/sleep times
> (between 1 and 100 ms). I think that with X saturated, it might
> explain why I only had one CPU running at 100% !

ah, great! :-) My testbox needs a 90% / 10% ratio between sleep/run for
an 8x8 matrix of ocbench tasks to not overload the X server. Once the
overload happens X starts penalizing certain clients it finds abusive (i
think), and that mechanism seems to be wall-clock based and it thus
brings in alot of non-determinism and skews the clients.

Ingo

2007-06-25 07:27:29

by Antonino Ingargiola

[permalink] [raw]
Subject: Re: [patch] CFS scheduler, -v18

2007/6/24, Ingo Molnar <[email protected]>:
>
> * Antonino Ingargiola <[email protected]> wrote:
>
> > Anyway, I've discovered with great pleasure that CFS has also the
> > SCHED_ISO priority. I may have missed something, but I don't remember
> > to have read this in any of the CFS release notes :). For me this is a
> > really useful feature. Thanks.
>
> well, it's only a hack and emulated: SCHED_ISO in CFS is recognized as a
> policy but it falls back to SCHED_NORMAL. Could you check how well this
> (i.e. SCHED_NORMAL) works for your workload, compared to SD's SCHED_ISO?

To be fair, my workload is not really "critical". I'm used to
skip-free audio listening (no matter what) since long time running my
audio player with SCHED_ISO. Even in mainline the skips aren't so
frequent, but still annoying. I'm using SCHED_ISO for the confidence
it gives in providing skip-free audio.

For my modest needs also CFS SCHED_NORMAL has been just fine (in these
latest days). I'll report if I can find a more critical workload that
can possibly stress CFS SCHED_NORMAL.


Regards,

~ Antonio

2007-06-26 03:04:26

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch] CFS scheduler, -v18

On Sat, 23 Jun 2007 00:20:36 +0200 Ingo Molnar <[email protected]> wrote:

>
> * S.Çağlar Onur <[email protected]> wrote:
>
> > > kernel/sched.c:745:28: sched_idletask.c: No such file or directory
> >
> > Ahh and this happens with [1], grabbing sched_idletask.c from .18 one solves
> > the problem...
>
> oops, indeed - i've fixed up the -git patch:
>
> http://people.redhat.com/mingo/cfs-scheduler/sched-cfs-v2.6.22-git-v18.patch
>

So I locally generated the diff to take -mm up to the above version of CFS.


- sys_sched_yield_to() went away? I guess I missed that.

- Curious. the simplification of task_tick_rt() seems to go only
halfway. Could do

if (p->policy != SCHED_RR)
return;

if (--p->time_slice)
return;

/* stuff goes here */

- dud macro:

#define is_rt_policy(p) ((p) == SCHED_FIFO || (p) == SCHED_RR)

It evaluates its arg twice and could and should be coded in C.

There are a bunch of other don't-need-to-be-implemented-as-a-macro
macros around there too. Generally, I suggest you review all the
patchset for macros-which-don't-need-to-be-macros.

- Extraneous newline:

enum cpu_idle_type
{

- Style thing:

struct sched_entity {
struct load_weight load; /* for nice- load-balancing purposes */
int on_rq;
struct rb_node run_node;
unsigned long delta_exec;
s64 delta_fair;

u64 wait_start_fair;
u64 wait_start;
u64 exec_start;
u64 sleep_start, sleep_start_fair;
u64 block_start;
u64 sleep_max;
u64 block_max;
u64 exec_max;
u64 wait_max;
u64 last_ran;

s64 wait_runtime;
u64 sum_exec_runtime;
s64 fair_key;
s64 sum_wait_runtime, sum_sleep_runtime;
unsigned long wait_runtime_overruns, wait_runtime_underruns;
};

I think the one-definition-per-line style is better than the `unsigned
long foo,bar,zot,zit;' thing. Easier to read, easier to read subsequent
patches and it leaves more room for a comment describing what the field
does.

- None of these fields have comments describing what they do ;)

- __exit_signal() does apparently-unlocked 64-bit arith. Is there some
implicit locking here or do we not care about the occasional race-induced
inaccuracy?

(ditto, lots of places, I expect)

(Gee, there's shitloads of 64-bit stuff in there. Does it all _really_
need to be 64-bit on 32-bit?)

- weight_s64() (what does this do?) looks too big to inline on 32-bit.

- update_stats_enqueue() looks too big to inline even on 64-bit.

- Overall, this change is tremendously huge for something which is
supposedly ready-to-merge. Looks like a lot of that is the sched_entity
conversion, but afaict there's quite a lot besides.

- Should "4" in

(sysctl_sched_features & 4)

be enumerated?

- Maybe even __check_preempt_curr_fair() is too porky to inline.

- There really is an astonishing amount of 64-bit arith in here...

- Some opportunities for useful comments have been missed ;)

#define NICE_0_LOAD SCHED_LOAD_SCALE
#define NICE_0_SHIFT SCHED_LOAD_SHIFT

<wonders what these mean>

- Should any of those new 64-bit arith functions in sched.c be pulled out
and made generic?

- update_curr_load() is huge, inlined and has several callsites?

- lots more macros-which-dont-need-to-be-macros in sched.c:
load_weight(), PRIO_TO_load_weight(), RTPRIO_TO_load_weight(), maybe
others. People are more inclined to comment functions than they are
macros, for some reason.

- inc_load(), dec_load(), inc_nr_running(), dec_nr_running(): these will
generate plenty of code on 32-bit and they're all inlined with multiple
callsites.

- overall, CFS takes sched.o from 41157 of .text up to 48781 on x86_64,
which at 18% is rather a large bloat. Hopefully a lot of this is the new
debug stuff.

- On i386 sched.o went from 33755 up to 43660 which is 29% growth.
Possibly acceptable, but why did it increase a lot more than the x86_64
version? All that 64-bit arith, I assume?

- style (or the lack thereof):

p->se.sum_wait_runtime = p->se.sum_sleep_runtime = 0;
p->se.sleep_start = p->se.sleep_start_fair = p->se.block_start = 0;
p->se.sleep_max = p->se.block_max = p->se.exec_max = p->se.wait_max = 0;
p->se.wait_runtime_overruns = p->se.wait_runtime_underruns = 0;

bit of an eyesore?

- in sched_init() this looks funny:

rq->ls.load_update_last = sched_clock();
rq->ls.load_update_start = sched_clock();

was it intended that these both get the same value?


2007-06-26 08:39:19

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] CFS scheduler, -v18


* Andrew Morton <[email protected]> wrote:

> So I locally generated the diff to take -mm up to the above version of
> CFS.

thx. I released a diff against mm2:

http://people.redhat.com/mingo/cfs-scheduler/sched-cfs-v2.6.22-rc4-mm2-v18.patch

but indeed the -git diff serves you better if you updated -mm to Linus'
latest.

firstly, thanks a ton for your review feedback!

> - sys_sched_yield_to() went away? I guess I missed that.

yep. Nobody tried it and sent any feedback on it, it was causing
patch-logistical complications both in -mm and for packagers that bundle
CFS (the experimental-schedulers site has a CFS repo and Fedora rawhide
started carrying CFS recently as well), and i dont really agree with
adding yet another yield interface anyway. So we can and should do this
independently of CFS.

> - Curious. the simplification of task_tick_rt() seems to go only
> halfway. Could do
>
> if (p->policy != SCHED_RR)
> return;
>
> if (--p->time_slice)
> return;
>
> /* stuff goes here */

yeah. I have fixed it in my v19 tree for it to look like you suggest.

> - dud macro:
>
> #define is_rt_policy(p) ((p) == SCHED_FIFO || (p) == SCHED_RR)
>
> It evaluates its arg twice and could and should be coded in C.
>
> There are a bunch of other don't-need-to-be-implemented-as-a-macro
> macros around there too. Generally, I suggest you review all the
> patchset for macros-which-don't-need-to-be-macros.

yep, fixed. (is a historic macro)

> - Extraneous newline:
>
> enum cpu_idle_type
> {

fixed. (is a pre-existing enum)

> - Style thing:
>
> struct sched_entity {

> u64 sleep_start, sleep_start_fair;

fixed.

> - None of these fields have comments describing what they do ;)

one of them has ;-) Will fill this in.

> - __exit_signal() does apparently-unlocked 64-bit arith. Is there
> some implicit locking here or do we not care about the occasional
> race-induced inaccuracy?

do you mean the tsk->se.sum_exec_runtime addition, etc? That runs with
interrupts disabled so sum_sched_runtime is protected.

> (ditto, lots of places, I expect)

which places do you mean?

> (Gee, there's shitloads of 64-bit stuff in there. Does it all
> _really_ need to be 64-bit on 32-bit?)

yes - CFS is fundamentally designed for 64-bit, with still pretty OK
arithmetics performance for 32-bit.

> - weight_s64() (what does this do?) looks too big to inline on 32-bit.

ok, i've uninlined it.

> - update_stats_enqueue() looks too big to inline even on 64-bit.

done.

> - Overall, this change is tremendously huge for something which is
> supposedly ready-to-merge. [...]

hey, that's not fair, your review comments just made it 10K larger ;-)

> [...] Looks like a lot of that is the sched_entity conversion, but
> afaict there's quite a lot besides.
>
> - Should "4" in
>
> (sysctl_sched_features & 4)
>
> be enumerated?

yep, done.

> - Maybe even __check_preempt_curr_fair() is too porky to inline.

yep - undone.

> - There really is an astonishing amount of 64-bit arith in here...
>
> - Some opportunities for useful comments have been missed ;)
>
> #define NICE_0_LOAD SCHED_LOAD_SCALE
> #define NICE_0_SHIFT SCHED_LOAD_SHIFT
>
> <wonders what these mean>

SCHED_LOAD_SCALE is the smpnice stuff. CFS reuses that and also makes it
clear via this define that a nice-0 task has a 'load' contribution to
the CPU as of NICE_0_LOAD. Sometimes, when doing smpnice load-balancing
calculations we want to use 'SCHED_LOAD_SCALE', sometimes we want to
stress it's NICE_0_LOAD.

> - Should any of those new 64-bit arith functions in sched.c be pulled
> out and made generic?

yep, the plan is to put this all into reciprocal_div.h and to convert
existing users of reciprocal_div to the cleaner stuff from Thomas. The
patch wont get any smaller due to that though ;-)

> - update_curr_load() is huge, inlined and has several callsites?

this is a reasonable tradeoff i think - update_curr_load()'s slowpath is
in __update_curr_load(). Anyway, it probably wont get inlined when the
kernel is built with -Os and without forced-inlining.

> - lots more macros-which-dont-need-to-be-macros in sched.c:
> load_weight(), PRIO_TO_load_weight(), RTPRIO_TO_load_weight(), maybe
> others. People are more inclined to comment functions than they are
> macros, for some reason.

these are mostly ancient macros. I fixed up some of them in my current
tree.

> - inc_load(), dec_load(), inc_nr_running(), dec_nr_running(): these will
> generate plenty of code on 32-bit and they're all inlined with
> multiple callsites.

yep - i'll revisit the inlining picture. This is not really a primary
worry i think because it's easy to tweak and people can already express
their inlining preference via CONFIG_CC_OPTIMIZE_FOR_SIZE and
CONFIG_FORCED_INLINING.

> - overall, CFS takes sched.o from 41157 of .text up to 48781 on x86_64,
> which at 18% is rather a large bloat. Hopefully a lot of this is
> the new debug stuff.

> - On i386 sched.o went from 33755 up to 43660 which is 29% growth.
> Possibly acceptable, but why did it increase a lot more than the x86_64
> version? All that 64-bit arith, I assume?

the main reason is the sched debugging stuff:

text data bss dec hex filename
37570 2538 20 40128 9cc0 kernel/sched.o
30692 2426 20 33138 8172 kernel/sched-no_sched_debug.o

i can make it depend on CONFIG_SCHEDSTATS, although i'd prefer it to be
always on.

> - style (or the lack thereof):
>
> p->se.sum_wait_runtime = p->se.sum_sleep_runtime = 0;
> p->se.sleep_start = p->se.sleep_start_fair = p->se.block_start = 0;
> p->se.sleep_max = p->se.block_max = p->se.exec_max = p->se.wait_max = 0;
> p->se.wait_runtime_overruns = p->se.wait_runtime_underruns = 0;
>
> bit of an eyesore?

fixed. (this heap grew gradually and now is/was an eyesore indeed.)

> - in sched_init() this looks funny:
>
> rq->ls.load_update_last = sched_clock();
> rq->ls.load_update_start = sched_clock();
>
> was it intended that these both get the same value?

it doesnt really matter, i fixed them to be initialized to the same
'now' value.

i've attached my current fixes. (Please dont apply it yet.)

Ingo

Not-Signed-off-by: Ingo Molnar <[email protected]>

---
Makefile | 2
include/linux/sched.h | 120 +++++++++++++++++++++++++++++---------------------
kernel/exit.c | 2
kernel/sched.c | 58 ++++++++++++++++--------
kernel/sched_debug.c | 2
kernel/sched_fair.c | 61 ++++++++++++-------------
kernel/sched_rt.c | 15 +++---
7 files changed, 149 insertions(+), 111 deletions(-)

Index: linux/Makefile
===================================================================
--- linux.orig/Makefile
+++ linux/Makefile
@@ -1,7 +1,7 @@
VERSION = 2
PATCHLEVEL = 6
SUBLEVEL = 22
-EXTRAVERSION = -rc6-cfs-v18
+EXTRAVERSION = -rc6-cfs-v19
NAME = Holy Dancing Manatees, Batman!

# *DOCUMENTATION*
Index: linux/include/linux/sched.h
===================================================================
--- linux.orig/include/linux/sched.h
+++ linux/include/linux/sched.h
@@ -528,31 +528,6 @@ struct signal_struct {
#define SIGNAL_STOP_CONTINUED 0x00000004 /* SIGCONT since WCONTINUED reap */
#define SIGNAL_GROUP_EXIT 0x00000008 /* group exit in progress */

-
-/*
- * Priority of a process goes from 0..MAX_PRIO-1, valid RT
- * priority is 0..MAX_RT_PRIO-1, and SCHED_NORMAL/SCHED_BATCH
- * tasks are in the range MAX_RT_PRIO..MAX_PRIO-1. Priority
- * values are inverted: lower p->prio value means higher priority.
- *
- * The MAX_USER_RT_PRIO value allows the actual maximum
- * RT priority to be separate from the value exported to
- * user-space. This allows kernel threads to set their
- * priority to a value higher than any user task. Note:
- * MAX_RT_PRIO must not be smaller than MAX_USER_RT_PRIO.
- */
-
-#define MAX_USER_RT_PRIO 100
-#define MAX_RT_PRIO MAX_USER_RT_PRIO
-
-#define MAX_PRIO (MAX_RT_PRIO + 40)
-#define DEFAULT_PRIO (MAX_RT_PRIO + 20)
-
-#define rt_prio(prio) unlikely((prio) < MAX_RT_PRIO)
-#define rt_task(p) rt_prio((p)->prio)
-#define is_rt_policy(p) ((p) == SCHED_FIFO || (p) == SCHED_RR)
-#define has_rt_policy(p) unlikely(is_rt_policy((p)->policy))
-
/*
* Some day this will be a full-fledged user tracking system..
*/
@@ -646,8 +621,7 @@ static inline int sched_info_on(void)
#endif
}

-enum cpu_idle_type
-{
+enum cpu_idle_type {
CPU_IDLE,
CPU_NOT_IDLE,
CPU_NEWLY_IDLE,
@@ -843,30 +817,45 @@ struct load_weight {
unsigned long weight, inv_weight;
};

-/* CFS stats for a schedulable entity (task, task-group etc) */
+/*
+ * CFS stats for a schedulable entity (task, task-group etc)
+ *
+ * Current field usage histogram:
+ *
+ * 4 se->block_start
+ * 4 se->run_node
+ * 4 se->sleep_start
+ * 4 se->sleep_start_fair
+ * 6 se->load.weight
+ * 7 se->delta_fair
+ * 15 se->wait_runtime
+ */
struct sched_entity {
- struct load_weight load; /* for nice- load-balancing purposes */
- int on_rq;
- struct rb_node run_node;
- unsigned long delta_exec;
- s64 delta_fair;
-
- u64 wait_start_fair;
- u64 wait_start;
- u64 exec_start;
- u64 sleep_start, sleep_start_fair;
- u64 block_start;
- u64 sleep_max;
- u64 block_max;
- u64 exec_max;
- u64 wait_max;
- u64 last_ran;
-
- s64 wait_runtime;
- u64 sum_exec_runtime;
- s64 fair_key;
- s64 sum_wait_runtime, sum_sleep_runtime;
- unsigned long wait_runtime_overruns, wait_runtime_underruns;
+ s64 wait_runtime;
+ s64 delta_fair;
+ struct load_weight load; /* for load-balancing */
+ struct rb_node run_node;
+ int on_rq;
+ unsigned long delta_exec;
+
+ u64 wait_start_fair;
+ u64 wait_start;
+ u64 exec_start;
+ u64 sleep_start;
+ u64 sleep_start_fair;
+ u64 block_start;
+ u64 sleep_max;
+ u64 block_max;
+ u64 exec_max;
+ u64 wait_max;
+ u64 last_ran;
+
+ u64 sum_exec_runtime;
+ s64 fair_key;
+ s64 sum_wait_runtime;
+ s64 sum_sleep_runtime;
+ unsigned long wait_runtime_overruns;
+ unsigned long wait_runtime_underruns;
};

struct task_struct {
@@ -1126,6 +1115,37 @@ struct task_struct {
#endif
};

+/*
+ * Priority of a process goes from 0..MAX_PRIO-1, valid RT
+ * priority is 0..MAX_RT_PRIO-1, and SCHED_NORMAL/SCHED_BATCH
+ * tasks are in the range MAX_RT_PRIO..MAX_PRIO-1. Priority
+ * values are inverted: lower p->prio value means higher priority.
+ *
+ * The MAX_USER_RT_PRIO value allows the actual maximum
+ * RT priority to be separate from the value exported to
+ * user-space. This allows kernel threads to set their
+ * priority to a value higher than any user task. Note:
+ * MAX_RT_PRIO must not be smaller than MAX_USER_RT_PRIO.
+ */
+
+#define MAX_USER_RT_PRIO 100
+#define MAX_RT_PRIO MAX_USER_RT_PRIO
+
+#define MAX_PRIO (MAX_RT_PRIO + 40)
+#define DEFAULT_PRIO (MAX_RT_PRIO + 20)
+
+static inline int rt_prio(int prio)
+{
+ if (unlikely(prio < MAX_RT_PRIO))
+ return 1;
+ return 0;
+}
+
+static inline int rt_task(struct task_struct *p)
+{
+ return rt_prio(p->prio);
+}
+
static inline pid_t process_group(struct task_struct *tsk)
{
return tsk->signal->pgrp;
Index: linux/kernel/exit.c
===================================================================
--- linux.orig/kernel/exit.c
+++ linux/kernel/exit.c
@@ -290,7 +290,7 @@ static void reparent_to_kthreadd(void)
/* Set the exit signal to SIGCHLD so we signal init on exit */
current->exit_signal = SIGCHLD;

- if (!has_rt_policy(current) && (task_nice(current) < 0))
+ if (task_nice(current) < 0)
set_user_nice(current, 0);
/* cpus_allowed? */
/* rt_priority? */
Index: linux/kernel/sched.c
===================================================================
--- linux.orig/kernel/sched.c
+++ linux/kernel/sched.c
@@ -106,6 +106,18 @@ unsigned long long __attribute__((weak))
#define MIN_TIMESLICE max(5 * HZ / 1000, 1)
#define DEF_TIMESLICE (100 * HZ / 1000)

+static inline int rt_policy(int policy)
+{
+ if (unlikely(policy == SCHED_FIFO) || unlikely(policy == SCHED_RR))
+ return 1;
+ return 0;
+}
+
+static inline int task_has_rt_policy(struct task_struct *p)
+{
+ return rt_policy(p->policy);
+}
+
/*
* This is the priority-queue data structure of the RT scheduling class:
*/
@@ -752,7 +764,7 @@ static void set_load_weight(struct task_
task_rq(p)->cfs.wait_runtime -= p->se.wait_runtime;
p->se.wait_runtime = 0;

- if (has_rt_policy(p)) {
+ if (task_has_rt_policy(p)) {
p->se.load.weight = prio_to_weight[0] * 2;
p->se.load.inv_weight = prio_to_wmult[0] >> 1;
return;
@@ -805,7 +817,7 @@ static inline int normal_prio(struct tas
{
int prio;

- if (has_rt_policy(p))
+ if (task_has_rt_policy(p))
prio = MAX_RT_PRIO-1 - p->rt_priority;
else
prio = __normal_prio(p);
@@ -1476,17 +1488,24 @@ int fastcall wake_up_state(struct task_s
*/
static void __sched_fork(struct task_struct *p)
{
- p->se.wait_start_fair = p->se.wait_start = p->se.exec_start = 0;
- p->se.sum_exec_runtime = 0;
- p->se.delta_exec = 0;
- p->se.delta_fair = 0;
-
- p->se.wait_runtime = 0;
-
- p->se.sum_wait_runtime = p->se.sum_sleep_runtime = 0;
- p->se.sleep_start = p->se.sleep_start_fair = p->se.block_start = 0;
- p->se.sleep_max = p->se.block_max = p->se.exec_max = p->se.wait_max = 0;
- p->se.wait_runtime_overruns = p->se.wait_runtime_underruns = 0;
+ p->se.wait_start_fair = 0;
+ p->se.wait_start = 0;
+ p->se.exec_start = 0;
+ p->se.sum_exec_runtime = 0;
+ p->se.delta_exec = 0;
+ p->se.delta_fair = 0;
+ p->se.wait_runtime = 0;
+ p->se.sum_wait_runtime = 0;
+ p->se.sum_sleep_runtime = 0;
+ p->se.sleep_start = 0;
+ p->se.sleep_start_fair = 0;
+ p->se.block_start = 0;
+ p->se.sleep_max = 0;
+ p->se.block_max = 0;
+ p->se.exec_max = 0;
+ p->se.wait_max = 0;
+ p->se.wait_runtime_overruns = 0;
+ p->se.wait_runtime_underruns = 0;

INIT_LIST_HEAD(&p->run_list);
p->se.on_rq = 0;
@@ -1799,7 +1818,7 @@ static void update_cpu_load(struct rq *t
int i, scale;

this_rq->nr_load_updates++;
- if (sysctl_sched_features & 64)
+ if (unlikely(!(sysctl_sched_features & SCHED_FEAT_PRECISE_CPU_LOAD)))
goto do_avg;

/* Update delta_fair/delta_exec fields first */
@@ -3801,7 +3820,7 @@ void set_user_nice(struct task_struct *p
* it wont have any effect on scheduling until the task is
* SCHED_FIFO/SCHED_RR:
*/
- if (has_rt_policy(p)) {
+ if (task_has_rt_policy(p)) {
p->static_prio = NICE_TO_PRIO(nice);
goto out_unlock;
}
@@ -3999,14 +4018,14 @@ recheck:
(p->mm && param->sched_priority > MAX_USER_RT_PRIO-1) ||
(!p->mm && param->sched_priority > MAX_RT_PRIO-1))
return -EINVAL;
- if (is_rt_policy(policy) != (param->sched_priority != 0))
+ if (rt_policy(policy) != (param->sched_priority != 0))
return -EINVAL;

/*
* Allow unprivileged RT tasks to decrease priority:
*/
if (!capable(CAP_SYS_NICE)) {
- if (is_rt_policy(policy)) {
+ if (rt_policy(policy)) {
unsigned long rlim_rtprio;
unsigned long flags;

@@ -6186,6 +6205,7 @@ int in_sched_functions(unsigned long add

void __init sched_init(void)
{
+ u64 now = sched_clock();
int highest_cpu = 0;
int i, j;

@@ -6206,8 +6226,8 @@ void __init sched_init(void)
rq->nr_running = 0;
rq->cfs.tasks_timeline = RB_ROOT;
rq->clock = rq->cfs.fair_clock = 1;
- rq->ls.load_update_last = sched_clock();
- rq->ls.load_update_start = sched_clock();
+ rq->ls.load_update_last = now;
+ rq->ls.load_update_start = now;

for (j = 0; j < CPU_LOAD_IDX_MAX; j++)
rq->cpu_load[j] = 0;
Index: linux/kernel/sched_debug.c
===================================================================
--- linux.orig/kernel/sched_debug.c
+++ linux/kernel/sched_debug.c
@@ -157,7 +157,7 @@ static int sched_debug_show(struct seq_f
u64 now = ktime_to_ns(ktime_get());
int cpu;

- SEQ_printf(m, "Sched Debug Version: v0.03, cfs-v18, %s %.*s\n",
+ SEQ_printf(m, "Sched Debug Version: v0.03, cfs-v19, %s %.*s\n",
init_utsname()->release,
(int)strcspn(init_utsname()->version, " "),
init_utsname()->version);
Index: linux/kernel/sched_fair.c
===================================================================
--- linux.orig/kernel/sched_fair.c
+++ linux/kernel/sched_fair.c
@@ -61,8 +61,24 @@ unsigned int sysctl_sched_stat_granulari
*/
unsigned int sysctl_sched_runtime_limit __read_mostly;

+/*
+ * Debugging: various feature bits
+ */
+enum {
+ SCHED_FEAT_IGNORE_PREEMPTED = 1,
+ SCHED_FEAT_DISTRIBUTE = 2,
+ SCHED_FEAT_FAIR_SLEEPERS = 4,
+ SCHED_FEAT_SLEEPER_AVG = 32,
+ SCHED_FEAT_PRECISE_CPU_LOAD = 64,
+ SCHED_FEAT_START_DEBIT = 128,
+ SCHED_FEAT_SKIP_INITIAL = 256,
+};
+
unsigned int sysctl_sched_features __read_mostly =
- 0 | 2 | 4 | 8 | 0 | 0 | 0 | 0;
+ SCHED_FEAT_DISTRIBUTE |
+ SCHED_FEAT_FAIR_SLEEPERS |
+ SCHED_FEAT_SLEEPER_AVG |
+ SCHED_FEAT_PRECISE_CPU_LOAD;

extern struct sched_class fair_sched_class;

@@ -145,7 +161,7 @@ static inline void
__dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
if (cfs_rq->rb_leftmost == &se->run_node)
- cfs_rq->rb_leftmost = NULL;
+ cfs_rq->rb_leftmost = rb_next(&se->run_node);
rb_erase(&se->run_node, &cfs_rq->tasks_timeline);
update_load_sub(&cfs_rq->load, se->load.weight);
cfs_rq->nr_running--;
@@ -258,7 +274,7 @@ __update_curr(struct cfs_rq *cfs_rq, str
* Task already marked for preemption, do not burden
* it with the cost of not having left the CPU yet:
*/
- if (unlikely(sysctl_sched_features & 1))
+ if (unlikely(sysctl_sched_features & SCHED_FEAT_IGNORE_PREEMPTED))
if (unlikely(test_tsk_thread_flag(curtask, TIF_NEED_RESCHED)))
return;

@@ -305,7 +321,7 @@ update_stats_wait_start(struct cfs_rq *c
se->wait_start = now;
}

-static inline s64 weight_s64(s64 calc, unsigned long weight, int shift)
+static s64 weight_s64(s64 calc, unsigned long weight, int shift)
{
if (calc < 0) {
calc = - calc * weight;
@@ -317,7 +333,7 @@ static inline s64 weight_s64(s64 calc, u
/*
* Task is being enqueued - update stats:
*/
-static inline void
+static void
update_stats_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se, u64 now)
{
s64 key;
@@ -438,7 +454,7 @@ static void distribute_fair_add(struct c
struct sched_entity *curr = cfs_rq_curr(cfs_rq);
s64 delta_fair = 0;

- if (!(sysctl_sched_features & 2))
+ if (!(sysctl_sched_features & SCHED_FEAT_DISTRIBUTE))
return;

if (cfs_rq->nr_running) {
@@ -469,7 +485,7 @@ __enqueue_sleeper(struct cfs_rq *cfs_rq,
* Fix up delta_fair with the effect of us running
* during the whole sleep period:
*/
- if (!(sysctl_sched_features & 32))
+ if (sysctl_sched_features & SCHED_FEAT_SLEEPER_AVG)
delta_fair = div64_s(delta_fair * load, load + se->load.weight);

delta_fair = weight_s64(delta_fair, se->load.weight, NICE_0_SHIFT);
@@ -495,7 +511,7 @@ enqueue_sleeper(struct cfs_rq *cfs_rq, s
s64 delta_fair;

if ((entity_is_task(se) && tsk->policy == SCHED_BATCH) ||
- !(sysctl_sched_features & 4))
+ !(sysctl_sched_features & SCHED_FEAT_FAIR_SLEEPERS))
return;

delta_fair = cfs_rq->fair_clock - se->sleep_start_fair;
@@ -574,7 +590,7 @@ static void dequeue_entity(struct cfs_rq
/*
* Preempt the current task with a newly woken task if needed:
*/
-static inline void
+static void
__check_preempt_curr_fair(struct cfs_rq *cfs_rq, struct sched_entity *se,
struct sched_entity *curr, unsigned long granularity)
{
@@ -612,23 +628,6 @@ put_prev_entity(struct cfs_rq *cfs_rq, s
int updated = 0;

/*
- * If the task is still waiting for the CPU (it just got
- * preempted), update its position within the tree and
- * start the wait period:
- */
- if ((sysctl_sched_features & 16) && entity_is_task(prev)) {
- struct task_struct *prevtask = task_of(prev);
-
- if (prev->on_rq &&
- test_tsk_thread_flag(prevtask, TIF_NEED_RESCHED)) {
-
- dequeue_entity(cfs_rq, prev, 0, now);
- enqueue_entity(cfs_rq, prev, 0, now);
- updated = 1;
- }
- }
-
- /*
* If still on the runqueue then deactivate_task()
* was not called and update_curr() has to be done:
*/
@@ -741,10 +740,8 @@ static void check_preempt_curr_fair(stru
unsigned long gran;

if (unlikely(rt_prio(p->prio))) {
- if (sysctl_sched_features & 8) {
- if (rt_prio(p->prio))
- update_curr(cfs_rq, rq_clock(rq));
- }
+ if (rt_prio(p->prio))
+ update_curr(cfs_rq, rq_clock(rq));
resched_task(curr);
return;
}
@@ -850,14 +847,14 @@ static void task_new_fair(struct rq *rq,
* The first wait is dominated by the child-runs-first logic,
* so do not credit it with that waiting time yet:
*/
- if (sysctl_sched_features & 256)
+ if (sysctl_sched_features & SCHED_FEAT_SKIP_INITIAL)
p->se.wait_start_fair = 0;

/*
* The statistical average of wait_runtime is about
* -granularity/2, so initialize the task with that:
*/
- if (sysctl_sched_features & 128)
+ if (sysctl_sched_features & SCHED_FEAT_START_DEBIT)
p->se.wait_runtime = -(s64)(sysctl_sched_granularity / 2);

__enqueue_entity(cfs_rq, se);
Index: linux/kernel/sched_rt.c
===================================================================
--- linux.orig/kernel/sched_rt.c
+++ linux/kernel/sched_rt.c
@@ -12,7 +12,7 @@ static inline void update_curr_rt(struct
struct task_struct *curr = rq->curr;
u64 delta_exec;

- if (!has_rt_policy(curr))
+ if (!task_has_rt_policy(curr))
return;

delta_exec = now - curr->se.exec_start;
@@ -179,13 +179,14 @@ static void task_tick_rt(struct rq *rq,
if (p->policy != SCHED_RR)
return;

- if (!(--p->time_slice)) {
- p->time_slice = static_prio_timeslice(p->static_prio);
- set_tsk_need_resched(p);
+ if (--p->time_slice)
+ return;

- /* put it at the end of the queue: */
- requeue_task_rt(rq, p);
- }
+ p->time_slice = static_prio_timeslice(p->static_prio);
+ set_tsk_need_resched(p);
+
+ /* put it at the end of the queue: */
+ requeue_task_rt(rq, p);
}

/*

2007-06-26 09:02:46

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch] CFS scheduler, -v18

On Tue, 26 Jun 2007 10:38:13 +0200 Ingo Molnar <[email protected]> wrote:

>
> > - __exit_signal() does apparently-unlocked 64-bit arith. Is there
> > some implicit locking here or do we not care about the occasional
> > race-induced inaccuracy?
>
> do you mean the tsk->se.sum_exec_runtime addition, etc? That runs with
> interrupts disabled so sum_sched_runtime is protected.
>
> > (ditto, lots of places, I expect)
>
> which places do you mean?

I forget ;) There seemed to be rather a lot of 64-bit addition with no
obvious locking in sight, that's all.

> > ...
> > (Gee, there's shitloads of 64-bit stuff in there. Does it all
> > _really_ need to be 64-bit on 32-bit?)
>
> yes - CFS is fundamentally designed for 64-bit, with still pretty OK
> arithmetics performance for 32-bit.

It may have been designed for 64-bit, but was that the correct design? The
cost on 32-bit appears to be pretty high. Perhaps a round of uninlining
will help.

> > - overall, CFS takes sched.o from 41157 of .text up to 48781 on x86_64,
> > which at 18% is rather a large bloat. Hopefully a lot of this is
> > the new debug stuff.
>
> > - On i386 sched.o went from 33755 up to 43660 which is 29% growth.
> > Possibly acceptable, but why did it increase a lot more than the x86_64
> > version? All that 64-bit arith, I assume?
>
> the main reason is the sched debugging stuff:

That would serve to explain the 18% growth on x86_64. But why did i386
grow by much more: 29%? I'd be suspecting all the new 64-bit arithmetic.


2007-06-26 09:39:18

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] CFS scheduler, -v18


* Andrew Morton <[email protected]> wrote:

> > the main reason is the sched debugging stuff:
>
> That would serve to explain the 18% growth on x86_64. But why did
> i386 grow by much more: 29%? I'd be suspecting all the new 64-bit
> arithmetic.

this is what i see on 32-bit:

text data bss dec hex filename
28732 3905 24 32661 7f95 kernel/sched.o-vanilla
37986 2538 20 40544 9e60 kernel/sched.o-v18
31092 2426 20 33538 8302 kernel/sched.o-v18-no_sched_debug

text is larger but data got smaller. While they are not equivalent in
function, the two almost even out each other (and that's without any of
the uninlining that is in v19). In fact, there's a 1.5K per CPU percpu
data size win with CFS, which is not visible in this stat. So on
dual-core the net cost should already be zero.

> The cost on 32-bit appears to be pretty high. Perhaps a round of
> uninlining will help.

agreed, i've done one more round of uninlining.

Ingo

2007-06-26 20:18:47

by Vincent Fortier

[permalink] [raw]
Subject: RE: [patch] CFS scheduler, -v18

> -----Message d'origine-----
> De : [email protected]
> [mailto:[email protected]] De la part de Ingo Molnar
> Envoy? : 22 juin 2007 18:02
>
> i'm pleased to announce release -v18 of the CFS scheduler patchset.
>
> The rolled-up CFS patch against today's -git kernel,
> v2.6.22-rc5, v2.6.22-rc4-mm2, v2.6.21.5 or v2.6.20.14 can be
> downloaded from the usual place:
>
> http://people.redhat.com/mingo/cfs-scheduler/
>
> ...

Pre-built kernels for Fedora 7 & Fedora Core 6 (yum reporsitory available) and Debian Sarge/Etch using latest CFS v18 available at http://linux-dev.qc.ec.gc.ca

>
> As usual, any sort of feedback, bugreport, fix and suggestion
> is more than welcome!
>
> Ingo

- vin

2007-06-27 10:52:17

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] CFS scheduler, -v18


* Fortier,Vincent [Montreal] <[email protected]> wrote:

> > -----Message d'origine-----
> > De : [email protected]
> > [mailto:[email protected]] De la part de Ingo Molnar
> > Envoy? : 22 juin 2007 18:02
> >
> > i'm pleased to announce release -v18 of the CFS scheduler patchset.
> >
> > The rolled-up CFS patch against today's -git kernel,
> > v2.6.22-rc5, v2.6.22-rc4-mm2, v2.6.21.5 or v2.6.20.14 can be
> > downloaded from the usual place:
> >
> > http://people.redhat.com/mingo/cfs-scheduler/
> >
> > ...
>
> Pre-built kernels for Fedora 7 & Fedora Core 6 (yum reporsitory
> available) and Debian Sarge/Etch using latest CFS v18 available at
> http://linux-dev.qc.ec.gc.ca

thanks Vincent!

Ingo

2007-06-30 21:10:49

by Willy Tarreau

[permalink] [raw]
Subject: Re: [patch] CFS scheduler, -v18

Ingo,

I've accidentally discovered a problem with -v18.

Some time ago, I wrote a small program to prevent my laptop from entering
low-power mode, and noticed that after upgrading my laptop's kernel from
2.4.20.9+cfs-v6 to 2.4.20.14+cfs-v18, it completely freezes if I run this
program.

The program is trivial, it just sets its prio to nice +20 and forks a busy
loop. I've added the ability to stop the loop after a user-definable number
of iterations, and I can confirm that it unfreezes when the loop ends. I'm
not even root when I run it.

Everything freezes, including the frame-buffer. It's not 100% reproducible, I
would say 90% only. Sometimes it requires a few seconds before freezing. It
*seems* to me that running another task in parallel (such as vmstat) increases
its chances to freeze. It seems like nicing to +19 does not cause any trouble.

I've tried it on my dual athlon, and with 1 process I see occasional pauses of
1 or 2 seconds, and with 2 processes, I see fairly larger pauses.

Here's the trivial program. Probably you'll find an obvious bug.

Regards,
Willy

---

/*
* cfs-freeze.c
* Fork a busy loop running with idle prio. This often results
* in complete freezes with CFS-v18.
*
* $ gcc -O2 -s -o cfs-freeze cfs-freeze.c
* $ ./cfs-freeze
*/

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sched.h>
#include <sys/time.h>
#include <sys/resource.h>


int main(int argc, char **argv) {
struct sched_param sch;
long long i;

if (argc > 1)
i = atoll(argv[1]);

if (i <= 0)
i = 4 * 1000 * 1000 * 1000ULL;

sch.sched_priority = 0;
sched_setscheduler(0, SCHED_OTHER, &sch);
setpriority(PRIO_PROCESS, 0, 20);
if (fork() == 0)
while (i--);
return 0;
}

---