2004-06-06 15:41:19

by Con Kolivas

[permalink] [raw]
Subject: [PATCH] Staircase Scheduler v6.3 for 2.6.7-rc2

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

This is an update of the scheduler policy mechanism rewrite using the
infrastructure of the current O(1) scheduler. Slight changes from the
original design require a detailed description. The change to the original
design has enabled all known corner cases to be abolished and cpu
distribution to be much better maintained. It has proven to be stable in my
testing and is ready for more widespread public testing now.


Aims:
- Interactive by design rather than have interactivity bolted on.
- Good scalability.
- Simple predictable design.
- Maintain appropriate cpu distribution and fairness.
- Low scheduling latency for normal policy tasks.
- Low overhead.
- Making renicing processes actually matter for CPU distribution (nice 0 gets
20 times what nice +20 gets)
- Resistant to priority inversion
- More forgiving of poor locking
- Tunable for a server workload or computational tasks


Description:
- All tasks start at a dynamic priority based on their nice value. They run
for one RR_INTERVAL (nominally set to 10ms) and then drop one stair
(priority). If they sleep before running again they get to run for 2
intervals before being demoted a priority and so on until they get all their
intervals at their best priority: 20 intervals for nice 0; 1 interval for
nice +19.


- - The staircase elevation mechanism can be disabled and all tasks can simply
descend stairs using the sysctl:

echo 0 > /proc/sys/kernel/interactive

this has the effect of maintaining cpu distribution much more strictly
according to nice whereas the default mechanism allows bursts of cpu by
interactive tasks before settling to appropriate cpu distribution.


- - The time tasks are cpu bound can be increased by using the sysctl:

echo 1 > /proc/sys/kernel/compute

which extends the RR_INTERVAL to 100ms and disables the staircase elevation
improving conditions for pure computational tasks by optimising cache
benefits and decreasing context switching (gains another 1.5% on kernbench).


Performance:
- - All cpu throughput benchmarks show equivalent or better performance than
mainline. Note that disabling the interactive setting actually _worsens_ some
benchmarks because of their dependence on yield() so I don't recommend
disabling it unless you do a comparison first.
- - Interactivity is approximately equivalent to mainline 2.6 but with faster
application startup and no known behavioural quirks.


Comments and testing welcome.

fs/proc/array.c | 2
include/linux/sched.h | 8
include/linux/sysctl.h | 2
kernel/sched.c | 676
+++++++++++++------------------------------------
kernel/sysctl.c | 16 +
5 files changed, 212 insertions(+), 492 deletions(-)

Can be downloaded here:
http://ck.kolivas.org/patches/2.6/2.6.7-rc2/patch-2.6.7-rc2-s6.3

and below
Con
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)

iD8DBQFAwzquZUg7+tp6mRURArBHAJ9SIBOKX6MYOdkJzdb+xRNnW82JQgCghLou
wXhrRsBsfY2BIqbLT1tUWcs=
=g1no
-----END PGP SIGNATURE-----


Attachments:
(No filename) (2.99 kB)
patch-2.6.7-rc2-s6.3 (33.39 kB)
Download all attachments

2004-06-06 17:03:09

by Jan Killius

[permalink] [raw]
Subject: Re: [PATCH] Staircase Scheduler v6.3 for 2.6.7-rc2

On Sunday 06 June 2004 17:39, you wrote:
> This is an update of the scheduler policy mechanism rewrite using the
> infrastructure of the current O(1) scheduler. Slight changes from the
> original design require a detailed description. The change to the original
> design has enabled all known corner cases to be abolished and cpu
> distribution to be much better maintained. It has proven to be stable in my
> testing and is ready for more widespread public testing now.
>
>
> Aims:
> - Interactive by design rather than have interactivity bolted on.
> - Good scalability.
> - Simple predictable design.
> - Maintain appropriate cpu distribution and fairness.
> - Low scheduling latency for normal policy tasks.
> - Low overhead.
> - Making renicing processes actually matter for CPU distribution (nice 0
> gets 20 times what nice +20 gets)
> - Resistant to priority inversion
> - More forgiving of poor locking
> - Tunable for a server workload or computational tasks
>
>
> Description:
> - All tasks start at a dynamic priority based on their nice value. They
> run for one RR_INTERVAL (nominally set to 10ms) and then drop one stair
> (priority). If they sleep before running again they get to run for 2
> intervals before being demoted a priority and so on until they get all
> their intervals at their best priority: 20 intervals for nice 0; 1 interval
> for nice +19.
>
>
> - The staircase elevation mechanism can be disabled and all tasks can
> simply descend stairs using the sysctl:
>
> echo 0 > /proc/sys/kernel/interactive
>
> this has the effect of maintaining cpu distribution much more strictly
> according to nice whereas the default mechanism allows bursts of cpu by
> interactive tasks before settling to appropriate cpu distribution.
>
>
> - The time tasks are cpu bound can be increased by using the sysctl:
>
> echo 1 > /proc/sys/kernel/compute
>
> which extends the RR_INTERVAL to 100ms and disables the staircase elevation
> improving conditions for pure computational tasks by optimising cache
> benefits and decreasing context switching (gains another 1.5% on
> kernbench).
>
>
> Performance:
> - All cpu throughput benchmarks show equivalent or better performance than
> mainline. Note that disabling the interactive setting actually _worsens_
> some benchmarks because of their dependence on yield() so I don't recommend
> disabling it unless you do a comparison first.
> - Interactivity is approximately equivalent to mainline 2.6 but with faster
> application startup and no known behavioural quirks.
>
>
> Comments and testing welcome.
>
> fs/proc/array.c | 2
> include/linux/sched.h | 8
> include/linux/sysctl.h | 2
> kernel/sched.c | 676
> +++++++++++++------------------------------------
> kernel/sysctl.c | 16 +
> 5 files changed, 212 insertions(+), 492 deletions(-)
>
> Can be downloaded here:
> http://ck.kolivas.org/patches/2.6/2.6.7-rc2/patch-2.6.7-rc2-s6.3
>
> and below
> Con
The same patch modified to apply clean on 2.6-7-rc2-mm2.
--
Jan


Attachments:
(No filename) (2.97 kB)
patch-2.6.7-rc2-mm2-s6.3 (33.43 kB)
Download all attachments

2004-06-06 17:40:06

by Felipe Alfaro Solana

[permalink] [raw]
Subject: Re: [PATCH] Staircase Scheduler v6.3 for 2.6.7-rc2

On Sun, 2004-06-06 at 17:39, Con Kolivas wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> This is an update of the scheduler policy mechanism rewrite using the
> infrastructure of the current O(1) scheduler. Slight changes from the
> original design require a detailed description. The change to the original
> design has enabled all known corner cases to be abolished and cpu
> distribution to be much better maintained. It has proven to be stable in my
> testing and is ready for more widespread public testing now.

I'm impressed... I'm currently playing with linux-2.6.7-rc2-bk7 plus
staircase plus autoswappiness and my system behaves exceptionally. It
seems pretty responsive even when under heavy load (while true; do a=2;
done). Nice work.

2004-06-06 20:43:09

by Prakash K. Cheemplavam

[permalink] [raw]
Subject: Re: [PATCH] Staircase Scheduler v6.3 for 2.6.7-rc2

Hi,

it is the first time I tried this scheduler with 2.6.7-rc2-mm2. A k.o
criteria for me: Playing ut2004 it generated a lot of statics (sound
wise). I consider this a regression in contrast to O(1). Nick's
scheduler plays nice as well. For Nick's I have X reniced to -10. Your
scheduler doesn't like this, as well: When plaing some tune via xmms and
then switching to another (virtual) desktop, I get pops in the sound for
fractions of a second. Putting X back to 0 fixes this. But I don't know
how to "fix" ut2004 with staircase. :-(

Hth,

Prakash

2004-06-06 21:00:17

by Felipe Alfaro Solana

[permalink] [raw]
Subject: Re: [PATCH] Staircase Scheduler v6.3 for 2.6.7-rc2

On Sun, 2004-06-06 at 17:39, Con Kolivas wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> This is an update of the scheduler policy mechanism rewrite using the
> infrastructure of the current O(1) scheduler. Slight changes from the
> original design require a detailed description. The change to the original
> design has enabled all known corner cases to be abolished and cpu
> distribution to be much better maintained. It has proven to be stable in my
> testing and is ready for more widespread public testing now.

It seems this staircase scheduler has some strange interactions with
networking and PM suspend. I can't suspend my laptop when any program is
using the network and, what's more, after the suspend mechanism fails,
the program that was using the network stays stuck at D state and can't
be killed.

I can trigger this with evolution 1.4.8 using an IMAP mail account, or
simply by launching Konqueror, pointing it to a URL and then try
suspending while Konqueror is retrieving the page from Internet.

I'm open to suggestions.

2004-06-06 22:55:21

by Con Kolivas

[permalink] [raw]
Subject: Re: [PATCH] Staircase Scheduler v6.3 for 2.6.7-rc2

On Mon, 7 Jun 2004 06:43, Prakash K. Cheemplavam wrote:
> Hi,

Hi. Thanks for testing.

> it is the first time I tried this scheduler with 2.6.7-rc2-mm2. A k.o
> criteria for me: Playing ut2004 it generated a lot of statics (sound
> wise). I consider this a regression in contrast to O(1). Nick's
> scheduler plays nice as well. For Nick's I have X reniced to -10. Your
> scheduler doesn't like this, as well: When plaing some tune via xmms and
> then switching to another (virtual) desktop, I get pops in the sound for
> fractions of a second. Putting X back to 0 fixes this. But I don't know
> how to "fix" ut2004 with staircase. :-(

Yes this is designed for X nice==0.

Try interactive = 0 setting for gaming.

Con

2004-06-06 22:59:08

by Con Kolivas

[permalink] [raw]
Subject: Re: [PATCH] Staircase Scheduler v6.3 for 2.6.7-rc2

On Mon, 7 Jun 2004 03:40, Felipe Alfaro Solana wrote:
> On Sun, 2004-06-06 at 17:39, Con Kolivas wrote:
> > -----BEGIN PGP SIGNED MESSAGE-----
> > Hash: SHA1
> >
> > This is an update of the scheduler policy mechanism rewrite using the
> > infrastructure of the current O(1) scheduler. Slight changes from the
> > original design require a detailed description. The change to the
> > original design has enabled all known corner cases to be abolished and
> > cpu distribution to be much better maintained. It has proven to be stable
> > in my testing and is ready for more widespread public testing now.
>
> I'm impressed... I'm currently playing with linux-2.6.7-rc2-bk7 plus
> staircase plus autoswappiness and my system behaves exceptionally. It
> seems pretty responsive even when under heavy load (while true; do a=2;
> done). Nice work.

Sounds good.

Thanks for testing.

Con

2004-06-06 22:59:04

by Con Kolivas

[permalink] [raw]
Subject: Re: [PATCH] Staircase Scheduler v6.3 for 2.6.7-rc2

On Mon, 7 Jun 2004 06:59, Felipe Alfaro Solana wrote:
> On Sun, 2004-06-06 at 17:39, Con Kolivas wrote:
> > -----BEGIN PGP SIGNED MESSAGE-----
> > Hash: SHA1
> >
> > This is an update of the scheduler policy mechanism rewrite using the
> > infrastructure of the current O(1) scheduler. Slight changes from the
> > original design require a detailed description. The change to the
> > original design has enabled all known corner cases to be abolished and
> > cpu distribution to be much better maintained. It has proven to be stable
> > in my testing and is ready for more widespread public testing now.
>
> It seems this staircase scheduler has some strange interactions with
> networking and PM suspend. I can't suspend my laptop when any program is
> using the network and, what's more, after the suspend mechanism fails,
> the program that was using the network stays stuck at D state and can't
> be killed.

Could well be the autoregulated vm swappiness which triggers the unable to
suspend when swappiness=0 bug. If not, this staircase scheduler is quite a
large change to the scheduler priority design and there could be some weird
interaction.

Con

2004-06-06 23:47:46

by Con Kolivas

[permalink] [raw]
Subject: Re: [PATCH] Staircase Scheduler v6.3 for 2.6.7-rc2

On Mon, 7 Jun 2004 01:39, Con Kolivas wrote:
> This is an update of the scheduler policy mechanism rewrite using the
> infrastructure of the current O(1) scheduler. Slight changes from the
> original design require a detailed description.

Seems it wasn't clear unless you look at the code; this has only a single
priority queue with no expired array.

Con

2004-06-07 13:56:44

by William Lee Irwin III

[permalink] [raw]
Subject: Re: [PATCH] Staircase Scheduler v6.3 for 2.6.7-rc2

On Mon, Jun 07, 2004 at 01:39:26AM +1000, Con Kolivas wrote:
> This is an update of the scheduler policy mechanism rewrite using the
> infrastructure of the current O(1) scheduler. Slight changes from the
> original design require a detailed description. The change to the original
> design has enabled all known corner cases to be abolished and cpu
> distribution to be much better maintained. It has proven to be stable in my
> testing and is ready for more widespread public testing now.

There is only one "array" per runqueue; the removed code should be at
most a BUG_ON() or WARN_ON(); I opted to delete it altogether.

Index: kolivas-2.6.7-rc2/kernel/sched.c
===================================================================
--- kolivas-2.6.7-rc2.orig/kernel/sched.c 2004-06-07 06:19:45.938605000 -0700
+++ kolivas-2.6.7-rc2/kernel/sched.c 2004-06-07 06:53:21.555185000 -0700
@@ -1789,13 +1789,7 @@
cpustat->user += user_ticks;
cpustat->system += sys_ticks;

- /* Task might have expired already, but not scheduled off yet */
- if (p->array != &rq->array) {
- set_tsk_need_resched(p);
- goto out;
- }
spin_lock(&rq->lock);
-
// SCHED_FIFO tasks never run out of timeslice.
if (unlikely(p->policy == SCHED_FIFO))
goto out_unlock;

2004-06-07 13:57:50

by William Lee Irwin III

[permalink] [raw]
Subject: Re: [PATCH] Staircase Scheduler v6.3 for 2.6.7-rc2

On Mon, Jun 07, 2004 at 06:56:31AM -0700, William Lee Irwin III wrote:
> There is only one "array" per runqueue; the removed code should be at
> most a BUG_ON() or WARN_ON(); I opted to delete it altogether.

JIFFIES_TO_NS() is unused.

Index: kolivas-2.6.7-rc2/kernel/sched.c
===================================================================
--- kolivas-2.6.7-rc2.orig/kernel/sched.c 2004-06-07 06:49:10.104411000 -0700
+++ kolivas-2.6.7-rc2/kernel/sched.c 2004-06-07 06:50:39.987747000 -0700
@@ -74,7 +74,6 @@
* Some helpers for converting nanosecond timing to jiffy resolution
*/
#define NS_TO_JIFFIES(TIME) ((TIME) / (1000000000 / HZ))
-#define JIFFIES_TO_NS(TIME) ((TIME) * (1000000000 / HZ))

//This is the time all tasks within the same priority round robin.
#define RR_INTERVAL ((10 * HZ / 1000) ? : 1)

2004-06-07 14:05:05

by William Lee Irwin III

[permalink] [raw]
Subject: Re: [PATCH] Staircase Scheduler v6.3 for 2.6.7-rc2

On Mon, Jun 07, 2004 at 06:57:38AM -0700, William Lee Irwin III wrote:
> JIFFIES_TO_NS() is unused.

array->nr_active only ever modified, never examined.


Index: kolivas-2.6.7-rc2/kernel/sched.c
===================================================================
--- kolivas-2.6.7-rc2.orig/kernel/sched.c 2004-06-07 06:53:26.779390000 -0700
+++ kolivas-2.6.7-rc2/kernel/sched.c 2004-06-07 07:03:00.910109000 -0700
@@ -92,7 +92,6 @@
typedef struct runqueue runqueue_t;

struct prio_array {
- unsigned int nr_active;
unsigned long bitmap[BITMAP_SIZE];
struct list_head queue[MAX_PRIO + 1];
};
@@ -204,7 +203,6 @@
static void dequeue_task(struct task_struct *p, runqueue_t *rq)
{
prio_array_t* array = &rq->array;
- array->nr_active--;
list_del(&p->run_list);
if (list_empty(array->queue + p->prio))
__clear_bit(p->prio, array->bitmap);
@@ -215,7 +213,6 @@
prio_array_t* array = &rq->array;
list_add_tail(&p->run_list, array->queue + p->prio);
__set_bit(p->prio, array->bitmap);
- array->nr_active++;
p->array = array;
}

@@ -229,7 +226,6 @@
prio_array_t* array = &rq->array;
list_add(&p->run_list, array->queue + p->prio);
__set_bit(p->prio, array->bitmap);
- array->nr_active++;
p->array = array;
}

@@ -1095,7 +1091,6 @@
p->prio = current->prio;
list_add_tail(&p->run_list, &current->run_list);
p->array = current->array;
- p->array->nr_active++;
rq->nr_running++;
}
} else {

2004-06-07 14:11:47

by William Lee Irwin III

[permalink] [raw]
Subject: Re: [PATCH] Staircase Scheduler v6.3 for 2.6.7-rc2

On Mon, Jun 07, 2004 at 07:04:45AM -0700, William Lee Irwin III wrote:
> array->nr_active only ever modified, never examined.

cpu_to_node_mask() is dead.


Index: kolivas-2.6.7-rc2/kernel/sched.c
===================================================================
--- kolivas-2.6.7-rc2.orig/kernel/sched.c 2004-06-07 07:03:00.910109000 -0700
+++ kolivas-2.6.7-rc2/kernel/sched.c 2004-06-07 07:06:28.072616000 -0700
@@ -46,12 +46,6 @@

#include <asm/unistd.h>

-#ifdef CONFIG_NUMA
-#define cpu_to_node_mask(cpu) node_to_cpumask(cpu_to_node(cpu))
-#else
-#define cpu_to_node_mask(cpu) (cpu_online_map)
-#endif
-
/*
* Convert user-nice values [ -20 ... 0 ... 19 ]
* to static priority [ MAX_RT_PRIO..MAX_PRIO-1 ],

2004-06-07 15:01:46

by William Lee Irwin III

[permalink] [raw]
Subject: Re: [PATCH] Staircase Scheduler v6.3 for 2.6.7-rc2

On Mon, Jun 07, 2004 at 07:04:45AM -0700, William Lee Irwin III wrote:
>> array->nr_active only ever modified, never examined.

On Mon, Jun 07, 2004 at 07:07:14AM -0700, William Lee Irwin III wrote:
> cpu_to_node_mask() is dead.

task->array is nothing more than a boolean flag. Shove it into
task->thread_info->flags, saving sizeof(prio_array_t *) from sizeof(task_t).
Compiletested only on sparc64 only.


Index: kolivas-2.6.7-rc2/include/asm-alpha/thread_info.h
===================================================================
--- kolivas-2.6.7-rc2.orig/include/asm-alpha/thread_info.h 2004-05-29 23:26:43.000000000 -0700
+++ kolivas-2.6.7-rc2/include/asm-alpha/thread_info.h 2004-06-07 07:13:08.775700000 -0700
@@ -77,6 +77,7 @@
#define TIF_UAC_NOPRINT 6 /* see sysinfo.h */
#define TIF_UAC_NOFIX 7
#define TIF_UAC_SIGBUS 8
+#define TIF_QUEUED 9 /* queued for scheduling */

#define _TIF_SYSCALL_TRACE (1<<TIF_SYSCALL_TRACE)
#define _TIF_NOTIFY_RESUME (1<<TIF_NOTIFY_RESUME)
Index: kolivas-2.6.7-rc2/include/asm-arm/thread_info.h
===================================================================
--- kolivas-2.6.7-rc2.orig/include/asm-arm/thread_info.h 2004-05-29 23:26:10.000000000 -0700
+++ kolivas-2.6.7-rc2/include/asm-arm/thread_info.h 2004-06-07 07:14:01.315712000 -0700
@@ -123,6 +123,7 @@
* TIF_NEED_RESCHED - rescheduling necessary
* TIF_USEDFPU - FPU was used by this task this quantum (SMP)
* TIF_POLLING_NRFLAG - true if poll_idle() is polling TIF_NEED_RESCHED
+ * TIF_QUEUED - true if on scheduler runqueue
*/
#define TIF_NOTIFY_RESUME 0
#define TIF_SIGPENDING 1
@@ -130,6 +131,7 @@
#define TIF_SYSCALL_TRACE 8
#define TIF_USED_FPU 16
#define TIF_POLLING_NRFLAG 17
+#define TIF_QUEUED 18

#define _TIF_NOTIFY_RESUME (1 << TIF_NOTIFY_RESUME)
#define _TIF_SIGPENDING (1 << TIF_SIGPENDING)
Index: kolivas-2.6.7-rc2/include/asm-arm26/thread_info.h
===================================================================
--- kolivas-2.6.7-rc2.orig/include/asm-arm26/thread_info.h 2004-05-29 23:25:53.000000000 -0700
+++ kolivas-2.6.7-rc2/include/asm-arm26/thread_info.h 2004-06-07 07:14:30.291307000 -0700
@@ -118,6 +118,7 @@
* TIF_NEED_RESCHED - rescheduling necessary
* TIF_USEDFPU - FPU was used by this task this quantum (SMP)
* TIF_POLLING_NRFLAG - true if poll_idle() is polling TIF_NEED_RESCHED
+ * TIF_QUEUED - true if on scheduler runqueue
*/
#define TIF_NOTIFY_RESUME 0
#define TIF_SIGPENDING 1
@@ -125,6 +126,7 @@
#define TIF_SYSCALL_TRACE 8
#define TIF_USED_FPU 16
#define TIF_POLLING_NRFLAG 17
+#define TIF_QUEUED 18

#define _TIF_NOTIFY_RESUME (1 << TIF_NOTIFY_RESUME)
#define _TIF_SIGPENDING (1 << TIF_SIGPENDING)
Index: kolivas-2.6.7-rc2/include/asm-cris/thread_info.h
===================================================================
--- kolivas-2.6.7-rc2.orig/include/asm-cris/thread_info.h 2004-05-29 23:26:48.000000000 -0700
+++ kolivas-2.6.7-rc2/include/asm-cris/thread_info.h 2004-06-07 07:14:57.512169000 -0700
@@ -85,6 +85,7 @@
#define TIF_SIGPENDING 2 /* signal pending */
#define TIF_NEED_RESCHED 3 /* rescheduling necessary */
#define TIF_POLLING_NRFLAG 16 /* true if poll_idle() is polling TIF_NEED_RESCHED */
+#define TIF_QUEUED 17 /* true if on scheduler runqueue */

#define _TIF_SYSCALL_TRACE (1<<TIF_SYSCALL_TRACE)
#define _TIF_NOTIFY_RESUME (1<<TIF_NOTIFY_RESUME)
Index: kolivas-2.6.7-rc2/include/asm-h8300/thread_info.h
===================================================================
--- kolivas-2.6.7-rc2.orig/include/asm-h8300/thread_info.h 2004-05-29 23:26:03.000000000 -0700
+++ kolivas-2.6.7-rc2/include/asm-h8300/thread_info.h 2004-06-07 07:16:00.653570000 -0700
@@ -93,6 +93,7 @@
#define TIF_NEED_RESCHED 3 /* rescheduling necessary */
#define TIF_POLLING_NRFLAG 4 /* true if poll_idle() is polling
TIF_NEED_RESCHED */
+#define TIF_QUEUED 17 /* true if on scheduler runqueue */

/* as above, but as bit values */
#define _TIF_SYSCALL_TRACE (1<<TIF_SYSCALL_TRACE)
Index: kolivas-2.6.7-rc2/include/asm-i386/thread_info.h
===================================================================
--- kolivas-2.6.7-rc2.orig/include/asm-i386/thread_info.h 2004-05-29 23:25:45.000000000 -0700
+++ kolivas-2.6.7-rc2/include/asm-i386/thread_info.h 2004-06-07 07:16:15.755274000 -0700
@@ -145,6 +145,7 @@
#define TIF_IRET 5 /* return with iret */
#define TIF_SYSCALL_AUDIT 7 /* syscall auditing active */
#define TIF_POLLING_NRFLAG 16 /* true if poll_idle() is polling TIF_NEED_RESCHED */
+#define TIF_QUEUED 17 /* true if on scheduler runqueue */

#define _TIF_SYSCALL_TRACE (1<<TIF_SYSCALL_TRACE)
#define _TIF_NOTIFY_RESUME (1<<TIF_NOTIFY_RESUME)
Index: kolivas-2.6.7-rc2/include/asm-ia64/thread_info.h
===================================================================
--- kolivas-2.6.7-rc2.orig/include/asm-ia64/thread_info.h 2004-05-29 23:26:49.000000000 -0700
+++ kolivas-2.6.7-rc2/include/asm-ia64/thread_info.h 2004-06-07 07:16:53.195583000 -0700
@@ -74,6 +74,7 @@
#define TIF_NEED_RESCHED 2 /* rescheduling necessary */
#define TIF_SYSCALL_TRACE 3 /* syscall trace active */
#define TIF_POLLING_NRFLAG 16 /* true if poll_idle() is polling TIF_NEED_RESCHED */
+#define TIF_QUEUED 17 /* true if on scheduler runqueue */

#define TIF_WORK_MASK 0x7 /* like TIF_ALLWORK_BITS but sans TIF_SYSCALL_TRACE */
#define TIF_ALLWORK_MASK 0xf /* bits 0..3 are "work to do on user-return" bits */
Index: kolivas-2.6.7-rc2/include/asm-m68k/thread_info.h
===================================================================
--- kolivas-2.6.7-rc2.orig/include/asm-m68k/thread_info.h 2004-05-29 23:26:19.000000000 -0700
+++ kolivas-2.6.7-rc2/include/asm-m68k/thread_info.h 2004-06-07 07:23:29.772294000 -0700
@@ -11,7 +11,7 @@
__s32 preempt_count; /* 0 => preemptable, <0 => BUG */
__u32 cpu; /* should always be 0 on m68k */
struct restart_block restart_block;
-
+ unsigned char queued;
__u8 supervisor_stack[0];
};

@@ -48,6 +48,7 @@
#define TIF_NOTIFY_RESUME 2 /* resumption notification requested */
#define TIF_SIGPENDING 3 /* signal pending */
#define TIF_NEED_RESCHED 4 /* rescheduling necessary */
+#define TIF_QUEUED 5 /* true if on scheduler runqueue */

extern int thread_flag_fixme(void);

@@ -67,6 +68,9 @@
case TIF_SYSCALL_TRACE: \
tsk->thread.work.syscall_trace = val; \
break; \
+ case TIF_QUEUED: \
+ (tsk)->thread_info.queued = 1; \
+ break; \
default: \
thread_flag_fixme(); \
} \
@@ -84,6 +88,9 @@
case TIF_SYSCALL_TRACE: \
___res = tsk->thread.work.syscall_trace;\
break; \
+ case TIF_QUEUED: \
+ ___res = tsk->thread_info.queued; \
+ break; \
default: \
___res = thread_flag_fixme(); \
} \
Index: kolivas-2.6.7-rc2/include/asm-m68knommu/thread_info.h
===================================================================
--- kolivas-2.6.7-rc2.orig/include/asm-m68knommu/thread_info.h 2004-05-29 23:26:09.000000000 -0700
+++ kolivas-2.6.7-rc2/include/asm-m68knommu/thread_info.h 2004-06-07 07:24:30.203107000 -0700
@@ -91,6 +91,7 @@
#define TIF_NEED_RESCHED 3 /* rescheduling necessary */
#define TIF_POLLING_NRFLAG 4 /* true if poll_idle() is polling
TIF_NEED_RESCHED */
+#define TIF_QUEUED 16 /* true if on scheduler runqueue */

/* as above, but as bit values */
#define _TIF_SYSCALL_TRACE (1<<TIF_SYSCALL_TRACE)
Index: kolivas-2.6.7-rc2/include/asm-mips/thread_info.h
===================================================================
--- kolivas-2.6.7-rc2.orig/include/asm-mips/thread_info.h 2004-05-29 23:25:40.000000000 -0700
+++ kolivas-2.6.7-rc2/include/asm-mips/thread_info.h 2004-06-07 07:25:03.044114000 -0700
@@ -113,6 +113,7 @@
#define TIF_SYSCALL_AUDIT 4 /* syscall auditing active */
#define TIF_USEDFPU 16 /* FPU was used by this task this quantum (SMP) */
#define TIF_POLLING_NRFLAG 17 /* true if poll_idle() is polling TIF_NEED_RESCHED */
+#define TIF_QUEUED 18 /* true if on scheduler runqueue */
#define TIF_SYSCALL_TRACE 31 /* syscall trace active */

#define _TIF_SYSCALL_TRACE (1<<TIF_SYSCALL_TRACE)
Index: kolivas-2.6.7-rc2/include/asm-parisc/thread_info.h
===================================================================
--- kolivas-2.6.7-rc2.orig/include/asm-parisc/thread_info.h 2004-05-29 23:26:03.000000000 -0700
+++ kolivas-2.6.7-rc2/include/asm-parisc/thread_info.h 2004-06-07 07:25:26.559539000 -0700
@@ -60,6 +60,7 @@
#define TIF_NEED_RESCHED 3 /* rescheduling necessary */
#define TIF_POLLING_NRFLAG 4 /* true if poll_idle() is polling TIF_NEED_RESCHED */
#define TIF_32BIT 5 /* 32 bit binary */
+#define TIF_QUEUED 6 /* true if on scheduler runqueue */

#define _TIF_SYSCALL_TRACE (1 << TIF_SYSCALL_TRACE)
#define _TIF_NOTIFY_RESUME (1 << TIF_NOTIFY_RESUME)
Index: kolivas-2.6.7-rc2/include/asm-ppc/thread_info.h
===================================================================
--- kolivas-2.6.7-rc2.orig/include/asm-ppc/thread_info.h 2004-05-29 23:25:52.000000000 -0700
+++ kolivas-2.6.7-rc2/include/asm-ppc/thread_info.h 2004-06-07 07:26:16.802901000 -0700
@@ -86,6 +86,7 @@
#define TIF_NEED_RESCHED 3 /* rescheduling necessary */
#define TIF_POLLING_NRFLAG 4 /* true if poll_idle() is polling
TIF_NEED_RESCHED */
+#define TIF_QUEUED 5 /* true if on scheduler runqueue */
/* as above, but as bit values */
#define _TIF_SYSCALL_TRACE (1<<TIF_SYSCALL_TRACE)
#define _TIF_NOTIFY_RESUME (1<<TIF_NOTIFY_RESUME)
Index: kolivas-2.6.7-rc2/include/asm-ppc64/thread_info.h
===================================================================
--- kolivas-2.6.7-rc2.orig/include/asm-ppc64/thread_info.h 2004-05-29 23:25:45.000000000 -0700
+++ kolivas-2.6.7-rc2/include/asm-ppc64/thread_info.h 2004-06-07 07:26:34.180260000 -0700
@@ -94,6 +94,7 @@
#define TIF_RUN_LIGHT 6 /* iSeries run light */
#define TIF_ABI_PENDING 7 /* 32/64 bit switch needed */
#define TIF_SYSCALL_AUDIT 8 /* syscall auditing active */
+#define TIF_QUEUED 9 /* true if on scheduler runqueue */

/* as above, but as bit values */
#define _TIF_SYSCALL_TRACE (1<<TIF_SYSCALL_TRACE)
Index: kolivas-2.6.7-rc2/include/asm-s390/thread_info.h
===================================================================
--- kolivas-2.6.7-rc2.orig/include/asm-s390/thread_info.h 2004-05-29 23:26:19.000000000 -0700
+++ kolivas-2.6.7-rc2/include/asm-s390/thread_info.h 2004-06-07 07:27:50.350680000 -0700
@@ -89,6 +89,7 @@
#define TIF_POLLING_NRFLAG 17 /* true if poll_idle() is polling
TIF_NEED_RESCHED */
#define TIF_31BIT 18 /* 32bit process */
+#define TIF_QUEUED 19 /* true if on scheduler runqueue */

#define _TIF_SYSCALL_TRACE (1<<TIF_SYSCALL_TRACE)
#define _TIF_NOTIFY_RESUME (1<<TIF_NOTIFY_RESUME)
Index: kolivas-2.6.7-rc2/include/asm-sh/thread_info.h
===================================================================
--- kolivas-2.6.7-rc2.orig/include/asm-sh/thread_info.h 2004-05-29 23:26:26.000000000 -0700
+++ kolivas-2.6.7-rc2/include/asm-sh/thread_info.h 2004-06-07 07:28:08.307950000 -0700
@@ -93,6 +93,7 @@
#define TIF_NEED_RESCHED 3 /* rescheduling necessary */
#define TIF_USEDFPU 16 /* FPU was used by this task this quantum (SMP) */
#define TIF_POLLING_NRFLAG 17 /* true if poll_idle() is polling TIF_NEED_RESCHED */
+#define TIF_QUEUED 18 /* true if on scheduler runqueue */
#define TIF_USERSPACE 31 /* true if FS sets userspace */

#define _TIF_SYSCALL_TRACE (1<<TIF_SYSCALL_TRACE)
Index: kolivas-2.6.7-rc2/include/asm-sparc/thread_info.h
===================================================================
--- kolivas-2.6.7-rc2.orig/include/asm-sparc/thread_info.h 2004-05-29 23:26:43.000000000 -0700
+++ kolivas-2.6.7-rc2/include/asm-sparc/thread_info.h 2004-06-07 07:29:22.011745000 -0700
@@ -137,6 +137,7 @@
* this quantum (SMP) */
#define TIF_POLLING_NRFLAG 9 /* true if poll_idle() is polling
* TIF_NEED_RESCHED */
+#define TIF_QUEUED 10 /* true if on scheduler runqueue */

/* as above, but as bit values */
#define _TIF_SYSCALL_TRACE (1<<TIF_SYSCALL_TRACE)
Index: kolivas-2.6.7-rc2/include/asm-sparc64/thread_info.h
===================================================================
--- kolivas-2.6.7-rc2.orig/include/asm-sparc64/thread_info.h 2004-05-29 23:26:26.000000000 -0700
+++ kolivas-2.6.7-rc2/include/asm-sparc64/thread_info.h 2004-06-07 07:30:54.506684000 -0700
@@ -228,6 +228,7 @@
* an immediate value in instructions such as andcc.
*/
#define TIF_ABI_PENDING 12
+#define TIF_QUEUED 13 /* true if on scheduler runqueue */

#define _TIF_SYSCALL_TRACE (1<<TIF_SYSCALL_TRACE)
#define _TIF_NOTIFY_RESUME (1<<TIF_NOTIFY_RESUME)
Index: kolivas-2.6.7-rc2/include/asm-um/thread_info.h
===================================================================
--- kolivas-2.6.7-rc2.orig/include/asm-um/thread_info.h 2004-05-29 23:26:03.000000000 -0700
+++ kolivas-2.6.7-rc2/include/asm-um/thread_info.h 2004-06-07 07:35:20.439256000 -0700
@@ -65,6 +65,7 @@
#define TIF_POLLING_NRFLAG 3 /* true if poll_idle() is polling
* TIF_NEED_RESCHED
*/
+#define TIF_QUEUED 4 /* true if on scheduler runqueue */

#define _TIF_SYSCALL_TRACE (1 << TIF_SYSCALL_TRACE)
#define _TIF_SIGPENDING (1 << TIF_SIGPENDING)
Index: kolivas-2.6.7-rc2/include/asm-v850/thread_info.h
===================================================================
--- kolivas-2.6.7-rc2.orig/include/asm-v850/thread_info.h 2004-05-29 23:26:03.000000000 -0700
+++ kolivas-2.6.7-rc2/include/asm-v850/thread_info.h 2004-06-07 07:36:38.430400000 -0700
@@ -83,6 +83,7 @@
#define TIF_NEED_RESCHED 3 /* rescheduling necessary */
#define TIF_POLLING_NRFLAG 4 /* true if poll_idle() is polling
TIF_NEED_RESCHED */
+#define TIF_QUEUED 5 /* true if on scheduler runqueue */

/* as above, but as bit values */
#define _TIF_SYSCALL_TRACE (1<<TIF_SYSCALL_TRACE)
Index: kolivas-2.6.7-rc2/include/asm-x86_64/thread_info.h
===================================================================
--- kolivas-2.6.7-rc2.orig/include/asm-x86_64/thread_info.h 2004-05-29 23:26:26.000000000 -0700
+++ kolivas-2.6.7-rc2/include/asm-x86_64/thread_info.h 2004-06-07 07:37:09.234717000 -0700
@@ -106,6 +106,7 @@
#define TIF_IA32 17 /* 32bit process */
#define TIF_FORK 18 /* ret_from_fork */
#define TIF_ABI_PENDING 19
+#define TIF_QUEUED 20 /* true if on scheduler runqueue */

#define _TIF_SYSCALL_TRACE (1<<TIF_SYSCALL_TRACE)
#define _TIF_NOTIFY_RESUME (1<<TIF_NOTIFY_RESUME)
Index: kolivas-2.6.7-rc2/include/linux/sched.h
===================================================================
--- kolivas-2.6.7-rc2.orig/include/linux/sched.h 2004-06-07 06:19:45.891612000 -0700
+++ kolivas-2.6.7-rc2/include/linux/sched.h 2004-06-07 07:37:49.068661000 -0700
@@ -325,7 +325,6 @@
extern struct user_struct root_user;
#define INIT_USER (&root_user)

-typedef struct prio_array prio_array_t;
struct backing_dev_info;
struct reclaim_state;

@@ -392,8 +391,6 @@

int prio, static_prio;
struct list_head run_list;
- prio_array_t *array;
-
unsigned long long timestamp;
unsigned long runtime, totalrun;
unsigned int deadline;
Index: kolivas-2.6.7-rc2/kernel/sched.c
===================================================================
--- kolivas-2.6.7-rc2.orig/kernel/sched.c 2004-06-07 07:06:28.072616000 -0700
+++ kolivas-2.6.7-rc2/kernel/sched.c 2004-06-07 07:59:22.711997000 -0700
@@ -85,10 +85,10 @@

typedef struct runqueue runqueue_t;

-struct prio_array {
+typedef struct prio_array {
unsigned long bitmap[BITMAP_SIZE];
struct list_head queue[MAX_PRIO + 1];
-};
+} prio_array_t;

/*
* This is the main, per-CPU runqueue data structure.
@@ -191,6 +191,21 @@
spin_unlock_irq(&rq->lock);
}

+static inline int task_queued(task_t *task)
+{
+ return test_ti_thread_flag(task->thread_info, TIF_QUEUED);
+}
+
+static inline void set_task_queued(task_t *task)
+{
+ set_ti_thread_flag(task->thread_info, TIF_QUEUED);
+}
+
+static inline void clear_task_queued(task_t *task)
+{
+ clear_ti_thread_flag(task->thread_info, TIF_QUEUED);
+}
+
/*
* Adding/removing a task to/from a priority array:
*/
@@ -207,7 +222,7 @@
prio_array_t* array = &rq->array;
list_add_tail(&p->run_list, array->queue + p->prio);
__set_bit(p->prio, array->bitmap);
- p->array = array;
+ set_task_queued(p);
}

/*
@@ -220,7 +235,7 @@
prio_array_t* array = &rq->array;
list_add(&p->run_list, array->queue + p->prio);
__set_bit(p->prio, array->bitmap);
- p->array = array;
+ set_task_queued(p);
}

/*
@@ -369,7 +384,7 @@
p->deadline--;
}
dequeue_task(p, rq);
- p->array = NULL;
+ clear_task_queued(p);
}

/*
@@ -442,7 +457,7 @@
* If the task is not on a runqueue (and not running), then
* it is sufficient to simply update the task's cpu field.
*/
- if (!p->array && !task_running(rq, p)) {
+ if (!task_queued(p) && !task_running(rq, p)) {
set_task_cpu(p, dest_cpu);
return 0;
}
@@ -473,7 +488,7 @@
repeat:
rq = task_rq_lock(p, &flags);
/* Must be off runqueue entirely, not preempted. */
- if (unlikely(p->array)) {
+ if (unlikely(task_queued(p))) {
/* If it's preempted, we yield. It could be a while. */
preempted = !task_running(rq, p);
task_rq_unlock(rq, &flags);
@@ -603,7 +618,7 @@
if (!(old_state & state))
goto out;

- if (p->array)
+ if (task_queued(p))
goto out_running;

cpu = task_cpu(p);
@@ -663,7 +678,7 @@
old_state = p->state;
if (!(old_state & state))
goto out;
- if (p->array)
+ if (task_queued(p))
goto out_running;

this_cpu = smp_processor_id();
@@ -725,7 +740,7 @@
*/
p->state = TASK_RUNNING;
INIT_LIST_HEAD(&p->run_list);
- p->array = NULL;
+ clear_task_queued(p);
spin_lock_init(&p->switch_lock);
#ifdef CONFIG_PREEMPT
/*
@@ -1079,12 +1094,12 @@
set_task_cpu(p, cpu);

if (cpu == this_cpu) {
- if (unlikely(!current->array))
+ if (unlikely(!task_queued(current)))
__activate_task(p, rq);
else {
p->prio = current->prio;
list_add_tail(&p->run_list, &current->run_list);
- p->array = current->array;
+ set_task_queued(p);
rq->nr_running++;
}
} else {
@@ -2232,9 +2247,8 @@
void set_user_nice(task_t *p, long nice)
{
unsigned long flags;
- prio_array_t* array;
runqueue_t *rq;
- int old_prio, new_prio, delta;
+ int queued, old_prio, new_prio, delta;

if (TASK_NICE(p) == nice || nice < -20 || nice > 19)
return;
@@ -2253,8 +2267,7 @@
p->static_prio = NICE_TO_PRIO(nice);
goto out_unlock;
}
- array = p->array;
- if (array)
+ if ((queued = task_queued(p)))
dequeue_task(p, rq);

old_prio = p->prio;
@@ -2263,7 +2276,7 @@
p->static_prio = NICE_TO_PRIO(nice);
p->prio += delta;

- if (array) {
+ if (queued) {
enqueue_task(p, rq);
/*
* If the task increased its priority or is running and
@@ -2369,7 +2382,7 @@
/* Actually do priority change: must hold rq lock. */
static void __setscheduler(struct task_struct *p, int policy, int prio)
{
- BUG_ON(p->array);
+ BUG_ON(task_queued(p));
p->policy = policy;
p->rt_priority = prio;
if (policy != SCHED_NORMAL)
@@ -2385,8 +2398,7 @@
{
struct sched_param lp;
int retval = -EINVAL;
- int oldprio;
- prio_array_t* array;
+ int queued, oldprio;
unsigned long flags;
runqueue_t *rq;
task_t *p;
@@ -2446,13 +2458,12 @@
if (retval)
goto out_unlock;

- array = p->array;
- if (array)
+ if ((queued = task_queued(p)))
deactivate_task(p, task_rq(p));
retval = 0;
oldprio = p->prio;
__setscheduler(p, policy, lp.sched_priority);
- if (array) {
+ if (queued) {
__activate_task(p, task_rq(p));
/*
* Reschedule if we are currently running on this runqueue and
@@ -2922,7 +2933,7 @@

idle_rq->curr = idle_rq->idle = idle;
deactivate_task(idle, rq);
- idle->array = NULL;
+ clear_task_queued(idle);
idle->prio = MAX_PRIO;
idle->state = TASK_RUNNING;
idle->deadline = 0;
@@ -3034,7 +3045,7 @@
goto out;

set_task_cpu(p, dest_cpu);
- if (p->array) {
+ if (task_queued(p)) {
/*
* Sync timestamp with rq_dest's before activating.
* The same thing could be achieved by doing this step

2004-06-07 15:14:00

by William Lee Irwin III

[permalink] [raw]
Subject: Re: [PATCH] Staircase Scheduler v6.3 for 2.6.7-rc2

On Mon, Jun 07, 2004 at 07:07:14AM -0700, William Lee Irwin III wrote:
>> cpu_to_node_mask() is dead.

On Mon, Jun 07, 2004 at 08:00:37AM -0700, William Lee Irwin III wrote:
> task->array is nothing more than a boolean flag. Shove it into
> task->thread_info->flags, saving sizeof(prio_array_t *) from sizeof(task_t).
> Compiletested only on sparc64 only.

prio_array_t is no longer a useful abstraction. Compiletested only on
sparc64 only.


Index: kolivas-2.6.7-rc2/kernel/sched.c
===================================================================
--- kolivas-2.6.7-rc2.orig/kernel/sched.c 2004-06-07 07:59:22.711997000 -0700
+++ kolivas-2.6.7-rc2/kernel/sched.c 2004-06-07 08:07:09.705003000 -0700
@@ -81,15 +81,8 @@
* These are the runqueue data structures:
*/

-#define BITMAP_SIZE ((((MAX_PRIO+1+7)/8)+sizeof(long)-1)/sizeof(long))
-
typedef struct runqueue runqueue_t;

-typedef struct prio_array {
- unsigned long bitmap[BITMAP_SIZE];
- struct list_head queue[MAX_PRIO + 1];
-} prio_array_t;
-
/*
* This is the main, per-CPU runqueue data structure.
*
@@ -113,7 +106,8 @@
unsigned long long timestamp_last_tick;
task_t *curr, *idle;
struct mm_struct *prev_mm;
- prio_array_t array;
+ unsigned long bitmap[BITS_TO_LONGS(MAX_PRIO+1)];
+ struct list_head queue[MAX_PRIO + 1];
atomic_t nr_iowait;

#ifdef CONFIG_SMP
@@ -207,21 +201,19 @@
}

/*
- * Adding/removing a task to/from a priority array:
+ * Adding/removing a task to/from a runqueue:
*/
static void dequeue_task(struct task_struct *p, runqueue_t *rq)
{
- prio_array_t* array = &rq->array;
list_del(&p->run_list);
- if (list_empty(array->queue + p->prio))
- __clear_bit(p->prio, array->bitmap);
+ if (list_empty(rq->queue + p->prio))
+ __clear_bit(p->prio, rq->bitmap);
}

static void enqueue_task(struct task_struct *p, runqueue_t *rq)
{
- prio_array_t* array = &rq->array;
- list_add_tail(&p->run_list, array->queue + p->prio);
- __set_bit(p->prio, array->bitmap);
+ list_add_tail(&p->run_list, rq->queue + p->prio);
+ __set_bit(p->prio, rq->bitmap);
set_task_queued(p);
}

@@ -232,9 +224,8 @@
*/
static inline void enqueue_task_head(struct task_struct *p, runqueue_t *rq)
{
- prio_array_t* array = &rq->array;
- list_add(&p->run_list, array->queue + p->prio);
- __set_bit(p->prio, array->bitmap);
+ list_add(&p->run_list, rq->queue + p->prio);
+ __set_bit(p->prio, rq->bitmap);
set_task_queued(p);
}

@@ -1257,27 +1248,24 @@
unsigned long max_nr_move, struct sched_domain *sd,
enum idle_type idle)
{
- prio_array_t* array, *dst_array;
struct list_head *head, *curr;
int idx, pulled = 0;
task_t *tmp;

if (max_nr_move <= 0 || busiest->nr_running <= 1)
goto out;
- array = &busiest->array;
- dst_array = &this_rq->array;

/* Start searching at priority 0: */
idx = 0;
skip_bitmap:
if (!idx)
- idx = sched_find_first_bit(array->bitmap);
+ idx = sched_find_first_bit(busiest->bitmap);
else
- idx = find_next_bit(array->bitmap, MAX_PRIO, idx);
+ idx = find_next_bit(busiest->bitmap, MAX_PRIO, idx);
if (idx >= MAX_PRIO)
goto out;

- head = array->queue + idx;
+ head = busiest->queue + idx;
curr = head->prev;
skip_queue:
tmp = list_entry(curr, task_t, run_list);
@@ -1919,7 +1907,6 @@
long *switch_count;
task_t *prev, *next;
runqueue_t *rq;
- prio_array_t* array;
struct list_head *queue;
unsigned long long now;
int cpu, idx;
@@ -1971,9 +1958,8 @@
}
}

- array = &rq->array;
- idx = sched_find_first_bit(array->bitmap);
- queue = array->queue + idx;
+ idx = sched_find_first_bit(rq->bitmap);
+ queue = rq->queue + idx;
next = list_entry(queue->next, task_t, run_list);

if (dependent_sleeper(cpu, rq, next))
@@ -3590,8 +3576,6 @@
#endif

for (i = 0; i < NR_CPUS; i++) {
- prio_array_t* array;
-
rq = cpu_rq(i);
spin_lock_init(&rq->lock);

@@ -3604,14 +3588,11 @@
INIT_LIST_HEAD(&rq->migration_queue);
#endif
atomic_set(&rq->nr_iowait, 0);
- array = &rq->array;
-
- for (j = 0; j <= MAX_PRIO; j++) {
- INIT_LIST_HEAD(array->queue + j);
- __clear_bit(j, array->bitmap);
- }
+ for (j = 0; j <= MAX_PRIO; j++)
+ INIT_LIST_HEAD(&rq->queue[j]);
+ memset(rq->bitmap, 0, BITS_TO_LONGS(MAX_PRIO+1)*sizeof(long));
// delimiter for bitsearch
- __set_bit(MAX_PRIO, array->bitmap);
+ __set_bit(MAX_PRIO, rq->bitmap);
}
/*
* We have to do a little magic to get the first

2004-06-07 20:22:43

by Phy Prabab

[permalink] [raw]
Subject: Re: [PATCH] Staircase Scheduler v6.3 for 2.6.7-rc2

FYI:

I have had a chance to test this patch. I have a make
system that has presented 2.6 kernels with problems,
so I am using this as the test. Observations show
that turning off interactive is much more
deterministic:

2.6.7-rc2-bk8-s63:
echo 0 > /proc/sys/kernel/interactive

A: 35.57user 38.18system 1:20.28elapsed 91%CPU
B: 35.54user 38.40system 1:19.48elapsed 93%CPU
C: 35.48user 38.28system 1:20.94elapsed 91%CPU

2.6.7-rc2-bk8-s63:
A: 35.32user 38.51system 1:26.47elapsed 85%CPU
B: 35.43user 38.35system 1:20.79elapsed 91%CPU
C: 35.61user 38.23system 1:25.00elapsed 86%CPU

However, something is still slower than the 2.4.x
kernels:

2.4.23:
A: 28.32user 29.51system 1:01.17elapsed 93%CPU
B: 28.54user 29.40system 1:01.48elapsed 92%CPU
B: 28.23user 28.80system 1:00.21elapsed 94%CPU

Nice work, as I can now turn off some functionality
within the kernel that is causing me some slow down.

Thanks!
Phy


--- Con Kolivas <[email protected]> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> This is an update of the scheduler policy mechanism
> rewrite using the
> infrastructure of the current O(1) scheduler. Slight
> changes from the
> original design require a detailed description. The
> change to the original
> design has enabled all known corner cases to be
> abolished and cpu
> distribution to be much better maintained. It has
> proven to be stable in my
> testing and is ready for more widespread public
> testing now.
>
>
> Aims:
> - Interactive by design rather than have
> interactivity bolted on.
> - Good scalability.
> - Simple predictable design.
> - Maintain appropriate cpu distribution and
> fairness.
> - Low scheduling latency for normal policy tasks.
> - Low overhead.
> - Making renicing processes actually matter for CPU
> distribution (nice 0 gets
> 20 times what nice +20 gets)
> - Resistant to priority inversion
> - More forgiving of poor locking
> - Tunable for a server workload or computational
> tasks
>
>
> Description:
> - All tasks start at a dynamic priority based on
> their nice value. They run
> for one RR_INTERVAL (nominally set to 10ms) and then
> drop one stair
> (priority). If they sleep before running again they
> get to run for 2
> intervals before being demoted a priority and so on
> until they get all their
> intervals at their best priority: 20 intervals for
> nice 0; 1 interval for
> nice +19.
>
>
> - - The staircase elevation mechanism can be
> disabled and all tasks can simply
> descend stairs using the sysctl:
>
> echo 0 > /proc/sys/kernel/interactive
>
> this has the effect of maintaining cpu distribution
> much more strictly
> according to nice whereas the default mechanism
> allows bursts of cpu by
> interactive tasks before settling to appropriate cpu
> distribution.
>
>
> - - The time tasks are cpu bound can be increased by
> using the sysctl:
>
> echo 1 > /proc/sys/kernel/compute
>
> which extends the RR_INTERVAL to 100ms and disables
> the staircase elevation
> improving conditions for pure computational tasks by
> optimising cache
> benefits and decreasing context switching (gains
> another 1.5% on kernbench).
>
>
> Performance:
> - - All cpu throughput benchmarks show equivalent or
> better performance than
> mainline. Note that disabling the interactive
> setting actually _worsens_ some
> benchmarks because of their dependence on yield() so
> I don't recommend
> disabling it unless you do a comparison first.
> - - Interactivity is approximately equivalent to
> mainline 2.6 but with faster
> application startup and no known behavioural quirks.
>
>
> Comments and testing welcome.
>
> fs/proc/array.c | 2
> include/linux/sched.h | 8
> include/linux/sysctl.h | 2
> kernel/sched.c | 676
> +++++++++++++------------------------------------
> kernel/sysctl.c | 16 +
> 5 files changed, 212 insertions(+), 492
> deletions(-)
>
> Can be downloaded here:
>
http://ck.kolivas.org/patches/2.6/2.6.7-rc2/patch-2.6.7-rc2-s6.3
>
> and below
> Con
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.2.4 (GNU/Linux)
>
>
iD8DBQFAwzquZUg7+tp6mRURArBHAJ9SIBOKX6MYOdkJzdb+xRNnW82JQgCghLou
> wXhrRsBsfY2BIqbLT1tUWcs=
> =g1no
> -----END PGP SIGNATURE-----
> > diff -Naurp --exclude-from=dontdiff
> linux-2.6.7-rc2-base/fs/proc/array.c
> linux-2.6.7-rc2-s6.3/fs/proc/array.c
> --- linux-2.6.7-rc2-base/fs/proc/array.c 2004-05-31
> 21:29:15.000000000 +1000
> +++ linux-2.6.7-rc2-s6.3/fs/proc/array.c 2004-06-07
> 00:03:56.959133536 +1000
> @@ -155,7 +155,6 @@ static inline char *
> task_state(struct t
> read_lock(&tasklist_lock);
> buffer += sprintf(buffer,
> "State:\t%s\n"
> - "SleepAVG:\t%lu%%\n"
> "Tgid:\t%d\n"
> "Pid:\t%d\n"
> "PPid:\t%d\n"
> @@ -163,7 +162,6 @@ static inline char *
> task_state(struct t
> "Uid:\t%d\t%d\t%d\t%d\n"
> "Gid:\t%d\t%d\t%d\t%d\n",
> get_task_state(p),
> - (p->sleep_avg/1024)*100/(1020000000/1024),
> p->tgid,
> p->pid, p->pid ? p->real_parent->pid : 0,
> p->pid && p->ptrace ? p->parent->pid : 0,
> diff -Naurp --exclude-from=dontdiff
> linux-2.6.7-rc2-base/include/linux/sched.h
> linux-2.6.7-rc2-s6.3/include/linux/sched.h
> --- linux-2.6.7-rc2-base/include/linux/sched.h
> 2004-05-31 21:29:21.000000000 +1000
> +++ linux-2.6.7-rc2-s6.3/include/linux/sched.h
> 2004-06-07 00:10:37.450576503 +1000
> @@ -164,6 +164,7 @@ extern void show_stack(struct
> task_struc
>
> void io_schedule(void);
> long io_schedule_timeout(long timeout);
> +extern int interactive, compute;
>
> extern void cpu_init (void);
> extern void trap_init(void);
> @@ -394,14 +395,13 @@ struct task_struct {
> struct list_head run_list;
> prio_array_t *array;
>
> - unsigned long sleep_avg;
> - long interactive_credit;
> unsigned long long timestamp;
> - int activated;
> + unsigned long runtime, totalrun;
> + unsigned int deadline;
>
> unsigned long policy;
> cpumask_t cpus_allowed;
> - unsigned int time_slice, first_time_slice;
> + unsigned int slice, time_slice, first_time_slice;
>
> struct list_head tasks;
> struct list_head ptrace_children;
> diff -Naurp --exclude-from=dontdiff
> linux-2.6.7-rc2-base/include/linux/sysctl.h
> linux-2.6.7-rc2-s6.3/include/linux/sysctl.h
> --- linux-2.6.7-rc2-base/include/linux/sysctl.h
> 2004-05-31 21:29:21.000000000 +1000
> +++ linux-2.6.7-rc2-s6.3/include/linux/sysctl.h
> 2004-06-07 00:06:13.007895851 +1000
> @@ -133,6 +133,8 @@ enum
> KERN_NGROUPS_MAX=63, /* int: NGROUPS_MAX */
> KERN_SPARC_SCONS_PWROFF=64, /* int: serial console
> power-off halt */
> KERN_HZ_TIMER=65, /* int: hz timer on or off */
> + KERN_INTERACTIVE=66, /* interactive tasks to have
> cpu bursts */
> + KERN_COMPUTE=67, /* adjust timeslices for a
> compute server */
> };
>
>
> diff -Naurp --exclude-from=dontdiff
> linux-2.6.7-rc2-base/kernel/sched.c
> linux-2.6.7-rc2-s6.3/kernel/sched.c
> --- linux-2.6.7-rc2-base/kernel/sched.c 2004-05-31
> 21:29:24.000000000 +1000
> +++ linux-2.6.7-rc2-s6.3/kernel/sched.c 2004-06-07
> 00:36:43.382396246 +1000
> @@ -16,7 +16,10 @@
> * by Davide Libenzi, preemptible kernel bits by
> Robert Love.
> * 2003-09-03 Interactivity tuning by Con Kolivas.
> * 2004-04-02 Scheduler domains code by Nick
> Piggin
> - */
> + * 2004-03-19. New staircase scheduling policy by
> Con Kolivas with help
> + * from Zwane Mwaikambo and useful suggestions by
> + * William Lee Irwin III.
> +*/
>
> #include <linux/mm.h>
> #include <linux/module.h>
> @@ -66,8 +69,6 @@
> #define USER_PRIO(p) ((p)-MAX_RT_PRIO)
> #define TASK_USER_PRIO(p)
> USER_PRIO((p)->static_prio)
> #define MAX_USER_PRIO (USER_PRIO(MAX_PRIO))
> -#define AVG_TIMESLICE (MIN_TIMESLICE +
> ((MAX_TIMESLICE - MIN_TIMESLICE) *\
> - (MAX_PRIO-1-NICE_TO_PRIO(0))/(MAX_USER_PRIO -
> 1)))
>
> /*
> * Some helpers for converting nanosecond timing to
> jiffy resolution
> @@ -75,110 +76,18 @@
> #define NS_TO_JIFFIES(TIME) ((TIME) / (1000000000 /
> HZ))
> #define JIFFIES_TO_NS(TIME) ((TIME) * (1000000000 /
> HZ))
>
> -/*
> - * These are the 'tuning knobs' of the scheduler:
> - *
> - * Minimum timeslice is 10 msecs, default timeslice
> is 100 msecs,
> - * maximum timeslice is 200 msecs. Timeslices get
> refilled after
> - * they expire.
> - */
> -#define MIN_TIMESLICE ( 10 * HZ / 1000)
> -#define MAX_TIMESLICE (200 * HZ / 1000)
> -#define ON_RUNQUEUE_WEIGHT 30
> -#define CHILD_PENALTY 95
> -#define PARENT_PENALTY 100
> -#define EXIT_WEIGHT 3
> -#define PRIO_BONUS_RATIO 25
> -#define MAX_BONUS (MAX_USER_PRIO *
> PRIO_BONUS_RATIO / 100)
> -#define INTERACTIVE_DELTA 2
> -#define MAX_SLEEP_AVG (AVG_TIMESLICE * MAX_BONUS)
> -#define STARVATION_LIMIT (MAX_SLEEP_AVG)
> -#define NS_MAX_SLEEP_AVG
> (JIFFIES_TO_NS(MAX_SLEEP_AVG))
> -#define CREDIT_LIMIT 100
> -
> -/*
> - * If a task is 'interactive' then we reinsert it
> in the active
> - * array after it has expired its current
> timeslice. (it will not
> - * continue to run immediately, it will still
> roundrobin with
> - * other interactive tasks.)
> - *
> - * This part scales the interactivity limit
> depending on niceness.
> - *
> - * We scale it linearly, offset by the
> INTERACTIVE_DELTA delta.
> - * Here are a few examples of different nice
> levels:
> - *
> - * TASK_INTERACTIVE(-20): [1,1,1,1,1,1,1,1,1,0,0]
> - * TASK_INTERACTIVE(-10): [1,1,1,1,1,1,1,0,0,0,0]
> - * TASK_INTERACTIVE( 0): [1,1,1,1,0,0,0,0,0,0,0]
> - * TASK_INTERACTIVE( 10): [1,1,0,0,0,0,0,0,0,0,0]
> - * TASK_INTERACTIVE( 19): [0,0,0,0,0,0,0,0,0,0,0]
> - *
> - * (the X axis represents the possible -5 ... 0 ...
> +5 dynamic
> - * priority range a task can explore, a value of
> '1' means the
> - * task is rated interactive.)
> - *
> - * Ie. nice +19 tasks can never get 'interactive'
> enough to be
> - * reinserted into the active array. And only
> heavily CPU-hog nice -20
> - * tasks will be expired. Default nice 0 tasks are
> somewhere between,
> - * it takes some effort for them to get
> interactive, but it's not
> - * too hard.
> - */
> -
> -#define CURRENT_BONUS(p) \
> - (NS_TO_JIFFIES((p)->sleep_avg) * MAX_BONUS / \
> - MAX_SLEEP_AVG)
> -
> -#ifdef CONFIG_SMP
> -#define TIMESLICE_GRANULARITY(p) (MIN_TIMESLICE * \
> - (1 << (((MAX_BONUS - CURRENT_BONUS(p)) ? : 1) -
> 1)) * \
> - num_online_cpus())
> -#else
> -#define TIMESLICE_GRANULARITY(p) (MIN_TIMESLICE * \
> - (1 << (((MAX_BONUS - CURRENT_BONUS(p)) ? : 1) -
> 1)))
> -#endif
> -
> -#define SCALE(v1,v1_max,v2_max) \
> - (v1) * (v2_max) / (v1_max)
>
=== message truncated ===





__________________________________
Do you Yahoo!?
Friends. Fun. Try the all-new Yahoo! Messenger.
http://messenger.yahoo.com/

2004-06-07 21:13:00

by Con Kolivas

[permalink] [raw]
Subject: Re: [PATCH] Staircase Scheduler v6.3 for 2.6.7-rc2

On Tue, 8 Jun 2004 05:57, Phy Prabab wrote:
> I have had a chance to test this patch.

Thanks

> I have a make
> system that has presented 2.6 kernels with problems,
> so I am using this as the test. Observations show
> that turning off interactive is much more
> deterministic:
>
> 2.6.7-rc2-bk8-s63:
> echo 0 > /proc/sys/kernel/interactive
>
> A: 35.57user 38.18system 1:20.28elapsed 91%CPU
> B: 35.54user 38.40system 1:19.48elapsed 93%CPU
> C: 35.48user 38.28system 1:20.94elapsed 91%CPU
>
> 2.6.7-rc2-bk8-s63:
> A: 35.32user 38.51system 1:26.47elapsed 85%CPU
> B: 35.43user 38.35system 1:20.79elapsed 91%CPU
> C: 35.61user 38.23system 1:25.00elapsed 86%CPU
>
> However, something is still slower than the 2.4.x
> kernels:
>
> 2.4.23:
> A: 28.32user 29.51system 1:01.17elapsed 93%CPU
> B: 28.54user 29.40system 1:01.48elapsed 92%CPU
> B: 28.23user 28.80system 1:00.21elapsed 94%CPU
>
> Nice work, as I can now turn off some functionality
> within the kernel that is causing me some slow down.

Glad to see it does what you require. Turning off "interactive" should still
provide moderately good interactive performance at low to moderate loads, but
is much stricter about cpu usage distribution as you can see.

Cheers,
Con

2004-06-07 21:42:53

by Phy Prabab

[permalink] [raw]
Subject: Re: [PATCH] Staircase Scheduler v6.3 for 2.6.7-rc2

OOOPPPSSSS....

I need to make a correction on my previous data. I
had inadvertantly turned off interactivity and also
increased the compute time to 100. I confirmed that
just setting interactivity off, does not solve my
problem:

2.6.7-rc3-s63 (0 @ /proc/sys/kernel/interactive):
A: 37.30user 40.56system 1:42.01elapsed 76%CPU
B: 37.29user 40.35system 1:23.87elapsed 92%CPU
C: 37.30user 40.56system 1:36.01elapsed 81%CPU

2.6.7-rc3-s63 (0 @ /proc/sys/kernel/interactive & 1
/proc/sys/kernel/compute):
A: 37.28user 40.36system 1:25.60elapsed 90%CPU
B: 37.22user 40.35system 1:22.17elapsed 94%CPU
C: 37.27user 40.35system 1:24.71elapsed 91%CPU

The question here, noticing that user and kernel time
are the same, where is the dead time coming from and
why is it sooooo much more deterministic with compute
time at 100 vs 10? Maybe I am misinterpreting the
data, but this suggests to me that something is going
awry (ping-pong, no settle, ???) within the kernl?


Also please note the degredation between
2.6.7-rc2-bk8-s63:

A: 35.57user 38.18system 1:20.28elapsed 91%CPU
B: 35.54user 38.40system 1:19.48elapsed 93%CPU
C: 35.48user 38.28system 1:20.94elapsed 91%CPU

Interesting how much more time is spent in both user
and kernel space between the two kernels. Also note
that 2.4.x exhibits even greater delta:

A: 28.32user 29.51system 1:01.17elapsed 93%CPU
B: 28.54user 29.40system 1:01.48elapsed 92%CPU
B: 28.23user 28.80system 1:00.21elapsed 94%CPU

Could anyone suggest a way to understand why the
difference between the 2.6 kernels and the 2.4
kernels?

Thank you for your time.
Phy





--- Con Kolivas <[email protected]> wrote:
> On Tue, 8 Jun 2004 05:57, Phy Prabab wrote:
> > I have had a chance to test this patch.
>
> Thanks
>
> > I have a make
> > system that has presented 2.6 kernels with
> problems,
> > so I am using this as the test. Observations
> show
> > that turning off interactive is much more
> > deterministic:
> >
> > 2.6.7-rc2-bk8-s63:
> > echo 0 > /proc/sys/kernel/interactive
> >
> > A: 35.57user 38.18system 1:20.28elapsed 91%CPU
> > B: 35.54user 38.40system 1:19.48elapsed 93%CPU
> > C: 35.48user 38.28system 1:20.94elapsed 91%CPU
> >
> > 2.6.7-rc2-bk8-s63:
> > A: 35.32user 38.51system 1:26.47elapsed 85%CPU
> > B: 35.43user 38.35system 1:20.79elapsed 91%CPU
> > C: 35.61user 38.23system 1:25.00elapsed 86%CPU
> >
> > However, something is still slower than the 2.4.x
> > kernels:
> >
> > 2.4.23:
> > A: 28.32user 29.51system 1:01.17elapsed 93%CPU
> > B: 28.54user 29.40system 1:01.48elapsed 92%CPU
> > B: 28.23user 28.80system 1:00.21elapsed 94%CPU
> >
> > Nice work, as I can now turn off some
> functionality
> > within the kernel that is causing me some slow
> down.
>
> Glad to see it does what you require. Turning off
> "interactive" should still
> provide moderately good interactive performance at
> low to moderate loads, but
> is much stricter about cpu usage distribution as you
> can see.
>
> Cheers,
> Con





__________________________________
Do you Yahoo!?
Friends. Fun. Try the all-new Yahoo! Messenger.
http://messenger.yahoo.com/

2004-06-07 23:58:27

by Con Kolivas

[permalink] [raw]
Subject: Re: [PATCH] Staircase Scheduler v6.3 for 2.6.7-rc2

Quoting Phy Prabab <[email protected]>:

> OOOPPPSSSS....
>
> I need to make a correction on my previous data. I
> had inadvertantly turned off interactivity and also
> increased the compute time to 100. I confirmed that
> just setting interactivity off, does not solve my
> problem:
>
> 2.6.7-rc3-s63 (0 @ /proc/sys/kernel/interactive):
> A: 37.30user 40.56system 1:42.01elapsed 76%CPU
> B: 37.29user 40.35system 1:23.87elapsed 92%CPU
> C: 37.30user 40.56system 1:36.01elapsed 81%CPU
>
> 2.6.7-rc3-s63 (0 @ /proc/sys/kernel/interactive & 1
> /proc/sys/kernel/compute):
> A: 37.28user 40.36system 1:25.60elapsed 90%CPU
> B: 37.22user 40.35system 1:22.17elapsed 94%CPU
> C: 37.27user 40.35system 1:24.71elapsed 91%CPU
>
> The question here, noticing that user and kernel time
> are the same, where is the dead time coming from and
> why is it sooooo much more deterministic with compute
> time at 100 vs 10? Maybe I am misinterpreting the
> data, but this suggests to me that something is going
> awry (ping-pong, no settle, ???) within the kernl?
>
>
> Also please note the degredation between
> 2.6.7-rc2-bk8-s63:
>
> A: 35.57user 38.18system 1:20.28elapsed 91%CPU
> B: 35.54user 38.40system 1:19.48elapsed 93%CPU
> C: 35.48user 38.28system 1:20.94elapsed 91%CPU
>
> Interesting how much more time is spent in both user
> and kernel space between the two kernels. Also note
> that 2.4.x exhibits even greater delta:
>
> A: 28.32user 29.51system 1:01.17elapsed 93%CPU
> B: 28.54user 29.40system 1:01.48elapsed 92%CPU
> B: 28.23user 28.80system 1:00.21elapsed 94%CPU
>
> Could anyone suggest a way to understand why the
> difference between the 2.6 kernels and the 2.4
> kernels?
>
> Thank you for your time.
> Phy
>
Hi.

How repeatable are the numbers normally? Some idea of what it is you're
benchmarking may also help in understanding the problem; locking may be an
issue with what you're benchmarking and out-of-order scheduling is not as
forgiving of poor locking. Extending the RR_INTERVAL and turning off
interactivity makes it more in-order and more forgiving of poor locking or
yield().

Compute==1 setting inactivates interactivity anyway, but that's not really
relevant to your figures since you had set interactive 0 when you set compute
1.

Con

2004-06-08 00:06:53

by Phy Prabab

[permalink] [raw]
Subject: Re: [PATCH] Staircase Scheduler v6.3 for 2.6.7-rc2


Just to clarify, setting compute 1 implys interactive
0?

These numbers are very reproducable nad have done them
(in a continuous loop) for two hours.

The test is a make of headers for a propritary exec.
Making headers is rather simple is all it does it link
a bunch of h files (traversing dirs) and some
dependance generation (3 files, yacc and lex). I have
moved the source code base to local disk to dicount
nfs issues (though the difference is neglibible and
nfs performance on 2.6 is generally faster than 2.4).

I have tried to get a good test case that can be
submitted. Still trying.

Any suggestions to try to diagnose this?

Thanks!
Phy
> >
> Hi.
>
> How repeatable are the numbers normally? Some idea
> of what it is you're
> benchmarking may also help in understanding the
> problem; locking may be an
> issue with what you're benchmarking and out-of-order
> scheduling is not as
> forgiving of poor locking. Extending the RR_INTERVAL
> and turning off
> interactivity makes it more in-order and more
> forgiving of poor locking or
> yield().
>
> Compute==1 setting inactivates interactivity anyway,
> but that's not really
> relevant to your figures since you had set
> interactive 0 when you set compute
> 1.
>
> Con
>





__________________________________
Do you Yahoo!?
Friends. Fun. Try the all-new Yahoo! Messenger.
http://messenger.yahoo.com/

2004-06-08 00:32:19

by Con Kolivas

[permalink] [raw]
Subject: Re: [PATCH] Staircase Scheduler v6.3 for 2.6.7-rc2

Quoting Phy Prabab <[email protected]>:

>
> Just to clarify, setting compute 1 implys interactive
> 0?

Yes.

>
> These numbers are very reproducable nad have done them
> (in a continuous loop) for two hours.

Ok.

>
> The test is a make of headers for a propritary exec.
> Making headers is rather simple is all it does it link
> a bunch of h files (traversing dirs) and some
> dependance generation (3 files, yacc and lex). I have
> moved the source code base to local disk to dicount
> nfs issues (though the difference is neglibible and
> nfs performance on 2.6 is generally faster than 2.4).

Out of curiosity, what happens from nfs? i/o effects can be significant. I
assume you've excluded memory effects?

> I have tried to get a good test case that can be
> submitted. Still trying.

Linking kernel headers with full debugging?

> Any suggestions to try to diagnose this?

Profiling.

> Thanks!
> Phy

Cheers,
Con

2004-06-08 02:51:02

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] Staircase Scheduler v6.3 for 2.6.7-rc2

Phy Prabab <[email protected]> wrote:
>
> Also please note the degredation between
> 2.6.7-rc2-bk8-s63:
>
> A: 35.57user 38.18system 1:20.28elapsed 91%CPU
> B: 35.54user 38.40system 1:19.48elapsed 93%CPU
> C: 35.48user 38.28system 1:20.94elapsed 91%CPU
>
> Interesting how much more time is spent in both user
> and kernel space between the two kernels. Also note
> that 2.4.x exhibits even greater delta:
>
> A: 28.32user 29.51system 1:01.17elapsed 93%CPU
> B: 28.54user 29.40system 1:01.48elapsed 92%CPU
> B: 28.23user 28.80system 1:00.21elapsed 94%CPU
>
> Could anyone suggest a way to understand why the
> difference between the 2.6 kernels and the 2.4
> kernels?

This is very very bad.

It's a uniprocessor machine, yes?

Could you describe the workload a bit more? Is it something which others
can get their hands on?

It spends a lot of time in the kernel for a build system. I wonder why.

At a guess I'd say either a) you're hitting some path in the kernel which
is going for a giant and bogus romp through memory, trashing CPU caches or
b) your workload really dislikes run-child-first-after-fork or c) the page
allocator is serving up pages which your access pattern dislikes or d)
something else.

It's certainly interesting.

2004-06-08 03:05:40

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH] Staircase Scheduler v6.3 for 2.6.7-rc2

Andrew Morton wrote:

> At a guess I'd say either a) you're hitting some path in the kernel which
> is going for a giant and bogus romp through memory, trashing CPU caches or
> b) your workload really dislikes run-child-first-after-fork or c) the page
> allocator is serving up pages which your access pattern dislikes or d)
> something else.
>

e) it's the staircase scheduler patch?

2004-06-08 03:31:17

by Con Kolivas

[permalink] [raw]
Subject: Re: [PATCH] Staircase Scheduler v6.3 for 2.6.7-rc2

Quoting Nick Piggin <[email protected]>:

> Andrew Morton wrote:
>
> > At a guess I'd say either a) you're hitting some path in the kernel which
> > is going for a giant and bogus romp through memory, trashing CPU caches or
> > b) your workload really dislikes run-child-first-after-fork or c) the page
> > allocator is serving up pages which your access pattern dislikes or d)
> > something else.
> >
>
> e) it's the staircase scheduler patch?
>

Phy Prabab wrote:
>Could anyone suggest a way to understand why the
>difference between the 2.6 kernels and the 2.4
>kernels

I guess Phy better tell us if it's unique to staircase; and if so there's
clearly a bug unique to it.

Con

2004-06-08 05:26:00

by Con Kolivas

[permalink] [raw]
Subject: Re: [PATCH] Staircase Scheduler v6.3 for 2.6.7-rc2

Quoting Andrew Morton <[email protected]>:

> Phy Prabab <[email protected]> wrote:
> >
> > Also please note the degredation between
> > 2.6.7-rc2-bk8-s63:
> >
> > A: 35.57user 38.18system 1:20.28elapsed 91%CPU
> > B: 35.54user 38.40system 1:19.48elapsed 93%CPU
> > C: 35.48user 38.28system 1:20.94elapsed 91%CPU
> >
> > Interesting how much more time is spent in both user
> > and kernel space between the two kernels. Also note
> > that 2.4.x exhibits even greater delta:
> >
> > A: 28.32user 29.51system 1:01.17elapsed 93%CPU
> > B: 28.54user 29.40system 1:01.48elapsed 92%CPU
> > B: 28.23user 28.80system 1:00.21elapsed 94%CPU
> >
> > Could anyone suggest a way to understand why the
> > difference between the 2.6 kernels and the 2.4
> > kernels?
>
> This is very very bad.
>
> It's a uniprocessor machine, yes?
>
> Could you describe the workload a bit more? Is it something which others
> can get their hands on?
>
> It spends a lot of time in the kernel for a build system. I wonder why.
>
> At a guess I'd say either a) you're hitting some path in the kernel which
> is going for a giant and bogus romp through memory, trashing CPU caches or
> b) your workload really dislikes run-child-first-after-fork or c) the page
> allocator is serving up pages which your access pattern dislikes or d)
> something else.
>
> It's certainly interesting.

Ordinary kernbench on 8x OSDL hardware fails to show anything like this:
Summary:
time in seconds for optimal run:
2.6.7-rc2: 129.948
2.6.7-rc2-s6.3: 129.528
2.6.7-rc2-s6.3compute: 127.63

Detailed:
kernel: patch-2.6.7-rc2
plmid: 2987
Host: stp8-002
Average Optimal -j 32 Load Run:
Elapsed Time 129.948
User Time 873.302
System Time 86.318
Percent CPU 738
Context Switches 39853.4
Sleeps 29045.8
Average Maximal -j Load Run:
Elapsed Time 128.362
User Time 866.274
System Time 84.352
Percent CPU 740
Context Switches 25742.6
Sleeps 20615.8
Average Half Load -j 4 Run:
Elapsed Time 231.04
User Time 833.69
System Time 72.94
Percent CPU 392
Context Switches 12672.2
Sleeps 22267.2


kernel: staircase6.2
plmid: 3002
Host: stp8-002
Average Optimal -j 32 Load Run:
Elapsed Time 129.528
User Time 873.866
System Time 85.372
Percent CPU 740.2
Context Switches 38286.6
Sleeps 28331
Average Maximal -j Load Run:
Elapsed Time 128.074
User Time 862.728
System Time 85.842
Percent CPU 740.2
Context Switches 20894
Sleeps 22189.2
Average Half Load -j 4 Run:
Elapsed Time 230.942
User Time 833.03
System Time 71.126
Percent CPU 391
Context Switches 3467
Sleeps 22389.2


kernel: crunchah6.3
plmid: 3008
Host: stp8-002
Average Optimal -j 32 Load Run:
Elapsed Time 127.63
User Time 861.668
System Time 83.078
Percent CPU 739.6
Context Switches 19818.4
Sleeps 28442.6
Average Maximal -j Load Run:
Elapsed Time 127.354
User Time 856.632
System Time 82.952
Percent CPU 737
Context Switches 12389.2
Sleeps 21788.6
Average Half Load -j 4 Run:
Elapsed Time 230.8
User Time 833.378
System Time 71.304
Percent CPU 391.6
Context Switches 3738.4
Sleeps 22288.6


Con

2004-06-08 05:42:05

by Phy Prabab

[permalink] [raw]
Subject: Re: [PATCH] Staircase Scheduler v6.3 for 2.6.7-rc2

The test case is a build system that links headers (ln
-s) and runs bison and flex on a couple of files.
strace shows no difference between running on 2.4.23
compared to running on 2.6.7-rc2bk8s63.

Unfortuneately, I can not send this out, but I am
trying to get a tet case that will demostrate this.

In the mean time, what can I do to try understand this
slowdown?

Thank you for your time.
Phy

--- Andrew Morton <[email protected]> wrote:
> Phy Prabab <[email protected]> wrote:
> >
> > Also please note the degredation between
> > 2.6.7-rc2-bk8-s63:
> >
> > A: 35.57user 38.18system 1:20.28elapsed 91%CPU
> > B: 35.54user 38.40system 1:19.48elapsed 93%CPU
> > C: 35.48user 38.28system 1:20.94elapsed 91%CPU
> >
> > Interesting how much more time is spent in both
> user
> > and kernel space between the two kernels. Also
> note
> > that 2.4.x exhibits even greater delta:
> >
> > A: 28.32user 29.51system 1:01.17elapsed 93%CPU
> > B: 28.54user 29.40system 1:01.48elapsed 92%CPU
> > B: 28.23user 28.80system 1:00.21elapsed 94%CPU
> >
> > Could anyone suggest a way to understand why the
> > difference between the 2.6 kernels and the 2.4
> > kernels?
>
> This is very very bad.
>
> It's a uniprocessor machine, yes?
>
> Could you describe the workload a bit more? Is it
> something which others
> can get their hands on?
>
> It spends a lot of time in the kernel for a build
> system. I wonder why.
>
> At a guess I'd say either a) you're hitting some
> path in the kernel which
> is going for a giant and bogus romp through memory,
> trashing CPU caches or
> b) your workload really dislikes
> run-child-first-after-fork or c) the page
> allocator is serving up pages which your access
> pattern dislikes or d)
> something else.
>
> It's certainly interesting.




__________________________________
Do you Yahoo!?
Friends. Fun. Try the all-new Yahoo! Messenger.
http://messenger.yahoo.com/

2004-06-08 05:48:52

by Zwane Mwaikambo

[permalink] [raw]
Subject: Re: [PATCH] Staircase Scheduler v6.3 for 2.6.7-rc2

yOn Mon, 7 Jun 2004, Phy Prabab wrote:

> The test case is a build system that links headers (ln
> -s) and runs bison and flex on a couple of files.
> strace shows no difference between running on 2.4.23
> compared to running on 2.6.7-rc2bk8s63.
>
> Unfortuneately, I can not send this out, but I am
> trying to get a tet case that will demostrate this.
>
> In the mean time, what can I do to try understand this
> slowdown?

Whether the regression is also present without the staircase scheduler.

2004-06-08 05:51:49

by Phy Prabab

[permalink] [raw]
Subject: Re: [PATCH] Staircase Scheduler v6.3 for 2.6.7-rc2

the results I published also included 2.4.23 runs
which do not have the stair step scheduler.

Also, I have run this on 2.6-1->2.6.7-rc<x>, all the
2.6.x series are slower than the 2.4 series.

Thank you for your time.
Phy

--- Nick Piggin <[email protected]> wrote:
> Andrew Morton wrote:
>
> > At a guess I'd say either a) you're hitting some
> path in the kernel which
> > is going for a giant and bogus romp through
> memory, trashing CPU caches or
> > b) your workload really dislikes
> run-child-first-after-fork or c) the page
> > allocator is serving up pages which your access
> pattern dislikes or d)
> > something else.
> >
>
> e) it's the staircase scheduler patch?




__________________________________
Do you Yahoo!?
Friends. Fun. Try the all-new Yahoo! Messenger.
http://messenger.yahoo.com/

2004-06-29 11:10:37

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH] Staircase Scheduler v6.3 for 2.6.7-rc2

Hi!

> This is an update of the scheduler policy mechanism rewrite using the
> infrastructure of the current O(1) scheduler. Slight changes from the
> original design require a detailed description. The change to the original
> design has enabled all known corner cases to be abolished and cpu
> distribution to be much better maintained. It has proven to be stable in my
> testing and is ready for more widespread public testing now.
>
>
> Aims:
> - Interactive by design rather than have interactivity bolted on.
> - Good scalability.
> - Simple predictable design.
> - Maintain appropriate cpu distribution and fairness.
> - Low scheduling latency for normal policy tasks.
> - Low overhead.
> - Making renicing processes actually matter for CPU distribution (nice 0 gets
> 20 times what nice +20 gets)
> - Resistant to priority inversion

How do you solve priority inversion?

Can you do "true idle threads" now? (I.e. nice +infinity?)
Pavel
--
People were complaining that M$ turns users into beta-testers...
...jr ghea gurz vagb qrirybcref, naq gurl frrz gb yvxr vg gung jnl!

2004-06-29 11:15:35

by Con Kolivas

[permalink] [raw]
Subject: Re: [PATCH] Staircase Scheduler v6.3 for 2.6.7-rc2

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Pavel Machek wrote:
| Hi!
|
|
|>This is an update of the scheduler policy mechanism rewrite using the
|>infrastructure of the current O(1) scheduler. Slight changes from the
|>original design require a detailed description. The change to the
original
|>design has enabled all known corner cases to be abolished and cpu
|>distribution to be much better maintained. It has proven to be stable
in my
|>testing and is ready for more widespread public testing now.
|>
|>
|>Aims:
|> - Interactive by design rather than have interactivity bolted on.
|> - Good scalability.
|> - Simple predictable design.
|> - Maintain appropriate cpu distribution and fairness.
|> - Low scheduling latency for normal policy tasks.
|> - Low overhead.
|> - Making renicing processes actually matter for CPU distribution
(nice 0 gets
|>20 times what nice +20 gets)
|> - Resistant to priority inversion
|
|
| How do you solve priority inversion?

I don't solve it. It is relatively resistant to priority inversion by
tasks traversing all the priorities lower than itself rather than
sitting at one priority. It does this more strictly when interactive is
disabled which I recommend for server and multiuser setups.

| Can you do "true idle threads" now? (I.e. nice +infinity?)

Not safely, no. I have a batch(idle) implementation that is relatively
safe but can be abused (it's in my -ck patchset only). At some stage if
the codebase settles down enough I'll try and introduce semaphore
tracking for batch processes that will elevate them to normal scheduling
~ to prevent DoSing.

Cheers,
Con
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFA4U8oZUg7+tp6mRURAircAKCSvuZ4D6rM7AsfAbHKhQoCw1C7NACaA44C
UyUJAByRIE894CchzlN2OiU=
=9xVI
-----END PGP SIGNATURE-----

2004-06-29 12:05:45

by Felipe Alfaro Solana

[permalink] [raw]
Subject: Re: [PATCH] Staircase Scheduler v6.3 for 2.6.7-rc2

On Tue, 2004-06-29 at 13:10 +0200, Pavel Machek wrote:
> Hi!
>
> > This is an update of the scheduler policy mechanism rewrite using the
> > infrastructure of the current O(1) scheduler. Slight changes from the
> > original design require a detailed description. The change to the original
> > design has enabled all known corner cases to be abolished and cpu
> > distribution to be much better maintained. It has proven to be stable in my
> > testing and is ready for more widespread public testing now.
> >
> >
> > Aims:
> > - Interactive by design rather than have interactivity bolted on.
> > - Good scalability.
> > - Simple predictable design.
> > - Maintain appropriate cpu distribution and fairness.
> > - Low scheduling latency for normal policy tasks.
> > - Low overhead.
> > - Making renicing processes actually matter for CPU distribution (nice 0 gets
> > 20 times what nice +20 gets)
> > - Resistant to priority inversion
>
> How do you solve priority inversion?
>
> Can you do "true idle threads" now? (I.e. nice +infinity?)

I think idle threads can be achieved with the batch scheduler policy
present in -ck patches. It works nicely, and I use it a lot.

PS: At least, if you mean with "idle threads" those that do only run
where there are idle cycles of CPU (i.e. all processes are sleeping).