2001-11-08 13:32:45

by Ingo Molnar

[permalink] [raw]
Subject: [patch] scheduler cache affinity improvement for 2.4 kernels


i've attached a patch that fixes a long-time performance problem in the
Linux scheduler.

it's a fix for a UP and SMP scheduler problem Alan described to me
recently, the 'CPU intensive process scheduling' problem. The essence of
the problem: if there are multiple, CPU-intensive processes running,
intermixed with other scheduling activities such as interactive work or
network-intensive applications, then the Linux scheduler does a poor job
of affinizing processes to processor caches. Such scheduler workload is
common for a large percentage of important application workloads: database
server workloads, webserver workloads and math-intensive clustered jobs,
and other applications.

If there are CPU-intensive processes A B and C, and a scheduling-intensive
X task, then in the stock 2.4 kernels we end up scheduling in the
following way:

A X A X A ... [timer tick]
B X B X B ... [timer tick]
C X C X C ... [timer tick]

ie. we switch between CPU-intensive (and possibly cache-intensive)
processes every timer tick. The timer tick can be 10 msec or shorter,
depending on the HZ value.

the intended length of the timeslice of such processes is supposed to be
dependent on their priority - for typical CPU-intensive processes it's 100
msecs. But in the above case, the effective timeslice of the
CPU/cache-intensive process is 10 msec or lower, causing potential cache
trashing if the working set of A, B and C are larger than the cache size
of the CPU but the invidivual process' workload fits into cache.
Repopulating a large processor cache can take many milliseconds (on a 2MB
on-die cache Xeon CPU it takes more than 10 msecs to repopulate a typical
cache), so the effect can be significant.

The correct behavior would be:

A X A X A ... [10 timer ticks]
B X B X B ... [10 timer ticks]
C X C X C ... [10 timer ticks]

this is in fact what happens if the scheduling acitivity of process 'X'
does not happen.

solution: i've introduced a new current->timer_ticks field (which is not
in the scheduler 'hot cacheline', nor does it cause any scheduling
overhead), which counts the number of timer ticks registered by any
particular process. If the number of timer ticks reaches the number of
available timeslices then the timer interrupt marks the process for
reschedule, clears ->counter and ->timer_ticks. These 'timer ticks' have
to be correctly administered across fork() and exit(), and some places
that touch ->counter need to deal with timer_ticks too, but otherwise the
patch has low impact.

scheduling semantics impact: this causes CPU hogs to be more affine to the
CPU they were running on, and will 'batch' them more agressively - without
giving them more CPU time than under the stock scheduler. The change does
not impact interactive tasks since they grow their ->counter above that of
CPU hogs anyway. It might cause less 'interactivity' in CPU hogs - but
this is the intended effect.

performance impact: this field is never used in the scheduler hotpath.
It's only used by the low frequency timer interrupt, and by the
fork()/exit() path, which can take an extra variable without any visible
impact. Also some fringe cases that touch ->counter needed updating too:
the OOM code and RR RT tasks.

performance results: The cases i've tested appear to work just fine, and
the change has the cache-affinity effect we are looking for. I've measured
'make -j bzImage' execution times on an 8-way, 700 MHz, 2MB cache Xeon
box. (certainly not a box whose caches are easy to trash.) Here are 6
successive make -j execution times with and without the patch applied. (To
avoid pagecache layout and other effects, the box is running a modified
but functionally equivalent version of the patch which allows runtime
switching between the old and new scheduler behavior.)

stock scheduler:

real 1m1.817s
real 1m1.871s
real 1m1.993s
real 1m2.015s
real 1m2.049s
real 1m2.077s

with the patch applied:

real 1m0.177s
real 1m0.313s
real 1m0.331s
real 1m0.349s
real 1m0.462s
real 1m0.792s

ie. stock scheduler is doing it in 62.0 seconds, new scheduler is doing it
in 60.3 seconds, a ~3% improvement - not bad, considering that compilation
is exeucting 99% in user-space, and that there was no 'interactive'
activity during the compilation job.

- to further measure the effects of the patch i've changed HZ to 1024 on a
single-CPU, 700 MHz, 2MB cache Xeon box, which improved 'make -j' kernel
compilation times by 4%.

- Compiling just drivers/block/floppy.c (which is a cache-intensive
operation) in parallel, with a constant single-process Apache network load
in the background shows a 7% improvement.

This shows the results we expected: with smaller timeslices, the effect of
cache trashing shows up more visibly.

(NOTE: i used 'make -j' only to create a well-known workload that has a
high cache footprint. It's not to suggest that 'make -j' makes much sense
on a single-CPU box.)

(it would be nice if those people who suspect scalability problems in
their workloads, could further test/verify the effects this patch.)

the patch is against 2.4.15-pre1 and boots/works just fine on both UP and
SMP systems.

please apply,

Ingo


Attachments:
sched-process-affinity-2.4.15-A3 (2.98 kB)

2001-11-08 15:22:00

by M. Edward Borasky

[permalink] [raw]
Subject: RE: [patch] scheduler cache affinity improvement for 2.4 kernels

I can think of some circumstances where one would want the *opposite* of
this patch. Consider a "time-sharing" system running both CPU-intensive
"batch" tasks and "interactive" tasks. There is going to be a tradeoff
between efficiency / throughput of the batch tasks and the response times of
interactive ones. From the point of view of the owner of an interactive
task, the CPU-intensive task should be penalized -- forced to load its cache
over and over and over -- rather than being favored by the OS because it's
more "efficient" to reuse the cache. The owner of the CPU-intensive batch
task, on the other hand, will like this patch -- *he* gets his job done
sooner.

The "traditional" way to deal with this issue is to adjust the time slice.
Larger time slices favor the batch jobs and make the system more efficient.
Smaller time slices make the system less efficient and increase context
switching and its associated overhead, but improve interactive response
time. Typically one starts with the default time slice and makes it smaller
gradually until an acceptable interactive response time is obtained.
--
"Suppose that tonight, while you sleep, a miracle happens - you wake up
tomorrow with what you have longed for. How will you discover that a miracle
happened? How will your loved ones? What will be different? What will you
notice? What do you need to explode into tomorrow with grace, power, love,
passion and confidence?" -- Michael Hall, PhD

M. Edward (Ed) Borasky
Relax! Run Your Own Brain with Neuro-Semantics!
http://www.borasky-research.net/Flyer.htm
mailto:[email protected]
http://groups.yahoo.com/group/pdx-neuro-semantics


> -----Original Message-----
> From: [email protected]
> [mailto:[email protected]]On Behalf Of Ingo Molnar
> Sent: Thursday, November 08, 2001 6:30 AM
> To: Linus Torvalds
> Cc: [email protected]; Alan Cox
> Subject: [patch] scheduler cache affinity improvement for 2.4 kernels
>
>
>
> i've attached a patch that fixes a long-time performance problem in the
> Linux scheduler.
>
> it's a fix for a UP and SMP scheduler problem Alan described to me
> recently, the 'CPU intensive process scheduling' problem. The essence of
> the problem: if there are multiple, CPU-intensive processes running,
> intermixed with other scheduling activities such as interactive work or
> network-intensive applications, then the Linux scheduler does a poor job
> of affinizing processes to processor caches. Such scheduler workload is
> common for a large percentage of important application workloads: database
> server workloads, webserver workloads and math-intensive clustered jobs,
> and other applications.
>
> If there are CPU-intensive processes A B and C, and a scheduling-intensive
> X task, then in the stock 2.4 kernels we end up scheduling in the
> following way:
>
> A X A X A ... [timer tick]
> B X B X B ... [timer tick]
> C X C X C ... [timer tick]
>
> ie. we switch between CPU-intensive (and possibly cache-intensive)
> processes every timer tick. The timer tick can be 10 msec or shorter,
> depending on the HZ value.
>
> the intended length of the timeslice of such processes is supposed to be
> dependent on their priority - for typical CPU-intensive processes it's 100
> msecs. But in the above case, the effective timeslice of the
> CPU/cache-intensive process is 10 msec or lower, causing potential cache
> trashing if the working set of A, B and C are larger than the cache size
> of the CPU but the invidivual process' workload fits into cache.
> Repopulating a large processor cache can take many milliseconds (on a 2MB
> on-die cache Xeon CPU it takes more than 10 msecs to repopulate a typical
> cache), so the effect can be significant.
>
> The correct behavior would be:
>
> A X A X A ... [10 timer ticks]
> B X B X B ... [10 timer ticks]
> C X C X C ... [10 timer ticks]
>
> this is in fact what happens if the scheduling acitivity of process 'X'
> does not happen.
>
> solution: i've introduced a new current->timer_ticks field (which is not
> in the scheduler 'hot cacheline', nor does it cause any scheduling
> overhead), which counts the number of timer ticks registered by any
> particular process. If the number of timer ticks reaches the number of
> available timeslices then the timer interrupt marks the process for
> reschedule, clears ->counter and ->timer_ticks. These 'timer ticks' have
> to be correctly administered across fork() and exit(), and some places
> that touch ->counter need to deal with timer_ticks too, but otherwise the
> patch has low impact.
>
> scheduling semantics impact: this causes CPU hogs to be more affine to the
> CPU they were running on, and will 'batch' them more agressively - without
> giving them more CPU time than under the stock scheduler. The change does
> not impact interactive tasks since they grow their ->counter above that of
> CPU hogs anyway. It might cause less 'interactivity' in CPU hogs - but
> this is the intended effect.
>
> performance impact: this field is never used in the scheduler hotpath.
> It's only used by the low frequency timer interrupt, and by the
> fork()/exit() path, which can take an extra variable without any visible
> impact. Also some fringe cases that touch ->counter needed updating too:
> the OOM code and RR RT tasks.
>
> performance results: The cases i've tested appear to work just fine, and
> the change has the cache-affinity effect we are looking for. I've measured
> 'make -j bzImage' execution times on an 8-way, 700 MHz, 2MB cache Xeon
> box. (certainly not a box whose caches are easy to trash.) Here are 6
> successive make -j execution times with and without the patch applied. (To
> avoid pagecache layout and other effects, the box is running a modified
> but functionally equivalent version of the patch which allows runtime
> switching between the old and new scheduler behavior.)
>
> stock scheduler:
>
> real 1m1.817s
> real 1m1.871s
> real 1m1.993s
> real 1m2.015s
> real 1m2.049s
> real 1m2.077s
>
> with the patch applied:
>
> real 1m0.177s
> real 1m0.313s
> real 1m0.331s
> real 1m0.349s
> real 1m0.462s
> real 1m0.792s
>
> ie. stock scheduler is doing it in 62.0 seconds, new scheduler is doing it
> in 60.3 seconds, a ~3% improvement - not bad, considering that compilation
> is exeucting 99% in user-space, and that there was no 'interactive'
> activity during the compilation job.
>
> - to further measure the effects of the patch i've changed HZ to 1024 on a
> single-CPU, 700 MHz, 2MB cache Xeon box, which improved 'make -j' kernel
> compilation times by 4%.
>
> - Compiling just drivers/block/floppy.c (which is a cache-intensive
> operation) in parallel, with a constant single-process Apache network load
> in the background shows a 7% improvement.
>
> This shows the results we expected: with smaller timeslices, the effect of
> cache trashing shows up more visibly.
>
> (NOTE: i used 'make -j' only to create a well-known workload that has a
> high cache footprint. It's not to suggest that 'make -j' makes much sense
> on a single-CPU box.)
>
> (it would be nice if those people who suspect scalability problems in
> their workloads, could further test/verify the effects this patch.)
>
> the patch is against 2.4.15-pre1 and boots/works just fine on both UP and
> SMP systems.
>
> please apply,
>
> Ingo
>

2001-11-08 15:36:30

by Ingo Molnar

[permalink] [raw]
Subject: RE: [patch] scheduler cache affinity improvement for 2.4 kernels


On Thu, 8 Nov 2001, M. Edward Borasky wrote:

> I can think of some circumstances where one would want the *opposite*
> of this patch. Consider a "time-sharing" system running both
> CPU-intensive "batch" tasks and "interactive" tasks. There is going to
> be a tradeoff between efficiency / throughput of the batch tasks and
> the response times of interactive ones. [...]

this mechanizm is already part of the scheduler and is not affected by my
patch. Interactive tasks get their '->counter' value increased gradually
via the recalculate code in the scheduler, which after some time gives
them effective priority above that of CPU-intensive processes.

To see this mechanizm working, just boot into the stock kernel or try a
kernel with the patch applied, start a few CPU-intensive processes, eg.
a couple of subshells doing an infinite loop:

while N=1; do N=1; done &
while N=1; do N=1; done &
while N=1; do N=1; done &
while N=1; do N=1; done &

and see how the interactive shell is still responding instantaneously in
such a mixed workload, despite having the same static priority as the
subshells.

Ingo

2001-11-08 17:08:03

by Davide Libenzi

[permalink] [raw]
Subject: Re: [patch] scheduler cache affinity improvement for 2.4 kernels

On Thu, 8 Nov 2001, Ingo Molnar wrote:

>
> i've attached a patch that fixes a long-time performance problem in the
> Linux scheduler.
>
> it's a fix for a UP and SMP scheduler problem Alan described to me
> recently, the 'CPU intensive process scheduling' problem. The essence of
> the problem: if there are multiple, CPU-intensive processes running,
> intermixed with other scheduling activities such as interactive work or
> network-intensive applications, then the Linux scheduler does a poor job
> of affinizing processes to processor caches. Such scheduler workload is
> common for a large percentage of important application workloads: database
> server workloads, webserver workloads and math-intensive clustered jobs,
> and other applications.
>
> If there are CPU-intensive processes A B and C, and a scheduling-intensive
> X task, then in the stock 2.4 kernels we end up scheduling in the
> following way:
>
> A X A X A ... [timer tick]
> B X B X B ... [timer tick]
> C X C X C ... [timer tick]
>
> ie. we switch between CPU-intensive (and possibly cache-intensive)
> processes every timer tick. The timer tick can be 10 msec or shorter,
> depending on the HZ value.
>
> the intended length of the timeslice of such processes is supposed to be
> dependent on their priority - for typical CPU-intensive processes it's 100
> msecs. But in the above case, the effective timeslice of the
> CPU/cache-intensive process is 10 msec or lower, causing potential cache
> trashing if the working set of A, B and C are larger than the cache size
> of the CPU but the invidivual process' workload fits into cache.
> Repopulating a large processor cache can take many milliseconds (on a 2MB
> on-die cache Xeon CPU it takes more than 10 msecs to repopulate a typical
> cache), so the effect can be significant.
>
> The correct behavior would be:
>
> A X A X A ... [10 timer ticks]
> B X B X B ... [10 timer ticks]
> C X C X C ... [10 timer ticks]
>
> this is in fact what happens if the scheduling acitivity of process 'X'
> does not happen.
>
> solution: i've introduced a new current->timer_ticks field (which is not
> in the scheduler 'hot cacheline', nor does it cause any scheduling
> overhead), which counts the number of timer ticks registered by any
> particular process. If the number of timer ticks reaches the number of
> available timeslices then the timer interrupt marks the process for
> reschedule, clears ->counter and ->timer_ticks. These 'timer ticks' have
> to be correctly administered across fork() and exit(), and some places
> that touch ->counter need to deal with timer_ticks too, but otherwise the
> patch has low impact.
>
> scheduling semantics impact: this causes CPU hogs to be more affine to the
> CPU they were running on, and will 'batch' them more agressively - without
> giving them more CPU time than under the stock scheduler. The change does
> not impact interactive tasks since they grow their ->counter above that of
> CPU hogs anyway. It might cause less 'interactivity' in CPU hogs - but
> this is the intended effect.

Maybe you missed this :

http://www.xmailserver.org/linux-patches/mss.html

where the patch that does this is here :

http://www.xmailserver.org/linux-patches/lnxsched.html#CPUHist




- Davide


2001-11-08 17:29:45

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] scheduler cache affinity improvement for 2.4 kernels


Davide,

On Thu, 8 Nov 2001, Davide Libenzi wrote:

> Maybe you missed this :
>
> http://www.xmailserver.org/linux-patches/mss.html
>
> where the patch that does this is here :
>
> http://www.xmailserver.org/linux-patches/lnxsched.html#CPUHist

i'm not sure what the patch is trying to achieve, but this part of
mcsched-2.4.13-0.4.diff looks incorrect:

+ prev->cpu_jtime += (jiffies - prev->sched_jtime) + jiffies;

(this is "2*jiffies - prev->sched_jtime" which doesnt appear to make much
sense - does it?)

and your patch adds a scheduling advantage to processes with more cache
footprint, which is the completely opposite of what we want.

but in any case, changing the goodness() function was not a goal of my
patch, i change the granularity of how processes lose their 'effective
priority'.

Ingo

2001-11-08 17:56:05

by Davide Libenzi

[permalink] [raw]
Subject: Re: [patch] scheduler cache affinity improvement for 2.4 kernels

On Thu, 8 Nov 2001, Ingo Molnar wrote:

>
> Davide,
>
> On Thu, 8 Nov 2001, Davide Libenzi wrote:
>
> > Maybe you missed this :
> >
> > http://www.xmailserver.org/linux-patches/mss.html
> >
> > where the patch that does this is here :
> >
> > http://www.xmailserver.org/linux-patches/lnxsched.html#CPUHist
>
> i'm not sure what the patch is trying to achieve, but this part of
> mcsched-2.4.13-0.4.diff looks incorrect:
>
> + prev->cpu_jtime += (jiffies - prev->sched_jtime) + jiffies;
>
> (this is "2*jiffies - prev->sched_jtime" which doesnt appear to make much
> sense - does it?)

The optimization is not good ( i left it in that way to make it more clear
what that operation is meant ) but the mean of the code is ok.
It sets the time ( in jiffies ) at which the process won't have any more
scheduling advantage.


> and your patch adds a scheduling advantage to processes with more cache
> footprint, which is the completely opposite of what we want.

It is exactly what we want indeed :
<quote>
it's a fix for a UP and SMP scheduler problem Alan described to me
recently, the 'CPU intensive process scheduling' problem. The essence of
the problem: if there are multiple, CPU-intensive processes running,
intermixed with other scheduling activities such as interactive work or
network-intensive applications, then the Linux scheduler does a poor job
of affinizing processes to processor caches. Such scheduler workload is
common for a large percentage of important application workloads: database
server workloads, webserver workloads and math-intensive clustered jobs,
and other applications.
</quote>

and if you take a look at the LatSched sampling it is achived very well.


> but in any case, changing the goodness() function was not a goal of my
> patch, i change the granularity of how processes lose their 'effective
> priority'.

I'll test the patch asap with the LatSched sampler and i'll let you know.




- Davide



2001-11-08 18:43:17

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] scheduler cache affinity improvement for 2.4 kernels


On Thu, 8 Nov 2001, Davide Libenzi wrote:

> It sets the time ( in jiffies ) at which the process won't have any
> more scheduling advantage.

(sorry, it indeed makes sense, since sched_jtime is on the order of
jiffies.)

> > and your patch adds a scheduling advantage to processes with more cache
> > footprint, which is the completely opposite of what we want.
>
> It is exactly what we want indeed :

if this is what is done by your patch, then we do not want to do this.
My patch does not give an advantage of CPU-intensive processes over that
of eg. 'vi'. Perhaps i'm misreading your patch, it's full of branches that
does not make the meaning very clear, cpu_jtime and sched_jtime are not
explained. Is sched_jtime the timestamp of the last schedule of this
process? And is cpu_jtime the number of jiffies spent on this CPU? Is
cpu_jtime cleared if we switch to another CPU?

Ingo


2001-11-08 19:05:38

by Davide Libenzi

[permalink] [raw]
Subject: Re: [patch] scheduler cache affinity improvement for 2.4 kernels

On Thu, 8 Nov 2001, Ingo Molnar wrote:

>
> On Thu, 8 Nov 2001, Davide Libenzi wrote:
>
> > It sets the time ( in jiffies ) at which the process won't have any
> > more scheduling advantage.
>
> (sorry, it indeed makes sense, since sched_jtime is on the order of
> jiffies.)
>
> > > and your patch adds a scheduling advantage to processes with more cache
> > > footprint, which is the completely opposite of what we want.
> >
> > It is exactly what we want indeed :
>
> if this is what is done by your patch, then we do not want to do this.
> My patch does not give an advantage of CPU-intensive processes over that
> of eg. 'vi'.

If A & B are CPU hog processes and E is the editor ( low level keventd )
you do want to avoid priority inversion between A and B when E kicks in.
Really IO bound tasks accumulates dynamic priority inside the recalc loop
and this is sufficent to win over this kind of "advantage" given to CPU
hog tasks.
My approach make also more "expensive" the preemption goodness to move
tasks between CPUs.
I'll take a closer look at your patch anyway.


> Perhaps i'm misreading your patch, it's full of branches that

"full of braches" == 2 if + 1 conditional-assign

> does not make the meaning very clear, cpu_jtime and sched_jtime are not
> explained. Is sched_jtime the timestamp of the last schedule of this
> process? And is cpu_jtime the number of jiffies spent on this CPU?

sched_jtime = last schedule time in jiffies
cpu_jtime = wall time after which the task will have 0 dynamic priority increase

> Is cpu_jtime cleared if we switch to another CPU?

It's missing of the published patch.
The one that i'm testing has that + a lower dynamic priority increase :

weight += (p->cpu_jtime - jiffies) >> 1;

I'm just testing results about this.




- Davide


2001-11-08 23:46:44

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [patch] scheduler cache affinity improvement for 2.4 kernels

On Thu, Nov 08, 2001 at 03:30:11PM +0100, Ingo Molnar wrote:
> current->counter += p->counter;
> - if (current->counter >= MAX_COUNTER)
> + current->timer_ticks += p->timer_ticks;
> + if (current->counter >= MAX_COUNTER) {
> current->counter = MAX_COUNTER;
> + if (current->timer_ticks >= current->counter) {
> + current->counter = 0;
> + current->timer_ticks = 0;
> + current->need_resched = 1;
> + }
> + }
> p->pid = 0;
> free_task_struct(p);

Before changing this area I'd prefer if you would apply this very
longstanding important bugfix that is floating in my tree for ages (not
sure if it's related to your patch, I don't understand exactly the
semantics of timer_ticks at the moment).

diff -urN parent-timeslice-ref/include/linux/sched.h parent-timeslice/include/linux/sched.h
--- parent-timeslice-ref/include/linux/sched.h Wed Oct 24 13:18:54 2001
+++ parent-timeslice/include/linux/sched.h Wed Oct 24 13:19:00 2001
@@ -317,6 +317,7 @@
#ifdef CONFIG_NUMA_SCHED
int nid;
#endif
+ int get_child_timeslice;
struct task_struct *next_task, *prev_task;
struct mm_struct *active_mm;
struct rw_sem_recursor mm_recursor;
diff -urN parent-timeslice-ref/kernel/exit.c parent-timeslice/kernel/exit.c
--- parent-timeslice-ref/kernel/exit.c Wed Oct 24 08:04:27 2001
+++ parent-timeslice/kernel/exit.c Wed Oct 24 13:19:35 2001
@@ -61,9 +61,11 @@
* timeslices, because any timeslice recovered here
* was given away by the parent in the first place.)
*/
- current->counter += p->counter;
- if (current->counter >= MAX_COUNTER)
- current->counter = MAX_COUNTER;
+ if (p->get_child_timeslice) {
+ current->counter += p->counter;
+ if (current->counter >= MAX_COUNTER)
+ current->counter = MAX_COUNTER;
+ }
p->pid = 0;
free_task_struct(p);
} else {
@@ -164,6 +166,7 @@
p->exit_signal = SIGCHLD;
p->self_exec_id++;
p->p_opptr = child_reaper;
+ p->get_child_timeslice = 0;
if (p->pdeath_signal) send_sig(p->pdeath_signal, p, 0);
}
}
diff -urN parent-timeslice-ref/kernel/fork.c parent-timeslice/kernel/fork.c
--- parent-timeslice-ref/kernel/fork.c Wed Oct 24 13:18:54 2001
+++ parent-timeslice/kernel/fork.c Wed Oct 24 13:19:00 2001
@@ -682,6 +682,9 @@
if (!current->counter)
current->need_resched = 1;

+ /* Tell the parent if it can get back its timeslice when child exits */
+ p->get_child_timeslice = 1;
+
/*
* Ok, add it to the run-queues and make it
* visible to the rest of the system.
diff -urN parent-timeslice-ref/kernel/sched.c parent-timeslice/kernel/sched.c
--- parent-timeslice-ref/kernel/sched.c Wed Oct 24 13:18:54 2001
+++ parent-timeslice/kernel/sched.c Wed Oct 24 13:19:00 2001
@@ -758,6 +758,7 @@
continue;
#endif
p->counter = (p->counter >> 1) + NICE_TO_TICKS(p->nice);
+ p->get_child_timeslice = 0;
}
read_unlock(&tasklist_lock);
spin_lock_irq(&runqueue_lock);


And also the schedule-child-first logic (I use it all the time for ages,
the only bug we managed to trigger was a race in /sbin/init that I fixed
a few months ago and it should be merged in the latest sysvinit
package):

diff -urN parent-timeslice/include/linux/sched.h child-first/include/linux/sched.h
--- parent-timeslice/include/linux/sched.h Thu May 3 18:17:56 2001
+++ child-first/include/linux/sched.h Thu May 3 18:19:44 2001
@@ -301,7 +301,7 @@
* all fields in a single cacheline that are needed for
* the goodness() loop in schedule().
*/
- int counter;
+ volatile int counter;
int nice;
unsigned int policy;
struct mm_struct *mm;
diff -urN parent-timeslice/kernel/fork.c child-first/kernel/fork.c
--- parent-timeslice/kernel/fork.c Thu May 3 18:18:31 2001
+++ child-first/kernel/fork.c Thu May 3 18:20:40 2001
@@ -665,15 +665,18 @@
p->pdeath_signal = 0;

/*
- * "share" dynamic priority between parent and child, thus the
- * total amount of dynamic priorities in the system doesnt change,
- * more scheduling fairness. This is only important in the first
- * timeslice, on the long run the scheduling behaviour is unchanged.
+ * Scheduling the child first is especially useful in avoiding a
+ * lot of copy-on-write faults if the child for a fork() just wants
+ * to do a few simple things and then exec().
*/
- p->counter = (current->counter + 1) >> 1;
- current->counter >>= 1;
- if (!current->counter)
+ {
+ int counter = current->counter;
+ p->counter = (counter + 1) >> 1;
+ current->counter = counter >> 1;
+ p->policy &= ~SCHED_YIELD;
+ current->policy |= SCHED_YIELD;
current->need_resched = 1;
+ }

/* Tell the parent if it can get back its timeslice when child exits */
p->get_child_timeslice = 1;


Last thing, I guess slip should be fixed to use sched_yield instead of
zeroing the timeslice to enforce the reschedule.

Andrea

2001-11-08 23:38:14

by Mike Fedyk

[permalink] [raw]
Subject: Re: [patch] scheduler cache affinity improvement for 2.4 kernels

On Thu, Nov 08, 2001 at 11:13:30AM -0800, Davide Libenzi wrote:
> On Thu, 8 Nov 2001, Ingo Molnar wrote:
>
> >
> > On Thu, 8 Nov 2001, Davide Libenzi wrote:
> >
> > > It sets the time ( in jiffies ) at which the process won't have any
> > > more scheduling advantage.
> >
> > (sorry, it indeed makes sense, since sched_jtime is on the order of
> > jiffies.)
> >
> > > > and your patch adds a scheduling advantage to processes with more cache
> > > > footprint, which is the completely opposite of what we want.
> > >
> > > It is exactly what we want indeed :
> >
> > if this is what is done by your patch, then we do not want to do this.
> > My patch does not give an advantage of CPU-intensive processes over that
> > of eg. 'vi'.
>
> If A & B are CPU hog processes and E is the editor ( low level keventd )
> you do want to avoid priority inversion between A and B when E kicks in.
> Really IO bound tasks accumulates dynamic priority inside the recalc loop
> and this is sufficent to win over this kind of "advantage" given to CPU
> hog tasks.
> My approach make also more "expensive" the preemption goodness to move
> tasks between CPUs.
> I'll take a closer look at your patch anyway.
>

Conceptually, both patches are compatible.

Whether they are technically is for someone else to say...

Ingo's patch in effect lowers the number of jiffies taken per second in the
scheduler (by making each task use several jiffies).

Davide's patch can take the default scheduler (even Ingo's enhanced
scheduler) and make it per processor, with his extra layer of scheduling
between individual processors.

I think that together, they both win. Davide's patch keeps a task from
switching CPUs very often, and Ingo's patch will make each task on each CPU
use the cache to the best extent for that task.

It remains to be proven whether the coarser scheduling approach (Ingo's) will
actually help when looking at cache properties.... When each task takes a
longer time slice, that allows the other tasks to be flushed out of the
caches during that time. When the next task comes back in, it will have to
re-populate the cache again. And the same for the next and etc...

Mike

2001-11-09 00:23:28

by Davide Libenzi

[permalink] [raw]
Subject: Re: [patch] scheduler cache affinity improvement for 2.4 kernels

On Thu, 8 Nov 2001, Ingo Molnar wrote:

[...]

I finally took a look at the patch and i think that it completely change
dynamic priority assignments.
Dynamic priorities will only 1) increase 2) drop to zero
I'm not sure what could be the effect of this on the scheduler behavior.
The patch, as is, has the advantage over CPU history to not add an extra
"if () {}" inside the goodness() function but, without doing something
like :

weight = p->counter;

+if (p->processor != this_cpu)
+ weight -= p->timer_ticks;

you're going to take poor SMP tasks moving decisions, expecially with the
actual :

static inline void add_to_runqueue(struct task_struct * p)
{
list_add(&p->run_list, &runqueue_head);
nr_running++;
}

that inserts tasks at runqueue head ( it'll be the first that will be
picked up ).
With this extra "if () {}" you're going to have the same fast path cost of
the CpuHistory patch.




- Davide


2001-11-09 00:29:58

by Davide Libenzi

[permalink] [raw]
Subject: Re: [patch] scheduler cache affinity improvement for 2.4 kernels

On Thu, 8 Nov 2001, Mike Fedyk wrote:

> Ingo's patch in effect lowers the number of jiffies taken per second in the
> scheduler (by making each task use several jiffies).
>
> Davide's patch can take the default scheduler (even Ingo's enhanced
> scheduler) and make it per processor, with his extra layer of scheduling
> between individual processors.

Don't mix things :)
We're talking only about the CpuHistory token of the scheduler proposed here:

http://www.xmailserver.org/linux-patches/mss.html

This is a bigger ( and not yet complete ) change on the SMP scheduler
behavior, while it keeps the scheduler that runs on each CPU the same.
I'm currently working on different balancing methods to keep the proposed
scheduler fair well balanced without spinning tasks "too much"(tm).




- Davide


2001-11-09 01:08:03

by Mike Fedyk

[permalink] [raw]
Subject: Re: [patch] scheduler cache affinity improvement for 2.4 kernels

[cc trimed]

On Thu, Nov 08, 2001 at 04:37:46PM -0800, Davide Libenzi wrote:
> On Thu, 8 Nov 2001, Mike Fedyk wrote:
>
> > Ingo's patch in effect lowers the number of jiffies taken per second in the
> > scheduler (by making each task use several jiffies).
> >
> > Davide's patch can take the default scheduler (even Ingo's enhanced
> > scheduler) and make it per processor, with his extra layer of scheduling
> > between individual processors.
>
> Don't mix things :)
> We're talking only about the CpuHistory token of the scheduler proposed here:
>
> http://www.xmailserver.org/linux-patches/mss.html
>
> This is a bigger ( and not yet complete ) change on the SMP scheduler
> behavior, while it keeps the scheduler that runs on each CPU the same.
> I'm currently working on different balancing methods to keep the proposed
> scheduler fair well balanced without spinning tasks "too much"(tm).
>
I've given your patch a try, and so far it looks promising.

Running one niced copy of cpuhog on a 2x366 mhz celeron box did pretty well.
Instead of switching several times in one second, it only switched a few
times per minute.

I was also able to merge it with just about everything else I was testing
(ext3, freeswan, elevator updates, -ac) except for the preempt patch. Well, I
was able to manually merge it, but the cpu afinity broke. (it wouldn't use
the second processor for anything except for interrupt processing...)

I haven't tried any of the other scheduler patches though. MQ, looks
interesting... :)

All in all, I think xsched will have much more impact on performance.
Simply because it tackles the problem of CPU affinity...

Even comparing Ingo's patch to your CPU History patch isn't fair, because
they attack different problems. Yours of CPU affinity, Ingo's of time spent
on individual tasks within a single processor.

Mike

2001-11-09 01:21:16

by Davide Libenzi

[permalink] [raw]
Subject: Re: [patch] scheduler cache affinity improvement for 2.4 kernels

On Thu, 8 Nov 2001, Mike Fedyk wrote:

> [cc trimed]
>
> On Thu, Nov 08, 2001 at 04:37:46PM -0800, Davide Libenzi wrote:
> > On Thu, 8 Nov 2001, Mike Fedyk wrote:
> >
> > > Ingo's patch in effect lowers the number of jiffies taken per second in the
> > > scheduler (by making each task use several jiffies).
> > >
> > > Davide's patch can take the default scheduler (even Ingo's enhanced
> > > scheduler) and make it per processor, with his extra layer of scheduling
> > > between individual processors.
> >
> > Don't mix things :)
> > We're talking only about the CpuHistory token of the scheduler proposed here:
> >
> > http://www.xmailserver.org/linux-patches/mss.html
> >
> > This is a bigger ( and not yet complete ) change on the SMP scheduler
> > behavior, while it keeps the scheduler that runs on each CPU the same.
> > I'm currently working on different balancing methods to keep the proposed
> > scheduler fair well balanced without spinning tasks "too much"(tm).
> >
> I've given your patch a try, and so far it looks promising.
>
> Running one niced copy of cpuhog on a 2x366 mhz celeron box did pretty well.
> Instead of switching several times in one second, it only switched a few
> times per minute.
>
> I was also able to merge it with just about everything else I was testing
> (ext3, freeswan, elevator updates, -ac) except for the preempt patch. Well, I
> was able to manually merge it, but the cpu afinity broke. (it wouldn't use
> the second processor for anything except for interrupt processing...)
>
> I haven't tried any of the other scheduler patches though. MQ, looks
> interesting... :)
>
> All in all, I think xsched will have much more impact on performance.
> Simply because it tackles the problem of CPU affinity...
>
> Even comparing Ingo's patch to your CPU History patch isn't fair, because
> they attack different problems. Yours of CPU affinity, Ingo's of time spent
> on individual tasks within a single processor.

xsched is not complete yet, it's a draft ( working draft :) ) that i'm
using to study a more heavy CPU tasks isolation on SMP systems.
I think that this is the way to go for a more scalable SMP scheduler.
I'm currently sampling the proposed scheduler with LatSched that gives a
very good picture of 1) process migration 2) _real_ scheduler latency
cycles cost.
The MQ scheduler has the same roots of the proposed one but has a longest
fast path due the try to make global scheduling decisions at every
schedule.
I'm in contact ( close contact coz we're both in Beaverton :) ) with IBM
guys to have the two scheduler tested on bigger machines if the proposed
scheduler will give some fruit.




- Davide


2001-11-09 01:35:27

by Mike Fedyk

[permalink] [raw]
Subject: Re: [patch] scheduler cache affinity improvement for 2.4 kernels

On Thu, Nov 08, 2001 at 05:29:25PM -0800, Davide Libenzi wrote:
> On Thu, 8 Nov 2001, Mike Fedyk wrote:
>
> > [cc trimed]
> >
> > On Thu, Nov 08, 2001 at 04:37:46PM -0800, Davide Libenzi wrote:
> > > On Thu, 8 Nov 2001, Mike Fedyk wrote:
> > >
> > > > Ingo's patch in effect lowers the number of jiffies taken per second in the
> > > > scheduler (by making each task use several jiffies).
> > > >
> > > > Davide's patch can take the default scheduler (even Ingo's enhanced
> > > > scheduler) and make it per processor, with his extra layer of scheduling
> > > > between individual processors.
> > >
> > > Don't mix things :)
> > > We're talking only about the CpuHistory token of the scheduler proposed here:
> > >
> > > http://www.xmailserver.org/linux-patches/mss.html
> > >
> > > This is a bigger ( and not yet complete ) change on the SMP scheduler
> > > behavior, while it keeps the scheduler that runs on each CPU the same.
> > > I'm currently working on different balancing methods to keep the proposed
> > > scheduler fair well balanced without spinning tasks "too much"(tm).
> > >
> > I've given your patch a try, and so far it looks promising.
> >
> > Running one niced copy of cpuhog on a 2x366 mhz celeron box did pretty well.
> > Instead of switching several times in one second, it only switched a few
> > times per minute.
> >
> > I was also able to merge it with just about everything else I was testing
> > (ext3, freeswan, elevator updates, -ac) except for the preempt patch. Well, I
> > was able to manually merge it, but the cpu afinity broke. (it wouldn't use
> > the second processor for anything except for interrupt processing...)
> >
> > I haven't tried any of the other scheduler patches though. MQ, looks
> > interesting... :)
> >
> > All in all, I think xsched will have much more impact on performance.
> > Simply because it tackles the problem of CPU affinity...
> >
> > Even comparing Ingo's patch to your CPU History patch isn't fair, because
> > they attack different problems. Yours of CPU affinity, Ingo's of time spent
> > on individual tasks within a single processor.

Looking at that again, it could've been "Ingo's of intra-CPU time slice
length"... ;)

> xsched is not complete yet, it's a draft ( working draft :) ) that i'm

A pretty good draft, I'd say!

> using to study a more heavy CPU tasks isolation on SMP systems.
> I think that this is the way to go for a more scalable SMP scheduler.
> I'm currently sampling the proposed scheduler with LatSched that gives a
> very good picture of 1) process migration 2) _real_ scheduler latency
> cycles cost.

:)

> The MQ scheduler has the same roots of the proposed one but has a longest
> fast path due the try to make global scheduling decisions at every
> schedule.

Ahh, so that's why it hasn't been adopted...

> I'm in contact ( close contact coz we're both in Beaverton :) ) with IBM
> guys to have the two scheduler tested on bigger machines if the proposed
> scheduler will give some fruit.
>

>From what I've seen, it probably will...

I hope something like this will go into 2.5...

What do other unixes do in this case? Are there any commercial Unixes that
have loose affinity like linux currently does? What about NT?

2001-11-09 02:01:31

by Davide Libenzi

[permalink] [raw]
Subject: Re: [patch] scheduler cache affinity improvement for 2.4 kernels

On Thu, 8 Nov 2001, Mike Fedyk wrote:

> > The MQ scheduler has the same roots of the proposed one but has a longest
> > fast path due the try to make global scheduling decisions at every
> > schedule.
>
> Ahh, so that's why it hasn't been adopted...

Changing the scheduler is not easy ( not to code patches but to make
everyone agree on the need of changing it ) and as i already said, it's
easier to force my cat to have a bath instead of Linus to change the
scheduler :)


> > I'm in contact ( close contact coz we're both in Beaverton :) ) with IBM
> > guys to have the two scheduler tested on bigger machines if the proposed
> > scheduler will give some fruit.
> >
>
> >From what I've seen, it probably will...
>
> I hope something like this will go into 2.5...
>
> What do other unixes do in this case? Are there any commercial Unixes that
> have loose affinity like linux currently does? What about NT?

I can't say about NT.
I've tried a "cvs checkout" from cvs.microsoft.com but the running server
( Nimda-CVS running on port 2401 ) said me that, although I've full
read/write access on the repository, the server is busy scanning port 80s :)




- Davide


2001-11-09 02:08:51

by Mike Fedyk

[permalink] [raw]
Subject: Re: [patch] scheduler cache affinity improvement for 2.4 kernels

On Thu, Nov 08, 2001 at 06:09:29PM -0800, Davide Libenzi wrote:
> On Thu, 8 Nov 2001, Mike Fedyk wrote:
>
> > > The MQ scheduler has the same roots of the proposed one but has a longest
> > > fast path due the try to make global scheduling decisions at every
> > > schedule.
> >
> > Ahh, so that's why it hasn't been adopted...
>
> Changing the scheduler is not easy ( not to code patches but to make
> everyone agree on the need of changing it ) and as i already said, it's
> easier to force my cat to have a bath instead of Linus to change the
> scheduler :)
>

Hmm, let's see... You go to the trouble to keep to code tight, and cache
optimized, even raid5 is choosing a little slower implementation for better
cache properties, and then you go and kill it all with the scheduler...
Yep, that makes sence. ;)

Mike

2001-11-09 07:30:40

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] scheduler cache affinity improvement for 2.4 kernels


On Thu, 8 Nov 2001, Mike Fedyk wrote:

> It remains to be proven whether the coarser scheduling approach
> (Ingo's) will actually help when looking at cache properties.... [...]

have you seen the numbers/measurements i posted in my original email? 3%
kernel compile speedup on an 'idle' 8-way system, 7% compilation speedup
with HZ=1024 and background networking load on a 1-way system.

Ingo

2001-11-09 08:06:25

by Mike Fedyk

[permalink] [raw]
Subject: Re: [patch] scheduler cache affinity improvement for 2.4 kernels

On Fri, Nov 09, 2001 at 09:28:09AM +0100, Ingo Molnar wrote:
>
> On Thu, 8 Nov 2001, Mike Fedyk wrote:
>
> > It remains to be proven whether the coarser scheduling approach
> > (Ingo's) will actually help when looking at cache properties.... [...]
>
> have you seen the numbers/measurements i posted in my original email? 3%
> kernel compile speedup on an 'idle' 8-way system, 7% compilation speedup
> with HZ=1024 and background networking load on a 1-way system.
>

Yep, but it probably won't do as well on my 2x366 celeron system that I use
every day...

Haven't tested, but it looks like there wouldn't be much chance of the third
CPU hog to stay in the caches. That's pretty much flushing the entire cache
of the previous task with the longer periods of execution.

Now, if the number of sequential jiffies were modified based on the L1/L2
cache sizes that would be interesting...

Also Ingo, if you're worried about your processes staying in the cache, I'd
work on the processor affinity before working on this... But, then again,
I'm not you, and I don't know how myself... :(

Mike

2001-11-11 21:10:37

by Davide Libenzi

[permalink] [raw]
Subject: Re: [patch] scheduler cache affinity improvement for 2.4 kernels

On Fri, 9 Nov 2001, Ingo Molnar wrote:

>
> On Thu, 8 Nov 2001, Mike Fedyk wrote:
>
> > It remains to be proven whether the coarser scheduling approach
> > (Ingo's) will actually help when looking at cache properties.... [...]
>
> have you seen the numbers/measurements i posted in my original email? 3%
> kernel compile speedup on an 'idle' 8-way system, 7% compilation speedup
> with HZ=1024 and background networking load on a 1-way system.

Ingo, i'm giving the timer_ticks patch a try in my proposed scheduler coz
i like the idea of skipping the if inside goodness(), and i can do this
safely because inside the proposed scheduler i don't have any cross CPU
goodnesses ( no "if (p->processor != this_cpu) weight -= p->timer_ticks;" ).
I made a change to it anyway, that is adding a water-mark in the decay
behavior ( timer.c ).
When counter is above this watermark ( currently 20 ) the counter decay as
usual while if counter <= watermark, ticks accumulates in timer_ticks.
This solution keeps the same good behavior for CPU bound tasks while it
gives a "human"/current decay to tasks that has got a lot of counter
accumulation inside the recalc loop ( I/O bound ).




- Davide


2001-11-11 22:23:13

by Davide Libenzi

[permalink] [raw]
Subject: Re: [patch] scheduler cache affinity improvement for 2.4 kernels

On Sun, 11 Nov 2001, Davide Libenzi wrote:

> Ingo, i'm giving the timer_ticks patch a try in my proposed scheduler coz
> i like the idea of skipping the if inside goodness(), and i can do this
> safely because inside the proposed scheduler i don't have any cross CPU
> goodnesses ( no "if (p->processor != this_cpu) weight -= p->timer_ticks;" ).
> I made a change to it anyway, that is adding a water-mark in the decay
> behavior ( timer.c ).
> When counter is above this watermark ( currently 20 ) the counter decay as
> usual while if counter <= watermark, ticks accumulates in timer_ticks.
> This solution keeps the same good behavior for CPU bound tasks while it
> gives a "human"/current decay to tasks that has got a lot of counter
> accumulation inside the recalc loop ( I/O bound ).

The watermark is 10 ( DEF_COUNTER ) not 20.




- Davide


2001-11-14 04:56:59

by Mike Kravetz

[permalink] [raw]
Subject: Re: [patch] scheduler cache affinity improvement for 2.4 kernels

On Thu, Nov 08, 2001 at 03:30:11PM +0100, Ingo Molnar wrote:
>
> i've attached a patch that fixes a long-time performance problem in the
> Linux scheduler.

Just got back from holiday and saw this patch. I like the idea
slowing down task dynamic priority modifications (the counter
field). My only thought/concern would be in the case where a
task with maximum dynamic priority (counter value) decides to
use 'all' of its timeslice. In such a case, the task can not
be preempted by another task (with the same static priority)
until its entire timeslice is expired. In the current scheduler,
I believe the task can be preempted after 1 timer tick. In
practice, this shouldn't be an issue. However, it is something
we may want to think about. One simple solution would be to
update a tasks dynamic priority (counter value) more frequently
it it is above its NICE_TO_TICKS value.

> (it would be nice if those people who suspect scalability problems in
> their workloads, could further test/verify the effects this patch.)

I'll try to run it on my 'CPU intensive' version of the TPC-H
behcnmark.

In addition, I have noted that this patch applies with minor
modification to our MultiQueue scheduler, and should be a win
in this environment also.

--
Mike

2001-11-14 17:59:39

by Davide Libenzi

[permalink] [raw]
Subject: Re: [patch] scheduler cache affinity improvement for 2.4 kernels

On Tue, 13 Nov 2001, Mike Kravetz wrote:

> On Thu, Nov 08, 2001 at 03:30:11PM +0100, Ingo Molnar wrote:
> >
> > i've attached a patch that fixes a long-time performance problem in the
> > Linux scheduler.
>
> Just got back from holiday and saw this patch. I like the idea
> slowing down task dynamic priority modifications (the counter
> field). My only thought/concern would be in the case where a
> task with maximum dynamic priority (counter value) decides to
> use 'all' of its timeslice. In such a case, the task can not
> be preempted by another task (with the same static priority)
> until its entire timeslice is expired. In the current scheduler,
> I believe the task can be preempted after 1 timer tick. In
> practice, this shouldn't be an issue. However, it is something
> we may want to think about. One simple solution would be to
> update a tasks dynamic priority (counter value) more frequently
> it it is above its NICE_TO_TICKS value.

Mike, take a look at my next post in this thread.
I'm using a watermark value of 10 :


if (p->counter > TICKS_WMARK)
--p->counter;
else if (++p->timer_ticks >= p->counter) {
p->counter = 0;
p->timer_ticks = 0;
p->need_resched = 1;
}





- Davide


2001-11-19 18:36:07

by Bill Davidsen

[permalink] [raw]
Subject: Re: [patch] scheduler cache affinity improvement for 2.4 kernels

In article <[email protected]> [email protected] wrote:

| Running one niced copy of cpuhog on a 2x366 mhz celeron box did pretty well.
| Instead of switching several times in one second, it only switched a few
| times per minute.
|
| I was also able to merge it with just about everything else I was testing
| (ext3, freeswan, elevator updates, -ac) except for the preempt patch. Well, I
| was able to manually merge it, but the cpu afinity broke. (it wouldn't use
| the second processor for anything except for interrupt processing...)

The problem with processor affinity is that for some *typical* loads
the things which make things better for one load make it worse for
another. If you wait longer for the "right" processor to be available
then you increase the chances that the cache is mostly filled with what
the CPU was doing in other processes, and affinity has done nothing but
delay the scheduling of the process, since the cache is going to be of
small use anyway.

The item of interest for making decisions about affinity would be
number of cache misses (and obviously cache changes to satisfy them).
This would allow a better estimate of how much of the cache is still
useful if affinity is preserved. Given the number of processor types on
which Linux runs this is not something likely to be included on all of
them, and is related to cache size as well. So the things being done in
the scheduler are not really measuring "how much of the cache is still
useful to process X," which would be the best predictor of the value of
affinity. I apologise to those who find this an old thought, not
everyone on this list has noted this, I believe.

I'm not surprised that you find the preempt doesn't work, it is
somewhat counter to the process of affinity. Given the choice of better
performance for cpu hogs or more responsive preformance and in some
cases much higher network performance, I will take preempt. But it would
be nice to have a choice at runtime or in /proc/sys so that systems
could be tuned to optimize the characteristics most needed.

--
bill davidsen <[email protected]>
His first management concern is not solving the problem, but covering
his ass. If he lived in the middle ages he'd wear his codpiece backward.