LinuxLists.cc - [patch] scheduler cache affinity improvement for 2.4 kernels

[permalink] [raw]

Subject: RE: [patch] scheduler cache affinity improvement for 2.4 kernels

On Thu, 8 Nov 2001, M. Edward Borasky wrote:

> I can think of some circumstances where one would want the *opposite*
> of this patch. Consider a "time-sharing" system running both
> CPU-intensive "batch" tasks and "interactive" tasks. There is going to
> be a tradeoff between efficiency / throughput of the batch tasks and
> the response times of interactive ones. [...]

this mechanizm is already part of the scheduler and is not affected by my
patch. Interactive tasks get their '->counter' value increased gradually
via the recalculate code in the scheduler, which after some time gives
them effective priority above that of CPU-intensive processes.

To see this mechanizm working, just boot into the stock kernel or try a
kernel with the patch applied, start a few CPU-intensive processes, eg.
a couple of subshells doing an infinite loop:

while N=1; do N=1; done &
while N=1; do N=1; done &
while N=1; do N=1; done &
while N=1; do N=1; done &

and see how the interactive shell is still responding instantaneously in
such a mixed workload, despite having the same static priority as the
subshells.

Ingo

2001-11-08 17:08:03

[permalink] [raw]

Subject: Re: [patch] scheduler cache affinity improvement for 2.4 kernels

On Thu, 8 Nov 2001, Ingo Molnar wrote:

>
> i've attached a patch that fixes a long-time performance problem in the
> Linux scheduler.
>
> it's a fix for a UP and SMP scheduler problem Alan described to me
> recently, the 'CPU intensive process scheduling' problem. The essence of
> the problem: if there are multiple, CPU-intensive processes running,
> intermixed with other scheduling activities such as interactive work or
> network-intensive applications, then the Linux scheduler does a poor job
> of affinizing processes to processor caches. Such scheduler workload is
> common for a large percentage of important application workloads: database
> server workloads, webserver workloads and math-intensive clustered jobs,
> and other applications.
>
> If there are CPU-intensive processes A B and C, and a scheduling-intensive
> X task, then in the stock 2.4 kernels we end up scheduling in the
> following way:
>
> A X A X A ... [timer tick]
> B X B X B ... [timer tick]
> C X C X C ... [timer tick]
>
> ie. we switch between CPU-intensive (and possibly cache-intensive)
> processes every timer tick. The timer tick can be 10 msec or shorter,
> depending on the HZ value.
>
> the intended length of the timeslice of such processes is supposed to be
> dependent on their priority - for typical CPU-intensive processes it's 100
> msecs. But in the above case, the effective timeslice of the
> CPU/cache-intensive process is 10 msec or lower, causing potential cache
> trashing if the working set of A, B and C are larger than the cache size
> of the CPU but the invidivual process' workload fits into cache.
> Repopulating a large processor cache can take many milliseconds (on a 2MB
> on-die cache Xeon CPU it takes more than 10 msecs to repopulate a typical
> cache), so the effect can be significant.
>
> The correct behavior would be:
>
> A X A X A ... [10 timer ticks]
> B X B X B ... [10 timer ticks]
> C X C X C ... [10 timer ticks]
>
> this is in fact what happens if the scheduling acitivity of process 'X'
> does not happen.
>
> solution: i've introduced a new current->timer_ticks field (which is not
> in the scheduler 'hot cacheline', nor does it cause any scheduling
> overhead), which counts the number of timer ticks registered by any
> particular process. If the number of timer ticks reaches the number of
> available timeslices then the timer interrupt marks the process for
> reschedule, clears ->counter and ->timer_ticks. These 'timer ticks' have
> to be correctly administered across fork() and exit(), and some places
> that touch ->counter need to deal with timer_ticks too, but otherwise the
> patch has low impact.
>
> scheduling semantics impact: this causes CPU hogs to be more affine to the
> CPU they were running on, and will 'batch' them more agressively - without
> giving them more CPU time than under the stock scheduler. The change does
> not impact interactive tasks since they grow their ->counter above that of
> CPU hogs anyway. It might cause less 'interactivity' in CPU hogs - but
> this is the intended effect.

Maybe you missed this :

http://www.xmailserver.org/linux-patches/mss.html

where the patch that does this is here :

http://www.xmailserver.org/linux-patches/lnxsched.html#CPUHist

- Davide

2001-11-08 17:29:45

[permalink] [raw]

Subject: Re: [patch] scheduler cache affinity improvement for 2.4 kernels

Davide,

On Thu, 8 Nov 2001, Davide Libenzi wrote:

> Maybe you missed this :
>
> http://www.xmailserver.org/linux-patches/mss.html
>
> where the patch that does this is here :
>
> http://www.xmailserver.org/linux-patches/lnxsched.html#CPUHist

i'm not sure what the patch is trying to achieve, but this part of
mcsched-2.4.13-0.4.diff looks incorrect:

+ prev->cpu_jtime += (jiffies - prev->sched_jtime) + jiffies;

(this is "2*jiffies - prev->sched_jtime" which doesnt appear to make much
sense - does it?)

and your patch adds a scheduling advantage to processes with more cache
footprint, which is the completely opposite of what we want.

but in any case, changing the goodness() function was not a goal of my
patch, i change the granularity of how processes lose their 'effective
priority'.

Ingo

2001-11-08 17:56:05

[permalink] [raw]

Subject: Re: [patch] scheduler cache affinity improvement for 2.4 kernels

On Thu, 8 Nov 2001, Ingo Molnar wrote:

>
> Davide,
>
> On Thu, 8 Nov 2001, Davide Libenzi wrote:
>
> > Maybe you missed this :
> >
> > http://www.xmailserver.org/linux-patches/mss.html
> >
> > where the patch that does this is here :
> >
> > http://www.xmailserver.org/linux-patches/lnxsched.html#CPUHist
>
> i'm not sure what the patch is trying to achieve, but this part of
> mcsched-2.4.13-0.4.diff looks incorrect:
>
> + prev->cpu_jtime += (jiffies - prev->sched_jtime) + jiffies;
>
> (this is "2*jiffies - prev->sched_jtime" which doesnt appear to make much
> sense - does it?)

The optimization is not good ( i left it in that way to make it more clear
what that operation is meant ) but the mean of the code is ok.
It sets the time ( in jiffies ) at which the process won't have any more
scheduling advantage.

> and your patch adds a scheduling advantage to processes with more cache
> footprint, which is the completely opposite of what we want.

It is exactly what we want indeed :
<quote>
it's a fix for a UP and SMP scheduler problem Alan described to me
recently, the 'CPU intensive process scheduling' problem. The essence of
the problem: if there are multiple, CPU-intensive processes running,
intermixed with other scheduling activities such as interactive work or
network-intensive applications, then the Linux scheduler does a poor job
of affinizing processes to processor caches. Such scheduler workload is
common for a large percentage of important application workloads: database
server workloads, webserver workloads and math-intensive clustered jobs,
and other applications.
</quote>

and if you take a look at the LatSched sampling it is achived very well.

> but in any case, changing the goodness() function was not a goal of my
> patch, i change the granularity of how processes lose their 'effective
> priority'.

I'll test the patch asap with the LatSched sampler and i'll let you know.

- Davide

2001-11-08 18:43:17

[permalink] [raw]

Subject: Re: [patch] scheduler cache affinity improvement for 2.4 kernels

On Thu, 8 Nov 2001, Davide Libenzi wrote:

> It sets the time ( in jiffies ) at which the process won't have any
> more scheduling advantage.

(sorry, it indeed makes sense, since sched_jtime is on the order of
jiffies.)

> > and your patch adds a scheduling advantage to processes with more cache
> > footprint, which is the completely opposite of what we want.
>
> It is exactly what we want indeed :

if this is what is done by your patch, then we do not want to do this.
My patch does not give an advantage of CPU-intensive processes over that
of eg. 'vi'. Perhaps i'm misreading your patch, it's full of branches that
does not make the meaning very clear, cpu_jtime and sched_jtime are not
explained. Is sched_jtime the timestamp of the last schedule of this
process? And is cpu_jtime the number of jiffies spent on this CPU? Is
cpu_jtime cleared if we switch to another CPU?

Ingo

2001-11-08 19:05:38

[permalink] [raw]

Subject: Re: [patch] scheduler cache affinity improvement for 2.4 kernels

On Thu, 8 Nov 2001, Ingo Molnar wrote:

>
> On Thu, 8 Nov 2001, Davide Libenzi wrote:
>
> > It sets the time ( in jiffies ) at which the process won't have any
> > more scheduling advantage.
>
> (sorry, it indeed makes sense, since sched_jtime is on the order of
> jiffies.)
>
> > > and your patch adds a scheduling advantage to processes with more cache
> > > footprint, which is the completely opposite of what we want.
> >
> > It is exactly what we want indeed :
>
> if this is what is done by your patch, then we do not want to do this.
> My patch does not give an advantage of CPU-intensive processes over that
> of eg. 'vi'.

If A & B are CPU hog processes and E is the editor ( low level keventd )
you do want to avoid priority inversion between A and B when E kicks in.
Really IO bound tasks accumulates dynamic priority inside the recalc loop
and this is sufficent to win over this kind of "advantage" given to CPU
hog tasks.
My approach make also more "expensive" the preemption goodness to move
tasks between CPUs.
I'll take a closer look at your patch anyway.

> Perhaps i'm misreading your patch, it's full of branches that

"full of braches" == 2 if + 1 conditional-assign

> does not make the meaning very clear, cpu_jtime and sched_jtime are not
> explained. Is sched_jtime the timestamp of the last schedule of this
> process? And is cpu_jtime the number of jiffies spent on this CPU?

sched_jtime = last schedule time in jiffies
cpu_jtime = wall time after which the task will have 0 dynamic priority increase

> Is cpu_jtime cleared if we switch to another CPU?

It's missing of the published patch.
The one that i'm testing has that + a lower dynamic priority increase :

weight += (p->cpu_jtime - jiffies) >> 1;

I'm just testing results about this.

- Davide

2001-11-08 23:46:44

by Andrea Arcangeli

[permalink] [raw]

Subject: Re: [patch] scheduler cache affinity improvement for 2.4 kernels

On Thu, Nov 08, 2001 at 03:30:11PM +0100, Ingo Molnar wrote:
> current->counter += p->counter;
> - if (current->counter >= MAX_COUNTER)
> + current->timer_ticks += p->timer_ticks;
> + if (current->counter >= MAX_COUNTER) {
> current->counter = MAX_COUNTER;
> + if (current->timer_ticks >= current->counter) {
> + current->counter = 0;
> + current->timer_ticks = 0;
> + current->need_resched = 1;
> + }
> + }
> p->pid = 0;
> free_task_struct(p);

Before changing this area I'd prefer if you would apply this very
longstanding important bugfix that is floating in my tree for ages (not
sure if it's related to your patch, I don't understand exactly the
semantics of timer_ticks at the moment).

diff -urN parent-timeslice-ref/include/linux/sched.h parent-timeslice/include/linux/sched.h
--- parent-timeslice-ref/include/linux/sched.h Wed Oct 24 13:18:54 2001
+++ parent-timeslice/include/linux/sched.h Wed Oct 24 13:19:00 2001
@@ -317,6 +317,7 @@
#ifdef CONFIG_NUMA_SCHED
int nid;
#endif
+ int get_child_timeslice;
struct task_struct *next_task, *prev_task;
struct mm_struct *active_mm;
struct rw_sem_recursor mm_recursor;
diff -urN parent-timeslice-ref/kernel/exit.c parent-timeslice/kernel/exit.c
--- parent-timeslice-ref/kernel/exit.c Wed Oct 24 08:04:27 2001
+++ parent-timeslice/kernel/exit.c Wed Oct 24 13:19:35 2001
@@ -61,9 +61,11 @@
* timeslices, because any timeslice recovered here
* was given away by the parent in the first place.)
*/
- current->counter += p->counter;
- if (current->counter >= MAX_COUNTER)
- current->counter = MAX_COUNTER;
+ if (p->get_child_timeslice) {
+ current->counter += p->counter;
+ if (current->counter >= MAX_COUNTER)
+ current->counter = MAX_COUNTER;
+ }
p->pid = 0;
free_task_struct(p);
} else {
@@ -164,6 +166,7 @@
p->exit_signal = SIGCHLD;
p->self_exec_id++;
p->p_opptr = child_reaper;
+ p->get_child_timeslice = 0;
if (p->pdeath_signal) send_sig(p->pdeath_signal, p, 0);
}
}
diff -urN parent-timeslice-ref/kernel/fork.c parent-timeslice/kernel/fork.c
--- parent-timeslice-ref/kernel/fork.c Wed Oct 24 13:18:54 2001
+++ parent-timeslice/kernel/fork.c Wed Oct 24 13:19:00 2001
@@ -682,6 +682,9 @@
if (!current->counter)
current->need_resched = 1;

+ /* Tell the parent if it can get back its timeslice when child exits */
+ p->get_child_timeslice = 1;
+
/*
* Ok, add it to the run-queues and make it
* visible to the rest of the system.
diff -urN parent-timeslice-ref/kernel/sched.c parent-timeslice/kernel/sched.c
--- parent-timeslice-ref/kernel/sched.c Wed Oct 24 13:18:54 2001
+++ parent-timeslice/kernel/sched.c Wed Oct 24 13:19:00 2001
@@ -758,6 +758,7 @@
continue;
#endif
p->counter = (p->counter >> 1) + NICE_TO_TICKS(p->nice);
+ p->get_child_timeslice = 0;
}
read_unlock(&tasklist_lock);
spin_lock_irq(&runqueue_lock);

And also the schedule-child-first logic (I use it all the time for ages,
the only bug we managed to trigger was a race in /sbin/init that I fixed
a few months ago and it should be merged in the latest sysvinit
package):

diff -urN parent-timeslice/include/linux/sched.h child-first/include/linux/sched.h
--- parent-timeslice/include/linux/sched.h Thu May 3 18:17:56 2001
+++ child-first/include/linux/sched.h Thu May 3 18:19:44 2001
@@ -301,7 +301,7 @@
* all fields in a single cacheline that are needed for
* the goodness() loop in schedule().
*/
- int counter;
+ volatile int counter;
int nice;
unsigned int policy;
struct mm_struct *mm;
diff -urN parent-timeslice/kernel/fork.c child-first/kernel/fork.c
--- parent-timeslice/kernel/fork.c Thu May 3 18:18:31 2001
+++ child-first/kernel/fork.c Thu May 3 18:20:40 2001
@@ -665,15 +665,18 @@
p->pdeath_signal = 0;

/*
- * "share" dynamic priority between parent and child, thus the
- * total amount of dynamic priorities in the system doesnt change,
- * more scheduling fairness. This is only important in the first
- * timeslice, on the long run the scheduling behaviour is unchanged.
+ * Scheduling the child first is especially useful in avoiding a
+ * lot of copy-on-write faults if the child for a fork() just wants
+ * to do a few simple things and then exec().
*/
- p->counter = (current->counter + 1) >> 1;
- current->counter >>= 1;
- if (!current->counter)
+ {
+ int counter = current->counter;
+ p->counter = (counter + 1) >> 1;
+ current->counter = counter >> 1;
+ p->policy &= ~SCHED_YIELD;
+ current->policy |= SCHED_YIELD;
current->need_resched = 1;
+ }

/* Tell the parent if it can get back its timeslice when child exits */
p->get_child_timeslice = 1;

Last thing, I guess slip should be fixed to use sched_yield instead of
zeroing the timeslice to enforce the reschedule.

Andrea

2001-11-08 23:38:14

[permalink] [raw]

Subject: Re: [patch] scheduler cache affinity improvement for 2.4 kernels

On Thu, Nov 08, 2001 at 11:13:30AM -0800, Davide Libenzi wrote:
> On Thu, 8 Nov 2001, Ingo Molnar wrote:
>
> >
> > On Thu, 8 Nov 2001, Davide Libenzi wrote:
> >
> > > It sets the time ( in jiffies ) at which the process won't have any
> > > more scheduling advantage.
> >
> > (sorry, it indeed makes sense, since sched_jtime is on the order of
> > jiffies.)
> >
> > > > and your patch adds a scheduling advantage to processes with more cache
> > > > footprint, which is the completely opposite of what we want.
> > >
> > > It is exactly what we want indeed :
> >
> > if this is what is done by your patch, then we do not want to do this.
> > My patch does not give an advantage of CPU-intensive processes over that
> > of eg. 'vi'.
>
> If A & B are CPU hog processes and E is the editor ( low level keventd )
> you do want to avoid priority inversion between A and B when E kicks in.
> Really IO bound tasks accumulates dynamic priority inside the recalc loop
> and this is sufficent to win over this kind of "advantage" given to CPU
> hog tasks.
> My approach make also more "expensive" the preemption goodness to move
> tasks between CPUs.
> I'll take a closer look at your patch anyway.
>

Conceptually, both patches are compatible.

Whether they are technically is for someone else to say...

Ingo's patch in effect lowers the number of jiffies taken per second in the
scheduler (by making each task use several jiffies).

Davide's patch can take the default scheduler (even Ingo's enhanced
scheduler) and make it per processor, with his extra layer of scheduling
between individual processors.

I think that together, they both win. Davide's patch keeps a task from
switching CPUs very often, and Ingo's patch will make each task on each CPU
use the cache to the best extent for that task.

It remains to be proven whether the coarser scheduling approach (Ingo's) will
actually help when looking at cache properties.... When each task takes a
longer time slice, that allows the other tasks to be flushed out of the
caches during that time. When the next task comes back in, it will have to
re-populate the cache again. And the same for the next and etc...

Mike

2001-11-09 00:23:28

[permalink] [raw]

Subject: Re: [patch] scheduler cache affinity improvement for 2.4 kernels

On Thu, 8 Nov 2001, Ingo Molnar wrote:

[...]

I finally took a look at the patch and i think that it completely change
dynamic priority assignments.
Dynamic priorities will only 1) increase 2) drop to zero
I'm not sure what could be the effect of this on the scheduler behavior.
The patch, as is, has the advantage over CPU history to not add an extra
"if () {}" inside the goodness() function but, without doing something
like :

weight = p->counter;

+if (p->processor != this_cpu)
+ weight -= p->timer_ticks;

you're going to take poor SMP tasks moving decisions, expecially with the
actual :

static inline void add_to_runqueue(struct task_struct * p)
{
list_add(&p->run_list, &runqueue_head);
nr_running++;
}

that inserts tasks at runqueue head ( it'll be the first that will be
picked up ).
With this extra "if () {}" you're going to have the same fast path cost of
the CpuHistory patch.

- Davide

2001-11-09 00:29:58

[permalink] [raw]

Subject: Re: [patch] scheduler cache affinity improvement for 2.4 kernels

On Thu, 8 Nov 2001, Mike Fedyk wrote:

> Ingo's patch in effect lowers the number of jiffies taken per second in the
> scheduler (by making each task use several jiffies).
>
> Davide's patch can take the default scheduler (even Ingo's enhanced
> scheduler) and make it per processor, with his extra layer of scheduling
> between individual processors.

Don't mix things :)
We're talking only about the CpuHistory token of the scheduler proposed here:

http://www.xmailserver.org/linux-patches/mss.html

This is a bigger ( and not yet complete ) change on the SMP scheduler
behavior, while it keeps the scheduler that runs on each CPU the same.
I'm currently working on different balancing methods to keep the proposed
scheduler fair well balanced without spinning tasks "too much"(tm).

- Davide

2001-11-09 01:08:03

[permalink] [raw]

Subject: Re: [patch] scheduler cache affinity improvement for 2.4 kernels

[cc trimed]

On Thu, Nov 08, 2001 at 04:37:46PM -0800, Davide Libenzi wrote:
> On Thu, 8 Nov 2001, Mike Fedyk wrote:
>
> > Ingo's patch in effect lowers the number of jiffies taken per second in the
> > scheduler (by making each task use several jiffies).
> >
> > Davide's patch can take the default scheduler (even Ingo's enhanced
> > scheduler) and make it per processor, with his extra layer of scheduling
> > between individual processors.
>
> Don't mix things :)
> We're talking only about the CpuHistory token of the scheduler proposed here:
>
> http://www.xmailserver.org/linux-patches/mss.html
>
> This is a bigger ( and not yet complete ) change on the SMP scheduler
> behavior, while it keeps the scheduler that runs on each CPU the same.
> I'm currently working on different balancing methods to keep the proposed
> scheduler fair well balanced without spinning tasks "too much"(tm).
>
I've given your patch a try, and so far it looks promising.

Running one niced copy of cpuhog on a 2x366 mhz celeron box did pretty well.
Instead of switching several times in one second, it only switched a few
times per minute.

I was also able to merge it with just about everything else I was testing
(ext3, freeswan, elevator updates, -ac) except for the preempt patch. Well, I
was able to manually merge it, but the cpu afinity broke. (it wouldn't use
the second processor for anything except for interrupt processing...)

I haven't tried any of the other scheduler patches though. MQ, looks
interesting... :)

All in all, I think xsched will have much more impact on performance.
Simply because it tackles the problem of CPU affinity...

Even comparing Ingo's patch to your CPU History patch isn't fair, because
they attack different problems. Yours of CPU affinity, Ingo's of time spent
on individual tasks within a single processor.

Mike

2001-11-09 01:21:16

[permalink] [raw]

Subject: Re: [patch] scheduler cache affinity improvement for 2.4 kernels

On Thu, 8 Nov 2001, Mike Fedyk wrote:

> [cc trimed]
>
> On Thu, Nov 08, 2001 at 04:37:46PM -0800, Davide Libenzi wrote:
> > On Thu, 8 Nov 2001, Mike Fedyk wrote:
> >
> > > Ingo's patch in effect lowers the number of jiffies taken per second in the
> > > scheduler (by making each task use several jiffies).
> > >
> > > Davide's patch can take the default scheduler (even Ingo's enhanced
> > > scheduler) and make it per processor, with his extra layer of scheduling
> > > between individual processors.
> >
> > Don't mix things :)
> > We're talking only about the CpuHistory token of the scheduler proposed here:
> >
> > http://www.xmailserver.org/linux-patches/mss.html
> >
> > This is a bigger ( and not yet complete ) change on the SMP scheduler
> > behavior, while it keeps the scheduler that runs on each CPU the same.
> > I'm currently working on different balancing methods to keep the proposed
> > scheduler fair well balanced without spinning tasks "too much"(tm).
> >
> I've given your patch a try, and so far it looks promising.
>
> Running one niced copy of cpuhog on a 2x366 mhz celeron box did pretty well.
> Instead of switching several times in one second, it only switched a few
> times per minute.
>
> I was also able to merge it with just about everything else I was testing
> (ext3, freeswan, elevator updates, -ac) except for the preempt patch. Well, I
> was able to manually merge it, but the cpu afinity broke. (it wouldn't use
> the second processor for anything except for interrupt processing...)
>
> I haven't tried any of the other scheduler patches though. MQ, looks
> interesting... :)
>
> All in all, I think xsched will have much more impact on performance.
> Simply because it tackles the problem of CPU affinity...
>
> Even comparing Ingo's patch to your CPU History patch isn't fair, because
> they attack different problems. Yours of CPU affinity, Ingo's of time spent
> on individual tasks within a single processor.

xsched is not complete yet, it's a draft ( working draft :) ) that i'm
using to study a more heavy CPU tasks isolation on SMP systems.
I think that this is the way to go for a more scalable SMP scheduler.
I'm currently sampling the proposed scheduler with LatSched that gives a
very good picture of 1) process migration 2) _real_ scheduler latency
cycles cost.
The MQ scheduler has the same roots of the proposed one but has a longest
fast path due the try to make global scheduling decisions at every
schedule.
I'm in contact ( close contact coz we're both in Beaverton :) ) with IBM
guys to have the two scheduler tested on bigger machines if the proposed
scheduler will give some fruit.

- Davide

2001-11-09 01:35:27

[permalink] [raw]

Subject: Re: [patch] scheduler cache affinity improvement for 2.4 kernels

On Thu, Nov 08, 2001 at 05:29:25PM -0800, Davide Libenzi wrote:
> On Thu, 8 Nov 2001, Mike Fedyk wrote:
>
> > [cc trimed]
> >
> > On Thu, Nov 08, 2001 at 04:37:46PM -0800, Davide Libenzi wrote:
> > > On Thu, 8 Nov 2001, Mike Fedyk wrote:
> > >
> > > > Ingo's patch in effect lowers the number of jiffies taken per second in the
> > > > scheduler (by making each task use several jiffies).
> > > >
> > > > Davide's patch can take the default scheduler (even Ingo's enhanced
> > > > scheduler) and make it per processor, with his extra layer of scheduling
> > > > between individual processors.
> > >
> > > Don't mix things :)
> > > We're talking only about the CpuHistory token of the scheduler proposed here:
> > >
> > > http://www.xmailserver.org/linux-patches/mss.html
> > >
> > > This is a bigger ( and not yet complete ) change on the SMP scheduler
> > > behavior, while it keeps the scheduler that runs on each CPU the same.
> > > I'm currently working on different balancing methods to keep the proposed
> > > scheduler fair well balanced without spinning tasks "too much"(tm).
> > >
> > I've given your patch a try, and so far it looks promising.
> >
> > Running one niced copy of cpuhog on a 2x366 mhz celeron box did pretty well.
> > Instead of switching several times in one second, it only switched a few
> > times per minute.
> >
> > I was also able to merge it with just about everything else I was testing
> > (ext3, freeswan, elevator updates, -ac) except for the preempt patch. Well, I
> > was able to manually merge it, but the cpu afinity broke. (it wouldn't use
> > the second processor for anything except for interrupt processing...)
> >
> > I haven't tried any of the other scheduler patches though. MQ, looks
> > interesting... :)
> >
> > All in all, I think xsched will have much more impact on performance.
> > Simply because it tackles the problem of CPU affinity...
> >
> > Even comparing Ingo's patch to your CPU History patch isn't fair, because
> > they attack different problems. Yours of CPU affinity, Ingo's of time spent
> > on individual tasks within a single processor.

Looking at that again, it could've been "Ingo's of intra-CPU time slice
length"... ;)

> xsched is not complete yet, it's a draft ( working draft :) ) that i'm

A pretty good draft, I'd say!

> using to study a more heavy CPU tasks isolation on SMP systems.
> I think that this is the way to go for a more scalable SMP scheduler.
> I'm currently sampling the proposed scheduler with LatSched that gives a
> very good picture of 1) process migration 2) _real_ scheduler latency
> cycles cost.

:)

> The MQ scheduler has the same roots of the proposed one but has a longest
> fast path due the try to make global scheduling decisions at every
> schedule.

Ahh, so that's why it hasn't been adopted...

> I'm in contact ( close contact coz we're both in Beaverton :) ) with IBM
> guys to have the two scheduler tested on bigger machines if the proposed
> scheduler will give some fruit.
>

>From what I've seen, it probably will...

I hope something like this will go into 2.5...

What do other unixes do in this case? Are there any commercial Unixes that
have loose affinity like linux currently does? What about NT?

2001-11-09 02:01:31

[permalink] [raw]

Subject: Re: [patch] scheduler cache affinity improvement for 2.4 kernels

On Thu, 8 Nov 2001, Mike Fedyk wrote:

> > The MQ scheduler has the same roots of the proposed one but has a longest
> > fast path due the try to make global scheduling decisions at every
> > schedule.
>
> Ahh, so that's why it hasn't been adopted...

Changing the scheduler is not easy ( not to code patches but to make
everyone agree on the need of changing it ) and as i already said, it's
easier to force my cat to have a bath instead of Linus to change the
scheduler :)

> > I'm in contact ( close contact coz we're both in Beaverton :) ) with IBM
> > guys to have the two scheduler tested on bigger machines if the proposed
> > scheduler will give some fruit.
> >
>
> >From what I've seen, it probably will...
>
> I hope something like this will go into 2.5...
>
> What do other unixes do in this case? Are there any commercial Unixes that
> have loose affinity like linux currently does? What about NT?

I can't say about NT.
I've tried a "cvs checkout" from cvs.microsoft.com but the running server
( Nimda-CVS running on port 2401 ) said me that, although I've full
read/write access on the repository, the server is busy scanning port 80s :)

- Davide

2001-11-09 02:08:51

[permalink] [raw]

Subject: Re: [patch] scheduler cache affinity improvement for 2.4 kernels

On Thu, Nov 08, 2001 at 06:09:29PM -0800, Davide Libenzi wrote:
> On Thu, 8 Nov 2001, Mike Fedyk wrote:
>
> > > The MQ scheduler has the same roots of the proposed one but has a longest
> > > fast path due the try to make global scheduling decisions at every
> > > schedule.
> >
> > Ahh, so that's why it hasn't been adopted...
>
> Changing the scheduler is not easy ( not to code patches but to make
> everyone agree on the need of changing it ) and as i already said, it's
> easier to force my cat to have a bath instead of Linus to change the
> scheduler :)
>

Hmm, let's see... You go to the trouble to keep to code tight, and cache
optimized, even raid5 is choosing a little slower implementation for better
cache properties, and then you go and kill it all with the scheduler...
Yep, that makes sence. ;)

Mike

2001-11-09 07:30:40

[permalink] [raw]

Subject: Re: [patch] scheduler cache affinity improvement for 2.4 kernels

On Thu, 8 Nov 2001, Mike Fedyk wrote:

> It remains to be proven whether the coarser scheduling approach
> (Ingo's) will actually help when looking at cache properties.... [...]

have you seen the numbers/measurements i posted in my original email? 3%
kernel compile speedup on an 'idle' 8-way system, 7% compilation speedup
with HZ=1024 and background networking load on a 1-way system.

Ingo

2001-11-09 08:06:25

[permalink] [raw]

Subject: Re: [patch] scheduler cache affinity improvement for 2.4 kernels

On Fri, Nov 09, 2001 at 09:28:09AM +0100, Ingo Molnar wrote:
>
> On Thu, 8 Nov 2001, Mike Fedyk wrote:
>
> > It remains to be proven whether the coarser scheduling approach
> > (Ingo's) will actually help when looking at cache properties.... [...]
>
> have you seen the numbers/measurements i posted in my original email? 3%
> kernel compile speedup on an 'idle' 8-way system, 7% compilation speedup
> with HZ=1024 and background networking load on a 1-way system.
>

Yep, but it probably won't do as well on my 2x366 celeron system that I use
every day...

Haven't tested, but it looks like there wouldn't be much chance of the third
CPU hog to stay in the caches. That's pretty much flushing the entire cache
of the previous task with the longer periods of execution.

Now, if the number of sequential jiffies were modified based on the L1/L2
cache sizes that would be interesting...

Also Ingo, if you're worried about your processes staying in the cache, I'd
work on the processor affinity before working on this... But, then again,
I'm not you, and I don't know how myself... :(

Mike

2001-11-11 21:10:37

[permalink] [raw]

Subject: Re: [patch] scheduler cache affinity improvement for 2.4 kernels

On Fri, 9 Nov 2001, Ingo Molnar wrote:

>
> On Thu, 8 Nov 2001, Mike Fedyk wrote:
>
> > It remains to be proven whether the coarser scheduling approach
> > (Ingo's) will actually help when looking at cache properties.... [...]
>
> have you seen the numbers/measurements i posted in my original email? 3%
> kernel compile speedup on an 'idle' 8-way system, 7% compilation speedup
> with HZ=1024 and background networking load on a 1-way system.

Ingo, i'm giving the timer_ticks patch a try in my proposed scheduler coz
i like the idea of skipping the if inside goodness(), and i can do this
safely because inside the proposed scheduler i don't have any cross CPU
goodnesses ( no "if (p->processor != this_cpu) weight -= p->timer_ticks;" ).
I made a change to it anyway, that is adding a water-mark in the decay
behavior ( timer.c ).
When counter is above this watermark ( currently 20 ) the counter decay as
usual while if counter <= watermark, ticks accumulates in timer_ticks.
This solution keeps the same good behavior for CPU bound tasks while it
gives a "human"/current decay to tasks that has got a lot of counter
accumulation inside the recalc loop ( I/O bound ).

- Davide

2001-11-11 22:23:13

[permalink] [raw]

Subject: Re: [patch] scheduler cache affinity improvement for 2.4 kernels

On Sun, 11 Nov 2001, Davide Libenzi wrote:

> Ingo, i'm giving the timer_ticks patch a try in my proposed scheduler coz
> i like the idea of skipping the if inside goodness(), and i can do this
> safely because inside the proposed scheduler i don't have any cross CPU
> goodnesses ( no "if (p->processor != this_cpu) weight -= p->timer_ticks;" ).
> I made a change to it anyway, that is adding a water-mark in the decay
> behavior ( timer.c ).
> When counter is above this watermark ( currently 20 ) the counter decay as
> usual while if counter <= watermark, ticks accumulates in timer_ticks.
> This solution keeps the same good behavior for CPU bound tasks while it
> gives a "human"/current decay to tasks that has got a lot of counter
> accumulation inside the recalc loop ( I/O bound ).

The watermark is 10 ( DEF_COUNTER ) not 20.

- Davide

2001-11-14 04:56:59

by Mike Kravetz

[permalink] [raw]

Subject: Re: [patch] scheduler cache affinity improvement for 2.4 kernels

On Thu, Nov 08, 2001 at 03:30:11PM +0100, Ingo Molnar wrote:
>
> i've attached a patch that fixes a long-time performance problem in the
> Linux scheduler.

Just got back from holiday and saw this patch. I like the idea
slowing down task dynamic priority modifications (the counter
field). My only thought/concern would be in the case where a
task with maximum dynamic priority (counter value) decides to
use 'all' of its timeslice. In such a case, the task can not
be preempted by another task (with the same static priority)
until its entire timeslice is expired. In the current scheduler,
I believe the task can be preempted after 1 timer tick. In
practice, this shouldn't be an issue. However, it is something
we may want to think about. One simple solution would be to
update a tasks dynamic priority (counter value) more frequently
it it is above its NICE_TO_TICKS value.

> (it would be nice if those people who suspect scalability problems in
> their workloads, could further test/verify the effects this patch.)

I'll try to run it on my 'CPU intensive' version of the TPC-H
behcnmark.

In addition, I have noted that this patch applies with minor
modification to our MultiQueue scheduler, and should be a win
in this environment also.

--
Mike

2001-11-14 17:59:39