LinuxLists.cc - Context switch times

[permalink] [raw]

Subject: Re: Context switch times

From: [email protected]
Date: Thu, 04 Oct 2001 22:14:13 +0100

> Comments?

2.4.x supports SSE on pentium III/athlons, so the SSE registers need to be
saved/restored on a taskswitch as well.... that's not exactly free.

lat_ctx doesn't execute any FPU ops. So at worst this happens once
on GLIBC program startup, but then never again.

This assumes I understand how lazy i387 restores work in the kernel
:-)

Franks a lot,
David S. Miller
[email protected]

2001-10-04 21:39:34

by Richard Gooch

[permalink] [raw]

Subject: Re: Context switch times

David S. Miller writes:
> From: [email protected]
> Date: Thu, 04 Oct 2001 22:14:13 +0100
>
> > Comments?
>
> 2.4.x supports SSE on pentium III/athlons, so the SSE registers need to be
> saved/restored on a taskswitch as well.... that's not exactly free.
>
> lat_ctx doesn't execute any FPU ops. So at worst this happens once
> on GLIBC program startup, but then never again.

Has something changed? Last I looked, the whole lmbench timing harness
was based on using the FPU. That was the cause of the big argument
Larry and I had some years back: my context switch benchmark didn't
use the FPU, and thus was more sensitive to variations (such as cache
misses due to aliasing).

Regards,

Richard....
Permanent: [email protected]
Current: [email protected]

2001-10-04 21:52:55

[permalink] [raw]

Subject: Re: Context switch times

From: Richard Gooch <[email protected]>
Date: Thu, 4 Oct 2001 15:39:05 -0600

David S. Miller writes:
> lat_ctx doesn't execute any FPU ops. So at worst this happens once
> on GLIBC program startup, but then never again.

Has something changed? Last I looked, the whole lmbench timing harness
was based on using the FPU.

Oops, that's entirely possible...

But things are usually layed out like this:

capture_start_time();
context_switch_N_times();
capture_end_time();

So the FPU hit is only before/after the runs, not during each and
every iteration.

Franks a lot,
David S. Miller
[email protected]

2001-10-04 21:55:25

by Benjamin LaHaise

[permalink] [raw]

Subject: Re: Context switch times

On Thu, Oct 04, 2001 at 02:52:39PM -0700, David S. Miller wrote:
> So the FPU hit is only before/after the runs, not during each and
> every iteration.

Right. Plus, the original mail mentioned that it was hitting all 8
CPUs, which is a pretty good example of braindead scheduler behaviour.

-ben

2001-10-04 22:30:51

[permalink] [raw]

Subject: Re: Context switch times

On Thu, 4 Oct 2001, Benjamin LaHaise wrote:

> On Thu, Oct 04, 2001 at 02:52:39PM -0700, David S. Miller wrote:
> > So the FPU hit is only before/after the runs, not during each and
> > every iteration.
>
> Right. Plus, the original mail mentioned that it was hitting all 8
> CPUs, which is a pretty good example of braindead scheduler behaviour.

There was a discussion about process spinning among idle CPUs a couple of
months ago.
Mike, did you code the patch that stick the task to an idle between the
send-IPI and the idle wakeup ?
At that time we simply left the issue unaddressed.

- Davide

2001-10-04 22:43:12

by Linus Torvalds

[permalink] [raw]

Subject: Re: Context switch times

In article <[email protected]>,
Benjamin LaHaise <[email protected]> wrote:
>On Thu, Oct 04, 2001 at 02:52:39PM -0700, David S. Miller wrote:
>> So the FPU hit is only before/after the runs, not during each and
>> every iteration.
>
>Right. Plus, the original mail mentioned that it was hitting all 8
>CPUs, which is a pretty good example of braindead scheduler behaviour.

Careful.

That's not actually true?(the braindead part, that i).

We went through this with Ingo and Larry McVoy, and the sad fact is that
to get the best numbers for lmbench, you simply have to do the wrong
thing.

Could we try to hit just two? Probably, but it doesn't really matter,
though: to make the lmbench scheduler benchmark go at full speed, you
want to limit it to _one_ CPU, which is not sensible in real-life
situations. The amount of concurrency in the context switching
benchmark is pretty small, and does not make up for bouncing the locks
etc between CPU's.

However, that lack of concurrency in lmbench is totally due to the
artificial nature of the benchmark, and the bigger-footprint scheduling
stuff (that isn't reported very much in the summary) are more realistic.

So 2.4.x took the (painful) road of saying that we care less about that
particular benchmark than about some other more realistic loads.

Linus

2001-10-04 22:50:31

by Mike Kravetz

[permalink] [raw]

Subject: Re: Context switch times

On Thu, Oct 04, 2001 at 03:35:35PM -0700, Davide Libenzi wrote:
> > Right. Plus, the original mail mentioned that it was hitting all 8
> > CPUs, which is a pretty good example of braindead scheduler behaviour.
>
> There was a discussion about process spinning among idle CPUs a couple of
> months ago.
> Mike, did you code the patch that stick the task to an idle between the
> send-IPI and the idle wakeup ?
> At that time we simply left the issue unaddressed.

I believe the 'quick and dirty' patches we came up with substantially
increased context switch times for this benchmark (doubled them).
The reason is that you needed to add IPI time to the context switch
time. Therefore, I did not actively pursue getting these accepted. :)
It appears that something in the 2.2 scheduler did a much better
job of handling this situation. I haven't looked at the 2.2 code.
Does anyone know what feature of the 2.2 scheduler was more successful
in keeping tasks on the CPUs on which they previously executed?
Also, why was that code removed from 2.4? I can research, but I
suspect someone here has firsthand knowledge.

--
Mike Kravetz [email protected]
IBM Peace, Love and Linux Technology Center

2001-10-04 22:53:41

by Benjamin LaHaise

[permalink] [raw]

Subject: Re: Context switch times

On Thu, Oct 04, 2001 at 10:42:37PM +0000, Linus Torvalds wrote:
> Could we try to hit just two? Probably, but it doesn't really matter,
> though: to make the lmbench scheduler benchmark go at full speed, you
> want to limit it to _one_ CPU, which is not sensible in real-life
> situations. The amount of concurrency in the context switching
> benchmark is pretty small, and does not make up for bouncing the locks
> etc between CPU's.

I don't quite agree with you that it doesn't matter. A lot of tests
(volanomark, other silly things) show that the current scheduler jumps
processes from CPU to CPU on SMP boxes far too easily, in addition to the
lengthy duration of run queue processing when loaded down. Yes, these
applications are leaving too many runnable processes around, but that's
the way some large app server loads behave. And right now it makes linux
look bad compared to other OSes.

Yes, low latency is good, but jumping around cpus adds more latency in
cache misses across slow busses than is needed when the working set is
already present in the 2MB L2 of your high end server.

-ben

2001-10-04 23:42:22

by Mike Kravetz

[permalink] [raw]

Subject: Re: Context switch times

On Thu, Oct 04, 2001 at 10:42:37PM +0000, Linus Torvalds wrote:
> Could we try to hit just two? Probably, but it doesn't really matter,
> though: to make the lmbench scheduler benchmark go at full speed, you
> want to limit it to _one_ CPU, which is not sensible in real-life
> situations.

Can you clarify? I agree that tuning the system for the best LMbench
performance is not a good thing to do! However, in general on an
8 CPU system with only 2 'active' tasks I would think limiting the
tasks to 2 CPUs would be desirable for cache effects.

I know that running LMbench with 2 active tasks on an 8 CPU system
results in those 2 tasks being 'round-robined' among all 8 CPUs.
Prior analysis leads me to believe the reason for this is due to
IPI latency. reschedule_idle() chooses the 'best/correct' CPU for
a task to run on, but before schedule() runs on that CPU another
CPU runs schedule() and the result is that the task runs on a
?less desirable? CPU. The nature of the LMbench scheduler benchmark
makes this occur frequently. The real question is: how often
does this happen in real-life situations?

--
Mike

2001-10-04 23:50:32

by Linus Torvalds

[permalink] [raw]

Subject: Re: Context switch times

On Thu, 4 Oct 2001, Mike Kravetz wrote:

> On Thu, Oct 04, 2001 at 10:42:37PM +0000, Linus Torvalds wrote:
> > Could we try to hit just two? Probably, but it doesn't really matter,
> > though: to make the lmbench scheduler benchmark go at full speed, you
> > want to limit it to _one_ CPU, which is not sensible in real-life
> > situations.
>
> Can you clarify? I agree that tuning the system for the best LMbench
> performance is not a good thing to do! However, in general on an
> 8 CPU system with only 2 'active' tasks I would think limiting the
> tasks to 2 CPUs would be desirable for cache effects.

Yes, limiting to 2 CPU's probably gets better cache behaviour, and it
might be worth looking into why it doesn't. The CPU affinity _should_
prioritize it down to two, but I haven't thought through your theory about
IPI latency.

However, the reason 2.2.x does so well is that in 2.2.x it will stay on
_once_ CPU if I remember correctly. We basically tuned the scheduler for
lmbench, and not much else.

Linus

2001-10-04 23:52:02

[permalink] [raw]

Subject: Re: Context switch times

On Thu, 4 Oct 2001, Mike Kravetz wrote:

> On Thu, Oct 04, 2001 at 10:42:37PM +0000, Linus Torvalds wrote:
> > Could we try to hit just two? Probably, but it doesn't really matter,
> > though: to make the lmbench scheduler benchmark go at full speed, you
> > want to limit it to _one_ CPU, which is not sensible in real-life
> > situations.
>
> Can you clarify? I agree that tuning the system for the best LMbench
> performance is not a good thing to do! However, in general on an
> 8 CPU system with only 2 'active' tasks I would think limiting the
> tasks to 2 CPUs would be desirable for cache effects.
>
> I know that running LMbench with 2 active tasks on an 8 CPU system
> results in those 2 tasks being 'round-robined' among all 8 CPUs.
> Prior analysis leads me to believe the reason for this is due to
> IPI latency. reschedule_idle() chooses the 'best/correct' CPU for
> a task to run on, but before schedule() runs on that CPU another
> CPU runs schedule() and the result is that the task runs on a
> ?less desirable? CPU. The nature of the LMbench scheduler benchmark
> makes this occur frequently. The real question is: how often
> does this happen in real-life situations?

Well, if you remember the first time this issue was discussed on the
mailing list was due a real life situation not due a bench run.

- Davide

2001-10-05 00:45:38

by Andrea Arcangeli

[permalink] [raw]

Subject: Re: Context switch times

On Thu, Oct 04, 2001 at 04:41:02PM -0700, Mike Kravetz wrote:
> I know that running LMbench with 2 active tasks on an 8 CPU system
> results in those 2 tasks being 'round-robined' among all 8 CPUs.
> Prior analysis leads me to believe the reason for this is due to
> IPI latency. reschedule_idle() chooses the 'best/correct' CPU for
> a task to run on, but before schedule() runs on that CPU another
> CPU runs schedule() and the result is that the task runs on a
> ?less desirable? CPU. The nature of the LMbench scheduler benchmark

doesn't lmbench wakeup only via pipes? Linux uses the sync-wakeup that
avoids reschedule_idle in such case, to serialize the pipe load in the
same cpu.

Andrea

2001-10-05 04:37:26

by Mike Kravetz

[permalink] [raw]

Subject: Re: Context switch times

On Fri, Oct 05, 2001 at 02:45:26AM +0200, Andrea Arcangeli wrote:
> doesn't lmbench wakeup only via pipes? Linux uses the sync-wakeup that
> avoids reschedule_idle in such case, to serialize the pipe load in the
> same cpu.

That's what I thought too. However, kernel profile data of a
lmbench run on 2.4.10 reveals that the pipe routines only call
the non-synchronous form of wake_up. I believe I reached the
same conclusion in the 2.4.7 time frame by instrumenting this
code.

--
Mike

2001-10-05 06:29:03

by Michailidis, Dimitrios

[permalink] [raw]

Subject: Re: Context switch times

> I believe the 'quick and dirty' patches we came up with substantially
> increased context switch times for this benchmark (doubled them).
> The reason is that you needed to add IPI time to the context switch
> time. Therefore, I did not actively pursue getting these accepted. :)
> It appears that something in the 2.2 scheduler did a much better
> job of handling this situation. I haven't looked at the 2.2 code.
> Does anyone know what feature of the 2.2 scheduler was more successful
> in keeping tasks on the CPUs on which they previously executed?
> Also, why was that code removed from 2.4? I can research, but I
> suspect someone here has firsthand knowledge.

The reason 2.2 does better is because under some conditions if a woken up
process's preferred CPU is busy it will refrain from moving it to another
CPU even if there are many idle CPUs, in the hope that the preferred CPU
will become available soon. This can cause situations where processes are
sitting on the run queue while CPUs idle, but works great for lmbench. OTOH
2.4 assigns processes to CPUs as soon as possible. IIRC this change
happened in one of the early 2.3.4x kernels.

---
Dimitris Michailidis [email protected]

2001-10-05 15:08:24

by Alan

[permalink] [raw]

Subject: Re: Context switch times

> I don't quite agree with you that it doesn't matter. A lot of tests
> (volanomark, other silly things) show that the current scheduler jumps
> processes from CPU to CPU on SMP boxes far too easily, in addition to the
> lengthy duration of run queue processing when loaded down. Yes, these
> applications are leaving too many runnable processes around, but that's
> the way some large app server loads behave. And right now it makes linux
> look bad compared to other OSes.

The current scheduler is completely hopeless by the time you hit 3
processors and a bit flaky on two. Its partly the algorithm and partly
implementation

#1 We schedule tasks based on a priority that has no cache
consideration

#2 We have an O(tasks) loop refreshing priority that isnt needed
(Montavista fix covers this)

#3 We have an O(runnable) loop which isnt ideal

#4 On x86 we are horribly cache pessimal. All the task structs are
on the same cache colour. Multiple tasks waiting for the same event
put their variables (like the wait queue) on the same cache line.

The poor cache performance greatly comes from problem 1. Because we prioritize
blindly on CPU usage with a tiny magic constant fudge factor we are
executing totally wrong task orders. Instead we need to schedule for cache
optimal behaviour, and the moderator is the fairness requirement, not the
other way around.

The classic example is two steady cpu loads and an occasionally waking
client (like an editor)

We end up scheduling [A, B are the loads, C is the editor)

A C B C A C B C A C B C A

whenever a key is hit we swap CPU hog because the priority balanced shifted.
Really we should have scheduled something looking more like

A C A C A C B C B C B C B

I would argue there are two needed priorities

1. CPU usage based priority - the current one

2. Task priority.

Task priority being set to maximum when a task sleeps and then bounded by
a function of max(task_priorty, fn(cpupriority)) so that tasks fall down
priority levels as their cpu usage in the set of time slices. This causes
tasks that are CPU hogs to sit at the bottom of the pile with the same low
task_priority meaning they run for long bursts and don't get pre-empted and
switched with other hogs through each cycle of the scheduler.

This isnt idle speculation - I've done some minimal playing with this but
my initial re-implementation didnt handle SMP at all and I am still not 100%
sure how to resolve SMP or how SMP will improve out of the current cunning
plan.

Alan

2001-10-05 15:25:27

by Eric W. Biederman

[permalink] [raw]

Subject: Re: Context switch times

Linus Torvalds <[email protected]> writes:

> On Thu, 4 Oct 2001, Mike Kravetz wrote:
>
> > On Thu, Oct 04, 2001 at 10:42:37PM +0000, Linus Torvalds wrote:
> > > Could we try to hit just two? Probably, but it doesn't really matter,
> > > though: to make the lmbench scheduler benchmark go at full speed, you
> > > want to limit it to _one_ CPU, which is not sensible in real-life
> > > situations.
> >
> > Can you clarify? I agree that tuning the system for the best LMbench
> > performance is not a good thing to do! However, in general on an
> > 8 CPU system with only 2 'active' tasks I would think limiting the
> > tasks to 2 CPUs would be desirable for cache effects.
>
> Yes, limiting to 2 CPU's probably gets better cache behaviour, and it
> might be worth looking into why it doesn't. The CPU affinity _should_
> prioritize it down to two, but I haven't thought through your theory about
> IPI latency.

I don't know what it is but I have seen this excessive cpu switching
in the wild. In particular on a dual processor machine I ran 2 cpu
intensive jobs, and a handful of daemons. And the cpu intensive jobs
would switch cpus every couple of seconds.

I was investigating it because on the Athlon I was running on a
customer was getting a super linear speed up. With one processes it
would take 8 minutes, and with 2 processes one would take 8 minutes
and the other would take 6 minutes. Very strange.

These processes except at their very beginning did no I/O and were
pure cpu hogs until they spit out their results. Very puzzling.
I can't see why we would ever want to take the cache miss penalty of
switching cpus, in this case.

Eric

2001-10-05 17:49:50

[permalink] [raw]

Subject: Re: Context switch times

Alan Cox wrote:
>
> > I don't quite agree with you that it doesn't matter. A lot of tests
> > (volanomark, other silly things) show that the current scheduler jumps
> > processes from CPU to CPU on SMP boxes far too easily, in addition to the
> > lengthy duration of run queue processing when loaded down. Yes, these
> > applications are leaving too many runnable processes around, but that's
> > the way some large app server loads behave. And right now it makes linux
> > look bad compared to other OSes.
>
> The current scheduler is completely hopeless by the time you hit 3
> processors and a bit flaky on two. Its partly the algorithm and partly
> implementation
>
> #1 We schedule tasks based on a priority that has no cache
> consideration
>
> #2 We have an O(tasks) loop refreshing priority that isnt needed
> (Montavista fix covers this)
>
> #3 We have an O(runnable) loop which isnt ideal
>
> #4 On x86 we are horribly cache pessimal. All the task structs are
> on the same cache colour. Multiple tasks waiting for the same event
> put their variables (like the wait queue) on the same cache line.
>
> The poor cache performance greatly comes from problem 1. Because we prioritize
> blindly on CPU usage with a tiny magic constant fudge factor we are
> executing totally wrong task orders. Instead we need to schedule for cache
> optimal behaviour, and the moderator is the fairness requirement, not the
> other way around.
>
> The classic example is two steady cpu loads and an occasionally waking
> client (like an editor)
>
> We end up scheduling [A, B are the loads, C is the editor)
>
> A C B C A C B C A C B C A
>
> whenever a key is hit we swap CPU hog because the priority balanced shifted.
> Really we should have scheduled something looking more like
>
> A C A C A C B C B C B C B
>
> I would argue there are two needed priorities
>
> 1. CPU usage based priority - the current one
>
> 2. Task priority.
>
> Task priority being set to maximum when a task sleeps and then bounded by
> a function of max(task_priorty, fn(cpupriority)) so that tasks fall down
> priority levels as their cpu usage in the set of time slices. This causes
> tasks that are CPU hogs to sit at the bottom of the pile with the same low
> task_priority meaning they run for long bursts and don't get pre-empted and
> switched with other hogs through each cycle of the scheduler.

Let me see if I have this right. Task priority goes to max on any (?)
sleep regardless of how long. And to min if it doesn't sleep for some
period of time. Where does the time slice counter come into this, if at
all?

For what its worth I am currently updating the MontaVista scheduler so,
I am open to ideas.

George
>
> This isnt idle speculation - I've done some minimal playing with this but
> my initial re-implementation didnt handle SMP at all and I am still not 100%
> sure how to resolve SMP or how SMP will improve out of the current cunning
> plan.
>
> Alan
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2001-10-05 22:24:12

by Alan

[permalink] [raw]

Subject: Re: Context switch times

> Let me see if I have this right. Task priority goes to max on any (?)
> sleep regardless of how long. And to min if it doesn't sleep for some
> period of time. Where does the time slice counter come into this, if at
> all?
>
> For what its worth I am currently updating the MontaVista scheduler so,
> I am open to ideas.

The time slice counter is the limit on the amount of time you can execute,
the priority determines who runs first.

So if you used your cpu quota you will get run reluctantly. If you slept
you will get run early and as you use time slice count you will drop
priority bands, but without pre-emption until you cross a band and there
is another task with higher priority.

This damps down task thrashing a bit, and for the cpu hogs it gets the
desired behaviour - which is that the all run their full quantum in the
background one after another instead of thrashing back and forth

2001-10-05 22:51:26

[permalink] [raw]

Subject: Re: Context switch times

On Fri, 5 Oct 2001, Alan Cox wrote:

> > Let me see if I have this right. Task priority goes to max on any (?)
> > sleep regardless of how long. And to min if it doesn't sleep for some
> > period of time. Where does the time slice counter come into this, if at
> > all?
> >
> > For what its worth I am currently updating the MontaVista scheduler so,
> > I am open to ideas.
>
> The time slice counter is the limit on the amount of time you can execute,
> the priority determines who runs first.
>
> So if you used your cpu quota you will get run reluctantly. If you slept
> you will get run early and as you use time slice count you will drop
> priority bands, but without pre-emption until you cross a band and there
> is another task with higher priority.
>
> This damps down task thrashing a bit, and for the cpu hogs it gets the
> desired behaviour - which is that the all run their full quantum in the
> background one after another instead of thrashing back and forth

What if we give to prev a priority boost P=F(T) where T is the time
prev is ran before the current schedule ?

- Davide

2001-10-05 22:59:16

by Alan

[permalink] [raw]

Subject: Re: Context switch times

> > This damps down task thrashing a bit, and for the cpu hogs it gets the
> > desired behaviour - which is that the all run their full quantum in the
> > background one after another instead of thrashing back and forth
>
> What if we give to prev a priority boost P=F(T) where T is the time
> prev is ran before the current schedule ?

That would be the wrong key. You can argue certainly that it is maybe
appropriate to use some function based on remaining scheduler ticks, but
that already occurs as the scheduler ticks is the upper bound for priority
band

2001-10-05 23:10:56

[permalink] [raw]

Subject: Re: Context switch times

On Sat, 6 Oct 2001, Alan Cox wrote:

> > > This damps down task thrashing a bit, and for the cpu hogs it gets the
> > > desired behaviour - which is that the all run their full quantum in the
> > > background one after another instead of thrashing back and forth
> >
> > What if we give to prev a priority boost P=F(T) where T is the time
> > prev is ran before the current schedule ?
>
> That would be the wrong key. You can argue certainly that it is maybe
> appropriate to use some function based on remaining scheduler ticks, but
> that already occurs as the scheduler ticks is the upper bound for priority
> band

No, i mean T = (Tstart - Tend) where :

Tstart = time the current ( prev ) task has been scheduled
Tend = current time ( in schedule() )

Basically it's the total time the current ( prev ) task has had the CPU

- Davide

2001-10-05 23:12:56

by Alan

[permalink] [raw]

Subject: Re: Context switch times

> No, i mean T = (Tstart - Tend) where :
>
> Tstart = time the current ( prev ) task has been scheduled
> Tend = current time ( in schedule() )
>
> Basically it's the total time the current ( prev ) task has had the CPU

Ok let me ask one question - why ?

2001-10-05 23:16:36

[permalink] [raw]

Subject: Re: Context switch times

On Sat, 6 Oct 2001, Alan Cox wrote:

> > No, i mean T = (Tstart - Tend) where :
> >
> > Tstart = time the current ( prev ) task has been scheduled
> > Tend = current time ( in schedule() )
> >
> > Basically it's the total time the current ( prev ) task has had the CPU
>
> Ok let me ask one question - why ?

Because the more the task is ran the higher the cache footprint is and the
higher the cache footprint is the more we'd like to pick up the exiting
task to rerun.

- Davide

2001-10-05 23:49:00

by Roger Larsson

[permalink] [raw]

Subject: Re: Context switch times

On Saturday 06 October 2001 01:04, Alan Cox wrote:
> > > This damps down task thrashing a bit, and for the cpu hogs it gets the
> > > desired behaviour - which is that the all run their full quantum in the
> > > background one after another instead of thrashing back and forth
> >
> > What if we give to prev a priority boost P=F(T) where T is the time
> > prev is ran before the current schedule ?
>
> That would be the wrong key. You can argue certainly that it is maybe
> appropriate to use some function based on remaining scheduler ticks, but
> that already occurs as the scheduler ticks is the upper bound for priority
> band

How about a LIFO list for each processor where preempted (count != 0) tasks
go?

When a preemption occurs the current goes to the LIFO.
When a process has run whole of its time slot - it can be moved to an usedup
queue. (No point in keeping it on the generic run queue, but remember
it when giving out new ticks, could also be on a per CPU basis).
Now select what to do next..
The first on the LIFO can be moved in as "current" before checking the rest
of the run queue...

Pros:
+ The LIFO will be sorted with highest prio on top - no need to search 'em all
+ a process that starts a time slot on one CPU stays the whole slot.

Cons:
- A process might get stuck waiting for the processor in the FIFO while the
other CPU is idle (but that was the point, wasn't it...)

/RogerL

--
Roger Larsson
Skellefte?
Sweden

2001-10-06 02:24:06

by Albert D. Cahalan

[permalink] [raw]

Subject: Re: Context switch times

Alan Cox writes:
> [somebody]

>> I don't quite agree with you that it doesn't matter. A lot of tests
>> (volanomark, other silly things) show that the current scheduler jumps
>> processes from CPU to CPU on SMP boxes far too easily

Be careful that your viewer doesn't disturb the system.
Please don't even consider using "top" for this.

> #4 On x86 we are horribly cache pessimal. All the task structs are
> on the same cache colour. Multiple tasks waiting for the same event
> put their variables (like the wait queue) on the same cache line.

If cache problems are bad enough, maybe this pays for itself:

/* old */
current = stack_ptr & ~0x1fff;

/* new */
hash = (stack_ptr>>8)^(stack_ptr>>12)^(stack_ptr>>16)^(stack_ptr>>20);
current = (stack_ptr & ~0x1fff) | (hash & 0x1e0);

> The classic example is two steady cpu loads and an occasionally waking
> client (like an editor)

Like "top"!

2001-10-06 02:52:45

[permalink] [raw]

Subject: Re: Context switch times

On Fri, 5 Oct 2001, Albert D. Cahalan wrote:

> If cache problems are bad enough, maybe this pays for itself:
>
> /* old */
> current = stack_ptr & ~0x1fff;
>
> /* new */
> hash = (stack_ptr>>8)^(stack_ptr>>12)^(stack_ptr>>16)^(stack_ptr>>20);
> current = (stack_ptr & ~0x1fff) | (hash & 0x1e0);

I suggested something like this ( randomly "moved" struct task_struct * )
a couple of years ago.
Never tried to measure the impact of the system though.

- Davide

2001-10-07 01:21:19

[permalink] [raw]

Subject: Re: Context switch times

Alan Cox wrote:
>
> > Let me see if I have this right. Task priority goes to max on any (?)
> > sleep regardless of how long. And to min if it doesn't sleep for some
> > period of time. Where does the time slice counter come into this, if at
> > all?
> >
> > For what its worth I am currently updating the MontaVista scheduler so,
> > I am open to ideas.
>
> The time slice counter is the limit on the amount of time you can execute,
> the priority determines who runs first.
>
> So if you used your cpu quota you will get run reluctantly. If you slept
> you will get run early and as you use time slice count you will drop
> priority bands, but without pre-emption until you cross a band and there
> is another task with higher priority.
>
> This damps down task thrashing a bit, and for the cpu hogs it gets the
> desired behaviour - which is that the all run their full quantum in the
> background one after another instead of thrashing back and forth

If I understand this, you are decreasing the priority of a task that is
running as it consumes its slice. In the current way of doing things
this happens and is noticed when the scheduler gets called. A task can
loose its place by being preempted by some quick I/O bound task. I.e. A
and B are cpu hogs with A having the cpu, if task Z comes along and runs
for 100 micro seconds, A finds itself replace by B, where as if Z does
not come along, A will complete its slice. It seems to me that A should
be first in line after Z and not bumped just because it has used some of
its slice. This would argue for priority being set at slice renewal
time and left alone until the task blocks or completes its slice.

George

2001-10-07 01:38:11

by Bill Davidsen

[permalink] [raw]

Subject: Re: Context switch times

On Sat, 6 Oct 2001, george anzinger wrote:

> Alan Cox wrote:

> > So if you used your cpu quota you will get run reluctantly. If you slept
> > you will get run early and as you use time slice count you will drop
> > priority bands, but without pre-emption until you cross a band and there
> > is another task with higher priority.
> >
> > This damps down task thrashing a bit, and for the cpu hogs it gets the
> > desired behaviour - which is that the all run their full quantum in the
> > background one after another instead of thrashing back and forth
>
> If I understand this, you are decreasing the priority of a task that is
> running as it consumes its slice. In the current way of doing things
> this happens and is noticed when the scheduler gets called. A task can
> loose its place by being preempted by some quick I/O bound task. I.e. A
> and B are cpu hogs with A having the cpu, if task Z comes along and runs
> for 100 micro seconds, A finds itself replace by B, where as if Z does
> not come along, A will complete its slice. It seems to me that A should
> be first in line after Z and not bumped just because it has used some of
> its slice. This would argue for priority being set at slice renewal
> time and left alone until the task blocks or completes its slice.

It certainly argues for continuing to run the process with the unexpired
timeslice, becausle it is most likely to have current data in caches of
every type.

--
bill davidsen <[email protected]>
CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.

2001-10-07 09:51:05

by Alan

[permalink] [raw]

Subject: Re: Context switch times

> its slice. This would argue for priority being set at slice renewal
> time and left alone until the task blocks or completes its slice.

That may well be the case. The banding I played with was effectively
doubling the task switch rate for more interactive tasks versus the bottom
feeders

2001-10-07 09:58:08

[permalink] [raw]

Subject: Re: Context switch times

Alan Cox wrote:
> This isnt idle speculation - I've done some minimal playing with this but
> my initial re-implementation didnt handle SMP at all and I am still not 100%
> sure how to resolve SMP or how SMP will improve out of the current cunning
> plan.

Here's some idle speculation on SMP to top it off. :) I tend to think
that the load balancing between CPUs should be a completely separate
algorithim and should not necessarily be run at every schedule(). The
idea is to compeletely decouple the problem of scheduling a single CPU
between tasks and the problem of load balancing between the CPUs, making
each problem simpler to solve.

Consider the following basic rules:

A) When a new task comes along, pick the "least loaded" CPU and lock the
new task onto that.
B) Whenever the load imbalance between least loaded CPU and most loaded
CPU becomes too great, move one or more tasks from most loaded CPU to
the least loaded CPU.

The rules themselves should be self-explanatory: A provides initial load
balancing, while B tries to keep the balance (with a sensible hysteresis
to avoid thrashing). However, there are a few minor details to solve:

1) How to determine the load of a CPU? If we can quantify this clearly,
we can easily set a hysteresis level to trigger load balancing between
two CPUs.
2) When and how often to check for load imbalance?
3) How to select the task(s) that should be moved between two CPUs to
correct an imbalance?

For problems 1 and 2 I propose the following solution: Insert the the
load balancing routine itself as a (fake) task on each CPU and run it
when the CPU gets around to it. The load balancer should behave almost
like a CPU-bound task, scheduled on the lowest priority level with other
runnable tasks. The last bit is important: the load balancer should not
be allowed to starve but should be invoked approximately once every
"full rotation" of the scheduler.

With the above it is easy to estimate the load of a CPU. We can simply
use the elapsed time between two invokations of the load balancer task.
When the load balancer task of a particular CPU gets run, it chalks up
the elapsed time on a score board somewhere, and checks whether there is
a significant imbalance between itself and some other CPU. If there is,
it commences to move some tasks between itself and the other CPU (note
rule B, though, it should be enough to mess with just two CPU queues at
a time to minimize balancing and locking overhead).

Problem 3 is tricky. Basically, there should be a cost/benefit function
F(tasks to move) that should be minimized. Ideally F(task_i), the
cost/benefit of moving a single task, would be calculated as a byproduct
of the CPU scheduler algorithm.

F(task_i) might be function of elapsed time since task_i was last
scheduled and the average time slice used by task_i, to account for the
probable cache hit. This would leave it up to the load balancer to move
as many lowest cost tasks to a new CPU as is needed to correct the
imbalance (average time slices used by each task would be needed in
order to make this decision).

Naturally, some additional rules might be necessary to make a task
eligible for moving, e.g., never move the only/last CPU bound task to
another CPU. In addition, it might actually make sense to move at most
one task at each invocation of the load balancer, to further reduce the
probability of thrashing. The load would still converge fairly quickly
towards a balanced state. It would also scale fairly well with the
number of CPUs.

How does that sound?

MikaL

2001-10-07 13:04:02

by Ingo Oeser

[permalink] [raw]

Subject: Re: Context switch times

On Sun, Oct 07, 2001 at 12:57:29PM +0300, Mika Liljeberg wrote:
> For problems 1 and 2 I propose the following solution: Insert the the
> load balancing routine itself as a (fake) task on each CPU and run it
> when the CPU gets around to it. The load balancer should behave almost
> like a CPU-bound task, scheduled on the lowest priority level with other
> runnable tasks.

The idle-task might be (ab-)used for this, because it has perfect
data for this.

T_SystemElapsed - T_IdleTaskRun = T_CPULoaded

Balancing could be done in schedule() itself, after checking this
value for each CPU.

> The last bit is important: the load balancer should not
> be allowed to starve but should be invoked approximately once every
> "full rotation" of the scheduler.

If a artificial CPU-hog is used for this task, the idle task will
never be run and power savings in the CPU are impossible.

Other parts sound interesting.

Regards

Ingo Oeser
--
.... Our continuing mission: To seek out knowledge of C, to explore
strange UNIX commands, and to boldly code where no one has man page 4.
--- Mike A. Harris <[email protected]>

2001-10-07 13:49:03

[permalink] [raw]

Subject: Re: Context switch times

Ingo Oeser wrote:
>
> On Sun, Oct 07, 2001 at 12:57:29PM +0300, Mika Liljeberg wrote:
> > For problems 1 and 2 I propose the following solution: Insert the the
> > load balancing routine itself as a (fake) task on each CPU and run it
> > when the CPU gets around to it. The load balancer should behave almost
> > like a CPU-bound task, scheduled on the lowest priority level with other
> > runnable tasks.
>
> The idle-task might be (ab-)used for this, because it has perfect
> data for this.
>
> T_SystemElapsed - T_IdleTaskRun = T_CPULoaded

The problem here is that the idle task doesn't get run at all if there
are runnable tasks. With a little bit of tweaking, the idle task context
could probably be used for this, I guess. However, I was thinking more
along the lines that the load balancer would not be a real task but just
a routine that would be executed in schedule() when the fake task comes
up (i.e. when the sceduler has done a "full rotation").

> Balancing could be done in schedule() itself, after checking this
> value for each CPU.

The point about separating the load balancer from the scheduler is that
you don't need to run it at every task switch. You can make the
scheduler very simple if you don't have to scan queues looking for
likely switchover candidates. This saves time in context switches.

One interesting property of the load balancer tasks would be that the
less heavily loaded CPUs would tend to execute the load balancer more
often, actively releaving the more heavily loaded CPUs, while those
would concentrate more on getting the job done. Come to think of it, it
could be coded in such a way that only the least loaded CPU would
execute the load balancing algorithm, while the others would simply
chalk up elapsed times.

> > The last bit is important: the load balancer should not
> > be allowed to starve but should be invoked approximately once every
> > "full rotation" of the scheduler.
>
> If a artificial CPU-hog is used for this task, the idle task will
> never be run and power savings in the CPU are impossible.

Right. Obviously, if a CPU has no runnable tasks and the load balancer
can't acquire any from other CPUs, it should yield to the idle task
rather than hog the CPU.

MikaL

2001-10-07 14:24:30

by Ingo Oeser

[permalink] [raw]

Subject: Re: Context switch times

On Sun, Oct 07, 2001 at 04:48:30PM +0300, Mika Liljeberg wrote:
> Ingo Oeser wrote:
> > The idle-task might be (ab-)used for this, because it has perfect
> > data for this.
> >
> > T_SystemElapsed - T_IdleTaskRun = T_CPULoaded
>
> The problem here is that the idle task doesn't get run at all if there
> are runnable tasks.

There are idle tasks per CPU. If there are runnable tasks and the
idle-task of a CPU is running it, it is not fully loaded at this
time.

No idle task is running, if all CPUs are fully loaded AFAIR.

> One interesting property of the load balancer tasks would be that the
> less heavily loaded CPUs would tend to execute the load balancer more
> often, actively releaving the more heavily loaded CPUs, while those
> would concentrate more on getting the job done. Come to think of it, it
> could be coded in such a way that only the least loaded CPU would
> execute the load balancing algorithm, while the others would simply
> chalk up elapsed times.

So you suggest only one load balancer task jumping from CPU to
CPU. I misunderstood it. _Only_ chalking up could certainly done by
the idle tasks.

Regards

Ingo Oeser

2001-10-07 14:33:51

[permalink] [raw]

Subject: Re: Context switch times

Ingo Oeser wrote:
> There are idle tasks per CPU. If there are runnable tasks and the
> idle-task of a CPU is running it, it is not fully loaded at this
> time.
>
> No idle task is running, if all CPUs are fully loaded AFAIR.

Yes. However, you still want to balance the queues even if all CPUs are
100% utilized. It's a fairness issue. Otherwise you could have 1 task
running on one CPU and 49 tasks on another.

MikaL

2001-10-07 18:00:11

by Andrea Arcangeli

[permalink] [raw]

Subject: Re: Context switch times

On Thu, Oct 04, 2001 at 09:35:07PM -0700, Mike Kravetz wrote:
> [..] the pipe routines only call
> the non-synchronous form of wake_up. [..]

Ok I see, I overlooked that when we don't need to block we wakeup-async.

So first of all it would be interesting to change the token passed
thorugh the pipe so that it always fills in the PAGE_SIZE pipe buffer so
that the task goes to sleep before reading from the next pipe in the
ring.

And if we want to optimize the current benchmark (with the small token that
triggers the async-wakeup and it always goes to sleep in read() rather
than write()) we'd need to invalidate a basic point of the scheduler
that have nothing to do with IPI reschedule, the point to invalidate is
that if the current task (the one that is running wake_up()) has a small
avg_slice we'd better not reschedule the wakenup task on a new idle cpu
but we'd better wait the current task to go away instead. Unless I'm
missing something this should fix lmbench in its current implementation.

Andrea

2001-10-07 18:02:01

[permalink] [raw]

Subject: Re: Context switch times

Mika Liljeberg wrote:
>
> Ingo Oeser wrote:
> > There are idle tasks per CPU. If there are runnable tasks and the
> > idle-task of a CPU is running it, it is not fully loaded at this
> > time.
> >
> > No idle task is running, if all CPUs are fully loaded AFAIR.
>
> Yes. However, you still want to balance the queues even if all CPUs are
> 100% utilized. It's a fairness issue. Otherwise you could have 1 task
> running on one CPU and 49 tasks on another.
>
On the face of it this only seems unfair. I think what we want to do is
to allocate the cpu resources among competing tasks as dictated by their
NICE values. If each task gets its allotted share it should not matter
if one of them is monopolizing one cpu. This is not to say that things
will work out this way, but to say that this is not the measure, nor the
thing to look at.

I suggest that, using the "recalculate tick" as a measure of time (I
know it is not very linear) when a cpu finds itself with nothing to do
(because all its tasks have completed their slices or blocked) and other
cpus have tasks in their queues it is time to "shop" for a new task to
run. This would then do load balancing just before each "recalculate
tick".

Of course, the above assumes that each cpu has its own run queue.

George

2001-10-07 18:34:51

[permalink] [raw]

Subject: Re: Context switch times

On Sun, 7 Oct 2001, Mika Liljeberg wrote:

> Alan Cox wrote:
> > This isnt idle speculation - I've done some minimal playing with this but
> > my initial re-implementation didnt handle SMP at all and I am still not 100%
> > sure how to resolve SMP or how SMP will improve out of the current cunning
> > plan.
>
> Here's some idle speculation on SMP to top it off. :) I tend to think
> that the load balancing between CPUs should be a completely separate
> algorithim and should not necessarily be run at every schedule(). The
> idea is to compeletely decouple the problem of scheduling a single CPU
> between tasks and the problem of load balancing between the CPUs, making
> each problem simpler to solve.
>
> Consider the following basic rules:
>
> A) When a new task comes along, pick the "least loaded" CPU and lock the
> new task onto that.
> B) Whenever the load imbalance between least loaded CPU and most loaded
> CPU becomes too great, move one or more tasks from most loaded CPU to
> the least loaded CPU.
>
> The rules themselves should be self-explanatory: A provides initial load
> balancing, while B tries to keep the balance (with a sensible hysteresis
> to avoid thrashing). However, there are a few minor details to solve:
>
> 1) How to determine the load of a CPU? If we can quantify this clearly,
> we can easily set a hysteresis level to trigger load balancing between
> two CPUs.
> 2) When and how often to check for load imbalance?
> 3) How to select the task(s) that should be moved between two CPUs to
> correct an imbalance?
>
> For problems 1 and 2 I propose the following solution: Insert the the
> load balancing routine itself as a (fake) task on each CPU and run it
> when the CPU gets around to it. The load balancer should behave almost
> like a CPU-bound task, scheduled on the lowest priority level with other
> runnable tasks. The last bit is important: the load balancer should not
> be allowed to starve but should be invoked approximately once every
> "full rotation" of the scheduler.
>
> With the above it is easy to estimate the load of a CPU. We can simply
> use the elapsed time between two invokations of the load balancer task.
> When the load balancer task of a particular CPU gets run, it chalks up
> the elapsed time on a score board somewhere, and checks whether there is
> a significant imbalance between itself and some other CPU. If there is,
> it commences to move some tasks between itself and the other CPU (note
> rule B, though, it should be enough to mess with just two CPU queues at
> a time to minimize balancing and locking overhead).
>
> Problem 3 is tricky. Basically, there should be a cost/benefit function
> F(tasks to move) that should be minimized. Ideally F(task_i), the
> cost/benefit of moving a single task, would be calculated as a byproduct
> of the CPU scheduler algorithm.
>
> F(task_i) might be function of elapsed time since task_i was last
> scheduled and the average time slice used by task_i, to account for the
> probable cache hit. This would leave it up to the load balancer to move
> as many lowest cost tasks to a new CPU as is needed to correct the
> imbalance (average time slices used by each task would be needed in
> order to make this decision).
>
> Naturally, some additional rules might be necessary to make a task
> eligible for moving, e.g., never move the only/last CPU bound task to
> another CPU. In addition, it might actually make sense to move at most
> one task at each invocation of the load balancer, to further reduce the
> probability of thrashing. The load would still converge fairly quickly
> towards a balanced state. It would also scale fairly well with the
> number of CPUs.
>
> How does that sound?

To measure the load of a cpu "nr_running" ( per cpu ) is probably the best
choice. Anyway there's some work already done :

http://lse.sourceforge.net/scheduling/LB/poolMQ.html

- Davide

2001-10-07 20:02:03

[permalink] [raw]

Subject: Re: Context switch times

Andrea Arcangeli wrote:
>
> On Thu, Oct 04, 2001 at 09:35:07PM -0700, Mike Kravetz wrote:
> > [..] the pipe routines only call
> > the non-synchronous form of wake_up. [..]
>
> Ok I see, I overlooked that when we don't need to block we wakeup-async.
>
> So first of all it would be interesting to change the token passed
> thorugh the pipe so that it always fills in the PAGE_SIZE pipe buffer so
> that the task goes to sleep before reading from the next pipe in the
> ring.
>
> And if we want to optimize the current benchmark (with the small token that
> triggers the async-wakeup and it always goes to sleep in read() rather
> than write()) we'd need to invalidate a basic point of the scheduler
> that have nothing to do with IPI reschedule, the point to invalidate is
> that if the current task (the one that is running wake_up()) has a small
> avg_slice we'd better not reschedule the wakenup task on a new idle cpu
> but we'd better wait the current task to go away instead.

A couple of questions:

1.) Do you want to qualify that the wake_up is not from an interrupt?

2.) Having done this AND the task doesn't block THIS time, do we wait
for the slice to end or is some other "dead man" timer needed?

George

Unless I'm
> missing something this should fix lmbench in its current implementation.
>
> Andrea
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2001-10-07 20:25:35

by Andrea Arcangeli

[permalink] [raw]

Subject: Re: Context switch times

On Sun, Oct 07, 2001 at 12:54:08PM -0700, george anzinger wrote:
> Andrea Arcangeli wrote:
> >
> > On Thu, Oct 04, 2001 at 09:35:07PM -0700, Mike Kravetz wrote:
> > > [..] the pipe routines only call
> > > the non-synchronous form of wake_up. [..]
> >
> > Ok I see, I overlooked that when we don't need to block we wakeup-async.
> >
> > So first of all it would be interesting to change the token passed
> > thorugh the pipe so that it always fills in the PAGE_SIZE pipe buffer so
> > that the task goes to sleep before reading from the next pipe in the
> > ring.
> >
> > And if we want to optimize the current benchmark (with the small token that
> > triggers the async-wakeup and it always goes to sleep in read() rather
> > than write()) we'd need to invalidate a basic point of the scheduler
> > that have nothing to do with IPI reschedule, the point to invalidate is
> > that if the current task (the one that is running wake_up()) has a small
> > avg_slice we'd better not reschedule the wakenup task on a new idle cpu
> > but we'd better wait the current task to go away instead.
>
> A couple of questions:
>
> 1.) Do you want to qualify that the wake_up is not from an interrupt?

If the avg_slice is very small we know the task will probabilistically
go away soon regardless where the wakeup cames from. I don't think the
logic has connection to the kind of wakeup, it only has connection to
the kind of "running task".

Infact waiting "bash" or "ksoftirqd" to go away is the right thing to do
to fix the suprious migrations of cpu intensive tasks too, no matter
where the wakeup cames from.

The only risk here is that we never know from the scheduler if the
running task with the small avg_slice will really go away this time too
or not. And if it doesn't go away this time, we may not do the best
global decision by not migrating the wakenup task to the idle cpus, but
the probability information provided by avg_slice should cover this
problem.

Of course this is theory, I didn't attempted to test it.

btw, I remeber Ingo some year ago said on l-k that rescheduling
immediatly every idle cpu isn't the best thing to do.

> 2.) Having done this AND the task doesn't block THIS time, do we wait
> for the slice to end or is some other "dead man" timer needed?

Of course waiting for the slice to end anyways is the natural first
step. As said above the point of the probability information of
avg_slice is to render very unlikely the "task doesn't block THIS time"
case.

Andrea

2001-10-07 22:06:27

[permalink] [raw]

Subject: Re: Context switch times

george anzinger wrote:
> On the face of it this only seems unfair. I think what we want to do is
> to allocate the cpu resources among competing tasks as dictated by their
> NICE values. If each task gets its allotted share it should not matter
> if one of them is monopolizing one cpu. This is not to say that things
> will work out this way, but to say that this is not the measure, nor the
> thing to look at.

I'm not sure I fully understood what you're driving at, but surely "each
task getting its allotted share" implies that the tasks are correctly
balanced across CPUs? I take your point that if you're only interested
in the total completion time of certain set of tasks, it doesn't matter
how they are divided among CPUs as long as each CPU is crunching at
maximum capacity. However, if you're interested in the completion time
of a particular task (or if the task is mostly CPU bound but with
occasional interaction with the user), you need to provide fairness all
the way through the scheduling process. This, IMHO, implies balancing
the tasks across CPUs.

How the balance is determined is another issue, though. I basically
proposed summing the time slices consumed by tasks executing on a single
CPU and using that as the comparison value. Davide Libenzi, on the other
hand, simply proposed using the number of tasks.

I would contend that if you can evenly divide a particular load across
multiple CPUs, you can then run a pre-emptive scheduler on each CPUs run
queue independently, without regard for the other CPUs, and still come
out with high CPU utilization and reasonable completion times for all
tasks. This has the major advantage of limiting spinlock contention to
the load balancer, instead of risking it at every schedule().

> I suggest that, using the "recalculate tick" as a measure of time (I
> know it is not very linear)

I'll take your word for it as I'm not very familiar with the current
scheduler. The main thing is to do the rebalancing periodically and to
have a reliable way of estimating the loading (not utilization) of each
CPU.

> when a cpu finds itself with nothing to do
> (because all its tasks have completed their slices or blocked) and other
> cpus have tasks in their queues it is time to "shop" for a new task to
> run.

This again has the problem that you only look for new tasks if you have
nothing to do. While this would work if you only look at the total
completion time of a set of tasks, it does nothing to ensure fair
scheduling for a particular task.

> This would then do load balancing just before each "recalculate
> tick".
> Of course, the above assumes that each cpu has its own run queue.
>
> George

Regards,

MikaL

2001-10-07 22:26:22

[permalink] [raw]

Subject: Re: Context switch times

On Mon, 8 Oct 2001, Mika Liljeberg wrote:

> How the balance is determined is another issue, though. I basically
> proposed summing the time slices consumed by tasks executing on a single
> CPU and using that as the comparison value. Davide Libenzi, on the other
> hand, simply proposed using the number of tasks.

CPU based number of __running__ tasks.

- Davide

2001-10-07 22:33:45

[permalink] [raw]

Subject: Re: Context switch times

Davide Libenzi wrote:
>
> On Mon, 8 Oct 2001, Mika Liljeberg wrote:
>
> > How the balance is determined is another issue, though. I basically
> > proposed summing the time slices consumed by tasks executing on a single
> > CPU and using that as the comparison value. Davide Libenzi, on the other
> > hand, simply proposed using the number of tasks.
>
> CPU based number of __running__ tasks.

That's what I meant, sorry for not making it clear.

I suppose in practise this would amount to using the number of _CPU
bound_ running tasks per CPU, since the non-CPU bound tasks would likely
be waiting rather than running? This would then be essentially
equivalent to (and simpler than) my proposal, i.e. using the sum of the
time slices.

> - Davide

MikaL

2001-10-07 23:49:48

[permalink] [raw]

Subject: Re: Context switch times

Mika Liljeberg wrote:
>
> george anzinger wrote:
> > On the face of it this only seems unfair. I think what we want to do is
> > to allocate the cpu resources among competing tasks as dictated by their
> > NICE values. If each task gets its allotted share it should not matter
> > if one of them is monopolizing one cpu. This is not to say that things
> > will work out this way, but to say that this is not the measure, nor the
> > thing to look at.
>
> I'm not sure I fully understood what you're driving at, but surely "each
> task getting its allotted share" implies that the tasks are correctly
> balanced across CPUs?

Actually I am trying to recast your requirement (for fairness) into
something that is clean and easy to code :) Keeping track of the number
of tasks a cpu executes over a given period of time, while possible,
does not seem to me to give a very good indication of its load.

> I take your point that if you're only interested
> in the total completion time of certain set of tasks, it doesn't matter
> how they are divided among CPUs as long as each CPU is crunching at
> maximum capacity. However, if you're interested in the completion time
> of a particular task (or if the task is mostly CPU bound but with
> occasional interaction with the user), you need to provide fairness all
> the way through the scheduling process. This, IMHO, implies balancing
> the tasks across CPUs.

I don't dispute balance, I am just trying to suggest a way to do it that
IMHO is relatively easy.
>
> How the balance is determined is another issue, though. I basically
> proposed summing the time slices consumed by tasks executing on a single
> CPU and using that as the comparison value. Davide Libenzi, on the other
> hand, simply proposed using the number of tasks.

In practice, in a given time (say a "recalculate period") a given cpu
with a given set of tasks will end up doing a fixed amount of work
(provided we don't allow idle tasks). This could be stated as a sum of
time slices or CPU jiffies if you like. As long as loads are passed
from cpu to cpu to make them all end a "recalculate period" at the same
time, I would contend we have balance. Counting tasks is problematic as
tasks come and go and present varying loads. The real trick is in
figuring which task to move, or if none should be moved.
>
> I would contend that if you can evenly divide a particular load across
> multiple CPUs, you can then run a pre-emptive scheduler on each CPUs run
> queue independently, without regard for the other CPUs, and still come
> out with high CPU utilization and reasonable completion times for all
> tasks. This has the major advantage of limiting spinlock contention to
> the load balancer, instead of risking it at every schedule().

Exactly.
>
> > I suggest that, using the "recalculate tick" as a measure of time (I
> > know it is not very linear)
>
> I'll take your word for it as I'm not very familiar with the current
> scheduler. The main thing is to do the rebalancing periodically and to
> have a reliable way of estimating the loading (not utilization) of each
> CPU.

The current schedule periodically calculates new time slices for each
task. It does this when no task in the run queue has any slice time
left. The slice size each task gets is related to the NICE value. In
practice then (other things being equal) each task will get a portion of
the available processing time that is related to its NICE value. I
think this is fair. But, of course, other things are not equal. When a
task is blocked and the recalculation happens, that task gets to keep
1/2 of it remaining slice as well as the new slice. Since, in addition
to slice time, the remaining slice value is used as a major component of
the task's priority, this means that the blocked task, if it is blocked
during a recalculate, will get a boosted priority.

I think what Alan is suggesting is that the priority boost is needed for
blocked tasks, but not necessarily the increased slice time.
>
> > when a cpu finds itself with nothing to do
> > (because all its tasks have completed their slices or blocked) and other
> > cpus have tasks in their queues it is time to "shop" for a new task to
> > run.
>
> This again has the problem that you only look for new tasks if you have
> nothing to do. While this would work if you only look at the total
> completion time of a set of tasks, it does nothing to ensure fair
> scheduling for a particular task.

But doesn't the f(NICE) used as a slice do this. The point is that we
will not allow a recalculate until all run able tasks with non zero
slice times have run their slice times to zero. This insures that all
tasks that can use the cpu will get their fair share of time each
"recalculate period".

This is pretty much what is done today across the whole run queue. The
enhancement we are talking about is separating tasks by cpu and pretty
much keeping them on one cpu.
>
> > This would then do load balancing just before each "recalculate
> > tick".
> > Of course, the above assumes that each cpu has its own run queue.
> >
> > George
>
> Regards,
>
> MikaL

2001-10-08 15:19:45

by Bill Davidsen

[permalink] [raw]

Subject: Re: Context switch times

In article <[email protected]> [email protected] wrote:

>Yes. However, you still want to balance the queues even if all CPUs are
>100% utilized. It's a fairness issue. Otherwise you could have 1 task
>running on one CPU and 49 tasks on another.

You say that as if it were a bad thing... I believe that if you have
one long running task and many small tasks in the system CPU affinity
will make that happen now. Obviously not if all CPUs are 100% loaded,
and your 1 vs. 49 is unrealistic, but having a task stay with a CPU
while trivia run on other CPU(s) is generally a good thing under certain
load conditions, which I guess are no less likely than your example;-)

--
bill davidsen <[email protected]>
"If I were a diplomat, in the best case I'd go hungry. In the worst
case, people would die."
-- Robert Lipe

2001-10-08 21:09:38

[permalink] [raw]

Subject: Re: Context switch times

george anzinger wrote:
>
> Mika Liljeberg wrote:
> > I'm not sure I fully understood what you're driving at, but surely "each
> > task getting its allotted share" implies that the tasks are correctly
> > balanced across CPUs?
>
> Actually I am trying to recast your requirement (for fairness) into
> something that is clean and easy to code :)

Fair enough. ;)

> The current schedule periodically calculates new time slices for each
> task. It does this when no task in the run queue has any slice time
> left. The slice size each task gets is related to the NICE value. In
> practice then (other things being equal) each task will get a portion of
> the available processing time that is related to its NICE value. I
> think this is fair. But, of course, other things are not equal. When a
> task is blocked and the recalculation happens, that task gets to keep
> 1/2 of it remaining slice as well as the new slice. Since, in addition
> to slice time, the remaining slice value is used as a major component of
> the task's priority, this means that the blocked task, if it is blocked
> during a recalculate, will get a boosted priority.

Thanks for the concise explanation. I follow you a little better now.

> I think what Alan is suggesting is that the priority boost is needed for
> blocked tasks, but not necessarily the increased slice time.

I have to agree. If the job is I/O bound, it's unlikely to need the
extra CPU time.

> > > when a cpu finds itself with nothing to do
> > > (because all its tasks have completed their slices or blocked) and other
> > > cpus have tasks in their queues it is time to "shop" for a new task to
> > > run.
> >
> > This again has the problem that you only look for new tasks if you have
> > nothing to do. While this would work if you only look at the total
> > completion time of a set of tasks, it does nothing to ensure fair
> > scheduling for a particular task.
>
> But doesn't the f(NICE) used as a slice do this. The point is that we
> will not allow a recalculate until all run able tasks with non zero
> slice times have run their slice times to zero. This insures that all
> tasks that can use the cpu will get their fair share of time each
> "recalculate period".

Ok, I think I get you. I didn't realize that there was already a strict
fairness enforcing mechanism built into the scheduler. So, you're
proposing to do a rebalance when the first CPU runs out of things to do,
which will happen sometime before the recalculate period expires?

Sounds reasonable, but I think you will see a flurry of rebalancing when
the recalculate period is ending and the CPUs are starting to run out of
things to do. This will cause load the balancer to frantically try to
spread the last few tasks with remaining time slices across the CPUs.
This is exactly the kind of thrashing that we would like to avoid.

> This is pretty much what is done today across the whole run queue. The
> enhancement we are talking about is separating tasks by cpu and pretty
> much keeping them on one cpu.

Exactly.

> > > This would then do load balancing just before each "recalculate
> > > tick".
> > > Of course, the above assumes that each cpu has its own run queue.
> > >
> > > George

Cheers,

MikaL

2001-10-08 22:57:35

[permalink] [raw]

Subject: discontig physical memory

Hi guys,

I ran into this problem with a SOC weird design.
The physical memory map looks like this:

0 - 4MB DMA-able embedded RAM;
4MB - 16MB nothing here;
16MB - 32MB external RAM;

Embedded controllers (FB/USB) can see only lowest 4MB
and they need almost all of it for buffers. The kernel
is living at phy address 16Mb.

Any ideas how to make lowest 4MB allocatable throu
kmalloc(size, GFP_DMA) without breaking the kernel?

Petko

PS: I've seen CONFIG_DISCONTIGMEM but it is not yet
implemented for MIPS and i am not sure if it is what
is required in this case.

2001-10-08 23:06:08

[permalink] [raw]

Subject: Re: discontig physical memory

From: Petko Manolov <[email protected]>
Date: Mon, 08 Oct 2001 15:54:30 -0700

Any ideas how to make lowest 4MB allocatable throu
kmalloc(size, GFP_DMA) without breaking the kernel?

Setup ZONE_DMA correctly at boot time.

See the code around the free_area_init() call in
arch/i386/mm/init.c for an example of how to do this.
Intel does it for the low 16MB, you would just do it
for the low 4MB.

Franks a lot,
David S. Miller
[email protected]

2001-10-08 23:21:50

[permalink] [raw]

Subject: Re: discontig physical memory

I already did that. I fixed MAX_DMA_ADDRESS and the
kernel now reports:

zone(0): 1024 pages
zone(1): 7168 pages /* this should be 4096 */
...

Which is not true as we have a gap between 4 - 16MB
phys memory. I also played with add_memory_region(),
but the kernel still think we have 32MB ram instead
of 20MB as is the case.

Also it lock solid in __get_free_pages in mem_init()
Probably this is related to zone(1) wrong assumption
right?

best,
Petko

"David S. Miller" wrote:
>
> From: Petko Manolov <[email protected]>
> Date: Mon, 08 Oct 2001 15:54:30 -0700
>
> Any ideas how to make lowest 4MB allocatable throu
> kmalloc(size, GFP_DMA) without breaking the kernel?
>
> Setup ZONE_DMA correctly at boot time.
>
> See the code around the free_area_init() call in
> arch/i386/mm/init.c for an example of how to do this.
> Intel does it for the low 16MB, you would just do it
> for the low 4MB.
>
> Franks a lot,
> David S. Miller
> [email protected]

2001-10-08 23:30:22

[permalink] [raw]

Subject: Re: discontig physical memory

From: Petko Manolov <[email protected]>
Date: Mon, 08 Oct 2001 16:18:25 -0700

I already did that. I fixed MAX_DMA_ADDRESS and the
kernel now reports:

zone(0): 1024 pages
zone(1): 7168 pages /* this should be 4096 */
...

Which is not true as we have a gap between 4 - 16MB
phys memory. I also played with add_memory_region(),
but the kernel still think we have 32MB ram instead
of 20MB as is the case.

You need to setup bootmem correctly, earlier on, such that
you do not tell the kernel that there are any pages in this
4 - 16MB region.

Do something like this instead of whatever your bootmem
calls are doing now:

bootmap_size = init_bootmem(0, (32 * 1024 * 1024));
free_bootmem((4 * 1024 * 1024),
((16 - 4) * 1024 * 1024));

Franks a lot,
David S. Miller
[email protected]

2001-10-09 00:37:58

[permalink] [raw]

Subject: Re: discontig physical memory

"David S. Miller" wrote:
>
> You need to setup bootmem correctly, earlier on, such that
> you do not tell the kernel that there are any pages in this
> 4 - 16MB region.
>
> Do something like this instead of whatever your bootmem
> calls are doing now:
>
> bootmap_size = init_bootmem(0, (32 * 1024 * 1024));
> free_bootmem((4 * 1024 * 1024),
> ((16 - 4) * 1024 * 1024));

2001-10-09 00:39:38

[permalink] [raw]

Subject: Re: discontig physical memory

2001-10-09 01:37:59

[permalink] [raw]

Subject: Re: discontig physical memory

From: Petko Manolov <[email protected]>
Date: Mon, 08 Oct 2001 17:36:11 -0700

"David S. Miller" wrote:
> Do something like this instead of whatever your bootmem
> calls are doing now:
>
> bootmap_size = init_bootmem(0, (32 * 1024 * 1024));
> free_bootmem((4 * 1024 * 1024),
> ((16 - 4) * 1024 * 1024));

This is suppose to tell the kernel about the gap?

Precisely. How else did you expect to let the kernel know?

Franks a lot,
David S. Miller
[email protected]

2001-10-09 02:46:41

[permalink] [raw]

Subject: Re: discontig physical memory

"David S. Miller" wrote:
>
> > bootmap_size = init_bootmem(0, (32 * 1024 * 1024));
> > free_bootmem((4 * 1024 * 1024),
> > ((16 - 4) * 1024 * 1024));
>
> This is suppose to tell the kernel about the gap?
>
> Precisely. How else did you expect to let the kernel know?

Unfortunately this didn't work. I also tried to declare
16MB hole starting from offset 0 so the kernel is supposed
to boot from phys addr 0x01000000. It crashed when tried to
execute kernel_thread()
Do you think there is a mechanism to tell the kernel that first
16MB is a non DMA-able area?

Petko

2001-10-09 04:55:16

by Richard Gooch

[permalink] [raw]

Subject: Re: Context switch times

David S. Miller writes:
> From: Richard Gooch <[email protected]>
> Date: Thu, 4 Oct 2001 15:39:05 -0600
>
> David S. Miller writes:
> > lat_ctx doesn't execute any FPU ops. So at worst this happens once
> > on GLIBC program startup, but then never again.
>
> Has something changed? Last I looked, the whole lmbench timing harness
> was based on using the FPU.
>
> Oops, that's entirely possible...
>
> But things are usually layed out like this:
>
> capture_start_time();
> context_switch_N_times();
> capture_end_time();
>
> So the FPU hit is only before/after the runs, not during each and
> every iteration.

Hm. Perhaps when I did my tests (where I noticed a penalty), we didn't
have lazy FPU saving. Now we disable the FPU, and restore state when
we trap, right?

I do note this comment in arch/i386/kernel/process.c:
* We fsave/fwait so that an exception goes off at the right time
* (as a call from the fsave or fwait in effect) rather than to
* the wrong process. Lazy FP saving no longer makes any sense
* with modern CPU's, and this simplifies a lot of things (SMP
* and UP become the same).

So what exactly is the difference between our "delayed FPU restore
upon trap" (which I think of as lazy FPU saving), and the "lazy FP"
saving in the comments?

Regards,

Richard....
Permanent: [email protected]
Current: [email protected]

2001-10-09 05:00:36

[permalink] [raw]

Subject: Re: Context switch times

From: Richard Gooch <[email protected]>
Date: Mon, 8 Oct 2001 22:55:23 -0600

So what exactly is the difference between our "delayed FPU restore
upon trap" (which I think of as lazy FPU saving), and the "lazy FP"
saving in the comments?

Save always on swithching OUT from a task vs. save only when some
different task asks for the FPU.

Franks a lot,
David S. Miller
[email protected]

2001-10-09 13:49:14

by Bill Davidsen

[permalink] [raw]

Subject: Re: Context switch times

In article <[email protected]> [email protected] asked:

| Hm. Perhaps when I did my tests (where I noticed a penalty), we didn't
| have lazy FPU saving. Now we disable the FPU, and restore state when
| we trap, right?
|
| I do note this comment in arch/i386/kernel/process.c:
| * We fsave/fwait so that an exception goes off at the right time
| * (as a call from the fsave or fwait in effect) rather than to
| * the wrong process. Lazy FP saving no longer makes any sense
| * with modern CPU's, and this simplifies a lot of things (SMP
| * and UP become the same).
|
| So what exactly is the difference between our "delayed FPU restore
| upon trap" (which I think of as lazy FPU saving), and the "lazy FP"
| saving in the comments?

We always save the FPU, but only restore it when/if it is going to be
used. And obviously we don't want to save it if it hasn't been used,
since it wasn't restored...

--
bill davidsen <[email protected]>
"If I were a diplomat, in the best case I'd go hungry. In the worst
case, people would die."
-- Robert Lipe

2001-10-09 22:37:55

by Hubertus Franke

[permalink] [raw]

Subject: Re: Context switch times

* Mika Liljeberg <[email protected]> [20011007 05;57]:"
> Alan Cox wrote:
> > This isnt idle speculation - I've done some minimal playing with this but
> > my initial re-implementation didnt handle SMP at all and I am still not 100%
> > sure how to resolve SMP or how SMP will improve out of the current cunning
> > plan.
>
> Here's some idle speculation on SMP to top it off. :) I tend to think
> that the load balancing between CPUs should be a completely separate
> algorithim and should not necessarily be run at every schedule(). The
> idea is to compeletely decouple the problem of scheduling a single CPU
> between tasks and the problem of load balancing between the CPUs, making
> each problem simpler to solve.
>

This is what we implemented as an extension to our MQ scheduler.
I will present on the results for this during ALS technical session:
"CPU Pooling and Load Balancing in Linux MultiQueue Scheduling".
If interested I can already put an earlier version on lse.sourceforge.net.

It basically does (1) and (2). (3) is done unintelligently right now.

> Consider the following basic rules:
>
> A) When a new task comes along, pick the "least loaded" CPU and lock the
> new task onto that.
> B) Whenever the load imbalance between least loaded CPU and most loaded
> CPU becomes too great, move one or more tasks from most loaded CPU to
> the least loaded CPU.
>
> The rules themselves should be self-explanatory: A provides initial load
> balancing, while B tries to keep the balance (with a sensible hysteresis
> to avoid thrashing). However, there are a few minor details to solve:
>
> 1) How to determine the load of a CPU? If we can quantify this clearly,
> we can easily set a hysteresis level to trigger load balancing between
> two CPUs.
> 2) When and how often to check for load imbalance?
> 3) How to select the task(s) that should be moved between two CPUs to
> correct an imbalance?
>
> For problems 1 and 2 I propose the following solution: Insert the the
> load balancing routine itself as a (fake) task on each CPU and run it
> when the CPU gets around to it. The load balancer should behave almost
> like a CPU-bound task, scheduled on the lowest priority level with other
> runnable tasks. The last bit is important: the load balancer should not
> be allowed to starve but should be invoked approximately once every
> "full rotation" of the scheduler.
>
> With the above it is easy to estimate the load of a CPU. We can simply
> use the elapsed time between two invokations of the load balancer task.
> When the load balancer task of a particular CPU gets run, it chalks up
> the elapsed time on a score board somewhere, and checks whether there is
> a significant imbalance between itself and some other CPU. If there is,
> it commences to move some tasks between itself and the other CPU (note
> rule B, though, it should be enough to mess with just two CPU queues at
> a time to minimize balancing and locking overhead).
>
> Problem 3 is tricky. Basically, there should be a cost/benefit function
> F(tasks to move) that should be minimized. Ideally F(task_i), the
> cost/benefit of moving a single task, would be calculated as a byproduct
> of the CPU scheduler algorithm.
>
> F(task_i) might be function of elapsed time since task_i was last
> scheduled and the average time slice used by task_i, to account for the
> probable cache hit. This would leave it up to the load balancer to move
> as many lowest cost tasks to a new CPU as is needed to correct the
> imbalance (average time slices used by each task would be needed in
> order to make this decision).
>
> Naturally, some additional rules might be necessary to make a task
> eligible for moving, e.g., never move the only/last CPU bound task to
> another CPU. In addition, it might actually make sense to move at most
> one task at each invocation of the load balancer, to further reduce the
> probability of thrashing. The load would still converge fairly quickly
> towards a balanced state. It would also scale fairly well with the
> number of CPUs.
>
> How does that sound?
>
> MikaL
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2001-10-10 00:00:12