2001-12-10 00:31:44

by Davide Libenzi

[permalink] [raw]
Subject: [RFC] Scheduler queue implementation ...


I've spent some time reading interesting ( someone yes, someone not )
papers about different scheduler implementations, policies, queues, etc..
A concept that i've found common in all these works is that, and it's easy
to agree, CPU bound ( batch ) tasks do not need a strict execution order.
So we can have simply two queues ( per CPU ), one that stores I/O bound (
counter > K for example ) and RT tasks that is walked entirely searching
for the better tasks, the other queue will store CPU bound tasks that are
executed in a FIFO policy.
In this way we'll have a better behavior for I/O bound tasks due the
shortest run queue to loop while the FIFO policy on CPU bound tasks does
not affect the system as long as these tasks will have the same amount of
virtual time.
The problem of having long runqueue is automatically solved by the
assumption that long_runqueue == lot_of_cpubound_tasks, that will find
home inside the FIFO queue by assuring a low latency for I/O bound and RT
tasks.
What we lose is the mm ( + 1) goodness() but the eventual cost of
switching mm is melted inside the long execution time ( 50-60 ms ) typical
of CPU bound tasks ( note that matching MMs does not mean matching cache
image ).




- Davide




2001-12-10 00:40:34

by Alan

[permalink] [raw]
Subject: Re: [RFC] Scheduler queue implementation ...

> What we lose is the mm ( + 1) goodness() but the eventual cost of
> switching mm is melted inside the long execution time ( 50-60 ms ) typical
> of CPU bound tasks ( note that matching MMs does not mean matching cache
> image ).

The mm matching and keeping an mm on the same cpu is critical for a lot of
applications (your 50ms execution time includes 2ms loading the cache from
the other CPU in some cases). Keeping the same mms together on the processor
is critical for certain platforms (eg ARM) but not for x86 so much.

Biasing an mm towards a given CPU is easy with any per cpu queue system.
I've tried a few variants for sticking the same mm tasks together and one
that might work - especially with your two queue setup is:

/* other live users of this mm */
if(new->mm->runnable && is_cpuhog(new))
list_add(&new->run_list, &new->mm->livehog->run_list);
else
...


in other words - if we have the mm in flow then throw ourselves onto the
list of cpu hogs adjacent to the other users of that mm.

2001-12-10 00:43:04

by Alan

[permalink] [raw]
Subject: Re: [RFC] Scheduler queue implementation ...

> So we can have simply two queues ( per CPU ), one that stores I/O bound (
> counter > K for example ) and RT tasks that is walked entirely searching
> for the better tasks, the other queue will store CPU bound tasks that are
> executed in a FIFO policy.

Oh as an aside btw - there are many real world workloads where we have a lot
of non cpu hog processes running. A lot of messaging systems have high task
switch rates but very few cpu hogs. So you still need to handle the non hogs
carefully to avoid degenerating back into Linus scheduler.

2001-12-10 00:55:55

by Davide Libenzi

[permalink] [raw]
Subject: Re: [RFC] Scheduler queue implementation ...

On Mon, 10 Dec 2001, Alan Cox wrote:

> > What we lose is the mm ( + 1) goodness() but the eventual cost of
> > switching mm is melted inside the long execution time ( 50-60 ms ) typical
> > of CPU bound tasks ( note that matching MMs does not mean matching cache
> > image ).
>
> The mm matching and keeping an mm on the same cpu is critical for a lot of
> applications (your 50ms execution time includes 2ms loading the cache from
> the other CPU in some cases). Keeping the same mms together on the processor
> is critical for certain platforms (eg ARM) but not for x86 so much.

Alan, you're mixing switch mm costs with cache image reload ones.
Note that equal mm does not mean matching cache image, at all.
I would say that they're completely unrelated.
By having only two queues maintain the implementation simple and solves
99% of common/extraordinary cases.
The cost of a tlb flush become "meaningful" for I/O bound tasks where
their short average run time is not sufficent to compensate the tlb flush
cost, and this is handled correctly/like-usual inside the I/O bound queue.




- Davide




2001-12-10 01:04:35

by Davide Libenzi

[permalink] [raw]
Subject: Re: [RFC] Scheduler queue implementation ...

On Mon, 10 Dec 2001, Alan Cox wrote:

> > So we can have simply two queues ( per CPU ), one that stores I/O bound (
> > counter > K for example ) and RT tasks that is walked entirely searching
> > for the better tasks, the other queue will store CPU bound tasks that are
> > executed in a FIFO policy.
>
> Oh as an aside btw - there are many real world workloads where we have a lot
> of non cpu hog processes running. A lot of messaging systems have high task
> switch rates but very few cpu hogs. So you still need to handle the non hogs
> carefully to avoid degenerating back into Linus scheduler.

In my experience, if you've I/O bound tasks that lead to a long run queue,
that means that you're suffering major kernel latency problems ( other than the
scheduler ).




- Davide


2001-12-10 01:07:05

by Alan

[permalink] [raw]
Subject: Re: [RFC] Scheduler queue implementation ...

> Alan, you're mixing switch mm costs with cache image reload ones.
> Note that equal mm does not mean matching cache image, at all.

They are often close to the same thing. Take a look at the constraints
on virtually cached processors like the ARM where they _are_ the same thing.

Equal mm for cpu sucking tasks often means equal cache image. On the other
hand unmatched mm pretty much definitively says "doesnt matter". The cost
of getting the equal mm/shared cache case wrong is too horrific to handwave
it out of existance using academic papers.

> By having only two queues maintain the implementation simple and solves
> 99% of common/extraordinary cases.
> The cost of a tlb flush become "meaningful" for I/O bound tasks where
> their short average run time is not sufficent to compensate the tlb flush
> cost, and this is handled correctly/like-usual inside the I/O bound queue.

You don't seem to solve either problem directly.

Without per cpu queues you spend all your time bouncing stuff between
processors which hurts. Without multiple queues for interactive tasks you
walk the interactive task list so you don't scale. Without some sensible
handling of mm/cpu binding you spend all day wasting ram bandwidth with
cache writeback.

The single cpu sucker queue is easy, the cost of merging mm equivalent tasks
in that queue is almost nil. Priority ordering interactive stuff using
several queues is easy and very cheap if you need it (I think you do hence
I have 7 priority bands and you have 1). In all these cases the hard bits
end up being

On a wake up which cpu do you give a task ?
When does an idle cpu steal a task, who from and which task ?
How do I define "imbalance" for cpu load balancing ?

2001-12-10 01:11:07

by Alan

[permalink] [raw]
Subject: Re: [RFC] Scheduler queue implementation ...

> > of non cpu hog processes running. A lot of messaging systems have high task
> > switch rates but very few cpu hogs. So you still need to handle the non hogs
> > carefully to avoid degenerating back into Linus scheduler.
>
> In my experience, if you've I/O bound tasks that lead to a long run queue,
> that means that you're suffering major kernel latency problems ( other than the
> scheduler ).

I don't see any evidence of it in the profiles. Its just that a lot of stuff
gets passed around between multiple daemons. You can see similar things
in something as mundane as a 4 task long pipe, a user mode file system or
some X11 clients.

2001-12-10 01:37:09

by Davide Libenzi

[permalink] [raw]
Subject: Re: [RFC] Scheduler queue implementation ...

On Mon, 10 Dec 2001, Alan Cox wrote:

> > > of non cpu hog processes running. A lot of messaging systems have high task
> > > switch rates but very few cpu hogs. So you still need to handle the non hogs
> > > carefully to avoid degenerating back into Linus scheduler.
> >
> > In my experience, if you've I/O bound tasks that lead to a long run queue,
> > that means that you're suffering major kernel latency problems ( other than the
> > scheduler ).
>
> I don't see any evidence of it in the profiles. Its just that a lot of stuff
> gets passed around between multiple daemons. You can see similar things
> in something as mundane as a 4 task long pipe, a user mode file system or
> some X11 clients.

If you've I/O bound ( very low user space average run time ) tasks
accumulation, it's very likely that the bottom part of the iceberg (
kernel ) is becoming quite fat with respect of the userspace part.
Coming at the pipe example, let's take Larry's lat_ctx ( lmbench ).
This is bouncing data through pipes using I/O bound tasks, and running
vmstat together with a lat_ctx 32 32 ... ( long list ), you'll see the run
queue length barley reach 3 ( with 32 bouncing tasks ).
It barely reaches 5 with 64 bouncing tasks.




- Davide


2001-12-10 01:43:39

by Alan

[permalink] [raw]
Subject: Re: [RFC] Scheduler queue implementation ...

> vmstat together with a lat_ctx 32 32 ... ( long list ), you'll see the run
> queue length barley reach 3 ( with 32 bouncing tasks ).
> It barely reaches 5 with 64 bouncing tasks.

Try 250 apache server processes running a mix of mod_perl and static
content, you'll see quite a reasonable queue size then, and it isnt all
cpu hogs.

Interesting question however. I certainly can't disprove your belief that
the queue will be short enough to be worth using multiqueue for the case
where its a queue per processor.

Alan

2001-12-10 02:35:00

by Davide Libenzi

[permalink] [raw]
Subject: Re: [RFC] Scheduler queue implementation ...

On Mon, 10 Dec 2001, Alan Cox wrote:

> > Alan, you're mixing switch mm costs with cache image reload ones.
> > Note that equal mm does not mean matching cache image, at all.
>
> They are often close to the same thing. Take a look at the constraints
> on virtually cached processors like the ARM where they _are_ the same thing.
>
> Equal mm for cpu sucking tasks often means equal cache image. On the other
> hand unmatched mm pretty much definitively says "doesnt matter". The cost
> of getting the equal mm/shared cache case wrong is too horrific to handwave
> it out of existance using academic papers.

This is very difficult to prove and heavily depend on the application
architecture.
Anyway i was just thinking that, if we scan the cpu bound queue like
usual, by correctly sorting out mm related tasks, we're going to blend
this time ( about 2.9us with rqlen=32 on a dual PIII 733, std scheduler )
inside the cpu bound average run time ( 30-60ms ).
This means ~ 0.005%


> > By having only two queues maintain the implementation simple and solves
> > 99% of common/extraordinary cases.
> > The cost of a tlb flush become "meaningful" for I/O bound tasks where
> > their short average run time is not sufficent to compensate the tlb flush
> > cost, and this is handled correctly/like-usual inside the I/O bound queue.
>
> You don't seem to solve either problem directly.
>
> Without per cpu queues you spend all your time bouncing stuff between
> processors which hurts. Without multiple queues for interactive tasks you
> walk the interactive task list so you don't scale. Without some sensible
> handling of mm/cpu binding you spend all day wasting ram bandwidth with
> cache writeback.

It has two queues ( cpubound + iobound-rttasks ) per CPU, as i wrote in my
previous message.
By having split in a multi-queue model plus having a separate queue for
iobound tasks, i think it'll scale pretty well indeed.
Two queues against N means even a lower scheduler memory footprint.
The whole point is to understand where greater optimizations will not pay
for the greater complexity ( plus D/I cache footprint ).



> The single cpu sucker queue is easy, the cost of merging mm equivalent tasks
> in that queue is almost nil. Priority ordering interactive stuff using
> several queues is easy and very cheap if you need it (I think you do hence
> I have 7 priority bands and you have 1). In all these cases the hard bits
> end up being
>
> On a wake up which cpu do you give a task ?

The multi queue scheduler is not like the old one where the bunch of
running tasks virtually own to all cpu.
In a multi queue scheduler tasks are _local_by_default_.


> When does an idle cpu steal a task, who from and which task ?
> How do I define "imbalance" for cpu load balancing ?

As i wrote in previous messages the idle(s) is woken up at every timer
tick and check the balance status.
After N ( tunable ) consecutive ticks that the idle has found an
unbalanced status, it'll try to steal a tasks.
>From where, you could ask ?
This is a good question and the best answer to good questions is:

I don't know :)

Just kidding ( only in part ).
There're several concepts:

1) cpu load expressed in run queue length
2) cpu load expressed as the sum ( in jiffies ) of the average run time of
the currently running task set
3) distance/metric of the move ( think about NUMA )
4) last mm used by the idle
5) last time in jiffies the task is ran

The trick is to find a function :

F = F( RQL, JLD, DIS, LMM, JLT )

so we can sort out and select the best task to run on the idle cpu.
And more, the cpu selection must be done without locking remote runqueue.
So it should be a two-phase task:

1) cpu selection ( no locks held )
2) task selection inside the selected cpu ( remote queue lock held )

The first phase will use 1, 2 and 3 from the above variable set, while the
second will use 4 and 5 for tasks selection inside the queue.




- Davide



2001-12-10 03:05:09

by Davide Libenzi

[permalink] [raw]
Subject: Re: [RFC] Scheduler queue implementation ...

On Mon, 10 Dec 2001, Alan Cox wrote:

> > vmstat together with a lat_ctx 32 32 ... ( long list ), you'll see the run
> > queue length barley reach 3 ( with 32 bouncing tasks ).
> > It barely reaches 5 with 64 bouncing tasks.
>
> Try 250 apache server processes running a mix of mod_perl and static
> content, you'll see quite a reasonable queue size then, and it isnt all
> cpu hogs.

For I/O bound, in this case ( since i use the average run time to weighs
the scheduler latency ), i mean tasks that runs for a very short time
before being rescheduled.
mod_perl is quite heavy from a cpu point of view and the static content,
as soon it becomes to be cached, will result in unblocking IO.
Since very likely even the HTTP response is going to be eaten by the
network layer w/out preemption, you're going to have a cpu path starting
from the HTTP request receiving from the next request fetch, that is very
likely run w/out preemption.


> Interesting question however. I certainly can't disprove your belief that
> the queue will be short enough to be worth using multiqueue for the case
> where its a queue per processor.

My point is, when the scheduler latency is going to have an impact on the
run task ?
When the average run time of the task is low.
But run time low means high dynamic priority ( counter ) that is
accumulated in consecutive recalc loops.
By having a tunable trigger point ( counter ) for task classification
( iobound-rttask or cpubound queue selection ) will make it possible to
have a runtime tuning of its value.




- Davide




2001-12-10 23:04:55

by Mike Kravetz

[permalink] [raw]
Subject: Re: [RFC] Scheduler queue implementation ...

On Sun, Dec 09, 2001 at 05:38:42PM -0800, Davide Libenzi wrote:
> Coming at the pipe example, let's take Larry's lat_ctx ( lmbench ).
> This is bouncing data through pipes using I/O bound tasks, and running
> vmstat together with a lat_ctx 32 32 ... ( long list ), you'll see the run
> queue length barley reach 3 ( with 32 bouncing tasks ).
> It barely reaches 5 with 64 bouncing tasks.

This may show my ignorance, but ... Why would one expect much
more than 2 runnable tasks as a result of a running lat_ctx?
This benchmark simply passes a token around a ring of tasks.
One task awakens the next, then goes to sleep. The only time
you have more than one runnable task is during the times when
the token is passed between tasks. In these transition times
I would rarely expect more than 2 tasks on the runqueue no
matter how many bouncing tasks you have.

We created a benchmark similar to lat_ctx that would allow you
to control how many runnable tasks there are in the system.
Look for 'Reflex' benchmark at:
http://lse.sourceforge.net/scheduling/
You can think of this as a controlled way of running multiple
copies of lat_ctx in parallel.

--
Mike

2001-12-10 23:28:57

by Davide Libenzi

[permalink] [raw]
Subject: Re: [RFC] Scheduler queue implementation ...

On Mon, 10 Dec 2001, Mike Kravetz wrote:

> On Sun, Dec 09, 2001 at 05:38:42PM -0800, Davide Libenzi wrote:
> > Coming at the pipe example, let's take Larry's lat_ctx ( lmbench ).
> > This is bouncing data through pipes using I/O bound tasks, and running
> > vmstat together with a lat_ctx 32 32 ... ( long list ), you'll see the run
> > queue length barley reach 3 ( with 32 bouncing tasks ).
> > It barely reaches 5 with 64 bouncing tasks.
>
> This may show my ignorance, but ... Why would one expect much
> more than 2 runnable tasks as a result of a running lat_ctx?
> This benchmark simply passes a token around a ring of tasks.
> One task awakens the next, then goes to sleep. The only time
> you have more than one runnable task is during the times when
> the token is passed between tasks. In these transition times
> I would rarely expect more than 2 tasks on the runqueue no
> matter how many bouncing tasks you have.
>
> We created a benchmark similar to lat_ctx that would allow you
> to control how many runnable tasks there are in the system.
> Look for 'Reflex' benchmark at:
> http://lse.sourceforge.net/scheduling/
> You can think of this as a controlled way of running multiple
> copies of lat_ctx in parallel.

Mike, this is not to criticize lat_ctx ( i'll do it later :) ) but is
started from an example that Alan made about processes bouncing data through pipes.
Talking about lat_ctx, it's a benchmark that you've to use 1) not to plot
latencies by having runqueue lengths on the x axis 2) only on UP systems (
take a look at the process distribution on SMP ).
I usually prefer to have a real measure of the scheduler latency and
having a real runqueue length to put on the x axis.
I cannot have this with lat_ctx because :

1) you'll never have a given rq len
2) numbers are bogus

If you've to show how the latency varies with the runqueue length you've
to have a benchmark that during the test shows that length.
That's why i prefer to use a cycle counter + the cpuhog program.
In that way i do not need to switch like a crazy ( and you're actually
going to measure warm cache performances ) to get the measure.
I can simply sample my system while it's switching "naturally" with my
system switch-load plus the cpuhog load on the runqueue.
So lat_ctx is basically a 1) simple 2) warm cache 3) UP 4) not rqlen
dependent, benchmark.




- Davide




2001-12-20 21:42:27

by Rik van Riel

[permalink] [raw]
Subject: Re: [RFC] Scheduler queue implementation ...

On Thu, 20 Dec 2001, Pavel Machek wrote:

> > > Alan, you're mixing switch mm costs with cache image reload ones.
> > > Note that equal mm does not mean matching cache image, at all.
> >
> > They are often close to the same thing. Take a look at the constraints
> > on virtually cached processors like the ARM where they _are_ the same thing.
> >
> > Equal mm for cpu sucking tasks often means equal cache image. On the
>
> Really?
>
> I'd guess that if cpu-bound software wants to use clone(CLONE_VM) to
> gain some performance, it should better work "mostly" in different
> memory areas on different cpus... But I could be wrong.

I guess you never used xmms or mozilla, then ;)

(where threads seem to be used for different stages of
processing data ... not sure about mozilla)

regards,

Rik
--
DMCA, SSSCA, W3C? Who cares? http://thefreeworld.net/

http://www.surriel.com/ http://distro.conectiva.com/

2001-12-20 21:34:57

by Pavel Machek

[permalink] [raw]
Subject: Re: [RFC] Scheduler queue implementation ...

Hi!

> > Alan, you're mixing switch mm costs with cache image reload ones.
> > Note that equal mm does not mean matching cache image, at all.
>
> They are often close to the same thing. Take a look at the constraints
> on virtually cached processors like the ARM where they _are_ the same thing.
>
> Equal mm for cpu sucking tasks often means equal cache image. On the

Really?

I'd guess that if cpu-bound software wants to use clone(CLONE_VM) to
gain some performance, it should better work "mostly" in different
memory areas on different cpus... But I could be wrong.

Pavel
--
"I do not steal MS software. It is not worth it."
-- Pavel Kankovsky

2001-12-21 16:51:19

by Alan

[permalink] [raw]
Subject: Re: [RFC] Scheduler queue implementation ...

> I'd guess that if cpu-bound software wants to use clone(CLONE_VM) to
> gain some performance, it should better work "mostly" in different
> memory areas on different cpus... But I could be wrong.

Lots of people use shared mm objects and threads for things like UI rather
than just for cpu hogging. If they have multiple cpu hogs doing that then
they want punishing (or better yet sending a copy of a document on caches)

2001-12-22 10:45:52

by Pavel Machek

[permalink] [raw]
Subject: Re: [RFC] Scheduler queue implementation ...


Hi!

> > I'd guess that if cpu-bound software wants to use clone(CLONE_VM) to
> > gain some performance, it should better work "mostly" in different
> > memory areas on different cpus... But I could be wrong.
>
> Lots of people use shared mm objects and threads for things like UI
> rather

But ... UI performance should not matter much. And that is *abuse* of
threads.

> than just for cpu hogging. If they have multiple cpu hogs doing that then
> they want punishing (or better yet sending a copy of a document on
> caches)

If I have a raytracer, and want to explore 8 cpus, how do I do that?
Separate scene into 8 pieces, make sure no r/w data are shared, and
clone(CLONE_VM). Are there other ways? [I do not know if people are
really doing it like that. *I* would do it that way. Is it bad?]

Pavel
--
"I do not steal MS software. It is not worth it."
-- Pavel Kankovsky

2001-12-22 14:13:51

by Alan

[permalink] [raw]
Subject: Re: [RFC] Scheduler queue implementation ...

> But ... UI performance should not matter much. And that is *abuse* of
> threads.

A UI thread often makes sense and its a real latency/efficiency trade off
given the inconvenient human. Thats why I was playing at scheduling all
running processes for an mm on one CPU - your comments make me think that
"for non hogs" might have some relevance.

> If I have a raytracer, and want to explore 8 cpus, how do I do that?
> Separate scene into 8 pieces, make sure no r/w data are shared, and
> clone(CLONE_VM). Are there other ways? [I do not know if people are
> really doing it like that. *I* would do it that way. Is it bad?]

Thats probably the best SMP way