Hello,
I wanted to know whether there is any support for gang scheduling in the linux
kernel. If so, what is the status of the implementation and if there are any
documents about the same.
thanks,
shreeni.
On Tue, 16 Jul 2002, shreenivasa H V wrote:
> I wanted to know whether there is any support for gang scheduling in the
> linux kernel. If so, what is the status of the implementation and if
> there are any documents about the same.
yes - the 'synchronous wakeup' feature is a form of gang scheduling. It in
essence uses real process-communication information to migrate 'related'
tasks to the same CPU. So it's automatic, no need to declare processes to
be part of a 'gang' in some formal (and thus fundamentally imperfect) way.
(another form of 'gang scheduling' can be achieved by binding the 'parent'
process to a single CPU - all children will be bound to that CPU as well.)
Ingo
On Thu, Jul 18, 2002 at 06:21:41PM +0200, Ingo Molnar wrote:
> yes - the 'synchronous wakeup' feature is a form of gang scheduling. It in
> essence uses real process-communication information to migrate 'related'
> tasks to the same CPU. So it's automatic, no need to declare processes to
> be part of a 'gang' in some formal (and thus fundamentally imperfect) way.
> (another form of 'gang scheduling' can be achieved by binding the 'parent'
> process to a single CPU - all children will be bound to that CPU as well.)
Hit #1 on google.com: http://www.sw.nec.co.jp/hpc/sx-e/sx-world/no23/en10.pdf
[SX-5 SERIES TECHNICAL HIGHLIGHTS]
Overview of Gang Scheduling
Koichi Nakanishi,
Senior Manager
Koji Suzuki,
Assistant Manager
4th Development Department, 1st Computers Software Division, Computers
Software Operations Unit, NEC Corporation
SX WORLD
Autumn 1998 No.23 Special Issue
[...]
The Gang scheduling function has the func-
tion of simultaneously allocating the required
number of CPUs when scheduling parallel
programs, and allows you to obtain almost
the same performance when multiple parallel
programs are simultaneously executing, as if
the programs were running alone.
I have approximately zero interest in this myself, but something seemed
off about the definition of gang scheduling being used in the post.
Cheers,
Bill
you are right in that the Linux scheduler does not enable classic
gang-scheduling: where multiple processes are scheduled 'at once' on
multiple CPUs. Can you point out specific (real-life) workloads where this
would be advantegous? Some testcode would be the best form of expressing
this. Pretty much any job that uses sane (kernel-based or kernel-helped)
synchronization should see good throughput.
Ingo
On Thu, Jul 18, 2002 at 07:40:19PM +0200, Ingo Molnar wrote:
> you are right in that the Linux scheduler does not enable classic
> gang-scheduling: where multiple processes are scheduled 'at once' on
> multiple CPUs. Can you point out specific (real-life) workloads where this
> would be advantegous? Some testcode would be the best form of expressing
> this. Pretty much any job that uses sane (kernel-based or kernel-helped)
> synchronization should see good throughput.
I will not advocate it myself. I only remembered the definition.
Cheers,
Bill
On Thu, Jul 18, 2002 at 07:40:19PM +0200, Ingo Molnar wrote:
>Can you point out specific (real-life) workloads where this
>would be advantegous?
It's mainly used for programs that needs lots of processing power
chucked at a specific problem, the problem is first broken down into
several small pieces and each part is sent off to a different
processor. When each piece has been processed, they are all
recombined and the rest of the calculation is continued. The problem
with this is that if any one of the pieces is delayed, all the
processors will be idle waiting for the interrupted piece to be
processed, before they can process the next set of pieces.
Sam
On Wed, 17 Jul 2002, Sam Mason wrote:
> It's mainly used for programs that needs lots of processing power
> chucked at a specific problem, the problem is first broken down into
> several small pieces and each part is sent off to a different processor.
> When each piece has been processed, they are all recombined and the rest
> of the calculation is continued. The problem with this is that if any
> one of the pieces is delayed, all the processors will be idle waiting
> for the interrupted piece to be processed, before they can process the
> next set of pieces.
well, how does gang scheduling solve this problem? Even gang-scheduled
tasks might be interrupted anytime on any CPU, by higher-priority tasks,
thus causing a delay.
Ingo
On Thu, Jul 18, 2002 at 10:08:13PM +0200, Ingo Molnar wrote:
>On Wed, 17 Jul 2002, Sam Mason wrote:
>> It's mainly used for programs that needs lots of processing power
>> chucked at a specific problem, the problem is first broken down into
>> several small pieces and each part is sent off to a different processor.
>> When each piece has been processed, they are all recombined and the rest
>> of the calculation is continued. The problem with this is that if any
>> one of the pieces is delayed, all the processors will be idle waiting
>> for the interrupted piece to be processed, before they can process the
>> next set of pieces.
>well, how does gang scheduling solve this problem? Even gang-scheduled
>tasks might be interrupted anytime on any CPU, by higher-priority tasks,
>thus causing a delay.
The important thing to remember is that this isn't a normal scheduling
method, it's used for VERY specialised software which is assumed to
have (almost) complete control of the machine. Gang scheduled
processes would have the highest priority possible and would get
executed before any other processes. This works because the software
knows what it's doing and assumes that the user only ran one bit of
gang scheduled software, if all of these are valid assumptions
everything should work nicely.
Thinking about it, if a process just sets itself to be the highest
priority and constrains it's self to appropriate processors then it
wouldn't surprise me if this was just what you want to do gang
scheduled.
Sam
On Wed, 17 Jul 2002, Sam Mason wrote:
> The important thing to remember is that this isn't a normal scheduling
> method, it's used for VERY specialised software which is assumed to have
> (almost) complete control of the machine. [...]
so how does this differ from a normal Linux system that is used
exclusively? The specialized tasks will get evenly distributed between
CPUs (as long as the number of tasks is not higher than the number of
CPUs), and nothing should interrupt them.
> [...] Gang scheduled processes would have the highest priority possible
> and would get executed before any other processes. This works because
> the software knows what it's doing and assumes that the user only ran
> one bit of gang scheduled software, if all of these are valid
> assumptions everything should work nicely.
>
> Thinking about it, if a process just sets itself to be the highest
> priority and constrains it's self to appropriate processors then it
> wouldn't surprise me if this was just what you want to do gang
> scheduled.
yeah. You can schedule processes 'manually' by using affinities - this is
for corner-cases which know it 100% well what they are doing. But the
default scheduler should get the '8 tasks running on an 8-way system' case
right as well - each CPU will run a single number-cruncher, and there wont
be any bouncing.
Ingo
On Thu, Jul 18, 2002 at 10:32:04PM +0200, Ingo Molnar wrote:
>On Wed, 17 Jul 2002, Sam Mason wrote:
>> The important thing to remember is that this isn't a normal scheduling
>> method, it's used for VERY specialised software which is assumed to have
>> (almost) complete control of the machine. [...]
>so how does this differ from a normal Linux system that is used
>exclusively? The specialized tasks will get evenly distributed between
>CPUs (as long as the number of tasks is not higher than the number of
>CPUs), and nothing should interrupt them.
At the moment I can't think of anything, but I'm sure that someone
with a bit of real life experience will show up and prove me wrong.
>> [...] Gang scheduled processes would have the highest priority possible
>> and would get executed before any other processes. This works because
>> the software knows what it's doing and assumes that the user only ran
>> one bit of gang scheduled software, if all of these are valid
>> assumptions everything should work nicely.
>>
>> Thinking about it, if a process just sets itself to be the highest
>> priority and constrains it's self to appropriate processors then it
>> wouldn't surprise me if this was just what you want to do gang
>> scheduled.
>
>yeah. You can schedule processes 'manually' by using affinities - this is
>for corner-cases which know it 100% well what they are doing. But the
>default scheduler should get the '8 tasks running on an 8-way system' case
>right as well - each CPU will run a single number-cruncher, and there wont
>be any bouncing.
I think that would work, I would also assume that the program has
enough knowledge to start only the required number of tasks and
therefore let the scheduler figure out where to put them.
Ingo Molnar <[email protected]> writes:
> On Wed, 17 Jul 2002, Sam Mason wrote:
>
> > The important thing to remember is that this isn't a normal scheduling
> > method, it's used for VERY specialised software which is assumed to have
> > (almost) complete control of the machine. [...]
>
> so how does this differ from a normal Linux system that is used
> exclusively? The specialized tasks will get evenly distributed between
> CPUs (as long as the number of tasks is not higher than the number of
> CPUs), and nothing should interrupt them.
I think as long as there is only one task set, it doesn't matter. But
if you are trying to run several parallel applications at the same
time on the same machine you are trying to schedule them as a gang (at
the same time on different processors) to provide the impression, that
each task set would run alone.
And afaik these applications typically use shared resources which are
protected by some kind of synchronization primitive like (user level)
spinlocks. If one of the processes holds the lock and the kernel
preempts it for some reason no other process needing access to the
shared resource is able to make progress. So the idea is to try to
schedule them all at the same time on different processors to ensure
that blocking time on the shared resource is really short.
regards,
Jean
On Thursday 18 July 2002 01:40 pm, Ingo Molnar wrote:
> you are right in that the Linux scheduler does not enable classic
> gang-scheduling: where multiple processes are scheduled 'at once' on
> multiple CPUs. Can you point out specific (real-life) workloads where this
> would be advantegous? Some testcode would be the best form of expressing
> this. Pretty much any job that uses sane (kernel-based or kernel-helped)
> synchronization should see good throughput.
>
> Ingo
>
> -
Go to any of the national labs.
I was involved in the gangscheduler implementation for the IBM 340 node SP2
cluster at Lawrence Livermore National Lab.
Implementation aside, one can show that the total system utilization can be
raised from ~60% to a ~90% when doing gang scheduling rather than FIFO
scheduling, which one would otherwise do to get a dedicated machine.
We got tons of papers on this.
For this it seems sufficient to simply STOP apps on a larger granularity and
have that done through a user level daemon. The kernel scheduler simply
schedules the runnable threads that given the U-Sched would always amount
to a limited number of threads/tasks.
--
-- Hubertus Franke ([email protected])
On Fri, 19 Jul 2002, Hubertus Franke wrote:
> For this it seems sufficient to simply STOP apps on a larger granularity
> and have that done through a user level daemon. The kernel scheduler
> simply schedules the runnable threads that given the U-Sched would
> always amount to a limited number of threads/tasks.
yep, this is my suggestion as well. Any reason to do gang scheduling in
the scheduler and not via a userspace daemon that stops/resumes (and
binds) managed tasks explicitly?
Ingo
On Saturday 20 July 2002 12:59 pm, Ingo Molnar wrote:
> On Fri, 19 Jul 2002, Hubertus Franke wrote:
> > For this it seems sufficient to simply STOP apps on a larger granularity
> > and have that done through a user level daemon. The kernel scheduler
> > simply schedules the runnable threads that given the U-Sched would
> > always amount to a limited number of threads/tasks.
>
> yep, this is my suggestion as well. Any reason to do gang scheduling in
> the scheduler and not via a userspace daemon that stops/resumes (and
> binds) managed tasks explicitly?
>
> Ingo
Not from our experience from a cluster based application. Each node was
a SMP in that case.
On a single SMP I could imagine for instance for parallel reendering
or similar tightly integrated parallel programs that need data
synchronization. Most of these apps assume a tightly coupled non-virtual
resource, i.e., scheduling of tasks is aligned.
SGI used to have that stuff in their base kernel. Read a paper about this some
years ago.
Again, at the beginning I'd go with a user level scheduler approach that
certainly would satisfy national labs etc. Most of the cluster schedulers,
like PBS and LoadLeveler etc., already provide that functionality.
--
-- Hubertus Franke ([email protected])
Hubertus Franke writes:
> On a single SMP I could imagine for instance for parallel reendering
> or similar tightly integrated parallel programs that need data
> synchronization. Most of these apps assume a tightly coupled
> non-virtual resource, i.e., scheduling of tasks is aligned.
>
> SGI used to have that stuff in their base kernel. Read a paper about
> this some years ago. Again, at the beginning I'd go with a user
> level scheduler approach that certainly would satisfy national labs
> etc. Most of the cluster schedulers, like PBS and LoadLeveler etc.,
> already provide that functionality.
A completely user-level solution may have some disadvantages, though,
such as delays in scheduling on/off (say if some daemon is used to
scan the process list). Perhaps we could add a small hack to the
scheduler such that when a task is about to be scheduled off, a signal
can be sent to a designated pid? Similarly, when a task is scheduled
on, another signal may be sent. An application that wanted to have
gang scheduling could then make use of this to STOP/CONT threads.
Regards,
Richard....
Permanent: [email protected]
Current: [email protected]
On Friday 19 July 2002 06:05 pm, Richard Gooch wrote:
> Hubertus Franke writes:
> > On a single SMP I could imagine for instance for parallel reendering
> > or similar tightly integrated parallel programs that need data
> > synchronization. Most of these apps assume a tightly coupled
> > non-virtual resource, i.e., scheduling of tasks is aligned.
> >
> > SGI used to have that stuff in their base kernel. Read a paper about
> > this some years ago. Again, at the beginning I'd go with a user
> > level scheduler approach that certainly would satisfy national labs
> > etc. Most of the cluster schedulers, like PBS and LoadLeveler etc.,
> > already provide that functionality.
>
> A completely user-level solution may have some disadvantages, though,
> such as delays in scheduling on/off (say if some daemon is used to
> scan the process list). Perhaps we could add a small hack to the
> scheduler such that when a task is about to be scheduled off, a signal
> can be sent to a designated pid? Similarly, when a task is scheduled
> on, another signal may be sent. An application that wanted to have
> gang scheduling could then make use of this to STOP/CONT threads.
>
> Regards,
>
> Richard....
> Permanent: [email protected]
> Current: [email protected]
I am glad you brought this up. I'd love to have a generic callback for this.
AIX used/has a process change handler that is being called on start/exit.
In Linux, this idea could be done through a generic hook settable through a
module... that should be sufficient and would allow for other stuff to
be handled as well. For instance in the presence of fast user level
communication (e.g. user mapped windows to myrinet the current
process could be marked in the communication adapter).
Just thinking loud.
--
-- Hubertus Franke ([email protected])