2004-10-01 23:39:57

by Andrew Morton

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement


Paul, I'm having second thoughts regarding a cpusets merge. Having gone
back and re-read the cpusets-vs-CKRM thread from mid-August, I am quite
unconvinced that we should proceed with two orthogonal resource
management/partitioning schemes.

And CKRM is much more general than the cpu/memsets code, and hence it
should be possible to realize your end-users requirements using an
appropriately modified CKRM, and a suitable controller.

I'd view the difficulty of implementing this as a test of the wisdom of
CKRM's design, actually.

The clearest statement of the end-user cpu and memory partitioning
requirement is this, from Paul:

> Cpusets - Static Isolation:
>
> The essential purpose of cpusets is to support isolating large,
> long-running, multinode compute bound HPC (high performance
> computing) applications or relatively independent service jobs,
> on dedicated sets of processor and memory nodes.
>
> The (unobtainable) ideal of cpusets is to provide perfect
> isolation, for such jobs as:
>
> 1) Massive compute jobs that might run hours or days, on dozens
> or hundreds of processors, consuming gigabytes or terabytes
> of main memory. These jobs are often highly parallel, and
> carefully sized and placed to obtain maximum performance
> on NUMA hardware, where memory placement and bandwidth is
> critical.
>
> 2) Independent services for which dedicated compute resources
> have been purchased or allocated, in units of one or more
> CPUs and Memory Nodes, such as a web server and a DBMS
> sharing a large system, but staying out of each others way.
>
> The essential new construct of cpusets is the set of dedicated
> compute resources - some processors and memory. These sets have
> names, permissions, an exclusion property, and can be subdivided
> into subsets.
>
> The cpuset file system models a hierarchy of 'virtual computers',
> which hierarchy will be deeper on larger systems.
>
> The average lifespan of a cpuset used for (1) above is probably
> between hours and days, based on the job lifespan, though a couple
> of system cpusets will remain in place as long as the system is
> running. The cpusets in (2) above might have a longer lifespan;
> you'd have to ask Simon Derr of Bull about that.
>

Now, even that is not a very good end-user requirement because it does
prejudge the way in which the requirement's solution should be implemented.
Users don't require that their NUMA machines "model a hierarchy of
'virtual computers'". Users require that their NUMA machines implement
some particular behaviour for their work mix. What is that behaviour?

For example, I am unable to determine from the above whether the users
would be 90% satisfied with some close-enough ruleset which was implemented
with even the existing CKRM cpu and memory governors.

So anyway, I want to reopen this discussion, and throw a huge spanner in
your works, sorry.

I would ask the CKRM team to tell us whether there has been any progress in
this area, whether they feel that they have a good understanding of the end
user requirement, and to sketch out a design with which CKRM could satisfy
that requirement.

Thanks.


2004-10-02 06:10:32

by Paul Jackson

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

[Adding Erich Focht <[email protected]>]

Are cpusets a special case of CKRM?

Andrew raises (again) the question - can CKRM meet the needs
which cpusets is trying to meet, enabling CKRM to subsume
cpusets?

Step 1 - Why cpusets?
Step 2 - Can CKRM do that?

Basically - cpusets implements dynamic soft partitioning to
provide jobs with sets of isolated CPUs and Memory Nodes.

The following begins Step 1, describing who has or is expected
to use cpusets, and what I understand of their requirements
for cpusets.

Cpuset Users
============

The users of cpusets want to run jobs in relative isolation, by
dividing the system into dynamically adjustable (w/o rebooting)
subsets of compute resources (dedicated CPUs and Memory Nodes),
and run one or sometimes several jobs within a given subset.

Many such users, if they push this model far enough, tend toward
using a batch manager, aka workload manager, such as OpenPBS
or LSF.

So the actual people who scream (gently) at me the most if I
miss something in cpusets for SGI are (or have been, on 2.4
kernels and/or Irix):

1) The PBS and LSF folks porting their workload
managers on top of cpusets, and

2) the SGI support engineers supporting customers
of our biggest configurations running high value
HPC applications.

3) Cpusets are also used by various graphics, storage and
soft-realtime projects to obtain dedicated or precisely
placed compute resources.

The other declared potential users of cpusets, Bull and NEC at
least, seem from what I can tell to have a somewhat different
focus, toward providing a mix of compute services with minimum
interference, from what I'd guess are more departmental size
systems.

Bull (Simon) and NEC (Erich) should also look closely at CKRM,
and then try to describe their requirements, so we can understand
whether CKRM, cpusets or both or neither can meet their needs.

If I've forgotten any other likely users of cpusets who are
lurking out there, I hope they will speak up and describe how
they expect to use cpusets, what they require, and whether
they find that CKRM would also meet their needs, or why not.

I will try to work with the folks in PBS and LSF a bit, to see
if I can get a simple statement of their essential needs that
would be useful to the CKRM folks. I'll begin taking a stab
at it, below.

CKRM folks - what would be the best presentation of CKRM that
I could point the PBS/LSF folks at?

It's usually easier for users to determine if something will
meet their needs if they can see and understand it. Trying to
do requirements analysis to drive design choices with no
feedback loop is crazy.

They'll know it when they see it, not a day sooner ;)

If some essential capability is missing, they might not
articulate that capability at all, until someone tries to
push a "solution" on them that is missing that capability.

Cpuset Requirements
===================

The three primary requirements that the SGI support engineers
on our biggest configurations keep telling me are most important
are:
1) isolation,
2) isolation, and
3) isolation.
A big HPC job running on a dedicated set of CPUs and Memory Nodes
should not lose any CPU cycles or Memory pages to outsiders.

Both the batch managers and the HPC shops need to be able to
guarantee exclusive use of some set of CPUS and Memory to a job.

The batch managers need to be able to efficiently list
the process id's of all tasks currently attached to a set.
By default, set membership should be inherited across fork and
exec, but batch managers need to be able to move tasks between
sets without regard to the process creation hierarchy.

A job running in a cpuset should be able to use various configuration,
resource management (CKRM for example), cpu and memory (numa) affinity
tools, performance analysis and thread management facilities within a
set, including pthreads and MPI, independently from what is happening
on the rest of the system.

One should be able to run a stock 3rd party app (Oracle is
the canonical example) on a system side-by-side with a special
customer app, each in their own set, neither interfering with
the other, and the Oracle folks happy that their app is running
in a supported environment.

And of course, a cpuset needs to be able to be setup and torn
down without impacting the rest of the system, and then its
CPU and Memory resources put back in the free pool, to be
reallocated in different configurations for other cpusets.

The batch or workload manager folks want to be hibernate and
migrate jobs, so that they can move long running jobs around to
get higher priority jobs through, and so that they can sensibly
over commit without thrashing. And they want to be able to
add and remove CPU and Memory resources to an existing cpuset,
which might appear to jobs currently executing within that
cpuset as resources going on and offline.

The HPC apps folks need to control some kernel memory
allocations, swapping, classic Unix daemons and kernel threads
along cpuset lines as well. When the kernel page cache is
many times larger than the memory on a single node, leaving
placement up to willy-nilly kernel decisions can totally blow
out a nodes memory, which is deadly to the performance of
the job using that node. Similarly, one job can interfere
with another if it abuses the swapper. Kernel threads that
don't require specific placement, as well as the classic Unix
daemons both need to be kept off the CPUs and Memory Nodes
used for the main applications, typically by confining them to
their own small cpuset.

The graphics, realtime and storage folks in particular need
to place their cpusets on very specific CPUs and Memory Nodes
near some piece of hardware of interest to them. The pool
of CPUs and Memory Nodes is not homogeneous to these folks.
If not all CPUs are the same speed, or not all Memory Nodes
the same size, then CPUs and Memory Nodes are not homogeneous
to the HPC folks either. And in any case, big numa machines
have complex bus topologies, which the system admins or batch
managers have to take into account when deciding which CPUs
and Memory Nodes to put together into a cpuset.

There must not be any presumption that composition of cpusets
is done on a per-node basis, with all the CPUs and Memory on
a node the unit of allocation. While this is often the case,
sometimes other combinations of CPUs and Memory Nodes are needed,
not along node boundaries.

For the larger configurations, I am beginning to see requests
for hierarchical "soft partitions" reflecting typically the
complex coorperate or government organization that purchased
the big system, and needs to share it amongst different,
semi-uncooperative groups and subgroups. I anticipate that
SGI will see more of this over the next few years, but I will
(reluctantly) admit that a hierarchy of some fixed depth of
two or three could meet the current needs as I hear them.

Even the flat model (no hierarchy) uses require some way to
name and control access to cpusets, with distinct permissions
for examining, attaching to, and changing them, that can be
used and managed on a system wide basis.

At least Bull has a requirement to automatically remove a
cpuset when the last user of it exits - which the current
implementation in Andrew's tree provides by calling out to a
user level program on the last release. User level code can
handle the actual removal.

Bull also has a requirement for the kernel to provide
cpuset-relative numbering of CPUs and Memory Nodes to some
applications, so that they can be run oblivious to the fact
that they don't own the entire machine. This requirement is
not satisfied by the current implementation in Andrew's tree -
Simon has a separate patch for that.

Cpusets needs to be able to interoperate with hotplug, which
can be a bit of challenge, given the tendency of cpuset code
to stash its own view of the current system CPU/Memory
configuration.

The essential implementation hooks required by cpusets follow from
their essential purpose. Cpusets control on which CPUs a task may
be scheduled, and on which Memory Nodes it may allocate memory.
Therefore hooks are required in the scheduler and allocator, which
constrain scheduling and allocation to only use the allowed CPUs
and Memory Nodes.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373

2004-10-02 14:53:40

by Dipankar Sarma

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

On Fri, Oct 01, 2004 at 11:06:44PM -0700, Paul Jackson wrote:
> Cpuset Requirements
> ===================
>
> The three primary requirements that the SGI support engineers
> on our biggest configurations keep telling me are most important
> are:
> 1) isolation,
> 2) isolation, and
> 3) isolation.
> A big HPC job running on a dedicated set of CPUs and Memory Nodes
> should not lose any CPU cycles or Memory pages to outsiders.
>
....

>
> A job running in a cpuset should be able to use various configuration,
> resource management (CKRM for example), cpu and memory (numa) affinity
> tools, performance analysis and thread management facilities within a
> set, including pthreads and MPI, independently from what is happening
> on the rest of the system.
>
> One should be able to run a stock 3rd party app (Oracle is
> the canonical example) on a system side-by-side with a special
> customer app, each in their own set, neither interfering with
> the other, and the Oracle folks happy that their app is running
> in a supported environment.

One of the things we are working on is to provide exactly something
like this. Not just that, within the isolated partitions, we want
to be able to provide completely different environment. For example,
we need to be able to run or more realtime processes of an application
in one partition while the other partition runs the database portion
of the application. For this to succeed, they need to be completely
isolated.

It would be nice if someone explains a potential CKRM implementation for
this kind of complete isolation.

Thanks
Dipankar

2004-10-02 15:49:25

by Marc E. Fiuczynski

[permalink] [raw]
Subject: RE: [ckrm-tech] Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

Paul & Andrew,

For PlanetLab (http://www.planet-lab.org) we also care very much about isolation
between different users. Maybe not to the same degree as your users.
Nonetheless, penning in resource hogs is very important to us. We are
giving CKRM a shot. Over the past two weeks I have worked with Hubertus,
Chandra, and Shailabh to iron various bugs. The controllers appear to be
working at first approximation. From our perspective, it is not so much the
specific resource controllers but the CKRM framework that is of importance.
I.e., we certainly plan to test and implement other resource controllers for
CPU, disk I/o and memory isolation.

For cpu isolation, would it suffice to use a HTB-based cpu scheduler. This
is essentially what the XEN folks are using to ensure strong isolation
between separate Xen domains. An implementation of such a scheduler exists
as part of the linux-vserver project and the port of that to CKRM should be
straightforward. In fact, I am thinking of doing such a port for PlanetLab
just to have an alternative to the existing CKRM cpu controller. Seems like
an implementation of that scheduler (or a modification to the existing CKRM
controller) + some support for CPU affinity + hotplug CPU support might
approach your cpuset solution. Correct me if I completely missed it.

For memory isolation, I am not sufficiently familiar with NUMA style
machines to comment on this topic. The CKRM memory controller is
interesting, but we have not used it sufficiently to comment.

Finally, in terms of isolation, we have mixed together CKRM with VSERVERs.
Using CKRM for performance isolation and Vserver (for the lack of a better
name) "view" isolation. Maybe your users care about the vserver style of
islation. We have an anon cvs server with our kernel (which is based on
Fedora Core 2 1.521 + vserver 1.9.2 + the latest ckrm e16 framework and
resource controllers that are not even available yet at ckrm.sf.net), which
you are welcome to play with.

Best regards,
Marc

-----------
Marc E. Fiuczynski
PlanetLab Consortium --- OS Taskforce PM
Princeton University --- Research Scholar
http://www.cs.princeton.edu/~mef

> -----Original Message-----
> From: [email protected]
> [mailto:[email protected]]On Behalf Of Andrew Morton
> Sent: Friday, October 01, 2004 7:41 PM
> To: Shailabh Nagar; [email protected]
> Cc: [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]
> Subject: [ckrm-tech] Re: [Lse-tech] [PATCH] cpusets - big numa cpu and
> memory placement
>
>
>
> Paul, I'm having second thoughts regarding a cpusets merge. Having gone
> back and re-read the cpusets-vs-CKRM thread from mid-August, I am quite
> unconvinced that we should proceed with two orthogonal resource
> management/partitioning schemes.
>
> And CKRM is much more general than the cpu/memsets code, and hence it
> should be possible to realize your end-users requirements using an
> appropriately modified CKRM, and a suitable controller.
>
> I'd view the difficulty of implementing this as a test of the wisdom of
> CKRM's design, actually.
>
> The clearest statement of the end-user cpu and memory partitioning
> requirement is this, from Paul:
>
> > Cpusets - Static Isolation:
> >
> > The essential purpose of cpusets is to support isolating large,
> > long-running, multinode compute bound HPC (high performance
> > computing) applications or relatively independent service jobs,
> > on dedicated sets of processor and memory nodes.
> >
> > The (unobtainable) ideal of cpusets is to provide perfect
> > isolation, for such jobs as:
> >
> > 1) Massive compute jobs that might run hours or days, on dozens
> > or hundreds of processors, consuming gigabytes or terabytes
> > of main memory. These jobs are often highly parallel, and
> > carefully sized and placed to obtain maximum performance
> > on NUMA hardware, where memory placement and bandwidth is
> > critical.
> >
> > 2) Independent services for which dedicated compute resources
> > have been purchased or allocated, in units of one or more
> > CPUs and Memory Nodes, such as a web server and a DBMS
> > sharing a large system, but staying out of each others way.
> >
> > The essential new construct of cpusets is the set of dedicated
> > compute resources - some processors and memory. These sets have
> > names, permissions, an exclusion property, and can be subdivided
> > into subsets.
> >
> > The cpuset file system models a hierarchy of 'virtual computers',
> > which hierarchy will be deeper on larger systems.
> >
> > The average lifespan of a cpuset used for (1) above is probably
> > between hours and days, based on the job lifespan, though a couple
> > of system cpusets will remain in place as long as the system is
> > running. The cpusets in (2) above might have a longer lifespan;
> > you'd have to ask Simon Derr of Bull about that.
> >
>
> Now, even that is not a very good end-user requirement because it does
> prejudge the way in which the requirement's solution should be
> implemented.
> Users don't require that their NUMA machines "model a hierarchy of
> 'virtual computers'". Users require that their NUMA machines implement
> some particular behaviour for their work mix. What is that behaviour?
>
> For example, I am unable to determine from the above whether the users
> would be 90% satisfied with some close-enough ruleset which was
> implemented
> with even the existing CKRM cpu and memory governors.
>
> So anyway, I want to reopen this discussion, and throw a huge spanner in
> your works, sorry.
>
> I would ask the CKRM team to tell us whether there has been any
> progress in
> this area, whether they feel that they have a good understanding
> of the end
> user requirement, and to sketch out a design with which CKRM could satisfy
> that requirement.
>
> Thanks.
>
>
> -------------------------------------------------------
> This SF.net email is sponsored by: IT Product Guide on ITManagersJournal
> Use IT products in your business? Tell us what you think of them. Give us
> Your Opinions, Get Free ThinkGeek Gift Certificates! Click to
> find out more
> http://productguide.itmanagersjournal.com/guidepromo.tmpl
> _______________________________________________
> ckrm-tech mailing list
> https://lists.sourceforge.net/lists/listinfo/ckrm-tech

2004-10-02 16:26:09

by Hubertus Franke

[permalink] [raw]
Subject: Re: [ckrm-tech] Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement



Marc E. Fiuczynski wrote:

> Paul & Andrew,
>
> For PlanetLab (http://www.planet-lab.org) we also care very much about isolation
> between different users. Maybe not to the same degree as your users.
> Nonetheless, penning in resource hogs is very important to us. We are
> giving CKRM a shot. Over the past two weeks I have worked with Hubertus,
> Chandra, and Shailabh to iron various bugs. The controllers appear to be
> working at first approximation. From our perspective, it is not so much the
> specific resource controllers but the CKRM framework that is of importance.
> I.e., we certainly plan to test and implement other resource controllers for
> CPU, disk I/o and memory isolation.
>
> For cpu isolation, would it suffice to use a HTB-based cpu scheduler. This
> is essentially what the XEN folks are using to ensure strong isolation
> between separate Xen domains. An implementation of such a scheduler exists
> as part of the linux-vserver project and the port of that to CKRM should be
> straightforward. In fact, I am thinking of doing such a port for PlanetLab
> just to have an alternative to the existing CKRM cpu controller. Seems like
> an implementation of that scheduler (or a modification to the existing CKRM
> controller) + some support for CPU affinity + hotplug CPU support might
> approach your cpuset solution. Correct me if I completely missed it.

Marc, cpusets lead to physical isolation.

>
> For memory isolation, I am not sufficiently familiar with NUMA style
> machines to comment on this topic. The CKRM memory controller is
> interesting, but we have not used it sufficiently to comment.
>
> Finally, in terms of isolation, we have mixed together CKRM with VSERVERs.
> Using CKRM for performance isolation and Vserver (for the lack of a better
> name) "view" isolation. Maybe your users care about the vserver style of
> islation. We have an anon cvs server with our kernel (which is based on
> Fedora Core 2 1.521 + vserver 1.9.2 + the latest ckrm e16 framework and
> resource controllers that are not even available yet at ckrm.sf.net), which
> you are welcome to play with.
>
> Best regards,
> Marc
>
> -----------
> Marc E. Fiuczynski
> PlanetLab Consortium --- OS Taskforce PM
> Princeton University --- Research Scholar
> http://www.cs.princeton.edu/~mef
>
>
>>-----Original Message-----
>>From: [email protected]
>>[mailto:[email protected]]On Behalf Of Andrew Morton
>>Sent: Friday, October 01, 2004 7:41 PM
>>To: Shailabh Nagar; [email protected]
>>Cc: [email protected]; [email protected]; [email protected];
>>[email protected]; [email protected]; [email protected];
>>[email protected]; [email protected]; [email protected];
>>[email protected]; [email protected]; [email protected];
>>[email protected]; [email protected]
>>Subject: [ckrm-tech] Re: [Lse-tech] [PATCH] cpusets - big numa cpu and
>>memory placement
>>
>>
>>
>>Paul, I'm having second thoughts regarding a cpusets merge. Having gone
>>back and re-read the cpusets-vs-CKRM thread from mid-August, I am quite
>>unconvinced that we should proceed with two orthogonal resource
>>management/partitioning schemes.
>>
>>And CKRM is much more general than the cpu/memsets code, and hence it
>>should be possible to realize your end-users requirements using an
>>appropriately modified CKRM, and a suitable controller.
>>
>>I'd view the difficulty of implementing this as a test of the wisdom of
>>CKRM's design, actually.
>>
>>The clearest statement of the end-user cpu and memory partitioning
>>requirement is this, from Paul:
>>
>>
>>>Cpusets - Static Isolation:
>>>
>>> The essential purpose of cpusets is to support isolating large,
>>> long-running, multinode compute bound HPC (high performance
>>> computing) applications or relatively independent service jobs,
>>> on dedicated sets of processor and memory nodes.
>>>
>>> The (unobtainable) ideal of cpusets is to provide perfect
>>> isolation, for such jobs as:
>>>
>>> 1) Massive compute jobs that might run hours or days, on dozens
>>> or hundreds of processors, consuming gigabytes or terabytes
>>> of main memory. These jobs are often highly parallel, and
>>> carefully sized and placed to obtain maximum performance
>>> on NUMA hardware, where memory placement and bandwidth is
>>> critical.
>>>
>>> 2) Independent services for which dedicated compute resources
>>> have been purchased or allocated, in units of one or more
>>> CPUs and Memory Nodes, such as a web server and a DBMS
>>> sharing a large system, but staying out of each others way.
>>>
>>> The essential new construct of cpusets is the set of dedicated
>>> compute resources - some processors and memory. These sets have
>>> names, permissions, an exclusion property, and can be subdivided
>>> into subsets.
>>>
>>> The cpuset file system models a hierarchy of 'virtual computers',
>>> which hierarchy will be deeper on larger systems.
>>>
>>> The average lifespan of a cpuset used for (1) above is probably
>>> between hours and days, based on the job lifespan, though a couple
>>> of system cpusets will remain in place as long as the system is
>>> running. The cpusets in (2) above might have a longer lifespan;
>>> you'd have to ask Simon Derr of Bull about that.
>>>
>>
>>Now, even that is not a very good end-user requirement because it does
>>prejudge the way in which the requirement's solution should be
>>implemented.
>> Users don't require that their NUMA machines "model a hierarchy of
>>'virtual computers'". Users require that their NUMA machines implement
>>some particular behaviour for their work mix. What is that behaviour?
>>
>>For example, I am unable to determine from the above whether the users
>>would be 90% satisfied with some close-enough ruleset which was
>>implemented
>>with even the existing CKRM cpu and memory governors.
>>
>>So anyway, I want to reopen this discussion, and throw a huge spanner in
>>your works, sorry.
>>
>>I would ask the CKRM team to tell us whether there has been any
>>progress in
>>this area, whether they feel that they have a good understanding
>>of the end
>>user requirement, and to sketch out a design with which CKRM could satisfy
>>that requirement.
>>
>>Thanks.
>>
>>
>>-------------------------------------------------------
>>This SF.net email is sponsored by: IT Product Guide on ITManagersJournal
>>Use IT products in your business? Tell us what you think of them. Give us
>>Your Opinions, Get Free ThinkGeek Gift Certificates! Click to
>>find out more
>>http://productguide.itmanagersjournal.com/guidepromo.tmpl
>>_______________________________________________
>>ckrm-tech mailing list
>>https://lists.sourceforge.net/lists/listinfo/ckrm-tech
>
>
>
>
> -------------------------------------------------------
> This SF.net email is sponsored by: IT Product Guide on ITManagersJournal
> Use IT products in your business? Tell us what you think of them. Give us
> Your Opinions, Get Free ThinkGeek Gift Certificates! Click to find out more
> http://productguide.itmanagersjournal.com/guidepromo.tmpl
> _______________________________________________
> Lse-tech mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/lse-tech
>

2004-10-02 16:24:37

by Hubertus Franke

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement


OK, let me respond to this (again...) from the perspective of cpus.
This should to some extend also cover Andrew's request as well as
Paul's earlier message.

I see cpumem sets to be orthogonal to CKRM cpu share allocations.
AGAIN.
I see cpumem sets to be orthogonal to CKRM cpu share allocations.

In its essense, "cpumem sets" is a hierarchical mechanism of sucessively
tighter constraints on the affinity mask of tasks.

The O(1) scheduler today does not know about cpumem sets. It operates
on the level of affinity masks to adhere to the constraints specified
based on cpu masks.

The CKRM cpu scheduler also adheres to affinity mask constraints and
frankly does not care how they are set.

So I do not see what at the scheduler level the problem will be.
If you want system isolation you deploy cpumem sets. If you want overall
share enforcement you choose ckrm classes.
In addition you can use both with the understanding that cpumem sets can
and will not be violated even if that means that shares are not maintained.

Since you want orthogonality, cpumem sets could be implemented as a
different "classtype". They would not belong to the taskclass and thus
are independent from what we consider the task class.



The tricky stuff comes in from the fact that CKRM assumes a system wide
definition of a class and a system wide "calculation" of shares.






Dipankar Sarma wrote:
> On Fri, Oct 01, 2004 at 11:06:44PM -0700, Paul Jackson wrote:
>
>>Cpuset Requirements
>>===================
>>
>>The three primary requirements that the SGI support engineers
>>on our biggest configurations keep telling me are most important
>>are:
>> 1) isolation,
>> 2) isolation, and
>> 3) isolation.
>>A big HPC job running on a dedicated set of CPUs and Memory Nodes
>>should not lose any CPU cycles or Memory pages to outsiders.
>>
>
> ....
>
>
>>A job running in a cpuset should be able to use various configuration,
>>resource management (CKRM for example), cpu and memory (numa) affinity
>>tools, performance analysis and thread management facilities within a
>>set, including pthreads and MPI, independently from what is happening
>>on the rest of the system.
>>
>>One should be able to run a stock 3rd party app (Oracle is
>>the canonical example) on a system side-by-side with a special
>>customer app, each in their own set, neither interfering with
>>the other, and the Oracle folks happy that their app is running
>>in a supported environment.
>
>
> One of the things we are working on is to provide exactly something
> like this. Not just that, within the isolated partitions, we want
> to be able to provide completely different environment. For example,
> we need to be able to run or more realtime processes of an application
> in one partition while the other partition runs the database portion
> of the application. For this to succeed, they need to be completely
> isolated.
>
> It would be nice if someone explains a potential CKRM implementation for
> this kind of complete isolation.

Alternatively to what is described above, if you want to do cpumemsets
purely through the current implementation, I'd approach it as follows:

- Start with the current cpumemset implementation.
- Write the CKRM controller that simply replaces the API of the
cpumemset.
- Now you have the object hierarchy through /rcfs/taskclass
- Change the memsets through the generic attributes (discussed in
earlier emails to extend the static fixed shares notation)
- DO NOT USE CPU shares (always specify DONTCARE).

I am not saying that this is the most elegant solution, but neither
is trying to achieve proportional shares through cpumemsets.


>
> Thanks
> Dipankar
>

Hope this helps.

>
> -------------------------------------------------------
> This SF.net email is sponsored by: IT Product Guide on ITManagersJournal
> Use IT products in your business? Tell us what you think of them. Give us
> Your Opinions, Get Free ThinkGeek Gift Certificates! Click to find out more
> http://productguide.itmanagersjournal.com/guidepromo.tmpl
> _______________________________________________
> Lse-tech mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/lse-tech
>

2004-10-02 17:50:35

by Paul Jackson

[permalink] [raw]
Subject: Re: [ckrm-tech] Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

Marc writes:
>
> For PlanetLab (http://www.planet-lab.org) we also care very much about isolation
> between different users. Maybe not to the same degree as your users.
> Nonetheless, penning in resource hogs is very important to us.

Thank-you for your report, Marc.

Before I look at code, I think we could do with a little more
discussion of usage patterns and requirements.

Despite my joke about "1) isolation, 2) isolation, and 3) isolation"
being the most important requirements on cpusets, there are further
requirements presented by typical cpuset users, which I tried to spell
out in my previous post.

Could you do a couple more things to further help this discussion:

1) I know nothing at this moment of what PlanetLab is or what
they do. Could you describe this a bit - your business, your
customers usage patterns and how these make use of CKRM? Perhaps
a couple of web links will help here. I will also do a Google
search now, in an effort to become more educated on PlanetLab.

I might come away from this thinking one of:

a. Dang - that sounds alot like what my cpuset users are
doing. If CKRM meets PlanetLab's needs, it might meet
my users needs too. I should put aside my skepticism
and approach Andrew's proposal to have CKRM supplant
cpusets with a more open mind than (I will confess)
I have now.

b. No, no - that's something different. PlanetLab doesn't
have the particular requirements x, y and z that my cpuset
users do. Rather they have other requirements, a, b and
c, that seem to fit my understanding of CKRM well, but
not cpusets.

2) I made some effort to present the usage patterns and
requirements of cpuset users in my post. Could you read
it and comment on the requirements I presented.

I'd be interested to know, for each cpuset requirement I
presented, which of the following multiple choices applies
in your case:

a. Huh - I (Marc) don't understand what you (pj) are
saying here well enough to comment further.

b. Yes - this sounds just like something PlanetLab needs,
perhaps rephrasing the requirement in terms more familiar
to you. And CKRM meets this requirement this way ...

c. No - this is not a big need PlanetLab has of its resource
management technology (perhaps noting in this case,
whether, in your understanding of CKRM, CKRM addresses
this requirement anyway, even though you don't need it).

I encourage you to stay "down to earth" in this, at least initially.
Speak in terms familiar to you, and present the actual, practical
experience you've gained at PlanetLab.

I want to avoid the trap of premature abstraction:

Gee - both CKRM and cpusets deal with resource management, both
have kernel hooks in the allocators and schedulers, both have
hierarchies and both provide isolation of some sort. They must
be two solutions to the same problem (or at least, since CKRM
is obviously bigger, it must be a solution to a superset of
the problems that cpusets addresses), and so we should pick one
(the superset, no doubt) and drop the other to avoid duplication.

Let us begin this discussion with a solid grounding in the actual
experiences we bring to this thread.

Thank-you.

"I'm thinking of a 4 legged, long tailed, warm blooded
creature, commonly associated with milk, that makes a
sound written in my language starting with the letter 'M'.
The name of the animal is a three letter word starting
with the letter 'C'. We had many of them in the barn on
my Dad's dairy farm."

Mooo ? [cow]

No - meow. [cat]

And no, we shouldn't try to catch mice with cows, even
if they are bigger than cats.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373

2004-10-02 17:56:19

by Paul Jackson

[permalink] [raw]
Subject: Re: [ckrm-tech] Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

Hubertus wrote:
>
> Marc, cpusets lead to physical isolation.

This is slightly too terse for my dense brain to grok.
Could you elaborate just a little, Hubertus? Thanks.

(Try to quote less - I almost missed your reply in
the middle of all the quoted stuff.)

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373

2004-10-02 18:19:14

by Hubertus Franke

[permalink] [raw]
Subject: Re: [ckrm-tech] Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement



Paul Jackson wrote:
> Hubertus wrote:
>
>>Marc, cpusets lead to physical isolation.
>
>
> This is slightly too terse for my dense brain to grok.
> Could you elaborate just a little, Hubertus? Thanks.
>

A minimal quote from your website :-)

"CpuMemSets provides a new Linux kernel facility that enables system
services and applications to specify on which CPUs they may be
scheduled, and from which nodes they may allocate memory."

Since I have addressed the cpu section it seems obvious that
in order to ISOLATE different workloads, you associate them onto
non-overlapping cpusets, thus technically they are physically isolated
from each other on said chosen CPUs.

Given that cpuset hierarchies translate into cpu-affinity masks,
this desired isolation can result in lost cycles globally.

I believe this to be orthogonal to share settings. To me both
are extremely desirable features.

I also pointed out that if you separate mechanism from API, it
is possible to move the CPU set API under the CKRM framework.
I have not thought about the memory aspect.

-- Hubertus


2004-10-02 18:27:36

by Paul Jackson

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

> I see cpumem sets to be orthogonal to CKRM cpu share allocations.

I agree. Thank-you for stating that, Hubertus.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373

2004-10-02 19:17:52

by Paul Jackson

[permalink] [raw]
Subject: Re: [ckrm-tech] Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

Hubertus wrote:
>
> A minimal quote from your website :-)

Ok - now I see what you're saying.

Let me expound a bit on this line, from a different perspective.

While big NUMA boxes provide the largest available single system image
boxes available currently, they have their complications. The bus and
cache structures and geometry are complex and multilayered.

For more modest, more homogenous systems, one can benefit from putting
CKRM controllers (I hope I'm using this term correctly here) on things
like memory pages, cpu cycles, disk i/o, and network i/o in order to
provide a fairly rich degree of control over what share of resources
each application class receives, and obtain both efficient and
controlled balance of resource usage.

But for the big NUMA configuration, running some of our customers most
performance critical applications, one cannot achieve the desired
performance by trying to control all the layers of cache and bus, in
complex geometries, with their various interactions.

So instead one ends up using an orthogonal (thanks, Hubertus) and
simpler mechanism - physical isolation(*). These nodes, and all their
associated hardware, are dedicated to the sole use of this critical
application. There is still sometimes non-trivial work done, for a
given application, to tune its performance, but by removing (well, at
least radically reducing) the interactions of other unknown applications
on the same hardware resources, the tuning of the critical application
now becomes a practical, solvable task.

In corporate organizations, this resembles the difference between having
separate divisions with their own P&L statements, kept at arms length
for all but a few common corporate services [cpusets], versus the more
dynamic trade-offs made within a single division, moving limited
resources back and forth in order to meet changing and sometimes
conflicting objectives in accordance with the priorities dictated by
upper management [CKRM].

(*) Well, not physical isolation in the sense of unplugging the
interconnect cables. Rather logical isolation of big chunks
of the physical hardware. And not pure 100% isolation, as
would come from running separate kernel images, but minimal
controlled isolation, with the ability to keep out anything
that causes interference if it doesn't need to be there, on
those particular CPUs and Memory Nodes.

And our customers _do_ want to manage these logically isolated
chunks as named "virtual computers" with system managed permissions
and integrity (such as the system-wide attribute of "Exclusive"
ownership of a CPU or Memory by one cpuset, and a robust ability
to list all tasks currently in a cpuset). This is a genuine user
requirement to my understanding, apparently contrary to Andrew's.

The above is not the only use of cpusets - there's also providing
a base for ports of PBS and LSF workload managers (which if I recall
correctly arose from earlier HPC environments similar to the one
I described above), and there's the work being done by Bull and NEC,
which can better be spoken to by representives of those companies.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373

2004-10-02 20:45:35

by Andrew Morton

[permalink] [raw]
Subject: Re: [ckrm-tech] Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

Hubertus Franke <[email protected]> wrote:
>
> Marc, cpusets lead to physical isolation.

Despite what Paul says, his customers *do not* "require" physical isolation
[*]. That's like an accountant requiring that his spreadsheet be written
in Pascal. He needs slapping.

Isolation is merely the means by which cpusets implements some higher-level
customer requirement.

I want to see a clearer description of what that higher-level requirement is.

Then I'd like to see some thought put into whether CKRM (with probably a new
controller) can provide a good-enough implementation of that requirement.

Coming at this from the other direction: CKRM is being positioned as a
general purpose resource management framework, yes? Isolation is a simple
form of resource management. If the CKRM framework simply cannot provide
this form of isolation then it just failed its first test, did it not?

[*] Except for the case where there is graphics (or other) hardware close
to a particular node. In that case it is obvious that CPU-group pinning is
the only way in which to satisfy the top-level requirement of "make access
to the graphics hardware be efficient".

2004-10-02 23:15:48

by Hubertus Franke

[permalink] [raw]
Subject: Re: [ckrm-tech] Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement



Andrew Morton wrote:
> Hubertus Franke <[email protected]> wrote:
>
>>Marc, cpusets lead to physical isolation.
>
>
> Despite what Paul says, his customers *do not* "require" physical isolation
> [*]. That's like an accountant requiring that his spreadsheet be written
> in Pascal. He needs slapping.
>
> Isolation is merely the means by which cpusets implements some higher-level
> customer requirement.
>
> I want to see a clearer description of what that higher-level requirement is.
>
> Then I'd like to see some thought put into whether CKRM (with probably a new
> controller) can provide a good-enough implementation of that requirement.
>

CKRM could do so. We already provide the name space and the class
hierarchy. If a cpuset is associated with a class, then the class
controller can sets the appropriate masks in the system.

The issue that Paul correctly pointed out is that if you associate the
current task classes, i.e. set cpu and i/o shares then one MIGHT have
conflicting directives to the system.
This can be avoided by not utilizing cpu shares at that point or live
with the potential share inbalance that will arrive from being forced
into the various affinity constraints of the tasks.
But we already have to live with that anyway when resources create
dependencies, such as to little memory can potentially impact obtained
cpu share.

Alternatively, cpumem set could be introduced as a whole new classtype
that similar to the socket class type will have this one controller
associated.

So to me cpumem sets as as concept is useful, so I won't be doing that
whopping, but it can be integrated into CKRM as classtype/controller
concept. Particularly for NUMA machine it makes sense in the absense of
more sophisticated and (sub)optimal placement by the OS.

> Coming at this from the other direction: CKRM is being positioned as a
> general purpose resource management framework, yes? Isolation is a simple
> form of resource management. If the CKRM framework simply cannot provide
> this form of isolation then it just failed its first test, did it not?
>

That's fair to say, I think it is feasible, by utilizing the guts of the
cpumem set and wrapping the CKRM RCFS and class objects around it.

> [*] Except for the case where there is graphics (or other) hardware close
> to a particular node. In that case it is obvious that CPU-group pinning is
> the only way in which to satisfy the top-level requirement of "make access
> to the graphics hardware be efficient".

Yipp ... but it is also useful if one has limited faith in the system
to always the right thing. If I have no control over where tasks go, I
can potentially end up introducing heavy bus traffic (over NUMA link).
There's a good reason why in many HPC deployment, application try to by
pass the OS ...

Hope this helps.


2004-10-02 23:21:39

by Peter Williams

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

Hubertus Franke wrote:
>
> OK, let me respond to this (again...) from the perspective of cpus.
> This should to some extend also cover Andrew's request as well as
> Paul's earlier message.
>
> I see cpumem sets to be orthogonal to CKRM cpu share allocations.
> AGAIN.
> I see cpumem sets to be orthogonal to CKRM cpu share allocations.
>
> In its essense, "cpumem sets" is a hierarchical mechanism of sucessively
> tighter constraints on the affinity mask of tasks.
>
> The O(1) scheduler today does not know about cpumem sets. It operates
> on the level of affinity masks to adhere to the constraints specified
> based on cpu masks.

This is where I see the need for "CPU sets". I.e. as a
replacement/modification to the CPU affinity mechanism basically adding
an extra level of abstraction to make it easier to use for implementing
the type of isolation that people seem to want. I say this because,
strictly speaking and as you imply, the current affinity mechanism is
sufficient to provide that isolation BUT it would be a huge pain to
implement.

The way I see it you just replace the task's affinity mask with a
pointer to its "CPU set" which contains the affinity mask shared by
tasks belonging to that set (and this is used by try_to_wake_up() and
the load balancing mechanism to do their stuff instead of the per task
affinity mask). Then when you want to do something like take a CPU away
from one group of tasks and give it to another group of tasks it's just
a matter of changing the affinity masks in the sets instead of visiting
every one of the tasks individually and changing their masks. There
should be no need to explicitly move tasks off the "lost" CPU after such
a change as it should/could be done next time that they go through
try_to_wake_up() and/or finish a time slice. Moving a task from one CPU
set to another would be a similar process to the current change of
affinity mask.

There would, of course, need to be some restriction on the movement of
CPUs from one set to another so that you don't end up with an empty set
with live tasks, etc.

A possible problem is that there may be users whose use of the current
affinity mechanism would be broken by such a change. A compile time
choice between the current mechanism and a set based mechanism would be
a possible solution. Of course, this proposed modification wouldn't
make any sense with less than 3 CPUs.

PS Once CPU sets were implemented like this, configurable CPU schedulers
(such as (blatant plug :-)) ZAPHOD) could have "per CPU set"
configurations, CKRM could do its (CPU management stuff) stuff within a
CPU set, etc.

>
> The CKRM cpu scheduler also adheres to affinity mask constraints and
> frankly does not care how they are set.
>
> So I do not see what at the scheduler level the problem will be.
> If you want system isolation you deploy cpumem sets. If you want overall
> share enforcement you choose ckrm classes.
> In addition you can use both with the understanding that cpumem sets can
> and will not be violated even if that means that shares are not maintained.
>
> Since you want orthogonality, cpumem sets could be implemented as a
> different "classtype". They would not belong to the taskclass and thus
> are independent from what we consider the task class.
>
>
>
> The tricky stuff comes in from the fact that CKRM assumes a system wide
> definition of a class and a system wide "calculation" of shares.

Doesn't sound insurmountable or particularly tricky :-).

Peter
--
Peter Williams [email protected]

"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce

2004-10-02 23:29:46

by Peter Williams

[permalink] [raw]
Subject: Re: [ckrm-tech] Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

Hubertus Franke wrote:
>
>
> Paul Jackson wrote:
>
>> Hubertus wrote:
>>
>>> Marc, cpusets lead to physical isolation.
>>
>>
>>
>> This is slightly too terse for my dense brain to grok.
>> Could you elaborate just a little, Hubertus? Thanks.
>>
>
> A minimal quote from your website :-)
>
> "CpuMemSets provides a new Linux kernel facility that enables system
> services and applications to specify on which CPUs they may be
> scheduled, and from which nodes they may allocate memory."
>
> Since I have addressed the cpu section it seems obvious that
> in order to ISOLATE different workloads, you associate them onto
> non-overlapping cpusets, thus technically they are physically isolated
> from each other on said chosen CPUs.
>
> Given that cpuset hierarchies translate into cpu-affinity masks,
> this desired isolation can result in lost cycles globally.

This argument if followed to its logical conclusion would advocate the
abolition of CPU affinity masks completely.

>
> I believe this to be orthogonal to share settings. To me both
> are extremely desirable features.
>
> I also pointed out that if you separate mechanism from API, it
> is possible to move the CPU set API under the CKRM framework.
> I have not thought about the memory aspect.
>
> -- Hubertus
>
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>


--
Peter Williams [email protected]

"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce

2004-10-02 23:31:49

by Alan

[permalink] [raw]
Subject: Re: [ckrm-tech] Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

On Sul, 2004-10-03 at 00:08, Hubertus Franke wrote:
> Andrew Morton wrote:
> > Hubertus Franke <[email protected]> wrote:
> >
> >>Marc, cpusets lead to physical isolation.

Not realistically on x86 unless you start billing memory accesses IMHO

2004-10-02 23:47:17

by Hubertus Franke

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

We are in sync on this... Hopefully, everybody else as well.
>
> This is where I see the need for "CPU sets". I.e. as a
> replacement/modification to the CPU affinity mechanism basically adding
> an extra level of abstraction to make it easier to use for implementing
> the type of isolation that people seem to want. I say this because,
> strictly speaking and as you imply, the current affinity mechanism is
> sufficient to provide that isolation BUT it would be a huge pain to
> implement.

Exactly, you do the movement from cpuset through higher level operations
replacing the per task cpu-affinity with a shared object.
This is what CKRM does at the core level through its class objects.
RCFS provides the high level operations. The controller implements them
wrt to the constraints and the details.

>
> The way I see it you just replace the task's affinity mask with a
> pointer to its "CPU set" which contains the affinity mask shared by
> tasks belonging to that set (and this is used by try_to_wake_up() and
> the load balancing mechanism to do their stuff instead of the per task
> affinity mask). Then when you want to do something like take a CPU away
> from one group of tasks and give it to another group of tasks it's just
> a matter of changing the affinity masks in the sets instead of visiting
> every one of the tasks individually and changing their masks.

Exactly ..

> There
> should be no need to explicitly move tasks off the "lost" CPU after such
> a change as it should/could be done next time that they go through
> try_to_wake_up() and/or finish a time slice. Moving a task from one CPU
> set to another would be a similar process to the current change of
> affinity mask.
>
> There would, of course, need to be some restriction on the movement of
> CPUs from one set to another so that you don't end up with an empty set
> with live tasks, etc.
>
> A possible problem is that there may be users whose use of the current
> affinity mechanism would be broken by such a change. A compile time
> choice between the current mechanism and a set based mechanism would be
> a possible solution. Of course, this proposed modification wouldn't
> make any sense with less than 3 CPUs.

Why ? It is even useful for 2 cpus.
Currently cpumem sets do not enforce that there is not intersections
between siblings of a hierarchy.

>
> PS Once CPU sets were implemented like this, configurable CPU schedulers
> (such as (blatant plug :-)) ZAPHOD) could have "per CPU set"
> configurations, CKRM could do its (CPU management stuff) stuff within a
> CPU set, etc.

That's one of the sticking points.
That would require that TASKCLASSES and cpumemsets must go along the
same hierarchy. With CPUmemsets being the top part of the hierarchy.
In other words the task classes can not span different cpusets.

There are other posibilities that would restrict the load balancing
along cpuset boundaries. If taskclasses are allowed to span disjoint
cpumemsets, what is then the definition of setting shares ?

Today we simply do the system wide share proportionment adhering to the
affinity constraints, which is still valid in this discussion.

>

>>
>> The tricky stuff comes in from the fact that CKRM assumes a system
>> wide definition of a class and a system wide "calculation" of shares.
>
Tricky in that it needs to be decided what the class hierarchy
definitions and whether to CKRM cpu scheduling within each cpuset and
what the exact definition of share then is ?

>
> Doesn't sound insurmountable or particularly tricky :-).

I agree its not insurmountable but a matter of deciding what the desired
behavior is ...

Regards.



>
> Peter

2004-10-02 23:54:21

by Hubertus Franke

[permalink] [raw]
Subject: Re: [ckrm-tech] Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement



Peter Williams wrote:

> Hubertus Franke wrote:
>
>>
>>
>> Paul Jackson wrote:

>> A minimal quote from your website :-)
>>
>> "CpuMemSets provides a new Linux kernel facility that enables system
>> services and applications to specify on which CPUs they may be
>> scheduled, and from which nodes they may allocate memory."
>>
>> Since I have addressed the cpu section it seems obvious that
>> in order to ISOLATE different workloads, you associate them onto
>> non-overlapping cpusets, thus technically they are physically isolated
>> from each other on said chosen CPUs.
>>
>> Given that cpuset hierarchies translate into cpu-affinity masks,
>> this desired isolation can result in lost cycles globally.
>
>
> This argument if followed to its logical conclusion would advocate the
> abolition of CPU affinity masks completely.
>

No, why is that. One can restrict memory on a task and by doing so waste
cycles in paging. That does not mean we should get ride of memory
restrictions or a like.
Loosing cycles is simply an observation of what could happen.

As in any system, over constraining a given workload (wrt to affinity,
cpu limits, rate control) can lead to suboptimal utilization of
resources. That does not mean there is no rational for the constraints
in the first place and hence they should never be allowed in the first
place.

Cheers ..


2004-10-03 00:00:39

by Peter Williams

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

Hubertus Franke wrote:
>> be a possible solution. Of course, this proposed modification
>> wouldn't make any sense with less than 3 CPUs.
>
>
> Why ? It is even useful for 2 cpus.
> Currently cpumem sets do not enforce that there is not intersections
> between siblings of a hierarchy.

There's only 3 non empty sets and only one of them can have a CPU
removed from the set without becoming empty. So the pain wouldn't be
worth the gain.

Peter
--
Peter Williams [email protected]

"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce

2004-10-03 02:28:14

by Paul Jackson

[permalink] [raw]
Subject: Re: [ckrm-tech] Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

Andrew writes:
>
> Despite what Paul says, his customers *do not* "require" physical isolation
> [*]. That's like an accountant requiring that his spreadsheet be written
> in Pascal. He needs slapping.

No - it's like an accountant saying the books for your two sole
proprietor Subchapter S corporations have to be kept separate.

Consider the following use case scenario, which emphasizes this
isolation aspect (and ignores other requirements, such as the need for
system admins to manage cpusets by name [some handle valid across
process contexts], with a system wide imposed permission model and
exclusive use guarantees, and with a well defined system supported
notion of which tasks are "in" which cpuset at any point in time).

===

You're running a 64-way, compute bound application on 64 CPUs of your
256 CPU system. The 64 threads are in lock step, tightly coupled, for
three days straight. You've sized the application and the computer you
bought to run that application to within the last few percent of what
CPU cycles are available on 64 CPUs and how many memory pages are
available on the nodes local to those CPUs. It's an MPT application in
Fortran, using most of the available bandwidth between those nodes for
synconization on each loop of the computation. If a single thread slows
down 10% for any reason, the entire application slows down that much
(sometimes worse), and you have big money on the table, ensuring that
doesn't happen. You absolutely positively have to complete that
application run on time, in three days (say it's a weather forecast for
four days out). You've varied the resolution to which you compute the
answer or the size of your input data set or whatever else you could, in
order to obtain the most accurate answer you could, in three days, not
an hour longer. If the runtimes jump around by more than 5% or 10%,
some Vice President starts losing sleep. If it's a 20% variation, that
sleep deprived Vice President works for the computer company that sold
you the system. The boss of the boss of my boss ;).

I now know that every one of these 64 threads is pinned for those three
days. It's just as pinned as the graphics application that has to be
near its hardware. Due to both the latency affects of the several
levels of hardware cache (on the CPU chip and off), and the additional
latency affects imposed by the software when it decides on which node to
place a page of memory off a page fault, nothing can move. Not in, not
out, not within. To within a fraction of a percent, nothing else may be
allowed onto those nodes, nothing of those 64 threads may be allowed off
those nodes, and none of the threads may be allowed to move within the
64 CPUs. And not just any random subset of 64 CPUs selected from the
256 available, but a subset that's "close" together, given the complex
geometries of these big systems (minimum number of router hops between
the furthest apart pair of CPUs in the set of 64 CPUs).

(*) Message Passing Interface (MPI) - http://www.mpi-forum.org

===

It's a requirement, I say. It's a requirement. Let the slapping begin ;).

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373

2004-10-03 02:51:41

by Paul Jackson

[permalink] [raw]
Subject: Re: [ckrm-tech] Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

Hubertus wrote:
>
> CKRM could do so. We already provide the name space and the class
> hierarchy.

Just because two things have name spaces and hierarchies, doesn't
make them interchangeable. Name spaces and hierarchies are just
implementation mechanisms - many interesting, entirely unrelated,
solutions make use of them.

What are the objects named, and what is the relation underlying
the hierarchy? These must match up.

The objects named in cpusets are subsets of a systems CPUs and Memory
Nodes. The relation underlying the hierarchy is the subset relation on
these sets: if one cpuset node is a descendent of another, then its
CPUs and Memory Nodes are a subset of the others.

What is the corresponding statement for CKRM?

For CKRM to subsume cpusets, there must be an injective map from the
above cpuset objects to CKRM objects, that preserves this subset
relation on cpusets.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373

2004-10-03 03:01:43

by Paul Jackson

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

Peter writes:
>
> I say this because,
> strictly speaking and as you imply, the current affinity mechanism is
> sufficient to provide that isolation BUT it would be a huge pain to
> implement.

The affects on any given task - where it gets scheduled and where it
allocates memory - can be duplicated using the current affinity
mechanisms (setaffinity/mbind/mempolicy).

However the system wide naming of cpusets, the control of their access,
use and modification, the exclusive rights to a CPU or Memory and the
robust linkage of tasks to these named cpusets are, in my view, just the
sort of system wide resource synchronization that kernels are born to
do, and these capabilities are not provided by the per-task existing
affinity mechanisms.

However, my point doesn't matter much. Whether its a huge pain, or an
infinite pain, so long as we agree it's more painful than we can
tolerate, that's enough agreement to continue this discussion along
other more fruitful lines.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373

2004-10-03 03:21:45

by Paul Jackson

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

Peter writes:
>
> The way I see it you just replace the task's affinity mask with a
> pointer to its "CPU set" which contains the affinity mask shared by
> tasks belonging to that set ...

I too like this suggestion. The current duplication of cpus_allowed and
mems_allowed between task and cpuset is a fragile design, forced on us
by incremental feature addition and the need to maintain backwards
compatibility.


> A possible problem is that there may be users whose use of the current
> affinity mechanism would be broken by such a change. A compile time
> choice between the current mechanism and a set based mechanism would be
> a possible solution.

Do you mean kernel or application compile time? The current affinity
mechanisms have enough field penetration that the kernel will have to
support or emulate these calls for a long period of deprecation at best.

So I guess you mean application compile time. However, the current user
level support, in glibc and other libraries, for these calls is
sufficiently confused, at least in my view, that rather than have that
same API mean two things, depending on a compile time switch, I'd rather
explore (1) emulating the existing calls, just as they are, (2) adding
new calls that are try these API's again, in line with our kernel
changes, and (3) eventually deprecate and remove the old calls, over a
multi-year period.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373

2004-10-03 03:26:49

by Paul Jackson

[permalink] [raw]
Subject: Re: [ckrm-tech] Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

Hubertus wrote:
> So to me cpumem sets as as concept is useful, so I won't be doing that
> whopping, but ...

I couldn't parse the above ... could you rephrase?

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373

2004-10-03 03:37:37

by Paul Jackson

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

Dipankar wrote:
> For this to succeed, they need to be completely
> isolated.

Do you mean by completely isolated (1) running two separate system
images on separate partitions connected at most by networks and storage,
or do you mean (2) minimal numa interaction between two subsets of
nodes, all running under the same system image?

If (1), then the partitioning project is down the hall ;) But I guess
you knew that. The issues on this thread involve managing resource
interactions on a single system image.

Just checking ... the words you used to describe the degree of
separation were sufficiently strong that I became worried we were at
risk for a miscommunication.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373

2004-10-03 03:41:44

by Paul Jackson

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

Peter writes:
> This is where I see the need for "CPU sets". I.e. as a
> replacement/modification to the CPU affinity mechanism

Note that despite the name, cpusets handles both CPU and
Memory affinity.

Which is probably why Hubertus is calling them cpumem sets.

And, indeed, why I have called them cpumemsets on alternate
years myself.

However the rest of your points, except where clearly specific
to the scheduler, apply equally well, so this point is not
critical at this point in the discussion.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373

2004-10-03 03:47:08

by Paul Jackson

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

Hubertus writes:
>
> That's one of the sticking points.
> That would require that TASKCLASSES and cpumemsets must go along the
> same hierarchy. With CPUmemsets being the top part of the hierarchy.
> In other words the task classes can not span different cpusets.

Can task classes span an entire cpuset subtree? I can well imagine that
an entire subtree of the cpuset tree should be managed by the same CKRM
policies and shares.

In particular, if we emulate the setaffinity/mbind/mempolicy calls by
forking a child cpuset to represent the new restrictions on the task
affected by those calls, then we'd for sure want to leave that task in
the same CKRM policy realm as it was before.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373

2004-10-03 03:53:19

by Peter Williams

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

Paul Jackson wrote:
> Peter writes:
>
>>The way I see it you just replace the task's affinity mask with a
>>pointer to its "CPU set" which contains the affinity mask shared by
>>tasks belonging to that set ...
>
>
> I too like this suggestion. The current duplication of cpus_allowed and
> mems_allowed between task and cpuset is a fragile design, forced on us
> by incremental feature addition and the need to maintain backwards
> compatibility.

OK.

>
>>A possible problem is that there may be users whose use of the current
>>affinity mechanism would be broken by such a change. A compile time
>>choice between the current mechanism and a set based mechanism would be
>>a possible solution.
>
>
> Do you mean kernel or application compile time?

Kernel compile time.

> The current affinity
> mechanisms have enough field penetration that the kernel will have to
> support or emulate these calls for a long period of deprecation at best.

That's unfortunate. Are the (higher level) ways in which they're used
incompatible with CPU sets or would CPU sets be seen as being a better
(easier) way of doing the job?

If the choice is at kernel compile time then those users of the current
mechanism can choose it and new users can choose CPU sets. Of course,
this makes gradual movement from one model to the other difficult to say
the least.

>
> So I guess you mean application compile time. However, the current user
> level support, in glibc and other libraries, for these calls is
> sufficiently confused, at least in my view, that rather than have that
> same API mean two things, depending on a compile time switch, I'd rather
> explore (1) emulating the existing calls, just as they are, (2) adding
> new calls that are try these API's again, in line with our kernel
> changes, and (3) eventually deprecate and remove the old calls, over a
> multi-year period.

I would agree with that. I guess that emulation would not be possible
on top of my suggestion hence the requirement for the "fragile design" etc.

Peter
--
Peter Williams [email protected]

"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce

2004-10-03 04:04:31

by Paul Jackson

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

Paul wrote:
> (2) adding new calls that are try these API's again
^^^
Drop that word 'are' - don't know how it snuck in there ;)

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373

2004-10-03 04:50:24

by Paul Jackson

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

Peter wrote:
>
> Of course, this [kernel compile option] makes gradual movement
> from one model to the other difficult to say the least.

To say the least.

It might be possible to continue to support current affinity calls
(setaffinity/mbind/mempolicy) even while removing the duplication of
affinity masks between tasks and cpusets.

If each call to set a tasks affinity resulted in moving that task into
its very own cpuset (unless it was already the only user of its cpuset),
and if the calls to load and store task->{cpus,mems}_allowed in the
implementation of these affinity sys calls were changed to load and
store those affinity masks in the tasks cpuset instead.

I'm just brainstorming here ... this scheme could easily have some
fatal flaw that I'm missing at the moment.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373

2004-10-03 05:12:45

by Peter Williams

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

Paul Jackson wrote:
> Peter wrote:
>
>>Of course, this [kernel compile option] makes gradual movement
>>from one model to the other difficult to say the least.
>
>
> To say the least.
>
> It might be possible to continue to support current affinity calls
> (setaffinity/mbind/mempolicy) even while removing the duplication of
> affinity masks between tasks and cpusets.
>
> If each call to set a tasks affinity resulted in moving that task into
> its very own cpuset (unless it was already the only user of its cpuset),
> and if the calls to load and store task->{cpus,mems}_allowed in the
> implementation of these affinity sys calls were changed to load and
> store those affinity masks in the tasks cpuset instead.
>
> I'm just brainstorming here ... this scheme could easily have some
> fatal flaw that I'm missing at the moment.

Provided overlapping sets are allowed it should be feasible. However,
I'm not a big fan of overlapping sets as it would make using different
CPU scheduling configurations in each set more difficult (maybe even
inadvisable) but that's a different issue.

Peter
--
Peter Williams [email protected]

"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce

2004-10-03 05:42:09

by Paul Jackson

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

Peter wrote:
>
> Provided overlapping sets are allowed it should be feasible. However,
> I'm not a big fan of overlapping sets as it would make using different
> CPU scheduling configurations in each set more difficult (maybe even
> inadvisable) but that's a different issue.

One can resolve these apparently conflicting objectives by having the
scheduling configuration apply to an entire subtree of the cpuset
hierarchy. When cpuset "a/b" is created below cpuset "a", by
default cpuset "a/b" should get reference counted links to the same
scheduler and other CKRM policies as "a" had.

Then details about what happens further down the cpuset tree, as leaf
nodes come and go, overlapping with their parents, in order to emulate
the old affinity calls, don't confuse the scheduling configuration,
which applies across the same broad swath of CPUs before the affinity
call as after.

You don't need all the cpusets non-overlapping, you just need the
ones that define the realm of a particular scheduling policy to be
non-overlapping (or to tolerate the confusions that result if they
aren't, if that's preferrable - I don't know that it is.)

Indeed, the simple act of an individual task tweaking its own CPU or
Memory affinity should _not_ give it a different scheduling realm.
Rather such a task must remain stuck in whatever realm it was in before
that affinity call.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373

2004-10-03 12:21:13

by Hubertus Franke

[permalink] [raw]
Subject: Re: [ckrm-tech] Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement



Paul Jackson wrote:
> Hubertus wrote:
>
>>CKRM could do so. We already provide the name space and the class
>>hierarchy.
>
>
> Just because two things have name spaces and hierarchies, doesn't
> make them interchangeable. Name spaces and hierarchies are just
> implementation mechanisms - many interesting, entirely unrelated,
> solutions make use of them.
>
> What are the objects named, and what is the relation underlying
> the hierarchy? These must match up.

Object name relationships are established through the rcfs pathname.

>
> The objects named in cpusets are subsets of a systems CPUs and Memory
> Nodes. The relation underlying the hierarchy is the subset relation on
> these sets: if one cpuset node is a descendent of another, then its
> CPUs and Memory Nodes are a subset of the others.

Exactly, the controller will enforce that in the same way we
enforce other attributes and shares.
Example, we make sure that the sum of the share "guarantees" for
all children does not exceed the total_guarantee (i.e. denominator)
of the parent.
Nothing prohibits the controller to enforce the set constraints
you describe above and reject requests that are not valid.
As I said before, ideally the controller would be the cpumem set
guts and RCFS would be the API to it.

That's what Andrew was asking for in case the requirement for
this functionality can/is made.

>
> What is the corresponding statement for CKRM?
>
> For CKRM to subsume cpusets, there must be an injective map from the
> above cpuset objects to CKRM objects, that preserves this subset
> relation on cpusets.
>

See above.

2004-10-03 14:15:32

by Paul Jackson

[permalink] [raw]
Subject: Re: [ckrm-tech] Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

Paul wrote:
> It's a requirement, I say. It's a requirement. Let the slapping begin ;).

Granted, to give Andrew his due (begrudgingly ;), the requirement
to pin processes on CPUs is a requirement of the _implementation_,
which follows, for someone familiar with the art, from the two
items:
1) The requirement of the _user_ that runtimes be repeatable
within perhaps 1% to 5% for a certain class of job, plus
2) The cantankerous properties of big honkin NUMA boxes.

Clearly, Andrew was looking for _user_ requirements, to which I
managed somewhat unwittingly to back up in my user case scenario.


I suspect that there is a second user case scenario, with which the Bull
or NEC folks might be more familiar with than I, that can seemingly lead
to the same implementation requirement to pin jobs. This scenario would
involve a customer who has paid good money for some compute capacity
(CPU cycles and Memory pages) with a certain guaranteed Quality of
Service, and who would prefer to see this capacity go to waste when
underutilized rather than risk it being unavailable in times of need.

However in this case, as Andrew is likely already chomping at the bit to
tell me, CKRM could provide such guaranteed compute capacities without
pinning.

Whether or not a CKRM class would sell to the customers of Bull and
NEC in lieu of a set of pinned nodes, I have no clue.

Erich, Simon - Can you introduce a note of reality into my
speculations above?


The third user case scenario that commonly leads us to pinning is
support of the batch or workload managers, PBS and LSF, which are fond
of dividing the compute resources up into identifiable subsets of CPUs
and Memory Nodes that are near to each other (in terms of the NUMA
topology) and that have the size (compute capacity as measured in free
cycles and freely available ram) requested by a job, then attaching that
job to that subset and running it.

In this third case, batch or workload managers have a long history with
big honkin SMP and NUMA boxes, and this remains an important market for
them. Consistent runtimes are valued by their customers and are a key
selling point of these products in the HPC market. So this third case
reduces to the first, with its implementation requirement for pinning
the tasks of an active job to specific CPUs and Memory Nodes.

For example from Platform's web site (the vendor of LSF) at:
http://www.platform.com/products/HPC
the benefits for their LSF HPC product include:
* Guaranteed consistent and reliable parallel workload processing with
high performance interconnect support
* Maximized application performance with topology-aware scheduling
* Ensures application runtime consistency by automatically allocating
similar processors

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373

2004-10-03 14:39:34

by Martin J. Bligh

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

>> The O(1) scheduler today does not know about cpumem sets. It operates
>> on the level of affinity masks to adhere to the constraints specified
>> based on cpu masks.
>
> This is where I see the need for "CPU sets". I.e. as a
> replacement/modification to the CPU affinity mechanism basically adding
> an extra level of abstraction to make it easier to use for implementing
> the type of isolation that people seem to want. I say this because,
> strictly speaking and as you imply, the current affinity mechanism is
> sufficient to provide that isolation BUT it would be a huge pain to
> implement.

The way cpusets uses the current cpus_allowed mechanism is, to me, the most
worrying thing about it. Frankly, the cpus_allowed thing is kind of tacked
onto the existing scheduler, and not at all integrated into it, and doesn't
work well if you use it heavily (eg bind all the processes to a few CPUs,
and watch the rest of the system kill itself).

Matt had proposed having a separate sched_domain tree for each cpuset, which
made a lot of sense, but seemed harder to do in practice because "exclusive"
in cpusets doesn't really mean exclusive at all. Even if we don't have
separate sched_domain trees, cpusets could be the top level in the master
tree, I think.

M.

2004-10-03 15:41:42

by Paul Jackson

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

Martin wrote:
> Matt had proposed having a separate sched_domain tree for each cpuset, which
> made a lot of sense, but seemed harder to do in practice because "exclusive"
> in cpusets doesn't really mean exclusive at all.

See my comments on this from yesterday on this thread.

I suspect we don't want a distinct sched_domain for each cpuset, but
rather a sched_domain for each of several entire subtrees of the cpuset
hierarchy, such that every CPU is in exactly one such sched domain, even
though it be in several cpusets in that sched_domain. Perhaps each
cpuset in such a subtree points to the same reference counted
sched_domain, or perhaps each cpuset except the one at the root of the
subtree has a flag set, telling the scheduler to search up the cpuset
tree to find a sched_domain. Probably the former, for performance
reasons.

As I can see even my own eyes glazing over trying to read what I just
wrote, let me give an example.

Let's say we have a 256 CPU system. At the top level, we divide it into
five non-overlapping cpusets, of sizes 64, 64, 32, 28 and 4. Each of
these five cpusets has its sched_domain, except the third one, of 32 CPUs.
That one is subdivided into 4 cpusets, of 8 CPUs each, non-overlapping,
each of the four with its own sched_domain.

[Aside - granted this is topologically equivalent to the flattened
partitioning into the eight cpusets of sizes 64, 64, 8, 8, 8, 8, 28 and
4. Perhaps the 32 CPUs were farmed out to the Professor of Eccentric
Economics, who has permission to manage his 32 CPUs and divide them
further, but who lacks permission to modify the top layer of the cpuset
hierarchy.]

So we have eight cpusets, non-overlapping and covering the entire
system, each with its own sched_domain. Now within those cpusets,
for various application reasons, further subdivisions occur. But
no more sched_domains are created, and the existing sched_domains
apply to all tasks attached to any cpuset in their cpuset subtree.

On the other topic you raise, of the meaning (or lack thereof) of
"exclusive". Perhaps "exclusive" should not a property of a node in
this tree, but rather a property of a node under a certain covering or
mapping. You note we need a map from the range of CPUs to the domain
sched_domain's, specifying for each CPU its unique sched_domain. And we
might have some other map on these same CPUs or Memory Nodes for other
purposes. I am afraid I've forgotten too much of my math from long long
ago to state this with exactly the right terms. But I can imagine
adding a little bit more code to cpusets, that kept a small list of such
mappings over the domains of CPUs and Memory Nodes, and that validated,
on each cpuset change, that each mapping preserved whatever properties
of covering and non-overlapping that it was marked for. One of these
mappings could be into the range of sched_domains and be marked for both
covering and non-overlapping.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373

2004-10-03 16:04:51

by Paul Jackson

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

Martin wrote:
> The way cpusets uses the current cpus_allowed mechanism is, to me, the most
> worrying thing about it. Frankly, the cpus_allowed thing is kind of tacked
> onto the existing scheduler, and not at all integrated into it, and doesn't
> work well if you use it heavily (eg bind all the processes to a few CPUs,
> and watch the rest of the system kill itself).

True. One detail of what you say I'm unclear on -- how will the rest of
the system kill itself? Why wouldn't the unemployed CPUs just idle
around, waiting for something to do?

As I recall, Ingo added task->cpus_allowed for the Tux in-kernel web
server a few years back, and I piggy backed the cpuset stuff on that, to
keep my patch size small.

Likely your same concerns apply to the task->mems_allowed field that
I added, in the same fashion, in my cpuset patch of recent.

We need a mechanism that the cpuset apparatus respects that maps each
CPU to a sched_domain, exactly one sched_domain for any given CPU at any
point in time, regardless of which task it is considering running at the
moment. Somewhat like dual-channeled disks, having more than one
sched_domain apply at the same time to a given CPU leads to confusions
best avoided unless desparately needed. Unlike dual-channeled disks, I
don't see the desparate need here for multi-channel sched_domains ;).

And of course, for the vast majority of normal systems in the world
not configured with cpusets, this has to collapse back to something
sensible "just like it is now."

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373

2004-10-03 20:15:53

by Tim Hockin

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

On Sun, Oct 03, 2004 at 07:36:46AM -0700, Martin J. Bligh wrote:
> > This is where I see the need for "CPU sets". I.e. as a
> > replacement/modification to the CPU affinity mechanism basically adding
> > an extra level of abstraction to make it easier to use for implementing
> > the type of isolation that people seem to want. I say this because,
> > strictly speaking and as you imply, the current affinity mechanism is
> > sufficient to provide that isolation BUT it would be a huge pain to
> > implement.
>
> The way cpusets uses the current cpus_allowed mechanism is, to me, the most
> worrying thing about it. Frankly, the cpus_allowed thing is kind of tacked
> onto the existing scheduler, and not at all integrated into it, and doesn't
> work well if you use it heavily (eg bind all the processes to a few CPUs,
> and watch the rest of the system kill itself).

7 years ago, before cpus_allowed was dreamed up, I proposed a pset patch
and was shot down hard. Now it's back, and we're trying to find a way to
cram it in on top.

Yeah, it does not fit nicely with cpus_allowed.

I have to ask - do we REALLY need cpusets? I meant, even SGI dropped
PSET at some point, because (if I recall) NO ONE USED IT.

What's the problem being solved that *requires* psets?

I have a customer I work with periodically who was using my pset patch up
until they moved to RH8, when the O(1) scheduler and cpus_allowed changed
everything. This was their requirement for pset:

1. Take a processor out of the general execution pool (call it
PROC_RESTRICTED). This processor will not schedule general tasks.
2. Assign a task to the PROC_RESTRICTED cpu. Now that CPU will only
schedule the assigned task (and it's children).
3. Repeat for every CPU, with the caveat that one CPU must remain
PROC_ENABLED.

I had an array of enum procstate and a new syscall pair:
sched_{gs}etprocstate(). The scheduler checks the procstate, and if it is
not ENABLED, it checks that (cpus_allowed == 1<<cpu). Simple, but works.
Could be baked a bit more, for general use.

What if I proposed a patch like this, now? It would require cleanup for
2.6, but I'm game if it's useful.

Tim

2004-10-03 20:23:18

by Erich Focht

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

> The other declared potential users of cpusets, Bull and NEC at
> least, seem from what I can tell to have a somewhat different
> focus, toward providing a mix of compute services with minimum
> interference, from what I'd guess are more departmental size
> systems.
>
> Bull (Simon) and NEC (Erich) should also look closely at CKRM,
> and then try to describe their requirements, so we can understand
> whether CKRM, cpusets or both or neither can meet their needs.

The requirements I have in mind come from our customers: users,
benchmarkers, administrators and compute center management. They are
used to our kind of big iron, the NEC SX (earth simulator style
hardware) which is running a proprietary Unix and has a few amenities
not present in Linux. Among them: gang scheduling (even across
machines for big parallel jobs), resource groups and tight integration
of these features with the batch resource manager.

Can cpusets help me/us/Linux to get closer to these requirements?

A clear yes. Regard cpusets as a new kind of composite resource built
from memory and CPUs. They can play the role of the resource groups we
need. Disjunct cpusets can run jobs which will almost never interfere
cpu-cycle or memory-wise. This can be easilly integrated into PBS/LSF
or whatever batch resource manager comes to your mind. Cpusets
selected with some knowledge of the NUMA characteristics of a machine
guarantee always reproducible and best compute performance. If a job
runs alone in a cpuset it will run as if the machine has been reduced
to that piece and is owned exclusively by the job. Also if the set
contains as many CPUs as MPI processes, the cpuset helps getting some
sort of gang scheduling (i.e. all members of a parallel process get
cycles at the same time, this reduces barrier synchronisation times,
improves performance and makes it more predictible). This is something
one absolutely needs on big machines when dealing with time critical
highest performance applications. Permanently losing 10% because the
CPU placement is poor or because one has to get some other process out
of the way is just inacceptable. When you sell machines for several
millions 10% performance loss translates to quite some amount of
money.

Can CKRM (as it is now) fulfil the requirements?

I don't think so. CKRM gives me to some extent the confidence that I
will really use the part of the machine for which I paid, say 50%. But
it doesn't care about the structure of the machine. CKRM tries giving
a user as much of the machine as possible, at least the amount he paid
for. For example: When I come in with my job the machine might be
already running another job who's user also paid for 50% but was the
only user and got 100% of the machine (say some Java application with
enough threads...). This job maybe has filled up most of the memory
and uses all CPUs. CKRM will take care of getting me cycles (maybe
exclusively on 50% of the CPUs and will treat my job preferrentially
when allocating memory, but will not care about the placement of the
CPUs and the memory. Neither will it care whether the previously
running job is still using my memory blocks and reducing my bandwith
to them. So I get 50% of the cycles and the memory but these will be
BAD CYCLES and BAD MEMORY. My job will run slower than possible and a
second run will be again different. Don't misunderstand me: CKRM in
its current state is great for different things and running it inside
a cpuset sounds like a good thing to do.

What about integration with PBS/LSF and alike?

It makes sense to let an external resource manager (batch or
non-batch) keep track of and manage cpusets resources. It can allocate
them and give them to jobs (exclusively) and delete them. That's
perfect and exactly what we want. CKRM is a resource manager itself
and has an own idea about resources. Certainly PBS/LSF/etc. could
create a CKRM class for each job and run it in this class. The
difficulty is to avoid the resource managers to interfere and work
against each other. In such a setup I'd rather expect a batch manager
to be started inside one CKRM class and let it ensure that e.g. the
interactive class isn't starved by the batch class.

Can CKRM be extended to do what cpusets do?

Certainly. Probably easilly. But cpusets will have to be reinvented, I
guess. Same hooks, same checks, different user interface...

Erich

2004-10-03 20:53:02

by Andrew Morton

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

Erich Focht <[email protected]> wrote:
>
> Can CKRM (as it is now) fulfil the requirements?
>
> I don't think so. CKRM gives me to some extent the confidence that I
> will really use the part of the machine for which I paid, say 50%. But
> it doesn't care about the structure of the machine.

Right. That's a restriction of the currently-implemented CKRM controllers.

> ...
> Can CKRM be extended to do what cpusets do?
>
> Certainly. Probably easilly. But cpusets will have to be reinvented, I
> guess. Same hooks, same checks, different user interface...

Well if it is indeed the case that the CKRM *framework* is up to the task
of being used to deliver the cpuset functionality then that's the way we
should go, no? It's more work and requires coordination and will deliver
later, but the eventual implementation will be better.

But I'm still not 100% confident that the CKRM framework is suitable.
Mainly because the CKRM and cpuset teams don't seem to have looked at each
other's stuff enough yet.

2004-10-03 23:48:48

by Martin J. Bligh

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

--Paul Jackson <[email protected]> wrote (on Sunday, October 03, 2004 09:02:09 -0700):

> Martin wrote:
>> The way cpusets uses the current cpus_allowed mechanism is, to me, the most
>> worrying thing about it. Frankly, the cpus_allowed thing is kind of tacked
>> onto the existing scheduler, and not at all integrated into it, and doesn't
>> work well if you use it heavily (eg bind all the processes to a few CPUs,
>> and watch the rest of the system kill itself).
>
> True. One detail of what you say I'm unclear on -- how will the rest of
> the system kill itself? Why wouldn't the unemployed CPUs just idle
> around, waiting for something to do?

I think last time I looked they just sat there saying:

Rebalance!
Ooooh, CPU 3 over there looks heavily loaded, I'll steal something.
That one. Try to migrate. Oops, no cpus_allowed bars me.
That one. Try to migrate. Oops, no cpus_allowed bars me.
That one. Try to migrate. Oops, no cpus_allowed bars me.
That one. Try to migrate. Oops, no cpus_allowed bars me.
That one. Try to migrate. Oops, no cpus_allowed bars me.
That one. Try to migrate. Oops, no cpus_allowed bars me.
That one. Try to migrate. Oops, no cpus_allowed bars me.
That one. Try to migrate. Oops, no cpus_allowed bars me.
That one. Try to migrate. Oops, no cpus_allowed bars me.
That one. Try to migrate. Oops, no cpus_allowed bars me.
Humpf. I give up.
Rebalance!
Ooooh, CPU 3 over there looks heavily loaded, I'll steal something.
That one. Try to migrate. Oops, no cpus_allowed bars me.
That one. Try to migrate. Oops, no cpus_allowed bars me.
That one. Try to migrate. Oops, no cpus_allowed bars me.
That one. Try to migrate. Oops, no cpus_allowed bars me.
That one. Try to migrate. Oops, no cpus_allowed bars me.
That one. Try to migrate. Oops, no cpus_allowed bars me.
That one. Try to migrate. Oops, no cpus_allowed bars me.
That one. Try to migrate. Oops, no cpus_allowed bars me.
That one. Try to migrate. Oops, no cpus_allowed bars me.
That one. Try to migrate. Oops, no cpus_allowed bars me.
Humpf. I give up.
Rebalance!
Ooooh, CPU 3 over there looks heavily loaded, I'll steal something.
That one. Try to migrate. Oops, no cpus_allowed bars me.
That one. Try to migrate. Oops, no cpus_allowed bars me.
That one. Try to migrate. Oops, no cpus_allowed bars me.
That one. Try to migrate. Oops, no cpus_allowed bars me.
That one. Try to migrate. Oops, no cpus_allowed bars me.
That one. Try to migrate. Oops, no cpus_allowed bars me.
That one. Try to migrate. Oops, no cpus_allowed bars me.
That one. Try to migrate. Oops, no cpus_allowed bars me.
That one. Try to migrate. Oops, no cpus_allowed bars me.
That one. Try to migrate. Oops, no cpus_allowed bars me.
Humpf. I give up.
... ad infinitum.

Desperately boring, and rather ineffective.

> As I recall, Ingo added task->cpus_allowed for the Tux in-kernel web
> server a few years back, and I piggy backed the cpuset stuff on that, to
> keep my patch size small.
>
> Likely your same concerns apply to the task->mems_allowed field that
> I added, in the same fashion, in my cpuset patch of recent.

Mmm, I'm less concerned about that one, or at least I can't specifically
see how it breaks.

> We need a mechanism that the cpuset apparatus respects that maps each
> CPU to a sched_domain, exactly one sched_domain for any given CPU at any
> point in time, regardless of which task it is considering running at the
> moment. Somewhat like dual-channeled disks, having more than one
> sched_domain apply at the same time to a given CPU leads to confusions
> best avoided unless desparately needed.

Agreed. The cpus_allowed mechanism doesn't seem well suited to heavy use
anyway (I think John Hawkes had problems with it too). That's not your
fault ... but I'm not convinced it's a good foundation to be building
further things on either ;-)

M.

2004-10-03 23:54:57

by Martin J. Bligh

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

> Martin wrote:
>> Matt had proposed having a separate sched_domain tree for each cpuset, which
>> made a lot of sense, but seemed harder to do in practice because "exclusive"
>> in cpusets doesn't really mean exclusive at all.
>
> See my comments on this from yesterday on this thread.
>
> I suspect we don't want a distinct sched_domain for each cpuset, but
> rather a sched_domain for each of several entire subtrees of the cpuset
> hierarchy, such that every CPU is in exactly one such sched domain, even
> though it be in several cpusets in that sched_domain.

Mmmm. The fundamental problem I think we ran across (just whilst pondering,
not in code) was that some things (eg ... init) are bound to ALL cpus (or
no cpus, depending how you word it); i.e. they're created before the cpusets
are, and are a member of the grand-top-level-uber-master-thingummy.

How do you service such processes? That's what I meant by the exclusive
domains aren't really exclusive.

Perhaps Matt can recall the problems better. I really liked his idea, aside
from the small problem that it didn't seem to work ;-)

> So we have eight cpusets, non-overlapping and covering the entire
> system, each with its own sched_domain.

But that's the problem ... I think there are *always* cpusets that overlap.
Which is sad (fixable?) because it breaks lots of intelligent things we
could do.

> purposes. I am afraid I've forgotten too much of my math from long long
> ago to state this with exactly the right terms.

That's OK, so have most of the rest of us, so even if you could remember,
it wouldn't help much ;-)

M.

2004-10-04 00:03:49

by Martin J. Bligh

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

--"Martin J. Bligh" <[email protected]> wrote (on Sunday, October 03, 2004 16:53:40 -0700):

>> Martin wrote:
>>> Matt had proposed having a separate sched_domain tree for each cpuset, which
>>> made a lot of sense, but seemed harder to do in practice because "exclusive"
>>> in cpusets doesn't really mean exclusive at all.
>>
>> See my comments on this from yesterday on this thread.
>>
>> I suspect we don't want a distinct sched_domain for each cpuset, but
>> rather a sched_domain for each of several entire subtrees of the cpuset
>> hierarchy, such that every CPU is in exactly one such sched domain, even
>> though it be in several cpusets in that sched_domain.
>
> Mmmm. The fundamental problem I think we ran across (just whilst pondering,
> not in code) was that some things (eg ... init) are bound to ALL cpus (or
> no cpus, depending how you word it); i.e. they're created before the cpusets
> are, and are a member of the grand-top-level-uber-master-thingummy.
>
> How do you service such processes? That's what I meant by the exclusive
> domains aren't really exclusive.
>
> Perhaps Matt can recall the problems better. I really liked his idea, aside
> from the small problem that it didn't seem to work ;-)
>
>> So we have eight cpusets, non-overlapping and covering the entire
>> system, each with its own sched_domain.
>
> But that's the problem ... I think there are *always* cpusets that overlap.
> Which is sad (fixable?) because it breaks lots of intelligent things we
> could do.

Hmmm. What if when you created a new, exclusive CPUset, the cpus you spec'ed
were *removed* from the parent CPUset (and existing processes forcibly
migrated off). That'd fix most of it, and would bring us much closer to the
true meaning of "exclusive". Changes your semantics a bit, but still ...

OK, so there is one problem I can see - you couldn't remove the last CPU
from the parent if there were any jobs running in it, but presumably fixable
(eg you have to move them into the created child, or fail the call).

M.

2004-10-04 00:48:20

by Paul Jackson

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

Martin wrote:
>
> Mmmm. The fundamental problem I think we ran across (just whilst pondering,
> not in code) was that some things (eg ... init) are bound to ALL cpus (or
> no cpus, depending how you word it); i.e. they're created before the cpusets
> are, and are a member of the grand-top-level-uber-master-thingummy.
>
> How do you service such processes? That's what I meant by the exclusive
> domains aren't really exclusive.

I move 'em. I have user code that identifies the kernel threads whose
cpus_allowed is a superset of cpus_online_map, and I put them in a nice
little padded cell with init and the classic Unix daemons, called the
'bootcpuset'.

The tasks whose cpus_allowed is a strict _subset_ of cpus_online_map
need to be where they are. These are things like the migration helper
threads, one for each cpu. They get a license to violate cpuset
boundaries.

I will probably end up submitting a patch at some point, that changes
two lines, one in ____call_usermodehelper() and one in kthread(), from
setting the cpus_allowed on certain kernel threads to CPU_MASK_ALL, so
that instead these lines set that cpus_allowed to a new mask, a kernel
global variable that can be read and written via the cpuset api. But
other than that, I don't need anymore kernel hooks than I already have,
and even now, I can get everything that's causing me any grief pinned
into the bootcpuset.


> But that's the problem ... I think there are *always* cpusets that overlap.
> Which is sad (fixable?) because it breaks lots of intelligent things we
> could do.

So with my bootcpuset, the problem is reduced, to a few tasks per CPU,
such as the migration threads, which must remain pinned on their one CPU
(or perhaps on just the CPUs local to one Memory Node). These tasks
remain in the root cpuset, which by the scheme we're contemplating,
doesn't get a sched_domain in the fancier configurations.

Yup - you're right - these tasks will also want the scheduler to give
them CPU time when they need it. Hmmm ... logically this violates our
nice schemes, but seems that we are down to such a small exception case
that there must be some primitive way to workaround this.

We basically need to keep a list of the 4 or 5 per-cpu kernel threads,
and whenever we repartition the sched_domains, make sure that each such
kernel thread is bound to whatever sched_domain happens to be covering
that cpu. If we just wrote the code, and quit trying to find a grand
unifying theory to explain it consistently with the rest of our design,
it would probably work just fine.

The cpuset code would have to be careful, when it came time to list the
tasks attached to a cpuset (Workload Manager software is fond of this
call) _not_ to list these indigenous (not "migrant" !) worker threads.
And when listing the tasks in the root cpuset, _do_ include them.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373

2004-10-04 00:55:20

by Paul Jackson

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

Martin wrote:
> (and existing processes forcibly migrated off)

No can do. As described in my previous message, everything is happily
moved already, with some user code (and a CPU_MASK_ALL patch to kthread
I haven't submitted yet) _except_ for a few per-CPU threads such as the
migration helpers, which can _not_ be moved off their respective CPUs.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373

2004-10-04 01:58:41

by Paul Jackson

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

Tim wrote:
> 7 years ago, before cpus_allowed was dreamed up, I proposed a pset patch

One more thing ... the original message from Simon and Sylvain that I
first saw a year ago announcing their cpuset work, which is the basis
for the current cpuset patch in Andrew's tree, began with the lines:


> From: Simon Derr <[email protected]>
> Date: Wed, 24 Sep 2003 17:59:01 +0200 (DFT)
> To: [email protected], [email protected]
> cc: Sylvain Jeaugey <[email protected]>
>
> We have developped a new feature in the Linux kernel, controlling CPU
> placements, which are useful on large SMP machines, especially NUMA ones.
> We call it CPUSETS, and we would highly appreciate to know about anyone
> who would be interested in such a feature. This has been somewhat inspired
> by the pset or cpumemset patches existing for Linux 2.4.


So I guess Tim, you (pset) and I (cpumemset) can both claim to
have developed anticedents of this current cpuset proposal.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373

2004-10-04 03:35:55

by Paul Jackson

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

Martin wrote:
> Rebalance!
> Ooooh, CPU 3 over there looks heavily loaded, I'll steal something.
> That one. Try to migrate. Oops, no cpus_allowed bars me.
> ...
> Humpf. I give up.
> ... ad infinitum.
>
> Desperately boring, and rather ineffective.

Well ... I don't mind unemployed CPUs being borish. It's not that they
have much useful work to do. But if they keep beating down the doors of
their neighbors trying to find work, that seems disruptive. Won't CPU 3
in your example waste time and suffer increased lock contention,
responding to its deadbeat neighbor?


> > Likely your same concerns apply to the task->mems_allowed field that
> > I added, in the same fashion, in my cpuset patch of recent.
>
> Mmm, I'm less concerned about that one, or at least I can't specifically
> see how it breaks.

Ray Bryant <[email protected]> is working this now. There are ways to get
memory allocated that hurt on our big boxes - such as blowing out one
nodes memory with a disproportionate share of the systems page cache
pages, due to problems vaguely like the cpus_allowed ones.

The kernel allocator and numa placement policies don't really integrate
mems_allowed into their algorithms, but rather are just whacked upside
the head anytime they ask if they can allocate on a non-allowed node.
They can end up doing suboptimal placement on big boxes.

A common one is that the first node in a multiple-node cpuset gets a
bigger memory load from allocations initiated on nodes up stream of it,
that weren't allowed to roost closer to home (or something like this ...
not sure I said this one just right).

Ray is leaning on me to get some kind of memory policy in each cpuset.
I'm giving him a hard time back over details of what this policy
structure should look like, buying time while I try to make more sense
of this all.

I've added him to the cc list here - hopefully he will find my
characterization of our discussions amusing ;).


> > Somewhat like dual-channeled disks, having more than one
> > sched_domain apply at the same time to a given CPU leads to confusions
> > best avoided unless desparately needed.
>
> Agreed. The cpus_allowed mechanism doesn't seem well suited to heavy use
> anyway (I think John Hawkes had problems with it too).

The various problems Hawkes had were various race conditions using the
new (at the time) set_cpus_allowed() that Ingo (I believe) added as part
of the O(1) scheduler. SGI was on the bleeding edge of using the
set_cpus_allowed() call in new and exciting ways, and there were various
race and lock conditions and issues with making sure the per-cpu
migration threads stayed home.

Other than reminding us that this stuff is hard, these problems Hawkes
dealt with don't, to my understanding, shed any light on the new issue
uncovered in this thread, that a simple per-task cpus_allowed mask,
heavily used to affect affinity policy, can interact poorly with
sophisticated schedulers trying to balance an entire system.

===

In sum, I am tending further in the direction of thinking we need to
have scheduler and allocation policies handled on a "per-domain" basis,
where these domains take the form of a partition of the system into
equivalence classes corresponding to subtrees of the cpuset hierarchy.

For example, just to throw out a wild and crazy idea, perhaps instead of
one global set of zonelists (one per node, each containing all nodes,
sorted in various numa friendly orders), rather there should be a set of
zonelists per memory-domain, containing just the nodes therein
(subsetted from the global zonelists, preserving order).

We'll have to be careful here. I suspect that the tolerance of those
running normal sized systems for this kind of crap will be pretty low.

Moreover, the scheduler in particular, and the allocator somewhat as
well, are areas with a long history of intense technical development.
Our impact on these areas has to be simplistic, so that folks doing the
real work here can keep our multi-domain stuff working with almost no
mind to it at all.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373

2004-10-04 03:43:53

by Paul Jackson

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

Most helpful response, Erich. Thanks.

> NEC SX (earth simulator style hardware)

Ah yes - another product that has earned my
affectionate term "big honkin iron".

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373

2004-10-04 03:58:06

by Martin J. Bligh

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

> Martin wrote:
>> (and existing processes forcibly migrated off)
>
> No can do. As described in my previous message, everything is happily
> moved already, with some user code (and a CPU_MASK_ALL patch to kthread
> I haven't submitted yet) _except_ for a few per-CPU threads such as the
> migration helpers, which can _not_ be moved off their respective CPUs.

Well, that just means we need to check for things bound to a subset when
we fork it off. ie if we have cpus 1,2,3,4 ... and there is

A bound to 1
B bound to 2
C bound to 3
D bound to 4

Then when I fork off exclusive subset for CPUs 1&2, I have to push A & B
into it. You're right, what I said was broken ... but it doesn't seem
hard to fix.

M.

2004-10-04 04:27:39

by Paul Jackson

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

Martin wrote:
> Then when I fork off exclusive subset for CPUs 1&2, I have to push A & B
> into it.

Tasks A & B must _not_ be considered members of that exclusive cpuset,
even though it seems that A & B must be attended to by the sched_domain
and memory_domain associated with that cpuset.

The workload managers expect to be able to list the tasks in a cpuset,
so it can hibernate, migrate, kill-off, or wait for the finish of these
tasks. I've been through this bug before - it was one that cost Hawkes
a long week to debug - I was moving the per-cpu migration threads off
their home CPU because I didn't have a clear way to distinguish tasks
genuinely in a cpuset, from tasks that just happened to be indigenous to
some of the same CPUs. My essential motivation for adapting a cpuset
implementation that has a task struct pointer to a shared cpuset struct
was to track exactly this relation - which tasks are in which cpuset.

No ... tasks A & B are not allowed in that new exclusive cpuset.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373

2004-10-04 14:04:35

by Hubertus Franke

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement



Erich Focht wrote:


> Can cpusets help me/us/Linux to get closer to these requirements?
>
> A clear yes. Regard cpusets as a new kind of composite resource built
> from memory and CPUs. They can play the role of the resource groups we
> need. Disjunct cpusets can run jobs which will almost never interfere
> cpu-cycle or memory-wise. This can be easilly integrated into PBS/LSF
> or whatever batch resource manager comes to your mind. Cpusets
> selected with some knowledge of the NUMA characteristics of a machine
> guarantee always reproducible and best compute performance. If a job
> runs alone in a cpuset it will run as if the machine has been reduced
> to that piece and is owned exclusively by the job. Also if the set
> contains as many CPUs as MPI processes, the cpuset helps getting some
> sort of gang scheduling (i.e. all members of a parallel process get
> cycles at the same time, this reduces barrier synchronisation times,
> improves performance and makes it more predictible). This is something
> one absolutely needs on big machines when dealing with time critical
> highest performance applications. Permanently losing 10% because the
> CPU placement is poor or because one has to get some other process out
> of the way is just inacceptable. When you sell machines for several
> millions 10% performance loss translates to quite some amount of
> money.
>
> Can CKRM (as it is now) fulfil the requirements?
>
> I don't think so. CKRM gives me to some extent the confidence that I
> will really use the part of the machine for which I paid, say 50%. But
> it doesn't care about the structure of the machine. CKRM tries giving
> a user as much of the machine as possible, at least the amount he paid
> for. For example: When I come in with my job the machine might be
> already running another job who's user also paid for 50% but was the
> only user and got 100% of the machine (say some Java application with
> enough threads...). This job maybe has filled up most of the memory
> and uses all CPUs. CKRM will take care of getting me cycles (maybe
> exclusively on 50% of the CPUs and will treat my job preferrentially
> when allocating memory, but will not care about the placement of the
> CPUs and the memory. Neither will it care whether the previously
> running job is still using my memory blocks and reducing my bandwith
> to them. So I get 50% of the cycles and the memory but these will be
> BAD CYCLES and BAD MEMORY. My job will run slower than possible and a
> second run will be again different. Don't misunderstand me: CKRM in
> its current state is great for different things and running it inside
> a cpuset sounds like a good thing to do.

You forget that CKRM does NOT violate the constraints set forward by
cpu_allowed masks. So most of your drawbacks described above are simply
not true.
As such it comes back to the question whether the RCFS
and controller interfaces can be used to set the cpu_allowed masks
in accordance to the current cpuset semantics.
Absolutely we can...

I am certainly not stipulating that cpusets can replace share based
scheduling or vice versa.

What remains to be discussed is whether
In order to allow CKRM scheduling within a cpuset here are a few
questions to be answered:
(a) is it a guarantee/property that cpusets at with the same
parent cpuset do not overlap ?
(b) can we enforce that a certain task class is limited to a cpuset
and its subsets.

If we agree or disagree then we can work on a proposal for this.

>
> What about integration with PBS/LSF and alike?
>
> It makes sense to let an external resource manager (batch or
> non-batch) keep track of and manage cpusets resources. It can allocate
> them and give them to jobs (exclusively) and delete them. That's
> perfect and exactly what we want. CKRM is a resource manager itself
> and has an own idea about resources. Certainly PBS/LSF/etc. could
> create a CKRM class for each job and run it in this class. The
> difficulty is to avoid the resource managers to interfere and work
> against each other. In such a setup I'd rather expect a batch manager
> to be started inside one CKRM class and let it ensure that e.g. the
> interactive class isn't starved by the batch class.
>
> Can CKRM be extended to do what cpusets do?

See above, I think it can be. We need to answer (a) and (b) and then
define what a share means.

>
> Certainly. Probably easilly. But cpusets will have to be reinvented, I
> guess. Same hooks, same checks, different user interface...
>

-- Hubertus

2004-10-04 14:08:47

by Erich Focht

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

On Sunday 03 October 2004 22:48, Andrew Morton wrote:
> Erich Focht <[email protected]> wrote:
> > Can CKRM be extended to do what cpusets do?
> >
> > Certainly. Probably easilly. But cpusets will have to be reinvented, I
> > guess. Same hooks, same checks, different user interface...
>
> Well if it is indeed the case that the CKRM *framework* is up to the task
> of being used to deliver the cpuset functionality then that's the way we
> should go, no? It's more work and requires coordination and will deliver
> later, but the eventual implementation will be better.
>
> But I'm still not 100% confident that the CKRM framework is suitable.
> Mainly because the CKRM and cpuset teams don't seem to have looked at each
> other's stuff enough yet.

My optimistic assumption that it is easy to build cpusets into CKRM is
only valid for adding a cpuset controller into the CKRM framework and
forgetting about the other controllers. The problems start with the
other controllers... As Hubertus said: CKRM and cpusets are
orthogonal.

Now CKRM consists of a set of more or less independent (orthogonal)
controllers. There is a cpu cycles and memory controller. Their aims
are different from that of cpuset and they cannot fulfil the
requirements of cpusets. But they make sense for themselves.

Adding cpusets as another special resource controller is fine but
breaks the requirement of having independent controllers. With this we
suddenly have two ways of controlling cpu and memory assignment. As
discussed previously in this thread it probably makes more sense to
let the old CKRM controllers manage resources inside each cpuset (at
certain level in the cpusets tree). One could even imagine switching
off the CKRM controllers in particular sets. The old cpucycles and
memory controllers will not be able to influence cycles and memory
distribution outside a cpuset, anyway, because these are hardly
limited by the affinity masks. So adding cpusets into CKRM must lead
to dependent controllers and a hierarchy between them (cpusets being
above the old controllers). This is indeed difficult but Dipankar
mentioned that CKRM people think about such a design (if I interpreted
his email correctly).

If CKRM sticks at the requirement for independent controllers (which
is clean in design and has been demonstrated to work) then it should
maybe first learn to run in an arbitrary cpuset and ignore the rest of
the machine. Having separate CKRM instances running in each partition
of a machine soft-partitioned with cpusets could be a target.

If CKRM wants to be a universal resource controller in the kernel then
a resource dependency tree and hierarchy might need to get somehow
into the CKRM infrastructure. The cpu cycles controller should notice
that there is another controller above it (cpusets) and might ask
that controller which processes it should take into account for its
job. The memory controller might get a different answer... Uhmmm, this
looks like a difficult problem.

Regards,
Erich

2004-10-04 14:17:45

by Simon Derr

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

On Mon, 4 Oct 2004, Hubertus Franke wrote:

> What remains to be discussed is whether
> In order to allow CKRM scheduling within a cpuset here are a few questions to
> be answered:
> (a) is it a guarantee/property that cpusets at with the same
> parent cpuset do not overlap ?

It depends on whether they are 'exclusive' cpusets or not.
In the general case, they may overlap.

2004-10-04 14:20:33

by Erich Focht

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

On Monday 04 October 2004 15:58, Hubertus Franke wrote:
> Erich Focht wrote:
> > Can CKRM (as it is now) fulfil the requirements?
> >
> > I don't think so. CKRM gives me to some extent the confidence that I
> > will really use the part of the machine for which I paid, say 50%. But
> > it doesn't care about the structure of the machine. CKRM tries giving
> > a user as much of the machine as possible, at least the amount he paid
> > for. For example: When I come in with my job the machine might be
> > already running another job who's user also paid for 50% but was the
> > only user and got 100% of the machine (say some Java application with
> > enough threads...). This job maybe has filled up most of the memory
> > and uses all CPUs. CKRM will take care of getting me cycles (maybe
> > exclusively on 50% of the CPUs and will treat my job preferrentially
> > when allocating memory, but will not care about the placement of the
> > CPUs and the memory. Neither will it care whether the previously
> > running job is still using my memory blocks and reducing my bandwith
> > to them. So I get 50% of the cycles and the memory but these will be
> > BAD CYCLES and BAD MEMORY. My job will run slower than possible and a
> > second run will be again different. Don't misunderstand me: CKRM in
> > its current state is great for different things and running it inside
> > a cpuset sounds like a good thing to do.
>
> You forget that CKRM does NOT violate the constraints set forward by
> cpu_allowed masks. So most of your drawbacks described above are simply
> not true.

I explicitely implied that I only use CKRM. This means all processes
have the trivial cpus_allowed mask and are allowed to go wherever they
want. With this assumption (and my understanding of CKRM) the
drawbacks will be there.

Cpusets is my method of choice (for the future) for setting the
cpus_allowed mask (and the memories_allowed). If I use cpusets AND
CKRM together all is fine, of course.

> I am certainly not stipulating that cpusets can replace share based
> scheduling or vice versa.
>
> What remains to be discussed is whether
> In order to allow CKRM scheduling within a cpuset here are a few
> questions to be answered:
> (a) is it a guarantee/property that cpusets at with the same
> parent cpuset do not overlap ?

Right now it isn't AFAIK. Paul, if all cpusets on the same level are
disjunct this certainly simplifies life. Would this be a too strong
limitation for you? We could live with it.

> (b) can we enforce that a certain task class is limited to a cpuset
> and its subsets.

That is intended, yes. A task escaping from its set would be a
security (or denial of service) risk.

Regards,
Erich

2004-10-04 14:39:39

by Paul Jackson

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

Erich wrote:
> > Can CKRM (as it is now) fulfil the requirements?
> ...
> [CKRM] doesn't care about the structure of the machine

Hubertus wrote:
> You forget that CKRM does NOT violate ... cpus_allowed ...
> ...
> In order to allow CKRM scheduling within a cpuset ...

I sense a disconnect here.

Seems to me Erich was asking if CKRM could be used _instead_ of cpusets,
and observes that, for now at least, CKRM lacks something.

Seems to me Hubertus is, _in_ _part_, responding to the question of
whether CKRM can be used _within_ cpusets, and claims to be taking a
position opposite to Erich's, protesting that indeed CKRM can be used
within cpusets - CKRM doesn't violate cpus_allowed constraints.

Hubertus - I didn't realize that Erich considered that question, not did
I realize he took that position.

Unfortunately, the plot thickens. Hubertus goes on it seems to consider
other questions, and I start to lose the thread of his thought. Such
questions as:

- can RCFS/controllers set cpus_allowed as do cpusets?
[ beware that there's more to cpusets than setting cpus_allowed ]
- can cpusets replace shared based scheduling?
- can share based scheduling replace cpusets?
- can CKRM scheduling be allowed within cpusets?
- are sibling cpusets exclusive?
[ yes - if the exclusive property is set on them ]
- can we enforce that a certain task class is limited to a cpuset subtree?

By now I'm thoroughly confused. Fortunately, Hubertus concludes:

- If we agree or disagree then we can work on a proposal for this.

Well, since I'm pretty sure from my Logic 101 class that we agree or
disagree, this is good news. I'm glad to hear we can work on a proposal
on this [ what was 'this' again ...? ;) ]

One thing I am sure of ... either one of Hubertus or myself needs another
cup of coffee, or both Hubertus and I need to have a beer together.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373

2004-10-04 15:00:12

by Martin J. Bligh

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

> My optimistic assumption that it is easy to build cpusets into CKRM is
> only valid for adding a cpuset controller into the CKRM framework and
> forgetting about the other controllers. The problems start with the
> other controllers... As Hubertus said: CKRM and cpusets are
> orthogonal.
>
> Now CKRM consists of a set of more or less independent (orthogonal)
> controllers. There is a cpu cycles and memory controller. Their aims
> are different from that of cpuset and they cannot fulfil the
> requirements of cpusets. But they make sense for themselves.
...

> If CKRM wants to be a universal resource controller in the kernel then
> a resource dependency tree and hierarchy might need to get somehow
> into the CKRM infrastructure. The cpu cycles controller should notice
> that there is another controller above it (cpusets) and might ask
> that controller which processes it should take into account for its
> job. The memory controller might get a different answer... Uhmmm, this
> looks like a difficult problem.

I see that the two mechanisms could have conflicting requirements. But
surely this is the case whether we merge the two into one integrated
system, or try to run CKRM and cpusets independantly at the same time?
I'd think the problems would be easier to tackle if the systems knew
about each other, and talked to each other.

I don't think anyone is suggesting that either system as is could replace
the other ... more that a combined system could be made for both types
of resource control that would be a better overall solution.

M.

2004-10-04 15:04:53

by Martin J. Bligh

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

--Paul Jackson <[email protected]> wrote (on Sunday, October 03, 2004 21:24:52 -0700):

> Martin wrote:
>> Then when I fork off exclusive subset for CPUs 1&2, I have to push A & B
>> into it.
>
> Tasks A & B must _not_ be considered members of that exclusive cpuset,
> even though it seems that A & B must be attended to by the sched_domain
> and memory_domain associated with that cpuset.
>
> The workload managers expect to be able to list the tasks in a cpuset,
> so it can hibernate, migrate, kill-off, or wait for the finish of these
> tasks. I've been through this bug before - it was one that cost Hawkes
> a long week to debug - I was moving the per-cpu migration threads off
> their home CPU because I didn't have a clear way to distinguish tasks
> genuinely in a cpuset, from tasks that just happened to be indigenous to
> some of the same CPUs. My essential motivation for adapting a cpuset
> implementation that has a task struct pointer to a shared cpuset struct
> was to track exactly this relation - which tasks are in which cpuset.
>
> No ... tasks A & B are not allowed in that new exclusive cpuset.

OK, then your "exclusive" cpusets aren't really exclusive at all, since
they have other stuff running in them. The fact that you may institute
the stuff early enough to avoid most things falling into this doesn't
really solve the problems, AFAICS.

Or perhaps we end up cpuset alpha and beta that you created, and we create
parallel cpusets that operate on the same sched_domain tree to contain the
other random stuff.

Kind of "cpu groups" and "task groups", where you can have multiple task
groups running on the same cpu group (or subset thereof), but not overlapping
different cpu groups. Then we can have one sched domain setup per cpu group,
or at least the top level entry in the main sched domain tree. This way the
scheduler might have a hope of working within this system efficiently ;-)

M.

2004-10-04 15:26:10

by Paul Jackson

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

Erich, responding to Hubertus:
> > (a) is it a guarantee/property that cpusets at with the same
> > parent cpuset do not overlap ?
>
> Right now it isn't AFAIK. Paul, if all cpusets on the same level are
> disjunct this certainly simplifies life. Would this be a too strong
> limitation for you? We could live with it.

Correct, Erich, it is not a guarantee that sibling cpusets don't
overlap, unless, as Simon noted, they are all marked exclusive.

Yes, it would be a stronger limitation than I would agree to, but that's
ok, because in my humble opinion, CKRM doesn't need it to operate within
cpusets.

I think what's needed for CKRM to operate within cpusets is clear
ownership.

Each instance of CKRM needs (tell me if I'm wrong here):
1) to have a clear and unambiguous answer to the question of
which CPUs, which Memory Nodes, and which Tasks it is
controlling,
2) no overlap of these sets with another instance of CKRM,
3) the CPUs and Memory Nodes on which any of these Tasks are
allowed to run must be a subset of those controlled by
this instance of CKRM, and
4) all Tasks allowed to run on any of the CPUs and Memory
Nodes controlled by this CKRM instance are in the list
of Tasks this CKRM knows it controls.

In short - each CKRM instance needs clear, unambiguous, non-overlapping
ownership of all it surveys.

Requesting that all cpusets be marked exclusive for both CPU and Memory
is an overzealous precondition for the above.

Another way to obtain the above requirements would be to assign each
CKRM instance to a separate cpuset subtree, where the root of the
subtree is marked exclusive for cpu and memory, where that CKRM instance
controls all CPUs and Memory owned by that subtree and all Tasks
attached to any cpuset in that subtree, and where any tasks attached to
ancestors of the root are either (1) not allowed to use any of the CPUs
and Memory assigned to the subtree, or (2) are both [2a] allowed to use
only some subset of the CPUs and Memory assigned to the subtree and [2b]
are included in the list of tasks to be managed by that CKRM instance.

(The last 4.5 lines above are the special case required to handle the
indigenous per-cpu tasks, such as the migration threads - sorry.)

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373

2004-10-04 15:32:56

by Paul Jackson

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

Martin wrote:
> I don't think anyone is suggesting that either system as is could replace
> the other ...

I'm pretty sure Andrew was suggesting this.

He began this thread addressing me with the statement:
>
> And CKRM is much more general than the cpu/memsets code, and hence it
> should be possible to realize your end-users requirements using an
> appropriately modified CKRM, and a suitable controller.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373

2004-10-04 15:43:54

by Paul Jackson

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

Martin wrote:
> I'd think the problems would be easier to tackle if the systems knew
> about each other, and talked to each other.

Clear boundaries should be enough. If each instance of CKRM is assured
that it has control of some subset of a system that's separate and
non-overlapping, with all Memory, CPU, Tasks, and Allowed masks of said
Tasks either wholly owned by that CKRM instance, or entirely outside,
then that should do it, right?

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373

2004-10-04 15:44:16

by Martin J. Bligh

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

> Martin wrote:
>> I don't think anyone is suggesting that either system as is could replace
>> the other ...
>
> I'm pretty sure Andrew was suggesting this.
>
> He began this thread addressing me with the statement:
>>
>> And CKRM is much more general than the cpu/memsets code, and hence it
>> should be possible to realize your end-users requirements using an

Note especially the last line:

>> appropriately modified CKRM, and a suitable controller.

So not CKRM as-is ...

M.

2004-10-04 15:57:27

by Paul Jackson

[permalink] [raw]
Subject: Re: [ckrm-tech] Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

Martin writes:
> OK, then your "exclusive" cpusets aren't really exclusive at all, since
> they have other stuff running in them.

What's clear is that 'exclusive' is not a sufficient precondition for
whatever it is that CKRM needs to have sufficient control.

Instead of trying to wrestle 'exclusive' into doing what you want, do me
a favor, if you would. Help me figure out what conditions CKRM _does_
need to operate within a cpuset, and we'll invent a new property that
satisfies those conditions.

See my earlier posts in the last hour for my efforts to figure out what
these conditions might be. I conjecture that it's something along the
lines of:

Assuring each CKRM instance that it has control of some
subset of a system that's separate and non-overlapping,
with all Memory, CPU, Tasks, and Allowed masks of said
Tasks either wholly owned by that CKRM instance, or
entirely outside.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373

2004-10-04 16:04:36

by Paul Jackson

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

Martin, quoting Andrew:
> >> appropriately modified CKRM, and a suitable controller.
>
> So not CKRM as-is ...

Yes - by now we all agree that CKRM as it is doesn't provide some things
that cpusets provides (though of course CKRM provides much more that
cpusets doesn't.)

Andrew would ask, if I am channeling him correctly, how about CKRM as it
could be? What would it take to modify CKRM so that it could subsume
(embrace and replace) cpusets, meeting all the requirements that in the
end we agreed were essential for cpusets to meet, rendering cpusets
redundant and no longer needed?

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373

2004-10-04 17:07:12

by Paul Jackson

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

Martin wrote:
> I don't think anyone is suggesting that either system as is could replace
> the other ... more that a combined system could be made for both types
> of resource control that would be a better overall solution.

Oops - sorry, Martin. I obviously didn't read your entire sentence
before objecting before.

Now that I do, it makes sense.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373

2004-10-04 18:22:15

by Martin J. Bligh

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

--On Monday, October 04, 2004 09:02:32 -0700 Paul Jackson <[email protected]> wrote:

> Martin, quoting Andrew:
>> >> appropriately modified CKRM, and a suitable controller.
>>
>> So not CKRM as-is ...
>
> Yes - by now we all agree that CKRM as it is doesn't provide some things
> that cpusets provides (though of course CKRM provides much more that
> cpusets doesn't.)
>
> Andrew would ask, if I am channeling him correctly, how about CKRM as it
> could be? What would it take to modify CKRM so that it could subsume
> (embrace and replace) cpusets, meeting all the requirements that in the
> end we agreed were essential for cpusets to meet, rendering cpusets
> redundant and no longer needed?

Well, or just merge the two somehow into one cohesive system, I'd think.
One doesn't need to completely subsume the other ;-)

M.

2004-10-04 18:25:27

by Martin J. Bligh

[permalink] [raw]
Subject: Re: [ckrm-tech] Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

--On Monday, October 04, 2004 08:53:27 -0700 Paul Jackson <[email protected]> wrote:

> Martin writes:
>> OK, then your "exclusive" cpusets aren't really exclusive at all, since
>> they have other stuff running in them.
>
> What's clear is that 'exclusive' is not a sufficient precondition for
> whatever it is that CKRM needs to have sufficient control.
>
> Instead of trying to wrestle 'exclusive' into doing what you want, do me
> a favor, if you would. Help me figure out what conditions CKRM _does_
> need to operate within a cpuset, and we'll invent a new property that
> satisfies those conditions.

Oh, I'm not even there yet ... just thinking about what cpusets needs
independantly to operate efficiently - I don't think cpus_allowed is efficient.

Whatever we call it, the resource management system definitely needs the
ability to isolate a set of resources (CPUs, RAM) totally dedicated to
one class or group of processes. That's what I see as the main feature
of cpusets right now, though there may be other things there as well that
I've missed? At least that's the main feature I personally see a need for ;-)

> See my earlier posts in the last hour for my efforts to figure out what
> these conditions might be. I conjecture that it's something along the
> lines of:
>
> Assuring each CKRM instance that it has control of some
> subset of a system that's separate and non-overlapping,
> with all Memory, CPU, Tasks, and Allowed masks of said
> Tasks either wholly owned by that CKRM instance, or
> entirely outside.

Mmm. Looks like you're trying to do multiple CKRMs, one inside each cpuset,
right? Not sure that's the way I'd go, but maybe it makes sense.

The way I'm looking at it, which is probably wholly insufficient, if not
downright wrong, we have multiple process groups, each of which gets some
set of resources. Those resources may be dedicated to that class (a la
cpusets) or not. One could view this as a set of resource groupings, and
set of process groupings, where one or more process groupings is bound to
a resource grouping.

The resources are cpus & memory, mainly, in my mind (though I guess IO,
etc fit too). The resource sets are more like cpusets, and the process
groups a bit more like CKRM, except they seem to overlap (to me) when
the sets in cpusets are non-exclusive, or when CKRM wants harder performance
guarantees.

Feel free to point out where I'm full of shit / missing the point ;-)

M.

2004-10-04 18:33:49

by Paul Jackson

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

Martin writes:
>
> One doesn't need to completely subsume the other ;-)

Well, close to it.

It's not a marriage of equals in his challenge:
>
> And CKRM is much more general than the cpu/memsets code, and hence it
> should be possible to realize your end-users requirements using an
> appropriately modified CKRM, and a suitable controller.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373

2004-10-04 19:48:19

by Rick Lindsley

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

I move 'em. I have user code that identifies the kernel threads
whose cpus_allowed is a superset of cpus_online_map, and I put them
in a nice little padded cell with init and the classic Unix daemons,
called the 'bootcpuset'.

So the examples you gave before were rather oversimplified, then?
You talked about dividing up a 256 cpu machine but didn't mention that
some portion of that must be reserved for the "bootcpuset". Would this
be enforced by the kernel, or the administrator?

I might suggest a simpler approach. As a matter of policy, at least one
cpu must remain outside of cpusets so that system processes like init,
getty, lpd, etc. have a place to run.

The tasks whose cpus_allowed is a strict _subset_ of cpus_online_map
need to be where they are. These are things like the migration
helper threads, one for each cpu. They get a license to violate
cpuset boundaries.

Literally, or figuratively? (How do we recognize these tasks?)

I will probably end up submitting a patch at some point, that changes
two lines, one in ____call_usermodehelper() and one in kthread(), from
setting the cpus_allowed on certain kernel threads to CPU_MASK_ALL,
so that instead these lines set that cpus_allowed to a new mask,
a kernel global variable that can be read and written via the cpuset
api. But other than that, I don't need anymore kernel hooks than I
already have, and even now, I can get everything that's causing me
any grief pinned into the bootcpuset.

Will cpus in exclusive cpusets be asked to service interrupts?

Martin pointed out the problem with looking at overloaded cpus repeatedly,
only to find (repeatedly) we can't steal any of their processes.
This is a real problem, but exists today outside of any cpuset changes.
A decaying failure rate might provide a hint to the scheduler to alleviate
this problem, or maybe the direct route of just checking more thoroughly
from the beginning is the answer.

So with my bootcpuset, the problem is reduced, to a few tasks
per CPU, such as the migration threads, which must remain pinned
on their one CPU (or perhaps on just the CPUs local to one Memory
Node). These tasks remain in the root cpuset, which by the scheme
we're contemplating, doesn't get a sched_domain in the fancier
configurations.

You just confused me on many different levels:

* what is the root cpuset? Is this the same as the "bootcpuset" you
made mention of?

* so where *do* these tasks go in the "fancier configurations"?

* what does it mean "not to get a sched_domain"? That the tasks in
the root cpuset can't move? Can't run? One solution to the
problem Martin described is to completely split the hierarchy that
sched_domain represents, with a different, disjoint tree for each
group of cpus in a cpuset. But wouldn't changing cpus_allowed
in every process do the same thing? (Isn't that how this would be
implemented at the lowest layer?)

I really haven't heard of anything that couldn't be handled adequately
through cpus_allowed so far other than "kicking everybody off a cpu"
which would need some new code. (Although, probably not, now that I
think of it, with the new hotplug cpu code wanting to do that too.)

If we just wrote the code, and quit trying to find a grand unifying
theory to explain it consistently with the rest of our design,
it would probably work just fine.

I'll assume we're missing a smiley here.

So we want to pin a process to a cpu or set of cpus: set cpus_allowed to
that cpu or that set of cpus.
So we want its children to be subject to the same restriction: children
already inherit the cpus_allowed mask of their parent.
We want to keep out everyone who shouldn't be here: then clear the
bits for the restrictive cpus in their cpus_allowed mask when the
restriction is created.

When you "remove a cpuset" you just or in the right bits in everybody's
cpus_allowed fields and they start migrating over.

To me, this all works for the cpu-intensive, gotta have it with 1% runtime
variation example you gave. Doesn't it? And it seems to work for the
department-needs-8-cpus-to-do-as-they-please example too, doesn't it?
The scheduler won't try to move a process to someplace it's not allowed.

Rick

2004-10-04 20:28:24

by Paul Jackson

[permalink] [raw]
Subject: Re: [ckrm-tech] Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

Martin wrote:
> Mmm. Looks like you're trying to do multiple CKRMs, one inside each cpuset,
> right? Not sure that's the way I'd go, but maybe it makes sense.

No - I was just reflecting my lack of adequate understanding of CKRM.

You guys were trying to get certain semantics out of cpusets to meet
your needs, putting words in my mouth as to what things like "exclusive"
meant, and I was pushing back, trying to get a fair, implementation
neutral statement of just what it was that CKRM needed out of cpusets,
by in part phrasing things in terms of what I thought you were trying to
have CKRM do with cpusets. Turns out I speak CKRM substantially worse
than you guys speak cpusets. <grin>

So nevermind what I was trying to do, which was, as you guessed:
>
> Looks like you're trying to do multiple CKRMs, one inside each cpuset,

Let me try again to see if I can figure out what you're trying to do.

You write:
>
> The way I'm looking at it, which is probably wholly insufficient, if not
> downright wrong, we have multiple process groups, each of which gets some
> set of resources. Those resources may be dedicated to that class (a la
> cpusets) or not. One could view this as a set of resource groupings, and
> set of process groupings, where one or more process groupings is bound to
> a resource grouping.
>
> The resources are cpus & memory, mainly, in my mind (though I guess IO,
> etc fit too). The resource sets are more like cpusets, and the process
> groups a bit more like CKRM, except they seem to overlap (to me) when
> the sets in cpusets are non-exclusive, or when CKRM wants harder performance
> guarantees.

I can understand it far enough to see groups of processes using groups
of resources (cpus & memory, like cpusets). Both of the phrases
containing "CKRM" in them go right past ... whizz. And I'm a little
fuzzy on what are the sets, invariants, relations, domains, ranges,
operations, pre and post conditions and such that could be modeled in a
more precise manner.

Keep talking ... Perhaps an example, along the lines of my "use case
scenarios", would help. When we start losing each other trying to
generalize too fast, it can help to make up an overly concrete example,
to get things grounded again.


> Whatever we call it, the resource management system definitely needs the
> ability to isolate a set of resources (CPUs, RAM) totally dedicated to
> one class or group of processes.

Not always "totally isolated and dedicated".

Here's a scenario that shows up some uses for "non-exclusive" cpusts.

Let's take my big 256 CPU system, divided into portions of 128, 64 and
64. At this level, these are three, mutually exclusive cpusets, and
interaction between them is minimized. In the first two portions, the
128 and the first 64, a couple of "company jewel" applications run.
These are highly tuned, highly parallel applications that are sucking up
99% of every CPU cycle, bus cycle, cache line and memory page available,
for hours on end, in a closely synchronized dance. They cannot tolerate
anything else interfering in their area. Frankly, they have little use
for CKRM, fancy schedulers or sophisticated allocators. They know
what's there, it's all their's, and they know exactly what they want to
do with it. Get out of the way and let them do their job. Industrial
strength computing at its finest.

Ok that much is as before.

Now the last portion, the second 64, is more of a general use area. It
is less fully utilized, and it's job mix more varied and less tightly
administered. There's some 64-thread background application that puts a
fairly light load on things, running day and night (maybe the V.P. of
the MIS shop is a fan of SETI).

Since this is a parallel programming shop, people show up with at random
hours with smaller parallel jobs, carve off temporary cpusets of the
appropriate size, and run an application in them. Their threads and
memory within their temporary cpuset are carefully placed, relative to
their cpuset, but they are not fully utilizing the nodes on which they
are running and they tolerate other things happening on the same nodes.
Perhaps the other stuff doesn't impact their performance much, or
perhaps they are too poor to pay for dedicated nodes (grad students
still looking for a grant?) ... whatever.

They may well make good use of a batch manager, to which they submit
jobs of a specified size (cpus and memory) so that the batch manager can
smooth out the load. and avoid periods of excess idling or thrashing.
The implementation of the batch manager relies heavily on the underlying
cpuset facility to manage various subsets of CPU and Memory Nodes. The
batch manager might own the first 192 CPUs on the system too, but most
users never get to see that part of the system.

Within that last 64 portion the current mechanisms, including the per
task cpus_allowed and mems_allowed, and the current schedulers and
allocators, may well be doing a pretty good job. Sure, there is an
element of chaos and things aren't perfect. It's the "usual" timeshare
environment with a varied load mix.

The enforced placement within the smaller nested non-exclusive cpusets
probably surprises the scheduler and allocator at times, leading to
unfair inbalances. I imagine that if CKRM just had that last 64 portion
to manage, and this was just a 64 CPU system, not a 256, then CKRM could
do a pretty good job of managing the systems resources.

Enough of this story ...

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373

2004-10-04 22:30:06

by Martin J. Bligh

[permalink] [raw]
Subject: Re: [ckrm-tech] Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

>> The way I'm looking at it, which is probably wholly insufficient, if not
>> downright wrong, we have multiple process groups, each of which gets some
>> set of resources. Those resources may be dedicated to that class (a la
>> cpusets) or not. One could view this as a set of resource groupings, and
>> set of process groupings, where one or more process groupings is bound to
>> a resource grouping.
>>
>> The resources are cpus & memory, mainly, in my mind (though I guess IO,
>> etc fit too). The resource sets are more like cpusets, and the process
>> groups a bit more like CKRM, except they seem to overlap (to me) when
>> the sets in cpusets are non-exclusive, or when CKRM wants harder performance
>> guarantees.
>
> I can understand it far enough to see groups of processes using groups
> of resources (cpus & memory, like cpusets). Both of the phrases
> containing "CKRM" in them go right past ... whizz. And I'm a little
> fuzzy on what are the sets, invariants, relations, domains, ranges,
> operations, pre and post conditions and such that could be modeled in a
> more precise manner.
>
> Keep talking ... Perhaps an example, along the lines of my "use case
> scenarios", would help. When we start losing each other trying to
> generalize too fast, it can help to make up an overly concrete example,
> to get things grounded again.

Let me make one thing clear: I don't work on CKRM ;-) So I'm not either
desperately familiar with it, or partial to it. Nor am I desperately
infatuated enough with my employer to believe that just because they're
involved with it, it must be stunningly brilliant. So I think I'm actually
fairly impartial ... and balanced in ignorance on both sides ;-)

I do think both things are solving perfectly valid problems (that IMO
intersect) ... not sure whether either is doing it the best way though ;-).

>> Whatever we call it, the resource management system definitely needs the
>> ability to isolate a set of resources (CPUs, RAM) totally dedicated to
>> one class or group of processes.
>
> Not always "totally isolated and dedicated".
>
> Here's a scenario that shows up some uses for "non-exclusive" cpusts.
>
> Let's take my big 256 CPU system, divided into portions of 128, 64 and
> 64. At this level, these are three, mutually exclusive cpusets, and
> interaction between them is minimized. In the first two portions, the
> 128 and the first 64, a couple of "company jewel" applications run.
> These are highly tuned, highly parallel applications that are sucking up
> 99% of every CPU cycle, bus cycle, cache line and memory page available,
> for hours on end, in a closely synchronized dance. They cannot tolerate
> anything else interfering in their area. Frankly, they have little use
> for CKRM, fancy schedulers or sophisticated allocators. They know
> what's there, it's all their's, and they know exactly what they want to
> do with it. Get out of the way and let them do their job. Industrial
> strength computing at its finest.
>
> Ok that much is as before.
>
> Now the last portion, the second 64, is more of a general use area. It
> is less fully utilized, and it's job mix more varied and less tightly
> administered. There's some 64-thread background application that puts a
> fairly light load on things, running day and night (maybe the V.P. of
> the MIS shop is a fan of SETI).
>
> Since this is a parallel programming shop, people show up with at random
> hours with smaller parallel jobs, carve off temporary cpusets of the
> appropriate size, and run an application in them. Their threads and
> memory within their temporary cpuset are carefully placed, relative to
> their cpuset, but they are not fully utilizing the nodes on which they
> are running and they tolerate other things happening on the same nodes.
> Perhaps the other stuff doesn't impact their performance much, or
> perhaps they are too poor to pay for dedicated nodes (grad students
> still looking for a grant?) ... whatever.

OK, the dedicated stuff in cpusets makes a lot of sense to me, for the
reasons you describe above. One screaming problem we have at the moment
is we can easily say "I want to bind myself to CPU X" but no way to say
"kick everyone else off it". That seems like a very real problem.

However, the non-dedicated stuff seems much more debateable, and where
the overlap with CKRM stuff seems possible to me. Do the people showing
up at random with smaller parallel jobs REALLY, REALLY care about the
physical layout of the machine? I suspect not, it's not the highly tuned
syncopated rhythm stuff you describe above. The "give me 1.5 CPUs worth
of bandwidth please" model of CKRM makes much more sense to me.

> They may well make good use of a batch manager, to which they submit
> jobs of a specified size (cpus and memory) so that the batch manager can
> smooth out the load. and avoid periods of excess idling or thrashing.
> The implementation of the batch manager relies heavily on the underlying
> cpuset facility to manage various subsets of CPU and Memory Nodes. The
> batch manager might own the first 192 CPUs on the system too, but most
> users never get to see that part of the system.
>
> Within that last 64 portion the current mechanisms, including the per
> task cpus_allowed and mems_allowed, and the current schedulers and
> allocators, may well be doing a pretty good job. Sure, there is an
> element of chaos and things aren't perfect. It's the "usual" timeshare
> environment with a varied load mix.
>
> The enforced placement within the smaller nested non-exclusive cpusets
> probably surprises the scheduler and allocator at times, leading to
> unfair inbalances. I imagine that if CKRM just had that last 64 portion
> to manage, and this was just a 64 CPU system, not a 256, then CKRM could
> do a pretty good job of managing the systems resources.

Right - exactly. Sounds like we're actually pretty much on the same page
(by the time I'd finished your email ;-)). So whatever the interface we
have, the underlying mechanisms seem to have two fundamentals: dedicated
and non-decicated resources. cpusets seems to do a good job of dedicated
and I'd argue the interface of specifying physical resources is a bit
clunky for non-dedicated stuff. CKRM doesn't seem to tackle the dedicated
at all, but seems to have an easier way of doing the non-dedicated.

So personally what I'd like is to have a unified interface (and I care
not a hoot which, or a new one altogether), that can specify dedicated
or non-decicated resources for groups of processes, and then have a
"cpusets-style" mechanism for the dedicated, and "CKRM-style" mechanism
for the non-dedicated. Not sure if that's exactly what Andrew was hoping
for, or the rest of you either ;-)

The whole discussion about multiple sched-domains, etc, we had earlier
is kind of just an implementation thing, but is a crapload easier to do
something efficient here if the bits caring about that stuff are only
dealing with dedicated resource partitions.

OK, now my email is getting as long as yours, so I'll stop ;-) ;-)

M.

2004-10-04 22:50:23

by Paul Jackson

[permalink] [raw]
Subject: Re: [ckrm-tech] Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

Good questions - thanks.

Rick wrote:
> So the examples you gave before were rather oversimplified, then?

Yes - they were. Quite intentionally.

> some portion of that must be reserved for the "bootcpuset". Would this
> be enforced by the kernel, or the administrator?

It's administrative. You don't have to run your system this way. The
kernel threads (both per-cpu and system-wide), as well as init and the
classic Unix daemons, can be left running in the root cpuset (see below
for what that is). The kernel doesn't care.

It was the additional request for a CKRM friendly setup that led me to
point out that system-wide kernel threads could be confined to a
"bootcpuset". Since bootcpuset is user level stuff, I hadn't mentioned
it before, on the kernel mailing list.

The more common reason for confining such kthreads and Unix daemons to a
bootcpuset are to minimize interactions between such tasks and important
applications.

> I might suggest a simpler approach. As a matter of policy, at least one
> cpu must remain outside of cpusets so that system processes like init,
> getty, lpd, etc. have a place to run.

This is the same thing, in different words. In my current cpuset
implemenation, _every_ task is attached to a cpuset.

What you call a cpu that "remains outside of cpusets" is the bootcpuset,
in my terms.

> The tasks whose cpus_allowed is a strict _subset_ of cpus_online_map
> need to be where they are. These are things like the migration
> helper threads, one for each cpu. They get a license to violate
> cpuset boundaries.
>
> Literally, or figuratively? (How do we recognize these tasks?)

I stated one critical word too vaguely. Let me restate (s/tasks/kernel
threads/), then translate.


> The kernel threads whose cpus_allowed is a strict _subset_ of cpus_online_map
> need to be where they are. These are things like the migration
> helper threads, one for each cpu. They get a license to violate
> cpuset boundaries.

> Literally, or figuratively? (How do we recognize these tasks?)

Literally. The early (_very_ early) user level code that sets up the
bootcpuset, as requested by a configuration file in /etc, moves the
kthreads with a cpus_allowed >= what's online to the bootcpuset, but
leaves the kthreads with a cpus_allowed < online where they are, in the
root cpuset.

If you do a "ps -efl", look for the tasks early in the list whose
command names in something like "/2" (printf format "/%u"). These
are the kthreads that usually need to be pinned on a CPU.

But you don't need to do that - an early boot user utility does it
as part of setting up the bootcpuset.

> Will cpus in exclusive cpusets be asked to service interrupts?

The current cpuset implementation makes no effort to manage interrupts.
To manage interrupts in relation to cpusets today, you'd have to use
some other means to control or determine where interrupts were going,
and then place your cpusets with that in mind.

> So with my bootcpuset, the problem is reduced, to a few tasks
> per CPU, such as the migration threads, which must remain pinned
> on their one CPU (or perhaps on just the CPUs local to one Memory
> Node). These tasks remain in the root cpuset, which by the scheme
> we're contemplating, doesn't get a sched_domain in the fancier
> configurations.
>
> You just confused me on many different levels:
>
> * what is the root cpuset? Is this the same as the "bootcpuset" you
> made mention of?

Not the same.

The root cpuset is the all encompassing cpuset representing the entire
system, from which all other cpusets are formed. The root cpuset always
contains all CPUs and all Memory Nodes.

The bootcpuset is typically a small cpuset, a direct child of the root
cpuset, containing what would be in your terms the one or a few cpus
that are reserved for the classic Unix system processes like init,
getty, lpd, etc.

> * so where *do* these tasks go in the "fancier configurations"?

Er eh - in the root cpuset ;). Hmmm ... guess that's not your question.

In this fancy configuration, I had the few kthreads that could _not_
be moved to the bootcpuset, because they had to remain pinned on
specific CPUs (e.g. the migration threads), remain in the root cpuset.

When the exclusive child cpusets were formed, and each given their own
special scheduler domain, I rebound the scheduler domain to use for
these per-cpu kthreads to which ever scheduler domain managed the cpu
that thread lived on. The thread remained in the root cpuset, but
hitched a ride on the scheduler that had assumed control of the cpu that
the thread lived on. Everything in this paragraphy is something I
invented in the last two days, in response to various requests from
others for setups that provided a clear boundary of control to
schedulers.

> If we just wrote the code, and quit trying to find a grand unifying
> theory to explain it consistently with the rest of our design,
> it would probably work just fine.
>
> I'll assume we're missing a smiley here.

Not really. The per-cpu kthreads are a wart that doesn't fit the
particular design being discussed here very well. Warts happen.

> When you "remove a cpuset" you just or in the right bits in everybody's
> cpus_allowed fields and they start migrating over.
>
> To me, this all works for the cpu-intensive, gotta have it with 1% runtime
> variation example you gave. Doesn't it? And it seems to work for the
> department-needs-8-cpus-to-do-as-they-please example too, doesn't it?

What you're saying is rather like saying I don't need a file system
on my floppy disk. Well, originally, I didn't. I wrote the bytes
to my tape casette, I read them back. What's the problem. If I
wanted to name the bytes, I stuck a label on the cassette and wrote
a note on the label.

Yes, that works. As systems get bigger, and as we add batch managers
and such to handle a more complicated set of jobs, we need to be able
to do things like:
* name sets of CPUs/Memory, in a way consistent across the system
* create and destroy a set
* control who can query, modify and attach a set
* change which set a task is attached to
* list which tasks are currently attached to a set
* query, set and change which CPUs and Memory are in a set.

This is like needing a FAT file system for your floppy. Cpusets
join the collection of "first class, kernel managed" objects,
and are no longer just the implied attributes of each task.

Batch managers and sysadmins of more complex, dynamically changing
configurations, sometimes on very large systems that are shared across
several departments or divisions, depend on this ability to treat
cpusets as first class name, kernel managed objects.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373

2004-10-05 03:15:08

by Matt Helsley

[permalink] [raw]
Subject: Re: [ckrm-tech] Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

On Sat, 2004-10-02 at 16:44, Hubertus Franke wrote:
<snip>
> along cpuset boundaries. If taskclasses are allowed to span disjoint
> cpumemsets, what is then the definition of setting shares ?
<snip>

I think the clearest interpretation is the share ratios are the same
but the quantity of "real" resources and the sum of shares allocated is
different depending on cpuset.

For example, suppose we have taskclass/A that spans cpusets Foo and Bar
-- processes foo and bar are members of taskclass/A but in cpusets Foo
and Bar respectively. Both get up to 50% share of cpu time in their
respective cpusets because they are in taskclass/A. Further suppose that
cpuset Foo has 1 CPU and cpuset Bar has 2 CPUs.

This means process foo could consume up to half a CPU while process bar
could consume up to a whole CPU. In order to enforce cpuset
partitioning, each class would then have to track its share usage on a
per-cpuset basis. [Otherwise share allocation in one partition could
prevent share allocation in another partition. Using the example above,
suppose process foo is using 45% of CPU in cpuset Foo. If the total
share consumption is calculated across cpusets process bar would only be
able to consume up to 5% of CPU in cpuset Bar.]

Cheers,
-Matt Helsley

2004-10-05 08:31:57

by Hubertus Franke

[permalink] [raw]
Subject: Re: [ckrm-tech] Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement



Matthew Helsley wrote:

> On Sat, 2004-10-02 at 16:44, Hubertus Franke wrote:
> <snip>
>
>>along cpuset boundaries. If taskclasses are allowed to span disjoint
>>cpumemsets, what is then the definition of setting shares ?
>
> <snip>
>
> I think the clearest interpretation is the share ratios are the same
> but the quantity of "real" resources and the sum of shares allocated is
> different depending on cpuset.
>
> For example, suppose we have taskclass/A that spans cpusets Foo and Bar
> -- processes foo and bar are members of taskclass/A but in cpusets Foo
> and Bar respectively. Both get up to 50% share of cpu time in their
> respective cpusets because they are in taskclass/A. Further suppose that
> cpuset Foo has 1 CPU and cpuset Bar has 2 CPUs.

Yes, we ( Shailabh and I ) were talking about exactly that this
afternoon. This would mean that the denominator of the cpu shares for a
given class <cls> is not determined solely by the parents
total_guarantee but by:
total_guarantee * size(cls->parent->cpuset) / size(cls->cpuset)

This is effectively what you describe below.

>
> This means process foo could consume up to half a CPU while process bar
> could consume up to a whole CPU. In order to enforce cpuset
> partitioning, each class would then have to track its share usage on a
> per-cpuset basis. [Otherwise share allocation in one partition could
> prevent share allocation in another partition. Using the example above,
> suppose process foo is using 45% of CPU in cpuset Foo. If the total
> share consumption is calculated across cpusets process bar would only be
> able to consume up to 5% of CPU in cpuset Bar.]
>

This would require some changes in the CPU scheduler to teach the
cpu-monitor to deal with the limited scope. It would also require some
mods to the API :
Since classes can span different cpu sets with different shares
how do we address the cpushare of a class in the particular context
of a cpu-set.
Alternatively, one could require that classes can not span different
cpu-sets, which would significantly reduce the complexity of this.

> Cheers,
> -Matt Helsley
>
>
>
> -------------------------------------------------------
> This SF.net email is sponsored by: IT Product Guide on ITManagersJournal
> Use IT products in your business? Tell us what you think of them. Give us
> Your Opinions, Get Free ThinkGeek Gift Certificates! Click to find out more
> http://productguide.itmanagersjournal.com/guidepromo.tmpl
> _______________________________________________
> ckrm-tech mailing list
> https://lists.sourceforge.net/lists/listinfo/ckrm-tech
>

2004-10-05 09:20:47

by Paul Jackson

[permalink] [raw]
Subject: Re: [ckrm-tech] Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

Martin wrote:
> Let me make one thing clear: I don't work on CKRM ;-)

ok ...

Indeed, unless I'm not recognizing someone's expertise properly, there
seems to be a shortage of the CKRM experts on this thread.

Who am I missing ...

> However, the non-dedicated stuff seems much more debateable, and where
> the overlap with CKRM stuff seems possible to me. Do the people showing
> up at random with smaller parallel jobs REALLY, REALLY care about the
> physical layout of the machine? I suspect not, it's not the highly tuned
> syncopated rhythm stuff you describe above. The "give me 1.5 CPUs worth
> of bandwidth please" model of CKRM makes much more sense to me.

It will vary. In shops that are doing alot of highly parallel work,
such as with OpenMP or MPI, many smaller parallel jobs will also be
placement sensitive. The performance of such jobs is hugely sensitive
to their placement and scheduling on dedicated CPUs and Memory, one per
active thread.

These shops will often use a batch scheduler or workload manager, such
as PBS or LSF to manage their jobs. PBS and LSF make a business of
defining various sized cpusets to fit the queued jobs, and running each
job in a dedicated cpuset. Their value comes from obtaining high
utilization, and optimum repeatable runtimes, on a varied input job
stream, especially of placement sensitive jobs. The feature set of
cpusets was driven as much as anything by what was required to support a
port of PBS or LSF.

> I'd argue the interface of specifying physical resources is a bit
> clunky for non-dedicated stuff.

Likeky so - the interface is expected to be wrapped with a user level
'cpuset' library, which converts it to a 'C' friendly model. And that
in turn is expected to be wrapped with a port of LSF or PBS, which
converts placement back to something that the customer finds familiar
and useful for managing their varied job mix.

I don't expect admins at HPC shops to spend much time poking around the
/dev/cpuset file system, though it is a nice way to look around and
figure out how things work.

The /dev/cpuset pseudo file system api was chosen because it was
convenient for small scale work, learning and experimentation, because
it was a natural for the hierarchical name space with permissions that I
required, and because it was convenient to leverage existing vfs
structure in the kernel.

> So personally what I'd like is to have a unified interface
> ...
> Not sure if that's exactly what Andrew was hoping
> for, or the rest of you either ;-)

Well, not what I'm pushing for, that's for sure.

We really have two different mechanisms here:

1) A placement mechanism, explicitly specifying what CPUs and Memory
Nodes are allowed, and
2) A sharing mechanism, specifying what proportion of fungible
resources as cpu cycles, page faults, i/o requests a particular
subset (class) of the user population is to receive.

If you look at the very lowest level hooks for cpusets and CKRM, you
will see the essential difference:

1) cpusets hooks the scheduler to prohibit scheduling on a CPU that
is not allowed, and the allocator to prohibit obtaining memory
on a Node that is not allowed.
2) CKRM hooks these and other places to throttle tasks by inserting
small delays, so as to obtain the requested share or percentage,
per class of user, of the rate of usage of fungible resources.

The specific details which must be passed back and forth across the
boundary between the kernel and user-space for these two mechanisms are
simply different. One controls which of a list of enumerable finite
non-substitutable resources may or may not be used, and the other
controls what share of other anonymous, fungible resources may be used.

Looking for a unified interface is a false economy in my view, and I
am suspicious that such a search reflects a failure to recognize the
essential differences between the two mechanisms.

> The whole discussion about multiple sched-domains, etc, we had earlier
> is kind of just an implementation thing, but is a crapload easier to do
> something efficient here if the bits caring about that stuff are only
> dealing with dedicated resource partitions.

Yes - much easier. I suspect that someday I will have to add to cpusets
the ability to provide, for select cpusets, the additional guarantees
(sole and exclusive ownership of all the CPUs, Memory Nodes, Tasks and
affinity masks therein) which a scheduler or allocator that's trying to
be smart requires to avoid going crazy. Not all cpusets need this - but
those cpusets which define the scope of scheduler or allocator domain
would sure like it. Whatever my exclusive flag means now, I'm sure we
all agree that it is too weak to meet this particular requirement.

> OK, now my email is getting as long as yours, so I'll stop ;-) ;-)

That would be tragic indeed. Good thing you stopped.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373

2004-10-05 09:31:49

by Simon Derr

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

On Mon, 4 Oct 2004, Martin J. Bligh wrote:

> OK, then your "exclusive" cpusets aren't really exclusive at all, since
> they have other stuff running in them. The fact that you may institute
> the stuff early enough to avoid most things falling into this doesn't
> really solve the problems, AFAICS.

I'd like to present you at this point what was the original decision for
having exclusive (called strict, at this point in history) and
non-exclusive cpusets.

The idea was to have a system, and run all jobs on it through a batch
scheduler. Some jobs cared about performance, some didn't.

The ones who cared about performance got an 'exclusive' cpuset, the ones
who didn't got a 'non exclusive' cpuset.

Now there is a possibility, that at a given time, only 'exclusive' jobs
are running, and hence that 'exclusive' cpusets have been created for jobs
on all the CPUs.

Our system (at Bull) is both a big and a small machine:
-big: we have NUMA constraints.
-small: we don't have enough CPUs to spare one, we need to use ALL CPUs
for our jobs.

There are still processes running outside the job cpusets (i.e in the root
cpuset), sshd, the batch scheduler. These tasks use a low amount of CPU,
so it is okay if they happen to run inside even 'exclusive' cpusets. For
us, 'exclusive' only means that no other CPU-hungry job is going to share
our CPU.

Of course, in our case, a valid argument is that 'exclusiveness' should
not be enforced by the kernel but rather by the job scheduler. Probably.

But now I see that the discussion is going towards:
-fully exclusive cpusets, maybe even with no interrupts handling
-maybe only allow exclusive cpusets, since non-exclusive cpusets are
tricky wrt CKRM.

That would be a no-go for us.


Simon.

2004-10-05 10:02:13

by Paul Jackson

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

Simon wrote:
> But now I see that the discussion is going towards:
> -fully exclusive cpusets, maybe even with no interrupts handling
> -maybe only allow exclusive cpusets, since non-exclusive cpusets are
> tricky wrt CKRM.
>
> That would be a no-go for us.

I'm with you there, Simon. Not all cpusets should be exclusive.

It is reasonable for domain-capable schedulers, allocators and
resource managers (domain aware CKRM?) require that any domain
they manage correspond to an exclusive cpuset, for some value
of exclusive stronger than now.

Less exclusive cpusets just wouldn't qualify for their own
scheduler, allocator or resource manager domains.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373

2004-10-05 10:06:10

by Paul Jackson

[permalink] [raw]
Subject: Re: [ckrm-tech] Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

> Who am I missing ...

Oops - hi, Hubertus ;).

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373

2004-10-05 14:26:07

by Paul Jackson

[permalink] [raw]
Subject: Re: [ckrm-tech] Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

Hubertus writes:
> Since classes can span different cpu sets with different shares
> how do we address the cpushare of a class in the particular context
> of a cpu-set.
> Alternatively, one could require that classes can not span different
> cpu-sets, which would significantly reduce the complexity of this.

It's not just cpusets that sets a tasks cpus_allowed ...

Lets say we have a 16 thread OpenMP application, running on a cpuset of
16 CPUs on a large system, one thread pinned to each CPU of the 16 using
sched_setaffinity, running exclusively there. Which means that there
are perhaps eight tasks pinned on each of those 16 CPUs, the one OpenMP
thread, and perhaps seven indigenous per-cpu kernel threads:
migration, ksoftirq, events, kblockd, aio, xfslogd and xfsdatad
(using what happens to be on a random 2.6 Altix in front of me).

Then the classe(s) containing the eight tasks on any given one of these
CPUs would be required to not contain any other tasks outside of those
eight, by your reduced complexity alternative, right?

On whom/what would this requirement be imposed? Hopefully some CKRM
classification would figure this out and handle the classification
automatically.

What of the couple of "mother" tasks in this OpenMP application, which
are in this same 16 CPU cpuset, probably pinned to all 16 of the CPUs,
instead of to any individual one of them? What are the requirements on
the classes to which these tasks belong, in relation to the above
classes for the per-cpu kthreads and per-cpu OpenMP threads? And on
what person/software is the job of adapting to these requirements
imposed?

Observe by the way that so long as:
1) the per-cpu OpenMP threads each get to use 99+% of their
respective CPUs,
2) CKRM didn't impose any constraints or work on anything else

then what CKRM does here doesn't matter.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373

2004-10-05 19:38:01

by Martin J. Bligh

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

> The idea was to have a system, and run all jobs on it through a batch
> scheduler. Some jobs cared about performance, some didn't.
>
> The ones who cared about performance got an 'exclusive' cpuset, the ones
> who didn't got a 'non exclusive' cpuset.

OK, makes sense. Thanks for that.

> Of course, in our case, a valid argument is that 'exclusiveness' should
> not be enforced by the kernel but rather by the job scheduler. Probably.
>
> But now I see that the discussion is going towards:
> -fully exclusive cpusets, maybe even with no interrupts handling
> -maybe only allow exclusive cpusets, since non-exclusive cpusets are
> tricky wrt CKRM.

Nope - personally I see us more headed for the exclusive cpusets, and
handle the non-exclusive stuff via a more CKRM-style mechanism. Which
I still think achieves what you need, though perhaps not in exactly the
fashion you envisioned.

M.

2004-10-05 22:21:19

by Matthew Dobson

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

On Sun, 2004-10-03 at 16:53, Martin J. Bligh wrote:
> > Martin wrote:
> >> Matt had proposed having a separate sched_domain tree for each cpuset, which
> >> made a lot of sense, but seemed harder to do in practice because "exclusive"
> >> in cpusets doesn't really mean exclusive at all.
> >
> > See my comments on this from yesterday on this thread.
> >
> > I suspect we don't want a distinct sched_domain for each cpuset, but
> > rather a sched_domain for each of several entire subtrees of the cpuset
> > hierarchy, such that every CPU is in exactly one such sched domain, even
> > though it be in several cpusets in that sched_domain.
>
> Mmmm. The fundamental problem I think we ran across (just whilst pondering,
> not in code) was that some things (eg ... init) are bound to ALL cpus (or
> no cpus, depending how you word it); i.e. they're created before the cpusets
> are, and are a member of the grand-top-level-uber-master-thingummy.
>
> How do you service such processes? That's what I meant by the exclusive
> domains aren't really exclusive.
>
> Perhaps Matt can recall the problems better. I really liked his idea, aside
> from the small problem that it didn't seem to work ;-)

Well that doesn't seem like a fair statement. It's potentially true,
but it's really hard to say without an implementation! ;)

I think that the idea behind cpusets is really good, essentially
creating isolated areas of CPUs and memory for tasks to run
undisturbed. I feel that the actual implementation, however, is taking
a wrong approach, because it attempts to use the cpus_allowed mask to
override the scheduler in the general case. cpus_allowed, in my
estimation, is meant to be used as the exception, not the rule. If we
wish to change that, we need to make the scheduler more aware of it, so
it can do the right thing(tm) in the presence of numerous tasks with
varying cpus_allowed masks. The other option is to implement cpusets in
a way that doesn't use cpus_allowed. That is the option that I am
pursuing.

My idea is to make sched_domains much more flexible and dynamic. By
adding locking and reference counting, and simplifying the way in which
sched_domains are created, linked, unlinked and eventually destroyed we
can use sched_domains as the implementation of cpusets. IA64 already
allows multiple sched_domains trees without a shared top-level domain.
My proposal is to make this functionality more generally available.
Extending the "isolated domains" concept a little further will buy us
most (all?) the functionality of "exclusive" cpusets without the need to
use cpus_allowed at all.

I've got some code. I'm in the midst of pushing it forward to rc3-mm2.
I'll post an RFC later today or tomorrow when it's cleaned up.

-Matt

2004-10-05 22:26:29

by Matthew Dobson

[permalink] [raw]
Subject: Re: [ckrm-tech] Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

On Tue, 2004-10-05 at 02:17, Paul Jackson wrote:
> The /dev/cpuset pseudo file system api was chosen because it was
> convenient for small scale work, learning and experimentation, because
> it was a natural for the hierarchical name space with permissions that I
> required, and because it was convenient to leverage existing vfs
> structure in the kernel.

I really like the /dev/cpuset FS. I would like to leverage most of that
code to be the user level interface to creating, linking & destroying
sched_domains at some point. This, of course, is assuming that the
dynamic sched_domains concept meets with something less than catcalls
and jeers... ;)

-Matt

2004-10-05 22:35:34

by Matthew Dobson

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

On Tue, 2004-10-05 at 02:26, Simon Derr wrote:
> I'd like to present you at this point what was the original decision for
> having exclusive (called strict, at this point in history) and
> non-exclusive cpusets.
>
> The idea was to have a system, and run all jobs on it through a batch
> scheduler. Some jobs cared about performance, some didn't.
>
> The ones who cared about performance got an 'exclusive' cpuset, the ones
> who didn't got a 'non exclusive' cpuset.

It sounds to me (and please correct me if I'm wrong) like 'non
exclusive' cpusets are more like a convenient way to group tasks than
any sort of performance or scheduling imperative. It would seem what
we'd really want here is a task grouping functionality, more than a
'cpuset'. A cpuset seems a bit heavy handed if all we want to do group
tasks for ease of administration.


> There are still processes running outside the job cpusets (i.e in the root
> cpuset), sshd, the batch scheduler. These tasks use a low amount of CPU,
> so it is okay if they happen to run inside even 'exclusive' cpusets. For
> us, 'exclusive' only means that no other CPU-hungry job is going to share
> our CPU.

If that's all 'exclusive' means then 'exclusive' is a poor choice of
terminology. 'Exclusive' sounds like it would exclude all tasks it is
possible to exclude from running there (ie: with the exception of
certain necessary kernel threads).

-Matt

2004-10-06 00:33:03

by Paul Jackson

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

Martin wrote:
> Nope - personally I see us more headed for the exclusive cpusets, and
> handle the non-exclusive stuff via a more CKRM-style mechanism.

As Simon said, no go.

Martin:

1) Are you going to prevent sched_setaffinity calls as well?
What about the per-cpu kernel threads?

See my reply to Hubertus on this thread:

Date: Tue, 5 Oct 2004 07:20:48 -0700
Message-Id: <[email protected]>

2) Do you have agreement from the LSF and PBS folks that they
can port to systems that support "shares" (the old Cray
term roughly equivalent to CKRM), but lacking placement
for jobs using shared resources? I doubt it.

3) Do you understand that OpenMP and MPI applications can really
need placement, in order to get separate threads on separate
CPUs, to allow concurrent execution, even when they aren't using
(or worth providing) 100% of each CPU they are on.

4) Continuing on item (1), I think that CKRM is going to have to
deal with varying, detailed placement constraints, such as is
presently implemented using a variety of settings of cpus_allowed
and mems_allowed. So too will schedulers and allocators. We can
setup a few, high level domains, that correspond to entire cpuset
subtrees, that have closer to the exclusive properties that
you want (stronger than the current cpuset exclusive flag ensures).
But within any of those domains, we need a mix of exclusive
and non-exclusive placement.

The CKRM controlled shares style of mechanism is appropriate when
one CPU cycle is as good as another, and one just needs to manage
what share of the total capacity a given class of users receive.

There are other applications, such as OpenMP and MPI applications with
closely coupled parallel threads, that require placement, including in
setups where that application doesn't get a totally isolated exclusive
'soft' partition of its own. If an OpenMP or MPI job doesn't have each
separate thread placed on a distinct CPU, it runs like crud. This is
so whether the job has its own dedicated cpuset, or it is sharing CPUs.

And there are important system management products, such as OpenPBS and
LSF, which rely on placement of jobs in named sets of CPUs and Memory
Nodes, both for jobs that are closely coupled parallel and jobs that are
not, both for jobs that have exclusive use of the CPUs and Memory Nodes
assigned to them and not.

CKRM cannot make these other usage patterns and requirements go away,
and even if it could force cpusets to only come in the totally isolated
flavor, CKRM would still have to deal with the placement that occurs
on a thread-by-thread basis that is essential to the performance of
tightly coupled thread applications and essential to the basic function
of certain per-cpu kernel threads.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373

2004-10-06 01:17:33

by Martin J. Bligh

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

> 1) Are you going to prevent sched_setaffinity calls as well?

Outside of the the exclusive domain they're bound into, yes.

> What about the per-cpu kernel threads?

Those are set up before the userspace domains, so will fall into
whatever domain they're bound to.

<cut lots of other stuff ...>

I think we're now getting down into really obscure requirements for
particular types of wierd MP jobs. Whether Linux wants to support that
or not is open to debate, but personally, given the complexity involved,
I'd be against it.

I agree with the basic partitioning stuff - and see a need for that. The
non-exclusive stuff I think is fairly obscure, and unnecessary complexity
at this point, as 90% of it is covered by CKRM. It's Andrew and Linus's
decision, but that's my input.

We'll never be able to provide every single feature everyone wants without
overloading the kernel with reams of complexity. It's also an evolutionary
process of putting in the most important stuff first, and seeing how it
goes. I see that as the exclusive domain stuff (when we find a better
implementation than cpus_allowed) + the CKRM scheduling resource control.
I know you have other opinions.

M.

2004-10-06 02:12:21

by Paul Jackson

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

Martin writes:
> I agree with the basic partitioning stuff - and see a need for that. The
> non-exclusive stuff I think is fairly obscure, and unnecessary complexity
> at this point, as 90% of it is covered by CKRM. It's Andrew and Linus's
> decision, but that's my input.

Now you're trying to marginalize non-exclusive cpusets as a fringe
requirement. Thanks a bunch ;).

Instead of requiring complete exclusion for all cpusets, and pointing to
the current 'exclusive' flag as the wrong flag at the wrong place at the
wrong time (sorry - my radio is turned to the V.P. debate in the
background) how about let's being clear what sort of exclusion the
schedulers, the allocators and here the resource manager (CKRM) require.

I can envision dividing a machine into a few large, quite separate,
'soft' partitions, where each such partition is represented by a subtree
of the cpuset hierarchy, and where there is no overlap of CPUs, Memory
Nodes or tasks between the 'soft' partitions, even though there is a
possibly richly nested cpuset (cpu and memory affinity) structure within
any given 'soft' partition.

Nothing would cross 'soft' partition boundaries. So far as CPUs, Memory
Nodes, Tasks and their Affinity, the 'soft' partitions would be
separate, isolated, and non-overlapping.

Each such 'soft' partition could host a separate instance (domain) of
the scheduler, allocator, and resource manager. Any such domain would
know what set of CPUs, Memory Nodes and Tasks it was managing, and would
have complete and sole control of the scheduling, allocation or resource
sharing of those entities.

But also within a 'soft' partition, there would be finer grain placement,
finer grain CPU and Memory affinity, whether by the current tasks
cpus_allowed and mems_allowed, or by some improved mechanism that the
schedulers, allocators and resource managers could better deal with.

There _has_ to be. Even if cpusets, sched_setaffinity, mbind, and
set_mempolicy all disappeared tomorrow, you still have the per-cpu
kernel threads that have to be placed to a tighter specification than
the whole of such a 'soft' partition.

Could you or some appropriate CKRM guru please try to tell me what
isolation you actually need for CKRM. Matthew or Peter please do the
same for the schedulers.

In particular, do you need to prohibit any finer grained placement
within a particular domain, or not. I believe not. Is it not the case
that what you really need is that the cpusets that correspond to one of
your domains (my 'soft' partitions, above) be isolated from any other
such 'soft' partition? Is it not the case that further, finer grained
placement within such an isolated 'soft' partition is acceptable? Sure
better be. Indeed, that's pretty much what we have now, with what
amounts to a single domain covering the entire system.

Instead of throwing out half of cpusets on claims that it conflicts
with the requirements of the schedulers, resource managers or (not yet
raised) the allocators, please be more clear as to what the actual
requirements are.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373

2004-10-06 02:47:24

by Paul Jackson

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

Matthew wrote:
>
> I feel that the actual implementation, however, is taking
> a wrong approach, because it attempts to use the cpus_allowed mask to
> override the scheduler in the general case. cpus_allowed, in my
> estimation, is meant to be used as the exception, not the rule.

I agree that big chunks of a large system that are marching to the beat
of two distinctly different drummers would better have their schedulers
organized along the domains that you describe, than by brute force abuse
of the cpus_allowed mask.

I look forward to your RFC, Matthew. Though not being a scheduler guru,
I will mostly have to rely on the textual commentary in order to
understand what it means.

Existing finer grain placement of CPUs (sched_setaffinity) and Memory
(mbind, set_mempolicy) already exists, and is required by parallel
threaded applications such as OpenMP and MPI are commonly used to
develop.

The finer grain use of non-exclusive cpusets, in order to support
such workload managers as PBS and LSF in managing this finer grained
placement on a system (domain) wide basis should not be placing any
significantly further load on the schedulers or resource managers.

The top level cpusets must provide additional isolation properties so
that separate scheduler and resource manager domains can work in
relative isolation. I've tried hard to speculate what these additional
isolation properties might be. I look forward to hearing from the CKRM
and scheduler folks on this. I agree that simple unconstrained (ab)use
of the cpus_allowed and mems_allowed masks, at that scale, places an
undo burden on the schedulers, allocators and resource managers.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373

2004-10-06 02:49:52

by Paul Jackson

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

Matthew wrote:
>
> By adding locking and reference counting, and simplifying the way in which
> sched_domains are created, linked, unlinked and eventually destroyed we
> can use sched_domains as the implementation of cpusets.

I'd be inclined to turn this sideways from what you say.

Rather, add another couple of properties to cpusets:

1) An isolated flag, that guarantees whatever isolation properties
we agree that schedulers, allocators and resource allocators
require between domains, and

2) For those cpusets which are so isolated, the option to add
links of some form, between that cpuset, and distinct scheduler,
allocator and/or resource domains.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373

2004-10-06 03:04:13

by Paul Jackson

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

Matthew writes:
>
> If that's all 'exclusive' means then 'exclusive' is a poor choice of
> terminology. 'Exclusive' sounds like it would exclude all tasks it is
> possible to exclude from running there (ie: with the exception of
> certain necessary kernel threads).

I suspect that my aggressive pushing of mechanism _out_ of the
kernel has obscured what's going on here.

The real 'exclusive' use of some set of CPUs and Memory Nodes
is provided by the workload managers, PBS and LSF. They fabricate
this out of the kernel cpuset 'exclusive' property, plus other
optional user level stuff.

For instance, one doesn't have to follow Simon's example, and leave the
classic Unix daemon load running in a cpuset that share resources with
all other cpusets. Instead, one can coral this classic Unix load into a
bootcpuset, administratively, at system boot. All the kernel mechanisms
required to support this exist in my current cpuset patch in Andrew's
tree.

The kernel cpuset 'mems_exclusive' and 'cpus_exclusive' flags are like
vitamin precursors. They are elements out of which the real nutrative
compound is constructed. Occassionally, as in Simon's configuration,
they are actually sufficient in their current state. Usually, more
processing is required. This processing just isn't visible to the
kernel code.

Perhaps these flags should be called:
mems_exclusive_precursor
cpus_exclusive_precursor
;).

And I also agree that there is some other, stronger, set of conditions
that the scheduler, allocator and resource manager domains need in order
to obtain sufficient isolation to stay sane.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373

2004-10-06 08:06:45

by Simon Derr

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

On Tue, 5 Oct 2004, Matthew Dobson wrote:

> On Sun, 2004-10-03 at 16:53, Martin J. Bligh wrote:
> > > Martin wrote:
> > >> Matt had proposed having a separate sched_domain tree for each cpuset, which
> > >> made a lot of sense, but seemed harder to do in practice because "exclusive"
> > >> in cpusets doesn't really mean exclusive at all.
> > >
> > > See my comments on this from yesterday on this thread.
> > >
> > > I suspect we don't want a distinct sched_domain for each cpuset, but
> > > rather a sched_domain for each of several entire subtrees of the cpuset
> > > hierarchy, such that every CPU is in exactly one such sched domain, even
> > > though it be in several cpusets in that sched_domain.
> >
> > Mmmm. The fundamental problem I think we ran across (just whilst pondering,
> > not in code) was that some things (eg ... init) are bound to ALL cpus (or
> > no cpus, depending how you word it); i.e. they're created before the cpusets
> > are, and are a member of the grand-top-level-uber-master-thingummy.
> >
> > How do you service such processes? That's what I meant by the exclusive
> > domains aren't really exclusive.
> >
> > Perhaps Matt can recall the problems better. I really liked his idea, aside
> > from the small problem that it didn't seem to work ;-)
>
> Well that doesn't seem like a fair statement. It's potentially true,
> but it's really hard to say without an implementation! ;)
>
> I think that the idea behind cpusets is really good, essentially
> creating isolated areas of CPUs and memory for tasks to run
> undisturbed. I feel that the actual implementation, however, is taking
> a wrong approach, because it attempts to use the cpus_allowed mask to
> override the scheduler in the general case. cpus_allowed, in my
> estimation, is meant to be used as the exception, not the rule. If we
> wish to change that, we need to make the scheduler more aware of it, so
> it can do the right thing(tm) in the presence of numerous tasks with
> varying cpus_allowed masks. The other option is to implement cpusets in
> a way that doesn't use cpus_allowed. That is the option that I am
> pursuing.

I like this idea.

The current implementation uses cpus_allowed because it is non-intrusive,
as it does not touch the scheduler at all, and also maybe because it was
easy to do this way since the cpuset development team seems to lack
scheduler gurus.

The 'non intrusive' part was also important as long as the cpusets were
mostly 'on their own', but if now it appears that more cooperation with
other functions such as CKRM is needed, I suppose a deeper impact on the
scheduler code might be OK. Especially if we intend to enforce 'real
exclusive' cpusets or things like that.

So I'm really interested in any design/bits of code that would go in that
direction.

Simon.

2004-10-06 09:46:19

by Simon Derr

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

On Tue, 5 Oct 2004, Paul Jackson wrote:

> Matthew wrote:
> >
> > By adding locking and reference counting, and simplifying the way in which
> > sched_domains are created, linked, unlinked and eventually destroyed we
> > can use sched_domains as the implementation of cpusets.
>
> I'd be inclined to turn this sideways from what you say.
>
> Rather, add another couple of properties to cpusets:
>
> 1) An isolated flag, that guarantees whatever isolation properties
> we agree that schedulers, allocators and resource allocators
> require between domains, and
>
> 2) For those cpusets which are so isolated, the option to add
> links of some form, between that cpuset, and distinct scheduler,
> allocator and/or resource domains.
>

Just to make sure we speak the same language:

That would lead to three kinds of cpusets:

1-'isolated' cpusets, with maybe a distinct scheduler, allocator and/or
resource domains.

2-'exclusive' cpusets (maybe with a better name?), that just don't overlap
with other cpusets who have the same parent.

3-'non-exclusive, non isolated' cpusets, with no restriction of any kind.

I suppose it would still be possible to create cpusets of type 2 or 3
inside a type-1 cpuset. They would be managed by the scheduler of the
parent 'isolated' cpuset.

I was thinking that the top cpuset is a particular case of type-1, but
actually no.

'isolated' cpusets should probably be at the same level as the top cpuset
(who should lose this name, then).

How should 'isolated' cpusets be created ? Should the top_cpuset be shrunk
to free some CPUs so we have room to create a new 'isolated' cpuset ?

Or should 'isolated' cpusets stay inside the top cpuset, that whould have
to schedule its processes outside the 'isolated' cpusets ? Should it then
be forbidden to cover the whole system with 'isolated' cpusets ?

That's a lot of question marks...

2004-10-06 13:32:10

by Paul Jackson

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

Simon wrote:
> Just to make sure we speak the same language:

Approximately. We already have two cpuset properties of cpus_exclusive
and mems_exclusive, which if set, assure that cpus and mems respectively
don't overlap with siblings or cousins.

I am imagining adding one more cpuset property: isolated.

Just what it would guarantee if set isn't clear yet: it would have to
provide whatever we agreed the scheduler, allocator and resource manager
folks needed in order to sanely support a separate domain in that
isolated cpuset. I'm currently expecting this to be something along the
lines of the following:
a. mems_exclusive == 1
b. cpus_exclusive == 1
c. no isolated ancestor or descendent
d. no task attached to any ancestor that is not either entirely within,
or entirely without, both the cpus and mems of the isolated cpuset.

Attempts later on to change the cpus or mems allowed of any task so as
to create a violation of [d.] would fail. As would any other action
that would violate the above.

I'm still unsure of just what is needed. I'm beginning to suspect that
there is a reasonable meeting point with the scheduler folks, but that
the CKRM folks may want something constructed from unobtainium. The
allocator folks are easy so far, as they haven't formed an organized
resistance ;).

There would be five flavors of cpusets. The four flavors obtained by
each combination of cpus_exclusive 0 or 1, and mems_exclusive 0 or 1,
but with isolated == 0. And the fifth flavor, with each of the
exclusive flags set 1, plus the isolated flag set 1.

The root node would start out isolated, but the first time you went to
mark a direct child of it isolated, if that effort succeeded, then the
root would lose its isolation (isolated goes to '0'), in accordance with
property [c.] You would have to be using a bootcpuset for this to have
any chance of working, with all the tasks having cpus_allowed ==
CPU_MASK_ALL, or mems_allowed == NODE_MASK_ALL, already confined to the
bootcpuset. The top level default scheduler, allocator and resource
manager would have to be able to work in a domain that was not isolated
and with some of its tasks, cpus and memory perhaps being managed by a
scheduler, allocator and/or resource manager in an isolated subordinate
domain.


> 'isolated' cpusets should probably be at the same level as the top cpuset
> (who should lose this name, then).

I don't think so. The top remains the one and only, all encompassing, top.


> Or should 'isolated' cpusets stay inside the top cpuset, that whould have
> to schedule its processes outside the 'isolated' cpusets

Yes - isolated cpusets stay beneath the top cpuset. Any given task in
the top cpuset would lie either entirely within, or without, of any
isolated descendent. If within and if that isolated descendent has a
scheduler, it owns the scheduling of that task. Similarly for the
allocator and resource manager.


> Should it then
> be forbidden to cover the whole system with 'isolated' cpusets ?

No need for this that I am aware of, yet anyway.


> That's a lot of question marks...

Yes - lots of question marks.

But the basic objectives are not too much up to question at this point:
1) An isolated cpuset must not overlap any other isolated cpuset, not in
mems, not in cpus, and (the tricky part) not in the affinity masks (or
whatever becomes of cpus_allowed and mems_allowed) of any task in the
system.
2) For any cpus_allowed or mems_allowed of any task or cpuset in the
entire system, it is either entirely contained within some isolated
cpuset, or entirely outside all of them.
3) Necessarily from the above, the isolated cpusets form a partial,
non-overlapping covering of the entire systems cpus, memory nodes,
and (via the per-task affinity bitmaps) tasks.

The final result being that for any scheduler, allocator or resource
manager:
* it knows exactly what is its domain of cpus, memory nodes or tasks
* it is the sole and exclusive owner of all in its domain, and
* it has no bearing on anything outside its domain.

It may well be that task->cpus_allowed and task->mems_allowed remain as
they are now, but that for major top level 'soft' partitionings of the system,
we use these isolated cpusets, and attach additional properties friendly to
the needs of schedulers, allocators and resource managers to such isolated
cpusets. This would put the *_allowed bitmaps back closer to being what they
should be - small scale exceptions rather than large scale abuses.

An isolated cpuset might well not have its own dedicated domains for
all three of schedulers, allocators and resource managers. It might
have say just its own scheduler, but continue to rely on the global
allocator and resource manager.

===

First however - I am still eager to hear what the CKRM folks think of
set_affinity, mbind and set_mempolicy, as well as what they think of the
current existing per-cpu kernel threads. It would seem that, regardless
of their take on cpusets, the CKRM folks might not be too happy with any
of these other means of setting the *_allowed bitmaps to anything other
than CPU_MASK_ALL. My best guess from what I've seen so far is that
they are trying to ignore these other issues with varied *_allowed
bitmap settings as being 'beneath the radar', but trying to use the same
issues to transform cpusets into being pretty much _only_ the flat space
of isolated cpusets from above, minus its hierarchical nesting and
non-exclusive options.

And in any case, I've yet to see that OpenMP and MPI jobs, with their
tight threading, fit well in the CKRM-world. Such jobs require to have
each thread on a separate CPU, or their performance sucks big time. They
can share CPUs and Nodes with other work and not suffer _too_ bad
(especially if something like gang scheduling is available), but they
must be placed one thread per distinct CPU. This is absolutely a
placement matter, not a fair share percentage of overall resources
matter. From all I can see, the CKRM folks just wish such jobs would go
away, or at least they wish that the main Linux kernel would accept a
CKRM patch that is inhospitable to such jobs.

My hope is that CKRM, like the schedulers, is tolerant of smaller scale
exceptions to the allowed placement.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373

2004-10-06 21:58:45

by Peter Williams

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

Simon Derr wrote:
> On Tue, 5 Oct 2004, Paul Jackson wrote:
>
>
>>Matthew wrote:
>>
>>>By adding locking and reference counting, and simplifying the way in which
>>>sched_domains are created, linked, unlinked and eventually destroyed we
>>>can use sched_domains as the implementation of cpusets.
>>
>>I'd be inclined to turn this sideways from what you say.
>>
>>Rather, add another couple of properties to cpusets:
>>
>> 1) An isolated flag, that guarantees whatever isolation properties
>> we agree that schedulers, allocators and resource allocators
>> require between domains, and
>>
>> 2) For those cpusets which are so isolated, the option to add
>> links of some form, between that cpuset, and distinct scheduler,
>> allocator and/or resource domains.
>>
>
>
> Just to make sure we speak the same language:
>
> That would lead to three kinds of cpusets:
>
> 1-'isolated' cpusets, with maybe a distinct scheduler, allocator and/or
> resource domains.
>
> 2-'exclusive' cpusets (maybe with a better name?), that just don't overlap
> with other cpusets who have the same parent.
>
> 3-'non-exclusive, non isolated' cpusets, with no restriction of any kind.
>
> I suppose it would still be possible to create cpusets of type 2 or 3
> inside a type-1 cpuset. They would be managed by the scheduler of the
> parent 'isolated' cpuset.
>
> I was thinking that the top cpuset is a particular case of type-1, but
> actually no.
>
> 'isolated' cpusets should probably be at the same level as the top cpuset
> (who should lose this name, then).
>
> How should 'isolated' cpusets be created ? Should the top_cpuset be shrunk
> to free some CPUs so we have room to create a new 'isolated' cpuset ?
>
> Or should 'isolated' cpusets stay inside the top cpuset, that whould have
> to schedule its processes outside the 'isolated' cpusets ? Should it then
> be forbidden to cover the whole system with 'isolated' cpusets ?
>
> That's a lot of question marks...
>

I think that this is becoming overly complicated. I think that you need
(at most) two types of cpuset: 1. the top level non overlapping type and
2. possibly overlapping sets within the top level ones. I think that the
term cpuset should be reserved for the top level ones and some other
term be coined for the others. The type 2 ones are really just the
equivalent of the current affinity mask but with the added constraint
that it be a (non empty) proper subset of the containing cpuset.

The three types that you've described are then just examples of
configurations that could be achieved using this model.

Peter
--
Peter Williams [email protected]

"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce

2004-10-06 22:54:45

by Paul Jackson

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

Peter protests:
> I think that this is becoming overly complicated.

My brainstorming ways to accomodate the isolation that the scheduler,
allocator and resource manager domains require is getting ahead of
itself.

First I need to hear from the CKRM folks what degree of isolation they
really need, the essential minimum, and how they intend to accomodate
not just cpusets, but also the other placement API's sched_setaffinity,
mbind and set_mempolicy, as well as the per-cpu kernel threads.

Then it makes sense to revisit the implementation.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373

2004-10-06 23:11:20

by Matthew Dobson

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

On Tue, 2004-10-05 at 19:08, Paul Jackson wrote:
> Martin writes:
> > I agree with the basic partitioning stuff - and see a need for that. The
> > non-exclusive stuff I think is fairly obscure, and unnecessary complexity
> > at this point, as 90% of it is covered by CKRM. It's Andrew and Linus's
> > decision, but that's my input.
>
> Now you're trying to marginalize non-exclusive cpusets as a fringe
> requirement. Thanks a bunch ;).
>
> Instead of requiring complete exclusion for all cpusets, and pointing to
> the current 'exclusive' flag as the wrong flag at the wrong place at the
> wrong time (sorry - my radio is turned to the V.P. debate in the
> background) how about let's being clear what sort of exclusion the
> schedulers, the allocators and here the resource manager (CKRM) require.

I think what Martin is trying to say, in his oh so eloquent way, is that
the difference between 'non-exclusive' cpusets and, say, CKRM
taskclasses isn't very clear. It seems to me that non-exclusive cpusets
are little more than a convenient way to group tasks. Now, I'm not
saying that I don't think that is a useful functionality, but I am
saying that cpusets seem like the wrong way to go about it.


> I can envision dividing a machine into a few large, quite separate,
> 'soft' partitions, where each such partition is represented by a subtree
> of the cpuset hierarchy, and where there is no overlap of CPUs, Memory
> Nodes or tasks between the 'soft' partitions, even though there is a
> possibly richly nested cpuset (cpu and memory affinity) structure within
> any given 'soft' partition.
>
> Nothing would cross 'soft' partition boundaries. So far as CPUs, Memory
> Nodes, Tasks and their Affinity, the 'soft' partitions would be
> separate, isolated, and non-overlapping.

Ok. These imaginary 'soft' partitions sound much like what I expected
'exclusive' cpusets to be based on the terminology. They also sound
exactly like what I am trying to implement through my sched_domains
work.


> Each such 'soft' partition could host a separate instance (domain) of
> the scheduler, allocator, and resource manager. Any such domain would
> know what set of CPUs, Memory Nodes and Tasks it was managing, and would
> have complete and sole control of the scheduling, allocation or resource
> sharing of those entities.

I don't know that these partitions would necessarily need their own
scheduler, allocator and resource manager, or if we would just make the
current scheduler, allocator and resource manager aware of these
boundaries. In either case, that is an implementation detail not to be
agonized over now.


> But also within a 'soft' partition, there would be finer grain placement,
> finer grain CPU and Memory affinity, whether by the current tasks
> cpus_allowed and mems_allowed, or by some improved mechanism that the
> schedulers, allocators and resource managers could better deal with.
>
> There _has_ to be. Even if cpusets, sched_setaffinity, mbind, and
> set_mempolicy all disappeared tomorrow, you still have the per-cpu
> kernel threads that have to be placed to a tighter specification than
> the whole of such a 'soft' partition.

Agreed. I'm not proposing that we rip out sched_set/getaffinity, mbind,
etc. What I'm saying is that tasks should not *default* to using these
mechanisms because, at least in their current incarnations, our
scheduler and allocator are written in such a way that these mechanisms
are secondary. The assumption is that the scheduler/allocator can
schedule/allocate wherever they choose. The scheduler does look at
these bindings and if they contradict the decision made we deal with
that after the fact. The allocator has longer code paths and more logic
to deal with if there are bindings in place. So our options are to
either:
1) find a way to not have to rely on these mechanisms for most/all tasks
in the system, or
2) rewrite the scheduler/allocator to deal with these bindings up front,
and take them into consideration early in the scheduling/allocating
process.


> Could you or some appropriate CKRM guru please try to tell me what
> isolation you actually need for CKRM. Matthew or Peter please do the
> same for the schedulers.
>
> In particular, do you need to prohibit any finer grained placement
> within a particular domain, or not. I believe not. Is it not the case
> that what you really need is that the cpusets that correspond to one of
> your domains (my 'soft' partitions, above) be isolated from any other
> such 'soft' partition? Is it not the case that further, finer grained
> placement within such an isolated 'soft' partition is acceptable? Sure
> better be. Indeed, that's pretty much what we have now, with what
> amounts to a single domain covering the entire system.

I must also plead ignorance to the gritty details of CKRM. It would
seem to me, from discussions on this thread, that CKRM could be made to
deal with 'isolated' domains, 'soft' partitions, or 'exclusive' cpusets
without TOO much headache. Basically just telling CKRM that the tasks
in this group are sharing CPU time from a pool of 4 CPUs, rather than
all 16 CPUs in the system. Hubertus? As far as supporting fine grained
binding inside domains, that should definitely be supported in any
solution worthy of acceptance. CKRM, to the best of my knowledge,
currently deals with cpus_allowed, and there's no reason to think that
it wouldn't be able to deal with cpus_allowed in the multiple domain
case.


> Instead of throwing out half of cpusets on claims that it conflicts
> with the requirements of the schedulers, resource managers or (not yet
> raised) the allocators, please be more clear as to what the actual
> requirements are.

That's not really the reason that I was arguing against half of
cpusets. My argument is not related to CKRM's requirements, as I really
don't know what those are! :) My argument is that I don't see what
non-exclusive cpusets buys us. If all we're looking for is basic
task-grouping functionality, I'm quite certain that we can implement
that in a much more light-weight way that doesn't conflict with the
scheduler's decision making process. In fact, for non-exclusive
cpusets, I'd say that we can probably implement that type of
task-grouping in a non-intrusive way that will complement the scheduler
and possibly even improve performance by giving the scheduler a hint
about which tasks should be scheduled together. Using cpus_allowed is
not that way. cpus_allowed should be reserved for what it was
originally meant for: specifying a *strict* subset of CPUs that a task
is restricted to running on.

-Matt

2004-10-06 23:56:30

by Peter Williams

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

Matthew Dobson wrote:
> On Tue, 2004-10-05 at 19:08, Paul Jackson wrote:
>
> I don't know that these partitions would necessarily need their own
> scheduler, allocator and resource manager, or if we would just make the
> current scheduler, allocator and resource manager aware of these
> boundaries. In either case, that is an implementation detail not to be
> agonized over now.

It's not so much whether they NEED their own scheduler, etc. as whether
it should be possible for them to have their own scheduler, etc. With a
configurable scheduler (such as ZAPHOD) this could just be a matter of
having separate configuration variables for each cpuset (e.g. if a
cpuset has been created to contain as bunch of servers there's no need
to try and provide good interactive response for its tasks (as none of
them will be interactive) so the interactive response mechanism can be
turned off in that cpuset leading to better server response and throughput).

Peter
--
Peter Williams [email protected]

"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce

2004-10-06 23:19:12

by Matthew Dobson

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

On Tue, 2004-10-05 at 20:01, Paul Jackson wrote:
> Matthew writes:
> >
> > If that's all 'exclusive' means then 'exclusive' is a poor choice of
> > terminology. 'Exclusive' sounds like it would exclude all tasks it is
> > possible to exclude from running there (ie: with the exception of
> > certain necessary kernel threads).
>
> I suspect that my aggressive pushing of mechanism _out_ of the
> kernel has obscured what's going on here.
>
> The real 'exclusive' use of some set of CPUs and Memory Nodes
> is provided by the workload managers, PBS and LSF. They fabricate
> this out of the kernel cpuset 'exclusive' property, plus other
> optional user level stuff.
>
> For instance, one doesn't have to follow Simon's example, and leave the
> classic Unix daemon load running in a cpuset that share resources with
> all other cpusets. Instead, one can coral this classic Unix load into a
> bootcpuset, administratively, at system boot. All the kernel mechanisms
> required to support this exist in my current cpuset patch in Andrew's
> tree.
>
> The kernel cpuset 'mems_exclusive' and 'cpus_exclusive' flags are like
> vitamin precursors. They are elements out of which the real nutrative
> compound is constructed. Occassionally, as in Simon's configuration,
> they are actually sufficient in their current state. Usually, more
> processing is required. This processing just isn't visible to the
> kernel code.
>
> Perhaps these flags should be called:
> mems_exclusive_precursor
> cpus_exclusive_precursor
> ;).

Ok... So if we could offer the 'real' exclusion that the PBS and LSF
workload managers offer directly, would that suffice? Meaning, could we
make PBS and LSF work on top of in-kernel mechanisms that offer 'real'
exclusion. 'Real' exclusion defined as isolated groups of CPUs and
memory that the kernel can guarantee will not run other processes? That
way we can get the job done without having to rely on these external
workload managers, and be able to offer this dynamic partitioning to all
users. Thoughts?

-Matt

2004-10-07 00:00:44

by Matthew Dobson

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

On Tue, 2004-10-05 at 19:39, Paul Jackson wrote:
> Matthew wrote:
> >
> > I feel that the actual implementation, however, is taking
> > a wrong approach, because it attempts to use the cpus_allowed mask to
> > override the scheduler in the general case. cpus_allowed, in my
> > estimation, is meant to be used as the exception, not the rule.
>
> I agree that big chunks of a large system that are marching to the beat
> of two distinctly different drummers would better have their schedulers
> organized along the domains that you describe, than by brute force abuse
> of the cpus_allowed mask.

Wonderful news! :)


> I look forward to your RFC, Matthew. Though not being a scheduler guru,
> I will mostly have to rely on the textual commentary in order to
> understand what it means.

Wow, building a fan base already. I'll need all the cheerleaders I can
get! ;)


> Existing finer grain placement of CPUs (sched_setaffinity) and Memory
> (mbind, set_mempolicy) already exists, and is required by parallel
> threaded applications such as OpenMP and MPI are commonly used to
> develop.

Absolutely. I have no intention of removing or modifying those
mechanisms. My only goal is to see that using them remains the
exceptional case, and not the default behavior of most tasks.


> The finer grain use of non-exclusive cpusets, in order to support
> such workload managers as PBS and LSF in managing this finer grained
> placement on a system (domain) wide basis should not be placing any
> significantly further load on the schedulers or resource managers.
>
> The top level cpusets must provide additional isolation properties so
> that separate scheduler and resource manager domains can work in
> relative isolation. I've tried hard to speculate what these additional
> isolation properties might be. I look forward to hearing from the CKRM
> and scheduler folks on this. I agree that simple unconstrained (ab)use
> of the cpus_allowed and mems_allowed masks, at that scale, places an
> undo burden on the schedulers, allocators and resource managers.

I'm really glad to hear that, Paul. That unconstrained (ab)use was my
only real concern with the cpusets patches. I look forward to massaging
our two approaches into something that will satisfy all interested
parties.

-Matt

2004-10-07 00:20:38

by Rick Lindsley

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

It's not so much whether they NEED their own scheduler, etc. as whether
it should be possible for them to have their own scheduler, etc. With a
configurable scheduler (such as ZAPHOD) this could just be a matter of
having separate configuration variables for each cpuset (e.g. if a
cpuset has been created to contain as bunch of servers there's no need
to try and provide good interactive response for its tasks (as none of
them will be interactive) so the interactive response mechanism can be
turned off in that cpuset leading to better server response and throughput).

Providing configurable schedulers is a feature/bug/argument completely
separate from cpusets. Let's stay focused on that for now.

Two concrete examples for cpusets stick in my mind:

* the department that has been given 16 cpus of a 128 cpu machine,
is free to do what they want with them, and doesn't much care
specifically how they're laid out. Think general timeshare.

* the department that has been given 16 cpus of a 128 cpu machine
to run a finely tuned application which expects and needs everybody
to stay off those cpus. Think compute-intensive.

Correct me if I'm wrong, but CKRM can handle the first, but cannot
currently handle the second. And the mechanism(s) for creating either
situation are suboptimal at best and non-existent at worst.

Rick

2004-10-07 08:57:21

by Paul Jackson

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

> I don't see what non-exclusive cpusets buys us.

One can nest them, overlap them, and duplicate them ;)

For example, we could do the following:

* Carve off CPUs 128-255 of a 256 CPU system in which
to run various HPC jobs, requiring numbers of CPUs.
This is named /dev/cpuset/hpcarena, and it is the really
really exclusive and isolated sort of cpuset which can and
does have its own scheduler domain, for a scheduler configuration
that is tuned for running a mix of HPC jobs. In this hpcarena
also runs the per-cpu kernel threads that are pinned on CPUs
128-255 (for _all_ tasks running on an exclusive cpuset
must be in that cpuset or below).

* The testing group gets half of this cpuset each weekend, in
order to run a battery of tests: /dev/cpuset/hpcarena/testing.
In this testing cpuset runs the following batch manager.

* They run a home brew batch manager, which takes an input
stream of test cases, carves off a small cpuset of the
requested size, and runs that test case in that cpuset.
This results in cpusets with names like:
/dev/cpuset/hpcarena/testing/test123. Our test123 is
running in this cpuset.

* Test123 here happens to be a test of the integrity of cpusets,
so sets up a couple of cpusets to run two independent jobs,
each a 2 CPU MPI job. This results in the cpusets:
/dev/cpuset/hpcarena/testing/test123/a and
/dev/cpuset/hpcarena/testing/test123/b. Our little
MPI jobs 'a' and 'b' are running in these two cpusets.

We now have several nested cpusets, each overlapping its ancestors,
with tasks in each cpuset.

But only the top hpcarena cpuset has the exclusive ownership
with no form of overlap of everything in its subtree that
something like a distinct scheduler domain wants.

Hopefully the above is not what you meant by "little more than a
convenient way to group tasks."


> 2) rewrite the scheduler/allocator to deal with these bindings up front,
> and take them into consideration early in the scheduling/allocating
> process.

The allocator is less stressed here by varied mems_allowed settings
than is the scheduler. For in 99+% of the cases, the allocator is
dealing with a zonelist that has the local (currently executing)
first on the zonelist, and is dealing with a mems_allowed that allows
allocation on the local node. So the allocator almost always succeeds
the first time it goes to see if the candidate page it has in hand
comes from a node allowed in current->mems_allowed.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373

2004-10-07 09:02:36

by Paul Jackson

[permalink] [raw]
Subject: Re: [ckrm-tech] Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

Matthew wrote:
> > Perhaps these flags should be called:
> > mems_exclusive_precursor
> > cpus_exclusive_precursor
> > ;).
>
> Ok... So if we could offer the 'real' exclusion that the PBS and LSF
> workload managers offer directly, would that suffice? Meaning, could we
> make PBS and LSF work on top of in-kernel mechanisms that offer 'real'
> exclusion. 'Real' exclusion defined as isolated groups of CPUs and
> memory that the kernel can guarantee will not run other processes? That
> way we can get the job done without having to rely on these external
> workload managers, and be able to offer this dynamic partitioning to all
> users. Thoughts?


I agree entirely. Before when I was being a penny pincher about
how much went in the kernel, it might have made sense to have
the mems_exclusive and cpus_exclusive precursor flags.

But now that we have demonstrated a bone fide need for a really
really exclusive cpuset, it was silly of me to consider offering:

> > mems_exclusive_precursor
> > cpus_exclusive_precursor
> > really_really_exclusive

These multiple flavors just confuse and annoy.

You're right. Just one flag option, for the really exclusive cpuset,
is required here.

A different scheduler domain (whether same scheduler with awareness of
the boundaries, or something more substantially distinct) may only be
attached to a cpuset if it is exclusive.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373

2004-10-07 09:47:07

by Paul Jackson

[permalink] [raw]
Subject: Re: [ckrm-tech] Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

Matt wrote:
> I'm really glad to hear that, Paul. That unconstrained (ab)use was my
> only real concern with the cpusets patches. I look forward to massaging
> our two approaches into something that will satisfy all interested
> parties.

Sounds good.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373

2004-10-07 10:56:46

by Rick Lindsley

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

> I don't see what non-exclusive cpusets buys us.

One can nest them, overlap them, and duplicate them ;)

For example, we could do the following:

Once you have the exclusive set in your example, wouldn't the existing
functionality of CKRM provide you all the functionality the other
non-exclusive sets require?

Seems to me, we need a way to *restrict use* of certain resources
(exclusive) and a way to *share use* of certain resources (non-exclusive.)
CKRM does the latter right now, I believe, but not the former. (Does
CKRM support sharing hierarchies as in the dept/group/individual example
you used?)

What about this model:

* All exclusive sets exist at the "top level" (non-overlapping,
non-hierarchical) and each is represented by a separate sched_domain
hierarchy suitable for the hardware used to create the cpuset.
I can't imagine anything more than an academic use for nested
exclusive sets.

* All non-exclusive sets are rooted at the "top level" but may
subdivide their range as needed in a tree fashion (multiple levels
if desired). Right now I believe this functionality could be
provided by CKRM.

Observations:

* There is no current mechanism to create exclusive sets; cpus_allowed
alone won't cut it. A combination of Matt's patch plus Paul's
code could probably resolve this.

* There is no clear policy on how to amiably create an exclusive set.
The main problem is what to do with the tasks already there.
I'd suggest they get forcibly moved. If their current cpus_allowed
mask does not allow them to move, then if they are a user process
they are killed. If they are a system process and cannot be
moved, they stay and gain squatter's rights in the newly created
exclusive set.

* Interrupts are not under consideration right now. They land where
they land, and this may affect exclusive sets. If this is a
problem, for now, you simply lay out your hardware and exclusive
sets more intelligently.

* Memory allocation has a tendency and preference, but no hard policy
with regards to where it comes from. A task which starts on one
part of the system but moves to another may have all its memory
allocated relatively far away. In unusual cases, it may acquire
remote memory because that's all that's left. A memory allocation
policy similar to cpus_allowed might be needed. (Martin?)

* If we provide a means for creating exclusive sets, I haven't heard
a good reason why CKRM can't manage this.

Rick

2004-10-07 13:00:46

by Simon Derr

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

On Thu, 7 Oct 2004, Paul Jackson wrote:

> > I don't see what non-exclusive cpusets buys us.
>
> One can nest them, overlap them, and duplicate them ;)

I would also add, if the decision comes to make 'real exclusive' cpusets,
my previous example, as a use for non-exclusive cpusets:

we are running jobs that need to be 'mostly' isolated on some part of the
system, and run in a specific location. We use cpusets for that. But we
can't afford to dedicate a part of the system for administrative tasks
(daemons, init..). These tasks should not be put inside one of the
'exclusive' cpusets, even temporary : they do not belong there. They
should just be allowed to steal a few cpu cycles from time to time : non
exclusive cpusets are the way to go.

2004-10-07 14:44:28

by Martin J. Bligh

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

> * Interrupts are not under consideration right now. They land where
> they land, and this may affect exclusive sets. If this is a
> problem, for now, you simply lay out your hardware and exclusive
> sets more intelligently.

They're easy to fix, just poke the values in /proc appropriately (same
as cpus_allowed, exactly).

> * Memory allocation has a tendency and preference, but no hard policy
> with regards to where it comes from. A task which starts on one
> part of the system but moves to another may have all its memory
> allocated relatively far away. In unusual cases, it may acquire
> remote memory because that's all that's left. A memory allocation
> policy similar to cpus_allowed might be needed. (Martin?)

The membind API already does this.

M.

2004-10-07 14:53:44

by Martin J. Bligh

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

> On Thu, 7 Oct 2004, Paul Jackson wrote:
>
>> > I don't see what non-exclusive cpusets buys us.
>>
>> One can nest them, overlap them, and duplicate them ;)
>
> I would also add, if the decision comes to make 'real exclusive' cpusets,
> my previous example, as a use for non-exclusive cpusets:
>
> we are running jobs that need to be 'mostly' isolated on some part of the
> system, and run in a specific location. We use cpusets for that. But we
> can't afford to dedicate a part of the system for administrative tasks
> (daemons, init..). These tasks should not be put inside one of the
> 'exclusive' cpusets, even temporary : they do not belong there. They
> should just be allowed to steal a few cpu cycles from time to time : non
> exclusive cpusets are the way to go.

That makes no sense to me whatsoever, I'm afraid. Why if they were allowed
"to steal a few cycles" are they so fervently banned from being in there?
You can keep them out of your userspace management part if you want.

So we have the purely exclusive stuff, which needs kernel support in the form
of sched_domains alterations. The rest of cpusets is just poking and prodding
at cpus_allowed, the membind API, and the irq binding stuff. All of which
you could do from userspace, without any further kernel support, right?
Or am I missing something?

M.


2004-10-07 18:16:22

by Paul Jackson

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

Martin wrote:
>
> So we have the purely exclusive stuff, which needs kernel support in the form
> of sched_domains alterations. The rest of cpusets is just poking and prodding
> at cpus_allowed, the membind API, and the irq binding stuff. All of which
> you could do from userspace, without any further kernel support, right?
> Or am I missing something?

Well ... we're gaining. A couple of days ago you were suggesting
that cpusets could be replaced with some exclusive domains plus
CKRM.

Now it's some exclusive domains plus poking the affinity masks.

Yes - you're still missing something.

But I must keep in mind that I had concluded, perhaps three years ago,
just what you conclude now: that cpusets is just poking some affinity
masks, and that I could do most of it from user land. The result ended
up missing some important capabilities. User level code could not
manage collections of hardware nodes (sets of CPUs and Memory Nodes) in
a co-ordinated and controlled manner.

The users of cpusets need to have system wide names for them, with
permissions for viewing, modifying and attaching to them, and with the
ability to list both what hardware (CPUs and Memory) in a cpuset, and
what tasks are attached to a cpuset. As is usual in such operating
systems, the kernel manages such system wide synchronized controlled
access views.

As I quote below, I've been saying this repeatedly. Could you
tell me, Martin, whether the disconnect is:
1) that you didn't yet realize that cpusets provided this model (names,
permissions, ...) or
2) you don't think such a model is useful, or
3) you think that such a model can be provided sensibly from user space?

If I knew this, I could focus my response better.

The rest of this message is just quotes from this last week - many
can stop reading here.

===

Date: Fri, 1 Oct 2004 23:06:44 -0700
From: Paul Jackson <[email protected]>

Even the flat model (no hierarchy) uses require some way to
name and control access to cpusets, with distinct permissions
for examining, attaching to, and changing them, that can be
used and managed on a system wide basis.

===

Date: Sat, 2 Oct 2004 12:14:30 -0700
From: Paul Jackson <[email protected]>

And our customers _do_ want to manage these logically isolated
chunks as named "virtual computers" with system managed permissions
and integrity (such as the system-wide attribute of "Exclusive"
ownership of a CPU or Memory by one cpuset, and a robust ability
to list all tasks currently in a cpuset).

===

Date: Sat, 2 Oct 2004 19:26:03 -0700
From: Paul Jackson <[email protected]>

Consider the following use case scenario, which emphasizes this
isolation aspect (and ignores other requirements, such as the need for
system admins to manage cpusets by name [some handle valid across
process contexts], with a system wide imposed permission model and
exclusive use guarantees, and with a well defined system supported
notion of which tasks are "in" which cpuset at any point in time).

===

Date: Sun, 3 Oct 2004 18:41:24 -0700
From: Paul Jackson <[email protected]>

SGI makes heavy and critical use of the cpuset facilities on both Irix
and Linux that have been developed since pset. These facilities handle
both cpu and memory placment, and provide the essential kernel support
(names and permissions and operations to query, modify and attach) for a
system wide administrative interface for managing the resulting sets of
CPUs and Memory Nodes.

===

Date: Tue, 5 Oct 2004 02:17:36 -0700
From: Paul Jackson <[email protected]>
To: "Martin J. Bligh" <[email protected]>

The /dev/cpuset pseudo file system api was chosen because it was
convenient for small scale work, learning and experimentation, because
it was a natural for the hierarchical name space with permissions that I
required, and because it was convenient to leverage existing vfs
structure in the kernel.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373

2004-10-07 18:19:53

by Martin J. Bligh

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

>> So we have the purely exclusive stuff, which needs kernel support in the form
>> of sched_domains alterations. The rest of cpusets is just poking and prodding
>> at cpus_allowed, the membind API, and the irq binding stuff. All of which
>> you could do from userspace, without any further kernel support, right?
>> Or am I missing something?
>
> Well ... we're gaining. A couple of days ago you were suggesting
> that cpusets could be replaced with some exclusive domains plus
> CKRM.
>
> Now it's some exclusive domains plus poking the affinity masks.
>
> Yes - you're still missing something.
>
> But I must keep in mind that I had concluded, perhaps three years ago,
> just what you conclude now: that cpusets is just poking some affinity
> masks, and that I could do most of it from user land. The result ended
> up missing some important capabilities. User level code could not
> manage collections of hardware nodes (sets of CPUs and Memory Nodes) in
> a co-ordinated and controlled manner.
>
> The users of cpusets need to have system wide names for them, with
> permissions for viewing, modifying and attaching to them, and with the
> ability to list both what hardware (CPUs and Memory) in a cpuset, and
> what tasks are attached to a cpuset. As is usual in such operating
> systems, the kernel manages such system wide synchronized controlled
> access views.
>
> As I quote below, I've been saying this repeatedly. Could you
> tell me, Martin, whether the disconnect is:
> 1) that you didn't yet realize that cpusets provided this model (names,
> permissions, ...) or
> 2) you don't think such a model is useful, or
> 3) you think that such a model can be provided sensibly from user space?
>
> If I knew this, I could focus my response better.
>
> The rest of this message is just quotes from this last week - many
> can stop reading here.

My main problem is that I don't think we want lots of overlapping complex
interfaces in the kernel. Plus I think some of the stuff proposed is fairly
klunky as an interface (physical binding where it's mostly not needed, and
yes I sort of see your point about keeping jobs on separate CPUs, though I
still think it's tenuous), and makes heavy use of stuff that doesn't work
well (e.g. cpus_allowed). So I'm searching for various ways to address that.

The purely exclusive parts of cpusets can be implemented in a much nicer
manner inside the kernel, by messing with sched_domains, instead of just
using cpus_allowed as a mechanism ... so that seems like much less of a
problem.

The non-exclusive bits seem to overlap heavily with both CKRM and what
could be done in userspace. I still think the physical stuff is rather
obscure, and binding stuff to specific CPUs is an ugly way to say "I want
these two threads to not run on the same CPU". But if we can find some
other way (eg userspace) to allow you to do that should you utterly insist
on doing so, that'd be a convenient way out.

As for the names and permissions issue, both would be *doable* from
userspace, though maybe not as easily as in-kernel. Names would probably
be less hassle than permissions, but neither would be impossible, it seems.

It all just seems like a lot of complexity for a fairly obscure set of
requirements for a very limited group of users, to be honest. Some bits
(eg partitioning system resources hard in exclusive sets) would seem likely
to be used by a much broader audience, and thus are rather more attractive.
But they could probably be done with a much simpler interface than the whole
cpusets (BTW, did that still sit on top of PAGG as well, or is that long
gone?)

M.

2004-10-07 18:32:37

by Andrew Morton

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

Paul Jackson <[email protected]> wrote:
>
> 3) you think that such a model can be provided sensibly from user space?

As you say, it's a matter of coordinated poking at cpus_allowed. I'd be
interested to know why this all cannot be done by a userspace daemon/server
thing.

2004-10-07 18:35:42

by Paul Jackson

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

Rick wrote:
>
> Two concrete examples for cpusets stick in my mind:
>
> * the department that has been given 16 cpus of a 128 cpu machine,
> is free to do what they want with them, and doesn't much care
> specifically how they're laid out. Think general timeshare.
>
> * the department that has been given 16 cpus of a 128 cpu machine
> to run a finely tuned application which expects and needs everybody
> to stay off those cpus. Think compute-intensive.
>
> Correct me if I'm wrong, but CKRM can handle the first, but cannot
> currently handle the second.

Even the first scenario is not well handled by CKRM, in my view, for
most workloads. On a 128 cpu, if you want 16 cpus of compute power, you
are much better off having that power on 16 specific cpus, rather than
getting 12.5% of each of the 128 cpus, unless your workload has very low
cache footprint.

I think of it like this. Long ago, I learned to consider performance
for many of the applications I wrote in terms of how many disk accesses
I needed, for the disk was a thousand times slower than the processor
and dominated performance across a broad scale.

The gap between the speed of interior cpu cycles and external ram
access across a bus or three is approaching the processor to disk
gap of old. A complex hierarchy of caches has grown up, within and
surrounding each processor, in an effort to ameliorate this gap.

The dreaded disk seek of old is now the cache line miss of today.

Look at the advertisements for compute power for hire in the magazines.
I can rent a decent small computer, with web access and offsite backup,
in an air conditioned room with UPS and 24/7 administration for under
$100/month. These advertisements never sell me 12.5% of the cycles on
each of the 128 cpus in a large server. They show pictures of some nice
little rack machine -- that can be all mine, for just $79/month. Sign
up now with our online web server and be using your system in minutes.

[ hmmm ... wonder how many spam filters I hit on that last paragraph ... ]

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373

2004-10-07 19:58:53

by Paul Jackson

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

Andrew wrote:
> I'd be interested to know why this all cannot be done by a
> userspace daemon/server thing.

The biggest stumbling block was the binding of task to cpuset, the
task->cpuset pointer. I doubt you would accept a patch to the kernel
that called out to my daemon on every fork and exit, to update this
binding. We require a robust answer to the question of which tasks are
in a cpuset. And the loop to read this back out, which scans each task
to see if it points to a particular cpuset, would be significantly less
atomic than it is now, if it had to be done, one task at a time, from
user space.

A second stumbling block, which perhaps you can recommend some way to
deal with, is permissions. What's the recommended way for this daemon
to verify the authority of the requesting process?

Also the other means to poke the affinity masks, sched_setaffinity,
mbind and set_mempolicy, need to be constrained to respect cpuset
boundaries and honor exclusion. I doubt you want them calling out to a
user daemon either.

And the memory affinity mask, mems_allowed, seems to require updating
within the current task context. Perhaps someone else is smart enough
to see an alternative, but I could not find a safe way to update this
from outside the current context. So it's updated on the path going
into __alloc_pages(). I doubt you want a patch that calls out to my
daemon on each call into __alloc_pages().

We also need to begin correct placement earlier in the boot process
than when a user daemon could start. It's important to get init
and the early shared libraries placed. This part has reasons of
its own to be pre-init. I am able to do this in user space today,
because the kernel has cpuset support, but I'd have to fold at
least this much back into the kernel otherwise.

And of course the hooks I added to __alloc_pages, to only allow
allocations from nodes in the tasks mems_allowed, would still be needed,
in some form, just as the already existing schedulers check for
cpus_allowed are needed, in some form (perhaps less blunt).

The hook in the sched code to offline a cpu needs to know what else is
allowed in a tasks cpuset so it can honor the cpuset boundary, if
possible, when migrating the task off the departing cpu. Would you want
this code calling out to a user daemon to determine what cpu to use
next?

The cpuset file system seems like an excellent way to present a system
wide hierarchical name space. I guess that this could be done as a
mount handled by my user space daemon, but using vfs for this sure
seemed sweet at the time.

There's a Linus quote I'm trying to remember ... something about while
kernels have an important role in providing hardware access, their
biggest job is in providing a coherent view of system wide resources.
Does this ring a bell? I haven't been able to recall enough of the
actual wording to google it.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373

2004-10-07 21:21:18

by Matt Helsley

[permalink] [raw]
Subject: Re: [ckrm-tech] Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

On Thu, 2004-10-07 at 12:52, Paul Jackson wrote:
<snip>
> Also the other means to poke the affinity masks, sched_setaffinity,
> mbind and set_mempolicy, need to be constrained to respect cpuset
> boundaries and honor exclusion. I doubt you want them calling out to a
> user daemon either.
>
> And the memory affinity mask, mems_allowed, seems to require updating
> within the current task context. Perhaps someone else is smart enough
> to see an alternative, but I could not find a safe way to update this
> from outside the current context. So it's updated on the path going
> into __alloc_pages(). I doubt you want a patch that calls out to my
> daemon on each call into __alloc_pages().
<snip>

Just a thought: could a system-wide ld preload of some form be useful
here? You could use preload to add wrappers around the necessary calls
(you'd probably want to do this in /etc/ld.so.preload). Then have those
wrappers communicate with a daemon or open some /etc config files that
describe the topology you wish to enforce.

Cheers,
-Matt Helsley

2004-10-07 21:11:19

by Rick Lindsley

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

The users of cpusets need to have system wide names for them, with
permissions for viewing, modifying and attaching to them, and with the
ability to list both what hardware (CPUs and Memory) in a cpuset, and
what tasks are attached to a cpuset. As is usual in such operating
systems, the kernel manages such system wide synchronized controlled
access views.

Well, you are *asserting* the kernel will manage this. But doesn't
CKRM offer this capability? The only thing it *can't* do is assure
exclusivity, today .. correct?

Rick

2004-10-08 09:25:54

by Erich Focht

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

On Thursday 07 October 2004 20:13, Martin J. Bligh wrote:
> It all just seems like a lot of complexity for a fairly obscure set of
> requirements for a very limited group of users, to be honest. Some bits
> (eg partitioning system resources hard in exclusive sets) would seem likely
> to be used by a much broader audience, and thus are rather more attractive.

May I translate the first sentence to: the requirements and usage
models described by Paul (SGI), Simon (Bull) and myself (NEC) are
"fairly obscure" and the group of users addressed (those mainly
running high performance computing (AKA HPC) applications) is "very
limited"? If this is what you want to say then it's you whose view is
very limited. Maybe I'm wrong with what you really wanted to say but I
remember similar arguing from your side when discussing benchmark
results in the context of the node affine scheduler.

This "very limited group of users" (small part of them listed in
http://www.top500.org) is who drives computer technology, processor design,
network interconnect technology forward since the 1950s. Their
requirements on the operating system are rather limited and that might
be the reason why kernel developers tend to ignore them. All that
counts for HPC is measured in GigaFLOPS or TeraFLOPS, not in elapsed
seconds for a kernel compile, AIM-7, Spec-SDET or Javabench. The way
of using these machines IS different from what YOU experience in day
by day work and Linux is not yet where it should be (though getting
close). Paul's endurance in this thread is certainly influenced by the
perspective of having to support soon a 20x512 CPU NUMA cluster at
NASA...

As a side note: put in the right context your statement on fairly
obscure requirements for a very limited group of users is a marketing
argument ... against IBM.

Thanks ;-)
Erich

--
Core Technology Group
NEC High Performance Computing Europe GmbH, EHPCTC

2004-10-08 09:53:59

by Andrew Morton

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

Erich Focht <[email protected]> wrote:
>
> May I translate the first sentence to: the requirements and usage
> models described by Paul (SGI), Simon (Bull) and myself (NEC) are
> "fairly obscure" and the group of users addressed (those mainly
> running high performance computing (AKA HPC) applications) is "very
> limited"? If this is what you want to say then it's you whose view is
> very limited.

Martin makes a legitimate point. We're talking here about a few tens or
hundreds of machines world-wide, yes? And those machines are very
high-value so it is a relatively small cost for their kernel providers to
add such a highly specialised patch as cpusets.

These are strong arguments for leaving cpusets as an out-of-kernel.org
patch, for those who need it.

On the other hand, the impact is small:

25-akpm/fs/proc/base.c | 19
25-akpm/include/linux/cpuset.h | 63 +
25-akpm/include/linux/sched.h | 7
25-akpm/init/Kconfig | 10
25-akpm/init/main.c | 5
25-akpm/kernel/Makefile | 1
25-akpm/kernel/cpuset.c | 1550 ++++++++++++++++++++++++++++++++++++++
25-akpm/kernel/exit.c | 2
25-akpm/kernel/fork.c | 3
25-akpm/kernel/sched.c | 8
25-akpm/mm/mempolicy.c | 13
25-akpm/mm/page_alloc.c | 13
25-akpm/mm/vmscan.c | 19

So it's a quite cheap patch for the kernel.org people to carry.

So I'm (just) OK with it from that point of view. My main concern is that
the CKRM framework ought to be able to accommodate the cpuset function,
dammit. I don't want to see us growing two orthogonal resource management
systems partly because their respective backers have no incentive to make
their code work together.

I realise there are technical/architectural problems too, but I do fear
that there's a risk of we-don't-have-a-business-case happening here too.

I don't think there are any architectural concerns around cpusets - the
major design question here is "is CKRM up to doing this and if not, why
not?". From what Hubertus has been saying CKRM _is_ up to the task, but
the cpuset team may decide that the amount of rework involved isn't
worthwhile and they're better off carrying an offstream patch.

But we're not there yet - we're still waiting for the design dust to
settle.

2004-10-08 10:00:15

by Nick Piggin

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

Erich Focht wrote:
> On Thursday 07 October 2004 20:13, Martin J. Bligh wrote:
>
>>It all just seems like a lot of complexity for a fairly obscure set of
>>requirements for a very limited group of users, to be honest. Some bits
>>(eg partitioning system resources hard in exclusive sets) would seem likely
>>to be used by a much broader audience, and thus are rather more attractive.
>
>
> May I translate the first sentence to: the requirements and usage
> models described by Paul (SGI), Simon (Bull) and myself (NEC) are
> "fairly obscure" and the group of users addressed (those mainly
> running high performance computing (AKA HPC) applications) is "very
> limited"? If this is what you want to say then it's you whose view is
> very limited. Maybe I'm wrong with what you really wanted to say but I
> remember similar arguing from your side when discussing benchmark
> results in the context of the node affine scheduler.
>
> This "very limited group of users" (small part of them listed in
> http://www.top500.org) is who drives computer technology, processor design,
> network interconnect technology forward since the 1950s. Their
> requirements on the operating system are rather limited and that might
> be the reason why kernel developers tend to ignore them. All that
> counts for HPC is measured in GigaFLOPS or TeraFLOPS, not in elapsed
> seconds for a kernel compile, AIM-7, Spec-SDET or Javabench. The way
> of using these machines IS different from what YOU experience in day
> by day work and Linux is not yet where it should be (though getting
> close). Paul's endurance in this thread is certainly influenced by the
> perspective of having to support soon a 20x512 CPU NUMA cluster at
> NASA...
>
> As a side note: put in the right context your statement on fairly
> obscure requirements for a very limited group of users is a marketing
> argument ... against IBM.
>
> Thanks ;-)
> Erich
>

With all due respect, Linux gets driven as much from the bottom up
as it does from the top down I think. Compared to desktop and small
servers, yes you are obscure :)

My view on it is this, we can do *exclusive* dynamic partitioning
today (we're very close to it - it wouldn't add complexity in the
scheduler to support it). You can also hack up a fair bit of other
functionality with cpu affinity masks.

So with any luck, that will hold you over until everyone working on
this can agree and produce a nice implementation that doesn't add
complexity to the normal case (or can be configured out), and then
pull it into the kernel.

2004-10-08 10:42:53

by Erich Focht

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

On Friday 08 October 2004 11:50, Andrew Morton wrote:
> So it's a quite cheap patch for the kernel.org people to carry.
>
> So I'm (just) OK with it from that point of view. My main concern is that
> the CKRM framework ought to be able to accommodate the cpuset function,
> dammit. I don't want to see us growing two orthogonal resource management
> systems partly because their respective backers have no incentive to make
> their code work together.

I don't think cpusets needs to grow beyond what it contains now. The
discussion started as an API discussion. Cpusets requirements, current
API and usage models were clearly shown. According to Hubertus CKRM
will be able to deal with these and implement them in its own API. It
isn't today. So why not wait for that? Having two APIs for the same
thing isn't unusual. Whether we switch from affinity to sched_domains
underneath isn't really the question.

> I realise there are technical/architectural problems too, but I do fear
> that there's a risk of we-don't-have-a-business-case happening here too.

ISVs are already using the current cpusets API. I think of resource
management systems like PBS (Altair), LSF (Platform Computing) plus
several providers of industrial simulation codes in the area of CAE
(computer aided engineering). I know examples from static and dynamic
mechanical stress analysis, fluid dynamics and electromagnetics
simulations. Financial simulation software could benefit of such
stuff, too, but I don't know of any example. Anyhow, I'd say we
already have a business case here. And instead of pushing ISVs to
support the SGI way of doing this, the Bull way and the NEC way, it
makes more sense to ask them to support the LINUX way.

> But we're not there yet - we're still waiting for the design dust to
> settle.

:-)

Regards,
Erich


2004-10-08 11:44:36

by Erich Focht

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

On Friday 08 October 2004 11:53, Nick Piggin wrote:
> Erich Focht wrote:
> > On Thursday 07 October 2004 20:13, Martin J. Bligh wrote:
> >
> >>It all just seems like a lot of complexity for a fairly obscure set of
> >>requirements for a very limited group of users, to be honest. Some bits
> >>(eg partitioning system resources hard in exclusive sets) would seem likely
> >>to be used by a much broader audience, and thus are rather more attractive.
> >
> > May I translate the first sentence to: the requirements and usage
> > models described by Paul (SGI), Simon (Bull) and myself (NEC) are
> > "fairly obscure" and the group of users addressed (those mainly
> > running high performance computing (AKA HPC) applications) is "very
> > limited"? If this is what you want to say then it's you whose view is
> > very limited. Maybe I'm wrong with what you really wanted to say but I
> > remember similar arguing from your side when discussing benchmark
> > results in the context of the node affine scheduler.
> >
> > This "very limited group of users" (small part of them listed in
> > http://www.top500.org) is who drives computer technology, processor design,
> > network interconnect technology forward since the 1950s.

> With all due respect, Linux gets driven as much from the bottom up
> as it does from the top down I think. Compared to desktop and small
> servers, yes you are obscure :)

I wasn't speaking of driving the Linux development, I was speaking of
driving the computer technology development. Just look at where the
DOD, DARPA, DOE money goes to. I actually aknowledged that HPC doesn't
really have a foot in the kernel developer community.

> My view on it is this, we can do *exclusive* dynamic partitioning
> today (we're very close to it - it wouldn't add complexity in the
> scheduler to support it).

Right, but that's an implementation question. The question
cpusets {AND, OR, XOR} CKRM ?
was basically a user space API question. I'm sure nobody will object
to changing the guts of cpusets to use sched_domains on exclusive sets
when this possibility will be there and ... simple.

> You can also hack up a fair bit of other functionality with cpu
> affinity masks.

I'm doing that for a subset of cpusets functionality in a module
(i.e. without touching the task structure and without hooking on
fork/exec) but that's ugly and on the long term insufficient.

Regards,
Erich

2004-10-08 14:25:57

by Martin J. Bligh

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

> On Thursday 07 October 2004 20:13, Martin J. Bligh wrote:
>> It all just seems like a lot of complexity for a fairly obscure set of
>> requirements for a very limited group of users, to be honest. Some bits
>> (eg partitioning system resources hard in exclusive sets) would seem likely
>> to be used by a much broader audience, and thus are rather more attractive.
>
> May I translate the first sentence to: the requirements and usage
> models described by Paul (SGI), Simon (Bull) and myself (NEC) are
> "fairly obscure" and the group of users addressed (those mainly
> running high performance computing (AKA HPC) applications) is "very
> limited"? If this is what you want to say then it's you whose view is
> very limited. Maybe I'm wrong with what you really wanted to say but I
> remember similar arguing from your side when discussing benchmark
> results in the context of the node affine scheduler.

No, I was talking about the non-exclusive part of cpusets that wouldn't
fit inside another mechanism. The basic partitioning I have no problem
with, and that seemed to cover most of the requirements, AFAICS.

As I've said before, the exclusive stuff makes sense, and is useful to
a wider audience, I think. Having non-exclusive stuff whilst still
requiring physical partioning is what I think is obscure, won't work
well (cpus_allowed is problematic) and could be done in userspace anyway.

M.

2004-10-08 14:27:11

by Martin J. Bligh

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

> Anyhow, I'd say we
> already have a business case here. And instead of pushing ISVs to
> support the SGI way of doing this, the Bull way and the NEC way, it
> makes more sense to ask them to support the LINUX way.

Right. But we're trying to work out what the Linux way *is* ;-)

M.

2004-10-08 22:40:15

by Erich Focht

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

On Friday 08 October 2004 16:24, Martin J. Bligh wrote:
> > On Thursday 07 October 2004 20:13, Martin J. Bligh wrote:
> >> It all just seems like a lot of complexity for a fairly obscure set of
> >> requirements for a very limited group of users, to be honest. Some bits
> >> (eg partitioning system resources hard in exclusive sets) would seem likely
> >> to be used by a much broader audience, and thus are rather more attractive.
> >
> > May I translate the first sentence to: the requirements and usage
> > models described by Paul (SGI), Simon (Bull) and myself (NEC) are
> > "fairly obscure" and the group of users addressed (those mainly
> > running high performance computing (AKA HPC) applications) is "very
> > limited"? If this is what you want to say then it's you whose view is
> > very limited. Maybe I'm wrong with what you really wanted to say but I
> > remember similar arguing from your side when discussing benchmark
> > results in the context of the node affine scheduler.
>
> No, I was talking about the non-exclusive part of cpusets that wouldn't
> fit inside another mechanism. The basic partitioning I have no problem
> with, and that seemed to cover most of the requirements, AFAICS.

I was hoping that I did misunderstand you ;-)

> As I've said before, the exclusive stuff makes sense, and is useful to
> a wider audience, I think. Having non-exclusive stuff whilst still
> requiring physical partioning is what I think is obscure, won't work
> well (cpus_allowed is problematic) and could be done in userspace anyway.

Do you mean non-exclusive or simply overlapping? If you think at the
implementation through sched_domains you really don't need a 1 to 1
mapping between them and cpusets. IMO one could map sched domains
structure from the toplevel cpuset down only as far as the
non-overlapping sets go. Below you just don't use sched domains any
more and leave it to the affinity masks. The logical setup would
anyhow have a first (uppermost) level soft-partitioning the machine,
overlaps don't make sense to me here. Then sched domains already buy
you something. If soft partition 1 allows overlap in the lower levels
(because we want to overcommit the machine here and fear the OpenMP
jobs which pin themselves blindly in their cpuset), just don't
continue mapping sched domains deeper. In soft-partition 2 you may not
allow overlapping subpartitions, so go ahead and map them to sched
domains. It doesn't really add complexity this way, just some IF
statement.

Regards,
Erich


2004-10-08 23:51:14

by Matthew Dobson

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

On Thu, 2004-10-07 at 01:51, Paul Jackson wrote:
> > I don't see what non-exclusive cpusets buys us.
>
> One can nest them, overlap them, and duplicate them ;)

<snip example>

> We now have several nested cpusets, each overlapping its ancestors,
> with tasks in each cpuset.
>
> But only the top hpcarena cpuset has the exclusive ownership
> with no form of overlap of everything in its subtree that
> something like a distinct scheduler domain wants.
>
> Hopefully the above is not what you meant by "little more than a
> convenient way to group tasks."

I think this example is easily achievable with the sched_domains
modifications I am proposing. You can still create your 128 CPU
exclusive domain, called big_domain (due to my lack of naming
creativity), and further divide big_domain into smaller, non-exclusive
sched_domains. We do this all the time, albeit statically at boot time,
with the current sched_domains code. When we create a 4-node domain on
IA64, and underneath it we create 4 1-node domains. We've now
partitioned the system into 4 sched_domains, each containing 4 cpus.
Balancing between these 4 node-level sched_domains is allowed, but can
be disallowed by not setting the SD_LOAD_BALANCE flag. Your example
does show that it can be more than just a convenient way to group tasks,
but your example can be done with what I'm proposing.


> > 2) rewrite the scheduler/allocator to deal with these bindings up front,
> > and take them into consideration early in the scheduling/allocating
> > process.
>
> The allocator is less stressed here by varied mems_allowed settings
> than is the scheduler. For in 99+% of the cases, the allocator is
> dealing with a zonelist that has the local (currently executing)
> first on the zonelist, and is dealing with a mems_allowed that allows
> allocation on the local node. So the allocator almost always succeeds
> the first time it goes to see if the candidate page it has in hand
> comes from a node allowed in current->mems_allowed.

Very true. The allocator and scheduler are very different beasts, just
as memory and CPUs are. The allocator does not struggle to cope with
mems_allowed (at least currently) as much as the scheduler struggles to
cope with cpus_allowed.

-Matt

2004-10-09 00:18:47

by Nick Piggin

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

Matthew Dobson wrote:
> On Thu, 2004-10-07 at 01:51, Paul Jackson wrote:
>
>>>I don't see what non-exclusive cpusets buys us.
>>
>>One can nest them, overlap them, and duplicate them ;)
>
>
> <snip example>
>
>>We now have several nested cpusets, each overlapping its ancestors,
>>with tasks in each cpuset.
>>
>>But only the top hpcarena cpuset has the exclusive ownership
>>with no form of overlap of everything in its subtree that
>>something like a distinct scheduler domain wants.
>>
>>Hopefully the above is not what you meant by "little more than a
>>convenient way to group tasks."
>
>
> I think this example is easily achievable with the sched_domains
> modifications I am proposing. You can still create your 128 CPU
> exclusive domain, called big_domain (due to my lack of naming
> creativity), and further divide big_domain into smaller, non-exclusive
> sched_domains. We do this all the time, albeit statically at boot time,
> with the current sched_domains code. When we create a 4-node domain on
> IA64, and underneath it we create 4 1-node domains. We've now
> partitioned the system into 4 sched_domains, each containing 4 cpus.
> Balancing between these 4 node-level sched_domains is allowed, but can
> be disallowed by not setting the SD_LOAD_BALANCE flag. Your example
> does show that it can be more than just a convenient way to group tasks,
> but your example can be done with what I'm proposing.
>

You wouldn't be able to do this just with sched domains, because
it doesn't know anything about individual tasks. As soon as you
have some overlap, all your tasks can escape out of your domain.

I don't think there is a really nice way to do overlapping sets.
Those that want them need to just use cpu affinity for now.

2004-10-10 02:37:18

by Paul Jackson

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

> The only thing it *can't* do is assure
> exclusivity, today .. correct?

No. Could you look back to my other posts of this
last week and let us know if I've answered your query
in more detail already? Thanks.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373

2004-10-10 03:24:48

by Paul Jackson

[permalink] [raw]
Subject: Re: [ckrm-tech] Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

Andrew wrote:
> As you say, it's a matter of coordinated poking at cpus_allowed.

No - I said I concluded that three years ago. And then later learned
the hard way this wasn't enough.

See further my earlier (like 2.5 days and 2 boxes of Kleenex ago) reply
to this same post.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373

2004-10-10 05:14:37

by Paul Jackson

[permalink] [raw]
Subject: Re: [ckrm-tech] Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

> That makes no sense to me whatsoever, I'm afraid. Why if they were allowed
> "to steal a few cycles" are they so fervently banned from being in there?

One substantial advantage of cpusets (as in the kernel patch in *-mm's
tree), over variations that "just poke the affinity masks from user
space" is the task->cpuset pointer. This tracks to what cpuset a task
is attached. The fork and exit code duplicates and nukes this pointer,
managing the cpuset reference counter.

It matters to batch schedulers and the like which cpuset a task is in,
and which tasks are in a cpuset, when it comes time to do things like
suspend or migrate the tasks currently in a cpuset.

Just because it's ok to share a little compute time in a cpuset doesn't
mean you don't care to know who is in it.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373

2004-10-11 23:03:16

by Matthew Dobson

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

On Fri, 2004-10-08 at 17:18, Nick Piggin wrote:
> Matthew Dobson wrote:
> > I think this example is easily achievable with the sched_domains
> > modifications I am proposing. You can still create your 128 CPU
> > exclusive domain, called big_domain (due to my lack of naming
> > creativity), and further divide big_domain into smaller, non-exclusive
> > sched_domains. We do this all the time, albeit statically at boot time,
> > with the current sched_domains code. When we create a 4-node domain on
> > IA64, and underneath it we create 4 1-node domains. We've now
> > partitioned the system into 4 sched_domains, each containing 4 cpus.
> > Balancing between these 4 node-level sched_domains is allowed, but can
> > be disallowed by not setting the SD_LOAD_BALANCE flag. Your example
> > does show that it can be more than just a convenient way to group tasks,
> > but your example can be done with what I'm proposing.
>
> You wouldn't be able to do this just with sched domains, because
> it doesn't know anything about individual tasks. As soon as you
> have some overlap, all your tasks can escape out of your domain.
>
> I don't think there is a really nice way to do overlapping sets.
> Those that want them need to just use cpu affinity for now.

Well, the tasks can escape out of the domain iff you have the
SD_LOAD_BALANCE flag set on that domain. If SD_LOAD_BALANCE isn't set,
then when the scheduler tick goes off, and the code looks at the domain,
it will see the lack of the flag and will not attempt to balance the
domain, correct? This is what we currently do with the 'isolated'
domains, right?

You're right that you can get some of the more obscure semantics of the
various flavors of cpusets by leveraging sched_domains AND
cpus_allowed. I don't have any desire to remove that ability, just keep
it as the exception.

-Matt

2004-10-11 23:19:04

by Nick Piggin

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

Matthew Dobson wrote:
> On Fri, 2004-10-08 at 17:18, Nick Piggin wrote:
>
>>Matthew Dobson wrote:
>>
>>>I think this example is easily achievable with the sched_domains
>>>modifications I am proposing. You can still create your 128 CPU
>>>exclusive domain, called big_domain (due to my lack of naming
>>>creativity), and further divide big_domain into smaller, non-exclusive
>>>sched_domains. We do this all the time, albeit statically at boot time,
>>>with the current sched_domains code. When we create a 4-node domain on
>>>IA64, and underneath it we create 4 1-node domains. We've now
>>>partitioned the system into 4 sched_domains, each containing 4 cpus.
>>>Balancing between these 4 node-level sched_domains is allowed, but can
>>>be disallowed by not setting the SD_LOAD_BALANCE flag. Your example
>>>does show that it can be more than just a convenient way to group tasks,
>>>but your example can be done with what I'm proposing.
>>
>>You wouldn't be able to do this just with sched domains, because
>>it doesn't know anything about individual tasks. As soon as you
>>have some overlap, all your tasks can escape out of your domain.
>>
>>I don't think there is a really nice way to do overlapping sets.
>>Those that want them need to just use cpu affinity for now.
>
>
> Well, the tasks can escape out of the domain iff you have the
> SD_LOAD_BALANCE flag set on that domain. If SD_LOAD_BALANCE isn't set,
> then when the scheduler tick goes off, and the code looks at the domain,
> it will see the lack of the flag and will not attempt to balance the
> domain, correct? This is what we currently do with the 'isolated'
> domains, right?
>

Yeah that's right. Well you have to remove some of the other SD_
flags as well (eg. SD_BALANCE_EXEC, SD_WAKE_BALANCE).

But I don't think there is much point in overlapping sets which
don't do any balancing. They might as well not exist at all.

> You're right that you can get some of the more obscure semantics of the
> various flavors of cpusets by leveraging sched_domains AND
> cpus_allowed. I don't have any desire to remove that ability, just keep
> it as the exception.
>

I think at this stage, overlapping cpu sets are the exception. It
is pretty logical that they're going to require some per-task info,
because the balancer can't otherwise differentiate between two tasks
on the same runqueue but in different cpu sets.

sched-domains gives you a nice clean way to do exclusive partitioning,
and I can't imagine it would be too common to want to do overlapping
partitioning.

2004-10-14 10:41:34

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

"Martin J. Bligh" <[email protected]> writes:

> My main problem is that I don't think we want lots of overlapping complex
> interfaces in the kernel. Plus I think some of the stuff proposed is fairly
> klunky as an interface (physical binding where it's mostly not needed, and
> yes I sort of see your point about keeping jobs on separate CPUs, though I
> still think it's tenuous), and makes heavy use of stuff that doesn't work
> well (e.g. cpus_allowed). So I'm searching for various ways to address that.

Sorry I spotted this thread late. People seem to be looking at how things
are done on clusters and then apply them to numa machines. Which I agree
looks totally backwards.

The actual application requirement (ignoring the sucky batch schedulers)
is for a group of processes (a magic process group?) to all be
simultaneously runnable. On a cluster that is accomplished by having
an extremely stupid scheduler place one process per machine. On a
NUMA machine you can do better because you can suspend and migrate
processes.

The other difference on these large machines is these compute jobs
that are cpu hogs will often have priority over all of the other
processes in the system.

A batch scheduler should be able to prevent a machine from being
overloaded by simply not putting too many processes on the machine at
a time. Or if a higher priority job comes in suspending all of
the processes that of some lower priority job to make run for the
new job. Being able to swap page tables is likely a desirable feature
in that scenario so all of the swapped out jobs resources can be
removed from memory.

> It all just seems like a lot of complexity for a fairly obscure set of
> requirements for a very limited group of users, to be honest.

I think that is correct to some extent. I think the requirements are
much more reasonable when people stop hanging on to the cludges they
have been using because they cannot migrate jobs, or suspend
sufficiently jobs to get out of the way of other jobs.

Martin does enhancing the scheduler to deal with a group of processes
that all run in lock-step, usually simultaneously computing or
communicating sound sane? Where preempting one is effectively preempting
all of them.

I have been quite confused by this thread in that I have not seen
any mechanism that looks beyond an individual processes at a time,
which seems so completely wrong.


Eric

2004-10-14 11:25:03

by Erich Focht

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

On Thursday 14 October 2004 12:35, Eric W. Biederman wrote:
> Sorry I spotted this thread late.

The thread was actually d(r)ying out...

> People seem to be looking at how things
> are done on clusters and then apply them to numa machines. Which I agree
> looks totally backwards.
>
> The actual application requirement (ignoring the sucky batch schedulers)
> is for a group of processes (a magic process group?) to all be
> simultaneously runnable. On a cluster that is accomplished by having
> an extremely stupid scheduler place one process per machine. On a
> NUMA machine you can do better because you can suspend and migrate
> processes.

Eric, beyond wanting all processes scheduled at the same time we also
want separation and real isolation (CPU and memory-wise) of processes
belonging to different users. The first emails in the thread describe
the requirements well. They are too complex to be simply handled by
cpus_allowed and mems_allowed masks, basically a hierarchy is needed
in the cpusets allocation.

> > It all just seems like a lot of complexity for a fairly obscure set of
> > requirements for a very limited group of users, to be honest.
>
> I think that is correct to some extent. I think the requirements are
> much more reasonable when people stop hanging on to the cludges they
> have been using because they cannot migrate jobs, or suspend
> sufficiently jobs to get out of the way of other jobs.

Cpusets and alike have a long history originating from ccNUMA
machines. It is not simply simulating replicating cluster
behavior. Batch schedulers may be an unelegant solution but they are
reality and used since computers were invented (more or less).

> Martin does enhancing the scheduler to deal with a group of processes
> that all run in lock-step, usually simultaneously computing or
> communicating sound sane? Where preempting one is effectively preempting
> all of them.
>
> I have been quite confused by this thread in that I have not seen
> any mechanism that looks beyond an individual processes at a time,
> which seems so completely wrong.

You seem to be suggesting a gang scheduler!!! YES!!! I would love
that! But I remember that 2 years ago there were some emails from
major kernel maintainers (I don't exactly remember whom) saying that a
gang scheduler will never go into Linux. So ... here's something which
somewhat simulates that behavior. Anyhow, cpusets makes sense (for
isolation of resources) anyway, no matter whether we have gang
scheduling or not.

> Eric

Regards,
Erich


2004-10-14 11:26:15

by Paul Jackson

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

Eric wrote:
> I have been quite confused by this thread in that I have not seen
> any mechanism that looks beyond an individual processes at a time,
> which seems so completely wrong.

In the simplest form, we obtain the equivalent of gang scheduling for
the several threads of a tightly coupled job by arranging to have only
one runnable thread per cpu, each such thread pinned on one cpu, and all
threads in a given job simultaneously runnable.

For compute bound jobs, this is often sufficient. Time share (to a
coarse granularity of minutes or hours) and overlap of various sized
jobs is handled using suspension and migration in order to obtain the
above invariants of one runnable thread per cpu at any given time, and
of having all threads in a tightly coupled job pinned to distinct cpus
and runnable simultaneously.

For jobs that are not compute bound, where other delays such as i/o
would allow for running more than one such job at a time (both
intermittendly runnable on a finer scale of seconds), then one needs
something like gang scheduling in order to keep all the threads in a
tightly coupled job running together, while still obtaining maximum
utilization of cpu/memory hardware from jobs with cpu duty cycles of
less than 50%.

The essential purpose of cpusets is to take the placement of individual
threads by the sched_setaffinity and mbind/set_mempolicy calls, and
extend that to manage placing groups of tasks on administratively
designated and controlled groups of cpus/nodes.

If you see nothing beyond individual processes, then I think you are
missing that.

However, it is correct that we haven't (so far as I recall) considered
the gang scheduling that you describe. My crystal ball says we might
get to that next year.

Gang scheduling isn't needed for the compute bound jobs, because just
running a single job at a time on a given subset of a systems cpus and
memory obtains the same result.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373

2004-10-14 20:23:17

by Paul Jackson

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

Kevin McMahon <[email protected]> pointed out to me a link to an interesting
article on gang scheduling:

http://www.linuxjournal.com/article.php?sid=7690
Issue 127: Improving Application Performance on HPC Systems with Process Synchronization
Posted on Monday, November 01, 2004 by Paul Terry Amar Shan Pentti Huttunen

It's amazingly current - won't even be posted for another couple of weeks ;).

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373

2004-10-14 22:42:54

by Hubertus Franke

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

Paul, there are also other means for gang scheduling then having
to architect a tightly synchronized global clock into the communication
device.

Particularly, in a batch oriented environment of compute intensive
applications, one does not really need/want to switch frequently.
Often, the communication devices are memory mapped straight into the
application OS involvement with limited available channels.

However, as shown in previous work, gang scheduling and other forms of
scheduling tricks (e.g. backfilling) can provide for significant higher
utilization. So, if a high context switching rate (read interactivity)
is not required, then a user space daemon scheduling network can be used.

We have a slew of pubs on this. An example readup can be obtained here:

Y. Zhang, H. Franke, J. Moreira, A. Sivasubramaniam. Improving Parallel
Job Scheduling by Combining Gang Scheduling and Backfilling Techniques.
In Proceedings of the International Parallel and Distributed Processing
Symposium (IPDPS), pages 113-142 May 2000.
http://www.cse.psu.edu/~anand/csl/papers/ipdps00.pdf

Or for a final sum up of that research as a journal.

Y. Zhang, H. Franke, J. Moreira, A. Sivasubramaniam. An Integrated
Approach to Parallel Scheduling Using Gang-Scheduling, Backfilling and
Migration. IEEE Transactions on Parallel and Distributed Systems,
14(3):236-247, March 2003.

This was implemented for the IBM SP2 cluster and ASCI machine at
Livermore National Lab in the late 90's.

If you are interested in short scheduling cycles we also discovered that
dependent on the synchronity of the applications gang scheduling is not
necessarily the best.

Y. Zhang, A. Sivasubramaniam, J. Moreira, H. Franke. A Simulation-based
Study of Scheduling Mechanisms for a Dynamic Cluster Environment. In
Proceedings of the ACM International Conference on Supercomputing (ICS),
pages 100-109, May 2000. http://www.cse.psu.edu/~anand/csl/papers/ics00a.pdf

If I remember correctly this tight gang scheduling based on slots was
already implemented on IRIX in 95/96 ( read a paper on that ).

Moral of the story here is that its unlikely that Linux will support
gang scheduling in its core anytime soon or will allow network adapters
to drive scheduling strategies. So likely these are out.
An less frequent gang scheduling can be implemented with user level
daemons, so an adequate solution is available for most instances.

-- Hubertus

Paul Jackson wrote:

> Kevin McMahon <[email protected]> pointed out to me a link to an interesting
> article on gang scheduling:
>
> http://www.linuxjournal.com/article.php?sid=7690
> Issue 127: Improving Application Performance on HPC Systems with Process Synchronization
> Posted on Monday, November 01, 2004 by Paul Terry Amar Shan Pentti Huttunen
>
> It's amazingly current - won't even be posted for another couple of weeks ;).
>

2004-10-15 01:29:44

by Paul Jackson

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

Huertus wrote:
> Paul, there are also other means for gang scheduling then having
> to architect a tightly synchronized global clock into the communication
> device.

We agree.

My reply to the post of Eric W. Biederman at the start of this
sub-thread began:

> In the simplest form, we obtain the equivalent of gang scheduling for
> the several threads of a tightly coupled job by arranging to have only
> one runnable thread per cpu, each such thread pinned on one cpu, and all
> threads in a given job simultaneously runnable.
>
> For compute bound jobs, this is often sufficient.

You reply adds substantial detail and excellent references.

Thank-you.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373

2005-02-08 00:00:03

by Matthew Dobson

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

Matthew Dobson wrote:
> On Sun, 2004-10-03 at 16:53, Martin J. Bligh wrote:
>
>>>Martin wrote:
>>>
>>>>Matt had proposed having a separate sched_domain tree for each cpuset, which
>>>>made a lot of sense, but seemed harder to do in practice because "exclusive"
>>>>in cpusets doesn't really mean exclusive at all.
>>>
>>>See my comments on this from yesterday on this thread.
>>>
>>>I suspect we don't want a distinct sched_domain for each cpuset, but
>>>rather a sched_domain for each of several entire subtrees of the cpuset
>>>hierarchy, such that every CPU is in exactly one such sched domain, even
>>>though it be in several cpusets in that sched_domain.
>>
>>Mmmm. The fundamental problem I think we ran across (just whilst pondering,
>>not in code) was that some things (eg ... init) are bound to ALL cpus (or
>>no cpus, depending how you word it); i.e. they're created before the cpusets
>>are, and are a member of the grand-top-level-uber-master-thingummy.
>>
>>How do you service such processes? That's what I meant by the exclusive
>>domains aren't really exclusive.
>>
>>Perhaps Matt can recall the problems better. I really liked his idea, aside
>>from the small problem that it didn't seem to work ;-)
>
>
> Well that doesn't seem like a fair statement. It's potentially true,
> but it's really hard to say without an implementation! ;)
>
> I think that the idea behind cpusets is really good, essentially
> creating isolated areas of CPUs and memory for tasks to run
> undisturbed. I feel that the actual implementation, however, is taking
> a wrong approach, because it attempts to use the cpus_allowed mask to
> override the scheduler in the general case. cpus_allowed, in my
> estimation, is meant to be used as the exception, not the rule. If we
> wish to change that, we need to make the scheduler more aware of it, so
> it can do the right thing(tm) in the presence of numerous tasks with
> varying cpus_allowed masks. The other option is to implement cpusets in
> a way that doesn't use cpus_allowed. That is the option that I am
> pursuing.
>
> My idea is to make sched_domains much more flexible and dynamic. By
> adding locking and reference counting, and simplifying the way in which
> sched_domains are created, linked, unlinked and eventually destroyed we
> can use sched_domains as the implementation of cpusets. IA64 already
> allows multiple sched_domains trees without a shared top-level domain.
> My proposal is to make this functionality more generally available.
> Extending the "isolated domains" concept a little further will buy us
> most (all?) the functionality of "exclusive" cpusets without the need to
> use cpus_allowed at all.
>
> I've got some code. I'm in the midst of pushing it forward to rc3-mm2.
> I'll post an RFC later today or tomorrow when it's cleaned up.
>
> -Matt

Sorry to reply a long quiet thread, but I've been trading emails with Paul
Jackson on this subject recently, and I've been unable to convince either him
or myself that merging CPUSETs and CKRM is as easy as I once believed. I'm
still convinced the CPU side is doable, but I haven't managed as much success
with the memory binding side of CPUSETs. In light of this, I'd like to remove
my previous objections to CPUSETs moving forward. If others still have things
they want discussed before CPUSETs moves into mainline, that's fine, but it
seems to me that CPUSETs offer legitimate functionality and that the code has
certainly "done its time" in -mm to convince me it's stable and usable.

-Matt

2005-02-08 00:16:19

by Andrew Morton

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

Matthew Dobson <[email protected]> wrote:
>
> Sorry to reply a long quiet thread,

Is appreciated, thanks.

> but I've been trading emails with Paul
> Jackson on this subject recently, and I've been unable to convince either him
> or myself that merging CPUSETs and CKRM is as easy as I once believed. I'm
> still convinced the CPU side is doable, but I haven't managed as much success
> with the memory binding side of CPUSETs. In light of this, I'd like to remove
> my previous objections to CPUSETs moving forward. If others still have things
> they want discussed before CPUSETs moves into mainline, that's fine, but it
> seems to me that CPUSETs offer legitimate functionality and that the code has
> certainly "done its time" in -mm to convince me it's stable and usable.

OK, I'll add cpusets to the 2.6.12 queue.

going once, going twice...

2005-02-08 00:35:32

by Paul Jackson

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

Andrew wrote:
> OK, I'll add cpusets to the 2.6.12 queue.

I'd like that ;).

Thank-you, Matthew, for the work you put into making sense of this.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401

2005-02-08 09:38:12

by Dinakar Guniguntala

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

On Mon, Feb 07, 2005 at 03:59:49PM -0800, Matthew Dobson wrote:

> Sorry to reply a long quiet thread, but I've been trading emails with Paul
> Jackson on this subject recently, and I've been unable to convince either
> him or myself that merging CPUSETs and CKRM is as easy as I once believed.
> I'm still convinced the CPU side is doable, but I haven't managed as much
> success with the memory binding side of CPUSETs. In light of this, I'd
> like to remove my previous objections to CPUSETs moving forward. If others
> still have things they want discussed before CPUSETs moves into mainline,
> that's fine, but it seems to me that CPUSETs offer legitimate functionality
> and that the code has certainly "done its time" in -mm to convince me it's
> stable and usable.
>
> -Matt
>

What about your proposed sched domain changes?
Cant sched domains be used handle the CPU groupings and the
existing code in cpusets that handle memory continue as is?
Weren't sched somains supposed to give the scheduler better knowledge
of the CPU groupings afterall ?

Regards,

Dinakar

2005-02-08 09:50:56

by Nick Piggin

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

Dinakar Guniguntala wrote:
> On Mon, Feb 07, 2005 at 03:59:49PM -0800, Matthew Dobson wrote:
>
>
>>Sorry to reply a long quiet thread, but I've been trading emails with Paul
>>Jackson on this subject recently, and I've been unable to convince either
>>him or myself that merging CPUSETs and CKRM is as easy as I once believed.
>>I'm still convinced the CPU side is doable, but I haven't managed as much
>>success with the memory binding side of CPUSETs. In light of this, I'd
>>like to remove my previous objections to CPUSETs moving forward. If others
>>still have things they want discussed before CPUSETs moves into mainline,
>>that's fine, but it seems to me that CPUSETs offer legitimate functionality
>>and that the code has certainly "done its time" in -mm to convince me it's
>>stable and usable.
>>
>>-Matt
>>
>
>
> What about your proposed sched domain changes?
> Cant sched domains be used handle the CPU groupings and the
> existing code in cpusets that handle memory continue as is?
> Weren't sched somains supposed to give the scheduler better knowledge
> of the CPU groupings afterall ?
>

sched domains can provide non overlapping top level partitions.
It would basically just stop the multiprocessor balancing from
moving tasks between these partitions (they would be manually
moved by setting explicit cpu affinities).

I didn't really follow where that idea went, but I think at least
a few people thought that sort of functionality wasn't nearly
fancy enough! :)

2005-02-08 16:15:20

by Martin J. Bligh

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

>> What about your proposed sched domain changes?
>> Cant sched domains be used handle the CPU groupings and the
>> existing code in cpusets that handle memory continue as is?
>> Weren't sched somains supposed to give the scheduler better knowledge
>> of the CPU groupings afterall ?
>>
>
> sched domains can provide non overlapping top level partitions.
> It would basically just stop the multiprocessor balancing from
> moving tasks between these partitions (they would be manually
> moved by setting explicit cpu affinities).
>
> I didn't really follow where that idea went, but I think at least
> a few people thought that sort of functionality wasn't nearly
> fancy enough! :)

Not fancy seems like a positive thing to me ;-)

M.

2005-02-08 16:18:20

by Martin J. Bligh

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

> Sorry to reply a long quiet thread, but I've been trading emails with
> Paul Jackson on this subject recently, and I've been unable to convince
> either him or myself that merging CPUSETs and CKRM is as easy as I once
> believed. I'm still convinced the CPU side is doable, but I haven't
> managed as much success with the memory binding side of CPUSETs.

Can you describe what the difficulty is with the mem binding side?

Thanks,

M.

PS. If you could also make your mailer line-wrap, that'd be splendid ;-)

2005-02-08 19:00:54

by Matthew Dobson

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

Dinakar Guniguntala wrote:
> On Mon, Feb 07, 2005 at 03:59:49PM -0800, Matthew Dobson wrote:
>
>
>>Sorry to reply a long quiet thread, but I've been trading emails with Paul
>>Jackson on this subject recently, and I've been unable to convince either
>>him or myself that merging CPUSETs and CKRM is as easy as I once believed.
>>I'm still convinced the CPU side is doable, but I haven't managed as much
>>success with the memory binding side of CPUSETs. In light of this, I'd
>>like to remove my previous objections to CPUSETs moving forward. If others
>>still have things they want discussed before CPUSETs moves into mainline,
>>that's fine, but it seems to me that CPUSETs offer legitimate functionality
>>and that the code has certainly "done its time" in -mm to convince me it's
>>stable and usable.
>>
>>-Matt
>>
>
>
> What about your proposed sched domain changes?
> Cant sched domains be used handle the CPU groupings and the
> existing code in cpusets that handle memory continue as is?
> Weren't sched somains supposed to give the scheduler better knowledge
> of the CPU groupings afterall ?
>
> Regards,
>
> Dinakar

Yes. I still think that there is room for merging on the CPU scheduling side
between CPUSETs and sched domains, and I will continue to work on that aspect.
The reason Paul and I decided that they weren't totally reconcilable is
because of the memory binding side of the CPUSETs code.

-Matt

2005-02-08 19:33:10

by Matthew Dobson

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

Nick Piggin wrote:
> Dinakar Guniguntala wrote:
>
>> On Mon, Feb 07, 2005 at 03:59:49PM -0800, Matthew Dobson wrote:
>>
>>
>>> Sorry to reply a long quiet thread, but I've been trading emails with
>>> Paul Jackson on this subject recently, and I've been unable to
>>> convince either him or myself that merging CPUSETs and CKRM is as
>>> easy as I once believed. I'm still convinced the CPU side is doable,
>>> but I haven't managed as much success with the memory binding side of
>>> CPUSETs. In light of this, I'd like to remove my previous objections
>>> to CPUSETs moving forward. If others still have things they want
>>> discussed before CPUSETs moves into mainline, that's fine, but it
>>> seems to me that CPUSETs offer legitimate functionality and that the
>>> code has certainly "done its time" in -mm to convince me it's stable
>>> and usable.
>>>
>>> -Matt
>>>
>>
>>
>> What about your proposed sched domain changes?
>> Cant sched domains be used handle the CPU groupings and the
>> existing code in cpusets that handle memory continue as is?
>> Weren't sched somains supposed to give the scheduler better knowledge
>> of the CPU groupings afterall ?
>>
>
> sched domains can provide non overlapping top level partitions.
> It would basically just stop the multiprocessor balancing from
> moving tasks between these partitions (they would be manually
> moved by setting explicit cpu affinities).

Yep. That's the idea! :)


> I didn't really follow where that idea went, but I think at least
> a few people thought that sort of functionality wasn't nearly
> fancy enough! :)

Well, that's about how far the idea was supposed to go. ;) I think named
hierarchical sched_domains would offer the same functionality (at least for CPU
partitioning) as CPUSETs. I'm not sure who didn't think it was fancy enough,
but if you or anyone else can describe CPUSETs configurations that couldn't be
represented by sched_domains trees, I'd be very curious to hear about them.

-Matt

2005-02-08 20:44:04

by Paul Jackson

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

Matthew wrote:
> The reason Paul and I decided that they weren't totally reconcilable is
> because of the memory binding side of the CPUSETs code.

Speak for yourself, Matthew ;).

I agree you that the scheduler experts (I'm not one, nor do I aspire to
be one) may well find that it makes sense someday to better integrate
scheduler domains and cpusets. It seems a little inefficient on the
surface for scheduler domain code to spend time trying to choose the
best task to run on a CPU, only to find out that the chosen task is not
allowed, because that tasks cpus_allowed does not allow execution on the
intended CPU. Since in some systems, cpusets will provide a better
indication of the natural clustering of various cpus_allowed values than
a simple boottime hierarchical partitioning of the system, it makes
sense to me that there might be a way to improve the integration of
cpusets and scheduler domains, at least as an option on systems that are
making heavy use of cpusets. This might have the downside of making
sched domains more dynamic than they are now, which might cost more
performance than it gained. Others will have to evaluate those
tradeoffs.

But when you write the phrase "they weren't totally reconcilable,"
I presume you mean "cpusets and CKRM weren't totally reconcilable."

I would come close to turning this phrasing around, and state that
they were (nearly) totally unreconcilable <grin>.

I found no useful and significant basis for integration of cpusets and
CKRM either involving CPU or Memory Node management.

As best as I can figure out, CKRM is a fair share scheduler with a
gussied up more modular architecture, so that the components to track
usage, control (throttle) tasks, and classify tasks are separate
plugins. I can find no significant and useful overlap on any of these
fronts, either the existing plugins or their infrastructure, with what
cpusets has and needs.

There are claims that CKRM has some generalized resource management
architecture that should be able to handle cpusets needs, but despite my
repeated (albeit not entirely successful) efforts to find documentation
and read source and my pleadings with Matthew and earlier on this
thread, I was never able to figure out what this meant, or find anything
that could profitably integrate with cpusets.

In sum -- I see a potential for useful integration of cpusets and
scheduler domains, I'll have to leave it up to those with expertise in
the scheduler to evaluate and perhaps accomplish this. I do not see any
useful integration of cpusets and CKRM.

I continue to be befuddled as to why, Matthew, you confound potential
cpuset-scheddomain integration with potential cpuset-CKRM integration.
Scheduler domains and CKRM are distinct beasts, in my book, and the
contemplations of cpuset integration with these two beasts are also
distinct efforts.

And cpusets and CKRM are distinct beasts.

But I repeat myself ...

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401

2005-02-08 22:14:32

by Matthew Dobson

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

Paul Jackson wrote:
> Matthew wrote:
>
>> The reason Paul and I decided that they weren't totally reconcilable is
>>because of the memory binding side of the CPUSETs code.
>
>
> Speak for yourself, Matthew ;).
>
> I agree you that the scheduler experts (I'm not one, nor do I aspire to
> be one) may well find that it makes sense someday to better integrate
> scheduler domains and cpusets. It seems a little inefficient on the
> surface for scheduler domain code to spend time trying to choose the
> best task to run on a CPU, only to find out that the chosen task is not
> allowed, because that tasks cpus_allowed does not allow execution on the
> intended CPU. Since in some systems, cpusets will provide a better
> indication of the natural clustering of various cpus_allowed values than
> a simple boottime hierarchical partitioning of the system, it makes
> sense to me that there might be a way to improve the integration of
> cpusets and scheduler domains, at least as an option on systems that are
> making heavy use of cpusets. This might have the downside of making
> sched domains more dynamic than they are now, which might cost more
> performance than it gained. Others will have to evaluate those
> tradeoffs.

Indeed. There are tradeoffs involved in changing sched_domains from a single
static, boot-time setup to a more dynamic, configurable setup. Most notably
the inevitable locking necessary to ensure a consistent view of the domain
trees. Those tradeoffs, design decisions, etc. are fodder for another thread.


> But when you write the phrase "they weren't totally reconcilable,"
> I presume you mean "cpusets and CKRM weren't totally reconcilable."
>
> I would come close to turning this phrasing around, and state that
> they were (nearly) totally unreconcilable <grin>.
>
> I found no useful and significant basis for integration of cpusets and
> CKRM either involving CPU or Memory Node management.

Yes, I misspoke. I should have been more clear that CKRM and CPUSETs (seem) to
be unreconcilable. Sched_domains and CPUSETs (seem) to have some potential
functionality overlap that leads me to (still) believe there is hope to
integrate these two systems.


> As best as I can figure out, CKRM is a fair share scheduler with a
> gussied up more modular architecture, so that the components to track
> usage, control (throttle) tasks, and classify tasks are separate
> plugins. I can find no significant and useful overlap on any of these
> fronts, either the existing plugins or their infrastructure, with what
> cpusets has and needs.
>
> There are claims that CKRM has some generalized resource management
> architecture that should be able to handle cpusets needs, but despite my
> repeated (albeit not entirely successful) efforts to find documentation
> and read source and my pleadings with Matthew and earlier on this
> thread, I was never able to figure out what this meant, or find anything
> that could profitably integrate with cpusets.
>
> In sum -- I see a potential for useful integration of cpusets and
> scheduler domains, I'll have to leave it up to those with expertise in
> the scheduler to evaluate and perhaps accomplish this. I do not see any
> useful integration of cpusets and CKRM.

I'm not an expert on CKRM, so I'll leave the refuting (or not refuting) of your
claims as to CKRM's usefulness to someone with more background and expertise on
the subject. Anyone want to pipe up and defend the alleged "gussied up"
fair-share scheduler?


> I continue to be befuddled as to why, Matthew, you confound potential
> cpuset-scheddomain integration with potential cpuset-CKRM integration.
> Scheduler domains and CKRM are distinct beasts, in my book, and the
> contemplations of cpuset integration with these two beasts are also
> distinct efforts.
>
> And cpusets and CKRM are distinct beasts.

My clever attempts to befuddle you have obviously succeeded beyond my wildest
dreams, Paul. You are now mired in a web of acronyms with no way out. You may
be eaten by a grue. :p


> But I repeat myself ...

It's the surest way to get someone to hear you, right!? ;)

-Matt

2005-02-08 22:18:44

by Matthew Dobson

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

Martin J. Bligh wrote:
>>Sorry to reply a long quiet thread, but I've been trading emails with
>>Paul Jackson on this subject recently, and I've been unable to convince
>>either him or myself that merging CPUSETs and CKRM is as easy as I once
>>believed. I'm still convinced the CPU side is doable, but I haven't
>>managed as much success with the memory binding side of CPUSETs.
>
>
> Can you describe what the difficulty is with the mem binding side?

Well, basically we need to ensure that when CPUSETs are marked "mems_exclusive"
that no one else in the system is allowed to allocate from those "exclusive"
nodes. This can't be guaranteed without hooks in the allocation code much like
what Paul has already written in his CPUSETs patchset.

> Thanks,
>
> M.
>
> PS. If you could also make your mailer line-wrap, that'd be splendid ;-)

I believe my mailer is line-wrapping correctly, but it's hard to be sure
without feedback. I switched to Thunderbird last week, and I think I've
(un)checked all the appropriate boxes. And yes, line wrapping is splendid.
Splendiferous, even.

-Matt

2005-02-08 23:26:49

by Nick Piggin

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

Martin J. Bligh wrote:
>>>What about your proposed sched domain changes?
>>>Cant sched domains be used handle the CPU groupings and the
>>>existing code in cpusets that handle memory continue as is?
>>>Weren't sched somains supposed to give the scheduler better knowledge
>>>of the CPU groupings afterall ?
>>>
>>
>>sched domains can provide non overlapping top level partitions.
>>It would basically just stop the multiprocessor balancing from
>>moving tasks between these partitions (they would be manually
>>moved by setting explicit cpu affinities).
>>
>>I didn't really follow where that idea went, but I think at least
>>a few people thought that sort of functionality wasn't nearly
>>fancy enough! :)
>
>
> Not fancy seems like a positive thing to me ;-)
>

Yes :)

I was thinking the sched domains soft-partitioning could be a
useful feature in its own right, considering the runtime impact
would be exactly zero, and the setup code should already be mostly
there.

If anyone was interested, I could try to cook up an implementation
on the scheduler side. The biggest issues may be the userspace
interface and a decent userspace management tool.

2005-02-08 23:58:07

by Shailabh Nagar

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement


>> As best as I can figure out, CKRM is a fair share scheduler with a
>> gussied up more modular architecture, so that the components to track
>> usage, control (throttle) tasks, and classify tasks are separate
>> plugins.

> I'm not an expert on CKRM, so I'll leave the refuting (or notrefuting)
> of your claims as to CKRM's usefulness to someone with more background
> and expertise on the subject. Anyone want to pipe up and defend the
> alleged "gussied up" fair-share scheduler?

Well, I'm not sure I want to minutely examine Paul's choice of words !
I would have thought that two OLS and one KS presentation would suffice
to clarify what CKRM is and isn't but that doesn't seem to be the case
:-) So here we go again

CKRM is both a resource management infrastructure AND
a set of controllers. The two are independent.

The infrastructure provides for
a) grouping of kernel objects (currently only tasks & sockets but can be
extended to any others)
b) an external interface for manipulating attributes of the grouping
such as shares, statistics and members
c) an internal interface for controllers to exploit this grouping
information in whatever way it wants.

The controllers do whatever they want with the grouping info.
The IBM folks on the project have written ONE set of controllers for
cpu, mem, io, net and numtasks which HAPPENS to be/aspire to be
fair-share. Others are free to write ones which ignore share settings
and be unfair, callous or whatever else they want.

We would love to have people write alternate controllers for the same
resources (cpu,mem,io,net,numtasks) and others. The former will provide
alternatives to our implementation, the latter will validate the
architecture's utility.


>> I can find no significant and useful overlap on any of these
>> fronts, either the existing plugins or their infrastructure, with what
>> cpusets has and needs.
>> There are claims that CKRM has some generalized resource management
>> architecture that should be able to handle cpusets needs, but despite my
>> repeated (albeit not entirely successful) efforts to find documentation
>> and read source and my pleadings with Matthew and earlier on this
>> thread, I was never able to figure out what this meant, or find anything
>> that could profitably integrate with cpusets.

Rereading the earlier posts on the thread, I'd agree. There are some
similarities in our interfaces but not enough to warrant a merger.


-- Shailabh

2005-02-09 00:25:50

by Paul Jackson

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

Matthew wrote:
> I should have been more clear that CKRM and CPUSETs (seem) to
> be unreconcilable. Sched_domains and CPUSETs (seem) to have some potential
> functionality overlap that leads me to (still) believe there is hope to
> integrate these two systems.

Aha - now we're getting somewhere.

I was under the illusion these last four months that you were going to
serve as priest at the shotgun wedding that Andrew had requested be
arranged between cpusets and CKRM. All this time, you were hoping to
get cpusets hooked up with sched domains.

My daughter 'cpusets' sure is popular ;).

If cpusets were somehow to be subsumed into CKRM, it would likely have
meant reincarnating cpusets in a new form, varying in some degree, large
or small, from its current form. If that had been in our forseeable
future, then we would not have wanted to put cpusets in its current form
in the main tree. It's alot easier to change API's that aren't API's
yet.

I remain certain that cpusets don't fit in CKRM. Not even close.

The merger of cpusets and sched domains is an entirely different affair,
in my view. It's an internal optimization, having next to zero impact
on any API's that the kernel presents to userland. On most systems, it
would be of no particular benefit. But on big honkin numa boxes making
heavy use of cpusets, it might make the schedulers work more efficient.
Or might not. I will leave that up to others to figure out, when and if
they choose to. I'll be glad to help with such an effort, what little
I can, if it comes about.

If such an integration between cpusets and sched domains is in our
future, we should first get cpusets into the kernel, and then the
appropriate experts can refine the interaction of cpusets with sched
domains. In this case, the sooner cpusets goes in, the better, so that
the integration effort with sched domains can commence, confident that
cpusets are here to stay.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401

2005-02-09 00:28:38

by Paul Jackson

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

Shailabh wrote:
> Well, I'm not sure I want to minutely examine Paul's choice of words !

You're a wise man ;).


> Rereading the earlier posts on the thread, I'd agree. There are some
> similarities in our interfaces but not enough to warrant a merger.

As I said ... a wise man !

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401

2005-02-09 02:53:20

by Nick Piggin

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

Matthew Dobson wrote:
> Nick Piggin wrote:

>> I didn't really follow where that idea went, but I think at least
>> a few people thought that sort of functionality wasn't nearly
>> fancy enough! :)
>
>
> Well, that's about how far the idea was supposed to go. ;) I think
> named hierarchical sched_domains would offer the same functionality (at
> least for CPU partitioning) as CPUSETs. I'm not sure who didn't think
> it was fancy enough, but if you or anyone else can describe CPUSETs
> configurations that couldn't be represented by sched_domains trees, I'd
> be very curious to hear about them.
>

OK. Someone mentioned wanting to do overlapping sets of CPUs. For
example, 3 groups, first can run on cpus 0 and 1, second 1 and 2,
third 2 and 0. However this in itself doesn't preculde the use of
sched-domains.

In the (hopefully) common case where there are disjoint partitions
_somewhere_, sched domains can do the job in a much better
way than task cpu affinities (better isolation, multiprocessor
balancing shouldn't break down).

Those users with overlapping CPU sets can then use task affinities
on top of sched domains partitions to get the desired result.

2005-02-09 04:25:07

by Paul Jackson

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

Nick wrote:
> The biggest issues may be the userspace
> interface and a decent userspace management tool.

One possibility, perhaps, would be to have a boolean flag "sched_domain"
on each cpuset, indicating whether it was a sched domain or not. If a
cpuset had its sched_domain flag set, then that cpusets cpus_allowed
mask would define a sched domain.

Later Nick wrote:
> In the (hopefully) common case where there are disjoint partitions
> _somewhere_, sched domains can do the job in a much better
> way than task cpu affinities (better isolation, multiprocessor
> balancing shouldn't break down).
>
> Those users with overlapping CPU sets can then use task affinities
> on top of sched domains partitions to get the desired result.

Ok - seems it should work with the above cpuset flag marking sched
domains, and a rule that _those_ cpusets so marked can't overlap. Other
cpusets that are not so marked, and any sched_setaffinity calls, can do
whatever they want. Trying to turn on the sched_domain flag on a cpuset
that overlapped with existing such cpuset sched_domains, or trying to
mess with the CPUs (cpus_allowed) in an existing cpuset sched_domain so
as to force it to overlap, would return an error to user space on that
write(2).

If the sysadmin didn't mark any cpusets as sched_domains, then fall back
to something automatic and useful.

Inside the kernel, we'll need someway for the cpuset code to tell the
sched code about sched_domain changes. This might mean something like
the following. Have the sched code provide the cpuset code a couple of
routines, one to setup and and the other to tear down sched_domains.

Both calls would take a cpumask_t argument, and return void. The setup
call must pass a cpumask that does not overlap any existing sched
domains defined via cpusets. The tear down call must pass a cpumask
value exactly matching a previous, still active, setup call.

So if someone made a single CPU change to an existing sched_domain
defining cpuset, the kernel cpuset code would have to call the kernel
sched code twice, first to tear down the old sched_domain, and then to
setup the new, slightly different, one. The cpuset code would likely be
holding the single global cpuset_sem semaphore across this pair of
calls.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373

2005-02-09 18:01:41

by Chandra Seetharaman

[permalink] [raw]
Subject: Re: [ckrm-tech] Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

On Tue, Feb 08, 2005 at 12:42:34PM -0800, Paul Jackson wrote:
> Matthew wrote:
>
> I found no useful and significant basis for integration of cpusets and
> CKRM either involving CPU or Memory Node management.
>
> As best as I can figure out, CKRM is a fair share scheduler with a
> gussied up more modular architecture, so that the components to track
> usage, control (throttle) tasks, and classify tasks are separate
> plugins. I can find no significant and useful overlap on any of these
> fronts, either the existing plugins or their infrastructure, with what
> cpusets has and needs.
>
> There are claims that CKRM has some generalized resource management
> architecture that should be able to handle cpusets needs, but despite my
> repeated (albeit not entirely successful) efforts to find documentation
> and read source and my pleadings with Matthew and earlier on this
> thread, I was never able to figure out what this meant, or find anything
> that could profitably integrate with cpusets.

I thought Hubertus did talk about this when the last time the thread
was active. Anyways, Here is how one could do cpuset/memset under the
ckrm framework(Note that I am not pitching for a marriage :) as there are
some small problems, like supporting 128 cpus, changing the parameter names
that ckrm currently uses):

First off cpuset and memset has to be implemented as two different
controllers.

cpuset controller:
- 'guarantee' parameter to be used for representing cpuset(bitwise)
- 'limit' parameter to be used for exclusivity and other flags.
- Highest level class(/rcfs/taskclass) will have all cpus in its list
- Every class will maintain two sets of cpusets, one that can be inherited,
inherit_cpuset(needed when exclusive is set in a child) and the other
for use by the class itself, my_cpuset.
- when a new class is created (under /rcfs/taskclass), it inherits all the
CPUS (from inherit_cpuset).
- admin can change the cpuset of this class by echoing the new
cpuset(guarantee) into the 'shares' file.
- admin can set/change the exclusivity(like) flags by echoing the value(limit)
to the 'shares' file.
- When the exclusivity flag is set in a class, the cpuset bits in this class
will be cleared in the inherit_cpuset of the parent, and all its other
children.
- At the time of scheduling, my_cpuset in the class of the task will be
consulted.

memset_controller would be similar to this, before pitching it I will talk
with Matt about why he thought that there is a problem.

If I missed some feature of cpuset that shows a bigger problem, please
let me know.
>
> In sum -- I see a potential for useful integration of cpusets and
> scheduler domains, I'll have to leave it up to those with expertise in
> the scheduler to evaluate and perhaps accomplish this. I do not see any
> useful integration of cpusets and CKRM.
>
> I continue to be befuddled as to why, Matthew, you confound potential
> cpuset-scheddomain integration with potential cpuset-CKRM integration.
> Scheduler domains and CKRM are distinct beasts, in my book, and the
> contemplations of cpuset integration with these two beasts are also
> distinct efforts.
>
> And cpusets and CKRM are distinct beasts.
>
> But I repeat myself ...
>
> --
> I won't rest till it's the best ...
> Programmer, Linux Scalability
> Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401
>
>
> -------------------------------------------------------
> SF email is sponsored by - The IT Product Guide
> Read honest & candid reviews on hundreds of IT Products from real users.
> Discover which products truly live up to the hype. Start reading now.
> http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
> _______________________________________________
> ckrm-tech mailing list
> https://lists.sourceforge.net/lists/listinfo/ckrm-tech

--

----------------------------------------------------------------------
Chandra Seetharaman | Be careful what you choose....
- [email protected] | .......you may get it.
----------------------------------------------------------------------

2005-02-11 02:48:40

by Chandra Seetharaman

[permalink] [raw]
Subject: Re: [ckrm-tech] Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

On Wed, Feb 09, 2005 at 09:59:28AM -0800, Chandra Seetharaman wrote:
> On Tue, Feb 08, 2005 at 12:42:34PM -0800, Paul Jackson wrote:
--stuff deleted---
> memset_controller would be similar to this, before pitching it I will talk
> with Matt about why he thought that there is a problem.

Talked to Matt Dobson and explained him the CKRM architecture and how
cpuset/memset can be implemented as a ckrm controller. He is now convinced
that there is no problem in making memset also a ckrm controller.

As explained in the earlier mail, memset also can be implemented in the
same way as cpuset.

>
> If I missed some feature of cpuset that shows a bigger problem, please
> let me know.
> >
> > In sum -- I see a potential for useful integration of cpusets and
> > scheduler domains, I'll have to leave it up to those with expertise in
> > the scheduler to evaluate and perhaps accomplish this. I do not see any
> > useful integration of cpusets and CKRM.
> >
> > I continue to be befuddled as to why, Matthew, you confound potential
> > cpuset-scheddomain integration with potential cpuset-CKRM integration.
> > Scheduler domains and CKRM are distinct beasts, in my book, and the
> > contemplations of cpuset integration with these two beasts are also
> > distinct efforts.
> >
> > And cpusets and CKRM are distinct beasts.
> >
> > But I repeat myself ...
> >
> > --
> > I won't rest till it's the best ...
> > Programmer, Linux Scalability
> > Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401
> >
> >
> > -------------------------------------------------------
> > SF email is sponsored by - The IT Product Guide
> > Read honest & candid reviews on hundreds of IT Products from real users.
> > Discover which products truly live up to the hype. Start reading now.
> > http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
> > _______________________________________________
> > ckrm-tech mailing list
> > https://lists.sourceforge.net/lists/listinfo/ckrm-tech
>
> --
>
> ----------------------------------------------------------------------
> Chandra Seetharaman | Be careful what you choose....
> - [email protected] | .......you may get it.
> ----------------------------------------------------------------------
>
>
> -------------------------------------------------------
> SF email is sponsored by - The IT Product Guide
> Read honest & candid reviews on hundreds of IT Products from real users.
> Discover which products truly live up to the hype. Start reading now.
> http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
> _______________________________________________
> ckrm-tech mailing list
> https://lists.sourceforge.net/lists/listinfo/ckrm-tech

--

----------------------------------------------------------------------
Chandra Seetharaman | Be careful what you choose....
- [email protected] | .......you may get it.
----------------------------------------------------------------------

2005-02-11 09:23:44

by Paul Jackson

[permalink] [raw]
Subject: Re: [ckrm-tech] Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement


[ For those who have already reached a conclusion on this
subject, there is little that is new below. It's just
cast in a different light, as an analysis of how well
the CKRM cpuset/memset task class that Chandra describes
meets the needs of cpusets. The conclusion is: not well.

A pickup truck and a motorcycle both have their uses.
It's just difficult to combine them in a useful fashion.

Feel free to skim or skip the rest of this message. -pj ]


Chandra writes:
> If I missed some feature of cpuset that shows a bigger problem, please
> let me know.

Perhaps it would be better if first you ask yourself what
features your cpuset/memset taskclasses provide beyond
what's available in the basic sched_setaffinity (for cpu)
and mbind/set_mempolicy (for memory) calls. Offhand, I don't
see any.

But, I will grant, with my apologies, that I wrote the above
more in irritation than in a sincere effort to explain.

So, let me come at this through another door.

Since it seems apparent by now that both numa placement and
workload management cause some form of mutually exclusive brain
damage to its practitioners, making it difficult for either to
understand the other, let me:
1) describe the important properties of cpusets,
2) examine how well your proposal provides such, and
3) examine its additional costs compared to cpusets.

1. The important properties of cpusets.
=======================================

Cpusets facilitate integrated processor and memory placement
of jobs on large systems, especially useful on numa systems,
where the co-ordinated placement of jobs on cpus and memory is
important, sometimes critical, to obtaining good performance.

It is becoming increasingly obvious, as Intel, IBM and AMD
push more and more cores into one package at one end, and as
NEC, IBM, Bull, SGI and others push more and more packages into
single image systems at the other end, that complex layered numa
topologies are here to stay, in increasing number and complexity.

Cpusets helps manage numa placement of jobs in a way that
numa folks seem to find makes sense. The names of key
interface elements, and the opening remarks in commentary and
documentation are specific and relevant to the needs of those
doing numa placement.

It does so with a minimal, low cost patch in the main kernel.
Running diffstat on the cpuset* patches in 2.6.11-rc1-mm2 shows
the following summary stats:

19 files changed, 2362 insertions(+), 253 deletions(-)

The runtime costs are nearly zero, consisting in the usual
case on any hot paths of a usage counter increment at fork, a
usage counter decrement at exit, a usually inconsequential
bitmask test in mm/page_alloc.c, and a generation number
check in the mm/mempolicy.c alloc_page_vma() wrapper to
__alloc_pages().

Cpusets handles any number of CPUs and Memory Nodes, with no
practical hard limit imposed by the API or data types.

Cpusets can be used in combination with a workload manager
such as CKRM. You can use cpusets to create "soft partitions"
that are subsets of the entire system, and then in each such
partition, you can run a separate instance of a workload manager
to obtain the desired resource sharing.

Cpusets may provide a practical API to support administrative
refinements of scheduler domains, along more optimal natural
job boundaries, instead of just along automatic, artificial
architecture boundaries. Matthew and Nick both seem to be
making mumblings in this direction, but the jury is still out.
Indeed, we're still investigating. I have not heard of anyone
proposing to integrate CKRM and sched domains in this manner,
nor do I expect to.

There is no reason to artificially limit the depth of the cpuset
hierarchy, which represents subsets of subsets of cpus and nodes.
The rules (invariants) of cpusets have been carefully chosen
so as to never require any global or wide ranging analysis of
the cpuset hierarchy in order to enforce. Each child must be
a subset of its parent, and exclusive cpusets cannot overlap
their siblings. That's about it. Both rules can be evaluated
locally, using just the nearest relatives of an affected cpuset.

An essential feature of the cpuset proposal is its file system
model of the 'nested subsets of cpus and nodes'. This provides
a name space, and permission model, that supports sensible
administration of numa friendly subsets of the compute resources
of large systems in complex administration environments.
A system can be dynamically 'partitioned' and 'sub-partitioned',
with sensible names and permissions for the partitions, while
maintaining the benefits of a single system image. This is
a classic use of a kernel, to manage a system wide resource
with a name space, structure rules, resource attributes, and
a permission/access model.

In sum, cpusets provides substantial benefit past the individual
sched_setaffinity/mbind/set_mempolicy calls for managing the
numa placement of jobs on large systems, at modest cost in
code size, runtime, maintenance and intellectual mastery.


2. How much of the above does your proposal provide?
====================================================

Not much. As best as I can tell, it provides an alternative
to the existing numa cpu and memory calls, at the cost of
considerable code, complexity and obtuseness above and beyond
cpusets. That additional complexity may well be necessary,
for the more difficult job it is trying to accomplish. But it
is not necessary for the simpler task of numa placement of jobs
on named, controlled, subsets of cpus and memory nodes.

Your proposal doesn't provide a distinguished "numa computation
unit" (cpu + memory), but rather tends to lose those two elements
in a longer list of task class elements.

I can't tell if it's just because you didn't take much time to
study cpusets, or if it's due to more essential limitations
of the CKRM implementation, but you got the subsetting and
exclusive rules wrong (or at least different).

The CKRM documentation and the names of key flags and such are
not intuitive to those doing numa work. If one comes at CKRM
from the perspective of someone trying to solve a numa placement
problem, the interfaces, documentation and naming really don't
make sense. Even if your architecture is more general and
powerful, I suspect your presentation is not widely accessible
outside those with a workload focus. Or perhaps I'm just more
dimwitted than most. It's difficult for me to know which.
But certainly both Matthew and I have struggled to make sense
of CKRM from a numa perspective.

You state you'd have a 128 CPU limitation. I don't know why
that would be, but it would be a critical imitation for SGI --
no small problem.

As explained below, with your proposal, one could not readily do
both workload management and numa placement at the same time,
because the task class hierarchy needed for the two is not
the same.

As noted above, while there seems to be a decent chance that
cpusets will provide some benefit to scheduler domains, allowing
the option of organizing sched domains along actual job usage
lines instead of artificial architecture lines, I have seen
no suggestion that CKRM task classes have that potential to
improve sched domains.

Elsewhere I recall you've had to impose fairly modest bounds
on the depth of your class hierarchy, because your resource
balancing rules are expensive to evaluate across deep, large
trees. The cpuset hierarchy has no such restraint.

Your task class hierarchy, if hijacked for numa placement,
might provide the kernel managed naming, structure and
access control of dynamic (soft) numa partitions that cpusets
does. I haven't looked closely at the permission model of
CKRM to see if it matches the needs of cpusets, so I can't
speak to that detail.

In sum, your cpuset/memset CKRM proposal provides few, if any,
of the additional benefits to numa placement work that cpusets
provides over the existing affinity and numa system calls.


3. What are the additional costs of your proposal over cpusets?
===============================================================

Your proposal, while it seems to offer little advantage for
numa placement to what we already have without cpusets, comes
at a substantial cost great than cpusets.

The CKRM patch is five times the size of the cpuset patch,
with diffstat on the ckrm-e17.2610.patch showing:

65 files changed, 13020 insertions(+), 19 deletions(-)

The CKRM runtime, from what I can tell on the lmbench slide
from OLS 2004, costs several percent of available cycles.

You propose to include the cpu/mem placement hierarchy in the
task class hierarchy. This presents difficulties. Essentially,
they are not the same hierarchies. A jobs placement is
independent of its priority. Both high and low priority jobs
may well require proper numa placement, and both high and low
priority tasks may well run within the same cpuset.

So if your task class hierarchy is hijacked for numa placement,
it will not serve you well for workload management. On a system
that required numa placement using something like cpusets, the
fives times larger size of the kernel patch required for CKRM
would be entirely unjustified, as CKRM would only be usable
for its cpuset-like capabilities.

Much of what you have now in CKRM would be useless for cpuset
work. As you observed in your proposal, you would need new
cpuset related rules for the subset and exclusive properties.

The cpuset scheduler hook is none - it only needs the
existing cpus_allowed check that Ingo already added, years ago.
You propose having the scheduler check the appropriate cpu mask
in the task class, which would definitely increase the cache
footprint size of the scheduler.

The papers for CKRM speak of providing policy driven
classification and differentiated service. The focus is on
managing resource sharing, to allow different classes of tasks
to get controlled allocations of proportions of shared resources.

Cpusets is not about sharing proportions of a common resource,
but rather about dedicating entire resources. Granted,
mathematically, there might be a mapping between these two.
But is it certainly an impediment to those having to understand
something, if it is implemented by abusing something quite
larger and quite foreign in intention.

This flows through to the names of the specific files in the
directory representing a cpuset or class. The names for CKRM
class directories are necessarily rather generic and abstract,
whereas those for cpusets directly represent the particular
need of placing tasks on cpus and memory nodes. For someone
doing numa placement, the latter are much easier to understand.

And as noted above, since you can't do both at the same time
(both use the CKRM infrastructure for its traditional workload
management and use it for numa placement) it's not like the
administrator of such a system gains any from the more abstract
names, if they are just using it for cpusets (numa placement).

There is no synergy in the kernel hooks required in the scheduler
and memory allocator. The hooks required by cpusets check
bitmasks in order to allow or prohibit scheduling a task on
a CPU, or allocating a page from a particular node to a task.
These are quite distinct from the hooks required by CKRM when
used as a fair share scheduler and workload manager, which
requires adding delays to tasks in order to obtain the desired
proportion of resource usage between classes. Similarly, the
CKRM memory allocator hooks manage the number of pages in use
by each task class and/or the rate of page faults, while the
cpuset memory allocator hooks manage which memory nodes are
available to satisfy an allocation request.

The share usage hooks that monitor each resource, and its usage
by each class, are useless for cpusets, which has no dependency
on resource usage. In cpusets, a task can use as much of its
allowed CPUs and Memory Nodes, without throttling. There is
no feedback loop based on rates of resource usage per class.

Most of the hooks required by the CKRM classification engine to
check for possible changes in a tasks class, such as in fork,
exec, setuid, listen, and other points where a kernel object
might change are not needed for cpusets. The cpuset patch only
requires such state change hooks in fork, exit and allocation,
and only requires to increment or decrement a usage count in
the fork and exit, and check a generation number in allocation.

Cpusets has no use for a kernel classification engine. Outside
of the trivial, automatic propagation of cpusets in fork and
exit, the only changes in cpusets are mandated from user space.

Nor do cpusets have any need for the kernel to support externally
defined policy rules. Cpusets has no use for the classification
engines callback mechanism. In cpusets, no events that might
affect state, such as fork, exit, reclassifications, changes in
uid, or resource rate usage samples, need to be reported to any
state agent, and there is no state agent, nor any communication
channel thereto.

Cpusets has no use for a facility that lets server tasks tell
some external classifier what phase they are operating in.
Cpusets has no need for some workload manager to be sampling
resource consumption and task state to determine resource
consumption. Cpusets has no need to track, in user space or
kernel, the state of tasks after they exit. Cpusets has no use
for delays nor for tracking them in the task struct.

Cpusets has no need for the hooks at the entry to, and exit from,
memory allocation routines to distinguish delays due to memory
allocation from those due to application i/o. Cpusets has no
need for sampling task state at fixed intervals, and our big
iron scientific customers would without a doubt not tolerate a
scan of the entire set of tasks every second for such resource
and task state data collection. Such a scan does _not_ scale
well on big honkin numa boxes. Whereas CKRM requires something
like relayfs to pass back to user space the constant stream of
such data, cpusets has no such needs and no such data.

Certainly, none of the network hooks that CKRM requires to
provide differentiated service across priority classes would be
of any use in a system (ab)using CKRM to provide cpuset style
numa placement.

It is true that both cpusets and CKRM make good use of the Linux
kernel's virtual file system (vfs). Cpusets uses vfs to model
the hierarchy of 'soft partitions' in the system. CKRM uses vfs
to model a resource priority hierarchy, essentially replacing a
single 'task priority' with hierarchical resource allocations,
managing what proportion, out of what is available, of fungible
resources such as ticks, cycles, bytes or data transfers a
given class of tasks is allowed to use in the aggregate.

Just because two facilities use vfs is certainly not sufficient
basis for deciding that they should be combined into one
facility.

The shares and stats control files in each task_class
directory are not needed by cpusets, but new control files,
for cpus_allowed and mems_allowed are needed. That, or the
existing names have to be overloaded, at the cost of obfuscating
the interface.

The kernel hooks for cpusets are fewer, simpler and more specific
than those for CKRM. Our high performance customers would want
the cpuset hooks compiled in, not the more generic ones for
CKRM (which they could not easily use for any other workload
management purpose anyway, if the task class hierarchy were
hijacked for the needs of cpusets, as noted above).

The development costs of cpusets so far, which are perhaps the
best predictor we have of future costs, have been substantially
lower than they have been for CKRM.

In sum, your proposal costs alot more than cpusets, by a variety
of metrics.

=================================================

In summary, I find that your cpuset/memset CKRM proposal provides
little or no benefit past the simpler cpu and memory placement
calls already available, while costing substantially more in
a variety of ways than my cpuset proposal, when evaluated for
its usefulness for numa placement.

(Of course, if evaluated for suitability for workload management,
the table is turned, and your CKRM patch provides essential
capability that my cpuset patch could never dream of doing.)

Moreover, the additional workload management benefits that your
CKRM facility provides, and that some of my customers might
want to use in combination with numa placement, would probably
become unavailable to them if we integrated cpusets and CKRM,
because cpusets would have to hijack the task class hierarchy
for its own nefarious purposes.

Such an attempt to integrate cpusets and CKRM would be a major
setback for cpusets, substantially increasing its costs and
reducing is value, probably well past the point of it even being
worth persuing further, in the mainstream kernel. Adding all
that foreign logic of cpusets to the CKRM patch probably
wouldn't help CKRM much either. The CKRM patch is already one
that requires a bright mind and some careful thought to master.
Adding cpuset numa placement logic, which is typically different
in detail, would add a complexity burden to the CKRM code that
would serve no one well.


> Note that I am not pitching for a marriage

We agree.

I just took more words to say it ').



--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401

2005-02-11 17:14:03

by Jesse Barnes

[permalink] [raw]
Subject: Re: [ckrm-tech] Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

On Thursday, February 10, 2005 6:46 pm, Chandra Seetharaman wrote:
> On Wed, Feb 09, 2005 at 09:59:28AM -0800, Chandra Seetharaman wrote:
> > On Tue, Feb 08, 2005 at 12:42:34PM -0800, Paul Jackson wrote:
>
> --stuff deleted---
>
> > memset_controller would be similar to this, before pitching it I will
> > talk with Matt about why he thought that there is a problem.
>
> Talked to Matt Dobson and explained him the CKRM architecture and how
> cpuset/memset can be implemented as a ckrm controller. He is now convinced
> that there is no problem in making memset also a ckrm controller.
>
> As explained in the earlier mail, memset also can be implemented in the
> same way as cpuset.

Arg! Look, cpusets is *done* (i.e. it works well) and relatively simple and
easy to use. It's also been in -mm for quite some time. It also solves the
problem of being able to deal with large jobs on large systems rather
elegantly. Why oppose its inclusion upstream?

CKRM seems nice, but why is it not in -mm? I've heard it talked about a lot,
but it usually comes up as a response to some other, simpler project, in the
vein of "ckrm can do this, so your project is not needed" and needless to say
that's a bit frustrating. I'm not saying that ckrm isn't useful--indeed it
seems like an idea with a lot of utility (I liked Rik's ideas for using it to
manage desktop boxes and multiuser systems as a sort of per-process rlimits
on steroids), but using it for system partitioning or systemwide accounting
seems a bit foolish to me...

Jesse

2005-02-11 18:46:01

by Chandra Seetharaman

[permalink] [raw]
Subject: Re: [ckrm-tech] Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

On Fri, Feb 11, 2005 at 08:54:52AM -0800, Jesse Barnes wrote:
> On Thursday, February 10, 2005 6:46 pm, Chandra Seetharaman wrote:
> > On Wed, Feb 09, 2005 at 09:59:28AM -0800, Chandra Seetharaman wrote:
> > > On Tue, Feb 08, 2005 at 12:42:34PM -0800, Paul Jackson wrote:
> >
> > --stuff deleted---
> >
> > > memset_controller would be similar to this, before pitching it I will
> > > talk with Matt about why he thought that there is a problem.
> >
> > Talked to Matt Dobson and explained him the CKRM architecture and how
> > cpuset/memset can be implemented as a ckrm controller. He is now convinced
> > that there is no problem in making memset also a ckrm controller.
> >
> > As explained in the earlier mail, memset also can be implemented in the
> > same way as cpuset.
>
> Arg! Look, cpusets is *done* (i.e. it works well) and relatively simple and
> easy to use. It's also been in -mm for quite some time. It also solves the
> problem of being able to deal with large jobs on large systems rather
> elegantly. Why oppose its inclusion upstream?

Jesse,

Do note that I did not oppose the cpuset inclusion(by saying that, "I am not
pitching for a marriage"), and here are the reasons:

1.Eventhough cpuset can be implemented under ckrm, currently the cpu controller
and mem controller(in ckrm) cannot handle the isolating part of the cpuset stuff
cleanly and provide the resource management capabilities ckrm is supposed to
provide. For that reason, one cannot expect both the cpuset and ckrm functionality
in a same kernel.
2.I doubt that users that need cpuset will need the resource management capabilities
ckrm provides.

My email was intented mainly to erase the notion that ckrm cannot handle cpuset.
Also, I wanted to understand if there is any real issues and that is why I talked
with Matt about why he thought ckrm cannot accomodate memset before sending the
second piece of mail.

>
> CKRM seems nice, but why is it not in -mm? I've heard it talked about a lot,
> but it usually comes up as a response to some other, simpler project, in the

We did post to lkml a while back and got comments on it. We are working on it and
will post the fixed code again in few weeks with couple of controllers.

> vein of "ckrm can do this, so your project is not needed" and needless to say
> that's a bit frustrating. I'm not saying that ckrm isn't useful--indeed it
> seems like an idea with a lot of utility (I liked Rik's ideas for using it to
> manage desktop boxes and multiuser systems as a sort of per-process rlimits
> on steroids), but using it for system partitioning or systemwide accounting
> seems a bit foolish to me...
>
> Jesse

--

----------------------------------------------------------------------
Chandra Seetharaman | Be careful what you choose....
- [email protected] | .......you may get it.
----------------------------------------------------------------------

2005-02-11 18:51:26

by Jesse Barnes

[permalink] [raw]
Subject: Re: [ckrm-tech] Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

On Friday, February 11, 2005 10:42 am, Chandra Seetharaman wrote:
> My email was intented mainly to erase the notion that ckrm cannot handle
> cpuset. Also, I wanted to understand if there is any real issues and that
> is why I talked with Matt about why he thought ckrm cannot accomodate
> memset before sending the second piece of mail.

Great! So cpusets is good to go for the mainline then (i.e. no major
objections to the interface). Note that implementation details that don't
affect the interface are another subject entirely, e.g. the sched domains
approach for scheduling as opposed to cpus_allowed.

> > CKRM seems nice, but why is it not in -mm? I've heard it talked about a
> > lot, but it usually comes up as a response to some other, simpler
> > project, in the
>
> We did post to lkml a while back and got comments on it. We are working on
> it and will post the fixed code again in few weeks with couple of
> controllers.

Excellent, I hope that it comes together into a form suitable for the
mainline, I think there are some really nice aspects to it.

Thanks,
Jesse

2005-02-12 01:41:28

by Chandra Seetharaman

[permalink] [raw]
Subject: Re: [ckrm-tech] Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

On Fri, Feb 11, 2005 at 01:21:12AM -0800, Paul Jackson wrote:
> [ For those who have already reached a conclusion on this
> subject, there is little that is new below. It's just
> cast in a different light, as an analysis of how well
> the CKRM cpuset/memset task class that Chandra describes
> meets the needs of cpusets. The conclusion is: not well.
>
> A pickup truck and a motorcycle both have their uses.
> It's just difficult to combine them in a useful fashion.
>
> Feel free to skim or skip the rest of this message. -pj ]
>
[ As replied in a earlier mail I am not advocating for cpuset to be
a ckrm controller. In this mail I am just providing clarifications
for some of Paul's comments. -chandra ]

>
> Chandra writes:
> > If I missed some feature of cpuset that shows a bigger problem, please
> > let me know.
>
> Perhaps it would be better if first you ask yourself what
> features your cpuset/memset taskclasses provide beyond

First off, I wasn't pitching for 'our' cpuset/memset taskclass. I was
suggesting that 'your' cpuset can be a ckrm controller.


> what's available in the basic sched_setaffinity (for cpu)
> and mbind/set_mempolicy (for memory) calls. Offhand, I don't
> see any.

and it don't have to be same as what the above functions provide. cpuset
can function exactly the same way under ckrm as it does otherwise.

>
> But, I will grant, with my apologies, that I wrote the above
> more in irritation than in a sincere effort to explain.
>
> So, let me come at this through another door.
>
> Since it seems apparent by now that both numa placement and
> workload management cause some form of mutually exclusive brain
> damage to its practitioners, making it difficult for either to
> understand the other, let me:
> 1) describe the important properties of cpusets,
> 2) examine how well your proposal provides such, and
> 3) examine its additional costs compared to cpusets.
>
> 1. The important properties of cpusets.
> =======================================
>
> Cpusets facilitate integrated processor and memory placement
> of jobs on large systems, especially useful on numa systems,
> where the co-ordinated placement of jobs on cpus and memory is
> important, sometimes critical, to obtaining good performance.
>
> It is becoming increasingly obvious, as Intel, IBM and AMD
> push more and more cores into one package at one end, and as
> NEC, IBM, Bull, SGI and others push more and more packages into
> single image systems at the other end, that complex layered numa
> topologies are here to stay, in increasing number and complexity.
>
> Cpusets helps manage numa placement of jobs in a way that
> numa folks seem to find makes sense. The names of key
> interface elements, and the opening remarks in commentary and
> documentation are specific and relevant to the needs of those
> doing numa placement.
>
> It does so with a minimal, low cost patch in the main kernel.
> Running diffstat on the cpuset* patches in 2.6.11-rc1-mm2 shows
> the following summary stats:
>
> 19 files changed, 2362 insertions(+), 253 deletions(-)
>
> The runtime costs are nearly zero, consisting in the usual
> case on any hot paths of a usage counter increment at fork, a
> usage counter decrement at exit, a usually inconsequential
> bitmask test in mm/page_alloc.c, and a generation number
> check in the mm/mempolicy.c alloc_page_vma() wrapper to
> __alloc_pages().
>
> Cpusets handles any number of CPUs and Memory Nodes, with no
> practical hard limit imposed by the API or data types.
>
> Cpusets can be used in combination with a workload manager
> such as CKRM. You can use cpusets to create "soft partitions"
> that are subsets of the entire system, and then in each such
> partition, you can run a separate instance of a workload manager
> to obtain the desired resource sharing.

CKRM's controllers currently may not play well with cpusets.
>
> Cpusets may provide a practical API to support administrative
> refinements of scheduler domains, along more optimal natural
> job boundaries, instead of just along automatic, artificial
> architecture boundaries. Matthew and Nick both seem to be
> making mumblings in this direction, but the jury is still out.
> Indeed, we're still investigating. I have not heard of anyone
> proposing to integrate CKRM and sched domains in this manner,
> nor do I expect to.

I haven't looked at sched_domains closely. May be I should and see how we
can form a synergy.

>
> There is no reason to artificially limit the depth of the cpuset
> hierarchy, which represents subsets of subsets of cpus and nodes.
> The rules (invariants) of cpusets have been carefully chosen
> so as to never require any global or wide ranging analysis of
> the cpuset hierarchy in order to enforce. Each child must be
> a subset of its parent, and exclusive cpusets cannot overlap
> their siblings. That's about it. Both rules can be evaluated
> locally, using just the nearest relatives of an affected cpuset.
>
> An essential feature of the cpuset proposal is its file system
> model of the 'nested subsets of cpus and nodes'. This provides
> a name space, and permission model, that supports sensible
> administration of numa friendly subsets of the compute resources
> of large systems in complex administration environments.
> A system can be dynamically 'partitioned' and 'sub-partitioned',
> with sensible names and permissions for the partitions, while
> maintaining the benefits of a single system image. This is
> a classic use of a kernel, to manage a system wide resource
> with a name space, structure rules, resource attributes, and
> a permission/access model.
>
> In sum, cpusets provides substantial benefit past the individual
> sched_setaffinity/mbind/set_mempolicy calls for managing the
> numa placement of jobs on large systems, at modest cost in
> code size, runtime, maintenance and intellectual mastery.
>
>
> 2. How much of the above does your proposal provide?
> ====================================================
>
> Not much. As best as I can tell, it provides an alternative
> to the existing numa cpu and memory calls, at the cost of
> considerable code, complexity and obtuseness above and beyond
> cpusets. That additional complexity may well be necessary,
> for the more difficult job it is trying to accomplish. But it
> is not necessary for the simpler task of numa placement of jobs
> on named, controlled, subsets of cpus and memory nodes.

I was answering a different question: whether ckrm can accomodate
cpuset or not ? ( i 'll talk about the complexity part later).

>
> Your proposal doesn't provide a distinguished "numa computation
> unit" (cpu + memory), but rather tends to lose those two elements
> in a longer list of task class elements.

It doesn't readily provide it, but the architecture can provide it.

>
> I can't tell if it's just because you didn't take much time to
> study cpusets, or if it's due to more essential limitations
> of the CKRM implementation, but you got the subsetting and
> exclusive rules wrong (or at least different).

My understanding was that, if a class/cpuset has an exclusive flag
set, then those cpus can be used only by this cpuset and its parent,
and no other cpusets in the system.

I did get one thing wrong, I did not realize that you do not allow
setting the exclusive flag in a cpuset if any of its siblings has
any of this cpuset's cpus. (May be i still didn't get it right)....

But, that doesn't change what I wrote in my earlier mail,
because, all these details are controller specific and i do not see
any limitation from ckrm's point of view in this context.

>
> The CKRM documentation and the names of key flags and such are
> not intuitive to those doing numa work. If one comes at CKRM
> from the perspective of someone trying to solve a numa placement
> problem, the interfaces, documentation and naming really don't
> make sense. Even if your architecture is more general and
> powerful, I suspect your presentation is not widely accessible
> outside those with a workload focus. Or perhaps I'm just more
> dimwitted than most. It's difficult for me to know which.
> But certainly both Matthew and I have struggled to make sense
> of CKRM from a numa perspective.

I agree. The filenames are not intuitive for cpuset purposes.

>
> You state you'd have a 128 CPU limitation. I don't know why
> that would be, but it would be a critical imitation for SGI --
> no small problem.

I understand it is critical for SGI. I said it is a small problem
because it can be worked out easily.

>
> As explained below, with your proposal, one could not readily do
> both workload management and numa placement at the same time,
> because the task class hierarchy needed for the two is not
> the same.
>
> As noted above, while there seems to be a decent chance that
> cpusets will provide some benefit to scheduler domains, allowing
> the option of organizing sched domains along actual job usage
> lines instead of artificial architecture lines, I have seen
> no suggestion that CKRM task classes have that potential to
> improve sched domains.
>
> Elsewhere I recall you've had to impose fairly modest bounds
> on the depth of your class hierarchy, because your resource
> balancing rules are expensive to evaluate across deep, large
> trees. The cpuset hierarchy has no such restraint.

We put the limitation in the architecture because of controllers.
We can open it up to allow deeper hierarchy and let the controllers
decide how deep they can support.

>
> Your task class hierarchy, if hijacked for numa placement,

I wasn't suggestint the cpuset controller to hijack ckrm's task
hierarchy, I was suggesting to play within.

Controllers don't hijack hierarchy. Hierarchy is only for classes,
controllers have control over only their portion of a class.

> might provide the kernel managed naming, structure and
> access control of dynamic (soft) numa partitions that cpusets
> does. I haven't looked closely at the permission model of
> CKRM to see if it matches the needs of cpusets, so I can't
> speak to that detail.

Are you talking about allowing users to manage their own class/cpusets ?
If so, we do have them.

>
> In sum, your cpuset/memset CKRM proposal provides few, if any,
> of the additional benefits to numa placement work that cpusets
> provides over the existing affinity and numa system calls.
>
>
> 3. What are the additional costs of your proposal over cpusets?
> ===============================================================
>
> Your proposal, while it seems to offer little advantage for
> numa placement to what we already have without cpusets, comes
> at a substantial cost great than cpusets.
>
> The CKRM patch is five times the size of the cpuset patch,
> with diffstat on the ckrm-e17.2610.patch showing:
>
> 65 files changed, 13020 insertions(+), 19 deletions(-)

ckrm-e17 has the whole stack(core, rcfs, taskclass, socketclass, delay
accounting, rbce, crbce, numtasks controller and listenaq controller).

But, for your purposes or our discussions one would need only 3 modules of
the above (core, rcfs and taskclass). I just compared it with the broken
up patches we posted on lkml recently. The whole stack has 12227 insertions
of which only 4554 insertions correspond to the 3 modules listed.

>
> The CKRM runtime, from what I can tell on the lmbench slide
> from OLS 2004, costs several percent of available cycles.

The graph you see in the presentation is with the CPU controller. Not
for the core ckrm. We don't have to include CPU controller to get cpuset
working as a controller.

>
> You propose to include the cpu/mem placement hierarchy in the
> task class hierarchy. This presents difficulties. Essentially,
> they are not the same hierarchies. A jobs placement is
> independent of its priority. Both high and low priority jobs
> may well require proper numa placement, and both high and low
> priority tasks may well run within the same cpuset.
>
> So if your task class hierarchy is hijacked for numa placement,
> it will not serve you well for workload management. On a system
> that required numa placement using something like cpusets, the
> fives times larger size of the kernel patch required for CKRM

As explained above, it is not 5 timer larger.

> would be entirely unjustified, as CKRM would only be usable
> for its cpuset-like capabilities.
>
> Much of what you have now in CKRM would be useless for cpuset
> work. As you observed in your proposal, you would need new
> cpuset related rules for the subset and exclusive properties.

ckrm doesn't need new rules, the subset and exclusive property handling
will be the functionality of the cpuset controller.

>
> The cpuset scheduler hook is none - it only needs the
> existing cpus_allowed check that Ingo already added, years ago.
> You propose having the scheduler check the appropriate cpu mask
> in the task class, which would definitely increase the cache
> footprint size of the scheduler.

agree, one more level of indirection(instead of task->cpuset->cpus_allowed
it will be task->taskclass->res[CPUSET]->cpus_allowed).

>
> The papers for CKRM speak of providing policy driven
> classification and differentiated service. The focus is on
> managing resource sharing, to allow different classes of tasks
> to get controlled allocations of proportions of shared resources.
>
> Cpusets is not about sharing proportions of a common resource,
> but rather about dedicating entire resources. Granted,
> mathematically, there might be a mapping between these two.
> But is it certainly an impediment to those having to understand
> something, if it is implemented by abusing something quite
> larger and quite foreign in intention.
>
> This flows through to the names of the specific files in the
> directory representing a cpuset or class. The names for CKRM
> class directories are necessarily rather generic and abstract,
> whereas those for cpusets directly represent the particular
> need of placing tasks on cpus and memory nodes. For someone
> doing numa placement, the latter are much easier to understand.
>
> And as noted above, since you can't do both at the same time
> (both use the CKRM infrastructure for its traditional workload
> management and use it for numa placement) it's not like the
> administrator of such a system gains any from the more abstract
> names, if they are just using it for cpusets (numa placement).
>
> There is no synergy in the kernel hooks required in the scheduler
> and memory allocator. The hooks required by cpusets check
> bitmasks in order to allow or prohibit scheduling a task on
> a CPU, or allocating a page from a particular node to a task.
> These are quite distinct from the hooks required by CKRM when
> used as a fair share scheduler and workload manager, which
> requires adding delays to tasks in order to obtain the desired
> proportion of resource usage between classes. Similarly, the
> CKRM memory allocator hooks manage the number of pages in use
> by each task class and/or the rate of page faults, while the
> cpuset memory allocator hooks manage which memory nodes are
> available to satisfy an allocation request.

I think this is where we go tangential. When you say CKRM you refer
the whole stack.

When we say CKRM, we mean only the framework(core, rcfs and taskclass or
socketclass). It is the frame work that enables the user to define classes
and classify tasks or sockets.

All the other modules are optional and exchangable.

CKRM has different configurable modules that has their defined purposes.
One doesn't have to include a module if they don't need it.

>
> The share usage hooks that monitor each resource, and its usage
> by each class, are useless for cpusets, which has no dependency
> on resource usage. In cpusets, a task can use as much of its
> allowed CPUs and Memory Nodes, without throttling. There is
> no feedback loop based on rates of resource usage per class.
>
> Most of the hooks required by the CKRM classification engine to
> check for possible changes in a tasks class, such as in fork,
> exec, setuid, listen, and other points where a kernel object
> might change are not needed for cpusets. The cpuset patch only
> requires such state change hooks in fork, exit and allocation,
> and only requires to increment or decrement a usage count in
> the fork and exit, and check a generation number in allocation.
>
> Cpusets has no use for a kernel classification engine. Outside
> of the trivial, automatic propagation of cpusets in fork and
> exit, the only changes in cpusets are mandated from user space.
>
> Nor do cpusets have any need for the kernel to support externally
> defined policy rules. Cpusets has no use for the classification
> engines callback mechanism. In cpusets, no events that might
> affect state, such as fork, exit, reclassifications, changes in
> uid, or resource rate usage samples, need to be reported to any
> state agent, and there is no state agent, nor any communication
> channel thereto.
>
> Cpusets has no use for a facility that lets server tasks tell
> some external classifier what phase they are operating in.
> Cpusets has no need for some workload manager to be sampling
> resource consumption and task state to determine resource
> consumption. Cpusets has no need to track, in user space or
> kernel, the state of tasks after they exit. Cpusets has no use
> for delays nor for tracking them in the task struct.
>
> Cpusets has no need for the hooks at the entry to, and exit from,
> memory allocation routines to distinguish delays due to memory
> allocation from those due to application i/o. Cpusets has no
> need for sampling task state at fixed intervals, and our big
> iron scientific customers would without a doubt not tolerate a
> scan of the entire set of tasks every second for such resource
> and task state data collection. Such a scan does _not_ scale
> well on big honkin numa boxes. Whereas CKRM requires something
> like relayfs to pass back to user space the constant stream of
> such data, cpusets has no such needs and no such data.
>
> Certainly, none of the network hooks that CKRM requires to
> provide differentiated service across priority classes would be
> of any use in a system (ab)using CKRM to provide cpuset style
> numa placement.

With the explanations above, I think you would now agree that all
the above comments are invalidated. Basically you don't have to
bring them in if you don't need them.

>
> It is true that both cpusets and CKRM make good use of the Linux
> kernel's virtual file system (vfs). Cpusets uses vfs to model
> the hierarchy of 'soft partitions' in the system. CKRM uses vfs
> to model a resource priority hierarchy, essentially replacing a
> single 'task priority' with hierarchical resource allocations,
> managing what proportion, out of what is available, of fungible
> resources such as ticks, cycles, bytes or data transfers a
> given class of tasks is allowed to use in the aggregate.
>
> Just because two facilities use vfs is certainly not sufficient
> basis for deciding that they should be combined into one
> facility.
>
> The shares and stats control files in each task_class
> directory are not needed by cpusets, but new control files,
> for cpus_allowed and mems_allowed are needed. That, or the
> existing names have to be overloaded, at the cost of obfuscating
> the interface.

shares file can accomodate these. But, for bigger configuration we
have to use some file based interface.

>
> The kernel hooks for cpusets are fewer, simpler and more specific
> than those for CKRM. Our high performance customers would want
> the cpuset hooks compiled in, not the more generic ones for
> CKRM (which they could not easily use for any other workload
> management purpose anyway, if the task class hierarchy were
> hijacked for the needs of cpusets, as noted above).
>
> The development costs of cpusets so far, which are perhaps the
> best predictor we have of future costs, have been substantially
> lower than they have been for CKRM.

I think you have to compare the developmental cost of a resource
controller providing cpusetfunctionality and not ckrm itself.
>
> In sum, your proposal costs alot more than cpusets, by a variety
> of metrics.
>
> =================================================
>
> In summary, I find that your cpuset/memset CKRM proposal provides
> little or no benefit past the simpler cpu and memory placement
> calls already available, while costing substantially more in
> a variety of ways than my cpuset proposal, when evaluated for
> its usefulness for numa placement.
>
> (Of course, if evaluated for suitability for workload management,
> the table is turned, and your CKRM patch provides essential
> capability that my cpuset patch could never dream of doing.)
>
> Moreover, the additional workload management benefits that your
> CKRM facility provides, and that some of my customers might
> want to use in combination with numa placement, would probably
> become unavailable to them if we integrated cpusets and CKRM,
> because cpusets would have to hijack the task class hierarchy
> for its own nefarious purposes.
>
> Such an attempt to integrate cpusets and CKRM would be a major
> setback for cpusets, substantially increasing its costs and
> reducing is value, probably well past the point of it even being
> worth persuing further, in the mainstream kernel. Adding all
> that foreign logic of cpusets to the CKRM patch probably
> wouldn't help CKRM much either. The CKRM patch is already one
> that requires a bright mind and some careful thought to master.

If one reads the design and then looks at the broken down patches,
it may not be hard.

> Adding cpuset numa placement logic, which is typically different
> in detail, would add a complexity burden to the CKRM code that
> would serve no one well.
>
>
> > Note that I am not pitching for a marriage
>
> We agree.
>
> I just took more words to say it ').

The reasons we quote also are very different though. I meant that it
won't be a happy, productive marriage.

But, I infer that you are suggesting their species itself are
different, which I do not agree.

chandra
PS to everyone else: Wow, you have lot of patience :)
>
>
>
> --
> I won't rest till it's the best ...
> Programmer, Linux Scalability
> Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401
>

2005-02-12 06:16:54

by Paul Jackson

[permalink] [raw]
Subject: Re: [ckrm-tech] Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

I agree with 97% of what you write, Chandra.


> one more level of indirection(instead of task->cpuset->cpus_allowed
> it will be task->taskclass->res[CPUSET]->cpus_allowed).

No -- two more levels of indirection (task->cpus_allowed becomes
task->taskclass->res[CPUSET]->cpus_allowed).


> But, for your purposes or our discussions one would need only 3 modules
> of the above (core, rcfs and taskclass).

Ok. That was not obvious to me until now. If there is a section in
your documentation that explains this, and addresses the needs and
motivations of someone trying to reuse portions of CKRM in such a
manner, I missed it. Whatever ...

In any case, on the issue that matters to me right now, we agree:

> It won't be a happy, productive marriage.

Good. Thanks. Good luck to you.

> PS to everyone else: Wow, you have lot of patience :)

For sure.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401