This is essentially the same as the patch set that I posted last week,
with the following fixes/changes:
- CONFIG_CONTAINERS is no longer a user-selectable option - subsystems
such as cpusets that require it should select it in Kconfig.
- Each container subsystem type now has a name, and a <name>_enabled
file in the top container directory. This file contains 0 or 1 to
indicate whether the container subsystem is enabled, and can only be
modified when there are no subcontainers; disabled container subsystems
don't get new instances created when a subcontainer is created; the
subsystem-specific state is simply inherited from the parent
container.
- include a config option to default to enabled, for backwards
compatibility
- Documentation tweaks
- builds properly without CONFIG_CONTAINER_CPUACCT configured on
- should build properly with newer gccs. (I've not actually had a
chance to try building it with anything newer than gcc 3.2.2, but I've
fixed all the potential warnings/errors that PaulJ pointed out when
compiling with some unspecified newer gcc).
I've also looked at converting ResGroups to be a client of the
container system. This isn't yet complete; my thoughts so far include:
- each resource controller can be implemented as an independent
container subsystem; rather than a single "shares" and "stats" file
in each directory there will be e.g. "numtasks_shares",
"cpurc_shares", etc
- the ResGroups core will basically end up as a library that provides
the common parsing/displaying for the shares and stats file for each
controller, and the logic for propagating resources up and down the
parent/child tree.
- for some of the resource controllers we will probably require a few
extra callbacks from the container system, e.g. at fork/exit time.
I might make these a config option that the controller must "select"
in Kconfig, to avoid extra locking/overhead for subsystems such as
cpusets that don't require such callbacks.
-------------------------------------
There have recently been various proposals floating around for
resource management/accounting subsystems in the kernel, including
Res Groups, User BeanCounters and others. These all need the basic
abstraction of being able to group together multiple processes in an
aggregate, in order to track/limit the resources permitted to those
processes, and all implement this grouping in different ways.
Already existing in the kernel is the cpuset subsystem; this has a
process grouping mechanism that is mature, tested, and well documented
(particularly with regards to synchronization rules).
This patchset extracts the process grouping code from cpusets into a
generic container system, and makes the cpusets code a client of
the container system.
It also provides a very simple additional container subsystem to do
per-container CPU usage accounting; this is primarily to demonstrate
use of the container subsystem API, but is useful in its own right.
The change is implemented in four stages:
1) extract the process grouping code from cpusets into a standalone system
2) remove the process grouping code from cpusets and hook into the
container system
3) convert the container system to present a generic API, and make
cpusets a client of that API
4) add a simple CPU accounting container subsystem as an example
The intention is that the various resource management efforts can also
become container clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test out e.g. the ResGroups CPU controller in
conjunction with the UBC memory controller
- the additional kernel footprint of any of the competing resource
management systems is substantially reduced, since it doesn't need
to provide process grouping/containment, hence improving their
chances of getting into the kernel
Possible TODOs include:
- define a convention for populating the per-container directories so
that different subsystems don't clash with one another
- provide higher-level primitives (e.g. an easy interface to seq_file)
for files registered by subsystems.
- support subsystem deregistering
Signed-off-by: Paul Menage <[email protected]>
---
Hi Paul,
Thanks for doing the exercise of removing the container part of cpuset
to provide some process aggregation.
With this model, I think I agree with you that RG can be split into
individual controllers (need to look at it closely).
I have few questions/concerns w.r.t this implementation:
- Since we are re-implementing anyways, why not use configfs instead of
having our own filesystem ?
- I am little nervous about notify_on_release, as RG would want
classes/RGs to be available even when there are no tasks or sub-
classes. (Documentation says that the user level program can rmdir
the container, which would be a problem). Can the user level program
be _not_ called when there are other subsystems registered ? Also,
shouldn't it be cpuset specific, instead of global ?
- Export of the locks: These locks protect container data structures.
But, most of the usages in cpuset.c are to protect the cpuset data
structure itself. Shouldn't the cpuset subsystem have its own locks ?
IMO, these locks should be used by subsystem only when they want data
integrity in the container data structure itself (like walking thru
the sibling list).
- Tight coupling of subsystems: I like your idea (you mentioned in a
reply to the previous thread) of having an array of containers in task
structure than the current implementation.
regards,
chandra
On Mon, 2006-10-02 at 02:53 -0700, Paul Menage wrote:
> This is essentially the same as the patch set that I posted last week,
> with the following fixes/changes:
>
> - CONFIG_CONTAINERS is no longer a user-selectable option - subsystems
> such as cpusets that require it should select it in Kconfig.
>
> - Each container subsystem type now has a name, and a <name>_enabled
> file in the top container directory. This file contains 0 or 1 to
> indicate whether the container subsystem is enabled, and can only be
> modified when there are no subcontainers; disabled container subsystems
> don't get new instances created when a subcontainer is created; the
> subsystem-specific state is simply inherited from the parent
> container.
>
> - include a config option to default to enabled, for backwards
> compatibility
>
> - Documentation tweaks
>
> - builds properly without CONFIG_CONTAINER_CPUACCT configured on
>
> - should build properly with newer gccs. (I've not actually had a
> chance to try building it with anything newer than gcc 3.2.2, but I've
> fixed all the potential warnings/errors that PaulJ pointed out when
> compiling with some unspecified newer gcc).
>
> I've also looked at converting ResGroups to be a client of the
> container system. This isn't yet complete; my thoughts so far include:
>
> - each resource controller can be implemented as an independent
> container subsystem; rather than a single "shares" and "stats" file
> in each directory there will be e.g. "numtasks_shares",
> "cpurc_shares", etc
>
> - the ResGroups core will basically end up as a library that provides
> the common parsing/displaying for the shares and stats file for each
> controller, and the logic for propagating resources up and down the
> parent/child tree.
>
> - for some of the resource controllers we will probably require a few
> extra callbacks from the container system, e.g. at fork/exit time.
> I might make these a config option that the controller must "select"
> in Kconfig, to avoid extra locking/overhead for subsystems such as
> cpusets that don't require such callbacks.
>
> -------------------------------------
>
> There have recently been various proposals floating around for
> resource management/accounting subsystems in the kernel, including
> Res Groups, User BeanCounters and others. These all need the basic
> abstraction of being able to group together multiple processes in an
> aggregate, in order to track/limit the resources permitted to those
> processes, and all implement this grouping in different ways.
>
> Already existing in the kernel is the cpuset subsystem; this has a
> process grouping mechanism that is mature, tested, and well documented
> (particularly with regards to synchronization rules).
>
> This patchset extracts the process grouping code from cpusets into a
> generic container system, and makes the cpusets code a client of
> the container system.
>
> It also provides a very simple additional container subsystem to do
> per-container CPU usage accounting; this is primarily to demonstrate
> use of the container subsystem API, but is useful in its own right.
>
> The change is implemented in four stages:
>
> 1) extract the process grouping code from cpusets into a standalone system
>
> 2) remove the process grouping code from cpusets and hook into the
> container system
>
> 3) convert the container system to present a generic API, and make
> cpusets a client of that API
>
> 4) add a simple CPU accounting container subsystem as an example
>
> The intention is that the various resource management efforts can also
> become container clients, with the result that:
>
> - the userspace APIs are (somewhat) normalised
>
> - it's easier to test out e.g. the ResGroups CPU controller in
> conjunction with the UBC memory controller
>
> - the additional kernel footprint of any of the competing resource
> management systems is substantially reduced, since it doesn't need
> to provide process grouping/containment, hence improving their
> chances of getting into the kernel
>
> Possible TODOs include:
>
> - define a convention for populating the per-container directories so
> that different subsystems don't clash with one another
>
> - provide higher-level primitives (e.g. an easy interface to seq_file)
> for files registered by subsystems.
>
> - support subsystem deregistering
>
> Signed-off-by: Paul Menage <[email protected]>
>
> ---
--
----------------------------------------------------------------------
Chandra Seetharaman | Be careful what you choose....
- [email protected] | .......you may get it.
----------------------------------------------------------------------
On 10/3/06, Chandra Seetharaman <[email protected]> wrote:
>
> Hi Paul,
>
> Thanks for doing the exercise of removing the container part of cpuset
> to provide some process aggregation.
>
> With this model, I think I agree with you that RG can be split into
> individual controllers (need to look at it closely).
>
> I have few questions/concerns w.r.t this implementation:
>
> - Since we are re-implementing anyways, why not use configfs instead of
> having our own filesystem ?
The filesystem was lifted straight from cpuset.c, and hence isn't a
reimplementation, it's a migration of code already in the tree. Wasn't
there also a problem with the maximum output size of a configfs file,
which would cause problems e.g. listing the task members in a
container?
> - I am little nervous about notify_on_release, as RG would want
> classes/RGs to be available even when there are no tasks or sub-
> classes. (Documentation says that the user level program can rmdir
> the container, which would be a problem). Can the user level program
> be _not_ called when there are other subsystems registered ? Also,
> shouldn't it be cpuset specific, instead of global ?
This again is taken straight from cpusets. The idea is that if you
don't have some kind of middleware polling the
container/cpuset/res_group directories to see if they're empty, you
can instead ask the kernel to call you back (via
"container_release_agent") at a point when a container is empty and
hence removable. I don't think there's any guarantee that the
container will still be empty by the time the userspace agent runs.
> - Export of the locks: These locks protect container data structures.
> But, most of the usages in cpuset.c are to protect the cpuset data
> structure itself. Shouldn't the cpuset subsystem have its own locks ?
> IMO, these locks should be used by subsystem only when they want data
> integrity in the container data structure itself (like walking thru
> the sibling list).
It would certainly be possible to have finer-grained locking. But the
cpuset code seems pretty happy with coarse-grained locking (only one
writer at any one time) and having just the two global locks does make
the whole synchronization an awful lot simpler. There's nothing to
stop you having additional analogues of the callback_mutex to protect
specific data in a particular resource controller's private data.
My inclination would be to find a situation where generic fine-grained
locking is really required before forcing it on all container
subsystems. The locking model in RG is certainly finer-grained than in
cpusets, but don't a lot of the operations end up taking the
root_group->group_lock anyway as their first action?
> - Tight coupling of subsystems: I like your idea (you mentioned in a
> reply to the previous thread) of having an array of containers in task
> structure than the current implementation.
Can you suggest some scenarios that require this?
Paul
Paul M wrote:
> The filesystem was lifted straight from cpuset.c,
The primary author of the cpuset file system code was Simon Derr.
I'd encourage you to include him on the cc list of future posts of
this patch set.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401
On Tue, 2006-10-03 at 19:34 -0700, Paul Menage wrote:
> On 10/3/06, Chandra Seetharaman <[email protected]> wrote:
> >
> > Hi Paul,
> >
> > Thanks for doing the exercise of removing the container part of cpuset
> > to provide some process aggregation.
> >
> > With this model, I think I agree with you that RG can be split into
> > individual controllers (need to look at it closely).
> >
> > I have few questions/concerns w.r.t this implementation:
> >
> > - Since we are re-implementing anyways, why not use configfs instead of
> > having our own filesystem ?
>
> The filesystem was lifted straight from cpuset.c, and hence isn't a
> reimplementation, it's a migration of code already in the tree. Wasn't
Ok. I can't call it re-implementing :). But, I guess you get the point.
This is an oppurtunity to remove the filesystem implementation and use
existing infrastructure, configfs. configfs didn't exist when cpuset
went in, otherwise they might have chosen to use it instead of writing
their own.
> there also a problem with the maximum output size of a configfs file,
> which would cause problems e.g. listing the task members in a
> container?
Yes, Joel is aware of it and is open to make that change.
http://marc.theaimsgroup.com/?l=ckrm-tech&m=115619222129067&w=2. Having
a in-tree user (this infrastructure + cpuset) for that feature will
increase the need for it.
>
> > - I am little nervous about notify_on_release, as RG would want
> > classes/RGs to be available even when there are no tasks or sub-
> > classes. (Documentation says that the user level program can rmdir
> > the container, which would be a problem). Can the user level program
> > be _not_ called when there are other subsystems registered ? Also,
> > shouldn't it be cpuset specific, instead of global ?
>
> This again is taken straight from cpusets. The idea is that if you
> don't have some kind of middleware polling the
> container/cpuset/res_group directories to see if they're empty, you
> can instead ask the kernel to call you back (via
> "container_release_agent") at a point when a container is empty and
I understand the purpose and usage.
> hence removable. I don't think there's any guarantee that the
> container will still be empty by the time the userspace agent runs.
My concern is that the container _will_ be considered empty if there is
no task attached with the container _and_ there is no sub-container.
CKRM/RG would want a empty container to exist.
We can hack it around by artificially incrementing the counter, but it
will beat the original purpose of this feature.
>
> > - Export of the locks: These locks protect container data structures.
> > But, most of the usages in cpuset.c are to protect the cpuset data
> > structure itself. Shouldn't the cpuset subsystem have its own locks ?
> > IMO, these locks should be used by subsystem only when they want data
> > integrity in the container data structure itself (like walking thru
> > the sibling list).
>
> It would certainly be possible to have finer-grained locking. But the
> cpuset code seems pretty happy with coarse-grained locking (only one
cpuset may be happy today. But, It will not be happy when there are tens
of other container subsystems use the same locks to protect their own
data structures. Using such coarse locking will certainly affect the
scalability.
> writer at any one time) and having just the two global locks does make
> the whole synchronization an awful lot simpler. There's nothing to
No questions about that. But, do recall BKL and how much effort has gone
in to break it to add scalability ( I am not saying that these locks are
same as that). When we are starting afresh, why not start with
scalability in mind.
> stop you having additional analogues of the callback_mutex to protect
> specific data in a particular resource controller's private data.
>
> My inclination would be to find a situation where generic fine-grained
> locking is really required before forcing it on all container
My thinking was like this: cpuset was the first user of this interface,
any future container subsystem writers will certainly use cpuset as an
example to write their subsystems. In effect, use the container-global
locks to protect their data structures, which is not good in the long
run.
> subsystems. The locking model in RG is certainly finer-grained than in
> cpusets, but don't a lot of the operations end up taking the
> root_group->group_lock anyway as their first action?
>
Only if they are going to depend on the core data structure being intact
(like list traversal).
> > - Tight coupling of subsystems: I like your idea (you mentioned in a
> > reply to the previous thread) of having an array of containers in task
> > structure than the current implementation.
>
> Can you suggest some scenarios that require this?
Consider a scenario where you have only the system level cpuset and have
multiple RGs. With this model you would be forced to create multiple
cpusets (with the same set of cpus) so as to allow multiple RG's. Now,
consider you want to create a cpuset that is a subset of the high level
cpuset, where in the hierarchy you would create this cpuset (at top
level or one level below) ?
Extend this scenario to multiple subsystems and see how complicated the
interface would become to the user.
If we have it this way, then the notify_on_release issue (above) will
disappear too.
>
> Paul
--
----------------------------------------------------------------------
Chandra Seetharaman | Be careful what you choose....
- [email protected] | .......you may get it.
----------------------------------------------------------------------
>>It would certainly be possible to have finer-grained locking. But the
>>cpuset code seems pretty happy with coarse-grained locking (only one
>
>
> cpuset may be happy today. But, It will not be happy when there are tens
> of other container subsystems use the same locks to protect their own
> data structures. Using such coarse locking will certainly affect the
> scalability.
All of this (and the rest of the snipped email with suggested
improvements) makes pretty good sense. But would it not be better
to do this in stages?
1) Split the code out from cpusets
2) Move to configfs
3) Work on locking scalability, etc ...
Else it'd seem that we'll never get anywhere, and it'll all be
impossible to review anyway. Incremental improvement would seem to
be a much easier way to fix this stuff, to me.
M.
On Wed, 2006-10-04 at 12:36 -0700, Martin Bligh wrote:
I agree with you, Martin.
> >>It would certainly be possible to have finer-grained locking. But the
> >>cpuset code seems pretty happy with coarse-grained locking (only one
> >
> >
> > cpuset may be happy today. But, It will not be happy when there are tens
> > of other container subsystems use the same locks to protect their own
> > data structures. Using such coarse locking will certainly affect the
> > scalability.
>
> All of this (and the rest of the snipped email with suggested
> improvements) makes pretty good sense. But would it not be better
> to do this in stages?
>
> 1) Split the code out from cpusets
Paul (Menage) is already work on this.
We will work out the rest.
> 2) Move to configfs
> 3) Work on locking scalability, etc ...
>
> Else it'd seem that we'll never get anywhere, and it'll all be
> impossible to review anyway. Incremental improvement would seem to
> be a much easier way to fix this stuff, to me.
>
> M.
--
----------------------------------------------------------------------
Chandra Seetharaman | Be careful what you choose....
- [email protected] | .......you may get it.
----------------------------------------------------------------------
On 10/4/06, Chandra Seetharaman <[email protected]> wrote:
> > The filesystem was lifted straight from cpuset.c, and hence isn't a
> > reimplementation, it's a migration of code already in the tree. Wasn't
>
> Ok. I can't call it re-implementing :). But, I guess you get the point.
> This is an oppurtunity to remove the filesystem implementation and use
> existing infrastructure, configfs. configfs didn't exist when cpuset
> went in, otherwise they might have chosen to use it instead of writing
> their own.
I guess I'm mostly agnostic about this issue. But looking at the
configfs interfacing code in rgcs.c versus the filesystem detail code
in container.c, it's about 200 lines vs 300 lines, so not exactly a
huge complexity saving.
> Yes, Joel is aware of it and is open to make that change.
> http://marc.theaimsgroup.com/?l=ckrm-tech&m=115619222129067&w=2. Having
> a in-tree user (this infrastructure + cpuset) for that feature will
> increase the need for it.
Great - when that happens I'd be much happier to move over configfs.
>
> My concern is that the container _will_ be considered empty if there is
> no task attached with the container _and_ there is no sub-container.
Right, that's the definition of an empty container.
>
> CKRM/RG would want a empty container to exist.
Even if the user wants to be told when it's OK to clean up their RG containers?
I don't see why this is a problem - all this does is allows the user
(*if* they want to) to get a callback when there's no longer anything
alive in the container. It doesn't cause the container to be removed,
without additional explicit action from userspace.
I don't see it as a feature that I'd make much use of myself, but it
preserves the existing cpusets API.
>
> cpuset may be happy today. But, It will not be happy when there are tens
> of other container subsystems use the same locks to protect their own
> data structures. Using such coarse locking will certainly affect the
> scalability.
The locks exported by the container system simply guarantee:
- if you hold container_manage_lock(), then no-one else will make any
changes to the core container groupings, and you won't block readers
- if you hold container_lock(), then no changes will be made to the
container groupings.
While there's nothing to stop a container subsystem from also using
them to protect its own data, there's equally nothing to stop a
container subsystem from using its own finer-grained locking, if it
feels that it needs to. E.g. the simple cpu_acct subsystem that I
posted as an example has a per-container spinlock that it uses to
protect the cpu stats for that container. So the callback doesn't need
to take either container_lock() or container_manage_lock(), it just
has to ensure it's in an rcu_read_lock() section and takes its own
spinlock.
> No questions about that. But, do recall BKL and how much effort has gone
> in to break it to add scalability ( I am not saying that these locks are
> same as that). When we are starting afresh, why not start with
> scalability in mind.
Looking at the group_lock in RG (including memrc and cpurc), the only
places that it appears to get taken are when creating and destroying
groups, or when changing shares. These are things that are going to be
done by the middleware - how many concurrent middleware systems are
you expecting to need to scale to? :-)
Incidentally, I have a few questions about the locking in the example
numtasks controller:
- recalc_and_propagate() does:
lock rgroup->group_lock
for each child:
lock child->group_lock
recalc_and_propagate(child)
So the first thing that the recursive call does is to take the lock
that the previous level of recursion has already taken. How's that
supposed to work?
E.g.
cd /mnt/configfs/res_groups
mkdir foo
echo -n res=numtasks,min_shares=10,max_shares=10 > shares
causes a lockup.
- in dec_usage_count() when a task exits, am I right that the only
locking done is on the group from which the task exited? The calls can
cascade up to the parent if cnt_borrowed is non-zero. In that case, in
the situation where the parent and two child groups each have a
cnt_borrowed of 1, what's to stop the dec_usage_count() calls racing
with one another in the parent, hence leaving parent->cnt_borrowed at
-1, and both calling dec_usage_count() on the grandparent?
I realise that these issues can be fixed, but they do serve to
illustrate the pitfalls (or at least, reduced clarity) associated with
fine-grained locking ...
>
> My thinking was like this: cpuset was the first user of this interface,
> any future container subsystem writers will certainly use cpuset as an
> example to write their subsystems. In effect, use the container-global
> locks to protect their data structures, which is not good in the long
> run.
Fair point. I'll see about reducing some of the coarse-grained locking
in cpuset.c. But a lot of the time, the cpuset code needs to be able
to prevent both changes to current->cpuset, and changes to its
reference - so it has to take container_lock() anyway, and might as
well use that as the only lock, rather than having to take two locks.
One optimization would be to have callback_mutex be an rwsem, so we
could export container_lock()/container_unlock() as well as
container_read_lock() and container_read_unlock(). (PaulJ, any
thoughts on that?)
> > > - Tight coupling of subsystems: I like your idea (you mentioned in a
> > > reply to the previous thread) of having an array of containers in task
> > > structure than the current implementation.
> >
> > Can you suggest some scenarios that require this?
>
> Consider a scenario where you have only the system level cpuset and have
> multiple RGs. With this model you would be forced to create multiple
> cpusets (with the same set of cpus) so as to allow multiple RG's.
No - see the intro to this patch set. I added a "<subsys>_enabled"
file for each subsystem, managed by the container.c code; if this is
set to 0, then child containers inherit the same per-subsystem state
for that subsystem as the root container.
So if you do echo 0 > /mnt/container/cpuset_enabled, and create child
containers, you still only have one cpuset.
> Now,
> consider you want to create a cpuset that is a subset of the high level
> cpuset, where in the hierarchy you would create this cpuset (at top
> level or one level below) ?
For a distinct set of tasks, or for the same set of tasks? If the
former, then it can be a new hierarchy under the top-level, no
problem. If it's for the same set of tasks then you'd need to give
more details about how/why you're trying to split up resources. I
would imagine that almost all situations like this, there would be a
natural way to split things up hierarchically. E.g. you want to pin
each customer's tasks on to a particular set of nodes (so use cpuset
parameters at the top level) and within those you want to let the
customer have different groups of tasks with e.g. different memory
limits (so use RG mem controller parameters at the next level)
Paul
On 10/4/06, Chandra Seetharaman <[email protected]> wrote:
> > All of this (and the rest of the snipped email with suggested
> > improvements) makes pretty good sense. But would it not be better
> > to do this in stages?
> >
> > 1) Split the code out from cpusets
>
> Paul (Menage) is already work on this.
The split out is done - right now I'm tidying up an example of RG
moved over the container API - basically dumping the group membership
code (since container.c supplies that) and migrating the shares/stats
file model to containerfs rather than configfs. I'll try to send it
out today.
Paul
On 10/4/06, Paul Menage <[email protected]> wrote:
>
> > > > - Tight coupling of subsystems: I like your idea (you mentioned in a
> > > > reply to the previous thread) of having an array of containers in task
> > > > structure than the current implementation.
> > >
...
BTW, that's not to say that having parallel hierarchies of containers
is necessarily a bad thing - I can imagine just mounting multiple
instances of containerfs, each managing one of the container pointers
in task_struct - but I think that could be added on afterwards. Even
if we did have the parallel support, we'd still need to support
multiple subsystems/controllers on the same hierarchy, since I think
that's going to be the much more common case.
Paul