LinuxLists.cc - [PATCH 0/7] containers (V7): Generic Process Containers

2007-02-12 08:52:54

Subject: [PATCH 0/7] containers (V7): Generic Process Containers

--

This is an update to my multi-hierarchy multi-subsystem generic
process containers patch. Changes since V6 (22nd December) include:

- updated to 2.6.20

- added more details about multiple hierarchy support in the
documentation

- reduced the per-task memory overhead to one pointer (previously it
was one pointer for each hierarchy). Now each task has
a pointer to a container_group, which holds the pointers to the
containers (one per active hierarchy) that the task is attached to
and the associated per-subsystem state (one per active subsystem).
This container group is shared (with reference counts) between all
tasks that have the same set of container mappings.

- added API support for binding/unbinding subsystems to/from active
hierarchies, by remounting with -oremount,<new-subsys-list>. Currently
this fails with EBUSY if the hierarchy has a child containers; full
implementation support is left to a later patch.

- added a bind() subsystem callback to indicate when a subsystem is
moved between hierarchies

- added container_clone(subsys, task), which creates a child container
for the hierarchy that the specified subsystem is bound to, and
moves the given task into that container. An example use of this
would be in sys_unshare, which could, if the namespace container
subsystem is active, create a child container when the new namespace
is created.

- temporarily removed the "release agent" support. It's only currently
used by CPUsets, and intrudes somewhat on the per-container
reference counting. If necessary it can be re-added, either as a
generic subsystem feature or a CPUset-specific feature, via a kernel
thread that periodically polls containers that have been designated
as notify_on_release to see if they are releasable

Generic Process Containers
--------------------------

There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
containers, and others. These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.

Already existing in the kernel is the cpuset subsystem; this has a
process grouping mechanism that is mature, tested, and well documented
(particularly with regards to synchronization rules).

This patchset extracts the process grouping code from cpusets into a
generic container system, and makes the cpusets code a client of
the container system.

It also provides several example clients of the container system,
including ResGroups, BeanCounters and namespace proxy.

The change is implemented in three stages, plus four example
subsystems that aren't necessarily intended to be merged as part of
this patch set, but demonstrate the applicability of the framework.

1) extract the process grouping code from cpusets into a standalone system

2) remove the process grouping code from cpusets and hook into the
container system

3) convert the container system to present a generic multi-hierarchy
API, and make cpusets a client of that API

4) example of a simple CPU accounting container subsystem

5) example of implementing ResGroups and its numtasks controller over
generic containers

6) example of implementing BeanCounters and its numfiles counter over
generic containers

7) example of integrating the namespace isolation code (sys_unshare()
or various clone flags) with generic containers, allowing virtual
servers to take advantage of other resource control efforts.

The intention is that the various resource management and
virtualization efforts can also become container clients, with the
result that:

- the userspace APIs are (somewhat) normalised

- it's easier to test out e.g. the ResGroups CPU controller in
conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.

- the additional kernel footprint of any of the competing resource
management systems is substantially reduced, since it doesn't need
to provide process grouping/containment, hence improving their
chances of getting into the kernel

Signed-off-by: Paul Menage <[email protected]>

2007-02-12 09:20:12

by Paul Jackson

[permalink] [raw]

Subject: Re: [PATCH 0/7] containers (V7): Generic Process Containers

> - temporarily removed the "release agent" support.

ouch

> ... it can be re-added ... via a kernel thread that periodically polls containers ...

double ouch.

You'll have a rough time selling me on the idea that some kernel thread
should be waking up every few seconds, grabbing system-wide locks, on a
big honkin NUMA box, for the few times per hour, or less, that a cpuset is
abandoned.

Offhand, that sounds mildly insane to me.

And how would this get the edge-triggered, rather than level-triggered,
release? In other words, if a new cpuset is created, and marked with
the notify_on_release flag, but otherwise not yet used (no child
cpusets and no tasks in it) then it is not to be released (removed.)
Only children and/or tasks are added, then later removed, is it a
candidate for release. I guess you'll need yet another state bit, set
when the cpuset is abandoned (last child removed or last pid
exits/leaves), and cleared when this kernel thread visits the cpuset to
see if it should be removed.

Can you explain to me how this intruded on the reference counting?

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2007-02-12 09:33:09

by Paul Menage

[permalink] [raw]

Subject: Re: [PATCH 0/7] containers (V7): Generic Process Containers

On 2/12/07, Paul Jackson <[email protected]> wrote:
>
> You'll have a rough time selling me on the idea that some kernel thread
> should be waking up every few seconds, grabbing system-wide locks, on a
> big honkin NUMA box, for the few times per hour, or less, that a cpuset is
> abandoned.

I think it could be made smarter than that, e.g. have a workqueue task
that's only woken when a refcount does actually reach zero. (I think
that waking a workqueue task is something that can be done without too
much worry about locks)

>
> Can you explain to me how this intruded on the reference counting?
>

Essentially, it means that anything that releases a reference count on
a container needs to be able to trigger a call to the release agent.
The reference count is often released at a point when important locks
are held, so you end up having to pass buffers into any function that
might drop a ref count, in order to store a path to a release agent to
be invoked.

In particular, the new container_clone() function can be called during
the task fork path; at which point forking a new release_agent process
would be impossible, or at least nasty. Additionally, if containers
are potentially going to be used for virtual servers, having the
release agent run from a top-level process rather than the process
context that released the refcount sounds like a sane option.

Paul

2007-02-12 09:54:45

by Paul Jackson

[permalink] [raw]

Subject: Re: [PATCH 0/7] containers (V7): Generic Process Containers

Paul M, responding to Paul J:
> I think it could be made smarter than that, e.g. have a workqueue task
> that's only woken when a refcount does actually reach zero. (I think
> that waking a workqueue task is something that can be done without too
> much worry about locks)
>
> >
> > Can you explain to me how this intruded on the reference counting?
> >
>
> Essentially, it means that anything that releases a reference count on
> a container needs to be able to trigger a call to the release agent.
> The reference count is often released at a point when important locks
> are held, so you end up having to pass buffers into any function that
> might drop a ref count, in order to store a path to a release agent to
> be invoked.

Ok - now that you put it like that - it's much more persuasive.

Consider me sold on this aspect of your proposal, until and unless I
protest otherwise, which is not likely.

Thanks.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2007-02-12 22:47:30

by Serge E. Hallyn

[permalink] [raw]

Subject: Re: [PATCH 0/7] containers (V7): Generic Process Containers

Quoting Sam Vilain ([email protected]):
> [email protected] wrote:
> > Generic Process Containers
> > --------------------------
> >
> > There have recently been various proposals floating around for
> > resource management/accounting and other task grouping subsystems in
> > the kernel, including ResGroups, User BeanCounters, NSProxy
> > containers, and others. These all need the basic abstraction of being
> > able to group together multiple processes in an aggregate, in order to
> > track/limit the resources permitted to those processes, or control
> > other behaviour of the processes, and all implement this grouping in
> > different ways.
> >
>
> I know I'm a bit out of touch, but AIUI the NSProxy *is* the container.

The nsproxy is an implementation detail, actually, and might go away at
any time. But yes we had been talking for quite awhile about sets of
namespaces as containers, so that a vserver would be a container, as
would a lightweight checkpoint/restart job.

> We decided a long time ago that a container was basically just a set of
> namespaces, which includes all of the subsystems you mention.
>
> This would suggesting re-write this patchset, part 2 as a "CPUSet

Well it's an unfortunate conflict, but I don't see where we have any
standing to make Paul change his terminology :)

Lately I've been talking about "Paul's containers", our "namespaces",
and the "namespace container subsystem" in patch 7.

> namespace", part 4 as a "CPU scheduling namespace", parts 5 and 6 as
> "Resource Limits Namespace" (drop this "BeanCounter" brand), and of
> course part 7 falls away.

Part 7 does not fall away, it uses Paul's patchset to start to provide
a management interface for namespaces.

-serge

2007-02-12 23:08:13

by Sam Vilain

[permalink] [raw]

Subject: Re: [PATCH 0/7] containers (V7): Generic Process Containers

[email protected] wrote:
> Generic Process Containers
> --------------------------
>
> There have recently been various proposals floating around for
> resource management/accounting and other task grouping subsystems in
> the kernel, including ResGroups, User BeanCounters, NSProxy
> containers, and others. These all need the basic abstraction of being
> able to group together multiple processes in an aggregate, in order to
> track/limit the resources permitted to those processes, or control
> other behaviour of the processes, and all implement this grouping in
> different ways.
>

I know I'm a bit out of touch, but AIUI the NSProxy *is* the container.
We decided a long time ago that a container was basically just a set of
namespaces, which includes all of the subsystems you mention.

This would suggesting re-write this patchset, part 2 as a "CPUSet
namespace", part 4 as a "CPU scheduling namespace", parts 5 and 6 as
"Resource Limits Namespace" (drop this "BeanCounter" brand), and of
course part 7 falls away.

Sam.

2007-02-12 23:16:00

by Paul Menage

[permalink] [raw]

Subject: Re: [PATCH 0/7] containers (V7): Generic Process Containers

On 2/12/07, Sam Vilain <[email protected]> wrote:
>
> I know I'm a bit out of touch, but AIUI the NSProxy *is* the container.
> We decided a long time ago that a container was basically just a set of
> namespaces, which includes all of the subsystems you mention.

You may have done that, but the CKRM/ResGroups independently decided a
long time ago that the fundamental unit was the resource class, and
the OpenVZ folks decided that the fundamental unit was the
BeanCounter, and the CPUSet folks decided that the fundamental unit
was the CPUSet, etc ... :-)

But there's a lot of common ground between these different approaches,
and potential for synergy, so the point of this patch set is to
provide a unification point for all of them, and a stepping stone for
other new resource controllers and process control modules.

Paul

2007-02-12 23:18:27

by Paul Menage

[permalink] [raw]

Subject: Re: [PATCH 0/7] containers (V7): Generic Process Containers

On 2/12/07, Serge E. Hallyn <[email protected]> wrote:
>
> Well it's an unfortunate conflict, but I don't see where we have any
> standing to make Paul change his terminology :)

I have no huge problem with changing my terminology in the interest of
wider adoption. "Container" seems like an appropriate name for the
abstraction, and possibly more appropriate than for a virtual server,
but if it was decreed that the only thing stopping the container patch
being merged was that it should be called, say, "Process Sets", I
would happily s/container/pset/g across the entire patchset.

Paul

2007-02-13 00:30:15

by Sam Vilain

[permalink] [raw]

Subject: Re: [PATCH 0/7] containers (V7): Generic Process Containers

Paul Menage wrote:
>> I know I'm a bit out of touch, but AIUI the NSProxy *is* the container.
>> We decided a long time ago that a container was basically just a set of
>> namespaces, which includes all of the subsystems you mention.
>>
> You may have done that, but the CKRM/ResGroups independently decided a
> long time ago that the fundamental unit was the resource class, and
> the OpenVZ folks decided that the fundamental unit was the
> BeanCounter, and the CPUSet folks decided that the fundamental unit
> was the CPUSet, etc ... :-)
>

There was a consensus that emerged that resulted in the nsproxy being
included in the kernel. That is why I used "we". What was included
varies greatly from the patch I put forward, as I mentioned.

You missed the point of what I was trying to say. Let me try again.

Originally I too thought that in order to begin on the path of
virtualisation merging we would just make a simple
container/vserver/whatever structure and hang everything off that.

I'll now say the same thing that was said before of my patch, that I
don't think that adding the containers structure inside the kernel adds
anything interesting or useful. In fact, I'd go further to say that
very thing you think is a useful abstraction is locking yourself into
inflexibility. This is what Eric recognised early on and eventually
brought me around to the idea of too.

So, we have the current implementation - individual subsystems are
virtualised, and it is the subsystems that you can see and work with.
You can group together your virtualisation decisions and call that a
container, great.

Ask yourself this - what do you need the container structure for so
badly, that virtualising the individual resources does not provide for?
You don't need it to copy the namespaces of another process ("enter")
and you don't need it for checkpoint/migration. What does it mean to
make a new container? You're making a "nothing" namespace for some
as-yet-unspecified grouping of tasks. That's a great idea for a set of
tightly integrated userland utilities to simplify the presentation to
the admin, but I don't see why you need to enshrine this in the kernel.
Certainly not for any of the other patches in your set as far as I can see.

> But there's a lot of common ground between these different approaches,
> and potential for synergy, so the point of this patch set is to
> provide a unification point for all of them, and a stepping stone for
> other new resource controllers and process control modules.
>

That precisely echos my sentiment on my submissions of some 12 months ago.

Sam.

2007-02-13 00:42:37

by Paul Menage

[permalink] [raw]

Subject: Re: [ckrm-tech] [PATCH 0/7] containers (V7): Generic Process Containers

On 2/12/07, Sam Vilain <[email protected]> wrote:
> Ask yourself this - what do you need the container structure for so
> badly, that virtualising the individual resources does not provide for?

Primarily, that otherwise every module that wants to affect/monitor
behaviour of a group of associated processes has to implement its own
process grouping abstraction.

As an example, the CPU accounting patch that in included in my patch
set as an illustration of a simple resource monitoring module is just
250 lines, almost entirely in one file; if it also had to handle
associating tasks together into groups and presenting a filesystem
interface to the user it would be far larger and would have a much
bigger footprint on the kernel.

>From the point of view of the virtual server containers, the advantage
is that you're integrated with a standard filesystem interface for
determining group membership. It does become simpler to combine
virtual servers and resource controllers, although I grant you that
you could juggle that from userspace without the additional kernel
support.

Paul

2007-02-13 01:13:15

by Sam Vilain

[permalink] [raw]

Subject: Re: [ckrm-tech] [PATCH 0/7] containers (V7): Generic Process Containers

Paul Menage wrote:
>> Ask yourself this - what do you need the container structure for so
>> badly, that virtualising the individual resources does not provide for?
>>
> Primarily, that otherwise every module that wants to affect/monitor
> behaviour of a group of associated processes has to implement its own
> process grouping abstraction.
>

Not every module, you just make them on sensible, planned groupings.
The danger is that the "container" group becomes a fallback grouping for
things when people can't be bothered thinking about it properly, and
everything including the kitchen sink gets thrown in. Then later you
find a real use case where you don't want them together, but it's too
late because it's already a part of the official API.

> As an example, the CPU accounting patch that in included in my patch
> set as an illustration of a simple resource monitoring module is just
> 250 lines, almost entirely in one file; if it also had to handle
> associating tasks together into groups and presenting a filesystem
> interface to the user it would be far larger and would have a much
> bigger footprint on the kernel.
>

It's also less flexible. What if I want to do CPU accounting on some
other boundaries than the "virtual server" a process is a part of?

> From the point of view of the virtual server containers, the advantage
> is that you're integrated with a standard filesystem interface for
> determining group membership. It does become simpler to combine
> virtual servers and resource controllers, although I grant you that
> you could juggle that from userspace without the additional kernel
> support.
>

I'm not disagreeing it's a pragmatic shortcut that has been successful
for a number of projects including vserver which I use every day. But
it reduces "synergy" by excluding the people working with virtualisation
in ways that don't fit its model.

Yes, there should be a similarity in the way that you manage namespaces
and it should be easy to develop new namespaces without constantly
re-inventing the wheel. But why does that imply making binding
decisions about the nature of how you can virtualise? IMHO those
decisions should be made on a per-subsystem basis.

Sam.

2007-02-13 01:47:42

by Paul Menage

[permalink] [raw]

Subject: Re: [ckrm-tech] [PATCH 0/7] containers (V7): Generic Process Containers

On 2/12/07, Sam Vilain <[email protected]> wrote:
>
> Not every module, you just make them on sensible, planned groupings.
> The danger is that the "container" group becomes a fallback grouping for
> things when people can't be bothered thinking about it properly, and
> everything including the kitchen sink gets thrown in. Then later you
> find a real use case where you don't want them together, but it's too
> late because it's already a part of the official API.

This paragraph seems to have a contradiction - first you say that
modules should use "planned groupings", then you complain that modules
using generic containers will find that they can't separate from other
modules "because they're part of an official API".

On the contrary - a module that's written over generic containers can
be combined with other such modules in the same groupings, or used in
different groupings, depending on what the user desires. This is
discussed in some detail in Documentation/containers.txt in patch 3 of
my patchset.

My patchset doesn't try to make all modules use the same *grouping*,
it tries to make them use the same *abstraction* (and hence share
code), but allows them to easily use different groupings just by
binding modules to different hierarchies.

>
> > As an example, the CPU accounting patch that in included in my patch
> > set as an illustration of a simple resource monitoring module is just
> > 250 lines, almost entirely in one file; if it also had to handle
> > associating tasks together into groups and presenting a filesystem
> > interface to the user it would be far larger and would have a much
> > bigger footprint on the kernel.
> >
>
> It's also less flexible. What if I want to do CPU accounting on some
> other boundaries than the "virtual server" a process is a part of?

No problem - that's why there are multiple hierarchies available. You
can mount the CPU accounting on one hierarchy, the virtual servers on
a different hierarchy, and they have completely independent mappings
from processes to containers - essentially parallel trees of
containers.

Paul

2007-02-20 17:36:04

by Eric W. Biederman

[permalink] [raw]

Subject: Re: [PATCH 0/7] containers (V7): Generic Process Containers

"Paul Menage" <[email protected]> writes:

> On 2/12/07, Sam Vilain <[email protected]> wrote:
>>
>> I know I'm a bit out of touch, but AIUI the NSProxy *is* the container.
>> We decided a long time ago that a container was basically just a set of
>> namespaces, which includes all of the subsystems you mention.
>
> You may have done that, but the CKRM/ResGroups independently decided a
> long time ago that the fundamental unit was the resource class, and
> the OpenVZ folks decided that the fundamental unit was the
> BeanCounter, and the CPUSet folks decided that the fundamental unit
> was the CPUSet, etc ... :-)

Using the container name is bad and it led to this stupid argument.

The fundamental unit of what we have merged into the kernel is the
namespace. The aggregate of all namespaces and everything is the
container.

Please, please pick a different name so people don't take one look
at your stuff and get confused like Sam did.

> But there's a lot of common ground between these different approaches,
> and potential for synergy, so the point of this patch set is to
> provide a unification point for all of them, and a stepping stone for
> other new resource controllers and process control modules.

For the case of namespaces I don't see how your code makes things
better. I do not see a real problem that you are solving.

I do agree that a common interface to the code and a common set of
infrastructure may help.

Personally until the pid namespace (which includes within it process
groupings) is complete I'm not certain the problem will be concrete
enough to solve. Hopefully we can have a working version of that
shortly.

Eric

2007-02-20 17:55:31

by Paul Menage

[permalink] [raw]

Subject: Re: [PATCH 0/7] containers (V7): Generic Process Containers

On 2/20/07, Eric W. Biederman <[email protected]> wrote:
> "Paul Menage" <[email protected]> writes:
>
> > On 2/12/07, Sam Vilain <[email protected]> wrote:
> >>
> >> I know I'm a bit out of touch, but AIUI the NSProxy *is* the container.
> >> We decided a long time ago that a container was basically just a set of
> >> namespaces, which includes all of the subsystems you mention.
> >
> > You may have done that, but the CKRM/ResGroups independently decided a
> > long time ago that the fundamental unit was the resource class, and
> > the OpenVZ folks decided that the fundamental unit was the
> > BeanCounter, and the CPUSet folks decided that the fundamental unit
> > was the CPUSet, etc ... :-)
>
> Using the container name is bad and it led to this stupid argument.
>
> The fundamental unit of what we have merged into the kernel is the
> namespace. The aggregate of all namespaces and everything is the
> container.

What are you defining here as "everything"? If you mean "all things
that could be applied to a segregated group of processes such as a
virtual server", then "container" seems like a good name for my
patches, since it allows you to aggregate namespaces, resource
control, other virtualization, etc.

Sam said "the NSProxy *is* the container". You appear to be planning
to have some namespaces, possibly not aggregated within the nsproxy
(pid namespace?) but are you planning to have some higher-level
"container" object that aggregates the nsproxy and the other
namespaces? If so, what is it? Does it track process membership, etc?
What's the userspace API? In what fundamental ways would it be
different from my generic containers patches with Serge Hallyn's
namespace subsystem. (If these questions are answered already by
designs or code, I'd be happy to be pointed at them).

I guess what it comes down to, is why is an aggregation of namespaces
suitable for the name "container", when an aggregation of namespaces
and other resource controllers isn't?

What do you think might be a better name for the generic process
groups that I'm pushing? As I said, I'm happy to do a simple
search/replace on my code to give a different name if that turned out
to be the gating factor to getting it merged. But I'd be inclined to
leave that decision up to Andrew/Linus.

> > But there's a lot of common ground between these different approaches,
> > and potential for synergy, so the point of this patch set is to
> > provide a unification point for all of them, and a stepping stone for
> > other new resource controllers and process control modules.
>
> For the case of namespaces I don't see how your code makes things
> better. I do not see a real problem that you are solving.

I'm trying to solve the problem that lots of different folks
(including us) are trying to do things that aggregate multiple process
into some kind of constrained group, and are all trying to use
different and incompatible ways of grouping/tracking those processes.

I agree that namespaces fit slightly less well into this model than
some other subsystems like resource management. But by integrating
with it you'd get automatic access to all the various different
resource controller work that's being done.

Paul

2007-02-20 19:32:18

by Eric W. Biederman

[permalink] [raw]

Subject: Re: [PATCH 0/7] containers (V7): Generic Process Containers

"Paul Menage" <[email protected]> writes:

> What are you defining here as "everything"? If you mean "all things
> that could be applied to a segregated group of processes such as a
> virtual server", then "container" seems like a good name for my
> patches, since it allows you to aggregate namespaces, resource
> control, other virtualization, etc.

The aggregation happens at the task struct when you apply different
pieces to a task we don't need a structure to allow aggregation.

> Sam said "the NSProxy *is* the container". You appear to be planning
> to have some namespaces, possibly not aggregated within the nsproxy
> (pid namespace?) but are you planning to have some higher-level
> "container" object that aggregates the nsproxy and the other
> namespaces?

No. A reverse mapping is not needed and is not interesting.
As long as I can walk all processes and ask what namespace are
you in I don't care.

> If so, what is it? Does it track process membership, etc?
> What's the userspace API? In what fundamental ways would it be
> different from my generic containers patches with Serge Hallyn's
> namespace subsystem. (If these questions are answered already by
> designs or code, I'd be happy to be pointed at them).

> I guess what it comes down to, is why is an aggregation of namespaces
> suitable for the name "container", when an aggregation of namespaces
> and other resource controllers isn't?

To me at least the interesting part of your work is not the aggregation
portion. But the infrastructure for building the different process
groups.

You seem to be calling your groups of processes lumped together for
one purpose or another a container.

> What do you think might be a better name for the generic process
> groups that I'm pushing? As I said, I'm happy to do a simple
> search/replace on my code to give a different name if that turned out
> to be the gating factor to getting it merged. But I'd be inclined to
> leave that decision up to Andrew/Linus.

As we have previously been using the term container is like
application. The end product that the user sees built out of
individual kernel provided components, but not something that exists
in the kernel as such.

We need a term for the non-aggregated case for the individual process
groups we build this out of in your infrastructure. Because you
clearly have more than one kind of process group a set of processes
can belong to.

>> > But there's a lot of common ground between these different approaches,
>> > and potential for synergy, so the point of this patch set is to
>> > provide a unification point for all of them, and a stepping stone for
>> > other new resource controllers and process control modules.
>>
>> For the case of namespaces I don't see how your code makes things
>> better. I do not see a real problem that you are solving.
>
> I'm trying to solve the problem that lots of different folks
> (including us) are trying to do things that aggregate multiple process
> into some kind of constrained group, and are all trying to use
> different and incompatible ways of grouping/tracking those processes.
>
> I agree that namespaces fit slightly less well into this model than
> some other subsystems like resource management. But by integrating
> with it you'd get automatic access to all the various different
> resource controller work that's being done.

I don't need to integrate with it to get access to the resource
controller work. The resource controller work applies to a set of
processes. I have a set of processes. I just need to apply the
resource controllers to my set of processes.

At least when we don't talk about the names for these things it
is that simple. Now global names I have issues with because almost
always the global names fall into their own category, but that is
a different story.

Now I have some half backed ideas that might be useful for providing
a better abstraction. But it requires setting down and looking
at the problem in detail, and understanding what people are trying
to accomplish with these things they are building. What subset of
process groups do you find interesting.

All that is necessary to have a group of processes do something
in an unnamed fashion is to hang a pointer off of the task_struct.
That's easy. You are adding a lot more to that, and there is
enough abstraction I haven't been able to easily work my way through
the infrastructure to what is really going on. Admittedly I haven't
looked closely yet.

Eric

2007-02-20 21:58:49

by Sam Vilain

[permalink] [raw]

Subject: Re: [PATCH 0/7] containers (V7): Generic Process Containers

Paul Menage wrote:
>> Using the container name is bad and it led to this stupid argument.
>>
>> The fundamental unit of what we have merged into the kernel is the
>> namespace. The aggregate of all namespaces and everything is the
>> container.
>>
> What are you defining here as "everything"? If you mean "all things
> that could be applied to a segregated group of processes such as a
> virtual server",

The term "segregated group of processes" is too vague. Segregated for
what? What is the kernel supposed to do with this information?

> I guess what it comes down to, is why is an aggregation of namespaces
> suitable for the name "container", when an aggregation of namespaces
> and other resource controllers isn't?
>

This argument goes away if you just rename these resource groups to
resource namespaces.

> What do you think might be a better name for the generic process
> groups that I'm pushing? As I said, I'm happy to do a simple
> search/replace on my code to give a different name if that turned out
> to be the gating factor to getting it merged. But I'd be inclined to
> leave that decision up to Andrew/Linus.
>

Did you like the names I came up with in my original reply?

- CPUset namespace for CPU partitioning
- Resource namespaces:
- cpusched namespace for CPU
- ulimit namespace for memory
- quota namespace for disk space
- io namespace for disk activity
- etc

>> For the case of namespaces I don't see how your code makes things
>> better. I do not see a real problem that you are solving.
>>
> I'm trying to solve the problem that lots of different folks
> (including us) are trying to do things that aggregate multiple process
> into some kind of constrained group, and are all trying to use
> different and incompatible ways of grouping/tracking those processes.
>

Maybe what's missing is a set of helper macros/functions that assist
with writing new namespaces. Perhaps you can give some more examples
and we can consider these on a case by case basis.

Sam.

2007-02-20 22:19:57

by Paul Menage

[permalink] [raw]

Subject: Re: [PATCH 0/7] containers (V7): Generic Process Containers

On 2/20/07, Sam Vilain <[email protected]> wrote:
>
> The term "segregated group of processes" is too vague. Segregated for
> what? What is the kernel supposed to do with this information?

The generic part of the kernel just keeps track of the fact that
they're segregated (and their children, etc).

It's the clients of this subsystem (virtual servers, resource
controllers) that can decide to give different per-process behaviour
based on the membership of various groups.

> Did you like the names I came up with in my original reply?
>
> - CPUset namespace for CPU partitioning
> - Resource namespaces:
> - cpusched namespace for CPU
> - ulimit namespace for memory
> - quota namespace for disk space
> - io namespace for disk activity
> - etc

This is a strange abuse of the term "namespace".
http://en.wikipedia.org/wiki/Namespace_%28computer_science%29

For the virtual server work that you're doing, namespace is a good term:

pids name processes, hence a pid namespace lets you have multiple
distinct mappings from pids to processes
filenames name files, so a filename (or mount) namespace lets you have
multiple distinct mappings from filenames to files.

For resource QoS control, it doesn't really make sense to talk about
namespaces. We're not virtualizing resources to rename them for
different virtual servers, we're limiting the quality of access to the
resources.

But the semantics of the term "namespace" notwithstanding, you're
equating a virtual server namespace (pid, ipc, uts, mounts, etc) with
a resource controller (memory, I/O, etc) in terms of their place in a
hierarchy, which is something I agree with. All of these subsystems
can be considered to be units that can be associated with groups of
processes; the ultimate grouping of processes is something that we're
both ultimately referring to as a container.

>
> Maybe what's missing is a set of helper macros/functions that assist
> with writing new namespaces.

That's pretty much what my containers patch does - provides a generic
way to associate state (what you're referring to as a namespace) with
groups of processes and a consistent user-space API for manipulating
them.

Paul

2007-02-20 22:47:50

by Paul Menage

[permalink] [raw]

Subject: Re: [PATCH 0/7] containers (V7): Generic Process Containers

On 2/20/07, Eric W. Biederman <[email protected]> wrote:
> > Sam said "the NSProxy *is* the container". You appear to be planning
> > to have some namespaces, possibly not aggregated within the nsproxy
> > (pid namespace?) but are you planning to have some higher-level
> > "container" object that aggregates the nsproxy and the other
> > namespaces?
>
> No. A reverse mapping is not needed and is not interesting.

... to you.

> As long as I can walk all processes and ask what namespace are
> you in I don't care.

How do you currently do that?

>
> To me at least the interesting part of your work is not the aggregation
> portion. But the infrastructure for building the different process
> groups.

In that case you can easily use it to just assign one namespace type
to each tree of process groups. The aggregation is something that
other groups find useful, but isn't required for the user to actually
make use of.

>
> You seem to be calling your groups of processes lumped together for
> one purpose or another a container.

Correct.

>
> We need a term for the non-aggregated case for the individual process
> groups we build this out of in your infrastructure. Because you
> clearly have more than one kind of process group a set of processes
> can belong to.

Yes. That's why my system supports multiple unrelated hierarchies of groups.

> > I agree that namespaces fit slightly less well into this model than
> > some other subsystems like resource management. But by integrating
> > with it you'd get automatic access to all the various different
> > resource controller work that's being done.
>
> I don't need to integrate with it to get access to the resource
> controller work.

Right, you could certainly do the extra work of tying your virtual
servers together with resource controllers in userspace. But you'd
still need an API to allow those resource controllers to be associated
with groups of processes and configured with limits/guarantees, which
is one of the main aims of my containers patch.

> Now I have some half backed ideas that might be useful for providing
> a better abstraction. But it requires setting down and looking
> at the problem in detail, and understanding what people are trying
> to accomplish with these things they are building. What subset of
> process groups do you find interesting.

I'm primarily interested in getting something in the kernel that can
be used as a base for interesting subsystems that apply behavioural or
QoS changes to defined groups of processes. Resource controllers and
namespaces seem to be good examples of this, but I can think of useful
subsystems for monitoring, permission control, etc.

>
> All that is necessary to have a group of processes do something
> in an unnamed fashion is to hang a pointer off of the task_struct.
> That's easy.

Right, adding a pointer to task_struct is easy. Configuring how/when
to not directly inherit it from the parent, or to change it for a
running task, or configuring state associated with the thing that the
pointer is pointing to, naming that group, and determining which group
a given process is assocaited with, is something that's effectively
repeated boiler plate for each different subsystem, and which can be
accomplished more generically via an abstraction like my containers
patch.

> You are adding a lot more to that, and there is

The main thing that I'm adding on top of the operations mentioned in
the previous paragraph, which pretty much essentially have to be in
the kernel, is the ability to group multiple different subsystems
together so that they share the same process->container mappings. Yes,
that is something that could potentially be done from userspace
instead, and just provide an independent tree for each subsystem or
namespace. But there seems to be interest from other parties
(including me) in having kernel support for it.

Paul

2007-02-20 22:58:23

by Sam Vilain

[permalink] [raw]

Subject: Re: [PATCH 0/7] containers (V7): Generic Process Containers

Paul Menage wrote:
>> The term "segregated group of processes" is too vague. Segregated for
>> what? What is the kernel supposed to do with this information?
>>
> The generic part of the kernel just keeps track of the fact that
> they're segregated (and their children, etc).
>
> It's the clients of this subsystem (virtual servers, resource
> controllers) that can decide to give different per-process behaviour
> based on the membership of various groups.
>

So those clients can use helper functions to use their own namespaces.
If they happen to group things in the same way they they'll all end up
using the same nsproxy.

>> Did you like the names I came up with in my original reply?
>>
>> - CPUset namespace for CPU partitioning
>> - Resource namespaces:
>> - cpusched namespace for CPU
>> - ulimit namespace for memory
>> - quota namespace for disk space
>> - io namespace for disk activity
>> - etc
>>
>
> This is a strange abuse of the term "namespace".
> http://en.wikipedia.org/wiki/Namespace_%28computer_science%29
>
> For the virtual server work that you're doing, namespace is a good term:
>
> pids name processes, hence a pid namespace lets you have multiple
> distinct mappings from pids to processes
> filenames name files, so a filename (or mount) namespace lets you have
> multiple distinct mappings from filenames to files.
>
> For resource QoS control, it doesn't really make sense to talk about
> namespaces. We're not virtualizing resources to rename them for
> different virtual servers, we're limiting the quality of access to the
> resources.
>
> But the semantics of the term "namespace" notwithstanding, you're
> equating a virtual server namespace (pid, ipc, uts, mounts, etc) with
> a resource controller (memory, I/O, etc) in terms of their place in a
> hierarchy, which is something I agree with. All of these subsystems
> can be considered to be units that can be associated with groups of
> processes; the ultimate grouping of processes is something that we're
> both ultimately referring to as a container.
>

I don't necessarily agree with the 'heirarchy' bit. It doesn't have to
be so segregated. But I think we already covered that in this thread.

I agree with the comment on the abuse of the term "namespace", though
consider that it has already been abused with the term IPC namespaces.
We have for some time been using it to refer to groupable entities
within the kernel that are associated with tasks, even if they don't
involve named entities that clash within a particular domain. But there
is always an entity and a domain, and that is the key point I'm trying
to make - the features you are putting forward are no different to the
examples that we made specifically for the purpose of setting the
standard for further features to follow.

We talked about naming a bit before, see
http://lkml.org/lkml/2006/3/21/403 and possibly other threads. So,
anyway, feel free to flog this old dead horse and suggest different
terms. We've all had long enough to think about it since so maybe it's
worth it, but with any new term it should be really darned clear that
they're essentially the same thing as namespaces, or otherwise be really
likable.

Sam.

2007-02-20 23:08:13

by Sam Vilain

[permalink] [raw]

Subject: Re: [PATCH 0/7] containers (V7): Generic Process Containers

Paul Menage wrote:
>> No. A reverse mapping is not needed and is not interesting.
>>
> ... to you.
>

You're missing the point of Eric's next sentence. If you can achieve
everything you need to achieve and get all the information you are after
without it, then it is uninteresting.

>> As long as I can walk all processes and ask what namespace are
>> you in I don't care.
>>
>
> How do you currently do that?
>

Take a look at /proc/PID/mounts for example.

>> All that is necessary to have a group of processes do something
>> in an unnamed fashion is to hang a pointer off of the task_struct.
>> That's easy.
>>
> Right, adding a pointer to task_struct is easy. Configuring how/when
> to not directly inherit it from the parent, or to change it for a
> running task, or configuring state associated with the thing that the
> pointer is pointing to, naming that group, and determining which group
> a given process is assocaited with, is something that's effectively
> repeated boiler plate for each different subsystem, and which can be
> accomplished more generically via an abstraction like my containers
> patch.
>

So make helpers. Macros. Anything, just don't introduce model
limitations like the container structure, because we've already got the
structure; the nsproxy.

Sam.

2007-02-20 23:28:20

by Paul Menage

[permalink] [raw]

Subject: Re: [PATCH 0/7] containers (V7): Generic Process Containers

On 2/20/07, Sam Vilain <[email protected]> wrote:
>
> I don't necessarily agree with the 'heirarchy' bit. It doesn't have to
> be so segregated. But I think we already covered that in this thread.

OK, but it's much easier to use a hierarchical system as a flat system
(just don't create children) than a flat system as a hierarchical
system. And others do seem to want a hierarchy for their process
groupings.

>
> I agree with the comment on the abuse of the term "namespace", though
> consider that it has already been abused with the term IPC namespaces.

That doesn't seem like an abuse to me at all - you're controlling what
IPC object a given name (shm_id, sem_id or msg_id) refers to for any
given group of processes.

> We have for some time been using it to refer to groupable entities
> within the kernel that are associated with tasks, even if they don't
> involve named entities that clash within a particular domain. But there
> is always an entity and a domain, and that is the key point I'm trying
> to make - the features you are putting forward are no different to the
> examples that we made specifically for the purpose of setting the
> standard for further features to follow.

They're very similar, I agree. An important difference is that things
like pid/mount namespaces are simply ways of controlling the
visibility of existing unix concepts, such as processes or
filesystems. You don't need additional configuration to make them
useful, as unix already has standard ways of manipulating them.

Things like resource controllers typically require additional
configuration to control how much resources each group of processes
can consume, etc. Also, it appears to be much more common to want to
move tasks between different resource controllers than to move them
between different namespaces. (And in order to configure and move, you
need to be able to name)

The configuration, naming and movement are the features that my
container patch provides on top of the features that nsproxy provides
for namespaces.

> anyway, feel free to flog this old dead horse and suggest different
> terms. We've all had long enough to think about it since so maybe it's
> worth it, but with any new term it should be really darned clear that
> they're essentially the same thing as namespaces, or otherwise be really
> likable.

What you're calling a "namespace", I'm calling a "subsystem state".
Essentially they're the same thing. The important thing that generic
containers provide is a standard way to manipulate "subsystem states"
or "namespaces".

Paul

2007-02-20 23:32:27

by Serge E. Hallyn

[permalink] [raw]

Subject: Re: [PATCH 0/7] containers (V7): Generic Process Containers

Quoting Paul Menage ([email protected]):
> On 2/20/07, Eric W. Biederman <[email protected]> wrote:
> >All that is necessary to have a group of processes do something
> >in an unnamed fashion is to hang a pointer off of the task_struct.
> >That's easy.
>
> Right, adding a pointer to task_struct is easy. Configuring how/when
> to not directly inherit it from the parent, or to change it for a
> running task, or configuring state associated with the thing that the
> pointer is pointing to, naming that group, and determining which group
> a given process is assocaited with, is something that's effectively
> repeated boiler plate for each different subsystem, and which can be
> accomplished more generically via an abstraction like my containers
> patch.

Eric,

what you gain with this patchset is, one very simple container subsystem
can tie a container to a cpu, another can limit it's RSS, and suddenly
you can

mount -t container -o ns,rss,cpuwhatever ns /container

And each virtual server you create by unsharing can get automatic cpu
and rss controls.

That is worthwhile imo.

-serge

2007-02-20 23:36:36

by Paul Menage

[permalink] [raw]

Subject: Re: [PATCH 0/7] containers (V7): Generic Process Containers

On 2/20/07, Sam Vilain <[email protected]> wrote:
> Paul Menage wrote:
> >> No. A reverse mapping is not needed and is not interesting.
> >>
> > ... to you.
> >
>
> You're missing the point of Eric's next sentence. If you can achieve
> everything you need to achieve and get all the information you are after
> without it, then it is uninteresting.

Yes, you can do it with an exhaustive trawl of /proc. That can be very
expensive on busy machines.

>
> >> As long as I can walk all processes and ask what namespace are
> >> you in I don't care.
> >>
> >
> > How do you currently do that?
> >
>
> Take a look at /proc/PID/mounts for example.

That doesn't tell you what mounts namespace a process is in - it tells
you what the process can *view* in the namespace.

>
> So make helpers. Macros. Anything, just don't introduce model
> limitations like the container structure, because we've already got the
> structure; the nsproxy.
>

As I mentioned in another email, nsproxy is fine for things that don't
need explicit configuration or reporting, or which already have
configuration methods (such as fork(), mount(), etc) available, and
which don't need to support movement of processes between different
"namespaces". If it was extended to support the
naming/config/movement, then it would be fine to use it as the
equivalent of a container.

I'd actually be interested in trying to combine my container object
and the nsproxy object into a single concept.

Paul

2007-02-20 23:37:12

by Serge E. Hallyn

[permalink] [raw]

Subject: Re: [PATCH 0/7] containers (V7): Generic Process Containers

Quoting Sam Vilain ([email protected]):
> Paul Menage wrote:

> We talked about naming a bit before, see
> http://lkml.org/lkml/2006/3/21/403 and possibly other threads. So,
> anyway, feel free to flog this old dead horse and suggest different
> terms. We've all had long enough to think about it since so maybe it's

Bah,

Paul, how about s/container/rug/, as in resource usage group?

At least you can be pretty sure that (a) rug won't conflict with any
other computer science therm, and (b) for that exact reason it should be
pretty memorable.

-serge