--
This is an update to my multi-hierarchy multi-subsystem generic
process containers patch. Changes include:
- updated to 2.6.20-rc1
- incorporating some fixes from Srivatsa Vaddagiri
- release agent path is per-hierarchy, and defaults to empty
- dropped the patch splitting cpusets and memsets for now, since it's
a fairly mechanical patch, but a bit painful to maintain in the
presence of any other cpusets changes in mainline.
A couple of important issues have arisen on which feedback would be
appreciated:
1) The need (or otherwise) for multi-subsystem hierarchies
-----------------------------------------------------------
(Based on discussions with Paul Jackson)
The patch as it stands allows you to mount a container hierarchy as a
filesystem, and at that point select a set of subsystems to be bound
to that hierarchy. Subsystems can't be bound or unbound while the
hierarchy is active. Dynamic binding/unbinding in this way is
theoretically possible, but runs into various nasty conditions when
you have to attach or detach a subsystem to a potentially large tree
of containers and tasks. E.g. what do you do if halfway through
unbinding the subsystem state from the containers in a hierarchy, you
run into a container whose subsystem state is busy and hence can't be
freed? Either we'd need to put fairly strict limits on the properties
of subsystems that can be dynamically bound/unbound (i.e. can't refuse
to transfer a task from one container to another, can't use css
refcounting to keep subsystem state alive, etc) or we'd need to be
prepared to do some very nasty error-handling and rollback.
PaulJ pointed out that there's a continuum between:
1) just one controller per container, period, or
2) full dynamic binding and unbinding of any controller from any
one or more containers, with no limitations due to what else
is or ever was bound to what when.
My current patch falls somewhere in the middle. PaulJ wondered whether
it would be cleaner just to aim for case 1 - i.e. rather than having
the concept of hierarchies to tie multiple subsystems together, have
each subsystem be its own hierarchy.
My own feeling is that having each subsystem be its own hierarchy is
certainly possible, but results in a lot more effort in userspace to
manage - you have to manage N hierarchies of containers for N
subsystems, rather than just one.
You also have the issue that if a task is forking just as you start
moving it from one container to another (in each of the N hierarchies)
you could end up with the child in the original container in some
hierarchies, and in the new container in others, which isn't ideal.
>From a performance point of view, one-controller-per-hierarchy has a
little more overhead at fork()/exit() time (since there are more
reference counts to atomically update) but a little less overhead for
accessing the subsystem state for a particular task, since there's one
less level of indirection involved.
2) Dynamic creation/destruction of containers from inside the kernel
--------------------------------------------------------------------
A recent patch from Serge Hallyn on [email protected] proposed
a container filesystem somewhat similar to the one in this patch,
designed to expose the hierarchy of namespaces (specifically, nsproxy
objects). A major difference was that a child container could be
created dynamically from within sys_unshare(), and automatically freed
once it was no longer being referenced.
This could be fitted into my container model with a couple of small changes:
- a container_clone() function that essentially creates a new child
container of the current container (in the specified subsystem's
hierarchy) and moves the current process into the new container - the
equivalent of doing
mkdir $current_container_dir/$some_unique_name
echo $$ > $current_container_dir/$some_unique_name
- an auto_delete option for containers - similar to notify_on_exit,
but rather than invoking a userspace program, simply deletes the
container from within the kernel.
Then the nsproxy container could be implemented easily as a container
subsystem - rather than having a direct nsproxy field in struct task,
it would use the generic container pointer associated with the nsproxy
subsystem; sys_unshare() would call container_clone() to create the
new container.
=================================================================
Generic Process Containers
--------------------------
There have recently been various proposals floating around for
resource management/accounting subsystems in the kernel, including
ResGroups, User BeanCounters, NSProxy containers, and others. These
all need the basic abstraction of being able to group together
multiple processes in an aggregate, in order to track/limit the
resources permitted to those processes, and all implement this
grouping in different ways.
Already existing in the kernel is the cpuset subsystem; this has a
process grouping mechanism that is mature, tested, and well documented
(particularly with regards to synchronization rules).
This patchset extracts the process grouping code from cpusets into a
generic container system, and makes the cpusets code a client of
the container system.
It also provides several example clients of the container system,
including ResGroups and BeanCounters
The change is implemented in three stages, plus three example
subsystems that aren't necessarily intended to be merged as part of
this patch set, but demonstrate the applicability of the framework.
1) extract the process grouping code from cpusets into a standalone system
2) remove the process grouping code from cpusets and hook into the
container system
3) convert the container system to present a generic multi-hierarchy
API, and make cpusets a client of that API
4) example of a simple CPU accounting container subsystem
5) example of implementing ResGroups and its numtasks controller over
generic containers
6) example of implementing BeanCounters and its numfiles counter over
generic containers
The intention is that the various resource management efforts can also
become container clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test out e.g. the ResGroups CPU controller in
conjunction with the BeanCounters memory controller
- the additional kernel footprint of any of the competing resource
management systems is substantially reduced, since it doesn't need
to provide process grouping/containment, hence improving their
chances of getting into the kernel
Signed-off-by: Paul Menage <[email protected]>
--
From: Serge E. Hallyn <[email protected]>
Subject: [RFC] [PATCH 1/1] container: define a namespace container subsystem
Here's a stab at a namespace container subsystem based on
Paul Menage's containers patch, just to experiment with
how semantics suit what we want.
A few things we'll want to address:
1. We'll want to be able to hook things like
rmdir, so that we can rm -rf /containers/vserver1
to kill all processes in that container and all
child containers.
2. We need a semantic difference between attaching
to a container, and being the first to join the
container you just created.
3. We will want to be able to give the container
attach function more info, so that we can ask to
attach to just the network namespace, but none of
the others, in the container we're attaching to.
I'm sure there'll be more, but that's a start...
Note that this is far more user-controled than my previous namespace
naming patch. In particular, ns actions - clone/unshare - do not
automatically create new containers. That may be for the best, so
I didn't try to move in that direction this time. However it may be desirable
to at least change the creation semantics such that (after a container create
request) an unshare or clone simultaneously creates the container and joins
the new process as the container's first process.
This version does not point to an nsproxy from the ns_container,
but it probably should, as a definitive way to pick an nsproxy to
attach to if a process wants to enter the container.
Signed-off-by: Serge E. Hallyn <[email protected]>
---
init/Kconfig | 9 ++++++
kernel/Makefile | 1 +
kernel/ns_container.c | 74 +++++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 84 insertions(+), 0 deletions(-)
diff --git a/init/Kconfig b/init/Kconfig
index ebaec57..c00b19c 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -297,6 +297,15 @@ config CONTAINER_CPUACCT
Provides a simple Resource Controller for monitoring the
total CPU consumed by the tasks in a container
+config CONTAINER_NS
+ bool "Namespace container subsystem"
+ select CONTAINERS
+ help
+ Provides a simple namespace container subsystem to
+ provide hierarchical naming of sets of namespaces,
+ for instance virtual servers and checkpoint/restart
+ jobs.
+
config RELAY
bool "Kernel->user space relay support (formerly relayfs)"
help
diff --git a/kernel/Makefile b/kernel/Makefile
index feba860..6c73a5e 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -39,6 +39,7 @@ obj-$(CONFIG_COMPAT) += compat.o
obj-$(CONFIG_CONTAINERS) += container.o
obj-$(CONFIG_CPUSETS) += cpuset.o
obj-$(CONFIG_CONTAINER_CPUACCT) += cpu_acct.o
+obj-$(CONFIG_CONTAINER_NS) += ns_container.o
obj-$(CONFIG_IKCONFIG) += configs.o
obj-$(CONFIG_STOP_MACHINE) += stop_machine.o
obj-$(CONFIG_AUDIT) += audit.o auditfilter.o
diff --git a/kernel/ns_container.c b/kernel/ns_container.c
new file mode 100644
index 0000000..b122bb4
--- /dev/null
+++ b/kernel/ns_container.c
@@ -0,0 +1,74 @@
+/*
+ * ns_container.c - namespace container subsystem
+ *
+ * Copyright IBM, 2006
+ */
+
+#include <linux/module.h>
+#include <linux/container.h>
+#include <linux/fs.h>
+
+struct nscont {
+ struct container_subsys_state css;
+ spinlock_t lock;
+};
+
+static struct container_subsys ns_subsys;
+
+static inline struct nscont *container_nscont(struct container *cont)
+{
+ return container_of(container_subsys_state(cont, &ns_subsys),
+ struct nscont, css);
+}
+
+int ns_can_attach(struct container_subsys *ss,
+ struct container *cont, struct task_struct *tsk)
+{
+ struct container *c;
+
+ if (atomic_read(&cont->count) != 0)
+ return -EPERM;
+
+ c = task_container(tsk, &ns_subsys);
+ if (c && c != cont->parent)
+ return -EPERM;
+ printk(KERN_NOTICE "%s: task %lu in container %s, attaching to %s, parent %s\n",
+ __FUNCTION__, tsk->pid, c ? c->dentry->d_name.name : "(null)",
+ cont->dentry->d_name.name, cont->parent->dentry->d_name.name);
+
+ return 0;
+}
+static int ns_create(struct container_subsys *ss, struct container *cont)
+{
+ struct nscont *ns = kzalloc(sizeof(*ns), GFP_KERNEL);
+ if (!ns) return -ENOMEM;
+ spin_lock_init(&ns->lock);
+ cont->subsys[ns_subsys.subsys_id] = &ns->css;
+ return 0;
+}
+
+static void ns_destroy(struct container_subsys *ss,
+ struct container *cont)
+{
+ struct nscont *ns = container_nscont(cont);
+ kfree(ns);
+}
+
+static struct container_subsys ns_subsys = {
+ .name = "ns_container",
+ .create = ns_create,
+ .destroy = ns_destroy,
+ .can_attach = ns_can_attach,
+ //.attach = ns_attach,
+ //.post_attach = ns_post_attach,
+ //.populate = ns_populate,
+ .subsys_id = -1,
+};
+
+int __init ns_init(void)
+{
+ int id = container_register_subsys(&ns_subsys);
+ return id < 0 ? id : 0;
+}
+
+module_init(ns_init)
--
1.4.1
Hi Serge,
On 1/3/07, Serge E. Hallyn <[email protected]> wrote:
> From: Serge E. Hallyn <[email protected]>
> Subject: [RFC] [PATCH 1/1] container: define a namespace container subsystem
>
> Here's a stab at a namespace container subsystem based on
> Paul Menage's containers patch, just to experiment with
> how semantics suit what we want.
Thanks for looking at this.
What you have here is the basic boilerplate for any generic container
subsystem. I realise that my current containers patch has some
incompatibilities with the way that nsproxy wants to work.
>
> A few things we'll want to address:
>
> 1. We'll want to be able to hook things like
> rmdir, so that we can rm -rf /containers/vserver1
> to kill all processes in that container and all
> child containers.
The current model is that rmdir fails if there are any processes still
in the container; so you'd have to kill processes by looking for pids
in the "tasks" info file. This was behaviour inherited from the
cpusets code; I'd be open to making this more configurable (e.g.
specifying that rmdir should try to kill any remaining tasks).
>
> 2. We need a semantic difference between attaching
> to a container, and being the first to join the
> container you just created.
Right - the way to do this would probably be some kind of
"container_clone()" function that duplicates the properties of the
current container in a child, and immediately moves the current
process into that container.
>
> 3. We will want to be able to give the container
> attach function more info, so that we can ask to
> attach to just the network namespace, but none of
> the others, in the container we're attaching to.
If you want to be able to attach to different namespaces separately,
then possibly they should be separate container subsystems?
Paul
Quoting Paul Menage ([email protected]):
> Hi Serge,
>
> On 1/3/07, Serge E. Hallyn <[email protected]> wrote:
> >From: Serge E. Hallyn <[email protected]>
> >Subject: [RFC] [PATCH 1/1] container: define a namespace container
> >subsystem
> >
> >Here's a stab at a namespace container subsystem based on
> >Paul Menage's containers patch, just to experiment with
> >how semantics suit what we want.
>
> Thanks for looking at this.
>
> What you have here is the basic boilerplate for any generic container
> subsystem. I realise that my current containers patch has some
> incompatibilities with the way that nsproxy wants to work.
In retrospect I don't like the changes in behavior. So my next
version will aim for closer to the original (non-containerfs)
behavior.
> >A few things we'll want to address:
> >
> > 1. We'll want to be able to hook things like
> > rmdir, so that we can rm -rf /containers/vserver1
> > to kill all processes in that container and all
> > child containers.
>
> The current model is that rmdir fails if there are any processes still
> in the container; so you'd have to kill processes by looking for pids
> in the "tasks" info file. This was behaviour inherited from the
> cpusets code; I'd be open to making this more configurable (e.g.
> specifying that rmdir should try to kill any remaining tasks).
Ok - of course I suspect I'll have to just start coding away before
i can guess at what help I might need from your code.
> >
> > 2. We need a semantic difference between attaching
> > to a container, and being the first to join the
> > container you just created.
>
> Right - the way to do this would probably be some kind of
> "container_clone()" function that duplicates the properties of the
> current container in a child, and immediately moves the current
> process into that container.
>
> > 3. We will want to be able to give the container
> > attach function more info, so that we can ask to
> > attach to just the network namespace, but none of
> > the others, in the container we're attaching to.
>
> If you want to be able to attach to different namespaces separately,
> then possibly they should be separate container subsystems?
That's one possibility, but imo somewhat unpalatable.
As I mentioned in the last email, I really like the idea of having
files representing each namespace under each namespace container
directory, creating a new container by linking some of those
namespace files, and entering containers by echoing the pathname
to the new container into /proc/$$/ns_container. (either upon
the echo, or, I think preferably, upon a subsequent exec)
-serge