2006-01-17 14:58:53

by Serge E. Hallyn

[permalink] [raw]
Subject: RFC [patch 00/34] PID Virtualization Overview

--
PID Virtualization is based on the concept of a container.
Our ultimate goal is to checkpoint/restart containers. The
containers should also be useful as a basis for the pid
virtualization required, for instance, by vserver.

The mechanism to start a container
is to 'echo "container_name" > /proc/container' which creates a new
container and associates the calling process with it. All subsequently
forked tasks then belong to that container.
There is a separate pid space associated with each container.
Only processes/task belonging to the same container "see" each other.
The exception is an implied default system container that has
a global view.
The following patches accomplish 3 things:
1) identify the locations at the user/kernel boundary where pids and
related ids ( pgrp, sessionids, .. ) need to be (de-)virtualized and
call appropriate (de-)virtualization functions.
2) provide the virtualization implementation in these functions.
3) implement a container object and a simple /proc interface to create one
4) provide a per container /proc/fs

-- Hubertus Franke ([email protected])
-- Cedric Le Goater ([email protected])
-- Serge E Hallyn ([email protected])
-- Dave Hansen ([email protected])


2006-01-17 16:19:25

by Suleiman Souhlal

[permalink] [raw]
Subject: Re: RFC [patch 00/34] PID Virtualization Overview

Serge Hallyn wrote:

> The mechanism to start a container
> is to 'echo "container_name" > /proc/container' which creates a new
> container and associates the calling process with it. All subsequently
> forked tasks then belong to that container.
> There is a separate pid space associated with each container.
> Only processes/task belonging to the same container "see" each other.

Why does there need a separate pid space for each container?
You don't really need one to make sure that only processes in the same
containers can see each other.

-- Suleiman

2006-01-17 17:08:55

by Dave Hansen

[permalink] [raw]
Subject: Re: RFC [patch 00/34] PID Virtualization Overview

On Tue, 2006-01-17 at 08:19 -0800, Suleiman Souhlal wrote:
> Serge Hallyn wrote:
> > The mechanism to start a container
> > is to 'echo "container_name" > /proc/container' which creates a new
> > container and associates the calling process with it. All subsequently
> > forked tasks then belong to that container.
> > There is a separate pid space associated with each container.
> > Only processes/task belonging to the same container "see" each other.
>
> Why does there need a separate pid space for each container?
> You don't really need one to make sure that only processes in the same
> containers can see each other.

One use for containers might be to pick a container from a system, wrap
it up, and transport it to another system where it would continue to
run. We would have to make sure that the pids did not collide with any
containers running on the target system.

-- Dave

2006-01-17 18:10:05

by Suleiman Souhlal

[permalink] [raw]
Subject: Re: RFC [patch 00/34] PID Virtualization Overview

Dave Hansen wrote:
> One use for containers might be to pick a container from a system, wrap
> it up, and transport it to another system where it would continue to
> run. We would have to make sure that the pids did not collide with any
> containers running on the target system.

Couldn't you assign new pids when the container is transported to the
other system?

-- Suleiman

2006-01-17 18:12:52

by Dave Hansen

[permalink] [raw]
Subject: Re: RFC [patch 00/34] PID Virtualization Overview

On Tue, 2006-01-17 at 10:09 -0800, Suleiman Souhlal wrote:
> Dave Hansen wrote:
> > One use for containers might be to pick a container from a system, wrap
> > it up, and transport it to another system where it would continue to
> > run. We would have to make sure that the pids did not collide with any
> > containers running on the target system.
>
> Couldn't you assign new pids when the container is transported to the
> other system?

You do assign new pids, at least as far as the kernel is concerned.
However, any processes that continue to run would get confused if their
pid changed. You have to make sure that the tasks have a _consistent_
view of which process is which pid.

-- Dave

2006-01-17 18:29:48

by Alan

[permalink] [raw]
Subject: Re: RFC [patch 00/34] PID Virtualization Overview

On Maw, 2006-01-17 at 10:12 -0800, Dave Hansen wrote:
> You do assign new pids, at least as far as the kernel is concerned.
> However, any processes that continue to run would get confused if their
> pid changed. You have to make sure that the tasks have a _consistent_
> view of which process is which pid.

Don't reassign the pid at all. Keep task->container and do the job
explicitly. Most task searches for a pid are abstracted already and most
users of ->pid who try and use it for comparing two tasks for equality
or for keeping a task reference are already terminally racey and want
fixing anyway.

It raises a few other minor questions - one is /proc - but if container
0 was the usual one then putting the other containers into a subdir
would break nothing. Alternatively proc could allow multiple mounts and
a container = option to get the fs view right in chroot trees. The
subdirectories would be nice for management views.

You'd also need some process management items for other contexts - kill
etc but most of that can be done just by having a fork_into_container()
ability.

Alan

2006-01-18 19:02:15

by Dave Hansen

[permalink] [raw]
Subject: Re: RFC [patch 00/34] PID Virtualization Overview

On Tue, 2006-01-17 at 18:29 +0000, Alan Cox wrote:
> On Maw, 2006-01-17 at 10:12 -0800, Dave Hansen wrote:
> > You do assign new pids, at least as far as the kernel is concerned.
> > However, any processes that continue to run would get confused if their
> > pid changed. You have to make sure that the tasks have a _consistent_
> > view of which process is which pid.
>
> Don't reassign the pid at all. Keep task->container and do the job
> explicitly. Most task searches for a pid are abstracted already and most
> users of ->pid who try and use it for comparing two tasks for equality
> or for keeping a task reference are already terminally racey and want
> fixing anyway.

Other than searches, there appear to be quite a number of drivers an
subsystems that like to print out pids. I can't find any cases yet
where these are integral to functionality, but I wonder what approach we
should take. Should we deprecate printk'ing of pids? Make a special
function or % modifier to turn a task_struct into something printable?

A function would run into issues of having buffers in which to print the
output. But, we'd be able to do things like:

sprintf(buffer, "%d:%d", tsk->container, tsk->pid);

-- Dave

2006-01-18 19:29:05

by Arjan van de Ven

[permalink] [raw]
Subject: Re: RFC [patch 00/34] PID Virtualization Overview

On Wed, 2006-01-18 at 11:01 -0800, Dave Hansen wrote:
> On Tue, 2006-01-17 at 18:29 +0000, Alan Cox wrote:
> > On Maw, 2006-01-17 at 10:12 -0800, Dave Hansen wrote:
> > > You do assign new pids, at least as far as the kernel is concerned.
> > > However, any processes that continue to run would get confused if their
> > > pid changed. You have to make sure that the tasks have a _consistent_
> > > view of which process is which pid.
> >
> > Don't reassign the pid at all. Keep task->container and do the job
> > explicitly. Most task searches for a pid are abstracted already and most
> > users of ->pid who try and use it for comparing two tasks for equality
> > or for keeping a task reference are already terminally racey and want
> > fixing anyway.
>
> Other than searches, there appear to be quite a number of drivers an
> subsystems that like to print out pids. I can't find any cases yet
> where these are integral to functionality, but I wonder what approach we
> should take.

those should obviously print out the REAL pid, not the application
pid ... so no changes needed.


2006-01-18 19:38:18

by Dave Hansen

[permalink] [raw]
Subject: Re: RFC [patch 00/34] PID Virtualization Overview

On Wed, 2006-01-18 at 20:28 +0100, Arjan van de Ven wrote:
> On Wed, 2006-01-18 at 11:01 -0800, Dave Hansen wrote:
> > Other than searches, there appear to be quite a number of drivers an
> > subsystems that like to print out pids. I can't find any cases yet
> > where these are integral to functionality, but I wonder what approach we
> > should take.
>
> those should obviously print out the REAL pid, not the application
> pid ... so no changes needed.

One suggestion was to make all pid comparisons meaningless without some
kind of "container" context along with it. The thought is that using
pids is inherently racy, and relatively meaningless anyway, so the
kernel shouldn't be dealing with them. (The obvious exception being in
userspace interfaces)

This would let tsk->pid be anything that it likes as long as it has a
unique pid in its container.

But, it seems that many drivers like to print out pids as a unique
identifier for the task. Should we just let them print those
potentially non-unique identifiers, deprecate and kill them, or provide
a replacement with something else which is truly unique?

-- Dave

2006-01-18 19:50:40

by Arjan van de Ven

[permalink] [raw]
Subject: Re: RFC [patch 00/34] PID Virtualization Overview

On Wed, 2006-01-18 at 11:38 -0800, Dave Hansen wrote:
> On Wed, 2006-01-18 at 20:28 +0100, Arjan van de Ven wrote:
> > On Wed, 2006-01-18 at 11:01 -0800, Dave Hansen wrote:
> > > Other than searches, there appear to be quite a number of drivers an
> > > subsystems that like to print out pids. I can't find any cases yet
> > > where these are integral to functionality, but I wonder what approach we
> > > should take.
> >
> > those should obviously print out the REAL pid, not the application
> > pid ... so no changes needed.
>
> One suggestion was to make all pid comparisons meaningless without some
> kind of "container" context along with it. The thought is that using
> pids is inherently racy

current->pid sure isn't racey, you yourself KNOW you're not going
away :)





2006-01-18 22:55:41

by Alan

[permalink] [raw]
Subject: Re: RFC [patch 00/34] PID Virtualization Overview

On Mer, 2006-01-18 at 11:38 -0800, Dave Hansen wrote:
> But, it seems that many drivers like to print out pids as a unique
> identifier for the task. Should we just let them print those
> potentially non-unique identifiers, deprecate and kill them, or provide
> a replacement with something else which is truly unique?

Pick a format for container number + pid and document/stick with it -
something like container::pid (eg 0::114) or 114[0] whatever so long as
it is consistent


2006-01-19 07:15:14

by Arjan van de Ven

[permalink] [raw]
Subject: Re: RFC [patch 00/34] PID Virtualization Overview

On Wed, 2006-01-18 at 22:54 +0000, Alan Cox wrote:
> On Mer, 2006-01-18 at 11:38 -0800, Dave Hansen wrote:
> > But, it seems that many drivers like to print out pids as a unique
> > identifier for the task. Should we just let them print those
> > potentially non-unique identifiers, deprecate and kill them, or provide
> > a replacement with something else which is truly unique?
>
> Pick a format for container number + pid and document/stick with it -
> something like container::pid (eg 0::114) or 114[0] whatever so long as
> it is consistent

having a pid_to_string(<task struct>) or maybe task_to_string() thing
for convenient printing of pids/tasks.. I'm all for that. Means you can
even configure how verbose you want it to be (include ->comm or not,
->state maybe etc)

2006-01-20 05:12:18

by Eric W. Biederman

[permalink] [raw]
Subject: Re: RFC [patch 00/34] PID Virtualization Overview

Arjan van de Ven <[email protected]> writes:

> On Wed, 2006-01-18 at 22:54 +0000, Alan Cox wrote:
>> On Mer, 2006-01-18 at 11:38 -0800, Dave Hansen wrote:
>> > But, it seems that many drivers like to print out pids as a unique
>> > identifier for the task. Should we just let them print those
>> > potentially non-unique identifiers, deprecate and kill them, or provide
>> > a replacement with something else which is truly unique?
>>
>> Pick a format for container number + pid and document/stick with it -
>> something like container::pid (eg 0::114) or 114[0] whatever so long as
>> it is consistent
>
> having a pid_to_string(<task struct>) or maybe task_to_string() thing
> for convenient printing of pids/tasks.. I'm all for that. Means you can
> even configure how verbose you want it to be (include ->comm or not,
> ->state maybe etc)

The only way I can see to sanely do this is to pass it the temporary
buffer it writes it's contents into.
Something like:
printk(KERN_XXX "%s\n", task_to_string(buf, tsk)); ?


Eric

2006-01-20 19:54:29

by Eric W. Biederman

[permalink] [raw]
Subject: RFC: Multiple instances of kernel namespaces.


At this point I have to confess I have been working on something
similar, to IBM's pid virtualization work. But I have what is at
least for me a unifying concept, that makes things easier to think
about.

The idea is to think about things in terms of namespaces. Currently
in the kernel we have the fs/mount namespace already implemented.

Partly this helps on what the interface for creating a new namespace
instance should be. 'clone(CLONE_NEW<NAMESPACE_TYPE>)', and how
it should be managed from the kernel data structures.

Partly thinking of things as namespaces helps me scope the problem.

Does this sound like a sane approach?

Eric

2006-01-20 20:14:00

by Serge E. Hallyn

[permalink] [raw]
Subject: Re: RFC: Multiple instances of kernel namespaces.

Quoting Eric W. Biederman ([email protected]):
>
> At this point I have to confess I have been working on something
> similar, to IBM's pid virtualization work. But I have what is at
> least for me a unifying concept, that makes things easier to think
> about.
>
> The idea is to think about things in terms of namespaces. Currently
> in the kernel we have the fs/mount namespace already implemented.
>
> Partly this helps on what the interface for creating a new namespace
> instance should be. 'clone(CLONE_NEW<NAMESPACE_TYPE>)', and how
> it should be managed from the kernel data structures.
>
> Partly thinking of things as namespaces helps me scope the problem.
>
> Does this sound like a sane approach?

And a bonus of this is that for security and vserver-type applications,
the CLONE_NEWPID and CLONE_NEWFS will often happen at the same time.

How do you (or do you?) address naming namespaces? This would be
necessary for transitioning into an existing namespace, performing
actions on existing namespaces (i.e. checkpoint, migrate to another
machine, enter the namespace and kill pid 521), and would just be
useful for accounting purposes, i.e. how else do you have a
"ps --all-namespaces" specify a process' namespace?

Doubt we want to add an argument to clone(), so do we just add a new
proc, sysfs, or syscall for setting a pid-namespace name?

Do we need a new syscall for transitioning into an existing namespace?

thanks,
-serge

2006-01-20 20:22:47

by Hubertus Franke

[permalink] [raw]
Subject: Re: RFC: Multiple instances of kernel namespaces.

Serge E. Hallyn wrote:
> Quoting Eric W. Biederman ([email protected]):
>
>>At this point I have to confess I have been working on something
>>similar, to IBM's pid virtualization work. But I have what is at
>>least for me a unifying concept, that makes things easier to think
>>about.
>>
>>The idea is to think about things in terms of namespaces. Currently
>>in the kernel we have the fs/mount namespace already implemented.
>>
>>Partly this helps on what the interface for creating a new namespace
>>instance should be. 'clone(CLONE_NEW<NAMESPACE_TYPE>)', and how
>>it should be managed from the kernel data structures.
>>
>>Partly thinking of things as namespaces helps me scope the problem.
>>
>>Does this sound like a sane approach?
>
>
> And a bonus of this is that for security and vserver-type applications,
> the CLONE_NEWPID and CLONE_NEWFS will often happen at the same time.
>
> How do you (or do you?) address naming namespaces? This would be
> necessary for transitioning into an existing namespace, performing
> actions on existing namespaces (i.e. checkpoint, migrate to another
> machine, enter the namespace and kill pid 521), and would just be
> useful for accounting purposes, i.e. how else do you have a
> "ps --all-namespaces" specify a process' namespace?
>
> Doubt we want to add an argument to clone(), so do we just add a new
> proc, sysfs, or syscall for setting a pid-namespace name?
>
> Do we need a new syscall for transitioning into an existing namespace?
>
> thanks,
> -serge
>


Just addressed a few of this in my previous reply to the other thread.

However, question here is whether the container (as we used it) provides
the "binding" object for these clones. One question for me then is
whether cloning of namespaces is always done in tandem.
As you are bringing the migration up, we can only clone fully contained
namespaces ! One could make that a condition of the migration or build
it right into the initial structure. Any thoughts on that ?

-- Hubertus

2006-01-20 20:23:42

by Serge E. Hallyn

[permalink] [raw]
Subject: Re: RFC [patch 00/34] PID Virtualization Overview

Quoting Eric W. Biederman ([email protected]):
> Arjan van de Ven <[email protected]> writes:
>
> > On Wed, 2006-01-18 at 22:54 +0000, Alan Cox wrote:
> >> On Mer, 2006-01-18 at 11:38 -0800, Dave Hansen wrote:
> >> > But, it seems that many drivers like to print out pids as a unique
> >> > identifier for the task. Should we just let them print those
> >> > potentially non-unique identifiers, deprecate and kill them, or provide
> >> > a replacement with something else which is truly unique?
> >>
> >> Pick a format for container number + pid and document/stick with it -
> >> something like container::pid (eg 0::114) or 114[0] whatever so long as
> >> it is consistent
> >
> > having a pid_to_string(<task struct>) or maybe task_to_string() thing
> > for convenient printing of pids/tasks.. I'm all for that. Means you can
> > even configure how verbose you want it to be (include ->comm or not,
> > ->state maybe etc)
>
> The only way I can see to sanely do this is to pass it the temporary
> buffer it writes it's contents into.
> Something like:
> printk(KERN_XXX "%s\n", task_to_string(buf, tsk)); ?

That's kind of neat :)

The only other thing I can think of is to do something like

#define task_str(tsk) tsk->container_id, tsk->pid
or
#define task_str(tsk) tsk->container_id, ":", tsk->pid

and have it be used as

printk(KERN_XXX "%s::%s\n", task_str(tsk));
or
printk(KERN_XXX "%s%s%s\n", task_str(tsk));

The only reason I point it out is that we don't risk memory corruption
if the printk caller forgets to give the extra '%s's, like we do if
the caller forgets they need char buf[PID_CONTAINER_MAXLENGTH] instead
of 'char *buf;' or 'char buf;'.

-serge

2006-01-20 20:33:56

by Hubertus Franke

[permalink] [raw]
Subject: Re: RFC [patch 00/34] PID Virtualization Overview

Serge E. Hallyn wrote:
> Quoting Eric W. Biederman ([email protected]):
>
>>Arjan van de Ven <[email protected]> writes:
>>
>>
>>>On Wed, 2006-01-18 at 22:54 +0000, Alan Cox wrote:
>>>
>>>>On Mer, 2006-01-18 at 11:38 -0800, Dave Hansen wrote:
>>>>
>>>>>But, it seems that many drivers like to print out pids as a unique
>>>>>identifier for the task. Should we just let them print those
>>>>>potentially non-unique identifiers, deprecate and kill them, or provide
>>>>>a replacement with something else which is truly unique?
>>>>
>>>>Pick a format for container number + pid and document/stick with it -
>>>>something like container::pid (eg 0::114) or 114[0] whatever so long as
>>>>it is consistent
>>>
>>>having a pid_to_string(<task struct>) or maybe task_to_string() thing
>>>for convenient printing of pids/tasks.. I'm all for that. Means you can
>>>even configure how verbose you want it to be (include ->comm or not,
>>>->state maybe etc)
>>
>>The only way I can see to sanely do this is to pass it the temporary
>>buffer it writes it's contents into.
>>Something like:
>>printk(KERN_XXX "%s\n", task_to_string(buf, tsk)); ?
>
>
> That's kind of neat :)
>
> The only other thing I can think of is to do something like
>
> #define task_str(tsk) tsk->container_id, tsk->pid
> or
> #define task_str(tsk) tsk->container_id, ":", tsk->pid
>
> and have it be used as
>
> printk(KERN_XXX "%s::%s\n", task_str(tsk));
> or
> printk(KERN_XXX "%s%s%s\n", task_str(tsk));
>
> The only reason I point it out is that we don't risk memory corruption
> if the printk caller forgets to give the extra '%s's, like we do if
> the caller forgets they need char buf[PID_CONTAINER_MAXLENGTH] instead
> of 'char *buf;' or 'char buf;'.
>
> -serge
>

As odd as this looks .. it does have the benefits and anything that avoids
potential problems.

On the other hand you might run into problems with the following.

char *str = task_str(tsk);

Eitherway .. I don't think these are the big fish to fry now :-)

-- Hubertus

2006-01-20 21:47:26

by Hubertus Franke

[permalink] [raw]
Subject: Re: RFC: Multiple instances of kernel namespaces.

Serge E. Hallyn wrote:
> Quoting Hubertus Franke ([email protected]):
>
>>However, question here is whether the container (as we used it) provides
>>the "binding" object for these clones. One question for me then is
>>whether cloning of namespaces is always done in tandem.
>
>
> No.

Thought so..

>
>
>>As you are bringing the migration up, we can only clone fully contained
>
>
> By clone do you actually mean clone(), or did you mean restart from
> checkpoint?

clone_<namespace> , so its neither nor ...
Essentially creating a new namespace ! That's what Eric was suggesting.

>
> If clone, then I don't understand the problem.
>
> If restart from checkpoint/migrate, then I think the answer has to be
> that that is a special case which we have to handle. Note that to clone
> a fs namespace, you need CAP_SYS_ADMIN. We could add another check in
> there to deny CLONE_NEWNS when CLONE_NEWPID is not specified IF and ONLY
> IF we are already no longer in container_id==0. Or even better, when
> a pid-namespace has been designated as migrateable.
>
> Anything other than that would be too limiting. Note that fs namespaces
> are going to be used for multi-level directories, for instance.

That's a reasonable approach. Give the general capability (since C/R + migration
is an additional capability that might not be utilized by many) and leave it to
the sys_admin to specify what is allowed or not
>
>>namespaces ! One could make that a condition of the migration or build
>>it right into the initial structure. Any thoughts on that ?
>
> So in other words I'm saying that this is the admin/user's problem to
> keep straight. Dealing with fs-namespaces in this sense could perhaps be
> dealt with later by hand in checkpoint/migrate/restore code by
> a) at checkpoint:
> i) checking the fs-namespace of each process or thread
> ii) storing /proc/mounts for each fs-namespace
> b) at restore, do CLONE_NEWNS for each process which needs it,
> and using the stored /proc/mounts to rebuild the
> namespace.
>
Something like it .. yes...

> Of course /proc mounts is itself relative to a namespace in the
> case of bind mounts, so I'm actually not sure this is feasible.
>
> -serge
>


2006-01-21 10:05:16

by Eric W. Biederman

[permalink] [raw]
Subject: Re: RFC: Multiple instances of kernel namespaces.

"Serge E. Hallyn" <[email protected]> writes:

> Quoting Eric W. Biederman ([email protected]):
>>
>> At this point I have to confess I have been working on something
>> similar, to IBM's pid virtualization work. But I have what is at
>> least for me a unifying concept, that makes things easier to think
>> about.
>>
>> The idea is to think about things in terms of namespaces. Currently
>> in the kernel we have the fs/mount namespace already implemented.
>>
>> Partly this helps on what the interface for creating a new namespace
>> instance should be. 'clone(CLONE_NEW<NAMESPACE_TYPE>)', and how
>> it should be managed from the kernel data structures.
>>
>> Partly thinking of things as namespaces helps me scope the problem.
>>
>> Does this sound like a sane approach?
>
> And a bonus of this is that for security and vserver-type applications,
> the CLONE_NEWPID and CLONE_NEWFS will often happen at the same time.
>
> How do you (or do you?) address naming namespaces? This would be
> necessary for transitioning into an existing namespace, performing
> actions on existing namespaces (i.e. checkpoint, migrate to another
> machine, enter the namespace and kill pid 521), and would just be
> useful for accounting purposes, i.e. how else do you have a
> "ps --all-namespaces" specify a process' namespace?

So I address naming indirectly. The last thing I want to have
is to add yet another namespace to the kernel for naming namespaces.
We have enough namespaces already.

In any sane context for a pid-namespace we need a pid that
we can call waitpid on, so we don't break the process tree.
Which means at least the init process has 2 pids, one
that it's parent sees, and another (1) that it and it's
children see.

So I name pidspaces like we do sessions of process groups
and sessions by the pid of the leader.

So in the simple case I have names like:
1178/1632

> Doubt we want to add an argument to clone(), so do we just add a new
> proc, sysfs, or syscall for setting a pid-namespace name?

That shouldn't be necessary.

> Do we need a new syscall for transitioning into an existing namespace?

That is a good question. The FS namespaces that we already have
has much the same problem. A completely different solution to
this problem seems to have been implemented but I don't grasp it
yet.

Inherently transitioning to an existing namespace is something
that is straight forward to implement, so it is worth thinking
about.

If I want a guest that can keep secrets from the host sysadmin I don't
want transitioning into a guest namespace to come too easily.

Currently I can always just create an extra child of pid 1
that I will be my slave. The problem is that this is an extra
process laying around.

Eric

2006-01-21 10:33:50

by Pavel Machek

[permalink] [raw]
Subject: Re: RFC [patch 00/34] PID Virtualization Overview

On St 18-01-06 11:01:52, Dave Hansen wrote:
> On Tue, 2006-01-17 at 18:29 +0000, Alan Cox wrote:
> > On Maw, 2006-01-17 at 10:12 -0800, Dave Hansen wrote:
> > > You do assign new pids, at least as far as the kernel is concerned.
> > > However, any processes that continue to run would get confused if their
> > > pid changed. You have to make sure that the tasks have a _consistent_
> > > view of which process is which pid.
> >
> > Don't reassign the pid at all. Keep task->container and do the job
> > explicitly. Most task searches for a pid are abstracted already and most
> > users of ->pid who try and use it for comparing two tasks for equality
> > or for keeping a task reference are already terminally racey and want
> > fixing anyway.
>
> Other than searches, there appear to be quite a number of drivers an
> subsystems that like to print out pids. I can't find any cases yet
> where these are integral to functionality, but I wonder what approach we
> should take. Should we deprecate printk'ing of pids? Make a special
> function or % modifier to turn a task_struct into something printable?
>
> A function would run into issues of having buffers in which to print the
> output. But, we'd be able to do things like:
>
> sprintf(buffer, "%d:%d", tsk->container, tsk->pid);

What about first fixing all the driver to print_task() or
something like that, where print_task would print name too (for
example). That way, we get more useful data *now* and you can fix it
any way you want in future.

char *print_task() doing pretty-printing should be enough.
Pavel



--
Thanks, Sharp!

2006-01-21 10:34:49

by Eric W. Biederman

[permalink] [raw]
Subject: Re: RFC [patch 00/34] PID Virtualization Overview

Hubertus Franke <[email protected]> writes:

> As odd as this looks .. it does have the benefits and anything that avoids
> potential problems.
>
> On the other hand you might run into problems with the following.
>
> char *str = task_str(tsk);
>
> Eitherway .. I don't think these are the big fish to fry now :-)

Except there are really no small fish :)

This solves the one really ugly part of my current patch,
that I had simply not thought through.

There is already something similar for paths in the fs
namespace.

char * d_path(struct dentry *dentry, struct vfsmount *vfsmnt,
char *buf, int buflen);
Which does exactly this.

Now frequently it is passed in a page sized buffer so
it's not quite the same but close enough.

Eric

2006-01-26 19:47:57

by Herbert Poetzl

[permalink] [raw]
Subject: Re: RFC: Multiple instances of kernel namespaces.

On Sat, Jan 21, 2006 at 03:04:16AM -0700, Eric W. Biederman wrote:
> "Serge E. Hallyn" <[email protected]> writes:
>
> > Quoting Eric W. Biederman ([email protected]):
> >>
> >> At this point I have to confess I have been working on something
> >> similar, to IBM's pid virtualization work. But I have what is at
> >> least for me a unifying concept, that makes things easier to think
> >> about.
> >>
> >> The idea is to think about things in terms of namespaces. Currently
> >> in the kernel we have the fs/mount namespace already implemented.
> >>
> >> Partly this helps on what the interface for creating a new namespace
> >> instance should be. 'clone(CLONE_NEW<NAMESPACE_TYPE>)', and how
> >> it should be managed from the kernel data structures.
> >>
> >> Partly thinking of things as namespaces helps me scope the problem.
> >>
> >> Does this sound like a sane approach?
> >
> > And a bonus of this is that for security and vserver-type applications,
> > the CLONE_NEWPID and CLONE_NEWFS will often happen at the same time.
> >
> > How do you (or do you?) address naming namespaces? This would be
> > necessary for transitioning into an existing namespace, performing
> > actions on existing namespaces (i.e. checkpoint, migrate to another
> > machine, enter the namespace and kill pid 521), and would just be
> > useful for accounting purposes, i.e. how else do you have a
> > "ps --all-namespaces" specify a process' namespace?
>
> So I address naming indirectly. The last thing I want to have
> is to add yet another namespace to the kernel for naming namespaces.
> We have enough namespaces already.
>
> In any sane context for a pid-namespace we need a pid that
> we can call waitpid on, so we don't break the process tree.
> Which means at least the init process has 2 pids, one
> that it's parent sees, and another (1) that it and it's
> children see.
>
> So I name pidspaces like we do sessions of process groups
> and sessions by the pid of the leader.
>
> So in the simple case I have names like:
> 1178/1632

which is a new namespace in itself, but it doesn't matter
as long as it uniquely and persistently identifies the
namespace for the time it exists ... just leaves the
question how to retrieve a list of all namespaces :)

> > Doubt we want to add an argument to clone(), so do we just add a new
> > proc, sysfs, or syscall for setting a pid-namespace name?
>
> That shouldn't be necessary.
>
> > Do we need a new syscall for transitioning into an existing namespace?
>
> That is a good question. The FS namespaces that we already have
> has much the same problem. A completely different solution to
> this problem seems to have been implemented but I don't grasp it
> yet.
>
> Inherently transitioning to an existing namespace is something
> that is straight forward to implement, so it is worth thinking
> about.
>
> If I want a guest that can keep secrets from the host sysadmin I don't
> want transitioning into a guest namespace to come too easily.

which can easily be achieved by 'marking' the namespace
as private and/or applying certain rules/checks to the
'enter' procedure ...

best,
Herbert

> Currently I can always just create an extra child of pid 1
> that I will be my slave. The problem is that this is an extra
> process laying around.
>
> Eric
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2006-01-26 20:14:33

by Eric W. Biederman

[permalink] [raw]
Subject: Re: RFC: Multiple instances of kernel namespaces.

Herbert Poetzl <[email protected]> writes:

> On Sat, Jan 21, 2006 at 03:04:16AM -0700, Eric W. Biederman wrote:
>> So in the simple case I have names like:
>> 1178/1632
>
> which is a new namespace in itself, but it doesn't matter
> as long as it uniquely and persistently identifies the
> namespace for the time it exists ... just leaves the
> question how to retrieve a list of all namespaces :)

Yes but the name of the namespace is still in the original pid namespace.
And more importantly to me it isn't a new kind of namespace.

>> If I want a guest that can keep secrets from the host sysadmin I don't
>> want transitioning into a guest namespace to come too easily.
>
> which can easily be achieved by 'marking' the namespace
> as private and/or applying certain rules/checks to the
> 'enter' procedure ...

Right. The trick here is that you must be able to deny
transitioning into a namespace from the inside the namespace.
Or else a guest could never trust it. Something one of my
coworkers pointed out to me.

Eric

2006-01-26 20:28:01

by Herbert Poetzl

[permalink] [raw]
Subject: Re: RFC: Multiple instances of kernel namespaces.

On Thu, Jan 26, 2006 at 01:13:45PM -0700, Eric W. Biederman wrote:
> Herbert Poetzl <[email protected]> writes:
>
> > On Sat, Jan 21, 2006 at 03:04:16AM -0700, Eric W. Biederman wrote:
> >> So in the simple case I have names like:
> >> 1178/1632
> >
> > which is a new namespace in itself, but it doesn't matter
> > as long as it uniquely and persistently identifies the
> > namespace for the time it exists ... just leaves the
> > question how to retrieve a list of all namespaces :)
>
> Yes but the name of the namespace is still in the original pid namespace.
> And more importantly to me it isn't a new kind of namespace.
>
> >> If I want a guest that can keep secrets from the host sysadmin I don't
> >> want transitioning into a guest namespace to come too easily.
> >
> > which can easily be achieved by 'marking' the namespace
> > as private and/or applying certain rules/checks to the
> > 'enter' procedure ...
>
> Right. The trick here is that you must be able to deny
> transitioning into a namespace from the inside the namespace.
> Or else a guest could never trust it. Something one of my
> coworkers pointed out to me.

not necessarily, for example have a 'private' flag, which
can only be set once (usually from outside), ensuring that
the namespace will not be entered. this flag could be
checked from inside ...

best,
Herbert

> Eric
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/