2006-01-17 14:57:32

by Serge E. Hallyn

[permalink] [raw]
Subject: RFC [patch 13/34] PID Virtualization Define new task_pid api

Actually define the task_pid() and task_tgid() functions. Also
replace pid with __pid so as to make sure any missed accessors are
caught.

The resulting object code seems to be identical in most cases, and is
actually shorter in cases where current->pid is used twice in a row,
as it does not dereference task-> twice.

Signed-off-by: Dave Hansen <[email protected]>
Signed-off-by: Serge Hallyn <[email protected]>
---
sched.h | 16 +++++++++++++---
1 files changed, 13 insertions(+), 3 deletions(-)

Index: linux-2.6.15/include/linux/sched.h
===================================================================
--- linux-2.6.15.orig/include/linux/sched.h 2006-01-17 08:18:22.000000000 -0500
+++ linux-2.6.15/include/linux/sched.h 2006-01-17 08:37:03.000000000 -0500
@@ -732,8 +732,8 @@
/* ??? */
unsigned long personality;
unsigned did_exec:1;
- pid_t pid;
- pid_t tgid;
+ pid_t __pid;
+ pid_t __tgid;
/*
* pointers to (original) parent process, youngest child, younger sibling,
* older sibling, respectively. (p->father can be replaced with
@@ -877,6 +877,16 @@
return p->pids[PIDTYPE_PID].nr != 0;
}

+static inline pid_t task_pid(const struct task_struct *p)
+{
+ return p->__pid;
+}
+
+static inline pid_t task_tgid(const struct task_struct *p)
+{
+ return p->__tgid;
+}
+
extern void free_task(struct task_struct *tsk);
extern void __put_task_struct(struct task_struct *tsk);
#define get_task_struct(tsk) do { atomic_inc(&(tsk)->usage); } while(0)
@@ -1200,7 +1210,7 @@

extern task_t * FASTCALL(next_thread(const task_t *p));

-#define thread_group_leader(p) (p->pid == p->tgid)
+#define thread_group_leader(p) (task_pid(p) == task_tgid(p))

static inline int thread_group_empty(task_t *p)
{

--


2006-01-17 15:32:57

by Arjan van de Ven

[permalink] [raw]
Subject: Re: RFC [patch 13/34] PID Virtualization Define new task_pid api

On Tue, 2006-01-17 at 08:33 -0600, Serge Hallyn wrote:
> plain text document attachment (BC-define-pid-handlers)
> Actually define the task_pid() and task_tgid() functions. Also
> replace pid with __pid so as to make sure any missed accessors are
> caught.

This question was asked a few times before without satisfactory answer:
*WHY* this abstraction.
There is *NO* point. Really.

(And if the answer is "but we want to play tricks later", just make a
current->realpid or whatever, but leave current->pid be the virtual pid.
Your abstraction helps NOTHING there. Zero Nada Noppes).


2006-01-17 15:56:04

by Serge E. Hallyn

[permalink] [raw]
Subject: Re: RFC [patch 13/34] PID Virtualization Define new task_pid api

Quoting Arjan van de Ven ([email protected]):
> On Tue, 2006-01-17 at 08:33 -0600, Serge Hallyn wrote:
> > plain text document attachment (BC-define-pid-handlers)
> > Actually define the task_pid() and task_tgid() functions. Also
> > replace pid with __pid so as to make sure any missed accessors are
> > caught.
>
> This question was asked a few times before without satisfactory answer:
> *WHY* this abstraction.
> There is *NO* point. Really.
>
> (And if the answer is "but we want to play tricks later", just make a
> current->realpid or whatever, but leave current->pid be the virtual pid.
> Your abstraction helps NOTHING there. Zero Nada Noppes).

The virtual pid is different depending on who is asking. So simply
storing current->realpid and current->pid isn't helpful, as we would
still need to call a function when a pid crosses user->kernel boundary.

However we could make the patch far less invasive by skipping the
task_pid() macro altogether. Switching current->pid to current->__pid
was to make sure we catch any ->pid accesses which we may have missed,
during compilation.

Is that approach (keeping task->pid as the real pid and dropping the
task_pid() macro) preferred by all?

-serge

2006-01-17 16:02:31

by Arjan van de Ven

[permalink] [raw]
Subject: Re: RFC [patch 13/34] PID Virtualization Define new task_pid api


> Is that approach (keeping task->pid as the real pid and dropping the
> task_pid() macro) preferred by all?

it sure is what I think is the best approach



2006-01-17 16:05:43

by Alan

[permalink] [raw]
Subject: Re: RFC [patch 13/34] PID Virtualization Define new task_pid api

On Maw, 2006-01-17 at 09:56 -0600, Serge E. Hallyn wrote:
> The virtual pid is different depending on who is asking. So simply
> storing current->realpid and current->pid isn't helpful, as we would
> still need to call a function when a pid crosses user->kernel boundary.

This is an obscure, weird piece of functionality for some special case
usages most of which are going to be eliminated by Xen. I don't see the
kernel side justification for it at all.

Maybe you should remap it the other side of the user->kernel boundary ?

2006-01-17 17:17:39

by Kyle Moffett

[permalink] [raw]
Subject: Re: RFC [patch 13/34] PID Virtualization Define new task_pid api

On Jan 17, 2006, at 11:19, Suleiman Souhlal wrote:
> Why does there need a separate pid space for each container? You
> don't really need one to make sure that only processes in the same
> containers can see each other.

On Jan 17, 2006, at 11:03, Alan Cox wrote:
> This is an obscure, weird piece of functionality for some special
> case usages most of which are going to be eliminated by Xen. I
> don't see the kernel side justification for it at all.
>
> Maybe you should remap it the other side of the user->kernel
> boundary ?

To answer both questions at the same time, this is to make it
practical to reliably freeze and restore whole process-trees/jobs,
possibly restarting or migrating to a different computer in-between.
Such freeze/restore code is mostly possible now, except for programs
that store a pid internally to send a signal to another process. The
usage would be something like this:

start:
create container:
run jobs

freeze
for each process/thread in the container:
send SIGSTOP to halt execution
for each process in the container:
store process data, filehandles, vpid, etc

restore:
create container:
iterate over the frozen processes in the freeze file:
clone_with_pid(flags, original_vpid);
adjust session, connect filehandles, etc
remap shared memory, etc
block on atomic "resume" variable in mmap-ed file

set "resume" to 1

The end result is that you could freeze and resume _any_ process tree
in a container, even ones that do weird things with filehandles,
sockets, pids, etc. Personally I would find this useful to migrate
an extensive memory-leak-debugging session in a large application
over from laptop to desktop or vice versa. You could also freeze/
migrate/restore a whole X session (not including the X process
itself, but all the client apps). You could not do this at all for
statically-linked applications without kernel support, and it would
be rather inefficient to do even for dynamically-linked ones.

The one other option would be to allow opening a file /proc/$PID/
control, to which you could write a signal number, and require
freezable programs to use that interface to reliably send signals
(This also makes signals non-racy if you're reusing all 60000+ pids
on a regular basis). This has the disadvantage of not even working
for existing dynamically-linked programs either..

Cheers,
Kyle Moffett

--
There is no way to make Linux robust with unreliable memory
subsystems, sorry. It would be like trying to make a human more
robust with an unreliable O2 supply. Memory just has to work.
-- Andi Kleen


2006-01-17 17:25:30

by Dave Hansen

[permalink] [raw]
Subject: Re: RFC [patch 13/34] PID Virtualization Define new task_pid api

On Tue, 2006-01-17 at 16:03 +0000, Alan Cox wrote:
> On Maw, 2006-01-17 at 09:56 -0600, Serge E. Hallyn wrote:
> > The virtual pid is different depending on who is asking. So simply
> > storing current->realpid and current->pid isn't helpful, as we would
> > still need to call a function when a pid crosses user->kernel boundary.
>
> This is an obscure, weird piece of functionality for some special case
> usages most of which are going to be eliminated by Xen. I don't see the
> kernel side justification for it at all.

At least OpenVZ and vserver want very similar functionality. They're
both working with out-of-tree patch sets. We each want to do subtly
different things with tsk->pid, and task_pid() seemed to be a decent
place to start. OpenVZ has a very similar concept in its pid_task()
function.

Arjan had a very good point last time we posted these: we should
consider getting rid of as many places in the kernel where pids are used
to uniquely identify tasks, and just stick with task_struct pointers.

> Maybe you should remap it the other side of the user->kernel boundary ?

We would very much like to run unmodified applications and libraries.
Doing it in a patched libc is certainly a possibility, but it wouldn't
make any future customers very happy.

I also wonder if RedHat or SUSE would ever ship and support a special
set of libraries for us. Oh, and there are always statically linked
apps... :)

-- Dave

2006-01-18 04:54:37

by Greg KH

[permalink] [raw]
Subject: Re: RFC [patch 13/34] PID Virtualization Define new task_pid api

On Tue, Jan 17, 2006 at 09:25:14AM -0800, Dave Hansen wrote:
>
> I also wonder if RedHat or SUSE would ever ship and support a special
> set of libraries for us.

Don't both companies do that today with their s390 releases? :)

2006-01-18 04:55:43

by Greg KH

[permalink] [raw]
Subject: Re: RFC [patch 13/34] PID Virtualization Define new task_pid api

On Tue, Jan 17, 2006 at 09:25:14AM -0800, Dave Hansen wrote:
>
> Arjan had a very good point last time we posted these: we should
> consider getting rid of as many places in the kernel where pids are used
> to uniquely identify tasks, and just stick with task_struct pointers.

That's a very good idea, why didn't you do that?

thanks,

greg k-h

2006-01-18 16:23:26

by Dave Hansen

[permalink] [raw]
Subject: Re: RFC [patch 13/34] PID Virtualization Define new task_pid api

On Tue, 2006-01-17 at 20:55 -0800, Greg KH wrote:
> On Tue, Jan 17, 2006 at 09:25:14AM -0800, Dave Hansen wrote:
> >
> > Arjan had a very good point last time we posted these: we should
> > consider getting rid of as many places in the kernel where pids are used
> > to uniquely identify tasks, and just stick with task_struct pointers.
>
> That's a very good idea, why didn't you do that?

Because we were being stupid and shoudn't have posted this massive set
of patches to LKML again before addressing the comments we got last
time, or doing _anything_ new with them.

-- Dave

2006-01-20 17:01:46

by Eric W. Biederman

[permalink] [raw]
Subject: Re: RFC [patch 13/34] PID Virtualization Define new task_pid api

Dave Hansen <[email protected]> writes:

> On Tue, 2006-01-17 at 20:55 -0800, Greg KH wrote:
>> On Tue, Jan 17, 2006 at 09:25:14AM -0800, Dave Hansen wrote:
>> >
>> > Arjan had a very good point last time we posted these: we should
>> > consider getting rid of as many places in the kernel where pids are used
>> > to uniquely identify tasks, and just stick with task_struct pointers.
>>
>> That's a very good idea, why didn't you do that?
>
> Because we were being stupid and shoudn't have posted this massive set
> of patches to LKML again before addressing the comments we got last
> time, or doing _anything_ new with them.

Actually a little progress has been made. I think the patch set
continues to the point of usability this time or at least is close.

Although it feels like there are still some gaps when I read through
it.

Eric

2006-01-20 20:18:04

by Hubertus Franke

[permalink] [raw]
Subject: Re: RFC [patch 13/34] PID Virtualization Define new task_pid api

Eric W. Biederman wrote:
> Dave Hansen <[email protected]> writes:
>
>
>>On Tue, 2006-01-17 at 20:55 -0800, Greg KH wrote:
>>
>>>On Tue, Jan 17, 2006 at 09:25:14AM -0800, Dave Hansen wrote:
>>>
>>>>Arjan had a very good point last time we posted these: we should
>>>>consider getting rid of as many places in the kernel where pids are used
>>>>to uniquely identify tasks, and just stick with task_struct pointers.
>>>
>>>That's a very good idea, why didn't you do that?
>>
>>Because we were being stupid and shoudn't have posted this massive set
>>of patches to LKML again before addressing the comments we got last
>>time, or doing _anything_ new with them.
>
>
> Actually a little progress has been made. I think the patch set
> continues to the point of usability this time or at least is close.
>
> Although it feels like there are still some gaps when I read through
> it.
>
> Eric
>

Let me just summarize the discussion that has taken place so far
and the consequences I/we seem to be drawing out of it.

We discussed the various approaches that are floating around now, enough
has been said about each, so I leave it at that ...
(a) GLIBC intercept LD_PRELOAD
(b) Binary Rewrite of glibc
(c) syscall table intercept (see ZAP)
(d) vpid approach (see "IBM" patches posted)
(e) <pid,container> approach (see below, suggested by Alan?.. )

There are several issues that came up in the email exchange ( Arjen, Alan Cox, .. ).
[ Please feel free to tell me if I/we captured or misinterpregin these wrong ]

1st:
====
Issue: we don't need all the task_pid() etc functions just stick to what
it was task->pid !

Consens: It seems consensus is forming on that ..
Actions: remove the patches 1-12/34 and adopt the rest straight forward

2nd:
====
Issue: we don't need pid virtualization, instead simply use <container,pid> pair.

This requires a bit more thought. Essentially that's what I was doing, but I mangled
them into the same pid and using masking to add/remove the container for internal use.
As pointed out by Alan(?), we can indeed reused the same pid internally many times
as long as we can distinguish during the pid-to-task_struct lookup. This is easily
done because, the caller provides the context hence the container for the lookup.

Actions: The vpid_to_pid will disappear and the check for whether we are in the same
container needs to be pushed down into the task lookup. question remains to figure out
whether the context of the task lookup (will always remain the caller ?).

Doing so has an implication, namely that we are moving over to "system containers".
The current implementation requires the vpid/pid only for the boundary condition at the
top of the container (to rewrite pid=1) and its parent and the fact that we wanted
a global look through container=0.
If said boundary would be eliminated and we simply make a container a child of the
initproc (pid=1), this would be unnecessary.

all together this would provide private namespaces (as just suggested by Eric).

The feeling would be that large parts of patch could be reduce by this.

What we need is a new system calls (similar to vserver) or maybe we can continue
the /proc approach for now...

sys_exec_container(const *char container_name, pid_t pid, unsigned int flags, const *char argv, const *char envp);

exec_container creates a new container (if indicated in flags) and a new task in it that reports to parent initproc.
if a non-zero pid is specified we use that pid, otherwise the system will allocate it. Finally
it create new session id ; chroot and exec's the specified program.

What we loose with this is the session and the tty, which Cedric described as application
container...

The sys_exec_container(...) seems to be similar to what Eric just called clone_namespace()

-- Hubertus

______________________________________________________

2006-01-21 10:25:28

by Eric W. Biederman

[permalink] [raw]
Subject: Re: RFC [patch 13/34] PID Virtualization Define new task_pid api

Hubertus Franke <[email protected]> writes:

> Eric W. Biederman wrote:

> Let me just summarize the discussion that has taken place so far
> and the consequences I/we seem to be drawing out of it.
>
> We discussed the various approaches that are floating around now, enough
> has been said about each, so I leave it at that ...
> (a) GLIBC intercept LD_PRELOAD
> (b) Binary Rewrite of glibc
> (c) syscall table intercept (see ZAP)
> (d) vpid approach (see "IBM" patches posted)
> (e) <pid,container> approach (see below, suggested by Alan?.. )
That seems to have been an observation of the current patchset.
>
> There are several issues that came up in the email exchange ( Arjen, Alan Cox,
> .. ).
> [ Please feel free to tell me if I/we captured or misinterpregin these wrong ]

...
> Actions: The vpid_to_pid will disappear and the check for whether we are in the
> same
> container needs to be pushed down into the task lookup. question remains to
> figure out
> whether the context of the task lookup (will always remain the caller ?).

You don't need a same container check. If something is in another container
it becomes invisible to you.

> Doing so has an implication, namely that we are moving over to "system
> containers".
> The current implementation requires the vpid/pid only for the boundary condition
> at the
> top of the container (to rewrite pid=1) and its parent and the fact that we
> wanted
> a global look through container=0.
> If said boundary would be eliminated and we simply make a container a child of
> the
> initproc (pid=1), this would be unnecessary.
>
> all together this would provide private namespaces (as just suggested by Eric).
>
> The feeling would be that large parts of patch could be reduce by this.

I concur. Except I think the initial impact could still be large.
It may be worth breaking all users of pids just so we audit them.

But that will certainly result in no long term cost, or runtime overhead.

> What we need is a new system calls (similar to vserver) or maybe we can continue
> the /proc approach for now...
>
> sys_exec_container(const *char container_name, pid_t pid, unsigned int flags,
> const *char argv, const *char envp);
>
> exec_container creates a new container (if indicated in flags) and a new task in
> it that reports to parent initproc.
> if a non-zero pid is specified we use that pid, otherwise the system will
> allocate it. Finally
> it create new session id ; chroot and exec's the specified program.
>
> What we loose with this is the session and the tty, which Cedric described as
> application
> container...
>
> The sys_exec_container(...) seems to be similar to what Eric just called
> clone_namespace()

Similar. But I was actually talking about just adding another flag to
sys_clone the syscall underlying fork(). Basically it is just another
resource not share or not-share.

Eric

2006-01-21 14:43:10

by Eric W. Biederman

[permalink] [raw]
Subject: Re: RFC [patch 13/34] PID Virtualization Define new task_pid api

Hubertus Franke <[email protected]> writes:

> 2nd:
> ====
> Issue: we don't need pid virtualization, instead simply use <container,pid>
> pair.
>
> This requires a bit more thought. Essentially that's what I was doing, but I
> mangled them into the same pid and using masking to add/remove the
> container for internal use. As pointed out by Alan(?), we can
> indeed reused the same pid internally many times as long as we can
> distinguish during the pid-to-task_struct lookup. This is easily
> done because, the caller provides the context hence the container for the
> lookup.
>
> Actions: The vpid_to_pid will disappear and the check for whether we
> are in the same container needs to be pushed down into the task
> lookup. question remains to figure out whether the context of the
> task lookup (will always remain the caller ?).

Any place the kernel saves a pid and then proceeds to signal it later.
At that later point in time it is possibly you will be in the wrong
context.

This probably justifies having a kpid_t that has both the process
space id and the pid in it. For when the kernel is storing pids to
use as weak references, for signal purposes etc.

At least tty_io.c and fcntl.c have examples where you the caller
may not have the proper context.

> Doing so has an implication, namely that we are moving over to "system
> containers". The current implementation requires the vpid/pid only
> for the boundary condition at the top of the container (to rewrite
> pid=1) and its parent and the fact that we wanted a global look
> through container=0. If said boundary would be eliminated and we
> simply make a container a child of the initproc (pid=1), this would
> be unnecessary.
>
> all together this would provide private namespaces (as just suggested by Eric).
>
> The feeling would be that large parts of patch could be reduce by
> this.

Simplified, and made easier to understand. I don't know if the number
of lines affected can or should be reduced.

One of my problems with your current approach is that it doesn't help
identify where you have problems.

I have found a specific example that your current patches get wrong,
because you make assumptions about which context is valid.

>From function kernel/khtread.c
> static void keventd_create_kthread(void *_create)
> {
> struct kthread_create_info *create = _create;
> int pid;
>
> /* We want our own signal handler (we take no signals by default). */
> pid = kernel_thread(kthread, create, CLONE_FS | CLONE_FILES | SIGCHLD);
> if (pid < 0) {
> create->result = ERR_PTR(pid);
> } else {
> wait_for_completion(&create->started);
> create->result = find_task_by_pid(pid);
> }
> complete(&create->done);
> }

kernel_thread() is a light wrapper around do_fork().
do_fork returns a virtual pid.
find_task_by_pid takes a pid with the upper bits holding the process
space id.

Therefore if this function or a cousin of it was ever triggered
by a userspace application in a virtual context find_task_by_pid
would fail to find the task structure.

The only way I know to make this change safely is to make compilation
of all functions that manipulate pids in possibly dangerous ways fail.
And then to manually and slowly fix them up.

That way if something is missed. You get a compile error instead
of incorrect execution.

Eric

2006-01-22 06:44:13

by Kyle Moffett

[permalink] [raw]
Subject: Re: RFC [patch 13/34] PID Virtualization Define new task_pid api

On Jan 21, 2006, at 09:42, Eric W. Biederman wrote:
> Hubertus Franke <[email protected]> writes:
>
>> Actions: The vpid_to_pid will disappear and the check for whether
>> we are in the same container needs to be pushed down into the task
>> lookup. question remains to figure out whether the context of
>> the task lookup (will always remain the caller ?).
>
> Any place the kernel saves a pid and then proceeds to signal it
> later. At that later point in time it is possibly you will be in
> the wrong context.
>
> This probably justifies having a kpid_t that has both the process
> space id and the pid in it. For when the kernel is storing pids to
> use as weak references, for signal purposes etc.

The kernel should not be saving a PID. The kernel should be sticking
a pointer to a struct task_struct somewhere (with appropriate
refcounting) and using that.

> The only way I know to make this change safely is to make
> compilation of all functions that manipulate pids in possibly
> dangerous ways fail. And then to manually and slowly fix them up.
>
> That way if something is missed. You get a compile error instead
> of incorrect execution.

I agree. This is one of the things I really liked about the recent
mutex patch; it added a lot of checks to various codepaths to verify
at both compile time and run time that the code was correct.

My personal opinion is that we need to add a new race-free API, say
open("/proc/fork"); that forks a process and returns an open "process
handle", essentially a filehandle that references a particular
process. (Also, an open("/proc/self/handle") or something to return
a current-process handle) Through some method of signaling the
kernel (syscall, ioctl, some other?) a process can send a signal to
the process referenced by the handle, check its status, etc. A
process handle might be passed to other processes using a UNIX-domain
socket. You would be able to dup() a process handle and then
restrict the set of valid operations on the new process handle, so
that it could be passed to another process without giving that
process access to the full set of operations (check status only, not
able to send a signal, for example).

Obviously we would need to maintain support for the old interface for
some period of time, but I think the new one would make it much
easier to write simple race-free programs.

Cheers,
Kyle Moffett

--
Simple things should be simple and complex things should be possible
-- Alan Kay



2006-01-22 15:49:53

by Eric W. Biederman

[permalink] [raw]
Subject: Re: RFC [patch 13/34] PID Virtualization Define new task_pid api

Kyle Moffett <[email protected]> writes:

> On Jan 21, 2006, at 09:42, Eric W. Biederman wrote:
>> Hubertus Franke <[email protected]> writes:
>>
>>> Actions: The vpid_to_pid will disappear and the check for whether we are in
>>> the same container needs to be pushed down into the task lookup. question
>>> remains to figure out whether the context of the task lookup (will always
>>> remain the caller ?).
>>
>> Any place the kernel saves a pid and then proceeds to signal it later. At
>> that later point in time it is possibly you will be in the wrong context.
>>
>> This probably justifies having a kpid_t that has both the process
>> space id and the pid in it. For when the kernel is storing pids to
>> use as weak references, for signal purposes etc.
>
> The kernel should not be saving a PID. The kernel should be sticking a pointer
> to a struct task_struct somewhere (with appropriate refcounting) and using that.

That has all of the wrong semantics, and simply will not work.

>> The only way I know to make this change safely is to make compilation of all
>> functions that manipulate pids in possibly dangerous ways fail. And then to
>> manually and slowly fix them up.
>>
>> That way if something is missed. You get a compile error instead of
>> incorrect execution.
>
> I agree. This is one of the things I really liked about the recent mutex
> patch; it added a lot of checks to various codepaths to verify at both compile
> time and run time that the code was correct.

And changing how we handle pids is if anything even more intrusive.
>
> My personal opinion is that we need to add a new race-free API, say
> open("/proc/fork"); that forks a process and returns an open "process handle",
> essentially a filehandle that references a particular process. (Also, an
> open("/proc/self/handle") or something to return a current-process handle)
> Through some method of signaling the kernel (syscall, ioctl, some other?) a
> process can send a signal to the process referenced by the handle, check its
> status, etc. A process handle might be passed to other processes using a
> UNIX-domain socket. You would be able to dup() a process handle and then
> restrict the set of valid operations on the new process handle, so that it
> could be passed to another process without giving that process access to the
> full set of operations (check status only, not able to send a signal, for
> example).

Ok. There are 2 sides to this, an internal kernel implementation,
and exporting to user space. Until we have something inside
the kernel exporting it is silly.

A pointer to a task_struct while it kind of sort of works. Is not
a good solution. The problem is that in a lot of cases we register
a pid to get a signal or something similar and then we never unregister
it. So by using a pointer to a trask_struct you effectively hold the
process in memory forever.

Then there is the second problem. A pointer to a task_struct is
insufficient. It does not handle the case of process groups which
are equally important.

Further a task_struct points at a thread not at a process so holding
a pointer to it would not do what you would expect.

Possibly holding a struct pid would be interesting.

> Obviously we would need to maintain support for the old interface for some
> period of time, but I think the new one would make it much easier to write
> simple race-free programs.

Well since this is the user space interface we would need to maintain
the old interface for as long as the kernel runs on existing architectures
or their are user space programs using it. Even in plan9 they weren't
creative enough to do away with PIDS.

Eric

2006-01-22 15:56:12

by Arjan van de Ven

[permalink] [raw]
Subject: Re: RFC [patch 13/34] PID Virtualization Define new task_pid api


>
> A pointer to a task_struct while it kind of sort of works. Is not
> a good solution. The problem is that in a lot of cases we register
> a pid to get a signal or something similar and then we never unregister
> it. So by using a pointer to a trask_struct you effectively hold the
> process in memory forever.

this is not right. Both the PID and the task struct have the exact same
lifetime rules, they HAVE to, to guard against pid reuse problems.


2006-01-22 16:24:57

by Eric W. Biederman

[permalink] [raw]
Subject: Re: RFC [patch 13/34] PID Virtualization Define new task_pid api

Arjan van de Ven <[email protected]> writes:

>>
>> A pointer to a task_struct while it kind of sort of works. Is not
>> a good solution. The problem is that in a lot of cases we register
>> a pid to get a signal or something similar and then we never unregister
>> it. So by using a pointer to a trask_struct you effectively hold the
>> process in memory forever.
>
> this is not right. Both the PID and the task struct have the exact same
> lifetime rules, they HAVE to, to guard against pid reuse problems.

Yes PIDs reserved for the lifetime of the task_struct (baring minor
details). There are actually a few races in /proc where it can
see the task_struct after the pid has been freed (see the pid_alive macro
in sched.h)

However when used as a reference the number can live as long as you
want. The classic example is a pid file that can exist even after
you reboot a machine.

So currently a use of a PID as a reference to processes or process
groups can last forever. An example of this is the kernel is
the result of fcntl(fd, F_SETOWN). The session of a tty is similar.

Since the in kernel references have a lifetime that is completely
different than the lifetime of a process or a PID. It is
not safe to simply replace such references with a direct reference
to a task_struct (besides being technically impossible). Adding
those references could potentially increase the lifespan of a task_struct
for to the life of the kernel depending on the reference.

Eric

2006-01-23 18:38:49

by Hubertus Franke

[permalink] [raw]
Subject: Re: RFC [patch 13/34] PID Virtualization Define new task_pid api

Eric W. Biederman wrote:
> Hubertus Franke <[email protected]> writes:
>

> ...
>
>>Actions: The vpid_to_pid will disappear and the check for whether we are in the
>>same
>>container needs to be pushed down into the task lookup. question remains to
>>figure out
>>whether the context of the task lookup (will always remain the caller ?).
>
>
> You don't need a same container check. If something is in another container
> it becomes invisible to you.
>

Eric, agreed.... that was implied by me (but poorly worded). What I meant (lets try this
again) is that the context defines/provides the namespace in which the lookup
is performed, hence as you say state.. naturally things in different containers
(namespaces) are invisible to you..

>
>>Doing so has an implication, namely that we are moving over to "system
>>containers".
>>The current implementation requires the vpid/pid only for the boundary condition
>>at the
>>top of the container (to rewrite pid=1) and its parent and the fact that we
>>wanted
>>a global look through container=0.
>>If said boundary would be eliminated and we simply make a container a child of
>>the
>>initproc (pid=1), this would be unnecessary.
>>
>>all together this would provide private namespaces (as just suggested by Eric).
>>
>>The feeling would be that large parts of patch could be reduce by this.
>
>
> I concur. Except I think the initial impact could still be large.
> It may be worth breaking all users of pids just so we audit them.
>
> But that will certainly result in no long term cost, or runtime overhead.
>
>
>>What we need is a new system calls (similar to vserver) or maybe we can continue
>>the /proc approach for now...
>>
>>sys_exec_container(const *char container_name, pid_t pid, unsigned int flags,
>>const *char argv, const *char envp);
>>
>>exec_container creates a new container (if indicated in flags) and a new task in
>>it that reports to parent initproc.
>>if a non-zero pid is specified we use that pid, otherwise the system will
>>allocate it. Finally
>>it create new session id ; chroot and exec's the specified program.
>>
>>What we loose with this is the session and the tty, which Cedric described as
>>application
>>container...
>>
>>The sys_exec_container(...) seems to be similar to what Eric just called
>>clone_namespace()
>
>
> Similar. But I was actually talking about just adding another flag to
> sys_clone the syscall underlying fork(). Basically it is just another
> resource not share or not-share.
>
> Eric
>

That's a good idea .. right now we simply did this through a flag left by the call
to the /proc/container fs ... (awkward at best, but didn't break the API).
I have a concern wrt doing it in during fork namely the sharing of resources.
Whe obviously are looking at some constraints here wrt to sharing. We need to
ensure that this ain't a thread etc that will share resources
across "containers" (which then later aren't migratable due to that sharing).
So doing the fork_exec() atomically would avoid that problem.



2006-01-23 18:49:00

by Eric W. Biederman

[permalink] [raw]
Subject: Re: RFC [patch 13/34] PID Virtualization Define new task_pid api

Hubertus Franke <[email protected]> writes:

> That's a good idea .. right now we simply did this through a flag left by the
> call
> to the /proc/container fs ... (awkward at best, but didn't break the API).
> I have a concern wrt doing it in during fork namely the sharing of resources.
> Whe obviously are looking at some constraints here wrt to sharing. We need to
> ensure that this ain't a thread etc that will share resources
> across "containers" (which then later aren't migratable due to that sharing).
> So doing the fork_exec() atomically would avoid that problem.

Checking that we aren't sharing things to become a thread is fairly straight
forward. do_fork already has similar checks in place.

This sounds like a classic case of if you don't want that don't do that
then.

Eric





2006-01-23 18:51:04

by Hubertus Franke

[permalink] [raw]
Subject: Re: RFC [patch 13/34] PID Virtualization Define new task_pid api

Eric W. Biederman wrote:
> Hubertus Franke <[email protected]> writes:
>
>
>
> Any place the kernel saves a pid and then proceeds to signal it later.
> At that later point in time it is possibly you will be in the wrong
> context.
>

Yes, that's possible.. In the current patch that is not a problem, because
the internal pid (aka kpid) == <vpid,containerid> mangeled together.
So in those cases, the kernel would have to keep <pid, container_id>

> This probably justifies having a kpid_t that has both the process
> space id and the pid in it. For when the kernel is storing pids to
> use as weak references, for signal purposes etc.
>

An that's what the current patch does. Only thing is we did not rename
everything to kpid_t!

> At least tty_io.c and fcntl.c have examples where you the caller
> may not have the proper context.

Can you point those out directly .. thanks..

>
>
>>Doing so has an implication, namely that we are moving over to "system
>>containers". The current implementation requires the vpid/pid only
>>for the boundary condition at the top of the container (to rewrite
>>pid=1) and its parent and the fact that we wanted a global look
>>through container=0. If said boundary would be eliminated and we
>>simply make a container a child of the initproc (pid=1), this would
>>be unnecessary.
>>
>>all together this would provide private namespaces (as just suggested by Eric).
>>
>>The feeling would be that large parts of patch could be reduce by
>>this.
>
>
> Simplified, and made easier to understand. I don't know if the number
> of lines affected can or should be reduced.

Unless we do it .. we won't know for sure.....

>
> One of my problems with your current approach is that it doesn't help
> identify where you have problems.
>
> I have found a specific example that your current patches get wrong,
> because you make assumptions about which context is valid.
>
> From function kernel/khtread.c
>
>>static void keventd_create_kthread(void *_create)
>>{
>> struct kthread_create_info *create = _create;
>> int pid;
>>
>> /* We want our own signal handler (we take no signals by default). */
>> pid = kernel_thread(kthread, create, CLONE_FS | CLONE_FILES | SIGCHLD);
>> if (pid < 0) {
>> create->result = ERR_PTR(pid);
>> } else {
>> wait_for_completion(&create->started);
>> create->result = find_task_by_pid(pid);
>> }
>> complete(&create->done);
>>}
>

So what we where thinking (and have experimented a bit with) is using
SPARSE for this. Didn't go too far, but is a potential.
Another possibility is to really introduce a pid_t (user perspective)
and a kpid_t (kernel pid <spaceid, pid>) type explicitely.
Then this would be solved.

>
> kernel_thread() is a light wrapper around do_fork().
> do_fork returns a virtual pid.
> find_task_by_pid takes a pid with the upper bits holding the process
> space id.
>
> Therefore if this function or a cousin of it was ever triggered
> by a userspace application in a virtual context find_task_by_pid
> would fail to find the task structure.
>
> The only way I know to make this change safely is to make compilation
> of all functions that manipulate pids in possibly dangerous ways fail.
> And then to manually and slowly fix them up.

See above (SPARSE or type strictness).
And yes, we internally discussed that, but the changes might be huge
to change all the occurences.
It would be good to know whether going this route will lead us to the
promise land or not.

>
> That way if something is missed. You get a compile error instead
> of incorrect execution.


>
> Eric
>


2006-01-23 19:28:50

by Eric W. Biederman

[permalink] [raw]
Subject: Re: RFC [patch 13/34] PID Virtualization Define new task_pid api

Hubertus Franke <[email protected]> writes:

> Eric W. Biederman wrote:
>> Hubertus Franke <[email protected]> writes:
>> Any place the kernel saves a pid and then proceeds to signal it later.
>> At that later point in time it is possibly you will be in the wrong
>> context.
>>
>
> Yes, that's possible.. In the current patch that is not a problem, because
> the internal pid (aka kpid) == <vpid,containerid> mangeled together.
> So in those cases, the kernel would have to keep <pid, container_id>

Agreed, and for the internal implementation I think having them mangled
together make sense, so long as we never export that form to userspace.

>> This probably justifies having a kpid_t that has both the process
>> space id and the pid in it. For when the kernel is storing pids to
>> use as weak references, for signal purposes etc.
>>
>
> An that's what the current patch does. Only thing is we did not rename
> everything to kpid_t!

Yep. But because of that you couldn't detect mixing of pid and kpid.

>> At least tty_io.c and fcntl.c have examples where you the caller
>> may not have the proper context.
>
> Can you point those out directly .. thanks..

Short version. tty's send signals on hangup and f_setown can trigger signals
being sent.

Eric



2006-01-23 21:11:22

by Alan

[permalink] [raw]
Subject: Re: RFC [patch 13/34] PID Virtualization Define new task_pid api

On Llu, 2006-01-23 at 12:28 -0700, Eric W. Biederman wrote:
> > Yes, that's possible.. In the current patch that is not a problem, because
> > the internal pid (aka kpid) == <vpid,containerid> mangeled together.
> > So in those cases, the kernel would have to keep <pid, container_id>
>
> Agreed, and for the internal implementation I think having them mangled
> together make sense, so long as we never export that form to userspace.

You have to refcount the container ids anyway or you may have stale
container references and end up reusing them.

2006-01-23 21:31:41

by Eric W. Biederman

[permalink] [raw]
Subject: Re: RFC [patch 13/34] PID Virtualization Define new task_pid api

Alan Cox <[email protected]> writes:

> On Llu, 2006-01-23 at 12:28 -0700, Eric W. Biederman wrote:
>> > Yes, that's possible.. In the current patch that is not a problem, because
>> > the internal pid (aka kpid) == <vpid,containerid> mangeled together.
>> > So in those cases, the kernel would have to keep <pid, container_id>
>>
>> Agreed, and for the internal implementation I think having them mangled
>> together make sense, so long as we never export that form to userspace.
>
> You have to refcount the container ids anyway or you may have stale
> container references and end up reusing them.

The short observation is currently we use at most 22bits of the pid
space, and we don't need a huge number of containers so combining them
into one integer makes sense for an efficient implementation, and it
is cheaper than comparing pointers.

Additional identifiers are really not necessary to user space and providing
them is one more thing that needs to be virtualized. We can already
talk about them indirectly by referring to processes that use them.

And there will be at least one processes id assigned to the pid space
from the outside pid space unless we choose to break waitpid, and friends.

I just don't want a neat implementation trick to cause us maintenance grief.

Eric

2006-01-23 22:15:30

by Hubertus Franke

[permalink] [raw]
Subject: Re: RFC [patch 13/34] PID Virtualization Define new task_pid api

Eric W. Biederman wrote:
> Alan Cox <[email protected]> writes:
>
>
>>On Llu, 2006-01-23 at 12:28 -0700, Eric W. Biederman wrote:
>>
>>>>Yes, that's possible.. In the current patch that is not a problem, because
>>>>the internal pid (aka kpid) == <vpid,containerid> mangeled together.
>>>>So in those cases, the kernel would have to keep <pid, container_id>
>>>
>>>Agreed, and for the internal implementation I think having them mangled
>>>together make sense, so long as we never export that form to userspace.
>>
>>You have to refcount the container ids anyway or you may have stale
>>container references and end up reusing them.
>
>
> The short observation is currently we use at most 22bits of the pid
> space, and we don't need a huge number of containers so combining them
> into one integer makes sense for an efficient implementation, and it
> is cheaper than comparing pointers.
> Additional identifiers are really not necessary to user space and providing
> them is one more thing that needs to be virtualized. We can already
> talk about them indirectly by referring to processes that use them.
>
> And there will be at least one processes id assigned to the pid space
> from the outside pid space unless we choose to break waitpid, and friends.
>
> I just don't want a neat implementation trick to cause us maintenance grief.
>
> Eric
>

In that case, I think we do require the current vpid_to_pid(translations)
in order to transfer the external user pid ( relative to the namespace )
into one that combines namespace (aka container_id) with the external pid.
Exactly how it is done today.
What will slightly change is the low level implementations of the

inline pid_t pid_to_vpid_ctx(pid_t pid, const struct task_struct *ctx);
pid_t __pid_to_vpid_ctx_excp(pid_t pid, int pidspace_id,const struct task_struct *ctx);

and reverse.
The VPID_2_PID and PID_2_VPID still remain at same locations.

Did I get your comments correctly, Eric ?..

Thanks as usual
-- Hubertus



2006-01-24 00:22:16

by Alan

[permalink] [raw]
Subject: Re: RFC [patch 13/34] PID Virtualization Define new task_pid api

On Llu, 2006-01-23 at 14:30 -0700, Eric W. Biederman wrote:
> The short observation is currently we use at most 22bits of the pid
> space, and we don't need a huge number of containers so combining them
> into one integer makes sense for an efficient implementation, and it
> is cheaper than comparing pointers.

Currently. In addition it becomes more costly the moment you have to
start masking them. Remember the point of this was to virtualise the
pid, so you are going to add a ton of masking versus a cheap extra
comparison from the same cache line. And you lose pid space you may well
need in the future for the sake of a quick hack.

> And there will be at least one processes id assigned to the pid space
> from the outside pid space unless we choose to break waitpid, and friends.

That comes out in the wash because it is already done by process tree
pointers anyway. It has to be because using ->ppid would be racy.

Alan

2006-01-24 06:57:02

by Arjan van de Ven

[permalink] [raw]
Subject: Re: RFC [patch 13/34] PID Virtualization Define new task_pid api


>
> In that case, I think we do require the current vpid_to_pid(translations)
> in order to transfer the external user pid ( relative to the namespace )
> into one that combines namespace (aka container_id) with the external pid.
> Exactly how it is done today.
> What will slightly change is the low level implementations of the
>
> inline pid_t pid_to_vpid_ctx(pid_t pid, const struct task_struct *ctx);
> pid_t __pid_to_vpid_ctx_excp(pid_t pid, int pidspace_id,const struct task_struct *ctx);
>
> and reverse.
> The VPID_2_PID and PID_2_VPID still remain at same locations.
>
> Did I get your comments correctly, Eric ?..

please call it 'userpid' not 'vpid', to make more clear what the pid is
used for/what domain it is in.


2006-01-24 19:28:05

by Eric W. Biederman

[permalink] [raw]
Subject: Re: RFC [patch 13/34] PID Virtualization Define new task_pid api

Alan Cox <[email protected]> writes:

> On Llu, 2006-01-23 at 14:30 -0700, Eric W. Biederman wrote:
>> The short observation is currently we use at most 22bits of the pid
>> space, and we don't need a huge number of containers so combining them
>> into one integer makes sense for an efficient implementation, and it
>> is cheaper than comparing pointers.
>
> Currently. In addition it becomes more costly the moment you have to
> start masking them. Remember the point of this was to virtualise the
> pid, so you are going to add a ton of masking versus a cheap extra
> comparison from the same cache line. And you lose pid space you may well
> need in the future for the sake of a quick hack.

I do disagree that as I am envisioning it will get in the way but I
do agree that putting them in the unsigned long may be overkill.

There is at least NFS lockd that appreciates having a single integer
per process unique identifier. So there is a practical basis for
wanting such a thing.

At least for this first round I think talking about a kpid
as a container, pid pair makes a lot of sense for the moment, as
the other implementations just confuse things.

>> And there will be at least one processes id assigned to the pid space
>> from the outside pid space unless we choose to break waitpid, and friends.
>
> That comes out in the wash because it is already done by process tree
> pointers anyway. It has to be because using ->ppid would be racy.

Possibly. Again, it is one of the more interesting cases, to get
just right.

However it looks to me that the biggest challenge right now about
development is the size of a patch to change any one of these things.
So it looks to me like the first step is to add wrappers for common idioms
that use pids, like printing the name of a task or testing if it is the
init task or if it is an idle task.

Or can you think of a case where it would be wise to leave
both the type and size of current->pid alone?

Eric

2006-01-24 19:36:36

by Eric W. Biederman

[permalink] [raw]
Subject: Re: RFC [patch 13/34] PID Virtualization Define new task_pid api

Hubertus Franke <[email protected]> writes:

> In that case, I think we do require the current vpid_to_pid(translations)
> in order to transfer the external user pid ( relative to the namespace )
> into one that combines namespace (aka container_id) with the external pid.
> Exactly how it is done today.
> What will slightly change is the low level implementations of the
>
> inline pid_t pid_to_vpid_ctx(pid_t pid, const struct task_struct *ctx);
> pid_t __pid_to_vpid_ctx_excp(pid_t pid, int pidspace_id,const struct task_struct
> *ctx);
>
> and reverse.
> The VPID_2_PID and PID_2_VPID still remain at same locations.
>
> Did I get your comments correctly, Eric ?..

Well we may need that. For the moment let's consider putting both
a kpid and upid and the task_struct, and elsewhere. Basically I don't think
translation is necessary in the common case.

However let's look at a single practical case to see how it would need
to be implemnted.

struct fown_struct. Every file has one and you can modify it both on
a socket with ioctls FIOSETOWN,SIOCSPGRP,FIOGETOWN,SIOCPGRP. And on
a normal file handle with fcntl with FSETOWN, and FGETOWN.

Since a struct file can be passed between processes in different
pid spaces using unix domain sockets we cannot count on the context
of the signaler to be the same as the context of the setter.

So we need to look at how to handle this case cleanly, and safely.

Eric

2006-01-24 21:09:38

by Hubertus Franke

[permalink] [raw]
Subject: Re: RFC [patch 13/34] PID Virtualization Define new task_pid api

Eric W. Biederman wrote:
> Hubertus Franke <[email protected]> writes:
>
>
>>In that case, I think we do require the current vpid_to_pid(translations)
>>in order to transfer the external user pid ( relative to the namespace )
>>into one that combines namespace (aka container_id) with the external pid.
>>Exactly how it is done today.
>>What will slightly change is the low level implementations of the
>>
>>inline pid_t pid_to_vpid_ctx(pid_t pid, const struct task_struct *ctx);
>>pid_t __pid_to_vpid_ctx_excp(pid_t pid, int pidspace_id,const struct task_struct
>>*ctx);
>>
>>and reverse.
>>The VPID_2_PID and PID_2_VPID still remain at same locations.
>>
>>Did I get your comments correctly, Eric ?..
>
>
> Well we may need that. For the moment let's consider putting both
> a kpid and upid and the task_struct, and elsewhere. Basically I don't think
> translation is necessary in the common case.

OK for discussion purposes no problem .. what ever is the best at the end
is the GO.
Abstractly speaking, mangling the <container,upid> tuple into the same
long int is an implementation detail.

>
> However let's look at a single practical case to see how it would need
> to be implemnted.
>
> struct fown_struct. Every file has one and you can modify it both on
> a socket with ioctls FIOSETOWN,SIOCSPGRP,FIOGETOWN,SIOCPGRP. And on
> a normal file handle with fcntl with FSETOWN, and FGETOWN.
>
> Since a struct file can be passed between processes in different
> pid spaces using unix domain sockets we cannot count on the context
> of the signaler to be the same as the context of the setter.

If you follow the patch set, we do distinguish the context case (we might have
missed a few here and there, as you already pointed out), but going into
the kernel we always take the context of the caller, coming out of the
kernel kpid -> upid we do use the appropriate context.

static inline pid_t pid_to_vpid_ctx(pid_t pid, const struct task_struct *ctx)

>
> So we need to look at how to handle this case cleanly, and safely.

-- Hubertus

2006-01-24 21:11:06

by Alan

[permalink] [raw]
Subject: Re: RFC [patch 13/34] PID Virtualization Define new task_pid api

On Maw, 2006-01-24 at 12:26 -0700, Eric W. Biederman wrote:
> There is at least NFS lockd that appreciates having a single integer
> per process unique identifier. So there is a practical basis for
> wanting such a thing.

Which gets us back to refcounting.

> At least for this first round I think talking about a kpid
> as a container, pid pair makes a lot of sense for the moment, as
> the other implementations just confuse things.

As an abstract object a kpid to me means a single identifier which
uniquely identifies the process and which in its component parts be they
pointers or not uniquely identifies the process in the container and the
container in the system, both correctly refcounted against re-use.

> However it looks to me that the biggest challenge right now about
> development is the size of a patch to change any one of these things.

Thats where we disagree strongly. Wrappers hide, confuse and obscure. We
want the workings brutally and clearly visible so that people don't make
assumptions and have nasty accidents. Its like typdedefs and overuse of
defines.

Alan

2006-01-24 21:15:42

by Arjan van de Ven

[permalink] [raw]
Subject: Re: RFC [patch 13/34] PID Virtualization Define new task_pid api

On Tue, 2006-01-24 at 21:11 +0000, Alan Cox wrote:
> On Maw, 2006-01-24 at 12:26 -0700, Eric W. Biederman wrote:
> > There is at least NFS lockd that appreciates having a single integer
> > per process unique identifier. So there is a practical basis for
> > wanting such a thing.
>
> Which gets us back to refcounting.
>
> > At least for this first round I think talking about a kpid
> > as a container, pid pair makes a lot of sense for the moment, as
> > the other implementations just confuse things.
>
> As an abstract object a kpid to me means a single identifier which
> uniquely identifies the process and which in its component parts be they
> pointers or not uniquely identifies the process in the container and the
> container in the system, both correctly refcounted against re-use.

they why not just straight use the task struct pointer for this? It's
guaranteed unique.. ;)


2006-01-25 09:14:55

by Eric W. Biederman

[permalink] [raw]
Subject: Re: RFC [patch 13/34] PID Virtualization Define new task_pid api

Alan Cox <[email protected]> writes:

>> However it looks to me that the biggest challenge right now about
>> development is the size of a patch to change any one of these things.
>
> Thats where we disagree strongly. Wrappers hide, confuse and obscure. We
> want the workings brutally and clearly visible so that people don't make
> assumptions and have nasty accidents. Its like typdedefs and overuse of
> defines.

I totally that we want the uses of pids to be clear and not over abstracted.
However most places that reference pids are debug statements are simply
line noise in any patch.

I threw together a casual patch for purposes of discussion earlier.
Not fully polished, just a rough draft for discussion and it was so big
most people didn't even receive it. So regardless of other considerations
the sheer number size is a problem even if abstractions can't help.

I think adding a helper for the common debugging idiom of printing out
the current task is generally useful, and is worth considering on it's own merits.

In particular, does this look like a sane piece of code to add to the kernel?
This is based on analogy with dev_printk.

Either tsk_printk or something like NIP_QUAD is what I am thinking about.

Eric


diff --git a/include/linux/sched.h b/include/linux/sched.h
index 0cfcd1c..d5dcbde 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -880,6 +880,19 @@ static inline pid_t process_group(struct
return tsk->signal->pgrp;
}

+/* debugging and troubleshooting/disanostic helpers. */
+#define tsk_printk(level, dev, format, arg...) \
+ printk(level "%s(%d): " format , (tsk)->comm , (tsk)->pid , ## arg)
+
+#define tsk_dbg(tsk, format, arg...) \
+ tsk_printk(KERN_DEBUG , tsk , format, ## arg)
+#define tsk_err(tsk, format, arg...) \
+ tsk_printk(KERN_ERR , tsk , format, ## arg)
+#define tsk_info(tsk, format, arg...) \
+ tsk_printk(KERN_INFO , tsk , format, ## arg)
+#define tsk_warn(tsk, format, arg...) \
+ tsk_printk(KERN_WARN , tsk , format, ## arg)
+
/**
* pid_alive - check that a task structure is not stale
* @p: Task structure to be checked.

2006-01-25 09:52:28

by Eric W. Biederman

[permalink] [raw]
Subject: Re: RFC [patch 13/34] PID Virtualization Define new task_pid api

Alan Cox <[email protected]> writes:

> On Maw, 2006-01-24 at 12:26 -0700, Eric W. Biederman wrote:

>> At least for this first round I think talking about a kpid
>> as a container, pid pair makes a lot of sense for the moment, as
>> the other implementations just confuse things.
>
> As an abstract object a kpid to me means a single identifier which
> uniquely identifies the process and which in its component parts be they
> pointers or not uniquely identifies the process in the container and the
> container in the system, both correctly refcounted against re-use.

Correct.

Currently by using pids internally we are not correctly refcounted
against reuse. Nor in the process group case do we even have an
object off of which we can hang a reference count.

In the case of a multiple instances of a process space the problem
is much more acute as we must properly ref count the pid space as
well.

Now to further make this fun we have variables like spawnpid in
drivers/char/vt_ioctl.c and drivers/char/keyboard.c that
persist indefinitely. Which cause problems for most traditional
reference counting techniques.

Further in cases where the references persist indefinitely we don't
want to pin the task_struct in memory indefinitely even after
the task has exited and it's zombie has been reaped.

So how do we solve this problem?

There are two possible approaches I can see to solving this problem.
1) Use a non-pointer based kpid and simply accept identifier
wrap-around problems with kpids just like we currently accept
these problems with pids.

2) Implement weak references for kpids.

Semantically a weak reference is a pointer that becomes NULL when the
object it refers to goes away.

A couple days ago I conducted an experiment, to see if I could
implement this in the kernel and surprisingly it is fairly straight
forward to do. First you define a weak kpid as a kpid with a
list_head attached, and whenever you setup a weak kpid you
register it with the pid hash table.

Then in detach_pid when the last reference to the pid goes away, you
walk the list of weak kpids and you NULL the appropriate entries.

This seems to solve the reference counting problem neatly and
without needing to disturb the logic of the existing code. Even
outside the context of multiple pid spaces then I think weak
kpids are desirable.

Thoughts?

from kernel/pid.c:
> void fastcall detach_pid(task_t *task, enum pid_type type)
> {
> int tmp, nr;
>
> nr = __detach_pid(task, type);
> if (!nr)
> return;

Walk the list of weak kpids here.

>
> for (tmp = PIDTYPE_MAX; --tmp >= 0; )
> if (tmp != type && find_pid(tmp, nr))
> return;
>
> free_pidmap(nr);
> }


Eric

2006-01-25 09:59:36

by Eric W. Biederman

[permalink] [raw]
Subject: Re: RFC [patch 13/34] PID Virtualization Define new task_pid api

Arjan van de Ven <[email protected]> writes:

> On Tue, 2006-01-24 at 21:11 +0000, Alan Cox wrote:
>> On Maw, 2006-01-24 at 12:26 -0700, Eric W. Biederman wrote:
>> > There is at least NFS lockd that appreciates having a single integer
>> > per process unique identifier. So there is a practical basis for
>> > wanting such a thing.
>>
>> Which gets us back to refcounting.
>>
>> > At least for this first round I think talking about a kpid
>> > as a container, pid pair makes a lot of sense for the moment, as
>> > the other implementations just confuse things.
>>
>> As an abstract object a kpid to me means a single identifier which
>> uniquely identifies the process and which in its component parts be they
>> pointers or not uniquely identifies the process in the container and the
>> container in the system, both correctly refcounted against re-use.
>
> they why not just straight use the task struct pointer for this? It's
> guaranteed unique.. ;)

Actually I think that is a very sensible solution to this problem.
It does double or triple the length of the string passed to lockd
and is an information leak about which kernel addresses you are
running out of which may be undesirable from a security perspective
but I think that will fix the practical problem.

Reference counting in this case is not an issue, as these are
per process locks and should be freed up when everything goes.

I have a weird memory that simply making the string long and using
%p (current) didn't work as well as of %d (current->kpid) but that is something
very hard to test, as usually even with multiple pid spaces you don't
get pid reuse and the errors from NFS are not at all clear that pid
reuse is what is causing problems. So I don't have good data on
that situation.

Eric

2006-01-25 15:11:14

by Trond Myklebust

[permalink] [raw]
Subject: Re: RFC [patch 13/34] PID Virtualization Define new task_pid api

On Wed, 2006-01-25 at 02:58 -0700, Eric W. Biederman wrote:
> >> On Maw, 2006-01-24 at 12:26 -0700, Eric W. Biederman wrote:
> >> > There is at least NFS lockd that appreciates having a single integer
> >> > per process unique identifier. So there is a practical basis for
> >> > wanting such a thing.

The NFS lock manager mainly wants a unique 32-bit identifier that can
follow clone(CLONE_FILES). The reason is that the Linux VFS is forced to
use the pointer to the file table as the "process identifier" for posix
locks (i.e. fcntl() locks).

Cheers,
Trond

2006-01-25 18:03:24

by Eric W. Biederman

[permalink] [raw]
Subject: Re: RFC [patch 13/34] PID Virtualization Define new task_pid api

Trond Myklebust <[email protected]> writes:

> On Wed, 2006-01-25 at 02:58 -0700, Eric W. Biederman wrote:
>> >> On Maw, 2006-01-24 at 12:26 -0700, Eric W. Biederman wrote:
>> >> > There is at least NFS lockd that appreciates having a single integer
>> >> > per process unique identifier. So there is a practical basis for
>> >> > wanting such a thing.
>
> The NFS lock manager mainly wants a unique 32-bit identifier that can
> follow clone(CLONE_FILES). The reason is that the Linux VFS is forced to
> use the pointer to the file table as the "process identifier" for posix
> locks (i.e. fcntl() locks).

Ok. I think I was thinking of a different case, but if I missed one
this could explain the weirdness I was seeing.

Let me list the cases I know of and see if I hit what
you are thinking of.

fs/nfs/nfs3proc.c:nfs3_proc_create()
For O_EXCL we have arg.verifier = current->pid.


fs/lockd/clntproc.c:nlmclnt_setlockargs()
We have: lock->oh.len = sprintf(req->a_owner, "%d@%s",
current->pid, system_utsname.nodename);

I think this is the fcntl() case.
I would suggest fl_pid might have something to do with it
but that is part flock based locking.

So I'm not certain I see the part of NFS you are refering to.

Eric

2006-01-25 19:30:47

by Trond Myklebust

[permalink] [raw]
Subject: Re: RFC [patch 13/34] PID Virtualization Define new task_pid api

On Wed, 2006-01-25 at 11:01 -0700, Eric W. Biederman wrote:
> fs/nfs/nfs3proc.c:nfs3_proc_create()
> For O_EXCL we have arg.verifier = current->pid.

Yes, but that does not result in any permanent state that would be tied
to the pid on the server. The verifier here is used only to ensure
idempotency of the exclusive create RPC call.

> fs/lockd/clntproc.c:nlmclnt_setlockargs()
> We have: lock->oh.len = sprintf(req->a_owner, "%d@%s",
> current->pid, system_utsname.nodename);
>
> I think this is the fcntl() case.
> I would suggest fl_pid might have something to do with it
> but that is part flock based locking.

That name is not interpreted by the NLM server. It is, AFAIK, only used
for debugging purposes.
nlm_find_lockowner() is used to define a unique identifier that is
supposed to be sent to the server as the 'pid'.

Cheers,
Trond

2006-01-25 22:00:55

by Eric W. Biederman

[permalink] [raw]
Subject: Re: RFC [patch 13/34] PID Virtualization Define new task_pid api

Trond Myklebust <[email protected]> writes:

> On Wed, 2006-01-25 at 11:01 -0700, Eric W. Biederman wrote:
>> fs/nfs/nfs3proc.c:nfs3_proc_create()
>> For O_EXCL we have arg.verifier = current->pid.
>
> Yes, but that does not result in any permanent state that would be tied
> to the pid on the server. The verifier here is used only to ensure
> idempotency of the exclusive create RPC call.
>
>> fs/lockd/clntproc.c:nlmclnt_setlockargs()
>> We have: lock->oh.len = sprintf(req->a_owner, "%d@%s",
>> current->pid, system_utsname.nodename);
>>
>> I think this is the fcntl() case.
>> I would suggest fl_pid might have something to do with it
>> but that is part flock based locking.
>
> That name is not interpreted by the NLM server. It is, AFAIK, only used
> for debugging purposes.

Ok I though it might have been compared to equality someplace.

> nlm_find_lockowner() is used to define a unique identifier that is
> supposed to be sent to the server as the 'pid'.

Ok interesting.

All I know for certain was that with 2 pidspaces using the
same nfs mount I was confusing something, with regards to locking.


Eric

2006-01-26 20:01:44

by Herbert Poetzl

[permalink] [raw]
Subject: Re: RFC [patch 13/34] PID Virtualization Define new task_pid api

On Sun, Jan 22, 2006 at 09:24:27AM -0700, Eric W. Biederman wrote:
> Arjan van de Ven <[email protected]> writes:
>
> >>
> >> A pointer to a task_struct while it kind of sort of works. Is
> >> not a good solution. The problem is that in a lot of cases we
> >> register a pid to get a signal or something similar and then we
> >> never unregister it. So by using a pointer to a trask_struct you
> >> effectively hold the process in memory forever.

> > this is not right. Both the PID and the task struct have the exact
> > same lifetime rules, they HAVE to, to guard against pid reuse
> > problems.

> Yes PIDs reserved for the lifetime of the task_struct (baring minor
> details). There are actually a few races in /proc where it can see the
> task_struct after the pid has been freed (see the pid_alive macro in
> sched.h)
>
> However when used as a reference the number can live as long as you
> want. The classic example is a pid file that can exist even after you
> reboot a machine.
>
> So currently a use of a PID as a reference to processes or process
> groups can last forever. An example of this is the kernel is the
> result of fcntl(fd, F_SETOWN). The session of a tty is similar.
>
> Since the in kernel references have a lifetime that is completely
> different than the lifetime of a process or a PID. It is not safe
> to simply replace such references with a direct reference to a
> task_struct (besides being technically impossible). Adding those
> references could potentially increase the lifespan of a task_struct
> for to the life of the kernel depending on the reference.

well, yes, but wouldn't that be the RightThing(tm)
anway? because 'referencing' something via a pid, then
letting the task holding the pid go away and even be
replaced by a new one (with the same pid) which then
will get suddenly signaled from somewhere, just because
the pid matches seems very broken to me ...

best,
Herbert

> Eric
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2006-01-26 20:23:17

by Herbert Poetzl

[permalink] [raw]
Subject: Re: RFC [patch 13/34] PID Virtualization Define new task_pid api

On Wed, Jan 25, 2006 at 02:51:22AM -0700, Eric W. Biederman wrote:
> Alan Cox <[email protected]> writes:
>
> > On Maw, 2006-01-24 at 12:26 -0700, Eric W. Biederman wrote:
>
> >> At least for this first round I think talking about a kpid
> >> as a container, pid pair makes a lot of sense for the moment, as
> >> the other implementations just confuse things.
> >
> > As an abstract object a kpid to me means a single identifier which
> > uniquely identifies the process and which in its component parts be they
> > pointers or not uniquely identifies the process in the container and the
> > container in the system, both correctly refcounted against re-use.
>
> Correct.
>
> Currently by using pids internally we are not correctly refcounted
> against reuse. Nor in the process group case do we even have an
> object off of which we can hang a reference count.
>
> In the case of a multiple instances of a process space the problem
> is much more acute as we must properly ref count the pid space as
> well.
>
> Now to further make this fun we have variables like spawnpid in
> drivers/char/vt_ioctl.c and drivers/char/keyboard.c that
> persist indefinitely. Which cause problems for most traditional
> reference counting techniques.
>
> Further in cases where the references persist indefinitely we don't
> want to pin the task_struct in memory indefinitely even after
> the task has exited and it's zombie has been reaped.
>
> So how do we solve this problem?
>
> There are two possible approaches I can see to solving this problem.
> 1) Use a non-pointer based kpid and simply accept identifier
> wrap-around problems with kpids just like we currently accept
> these problems with pids.

sounds like a poor approach (well similar to the
current one, except that the issues might get more
comples when processes are signalled or referenced
across pid spaces :) ...

anyway, if that would be the aim, it could be done
much simpler by 'just' adding a v/upid field to the
task struct and use that for everything userspace
related (i.e. locating tasks, sending signals, etc)
no need to change the current *pid entries at all

best,
Herbert

> 2) Implement weak references for kpids.
>
> Semantically a weak reference is a pointer that becomes NULL when the
> object it refers to goes away.
>
> A couple days ago I conducted an experiment, to see if I could
> implement this in the kernel and surprisingly it is fairly straight
> forward to do. First you define a weak kpid as a kpid with a
> list_head attached, and whenever you setup a weak kpid you
> register it with the pid hash table.
>
> Then in detach_pid when the last reference to the pid goes away, you
> walk the list of weak kpids and you NULL the appropriate entries.
>
> This seems to solve the reference counting problem neatly and
> without needing to disturb the logic of the existing code. Even
> outside the context of multiple pid spaces then I think weak
> kpids are desirable.
>
> Thoughts?
>
> from kernel/pid.c:
> > void fastcall detach_pid(task_t *task, enum pid_type type)
> > {
> > int tmp, nr;
> >
> > nr = __detach_pid(task, type);
> > if (!nr)
> > return;
>
> Walk the list of weak kpids here.
>
> >
> > for (tmp = PIDTYPE_MAX; --tmp >= 0; )
> > if (tmp != type && find_pid(tmp, nr))
> > return;
> >
> > free_pidmap(nr);
> > }
>
>
> Eric
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2006-01-27 08:28:59

by Eric W. Biederman

[permalink] [raw]
Subject: Re: RFC [patch 13/34] PID Virtualization Define new task_pid api

Herbert Poetzl <[email protected]> writes:

> anyway, if that would be the aim, it could be done
> much simpler by 'just' adding a v/upid field to the
> task struct and use that for everything userspace
> related (i.e. locating tasks, sending signals, etc)
> no need to change the current *pid entries at all

Yes and no. Changing the current pid entries as opposed
to adding the kpid/upid separation is a slightly different
problem.

In particular there are 4 uses we need to change.
- Printing the pid in debug messages.
- Comparing pids (because we need to add a context comparison)
- Sending signals/localing tasks.
- Entering a code path that wants to do one of the above.

Printing the pid in debug messages seems to be confined to
performing the action with reference to a task_struct, and
is a completeness issue not a correctness issue.

Sending signals and locating tasks by pid is fairly
straight forward to change the interface of all affected
functions. And thus forcing an audit of all callers,
recursively also works well.

Comparing pids is where I think things get sticky but arguably
that case is rare enough we may be able to catch all of the users
with a code audit.

The only change I would really advocate at the moment beyond
adding the kpid fields to the task struct is to rename
(pid, tgid, pgrp, session) to (upid, utgid, upgrp, usession)
so we catch and break the users. This would catch flush into
the open all of the users that are doing weird things like comparing
pids and would leave any rare untested and unspotted case broken where
it will not compile.

Arguably that it is overkill to break all of the users to catch the
stragglers that we can't easily spot with a code review. Likely I be
satisfied with not breaking the code until I found a straggler that
affects correctness that made it through a kernel code review.

So far I have yet to see a version of the code that does not miss
important stragglers. Which is why to be correct I suspect we need
to break all users.

Eric

2006-01-27 09:06:22

by Eric W. Biederman

[permalink] [raw]
Subject: Re: RFC [patch 13/34] PID Virtualization Define new task_pid api

Herbert Poetzl <[email protected]> writes:

> well, yes, but wouldn't that be the RightThing(tm)
> anway? because 'referencing' something via a pid, then
> letting the task holding the pid go away and even be
> replaced by a new one (with the same pid) which then
> will get suddenly signaled from somewhere, just because
> the pid matches seems very broken to me ...

Agreed, but that describes the current state of the kernel.

Using a task_struct for referencing kernel threads where there
is tight collaboration seems sane. However using a task_struct
is impossible when referring to process groups, and it feels
like a bad idea to reference user space processes.

Basically my concern is that by using task structs internally
the kernel will start collecting invisible zombies. And
with a case like struct fown_struct we could force RLIMIT_NOFILE task
structs into memory, per hostile process. Usually this is much more
than RLIMIT_NPROC which limits the total number of live processes
and zombies a single user may create.

So assuming RLIMIT_NPROC == 100 and RLIMIT_NOFILE == 1024

Which means something like 100*1024*sizeof(struct task_struct) bytes
sizeof(struct task_struct) is somewhere between 512 and 1K bytes,
on a 32bit platform.

So 100*1024*512 to 100*1024*1024 = 50 to 100MB.
Being able to pin 100MB with modest ulimits does not sound like an
obvious fix to me.

Given what a hostile user can potentially accomplish I think anything that
approaches using struct task_struct pointers as a replacements for pids
should be approached carefully.

Eric

2006-01-27 12:29:09

by Kyle Moffett

[permalink] [raw]
Subject: Re: RFC [patch 13/34] PID Virtualization Define new task_pid api

On Jan 27, 2006, at 04:04, Eric W. Biederman wrote:
> Basically my concern is that by using task structs internally the
> kernel will start collecting invisible zombies.

So come up with a task_struct weakref system. Maintain an (RCU?)
linked list of struct task_weakref in the struct task_struct, and
when the task struct is about to go away, run around all of the
weakrefs and change their pointers to NULL. The user of the weakref
should check if the pointer is NULL and handle accordingly. Sure, it
would be tricky to get the locking right, but a couple extra bytes
for a struct task_weakref would be a lot better than a whole pinned
struct task_struct.

Cheers,
Kyle Moffett

--
Somone asked me why I work on this free (http://www.fsf.org/
philosophy/) software stuff and not get a real job. Charles Schulz
had the best answer:

"Why do musicians compose symphonies and poets write poems? They do
it because life wouldn't have any meaning for them if they didn't.
That's why I draw cartoons. It's my life."
-- Charles Schulz


2006-01-27 13:17:27

by Eric W. Biederman

[permalink] [raw]
Subject: Re: RFC [patch 13/34] PID Virtualization Define new task_pid api

Kyle Moffett <[email protected]> writes:

> On Jan 27, 2006, at 04:04, Eric W. Biederman wrote:
>> Basically my concern is that by using task structs internally the kernel will
>> start collecting invisible zombies.
>
> So come up with a task_struct weakref system. Maintain an (RCU?) linked list
> of struct task_weakref in the struct task_struct, and when the task struct is
> about to go away, run around all of the weakrefs and change their pointers to
> NULL. The user of the weakref should check if the pointer is NULL and handle
> accordingly. Sure, it would be tricky to get the locking right, but a couple
> extra bytes for a struct task_weakref would be a lot better than a whole pinned
> struct task_struct.

Right.

I'm working on it.

Somehow it was lodged in my mind that a task_struct cannot represent
a process group, but it can obviously be the first in a link list.
Because of that I so far I have been approaching the weak reference
problem from the pid hash table side. While using pid hash tables
can come out fairly clean, my best solution so far doubled the size
of the pid hash table.

I have two things that still concern me.
- The size of the linked list in pathological cases.
- Consistently picking a leader for a process group.

But I don't know if either one of them will actually be a problem,
so I think I will walk down that implementation path and see where
it takes me.

Eric

2006-01-31 21:03:05

by Linus Torvalds

[permalink] [raw]
Subject: Re: RFC [patch 13/34] PID Virtualization Define new task_pid api


(I'm coming in late, it's not been a high priority for me)

On Fri, 20 Jan 2006, Hubertus Franke wrote:
>
> 2nd:
> ==== Issue: we don't need pid virtualization, instead simply use
> <container,pid> pair.
>
> This requires a bit more thought. Essentially that's what I was doing,
> but I mangled them into the same pid and using masking to add/remove the
> container for internal use. As pointed out by Alan(?), we can indeed
> reused the same pid internally many times as long as we can distinguish
> during the pid-to-task_struct lookup. This is easily done because, the
> caller provides the context hence the container for the lookup.

This is my preferred approach BY FAR.

Doing a <container,pid> approach is very natural, and avoids almost all
issues. At most, you might want to have a new system call (most naturally
just the one that is limited to the "init container" - it the one that we
boot up with) that can specify both container and pid explicitly, and see
all processes and access all processes. But all "normal" system calls
would only ever operate within their container.

The fact is, we want "containers" anyway for any virtualization thing, ie
vserver already adds them. And if we have containers, then it's very easy
("easyish") to split up the current static "pid_hash[]", "pidmap_array[]"
and "pidmap_lock", and make them per-container, and have a pointer to the
container for each "struct task_struct".

After that, there wouldn't even be a lot else to do. The normal system
calls would just use their own container, and the (few) places that save
away pid's for later (ie things that use "kill_proc_info_as_uid()" and
"struct fown_struct" friends) would have to also squirrell away the
container, but then you should be pretty much done.

Of course, you'll have to do the system calls to _create_ the containers
in the first place, but that's at a higher level and involves much more
than just the pid-space (ie a container would normally have more than just
the uid mappings, it would have any network knowledge too etc - hostname,
perhaps list of network devices associated with that context etc etc)

Linus

2006-02-01 00:01:49

by Hubertus Franke

[permalink] [raw]
Subject: Re: RFC [patch 13/34] PID Virtualization Define new task_pid api

Linus Torvalds wrote:
> (I'm coming in late, it's not been a high priority for me)
>
> On Fri, 20 Jan 2006, Hubertus Franke wrote:
>
>>2nd:
>>==== Issue: we don't need pid virtualization, instead simply use
>><container,pid> pair.
>>
>>This requires a bit more thought. Essentially that's what I was doing,
>>but I mangled them into the same pid and using masking to add/remove the
>>container for internal use. As pointed out by Alan(?), we can indeed
>>reused the same pid internally many times as long as we can distinguish
>>during the pid-to-task_struct lookup. This is easily done because, the
>>caller provides the context hence the container for the lookup.
>
>
> This is my preferred approach BY FAR.
>
> Doing a <container,pid> approach is very natural, and avoids almost all
> issues. At most, you might want to have a new system call (most naturally
> just the one that is limited to the "init container" - it the one that we
> boot up with) that can specify both container and pid explicitly, and see
> all processes and access all processes. But all "normal" system calls
> would only ever operate within their container.

That's what the current patch set does.
One "global container" that sees and accesses all and the rest is limited
to their respective "container".

>
> The fact is, we want "containers" anyway for any virtualization thing, ie
> vserver already adds them. And if we have containers, then it's very easy
> ("easyish") to split up the current static "pid_hash[]", "pidmap_array[]"
> and "pidmap_lock", and make them per-container, and have a pointer to the
> container for each "struct task_struct".

We are very close to that .. the pidmap_array is already organized that way.
This was done so not to make the container an object that penetrates every
where in the code. Now that the discussion is flushing out, actually
accessing those entities through the container of the context-task would
be the next logical restructuring of the code.

>
> After that, there wouldn't even be a lot else to do. The normal system
> calls would just use their own container, and the (few) places that save
> away pid's for later (ie things that use "kill_proc_info_as_uid()" and
> "struct fown_struct" friends) would have to also squirrell away the
> container, but then you should be pretty much done.

Agreed.

>
> Of course, you'll have to do the system calls to _create_ the containers
> in the first place, but that's at a higher level and involves much more
> than just the pid-space (ie a container would normally have more than just
> the uid mappings, it would have any network knowledge too etc - hostname,
> perhaps list of network devices associated with that context etc etc)

Right now we do it simply through a poor man's /proc/container fs approach
that should be reasonable straight forward to convert to a syscall.

>
> Linus
>

Finally, I presume you followed the discussion on the conversion from task->pid
to access_functions that stirred some criticism. That part would disappear.

-- Hubertus

2006-02-01 04:19:53

by Eric W. Biederman

[permalink] [raw]
Subject: Re: RFC [patch 13/34] PID Virtualization Define new task_pid api

Linus Torvalds <[email protected]> writes:

> (I'm coming in late, it's not been a high priority for me)

Thanks. For taking the time.

> On Fri, 20 Jan 2006, Hubertus Franke wrote:
>>
>> 2nd:
>> ==== Issue: we don't need pid virtualization, instead simply use
>> <container,pid> pair.
>>
>> This requires a bit more thought. Essentially that's what I was doing,
>> but I mangled them into the same pid and using masking to add/remove the
>> container for internal use. As pointed out by Alan(?), we can indeed
>> reused the same pid internally many times as long as we can distinguish
>> during the pid-to-task_struct lookup. This is easily done because, the
>> caller provides the context hence the container for the lookup.
>
> This is my preferred approach BY FAR.
>
> Doing a <container,pid> approach is very natural, and avoids almost all
> issues. At most, you might want to have a new system call (most naturally
> just the one that is limited to the "init container" - it the one that we
> boot up with) that can specify both container and pid explicitly, and see
> all processes and access all processes. But all "normal" system calls
> would only ever operate within their container.

On this front I have been planning on using sys_clone as it allows
pieces of the virtualization to be incrementally built, it already
supports the FS namespace, and it supports flexibly specifying what
you want to contain.

> The fact is, we want "containers" anyway for any virtualization thing, ie
> vserver already adds them. And if we have containers, then it's very easy
> ("easyish") to split up the current static "pid_hash[]", "pidmap_array[]"
> and "pidmap_lock", and make them per-container, and have a pointer to the
> container for each "struct task_struct".

Well hash tables with their giant allocations are hard to split it has
been easier to add a container tag.

> After that, there wouldn't even be a lot else to do. The normal system
> calls would just use their own container, and the (few) places that save
> away pid's for later (ie things that use "kill_proc_info_as_uid()" and
> "struct fown_struct" friends) would have to also squirrell away the
> container, but then you should be pretty much done.

Yes. Although there are a few container lifetimes problems with that
approach. Do you want your container alive for a long time after every
process using it has exited just because someone has squirrelled away their
pid. While container lifetime issues crop up elsewhere as well PIDs are
by far the worst, because it is current safe to store a PID indefinitely
with nothing worse that PID wrap around.

> Of course, you'll have to do the system calls to _create_ the containers
> in the first place, but that's at a higher level and involves much more
> than just the pid-space (ie a container would normally have more than just
> the uid mappings, it would have any network knowledge too etc - hostname,
> perhaps list of network devices associated with that context etc etc)

Yes. A list of network devices works seems to work well.

Eric

2006-02-01 04:39:41

by Linus Torvalds

[permalink] [raw]
Subject: Re: RFC [patch 13/34] PID Virtualization Define new task_pid api



On Tue, 31 Jan 2006, Eric W. Biederman wrote:
>
> Yes. Although there are a few container lifetimes problems with that
> approach. Do you want your container alive for a long time after every
> process using it has exited just because someone has squirrelled away their
> pid. While container lifetime issues crop up elsewhere as well PIDs are
> by far the worst, because it is current safe to store a PID indefinitely
> with nothing worse that PID wrap around.

Are people really expecting to have a huge turn-over on containers? It
sounds like this shouldn't be a problem in any normal circumstance:
especially if you don't even do the "big hash-table per container"
approach, who really cares if a container lives on after the last process
exited?

I'd have expected that the major user for this would end up being ISP's
and the like, and I would not expect the virtual machines to be brought up
all the time.

If it's a problem, you can do the same thing that the "struct mm_struct"
does: it has life-time issues because a mm_struct actually has to live for
potentially a _long_ time (zombies) but at the same time we want to free
the data structures allocated to the mm_struct as soon as possible,
notably the VMA's and the page tables.

So a mm_struct uses a two-level counter, with the "real" users (who need
the page tables etc) incrementing one ("mm_users"), and the "secondary"
ones (who just need to have an mm_struct pinned, but are ok with an empty
VM being attached) incrementing the other ("mm_count").

The same approach might be valid for "containers": you can destroy most of
the associated container when the actual processes are gone, but keep just
the empty container around until all secondary references are finally also
gone.

It's pretty simple: the secondary reference starts at 1 - with the
"primary" counter being the single ref to the secondary. Then freeing a
primary does:

if (atomic_dec_and_test(&container->primary_counter)) {
.. free the core resources here ..

/* then release the ref from the primary to secondary */
secondary_free(container);
}

(for "mm_struct", the primary is dropped "mmput()" and the secondary is
dropped with "mmdrop()", which is absolutely horrid naming. Please name
things better than I did ;)

Linus

2006-02-01 07:16:07

by Eric W. Biederman

[permalink] [raw]
Subject: Re: RFC [patch 13/34] PID Virtualization Define new task_pid api

Linus Torvalds <[email protected]> writes:

> On Tue, 31 Jan 2006, Eric W. Biederman wrote:
>>
>> Yes. Although there are a few container lifetimes problems with that
>> approach. Do you want your container alive for a long time after every
>> process using it has exited just because someone has squirrelled away their
>> pid. While container lifetime issues crop up elsewhere as well PIDs are
>> by far the worst, because it is current safe to store a PID indefinitely
>> with nothing worse that PID wrap around.
>
> Are people really expecting to have a huge turn-over on containers? It
> sounds like this shouldn't be a problem in any normal circumstance:
> especially if you don't even do the "big hash-table per container"
> approach, who really cares if a container lives on after the last process
> exited?

Turn over rate is a good argument, for not worry about things too much.
I guess it only really becomes a problem if you have large amounts
of resources locked up.

> I'd have expected that the major user for this would end up being ISP's
> and the like, and I would not expect the virtual machines to be brought up
> all the time.

People doing server consolidation are one of the big user bases. The other
and possibly a bigger driver force right now are people dealing with large
high performance clusters. There the interest is in encapsulating applications
so that you can checkpoint or migrate them.

One container per batch job might not be too high but if wound up being used
for short jobs as well as long ones you could get as high as one container
every couple of minutes.

The scary part with lifetime issues is that if you aren't careful you can
have lots of system resources used with no obvious source.

> If it's a problem, you can do the same thing that the "struct mm_struct"
> does: it has life-time issues because a mm_struct actually has to live for
> potentially a _long_ time (zombies) but at the same time we want to free
> the data structures allocated to the mm_struct as soon as possible,
> notably the VMA's and the page tables.
>
> So a mm_struct uses a two-level counter, with the "real" users (who need
> the page tables etc) incrementing one ("mm_users"), and the "secondary"
> ones (who just need to have an mm_struct pinned, but are ok with an empty
> VM being attached) incrementing the other ("mm_count").

Neat. I had not realized that was what was going on. Having clean up a bunch
of cases there ages ago I was about to feel silly but I just realized mmdrop
is more recent than my comment explaining the difference between mmput and
mm_release.

One of the suggestions that has been floating around was to replace the
saved pids with references to task structures in places like fown_struct.
If we were to take that approach we would have nasty lifetime issues, because
we would continue to pin processes even after they were no longer zombies,
and we can potentially get a lot of fown_structs.

So I am considering introducing an intermediary (on very similar lines
to what you were suggesting) a struct task_ref that is just:
struct task_ref
{
atomic_t count;
enum pid_type type;
struct task_struct *task;
};

That can be used to track tasks and process groups. I posted fairly
complete patches for review a few days ago. The interesting thing
with this case is that it can solve the pid wrap around issues as well
as container reference issues, by completely removing the need for
them.

The other technique that has served me well in my network
virtualization work was to setup of a notifier and have everyone who
cared register a notifier and drop their references when the notifier
was called. For a low number of things that care as this works very
well.

> (for "mm_struct", the primary is dropped "mmput()" and the secondary is
> dropped with "mmdrop()", which is absolutely horrid naming. Please name
> things better than I did ;)

Well it is a challenge there aren't that many good names around and
it is hard work to find them. :)

Eric

2006-02-01 16:30:11

by Greg Kurz

[permalink] [raw]
Subject: Re: RFC [patch 13/34] PID Virtualization Define new task_pid api

Eric W. Biederman wrote:
>
> On this front I have been planning on using sys_clone as it allows
> pieces of the virtualization to be incrementally built, it already
> supports the FS namespace, and it supports flexibly specifying what
> you want to contain.
>

What would you do to handle the following case:

pid = getpid();
if (sys_clone(CLONE_CONTAINER) == 0) {
ppid = getppid();
assert(ppid == pid);
}

Most of the calls involving resource ids will return values that aren't
*consistent* with ids already stored in userland... could possibly break some
piece of code. Perhaps a sys_exec() should also be enforced to reset the process
memory.

-Greg-

2006-02-01 16:42:13

by Dave Hansen

[permalink] [raw]
Subject: Re: RFC [patch 13/34] PID Virtualization Define new task_pid api

On Tue, 2006-01-31 at 20:39 -0800, Linus Torvalds wrote:
> On Tue, 31 Jan 2006, Eric W. Biederman wrote:
> > Yes. Although there are a few container lifetimes problems with that
> > approach. Do you want your container alive for a long time after every
> > process using it has exited just because someone has squirrelled away their
> > pid. While container lifetime issues crop up elsewhere as well PIDs are
> > by far the worst, because it is current safe to store a PID indefinitely
> > with nothing worse that PID wrap around.
>
> Are people really expecting to have a huge turn-over on containers? It
> sounds like this shouldn't be a problem in any normal circumstance:
> especially if you don't even do the "big hash-table per container"
> approach, who really cares if a container lives on after the last process
> exited?

Other than testing, I can't imagine a case when we'd need them created
and destroyed very often. In fact, one of the biggest cases for needing
checkpoint/restart on a container is a very long-lived processes that is
doing important work.

-- Dave

2006-02-01 16:45:53

by Eric W. Biederman

[permalink] [raw]
Subject: Re: RFC [patch 13/34] PID Virtualization Define new task_pid api

Greg <[email protected]> writes:

> Eric W. Biederman wrote:
>>
>> On this front I have been planning on using sys_clone as it allows
>> pieces of the virtualization to be incrementally built, it already
>> supports the FS namespace, and it supports flexibly specifying what
>> you want to contain.
>>
>
> What would you do to handle the following case:
>
> pid = getpid();
> if (sys_clone(CLONE_CONTAINER) == 0) {
> ppid = getppid();
> assert(ppid == pid);
> }
>
> Most of the calls involving resource ids will return values that aren't
> *consistent* with ids already stored in userland... could possibly break some
> piece of code. Perhaps a sys_exec() should also be enforced to reset the process
> memory.

Well that assertion will fail.
At that point getppid() will return 0, and getpid() will return 1.

Processes getting confused is their own problem.

Now there will be a pid that the parent sees that will not be 0.
And that is what the parent will see in the context of wait.

In my code I introduced a wid (wait id) for that purpose.

This makes it possible to manage a container using the usual unix
process semantics. Which is very important.


Eric

2006-02-02 05:14:17

by Herbert Poetzl

[permalink] [raw]
Subject: Re: RFC [patch 13/34] PID Virtualization Define new task_pid api

On Tue, Jan 31, 2006 at 08:39:19PM -0800, Linus Torvalds wrote:
>
>
> On Tue, 31 Jan 2006, Eric W. Biederman wrote:
> >
> > Yes. Although there are a few container lifetimes problems with
> > that approach. Do you want your container alive for a long time
> > after every process using it has exited just because someone has
> > squirrelled away their pid. While container lifetime issues crop up
> > elsewhere as well PIDs are by far the worst, because it is current
> > safe to store a PID indefinitely with nothing worse that PID wrap
> > around.
>
> Are people really expecting to have a huge turn-over on containers? It
> sounds like this shouldn't be a problem in any normal circumstance:
> especially if you don't even do the "big hash-table per container"
> approach, who really cares if a container lives on after the last
> process exited?
>
> I'd have expected that the major user for this would end up being
> ISP's and the like, and I would not expect the virtual machines to be
> brought up all the time.

well, really depends, as far as I can tell the
number of guest (container) (re)starts can be as
high as one per second (in extreme cases) while
the entire setup doesn't have more than 50-100
containers at the same time, and usually 'runs'
for more than a few months without reboot ...

but agreed, the typical number of container
creations and deletions will be around one per
hour or day ...

> If it's a problem, you can do the same thing that the "struct
> mm_struct" does: it has life-time issues because a mm_struct actually
> has to live for potentially a _long_ time (zombies) but at the same
> time we want to free the data structures allocated to the mm_struct as
> soon as possible, notably the VMA's and the page tables.
>
> So a mm_struct uses a two-level counter, with the "real" users
> (who need the page tables etc) incrementing one ("mm_users"), and
> the "secondary" ones (who just need to have an mm_struct pinned,
> but are ok with an empty VM being attached) incrementing the other
> ("mm_count").

yes, we already do something very similar in
Linux-VServer, basically differentiating between
'active users' and 'passive references' ...

> The same approach might be valid for "containers": you can destroy most of
> the associated container when the actual processes are gone, but keep just
> the empty container around until all secondary references are finally also
> gone.
>
> It's pretty simple: the secondary reference starts at 1 - with the
> "primary" counter being the single ref to the secondary. Then freeing a
> primary does:
>
> if (atomic_dec_and_test(&container->primary_counter)) {
> .. free the core resources here ..
>
> /* then release the ref from the primary to secondary */
> secondary_free(container);
> }
>
> (for "mm_struct", the primary is dropped "mmput()" and the secondary is
> dropped with "mmdrop()", which is absolutely horrid naming. Please name
> things better than I did ;)

best,
Herbert

> Linus
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2006-02-02 13:51:20

by Greg Kurz

[permalink] [raw]
Subject: Re: RFC [patch 13/34] PID Virtualization Define new task_pid api

Eric W. Biederman wrote:
> Greg <[email protected]> writes:
>
>
>>Eric W. Biederman wrote:
>>
>>>On this front I have been planning on using sys_clone as it allows
>>>pieces of the virtualization to be incrementally built, it already
>>>supports the FS namespace, and it supports flexibly specifying what
>>>you want to contain.
>>>
>>
>>What would you do to handle the following case:
>>
>>pid = getpid();
>>if (sys_clone(CLONE_CONTAINER) == 0) {
>> ppid = getppid();
>> assert(ppid == pid);
>>}
>>
>>Most of the calls involving resource ids will return values that aren't
>>*consistent* with ids already stored in userland... could possibly break some
>>piece of code. Perhaps a sys_exec() should also be enforced to reset the process
>>memory.
>
>
> Well that assertion will fail.
> At that point getppid() will return 0, and getpid() will return 1.
>
> Processes getting confused is their own problem.
>

This flavour of clone should be used with great care then since it breaks the
usual unix process semantics. :)

> Now there will be a pid that the parent sees that will not be 0.
> And that is what the parent will see in the context of wait.
>
> In my code I introduced a wid (wait id) for that purpose.
>

Is it possible to see the code ?

> This makes it possible to manage a container using the usual unix
> process semantics. Which is very important.
>
>
> Eric
>

Thanks.

-Greg-

2006-02-02 14:11:28

by Eric W. Biederman

[permalink] [raw]
Subject: Re: RFC [patch 13/34] PID Virtualization Define new task_pid api

Greg <[email protected]> writes:

>> Well that assertion will fail.
>> At that point getppid() will return 0, and getpid() will return 1.
>>
>> Processes getting confused is their own problem.
>>
>
> This flavour of clone should be used with great care then since it breaks the
> usual unix process semantics. :)

Do you know of a flavor of clone (besides fork) that doesn't share that
property?

For the most part I am not breaking the usual process semantics I work
very hard to preserve it but simply which pids you see are different.
Which what would be expected.

>> Now there will be a pid that the parent sees that will not be 0.
>> And that is what the parent will see in the context of wait.
>>
>> In my code I introduced a wid (wait id) for that purpose.
>>
>
> Is it possible to see the code ?

git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/linux-2.6-ns.git/

The tree is still at the proof of concept level but it does come fairly close
to what has been discussed.

Eric

2006-02-02 14:48:24

by Kirill Korotaev

[permalink] [raw]
Subject: Re: RFC [patch 13/34] PID Virtualization Define new task_pid api

Hello all,

>>>The virtual pid is different depending on who is asking. So simply
>>>storing current->realpid and current->pid isn't helpful, as we would
>>>still need to call a function when a pid crosses user->kernel boundary.
>>
>>This is an obscure, weird piece of functionality for some special case
>>usages most of which are going to be eliminated by Xen. I don't see the
>>kernel side justification for it at all.
>
>
> At least OpenVZ and vserver want very similar functionality. They're
> both working with out-of-tree patch sets. We each want to do subtly
> different things with tsk->pid, and task_pid() seemed to be a decent
> place to start. OpenVZ has a very similar concept in its pid_task()
> function.
But our VPID patch is much less intrusive and shorter (thanks to Alexey
Kuznetsov).
I will send a broken-out patches today CC-ing you.

BTW, have you tested it somehow (LTP, etc.)?

> Arjan had a very good point last time we posted these: we should
> consider getting rid of as many places in the kernel where pids are used
> to uniquely identify tasks, and just stick with task_struct pointers.

Kirill

2006-02-02 14:48:46

by Kirill Korotaev

[permalink] [raw]
Subject: Re: RFC [patch 13/34] PID Virtualization Define new task_pid api

> This is my preferred approach BY FAR.
>
> Doing a <container,pid> approach is very natural, and avoids almost all
> issues. At most, you might want to have a new system call (most naturally
> just the one that is limited to the "init container" - it the one that we
> boot up with) that can specify both container and pid explicitly, and see
> all processes and access all processes. But all "normal" system calls
> would only ever operate within their container.
>
> The fact is, we want "containers" anyway for any virtualization thing, ie
> vserver already adds them. And if we have containers, then it's very easy
> ("easyish") to split up the current static "pid_hash[]", "pidmap_array[]"
> and "pidmap_lock", and make them per-container, and have a pointer to the
> container for each "struct task_struct".
In fact this is almost what OpenVZ does for half a year, both containers
and VPIDs.
But it is very usefull to see process tree from host system. To be able
to use std tools to manage containers from host (i.e. ps, kill, top,
etc.). So it is much more convinient to have 2 pids. One globally
unique, and one for container.

> After that, there wouldn't even be a lot else to do. The normal system
> calls would just use their own container, and the (few) places that save
> away pid's for later (ie things that use "kill_proc_info_as_uid()" and
> "struct fown_struct" friends) would have to also squirrell away the
> container, but then you should be pretty much done.
>
> Of course, you'll have to do the system calls to _create_ the containers
> in the first place, but that's at a higher level and involves much more
> than just the pid-space (ie a container would normally have more than just
> the uid mappings, it would have any network knowledge too etc - hostname,
> perhaps list of network devices associated with that context etc etc)

Kirill

2006-02-02 15:15:28

by Eric W. Biederman

[permalink] [raw]
Subject: Re: RFC [patch 13/34] PID Virtualization Define new task_pid api

Kirill Korotaev <[email protected]> writes:

> In fact this is almost what OpenVZ does for half a year, both containers and
> VPIDs.
> But it is very usefull to see process tree from host system. To be able to use
> std tools to manage containers from host (i.e. ps, kill, top, etc.). So it is
> much more convinient to have 2 pids. One globally unique, and one for container.

There are two issues here.
1) Monitoring. (ps, top etc)
2) Control (kill).

For monitoring you might need to patch ps/top a little but it is doable without
2 pids.

For kill it is extremely rude to kill processes inside of a nested pid space.
There are other solutions to the problem.

It is undesireable to make it too easy to communicate through the barrier because
then applications may start to take advantage of it and then depend on
it and you will have lost the isolation that the container gives you.

Eric

2006-02-02 15:24:49

by Kirill Korotaev

[permalink] [raw]
Subject: Re: RFC [patch 13/34] PID Virtualization Define new task_pid api

>>In fact this is almost what OpenVZ does for half a year, both containers and
>>VPIDs.
>>But it is very usefull to see process tree from host system. To be able to use
>>std tools to manage containers from host (i.e. ps, kill, top, etc.). So it is
>>much more convinient to have 2 pids. One globally unique, and one for container.
>
>
> There are two issues here.
> 1) Monitoring. (ps, top etc)
> 2) Control (kill).
>
> For monitoring you might need to patch ps/top a little but it is doable without
> 2 pids.
>
> For kill it is extremely rude to kill processes inside of a nested pid space.
> There are other solutions to the problem.
it is not always good idea to fix the tools everytime new functionality
is involved. why do you think there are no more tools except for
ps,top,kill? will you fix it all?

Another example, when you have problems in your VPS it is very
convinient to attach with strace/gdb/etc from the host. Will you patch
it as well?

OpenVZ big advantage is this ease of administering compared to VM
approach and it is not good idea to forbid this. If you have broken VM
you have problems, in OpenVZ you have control over VPSs.

> It is undesireable to make it too easy to communicate through the barrier because
> then applications may start to take advantage of it and then depend on
> it and you will have lost the isolation that the container gives you.
in OpenVZ we have VPSs fully isolated between each other.
But host system has access to some of VPS resources such as files,
processes, etc. I understand your concern which is related to
checkpointing, yeah?

Kirill

2006-02-02 15:53:22

by Eric W. Biederman

[permalink] [raw]
Subject: Re: RFC [patch 13/34] PID Virtualization Define new task_pid api

Kirill Korotaev <[email protected]> writes:

>>>In fact this is almost what OpenVZ does for half a year, both containers and
>>>VPIDs.
>>>But it is very usefull to see process tree from host system. To be able to use
>>>std tools to manage containers from host (i.e. ps, kill, top, etc.). So it is
>>>much more convinient to have 2 pids. One globally unique, and one for
> container.
>> There are two issues here.
>> 1) Monitoring. (ps, top etc)
>> 2) Control (kill).
>> For monitoring you might need to patch ps/top a little but it is doable
> without
>> 2 pids.
>> For kill it is extremely rude to kill processes inside of a nested pid space.
>> There are other solutions to the problem.
> it is not always good idea to fix the tools everytime new functionality is
> involved. why do you think there are no more tools except for ps,top,kill? will
> you fix it all?

No. In my implementation the nested pid space has one pid that essentially
looks like a threaded process. So ps and top can see that something is there
but you aren't spammed with all of the pids. In addition for top the cpu utilization
for all of the pids are shown for that one pid (just like the thread group leader of
on a threaded process). So there is no confusion.

If you want the detailed information you can either chroot to an environment
where the appropriate version of /proc is available. Or you can modify your
tools.

> Another example, when you have problems in your VPS it is very convinient to
> attach with strace/gdb/etc from the host. Will you patch it as well?
>
> OpenVZ big advantage is this ease of administering compared to VM approach and
> it is not good idea to forbid this. If you have broken VM you have problems, in
> OpenVZ you have control over VPSs.
>
>> It is undesireable to make it too easy to communicate through the barrier
> because
>> then applications may start to take advantage of it and then depend on
>> it and you will have lost the isolation that the container gives you.
> in OpenVZ we have VPSs fully isolated between each other.
> But host system has access to some of VPS resources such as files, processes,
> etc. I understand your concern which is related to checkpointing, yeah?

There areas.
1) Checkpointing.
2) Isolation for security purposes.
There may be secrets that the sysadmin should not have access to.
3) Nesting of containers, (so they are general purpose and not special hacks).

The vserver way of solving some of these problems is to provide a way
to enter the guest. I would rather have some explicit operation that puts
you into the guest context so there is a single point where we can tackle
the nested security issues, than to have hundreds of places we have to
look at individually.

Eric

2006-02-02 16:03:53

by Kirill Korotaev

[permalink] [raw]
Subject: Re: RFC [patch 13/34] PID Virtualization Define new task_pid api

> There areas.
> 1) Checkpointing.
> 2) Isolation for security purposes.
> There may be secrets that the sysadmin should not have access to.
I hope you understand, that such things do not make anything secure.
Administrator of the node will always have access to /proc/kcore,
devices, KERNEL CODE(!) etc. No security from this point of view.

> 3) Nesting of containers, (so they are general purpose and not special hacks).
Why are you interested in nesting? Any applications for this?
Until everything is virtualized in nesting way (including TCP/IP stack,
routing etc.) I see no much use of it.

> The vserver way of solving some of these problems is to provide a way
> to enter the guest. I would rather have some explicit operation that puts
> you into the guest context so there is a single point where we can tackle
> the nested security issues, than to have hundreds of places we have to
> look at individually.
Huh, it sounds too easy. Just imagine that VPS owner has deleted ps,
top, kill, bash and other tools. You won't be able to enter. Another
example when VPS owner is near its resource limits - you won't be able
to do anything after VPS entering.
Do you need other examples?

Kirill

2006-02-02 16:29:13

by Eric W. Biederman

[permalink] [raw]
Subject: Re: RFC [patch 13/34] PID Virtualization Define new task_pid api

Kirill Korotaev <[email protected]> writes:

>> There areas.
>> 1) Checkpointing.
>> 2) Isolation for security purposes.
>> There may be secrets that the sysadmin should not have access to.
> I hope you understand, that such things do not make anything
> secure. Administrator of the node will always have access to /proc/kcore,
> devices, KERNEL CODE(!) etc. No security from this point of view.

Only if they have CAP_SYS_RAWIO. I admit it takes a lot more
to get there than just that. But having a mechanism that has the
potential to be secured and is much simpler to understand
and to setup for minimal privileges than any of the other unix
addons I have seen is very interesting.


>> 3) Nesting of containers, (so they are general purpose and not special hacks).
> Why are you interested in nesting? Any applications for this?
> Until everything is virtualized in nesting way (including TCP/IP stack, routing
> etc.) I see no much use of it.

For everything except the PID namespace I am just interested in having multiple
separate namespaces. For the PID namespace to keep the traditional unix
model you need a parent process so it is actually nesting.

I am interested because, it is easy, because if it is possible than
the range of applications you can apply a containers to is much
larger. At the far end of that spectrum is migrating a server running
on real hardware and bringing it up as a guest on a newer much more
powerful machine. With the appearance that it had only been
unreachable for a few seconds.

>> The vserver way of solving some of these problems is to provide a way
>> to enter the guest. I would rather have some explicit operation that puts
>> you into the guest context so there is a single point where we can tackle
>> the nested security issues, than to have hundreds of places we have to
>> look at individually.
> Huh, it sounds too easy. Just imagine that VPS owner has deleted ps, top, kill,
> bash and other tools. You won't be able to enter.

Entering is different from execing a process on the inside.
Implementation wise it is changing the context pointer on your task.

> Another example when VPS owner
> is near its resource limits - you won't be able to do anything after VPS
> entering.

For debugging this is a good reason for being inside. What if the
problem is that you are out of resources?

I have no intention of requiring monitoring to work from the inside though.

> Do you need other examples?

No I need to post patches.

Eric

2006-02-02 21:10:44

by Cédric Le Goater

[permalink] [raw]
Subject: Re: RFC [patch 13/34] PID Virtualization Define new task_pid api

Eric W. Biederman wrote:

> If you want the detailed information you can either chroot to an environment
> where the appropriate version of /proc is available. Or you can modify your
> tools.

did you modify /proc to be able to mount it multiples times on the same
system ?

C.

2006-02-02 21:26:26

by Eric W. Biederman

[permalink] [raw]
Subject: Re: RFC [patch 13/34] PID Virtualization Define new task_pid api

Cedric Le Goater <[email protected]> writes:

> Eric W. Biederman wrote:
>
>> If you want the detailed information you can either chroot to an environment
>> where the appropriate version of /proc is available. Or you can modify your
>> tools.
>
> did you modify /proc to be able to mount it multiples times on the same
> system ?

Yes.

Eric

2006-02-02 21:32:45

by Cédric Le Goater

[permalink] [raw]
Subject: Re: RFC [patch 13/34] PID Virtualization Define new task_pid api

Eric W. Biederman wrote:

> For everything except the PID namespace I am just interested in having multiple
> separate namespaces. For the PID namespace to keep the traditional unix
> model you need a parent process so it is actually nesting.
>
> I am interested because, it is easy, because if it is possible than
> the range of applications you can apply a containers to is much
> larger. At the far end of that spectrum is migrating a server running
> on real hardware and bringing it up as a guest on a newer much more
> powerful machine. With the appearance that it had only been
> unreachable for a few seconds.

We gave a name to such containers. We call them 'application' containers
just to make a difference with the 'system' containers, like vserver or openvz.

'application' containers are very useful in an HPC environment where you
can trig checkpoint/restart through batch managers. But, this model,
really, has some virtualization issues on the edge. The virtualisation of
the parent process of such a container is tricky, it can be in multiple
containers at the same time, it can die (difficult to restart ...), it
could belong to another process groups, etc. Plenty of annoying cases to
handle.

'system' containers, like vserver or openvz, are safer because they use a
private PID space.

Now, would it be possible to have an 'application' container using a
private PID space and being friendly to the usual unix process semantics ?
We haven't found a solution yet ...

> Entering is different from execing a process on the inside.
> Implementation wise it is changing the context pointer on your task.

yes and you could restrict that privilege to some top level container, like
the container 0.

C.

2006-02-02 21:44:04

by Hubertus Franke

[permalink] [raw]
Subject: Re: RFC [patch 13/34] PID Virtualization Define new task_pid api

On Thu, 2006-02-02 at 22:32 +0100, Cedric Le Goater wrote:
> Eric W. Biederman wrote:
>
> > For everything except the PID namespace I am just interested in having multiple
> > separate namespaces. For the PID namespace to keep the traditional unix
> > model you need a parent process so it is actually nesting.
> >
> > I am interested because, it is easy, because if it is possible than
> > the range of applications you can apply a containers to is much
> > larger. At the far end of that spectrum is migrating a server running
> > on real hardware and bringing it up as a guest on a newer much more
> > powerful machine. With the appearance that it had only been
> > unreachable for a few seconds.
>
> We gave a name to such containers. We call them 'application' containers
> just to make a difference with the 'system' containers, like vserver or openvz.
>
> 'application' containers are very useful in an HPC environment where you
> can trig checkpoint/restart through batch managers. But, this model,
> really, has some virtualization issues on the edge. The virtualisation of
> the parent process of such a container is tricky, it can be in multiple
> containers at the same time, it can die (difficult to restart ...), it
> could belong to another process groups, etc. Plenty of annoying cases to
> handle.

I guess we all experienced that (see everybodies patches).

>
> 'system' containers, like vserver or openvz, are safer because they use a
> private PID space.

Maybe we are talking about different things, but I thought our patch-set
does provide private pid spaces ( by putting <pid,container> tuple into
the same pid ). Yes, privacy is achieved only by guarding the edges, but
the next approach that was discussed on the mailing list was to indeed
make container's first class object, hang the pid mgmt and pid lookup
off the container and thus force the isolation.
Containers then hence move up to be children of "init".

>
> Now, would it be possible to have an 'application' container using a
> private PID space and being friendly to the usual unix process semantics ?
> We haven't found a solution yet ...

Could you spell out what specific UNIX semantics you are referring to.
That would help and we can discuss whether and how they can be
addressed.

>
> > Entering is different from execing a process on the inside.
> > Implementation wise it is changing the context pointer on your task.
>
> yes and you could restrict that privilege to some top level container, like
> the container 0.
>

--
Hubertus Franke <[email protected]>

2006-02-02 21:47:45

by Eric W. Biederman

[permalink] [raw]
Subject: Re: RFC [patch 13/34] PID Virtualization Define new task_pid api

Cedric Le Goater <[email protected]> writes:

> Now, would it be possible to have an 'application' container using a
> private PID space and being friendly to the usual unix process semantics ?
> We haven't found a solution yet ...

Well that is what I implemented. So I am pretty certain it is solvable.

Eric

2006-02-03 10:06:23

by Kirill Korotaev

[permalink] [raw]
Subject: Re: RFC [patch 13/34] PID Virtualization Define new task_pid api

>>Now, would it be possible to have an 'application' container using a
>>private PID space and being friendly to the usual unix process semantics ?
>>We haven't found a solution yet ...
>
>
> Well that is what I implemented. So I am pretty certain it is solvable.
Exactly. This is what our patch does also. It is solvable. Tested by LTP
and numerous other tests/production systems etc.

Kirill


2006-02-03 10:50:21

by Kirill Korotaev

[permalink] [raw]
Subject: Re: RFC [patch 13/34] PID Virtualization Define new task_pid api

>>I hope you understand, that such things do not make anything
>>secure. Administrator of the node will always have access to /proc/kcore,
>>devices, KERNEL CODE(!) etc. No security from this point of view.
> Only if they have CAP_SYS_RAWIO. I admit it takes a lot more
> to get there than just that. But having a mechanism that has the
> potential to be secured and is much simpler to understand
> and to setup for minimal privileges than any of the other unix
> addons I have seen is very interesting.
ok. I suppose it can be done as an option. If required, access from host
system can be allowed. If "secure" environment is requested - fully
isolated.

>>>3) Nesting of containers, (so they are general purpose and not special hacks).
>>
>>Why are you interested in nesting? Any applications for this?
>>Until everything is virtualized in nesting way (including TCP/IP stack, routing
>>etc.) I see no much use of it.
> For everything except the PID namespace I am just interested in having multiple
> separate namespaces. For the PID namespace to keep the traditional unix
> model you need a parent process so it is actually nesting.
Yes, but nesting can be one level as in OpenVZ, when VPS is a nested
namespace inside host system or it can be a fully isolated separate
traditional namespace.

By real nesting I mean hierarchical containers, when containers inside
multiple containers are allowed. This is hard to implement. For example,
for real containers you will need to have isolated TCP/IP stacks and
with complex rules of routing etc.

> I am interested because, it is easy, because if it is possible than
> the range of applications you can apply a containers to is much
> larger. At the far end of that spectrum is migrating a server running
> on real hardware and bringing it up as a guest on a newer much more
> powerful machine. With the appearance that it had only been
> unreachable for a few seconds.
You can use fully isolated containers like OpenVZ VPSs for this. They
are naturally suitable for this, because provide you not PIDs isolation
only, but also IPC, sockets, etc.

How can you migrate application which consists of two processes doing
IPC via signals? They are not tired inside kernel anyhow and there is no
way to automatically detect that both should be migrated together.
VPSs what provides you such kind of boundaries of what should be
considered as a whole.

>>>The vserver way of solving some of these problems is to provide a way
>>>to enter the guest. I would rather have some explicit operation that puts
>>>you into the guest context so there is a single point where we can tackle
>>>the nested security issues, than to have hundreds of places we have to
>>>look at individually.
>>
>>Huh, it sounds too easy. Just imagine that VPS owner has deleted ps, top, kill,
>>bash and other tools. You won't be able to enter.

> Entering is different from execing a process on the inside.
> Implementation wise it is changing the context pointer on your task.
If I understand you correctly it is fully insecure way of doing things.
After changing context without applying all the restrictions which
should be implied by VPS your process will be ptrace'able and so on.

>>Another example when VPS owner
>>is near its resource limits - you won't be able to do anything after VPS
>>entering.
> For debugging this is a good reason for being inside. What if the
> problem is that you are out of resources?
Debugging - yes, in production - no.
That is why OpenVZ allows host system to access VPS resources - for
debugging in production.

> I have no intention of requiring monitoring to work from the inside though.
>>Do you need other examples?
> No I need to post patches.
Thanks a lot for valuable discussion and your time!

Kirill

2006-02-03 11:12:10

by Eric W. Biederman

[permalink] [raw]
Subject: Re: RFC [patch 13/34] PID Virtualization Define new task_pid api

Kirill Korotaev <[email protected]> writes:

>>>I hope you understand, that such things do not make anything
>>>secure. Administrator of the node will always have access to /proc/kcore,
>>>devices, KERNEL CODE(!) etc. No security from this point of view.
>> Only if they have CAP_SYS_RAWIO. I admit it takes a lot more
>> to get there than just that. But having a mechanism that has the
>> potential to be secured and is much simpler to understand
>> and to setup for minimal privileges than any of the other unix
>> addons I have seen is very interesting.
> ok. I suppose it can be done as an option. If required, access from host system
> can be allowed. If "secure" environment is requested - fully isolated.
>
>>>>3) Nesting of containers, (so they are general purpose and not special
> hacks).
>>>
>>>Why are you interested in nesting? Any applications for this?
>>>Until everything is virtualized in nesting way (including TCP/IP stack,
> routing
>>>etc.) I see no much use of it.
>> For everything except the PID namespace I am just interested in having
> multiple
>> separate namespaces. For the PID namespace to keep the traditional unix
>> model you need a parent process so it is actually nesting.
> Yes, but nesting can be one level as in OpenVZ, when VPS is a nested namespace
> inside host system or it can be a fully isolated separate traditional namespace.
>
> By real nesting I mean hierarchical containers, when containers inside multiple
> containers are allowed. This is hard to implement. For example, for real
> containers you will need to have isolated TCP/IP stacks and with complex rules
> of routing etc.

TCP/IP is a pain because it has so many global static variables, but otherwise
it is easier than PIDs. You just need what looks like 2 instances
of the network stack. And since you usually don't have enough real
hardware a 2 headed network device that acts like 2 NICS plugged into
a cross over cable.

>> I am interested because, it is easy, because if it is possible than
>> the range of applications you can apply a containers to is much
>> larger. At the far end of that spectrum is migrating a server running
>> on real hardware and bringing it up as a guest on a newer much more
>> powerful machine. With the appearance that it had only been
>> unreachable for a few seconds.

> You can use fully isolated containers like OpenVZ VPSs for this. They are
> naturally suitable for this, because provide you not PIDs isolation only, but
> also IPC, sockets, etc.

Exactly.

> How can you migrate application which consists of two processes doing IPC via
> signals? They are not tired inside kernel anyhow and there is no way to
> automatically detect that both should be migrated together.
> VPSs what provides you such kind of boundaries of what should be considered as a
> whole.

I always look at migration in terms of a container/VPS.

However the entire system can be considered one such container. So on the
extreme side you can migrate everything to another machine. It's not
a hard requirement but it would be nice if the mechanism wasn't so special
that it prevented that.


>>>
>>>Huh, it sounds too easy. Just imagine that VPS owner has deleted ps, top,
> kill,
>>> bash and other tools. You won't be able to enter.
>
>> Entering is different from execing a process on the inside.
>> Implementation wise it is changing the context pointer on your task.
> If I understand you correctly it is fully insecure way of doing things. After
> changing context without applying all the restrictions which should be implied
> by VPS your process will be ptrace'able and so on.

Not exactly insecure. But something you need to be careful with.

It's an idea I don't like personally. But I like it more than adhoc
mechanisms for modify what I guest gets to look at.


>>>Another example when VPS owner
>>>is near its resource limits - you won't be able to do anything after VPS
>>>entering.
>> For debugging this is a good reason for being inside. What if the
>> problem is that you are out of resources?
> Debugging - yes, in production - no.

I was talking about sysadmin style of debugging.

> That is why OpenVZ allows host system to access VPS resources - for debugging in
> production.

This is something I will freely admit is up on the air, how this
should be accomplished. But I don't want an assumption that the host
system will always be able to access guest resources.

> Thanks a lot for valuable discussion and your time!

Welcome.

Eric

2006-02-03 15:45:39

by Dave Hansen

[permalink] [raw]
Subject: Re: RFC [patch 13/34] PID Virtualization Define new task_pid api

On Fri, 2006-02-03 at 13:52 +0300, Kirill Korotaev wrote:
> How can you migrate application which consists of two processes doing
> IPC via signals? They are not tired inside kernel anyhow and there is
> no way to automatically detect that both should be migrated together.
> VPSs what provides you such kind of boundaries of what should be
> considered as a whole.

Could you explain a little bit _how_ VPSs provide this?

-- Dave

2006-02-03 16:34:56

by Kirill Korotaev

[permalink] [raw]
Subject: Re: RFC [patch 13/34] PID Virtualization Define new task_pid api

>>How can you migrate application which consists of two processes doing
>>IPC via signals? They are not tired inside kernel anyhow and there is
>>no way to automatically detect that both should be migrated together.
>>VPSs what provides you such kind of boundaries of what should be
>>considered as a whole.
>
>
> Could you explain a little bit _how_ VPSs provide this?
OpenVZ virtualizes kernel resources such as IPC making them per-VPS.
Processes in VPS deal with such virtualized resources, so they
efficiently should be migrated together. That's it - a whole container
with its resources should be considered as a whole.

Signals are not virtualized in this manner since they are rather
per-task resource, but OpenVZ introduces strict boundaries in kernel
which make sure that no any 3rd task from another container (VPS) will
be participating in the communication.

Kirill

2006-02-06 20:15:37

by Pavel Machek

[permalink] [raw]
Subject: Re: RFC [patch 13/34] PID Virtualization Define new task_pid api

Hi!

> There are two issues here.
> 1) Monitoring. (ps, top etc)
> 2) Control (kill).
>
> For monitoring you might need to patch ps/top a little but it is doable without
> 2 pids.
>
> For kill it is extremely rude to kill processes inside of a nested pid space.
> There are other solutions to the problem.

Can you elaborate? If I have 10 containers with 1000 processes each,
it would be nice to be able to run top then kill 40 top cpu hogs....
--
Thanks, Sharp!

2006-02-06 20:35:09

by Kirill Korotaev

[permalink] [raw]
Subject: Re: RFC [patch 13/34] PID Virtualization Define new task_pid api

>>There are two issues here.
>>1) Monitoring. (ps, top etc)
>>2) Control (kill).
>>
>>For monitoring you might need to patch ps/top a little but it is doable without
>>2 pids.
>>
>>For kill it is extremely rude to kill processes inside of a nested pid space.
>>There are other solutions to the problem.

> Can you elaborate? If I have 10 containers with 1000 processes each,
> it would be nice to be able to run top then kill 40 top cpu hogs....
This is exactly the reason why we allow host system to see all the
containers/VPSs/processes.

Otherwise monitoring tools should be fixed for it, which doesn't look
good and top/ps/kill are not the only tools in the world.
Without such functionality you can't understand whether you machine is
underloaded or overloaded.

Kirill

2006-02-06 20:36:38

by Eric W. Biederman

[permalink] [raw]
Subject: Re: RFC [patch 13/34] PID Virtualization Define new task_pid api

Pavel Machek <[email protected]> writes:

> Hi!
>
>> There are two issues here.
>> 1) Monitoring. (ps, top etc)
>> 2) Control (kill).
>>
>> For monitoring you might need to patch ps/top a little but it is doable
> without
>> 2 pids.
>>
>> For kill it is extremely rude to kill processes inside of a nested pid space.
>> There are other solutions to the problem.
>
> Can you elaborate? If I have 10 containers with 1000 processes each,
> it would be nice to be able to run top then kill 40 top cpu hogs....

So I just posted my patches to lkml if you want to see the details.
But the way I have implemented it each container has a pid in it's parent's
pid namespace.

That pid to top looks essentially a thread group leader, and top will
tell you which container contains the cpu hogs.

Currently from the outside your choice is to kill or not kill the entire
container. Which let's you kill the cpu hogs. The control is not broken
down so fine that it is easy to do something the sysadmin of the container
should be doing.

Eric

2006-02-06 20:41:44

by Eric W. Biederman

[permalink] [raw]
Subject: Re: RFC [patch 13/34] PID Virtualization Define new task_pid api

Kirill Korotaev <[email protected]> writes:

> This is exactly the reason why we allow host system to see all the
> containers/VPSs/processes.

Which makes for a really hairy, and noticably different logical implementation.
At least that was my impression when glancing at your patches. I haven't
had a chance to look at them in depth yet.

> Otherwise monitoring tools should be fixed for it, which doesn't look good and
> top/ps/kill are not the only tools in the world.
> Without such functionality you can't understand whether you machine is
> underloaded or overloaded.

Look at my code please. I think it is a place in the problem domain
you haven't considered.

Except for detailed information everything is there for existing tools.

Eric