2008-11-19 02:59:42

by Michael Kerrisk

[permalink] [raw]
Subject: Documentation for CLONE_NEWPID

Pavel, Kir,

Drawing fairly heavily on your LWN.net article (http://lwn.net/Articles/259217/), plus the kernel source and some experimentation, I created the patch below to document CLONE_NEWPID for the clone(2) manual page. Could you please review and let me know of any improvements or inaccuracies.

Thanks,

Michael

--- a/man2/clone.2
+++ b/man2/clone.2
@@ -266,6 +268,78 @@ in the same
.BR clone ()
call.
.TP
+.BR CLONE_NEWPID " (since Linux 2.6.24)"
+.\" This explanation draws a lot of details from
+.\" http://lwn.net/Articles/259217/
+.\" Authors: Pavel Emelyanov <[email protected]>
+.\" and Kir Kolyshkin <[email protected]>
+.\"
+.\" The primary kernel commit is 30e49c263e36341b60b735cbef5ca37912549264
+.\" Author: Pavel Emelyanov <[email protected]>
+If
+.B CLONE_PID
+is set, then create the process in a new PID namespace.
+If this flag is not set, then (as with
+.BR fork (2)),
+the process is created in the same PID namespace as
+the calling process.
+This flag is intended for the implementation of control groups.
+
+A PID namespace provides an isolated environment for PIDs:
+PIDs in a new namespace start at 1,
+somewhat like a standalone system, and calls to
+.BR fork (2),
+.BR vfork (2),
+or
+.BR clone (2)
+will produce processes whose PIDs within the namespace
+are only guaranteed to be unique within that namespace.
+
+The first process created in a new namespace
+(i.e., the process created using the
+.BR CLONE_NEWPID
+flag) has the PID 1, and is the "init" process for the namespace.
+Children that are orphaned within the namespace will be reparented
+to this process rather than
+.BR init (8).
+Unlike the traditional
+.B init
+process, the "init" process of a PID namespace can terminate,
+and if it does, all of the processes in the namespace are terminated.
+
+PID namespaces form a hierarchy.
+When a PID new namespace is created,
+the PIDs of the processes in that namespace are visible
+in the PID namespace of the process that created the new namespace;
+analogously, if the parent PID namespace is itself
+the child of another PID namespace,
+then PIDs of the child and parent PID namespaces will both be
+visible in the grandparent PID namespace.
+Conversely, the processes in the "child" PID namespace do not see
+the PIDs of the processes in the parent namespace.
+The existence of a namespace hierarchy means that each process
+may now have multiple PIDs:
+one for each namespace in which it is visible.
+(A call to
+.BR getpid (2)
+always returns the PID associated with the namespace in which
+the process was created.)
+
+After creating the new namespace,
+it is useful for the child to change its root directory
+and mount a new procfs instance at
+.I /proc
+so that tools such as
+.BR ps (1)
+work correctly.
+
+Use of this flag requires: a kernel configured with the
+.B CONFIG_PID_NS
+configuration option and requires that the process be privileged
+.RB (CAP_SYS_ADMIN ).
+This flag can't be specified in conjunction with
+.BR CLONE_THREAD .
+.TP
.BR CLONE_PARENT " (since Linux 2.3.12)"
If
.B CLONE_PARENT
@@ -627,6 +701,14 @@ were specified in
.IR flags .
.TP
.B EINVAL
+Both
+.BR CLONE_NEWPID
+and
+.BR CLONE_THREAD
+were specified in
+.IR flags .
+.TP
+.B EINVAL
Returned by
.BR clone ()
when a zero value is specified for
@@ -639,6 +721,8 @@ copied.
.TP
.B EPERM
.B CLONE_NEWNS
+or
+.B CLONE_NEWPID
was specified by a non-root process (process without \fBCAP_SYS_ADMIN\fP).
.TP
.B EPERM


2008-11-23 22:19:54

by Serge E. Hallyn

[permalink] [raw]
Subject: Re: Documentation for CLONE_NEWPID

Quoting Michael Kerrisk ([email protected]):
> Pavel, Kir,
>
> Drawing fairly heavily on your LWN.net article
> (http://lwn.net/Articles/259217/), plus the kernel source and some
> experimentation, I created the patch below to document CLONE_NEWPID for the
> clone(2) manual page. Could you please review and let me know of any
> improvements or inaccuracies.


Thanks, Michael, this is something we've definately been wanting to
get to.

...

> +(i.e., the process created using the
> +.BR CLONE_NEWPID
> +flag) has the PID 1, and is the "init" process for the namespace.
> +Children that are orphaned within the namespace will be reparented
> +to this process rather than
> +.BR init (8).
> +Unlike the traditional
> +.B init
> +process, the "init" process of a PID namespace can terminate,
> +and if it does, all of the processes in the namespace are terminated.
> +
> +PID namespaces form a hierarchy.
> +When a PID new namespace is created,
> +the PIDs of the processes in that namespace are visible

The processes in that namespace are visible, but by different
pids. So saying that the pids are visible in the parent
pidns isn't quite right.

> +in the PID namespace of the process that created the new namespace;
> +analogously, if the parent PID namespace is itself
> +the child of another PID namespace,
> +then PIDs of the child and parent PID namespaces will both be

Again, the processes, not pids, are visible.

> +visible in the grandparent PID namespace.
> +Conversely, the processes in the "child" PID namespace do not see
> +the PIDs of the processes in the parent namespace.
> +The existence of a namespace hierarchy means that each process
> +may now have multiple PIDs:
> +one for each namespace in which it is visible.
> +(A call to
> +.BR getpid (2)
> +always returns the PID associated with the namespace in which
> +the process was created.)
> +
> +After creating the new namespace,
> +it is useful for the child to change its root directory
> +and mount a new procfs instance at
> +.I /proc
> +so that tools such as
> +.BR ps (1)
> +work correctly.

Probably not worth mentioning here, but if it has done
CLONE_NEWNS then it doesn't need to change its root, it
can just mount a new proc instance over /proc.

> +
> +Use of this flag requires: a kernel configured with the
> +.B CONFIG_PID_NS
> +configuration option and requires that the process be privileged
> +.RB (CAP_SYS_ADMIN ).
> +This flag can't be specified in conjunction with
> +.BR CLONE_THREAD .
> +.TP
> .BR CLONE_PARENT " (since Linux 2.3.12)"
> If
> .B CLONE_PARENT
> @@ -627,6 +701,14 @@ were specified in
> .IR flags .
> .TP
> .B EINVAL
> +Both
> +.BR CLONE_NEWPID
> +and
> +.BR CLONE_THREAD
> +were specified in
> +.IR flags .
> +.TP
> +.B EINVAL
> Returned by
> .BR clone ()
> when a zero value is specified for
> @@ -639,6 +721,8 @@ copied.
> .TP
> .B EPERM
> .B CLONE_NEWNS
> +or
> +.B CLONE_NEWPID
> was specified by a non-root process (process without \fBCAP_SYS_ADMIN\fP).
> .TP
> .B EPERM

I assume you've considered the pros and cons of mentioning
signal semantics with respect to init tasks of child pid namespaces,
and decided it's not worth mentioning yet as the semantics are not
yet finalized?

The goal is to treat the process as a system-wide init with respect
to signals coming from its own namespace, and treat it as an ordinary
task for signals coming from its ancestor namespaces. But as you've
probably read, the implementation may result in some unfortunate
side-effects regarding blocked signals etc.

thanks,
-serge

2008-11-24 12:47:19

by Pavel Emelyanov

[permalink] [raw]
Subject: Re: Documentation for CLONE_NEWPID

Michael Kerrisk wrote:
> Pavel, Kir,
>
> Drawing fairly heavily on your LWN.net article (http://lwn.net/Articles/259217/), plus the kernel
> source and some experimentation, I created the patch below to document CLONE_NEWPID for the clone(2)
> manual page. Could you please review and let me know of any improvements or inaccuracies.

Michael, sorry for the late response - I've been on vacation last week and didn't
have chance to connect to check my mail. Some comments are inline.

> Thanks,
>
> Michael
>
> --- a/man2/clone.2
> +++ b/man2/clone.2
> @@ -266,6 +268,78 @@ in the same
> .BR clone ()
> call.
> .TP
> +.BR CLONE_NEWPID " (since Linux 2.6.24)"
> +.\" This explanation draws a lot of details from
> +.\" http://lwn.net/Articles/259217/
> +.\" Authors: Pavel Emelyanov <[email protected]>
> +.\" and Kir Kolyshkin <[email protected]>
> +.\"
> +.\" The primary kernel commit is 30e49c263e36341b60b735cbef5ca37912549264
> +.\" Author: Pavel Emelyanov <[email protected]>
> +If
> +.B CLONE_PID
> +is set, then create the process in a new PID namespace.
> +If this flag is not set, then (as with
> +.BR fork (2)),
> +the process is created in the same PID namespace as
> +the calling process.
> +This flag is intended for the implementation of control groups.

Well, actually this has nothing to do with control groups. This
flag is intended to be used to facilitate the creation of containers
along with many other clone flags. Control groups is yet another
way to create a container.

> +A PID namespace provides an isolated environment for PIDs:
> +PIDs in a new namespace start at 1,
> +somewhat like a standalone system, and calls to
> +.BR fork (2),
> +.BR vfork (2),
> +or
> +.BR clone (2)
> +will produce processes whose PIDs within the namespace
> +are only guaranteed to be unique within that namespace.

Well, I'm not sure I understood correctly what was meant here, but after
we have a namespace each task has two pids. And _all_ of them are unique
in corresponding namespaces.

> +The first process created in a new namespace
> +(i.e., the process created using the
> +.BR CLONE_NEWPID
> +flag) has the PID 1, and is the "init" process for the namespace.
> +Children that are orphaned within the namespace will be reparented
> +to this process rather than
> +.BR init (8).
> +Unlike the traditional
> +.B init
> +process, the "init" process of a PID namespace can terminate,
> +and if it does, all of the processes in the namespace are terminated.
> +
> +PID namespaces form a hierarchy.
> +When a PID new namespace is created,
> +the PIDs of the processes in that namespace are visible
> +in the PID namespace of the process that created the new namespace;
> +analogously, if the parent PID namespace is itself
> +the child of another PID namespace,
> +then PIDs of the child and parent PID namespaces will both be
> +visible in the grandparent PID namespace.
> +Conversely, the processes in the "child" PID namespace do not see
> +the PIDs of the processes in the parent namespace.
> +The existence of a namespace hierarchy means that each process
> +may now have multiple PIDs:
> +one for each namespace in which it is visible.
> +(A call to
> +.BR getpid (2)
> +always returns the PID associated with the namespace in which
> +the process was created.)

I don't thinks it's a good example - the getpid cannot be called
for other process other than current :)

> +
> +After creating the new namespace,
> +it is useful for the child to change its root directory
> +and mount a new procfs instance at
> +.I /proc
> +so that tools such as
> +.BR ps (1)
> +work correctly.
> +
> +Use of this flag requires: a kernel configured with the
> +.B CONFIG_PID_NS
> +configuration option and requires that the process be privileged
> +.RB (CAP_SYS_ADMIN ).
> +This flag can't be specified in conjunction with
> +.BR CLONE_THREAD .
> +.TP
> .BR CLONE_PARENT " (since Linux 2.3.12)"
> If
> .B CLONE_PARENT
> @@ -627,6 +701,14 @@ were specified in
> .IR flags .
> .TP
> .B EINVAL
> +Both
> +.BR CLONE_NEWPID
> +and
> +.BR CLONE_THREAD
> +were specified in
> +.IR flags .
> +.TP
> +.B EINVAL
> Returned by
> .BR clone ()
> when a zero value is specified for
> @@ -639,6 +721,8 @@ copied.
> .TP
> .B EPERM
> .B CLONE_NEWNS
> +or
> +.B CLONE_NEWPID
> was specified by a non-root process (process without \fBCAP_SYS_ADMIN\fP).
> .TP
> .B EPERM
>

2008-11-25 15:09:39

by Michael Kerrisk

[permalink] [raw]
Subject: Re: Documentation for CLONE_NEWPID

Hi Serge,

[...]

>> +PID namespaces form a hierarchy.
>> +When a PID new namespace is created,
>> +the PIDs of the processes in that namespace are visible
>
> The processes in that namespace are visible, but by different
> pids. So saying that the pids are visible in the parent
> pidns isn't quite right.

Yes, good point. I made that last line:

"... the processes in that namespace are visible..."

>> +in the PID namespace of the process that created the new namespace;
>> +analogously, if the parent PID namespace is itself
>> +the child of another PID namespace,
>> +then PIDs of the child and parent PID namespaces will both be
>
> Again, the processes, not pids, are visible.

Yep. Fixed.

[...]

>> +After creating the new namespace,
>> +it is useful for the child to change its root directory
>> +and mount a new procfs instance at
>> +.I /proc
>> +so that tools such as
>> +.BR ps (1)
>> +work correctly.
>
> Probably not worth mentioning here, but if it has done
> CLONE_NEWNS then it doesn't need to change its root, it
> can just mount a new proc instance over /proc.

Actually, I think it is worth mentioning. When I wrote the text, I
suspected that the point you make here was true, but I wasn't 100%
sure. Therefore, I think it worth adding a sentence like your comment
here, and I've done so.

[...]

> I assume you've considered the pros and cons of mentioning
> signal semantics with respect to init tasks of child pid namespaces,
> and decided it's not worth mentioning yet as the semantics are not
> yet finalized?
>
> The goal is to treat the process as a system-wide init with respect
> to signals coming from its own namespace, and treat it as an ordinary
> task for signals coming from its ancestor namespaces. But as you've
> probably read, the implementation may result in some unfortunate
> side-effects regarding blocked signals etc.

No I haven't considered this at all. Could you provide some pointers
to relevant discussions on this?

Cheers,

Michael

--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
git://git.kernel.org/pub/scm/docs/man-pages/man-pages.git
man-pages online: http://www.kernel.org/doc/man-pages/online_pages.html
Found a bug? http://www.kernel.org/doc/man-pages/reporting_bugs.html

2008-11-25 15:46:30

by Michael Kerrisk

[permalink] [raw]
Subject: Re: Documentation for CLONE_NEWPID

Hi Pavel,

On Mon, Nov 24, 2008 at 7:46 AM, Pavel Emelyanov <[email protected]> wrote:
> Michael Kerrisk wrote:
>> Pavel, Kir,
>>
>> Drawing fairly heavily on your LWN.net article (http://lwn.net/Articles/259217/), plus the kernel
>> source and some experimentation, I created the patch below to document CLONE_NEWPID for the clone(2)
>> manual page. Could you please review and let me know of any improvements or inaccuracies.
>
> Michael, sorry for the late response - I've been on vacation last week and didn't
> have chance to connect to check my mail.

No problem.

> Some comments are inline.

Thanks!

[...]

>> +This flag is intended for the implementation of control groups.
>
> Well, actually this has nothing to do with control groups. This
> flag is intended to be used to facilitate the creation of containers
> along with many other clone flags. Control groups is yet another
> way to create a container.

Yep, after an earlier mail from Eric, I already changed this to "containers".

>> +A PID namespace provides an isolated environment for PIDs:
>> +PIDs in a new namespace start at 1,
>> +somewhat like a standalone system, and calls to
>> +.BR fork (2),
>> +.BR vfork (2),
>> +or
>> +.BR clone (2)
>> +will produce processes whose PIDs within the namespace
>> +are only guaranteed to be unique within that namespace.
>
> Well, I'm not sure I understood correctly what was meant here, but after

I've simplified that sentence somewhat. Now it just reads:

A PID namespace provides an isolated environment for
PIDs: PIDs in a new namespace start at 1, somewhat like
a standalone system, and calls to fork(2), vfork(2), or
clone(2) will produce processes with PIDs that are
unique within the namespace.

> we have a namespace each task has two pids. And _all_ of them are unique
> in corresponding namespaces.

And I already make that point lower down in the text (see ***), but
now I extended the sentence there a little.

[...]

*** Here's where I make the point about each process having multiple PIDs"

>> +The existence of a namespace hierarchy means that each process
>> +may now have multiple PIDs:
>> +one for each namespace in which it is visible.

I added some words here:

"each of these PIDs is unique within the corresponding namespace".

>> +(A call to
>> +.BR getpid (2)
>> +always returns the PID associated with the namespace in which
>> +the process was created.)
>
> I don't thinks it's a good example - the getpid cannot be called
> for other process other than current :)

It wasn't meant as an example. The point was, with a process
potentially being a member of multiple namespaces, the reader might
wonder: what does getpid(2) return? This sentence was intended to
clarify that. With that explanation, does this sentence now seem
okay?

[...]

Cheers,

Michael

--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
git://git.kernel.org/pub/scm/docs/man-pages/man-pages.git
man-pages online: http://www.kernel.org/doc/man-pages/online_pages.html
Found a bug? http://www.kernel.org/doc/man-pages/reporting_bugs.html

2008-11-25 15:55:27

by Pavel Emelyanov

[permalink] [raw]
Subject: Re: Documentation for CLONE_NEWPID

>>> +A PID namespace provides an isolated environment for PIDs:
>>> +PIDs in a new namespace start at 1,
>>> +somewhat like a standalone system, and calls to
>>> +.BR fork (2),
>>> +.BR vfork (2),
>>> +or
>>> +.BR clone (2)
>>> +will produce processes whose PIDs within the namespace
>>> +are only guaranteed to be unique within that namespace.
>> Well, I'm not sure I understood correctly what was meant here, but after
>
> I've simplified that sentence somewhat. Now it just reads:
>
> A PID namespace provides an isolated environment for
> PIDs: PIDs in a new namespace start at 1, somewhat like
> a standalone system, and calls to fork(2), vfork(2), or
> clone(2) will produce processes with PIDs that are
> unique within the namespace.

OK, thanks.

>> we have a namespace each task has two pids. And _all_ of them are unique
>> in corresponding namespaces.
>
> And I already make that point lower down in the text (see ***), but
> now I extended the sentence there a little.
>
> [...]
>
> *** Here's where I make the point about each process having multiple PIDs"
>
>>> +The existence of a namespace hierarchy means that each process
>>> +may now have multiple PIDs:
>>> +one for each namespace in which it is visible.
>
> I added some words here:
>
> "each of these PIDs is unique within the corresponding namespace".

Correct.

>>> +(A call to
>>> +.BR getpid (2)
>>> +always returns the PID associated with the namespace in which
>>> +the process was created.)
>> I don't thinks it's a good example - the getpid cannot be called
>> for other process other than current :)
>
> It wasn't meant as an example. The point was, with a process
> potentially being a member of multiple namespaces, the reader might
> wonder: what does getpid(2) return? This sentence was intended to
> clarify that. With that explanation, does this sentence now seem
> okay?

Yes, but I'd change "was created" into "lives in". From my POV this
sounds more clear. I do not insist however :)

> [...]
>
> Cheers,
>
> Michael
>

2008-11-25 16:27:45

by Michael Kerrisk

[permalink] [raw]
Subject: Re: Documentation for CLONE_NEWPID

Hi Pavel,

[...]

>> *** Here's where I make the point about each process having multiple PIDs"
>>
>>>> +The existence of a namespace hierarchy means that each process
>>>> +may now have multiple PIDs:
>>>> +one for each namespace in which it is visible.
>>
>> I added some words here:
>>
>> "each of these PIDs is unique within the corresponding namespace".
>
> Correct.

Thanks for the ACK.

>>>> +(A call to
>>>> +.BR getpid (2)
>>>> +always returns the PID associated with the namespace in which
>>>> +the process was created.)
>>> I don't thinks it's a good example - the getpid cannot be called
>>> for other process other than current :)
>>
>> It wasn't meant as an example. The point was, with a process
>> potentially being a member of multiple namespaces, the reader might
>> wonder: what does getpid(2) return? This sentence was intended to
>> clarify that. With that explanation, does this sentence now seem
>> okay?
>
> Yes, but I'd change "was created" into "lives in". From my POV this
> sounds more clear. I do not insist however :)

Done.

Thanks,

Michael

--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
git://git.kernel.org/pub/scm/docs/man-pages/man-pages.git
man-pages online: http://www.kernel.org/doc/man-pages/online_pages.html
Found a bug? http://www.kernel.org/doc/man-pages/reporting_bugs.html

2008-11-26 17:08:01

by Serge E. Hallyn

[permalink] [raw]
Subject: Re: Documentation for CLONE_NEWPID

Quoting Michael Kerrisk ([email protected]):
> Hi Serge,
> > I assume you've considered the pros and cons of mentioning
> > signal semantics with respect to init tasks of child pid namespaces,
> > and decided it's not worth mentioning yet as the semantics are not
> > yet finalized?
> >
> > The goal is to treat the process as a system-wide init with respect
> > to signals coming from its own namespace, and treat it as an ordinary
> > task for signals coming from its ancestor namespaces. But as you've
> > probably read, the implementation may result in some unfortunate
> > side-effects regarding blocked signals etc.
>
> No I haven't considered this at all. Could you provide some pointers
> to relevant discussions on this?

The discussions are going on right now on lkml (see for instance
http://lkml.org/lkml/2008/11/15/124). But I think Nadia has already
written something up. Nadia, could you send what you have to
Michael?

thanks,
-serge