2013-03-01 05:07:59

by Rob Landley

[permalink] [raw]
Subject: Re: For review: pid_namespaces(7) man page

On 02/28/2013 05:24:07 AM, Michael Kerrisk (man-pages) wrote:
> Eric et al,
>
> Eventually, there will be more namespace man pages, but let us start
> now with one for PID namespaces. The attached page aims to provide a
> fairly complete overview of PID namespaces.

Onward!

> PID_NAMESPACES(7) Linux Programmer's Manual PID_NAMESPACES(7)
>
> NAME
> pid_namespaces - overview of Linux PID namespaces
>
> DESCRIPTION
> For an overview of namespaces, see namespaces(7).
>
> PID namespaces isolate the process ID number space, meaning
> that processes in different PID namespaces can have the same
> PID.

Um, perhaps "different processes"? Slightly repetitive, but trying to
avoid the potential misreading that "a processes can have the same PID
in different namespaces". (A single process can't be a member of more
than one namespace. This is not about selective visibility.)

> PID namespaces allow containers to migrate to a new host
> while the processes inside the container maintain the same
> PIDs.

I thought suspend/resume a container was the simple case. Migration to
a new host is built on top of that. (On resume in a new container on
the same system, if other stuff is going on in the system so the
available PIDs have shifted.)

> Likewise, a process in an ancestor namespace can—subject to the
> usual permission checks described in kill(2)—send signals to
> the "init" process of a child PID namespace only if the "init"
> process has established a handler for that signal. (Within the
> handler, the siginfo_t si_pid field described in sigaction(2)
> will be zero.) SIGKILL or SIGSTOP are treated exceptionally:
> these signals are forcibly delivered when sent from an ancestor
> PID namespace. Neither of these signals can be caught by the
> "init" process, and so will result in the usual actions associ‐
> ated with those signals (respectively, terminating and stopping
> the process).

If SIGKILL to init is propogated to all the children of init, is
SIGSTOP also propogated to all the children? (I.E. will SIGSTOP to
container's init suspend the whole container, and will SIGCONT resume
the whole container? If the latter, will it only resume processes that
weren't previously stopped? :)

> To put things another way: a process's PID namespace membership
> is determined when the process is created and cannot be changed
> thereafter. Among other things, this means that the parental
> relationship between processes mirrors the parental between PID

mirrors the relationship

> namespaces: the parent of a process is either in the same
> namespace or resides in the immediate parent PID namespace.
>
> Every thread in a process must be in the same PID namespace.
> For this reason, the two following call sequences will fail:
>
> unshare(CLONE_NEWPID);
> clone(..., CLONE_VM, ...); /* Fails */
>
> setns(fd, CLONE_NEWPID);
> clone(..., CLONE_VM, ...); /* Fails */

They fail with -EUNDOCUMENTED

> Because the above unshare(2) and setns(2) calls only change the
> PID namespace for created children, the clone(2) calls neces‐
> sarily put the new thread in a different PID namespace from the
> calling thread.

Um, no they don't. They fail. That's the point. They _would_ put the
new thread in a different PID namespace, which breaks the definition of
threads.

How about:

The above unshare(2) and setns(2) calls change the PID namespace of
children created by subsequent clone(2) calls, which is incompatible
with CLONE_VM.

> Miscellaneous
> After creating a new PID namespace, it is useful for the child
> to change its root directory and mount a new procfs instance at
> /proc so that tools such as ps(1) work correctly. (If a new
> mount namespace is simultaneously created by including
> CLONE_NEWNS in the flags argument of clone(2) or unshare(2)),
> then it isn't necessary to change the root directory: a new
> procfs instance can be mounted directly over /proc.)

Why is the (If) clause in parentheses? And unshare(2)) has a Bruce.
(I.E. unbalanced parens.).

> Calling readlink(2) on the path /proc/self yields the process
> ID of the caller in the PID namespace of the procfs mount
> (i.e., the PID namespace of the process that mounted the
> procfs).

This is per-filesystem rather than using the process's namespace
because...?
(Where /proc/self points is already process-local data, so the races
here can't be too horrible...)

> When a process ID is passed over a UNIX domain socket to a
> process in a different PID namespace (see the description of
> SCM_CREDENTIALS in unix(7)), it is translated into the corre‐
> sponding PID value in the receiving process's PID namespace.

Heh. :)

> CONFORMING TO
> Namespaces are a Linux-specific feature.

And yet the glibc guys insist on #define GNU_GNU_GNU_ALL_HAIL_STALLMAN
in order to access this Linux-specific feature which has nothing
whatsoever to do with the FSF.

The unshare() call originally _didn't_ require this define, but they
retroactively added the requirement in a version "upgrade" to match
your man page. This made me sad. It also made me prototype it myself
rather than expecting the header to provide it.

Rob


2013-03-01 06:58:20

by Eric W. Biederman

[permalink] [raw]
Subject: Re: For review: pid_namespaces(7) man page

Rob Landley <[email protected]> writes:

> On 02/28/2013 05:24:07 AM, Michael Kerrisk (man-pages) wrote:
>> Eric et al,
>>
>> Eventually, there will be more namespace man pages, but let us start
>> now with one for PID namespaces. The attached page aims to provide a
>> fairly complete overview of PID namespaces.
>
> Onward!
>
>> PID_NAMESPACES(7) Linux Programmer's Manual PID_NAMESPACES(7)
>>
>> NAME
>> pid_namespaces - overview of Linux PID namespaces
>>
>> DESCRIPTION
>> For an overview of namespaces, see namespaces(7).
>>
>> PID namespaces isolate the process ID number space, meaning
>> that processes in different PID namespaces can have the same
>> PID.
>
> Um, perhaps "different processes"? Slightly repetitive, but trying to
> avoid the potential misreading that "a processes can have the same PID
> in different namespaces". (A single process can't be a member of more
> than one namespace. This is not about selective visibility.)

Well actually a process is visible and arguably a member of all parent
pid namespaces, and a process certainly had a pid value in each pid
namespace up to the root of the pid namespace tree.

>> PID namespaces allow containers to migrate to a new host
>> while the processes inside the container maintain the same
>> PIDs.
>
> I thought suspend/resume a container was the simple case. Migration to
> a new host is built on top of that. (On resume in a new container on
> the same system, if other stuff is going on in the system so the
> available PIDs have shifted.)

I don't know if there is a difference at the implementation level.

>> Likewise, a process in an ancestor namespace can—subject to the
>> usual permission checks described in kill(2)—send signals to
>> the "init" process of a child PID namespace only if the "init"
>> process has established a handler for that signal. (Within the
>> handler, the siginfo_t si_pid field described in sigaction(2)
>> will be zero.) SIGKILL or SIGSTOP are treated exceptionally:
>> these signals are forcibly delivered when sent from an ancestor
>> PID namespace. Neither of these signals can be caught by the
>> "init" process, and so will result in the usual actions associ‐
>> ated with those signals (respectively, terminating and stopping
>> the process).
>
> If SIGKILL to init is propogated to all the children of init, is
> SIGSTOP also propogated to all the children? (I.E. will SIGSTOP to
> container's init suspend the whole container, and will SIGCONT resume
> the whole container? If the latter, will it only resume processes that
> weren't previously stopped? :)

No. SIGSTOP stops sent to init stops just init.

It isn't SIGKILL that is propogated it is the exiting of init that is
propogated by way of SIGKILL. If your init process calls _exit() or
hits a SIGSEGV and dies all of the other processes in the pid namespace
will be sent a SIGKILL and be forced down.

This is similar to a the system panic if the global init exits.

>> To put things another way: a process's PID namespace membership
>> is determined when the process is created and cannot be changed
>> thereafter. Among other things, this means that the parental
>> relationship between processes mirrors the parental between PID
>
> mirrors the relationship
>
>> namespaces: the parent of a process is either in the same
>> namespace or resides in the immediate parent PID namespace.
>>
>> Every thread in a process must be in the same PID namespace.
>> For this reason, the two following call sequences will fail:
>>
>> unshare(CLONE_NEWPID);
>> clone(..., CLONE_VM, ...); /* Fails */
>>
>> setns(fd, CLONE_NEWPID);
>> clone(..., CLONE_VM, ...); /* Fails */
>
> They fail with -EUNDOCUMENTED
Make that -EINVAL.

>> Because the above unshare(2) and setns(2) calls only change the
>> PID namespace for created children, the clone(2) calls neces‐
>> sarily put the new thread in a different PID namespace from the
>> calling thread.
>
> Um, no they don't. They fail. That's the point. They _would_ put the
> new thread in a different PID namespace, which breaks the definition of
> threads.
>
> How about:
>
> The above unshare(2) and setns(2) calls change the PID namespace of
> children created by subsequent clone(2) calls, which is incompatible
> with CLONE_VM.
>
>> Miscellaneous
>> After creating a new PID namespace, it is useful for the child
>> to change its root directory and mount a new procfs instance at
>> /proc so that tools such as ps(1) work correctly. (If a new
>> mount namespace is simultaneously created by including
>> CLONE_NEWNS in the flags argument of clone(2) or unshare(2)),
>> then it isn't necessary to change the root directory: a new
>> procfs instance can be mounted directly over /proc.)
>
> Why is the (If) clause in parentheses? And unshare(2)) has a Bruce.
> (I.E. unbalanced parens.).
>
>> Calling readlink(2) on the path /proc/self yields the process
>> ID of the caller in the PID namespace of the procfs mount
>> (i.e., the PID namespace of the process that mounted the
>> procfs).
>
> This is per-filesystem rather than using the process's namespace
> because...?

The entire proc filesystem mount is in the pid namespace of the mounting
process. Every pid that proc reports. /proc/self is not a special
case, but /proc/self can be interesting if you want to find your pid
in that other guys pid namespace.

> (Where /proc/self points is already process-local data, so the races
> here can't be too horrible...)

It actually is moderately important for /proc/self to do the right thing
here. It means you can run against a /proc that is not for your pid
namespace and all of the /proc/self things that glibc and various other
programs and libraries due continue to work.

>> When a process ID is passed over a UNIX domain socket to a
>> process in a different PID namespace (see the description of
>> SCM_CREDENTIALS in unix(7)), it is translated into the corre‐
>> sponding PID value in the receiving process's PID namespace.
>
> Heh. :)
>
>> CONFORMING TO
>> Namespaces are a Linux-specific feature.
>
> And yet the glibc guys insist on #define GNU_GNU_GNU_ALL_HAIL_STALLMAN
> in order to access this Linux-specific feature which has nothing
> whatsoever to do with the FSF.

I read it _GNU_SOURCE just implies a libc extensions specific to glibc.
Of course now that you mention it _GNU_SOURCE implies that we can
reasonably file a bug against glibc on the HURD or BSD for not
implementing this feature can't we?

> The unshare() call originally _didn't_ require this define, but they
> retroactively added the requirement in a version "upgrade" to match
> your man page. This made me sad. It also made me prototype it myself
> rather than expecting the header to provide it.

Eric

Subject: Re: For review: pid_namespaces(7) man page

Hi Rob,

On Fri, Mar 1, 2013 at 5:01 AM, Rob Landley <[email protected]> wrote:
> On 02/28/2013 05:24:07 AM, Michael Kerrisk (man-pages) wrote:
[...]

>> DESCRIPTION
>> For an overview of namespaces, see namespaces(7).
>>
>> PID namespaces isolate the process ID number space, meaning
>> that processes in different PID namespaces can have the same
>> PID.
>
>
> Um, perhaps "different processes"? Slightly repetitive, but trying to avoid
> the potential misreading that "a processes can have the same PID in
> different namespaces". (A single process can't be a member of more than one
> namespace. This is not about selective visibility.)

I'm not sure this clarifies things...

>> PID namespaces allow containers to migrate to a new host
>> while the processes inside the container maintain the same
>> PIDs.
>
>
> I thought suspend/resume a container was the simple case. Migration to a new
> host is built on top of that. (On resume in a new container on the same
> system, if other stuff is going on in the system so the available PIDs have
> shifted.)

I'll add some words here on suspend/resume.

>> Likewise, a process in an ancestor namespace can—subject to the
>> usual permission checks described in kill(2)—send signals to
>> the "init" process of a child PID namespace only if the "init"
>> process has established a handler for that signal. (Within the
>> handler, the siginfo_t si_pid field described in sigaction(2)
>> will be zero.) SIGKILL or SIGSTOP are treated exceptionally:
>> these signals are forcibly delivered when sent from an ancestor
>> PID namespace. Neither of these signals can be caught by the
>> "init" process, and so will result in the usual actions associ‐
>> ated with those signals (respectively, terminating and stopping
>> the process).
>
>
> If SIGKILL to init is propogated to all the children of init, is SIGSTOP
> also propogated to all the children? (I.E. will SIGSTOP to container's init
> suspend the whole container, and will SIGCONT resume the whole container? If
> the latter, will it only resume processes that weren't previously stopped?
> :)

Covered by Eric.

>> To put things another way: a process's PID namespace membership
>> is determined when the process is created and cannot be changed
>> thereafter. Among other things, this means that the parental
>> relationship between processes mirrors the parental between PID
>
>
> mirrors the relationship

Thanks.

>> namespaces: the parent of a process is either in the same
>> namespace or resides in the immediate parent PID namespace.
>>
>> Every thread in a process must be in the same PID namespace.
>> For this reason, the two following call sequences will fail:
>>
>> unshare(CLONE_NEWPID);
>> clone(..., CLONE_VM, ...); /* Fails */
>>
>> setns(fd, CLONE_NEWPID);
>> clone(..., CLONE_VM, ...); /* Fails */
>
>
> They fail with -EUNDOCUMENTED

Added EINVAL, as per Eric's reply. (Eric does that error also apply
for the two new cases you added?).

>> Because the above unshare(2) and setns(2) calls only change the
>> PID namespace for created children, the clone(2) calls neces‐
>> sarily put the new thread in a different PID namespace from the
>> calling thread.
>
>
> Um, no they don't. They fail. That's the point.

(Good catch.)

> They _would_ put the new
> thread in a different PID namespace, which breaks the definition of threads.
>
> How about:
>
> The above unshare(2) and setns(2) calls change the PID namespace of
> children created by subsequent clone(2) calls, which is incompatible
> with CLONE_VM.

I decided on:

The point here is that unshare(2) and setns(2) change the PID
namespace for created children but not for the calling process,
while clone(2) CLONE_VM specifies the creation of a new thread
in the same process.

>> Miscellaneous
>> After creating a new PID namespace, it is useful for the child
>> to change its root directory and mount a new procfs instance at
>> /proc so that tools such as ps(1) work correctly. (If a new
>> mount namespace is simultaneously created by including
>> CLONE_NEWNS in the flags argument of clone(2) or unshare(2)),
>> then it isn't necessary to change the root directory: a new
>> procfs instance can be mounted directly over /proc.)
>
>
> Why is the (If) clause in parentheses? And unshare(2)) has a Bruce.
> (I.E. unbalanced parens.).

I'll make some fixes here.

>> Calling readlink(2) on the path /proc/self yields the process
>> ID of the caller in the PID namespace of the procfs mount
>> (i.e., the PID namespace of the process that mounted the
>> procfs).
>
>
> This is per-filesystem rather than using the process's namespace because...?
> (Where /proc/self points is already process-local data, so the races here
> can't be too horrible...)

Explained by Eric.

I'll add:

[[
This can be useful for introspection purposes,
when a process wants to discover its PID in other namespaces.
]]

[...]

>> CONFORMING TO
>> Namespaces are a Linux-specific feature.
>
>
> And yet the glibc guys insist on #define GNU_GNU_GNU_ALL_HAIL_STALLMAN in
> order to access this Linux-specific feature which has nothing whatsoever to
> do with the FSF.

This is a misunderstanding. _GNU_SOURCE is the standard way to expose
Linux-specific functionality from POSIX header files.

> The unshare() call originally _didn't_ require this define, but they
> retroactively added the requirement in a version "upgrade" to match your man
> page. This made me sad. It also made me prototype it myself rather than
> expecting the header to provide it.

Hmmm. I did not notice that change. Ulrich rejected my early (2007)
request for a change
(http://www.sourceware.org/bugzilla/show_bug.cgi?id=4749) and then
quietly made it later (glibc 2.14, 2011).

Thanks for the review, Rob.

Cheers,

Michael

--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Author of "The Linux Programming Interface"; http://man7.org/tlpi/

2013-03-01 15:35:52

by Eric W. Biederman

[permalink] [raw]
Subject: Re: For review: pid_namespaces(7) man page

"Michael Kerrisk (man-pages)" <[email protected]> writes:

> Hi Rob,
>
> On Fri, Mar 1, 2013 at 5:01 AM, Rob Landley <[email protected]> wrote:
>> On 02/28/2013 05:24:07 AM, Michael Kerrisk (man-pages) wrote:
> [...]
>
>>> DESCRIPTION
>>> For an overview of namespaces, see namespaces(7).
>>>
>>> PID namespaces isolate the process ID number space, meaning
>>> that processes in different PID namespaces can have the same
>>> PID.
>>
>>
>> Um, perhaps "different processes"? Slightly repetitive, but trying to avoid
>> the potential misreading that "a processes can have the same PID in
>> different namespaces". (A single process can't be a member of more than one
>> namespace. This is not about selective visibility.)
>
> I'm not sure this clarifies things...
>
>>> PID namespaces allow containers to migrate to a new host
>>> while the processes inside the container maintain the same
>>> PIDs.
>>
>>
>> I thought suspend/resume a container was the simple case. Migration to a new
>> host is built on top of that. (On resume in a new container on the same
>> system, if other stuff is going on in the system so the available PIDs have
>> shifted.)
>
> I'll add some words here on suspend/resume.
>
>>> Likewise, a process in an ancestor namespace can—subject to the
>>> usual permission checks described in kill(2)—send signals to
>>> the "init" process of a child PID namespace only if the "init"
>>> process has established a handler for that signal. (Within the
>>> handler, the siginfo_t si_pid field described in sigaction(2)
>>> will be zero.) SIGKILL or SIGSTOP are treated exceptionally:
>>> these signals are forcibly delivered when sent from an ancestor
>>> PID namespace. Neither of these signals can be caught by the
>>> "init" process, and so will result in the usual actions associ‐
>>> ated with those signals (respectively, terminating and stopping
>>> the process).
>>
>>
>> If SIGKILL to init is propogated to all the children of init, is SIGSTOP
>> also propogated to all the children? (I.E. will SIGSTOP to container's init
>> suspend the whole container, and will SIGCONT resume the whole container? If
>> the latter, will it only resume processes that weren't previously stopped?
>> :)
>
> Covered by Eric.
>
>>> To put things another way: a process's PID namespace membership
>>> is determined when the process is created and cannot be changed
>>> thereafter. Among other things, this means that the parental
>>> relationship between processes mirrors the parental between PID
>>
>>
>> mirrors the relationship
>
> Thanks.
>
>>> namespaces: the parent of a process is either in the same
>>> namespace or resides in the immediate parent PID namespace.
>>>
>>> Every thread in a process must be in the same PID namespace.
>>> For this reason, the two following call sequences will fail:
>>>
>>> unshare(CLONE_NEWPID);
>>> clone(..., CLONE_VM, ...); /* Fails */
>>>
>>> setns(fd, CLONE_NEWPID);
>>> clone(..., CLONE_VM, ...); /* Fails */
>>
>>
>> They fail with -EUNDOCUMENTED
>
> Added EINVAL, as per Eric's reply. (Eric does that error also apply
> for the two new cases you added?).
>
>>> Because the above unshare(2) and setns(2) calls only change the
>>> PID namespace for created children, the clone(2) calls neces‐
>>> sarily put the new thread in a different PID namespace from the
>>> calling thread.
>>
>>
>> Um, no they don't. They fail. That's the point.
>
> (Good catch.)
>
>> They _would_ put the new
>> thread in a different PID namespace, which breaks the definition of threads.
>>
>> How about:
>>
>> The above unshare(2) and setns(2) calls change the PID namespace of
>> children created by subsequent clone(2) calls, which is incompatible
>> with CLONE_VM.
>
> I decided on:
>
> The point here is that unshare(2) and setns(2) change the PID
> namespace for created children but not for the calling process,
> while clone(2) CLONE_VM specifies the creation of a new thread
> in the same process.

Can we make that "for all new tasks created" instead of "created
children"

Othewise someone might expect CLONE_THREAD would work as you
CLONE_THREAD creates a thread and not a child...

Eric

2013-03-04 03:50:19

by Rob Landley

[permalink] [raw]
Subject: Re: For review: pid_namespaces(7) man page

On 03/01/2013 03:57:40 AM, Michael Kerrisk (man-pages) wrote:
> > And yet the glibc guys insist on #define
> GNU_GNU_GNU_ALL_HAIL_STALLMAN in
> > order to access this Linux-specific feature which has nothing
> whatsoever to
> > do with the FSF.
>
> This is a misunderstanding. _GNU_SOURCE is the standard way to expose
> Linux-specific functionality from POSIX header files.

What standard? The Linux kernel is not, and never was, part of the GNU
project.

Rob-

2013-03-04 04:04:06

by Eric W. Biederman

[permalink] [raw]
Subject: Re: For review: pid_namespaces(7) man page

Rob Landley <[email protected]> writes:

> On 03/01/2013 03:57:40 AM, Michael Kerrisk (man-pages) wrote:
>> > And yet the glibc guys insist on #define
>> GNU_GNU_GNU_ALL_HAIL_STALLMAN in
>> > order to access this Linux-specific feature which has nothing
>> whatsoever to
>> > do with the FSF.
>>
>> This is a misunderstanding. _GNU_SOURCE is the standard way to expose
>> Linux-specific functionality from POSIX header files.
>
> What standard? The Linux kernel is not, and never was, part of the GNU
> project.

Is the argument that there should be a _LINUX_SOURCE directive in glibc
for this?

Although come to think of it I can't imagine how <sched.h> is a POSIX
header. Last I looked it only had linux specific bits in it. Which
makes needing any kind of #define strange.

Eric

Subject: Re: For review: pid_namespaces(7) man page

On Fri, Mar 1, 2013 at 4:35 PM, Eric W. Biederman <[email protected]> wrote:
> "Michael Kerrisk (man-pages)" <[email protected]> writes:
>
>> Hi Rob,
>>
>> On Fri, Mar 1, 2013 at 5:01 AM, Rob Landley <[email protected]> wrote:
>>> On 02/28/2013 05:24:07 AM, Michael Kerrisk (man-pages) wrote:
>> [...]
>>
>>>> DESCRIPTION
>>>> For an overview of namespaces, see namespaces(7).
>>>>
>>>> PID namespaces isolate the process ID number space, meaning
>>>> that processes in different PID namespaces can have the same
>>>> PID.
>>>
>>>
>>> Um, perhaps "different processes"? Slightly repetitive, but trying to avoid
>>> the potential misreading that "a processes can have the same PID in
>>> different namespaces". (A single process can't be a member of more than one
>>> namespace. This is not about selective visibility.)
>>
>> I'm not sure this clarifies things...
>>
>>>> PID namespaces allow containers to migrate to a new host
>>>> while the processes inside the container maintain the same
>>>> PIDs.
>>>
>>>
>>> I thought suspend/resume a container was the simple case. Migration to a new
>>> host is built on top of that. (On resume in a new container on the same
>>> system, if other stuff is going on in the system so the available PIDs have
>>> shifted.)
>>
>> I'll add some words here on suspend/resume.
>>
>>>> Likewise, a process in an ancestor namespace can—subject to the
>>>> usual permission checks described in kill(2)—send signals to
>>>> the "init" process of a child PID namespace only if the "init"
>>>> process has established a handler for that signal. (Within the
>>>> handler, the siginfo_t si_pid field described in sigaction(2)
>>>> will be zero.) SIGKILL or SIGSTOP are treated exceptionally:
>>>> these signals are forcibly delivered when sent from an ancestor
>>>> PID namespace. Neither of these signals can be caught by the
>>>> "init" process, and so will result in the usual actions associ‐
>>>> ated with those signals (respectively, terminating and stopping
>>>> the process).
>>>
>>>
>>> If SIGKILL to init is propogated to all the children of init, is SIGSTOP
>>> also propogated to all the children? (I.E. will SIGSTOP to container's init
>>> suspend the whole container, and will SIGCONT resume the whole container? If
>>> the latter, will it only resume processes that weren't previously stopped?
>>> :)
>>
>> Covered by Eric.
>>
>>>> To put things another way: a process's PID namespace membership
>>>> is determined when the process is created and cannot be changed
>>>> thereafter. Among other things, this means that the parental
>>>> relationship between processes mirrors the parental between PID
>>>
>>>
>>> mirrors the relationship
>>
>> Thanks.
>>
>>>> namespaces: the parent of a process is either in the same
>>>> namespace or resides in the immediate parent PID namespace.
>>>>
>>>> Every thread in a process must be in the same PID namespace.
>>>> For this reason, the two following call sequences will fail:
>>>>
>>>> unshare(CLONE_NEWPID);
>>>> clone(..., CLONE_VM, ...); /* Fails */
>>>>
>>>> setns(fd, CLONE_NEWPID);
>>>> clone(..., CLONE_VM, ...); /* Fails */
>>>
>>>
>>> They fail with -EUNDOCUMENTED
>>
>> Added EINVAL, as per Eric's reply. (Eric does that error also apply
>> for the two new cases you added?).
>>
>>>> Because the above unshare(2) and setns(2) calls only change the
>>>> PID namespace for created children, the clone(2) calls neces‐
>>>> sarily put the new thread in a different PID namespace from the
>>>> calling thread.
>>>
>>>
>>> Um, no they don't. They fail. That's the point.
>>
>> (Good catch.)
>>
>>> They _would_ put the new
>>> thread in a different PID namespace, which breaks the definition of threads.
>>>
>>> How about:
>>>
>>> The above unshare(2) and setns(2) calls change the PID namespace of
>>> children created by subsequent clone(2) calls, which is incompatible
>>> with CLONE_VM.
>>
>> I decided on:
>>
>> The point here is that unshare(2) and setns(2) change the PID
>> namespace for created children but not for the calling process,
>> while clone(2) CLONE_VM specifies the creation of a new thread
>> in the same process.
>
> Can we make that "for all new tasks created" instead of "created
> children"
>
> Othewise someone might expect CLONE_THREAD would work as you
> CLONE_THREAD creates a thread and not a child...

The term "task" is kernel-space talk that rarely appears in man pages,
so I am reluctant to use it.

How about this:

The point here is that unshare(2) and setns(2) change the PID
namespace for processes subsequently created by the caller, but
not for the calling process, while clone(2) CLONE_VM specifies
the creation of a new thread in the same process.

Cheers,

Michael

--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Author of "The Linux Programming Interface"; http://man7.org/tlpi/

Subject: Re: For review: pid_namespaces(7) man page

On Mon, Mar 4, 2013 at 5:03 AM, Eric W. Biederman <[email protected]> wrote:
> Rob Landley <[email protected]> writes:
>
>> On 03/01/2013 03:57:40 AM, Michael Kerrisk (man-pages) wrote:
>>> > And yet the glibc guys insist on #define
>>> GNU_GNU_GNU_ALL_HAIL_STALLMAN in
>>> > order to access this Linux-specific feature which has nothing
>>> whatsoever to
>>> > do with the FSF.
>>>
>>> This is a misunderstanding. _GNU_SOURCE is the standard way to expose
>>> Linux-specific functionality from POSIX header files.
>>
>> What standard? The Linux kernel is not, and never was, part of the GNU
>> project.
>
> Is the argument that there should be a _LINUX_SOURCE directive in glibc
> for this?
>
> Although come to think of it I can't imagine how <sched.h> is a POSIX
> header. Last I looked it only had linux specific bits in it. Which
> makes needing any kind of #define strange.

I think you may be thinking of the wrong sched.h. The glibc
/usr/include/sched.h declares many user-space functions from POSIX.


Cheers,

Mcihael

Subject: Re: For review: pid_namespaces(7) man page

On Mon, Mar 4, 2013 at 4:50 AM, Rob Landley <[email protected]> wrote:
> On 03/01/2013 03:57:40 AM, Michael Kerrisk (man-pages) wrote:
>>
>> > And yet the glibc guys insist on #define GNU_GNU_GNU_ALL_HAIL_STALLMAN
>> > in
>> > order to access this Linux-specific feature which has nothing whatsoever
>> > to
>> > do with the FSF.
>>
>> This is a misunderstanding. _GNU_SOURCE is the standard way to expose
>> Linux-specific functionality from POSIX header files.
>
>
> What standard? The Linux kernel is not, and never was, part of the GNU
> project.

This is "the standard way that glibc isolates Linux-specific
functionality in POSIX header files".

Thanks,

Michael


--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Author of "The Linux Programming Interface"; http://man7.org/tlpi/

2013-03-04 17:52:30

by Eric W. Biederman

[permalink] [raw]
Subject: Re: For review: pid_namespaces(7) man page

"Michael Kerrisk (man-pages)" <[email protected]> writes:

> On Fri, Mar 1, 2013 at 4:35 PM, Eric W. Biederman <[email protected]> wrote:
>> "Michael Kerrisk (man-pages)" <[email protected]> writes:
>>
>>> Hi Rob,
>>>
>>> On Fri, Mar 1, 2013 at 5:01 AM, Rob Landley <[email protected]> wrote:
>>>> On 02/28/2013 05:24:07 AM, Michael Kerrisk (man-pages) wrote:
>>> [...]
>>>
>>>>> DESCRIPTION
>>>>> For an overview of namespaces, see namespaces(7).
>>>>>
>>>>> PID namespaces isolate the process ID number space, meaning
>>>>> that processes in different PID namespaces can have the same
>>>>> PID.
>>>>
>>>>
>>>> Um, perhaps "different processes"? Slightly repetitive, but trying to avoid
>>>> the potential misreading that "a processes can have the same PID in
>>>> different namespaces". (A single process can't be a member of more than one
>>>> namespace. This is not about selective visibility.)
>>>
>>> I'm not sure this clarifies things...
>>>
>>>>> PID namespaces allow containers to migrate to a new host
>>>>> while the processes inside the container maintain the same
>>>>> PIDs.
>>>>
>>>>
>>>> I thought suspend/resume a container was the simple case. Migration to a new
>>>> host is built on top of that. (On resume in a new container on the same
>>>> system, if other stuff is going on in the system so the available PIDs have
>>>> shifted.)
>>>
>>> I'll add some words here on suspend/resume.
>>>
>>>>> Likewise, a process in an ancestor namespace can—subject to the
>>>>> usual permission checks described in kill(2)—send signals to
>>>>> the "init" process of a child PID namespace only if the "init"
>>>>> process has established a handler for that signal. (Within the
>>>>> handler, the siginfo_t si_pid field described in sigaction(2)
>>>>> will be zero.) SIGKILL or SIGSTOP are treated exceptionally:
>>>>> these signals are forcibly delivered when sent from an ancestor
>>>>> PID namespace. Neither of these signals can be caught by the
>>>>> "init" process, and so will result in the usual actions associ‐
>>>>> ated with those signals (respectively, terminating and stopping
>>>>> the process).
>>>>
>>>>
>>>> If SIGKILL to init is propogated to all the children of init, is SIGSTOP
>>>> also propogated to all the children? (I.E. will SIGSTOP to container's init
>>>> suspend the whole container, and will SIGCONT resume the whole container? If
>>>> the latter, will it only resume processes that weren't previously stopped?
>>>> :)
>>>
>>> Covered by Eric.
>>>
>>>>> To put things another way: a process's PID namespace membership
>>>>> is determined when the process is created and cannot be changed
>>>>> thereafter. Among other things, this means that the parental
>>>>> relationship between processes mirrors the parental between PID
>>>>
>>>>
>>>> mirrors the relationship
>>>
>>> Thanks.
>>>
>>>>> namespaces: the parent of a process is either in the same
>>>>> namespace or resides in the immediate parent PID namespace.
>>>>>
>>>>> Every thread in a process must be in the same PID namespace.
>>>>> For this reason, the two following call sequences will fail:
>>>>>
>>>>> unshare(CLONE_NEWPID);
>>>>> clone(..., CLONE_VM, ...); /* Fails */
>>>>>
>>>>> setns(fd, CLONE_NEWPID);
>>>>> clone(..., CLONE_VM, ...); /* Fails */
>>>>
>>>>
>>>> They fail with -EUNDOCUMENTED
>>>
>>> Added EINVAL, as per Eric's reply. (Eric does that error also apply
>>> for the two new cases you added?).
>>>
>>>>> Because the above unshare(2) and setns(2) calls only change the
>>>>> PID namespace for created children, the clone(2) calls neces‐
>>>>> sarily put the new thread in a different PID namespace from the
>>>>> calling thread.
>>>>
>>>>
>>>> Um, no they don't. They fail. That's the point.
>>>
>>> (Good catch.)
>>>
>>>> They _would_ put the new
>>>> thread in a different PID namespace, which breaks the definition of threads.
>>>>
>>>> How about:
>>>>
>>>> The above unshare(2) and setns(2) calls change the PID namespace of
>>>> children created by subsequent clone(2) calls, which is incompatible
>>>> with CLONE_VM.
>>>
>>> I decided on:
>>>
>>> The point here is that unshare(2) and setns(2) change the PID
>>> namespace for created children but not for the calling process,
>>> while clone(2) CLONE_VM specifies the creation of a new thread
>>> in the same process.
>>
>> Can we make that "for all new tasks created" instead of "created
>> children"
>>
>> Othewise someone might expect CLONE_THREAD would work as you
>> CLONE_THREAD creates a thread and not a child...
>
> The term "task" is kernel-space talk that rarely appears in man pages,
> so I am reluctant to use it.

With respect to clone and in this case I am not certain we can properly
describe what happens without talking about tasks. But it is worth
a try.


> How about this:
>
> The point here is that unshare(2) and setns(2) change the PID
> namespace for processes subsequently created by the caller, but
> not for the calling process, while clone(2) CLONE_VM specifies
> the creation of a new thread in the same process.

Hmm. How about this.

The point here is that unshare(2) and setns(2) change the PID
namespace that will be used by in all subsequent calls to clone
and fork by the caller, but not for the calling process, and
that all threads in a process must share the same PID
namespace. Which makes a subsequent clone(2) CLONE_VM
specify the creation of a new thread in the a different PID
namespace but in the same process which is impossible.

Eric

2013-03-04 19:27:55

by Rob Landley

[permalink] [raw]
Subject: Re: For review: pid_namespaces(7) man page

On 03/03/2013 10:03:55 PM, Eric W. Biederman wrote:
> Rob Landley <[email protected]> writes:
>
> > On 03/01/2013 03:57:40 AM, Michael Kerrisk (man-pages) wrote:
> >> > And yet the glibc guys insist on #define
> >> GNU_GNU_GNU_ALL_HAIL_STALLMAN in
> >> > order to access this Linux-specific feature which has nothing
> >> whatsoever to
> >> > do with the FSF.
> >>
> >> This is a misunderstanding. _GNU_SOURCE is the standard way to
> expose
> >> Linux-specific functionality from POSIX header files.
> >
> > What standard? The Linux kernel is not, and never was, part of the
> GNU
> > project.
>
> Is the argument that there should be a _LINUX_SOURCE directive in
> glibc
> for this?

If you don't #define any feature test macros at all, you get a bunch of
macros (_BSD_SOURCE, _SVID_SOURCE, _POSIX_SOURCE,
_POSIX_C_SOURCE=200809L, and so on) defined by default in features.h.
If you start defining macros, several of the default ones _go_away_,
and you start missing things that are defined by posix-2008. Yes,
defining feature test macros makes definitions _vanish_ out of the
headers, which means feature test macros can actually reduce code
portability.

The _GNU_SOURCE is glibc's way of saying "switch on everything glibc
offers". (Except it isn't _quite_, but that seems to be what they
intended.) So there are various things that test _specifically_ for
that macro, but the macro also switches on (from features.h):

/* If _GNU_SOURCE was defined by the user, turn on all the other
features. */
#ifdef _GNU_SOURCE
# undef _ISOC95_SOURCE
# define _ISOC95_SOURCE 1
# undef _ISOC99_SOURCE
# define _ISOC99_SOURCE 1
# undef _POSIX_SOURCE
# define _POSIX_SOURCE 1
# undef _POSIX_C_SOURCE
# define _POSIX_C_SOURCE 200809L
# undef _XOPEN_SOURCE
# define _XOPEN_SOURCE 700
# undef _XOPEN_SOURCE_EXTENDED
# define _XOPEN_SOURCE_EXTENDED 1
# undef _LARGEFILE64_SOURCE
# define _LARGEFILE64_SOURCE 1
# undef _BSD_SOURCE
# define _BSD_SOURCE 1
# undef _SVID_SOURCE
# define _SVID_SOURCE 1
# undef _ATFILE_SOURCE
# define _ATFILE_SOURCE 1
#endif

This is not fine-grained control of what libc exports. This is "if you
want to use unshare() then everything we ever implemented gets
simultaneously exported into your namespace". (Which it _mostly_ is if
you never use any feature test macros, but not the Linux-specific
system calls.)

The new musl-libc.org did an _ALL_SOURCE macro that just enables every
feature test macro they implemented. (That's its definition, it's the
feature test macro that says feature test macros area bad idea.)

> Although come to think of it I can't imagine how <sched.h> is a POSIX
> header. Last I looked it only had linux specific bits in it. Which
> makes needing any kind of #define strange.

My objection is that Linux system calls are not part of the GNU
project. Requiring that macro to get Linux system calls out of bionic,
uClibc, klibc, musl, olibc, dietlibc, or newlib is _silly_. It's the
"GNU/Linux" prefix imposed on the source level, and it's a fairly
recent development (I've only noticed it since 2008 or so).

Rob-

Subject: Re: For review: pid_namespaces(7) man page

On Mon, Mar 4, 2013 at 8:27 PM, Rob Landley <[email protected]> wrote:
> On 03/03/2013 10:03:55 PM, Eric W. Biederman wrote:
>>
>> Rob Landley <[email protected]> writes:
>>
>> > On 03/01/2013 03:57:40 AM, Michael Kerrisk (man-pages) wrote:
>> >> > And yet the glibc guys insist on #define
>> >> GNU_GNU_GNU_ALL_HAIL_STALLMAN in
>> >> > order to access this Linux-specific feature which has nothing
>> >> whatsoever to
>> >> > do with the FSF.
>> >>
>> >> This is a misunderstanding. _GNU_SOURCE is the standard way to expose
>> >> Linux-specific functionality from POSIX header files.
>> >
>> > What standard? The Linux kernel is not, and never was, part of the GNU
>> > project.
>>
>> Is the argument that there should be a _LINUX_SOURCE directive in glibc
>> for this?
>
>
> If you don't #define any feature test macros at all, you get a bunch of
> macros (_BSD_SOURCE, _SVID_SOURCE, _POSIX_SOURCE, _POSIX_C_SOURCE=200809L,
> and so on) defined by default in features.h. If you start defining macros,
> several of the default ones _go_away_, and you start missing things that are
> defined by posix-2008. Yes, defining feature test macros makes definitions
> _vanish_ out of the headers, which means feature test macros can actually
> reduce code portability.

This has nothing to do with reducing portability; have a (careful)
read of feature_test_macros(7).

[...]

> The new musl-libc.org did an _ALL_SOURCE macro that just enables every
> feature test macro they implemented. (That's its definition, it's the
> feature test macro that says feature test macros area bad idea.)
>
>
>> Although come to think of it I can't imagine how <sched.h> is a POSIX
>> header. Last I looked it only had linux specific bits in it. Which
>> makes needing any kind of #define strange.
>
>
> My objection is that Linux system calls are not part of the GNU project.
> Requiring that macro to get Linux system calls out of bionic, uClibc, klibc,
> musl, olibc, dietlibc, or newlib is _silly_. It's the "GNU/Linux" prefix
> imposed on the source level, and it's a fairly recent development (I've only
> noticed it since 2008 or so).

The macro has been present since at least glibc 2.0 (1997).

Cheers,

Michael

--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Author of "The Linux Programming Interface"; http://man7.org/tlpi/

2013-03-06 01:58:44

by Rob Landley

[permalink] [raw]
Subject: Re: For review: pid_namespaces(7) man page

On 03/04/2013 11:52:19 AM, Eric W. Biederman wrote:
> > How about this:
> >
> > The point here is that unshare(2) and setns(2) change the
> PID
> > namespace for processes subsequently created by the caller,
> but
> > not for the calling process, while clone(2) CLONE_VM
> specifies
> > the creation of a new thread in the same process.
>
> Hmm. How about this.
>
> The point here is that unshare(2) and setns(2) change the PID
> namespace that will be used by in all subsequent calls to
> clone
> and fork by the caller, but not for the calling process, and
> that all threads in a process must share the same PID
> namespace. Which makes a subsequent clone(2) CLONE_VM
> specify the creation of a new thread in the a different PID
> namespace but in the same process which is impossible.

CLONE_VM and CLONE_NEWPID are incompatible because all threads of the
same process must be in the same PID namespace. Since unshare(2) and
setns(2) change the PID namespace for subsequent calls to clone(2),
those subsequent calls cannot create new threads (unless you setns(2)
back to the original namespace first).

That last bit's a guess. :)

Rob-

2013-03-06 02:24:09

by Eric W. Biederman

[permalink] [raw]
Subject: Re: For review: pid_namespaces(7) man page

Rob Landley <[email protected]> writes:

> On 03/04/2013 11:52:19 AM, Eric W. Biederman wrote:
>> > How about this:
>> >
>> > The point here is that unshare(2) and setns(2) change the
>> PID
>> > namespace for processes subsequently created by the caller,
>> but
>> > not for the calling process, while clone(2) CLONE_VM
>> specifies
>> > the creation of a new thread in the same process.
>>
>> Hmm. How about this.
>>
>> The point here is that unshare(2) and setns(2) change the PID
>> namespace that will be used by in all subsequent calls to
>> clone
>> and fork by the caller, but not for the calling process, and
>> that all threads in a process must share the same PID
>> namespace. Which makes a subsequent clone(2) CLONE_VM
>> specify the creation of a new thread in the a different PID
>> namespace but in the same process which is impossible.
>
> CLONE_VM and CLONE_NEWPID are incompatible because all threads of the
> same process must be in the same PID namespace. Since unshare(2) and
> setns(2) change the PID namespace for subsequent calls to clone(2),
> those subsequent calls cannot create new threads (unless you setns(2)
> back to the original namespace first).
>
> That last bit's a guess. :)

Good wording thank you, and the last bit is right. You can restore
the pid namespace with setns(2), and that will allow thread and process
creation creation again.

Eric