Hi Eric,
On Thu, Feb 28, 2013 at 4:24 PM, Eric W. Biederman
<[email protected]> wrote:
> "Michael Kerrisk (man-pages)" <[email protected]> writes:
[...]
>> ==========
>> PID_NAMESPACES(7) Linux Programmer's Manual PID_NAMESPACES(7)
>>
>> NAME
>> pid_namespaces - overview of Linux PID namespaces
>>
>> DESCRIPTION
[...]
>> The namespace init process
>> The first process created in a new namespace (i.e., the process
>> created using clone(2) with the CLONE_NEWPID flag, or the first
>> child created by a process after a call to unshare(2) using the
>> CLONE_NEWPID flag) has the PID 1, and is the "init" process for
>> the namespace (see init(1)). Children that are orphaned within
>> the namespace will be reparented to this process rather than
>> init(1).
>>
>> If the "init" process of a PID namespace terminates, the kernel
>> terminates all of the processes in the namespace via a SIGKILL
>> signal. This behavior reflects the fact that the "init"
>> process is essential for the correct operation of a PID names‐
>> pace. In this case, a subsequent fork(2) into this PID names‐
>> pace (e.g., from a process that has done a setns(2) into the
>> namespace using an open file descriptor for a
>> /proc/[pid]/ns/pid file corresponding to a process that was in
>> the namespace) will fail with the error ENOMEM; it is not pos‐
>> sible to create a new processes in a PID namespace whose "init"
>> process has terminated.
>
> It may be useful to mention unshare in the case of fork(2) failing just
> because that is such an easy mistake to make.
>
> unshare(CLONE_NEWPID);
> pid = fork();
> waitpid(pid,...);
> fork() -> ENOMEM
I'm lost. Why does that sequence fail? The child of fork() becomes PID
1 in the new PID namespace.
>> Only signals for which the "init" process has established a
>> signal handler can be sent to the "init" process by other mem‐
>> bers of the PID namespace. This restriction applies even to
>> privileged processes, and prevents other members of the PID
>> namespace from accidentally killing the "init" process.
>>
>> Likewise, a process in an ancestor namespace can—subject to the
>> usual permission checks described in kill(2)—send signals to
>> the "init" process of a child PID namespace only if the "init"
>> process has established a handler for that signal. (Within the
>> handler, the siginfo_t si_pid field described in sigaction(2)
>> will be zero.) SIGKILL or SIGSTOP are treated exceptionally:
>> these signals are forcibly delivered when sent from an ancestor
>> PID namespace. Neither of these signals can be caught by the
>> "init" process, and so will result in the usual actions associ‐
>> ated with those signals (respectively, terminating and stopping
>> the process).
>>
>> Nesting PID namespaces
>> PID namespaces can be nested: each PID namespace has a parent,
>> except for the initial ("root") PID namespace. The parent of a
>> PID namespace is the PID namespace of the process that created
>> the namespace using clone(2) or unshare(2). PID namespaces
>> thus form a tree, with all namespaces ultimately tracing their
>> ancestry to the root namespace.
>>
>> A process is visible to other processes in its PID namespace,
>> and to the processes in each direct ancestor PID namespace
>> going back to the root PID namespace. In this context, "visi‐
>> ble" means that one process can be the target of operations by
>> another process using system calls that specify a process ID.
>> Conversely, the processes in a child PID namespace can't see
>> processes in the parent and further removed ancestor namespace.
>> More succinctly: a process can see (e.g., send signals with
>> kill(2), set nice values with setpriority(2), etc.) only pro‐
>> cesses contained in its own PID namespace and in descendants of
>> that namespace.
>>
>> A process has one process ID in each of the layers of the PID
>> namespace hierarchy in which is visible, and walking back
>> though each direct ancestor namespace through to the root PID
>> namespace. System calls that operate on process IDs always
>> operate using the process ID that is visible in the PID names‐
>> pace of the caller. A call to getpid(2) always returns the PID
>> associated with the namespace in which the process was created.
>>
>> Some processes in a PID namespace may have parents that are
>> outside of the namespace. For example, the parent of the ini‐
>> tial process in the namespace (i.e., the init(1) process with
>> PID 1) is necessarily in another namespace. Likewise, the
>> direct children of a process that uses setns(2) to cause its
>> children to join a PID namespace are in a different PID names‐
>> pace from the caller of setns(2). Calls to getppid(2) for such
>> processes return 0.
>>
>> setns(2) and unshare(2) semantics
>> Calls to setns(2) that specify a PID namespace file descriptor
>> and calls to unshare(2) with the CLONE_NEWPID flag cause chil‐
>> dren subsequently created by the caller to be placed in a dif‐
>> ferent PID namespace from the caller. These calls do not, how‐
>> ever, change the PID namespace of the calling process, because
>> doing so would change the caller's idea of its own PID (as
>> reported by getpid()), which would break many applications and
>> libraries.
>>
>> To put things another way: a process's PID namespace membership
>> is determined when the process is created and cannot be changed
>> thereafter. Among other things, this means that the parental
>> relationship between processes mirrors the parental between PID
>> namespaces: the parent of a process is either in the same
>> namespace or resides in the immediate parent PID namespace.
>
> This is mostly true. With setns it is possible to have a parent
> in a pid namespace several steps up the pid namespace hierarchy.
>
>> Every thread in a process must be in the same PID namespace.
>> For this reason, the two following call sequences will fail:
>>
>> unshare(CLONE_NEWPID);
>> clone(..., CLONE_VM, ...); /* Fails */
>>
>> setns(fd, CLONE_NEWPID);
>> clone(..., CLONE_VM, ...); /* Fails */
>>
>> Because the above unshare(2) and setns(2) calls only change the
>> PID namespace for created children, the clone(2) calls neces‐
>> sarily put the new thread in a different PID namespace from the
>> calling thread.
>
> I don't know if it is interesting but these sequences also fail. But I
> suppose that is obvious? Or documented at least Documented in the clone
> manpage and unshare manpages.
>
> clone(..., CLONE_VM, ...);
> unshare(CLONE_NEWPID); /* Fails */
>
> clone(..., CLONE_VM, ...);
> setns(fd, CLONE_NEWPID); /* Fails */
I added to this page.
>> Miscellaneous
>> After creating a new PID namespace, it is useful for the child
>> to change its root directory and mount a new procfs instance at
>> /proc so that tools such as ps(1) work correctly. (If a new
>> mount namespace is simultaneously created by including
>> CLONE_NEWNS in the flags argument of clone(2) or unshare(2)),
>> then it isn't necessary to change the root directory: a new
>> procfs instance can be mounted directly over /proc.)
>
> Should it be documented somewhere that /proc when mounted from a pid
> namespace will use the pids of that pid namespace and /proc will only
> show process for visible in the mounting pid namespace, even if that
> mount of proc is accessed by processes in other pid namespaces?
>
> You sort of say it here by saying it is useful to mount a new copy of
> /proc, which it is. I just don't see you coming out straight and saying
> why it is. It just seems to be implied.
You're right. I should be more explicit. I will add some text detailing this.
[...]
Thanks for the comments, Eric!
Cheers,
Michael
--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Author of "The Linux Programming Interface"; http://man7.org/tlpi/
"Michael Kerrisk (man-pages)" <[email protected]> writes:
> Hi Eric,
>
> On Thu, Feb 28, 2013 at 4:24 PM, Eric W. Biederman
> <[email protected]> wrote:
>> "Michael Kerrisk (man-pages)" <[email protected]> writes:
>
> [...]
>
>>> ==========
>>> PID_NAMESPACES(7) Linux Programmer's Manual PID_NAMESPACES(7)
>>>
>>> NAME
>>> pid_namespaces - overview of Linux PID namespaces
>>>
>>> DESCRIPTION
> [...]
>
>>> The namespace init process
>>> The first process created in a new namespace (i.e., the process
>>> created using clone(2) with the CLONE_NEWPID flag, or the first
>>> child created by a process after a call to unshare(2) using the
>>> CLONE_NEWPID flag) has the PID 1, and is the "init" process for
>>> the namespace (see init(1)). Children that are orphaned within
>>> the namespace will be reparented to this process rather than
>>> init(1).
>>>
>>> If the "init" process of a PID namespace terminates, the kernel
>>> terminates all of the processes in the namespace via a SIGKILL
>>> signal. This behavior reflects the fact that the "init"
>>> process is essential for the correct operation of a PID names‐
>>> pace. In this case, a subsequent fork(2) into this PID names‐
>>> pace (e.g., from a process that has done a setns(2) into the
>>> namespace using an open file descriptor for a
>>> /proc/[pid]/ns/pid file corresponding to a process that was in
>>> the namespace) will fail with the error ENOMEM; it is not pos‐
>>> sible to create a new processes in a PID namespace whose "init"
>>> process has terminated.
>>
>> It may be useful to mention unshare in the case of fork(2) failing just
>> because that is such an easy mistake to make.
>>
>> unshare(CLONE_NEWPID);
>> pid = fork();
>> waitpid(pid,...);
>> fork() -> ENOMEM
>
> I'm lost. Why does that sequence fail? The child of fork() becomes PID
> 1 in the new PID namespace.
Correct.
Then we wait for the child of the fork to exit();
Then we fork again into the new pid namespace.
The second fork fails because init has exited.
Eric
On Fri, Mar 1, 2013 at 10:10 AM, Eric W. Biederman
<[email protected]> wrote:
> "Michael Kerrisk (man-pages)" <[email protected]> writes:
>
>> Hi Eric,
>>
>> On Thu, Feb 28, 2013 at 4:24 PM, Eric W. Biederman
>> <[email protected]> wrote:
>>> "Michael Kerrisk (man-pages)" <[email protected]> writes:
>>
>> [...]
>>
>>>> ==========
>>>> PID_NAMESPACES(7) Linux Programmer's Manual PID_NAMESPACES(7)
>>>>
>>>> NAME
>>>> pid_namespaces - overview of Linux PID namespaces
>>>>
>>>> DESCRIPTION
>> [...]
>>
>>>> The namespace init process
>>>> The first process created in a new namespace (i.e., the process
>>>> created using clone(2) with the CLONE_NEWPID flag, or the first
>>>> child created by a process after a call to unshare(2) using the
>>>> CLONE_NEWPID flag) has the PID 1, and is the "init" process for
>>>> the namespace (see init(1)). Children that are orphaned within
>>>> the namespace will be reparented to this process rather than
>>>> init(1).
>>>>
>>>> If the "init" process of a PID namespace terminates, the kernel
>>>> terminates all of the processes in the namespace via a SIGKILL
>>>> signal. This behavior reflects the fact that the "init"
>>>> process is essential for the correct operation of a PID names‐
>>>> pace. In this case, a subsequent fork(2) into this PID names‐
>>>> pace (e.g., from a process that has done a setns(2) into the
>>>> namespace using an open file descriptor for a
>>>> /proc/[pid]/ns/pid file corresponding to a process that was in
>>>> the namespace) will fail with the error ENOMEM; it is not pos‐
>>>> sible to create a new processes in a PID namespace whose "init"
>>>> process has terminated.
>>>
>>> It may be useful to mention unshare in the case of fork(2) failing just
>>> because that is such an easy mistake to make.
>>>
>>> unshare(CLONE_NEWPID);
>>> pid = fork();
>>> waitpid(pid,...);
>>> fork() -> ENOMEM
>>
>> I'm lost. Why does that sequence fail? The child of fork() becomes PID
>> 1 in the new PID namespace.
>
> Correct.
> Then we wait for the child of the fork to exit();
> Then we fork again into the new pid namespace.
> The second fork fails because init has exited.
Ahhh -- I misapprehended the scenario you were describing. Got it now.
I'll add that case.
Thanks,
Michael