Hello all,
I'm looking for review input for the pivot_root(2) manual
page, which I have substantially rewritten.
The original page was written 19 years ago, and has seen
little revision since that time. It contains a number of
errors. Even at the time it was first released, the
manual page already had some inaccuracies, since it was
written before the final release of the system call, whose
implementation was subsequently changed, but the manual
page was not updated to reflect those changes.
The revised page is more than 2.5 times the size of the
previous page, and now includes an example program.
As well as fixing a number of errors and adding many
missing details, the page also adds a description of the
pivot_root(".", ".") technique.
I would be happy to receive error corrections and notes
on missing details that should be added to the page.
The rendered page is shown below. The page source can
be found in the Git repo at
https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git
One area of the page that I'm still not really happy with
is the "vague" wording in the second paragraph and the note
in the third paragraph about the system call possibly
changing. These pieces survive (in somewhat modified form)
from the original page, which was written before the
system call was released, and it seems there was some
question about whether the system call might still change
its behavior with respect to the root directory and current
working directory of other processes. However, after 19
years, nothing has changed, and surely it will not in the
future, since that would constitute an ABI breakage.
I'm considering to rewrite these pieces to exactly
describe what the system call does (which I already
do in the third paragraph) and remove the "may or may not"
pieces in the second paragraph. I'd welcome comments
on making that change.
The rendered page is shown below. The page source is at
https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/tree/man2/pivot_root.2
in the Git repo at
https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git
Thanks,
Michael
NAME
pivot_root - change the root filesystem
SYNOPSIS
int pivot_root(const char *new_root, const char *put_old);
Note: There is no glibc wrapper for this system call; see NOTES.
DESCRIPTION
pivot_root() changes the root filesystem in the mount namespace of
the calling process. More precisely, it moves the root filesystem
to the directory put_old and makes new_root the new root filesys‐
tem. The calling process must have the CAP_SYS_ADMIN capability
in the user namespace that owns the caller's mount namespace.
pivot_root() may or may not change the current root and the cur‐
rent working directory of any processes or threads that use the
old root directory and which are in the same mount namespace as
the caller of pivot_root(). The caller of pivot_root() should
ensure that processes with root or current working directory at
the old root operate correctly in either case. An easy way to
ensure this is to change their root and current working directory
to new_root before invoking pivot_root(). Note also that
pivot_root() may or may not affect the calling process's current
working directory. It is therefore recommended to call chdir("/")
immediately after pivot_root().
The paragraph above is intentionally vague because at the time
when pivot_root() was first implemented, it was unclear whether
its affect on other process's root and current working directo‐
ries—and the caller's current working directory—might change in
the future. However, the behavior has remained consistent since
this system call was first implemented: pivot_root() changes the
root directory and the current working directory of each process
or thread in the same mount namespace to new_root if they point to
the old root directory. (See also NOTES.) On the other hand,
pivot_root() does not change the caller's current working direc‐
tory (unless it is on the old root directory), and thus it should
be followed by a chdir("/") call.
The following restrictions apply:
- new_root and put_old must be directories.
- new_root and put_old must not be on the same filesystem as the
current root. In particular, new_root can't be "/" (but can be
a bind mounted directory on the current root filesystem).
- put_old must be at or underneath new_root; that is, adding a
nonnegative number of /.. to the string pointed to by put_old
must yield the same directory as new_root.
- new_root must be a mount point. (If it is not otherwise a
mount point, it suffices to bind mount new_root on top of
itself.)
- The propagation type of the parent mount of new_root and the
parent mount of the current root directory must not be
MS_SHARED; similarly, if put_old is an existing mount point,
its propagation type must not be MS_SHARED. These restrictions
ensure that pivot_root() never propagates any changes to
another mount namespace.
- The current root directory must be a mount point.
RETURN VALUE
On success, zero is returned. On error, -1 is returned, and errno
is set appropriately.
ERRORS
pivot_root() may fail with any of the same errors as stat(2).
Additionally, it may fail with the following errors:
EBUSY new_root or put_old is on the current root filesystem.
(This error covers the pathological case where new_root is
"/".)
EINVAL new_root is not a mount point.
EINVAL put_old is not underneath new_root.
EINVAL The current root directory is not a mount point (because of
an earlier chroot(2)).
EINVAL The current root is on the rootfs (initial ramfs) filesys‐
tem; see NOTES.
EINVAL Either the mount point at new_root, or the parent mount of
that mount point, has propagation type MS_SHARED.
EINVAL put_old is a mount point and has the propagation type
MS_SHARED.
ENOTDIR
new_root or put_old is not a directory.
EPERM The calling process does not have the CAP_SYS_ADMIN capa‐
bility.
VERSIONS
pivot_root() was introduced in Linux 2.3.41.
CONFORMING TO
pivot_root() is Linux-specific and hence is not portable.
NOTES
Glibc does not provide a wrapper for this system call; call it
using syscall(2).
A command-line interface for this system call is provided by
pivot_root(8).
pivot_root() allows the caller to switch to a new root filesystem
while at the same time placing the old root mount at a location
under new_root from where it can subsequently be unmounted. (The
fact that it moves all processes that have a root directory or
current working directory on the old root filesystem to the new
root filesystem frees the old root filesystem of users, allowing
it to be unmounted more easily.)
A typical use of pivot_root() is during system startup, when the
system mounts a temporary root filesystem (e.g., an initrd), then
mounts the real root filesystem, and eventually turns the latter
into the current root of all relevant processes or threads. A
modern use is to set up a root filesystem during the creation of a
container.
The fact that pivot_root() modifies process root and current work‐
ing directories in the manner noted in DESCRIPTION is necessary in
order to prevent kernel threads from keeping the old root direc‐
tory busy with their root and current working directory, even if
they never access the filesystem in any way.
new_root and put_old may be the same directory. In particular,
the following sequence allows a pivot-root operation without need‐
ing to create and remove a temporary directory:
chdir(new_root);
mount("", ".", MS_SLAVE | MS_REC, NULL);
/* Or: MS_PRIVATE | MS_REC */
pivot_root(".", ".");
umount2(".", MNT_DETACH);
This sequence succeeds because the pivot_root() call stacks the
old root mount point (old_root) on top of the new root mount point
at /. At that point, the calling process's root directory and
current working directory refer to the new root mount point
(new_root). During the subsequent umount() call, resolution of
"." starts with new_root and then moves up the list of mounts
stacked at /, with the result that old_root is unmounted.
The rootfs (initial ramfs) cannot be pivot_root()ed. The recom‐
mended method of changing the root filesystem in this case is to
delete everything in rootfs, overmount rootfs with the new root,
attach stdin/stdout/stderr to the new /dev/console, and exec the
new init(1). Helper programs for this process exist; see
switch_root(8).
EXAMPLE
The program below demonstrates the use of pivot_root() inside a
mount namespace that is created using clone(2). After pivoting to
the root directory named in the program's first command-line argu‐
ment, the child created by clone(2) then executes the program
named in the remaining command-line arguments.
We demonstrate the program by creating a directory that will serve
as the new root filesystem and placing a copy of the (statically
linked) busybox(1) executable in that directory.
$ mkdir /tmp/rootfs
$ ls -id /tmp/rootfs # Show inode number of new root directory
319459 /tmp/rootfs
$ cp $(which busybox) /tmp/rootfs
$ PS1='bbsh$ ' sudo ./pivot_root_demo /tmp/rootfs /busybox sh
bbsh$ PATH=/
bbsh$ busybox ln busybox ln
bbsh$ ln busybox echo
bbsh$ ln busybox ls
bbsh$ ls
busybox echo ln ls
bbsh$ ls -id / # Compare with inode number above
319459 /
bbsh$ echo 'hello world'
hello world
Program source
/* pivot_root_demo.c */
#define _GNU_SOURCE
#include <sched.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/wait.h>
#include <sys/syscall.h>
#include <sys/mount.h>
#include <sys/stat.h>
#include <limits.h>
#define errExit(msg) do { perror(msg); exit(EXIT_FAILURE); \
} while (0)
static int
pivot_root(const char *new_root, const char *put_old)
{
return syscall(SYS_pivot_root, new_root, put_old);
}
#define STACK_SIZE (1024 * 1024)
static int /* Startup function for cloned child */
child(void *arg)
{
char **args = arg;
char *new_root = args[0];
const char *put_old = "/oldrootfs";
char path[PATH_MAX];
/* Ensure that 'new_root' and its parent mount don't have
shared propagation (which would cause pivot_root() to
return an error), and prevent propagation of mount
events to the initial mount namespace */
if (mount(NULL, "/", NULL, MS_REC | MS_PRIVATE, NULL) == 1)
errExit("mount-MS_PRIVATE");
/* Ensure that 'new_root' is a mount point */
if (mount(new_root, new_root, NULL, MS_BIND, NULL) == -1)
errExit("mount-MS_BIND");
/* Create directory to which old root will be pivoted */
snprintf(path, sizeof(path), "%s/%s", new_root, put_old);
if (mkdir(path, 0777) == -1)
errExit("mkdir");
/* And pivot the root filesystem */
if (pivot_root(new_root, path) == -1)
errExit("pivot_root");
/* Switch the current working working directory to "/" */
if (chdir("/") == -1)
errExit("chdir");
/* Unmount old root and remove mount point */
if (umount2(put_old, MNT_DETACH) == -1)
perror("umount2");
if (rmdir(put_old) == -1)
perror("rmdir");
/* Execute the command specified in argv[1]... */
execv(args[1], &args[1]);
errExit("execv");
}
int
main(int argc, char *argv[])
{
/* Create a child process in a new mount namespace */
char *stack = malloc(STACK_SIZE);
if (stack == NULL)
errExit("malloc");
if (clone(child, stack + STACK_SIZE,
CLONE_NEWNS | SIGCHLD, &argv[1]) == -1)
errExit("clone");
/* Parent falls through to here; wait for child */
if (wait(NULL) == -1)
errExit("wait");
exit(EXIT_SUCCESS);
}
SEE ALSO
chdir(2), chroot(2), mount(2), stat(2), initrd(4), mount_names‐
paces(7), pivot_root(8), switch_root(8)
--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
Hello Michael,
Am 23.09.19 um 14:04 schrieb Michael Kerrisk (man-pages):
> I'm considering to rewrite these pieces to exactly
> describe what the system call does (which I already
> do in the third paragraph) and remove the "may or may not"
> pieces in the second paragraph. I'd welcome comments
> on making that change.
I think that it would make the man page significantly easier to
understand if if the vague wording and the meta discussion about it are
removed.
> DESCRIPTION
[...]> pivot_root() changes the
> root directory and the current working directory of each process
> or thread in the same mount namespace to new_root if they point to
> the old root directory. (See also NOTES.) On the other hand,
> pivot_root() does not change the caller's current working direc‐
> tory (unless it is on the old root directory), and thus it should
> be followed by a chdir("/") call.
There is a contradiction here with the NOTES (cf. below).
> The following restrictions apply:
>
> - new_root and put_old must be directories.
>
> - new_root and put_old must not be on the same filesystem as the
> current root. In particular, new_root can't be "/" (but can be
> a bind mounted directory on the current root filesystem).
Wouldn't "must not be on the same mountpoint" or something similar be
more clear, at least for new_root? The note in parentheses indicates
that new_root can actually be on the same filesystem as the current
note. However, ...
> - put_old must be at or underneath new_root; that is, adding a
> nonnegative number of /.. to the string pointed to by put_old
> must yield the same directory as new_root.
>
> - new_root must be a mount point. (If it is not otherwise a
> mount point, it suffices to bind mount new_root on top of
> itself.)
... this item actually makes the above item almost redundant regarding
new_root (except for the "/") case. So one could replace this item with
something like this:
- new_root must be a mount point different from "/". (If it is not
otherwise a mount point, it suffices to bind mount new_root on top
of itself.)
The above item would then only mention put_old (and maybe use clarified
wording on whether actually a different file system is necessary for
put_old or whether a different mount point is enough).
> NOTES
[...]
> pivot_root() allows the caller to switch to a new root filesystem
> while at the same time placing the old root mount at a location
> under new_root from where it can subsequently be unmounted. (The
> fact that it moves all processes that have a root directory or
> current working directory on the old root filesystem to the new
> root filesystem frees the old root filesystem of users, allowing
> it to be unmounted more easily.)
Here is the contradiction:
The DESCRIPTION says that root and current working dir are only changed
"if they point to the old root directory". Here in the NOTES it says
that any root or working directories on the old root file system (i.e.,
even if somewhere below the root) are changed.
Which is correct?
If it indeed affects all processes with root and/or current working
directory below the old root, the text here does not clearly state what
the new root/current working directory of theses processes is.
E.g., if a process is at /foo and we pivot to /bar, will the process be
moved to /bar (i.e., at / after pivot_root), or will the kernel attempt
to move it to some location like /bar/foo? Because the latter might not
even exist, I suspect that everything is just moved to new_root, but
this could be stated explicitly by replacing "to the new root
filesystem" in the above paragraph with "to the new root directory"
(after checking whether this is true).
> EXAMPLE> The program below demonstrates the use of pivot_root() inside a
> mount namespace that is created using clone(2). After pivoting to
> the root directory named in the program's first command-line argu‐
> ment, the child created by clone(2) then executes the program
> named in the remaining command-line arguments.
Why not use the pivot_root(".", ".") in the example program?
It would make the example shorter, and also works if the process cannot
write to new_root (e..g., in a user namespace).
Regards,
Philipp
Hello Philipp,
My apologies that it has taken a while to reply. (I had been hoping
and waiting that a few more people might weigh in on this thread.)
On 9/23/19 3:42 PM, Philipp Wendler wrote:
> Hello Michael,
>
> Am 23.09.19 um 14:04 schrieb Michael Kerrisk (man-pages):
>
>> I'm considering to rewrite these pieces to exactly
>> describe what the system call does (which I already
>> do in the third paragraph) and remove the "may or may not"
>> pieces in the second paragraph. I'd welcome comments
>> on making that change.
>
> I think that it would make the man page significantly easier to
> understand if if the vague wording and the meta discussion about it are
> removed.
It is my inclination to make this change, but I'd love to get more
feedback on this point.
>> DESCRIPTION
> [...]> pivot_root() changes the
>> root directory and the current working directory of each process
>> or thread in the same mount namespace to new_root if they point to
>> the old root directory. (See also NOTES.) On the other hand,
>> pivot_root() does not change the caller's current working direc‐
>> tory (unless it is on the old root directory), and thus it should
>> be followed by a chdir("/") call.
>
> There is a contradiction here with the NOTES (cf. below).
See below.
>> The following restrictions apply:
>>
>> - new_root and put_old must be directories.
>>
>> - new_root and put_old must not be on the same filesystem as the
>> current root. In particular, new_root can't be "/" (but can be
>> a bind mounted directory on the current root filesystem).
>
> Wouldn't "must not be on the same mountpoint" or something similar be
> more clear, at least for new_root? The note in parentheses indicates
> that new_root can actually be on the same filesystem as the current
> note. However, ...
For 'put_old', it really is "filesystem".
For 'new_root', see below.
>> - put_old must be at or underneath new_root; that is, adding a
>> nonnegative number of /.. to the string pointed to by put_old
>> must yield the same directory as new_root.
>>
>> - new_root must be a mount point. (If it is not otherwise a
>> mount point, it suffices to bind mount new_root on top of
>> itself.)
>
> ... this item actually makes the above item almost redundant regarding
> new_root (except for the "/") case. So one could replace this item with
> something like this:
>
> - new_root must be a mount point different from "/". (If it is not
> otherwise a mount point, it suffices to bind mount new_root on top
> of itself.)
>
> The above item would then only mention put_old (and maybe use clarified
> wording on whether actually a different file system is necessary for
> put_old or whether a different mount point is enough).
Thanks. That's a good suggestion. I simplified the earlier bullet
point as you suggested, and changed the text here to say:
- new_root must be a mount point, but can't be "/". If it is not
otherwise a mount point, it suffices to bind mount new_root on
top of itself. (new_root can be a bind mounted directory on
the current root filesystem.)
>> NOTES
> [...]
>> pivot_root() allows the caller to switch to a new root filesystem
>> while at the same time placing the old root mount at a location
>> under new_root from where it can subsequently be unmounted. (The
>> fact that it moves all processes that have a root directory or
>> current working directory on the old root filesystem to the new
>> root filesystem frees the old root filesystem of users, allowing
>> it to be unmounted more easily.)
>
> Here is the contradiction:
> The DESCRIPTION says that root and current working dir are only changed
> "if they point to the old root directory". Here in the NOTES it says
> that any root or working directories on the old root file system (i.e.,
> even if somewhere below the root) are changed.
>
> Which is correct?
The first text is correct. I must have accidentally inserted
"filesystem" into the paragraph just here during a global edit.
Thanks for catching that.
> If it indeed affects all processes with root and/or current working
> directory below the old root, the text here does not clearly state what
> the new root/current working directory of theses processes is.
> E.g., if a process is at /foo and we pivot to /bar, will the process be
> moved to /bar (i.e., at / after pivot_root), or will the kernel attempt
> to move it to some location like /bar/foo? Because the latter might not
> even exist, I suspect that everything is just moved to new_root, but
> this could be stated explicitly by replacing "to the new root
> filesystem" in the above paragraph with "to the new root directory"
> (after checking whether this is true).
The text here now reads:
pivot_root() allows the caller to switch to a new root filesystem
while at the same time placing the old root mount at a location
under new_root from where it can subsequently be unmounted. (The
fact that it moves all processes that have a root directory or
current working directory on the old root directory to the new
root frees the old root directory of users, allowing the old root
filesystem to be unmounted more easily.)
>> EXAMPLE> The program below demonstrates the use of pivot_root() inside a
>> mount namespace that is created using clone(2). After pivoting to
>> the root directory named in the program's first command-line argu‐
>> ment, the child created by clone(2) then executes the program
>> named in the remaining command-line arguments.
>
> Why not use the pivot_root(".", ".") in the example program?
> It would make the example shorter, and also works if the process cannot
> write to new_root (e..g., in a user namespace).
I'm not sure. Some people have a bit of trouble to wrap their head
around the pivot_root(".", ".") idea. (I possibly am one of them.)
I'd be quite keen to hear other opinions on this. Unfortunately,
few people have commented on this manual page rewrite.
Thanks,
Michael
--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
At 2019-10-09T09:41:34+0200, Michael Kerrisk (man-pages) wrote:
> I'm not sure. Some people have a bit of trouble to wrap their head
> around the pivot_root(".", ".") idea. (I possibly am one of them.)
> I'd be quite keen to hear other opinions on this. Unfortunately,
> few people have commented on this manual page rewrite.
pivot_root(".", ".") seems as ineffable to me as chdir(".").
Meaning mostly, but not completely.
I have an external drive with a USB cable that's a little dodgy. If it
moves around a bit the external drive gets auto-unmounted, and then
remounted in the same place, so I can experience the otherwise-baffling
shell experience of:
[disconnect/reconnect happens; the device is mounted again now]
$ ls .
Input/output error
$ cd .
$ ls .
[perfectly fine listing]
What's happened is that the meaning of "." has subtly changed in a way
that I suppose would never have been seen back in Version 7 Unix days.
Maybe I've been reading too much historical documentation (I'm currently
enjoying McKusick et al.'s _Design and Implementation of the 4.4BSD
Operations System_), but the way we describe and teach Unixlike systems
in operating systems classes and, more to the point, in our man pages I
think continues to be strongly informed by the invariants we learned in
our youth, and which are slowly but steadily being invalidated.
Concretely, I recommend having pivot_root(".", ".") in the man page as
an example, but perhaps as an alternate. Because it is
counterintuitive (to some minds), it's worth spending some time to
explain it. But I would offer it because it's a valid use of the system
call and because it makes sense to a domain expert (Eric Biedermann).
I would try to offer an explanation myself but I lack the understanding.
_If_ I'm following the discussion correctly, which I doubt, then what I
imagine to happen is that a sequence point occurs between the function
parameters, and "." changes its meaning as with my "cd ." example above.
I am probably reasoning by analogy, and perhaps not by a good one.
Also, it is okay if the language of this page continues to evolve over
time. I appreciate your desire to get it "perfect" (or at least to some
local optimum) now since you're most of the way through an overhaul of
it, but it is not just the system that changes with time--the audience
does too.
Maybe in 5 or 10 years, the kids will be au fait with pivot_root(".",
".") and only some graybeards will continue to think of it as a bit
strange.
Regards,
Branden
"Michael Kerrisk (man-pages)" <[email protected]> writes:
> Hello Philipp,
>
> My apologies that it has taken a while to reply. (I had been hoping
> and waiting that a few more people might weigh in on this thread.)
>
> On 9/23/19 3:42 PM, Philipp Wendler wrote:
>> Hello Michael,
>>
>> Am 23.09.19 um 14:04 schrieb Michael Kerrisk (man-pages):
>>
>>> I'm considering to rewrite these pieces to exactly
>>> describe what the system call does (which I already
>>> do in the third paragraph) and remove the "may or may not"
>>> pieces in the second paragraph. I'd welcome comments
>>> on making that change.
>>
>> I think that it would make the man page significantly easier to
>> understand if if the vague wording and the meta discussion about it are
>> removed.
>
> It is my inclination to make this change, but I'd love to get more
> feedback on this point.
>
>>> DESCRIPTION
>> [...]> pivot_root() changes the
>>> root directory and the current working directory of each process
>>> or thread in the same mount namespace to new_root if they point to
>>> the old root directory. (See also NOTES.) On the other hand,
>>> pivot_root() does not change the caller's current working direc‐
>>> tory (unless it is on the old root directory), and thus it should
>>> be followed by a chdir("/") call.
>>
>> There is a contradiction here with the NOTES (cf. below).
>
> See below.
>
>>> The following restrictions apply:
>>>
>>> - new_root and put_old must be directories.
>>>
>>> - new_root and put_old must not be on the same filesystem as the
>>> current root. In particular, new_root can't be "/" (but can be
>>> a bind mounted directory on the current root filesystem).
>>
>> Wouldn't "must not be on the same mountpoint" or something similar be
>> more clear, at least for new_root? The note in parentheses indicates
>> that new_root can actually be on the same filesystem as the current
>> note. However, ...
>
> For 'put_old', it really is "filesystem".
If we are going to be pedantic "filesystem" is really the wrong concept
here. The section about bind mount clarifies it, but I wonder if there
is a better term.
I think I would say: "new_root and put_old must not be on the same mount
as the current root."
I think using "mount" instead of "filesystem" keeps the concepts less
confusing.
As I am reading through this email and seeing text that is trying to be
precise and clear then hitting the term "filesystem" is a bit jarring.
pivot_root doesn't care a thing for file systems. pivot_root only cares
about mounts.
And by a "mount" I mean the thing that you get when you create a bind
mount or you call mount normally.
Michael do you have man pages for the new mount api yet?
> For 'new_root', see below.
>
>>> - put_old must be at or underneath new_root; that is, adding a
>>> nonnegative number of /.. to the string pointed to by put_old
>>> must yield the same directory as new_root.
>>>
>>> - new_root must be a mount point. (If it is not otherwise a
>>> mount point, it suffices to bind mount new_root on top of
>>> itself.)
>>
>> ... this item actually makes the above item almost redundant regarding
>> new_root (except for the "/") case. So one could replace this item with
>> something like this:
>>
>> - new_root must be a mount point different from "/". (If it is not
>> otherwise a mount point, it suffices to bind mount new_root on top
>> of itself.)
>>
>> The above item would then only mention put_old (and maybe use clarified
>> wording on whether actually a different file system is necessary for
>> put_old or whether a different mount point is enough).
>
> Thanks. That's a good suggestion. I simplified the earlier bullet
> point as you suggested, and changed the text here to say:
>
> - new_root must be a mount point, but can't be "/". If it is not
> otherwise a mount point, it suffices to bind mount new_root on
> top of itself. (new_root can be a bind mounted directory on
> the current root filesystem.)
How about:
- new_root must be the path to a mount, but can't be "/". Any
path that is not already a mount can be converted into one by
bind mounting the path onto itself.
>>> NOTES
>> [...]
>>> pivot_root() allows the caller to switch to a new root filesystem
>>> while at the same time placing the old root mount at a location
>>> under new_root from where it can subsequently be unmounted. (The
>>> fact that it moves all processes that have a root directory or
>>> current working directory on the old root filesystem to the new
>>> root filesystem frees the old root filesystem of users, allowing
>>> it to be unmounted more easily.)
>>
>> Here is the contradiction:
>> The DESCRIPTION says that root and current working dir are only changed
>> "if they point to the old root directory". Here in the NOTES it says
>> that any root or working directories on the old root file system (i.e.,
>> even if somewhere below the root) are changed.
>>
>> Which is correct?
>
> The first text is correct. I must have accidentally inserted
> "filesystem" into the paragraph just here during a global edit.
> Thanks for catching that.
>
>> If it indeed affects all processes with root and/or current working
>> directory below the old root, the text here does not clearly state what
>> the new root/current working directory of theses processes is.
>> E.g., if a process is at /foo and we pivot to /bar, will the process be
>> moved to /bar (i.e., at / after pivot_root), or will the kernel attempt
>> to move it to some location like /bar/foo? Because the latter might not
>> even exist, I suspect that everything is just moved to new_root, but
>> this could be stated explicitly by replacing "to the new root
>> filesystem" in the above paragraph with "to the new root directory"
>> (after checking whether this is true).
>
> The text here now reads:
>
> pivot_root() allows the caller to switch to a new root filesystem
> while at the same time placing the old root mount at a location
> under new_root from where it can subsequently be unmounted. (The
> fact that it moves all processes that have a root directory or
> current working directory on the old root directory to the new
> root frees the old root directory of users, allowing the old root
> filesystem to be unmounted more easily.)
Please "mount" instead of "filesystem".
>>> EXAMPLE> The program below demonstrates the use of pivot_root() inside a
>>> mount namespace that is created using clone(2). After pivoting to
>>> the root directory named in the program's first command-line argu‐
>>> ment, the child created by clone(2) then executes the program
>>> named in the remaining command-line arguments.
>>
>> Why not use the pivot_root(".", ".") in the example program?
>> It would make the example shorter, and also works if the process cannot
>> write to new_root (e..g., in a user namespace).
>
> I'm not sure. Some people have a bit of trouble to wrap their head
> around the pivot_root(".", ".") idea. (I possibly am one of them.)
> I'd be quite keen to hear other opinions on this. Unfortunately,
> few people have commented on this manual page rewrite.
I am happy as long as it is pivot_root(".", ".") is documented
somewhere. There is real code that uses it so it is not going away.
Plus pivot_root(".", ".") is really what is desired in a lot of
situations where the caller of pivot_root is an intermediary and
does not control the new root filesystem. At which point the only
path you can be guaranteed to exit on the new root filesystem is "/".
Eric
Hello Eric,
Thank you. I was hoping you might jump in on this thread.
Please see below.
On 10/9/19 10:46 AM, Eric W. Biederman wrote:
> "Michael Kerrisk (man-pages)" <[email protected]> writes:
>
>> Hello Philipp,
>>
>> My apologies that it has taken a while to reply. (I had been hoping
>> and waiting that a few more people might weigh in on this thread.)
>>
>> On 9/23/19 3:42 PM, Philipp Wendler wrote:
>>> Hello Michael,
>>>
>>> Am 23.09.19 um 14:04 schrieb Michael Kerrisk (man-pages):
>>>
>>>> I'm considering to rewrite these pieces to exactly
>>>> describe what the system call does (which I already
>>>> do in the third paragraph) and remove the "may or may not"
>>>> pieces in the second paragraph. I'd welcome comments
>>>> on making that change.
What did you think about my proposal above? To put it in context,
this was my initial comment in the mail:
[[
One area of the page that I'm still not really happy with
is the "vague" wording in the second paragraph and the note
in the third paragraph about the system call possibly
changing. These pieces survive (in somewhat modified form)
from the original page, which was written before the
system call was released, and it seems there was some
question about whether the system call might still change
its behavior with respect to the root directory and current
working directory of other processes. However, after 19
years, nothing has changed, and surely it will not in the
future, since that would constitute an ABI breakage.
I'm considering to rewrite these pieces to exactly
describe what the system call does (which I already
do in the third paragraph) and remove the "may or may not"
pieces in the second paragraph. I'd welcome comments
on making that change.
]]
And the second and third paragraphs of the manual page currently
read:
[[
pivot_root() may or may not change the current root and the cur‐
rent working directory of any processes or threads that use the
old root directory and which are in the same mount namespace as
the caller of pivot_root(). The caller of pivot_root() should
ensure that processes with root or current working directory at
the old root operate correctly in either case. An easy way to
ensure this is to change their root and current working directory
to new_root before invoking pivot_root(). Note also that
pivot_root() may or may not affect the calling process's current
working directory. It is therefore recommended to call chdir("/")
immediately after pivot_root().
The paragraph above is intentionally vague because at the time
when pivot_root() was first implemented, it was unclear whether
its affect on other process's root and current working directo‐
ries—and the caller's current working directory—might change in
the future. However, the behavior has remained consistent since
this system call was first implemented: pivot_root() changes the
root directory and the current working directory of each process
or thread in the same mount namespace to new_root if they point to
the old root directory. (See also NOTES.) On the other hand,
pivot_root() does not change the caller's current working direc‐
tory (unless it is on the old root directory), and thus it should
be followed by a chdir("/") call.
]]
>>> I think that it would make the man page significantly easier to
>>> understand if if the vague wording and the meta discussion about it are
>>> removed.
>>
>> It is my inclination to make this change, but I'd love to get more
>> feedback on this point.
>>
>>>> DESCRIPTION
>>> [...]> pivot_root() changes the
>>>> root directory and the current working directory of each process
>>>> or thread in the same mount namespace to new_root if they point to
>>>> the old root directory. (See also NOTES.) On the other hand,
>>>> pivot_root() does not change the caller's current working direc‐
>>>> tory (unless it is on the old root directory), and thus it should
>>>> be followed by a chdir("/") call.
>>>
>>> There is a contradiction here with the NOTES (cf. below).
>>
>> See below.
>>
>>>> The following restrictions apply:
>>>>
>>>> - new_root and put_old must be directories.
>>>>
>>>> - new_root and put_old must not be on the same filesystem as the
>>>> current root. In particular, new_root can't be "/" (but can be
>>>> a bind mounted directory on the current root filesystem).
>>>
>>> Wouldn't "must not be on the same mountpoint" or something similar be
>>> more clear, at least for new_root? The note in parentheses indicates
>>> that new_root can actually be on the same filesystem as the current
>>> note. However, ...
>>
>> For 'put_old', it really is "filesystem".
>
> If we are going to be pedantic "filesystem" is really the wrong concept
> here. The section about bind mount clarifies it, but I wonder if there
> is a better term.
Thanks. My aim was to try to distinguish "mount point" from
"a mount somewhere inside the file system associated with a
certain mount point"--in other words, I wanted to make it clear
that 'put_old' (and 'new_root') could not be subdirectories
under the current root mount point (which is correct, right?).
Using "mount" does seem better. (My only concern is that some
people may take it to mean "the mount point", but perhaps that
just my own confusion.)
> I think I would say: "new_root and put_old must not be on the same mount
> as the current root."
I've made that change.
> I think using "mount" instead of "filesystem" keeps the concepts less
> confusing.
>
> As I am reading through this email and seeing text that is trying to be
> precise and clear then hitting the term "filesystem" is a bit jarring.
> pivot_root doesn't care a thing for file systems. pivot_root only cares
> about mounts.
>
> And by a "mount" I mean the thing that you get when you create a bind
> mount or you call mount normally.
Thanks for the above comments.
Hmm, doI need to make similar changes in the initial paragraph of
the manual page as well? It currently reads:
pivot_root() changes the root filesystem in the mount namespace of
the calling process. More precisely, it moves the root filesystem
to the directory put_old and makes new_root the new root filesys‐
tem. The calling process must have the CAP_SYS_ADMIN capability
in the user namespace that owns the caller's mount namespace.
Furthermore the one line NAME of the man page reads:
pivot_root - change the root filesystem
Is a change needed there also?
> Michael do you have man pages for the new mount api yet?
David Howells wrote pages in mid-2018, well before the syscalls got
merged in the kernel (in mid-2019). I did not merge them because
the code was not yet in the kernel, and lacking time, I never chased
David when the syscalls did get merged to see if the pages were still
up to date. I pinged David just now.
>> For 'new_root', see below.
>>
>>>> - put_old must be at or underneath new_root; that is, adding a
>>>> nonnegative number of /.. to the string pointed to by put_old
>>>> must yield the same directory as new_root.
>>>>
>>>> - new_root must be a mount point. (If it is not otherwise a
>>>> mount point, it suffices to bind mount new_root on top of
>>>> itself.)
>>>
>>> ... this item actually makes the above item almost redundant regarding
>>> new_root (except for the "/") case. So one could replace this item with
>>> something like this:
>>>
>>> - new_root must be a mount point different from "/". (If it is not
>>> otherwise a mount point, it suffices to bind mount new_root on top
>>> of itself.)
>>>
>>> The above item would then only mention put_old (and maybe use clarified
>>> wording on whether actually a different file system is necessary for
>>> put_old or whether a different mount point is enough).
>>
>> Thanks. That's a good suggestion. I simplified the earlier bullet
>> point as you suggested, and changed the text here to say:
>>
>> - new_root must be a mount point, but can't be "/". If it is not
>> otherwise a mount point, it suffices to bind mount new_root on
>> top of itself. (new_root can be a bind mounted directory on
>> the current root filesystem.)
>
> How about:
> - new_root must be the path to a mount, but can't be "/". Any
Surely here it must be "mount point" not "mount"? (See my discussion
above.)
> path that is not already a mount can be converted into one by
> bind mounting the path onto itself.
>>>> NOTES
>>> [...]
>>>> pivot_root() allows the caller to switch to a new root filesystem
>>>> while at the same time placing the old root mount at a location
>>>> under new_root from where it can subsequently be unmounted. (The
>>>> fact that it moves all processes that have a root directory or
>>>> current working directory on the old root filesystem to the new
>>>> root filesystem frees the old root filesystem of users, allowing
>>>> it to be unmounted more easily.)
>>>
>>> Here is the contradiction:
>>> The DESCRIPTION says that root and current working dir are only changed
>>> "if they point to the old root directory". Here in the NOTES it says
>>> that any root or working directories on the old root file system (i.e.,
>>> even if somewhere below the root) are changed.
>>>
>>> Which is correct?
>>
>> The first text is correct. I must have accidentally inserted
>> "filesystem" into the paragraph just here during a global edit.
>> Thanks for catching that.
>>
>>> If it indeed affects all processes with root and/or current working
>>> directory below the old root, the text here does not clearly state what
>>> the new root/current working directory of theses processes is.
>>> E.g., if a process is at /foo and we pivot to /bar, will the process be
>>> moved to /bar (i.e., at / after pivot_root), or will the kernel attempt
>>> to move it to some location like /bar/foo? Because the latter might not
>>> even exist, I suspect that everything is just moved to new_root, but
>>> this could be stated explicitly by replacing "to the new root
>>> filesystem" in the above paragraph with "to the new root directory"
>>> (after checking whether this is true).
>>
>> The text here now reads:
>>
>> pivot_root() allows the caller to switch to a new root filesystem
>> while at the same time placing the old root mount at a location
>> under new_root from where it can subsequently be unmounted. (The
>> fact that it moves all processes that have a root directory or
>> current working directory on the old root directory to the new
>> root frees the old root directory of users, allowing the old root
>> filesystem to be unmounted more easily.)
>
>
> Please "mount" instead of "filesystem".
Changed.
>>>> EXAMPLE> The program below demonstrates the use of pivot_root() inside a
>>>> mount namespace that is created using clone(2). After pivoting to
>>>> the root directory named in the program's first command-line argu‐
>>>> ment, the child created by clone(2) then executes the program
>>>> named in the remaining command-line arguments.
>>>
>>> Why not use the pivot_root(".", ".") in the example program?
>>> It would make the example shorter, and also works if the process cannot
>>> write to new_root (e..g., in a user namespace).
>>
>> I'm not sure. Some people have a bit of trouble to wrap their head
>> around the pivot_root(".", ".") idea. (I possibly am one of them.)
>> I'd be quite keen to hear other opinions on this. Unfortunately,
>> few people have commented on this manual page rewrite.
>
> I am happy as long as it is pivot_root(".", ".") is documented
> somewhere. There is real code that uses it so it is not going away.
> Plus pivot_root(".", ".") is really what is desired in a lot of
> situations where the caller of pivot_root is an intermediary and
> does not control the new root filesystem. At which point the only
> path you can be guaranteed to exit on the new root filesystem is "/".
Good. There is documentation of pivot_root(".", ".") i the page!
Thanks,
Michael
--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
"Michael Kerrisk (man-pages)" <[email protected]> writes:
> Hello Eric,
>
> Thank you. I was hoping you might jump in on this thread.
>
> Please see below.
>
> On 10/9/19 10:46 AM, Eric W. Biederman wrote:
>> "Michael Kerrisk (man-pages)" <[email protected]> writes:
>>
>>> Hello Philipp,
>>>
>>> My apologies that it has taken a while to reply. (I had been hoping
>>> and waiting that a few more people might weigh in on this thread.)
>>>
>>> On 9/23/19 3:42 PM, Philipp Wendler wrote:
>>>> Hello Michael,
>>>>
>>>> Am 23.09.19 um 14:04 schrieb Michael Kerrisk (man-pages):
>>>>
>>>>> I'm considering to rewrite these pieces to exactly
>>>>> describe what the system call does (which I already
>>>>> do in the third paragraph) and remove the "may or may not"
>>>>> pieces in the second paragraph. I'd welcome comments
>>>>> on making that change.
>
> What did you think about my proposal above? To put it in context,
> this was my initial comment in the mail:
>
> [[
> One area of the page that I'm still not really happy with
> is the "vague" wording in the second paragraph and the note
> in the third paragraph about the system call possibly
> changing. These pieces survive (in somewhat modified form)
> from the original page, which was written before the
> system call was released, and it seems there was some
> question about whether the system call might still change
> its behavior with respect to the root directory and current
> working directory of other processes. However, after 19
> years, nothing has changed, and surely it will not in the
> future, since that would constitute an ABI breakage.
> I'm considering to rewrite these pieces to exactly
> describe what the system call does (which I already
> do in the third paragraph) and remove the "may or may not"
> pieces in the second paragraph. I'd welcome comments
> on making that change.
> ]]
>
> And the second and third paragraphs of the manual page currently
> read:
>
> [[
> pivot_root() may or may not change the current root and the cur‐
> rent working directory of any processes or threads that use the
> old root directory and which are in the same mount namespace as
> the caller of pivot_root(). The caller of pivot_root() should
> ensure that processes with root or current working directory at
> the old root operate correctly in either case. An easy way to
> ensure this is to change their root and current working directory
> to new_root before invoking pivot_root(). Note also that
> pivot_root() may or may not affect the calling process's current
> working directory. It is therefore recommended to call chdir("/")
> immediately after pivot_root().
>
> The paragraph above is intentionally vague because at the time
> when pivot_root() was first implemented, it was unclear whether
> its affect on other process's root and current working directo‐
> ries—and the caller's current working directory—might change in
> the future. However, the behavior has remained consistent since
> this system call was first implemented: pivot_root() changes the
> root directory and the current working directory of each process
> or thread in the same mount namespace to new_root if they point to
> the old root directory. (See also NOTES.) On the other hand,
> pivot_root() does not change the caller's current working direc‐
> tory (unless it is on the old root directory), and thus it should
> be followed by a chdir("/") call.
> ]]
Apologies I saw that concern I didn't realize it was a questio
I think it is very reasonable to remove warning the behavior might
change. We have pivot_root(8) in common use that to use it requires
the semantic of changing processes other than the current process.
Which means any attempt to noticably change the behavior of
pivot_root(2) will break userspace.
Now the documented semantics in behavior above are not quite what
pivot_root(2) does. It walks all processes on the system and if the
working directory or the root directory refer to the root mount that is
being replaced, then pivot_root(2) will update them.
In practice the above is limited to a mount namespace. But something as
simple as "cd /proc/<somepid>/root" can allow a process to have a
working directory in a different mount namespace.
Because ``unprivileged'' users can now use pivot_root(2) we may want to
rethink the implementation at some point to be cheaper than a global
process walk. So far that process walk has not been a problem in
practice.
If we had to write pivot_root(2) from scratch limiting it to just
changing the root directory of the process that calls pivot_root(2)
would have been the superior semantic. That would have required run
pivot_root(8) like: "exec pivot_root . . -- /bin/bash ..." but it would
not have required walking every thread in the system.
>>>> I think that it would make the man page significantly easier to
>>>> understand if if the vague wording and the meta discussion about it are
>>>> removed.
>>>
>>> It is my inclination to make this change, but I'd love to get more
>>> feedback on this point.
>>>
>>>>> DESCRIPTION
>>>> [...]> pivot_root() changes the
>>>>> root directory and the current working directory of each process
>>>>> or thread in the same mount namespace to new_root if they point to
>>>>> the old root directory. (See also NOTES.) On the other hand,
>>>>> pivot_root() does not change the caller's current working direc‐
>>>>> tory (unless it is on the old root directory), and thus it should
>>>>> be followed by a chdir("/") call.
>>>>
>>>> There is a contradiction here with the NOTES (cf. below).
>>>
>>> See below.
>>>
>>>>> The following restrictions apply:
>>>>>
>>>>> - new_root and put_old must be directories.
>>>>>
>>>>> - new_root and put_old must not be on the same filesystem as the
>>>>> current root. In particular, new_root can't be "/" (but can be
>>>>> a bind mounted directory on the current root filesystem).
>>>>
>>>> Wouldn't "must not be on the same mountpoint" or something similar be
>>>> more clear, at least for new_root? The note in parentheses indicates
>>>> that new_root can actually be on the same filesystem as the current
>>>> note. However, ...
>>>
>>> For 'put_old', it really is "filesystem".
>>
>> If we are going to be pedantic "filesystem" is really the wrong concept
>> here. The section about bind mount clarifies it, but I wonder if there
>> is a better term.
>
> Thanks. My aim was to try to distinguish "mount point" from
> "a mount somewhere inside the file system associated with a
> certain mount point"--in other words, I wanted to make it clear
> that 'put_old' (and 'new_root') could not be subdirectories
> under the current root mount point (which is correct, right?).
>
> Using "mount" does seem better. (My only concern is that some
> people may take it to mean "the mount point", but perhaps that
> just my own confusion.)
I am open to better terms. But mount or vfsmount is what we are using
internal to the kernel and is really a distinct concept from filesystem.
And it is starting to leak out in system calls like move_mount.
>> I think I would say: "new_root and put_old must not be on the same mount
>> as the current root."
>
> I've made that change.
>
>> I think using "mount" instead of "filesystem" keeps the concepts less
>> confusing.
>>
>> As I am reading through this email and seeing text that is trying to be
>> precise and clear then hitting the term "filesystem" is a bit jarring.
>> pivot_root doesn't care a thing for file systems. pivot_root only cares
>> about mounts.
>>
>> And by a "mount" I mean the thing that you get when you create a bind
>> mount or you call mount normally.
>
> Thanks for the above comments.
>
> Hmm, doI need to make similar changes in the initial paragraph of
> the manual page as well? It currently reads:
>
> pivot_root() changes the root filesystem in the mount namespace of
> the calling process. More precisely, it moves the root filesystem
> to the directory put_old and makes new_root the new root filesys‐
> tem. The calling process must have the CAP_SYS_ADMIN capability
> in the user namespace that owns the caller's mount namespace.
>
> Furthermore the one line NAME of the man page reads:
>
> pivot_root - change the root filesystem
>
> Is a change needed there also?
Yes please. Both locations.
>> Michael do you have man pages for the new mount api yet?
>
> David Howells wrote pages in mid-2018, well before the syscalls got
> merged in the kernel (in mid-2019). I did not merge them because
> the code was not yet in the kernel, and lacking time, I never chased
> David when the syscalls did get merged to see if the pages were still
> up to date. I pinged David just now.
Good. I was thinking of them because the concept of "mount" matters more
there.
>>>
>>>>> - put_old must be at or underneath new_root; that is, adding a
>>>>> nonnegative number of /.. to the string pointed to by put_old
>>>>> must yield the same directory as new_root.
>>>>>
>>>>> - new_root must be a mount point. (If it is not otherwise a
>>>>> mount point, it suffices to bind mount new_root on top of
>>>>> itself.)
>>>>
>>>> ... this item actually makes the above item almost redundant regarding
>>>> new_root (except for the "/") case. So one could replace this item with
>>>> something like this:
>>>>
>>>> - new_root must be a mount point different from "/". (If it is not
>>>> otherwise a mount point, it suffices to bind mount new_root on top
>>>> of itself.)
>>>>
>>>> The above item would then only mention put_old (and maybe use clarified
>>>> wording on whether actually a different file system is necessary for
>>>> put_old or whether a different mount point is enough).
>>>
>>> Thanks. That's a good suggestion. I simplified the earlier bullet
>>> point as you suggested, and changed the text here to say:
>>>
>>> - new_root must be a mount point, but can't be "/". If it is not
>>> otherwise a mount point, it suffices to bind mount new_root on
>>> top of itself. (new_root can be a bind mounted directory on
>>> the current root filesystem.)
>>
>> How about:
>> - new_root must be the path to a mount, but can't be "/". Any
>
> Surely here it must be "mount point" not "mount"? (See my discussion
> above.)
Sigh. I have had my head in the code to long, where new_root is
used to refer to the mount that is mounted on that mount point as well.
>
>> path that is not already a mount can be converted into one by
>> bind mounting the path onto itself.
>>>>> NOTES
>>>> [...]
>>>>> pivot_root() allows the caller to switch to a new root filesystem
>>>>> while at the same time placing the old root mount at a location
>>>>> under new_root from where it can subsequently be unmounted. (The
>>>>> fact that it moves all processes that have a root directory or
>>>>> current working directory on the old root filesystem to the new
>>>>> root filesystem frees the old root filesystem of users, allowing
>>>>> it to be unmounted more easily.)
>>>>
>>>> Here is the contradiction:
>>>> The DESCRIPTION says that root and current working dir are only changed
>>>> "if they point to the old root directory". Here in the NOTES it says
>>>> that any root or working directories on the old root file system (i.e.,
>>>> even if somewhere below the root) are changed.
>>>>
>>>> Which is correct?
>>>
>>> The first text is correct. I must have accidentally inserted
>>> "filesystem" into the paragraph just here during a global edit.
>>> Thanks for catching that.
>>>
>>>> If it indeed affects all processes with root and/or current working
>>>> directory below the old root, the text here does not clearly state what
>>>> the new root/current working directory of theses processes is.
>>>> E.g., if a process is at /foo and we pivot to /bar, will the process be
>>>> moved to /bar (i.e., at / after pivot_root), or will the kernel attempt
>>>> to move it to some location like /bar/foo? Because the latter might not
>>>> even exist, I suspect that everything is just moved to new_root, but
>>>> this could be stated explicitly by replacing "to the new root
>>>> filesystem" in the above paragraph with "to the new root directory"
>>>> (after checking whether this is true).
>>>
>>> The text here now reads:
>>>
>>> pivot_root() allows the caller to switch to a new root filesystem
>>> while at the same time placing the old root mount at a location
>>> under new_root from where it can subsequently be unmounted. (The
>>> fact that it moves all processes that have a root directory or
>>> current working directory on the old root directory to the new
>>> root frees the old root directory of users, allowing the old root
>>> filesystem to be unmounted more easily.)
>>
>>
>> Please "mount" instead of "filesystem".
>
> Changed.
>
>
>>>>> EXAMPLE> The program below demonstrates the use of pivot_root() inside a
>>>>> mount namespace that is created using clone(2). After pivoting to
>>>>> the root directory named in the program's first command-line argu‐
>>>>> ment, the child created by clone(2) then executes the program
>>>>> named in the remaining command-line arguments.
>>>>
>>>> Why not use the pivot_root(".", ".") in the example program?
>>>> It would make the example shorter, and also works if the process cannot
>>>> write to new_root (e..g., in a user namespace).
>>>
>>> I'm not sure. Some people have a bit of trouble to wrap their head
>>> around the pivot_root(".", ".") idea. (I possibly am one of them.)
>>> I'd be quite keen to hear other opinions on this. Unfortunately,
>>> few people have commented on this manual page rewrite.
>>
>> I am happy as long as it is pivot_root(".", ".") is documented
>> somewhere. There is real code that uses it so it is not going away.
>> Plus pivot_root(".", ".") is really what is desired in a lot of
>> situations where the caller of pivot_root is an intermediary and
>> does not control the new root filesystem. At which point the only
>> path you can be guaranteed to exit on the new root filesystem is "/".
>
> Good. There is documentation of pivot_root(".", ".") i the page!
Yeah!
Eric
Hello Eric,
On 10/9/19 6:00 PM, Eric W. Biederman wrote:
> "Michael Kerrisk (man-pages)" <[email protected]> writes:
>
>> Hello Eric,
>>
>> Thank you. I was hoping you might jump in on this thread.
>>
>> Please see below.
>>
>> On 10/9/19 10:46 AM, Eric W. Biederman wrote:
>>> "Michael Kerrisk (man-pages)" <[email protected]> writes:
>>>
>>>> Hello Philipp,
>>>>
>>>> My apologies that it has taken a while to reply. (I had been hoping
>>>> and waiting that a few more people might weigh in on this thread.)
>>>>
>>>> On 9/23/19 3:42 PM, Philipp Wendler wrote:
>>>>> Hello Michael,
>>>>>
>>>>> Am 23.09.19 um 14:04 schrieb Michael Kerrisk (man-pages):
>>>>>
>>>>>> I'm considering to rewrite these pieces to exactly
>>>>>> describe what the system call does (which I already
>>>>>> do in the third paragraph) and remove the "may or may not"
>>>>>> pieces in the second paragraph. I'd welcome comments
>>>>>> on making that change.
>>
>> What did you think about my proposal above? To put it in context,
>> this was my initial comment in the mail:
>>
>> [[
>> One area of the page that I'm still not really happy with
>> is the "vague" wording in the second paragraph and the note
>> in the third paragraph about the system call possibly
>> changing. These pieces survive (in somewhat modified form)
>> from the original page, which was written before the
>> system call was released, and it seems there was some
>> question about whether the system call might still change
>> its behavior with respect to the root directory and current
>> working directory of other processes. However, after 19
>> years, nothing has changed, and surely it will not in the
>> future, since that would constitute an ABI breakage.
>> I'm considering to rewrite these pieces to exactly
>> describe what the system call does (which I already
>> do in the third paragraph) and remove the "may or may not"
>> pieces in the second paragraph. I'd welcome comments
>> on making that change.
>> ]]
>>
>> And the second and third paragraphs of the manual page currently
>> read:
>>
>> [[
>> pivot_root() may or may not change the current root and the cur‐
>> rent working directory of any processes or threads that use the
>> old root directory and which are in the same mount namespace as
>> the caller of pivot_root(). The caller of pivot_root() should
>> ensure that processes with root or current working directory at
>> the old root operate correctly in either case. An easy way to
>> ensure this is to change their root and current working directory
>> to new_root before invoking pivot_root(). Note also that
>> pivot_root() may or may not affect the calling process's current
>> working directory. It is therefore recommended to call chdir("/")
>> immediately after pivot_root().
>>
>> The paragraph above is intentionally vague because at the time
>> when pivot_root() was first implemented, it was unclear whether
>> its affect on other process's root and current working directo‐
>> ries—and the caller's current working directory—might change in
>> the future. However, the behavior has remained consistent since
>> this system call was first implemented: pivot_root() changes the
>> root directory and the current working directory of each process
>> or thread in the same mount namespace to new_root if they point to
>> the old root directory. (See also NOTES.) On the other hand,
>> pivot_root() does not change the caller's current working direc‐
>> tory (unless it is on the old root directory), and thus it should
>> be followed by a chdir("/") call.
>> ]]
>
> Apologies I saw that concern I didn't realize it was a questio
>
> I think it is very reasonable to remove warning the behavior might
> change. We have pivot_root(8) in common use that to use it requires
> the semantic of changing processes other than the current process.
> Which means any attempt to noticably change the behavior of
> pivot_root(2) will break userspace.
Thanks for the confirmation that this change would be okay.
I will make this change soon, unless I hear a counterargument.
> Now the documented semantics in behavior above are not quite what
> pivot_root(2) does. It walks all processes on the system and if the
> working directory or the root directory refer to the root mount that is
> being replaced, then pivot_root(2) will update them.
>
> In practice the above is limited to a mount namespace. But something as
> simple as "cd /proc/<somepid>/root" can allow a process to have a
> working directory in a different mount namespace.
So, I'm not quite clear. Do you mean that something in the existing
manual page text should change? If so, could you describe the
needed change please?
> Because ``unprivileged'' users can now use pivot_root(2) we may want to
> rethink the implementation at some point to be cheaper than a global
> process walk. So far that process walk has not been a problem in
> practice.
>
> If we had to write pivot_root(2) from scratch limiting it to just
> changing the root directory of the process that calls pivot_root(2)
> would have been the superior semantic. That would have required run
> pivot_root(8) like: "exec pivot_root . . -- /bin/bash ..." but it would
> not have required walking every thread in the system.
Okay.
[...]
>>>>>> DESCRIPTION
>>>>> [...]> pivot_root() changes the
>>>>>> root directory and the current working directory of each process
>>>>>> or thread in the same mount namespace to new_root if they point to
>>>>>> the old root directory. (See also NOTES.) On the other hand,
>>>>>> pivot_root() does not change the caller's current working direc‐
>>>>>> tory (unless it is on the old root directory), and thus it should
>>>>>> be followed by a chdir("/") call.
>>>>>
>>>>> There is a contradiction here with the NOTES (cf. below).
>>>>
>>>> See below.
>>>>
>>>>>> The following restrictions apply:
>>>>>>
>>>>>> - new_root and put_old must be directories.
>>>>>>
>>>>>> - new_root and put_old must not be on the same filesystem as the
>>>>>> current root. In particular, new_root can't be "/" (but can be
>>>>>> a bind mounted directory on the current root filesystem).
>>>>>
>>>>> Wouldn't "must not be on the same mountpoint" or something similar be
>>>>> more clear, at least for new_root? The note in parentheses indicates
>>>>> that new_root can actually be on the same filesystem as the current
>>>>> note. However, ...
>>>>
>>>> For 'put_old', it really is "filesystem".
>>>
>>> If we are going to be pedantic "filesystem" is really the wrong concept
>>> here. The section about bind mount clarifies it, but I wonder if there
>>> is a better term.
>>
>> Thanks. My aim was to try to distinguish "mount point" from
>> "a mount somewhere inside the file system associated with a
>> certain mount point"--in other words, I wanted to make it clear
>> that 'put_old' (and 'new_root') could not be subdirectories
>> under the current root mount point (which is correct, right?).
>>
>> Using "mount" does seem better. (My only concern is that some
>> people may take it to mean "the mount point", but perhaps that
>> just my own confusion.)
>
> I am open to better terms. But mount or vfsmount is what we are using
> internal to the kernel and is really a distinct concept from filesystem.
> And it is starting to leak out in system calls like move_mount.
I have no better term to propose.
[...]
>> Thanks for the above comments.
>>
>> Hmm, doI need to make similar changes in the initial paragraph of
>> the manual page as well? It currently reads:
>>
>> pivot_root() changes the root filesystem in the mount namespace of
>> the calling process. More precisely, it moves the root filesystem
>> to the directory put_old and makes new_root the new root filesys‐
>> tem. The calling process must have the CAP_SYS_ADMIN capability
>> in the user namespace that owns the caller's mount namespace.
>>
>> Furthermore the one line NAME of the man page reads:
>>
>> pivot_root - change the root filesystem
>>
>> Is a change needed there also?
>
> Yes please. Both locations.
Okay. So would the following be okay:
[[
NAME
pivot_root - change the root mount
...
DESCRIPTION
pivot_root() changes the root mount in the mount namespace of the
calling process. More precisely, it moves the root mount to the
directory put_old and makes new_root the new root mount. The
calling process must have the CAP_SYS_ADMIN capability in the user
namespace that owns the caller's mount namespace.
]]
?
[...]
>>>>>> - new_root must be a mount point. (If it is not otherwise a
>>>>>> mount point, it suffices to bind mount new_root on top of
>>>>>> itself.)
>>>>>
>>>>> ... this item actually makes the above item almost redundant regarding
>>>>> new_root (except for the "/") case. So one could replace this item with
>>>>> something like this:
>>>>>
>>>>> - new_root must be a mount point different from "/". (If it is not
>>>>> otherwise a mount point, it suffices to bind mount new_root on top
>>>>> of itself.)
>>>>>
>>>>> The above item would then only mention put_old (and maybe use clarified
>>>>> wording on whether actually a different file system is necessary for
>>>>> put_old or whether a different mount point is enough).
>>>>
>>>> Thanks. That's a good suggestion. I simplified the earlier bullet
>>>> point as you suggested, and changed the text here to say:
>>>>
>>>> - new_root must be a mount point, but can't be "/". If it is not
>>>> otherwise a mount point, it suffices to bind mount new_root on
>>>> top of itself. (new_root can be a bind mounted directory on
>>>> the current root filesystem.)
>>>
>>> How about:
>>> - new_root must be the path to a mount, but can't be "/". Any
>>
>> Surely here it must be "mount point" not "mount"? (See my discussion
>> above.)
>
> Sigh. I have had my head in the code to long, where new_root is
> used to refer to the mount that is mounted on that mount point as well.
Okay -- so I made the text here:
- new_root must be a path to a mount point, but can't be "/". A
path that is not already a mount point can be converted into
one by bind mounting the path onto itself.
Okay?
[...]
Thanks, Eric. As always, your input for the man pages is so
valuable. (My only challenge is to keep up with you...)
Cheers,
Michael
--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
Hello Eric,
I think I just understood something. See below.
On 10/9/19 11:01 PM, Michael Kerrisk (man-pages) wrote:
> Hello Eric,
>
> On 10/9/19 6:00 PM, Eric W. Biederman wrote:
>> "Michael Kerrisk (man-pages)" <[email protected]> writes:
>>
>>> Hello Eric,
>>>
>>> Thank you. I was hoping you might jump in on this thread.
>>>
>>> Please see below.
>>>
>>> On 10/9/19 10:46 AM, Eric W. Biederman wrote:
>>>> "Michael Kerrisk (man-pages)" <[email protected]> writes:
>>>>
>>>>> Hello Philipp,
>>>>>
>>>>> My apologies that it has taken a while to reply. (I had been hoping
>>>>> and waiting that a few more people might weigh in on this thread.)
>>>>>
>>>>> On 9/23/19 3:42 PM, Philipp Wendler wrote:
>>>>>> Hello Michael,
>>>>>>
>>>>>> Am 23.09.19 um 14:04 schrieb Michael Kerrisk (man-pages):
>>>>>>
>>>>>>> I'm considering to rewrite these pieces to exactly
>>>>>>> describe what the system call does (which I already
>>>>>>> do in the third paragraph) and remove the "may or may not"
>>>>>>> pieces in the second paragraph. I'd welcome comments
>>>>>>> on making that change.
>>>
>>> What did you think about my proposal above? To put it in context,
>>> this was my initial comment in the mail:
>>>
>>> [[
>>> One area of the page that I'm still not really happy with
>>> is the "vague" wording in the second paragraph and the note
>>> in the third paragraph about the system call possibly
>>> changing. These pieces survive (in somewhat modified form)
>>> from the original page, which was written before the
>>> system call was released, and it seems there was some
>>> question about whether the system call might still change
>>> its behavior with respect to the root directory and current
>>> working directory of other processes. However, after 19
>>> years, nothing has changed, and surely it will not in the
>>> future, since that would constitute an ABI breakage.
>>> I'm considering to rewrite these pieces to exactly
>>> describe what the system call does (which I already
>>> do in the third paragraph) and remove the "may or may not"
>>> pieces in the second paragraph. I'd welcome comments
>>> on making that change.
>>> ]]
>>>
>>> And the second and third paragraphs of the manual page currently
>>> read:
>>>
>>> [[
>>> pivot_root() may or may not change the current root and the cur‐
>>> rent working directory of any processes or threads that use the
>>> old root directory and which are in the same mount namespace as
>>> the caller of pivot_root(). The caller of pivot_root() should
>>> ensure that processes with root or current working directory at
>>> the old root operate correctly in either case. An easy way to
>>> ensure this is to change their root and current working directory
>>> to new_root before invoking pivot_root(). Note also that
>>> pivot_root() may or may not affect the calling process's current
>>> working directory. It is therefore recommended to call chdir("/")
>>> immediately after pivot_root().
>>>
>>> The paragraph above is intentionally vague because at the time
>>> when pivot_root() was first implemented, it was unclear whether
>>> its affect on other process's root and current working directo‐
>>> ries—and the caller's current working directory—might change in
>>> the future. However, the behavior has remained consistent since
>>> this system call was first implemented: pivot_root() changes the
>>> root directory and the current working directory of each process
>>> or thread in the same mount namespace to new_root if they point to
>>> the old root directory. (See also NOTES.) On the other hand,
>>> pivot_root() does not change the caller's current working direc‐
>>> tory (unless it is on the old root directory), and thus it should
>>> be followed by a chdir("/") call.
>>> ]]
>>
>> Apologies I saw that concern I didn't realize it was a questio
>>
>> I think it is very reasonable to remove warning the behavior might
>> change. We have pivot_root(8) in common use that to use it requires
>> the semantic of changing processes other than the current process.
>> Which means any attempt to noticably change the behavior of
>> pivot_root(2) will break userspace.
>
> Thanks for the confirmation that this change would be okay.
> I will make this change soon, unless I hear a counterargument.
>
>> Now the documented semantics in behavior above are not quite what
>> pivot_root(2) does. It walks all processes on the system and if the
>> working directory or the root directory refer to the root mount that is
>> being replaced, then pivot_root(2) will update them.
>>
>> In practice the above is limited to a mount namespace. But something as
>> simple as "cd /proc/<somepid>/root" can allow a process to have a
>> working directory in a different mount namespace.
>
> So, I'm not quite clear. Do you mean that something in the existing
> manual page text should change? If so, could you describe the
> needed change please?
Okay, I had to sleep on this one. I think what you are saying is
that is some process, pidX, in mountns X does a "cd /proc/<pidY>/root"
where pidY is a process in mountns Y, and then some
process in mountns Y does a pivot_root(), the the CWD of pidX will
be changed, even though it is in a different mountns. Right?
Thanks,
Michael
--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/