Subject: [PATCHi, man-pages] mount_namespaces.7: More clearly explain "locked mounts"

For a long time, this manual page has had a brief discussion of
"locked" mounts, without clearly saying what this concept is, or
why it exists. Expand the discussion with an explanation of what
locked mounts are, why mounts are locked, and some examples of the
effect of locking.

Thanks to Christian Brauner for a lot of help in understanding
these details.

Reported-by: Christian Brauner <[email protected]>
Signed-off-by: Michael Kerrisk <[email protected]>
---

Hello Eric and others,

After some quite helpful info from Chrstian Brauner, I've expanded
the discussion of locked mounts (a concept I didn't really have a
good grasp on) in the mount_namespaces(7) manual page. I would be
grateful to receive review comments, acks, etc., on the patch below.
Could you take a look please?

Cheers,

Michael

man7/mount_namespaces.7 | 73 +++++++++++++++++++++++++++++++++++++++++
1 file changed, 73 insertions(+)

diff --git a/man7/mount_namespaces.7 b/man7/mount_namespaces.7
index e3468bdb7..97427c9ea 100644
--- a/man7/mount_namespaces.7
+++ b/man7/mount_namespaces.7
@@ -107,6 +107,62 @@ operation brings across all of the mounts from the original
mount namespace as a single unit,
and recursive mounts that propagate between
mount namespaces propagate as a single unit.)
+.IP
+In this context, "may not be separated" means that the mounts
+are locked so that they may not be individually unmounted.
+Consider the following example:
+.IP
+.RS
+.in +4n
+.EX
+$ \fBsudo mkdir /mnt/dir\fP
+$ \fBsudo sh \-c \(aqecho "aaaaaa" > /mnt/dir/a\(aq\fP
+$ \fBsudo mount \-\-bind -o ro /some/path /mnt/dir\fP
+$ \fBls /mnt/dir\fP # Former contents of directory are invisible
+.EE
+.in
+.RE
+.IP
+The above steps, performed in a more privileged user namespace,
+have created a (read-only) bind mount that
+obscures the contents of the directory
+.IR /mnt/dir .
+For security reasons, it should not be possible to unmount
+that mount in a less privileged user namespace,
+since that would reveal the contents of the directory
+.IR /mnt/dir .
+.IP
+Suppose we now create a new mount namespace
+owned by a (new) subordinate user namespace.
+The new mount namespace will inherit copies of all of the mounts
+from the previous mount namespace.
+However, those mounts will be locked because the new mount namespace
+is owned by a less privileged user namespace.
+Consequently, an attempt to unmount the mount fails:
+.IP
+.RS
+.in +4n
+.EX
+$ \fBsudo unshare \-\-user \-\-map\-root\-user \-\-mount \e\fP
+ \fBstrace \-o /tmp/log \e\fP
+ \fBumount /mnt/dir\fP
+umount: /mnt/dir: not mounted.
+$ \fBgrep \(aq^umount\(aq /tmp/log\fP
+umount2("/mnt/dir", 0) = \-1 EINVAL (Invalid argument)
+.EE
+.in
+.RE
+.IP
+The error message from
+.BR mount (8)
+is a little confusing, but the
+.BR strace (1)
+output reveals that the underlying
+.BR umount2 (2)
+system call failed with the error
+.BR EINVAL ,
+which is the error that the kernel returns to indicate that
+the mount is locked.
.IP *
The
.BR mount (2)
@@ -128,6 +184,23 @@ settings become locked
when propagated from a more privileged to
a less privileged mount namespace,
and may not be changed in the less privileged mount namespace.
+.IP
+This point can be illustrated by a continuation of the previous example.
+In that example, the bind mount was marked as read-only.
+For security reasons,
+it should not be possible to make the mount writable in
+a less privileged namespace, and indeed the kernel prevents this,
+as illustrated by the following:
+.IP
+.RS
+.in +4n
+.EX
+$ \fBsudo unshare \-\-user \-\-map\-root\-user \-\-mount \e\fP
+ \fBmount \-o remount,rw /mnt/dir\fP
+mount: /mnt/dir: permission denied.
+.EE
+.in
+.RE
.IP *
.\" (As of 3.18-rc1 (in Al Viro's 2014-08-30 vfs.git#for-next tree))
A file or directory that is a mount point in one namespace that is not
--
2.31.1


2021-08-14 08:11:28

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCHi, man-pages] mount_namespaces.7: More clearly explain "locked mounts"

On Sat, Aug 14, 2021 at 12:01:20AM +0200, Michael Kerrisk wrote:
> For a long time, this manual page has had a brief discussion of
> "locked" mounts, without clearly saying what this concept is, or
> why it exists. Expand the discussion with an explanation of what
> locked mounts are, why mounts are locked, and some examples of the
> effect of locking.
>
> Thanks to Christian Brauner for a lot of help in understanding
> these details.
>
> Link: https://lore.kernel.org/r/[email protected]
> Reported-by: Christian Brauner <[email protected]>
> Signed-off-by: Michael Kerrisk <[email protected]>
> ---

Looks good. Thank you!
Acked-by: Christian Brauner <[email protected]>

2021-08-16 16:08:13

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCHi, man-pages] mount_namespaces.7: More clearly explain "locked mounts"

Michael Kerrisk <[email protected]> writes:

> For a long time, this manual page has had a brief discussion of
> "locked" mounts, without clearly saying what this concept is, or
> why it exists. Expand the discussion with an explanation of what
> locked mounts are, why mounts are locked, and some examples of the
> effect of locking.
>
> Thanks to Christian Brauner for a lot of help in understanding
> these details.
>
> Reported-by: Christian Brauner <[email protected]>
> Signed-off-by: Michael Kerrisk <[email protected]>
> ---
>
> Hello Eric and others,
>
> After some quite helpful info from Chrstian Brauner, I've expanded
> the discussion of locked mounts (a concept I didn't really have a
> good grasp on) in the mount_namespaces(7) manual page. I would be
> grateful to receive review comments, acks, etc., on the patch below.
> Could you take a look please?
>
> Cheers,
>
> Michael
>
> man7/mount_namespaces.7 | 73 +++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 73 insertions(+)
>
> diff --git a/man7/mount_namespaces.7 b/man7/mount_namespaces.7
> index e3468bdb7..97427c9ea 100644
> --- a/man7/mount_namespaces.7
> +++ b/man7/mount_namespaces.7
> @@ -107,6 +107,62 @@ operation brings across all of the mounts from the original
> mount namespace as a single unit,
> and recursive mounts that propagate between
> mount namespaces propagate as a single unit.)
> +.IP
> +In this context, "may not be separated" means that the mounts
> +are locked so that they may not be individually unmounted.
> +Consider the following example:
> +.IP
> +.RS
> +.in +4n
> +.EX
> +$ \fBsudo mkdir /mnt/dir\fP
> +$ \fBsudo sh \-c \(aqecho "aaaaaa" > /mnt/dir/a\(aq\fP
> +$ \fBsudo mount \-\-bind -o ro /some/path /mnt/dir\fP
> +$ \fBls /mnt/dir\fP # Former contents of directory are invisible

Do we want a more motivating example such as a /proc/sys?

It has been common to mount over /proc files and directories that can be
written to by the global root so that users in a mount namespace may not
touch them.


> +.EE
> +.in
> +.RE
> +.IP
> +The above steps, performed in a more privileged user namespace,
> +have created a (read-only) bind mount that
> +obscures the contents of the directory
> +.IR /mnt/dir .
> +For security reasons, it should not be possible to unmount
> +that mount in a less privileged user namespace,
> +since that would reveal the contents of the directory
> +.IR /mnt/dir .
> +.IP
> +Suppose we now create a new mount namespace
> +owned by a (new) subordinate user namespace.
> +The new mount namespace will inherit copies of all of the mounts
> +from the previous mount namespace.
> +However, those mounts will be locked because the new mount namespace
> +is owned by a less privileged user namespace.
> +Consequently, an attempt to unmount the mount fails:
> +.IP
> +.RS
> +.in +4n
> +.EX
> +$ \fBsudo unshare \-\-user \-\-map\-root\-user \-\-mount \e\fP
> + \fBstrace \-o /tmp/log \e\fP
> + \fBumount /mnt/dir\fP
> +umount: /mnt/dir: not mounted.
> +$ \fBgrep \(aq^umount\(aq /tmp/log\fP
> +umount2("/mnt/dir", 0) = \-1 EINVAL (Invalid argument)
> +.EE
> +.in
> +.RE
> +.IP
> +The error message from
> +.BR mount (8)
> +is a little confusing, but the
> +.BR strace (1)
> +output reveals that the underlying
> +.BR umount2 (2)
> +system call failed with the error
> +.BR EINVAL ,
> +which is the error that the kernel returns to indicate that
> +the mount is locked.

Do you want to mention that you can unmount the entire subtree? Either
with pivot_root if it is locked to "/" or with
"umount -l /path/to/propagated/directory".

> .IP *
> The
> .BR mount (2)
> @@ -128,6 +184,23 @@ settings become locked
> when propagated from a more privileged to
> a less privileged mount namespace,
> and may not be changed in the less privileged mount namespace.
> +.IP
> +This point can be illustrated by a continuation of the previous example.
> +In that example, the bind mount was marked as read-only.
> +For security reasons,
> +it should not be possible to make the mount writable in
> +a less privileged namespace, and indeed the kernel prevents this,
> +as illustrated by the following:
> +.IP
> +.RS
> +.in +4n
> +.EX
> +$ \fBsudo unshare \-\-user \-\-map\-root\-user \-\-mount \e\fP
> + \fBmount \-o remount,rw /mnt/dir\fP
> +mount: /mnt/dir: permission denied.
> +.EE
> +.in
> +.RE
> .IP *
> .\" (As of 3.18-rc1 (in Al Viro's 2014-08-30 vfs.git#for-next tree))
> A file or directory that is a mount point in one namespace that is not

Eric

Subject: Re: [PATCHi, man-pages] mount_namespaces.7: More clearly explain "locked mounts"

Hi Eric,

Thanks for your feedback!

On 8/16/21 6:03 PM, Eric W. Biederman wrote:
> Michael Kerrisk <[email protected]> writes:
>
>> For a long time, this manual page has had a brief discussion of
>> "locked" mounts, without clearly saying what this concept is, or
>> why it exists. Expand the discussion with an explanation of what
>> locked mounts are, why mounts are locked, and some examples of the
>> effect of locking.
>>
>> Thanks to Christian Brauner for a lot of help in understanding
>> these details.
>>
>> Reported-by: Christian Brauner <[email protected]>
>> Signed-off-by: Michael Kerrisk <[email protected]>
>> ---
>>
>> Hello Eric and others,
>>
>> After some quite helpful info from Chrstian Brauner, I've expanded
>> the discussion of locked mounts (a concept I didn't really have a
>> good grasp on) in the mount_namespaces(7) manual page. I would be
>> grateful to receive review comments, acks, etc., on the patch below.
>> Could you take a look please?
>>
>> Cheers,
>>
>> Michael
>>
>> man7/mount_namespaces.7 | 73 +++++++++++++++++++++++++++++++++++++++++
>> 1 file changed, 73 insertions(+)
>>
>> diff --git a/man7/mount_namespaces.7 b/man7/mount_namespaces.7
>> index e3468bdb7..97427c9ea 100644
>> --- a/man7/mount_namespaces.7
>> +++ b/man7/mount_namespaces.7
>> @@ -107,6 +107,62 @@ operation brings across all of the mounts from the original
>> mount namespace as a single unit,
>> and recursive mounts that propagate between
>> mount namespaces propagate as a single unit.)
>> +.IP
>> +In this context, "may not be separated" means that the mounts
>> +are locked so that they may not be individually unmounted.
>> +Consider the following example:
>> +.IP
>> +.RS
>> +.in +4n
>> +.EX
>> +$ \fBsudo mkdir /mnt/dir\fP
>> +$ \fBsudo sh \-c \(aqecho "aaaaaa" > /mnt/dir/a\(aq\fP
>> +$ \fBsudo mount \-\-bind -o ro /some/path /mnt/dir\fP
>> +$ \fBls /mnt/dir\fP # Former contents of directory are invisible
>
> Do we want a more motivating example such as a /proc/sys?
>
> It has been common to mount over /proc files and directories that can be
> written to by the global root so that users in a mount namespace may not
> touch them.

Seems reasonable. But I want to check one thing. Can you please
define "global root". I'm pretty sure I know what you mean, but
I'd like to know your definition.

>> +.EE
>> +.in
>> +.RE
>> +.IP
>> +The above steps, performed in a more privileged user namespace,
>> +have created a (read-only) bind mount that
>> +obscures the contents of the directory
>> +.IR /mnt/dir .
>> +For security reasons, it should not be possible to unmount
>> +that mount in a less privileged user namespace,
>> +since that would reveal the contents of the directory
>> +.IR /mnt/dir .
> > +.IP
>> +Suppose we now create a new mount namespace
>> +owned by a (new) subordinate user namespace.
>> +The new mount namespace will inherit copies of all of the mounts
>> +from the previous mount namespace.
>> +However, those mounts will be locked because the new mount namespace
>> +is owned by a less privileged user namespace.
>> +Consequently, an attempt to unmount the mount fails:
>> +.IP
>> +.RS
>> +.in +4n
>> +.EX
>> +$ \fBsudo unshare \-\-user \-\-map\-root\-user \-\-mount \e\fP
>> + \fBstrace \-o /tmp/log \e\fP
>> + \fBumount /mnt/dir\fP
>> +umount: /mnt/dir: not mounted.
>> +$ \fBgrep \(aq^umount\(aq /tmp/log\fP
>> +umount2("/mnt/dir", 0) = \-1 EINVAL (Invalid argument)
>> +.EE
>> +.in
>> +.RE
>> +.IP
>> +The error message from
>> +.BR mount (8)
>> +is a little confusing, but the
>> +.BR strace (1)
>> +output reveals that the underlying
>> +.BR umount2 (2)
>> +system call failed with the error
>> +.BR EINVAL ,
>> +which is the error that the kernel returns to indicate that
>> +the mount is locked.
>
> Do you want to mention that you can unmount the entire subtree? Either
> with pivot_root if it is locked to "/" or with
> "umount -l /path/to/propagated/directory".

Yes, I wondered about that, but hadn't got round to devising
the scenario. How about this:

[[
* Following on from the previous point, note that it is possible
to unmount an entire tree of mounts that propagated as a unit
into a mount namespace that is owned by a less privileged user
namespace, as illustrated in the following example.

First, we create new user and mount namespaces using
unshare(1). In the new mount namespace, the propagation type
of all mounts is set to private. We then create a shared bind
mount at /mnt, and a small hierarchy of mount points underneath
that mount point.

$ PS1='ns1# ' sudo unshare --user --map-root-user \
--mount --propagation private bash
ns1# echo $$ # We need the PID of this shell later
778501
ns1# mount --make-shared --bind /mnt /mnt
ns1# mkdir /mnt/x
ns1# mount --make-private -t tmpfs none /mnt/x
ns1# mkdir /mnt/x/y
ns1# mount --make-private -t tmpfs none /mnt/x/y
ns1# grep /mnt /proc/self/mountinfo | sed 's/ - .*//'
986 83 8:5 /mnt /mnt rw,relatime shared:344
989 986 0:56 / /mnt/x rw,relatime
990 989 0:57 / /mnt/x/y rw,relatime

Continuing in the same shell session, we then create a second
shell in a new mount namespace and a new subordinate (and thus
less privileged) user namespace and check the state of the
propagated mount points rooted at /mnt.

ns1# PS1='ns2# unshare --user --map-root-user \
--mount --propagation unchanged bash
ns2# grep /mnt /proc/self/mountinfo | sed 's/ - .*//'
1239 1204 8:5 /mnt /mnt rw,relatime master:344
1240 1239 0:56 / /mnt/x rw,relatime
1241 1240 0:57 / /mnt/x/y rw,relatime

Of note in the above output is that the propagation type of the
mount point /mnt has been reduced to slave, as explained near
the start of this subsection. This means that submount events
will propagate from the master /mnt in "ns1", but propagation
will not occur in the opposite direction.

From a separate terminal window, we then use nsenter(1) to
enter the mount and user namespaces corresponding to "ns1". In
that terminal window, we then recursively bind mount /mnt/x at
the location /mnt/ppp.

$ PS1='ns3# ' sudo nsenter -t 778501 --user --mount
ns3# mount --rbind --make-private /mnt/x /mnt/ppp
ns3# grep /mnt /proc/self/mountinfo | sed 's/ - .*//'
986 83 8:5 /mnt /mnt rw,relatime shared:344
989 986 0:56 / /mnt/x rw,relatime
990 989 0:57 / /mnt/x/y rw,relatime
1242 986 0:56 / /mnt/ppp rw,relatime
1243 1242 0:57 / /mnt/ppp/y rw,relatime shared:518

Because the propagation type of the parent mount, /mnt, was
shared, the recursive bind mount propagated a small tree of
mounts under the slave mount /mnt into "ns2", as can be
verified by executing the following command in that shell
session:

ns2# grep /mnt /proc/self/mountinfo | sed 's/ - .*//'
1239 1204 8:5 /mnt /mnt rw,relatime master:344
1240 1239 0:56 / /mnt/x rw,relatime
1241 1240 0:57 / /mnt/x/y rw,relatime
1244 1239 0:56 / /mnt/ppp rw,relatime
1245 1244 0:57 / /mnt/ppp/y rw,relatime master:518

While it is not possible to unmount a part of that propagated
subtree (/mnt/ppp/y), it is possible to unmount the entire
tree, as shown by the following commands:

ns2# umount /mnt/ppp/y
umount: /mnt/ppp/y: not mounted.
ns2# umount -l /mnt/ppp | sed 's/ - .*//' # Succeeds...
ns2# grep /mnt /proc/self/mountinfo
1239 1204 8:5 /mnt /mnt rw,relatime master:344
1240 1239 0:56 / /mnt/x rw,relatime
1241 1240 0:57 / /mnt/x/y rw,relatime
]]

?

Thanks,

Michael

>
>> .IP *
>> The
>> .BR mount (2)
>> @@ -128,6 +184,23 @@ settings become locked
>> when propagated from a more privileged to
>> a less privileged mount namespace,
>> and may not be changed in the less privileged mount namespace.
>> +.IP
>> +This point can be illustrated by a continuation of the previous example.
>> +In that example, the bind mount was marked as read-only.
>> +For security reasons,
>> +it should not be possible to make the mount writable in
>> +a less privileged namespace, and indeed the kernel prevents this,
>> +as illustrated by the following:
>> +.IP
>> +.RS
>> +.in +4n
>> +.EX
>> +$ \fBsudo unshare \-\-user \-\-map\-root\-user \-\-mount \e\fP
>> + \fBmount \-o remount,rw /mnt/dir\fP
>> +mount: /mnt/dir: permission denied.
>> +.EE
>> +.in
>> +.RE
>> .IP *
>> .\" (As of 3.18-rc1 (in Al Viro's 2014-08-30 vfs.git#for-next tree))
>> A file or directory that is a mount point in one namespace that is not
>
> Eric
>


--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

2021-08-17 14:08:53

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCHi, man-pages] mount_namespaces.7: More clearly explain "locked mounts"

On Tue, Aug 17, 2021 at 05:12:20AM +0200, Michael Kerrisk (man-pages) wrote:
> Hi Eric,
>
> Thanks for your feedback!
>
> On 8/16/21 6:03 PM, Eric W. Biederman wrote:
> > Michael Kerrisk <[email protected]> writes:
> >
> >> For a long time, this manual page has had a brief discussion of
> >> "locked" mounts, without clearly saying what this concept is, or
> >> why it exists. Expand the discussion with an explanation of what
> >> locked mounts are, why mounts are locked, and some examples of the
> >> effect of locking.
> >>
> >> Thanks to Christian Brauner for a lot of help in understanding
> >> these details.
> >>
> >> Reported-by: Christian Brauner <[email protected]>
> >> Signed-off-by: Michael Kerrisk <[email protected]>
> >> ---
> >>
> >> Hello Eric and others,
> >>
> >> After some quite helpful info from Chrstian Brauner, I've expanded
> >> the discussion of locked mounts (a concept I didn't really have a
> >> good grasp on) in the mount_namespaces(7) manual page. I would be
> >> grateful to receive review comments, acks, etc., on the patch below.
> >> Could you take a look please?
> >>
> >> Cheers,
> >>
> >> Michael
> >>
> >> man7/mount_namespaces.7 | 73 +++++++++++++++++++++++++++++++++++++++++
> >> 1 file changed, 73 insertions(+)
> >>
> >> diff --git a/man7/mount_namespaces.7 b/man7/mount_namespaces.7
> >> index e3468bdb7..97427c9ea 100644
> >> --- a/man7/mount_namespaces.7
> >> +++ b/man7/mount_namespaces.7
> >> @@ -107,6 +107,62 @@ operation brings across all of the mounts from the original
> >> mount namespace as a single unit,
> >> and recursive mounts that propagate between
> >> mount namespaces propagate as a single unit.)
> >> +.IP
> >> +In this context, "may not be separated" means that the mounts
> >> +are locked so that they may not be individually unmounted.
> >> +Consider the following example:
> >> +.IP
> >> +.RS
> >> +.in +4n
> >> +.EX
> >> +$ \fBsudo mkdir /mnt/dir\fP
> >> +$ \fBsudo sh \-c \(aqecho "aaaaaa" > /mnt/dir/a\(aq\fP
> >> +$ \fBsudo mount \-\-bind -o ro /some/path /mnt/dir\fP
> >> +$ \fBls /mnt/dir\fP # Former contents of directory are invisible
> >
> > Do we want a more motivating example such as a /proc/sys?

Could be even be better to use an example involving /etc/shadow, e.g.:

sudo mount --bind /etc /mnt
sudo mount --bind /dev/null /mnt/shadow

the procfs example might be a bit awkward (see below).

> >
> > It has been common to mount over /proc files and directories that can be
> > written to by the global root so that users in a mount namespace may not
> > touch them.
>
> Seems reasonable. But I want to check one thing. Can you please
> define "global root". I'm pretty sure I know what you mean, but
> I'd like to know your definition.

(global root == root in the initial user namespace.)

Some application container runtimes have a concept of "masked paths"
where they overmount certain directories they want to hide with an empty
tmpfs and some files they want to hide with /dev/null (see [1]).

But I don't think this is a great example because this overmounting is
mostly needed and done when you're running privileged containers (see [2]).

There's usually no point in overmounting parts of procfs that are
writable by global root. If you're running in an unprivileged container
userns root can't write to any of the files that only global root can.
Otherwise this would be a rather severe security issue.

There might be a use-case for overmounting files that contain global
information that are readable inside user namespaces but then one either
has to question why they are readable in the first place or why this
information needs to be hidden. Examples include /proc/kallsyms and
/proc/keys.

But overall the overmounting of procfs is most sensible when running
privileged containers or when sharing pid namespaces and procfs is
somehow bind-mounted from somewhere. But that means there's no user
namespace in play which means that the mounts aren't locked.

So if the container runtime has e.g. overmounted /proc/kcore with
/dev/null then the privileged container can unmount it. To protect
against this such privileged containers usually drop CAP_SYS_ADMIN.
So the protection here comes from dropping capabilities not from locking
mounts together. All of this makes this a bit of a confusing example.

An example where locked mount protection is relied on heavily which I'm
involved in is systemd(-nspawn). All custom mounts a container gets such
as data shared from the host with the container are mounted in a separate
(privileged) mount namespace before the container workload is cloned.
The cloned container then gets a new mount + userns pair and hence, all
the mounts it inherited are now locked.

This way, you can e.g. share /etc with your container and just overmount
/etc/shadow with /dev/null or a custom /etc/shadow (Reason for my
example above.) without dropping capabilities that would prevent the
container from mounting.

So I'd suggest using a simple example. This is not about illustrating
what container runtimes do but what the behavior of a mount namespace
is. There's really no need to overcomplicate this.

[1]: https://github.com/moby/moby/blob/51b06c6795160d8a1ba05d05d6491df7588b2957/oci/defaults.go#L90
[2]: https://github.com/moby/moby/blob/51b06c6795160d8a1ba05d05d6491df7588b2957/oci/defaults.go#L110

>
> >> +.EE
> >> +.in
> >> +.RE
> >> +.IP
> >> +The above steps, performed in a more privileged user namespace,
> >> +have created a (read-only) bind mount that
> >> +obscures the contents of the directory
> >> +.IR /mnt/dir .
> >> +For security reasons, it should not be possible to unmount
> >> +that mount in a less privileged user namespace,
> >> +since that would reveal the contents of the directory
> >> +.IR /mnt/dir .
> > > +.IP
> >> +Suppose we now create a new mount namespace
> >> +owned by a (new) subordinate user namespace.
> >> +The new mount namespace will inherit copies of all of the mounts
> >> +from the previous mount namespace.
> >> +However, those mounts will be locked because the new mount namespace
> >> +is owned by a less privileged user namespace.
> >> +Consequently, an attempt to unmount the mount fails:
> >> +.IP
> >> +.RS
> >> +.in +4n
> >> +.EX
> >> +$ \fBsudo unshare \-\-user \-\-map\-root\-user \-\-mount \e\fP
> >> + \fBstrace \-o /tmp/log \e\fP
> >> + \fBumount /mnt/dir\fP
> >> +umount: /mnt/dir: not mounted.
> >> +$ \fBgrep \(aq^umount\(aq /tmp/log\fP
> >> +umount2("/mnt/dir", 0) = \-1 EINVAL (Invalid argument)
> >> +.EE
> >> +.in
> >> +.RE
> >> +.IP
> >> +The error message from
> >> +.BR mount (8)
> >> +is a little confusing, but the
> >> +.BR strace (1)
> >> +output reveals that the underlying
> >> +.BR umount2 (2)
> >> +system call failed with the error
> >> +.BR EINVAL ,
> >> +which is the error that the kernel returns to indicate that
> >> +the mount is locked.
> >
> > Do you want to mention that you can unmount the entire subtree? Either
> > with pivot_root if it is locked to "/" or with
> > "umount -l /path/to/propagated/directory".
>
> Yes, I wondered about that, but hadn't got round to devising
> the scenario. How about this:
>
> [[
> * Following on from the previous point, note that it is possible
> to unmount an entire tree of mounts that propagated as a unit
> into a mount namespace that is owned by a less privileged user
> namespace, as illustrated in the following example.
>
> First, we create new user and mount namespaces using
> unshare(1). In the new mount namespace, the propagation type
> of all mounts is set to private. We then create a shared bind
> mount at /mnt, and a small hierarchy of mount points underneath
> that mount point.
>
> $ PS1='ns1# ' sudo unshare --user --map-root-user \
> --mount --propagation private bash
> ns1# echo $$ # We need the PID of this shell later
> 778501
> ns1# mount --make-shared --bind /mnt /mnt
> ns1# mkdir /mnt/x
> ns1# mount --make-private -t tmpfs none /mnt/x
> ns1# mkdir /mnt/x/y
> ns1# mount --make-private -t tmpfs none /mnt/x/y
> ns1# grep /mnt /proc/self/mountinfo | sed 's/ - .*//'
> 986 83 8:5 /mnt /mnt rw,relatime shared:344
> 989 986 0:56 / /mnt/x rw,relatime
> 990 989 0:57 / /mnt/x/y rw,relatime
>
> Continuing in the same shell session, we then create a second
> shell in a new mount namespace and a new subordinate (and thus
> less privileged) user namespace and check the state of the
> propagated mount points rooted at /mnt.
>
> ns1# PS1='ns2# unshare --user --map-root-user \
> --mount --propagation unchanged bash
> ns2# grep /mnt /proc/self/mountinfo | sed 's/ - .*//'
> 1239 1204 8:5 /mnt /mnt rw,relatime master:344
> 1240 1239 0:56 / /mnt/x rw,relatime
> 1241 1240 0:57 / /mnt/x/y rw,relatime
>
> Of note in the above output is that the propagation type of the
> mount point /mnt has been reduced to slave, as explained near
> the start of this subsection. This means that submount events
> will propagate from the master /mnt in "ns1", but propagation
> will not occur in the opposite direction.
>
> From a separate terminal window, we then use nsenter(1) to
> enter the mount and user namespaces corresponding to "ns1". In
> that terminal window, we then recursively bind mount /mnt/x at
> the location /mnt/ppp.
>
> $ PS1='ns3# ' sudo nsenter -t 778501 --user --mount
> ns3# mount --rbind --make-private /mnt/x /mnt/ppp
> ns3# grep /mnt /proc/self/mountinfo | sed 's/ - .*//'
> 986 83 8:5 /mnt /mnt rw,relatime shared:344
> 989 986 0:56 / /mnt/x rw,relatime
> 990 989 0:57 / /mnt/x/y rw,relatime
> 1242 986 0:56 / /mnt/ppp rw,relatime
> 1243 1242 0:57 / /mnt/ppp/y rw,relatime shared:518
>
> Because the propagation type of the parent mount, /mnt, was
> shared, the recursive bind mount propagated a small tree of
> mounts under the slave mount /mnt into "ns2", as can be
> verified by executing the following command in that shell
> session:
>
> ns2# grep /mnt /proc/self/mountinfo | sed 's/ - .*//'
> 1239 1204 8:5 /mnt /mnt rw,relatime master:344
> 1240 1239 0:56 / /mnt/x rw,relatime
> 1241 1240 0:57 / /mnt/x/y rw,relatime
> 1244 1239 0:56 / /mnt/ppp rw,relatime
> 1245 1244 0:57 / /mnt/ppp/y rw,relatime master:518
>
> While it is not possible to unmount a part of that propagated
> subtree (/mnt/ppp/y), it is possible to unmount the entire
> tree, as shown by the following commands:
>
> ns2# umount /mnt/ppp/y
> umount: /mnt/ppp/y: not mounted.
> ns2# umount -l /mnt/ppp | sed 's/ - .*//' # Succeeds...
> ns2# grep /mnt /proc/self/mountinfo
> 1239 1204 8:5 /mnt /mnt rw,relatime master:344
> 1240 1239 0:56 / /mnt/x rw,relatime
> 1241 1240 0:57 / /mnt/x/y rw,relatime
> ]]
>
> ?

I'd just add a note about mounts that propagated locked together as unit
as being unmountable as a unit (which is intuitive but may need to be
spelled out). But I'd leave this lenghty example as it makes the
manpage pretty convoluted.

Christian

2021-08-17 15:57:34

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCHi, man-pages] mount_namespaces.7: More clearly explain "locked mounts"

"Michael Kerrisk (man-pages)" <[email protected]> writes:

> Hi Eric,
>
> Thanks for your feedback!
>
> On 8/16/21 6:03 PM, Eric W. Biederman wrote:
>> Michael Kerrisk <[email protected]> writes:
>>
>>> For a long time, this manual page has had a brief discussion of
>>> "locked" mounts, without clearly saying what this concept is, or
>>> why it exists. Expand the discussion with an explanation of what
>>> locked mounts are, why mounts are locked, and some examples of the
>>> effect of locking.
>>>
>>> Thanks to Christian Brauner for a lot of help in understanding
>>> these details.
>>>
>>> Reported-by: Christian Brauner <[email protected]>
>>> Signed-off-by: Michael Kerrisk <[email protected]>
>>> ---
>>>
>>> Hello Eric and others,
>>>
>>> After some quite helpful info from Chrstian Brauner, I've expanded
>>> the discussion of locked mounts (a concept I didn't really have a
>>> good grasp on) in the mount_namespaces(7) manual page. I would be
>>> grateful to receive review comments, acks, etc., on the patch below.
>>> Could you take a look please?
>>>
>>> Cheers,
>>>
>>> Michael
>>>
>>> man7/mount_namespaces.7 | 73 +++++++++++++++++++++++++++++++++++++++++
>>> 1 file changed, 73 insertions(+)
>>>
>>> diff --git a/man7/mount_namespaces.7 b/man7/mount_namespaces.7
>>> index e3468bdb7..97427c9ea 100644
>>> --- a/man7/mount_namespaces.7
>>> +++ b/man7/mount_namespaces.7
>>> @@ -107,6 +107,62 @@ operation brings across all of the mounts from the original
>>> mount namespace as a single unit,
>>> and recursive mounts that propagate between
>>> mount namespaces propagate as a single unit.)
>>> +.IP
>>> +In this context, "may not be separated" means that the mounts
>>> +are locked so that they may not be individually unmounted.
>>> +Consider the following example:
>>> +.IP
>>> +.RS
>>> +.in +4n
>>> +.EX
>>> +$ \fBsudo mkdir /mnt/dir\fP
>>> +$ \fBsudo sh \-c \(aqecho "aaaaaa" > /mnt/dir/a\(aq\fP
>>> +$ \fBsudo mount \-\-bind -o ro /some/path /mnt/dir\fP
>>> +$ \fBls /mnt/dir\fP # Former contents of directory are invisible
>>
>> Do we want a more motivating example such as a /proc/sys?
>>
>> It has been common to mount over /proc files and directories that can be
>> written to by the global root so that users in a mount namespace may not
>> touch them.
>
> Seems reasonable. But I want to check one thing. Can you please
> define "global root". I'm pretty sure I know what you mean, but
> I'd like to know your definition.

I mean uid 0 in the initial user namespace.
This uid owns most of files in /proc.

Container systems that don't want to use user namespaces frequently
mount over files in proc to prevent using some of the root privileges
that come simply by having uid 0.

Another use is mounting over files on virtual filesystems like proc
to reduce the attack surface.

For reducing what the root user in a container can do, I think using user
namespaces and using a uid other than 0 in the initial user namespace.


>>> +.EE
>>> +.in
>>> +.RE
>>> +.IP
>>> +The above steps, performed in a more privileged user namespace,
>>> +have created a (read-only) bind mount that
>>> +obscures the contents of the directory
>>> +.IR /mnt/dir .
>>> +For security reasons, it should not be possible to unmount
>>> +that mount in a less privileged user namespace,
>>> +since that would reveal the contents of the directory
>>> +.IR /mnt/dir .
>> > +.IP
>>> +Suppose we now create a new mount namespace
>>> +owned by a (new) subordinate user namespace.
>>> +The new mount namespace will inherit copies of all of the mounts
>>> +from the previous mount namespace.
>>> +However, those mounts will be locked because the new mount namespace
>>> +is owned by a less privileged user namespace.
>>> +Consequently, an attempt to unmount the mount fails:
>>> +.IP
>>> +.RS
>>> +.in +4n
>>> +.EX
>>> +$ \fBsudo unshare \-\-user \-\-map\-root\-user \-\-mount \e\fP
>>> + \fBstrace \-o /tmp/log \e\fP
>>> + \fBumount /mnt/dir\fP
>>> +umount: /mnt/dir: not mounted.
>>> +$ \fBgrep \(aq^umount\(aq /tmp/log\fP
>>> +umount2("/mnt/dir", 0) = \-1 EINVAL (Invalid argument)
>>> +.EE
>>> +.in
>>> +.RE
>>> +.IP
>>> +The error message from
>>> +.BR mount (8)
>>> +is a little confusing, but the
>>> +.BR strace (1)
>>> +output reveals that the underlying
>>> +.BR umount2 (2)
>>> +system call failed with the error
>>> +.BR EINVAL ,
>>> +which is the error that the kernel returns to indicate that
>>> +the mount is locked.
>>
>> Do you want to mention that you can unmount the entire subtree? Either
>> with pivot_root if it is locked to "/" or with
>> "umount -l /path/to/propagated/directory".
>
> Yes, I wondered about that, but hadn't got round to devising
> the scenario. How about this:
>
> [[
> * Following on from the previous point, note that it is possible
> to unmount an entire tree of mounts that propagated as a unit
^^^^^ subtree?
> into a mount namespace that is owned by a less privileged user
> namespace, as illustrated in the following example.

>
> First, we create new user and mount namespaces using
> unshare(1). In the new mount namespace, the propagation type
> of all mounts is set to private. We then create a shared bind
> mount at /mnt, and a small hierarchy of mount points underneath
> that mount point.
>
> $ PS1='ns1# ' sudo unshare --user --map-root-user \
> --mount --propagation private bash
> ns1# echo $$ # We need the PID of this shell later
> 778501
> ns1# mount --make-shared --bind /mnt /mnt
> ns1# mkdir /mnt/x
> ns1# mount --make-private -t tmpfs none /mnt/x
> ns1# mkdir /mnt/x/y
> ns1# mount --make-private -t tmpfs none /mnt/x/y
> ns1# grep /mnt /proc/self/mountinfo | sed 's/ - .*//'
> 986 83 8:5 /mnt /mnt rw,relatime shared:344
> 989 986 0:56 / /mnt/x rw,relatime
> 990 989 0:57 / /mnt/x/y rw,relatime
>
> Continuing in the same shell session, we then create a second
> shell in a new mount namespace and a new subordinate (and thus
> less privileged) user namespace and check the state of the
> propagated mount points rooted at /mnt.
>
> ns1# PS1='ns2# unshare --user --map-root-user \
> --mount --propagation unchanged bash
> ns2# grep /mnt /proc/self/mountinfo | sed 's/ - .*//'
> 1239 1204 8:5 /mnt /mnt rw,relatime master:344
> 1240 1239 0:56 / /mnt/x rw,relatime
> 1241 1240 0:57 / /mnt/x/y rw,relatime
>
> Of note in the above output is that the propagation type of the
> mount point /mnt has been reduced to slave, as explained near
> the start of this subsection. This means that submount events
> will propagate from the master /mnt in "ns1", but propagation
> will not occur in the opposite direction.
>
> From a separate terminal window, we then use nsenter(1) to
> enter the mount and user namespaces corresponding to "ns1". In
> that terminal window, we then recursively bind mount /mnt/x at
> the location /mnt/ppp.
>
> $ PS1='ns3# ' sudo nsenter -t 778501 --user --mount
> ns3# mount --rbind --make-private /mnt/x /mnt/ppp
> ns3# grep /mnt /proc/self/mountinfo | sed 's/ - .*//'
> 986 83 8:5 /mnt /mnt rw,relatime shared:344
> 989 986 0:56 / /mnt/x rw,relatime
> 990 989 0:57 / /mnt/x/y rw,relatime
> 1242 986 0:56 / /mnt/ppp rw,relatime
> 1243 1242 0:57 / /mnt/ppp/y rw,relatime shared:518
>
> Because the propagation type of the parent mount, /mnt, was
> shared, the recursive bind mount propagated a small tree of
> mounts under the slave mount /mnt into "ns2", as can be
> verified by executing the following command in that shell
> session:
>
> ns2# grep /mnt /proc/self/mountinfo | sed 's/ - .*//'
> 1239 1204 8:5 /mnt /mnt rw,relatime master:344
> 1240 1239 0:56 / /mnt/x rw,relatime
> 1241 1240 0:57 / /mnt/x/y rw,relatime
> 1244 1239 0:56 / /mnt/ppp rw,relatime
> 1245 1244 0:57 / /mnt/ppp/y rw,relatime master:518
>
> While it is not possible to unmount a part of that propagated
> subtree (/mnt/ppp/y), it is possible to unmount the entire
> tree, as shown by the following commands:
>
> ns2# umount /mnt/ppp/y
> umount: /mnt/ppp/y: not mounted.
> ns2# umount -l /mnt/ppp | sed 's/ - .*//' # Succeeds...
> ns2# grep /mnt /proc/self/mountinfo
> 1239 1204 8:5 /mnt /mnt rw,relatime master:344
> 1240 1239 0:56 / /mnt/x rw,relatime
> 1241 1240 0:57 / /mnt/x/y rw,relatime
> ]]
>
> ?

Yes.

It is worth noting that in ns2 it is also possible to mount on top of
/mnt/ppp/y and umount from /mnt/ppp/y.


Eric

Subject: Re: [PATCHi, man-pages] mount_namespaces.7: More clearly explain "locked mounts"

Hello Eric,

Thank you for you response.

On 8/17/21 5:51 PM, Eric W. Biederman wrote:
> "Michael Kerrisk (man-pages)" <[email protected]> writes:
>
>> Hi Eric,
>>
>> Thanks for your feedback!
>>
>> On 8/16/21 6:03 PM, Eric W. Biederman wrote:
>>> Michael Kerrisk <[email protected]> writes:
>>>
>>>> For a long time, this manual page has had a brief discussion of
>>>> "locked" mounts, without clearly saying what this concept is, or
>>>> why it exists. Expand the discussion with an explanation of what
>>>> locked mounts are, why mounts are locked, and some examples of the
>>>> effect of locking.
>>>>
>>>> Thanks to Christian Brauner for a lot of help in understanding
>>>> these details.
>>>>
>>>> Reported-by: Christian Brauner <[email protected]>
>>>> Signed-off-by: Michael Kerrisk <[email protected]>
>>>> ---
>>>>
>>>> Hello Eric and others,
>>>>
>>>> After some quite helpful info from Chrstian Brauner, I've expanded
>>>> the discussion of locked mounts (a concept I didn't really have a
>>>> good grasp on) in the mount_namespaces(7) manual page. I would be
>>>> grateful to receive review comments, acks, etc., on the patch below.
>>>> Could you take a look please?
>>>>
>>>> Cheers,
>>>>
>>>> Michael
>>>>
>>>> man7/mount_namespaces.7 | 73 +++++++++++++++++++++++++++++++++++++++++
>>>> 1 file changed, 73 insertions(+)
>>>>
>>>> diff --git a/man7/mount_namespaces.7 b/man7/mount_namespaces.7
>>>> index e3468bdb7..97427c9ea 100644
>>>> --- a/man7/mount_namespaces.7
>>>> +++ b/man7/mount_namespaces.7
>>>> @@ -107,6 +107,62 @@ operation brings across all of the mounts from the original
>>>> mount namespace as a single unit,
>>>> and recursive mounts that propagate between
>>>> mount namespaces propagate as a single unit.)
>>>> +.IP
>>>> +In this context, "may not be separated" means that the mounts
>>>> +are locked so that they may not be individually unmounted.
>>>> +Consider the following example:
>>>> +.IP
>>>> +.RS
>>>> +.in +4n
>>>> +.EX
>>>> +$ \fBsudo mkdir /mnt/dir\fP
>>>> +$ \fBsudo sh \-c \(aqecho "aaaaaa" > /mnt/dir/a\(aq\fP
>>>> +$ \fBsudo mount \-\-bind -o ro /some/path /mnt/dir\fP
>>>> +$ \fBls /mnt/dir\fP # Former contents of directory are invisible
>>>
>>> Do we want a more motivating example such as a /proc/sys?
>>>
>>> It has been common to mount over /proc files and directories that can be
>>> written to by the global root so that users in a mount namespace may not
>>> touch them.
>>
>> Seems reasonable. But I want to check one thing. Can you please
>> define "global root". I'm pretty sure I know what you mean, but
>> I'd like to know your definition.
>
> I mean uid 0 in the initial user namespace.

(Good. That's what I thought you meant. So far, that term is not
described in the manual pages. I just now added a definition of the
term to user_namespaces(7).)

> This uid owns most of files in /proc.
>
> Container systems that don't want to use user namespaces frequently
> mount over files in proc to prevent using some of the root privileges
> that come simply by having uid 0.
>
> Another use is mounting over files on virtual filesystems like proc
> to reduce the attack surface.

Thanks for the background. I think for the moment I will go with
Christian's alternative suggestion (an example using /etc/shadow).

> For reducing what the root user in a container can do, I think using user
> namespaces and using a uid other than 0 in the initial user namespace.
>
>
>>>> +.EE
>>>> +.in
>>>> +.RE
>>>> +.IP
>>>> +The above steps, performed in a more privileged user namespace,
>>>> +have created a (read-only) bind mount that
>>>> +obscures the contents of the directory
>>>> +.IR /mnt/dir .
>>>> +For security reasons, it should not be possible to unmount
>>>> +that mount in a less privileged user namespace,
>>>> +since that would reveal the contents of the directory
>>>> +.IR /mnt/dir .
>>> > +.IP
>>>> +Suppose we now create a new mount namespace
>>>> +owned by a (new) subordinate user namespace.
>>>> +The new mount namespace will inherit copies of all of the mounts
>>>> +from the previous mount namespace.
>>>> +However, those mounts will be locked because the new mount namespace
>>>> +is owned by a less privileged user namespace.
>>>> +Consequently, an attempt to unmount the mount fails:
>>>> +.IP
>>>> +.RS
>>>> +.in +4n
>>>> +.EX
>>>> +$ \fBsudo unshare \-\-user \-\-map\-root\-user \-\-mount \e\fP
>>>> + \fBstrace \-o /tmp/log \e\fP
>>>> + \fBumount /mnt/dir\fP
>>>> +umount: /mnt/dir: not mounted.
>>>> +$ \fBgrep \(aq^umount\(aq /tmp/log\fP
>>>> +umount2("/mnt/dir", 0) = \-1 EINVAL (Invalid argument)
>>>> +.EE
>>>> +.in
>>>> +.RE
>>>> +.IP
>>>> +The error message from
>>>> +.BR mount (8)
>>>> +is a little confusing, but the
>>>> +.BR strace (1)
>>>> +output reveals that the underlying
>>>> +.BR umount2 (2)
>>>> +system call failed with the error
>>>> +.BR EINVAL ,
>>>> +which is the error that the kernel returns to indicate that
>>>> +the mount is locked.
>>>
>>> Do you want to mention that you can unmount the entire subtree? Either
>>> with pivot_root if it is locked to "/" or with
>>> "umount -l /path/to/propagated/directory".
>>
>> Yes, I wondered about that, but hadn't got round to devising
>> the scenario. How about this:
>>
>> [[
>> * Following on from the previous point, note that it is possible
>> to unmount an entire tree of mounts that propagated as a unit
> ^^^^^ subtree?

Yes, probably better, to prevent misunderstandings. Changed (and in a few
other places also).

>> into a mount namespace that is owned by a less privileged user
>> namespace, as illustrated in the following example.
>
>>
>> First, we create new user and mount namespaces using
>> unshare(1). In the new mount namespace, the propagation type
>> of all mounts is set to private. We then create a shared bind
>> mount at /mnt, and a small hierarchy of mount points underneath
>> that mount point.
>>
>> $ PS1='ns1# ' sudo unshare --user --map-root-user \
>> --mount --propagation private bash
>> ns1# echo $$ # We need the PID of this shell later
>> 778501
>> ns1# mount --make-shared --bind /mnt /mnt
>> ns1# mkdir /mnt/x
>> ns1# mount --make-private -t tmpfs none /mnt/x
>> ns1# mkdir /mnt/x/y
>> ns1# mount --make-private -t tmpfs none /mnt/x/y
>> ns1# grep /mnt /proc/self/mountinfo | sed 's/ - .*//'
>> 986 83 8:5 /mnt /mnt rw,relatime shared:344
>> 989 986 0:56 / /mnt/x rw,relatime
>> 990 989 0:57 / /mnt/x/y rw,relatime
>>
>> Continuing in the same shell session, we then create a second
>> shell in a new mount namespace and a new subordinate (and thus
>> less privileged) user namespace and check the state of the
>> propagated mount points rooted at /mnt.
>>
>> ns1# PS1='ns2# unshare --user --map-root-user \
>> --mount --propagation unchanged bash
>> ns2# grep /mnt /proc/self/mountinfo | sed 's/ - .*//'
>> 1239 1204 8:5 /mnt /mnt rw,relatime master:344
>> 1240 1239 0:56 / /mnt/x rw,relatime
>> 1241 1240 0:57 / /mnt/x/y rw,relatime
>>
>> Of note in the above output is that the propagation type of the
>> mount point /mnt has been reduced to slave, as explained near
>> the start of this subsection. This means that submount events
>> will propagate from the master /mnt in "ns1", but propagation
>> will not occur in the opposite direction.
>>
>> From a separate terminal window, we then use nsenter(1) to
>> enter the mount and user namespaces corresponding to "ns1". In
>> that terminal window, we then recursively bind mount /mnt/x at
>> the location /mnt/ppp.
>>
>> $ PS1='ns3# ' sudo nsenter -t 778501 --user --mount
>> ns3# mount --rbind --make-private /mnt/x /mnt/ppp
>> ns3# grep /mnt /proc/self/mountinfo | sed 's/ - .*//'
>> 986 83 8:5 /mnt /mnt rw,relatime shared:344
>> 989 986 0:56 / /mnt/x rw,relatime
>> 990 989 0:57 / /mnt/x/y rw,relatime
>> 1242 986 0:56 / /mnt/ppp rw,relatime
>> 1243 1242 0:57 / /mnt/ppp/y rw,relatime shared:518
>>
>> Because the propagation type of the parent mount, /mnt, was
>> shared, the recursive bind mount propagated a small tree of
>> mounts under the slave mount /mnt into "ns2", as can be
>> verified by executing the following command in that shell
>> session:
>>
>> ns2# grep /mnt /proc/self/mountinfo | sed 's/ - .*//'
>> 1239 1204 8:5 /mnt /mnt rw,relatime master:344
>> 1240 1239 0:56 / /mnt/x rw,relatime
>> 1241 1240 0:57 / /mnt/x/y rw,relatime
>> 1244 1239 0:56 / /mnt/ppp rw,relatime
>> 1245 1244 0:57 / /mnt/ppp/y rw,relatime master:518
>>
>> While it is not possible to unmount a part of that propagated
>> subtree (/mnt/ppp/y), it is possible to unmount the entire
>> tree, as shown by the following commands:
>>
>> ns2# umount /mnt/ppp/y
>> umount: /mnt/ppp/y: not mounted.
>> ns2# umount -l /mnt/ppp | sed 's/ - .*//' # Succeeds...
>> ns2# grep /mnt /proc/self/mountinfo
>> 1239 1204 8:5 /mnt /mnt rw,relatime master:344
>> 1240 1239 0:56 / /mnt/x rw,relatime
>> 1241 1240 0:57 / /mnt/x/y rw,relatime
>> ]]
>>
>> ?
>
> Yes.
>
> It is worth noting that in ns2 it is also possible to mount on top of
> /mnt/ppp/y and umount from /mnt/ppp/y.

Yes, good point. I've added some text, and an example for that case.

Cheers,

Michael

--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

Subject: Re: [PATCHi, man-pages] mount_namespaces.7: More clearly explain "locked mounts"

Hi Christian,

On 8/17/21 4:06 PM, Christian Brauner wrote:
> On Tue, Aug 17, 2021 at 05:12:20AM +0200, Michael Kerrisk (man-pages) wrote:
>> Hi Eric,
>>
>> Thanks for your feedback!
>>
>> On 8/16/21 6:03 PM, Eric W. Biederman wrote:
>>> Michael Kerrisk <[email protected]> writes:
>>>
>>>> For a long time, this manual page has had a brief discussion of
>>>> "locked" mounts, without clearly saying what this concept is, or
>>>> why it exists. Expand the discussion with an explanation of what
>>>> locked mounts are, why mounts are locked, and some examples of the
>>>> effect of locking.
>>>>
>>>> Thanks to Christian Brauner for a lot of help in understanding
>>>> these details.
>>>>
>>>> Reported-by: Christian Brauner <[email protected]>
>>>> Signed-off-by: Michael Kerrisk <[email protected]>
>>>> ---
>>>>
>>>> Hello Eric and others,
>>>>
>>>> After some quite helpful info from Chrstian Brauner, I've expanded
>>>> the discussion of locked mounts (a concept I didn't really have a
>>>> good grasp on) in the mount_namespaces(7) manual page. I would be
>>>> grateful to receive review comments, acks, etc., on the patch below.
>>>> Could you take a look please?
>>>>
>>>> Cheers,
>>>>
>>>> Michael
>>>>
>>>> man7/mount_namespaces.7 | 73 +++++++++++++++++++++++++++++++++++++++++
>>>> 1 file changed, 73 insertions(+)
>>>>
>>>> diff --git a/man7/mount_namespaces.7 b/man7/mount_namespaces.7
>>>> index e3468bdb7..97427c9ea 100644
>>>> --- a/man7/mount_namespaces.7
>>>> +++ b/man7/mount_namespaces.7
>>>> @@ -107,6 +107,62 @@ operation brings across all of the mounts from the original
>>>> mount namespace as a single unit,
>>>> and recursive mounts that propagate between
>>>> mount namespaces propagate as a single unit.)
>>>> +.IP
>>>> +In this context, "may not be separated" means that the mounts
>>>> +are locked so that they may not be individually unmounted.
>>>> +Consider the following example:
>>>> +.IP
>>>> +.RS
>>>> +.in +4n
>>>> +.EX
>>>> +$ \fBsudo mkdir /mnt/dir\fP
>>>> +$ \fBsudo sh \-c \(aqecho "aaaaaa" > /mnt/dir/a\(aq\fP
>>>> +$ \fBsudo mount \-\-bind -o ro /some/path /mnt/dir\fP
>>>> +$ \fBls /mnt/dir\fP # Former contents of directory are invisible
>>>
>>> Do we want a more motivating example such as a /proc/sys?
>
> Could be even be better to use an example involving /etc/shadow, e.g.:
>
> sudo mount --bind /etc /mnt
> sudo mount --bind /dev/null /mnt/shadow

Nice! I've rewritten the example to use /etc/shadow
instead of a bind-mounted directory at /mnt/dir.
Thanks!

> the procfs example might be a bit awkward (see below).

Okay.

>>> It has been common to mount over /proc files and directories that can be
>>> written to by the global root so that users in a mount namespace may not
>>> touch them.
>>
>> Seems reasonable. But I want to check one thing. Can you please
>> define "global root". I'm pretty sure I know what you mean, but
>> I'd like to know your definition.
>
> (global root == root in the initial user namespace.)

(As noted in the mail to Eric, I've added this definition to
user_namespaces(7).)

> Some application container runtimes have a concept of "masked paths"
> where they overmount certain directories they want to hide with an empty
> tmpfs and some files they want to hide with /dev/null (see [1]).
>
> But I don't think this is a great example because this overmounting is
> mostly needed and done when you're running privileged containers (see [2]).
>
> There's usually no point in overmounting parts of procfs that are
> writable by global root. If you're running in an unprivileged container
> userns root can't write to any of the files that only global root can.
> Otherwise this would be a rather severe security issue.
>
> There might be a use-case for overmounting files that contain global
> information that are readable inside user namespaces but then one either
> has to question why they are readable in the first place or why this
> information needs to be hidden. Examples include /proc/kallsyms and
> /proc/keys.
>
> But overall the overmounting of procfs is most sensible when running
> privileged containers or when sharing pid namespaces and procfs is
> somehow bind-mounted from somewhere. But that means there's no user
> namespace in play which means that the mounts aren't locked.
>
> So if the container runtime has e.g. overmounted /proc/kcore with
> /dev/null then the privileged container can unmount it. To protect
> against this such privileged containers usually drop CAP_SYS_ADMIN.
> So the protection here comes from dropping capabilities not from locking
> mounts together. All of this makes this a bit of a confusing example.
>
> An example where locked mount protection is relied on heavily which I'm
> involved in is systemd(-nspawn). All custom mounts a container gets such
> as data shared from the host with the container are mounted in a separate
> (privileged) mount namespace before the container workload is cloned.
> The cloned container then gets a new mount + userns pair and hence, all
> the mounts it inherited are now locked.
>
> This way, you can e.g. share /etc with your container and just overmount
> /etc/shadow with /dev/null or a custom /etc/shadow (Reason for my
> example above.) without dropping capabilities that would prevent the
> container from mounting.
>
> So I'd suggest using a simple example. This is not about illustrating
> what container runtimes do but what the behavior of a mount namespace
> is. There's really no need to overcomplicate this.

Thanks for the detailed explanation. As noted above, I've rewritten
the example to use /etc/shadow.

> [1]: https://github.com/moby/moby/blob/51b06c6795160d8a1ba05d05d6491df7588b2957/oci/defaults.go#L90
> [2]: https://github.com/moby/moby/blob/51b06c6795160d8a1ba05d05d6491df7588b2957/oci/defaults.go#L110
>
>>
>>>> +.EE
>>>> +.in
>>>> +.RE
>>>> +.IP
>>>> +The above steps, performed in a more privileged user namespace,
>>>> +have created a (read-only) bind mount that
>>>> +obscures the contents of the directory
>>>> +.IR /mnt/dir .
>>>> +For security reasons, it should not be possible to unmount
>>>> +that mount in a less privileged user namespace,
>>>> +since that would reveal the contents of the directory
>>>> +.IR /mnt/dir .
>>> > +.IP
>>>> +Suppose we now create a new mount namespace
>>>> +owned by a (new) subordinate user namespace.
>>>> +The new mount namespace will inherit copies of all of the mounts
>>>> +from the previous mount namespace.
>>>> +However, those mounts will be locked because the new mount namespace
>>>> +is owned by a less privileged user namespace.
>>>> +Consequently, an attempt to unmount the mount fails:
>>>> +.IP
>>>> +.RS
>>>> +.in +4n
>>>> +.EX
>>>> +$ \fBsudo unshare \-\-user \-\-map\-root\-user \-\-mount \e\fP
>>>> + \fBstrace \-o /tmp/log \e\fP
>>>> + \fBumount /mnt/dir\fP
>>>> +umount: /mnt/dir: not mounted.
>>>> +$ \fBgrep \(aq^umount\(aq /tmp/log\fP
>>>> +umount2("/mnt/dir", 0) = \-1 EINVAL (Invalid argument)
>>>> +.EE
>>>> +.in
>>>> +.RE
>>>> +.IP
>>>> +The error message from
>>>> +.BR mount (8)
>>>> +is a little confusing, but the
>>>> +.BR strace (1)
>>>> +output reveals that the underlying
>>>> +.BR umount2 (2)
>>>> +system call failed with the error
>>>> +.BR EINVAL ,
>>>> +which is the error that the kernel returns to indicate that
>>>> +the mount is locked.
>>>
>>> Do you want to mention that you can unmount the entire subtree? Either
>>> with pivot_root if it is locked to "/" or with
>>> "umount -l /path/to/propagated/directory".
>>
>> Yes, I wondered about that, but hadn't got round to devising
>> the scenario. How about this:
>>
>> [[
>> * Following on from the previous point, note that it is possible
>> to unmount an entire tree of mounts that propagated as a unit
>> into a mount namespace that is owned by a less privileged user
>> namespace, as illustrated in the following example.
>>
>> First, we create new user and mount namespaces using
>> unshare(1). In the new mount namespace, the propagation type
>> of all mounts is set to private. We then create a shared bind
>> mount at /mnt, and a small hierarchy of mount points underneath
>> that mount point.
>>
>> $ PS1='ns1# ' sudo unshare --user --map-root-user \
>> --mount --propagation private bash
>> ns1# echo $$ # We need the PID of this shell later
>> 778501
>> ns1# mount --make-shared --bind /mnt /mnt
>> ns1# mkdir /mnt/x
>> ns1# mount --make-private -t tmpfs none /mnt/x
>> ns1# mkdir /mnt/x/y
>> ns1# mount --make-private -t tmpfs none /mnt/x/y
>> ns1# grep /mnt /proc/self/mountinfo | sed 's/ - .*//'
>> 986 83 8:5 /mnt /mnt rw,relatime shared:344
>> 989 986 0:56 / /mnt/x rw,relatime
>> 990 989 0:57 / /mnt/x/y rw,relatime
>>
>> Continuing in the same shell session, we then create a second
>> shell in a new mount namespace and a new subordinate (and thus
>> less privileged) user namespace and check the state of the
>> propagated mount points rooted at /mnt.
>>
>> ns1# PS1='ns2# unshare --user --map-root-user \
>> --mount --propagation unchanged bash
>> ns2# grep /mnt /proc/self/mountinfo | sed 's/ - .*//'
>> 1239 1204 8:5 /mnt /mnt rw,relatime master:344
>> 1240 1239 0:56 / /mnt/x rw,relatime
>> 1241 1240 0:57 / /mnt/x/y rw,relatime
>>
>> Of note in the above output is that the propagation type of the
>> mount point /mnt has been reduced to slave, as explained near
>> the start of this subsection. This means that submount events
>> will propagate from the master /mnt in "ns1", but propagation
>> will not occur in the opposite direction.
>>
>> From a separate terminal window, we then use nsenter(1) to
>> enter the mount and user namespaces corresponding to "ns1". In
>> that terminal window, we then recursively bind mount /mnt/x at
>> the location /mnt/ppp.
>>
>> $ PS1='ns3# ' sudo nsenter -t 778501 --user --mount
>> ns3# mount --rbind --make-private /mnt/x /mnt/ppp
>> ns3# grep /mnt /proc/self/mountinfo | sed 's/ - .*//'
>> 986 83 8:5 /mnt /mnt rw,relatime shared:344
>> 989 986 0:56 / /mnt/x rw,relatime
>> 990 989 0:57 / /mnt/x/y rw,relatime
>> 1242 986 0:56 / /mnt/ppp rw,relatime
>> 1243 1242 0:57 / /mnt/ppp/y rw,relatime shared:518
>>
>> Because the propagation type of the parent mount, /mnt, was
>> shared, the recursive bind mount propagated a small tree of
>> mounts under the slave mount /mnt into "ns2", as can be
>> verified by executing the following command in that shell
>> session:
>>
>> ns2# grep /mnt /proc/self/mountinfo | sed 's/ - .*//'
>> 1239 1204 8:5 /mnt /mnt rw,relatime master:344
>> 1240 1239 0:56 / /mnt/x rw,relatime
>> 1241 1240 0:57 / /mnt/x/y rw,relatime
>> 1244 1239 0:56 / /mnt/ppp rw,relatime
>> 1245 1244 0:57 / /mnt/ppp/y rw,relatime master:518
>>
>> While it is not possible to unmount a part of that propagated
>> subtree (/mnt/ppp/y), it is possible to unmount the entire
>> tree, as shown by the following commands:
>>
>> ns2# umount /mnt/ppp/y
>> umount: /mnt/ppp/y: not mounted.
>> ns2# umount -l /mnt/ppp | sed 's/ - .*//' # Succeeds...
>> ns2# grep /mnt /proc/self/mountinfo
>> 1239 1204 8:5 /mnt /mnt rw,relatime master:344
>> 1240 1239 0:56 / /mnt/x rw,relatime
>> 1241 1240 0:57 / /mnt/x/y rw,relatime
>> ]]
>>
>> ?
>
> I'd just add a note about mounts that propagated locked together as unit
> as being unmountable as a unit (which is intuitive but may need to be
> spelled out). But I'd leave this lenghty example as it makes the
> manpage pretty convoluted.

Christian, I do sympathize with this point of view, and I hesitated
about adding this much text to the page. But, on the other hand:

* Many of the pages in section 7 are intended to provide "the big
picture" of how things work.
* Mount namespaces are complex and (I think) generally poorly
understood. So let's help people as much as we can.
* I had already relocated this whole subsection to the end of
the page, so it is less obtrusive.

In summary, I'm inclined to keep the text, but thank you for
voicing your (mild) objection.

Thanks,

Michael

--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/