2014-10-07 12:13:54

by Andrei Vagin

[permalink] [raw]
Subject: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root

From: Andrey Vagin <[email protected]>

Currently when we create a new container with a separate root,
we need to clone the current mount namespace with all mounts and then
clean up it by using pivot_root(). A big part of mountpoints are cloned
only to be umounted.

Another problem is that rootfs can't be hidden from a container, because
rootfs can't be moved or umounted.

Here is an example how to get access to rootfs:
fd = open("/proc/self/ns/mnt", O_RDONLY)
umount2("/", MNT_DETACH);
setns(fd, CLONE_NEWNS)

rootfs may contain data, which should not be avaliable in CT-s.

I suggest to add ability to create a mount namespace with specified
mount points. A current task root can be used as a root for the new
mount namespace.

With this patch you can call chroot(ct->rootfs) and
unshare(UNSHARE_NEWNS2) to get a clean mount namespace.

UNSHARE_NEWNS2 can be used only with the unshare() syscall. The clone()
syscall doesn't have unused flags.

Here is an example how it looks like:
$ cat ../../unshare.c

int main(int argc, char **argv)
{
if (unshare(UNSHARE_NEWNS2))
return 1;

execl("/bin/bash", "/bin/bash", NULL);
return 1;
}
$ mount --bind test/ubuntu/ test/ubuntu/
$ cd test/ubuntu/
$ chroot .
$ ./unshare2
$ mount -t proc proc proc
$ cat /proc/self/mountinfo
55 55 252:1 /home/avagin/test/ubuntu / rw,relatime - ext4 /dev/disk/by-uuid/d672b85f-533c-4868-9609-ca80be52d3c6 rw,errors=remount-ro,data=ordered
56 55 0:3 / /proc rw,relatime - proc proc rw

Cc: Alexander Viro <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: "Eric W. Biederman" <[email protected]>
Cc: Cyrill Gorcunov <[email protected]>
Cc: Pavel Emelyanov <[email protected]>
Cc: Serge Hallyn <[email protected]>
Cc: Rob Landley <[email protected]>
Signed-off-by: Andrey Vagin <[email protected]>
---
fs/namespace.c | 16 ++++++++++++++--
include/uapi/linux/sched.h | 8 ++++++++
kernel/fork.c | 11 ++++++++---
kernel/nsproxy.c | 2 +-
4 files changed, 31 insertions(+), 6 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index 730c50e..f50a848 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -2569,12 +2569,24 @@ struct mnt_namespace *copy_mnt_ns(unsigned long flags, struct mnt_namespace *ns,

BUG_ON(!ns);

- if (likely(!(flags & CLONE_NEWNS))) {
+ if (likely(!(flags & (CLONE_NEWNS | UNSHARE_NEWNS2)))) {
get_mnt_ns(ns);
return ns;
}

- old = ns->root;
+ if (flags & CLONE_NEWNS)
+ old = ns->root;
+ else { /* UNSHARE_NEWNS2 */
+ struct path root;
+
+ get_fs_root(current->fs, &root);
+ if (root.mnt->mnt_root != root.dentry) {
+ path_put(&root);
+ return ERR_PTR(-EINVAL); /* not a mountpoint */
+ }
+ old = real_mount(root.mnt);
+ path_put(&root);
+ }

new_ns = alloc_mnt_ns(user_ns);
if (IS_ERR(new_ns))
diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index 34f9d73..8092e50 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -31,6 +31,14 @@
#define CLONE_IO 0x80000000 /* Clone io context */

/*
+ * Following flags can be used only with unshare(), because
+ * they are intersected with CSIGNAL
+ */
+#define UNSHARE_NEWNS2 0x00000001 /* Clone mnt namespace starting with the current task root. */
+
+#define UNSHARE_FLAGS (UNSHARE_NEWNS2)
+
+/*
* Scheduling policies
*/
#define SCHED_NORMAL 0
diff --git a/kernel/fork.c b/kernel/fork.c
index 0cf9cdb..52f1fc0 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1381,7 +1381,12 @@ static struct task_struct *copy_process(unsigned long clone_flags,
retval = copy_mm(clone_flags, p);
if (retval)
goto bad_fork_cleanup_signal;
- retval = copy_namespaces(clone_flags, p);
+
+ /*
+ * CSIGNAL and UNSHARE_FLAGS are intersected, but
+ * UNSHARE_FLAGS can't be used with clone().
+ */
+ retval = copy_namespaces(clone_flags & ~UNSHARE_FLAGS, p);
if (retval)
goto bad_fork_cleanup_mm;
retval = copy_io(clone_flags, p);
@@ -1790,7 +1795,7 @@ static int check_unshare_flags(unsigned long unshare_flags)
if (unshare_flags & ~(CLONE_THREAD|CLONE_FS|CLONE_NEWNS|CLONE_SIGHAND|
CLONE_VM|CLONE_FILES|CLONE_SYSVSEM|
CLONE_NEWUTS|CLONE_NEWIPC|CLONE_NEWNET|
- CLONE_NEWUSER|CLONE_NEWPID))
+ CLONE_NEWUSER|CLONE_NEWPID|UNSHARE_FLAGS))
return -EINVAL;
/*
* Not implemented, but pretend it works if there is nothing to
@@ -1880,7 +1885,7 @@ SYSCALL_DEFINE1(unshare, unsigned long, unshare_flags)
/*
* If unsharing namespace, must also unshare filesystem information.
*/
- if (unshare_flags & CLONE_NEWNS)
+ if (unshare_flags & (CLONE_NEWNS | UNSHARE_NEWNS2))
unshare_flags |= CLONE_FS;

err = check_unshare_flags(unshare_flags);
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index ef42d0a..a29e836 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -180,7 +180,7 @@ int unshare_nsproxy_namespaces(unsigned long unshare_flags,
int err = 0;

if (!(unshare_flags & (CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC |
- CLONE_NEWNET | CLONE_NEWPID)))
+ CLONE_NEWNET | CLONE_NEWPID | UNSHARE_FLAGS)))
return 0;

user_ns = new_cred ? new_cred->user_ns : current_user_ns();
--
1.9.3


2014-10-07 13:30:44

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root

On Tue, Oct 07, 2014 at 04:12:57PM +0400, Andrey Vagin wrote:
> Another problem is that rootfs can't be hidden from a container, because
> rootfs can't be moved or umounted.

... which is a bug in mntns_install(), AFAICS.

> Here is an example how to get access to rootfs:
> fd = open("/proc/self/ns/mnt", O_RDONLY)
> umount2("/", MNT_DETACH);
> setns(fd, CLONE_NEWNS)
>
> rootfs may contain data, which should not be avaliable in CT-s.

Indeed.

> I suggest to add ability to create a mount namespace with specified
> mount points. A current task root can be used as a root for the new
> mount namespace.

Yecchh... Frankly, you are opening a big can of worms - having rootfs
as absolute root of all namespaces simplifies a lot of things in there.

2014-10-07 13:33:43

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root

On Tue, Oct 07, 2014 at 02:30:40PM +0100, Al Viro wrote:
> On Tue, Oct 07, 2014 at 04:12:57PM +0400, Andrey Vagin wrote:
> > Another problem is that rootfs can't be hidden from a container, because
> > rootfs can't be moved or umounted.
>
> ... which is a bug in mntns_install(), AFAICS.

Ability to get to exposed rootfs, that is.

> > Here is an example how to get access to rootfs:
> > fd = open("/proc/self/ns/mnt", O_RDONLY)
> > umount2("/", MNT_DETACH);
> > setns(fd, CLONE_NEWNS)
> >
> > rootfs may contain data, which should not be avaliable in CT-s.
>
> Indeed.

... and it looks like the above is what your mangled reproducer in previous
patch had been made of -
fd = open("/proc/self/ns/mnt", O_RDONLY)
umount2("/", MNT_DETACH);
setns(fd, CLONE_NEWNS)
umount2("/", MNT_DETACH);

IMO what it shows is setns() bug. This "switch root/cwd, no matter what"
is wrong.

2014-10-07 19:44:50

by Andrew Vagin

[permalink] [raw]
Subject: Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root

On Tue, Oct 07, 2014 at 02:33:39PM +0100, Al Viro wrote:
> On Tue, Oct 07, 2014 at 02:30:40PM +0100, Al Viro wrote:
> > On Tue, Oct 07, 2014 at 04:12:57PM +0400, Andrey Vagin wrote:
> > > Another problem is that rootfs can't be hidden from a container, because
> > > rootfs can't be moved or umounted.
> >
> > ... which is a bug in mntns_install(), AFAICS.
>
> Ability to get to exposed rootfs, that is.
>
> > > Here is an example how to get access to rootfs:
> > > fd = open("/proc/self/ns/mnt", O_RDONLY)
> > > umount2("/", MNT_DETACH);
> > > setns(fd, CLONE_NEWNS)
> > >
> > > rootfs may contain data, which should not be avaliable in CT-s.
> >
> > Indeed.
>
> ... and it looks like the above is what your mangled reproducer in previous
> patch had been made of -
> fd = open("/proc/self/ns/mnt", O_RDONLY)
> umount2("/", MNT_DETACH);
> setns(fd, CLONE_NEWNS)
> umount2("/", MNT_DETACH);
>
> IMO what it shows is setns() bug. This "switch root/cwd, no matter what"
> is wrong.

How can we move in another namespace without this "switch"? If the task
root and cwd isn't switched, they points to the old tree. In this case
how can we open something from the new tree?

Do you mean that root should not be changed if it points on a detached
mount or on a mount from another mount namespace? This looks reasonable.

2014-10-07 20:31:38

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root

Al Viro <[email protected]> writes:

2> On Tue, Oct 07, 2014 at 02:30:40PM +0100, Al Viro wrote:
>> On Tue, Oct 07, 2014 at 04:12:57PM +0400, Andrey Vagin wrote:
>> > Another problem is that rootfs can't be hidden from a container, because
>> > rootfs can't be moved or umounted.
>>
>> ... which is a bug in mntns_install(), AFAICS.
>
> Ability to get to exposed rootfs, that is.

The container side of this argument is pretty bogus. It only applies
if user namespaces are not used for the container.

So it is only root (and not root in a container) who can get to the
exposed rootfs.

I have a vague memory someone actually had a real use in miminal systems
for being able to get back to the rootfs and being able to use rootfs as
the rootfs. There was even a patch at that time that Andrew Morton was
carrying for a time to allow unmounting root and get at rootfs, and to
prevent the oops on rootfs unmount in some way.

So not only do I not think it is a bug to get back too rootfs, I think
it is a feature that some people have expressed at least half-way sane
uses for.

>> > Here is an example how to get access to rootfs:
>> > fd = open("/proc/self/ns/mnt", O_RDONLY)
>> > umount2("/", MNT_DETACH);
>> > setns(fd, CLONE_NEWNS)
>> >
>> > rootfs may contain data, which should not be avaliable in CT-s.
>>
>> Indeed.
>
> ... and it looks like the above is what your mangled reproducer in previous
> patch had been made of -
> fd = open("/proc/self/ns/mnt", O_RDONLY)
> umount2("/", MNT_DETACH);
> setns(fd, CLONE_NEWNS)
> umount2("/", MNT_DETACH);
>
> IMO what it shows is setns() bug. This "switch root/cwd, no matter what"
> is wrong.

IMO the bug is allowing us to unmount things that should never be unmounted.

In a mount namespace created with just user namespace permissions we
can't get at rootfs because MNT_LOCKED is set on the root directory
and thus it can not be mounted.

Further if anyone has permission to call chroot and chdir on any mount
in a mount namespace (that isn't currently covered) they can get at all
of them that are not currently covered. A mount namespace where no one
can get at any uncovered filesystem seems to be the definition of
useless and ridiculous.


Now there is a bug in that MNT_DETACH today does not currently enforce
MNT_LOCKED on submounts of the mount point that is detached. I am
currently looking at how to construct the appropriate permission check
to prevent that. Unfortunately I can not disallow MNT_DETACH with
submounts all together as that breaks too many legitimate uses.

That failure to enforce MNT_LOCKED is my mistake. I had a naive notion
that submounts would remain mounted after a mount detach and I misread
the code when I did the original work. My mistake.

Eric

2014-10-07 20:45:57

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root

Andrey Vagin <[email protected]> writes:

> From: Andrey Vagin <[email protected]>
>
> Currently when we create a new container with a separate root,
> we need to clone the current mount namespace with all mounts and then
> clean up it by using pivot_root(). A big part of mountpoints are cloned
> only to be umounted.

Is the motivation performance? Because if that is the motivation we
need numbers.

> Another problem is that rootfs can't be hidden from a container, because
> rootfs can't be moved or umounted.
>
> Here is an example how to get access to rootfs:
> fd = open("/proc/self/ns/mnt", O_RDONLY)
> umount2("/", MNT_DETACH);
> setns(fd, CLONE_NEWNS)
>
> rootfs may contain data, which should not be avaliable in CT-s.

Well don't give those containers CAP_SYS_ADMIN. If you aren't using
user namespaces there is no expectation of safety from those kinds of
problems. Getting at rootfs is perfectly valid for root.

> I suggest to add ability to create a mount namespace with specified
> mount points. A current task root can be used as a root for the new
> mount namespace.

I really don't think you are going to like the result because you will
loose access to /proc and /sys.

> With this patch you can call chroot(ct->rootfs) and
> unshare(UNSHARE_NEWNS2) to get a clean mount namespace.

That is a little bit of an ugly way to smuggle a parameter into the
creation of a mount namespace. Further I am pretty certain this patch
totally breaks the setting of new_fs->root and new_fs->pwd.


In net my opinion is that the code doesn't work and does not provide
sufficient justification for a new system call.

> UNSHARE_NEWNS2 can be used only with the unshare() syscall. The clone()
> syscall doesn't have unused flags.
>
> Here is an example how it looks like:
> $ cat ../../unshare.c
>
> int main(int argc, char **argv)
> {
/* You left out
* mount --bind /some/root/path /some/root/path
* chroot /some/root/path
*/

> if (unshare(UNSHARE_NEWNS2))
> return 1;
>
> execl("/bin/bash", "/bin/bash", NULL);
> return 1;
> }
> $ mount --bind test/ubuntu/ test/ubuntu/
> $ cd test/ubuntu/
> $ chroot .
> $ ./unshare2
> $ mount -t proc proc proc
> $ cat /proc/self/mountinfo
> 55 55 252:1 /home/avagin/test/ubuntu / rw,relatime - ext4 /dev/disk/by-uuid/d672b85f-533c-4868-9609-ca80be52d3c6 rw,errors=remount-ro,data=ordered
> 56 55 0:3 / /proc rw,relatime - proc proc rw
>
> Cc: Alexander Viro <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: "Eric W. Biederman" <[email protected]>
> Cc: Cyrill Gorcunov <[email protected]>
> Cc: Pavel Emelyanov <[email protected]>
> Cc: Serge Hallyn <[email protected]>
> Cc: Rob Landley <[email protected]>
> Signed-off-by: Andrey Vagin <[email protected]>
> ---
> fs/namespace.c | 16 ++++++++++++++--
> include/uapi/linux/sched.h | 8 ++++++++
> kernel/fork.c | 11 ++++++++---
> kernel/nsproxy.c | 2 +-
> 4 files changed, 31 insertions(+), 6 deletions(-)
>
> diff --git a/fs/namespace.c b/fs/namespace.c
> index 730c50e..f50a848 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -2569,12 +2569,24 @@ struct mnt_namespace *copy_mnt_ns(unsigned long flags, struct mnt_namespace *ns,
>
> BUG_ON(!ns);
>
> - if (likely(!(flags & CLONE_NEWNS))) {
> + if (likely(!(flags & (CLONE_NEWNS | UNSHARE_NEWNS2)))) {
> get_mnt_ns(ns);
> return ns;
> }
>
> - old = ns->root;
> + if (flags & CLONE_NEWNS)
> + old = ns->root;
> + else { /* UNSHARE_NEWNS2 */
> + struct path root;
> +
> + get_fs_root(current->fs, &root);
> + if (root.mnt->mnt_root != root.dentry) {
> + path_put(&root);
> + return ERR_PTR(-EINVAL); /* not a mountpoint */
> + }
> + old = real_mount(root.mnt);
> + path_put(&root);
> + }
>
> new_ns = alloc_mnt_ns(user_ns);
> if (IS_ERR(new_ns))
> diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
> index 34f9d73..8092e50 100644
> --- a/include/uapi/linux/sched.h
> +++ b/include/uapi/linux/sched.h
> @@ -31,6 +31,14 @@
> #define CLONE_IO 0x80000000 /* Clone io context */
>
> /*
> + * Following flags can be used only with unshare(), because
> + * they are intersected with CSIGNAL
> + */
> +#define UNSHARE_NEWNS2 0x00000001 /* Clone mnt namespace starting with the current task root. */
> +
> +#define UNSHARE_FLAGS (UNSHARE_NEWNS2)
> +
> +/*
> * Scheduling policies
> */
> #define SCHED_NORMAL 0
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 0cf9cdb..52f1fc0 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1381,7 +1381,12 @@ static struct task_struct *copy_process(unsigned long clone_flags,
> retval = copy_mm(clone_flags, p);
> if (retval)
> goto bad_fork_cleanup_signal;
> - retval = copy_namespaces(clone_flags, p);
> +
> + /*
> + * CSIGNAL and UNSHARE_FLAGS are intersected, but
> + * UNSHARE_FLAGS can't be used with clone().
> + */
> + retval = copy_namespaces(clone_flags & ~UNSHARE_FLAGS, p);
> if (retval)
> goto bad_fork_cleanup_mm;
> retval = copy_io(clone_flags, p);
> @@ -1790,7 +1795,7 @@ static int check_unshare_flags(unsigned long unshare_flags)
> if (unshare_flags & ~(CLONE_THREAD|CLONE_FS|CLONE_NEWNS|CLONE_SIGHAND|
> CLONE_VM|CLONE_FILES|CLONE_SYSVSEM|
> CLONE_NEWUTS|CLONE_NEWIPC|CLONE_NEWNET|
> - CLONE_NEWUSER|CLONE_NEWPID))
> + CLONE_NEWUSER|CLONE_NEWPID|UNSHARE_FLAGS))
It seems confusing to use UNSHARE_FLAGS here.
> return -EINVAL;
> /*
> * Not implemented, but pretend it works if there is nothing to
> @@ -1880,7 +1885,7 @@ SYSCALL_DEFINE1(unshare, unsigned long, unshare_flags)
> /*
> * If unsharing namespace, must also unshare filesystem information.
> */
> - if (unshare_flags & CLONE_NEWNS)
> + if (unshare_flags & (CLONE_NEWNS | UNSHARE_NEWNS2))
> unshare_flags |= CLONE_FS;
>
> err = check_unshare_flags(unshare_flags);
> diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
> index ef42d0a..a29e836 100644
> --- a/kernel/nsproxy.c
> +++ b/kernel/nsproxy.c
> @@ -180,7 +180,7 @@ int unshare_nsproxy_namespaces(unsigned long unshare_flags,
> int err = 0;
>
> if (!(unshare_flags & (CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC |
> - CLONE_NEWNET | CLONE_NEWPID)))
> + CLONE_NEWNET | CLONE_NEWPID | UNSHARE_FLAGS)))

It is inappropriate to assume that all unshare flags will be namespaces.

> return 0;
>
> user_ns = new_cred ? new_cred->user_ns : current_user_ns();

2014-10-07 20:46:58

by Serge Hallyn

[permalink] [raw]
Subject: Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root

Quoting Eric W. Biederman ([email protected]):
> Al Viro <[email protected]> writes:
>
> 2> On Tue, Oct 07, 2014 at 02:30:40PM +0100, Al Viro wrote:
> >> On Tue, Oct 07, 2014 at 04:12:57PM +0400, Andrey Vagin wrote:
> >> > Another problem is that rootfs can't be hidden from a container, because
> >> > rootfs can't be moved or umounted.
> >>
> >> ... which is a bug in mntns_install(), AFAICS.
> >
> > Ability to get to exposed rootfs, that is.
>
> The container side of this argument is pretty bogus. It only applies
> if user namespaces are not used for the container.

User namespaces are still far too restricted for many container use
cases. We can't say "we have user namespaces so now privileged
containers can be ignored". Yes you never should have handed the
keys to a privileged container to an untrusted person, but we do
still try to protect the host from accidental damage due to a
privileged container.

> So it is only root (and not root in a container) who can get to the
> exposed rootfs.
>
> I have a vague memory someone actually had a real use in miminal systems
> for being able to get back to the rootfs and being able to use rootfs as
> the rootfs. There was even a patch at that time that Andrew Morton was
> carrying for a time to allow unmounting root and get at rootfs, and to
> prevent the oops on rootfs unmount in some way.
>
> So not only do I not think it is a bug to get back too rootfs, I think
> it is a feature that some people have expressed at least half-way sane
> uses for.

They can still do that if they want, using chroot :)

> >> > Here is an example how to get access to rootfs:
> >> > fd = open("/proc/self/ns/mnt", O_RDONLY)
> >> > umount2("/", MNT_DETACH);
> >> > setns(fd, CLONE_NEWNS)
> >> >
> >> > rootfs may contain data, which should not be avaliable in CT-s.
> >>
> >> Indeed.
> >
> > ... and it looks like the above is what your mangled reproducer in previous
> > patch had been made of -
> > fd = open("/proc/self/ns/mnt", O_RDONLY)
> > umount2("/", MNT_DETACH);
> > setns(fd, CLONE_NEWNS)
> > umount2("/", MNT_DETACH);
> >
> > IMO what it shows is setns() bug. This "switch root/cwd, no matter what"
> > is wrong.
>
> IMO the bug is allowing us to unmount things that should never be unmounted.
>
> In a mount namespace created with just user namespace permissions we
> can't get at rootfs because MNT_LOCKED is set on the root directory
> and thus it can not be mounted.
>
> Further if anyone has permission to call chroot and chdir on any mount
> in a mount namespace (that isn't currently covered) they can get at all
> of them that are not currently covered. A mount namespace where no one
> can get at any uncovered filesystem seems to be the definition of
> useless and ridiculous.
>
>
> Now there is a bug in that MNT_DETACH today does not currently enforce
> MNT_LOCKED on submounts of the mount point that is detached. I am
> currently looking at how to construct the appropriate permission check
> to prevent that. Unfortunately I can not disallow MNT_DETACH with
> submounts all together as that breaks too many legitimate uses.
>
> That failure to enforce MNT_LOCKED is my mistake. I had a naive notion
> that submounts would remain mounted after a mount detach and I misread
> the code when I did the original work. My mistake.

2014-10-07 20:53:10

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root

Serge Hallyn <[email protected]> writes:

> Quoting Eric W. Biederman ([email protected]):
>> Al Viro <[email protected]> writes:
>>
>> 2> On Tue, Oct 07, 2014 at 02:30:40PM +0100, Al Viro wrote:
>> >> On Tue, Oct 07, 2014 at 04:12:57PM +0400, Andrey Vagin wrote:
>> >> > Another problem is that rootfs can't be hidden from a container, because
>> >> > rootfs can't be moved or umounted.
>> >>
>> >> ... which is a bug in mntns_install(), AFAICS.
>> >
>> > Ability to get to exposed rootfs, that is.
>>
>> The container side of this argument is pretty bogus. It only applies
>> if user namespaces are not used for the container.
>
> User namespaces are still far too restricted for many container use
> cases. We can't say "we have user namespaces so now privileged
> containers can be ignored". Yes you never should have handed the
> keys to a privileged container to an untrusted person, but we do
> still try to protect the host from accidental damage due to a
> privileged container.

What I meant is that it isn't about containers. It is about something
root can do. So this is not a "container" problem.

>> So it is only root (and not root in a container) who can get to the
>> exposed rootfs.
>>
>> I have a vague memory someone actually had a real use in miminal systems
>> for being able to get back to the rootfs and being able to use rootfs as
>> the rootfs. There was even a patch at that time that Andrew Morton was
>> carrying for a time to allow unmounting root and get at rootfs, and to
>> prevent the oops on rootfs unmount in some way.
>>
>> So not only do I not think it is a bug to get back too rootfs, I think
>> it is a feature that some people have expressed at least half-way sane
>> uses for.
>
> They can still do that if they want, using chroot :)

It would take fchdir or fchroot and a directory file descriptor open on
rootfs. Frequently there is no appropriate directory file descriptor.

Eric

2014-10-07 21:03:17

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root

On Tue, Oct 7, 2014 at 1:30 PM, Eric W. Biederman <[email protected]> wrote:
> Al Viro <[email protected]> writes:
>
> 2> On Tue, Oct 07, 2014 at 02:30:40PM +0100, Al Viro wrote:
>>> On Tue, Oct 07, 2014 at 04:12:57PM +0400, Andrey Vagin wrote:
>>> > Another problem is that rootfs can't be hidden from a container, because
>>> > rootfs can't be moved or umounted.
>>>
>>> ... which is a bug in mntns_install(), AFAICS.
>>
>> Ability to get to exposed rootfs, that is.
>
> The container side of this argument is pretty bogus. It only applies
> if user namespaces are not used for the container.
>
> So it is only root (and not root in a container) who can get to the
> exposed rootfs.
>
> I have a vague memory someone actually had a real use in miminal systems
> for being able to get back to the rootfs and being able to use rootfs as
> the rootfs. There was even a patch at that time that Andrew Morton was
> carrying for a time to allow unmounting root and get at rootfs, and to
> prevent the oops on rootfs unmount in some way.
>
> So not only do I not think it is a bug to get back too rootfs, I think
> it is a feature that some people have expressed at least half-way sane
> uses for.
>
>>> > Here is an example how to get access to rootfs:
>>> > fd = open("/proc/self/ns/mnt", O_RDONLY)
>>> > umount2("/", MNT_DETACH);
>>> > setns(fd, CLONE_NEWNS)
>>> >
>>> > rootfs may contain data, which should not be avaliable in CT-s.
>>>
>>> Indeed.
>>
>> ... and it looks like the above is what your mangled reproducer in previous
>> patch had been made of -
>> fd = open("/proc/self/ns/mnt", O_RDONLY)
>> umount2("/", MNT_DETACH);
>> setns(fd, CLONE_NEWNS)
>> umount2("/", MNT_DETACH);
>>
>> IMO what it shows is setns() bug. This "switch root/cwd, no matter what"
>> is wrong.
>
> IMO the bug is allowing us to unmount things that should never be unmounted.
>
> In a mount namespace created with just user namespace permissions we
> can't get at rootfs because MNT_LOCKED is set on the root directory
> and thus it can not be mounted.
>
> Further if anyone has permission to call chroot and chdir on any mount
> in a mount namespace (that isn't currently covered) they can get at all
> of them that are not currently covered. A mount namespace where no one
> can get at any uncovered filesystem seems to be the definition of
> useless and ridiculous.
>
>
> Now there is a bug in that MNT_DETACH today does not currently enforce
> MNT_LOCKED on submounts of the mount point that is detached. I am
> currently looking at how to construct the appropriate permission check
> to prevent that. Unfortunately I can not disallow MNT_DETACH with
> submounts all together as that breaks too many legitimate uses.

Why should MNT_LOCKED on submounts be enforced?

Is it because, if you retain a reference to the detached tree, then
you can see under the submounts? If so, let's fix *that*. Because
otherwise the whole model of pivot_root + detach will break.

Also, damn it, we need change_the_ns_root instead of pivot_root. I
doubt that any container programs actually want to keep the old root
attached after pivot_root.

--Andy

2014-10-07 21:26:35

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root

Andy Lutomirski <[email protected]> writes:

> Why should MNT_LOCKED on submounts be enforced?
>
> Is it because, if you retain a reference to the detached tree, then
> you can see under the submounts?

Yes. MNT_DETACH is a recursive operation that detaches all of the mount
and all of it's submounts. Which means you can see under the submounts
if you have a reference to a detached mount.

> If so, let's fix *that*. Because
> otherwise the whole model of pivot_root + detach will break.

I am not certain what you are referring to. pivot_root doesn't
manipulate the mount tree so you can see under anything.

What I believe is the appropriate fix is to fail umount2(...,MNT_DETACH)
if there are any referenced mount points being detached that have a
locked submount.

> Also, damn it, we need change_the_ns_root instead of pivot_root. I
> doubt that any container programs actually want to keep the old root
> attached after pivot_root.

Shrug. Except for chroot_fs_refs() pivot_root is a cheap.

I'm not particularly in favor of merging pivot_root and umount2. The
number of weird cases in the current api are high. A merged piece of
code would just make them higher.

I am hoping that one more round of bug fixing will at least get the bugs
for having unprivileged mounts fixed in the current API.

Eric

2014-10-07 21:33:26

by Serge Hallyn

[permalink] [raw]
Subject: Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root

Quoting Eric W. Biederman ([email protected]):
> Serge Hallyn <[email protected]> writes:
>
> > Quoting Eric W. Biederman ([email protected]):
> >> Al Viro <[email protected]> writes:
> >>
> >> 2> On Tue, Oct 07, 2014 at 02:30:40PM +0100, Al Viro wrote:
> >> >> On Tue, Oct 07, 2014 at 04:12:57PM +0400, Andrey Vagin wrote:
> >> >> > Another problem is that rootfs can't be hidden from a container, because
> >> >> > rootfs can't be moved or umounted.
> >> >>
> >> >> ... which is a bug in mntns_install(), AFAICS.
> >> >
> >> > Ability to get to exposed rootfs, that is.
> >>
> >> The container side of this argument is pretty bogus. It only applies
> >> if user namespaces are not used for the container.
> >
> > User namespaces are still far too restricted for many container use
> > cases. We can't say "we have user namespaces so now privileged
> > containers can be ignored". Yes you never should have handed the
> > keys to a privileged container to an untrusted person, but we do
> > still try to protect the host from accidental damage due to a
> > privileged container.
>
> What I meant is that it isn't about containers. It is about something
> root can do. So this is not a "container" problem.

Oh, ok.

Sorry, I'm getting the two thread confused anyway. I'm going to bow out
here until I can pay proper attention.

> >> So it is only root (and not root in a container) who can get to the
> >> exposed rootfs.
> >>
> >> I have a vague memory someone actually had a real use in miminal systems
> >> for being able to get back to the rootfs and being able to use rootfs as
> >> the rootfs. There was even a patch at that time that Andrew Morton was
> >> carrying for a time to allow unmounting root and get at rootfs, and to
> >> prevent the oops on rootfs unmount in some way.
> >>
> >> So not only do I not think it is a bug to get back too rootfs, I think
> >> it is a feature that some people have expressed at least half-way sane
> >> uses for.
> >
> > They can still do that if they want, using chroot :)
>
> It would take fchdir or fchroot and a directory file descriptor open on
> rootfs. Frequently there is no appropriate directory file descriptor.

? you can always escape if you're simply chrooted. waterbuffalo :)

2014-10-07 21:34:15

by Serge Hallyn

[permalink] [raw]
Subject: Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root

Quoting Andy Lutomirski ([email protected]):
> On Tue, Oct 7, 2014 at 1:30 PM, Eric W. Biederman <[email protected]> wrote:
> > Al Viro <[email protected]> writes:
> >
> > 2> On Tue, Oct 07, 2014 at 02:30:40PM +0100, Al Viro wrote:
> >>> On Tue, Oct 07, 2014 at 04:12:57PM +0400, Andrey Vagin wrote:
> >>> > Another problem is that rootfs can't be hidden from a container, because
> >>> > rootfs can't be moved or umounted.
> >>>
> >>> ... which is a bug in mntns_install(), AFAICS.
> >>
> >> Ability to get to exposed rootfs, that is.
> >
> > The container side of this argument is pretty bogus. It only applies
> > if user namespaces are not used for the container.
> >
> > So it is only root (and not root in a container) who can get to the
> > exposed rootfs.
> >
> > I have a vague memory someone actually had a real use in miminal systems
> > for being able to get back to the rootfs and being able to use rootfs as
> > the rootfs. There was even a patch at that time that Andrew Morton was
> > carrying for a time to allow unmounting root and get at rootfs, and to
> > prevent the oops on rootfs unmount in some way.
> >
> > So not only do I not think it is a bug to get back too rootfs, I think
> > it is a feature that some people have expressed at least half-way sane
> > uses for.
> >
> >>> > Here is an example how to get access to rootfs:
> >>> > fd = open("/proc/self/ns/mnt", O_RDONLY)
> >>> > umount2("/", MNT_DETACH);
> >>> > setns(fd, CLONE_NEWNS)
> >>> >
> >>> > rootfs may contain data, which should not be avaliable in CT-s.
> >>>
> >>> Indeed.
> >>
> >> ... and it looks like the above is what your mangled reproducer in previous
> >> patch had been made of -
> >> fd = open("/proc/self/ns/mnt", O_RDONLY)
> >> umount2("/", MNT_DETACH);
> >> setns(fd, CLONE_NEWNS)
> >> umount2("/", MNT_DETACH);
> >>
> >> IMO what it shows is setns() bug. This "switch root/cwd, no matter what"
> >> is wrong.
> >
> > IMO the bug is allowing us to unmount things that should never be unmounted.
> >
> > In a mount namespace created with just user namespace permissions we
> > can't get at rootfs because MNT_LOCKED is set on the root directory
> > and thus it can not be mounted.
> >
> > Further if anyone has permission to call chroot and chdir on any mount
> > in a mount namespace (that isn't currently covered) they can get at all
> > of them that are not currently covered. A mount namespace where no one
> > can get at any uncovered filesystem seems to be the definition of
> > useless and ridiculous.
> >
> >
> > Now there is a bug in that MNT_DETACH today does not currently enforce
> > MNT_LOCKED on submounts of the mount point that is detached. I am
> > currently looking at how to construct the appropriate permission check
> > to prevent that. Unfortunately I can not disallow MNT_DETACH with
> > submounts all together as that breaks too many legitimate uses.
>
> Why should MNT_LOCKED on submounts be enforced?
>
> Is it because, if you retain a reference to the detached tree, then
> you can see under the submounts? If so, let's fix *that*. Because
> otherwise the whole model of pivot_root + detach will break.
>
> Also, damn it, we need change_the_ns_root instead of pivot_root. I
> doubt that any container programs actually want to keep the old root
> attached after pivot_root.

Right I think that'll fix the problem we were having, and I think
Andrey said the same thing in another list a few days ago.

2014-10-07 21:38:29

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root

On Tue, Oct 7, 2014 at 2:26 PM, Eric W. Biederman <[email protected]> wrote:
> Andy Lutomirski <[email protected]> writes:
>
>> Why should MNT_LOCKED on submounts be enforced?
>>
>> Is it because, if you retain a reference to the detached tree, then
>> you can see under the submounts?
>
> Yes. MNT_DETACH is a recursive operation that detaches all of the mount
> and all of it's submounts. Which means you can see under the submounts
> if you have a reference to a detached mount.
>
>> If so, let's fix *that*. Because
>> otherwise the whole model of pivot_root + detach will break.
>
> I am not certain what you are referring to. pivot_root doesn't
> manipulate the mount tree so you can see under anything.
>
> What I believe is the appropriate fix is to fail umount2(...,MNT_DETACH)
> if there are any referenced mount points being detached that have a
> locked submount.

Most of the container-using things do, roughly:

Unshare userns and mountns
Mount some new stuff
pivot_root to the new stuff
MNT_DETACH the old.

That last step will almost always fail if you make this change.

--Andy

2014-10-07 21:42:53

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root

Serge Hallyn <[email protected]> writes:

> Quoting Eric W. Biederman ([email protected]):

>> What I meant is that it isn't about containers. It is about something
>> root can do. So this is not a "container" problem.
>
> Oh, ok.
>
> Sorry, I'm getting the two thread confused anyway. I'm going to bow out
> here until I can pay proper attention.

I think there is half a point here that may be legit, when you are using
mount namespaces to jail applications, there may be a problem
with umounting / and making it to the underlying rootfs filesystem.

I am squinting and looking this way and that but while I can imagine
someone more clever than I can think up some unique property of rootfs
that makes it a little more exploitable than just mounting a ramfs,
but since you have to be root to exploit those properties I think the
game is pretty much lost.

>> >> So it is only root (and not root in a container) who can get to the
>> >> exposed rootfs.
>> >>
>> >> I have a vague memory someone actually had a real use in miminal systems
>> >> for being able to get back to the rootfs and being able to use rootfs as
>> >> the rootfs. There was even a patch at that time that Andrew Morton was
>> >> carrying for a time to allow unmounting root and get at rootfs, and to
>> >> prevent the oops on rootfs unmount in some way.
>> >>
>> >> So not only do I not think it is a bug to get back too rootfs, I think
>> >> it is a feature that some people have expressed at least half-way sane
>> >> uses for.
>> >
>> > They can still do that if they want, using chroot :)
>>
>> It would take fchdir or fchroot and a directory file descriptor open on
>> rootfs. Frequently there is no appropriate directory file descriptor.
>
> ? you can always escape if you're simply chrooted. waterbuffalo :)

filesystem type rootfs.

Eric

2014-10-07 21:51:21

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root

Andy Lutomirski <[email protected]> writes:

> On Tue, Oct 7, 2014 at 2:26 PM, Eric W. Biederman <[email protected]> wrote:
>> Andy Lutomirski <[email protected]> writes:
>>
>>> Why should MNT_LOCKED on submounts be enforced?
>>>
>>> Is it because, if you retain a reference to the detached tree, then
>>> you can see under the submounts?
>>
>> Yes. MNT_DETACH is a recursive operation that detaches all of the mount
>> and all of it's submounts. Which means you can see under the submounts
>> if you have a reference to a detached mount.
>>
>>> If so, let's fix *that*. Because
>>> otherwise the whole model of pivot_root + detach will break.
>>
>> I am not certain what you are referring to. pivot_root doesn't
>> manipulate the mount tree so you can see under anything.
>>
>> What I believe is the appropriate fix is to fail umount2(...,MNT_DETACH)
>> if there are any referenced mount points being detached that have a
>> locked submount.
>
> Most of the container-using things do, roughly:
>
> Unshare userns and mountns
> Mount some new stuff
> pivot_root to the new stuff
> MNT_DETACH the old.
>
> That last step will almost always fail if you make this change.

I don't think so.

I expect I could add full busy detection of normal umounts and those
applications would not fail.

What I am proposing is a more targeted version of busy detection that
looks at each mount in the set that detach will unmount. For each mount
if it is busy with non-submount references and it has at least one
locked submount fail the detach with -EBUSY.

Do you really think we have userspace references to the one or more of the
mounts under old?

Eric

2014-10-07 21:52:44

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root

On Tue, Oct 7, 2014 at 2:50 PM, Eric W. Biederman <[email protected]> wrote:
> Andy Lutomirski <[email protected]> writes:
>
>> On Tue, Oct 7, 2014 at 2:26 PM, Eric W. Biederman <[email protected]> wrote:
>>> Andy Lutomirski <[email protected]> writes:
>>>
>>>> Why should MNT_LOCKED on submounts be enforced?
>>>>
>>>> Is it because, if you retain a reference to the detached tree, then
>>>> you can see under the submounts?
>>>
>>> Yes. MNT_DETACH is a recursive operation that detaches all of the mount
>>> and all of it's submounts. Which means you can see under the submounts
>>> if you have a reference to a detached mount.
>>>
>>>> If so, let's fix *that*. Because
>>>> otherwise the whole model of pivot_root + detach will break.
>>>
>>> I am not certain what you are referring to. pivot_root doesn't
>>> manipulate the mount tree so you can see under anything.
>>>
>>> What I believe is the appropriate fix is to fail umount2(...,MNT_DETACH)
>>> if there are any referenced mount points being detached that have a
>>> locked submount.
>>
>> Most of the container-using things do, roughly:
>>
>> Unshare userns and mountns
>> Mount some new stuff
>> pivot_root to the new stuff
>> MNT_DETACH the old.
>>
>> That last step will almost always fail if you make this change.
>
> I don't think so.
>
> I expect I could add full busy detection of normal umounts and those
> applications would not fail.
>
> What I am proposing is a more targeted version of busy detection that
> looks at each mount in the set that detach will unmount. For each mount
> if it is busy with non-submount references and it has at least one
> locked submount fail the detach with -EBUSY.
>
> Do you really think we have userspace references to the one or more of the
> mounts under old?
>

I suspect that we have a userspace reference to old itself.

--Andy

> Eric



--
Andy Lutomirski
AMA Capital Management, LLC

2014-10-07 22:19:36

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root

On Tue, Oct 7, 2014 at 2:42 PM, Eric W. Biederman <[email protected]> wrote:
>
> I am squinting and looking this way and that but while I can imagine
> someone more clever than I can think up some unique property of rootfs
> that makes it a little more exploitable than just mounting a ramfs,
> but since you have to be root to exploit those properties I think the
> game is pretty much lost.

Yes. rootfs might not be empty, it might have totally insane
permissions, and it's globally shared, which makes it into a wonderful
channel to pass things around that shouldn't be passed around.

Can non-root do this? You'd need to be in a userns with a "/" that
isn't MNT_LOCKED. Can this happen on any normal setup?

FWIW, I think we should unconditionally MNT_LOCKED the root on userns
unshare, even if it's the only mount.


>
>>> >> So it is only root (and not root in a container) who can get to the
>>> >> exposed rootfs.
>>> >>
>>> >> I have a vague memory someone actually had a real use in miminal systems
>>> >> for being able to get back to the rootfs and being able to use rootfs as
>>> >> the rootfs. There was even a patch at that time that Andrew Morton was
>>> >> carrying for a time to allow unmounting root and get at rootfs, and to
>>> >> prevent the oops on rootfs unmount in some way.
>>> >>
>>> >> So not only do I not think it is a bug to get back too rootfs, I think
>>> >> it is a feature that some people have expressed at least half-way sane
>>> >> uses for.
>>> >
>>> > They can still do that if they want, using chroot :)
>>>
>>> It would take fchdir or fchroot and a directory file descriptor open on
>>> rootfs. Frequently there is no appropriate directory file descriptor.
>>
>> ? you can always escape if you're simply chrooted. waterbuffalo :)
>
> filesystem type rootfs.
>
> Eric
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-api" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html



--
Andy Lutomirski
AMA Capital Management, LLC

2014-10-07 22:43:19

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root

Andy Lutomirski <[email protected]> writes:

> On Tue, Oct 7, 2014 at 2:42 PM, Eric W. Biederman <[email protected]> wrote:
>>
>> I am squinting and looking this way and that but while I can imagine
>> someone more clever than I can think up some unique property of rootfs
>> that makes it a little more exploitable than just mounting a ramfs,
>> but since you have to be root to exploit those properties I think the
>> game is pretty much lost.
>
> Yes. rootfs might not be empty, it might have totally insane
> permissions, and it's globally shared, which makes it into a wonderful
> channel to pass things around that shouldn't be passed around.

But if only root with proc mounted can reach it... I don't know.
There might be a case for setting MNT_LOCKED when we overmount "/"
as root but I don't yet see it.

> Can non-root do this? You'd need to be in a userns with a "/" that
> isn't MNT_LOCKED. Can this happen on any normal setup?
>
> FWIW, I think we should unconditionally MNT_LOCKED the root on userns
> unshare, even if it's the only mount.

To the best of my knowledge MNT_LOCKED is set uncondintially on userns
unshare.

Eric

2014-10-07 22:44:50

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root

On Tue, Oct 7, 2014 at 3:42 PM, Eric W. Biederman <[email protected]> wrote:
> Andy Lutomirski <[email protected]> writes:
>
>> On Tue, Oct 7, 2014 at 2:42 PM, Eric W. Biederman <[email protected]> wrote:
>>>
>>> I am squinting and looking this way and that but while I can imagine
>>> someone more clever than I can think up some unique property of rootfs
>>> that makes it a little more exploitable than just mounting a ramfs,
>>> but since you have to be root to exploit those properties I think the
>>> game is pretty much lost.
>>
>> Yes. rootfs might not be empty, it might have totally insane
>> permissions, and it's globally shared, which makes it into a wonderful
>> channel to pass things around that shouldn't be passed around.
>
> But if only root with proc mounted can reach it... I don't know.

It doesn't have to be global root. It could be userns root.

> There might be a case for setting MNT_LOCKED when we overmount "/"
> as root but I don't yet see it.
>
>> Can non-root do this? You'd need to be in a userns with a "/" that
>> isn't MNT_LOCKED. Can this happen on any normal setup?
>>
>> FWIW, I think we should unconditionally MNT_LOCKED the root on userns
>> unshare, even if it's the only mount.
>
> To the best of my knowledge MNT_LOCKED is set uncondintially on userns
> unshare.

Only if list_empty(&old->mnt_expire), whatever that means, I think.

--Andy

2014-10-07 23:42:45

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root

Andy Lutomirski <[email protected]> writes:

> On Tue, Oct 7, 2014 at 3:42 PM, Eric W. Biederman <[email protected]> wrote:
>> Andy Lutomirski <[email protected]> writes:
>>
>>> On Tue, Oct 7, 2014 at 2:42 PM, Eric W. Biederman <[email protected]> wrote:
>>>>
>>>> I am squinting and looking this way and that but while I can imagine
>>>> someone more clever than I can think up some unique property of rootfs
>>>> that makes it a little more exploitable than just mounting a ramfs,
>>>> but since you have to be root to exploit those properties I think the
>>>> game is pretty much lost.
>>>
>>> Yes. rootfs might not be empty, it might have totally insane
>>> permissions, and it's globally shared, which makes it into a wonderful
>>> channel to pass things around that shouldn't be passed around.
>>
>> But if only root with proc mounted can reach it... I don't know.
>
> It doesn't have to be global root. It could be userns root.
>
>> There might be a case for setting MNT_LOCKED when we overmount "/"
>> as root but I don't yet see it.
>>
>>> Can non-root do this? You'd need to be in a userns with a "/" that
>>> isn't MNT_LOCKED. Can this happen on any normal setup?
>>>
>>> FWIW, I think we should unconditionally MNT_LOCKED the root on userns
>>> unshare, even if it's the only mount.
>>
>> To the best of my knowledge MNT_LOCKED is set uncondintially on userns
>> unshare.
>
> Only if list_empty(&old->mnt_expire), whatever that means, I think.

An autofs or nfs automounted mount. Can those ever become root?

Eric

2014-10-07 23:45:11

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root

On Tue, Oct 7, 2014 at 4:42 PM, Eric W. Biederman <[email protected]> wrote:
> Andy Lutomirski <[email protected]> writes:
>
>> On Tue, Oct 7, 2014 at 3:42 PM, Eric W. Biederman <[email protected]> wrote:
>>> Andy Lutomirski <[email protected]> writes:
>>>
>>>> On Tue, Oct 7, 2014 at 2:42 PM, Eric W. Biederman <[email protected]> wrote:
>>>>>
>>>>> I am squinting and looking this way and that but while I can imagine
>>>>> someone more clever than I can think up some unique property of rootfs
>>>>> that makes it a little more exploitable than just mounting a ramfs,
>>>>> but since you have to be root to exploit those properties I think the
>>>>> game is pretty much lost.
>>>>
>>>> Yes. rootfs might not be empty, it might have totally insane
>>>> permissions, and it's globally shared, which makes it into a wonderful
>>>> channel to pass things around that shouldn't be passed around.
>>>
>>> But if only root with proc mounted can reach it... I don't know.
>>
>> It doesn't have to be global root. It could be userns root.
>>
>>> There might be a case for setting MNT_LOCKED when we overmount "/"
>>> as root but I don't yet see it.
>>>
>>>> Can non-root do this? You'd need to be in a userns with a "/" that
>>>> isn't MNT_LOCKED. Can this happen on any normal setup?
>>>>
>>>> FWIW, I think we should unconditionally MNT_LOCKED the root on userns
>>>> unshare, even if it's the only mount.
>>>
>>> To the best of my knowledge MNT_LOCKED is set uncondintially on userns
>>> unshare.
>>
>> Only if list_empty(&old->mnt_expire), whatever that means, I think.
>
> An autofs or nfs automounted mount. Can those ever become root?

Dunno.

I thought that this code was what set MNT_LOCKED for things that
should be locked:

/* Don't allow unprivileged users to reveal what is under a mount */
if ((flag & CL_UNPRIVILEGED) && list_empty(&old->mnt_expire))
mnt->mnt.mnt_flags |= MNT_LOCKED;

Now I'm confused. Shouldn't that be checking for submounts? Is that
what it's doing?

Anyway, I think that this should unconditionally set MNT_LOCKED on the
thing that mounted on rootfs.

--Andy

2014-10-08 00:21:30

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root

Andy Lutomirski <[email protected]> writes:

> On Tue, Oct 7, 2014 at 4:42 PM, Eric W. Biederman <[email protected]> wrote:
>> Andy Lutomirski <[email protected]> writes:
>>
>>> On Tue, Oct 7, 2014 at 3:42 PM, Eric W. Biederman <[email protected]> wrote:
>>>> Andy Lutomirski <[email protected]> writes:
>>>>
>>>>> On Tue, Oct 7, 2014 at 2:42 PM, Eric W. Biederman <[email protected]> wrote:
>>>>>>
>>>>>> I am squinting and looking this way and that but while I can imagine
>>>>>> someone more clever than I can think up some unique property of rootfs
>>>>>> that makes it a little more exploitable than just mounting a ramfs,
>>>>>> but since you have to be root to exploit those properties I think the
>>>>>> game is pretty much lost.
>>>>>
>>>>> Yes. rootfs might not be empty, it might have totally insane
>>>>> permissions, and it's globally shared, which makes it into a wonderful
>>>>> channel to pass things around that shouldn't be passed around.
>>>>
>>>> But if only root with proc mounted can reach it... I don't know.
>>>
>>> It doesn't have to be global root. It could be userns root.
>>>
>>>> There might be a case for setting MNT_LOCKED when we overmount "/"
>>>> as root but I don't yet see it.
>>>>
>>>>> Can non-root do this? You'd need to be in a userns with a "/" that
>>>>> isn't MNT_LOCKED. Can this happen on any normal setup?
>>>>>
>>>>> FWIW, I think we should unconditionally MNT_LOCKED the root on userns
>>>>> unshare, even if it's the only mount.
>>>>
>>>> To the best of my knowledge MNT_LOCKED is set uncondintially on userns
>>>> unshare.
>>>
>>> Only if list_empty(&old->mnt_expire), whatever that means, I think.
>>
>> An autofs or nfs automounted mount. Can those ever become root?
>
> Dunno.
>
> I thought that this code was what set MNT_LOCKED for things that
> should be locked:

It does.

> /* Don't allow unprivileged users to reveal what is under a mount */
> if ((flag & CL_UNPRIVILEGED) && list_empty(&old->mnt_expire))
> mnt->mnt.mnt_flags |= MNT_LOCKED;
>
> Now I'm confused. Shouldn't that be checking for submounts? Is that
> what it's doing?

As it copies each mount (mostly submounts) it sets MNT_LOCKED.

> Anyway, I think that this should unconditionally set MNT_LOCKED on the
> thing that mounted on rootfs.

As I read the code mnt_set_expiry is only for nfs, cifs, and afs
submounts (and perhaps proc should use them). So as they are generated
mnt_expiry should never start on the root of filesystem of the mount tree.

Furthermore mnt_expiry is cleared when a mount is moved, and when
it is bind mounted, or propagated.

pivot_root does look as though it may be missing the proper mnt_expiry
handling list_del_init(&old->mnt_expire), but pivot_root does have
proper MNT_LOCKED handling.

Looking that test was slightly off and it should be:
if ((flag & CL_UNPRIVILEGED) &&
(!(flag & CL_EXPIRE) || list_empty(&old->mnt_expire))
mnt->mnt.mnt_flags |= MNT_LOCKED;

So on that note we might clear CL_EXPIRE when CL_UNPRIVILEGED is set
in copy_tree but I don't see that as being really needed.

Eric

2014-10-08 00:25:59

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root

On Tue, Oct 7, 2014 at 5:20 PM, Eric W. Biederman <[email protected]> wrote:
> Andy Lutomirski <[email protected]> writes:
>
>> On Tue, Oct 7, 2014 at 4:42 PM, Eric W. Biederman <[email protected]> wrote:
>>> Andy Lutomirski <[email protected]> writes:
>>>
>>>> On Tue, Oct 7, 2014 at 3:42 PM, Eric W. Biederman <[email protected]> wrote:
>>>>> Andy Lutomirski <[email protected]> writes:
>>>>>
>>>>>> On Tue, Oct 7, 2014 at 2:42 PM, Eric W. Biederman <[email protected]> wrote:
>>>>>>>
>>>>>>> I am squinting and looking this way and that but while I can imagine
>>>>>>> someone more clever than I can think up some unique property of rootfs
>>>>>>> that makes it a little more exploitable than just mounting a ramfs,
>>>>>>> but since you have to be root to exploit those properties I think the
>>>>>>> game is pretty much lost.
>>>>>>
>>>>>> Yes. rootfs might not be empty, it might have totally insane
>>>>>> permissions, and it's globally shared, which makes it into a wonderful
>>>>>> channel to pass things around that shouldn't be passed around.
>>>>>
>>>>> But if only root with proc mounted can reach it... I don't know.
>>>>
>>>> It doesn't have to be global root. It could be userns root.
>>>>
>>>>> There might be a case for setting MNT_LOCKED when we overmount "/"
>>>>> as root but I don't yet see it.
>>>>>
>>>>>> Can non-root do this? You'd need to be in a userns with a "/" that
>>>>>> isn't MNT_LOCKED. Can this happen on any normal setup?
>>>>>>
>>>>>> FWIW, I think we should unconditionally MNT_LOCKED the root on userns
>>>>>> unshare, even if it's the only mount.
>>>>>
>>>>> To the best of my knowledge MNT_LOCKED is set uncondintially on userns
>>>>> unshare.
>>>>
>>>> Only if list_empty(&old->mnt_expire), whatever that means, I think.
>>>
>>> An autofs or nfs automounted mount. Can those ever become root?
>>
>> Dunno.
>>
>> I thought that this code was what set MNT_LOCKED for things that
>> should be locked:
>
> It does.
>
>> /* Don't allow unprivileged users to reveal what is under a mount */
>> if ((flag & CL_UNPRIVILEGED) && list_empty(&old->mnt_expire))
>> mnt->mnt.mnt_flags |= MNT_LOCKED;
>>
>> Now I'm confused. Shouldn't that be checking for submounts? Is that
>> what it's doing?
>
> As it copies each mount (mostly submounts) it sets MNT_LOCKED.

Oh. They're *all* MNT_LOCKED. Duh.

>
>> Anyway, I think that this should unconditionally set MNT_LOCKED on the
>> thing that mounted on rootfs.
>
> As I read the code mnt_set_expiry is only for nfs, cifs, and afs
> submounts (and perhaps proc should use them). So as they are generated
> mnt_expiry should never start on the root of filesystem of the mount tree.
>
> Furthermore mnt_expiry is cleared when a mount is moved, and when
> it is bind mounted, or propagated.
>
> pivot_root does look as though it may be missing the proper mnt_expiry
> handling list_del_init(&old->mnt_expire), but pivot_root does have
> proper MNT_LOCKED handling.

pivot_root is quite broken, as noted in my other email. It's just not
broken like this, I think. :)

>
> Looking that test was slightly off and it should be:
> if ((flag & CL_UNPRIVILEGED) &&
> (!(flag & CL_EXPIRE) || list_empty(&old->mnt_expire))
> mnt->mnt.mnt_flags |= MNT_LOCKED;
>
> So on that note we might clear CL_EXPIRE when CL_UNPRIVILEGED is set
> in copy_tree but I don't see that as being really needed.
>
> Eric



--
Andy Lutomirski
AMA Capital Management, LLC

2014-10-08 11:09:00

by Andrew Vagin

[permalink] [raw]
Subject: Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root

On Tue, Oct 07, 2014 at 01:45:22PM -0700, Eric W. Biederman wrote:
> Andrey Vagin <[email protected]> writes:
>
> > From: Andrey Vagin <[email protected]>
> >
> > Currently when we create a new container with a separate root,
> > we need to clone the current mount namespace with all mounts and then
> > clean up it by using pivot_root(). A big part of mountpoints are cloned
> > only to be umounted.
>
> Is the motivation performance? Because if that is the motivation we
> need numbers.

The major motivation to create a clean mount namespace which contains
only required mounts.

Now you want to convince us that there is nothing wrong if we use
userns, because all inherited mounts are locked. My point is that all
useless mounts should be umounted. If the current root isn't on rootfs,
pivot_root() allows us to umount all useless points. But pivot_root()
doesn't work, if the current root is on rootfs. How can we umount
useless points in this case?

Maybe we want to say that rootfs should not be used if we are going to
create containers...

Thanks,
Andrew

2014-10-08 15:35:47

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root

On Wed, Oct 8, 2014 at 4:08 AM, Andrew Vagin <[email protected]> wrote:
> On Tue, Oct 07, 2014 at 01:45:22PM -0700, Eric W. Biederman wrote:
>> Andrey Vagin <[email protected]> writes:
>>
>> > From: Andrey Vagin <[email protected]>
>> >
>> > Currently when we create a new container with a separate root,
>> > we need to clone the current mount namespace with all mounts and then
>> > clean up it by using pivot_root(). A big part of mountpoints are cloned
>> > only to be umounted.
>>
>> Is the motivation performance? Because if that is the motivation we
>> need numbers.
>
> The major motivation to create a clean mount namespace which contains
> only required mounts.
>
> Now you want to convince us that there is nothing wrong if we use
> userns, because all inherited mounts are locked. My point is that all
> useless mounts should be umounted. If the current root isn't on rootfs,
> pivot_root() allows us to umount all useless points. But pivot_root()
> doesn't work, if the current root is on rootfs. How can we umount
> useless points in this case?
>
> Maybe we want to say that rootfs should not be used if we are going to
> create containers...
>

Could we have an extra rootfs-like fs that is always completely empty,
doesn't allow any writes, and can sit at the bottom of container
namespace hierarchies? If so, and if we add a new syscall that's like
pivot_root (or unshare) but prunes the hierarchy, then we could switch
to that rootfs then.

> Thanks,
> Andrew
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-api" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html



--
Andy Lutomirski
AMA Capital Management, LLC

2014-10-08 19:24:28

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root

Andy Lutomirski <[email protected]> writes:

> On Wed, Oct 8, 2014 at 4:08 AM, Andrew Vagin <[email protected]> wrote:
>> On Tue, Oct 07, 2014 at 01:45:22PM -0700, Eric W. Biederman wrote:
>>> Andrey Vagin <[email protected]> writes:
>>>
>>> > From: Andrey Vagin <[email protected]>
>>> >
>>> > Currently when we create a new container with a separate root,
>>> > we need to clone the current mount namespace with all mounts and then
>>> > clean up it by using pivot_root(). A big part of mountpoints are cloned
>>> > only to be umounted.
>>>
>>> Is the motivation performance? Because if that is the motivation we
>>> need numbers.
>>
>> The major motivation to create a clean mount namespace which contains
>> only required mounts.
>>
>> Now you want to convince us that there is nothing wrong if we use
>> userns, because all inherited mounts are locked. My point is that all
>> useless mounts should be umounted. If the current root isn't on rootfs,
>> pivot_root() allows us to umount all useless points. But pivot_root()
>> doesn't work, if the current root is on rootfs. How can we umount
>> useless points in this case?

One of your justifications for a new system call was so you could do
less. Doing less to get to where you want to go is only justified when
your doing less to get better performance.

It sounds like your actual concern is about sandboxing and security
audits. That is a very legitimate concern. That isn't however the core
concern of containers, so it was not clear that is what you meant.

>> Maybe we want to say that rootfs should not be used if we are going to
>> create containers...

Today it is an assumption of the vfs that rootfs is mounted. With
rootfs mounted and pivot_root at the base of the mount stack you can
make as minimal of a set of mounts as the vfs allows.

Removing rootfs from the vfs requires an audit of everything that
manipulates mounts. It is not remotely a local excercise.

One of the things that needs to be considered is that if you really want
to audit mounts is the code that needs manipulates them needs to be
audited every bit as much as the mounts themselves.

> Could we have an extra rootfs-like fs that is always completely empty,
> doesn't allow any writes, and can sit at the bottom of container
> namespace hierarchies? If so, and if we add a new syscall that's like
> pivot_root (or unshare) but prunes the hierarchy, then we could switch
> to that rootfs then.

Or equally have something that guarantees that rootfs is empty and
read-only at the time the normal root filesystem is mounted. That is
certainly a much more localized change if we want to go there.

I am half tempted to suggest that mount --move /some/path / be updated
to make the old / just go away (perhaps to be replaced with a read-only
empty rootfs). That gets us into figuring out if we break userspace
which is a big challenge.

Eric

2014-10-08 19:32:16

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root

On Wed, Oct 8, 2014 at 12:23 PM, Eric W. Biederman
<[email protected]> wrote:
> Andy Lutomirski <[email protected]> writes:
>
>> On Wed, Oct 8, 2014 at 4:08 AM, Andrew Vagin <[email protected]> wrote:
>>> On Tue, Oct 07, 2014 at 01:45:22PM -0700, Eric W. Biederman wrote:
>>>> Andrey Vagin <[email protected]> writes:
>>>>
>>>> > From: Andrey Vagin <[email protected]>
>>>> >
>>>> > Currently when we create a new container with a separate root,
>>>> > we need to clone the current mount namespace with all mounts and then
>>>> > clean up it by using pivot_root(). A big part of mountpoints are cloned
>>>> > only to be umounted.
>>>>
>>>> Is the motivation performance? Because if that is the motivation we
>>>> need numbers.
>>>
>>> The major motivation to create a clean mount namespace which contains
>>> only required mounts.
>>>
>>> Now you want to convince us that there is nothing wrong if we use
>>> userns, because all inherited mounts are locked. My point is that all
>>> useless mounts should be umounted. If the current root isn't on rootfs,
>>> pivot_root() allows us to umount all useless points. But pivot_root()
>>> doesn't work, if the current root is on rootfs. How can we umount
>>> useless points in this case?
>
> One of your justifications for a new system call was so you could do
> less. Doing less to get to where you want to go is only justified when
> your doing less to get better performance.
>
> It sounds like your actual concern is about sandboxing and security
> audits. That is a very legitimate concern. That isn't however the core
> concern of containers, so it was not clear that is what you meant.
>
>>> Maybe we want to say that rootfs should not be used if we are going to
>>> create containers...
>
> Today it is an assumption of the vfs that rootfs is mounted. With
> rootfs mounted and pivot_root at the base of the mount stack you can
> make as minimal of a set of mounts as the vfs allows.
>
> Removing rootfs from the vfs requires an audit of everything that
> manipulates mounts. It is not remotely a local excercise.

Would it be a less invasive audit to allow different mount namespaces
to have different rootfses?

>
> One of the things that needs to be considered is that if you really want
> to audit mounts is the code that needs manipulates them needs to be
> audited every bit as much as the mounts themselves.
>
>> Could we have an extra rootfs-like fs that is always completely empty,
>> doesn't allow any writes, and can sit at the bottom of container
>> namespace hierarchies? If so, and if we add a new syscall that's like
>> pivot_root (or unshare) but prunes the hierarchy, then we could switch
>> to that rootfs then.
>
> Or equally have something that guarantees that rootfs is empty and
> read-only at the time the normal root filesystem is mounted. That is
> certainly a much more localized change if we want to go there.
>
> I am half tempted to suggest that mount --move /some/path / be updated
> to make the old / just go away (perhaps to be replaced with a read-only
> empty rootfs). That gets us into figuring out if we break userspace
> which is a big challenge.

Hence my argument for a new syscall or entirely new operation.
mount(2) and friends are way too multiplexed right now. I just found
yet another security bug due to the insanely complicated semantics of
the vfs syscalls. (Yes, a different one from the one yesterday.)

A new operation kills several birds with one stone. It could look like:

int mntns_change_root(int dfd, const char *path, int flags);

return -EPERM if chrooted. Returns -EINVAL if path (relative to dfd)
isn't a mountmount. Otherwise it disconnects path from the existing
hierarchy, attaches a permanently-empty read-only rootfs under it,
makes it the root of the mntns, and does the root refs fixup. The old
hierarchy gets thrown out.

Benefits:
- Plausibly slightly faster. (Especially if we add the trivial
optimization that, if the caller holds the only reference to the
mntns, then changing root refs can avoid the big loop, but maybe
pivot_root could do that, too.)
- Sidesteps the whole rootfs mess.
- Much easier to use than pivot_root.
- No userspace breakage.

Heck, it could be even simpler: it could just unconditionally skip the
fs struct walk. Just require callers to make sure that everyone else
chroots into the new root.

Systemd could use this, too. If it wants to keep a reference to the
initramfs, it can use a file descriptor.

--Andy

2014-10-08 21:23:24

by Rob Landley

[permalink] [raw]
Subject: Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root

On 10/08/14 14:23, Eric W. Biederman wrote:
>> Could we have an extra rootfs-like fs that is always completely empty,
>> doesn't allow any writes, and can sit at the bottom of container
>> namespace hierarchies? If so, and if we add a new syscall that's like
>> pivot_root (or unshare) but prunes the hierarchy, then we could switch
>> to that rootfs then.
>
> Or equally have something that guarantees that rootfs is empty and
> read-only at the time the normal root filesystem is mounted. That is
> certainly a much more localized change if we want to go there.

What do you mean "normal" root filesystem? It is entirely possible (and
in fact common in the embedded world) to run from rootfs. I pushed my
old inittmpfs patches (at the request of cray) last year because being
able to take down the system with "cat /dev/zero > /blah" (as rootfs
allows and tmpfs doesn't) is a bad thing.

Rootfs is about as special as PID 1 is. We don't filter out PID 1 from
"ps" to avoid confusing people, but for some reason whoever did
/proc/$PID/mountinfo decided that rootfs shouldn't show up because magic
magic specialness.

We show /run, which is a tmpfs instance. If I mount two different
filesystems on top of each other on /mnt, it shows both. (Overmounts
were not invented by rootfs.) But no, mountinfo filters out rootfs
because magic magic specialness.

It makes me sad that this kind of special-case thinking is allowed in
the kernel.

> I am half tempted to suggest that mount --move /some/path / be updated
> to make the old / just go away (perhaps to be replaced with a read-only
> empty rootfs). That gets us into figuring out if we break userspace
> which is a big challenge.

My concern was that chroot() moving a magic "/" pointer that you can
trivially escape from with x=open("."); chroot("sub"); fdchdir(".");
chdir("../../../../../../../../.."); is having extra code in the kernel
to do it _wrong_.

We have per-process namespaces now. We can actually adjust the mount
tree (inserting a new bind mount if the directory we're changing to is
not already a mount point). If a per-process namespace needs to be
anchored by a tmpfs, fine. But requiring that to be teh SAME instance
globally for the entire system is not what containers is _about_. It's
not true for PID 1 and it shouldn't be true for rootfs.

By all means, if a filesystem is no longer accessable in a namespace,
decrement its reference count. (Keeping in mind that a bind mount should
count as a reference, and rootfs should always have a nonzero reference
count.) But "/" is not special in this regard. If you want to make all
overmounts vanish (which seems like a really bad idea and breaks 40
years of unix semantics), argue for that. Please stop treating rootfs
like it isn't potentialy usable as a full-fledged filesystem.

(Pet peeve of mine.)

> Eric

Rob

2014-10-08 21:36:08

by Rob Landley

[permalink] [raw]
Subject: Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root

On 10/08/14 14:31, Andy Lutomirski wrote:
> On Wed, Oct 8, 2014 at 12:23 PM, Eric W. Biederman
> <[email protected]> wrote:
>> Andy Lutomirski <[email protected]> writes:
>>>> Maybe we want to say that rootfs should not be used if we are going to
>>>> create containers...
>>
>> Today it is an assumption of the vfs that rootfs is mounted. With
>> rootfs mounted and pivot_root at the base of the mount stack you can
>> make as minimal of a set of mounts as the vfs allows.
>>
>> Removing rootfs from the vfs requires an audit of everything that
>> manipulates mounts. It is not remotely a local excercise.
>
> Would it be a less invasive audit to allow different mount namespaces
> to have different rootfses?

I.E. The same way different namespaces have different init tasks?

The abstraction containers has implemented here should be logically
consistent.

>>> Could we have an extra rootfs-like fs that is always completely empty,
>>> doesn't allow any writes, and can sit at the bottom of container
>>> namespace hierarchies? If so, and if we add a new syscall that's like
>>> pivot_root (or unshare) but prunes the hierarchy, then we could switch
>>> to that rootfs then.
>>
>> Or equally have something that guarantees that rootfs is empty and
>> read-only at the time the normal root filesystem is mounted. That is
>> certainly a much more localized change if we want to go there.
>>
>> I am half tempted to suggest that mount --move /some/path / be updated
>> to make the old / just go away (perhaps to be replaced with a read-only
>> empty rootfs). That gets us into figuring out if we break userspace
>> which is a big challenge.
>
> Hence my argument for a new syscall or entirely new operation.

I'm still waiting for somebody to explain to my why chroot() shouldn't
be changed to do this instead of adding a new syscall. (At least when
mount namespace support is enabled.)

> mount(2) and friends are way too multiplexed right now. I just found
> yet another security bug due to the insanely complicated semantics of
> the vfs syscalls. (Yes, a different one from the one yesterday.)

As the guy who rewrote busybox mount 3 times, and who just implemented a
brand new one (toybox) from scratch:

It's a bit fiddly, yes.

> A new operation kills several birds with one stone. It could look like:
>
> int mntns_change_root(int dfd, const char *path, int flags);
>
> return -EPERM if chrooted.

Really?

> Returns -EINVAL if path (relative to dfd) isn't a mountmount.

Requiring that chroot() only be called on mountpoints would break
existing semantics, which gets us back to new systemcall instead of
changing behavior of existing one.

If I recall, the first line of pushback against merging the openvz code
as is was "buckets of new syscalls". Pushback against adding a new
system call is understandable. Why can't we fix chroot() now that we
have the tools to do so?

> Otherwise it disconnects path from the existing
> hierarchy, attaches a permanently-empty read-only rootfs under it,
> makes it the root of the mntns, and does the root refs fixup. The old
> hierarchy gets thrown out.

We have a chroot() syscall. We don't use it for containers because it
doesn't do what we want. Does it currently do what _anybody_ wants?

> Systemd could use this, too.

While that's a strong argument against it, I'm willing to overlook it.

Rob

2014-10-08 22:02:00

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root

On Wed, Oct 8, 2014 at 2:36 PM, Rob Landley <[email protected]> wrote:
> On 10/08/14 14:31, Andy Lutomirski wrote:
>> On Wed, Oct 8, 2014 at 12:23 PM, Eric W. Biederman
>> <[email protected]> wrote:
>>> Andy Lutomirski <[email protected]> writes:
>>>>> Maybe we want to say that rootfs should not be used if we are going to
>>>>> create containers...
>>>
>>> Today it is an assumption of the vfs that rootfs is mounted. With
>>> rootfs mounted and pivot_root at the base of the mount stack you can
>>> make as minimal of a set of mounts as the vfs allows.
>>>
>>> Removing rootfs from the vfs requires an audit of everything that
>>> manipulates mounts. It is not remotely a local excercise.
>>
>> Would it be a less invasive audit to allow different mount namespaces
>> to have different rootfses?
>
> I.E. The same way different namespaces have different init tasks?
>
> The abstraction containers has implemented here should be logically
> consistent.
>
>>>> Could we have an extra rootfs-like fs that is always completely empty,
>>>> doesn't allow any writes, and can sit at the bottom of container
>>>> namespace hierarchies? If so, and if we add a new syscall that's like
>>>> pivot_root (or unshare) but prunes the hierarchy, then we could switch
>>>> to that rootfs then.
>>>
>>> Or equally have something that guarantees that rootfs is empty and
>>> read-only at the time the normal root filesystem is mounted. That is
>>> certainly a much more localized change if we want to go there.
>>>
>>> I am half tempted to suggest that mount --move /some/path / be updated
>>> to make the old / just go away (perhaps to be replaced with a read-only
>>> empty rootfs). That gets us into figuring out if we break userspace
>>> which is a big challenge.
>>
>> Hence my argument for a new syscall or entirely new operation.
>
> I'm still waiting for somebody to explain to my why chroot() shouldn't
> be changed to do this instead of adding a new syscall. (At least when
> mount namespace support is enabled.)

Because chroot has no effect on the namespace at all. If you fork and
the child chroots, the parent isn't chrooted. And, more importantly
for my example, is a process has it's cwd as /foo, and then it forks
and the child chroots, then parent's ".." isn't changed as a result of
the chroot.

>
>> mount(2) and friends are way too multiplexed right now. I just found
>> yet another security bug due to the insanely complicated semantics of
>> the vfs syscalls. (Yes, a different one from the one yesterday.)
>
> As the guy who rewrote busybox mount 3 times, and who just implemented a
> brand new one (toybox) from scratch:
>
> It's a bit fiddly, yes.
>
>> A new operation kills several birds with one stone. It could look like:
>>
>> int mntns_change_root(int dfd, const char *path, int flags);
>>
>> return -EPERM if chrooted.
>
> Really?

Now that CVE-2014-7970 is public: what the heck is pivot_root supposed
to do if the caller is chrooted? The current behavior is obviously
incorrect (it leaks memory), but it's not entirely clear to me what
should happen. I think it should either be disallowed or should have
well-defined semantics.

For simplicity, if a new syscall for this is added, then I think that
the caller-is-chrooted case should be disallowed. If someone needs it
and can articulate what the semantics should be, then I have no
problem with allowing it going forward.

>
>> Returns -EINVAL if path (relative to dfd) isn't a mountmount.
>
> Requiring that chroot() only be called on mountpoints would break
> existing semantics, which gets us back to new systemcall instead of
> changing behavior of existing one.

But we're talking about a pivot_root replacement. You can always
create a bindmount. Alternatively, the syscall could create a fresh
bind-mount and reattach all children to it if needed.

>
> If I recall, the first line of pushback against merging the openvz code
> as is was "buckets of new syscalls". Pushback against adding a new
> system call is understandable. Why can't we fix chroot() now that we
> have the tools to do so?

Because chroot already does something else. In particularly, the new
"root"'s parents are *still there*, which is why it's so easy to
escape from a chroot. Sensible container systems (and initramfs
code!) wants to clean up all the mounts that should be unreachable.

>
>> Otherwise it disconnects path from the existing
>> hierarchy, attaches a permanently-empty read-only rootfs under it,
>> makes it the root of the mntns, and does the root refs fixup. The old
>> hierarchy gets thrown out.
>
> We have a chroot() syscall. We don't use it for containers because it
> doesn't do what we want. Does it currently do what _anybody_ wants?

Irrelevant. It's POSIX and we'll break all kinds of things if we change it.

>
>> Systemd could use this, too.
>
> While that's a strong argument against it, I'm willing to overlook it.

:)

Feel free to read that as "an initramfs /init that prepares to hand
off to /sbin/init could use this." busybox's switch_root could do
this, too, and even my virtme tool would indirectly benefit slightly.
(virtme uses an init system that is a few tens of lines of
busybox-compatible shell script that runs in a highly controlled
environment.)

--Andy

2014-10-08 23:39:10

by Serge Hallyn

[permalink] [raw]
Subject: Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root

Quoting Andy Lutomirski ([email protected]):
> On Wed, Oct 8, 2014 at 2:36 PM, Rob Landley <[email protected]> wrote:
> > On 10/08/14 14:31, Andy Lutomirski wrote:
> >> On Wed, Oct 8, 2014 at 12:23 PM, Eric W. Biederman
> >> <[email protected]> wrote:
> >>> Andy Lutomirski <[email protected]> writes:
> >>>>> Maybe we want to say that rootfs should not be used if we are going to
> >>>>> create containers...
> >>>
> >>> Today it is an assumption of the vfs that rootfs is mounted. With
> >>> rootfs mounted and pivot_root at the base of the mount stack you can
> >>> make as minimal of a set of mounts as the vfs allows.
> >>>
> >>> Removing rootfs from the vfs requires an audit of everything that
> >>> manipulates mounts. It is not remotely a local excercise.
> >>
> >> Would it be a less invasive audit to allow different mount namespaces
> >> to have different rootfses?
> >
> > I.E. The same way different namespaces have different init tasks?
> >
> > The abstraction containers has implemented here should be logically
> > consistent.
> >
> >>>> Could we have an extra rootfs-like fs that is always completely empty,
> >>>> doesn't allow any writes, and can sit at the bottom of container
> >>>> namespace hierarchies? If so, and if we add a new syscall that's like
> >>>> pivot_root (or unshare) but prunes the hierarchy, then we could switch
> >>>> to that rootfs then.
> >>>
> >>> Or equally have something that guarantees that rootfs is empty and
> >>> read-only at the time the normal root filesystem is mounted. That is
> >>> certainly a much more localized change if we want to go there.
> >>>
> >>> I am half tempted to suggest that mount --move /some/path / be updated
> >>> to make the old / just go away (perhaps to be replaced with a read-only
> >>> empty rootfs). That gets us into figuring out if we break userspace
> >>> which is a big challenge.
> >>
> >> Hence my argument for a new syscall or entirely new operation.
> >
> > I'm still waiting for somebody to explain to my why chroot() shouldn't
> > be changed to do this instead of adding a new syscall. (At least when
> > mount namespace support is enabled.)
>
> Because chroot has no effect on the namespace at all. If you fork and
> the child chroots, the parent isn't chrooted. And, more importantly
> for my example, is a process has it's cwd as /foo, and then it forks
> and the child chroots, then parent's ".." isn't changed as a result of
> the chroot.
>
> >
> >> mount(2) and friends are way too multiplexed right now. I just found
> >> yet another security bug due to the insanely complicated semantics of
> >> the vfs syscalls. (Yes, a different one from the one yesterday.)
> >
> > As the guy who rewrote busybox mount 3 times, and who just implemented a
> > brand new one (toybox) from scratch:
> >
> > It's a bit fiddly, yes.
> >
> >> A new operation kills several birds with one stone. It could look like:
> >>
> >> int mntns_change_root(int dfd, const char *path, int flags);
> >>
> >> return -EPERM if chrooted.
> >
> > Really?
>
> Now that CVE-2014-7970 is public: what the heck is pivot_root supposed
> to do if the caller is chrooted? The current behavior is obviously
> incorrect (it leaks memory), but it's not entirely clear to me what
> should happen. I think it should either be disallowed or should have
> well-defined semantics.
>
> For simplicity, if a new syscall for this is added, then I think that
> the caller-is-chrooted case should be disallowed. If someone needs it
> and can articulate what the semantics should be, then I have no
> problem with allowing it going forward.

It's not that I'd have a need for that, but rather if for some
reason I started out chrooted due to some bogus initramfs, I'd
prefer to not have to feel like a criminial and escape the chroot
first.

2014-10-08 23:41:40

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root

On Wed, Oct 8, 2014 at 4:38 PM, Serge Hallyn <[email protected]> wrote:
> Quoting Andy Lutomirski ([email protected]):
>> On Wed, Oct 8, 2014 at 2:36 PM, Rob Landley <[email protected]> wrote:
>> > On 10/08/14 14:31, Andy Lutomirski wrote:
>> >> On Wed, Oct 8, 2014 at 12:23 PM, Eric W. Biederman
>> >> <[email protected]> wrote:
>> >>> Andy Lutomirski <[email protected]> writes:
>> >>>>> Maybe we want to say that rootfs should not be used if we are going to
>> >>>>> create containers...
>> >>>
>> >>> Today it is an assumption of the vfs that rootfs is mounted. With
>> >>> rootfs mounted and pivot_root at the base of the mount stack you can
>> >>> make as minimal of a set of mounts as the vfs allows.
>> >>>
>> >>> Removing rootfs from the vfs requires an audit of everything that
>> >>> manipulates mounts. It is not remotely a local excercise.
>> >>
>> >> Would it be a less invasive audit to allow different mount namespaces
>> >> to have different rootfses?
>> >
>> > I.E. The same way different namespaces have different init tasks?
>> >
>> > The abstraction containers has implemented here should be logically
>> > consistent.
>> >
>> >>>> Could we have an extra rootfs-like fs that is always completely empty,
>> >>>> doesn't allow any writes, and can sit at the bottom of container
>> >>>> namespace hierarchies? If so, and if we add a new syscall that's like
>> >>>> pivot_root (or unshare) but prunes the hierarchy, then we could switch
>> >>>> to that rootfs then.
>> >>>
>> >>> Or equally have something that guarantees that rootfs is empty and
>> >>> read-only at the time the normal root filesystem is mounted. That is
>> >>> certainly a much more localized change if we want to go there.
>> >>>
>> >>> I am half tempted to suggest that mount --move /some/path / be updated
>> >>> to make the old / just go away (perhaps to be replaced with a read-only
>> >>> empty rootfs). That gets us into figuring out if we break userspace
>> >>> which is a big challenge.
>> >>
>> >> Hence my argument for a new syscall or entirely new operation.
>> >
>> > I'm still waiting for somebody to explain to my why chroot() shouldn't
>> > be changed to do this instead of adding a new syscall. (At least when
>> > mount namespace support is enabled.)
>>
>> Because chroot has no effect on the namespace at all. If you fork and
>> the child chroots, the parent isn't chrooted. And, more importantly
>> for my example, is a process has it's cwd as /foo, and then it forks
>> and the child chroots, then parent's ".." isn't changed as a result of
>> the chroot.
>>
>> >
>> >> mount(2) and friends are way too multiplexed right now. I just found
>> >> yet another security bug due to the insanely complicated semantics of
>> >> the vfs syscalls. (Yes, a different one from the one yesterday.)
>> >
>> > As the guy who rewrote busybox mount 3 times, and who just implemented a
>> > brand new one (toybox) from scratch:
>> >
>> > It's a bit fiddly, yes.
>> >
>> >> A new operation kills several birds with one stone. It could look like:
>> >>
>> >> int mntns_change_root(int dfd, const char *path, int flags);
>> >>
>> >> return -EPERM if chrooted.
>> >
>> > Really?
>>
>> Now that CVE-2014-7970 is public: what the heck is pivot_root supposed
>> to do if the caller is chrooted? The current behavior is obviously
>> incorrect (it leaks memory), but it's not entirely clear to me what
>> should happen. I think it should either be disallowed or should have
>> well-defined semantics.
>>
>> For simplicity, if a new syscall for this is added, then I think that
>> the caller-is-chrooted case should be disallowed. If someone needs it
>> and can articulate what the semantics should be, then I have no
>> problem with allowing it going forward.
>
> It's not that I'd have a need for that, but rather if for some
> reason I started out chrooted due to some bogus initramfs, I'd
> prefer to not have to feel like a criminial and escape the chroot
> first.

You already can't create a userns if you're chrooted (even if you have
global privilege).

--Andy

2014-10-09 10:30:00

by Andrew Vagin

[permalink] [raw]
Subject: Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root

On Wed, Oct 08, 2014 at 12:23:52PM -0700, Eric W. Biederman wrote:
> Andy Lutomirski <[email protected]> writes:
>
> > On Wed, Oct 8, 2014 at 4:08 AM, Andrew Vagin <[email protected]> wrote:
> >> On Tue, Oct 07, 2014 at 01:45:22PM -0700, Eric W. Biederman wrote:
> >>> Andrey Vagin <[email protected]> writes:
> >>>
> >>> > From: Andrey Vagin <[email protected]>
> >>> >
> >>> > Currently when we create a new container with a separate root,
> >>> > we need to clone the current mount namespace with all mounts and then
> >>> > clean up it by using pivot_root(). A big part of mountpoints are cloned
> >>> > only to be umounted.
> >>>
> >>> Is the motivation performance? Because if that is the motivation we
> >>> need numbers.
> >>
> >> The major motivation to create a clean mount namespace which contains
> >> only required mounts.
> >>
> >> Now you want to convince us that there is nothing wrong if we use
> >> userns, because all inherited mounts are locked. My point is that all
> >> useless mounts should be umounted. If the current root isn't on rootfs,
> >> pivot_root() allows us to umount all useless points. But pivot_root()
> >> doesn't work, if the current root is on rootfs. How can we umount
> >> useless points in this case?
>
> One of your justifications for a new system call was so you could do
> less. Doing less to get to where you want to go is only justified when
> your doing less to get better performance.
>
> >> Maybe we want to say that rootfs should not be used if we are going to
> >> create containers...
>
> Today it is an assumption of the vfs that rootfs is mounted. With
> rootfs mounted and pivot_root at the base of the mount stack you can
> make as minimal of a set of mounts as the vfs allows.

You have misunderstood me.
For most system /proc/self/mountinfo looks like this:
[root@dhcp-10-30-23-214 ~]# cat /proc/self/mountinfo
17 22 0:3 / /proc rw,relatime - proc proc rw
18 22 0:0 / /sys rw,relatime - sysfs sysfs rw
19 22 0:5 / /dev rw,relatime - devtmpfs devtmpfs rw,size=502324k,nr_inodes=125581,mode=755
20 19 0:11 / /dev/pts rw,relatime - devpts devpts rw,gid=5,mode=620,ptmxmode=000
21 19 0:17 / /dev/shm rw,nosuid,nodev,noexec,relatime - tmpfs tmpfs rw
22 1 253:2 / / rw,relatime - ext4 /dev/vda2 rw,barrier=1,data=ordered
24 22 253:1 / /boot rw,relatime - ext3 /dev/vda1 rw,errors=continue,user_xattr,acl,barrier=1,data=ordered

/ isn't a rootfs mount here and pivot_root() works fine in this case. Here is
no problem for such system.

Now look at the second case:
hell@android:/ $ cat /proc/self/mountinfo
1 1 0:1 / / ro,relatime - rootfs rootfs ro
11 1 0:11 / /dev rw,nosuid,relatime - tmpfs tmpfs rw,mode=755
12 11 0:9 / /dev/pts rw,relatime - devpts devpts rw,mode=600
13 1 0:3 / /proc rw,relatime - proc proc rw
14 1 0:12 / /sys rw,relatime - sysfs sysfs rw

Now / is an rootfs mount. pivot_root() doesn't work in this case and we
need to do some tricks to get a minimal set of mounts.

Thanks,
Andrew

>
> Removing rootfs from the vfs requires an audit of everything that
> manipulates mounts. It is not remotely a local excercise.
>
> One of the things that needs to be considered is that if you really want
> to audit mounts is the code that needs manipulates them needs to be
> audited every bit as much as the mounts themselves.
>
> > Could we have an extra rootfs-like fs that is always completely empty,
> > doesn't allow any writes, and can sit at the bottom of container
> > namespace hierarchies? If so, and if we add a new syscall that's like
> > pivot_root (or unshare) but prunes the hierarchy, then we could switch
> > to that rootfs then.
>
> Or equally have something that guarantees that rootfs is empty and
> read-only at the time the normal root filesystem is mounted. That is
> certainly a much more localized change if we want to go there.
>
> I am half tempted to suggest that mount --move /some/path / be updated
> to make the old / just go away (perhaps to be replaced with a read-only
> empty rootfs). That gets us into figuring out if we break userspace
> which is a big challenge.
>
> Eric