2014-10-07 12:00:35

by Andrei Vagin

[permalink] [raw]
Subject: [PATCH] mnt: don't allow to detach the namespace root

This patch fixes a bug, which is triggered by following code:
while (1) {
if (umount2("/", MNT_DETACH) ||
setns(fd, CLONE_NEWNS))
return break;
}

[ 260.548301] ------------[ cut here ]------------
[ 260.550941] kernel BUG at /build/buildd/linux-3.13.0/fs/pnode.c:372!
[ 260.552068] invalid opcode: 0000 [#1] SMP
[ 260.552068] Modules linked in: xt_CHECKSUM iptable_mangle xt_tcpudp xt_addrtype xt_conntrack ipt_MASQUERADE iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack bridge stp llc dm_thin_pool dm_persistent_data dm_bufio dm_bio_prison iptable_filter ip_tables x_tables crct10dif_pclmul crc32_pclmul ghash_clmulni_intel binfmt_misc nfsd auth_rpcgss nfs_acl aesni_intel nfs lockd aes_x86_64 sunrpc fscache lrw gf128mul glue_helper ablk_helper cryptd serio_raw ppdev parport_pc lp parport btrfs xor raid6_pq libcrc32c psmouse floppy
[ 260.552068] CPU: 0 PID: 1723 Comm: nsenter Not tainted 3.13.0-30-generic #55-Ubuntu
[ 260.552068] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[ 260.552068] task: ffff8800376097f0 ti: ffff880074824000 task.ti: ffff880074824000
[ 260.552068] RIP: 0010:[<ffffffff811e9483>] [<ffffffff811e9483>] propagate_umount+0x123/0x130
[ 260.552068] RSP: 0018:ffff880074825e98 EFLAGS: 00010246
[ 260.552068] RAX: ffff88007c741140 RBX: 0000000000000002 RCX: ffff88007c741190
[ 260.552068] RDX: ffff88007c741190 RSI: ffff880074825ec0 RDI: ffff880074825ec0
[ 260.552068] RBP: ffff880074825eb0 R08: 00000000000172e0 R09: ffff88007fc172e0
[ 260.552068] R10: ffffffff811cc642 R11: ffffea0001d59000 R12: ffff88007c741140
[ 260.552068] R13: ffff88007c741140 R14: ffff88007c741140 R15: 0000000000000000
[ 260.552068] FS: 00007fd5c7e41740(0000) GS:ffff88007fc00000(0000) knlGS:0000000000000000
[ 260.552068] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 260.552068] CR2: 00007fd5c7968050 CR3: 0000000070124000 CR4: 00000000000406f0
[ 260.552068] Stack:
[ 260.552068] 0000000000000002 0000000000000002 ffff88007c631000 ffff880074825ed8
[ 260.552068] ffffffff811dcfac ffff88007c741140 0000000000000002 ffff88007c741160
[ 260.552068] ffff880074825f38 ffffffff811dd12b ffffffff811cc642 0000000075640000
[ 260.552068] Call Trace:
[ 260.552068] [<ffffffff811dcfac>] umount_tree+0x20c/0x260
[ 260.552068] [<ffffffff811dd12b>] do_umount+0x12b/0x300
[ 260.552068] [<ffffffff811cc642>] ? final_putname+0x22/0x50
[ 260.552068] [<ffffffff811cc849>] ? putname+0x29/0x40
[ 260.552068] [<ffffffff811dd88c>] SyS_umount+0xdc/0x100
[ 260.552068] [<ffffffff8172aeff>] tracesys+0xe1/0xe6
[ 260.552068] Code: 89 50 08 48 8b 50 08 48 89 02 49 89 45 08 e9 72 ff ff ff 0f 1f 44 00 00 4c 89 e6 4c 89 e7 e8 f5 f6 ff ff 48 89 c3 e9 39 ff ff ff <0f> 0b 66 2e 0f 1f 84 00 00 00 00 00 90 66 66 66 66 90 55 b8 01
[ 260.552068] RIP [<ffffffff811e9483>] propagate_umount+0x123/0x130
[ 260.552068] RSP <ffff880074825e98>
[ 260.611451] ---[ end trace 11c33d85f1d4c652 ]--

Cc: Alexander Viro <[email protected]>
Cc: Serge Hallyn <[email protected]>
Signed-off-by: Andrey Vagin <[email protected]>
---
fs/namespace.c | 3 +++
1 file changed, 3 insertions(+)

diff --git a/fs/namespace.c b/fs/namespace.c
index f50a848..a63e0a9 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1363,6 +1363,9 @@ static int do_umount(struct mount *mnt, int flags)
return retval;
}

+ if (!mnt_has_parent(mnt))
+ return -EINVAL;
+
namespace_lock();
lock_mount_hash();
event++;
--
1.9.3


2014-10-07 13:24:40

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH] mnt: don't allow to detach the namespace root

On Tue, Oct 07, 2014 at 04:00:12PM +0400, Andrey Vagin wrote:
> This patch fixes a bug, which is triggered by following code:
> while (1) {
> if (umount2("/", MNT_DETACH) ||
> setns(fd, CLONE_NEWNS))
> return break;
> }

Excuse me, but that makes no sense whatsoever (not to mention that
reproducer won't compile - return break; alone is enough to make
sure of that).

Could you post the real reproducer?

2014-10-07 13:40:18

by Andrew Vagin

[permalink] [raw]
Subject: Re: [PATCH] mnt: don't allow to detach the namespace root

On Tue, Oct 07, 2014 at 02:24:36PM +0100, Al Viro wrote:
> On Tue, Oct 07, 2014 at 04:00:12PM +0400, Andrey Vagin wrote:
> > This patch fixes a bug, which is triggered by following code:
> > while (1) {
> > if (umount2("/", MNT_DETACH) ||
> > setns(fd, CLONE_NEWNS))
> > return break;
> > }
>
> Excuse me, but that makes no sense whatsoever (not to mention that
> reproducer won't compile - return break; alone is enough to make
> sure of that).
>
> Could you post the real reproducer?

#define _GNU_SOURCE
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <sched.h>
#include <unistd.h>
#include <sys/mount.h>

int main(int argc, char **argv)
{
int fd;

fd = open("/proc/self/ns/mnt", O_RDONLY);
if (fd < 0)
return 1;
while (1) {
if (umount2("/", MNT_DETACH) ||
setns(fd, CLONE_NEWNS))
break;
}

return 0;
}

root@ubuntu:/home/avagin# gcc -Wall nsenter.c -o nsenter
root@ubuntu:/home/avagin# strace ./nsenter
execve("./nsenter", ["./nsenter"], [/* 22 vars */]) = 0
...
open("/proc/self/ns/mnt", O_RDONLY) = 3
umount("/", MNT_DETACH) = 0
setns(3, 131072) = 0
umount("/", MNT_DETACH

2014-10-07 19:27:41

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH] umount: Do not allow unmounting rootfs.


Andrew Vagin <[email protected]> writes:

> #define _GNU_SOURCE
> #include <sys/types.h>
> #include <sys/stat.h>
> #include <fcntl.h>
> #include <sched.h>
> #include <unistd.h>
> #include <sys/mount.h>
>
> int main(int argc, char **argv)
> {
> int fd;
>
> fd = open("/proc/self/ns/mnt", O_RDONLY);
> if (fd < 0)
> return 1;
> while (1) {
> if (umount2("/", MNT_DETACH) ||
> setns(fd, CLONE_NEWNS))
> break;
> }
>
> return 0;
> }
>
> root@ubuntu:/home/avagin# gcc -Wall nsenter.c -o nsenter
> root@ubuntu:/home/avagin# strace ./nsenter
> execve("./nsenter", ["./nsenter"], [/* 22 vars */]) = 0
> ...
> open("/proc/self/ns/mnt", O_RDONLY) = 3
> umount("/", MNT_DETACH) = 0
> setns(3, 131072) = 0
> umount("/", MNT_DETACH
>
causes:

> [ 260.548301] ------------[ cut here ]------------
> [ 260.550941] kernel BUG at /build/buildd/linux-3.13.0/fs/pnode.c:372!
> [ 260.552068] invalid opcode: 0000 [#1] SMP
> [ 260.552068] Modules linked in: xt_CHECKSUM iptable_mangle xt_tcpudp xt_addrtype xt_conntrack ipt_MASQUERADE iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack bridge stp llc dm_thin_pool dm_persistent_data dm_bufio dm_bio_prison iptable_filter ip_tables x_tables crct10dif_pclmul crc32_pclmul ghash_clmulni_intel binfmt_misc nfsd auth_rpcgss nfs_acl aesni_intel nfs lockd aes_x86_64 sunrpc fscache lrw gf128mul glue_helper ablk_helper cryptd serio_raw ppdev parport_pc lp parport btrfs xor raid6_pq libcrc32c psmouse floppy
> [ 260.552068] CPU: 0 PID: 1723 Comm: nsenter Not tainted 3.13.0-30-generic #55-Ubuntu
> [ 260.552068] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
> [ 260.552068] task: ffff8800376097f0 ti: ffff880074824000 task.ti: ffff880074824000
> [ 260.552068] RIP: 0010:[<ffffffff811e9483>] [<ffffffff811e9483>] propagate_umount+0x123/0x130
> [ 260.552068] RSP: 0018:ffff880074825e98 EFLAGS: 00010246
> [ 260.552068] RAX: ffff88007c741140 RBX: 0000000000000002 RCX: ffff88007c741190
> [ 260.552068] RDX: ffff88007c741190 RSI: ffff880074825ec0 RDI: ffff880074825ec0
> [ 260.552068] RBP: ffff880074825eb0 R08: 00000000000172e0 R09: ffff88007fc172e0
> [ 260.552068] R10: ffffffff811cc642 R11: ffffea0001d59000 R12: ffff88007c741140
> [ 260.552068] R13: ffff88007c741140 R14: ffff88007c741140 R15: 0000000000000000
> [ 260.552068] FS: 00007fd5c7e41740(0000) GS:ffff88007fc00000(0000) knlGS:0000000000000000
> [ 260.552068] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 260.552068] CR2: 00007fd5c7968050 CR3: 0000000070124000 CR4: 00000000000406f0
> [ 260.552068] Stack:
> [ 260.552068] 0000000000000002 0000000000000002 ffff88007c631000 ffff880074825ed8
> [ 260.552068] ffffffff811dcfac ffff88007c741140 0000000000000002 ffff88007c741160
> [ 260.552068] ffff880074825f38 ffffffff811dd12b ffffffff811cc642 0000000075640000
> [ 260.552068] Call Trace:
> [ 260.552068] [<ffffffff811dcfac>] umount_tree+0x20c/0x260
> [ 260.552068] [<ffffffff811dd12b>] do_umount+0x12b/0x300
> [ 260.552068] [<ffffffff811cc642>] ? final_putname+0x22/0x50
> [ 260.552068] [<ffffffff811cc849>] ? putname+0x29/0x40
> [ 260.552068] [<ffffffff811dd88c>] SyS_umount+0xdc/0x100
> [ 260.552068] [<ffffffff8172aeff>] tracesys+0xe1/0xe6
> [ 260.552068] Code: 89 50 08 48 8b 50 08 48 89 02 49 89 45 08 e9 72 ff ff ff 0f 1f 44 00 00 4c 89 e6 4c 89 e7 e8 f5 f6 ff ff 48 89 c3 e9 39 ff ff ff <0f> 0b 66 2e 0f 1f 84 00 00 00 00 00 90 66 66 66 66 90 55 b8 01
> [ 260.552068] RIP [<ffffffff811e9483>] propagate_umount+0x123/0x130
> [ 260.552068] RSP <ffff880074825e98>
> [ 260.611451] ---[ end trace 11c33d85f1d4c652 ]--

Which in practice is totally uninteresting. Only the global root user can
do it, and it is just a stupid thing to do.

However that is no excuse to allow a silly way to oops the kernel.

We can avoid this silly problem by setting MNT_LOCKED on the rootfs
mount point and thus avoid needing any special cases in the unmount
code.

Signed-off-by: "Eric W. Biederman" <[email protected]>
---
fs/namespace.c | 1 +
1 file changed, 1 insertion(+)

diff --git a/fs/namespace.c b/fs/namespace.c
index 7c5834381a73..9c6ee037bef7 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -2887,6 +2887,7 @@ static void __init init_mount_tree(void)

root.mnt = mnt;
root.dentry = mnt->mnt_root;
+ mnt->mnt_flags |= MNT_LOCKED;

set_fs_pwd(current->fs, &root);
set_fs_root(current->fs, &root);
--
1.9.1

2014-10-07 19:53:14

by Andrew Vagin

[permalink] [raw]
Subject: Re: [PATCH] umount: Do not allow unmounting rootfs.

On Tue, Oct 07, 2014 at 12:27:06PM -0700, Eric W. Biederman wrote:
>
> Andrew Vagin <[email protected]> writes:
>
> > #define _GNU_SOURCE
> > #include <sys/types.h>
> > #include <sys/stat.h>
> > #include <fcntl.h>
> > #include <sched.h>
> > #include <unistd.h>
> > #include <sys/mount.h>
> >
> > int main(int argc, char **argv)
> > {
> > int fd;
> >
> > fd = open("/proc/self/ns/mnt", O_RDONLY);
> > if (fd < 0)
> > return 1;
> > while (1) {
> > if (umount2("/", MNT_DETACH) ||
> > setns(fd, CLONE_NEWNS))
> > break;
> > }
> >
> > return 0;
> > }
> >
> > root@ubuntu:/home/avagin# gcc -Wall nsenter.c -o nsenter
> > root@ubuntu:/home/avagin# strace ./nsenter
> > execve("./nsenter", ["./nsenter"], [/* 22 vars */]) = 0
> > ...
> > open("/proc/self/ns/mnt", O_RDONLY) = 3
> > umount("/", MNT_DETACH) = 0
> > setns(3, 131072) = 0
> > umount("/", MNT_DETACH
> >
> causes:
>
> > [ 260.548301] ------------[ cut here ]------------
> > [ 260.550941] kernel BUG at /build/buildd/linux-3.13.0/fs/pnode.c:372!
> > [ 260.552068] invalid opcode: 0000 [#1] SMP
> > [ 260.552068] Modules linked in: xt_CHECKSUM iptable_mangle xt_tcpudp xt_addrtype xt_conntrack ipt_MASQUERADE iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack bridge stp llc dm_thin_pool dm_persistent_data dm_bufio dm_bio_prison iptable_filter ip_tables x_tables crct10dif_pclmul crc32_pclmul ghash_clmulni_intel binfmt_misc nfsd auth_rpcgss nfs_acl aesni_intel nfs lockd aes_x86_64 sunrpc fscache lrw gf128mul glue_helper ablk_helper cryptd serio_raw ppdev parport_pc lp parport btrfs xor raid6_pq libcrc32c psmouse floppy
> > [ 260.552068] CPU: 0 PID: 1723 Comm: nsenter Not tainted 3.13.0-30-generic #55-Ubuntu
> > [ 260.552068] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
> > [ 260.552068] task: ffff8800376097f0 ti: ffff880074824000 task.ti: ffff880074824000
> > [ 260.552068] RIP: 0010:[<ffffffff811e9483>] [<ffffffff811e9483>] propagate_umount+0x123/0x130
> > [ 260.552068] RSP: 0018:ffff880074825e98 EFLAGS: 00010246
> > [ 260.552068] RAX: ffff88007c741140 RBX: 0000000000000002 RCX: ffff88007c741190
> > [ 260.552068] RDX: ffff88007c741190 RSI: ffff880074825ec0 RDI: ffff880074825ec0
> > [ 260.552068] RBP: ffff880074825eb0 R08: 00000000000172e0 R09: ffff88007fc172e0
> > [ 260.552068] R10: ffffffff811cc642 R11: ffffea0001d59000 R12: ffff88007c741140
> > [ 260.552068] R13: ffff88007c741140 R14: ffff88007c741140 R15: 0000000000000000
> > [ 260.552068] FS: 00007fd5c7e41740(0000) GS:ffff88007fc00000(0000) knlGS:0000000000000000
> > [ 260.552068] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [ 260.552068] CR2: 00007fd5c7968050 CR3: 0000000070124000 CR4: 00000000000406f0
> > [ 260.552068] Stack:
> > [ 260.552068] 0000000000000002 0000000000000002 ffff88007c631000 ffff880074825ed8
> > [ 260.552068] ffffffff811dcfac ffff88007c741140 0000000000000002 ffff88007c741160
> > [ 260.552068] ffff880074825f38 ffffffff811dd12b ffffffff811cc642 0000000075640000
> > [ 260.552068] Call Trace:
> > [ 260.552068] [<ffffffff811dcfac>] umount_tree+0x20c/0x260
> > [ 260.552068] [<ffffffff811dd12b>] do_umount+0x12b/0x300
> > [ 260.552068] [<ffffffff811cc642>] ? final_putname+0x22/0x50
> > [ 260.552068] [<ffffffff811cc849>] ? putname+0x29/0x40
> > [ 260.552068] [<ffffffff811dd88c>] SyS_umount+0xdc/0x100
> > [ 260.552068] [<ffffffff8172aeff>] tracesys+0xe1/0xe6
> > [ 260.552068] Code: 89 50 08 48 8b 50 08 48 89 02 49 89 45 08 e9 72 ff ff ff 0f 1f 44 00 00 4c 89 e6 4c 89 e7 e8 f5 f6 ff ff 48 89 c3 e9 39 ff ff ff <0f> 0b 66 2e 0f 1f 84 00 00 00 00 00 90 66 66 66 66 90 55 b8 01
> > [ 260.552068] RIP [<ffffffff811e9483>] propagate_umount+0x123/0x130
> > [ 260.552068] RSP <ffff880074825e98>
> > [ 260.611451] ---[ end trace 11c33d85f1d4c652 ]--
>
> Which in practice is totally uninteresting. Only the global root user can
> do it, and it is just a stupid thing to do.
>
> However that is no excuse to allow a silly way to oops the kernel.
>
> We can avoid this silly problem by setting MNT_LOCKED on the rootfs
> mount point and thus avoid needing any special cases in the unmount
> code.

I had this idea too, but it doesn't work.

MNT_LOCKED isn't inherited, if the privileged user creates a new mount
namespace.

So "unshame -m ./nsenter" reproduces the same BUG.

Thanks,
Andrey

>
> Signed-off-by: "Eric W. Biederman" <[email protected]>
> ---
> fs/namespace.c | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/fs/namespace.c b/fs/namespace.c
> index 7c5834381a73..9c6ee037bef7 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -2887,6 +2887,7 @@ static void __init init_mount_tree(void)
>
> root.mnt = mnt;
> root.dentry = mnt->mnt_root;
> + mnt->mnt_flags |= MNT_LOCKED;
>
> set_fs_pwd(current->fs, &root);
> set_fs_root(current->fs, &root);
> --
> 1.9.1
>

2014-10-07 20:58:34

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH] umount: Do not allow unmounting rootfs.

Andrew Vagin <[email protected]> writes:

> On Tue, Oct 07, 2014 at 12:27:06PM -0700, Eric W. Biederman wrote:
>>
>> Which in practice is totally uninteresting. Only the global root user can
>> do it, and it is just a stupid thing to do.
>>
>> However that is no excuse to allow a silly way to oops the kernel.
>>
>> We can avoid this silly problem by setting MNT_LOCKED on the rootfs
>> mount point and thus avoid needing any special cases in the unmount
>> code.
>
> I had this idea too, but it doesn't work.
>
> MNT_LOCKED isn't inherited, if the privileged user creates a new mount
> namespace.
>
> So "unshame -m ./nsenter" reproduces the same BUG.

Which broken tree do you have where MNT_LOCKED is not inherited?

That case fails to reproduce the BUG for me.

The semantics of MNT_LOCKED are that you aren't allowed to see what is
beneath. So if you can get under there even by unsharing the mount
namespace it is an implementation bug in MNT_LOCKED.

Eric

2014-10-07 22:00:08

by Andrew Vagin

[permalink] [raw]
Subject: Re: [PATCH] umount: Do not allow unmounting rootfs.

On Tue, Oct 07, 2014 at 01:58:01PM -0700, Eric W. Biederman wrote:
> Andrew Vagin <[email protected]> writes:
>
> > On Tue, Oct 07, 2014 at 12:27:06PM -0700, Eric W. Biederman wrote:
> >>
> >> Which in practice is totally uninteresting. Only the global root user can
> >> do it, and it is just a stupid thing to do.
> >>
> >> However that is no excuse to allow a silly way to oops the kernel.
> >>
> >> We can avoid this silly problem by setting MNT_LOCKED on the rootfs
> >> mount point and thus avoid needing any special cases in the unmount
> >> code.
> >
> > I had this idea too, but it doesn't work.
> >
> > MNT_LOCKED isn't inherited, if the privileged user creates a new mount
> > namespace.
> >
> > So "unshame -m ./nsenter" reproduces the same BUG.
>
> Which broken tree do you have where MNT_LOCKED is not inherited?

It is Linus' tree with your patch.

I commented out one line and the BUG isn't triggered any more.

diff --git a/fs/namespace.c b/fs/namespace.c
index 15676e9..eacfcad 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1494,7 +1494,7 @@ struct mount *copy_tree(struct mount *mnt, struct dentry *dentry,
if (IS_ERR(q))
return q;

- q->mnt.mnt_flags &= ~MNT_LOCKED;
+// q->mnt.mnt_flags &= ~MNT_LOCKED;
q->mnt_mountpoint = mnt->mnt_mountpoint;

p = mnt;

>
> That case fails to reproduce the BUG for me.
>
> The semantics of MNT_LOCKED are that you aren't allowed to see what is
> beneath. So if you can get under there even by unsharing the mount
> namespace it is an implementation bug in MNT_LOCKED.

I have applied your patch to the Linus' tree. Look at this:

[avagin@localhost linux-cr]$ git log --pretty=oneline | head -n 5
4da63ceb9069993435deb16b017c9419ddbc5ac1 umount: Do not allow unmounting rootfs.
bfe01a5ba2490f299e1d2d5508cbbbadd897bbe9 Linux 3.17
ef0a59924a795ccb4ced0ae1722a337745a1b045 Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
7b6ea43d3f90ba1db87883126c2c09777f51d3d6 Merge tag 'tiny/kconfig-for-3.17' of https://git.kernel.org/pub/scm/linux/kernel/git/josh/linux
62b4d2041117f35ab2409c9f5c4b8d3dc8e59d0f init/Kconfig: Fix HAVE_FUTEX_CMPXCHG to not break up the EXPERT menu
[avagin@localhost linux-cr]$ git show 4da63ceb9069993435deb16b017c9419ddbc5ac1 | head -n 4
commit 4da63ceb9069993435deb16b017c9419ddbc5ac1
Author: Eric W. Biederman <[email protected]>
Date: Tue Oct 7 12:27:06 2014 -0700

naavagin@ubuntu:~$ uname -a
Linux ubuntu 3.17.0-00001-g4da63ce #58 SMP Wed Oct 8 01:29:11 MSK 2014 x86_64 x86_64 x86_64 GNU/Linux
avagin@ubuntu:~$ cat nsenter.c
#define _GNU_SOURCE /* See feature_test_macros(7) */
#include <sched.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <sys/mount.h>

int main(int argc, char **argv)
{
int fd;

fd = open("/proc/self/ns/mnt", O_RDONLY);
if (fd < 0)
return 1;
umount2("/", MNT_DETACH);
if (setns(fd, CLONE_NEWNS))
return 1;
umount2("/", MNT_DETACH);

return 0;
}

root@ubuntu:/home/avagin# gcc -Wall -o nsenter nsenter.c
root@ubuntu:/home/avagin# unshare -m ./nsenter

[ 77.723836] ------------[ cut here ]------------
[ 77.724018] kernel BUG at fs/pnode.c:372!
[ 77.724018] invalid opcode: 0000 [#1] SMP
[ 77.724018] Modules linked in: microcode joydev virtio_balloon i2c_piix4 i2c_core nfsd nfs lockd sunrpc fscache fuse virtio_blk virtio_net floppy
[ 77.724018] CPU: 0 PID: 1050 Comm: nsenter Not tainted 3.17.0-00001-g4da63ce #58
[ 77.724018] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[ 77.724018] task: ffff88003a87c4a0 ti: ffff880035a58000 task.ti: ffff880035a58000
[ 77.724018] RIP: 0010:[<ffffffff81208c03>] [<ffffffff81208c03>] propagate_umount+0x153/0x160
[ 77.724018] RSP: 0018:ffff880035a5be60 EFLAGS: 00010246
[ 77.724018] RAX: ffff88003a9c36e0 RBX: 0000000000000000 RCX: dead000000200200
[ 77.724018] RDX: ffff88003a9c36e0 RSI: ffff880035a5be98 RDI: ffff880035a5be98
[ 77.724018] RBP: ffff880035a5be88 R08: ffff88003a9c36e0 R09: 0000000000000000
[ 77.724018] R10: ffff88003a87c4a0 R11: ffff88003a87cd10 R12: ffff88003a9c3680
[ 77.724018] R13: ffff88003a9c3680 R14: ffff88003a9c3680 R15: 0000000000000000
[ 77.724018] FS: 00007ff0e55f6740(0000) GS:ffff88003fc00000(0000) knlGS:0000000000000000
[ 77.724018] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 77.724018] CR2: 00007ff0e5109e90 CR3: 00000000350b9000 CR4: 00000000000406f0
[ 77.724018] Stack:
[ 77.724018] 0000000000000246 0000000000000000 0000000000000002 ffff88003a9c36e0
[ 77.724018] ffff88003a9c3680 ffff880035a5bec0 ffffffff811fbda1 ffff88003a9c3680
[ 77.724018] 00000000bbc7c04f 0000000000000002 ffff88003a9c36a0 ffff88003e01ef60
[ 77.724018] Call Trace:
[ 77.724018] [<ffffffff811fbda1>] umount_tree+0x251/0x260
[ 77.724018] [<ffffffff811fbf52>] do_umount+0x1a2/0x3a0
[ 77.724018] [<ffffffff811fc7bc>] ? SyS_umount+0xec/0x110
[ 77.724018] [<ffffffff811e4f49>] ? putname+0x29/0x40
[ 77.724018] [<ffffffff811fc7bc>] SyS_umount+0xec/0x110
[ 77.724018] [<ffffffff816ea7e9>] system_call_fastpath+0x16/0x1b
[ 77.724018] Code: 8b 50 08 48 89 02 49 89 45 08 e9 46 ff ff ff 66 0f 1f 84 00 00 00 00 00 4c 89 e6 4c 89 e7 e8 05 f7 ff ff 48 89 c3 e9 09 ff ff ff <0f> 0b 66 2e 0f 1f 84 00 00 00 00 00 90 66 66 66 66 90 55 b8 01
[ 77.724018] RIP [<ffffffff81208c03>] propagate_umount+0x153/0x160
[ 77.724018] RSP <ffff880035a5be60>
[ 77.761968] ---[ end trace 94fc755aefee9186 ]---

>
> Eric

2014-10-07 23:35:45

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH] umount: Do not allow unmounting rootfs.

Andrew Vagin <[email protected]> writes:

> On Tue, Oct 07, 2014 at 01:58:01PM -0700, Eric W. Biederman wrote:
>> Andrew Vagin <[email protected]> writes:
>>
>> > On Tue, Oct 07, 2014 at 12:27:06PM -0700, Eric W. Biederman wrote:
>> >>
>> >> Which in practice is totally uninteresting. Only the global root user can
>> >> do it, and it is just a stupid thing to do.
>> >>
>> >> However that is no excuse to allow a silly way to oops the kernel.
>> >>
>> >> We can avoid this silly problem by setting MNT_LOCKED on the rootfs
>> >> mount point and thus avoid needing any special cases in the unmount
>> >> code.
>> >
>> > I had this idea too, but it doesn't work.
>> >
>> > MNT_LOCKED isn't inherited, if the privileged user creates a new mount
>> > namespace.
>> >
>> > So "unshame -m ./nsenter" reproduces the same BUG.
>>
>> Which broken tree do you have where MNT_LOCKED is not inherited?
>
> It is Linus' tree with your patch.
>
> I commented out one line and the BUG isn't triggered any more.

Ok. That is very very weird. It works for me and not for you.

Doh! I ran your test program first on the primary mount namespace
and didn't have /proc mounted when I tried to run your test program
later.

Thanks for the hint about copy_tree that does seem to be where the logic
goes haywire. In copy_mnt_ns we want an exact copy not a partial copy.
The good news is that this bug could only affect rootfs, so it is not a
security hole with respect to mount namespaces, created with user
namespace permissions.

mount --rbind and mount propogation should not be locked to their parent
mounts so we do need to clear MNT_LOCKED in a few places.

A patch that fixes copy_tree carefully in a moment.

Eric

2014-10-07 23:40:46

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH] mnt: Move the clear of MNT_LOCKED from copy_tree to it's


Clear MNT_LOCKED in the callers of copy_tree except copy_mnt_ns, and
collect_mounts. In copy_mnt_nswe want an exact copy of a mount tree,
so not clearing MNT_LOCKED is important. Similarly collect_mounts
is used to take a snapshot of the mount tree for audit logging purposes
and auditing using a faithful copy of the tree is important.

This becomes particularly significant when we start setting MNT_LOCKED
on rootfs to prevent it from being unmounted.

Signed-off-by: "Eric W. Biederman" <[email protected]>
---
fs/namespace.c | 1 -
fs/pnode.c | 1 +
2 files changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index 9c6ee037bef7..5a11e273541c 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1498,7 +1498,6 @@ struct mount *copy_tree(struct mount *mnt, struct dentry *dentry,
if (IS_ERR(q))
return q;

- q->mnt.mnt_flags &= ~MNT_LOCKED;
q->mnt_mountpoint = mnt->mnt_mountpoint;

p = mnt;
diff --git a/fs/pnode.c b/fs/pnode.c
index aae331a5d03b..260ac8f898a4 100644
--- a/fs/pnode.c
+++ b/fs/pnode.c
@@ -242,6 +242,7 @@ static int propagate_one(struct mount *m)
child = copy_tree(last_source, last_source->mnt.mnt_root, type);
if (IS_ERR(child))
return PTR_ERR(child);
+ child->mnt.mnt_flags &= ~MNT_LOCKED;
mnt_set_mountpoint(m, mp, child);
last_dest = m;
last_source = child;
--
1.9.1

2014-10-08 10:46:56

by Andrew Vagin

[permalink] [raw]
Subject: Re: [PATCH] mnt: Move the clear of MNT_LOCKED from copy_tree to it's

On Tue, Oct 07, 2014 at 04:40:10PM -0700, Eric W. Biederman wrote:
>
> Clear MNT_LOCKED in the callers of copy_tree except copy_mnt_ns, and
> collect_mounts. In copy_mnt_nswe want an exact copy of a mount tree,
> so not clearing MNT_LOCKED is important. Similarly collect_mounts
> is used to take a snapshot of the mount tree for audit logging purposes
> and auditing using a faithful copy of the tree is important.
>
> This becomes particularly significant when we start setting MNT_LOCKED
> on rootfs to prevent it from being unmounted.
>

Acked-by: Andrew Vagin <[email protected]>

> Signed-off-by: "Eric W. Biederman" <[email protected]>
> ---
> fs/namespace.c | 1 -
> fs/pnode.c | 1 +
> 2 files changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/fs/namespace.c b/fs/namespace.c
> index 9c6ee037bef7..5a11e273541c 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -1498,7 +1498,6 @@ struct mount *copy_tree(struct mount *mnt, struct dentry *dentry,
> if (IS_ERR(q))
> return q;
>
> - q->mnt.mnt_flags &= ~MNT_LOCKED;
> q->mnt_mountpoint = mnt->mnt_mountpoint;
>
> p = mnt;
> diff --git a/fs/pnode.c b/fs/pnode.c
> index aae331a5d03b..260ac8f898a4 100644
> --- a/fs/pnode.c
> +++ b/fs/pnode.c
> @@ -242,6 +242,7 @@ static int propagate_one(struct mount *m)
> child = copy_tree(last_source, last_source->mnt.mnt_root, type);
> if (IS_ERR(child))
> return PTR_ERR(child);
> + child->mnt.mnt_flags &= ~MNT_LOCKED;
> mnt_set_mountpoint(m, mp, child);
> last_dest = m;
> last_source = child;
> --
> 1.9.1
>

2014-10-08 10:47:39

by Andrew Vagin

[permalink] [raw]
Subject: Re: [PATCH] umount: Do not allow unmounting rootfs.

On Tue, Oct 07, 2014 at 12:27:06PM -0700, Eric W. Biederman wrote:
>
> Andrew Vagin <[email protected]> writes:
>
> > #define _GNU_SOURCE
> > #include <sys/types.h>
> > #include <sys/stat.h>
> > #include <fcntl.h>
> > #include <sched.h>
> > #include <unistd.h>
> > #include <sys/mount.h>
> >
> > int main(int argc, char **argv)
> > {
> > int fd;
> >
> > fd = open("/proc/self/ns/mnt", O_RDONLY);
> > if (fd < 0)
> > return 1;
> > while (1) {
> > if (umount2("/", MNT_DETACH) ||
> > setns(fd, CLONE_NEWNS))
> > break;
> > }
> >
> > return 0;
> > }
> >
> > root@ubuntu:/home/avagin# gcc -Wall nsenter.c -o nsenter
> > root@ubuntu:/home/avagin# strace ./nsenter
> > execve("./nsenter", ["./nsenter"], [/* 22 vars */]) = 0
> > ...
> > open("/proc/self/ns/mnt", O_RDONLY) = 3
> > umount("/", MNT_DETACH) = 0
> > setns(3, 131072) = 0
> > umount("/", MNT_DETACH
> >
> causes:
>
> > [ 260.548301] ------------[ cut here ]------------
> > [ 260.550941] kernel BUG at /build/buildd/linux-3.13.0/fs/pnode.c:372!
> > [ 260.552068] invalid opcode: 0000 [#1] SMP
> > [ 260.552068] Modules linked in: xt_CHECKSUM iptable_mangle xt_tcpudp xt_addrtype xt_conntrack ipt_MASQUERADE iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack bridge stp llc dm_thin_pool dm_persistent_data dm_bufio dm_bio_prison iptable_filter ip_tables x_tables crct10dif_pclmul crc32_pclmul ghash_clmulni_intel binfmt_misc nfsd auth_rpcgss nfs_acl aesni_intel nfs lockd aes_x86_64 sunrpc fscache lrw gf128mul glue_helper ablk_helper cryptd serio_raw ppdev parport_pc lp parport btrfs xor raid6_pq libcrc32c psmouse floppy
> > [ 260.552068] CPU: 0 PID: 1723 Comm: nsenter Not tainted 3.13.0-30-generic #55-Ubuntu
> > [ 260.552068] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
> > [ 260.552068] task: ffff8800376097f0 ti: ffff880074824000 task.ti: ffff880074824000
> > [ 260.552068] RIP: 0010:[<ffffffff811e9483>] [<ffffffff811e9483>] propagate_umount+0x123/0x130
> > [ 260.552068] RSP: 0018:ffff880074825e98 EFLAGS: 00010246
> > [ 260.552068] RAX: ffff88007c741140 RBX: 0000000000000002 RCX: ffff88007c741190
> > [ 260.552068] RDX: ffff88007c741190 RSI: ffff880074825ec0 RDI: ffff880074825ec0
> > [ 260.552068] RBP: ffff880074825eb0 R08: 00000000000172e0 R09: ffff88007fc172e0
> > [ 260.552068] R10: ffffffff811cc642 R11: ffffea0001d59000 R12: ffff88007c741140
> > [ 260.552068] R13: ffff88007c741140 R14: ffff88007c741140 R15: 0000000000000000
> > [ 260.552068] FS: 00007fd5c7e41740(0000) GS:ffff88007fc00000(0000) knlGS:0000000000000000
> > [ 260.552068] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [ 260.552068] CR2: 00007fd5c7968050 CR3: 0000000070124000 CR4: 00000000000406f0
> > [ 260.552068] Stack:
> > [ 260.552068] 0000000000000002 0000000000000002 ffff88007c631000 ffff880074825ed8
> > [ 260.552068] ffffffff811dcfac ffff88007c741140 0000000000000002 ffff88007c741160
> > [ 260.552068] ffff880074825f38 ffffffff811dd12b ffffffff811cc642 0000000075640000
> > [ 260.552068] Call Trace:
> > [ 260.552068] [<ffffffff811dcfac>] umount_tree+0x20c/0x260
> > [ 260.552068] [<ffffffff811dd12b>] do_umount+0x12b/0x300
> > [ 260.552068] [<ffffffff811cc642>] ? final_putname+0x22/0x50
> > [ 260.552068] [<ffffffff811cc849>] ? putname+0x29/0x40
> > [ 260.552068] [<ffffffff811dd88c>] SyS_umount+0xdc/0x100
> > [ 260.552068] [<ffffffff8172aeff>] tracesys+0xe1/0xe6
> > [ 260.552068] Code: 89 50 08 48 8b 50 08 48 89 02 49 89 45 08 e9 72 ff ff ff 0f 1f 44 00 00 4c 89 e6 4c 89 e7 e8 f5 f6 ff ff 48 89 c3 e9 39 ff ff ff <0f> 0b 66 2e 0f 1f 84 00 00 00 00 00 90 66 66 66 66 90 55 b8 01
> > [ 260.552068] RIP [<ffffffff811e9483>] propagate_umount+0x123/0x130
> > [ 260.552068] RSP <ffff880074825e98>
> > [ 260.611451] ---[ end trace 11c33d85f1d4c652 ]--
>
> Which in practice is totally uninteresting. Only the global root user can
> do it, and it is just a stupid thing to do.
>
> However that is no excuse to allow a silly way to oops the kernel.
>
> We can avoid this silly problem by setting MNT_LOCKED on the rootfs
> mount point and thus avoid needing any special cases in the unmount
> code.
>

Acked-by: Andrew Vagin <[email protected]>

> Signed-off-by: "Eric W. Biederman" <[email protected]>
> ---
> fs/namespace.c | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/fs/namespace.c b/fs/namespace.c
> index 7c5834381a73..9c6ee037bef7 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -2887,6 +2887,7 @@ static void __init init_mount_tree(void)
>
> root.mnt = mnt;
> root.dentry = mnt->mnt_root;
> + mnt->mnt_flags |= MNT_LOCKED;
>
> set_fs_pwd(current->fs, &root);
> set_fs_root(current->fs, &root);
> --
> 1.9.1
>