LinuxLists.cc - [PATCH RFC] allow some kernel filesystems to be mounted in a user namespace

2013-07-16 19:29:30

Subject: [PATCH RFC] allow some kernel filesystems to be mounted in a user namespace

All the files will be owned by host root, so there's no security
concern in allowing this.

(These are mounted by default by mountall, and if permission is
denied then by default container boot will hang)

Signed-off-by: Serge Hallyn <[email protected]>
---
fs/debugfs/inode.c | 1 +
fs/fuse/control.c | 1 +
security/inode.c | 1 +
3 files changed, 3 insertions(+)

diff --git a/fs/debugfs/inode.c b/fs/debugfs/inode.c
index 4888cb3..8632432 100644
--- a/fs/debugfs/inode.c
+++ b/fs/debugfs/inode.c
@@ -298,6 +298,7 @@ static struct file_system_type debug_fs_type = {
.name = "debugfs",
.mount = debug_mount,
.kill_sb = kill_litter_super,
+ .fs_flags = FS_USERNS_MOUNT,
};
MODULE_ALIAS_FS("debugfs");

diff --git a/fs/fuse/control.c b/fs/fuse/control.c
index a0b0855..4991441 100644
--- a/fs/fuse/control.c
+++ b/fs/fuse/control.c
@@ -340,6 +340,7 @@ static struct file_system_type fuse_ctl_fs_type = {
.name = "fusectl",
.mount = fuse_ctl_mount,
.kill_sb = fuse_ctl_kill_sb,
+ .fs_flags = FS_USERNS_MOUNT,
};
MODULE_ALIAS_FS("fusectl");

diff --git a/security/inode.c b/security/inode.c
index 43ce6e1..ec18abd 100644
--- a/security/inode.c
+++ b/security/inode.c
@@ -49,6 +49,7 @@ static struct file_system_type fs_type = {
.name = "securityfs",
.mount = get_sb,
.kill_sb = kill_litter_super,
+ .fs_flags = FS_USERNS_MOUNT,
};

/**
--
1.8.3.2

2013-07-16 19:38:29

by Al Viro

[permalink] [raw]

Subject: Re: [PATCH RFC] allow some kernel filesystems to be mounted in a user namespace

On Tue, Jul 16, 2013 at 02:29:20PM -0500, Serge Hallyn wrote:
> All the files will be owned by host root, so there's no security
> concern in allowing this.

Files owned by root != very bad things can't be done by non-root.
Especially for debugfs, which is very much a "don't even think about
mounting that on a production box" thing...

2013-07-16 19:50:05

by Serge E. Hallyn

[permalink] [raw]

Subject: Re: [PATCH RFC] allow some kernel filesystems to be mounted in a user namespace

Quoting Al Viro ([email protected]):
> On Tue, Jul 16, 2013 at 02:29:20PM -0500, Serge Hallyn wrote:
> > All the files will be owned by host root, so there's no security
> > concern in allowing this.
>
> Files owned by root != very bad things can't be done by non-root.
> Especially for debugfs, which is very much a "don't even think about
> mounting that on a production box" thing...

I would prefer it not be mounted. But near as I can tell there
should be no regression security-wise whether an unprivileged
user on the host has access to it, or whether a user in a
non-init user ns is allowed to mount it. (Obviously I could very
well be wrong)

-serge

2013-07-16 21:33:28

by Andy Lutomirski

[permalink] [raw]

Subject: Re: [PATCH RFC] allow some kernel filesystems to be mounted in a user namespace

On 07/16/2013 12:50 PM, Serge E. Hallyn wrote:
> Quoting Al Viro ([email protected]):
>> On Tue, Jul 16, 2013 at 02:29:20PM -0500, Serge Hallyn wrote:
>>> All the files will be owned by host root, so there's no security
>>> concern in allowing this.
>>
>> Files owned by root != very bad things can't be done by non-root.
>> Especially for debugfs, which is very much a "don't even think about
>> mounting that on a production box" thing...
>
> I would prefer it not be mounted. But near as I can tell there
> should be no regression security-wise whether an unprivileged
> user on the host has access to it, or whether a user in a
> non-init user ns is allowed to mount it. (Obviously I could very
> well be wrong)

I would argue that either (a) debugfs denies everything to non-root, so
mounting it in a (rootless) userns is useless or (b) it doesn't, in
which case it's dangerous.

In neither case does it make sense to me to allow the mount.

--Andy

2013-07-16 21:37:51

by Serge E. Hallyn

[permalink] [raw]

Subject: Re: [PATCH RFC] allow some kernel filesystems to be mounted in a user namespace

Quoting Andy Lutomirski ([email protected]):
> On 07/16/2013 12:50 PM, Serge E. Hallyn wrote:
> > Quoting Al Viro ([email protected]):
> >> On Tue, Jul 16, 2013 at 02:29:20PM -0500, Serge Hallyn wrote:
> >>> All the files will be owned by host root, so there's no security
> >>> concern in allowing this.
> >>
> >> Files owned by root != very bad things can't be done by non-root.
> >> Especially for debugfs, which is very much a "don't even think about
> >> mounting that on a production box" thing...
> >
> > I would prefer it not be mounted. But near as I can tell there
> > should be no regression security-wise whether an unprivileged
> > user on the host has access to it, or whether a user in a
> > non-init user ns is allowed to mount it. (Obviously I could very
> > well be wrong)
>
> I would argue that either (a) debugfs denies everything to non-root, so
> mounting it in a (rootless) userns is useless or (b) it doesn't, in
> which case it's dangerous.
>
> In neither case does it make sense to me to allow the mount.

It makes sense from the POV of having sane user-space. I can obviously
work around this by tweaking a stock container rootfs to be different
from a stock host rootfs. It is undesirable.

For debug and fusectl there is another option which I'm happy to
pursue, namely tweaking how mountall handles 'nofail' to ignore these
errors.

But for /sys/kernel/security, the failure of which to mount on a
non-container can be a real problem, that is not good enough. So
at least I'd like securityfs to be mountable in a non-init userns.

-serge

2013-07-16 21:39:04

by Serge E. Hallyn

[permalink] [raw]

Subject: Re: [PATCH RFC] allow some kernel filesystems to be mounted in a user namespace

Quoting Serge E. Hallyn ([email protected]):
> Quoting Andy Lutomirski ([email protected]):
> > On 07/16/2013 12:50 PM, Serge E. Hallyn wrote:
> > > Quoting Al Viro ([email protected]):
> > >> On Tue, Jul 16, 2013 at 02:29:20PM -0500, Serge Hallyn wrote:
> > >>> All the files will be owned by host root, so there's no security
> > >>> concern in allowing this.
> > >>
> > >> Files owned by root != very bad things can't be done by non-root.
> > >> Especially for debugfs, which is very much a "don't even think about
> > >> mounting that on a production box" thing...
> > >
> > > I would prefer it not be mounted. But near as I can tell there
> > > should be no regression security-wise whether an unprivileged
> > > user on the host has access to it, or whether a user in a
> > > non-init user ns is allowed to mount it. (Obviously I could very
> > > well be wrong)
> >
> > I would argue that either (a) debugfs denies everything to non-root, so
> > mounting it in a (rootless) userns is useless or (b) it doesn't, in
> > which case it's dangerous.
> >
> > In neither case does it make sense to me to allow the mount.
>
> It makes sense from the POV of having sane user-space. I can obviously
> work around this by tweaking a stock container rootfs to be different
> from a stock host rootfs. It is undesirable.

(s/It/But that/)

2013-07-16 21:44:43

by Andy Lutomirski

[permalink] [raw]

Subject: Re: [PATCH RFC] allow some kernel filesystems to be mounted in a user namespace

On Tue, Jul 16, 2013 at 2:37 PM, Serge E. Hallyn <[email protected]> wrote:
> Quoting Andy Lutomirski ([email protected]):
>> On 07/16/2013 12:50 PM, Serge E. Hallyn wrote:
>> > Quoting Al Viro ([email protected]):
>> >> On Tue, Jul 16, 2013 at 02:29:20PM -0500, Serge Hallyn wrote:
>> >>> All the files will be owned by host root, so there's no security
>> >>> concern in allowing this.
>> >>
>> >> Files owned by root != very bad things can't be done by non-root.
>> >> Especially for debugfs, which is very much a "don't even think about
>> >> mounting that on a production box" thing...
>> >
>> > I would prefer it not be mounted. But near as I can tell there
>> > should be no regression security-wise whether an unprivileged
>> > user on the host has access to it, or whether a user in a
>> > non-init user ns is allowed to mount it. (Obviously I could very
>> > well be wrong)
>>
>> I would argue that either (a) debugfs denies everything to non-root, so
>> mounting it in a (rootless) userns is useless or (b) it doesn't, in
>> which case it's dangerous.
>>
>> In neither case does it make sense to me to allow the mount.
>
> It makes sense from the POV of having sane user-space. I can obviously
> work around this by tweaking a stock container rootfs to be different
> from a stock host rootfs. It is undesirable.
>
> For debug and fusectl there is another option which I'm happy to
> pursue, namely tweaking how mountall handles 'nofail' to ignore these
> errors.

I don't know enough about fuse to know whether it should work in a
container, but presumably the fusectl FS needs to be aware of userns
mappings for it to work right. But ISTM it would be better for
containers to be smart enough to keep going if debugfs fails to mount
-- this really seems like a userspace problem that ought to be fixed
in userspace.

>
> But for /sys/kernel/security, the failure of which to mount on a
> non-container can be a real problem, that is not good enough. So
> at least I'd like securityfs to be mountable in a non-init userns.
>

Will the container work if /sys/kernel/security is inaccessible even to "root"?

> -serge

--
Andy Lutomirski
AMA Capital Management, LLC

2013-07-16 22:03:04

by Serge E. Hallyn

[permalink] [raw]

Subject: Re: [PATCH RFC] allow some kernel filesystems to be mounted in a user namespace

Quoting Andy Lutomirski ([email protected]):
> On Tue, Jul 16, 2013 at 2:37 PM, Serge E. Hallyn <[email protected]> wrote:
> > Quoting Andy Lutomirski ([email protected]):
> >> On 07/16/2013 12:50 PM, Serge E. Hallyn wrote:
> >> > Quoting Al Viro ([email protected]):
> >> >> On Tue, Jul 16, 2013 at 02:29:20PM -0500, Serge Hallyn wrote:
> >> >>> All the files will be owned by host root, so there's no security
> >> >>> concern in allowing this.
> >> >>
> >> >> Files owned by root != very bad things can't be done by non-root.
> >> >> Especially for debugfs, which is very much a "don't even think about
> >> >> mounting that on a production box" thing...
> >> >
> >> > I would prefer it not be mounted. But near as I can tell there
> >> > should be no regression security-wise whether an unprivileged
> >> > user on the host has access to it, or whether a user in a
> >> > non-init user ns is allowed to mount it. (Obviously I could very
> >> > well be wrong)
> >>
> >> I would argue that either (a) debugfs denies everything to non-root, so
> >> mounting it in a (rootless) userns is useless or (b) it doesn't, in
> >> which case it's dangerous.
> >>
> >> In neither case does it make sense to me to allow the mount.
> >
> > It makes sense from the POV of having sane user-space. I can obviously
> > work around this by tweaking a stock container rootfs to be different
> > from a stock host rootfs. It is undesirable.
> >
> > For debug and fusectl there is another option which I'm happy to
> > pursue, namely tweaking how mountall handles 'nofail' to ignore these
> > errors.
>
> I don't know enough about fuse to know whether it should work in a
> container, but presumably the fusectl FS needs to be aware of userns

Again it's not about working - we actually don't (through LSM) allow
writes under any of them anyway. It's about containers and
non-containers having similar boot sequences when possible.

> mappings for it to work right. But ISTM it would be better for
> containers to be smart enough to keep going if debugfs fails to mount

"smart enough" in this case means finding ways to figure out information
that it wouldn't otherwise need, and the form of which could at some point
change, and generally just increases the future potential fragility.

Well, to be fair that's again really referring to the securityfs one.
Basically solving that would require teaching mountall to parse
/proc/self/uid_map to decide its namespace.

> -- this really seems like a userspace problem that ought to be fixed
> in userspace.

> > But for /sys/kernel/security, the failure of which to mount on a
> > non-container can be a real problem, that is not good enough. So
> > at least I'd like securityfs to be mountable in a non-init userns.
> >
>
> Will the container work if /sys/kernel/security is inaccessible even to "root"?

Yes. As it is they're actually not allowed to write under there (by
LSM). Containers start fine for me with these three mounted this way.

-serge

2013-07-16 22:08:09

by Andy Lutomirski

[permalink] [raw]

Subject: Re: [PATCH RFC] allow some kernel filesystems to be mounted in a user namespace

On Tue, Jul 16, 2013 at 3:03 PM, Serge E. Hallyn <[email protected]> wrote:
> Quoting Andy Lutomirski ([email protected]):
>> On Tue, Jul 16, 2013 at 2:37 PM, Serge E. Hallyn <[email protected]> wrote:
>> > Quoting Andy Lutomirski ([email protected]):
>> >> On 07/16/2013 12:50 PM, Serge E. Hallyn wrote:
>> >> > Quoting Al Viro ([email protected]):
>> >> >> On Tue, Jul 16, 2013 at 02:29:20PM -0500, Serge Hallyn wrote:
>> >> >>> All the files will be owned by host root, so there's no security
>> >> >>> concern in allowing this.
>> >> >>
>> >> >> Files owned by root != very bad things can't be done by non-root.
>> >> >> Especially for debugfs, which is very much a "don't even think about
>> >> >> mounting that on a production box" thing...
>> >> >
>> >> > I would prefer it not be mounted. But near as I can tell there
>> >> > should be no regression security-wise whether an unprivileged
>> >> > user on the host has access to it, or whether a user in a
>> >> > non-init user ns is allowed to mount it. (Obviously I could very
>> >> > well be wrong)
>> >>
>> >> I would argue that either (a) debugfs denies everything to non-root, so
>> >> mounting it in a (rootless) userns is useless or (b) it doesn't, in
>> >> which case it's dangerous.
>> >>
>> >> In neither case does it make sense to me to allow the mount.
>> >
>> > It makes sense from the POV of having sane user-space. I can obviously
>> > work around this by tweaking a stock container rootfs to be different
>> > from a stock host rootfs. It is undesirable.
>> >
>> > For debug and fusectl there is another option which I'm happy to
>> > pursue, namely tweaking how mountall handles 'nofail' to ignore these
>> > errors.
>>
>> I don't know enough about fuse to know whether it should work in a
>> container, but presumably the fusectl FS needs to be aware of userns
>
> Again it's not about working - we actually don't (through LSM) allow
> writes under any of them anyway. It's about containers and
> non-containers having similar boot sequences when possible.

I, and many other people, run kernel.org kernels with LSM disabled.
userns defaults to on, and that configuration needs to be secure.

>
>> mappings for it to work right. But ISTM it would be better for
>> containers to be smart enough to keep going if debugfs fails to mount
>
> "smart enough" in this case means finding ways to figure out information
> that it wouldn't otherwise need, and the form of which could at some point
> change, and generally just increases the future potential fragility.

Presumably this is as simple as making 'mountall' report success if
nofail is set and mount returns -EPERM.

That being said, it would probably be okay to modify debugfs to detect
that it's in a nonroot userns and show up empty when mounted.

>
> Well, to be fair that's again really referring to the securityfs one.
> Basically solving that would require teaching mountall to parse
> /proc/self/uid_map to decide its namespace.

Huh?

>
>> -- this really seems like a userspace problem that ought to be fixed
>> in userspace.
>
>> > But for /sys/kernel/security, the failure of which to mount on a
>> > non-container can be a real problem, that is not good enough. So
>> > at least I'd like securityfs to be mountable in a non-init userns.
>> >
>>
>> Will the container work if /sys/kernel/security is inaccessible even to "root"?
>
> Yes. As it is they're actually not allowed to write under there (by
> LSM). Containers start fine for me with these three mounted this way.
>

At least for securityfs, relying on LSM is legit.

--Andy

2013-07-16 22:23:11

by Serge E. Hallyn

[permalink] [raw]

Subject: Re: [PATCH RFC] allow some kernel filesystems to be mounted in a user namespace

Quoting Andy Lutomirski ([email protected]):
> On Tue, Jul 16, 2013 at 3:03 PM, Serge E. Hallyn <[email protected]> wrote:
> > Quoting Andy Lutomirski ([email protected]):
> >> On Tue, Jul 16, 2013 at 2:37 PM, Serge E. Hallyn <[email protected]> wrote:
> >> > Quoting Andy Lutomirski ([email protected]):
> >> >> On 07/16/2013 12:50 PM, Serge E. Hallyn wrote:
> >> >> > Quoting Al Viro ([email protected]):
> >> >> >> On Tue, Jul 16, 2013 at 02:29:20PM -0500, Serge Hallyn wrote:
> >> >> >>> All the files will be owned by host root, so there's no security
> >> >> >>> concern in allowing this.
> >> >> >>
> >> >> >> Files owned by root != very bad things can't be done by non-root.
> >> >> >> Especially for debugfs, which is very much a "don't even think about
> >> >> >> mounting that on a production box" thing...
> >> >> >
> >> >> > I would prefer it not be mounted. But near as I can tell there
> >> >> > should be no regression security-wise whether an unprivileged
> >> >> > user on the host has access to it, or whether a user in a
> >> >> > non-init user ns is allowed to mount it. (Obviously I could very
> >> >> > well be wrong)
> >> >>
> >> >> I would argue that either (a) debugfs denies everything to non-root, so
> >> >> mounting it in a (rootless) userns is useless or (b) it doesn't, in
> >> >> which case it's dangerous.
> >> >>
> >> >> In neither case does it make sense to me to allow the mount.
> >> >
> >> > It makes sense from the POV of having sane user-space. I can obviously
> >> > work around this by tweaking a stock container rootfs to be different
> >> > from a stock host rootfs. It is undesirable.
> >> >
> >> > For debug and fusectl there is another option which I'm happy to
> >> > pursue, namely tweaking how mountall handles 'nofail' to ignore these
> >> > errors.
> >>
> >> I don't know enough about fuse to know whether it should work in a
> >> container, but presumably the fusectl FS needs to be aware of userns
> >
> > Again it's not about working - we actually don't (through LSM) allow
> > writes under any of them anyway. It's about containers and
> > non-containers having similar boot sequences when possible.
>
> I, and many other people, run kernel.org kernels with LSM disabled.
> userns defaults to on, and that configuration needs to be secure.

My point was just that not being able to write under those mounts will
not break the containers. I'm not saying it would be ok to push this
patch is it did require an LSM to be safe.

> >> mappings for it to work right. But ISTM it would be better for
> >> containers to be smart enough to keep going if debugfs fails to mount
> >
> > "smart enough" in this case means finding ways to figure out information
> > that it wouldn't otherwise need, and the form of which could at some point
> > change, and generally just increases the future potential fragility.
>
> Presumably this is as simple as making 'mountall' report success if
> nofail is set and mount returns -EPERM.
>
> That being said, it would probably be okay to modify debugfs to detect
> that it's in a nonroot userns and show up empty when mounted.

That'd obviously work for containers.

> > Well, to be fair that's again really referring to the securityfs one.
> > Basically solving that would require teaching mountall to parse
> > /proc/self/uid_map to decide its namespace.
>
> Huh?

I don't think it's going to be ok to have mountall proceed on
real hosts with /sys/kernel/security not mounted, risking the expected
security policy *quietly* not being setup on hosts.

That's why I consider it better and safer to simply allow the
securityfs mount.

> >> -- this really seems like a userspace problem that ought to be fixed
> >> in userspace.
> >
> >> > But for /sys/kernel/security, the failure of which to mount on a
> >> > non-container can be a real problem, that is not good enough. So
> >> > at least I'd like securityfs to be mountable in a non-init userns.
> >> >
> >>
> >> Will the container work if /sys/kernel/security is inaccessible even to "root"?
> >
> > Yes. As it is they're actually not allowed to write under there (by
> > LSM). Containers start fine for me with these three mounted this way.
> >
>
> At least for securityfs, relying on LSM is legit.

I'm not "relying on LSM" to make these safe. I'm relying on the
uid mappings to make these safe.

Nevertheless I at least have hope of working around the others (in a
distro-acceptable way), so if the others are too scary I'll pursue
the workaround for the others and see where I get. But I really feel
the securityfs one is the best solution.

thanks,
-serge

2013-07-17 05:43:36

by Eric W. Biederman

[permalink] [raw]

Subject: Re: [PATCH RFC] allow some kernel filesystems to be mounted in a user namespace

"Serge E. Hallyn" <[email protected]> writes:

> I'm not "relying on LSM" to make these safe. I'm relying on the
> uid mappings to make these safe.
>
> Nevertheless I at least have hope of working around the others (in a
> distro-acceptable way), so if the others are too scary I'll pursue
> the workaround for the others and see where I get. But I really feel
> the securityfs one is the best solution.

Personally I don't trust debugfs enough to compile it into my kernel.

fuse simply isn't ready to be have fresh mounts usefully created inside
a user namespace.

Fundamentally with debugfs and securityfs you run into the issue we saw
with sysfs and proc where at some level it is the system administrators
perogative if those filesystems should be mounted.

The rule with filesystems like that is mounting them needs to be no more
dangerous than bind mounting them. At the point in the cycle you are
talking about mounting them you presumably have already thrown away
their original mounts making it impossible to tell if it would have been
safe to mount them or not. Making your patch completely inappropriate.

What you need to do is at container setup time to bind mount those
filesystems if they are already mounted and you want them in the
container. If you are just shuffling around something you can already
see there are no security issues.

Eric

2013-07-17 12:41:29

by Serge Hallyn

[permalink] [raw]

Subject: Re: [PATCH RFC] allow some kernel filesystems to be mounted in a user namespace

Quoting Eric W. Biederman ([email protected]):
> "Serge E. Hallyn" <[email protected]> writes:
>
> > I'm not "relying on LSM" to make these safe. I'm relying on the
> > uid mappings to make these safe.
> >
> > Nevertheless I at least have hope of working around the others (in a
> > distro-acceptable way), so if the others are too scary I'll pursue
> > the workaround for the others and see where I get. But I really feel
> > the securityfs one is the best solution.
>
> Personally I don't trust debugfs enough to compile it into my kernel.

That, again, seems reasonable, but would also seem to invalidate
objections to this patch :) but,

> fuse simply isn't ready to be have fresh mounts usefully created inside
> a user namespace.
>
> Fundamentally with debugfs and securityfs you run into the issue we saw
> with sysfs and proc where at some level it is the system administrators
> perogative if those filesystems should be mounted.
>
> The rule with filesystems like that is mounting them needs to be no more
> dangerous than bind mounting them. At the point in the cycle you are
> talking about mounting them you presumably have already thrown away
> their original mounts making it impossible to tell if it would have been
> safe to mount them or not. Making your patch completely inappropriate.

Right so the specific problem this patch introduces is: An admin who is
using a distro kernel with these filesystems enabled but not mounted,
without this patch does not have to worry about unprivileged users being
able to access the fs. With this patch, he does.

Thanks everyone, I withdraw this patch.

> What you need to do is at container setup time to bind mount those
> filesystems if they are already mounted and you want them in the
> container. If you are just shuffling around something you can already
> see there are no security issues.

-serge