LinuxLists.cc - user namespace and fully visible proc and sys mounts

2016-03-06 08:28:30

Subject: user namespace and fully visible proc and sys mounts

Hi,

So we've been over this many times... but unfortunately there is more
breakage to report. Regular privileged and unprivileged containers
work all right for us. But running an unprivileged container inside a
privileged container is blocked.

When creating privileged containers, lxc by default does a few things:
it mounts some fuse.lxcfs files over procfiles include /proc/meminfo and
/proc/uptime. It mounts proc rw but /proc/sysrq-trigger ro as well as
moves /proc/sys/net out of the way, bind-mounts /proc/sys readonly
(because this container is not in a user namespace) then moves
/proc/sys/net back. Finally it mounts sys ro but bind-mounts
/sys/devices/virtual/net as writeable.

If any of these are left enabled, unprivileged containers can't be
started. If all are disabled, then they can be.

Can we find a way to make these not block remounts in child user
namespaces? A boot flag, a procfs and sysfs mount option, a sysctl?

-serge

2016-03-06 22:03:40

by Eric W. Biederman

[permalink] [raw]

Subject: Re: user namespace and fully visible proc and sys mounts

"Serge E. Hallyn" <[email protected]> writes:

> Hi,
>
> So we've been over this many times... but unfortunately there is more
> breakage to report. Regular privileged and unprivileged containers
> work all right for us. But running an unprivileged container inside a
> privileged container is blocked.
>
> When creating privileged containers, lxc by default does a few things:
> it mounts some fuse.lxcfs files over procfiles include /proc/meminfo and
> /proc/uptime. It mounts proc rw but /proc/sysrq-trigger ro as well as
> moves /proc/sys/net out of the way, bind-mounts /proc/sys readonly
> (because this container is not in a user namespace) then moves
> /proc/sys/net back. Finally it mounts sys ro but bind-mounts
> /sys/devices/virtual/net as writeable.
>
> If any of these are left enabled, unprivileged containers can't be
> started. If all are disabled, then they can be.
>
> Can we find a way to make these not block remounts in child user
> namespaces? A boot flag, a procfs and sysfs mount option, a sysctl?

Are any of these overmounts done for the purpose of security? It
appears the /proc/sys and /sys mounts being made read-only is for that
purpose.

If none of the mounts are for secuirty the easy solution that works
today is to also mount /proc and /sys somewhere else in your container
so that the permission check for mounting a new copy passes.

That said /proc/sys appears to be a show stopper in this scheme. As the
root of your privileged container can enter your unprivileged container
it can bypass your read-only /proc/sys by mounting a new copy of proc if
we allow the relaxation you are requesting.

Therefore the only choice on the table (and I don't have a clue how
realistic it is) is to have a variant of proc with just files describing
processes. Call it processfs. That would not need the current
restrictions.

As for sysfs I am drawing a blank about what might be possible.

Eric

2016-03-06 23:38:18

by Serge E. Hallyn

[permalink] [raw]

Subject: Re: user namespace and fully visible proc and sys mounts

On Sun, Mar 06, 2016 at 03:53:40PM -0600, Eric W. Biederman wrote:
> "Serge E. Hallyn" <[email protected]> writes:
>
> > Hi,
> >
> > So we've been over this many times... but unfortunately there is more
> > breakage to report. Regular privileged and unprivileged containers
> > work all right for us. But running an unprivileged container inside a
> > privileged container is blocked.
> >
> > When creating privileged containers, lxc by default does a few things:
> > it mounts some fuse.lxcfs files over procfiles include /proc/meminfo and
> > /proc/uptime. It mounts proc rw but /proc/sysrq-trigger ro as well as
> > moves /proc/sys/net out of the way, bind-mounts /proc/sys readonly
> > (because this container is not in a user namespace) then moves
> > /proc/sys/net back. Finally it mounts sys ro but bind-mounts
> > /sys/devices/virtual/net as writeable.
> >
> > If any of these are left enabled, unprivileged containers can't be
> > started. If all are disabled, then they can be.
> >
> > Can we find a way to make these not block remounts in child user
> > namespaces? A boot flag, a procfs and sysfs mount option, a sysctl?
>
> Are any of these overmounts done for the purpose of security? It

The fuse.lxcfs ones are not for security.

The others are for security, but only in non-user-namespaced containers.
(We're doing them in unprivileged as well for simplicity but could stop
that). We're not overmounting to hide things, we're mounting readonly
because the procfiles are owned by the same uid that is root in the
container. Now in Ubuntu we do also have precise apparmor profiles
which redundantly prevent writing, and our only real goal is to prevent
accidental host damage, but the defense in depth is still nice to have,
and I don't want to drop that.

> appears the /proc/sys and /sys mounts being made read-only is for that
> purpose.

Right, but we're not hiding anything. In fact maybe that's how we
can detect this - if the dentry over- and under-mount for a directory
is the same, ignore it, because it doesn't fall under your original
thread scenario?

> If none of the mounts are for secuirty the easy solution that works
> today is to also mount /proc and /sys somewhere else in your container
> so that the permission check for mounting a new copy passes.

Yeah, we used to do that, and I actually forgot that we used to do that.
I'll have to look into why it no longer suffices.

(The security aspect wasn't too bad, since we used apparmor to prevent any
writes to the redundant mounts)

> That said /proc/sys appears to be a show stopper in this scheme. As the
> root of your privileged container can enter your unprivileged container
> it can bypass your read-only /proc/sys by mounting a new copy of proc if
> we allow the relaxation you are requesting.

Yeah, will have to think about that.

> Therefore the only choice on the table (and I don't have a clue how
> realistic it is) is to have a variant of proc with just files describing
> processes. Call it processfs. That would not need the current
> restrictions.
>
> As for sysfs I am drawing a blank about what might be possible.
>
> Eric

2016-03-07 02:24:44

by Andy Lutomirski

[permalink] [raw]

Subject: Re: user namespace and fully visible proc and sys mounts

On Mar 6, 2016 2:03 PM, "Eric W. Biederman" <[email protected]> wrote:
>
> "Serge E. Hallyn" <[email protected]> writes:
>
> > Hi,
> >
> > So we've been over this many times... but unfortunately there is more
> > breakage to report. Regular privileged and unprivileged containers
> > work all right for us. But running an unprivileged container inside a
> > privileged container is blocked.
> >
> > When creating privileged containers, lxc by default does a few things:
> > it mounts some fuse.lxcfs files over procfiles include /proc/meminfo and
> > /proc/uptime. It mounts proc rw but /proc/sysrq-trigger ro as well as
> > moves /proc/sys/net out of the way, bind-mounts /proc/sys readonly
> > (because this container is not in a user namespace) then moves
> > /proc/sys/net back. Finally it mounts sys ro but bind-mounts
> > /sys/devices/virtual/net as writeable.
> >
> > If any of these are left enabled, unprivileged containers can't be
> > started. If all are disabled, then they can be.
> >
> > Can we find a way to make these not block remounts in child user
> > namespaces? A boot flag, a procfs and sysfs mount option, a sysctl?
>
> Are any of these overmounts done for the purpose of security? It
> appears the /proc/sys and /sys mounts being made read-only is for that
> purpose.
>
> If none of the mounts are for secuirty the easy solution that works
> today is to also mount /proc and /sys somewhere else in your container
> so that the permission check for mounting a new copy passes.

Can we use the big hammer approach on /proc/sys? Specifically, what
if we made it so that /proc mounts created in a non-root namespace
*only* see things that are scoped to the active namespaces, and only
those over which the mounter has capabilities? We could have mount
options for this.

/proc/sys utterly sucks for namespaces things. So does the uid_map
and similar crap. The API is simply awful.

On a related note, can we *please* find a way to constrain namespace
creation in a way that might satisfy the RHEL crowd?

>
> That said /proc/sys appears to be a show stopper in this scheme. As the
> root of your privileged container can enter your unprivileged container
> it can bypass your read-only /proc/sys by mounting a new copy of proc if
> we allow the relaxation you are requesting.
>
> Therefore the only choice on the table (and I don't have a clue how
> realistic it is) is to have a variant of proc with just files describing
> processes. Call it processfs. That would not need the current
> restrictions.
>
> As for sysfs I am drawing a blank about what might be possible.

Lovely. Yet another vaguely-namespaced thing in a pseudo-filesystem.

--Andy

2016-03-07 03:45:26

by Serge E. Hallyn

[permalink] [raw]

Subject: Re: user namespace and fully visible proc and sys mounts

On Sun, Mar 06, 2016 at 06:24:23PM -0800, Andy Lutomirski wrote:
> On Mar 6, 2016 2:03 PM, "Eric W. Biederman" <[email protected]> wrote:
> >
> > "Serge E. Hallyn" <[email protected]> writes:
> >
> > > Hi,
> > >
> > > So we've been over this many times... but unfortunately there is more
> > > breakage to report. Regular privileged and unprivileged containers
> > > work all right for us. But running an unprivileged container inside a
> > > privileged container is blocked.
> > >
> > > When creating privileged containers, lxc by default does a few things:
> > > it mounts some fuse.lxcfs files over procfiles include /proc/meminfo and
> > > /proc/uptime. It mounts proc rw but /proc/sysrq-trigger ro as well as
> > > moves /proc/sys/net out of the way, bind-mounts /proc/sys readonly
> > > (because this container is not in a user namespace) then moves
> > > /proc/sys/net back. Finally it mounts sys ro but bind-mounts
> > > /sys/devices/virtual/net as writeable.
> > >
> > > If any of these are left enabled, unprivileged containers can't be
> > > started. If all are disabled, then they can be.
> > >
> > > Can we find a way to make these not block remounts in child user
> > > namespaces? A boot flag, a procfs and sysfs mount option, a sysctl?
> >
> > Are any of these overmounts done for the purpose of security? It
> > appears the /proc/sys and /sys mounts being made read-only is for that
> > purpose.
> >
> > If none of the mounts are for secuirty the easy solution that works
> > today is to also mount /proc and /sys somewhere else in your container
> > so that the permission check for mounting a new copy passes.
>
> Can we use the big hammer approach on /proc/sys? Specifically, what
> if we made it so that /proc mounts created in a non-root namespace
> *only* see things that are scoped to the active namespaces, and only
> those over which the mounter has capabilities? We could have mount
> options for this.

Of course the problem is precisely non-user-namespaced containers which
do own and have capabilities over the /proc/sys/files. For user-namespaced
containers /proc/sys/ isn't really an issue.

Better namespacing of sysctls and maybe some way to say "I relinquish
the ability to update *those* sysctls for myself and all children" could
help.

> /proc/sys utterly sucks for namespaces things. So does the uid_map
> and similar crap. The API is simply awful.
>
> On a related note, can we *please* find a way to constrain namespace
> creation in a way that might satisfy the RHEL crowd?
>
> >
> > That said /proc/sys appears to be a show stopper in this scheme. As the
> > root of your privileged container can enter your unprivileged container
> > it can bypass your read-only /proc/sys by mounting a new copy of proc if
> > we allow the relaxation you are requesting.
> >
> > Therefore the only choice on the table (and I don't have a clue how
> > realistic it is) is to have a variant of proc with just files describing
> > processes. Call it processfs. That would not need the current
> > restrictions.
> >
> > As for sysfs I am drawing a blank about what might be possible.
>
> Lovely. Yet another vaguely-namespaced thing in a pseudo-filesystem.
>
> --Andy

`

2016-03-07 03:49:38

by Andy Lutomirski

[permalink] [raw]

Subject: Re: user namespace and fully visible proc and sys mounts

On Sun, Mar 6, 2016 at 7:45 PM, Serge E. Hallyn <[email protected]> wrote:
> On Sun, Mar 06, 2016 at 06:24:23PM -0800, Andy Lutomirski wrote:
>> On Mar 6, 2016 2:03 PM, "Eric W. Biederman" <[email protected]> wrote:
>> >
>> > "Serge E. Hallyn" <[email protected]> writes:
>> >
>> > > Hi,
>> > >
>> > > So we've been over this many times... but unfortunately there is more
>> > > breakage to report. Regular privileged and unprivileged containers
>> > > work all right for us. But running an unprivileged container inside a
>> > > privileged container is blocked.
>> > >
>> > > When creating privileged containers, lxc by default does a few things:
>> > > it mounts some fuse.lxcfs files over procfiles include /proc/meminfo and
>> > > /proc/uptime. It mounts proc rw but /proc/sysrq-trigger ro as well as
>> > > moves /proc/sys/net out of the way, bind-mounts /proc/sys readonly
>> > > (because this container is not in a user namespace) then moves
>> > > /proc/sys/net back. Finally it mounts sys ro but bind-mounts
>> > > /sys/devices/virtual/net as writeable.
>> > >
>> > > If any of these are left enabled, unprivileged containers can't be
>> > > started. If all are disabled, then they can be.
>> > >
>> > > Can we find a way to make these not block remounts in child user
>> > > namespaces? A boot flag, a procfs and sysfs mount option, a sysctl?
>> >
>> > Are any of these overmounts done for the purpose of security? It
>> > appears the /proc/sys and /sys mounts being made read-only is for that
>> > purpose.
>> >
>> > If none of the mounts are for secuirty the easy solution that works
>> > today is to also mount /proc and /sys somewhere else in your container
>> > so that the permission check for mounting a new copy passes.
>>
>> Can we use the big hammer approach on /proc/sys? Specifically, what
>> if we made it so that /proc mounts created in a non-root namespace
>> *only* see things that are scoped to the active namespaces, and only
>> those over which the mounter has capabilities? We could have mount
>> options for this.
>
> Of course the problem is precisely non-user-namespaced containers which
> do own and have capabilities over the /proc/sys/files. For user-namespaced
> containers /proc/sys/ isn't really an issue.

What I mean is:

mount -o nsonly=user,net -t proc none /proc

would show the list of processors and things scoped to the current
userns and netns, would *not* show global sysctls, and would fail
unless the caller has appropriate caps over the userns and netns.
This would work even if the old procfs is not fully visbile.

--Andy

2016-03-07 05:03:42

by Serge E. Hallyn

[permalink] [raw]

Subject: Re: user namespace and fully visible proc and sys mounts

On Sun, Mar 06, 2016 at 07:49:14PM -0800, Andy Lutomirski wrote:
> On Sun, Mar 6, 2016 at 7:45 PM, Serge E. Hallyn <[email protected]> wrote:
> > On Sun, Mar 06, 2016 at 06:24:23PM -0800, Andy Lutomirski wrote:
> >> On Mar 6, 2016 2:03 PM, "Eric W. Biederman" <[email protected]> wrote:
> >> >
> >> > "Serge E. Hallyn" <[email protected]> writes:
> >> >
> >> > > Hi,
> >> > >
> >> > > So we've been over this many times... but unfortunately there is more
> >> > > breakage to report. Regular privileged and unprivileged containers
> >> > > work all right for us. But running an unprivileged container inside a
> >> > > privileged container is blocked.
> >> > >
> >> > > When creating privileged containers, lxc by default does a few things:
> >> > > it mounts some fuse.lxcfs files over procfiles include /proc/meminfo and
> >> > > /proc/uptime. It mounts proc rw but /proc/sysrq-trigger ro as well as
> >> > > moves /proc/sys/net out of the way, bind-mounts /proc/sys readonly
> >> > > (because this container is not in a user namespace) then moves
> >> > > /proc/sys/net back. Finally it mounts sys ro but bind-mounts
> >> > > /sys/devices/virtual/net as writeable.
> >> > >
> >> > > If any of these are left enabled, unprivileged containers can't be
> >> > > started. If all are disabled, then they can be.
> >> > >
> >> > > Can we find a way to make these not block remounts in child user
> >> > > namespaces? A boot flag, a procfs and sysfs mount option, a sysctl?
> >> >
> >> > Are any of these overmounts done for the purpose of security? It
> >> > appears the /proc/sys and /sys mounts being made read-only is for that
> >> > purpose.
> >> >
> >> > If none of the mounts are for secuirty the easy solution that works
> >> > today is to also mount /proc and /sys somewhere else in your container
> >> > so that the permission check for mounting a new copy passes.
> >>
> >> Can we use the big hammer approach on /proc/sys? Specifically, what
> >> if we made it so that /proc mounts created in a non-root namespace
> >> *only* see things that are scoped to the active namespaces, and only
> >> those over which the mounter has capabilities? We could have mount
> >> options for this.
> >
> > Of course the problem is precisely non-user-namespaced containers which
> > do own and have capabilities over the /proc/sys/files. For user-namespaced
> > containers /proc/sys/ isn't really an issue.
>
> What I mean is:
>
> mount -o nsonly=user,net -t proc none /proc
>
> would show the list of processors and things scoped to the current
> userns and netns, would *not* show global sysctls, and would fail
> unless the caller has appropriate caps over the userns and netns.
> This would work even if the old procfs is not fully visbile.

Gah, so apparently I'd forgotten the workaround I'd implemented - I
thought things had regressed, but they haven't, I'd just missed a step.

Sorry for the noise. I don't want to make things more complicated or
more brittle when we can make it work as is - thanks.

-serge

2016-03-08 00:17:19

by Eric W. Biederman

[permalink] [raw]

Subject: Re: user namespace and fully visible proc and sys mounts

Andy Lutomirski <[email protected]> writes:

> On a related note, can we *please* find a way to constrain namespace
> creation in a way that might satisfy the RHEL crowd?

I am not certain to what you are referrring.

As long as folks are willing to work with me I am happy to help design
and design something that makes things better for everyone. If someone
pushes hard, suggestes crappy patches, and does not listen to
constructive feedback I will shoot their patches down (especially when I
am sick and tired as I have been more than I would like this development
cycle).

Eric

2016-03-08 00:25:01

by Andy Lutomirski

[permalink] [raw]

Subject: Re: user namespace and fully visible proc and sys mounts

On Mon, Mar 7, 2016 at 4:07 PM, Eric W. Biederman <[email protected]> wrote:
> Andy Lutomirski <[email protected]> writes:
>
>> On a related note, can we *please* find a way to constrain namespace
>> creation in a way that might satisfy the RHEL crowd?
>
> I am not certain to what you are referrring.
>
> As long as folks are willing to work with me I am happy to help design
> and design something that makes things better for everyone. If someone
> pushes hard, suggestes crappy patches, and does not listen to
> constructive feedback I will shoot their patches down (especially when I
> am sick and tired as I have been more than I would like this development
> cycle).

I think we should add some mechanism that will allow the right to
create various namespaces to be constrained in a useful and usable
manner. I'll start a new thread.

2016-03-08 04:15:43

by Eric W. Biederman

[permalink] [raw]

Subject: Re: user namespace and fully visible proc and sys mounts

Andy Lutomirski <[email protected]> writes:

> On Mon, Mar 7, 2016 at 4:07 PM, Eric W. Biederman <[email protected]> wrote:
>> Andy Lutomirski <[email protected]> writes:
>>
>>> On a related note, can we *please* find a way to constrain namespace
>>> creation in a way that might satisfy the RHEL crowd?
>>
>> I am not certain to what you are referrring.
>>
>> As long as folks are willing to work with me I am happy to help design
>> and design something that makes things better for everyone. If someone
>> pushes hard, suggestes crappy patches, and does not listen to
>> constructive feedback I will shoot their patches down (especially when I
>> am sick and tired as I have been more than I would like this development
>> cycle).
>
> I think we should add some mechanism that will allow the right to
> create various namespaces to be constrained in a useful and usable
> manner. I'll start a new thread.

On the general principle that there is more attack surface, and attack
surface reduction is generally good I agree. I will await your follow
on thread when you are ready.

Eric