2013-06-06 16:35:59

by Chris Webb

[permalink] [raw]
Subject: Building a BSD-jail clone out of namespaces

Prompted by the new userns support merged in the 3.8/3.9 kernels, I've been
playing with namespaces and trying to understand how I could use them to
build containers to replace some of my uses of qemu-kvm virtual machines.

I've successfully created a fakeroot-type container running as an
unprivileged user by unsharing everything including CLONE_NEWUSER, and can
map a block of host UIDs for that environment by writing to
/proc/PID/[ug]id_map from a helper process running as root.

However, what I'm hoping for in practice is to be able to create containers
whose access to its filesystem subtree is untranslated, i.e. uid/gid N in
the container maps to uid/gid N in a subdirectory of the filesystem, but
which is still isolated from the rest of the host filesystem and can't do
externally privileged things. This is pretty much what a BSD jail provides,
for example.

Is this possible to achieve securely using the mechanisms now available?
(I'm assuming that parent directory permissions prevent unprivileged host
users from getting at these container filesystems, exactly as is necessary
to make BSD jails safe.)


As a first step, I naively tried running as root and unsharing everything
with

unshare(CLONE_NEWIPC | CLONE_NEWNS | CLONE_NEWNET | CLONE_NEWPID
| CLONE_NEWUTS | CLONE_NEWUSER);

before execing a shell[1]. From another root process in the host namespace,
I then wrote a pass-through mapping 0 0 4294967295 to /proc/PID/[ug]id_map.

The result initially looks plausible, with the PID namespace preventing
signals being sent from one container to another, despite those processes
sharing the same user ID in the top-level user namespace.

However, unfortunately I still have too many privileges with respect to the
host. Whilst (for example) I can't mknod, I can mount a sysfs or procfs and
apparently write to them with host root privileges to reconfigure the host
kernel. I suspect there will be other things I haven't secured by this
recipe too.

I also tried tightening things up by dropping capabilities from my root user
and preventing capability grant on exec by setting and locking SECBIT_NOROOT
on before starting the container. However, I'm not sure this really makes
any difference---does CLONE_NEWUSER drop all capabilities with respect to
the parent namespace?

[1] In this description, I'm ignoring the part where I lock into a new root
filesystem, but presumably the way to do this is by pivot_root into a bind
mount?

Best wishes,

Chris.


2013-06-06 16:46:56

by Chris Webb

[permalink] [raw]
Subject: Re: Building a BSD-jail clone out of namespaces

"Eric W. Biederman" <[email protected]> writes:

> That will work, but you really don't want to run with uid == 0 mapped to
> uid == 0. There are too many things in /proc and /sys and similar that
> grant access to uid == 0.

Many thanks for the swift reply. If I map UID zero in the userns to a
non-zero UID outside (say -1), is there any way to use the userns UIDs
instead of host UIDs when accessing the container's root filesystem so I
don't end up with strange file ownerships on disk? This would prevent me
from using the same filesystem on physical hosts or in VMs.

I don't think there's any kernel mechanism that lets me apply a UID
translation layer as part of a bind mount is there?

Cheers,

Chris.

2013-06-06 16:57:12

by Eric W. Biederman

[permalink] [raw]
Subject: Re: Building a BSD-jail clone out of namespaces

Chris Webb <[email protected]> writes:

> "Eric W. Biederman" <[email protected]> writes:
>
>> That will work, but you really don't want to run with uid == 0 mapped to
>> uid == 0. There are too many things in /proc and /sys and similar that
>> grant access to uid == 0.
>
> Many thanks for the swift reply. If I map UID zero in the userns to a
> non-zero UID outside (say -1), is there any way to use the userns UIDs
> instead of host UIDs when accessing the container's root filesystem so I
> don't end up with strange file ownerships on disk? This would prevent me
> from using the same filesystem on physical hosts or in VMs.

Hmm. I guess it depends on how your VM is reading them. If it is
blocked based access to the filesystem you have a problem. If the VM
is effectively NFS mounting the filesystem you can do all kinds of
things.

It is possible to just change the user namespace and setup your mapping,
effectively running your VM in the user namespace, and that would allow
the VM to see your mapped uids.

> I don't think there's any kernel mechanism that lets me apply a UID
> translation layer as part of a bind mount is there?

No. In principle you could mount the filesystem inside of the user
namespace but in practice no filesystems support that yet.

Eric

2013-06-06 16:58:22

by Eric W. Biederman

[permalink] [raw]
Subject: Re: Building a BSD-jail clone out of namespaces

Chris Webb <[email protected]> writes:

> Prompted by the new userns support merged in the 3.8/3.9 kernels, I've been
> playing with namespaces and trying to understand how I could use them to
> build containers to replace some of my uses of qemu-kvm virtual machines.
>
> I've successfully created a fakeroot-type container running as an
> unprivileged user by unsharing everything including CLONE_NEWUSER, and can
> map a block of host UIDs for that environment by writing to
> /proc/PID/[ug]id_map from a helper process running as root.
>
> However, what I'm hoping for in practice is to be able to create containers
> whose access to its filesystem subtree is untranslated, i.e. uid/gid N in
> the container maps to uid/gid N in a subdirectory of the filesystem, but
> which is still isolated from the rest of the host filesystem and can't do
> externally privileged things. This is pretty much what a BSD jail provides,
> for example.
>
> Is this possible to achieve securely using the mechanisms now available?
> (I'm assuming that parent directory permissions prevent unprivileged host
> users from getting at these container filesystems, exactly as is necessary
> to make BSD jails safe.)
>
>
> As a first step, I naively tried running as root and unsharing everything
> with
>
> unshare(CLONE_NEWIPC | CLONE_NEWNS | CLONE_NEWNET | CLONE_NEWPID
> | CLONE_NEWUTS | CLONE_NEWUSER);
>
> before execing a shell[1]. From another root process in the host namespace,
> I then wrote a pass-through mapping 0 0 4294967295 to /proc/PID/[ug]id_map.

That will work, but you really don't want to run with uid == 0 mapped to
uid == 0. There are too many things in /proc and /sys and similar that
grant access to uid == 0.

> The result initially looks plausible, with the PID namespace preventing
> signals being sent from one container to another, despite those processes
> sharing the same user ID in the top-level user namespace.
>
> However, unfortunately I still have too many privileges with respect to the
> host. Whilst (for example) I can't mknod, I can mount a sysfs or procfs and
> apparently write to them with host root privileges to reconfigure the host
> kernel. I suspect there will be other things I haven't secured by this
> recipe too.

Yes. I recommend having a dedicated range of uids for your container to
prevent this kind of silliness. Or at the very least a separate mapping
of uid == 0.

> I also tried tightening things up by dropping capabilities from my root user
> and preventing capability grant on exec by setting and locking SECBIT_NOROOT
> on before starting the container. However, I'm not sure this really makes
> any difference---does CLONE_NEWUSER drop all capabilities with respect to
> the parent namespace?

Yes. CLONE_NEWUSER drops all capabilities with respect to the parent
namespace.

> [1] In this description, I'm ignoring the part where I lock into a new root
> filesystem, but presumably the way to do this is by pivot_root into a bind
> mount?

Yes pivot_root and bind mount work.

ERic

2013-06-06 21:51:55

by Chris Webb

[permalink] [raw]
Subject: Re: Building a BSD-jail clone out of namespaces

"Eric W. Biederman" <[email protected]> writes:

> Hmm. I guess it depends on how your VM is reading them. If it is
> blocked based access to the filesystem you have a problem. If the VM
> is effectively NFS mounting the filesystem you can do all kinds of
> things.
>
> It is possible to just change the user namespace and setup your mapping,
> effectively running your VM in the user namespace, and that would allow
> the VM to see your mapped uids.

In some cases I was thinking of mounting a filesystem directly from a block
device, but more often it would be directories in a local host filesystem.
I use qemu's built in virtio 9p-over-pci to pass these in at present.

So in principle, that does mean I could store UIDs translated and wrap
everything else I do at host level in a userns translation layer as well,
but it's quite an intrusive thing to do and I imagine it would preclude
lightweight throwaway containers where I share the host filesystem read-only
into a container.

This is why I was quite keen to avoid mangled ownerships in the host
filesystems at all, but from what you say, that goal sounds like this might
be rather tricky to achieve.

> There are too many things in /proc and /sys and similar that
> grant access to uid == 0.

Ah yes, I can see why this is a thorny one. Is it just the synthetic
filesystems like /proc and /sys that are the problem, or are there loads of
other places in the kernel that assume uid == 0 implies privilege? I.e. is
it 'just' a matter of somehow securing access to procfs and sysfs, or a much
wider issue?

Best wishes,

Chris.

2013-06-07 04:07:22

by Eric W. Biederman

[permalink] [raw]
Subject: Re: Building a BSD-jail clone out of namespaces

Chris Webb <[email protected]> writes:

> "Eric W. Biederman" <[email protected]> writes:
>
>> Hmm. I guess it depends on how your VM is reading them. If it is
>> blocked based access to the filesystem you have a problem. If the VM
>> is effectively NFS mounting the filesystem you can do all kinds of
>> things.
>>
>> It is possible to just change the user namespace and setup your mapping,
>> effectively running your VM in the user namespace, and that would allow
>> the VM to see your mapped uids.
>
> In some cases I was thinking of mounting a filesystem directly from a block
> device, but more often it would be directories in a local host filesystem.
> I use qemu's built in virtio 9p-over-pci to pass these in at present.

Interesting. I hadn't seen that feature. That makes 9p much more
interesting that I thought it was.

> So in principle, that does mean I could store UIDs translated and wrap
> everything else I do at host level in a userns translation layer as well,
> but it's quite an intrusive thing to do and I imagine it would preclude
> lightweight throwaway containers where I share the host filesystem read-only
> into a container.

Not being able to share the host filesystem into a container is a
downside of the current implementation. In principle you can have an
overlay style filesystem that munges the uids and removes this
limitation, but that doesn't currently exist.

> This is why I was quite keen to avoid mangled ownerships in the host
> filesystems at all, but from what you say, that goal sounds like this might
> be rather tricky to achieve.

If you don't try to share the host root filesystem you can achieve the
sharing pretty easily by just running qemu in a user namespace. So that
qemu or whatever else serves the 9p protocol sees the filesystem with all
of the uids and gids translated.

>> There are too many things in /proc and /sys and similar that
>> grant access to uid == 0.
>
> Ah yes, I can see why this is a thorny one. Is it just the synthetic
> filesystems like /proc and /sys that are the problem, or are there loads of
> other places in the kernel that assume uid == 0 implies privilege? I.e. is
> it 'just' a matter of somehow securing access to procfs and sysfs, or a much
> wider issue?

It is a wider issue. Capabilities cover most of places in the kernel
where the kernel tests if you have privilege but there are other
filesystems like devtmpsfs, and the occasional silly piece of kernel
code that should be using capabilities but is not. Beyond the kernel
there are files like /etc/shadow that only root is allowed to read.

Which all boils down to the fact that for the inconvience of using a
separate range of uids a lot of other problems just go away.

Eric

2013-06-07 12:58:55

by Chris Webb

[permalink] [raw]
Subject: Re: Building a BSD-jail clone out of namespaces

"Eric W. Biederman" <[email protected]> writes:

> It is a wider issue. Capabilities cover most of places in the kernel
> where the kernel tests if you have privilege but there are other
> filesystems like devtmpsfs, and the occasional silly piece of kernel
> code that should be using capabilities but is not. Beyond the kernel
> there are files like /etc/shadow that only root is allowed to read.
>
> Which all boils down to the fact that for the inconvience of using a
> separate range of uids a lot of other problems just go away.

Hi. Thanks for the clarifications here, which make a lot of sense.

> Not being able to share the host filesystem into a container is a
> downside of the current implementation. In principle you can have an
> overlay style filesystem that munges the uids and removes this
> limitation, but that doesn't currently exist.

Yes, given the design means I can't just have an identity UID/GID mapping,
this seems like the building block I'm missing to get namespace IDs instead
of host IDs stored on disk. I imagine it might be fairly straightforward for
me to take a simple 'example' stacked filesystem like wrapfs and teach it to
map UIDs and GIDs. I'll have to take a look.

Cheers,

Chris.

2013-06-27 13:43:13

by Chris Webb

[permalink] [raw]
Subject: Re: Building a BSD-jail clone out of namespaces

Chris Webb <[email protected]> writes:

> Prompted by the new userns support merged in the 3.8/3.9 kernels, I've been
> playing with namespaces and trying to understand how I could use them to
> build containers to replace some of my uses of qemu-kvm virtual machines.

I now have most things working as I'd want and am just polishing my
userspace container tool before release to make sure it fits well with
common conventions such as those mentioned at

http://www.freedesktop.org/wiki/Software/systemd/ContainerInterface/

and parses /etc/subuid and /etc/subgid files in the format you've defined
them in your shadow patches. I was delighted by how it all nests nicely,
provided I bind mount my /dev nodes from the level above rather than try to
mknod them in the outer container.

I'd like to arrange for slightly different behaviour when the tool is run at
the top-level 'host' user namespace, for example warning about attempts to
map the dangerous UID 0.

Is there a canonical way to detect when I'm in the top-level user namespace?
I can clearly try doing something which should be impossible for a
non-top-level root user like opening /proc/kpageflags for reading or
/proc/sys/ctrl-alt-del for writing, but I wondered if there was something
more idiomatic as a test? (Some sort of 'get parent namespace' that might
return null at top-level maybe?)

Cheers,

Chris.