LinuxLists.cc - [LSF/MM TOPIC] Containers and distributed filesystems

2019-01-23 18:10:12

Subject: [LSF/MM TOPIC] Containers and distributed filesystems

Hi,

I'd like to propose an LSF/MM discussion around the topic of containers
and distributed filesystems.

The background is that we have a number of decisions to make around
dealing with namespaces when the filesystem is distributed.

On the one hand, there is the issue of which user namespace we should
be using when putting uids/gids on the wire, or when translating into
alternative identities (user/group name, cifs SIDs,...). There are two
main competing proposals: the first proposal is to select the user
namespace of the process that mounted the distributed filesystem. The
second proposal is to (continue to) use the user namespace pointed to
by init_nsproxy. It seems that whichever choice we make, we probably
want to ensure that all the major distributed filesystems (AFS, CIFS,
NFS) have consistent handling of these situations.

Another issue arises around the question of identifying containers when
they are migrated. At least the NFSv4 client needs to be able to send a
unique identifier that is preserved across container migration. The
uts_namespace is typically insufficient for this purpose, since most
containers don't bother to set a unique hostname.

Finally, there is an issue that may be unique to NFS (in which case I'd
be happy to see it as a hallway discussion or a BoF session) around
preserving file state across container migrations.

Cheers
Trond

--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]

2019-01-23 19:21:42

by James Bottomley

[permalink] [raw]

Subject: Re: [LSF/MM TOPIC] Containers and distributed filesystems

On Wed, 2019-01-23 at 18:10 +0000, Trond Myklebust wrote:
> Hi,
>
> I'd like to propose an LSF/MM discussion around the topic of
> containers and distributed filesystems.
>
> The background is that we have a number of decisions to make around
> dealing with namespaces when the filesystem is distributed.
>
> On the one hand, there is the issue of which user namespace we should
> be using when putting uids/gids on the wire, or when translating into
> alternative identities (user/group name, cifs SIDs,...). There are
> two main competing proposals: the first proposal is to select the
> user namespace of the process that mounted the distributed
> filesystem. The second proposal is to (continue to) use the user
> namespace pointed to by init_nsproxy. It seems that whichever choice
> we make, we probably want to ensure that all the major distributed
> filesystems (AFS, CIFS, NFS) have consistent handling of these
> situations.

I don't think there's much disagreement among container people: most
would agree the uids on the wire should match the uids in the
container. If you're running your remote fs via fuse in an
unprivileged container, you have no access to the kuid/kgid anyway, so
it's the way you have to run.

I think the latter comes about because most of the container
implementations still have difficulty consuming the user namespace, so
most run without it (where kuid = uid) or mis-implement it, which is
where you might get the mismatch. Is there an actual use case where
you'd want to see the kuid at the remote end, bearing in mind that when
user namespaces are properly set up kuid is often the product of
internal subuid mapping.

> Another issue arises around the question of identifying containers
> when they are migrated. At least the NFSv4 client needs to be able to
> send a unique identifier that is preserved across container
> migration. The uts_namespace is typically insufficient for this
> purpose, since most containers don't bother to set a unique hostname.

We did have a discussion in plumbers about the container ID, but I'm
not sure it reached a useful conclusion for you (video, I'm afraid):

https://linuxplumbersconf.org/event/2/contributions/215/

> Finally, there is an issue that may be unique to NFS (in which case
> I'd be happy to see it as a hallway discussion or a BoF session)
> around preserving file state across container migrations.

If by file state, you mean the internal kernel struct file state,
doesn't CRIU already do that? or do you mean some other state?

James

2019-01-23 20:50:47

by Trond Myklebust

[permalink] [raw]

Subject: Re: [LSF/MM TOPIC] Containers and distributed filesystems

On Wed, 2019-01-23 at 11:21 -0800, James Bottomley wrote:
> On Wed, 2019-01-23 at 18:10 +0000, Trond Myklebust wrote:
> > Hi,
> >
> > I'd like to propose an LSF/MM discussion around the topic of
> > containers and distributed filesystems.
> >
> > The background is that we have a number of decisions to make around
> > dealing with namespaces when the filesystem is distributed.
> >
> > On the one hand, there is the issue of which user namespace we
> > should
> > be using when putting uids/gids on the wire, or when translating
> > into
> > alternative identities (user/group name, cifs SIDs,...). There are
> > two main competing proposals: the first proposal is to select the
> > user namespace of the process that mounted the distributed
> > filesystem. The second proposal is to (continue to) use the user
> > namespace pointed to by init_nsproxy. It seems that whichever
> > choice
> > we make, we probably want to ensure that all the major distributed
> > filesystems (AFS, CIFS, NFS) have consistent handling of these
> > situations.
>
> I don't think there's much disagreement among container people: most
> would agree the uids on the wire should match the uids in the
> container. If you're running your remote fs via fuse in an
> unprivileged container, you have no access to the kuid/kgid anyway,
> so
> it's the way you have to run.
>
> I think the latter comes about because most of the container
> implementations still have difficulty consuming the user namespace,
> so
> most run without it (where kuid = uid) or mis-implement it, which is
> where you might get the mismatch. Is there an actual use case where
> you'd want to see the kuid at the remote end, bearing in mind that
> when
> user namespaces are properly set up kuid is often the product of
> internal subuid mapping.

Wouldn't the above basically allow you to spoof root on any existing
mounted NFS client using the unprivileged command 'unshare -U -r'?

Eric Biederman was the one proposing the 'match the namespace of the
process that mounted the filesystem' approach. My main questions about
that approach would be:
1) Are we guaranteed to always have a mapping between an arbitrary
uid/gid from the user namespace in the container, to the user namespace
of the parent orchestrator process that set up the mount?
2) How do we reconcile that approach with the requirement that NFSv4 be
able to convert uids/gids into stringified user/group names (which is
usually solved using an upcall mechanism)?

> > Another issue arises around the question of identifying containers
> > when they are migrated. At least the NFSv4 client needs to be able
> > to
> > send a unique identifier that is preserved across container
> > migration. The uts_namespace is typically insufficient for this
> > purpose, since most containers don't bother to set a unique
> > hostname.
>
> We did have a discussion in plumbers about the container ID, but I'm
> not sure it reached a useful conclusion for you (video, I'm afraid):
>
> https://linuxplumbersconf.org/event/2/contributions/215/

I have a concrete proposal for how we can do this using 'udev', and I'm
looking for a forum in which to discuss it.

> > Finally, there is an issue that may be unique to NFS (in which case
> > I'd be happy to see it as a hallway discussion or a BoF session)
> > around preserving file state across container migrations.
>
> If by file state, you mean the internal kernel struct file state,
> doesn't CRIU already do that? or do you mean some other state?

I thought CRIU was unable to deal with file locking state?

--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]

2019-01-23 22:32:31

by James Bottomley

[permalink] [raw]

Subject: Re: [LSF/MM TOPIC] Containers and distributed filesystems

On Wed, 2019-01-23 at 20:50 +0000, Trond Myklebust wrote:
> On Wed, 2019-01-23 at 11:21 -0800, James Bottomley wrote:
> > On Wed, 2019-01-23 at 18:10 +0000, Trond Myklebust wrote:
> > > Hi,
> > >
> > > I'd like to propose an LSF/MM discussion around the topic of
> > > containers and distributed filesystems.
> > >
> > > The background is that we have a number of decisions to make
> > > around dealing with namespaces when the filesystem is
> > > distributed.
> > >
> > > On the one hand, there is the issue of which user namespace we
> > > should be using when putting uids/gids on the wire, or when
> > > translating into alternative identities (user/group name, cifs
> > > SIDs,...). There are two main competing proposals: the first
> > > proposal is to select the user namespace of the process that
> > > mounted the distributed filesystem. The second proposal is to
> > > (continue to) use the user namespace pointed to by init_nsproxy.
> > > It seems that whichever choice we make, we probably want to
> > > ensure that all the major distributed filesystems (AFS, CIFS,
> > > NFS) have consistent handling of these situations.
> >
> > I don't think there's much disagreement among container people:
> > most would agree the uids on the wire should match the uids in the
> > container. If you're running your remote fs via fuse in an
> > unprivileged container, you have no access to the kuid/kgid anyway,
> > so it's the way you have to run.
> >
> > I think the latter comes about because most of the container
> > implementations still have difficulty consuming the user namespace,
> > so most run without it (where kuid = uid) or mis-implement it,
> > which is where you might get the mismatch. Is there an actual use
> > case where you'd want to see the kuid at the remote end, bearing in
> > mind that when user namespaces are properly set up kuid is often
> > the product of internal subuid mapping.
>
> Wouldn't the above basically allow you to spoof root on any existing
> mounted NFS client using the unprivileged command 'unshare -U -r'?

Yes, but what are you using as security on the remote? If it's an
assumption of coming from a privileged port, say, then that's not going
to work unprivileged anyway (and is a very 90s way of doing
security). If it's role based credential based security then, surely,
how the client manages ids shouldn't be visible to the server, because
the server has granular credentials for each of its roles.

> Eric Biederman was the one proposing the 'match the namespace of the
> process that mounted the filesystem' approach. My main questions
> about that approach would be:
> 1) Are we guaranteed to always have a mapping between an arbitrary
> uid/gid from the user namespace in the container, to the user
> namespace of the parent orchestrator process that set up the mount?

Yes, user namespace mappings are injective, so a uid inside always maps
to one outside but not necessarily vice versa. Each user namespace you
go through can shrink the pool of external ids it maps to.

> 2) How do we reconcile that approach with the requirement that NFSv4
> be able to convert uids/gids into stringified user/group names (which
> is usually solved using an upcall mechanism)?

How do you authenticate the stringified ids? If you're relying on
authentication at mount time only and trusting the client to tell you
the users with no further granular authentication by id then yes, it's
always going to be a bit unsafe because anyone possessing the mount
credentials can be any id on the server. So if you want the client to
supervise what id goes to the server then the client has to run the
mount securely and make sure handoff to the user namespace of the
container is correct and, obviously, you can't allow an unprivileged
container to manage the actual client itself.

So, I think, to give a concrete example, the container has what it
thinks of as root and bin (uid 0 and 1) at exterior uid 1000 and 1001.
You want the handed off mount to accept a write by container bin at
exterior uid 1001 as uid 1 to the server (real bin) but deny a write by
container root (exterior uid 1000)?

> > > Another issue arises around the question of identifying
> > > containers when they are migrated. At least the NFSv4 client
> > > needs to be able to send a unique identifier that is preserved
> > > across container migration. The uts_namespace is typically
> > > insufficient for this purpose, since most containers don't bother
> > > to set a unique hostname.
> >
> > We did have a discussion in plumbers about the container ID, but
> > I'm not sure it reached a useful conclusion for you (video, I'm
> > afraid):
> >
> > https://linuxplumbersconf.org/event/2/contributions/215/
>
> I have a concrete proposal for how we can do this using 'udev', and
> I'm looking for a forum in which to discuss it.

Cc'ing the container list: [email protected] might
be a good start.

> > > Finally, there is an issue that may be unique to NFS (in which
> > > case I'd be happy to see it as a hallway discussion or a BoF
> > > session) around preserving file state across container
> > > migrations.
> >
> > If by file state, you mean the internal kernel struct file state,
> > doesn't CRIU already do that? or do you mean some other state?
>
> I thought CRIU was unable to deal with file locking state?

Depends what you mean by "deal with". The lock state can be extracted
from the source and transferred to the target so it works locally
(every transferred process sees the same locking state before and
after). However, I think on the server locks get dropped on the
transfer and reacquired so a third party can get in to acquire the lock
if that's the worry? We probably need a CRIU person to explain this
better and what the current state of play is since my knowledge is some
years old.

James

2019-02-09 21:49:30

by Steve French

[permalink] [raw]

Subject: Re: [LSF/MM TOPIC] Containers and distributed filesystems

Trond's proposal for discussion (his proposal below) at LSF/MM makes
sense and could be useful, and similar questions come up often with
CIFS/SMB3 (and probably other distributed file systems).

On Wed, Jan 23, 2019 at 12:11 PM Trond Myklebust
<[email protected]> wrote:
> I'd like to propose an LSF/MM discussion around the topic of containers
> and distributed filesystems.
>
> The background is that we have a number of decisions to make around
> dealing with namespaces when the filesystem is distributed.
>
> On the one hand, there is the issue of which user namespace we should
> be using when putting uids/gids on the wire, or when translating into
> alternative identities (user/group name, cifs SIDs,...). There are two
> main competing proposals: the first proposal is to select the user
> namespace of the process that mounted the distributed filesystem. The
> second proposal is to (continue to) use the user namespace pointed to
> by init_nsproxy. It seems that whichever choice we make, we probably
> want to ensure that all the major distributed filesystems (AFS, CIFS,
> NFS) have consistent handling of these situations.
> Another issue arises around the question of identifying containers when
> they are migrated. At least the NFSv4 client needs to be able to send a
> unique identifier that is preserved across container migration. The
> uts_namespace is typically insufficient for this purpose, since most
> containers don't bother to set a unique hostname.

Makes sense

> Finally, there is an issue that may be unique to NFS (in which case I'd
> be happy to see it as a hallway discussion or a BoF session) around
> preserving file state across container migrations.

Not unique to NFS

--
Thanks,

Steve