LinuxLists.cc - Should we split the network filesystem setup into two phases?

2018-08-15 19:24:22

Subject: Should we split the network filesystem setup into two phases?

Having just re-ported NFS on top of the new mount API stuff, I find that I
don't really like the idea of superblocks being separated by communication
parameters - especially when it might seem reasonable to be able to adjust
those parameters.

Does it make sense to abstract out the remote peer and allow (a) that to be
configured separately from any superblocks using it and (b) that to be used to
create superblocks?

Note that what a 'remote peer' is would be different for different
filesystems:

(*) For NFS, it would probably be a named server, with address(es) attached
to the name. In lieu of actually having a name, the initial IP address
could be used.

(*) For CIFS, it would probably be a named server. I'm not sure if CIFS
allows an abstraction for a share that can move about inside a domain.

(*) For AFS, it would be a cell, I think, where the actual fileserver(s) used
are a matter of direction from the Volume Location server.

(*) For 9P and Ceph, I don't really know.

What could be configured? Well, addresses, ports, timeouts. Maybe protocol
level negotiation - though not being able to explicitly specify, say, the
particular version and minorversion on an NFS share would be problematic for
backward compatibility.

One advantage it could give us is that it might make it easier if someone asks
for server X to query userspace in some way for the default parameters for X
are.

What might this look like in terms of userspace? Well, we could overload the
new mount API:

peer1 = fsopen("nfs", FSOPEN_CREATE_PEER);
fsconfig(peer1, FSCONFIG_SET_NS, "net", NULL, netns_fd);
fsconfig(peer1, FSCONFIG_SET_STRING, "peer_name", "server.home");
fsconfig(peer1, FSCONFIG_SET_STRING, "vers", "4.2");
fsconfig(peer1, FSCONFIG_SET_STRING, "address", "tcp:192.168.1.1");
fsconfig(peer1, FSCONFIG_SET_STRING, "address", "tcp:192.168.1.2");
fsconfig(peer1, FSCONFIG_SET_STRING, "timeo", "122");
fsconfig(peer1, FSCONFIG_CMD_SET_UP_PEER, NULL, NULL, 0);

peer2 = fsopen("nfs", FSOPEN_CREATE_PEER);
fsconfig(peer2, FSCONFIG_SET_NS, "net", NULL, netns_fd);
fsconfig(peer2, FSCONFIG_SET_STRING, "peer_name", "server2.home");
fsconfig(peer2, FSCONFIG_SET_STRING, "vers", "3");
fsconfig(peer2, FSCONFIG_SET_STRING, "address", "tcp:192.168.1.3");
fsconfig(peer2, FSCONFIG_SET_STRING, "address", "udp:192.168.1.4+6001");
fsconfig(peer2, FSCONFIG_CMD_SET_UP_PEER, NULL, NULL, 0);

fs = fsopen("nfs", 0);
fsconfig(fs, FSCONFIG_SET_PEER, "peer.1", NULL, peer1);
fsconfig(fs, FSCONFIG_SET_PEER, "peer.2", NULL, peer2);
fsconfig(fs, FSCONFIG_SET_STRING, "source", "/home/dhowells", 0);
m = fsmount(fs, 0, 0);

[Note that Eric's oft-repeated point about the 'creation' operation altering
established parameters still stands here.]

You could also then reopen it for configuration, maybe by:

peer = fspick(AT_FDCWD, "/mnt", FSPICK_PEER);

or:

peer = fspick(AT_FDCWD, "nfs:server.home", FSPICK_PEER_BY_NAME);

though it might be better to give it its own syscall:

peer = fspeer("nfs", "server.home", O_CLOEXEC);
fsconfig(peer, FSCONFIG_SET_NS, "net", NULL, netns_fd);
...
fsconfig(peer, FSCONFIG_CMD_SET_UP_PEER, NULL, NULL, 0);

In terms of alternative interfaces, I'm not sure how easy it would be to make
it like cgroups where you go and create a dir in a special filesystem, say,
"/sys/peers/nfs", because the peers records and names would have to be network
namespaced. Also, it might make it more difficult to use to create a root fs.

On the other hand, being able to adjust the peer configuration by:

echo 71 >/sys/peers/nfs/server.home/timeo

does have a certain appeal.

Also, netlink might be the right option, but I'm not sure how you'd pin the
resultant object whilst you make use of it.

A further thought is that is it worth making this idea more general and
encompassing non-network devices also? This would run into issues of some
logical sources being visible across namespaces and but not others.

David

2018-08-15 19:44:34

by Andy Lutomirski

[permalink] [raw]

Subject: Re: Should we split the network filesystem setup into two phases?

> On Aug 15, 2018, at 9:31 AM, David Howells <[email protected]> wrote:
>=20
> Having just re-ported NFS on top of the new mount API stuff, I find that I=

> don't really like the idea of superblocks being separated by communication=

> parameters - especially when it might seem reasonable to be able to adjust=

> those parameters.
>=20
> Does it make sense to abstract out the remote peer and allow (a) that to b=
e
> configured separately from any superblocks using it and (b) that to be use=
d to
> create superblocks?
>=20
> Note that what a 'remote peer' is would be different for different
> filesystems:

...

I think this looks rather nice. But maybe you should generalize the concept=
of =E2=80=9Cpeer=E2=80=9D so that it works for btrfs too. In the case where=
you mount two different subvolumes, you=E2=80=99re creating a *something*, a=
nd you=E2=80=99re then creating a filesystem that references it. It=E2=80=99=
s almost the same thing.

>=20
>=20

>=20
> fs =3D fsopen("nfs", 0);
> fsconfig(fs, FSCONFIG_SET_PEER, "peer.1", NULL, peer1);

As you mention below, this seems like it might have namespacing issues.

> =20

> In terms of alternative interfaces, I'm not sure how easy it would be to m=
ake
> it like cgroups where you go and create a dir in a special filesystem, say=
,
> "/sys/peers/nfs", because the peers records and names would have to be net=
work
> namespaced. Also, it might make it more difficult to use to create a root=
fs.
>=20
> On the other hand, being able to adjust the peer configuration by:
>=20
> echo 71 >/sys/peers/nfs/server.home/timeo
>=20
> does have a certain appeal.
>=20
> Also, netlink might be the right option, but I'm not sure how you'd pin th=
e
> resultant object whilst you make use of it.
>=20

My suggestion would be to avoid giving these things names at all. I think th=
at referring to them by fd should be sufficient, especially if you allow the=
m to be reopened based on a mount that uses them and allow them to get bind-=
mounted somewhere a la namespaces to make them permanent if needed.

> A further thought is that is it worth making this idea more general and
> encompassing non-network devices also? This would run into issues of some=

> logical sources being visible across namespaces and but not others.

Indeed :)

It probably pays to rope a btrfs person into this discussion.

2018-08-16 08:02:07

by Eric W. Biederman

[permalink] [raw]

Subject: Re: Should we split the network filesystem setup into two phases?

David Howells <[email protected]> writes:

> Having just re-ported NFS on top of the new mount API stuff, I find that I
> don't really like the idea of superblocks being separated by communication
> parameters - especially when it might seem reasonable to be able to adjust
> those parameters.
>
> Does it make sense to abstract out the remote peer and allow (a) that to be
> configured separately from any superblocks using it and (b) that to be used to
> create superblocks?
>
> Note that what a 'remote peer' is would be different for different
> filesystems:
>
> (*) For NFS, it would probably be a named server, with address(es) attached
> to the name. In lieu of actually having a name, the initial IP address
> could be used.
>
> (*) For CIFS, it would probably be a named server. I'm not sure if CIFS
> allows an abstraction for a share that can move about inside a domain.
>
> (*) For AFS, it would be a cell, I think, where the actual fileserver(s) used
> are a matter of direction from the Volume Location server.
>
> (*) For 9P and Ceph, I don't really know.
>
> What could be configured? Well, addresses, ports, timeouts. Maybe protocol
> level negotiation - though not being able to explicitly specify, say, the
> particular version and minorversion on an NFS share would be problematic for
> backward compatibility.
>
> One advantage it could give us is that it might make it easier if someone asks
> for server X to query userspace in some way for the default parameters for X
> are.
>
> What might this look like in terms of userspace? Well, we could overload the
> new mount API:
>
> peer1 = fsopen("nfs", FSOPEN_CREATE_PEER);
> fsconfig(peer1, FSCONFIG_SET_NS, "net", NULL, netns_fd);
> fsconfig(peer1, FSCONFIG_SET_STRING, "peer_name", "server.home");
> fsconfig(peer1, FSCONFIG_SET_STRING, "vers", "4.2");
> fsconfig(peer1, FSCONFIG_SET_STRING, "address", "tcp:192.168.1.1");
> fsconfig(peer1, FSCONFIG_SET_STRING, "address", "tcp:192.168.1.2");
> fsconfig(peer1, FSCONFIG_SET_STRING, "timeo", "122");
> fsconfig(peer1, FSCONFIG_CMD_SET_UP_PEER, NULL, NULL, 0);
>
> peer2 = fsopen("nfs", FSOPEN_CREATE_PEER);
> fsconfig(peer2, FSCONFIG_SET_NS, "net", NULL, netns_fd);
> fsconfig(peer2, FSCONFIG_SET_STRING, "peer_name", "server2.home");
> fsconfig(peer2, FSCONFIG_SET_STRING, "vers", "3");
> fsconfig(peer2, FSCONFIG_SET_STRING, "address", "tcp:192.168.1.3");
> fsconfig(peer2, FSCONFIG_SET_STRING, "address", "udp:192.168.1.4+6001");
> fsconfig(peer2, FSCONFIG_CMD_SET_UP_PEER, NULL, NULL, 0);
>
> fs = fsopen("nfs", 0);
> fsconfig(fs, FSCONFIG_SET_PEER, "peer.1", NULL, peer1);
> fsconfig(fs, FSCONFIG_SET_PEER, "peer.2", NULL, peer2);
> fsconfig(fs, FSCONFIG_SET_STRING, "source", "/home/dhowells", 0);
> m = fsmount(fs, 0, 0);
>
> [Note that Eric's oft-repeated point about the 'creation' operation altering
> established parameters still stands here.]
>
> You could also then reopen it for configuration, maybe by:
>
> peer = fspick(AT_FDCWD, "/mnt", FSPICK_PEER);
>
> or:
>
> peer = fspick(AT_FDCWD, "nfs:server.home", FSPICK_PEER_BY_NAME);
>
> though it might be better to give it its own syscall:
>
> peer = fspeer("nfs", "server.home", O_CLOEXEC);
> fsconfig(peer, FSCONFIG_SET_NS, "net", NULL, netns_fd);
> ...
> fsconfig(peer, FSCONFIG_CMD_SET_UP_PEER, NULL, NULL, 0);
>
> In terms of alternative interfaces, I'm not sure how easy it would be to make
> it like cgroups where you go and create a dir in a special filesystem, say,
> "/sys/peers/nfs", because the peers records and names would have to be network
> namespaced. Also, it might make it more difficult to use to create a root fs.
>
> On the other hand, being able to adjust the peer configuration by:
>
> echo 71 >/sys/peers/nfs/server.home/timeo
>
> does have a certain appeal.
>
> Also, netlink might be the right option, but I'm not sure how you'd pin the
> resultant object whilst you make use of it.
>
> A further thought is that is it worth making this idea more general and
> encompassing non-network devices also? This would run into issues of some
> logical sources being visible across namespaces and but not others.

Even network filesystems are going to have challenges of filesystems
being visible in some network namespaces and not others. As some
filesystems will be visible on the internet and some filesystems will
only be visible on the appropriate local network. Network namespaces
are sometimes used to deal with the case of local networks with
overlapping ip addresses.

I think you are proposing a model for network filesystems that is
essentially the same situation where we are with most block devices
filesystems today. Where some parameters identitify the local
filesystem instance and some parameters identify how the kernel
interacts with that filesystem instance.

For system efficiency there is a strong argument for having the fewest
number of filesystem instances we can. Otherwise we will be caching the
same data twice and wasting space in RAM etc.

So I like the idea.

At least for devpts we always create a new filesystem instance every
time mount(2) is called. NFS seems to have the option to create a new
filesystem instance every time mount(2) is called as well, (even if the
filesystem parameters are the same). And depending on the case I can
see the attraction for other filesystems as well.

So I don't think we can completely abandon the option for filesystems
to always create a new filesystem instance when mount(8) is called.

I most definitely support thinking this through and figuring out how it
best make sense for the new filesystem API to create new filesystem
instances or fail to create new filesystems instances.

Eric

2018-08-16 19:24:37

by Steve French

[permalink] [raw]

Subject: Re: Should we split the network filesystem setup into two phases?

On Thu, Aug 16, 2018 at 2:56 AM Eric W. Biederman <[email protected]> wrote:
>
> David Howells <[email protected]> writes:
>
> > Having just re-ported NFS on top of the new mount API stuff, I find that I
> > don't really like the idea of superblocks being separated by communication
> > parameters - especially when it might seem reasonable to be able to adjust
> > those parameters.
> >
> > Does it make sense to abstract out the remote peer and allow (a) that to be
> > configured separately from any superblocks using it and (b) that to be used to
> > create superblocks?
<snip>
> At least for devpts we always create a new filesystem instance every
> time mount(2) is called. NFS seems to have the option to create a new
> filesystem instance every time mount(2) is called as well, (even if the
> filesystem parameters are the same). And depending on the case I can
> see the attraction for other filesystems as well.
>
> So I don't think we can completely abandon the option for filesystems
> to always create a new filesystem instance when mount(8) is called.

In cifs we attempt to match new mounts to existing tree connections
(instances of connections to a \\server\share) from other mount(s)
based first on whether security settings match (e.g. are both
Kerberos) and then on whether encryption is on/off and whether this is
a snapshot mount (smb3 previous versions feature). If neither is
mounted with a snaphsot and the encryption settings match then
we will use the same tree id to talk with the server as the other
mounts use. Interesting idea to allow mount to force a new
tree id.

What was the NFS mount option you were talking about?
Looking at the nfs man page the only one that looked similar
was "nosharecache"

> I most definitely support thinking this through and figuring out how it
> best make sense for the new filesystem API to create new filesystem
> instances or fail to create new filesystems instances.

Yes - it is an interesting question.

--
Thanks,

Steve

2018-08-16 20:21:22

by Eric W. Biederman

[permalink] [raw]

Subject: Re: Should we split the network filesystem setup into two phases?

Steve French <[email protected]> writes:

> On Thu, Aug 16, 2018 at 2:56 AM Eric W. Biederman <[email protected]> wrote:
>>
>> David Howells <[email protected]> writes:
>>
>> > Having just re-ported NFS on top of the new mount API stuff, I find that I
>> > don't really like the idea of superblocks being separated by communication
>> > parameters - especially when it might seem reasonable to be able to adjust
>> > those parameters.
>> >
>> > Does it make sense to abstract out the remote peer and allow (a) that to be
>> > configured separately from any superblocks using it and (b) that to be used to
>> > create superblocks?
> <snip>
>> At least for devpts we always create a new filesystem instance every
>> time mount(2) is called. NFS seems to have the option to create a new
>> filesystem instance every time mount(2) is called as well, (even if the
>> filesystem parameters are the same). And depending on the case I can
>> see the attraction for other filesystems as well.
>>
>> So I don't think we can completely abandon the option for filesystems
>> to always create a new filesystem instance when mount(8) is called.
>
> In cifs we attempt to match new mounts to existing tree connections
> (instances of connections to a \\server\share) from other mount(s)
> based first on whether security settings match (e.g. are both
> Kerberos) and then on whether encryption is on/off and whether this is
> a snapshot mount (smb3 previous versions feature). If neither is
> mounted with a snaphsot and the encryption settings match then
> we will use the same tree id to talk with the server as the other
> mounts use. Interesting idea to allow mount to force a new
> tree id.
>
> What was the NFS mount option you were talking about?
> Looking at the nfs man page the only one that looked similar
> was "nosharecache"

I was remembering this from reading the nfs mount code:

static int nfs_compare_super(struct super_block *sb, void *data)
{
...
if (!nfs_compare_super_address(old, server))
return 0;
/* Note: NFS_MOUNT_UNSHARED == NFS4_MOUNT_UNSHARED */
if (old->flags & NFS_MOUNT_UNSHARED)
return 0;
...
}

If a filesystem has NFS_MOUNT_UNSHARED set it does not serve as a
candidate for new mount requests. Skimming the code it looks like
nosharecache is what sets NFS_MOUNT_UNSHARED.

Another interesting and common case is tmpfs which always creates a new
filesystem instance whenever it is mounted.

Eric

2018-08-16 20:23:14

by Aurélien Aptel

[permalink] [raw]

Subject: Re: Should we split the network filesystem setup into two phases?

Steve French <[email protected]> writes:
> In cifs we attempt to match new mounts to existing tree connections
> (instances of connections to a \\server\share) from other mount(s)
> based first on whether security settings match (e.g. are both
> Kerberos) and then on whether encryption is on/off and whether this is
> a snapshot mount (smb3 previous versions feature). If neither is
> mounted with a snaphsot and the encryption settings match then
> we will use the same tree id to talk with the server as the other
> mounts use. Interesting idea to allow mount to force a new
> tree id.

We actually already have this mount option in cifs.ko, it's "nosharesock".

> What was the NFS mount option you were talking about?
> Looking at the nfs man page the only one that looked similar
> was "nosharecache"

Cheers,
--
Aurélien Aptel / SUSE Labs Samba Team
GPG: 1839 CB5F 9F5B FB9B AA97 8C99 03C8 A49B 521B D5D3
SUSE Linux GmbH, Maxfeldstraße 5, 90409 Nürnberg, Germany
GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)

2018-08-16 21:36:30

by Steve French

[permalink] [raw]

Subject: Re: Should we split the network filesystem setup into two phases?

On Thu, Aug 16, 2018 at 12:23 PM Aur=C3=A9lien Aptel <[email protected]> wrot=
e:
>
> Steve French <[email protected]> writes:
> > In cifs we attempt to match new mounts to existing tree connections
> > (instances of connections to a \\server\share) from other mount(s)
> > based first on whether security settings match (e.g. are both
> > Kerberos) and then on whether encryption is on/off and whether this is
> > a snapshot mount (smb3 previous versions feature). If neither is
> > mounted with a snaphsot and the encryption settings match then
> > we will use the same tree id to talk with the server as the other
> > mounts use. Interesting idea to allow mount to force a new
> > tree id.
>
> We actually already have this mount option in cifs.ko, it's "nosharesock"=
.

Yes - good point. It is very easy to do on cifs. I mainly use that to sim=
ulate
multiple clients for testing servers (so each mount to the same server
whether or not the share matched, looks like a different client, coming
from a different socket and thus with different session ids and tree
ids as well).

It is very useful when trying to simulate multiple clients running to the s=
ame
server while using only one client machine (or VM).

> > What was the NFS mount option you were talking about?
> > Looking at the nfs man page the only one that looked similar
> > was "nosharecache"

The nfs man page apparently discourages its use:

"As of kernel 2.6.18, the behavior specified by nosharecache is legacy
caching behavior. This is considered a data risk"

--=20
Thanks,

Steve

2018-08-18 02:17:13

by Al Viro

[permalink] [raw]

Subject: Re: Should we split the network filesystem setup into two phases?

On Thu, Aug 16, 2018 at 12:06:06AM -0500, Eric W. Biederman wrote:

> So I don't think we can completely abandon the option for filesystems
> to always create a new filesystem instance when mount(8) is called.

Huh? If filesystem wants to create a new instance on each ->mount(),
it can bloody well do so. Quite a few do - if that fs can handle
that, more power to it.

The problem is what to do with filesystems that *can't* do that.
You really, really can't have two ext4 (or xfs, etc.) instances over
the same device at the same time. Cache coherency, locking, etc.
will kill you.

And that's not to mention the joy of defining the semantics of
having the same ext4 mounted with two logs at the same time ;-)

I've seen "reject unless the options are compatible/identical/whatever",
but that ignores the real problem with existing policy. It's *NOT*
"I've mounted this and got an existing instance with non-matching
options". That's a minor annoyance (and back when that decision
had been made, mount(2) was very definitly root-only). The real
problem is different and much worse - it's remount.

I have asked to mount something and it had already been mounted,
with identical options. OK, so what happens if I do mount -o remount
on my instance? *IF* we are operating in the "only sysadmin can
mount new filesystems", it's not a big deal - there are already
lots of ways you can shoot yourself in the foot and mount(2) is
certainly a powerful one. But if we get to "Joe R. Luser can do
it in his container", we have a big problem.

Decision back then had been mostly for usability reasons - it was
back in 2001 (well before the containermania, userns or anything
of that sort) and it was more about "how many hoops does one have
to jump through to get something mounted, assuming the sanity of
sysadmin doing that?". If *anything* like userns had been a concern
back then, it probably would've been different. However, it's
17 years too late and if anyone has a functional TARDIS, I can
easily think of better uses for it...