2022-02-07 17:32:02

by Benjamin Coddington

[permalink] [raw]
Subject: v4 clientid uniquifiers in containers/namespaces

Hi all,

Is anyone using a udev(-like) implementation with NETLINK_LISTEN_ALL_NSID?
It looks like that is at least necessary to allow the init namespaced udev
to receive notifications on /sys/fs/nfs/net/nfs_client/identifier, which
would be a pre-req to automatically uniquify in containers.

I'md interested since it will inform whether I need to send patches to
systemd's udev, and potentially open the can of worms over there. Yet its
not yet clear to me how an init namespaced udev process can write to a netns
sysfs path.

Another option might be to create yet another daemon/tool that would listen
specifically for these notifications. Ugh.

Ben



2022-02-07 22:04:33

by Chuck Lever III

[permalink] [raw]
Subject: Re: v4 clientid uniquifiers in containers/namespaces



> On Feb 7, 2022, at 9:05 AM, Benjamin Coddington <[email protected]> wrote:
>
> On 5 Feb 2022, at 14:50, Benjamin Coddington wrote:
>
>> On 5 Feb 2022, at 13:24, Trond Myklebust wrote:
>>
>>> On Sat, 2022-02-05 at 10:03 -0500, Benjamin Coddington wrote:
>>>> Hi all,
>>>>
>>>> Is anyone using a udev(-like) implementation with
>>>> NETLINK_LISTEN_ALL_NSID?
>>>> It looks like that is at least necessary to allow the init namespaced
>>>> udev
>>>> to receive notifications on /sys/fs/nfs/net/nfs_client/identifier,
>>>> which
>>>> would be a pre-req to automatically uniquify in containers.
>>>>
>>>> I'md interested since it will inform whether I need to send patches
>>>> to
>>>> systemd's udev, and potentially open the can of worms over there.
>>>> Yet its
>>>> not yet clear to me how an init namespaced udev process can write to
>>>> a netns
>>>> sysfs path.
>>>>
>>>> Another option might be to create yet another daemon/tool that would
>>>> listen
>>>> specifically for these notifications. Ugh.
>>>>
>>>> Ben
>>>>
>>>
>>> I don't understand. Why do you need a new daemon/tool?
>
> Because what we've got only works for the init namespace.
>
> Udev won't get kobject notifications because its not using
> NETLINK_LISTEN_ALL_NSIDs.
>
> We need to figure out if we want:
>
> 1) the init namespace udevd to handle all client_id uniquifiers
> 2) we expect network namespaces to run their own udevd
> 3) or both.
>
> I think 2 violates "least surprise", and 3 might not be something anyone
> ever wants. If they do, we can fix it at that point.
>
> So to make 1 work, we can try to change udevd, or maybe just hacking about
> with nfs_netns_object_child_ns_type will be sufficient.

I agree that 1 seems like the preferred approach, though
I don't have a technical suggestion at this point.

Again, thank you for drilling into this.


--
Chuck Lever




2022-02-08 15:46:37

by Chuck Lever III

[permalink] [raw]
Subject: Re: v4 clientid uniquifiers in containers/namespaces



> On Feb 7, 2022, at 2:38 PM, Trond Myklebust <[email protected]> wrote:
>
> On Mon, 2022-02-07 at 15:49 +0000, Chuck Lever III wrote:
>>
>>
>>> On Feb 7, 2022, at 9:05 AM, Benjamin Coddington
>>> <[email protected]> wrote:
>>>
>>> On 5 Feb 2022, at 14:50, Benjamin Coddington wrote:
>>>
>>>> On 5 Feb 2022, at 13:24, Trond Myklebust wrote:
>>>>
>>>>> On Sat, 2022-02-05 at 10:03 -0500, Benjamin Coddington wrote:
>>>>>> Hi all,
>>>>>>
>>>>>> Is anyone using a udev(-like) implementation with
>>>>>> NETLINK_LISTEN_ALL_NSID?
>>>>>> It looks like that is at least necessary to allow the init
>>>>>> namespaced
>>>>>> udev
>>>>>> to receive notifications on
>>>>>> /sys/fs/nfs/net/nfs_client/identifier,
>>>>>> which
>>>>>> would be a pre-req to automatically uniquify in containers.
>>>>>>
>>>>>> I'md interested since it will inform whether I need to send
>>>>>> patches
>>>>>> to
>>>>>> systemd's udev, and potentially open the can of worms over
>>>>>> there.
>>>>>> Yet its
>>>>>> not yet clear to me how an init namespaced udev process can
>>>>>> write to
>>>>>> a netns
>>>>>> sysfs path.
>>>>>>
>>>>>> Another option might be to create yet another daemon/tool
>>>>>> that would
>>>>>> listen
>>>>>> specifically for these notifications. Ugh.
>>>>>>
>>>>>> Ben
>>>>>>
>>>>>
>>>>> I don't understand. Why do you need a new daemon/tool?
>>>
>>> Because what we've got only works for the init namespace.
>>>
>>> Udev won't get kobject notifications because its not using
>>> NETLINK_LISTEN_ALL_NSIDs.
>>>
>>> We need to figure out if we want:
>>>
>>> 1) the init namespace udevd to handle all client_id uniquifiers
>>> 2) we expect network namespaces to run their own udevd
>>> 3) or both.
>>>
>>> I think 2 violates "least surprise", and 3 might not be something
>>> anyone
>>> ever wants. If they do, we can fix it at that point.
>>>
>>> So to make 1 work, we can try to change udevd, or maybe just
>>> hacking about
>>> with nfs_netns_object_child_ns_type will be sufficient.
>>
>> I agree that 1 seems like the preferred approach, though
>> I don't have a technical suggestion at this point.
>>
>
> I strongly disagree. (1) requires the init namespace to have intimate
> knowledge of container internals. Why do we need to make that a
> requirement? That violates the expectation that containers are
> stateless by default, and also the expectation that they operate
> independently of the environment.
>
> If you really do want external control over the uuid that is set, then
> it should be pretty trivial to do so by using the standard container
> tools for manipulating the namespace (e.g. to mount a file that is
> under control of the parent as /etc/nfs4-uuid.conf or whatever).
>
> However in most cases that I can think of, if the container is doing
> its own NFS mounting, then it is going to have to be set up with its
> own nfs-utils, etc, so there is no reason why we can't also require
> udev.

What Ben described in 1. more closely aligned with how I thought
containers work today.

But it could be that 2. gives the ability to migrate the guest
container to another physical host and take its nfs4_unique_id
with it.

I don't have a strong preference between the two. I'm in favor
of doing whichever gets us to "done" faster.


--
Chuck Lever




2022-02-08 16:37:16

by Chuck Lever III

[permalink] [raw]
Subject: Re: v4 clientid uniquifiers in containers/namespaces



> On Feb 8, 2022, at 9:29 AM, Benjamin Coddington <[email protected]> wrote:
>
> On 8 Feb 2022, at 8:45, Trond Myklebust wrote:
>
>>> Can't we just uniquify the namespaced NFS client ourselves, while
>>> still
>>> exposing /sys/fs/nfs/net/nfs_client/identifier within the namespace?
>>> That
>>> way if someone want to run udev or use their own method of persistent
>>> id
>>> its available to them within the container so they can. Then we can
>>> move
>>> forward because the problem of distinguishing clients between the
>>> host
>>> and
>>> netns is automagically solved.
>>
>> That could be done.
>
> Ok, I'm eyeballing a sha1 of the init namespace uniquifier and
> peernet2id_alloc(new_net, init_net).. but means the NFS client would grow a
> dependency on CRYPTO and CRYPTO_SHA1.

Or you could use siphash instead of SHA-1.

I don't think we should be adding any more SHA-1 to the kernel --
it's deprecated for good reasons.


--
Chuck Lever




2022-02-08 18:58:23

by Trond Myklebust

[permalink] [raw]
Subject: Re: v4 clientid uniquifiers in containers/namespaces

On Tue, 2022-02-08 at 06:32 -0500, Benjamin Coddington wrote:
> On 7 Feb 2022, at 18:59, Chuck Lever III wrote:
>
> > > On Feb 7, 2022, at 2:38 PM, Trond Myklebust
> > > <[email protected]>
> > > wrote:
> > >
> > > On Mon, 2022-02-07 at 15:49 +0000, Chuck Lever III wrote:
> > > >
> > > >
> > > > > On Feb 7, 2022, at 9:05 AM, Benjamin Coddington
> > > > > <[email protected]> wrote:
> > > > >
> > > > > On 5 Feb 2022, at 14:50, Benjamin Coddington wrote:
> > > > >
> > > > > > On 5 Feb 2022, at 13:24, Trond Myklebust wrote:
> > > > > >
> > > > > > > On Sat, 2022-02-05 at 10:03 -0500, Benjamin Coddington
> > > > > > > wrote:
> > > > > > > > Hi all,
> > > > > > > >
> > > > > > > > Is anyone using a udev(-like) implementation with
> > > > > > > > NETLINK_LISTEN_ALL_NSID?
> > > > > > > > It looks like that is at least necessary to allow the
> > > > > > > > init
> > > > > > > > namespaced
> > > > > > > > udev
> > > > > > > > to receive notifications on
> > > > > > > > /sys/fs/nfs/net/nfs_client/identifier,
> > > > > > > > which
> > > > > > > > would be a pre-req to automatically uniquify in
> > > > > > > > containers.
> > > > > > > >
> > > > > > > > I'md interested since it will inform whether I need to
> > > > > > > > send
> > > > > > > > patches
> > > > > > > > to
> > > > > > > > systemd's udev, and potentially open the can of worms
> > > > > > > > over
> > > > > > > > there.
> > > > > > > > Yet its
> > > > > > > > not yet clear to me how an init namespaced udev process
> > > > > > > > can
> > > > > > > > write to
> > > > > > > > a netns
> > > > > > > > sysfs path.
> > > > > > > >
> > > > > > > > Another option might be to create yet another
> > > > > > > > daemon/tool
> > > > > > > > that would
> > > > > > > > listen
> > > > > > > > specifically for these notifications.  Ugh.
> > > > > > > >
> > > > > > > > Ben
> > > > > > > >
> > > > > > >
> > > > > > > I don't understand. Why do you need a new daemon/tool?
> > > > >
> > > > > Because what we've got only works for the init namespace.
> > > > >
> > > > > Udev won't get kobject notifications because its not using
> > > > > NETLINK_LISTEN_ALL_NSIDs.
> > > > >
> > > > > We need to figure out if we want:
> > > > >
> > > > > 1) the init namespace udevd to handle all client_id
> > > > > uniquifiers
> > > > > 2) we expect network namespaces to run their own udevd
> > > > > 3) or both.
> > > > >
> > > > > I think 2 violates "least surprise", and 3 might not be
> > > > > something
> > > > > anyone
> > > > > ever wants.  If they do, we can fix it at that point.
> > > > >
> > > > > So to make 1 work, we can try to change udevd, or maybe just
> > > > > hacking about
> > > > > with nfs_netns_object_child_ns_type will be sufficient.
> > > >
> > > > I agree that 1 seems like the preferred approach, though
> > > > I don't have a technical suggestion at this point.
> > > >
> > >
> > > I strongly disagree. (1) requires the init namespace to have
> > > intimate
> > > knowledge of container internals.
>
> Not really, we're just distinguishing NFS clients in containers from
> NFS
> clients on the host.  That doesn't require intimate knowledge, only a
> mechanism to create a unique value per-container.
>
> > > Why do we need to make that a requirement? That violates the
> > > expectation
> > > that containers are stateless by default, and also the
> > > expectation
> > > that
> > > they operate independently of the environment.
>
> I'm not familiar with the expectation that containers are stateless
> by
> default, or that they operate independently of the environment.
>

Put differently: do you expect QEMU/KVM and VMware ESX to have to know
a priori that a VM is going to use NFSv4, and force them to have to
modify the VM state accordingly? No, of course not. So why do you think
this is a good idea for containers?

This is exactly the problem with the keyring upcall mechanism, and why
it is completely useless on a modern system. It relies on the top level
knowing what the containers are doing and how they are configured.
Imagine if you want to nest containers (yes, people do that - just
Google "nested docker containers"). Your top level process would have
to know not just how the first level of containers is configured
(network details, user mappings, ...), but also details about how the
child containers, that it is not directly managing, are configured.
It's just not practical.

> > > If you really do want external control over the uuid that is set,
> > > then
> > > it should be pretty trivial to do so by using the standard
> > > container
> > > tools for manipulating the namespace (e.g. to mount a file that
> > > is
> > > under control of the parent as /etc/nfs4-uuid.conf or whatever).
>
> We're not looking for external control, just automation.  The NFS
> community
> has decided that udev is the way to go here, so as long as we can get
> the
> notifications to /some/ udev process, I feel confident we can make
> all
> of
> this transparent.
>
> The less we have to teach all the container tooling folks, the better
> for us.
>

Agreed. I'm saying that udev case also allows for top level control if
you think you need it.

> > > However in most cases that I can think of, if the container is
> > > doing
> > > its own NFS mounting, then it is going to have to be set up with
> > > its
> > > own nfs-utils, etc, so there is no reason why we can't also
> > > require
> > > udev.
>
> I'm not as confident about this as you are.  Network namespaces are
> pretty
> useful on their own to create independent network configurations or
> to
> isolate hardware interfaces.  We've had a few surprising cases of
> customers
> using them in creative ways.
>
> There's a bit of a chicken and egg problem with 2, though.  If the
> nfs
> module is loaded, the kernel notification gets sent as soon as you
> create
> the namespace.  Its not going to wait for you to move or exec udev
> into
> that
> network namespace, and the notification is lost.
>
> Can't we just uniquify the namespaced NFS client ourselves, while
> still
> exposing /sys/fs/nfs/net/nfs_client/identifier within the namespace? 
> That
> way if someone want to run udev or use their own method of persistent
> id
> its available to them within the container so they can.  Then we can
> move
> forward because the problem of distinguishing clients between the
> host
> and
> netns is automagically solved.

That could be done.

>
> Where we are today is the host NFS client is uniquified, and all the
> netns
> clients are distinguished from the host, but not eachother.
>
> Ben
>

--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]


2022-02-08 22:25:37

by Benjamin Coddington

[permalink] [raw]
Subject: Re: v4 clientid uniquifiers in containers/namespaces

On 8 Feb 2022, at 9:42, Chuck Lever III wrote:

>> On Feb 8, 2022, at 9:29 AM, Benjamin Coddington <[email protected]>
>> wrote:
>>
>> On 8 Feb 2022, at 8:45, Trond Myklebust wrote:
>>
>>>> Can't we just uniquify the namespaced NFS client ourselves, while
>>>> still
>>>> exposing /sys/fs/nfs/net/nfs_client/identifier within the
>>>> namespace?
>>>> That
>>>> way if someone want to run udev or use their own method of
>>>> persistent
>>>> id
>>>> its available to them within the container so they can. Then we
>>>> can
>>>> move
>>>> forward because the problem of distinguishing clients between the
>>>> host
>>>> and
>>>> netns is automagically solved.
>>>
>>> That could be done.
>>
>> Ok, I'm eyeballing a sha1 of the init namespace uniquifier and
>> peernet2id_alloc(new_net, init_net).. but means the NFS client would
>> grow a
>> dependency on CRYPTO and CRYPTO_SHA1.
>
> Or you could use siphash instead of SHA-1.
>
> I don't think we should be adding any more SHA-1 to the kernel --
> it's deprecated for good reasons.

Thanks! Siphash is nicer too. :)

Ben


2022-02-09 05:33:25

by Benjamin Coddington

[permalink] [raw]
Subject: Re: v4 clientid uniquifiers in containers/namespaces

On 7 Feb 2022, at 18:59, Chuck Lever III wrote:

>> On Feb 7, 2022, at 2:38 PM, Trond Myklebust <[email protected]>
>> wrote:
>>
>> On Mon, 2022-02-07 at 15:49 +0000, Chuck Lever III wrote:
>>>
>>>
>>>> On Feb 7, 2022, at 9:05 AM, Benjamin Coddington
>>>> <[email protected]> wrote:
>>>>
>>>> On 5 Feb 2022, at 14:50, Benjamin Coddington wrote:
>>>>
>>>>> On 5 Feb 2022, at 13:24, Trond Myklebust wrote:
>>>>>
>>>>>> On Sat, 2022-02-05 at 10:03 -0500, Benjamin Coddington wrote:
>>>>>>> Hi all,
>>>>>>>
>>>>>>> Is anyone using a udev(-like) implementation with
>>>>>>> NETLINK_LISTEN_ALL_NSID?
>>>>>>> It looks like that is at least necessary to allow the init
>>>>>>> namespaced
>>>>>>> udev
>>>>>>> to receive notifications on
>>>>>>> /sys/fs/nfs/net/nfs_client/identifier,
>>>>>>> which
>>>>>>> would be a pre-req to automatically uniquify in containers.
>>>>>>>
>>>>>>> I'md interested since it will inform whether I need to send
>>>>>>> patches
>>>>>>> to
>>>>>>> systemd's udev, and potentially open the can of worms over
>>>>>>> there.
>>>>>>> Yet its
>>>>>>> not yet clear to me how an init namespaced udev process can
>>>>>>> write to
>>>>>>> a netns
>>>>>>> sysfs path.
>>>>>>>
>>>>>>> Another option might be to create yet another daemon/tool
>>>>>>> that would
>>>>>>> listen
>>>>>>> specifically for these notifications. Ugh.
>>>>>>>
>>>>>>> Ben
>>>>>>>
>>>>>>
>>>>>> I don't understand. Why do you need a new daemon/tool?
>>>>
>>>> Because what we've got only works for the init namespace.
>>>>
>>>> Udev won't get kobject notifications because its not using
>>>> NETLINK_LISTEN_ALL_NSIDs.
>>>>
>>>> We need to figure out if we want:
>>>>
>>>> 1) the init namespace udevd to handle all client_id uniquifiers
>>>> 2) we expect network namespaces to run their own udevd
>>>> 3) or both.
>>>>
>>>> I think 2 violates "least surprise", and 3 might not be something
>>>> anyone
>>>> ever wants. If they do, we can fix it at that point.
>>>>
>>>> So to make 1 work, we can try to change udevd, or maybe just
>>>> hacking about
>>>> with nfs_netns_object_child_ns_type will be sufficient.
>>>
>>> I agree that 1 seems like the preferred approach, though
>>> I don't have a technical suggestion at this point.
>>>
>>
>> I strongly disagree. (1) requires the init namespace to have intimate
>> knowledge of container internals.

Not really, we're just distinguishing NFS clients in containers from NFS
clients on the host. That doesn't require intimate knowledge, only a
mechanism to create a unique value per-container.

>> Why do we need to make that a requirement? That violates the
>> expectation
>> that containers are stateless by default, and also the expectation
>> that
>> they operate independently of the environment.

I'm not familiar with the expectation that containers are stateless by
default, or that they operate independently of the environment.

>> If you really do want external control over the uuid that is set,
>> then
>> it should be pretty trivial to do so by using the standard container
>> tools for manipulating the namespace (e.g. to mount a file that is
>> under control of the parent as /etc/nfs4-uuid.conf or whatever).

We're not looking for external control, just automation. The NFS
community
has decided that udev is the way to go here, so as long as we can get
the
notifications to /some/ udev process, I feel confident we can make all
of
this transparent.

The less we have to teach all the container tooling folks, the better
for us.

>> However in most cases that I can think of, if the container is doing
>> its own NFS mounting, then it is going to have to be set up with its
>> own nfs-utils, etc, so there is no reason why we can't also require
>> udev.

I'm not as confident about this as you are. Network namespaces are
pretty
useful on their own to create independent network configurations or to
isolate hardware interfaces. We've had a few surprising cases of
customers
using them in creative ways.

There's a bit of a chicken and egg problem with 2, though. If the nfs
module is loaded, the kernel notification gets sent as soon as you
create
the namespace. Its not going to wait for you to move or exec udev into
that
network namespace, and the notification is lost.

Can't we just uniquify the namespaced NFS client ourselves, while still
exposing /sys/fs/nfs/net/nfs_client/identifier within the namespace?
That
way if someone want to run udev or use their own method of persistent id
its available to them within the container so they can. Then we can
move
forward because the problem of distinguishing clients between the host
and
netns is automagically solved.

Where we are today is the host NFS client is uniquified, and all the
netns
clients are distinguished from the host, but not eachother.

Ben


2022-02-09 05:36:34

by Benjamin Coddington

[permalink] [raw]
Subject: Re: v4 clientid uniquifiers in containers/namespaces

On 7 Feb 2022, at 20:59, NeilBrown wrote:

> On Sun, 06 Feb 2022, Benjamin Coddington wrote:
>> Hi all,
>>
>> Is anyone using a udev(-like) implementation with NETLINK_LISTEN_ALL_NSID?
>> It looks like that is at least necessary to allow the init namespaced udev
>> to receive notifications on /sys/fs/nfs/net/nfs_client/identifier, which
>> would be a pre-req to automatically uniquify in containers.
>
> Could you walk me through the reasoning here - or point me to where it
> has been discussed.

https://lore.kernel.org/linux-nfs/[email protected]/

> It seems to me that mount.nfs is the place to set nfs_client/identifier.
> It can be told (via /etc/nfs.conf or /etc/nfsmount.conf) how to generate
> and where to store the identifier. It can check the current value and
> update if needed. As long as the identifier is set before the first
> mount, there is no rush.
>
> Why does it need to be done in response to a uevent??

I think the assertion was that it was the only sensible way, and it does
seem to be better than exposing yet another knob when all that's needed is a
way to distinguish and persist NFS clients when network namespaces can come
and go at any time, and there can be a lot of them.

Ben


2022-02-09 05:54:58

by Trond Myklebust

[permalink] [raw]
Subject: Re: v4 clientid uniquifiers in containers/namespaces

On Tue, 2022-02-08 at 06:32 -0500, Benjamin Coddington wrote:
>
> There's a bit of a chicken and egg problem with 2, though.  If the
> nfs
> module is loaded, the kernel notification gets sent as soon as you
> create
> the namespace.  Its not going to wait for you to move or exec udev
> into
> that
> network namespace, and the notification is lost.


Wait a minute... I missed this comment earlier, but it definitely
points to a misunderstanding.

The notification is _not_ sent by the act of loading a module. It is
sent by the call to kobject_uevent() in nfs_netns_sysfs_setup(). That
again is called as part of nfs_net_init() when the net namespace gets
created.

--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]


2022-02-09 06:18:25

by Benjamin Coddington

[permalink] [raw]
Subject: Re: v4 clientid uniquifiers in containers/namespaces

On 8 Feb 2022, at 11:47, Trond Myklebust wrote:

> On Tue, 2022-02-08 at 06:32 -0500, Benjamin Coddington wrote:
>>
>> There's a bit of a chicken and egg problem with 2, though.  If the
>> nfs
>> module is loaded, the kernel notification gets sent as soon as you
>> create
>> the namespace.  Its not going to wait for you to move or exec udev
>> into
>> that
>> network namespace, and the notification is lost.
>
>
> Wait a minute... I missed this comment earlier, but it definitely
> points to a misunderstanding.
>
> The notification is _not_ sent by the act of loading a module. It is
> sent by the call to kobject_uevent() in nfs_netns_sysfs_setup(). That
> again is called as part of nfs_net_init() when the net namespace gets
> created.

My communication was poor. The first notification is sent to udev when the
nfs module is loaded. That is the initial creation of the sysfs, the
notification in the init namespace.

After that, if a network namespace is created and "the nfs module is
[already] loaded", the notification is immediately sent.

I think we're both understanding it and our understanding matches how it
works.

Ben


2022-02-09 06:20:55

by Trond Myklebust

[permalink] [raw]
Subject: Re: v4 clientid uniquifiers in containers/namespaces

On Mon, 2022-02-07 at 15:49 +0000, Chuck Lever III wrote:
>
>
> > On Feb 7, 2022, at 9:05 AM, Benjamin Coddington
> > <[email protected]> wrote:
> >
> > On 5 Feb 2022, at 14:50, Benjamin Coddington wrote:
> >
> > > On 5 Feb 2022, at 13:24, Trond Myklebust wrote:
> > >
> > > > On Sat, 2022-02-05 at 10:03 -0500, Benjamin Coddington wrote:
> > > > > Hi all,
> > > > >
> > > > > Is anyone using a udev(-like) implementation with
> > > > > NETLINK_LISTEN_ALL_NSID?
> > > > > It looks like that is at least necessary to allow the init
> > > > > namespaced
> > > > > udev
> > > > > to receive notifications on
> > > > > /sys/fs/nfs/net/nfs_client/identifier,
> > > > > which
> > > > > would be a pre-req to automatically uniquify in containers.
> > > > >
> > > > > I'md interested since it will inform whether I need to send
> > > > > patches
> > > > > to
> > > > > systemd's udev, and potentially open the can of worms over
> > > > > there.
> > > > > Yet its
> > > > > not yet clear to me how an init namespaced udev process can
> > > > > write to
> > > > > a netns
> > > > > sysfs path.
> > > > >
> > > > > Another option might be to create yet another daemon/tool
> > > > > that would
> > > > > listen
> > > > > specifically for these notifications.  Ugh.
> > > > >
> > > > > Ben
> > > > >
> > > >
> > > > I don't understand. Why do you need a new daemon/tool?
> >
> > Because what we've got only works for the init namespace.
> >
> > Udev won't get kobject notifications because its not using
> > NETLINK_LISTEN_ALL_NSIDs.
> >
> > We need to figure out if we want:
> >
> > 1) the init namespace udevd to handle all client_id uniquifiers
> > 2) we expect network namespaces to run their own udevd
> > 3) or both.
> >
> > I think 2 violates "least surprise", and 3 might not be something
> > anyone
> > ever wants.  If they do, we can fix it at that point.
> >
> > So to make 1 work, we can try to change udevd, or maybe just
> > hacking about
> > with nfs_netns_object_child_ns_type will be sufficient.
>
> I agree that 1 seems like the preferred approach, though
> I don't have a technical suggestion at this point.
>

I strongly disagree. (1) requires the init namespace to have intimate
knowledge of container internals. Why do we need to make that a
requirement? That violates the expectation that containers are
stateless by default, and also the expectation that they operate
independently of the environment.

If you really do want external control over the uuid that is set, then
it should be pretty trivial to do so by using the standard container
tools for manipulating the namespace (e.g. to mount a file that is
under control of the parent as /etc/nfs4-uuid.conf or whatever).

However in most cases that I can think of, if the container is doing
its own NFS mounting, then it is going to have to be set up with its
own nfs-utils, etc, so there is no reason why we can't also require
udev.

--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]


2022-02-09 06:43:32

by Trond Myklebust

[permalink] [raw]
Subject: Re: v4 clientid uniquifiers in containers/namespaces

On Wed, 2022-02-09 at 07:56 +1100, NeilBrown wrote:
> On Tue, 08 Feb 2022, Benjamin Coddington wrote:
> > On 7 Feb 2022, at 20:59, NeilBrown wrote:
> >
> > > On Sun, 06 Feb 2022, Benjamin Coddington wrote:
> > > > Hi all,
> > > >
> > > > Is anyone using a udev(-like) implementation with
> > > > NETLINK_LISTEN_ALL_NSID?
> > > > It looks like that is at least necessary to allow the init
> > > > namespaced udev
> > > > to receive notifications on
> > > > /sys/fs/nfs/net/nfs_client/identifier, which
> > > > would be a pre-req to automatically uniquify in containers.
> > >
> > > Could you walk me through the reasoning here - or point me to
> > > where it
> > > has been discussed.
> >
> > https://lore.kernel.org/linux-nfs/[email protected]/
>
> Thanks.  I did remember that discussion though it was helpful to
> refresh
> my memory, and to be sure there is nothing else.
>
> >
> > > It seems to me that mount.nfs is the place to set
> > > nfs_client/identifier.
> > > It can be told (via /etc/nfs.conf or /etc/nfsmount.conf) how to
> > > generate
> > > and where to store the identifier.  It can check the current
> > > value and
> > > update if needed.  As long as the identifier is set before the
> > > first
> > > mount, there is no rush.
> > >
> > > Why does it need to be done in response to a uevent??
> >
> > I think the assertion was that it was the only sensible way, and it
> > does
> > seem to be better than exposing yet another knob when all that's
> > needed is a
> > way to distinguish and persist NFS clients when network namespaces
> > can come
> > and go at any time, and there can be a lot of them.
>
> "assertion" is an apt word.  There wasn't a whole lot of reasoned
> argument, mostly just assertions.
>
> The best argument was that "nfs.conf is not namespace aware", which
> is
> only somewhat true.  Using "ip netnfs exec" will make
> non-namepsace-aware tools work correctly in namespaces providing
> their
> config files are in /etc/netns/NAME - they get bind-mounted over the
> files in /etc.
> And of course /etc/nfs.conf can be MADE namespace aware.
>
> There is also a reasonable argument that auto-editiing /etc/nfs.conf
> risks collision with an admin, but that is why we have
> /etc/nfs.conf.d
>
> For me, the weakest part of the Steve's case was that he presented it
> as
> "setting module parameters via nfs.conf" rather than "configuring
> client
> identity via nfs.conf".  A number of the early negative responses
> were
> focused on the distraction of a module parameter being involved.
>
> The weakness for the alternative, of course, is the fact that using
> the
> udev mechanism requires running udevd in each network namespace,
> which
> is an unnecessary burden.
>
> So I still STRONGLY think that the identity should be set by
> mount.nfs
> reading (and writing) some file in /etc or /etc/netnfs/NAME, and I
> weakly think that the file should be in /etc/nfs.conf.d/ so that the
> reading is automagic.
>

No. It's not a per-mount setting, so it has no business being in the
mount protocol.

--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]


2022-02-09 07:40:34

by NeilBrown

[permalink] [raw]
Subject: Re: v4 clientid uniquifiers in containers/namespaces

On Sun, 06 Feb 2022, Benjamin Coddington wrote:
> Hi all,
>
> Is anyone using a udev(-like) implementation with NETLINK_LISTEN_ALL_NSID?
> It looks like that is at least necessary to allow the init namespaced udev
> to receive notifications on /sys/fs/nfs/net/nfs_client/identifier, which
> would be a pre-req to automatically uniquify in containers.

Could you walk me through the reasoning here - or point me to where it
has been discussed.
It seems to me that mount.nfs is the place to set nfs_client/identifier.
It can be told (via /etc/nfs.conf or /etc/nfsmount.conf) how to generate
and where to store the identifier. It can check the current value and
update if needed. As long as the identifier is set before the first
mount, there is no rush.

Why does it need to be done in response to a uevent??

Thanks,
NeilBrown

2022-02-09 08:17:24

by Benjamin Coddington

[permalink] [raw]
Subject: Re: v4 clientid uniquifiers in containers/namespaces

On 8 Feb 2022, at 10:47, Trond Myklebust wrote:

>> peernet2id_alloc() is not designed for this. It appears to use
>> idr_alloc(), which means it will reuse values frequently.

I did not think of that.

> Furthermore, that would introduce a dependency on the init namespace
> identifier being unique, which precludes its use for initialising said
> init namespace.

That's what the udev rule will fix! :)

I think I was still on the deterministic bus, but it seems to make the most
sense to simply use a random value as a default, then. And if a container
wants to be the same client it must run udev, or write to sysfs itselt.

Ben


2022-02-09 09:00:53

by Trond Myklebust

[permalink] [raw]
Subject: Re: v4 clientid uniquifiers in containers/namespaces

On Tue, 2022-02-08 at 10:23 -0500, Benjamin Coddington wrote:
> On 8 Feb 2022, at 9:42, Chuck Lever III wrote:
>
> > > On Feb 8, 2022, at 9:29 AM, Benjamin Coddington
> > > <[email protected]>
> > > wrote:
> > >
> > > On 8 Feb 2022, at 8:45, Trond Myklebust wrote:
> > >
> > > > > Can't we just uniquify the namespaced NFS client ourselves,
> > > > > while
> > > > > still
> > > > > exposing /sys/fs/nfs/net/nfs_client/identifier within the
> > > > > namespace?
> > > > > That
> > > > > way if someone want to run udev or use their own method of
> > > > > persistent
> > > > > id
> > > > > its available to them within the container so they can.  Then
> > > > > we
> > > > > can
> > > > > move
> > > > > forward because the problem of distinguishing clients between
> > > > > the
> > > > > host
> > > > > and
> > > > > netns is automagically solved.
> > > >
> > > > That could be done.
> > >
> > > Ok, I'm eyeballing a sha1 of the init namespace uniquifier and
> > > peernet2id_alloc(new_net, init_net).. but means the NFS client
> > > would
> > > grow a
> > > dependency on CRYPTO and CRYPTO_SHA1.
> >
> > Or you could use siphash instead of SHA-1.
> >
> > I don't think we should be adding any more SHA-1 to the kernel --
> > it's deprecated for good reasons.
>
> Thanks! Siphash is nicer too.  :)
>
>

peernet2id_alloc() is not designed for this. It appears to use
idr_alloc(), which means it will reuse values frequently.

--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]


2022-02-09 09:50:58

by Benjamin Coddington

[permalink] [raw]
Subject: Re: v4 clientid uniquifiers in containers/namespaces

On 8 Feb 2022, at 8:45, Trond Myklebust wrote:

> On Tue, 2022-02-08 at 06:32 -0500, Benjamin Coddington wrote:
>> On 7 Feb 2022, at 18:59, Chuck Lever III wrote:
>>
>>>> On Feb 7, 2022, at 2:38 PM, Trond Myklebust
>>>> <[email protected]>
>>>> wrote:
>>>>
>>>> On Mon, 2022-02-07 at 15:49 +0000, Chuck Lever III wrote:
>>>>>
>>>>>
>>>>>> On Feb 7, 2022, at 9:05 AM, Benjamin Coddington
>>>>>> <[email protected]> wrote:
>>>>>>
>>>>>> On 5 Feb 2022, at 14:50, Benjamin Coddington wrote:
>>>>>>
>>>>>>> On 5 Feb 2022, at 13:24, Trond Myklebust wrote:
>>>>>>>
>>>>>>>> On Sat, 2022-02-05 at 10:03 -0500, Benjamin Coddington
>>>>>>>> wrote:
>>>>>>>>> Hi all,
>>>>>>>>>
>>>>>>>>> Is anyone using a udev(-like) implementation with
>>>>>>>>> NETLINK_LISTEN_ALL_NSID?
>>>>>>>>> It looks like that is at least necessary to allow the
>>>>>>>>> init
>>>>>>>>> namespaced
>>>>>>>>> udev
>>>>>>>>> to receive notifications on
>>>>>>>>> /sys/fs/nfs/net/nfs_client/identifier,
>>>>>>>>> which
>>>>>>>>> would be a pre-req to automatically uniquify in
>>>>>>>>> containers.
>>>>>>>>>
>>>>>>>>> I'md interested since it will inform whether I need to
>>>>>>>>> send
>>>>>>>>> patches
>>>>>>>>> to
>>>>>>>>> systemd's udev, and potentially open the can of worms
>>>>>>>>> over
>>>>>>>>> there.
>>>>>>>>> Yet its
>>>>>>>>> not yet clear to me how an init namespaced udev process
>>>>>>>>> can
>>>>>>>>> write to
>>>>>>>>> a netns
>>>>>>>>> sysfs path.
>>>>>>>>>
>>>>>>>>> Another option might be to create yet another
>>>>>>>>> daemon/tool
>>>>>>>>> that would
>>>>>>>>> listen
>>>>>>>>> specifically for these notifications.  Ugh.
>>>>>>>>>
>>>>>>>>> Ben
>>>>>>>>>
>>>>>>>>
>>>>>>>> I don't understand. Why do you need a new daemon/tool?
>>>>>>
>>>>>> Because what we've got only works for the init namespace.
>>>>>>
>>>>>> Udev won't get kobject notifications because its not using
>>>>>> NETLINK_LISTEN_ALL_NSIDs.
>>>>>>
>>>>>> We need to figure out if we want:
>>>>>>
>>>>>> 1) the init namespace udevd to handle all client_id
>>>>>> uniquifiers
>>>>>> 2) we expect network namespaces to run their own udevd
>>>>>> 3) or both.
>>>>>>
>>>>>> I think 2 violates "least surprise", and 3 might not be
>>>>>> something
>>>>>> anyone
>>>>>> ever wants.  If they do, we can fix it at that point.
>>>>>>
>>>>>> So to make 1 work, we can try to change udevd, or maybe just
>>>>>> hacking about
>>>>>> with nfs_netns_object_child_ns_type will be sufficient.
>>>>>
>>>>> I agree that 1 seems like the preferred approach, though
>>>>> I don't have a technical suggestion at this point.
>>>>>
>>>>
>>>> I strongly disagree. (1) requires the init namespace to have
>>>> intimate
>>>> knowledge of container internals.
>>
>> Not really, we're just distinguishing NFS clients in containers from
>> NFS
>> clients on the host.  That doesn't require intimate knowledge, only a
>> mechanism to create a unique value per-container.
>>
>>>> Why do we need to make that a requirement? That violates the
>>>> expectation
>>>> that containers are stateless by default, and also the
>>>> expectation
>>>> that
>>>> they operate independently of the environment.
>>
>> I'm not familiar with the expectation that containers are stateless
>> by
>> default, or that they operate independently of the environment.
>>
>
> Put differently: do you expect QEMU/KVM and VMware ESX to have to know
> a priori that a VM is going to use NFSv4, and force them to have to
> modify the VM state accordingly? No, of course not. So why do you think
> this is a good idea for containers?

Well, I don't think /that's/ a good idea, no, but I don't think the
comparison is valid. I wouldn't equate containers with VMs when it comes to
configuration or state because VMs attempt to create a nearly isolated
processing environment, while containers or namespaces are a complete
mish-mash of objects, state, and paradigms. A lot of what happens in a
particular set of namespaces can happen and affect objects in init too.

The immediate example is the very problem we're trying to fix: nfs clients in
netns can disrupt/reclaim state from the init namespace client.

> This is exactly the problem with the keyring upcall mechanism, and why
> it is completely useless on a modern system. It relies on the top level
> knowing what the containers are doing and how they are configured.

We're actually talking over this problem while working TLS, and I agree that
keyrings need changes to allow userspace callouts to be "routed", and that
configuration must come from within the containers. And lacking a container
taking responsibility for it, it is up to the host to do something sane.

> Imagine if you want to nest containers (yes, people do that - just
> Google "nested docker containers"). Your top level process would have
> to know not just how the first level of containers is configured
> (network details, user mappings, ...), but also details about how the
> child containers, that it is not directly managing, are configured.
> It's just not practical.

Oh yeah, I know all about it. Its quite a mess, and every subsystem that
has to account for all of this does it a little differently.

>> Can't we just uniquify the namespaced NFS client ourselves, while
>> still
>> exposing /sys/fs/nfs/net/nfs_client/identifier within the namespace? 
>> That
>> way if someone want to run udev or use their own method of persistent
>> id
>> its available to them within the container so they can.  Then we can
>> move
>> forward because the problem of distinguishing clients between the
>> host
>> and
>> netns is automagically solved.
>
> That could be done.

Ok, I'm eyeballing a sha1 of the init namespace uniquifier and
peernet2id_alloc(new_net, init_net).. but means the NFS client would grow a
dependency on CRYPTO and CRYPTO_SHA1.

hm.

Ben


2022-02-09 10:00:55

by NeilBrown

[permalink] [raw]
Subject: Re: v4 clientid uniquifiers in containers/namespaces

On Wed, 09 Feb 2022, Trond Myklebust wrote:
> On Wed, 2022-02-09 at 07:56 +1100, NeilBrown wrote:
> >
> > So I still STRONGLY think that the identity should be set by
> > mount.nfs
> > reading (and writing) some file in /etc or /etc/netnfs/NAME, and I
> > weakly think that the file should be in /etc/nfs.conf.d/ so that the
> > reading is automagic.
> >
>
> No. It's not a per-mount setting, so it has no business being in the
> mount protocol.

I agree that it is not different for different mounts, but every mount
needs it, and without any mounts it is not needed.

Much like statd really, which is started by mount.nfs when it is
determined that it is needed, but not running.

NeilBrown

2022-02-09 10:25:11

by Trond Myklebust

[permalink] [raw]
Subject: Re: v4 clientid uniquifiers in containers/namespaces

On Tue, 2022-02-08 at 15:43 +0000, Trond Myklebust wrote:
> On Tue, 2022-02-08 at 10:23 -0500, Benjamin Coddington wrote:
> > On 8 Feb 2022, at 9:42, Chuck Lever III wrote:
> >
> > > > On Feb 8, 2022, at 9:29 AM, Benjamin Coddington
> > > > <[email protected]>
> > > > wrote:
> > > >
> > > > On 8 Feb 2022, at 8:45, Trond Myklebust wrote:
> > > >
> > > > > > Can't we just uniquify the namespaced NFS client ourselves,
> > > > > > while
> > > > > > still
> > > > > > exposing /sys/fs/nfs/net/nfs_client/identifier within the
> > > > > > namespace?
> > > > > > That
> > > > > > way if someone want to run udev or use their own method of
> > > > > > persistent
> > > > > > id
> > > > > > its available to them within the container so they can. 
> > > > > > Then
> > > > > > we
> > > > > > can
> > > > > > move
> > > > > > forward because the problem of distinguishing clients
> > > > > > between
> > > > > > the
> > > > > > host
> > > > > > and
> > > > > > netns is automagically solved.
> > > > >
> > > > > That could be done.
> > > >
> > > > Ok, I'm eyeballing a sha1 of the init namespace uniquifier and
> > > > peernet2id_alloc(new_net, init_net).. but means the NFS client
> > > > would
> > > > grow a
> > > > dependency on CRYPTO and CRYPTO_SHA1.
> > >
> > > Or you could use siphash instead of SHA-1.
> > >
> > > I don't think we should be adding any more SHA-1 to the kernel --
> > > it's deprecated for good reasons.
> >
> > Thanks! Siphash is nicer too.  :)
> >
> >
>
> peernet2id_alloc() is not designed for this. It appears to use
> idr_alloc(), which means it will reuse values frequently.
>

Furthermore, that would introduce a dependency on the init namespace
identifier being unique, which precludes its use for initialising said
init namespace.

--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]


2022-02-09 11:26:50

by NeilBrown

[permalink] [raw]
Subject: Re: v4 clientid uniquifiers in containers/namespaces

On Tue, 08 Feb 2022, Benjamin Coddington wrote:
> On 7 Feb 2022, at 20:59, NeilBrown wrote:
>
> > On Sun, 06 Feb 2022, Benjamin Coddington wrote:
> >> Hi all,
> >>
> >> Is anyone using a udev(-like) implementation with NETLINK_LISTEN_ALL_NSID?
> >> It looks like that is at least necessary to allow the init namespaced udev
> >> to receive notifications on /sys/fs/nfs/net/nfs_client/identifier, which
> >> would be a pre-req to automatically uniquify in containers.
> >
> > Could you walk me through the reasoning here - or point me to where it
> > has been discussed.
>
> https://lore.kernel.org/linux-nfs/[email protected]/

Thanks. I did remember that discussion though it was helpful to refresh
my memory, and to be sure there is nothing else.

>
> > It seems to me that mount.nfs is the place to set nfs_client/identifier.
> > It can be told (via /etc/nfs.conf or /etc/nfsmount.conf) how to generate
> > and where to store the identifier. It can check the current value and
> > update if needed. As long as the identifier is set before the first
> > mount, there is no rush.
> >
> > Why does it need to be done in response to a uevent??
>
> I think the assertion was that it was the only sensible way, and it does
> seem to be better than exposing yet another knob when all that's needed is a
> way to distinguish and persist NFS clients when network namespaces can come
> and go at any time, and there can be a lot of them.

"assertion" is an apt word. There wasn't a whole lot of reasoned
argument, mostly just assertions.

The best argument was that "nfs.conf is not namespace aware", which is
only somewhat true. Using "ip netnfs exec" will make
non-namepsace-aware tools work correctly in namespaces providing their
config files are in /etc/netns/NAME - they get bind-mounted over the
files in /etc.
And of course /etc/nfs.conf can be MADE namespace aware.

There is also a reasonable argument that auto-editiing /etc/nfs.conf
risks collision with an admin, but that is why we have /etc/nfs.conf.d

For me, the weakest part of the Steve's case was that he presented it as
"setting module parameters via nfs.conf" rather than "configuring client
identity via nfs.conf". A number of the early negative responses were
focused on the distraction of a module parameter being involved.

The weakness for the alternative, of course, is the fact that using the
udev mechanism requires running udevd in each network namespace, which
is an unnecessary burden.

So I still STRONGLY think that the identity should be set by mount.nfs
reading (and writing) some file in /etc or /etc/netnfs/NAME, and I
weakly think that the file should be in /etc/nfs.conf.d/ so that the
reading is automagic.

Thanks,
NeilBrown

2022-02-15 20:10:55

by Benjamin Coddington

[permalink] [raw]
Subject: Re: v4 clientid uniquifiers in containers/namespaces

On 8 Feb 2022, at 18:34, Trond Myklebust wrote:

> On Wed, 2022-02-09 at 07:56 +1100, NeilBrown wrote:

>> So I still STRONGLY think that the identity should be set by
>> mount.nfs
>> reading (and writing) some file in /etc or /etc/netnfs/NAME, and I
>> weakly think that the file should be in /etc/nfs.conf.d/ so that the
>> reading is automagic.
>>
>
> No. It's not a per-mount setting, so it has no business being in the
> mount protocol.

Trond,

We still have the issue that udev handling the event to set the uniquifier
for the init namespace races with the first SETCLIENTID/EXCHANGE_ID.

Now that network namespaces uniqify by default, would you prefer we try to
solve this with the userspace tools setting the module parameter instead of
depending on udev for the init namespace?

Alternatively, we could grow another module parameter:
nfs4_unique_id_timeout:int Seconds to wait for a uniquifier

A non-zero default also gives network namespaces the chance to set a
persistent value that differs from the random value the kernel generated.

Ben