2013-12-01 13:20:21

by Christoph Hellwig

[permalink] [raw]
Subject: [PATCH 00/11] [RFC] repair net namespace damage to rpc_pipefs

This series tries to get rid of the damage created by sprinkling the
network namespace cat poo all over rpc_pipefs and users.

Instead of getting lost in a maze of notifiers and infrastructure build
around it a cargo cult manner we revert to a slightly nicer version of
the pre-namespace API.

To do this we just have to create an in-kernel instance of rpc_pipefs
that is mounted at network namespace creation so that all functions can
operated on it.

As-is this has one major downside: because the initial mount already grabs
a reference to the network namespace we'll create a cyclic reference and
will never free the network namespace. To get around this we'd need
some way to only grab it once user mounts show up / disapear in the VFS.

Given that the namespace kraken has infected various internal filesystem
and will get more soon I suspect this problem is or will become generic
and will need a proper solution anyway. Al, any good ideas how to deal
with this? Most straight forward way would be to add a counter of
user vfsmount to the superblock and methods when it goes to 1 and 0,
but that seems a bit ugly.


2013-12-02 16:45:40

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH 00/11] [RFC] repair net namespace damage to rpc_pipefs

On Mon, 2 Dec 2013 11:37:35 -0500
"J. Bruce Fields" <[email protected]> wrote:

> On Mon, Dec 02, 2013 at 11:33:19AM -0500, Jeff Layton wrote:
> > On Mon, 2 Dec 2013 11:00:50 -0500
> > Trond Myklebust <[email protected]> wrote:
> >
> > >
> > > On Dec 2, 2013, at 10:34, Christoph Hellwig <[email protected]> wrote:
> > >
> > > > On Mon, Dec 02, 2013 at 08:44:25AM -0500, Trond Myklebust wrote:
> > > >> The lifetime of the kernel mount only needs to match that of the rpc_client, since each rpc_client is associated to a single net namespace, and each net namespace is in a 1-1 relationship with an rpc_pipefs super block.
> > > >
> > > > Except for the non-rpc_client users of rpc_pipefs?
> > >
> > > There is the idmapper pipe which is created as part of setting up a NFSv4 mount: that could either call rpc_get_mount(), or just rely on the fact that the nfs_client has an rpc_client. Ditto for the DNS resolver pipe.
> > >
> > > Any more?
> > >
> >
> > There's the nfsdcld pipe stuff, but that was supposed to have been
> > ripped out in 3.10. Bruce wasn't ready to do that since we didn't have
> > a solution to the problem of using a UMH upcall in a container.
> >
> > As far as I'm concerned, we should go ahead and rip that out and worry
> > about the UMH in a container problem later. Bruce, any objection to
> > going ahead and doing that? I can respin/resend the patch to do that if
> > you're ready for it...
>
> I still haven't seen a solution to the UMH problem. Do we even expect
> that there will be one at this point?
>
> I just want to avoid the worst case where we decide that UMH was just
> not designed for this kind of upcall, period, and then need to backtrack
> again....
>

I don't think there's a solution as of yet...

I'd like to avoid that too, but I expect that no one is actually using
that code anyway. If we have to put it back in (ugh) then we'll
probably need to rethink parts of it anyhow. I'm not aware of any
distro that ever shipped with it enabled.

--
Jeff Layton <[email protected]>

2013-12-02 15:34:40

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 00/11] [RFC] repair net namespace damage to rpc_pipefs

On Mon, Dec 02, 2013 at 08:44:25AM -0500, Trond Myklebust wrote:
> The lifetime of the kernel mount only needs to match that of the rpc_client, since each rpc_client is associated to a single net namespace, and each net namespace is in a 1-1 relationship with an rpc_pipefs super block.

Except for the non-rpc_client users of rpc_pipefs?


2013-12-02 15:57:27

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH 00/11] [RFC] repair net namespace damage to rpc_pipefs

On Mon, Dec 02, 2013 at 08:44:25AM -0500, Trond Myklebust wrote:

> > I'll have to let the net namespace folks chime in for that, as far as
> > I'm concerned it's a featured better config'ed off. If they can't come
> > up with anything better the procfs hack above would be it.
>
> The lifetime of the kernel mount only needs to match that of the rpc_client, since each rpc_client is associated to a single net namespace, and each net namespace is in a 1-1 relationship with an rpc_pipefs super block.
>
> IOW: move the kernel mount/umount back to the rpc_client create/destroy methods and all should be well.

Hmm... I'm looking through rpc_pipe.c and there are some fun issues in there:
* ->kill_sb() *is* called after rpc_fill_super() failures. It's
not ->put_super() (and even ->put_super() would've been called for failures
past setting ->s_root). With ->s_fs_info set only on success, it means
that rpc_kill_sb() will just oops after such failure exits.
* just what is
if (sn->pipefs_sb != sb) {
mutex_unlock(&sn->pipefs_sb_lock);
goto out;
}
sn->pipefs_sb = NULL;
about? In which scenario is it not equal to ->pipefs_sb? When said
->pipefs_sb is NULL? But that, AFAICS, can happen only on cleanup after
failing rpc_fill_super(), in which case we won't get that sn thing in
the first place (in fact, we'll oops trying to get it).

Trond, am I right interpreting you as "that filesystem has non-empty
contents only when there is at least one rpc_client in that netns;
after rpc_client removal there won't be any accesses to data structures
associated with it (due to rpc_close_pipes() and friends, presumably),
so there won't be any need to keep netns alive after that"?

If so and if rpc_client really can't outlive netns, then yes, having each
of them hold an internal vfsmount reference pinning rpc_pipefs down would
suffice.

2013-12-01 14:36:53

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH 00/11] [RFC] repair net namespace damage to rpc_pipefs

On Sun, 01 Dec 2013 05:14:41 -0800
Christoph Hellwig <[email protected]> wrote:

> This series tries to get rid of the damage created by sprinkling the
> network namespace cat poo all over rpc_pipefs and users.
>
> Instead of getting lost in a maze of notifiers and infrastructure build
> around it a cargo cult manner we revert to a slightly nicer version of
> the pre-namespace API.
>
> To do this we just have to create an in-kernel instance of rpc_pipefs
> that is mounted at network namespace creation so that all functions can
> operated on it.
>
> As-is this has one major downside: because the initial mount already grabs
> a reference to the network namespace we'll create a cyclic reference and
> will never free the network namespace. To get around this we'd need
> some way to only grab it once user mounts show up / disapear in the VFS.
>
> Given that the namespace kraken has infected various internal filesystem
> and will get more soon I suspect this problem is or will become generic
> and will need a proper solution anyway. Al, any good ideas how to deal
> with this? Most straight forward way would be to add a counter of
> user vfsmount to the superblock and methods when it goes to 1 and 0,
> but that seems a bit ugly.

Looks like a good set overall -- I'll plan to review it in more detail
in the next day or two. If we do take it, it'll need to be reconciled
with the set I sent recently that makes some changes in this code:

[PATCH v4 0/3] sunrpc/nfs: more reliable detection of running gssd

Right offhand, I don't see anything fundamentally at odds with it here,
so it's probably a matter of just fixing up the merge of the two. You
may want to base the next version of this on top of Trond's "devel"
branch since it has those patches.

--
Jeff Layton <[email protected]>

2013-12-02 16:00:52

by Trond Myklebust

[permalink] [raw]
Subject: Re: [PATCH 00/11] [RFC] repair net namespace damage to rpc_pipefs


On Dec 2, 2013, at 10:34, Christoph Hellwig <[email protected]> wrote:

> On Mon, Dec 02, 2013 at 08:44:25AM -0500, Trond Myklebust wrote:
>> The lifetime of the kernel mount only needs to match that of the rpc_client, since each rpc_client is associated to a single net namespace, and each net namespace is in a 1-1 relationship with an rpc_pipefs super block.
>
> Except for the non-rpc_client users of rpc_pipefs?

There is the idmapper pipe which is created as part of setting up a NFSv4 mount: that could either call rpc_get_mount(), or just rely on the fact that the nfs_client has an rpc_client. Ditto for the DNS resolver pipe.

Any more?

Cheers
Trond

2013-12-02 16:46:58

by Trond Myklebust

[permalink] [raw]
Subject: Re: [PATCH 00/11] [RFC] repair net namespace damage to rpc_pipefs


On Dec 2, 2013, at 11:27, Christoph Hellwig <[email protected]> wrote:

> On Mon, Dec 02, 2013 at 11:00:50AM -0500, Trond Myklebust wrote:
>>> Except for the non-rpc_client users of rpc_pipefs?
>>
>> There is the idmapper pipe which is created as part of setting up a NFSv4 mount: that could either call rpc_get_mount(), or just rely on the fact that the nfs_client has an rpc_client. Ditto for the DNS resolver pipe.
>
> We got idmapper, dns, pnfs-block upcall.
>
> idmapper seems per-nfs_client, the other two are created on module load
> and removed at unload time.

Do they need to be? AFAICS, we can trivially convert the DNS resolver to be created on mount. The pnfs block upcall looks to be a little more difficult, but we should be able to do it in ld_type->set_layoutdriver()/clear_layoutdriver().

> Similarly the pipe in nfsd works globally as well.

Mount the rpc_pipefs in nfs4_state_create_net() and unmount it in nfs4_state_destroy_net()?

Cheers
Trond

2013-12-02 07:23:11

by Stanislav Kinsbursky

[permalink] [raw]
Subject: Re: [PATCH 00/11] [RFC] repair net namespace damage to rpc_pipefs

01.12.2013 17:14, Christoph Hellwig ?????:
> As-is this has one major downside: because the initial mount already grabs
> a reference to the network namespace we'll create a cyclic reference and
> will never free the network namespace. To get around this we'd need
> some way to only grab it once user mounts show up / disapear in the VFS.

That's a great cleanup of the whole PipeFS mount/umount logic. And it actually
moves code to the state it was (with minor changes) before net namespace
changes.
But the reason for all this "damage" was exactly to remove this major downside
and do _not_ tie network namespace to mount point..
So, please, solve the problem before reverting all this notifiers and friends.

--
Best regards,
Stanislav Kinsbursky

2013-12-01 15:44:47

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH 00/11] [RFC] repair net namespace damage to rpc_pipefs

On Sun, 01 Dec 2013 05:14:41 -0800
Christoph Hellwig <[email protected]> wrote:

> This series tries to get rid of the damage created by sprinkling the
> network namespace cat poo all over rpc_pipefs and users.
>
> Instead of getting lost in a maze of notifiers and infrastructure build
> around it a cargo cult manner we revert to a slightly nicer version of
> the pre-namespace API.
>
> To do this we just have to create an in-kernel instance of rpc_pipefs
> that is mounted at network namespace creation so that all functions can
> operated on it.
>
> As-is this has one major downside: because the initial mount already grabs
> a reference to the network namespace we'll create a cyclic reference and
> will never free the network namespace. To get around this we'd need
> some way to only grab it once user mounts show up / disapear in the VFS.
>
> Given that the namespace kraken has infected various internal filesystem
> and will get more soon I suspect this problem is or will become generic
> and will need a proper solution anyway. Al, any good ideas how to deal
> with this? Most straight forward way would be to add a counter of
> user vfsmount to the superblock and methods when it goes to 1 and 0,
> but that seems a bit ugly.

I see what you mean here...hrm...

One possibility (though it may have holes that I don't see right
offhand):

Don't call get_ns in rpc_fill_super. Do it in rpc_mount instead, and
only take the reference if MS_KERNMOUNT isn't set?
--
Jeff Layton <[email protected]>

2013-12-02 16:33:50

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH 00/11] [RFC] repair net namespace damage to rpc_pipefs

On Mon, 2 Dec 2013 11:00:50 -0500
Trond Myklebust <[email protected]> wrote:

>
> On Dec 2, 2013, at 10:34, Christoph Hellwig <[email protected]> wrote:
>
> > On Mon, Dec 02, 2013 at 08:44:25AM -0500, Trond Myklebust wrote:
> >> The lifetime of the kernel mount only needs to match that of the rpc_client, since each rpc_client is associated to a single net namespace, and each net namespace is in a 1-1 relationship with an rpc_pipefs super block.
> >
> > Except for the non-rpc_client users of rpc_pipefs?
>
> There is the idmapper pipe which is created as part of setting up a NFSv4 mount: that could either call rpc_get_mount(), or just rely on the fact that the nfs_client has an rpc_client. Ditto for the DNS resolver pipe.
>
> Any more?
>

There's the nfsdcld pipe stuff, but that was supposed to have been
ripped out in 3.10. Bruce wasn't ready to do that since we didn't have
a solution to the problem of using a UMH upcall in a container.

As far as I'm concerned, we should go ahead and rip that out and worry
about the UMH in a container problem later. Bruce, any objection to
going ahead and doing that? I can respin/resend the patch to do that if
you're ready for it...

--
Jeff Layton <[email protected]>

2013-12-01 15:58:12

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH 00/11] [RFC] repair net namespace damage to rpc_pipefs

On Sun, 1 Dec 2013 10:44:22 -0500
Jeff Layton <[email protected]> wrote:

> On Sun, 01 Dec 2013 05:14:41 -0800
> Christoph Hellwig <[email protected]> wrote:
>
> > This series tries to get rid of the damage created by sprinkling the
> > network namespace cat poo all over rpc_pipefs and users.
> >
> > Instead of getting lost in a maze of notifiers and infrastructure build
> > around it a cargo cult manner we revert to a slightly nicer version of
> > the pre-namespace API.
> >
> > To do this we just have to create an in-kernel instance of rpc_pipefs
> > that is mounted at network namespace creation so that all functions can
> > operated on it.
> >
> > As-is this has one major downside: because the initial mount already grabs
> > a reference to the network namespace we'll create a cyclic reference and
> > will never free the network namespace. To get around this we'd need
> > some way to only grab it once user mounts show up / disapear in the VFS.
> >
> > Given that the namespace kraken has infected various internal filesystem
> > and will get more soon I suspect this problem is or will become generic
> > and will need a proper solution anyway. Al, any good ideas how to deal
> > with this? Most straight forward way would be to add a counter of
> > user vfsmount to the superblock and methods when it goes to 1 and 0,
> > but that seems a bit ugly.
>
> I see what you mean here...hrm...
>
> One possibility (though it may have holes that I don't see right
> offhand):
>
> Don't call get_ns in rpc_fill_super. Do it in rpc_mount instead, and
> only take the reference if MS_KERNMOUNT isn't set?

...except that we have no mechanism to put those references on each
umount. Nevermind...

--
Jeff Layton <[email protected]>

2013-12-02 08:12:34

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 00/11] [RFC] repair net namespace damage to rpc_pipefs

On Sun, Dec 01, 2013 at 06:13:29PM +0000, Al Viro wrote:
> Making the series no-go in that form, obviously.

Looking at the mess it made I'd almost be tempted to say a little leak
for a less used features is better than lots of pain for everyone..

Looking at the mess it made I'm really upset.

> > Given that the namespace kraken has infected various internal filesystem
> > and will get more soon I suspect this problem is or will become generic
> > and will need a proper solution anyway. Al, any good ideas how to deal
> > with this? Most straight forward way would be to add a counter of
> > user vfsmount to the superblock and methods when it goes to 1 and 0,
> > but that seems a bit ugly.
>
> Folks, please, _please_, let's formulate the lifecycle rules first; we
> already had way too much trouble from putting mechanism first only to
> run into questions like the above ("what happens if somebody tries to
> allocate a PID in pid_ns that is already scheduled for shutdown?").
> Remember the (recurring) fun with kobject-related lifetime issues?
> Or rpc_pipefs notifier ugliness, for that matter...

I'll have to let the net namespace folks chime in for that, as far as
I'm concerned it's a featured better config'ed off. If they can't come
up with anything better the procfs hack above would be it.

2013-12-01 18:13:32

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH 00/11] [RFC] repair net namespace damage to rpc_pipefs

On Sun, Dec 01, 2013 at 05:14:41AM -0800, Christoph Hellwig wrote:
> This series tries to get rid of the damage created by sprinkling the
> network namespace cat poo all over rpc_pipefs and users.
>
> Instead of getting lost in a maze of notifiers and infrastructure build
> around it a cargo cult manner we revert to a slightly nicer version of
> the pre-namespace API.
>
> To do this we just have to create an in-kernel instance of rpc_pipefs
> that is mounted at network namespace creation so that all functions can
> operated on it.
>
> As-is this has one major downside: because the initial mount already grabs
> a reference to the network namespace we'll create a cyclic reference and
> will never free the network namespace.

Making the series no-go in that form, obviously.

> To get around this we'd need
> some way to only grab it once user mounts show up / disapear in the VFS.

Hmm... FWIW, there already is something of that sort; it _is_ ugly, but it
might be a food for thought...

AFAICS, pid_ns gets internal procfs instance and it pins the sucker down.
Which would cause exact same problems, obviously. The trick done there
is more or less to introduce a "being shut down" state of pid_ns - from
the moment when we don't have any pids in it to actual destruction.
Entering that state schedules (yes, it is async and yes, it is ugly)
dropping the internal procfs vfsmount.

Additional headache, AFAICS, comes from /proc/self/ns/pid - it can be
opened, passed to somebody in ancestor pidns and then fed by it to
setns(2). After that fork() by that somebody will trigger alloc_pid() in
that pid_ns. What happens if it comes just before the (already scheduled)
pid_ns_release_proc()? AFAICS, nothing good - there's no protection
against leaks, access to freed vfsmount, double-mntput, etc. Eric, am
I missing something subtle and relevant in that code?

> Given that the namespace kraken has infected various internal filesystem
> and will get more soon I suspect this problem is or will become generic
> and will need a proper solution anyway. Al, any good ideas how to deal
> with this? Most straight forward way would be to add a counter of
> user vfsmount to the superblock and methods when it goes to 1 and 0,
> but that seems a bit ugly.

Folks, please, _please_, let's formulate the lifecycle rules first; we
already had way too much trouble from putting mechanism first only to
run into questions like the above ("what happens if somebody tries to
allocate a PID in pid_ns that is already scheduled for shutdown?").
Remember the (recurring) fun with kobject-related lifetime issues?
Or rpc_pipefs notifier ugliness, for that matter...

2013-12-03 02:11:06

by Al Viro

[permalink] [raw]
Subject: [RFC] alloc_pid() breakage

On Sun, Dec 01, 2013 at 06:13:29PM +0000, Al Viro wrote:

> AFAICS, pid_ns gets internal procfs instance and it pins the sucker down.
> Which would cause exact same problems, obviously. The trick done there
> is more or less to introduce a "being shut down" state of pid_ns - from
> the moment when we don't have any pids in it to actual destruction.
> Entering that state schedules (yes, it is async and yes, it is ugly)
> dropping the internal procfs vfsmount.
>
> Additional headache, AFAICS, comes from /proc/self/ns/pid - it can be
> opened, passed to somebody in ancestor pidns and then fed by it to
> setns(2). After that fork() by that somebody will trigger alloc_pid() in
> that pid_ns. What happens if it comes just before the (already scheduled)
> pid_ns_release_proc()? AFAICS, nothing good - there's no protection
> against leaks, access to freed vfsmount, double-mntput, etc. Eric, am
> I missing something subtle and relevant in that code?

Egads... I think I see what's going on, but it's convoluted as hell -
you rely on 1 not getting returned more than once by alloc_pidmap(), even
after having been freed, so this
if (unlikely(is_child_reaper(pid))) {
if (pid_ns_prepare_proc(ns))
goto out_free;
}
is essentially "on the first call of alloc_pid() for given pidns". And
upper bit in ->nr_hashed acts as "it's not in rundown state".

OK, so... what happens if I do unshare(CLONE_NEWPID) and the first fork()
attempt fails (e.g. due to failure to allocate a map page when allocating
a number in parent pidns, or OOM-induced failure to mount procfs, whatever).
Sure, that fork() has failed. No pid had been allocated, thus no free_pid()
calls made. After a while the memory becomes less tight and the same process
tries to fork() again. What happens then? pidns with processes in it,
but no reaper and NULL ->proc_mnt? sysctl(2) called in it won't be happy;
neither will exit(2), actually, since it'll hit proc_flush_task_mnt() and
oops on trying to evaluate ->proc_mnt->mnt_root...

Another question: can free_pid() end up scheduling ->proc_work for anything
other than the last level? After all, reaper in parent pidns couldn't have
gotten through the zap_pid_ns_process() yet, let alone getting to its
free_pid(), right?

2013-12-02 16:38:01

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH 00/11] [RFC] repair net namespace damage to rpc_pipefs

On Mon, Dec 02, 2013 at 11:33:19AM -0500, Jeff Layton wrote:
> On Mon, 2 Dec 2013 11:00:50 -0500
> Trond Myklebust <[email protected]> wrote:
>
> >
> > On Dec 2, 2013, at 10:34, Christoph Hellwig <[email protected]> wrote:
> >
> > > On Mon, Dec 02, 2013 at 08:44:25AM -0500, Trond Myklebust wrote:
> > >> The lifetime of the kernel mount only needs to match that of the rpc_client, since each rpc_client is associated to a single net namespace, and each net namespace is in a 1-1 relationship with an rpc_pipefs super block.
> > >
> > > Except for the non-rpc_client users of rpc_pipefs?
> >
> > There is the idmapper pipe which is created as part of setting up a NFSv4 mount: that could either call rpc_get_mount(), or just rely on the fact that the nfs_client has an rpc_client. Ditto for the DNS resolver pipe.
> >
> > Any more?
> >
>
> There's the nfsdcld pipe stuff, but that was supposed to have been
> ripped out in 3.10. Bruce wasn't ready to do that since we didn't have
> a solution to the problem of using a UMH upcall in a container.
>
> As far as I'm concerned, we should go ahead and rip that out and worry
> about the UMH in a container problem later. Bruce, any objection to
> going ahead and doing that? I can respin/resend the patch to do that if
> you're ready for it...

I still haven't seen a solution to the UMH problem. Do we even expect
that there will be one at this point?

I just want to avoid the worst case where we decide that UMH was just
not designed for this kind of upcall, period, and then need to backtrack
again....

--b.

2013-12-02 15:58:10

by Trond Myklebust

[permalink] [raw]
Subject: Re: [PATCH 00/11] [RFC] repair net namespace damage to rpc_pipefs


On Dec 2, 2013, at 9:24, Stanislav Kinsbursky <[email protected]> wrote:

> 02.12.2013 17:44, Trond Myklebust пишет:
>>
>> On Dec 2, 2013, at 3:12, Christoph Hellwig <[email protected]> wrote:
>>
>>> On Sun, Dec 01, 2013 at 06:13:29PM +0000, Al Viro wrote:
>>>> Making the series no-go in that form, obviously.
>>>
>>> Looking at the mess it made I'd almost be tempted to say a little leak
>>> for a less used features is better than lots of pain for everyone..
>>>
>>> Looking at the mess it made I'm really upset.
>>>
>>>>> Given that the namespace kraken has infected various internal filesystem
>>>>> and will get more soon I suspect this problem is or will become generic
>>>>> and will need a proper solution anyway. Al, any good ideas how to deal
>>>>> with this? Most straight forward way would be to add a counter of
>>>>> user vfsmount to the superblock and methods when it goes to 1 and 0,
>>>>> but that seems a bit ugly.
>>>>
>>>> Folks, please, _please_, let's formulate the lifecycle rules first; we
>>>> already had way too much trouble from putting mechanism first only to
>>>> run into questions like the above ("what happens if somebody tries to
>>>> allocate a PID in pid_ns that is already scheduled for shutdown?").
>>>> Remember the (recurring) fun with kobject-related lifetime issues?
>>>> Or rpc_pipefs notifier ugliness, for that matter...
>>>
>>> I'll have to let the net namespace folks chime in for that, as far as
>>> I'm concerned it's a featured better config'ed off. If they can't come
>>> up with anything better the procfs hack above would be it.
>>
>> The lifetime of the kernel mount only needs to match that of the rpc_client, since each rpc_client is associated to a single net namespace, and each net namespace is in a 1-1 relationship with an rpc_pipefs super block.
>>
>> IOW: move the kernel mount/umount back to the rpc_client create/destroy methods and all should be well.
>>
>
> I'm sorry, guys, if I'm missing the point.
> But there was the reason, why all this notifier infrastructure was introduced:
>
> "RPC pipefs superblock should holds network namespace while active."
>
> And that's why:
>
> "RPC pipefs mount can't be performed in kernel context since new super block
> will holds networks namespace reference and it's impossible to recognize, when
> and how we have to release this mount point."
>
> https://lkml.org/lkml/2011/10/17/123
>
> Circumstances has changed and now all this can be fixed much simplier?

I’m just pointing out that we _do_ know when the rpc_clients no longer needs to access the (per-net namespace) super block. Once we've destroyed the rpc_clients (well, OK, technically once we've destroyed the struct rpc_xprts) that refer to that net namespace, then the kernel no longer needs the super block to be mounted anywhere.

IOW: if we add back a rpc_get_mount(net)/rpc_put_mount(net), then we can call the former when creating the rpc_xprt, and the latter when destroying it. The rpc_pipefs super block is destroyed when both user space and the kernel have umounted it. i.e. when all NFS super blocks from that net name space have been destroyed and the container has unmounted rpc_pipefs.

No?

Cheers,
Trond

2013-12-02 16:04:48

by Trond Myklebust

[permalink] [raw]
Subject: Re: [PATCH 00/11] [RFC] repair net namespace damage to rpc_pipefs


On Dec 2, 2013, at 10:57, Al Viro <[email protected]> wrote:

> Trond, am I right interpreting you as "that filesystem has non-empty
> contents only when there is at least one rpc_client in that netns;
> after rpc_client removal there won't be any accesses to data structures
> associated with it (due to rpc_close_pipes() and friends, presumably),
> so there won't be any need to keep netns alive after that"?
>
> If so and if rpc_client really can't outlive netns, then yes, having each
> of them hold an internal vfsmount reference pinning rpc_pipefs down would
> suffice.

We do have some ?external? users but in practice their lifetimes are bounded by the lifetimes of the rpc_clients (See my reply to Christoph?s mail).

So yes, I believe that it is the case that we need at least one rpc_client in a given nets for the filesystem to be non-empty.

Cheers,
Trond

2013-12-02 13:44:27

by Trond Myklebust

[permalink] [raw]
Subject: Re: [PATCH 00/11] [RFC] repair net namespace damage to rpc_pipefs


On Dec 2, 2013, at 3:12, Christoph Hellwig <[email protected]> wrote:

> On Sun, Dec 01, 2013 at 06:13:29PM +0000, Al Viro wrote:
>> Making the series no-go in that form, obviously.
>
> Looking at the mess it made I'd almost be tempted to say a little leak
> for a less used features is better than lots of pain for everyone..
>
> Looking at the mess it made I'm really upset.
>
>>> Given that the namespace kraken has infected various internal filesystem
>>> and will get more soon I suspect this problem is or will become generic
>>> and will need a proper solution anyway. Al, any good ideas how to deal
>>> with this? Most straight forward way would be to add a counter of
>>> user vfsmount to the superblock and methods when it goes to 1 and 0,
>>> but that seems a bit ugly.
>>
>> Folks, please, _please_, let's formulate the lifecycle rules first; we
>> already had way too much trouble from putting mechanism first only to
>> run into questions like the above ("what happens if somebody tries to
>> allocate a PID in pid_ns that is already scheduled for shutdown?").
>> Remember the (recurring) fun with kobject-related lifetime issues?
>> Or rpc_pipefs notifier ugliness, for that matter...
>
> I'll have to let the net namespace folks chime in for that, as far as
> I'm concerned it's a featured better config'ed off. If they can't come
> up with anything better the procfs hack above would be it.

The lifetime of the kernel mount only needs to match that of the rpc_client, since each rpc_client is associated to a single net namespace, and each net namespace is in a 1-1 relationship with an rpc_pipefs super block.

IOW: move the kernel mount/umount back to the rpc_client create/destroy methods and all should be well.

Cheers
Trond

2013-12-02 16:27:19

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 00/11] [RFC] repair net namespace damage to rpc_pipefs

On Mon, Dec 02, 2013 at 11:00:50AM -0500, Trond Myklebust wrote:
> > Except for the non-rpc_client users of rpc_pipefs?
>
> There is the idmapper pipe which is created as part of setting up a NFSv4 mount: that could either call rpc_get_mount(), or just rely on the fact that the nfs_client has an rpc_client. Ditto for the DNS resolver pipe.

We got idmapper, dns, pnfs-block upcall.

idmapper seems per-nfs_client, the other two are created on module load
and removed at unload time.

Similarly the pipe in nfsd works globally as well.

2013-12-02 14:24:48

by Stanislav Kinsbursky

[permalink] [raw]
Subject: Re: [PATCH 00/11] [RFC] repair net namespace damage to rpc_pipefs

02.12.2013 17:44, Trond Myklebust пишет:
>
> On Dec 2, 2013, at 3:12, Christoph Hellwig <[email protected]> wrote:
>
>> On Sun, Dec 01, 2013 at 06:13:29PM +0000, Al Viro wrote:
>>> Making the series no-go in that form, obviously.
>>
>> Looking at the mess it made I'd almost be tempted to say a little leak
>> for a less used features is better than lots of pain for everyone..
>>
>> Looking at the mess it made I'm really upset.
>>
>>>> Given that the namespace kraken has infected various internal filesystem
>>>> and will get more soon I suspect this problem is or will become generic
>>>> and will need a proper solution anyway. Al, any good ideas how to deal
>>>> with this? Most straight forward way would be to add a counter of
>>>> user vfsmount to the superblock and methods when it goes to 1 and 0,
>>>> but that seems a bit ugly.
>>>
>>> Folks, please, _please_, let's formulate the lifecycle rules first; we
>>> already had way too much trouble from putting mechanism first only to
>>> run into questions like the above ("what happens if somebody tries to
>>> allocate a PID in pid_ns that is already scheduled for shutdown?").
>>> Remember the (recurring) fun with kobject-related lifetime issues?
>>> Or rpc_pipefs notifier ugliness, for that matter...
>>
>> I'll have to let the net namespace folks chime in for that, as far as
>> I'm concerned it's a featured better config'ed off. If they can't come
>> up with anything better the procfs hack above would be it.
>
> The lifetime of the kernel mount only needs to match that of the rpc_client, since each rpc_client is associated to a single net namespace, and each net namespace is in a 1-1 relationship with an rpc_pipefs super block.
>
> IOW: move the kernel mount/umount back to the rpc_client create/destroy methods and all should be well.
>

I'm sorry, guys, if I'm missing the point.
But there was the reason, why all this notifier infrastructure was introduced:

"RPC pipefs superblock should holds network namespace while active."

And that's why:

"RPC pipefs mount can't be performed in kernel context since new super block
will holds networks namespace reference and it's impossible to recognize, when
and how we have to release this mount point."

https://lkml.org/lkml/2011/10/17/123

Circumstances has changed and now all this can be fixed much simplier?

> Cheers
> Trond--
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>


--
Best regards,
Stanislav Kinsbursky

2013-12-03 07:24:59

by Stanislav Kinsbursky

[permalink] [raw]
Subject: Re: [PATCH 00/11] [RFC] repair net namespace damage to rpc_pipefs

02.12.2013 19:58, Trond Myklebust пишет:
>
> On Dec 2, 2013, at 9:24, Stanislav Kinsbursky <[email protected]> wrote:
>
>> 02.12.2013 17:44, Trond Myklebust пишет:
>>>
>>> On Dec 2, 2013, at 3:12, Christoph Hellwig <[email protected]> wrote:
>>>
>>>> On Sun, Dec 01, 2013 at 06:13:29PM +0000, Al Viro wrote:
>>>>> Making the series no-go in that form, obviously.
>>>>
>>>> Looking at the mess it made I'd almost be tempted to say a little leak
>>>> for a less used features is better than lots of pain for everyone..
>>>>
>>>> Looking at the mess it made I'm really upset.
>>>>
>>>>>> Given that the namespace kraken has infected various internal filesystem
>>>>>> and will get more soon I suspect this problem is or will become generic
>>>>>> and will need a proper solution anyway. Al, any good ideas how to deal
>>>>>> with this? Most straight forward way would be to add a counter of
>>>>>> user vfsmount to the superblock and methods when it goes to 1 and 0,
>>>>>> but that seems a bit ugly.
>>>>>
>>>>> Folks, please, _please_, let's formulate the lifecycle rules first; we
>>>>> already had way too much trouble from putting mechanism first only to
>>>>> run into questions like the above ("what happens if somebody tries to
>>>>> allocate a PID in pid_ns that is already scheduled for shutdown?").
>>>>> Remember the (recurring) fun with kobject-related lifetime issues?
>>>>> Or rpc_pipefs notifier ugliness, for that matter...
>>>>
>>>> I'll have to let the net namespace folks chime in for that, as far as
>>>> I'm concerned it's a featured better config'ed off. If they can't come
>>>> up with anything better the procfs hack above would be it.
>>>
>>> The lifetime of the kernel mount only needs to match that of the rpc_client, since each rpc_client is associated to a single net namespace, and each net namespace is in a 1-1 relationship with an rpc_pipefs super block.
>>>
>>> IOW: move the kernel mount/umount back to the rpc_client create/destroy methods and all should be well.
>>>
>>
>> I'm sorry, guys, if I'm missing the point.
>> But there was the reason, why all this notifier infrastructure was introduced:
>>
>> "RPC pipefs superblock should holds network namespace while active."
>>
>> And that's why:
>>
>> "RPC pipefs mount can't be performed in kernel context since new super block
>> will holds networks namespace reference and it's impossible to recognize, when
>> and how we have to release this mount point."
>>
>> https://lkml.org/lkml/2011/10/17/123
>>
>> Circumstances has changed and now all this can be fixed much simplier?
>
> I’m just pointing out that we _do_ know when the rpc_clients no longer needs to access the (per-net namespace) super block. Once we've destroyed the rpc_clients (well, OK, technically once we've destroyed the struct rpc_xprts) that refer to that net namespace, then the kernel no longer needs the super block to be mounted anywhere.
>
> IOW: if we add back a rpc_get_mount(net)/rpc_put_mount(net), then we can call the former when creating the rpc_xprt, and the latter when destroying it. The rpc_pipefs super block is destroyed when both user space and the kernel have umounted it. i.e. when all NFS super blocks from that net name space have been destroyed and the container has unmounted rpc_pipefs.
>
> No?
>

So, you are proposing to create/get per-net mount point either on user space action or rpc client creation?
This should work, I suppose... The only thing which looks weird, is layer violation, when network namespace is being hold by mount point.

> Cheers,
> Trond
>


--
Best regards,
Stanislav Kinsbursky

2013-12-02 15:36:58

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 00/11] [RFC] repair net namespace damage to rpc_pipefs

On Mon, Dec 02, 2013 at 06:24:43PM +0400, Stanislav Kinsbursky wrote:
> I'm sorry, guys, if I'm missing the point.
> But there was the reason, why all this notifier infrastructure was introduced:

And the reason is that you're a lazy bastard who let's others clean up
the poo your sprinkle over the tree, and that the maintainers weren't
strict enought to reject that massive pile of junk, which unfortunately
is the common theme in current Linux development.

>
> "RPC pipefs mount can't be performed in kernel context since new super block
> will holds networks namespace reference and it's impossible to recognize, when
> and how we have to release this mount point."
>
> https://lkml.org/lkml/2011/10/17/123
>
> Circumstances has changed and now all this can be fixed much simplier?

As Al pointed our an easy workaround is easily available, and he's also
asked you about exact namespace semantics to be provided so we can
possibly find a less bad option.