---------- Forwarded message ---------
From: Yun Levi <[email protected]>
Date: Tue, Apr 7, 2020 at 9:53 PM
Subject: Re: [PATCH] netns: dangling pointer on netns bind mount point.
To: Al Viro <[email protected]>
Cc: David S. Miller <[email protected]>, Jakub Kicinski
<[email protected]>, Guillaume Nault <[email protected]>, Nicolas
Dichtel <[email protected]>, Eric Dumazet
<[email protected]>, Li RongQing <[email protected]>, Thomas
Gleixner <[email protected]>, Johannes Berg
<[email protected]>, David Howells <[email protected]>,
<[email protected]>, <[email protected]>,
<[email protected]>, <[email protected]>
BTW, It's my question.
>Look: we call ns_get_path(), which calls ns_get_path_cb(), which
>calls the callback passed to it (ns_get_path_task(), in this case),
>which grabs a reference to ns. Then we pass that reference to
>__ns_get_path().
>
>__ns_get_path() looks for dentry stashed in ns. If there is one
>and it's still alive, we grab a reference to that dentry and drop
>the reference to ns - no new inodes had been created, so no new
>namespace references have appeared. Existing inode is pinned
>by dentry and dentry is pinned by _dentry_ reference we've got.
actually ns_get_path is called in unshare(2). and it makes new dentry
and inode in __ns_get_path finally (Cuz it create new network
namespace)
In that case, when I mount with --bind option to this
proc/self/ns/net, it only increase dentry refcount on
do_loopback->clone_mnt finally (not call netns_operation->get)
That means it's not increase previous created network namespace
reference count but only increase reference count of _dentry_
In that situation, If I exit the child process it definitely frees the
net_namespace previous created at the same time it decrease
net_namespace's refcnt in exit_task_namespace().
It means it's possible that bind mount point can hold the dentry with
inode having net_namespace's dangling pointer in another process.
In above situation, parent who know that binded mount point calls
setns(2) then it sets the net_namespace which is refered by the inode
of the dentry increased by do_loopback.
That makes set the net_namespace which was freed already.
The Kernel Panic log is happend NOT in child kill but in Parent killed
after I setns(2) with the net_namespace made by child process.
Thanks you for your reviewing, But Please forgive that I couldn't
share .config right now (I try to make for x86....)
I try to understand your comments But Please forgive my fault because
of my ignorant...
If my explain is wrong, please rebuke me... and Please share your knowledge...
Thank you.
On Tue, Apr 7, 2020 at 1:04 PM Al Viro <[email protected]> wrote:
>
> On Tue, Apr 07, 2020 at 12:29:34PM +0900, Yun Levi wrote:
>
> > sock_prot_inuse_add+0x10/0x20
> > __sock_release+0x44/0xc0
> > sock_close+0x14/0x20
> > __fput+0x8c/0x1b8
> > ____fput+0xc/0x18
> > task_work_run+0xa8/0xd8
> > do_exit+0x2e4/0xa50
> > do_group_exit+0x34/0xc8
> > get_signal+0xd4/0x600
> > do_signal+0x174/0x268
> > do_notify_resume+0xcc/0x110
> > work_pending+0x8/0x10
> > Code: b940c821 f940c000 d538d083 8b010800 (b8606861) ---[ end trace
> > 0b98c9ccbfd9f6fd ]---
> > Date/Time : 02-14-0120 06:35:47 Kernel panic - not syncing: Fatal
> > exception in interrupt SMP: stopping secondary CPUs Kernel Offset:
> > disabled CPU features: 0x0,21806000 Memory Limit: none
> >
> > What I saw is when I try to bind on some mount point to
> > /proc/{pid}/ns/net which made by child process, That's doesn't
> > increase the netns' refcnt.
>
> Why would it? Increase of netns refcount should happen when you
> follow /proc/*/ns/net and stay for as long as nsfs inode is alive,
> not by cloning that mount. And while we are at it, you don't need
> to bind it anywhere in order to call setns() - just open the
> sucker and then feed the resulting descriptor to setns(2). No
> mount --bind involved and if child exiting between open() and
> setns() would free netns, we would have exact same problem.
>
> > And when the child process's going to exit, it frees the netns But
> > Still bind mount point's inode's private data point to netns which was
> > freed by child when it exits.
>
> Look: we call ns_get_path(), which calls ns_get_path_cb(), which
> calls the callback passed to it (ns_get_path_task(), in this case),
> which grabs a reference to ns. Then we pass that reference to
> __ns_get_path().
>
> __ns_get_path() looks for dentry stashed in ns. If there is one
> and it's still alive, we grab a reference to that dentry and drop
> the reference to ns - no new inodes had been created, so no new
> namespace references have appeared. Existing inode is pinned
> by dentry and dentry is pinned by _dentry_ reference we've got.
>
> If dentry is not there or is already in the middle of being destroyed,
> we allocate a new inode, stash our namespace reference into it,
> create a dentry referencing that new inode, stash it into namespace
> and return that dentry. Without dropping namespace reference we'd
> obtained in ns_get_path_task() - it went into new inode.
>
> If inode or dentry creation fails (out of memory), we drop what
> we'd obtained (namespace if inode creation fails, just the inode
> if dentry creation fails - namespace reference that went into
> inode will be dropped by inode destructor in the latter case) and
> return an error.
>
> If somebody else manages to get through the entire thing while
> we'd been allocating stuff and we see _their_ dentry already
> stashed into namespace, we just drop our dentry (its destructor
> will drop inode reference, which will lead to inode destructor,
> which will drop namespace reference) and bugger off with -EAGAIN.
> The caller (ns_get_path_cb()) will retry the entire thing.
>
> The invariant to be preserved here is "each of those inodes
> holds a reference to namespace for as long as the inode exists".
> ns_get_path() increments namespace refcount if and only if it
> has allocated an nsfs inode. If it grabs an existing dentry
> instead, the namespace refcount remains unchanged.
>
> Anyway, I would really like to see .config and C (or shell)
> reproducer. Ideally - .config that could be run in a KVM.