2022-03-19 07:02:10

by Trond Myklebust

[permalink] [raw]
Subject: Re: [PATCH RFC] nfsd: avoid recursive locking through fsnotify

On Fri, 2022-03-18 at 17:16 -0700, Khazhismel Kumykov wrote:
> fsnotify_add_inode_mark may allocate with GFP_KERNEL, which may
> result
> in recursing back into nfsd, resulting in deadlock. See below stack.
>
> nfsd            D    0 1591536      2 0x80004080
> Call Trace:
>  __schedule+0x497/0x630
>  schedule+0x67/0x90
>  schedule_preempt_disabled+0xe/0x10
>  __mutex_lock+0x347/0x4b0
>  fsnotify_destroy_mark+0x22/0xa0
>  nfsd_file_free+0x79/0xd0 [nfsd]
>  nfsd_file_put_noref+0x7c/0x90 [nfsd]
>  nfsd_file_lru_dispose+0x6d/0xa0 [nfsd]
>  nfsd_file_lru_scan+0x57/0x80 [nfsd]
>  do_shrink_slab+0x1f2/0x330
>  shrink_slab+0x244/0x2f0
>  shrink_node+0xd7/0x490
>  do_try_to_free_pages+0x12f/0x3b0
>  try_to_free_pages+0x43f/0x540
>  __alloc_pages_slowpath+0x6ab/0x11c0
>  __alloc_pages_nodemask+0x274/0x2c0
>  alloc_slab_page+0x32/0x2e0
>  new_slab+0xa6/0x8b0
>  ___slab_alloc+0x34b/0x520
>  kmem_cache_alloc+0x1c4/0x250
>  fsnotify_add_mark_locked+0x18d/0x4c0
>  fsnotify_add_mark+0x48/0x70
>  nfsd_file_acquire+0x570/0x6f0 [nfsd]
>  nfsd_read+0xa7/0x1c0 [nfsd]
>  nfsd3_proc_read+0xc1/0x110 [nfsd]
>  nfsd_dispatch+0xf7/0x240 [nfsd]
>  svc_process_common+0x2f4/0x610 [sunrpc]
>  svc_process+0xf9/0x110 [sunrpc]
>  nfsd+0x10e/0x180 [nfsd]
>  kthread+0x130/0x140
>  ret_from_fork+0x35/0x40
>
> Signed-off-by: Khazhismel Kumykov <[email protected]>
> ---
>  fs/nfsd/filecache.c | 4 ++++
>  1 file changed, 4 insertions(+)
>
> Marking this RFC since I haven't actually had a chance to test this,
> we
> we're seeing this deadlock for some customers.
>
> diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
> index fdf89fcf1a0c..a14760f9b486 100644
> --- a/fs/nfsd/filecache.c
> +++ b/fs/nfsd/filecache.c
> @@ -121,6 +121,7 @@ nfsd_file_mark_find_or_create(struct nfsd_file
> *nf)
>         struct fsnotify_mark    *mark;
>         struct nfsd_file_mark   *nfm = NULL, *new;
>         struct inode *inode = nf->nf_inode;
> +       unsigned int pflags;
>  
>         do {
>                 mutex_lock(&nfsd_file_fsnotify_group->mark_mutex);
> @@ -149,7 +150,10 @@ nfsd_file_mark_find_or_create(struct nfsd_file
> *nf)
>                 new->nfm_mark.mask = FS_ATTRIB|FS_DELETE_SELF;
>                 refcount_set(&new->nfm_ref, 1);
>  
> +               /* fsnotify allocates, avoid recursion back into nfsd
> */
> +               pflags = memalloc_nofs_save();
>                 err = fsnotify_add_inode_mark(&new->nfm_mark, inode,
> 0);
> +               memalloc_nofs_restore(pflags);
>  
>                 /*
>                  * If the add was successful, then return the object.

Isn't that stack trace showing a slab direct reclaim, and not a
filesystem writeback situation?

Does memalloc_nofs_save()/restore() really fix this problem? It seems
to me that it cannot, particularly since knfsd is not a filesystem, and
so does not ever handle writeback of dirty pages.

Cc: the linux-mm mailing list in search of answers to the above 2
questions.

--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]



2022-03-20 04:46:03

by Khazhismel Kumykov

[permalink] [raw]
Subject: Re: [PATCH RFC] nfsd: avoid recursive locking through fsnotify

On Fri, Mar 18, 2022 at 5:36 PM Trond Myklebust <[email protected]> wrote:
>
> Isn't that stack trace showing a slab direct reclaim, and not a
> filesystem writeback situation?
>
> Does memalloc_nofs_save()/restore() really fix this problem? It seems
> to me that it cannot, particularly since knfsd is not a filesystem, and
> so does not ever handle writeback of dirty pages.

Ah you're right. An alternative would be delaying the destroy_mark,
which I am noticing now that, on mainline, the shrinker calls
nfsd_file_dispose_list_delayed, something missing from the version of
5.4 I was looking at.

To my reading 9542e6a643fc6 ("nfsd: Containerise filecache
laundrette") should have the effect of fixing this deadlock (and the
message does explicitly call out notifier deadlocks) - maybe something
to send towards stable?
>
>
> Cc: the linux-mm mailing list in search of answers to the above 2
> questions.
>
> --
> Trond Myklebust
> Linux NFS client maintainer, Hammerspace
> [email protected]
>
>


Attachments:
smime.p7s (3.91 kB)
S/MIME Cryptographic Signature

2022-03-20 16:28:52

by Amir Goldstein

[permalink] [raw]
Subject: Re: [PATCH RFC] nfsd: avoid recursive locking through fsnotify

On Sat, Mar 19, 2022 at 9:02 AM Trond Myklebust <[email protected]> wrote:
>
> On Fri, 2022-03-18 at 17:16 -0700, Khazhismel Kumykov wrote:
> > fsnotify_add_inode_mark may allocate with GFP_KERNEL, which may
> > result
> > in recursing back into nfsd, resulting in deadlock. See below stack.
> >
> > nfsd D 0 1591536 2 0x80004080
> > Call Trace:
> > __schedule+0x497/0x630
> > schedule+0x67/0x90
> > schedule_preempt_disabled+0xe/0x10
> > __mutex_lock+0x347/0x4b0
> > fsnotify_destroy_mark+0x22/0xa0
> > nfsd_file_free+0x79/0xd0 [nfsd]
> > nfsd_file_put_noref+0x7c/0x90 [nfsd]
> > nfsd_file_lru_dispose+0x6d/0xa0 [nfsd]
> > nfsd_file_lru_scan+0x57/0x80 [nfsd]
> > do_shrink_slab+0x1f2/0x330
> > shrink_slab+0x244/0x2f0
> > shrink_node+0xd7/0x490
> > do_try_to_free_pages+0x12f/0x3b0
> > try_to_free_pages+0x43f/0x540
> > __alloc_pages_slowpath+0x6ab/0x11c0
> > __alloc_pages_nodemask+0x274/0x2c0
> > alloc_slab_page+0x32/0x2e0
> > new_slab+0xa6/0x8b0
> > ___slab_alloc+0x34b/0x520
> > kmem_cache_alloc+0x1c4/0x250
> > fsnotify_add_mark_locked+0x18d/0x4c0
> > fsnotify_add_mark+0x48/0x70
> > nfsd_file_acquire+0x570/0x6f0 [nfsd]
> > nfsd_read+0xa7/0x1c0 [nfsd]
> > nfsd3_proc_read+0xc1/0x110 [nfsd]
> > nfsd_dispatch+0xf7/0x240 [nfsd]
> > svc_process_common+0x2f4/0x610 [sunrpc]
> > svc_process+0xf9/0x110 [sunrpc]
> > nfsd+0x10e/0x180 [nfsd]
> > kthread+0x130/0x140
> > ret_from_fork+0x35/0x40
> >
> > Signed-off-by: Khazhismel Kumykov <[email protected]>
> > ---
> > fs/nfsd/filecache.c | 4 ++++
> > 1 file changed, 4 insertions(+)
> >
> > Marking this RFC since I haven't actually had a chance to test this,
> > we
> > we're seeing this deadlock for some customers.
> >
> > diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
> > index fdf89fcf1a0c..a14760f9b486 100644
> > --- a/fs/nfsd/filecache.c
> > +++ b/fs/nfsd/filecache.c
> > @@ -121,6 +121,7 @@ nfsd_file_mark_find_or_create(struct nfsd_file
> > *nf)
> > struct fsnotify_mark *mark;
> > struct nfsd_file_mark *nfm = NULL, *new;
> > struct inode *inode = nf->nf_inode;
> > + unsigned int pflags;
> >
> > do {
> > mutex_lock(&nfsd_file_fsnotify_group->mark_mutex);
> > @@ -149,7 +150,10 @@ nfsd_file_mark_find_or_create(struct nfsd_file
> > *nf)
> > new->nfm_mark.mask = FS_ATTRIB|FS_DELETE_SELF;
> > refcount_set(&new->nfm_ref, 1);
> >
> > + /* fsnotify allocates, avoid recursion back into nfsd
> > */
> > + pflags = memalloc_nofs_save();
> > err = fsnotify_add_inode_mark(&new->nfm_mark, inode,
> > 0);
> > + memalloc_nofs_restore(pflags);
> >
> > /*
> > * If the add was successful, then return the object.
>
> Isn't that stack trace showing a slab direct reclaim, and not a
> filesystem writeback situation?
>
> Does memalloc_nofs_save()/restore() really fix this problem? It seems
> to me that it cannot, particularly since knfsd is not a filesystem, and
> so does not ever handle writeback of dirty pages.
>

Maybe NOFS throttles direct reclaims to the point that the problem is
harder to hit?

This report came in at good timing for me.

It demonstrates an issue I did not predict for "volatile"' fanotify marks [1].
As far as I can tell, nfsd filecache is currently the only fsnotify backend that
frees fsnotify marks in memory shrinker. "volatile" fanotify marks would also
be evictable in that way, so they would expose fanotify to this deadlock.

For the short term, maybe nfsd filecache can avoid the problem by checking
mutex_is_locked(&nfsd_file_fsnotify_group->mark_mutex) and abort the
shrinker. I wonder if there is a place for a helper mutex_is_locked_by_me()?

Jan,

A relatively simple fix would be to allocate fsnotify_mark_connector in
fsnotify_add_mark() and free it, if a connector already exists for the object.
I don't think there is a good reason to optimize away this allocation
for the case of a non-first group to set a mark on an object?

Thanks,
Amir.



[1] https://lore.kernel.org/linux-fsdevel/[email protected]/

2022-03-21 18:35:33

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH RFC] nfsd: avoid recursive locking through fsnotify

On Sat 19-03-22 11:36:13, Amir Goldstein wrote:
> On Sat, Mar 19, 2022 at 9:02 AM Trond Myklebust <[email protected]> wrote:
> >
> > On Fri, 2022-03-18 at 17:16 -0700, Khazhismel Kumykov wrote:
> > > fsnotify_add_inode_mark may allocate with GFP_KERNEL, which may
> > > result
> > > in recursing back into nfsd, resulting in deadlock. See below stack.
> > >
> > > nfsd D 0 1591536 2 0x80004080
> > > Call Trace:
> > > __schedule+0x497/0x630
> > > schedule+0x67/0x90
> > > schedule_preempt_disabled+0xe/0x10
> > > __mutex_lock+0x347/0x4b0
> > > fsnotify_destroy_mark+0x22/0xa0
> > > nfsd_file_free+0x79/0xd0 [nfsd]
> > > nfsd_file_put_noref+0x7c/0x90 [nfsd]
> > > nfsd_file_lru_dispose+0x6d/0xa0 [nfsd]
> > > nfsd_file_lru_scan+0x57/0x80 [nfsd]
> > > do_shrink_slab+0x1f2/0x330
> > > shrink_slab+0x244/0x2f0
> > > shrink_node+0xd7/0x490
> > > do_try_to_free_pages+0x12f/0x3b0
> > > try_to_free_pages+0x43f/0x540
> > > __alloc_pages_slowpath+0x6ab/0x11c0
> > > __alloc_pages_nodemask+0x274/0x2c0
> > > alloc_slab_page+0x32/0x2e0
> > > new_slab+0xa6/0x8b0
> > > ___slab_alloc+0x34b/0x520
> > > kmem_cache_alloc+0x1c4/0x250
> > > fsnotify_add_mark_locked+0x18d/0x4c0
> > > fsnotify_add_mark+0x48/0x70
> > > nfsd_file_acquire+0x570/0x6f0 [nfsd]
> > > nfsd_read+0xa7/0x1c0 [nfsd]
> > > nfsd3_proc_read+0xc1/0x110 [nfsd]
> > > nfsd_dispatch+0xf7/0x240 [nfsd]
> > > svc_process_common+0x2f4/0x610 [sunrpc]
> > > svc_process+0xf9/0x110 [sunrpc]
> > > nfsd+0x10e/0x180 [nfsd]
> > > kthread+0x130/0x140
> > > ret_from_fork+0x35/0x40
> > >
> > > Signed-off-by: Khazhismel Kumykov <[email protected]>
> > > ---
> > > fs/nfsd/filecache.c | 4 ++++
> > > 1 file changed, 4 insertions(+)
> > >
> > > Marking this RFC since I haven't actually had a chance to test this,
> > > we
> > > we're seeing this deadlock for some customers.
> > >
> > > diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
> > > index fdf89fcf1a0c..a14760f9b486 100644
> > > --- a/fs/nfsd/filecache.c
> > > +++ b/fs/nfsd/filecache.c
> > > @@ -121,6 +121,7 @@ nfsd_file_mark_find_or_create(struct nfsd_file
> > > *nf)
> > > struct fsnotify_mark *mark;
> > > struct nfsd_file_mark *nfm = NULL, *new;
> > > struct inode *inode = nf->nf_inode;
> > > + unsigned int pflags;
> > >
> > > do {
> > > mutex_lock(&nfsd_file_fsnotify_group->mark_mutex);
> > > @@ -149,7 +150,10 @@ nfsd_file_mark_find_or_create(struct nfsd_file
> > > *nf)
> > > new->nfm_mark.mask = FS_ATTRIB|FS_DELETE_SELF;
> > > refcount_set(&new->nfm_ref, 1);
> > >
> > > + /* fsnotify allocates, avoid recursion back into nfsd
> > > */
> > > + pflags = memalloc_nofs_save();
> > > err = fsnotify_add_inode_mark(&new->nfm_mark, inode,
> > > 0);
> > > + memalloc_nofs_restore(pflags);
> > >
> > > /*
> > > * If the add was successful, then return the object.
> >
> > Isn't that stack trace showing a slab direct reclaim, and not a
> > filesystem writeback situation?
> >
> > Does memalloc_nofs_save()/restore() really fix this problem? It seems
> > to me that it cannot, particularly since knfsd is not a filesystem, and
> > so does not ever handle writeback of dirty pages.
> >
>
> Maybe NOFS throttles direct reclaims to the point that the problem is
> harder to hit?
>
> This report came in at good timing for me.
>
> It demonstrates an issue I did not predict for "volatile"' fanotify marks [1].
> As far as I can tell, nfsd filecache is currently the only fsnotify backend that
> frees fsnotify marks in memory shrinker. "volatile" fanotify marks would also
> be evictable in that way, so they would expose fanotify to this deadlock.
>
> For the short term, maybe nfsd filecache can avoid the problem by checking
> mutex_is_locked(&nfsd_file_fsnotify_group->mark_mutex) and abort the
> shrinker. I wonder if there is a place for a helper mutex_is_locked_by_me()?
>
> Jan,
>
> A relatively simple fix would be to allocate fsnotify_mark_connector in
> fsnotify_add_mark() and free it, if a connector already exists for the object.
> I don't think there is a good reason to optimize away this allocation
> for the case of a non-first group to set a mark on an object?

Indeed, nasty. Volatile marks will add group->mark_mutex into a set of
locks grabbed during inode slab reclaim. So any allocation under
group->mark_mutex has to be GFP_NOFS now. This is not just about connector
allocations but also mark allocations for fanotify. Moving allocations from
under mark_mutex is also possible solution but passing preallocated memory
around is kind of ugly as well. So the cleanest solution I currently see is
to come up with helpers like "fsnotify_lock_group() &
fsnotify_unlock_group()" which will lock/unlock mark_mutex and also do
memalloc_nofs_save / restore magic.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2022-03-21 21:42:46

by Khazhismel Kumykov

[permalink] [raw]
Subject: Re: [PATCH RFC] nfsd: avoid recursive locking through fsnotify

On Sat, Mar 19, 2022 at 2:36 AM Amir Goldstein <[email protected]> wrote:
>
> On Sat, Mar 19, 2022 at 9:02 AM Trond Myklebust <[email protected]> wrote:
> >
> > On Fri, 2022-03-18 at 17:16 -0700, Khazhismel Kumykov wrote:
> > > fsnotify_add_inode_mark may allocate with GFP_KERNEL, which may
> > > result
> > > in recursing back into nfsd, resulting in deadlock. See below stack.
> > >
> > > nfsd D 0 1591536 2 0x80004080
> > > Call Trace:
> > > __schedule+0x497/0x630
> > > schedule+0x67/0x90
> > > schedule_preempt_disabled+0xe/0x10
> > > __mutex_lock+0x347/0x4b0
> > > fsnotify_destroy_mark+0x22/0xa0
> > > nfsd_file_free+0x79/0xd0 [nfsd]
> > > nfsd_file_put_noref+0x7c/0x90 [nfsd]
> > > nfsd_file_lru_dispose+0x6d/0xa0 [nfsd]
> > > nfsd_file_lru_scan+0x57/0x80 [nfsd]
> > > do_shrink_slab+0x1f2/0x330
> > > shrink_slab+0x244/0x2f0
> > > shrink_node+0xd7/0x490
> > > do_try_to_free_pages+0x12f/0x3b0
> > > try_to_free_pages+0x43f/0x540
> > > __alloc_pages_slowpath+0x6ab/0x11c0
> > > __alloc_pages_nodemask+0x274/0x2c0
> > > alloc_slab_page+0x32/0x2e0
> > > new_slab+0xa6/0x8b0
> > > ___slab_alloc+0x34b/0x520
> > > kmem_cache_alloc+0x1c4/0x250
> > > fsnotify_add_mark_locked+0x18d/0x4c0
> > > fsnotify_add_mark+0x48/0x70
> > > nfsd_file_acquire+0x570/0x6f0 [nfsd]
> > > nfsd_read+0xa7/0x1c0 [nfsd]
> > > nfsd3_proc_read+0xc1/0x110 [nfsd]
> > > nfsd_dispatch+0xf7/0x240 [nfsd]
> > > svc_process_common+0x2f4/0x610 [sunrpc]
> > > svc_process+0xf9/0x110 [sunrpc]
> > > nfsd+0x10e/0x180 [nfsd]
> > > kthread+0x130/0x140
> > > ret_from_fork+0x35/0x40
> > >
> > > Signed-off-by: Khazhismel Kumykov <[email protected]>
> > > ---
> > > fs/nfsd/filecache.c | 4 ++++
> > > 1 file changed, 4 insertions(+)
> > >
> > > Marking this RFC since I haven't actually had a chance to test this,
> > > we
> > > we're seeing this deadlock for some customers.
> > >
> > > diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
> > > index fdf89fcf1a0c..a14760f9b486 100644
> > > --- a/fs/nfsd/filecache.c
> > > +++ b/fs/nfsd/filecache.c
> > > @@ -121,6 +121,7 @@ nfsd_file_mark_find_or_create(struct nfsd_file
> > > *nf)
> > > struct fsnotify_mark *mark;
> > > struct nfsd_file_mark *nfm = NULL, *new;
> > > struct inode *inode = nf->nf_inode;
> > > + unsigned int pflags;
> > >
> > > do {
> > > mutex_lock(&nfsd_file_fsnotify_group->mark_mutex);
> > > @@ -149,7 +150,10 @@ nfsd_file_mark_find_or_create(struct nfsd_file
> > > *nf)
> > > new->nfm_mark.mask = FS_ATTRIB|FS_DELETE_SELF;
> > > refcount_set(&new->nfm_ref, 1);
> > >
> > > + /* fsnotify allocates, avoid recursion back into nfsd
> > > */
> > > + pflags = memalloc_nofs_save();
> > > err = fsnotify_add_inode_mark(&new->nfm_mark, inode,
> > > 0);
> > > + memalloc_nofs_restore(pflags);
> > >
> > > /*
> > > * If the add was successful, then return the object.
> >
> > Isn't that stack trace showing a slab direct reclaim, and not a
> > filesystem writeback situation?
> >
> > Does memalloc_nofs_save()/restore() really fix this problem? It seems
> > to me that it cannot, particularly since knfsd is not a filesystem, and
> > so does not ever handle writeback of dirty pages.
> >
>
> Maybe NOFS throttles direct reclaims to the point that the problem is
> harder to hit?

(I think I simply got confused - I don't see reason that NOFS would
help with direct reclaim, though it does look like the gfp flags are
passed via a shrink_control struct so one *could* react to them in the
shrinker - again not an area i'm super familiar with)

>
> This report came in at good timing for me.
>
> It demonstrates an issue I did not predict for "volatile"' fanotify marks [1].
> As far as I can tell, nfsd filecache is currently the only fsnotify backend that
> frees fsnotify marks in memory shrinker. "volatile" fanotify marks would also
> be evictable in that way, so they would expose fanotify to this deadlock.
>
> For the short term, maybe nfsd filecache can avoid the problem by checking
> mutex_is_locked(&nfsd_file_fsnotify_group->mark_mutex) and abort the
> shrinker. I wonder if there is a place for a helper mutex_is_locked_by_me()?

fwiw, it does look like ~5.5 nfsd did stop freeing fanotify marks
during reclaim, in the commit "nfsd: Containerise filecache
laundrette" (I had sent an earlier email about this, not sure where
that's getting caught up, but I do see it on lore...)


>
> Jan,
>
> A relatively simple fix would be to allocate fsnotify_mark_connector in
> fsnotify_add_mark() and free it, if a connector already exists for the object.
> I don't think there is a good reason to optimize away this allocation
> for the case of a non-first group to set a mark on an object?
>
> Thanks,
> Amir.
>
>
>
> [1] https://lore.kernel.org/linux-fsdevel/[email protected]/


Attachments:
smime.p7s (3.91 kB)
S/MIME Cryptographic Signature