2022-10-04 05:22:12

by NeilBrown

[permalink] [raw]
Subject: nfsd: another possible delegation race


Hi,
I have a customer who experienced a crash in nfsd which appears to be
related to delegation return. I cannot completely rule-out
Commit 548ec0805c39 ("nfsd: fix use-after-free due to delegation race")
as the kernel being used didn't have that commit, but the symptoms are
quite different, and while exploring I found, I think, a different
race. This new race doesn't obviously address all the symptoms, but
maybe it does...

The symptoms were:
1/ WARN_ON(!unhash_delegation_locked(dp));
in nfs4_laundromat complained (delegation wasn't hashed!)
2/ refcount_t: saturated; leaking memory
This came from the refcount_inc in revoke_delegation() called from
nfs4_laundromat(), a few lines below the above warning
3/ BUG: kernel NULL pointer dereference, address: 0000000000000028
This is from the destroy_unhashed_deleg() call at the end of
that same revoke_delegation() call, which calls
nfs4_unlock_deleg_lease() and passes fp->fi_deleg_file, which is
NULL (!!!), to vfs_setlease().
These three happened in a 200usec window.

What I imagine might be happening is that the nfsd_break_deleg_cb()
callback is called after destroy_delegation() has unhashed the deleg,
but before destroy_unhashed_delegation() gets called.

If nfsd_break_deleg_cb() is called before the unhash - and particularly
if nfsd_break_one_deleg()->nfsd4_run_cb() is called before, then the
unhash will disconnect the delegation from the recall list, and no
harm can be done.
Once vfs_setlease(F_UNLCK) is called, the callback can no longer be
called, so again no harm is possible.

Between these two is a race window. The delegation can be put on the
recall list, but the deleg will be unhashed and put_deleg_file() will
have set fi_deleg_file to NULL - resulting in first WARNING and the
BUG.

I cannot see how the refcount_t warning can be generated ... so maybe
I've missed something.

My proposed solution is to test delegation_hashed() while holding
fp->fi_lock in nfsd_break_deleg_cb(). If the delegation is locked, we
can safely schedule the recall. If it isn't, then the lease is about
to be unlocked and there is no need to schedule anything.

I don't know this code at all well, so I thought it safest to ask for
comments before posting a proper patch.
I'm particularly curious to know if anyone has ideas about the refcount
overflow. Corruption is unlikely as the deleg looked mostly OK and the
memory has been freed, but not reallocated (though it is possible it
was freed, reallocated, and freed again).
This wasn't a refcount_inc on a zero refcount - that gives a different
error. I don't know what the refcount value was, it has already been
changed to the 'saturated' value - 0xc0000000.

Thanks,
NeilBrown


diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
index c5d199d7e6b4..e02d638df6be 100644
--- a/fs/nfsd/nfs4state.c
+++ b/fs/nfsd/nfs4state.c
@@ -4822,8 +4822,10 @@ nfsd_break_deleg_cb(struct file_lock *fl)
fl->fl_break_time = 0;

spin_lock(&fp->fi_lock);
- fp->fi_had_conflict = true;
- nfsd_break_one_deleg(dp);
+ if (delegation_hashed(dp)) {
+ fp->fi_had_conflict = true;
+ nfsd_break_one_deleg(dp);
+ }
spin_unlock(&fp->fi_lock);
return ret;
}



2022-10-04 11:22:16

by Jeff Layton

[permalink] [raw]
Subject: Re: nfsd: another possible delegation race

On Tue, 2022-10-04 at 16:14 +1100, NeilBrown wrote:
> Hi,
> I have a customer who experienced a crash in nfsd which appears to be
> related to delegation return. I cannot completely rule-out
> Commit 548ec0805c39 ("nfsd: fix use-after-free due to delegation race")
> as the kernel being used didn't have that commit, but the symptoms are
> quite different, and while exploring I found, I think, a different
> race. This new race doesn't obviously address all the symptoms, but
> maybe it does...
>
> The symptoms were:
> 1/ WARN_ON(!unhash_delegation_locked(dp));
> in nfs4_laundromat complained (delegation wasn't hashed!)
> 2/ refcount_t: saturated; leaking memory
> This came from the refcount_inc in revoke_delegation() called from
> nfs4_laundromat(), a few lines below the above warning

Well, that is odd! Chuck has caught this a couple of times:

https://bugzilla.linux-nfs.org/show_bug.cgi?id=394

...but that's an underflow.

> 3/ BUG: kernel NULL pointer dereference, address: 0000000000000028
> This is from the destroy_unhashed_deleg() call at the end of
> that same revoke_delegation() call, which calls
> nfs4_unlock_deleg_lease() and passes fp->fi_deleg_file, which is
> NULL (!!!), to vfs_setlease().
> These three happened in a 200usec window.
>
> What I imagine might be happening is that the nfsd_break_deleg_cb()
> callback is called after destroy_delegation() has unhashed the deleg,
> but before destroy_unhashed_delegation() gets called.
>

Ok, so a DELEGRETURN is racing with a lease break?

> If nfsd_break_deleg_cb() is called before the unhash - and particularly
> if nfsd_break_one_deleg()->nfsd4_run_cb() is called before, then the
> unhash will disconnect the delegation from the recall list, and no
> harm can be done.
> Once vfs_setlease(F_UNLCK) is called, the callback can no longer be
> called, so again no harm is possible.
>
> Between these two is a race window. The delegation can be put on the
> recall list, but the deleg will be unhashed and put_deleg_file() will
> have set fi_deleg_file to NULL - resulting in first WARNING and the
> BUG.
>
> I cannot see how the refcount_t warning can be generated ... so maybe
> I've missed something.
>
> My proposed solution is to test delegation_hashed() while holding
> fp->fi_lock in nfsd_break_deleg_cb(). If the delegation is locked, we
> can safely schedule the recall. If it isn't, then the lease is about
> to be unlocked and there is no need to schedule anything.
>

I think you mean "If the delegation is hashed, we can safely schedule
the recall."

That sounds like it might be the right approach. Once we've unhashed the
stateid, I think we can safely assume that it's on its way out the door
and that we don't need to issue a recall.

> I don't know this code at all well, so I thought it safest to ask for
> comments before posting a proper patch.
> I'm particularly curious to know if anyone has ideas about the refcount
> overflow. Corruption is unlikely as the deleg looked mostly OK and the
> memory has been freed, but not reallocated (though it is possible it
> was freed, reallocated, and freed again).
> This wasn't a refcount_inc on a zero refcount - that gives a different
> error. I don't know what the refcount value was, it has already been
> changed to the 'saturated' value - 0xc0000000.
>
>

That would be this, I think:

else if (unlikely(old < 0 || old + i < 0))
refcount_warn_saturate(r, REFCOUNT_ADD_OVF);

So the old or new value was < 0?

No idea how you get there though. I would think if we were leaking
delegations to that degree that we'd see leaked memory warnings when
shutting down nfsd.

>
>
> diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
> index c5d199d7e6b4..e02d638df6be 100644
> --- a/fs/nfsd/nfs4state.c
> +++ b/fs/nfsd/nfs4state.c
> @@ -4822,8 +4822,10 @@ nfsd_break_deleg_cb(struct file_lock *fl)
> fl->fl_break_time = 0;
>
> spin_lock(&fp->fi_lock);
> - fp->fi_had_conflict = true;
> - nfsd_break_one_deleg(dp);
> + if (delegation_hashed(dp)) {
> + fp->fi_had_conflict = true;
> + nfsd_break_one_deleg(dp);
> + }
> spin_unlock(&fp->fi_lock);
> return ret;
> }
>
>

This looks reasonable to me.
--
Jeff Layton <[email protected]>

2022-10-04 14:24:32

by Chuck Lever III

[permalink] [raw]
Subject: Re: nfsd: another possible delegation race



> On Oct 4, 2022, at 1:14 AM, NeilBrown <[email protected]> wrote:
>
>
> Hi,
> I have a customer who experienced a crash in nfsd which appears to be
> related to delegation return. I cannot completely rule-out
> Commit 548ec0805c39 ("nfsd: fix use-after-free due to delegation race")
> as the kernel being used didn't have that commit, but the symptoms are
> quite different, and while exploring I found, I think, a different
> race. This new race doesn't obviously address all the symptoms, but
> maybe it does...
>
> The symptoms were:
> 1/ WARN_ON(!unhash_delegation_locked(dp));
> in nfs4_laundromat complained (delegation wasn't hashed!)
> 2/ refcount_t: saturated; leaking memory
> This came from the refcount_inc in revoke_delegation() called from
> nfs4_laundromat(), a few lines below the above warning
> 3/ BUG: kernel NULL pointer dereference, address: 0000000000000028
> This is from the destroy_unhashed_deleg() call at the end of
> that same revoke_delegation() call, which calls
> nfs4_unlock_deleg_lease() and passes fp->fi_deleg_file, which is
> NULL (!!!), to vfs_setlease().
> These three happened in a 200usec window.
>
> What I imagine might be happening is that the nfsd_break_deleg_cb()
> callback is called after destroy_delegation() has unhashed the deleg,
> but before destroy_unhashed_delegation() gets called.
>
> If nfsd_break_deleg_cb() is called before the unhash - and particularly
> if nfsd_break_one_deleg()->nfsd4_run_cb() is called before, then the
> unhash will disconnect the delegation from the recall list, and no
> harm can be done.
> Once vfs_setlease(F_UNLCK) is called, the callback can no longer be
> called, so again no harm is possible.
>
> Between these two is a race window. The delegation can be put on the
> recall list, but the deleg will be unhashed and put_deleg_file() will
> have set fi_deleg_file to NULL - resulting in first WARNING and the
> BUG.

That seems plausible. I've been accepting defensive patches like
what you proposed below, so I can queue that up for v6.2 as soon as
you post an official version.

It would help to know the kernel version where you encountered
these symptoms, and to have a rough description of the workload;
I assume you do not have a reliable reproducer. I'm wondering if
there should be a bug report too (bugzilla.linux-nfs.org)?


> I cannot see how the refcount_t warning can be generated ... so maybe
> I've missed something.

stid refcounting does not seem reliable in the current code base.
It's possible that the overflow is a separate issue that simply
appeared at the same time or due to the same conditions that
triggered the BUG.


> My proposed solution is to test delegation_hashed() while holding
> fp->fi_lock in nfsd_break_deleg_cb(). If the delegation is locked, we
> can safely schedule the recall. If it isn't, then the lease is about
> to be unlocked and there is no need to schedule anything.
>
> I don't know this code at all well, so I thought it safest to ask for
> comments before posting a proper patch.
> I'm particularly curious to know if anyone has ideas about the refcount
> overflow. Corruption is unlikely as the deleg looked mostly OK and the
> memory has been freed, but not reallocated (though it is possible it
> was freed, reallocated, and freed again).
> This wasn't a refcount_inc on a zero refcount - that gives a different
> error. I don't know what the refcount value was, it has already been
> changed to the 'saturated' value - 0xc0000000.
>
> Thanks,
> NeilBrown
>
>
> diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
> index c5d199d7e6b4..e02d638df6be 100644
> --- a/fs/nfsd/nfs4state.c
> +++ b/fs/nfsd/nfs4state.c
> @@ -4822,8 +4822,10 @@ nfsd_break_deleg_cb(struct file_lock *fl)
> fl->fl_break_time = 0;
>
> spin_lock(&fp->fi_lock);
> - fp->fi_had_conflict = true;
> - nfsd_break_one_deleg(dp);
> + if (delegation_hashed(dp)) {
> + fp->fi_had_conflict = true;
> + nfsd_break_one_deleg(dp);
> + }
> spin_unlock(&fp->fi_lock);
> return ret;
> }
>
>

--
Chuck Lever



2022-10-04 21:51:13

by NeilBrown

[permalink] [raw]
Subject: Re: nfsd: another possible delegation race

On Tue, 04 Oct 2022, Jeff Layton wrote:
> On Tue, 2022-10-04 at 16:14 +1100, NeilBrown wrote:
> > Hi,
> > I have a customer who experienced a crash in nfsd which appears to be
> > related to delegation return. I cannot completely rule-out
> > Commit 548ec0805c39 ("nfsd: fix use-after-free due to delegation race")
> > as the kernel being used didn't have that commit, but the symptoms are
> > quite different, and while exploring I found, I think, a different
> > race. This new race doesn't obviously address all the symptoms, but
> > maybe it does...
> >
> > The symptoms were:
> > 1/ WARN_ON(!unhash_delegation_locked(dp));
> > in nfs4_laundromat complained (delegation wasn't hashed!)
> > 2/ refcount_t: saturated; leaking memory
> > This came from the refcount_inc in revoke_delegation() called from
> > nfs4_laundromat(), a few lines below the above warning
>
> Well, that is odd! Chuck has caught this a couple of times:
>
> https://bugzilla.linux-nfs.org/show_bug.cgi?id=394
>
> ...but that's an underflow.

https://bugzilla.redhat.com/show_bug.cgi?id=2127067
linked in that bugzilla looks like exactly the same problem, though
caught at a different place by KASAN.

I would think there must have been a previous underflow - after that all
further references cause the "saturation" warning. Except that the
dmesg in the crashdump has 123 hours of logs and no other refcount_t
message. Maybe I shouldn't lose sleep over this...

>
> > 3/ BUG: kernel NULL pointer dereference, address: 0000000000000028
> > This is from the destroy_unhashed_deleg() call at the end of
> > that same revoke_delegation() call, which calls
> > nfs4_unlock_deleg_lease() and passes fp->fi_deleg_file, which is
> > NULL (!!!), to vfs_setlease().
> > These three happened in a 200usec window.
> >
> > What I imagine might be happening is that the nfsd_break_deleg_cb()
> > callback is called after destroy_delegation() has unhashed the deleg,
> > but before destroy_unhashed_delegation() gets called.
> >
>
> Ok, so a DELEGRETURN is racing with a lease break?

Exactly my assessment - yes.


>
> > If nfsd_break_deleg_cb() is called before the unhash - and particularly
> > if nfsd_break_one_deleg()->nfsd4_run_cb() is called before, then the
> > unhash will disconnect the delegation from the recall list, and no
> > harm can be done.
> > Once vfs_setlease(F_UNLCK) is called, the callback can no longer be
> > called, so again no harm is possible.
> >
> > Between these two is a race window. The delegation can be put on the
> > recall list, but the deleg will be unhashed and put_deleg_file() will
> > have set fi_deleg_file to NULL - resulting in first WARNING and the
> > BUG.
> >
> > I cannot see how the refcount_t warning can be generated ... so maybe
> > I've missed something.
> >
> > My proposed solution is to test delegation_hashed() while holding
> > fp->fi_lock in nfsd_break_deleg_cb(). If the delegation is locked, we
> > can safely schedule the recall. If it isn't, then the lease is about
> > to be unlocked and there is no need to schedule anything.
> >
>
> I think you mean "If the delegation is hashed, we can safely schedule
> the recall."

Yes s/locked/hashed/ :-)

>
> That sounds like it might be the right approach. Once we've unhashed the
> stateid, I think we can safely assume that it's on its way out the door
> and that we don't need to issue a recall.
>
> > I don't know this code at all well, so I thought it safest to ask for
> > comments before posting a proper patch.
> > I'm particularly curious to know if anyone has ideas about the refcount
> > overflow. Corruption is unlikely as the deleg looked mostly OK and the
> > memory has been freed, but not reallocated (though it is possible it
> > was freed, reallocated, and freed again).
> > This wasn't a refcount_inc on a zero refcount - that gives a different
> > error. I don't know what the refcount value was, it has already been
> > changed to the 'saturated' value - 0xc0000000.
> >
> >
>
> That would be this, I think:
>
> else if (unlikely(old < 0 || old + i < 0))
> refcount_warn_saturate(r, REFCOUNT_ADD_OVF);
>
> So the old or new value was < 0?
>
> No idea how you get there though. I would think if we were leaking
> delegations to that degree that we'd see leaked memory warnings when
> shutting down nfsd.

I'm certain the refcount did get to zero at some point because the
nfsd_deleg has been freed. But the refcount code should never set it
negative without reporting a warning.


>
> >
> >
> > diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
> > index c5d199d7e6b4..e02d638df6be 100644
> > --- a/fs/nfsd/nfs4state.c
> > +++ b/fs/nfsd/nfs4state.c
> > @@ -4822,8 +4822,10 @@ nfsd_break_deleg_cb(struct file_lock *fl)
> > fl->fl_break_time = 0;
> >
> > spin_lock(&fp->fi_lock);
> > - fp->fi_had_conflict = true;
> > - nfsd_break_one_deleg(dp);
> > + if (delegation_hashed(dp)) {
> > + fp->fi_had_conflict = true;
> > + nfsd_break_one_deleg(dp);
> > + }
> > spin_unlock(&fp->fi_lock);
> > return ret;
> > }
> >
> >
>
> This looks reasonable to me.

Thanks!
NeilBrown

2022-10-04 22:40:57

by NeilBrown

[permalink] [raw]
Subject: Re: nfsd: another possible delegation race

On Wed, 05 Oct 2022, Chuck Lever III wrote:
>
> > On Oct 4, 2022, at 1:14 AM, NeilBrown <[email protected]> wrote:
> >
> >
> > Hi,
> > I have a customer who experienced a crash in nfsd which appears to be
> > related to delegation return. I cannot completely rule-out
> > Commit 548ec0805c39 ("nfsd: fix use-after-free due to delegation race")
> > as the kernel being used didn't have that commit, but the symptoms are
> > quite different, and while exploring I found, I think, a different
> > race. This new race doesn't obviously address all the symptoms, but
> > maybe it does...
> >
> > The symptoms were:
> > 1/ WARN_ON(!unhash_delegation_locked(dp));
> > in nfs4_laundromat complained (delegation wasn't hashed!)
> > 2/ refcount_t: saturated; leaking memory
> > This came from the refcount_inc in revoke_delegation() called from
> > nfs4_laundromat(), a few lines below the above warning
> > 3/ BUG: kernel NULL pointer dereference, address: 0000000000000028
> > This is from the destroy_unhashed_deleg() call at the end of
> > that same revoke_delegation() call, which calls
> > nfs4_unlock_deleg_lease() and passes fp->fi_deleg_file, which is
> > NULL (!!!), to vfs_setlease().
> > These three happened in a 200usec window.
> >
> > What I imagine might be happening is that the nfsd_break_deleg_cb()
> > callback is called after destroy_delegation() has unhashed the deleg,
> > but before destroy_unhashed_delegation() gets called.
> >
> > If nfsd_break_deleg_cb() is called before the unhash - and particularly
> > if nfsd_break_one_deleg()->nfsd4_run_cb() is called before, then the
> > unhash will disconnect the delegation from the recall list, and no
> > harm can be done.
> > Once vfs_setlease(F_UNLCK) is called, the callback can no longer be
> > called, so again no harm is possible.
> >
> > Between these two is a race window. The delegation can be put on the
> > recall list, but the deleg will be unhashed and put_deleg_file() will
> > have set fi_deleg_file to NULL - resulting in first WARNING and the
> > BUG.
>
> That seems plausible. I've been accepting defensive patches like
> what you proposed below, so I can queue that up for v6.2 as soon as
> you post an official version.
>
> It would help to know the kernel version where you encountered
> these symptoms, and to have a rough description of the workload;
> I assume you do not have a reliable reproducer. I'm wondering if
> there should be a bug report too (bugzilla.linux-nfs.org)?
>
Kernel version 5.3.18 plus various backported patches SLE15-SP3 from
July 2021, so not ancient but not the most recent either.

I don't have any workload information.
bug 394 seem much the same, though details might be different.

>
> > I cannot see how the refcount_t warning can be generated ... so maybe
> > I've missed something.
>
> stid refcounting does not seem reliable in the current code base.
> It's possible that the overflow is a separate issue that simply
> appeared at the same time or due to the same conditions that
> triggered the BUG.

Maybe ... though refcount is quite noisy about anything suspicious....

Aha.. The refcount is at the start of the structure. When slub frees an
allocation, it put is on a freelist with pointers at the start of the
allocation. So freeing an nfsd_deleg will corrupt the refcount.
So when refcount_inc() is called on a freed nfsd_deleg, we get the
"saturation" error, because pointers tend to look like negative numbers.

One would expect to see sc_type corrupt too because it is within the
first 64 bits of the start of the struct, but it is set explicitly in
revoke_delegation() which happens after the struct is freed, and is
where the crash happens.

Oh good - I think I understand it all now. Thanks :-)

NeilBrown

2022-10-05 00:34:46

by NeilBrown

[permalink] [raw]
Subject: Re: nfsd: another possible delegation race

On Tue, 04 Oct 2022, Jeff Layton wrote:
> On Tue, 2022-10-04 at 16:14 +1100, NeilBrown wrote:
> > Hi,
> > I have a customer who experienced a crash in nfsd which appears to be
> > related to delegation return. I cannot completely rule-out
> > Commit 548ec0805c39 ("nfsd: fix use-after-free due to delegation race")
...
>
> Ok, so a DELEGRETURN is racing with a lease break?
>

and that is exactly what the above mentioned commit fixes. I now see
that it is the right fix for my problem as well.

Putting the delegation_hashed() check in nfsd_break_deleg_cb() doesn't
help because the problematic list-add to del_recall_lru doesn't happen
until later in a different thread, and so the nfs4_delegation could
still get unhashed before that thread adds it to the list. The above
commit adds the protection at the right place.

Because being on ->del_recall_lru doesn't imply a reference, it is easy
for the delegation to be freed before the laundromat finds it on the
list.

As mentioned in my reply to Chuck, the refcount_inc() gives a "saturated"
error rather than a "use-after-free" error because when the delegation
was freed, slub (which SUSE uses) puts a pointer at the start of the
memory, which is exactly where the refcount is. So it has a good chance
of becoming negative.

So my symptoms are now completely explained. Thanks to both of you for
your help.

NeilBrown