LinuxLists.cc - [PATCH] nfsd: when reusing an existing repcache entry, unhash it first

2013-12-02 20:26:37

Subject: [PATCH] nfsd: when reusing an existing repcache entry, unhash it first

The DRC code will attempt to reuse an existing, expired cache entry in
preference to allocating a new one. It'll then search the cache, and if
it gets a hit it'll then free the cache entry that it was going to
reuse.

The cache code doesn't unhash the entry that it's going to reuse
however, so it's possible for it end up designating an entry for reuse
and then subsequently freeing the same entry after it finds it. This
leads it to a later use-after-free situation and usually some list
corruption warnings or an oops.

Fix this by simply unhashing the entry that we intend to reuse. That
will mean that it's not findable via a search and should prevent this
situation from occurring.

Cc: [email protected] # v3.10+
Reported-by: Christoph Hellwig <[email protected]>
Reported-by: g. artim <[email protected]>
Signed-off-by: Jeff Layton <[email protected]>
---
fs/nfsd/nfscache.c | 9 ++++++++-
1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/fs/nfsd/nfscache.c b/fs/nfsd/nfscache.c
index 9186c7c..b6af150 100644
--- a/fs/nfsd/nfscache.c
+++ b/fs/nfsd/nfscache.c
@@ -132,6 +132,13 @@ nfsd_reply_cache_alloc(void)
}

static void
+nfsd_reply_cache_unhash(struct svc_cacherep *rp)
+{
+ hlist_del_init(&rp->c_hash);
+ list_del_init(&rp->c_lru);
+}
+
+static void
nfsd_reply_cache_free_locked(struct svc_cacherep *rp)
{
if (rp->c_type == RC_REPLBUFF && rp->c_replvec.iov_base) {
@@ -417,7 +424,7 @@ nfsd_cache_lookup(struct svc_rqst *rqstp)
rp = list_first_entry(&lru_head, struct svc_cacherep, c_lru);
if (nfsd_cache_entry_expired(rp) ||
num_drc_entries >= max_drc_entries) {
- lru_put_end(rp);
+ nfsd_reply_cache_unhash(rp);
prune_cache_entries();
goto search_cache;
}
--
1.8.4.2

2013-12-04 14:25:13

by J. Bruce Fields

[permalink] [raw]

Subject: Re: [PATCH] nfsd: when reusing an existing repcache entry, unhash it first

On Wed, Dec 04, 2013 at 05:09:44AM -0800, Christoph Hellwig wrote:
> On Wed, Dec 04, 2013 at 07:54:02AM -0500, Jeff Layton wrote:
> > Are you suggesting that we shouldn't try to bound the cache at all and
> > should instead just let it grow and rely on the shrinker to keep it in
> > check?
>
> Exactly.

As long as the lookup times stay reasonable....

--b.

2013-12-04 17:04:02

by Jeff Layton

[permalink] [raw]

Subject: Re: [PATCH] nfsd: when reusing an existing repcache entry, unhash it first

On Wed, 4 Dec 2013 09:25:09 -0500
"J. Bruce Fields" <[email protected]> wrote:

> On Wed, Dec 04, 2013 at 05:09:44AM -0800, Christoph Hellwig wrote:
> > On Wed, Dec 04, 2013 at 07:54:02AM -0500, Jeff Layton wrote:
> > > Are you suggesting that we shouldn't try to bound the cache at all and
> > > should instead just let it grow and rely on the shrinker to keep it in
> > > check?
> >
> > Exactly.
>
> As long as the lookup times stay reasonable....
>

That is another tricky question. Currently, we size the number of hash
buckets based on the max size of the cache. If we allow it to grow
without bound, then how do we determine how many hash buckets we need?

--
Jeff Layton <[email protected]>

2013-12-04 18:41:47

by Jeff Layton

[permalink] [raw]

Subject: Re: [PATCH] nfsd: when reusing an existing repcache entry, unhash it first

On Wed, 4 Dec 2013 09:06:02 -0800
Christoph Hellwig <[email protected]> wrote:

> On Wed, Dec 04, 2013 at 12:06:16PM -0500, Jeff Layton wrote:
> > That is another tricky question. Currently, we size the number of hash
> > buckets based on the max size of the cache. If we allow it to grow
> > without bound, then how do we determine how many hash buckets we need?
>
> It won't grow without bound as long as it's trimmed below the size
> by both the periodic workqueue and the shrinker.
>

Right, but we don't know a priori how big the cache can grow, so that
makes determining the number of buckets to allocate a little tricky.

> Of course replacing the hash with an autoscaling data structure would be
> useful, but given that 4.1 obsoletes the DRC I'm not sure how much
> effort we want to put into it.

Agreed.

--
Jeff Layton <[email protected]>

2013-12-04 13:09:48

by Christoph Hellwig

[permalink] [raw]

Subject: Re: [PATCH] nfsd: when reusing an existing repcache entry, unhash it first

On Wed, Dec 04, 2013 at 07:54:02AM -0500, Jeff Layton wrote:
> Are you suggesting that we shouldn't try to bound the cache at all and
> should instead just let it grow and rely on the shrinker to keep it in
> check?

Exactly. Btw, what about the comment that the DRC should be per-client
on top of the file? That also seems like some very low hanging fruit.

2013-12-03 15:50:31

by J. Bruce Fields

[permalink] [raw]

Subject: Re: [PATCH] nfsd: when reusing an existing repcache entry, unhash it first

On Mon, Dec 02, 2013 at 03:26:19PM -0500, Jeff Layton wrote:
> The DRC code will attempt to reuse an existing, expired cache entry in
> preference to allocating a new one. It'll then search the cache, and if
> it gets a hit it'll then free the cache entry that it was going to
> reuse.
>
> The cache code doesn't unhash the entry that it's going to reuse
> however, so it's possible for it end up designating an entry for reuse
> and then subsequently freeing the same entry after it finds it. This
> leads it to a later use-after-free situation and usually some list
> corruption warnings or an oops.
>
> Fix this by simply unhashing the entry that we intend to reuse. That
> will mean that it's not findable via a search and should prevent this
> situation from occurring.

And that also makes it simpler to verify that prune_cache_entries()
isn't going to free rp, good.

Thanks, applying!

(But, note: I may not get things pushed out till next week as I don't
have as convenient a testing setup while I'm travelling this week.)

--b.

>
> Cc: [email protected] # v3.10+
> Reported-by: Christoph Hellwig <[email protected]>
> Reported-by: g. artim <[email protected]>
> Signed-off-by: Jeff Layton <[email protected]>
> ---
> fs/nfsd/nfscache.c | 9 ++++++++-
> 1 file changed, 8 insertions(+), 1 deletion(-)
>
> diff --git a/fs/nfsd/nfscache.c b/fs/nfsd/nfscache.c
> index 9186c7c..b6af150 100644
> --- a/fs/nfsd/nfscache.c
> +++ b/fs/nfsd/nfscache.c
> @@ -132,6 +132,13 @@ nfsd_reply_cache_alloc(void)
> }
>
> static void
> +nfsd_reply_cache_unhash(struct svc_cacherep *rp)
> +{
> + hlist_del_init(&rp->c_hash);
> + list_del_init(&rp->c_lru);
> +}
> +
> +static void
> nfsd_reply_cache_free_locked(struct svc_cacherep *rp)
> {
> if (rp->c_type == RC_REPLBUFF && rp->c_replvec.iov_base) {
> @@ -417,7 +424,7 @@ nfsd_cache_lookup(struct svc_rqst *rqstp)
> rp = list_first_entry(&lru_head, struct svc_cacherep, c_lru);
> if (nfsd_cache_entry_expired(rp) ||
> num_drc_entries >= max_drc_entries) {
> - lru_put_end(rp);
> + nfsd_reply_cache_unhash(rp);
> prune_cache_entries();
> goto search_cache;
> }
> --
> 1.8.4.2
>

2013-12-04 13:45:29

by Jeff Layton

[permalink] [raw]

Subject: Re: [PATCH] nfsd: when reusing an existing repcache entry, unhash it first

On Wed, 4 Dec 2013 05:40:36 -0800
Christoph Hellwig <[email protected]> wrote:

> On Wed, Dec 04, 2013 at 08:31:01AM -0500, Jeff Layton wrote:
> > > Exactly. Btw, what about the comment that the DRC should be per-client
> > > on top of the file? That also seems like some very low hanging fruit.
> > >
> >
> > Ok, I'll see about spinning up a patchset that does that. We'll also
> > probably want to just allocate entries out of the slabcache in that
> > case, so that'll make things quite a bit simpler.
>
> FYI I'm testing a patch at the moment that just rips out the "direct
> reclaim" as an alternative to your patch, without actually redoing
> the whole LRU infrastructure. Unfortunately with your patch 075
> still hangs for me, although it at lest doesn't crash.
>

Yeah, I've noticed the same hang, but hadn't able to determine why it
was hanging. I suspect that that hang is what's tickles the bug that my
patch fixes. With the hang, we see the client doing retransmits and not
getting replies and that means that we exercise the DRC more...

--
Jeff Layton <[email protected]>

2013-12-04 08:33:37

by Christoph Hellwig

[permalink] [raw]

Subject: Re: [PATCH] nfsd: when reusing an existing repcache entry, unhash it first

On Tue, Dec 03, 2013 at 01:21:12PM -0500, Jeff Layton wrote:
> Most of this code is protected by a single spinlock (cache_lock). The
> main benefit to switching to list_lru would be that we could move to a
> per-node lock. But, to make that worthwhile would mean we'd need to
> redesign the locking and break the cache_lock into multiple locks.

No need to redo locks just yet, the biggest benefit is to use a well
debug library for lru list handling, with the second biggest one being
that you get out of the box shrinker support.

> Also, the existing code does take pains to reuse an expired entry off
> the LRU list in preference to allocating a new one. The list_lru code
> doesn't have a mechanism to scrape the first entry off the LRU list,
> though I suppose we could add one or abuse the cb_arg in the walk
> callback as a return pointer.

That's just because it is an old-school cache with a fixed upper bound
of entries and no shrinker support. With proper shrinker integration
this nasty code isn't needed.

2013-12-04 12:54:31

by Jeff Layton

[permalink] [raw]

Subject: Re: [PATCH] nfsd: when reusing an existing repcache entry, unhash it first

On Wed, 4 Dec 2013 00:33:36 -0800
Christoph Hellwig <[email protected]> wrote:

> On Tue, Dec 03, 2013 at 01:21:12PM -0500, Jeff Layton wrote:
> > Most of this code is protected by a single spinlock (cache_lock). The
> > main benefit to switching to list_lru would be that we could move to a
> > per-node lock. But, to make that worthwhile would mean we'd need to
> > redesign the locking and break the cache_lock into multiple locks.
>
> No need to redo locks just yet, the biggest benefit is to use a well
> debug library for lru list handling, with the second biggest one being
> that you get out of the box shrinker support.
>
> > Also, the existing code does take pains to reuse an expired entry off
> > the LRU list in preference to allocating a new one. The list_lru code
> > doesn't have a mechanism to scrape the first entry off the LRU list,
> > though I suppose we could add one or abuse the cb_arg in the walk
> > callback as a return pointer.
>
> That's just because it is an old-school cache with a fixed upper bound
> of entries and no shrinker support. With proper shrinker integration
> this nasty code isn't needed.

Well, the old code just had a fixed number of entries (1024 or so?),
which was ridiculously small on modern server. A busy server could blow
out the cache in less than the retransmit interval, which made it
effectively worthless.

The new code has a much larger upper bound, but it is still bounded by
the amount of low memory on the host. We have shrinker integration
(though its propriety might be in question).

Are you suggesting that we shouldn't try to bound the cache at all and
should instead just let it grow and rely on the shrinker to keep it in
check?

--
Jeff Layton <[email protected]>

2013-12-04 13:40:38

by Christoph Hellwig

[permalink] [raw]

Subject: Re: [PATCH] nfsd: when reusing an existing repcache entry, unhash it first

On Wed, Dec 04, 2013 at 08:31:01AM -0500, Jeff Layton wrote:
> > Exactly. Btw, what about the comment that the DRC should be per-client
> > on top of the file? That also seems like some very low hanging fruit.
> >
>
> Ok, I'll see about spinning up a patchset that does that. We'll also
> probably want to just allocate entries out of the slabcache in that
> case, so that'll make things quite a bit simpler.

FYI I'm testing a patch at the moment that just rips out the "direct
reclaim" as an alternative to your patch, without actually redoing
the whole LRU infrastructure. Unfortunately with your patch 075
still hangs for me, although it at lest doesn't crash.

2013-12-04 13:31:27

by Jeff Layton

[permalink] [raw]

Subject: Re: [PATCH] nfsd: when reusing an existing repcache entry, unhash it first

On Wed, 4 Dec 2013 05:09:44 -0800
Christoph Hellwig <[email protected]> wrote:

> On Wed, Dec 04, 2013 at 07:54:02AM -0500, Jeff Layton wrote:
> > Are you suggesting that we shouldn't try to bound the cache at all and
> > should instead just let it grow and rely on the shrinker to keep it in
> > check?
>
> Exactly. Btw, what about the comment that the DRC should be per-client
> on top of the file? That also seems like some very low hanging fruit.
>

Ok, I'll see about spinning up a patchset that does that. We'll also
probably want to just allocate entries out of the slabcache in that
case, so that'll make things quite a bit simpler.

As far as making this a per-client cache...sounds great! Go for it!
--
Jeff Layton <[email protected]>

2013-12-03 10:25:19

by Christoph Hellwig

[permalink] [raw]

Subject: Re: [PATCH] nfsd: when reusing an existing repcache entry, unhash it first

2013-12-04 13:54:58

by Christoph Hellwig

[permalink] [raw]

Subject: Re: [PATCH] nfsd: when reusing an existing repcache entry, unhash it first

> Yeah, I've noticed the same hang, but hadn't able to determine why it
> was hanging. I suspect that that hang is what's tickles the bug that my
> patch fixes. With the hang, we see the client doing retransmits and not
> getting replies and that means that we exercise the DRC more...

FYI here is the one that just kills the silly direct reclaim. It also
fixes the oops, but I still see the hang:

diff --git a/fs/nfsd/nfscache.c b/fs/nfsd/nfscache.c
index 9186c7c..dd260a1 100644
--- a/fs/nfsd/nfscache.c
+++ b/fs/nfsd/nfscache.c
@@ -380,11 +380,8 @@ nfsd_cache_search(struct svc_rqst *rqstp, __wsum csum)
}

/*
- * Try to find an entry matching the current call in the cache. When none
- * is found, we try to grab the oldest expired entry off the LRU list. If
- * a suitable one isn't there, then drop the cache_lock and allocate a
- * new one, then search again in case one got inserted while this thread
- * didn't hold the lock.
+ * Try to find an entry matching the current call in the cache and if none is
+ * found allocate and insert a new one.
*/
int
nfsd_cache_lookup(struct svc_rqst *rqstp)
@@ -409,22 +406,8 @@ nfsd_cache_lookup(struct svc_rqst *rqstp)

/*
* Since the common case is a cache miss followed by an insert,
- * preallocate an entry. First, try to reuse the first entry on the LRU
- * if it works, then go ahead and prune the LRU list.
+ * preallocate an entry.
*/
- spin_lock(&cache_lock);
- if (!list_empty(&lru_head)) {
- rp = list_first_entry(&lru_head, struct svc_cacherep, c_lru);
- if (nfsd_cache_entry_expired(rp) ||
- num_drc_entries >= max_drc_entries) {
- lru_put_end(rp);
- prune_cache_entries();
- goto search_cache;
- }
- }
-
- /* No expired ones available, allocate a new one. */
- spin_unlock(&cache_lock);
rp = nfsd_reply_cache_alloc();
spin_lock(&cache_lock);
if (likely(rp)) {
@@ -432,7 +415,6 @@ nfsd_cache_lookup(struct svc_rqst *rqstp)
drc_mem_usage += sizeof(*rp);
}

-search_cache:
found = nfsd_cache_search(rqstp, csum);
if (found) {
if (likely(rp))
@@ -446,15 +428,6 @@ search_cache:
goto out;
}

- /*
- * We're keeping the one we just allocated. Are we now over the
- * limit? Prune one off the tip of the LRU in trade for the one we
- * just allocated if so.
- */
- if (num_drc_entries >= max_drc_entries)
- nfsd_reply_cache_free_locked(list_first_entry(&lru_head,
- struct svc_cacherep, c_lru));
-
nfsdstats.rcmisses++;
rqstp->rq_cacherep = rp;
rp->c_state = RC_INPROG;

2013-12-04 14:15:55

by Jeff Layton

[permalink] [raw]

Subject: Re: [PATCH] nfsd: when reusing an existing repcache entry, unhash it first

On Wed, 4 Dec 2013 05:54:57 -0800
Christoph Hellwig <[email protected]> wrote:

> > Yeah, I've noticed the same hang, but hadn't able to determine why it
> > was hanging. I suspect that that hang is what's tickles the bug that my
> > patch fixes. With the hang, we see the client doing retransmits and not
> > getting replies and that means that we exercise the DRC more...
>
> FYI here is the one that just kills the silly direct reclaim. It also
> fixes the oops, but I still see the hang:
>
>
> diff --git a/fs/nfsd/nfscache.c b/fs/nfsd/nfscache.c
> index 9186c7c..dd260a1 100644
> --- a/fs/nfsd/nfscache.c
> +++ b/fs/nfsd/nfscache.c
> @@ -380,11 +380,8 @@ nfsd_cache_search(struct svc_rqst *rqstp, __wsum csum)
> }
>
> /*
> - * Try to find an entry matching the current call in the cache. When none
> - * is found, we try to grab the oldest expired entry off the LRU list. If
> - * a suitable one isn't there, then drop the cache_lock and allocate a
> - * new one, then search again in case one got inserted while this thread
> - * didn't hold the lock.
> + * Try to find an entry matching the current call in the cache and if none is
> + * found allocate and insert a new one.
> */
> int
> nfsd_cache_lookup(struct svc_rqst *rqstp)
> @@ -409,22 +406,8 @@ nfsd_cache_lookup(struct svc_rqst *rqstp)
>
> /*
> * Since the common case is a cache miss followed by an insert,
> - * preallocate an entry. First, try to reuse the first entry on the LRU
> - * if it works, then go ahead and prune the LRU list.
> + * preallocate an entry.
> */
> - spin_lock(&cache_lock);
> - if (!list_empty(&lru_head)) {
> - rp = list_first_entry(&lru_head, struct svc_cacherep, c_lru);
> - if (nfsd_cache_entry_expired(rp) ||
> - num_drc_entries >= max_drc_entries) {
> - lru_put_end(rp);
> - prune_cache_entries();
> - goto search_cache;
> - }
> - }
> -
> - /* No expired ones available, allocate a new one. */
> - spin_unlock(&cache_lock);
> rp = nfsd_reply_cache_alloc();
> spin_lock(&cache_lock);
> if (likely(rp)) {
> @@ -432,7 +415,6 @@ nfsd_cache_lookup(struct svc_rqst *rqstp)
> drc_mem_usage += sizeof(*rp);
> }
>

It might be good to run prune_cache_entries(); at this point.
Otherwise, this looks like it'll be fine...

> -search_cache:
> found = nfsd_cache_search(rqstp, csum);
> if (found) {
> if (likely(rp))
> @@ -446,15 +428,6 @@ search_cache:
> goto out;
> }
>
> - /*
> - * We're keeping the one we just allocated. Are we now over the
> - * limit? Prune one off the tip of the LRU in trade for the one we
> - * just allocated if so.
> - */
> - if (num_drc_entries >= max_drc_entries)
> - nfsd_reply_cache_free_locked(list_first_entry(&lru_head,
> - struct svc_cacherep, c_lru));
> -
> nfsdstats.rcmisses++;
> rqstp->rq_cacherep = rp;
> rp->c_state = RC_INPROG;

--
Jeff Layton <[email protected]>

2013-12-03 18:21:43

by Jeff Layton

[permalink] [raw]

Subject: Re: [PATCH] nfsd: when reusing an existing repcache entry, unhash it first

On Tue, 3 Dec 2013 02:25:17 -0800
Christoph Hellwig <[email protected]> wrote:

> On Mon, Dec 02, 2013 at 03:26:19PM -0500, Jeff Layton wrote:
> > The DRC code will attempt to reuse an existing, expired cache entry in
> > preference to allocating a new one. It'll then search the cache, and if
> > it gets a hit it'll then free the cache entry that it was going to
> > reuse.
> >
> > The cache code doesn't unhash the entry that it's going to reuse
> > however, so it's possible for it end up designating an entry for reuse
> > and then subsequently freeing the same entry after it finds it. This
> > leads it to a later use-after-free situation and usually some list
> > corruption warnings or an oops.
> >
> > Fix this by simply unhashing the entry that we intend to reuse. That
> > will mean that it's not findable via a search and should prevent this
> > situation from occurring.
>
> The fix looks reasonable to me,
>
> Reviewed-by: Christoph Hellwig <[email protected]>
>

Thanks.

> Btw, it seems like this code would benefit from being converted to
> the list_lru structure.

Maybe.

Most of this code is protected by a single spinlock (cache_lock). The
main benefit to switching to list_lru would be that we could move to a
per-node lock. But, to make that worthwhile would mean we'd need to
redesign the locking and break the cache_lock into multiple locks.

Also, the existing code does take pains to reuse an expired entry off
the LRU list in preference to allocating a new one. The list_lru code
doesn't have a mechanism to scrape the first entry off the LRU list,
though I suppose we could add one or abuse the cb_arg in the walk
callback as a return pointer.

I'm not sure the resulting code would be any clearer however...

Maybe when we get to the point that the cache_lock shows up as a
bottleneck, it'll make more sense?

--
Jeff Layton <[email protected]>

2013-12-04 17:06:05

by Christoph Hellwig

[permalink] [raw]

Subject: Re: [PATCH] nfsd: when reusing an existing repcache entry, unhash it first

On Wed, Dec 04, 2013 at 12:06:16PM -0500, Jeff Layton wrote:
> That is another tricky question. Currently, we size the number of hash
> buckets based on the max size of the cache. If we allow it to grow
> without bound, then how do we determine how many hash buckets we need?

It won't grow without bound as long as it's trimmed below the size
by both the periodic workqueue and the shrinker.

Of course replacing the hash with an autoscaling data structure would be
useful, but given that 4.1 obsoletes the DRC I'm not sure how much
effort we want to put into it.