LinuxLists.cc - [PATCH v2 0/8] nfsd: duplicate reply cache overhaul

2013-02-04 13:18:13

[permalink] [raw]

Subject: [PATCH v2 0/8] nfsd: duplicate reply cache overhaul

This is the second posting of the remaining unmerged patches in this
set. There are a number of differences from the first set:

- The bug in the checksum patch has been fixed.

- A hard cap on the number of DRC entries is retained, but it's larger
than the original cap, and scales with the amount of low memory in
the machine.

- A shrinker is still registered, but it will now only free entries
that are expired or are over the max number of entries.

Our QA group has been reporting on and off for the last several years
about occasional failures in testing, especially on UDP. When we go to
look at traces, we see a missing reply from a server on a non-idempotent
request. The client then retransmits the request and the server tries to
redo it instead of just sending the DRC entry.

With an instrumented kernel on the server and a synthetic reproducer, we
found that it's quite easy to hammer the server so fast that DRC entries
get flushed out long before a retransmit can come in.

This patchset is a first pass at fixing this. Instead of simply keeping
a cache of the last 1024 entries, it allows nfsd to grow and shrink the
DRC dynamically.

While most of us will probably say "so what" when it comes to UDP
failures, it's a potential problem on connected transports as well. I'm
also inclined to try and fix things that screw up the people that are
helping us test our code.

I'd like to see this merged for 3.9 if possible...

Jeff Layton (8):
nfsd: always move DRC entries to the end of LRU list when updating
timestamp
nfsd: track the number of DRC entries in the cache
nfsd: dynamically allocate DRC entries
nfsd: remove the cache_disabled flag
nfsd: when updating an entry with RC_NOCACHE, just free it
nfsd: add recurring workqueue job to clean the cache
nfsd: register a shrinker for DRC cache entries
nfsd: keep a checksum of the first 256 bytes of request

fs/nfsd/cache.h | 5 +
fs/nfsd/nfscache.c | 271 ++++++++++++++++++++++++++++++++++++++++-------------
2 files changed, 209 insertions(+), 67 deletions(-)

--
1.7.11.7

2013-02-04 15:54:21

by J. Bruce Fields

[permalink] [raw]

Subject: Re: [PATCH v2 8/8] nfsd: keep a checksum of the first 256 bytes of request

On Mon, Feb 04, 2013 at 08:18:07AM -0500, Jeff Layton wrote:
> Now that we're allowing more DRC entries, it becomes a lot easier to hit
> problems with XID collisions. In order to mitigate those, calculate the
> crc32 of up to the first 256 bytes of each request coming in and store
> that in the cache entry, along with the total length of the request.
>
> Signed-off-by: Jeff Layton <[email protected]>
> ---
> fs/nfsd/cache.h | 5 +++++
> fs/nfsd/nfscache.c | 44 ++++++++++++++++++++++++++++++++++++++++----
> 2 files changed, 45 insertions(+), 4 deletions(-)
>
> diff --git a/fs/nfsd/cache.h b/fs/nfsd/cache.h
> index 9c7232b..4822db3 100644
> --- a/fs/nfsd/cache.h
> +++ b/fs/nfsd/cache.h
> @@ -29,6 +29,8 @@ struct svc_cacherep {
> u32 c_prot;
> u32 c_proc;
> u32 c_vers;
> + unsigned int c_len;
> + u32 c_crc;
> unsigned long c_timestamp;
> union {
> struct kvec u_vec;
> @@ -73,6 +75,9 @@ enum {
> /* Cache entries expire after this time period */
> #define RC_EXPIRE (120 * HZ)
>
> +/* Checksum this amount of the request */
> +#define RC_CSUMLEN (256U)
> +
> int nfsd_reply_cache_init(void);
> void nfsd_reply_cache_shutdown(void);
> int nfsd_cache_lookup(struct svc_rqst *);
> diff --git a/fs/nfsd/nfscache.c b/fs/nfsd/nfscache.c
> index d16a5d6..cb655f3 100644
> --- a/fs/nfsd/nfscache.c
> +++ b/fs/nfsd/nfscache.c
> @@ -11,6 +11,7 @@
> #include <linux/slab.h>
> #include <linux/sunrpc/clnt.h>
> #include <linux/highmem.h>
> +#include <linux/crc32.h>
>
> #include "nfsd.h"
> #include "cache.h"
> @@ -24,6 +25,7 @@ static struct list_head lru_head;
> static struct kmem_cache *drc_slab;
> static unsigned int num_drc_entries;
> static unsigned int max_drc_entries;
> +static u32 crc_seed;
>
> /*
> * Calculate the hash index from an XID.
> @@ -130,6 +132,9 @@ int nfsd_reply_cache_init(void)
> INIT_LIST_HEAD(&lru_head);
> max_drc_entries = nfsd_cache_size_limit();
> num_drc_entries = 0;
> +
> + /* Is a random seed any better than some well-defined constant? */
> + get_random_bytes(&crc_seed, sizeof(crc_seed));
> return 0;
> out_nomem:
> printk(KERN_ERR "nfsd: failed to allocate reply cache\n");
> @@ -238,12 +243,37 @@ nfsd_reply_cache_shrink(struct shrinker *shrink, struct shrink_control *sc)
> }
>
> /*
> + * Walk an xdr_buf and get a CRC for at most the first RC_CSUMLEN bytes
> + */
> +static u32
> +nfsd_cache_crc(struct xdr_buf *buf)
> +{
> + u32 crc;
> + const unsigned char *p = buf->head[0].iov_base;
> + size_t csum_len = min_t(size_t, buf->head[0].iov_len + buf->page_len,
> + RC_CSUMLEN);
> + size_t len = min(buf->head[0].iov_len, csum_len);
> +
> + /* rq_arg.head first */
> + crc = crc32(crc_seed, p, len);
> + csum_len -= len;
> +
> + /* Nothing left */
> + if (!csum_len)
> + return crc;
> +
> + /* checksum the rest from the page_array */
> + p = page_address(buf->pages[0]) + buf->page_base;

If buf->page_base is large (close to PAGE_SIZE), then reads past the end
of the page when it should be continuing to the next page.

In practice page_base is always 0 here, and I think it's unlikely that
will change. But it would be worth a comment. (Or maybe even a
WARN_ON_ONCE(buf->page_base).)

> + return crc32(crc, p, csum_len);
> +}
> +
> +/*
> * Search the request hash for an entry that matches the given rqstp.
> * Must be called with cache_lock held. Returns the found entry or
> * NULL on failure.
> */
> static struct svc_cacherep *
> -nfsd_cache_search(struct svc_rqst *rqstp)
> +nfsd_cache_search(struct svc_rqst *rqstp, u32 crc)
> {
> struct svc_cacherep *rp;
> struct hlist_node *hn;
> @@ -257,6 +287,7 @@ nfsd_cache_search(struct svc_rqst *rqstp)
> hlist_for_each_entry(rp, hn, rh, c_hash) {
> if (xid == rp->c_xid && proc == rp->c_proc &&
> proto == rp->c_prot && vers == rp->c_vers &&
> + rqstp->rq_arg.len == rp->c_len && crc == rp->c_crc &&
> rpc_cmp_addr(svc_addr(rqstp), (struct sockaddr *)&rp->c_addr) &&
> rpc_get_port(svc_addr(rqstp)) == rpc_get_port((struct sockaddr *)&rp->c_addr))
> return rp;
> @@ -276,7 +307,8 @@ nfsd_cache_lookup(struct svc_rqst *rqstp)
> __be32 xid = rqstp->rq_xid;
> u32 proto = rqstp->rq_prot,
> vers = rqstp->rq_vers,
> - proc = rqstp->rq_proc;
> + proc = rqstp->rq_proc,
> + crc;
> unsigned long age;
> int type = rqstp->rq_cachetype;
> int rtn;
> @@ -287,10 +319,12 @@ nfsd_cache_lookup(struct svc_rqst *rqstp)
> return RC_DOIT;
> }
>
> + crc = nfsd_cache_crc(&rqstp->rq_arg);
> +

For a moment I was wondering whether we should delay calculating that
till we need it--but of course we need it in all cases but allocation
failure (either to match an existing entry or populate a new one). OK!

Looks fine.--b.

> spin_lock(&cache_lock);
> rtn = RC_DOIT;
>
> - rp = nfsd_cache_search(rqstp);
> + rp = nfsd_cache_search(rqstp, crc);
> if (rp)
> goto found_entry;
>
> @@ -318,7 +352,7 @@ nfsd_cache_lookup(struct svc_rqst *rqstp)
> * Must search again just in case someone inserted one
> * after we dropped the lock above.
> */
> - found = nfsd_cache_search(rqstp);
> + found = nfsd_cache_search(rqstp, crc);
> if (found) {
> nfsd_reply_cache_free_locked(rp);
> rp = found;
> @@ -344,6 +378,8 @@ setup_entry:
> rpc_set_port((struct sockaddr *)&rp->c_addr, rpc_get_port(svc_addr(rqstp)));
> rp->c_prot = proto;
> rp->c_vers = vers;
> + rp->c_len = rqstp->rq_arg.len;
> + rp->c_crc = crc;
>
> hash_refile(rp);
> lru_put_end(rp);
> --
> 1.7.11.7
>

2013-02-04 20:20:47

by J. Bruce Fields

[permalink] [raw]

Subject: Re: [PATCH v2 8/8] nfsd: keep a checksum of the first 256 bytes of request

On Mon, Feb 04, 2013 at 08:18:07AM -0500, Jeff Layton wrote:
> @@ -238,12 +243,37 @@ nfsd_reply_cache_shrink(struct shrinker *shrink, struct shrink_control *sc)
> }
>
> /*
> + * Walk an xdr_buf and get a CRC for at most the first RC_CSUMLEN bytes
> + */
> +static u32
> +nfsd_cache_crc(struct xdr_buf *buf)
> +{
> + u32 crc;
> + const unsigned char *p = buf->head[0].iov_base;
> + size_t csum_len = min_t(size_t, buf->head[0].iov_len + buf->page_len,
> + RC_CSUMLEN);
> + size_t len = min(buf->head[0].iov_len, csum_len);
> +
> + /* rq_arg.head first */
> + crc = crc32(crc_seed, p, len);
> + csum_len -= len;

I'm getting a RPLY14 failure from pynfs --security=krb5i.

I suspect what's happening here is that the data you're checksumming
over includes the gss sequence number and the krbi integrity checksum.
Both those change, even on resends, to prevent an attacker from doing
something nefarious by resending an old rpc.

I think we really want to checksum just over the nfs-level data. Our
checks for xid, program number, etc., already cover most of the rpc
header anyway.

--b.

> +
> + /* Nothing left */
> + if (!csum_len)
> + return crc;
> +
> + /* checksum the rest from the page_array */
> + p = page_address(buf->pages[0]) + buf->page_base;
> + return crc32(crc, p, csum_len);
> +}
> +
> +/*
> * Search the request hash for an entry that matches the given rqstp.
> * Must be called with cache_lock held. Returns the found entry or
> * NULL on failure.
> */
> static struct svc_cacherep *
> -nfsd_cache_search(struct svc_rqst *rqstp)
> +nfsd_cache_search(struct svc_rqst *rqstp, u32 crc)
> {
> struct svc_cacherep *rp;
> struct hlist_node *hn;
> @@ -257,6 +287,7 @@ nfsd_cache_search(struct svc_rqst *rqstp)
> hlist_for_each_entry(rp, hn, rh, c_hash) {
> if (xid == rp->c_xid && proc == rp->c_proc &&
> proto == rp->c_prot && vers == rp->c_vers &&
> + rqstp->rq_arg.len == rp->c_len && crc == rp->c_crc &&
> rpc_cmp_addr(svc_addr(rqstp), (struct sockaddr *)&rp->c_addr) &&
> rpc_get_port(svc_addr(rqstp)) == rpc_get_port((struct sockaddr *)&rp->c_addr))
> return rp;
> @@ -276,7 +307,8 @@ nfsd_cache_lookup(struct svc_rqst *rqstp)
> __be32 xid = rqstp->rq_xid;
> u32 proto = rqstp->rq_prot,
> vers = rqstp->rq_vers,
> - proc = rqstp->rq_proc;
> + proc = rqstp->rq_proc,
> + crc;
> unsigned long age;
> int type = rqstp->rq_cachetype;
> int rtn;
> @@ -287,10 +319,12 @@ nfsd_cache_lookup(struct svc_rqst *rqstp)
> return RC_DOIT;
> }
>
> + crc = nfsd_cache_crc(&rqstp->rq_arg);
> +
> spin_lock(&cache_lock);
> rtn = RC_DOIT;
>
> - rp = nfsd_cache_search(rqstp);
> + rp = nfsd_cache_search(rqstp, crc);
> if (rp)
> goto found_entry;
>
> @@ -318,7 +352,7 @@ nfsd_cache_lookup(struct svc_rqst *rqstp)
> * Must search again just in case someone inserted one
> * after we dropped the lock above.
> */
> - found = nfsd_cache_search(rqstp);
> + found = nfsd_cache_search(rqstp, crc);
> if (found) {
> nfsd_reply_cache_free_locked(rp);
> rp = found;
> @@ -344,6 +378,8 @@ setup_entry:
> rpc_set_port((struct sockaddr *)&rp->c_addr, rpc_get_port(svc_addr(rqstp)));
> rp->c_prot = proto;
> rp->c_vers = vers;
> + rp->c_len = rqstp->rq_arg.len;
> + rp->c_crc = crc;
>
> hash_refile(rp);
> lru_put_end(rp);
> --
> 1.7.11.7
>

2013-02-04 16:16:24

[permalink] [raw]

Subject: Re: [PATCH v2 8/8] nfsd: keep a checksum of the first 256 bytes of request

On Mon, 4 Feb 2013 10:54:20 -0500
"J. Bruce Fields" <[email protected]> wrote:

> On Mon, Feb 04, 2013 at 08:18:07AM -0500, Jeff Layton wrote:
> > Now that we're allowing more DRC entries, it becomes a lot easier to hit
> > problems with XID collisions. In order to mitigate those, calculate the
> > crc32 of up to the first 256 bytes of each request coming in and store
> > that in the cache entry, along with the total length of the request.
> >
> > Signed-off-by: Jeff Layton <[email protected]>
> > ---
> > fs/nfsd/cache.h | 5 +++++
> > fs/nfsd/nfscache.c | 44 ++++++++++++++++++++++++++++++++++++++++----
> > 2 files changed, 45 insertions(+), 4 deletions(-)
> >
> > diff --git a/fs/nfsd/cache.h b/fs/nfsd/cache.h
> > index 9c7232b..4822db3 100644
> > --- a/fs/nfsd/cache.h
> > +++ b/fs/nfsd/cache.h
> > @@ -29,6 +29,8 @@ struct svc_cacherep {
> > u32 c_prot;
> > u32 c_proc;
> > u32 c_vers;
> > + unsigned int c_len;
> > + u32 c_crc;
> > unsigned long c_timestamp;
> > union {
> > struct kvec u_vec;
> > @@ -73,6 +75,9 @@ enum {
> > /* Cache entries expire after this time period */
> > #define RC_EXPIRE (120 * HZ)
> >
> > +/* Checksum this amount of the request */
> > +#define RC_CSUMLEN (256U)
> > +
> > int nfsd_reply_cache_init(void);
> > void nfsd_reply_cache_shutdown(void);
> > int nfsd_cache_lookup(struct svc_rqst *);
> > diff --git a/fs/nfsd/nfscache.c b/fs/nfsd/nfscache.c
> > index d16a5d6..cb655f3 100644
> > --- a/fs/nfsd/nfscache.c
> > +++ b/fs/nfsd/nfscache.c
> > @@ -11,6 +11,7 @@
> > #include <linux/slab.h>
> > #include <linux/sunrpc/clnt.h>
> > #include <linux/highmem.h>
> > +#include <linux/crc32.h>
> >
> > #include "nfsd.h"
> > #include "cache.h"
> > @@ -24,6 +25,7 @@ static struct list_head lru_head;
> > static struct kmem_cache *drc_slab;
> > static unsigned int num_drc_entries;
> > static unsigned int max_drc_entries;
> > +static u32 crc_seed;
> >
> > /*
> > * Calculate the hash index from an XID.
> > @@ -130,6 +132,9 @@ int nfsd_reply_cache_init(void)
> > INIT_LIST_HEAD(&lru_head);
> > max_drc_entries = nfsd_cache_size_limit();
> > num_drc_entries = 0;
> > +
> > + /* Is a random seed any better than some well-defined constant? */
> > + get_random_bytes(&crc_seed, sizeof(crc_seed));
> > return 0;
> > out_nomem:
> > printk(KERN_ERR "nfsd: failed to allocate reply cache\n");
> > @@ -238,12 +243,37 @@ nfsd_reply_cache_shrink(struct shrinker *shrink, struct shrink_control *sc)
> > }
> >
> > /*
> > + * Walk an xdr_buf and get a CRC for at most the first RC_CSUMLEN bytes
> > + */
> > +static u32
> > +nfsd_cache_crc(struct xdr_buf *buf)
> > +{
> > + u32 crc;
> > + const unsigned char *p = buf->head[0].iov_base;
> > + size_t csum_len = min_t(size_t, buf->head[0].iov_len + buf->page_len,
> > + RC_CSUMLEN);
> > + size_t len = min(buf->head[0].iov_len, csum_len);
> > +
> > + /* rq_arg.head first */
> > + crc = crc32(crc_seed, p, len);
> > + csum_len -= len;
> > +
> > + /* Nothing left */
> > + if (!csum_len)
> > + return crc;
> > +
> > + /* checksum the rest from the page_array */
> > + p = page_address(buf->pages[0]) + buf->page_base;
>
> If buf->page_base is large (close to PAGE_SIZE), then reads past the end
> of the page when it should be continuing to the next page.
>
> In practice page_base is always 0 here, and I think it's unlikely that
> will change. But it would be worth a comment. (Or maybe even a
> WARN_ON_ONCE(buf->page_base).)
>

When I looked at the rpc_rqst definition, it said:

struct page ** pages; /* Array of contiguous pages */

...but now that I look at svc_alloc_arg, I see that they aren't
necessarily contiguous. I'd probably feel more comfortable fixing this
up to be generally correct in the event that page_base is ever non-zero.

Perhaps I can just respin this patch to account for that possibility?

> > + return crc32(crc, p, csum_len);
> > +}
> > +
> > +/*
> > * Search the request hash for an entry that matches the given rqstp.
> > * Must be called with cache_lock held. Returns the found entry or
> > * NULL on failure.
> > */
> > static struct svc_cacherep *
> > -nfsd_cache_search(struct svc_rqst *rqstp)
> > +nfsd_cache_search(struct svc_rqst *rqstp, u32 crc)
> > {
> > struct svc_cacherep *rp;
> > struct hlist_node *hn;
> > @@ -257,6 +287,7 @@ nfsd_cache_search(struct svc_rqst *rqstp)
> > hlist_for_each_entry(rp, hn, rh, c_hash) {
> > if (xid == rp->c_xid && proc == rp->c_proc &&
> > proto == rp->c_prot && vers == rp->c_vers &&
> > + rqstp->rq_arg.len == rp->c_len && crc == rp->c_crc &&
> > rpc_cmp_addr(svc_addr(rqstp), (struct sockaddr *)&rp->c_addr) &&
> > rpc_get_port(svc_addr(rqstp)) == rpc_get_port((struct sockaddr *)&rp->c_addr))
> > return rp;
> > @@ -276,7 +307,8 @@ nfsd_cache_lookup(struct svc_rqst *rqstp)
> > __be32 xid = rqstp->rq_xid;
> > u32 proto = rqstp->rq_prot,
> > vers = rqstp->rq_vers,
> > - proc = rqstp->rq_proc;
> > + proc = rqstp->rq_proc,
> > + crc;
> > unsigned long age;
> > int type = rqstp->rq_cachetype;
> > int rtn;
> > @@ -287,10 +319,12 @@ nfsd_cache_lookup(struct svc_rqst *rqstp)
> > return RC_DOIT;
> > }
> >
> > + crc = nfsd_cache_crc(&rqstp->rq_arg);
> > +
>
> For a moment I was wondering whether we should delay calculating that
> till we need it--but of course we need it in all cases but allocation
> failure (either to match an existing entry or populate a new one). OK!
>
> Looks fine.--b.
>

Correct, and by doing it early, we can keep that outside the spinlock.

> > spin_lock(&cache_lock);
> > rtn = RC_DOIT;
> >
> > - rp = nfsd_cache_search(rqstp);
> > + rp = nfsd_cache_search(rqstp, crc);
> > if (rp)
> > goto found_entry;
> >
> > @@ -318,7 +352,7 @@ nfsd_cache_lookup(struct svc_rqst *rqstp)
> > * Must search again just in case someone inserted one
> > * after we dropped the lock above.
> > */
> > - found = nfsd_cache_search(rqstp);
> > + found = nfsd_cache_search(rqstp, crc);
> > if (found) {
> > nfsd_reply_cache_free_locked(rp);
> > rp = found;
> > @@ -344,6 +378,8 @@ setup_entry:
> > rpc_set_port((struct sockaddr *)&rp->c_addr, rpc_get_port(svc_addr(rqstp)));
> > rp->c_prot = proto;
> > rp->c_vers = vers;
> > + rp->c_len = rqstp->rq_arg.len;
> > + rp->c_crc = crc;
> >
> > hash_refile(rp);
> > lru_put_end(rp);
> > --
> > 1.7.11.7
> >

--
Jeff Layton <[email protected]>

2013-02-05 15:58:10

[permalink] [raw]

Subject: Re: [PATCH v2 0/8] nfsd: duplicate reply cache overhaul

On Tue, 5 Feb 2013 10:15:05 -0500
"J. Bruce Fields" <[email protected]> wrote:

> On Mon, Feb 04, 2013 at 08:17:59AM -0500, Jeff Layton wrote:
> > This patchset is a first pass at fixing this. Instead of simply keeping
> > a cache of the last 1024 entries, it allows nfsd to grow and shrink the
> > DRC dynamically.
>
> One other thing I think we should try is to organize the cache per
> client address and evict cache entries from clients with more entries
> first, on the theory that:
>
> - There are diminishing returns from keeping huge numbers of
> entries from a single client: a client that has sent us lots
> of new requests is likely to have processed replies to the
> older ones.
>

Ok, but you'd need to be careful here. You don't want to drop any entry
before the client has had a chance to retransmit the request, but we
don't have a way to know what timeo= value the client is using.

> - We should still try to hang on to a few of the most recent
> entries from a client that hasn't made any requests recently,
> because such a client may have just temporarily lost contact,
> in which case it's particularly likely to need to retry a
> request, and we don't want its entries pushed out by clients
> that remain active.
>

Sounds tricky to get the heuristics right, but interesting. Maybe we
can try to hash them out while we're at cthon?

--
Jeff Layton <[email protected]>

2013-02-04 18:18:54

by J. Bruce Fields

[permalink] [raw]

Subject: Re: [PATCH 9/8] nfsd: handle arbitrary page array layouts in nfsd_cache_crc

On Mon, Feb 04, 2013 at 01:07:38PM -0500, Jeff Layton wrote:
> The original code didn't handle the case where the page_base was close
> to the end of the page. In practice, page_base is always 0 on receive,
> so it's not a problem today, but let's future-proof this in the event
> that it ever is.

OK--b.

>
> Signed-off-by: Jeff Layton <[email protected]>
> ---
> fs/nfsd/nfscache.c | 21 ++++++++++++++-------
> 1 file changed, 14 insertions(+), 7 deletions(-)
>
> diff --git a/fs/nfsd/nfscache.c b/fs/nfsd/nfscache.c
> index cb655f3..e705fea 100644
> --- a/fs/nfsd/nfscache.c
> +++ b/fs/nfsd/nfscache.c
> @@ -248,6 +248,8 @@ nfsd_reply_cache_shrink(struct shrinker *shrink, struct shrink_control *sc)
> static u32
> nfsd_cache_crc(struct xdr_buf *buf)
> {
> + int idx;
> + unsigned int base;
> u32 crc;
> const unsigned char *p = buf->head[0].iov_base;
> size_t csum_len = min_t(size_t, buf->head[0].iov_len + buf->page_len,
> @@ -258,13 +260,18 @@ nfsd_cache_crc(struct xdr_buf *buf)
> crc = crc32(crc_seed, p, len);
> csum_len -= len;
>
> - /* Nothing left */
> - if (!csum_len)
> - return crc;
> -
> - /* checksum the rest from the page_array */
> - p = page_address(buf->pages[0]) + buf->page_base;
> - return crc32(crc, p, csum_len);
> + /* Continue into page array */
> + idx = buf->page_base / PAGE_SIZE;
> + base = buf->page_base & ~PAGE_MASK;
> + while (csum_len) {
> + p = page_address(buf->pages[idx]) + base;
> + len = min(PAGE_SIZE - base, csum_len);
> + crc = crc32(crc, p, len);
> + csum_len -= len;
> + base = 0;
> + ++idx;
> + }
> + return crc;
> }
>
> /*
> --
> 1.7.11.7
>

2013-02-04 18:07:45

[permalink] [raw]

Subject: [PATCH 9/8] nfsd: handle arbitrary page array layouts in nfsd_cache_crc

The original code didn't handle the case where the page_base was close
to the end of the page. In practice, page_base is always 0 on receive,
so it's not a problem today, but let's future-proof this in the event
that it ever is.

Signed-off-by: Jeff Layton <[email protected]>
---
fs/nfsd/nfscache.c | 21 ++++++++++++++-------
1 file changed, 14 insertions(+), 7 deletions(-)

diff --git a/fs/nfsd/nfscache.c b/fs/nfsd/nfscache.c
index cb655f3..e705fea 100644
--- a/fs/nfsd/nfscache.c
+++ b/fs/nfsd/nfscache.c
@@ -248,6 +248,8 @@ nfsd_reply_cache_shrink(struct shrinker *shrink, struct shrink_control *sc)
static u32
nfsd_cache_crc(struct xdr_buf *buf)
{
+ int idx;
+ unsigned int base;
u32 crc;
const unsigned char *p = buf->head[0].iov_base;
size_t csum_len = min_t(size_t, buf->head[0].iov_len + buf->page_len,
@@ -258,13 +260,18 @@ nfsd_cache_crc(struct xdr_buf *buf)
crc = crc32(crc_seed, p, len);
csum_len -= len;

- /* Nothing left */
- if (!csum_len)
- return crc;
-
- /* checksum the rest from the page_array */
- p = page_address(buf->pages[0]) + buf->page_base;
- return crc32(crc, p, csum_len);
+ /* Continue into page array */
+ idx = buf->page_base / PAGE_SIZE;
+ base = buf->page_base & ~PAGE_MASK;
+ while (csum_len) {
+ p = page_address(buf->pages[idx]) + base;
+ len = min(PAGE_SIZE - base, csum_len);
+ crc = crc32(crc, p, len);
+ csum_len -= len;
+ base = 0;
+ ++idx;
+ }
+ return crc;
}

/*
--
1.7.11.7

2013-02-05 15:51:47

[permalink] [raw]

Subject: Re: [PATCH v2 8/8] nfsd: keep a checksum of the first 256 bytes of request

On Tue, 5 Feb 2013 09:55:47 -0500
"J. Bruce Fields" <[email protected]> wrote:

> On Mon, Feb 04, 2013 at 03:20:46PM -0500, J. Bruce Fields wrote:
> > On Mon, Feb 04, 2013 at 08:18:07AM -0500, Jeff Layton wrote:
> > > @@ -238,12 +243,37 @@ nfsd_reply_cache_shrink(struct shrinker *shrink, struct shrink_control *sc)
> > > }
> > >
> > > /*
> > > + * Walk an xdr_buf and get a CRC for at most the first RC_CSUMLEN bytes
> > > + */
> > > +static u32
> > > +nfsd_cache_crc(struct xdr_buf *buf)
> > > +{
> > > + u32 crc;
> > > + const unsigned char *p = buf->head[0].iov_base;
> > > + size_t csum_len = min_t(size_t, buf->head[0].iov_len + buf->page_len,
> > > + RC_CSUMLEN);
> > > + size_t len = min(buf->head[0].iov_len, csum_len);
> > > +
> > > + /* rq_arg.head first */
> > > + crc = crc32(crc_seed, p, len);
> > > + csum_len -= len;
> >
> > I'm getting a RPLY14 failure from pynfs --security=krb5i.
> >
> > I suspect what's happening here is that the data you're checksumming
> > over includes the gss sequence number and the krbi integrity checksum.
> > Both those change, even on resends, to prevent an attacker from doing
> > something nefarious by resending an old rpc.
> >
> > I think we really want to checksum just over the nfs-level data. Our
> > checks for xid, program number, etc., already cover most of the rpc
> > header anyway.
>
> I've dropped this for now, but applied the previous patches.
>

Thanks -- this is a thorny problem to solve it seems...

The problem seems to be the checksum at the end of the NFS data in the
krb5i case. There's some similar stuff at the end of a decrypted krb5p
request too, but I'm not yet clear on what that is...

It would be nice if the RPC layer somehow would inform us of the length
of the "real" nfs data, but I don't think it does that currently.

--
Jeff Layton <[email protected]>

2013-02-04 13:18:28

[permalink] [raw]

Subject: [PATCH v2 7/8] nfsd: register a shrinker for DRC cache entries

Since we dynamically allocate them now, allow the system to call us up
to release them if it gets low on memory. Since these entries aren't
replaceable, only free ones that are expired or that are over the cap.
The the seeks value is set to '1' however to indicate that freeing the
these entries is low-cost.

Signed-off-by: Jeff Layton <[email protected]>
---
fs/nfsd/nfscache.c | 23 +++++++++++++++++++++++
1 file changed, 23 insertions(+)

diff --git a/fs/nfsd/nfscache.c b/fs/nfsd/nfscache.c
index d7b088b..d16a5d6 100644
--- a/fs/nfsd/nfscache.c
+++ b/fs/nfsd/nfscache.c
@@ -37,6 +37,13 @@ static inline u32 request_hash(u32 xid)

static int nfsd_cache_append(struct svc_rqst *rqstp, struct kvec *vec);
static void cache_cleaner_func(struct work_struct *unused);
+static int nfsd_reply_cache_shrink(struct shrinker *shrink,
+ struct shrink_control *sc);
+
+struct shrinker nfsd_reply_cache_shrinker = {
+ .shrink = nfsd_reply_cache_shrink,
+ .seeks = 1,
+};

/*
* locking for the reply cache:
@@ -110,6 +117,7 @@ nfsd_reply_cache_free(struct svc_cacherep *rp)

int nfsd_reply_cache_init(void)
{
+ register_shrinker(&nfsd_reply_cache_shrinker);
drc_slab = kmem_cache_create("nfsd_drc", sizeof(struct svc_cacherep),
0, 0, NULL);
if (!drc_slab)
@@ -133,6 +141,7 @@ void nfsd_reply_cache_shutdown(void)
{
struct svc_cacherep *rp;

+ unregister_shrinker(&nfsd_reply_cache_shrinker);
cancel_delayed_work_sync(&cache_cleaner);

while (!list_empty(&lru_head)) {
@@ -214,6 +223,20 @@ cache_cleaner_func(struct work_struct *unused)
spin_unlock(&cache_lock);
}

+static int
+nfsd_reply_cache_shrink(struct shrinker *shrink, struct shrink_control *sc)
+{
+ unsigned int num;
+
+ spin_lock(&cache_lock);
+ if (sc->nr_to_scan)
+ prune_cache_entries();
+ num = num_drc_entries;
+ spin_unlock(&cache_lock);
+
+ return num;
+}
+
/*
* Search the request hash for an entry that matches the given rqstp.
* Must be called with cache_lock held. Returns the found entry or
--
1.7.11.7

2013-02-05 15:15:06

by J. Bruce Fields

[permalink] [raw]

Subject: Re: [PATCH v2 0/8] nfsd: duplicate reply cache overhaul

On Mon, Feb 04, 2013 at 08:17:59AM -0500, Jeff Layton wrote:
> This patchset is a first pass at fixing this. Instead of simply keeping
> a cache of the last 1024 entries, it allows nfsd to grow and shrink the
> DRC dynamically.

One other thing I think we should try is to organize the cache per
client address and evict cache entries from clients with more entries
first, on the theory that:

- There are diminishing returns from keeping huge numbers of
entries from a single client: a client that has sent us lots
of new requests is likely to have processed replies to the
older ones.

- We should still try to hang on to a few of the most recent
entries from a client that hasn't made any requests recently,
because such a client may have just temporarily lost contact,
in which case it's particularly likely to need to retry a
request, and we don't want its entries pushed out by clients
that remain active.

--b.

2013-02-05 14:55:50

by J. Bruce Fields

[permalink] [raw]

Subject: Re: [PATCH v2 8/8] nfsd: keep a checksum of the first 256 bytes of request

On Mon, Feb 04, 2013 at 03:20:46PM -0500, J. Bruce Fields wrote:
> On Mon, Feb 04, 2013 at 08:18:07AM -0500, Jeff Layton wrote:
> > @@ -238,12 +243,37 @@ nfsd_reply_cache_shrink(struct shrinker *shrink, struct shrink_control *sc)
> > }
> >
> > /*
> > + * Walk an xdr_buf and get a CRC for at most the first RC_CSUMLEN bytes
> > + */
> > +static u32
> > +nfsd_cache_crc(struct xdr_buf *buf)
> > +{
> > + u32 crc;
> > + const unsigned char *p = buf->head[0].iov_base;
> > + size_t csum_len = min_t(size_t, buf->head[0].iov_len + buf->page_len,
> > + RC_CSUMLEN);
> > + size_t len = min(buf->head[0].iov_len, csum_len);
> > +
> > + /* rq_arg.head first */
> > + crc = crc32(crc_seed, p, len);
> > + csum_len -= len;
>
> I'm getting a RPLY14 failure from pynfs --security=krb5i.
>
> I suspect what's happening here is that the data you're checksumming
> over includes the gss sequence number and the krbi integrity checksum.
> Both those change, even on resends, to prevent an attacker from doing
> something nefarious by resending an old rpc.
>
> I think we really want to checksum just over the nfs-level data. Our
> checks for xid, program number, etc., already cover most of the rpc
> header anyway.

I've dropped this for now, but applied the previous patches.

--b.

2013-02-04 13:18:16

[permalink] [raw]

Subject: [PATCH v2 2/8] nfsd: track the number of DRC entries in the cache

Signed-off-by: Jeff Layton <[email protected]>
---
fs/nfsd/nfscache.c | 4 ++++
1 file changed, 4 insertions(+)

diff --git a/fs/nfsd/nfscache.c b/fs/nfsd/nfscache.c
index 9d80dfa..c0c5847 100644
--- a/fs/nfsd/nfscache.c
+++ b/fs/nfsd/nfscache.c
@@ -27,6 +27,7 @@ static struct hlist_head * cache_hash;
static struct list_head lru_head;
static int cache_disabled = 1;
static struct kmem_cache *drc_slab;
+static unsigned int num_drc_entries;

/*
* Calculate the hash index from an XID.
@@ -68,6 +69,7 @@ nfsd_reply_cache_free_locked(struct svc_cacherep *rp)
if (rp->c_type == RC_REPLBUFF)
kfree(rp->c_replvec.iov_base);
list_del(&rp->c_lru);
+ --num_drc_entries;
kmem_cache_free(drc_slab, rp);
}

@@ -83,10 +85,12 @@ int nfsd_reply_cache_init(void)

INIT_LIST_HEAD(&lru_head);
i = CACHESIZE;
+ num_drc_entries = 0;
while (i) {
rp = nfsd_reply_cache_alloc();
if (!rp)
goto out_nomem;
+ ++num_drc_entries;
list_add(&rp->c_lru, &lru_head);
i--;
}
--
1.7.11.7

2013-02-04 13:18:21

[permalink] [raw]

Subject: [PATCH v2 4/8] nfsd: remove the cache_disabled flag

With the change to dynamically allocate entries, the cache is never
disabled on the fly. Remove this flag.

Signed-off-by: Jeff Layton <[email protected]>
---
fs/nfsd/nfscache.c | 10 +++-------
1 file changed, 3 insertions(+), 7 deletions(-)

diff --git a/fs/nfsd/nfscache.c b/fs/nfsd/nfscache.c
index d213e6e..69d29d4 100644
--- a/fs/nfsd/nfscache.c
+++ b/fs/nfsd/nfscache.c
@@ -21,7 +21,6 @@

static struct hlist_head * cache_hash;
static struct list_head lru_head;
-static int cache_disabled = 1;
static struct kmem_cache *drc_slab;
static unsigned int num_drc_entries;
static unsigned int max_drc_entries;
@@ -113,7 +112,6 @@ int nfsd_reply_cache_init(void)
INIT_LIST_HEAD(&lru_head);
max_drc_entries = nfsd_cache_size_limit();
num_drc_entries = 0;
- cache_disabled = 0;
return 0;
out_nomem:
printk(KERN_ERR "nfsd: failed to allocate reply cache\n");
@@ -130,8 +128,6 @@ void nfsd_reply_cache_shutdown(void)
nfsd_reply_cache_free_locked(rp);
}

- cache_disabled = 1;
-
kfree (cache_hash);
cache_hash = NULL;

@@ -215,7 +211,7 @@ nfsd_cache_lookup(struct svc_rqst *rqstp)
int rtn;

rqstp->rq_cacherep = NULL;
- if (cache_disabled || type == RC_NOCACHE) {
+ if (type == RC_NOCACHE) {
nfsdstats.rcnocache++;
return RC_DOIT;
}
@@ -345,11 +341,11 @@ found_entry:
void
nfsd_cache_update(struct svc_rqst *rqstp, int cachetype, __be32 *statp)
{
- struct svc_cacherep *rp;
+ struct svc_cacherep *rp = rqstp->rq_cacherep;
struct kvec *resv = &rqstp->rq_res.head[0], *cachv;
int len;

- if (!(rp = rqstp->rq_cacherep) || cache_disabled)
+ if (!rp)
return;

len = resv->iov_len - ((char*)statp - (char*)resv->iov_base);
--
1.7.11.7

2013-02-04 13:18:23

[permalink] [raw]

Subject: [PATCH v2 5/8] nfsd: when updating an entry with RC_NOCACHE, just free it

There's no need to keep entries around that we're declaring RC_NOCACHE.
Ditto if there's a problem with the entry.

With this change too, there's no need to test for RC_UNUSED in the
search function. If the entry's in the hash table then it's either
INPROG or DONE.

Signed-off-by: Jeff Layton <[email protected]>
---
fs/nfsd/nfscache.c | 18 ++++++++++++++----
1 file changed, 14 insertions(+), 4 deletions(-)

diff --git a/fs/nfsd/nfscache.c b/fs/nfsd/nfscache.c
index 69d29d4..e8ea785 100644
--- a/fs/nfsd/nfscache.c
+++ b/fs/nfsd/nfscache.c
@@ -98,6 +98,14 @@ nfsd_reply_cache_free_locked(struct svc_cacherep *rp)
kmem_cache_free(drc_slab, rp);
}

+static void
+nfsd_reply_cache_free(struct svc_cacherep *rp)
+{
+ spin_lock(&cache_lock);
+ nfsd_reply_cache_free_locked(rp);
+ spin_unlock(&cache_lock);
+}
+
int nfsd_reply_cache_init(void)
{
drc_slab = kmem_cache_create("nfsd_drc", sizeof(struct svc_cacherep),
@@ -182,8 +190,7 @@ nfsd_cache_search(struct svc_rqst *rqstp)

rh = &cache_hash[request_hash(xid)];
hlist_for_each_entry(rp, hn, rh, c_hash) {
- if (rp->c_state != RC_UNUSED &&
- xid == rp->c_xid && proc == rp->c_proc &&
+ if (xid == rp->c_xid && proc == rp->c_proc &&
proto == rp->c_prot && vers == rp->c_vers &&
!nfsd_cache_entry_expired(rp) &&
rpc_cmp_addr(svc_addr(rqstp), (struct sockaddr *)&rp->c_addr) &&
@@ -353,7 +360,7 @@ nfsd_cache_update(struct svc_rqst *rqstp, int cachetype, __be32 *statp)

/* Don't cache excessive amounts of data and XDR failures */
if (!statp || len > (256 >> 2)) {
- rp->c_state = RC_UNUSED;
+ nfsd_reply_cache_free(rp);
return;
}

@@ -367,12 +374,15 @@ nfsd_cache_update(struct svc_rqst *rqstp, int cachetype, __be32 *statp)
cachv = &rp->c_replvec;
cachv->iov_base = kmalloc(len << 2, GFP_KERNEL);
if (!cachv->iov_base) {
- rp->c_state = RC_UNUSED;
+ nfsd_reply_cache_free(rp);
return;
}
cachv->iov_len = len << 2;
memcpy(cachv->iov_base, statp, len << 2);
break;
+ case RC_NOCACHE:
+ nfsd_reply_cache_free(rp);
+ return;
}
spin_lock(&cache_lock);
lru_put_end(rp);
--
1.7.11.7

2013-02-04 13:18:29

[permalink] [raw]

Subject: [PATCH v2 8/8] nfsd: keep a checksum of the first 256 bytes of request

Now that we're allowing more DRC entries, it becomes a lot easier to hit
problems with XID collisions. In order to mitigate those, calculate the
crc32 of up to the first 256 bytes of each request coming in and store
that in the cache entry, along with the total length of the request.

Signed-off-by: Jeff Layton <[email protected]>
---
fs/nfsd/cache.h | 5 +++++
fs/nfsd/nfscache.c | 44 ++++++++++++++++++++++++++++++++++++++++----
2 files changed, 45 insertions(+), 4 deletions(-)

diff --git a/fs/nfsd/cache.h b/fs/nfsd/cache.h
index 9c7232b..4822db3 100644
--- a/fs/nfsd/cache.h
+++ b/fs/nfsd/cache.h
@@ -29,6 +29,8 @@ struct svc_cacherep {
u32 c_prot;
u32 c_proc;
u32 c_vers;
+ unsigned int c_len;
+ u32 c_crc;
unsigned long c_timestamp;
union {
struct kvec u_vec;
@@ -73,6 +75,9 @@ enum {
/* Cache entries expire after this time period */
#define RC_EXPIRE (120 * HZ)

+/* Checksum this amount of the request */
+#define RC_CSUMLEN (256U)
+
int nfsd_reply_cache_init(void);
void nfsd_reply_cache_shutdown(void);
int nfsd_cache_lookup(struct svc_rqst *);
diff --git a/fs/nfsd/nfscache.c b/fs/nfsd/nfscache.c
index d16a5d6..cb655f3 100644
--- a/fs/nfsd/nfscache.c
+++ b/fs/nfsd/nfscache.c
@@ -11,6 +11,7 @@
#include <linux/slab.h>
#include <linux/sunrpc/clnt.h>
#include <linux/highmem.h>
+#include <linux/crc32.h>

#include "nfsd.h"
#include "cache.h"
@@ -24,6 +25,7 @@ static struct list_head lru_head;
static struct kmem_cache *drc_slab;
static unsigned int num_drc_entries;
static unsigned int max_drc_entries;
+static u32 crc_seed;

/*
* Calculate the hash index from an XID.
@@ -130,6 +132,9 @@ int nfsd_reply_cache_init(void)
INIT_LIST_HEAD(&lru_head);
max_drc_entries = nfsd_cache_size_limit();
num_drc_entries = 0;
+
+ /* Is a random seed any better than some well-defined constant? */
+ get_random_bytes(&crc_seed, sizeof(crc_seed));
return 0;
out_nomem:
printk(KERN_ERR "nfsd: failed to allocate reply cache\n");
@@ -238,12 +243,37 @@ nfsd_reply_cache_shrink(struct shrinker *shrink, struct shrink_control *sc)
}

/*
+ * Walk an xdr_buf and get a CRC for at most the first RC_CSUMLEN bytes
+ */
+static u32
+nfsd_cache_crc(struct xdr_buf *buf)
+{
+ u32 crc;
+ const unsigned char *p = buf->head[0].iov_base;
+ size_t csum_len = min_t(size_t, buf->head[0].iov_len + buf->page_len,
+ RC_CSUMLEN);
+ size_t len = min(buf->head[0].iov_len, csum_len);
+
+ /* rq_arg.head first */
+ crc = crc32(crc_seed, p, len);
+ csum_len -= len;
+
+ /* Nothing left */
+ if (!csum_len)
+ return crc;
+
+ /* checksum the rest from the page_array */
+ p = page_address(buf->pages[0]) + buf->page_base;
+ return crc32(crc, p, csum_len);
+}
+
+/*
* Search the request hash for an entry that matches the given rqstp.
* Must be called with cache_lock held. Returns the found entry or
* NULL on failure.
*/
static struct svc_cacherep *
-nfsd_cache_search(struct svc_rqst *rqstp)
+nfsd_cache_search(struct svc_rqst *rqstp, u32 crc)
{
struct svc_cacherep *rp;
struct hlist_node *hn;
@@ -257,6 +287,7 @@ nfsd_cache_search(struct svc_rqst *rqstp)
hlist_for_each_entry(rp, hn, rh, c_hash) {
if (xid == rp->c_xid && proc == rp->c_proc &&
proto == rp->c_prot && vers == rp->c_vers &&
+ rqstp->rq_arg.len == rp->c_len && crc == rp->c_crc &&
rpc_cmp_addr(svc_addr(rqstp), (struct sockaddr *)&rp->c_addr) &&
rpc_get_port(svc_addr(rqstp)) == rpc_get_port((struct sockaddr *)&rp->c_addr))
return rp;
@@ -276,7 +307,8 @@ nfsd_cache_lookup(struct svc_rqst *rqstp)
__be32 xid = rqstp->rq_xid;
u32 proto = rqstp->rq_prot,
vers = rqstp->rq_vers,
- proc = rqstp->rq_proc;
+ proc = rqstp->rq_proc,
+ crc;
unsigned long age;
int type = rqstp->rq_cachetype;
int rtn;
@@ -287,10 +319,12 @@ nfsd_cache_lookup(struct svc_rqst *rqstp)
return RC_DOIT;
}

+ crc = nfsd_cache_crc(&rqstp->rq_arg);
+
spin_lock(&cache_lock);
rtn = RC_DOIT;

- rp = nfsd_cache_search(rqstp);
+ rp = nfsd_cache_search(rqstp, crc);
if (rp)
goto found_entry;

@@ -318,7 +352,7 @@ nfsd_cache_lookup(struct svc_rqst *rqstp)
* Must search again just in case someone inserted one
* after we dropped the lock above.
*/
- found = nfsd_cache_search(rqstp);
+ found = nfsd_cache_search(rqstp, crc);
if (found) {
nfsd_reply_cache_free_locked(rp);
rp = found;
@@ -344,6 +378,8 @@ setup_entry:
rpc_set_port((struct sockaddr *)&rp->c_addr, rpc_get_port(svc_addr(rqstp)));
rp->c_prot = proto;
rp->c_vers = vers;
+ rp->c_len = rqstp->rq_arg.len;
+ rp->c_crc = crc;

hash_refile(rp);
lru_put_end(rp);
--
1.7.11.7

2013-02-04 13:18:19

[permalink] [raw]

Subject: [PATCH v2 3/8] nfsd: dynamically allocate DRC entries

The existing code keeps a fixed-size cache of 1024 entries. This is much
too small for a busy server, and wastes memory on an idle one. This
patch changes the code to dynamically allocate and free these cache
entries.

A cap on the number of entries is retained, but it's much larger than
the existing value and now scales with the amount of low memory in the
machine.

Signed-off-by: Jeff Layton <[email protected]>
---
fs/nfsd/nfscache.c | 125 ++++++++++++++++++++++++++++++++---------------------
1 file changed, 75 insertions(+), 50 deletions(-)

diff --git a/fs/nfsd/nfscache.c b/fs/nfsd/nfscache.c
index c0c5847..d213e6e 100644
--- a/fs/nfsd/nfscache.c
+++ b/fs/nfsd/nfscache.c
@@ -10,17 +10,13 @@

#include <linux/slab.h>
#include <linux/sunrpc/clnt.h>
+#include <linux/highmem.h>

#include "nfsd.h"
#include "cache.h"

-/* Size of reply cache. Common values are:
- * 4.3BSD: 128
- * 4.4BSD: 256
- * Solaris2: 1024
- * DEC Unix: 512-4096
- */
-#define CACHESIZE 1024
+#define NFSDDBG_FACILITY NFSDDBG_REPCACHE
+
#define HASHSIZE 64

static struct hlist_head * cache_hash;
@@ -28,6 +24,7 @@ static struct list_head lru_head;
static int cache_disabled = 1;
static struct kmem_cache *drc_slab;
static unsigned int num_drc_entries;
+static unsigned int max_drc_entries;

/*
* Calculate the hash index from an XID.
@@ -48,6 +45,34 @@ static int nfsd_cache_append(struct svc_rqst *rqstp, struct kvec *vec);
*/
static DEFINE_SPINLOCK(cache_lock);

+/*
+ * Put a cap on the size of the DRC based on the amount of available
+ * low memory in the machine.
+ *
+ * 64MB: 8192
+ * 128MB: 11585
+ * 256MB: 16384
+ * 512MB: 23170
+ * 1GB: 32768
+ * 2GB: 46340
+ * 4GB: 65536
+ * 8GB: 92681
+ * 16GB: 131072
+ *
+ * ...with a hard cap of 256k entries. In the worst case, each entry will be
+ * ~1k, so the above numbers should give a rough max of the amount of memory
+ * used in k.
+ */
+static unsigned int
+nfsd_cache_size_limit(void)
+{
+ unsigned int limit;
+ unsigned long low_pages = totalram_pages - totalhigh_pages;
+
+ limit = (16 * int_sqrt(low_pages)) << (PAGE_SHIFT-10);
+ return min_t(unsigned int, limit, 256*1024);
+}
+
static struct svc_cacherep *
nfsd_reply_cache_alloc(void)
{
@@ -68,6 +93,7 @@ nfsd_reply_cache_free_locked(struct svc_cacherep *rp)
{
if (rp->c_type == RC_REPLBUFF)
kfree(rp->c_replvec.iov_base);
+ hlist_del(&rp->c_hash);
list_del(&rp->c_lru);
--num_drc_entries;
kmem_cache_free(drc_slab, rp);
@@ -75,30 +101,18 @@ nfsd_reply_cache_free_locked(struct svc_cacherep *rp)

int nfsd_reply_cache_init(void)
{
- int i;
- struct svc_cacherep *rp;
-
drc_slab = kmem_cache_create("nfsd_drc", sizeof(struct svc_cacherep),
0, 0, NULL);
if (!drc_slab)
goto out_nomem;

- INIT_LIST_HEAD(&lru_head);
- i = CACHESIZE;
- num_drc_entries = 0;
- while (i) {
- rp = nfsd_reply_cache_alloc();
- if (!rp)
- goto out_nomem;
- ++num_drc_entries;
- list_add(&rp->c_lru, &lru_head);
- i--;
- }
-
- cache_hash = kcalloc (HASHSIZE, sizeof(struct hlist_head), GFP_KERNEL);
+ cache_hash = kcalloc(HASHSIZE, sizeof(struct hlist_head), GFP_KERNEL);
if (!cache_hash)
goto out_nomem;

+ INIT_LIST_HEAD(&lru_head);
+ max_drc_entries = nfsd_cache_size_limit();
+ num_drc_entries = 0;
cache_disabled = 0;
return 0;
out_nomem:
@@ -191,7 +205,7 @@ nfsd_cache_search(struct svc_rqst *rqstp)
int
nfsd_cache_lookup(struct svc_rqst *rqstp)
{
- struct svc_cacherep *rp;
+ struct svc_cacherep *rp, *found;
__be32 xid = rqstp->rq_xid;
u32 proto = rqstp->rq_prot,
vers = rqstp->rq_vers,
@@ -210,38 +224,48 @@ nfsd_cache_lookup(struct svc_rqst *rqstp)
rtn = RC_DOIT;

rp = nfsd_cache_search(rqstp);
- if (rp) {
- nfsdstats.rchits++;
+ if (rp)
goto found_entry;
+
+ /* Try to use the first entry on the LRU */
+ if (!list_empty(&lru_head)) {
+ rp = list_first_entry(&lru_head, struct svc_cacherep, c_lru);
+ if (nfsd_cache_entry_expired(rp) ||
+ num_drc_entries >= max_drc_entries)
+ goto setup_entry;
}
- nfsdstats.rcmisses++;

- /* This loop shouldn't take more than a few iterations normally */
- {
- int safe = 0;
- list_for_each_entry(rp, &lru_head, c_lru) {
- if (rp->c_state != RC_INPROG)
- break;
- if (safe++ > CACHESIZE) {
- printk("nfsd: loop in repcache LRU list\n");
- cache_disabled = 1;
- goto out;
- }
+ spin_unlock(&cache_lock);
+ rp = nfsd_reply_cache_alloc();
+ if (!rp) {
+ dprintk("nfsd: unable to allocate DRC entry!\n");
+ return RC_DOIT;
}
+ spin_lock(&cache_lock);
+ ++num_drc_entries;
+
+ /*
+ * Must search again just in case someone inserted one
+ * after we dropped the lock above.
+ */
+ found = nfsd_cache_search(rqstp);
+ if (found) {
+ nfsd_reply_cache_free_locked(rp);
+ rp = found;
+ goto found_entry;
}

- /* All entries on the LRU are in-progress. This should not happen */
- if (&rp->c_lru == &lru_head) {
- static int complaints;
-
- printk(KERN_WARNING "nfsd: all repcache entries locked!\n");
- if (++complaints > 5) {
- printk(KERN_WARNING "nfsd: disabling repcache.\n");
- cache_disabled = 1;
- }
- goto out;
- }
+ /*
+ * We're keeping the one we just allocated. Are we now over the
+ * limit? Prune one off the tip of the LRU in trade for the one we
+ * just allocated if so.
+ */
+ if (num_drc_entries >= max_drc_entries)
+ nfsd_reply_cache_free_locked(list_first_entry(&lru_head,
+ struct svc_cacherep, c_lru));

+setup_entry:
+ nfsdstats.rcmisses++;
rqstp->rq_cacherep = rp;
rp->c_state = RC_INPROG;
rp->c_xid = xid;
@@ -265,6 +289,7 @@ nfsd_cache_lookup(struct svc_rqst *rqstp)
return rtn;

found_entry:
+ nfsdstats.rchits++;
/* We found a matching entry which is either in progress or done. */
age = jiffies - rp->c_timestamp;
lru_put_end(rp);
@@ -295,7 +320,7 @@ found_entry:
break;
default:
printk(KERN_WARNING "nfsd: bad repcache type %d\n", rp->c_type);
- rp->c_state = RC_UNUSED;
+ nfsd_reply_cache_free_locked(rp);
}

goto out;
--
1.7.11.7

2013-02-04 15:56:11

by J. Bruce Fields

[permalink] [raw]

Subject: Re: [PATCH v2 0/8] nfsd: duplicate reply cache overhaul

On Mon, Feb 04, 2013 at 08:17:59AM -0500, Jeff Layton wrote:
> This is the second posting of the remaining unmerged patches in this
> set. There are a number of differences from the first set:
>
> - The bug in the checksum patch has been fixed.
>
> - A hard cap on the number of DRC entries is retained, but it's larger
> than the original cap, and scales with the amount of low memory in
> the machine.
>
> - A shrinker is still registered, but it will now only free entries
> that are expired or are over the max number of entries.
>
> Our QA group has been reporting on and off for the last several years
> about occasional failures in testing, especially on UDP. When we go to
> look at traces, we see a missing reply from a server on a non-idempotent
> request. The client then retransmits the request and the server tries to
> redo it instead of just sending the DRC entry.
>
> With an instrumented kernel on the server and a synthetic reproducer, we
> found that it's quite easy to hammer the server so fast that DRC entries
> get flushed out long before a retransmit can come in.
>
> This patchset is a first pass at fixing this. Instead of simply keeping
> a cache of the last 1024 entries, it allows nfsd to grow and shrink the
> DRC dynamically.
>
> While most of us will probably say "so what" when it comes to UDP
> failures, it's a potential problem on connected transports as well. I'm
> also inclined to try and fix things that screw up the people that are
> helping us test our code.
>
> I'd like to see this merged for 3.9 if possible...

These look fine, thanks. Applying pending some testing.

--b.

>
> Jeff Layton (8):
> nfsd: always move DRC entries to the end of LRU list when updating
> timestamp
> nfsd: track the number of DRC entries in the cache
> nfsd: dynamically allocate DRC entries
> nfsd: remove the cache_disabled flag
> nfsd: when updating an entry with RC_NOCACHE, just free it
> nfsd: add recurring workqueue job to clean the cache
> nfsd: register a shrinker for DRC cache entries
> nfsd: keep a checksum of the first 256 bytes of request
>
> fs/nfsd/cache.h | 5 +
> fs/nfsd/nfscache.c | 271 ++++++++++++++++++++++++++++++++++++++++-------------
> 2 files changed, 209 insertions(+), 67 deletions(-)
>
> --
> 1.7.11.7
>

2013-02-04 13:18:25

[permalink] [raw]

Subject: [PATCH v2 6/8] nfsd: add recurring workqueue job to clean the cache

It's not sufficient to only clean the cache when requests come in. What
if we have a flurry of activity and then the server goes idle? Add a
workqueue job that will clean the cache every RC_EXPIRE period.

Care is taken to only run this when we expect to have entries expiring.

Signed-off-by: Jeff Layton <[email protected]>
---
fs/nfsd/nfscache.c | 50 +++++++++++++++++++++++++++++++++++++++++++++++---
1 file changed, 47 insertions(+), 3 deletions(-)

diff --git a/fs/nfsd/nfscache.c b/fs/nfsd/nfscache.c
index e8ea785..d7b088b 100644
--- a/fs/nfsd/nfscache.c
+++ b/fs/nfsd/nfscache.c
@@ -36,6 +36,7 @@ static inline u32 request_hash(u32 xid)
}

static int nfsd_cache_append(struct svc_rqst *rqstp, struct kvec *vec);
+static void cache_cleaner_func(struct work_struct *unused);

/*
* locking for the reply cache:
@@ -43,6 +44,7 @@ static int nfsd_cache_append(struct svc_rqst *rqstp, struct kvec *vec);
* Otherwise, it when accessing _prev or _next, the lock must be held.
*/
static DEFINE_SPINLOCK(cache_lock);
+static DECLARE_DELAYED_WORK(cache_cleaner, cache_cleaner_func);

/*
* Put a cap on the size of the DRC based on the amount of available
@@ -131,6 +133,8 @@ void nfsd_reply_cache_shutdown(void)
{
struct svc_cacherep *rp;

+ cancel_delayed_work_sync(&cache_cleaner);
+
while (!list_empty(&lru_head)) {
rp = list_entry(lru_head.next, struct svc_cacherep, c_lru);
nfsd_reply_cache_free_locked(rp);
@@ -146,13 +150,15 @@ void nfsd_reply_cache_shutdown(void)
}

/*
- * Move cache entry to end of LRU list
+ * Move cache entry to end of LRU list, and queue the cleaner to run if it's
+ * not already scheduled.
*/
static void
lru_put_end(struct svc_cacherep *rp)
{
rp->c_timestamp = jiffies;
list_move_tail(&rp->c_lru, &lru_head);
+ schedule_delayed_work(&cache_cleaner, RC_EXPIRE);
}

/*
@@ -173,6 +179,42 @@ nfsd_cache_entry_expired(struct svc_cacherep *rp)
}

/*
+ * Walk the LRU list and prune off entries that are older than RC_EXPIRE.
+ * Also prune the oldest ones when the total exceeds the max number of entries.
+ */
+static void
+prune_cache_entries(void)
+{
+ struct svc_cacherep *rp, *tmp;
+
+ list_for_each_entry_safe(rp, tmp, &lru_head, c_lru) {
+ if (!nfsd_cache_entry_expired(rp) &&
+ num_drc_entries <= max_drc_entries)
+ break;
+ nfsd_reply_cache_free_locked(rp);
+ }
+
+ /*
+ * Conditionally rearm the job. If we cleaned out the list, then
+ * cancel any pending run (since there won't be any work to do).
+ * Otherwise, we rearm the job or modify the existing one to run in
+ * RC_EXPIRE since we just ran the pruner.
+ */
+ if (list_empty(&lru_head))
+ cancel_delayed_work(&cache_cleaner);
+ else
+ mod_delayed_work(system_wq, &cache_cleaner, RC_EXPIRE);
+}
+
+static void
+cache_cleaner_func(struct work_struct *unused)
+{
+ spin_lock(&cache_lock);
+ prune_cache_entries();
+ spin_unlock(&cache_lock);
+}
+
+/*
* Search the request hash for an entry that matches the given rqstp.
* Must be called with cache_lock held. Returns the found entry or
* NULL on failure.
@@ -192,7 +234,6 @@ nfsd_cache_search(struct svc_rqst *rqstp)
hlist_for_each_entry(rp, hn, rh, c_hash) {
if (xid == rp->c_xid && proc == rp->c_proc &&
proto == rp->c_prot && vers == rp->c_vers &&
- !nfsd_cache_entry_expired(rp) &&
rpc_cmp_addr(svc_addr(rqstp), (struct sockaddr *)&rp->c_addr) &&
rpc_get_port(svc_addr(rqstp)) == rpc_get_port((struct sockaddr *)&rp->c_addr))
return rp;
@@ -234,8 +275,11 @@ nfsd_cache_lookup(struct svc_rqst *rqstp)
if (!list_empty(&lru_head)) {
rp = list_first_entry(&lru_head, struct svc_cacherep, c_lru);
if (nfsd_cache_entry_expired(rp) ||
- num_drc_entries >= max_drc_entries)
+ num_drc_entries >= max_drc_entries) {
+ lru_put_end(rp);
+ prune_cache_entries();
goto setup_entry;
+ }
}

spin_unlock(&cache_lock);
--
1.7.11.7

2013-02-04 13:18:15

[permalink] [raw]

Subject: [PATCH v2 1/8] nfsd: always move DRC entries to the end of LRU list when updating timestamp

...otherwise, we end up with the list ordering wrong. Currently, it's
not a problem since we skip RC_INPROG entries, but keeping the ordering
strict will be necessary for a later patch that adds a cache cleaner.

Signed-off-by: Jeff Layton <[email protected]>
---
fs/nfsd/nfscache.c | 5 ++---
1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/fs/nfsd/nfscache.c b/fs/nfsd/nfscache.c
index b89e7c8..9d80dfa 100644
--- a/fs/nfsd/nfscache.c
+++ b/fs/nfsd/nfscache.c
@@ -129,6 +129,7 @@ void nfsd_reply_cache_shutdown(void)
static void
lru_put_end(struct svc_cacherep *rp)
{
+ rp->c_timestamp = jiffies;
list_move_tail(&rp->c_lru, &lru_head);
}

@@ -245,9 +246,9 @@ nfsd_cache_lookup(struct svc_rqst *rqstp)
rpc_set_port((struct sockaddr *)&rp->c_addr, rpc_get_port(svc_addr(rqstp)));
rp->c_prot = proto;
rp->c_vers = vers;
- rp->c_timestamp = jiffies;

hash_refile(rp);
+ lru_put_end(rp);

/* release any buffer */
if (rp->c_type == RC_REPLBUFF) {
@@ -262,7 +263,6 @@ nfsd_cache_lookup(struct svc_rqst *rqstp)
found_entry:
/* We found a matching entry which is either in progress or done. */
age = jiffies - rp->c_timestamp;
- rp->c_timestamp = jiffies;
lru_put_end(rp);

rtn = RC_DROPIT;
@@ -354,7 +354,6 @@ nfsd_cache_update(struct svc_rqst *rqstp, int cachetype, __be32 *statp)
rp->c_secure = rqstp->rq_secure;
rp->c_type = cachetype;
rp->c_state = RC_DONE;
- rp->c_timestamp = jiffies;
spin_unlock(&cache_lock);
return;
}
--
1.7.11.7