2013-01-28 19:48:09

by Jeff Layton

[permalink] [raw]
Subject: [PATCH v1 00/16] nfsd: duplicate reply cache overhaul

Our QA group has been reporting on and off for the last several years
about occasional failures in testing, especially on UDP. When we go to
look at traces, we see a missing reply from a server on a non-idempotent
request. The client then retransmits the request and the server tries to
redo it instead of just sending the DRC entry.

With an instrumented kernel on the server and a synthetic reproducer, we
found that it's quite easy to hammer the server so fast the DRC entries
get flushed out long before a retransmit can come in.

This patchset is a first pass at fixing this. Instead of simply keeping
a cache of the last 1024 entries, it allows nfsd to grow and shrink the
DRC dynamically.

The first patch is a bugfix for IPv6 support. The next several are
cleanups and reorganizations of the existing code. The tenth patch makes
them dynamically allocated, and the ones following that add various
mechanisms to help keep the cache to a manageable size. The final patch
adds the ability to checksum the first part the request, intended as a
way to mitigate the effects of an XID collision.

While most of us will probably say "so what" when it comes to UDP
failures, it's a potential problem on connected transports as well. I'm
also inclined to try and fix things that screw up the people that are
helping us test our code.

I'd like to see this merged for 3.9 if possible...

Jeff Layton (16):
nfsd: fix IPv6 address handling in the DRC
nfsd: remove unneeded spinlock in nfsd_cache_update
nfsd: get rid of RC_INTR
nfsd: create a dedicated slabcache for DRC entries
nfsd: add alloc and free functions for DRC entries
nfsd: remove redundant test from nfsd_reply_cache_free
nfsd: clean up and clarify the cache expiration code
nfsd: break out hashtable search into separate function
nfsd: always move DRC entries to the end of LRU list when updating
timestamp
nfsd: dynamically allocate DRC entries
nfsd: remove the cache_disabled flag
nfsd: when updating an entry with RC_NOCACHE, just free it
nfsd: add recurring workqueue job to clean the cache
nfsd: track the number of DRC entries in the cache
nfsd: register a shrinker for DRC cache entries
nfsd: keep a checksum of the first 256 bytes of request

fs/nfsd/cache.h | 17 ++-
fs/nfsd/nfscache.c | 337 ++++++++++++++++++++++++++++++++++----------
fs/nfsd/nfssvc.c | 1 -
include/linux/sunrpc/clnt.h | 4 +-
4 files changed, 278 insertions(+), 81 deletions(-)

--
1.7.11.7



2013-01-28 20:26:49

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH v1 01/16] nfsd: fix IPv6 address handling in the DRC

On Mon, 28 Jan 2013 12:16:17 -0800 (PST)
Chuck Lever <[email protected]> wrote:

>
> On Jan 28, 2013, at 3:05 PM, Jeff Layton <[email protected]> wrote:
>
> > On Mon, 28 Jan 2013 11:51:13 -0800 (PST)
> > Chuck Lever <[email protected]> wrote:
> >
> >>
> >> On Jan 28, 2013, at 2:41 PM, Jeff Layton <[email protected]> wrote:
> >>
> >>> Currently, it only stores the first 16 bytes of any address. struct
> >>> sockaddr_in6 is 28 bytes however, so we're currently ignoring the last
> >>> 12 bytes of the address.
> >>>
> >>> Expand the c_addr field to a sockaddr_in6,
> >>
> >> What do you do about link-local addresses?
> >>
> >
> > I use rpc_cmp_addr() which should handle the scope correctly for
> > link-local addrs. Now that you mention it though, I think we may have a
> > bug in __rpc_copy_addr6. It doesn't touch the scope_id at all --
> > shouldn't we be setting the scopeid in that function as well?
>
> It looks like rpc_copy_addr() is invoked to copy only source addresses, so the scope ID probably doesn't apply. The comment over rpc_copy_addr() says it explicitly ignores port and scope ID.
>

Well, it copies the source address of the client, but all of the
callers are in nfsd code. So I think it probably does matter:

Consider a server that has 2 interfaces, and 2 clients with identical
link-local addresses hitting the link local address of each interface.
The only way to distinguish them at that point is to look at their
scope_ids.

In the event that we're not using link-local addresses, the scopeid
should just be ignored. So I think we probably need something like this
patch too:

----------------------[snip]---------------------

sunrpc: copy scope ID in __rpc_copy_addr6

When copying an address, we should also copy the scopeid in the event
that this is a link-local address and the scope matters.

Signed-off-by: Jeff Layton <[email protected]>
---
include/linux/sunrpc/clnt.h | 1 +
1 file changed, 1 insertion(+)

diff --git a/include/linux/sunrpc/clnt.h b/include/linux/sunrpc/clnt.h
index 34206b8..4824286 100644
--- a/include/linux/sunrpc/clnt.h
+++ b/include/linux/sunrpc/clnt.h
@@ -242,6 +242,7 @@ static inline bool __rpc_copy_addr6(struct sockaddr *dst,

dsin6->sin6_family = ssin6->sin6_family;
dsin6->sin6_addr = ssin6->sin6_addr;
+ dsin6->sin6_scope_id = ssin6->sin6_scope_id;
return true;
}
#else /* !(IS_ENABLED(CONFIG_IPV6) */
--
1.7.11.7


--
Jeff Layton <[email protected]>

2013-01-30 15:37:59

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH v1 15/16] nfsd: register a shrinker for DRC cache entries

On Wed, 30 Jan 2013 10:24:49 -0500
"J. Bruce Fields" <[email protected]> wrote:

> On Mon, Jan 28, 2013 at 02:41:21PM -0500, Jeff Layton wrote:
> > Since we dynamically allocate them now, allow the system to call us
> > up to release them if we get low on memory.
> >
> > We set the "seeks" value very high, to discourage it and release
> > them in LRU order.
>
> I don't understand how the mm uses seeks. I guess I'll do some reading.
>

Me neither. It's part of a heuristic to make it prefer not to purge
certain caches entries vs. others, but the logic escapes me...

> This looks straightforward enough, but I'm not sure how to judge whether
> this is the correct policy.
>

Agreed. At the very least, it probably makes sense to purge anything
already expired when we get a shrinker call. Beyond that, I'm not sure
whether it makes sense or not.

> --b.
>
> > We may end up releasing some prematurely if the
> > box is really low on memory.
> >
> > Signed-off-by: Jeff Layton <[email protected]>
> > ---
> > fs/nfsd/nfscache.c | 54 +++++++++++++++++++++++++++++++++++++++++++++++++++++-
> > 1 file changed, 53 insertions(+), 1 deletion(-)
> >
> > diff --git a/fs/nfsd/nfscache.c b/fs/nfsd/nfscache.c
> > index bd21230..27edd47 100644
> > --- a/fs/nfsd/nfscache.c
> > +++ b/fs/nfsd/nfscache.c
> > @@ -35,6 +35,18 @@ static inline u32 request_hash(u32 xid)
> >
> > static int nfsd_cache_append(struct svc_rqst *rqstp, struct kvec *vec);
> > static void cache_cleaner_func(struct work_struct *unused);
> > +static int nfsd_reply_cache_shrink(struct shrinker *shrink,
> > + struct shrink_control *sc);
> > +
> > +/*
> > + * We set "seeks" to a large value here since we really can't recreate these
> > + * entries. We're just gambling that they won't actually be needed if and
> > + * when the box comes under memory pressure.
> > + */
> > +struct shrinker nfsd_reply_cache_shrinker = {
> > + .shrink = nfsd_reply_cache_shrink,
> > + .seeks = 64,
> > +};
> >
> > /*
> > * locking for the reply cache:
> > @@ -80,6 +92,7 @@ nfsd_reply_cache_free(struct svc_cacherep *rp)
> >
> > int nfsd_reply_cache_init(void)
> > {
> > + register_shrinker(&nfsd_reply_cache_shrinker);
> > drc_slab = kmem_cache_create("nfsd_drc", sizeof(struct svc_cacherep),
> > 0, 0, NULL);
> > if (!drc_slab)
> > @@ -102,6 +115,7 @@ void nfsd_reply_cache_shutdown(void)
> > {
> > struct svc_cacherep *rp;
> >
> > + unregister_shrinker(&nfsd_reply_cache_shrinker);
> > cancel_delayed_work_sync(&cache_cleaner);
> >
> > while (!list_empty(&lru_head)) {
> > @@ -152,15 +166,17 @@ nfsd_cache_entry_expired(struct svc_cacherep *rp)
> > * if we hit one that isn't old enough, then we can stop the walking. Must
> > * be called with cache_lock held.
> > */
> > -static void
> > +static int
> > prune_old_cache_entries(void)
> > {
> > + int freed = 0;
> > struct svc_cacherep *rp, *tmp;
> >
> > list_for_each_entry_safe(rp, tmp, &lru_head, c_lru) {
> > if (!nfsd_cache_entry_expired(rp))
> > break;
> > nfsd_reply_cache_free_locked(rp);
> > + ++freed;
> > }
> >
> > /*
> > @@ -173,6 +189,8 @@ prune_old_cache_entries(void)
> > cancel_delayed_work(&cache_cleaner);
> > else
> > mod_delayed_work(system_wq, &cache_cleaner, RC_EXPIRE);
> > +
> > + return freed;
> > }
> >
> > static void
> > @@ -183,6 +201,40 @@ cache_cleaner_func(struct work_struct *unused)
> > spin_unlock(&cache_lock);
> > }
> >
> > +static int
> > +nfsd_reply_cache_shrink(struct shrinker *shrink, struct shrink_control *sc)
> > +{
> > + int nr_to_scan = sc->nr_to_scan;
> > + unsigned int num;
> > +
> > + spin_lock(&cache_lock);
> > + if (!nr_to_scan)
> > + goto out;
> > +
> > + nr_to_scan -= prune_old_cache_entries();
> > +
> > + /*
> > + * If that didn't free enough, then keep going. It's better to free
> > + * some prematurely than to have the box keel over due to lack of
> > + * memory.
> > + */
> > + while (nr_to_scan > 0) {
> > + struct svc_cacherep *rp;
> > +
> > + if (list_empty(&lru_head))
> > + break;
> > + rp = list_first_entry(&lru_head, struct svc_cacherep, c_lru);
> > + nfsd_reply_cache_free_locked(rp);
> > + --nr_to_scan;
> > + }
> > +
> > +out:
> > + num = num_drc_entries;
> > + spin_unlock(&cache_lock);
> > +
> > + return num;
> > +}
> > +
> > /*
> > * Search the request hash for an entry that matches the given rqstp.
> > * Must be called with cache_lock held. Returns the found entry or
> > --
> > 1.7.11.7
> >


--
Jeff Layton <[email protected]>

2013-01-29 18:20:55

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH v1 16/16] nfsd: keep a checksum of the first 256 bytes of request

On Mon, 28 Jan 2013 14:41:22 -0500
Jeff Layton <[email protected]> wrote:

> Now that we're dynamically allocating these entries, it becomes a lot
> easier to hit problems with XID collisions. In order to mitigate those,
> checksum up to the first 256 bytes of each request coming in and store
> those in the cache entry, along with the total length of the request.
>
> Signed-off-by: Jeff Layton <[email protected]>
> ---
> fs/nfsd/cache.h | 5 +++++
> fs/nfsd/nfscache.c | 44 ++++++++++++++++++++++++++++++++++++++++----
> 2 files changed, 45 insertions(+), 4 deletions(-)
>
> diff --git a/fs/nfsd/cache.h b/fs/nfsd/cache.h
> index 9c7232b..4822db3 100644
> --- a/fs/nfsd/cache.h
> +++ b/fs/nfsd/cache.h
> @@ -29,6 +29,8 @@ struct svc_cacherep {
> u32 c_prot;
> u32 c_proc;
> u32 c_vers;
> + unsigned int c_len;
> + u32 c_crc;
> unsigned long c_timestamp;
> union {
> struct kvec u_vec;
> @@ -73,6 +75,9 @@ enum {
> /* Cache entries expire after this time period */
> #define RC_EXPIRE (120 * HZ)
>
> +/* Checksum this amount of the request */
> +#define RC_CSUMLEN (256U)
> +
> int nfsd_reply_cache_init(void);
> void nfsd_reply_cache_shutdown(void);
> int nfsd_cache_lookup(struct svc_rqst *);
> diff --git a/fs/nfsd/nfscache.c b/fs/nfsd/nfscache.c
> index 27edd47..abbf956 100644
> --- a/fs/nfsd/nfscache.c
> +++ b/fs/nfsd/nfscache.c
> @@ -10,6 +10,7 @@
>
> #include <linux/slab.h>
> #include <linux/sunrpc/clnt.h>
> +#include <linux/crc32.h>
>
> #include "nfsd.h"
> #include "cache.h"
> @@ -22,6 +23,7 @@ static struct hlist_head * cache_hash;
> static struct list_head lru_head;
> static struct kmem_cache *drc_slab;
> static unsigned int num_drc_entries;
> +static u32 crc_seed;
>
> /*
> * Calculate the hash index from an XID.
> @@ -103,6 +105,9 @@ int nfsd_reply_cache_init(void)
> goto out_nomem;
>
> INIT_LIST_HEAD(&lru_head);
> +
> + /* Is a random seed any better than some well-defined constant? */
> + get_random_bytes(&crc_seed, sizeof(crc_seed));
> num_drc_entries = 0;
> return 0;
> out_nomem:
> @@ -236,12 +241,37 @@ out:
> }
>
> /*
> + * Walk an xdr_buf and get a CRC for at most the first RC_CSUMLEN bytes
> + */
> +static u32
> +nfsd_cache_crc(struct xdr_buf *buf)
> +{
> + u32 crc;
> + const unsigned char *p = buf->head[0].iov_base;
> + size_t total_len = min(buf->len, RC_CSUMLEN);
> + size_t len = min(buf->head[0].iov_len, total_len);
> +
> + /* rq_arg.head first */
> + crc = crc32(crc_seed, p, len);
> + total_len -= len;
> +
> + /* Nothing left */
> + if (!total_len)
> + return crc;
> +
> + /* checksum the rest from the page_array */
> + p = page_address(buf->pages[0]) + buf->page_base;
> + len = min(buf->len - len, total_len);
> + return crc32(crc, p, len);
> +}
> +

My apologies...the above code is wrong and I was seeing test failures
with pynfs from it.

buf->len is set to the size of the received RPC + NFS frames in
svc_recv. svc_process then advances head[0] as it scrapes out the RPC
fields. So, when we get to vs_dispatch, buf->len is no longer valid
(it's off by the size of the RPC header). I've got a new patch that
avoids using buf->len, and with that the tests pass.

I'll plan to send a respin of this whole set in a few days, once I've
given enough chance for anyone interested to comment on it.

> +/*
> * Search the request hash for an entry that matches the given rqstp.
> * Must be called with cache_lock held. Returns the found entry or
> * NULL on failure.
> */
> static struct svc_cacherep *
> -nfsd_cache_search(struct svc_rqst *rqstp)
> +nfsd_cache_search(struct svc_rqst *rqstp, u32 crc)
> {
> struct svc_cacherep *rp;
> struct hlist_node *hn;
> @@ -255,6 +285,7 @@ nfsd_cache_search(struct svc_rqst *rqstp)
> hlist_for_each_entry(rp, hn, rh, c_hash) {
> if (xid == rp->c_xid && proc == rp->c_proc &&
> proto == rp->c_prot && vers == rp->c_vers &&
> + rqstp->rq_arg.len == rp->c_len && crc == rp->c_crc &&
> rpc_cmp_addr(svc_addr(rqstp), (struct sockaddr *)&rp->c_addr) &&
> rpc_get_port(svc_addr(rqstp)) == rpc_get_port((struct sockaddr *)&rp->c_addr))
> return rp;
> @@ -274,7 +305,8 @@ nfsd_cache_lookup(struct svc_rqst *rqstp)
> __be32 xid = rqstp->rq_xid;
> u32 proto = rqstp->rq_prot,
> vers = rqstp->rq_vers,
> - proc = rqstp->rq_proc;
> + proc = rqstp->rq_proc,
> + crc;
> unsigned long age;
> int type = rqstp->rq_cachetype;
> int rtn;
> @@ -285,10 +317,12 @@ nfsd_cache_lookup(struct svc_rqst *rqstp)
> return RC_DOIT;
> }
>
> + crc = nfsd_cache_crc(&rqstp->rq_arg);
> +
> spin_lock(&cache_lock);
> rtn = RC_DOIT;
>
> - rp = nfsd_cache_search(rqstp);
> + rp = nfsd_cache_search(rqstp, crc);
> if (rp)
> goto found_entry;
>
> @@ -318,7 +352,7 @@ nfsd_cache_lookup(struct svc_rqst *rqstp)
> * Must search again just in case someone inserted one
> * after we dropped the lock above.
> */
> - found = nfsd_cache_search(rqstp);
> + found = nfsd_cache_search(rqstp, crc);
> if (found) {
> nfsd_reply_cache_free_locked(rp);
> rp = found;
> @@ -335,6 +369,8 @@ setup_entry:
> rpc_set_port((struct sockaddr *)&rp->c_addr, rpc_get_port(svc_addr(rqstp)));
> rp->c_prot = proto;
> rp->c_vers = vers;
> + rp->c_len = rqstp->rq_arg.len;
> + rp->c_crc = crc;
>
> hash_refile(rp);
> lru_put_end(rp);


--
Jeff Layton <[email protected]>

2013-01-28 19:41:38

by Jeff Layton

[permalink] [raw]
Subject: [PATCH v1 02/16] nfsd: remove unneeded spinlock in nfsd_cache_update

The locking rules for cache entries say that locking the cache_lock
isn't needed if you're just touching the current entry. Earlier
in this function we set rp->c_state to RC_UNUSED without any locking,
so I believe it's ok to do the same here.

Signed-off-by: Jeff Layton <[email protected]>
---
fs/nfsd/nfscache.c | 2 --
1 file changed, 2 deletions(-)

diff --git a/fs/nfsd/nfscache.c b/fs/nfsd/nfscache.c
index 5dd9ec2..972c14a 100644
--- a/fs/nfsd/nfscache.c
+++ b/fs/nfsd/nfscache.c
@@ -286,9 +286,7 @@ nfsd_cache_update(struct svc_rqst *rqstp, int cachetype, __be32 *statp)
cachv = &rp->c_replvec;
cachv->iov_base = kmalloc(len << 2, GFP_KERNEL);
if (!cachv->iov_base) {
- spin_lock(&cache_lock);
rp->c_state = RC_UNUSED;
- spin_unlock(&cache_lock);
return;
}
cachv->iov_len = len << 2;
--
1.7.11.7


2013-01-31 21:23:03

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH v1 02/16] nfsd: remove unneeded spinlock in nfsd_cache_update

On Mon, Jan 28, 2013 at 02:41:08PM -0500, Jeff Layton wrote:
> The locking rules for cache entries say that locking the cache_lock
> isn't needed if you're just touching the current entry. Earlier
> in this function we set rp->c_state to RC_UNUSED without any locking,
> so I believe it's ok to do the same here.

Also, the previous assignment to cachv->iov_base is a write to the same
entry without a lock.

It's still a little odd to be writing to this entry without holding any
lock. I have to stare at it for a while to convince myself it's OK, but
I guess it is: this entry is transitioning from RC_INUSE to RC_UNUSED,
but it's not modifying the entry otherwise (that cachv->iov_base
assignment was actually a no-op in this case since it had to be NULL
already). So it shouldn't matter that there's no write barrier or
anything. At worst another thread may be a little late to notice this
entry is free.

OK, applying. Could use a comment or something though.

--b.

>
> Signed-off-by: Jeff Layton <[email protected]>
> ---
> fs/nfsd/nfscache.c | 2 --
> 1 file changed, 2 deletions(-)
>
> diff --git a/fs/nfsd/nfscache.c b/fs/nfsd/nfscache.c
> index 5dd9ec2..972c14a 100644
> --- a/fs/nfsd/nfscache.c
> +++ b/fs/nfsd/nfscache.c
> @@ -286,9 +286,7 @@ nfsd_cache_update(struct svc_rqst *rqstp, int cachetype, __be32 *statp)
> cachv = &rp->c_replvec;
> cachv->iov_base = kmalloc(len << 2, GFP_KERNEL);
> if (!cachv->iov_base) {
> - spin_lock(&cache_lock);
> rp->c_state = RC_UNUSED;
> - spin_unlock(&cache_lock);
> return;
> }
> cachv->iov_len = len << 2;
> --
> 1.7.11.7
>

2013-01-31 20:59:03

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH v1 01/16] nfsd: fix IPv6 address handling in the DRC

On Mon, Jan 28, 2013 at 02:41:07PM -0500, Jeff Layton wrote:
> Currently, it only stores the first 16 bytes of any address. struct
> sockaddr_in6 is 28 bytes however, so we're currently ignoring the last
> 12 bytes of the address.
>
> Expand the c_addr field to a sockaddr_in6, and cast it to a sockaddr_in
> as necessary. Also fix the comparitor to use the existing RPC
> helpers for this.

Applying. (Assuming the fix for link-local case will be separate.)

--b.

>
> Signed-off-by: Jeff Layton <[email protected]>
> ---
> fs/nfsd/cache.h | 6 +++++-
> fs/nfsd/nfscache.c | 7 +++++--
> include/linux/sunrpc/clnt.h | 4 +++-
> 3 files changed, 13 insertions(+), 4 deletions(-)
>
> diff --git a/fs/nfsd/cache.h b/fs/nfsd/cache.h
> index 93cc9d3..2cac76c 100644
> --- a/fs/nfsd/cache.h
> +++ b/fs/nfsd/cache.h
> @@ -12,6 +12,10 @@
>
> /*
> * Representation of a reply cache entry.
> + *
> + * Note that we use a sockaddr_in6 to hold the address instead of the more
> + * typical sockaddr_storage. This is for space reasons, since sockaddr_storage
> + * is much larger than a sockaddr_in6.
> */
> struct svc_cacherep {
> struct hlist_node c_hash;
> @@ -20,7 +24,7 @@ struct svc_cacherep {
> unsigned char c_state, /* unused, inprog, done */
> c_type, /* status, buffer */
> c_secure : 1; /* req came from port < 1024 */
> - struct sockaddr_in c_addr;
> + struct sockaddr_in6 c_addr;
> __be32 c_xid;
> u32 c_prot;
> u32 c_proc;
> diff --git a/fs/nfsd/nfscache.c b/fs/nfsd/nfscache.c
> index 2cbac34..5dd9ec2 100644
> --- a/fs/nfsd/nfscache.c
> +++ b/fs/nfsd/nfscache.c
> @@ -9,6 +9,7 @@
> */
>
> #include <linux/slab.h>
> +#include <linux/sunrpc/clnt.h>
>
> #include "nfsd.h"
> #include "cache.h"
> @@ -146,7 +147,8 @@ nfsd_cache_lookup(struct svc_rqst *rqstp)
> xid == rp->c_xid && proc == rp->c_proc &&
> proto == rp->c_prot && vers == rp->c_vers &&
> time_before(jiffies, rp->c_timestamp + 120*HZ) &&
> - memcmp((char*)&rqstp->rq_addr, (char*)&rp->c_addr, sizeof(rp->c_addr))==0) {
> + rpc_cmp_addr(svc_addr(rqstp), (struct sockaddr *)&rp->c_addr) &&
> + rpc_get_port(svc_addr(rqstp)) == rpc_get_port((struct sockaddr *)&rp->c_addr)) {
> nfsdstats.rchits++;
> goto found_entry;
> }
> @@ -183,7 +185,8 @@ nfsd_cache_lookup(struct svc_rqst *rqstp)
> rp->c_state = RC_INPROG;
> rp->c_xid = xid;
> rp->c_proc = proc;
> - memcpy(&rp->c_addr, svc_addr_in(rqstp), sizeof(rp->c_addr));
> + rpc_copy_addr((struct sockaddr *)&rp->c_addr, svc_addr(rqstp));
> + rpc_set_port((struct sockaddr *)&rp->c_addr, rpc_get_port(svc_addr(rqstp)));
> rp->c_prot = proto;
> rp->c_vers = vers;
> rp->c_timestamp = jiffies;
> diff --git a/include/linux/sunrpc/clnt.h b/include/linux/sunrpc/clnt.h
> index 34206b8..47354a2 100644
> --- a/include/linux/sunrpc/clnt.h
> +++ b/include/linux/sunrpc/clnt.h
> @@ -263,7 +263,9 @@ static inline bool __rpc_copy_addr6(struct sockaddr *dst,
> * @sap1: first sockaddr
> * @sap2: second sockaddr
> *
> - * Just compares the family and address portion. Ignores port, scope, etc.
> + * Just compares the family and address portion. Ignores port, but
> + * compares the scope if it's a link-local address.
> + *
> * Returns true if the addrs are equal, false if they aren't.
> */
> static inline bool rpc_cmp_addr(const struct sockaddr *sap1,
> --
> 1.7.11.7
>

2013-01-28 19:41:44

by Jeff Layton

[permalink] [raw]
Subject: [PATCH v1 04/16] nfsd: create a dedicated slabcache for DRC entries

Currently we use kmalloc() which wastes a little bit of memory on each
allocation since it's a power of 2 allocator. Since we're allocating a
1024 of these now, and may need even more later, let's create a new
slabcache for them.

Signed-off-by: Jeff Layton <[email protected]>
---
fs/nfsd/nfscache.c | 15 +++++++++++++--
1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/fs/nfsd/nfscache.c b/fs/nfsd/nfscache.c
index 972c14a..4aad9e4 100644
--- a/fs/nfsd/nfscache.c
+++ b/fs/nfsd/nfscache.c
@@ -26,6 +26,7 @@
static struct hlist_head * cache_hash;
static struct list_head lru_head;
static int cache_disabled = 1;
+static struct kmem_cache *drc_slab;

/*
* Calculate the hash index from an XID.
@@ -51,10 +52,15 @@ int nfsd_reply_cache_init(void)
struct svc_cacherep *rp;
int i;

+ drc_slab = kmem_cache_create("nfsd_drc", sizeof(struct svc_cacherep),
+ 0, 0, NULL);
+ if (!drc_slab)
+ goto out_nomem;
+
INIT_LIST_HEAD(&lru_head);
i = CACHESIZE;
while (i) {
- rp = kmalloc(sizeof(*rp), GFP_KERNEL);
+ rp = kmem_cache_alloc(drc_slab, GFP_KERNEL);
if (!rp)
goto out_nomem;
list_add(&rp->c_lru, &lru_head);
@@ -85,13 +91,18 @@ void nfsd_reply_cache_shutdown(void)
if (rp->c_state == RC_DONE && rp->c_type == RC_REPLBUFF)
kfree(rp->c_replvec.iov_base);
list_del(&rp->c_lru);
- kfree(rp);
+ kmem_cache_free(drc_slab, rp);
}

cache_disabled = 1;

kfree (cache_hash);
cache_hash = NULL;
+
+ if (drc_slab) {
+ kmem_cache_destroy(drc_slab);
+ drc_slab = NULL;
+ }
}

/*
--
1.7.11.7


2013-01-30 15:24:50

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH v1 15/16] nfsd: register a shrinker for DRC cache entries

On Mon, Jan 28, 2013 at 02:41:21PM -0500, Jeff Layton wrote:
> Since we dynamically allocate them now, allow the system to call us
> up to release them if we get low on memory.
>
> We set the "seeks" value very high, to discourage it and release
> them in LRU order.

I don't understand how the mm uses seeks. I guess I'll do some reading.

This looks straightforward enough, but I'm not sure how to judge whether
this is the correct policy.

--b.

> We may end up releasing some prematurely if the
> box is really low on memory.
>
> Signed-off-by: Jeff Layton <[email protected]>
> ---
> fs/nfsd/nfscache.c | 54 +++++++++++++++++++++++++++++++++++++++++++++++++++++-
> 1 file changed, 53 insertions(+), 1 deletion(-)
>
> diff --git a/fs/nfsd/nfscache.c b/fs/nfsd/nfscache.c
> index bd21230..27edd47 100644
> --- a/fs/nfsd/nfscache.c
> +++ b/fs/nfsd/nfscache.c
> @@ -35,6 +35,18 @@ static inline u32 request_hash(u32 xid)
>
> static int nfsd_cache_append(struct svc_rqst *rqstp, struct kvec *vec);
> static void cache_cleaner_func(struct work_struct *unused);
> +static int nfsd_reply_cache_shrink(struct shrinker *shrink,
> + struct shrink_control *sc);
> +
> +/*
> + * We set "seeks" to a large value here since we really can't recreate these
> + * entries. We're just gambling that they won't actually be needed if and
> + * when the box comes under memory pressure.
> + */
> +struct shrinker nfsd_reply_cache_shrinker = {
> + .shrink = nfsd_reply_cache_shrink,
> + .seeks = 64,
> +};
>
> /*
> * locking for the reply cache:
> @@ -80,6 +92,7 @@ nfsd_reply_cache_free(struct svc_cacherep *rp)
>
> int nfsd_reply_cache_init(void)
> {
> + register_shrinker(&nfsd_reply_cache_shrinker);
> drc_slab = kmem_cache_create("nfsd_drc", sizeof(struct svc_cacherep),
> 0, 0, NULL);
> if (!drc_slab)
> @@ -102,6 +115,7 @@ void nfsd_reply_cache_shutdown(void)
> {
> struct svc_cacherep *rp;
>
> + unregister_shrinker(&nfsd_reply_cache_shrinker);
> cancel_delayed_work_sync(&cache_cleaner);
>
> while (!list_empty(&lru_head)) {
> @@ -152,15 +166,17 @@ nfsd_cache_entry_expired(struct svc_cacherep *rp)
> * if we hit one that isn't old enough, then we can stop the walking. Must
> * be called with cache_lock held.
> */
> -static void
> +static int
> prune_old_cache_entries(void)
> {
> + int freed = 0;
> struct svc_cacherep *rp, *tmp;
>
> list_for_each_entry_safe(rp, tmp, &lru_head, c_lru) {
> if (!nfsd_cache_entry_expired(rp))
> break;
> nfsd_reply_cache_free_locked(rp);
> + ++freed;
> }
>
> /*
> @@ -173,6 +189,8 @@ prune_old_cache_entries(void)
> cancel_delayed_work(&cache_cleaner);
> else
> mod_delayed_work(system_wq, &cache_cleaner, RC_EXPIRE);
> +
> + return freed;
> }
>
> static void
> @@ -183,6 +201,40 @@ cache_cleaner_func(struct work_struct *unused)
> spin_unlock(&cache_lock);
> }
>
> +static int
> +nfsd_reply_cache_shrink(struct shrinker *shrink, struct shrink_control *sc)
> +{
> + int nr_to_scan = sc->nr_to_scan;
> + unsigned int num;
> +
> + spin_lock(&cache_lock);
> + if (!nr_to_scan)
> + goto out;
> +
> + nr_to_scan -= prune_old_cache_entries();
> +
> + /*
> + * If that didn't free enough, then keep going. It's better to free
> + * some prematurely than to have the box keel over due to lack of
> + * memory.
> + */
> + while (nr_to_scan > 0) {
> + struct svc_cacherep *rp;
> +
> + if (list_empty(&lru_head))
> + break;
> + rp = list_first_entry(&lru_head, struct svc_cacherep, c_lru);
> + nfsd_reply_cache_free_locked(rp);
> + --nr_to_scan;
> + }
> +
> +out:
> + num = num_drc_entries;
> + spin_unlock(&cache_lock);
> +
> + return num;
> +}
> +
> /*
> * Search the request hash for an entry that matches the given rqstp.
> * Must be called with cache_lock held. Returns the found entry or
> --
> 1.7.11.7
>

2013-01-28 19:51:52

by Chuck Lever III

[permalink] [raw]
Subject: Re: [PATCH v1 01/16] nfsd: fix IPv6 address handling in the DRC


On Jan 28, 2013, at 2:41 PM, Jeff Layton <[email protected]> wrote:

> Currently, it only stores the first 16 bytes of any address. struct
> sockaddr_in6 is 28 bytes however, so we're currently ignoring the last
> 12 bytes of the address.
>
> Expand the c_addr field to a sockaddr_in6,

What do you do about link-local addresses?

> and cast it to a sockaddr_in
> as necessary. Also fix the comparitor to use the existing RPC
> helpers for this.
>
> Signed-off-by: Jeff Layton <[email protected]>
> ---
> fs/nfsd/cache.h | 6 +++++-
> fs/nfsd/nfscache.c | 7 +++++--
> include/linux/sunrpc/clnt.h | 4 +++-
> 3 files changed, 13 insertions(+), 4 deletions(-)
>
> diff --git a/fs/nfsd/cache.h b/fs/nfsd/cache.h
> index 93cc9d3..2cac76c 100644
> --- a/fs/nfsd/cache.h
> +++ b/fs/nfsd/cache.h
> @@ -12,6 +12,10 @@
>
> /*
> * Representation of a reply cache entry.
> + *
> + * Note that we use a sockaddr_in6 to hold the address instead of the more
> + * typical sockaddr_storage. This is for space reasons, since sockaddr_storage
> + * is much larger than a sockaddr_in6.
> */
> struct svc_cacherep {
> struct hlist_node c_hash;
> @@ -20,7 +24,7 @@ struct svc_cacherep {
> unsigned char c_state, /* unused, inprog, done */
> c_type, /* status, buffer */
> c_secure : 1; /* req came from port < 1024 */
> - struct sockaddr_in c_addr;
> + struct sockaddr_in6 c_addr;
> __be32 c_xid;
> u32 c_prot;
> u32 c_proc;
> diff --git a/fs/nfsd/nfscache.c b/fs/nfsd/nfscache.c
> index 2cbac34..5dd9ec2 100644
> --- a/fs/nfsd/nfscache.c
> +++ b/fs/nfsd/nfscache.c
> @@ -9,6 +9,7 @@
> */
>
> #include <linux/slab.h>
> +#include <linux/sunrpc/clnt.h>
>
> #include "nfsd.h"
> #include "cache.h"
> @@ -146,7 +147,8 @@ nfsd_cache_lookup(struct svc_rqst *rqstp)
> xid == rp->c_xid && proc == rp->c_proc &&
> proto == rp->c_prot && vers == rp->c_vers &&
> time_before(jiffies, rp->c_timestamp + 120*HZ) &&
> - memcmp((char*)&rqstp->rq_addr, (char*)&rp->c_addr, sizeof(rp->c_addr))==0) {
> + rpc_cmp_addr(svc_addr(rqstp), (struct sockaddr *)&rp->c_addr) &&
> + rpc_get_port(svc_addr(rqstp)) == rpc_get_port((struct sockaddr *)&rp->c_addr)) {
> nfsdstats.rchits++;
> goto found_entry;
> }
> @@ -183,7 +185,8 @@ nfsd_cache_lookup(struct svc_rqst *rqstp)
> rp->c_state = RC_INPROG;
> rp->c_xid = xid;
> rp->c_proc = proc;
> - memcpy(&rp->c_addr, svc_addr_in(rqstp), sizeof(rp->c_addr));
> + rpc_copy_addr((struct sockaddr *)&rp->c_addr, svc_addr(rqstp));
> + rpc_set_port((struct sockaddr *)&rp->c_addr, rpc_get_port(svc_addr(rqstp)));
> rp->c_prot = proto;
> rp->c_vers = vers;
> rp->c_timestamp = jiffies;
> diff --git a/include/linux/sunrpc/clnt.h b/include/linux/sunrpc/clnt.h
> index 34206b8..47354a2 100644
> --- a/include/linux/sunrpc/clnt.h
> +++ b/include/linux/sunrpc/clnt.h
> @@ -263,7 +263,9 @@ static inline bool __rpc_copy_addr6(struct sockaddr *dst,
> * @sap1: first sockaddr
> * @sap2: second sockaddr
> *
> - * Just compares the family and address portion. Ignores port, scope, etc.
> + * Just compares the family and address portion. Ignores port, but
> + * compares the scope if it's a link-local address.
> + *
> * Returns true if the addrs are equal, false if they aren't.
> */
> static inline bool rpc_cmp_addr(const struct sockaddr *sap1,
> --
> 1.7.11.7
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com





2013-01-30 15:16:13

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH v1 10/16] nfsd: dynamically allocate DRC entries

On Mon, Jan 28, 2013 at 02:41:16PM -0500, Jeff Layton wrote:
> The existing code keeps a fixed-size cache of 1024 entries. This is
> much too small for a busy server, and wastes memory on an idle one.
>
> This patch changes the code to dynamically allocate and free these
> cache entries. For now, the cache growth is potentially boundless.

I'd prefer to be strict about not introducing regressions, even if
they're fixed a few patches later.

Could we make sure some limit is still enforced in this patch?

--b.

> We'll add some mechanisms to help control that in later patches.
>
> Signed-off-by: Jeff Layton <[email protected]>
> ---
> fs/nfsd/nfscache.c | 83 +++++++++++++++++++++++-------------------------------
> 1 file changed, 35 insertions(+), 48 deletions(-)
>
> diff --git a/fs/nfsd/nfscache.c b/fs/nfsd/nfscache.c
> index 9d80dfa..ad96f1d 100644
> --- a/fs/nfsd/nfscache.c
> +++ b/fs/nfsd/nfscache.c
> @@ -14,13 +14,8 @@
> #include "nfsd.h"
> #include "cache.h"
>
> -/* Size of reply cache. Common values are:
> - * 4.3BSD: 128
> - * 4.4BSD: 256
> - * Solaris2: 1024
> - * DEC Unix: 512-4096
> - */
> -#define CACHESIZE 1024
> +#define NFSDDBG_FACILITY NFSDDBG_REPCACHE
> +
> #define HASHSIZE 64
>
> static struct hlist_head * cache_hash;
> @@ -67,34 +62,23 @@ nfsd_reply_cache_free_locked(struct svc_cacherep *rp)
> {
> if (rp->c_type == RC_REPLBUFF)
> kfree(rp->c_replvec.iov_base);
> + hlist_del(&rp->c_hash);
> list_del(&rp->c_lru);
> kmem_cache_free(drc_slab, rp);
> }
>
> int nfsd_reply_cache_init(void)
> {
> - int i;
> - struct svc_cacherep *rp;
> -
> drc_slab = kmem_cache_create("nfsd_drc", sizeof(struct svc_cacherep),
> 0, 0, NULL);
> if (!drc_slab)
> goto out_nomem;
>
> - INIT_LIST_HEAD(&lru_head);
> - i = CACHESIZE;
> - while (i) {
> - rp = nfsd_reply_cache_alloc();
> - if (!rp)
> - goto out_nomem;
> - list_add(&rp->c_lru, &lru_head);
> - i--;
> - }
> -
> - cache_hash = kcalloc (HASHSIZE, sizeof(struct hlist_head), GFP_KERNEL);
> + cache_hash = kcalloc(HASHSIZE, sizeof(struct hlist_head), GFP_KERNEL);
> if (!cache_hash)
> goto out_nomem;
>
> + INIT_LIST_HEAD(&lru_head);
> cache_disabled = 0;
> return 0;
> out_nomem:
> @@ -187,7 +171,7 @@ nfsd_cache_search(struct svc_rqst *rqstp)
> int
> nfsd_cache_lookup(struct svc_rqst *rqstp)
> {
> - struct svc_cacherep *rp;
> + struct svc_cacherep *rp, *found;
> __be32 xid = rqstp->rq_xid;
> u32 proto = rqstp->rq_prot,
> vers = rqstp->rq_vers,
> @@ -206,38 +190,40 @@ nfsd_cache_lookup(struct svc_rqst *rqstp)
> rtn = RC_DOIT;
>
> rp = nfsd_cache_search(rqstp);
> - if (rp) {
> - nfsdstats.rchits++;
> + if (rp)
> goto found_entry;
> - }
> - nfsdstats.rcmisses++;
>
> - /* This loop shouldn't take more than a few iterations normally */
> - {
> - int safe = 0;
> - list_for_each_entry(rp, &lru_head, c_lru) {
> - if (rp->c_state != RC_INPROG)
> - break;
> - if (safe++ > CACHESIZE) {
> - printk("nfsd: loop in repcache LRU list\n");
> - cache_disabled = 1;
> - goto out;
> - }
> - }
> + /*
> + * Try to use the first entry on the LRU, if it's still good then we
> + * know that the later ones will also be.
> + */
> + if (!list_empty(&lru_head)) {
> + rp = list_first_entry(&lru_head, struct svc_cacherep, c_lru);
> + if (nfsd_cache_entry_expired(rp))
> + goto setup_entry;
> }
>
> - /* All entries on the LRU are in-progress. This should not happen */
> - if (&rp->c_lru == &lru_head) {
> - static int complaints;
> + spin_unlock(&cache_lock);
> + rp = nfsd_reply_cache_alloc();
> + if (!rp) {
> + dprintk("nfsd: unable to allocate DRC entry!\n");
> + return RC_DOIT;
> + }
> + spin_lock(&cache_lock);
>
> - printk(KERN_WARNING "nfsd: all repcache entries locked!\n");
> - if (++complaints > 5) {
> - printk(KERN_WARNING "nfsd: disabling repcache.\n");
> - cache_disabled = 1;
> - }
> - goto out;
> + /*
> + * Must search again just in case someone inserted one
> + * after we dropped the lock above.
> + */
> + found = nfsd_cache_search(rqstp);
> + if (found) {
> + nfsd_reply_cache_free_locked(rp);
> + rp = found;
> + goto found_entry;
> }
>
> +setup_entry:
> + nfsdstats.rcmisses++;
> rqstp->rq_cacherep = rp;
> rp->c_state = RC_INPROG;
> rp->c_xid = xid;
> @@ -261,6 +247,7 @@ nfsd_cache_lookup(struct svc_rqst *rqstp)
> return rtn;
>
> found_entry:
> + nfsdstats.rchits++;
> /* We found a matching entry which is either in progress or done. */
> age = jiffies - rp->c_timestamp;
> lru_put_end(rp);
> @@ -291,7 +278,7 @@ found_entry:
> break;
> default:
> printk(KERN_WARNING "nfsd: bad repcache type %d\n", rp->c_type);
> - rp->c_state = RC_UNUSED;
> + nfsd_reply_cache_free_locked(rp);
> }
>
> goto out;
> --
> 1.7.11.7
>

2013-01-28 19:48:02

by Jeff Layton

[permalink] [raw]
Subject: [PATCH v1 03/16] nfsd: get rid of RC_INTR

The reply cache code never returns this status.

Signed-off-by: Jeff Layton <[email protected]>
---
fs/nfsd/cache.h | 3 +--
fs/nfsd/nfssvc.c | 1 -
2 files changed, 1 insertion(+), 3 deletions(-)

diff --git a/fs/nfsd/cache.h b/fs/nfsd/cache.h
index 2cac76c..f8c6df8 100644
--- a/fs/nfsd/cache.h
+++ b/fs/nfsd/cache.h
@@ -50,8 +50,7 @@ enum {
enum {
RC_DROPIT,
RC_REPLY,
- RC_DOIT,
- RC_INTR
+ RC_DOIT
};

/*
diff --git a/fs/nfsd/nfssvc.c b/fs/nfsd/nfssvc.c
index cee62ab..40cb1cb 100644
--- a/fs/nfsd/nfssvc.c
+++ b/fs/nfsd/nfssvc.c
@@ -652,7 +652,6 @@ nfsd_dispatch(struct svc_rqst *rqstp, __be32 *statp)

/* Check whether we have this call in the cache. */
switch (nfsd_cache_lookup(rqstp)) {
- case RC_INTR:
case RC_DROPIT:
return 0;
case RC_REPLY:
--
1.7.11.7


2013-01-29 13:34:08

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH v1 01/16] nfsd: fix IPv6 address handling in the DRC

On Mon, 28 Jan 2013 16:32:55 -0500
Chuck Lever <[email protected]> wrote:

>
> On Jan 28, 2013, at 4:06 PM, Jeff Layton <[email protected]> wrote:
>
> > On Mon, 28 Jan 2013 15:54:26 -0500
> > Chuck Lever <[email protected]> wrote:
> >
> >> Since we don't know for sure how this works, we should have someone create unit tests to explore this, fix any bugs that are discovered, and ensure that it continues to work.
> >>
> >
> > I'm fairly sure I understand how it works (though I did need to go back
> > and refresh my memory here). I'm not sure how you'd go about writing
> > such a test for this, but let me know if you do and I'll be happy to
> > take a look.
>
> I guess I'm abusing the term "unit test." I mean simply that someone has a set up that uses link-local addresses and tries all the typical NFSv4 tests with it, and especially the ones that involve the DRC.
>

Well, there's the rub isn't it...

In order to test that properly, you need a multi-homed server and
clients on each end that have the *same* link-local address, so that the
sockaddr_in6's only differ by scopeid.

I'm not sure how to set something like that up. Maybe with
virtualization somehow.? Even if you can though, it's not going to be
easy to "bottle" such a test so you can trivially run it at will to
check for regressions.

--
Jeff Layton <[email protected]>

2013-01-28 21:33:07

by Chuck Lever III

[permalink] [raw]
Subject: Re: [PATCH v1 01/16] nfsd: fix IPv6 address handling in the DRC


On Jan 28, 2013, at 4:06 PM, Jeff Layton <[email protected]> wrote:

> On Mon, 28 Jan 2013 15:54:26 -0500
> Chuck Lever <[email protected]> wrote:
>
>> Since we don't know for sure how this works, we should have someone create unit tests to explore this, fix any bugs that are discovered, and ensure that it continues to work.
>>
>
> I'm fairly sure I understand how it works (though I did need to go back
> and refresh my memory here). I'm not sure how you'd go about writing
> such a test for this, but let me know if you do and I'll be happy to
> take a look.

I guess I'm abusing the term "unit test." I mean simply that someone has a set up that uses link-local addresses and tries all the typical NFSv4 tests with it, and especially the ones that involve the DRC.

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com





2013-01-28 19:48:35

by Jeff Layton

[permalink] [raw]
Subject: [PATCH v1 08/16] nfsd: break out hashtable search into separate function

Later, we'll need more than one call site for this, so break it out
into a new function.

Signed-off-by: Jeff Layton <[email protected]>
---
fs/nfsd/nfscache.c | 46 +++++++++++++++++++++++++++++++++-------------
1 file changed, 33 insertions(+), 13 deletions(-)

diff --git a/fs/nfsd/nfscache.c b/fs/nfsd/nfscache.c
index 634b856..b89e7c8 100644
--- a/fs/nfsd/nfscache.c
+++ b/fs/nfsd/nfscache.c
@@ -150,6 +150,35 @@ nfsd_cache_entry_expired(struct svc_cacherep *rp)
}

/*
+ * Search the request hash for an entry that matches the given rqstp.
+ * Must be called with cache_lock held. Returns the found entry or
+ * NULL on failure.
+ */
+static struct svc_cacherep *
+nfsd_cache_search(struct svc_rqst *rqstp)
+{
+ struct svc_cacherep *rp;
+ struct hlist_node *hn;
+ struct hlist_head *rh;
+ __be32 xid = rqstp->rq_xid;
+ u32 proto = rqstp->rq_prot,
+ vers = rqstp->rq_vers,
+ proc = rqstp->rq_proc;
+
+ rh = &cache_hash[request_hash(xid)];
+ hlist_for_each_entry(rp, hn, rh, c_hash) {
+ if (rp->c_state != RC_UNUSED &&
+ xid == rp->c_xid && proc == rp->c_proc &&
+ proto == rp->c_prot && vers == rp->c_vers &&
+ !nfsd_cache_entry_expired(rp) &&
+ rpc_cmp_addr(svc_addr(rqstp), (struct sockaddr *)&rp->c_addr) &&
+ rpc_get_port(svc_addr(rqstp)) == rpc_get_port((struct sockaddr *)&rp->c_addr))
+ return rp;
+ }
+ return NULL;
+}
+
+/*
* Try to find an entry matching the current call in the cache. When none
* is found, we grab the oldest unlocked entry off the LRU list.
* Note that no operation within the loop may sleep.
@@ -157,8 +186,6 @@ nfsd_cache_entry_expired(struct svc_cacherep *rp)
int
nfsd_cache_lookup(struct svc_rqst *rqstp)
{
- struct hlist_node *hn;
- struct hlist_head *rh;
struct svc_cacherep *rp;
__be32 xid = rqstp->rq_xid;
u32 proto = rqstp->rq_prot,
@@ -177,17 +204,10 @@ nfsd_cache_lookup(struct svc_rqst *rqstp)
spin_lock(&cache_lock);
rtn = RC_DOIT;

- rh = &cache_hash[request_hash(xid)];
- hlist_for_each_entry(rp, hn, rh, c_hash) {
- if (rp->c_state != RC_UNUSED &&
- xid == rp->c_xid && proc == rp->c_proc &&
- proto == rp->c_prot && vers == rp->c_vers &&
- !nfsd_cache_entry_expired(rp) &&
- rpc_cmp_addr(svc_addr(rqstp), (struct sockaddr *)&rp->c_addr) &&
- rpc_get_port(svc_addr(rqstp)) == rpc_get_port((struct sockaddr *)&rp->c_addr)) {
- nfsdstats.rchits++;
- goto found_entry;
- }
+ rp = nfsd_cache_search(rqstp);
+ if (rp) {
+ nfsdstats.rchits++;
+ goto found_entry;
}
nfsdstats.rcmisses++;

--
1.7.11.7


2013-01-28 19:48:36

by Jeff Layton

[permalink] [raw]
Subject: [PATCH v1 14/16] nfsd: track the number of DRC entries in the cache

Signed-off-by: Jeff Layton <[email protected]>
---
fs/nfsd/nfscache.c | 4 ++++
1 file changed, 4 insertions(+)

diff --git a/fs/nfsd/nfscache.c b/fs/nfsd/nfscache.c
index 896e1c0..bd21230 100644
--- a/fs/nfsd/nfscache.c
+++ b/fs/nfsd/nfscache.c
@@ -21,6 +21,7 @@
static struct hlist_head * cache_hash;
static struct list_head lru_head;
static struct kmem_cache *drc_slab;
+static unsigned int num_drc_entries;

/*
* Calculate the hash index from an XID.
@@ -65,6 +66,7 @@ nfsd_reply_cache_free_locked(struct svc_cacherep *rp)
kfree(rp->c_replvec.iov_base);
hlist_del(&rp->c_hash);
list_del(&rp->c_lru);
+ --num_drc_entries;
kmem_cache_free(drc_slab, rp);
}

@@ -88,6 +90,7 @@ int nfsd_reply_cache_init(void)
goto out_nomem;

INIT_LIST_HEAD(&lru_head);
+ num_drc_entries = 0;
return 0;
out_nomem:
printk(KERN_ERR "nfsd: failed to allocate reply cache\n");
@@ -257,6 +260,7 @@ nfsd_cache_lookup(struct svc_rqst *rqstp)
return RC_DOIT;
}
spin_lock(&cache_lock);
+ ++num_drc_entries;

/*
* Must search again just in case someone inserted one
--
1.7.11.7


2013-01-28 20:05:15

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH v1 01/16] nfsd: fix IPv6 address handling in the DRC

On Mon, 28 Jan 2013 11:51:13 -0800 (PST)
Chuck Lever <[email protected]> wrote:

>
> On Jan 28, 2013, at 2:41 PM, Jeff Layton <[email protected]> wrote:
>
> > Currently, it only stores the first 16 bytes of any address. struct
> > sockaddr_in6 is 28 bytes however, so we're currently ignoring the last
> > 12 bytes of the address.
> >
> > Expand the c_addr field to a sockaddr_in6,
>
> What do you do about link-local addresses?
>

I use rpc_cmp_addr() which should handle the scope correctly for
link-local addrs. Now that you mention it though, I think we may have a
bug in __rpc_copy_addr6. It doesn't touch the scope_id at all --
shouldn't we be setting the scopeid in that function as well?


> > and cast it to a sockaddr_in
> > as necessary. Also fix the comparitor to use the existing RPC
> > helpers for this.
> >
> > Signed-off-by: Jeff Layton <[email protected]>
> > ---
> > fs/nfsd/cache.h | 6 +++++-
> > fs/nfsd/nfscache.c | 7 +++++--
> > include/linux/sunrpc/clnt.h | 4 +++-
> > 3 files changed, 13 insertions(+), 4 deletions(-)
> >
> > diff --git a/fs/nfsd/cache.h b/fs/nfsd/cache.h
> > index 93cc9d3..2cac76c 100644
> > --- a/fs/nfsd/cache.h
> > +++ b/fs/nfsd/cache.h
> > @@ -12,6 +12,10 @@
> >
> > /*
> > * Representation of a reply cache entry.
> > + *
> > + * Note that we use a sockaddr_in6 to hold the address instead of the more
> > + * typical sockaddr_storage. This is for space reasons, since sockaddr_storage
> > + * is much larger than a sockaddr_in6.
> > */
> > struct svc_cacherep {
> > struct hlist_node c_hash;
> > @@ -20,7 +24,7 @@ struct svc_cacherep {
> > unsigned char c_state, /* unused, inprog, done */
> > c_type, /* status, buffer */
> > c_secure : 1; /* req came from port < 1024 */
> > - struct sockaddr_in c_addr;
> > + struct sockaddr_in6 c_addr;
> > __be32 c_xid;
> > u32 c_prot;
> > u32 c_proc;
> > diff --git a/fs/nfsd/nfscache.c b/fs/nfsd/nfscache.c
> > index 2cbac34..5dd9ec2 100644
> > --- a/fs/nfsd/nfscache.c
> > +++ b/fs/nfsd/nfscache.c
> > @@ -9,6 +9,7 @@
> > */
> >
> > #include <linux/slab.h>
> > +#include <linux/sunrpc/clnt.h>
> >
> > #include "nfsd.h"
> > #include "cache.h"
> > @@ -146,7 +147,8 @@ nfsd_cache_lookup(struct svc_rqst *rqstp)
> > xid == rp->c_xid && proc == rp->c_proc &&
> > proto == rp->c_prot && vers == rp->c_vers &&
> > time_before(jiffies, rp->c_timestamp + 120*HZ) &&
> > - memcmp((char*)&rqstp->rq_addr, (char*)&rp->c_addr, sizeof(rp->c_addr))==0) {
> > + rpc_cmp_addr(svc_addr(rqstp), (struct sockaddr *)&rp->c_addr) &&
> > + rpc_get_port(svc_addr(rqstp)) == rpc_get_port((struct sockaddr *)&rp->c_addr)) {
> > nfsdstats.rchits++;
> > goto found_entry;
> > }
> > @@ -183,7 +185,8 @@ nfsd_cache_lookup(struct svc_rqst *rqstp)
> > rp->c_state = RC_INPROG;
> > rp->c_xid = xid;
> > rp->c_proc = proc;
> > - memcpy(&rp->c_addr, svc_addr_in(rqstp), sizeof(rp->c_addr));
> > + rpc_copy_addr((struct sockaddr *)&rp->c_addr, svc_addr(rqstp));
> > + rpc_set_port((struct sockaddr *)&rp->c_addr, rpc_get_port(svc_addr(rqstp)));
> > rp->c_prot = proto;
> > rp->c_vers = vers;
> > rp->c_timestamp = jiffies;
> > diff --git a/include/linux/sunrpc/clnt.h b/include/linux/sunrpc/clnt.h
> > index 34206b8..47354a2 100644
> > --- a/include/linux/sunrpc/clnt.h
> > +++ b/include/linux/sunrpc/clnt.h
> > @@ -263,7 +263,9 @@ static inline bool __rpc_copy_addr6(struct sockaddr *dst,
> > * @sap1: first sockaddr
> > * @sap2: second sockaddr
> > *
> > - * Just compares the family and address portion. Ignores port, scope, etc.
> > + * Just compares the family and address portion. Ignores port, but
> > + * compares the scope if it's a link-local address.
> > + *
> > * Returns true if the addrs are equal, false if they aren't.
> > */
> > static inline bool rpc_cmp_addr(const struct sockaddr *sap1,
> > --
> > 1.7.11.7
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
>


--
Jeff Layton <[email protected]>

2013-01-30 15:14:13

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH v1 16/16] nfsd: keep a checksum of the first 256 bytes of request

On Mon, Jan 28, 2013 at 02:41:22PM -0500, Jeff Layton wrote:
> Now that we're dynamically allocating these entries, it becomes a lot
> easier to hit problems with XID collisions.

I guess I'd write that "Now that we're allowing more entries".

--b.

> In order to mitigate those,
> checksum up to the first 256 bytes of each request coming in and store
> those in the cache entry, along with the total length of the request.
>
> Signed-off-by: Jeff Layton <[email protected]>
> ---
> fs/nfsd/cache.h | 5 +++++
> fs/nfsd/nfscache.c | 44 ++++++++++++++++++++++++++++++++++++++++----
> 2 files changed, 45 insertions(+), 4 deletions(-)
>
> diff --git a/fs/nfsd/cache.h b/fs/nfsd/cache.h
> index 9c7232b..4822db3 100644
> --- a/fs/nfsd/cache.h
> +++ b/fs/nfsd/cache.h
> @@ -29,6 +29,8 @@ struct svc_cacherep {
> u32 c_prot;
> u32 c_proc;
> u32 c_vers;
> + unsigned int c_len;
> + u32 c_crc;
> unsigned long c_timestamp;
> union {
> struct kvec u_vec;
> @@ -73,6 +75,9 @@ enum {
> /* Cache entries expire after this time period */
> #define RC_EXPIRE (120 * HZ)
>
> +/* Checksum this amount of the request */
> +#define RC_CSUMLEN (256U)
> +
> int nfsd_reply_cache_init(void);
> void nfsd_reply_cache_shutdown(void);
> int nfsd_cache_lookup(struct svc_rqst *);
> diff --git a/fs/nfsd/nfscache.c b/fs/nfsd/nfscache.c
> index 27edd47..abbf956 100644
> --- a/fs/nfsd/nfscache.c
> +++ b/fs/nfsd/nfscache.c
> @@ -10,6 +10,7 @@
>
> #include <linux/slab.h>
> #include <linux/sunrpc/clnt.h>
> +#include <linux/crc32.h>
>
> #include "nfsd.h"
> #include "cache.h"
> @@ -22,6 +23,7 @@ static struct hlist_head * cache_hash;
> static struct list_head lru_head;
> static struct kmem_cache *drc_slab;
> static unsigned int num_drc_entries;
> +static u32 crc_seed;
>
> /*
> * Calculate the hash index from an XID.
> @@ -103,6 +105,9 @@ int nfsd_reply_cache_init(void)
> goto out_nomem;
>
> INIT_LIST_HEAD(&lru_head);
> +
> + /* Is a random seed any better than some well-defined constant? */
> + get_random_bytes(&crc_seed, sizeof(crc_seed));
> num_drc_entries = 0;
> return 0;
> out_nomem:
> @@ -236,12 +241,37 @@ out:
> }
>
> /*
> + * Walk an xdr_buf and get a CRC for at most the first RC_CSUMLEN bytes
> + */
> +static u32
> +nfsd_cache_crc(struct xdr_buf *buf)
> +{
> + u32 crc;
> + const unsigned char *p = buf->head[0].iov_base;
> + size_t total_len = min(buf->len, RC_CSUMLEN);
> + size_t len = min(buf->head[0].iov_len, total_len);
> +
> + /* rq_arg.head first */
> + crc = crc32(crc_seed, p, len);
> + total_len -= len;
> +
> + /* Nothing left */
> + if (!total_len)
> + return crc;
> +
> + /* checksum the rest from the page_array */
> + p = page_address(buf->pages[0]) + buf->page_base;
> + len = min(buf->len - len, total_len);
> + return crc32(crc, p, len);
> +}
> +
> +/*
> * Search the request hash for an entry that matches the given rqstp.
> * Must be called with cache_lock held. Returns the found entry or
> * NULL on failure.
> */
> static struct svc_cacherep *
> -nfsd_cache_search(struct svc_rqst *rqstp)
> +nfsd_cache_search(struct svc_rqst *rqstp, u32 crc)
> {
> struct svc_cacherep *rp;
> struct hlist_node *hn;
> @@ -255,6 +285,7 @@ nfsd_cache_search(struct svc_rqst *rqstp)
> hlist_for_each_entry(rp, hn, rh, c_hash) {
> if (xid == rp->c_xid && proc == rp->c_proc &&
> proto == rp->c_prot && vers == rp->c_vers &&
> + rqstp->rq_arg.len == rp->c_len && crc == rp->c_crc &&
> rpc_cmp_addr(svc_addr(rqstp), (struct sockaddr *)&rp->c_addr) &&
> rpc_get_port(svc_addr(rqstp)) == rpc_get_port((struct sockaddr *)&rp->c_addr))
> return rp;
> @@ -274,7 +305,8 @@ nfsd_cache_lookup(struct svc_rqst *rqstp)
> __be32 xid = rqstp->rq_xid;
> u32 proto = rqstp->rq_prot,
> vers = rqstp->rq_vers,
> - proc = rqstp->rq_proc;
> + proc = rqstp->rq_proc,
> + crc;
> unsigned long age;
> int type = rqstp->rq_cachetype;
> int rtn;
> @@ -285,10 +317,12 @@ nfsd_cache_lookup(struct svc_rqst *rqstp)
> return RC_DOIT;
> }
>
> + crc = nfsd_cache_crc(&rqstp->rq_arg);
> +
> spin_lock(&cache_lock);
> rtn = RC_DOIT;
>
> - rp = nfsd_cache_search(rqstp);
> + rp = nfsd_cache_search(rqstp, crc);
> if (rp)
> goto found_entry;
>
> @@ -318,7 +352,7 @@ nfsd_cache_lookup(struct svc_rqst *rqstp)
> * Must search again just in case someone inserted one
> * after we dropped the lock above.
> */
> - found = nfsd_cache_search(rqstp);
> + found = nfsd_cache_search(rqstp, crc);
> if (found) {
> nfsd_reply_cache_free_locked(rp);
> rp = found;
> @@ -335,6 +369,8 @@ setup_entry:
> rpc_set_port((struct sockaddr *)&rp->c_addr, rpc_get_port(svc_addr(rqstp)));
> rp->c_prot = proto;
> rp->c_vers = vers;
> + rp->c_len = rqstp->rq_arg.len;
> + rp->c_crc = crc;
>
> hash_refile(rp);
> lru_put_end(rp);
> --
> 1.7.11.7
>

2013-01-28 19:41:58

by Jeff Layton

[permalink] [raw]
Subject: [PATCH v1 09/16] nfsd: always move DRC entries to the end of LRU list when updating timestamp

...otherwise, we end up with the list ordering wrong. Currently, it's
not a problem since we skip RC_INPROG entries, but keeping the ordering
strict will be necessary for a later patch that adds a cache cleaner.

Signed-off-by: Jeff Layton <[email protected]>
---
fs/nfsd/nfscache.c | 5 ++---
1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/fs/nfsd/nfscache.c b/fs/nfsd/nfscache.c
index b89e7c8..9d80dfa 100644
--- a/fs/nfsd/nfscache.c
+++ b/fs/nfsd/nfscache.c
@@ -129,6 +129,7 @@ void nfsd_reply_cache_shutdown(void)
static void
lru_put_end(struct svc_cacherep *rp)
{
+ rp->c_timestamp = jiffies;
list_move_tail(&rp->c_lru, &lru_head);
}

@@ -245,9 +246,9 @@ nfsd_cache_lookup(struct svc_rqst *rqstp)
rpc_set_port((struct sockaddr *)&rp->c_addr, rpc_get_port(svc_addr(rqstp)));
rp->c_prot = proto;
rp->c_vers = vers;
- rp->c_timestamp = jiffies;

hash_refile(rp);
+ lru_put_end(rp);

/* release any buffer */
if (rp->c_type == RC_REPLBUFF) {
@@ -262,7 +263,6 @@ nfsd_cache_lookup(struct svc_rqst *rqstp)
found_entry:
/* We found a matching entry which is either in progress or done. */
age = jiffies - rp->c_timestamp;
- rp->c_timestamp = jiffies;
lru_put_end(rp);

rtn = RC_DROPIT;
@@ -354,7 +354,6 @@ nfsd_cache_update(struct svc_rqst *rqstp, int cachetype, __be32 *statp)
rp->c_secure = rqstp->rq_secure;
rp->c_type = cachetype;
rp->c_state = RC_DONE;
- rp->c_timestamp = jiffies;
spin_unlock(&cache_lock);
return;
}
--
1.7.11.7


2013-01-28 19:42:04

by Jeff Layton

[permalink] [raw]
Subject: [PATCH v1 11/16] nfsd: remove the cache_disabled flag

Now that we dynamically allocate entries, we never disable the cache
on the fly. Remove this flag.

Signed-off-by: Jeff Layton <[email protected]>
---
fs/nfsd/nfscache.c | 10 +++-------
1 file changed, 3 insertions(+), 7 deletions(-)

diff --git a/fs/nfsd/nfscache.c b/fs/nfsd/nfscache.c
index ad96f1d..5a39b82 100644
--- a/fs/nfsd/nfscache.c
+++ b/fs/nfsd/nfscache.c
@@ -20,7 +20,6 @@

static struct hlist_head * cache_hash;
static struct list_head lru_head;
-static int cache_disabled = 1;
static struct kmem_cache *drc_slab;

/*
@@ -79,7 +78,6 @@ int nfsd_reply_cache_init(void)
goto out_nomem;

INIT_LIST_HEAD(&lru_head);
- cache_disabled = 0;
return 0;
out_nomem:
printk(KERN_ERR "nfsd: failed to allocate reply cache\n");
@@ -96,8 +94,6 @@ void nfsd_reply_cache_shutdown(void)
nfsd_reply_cache_free_locked(rp);
}

- cache_disabled = 1;
-
kfree (cache_hash);
cache_hash = NULL;

@@ -181,7 +177,7 @@ nfsd_cache_lookup(struct svc_rqst *rqstp)
int rtn;

rqstp->rq_cacherep = NULL;
- if (cache_disabled || type == RC_NOCACHE) {
+ if (type == RC_NOCACHE) {
nfsdstats.rcnocache++;
return RC_DOIT;
}
@@ -303,11 +299,11 @@ found_entry:
void
nfsd_cache_update(struct svc_rqst *rqstp, int cachetype, __be32 *statp)
{
- struct svc_cacherep *rp;
+ struct svc_cacherep *rp = rqstp->rq_cacherep;
struct kvec *resv = &rqstp->rq_res.head[0], *cachv;
int len;

- if (!(rp = rqstp->rq_cacherep) || cache_disabled)
+ if (!rp)
return;

len = resv->iov_len - ((char*)statp - (char*)resv->iov_base);
--
1.7.11.7


2013-01-30 15:20:38

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH v1 10/16] nfsd: dynamically allocate DRC entries

On Wed, 30 Jan 2013 10:16:07 -0500
"J. Bruce Fields" <[email protected]> wrote:

> On Mon, Jan 28, 2013 at 02:41:16PM -0500, Jeff Layton wrote:
> > The existing code keeps a fixed-size cache of 1024 entries. This is
> > much too small for a busy server, and wastes memory on an idle one.
> >
> > This patch changes the code to dynamically allocate and free these
> > cache entries. For now, the cache growth is potentially boundless.
>
> I'd prefer to be strict about not introducing regressions, even if
> they're fixed a few patches later.
>
> Could we make sure some limit is still enforced in this patch?
>
> --b.
>

Sure...I'm pondering some sort of global limit that's sized vs. memory
on the box anyway. I'll see if I can roll that into this.

> > We'll add some mechanisms to help control that in later patches.
> >
> > Signed-off-by: Jeff Layton <[email protected]>
> > ---
> > fs/nfsd/nfscache.c | 83 +++++++++++++++++++++++-------------------------------
> > 1 file changed, 35 insertions(+), 48 deletions(-)
> >
> > diff --git a/fs/nfsd/nfscache.c b/fs/nfsd/nfscache.c
> > index 9d80dfa..ad96f1d 100644
> > --- a/fs/nfsd/nfscache.c
> > +++ b/fs/nfsd/nfscache.c
> > @@ -14,13 +14,8 @@
> > #include "nfsd.h"
> > #include "cache.h"
> >
> > -/* Size of reply cache. Common values are:
> > - * 4.3BSD: 128
> > - * 4.4BSD: 256
> > - * Solaris2: 1024
> > - * DEC Unix: 512-4096
> > - */
> > -#define CACHESIZE 1024
> > +#define NFSDDBG_FACILITY NFSDDBG_REPCACHE
> > +
> > #define HASHSIZE 64
> >
> > static struct hlist_head * cache_hash;
> > @@ -67,34 +62,23 @@ nfsd_reply_cache_free_locked(struct svc_cacherep *rp)
> > {
> > if (rp->c_type == RC_REPLBUFF)
> > kfree(rp->c_replvec.iov_base);
> > + hlist_del(&rp->c_hash);
> > list_del(&rp->c_lru);
> > kmem_cache_free(drc_slab, rp);
> > }
> >
> > int nfsd_reply_cache_init(void)
> > {
> > - int i;
> > - struct svc_cacherep *rp;
> > -
> > drc_slab = kmem_cache_create("nfsd_drc", sizeof(struct svc_cacherep),
> > 0, 0, NULL);
> > if (!drc_slab)
> > goto out_nomem;
> >
> > - INIT_LIST_HEAD(&lru_head);
> > - i = CACHESIZE;
> > - while (i) {
> > - rp = nfsd_reply_cache_alloc();
> > - if (!rp)
> > - goto out_nomem;
> > - list_add(&rp->c_lru, &lru_head);
> > - i--;
> > - }
> > -
> > - cache_hash = kcalloc (HASHSIZE, sizeof(struct hlist_head), GFP_KERNEL);
> > + cache_hash = kcalloc(HASHSIZE, sizeof(struct hlist_head), GFP_KERNEL);
> > if (!cache_hash)
> > goto out_nomem;
> >
> > + INIT_LIST_HEAD(&lru_head);
> > cache_disabled = 0;
> > return 0;
> > out_nomem:
> > @@ -187,7 +171,7 @@ nfsd_cache_search(struct svc_rqst *rqstp)
> > int
> > nfsd_cache_lookup(struct svc_rqst *rqstp)
> > {
> > - struct svc_cacherep *rp;
> > + struct svc_cacherep *rp, *found;
> > __be32 xid = rqstp->rq_xid;
> > u32 proto = rqstp->rq_prot,
> > vers = rqstp->rq_vers,
> > @@ -206,38 +190,40 @@ nfsd_cache_lookup(struct svc_rqst *rqstp)
> > rtn = RC_DOIT;
> >
> > rp = nfsd_cache_search(rqstp);
> > - if (rp) {
> > - nfsdstats.rchits++;
> > + if (rp)
> > goto found_entry;
> > - }
> > - nfsdstats.rcmisses++;
> >
> > - /* This loop shouldn't take more than a few iterations normally */
> > - {
> > - int safe = 0;
> > - list_for_each_entry(rp, &lru_head, c_lru) {
> > - if (rp->c_state != RC_INPROG)
> > - break;
> > - if (safe++ > CACHESIZE) {
> > - printk("nfsd: loop in repcache LRU list\n");
> > - cache_disabled = 1;
> > - goto out;
> > - }
> > - }
> > + /*
> > + * Try to use the first entry on the LRU, if it's still good then we
> > + * know that the later ones will also be.
> > + */
> > + if (!list_empty(&lru_head)) {
> > + rp = list_first_entry(&lru_head, struct svc_cacherep, c_lru);
> > + if (nfsd_cache_entry_expired(rp))
> > + goto setup_entry;
> > }
> >
> > - /* All entries on the LRU are in-progress. This should not happen */
> > - if (&rp->c_lru == &lru_head) {
> > - static int complaints;
> > + spin_unlock(&cache_lock);
> > + rp = nfsd_reply_cache_alloc();
> > + if (!rp) {
> > + dprintk("nfsd: unable to allocate DRC entry!\n");
> > + return RC_DOIT;
> > + }
> > + spin_lock(&cache_lock);
> >
> > - printk(KERN_WARNING "nfsd: all repcache entries locked!\n");
> > - if (++complaints > 5) {
> > - printk(KERN_WARNING "nfsd: disabling repcache.\n");
> > - cache_disabled = 1;
> > - }
> > - goto out;
> > + /*
> > + * Must search again just in case someone inserted one
> > + * after we dropped the lock above.
> > + */
> > + found = nfsd_cache_search(rqstp);
> > + if (found) {
> > + nfsd_reply_cache_free_locked(rp);
> > + rp = found;
> > + goto found_entry;
> > }
> >
> > +setup_entry:
> > + nfsdstats.rcmisses++;
> > rqstp->rq_cacherep = rp;
> > rp->c_state = RC_INPROG;
> > rp->c_xid = xid;
> > @@ -261,6 +247,7 @@ nfsd_cache_lookup(struct svc_rqst *rqstp)
> > return rtn;
> >
> > found_entry:
> > + nfsdstats.rchits++;
> > /* We found a matching entry which is either in progress or done. */
> > age = jiffies - rp->c_timestamp;
> > lru_put_end(rp);
> > @@ -291,7 +278,7 @@ found_entry:
> > break;
> > default:
> > printk(KERN_WARNING "nfsd: bad repcache type %d\n", rp->c_type);
> > - rp->c_state = RC_UNUSED;
> > + nfsd_reply_cache_free_locked(rp);
> > }
> >
> > goto out;
> > --
> > 1.7.11.7
> >


--
Jeff Layton <[email protected]>

2013-01-28 19:42:14

by Jeff Layton

[permalink] [raw]
Subject: [PATCH v1 16/16] nfsd: keep a checksum of the first 256 bytes of request

Now that we're dynamically allocating these entries, it becomes a lot
easier to hit problems with XID collisions. In order to mitigate those,
checksum up to the first 256 bytes of each request coming in and store
those in the cache entry, along with the total length of the request.

Signed-off-by: Jeff Layton <[email protected]>
---
fs/nfsd/cache.h | 5 +++++
fs/nfsd/nfscache.c | 44 ++++++++++++++++++++++++++++++++++++++++----
2 files changed, 45 insertions(+), 4 deletions(-)

diff --git a/fs/nfsd/cache.h b/fs/nfsd/cache.h
index 9c7232b..4822db3 100644
--- a/fs/nfsd/cache.h
+++ b/fs/nfsd/cache.h
@@ -29,6 +29,8 @@ struct svc_cacherep {
u32 c_prot;
u32 c_proc;
u32 c_vers;
+ unsigned int c_len;
+ u32 c_crc;
unsigned long c_timestamp;
union {
struct kvec u_vec;
@@ -73,6 +75,9 @@ enum {
/* Cache entries expire after this time period */
#define RC_EXPIRE (120 * HZ)

+/* Checksum this amount of the request */
+#define RC_CSUMLEN (256U)
+
int nfsd_reply_cache_init(void);
void nfsd_reply_cache_shutdown(void);
int nfsd_cache_lookup(struct svc_rqst *);
diff --git a/fs/nfsd/nfscache.c b/fs/nfsd/nfscache.c
index 27edd47..abbf956 100644
--- a/fs/nfsd/nfscache.c
+++ b/fs/nfsd/nfscache.c
@@ -10,6 +10,7 @@

#include <linux/slab.h>
#include <linux/sunrpc/clnt.h>
+#include <linux/crc32.h>

#include "nfsd.h"
#include "cache.h"
@@ -22,6 +23,7 @@ static struct hlist_head * cache_hash;
static struct list_head lru_head;
static struct kmem_cache *drc_slab;
static unsigned int num_drc_entries;
+static u32 crc_seed;

/*
* Calculate the hash index from an XID.
@@ -103,6 +105,9 @@ int nfsd_reply_cache_init(void)
goto out_nomem;

INIT_LIST_HEAD(&lru_head);
+
+ /* Is a random seed any better than some well-defined constant? */
+ get_random_bytes(&crc_seed, sizeof(crc_seed));
num_drc_entries = 0;
return 0;
out_nomem:
@@ -236,12 +241,37 @@ out:
}

/*
+ * Walk an xdr_buf and get a CRC for at most the first RC_CSUMLEN bytes
+ */
+static u32
+nfsd_cache_crc(struct xdr_buf *buf)
+{
+ u32 crc;
+ const unsigned char *p = buf->head[0].iov_base;
+ size_t total_len = min(buf->len, RC_CSUMLEN);
+ size_t len = min(buf->head[0].iov_len, total_len);
+
+ /* rq_arg.head first */
+ crc = crc32(crc_seed, p, len);
+ total_len -= len;
+
+ /* Nothing left */
+ if (!total_len)
+ return crc;
+
+ /* checksum the rest from the page_array */
+ p = page_address(buf->pages[0]) + buf->page_base;
+ len = min(buf->len - len, total_len);
+ return crc32(crc, p, len);
+}
+
+/*
* Search the request hash for an entry that matches the given rqstp.
* Must be called with cache_lock held. Returns the found entry or
* NULL on failure.
*/
static struct svc_cacherep *
-nfsd_cache_search(struct svc_rqst *rqstp)
+nfsd_cache_search(struct svc_rqst *rqstp, u32 crc)
{
struct svc_cacherep *rp;
struct hlist_node *hn;
@@ -255,6 +285,7 @@ nfsd_cache_search(struct svc_rqst *rqstp)
hlist_for_each_entry(rp, hn, rh, c_hash) {
if (xid == rp->c_xid && proc == rp->c_proc &&
proto == rp->c_prot && vers == rp->c_vers &&
+ rqstp->rq_arg.len == rp->c_len && crc == rp->c_crc &&
rpc_cmp_addr(svc_addr(rqstp), (struct sockaddr *)&rp->c_addr) &&
rpc_get_port(svc_addr(rqstp)) == rpc_get_port((struct sockaddr *)&rp->c_addr))
return rp;
@@ -274,7 +305,8 @@ nfsd_cache_lookup(struct svc_rqst *rqstp)
__be32 xid = rqstp->rq_xid;
u32 proto = rqstp->rq_prot,
vers = rqstp->rq_vers,
- proc = rqstp->rq_proc;
+ proc = rqstp->rq_proc,
+ crc;
unsigned long age;
int type = rqstp->rq_cachetype;
int rtn;
@@ -285,10 +317,12 @@ nfsd_cache_lookup(struct svc_rqst *rqstp)
return RC_DOIT;
}

+ crc = nfsd_cache_crc(&rqstp->rq_arg);
+
spin_lock(&cache_lock);
rtn = RC_DOIT;

- rp = nfsd_cache_search(rqstp);
+ rp = nfsd_cache_search(rqstp, crc);
if (rp)
goto found_entry;

@@ -318,7 +352,7 @@ nfsd_cache_lookup(struct svc_rqst *rqstp)
* Must search again just in case someone inserted one
* after we dropped the lock above.
*/
- found = nfsd_cache_search(rqstp);
+ found = nfsd_cache_search(rqstp, crc);
if (found) {
nfsd_reply_cache_free_locked(rp);
rp = found;
@@ -335,6 +369,8 @@ setup_entry:
rpc_set_port((struct sockaddr *)&rp->c_addr, rpc_get_port(svc_addr(rqstp)));
rp->c_prot = proto;
rp->c_vers = vers;
+ rp->c_len = rqstp->rq_arg.len;
+ rp->c_crc = crc;

hash_refile(rp);
lru_put_end(rp);
--
1.7.11.7


2013-01-28 19:42:05

by Jeff Layton

[permalink] [raw]
Subject: [PATCH v1 12/16] nfsd: when updating an entry with RC_NOCACHE, just free it

There's no need to keep entries around that we're declaring RC_NOCACHE.
Ditto if there's a problem with the entry.

With this change too, there's no need to test for RC_UNUSED in the
search function. If the entry's in the hash table then it's either
INPROG or DONE.

Signed-off-by: Jeff Layton <[email protected]>
---
fs/nfsd/nfscache.c | 18 ++++++++++++++----
1 file changed, 14 insertions(+), 4 deletions(-)

diff --git a/fs/nfsd/nfscache.c b/fs/nfsd/nfscache.c
index 5a39b82..6b4693c 100644
--- a/fs/nfsd/nfscache.c
+++ b/fs/nfsd/nfscache.c
@@ -66,6 +66,14 @@ nfsd_reply_cache_free_locked(struct svc_cacherep *rp)
kmem_cache_free(drc_slab, rp);
}

+static void
+nfsd_reply_cache_free(struct svc_cacherep *rp)
+{
+ spin_lock(&cache_lock);
+ nfsd_reply_cache_free_locked(rp);
+ spin_unlock(&cache_lock);
+}
+
int nfsd_reply_cache_init(void)
{
drc_slab = kmem_cache_create("nfsd_drc", sizeof(struct svc_cacherep),
@@ -148,8 +156,7 @@ nfsd_cache_search(struct svc_rqst *rqstp)

rh = &cache_hash[request_hash(xid)];
hlist_for_each_entry(rp, hn, rh, c_hash) {
- if (rp->c_state != RC_UNUSED &&
- xid == rp->c_xid && proc == rp->c_proc &&
+ if (xid == rp->c_xid && proc == rp->c_proc &&
proto == rp->c_prot && vers == rp->c_vers &&
!nfsd_cache_entry_expired(rp) &&
rpc_cmp_addr(svc_addr(rqstp), (struct sockaddr *)&rp->c_addr) &&
@@ -311,7 +318,7 @@ nfsd_cache_update(struct svc_rqst *rqstp, int cachetype, __be32 *statp)

/* Don't cache excessive amounts of data and XDR failures */
if (!statp || len > (256 >> 2)) {
- rp->c_state = RC_UNUSED;
+ nfsd_reply_cache_free(rp);
return;
}

@@ -325,12 +332,15 @@ nfsd_cache_update(struct svc_rqst *rqstp, int cachetype, __be32 *statp)
cachv = &rp->c_replvec;
cachv->iov_base = kmalloc(len << 2, GFP_KERNEL);
if (!cachv->iov_base) {
- rp->c_state = RC_UNUSED;
+ nfsd_reply_cache_free(rp);
return;
}
cachv->iov_len = len << 2;
memcpy(cachv->iov_base, statp, len << 2);
break;
+ case RC_NOCACHE:
+ nfsd_reply_cache_free(rp);
+ return;
}
spin_lock(&cache_lock);
lru_put_end(rp);
--
1.7.11.7


2013-01-28 19:49:05

by Jeff Layton

[permalink] [raw]
Subject: [PATCH v1 13/16] nfsd: add recurring workqueue job to clean the cache

It's not sufficient to only clean the cache when requests come in. What
if we have a flurry of activity and then the server goes idle? Add a
workqueue job that will clean the cache every RC_EXPIRE period.

Care is taken to only run this when we expect to have entries expiring.

Signed-off-by: Jeff Layton <[email protected]>
---
fs/nfsd/nfscache.c | 50 +++++++++++++++++++++++++++++++++++++++++++++++---
1 file changed, 47 insertions(+), 3 deletions(-)

diff --git a/fs/nfsd/nfscache.c b/fs/nfsd/nfscache.c
index 6b4693c..896e1c0 100644
--- a/fs/nfsd/nfscache.c
+++ b/fs/nfsd/nfscache.c
@@ -33,6 +33,7 @@ static inline u32 request_hash(u32 xid)
}

static int nfsd_cache_append(struct svc_rqst *rqstp, struct kvec *vec);
+static void cache_cleaner_func(struct work_struct *unused);

/*
* locking for the reply cache:
@@ -40,6 +41,7 @@ static int nfsd_cache_append(struct svc_rqst *rqstp, struct kvec *vec);
* Otherwise, it when accessing _prev or _next, the lock must be held.
*/
static DEFINE_SPINLOCK(cache_lock);
+static DECLARE_DELAYED_WORK(cache_cleaner, cache_cleaner_func);

static struct svc_cacherep *
nfsd_reply_cache_alloc(void)
@@ -97,6 +99,8 @@ void nfsd_reply_cache_shutdown(void)
{
struct svc_cacherep *rp;

+ cancel_delayed_work_sync(&cache_cleaner);
+
while (!list_empty(&lru_head)) {
rp = list_entry(lru_head.next, struct svc_cacherep, c_lru);
nfsd_reply_cache_free_locked(rp);
@@ -112,13 +116,15 @@ void nfsd_reply_cache_shutdown(void)
}

/*
- * Move cache entry to end of LRU list
+ * Move cache entry to end of LRU list, and queue the cleaner to run if it's
+ * not already scheduled.
*/
static void
lru_put_end(struct svc_cacherep *rp)
{
rp->c_timestamp = jiffies;
list_move_tail(&rp->c_lru, &lru_head);
+ schedule_delayed_work(&cache_cleaner, RC_EXPIRE);
}

/*
@@ -139,6 +145,42 @@ nfsd_cache_entry_expired(struct svc_cacherep *rp)
}

/*
+ * Walk the LRU list and prune off entries that are older than RC_EXPIRE
+ * if we hit one that isn't old enough, then we can stop the walking. Must
+ * be called with cache_lock held.
+ */
+static void
+prune_old_cache_entries(void)
+{
+ struct svc_cacherep *rp, *tmp;
+
+ list_for_each_entry_safe(rp, tmp, &lru_head, c_lru) {
+ if (!nfsd_cache_entry_expired(rp))
+ break;
+ nfsd_reply_cache_free_locked(rp);
+ }
+
+ /*
+ * Conditionally rearm the job. If we cleaned out the list, then
+ * cancel any pending run (since there won't be any work to do).
+ * Otherwise, we rearm the job or modify the existing one to run in
+ * RC_EXPIRE since we know that nothing will expire until then.
+ */
+ if (list_empty(&lru_head))
+ cancel_delayed_work(&cache_cleaner);
+ else
+ mod_delayed_work(system_wq, &cache_cleaner, RC_EXPIRE);
+}
+
+static void
+cache_cleaner_func(struct work_struct *unused)
+{
+ spin_lock(&cache_lock);
+ prune_old_cache_entries();
+ spin_unlock(&cache_lock);
+}
+
+/*
* Search the request hash for an entry that matches the given rqstp.
* Must be called with cache_lock held. Returns the found entry or
* NULL on failure.
@@ -158,7 +200,6 @@ nfsd_cache_search(struct svc_rqst *rqstp)
hlist_for_each_entry(rp, hn, rh, c_hash) {
if (xid == rp->c_xid && proc == rp->c_proc &&
proto == rp->c_prot && vers == rp->c_vers &&
- !nfsd_cache_entry_expired(rp) &&
rpc_cmp_addr(svc_addr(rqstp), (struct sockaddr *)&rp->c_addr) &&
rpc_get_port(svc_addr(rqstp)) == rpc_get_port((struct sockaddr *)&rp->c_addr))
return rp;
@@ -202,8 +243,11 @@ nfsd_cache_lookup(struct svc_rqst *rqstp)
*/
if (!list_empty(&lru_head)) {
rp = list_first_entry(&lru_head, struct svc_cacherep, c_lru);
- if (nfsd_cache_entry_expired(rp))
+ if (nfsd_cache_entry_expired(rp)) {
+ lru_put_end(rp);
+ prune_old_cache_entries();
goto setup_entry;
+ }
}

spin_unlock(&cache_lock);
--
1.7.11.7


2013-01-31 21:06:55

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH v1 01/16] nfsd: fix IPv6 address handling in the DRC

On Thu, 31 Jan 2013 15:59:01 -0500
"J. Bruce Fields" <[email protected]> wrote:

> On Mon, Jan 28, 2013 at 02:41:07PM -0500, Jeff Layton wrote:
> > Currently, it only stores the first 16 bytes of any address. struct
> > sockaddr_in6 is 28 bytes however, so we're currently ignoring the last
> > 12 bytes of the address.
> >
> > Expand the c_addr field to a sockaddr_in6, and cast it to a sockaddr_in
> > as necessary. Also fix the comparitor to use the existing RPC
> > helpers for this.
>
> Applying. (Assuming the fix for link-local case will be separate.)
>
> --b.
>

Thanks. Yep, I'll do that as a separate patch.

--
Jeff Layton <[email protected]>

2013-01-28 19:49:27

by Jeff Layton

[permalink] [raw]
Subject: [PATCH v1 10/16] nfsd: dynamically allocate DRC entries

The existing code keeps a fixed-size cache of 1024 entries. This is
much too small for a busy server, and wastes memory on an idle one.

This patch changes the code to dynamically allocate and free these
cache entries. For now, the cache growth is potentially boundless.
We'll add some mechanisms to help control that in later patches.

Signed-off-by: Jeff Layton <[email protected]>
---
fs/nfsd/nfscache.c | 83 +++++++++++++++++++++++-------------------------------
1 file changed, 35 insertions(+), 48 deletions(-)

diff --git a/fs/nfsd/nfscache.c b/fs/nfsd/nfscache.c
index 9d80dfa..ad96f1d 100644
--- a/fs/nfsd/nfscache.c
+++ b/fs/nfsd/nfscache.c
@@ -14,13 +14,8 @@
#include "nfsd.h"
#include "cache.h"

-/* Size of reply cache. Common values are:
- * 4.3BSD: 128
- * 4.4BSD: 256
- * Solaris2: 1024
- * DEC Unix: 512-4096
- */
-#define CACHESIZE 1024
+#define NFSDDBG_FACILITY NFSDDBG_REPCACHE
+
#define HASHSIZE 64

static struct hlist_head * cache_hash;
@@ -67,34 +62,23 @@ nfsd_reply_cache_free_locked(struct svc_cacherep *rp)
{
if (rp->c_type == RC_REPLBUFF)
kfree(rp->c_replvec.iov_base);
+ hlist_del(&rp->c_hash);
list_del(&rp->c_lru);
kmem_cache_free(drc_slab, rp);
}

int nfsd_reply_cache_init(void)
{
- int i;
- struct svc_cacherep *rp;
-
drc_slab = kmem_cache_create("nfsd_drc", sizeof(struct svc_cacherep),
0, 0, NULL);
if (!drc_slab)
goto out_nomem;

- INIT_LIST_HEAD(&lru_head);
- i = CACHESIZE;
- while (i) {
- rp = nfsd_reply_cache_alloc();
- if (!rp)
- goto out_nomem;
- list_add(&rp->c_lru, &lru_head);
- i--;
- }
-
- cache_hash = kcalloc (HASHSIZE, sizeof(struct hlist_head), GFP_KERNEL);
+ cache_hash = kcalloc(HASHSIZE, sizeof(struct hlist_head), GFP_KERNEL);
if (!cache_hash)
goto out_nomem;

+ INIT_LIST_HEAD(&lru_head);
cache_disabled = 0;
return 0;
out_nomem:
@@ -187,7 +171,7 @@ nfsd_cache_search(struct svc_rqst *rqstp)
int
nfsd_cache_lookup(struct svc_rqst *rqstp)
{
- struct svc_cacherep *rp;
+ struct svc_cacherep *rp, *found;
__be32 xid = rqstp->rq_xid;
u32 proto = rqstp->rq_prot,
vers = rqstp->rq_vers,
@@ -206,38 +190,40 @@ nfsd_cache_lookup(struct svc_rqst *rqstp)
rtn = RC_DOIT;

rp = nfsd_cache_search(rqstp);
- if (rp) {
- nfsdstats.rchits++;
+ if (rp)
goto found_entry;
- }
- nfsdstats.rcmisses++;

- /* This loop shouldn't take more than a few iterations normally */
- {
- int safe = 0;
- list_for_each_entry(rp, &lru_head, c_lru) {
- if (rp->c_state != RC_INPROG)
- break;
- if (safe++ > CACHESIZE) {
- printk("nfsd: loop in repcache LRU list\n");
- cache_disabled = 1;
- goto out;
- }
- }
+ /*
+ * Try to use the first entry on the LRU, if it's still good then we
+ * know that the later ones will also be.
+ */
+ if (!list_empty(&lru_head)) {
+ rp = list_first_entry(&lru_head, struct svc_cacherep, c_lru);
+ if (nfsd_cache_entry_expired(rp))
+ goto setup_entry;
}

- /* All entries on the LRU are in-progress. This should not happen */
- if (&rp->c_lru == &lru_head) {
- static int complaints;
+ spin_unlock(&cache_lock);
+ rp = nfsd_reply_cache_alloc();
+ if (!rp) {
+ dprintk("nfsd: unable to allocate DRC entry!\n");
+ return RC_DOIT;
+ }
+ spin_lock(&cache_lock);

- printk(KERN_WARNING "nfsd: all repcache entries locked!\n");
- if (++complaints > 5) {
- printk(KERN_WARNING "nfsd: disabling repcache.\n");
- cache_disabled = 1;
- }
- goto out;
+ /*
+ * Must search again just in case someone inserted one
+ * after we dropped the lock above.
+ */
+ found = nfsd_cache_search(rqstp);
+ if (found) {
+ nfsd_reply_cache_free_locked(rp);
+ rp = found;
+ goto found_entry;
}

+setup_entry:
+ nfsdstats.rcmisses++;
rqstp->rq_cacherep = rp;
rp->c_state = RC_INPROG;
rp->c_xid = xid;
@@ -261,6 +247,7 @@ nfsd_cache_lookup(struct svc_rqst *rqstp)
return rtn;

found_entry:
+ nfsdstats.rchits++;
/* We found a matching entry which is either in progress or done. */
age = jiffies - rp->c_timestamp;
lru_put_end(rp);
@@ -291,7 +278,7 @@ found_entry:
break;
default:
printk(KERN_WARNING "nfsd: bad repcache type %d\n", rp->c_type);
- rp->c_state = RC_UNUSED;
+ nfsd_reply_cache_free_locked(rp);
}

goto out;
--
1.7.11.7


2013-01-28 19:47:07

by Jeff Layton

[permalink] [raw]
Subject: [PATCH v1 06/16] nfsd: remove redundant test from nfsd_reply_cache_free

Entries can only get a c_type of RC_REPLBUFF iff they are
RC_DONE. Therefore the test for RC_DONE isn't necessary here.

Signed-off-by: Jeff Layton <[email protected]>
---
fs/nfsd/nfscache.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/nfsd/nfscache.c b/fs/nfsd/nfscache.c
index 363bc61..2cdc4be 100644
--- a/fs/nfsd/nfscache.c
+++ b/fs/nfsd/nfscache.c
@@ -65,7 +65,7 @@ nfsd_reply_cache_alloc(void)
static void
nfsd_reply_cache_free_locked(struct svc_cacherep *rp)
{
- if (rp->c_state == RC_DONE && rp->c_type == RC_REPLBUFF)
+ if (rp->c_type == RC_REPLBUFF)
kfree(rp->c_replvec.iov_base);
list_del(&rp->c_lru);
kmem_cache_free(drc_slab, rp);
--
1.7.11.7


2013-01-28 20:54:37

by Chuck Lever III

[permalink] [raw]
Subject: Re: [PATCH v1 01/16] nfsd: fix IPv6 address handling in the DRC


On Jan 28, 2013, at 3:45 PM, Jeff Layton <[email protected]> wrote:

> On Mon, 28 Jan 2013 15:34:49 -0500
> Chuck Lever <[email protected]> wrote:
>
>>
>> On Jan 28, 2013, at 3:26 PM, Jeff Layton <[email protected]> wrote:
>>
>>> On Mon, 28 Jan 2013 12:16:17 -0800 (PST)
>>> Chuck Lever <[email protected]> wrote:
>>>
>>>>
>>>> On Jan 28, 2013, at 3:05 PM, Jeff Layton <[email protected]> wrote:
>>>>
>>>>> On Mon, 28 Jan 2013 11:51:13 -0800 (PST)
>>>>> Chuck Lever <[email protected]> wrote:
>>>>>
>>>>>>
>>>>>> On Jan 28, 2013, at 2:41 PM, Jeff Layton <[email protected]> wrote:
>>>>>>
>>>>>>> Currently, it only stores the first 16 bytes of any address. struct
>>>>>>> sockaddr_in6 is 28 bytes however, so we're currently ignoring the last
>>>>>>> 12 bytes of the address.
>>>>>>>
>>>>>>> Expand the c_addr field to a sockaddr_in6,
>>>>>>
>>>>>> What do you do about link-local addresses?
>>>>>>
>>>>>
>>>>> I use rpc_cmp_addr() which should handle the scope correctly for
>>>>> link-local addrs. Now that you mention it though, I think we may have a
>>>>> bug in __rpc_copy_addr6. It doesn't touch the scope_id at all --
>>>>> shouldn't we be setting the scopeid in that function as well?
>>>>
>>>> It looks like rpc_copy_addr() is invoked to copy only source addresses, so the scope ID probably doesn't apply. The comment over rpc_copy_addr() says it explicitly ignores port and scope ID.
>>>>
>>>
>>> Well, it copies the source address of the client, but all of the
>>> callers are in nfsd code. So I think it probably does matter:
>>>
>>> Consider a server that has 2 interfaces, and 2 clients with identical
>>> link-local addresses hitting the link local address of each interface.
>>> The only way to distinguish them at that point is to look at their
>>> scope_ids.
>>
>> Yes, but we want to use the server's scope IDs in this case, don't we? Copying the client's scope IDs may not be the right thing to do. That's why __rpc_copy_addr6() doesn't copy the scope ID.
>>
>
> Is there any real distinction between the two? Whenever we start
> discussing scopeids I mentally substitute "interface index", since it's
> the same thing on linux.
>
> AIUI, scopeids only have a meaning on the host itself. When a
> link-local address gets used, the sockaddr_in6 in question gets the
> sin6_scope_id set to the index of the interface to which it is
> associated.
>
> So if we have a client and server "pair" using link-local addresses,
> then they should have equivalent scope IDs since they are associated
> with the same interface on the local machine.

I don't think we can assume that the interface indices are the same on both the client and server. There is no co-ordination of that number space between separate hosts.

What may be the case is that the server's network layer copies in the correct scope ID for the server when it exposes source addresses to its upper layers. In that case, the ID's probably safe to use when comparing addresses locally on the server.

Since we don't know for sure how this works, we should have someone create unit tests to explore this, fix any bugs that are discovered, and ensure that it continues to work.

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com





2013-01-28 20:35:03

by Chuck Lever III

[permalink] [raw]
Subject: Re: [PATCH v1 01/16] nfsd: fix IPv6 address handling in the DRC


On Jan 28, 2013, at 3:26 PM, Jeff Layton <[email protected]> wrote:

> On Mon, 28 Jan 2013 12:16:17 -0800 (PST)
> Chuck Lever <[email protected]> wrote:
>
>>
>> On Jan 28, 2013, at 3:05 PM, Jeff Layton <[email protected]> wrote:
>>
>>> On Mon, 28 Jan 2013 11:51:13 -0800 (PST)
>>> Chuck Lever <[email protected]> wrote:
>>>
>>>>
>>>> On Jan 28, 2013, at 2:41 PM, Jeff Layton <[email protected]> wrote:
>>>>
>>>>> Currently, it only stores the first 16 bytes of any address. struct
>>>>> sockaddr_in6 is 28 bytes however, so we're currently ignoring the last
>>>>> 12 bytes of the address.
>>>>>
>>>>> Expand the c_addr field to a sockaddr_in6,
>>>>
>>>> What do you do about link-local addresses?
>>>>
>>>
>>> I use rpc_cmp_addr() which should handle the scope correctly for
>>> link-local addrs. Now that you mention it though, I think we may have a
>>> bug in __rpc_copy_addr6. It doesn't touch the scope_id at all --
>>> shouldn't we be setting the scopeid in that function as well?
>>
>> It looks like rpc_copy_addr() is invoked to copy only source addresses, so the scope ID probably doesn't apply. The comment over rpc_copy_addr() says it explicitly ignores port and scope ID.
>>
>
> Well, it copies the source address of the client, but all of the
> callers are in nfsd code. So I think it probably does matter:
>
> Consider a server that has 2 interfaces, and 2 clients with identical
> link-local addresses hitting the link local address of each interface.
> The only way to distinguish them at that point is to look at their
> scope_ids.

Yes, but we want to use the server's scope IDs in this case, don't we? Copying the client's scope IDs may not be the right thing to do. That's why __rpc_copy_addr6() doesn't copy the scope ID.

> In the event that we're not using link-local addresses, the scopeid
> should just be ignored. So I think we probably need something like this
> patch too:
>
> ----------------------[snip]---------------------
>
> sunrpc: copy scope ID in __rpc_copy_addr6
>
> When copying an address, we should also copy the scopeid in the event
> that this is a link-local address and the scope matters.
>
> Signed-off-by: Jeff Layton <[email protected]>
> ---
> include/linux/sunrpc/clnt.h | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/include/linux/sunrpc/clnt.h b/include/linux/sunrpc/clnt.h
> index 34206b8..4824286 100644
> --- a/include/linux/sunrpc/clnt.h
> +++ b/include/linux/sunrpc/clnt.h
> @@ -242,6 +242,7 @@ static inline bool __rpc_copy_addr6(struct sockaddr *dst,
>
> dsin6->sin6_family = ssin6->sin6_family;
> dsin6->sin6_addr = ssin6->sin6_addr;
> + dsin6->sin6_scope_id = ssin6->sin6_scope_id;

If you find this is the correct thing to do, you should fix the documenting comment before rpc_copy_addr().


> return true;
> }
> #else /* !(IS_ENABLED(CONFIG_IPV6) */
> --
> 1.7.11.7
>
>
> --
> Jeff Layton <[email protected]>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com





2013-01-28 19:49:23

by Jeff Layton

[permalink] [raw]
Subject: [PATCH v1 05/16] nfsd: add alloc and free functions for DRC entries

Signed-off-by: Jeff Layton <[email protected]>
---
fs/nfsd/nfscache.c | 36 +++++++++++++++++++++++++++---------
1 file changed, 27 insertions(+), 9 deletions(-)

diff --git a/fs/nfsd/nfscache.c b/fs/nfsd/nfscache.c
index 4aad9e4..363bc61 100644
--- a/fs/nfsd/nfscache.c
+++ b/fs/nfsd/nfscache.c
@@ -47,10 +47,34 @@ static int nfsd_cache_append(struct svc_rqst *rqstp, struct kvec *vec);
*/
static DEFINE_SPINLOCK(cache_lock);

-int nfsd_reply_cache_init(void)
+static struct svc_cacherep *
+nfsd_reply_cache_alloc(void)
{
struct svc_cacherep *rp;
+
+ rp = kmem_cache_alloc(drc_slab, GFP_KERNEL);
+ if (rp) {
+ rp->c_state = RC_UNUSED;
+ rp->c_type = RC_NOCACHE;
+ INIT_LIST_HEAD(&rp->c_lru);
+ INIT_HLIST_NODE(&rp->c_hash);
+ }
+ return rp;
+}
+
+static void
+nfsd_reply_cache_free_locked(struct svc_cacherep *rp)
+{
+ if (rp->c_state == RC_DONE && rp->c_type == RC_REPLBUFF)
+ kfree(rp->c_replvec.iov_base);
+ list_del(&rp->c_lru);
+ kmem_cache_free(drc_slab, rp);
+}
+
+int nfsd_reply_cache_init(void)
+{
int i;
+ struct svc_cacherep *rp;

drc_slab = kmem_cache_create("nfsd_drc", sizeof(struct svc_cacherep),
0, 0, NULL);
@@ -60,13 +84,10 @@ int nfsd_reply_cache_init(void)
INIT_LIST_HEAD(&lru_head);
i = CACHESIZE;
while (i) {
- rp = kmem_cache_alloc(drc_slab, GFP_KERNEL);
+ rp = nfsd_reply_cache_alloc();
if (!rp)
goto out_nomem;
list_add(&rp->c_lru, &lru_head);
- rp->c_state = RC_UNUSED;
- rp->c_type = RC_NOCACHE;
- INIT_HLIST_NODE(&rp->c_hash);
i--;
}

@@ -88,10 +109,7 @@ void nfsd_reply_cache_shutdown(void)

while (!list_empty(&lru_head)) {
rp = list_entry(lru_head.next, struct svc_cacherep, c_lru);
- if (rp->c_state == RC_DONE && rp->c_type == RC_REPLBUFF)
- kfree(rp->c_replvec.iov_base);
- list_del(&rp->c_lru);
- kmem_cache_free(drc_slab, rp);
+ nfsd_reply_cache_free_locked(rp);
}

cache_disabled = 1;
--
1.7.11.7


2013-01-28 19:41:52

by Jeff Layton

[permalink] [raw]
Subject: [PATCH v1 07/16] nfsd: clean up and clarify the cache expiration code

Add a preprocessor constant for the expiry time of cache entries, and
move the test for an expired entry into a function. Note that the current
code does not test for RC_INPROG. It just assumes that it won't take more
than 2 minutes to fill out an in-progress entry.

I'm not sure how valid that assumption is though, so let's just ensure
that we never consider an RC_INPROG entry to be expired.

Signed-off-by: Jeff Layton <[email protected]>
---
fs/nfsd/cache.h | 3 +++
fs/nfsd/nfscache.c | 9 ++++++++-
2 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/fs/nfsd/cache.h b/fs/nfsd/cache.h
index f8c6df8..9c7232b 100644
--- a/fs/nfsd/cache.h
+++ b/fs/nfsd/cache.h
@@ -70,6 +70,9 @@ enum {
*/
#define RC_DELAY (HZ/5)

+/* Cache entries expire after this time period */
+#define RC_EXPIRE (120 * HZ)
+
int nfsd_reply_cache_init(void);
void nfsd_reply_cache_shutdown(void);
int nfsd_cache_lookup(struct svc_rqst *);
diff --git a/fs/nfsd/nfscache.c b/fs/nfsd/nfscache.c
index 2cdc4be..634b856 100644
--- a/fs/nfsd/nfscache.c
+++ b/fs/nfsd/nfscache.c
@@ -142,6 +142,13 @@ hash_refile(struct svc_cacherep *rp)
hlist_add_head(&rp->c_hash, cache_hash + request_hash(rp->c_xid));
}

+static inline bool
+nfsd_cache_entry_expired(struct svc_cacherep *rp)
+{
+ return rp->c_state != RC_INPROG &&
+ time_after(jiffies, rp->c_timestamp + RC_EXPIRE);
+}
+
/*
* Try to find an entry matching the current call in the cache. When none
* is found, we grab the oldest unlocked entry off the LRU list.
@@ -175,7 +182,7 @@ nfsd_cache_lookup(struct svc_rqst *rqstp)
if (rp->c_state != RC_UNUSED &&
xid == rp->c_xid && proc == rp->c_proc &&
proto == rp->c_prot && vers == rp->c_vers &&
- time_before(jiffies, rp->c_timestamp + 120*HZ) &&
+ !nfsd_cache_entry_expired(rp) &&
rpc_cmp_addr(svc_addr(rqstp), (struct sockaddr *)&rp->c_addr) &&
rpc_get_port(svc_addr(rqstp)) == rpc_get_port((struct sockaddr *)&rp->c_addr)) {
nfsdstats.rchits++;
--
1.7.11.7


2013-01-28 19:41:35

by Jeff Layton

[permalink] [raw]
Subject: [PATCH v1 01/16] nfsd: fix IPv6 address handling in the DRC

Currently, it only stores the first 16 bytes of any address. struct
sockaddr_in6 is 28 bytes however, so we're currently ignoring the last
12 bytes of the address.

Expand the c_addr field to a sockaddr_in6, and cast it to a sockaddr_in
as necessary. Also fix the comparitor to use the existing RPC
helpers for this.

Signed-off-by: Jeff Layton <[email protected]>
---
fs/nfsd/cache.h | 6 +++++-
fs/nfsd/nfscache.c | 7 +++++--
include/linux/sunrpc/clnt.h | 4 +++-
3 files changed, 13 insertions(+), 4 deletions(-)

diff --git a/fs/nfsd/cache.h b/fs/nfsd/cache.h
index 93cc9d3..2cac76c 100644
--- a/fs/nfsd/cache.h
+++ b/fs/nfsd/cache.h
@@ -12,6 +12,10 @@

/*
* Representation of a reply cache entry.
+ *
+ * Note that we use a sockaddr_in6 to hold the address instead of the more
+ * typical sockaddr_storage. This is for space reasons, since sockaddr_storage
+ * is much larger than a sockaddr_in6.
*/
struct svc_cacherep {
struct hlist_node c_hash;
@@ -20,7 +24,7 @@ struct svc_cacherep {
unsigned char c_state, /* unused, inprog, done */
c_type, /* status, buffer */
c_secure : 1; /* req came from port < 1024 */
- struct sockaddr_in c_addr;
+ struct sockaddr_in6 c_addr;
__be32 c_xid;
u32 c_prot;
u32 c_proc;
diff --git a/fs/nfsd/nfscache.c b/fs/nfsd/nfscache.c
index 2cbac34..5dd9ec2 100644
--- a/fs/nfsd/nfscache.c
+++ b/fs/nfsd/nfscache.c
@@ -9,6 +9,7 @@
*/

#include <linux/slab.h>
+#include <linux/sunrpc/clnt.h>

#include "nfsd.h"
#include "cache.h"
@@ -146,7 +147,8 @@ nfsd_cache_lookup(struct svc_rqst *rqstp)
xid == rp->c_xid && proc == rp->c_proc &&
proto == rp->c_prot && vers == rp->c_vers &&
time_before(jiffies, rp->c_timestamp + 120*HZ) &&
- memcmp((char*)&rqstp->rq_addr, (char*)&rp->c_addr, sizeof(rp->c_addr))==0) {
+ rpc_cmp_addr(svc_addr(rqstp), (struct sockaddr *)&rp->c_addr) &&
+ rpc_get_port(svc_addr(rqstp)) == rpc_get_port((struct sockaddr *)&rp->c_addr)) {
nfsdstats.rchits++;
goto found_entry;
}
@@ -183,7 +185,8 @@ nfsd_cache_lookup(struct svc_rqst *rqstp)
rp->c_state = RC_INPROG;
rp->c_xid = xid;
rp->c_proc = proc;
- memcpy(&rp->c_addr, svc_addr_in(rqstp), sizeof(rp->c_addr));
+ rpc_copy_addr((struct sockaddr *)&rp->c_addr, svc_addr(rqstp));
+ rpc_set_port((struct sockaddr *)&rp->c_addr, rpc_get_port(svc_addr(rqstp)));
rp->c_prot = proto;
rp->c_vers = vers;
rp->c_timestamp = jiffies;
diff --git a/include/linux/sunrpc/clnt.h b/include/linux/sunrpc/clnt.h
index 34206b8..47354a2 100644
--- a/include/linux/sunrpc/clnt.h
+++ b/include/linux/sunrpc/clnt.h
@@ -263,7 +263,9 @@ static inline bool __rpc_copy_addr6(struct sockaddr *dst,
* @sap1: first sockaddr
* @sap2: second sockaddr
*
- * Just compares the family and address portion. Ignores port, scope, etc.
+ * Just compares the family and address portion. Ignores port, but
+ * compares the scope if it's a link-local address.
+ *
* Returns true if the addrs are equal, false if they aren't.
*/
static inline bool rpc_cmp_addr(const struct sockaddr *sap1,
--
1.7.11.7


2013-01-28 20:46:03

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH v1 01/16] nfsd: fix IPv6 address handling in the DRC

On Mon, 28 Jan 2013 15:34:49 -0500
Chuck Lever <[email protected]> wrote:

>
> On Jan 28, 2013, at 3:26 PM, Jeff Layton <[email protected]> wrote:
>
> > On Mon, 28 Jan 2013 12:16:17 -0800 (PST)
> > Chuck Lever <[email protected]> wrote:
> >
> >>
> >> On Jan 28, 2013, at 3:05 PM, Jeff Layton <[email protected]> wrote:
> >>
> >>> On Mon, 28 Jan 2013 11:51:13 -0800 (PST)
> >>> Chuck Lever <[email protected]> wrote:
> >>>
> >>>>
> >>>> On Jan 28, 2013, at 2:41 PM, Jeff Layton <[email protected]> wrote:
> >>>>
> >>>>> Currently, it only stores the first 16 bytes of any address. struct
> >>>>> sockaddr_in6 is 28 bytes however, so we're currently ignoring the last
> >>>>> 12 bytes of the address.
> >>>>>
> >>>>> Expand the c_addr field to a sockaddr_in6,
> >>>>
> >>>> What do you do about link-local addresses?
> >>>>
> >>>
> >>> I use rpc_cmp_addr() which should handle the scope correctly for
> >>> link-local addrs. Now that you mention it though, I think we may have a
> >>> bug in __rpc_copy_addr6. It doesn't touch the scope_id at all --
> >>> shouldn't we be setting the scopeid in that function as well?
> >>
> >> It looks like rpc_copy_addr() is invoked to copy only source addresses, so the scope ID probably doesn't apply. The comment over rpc_copy_addr() says it explicitly ignores port and scope ID.
> >>
> >
> > Well, it copies the source address of the client, but all of the
> > callers are in nfsd code. So I think it probably does matter:
> >
> > Consider a server that has 2 interfaces, and 2 clients with identical
> > link-local addresses hitting the link local address of each interface.
> > The only way to distinguish them at that point is to look at their
> > scope_ids.
>
> Yes, but we want to use the server's scope IDs in this case, don't we? Copying the client's scope IDs may not be the right thing to do. That's why __rpc_copy_addr6() doesn't copy the scope ID.
>

Is there any real distinction between the two? Whenever we start
discussing scopeids I mentally substitute "interface index", since it's
the same thing on linux.

AIUI, scopeids only have a meaning on the host itself. When a
link-local address gets used, the sockaddr_in6 in question gets the
sin6_scope_id set to the index of the interface to which it is
associated.

So if we have a client and server "pair" using link-local addresses,
then they should have equivalent scope IDs since they are associated
with the same interface on the local machine.

--
Jeff Layton <[email protected]>

2013-01-28 20:18:06

by Chuck Lever III

[permalink] [raw]
Subject: Re: [PATCH v1 01/16] nfsd: fix IPv6 address handling in the DRC


On Jan 28, 2013, at 3:05 PM, Jeff Layton <[email protected]> wrote:

> On Mon, 28 Jan 2013 11:51:13 -0800 (PST)
> Chuck Lever <[email protected]> wrote:
>
>>
>> On Jan 28, 2013, at 2:41 PM, Jeff Layton <[email protected]> wrote:
>>
>>> Currently, it only stores the first 16 bytes of any address. struct
>>> sockaddr_in6 is 28 bytes however, so we're currently ignoring the last
>>> 12 bytes of the address.
>>>
>>> Expand the c_addr field to a sockaddr_in6,
>>
>> What do you do about link-local addresses?
>>
>
> I use rpc_cmp_addr() which should handle the scope correctly for
> link-local addrs. Now that you mention it though, I think we may have a
> bug in __rpc_copy_addr6. It doesn't touch the scope_id at all --
> shouldn't we be setting the scopeid in that function as well?

It looks like rpc_copy_addr() is invoked to copy only source addresses, so the scope ID probably doesn't apply. The comment over rpc_copy_addr() says it explicitly ignores port and scope ID.

Sounds like we need another class of unit tests.

>
>
>>> and cast it to a sockaddr_in
>>> as necessary. Also fix the comparitor to use the existing RPC
>>> helpers for this.
>>>
>>> Signed-off-by: Jeff Layton <[email protected]>
>>> ---
>>> fs/nfsd/cache.h | 6 +++++-
>>> fs/nfsd/nfscache.c | 7 +++++--
>>> include/linux/sunrpc/clnt.h | 4 +++-
>>> 3 files changed, 13 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/fs/nfsd/cache.h b/fs/nfsd/cache.h
>>> index 93cc9d3..2cac76c 100644
>>> --- a/fs/nfsd/cache.h
>>> +++ b/fs/nfsd/cache.h
>>> @@ -12,6 +12,10 @@
>>>
>>> /*
>>> * Representation of a reply cache entry.
>>> + *
>>> + * Note that we use a sockaddr_in6 to hold the address instead of the more
>>> + * typical sockaddr_storage. This is for space reasons, since sockaddr_storage
>>> + * is much larger than a sockaddr_in6.
>>> */
>>> struct svc_cacherep {
>>> struct hlist_node c_hash;
>>> @@ -20,7 +24,7 @@ struct svc_cacherep {
>>> unsigned char c_state, /* unused, inprog, done */
>>> c_type, /* status, buffer */
>>> c_secure : 1; /* req came from port < 1024 */
>>> - struct sockaddr_in c_addr;
>>> + struct sockaddr_in6 c_addr;
>>> __be32 c_xid;
>>> u32 c_prot;
>>> u32 c_proc;
>>> diff --git a/fs/nfsd/nfscache.c b/fs/nfsd/nfscache.c
>>> index 2cbac34..5dd9ec2 100644
>>> --- a/fs/nfsd/nfscache.c
>>> +++ b/fs/nfsd/nfscache.c
>>> @@ -9,6 +9,7 @@
>>> */
>>>
>>> #include <linux/slab.h>
>>> +#include <linux/sunrpc/clnt.h>
>>>
>>> #include "nfsd.h"
>>> #include "cache.h"
>>> @@ -146,7 +147,8 @@ nfsd_cache_lookup(struct svc_rqst *rqstp)
>>> xid == rp->c_xid && proc == rp->c_proc &&
>>> proto == rp->c_prot && vers == rp->c_vers &&
>>> time_before(jiffies, rp->c_timestamp + 120*HZ) &&
>>> - memcmp((char*)&rqstp->rq_addr, (char*)&rp->c_addr, sizeof(rp->c_addr))==0) {
>>> + rpc_cmp_addr(svc_addr(rqstp), (struct sockaddr *)&rp->c_addr) &&
>>> + rpc_get_port(svc_addr(rqstp)) == rpc_get_port((struct sockaddr *)&rp->c_addr)) {
>>> nfsdstats.rchits++;
>>> goto found_entry;
>>> }
>>> @@ -183,7 +185,8 @@ nfsd_cache_lookup(struct svc_rqst *rqstp)
>>> rp->c_state = RC_INPROG;
>>> rp->c_xid = xid;
>>> rp->c_proc = proc;
>>> - memcpy(&rp->c_addr, svc_addr_in(rqstp), sizeof(rp->c_addr));
>>> + rpc_copy_addr((struct sockaddr *)&rp->c_addr, svc_addr(rqstp));
>>> + rpc_set_port((struct sockaddr *)&rp->c_addr, rpc_get_port(svc_addr(rqstp)));
>>> rp->c_prot = proto;
>>> rp->c_vers = vers;
>>> rp->c_timestamp = jiffies;
>>> diff --git a/include/linux/sunrpc/clnt.h b/include/linux/sunrpc/clnt.h
>>> index 34206b8..47354a2 100644
>>> --- a/include/linux/sunrpc/clnt.h
>>> +++ b/include/linux/sunrpc/clnt.h
>>> @@ -263,7 +263,9 @@ static inline bool __rpc_copy_addr6(struct sockaddr *dst,
>>> * @sap1: first sockaddr
>>> * @sap2: second sockaddr
>>> *
>>> - * Just compares the family and address portion. Ignores port, scope, etc.
>>> + * Just compares the family and address portion. Ignores port, but
>>> + * compares the scope if it's a link-local address.
>>> + *
>>> * Returns true if the addrs are equal, false if they aren't.
>>> */
>>> static inline bool rpc_cmp_addr(const struct sockaddr *sap1,
>>> --
>>> 1.7.11.7
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>>> the body of a message to [email protected]
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>
>
>
> --
> Jeff Layton <[email protected]>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com





2013-01-28 21:06:20

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH v1 01/16] nfsd: fix IPv6 address handling in the DRC

On Mon, 28 Jan 2013 15:54:26 -0500
Chuck Lever <[email protected]> wrote:

>
> On Jan 28, 2013, at 3:45 PM, Jeff Layton <[email protected]> wrote:
>
> > On Mon, 28 Jan 2013 15:34:49 -0500
> > Chuck Lever <[email protected]> wrote:
> >
> >>
> >> On Jan 28, 2013, at 3:26 PM, Jeff Layton <[email protected]> wrote:
> >>
> >>> On Mon, 28 Jan 2013 12:16:17 -0800 (PST)
> >>> Chuck Lever <[email protected]> wrote:
> >>>
> >>>>
> >>>> On Jan 28, 2013, at 3:05 PM, Jeff Layton <[email protected]> wrote:
> >>>>
> >>>>> On Mon, 28 Jan 2013 11:51:13 -0800 (PST)
> >>>>> Chuck Lever <[email protected]> wrote:
> >>>>>
> >>>>>>
> >>>>>> On Jan 28, 2013, at 2:41 PM, Jeff Layton <[email protected]> wrote:
> >>>>>>
> >>>>>>> Currently, it only stores the first 16 bytes of any address. struct
> >>>>>>> sockaddr_in6 is 28 bytes however, so we're currently ignoring the last
> >>>>>>> 12 bytes of the address.
> >>>>>>>
> >>>>>>> Expand the c_addr field to a sockaddr_in6,
> >>>>>>
> >>>>>> What do you do about link-local addresses?
> >>>>>>
> >>>>>
> >>>>> I use rpc_cmp_addr() which should handle the scope correctly for
> >>>>> link-local addrs. Now that you mention it though, I think we may have a
> >>>>> bug in __rpc_copy_addr6. It doesn't touch the scope_id at all --
> >>>>> shouldn't we be setting the scopeid in that function as well?
> >>>>
> >>>> It looks like rpc_copy_addr() is invoked to copy only source addresses, so the scope ID probably doesn't apply. The comment over rpc_copy_addr() says it explicitly ignores port and scope ID.
> >>>>
> >>>
> >>> Well, it copies the source address of the client, but all of the
> >>> callers are in nfsd code. So I think it probably does matter:
> >>>
> >>> Consider a server that has 2 interfaces, and 2 clients with identical
> >>> link-local addresses hitting the link local address of each interface.
> >>> The only way to distinguish them at that point is to look at their
> >>> scope_ids.
> >>
> >> Yes, but we want to use the server's scope IDs in this case, don't we? Copying the client's scope IDs may not be the right thing to do. That's why __rpc_copy_addr6() doesn't copy the scope ID.
> >>
> >
> > Is there any real distinction between the two? Whenever we start
> > discussing scopeids I mentally substitute "interface index", since it's
> > the same thing on linux.
> >
> > AIUI, scopeids only have a meaning on the host itself. When a
> > link-local address gets used, the sockaddr_in6 in question gets the
> > sin6_scope_id set to the index of the interface to which it is
> > associated.
> >
> > So if we have a client and server "pair" using link-local addresses,
> > then they should have equivalent scope IDs since they are associated
> > with the same interface on the local machine.
>
> I don't think we can assume that the interface indices are the same on both the client and server. There is no co-ordination of that number space between separate hosts.
>

> What may be the case is that the server's network layer copies in the correct scope ID for the server when it exposes source addresses to its upper layers. In that case, the ID's probably safe to use when comparing addresses locally on the server.
>

I wasn't suggesting they were the same on client and server. They are
most definitely *not*. I look at it this way...

A link-local address must have a scope ID (aka interface index)
attached, since without that the address has no real meaning. At the
same time scopeIDs are never passed around on the wire -- they don't
have any meaning outside of a single host.

Ultimately scopeids are always generated by the local host, so it's OK
to copy them between client and server addresses within the host's
kernel. When I have a socket with link-local addresses, the scopeids of
both sockaddrs must be the same since they are associated with the same
interface.

> Since we don't know for sure how this works, we should have someone create unit tests to explore this, fix any bugs that are discovered, and ensure that it continues to work.
>

I'm fairly sure I understand how it works (though I did need to go back
and refresh my memory here). I'm not sure how you'd go about writing
such a test for this, but let me know if you do and I'll be happy to
take a look.

In the meantime, I'll take a look at where we're copying and comparing
ipv6 addrs and see whether we're missing scopeids in any of these
places.

--
Jeff Layton <[email protected]>

2013-01-28 19:50:11

by Jeff Layton

[permalink] [raw]
Subject: [PATCH v1 15/16] nfsd: register a shrinker for DRC cache entries

Since we dynamically allocate them now, allow the system to call us
up to release them if we get low on memory.

We set the "seeks" value very high, to discourage it and release
them in LRU order. We may end up releasing some prematurely if the
box is really low on memory.

Signed-off-by: Jeff Layton <[email protected]>
---
fs/nfsd/nfscache.c | 54 +++++++++++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 53 insertions(+), 1 deletion(-)

diff --git a/fs/nfsd/nfscache.c b/fs/nfsd/nfscache.c
index bd21230..27edd47 100644
--- a/fs/nfsd/nfscache.c
+++ b/fs/nfsd/nfscache.c
@@ -35,6 +35,18 @@ static inline u32 request_hash(u32 xid)

static int nfsd_cache_append(struct svc_rqst *rqstp, struct kvec *vec);
static void cache_cleaner_func(struct work_struct *unused);
+static int nfsd_reply_cache_shrink(struct shrinker *shrink,
+ struct shrink_control *sc);
+
+/*
+ * We set "seeks" to a large value here since we really can't recreate these
+ * entries. We're just gambling that they won't actually be needed if and
+ * when the box comes under memory pressure.
+ */
+struct shrinker nfsd_reply_cache_shrinker = {
+ .shrink = nfsd_reply_cache_shrink,
+ .seeks = 64,
+};

/*
* locking for the reply cache:
@@ -80,6 +92,7 @@ nfsd_reply_cache_free(struct svc_cacherep *rp)

int nfsd_reply_cache_init(void)
{
+ register_shrinker(&nfsd_reply_cache_shrinker);
drc_slab = kmem_cache_create("nfsd_drc", sizeof(struct svc_cacherep),
0, 0, NULL);
if (!drc_slab)
@@ -102,6 +115,7 @@ void nfsd_reply_cache_shutdown(void)
{
struct svc_cacherep *rp;

+ unregister_shrinker(&nfsd_reply_cache_shrinker);
cancel_delayed_work_sync(&cache_cleaner);

while (!list_empty(&lru_head)) {
@@ -152,15 +166,17 @@ nfsd_cache_entry_expired(struct svc_cacherep *rp)
* if we hit one that isn't old enough, then we can stop the walking. Must
* be called with cache_lock held.
*/
-static void
+static int
prune_old_cache_entries(void)
{
+ int freed = 0;
struct svc_cacherep *rp, *tmp;

list_for_each_entry_safe(rp, tmp, &lru_head, c_lru) {
if (!nfsd_cache_entry_expired(rp))
break;
nfsd_reply_cache_free_locked(rp);
+ ++freed;
}

/*
@@ -173,6 +189,8 @@ prune_old_cache_entries(void)
cancel_delayed_work(&cache_cleaner);
else
mod_delayed_work(system_wq, &cache_cleaner, RC_EXPIRE);
+
+ return freed;
}

static void
@@ -183,6 +201,40 @@ cache_cleaner_func(struct work_struct *unused)
spin_unlock(&cache_lock);
}

+static int
+nfsd_reply_cache_shrink(struct shrinker *shrink, struct shrink_control *sc)
+{
+ int nr_to_scan = sc->nr_to_scan;
+ unsigned int num;
+
+ spin_lock(&cache_lock);
+ if (!nr_to_scan)
+ goto out;
+
+ nr_to_scan -= prune_old_cache_entries();
+
+ /*
+ * If that didn't free enough, then keep going. It's better to free
+ * some prematurely than to have the box keel over due to lack of
+ * memory.
+ */
+ while (nr_to_scan > 0) {
+ struct svc_cacherep *rp;
+
+ if (list_empty(&lru_head))
+ break;
+ rp = list_first_entry(&lru_head, struct svc_cacherep, c_lru);
+ nfsd_reply_cache_free_locked(rp);
+ --nr_to_scan;
+ }
+
+out:
+ num = num_drc_entries;
+ spin_unlock(&cache_lock);
+
+ return num;
+}
+
/*
* Search the request hash for an entry that matches the given rqstp.
* Must be called with cache_lock held. Returns the found entry or
--
1.7.11.7


2013-02-01 15:51:32

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH v1 08/16] nfsd: break out hashtable search into separate function

On Fri, Feb 01, 2013 at 10:26:40AM -0500, Jeff Layton wrote:
> On Fri, 1 Feb 2013 08:14:09 -0500
> "J. Bruce Fields" <[email protected]> wrote:
>
> > On Mon, Jan 28, 2013 at 02:41:14PM -0500, Jeff Layton wrote:
> > > Later, we'll need more than one call site for this, so break it out
> > > into a new function.
> >
> > I'm applying these first 8 to the for-3.9 branch at:
> >
> > git://linux-nfs.org/~bfields/linux.git for-3.9
> >
> > --b.
> >
>
> Thanks Bruce. I rebased the remaining patches on top of these, and now
> I'm seeing this when I try to test against it. I don't think it's due
> to the DRC patches, but I haven't started tracking it down yet:

Probably Stanislav's latest patches to the cache upcall code. I'm
surprised I didn't hit it.

--b.

>
> [10607.327785] BUG: unable to handle kernel NULL pointer dereference at (null)
> [10607.328033] IP: [<ffffffffa0164f95>] qword_add+0x5/0xd0 [sunrpc]
> [10607.328033] PGD 0
> [10607.328033] Oops: 0000 [#1] SMP
> [10607.328033] Modules linked in: nfsd(OF) auth_rpcgss nfs_acl lockd sunrpc kvm_amd kvm microcode virtio_balloon virtio_net i2c_piix4 cirrus drm_kms_helper ttm drm virtio_blk i2c_core
> [10607.328033] CPU 0
> [10607.328033] Pid: 5209, comm: nfsd Tainted: GF O 3.8.0-0.rc5.git2.1.fc19.x86_64 #1 Bochs Bochs
> [10607.328033] RIP: 0010:[<ffffffffa0164f95>] [<ffffffffa0164f95>] qword_add+0x5/0xd0 [sunrpc]
> [10607.328033] RSP: 0018:ffff88000307db58 EFLAGS: 00010286
> [10607.328033] RAX: ffff8800644e3a08 RBX: 000000000a5a51ab RCX: 0000000000000000
> [10607.328033] RDX: ffff88004367ee70 RSI: 0000000000000000 RDI: 000000000a5a51ab
> [10607.328033] RBP: ffff88000307db80 R08: 0000000000000000 R09: 0000000000000000
> [10607.328033] R10: ffff8800436c0000 R11: 0000000000000000 R12: ffff8800644bf730
> [10607.328033] R13: 0000000000000000 R14: ffff88000307db98 R15: ffff880002f5dd40
> [10607.328033] FS: 00007fb85bea8740(0000) GS:ffff88007fc00000(0000) knlGS:0000000000000000
> [10607.328033] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [10607.328033] CR2: 0000000000000000 CR3: 00000000030ba000 CR4: 00000000000006f0
> [10607.328033] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [10607.328033] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> [10607.328033] Process nfsd (pid: 5209, threadinfo ffff88000307c000, task ffff8800436c0000)
> [10607.328033] Stack:
> [10607.328033] ffffffffa01b56de ffff88000307db78 ffff8800644bf730 00000000fffffff5
> [10607.328033] ffff880005e409d8 ffff88000307dbd0 ffffffffa0166ee7 0000000000000078
> [10607.328033] 00000000510bb440 000000000a5a51ab ffff880005e43138 ffff8800644e3a08
> [10607.328033] Call Trace:
> [10607.328033] [<ffffffffa01b56de>] ? expkey_request+0x2e/0x90 [nfsd]
> [10607.328033] [<ffffffffa0166ee7>] cache_check+0xe7/0x350 [sunrpc]
> [10607.328033] [<ffffffffa01b613e>] exp_find+0x17e/0x350 [nfsd]
> [10607.328033] [<ffffffffa01b5fc5>] ? exp_find+0x5/0x350 [nfsd]
> [10607.328033] [<ffffffffa01b76d5>] ? rqst_exp_find+0x5/0x300 [nfsd]
> [10607.328033] [<ffffffffa01b77ee>] rqst_exp_find+0x11e/0x300 [nfsd]
> [10607.328033] [<ffffffffa01b76d5>] ? rqst_exp_find+0x5/0x300 [nfsd]
> [10607.328033] [<ffffffffa01b7a82>] rqst_find_fsidzero_export+0x22/0x30 [nfsd]
> [10607.328033] [<ffffffffa01b7aaa>] exp_pseudoroot+0x1a/0xb0 [nfsd]
> [10607.328033] [<ffffffffa01bf565>] nfsd4_putrootfh+0x25/0x30 [nfsd]
> [10607.328033] [<ffffffffa01c00f3>] nfsd4_proc_compound+0x573/0x790 [nfsd]
> [10607.328033] [<ffffffffa01abedb>] nfsd_dispatch+0xbb/0x200 [nfsd]
> [10607.328033] [<ffffffffa015aadd>] svc_process_common+0x46d/0x6e0 [sunrpc]
> [10607.328033] [<ffffffffa015ae57>] svc_process+0x107/0x170 [sunrpc]
> [10607.328033] [<ffffffffa01ab28f>] nfsd+0xbf/0x130 [nfsd]
> [10607.328033] [<ffffffffa01ab1d0>] ? nfsd_destroy+0x220/0x220 [nfsd]
> [10607.328033] [<ffffffff81090cea>] kthread+0xea/0xf0
> [10607.328033] [<ffffffff81090c00>] ? insert_kthread_work+0x80/0x80
> [10607.328033] [<ffffffff816e086c>] ret_from_fork+0x7c/0xb0
> [10607.328033] [<ffffffff81090c00>] ? insert_kthread_work+0x80/0x80
> [10607.328033] Code: 4c c3 81 e8 9e cd f6 e0 85 c0 0f 85 0d fe ff ff e9 5a ff ff ff 0f 0b 0f 0b 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 66 66 66 66 90 <8b> 0e 55 4c 8b 07 48 89 e5 85 c9 78 56 66 0f 1f 44 00 00 48 83
> [10607.328033] RIP [<ffffffffa0164f95>] qword_add+0x5/0xd0 [sunrpc]
> [10607.328033] RSP <ffff88000307db58>
> [10607.328033] CR2: 0000000000000000
> [10607.961841] ---[ end trace b5a4adf462b23e9a ]---
>
> Does it look familiar at all?
>
> --
> Jeff Layton <[email protected]>

2013-02-01 15:49:04

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH v1 08/16] nfsd: break out hashtable search into separate function

On Fri, 1 Feb 2013 10:26:40 -0500
Jeff Layton <[email protected]> wrote:

> On Fri, 1 Feb 2013 08:14:09 -0500
> "J. Bruce Fields" <[email protected]> wrote:
>
> > On Mon, Jan 28, 2013 at 02:41:14PM -0500, Jeff Layton wrote:
> > > Later, we'll need more than one call site for this, so break it out
> > > into a new function.
> >
> > I'm applying these first 8 to the for-3.9 branch at:
> >
> > git://linux-nfs.org/~bfields/linux.git for-3.9
> >
> > --b.
> >
>
> Thanks Bruce. I rebased the remaining patches on top of these, and now
> I'm seeing this when I try to test against it. I don't think it's due
> to the DRC patches, but I haven't started tracking it down yet:
>
> [10607.327785] BUG: unable to handle kernel NULL pointer dereference at (null)
> [10607.328033] IP: [<ffffffffa0164f95>] qword_add+0x5/0xd0 [sunrpc]
> [10607.328033] PGD 0
> [10607.328033] Oops: 0000 [#1] SMP
> [10607.328033] Modules linked in: nfsd(OF) auth_rpcgss nfs_acl lockd sunrpc kvm_amd kvm microcode virtio_balloon virtio_net i2c_piix4 cirrus drm_kms_helper ttm drm virtio_blk i2c_core
> [10607.328033] CPU 0
> [10607.328033] Pid: 5209, comm: nfsd Tainted: GF O 3.8.0-0.rc5.git2.1.fc19.x86_64 #1 Bochs Bochs
> [10607.328033] RIP: 0010:[<ffffffffa0164f95>] [<ffffffffa0164f95>] qword_add+0x5/0xd0 [sunrpc]
> [10607.328033] RSP: 0018:ffff88000307db58 EFLAGS: 00010286
> [10607.328033] RAX: ffff8800644e3a08 RBX: 000000000a5a51ab RCX: 0000000000000000
> [10607.328033] RDX: ffff88004367ee70 RSI: 0000000000000000 RDI: 000000000a5a51ab
> [10607.328033] RBP: ffff88000307db80 R08: 0000000000000000 R09: 0000000000000000
> [10607.328033] R10: ffff8800436c0000 R11: 0000000000000000 R12: ffff8800644bf730
> [10607.328033] R13: 0000000000000000 R14: ffff88000307db98 R15: ffff880002f5dd40
> [10607.328033] FS: 00007fb85bea8740(0000) GS:ffff88007fc00000(0000) knlGS:0000000000000000
> [10607.328033] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [10607.328033] CR2: 0000000000000000 CR3: 00000000030ba000 CR4: 00000000000006f0
> [10607.328033] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [10607.328033] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> [10607.328033] Process nfsd (pid: 5209, threadinfo ffff88000307c000, task ffff8800436c0000)
> [10607.328033] Stack:
> [10607.328033] ffffffffa01b56de ffff88000307db78 ffff8800644bf730 00000000fffffff5
> [10607.328033] ffff880005e409d8 ffff88000307dbd0 ffffffffa0166ee7 0000000000000078
> [10607.328033] 00000000510bb440 000000000a5a51ab ffff880005e43138 ffff8800644e3a08
> [10607.328033] Call Trace:
> [10607.328033] [<ffffffffa01b56de>] ? expkey_request+0x2e/0x90 [nfsd]
> [10607.328033] [<ffffffffa0166ee7>] cache_check+0xe7/0x350 [sunrpc]
> [10607.328033] [<ffffffffa01b613e>] exp_find+0x17e/0x350 [nfsd]
> [10607.328033] [<ffffffffa01b5fc5>] ? exp_find+0x5/0x350 [nfsd]
> [10607.328033] [<ffffffffa01b76d5>] ? rqst_exp_find+0x5/0x300 [nfsd]
> [10607.328033] [<ffffffffa01b77ee>] rqst_exp_find+0x11e/0x300 [nfsd]
> [10607.328033] [<ffffffffa01b76d5>] ? rqst_exp_find+0x5/0x300 [nfsd]
> [10607.328033] [<ffffffffa01b7a82>] rqst_find_fsidzero_export+0x22/0x30 [nfsd]
> [10607.328033] [<ffffffffa01b7aaa>] exp_pseudoroot+0x1a/0xb0 [nfsd]
> [10607.328033] [<ffffffffa01bf565>] nfsd4_putrootfh+0x25/0x30 [nfsd]
> [10607.328033] [<ffffffffa01c00f3>] nfsd4_proc_compound+0x573/0x790 [nfsd]
> [10607.328033] [<ffffffffa01abedb>] nfsd_dispatch+0xbb/0x200 [nfsd]
> [10607.328033] [<ffffffffa015aadd>] svc_process_common+0x46d/0x6e0 [sunrpc]
> [10607.328033] [<ffffffffa015ae57>] svc_process+0x107/0x170 [sunrpc]
> [10607.328033] [<ffffffffa01ab28f>] nfsd+0xbf/0x130 [nfsd]
> [10607.328033] [<ffffffffa01ab1d0>] ? nfsd_destroy+0x220/0x220 [nfsd]
> [10607.328033] [<ffffffff81090cea>] kthread+0xea/0xf0
> [10607.328033] [<ffffffff81090c00>] ? insert_kthread_work+0x80/0x80
> [10607.328033] [<ffffffff816e086c>] ret_from_fork+0x7c/0xb0
> [10607.328033] [<ffffffff81090c00>] ? insert_kthread_work+0x80/0x80
> [10607.328033] Code: 4c c3 81 e8 9e cd f6 e0 85 c0 0f 85 0d fe ff ff e9 5a ff ff ff 0f 0b 0f 0b 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 66 66 66 66 90 <8b> 0e 55 4c 8b 07 48 89 e5 85 c9 78 56 66 0f 1f 44 00 00 48 83
> [10607.328033] RIP [<ffffffffa0164f95>] qword_add+0x5/0xd0 [sunrpc]
> [10607.328033] RSP <ffff88000307db58>
> [10607.328033] CR2: 0000000000000000
> [10607.961841] ---[ end trace b5a4adf462b23e9a ]---
>
> Does it look familiar at all?
>

A bisect narrowed it down to this:

[jlayton@salusa linux]$ git bisect bad
517eb5600b455b608fb4cf38e403909a82c76100 is the first bad commit
commit 517eb5600b455b608fb4cf38e403909a82c76100
Author: Stanislav Kinsbursky <[email protected]>
Date: Tue Jan 15 11:09:31 2013 +0300

SUNRPC: introduce cache_detail->cache_request callback

This callback will allow to simplify upcalls in further patches in this
series.

Signed-off-by: Stanislav Kinsbursky <[email protected]>
Signed-off-by: J. Bruce Fields <[email protected]>

:040000 040000 c0474db96110ab75ee7625454f840ccb34c2710b 97f6aeeee53d8125b49d6e0ccabd7ee721c1248e M fs
:040000 040000 9a0af12174857da2a384c7b290c878233319d347 99aed4dc4cea3fb02eafeb0df565bf5039f01561 M include
:040000 040000 0e14940767d1ab0d2d5f9e7006cf845246fc224f cb3e8be2d456debb7433120b49caa54f8cb0c751 M net

...the original oops only triggered when I ran pynfs against the server
after starting it up. At this commit, the machine oopsed when starting
the server:

[ 261.469548] BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
[ 261.470014] IP: [<ffffffffa02067c1>] svc_export_request+0x21/0x70 [nfsd]
[ 261.470014] PGD 0
[ 261.470014] Oops: 0000 [#1] SMP
[ 261.470014] Modules linked in: nfsd(OF) auth_rpcgss nfs_acl lockd sunrpc kvm_amd kvm microcode virtio_balloon virtio_net i2c_piix4 cirrus drm_kms_helper ttm drm virtio_blk i2c_core [last unloaded: nfsd]
[ 261.470014] CPU 0
[ 261.470014] Pid: 1740, comm: exportfs Tainted: GF O 3.8.0-0.rc5.git2.1.fc19.x86_64 #1 Bochs Bochs
[ 261.470014] RIP: 0010:[<ffffffffa02067c1>] [<ffffffffa02067c1>] svc_export_request+0x21/0x70 [nfsd]
[ 261.470014] RSP: 0018:ffff880036d2be10 EFLAGS: 00010246
[ 261.470014] RAX: 0000000000000000 RBX: 0000000000000025 RCX: 0000000000000000
[ 261.470014] RDX: 0000000000000025 RSI: 0000000000000000 RDI: 0000000000000025
[ 261.470014] RBP: ffff880036d2be28 R08: 0000000000000000 R09: 0000000000000000
[ 261.470014] R10: ffff880036bc2150 R11: 0000000000000000 R12: 0000000000000000
[ 261.470014] R13: ffff88005c2eb000 R14: 00007fffc97b9180 R15: ffff880036577b58
[ 261.470014] FS: 00007fc93ca35740(0000) GS:ffff88007fc00000(0000) knlGS:0000000000000000
[ 261.470014] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 261.470014] CR2: 0000000000000018 CR3: 000000005c2ec000 CR4: 00000000000006f0
[ 261.470014] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 261.470014] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 261.470014] Process exportfs (pid: 1740, threadinfo ffff880036d2a000, task ffff880036bc2150)
[ 261.470014] Stack:
[ 261.470014] 0000000000000025 ffff88005c2eb000 ffff880077e66df0 ffff880036d2be58
[ 261.470014] ffffffffa0166207 0000000000000025 00007fffc97b9180 ffff880077e66df0
[ 261.470014] ffffea000170bac0 ffff880036d2be88 ffffffffa016629e ffff880077e66df0
[ 261.470014] Call Trace:
[ 261.470014] [<ffffffffa0166207>] cache_do_downcall+0x57/0x70 [sunrpc]
[ 261.470014] [<ffffffffa016629e>] cache_downcall+0x7e/0x100 [sunrpc]
[ 261.470014] [<ffffffffa0166378>] cache_write_procfs+0x58/0x90 [sunrpc]
[ 261.470014] [<ffffffffa0166320>] ? cache_downcall+0x100/0x100 [sunrpc]
[ 261.470014] [<ffffffff8123b0e5>] proc_reg_write+0x75/0xb0
[ 261.470014] [<ffffffff811ccecf>] vfs_write+0x9f/0x170
[ 261.470014] [<ffffffff811cd089>] sys_write+0x49/0xa0
[ 261.470014] [<ffffffff816e0919>] system_call_fastpath+0x16/0x1b
[ 261.470014] Code: 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 55 48 89 e5 41 55 49 89 f5 41 54 49 89 cc 53 48 8b 46 28 48 89 d3 48 89 ce 48 89 df <48> 8b 50 18 e8 c6 e7 f5 ff 41 8b 14 24 48 8b 33 49 8d 7d 38 e8
[ 261.470014] RIP [<ffffffffa02067c1>] svc_export_request+0x21/0x70 [nfsd]
[ 261.470014] RSP <ffff880036d2be10>
[ 261.470014] CR2: 0000000000000018
[ 261.511932] ---[ end trace 04a5087ac298f72e ]---

I think you may want to have a hard look at that patch...
--
Jeff Layton <[email protected]>

2013-02-01 16:05:03

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH v1 08/16] nfsd: break out hashtable search into separate function

On Fri, Feb 01, 2013 at 11:01:18AM -0500, Jeff Layton wrote:
> On Fri, 1 Feb 2013 10:51:31 -0500
> "J. Bruce Fields" <[email protected]> wrote:
>
> > On Fri, Feb 01, 2013 at 10:26:40AM -0500, Jeff Layton wrote:
> > > On Fri, 1 Feb 2013 08:14:09 -0500
> > > "J. Bruce Fields" <[email protected]> wrote:
> > >
> > > > On Mon, Jan 28, 2013 at 02:41:14PM -0500, Jeff Layton wrote:
> > > > > Later, we'll need more than one call site for this, so break it out
> > > > > into a new function.
> > > >
> > > > I'm applying these first 8 to the for-3.9 branch at:
> > > >
> > > > git://linux-nfs.org/~bfields/linux.git for-3.9
> > > >
> > > > --b.
> > > >
> > >
> > > Thanks Bruce. I rebased the remaining patches on top of these, and now
> > > I'm seeing this when I try to test against it. I don't think it's due
> > > to the DRC patches, but I haven't started tracking it down yet:
> >
> > Probably Stanislav's latest patches to the cache upcall code. I'm
> > surprised I didn't hit it.
> >
> > --b.
> >
>
> My apologies... it was a kABI issue ;)
>
> I had built a nfsd.ko module to test my changes, but didn't build
> sunrpc.ko to make sure that I got Stanislav's changes. When I did that,
> the oops went away...

Ah, ok, good.--b.

2013-02-01 15:26:47

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH v1 08/16] nfsd: break out hashtable search into separate function

On Fri, 1 Feb 2013 08:14:09 -0500
"J. Bruce Fields" <[email protected]> wrote:

> On Mon, Jan 28, 2013 at 02:41:14PM -0500, Jeff Layton wrote:
> > Later, we'll need more than one call site for this, so break it out
> > into a new function.
>
> I'm applying these first 8 to the for-3.9 branch at:
>
> git://linux-nfs.org/~bfields/linux.git for-3.9
>
> --b.
>

Thanks Bruce. I rebased the remaining patches on top of these, and now
I'm seeing this when I try to test against it. I don't think it's due
to the DRC patches, but I haven't started tracking it down yet:

[10607.327785] BUG: unable to handle kernel NULL pointer dereference at (null)
[10607.328033] IP: [<ffffffffa0164f95>] qword_add+0x5/0xd0 [sunrpc]
[10607.328033] PGD 0
[10607.328033] Oops: 0000 [#1] SMP
[10607.328033] Modules linked in: nfsd(OF) auth_rpcgss nfs_acl lockd sunrpc kvm_amd kvm microcode virtio_balloon virtio_net i2c_piix4 cirrus drm_kms_helper ttm drm virtio_blk i2c_core
[10607.328033] CPU 0
[10607.328033] Pid: 5209, comm: nfsd Tainted: GF O 3.8.0-0.rc5.git2.1.fc19.x86_64 #1 Bochs Bochs
[10607.328033] RIP: 0010:[<ffffffffa0164f95>] [<ffffffffa0164f95>] qword_add+0x5/0xd0 [sunrpc]
[10607.328033] RSP: 0018:ffff88000307db58 EFLAGS: 00010286
[10607.328033] RAX: ffff8800644e3a08 RBX: 000000000a5a51ab RCX: 0000000000000000
[10607.328033] RDX: ffff88004367ee70 RSI: 0000000000000000 RDI: 000000000a5a51ab
[10607.328033] RBP: ffff88000307db80 R08: 0000000000000000 R09: 0000000000000000
[10607.328033] R10: ffff8800436c0000 R11: 0000000000000000 R12: ffff8800644bf730
[10607.328033] R13: 0000000000000000 R14: ffff88000307db98 R15: ffff880002f5dd40
[10607.328033] FS: 00007fb85bea8740(0000) GS:ffff88007fc00000(0000) knlGS:0000000000000000
[10607.328033] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[10607.328033] CR2: 0000000000000000 CR3: 00000000030ba000 CR4: 00000000000006f0
[10607.328033] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[10607.328033] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[10607.328033] Process nfsd (pid: 5209, threadinfo ffff88000307c000, task ffff8800436c0000)
[10607.328033] Stack:
[10607.328033] ffffffffa01b56de ffff88000307db78 ffff8800644bf730 00000000fffffff5
[10607.328033] ffff880005e409d8 ffff88000307dbd0 ffffffffa0166ee7 0000000000000078
[10607.328033] 00000000510bb440 000000000a5a51ab ffff880005e43138 ffff8800644e3a08
[10607.328033] Call Trace:
[10607.328033] [<ffffffffa01b56de>] ? expkey_request+0x2e/0x90 [nfsd]
[10607.328033] [<ffffffffa0166ee7>] cache_check+0xe7/0x350 [sunrpc]
[10607.328033] [<ffffffffa01b613e>] exp_find+0x17e/0x350 [nfsd]
[10607.328033] [<ffffffffa01b5fc5>] ? exp_find+0x5/0x350 [nfsd]
[10607.328033] [<ffffffffa01b76d5>] ? rqst_exp_find+0x5/0x300 [nfsd]
[10607.328033] [<ffffffffa01b77ee>] rqst_exp_find+0x11e/0x300 [nfsd]
[10607.328033] [<ffffffffa01b76d5>] ? rqst_exp_find+0x5/0x300 [nfsd]
[10607.328033] [<ffffffffa01b7a82>] rqst_find_fsidzero_export+0x22/0x30 [nfsd]
[10607.328033] [<ffffffffa01b7aaa>] exp_pseudoroot+0x1a/0xb0 [nfsd]
[10607.328033] [<ffffffffa01bf565>] nfsd4_putrootfh+0x25/0x30 [nfsd]
[10607.328033] [<ffffffffa01c00f3>] nfsd4_proc_compound+0x573/0x790 [nfsd]
[10607.328033] [<ffffffffa01abedb>] nfsd_dispatch+0xbb/0x200 [nfsd]
[10607.328033] [<ffffffffa015aadd>] svc_process_common+0x46d/0x6e0 [sunrpc]
[10607.328033] [<ffffffffa015ae57>] svc_process+0x107/0x170 [sunrpc]
[10607.328033] [<ffffffffa01ab28f>] nfsd+0xbf/0x130 [nfsd]
[10607.328033] [<ffffffffa01ab1d0>] ? nfsd_destroy+0x220/0x220 [nfsd]
[10607.328033] [<ffffffff81090cea>] kthread+0xea/0xf0
[10607.328033] [<ffffffff81090c00>] ? insert_kthread_work+0x80/0x80
[10607.328033] [<ffffffff816e086c>] ret_from_fork+0x7c/0xb0
[10607.328033] [<ffffffff81090c00>] ? insert_kthread_work+0x80/0x80
[10607.328033] Code: 4c c3 81 e8 9e cd f6 e0 85 c0 0f 85 0d fe ff ff e9 5a ff ff ff 0f 0b 0f 0b 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 66 66 66 66 90 <8b> 0e 55 4c 8b 07 48 89 e5 85 c9 78 56 66 0f 1f 44 00 00 48 83
[10607.328033] RIP [<ffffffffa0164f95>] qword_add+0x5/0xd0 [sunrpc]
[10607.328033] RSP <ffff88000307db58>
[10607.328033] CR2: 0000000000000000
[10607.961841] ---[ end trace b5a4adf462b23e9a ]---

Does it look familiar at all?

--
Jeff Layton <[email protected]>

2013-02-01 13:14:10

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH v1 08/16] nfsd: break out hashtable search into separate function

On Mon, Jan 28, 2013 at 02:41:14PM -0500, Jeff Layton wrote:
> Later, we'll need more than one call site for this, so break it out
> into a new function.

I'm applying these first 8 to the for-3.9 branch at:

git://linux-nfs.org/~bfields/linux.git for-3.9

--b.

>
> Signed-off-by: Jeff Layton <[email protected]>
> ---
> fs/nfsd/nfscache.c | 46 +++++++++++++++++++++++++++++++++-------------
> 1 file changed, 33 insertions(+), 13 deletions(-)
>
> diff --git a/fs/nfsd/nfscache.c b/fs/nfsd/nfscache.c
> index 634b856..b89e7c8 100644
> --- a/fs/nfsd/nfscache.c
> +++ b/fs/nfsd/nfscache.c
> @@ -150,6 +150,35 @@ nfsd_cache_entry_expired(struct svc_cacherep *rp)
> }
>
> /*
> + * Search the request hash for an entry that matches the given rqstp.
> + * Must be called with cache_lock held. Returns the found entry or
> + * NULL on failure.
> + */
> +static struct svc_cacherep *
> +nfsd_cache_search(struct svc_rqst *rqstp)
> +{
> + struct svc_cacherep *rp;
> + struct hlist_node *hn;
> + struct hlist_head *rh;
> + __be32 xid = rqstp->rq_xid;
> + u32 proto = rqstp->rq_prot,
> + vers = rqstp->rq_vers,
> + proc = rqstp->rq_proc;
> +
> + rh = &cache_hash[request_hash(xid)];
> + hlist_for_each_entry(rp, hn, rh, c_hash) {
> + if (rp->c_state != RC_UNUSED &&
> + xid == rp->c_xid && proc == rp->c_proc &&
> + proto == rp->c_prot && vers == rp->c_vers &&
> + !nfsd_cache_entry_expired(rp) &&
> + rpc_cmp_addr(svc_addr(rqstp), (struct sockaddr *)&rp->c_addr) &&
> + rpc_get_port(svc_addr(rqstp)) == rpc_get_port((struct sockaddr *)&rp->c_addr))
> + return rp;
> + }
> + return NULL;
> +}
> +
> +/*
> * Try to find an entry matching the current call in the cache. When none
> * is found, we grab the oldest unlocked entry off the LRU list.
> * Note that no operation within the loop may sleep.
> @@ -157,8 +186,6 @@ nfsd_cache_entry_expired(struct svc_cacherep *rp)
> int
> nfsd_cache_lookup(struct svc_rqst *rqstp)
> {
> - struct hlist_node *hn;
> - struct hlist_head *rh;
> struct svc_cacherep *rp;
> __be32 xid = rqstp->rq_xid;
> u32 proto = rqstp->rq_prot,
> @@ -177,17 +204,10 @@ nfsd_cache_lookup(struct svc_rqst *rqstp)
> spin_lock(&cache_lock);
> rtn = RC_DOIT;
>
> - rh = &cache_hash[request_hash(xid)];
> - hlist_for_each_entry(rp, hn, rh, c_hash) {
> - if (rp->c_state != RC_UNUSED &&
> - xid == rp->c_xid && proc == rp->c_proc &&
> - proto == rp->c_prot && vers == rp->c_vers &&
> - !nfsd_cache_entry_expired(rp) &&
> - rpc_cmp_addr(svc_addr(rqstp), (struct sockaddr *)&rp->c_addr) &&
> - rpc_get_port(svc_addr(rqstp)) == rpc_get_port((struct sockaddr *)&rp->c_addr)) {
> - nfsdstats.rchits++;
> - goto found_entry;
> - }
> + rp = nfsd_cache_search(rqstp);
> + if (rp) {
> + nfsdstats.rchits++;
> + goto found_entry;
> }
> nfsdstats.rcmisses++;
>
> --
> 1.7.11.7
>

2013-02-01 16:01:33

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH v1 08/16] nfsd: break out hashtable search into separate function

On Fri, 1 Feb 2013 10:51:31 -0500
"J. Bruce Fields" <[email protected]> wrote:

> On Fri, Feb 01, 2013 at 10:26:40AM -0500, Jeff Layton wrote:
> > On Fri, 1 Feb 2013 08:14:09 -0500
> > "J. Bruce Fields" <[email protected]> wrote:
> >
> > > On Mon, Jan 28, 2013 at 02:41:14PM -0500, Jeff Layton wrote:
> > > > Later, we'll need more than one call site for this, so break it out
> > > > into a new function.
> > >
> > > I'm applying these first 8 to the for-3.9 branch at:
> > >
> > > git://linux-nfs.org/~bfields/linux.git for-3.9
> > >
> > > --b.
> > >
> >
> > Thanks Bruce. I rebased the remaining patches on top of these, and now
> > I'm seeing this when I try to test against it. I don't think it's due
> > to the DRC patches, but I haven't started tracking it down yet:
>
> Probably Stanislav's latest patches to the cache upcall code. I'm
> surprised I didn't hit it.
>
> --b.
>

My apologies... it was a kABI issue ;)

I had built a nfsd.ko module to test my changes, but didn't build
sunrpc.ko to make sure that I got Stanislav's changes. When I did that,
the oops went away...

Sorry for the noise!
--
Jeff Layton <[email protected]>