LinuxLists.cc - [PATCH v2 0/3] nfsd: better stats gathering for DRC

2013-02-15 17:23:19

Subject: [PATCH v2 0/3] nfsd: better stats gathering for DRC

This is a respin of the series I posted yesterday. The main differences are:

1/ I dropped the first patch that fixes the comments, since Bruce said
that he was applying it.

2/ I've changed the csum_misses counter to only increment when the
checksum comparison fails (instead of checksum and the length). I
also cleaned up the comparator function a bit.

3/ I added a new patch to track memory utilization by the DRC, figuring
that might be handy if someone runs into memory problems later.

Further comments and suggestions welcome. I'd like to see this in 3.9
if possible.

Jeff Layton (3):
nfsd: break out comparator into separate function
nfsd: track memory utilization in the DRC
nfsd: add new reply_cache_stats file in nfsdfs

fs/nfsd/cache.h | 1 +
fs/nfsd/nfscache.c | 77 +++++++++++++++++++++++++++++++++++++++++++++---------
fs/nfsd/nfsctl.c | 9 +++++++
3 files changed, 74 insertions(+), 13 deletions(-)

--
1.7.11.7

2013-02-16 17:18:29

by Chuck Lever III

[permalink] [raw]

Subject: Re: [PATCH RFC] nfsd: report length of the largest hash chain in reply cache stats

On Feb 16, 2013, at 8:39 AM, J. Bruce Fields <[email protected]> wrote:

> On Fri, Feb 15, 2013 at 05:20:58PM -0500, Jeff Layton wrote:
>> An excellent question, and not an easy one to answer. Clearly 1024
>> entries was not enough. We now cap the size as a function of the
>> available low memory, which I think is a reasonable way to keep it from
>> ballooning so large that the box falls over. We also have a shrinker
>> and periodic cache cleaner to prune off entries that have expired.
>>
>> Of course one thing I haven't really considered enough is the
>> performance implications of walking the potentially much longer hash
>> chains here.
>>
>> If that is a problem, then one way to counter that without moving to a
>> different structure altogether might be to alter the hash function
>> based on the max size of the cache. IOW, grow the number of hash buckets
>> as the max cache size grows?

The trouble with a hash table is that once you've allocated it, it's a heavy lift to increase the table size. That sort of logic adds complexity and additional locking, and is often difficult to test.

> Another reason to organize the cache per client address?

In theory, an active single client could evict all entries for other clients, but do we know this happens in practice?

> With a per-client maximum number of entries, sizing the hash tables
> should be easier.

When a server has only one client, should that client be allowed to maximize the use of a server's resources (eg, use all of the DRC resource the server has available)? How about when a server has one active client and multiple quiescent clients?

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com

2013-02-16 13:39:35

by J. Bruce Fields

[permalink] [raw]

Subject: Re: [PATCH RFC] nfsd: report length of the largest hash chain in reply cache stats

On Fri, Feb 15, 2013 at 05:20:58PM -0500, Jeff Layton wrote:
> An excellent question, and not an easy one to answer. Clearly 1024
> entries was not enough. We now cap the size as a function of the
> available low memory, which I think is a reasonable way to keep it from
> ballooning so large that the box falls over. We also have a shrinker
> and periodic cache cleaner to prune off entries that have expired.
>
> Of course one thing I haven't really considered enough is the
> performance implications of walking the potentially much longer hash
> chains here.
>
> If that is a problem, then one way to counter that without moving to a
> different structure altogether might be to alter the hash function
> based on the max size of the cache. IOW, grow the number of hash buckets
> as the max cache size grows?

Another reason to organize the cache per client address?

Two levels of hash tables might be good enough: one global hash table
for the client address, one per-client for the rest.

With a per-client maximum number of entries, sizing the hash tables
should be easier.

If we wanted to be fancy in theory the address lookup could probably be
lockless in the typical case.

--b.

2013-02-15 20:04:42

by Jeff Layton

[permalink] [raw]

Subject: [PATCH RFC] nfsd: report length of the largest hash chain in reply cache stats

So we can get a feel for how effective the hashing function is.

As Chuck Lever pointed out to me, it's generally acceptable to do
"expensive" stuff when reading the stats since that's a relatively
rare activity.

Cc: Chuck Lever <[email protected]>
Signed-off-by: Jeff Layton <[email protected]>
---
fs/nfsd/nfscache.c | 20 ++++++++++++++++++++
1 file changed, 20 insertions(+)

diff --git a/fs/nfsd/nfscache.c b/fs/nfsd/nfscache.c
index a5ac9ab..172e211 100644
--- a/fs/nfsd/nfscache.c
+++ b/fs/nfsd/nfscache.c
@@ -551,6 +551,25 @@ nfsd_cache_append(struct svc_rqst *rqstp, struct kvec *data)
return 1;
}

+/* Get stats on the hashtable itself */
+static unsigned int
+nfsd_repcache_max_chain_len(void)
+{
+ int i;
+ struct hlist_node *pos;
+ unsigned int max = 0;
+
+ for (i = 0; i < HASHSIZE; ++i) {
+ unsigned int cur = 0;
+
+ hlist_for_each(pos, &cache_hash[i])
+ ++cur;
+ max = max(cur, max);
+ }
+
+ return max;
+}
+
/*
* Note that fields may be added, removed or reordered in the future. Programs
* scraping this file for info should test the labels to ensure they're
@@ -566,6 +585,7 @@ static int nfsd_reply_cache_stats_show(struct seq_file *m, void *v)
seq_printf(m, "cache misses: %u\n", nfsdstats.rcmisses);
seq_printf(m, "not cached: %u\n", nfsdstats.rcnocache);
seq_printf(m, "checksum misses: %u\n", csum_misses);
+ seq_printf(m, "max chain len: %u\n", nfsd_repcache_max_chain_len());
spin_unlock(&cache_lock);
return 0;
}
--
1.7.11.7

2013-02-15 17:23:20

by Jeff Layton

[permalink] [raw]

Subject: [PATCH v2 2/3] nfsd: track memory utilization in the DRC

Signed-off-by: Jeff Layton <[email protected]>
---
fs/nfsd/nfscache.c | 11 ++++++++++-
1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/fs/nfsd/nfscache.c b/fs/nfsd/nfscache.c
index c00c5bc..ad1d069 100644
--- a/fs/nfsd/nfscache.c
+++ b/fs/nfsd/nfscache.c
@@ -25,6 +25,7 @@ static struct list_head lru_head;
static struct kmem_cache *drc_slab;
static unsigned int num_drc_entries;
static unsigned int max_drc_entries;
+static unsigned int drc_mem_usage;
static unsigned int csum_misses;

/*
@@ -101,11 +102,14 @@ nfsd_reply_cache_alloc(void)
static void
nfsd_reply_cache_free_locked(struct svc_cacherep *rp)
{
- if (rp->c_type == RC_REPLBUFF)
+ if (rp->c_type == RC_REPLBUFF && rp->c_replvec.iov_base) {
+ drc_mem_usage -= ksize(rp->c_replvec.iov_base);
kfree(rp->c_replvec.iov_base);
+ }
hlist_del(&rp->c_hash);
list_del(&rp->c_lru);
--num_drc_entries;
+ drc_mem_usage -= ksize(rp);
kmem_cache_free(drc_slab, rp);
}

@@ -367,6 +371,7 @@ nfsd_cache_lookup(struct svc_rqst *rqstp)
return RC_DOIT;
}
spin_lock(&cache_lock);
+ drc_mem_usage += ksize(rp);
++num_drc_entries;

/*
@@ -407,6 +412,7 @@ setup_entry:

/* release any buffer */
if (rp->c_type == RC_REPLBUFF) {
+ drc_mem_usage -= ksize(rp->c_replvec.iov_base);
kfree(rp->c_replvec.iov_base);
rp->c_replvec.iov_base = NULL;
}
@@ -475,6 +481,7 @@ nfsd_cache_update(struct svc_rqst *rqstp, int cachetype, __be32 *statp)
struct svc_cacherep *rp = rqstp->rq_cacherep;
struct kvec *resv = &rqstp->rq_res.head[0], *cachv;
int len;
+ size_t replbufsize = 0;

if (!rp)
return;
@@ -501,6 +508,7 @@ nfsd_cache_update(struct svc_rqst *rqstp, int cachetype, __be32 *statp)
nfsd_reply_cache_free(rp);
return;
}
+ replbufsize = ksize(cachv->iov_base);
cachv->iov_len = len << 2;
memcpy(cachv->iov_base, statp, len << 2);
break;
@@ -509,6 +517,7 @@ nfsd_cache_update(struct svc_rqst *rqstp, int cachetype, __be32 *statp)
return;
}
spin_lock(&cache_lock);
+ drc_mem_usage += replbufsize;
lru_put_end(rp);
rp->c_secure = rqstp->rq_secure;
rp->c_type = cachetype;
--
1.7.11.7

2013-02-17 19:58:30

by Chuck Lever III

[permalink] [raw]

Subject: Re: [PATCH RFC] nfsd: report length of the largest hash chain in reply cache stats

On Feb 17, 2013, at 11:00 AM, J. Bruce Fields <[email protected]> wrote:

> On Sat, Feb 16, 2013 at 12:18:18PM -0500, Chuck Lever wrote:
>>
>> On Feb 16, 2013, at 8:39 AM, J. Bruce Fields <[email protected]> wrote:
>>> With a per-client maximum number of entries, sizing the hash tables
>>> should be easier.
>> When a server has only one client, should that client be allowed to
>> maximize the use of a server's resources (eg, use all of the DRC
>> resource the server has available)?
>
> I've been assuming there's rapidly diminishing returns to caching a lot
> of replies to a single client. But that might not be true--I guess a
> busy UDP client with a long retry timeout might benefit from a large
> cache?

I was thinking that the possible value of holding a reply diminishes the _longer_ you hold it. The number of replies held depends then on how busy each client is, and of course, how much memory is available to hold cached replies.

Jeff's metrics may help us sort this out.

>
>> How about when a server has one active client and multiple quiescent
>> clients?
>
> I think there's a chance in that case that one of the quiescent clients
> is experiencing a temporary network problem, in which case we may want
> to preserve a few entries for them even if the active client's activity
> would normally evict them.

OK, sure. So, we might express that as a lower bound on the number of replies cached per client? Or would we express it as an upper bound on the age of each cached reply?

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com

2013-02-15 18:34:10

by Jeff Layton

[permalink] [raw]

Subject: Re: [PATCH v2 3/3] nfsd: add new reply_cache_stats file in nfsdfs

On Fri, 15 Feb 2013 12:29:08 -0500
Chuck Lever <[email protected]> wrote:

>
> On Feb 15, 2013, at 12:23 PM, Jeff Layton <[email protected]> wrote:
>
> > For presenting statistics relating to duplicate reply cache.
> >
> > Signed-off-by: Jeff Layton <[email protected]>
> > ---
> > fs/nfsd/cache.h | 1 +
> > fs/nfsd/nfscache.c | 37 +++++++++++++++++++++++++++++++++----
> > fs/nfsd/nfsctl.c | 9 +++++++++
> > 3 files changed, 43 insertions(+), 4 deletions(-)
> >
> > diff --git a/fs/nfsd/cache.h b/fs/nfsd/cache.h
> > index 87fd141..d5c5b3e 100644
> > --- a/fs/nfsd/cache.h
> > +++ b/fs/nfsd/cache.h
> > @@ -82,6 +82,7 @@ int nfsd_reply_cache_init(void);
> > void nfsd_reply_cache_shutdown(void);
> > int nfsd_cache_lookup(struct svc_rqst *);
> > void nfsd_cache_update(struct svc_rqst *, int, __be32 *);
> > +int nfsd_reply_cache_stats_open(struct inode *, struct file *);
> >
> > #ifdef CONFIG_NFSD_V4
> > void nfsd4_set_statp(struct svc_rqst *rqstp, __be32 *statp);
> > diff --git a/fs/nfsd/nfscache.c b/fs/nfsd/nfscache.c
> > index ad1d069..b155afa 100644
> > --- a/fs/nfsd/nfscache.c
> > +++ b/fs/nfsd/nfscache.c
> > @@ -23,12 +23,17 @@
> > static struct hlist_head * cache_hash;
> > static struct list_head lru_head;
> > static struct kmem_cache *drc_slab;
> > -static unsigned int num_drc_entries;
> > -static unsigned int max_drc_entries;
> > -static unsigned int drc_mem_usage;
> > -static unsigned int csum_misses;
> >
> > /*
> > + * Stats on the duplicate reply cache code. All of these and the "rc" fields
> > + * in nfsdstats are protected by the cache_lock
> > + */
> > +static unsigned int num_drc_entries; /* number of entries */
> > +static unsigned int max_drc_entries; /* max number of entries */
> > +static unsigned int drc_mem_usage; /* memory used by cache */
> > +static unsigned int csum_misses; /* cache misses due only to
> > + csum comparison failures */
> > +/*
> > * Calculate the hash index from an XID.
> > */
> > static inline u32 request_hash(u32 xid)
> > @@ -545,3 +550,27 @@ nfsd_cache_append(struct svc_rqst *rqstp, struct kvec *data)
> > vec->iov_len += data->iov_len;
> > return 1;
> > }
> > +
> > +/*
> > + * Note that fields may be added, removed or reordered in the future. Programs
> > + * scraping this file for info should test the labels to ensure they're
> > + * getting the correct field.
> > + */
> > +static int nfsd_reply_cache_stats_show(struct seq_file *m, void *v)
> > +{
> > + spin_lock(&cache_lock);
> > + seq_printf(m, "max entries: %u\n", max_drc_entries);
> > + seq_printf(m, "num entries: %u\n", num_drc_entries);
> > + seq_printf(m, "mem usage: %u\n", drc_mem_usage);
> > + seq_printf(m, "cache hits: %u\n", nfsdstats.rchits);
> > + seq_printf(m, "cache misses: %u\n", nfsdstats.rcmisses);
> > + seq_printf(m, "not cached: %u\n", nfsdstats.rcnocache);
> > + seq_printf(m, "checksum misses: %u\n", csum_misses);
> > + spin_unlock(&cache_lock);
> > + return 0;
> > +}
>
> Would it be hard to add a statistic that quantified the distribution of entries in the cache hash table? Something like "maximum length of cache hash chain" ?
>

Hard? Not very, but it wouldn't be free...

I guess we'd need to keep an array of HASHSIZE (64) uints to track the
length of each hash bucket. Currently too, we don't need to know what
hash bucket an entry sits when freeing it. So we'd need to recompute
that when freeing the entry (which is fairly trivial).

Also, what sort of info do you want here? Longest hash chain that the
cache has ever had, or longest hash chain currently in the cache? It
may be better to just present the length of each chain instead and let
userland sort out the max.

I'd probably suggest that do this in a later patch on top of this
series once we determine what info we're looking for here. It should
be doable for 3.9 though...

--
Jeff Layton <[email protected]>

2013-02-18 14:39:54

by Jeff Layton

[permalink] [raw]

Subject: Re: [PATCH RFC] nfsd: report length of the largest hash chain in reply cache stats

On Sat, 16 Feb 2013 12:18:18 -0500
Chuck Lever <[email protected]> wrote:

>
> On Feb 16, 2013, at 8:39 AM, J. Bruce Fields <[email protected]> wrote:
>
> > On Fri, Feb 15, 2013 at 05:20:58PM -0500, Jeff Layton wrote:
> >> An excellent question, and not an easy one to answer. Clearly 1024
> >> entries was not enough. We now cap the size as a function of the
> >> available low memory, which I think is a reasonable way to keep it from
> >> ballooning so large that the box falls over. We also have a shrinker
> >> and periodic cache cleaner to prune off entries that have expired.
> >>
> >> Of course one thing I haven't really considered enough is the
> >> performance implications of walking the potentially much longer hash
> >> chains here.
> >>
> >> If that is a problem, then one way to counter that without moving to a
> >> different structure altogether might be to alter the hash function
> >> based on the max size of the cache. IOW, grow the number of hash buckets
> >> as the max cache size grows?
>
> The trouble with a hash table is that once you've allocated it, it's a heavy lift to increase the table size. That sort of logic adds complexity and additional locking, and is often difficult to test.
>

I wasn't suggesting that we resize/rebalance the table on the fly. We
determine the max allowable number of cache entries when the server
starts. We could also determine the number of buckets at the same time,
and alter the hashing function to take that number into account.

Of course, more buckets may not help if the hash function just sucks.

> > Another reason to organize the cache per client address?
>
>
> In theory, an active single client could evict all entries for other clients, but do we know this happens in practice?
>

I'm pretty sure that's what's been happening to our QA group. They have
some shared NFS servers set up in the test lab for client testing. When
things get busy, they the DRC just plain doesn't appear to work. It's
hard to know for sure though since the problem only crops up very
rarely.

My hope is that the massive increase in the size of the DRC should
prevent that from occurring now.

> > With a per-client maximum number of entries, sizing the hash tables
> > should be easier.
>
>
> When a server has only one client, should that client be allowed to maximize the use of a server's resources (eg, use all of the DRC resource the server has available)? How about when a server has one active client and multiple quiescent clients?
>

Again, we have two "problems" to solve, and we need to take care not to
conflate them too much.

1) how do we best organize the cache for efficient lookups?

2) what cache eviction policy should we use?

Organizing the cache based on client address might help both of those,
but we'll need to determine whether it's worth the extra complexity.
--
Jeff Layton <[email protected]>

2013-02-15 17:23:20

by Jeff Layton

[permalink] [raw]

Subject: [PATCH v2 3/3] nfsd: add new reply_cache_stats file in nfsdfs

For presenting statistics relating to duplicate reply cache.

Signed-off-by: Jeff Layton <[email protected]>
---
fs/nfsd/cache.h | 1 +
fs/nfsd/nfscache.c | 37 +++++++++++++++++++++++++++++++++----
fs/nfsd/nfsctl.c | 9 +++++++++
3 files changed, 43 insertions(+), 4 deletions(-)

diff --git a/fs/nfsd/cache.h b/fs/nfsd/cache.h
index 87fd141..d5c5b3e 100644
--- a/fs/nfsd/cache.h
+++ b/fs/nfsd/cache.h
@@ -82,6 +82,7 @@ int nfsd_reply_cache_init(void);
void nfsd_reply_cache_shutdown(void);
int nfsd_cache_lookup(struct svc_rqst *);
void nfsd_cache_update(struct svc_rqst *, int, __be32 *);
+int nfsd_reply_cache_stats_open(struct inode *, struct file *);

#ifdef CONFIG_NFSD_V4
void nfsd4_set_statp(struct svc_rqst *rqstp, __be32 *statp);
diff --git a/fs/nfsd/nfscache.c b/fs/nfsd/nfscache.c
index ad1d069..b155afa 100644
--- a/fs/nfsd/nfscache.c
+++ b/fs/nfsd/nfscache.c
@@ -23,12 +23,17 @@
static struct hlist_head * cache_hash;
static struct list_head lru_head;
static struct kmem_cache *drc_slab;
-static unsigned int num_drc_entries;
-static unsigned int max_drc_entries;
-static unsigned int drc_mem_usage;
-static unsigned int csum_misses;

/*
+ * Stats on the duplicate reply cache code. All of these and the "rc" fields
+ * in nfsdstats are protected by the cache_lock
+ */
+static unsigned int num_drc_entries; /* number of entries */
+static unsigned int max_drc_entries; /* max number of entries */
+static unsigned int drc_mem_usage; /* memory used by cache */
+static unsigned int csum_misses; /* cache misses due only to
+ csum comparison failures */
+/*
* Calculate the hash index from an XID.
*/
static inline u32 request_hash(u32 xid)
@@ -545,3 +550,27 @@ nfsd_cache_append(struct svc_rqst *rqstp, struct kvec *data)
vec->iov_len += data->iov_len;
return 1;
}
+
+/*
+ * Note that fields may be added, removed or reordered in the future. Programs
+ * scraping this file for info should test the labels to ensure they're
+ * getting the correct field.
+ */
+static int nfsd_reply_cache_stats_show(struct seq_file *m, void *v)
+{
+ spin_lock(&cache_lock);
+ seq_printf(m, "max entries: %u\n", max_drc_entries);
+ seq_printf(m, "num entries: %u\n", num_drc_entries);
+ seq_printf(m, "mem usage: %u\n", drc_mem_usage);
+ seq_printf(m, "cache hits: %u\n", nfsdstats.rchits);
+ seq_printf(m, "cache misses: %u\n", nfsdstats.rcmisses);
+ seq_printf(m, "not cached: %u\n", nfsdstats.rcnocache);
+ seq_printf(m, "checksum misses: %u\n", csum_misses);
+ spin_unlock(&cache_lock);
+ return 0;
+}
+
+int nfsd_reply_cache_stats_open(struct inode *inode, struct file *file)
+{
+ return single_open(file, nfsd_reply_cache_stats_show, NULL);
+}
diff --git a/fs/nfsd/nfsctl.c b/fs/nfsd/nfsctl.c
index 29c3f0d..5dea5af 100644
--- a/fs/nfsd/nfsctl.c
+++ b/fs/nfsd/nfsctl.c
@@ -35,6 +35,7 @@ enum {
NFSD_Threads,
NFSD_Pool_Threads,
NFSD_Pool_Stats,
+ NFSD_Reply_Cache_Stats,
NFSD_Versions,
NFSD_Ports,
NFSD_MaxBlkSize,
@@ -194,6 +195,13 @@ static const struct file_operations pool_stats_operations = {
.owner = THIS_MODULE,
};

+static struct file_operations reply_cache_stats_operations = {
+ .open = nfsd_reply_cache_stats_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = single_release,
+};
+
/*----------------------------------------------------------------------------*/
/*
* payload - write methods
@@ -1024,6 +1032,7 @@ static int nfsd_fill_super(struct super_block * sb, void * data, int silent)
[NFSD_Threads] = {"threads", &transaction_ops, S_IWUSR|S_IRUSR},
[NFSD_Pool_Threads] = {"pool_threads", &transaction_ops, S_IWUSR|S_IRUSR},
[NFSD_Pool_Stats] = {"pool_stats", &pool_stats_operations, S_IRUGO},
+ [NFSD_Reply_Cache_Stats] = {"reply_cache_stats", &reply_cache_stats_operations, S_IRUGO},
[NFSD_Versions] = {"versions", &transaction_ops, S_IWUSR|S_IRUSR},
[NFSD_Ports] = {"portlist", &transaction_ops, S_IWUSR|S_IRUGO},
[NFSD_MaxBlkSize] = {"max_block_size", &transaction_ops, S_IWUSR|S_IRUGO},
--
1.7.11.7

2013-02-15 17:23:19

by Jeff Layton

[permalink] [raw]

Subject: [PATCH v2 1/3] nfsd: break out comparator into separate function

Break the function that compares the rqstp and checksum against a reply
cache entry. While we're at it, track the efficacy of the checksum over
the NFS data by tracking the cases where we would have incorrectly
matched a DRC entry if we had not tracked it or the length.

Signed-off-by: Jeff Layton <[email protected]>
---
fs/nfsd/nfscache.c | 33 +++++++++++++++++++++++----------
1 file changed, 23 insertions(+), 10 deletions(-)

diff --git a/fs/nfsd/nfscache.c b/fs/nfsd/nfscache.c
index 2f9c2d2..c00c5bc 100644
--- a/fs/nfsd/nfscache.c
+++ b/fs/nfsd/nfscache.c
@@ -25,6 +25,7 @@ static struct list_head lru_head;
static struct kmem_cache *drc_slab;
static unsigned int num_drc_entries;
static unsigned int max_drc_entries;
+static unsigned int csum_misses;

/*
* Calculate the hash index from an XID.
@@ -272,6 +273,26 @@ nfsd_cache_csum(struct svc_rqst *rqstp)
return csum;
}

+static bool
+nfsd_cache_match(struct svc_rqst *rqstp, __wsum csum, struct svc_cacherep *rp)
+{
+ /* Check RPC header info first */
+ if (rqstp->rq_xid != rp->c_xid || rqstp->rq_proc != rp->c_proc ||
+ rqstp->rq_prot != rp->c_prot || rqstp->rq_vers != rp->c_vers ||
+ rqstp->rq_arg.len != rp->c_len ||
+ !rpc_cmp_addr(svc_addr(rqstp), (struct sockaddr *)&rp->c_addr) ||
+ rpc_get_port(svc_addr(rqstp)) != rpc_get_port((struct sockaddr *)&rp->c_addr))
+ return false;
+
+ /* compare checksum of NFS data */
+ if (csum != rp->c_csum) {
+ ++csum_misses;
+ return false;
+ }
+
+ return true;
+}
+
/*
* Search the request hash for an entry that matches the given rqstp.
* Must be called with cache_lock held. Returns the found entry or
@@ -283,18 +304,10 @@ nfsd_cache_search(struct svc_rqst *rqstp, __wsum csum)
struct svc_cacherep *rp;
struct hlist_node *hn;
struct hlist_head *rh;
- __be32 xid = rqstp->rq_xid;
- u32 proto = rqstp->rq_prot,
- vers = rqstp->rq_vers,
- proc = rqstp->rq_proc;

- rh = &cache_hash[request_hash(xid)];
+ rh = &cache_hash[request_hash(rqstp->rq_xid)];
hlist_for_each_entry(rp, hn, rh, c_hash) {
- if (xid == rp->c_xid && proc == rp->c_proc &&
- proto == rp->c_prot && vers == rp->c_vers &&
- rqstp->rq_arg.len == rp->c_len && csum == rp->c_csum &&
- rpc_cmp_addr(svc_addr(rqstp), (struct sockaddr *)&rp->c_addr) &&
- rpc_get_port(svc_addr(rqstp)) == rpc_get_port((struct sockaddr *)&rp->c_addr))
+ if (nfsd_cache_match(rqstp, csum, rp))
return rp;
}
return NULL;
--
1.7.11.7

2013-02-15 21:15:12

by Chuck Lever III

[permalink] [raw]

Subject: Re: [PATCH RFC] nfsd: report length of the largest hash chain in reply cache stats

On Feb 15, 2013, at 3:04 PM, Jeff Layton <[email protected]> wrote:

> So we can get a feel for how effective the hashing function is.
>
> As Chuck Lever pointed out to me, it's generally acceptable to do
> "expensive" stuff when reading the stats since that's a relatively
> rare activity.

A good measure of the efficacy of a hash function is the ratio of the maximum chain length to the optimal chain length (which can be computed by dividing the total number of cache entries by the number of hash chains).

If we plan to stick with a hash table for this cache, there should be some indication when the hash function falls over. This will matter because the DRC can now grow much larger, which is turning out to be the real fundamental change with this work.

A philosophical question though is "How can we know when the DRC is large enough?"

> Cc: Chuck Lever <[email protected]>
> Signed-off-by: Jeff Layton <[email protected]>
> ---
> fs/nfsd/nfscache.c | 20 ++++++++++++++++++++
> 1 file changed, 20 insertions(+)
>
> diff --git a/fs/nfsd/nfscache.c b/fs/nfsd/nfscache.c
> index a5ac9ab..172e211 100644
> --- a/fs/nfsd/nfscache.c
> +++ b/fs/nfsd/nfscache.c
> @@ -551,6 +551,25 @@ nfsd_cache_append(struct svc_rqst *rqstp, struct kvec *data)
> return 1;
> }
>
> +/* Get stats on the hashtable itself */
> +static unsigned int
> +nfsd_repcache_max_chain_len(void)
> +{
> + int i;
> + struct hlist_node *pos;
> + unsigned int max = 0;
> +
> + for (i = 0; i < HASHSIZE; ++i) {
> + unsigned int cur = 0;
> +
> + hlist_for_each(pos, &cache_hash[i])
> + ++cur;
> + max = max(cur, max);
> + }
> +
> + return max;
> +}
> +
> /*
> * Note that fields may be added, removed or reordered in the future. Programs
> * scraping this file for info should test the labels to ensure they're
> @@ -566,6 +585,7 @@ static int nfsd_reply_cache_stats_show(struct seq_file *m, void *v)
> seq_printf(m, "cache misses: %u\n", nfsdstats.rcmisses);
> seq_printf(m, "not cached: %u\n", nfsdstats.rcnocache);
> seq_printf(m, "checksum misses: %u\n", csum_misses);
> + seq_printf(m, "max chain len: %u\n", nfsd_repcache_max_chain_len());
> spin_unlock(&cache_lock);
> return 0;
> }
> --
> 1.7.11.7
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com

2013-02-18 14:21:39

by Jeff Layton

[permalink] [raw]

Subject: Re: [PATCH RFC] nfsd: report length of the largest hash chain in reply cache stats

On Sun, 17 Feb 2013 11:00:56 -0500
"J. Bruce Fields" <[email protected]> wrote:

> On Sat, Feb 16, 2013 at 12:18:18PM -0500, Chuck Lever wrote:
> >
> > On Feb 16, 2013, at 8:39 AM, J. Bruce Fields <[email protected]> wrote:
> > > With a per-client maximum number of entries, sizing the hash tables
> > > should be easier.
> > When a server has only one client, should that client be allowed to
> > maximize the use of a server's resources (eg, use all of the DRC
> > resource the server has available)?
>
> I've been assuming there's rapidly diminishing returns to caching a lot
> of replies to a single client. But that might not be true--I guess a
> busy UDP client with a long retry timeout might benefit from a large
> cache?
>

Yes, or one with a massively parallel workload and poor slot-table
implementation that just sprays tons of requests at the server?

> > How about when a server has one active client and multiple quiescent
> > clients?
>
> I think there's a chance in that case that one of the quiescent clients
> is experiencing a temporary network problem, in which case we may want
> to preserve a few entries for them even if the active client's activity
> would normally evict them.
>

Cache eviction policy is really orthogonal to how we organize it for
efficient lookups. Currently, cache entries sit in a hash table for
lookups, and on a LRU list for eviction.

We can certainly change either or both. I think the trick here is to
nail down what semantics you want for the cache and then look at how
best to organize it to achieve that.

OTOH, maybe we should see whether this is really a problem first before
we go and try to fix anything.

--
Jeff Layton <[email protected]>

2013-02-17 16:00:58

by J. Bruce Fields

[permalink] [raw]

Subject: Re: [PATCH RFC] nfsd: report length of the largest hash chain in reply cache stats

On Sat, Feb 16, 2013 at 12:18:18PM -0500, Chuck Lever wrote:
>
> On Feb 16, 2013, at 8:39 AM, J. Bruce Fields <[email protected]> wrote:
> > With a per-client maximum number of entries, sizing the hash tables
> > should be easier.
> When a server has only one client, should that client be allowed to
> maximize the use of a server's resources (eg, use all of the DRC
> resource the server has available)?

I've been assuming there's rapidly diminishing returns to caching a lot
of replies to a single client. But that might not be true--I guess a
busy UDP client with a long retry timeout might benefit from a large
cache?

> How about when a server has one active client and multiple quiescent
> clients?

I think there's a chance in that case that one of the quiescent clients
is experiencing a temporary network problem, in which case we may want
to preserve a few entries for them even if the active client's activity
would normally evict them.

--b.

2013-02-18 16:18:22

by Chuck Lever III

[permalink] [raw]

Subject: Re: [PATCH RFC] nfsd: report length of the largest hash chain in reply cache stats

On Feb 18, 2013, at 9:39 AM, Jeff Layton <[email protected]> wrote:

> On Sat, 16 Feb 2013 12:18:18 -0500
> Chuck Lever <[email protected]> wrote:
>
>>
>> On Feb 16, 2013, at 8:39 AM, J. Bruce Fields <[email protected]> wrote:
>>
>>> On Fri, Feb 15, 2013 at 05:20:58PM -0500, Jeff Layton wrote:
>>>> An excellent question, and not an easy one to answer. Clearly 1024
>>>> entries was not enough. We now cap the size as a function of the
>>>> available low memory, which I think is a reasonable way to keep it from
>>>> ballooning so large that the box falls over. We also have a shrinker
>>>> and periodic cache cleaner to prune off entries that have expired.
>>>>
>>>> Of course one thing I haven't really considered enough is the
>>>> performance implications of walking the potentially much longer hash
>>>> chains here.
>>>>
>>>> If that is a problem, then one way to counter that without moving to a
>>>> different structure altogether might be to alter the hash function
>>>> based on the max size of the cache. IOW, grow the number of hash buckets
>>>> as the max cache size grows?
>>
>> The trouble with a hash table is that once you've allocated it, it's a heavy lift to increase the table size. That sort of logic adds complexity and additional locking, and is often difficult to test.
>>
>
> I wasn't suggesting that we resize/rebalance the table on the fly. We
> determine the max allowable number of cache entries when the server
> starts. We could also determine the number of buckets at the same time,
> and alter the hashing function to take that number into account.
>
> Of course, more buckets may not help if the hash function just sucks.

My point was that using a data structure like an rbtree, as you suggested, has the advantage of not requiring any extra logic (whether done statically or dynamically). An rbtree implementation is much more likely to exercise exactly the same code paths no matter how many entries are in the cache. It will scale deterministically as DRC capacity continues to increase over time. And it is not dependent on efficient hash-ability of client XIDs.

>
>>> Another reason to organize the cache per client address?
>>
>>
>> In theory, an active single client could evict all entries for other clients, but do we know this happens in practice?
>>
>
> I'm pretty sure that's what's been happening to our QA group. They have
> some shared NFS servers set up in the test lab for client testing. When
> things get busy, they the DRC just plain doesn't appear to work. It's
> hard to know for sure though since the problem only crops up very
> rarely.
>
> My hope is that the massive increase in the size of the DRC should
> prevent that from occurring now.

Well, right. If the DRC is too small already, any client can cause entries to be evicted prematurely, even its own entries. Now we have an opportunity to ask whether a larger cache makes it much less likely that a single obnoxious client will effect his neighbors. In other words, maybe per-client sensitivity doesn't buy us much if the DRC can grow larger.

>
>>> With a per-client maximum number of entries, sizing the hash tables
>>> should be easier.
>>
>>
>> When a server has only one client, should that client be allowed to maximize the use of a server's resources (eg, use all of the DRC resource the server has available)? How about when a server has one active client and multiple quiescent clients?
>>
>
> Again, we have two "problems" to solve, and we need to take care not to
> conflate them too much.
>
> 1) how do we best organize the cache for efficient lookups?
>
> 2) what cache eviction policy should we use?

These are related, I feel. Because memory capacity is increasing faster than memory bandwidth, I think lookup time efficiency is relatively more important than cache space efficiency. Also, above we were arguing that a larger DRC is more effective.

What's more, 2) is a much harder problem to solve well. And if lookups are efficient anyway, then 2) should matter much less, as it will only be needed to deal with memory pressure on the server, and not to keep lookup time efficient.

I was proposing the approach of dealing with per-client caching with 2) (choosing which entry to evict based on how many entries there are for each client). But 2) looks less critical than 1), thus perhaps per-client caching may not provide a lot of benefit.

> Organizing the cache based on client address might help both of those,
> but we'll need to determine whether it's worth the extra complexity.

Agree. Data, data, data.

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com

2013-02-18 14:30:06

by J. Bruce Fields

[permalink] [raw]

Subject: Re: [PATCH RFC] nfsd: report length of the largest hash chain in reply cache stats

On Mon, Feb 18, 2013 at 09:21:34AM -0500, Jeff Layton wrote:
> On Sun, 17 Feb 2013 11:00:56 -0500
> "J. Bruce Fields" <[email protected]> wrote:
>
> > On Sat, Feb 16, 2013 at 12:18:18PM -0500, Chuck Lever wrote:
> > >
> > > On Feb 16, 2013, at 8:39 AM, J. Bruce Fields <[email protected]> wrote:
> > > > With a per-client maximum number of entries, sizing the hash tables
> > > > should be easier.
> > > When a server has only one client, should that client be allowed to
> > > maximize the use of a server's resources (eg, use all of the DRC
> > > resource the server has available)?
> >
> > I've been assuming there's rapidly diminishing returns to caching a lot
> > of replies to a single client. But that might not be true--I guess a
> > busy UDP client with a long retry timeout might benefit from a large
> > cache?
> >
>
> Yes, or one with a massively parallel workload and poor slot-table
> implementation that just sprays tons of requests at the server?

I believe we pin cache entries while the request is in progress and
timestamp and insert into the lru when when we send the reply--so the
parallelism doesn't matter so much, in the sense that we're not going to
for example get a big slow write operation evicted by a bunch of quick
concurrent getattrs.

What matters is: from the time we send a reply, to the time the client
retries, how many more requests does the client send?

> > > How about when a server has one active client and multiple quiescent
> > > clients?
> >
> > I think there's a chance in that case that one of the quiescent clients
> > is experiencing a temporary network problem, in which case we may want
> > to preserve a few entries for them even if the active client's activity
> > would normally evict them.
> >
>
> Cache eviction policy is really orthogonal to how we organize it for
> efficient lookups. Currently, cache entries sit in a hash table for
> lookups, and on a LRU list for eviction.
>
> We can certainly change either or both. I think the trick here is to
> nail down what semantics you want for the cache and then look at how
> best to organize it to achieve that.
>
> OTOH, maybe we should see whether this is really a problem first before
> we go and try to fix anything.

Yeah.

--b.

2013-02-15 17:29:20

by Chuck Lever III

[permalink] [raw]

Subject: Re: [PATCH v2 3/3] nfsd: add new reply_cache_stats file in nfsdfs

On Feb 15, 2013, at 12:23 PM, Jeff Layton <[email protected]> wrote:

> For presenting statistics relating to duplicate reply cache.
>
> Signed-off-by: Jeff Layton <[email protected]>
> ---
> fs/nfsd/cache.h | 1 +
> fs/nfsd/nfscache.c | 37 +++++++++++++++++++++++++++++++++----
> fs/nfsd/nfsctl.c | 9 +++++++++
> 3 files changed, 43 insertions(+), 4 deletions(-)
>
> diff --git a/fs/nfsd/cache.h b/fs/nfsd/cache.h
> index 87fd141..d5c5b3e 100644
> --- a/fs/nfsd/cache.h
> +++ b/fs/nfsd/cache.h
> @@ -82,6 +82,7 @@ int nfsd_reply_cache_init(void);
> void nfsd_reply_cache_shutdown(void);
> int nfsd_cache_lookup(struct svc_rqst *);
> void nfsd_cache_update(struct svc_rqst *, int, __be32 *);
> +int nfsd_reply_cache_stats_open(struct inode *, struct file *);
>
> #ifdef CONFIG_NFSD_V4
> void nfsd4_set_statp(struct svc_rqst *rqstp, __be32 *statp);
> diff --git a/fs/nfsd/nfscache.c b/fs/nfsd/nfscache.c
> index ad1d069..b155afa 100644
> --- a/fs/nfsd/nfscache.c
> +++ b/fs/nfsd/nfscache.c
> @@ -23,12 +23,17 @@
> static struct hlist_head * cache_hash;
> static struct list_head lru_head;
> static struct kmem_cache *drc_slab;
> -static unsigned int num_drc_entries;
> -static unsigned int max_drc_entries;
> -static unsigned int drc_mem_usage;
> -static unsigned int csum_misses;
>
> /*
> + * Stats on the duplicate reply cache code. All of these and the "rc" fields
> + * in nfsdstats are protected by the cache_lock
> + */
> +static unsigned int num_drc_entries; /* number of entries */
> +static unsigned int max_drc_entries; /* max number of entries */
> +static unsigned int drc_mem_usage; /* memory used by cache */
> +static unsigned int csum_misses; /* cache misses due only to
> + csum comparison failures */
> +/*
> * Calculate the hash index from an XID.
> */
> static inline u32 request_hash(u32 xid)
> @@ -545,3 +550,27 @@ nfsd_cache_append(struct svc_rqst *rqstp, struct kvec *data)
> vec->iov_len += data->iov_len;
> return 1;
> }
> +
> +/*
> + * Note that fields may be added, removed or reordered in the future. Programs
> + * scraping this file for info should test the labels to ensure they're
> + * getting the correct field.
> + */
> +static int nfsd_reply_cache_stats_show(struct seq_file *m, void *v)
> +{
> + spin_lock(&cache_lock);
> + seq_printf(m, "max entries: %u\n", max_drc_entries);
> + seq_printf(m, "num entries: %u\n", num_drc_entries);
> + seq_printf(m, "mem usage: %u\n", drc_mem_usage);
> + seq_printf(m, "cache hits: %u\n", nfsdstats.rchits);
> + seq_printf(m, "cache misses: %u\n", nfsdstats.rcmisses);
> + seq_printf(m, "not cached: %u\n", nfsdstats.rcnocache);
> + seq_printf(m, "checksum misses: %u\n", csum_misses);
> + spin_unlock(&cache_lock);
> + return 0;
> +}

Would it be hard to add a statistic that quantified the distribution of entries in the cache hash table? Something like "maximum length of cache hash chain" ?

> +
> +int nfsd_reply_cache_stats_open(struct inode *inode, struct file *file)
> +{
> + return single_open(file, nfsd_reply_cache_stats_show, NULL);
> +}
> diff --git a/fs/nfsd/nfsctl.c b/fs/nfsd/nfsctl.c
> index 29c3f0d..5dea5af 100644
> --- a/fs/nfsd/nfsctl.c
> +++ b/fs/nfsd/nfsctl.c
> @@ -35,6 +35,7 @@ enum {
> NFSD_Threads,
> NFSD_Pool_Threads,
> NFSD_Pool_Stats,
> + NFSD_Reply_Cache_Stats,
> NFSD_Versions,
> NFSD_Ports,
> NFSD_MaxBlkSize,
> @@ -194,6 +195,13 @@ static const struct file_operations pool_stats_operations = {
> .owner = THIS_MODULE,
> };
>
> +static struct file_operations reply_cache_stats_operations = {
> + .open = nfsd_reply_cache_stats_open,
> + .read = seq_read,
> + .llseek = seq_lseek,
> + .release = single_release,
> +};
> +
> /*----------------------------------------------------------------------------*/
> /*
> * payload - write methods
> @@ -1024,6 +1032,7 @@ static int nfsd_fill_super(struct super_block * sb, void * data, int silent)
> [NFSD_Threads] = {"threads", &transaction_ops, S_IWUSR|S_IRUSR},
> [NFSD_Pool_Threads] = {"pool_threads", &transaction_ops, S_IWUSR|S_IRUSR},
> [NFSD_Pool_Stats] = {"pool_stats", &pool_stats_operations, S_IRUGO},
> + [NFSD_Reply_Cache_Stats] = {"reply_cache_stats", &reply_cache_stats_operations, S_IRUGO},
> [NFSD_Versions] = {"versions", &transaction_ops, S_IWUSR|S_IRUSR},
> [NFSD_Ports] = {"portlist", &transaction_ops, S_IWUSR|S_IRUGO},
> [NFSD_MaxBlkSize] = {"max_block_size", &transaction_ops, S_IWUSR|S_IRUGO},
> --
> 1.7.11.7
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com

2013-02-15 22:21:05

by Jeff Layton

[permalink] [raw]

Subject: Re: [PATCH RFC] nfsd: report length of the largest hash chain in reply cache stats

On Fri, 15 Feb 2013 16:14:56 -0500
Chuck Lever <[email protected]> wrote:

>
> On Feb 15, 2013, at 3:04 PM, Jeff Layton <[email protected]> wrote:
>
> > So we can get a feel for how effective the hashing function is.
> >
> > As Chuck Lever pointed out to me, it's generally acceptable to do
> > "expensive" stuff when reading the stats since that's a relatively
> > rare activity.
>
> A good measure of the efficacy of a hash function is the ratio of the maximum chain length to the optimal chain length (which can be computed by dividing the total number of cache entries by the number of hash chains).
>

Right, the number of chains is always 64 for now (maybe we should print
that out in the stats too), so you can compute that from the values
provided here.

> If we plan to stick with a hash table for this cache, there should be some indication when the hash function falls over. This will matter because the DRC can now grow much larger, which is turning out to be the real fundamental change with this work.
>

That's the kicker. With the patch below, computing the max chain length
on the fly is somewhat expensive since you have to walk every entry.
It's certainly possible though (even likely) that the real max length
will occur at some point when we're not looking at this file. So how to
best gauge that?

Maybe we should just punt and move it all to a rbtree or something. A
self-balancing structure is nice and simple to deal with, even if the
insertion penalty is a bit higher...

> A philosophical question though is "How can we know when the DRC is large enough?"
>

An excellent question, and not an easy one to answer. Clearly 1024
entries was not enough. We now cap the size as a function of the
available low memory, which I think is a reasonable way to keep it from
ballooning so large that the box falls over. We also have a shrinker
and periodic cache cleaner to prune off entries that have expired.

Of course one thing I haven't really considered enough is the
performance implications of walking the potentially much longer hash
chains here.

If that is a problem, then one way to counter that without moving to a
different structure altogether might be to alter the hash function
based on the max size of the cache. IOW, grow the number of hash buckets
as the max cache size grows?

--
Jeff Layton <[email protected]>