2008-08-14 02:03:28

by Jeff Layton

[permalink] [raw]
Subject: [PATCH] knfsd: allocate readahead cache in individual chunks

I had a report from someone building a large NFS server that they were
unable to start more than 585 nfsd threads. It was reported against an
older kernel using the slab allocator, and I tracked it down to the
large allocation in nfsd_racache_init failing.

It appears that the slub allocator handles large allocations better,
but large contiguous allocations can often be problematic. There
doesn't seem to be any reason that the racache has to be allocated as a
single large chunk. This patch breaks this up so that the racache is
built up from separate allocations.

Thoughts?

Signed-off-by: Jeff Layton <[email protected]>
---
fs/nfsd/vfs.c | 49 ++++++++++++++++++++++++++++++++-----------------
1 files changed, 32 insertions(+), 17 deletions(-)

diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 1319e80..8bdf46f 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -83,11 +83,11 @@ struct raparm_hbucket {
spinlock_t pb_lock;
} ____cacheline_aligned_in_smp;

-static struct raparms * raparml;
#define RAPARM_HASH_BITS 4
#define RAPARM_HASH_SIZE (1<<RAPARM_HASH_BITS)
#define RAPARM_HASH_MASK (RAPARM_HASH_SIZE-1)
static struct raparm_hbucket raparm_hash[RAPARM_HASH_SIZE];
+static unsigned int raparm_hash_buckets;

/*
* Called from nfsd_lookup and encode_dirent. Check if we have crossed
@@ -1966,11 +1966,20 @@ nfsd_permission(struct svc_rqst *rqstp, struct svc_export *exp,
void
nfsd_racache_shutdown(void)
{
- if (!raparml)
- return;
+ struct raparms *raparm, *last_raparm;
+ unsigned int i;
+
dprintk("nfsd: freeing readahead buffers.\n");
- kfree(raparml);
- raparml = NULL;
+
+ for (i = 0; i < raparm_hash_buckets; i++) {
+ raparm = raparm_hash[i].pb_head;
+ while(raparm) {
+ last_raparm = raparm;
+ raparm = raparm->p_next;
+ kfree(last_raparm);
+ }
+ raparm_hash[i].pb_head = NULL;
+ }
}
/*
* Initialize readahead param cache
@@ -1981,35 +1990,41 @@ nfsd_racache_init(int cache_size)
int i;
int j = 0;
int nperbucket;
+ struct raparms *raparm, *last_raparm = NULL;


- if (raparml)
+ if (raparm_hash[0].pb_head)
return 0;
if (cache_size < 2*RAPARM_HASH_SIZE)
cache_size = 2*RAPARM_HASH_SIZE;
- raparml = kcalloc(cache_size, sizeof(struct raparms), GFP_KERNEL);
-
- if (!raparml) {
- printk(KERN_WARNING
- "nfsd: Could not allocate memory read-ahead cache.\n");
- return -ENOMEM;
- }

dprintk("nfsd: allocating %d readahead buffers.\n", cache_size);
for (i = 0 ; i < RAPARM_HASH_SIZE ; i++) {
raparm_hash[i].pb_head = NULL;
spin_lock_init(&raparm_hash[i].pb_lock);
}
+
nperbucket = DIV_ROUND_UP(cache_size, RAPARM_HASH_SIZE);
- for (i = 0; i < cache_size - 1; i++) {
+ for (i = 0; i < cache_size; i++) {
+ raparm = kzalloc(sizeof(*raparm), GFP_KERNEL);
+ if (!raparm)
+ goto out_nomem;
+
if (i % nperbucket == 0)
- raparm_hash[j++].pb_head = raparml + i;
- if (i % nperbucket < nperbucket-1)
- raparml[i].p_next = raparml + i + 1;
+ raparm_hash[j++].pb_head = raparm;
+ else
+ last_raparm->p_next = raparm;
+ last_raparm = raparm;
}

nfsdstats.ra_size = cache_size;
+ raparm_hash_buckets = j;
return 0;
+
+out_nomem:
+ dprintk("nfsd: kmalloc failed, freeing readahead buffers\n");
+ nfsd_racache_shutdown();
+ return -ENOMEM;
}

#if defined(CONFIG_NFSD_V2_ACL) || defined(CONFIG_NFSD_V3_ACL)
--
1.5.5.1



2008-08-14 18:06:37

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH] knfsd: allocate readahead cache in individual chunks

On Wed, Aug 13, 2008 at 10:03:27PM -0400, Jeff Layton wrote:
> I had a report from someone building a large NFS server that they were
> unable to start more than 585 nfsd threads. It was reported against an
> older kernel using the slab allocator, and I tracked it down to the
> large allocation in nfsd_racache_init failing.
>
> It appears that the slub allocator handles large allocations better,
> but large contiguous allocations can often be problematic. There
> doesn't seem to be any reason that the racache has to be allocated as a
> single large chunk. This patch breaks this up so that the racache is
> built up from separate allocations.

So by default nfsd_racache_init gets called with size 2*nrservs, so 1170
in this case. Looks like struct raprms is about 50 bytes. So that's
about a 60k allocation?

And RAPARM_HASH_SIZE is 16. So you could use an array for each hash
bucket and each array would be under a page even in this rather extreme
case, if you wanted to avoid lots of little allocations. I don't know
whether that really matters, though.

> +static unsigned int raparm_hash_buckets;

It looks like this is here just to tell you how many bucket heads are
non-null when freeing? But I think the

if (cache_size < 2*RAPARM_HASH_SIZE)
cache_size = 2*RAPARM_HASH_SIZE;

in nfsd_racache_init ensures no bucket is null, and anyway you check for
null when freeing, so it doesn't seem like it would matter much.

> nperbucket = DIV_ROUND_UP(cache_size, RAPARM_HASH_SIZE);
> - for (i = 0; i < cache_size - 1; i++) {
> + for (i = 0; i < cache_size; i++) {
> + raparm = kzalloc(sizeof(*raparm), GFP_KERNEL);
> + if (!raparm)
> + goto out_nomem;
> +
> if (i % nperbucket == 0)
> - raparm_hash[j++].pb_head = raparml + i;
> - if (i % nperbucket < nperbucket-1)
> - raparml[i].p_next = raparml + i + 1;
> + raparm_hash[j++].pb_head = raparm;
> + else
> + last_raparm->p_next = raparm;
> + last_raparm = raparm;

The modular arithmetic here always struck me as a bit convoluted. Why
not just nested for loops, and not be as fussy about the round-off
error? E.g.

for (i = 0; i < RAPARM_HASH_SIZE; i++) {
spin_lock_init(&raparm_hash[i].pb_lock);
raparm = &raparm_hash[i].pb_head;

for (j = 0; j < nperbucket; j++) {
*raparm = kzalloc(sizeof(*raparm), GFP_KERNEL);
if (!*raparm)
goto out_nomem;
raparm = &(*raparm)->p_next;
}
*raparm = NULL;
}

Lightly-tested patch (to apply on top of yours) follows.

--b.

diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 1acbf13..72783d9 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -87,7 +87,6 @@ struct raparm_hbucket {
#define RAPARM_HASH_SIZE (1<<RAPARM_HASH_BITS)
#define RAPARM_HASH_MASK (RAPARM_HASH_SIZE-1)
static struct raparm_hbucket raparm_hash[RAPARM_HASH_SIZE];
-static unsigned int raparm_hash_buckets;

/*
* Called from nfsd_lookup and encode_dirent. Check if we have crossed
@@ -1977,7 +1976,7 @@ nfsd_racache_shutdown(void)

dprintk("nfsd: freeing readahead buffers.\n");

- for (i = 0; i < raparm_hash_buckets; i++) {
+ for (i = 0; i < RAPARM_HASH_SIZE; i++) {
raparm = raparm_hash[i].pb_head;
while(raparm) {
last_raparm = raparm;
@@ -1996,35 +1995,32 @@ nfsd_racache_init(int cache_size)
int i;
int j = 0;
int nperbucket;
- struct raparms *raparm, *last_raparm = NULL;
+ struct raparms **raparm = NULL;


if (raparm_hash[0].pb_head)
return 0;
- if (cache_size < 2*RAPARM_HASH_SIZE)
- cache_size = 2*RAPARM_HASH_SIZE;
+ nperbucket = DIV_ROUND_UP(cache_size, RAPARM_HASH_SIZE);
+ if (nperbucket < 2)
+ nperbucket = 2;
+ cache_size = nperbucket * RAPARM_HASH_SIZE;

dprintk("nfsd: allocating %d readahead buffers.\n", cache_size);
- for (i = 0 ; i < RAPARM_HASH_SIZE ; i++) {
- raparm_hash[i].pb_head = NULL;
- spin_lock_init(&raparm_hash[i].pb_lock);
- }

- nperbucket = DIV_ROUND_UP(cache_size, RAPARM_HASH_SIZE);
- for (i = 0; i < cache_size; i++) {
- raparm = kzalloc(sizeof(*raparm), GFP_KERNEL);
- if (!raparm)
- goto out_nomem;
+ for (i = 0; i < RAPARM_HASH_SIZE; i++) {
+ spin_lock_init(&raparm_hash[i].pb_lock);

- if (i % nperbucket == 0)
- raparm_hash[j++].pb_head = raparm;
- else
- last_raparm->p_next = raparm;
- last_raparm = raparm;
+ raparm = &raparm_hash[i].pb_head;
+ for (j = 0; j < nperbucket; j++) {
+ *raparm = kzalloc(sizeof(*raparm), GFP_KERNEL);
+ if (!*raparm)
+ goto out_nomem;
+ raparm = &(*raparm)->p_next;
+ }
+ *raparm = NULL;
}

nfsdstats.ra_size = cache_size;
- raparm_hash_buckets = j;
return 0;

out_nomem:

2008-08-14 18:52:31

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH] knfsd: allocate readahead cache in individual chunks

On Thu, 14 Aug 2008 14:06:35 -0400
"J. Bruce Fields" <[email protected]> wrote:

> On Wed, Aug 13, 2008 at 10:03:27PM -0400, Jeff Layton wrote:
> > I had a report from someone building a large NFS server that they were
> > unable to start more than 585 nfsd threads. It was reported against an
> > older kernel using the slab allocator, and I tracked it down to the
> > large allocation in nfsd_racache_init failing.
> >
> > It appears that the slub allocator handles large allocations better,
> > but large contiguous allocations can often be problematic. There
> > doesn't seem to be any reason that the racache has to be allocated as a
> > single large chunk. This patch breaks this up so that the racache is
> > built up from separate allocations.
>
> So by default nfsd_racache_init gets called with size 2*nrservs, so 1170
> in this case. Looks like struct raprms is about 50 bytes. So that's
> about a 60k allocation?
>

On my x86_64 box, sizeof(struct raparm) is 72 bytes for recent
kernels. For 2.6.18-ish it's 112:

2*585*112 = 131040

...which is just under a 128k allocation, and that seems to be as large
a kmalloc as you can do with slab. This is just the breakover point
though, the customer in question was trying to start 2048 nfsd's. That
might be considered insane, but if it is, we should probably reduce
NFSD_MAXSERVS which is set at 8192.

> And RAPARM_HASH_SIZE is 16. So you could use an array for each hash
> bucket and each array would be under a page even in this rather extreme
> case, if you wanted to avoid lots of little allocations. I don't know
> whether that really matters, though.
>

I thought about that after I sent this patch.

I suppose there are 3 routes we can go with this:

1) do what I did here (individual kmalloc for each raparm)

2) do an allocation of an raparms array for each hash bucket, but when
we get to larger thread counts, these will still require multi-page
allocations:

8192 threads * 2 / 16 buckets * 72 bytes = 73728 byte allocation

3) declare a separate slabcache for this and allocate each raparm
individually from that

...#3 might be reasonable, but we could end up wasting even more memory
that way when we only have a few threads. Tough call...

> > +static unsigned int raparm_hash_buckets;
>
> It looks like this is here just to tell you how many bucket heads are
> non-null when freeing? But I think the
>
> if (cache_size < 2*RAPARM_HASH_SIZE)
> cache_size = 2*RAPARM_HASH_SIZE;
>
> in nfsd_racache_init ensures no bucket is null, and anyway you check for
> null when freeing, so it doesn't seem like it would matter much.
>

Very good point. Your way is better...

> > nperbucket = DIV_ROUND_UP(cache_size, RAPARM_HASH_SIZE);
> > - for (i = 0; i < cache_size - 1; i++) {
> > + for (i = 0; i < cache_size; i++) {
> > + raparm = kzalloc(sizeof(*raparm), GFP_KERNEL);
> > + if (!raparm)
> > + goto out_nomem;
> > +
> > if (i % nperbucket == 0)
> > - raparm_hash[j++].pb_head = raparml + i;
> > - if (i % nperbucket < nperbucket-1)
> > - raparml[i].p_next = raparml + i + 1;
> > + raparm_hash[j++].pb_head = raparm;
> > + else
> > + last_raparm->p_next = raparm;
> > + last_raparm = raparm;
>
> The modular arithmetic here always struck me as a bit convoluted. Why
> not just nested for loops, and not be as fussy about the round-off
> error? E.g.
>
> for (i = 0; i < RAPARM_HASH_SIZE; i++) {
> spin_lock_init(&raparm_hash[i].pb_lock);
> raparm = &raparm_hash[i].pb_head;
>
> for (j = 0; j < nperbucket; j++) {
> *raparm = kzalloc(sizeof(*raparm), GFP_KERNEL);
> if (!*raparm)
> goto out_nomem;
> raparm = &(*raparm)->p_next;
> }
> *raparm = NULL;
> }
>
> Lightly-tested patch (to apply on top of yours) follows.
>
> --b.
>

The combined patch looks good to me

Thanks!
--
Jeff Layton <[email protected]>

2008-08-14 19:11:34

by Trond Myklebust

[permalink] [raw]
Subject: Re: [PATCH] knfsd: allocate readahead cache in individual chunks

On Thu, 2008-08-14 at 14:52 -0400, Jeff Layton wrote:
> 1) do what I did here (individual kmalloc for each raparm)
>
> 2) do an allocation of an raparms array for each hash bucket, but when
> we get to larger thread counts, these will still require multi-page
> allocations:
>
> 8192 threads * 2 / 16 buckets * 72 bytes = 73728 byte allocation
>
> 3) declare a separate slabcache for this and allocate each raparm
> individually from that
>
> ...#3 might be reasonable, but we could end up wasting even more memory
> that way when we only have a few threads. Tough call...

How about
4) Set an upper cap on the size of the raparm array

Readahead is only useful in moderation. You don't want to reach a
situation where the various threads' readahead activities start causing
thrashing...

Cheers
Trond


2008-08-14 19:14:36

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH] knfsd: allocate readahead cache in individual chunks

On Thu, 14 Aug 2008 15:11:26 -0400
Trond Myklebust <[email protected]> wrote:

> On Thu, 2008-08-14 at 14:52 -0400, Jeff Layton wrote:
> > 1) do what I did here (individual kmalloc for each raparm)
> >
> > 2) do an allocation of an raparms array for each hash bucket, but when
> > we get to larger thread counts, these will still require multi-page
> > allocations:
> >
> > 8192 threads * 2 / 16 buckets * 72 bytes = 73728 byte allocation
> >
> > 3) declare a separate slabcache for this and allocate each raparm
> > individually from that
> >
> > ...#3 might be reasonable, but we could end up wasting even more memory
> > that way when we only have a few threads. Tough call...
>
> How about
> 4) Set an upper cap on the size of the raparm array
>
> Readahead is only useful in moderation. You don't want to reach a
> situation where the various threads' readahead activities start causing
> thrashing...
>

That sounds like it would be a good idea in conjunction with this
patch. What would be a reasonable upper limit?

--
Jeff Layton <[email protected]>