LinuxLists.cc - Need help with NFS Server SUNRPC performance issue

2013-10-31 06:49:03

Subject: Need help with NFS Server SUNRPC performance issue

Hi Folks,

I am chasing a NFS server performance issue on Ubuntu
3.8.13-030813-generic kernel. We setup 32 NFSD threads on our NFS
server.

The issue is:
# I am using fio to generate 4K random writes (over a sync mounted NFS
server filesystem) with 64 outstanding IOs per job for 10 jobs. fio
direct flag is set.
# When doing fio randwrite 4K IOs, realized that we cannot exceed 2.5K
IOPs on the NFS server from a single client.
# With multiple clients we can do more IOPs (like 3x more IOPs with 3 clients)
# Further chasing the issue, I realized that at any point in time only
8 NFSD threads are active doing vfs_wrte(). Remaining 24 threads are
sleeping within svc_recv()/svc_get_next_xprt().
# First I thought its TCP socket contention/sleeping at the wrong
time. I introduced a one-sec sleep around vfs_write() within NFSD
using msleep(). With this I can clearly see that only 8 NFSD threads
are active doing the write+sleep loop while all other threads are
sleeping.
# I enabled rpcdebug/nfs debug on NFS client side + used tcpdump on
NFS server side to confirm that client is queuing all the outstanding
IOs concurrently & its not a NFS client side problem.

Now the question is what is holding up the sunrpc layer to do only 8
outstanding IOs? Is there some TCP level buffer size limitation or so
that is causing this issue? I also added counters around which all
nfsd threads get to process the SVC xport & I see always only the
first 10 threads being used up all the time. The rest of the NFSD
threads never receive a packet at all to handle.

I already setup number of RPC slots tuneable to 128 on both server &
client before the mount, so this is not the issue.

Are there some other tuneables that control this behaviour? I think if
I cross the 8 concurrent IOs per client<>server, I will be able to get
>2.5K IOPs.

I also confirmed that each NFS multi-step operation that comes from
client has an OP_PUTFH/OP_WRITE/OP_GETATTR. I dont see any other
unnecessary NFS packets in the flow.

Any help/inputs on this topic greatly appreciated.

Thanks.

--Shyam

2013-10-31 15:14:43

by J. Bruce Fields

[permalink] [raw]

Subject: Re: Need help with NFS Server SUNRPC performance issue

On Thu, Oct 31, 2013 at 10:45:46AM -0400, Michael Richardson wrote:
>
> >> I am chasing a NFS server performance issue on Ubuntu
> >> 3.8.13-030813-generic kernel. We setup 32 NFSD threads on our NFS
> >> server.
>
> I have been also trying to figure out NFS performance issues at my home
> office. Server is ubuntu precise (3.2.0-55, old, true) kernel, and clients
> are mostly a mix of Debian versions (mostly virtualized XEN).
> GbE over a VLAN is setup just for storage, and mostly IPv6 connections.
>
> J. Bruce Fields <[email protected]> wrote:
> > Would you be willing to test an upstream kernel and/or some patches?
> > Sounds like you're using only NFSv4?
>
> I'm also willing to; my preference would be to build a generic 3.10 or 3.11
> kernel with NFS as a module, and then update the NFS code, but I
> haven't gotten around to scheduling some time to reboot a bunch.
>
> What I observe is huge TCP send queues on the server and what appears to be
> head of queue blocking on the client. This looks like a client issue to me,
> and for at least one client (my mpd/shoutcast server), I'm happy to reboot
> it regularly... I notice the NFS delays because the music stops :-)
>
> There are some potential instabilities in frequency of IPv6 Router
> Advertisements due to a bug in the CeroWRT, which initially I was blaming,
> but I'm no longer convinced, since it happens over IPv4 on the storage VLAN
> too.
>
> Shyam, please share with me your testing strategy.

Your problem sounds different; among other things, it's with reads
rather than writes.

Yes, testing with a recent upstream kernel would be a good start. 3.11
or more recent would be ideal as there was a fix for read deadlocks
there (which I doubt you're hitting, but would be nice to rule it out).

--b.

2013-10-31 14:45:58

by Michael Richardson

[permalink] [raw]

Subject: Re: Need help with NFS Server SUNRPC performance issue

Attachments:

(No filename) (1.59 kB)
(No filename) (307.00 B)
Download all attachments

2013-10-31 14:15:39

by J. Bruce Fields

[permalink] [raw]

Subject: Re: Need help with NFS Server SUNRPC performance issue

On Thu, Oct 31, 2013 at 12:19:01PM +0530, Shyam Kaushik wrote:
> Hi Folks,
>
> I am chasing a NFS server performance issue on Ubuntu
> 3.8.13-030813-generic kernel. We setup 32 NFSD threads on our NFS
> server.
>
> The issue is:
> # I am using fio to generate 4K random writes (over a sync mounted NFS
> server filesystem) with 64 outstanding IOs per job for 10 jobs. fio
> direct flag is set.
> # When doing fio randwrite 4K IOs, realized that we cannot exceed 2.5K
> IOPs on the NFS server from a single client.
> # With multiple clients we can do more IOPs (like 3x more IOPs with 3 clients)
> # Further chasing the issue, I realized that at any point in time only
> 8 NFSD threads are active doing vfs_wrte(). Remaining 24 threads are
> sleeping within svc_recv()/svc_get_next_xprt().
> # First I thought its TCP socket contention/sleeping at the wrong
> time. I introduced a one-sec sleep around vfs_write() within NFSD
> using msleep(). With this I can clearly see that only 8 NFSD threads
> are active doing the write+sleep loop while all other threads are
> sleeping.
> # I enabled rpcdebug/nfs debug on NFS client side + used tcpdump on
> NFS server side to confirm that client is queuing all the outstanding
> IOs concurrently & its not a NFS client side problem.
>
> Now the question is what is holding up the sunrpc layer to do only 8
> outstanding IOs? Is there some TCP level buffer size limitation or so
> that is causing this issue? I also added counters around which all
> nfsd threads get to process the SVC xport & I see always only the
> first 10 threads being used up all the time. The rest of the NFSD
> threads never receive a packet at all to handle.
>
> I already setup number of RPC slots tuneable to 128 on both server &
> client before the mount, so this is not the issue.
>
> Are there some other tuneables that control this behaviour? I think if
> I cross the 8 concurrent IOs per client<>server, I will be able to get
> >2.5K IOPs.
>
> I also confirmed that each NFS multi-step operation that comes from
> client has an OP_PUTFH/OP_WRITE/OP_GETATTR. I dont see any other
> unnecessary NFS packets in the flow.
>
> Any help/inputs on this topic greatly appreciated.

There's some logic in the rpc layer that tries not to accept requests
unless there's adequate send buffer space for the worst case reply. It
could be that logic interfering..... I'm not sure how to test that
quickly.

Would you be willing to test an upstream kernel and/or some patches?

Sounds like you're using only NFSv4?

--b.

2013-11-05 19:58:11

by J. Bruce Fields

[permalink] [raw]

Subject: Re: Need help with NFS Server SUNRPC performance issue

On Tue, Nov 05, 2013 at 07:14:50PM +0530, Shyam Kaushik wrote:
> Hi Bruce,
>
> You are spot on this issue. I did a quicker option of just fixing
>
> fs/nfsd/nfs4proc.c
>
> nfsd_procedures4[]
>
> NFSPROC4_COMPOUND
> instead of
> .pc_xdrressize = NFSD_BUFSIZE/4
>
> I made it by /8 & I got double the IOPs. I moved it /16 & now I see
> that 30 NFSD threads out of 32 that I have configured are doing the
> nfsd_write() job. So yes this is the exact problematic area.

Yes, that looks like good evidence we're on the right track, thanks very
much for the testing.

> Now for a permanent fixture for this issue, what do you suggest? Is it
> that before processing the compound we adjust svc_reserve()?

I think decode_compound() needs to do some estimate of the maximum total
reply size and call svc_reserve() with that new estimate.

And for the current code I think it really could be as simple as
checking whether the compound includes a READ op.

That's because that's all the current xdr encoding handles. We need to
fix that: people need to be able to fetch ACLs larger than 4k, and
READDIR would be faster if it could return more than 4k of data at a go.

After we do that, we'll need to know more than just the list of ops,
we'll need to e.g. know which attributes exactly a GETATTR requested.
And we don't have any automatic way to figure that out so it'll all be a
lot of manual arithmetic. On the other hand the good news is we only
need a rough upper bound, so this will may be doable.

Beyond that it would also be good to think about whether using
worst-case reply sizes to decide when to accept requests is really
right.

Anyway here's the slightly improved hack--totally untested except to fix
some compile errors.

--b.

diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
index d9454fe..947f268 100644
--- a/fs/nfsd/nfs4xdr.c
+++ b/fs/nfsd/nfs4xdr.c
@@ -1617,6 +1617,7 @@ nfsd4_decode_compound(struct nfsd4_compoundargs *argp)
struct nfsd4_op *op;
struct nfsd4_minorversion_ops *ops;
bool cachethis = false;
+ bool foundread = false;
int i;

READ_BUF(4);
@@ -1667,10 +1668,15 @@ nfsd4_decode_compound(struct nfsd4_compoundargs *argp)
* op in the compound wants to be cached:
*/
cachethis |= nfsd4_cache_this_op(op);
+
+ foundread |= op->opnum == OP_READ;
}
/* Sessions make the DRC unnecessary: */
if (argp->minorversion)
cachethis = false;
+ if (!foundread)
+ /* XXX: use tighter estimates, and svc_reserve_auth: */
+ svc_reserve(argp->rqstp, PAGE_SIZE);
argp->rqstp->rq_cachetype = cachethis ? RC_REPLBUFF : RC_NOCACHE;

DECODE_TAIL;

2013-11-13 22:00:37

by J. Bruce Fields

[permalink] [raw]

Subject: Re: Need help with NFS Server SUNRPC performance issue

On Wed, Nov 13, 2013 at 11:24:44AM -0500, J. Bruce Fields wrote:
> On Wed, Nov 06, 2013 at 12:57:38PM +0530, Shyam Kaushik wrote:
> > Hi Bruce,
> >
> > This hack works great. All 32 of configured NFSD threads end up doing
> > nfsd_write() which is great & I get higher IOPs/bandwidth from NFS
> > client side.
> >
> > What do you think if we vary the signature of
> > typedef __be32(*nfsd4_dec)(struct nfsd4_compoundargs *argp, void *);
> >
> > to include an additional return argument of the size estimate. This
> > way we get size estimate from the decoders (like nfsd4_decode_read
> > could return this estimate as rd_length + overhead) & in the worst
> > case if decoder says cant estimate (like a special return code -1) we
> > dont do svc_reserve() & leave it like it is. So when we run through
> > the compound we have a sum of size estimate & just do svc_reserve() at
> > the end of nfsd4_decode_compound() like your hack has.
> >
> > Does this sound reasonable to you? If not, perhaps can we just use the
> > hack for now & worry about readdir etc when they support >4K buffer?
>
> Yep. Actually looking at it again I think it needs a couple more
> special cases (for readlink, readdir), but that should be good enough
> for now.

So I'm planning to commit the following.

But eventually I agree we'd rather do the calculation in the decoder.
(Which would make it easier for example to take into account whether a
getattr op includes a request for an ACL.)

--b.

commit 6ff40decff0ef35a5d755ec60182d7f803356dfb
Author: J. Bruce Fields <[email protected]>
Date: Tue Nov 5 15:07:16 2013 -0500

nfsd4: improve write performance with better sendspace reservations

Currently the rpc code conservatively refuses to accept rpc's from a
client if the sum of its worst-case estimates of the replies it owes
that client exceed the send buffer space.

Unfortunately our estimate of the worst-case reply for an NFSv4 compound
is always the maximum read size. This can unnecessarily limit the
number of operations we handle concurrently, for example in the case
most operations are writes (which have small replies).

We can do a little better if we check which ops the compound contains.

This is still a rough estimate, we'll need to improve on it some day.

Reported-by: Shyam Kaushik <[email protected]>
Tested-by: Shyam Kaushik <[email protected]>
Signed-off-by: J. Bruce Fields <[email protected]>

diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
index d9d7fa9..9d76ee3 100644
--- a/fs/nfsd/nfs4xdr.c
+++ b/fs/nfsd/nfs4xdr.c
@@ -1597,12 +1597,39 @@ nfsd4_opnum_in_range(struct nfsd4_compoundargs *argp, struct nfsd4_op *op)
return true;
}

+/*
+ * Return a rough estimate of the maximum possible reply size. Note the
+ * estimate includes rpc headers so is meant to be passed to
+ * svc_reserve, not svc_reserve_auth.
+ *
+ * Also note the current compound encoding permits only one operation to
+ * use pages beyond the first one, so the maximum possible length is the
+ * maximum over these values, not the sum.
+ */
+static int nfsd4_max_reply(u32 opnum)
+{
+ switch (opnum) {
+ case OP_READLINK:
+ case OP_READDIR:
+ /*
+ * Both of these ops take a single page for data and put
+ * the head and tail in another page:
+ */
+ return 2 * PAGE_SIZE;
+ case OP_READ:
+ return INT_MAX;
+ default:
+ return PAGE_SIZE;
+ }
+}
+
static __be32
nfsd4_decode_compound(struct nfsd4_compoundargs *argp)
{
DECODE_HEAD;
struct nfsd4_op *op;
bool cachethis = false;
+ int max_reply = PAGE_SIZE;
int i;

READ_BUF(4);
@@ -1652,10 +1679,14 @@ nfsd4_decode_compound(struct nfsd4_compoundargs *argp)
* op in the compound wants to be cached:
*/
cachethis |= nfsd4_cache_this_op(op);
+
+ max_reply = max(max_reply, nfsd4_max_reply(op->opnum));
}
/* Sessions make the DRC unnecessary: */
if (argp->minorversion)
cachethis = false;
+ if (max_reply != INT_MAX)
+ svc_reserve(argp->rqstp, max_reply);
argp->rqstp->rq_cachetype = cachethis ? RC_REPLBUFF : RC_NOCACHE;

DECODE_TAIL;

2013-11-04 23:03:17

by J. Bruce Fields

[permalink] [raw]

Subject: Re: Need help with NFS Server SUNRPC performance issue

On Fri, Nov 01, 2013 at 03:09:03PM -0400, Michael Richardson wrote:
>
> J. Bruce Fields <[email protected]> wrote:
> > > There are some potential instabilities in frequency of IPv6 Router
> > > Advertisements due to a bug in the CeroWRT, which initially I was blaming,
> > > but I'm no longer convinced, since it happens over IPv4 on the storage VLAN
> > > too.
> > >
> > > Shyam, please share with me your testing strategy.
> >
> > Your problem sounds different; among other things, it's with reads
> > rather than writes.
>
> Well.... I sure have problems with writes too.
> sshfs is way faster across a LAN, which is just wrong :-)

That sounds suspect, yes, but we'd need some more details (exactly
what's your test, and what results do you get?).

--b.

2013-11-01 04:38:20

by Shyam Kaushik

[permalink] [raw]

Subject: Re: Need help with NFS Server SUNRPC performance issue

Hi Bruce,

Yes I am using NFSv4. I am willing to test any kernel/patches that you
suggest. Please let me know where we can start. Also I have
sunrpc/nfsd/lockd etc compiled as modules & can readily debug it as
needed.

I digged this a bit further & I think you are on dot that the issue is
with rcp layer + buffer space. From tcpdump I see that the initial
request comes from client to server according to the number of
outstanding IOs that fio initiates, but then there are multiple back &
forth packets (RPC continuation & acks) that is slowing up things. I
thought waking up the NFSD threads that are sleeping within
svc_get_next_xprt() was an issue initially & made the
schedule_timeout() with a smaller timeout, but then all the threads
wakeup & saw there was no work enqueued & went back to sleep again. So
from sunrpc server standpoint enqueue() is not happening as it should
be.

In the meantime from NFS client side I see a single rpc thread thats
working all the time.

Thanks.

--Shyam

On Thu, Oct 31, 2013 at 7:45 PM, J. Bruce Fields <[email protected]> wrote:
> On Thu, Oct 31, 2013 at 12:19:01PM +0530, Shyam Kaushik wrote:
>> Hi Folks,
>>
>> I am chasing a NFS server performance issue on Ubuntu
>> 3.8.13-030813-generic kernel. We setup 32 NFSD threads on our NFS
>> server.
>>
>> The issue is:
>> # I am using fio to generate 4K random writes (over a sync mounted NFS
>> server filesystem) with 64 outstanding IOs per job for 10 jobs. fio
>> direct flag is set.
>> # When doing fio randwrite 4K IOs, realized that we cannot exceed 2.5K
>> IOPs on the NFS server from a single client.
>> # With multiple clients we can do more IOPs (like 3x more IOPs with 3 clients)
>> # Further chasing the issue, I realized that at any point in time only
>> 8 NFSD threads are active doing vfs_wrte(). Remaining 24 threads are
>> sleeping within svc_recv()/svc_get_next_xprt().
>> # First I thought its TCP socket contention/sleeping at the wrong
>> time. I introduced a one-sec sleep around vfs_write() within NFSD
>> using msleep(). With this I can clearly see that only 8 NFSD threads
>> are active doing the write+sleep loop while all other threads are
>> sleeping.
>> # I enabled rpcdebug/nfs debug on NFS client side + used tcpdump on
>> NFS server side to confirm that client is queuing all the outstanding
>> IOs concurrently & its not a NFS client side problem.
>>
>> Now the question is what is holding up the sunrpc layer to do only 8
>> outstanding IOs? Is there some TCP level buffer size limitation or so
>> that is causing this issue? I also added counters around which all
>> nfsd threads get to process the SVC xport & I see always only the
>> first 10 threads being used up all the time. The rest of the NFSD
>> threads never receive a packet at all to handle.
>>
>> I already setup number of RPC slots tuneable to 128 on both server &
>> client before the mount, so this is not the issue.
>>
>> Are there some other tuneables that control this behaviour? I think if
>> I cross the 8 concurrent IOs per client<>server, I will be able to get
>> >2.5K IOPs.
>>
>> I also confirmed that each NFS multi-step operation that comes from
>> client has an OP_PUTFH/OP_WRITE/OP_GETATTR. I dont see any other
>> unnecessary NFS packets in the flow.
>>
>> Any help/inputs on this topic greatly appreciated.
>
> There's some logic in the rpc layer that tries not to accept requests
> unless there's adequate send buffer space for the worst case reply. It
> could be that logic interfering..... I'm not sure how to test that
> quickly.
>
> Would you be willing to test an upstream kernel and/or some patches?
>
> Sounds like you're using only NFSv4?
>
> --b.

2013-11-13 04:07:21

by Shyam Kaushik

[permalink] [raw]

Subject: Re: Need help with NFS Server SUNRPC performance issue

Hi Bruce,

Can you pls suggest how we can formalize this hack into a proper fix?

Thanks.

--Shyam

On Thu, Oct 31, 2013 at 8:15 PM, Michael Richardson <[email protected]> wrote:
>
> >> I am chasing a NFS server performance issue on Ubuntu
> >> 3.8.13-030813-generic kernel. We setup 32 NFSD threads on our NFS
> >> server.
>
> I have been also trying to figure out NFS performance issues at my home
> office. Server is ubuntu precise (3.2.0-55, old, true) kernel, and clients
> are mostly a mix of Debian versions (mostly virtualized XEN).
> GbE over a VLAN is setup just for storage, and mostly IPv6 connections.
>
> J. Bruce Fields <[email protected]> wrote:
> > Would you be willing to test an upstream kernel and/or some patches?
> > Sounds like you're using only NFSv4?
>
> I'm also willing to; my preference would be to build a generic 3.10 or 3.11
> kernel with NFS as a module, and then update the NFS code, but I
> haven't gotten around to scheduling some time to reboot a bunch.
>
> What I observe is huge TCP send queues on the server and what appears to be
> head of queue blocking on the client. This looks like a client issue to me,
> and for at least one client (my mpd/shoutcast server), I'm happy to reboot
> it regularly... I notice the NFS delays because the music stops :-)
>
> There are some potential instabilities in frequency of IPv6 Router
> Advertisements due to a bug in the CeroWRT, which initially I was blaming,
> but I'm no longer convinced, since it happens over IPv4 on the storage VLAN
> too.
>
> Shyam, please share with me your testing strategy.
>
> --
> ] Never tell me the odds! | ipv6 mesh networks [
> ] Michael Richardson, Sandelman Software Works | network architect [
> ] [email protected] http://www.sandelman.ca/ | ruby on rails [
>
>
>
>
>

2013-11-04 23:02:45

by J. Bruce Fields

[permalink] [raw]

Subject: Re: Need help with NFS Server SUNRPC performance issue

On Fri, Nov 01, 2013 at 10:08:18AM +0530, Shyam Kaushik wrote:
> Hi Bruce,
>
> Yes I am using NFSv4. I am willing to test any kernel/patches that you
> suggest. Please let me know where we can start. Also I have
> sunrpc/nfsd/lockd etc compiled as modules & can readily debug it as
> needed.

OK, thanks. It would be worth trying to implement the comment at the
top of fs/nfsd/nfs4xdr.c:

* TODO: Neil Brown made the following observation: We
* currently initially reserve NFSD_BUFSIZE space on the
* transmit queue and never release any of that until the
* request is complete. It would be good to calculate a new
* maximum response size while decoding the COMPOUND, and call
* svc_reserve with this number at the end of
* nfs4svc_decode_compoundargs.

I think it shouldn't be too difficult--we just need to work out some
upper bounds on the reply size per operation.

A first idea approximation just to test the idea might be just to call
svc_reserve(., 4096) on any compound not containing a read.

--b.

>
> I digged this a bit further & I think you are on dot that the issue is
> with rcp layer + buffer space. From tcpdump I see that the initial
> request comes from client to server according to the number of
> outstanding IOs that fio initiates, but then there are multiple back &
> forth packets (RPC continuation & acks) that is slowing up things. I
> thought waking up the NFSD threads that are sleeping within
> svc_get_next_xprt() was an issue initially & made the
> schedule_timeout() with a smaller timeout, but then all the threads
> wakeup & saw there was no work enqueued & went back to sleep again. So
> from sunrpc server standpoint enqueue() is not happening as it should
> be.
>
> In the meantime from NFS client side I see a single rpc thread thats
> working all the time.
>
> Thanks.
>
> --Shyam
>
>
>
> On Thu, Oct 31, 2013 at 7:45 PM, J. Bruce Fields <[email protected]> wrote:
> > On Thu, Oct 31, 2013 at 12:19:01PM +0530, Shyam Kaushik wrote:
> >> Hi Folks,
> >>
> >> I am chasing a NFS server performance issue on Ubuntu
> >> 3.8.13-030813-generic kernel. We setup 32 NFSD threads on our NFS
> >> server.
> >>
> >> The issue is:
> >> # I am using fio to generate 4K random writes (over a sync mounted NFS
> >> server filesystem) with 64 outstanding IOs per job for 10 jobs. fio
> >> direct flag is set.
> >> # When doing fio randwrite 4K IOs, realized that we cannot exceed 2.5K
> >> IOPs on the NFS server from a single client.
> >> # With multiple clients we can do more IOPs (like 3x more IOPs with 3 clients)
> >> # Further chasing the issue, I realized that at any point in time only
> >> 8 NFSD threads are active doing vfs_wrte(). Remaining 24 threads are
> >> sleeping within svc_recv()/svc_get_next_xprt().
> >> # First I thought its TCP socket contention/sleeping at the wrong
> >> time. I introduced a one-sec sleep around vfs_write() within NFSD
> >> using msleep(). With this I can clearly see that only 8 NFSD threads
> >> are active doing the write+sleep loop while all other threads are
> >> sleeping.
> >> # I enabled rpcdebug/nfs debug on NFS client side + used tcpdump on
> >> NFS server side to confirm that client is queuing all the outstanding
> >> IOs concurrently & its not a NFS client side problem.
> >>
> >> Now the question is what is holding up the sunrpc layer to do only 8
> >> outstanding IOs? Is there some TCP level buffer size limitation or so
> >> that is causing this issue? I also added counters around which all
> >> nfsd threads get to process the SVC xport & I see always only the
> >> first 10 threads being used up all the time. The rest of the NFSD
> >> threads never receive a packet at all to handle.
> >>
> >> I already setup number of RPC slots tuneable to 128 on both server &
> >> client before the mount, so this is not the issue.
> >>
> >> Are there some other tuneables that control this behaviour? I think if
> >> I cross the 8 concurrent IOs per client<>server, I will be able to get
> >> >2.5K IOPs.
> >>
> >> I also confirmed that each NFS multi-step operation that comes from
> >> client has an OP_PUTFH/OP_WRITE/OP_GETATTR. I dont see any other
> >> unnecessary NFS packets in the flow.
> >>
> >> Any help/inputs on this topic greatly appreciated.
> >
> > There's some logic in the rpc layer that tries not to accept requests
> > unless there's adequate send buffer space for the worst case reply. It
> > could be that logic interfering..... I'm not sure how to test that
> > quickly.
> >
> > Would you be willing to test an upstream kernel and/or some patches?
> >
> > Sounds like you're using only NFSv4?
> >
> > --b.

2013-11-13 16:24:45

by J. Bruce Fields

[permalink] [raw]

Subject: Re: Need help with NFS Server SUNRPC performance issue

On Wed, Nov 06, 2013 at 12:57:38PM +0530, Shyam Kaushik wrote:
> Hi Bruce,
>
> This hack works great. All 32 of configured NFSD threads end up doing
> nfsd_write() which is great & I get higher IOPs/bandwidth from NFS
> client side.
>
> What do you think if we vary the signature of
> typedef __be32(*nfsd4_dec)(struct nfsd4_compoundargs *argp, void *);
>
> to include an additional return argument of the size estimate. This
> way we get size estimate from the decoders (like nfsd4_decode_read
> could return this estimate as rd_length + overhead) & in the worst
> case if decoder says cant estimate (like a special return code -1) we
> dont do svc_reserve() & leave it like it is. So when we run through
> the compound we have a sum of size estimate & just do svc_reserve() at
> the end of nfsd4_decode_compound() like your hack has.
>
> Does this sound reasonable to you? If not, perhaps can we just use the
> hack for now & worry about readdir etc when they support >4K buffer?

Yep. Actually looking at it again I think it needs a couple more
special cases (for readlink, readdir), but that should be good enough
for now.

For the future.... I'd rather not add an extra argument to every encoder
but maybe that is the simplest thing to do.

--b.

2013-11-14 04:23:03

by Shyam Kaushik

[permalink] [raw]

Subject: Re: Need help with NFS Server SUNRPC performance issue

Thanks a lot Bruce for taking care of this!

I will apply this patch manually on the 3.8 NFSD version we use. Thanks.

--Shyam

On Thu, Nov 14, 2013 at 3:30 AM, J. Bruce Fields <[email protected]> wrote:
> On Wed, Nov 13, 2013 at 11:24:44AM -0500, J. Bruce Fields wrote:
>> On Wed, Nov 06, 2013 at 12:57:38PM +0530, Shyam Kaushik wrote:
>> > Hi Bruce,
>> >
>> > This hack works great. All 32 of configured NFSD threads end up doing
>> > nfsd_write() which is great & I get higher IOPs/bandwidth from NFS
>> > client side.
>> >
>> > What do you think if we vary the signature of
>> > typedef __be32(*nfsd4_dec)(struct nfsd4_compoundargs *argp, void *);
>> >
>> > to include an additional return argument of the size estimate. This
>> > way we get size estimate from the decoders (like nfsd4_decode_read
>> > could return this estimate as rd_length + overhead) & in the worst
>> > case if decoder says cant estimate (like a special return code -1) we
>> > dont do svc_reserve() & leave it like it is. So when we run through
>> > the compound we have a sum of size estimate & just do svc_reserve() at
>> > the end of nfsd4_decode_compound() like your hack has.
>> >
>> > Does this sound reasonable to you? If not, perhaps can we just use the
>> > hack for now & worry about readdir etc when they support >4K buffer?
>>
>> Yep. Actually looking at it again I think it needs a couple more
>> special cases (for readlink, readdir), but that should be good enough
>> for now.
>
> So I'm planning to commit the following.
>
> But eventually I agree we'd rather do the calculation in the decoder.
> (Which would make it easier for example to take into account whether a
> getattr op includes a request for an ACL.)
>
> --b.
>
> commit 6ff40decff0ef35a5d755ec60182d7f803356dfb
> Author: J. Bruce Fields <[email protected]>
> Date: Tue Nov 5 15:07:16 2013 -0500
>
> nfsd4: improve write performance with better sendspace reservations
>
> Currently the rpc code conservatively refuses to accept rpc's from a
> client if the sum of its worst-case estimates of the replies it owes
> that client exceed the send buffer space.
>
> Unfortunately our estimate of the worst-case reply for an NFSv4 compound
> is always the maximum read size. This can unnecessarily limit the
> number of operations we handle concurrently, for example in the case
> most operations are writes (which have small replies).
>
> We can do a little better if we check which ops the compound contains.
>
> This is still a rough estimate, we'll need to improve on it some day.
>
> Reported-by: Shyam Kaushik <[email protected]>
> Tested-by: Shyam Kaushik <[email protected]>
> Signed-off-by: J. Bruce Fields <[email protected]>
>
> diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
> index d9d7fa9..9d76ee3 100644
> --- a/fs/nfsd/nfs4xdr.c
> +++ b/fs/nfsd/nfs4xdr.c
> @@ -1597,12 +1597,39 @@ nfsd4_opnum_in_range(struct nfsd4_compoundargs *argp, struct nfsd4_op *op)
> return true;
> }
>
> +/*
> + * Return a rough estimate of the maximum possible reply size. Note the
> + * estimate includes rpc headers so is meant to be passed to
> + * svc_reserve, not svc_reserve_auth.
> + *
> + * Also note the current compound encoding permits only one operation to
> + * use pages beyond the first one, so the maximum possible length is the
> + * maximum over these values, not the sum.
> + */
> +static int nfsd4_max_reply(u32 opnum)
> +{
> + switch (opnum) {
> + case OP_READLINK:
> + case OP_READDIR:
> + /*
> + * Both of these ops take a single page for data and put
> + * the head and tail in another page:
> + */
> + return 2 * PAGE_SIZE;
> + case OP_READ:
> + return INT_MAX;
> + default:
> + return PAGE_SIZE;
> + }
> +}
> +
> static __be32
> nfsd4_decode_compound(struct nfsd4_compoundargs *argp)
> {
> DECODE_HEAD;
> struct nfsd4_op *op;
> bool cachethis = false;
> + int max_reply = PAGE_SIZE;
> int i;
>
> READ_BUF(4);
> @@ -1652,10 +1679,14 @@ nfsd4_decode_compound(struct nfsd4_compoundargs *argp)
> * op in the compound wants to be cached:
> */
> cachethis |= nfsd4_cache_this_op(op);
> +
> + max_reply = max(max_reply, nfsd4_max_reply(op->opnum));
> }
> /* Sessions make the DRC unnecessary: */
> if (argp->minorversion)
> cachethis = false;
> + if (max_reply != INT_MAX)
> + svc_reserve(argp->rqstp, max_reply);
> argp->rqstp->rq_cachetype = cachethis ? RC_REPLBUFF : RC_NOCACHE;
>
> DECODE_TAIL;

2013-11-05 13:44:52

by Shyam Kaushik

[permalink] [raw]

Subject: Re: Need help with NFS Server SUNRPC performance issue

Hi Bruce,

You are spot on this issue. I did a quicker option of just fixing

fs/nfsd/nfs4proc.c

nfsd_procedures4[]

NFSPROC4_COMPOUND
instead of
.pc_xdrressize = NFSD_BUFSIZE/4

I made it by /8 & I got double the IOPs. I moved it /16 & now I see
that 30 NFSD threads out of 32 that I have configured are doing the
nfsd_write() job. So yes this is the exact problematic area.

Now for a permanent fixture for this issue, what do you suggest? Is it
that before processing the compound we adjust svc_reserve()? Is it
possible that you can deliver a correction for this issue?

Thanks a lot.

--Shyam

On Tue, Nov 5, 2013 at 4:32 AM, J. Bruce Fields <[email protected]> wrote:
> On Fri, Nov 01, 2013 at 10:08:18AM +0530, Shyam Kaushik wrote:
>> Hi Bruce,
>>
>> Yes I am using NFSv4. I am willing to test any kernel/patches that you
>> suggest. Please let me know where we can start. Also I have
>> sunrpc/nfsd/lockd etc compiled as modules & can readily debug it as
>> needed.
>
> OK, thanks. It would be worth trying to implement the comment at the
> top of fs/nfsd/nfs4xdr.c:
>
> * TODO: Neil Brown made the following observation: We
> * currently initially reserve NFSD_BUFSIZE space on the
> * transmit queue and never release any of that until the
> * request is complete. It would be good to calculate a new
> * maximum response size while decoding the COMPOUND, and call
> * svc_reserve with this number at the end of
> * nfs4svc_decode_compoundargs.
>
> I think it shouldn't be too difficult--we just need to work out some
> upper bounds on the reply size per operation.
>
> A first idea approximation just to test the idea might be just to call
> svc_reserve(., 4096) on any compound not containing a read.
>
> --b.
>
>>
>> I digged this a bit further & I think you are on dot that the issue is
>> with rcp layer + buffer space. From tcpdump I see that the initial
>> request comes from client to server according to the number of
>> outstanding IOs that fio initiates, but then there are multiple back &
>> forth packets (RPC continuation & acks) that is slowing up things. I
>> thought waking up the NFSD threads that are sleeping within
>> svc_get_next_xprt() was an issue initially & made the
>> schedule_timeout() with a smaller timeout, but then all the threads
>> wakeup & saw there was no work enqueued & went back to sleep again. So
>> from sunrpc server standpoint enqueue() is not happening as it should
>> be.
>>
>> In the meantime from NFS client side I see a single rpc thread thats
>> working all the time.
>>
>> Thanks.
>>
>> --Shyam
>>
>>
>>
>> On Thu, Oct 31, 2013 at 7:45 PM, J. Bruce Fields <[email protected]> wrote:
>> > On Thu, Oct 31, 2013 at 12:19:01PM +0530, Shyam Kaushik wrote:
>> >> Hi Folks,
>> >>
>> >> I am chasing a NFS server performance issue on Ubuntu
>> >> 3.8.13-030813-generic kernel. We setup 32 NFSD threads on our NFS
>> >> server.
>> >>
>> >> The issue is:
>> >> # I am using fio to generate 4K random writes (over a sync mounted NFS
>> >> server filesystem) with 64 outstanding IOs per job for 10 jobs. fio
>> >> direct flag is set.
>> >> # When doing fio randwrite 4K IOs, realized that we cannot exceed 2.5K
>> >> IOPs on the NFS server from a single client.
>> >> # With multiple clients we can do more IOPs (like 3x more IOPs with 3 clients)
>> >> # Further chasing the issue, I realized that at any point in time only
>> >> 8 NFSD threads are active doing vfs_wrte(). Remaining 24 threads are
>> >> sleeping within svc_recv()/svc_get_next_xprt().
>> >> # First I thought its TCP socket contention/sleeping at the wrong
>> >> time. I introduced a one-sec sleep around vfs_write() within NFSD
>> >> using msleep(). With this I can clearly see that only 8 NFSD threads
>> >> are active doing the write+sleep loop while all other threads are
>> >> sleeping.
>> >> # I enabled rpcdebug/nfs debug on NFS client side + used tcpdump on
>> >> NFS server side to confirm that client is queuing all the outstanding
>> >> IOs concurrently & its not a NFS client side problem.
>> >>
>> >> Now the question is what is holding up the sunrpc layer to do only 8
>> >> outstanding IOs? Is there some TCP level buffer size limitation or so
>> >> that is causing this issue? I also added counters around which all
>> >> nfsd threads get to process the SVC xport & I see always only the
>> >> first 10 threads being used up all the time. The rest of the NFSD
>> >> threads never receive a packet at all to handle.
>> >>
>> >> I already setup number of RPC slots tuneable to 128 on both server &
>> >> client before the mount, so this is not the issue.
>> >>
>> >> Are there some other tuneables that control this behaviour? I think if
>> >> I cross the 8 concurrent IOs per client<>server, I will be able to get
>> >> >2.5K IOPs.
>> >>
>> >> I also confirmed that each NFS multi-step operation that comes from
>> >> client has an OP_PUTFH/OP_WRITE/OP_GETATTR. I dont see any other
>> >> unnecessary NFS packets in the flow.
>> >>
>> >> Any help/inputs on this topic greatly appreciated.
>> >
>> > There's some logic in the rpc layer that tries not to accept requests
>> > unless there's adequate send buffer space for the worst case reply. It
>> > could be that logic interfering..... I'm not sure how to test that
>> > quickly.
>> >
>> > Would you be willing to test an upstream kernel and/or some patches?
>> >
>> > Sounds like you're using only NFSv4?
>> >
>> > --b.

2013-11-01 23:43:01

by Michael Richardson

[permalink] [raw]

Subject: Re: Need help with NFS Server SUNRPC performance issue

J. Bruce Fields <[email protected]> wrote:
> > There are some potential instabilities in frequency of IPv6 Router
> > Advertisements due to a bug in the CeroWRT, which initially I was blaming,
> > but I'm no longer convinced, since it happens over IPv4 on the storage VLAN
> > too.
> >
> > Shyam, please share with me your testing strategy.
>
> Your problem sounds different; among other things, it's with reads
> rather than writes.

Well.... I sure have problems with writes too.
sshfs is way faster across a LAN, which is just wrong :-)

Michael Richardson
-on the road-

2013-11-01 04:43:17

by Shyam Kaushik

[permalink] [raw]

Subject: Re: Need help with NFS Server SUNRPC performance issue

Hi Michael,

For your testing strategy question:

Here is the fio config file that I use

[global]
ioengine=libaio
iodepth=64
rw=randwrite
bs=4k
direct=1
size=4gb
loops=1
overwrite=0
norandommap=1
numjobs=10
blockalign=4k

[job1]
directory=/mnt/NFS_data_2/

If you have a good performing BE on the NFS server you will get to see
that you are limited at 2.5K IOPs.

--Shyam

On Thu, Oct 31, 2013 at 8:15 PM, Michael Richardson <[email protected]> wrote:
>
> >> I am chasing a NFS server performance issue on Ubuntu
> >> 3.8.13-030813-generic kernel. We setup 32 NFSD threads on our NFS
> >> server.
>
> I have been also trying to figure out NFS performance issues at my home
> office. Server is ubuntu precise (3.2.0-55, old, true) kernel, and clients
> are mostly a mix of Debian versions (mostly virtualized XEN).
> GbE over a VLAN is setup just for storage, and mostly IPv6 connections.
>
> J. Bruce Fields <[email protected]> wrote:
> > Would you be willing to test an upstream kernel and/or some patches?
> > Sounds like you're using only NFSv4?
>
> I'm also willing to; my preference would be to build a generic 3.10 or 3.11
> kernel with NFS as a module, and then update the NFS code, but I
> haven't gotten around to scheduling some time to reboot a bunch.
>
> What I observe is huge TCP send queues on the server and what appears to be
> head of queue blocking on the client. This looks like a client issue to me,
> and for at least one client (my mpd/shoutcast server), I'm happy to reboot
> it regularly... I notice the NFS delays because the music stops :-)
>
> There are some potential instabilities in frequency of IPv6 Router
> Advertisements due to a bug in the CeroWRT, which initially I was blaming,
> but I'm no longer convinced, since it happens over IPv4 on the storage VLAN
> too.
>
> Shyam, please share with me your testing strategy.
>
> --
> ] Never tell me the odds! | ipv6 mesh networks [
> ] Michael Richardson, Sandelman Software Works | network architect [
> ] [email protected] http://www.sandelman.ca/ | ruby on rails [
>
>
>
>
>

2013-11-13 16:18:34

by J. Bruce Fields

[permalink] [raw]

Subject: Re: Need help with NFS Server SUNRPC performance issue

On Wed, Nov 13, 2013 at 09:37:20AM +0530, Shyam Kaushik wrote:
> Hi Bruce,
>
> Can you pls suggest how we can formalize this hack into a proper fix?

Apologies, I delayed responding while trying to decide if an xdr rewrite
was ready for this merge window--it looks definitely not, so I'm just
going to go with the "PAGE_SIZE for all but read" hack in
nfsd4_decode_compound for now, with just slightly better documentation.

--b.

>
> Thanks.
>
> --Shyam
>
> On Thu, Oct 31, 2013 at 8:15 PM, Michael Richardson <[email protected]> wrote:
> >
> > >> I am chasing a NFS server performance issue on Ubuntu
> > >> 3.8.13-030813-generic kernel. We setup 32 NFSD threads on our NFS
> > >> server.
> >
> > I have been also trying to figure out NFS performance issues at my home
> > office. Server is ubuntu precise (3.2.0-55, old, true) kernel, and clients
> > are mostly a mix of Debian versions (mostly virtualized XEN).
> > GbE over a VLAN is setup just for storage, and mostly IPv6 connections.
> >
> > J. Bruce Fields <[email protected]> wrote:
> > > Would you be willing to test an upstream kernel and/or some patches?
> > > Sounds like you're using only NFSv4?
> >
> > I'm also willing to; my preference would be to build a generic 3.10 or 3.11
> > kernel with NFS as a module, and then update the NFS code, but I
> > haven't gotten around to scheduling some time to reboot a bunch.
> >
> > What I observe is huge TCP send queues on the server and what appears to be
> > head of queue blocking on the client. This looks like a client issue to me,
> > and for at least one client (my mpd/shoutcast server), I'm happy to reboot
> > it regularly... I notice the NFS delays because the music stops :-)
> >
> > There are some potential instabilities in frequency of IPv6 Router
> > Advertisements due to a bug in the CeroWRT, which initially I was blaming,
> > but I'm no longer convinced, since it happens over IPv4 on the storage VLAN
> > too.
> >
> > Shyam, please share with me your testing strategy.
> >
> > --
> > ] Never tell me the odds! | ipv6 mesh networks [
> > ] Michael Richardson, Sandelman Software Works | network architect [
> > ] [email protected] http://www.sandelman.ca/ | ruby on rails [
> >
> >
> >
> >
> >

2013-11-06 07:27:39

by Shyam Kaushik

[permalink] [raw]

Subject: Re: Need help with NFS Server SUNRPC performance issue

Hi Bruce,

This hack works great. All 32 of configured NFSD threads end up doing
nfsd_write() which is great & I get higher IOPs/bandwidth from NFS
client side.

What do you think if we vary the signature of
typedef __be32(*nfsd4_dec)(struct nfsd4_compoundargs *argp, void *);

to include an additional return argument of the size estimate. This
way we get size estimate from the decoders (like nfsd4_decode_read
could return this estimate as rd_length + overhead) & in the worst
case if decoder says cant estimate (like a special return code -1) we
dont do svc_reserve() & leave it like it is. So when we run through
the compound we have a sum of size estimate & just do svc_reserve() at
the end of nfsd4_decode_compound() like your hack has.

Does this sound reasonable to you? If not, perhaps can we just use the
hack for now & worry about readdir etc when they support >4K buffer?

--Shyam

On Wed, Nov 6, 2013 at 1:28 AM, J. Bruce Fields <[email protected]> wrote:
> On Tue, Nov 05, 2013 at 07:14:50PM +0530, Shyam Kaushik wrote:
>> Hi Bruce,
>>
>> You are spot on this issue. I did a quicker option of just fixing
>>
>> fs/nfsd/nfs4proc.c
>>
>> nfsd_procedures4[]
>>
>> NFSPROC4_COMPOUND
>> instead of
>> .pc_xdrressize = NFSD_BUFSIZE/4
>>
>> I made it by /8 & I got double the IOPs. I moved it /16 & now I see
>> that 30 NFSD threads out of 32 that I have configured are doing the
>> nfsd_write() job. So yes this is the exact problematic area.
>
> Yes, that looks like good evidence we're on the right track, thanks very
> much for the testing.
>
>> Now for a permanent fixture for this issue, what do you suggest? Is it
>> that before processing the compound we adjust svc_reserve()?
>
> I think decode_compound() needs to do some estimate of the maximum total
> reply size and call svc_reserve() with that new estimate.
>
> And for the current code I think it really could be as simple as
> checking whether the compound includes a READ op.
>
> That's because that's all the current xdr encoding handles. We need to
> fix that: people need to be able to fetch ACLs larger than 4k, and
> READDIR would be faster if it could return more than 4k of data at a go.
>
> After we do that, we'll need to know more than just the list of ops,
> we'll need to e.g. know which attributes exactly a GETATTR requested.
> And we don't have any automatic way to figure that out so it'll all be a
> lot of manual arithmetic. On the other hand the good news is we only
> need a rough upper bound, so this will may be doable.
>
> Beyond that it would also be good to think about whether using
> worst-case reply sizes to decide when to accept requests is really
> right.
>
> Anyway here's the slightly improved hack--totally untested except to fix
> some compile errors.
>
> --b.
>
> diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
> index d9454fe..947f268 100644
> --- a/fs/nfsd/nfs4xdr.c
> +++ b/fs/nfsd/nfs4xdr.c
> @@ -1617,6 +1617,7 @@ nfsd4_decode_compound(struct nfsd4_compoundargs *argp)
> struct nfsd4_op *op;
> struct nfsd4_minorversion_ops *ops;
> bool cachethis = false;
> + bool foundread = false;
> int i;
>
> READ_BUF(4);
> @@ -1667,10 +1668,15 @@ nfsd4_decode_compound(struct nfsd4_compoundargs *argp)
> * op in the compound wants to be cached:
> */
> cachethis |= nfsd4_cache_this_op(op);
> +
> + foundread |= op->opnum == OP_READ;
> }
> /* Sessions make the DRC unnecessary: */
> if (argp->minorversion)
> cachethis = false;
> + if (!foundread)
> + /* XXX: use tighter estimates, and svc_reserve_auth: */
> + svc_reserve(argp->rqstp, PAGE_SIZE);
> argp->rqstp->rq_cachetype = cachethis ? RC_REPLBUFF : RC_NOCACHE;
>
> DECODE_TAIL;