2011-12-22 18:31:03

by Paul Anderson

[permalink] [raw]
Subject: NFSv4 empty RPC thrashing?

Issue: extremely high rate packets like so (tcpdump):

16:31:09.308678 IP (tos 0x0, ttl 64, id 41517, offset 0, flags [DF],
proto TCP (6), length 144)
r20.xxx.edu.1362383749 > nfsb.xxx.edu.nfs: 88 null
16:31:09.308895 IP (tos 0x0, ttl 64, id 22578, offset 0, flags [DF],
proto TCP (6), length 100)
nfsb.xxx.edu.nfs > r20.xxx.edu.1362383749: reply ok 44 null
16:31:09.308926 IP (tos 0x0, ttl 64, id 41518, offset 0, flags [DF],
proto TCP (6), length 192)
r20.xxx.edu.1379160965 > nfsb.xxx.edu.nfs: 136 null

All linux kernels are from kernel.org, version 2.6.38.5 with the
addition of Mosix. All userland is Ubuntu 10.04LTS.

Scenario: the compute cluster is composed of 50-60 compute nodes, and
10 or so head nodes that act as compute/login and high rate NFS
serving, for the purpose of largely sequential processing of high
volume genetic sequencing data (one recent job was 50-70TiBytes in, 50
TiBytes out). We see this problem regularly (two servers are
currently being hit this way right now), and is apparently cleared
only on reboot of the server.

Something in our use of the cluster appears to be triggering what
looks like a race condition in the NFSv4 client/server communications.
This issue prevents us from using NFS reliably in our cluster.
Although we do very high I/O at times, this alone does not appear to
be the trigger. It is possibly a related to a problem of having SLURM
starting 200-300 jobs at once, where each job hits a common NFS
fileserver for the program binaries, for example. In our cluster
testing, this appears to reliably cause about half the jobs to fail
while loading the program itself - they hang in D state indefinitely,
but are killable.

Looking at dozens of clients, we can do tcpdump, and see messages
similar to the above being sent at a high rate from the gigabit
connected compute nodes - the main indication being a context switch
rate of 20-30K per second. The 10 gigabit connected server is
functioning, but seeing context switch rates of 200-300K per second -
an exceptional rate that appears to slow down NFS services for all
other users. I have not done any extensive packet capture to
determine actual traffic rates, but am pretty sure it is limited by
wire speed and CPU.

The client nodes in this scenario are not actively being used - some
show zero processes in D state, others show dozens of jobs stuck in D
state (these jobs are unkillable) - the NFSv4 server shows nfsd jobs
running flat out.

Mount commands look like this:
for h in $servers do ;
mount -t nfs4 -o rw,soft,intr,nodev,nosuid,async ${h}:/ /net/$h
done

The NFSv4 servers all start using stock Ubuntu 10.04 setup - no tuning
has been done.

We can trivially get packet captures with more packets, but they are
all similar - 15-20 client nodes all pounding one NFS server node.

Any guesses as to what we can try?

Thanks,

Paul Anderson


2011-12-23 17:15:23

by Paul Anderson

[permalink] [raw]
Subject: Re: NFSv4 empty RPC thrashing?

Our new plan is to upgrade all client/server kernels to 3.0.12 and
retry in mid-January time frame. We should be able to reproduce this
and one other possible problem quickly then.

Thanks, all!

Paul

On Thu, Dec 22, 2011 at 2:48 PM, Chuck Lever <[email protected]> wrote:
>
> On Dec 22, 2011, at 2:37 PM, Paul Anderson wrote:
>
>> See way below for inline comments...
>>
>> On Thu, Dec 22, 2011 at 2:25 PM, Chuck Lever <[email protected]> wrote:
>>>
>>> On Dec 22, 2011, at 1:36 PM, Chuck Lever wrote:
>>>
>>>> Hi Paul, long time!
>>>>
>>>> On Dec 22, 2011, at 1:31 PM, Paul Anderson wrote:
>>>>
>>>>> Issue: extremely high rate packets like so (tcpdump):
>>>>>
>>>>> 16:31:09.308678 IP (tos 0x0, ttl 64, id 41517, offset 0, flags [DF],
>>>>> proto TCP (6), length 144)
>>>>> ? r20.xxx.edu.1362383749 > nfsb.xxx.edu.nfs: 88 null
>>>>> 16:31:09.308895 IP (tos 0x0, ttl 64, id 22578, offset 0, flags [DF],
>>>>> proto TCP (6), length 100)
>>>>> ? nfsb.xxx.edu.nfs > r20.xxx.edu.1362383749: reply ok 44 null
>>>>> 16:31:09.308926 IP (tos 0x0, ttl 64, id 41518, offset 0, flags [DF],
>>>>> proto TCP (6), length 192)
>>>>> ? r20.xxx.edu.1379160965 > nfsb.xxx.edu.nfs: 136 null
>>>>>
>>>>> All linux kernels are from kernel.org, version 2.6.38.5 with the
>>>>> addition of Mosix. ?All userland is Ubuntu 10.04LTS.
>>>>>
>>>>> Scenario: the compute cluster is composed of 50-60 compute nodes, and
>>>>> 10 or so head nodes that act as compute/login and high rate NFS
>>>>> serving, for the purpose of largely sequential processing of high
>>>>> volume genetic sequencing data (one recent job was 50-70TiBytes in, 50
>>>>> TiBytes out). ?We see this problem regularly (two servers are
>>>>> currently being hit this way right now), and is apparently cleared
>>>>> only on reboot of the server.
>>>>>
>>>>> Something in our use of the cluster appears to be triggering what
>>>>> looks like a race condition in the NFSv4 client/server communications.
>>>>> This issue prevents us from using NFS reliably in our cluster.
>>>>> Although we do very high I/O at times, this alone does not appear to
>>>>> be the trigger. ?It is possibly a related to a problem of having SLURM
>>>>> starting 200-300 jobs at once, where each job hits a common NFS
>>>>> fileserver for the program binaries, for example. ?In our cluster
>>>>> testing, this appears to reliably cause about half the jobs to fail
>>>>> while loading the program itself - they hang in D state indefinitely,
>>>>> but are killable.
>>>>>
>>>>> Looking at dozens of clients, we can do tcpdump, and see messages
>>>>> similar to the above being sent at a high rate from the gigabit
>>>>> connected compute nodes - the main indication being a context switch
>>>>> rate of 20-30K per second. ?The 10 gigabit connected server is
>>>>> functioning, but seeing context switch rates of 200-300K per second -
>>>>> an exceptional rate that appears to slow down NFS services for all
>>>>> other users. ?I have not done any extensive packet capture to
>>>>> determine actual traffic rates, but am pretty sure it is limited by
>>>>> wire speed and CPU.
>>>>>
>>>>> The client nodes in this scenario are not actively being used - some
>>>>> show zero processes in D state, others show dozens of jobs stuck in D
>>>>> state (these jobs are unkillable) - the NFSv4 server shows nfsd jobs
>>>>> running flat out.
>>>>>
>>>>> Mount commands look like this:
>>>>> for h in $servers do ;
>>>>> ? mount -t nfs4 -o rw,soft,intr,nodev,nosuid,async ${h}:/ /net/$h
>>>>> done
>>>>>
>>>>> The NFSv4 servers all start using stock Ubuntu 10.04 setup - no tuning
>>>>> has been done.
>>>>>
>>>>> We can trivially get packet captures with more packets, but they are
>>>>> all similar - 15-20 client nodes all pounding one NFS server node.
>>>>
>>>> We'd need to see full-frame raw captures. ?"tcpdump -s0 -w /tmp/raw" ?Let's see a few megabytes.
>>>>
>>>> On the face of it, it looks like it could be a state reclaim loop, but I can't say until I see a full network capture.
>>>
>>> Paul sent me a pair of pcap traces off-list.
>>>
>>> This is yet another a state reclaim loop. ?The server is returning NFS4ERR_EXPIRED to a READ request, but the client's subsequent RENEW gets NFS4_OK, so it doesn't do any state recovery. ?It simply keeps retrying the READ.
>>>
>>> Paul, have you tried upgrading your clients to the latest kernel.org release? ?Was there a recent network partition or server reboot that triggers these events?
>>
>> We are evaluating a move to 3.0.x, and hope to be able to do that
>> sooner rather than later. ?The impetus there is a set of improvements
>> in XFS dirty page flushing that we hope will reduce filesystem
>> corruption on normal reboots.
>
> I thought all those issues were cleared up by 2.6.38. ?Good luck...
>
>> The two NFS servers in question were recently rebooted (one yesterday
>> shows this problem, the one I gave Chuck pcaps from was 8 days ago).
>> It would typically be the case on reboot that numerous processes in
>> the compute cluster would be attempting to access files, mostly
>> reading, but some writing. ?I have not noticed if the thrashing
>> problem appears immediately on reboot - I thought it was not
>> instantaneous, but I could be wrong.
>
> For extra credit, you could try to capture network traffic on the clients while the server is rebooting to see if we can watch the putative state reclaim failure as it happens. ?I'm guessing you'd have to wait until just after the server's grace period expires (90 seconds?) before things start going ka-fluey.
>
> Unless this is unbearable, I'd wait though until you have 3.0 deployed on your clients. ?Either that will fix your problem, or it will be easier for us to patch if it doesn't.
>
>>> One reason this might happen is if the client is using an expired state ID for the READ, and then uses a valid client ID during the RENEW request. ?This can happen if the client failed, some time earlier, to completely reclaim state after a server outage.
>>>
>> Since we rarely, if ever reboot the entire cluster at once, this is plausible.
>>
>>
>> Paul
>>
>>
>>>
>>> Bruce, is there any way to look at the state ID and client ID tokens to verify they are for the same lease?
>>>
>>> --
>>> Chuck Lever
>>> chuck[dot]lever[at]oracle[dot]com
>>>
>>>
>>>
>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>> the body of a message to [email protected]
>> More majordomo info at ?http://vger.kernel.org/majordomo-info.html
>
> --
> Chuck Lever
> chuck[dot]lever[at]oracle[dot]com
>
>
>
>

2011-12-22 19:25:11

by Chuck Lever III

[permalink] [raw]
Subject: Re: NFSv4 empty RPC thrashing?


On Dec 22, 2011, at 1:36 PM, Chuck Lever wrote:

> Hi Paul, long time!
>
> On Dec 22, 2011, at 1:31 PM, Paul Anderson wrote:
>
>> Issue: extremely high rate packets like so (tcpdump):
>>
>> 16:31:09.308678 IP (tos 0x0, ttl 64, id 41517, offset 0, flags [DF],
>> proto TCP (6), length 144)
>> r20.xxx.edu.1362383749 > nfsb.xxx.edu.nfs: 88 null
>> 16:31:09.308895 IP (tos 0x0, ttl 64, id 22578, offset 0, flags [DF],
>> proto TCP (6), length 100)
>> nfsb.xxx.edu.nfs > r20.xxx.edu.1362383749: reply ok 44 null
>> 16:31:09.308926 IP (tos 0x0, ttl 64, id 41518, offset 0, flags [DF],
>> proto TCP (6), length 192)
>> r20.xxx.edu.1379160965 > nfsb.xxx.edu.nfs: 136 null
>>
>> All linux kernels are from kernel.org, version 2.6.38.5 with the
>> addition of Mosix. All userland is Ubuntu 10.04LTS.
>>
>> Scenario: the compute cluster is composed of 50-60 compute nodes, and
>> 10 or so head nodes that act as compute/login and high rate NFS
>> serving, for the purpose of largely sequential processing of high
>> volume genetic sequencing data (one recent job was 50-70TiBytes in, 50
>> TiBytes out). We see this problem regularly (two servers are
>> currently being hit this way right now), and is apparently cleared
>> only on reboot of the server.
>>
>> Something in our use of the cluster appears to be triggering what
>> looks like a race condition in the NFSv4 client/server communications.
>> This issue prevents us from using NFS reliably in our cluster.
>> Although we do very high I/O at times, this alone does not appear to
>> be the trigger. It is possibly a related to a problem of having SLURM
>> starting 200-300 jobs at once, where each job hits a common NFS
>> fileserver for the program binaries, for example. In our cluster
>> testing, this appears to reliably cause about half the jobs to fail
>> while loading the program itself - they hang in D state indefinitely,
>> but are killable.
>>
>> Looking at dozens of clients, we can do tcpdump, and see messages
>> similar to the above being sent at a high rate from the gigabit
>> connected compute nodes - the main indication being a context switch
>> rate of 20-30K per second. The 10 gigabit connected server is
>> functioning, but seeing context switch rates of 200-300K per second -
>> an exceptional rate that appears to slow down NFS services for all
>> other users. I have not done any extensive packet capture to
>> determine actual traffic rates, but am pretty sure it is limited by
>> wire speed and CPU.
>>
>> The client nodes in this scenario are not actively being used - some
>> show zero processes in D state, others show dozens of jobs stuck in D
>> state (these jobs are unkillable) - the NFSv4 server shows nfsd jobs
>> running flat out.
>>
>> Mount commands look like this:
>> for h in $servers do ;
>> mount -t nfs4 -o rw,soft,intr,nodev,nosuid,async ${h}:/ /net/$h
>> done
>>
>> The NFSv4 servers all start using stock Ubuntu 10.04 setup - no tuning
>> has been done.
>>
>> We can trivially get packet captures with more packets, but they are
>> all similar - 15-20 client nodes all pounding one NFS server node.
>
> We'd need to see full-frame raw captures. "tcpdump -s0 -w /tmp/raw" Let's see a few megabytes.
>
> On the face of it, it looks like it could be a state reclaim loop, but I can't say until I see a full network capture.

Paul sent me a pair of pcap traces off-list.

This is yet another a state reclaim loop. The server is returning NFS4ERR_EXPIRED to a READ request, but the client's subsequent RENEW gets NFS4_OK, so it doesn't do any state recovery. It simply keeps retrying the READ.

Paul, have you tried upgrading your clients to the latest kernel.org release? Was there a recent network partition or server reboot that triggers these events?

One reason this might happen is if the client is using an expired state ID for the READ, and then uses a valid client ID during the RENEW request. This can happen if the client failed, some time earlier, to completely reclaim state after a server outage.

Bruce, is there any way to look at the state ID and client ID tokens to verify they are for the same lease?

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com





2011-12-23 18:56:46

by Andy Adamson

[permalink] [raw]
Subject: Re: NFSv4 empty RPC thrashing?

Commit 0ced63d1 introduced in 3.0-rc1 fixes this condition.

NFSv4: Handle expired stateids when the lease is still valid

Currently, if the server returns NFS4ERR_EXPIRED in reply to a READ or
WRITE, but the RENEW test determines that the lease is still active, we
fail to recover and end up looping forever in a READ/WRITE + RENEW death
spiral.

Signed-off-by: Trond Myklebust <[email protected]>
Cc: [email protected]

-->Andy

On Thu, Dec 22, 2011 at 2:25 PM, Chuck Lever <[email protected]> wrote:
>
> On Dec 22, 2011, at 1:36 PM, Chuck Lever wrote:
>
>> Hi Paul, long time!
>>
>> On Dec 22, 2011, at 1:31 PM, Paul Anderson wrote:
>>
>>> Issue: extremely high rate packets like so (tcpdump):
>>>
>>> 16:31:09.308678 IP (tos 0x0, ttl 64, id 41517, offset 0, flags [DF],
>>> proto TCP (6), length 144)
>>> ? r20.xxx.edu.1362383749 > nfsb.xxx.edu.nfs: 88 null
>>> 16:31:09.308895 IP (tos 0x0, ttl 64, id 22578, offset 0, flags [DF],
>>> proto TCP (6), length 100)
>>> ? nfsb.xxx.edu.nfs > r20.xxx.edu.1362383749: reply ok 44 null
>>> 16:31:09.308926 IP (tos 0x0, ttl 64, id 41518, offset 0, flags [DF],
>>> proto TCP (6), length 192)
>>> ? r20.xxx.edu.1379160965 > nfsb.xxx.edu.nfs: 136 null
>>>
>>> All linux kernels are from kernel.org, version 2.6.38.5 with the
>>> addition of Mosix. ?All userland is Ubuntu 10.04LTS.
>>>
>>> Scenario: the compute cluster is composed of 50-60 compute nodes, and
>>> 10 or so head nodes that act as compute/login and high rate NFS
>>> serving, for the purpose of largely sequential processing of high
>>> volume genetic sequencing data (one recent job was 50-70TiBytes in, 50
>>> TiBytes out). ?We see this problem regularly (two servers are
>>> currently being hit this way right now), and is apparently cleared
>>> only on reboot of the server.
>>>
>>> Something in our use of the cluster appears to be triggering what
>>> looks like a race condition in the NFSv4 client/server communications.
>>> This issue prevents us from using NFS reliably in our cluster.
>>> Although we do very high I/O at times, this alone does not appear to
>>> be the trigger. ?It is possibly a related to a problem of having SLURM
>>> starting 200-300 jobs at once, where each job hits a common NFS
>>> fileserver for the program binaries, for example. ?In our cluster
>>> testing, this appears to reliably cause about half the jobs to fail
>>> while loading the program itself - they hang in D state indefinitely,
>>> but are killable.
>>>
>>> Looking at dozens of clients, we can do tcpdump, and see messages
>>> similar to the above being sent at a high rate from the gigabit
>>> connected compute nodes - the main indication being a context switch
>>> rate of 20-30K per second. ?The 10 gigabit connected server is
>>> functioning, but seeing context switch rates of 200-300K per second -
>>> an exceptional rate that appears to slow down NFS services for all
>>> other users. ?I have not done any extensive packet capture to
>>> determine actual traffic rates, but am pretty sure it is limited by
>>> wire speed and CPU.
>>>
>>> The client nodes in this scenario are not actively being used - some
>>> show zero processes in D state, others show dozens of jobs stuck in D
>>> state (these jobs are unkillable) - the NFSv4 server shows nfsd jobs
>>> running flat out.
>>>
>>> Mount commands look like this:
>>> for h in $servers do ;
>>> ? mount -t nfs4 -o rw,soft,intr,nodev,nosuid,async ${h}:/ /net/$h
>>> done
>>>
>>> The NFSv4 servers all start using stock Ubuntu 10.04 setup - no tuning
>>> has been done.
>>>
>>> We can trivially get packet captures with more packets, but they are
>>> all similar - 15-20 client nodes all pounding one NFS server node.
>>
>> We'd need to see full-frame raw captures. ?"tcpdump -s0 -w /tmp/raw" ?Let's see a few megabytes.
>>
>> On the face of it, it looks like it could be a state reclaim loop, but I can't say until I see a full network capture.
>
> Paul sent me a pair of pcap traces off-list.
>
> This is yet another a state reclaim loop. ?The server is returning NFS4ERR_EXPIRED to a READ request, but the client's subsequent RENEW gets NFS4_OK, so it doesn't do any state recovery. ?It simply keeps retrying the READ.
>
> Paul, have you tried upgrading your clients to the latest kernel.org release? ?Was there a recent network partition or server reboot that triggers these events?
>
> One reason this might happen is if the client is using an expired state ID for the READ, and then uses a valid client ID during the RENEW request. ?This can happen if the client failed, some time earlier, to completely reclaim state after a server outage.
>
> Bruce, is there any way to look at the state ID and client ID tokens to verify they are for the same lease?
>
> --
> Chuck Lever
> chuck[dot]lever[at]oracle[dot]com
>
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at ?http://vger.kernel.org/majordomo-info.html

2011-12-22 18:36:42

by Chuck Lever III

[permalink] [raw]
Subject: Re: NFSv4 empty RPC thrashing?

Hi Paul, long time!

On Dec 22, 2011, at 1:31 PM, Paul Anderson wrote:

> Issue: extremely high rate packets like so (tcpdump):
>
> 16:31:09.308678 IP (tos 0x0, ttl 64, id 41517, offset 0, flags [DF],
> proto TCP (6), length 144)
> r20.xxx.edu.1362383749 > nfsb.xxx.edu.nfs: 88 null
> 16:31:09.308895 IP (tos 0x0, ttl 64, id 22578, offset 0, flags [DF],
> proto TCP (6), length 100)
> nfsb.xxx.edu.nfs > r20.xxx.edu.1362383749: reply ok 44 null
> 16:31:09.308926 IP (tos 0x0, ttl 64, id 41518, offset 0, flags [DF],
> proto TCP (6), length 192)
> r20.xxx.edu.1379160965 > nfsb.xxx.edu.nfs: 136 null
>
> All linux kernels are from kernel.org, version 2.6.38.5 with the
> addition of Mosix. All userland is Ubuntu 10.04LTS.
>
> Scenario: the compute cluster is composed of 50-60 compute nodes, and
> 10 or so head nodes that act as compute/login and high rate NFS
> serving, for the purpose of largely sequential processing of high
> volume genetic sequencing data (one recent job was 50-70TiBytes in, 50
> TiBytes out). We see this problem regularly (two servers are
> currently being hit this way right now), and is apparently cleared
> only on reboot of the server.
>
> Something in our use of the cluster appears to be triggering what
> looks like a race condition in the NFSv4 client/server communications.
> This issue prevents us from using NFS reliably in our cluster.
> Although we do very high I/O at times, this alone does not appear to
> be the trigger. It is possibly a related to a problem of having SLURM
> starting 200-300 jobs at once, where each job hits a common NFS
> fileserver for the program binaries, for example. In our cluster
> testing, this appears to reliably cause about half the jobs to fail
> while loading the program itself - they hang in D state indefinitely,
> but are killable.
>
> Looking at dozens of clients, we can do tcpdump, and see messages
> similar to the above being sent at a high rate from the gigabit
> connected compute nodes - the main indication being a context switch
> rate of 20-30K per second. The 10 gigabit connected server is
> functioning, but seeing context switch rates of 200-300K per second -
> an exceptional rate that appears to slow down NFS services for all
> other users. I have not done any extensive packet capture to
> determine actual traffic rates, but am pretty sure it is limited by
> wire speed and CPU.
>
> The client nodes in this scenario are not actively being used - some
> show zero processes in D state, others show dozens of jobs stuck in D
> state (these jobs are unkillable) - the NFSv4 server shows nfsd jobs
> running flat out.
>
> Mount commands look like this:
> for h in $servers do ;
> mount -t nfs4 -o rw,soft,intr,nodev,nosuid,async ${h}:/ /net/$h
> done
>
> The NFSv4 servers all start using stock Ubuntu 10.04 setup - no tuning
> has been done.
>
> We can trivially get packet captures with more packets, but they are
> all similar - 15-20 client nodes all pounding one NFS server node.

We'd need to see full-frame raw captures. "tcpdump -s0 -w /tmp/raw" Let's see a few megabytes.

On the face of it, it looks like it could be a state reclaim loop, but I can't say until I see a full network capture.

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com





2011-12-22 19:48:27

by Chuck Lever III

[permalink] [raw]
Subject: Re: NFSv4 empty RPC thrashing?


On Dec 22, 2011, at 2:37 PM, Paul Anderson wrote:

> See way below for inline comments...
>
> On Thu, Dec 22, 2011 at 2:25 PM, Chuck Lever <[email protected]> wrote:
>>
>> On Dec 22, 2011, at 1:36 PM, Chuck Lever wrote:
>>
>>> Hi Paul, long time!
>>>
>>> On Dec 22, 2011, at 1:31 PM, Paul Anderson wrote:
>>>
>>>> Issue: extremely high rate packets like so (tcpdump):
>>>>
>>>> 16:31:09.308678 IP (tos 0x0, ttl 64, id 41517, offset 0, flags [DF],
>>>> proto TCP (6), length 144)
>>>> r20.xxx.edu.1362383749 > nfsb.xxx.edu.nfs: 88 null
>>>> 16:31:09.308895 IP (tos 0x0, ttl 64, id 22578, offset 0, flags [DF],
>>>> proto TCP (6), length 100)
>>>> nfsb.xxx.edu.nfs > r20.xxx.edu.1362383749: reply ok 44 null
>>>> 16:31:09.308926 IP (tos 0x0, ttl 64, id 41518, offset 0, flags [DF],
>>>> proto TCP (6), length 192)
>>>> r20.xxx.edu.1379160965 > nfsb.xxx.edu.nfs: 136 null
>>>>
>>>> All linux kernels are from kernel.org, version 2.6.38.5 with the
>>>> addition of Mosix. All userland is Ubuntu 10.04LTS.
>>>>
>>>> Scenario: the compute cluster is composed of 50-60 compute nodes, and
>>>> 10 or so head nodes that act as compute/login and high rate NFS
>>>> serving, for the purpose of largely sequential processing of high
>>>> volume genetic sequencing data (one recent job was 50-70TiBytes in, 50
>>>> TiBytes out). We see this problem regularly (two servers are
>>>> currently being hit this way right now), and is apparently cleared
>>>> only on reboot of the server.
>>>>
>>>> Something in our use of the cluster appears to be triggering what
>>>> looks like a race condition in the NFSv4 client/server communications.
>>>> This issue prevents us from using NFS reliably in our cluster.
>>>> Although we do very high I/O at times, this alone does not appear to
>>>> be the trigger. It is possibly a related to a problem of having SLURM
>>>> starting 200-300 jobs at once, where each job hits a common NFS
>>>> fileserver for the program binaries, for example. In our cluster
>>>> testing, this appears to reliably cause about half the jobs to fail
>>>> while loading the program itself - they hang in D state indefinitely,
>>>> but are killable.
>>>>
>>>> Looking at dozens of clients, we can do tcpdump, and see messages
>>>> similar to the above being sent at a high rate from the gigabit
>>>> connected compute nodes - the main indication being a context switch
>>>> rate of 20-30K per second. The 10 gigabit connected server is
>>>> functioning, but seeing context switch rates of 200-300K per second -
>>>> an exceptional rate that appears to slow down NFS services for all
>>>> other users. I have not done any extensive packet capture to
>>>> determine actual traffic rates, but am pretty sure it is limited by
>>>> wire speed and CPU.
>>>>
>>>> The client nodes in this scenario are not actively being used - some
>>>> show zero processes in D state, others show dozens of jobs stuck in D
>>>> state (these jobs are unkillable) - the NFSv4 server shows nfsd jobs
>>>> running flat out.
>>>>
>>>> Mount commands look like this:
>>>> for h in $servers do ;
>>>> mount -t nfs4 -o rw,soft,intr,nodev,nosuid,async ${h}:/ /net/$h
>>>> done
>>>>
>>>> The NFSv4 servers all start using stock Ubuntu 10.04 setup - no tuning
>>>> has been done.
>>>>
>>>> We can trivially get packet captures with more packets, but they are
>>>> all similar - 15-20 client nodes all pounding one NFS server node.
>>>
>>> We'd need to see full-frame raw captures. "tcpdump -s0 -w /tmp/raw" Let's see a few megabytes.
>>>
>>> On the face of it, it looks like it could be a state reclaim loop, but I can't say until I see a full network capture.
>>
>> Paul sent me a pair of pcap traces off-list.
>>
>> This is yet another a state reclaim loop. The server is returning NFS4ERR_EXPIRED to a READ request, but the client's subsequent RENEW gets NFS4_OK, so it doesn't do any state recovery. It simply keeps retrying the READ.
>>
>> Paul, have you tried upgrading your clients to the latest kernel.org release? Was there a recent network partition or server reboot that triggers these events?
>
> We are evaluating a move to 3.0.x, and hope to be able to do that
> sooner rather than later. The impetus there is a set of improvements
> in XFS dirty page flushing that we hope will reduce filesystem
> corruption on normal reboots.

I thought all those issues were cleared up by 2.6.38. Good luck...

> The two NFS servers in question were recently rebooted (one yesterday
> shows this problem, the one I gave Chuck pcaps from was 8 days ago).
> It would typically be the case on reboot that numerous processes in
> the compute cluster would be attempting to access files, mostly
> reading, but some writing. I have not noticed if the thrashing
> problem appears immediately on reboot - I thought it was not
> instantaneous, but I could be wrong.

For extra credit, you could try to capture network traffic on the clients while the server is rebooting to see if we can watch the putative state reclaim failure as it happens. I'm guessing you'd have to wait until just after the server's grace period expires (90 seconds?) before things start going ka-fluey.

Unless this is unbearable, I'd wait though until you have 3.0 deployed on your clients. Either that will fix your problem, or it will be easier for us to patch if it doesn't.

>> One reason this might happen is if the client is using an expired state ID for the READ, and then uses a valid client ID during the RENEW request. This can happen if the client failed, some time earlier, to completely reclaim state after a server outage.
>>
> Since we rarely, if ever reboot the entire cluster at once, this is plausible.
>
>
> Paul
>
>
>>
>> Bruce, is there any way to look at the state ID and client ID tokens to verify they are for the same lease?
>>
>> --
>> Chuck Lever
>> chuck[dot]lever[at]oracle[dot]com
>>
>>
>>
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com





2011-12-22 19:37:13

by Paul Anderson

[permalink] [raw]
Subject: Re: NFSv4 empty RPC thrashing?

See way below for inline comments...

On Thu, Dec 22, 2011 at 2:25 PM, Chuck Lever <[email protected]> wrote:
>
> On Dec 22, 2011, at 1:36 PM, Chuck Lever wrote:
>
>> Hi Paul, long time!
>>
>> On Dec 22, 2011, at 1:31 PM, Paul Anderson wrote:
>>
>>> Issue: extremely high rate packets like so (tcpdump):
>>>
>>> 16:31:09.308678 IP (tos 0x0, ttl 64, id 41517, offset 0, flags [DF],
>>> proto TCP (6), length 144)
>>> ? r20.xxx.edu.1362383749 > nfsb.xxx.edu.nfs: 88 null
>>> 16:31:09.308895 IP (tos 0x0, ttl 64, id 22578, offset 0, flags [DF],
>>> proto TCP (6), length 100)
>>> ? nfsb.xxx.edu.nfs > r20.xxx.edu.1362383749: reply ok 44 null
>>> 16:31:09.308926 IP (tos 0x0, ttl 64, id 41518, offset 0, flags [DF],
>>> proto TCP (6), length 192)
>>> ? r20.xxx.edu.1379160965 > nfsb.xxx.edu.nfs: 136 null
>>>
>>> All linux kernels are from kernel.org, version 2.6.38.5 with the
>>> addition of Mosix. ?All userland is Ubuntu 10.04LTS.
>>>
>>> Scenario: the compute cluster is composed of 50-60 compute nodes, and
>>> 10 or so head nodes that act as compute/login and high rate NFS
>>> serving, for the purpose of largely sequential processing of high
>>> volume genetic sequencing data (one recent job was 50-70TiBytes in, 50
>>> TiBytes out). ?We see this problem regularly (two servers are
>>> currently being hit this way right now), and is apparently cleared
>>> only on reboot of the server.
>>>
>>> Something in our use of the cluster appears to be triggering what
>>> looks like a race condition in the NFSv4 client/server communications.
>>> This issue prevents us from using NFS reliably in our cluster.
>>> Although we do very high I/O at times, this alone does not appear to
>>> be the trigger. ?It is possibly a related to a problem of having SLURM
>>> starting 200-300 jobs at once, where each job hits a common NFS
>>> fileserver for the program binaries, for example. ?In our cluster
>>> testing, this appears to reliably cause about half the jobs to fail
>>> while loading the program itself - they hang in D state indefinitely,
>>> but are killable.
>>>
>>> Looking at dozens of clients, we can do tcpdump, and see messages
>>> similar to the above being sent at a high rate from the gigabit
>>> connected compute nodes - the main indication being a context switch
>>> rate of 20-30K per second. ?The 10 gigabit connected server is
>>> functioning, but seeing context switch rates of 200-300K per second -
>>> an exceptional rate that appears to slow down NFS services for all
>>> other users. ?I have not done any extensive packet capture to
>>> determine actual traffic rates, but am pretty sure it is limited by
>>> wire speed and CPU.
>>>
>>> The client nodes in this scenario are not actively being used - some
>>> show zero processes in D state, others show dozens of jobs stuck in D
>>> state (these jobs are unkillable) - the NFSv4 server shows nfsd jobs
>>> running flat out.
>>>
>>> Mount commands look like this:
>>> for h in $servers do ;
>>> ? mount -t nfs4 -o rw,soft,intr,nodev,nosuid,async ${h}:/ /net/$h
>>> done
>>>
>>> The NFSv4 servers all start using stock Ubuntu 10.04 setup - no tuning
>>> has been done.
>>>
>>> We can trivially get packet captures with more packets, but they are
>>> all similar - 15-20 client nodes all pounding one NFS server node.
>>
>> We'd need to see full-frame raw captures. ?"tcpdump -s0 -w /tmp/raw" ?Let's see a few megabytes.
>>
>> On the face of it, it looks like it could be a state reclaim loop, but I can't say until I see a full network capture.
>
> Paul sent me a pair of pcap traces off-list.
>
> This is yet another a state reclaim loop. ?The server is returning NFS4ERR_EXPIRED to a READ request, but the client's subsequent RENEW gets NFS4_OK, so it doesn't do any state recovery. ?It simply keeps retrying the READ.
>
> Paul, have you tried upgrading your clients to the latest kernel.org release? ?Was there a recent network partition or server reboot that triggers these events?

We are evaluating a move to 3.0.x, and hope to be able to do that
sooner rather than later. The impetus there is a set of improvements
in XFS dirty page flushing that we hope will reduce filesystem
corruption on normal reboots.

The two NFS servers in question were recently rebooted (one yesterday
shows this problem, the one I gave Chuck pcaps from was 8 days ago).
It would typically be the case on reboot that numerous processes in
the compute cluster would be attempting to access files, mostly
reading, but some writing. I have not noticed if the thrashing
problem appears immediately on reboot - I thought it was not
instantaneous, but I could be wrong.


>
> One reason this might happen is if the client is using an expired state ID for the READ, and then uses a valid client ID during the RENEW request. ?This can happen if the client failed, some time earlier, to completely reclaim state after a server outage.

Since we rarely, if ever reboot the entire cluster at once, this is plausible.


Paul


>
> Bruce, is there any way to look at the state ID and client ID tokens to verify they are for the same lease?
>
> --
> Chuck Lever
> chuck[dot]lever[at]oracle[dot]com
>
>
>
>