2004-09-30 13:39:49

by Douglas Furlong

[permalink] [raw]
Subject: NFS stops responding

Good morning all.

Considering the exceedingly fast and speedy response I got yesterday
with regards to my problem accessing edirectory.co.uk I thought I would
try my luck with an NFS problem.

All our unix systems at work have their home directory mounted via NFS
to allow hot seating (not that they ever use it!).

I have just recently upgraded to Fedora Core 2, running the most recent
kernel.

All the workstations are running Fedora Core 2, with the second from
last kernel (due to CIFS/SMB problems in the latest one).

Unfortunately there are two users who's connection to the NFS server is
dropped and does not seem to want to reconnect. To date I have.

1) Replaced both of their PC's
2) Replaced switch
3) will replace network cables tomorrow
4) I have tried numerous version of the kernel including the testing
kernel from rawhide.
5) Tried variations in the timeo=x value to see if that will help.

These lockups vary in time between 30 minutes and 5 hours. Network
connections are not affected by this lock up, I am able to ssh on to the
box (that's how I collected the tcpdump data).

I also have two windows PC's on this switch and things appear to be
fine.

I have 7 or 8 other systems running linux on the network and NFS
communication is not affected.

I have increased the number of servers on the NFS server from 8 to 16. I
did this by editing /etc/init.d/nfs (don't think this is of any help).

I took some tcpdump info on both the client and the server to try and
see if I can work out what is going on. Initially it is not providing me
with much information (but loads of data).

I have attached two files, one from the client and one from the server.
Main reason for attaching them is due to length of data. I had wanted to
attach them as plain text to simplify access, but at 100k it's a bit too
large.
I didn't want to cut them down too much just in case I removed some
pertinent information :(
--
Douglas Furlong
Systems Administrator
Firebox.com
T: 0870 420 4475 F: 0870 220 2178


Attachments:
tcpdump_output_nfs_client_server.txt.tar.gz (14.17 kB)
signature.asc (189.00 B)
This is a digitally signed message part
Download all attachments

2004-09-30 16:06:22

by Jason Holmes

[permalink] [raw]
Subject: Re: NFS stops responding

I have had similar problems with NFS recently and have yet to figure out
a pattern. They started around the 2.4.27 time frame, but that could
just be coincidental. I have 8 NFS servers and several hundred clients.
Every few days, one of the clients will start hanging connections to
one of its mounts (all of the processes access that mount go into D
state and never return - the machine has to be forcefully rebooted to
get rid of them). While one of the client machines are hanging on a
mount, the other client machines are fine. Access to the other mounts
are fine on the hanging machine. The server is fine when this happens
and I see no odd messages in the logs.

The servers were originally running RedHat Enterprise 3 kernels - I have
also tried 2.6.8.1 and have had the same problem. Clients have been
2.4.27, 2.6.8.1, and the latest RedHat kernels. The network is a simple
private one and there is no packet loss. I've tried both UDP and TCP v3
hard mounts. Exports are synchronous.

I'm currently hoping that one of my machines with sysrq enabled will
hang to see if I can possibly get some information out of that that will
shed some light on the situation. I'd be happy to entertain any other
debugging suggestions on this. Unfortunately, I haven't been able to
figure out how to force the problem to happen, so I'm at the mercy of
waiting for it to just pop up.

Thanks,

--
Jason Holmes

Douglas Furlong wrote:
> Good morning all.
>
> Considering the exceedingly fast and speedy response I got yesterday
> with regards to my problem accessing edirectory.co.uk I thought I would
> try my luck with an NFS problem.
>
> All our unix systems at work have their home directory mounted via NFS
> to allow hot seating (not that they ever use it!).
>
> I have just recently upgraded to Fedora Core 2, running the most recent
> kernel.
>
> All the workstations are running Fedora Core 2, with the second from
> last kernel (due to CIFS/SMB problems in the latest one).
>
> Unfortunately there are two users who's connection to the NFS server is
> dropped and does not seem to want to reconnect. To date I have.
>
> 1) Replaced both of their PC's
> 2) Replaced switch
> 3) will replace network cables tomorrow
> 4) I have tried numerous version of the kernel including the testing
> kernel from rawhide.
> 5) Tried variations in the timeo=x value to see if that will help.
>
> These lockups vary in time between 30 minutes and 5 hours. Network
> connections are not affected by this lock up, I am able to ssh on to the
> box (that's how I collected the tcpdump data).
>
> I also have two windows PC's on this switch and things appear to be
> fine.
>
> I have 7 or 8 other systems running linux on the network and NFS
> communication is not affected.
>
> I have increased the number of servers on the NFS server from 8 to 16. I
> did this by editing /etc/init.d/nfs (don't think this is of any help).
>
> I took some tcpdump info on both the client and the server to try and
> see if I can work out what is going on. Initially it is not providing me
> with much information (but loads of data).
>
> I have attached two files, one from the client and one from the server.
> Main reason for attaching them is due to length of data. I had wanted to
> attach them as plain text to simplify access, but at 100k it's a bit too
> large.
> I didn't want to cut them down too much just in case I removed some
> pertinent information :(



-------------------------------------------------------
This SF.net email is sponsored by: IT Product Guide on ITManagersJournal
Use IT products in your business? Tell us what you think of them. Give us
Your Opinions, Get Free ThinkGeek Gift Certificates! Click to find out more
http://productguide.itmanagersjournal.com/guidepromo.tmpl
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2004-09-30 19:10:31

by Jason Holmes

[permalink] [raw]
Subject: Re: NFS stops responding

Here's a 'sysrq-T' listing for a few hung processes. Unfortunately,
this was on a 2.4.21-20.ELsmp RedHat kernel and not a vanilla kernel
(I'll send one of those along as soon as I can get one):

xauth D 00000100e2d30370 1312 9600 9599
(NOTLB)

Call Trace: [<ffffffff80120d8a>]{io_schedule+42}
[<ffffffff801420ed>]{___wait_on_page+285}
[<ffffffff8014316a>]{do_generic_file_read+1258}
[<ffffffff80143770>]{file_read_actor+0}
[<ffffffff801438c5>]{generic_file_new_read+165}
[<ffffffffa02ec3a9>]{:nfs:nfs_file_read+217}
[<ffffffff8015dfd2>]{sys_read+178}
[<ffffffff80110177>]{system_call+119}

bash D 00000100e2bef130 824 9614 1 9666 9583
(NOTLB)

Call Trace: [<ffffffff80120d8a>]{io_schedule+42}
[<ffffffff801420ed>]{___wait_on_page+285}
[<ffffffff8014316a>]{do_generic_file_read+1258}
[<ffffffff80143770>]{file_read_actor+0}
[<ffffffff801438c5>]{generic_file_new_read+165}
[<ffffffffa02ec3a9>]{:nfs:nfs_file_read+217}
[<ffffffff8015dfd2>]{sys_read+178}
[<ffffffff80110177>]{system_call+119}

bash D 00000100db051e28 0 9666 1 9718 9614
(NOTLB)

Call Trace: [<ffffffff80120d8a>]{io_schedule+42}
[<ffffffff80142466>]{__lock_page+294}
[<ffffffff801430ca>]{do_generic_file_read+1098}
[<ffffffff80143770>]{file_read_actor+0}
[<ffffffff801438c5>]{generic_file_new_read+165}
[<ffffffffa02ec3a9>]{:nfs:nfs_file_read+217}
[<ffffffff8015dfd2>]{sys_read+178}
[<ffffffff80110177>]{system_call+119}

Thanks,

--
Jason Holmes

Jason Holmes wrote:
> I have had similar problems with NFS recently and have yet to figure out
> a pattern. They started around the 2.4.27 time frame, but that could
> just be coincidental. I have 8 NFS servers and several hundred clients.
> Every few days, one of the clients will start hanging connections to
> one of its mounts (all of the processes access that mount go into D
> state and never return - the machine has to be forcefully rebooted to
> get rid of them). While one of the client machines are hanging on a
> mount, the other client machines are fine. Access to the other mounts
> are fine on the hanging machine. The server is fine when this happens
> and I see no odd messages in the logs.
>
> The servers were originally running RedHat Enterprise 3 kernels - I have
> also tried 2.6.8.1 and have had the same problem. Clients have been
> 2.4.27, 2.6.8.1, and the latest RedHat kernels. The network is a simple
> private one and there is no packet loss. I've tried both UDP and TCP v3
> hard mounts. Exports are synchronous.
>
> I'm currently hoping that one of my machines with sysrq enabled will
> hang to see if I can possibly get some information out of that that will
> shed some light on the situation. I'd be happy to entertain any other
> debugging suggestions on this. Unfortunately, I haven't been able to
> figure out how to force the problem to happen, so I'm at the mercy of
> waiting for it to just pop up.
>
> Thanks,
>
> --
> Jason Holmes
>
> Douglas Furlong wrote:
>
>> Good morning all.
>>
>> Considering the exceedingly fast and speedy response I got yesterday
>> with regards to my problem accessing edirectory.co.uk I thought I would
>> try my luck with an NFS problem.
>>
>> All our unix systems at work have their home directory mounted via NFS
>> to allow hot seating (not that they ever use it!).
>>
>> I have just recently upgraded to Fedora Core 2, running the most recent
>> kernel.
>>
>> All the workstations are running Fedora Core 2, with the second from
>> last kernel (due to CIFS/SMB problems in the latest one).
>>
>> Unfortunately there are two users who's connection to the NFS server is
>> dropped and does not seem to want to reconnect. To date I have.
>>
>> 1) Replaced both of their PC's
>> 2) Replaced switch
>> 3) will replace network cables tomorrow
>> 4) I have tried numerous version of the kernel including the testing
>> kernel from rawhide.
>> 5) Tried variations in the timeo=x value to see if that will help.
>>
>> These lockups vary in time between 30 minutes and 5 hours. Network
>> connections are not affected by this lock up, I am able to ssh on to the
>> box (that's how I collected the tcpdump data).
>>
>> I also have two windows PC's on this switch and things appear to be
>> fine.
>>
>> I have 7 or 8 other systems running linux on the network and NFS
>> communication is not affected.
>>
>> I have increased the number of servers on the NFS server from 8 to 16. I
>> did this by editing /etc/init.d/nfs (don't think this is of any help).
>>
>> I took some tcpdump info on both the client and the server to try and
>> see if I can work out what is going on. Initially it is not providing me
>> with much information (but loads of data).
>>
>> I have attached two files, one from the client and one from the server.
>> Main reason for attaching them is due to length of data. I had wanted to
>> attach them as plain text to simplify access, but at 100k it's a bit too
>> large.
>> I didn't want to cut them down too much just in case I removed some
>> pertinent information :(
>
>
>
>
> -------------------------------------------------------
> This SF.net email is sponsored by: IT Product Guide on ITManagersJournal
> Use IT products in your business? Tell us what you think of them. Give us
> Your Opinions, Get Free ThinkGeek Gift Certificates! Click to find out more
> http://productguide.itmanagersjournal.com/guidepromo.tmpl
> _______________________________________________
> NFS maillist - [email protected]
> https://lists.sourceforge.net/lists/listinfo/nfs
>



-------------------------------------------------------
This SF.net email is sponsored by: IT Product Guide on ITManagersJournal
Use IT products in your business? Tell us what you think of them. Give us
Your Opinions, Get Free ThinkGeek Gift Certificates! Click to find out more
http://productguide.itmanagersjournal.com/guidepromo.tmpl
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2004-10-13 15:08:30

by Jason Holmes

[permalink] [raw]
Subject: Re: NFS stops responding

Douglas Furlong wrote:
> On Fri, 2004-10-01 at 11:40 -0400, Jason Holmes wrote:
>
>>FYI, I'm beginning to suspect that this is a problem more with the newer
>>RedHat kernels than anything else. I've only had one vanilla kernel NFS
>>lockup since I moved the NFS servers to 2.6.8.1 (3 days) and that
>>happened right after I did the move, so it could be coincidental. Back
>>when the servers ran RedHat kernels, the RedHat kernel clients never
>>locked up whereas the vanilla clients did. Yesterday I had 4 NFS
>>lockups on the same machine running the RedHat 2.4.21-20.ELsmp kernel
>>(the one that generated the trace below), but it hasn't locked up since
>>I moved it to 2.6.8.1. I guess I'll know for sure if my lockups don't
>>come back for a week or so.
>>
>>Thanks,
>
> <snip>
>
> Have you had any more lockups?
>
> Would you feel happy entering a bugzilla entry for this problem over at
> redhat? You seem to have got a lot more information then I have in my
> trouble shooting, I will then add what ever else I have, and my local
> setup.

I'm sorry to say that I had my first lockup this morning after a week
and a half of no problems. I was unfortunately unable to get any
debugging information from it at the time. This was a 2.4.27 client
against a 2.6.8.1 server.

Thanks,

--
Jason Holmes


-------------------------------------------------------
This SF.net email is sponsored by: IT Product Guide on ITManagersJournal
Use IT products in your business? Tell us what you think of them. Give us
Your Opinions, Get Free ThinkGeek Gift Certificates! Click to find out more
http://productguide.itmanagersjournal.com/guidepromo.tmpl
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2004-10-01 15:40:24

by Jason Holmes

[permalink] [raw]
Subject: Re: NFS stops responding

FYI, I'm beginning to suspect that this is a problem more with the newer
RedHat kernels than anything else. I've only had one vanilla kernel NFS
lockup since I moved the NFS servers to 2.6.8.1 (3 days) and that
happened right after I did the move, so it could be coincidental. Back
when the servers ran RedHat kernels, the RedHat kernel clients never
locked up whereas the vanilla clients did. Yesterday I had 4 NFS
lockups on the same machine running the RedHat 2.4.21-20.ELsmp kernel
(the one that generated the trace below), but it hasn't locked up since
I moved it to 2.6.8.1. I guess I'll know for sure if my lockups don't
come back for a week or so.

Thanks,

--
Jason Holmes

Jason Holmes wrote:
> Here's a 'sysrq-T' listing for a few hung processes. Unfortunately,
> this was on a 2.4.21-20.ELsmp RedHat kernel and not a vanilla kernel
> (I'll send one of those along as soon as I can get one):
>
> xauth D 00000100e2d30370 1312 9600 9599 (NOTLB)
>
> Call Trace: [<ffffffff80120d8a>]{io_schedule+42}
> [<ffffffff801420ed>]{___wait_on_page+285}
> [<ffffffff8014316a>]{do_generic_file_read+1258}
> [<ffffffff80143770>]{file_read_actor+0}
> [<ffffffff801438c5>]{generic_file_new_read+165}
> [<ffffffffa02ec3a9>]{:nfs:nfs_file_read+217}
> [<ffffffff8015dfd2>]{sys_read+178}
> [<ffffffff80110177>]{system_call+119}
>
> bash D 00000100e2bef130 824 9614 1 9666 9583
> (NOTLB)
>
> Call Trace: [<ffffffff80120d8a>]{io_schedule+42}
> [<ffffffff801420ed>]{___wait_on_page+285}
> [<ffffffff8014316a>]{do_generic_file_read+1258}
> [<ffffffff80143770>]{file_read_actor+0}
> [<ffffffff801438c5>]{generic_file_new_read+165}
> [<ffffffffa02ec3a9>]{:nfs:nfs_file_read+217}
> [<ffffffff8015dfd2>]{sys_read+178}
> [<ffffffff80110177>]{system_call+119}
>
> bash D 00000100db051e28 0 9666 1 9718 9614
> (NOTLB)
>
> Call Trace: [<ffffffff80120d8a>]{io_schedule+42}
> [<ffffffff80142466>]{__lock_page+294}
> [<ffffffff801430ca>]{do_generic_file_read+1098}
> [<ffffffff80143770>]{file_read_actor+0}
> [<ffffffff801438c5>]{generic_file_new_read+165}
> [<ffffffffa02ec3a9>]{:nfs:nfs_file_read+217}
> [<ffffffff8015dfd2>]{sys_read+178}
> [<ffffffff80110177>]{system_call+119}
>
> Thanks,
>
> --
> Jason Holmes
>
> Jason Holmes wrote:
>
>> I have had similar problems with NFS recently and have yet to figure
>> out a pattern. They started around the 2.4.27 time frame, but that
>> could just be coincidental. I have 8 NFS servers and several hundred
>> clients. Every few days, one of the clients will start hanging
>> connections to one of its mounts (all of the processes access that
>> mount go into D state and never return - the machine has to be
>> forcefully rebooted to get rid of them). While one of the client
>> machines are hanging on a mount, the other client machines are fine.
>> Access to the other mounts are fine on the hanging machine. The
>> server is fine when this happens and I see no odd messages in the logs.
>>
>> The servers were originally running RedHat Enterprise 3 kernels - I
>> have also tried 2.6.8.1 and have had the same problem. Clients have
>> been 2.4.27, 2.6.8.1, and the latest RedHat kernels. The network is a
>> simple private one and there is no packet loss. I've tried both UDP
>> and TCP v3 hard mounts. Exports are synchronous.
>>
>> I'm currently hoping that one of my machines with sysrq enabled will
>> hang to see if I can possibly get some information out of that that
>> will shed some light on the situation. I'd be happy to entertain any
>> other debugging suggestions on this. Unfortunately, I haven't been
>> able to figure out how to force the problem to happen, so I'm at the
>> mercy of waiting for it to just pop up.
>>
>> Thanks,
>>
>> --
>> Jason Holmes
>>
>> Douglas Furlong wrote:
>>
>>> Good morning all.
>>>
>>> Considering the exceedingly fast and speedy response I got yesterday
>>> with regards to my problem accessing edirectory.co.uk I thought I would
>>> try my luck with an NFS problem.
>>>
>>> All our unix systems at work have their home directory mounted via NFS
>>> to allow hot seating (not that they ever use it!).
>>>
>>> I have just recently upgraded to Fedora Core 2, running the most recent
>>> kernel.
>>>
>>> All the workstations are running Fedora Core 2, with the second from
>>> last kernel (due to CIFS/SMB problems in the latest one).
>>>
>>> Unfortunately there are two users who's connection to the NFS server is
>>> dropped and does not seem to want to reconnect. To date I have.
>>>
>>> 1) Replaced both of their PC's
>>> 2) Replaced switch
>>> 3) will replace network cables tomorrow
>>> 4) I have tried numerous version of the kernel including the testing
>>> kernel from rawhide.
>>> 5) Tried variations in the timeo=x value to see if that will help.
>>>
>>> These lockups vary in time between 30 minutes and 5 hours. Network
>>> connections are not affected by this lock up, I am able to ssh on to the
>>> box (that's how I collected the tcpdump data).
>>>
>>> I also have two windows PC's on this switch and things appear to be
>>> fine.
>>>
>>> I have 7 or 8 other systems running linux on the network and NFS
>>> communication is not affected.
>>>
>>> I have increased the number of servers on the NFS server from 8 to 16. I
>>> did this by editing /etc/init.d/nfs (don't think this is of any help).
>>>
>>> I took some tcpdump info on both the client and the server to try and
>>> see if I can work out what is going on. Initially it is not providing me
>>> with much information (but loads of data).
>>>
>>> I have attached two files, one from the client and one from the server.
>>> Main reason for attaching them is due to length of data. I had wanted to
>>> attach them as plain text to simplify access, but at 100k it's a bit too
>>> large.
>>> I didn't want to cut them down too much just in case I removed some
>>> pertinent information :(
>>
>>
>>
>>
>>
>> -------------------------------------------------------
>> This SF.net email is sponsored by: IT Product Guide on ITManagersJournal
>> Use IT products in your business? Tell us what you think of them. Give us
>> Your Opinions, Get Free ThinkGeek Gift Certificates! Click to find out
>> more
>> http://productguide.itmanagersjournal.com/guidepromo.tmpl
>> _______________________________________________
>> NFS maillist - [email protected]
>> https://lists.sourceforge.net/lists/listinfo/nfs
>>
>
>
>
> -------------------------------------------------------
> This SF.net email is sponsored by: IT Product Guide on ITManagersJournal
> Use IT products in your business? Tell us what you think of them. Give us
> Your Opinions, Get Free ThinkGeek Gift Certificates! Click to find out more
> http://productguide.itmanagersjournal.com/guidepromo.tmpl
> _______________________________________________
> NFS maillist - [email protected]
> https://lists.sourceforge.net/lists/listinfo/nfs



-------------------------------------------------------
This SF.net email is sponsored by: IT Product Guide on ITManagersJournal
Use IT products in your business? Tell us what you think of them. Give us
Your Opinions, Get Free ThinkGeek Gift Certificates! Click to find out more
http://productguide.itmanagersjournal.com/guidepromo.tmpl
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2004-10-07 10:57:09

by Douglas Furlong

[permalink] [raw]
Subject: Re: NFS stops responding

On Fri, 2004-10-01 at 11:40 -0400, Jason Holmes wrote:
> FYI, I'm beginning to suspect that this is a problem more with the newer
> RedHat kernels than anything else. I've only had one vanilla kernel NFS
> lockup since I moved the NFS servers to 2.6.8.1 (3 days) and that
> happened right after I did the move, so it could be coincidental. Back
> when the servers ran RedHat kernels, the RedHat kernel clients never
> locked up whereas the vanilla clients did. Yesterday I had 4 NFS
> lockups on the same machine running the RedHat 2.4.21-20.ELsmp kernel
> (the one that generated the trace below), but it hasn't locked up since
> I moved it to 2.6.8.1. I guess I'll know for sure if my lockups don't
> come back for a week or so.
>
> Thanks,
<snip>

Have you had any more lockups?

Would you feel happy entering a bugzilla entry for this problem over at
redhat? You seem to have got a lot more information then I have in my
trouble shooting, I will then add what ever else I have, and my local
setup.

Doug



-------------------------------------------------------
This SF.net email is sponsored by: IT Product Guide on ITManagersJournal
Use IT products in your business? Tell us what you think of them. Give us
Your Opinions, Get Free ThinkGeek Gift Certificates! Click to find out more
http://productguide.itmanagersjournal.com/guidepromo.tmpl
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs