2005-12-14 20:42:51

by Joshua Baker-LePain

[permalink] [raw]
Subject: Error 512

I'm still seeing a moderate number of errors like this on clients (both
centos-3 and centos-4) connected to my centos-4 servers:

RPC: error 512 connecting to server $HOSTNAME
nfs_statfs: statfs error = 512

associated with mounted FSs becoming unavailable for periods of time (not
sure exactly how long, but on the order of 10-15 minutes). What exactly
is error 512, and where does it point me to for the purposes of fixing
this?

Thanks.

--
Joshua Baker-LePain
Department of Biomedical Engineering
Duke University


-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems? Stop! Download the new AJAX search engine that makes
searching your log files as easy as surfing the web. DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs


2005-12-14 23:13:42

by Trond Myklebust

[permalink] [raw]
Subject: Re: Error 512

On Wed, 2005-12-14 at 15:42 -0500, Joshua Baker-LePain wrote:
> I'm still seeing a moderate number of errors like this on clients (both
> centos-3 and centos-4) connected to my centos-4 servers:
>
> RPC: error 512 connecting to server $HOSTNAME
> nfs_statfs: statfs error = 512
>
> associated with mounted FSs becoming unavailable for periods of time (not
> sure exactly how long, but on the order of 10-15 minutes). What exactly
> is error 512, and where does it point me to for the purposes of fixing
> this?

Error 512 is ERESTARTSYS. It means you interrupted the RPC call...

There is a patch going in to 2.6.16 that will remove that error message.

Cheers,
Trond



-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems? Stop! Download the new AJAX search engine that makes
searching your log files as easy as surfing the web. DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2005-12-15 01:54:15

by Joshua Baker-LePain

[permalink] [raw]
Subject: Re: Error 512

On Wed, 14 Dec 2005 at 6:13pm, Trond Myklebust wrote

> On Wed, 2005-12-14 at 15:42 -0500, Joshua Baker-LePain wrote:
>> I'm still seeing a moderate number of errors like this on clients (both
>> centos-3 and centos-4) connected to my centos-4 servers:
>>
>> RPC: error 512 connecting to server $HOSTNAME
>> nfs_statfs: statfs error = 512
>>
>> associated with mounted FSs becoming unavailable for periods of time (not
>> sure exactly how long, but on the order of 10-15 minutes). What exactly
>> is error 512, and where does it point me to for the purposes of fixing
>> this?
>
> Error 512 is ERESTARTSYS. It means you interrupted the RPC call...
>
> There is a patch going in to 2.6.16 that will remove that error message.

Ah, OK. So you're saying I need to find out why the server is waiting
10+ minutes to respond to the client's request(s). *sigh*

Thanks.

--
Joshua Baker-LePain
Department of Biomedical Engineering
Duke University


-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems? Stop! Download the new AJAX search engine that makes
searching your log files as easy as surfing the web. DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2005-12-15 02:06:04

by Trond Myklebust

[permalink] [raw]
Subject: Re: Error 512

On Wed, 2005-12-14 at 20:54 -0500, Joshua Baker-LePain wrote:
> On Wed, 14 Dec 2005 at 6:13pm, Trond Myklebust wrote
>
> > On Wed, 2005-12-14 at 15:42 -0500, Joshua Baker-LePain wrote:
> >> I'm still seeing a moderate number of errors like this on clients (both
> >> centos-3 and centos-4) connected to my centos-4 servers:
> >>
> >> RPC: error 512 connecting to server $HOSTNAME
> >> nfs_statfs: statfs error = 512
> >>
> >> associated with mounted FSs becoming unavailable for periods of time (not
> >> sure exactly how long, but on the order of 10-15 minutes). What exactly
> >> is error 512, and where does it point me to for the purposes of fixing
> >> this?
> >
> > Error 512 is ERESTARTSYS. It means you interrupted the RPC call...
> >
> > There is a patch going in to 2.6.16 that will remove that error message.
>
> Ah, OK. So you're saying I need to find out why the server is waiting
> 10+ minutes to respond to the client's request(s). *sigh*

Yep. If you can send us a tcpdump, then that might help...

Cheers,
Trond



-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems? Stop! Download the new AJAX search engine that makes
searching your log files as easy as surfing the web. DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2005-12-15 02:21:47

by Joshua Baker-LePain

[permalink] [raw]
Subject: Re: Error 512

On Wed, 14 Dec 2005 at 9:05pm, Trond Myklebust wrote

> On Wed, 2005-12-14 at 20:54 -0500, Joshua Baker-LePain wrote:
>>
>> Ah, OK. So you're saying I need to find out why the server is waiting
>> 10+ minutes to respond to the client's request(s). *sigh*
>
> Yep. If you can send us a tcpdump, then that might help...
>
Will do first chance I get. The real fun bit is that I haven't yet worked
out a way to reproduce this at will.

--
Joshua Baker-LePain
Department of Biomedical Engineering
Duke University


-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems? Stop! Download the new AJAX search engine that makes
searching your log files as easy as surfing the web. DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2005-12-15 19:05:34

by Joshua Baker-LePain

[permalink] [raw]
Subject: Re: Error 512

On Wed, 14 Dec 2005 at 9:05pm, Trond Myklebust wrote

> On Wed, 2005-12-14 at 20:54 -0500, Joshua Baker-LePain wrote:

>> Ah, OK. So you're saying I need to find out why the server is waiting
>> 10+ minutes to respond to the client's request(s). *sigh*
>
> Yep. If you can send us a tcpdump, then that might help...

http://biomechanics.bme.duke.edu/~jlb/nfshang

This is a *long* hang that eventually did resolve captured via

tcpdump -w nfshang -s 192 host $SERVER


--
Joshua Baker-LePain
Department of Biomedical Engineering
Duke University


-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems? Stop! Download the new AJAX search engine that makes
searching your log files as easy as surfing the web. DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2005-12-15 19:16:53

by Joshua Baker-LePain

[permalink] [raw]
Subject: Re: Error 512

On Thu, 15 Dec 2005 at 2:05pm, Joshua Baker-LePain wrote

> On Wed, 14 Dec 2005 at 9:05pm, Trond Myklebust wrote
>
>> On Wed, 2005-12-14 at 20:54 -0500, Joshua Baker-LePain wrote:
>
>>> Ah, OK. So you're saying I need to find out why the server is waiting
>>> 10+ minutes to respond to the client's request(s). *sigh*
>>
>> Yep. If you can send us a tcpdump, then that might help...
>
> http://biomechanics.bme.duke.edu/~jlb/nfshang
>
> This is a *long* hang that eventually did resolve captured via

*sigh* No, it didn't resolve. Let me know if you need a dump from when
it's unhung.

--
Joshua Baker-LePain
Department of Biomedical Engineering
Duke University


-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems? Stop! Download the new AJAX search engine that makes
searching your log files as easy as surfing the web. DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2005-12-15 19:37:24

by Trond Myklebust

[permalink] [raw]
Subject: Re: Error 512

On Thu, 2005-12-15 at 14:05 -0500, Joshua Baker-LePain wrote:
> On Wed, 14 Dec 2005 at 9:05pm, Trond Myklebust wrote
>
> > On Wed, 2005-12-14 at 20:54 -0500, Joshua Baker-LePain wrote:
>
> >> Ah, OK. So you're saying I need to find out why the server is waiting
> >> 10+ minutes to respond to the client's request(s). *sigh*
> >
> > Yep. If you can send us a tcpdump, then that might help...
>
> http://biomechanics.bme.duke.edu/~jlb/nfshang
>

It looks like the client is looping while trying to connect to the
server. The server appears to be ACKing the connection attempt, but that
is where it stops.

I assume this tcpdump was taken on the server? If so, it might be
interesting to look at what the client is seeing.

Cheers,
Trond



-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems? Stop! Download the new AJAX search engine that makes
searching your log files as easy as surfing the web. DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2005-12-15 19:40:25

by Joshua Baker-LePain

[permalink] [raw]
Subject: Re: Error 512

On Thu, 15 Dec 2005 at 2:37pm, Trond Myklebust wrote

> On Thu, 2005-12-15 at 14:05 -0500, Joshua Baker-LePain wrote:
>> On Wed, 14 Dec 2005 at 9:05pm, Trond Myklebust wrote
>>
>>> On Wed, 2005-12-14 at 20:54 -0500, Joshua Baker-LePain wrote:
>>
>>>> Ah, OK. So you're saying I need to find out why the server is waiting
>>>> 10+ minutes to respond to the client's request(s). *sigh*
>>>
>>> Yep. If you can send us a tcpdump, then that might help...
>>
>> http://biomechanics.bme.duke.edu/~jlb/nfshang
>>
>
> It looks like the client is looping while trying to connect to the
> server. The server appears to be ACKing the connection attempt, but that
> is where it stops.
>
> I assume this tcpdump was taken on the server? If so, it might be
> interesting to look at what the client is seeing.

Actually, that was on the client. Would it be helpful to look at it from
the server side?

--
Joshua Baker-LePain
Department of Biomedical Engineering
Duke University


-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems? Stop! Download the new AJAX search engine that makes
searching your log files as easy as surfing the web. DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2005-12-15 20:05:28

by Trond Myklebust

[permalink] [raw]
Subject: Re: Error 512

On Thu, 2005-12-15 at 14:40 -0500, Joshua Baker-LePain wrote:

> Actually, that was on the client. Would it be helpful to look at it from
> the server side?

Not yet...

Something else interesting about that trace is the fact that the server
keeps resending a bunch of old ACKs. It looks as if it thinks it is
still connected to the client.

What versions of Linux are you running on these machines?

Cheers,
Trond



-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems? Stop! Download the new AJAX search engine that makes
searching your log files as easy as surfing the web. DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2005-12-15 20:22:28

by Joshua Baker-LePain

[permalink] [raw]
Subject: Re: Error 512

On Thu, 15 Dec 2005 at 3:04pm, Trond Myklebust wrote

> On Thu, 2005-12-15 at 14:40 -0500, Joshua Baker-LePain wrote:
>
>> Actually, that was on the client. Would it be helpful to look at it from
>> the server side?
>
> Not yet...
>
> Something else interesting about that trace is the fact that the server
> keeps resending a bunch of old ACKs. It looks as if it thinks it is
> still connected to the client.
>
> What versions of Linux are you running on these machines?

Everything is fully up-to-date centos-4 (kernel 2.6.9-22.0.1.ELsmp).

--
Joshua Baker-LePain
Department of Biomedical Engineering
Duke University


-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems? Stop! Download the new AJAX search engine that makes
searching your log files as easy as surfing the web. DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2005-12-15 20:39:33

by Trond Myklebust

[permalink] [raw]
Subject: Re: Error 512

On Thu, 2005-12-15 at 15:22 -0500, Joshua Baker-LePain wrote:
> > What versions of Linux are you running on these machines?
>
> Everything is fully up-to-date centos-4 (kernel 2.6.9-22.0.1.ELsmp).

As far as I can see, the problem is that every time the server responds
to the client's request for a connection, the client fails to ACK that
response. Unless something is really broken deep down in the TCP layer,
then the most likely explanation is a firewall setup that is blocking
that traffic.

Any IPtable/firewall enabled that might prevent the TCP 3-way handshake
from completing? Or any weird filtering at the switch?

Cheers,
Trond



-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems? Stop! Download the new AJAX search engine that makes
searching your log files as easy as surfing the web. DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2005-12-15 20:58:25

by Joshua Baker-LePain

[permalink] [raw]
Subject: Re: Error 512

On Thu, 15 Dec 2005 at 3:39pm, Trond Myklebust wrote

> On Thu, 2005-12-15 at 15:22 -0500, Joshua Baker-LePain wrote:
>>> What versions of Linux are you running on these machines?
>>
>> Everything is fully up-to-date centos-4 (kernel 2.6.9-22.0.1.ELsmp).
>
> As far as I can see, the problem is that every time the server responds
> to the client's request for a connection, the client fails to ACK that
> response. Unless something is really broken deep down in the TCP layer,
> then the most likely explanation is a firewall setup that is blocking
> that traffic.
>
> Any IPtable/firewall enabled that might prevent the TCP 3-way handshake
> from completing? Or any weird filtering at the switch?

I think -- I'm hopeful -- that you just nailed it. At least, I disabled
iptables on the client, and the mounts all came back shortly thereafter.

For several OS revisions, I've used a rule like the following on all
client machines:

-A INPUT -m state --state RELATED,ESTABLISHED -j ACCEPT

In retrospect, it's only recently, in moving towards NFS over TCP, that
I've begun having this randomly-hanging-mounts problem. I'm thinking
it's possible that that rule was timing out. I'll add rules to all
clients explicitly allowing traffic from the servers (the servers already
have such rules), and hopefully this'll be the last you here from on this
topic. I *really* appreciate all your help in tracking this down.

--
Joshua Baker-LePain
Department of Biomedical Engineering
Duke University


-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems? Stop! Download the new AJAX search engine that makes
searching your log files as easy as surfing the web. DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs