2007-10-18 06:30:16

by NeilBrown

[permalink] [raw]
Subject: Re: nfsd closes port 2049

On Tuesday October 16, [email protected] wrote:
> At 05:57 PM 10/15/2007, Andrea Righi wrote:
> >Neil Brown wrote:
> >>> The weird thing is that at a certain point the socket opened on port
> >>> 2049 on the NFS server is being closed for unknown reasons (or better
> >>> for unknown reasons for me!).
> >>
> >> This is fixed in any release based on 2.6.16.31 or later.
> >> The relevant mainline patch is
> >> 1a047060a99f274a7c52cfea8159e4142a14b8a7
> >> as below.
> >> So update your kernel package.
> >
> >Thanks Neil, looking at the source and in my logs this seems to explain
> >perfectly my problem. I'll try the patch ASAP.
>
> BTW, the nfsd_acceptable() issue is different from this one, and the
> no_subtree_check I suggested may still be needed (right Neil?). I'm
> interested in what you find - keep us posted.

I don't know exactly what you mean by "the nfsd_acceptable() issue",
but whatever it is, it would be completely separate from tcp
connections.

If a filesystems got unexported, or a "chmod -x" made some directories
unaccessible, it would not close any TCP connection. It would simply
return an error status for every request, leaving the TCP connection
active.

NeilBrown

-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems? Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >> http://get.splunk.com/
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs


2007-10-18 11:59:45

by Talpey, Thomas

[permalink] [raw]
Subject: Re: nfsd closes port 2049

At 02:30 AM 10/18/2007, Neil Brown wrote:
>On Tuesday October 16, [email protected] wrote:
>> BTW, the nfsd_acceptable() issue is different from this one, and the
>> no_subtree_check I suggested may still be needed (right Neil?). I'm
>> interested in what you find - keep us posted.
>
>I don't know exactly what you mean by "the nfsd_acceptable() issue",
>but whatever it is, it would be completely separate from tcp
>connections.

It was the messages in the logs Andrea sent, here's one:

>>Oct 13 05:20:56 node0101 kernel: nfsd_acceptable failed at ffff8100c7873700

>If a filesystems got unexported, or a "chmod -x" made some directories
>unaccessible, it would not close any TCP connection. It would simply
>return an error status for every request, leaving the TCP connection
>active.

Fair enough - the clients wouldn't automatically close the connections
due to this, either. So the race condition at the server is the probable
cause of Andrea's observed error.

I still think adding no_subtree_check will help the situation. These
export failures are coming from some failed check at the server, and
they're rare enough to make me think there's a GPFS or other server
issue at work from time to time.

Tom.

-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems? Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >> http://get.splunk.com/
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2007-10-18 12:42:49

by Andrea Righi

[permalink] [raw]
Subject: Re: nfsd closes port 2049

Talpey, Thomas wrote:
> At 02:30 AM 10/18/2007, Neil Brown wrote:
>> On Tuesday October 16, [email protected] wrote:
>>> BTW, the nfsd_acceptable() issue is different from this one, and the
>>> no_subtree_check I suggested may still be needed (right Neil?). I'm
>>> interested in what you find - keep us posted.
>> I don't know exactly what you mean by "the nfsd_acceptable() issue",
>> but whatever it is, it would be completely separate from tcp
>> connections.
>
> It was the messages in the logs Andrea sent, here's one:
>
>>> Oct 13 05:20:56 node0101 kernel: nfsd_acceptable failed at ffff8100c7873700
>
>> If a filesystems got unexported, or a "chmod -x" made some directories
>> unaccessible, it would not close any TCP connection. It would simply
>> return an error status for every request, leaving the TCP connection
>> active.
>
> Fair enough - the clients wouldn't automatically close the connections
> due to this, either. So the race condition at the server is the probable
> cause of Andrea's observed error.
>
> I still think adding no_subtree_check will help the situation. These
> export failures are coming from some failed check at the server, and
> they're rare enough to make me think there's a GPFS or other server
> issue at work from time to time.

The NFS server is working fine for the 2nd day using the fix (with kernel
2.6.16.53-0.8-smp), but I'm not yet using the no_subtree_check option. I tried
to stress the filesystem with multiple accesses from all the clients (256),
alterning with periods of inactivity. This was a good "pattern" to reproduce the
problem and since it didn't happen anymore I'm considering the issue resolved.

Anyway, I'm quite lucky :-) and I've another cluster with another identical NFS
server: same hardware, same distro, same number of clients, etc, so I can try
the no_subtree_check there.

-Andrea

-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems? Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >> http://get.splunk.com/
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2007-10-18 13:45:03

by Talpey, Thomas

[permalink] [raw]
Subject: Re: nfsd closes port 2049

At 08:42 AM 10/18/2007, Andrea Righi wrote:
>The NFS server is working fine for the 2nd day using the fix (with kernel
>2.6.16.53-0.8-smp), but I'm not yet using the no_subtree_check option. I tried
>to stress the filesystem with multiple accesses from all the clients (256),
>alterning with periods of inactivity. This was a good "pattern" to
>reproduce the
>problem and since it didn't happen anymore I'm considering the issue resolved.

Great! Do you have any "nfsd_acceptable" failures in the server's syslog?
If none, then the second issue didn't occur either, a good thing in itself. :-)

Tom.

-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems? Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >> http://get.splunk.com/
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2007-10-18 14:35:14

by Andrea Righi

[permalink] [raw]
Subject: Re: nfsd closes port 2049

Talpey, Thomas wrote:
> At 08:42 AM 10/18/2007, Andrea Righi wrote:
>> The NFS server is working fine for the 2nd day using the fix (with kernel
>> 2.6.16.53-0.8-smp), but I'm not yet using the no_subtree_check option. I tried
>> to stress the filesystem with multiple accesses from all the clients (256),
>> alterning with periods of inactivity. This was a good "pattern" to
>> reproduce the
>> problem and since it didn't happen anymore I'm considering the issue resolved.
>
> Great! Do you have any "nfsd_acceptable" failures in the server's syslog?
> If none, then the second issue didn't occur either, a good thing in itself. :-)
>
> Tom.
>

I rebooted the NFS server with the new kernel on Oct 17 10:45:24, and:

node0101:~ # (bzcat /var/log/messages-2007101[78].bz2; cat /var/log/messages) | grep nfsd_acceptable
Oct 16 23:04:41 node0101 kernel: nfsd_acceptable failed at ffff810155e88150
Oct 17 07:52:38 node0101 kernel: nfsd_acceptable failed at ffff81015da98970
Oct 17 10:22:33 node0101 kernel: nfsd_acceptable failed at ffff8100d8cffa40

So, any nfsd_acceptable failure in the logs.

-Andrea

-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems? Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >> http://get.splunk.com/
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs