2007-10-15 20:24:27

by NeilBrown

[permalink] [raw]
Subject: Re: [NFS] nfsd closes port 2049

On Monday October 15, [email protected] wrote:
> Hi all,
>
> I'm trying to debug a weird problem with nfsd on a 2.6.16.27-0.6-smp
> kernel.
>
> 1 server: SuSE SLES 10 x86_64, config attached
> 256 clients: RHEL4 Update 4 2.6.9-42.ELsmp x86_64
>
> Using nfs v3.
>
> The clients have been happily talking to the server for several days
> without incident.
>
> The weird thing is that at a certain point the socket opened on port
> 2049 on the NFS server is being closed for unknown reasons (or better
> for unknown reasons for me!).

This is fixed in any release based on 2.6.16.31 or later.
The relevant mainline patch is
1a047060a99f274a7c52cfea8159e4142a14b8a7
as below.
So update your kernel package.

NeilBrown


commit 1a047060a99f274a7c52cfea8159e4142a14b8a7
Author: NeilBrown <[email protected]>
Date: Thu Oct 19 23:29:13 2006 -0700

[PATCH] knfsd: fix race that can disable NFS server

This patch is suitable for just about any 2.6 kernel. It should go in
2.6.19 and 2.6.18.2 and possible even the .17 and .16 stable series.

This is a long standing bug that seems to have only recently become
apparent, presumably due to increasing use of NFS over TCP - many
distros seem to be making it the default.

The SK_CONN bit gets set when a listening socket may be ready
for an accept, just as SK_DATA is set when data may be available.

It is entirely possible for svc_tcp_accept to be called with neither
of these set. It doesn't happen often but there is a small race in
svc_sock_enqueue as SK_CONN and SK_DATA are tested outside the
spin_lock. They could be cleared immediately after the test and
before the lock is gained.

This normally shouldn't be a problem. The sockets are non-blocking so
trying to read() or accept() when ther is nothing to do is not a problem.

However: svc_tcp_recvfrom makes the decision "Should I accept() or
should I read()" based on whether SK_CONN is set or not. This usually
works but is not safe. The decision should be based on whether it is
a TCP_LISTEN socket or a TCP_CONNECTED socket.

Signed-off-by: Neil Brown <[email protected]>
Cc: Adrian Bunk <[email protected]>
Cc: <[email protected]>
Cc: Trond Myklebust <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index 61e307c..96521f1 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -973,7 +973,7 @@ svc_tcp_recvfrom(struct svc_rqst *rqstp)
return 0;
}

- if (test_bit(SK_CONN, &svsk->sk_flags)) {
+ if (svsk->sk_sk->sk_state == TCP_LISTEN) {
svc_tcp_accept(svsk);
svc_sock_received(svsk);
return 0;


2007-10-15 21:57:54

by Andrea Righi

[permalink] [raw]
Subject: Re: nfsd closes port 2049

Neil Brown wrote:
> On Monday October 15, [email protected] wrote:
>> Hi all,
>>
>> I'm trying to debug a weird problem with nfsd on a 2.6.16.27-0.6-smp
>> kernel.
>>
>> 1 server: SuSE SLES 10 x86_64, config attached
>> 256 clients: RHEL4 Update 4 2.6.9-42.ELsmp x86_64
>>
>> Using nfs v3.
>>
>> The clients have been happily talking to the server for several days
>> without incident.
>>
>> The weird thing is that at a certain point the socket opened on port
>> 2049 on the NFS server is being closed for unknown reasons (or better
>> for unknown reasons for me!).
>
> This is fixed in any release based on 2.6.16.31 or later.
> The relevant mainline patch is
> 1a047060a99f274a7c52cfea8159e4142a14b8a7
> as below.
> So update your kernel package.

Thanks Neil, looking at the source and in my logs this seems to explain
perfectly my problem. I'll try the patch ASAP.

-Andrea

-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems? Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >> http://get.splunk.com/
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2007-10-16 12:11:25

by Talpey, Thomas

[permalink] [raw]
Subject: Re: nfsd closes port 2049

At 05:57 PM 10/15/2007, Andrea Righi wrote:
>Neil Brown wrote:
>>> The weird thing is that at a certain point the socket opened on port
>>> 2049 on the NFS server is being closed for unknown reasons (or better
>>> for unknown reasons for me!).
>>
>> This is fixed in any release based on 2.6.16.31 or later.
>> The relevant mainline patch is
>> 1a047060a99f274a7c52cfea8159e4142a14b8a7
>> as below.
>> So update your kernel package.
>
>Thanks Neil, looking at the source and in my logs this seems to explain
>perfectly my problem. I'll try the patch ASAP.

BTW, the nfsd_acceptable() issue is different from this one, and the
no_subtree_check I suggested may still be needed (right Neil?). I'm
interested in what you find - keep us posted.

Tom.

-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems? Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >> http://get.splunk.com/
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2007-10-16 17:50:29

by Andrea Righi

[permalink] [raw]
Subject: Re: nfsd closes port 2049

Talpey, Thomas wrote:
> At 05:57 PM 10/15/2007, Andrea Righi wrote:
>> Neil Brown wrote:
>>>> The weird thing is that at a certain point the socket opened on port
>>>> 2049 on the NFS server is being closed for unknown reasons (or better
>>>> for unknown reasons for me!).
>>> This is fixed in any release based on 2.6.16.31 or later.
>>> The relevant mainline patch is
>>> 1a047060a99f274a7c52cfea8159e4142a14b8a7
>>> as below.
>>> So update your kernel package.
>> Thanks Neil, looking at the source and in my logs this seems to explain
>> perfectly my problem. I'll try the patch ASAP.
>
> BTW, the nfsd_acceptable() issue is different from this one, and the
> no_subtree_check I suggested may still be needed (right Neil?). I'm
> interested in what you find - keep us posted.
>
> Tom.
>

I've just finished to update the kernel to 2.6.16.53-0.8-smp (always SLES10
x86_64 of course) that includes the fix reported by Neil. I've exported the fs
withtout the no_subtree_check option for now, let's see in these days what'll
happen.

-Andrea

-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems? Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >> http://get.splunk.com/
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs