LinuxLists.cc - NFS stop/start problems (related to datagram shutdown bug?)

2001-02-05 22:37:13

Subject: NFS stop/start problems (related to datagram shutdown bug?)

Seems recently, on both redhat 6.1 and 7.0 using kernel 2.4.1-ac3, I
ran into this problem:

Stopping NFS says the following in the kernel logs:

nfsd: terminating on signal 9
nfsd: terminating on signal 9
nfsd: terminating on signal 9
nfsd: terminating on signal 9
nfsd: terminating on signal 9
nfsd: terminating on signal 9
nfsd: terminating on signal 9
nfsd: terminating on signal 9
svc: server socket destroy delayed

And restarting NFS has the following error message:

root:~> /etc/rc.d/init.d/nfs start
Starting NFS services: [ OK ]
Starting NFS quotas: [ OK ]
Starting NFS mountd: [ OK ]
Starting NFS daemon: nfssvc: Address already in use
[FAILED]

>From that moment forward, the NFS server is completely broken until the system
is rebooted, and other machines respond during a 'mount' by saying,

nfs: server xxx not responding, still trying

When I tried this, the remote computer had unmounted this NFS-served partition
prior to shutting NFS down with '/etc/rc.d/init.d/nfs stop'. I was wondering if
this could be related to that datagram shutdown bug, and maybe if there's a
quick solution in the meantime to kill the socket so that I can restart NFS
without rebooting.

Thanks,
Byron

--
Byron Stanoszek Ph: (330) 644-3059
Systems Programmer Fax: (330) 644-8110
Commercial Timesharing Inc. Email: [email protected]

2001-02-05 22:42:13

by Alan

[permalink] [raw]

Subject: Re: NFS stop/start problems (related to datagram shutdown bug?)

> Seems recently, on both redhat 6.1 and 7.0 using kernel 2.4.1-ac3, I
> ran into this problem:

Ok seen this in older 2.2 but not 2.4

> nfsd: terminating on signal 9
> svc: server socket destroy delayed
>
> And restarting NFS has the following error message:
> Starting NFS mountd: [ OK ]
> Starting NFS daemon: nfssvc: Address already in use
> [FAILED]

A socket got stuck. Thats preventing you restarting it. The bug is whatever
leak caused the svc: server socket destroy delayed case.

Just for reference what network card ?

2001-02-05 22:49:44

by Byron Stanoszek

[permalink] [raw]

Subject: Re: NFS stop/start problems (related to datagram shutdown bug?)

On Mon, 5 Feb 2001, Alan Cox wrote:

> > Seems recently, on both redhat 6.1 and 7.0 using kernel 2.4.1-ac3, I
> > ran into this problem:
>
> Ok seen this in older 2.2 but not 2.4
>
> > nfsd: terminating on signal 9
> > svc: server socket destroy delayed
> >
> > And restarting NFS has the following error message:
> > Starting NFS mountd: [ OK ]
> > Starting NFS daemon: nfssvc: Address already in use
> > [FAILED]
>
> A socket got stuck. Thats preventing you restarting it. The bug is whatever
> leak caused the svc: server socket destroy delayed case.
>
> Just for reference what network card ?

Both machines had a 3c905b-tx-nm card in them.

3c59x.c:LK1.1.12 06 Jan 2000 Donald Becker and others.
http://www.scyld.com/network/vortex.html $Revision: 1.102.2.46 $
See Documentation/networking/vortex.txt
eth0: 3Com PCI 3c905B Cyclone 100baseTx at 0x6100, 00:50:da:cd:c8:b9, IRQ 11
product code 'XC' rev 00.13 date 12-29-99
8K byte-wide RAM 5:3 Rx:Tx split, autoselect/Autonegotiate interface.
MII transceiver found at address 24, status 786d.
Enabling bus-master transmits and whole-frame receives.

-Byron

--
Byron Stanoszek Ph: (330) 644-3059
Systems Programmer Fax: (330) 644-8110
Commercial Timesharing Inc. Email: [email protected]

2001-02-05 23:57:49

by NeilBrown

[permalink] [raw]

Subject: Re: NFS stop/start problems (related to datagram shutdown bug?)

On Monday February 5, [email protected] wrote:
> Seems recently, on both redhat 6.1 and 7.0 using kernel 2.4.1-ac3, I
> ran into this problem:
>
> Stopping NFS says the following in the kernel logs:
>
> nfsd: terminating on signal 9
> nfsd: terminating on signal 9
> nfsd: terminating on signal 9
> nfsd: terminating on signal 9
> nfsd: terminating on signal 9
> nfsd: terminating on signal 9
> nfsd: terminating on signal 9
> nfsd: terminating on signal 9
> svc: server socket destroy delayed
>
> And restarting NFS has the following error message:
>
> root:~> /etc/rc.d/init.d/nfs start
> Starting NFS services: [ OK ]
> Starting NFS quotas: [ OK ]
> Starting NFS mountd: [ OK ]
> Starting NFS daemon: nfssvc: Address already in use
> [FAILED]

How repeatable is this? Is the server SMP?

There does seem to be a possible problem with sk_inuse not being
updated atomically, so a race between an increment and a decrement
could lose one of them.
svc_sock_release seems to often be called with no more protection than
the BKL, and it decrements sk_inuse.

svc_sock_enqueue, on the other hand increments sk_inuse, and is
protected by sv_lock, but not, I think, by the BKL, as it is called by
a networking layer callback. So there might be a possibility for a
race here.

The attached patch might fix it, so if you are having reproducable
problems, it might be worth applying this patch.

Trond: any comments?

NeilBrown

[ a better fix would be to make sk_inuse atomic_t ]

--- net/sunrpc/svcsock.c 2001/02/05 23:45:54 1.1
+++ net/sunrpc/svcsock.c 2001/02/05 23:48:12
@@ -211,16 +211,22 @@
svc_sock_release(struct svc_rqst *rqstp)
{
struct svc_sock *svsk = rqstp->rq_sock;
+ struct svc_serv *serv = svsk->sk_server;

if (!svsk)
return;
svc_release_skb(rqstp);
rqstp->rq_sock = NULL;
+
+ spin_lock_bh(&serv->sv_lock);
if (!--(svsk->sk_inuse) && svsk->sk_dead) {
+ spin_unlock_bh(&serv->sv_lock);
dprintk("svc: releasing dead socket\n");
sock_release(svsk->sk_sock);
kfree(svsk);
}
+ else
+ spin_unlock_bh(&serv->sv_lock);
}

/*

2001-02-06 00:45:02

by Byron Stanoszek

[permalink] [raw]

Subject: Re: NFS stop/start problems (related to datagram shutdown bug?)

On Tue, 6 Feb 2001, Neil Brown wrote:

> How repeatable is this? Is the server SMP?

I've tested this on two UP Athlons and 2 SMP Pentium 3's and the same problem
occurred. I have not tested it more than once on the same system (I left the
NFS servers untouched after the reboot).

The Athlon systems running NFS were 2.4.1-ac3 and the Pentiums were running
2.2.19-pre7. All computers exporting the FS had one directory mounted at least
once.

In one case, only 1 directory was mounted once and then unmounted before
shutting off the NFS server. When I realized I forgot to copy a directory over,
I went to restart NFS on the server and found out I was unable to. Probably
irrelevant, but this had been after transferring 7 gigs of data over 100 Mbps.

I still have the 'broken' server running, so if you would like me to run a
command or two on it I can show you the results.

> The attached patch might fix it, so if you are having reproducable
> problems, it might be worth applying this patch.

I can try it tomorrow and see if it fixes the problem, but since this problem
also occurred on a UP, using spin locks probably will not correct it. Perhaps
it's something else.

> [patch snipped]

-Byron

--
Byron Stanoszek Ph: (330) 644-3059
Systems Programmer Fax: (330) 644-8110
Commercial Timesharing Inc. Email: [email protected]

2001-02-06 01:19:29

by NeilBrown

[permalink] [raw]

Subject: Re: NFS stop/start problems (related to datagram shutdown bug?)

On Monday February 5, [email protected] wrote:
> On Tue, 6 Feb 2001, Neil Brown wrote:
>
> > How repeatable is this? Is the server SMP?
>
> I've tested this on two UP Athlons and 2 SMP Pentium 3's and the same problem
> occurred. I have not tested it more than once on the same system (I left the
> NFS servers untouched after the reboot).
>
> The Athlon systems running NFS were 2.4.1-ac3 and the Pentiums were running
> 2.2.19-pre7. All computers exporting the FS had one directory mounted at least
> once.
>
> In one case, only 1 directory was mounted once and then unmounted before
> shutting off the NFS server. When I realized I forgot to copy a directory over,
> I went to restart NFS on the server and found out I was unable to. Probably
> irrelevant, but this had been after transferring 7 gigs of data over 100 Mbps.
>
> I still have the 'broken' server running, so if you would like me to run a
> command or two on it I can show you the results.

I don't think that there is much useful that I could look at, thanks.

>
> > The attached patch might fix it, so if you are having reproducable
> > problems, it might be worth applying this patch.
>
> I can try it tomorrow and see if it fixes the problem, but since this problem
> also occurred on a UP, using spin locks probably will not correct it. Perhaps
> it's something else.

On second thoughts, this doesn't need to be SMP related. I don't know
much about "bottom halves" but I gather that they get run after an
interrupt has been handled and interrupts have been re-enabled, but
before the original process is rescheduled. If this is the case, then
the "_bh" part of the "spin_lock_bh" (which does a local_bh_disable)
could be the bit that is important on a UP system.

NeilBrown

>
> > [patch snipped]
>
> -Byron
>
> --
> Byron Stanoszek Ph: (330) 644-3059
> Systems Programmer Fax: (330) 644-8110
> Commercial Timesharing Inc. Email: [email protected]

2001-02-06 09:27:52

by Trond Myklebust

[permalink] [raw]

Subject: Re: NFS stop/start problems (related to datagram shutdown bug?)

>>>>> " " == Neil Brown <[email protected]> writes:

> The attached patch might fix it, so if you are having
> reproducable problems, it might be worth applying this patch.

> Trond: any comments?

> +
> + spin_lock_bh(&serv->sv_lock);
> if (!--(svsk->sk_inuse) && svsk->sk_dead) {
> + spin_unlock_bh(&serv->sv_lock);
> dprintk("svc: releasing dead socket\n");
> sock_release(svsk->sk_sock);
> kfree(svsk);
> }
> + else
> + spin_unlock_bh(&serv->sv_lock);
> }

Looks correct, but there's a similar problem in svc_delete_socket()
(see the setting of sk_dead, and subsequent test for sk_inuse).

Cheers,
Trond

2001-02-06 14:35:17

by Byron Stanoszek

[permalink] [raw]

Subject: Re: NFS stop/start problems (related to datagram shutdown bug?)

> There does seem to be a possible problem with sk_inuse not being
> updated atomically, so a race between an increment and a decrement
> could lose one of them.
> svc_sock_release seems to often be called with no more protection than
> the BKL, and it decrements sk_inuse.
>
> svc_sock_enqueue, on the other hand increments sk_inuse, and is
> protected by sv_lock, but not, I think, by the BKL, as it is called by
> a networking layer callback. So there might be a possibility for a
> race here.
>
> The attached patch might fix it, so if you are having reproducable
> problems, it might be worth applying this patch.
>
> NeilBrown

I applied the patch and the problem seems to have gone away, where it was
fairly reproducable beforehand. It waits a little longer (about 4 seconds)
during the NFS daemon shutdown before [ OK ] pops up, but it could be my
imagination because I was doing it on the 166 and I was used to the 866's.

But what matters is that I can stop and restart NFS just fine now whereas
before I couldn't. Thanks for the patch.

-Byron

--
Byron Stanoszek Ph: (330) 644-3059
Systems Programmer Fax: (330) 644-8110
Commercial Timesharing Inc. Email: [email protected]