Seems recently, on both redhat 6.1 and 7.0 using kernel 2.4.1-ac3, I
ran into this problem:
Stopping NFS says the following in the kernel logs:
nfsd: terminating on signal 9
nfsd: terminating on signal 9
nfsd: terminating on signal 9
nfsd: terminating on signal 9
nfsd: terminating on signal 9
nfsd: terminating on signal 9
nfsd: terminating on signal 9
nfsd: terminating on signal 9
svc: server socket destroy delayed
And restarting NFS has the following error message:
root:~> /etc/rc.d/init.d/nfs start
Starting NFS services: [ OK ]
Starting NFS quotas: [ OK ]
Starting NFS mountd: [ OK ]
Starting NFS daemon: nfssvc: Address already in use
[FAILED]
>From that moment forward, the NFS server is completely broken until the system
is rebooted, and other machines respond during a 'mount' by saying,
nfs: server xxx not responding, still trying
When I tried this, the remote computer had unmounted this NFS-served partition
prior to shutting NFS down with '/etc/rc.d/init.d/nfs stop'. I was wondering if
this could be related to that datagram shutdown bug, and maybe if there's a
quick solution in the meantime to kill the socket so that I can restart NFS
without rebooting.
Thanks,
Byron
--
Byron Stanoszek Ph: (330) 644-3059
Systems Programmer Fax: (330) 644-8110
Commercial Timesharing Inc. Email: [email protected]
> Seems recently, on both redhat 6.1 and 7.0 using kernel 2.4.1-ac3, I
> ran into this problem:
Ok seen this in older 2.2 but not 2.4
> nfsd: terminating on signal 9
> svc: server socket destroy delayed
>
> And restarting NFS has the following error message:
> Starting NFS mountd: [ OK ]
> Starting NFS daemon: nfssvc: Address already in use
> [FAILED]
A socket got stuck. Thats preventing you restarting it. The bug is whatever
leak caused the svc: server socket destroy delayed case.
Just for reference what network card ?
On Mon, 5 Feb 2001, Alan Cox wrote:
> > Seems recently, on both redhat 6.1 and 7.0 using kernel 2.4.1-ac3, I
> > ran into this problem:
>
> Ok seen this in older 2.2 but not 2.4
>
> > nfsd: terminating on signal 9
> > svc: server socket destroy delayed
> >
> > And restarting NFS has the following error message:
> > Starting NFS mountd: [ OK ]
> > Starting NFS daemon: nfssvc: Address already in use
> > [FAILED]
>
> A socket got stuck. Thats preventing you restarting it. The bug is whatever
> leak caused the svc: server socket destroy delayed case.
>
> Just for reference what network card ?
Both machines had a 3c905b-tx-nm card in them.
3c59x.c:LK1.1.12 06 Jan 2000 Donald Becker and others.
http://www.scyld.com/network/vortex.html $Revision: 1.102.2.46 $
See Documentation/networking/vortex.txt
eth0: 3Com PCI 3c905B Cyclone 100baseTx at 0x6100, 00:50:da:cd:c8:b9, IRQ 11
product code 'XC' rev 00.13 date 12-29-99
8K byte-wide RAM 5:3 Rx:Tx split, autoselect/Autonegotiate interface.
MII transceiver found at address 24, status 786d.
Enabling bus-master transmits and whole-frame receives.
-Byron
--
Byron Stanoszek Ph: (330) 644-3059
Systems Programmer Fax: (330) 644-8110
Commercial Timesharing Inc. Email: [email protected]
On Monday February 5, [email protected] wrote:
> Seems recently, on both redhat 6.1 and 7.0 using kernel 2.4.1-ac3, I
> ran into this problem:
>
> Stopping NFS says the following in the kernel logs:
>
> nfsd: terminating on signal 9
> nfsd: terminating on signal 9
> nfsd: terminating on signal 9
> nfsd: terminating on signal 9
> nfsd: terminating on signal 9
> nfsd: terminating on signal 9
> nfsd: terminating on signal 9
> nfsd: terminating on signal 9
> svc: server socket destroy delayed
>
> And restarting NFS has the following error message:
>
> root:~> /etc/rc.d/init.d/nfs start
> Starting NFS services: [ OK ]
> Starting NFS quotas: [ OK ]
> Starting NFS mountd: [ OK ]
> Starting NFS daemon: nfssvc: Address already in use
> [FAILED]
How repeatable is this? Is the server SMP?
There does seem to be a possible problem with sk_inuse not being
updated atomically, so a race between an increment and a decrement
could lose one of them.
svc_sock_release seems to often be called with no more protection than
the BKL, and it decrements sk_inuse.
svc_sock_enqueue, on the other hand increments sk_inuse, and is
protected by sv_lock, but not, I think, by the BKL, as it is called by
a networking layer callback. So there might be a possibility for a
race here.
The attached patch might fix it, so if you are having reproducable
problems, it might be worth applying this patch.
Trond: any comments?
NeilBrown
[ a better fix would be to make sk_inuse atomic_t ]
--- net/sunrpc/svcsock.c 2001/02/05 23:45:54 1.1
+++ net/sunrpc/svcsock.c 2001/02/05 23:48:12
@@ -211,16 +211,22 @@
svc_sock_release(struct svc_rqst *rqstp)
{
struct svc_sock *svsk = rqstp->rq_sock;
+ struct svc_serv *serv = svsk->sk_server;
if (!svsk)
return;
svc_release_skb(rqstp);
rqstp->rq_sock = NULL;
+
+ spin_lock_bh(&serv->sv_lock);
if (!--(svsk->sk_inuse) && svsk->sk_dead) {
+ spin_unlock_bh(&serv->sv_lock);
dprintk("svc: releasing dead socket\n");
sock_release(svsk->sk_sock);
kfree(svsk);
}
+ else
+ spin_unlock_bh(&serv->sv_lock);
}
/*
On Tue, 6 Feb 2001, Neil Brown wrote:
> How repeatable is this? Is the server SMP?
I've tested this on two UP Athlons and 2 SMP Pentium 3's and the same problem
occurred. I have not tested it more than once on the same system (I left the
NFS servers untouched after the reboot).
The Athlon systems running NFS were 2.4.1-ac3 and the Pentiums were running
2.2.19-pre7. All computers exporting the FS had one directory mounted at least
once.
In one case, only 1 directory was mounted once and then unmounted before
shutting off the NFS server. When I realized I forgot to copy a directory over,
I went to restart NFS on the server and found out I was unable to. Probably
irrelevant, but this had been after transferring 7 gigs of data over 100 Mbps.
I still have the 'broken' server running, so if you would like me to run a
command or two on it I can show you the results.
> The attached patch might fix it, so if you are having reproducable
> problems, it might be worth applying this patch.
I can try it tomorrow and see if it fixes the problem, but since this problem
also occurred on a UP, using spin locks probably will not correct it. Perhaps
it's something else.
> [patch snipped]
-Byron
--
Byron Stanoszek Ph: (330) 644-3059
Systems Programmer Fax: (330) 644-8110
Commercial Timesharing Inc. Email: [email protected]
On Monday February 5, [email protected] wrote:
> On Tue, 6 Feb 2001, Neil Brown wrote:
>
> > How repeatable is this? Is the server SMP?
>
> I've tested this on two UP Athlons and 2 SMP Pentium 3's and the same problem
> occurred. I have not tested it more than once on the same system (I left the
> NFS servers untouched after the reboot).
>
> The Athlon systems running NFS were 2.4.1-ac3 and the Pentiums were running
> 2.2.19-pre7. All computers exporting the FS had one directory mounted at least
> once.
>
> In one case, only 1 directory was mounted once and then unmounted before
> shutting off the NFS server. When I realized I forgot to copy a directory over,
> I went to restart NFS on the server and found out I was unable to. Probably
> irrelevant, but this had been after transferring 7 gigs of data over 100 Mbps.
>
> I still have the 'broken' server running, so if you would like me to run a
> command or two on it I can show you the results.
I don't think that there is much useful that I could look at, thanks.
>
> > The attached patch might fix it, so if you are having reproducable
> > problems, it might be worth applying this patch.
>
> I can try it tomorrow and see if it fixes the problem, but since this problem
> also occurred on a UP, using spin locks probably will not correct it. Perhaps
> it's something else.
On second thoughts, this doesn't need to be SMP related. I don't know
much about "bottom halves" but I gather that they get run after an
interrupt has been handled and interrupts have been re-enabled, but
before the original process is rescheduled. If this is the case, then
the "_bh" part of the "spin_lock_bh" (which does a local_bh_disable)
could be the bit that is important on a UP system.
NeilBrown
>
> > [patch snipped]
>
> -Byron
>
> --
> Byron Stanoszek Ph: (330) 644-3059
> Systems Programmer Fax: (330) 644-8110
> Commercial Timesharing Inc. Email: [email protected]
>>>>> " " == Neil Brown <[email protected]> writes:
> The attached patch might fix it, so if you are having
> reproducable problems, it might be worth applying this patch.
> Trond: any comments?
> +
> + spin_lock_bh(&serv->sv_lock);
> if (!--(svsk->sk_inuse) && svsk->sk_dead) {
> + spin_unlock_bh(&serv->sv_lock);
> dprintk("svc: releasing dead socket\n");
> sock_release(svsk->sk_sock);
> kfree(svsk);
> }
> + else
> + spin_unlock_bh(&serv->sv_lock);
> }
Looks correct, but there's a similar problem in svc_delete_socket()
(see the setting of sk_dead, and subsequent test for sk_inuse).
Cheers,
Trond
> There does seem to be a possible problem with sk_inuse not being
> updated atomically, so a race between an increment and a decrement
> could lose one of them.
> svc_sock_release seems to often be called with no more protection than
> the BKL, and it decrements sk_inuse.
>
> svc_sock_enqueue, on the other hand increments sk_inuse, and is
> protected by sv_lock, but not, I think, by the BKL, as it is called by
> a networking layer callback. So there might be a possibility for a
> race here.
>
> The attached patch might fix it, so if you are having reproducable
> problems, it might be worth applying this patch.
>
> NeilBrown
I applied the patch and the problem seems to have gone away, where it was
fairly reproducable beforehand. It waits a little longer (about 4 seconds)
during the NFS daemon shutdown before [ OK ] pops up, but it could be my
imagination because I was doing it on the 166 and I was used to the 866's.
But what matters is that I can stop and restart NFS just fine now whereas
before I couldn't. Thanks for the patch.
-Byron
--
Byron Stanoszek Ph: (330) 644-3059
Systems Programmer Fax: (330) 644-8110
Commercial Timesharing Inc. Email: [email protected]