2002-09-17 17:16:54

by Magnus Naeslund(f)

[permalink] [raw]
Subject: App hanging at NFS failure

SWYgaSBoYXZlIGEgcHJvZ3JhbSB0aGF0IHVzZXMgc2VuZGZpbGUgdG8gc2VuZCBhIGZpbGUgdG8g
dGhlIG5ldHdvcmsgZnJvbSBhbiBORlMgc2VydmVyIChtb3VudCBvcHRpb25zOiBydyxub2F0aW1l
LGhhcmQsaW50cixyc2l6ZT04MTkyLHdzaXplPTgxOTIpIGFuZCByZWJvb3QgdGhlIE5GUyBzZXJ2
ZXIgdGhlIHByb2dyYW0gaXMgc3RpbGwgaHVuZyB3aGVuIHRoZSBORlMgc2VydmVyIGlzIG9ubGlu
ZSBhZ2Fpbi4NCg0KV2hlbiBpIHRyeSB0byBzdHJhY2UgaXQsIGl0IHN0YXJ0cyB0byBydW4gYWdh
aW4sIGFuZCB3b3JrcyBhcyB1c3VhbCwgdGhlIHNhbWUgaWYgaSBkZWJ1ZyBpdCB3aXRoIGdkYi4N
Cg0KQ2FuIGkgYWZmZWN0IHRoaXMgYnkgdXNpbmcgc29mdCBtb3VudCBvciBoYXJkIHdpdGhvdXQg
aW50ciwgb3Igd2hhdCBjYW4gaSBkbz8NCg0KTWFnbnVz



-------------------------------------------------------
This SF.NET email is sponsored by: AMD - Your access to the experts
on Hammer Technology! Open Source & Linux Developers, register now
for the AMD Developer Symposium. Code: EX8664
http://www.developwithamd.com/developerlab
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs


2002-09-17 17:45:14

by Lever, Charles

[permalink] [raw]
Subject: RE: App hanging at NFS failure

> If i have a program that uses sendfile to send a file to the
> network from an NFS server (mount options:
> rw,noatime,hard,intr,rsize=8192,wsize=8192) and reboot the
> NFS server the program is still hung when the NFS server is
> online again.

how long does it take to restart your NFS server? how long do
you wait before deciding your application is hung? oh, and
of course, which distribution and kernel version?

> When i try to strace it, it starts to run again, and works as
> usual, the same if i debug it with gdb.

try tracing it with "echo 3 > /proc/sys/sunrpc/rpc_debug"
this produces a ton of kernel log output, but may show why
the NFS client has stopped servicing application requests.
use "echo 0 > /proc/sys/sunrpc/rpc_debug"


-------------------------------------------------------
This SF.NET email is sponsored by: AMD - Your access to the experts
on Hammer Technology! Open Source & Linux Developers, register now
for the AMD Developer Symposium. Code: EX8664
http://www.developwithamd.com/developerlab
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2002-09-17 18:05:35

by Magnus Naeslund(f)

[permalink] [raw]
Subject: Re: App hanging at NFS failure

Lever, Charles <[email protected]> wrote:
>> If i have a program that uses sendfile to send a file to the
>> network from an NFS server (mount options:
>> rw,noatime,hard,intr,rsize=3D8192,wsize=3D8192) and reboot the
>> NFS server the program is still hung when the NFS server is
>> online again.
>=20
> how long does it take to restart your NFS server? how long do
> you wait before deciding your application is hung? oh, and
> of course, which distribution and kernel version?
>=20

Oh, i forgot, the server is "fet1a.our.net 2.4.18-10custom #2 SMP Tue =
Aug 27 03:27:31 CEST 2002 i686 unknown" an dual athlon running RedHat =
7.3, no patches, redhat supplied kernel.

The downtime is the time it takes to reboot the nfs server machine:

Sep 16 13:45:06 gserver1 kernel: nfs: server fet1a.our.net is not =
responding=20
Sep 16 13:45:21 gserver1 kernel: nfs: server fet1a.our.net OK=20
Sep 16 13:46:22 gserver1 kernel: nfs: server fet1a.our.net is not =
responding=20
Sep 16 13:47:00 gserver1 kernel: nfs: server fet1a.our.net still not =
responding=20
Sep 16 13:47:49 gserver1 kernel: nfs: server fet1a.our.net still not =
responding=20
Sep 16 13:48:31 gserver1 kernel: nfs: server fet1a.our.net still not =
responding=20
Sep 16 13:48:46 gserver1 kernel: nfs: server fet1a.our.net OK=20


>> When i try to strace it, it starts to run again, and works as
>> usual, the same if i debug it with gdb.
>=20
> try tracing it with "echo 3 > /proc/sys/sunrpc/rpc_debug"
> this produces a ton of kernel log output, but may show why
> the NFS client has stopped servicing application requests.
> use "echo 0 > /proc/sys/sunrpc/rpc_debug"

As soon as i can get a test machine up and running i'll try that.

The thing is that only the processes that was sendfiling from the nfs =
partition at the time of reboot hung, after the server is up again, the =
other processes work just fine (the process is started in an inetd =
fashion (using djb's daemontools)).

So everything is fine except that the "old" processes are hung, and =
needs to me kill -CONT or something to snap out of it.

The nfs clients own tcp clients times out during the downtime, if that =
could have anything to do with it? Kinda if the tcpsocket goes crazy =
because it's stuck in NFS during the disconnect or something?

Magnus



-------------------------------------------------------
This SF.NET email is sponsored by: AMD - Your access to the experts
on Hammer Technology! Open Source & Linux Developers, register now
for the AMD Developer Symposium. Code: EX8664
http://www.developwithamd.com/developerlab
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs