2005-09-12 15:39:52

by Bernd Schubert

[permalink] [raw]
Subject: 2.613: network write socket problems

Hello,

on last Friday we switched on our server to 2.6.13 and today we are
experiencing problems with our nfs clients.
In particular I'm talking about the unfs3 daemon, not the kernel nfs daemon.
Both are running on the server but on different ports, of course. Both are
also serving to the same clients, but different directories.

Today it already several times happend that the unfs3 daemon stalled. Ethereal
showed no network packages on the unfs3 daemon port during this time.
A strace to the proc-id of the daemon clearly shows that *some* writes to some
network sockets will take ages to finish

write(37, "\200\0\0x\203\326(\5\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 124) =
124

This kind of writes can take between seconds and minutes, while it usually
happens much faster than I can count. After the write() to the network
socket, other operations happen rather fast, until the next write to a
network socket. (I identified the troublesome filedescriptors by looking
to /proc/procid/fd).
After restarting the unfs3 daemon everything goes smooth for some time
(approximately 20min to 2h), until the next write to a filedescriptor stalls.

Any idea whats going on? Until today this never happend before, neither with
2.6.x nor 2.4.x. As I wrote, on Friday we replaced 2.6.11.12 by 2.6.13, the
configuration should be similar, only changes should be HZ set to 250 and
additionally the skge driver.
We already switched back from skge to sk98lin, but the problem seems to
remain.


Thanks,
Bernd



--
Bernd Schubert
Physikalisch Chemisches Institut / Theoretische Chemie
Universit?t Heidelberg
INF 229
69120 Heidelberg
e-mail: [email protected]


2005-09-13 09:23:04

by Bernd Schubert

[permalink] [raw]
Subject: Re: 2.613: network write socket problems

On Monday 12 September 2005 17:39, Bernd Schubert wrote:
> Hello,
>
> on last Friday we switched on our server to 2.6.13 and today we are
> experiencing problems with our nfs clients.
> In particular I'm talking about the unfs3 daemon, not the kernel nfs
> daemon. Both are running on the server but on different ports, of course.
> Both are also serving to the same clients, but different directories.
>
> Today it already several times happend that the unfs3 daemon stalled.
> Ethereal showed no network packages on the unfs3 daemon port during this
> time. A strace to the proc-id of the daemon clearly shows that *some*
> writes to some network sockets will take ages to finish
>
> write(37, "\200\0\0x\203\326(\5\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 124)
> = 124

Sorry for the noise, its not a kernel problem. Switching back to 2.6.11 didn't
help, so we investigated further. It turned out, that one of our clients was
in a kind of a zombie state and asking for filehandles, but not answering
request from the server. Since unfs3 is only single threaded, all other
clients had to wait for timeouts.

Bernd