2004-09-08 07:32:58

by Frank Steiner

[permalink] [raw]
Subject: server freeze with 2.6.8.1

Hi,

I know that the following description is not helpful for reproducing
the freeze, but I'm hoping that someone might have encountered something
similar or has an idea...

Our NFS server serves about 60 clients, 25 of them with root-over-nfs,
the rest just gets some stuff like /home etc. We recently switched from
2.4.21 to 2.6.8.1 and use self-compiled kernel rpms (based on the SuSE
rpms). We installed some kernel upgrades in the last weeks, all 2.6.8.1 based,
but enhanced by some security fixes or patches like packet writing etc.
So we got versions 2.6.8.1-1 to -3 which differed just in the security
fixes nfsd-xdr-patch, reiserfs-xattr-acl.patch, as well as the cdwriting-
patch. So basically the same kernels.

During these updates and reboots we encoutered some mysterious server
freezes. They are not 100% reproducible, but when they happened we
had always installed a new kernel rpm on the server (in parallel to the
old ones, so that diskless clients keep the needed /lib/modules/ until they
reboot) and then either

- rebooted several clients in parallel to the new kernel while the
server is still running the older version.

- or had the server (and some clients) run the new version and then
rebooted some clients which are still running the old version.

We had also one situation where a user logged in into a client with KDE and
the server froze. The first time this happened, the client had just
booted to the new version that the server was already running. The freeze
then happened 4 times in a row when the user logged in, until we cleaned
up all kde-related files and directories on that users home (hosted on
the nfs server).

When the freeze occurs, the nfs server does not give any message on
/dev/tty10. No kernel oops or sth. Sometimes, when I'm quick enough,
I can still switch between consoles, e.g., from tty10 to tty1 and
back, but trying e.g. a emergency sync will then freeze the server
completely, so that not even alt+sysrq+b will work. The last messages
I see in /var/log/messages are always the messages that a client has
mounted the nfs directories.

We are using nfs v3 with tcp,hard,intr,lock.

We had the same problem already when running the official SuSE kernel
2.4.21-xxx (never before with 2.4.19 and 2.4.20), where the nfs server
would freeze the same way, and that happened to nfs servers running
an i386 and IBM pSeries (SLES 8), but it happened not that often.

Now that I turned on /proc/sys/sunrpc/rpc_debug and /proc/sys/sunrpc/nfsd_debug
it could (of course) not reproduce the freeze again by booting back and
forth with the different versions.
Maybe it happens only after the server had run for some days (kind
of a pollution effect...?).

I can try to keep /proc/sys/sunrpc/rpc_debug and nfsd_debug running,
although it causes such a lot of messages that the disk performance
of the server really goes down. Would those debugging messages help
at all of the freeze occured again?

Or is there something else I can try? Has anyone seen sth. similar?

cu,
Frank

--
Dipl.-Inform. Frank Steiner Web: http://www.bio.ifi.lmu.de/~steiner/
Lehrstuhl f. Bioinformatik Mail: http://www.bio.ifi.lmu.de/~steiner/m/
LMU, Amalienstr. 17 Phone: +49 89 2180-4049
80333 Muenchen, Germany Fax: +49 89 2180-99-4049
* Rekursion kann man erst verstehen, wenn man Rekursion verstanden hat. *


-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
http://ads.osdn.com/?ad_id=5047&alloc_id=10808&op=click
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs


2004-09-08 11:25:48

by Frank Steiner

[permalink] [raw]
Subject: We got a logfile!

Hi,

while the proc-debugging was still on, we rebooted a few more clients
from the old to the new kernel version and the server froze again.
Maybe the logs (if the server was able to write them far enough before
freezing) allow someone to debug the problem?

I've put them here:
http://www.bio.ifi.lmu.de/~steiner/nfsserver-freeze-log-messages.txt.bz2
http://www.bio.ifi.lmu.de/~steiner/nfsserver-freeze-procdebug.txt.bz2

The first of the clients booted on 10:39, the last one was turned-off
over night (running an old kernel before) and got up at 10:49. The
nfsserver-freeze-log-messages.txt contains the lines from /var/log/messages
between 10:39 and 10:50 (the crash time), the nfsserver-freeze-procdebug.txt
contains the kernel messages from that timeslice.
The file with the kernel messages has 1.5 million lines *urgh*, but
maybe some expert just needs to take a look at the last few entries
to see the problem...?

cu,
Frank

--
Dipl.-Inform. Frank Steiner Web: http://www.bio.ifi.lmu.de/~steiner/
Lehrstuhl f. Bioinformatik Mail: http://www.bio.ifi.lmu.de/~steiner/m/
LMU, Amalienstr. 17 Phone: +49 89 2180-4049
80333 Muenchen, Germany Fax: +49 89 2180-99-4049



-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
http://ads.osdn.com/?ad_id=5047&alloc_id=10808&op=click
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2004-09-08 07:54:05

by Frank Steiner

[permalink] [raw]
Subject: Re: server freeze with 2.6.8.1

Just to say that: It would already help if someone could tell me
what debugging options I should turn on on the server, so that the next
freeze gives enough output that someone here might be able to analyze
the error. Is that stuff in /proc that I set enough?

cu,
Frank

--
Dipl.-Inform. Frank Steiner Web: http://www.bio.ifi.lmu.de/~steiner/
Lehrstuhl f. Bioinformatik Mail: http://www.bio.ifi.lmu.de/~steiner/m/
LMU, Amalienstr. 17 Phone: +49 89 2180-4049
80333 Muenchen, Germany Fax: +49 89 2180-99-4049
* Rekursion kann man erst verstehen, wenn man Rekursion verstanden hat. *


-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
http://ads.osdn.com/?ad_id=5047&alloc_id=10808&op=click
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs