From: "Roger Heflin" Subject: NFS crash on Suse, any ideas on which normal nfs patch could be a cause/fix? Date: Thu, 9 Jun 2005 13:50:13 -0500 Message-ID: Mime-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_NextPart_000_02B4_01C56CFA.28F006D0" Return-path: Received: from sc8-sf-mx2-b.sourceforge.net ([10.3.1.12] helo=sc8-sf-mx2.sourceforge.net) by sc8-sf-list2.sourceforge.net with esmtp (Exim 4.30) id 1DgS5b-00010D-Cg for nfs@lists.sourceforge.net; Thu, 09 Jun 2005 11:49:15 -0700 Received: from host27-37.discord.birch.net ([65.16.27.37] helo=EXCHG2003.microtech-ks.com) by sc8-sf-mx2.sourceforge.net with esmtp (Exim 4.41) id 1DgS5X-0007Dv-IW for nfs@lists.sourceforge.net; Thu, 09 Jun 2005 11:49:15 -0700 To: Sender: nfs-admin@lists.sourceforge.net Errors-To: nfs-admin@lists.sourceforge.net List-Unsubscribe: , List-Id: Discussion of NFS under Linux development, interoperability, and testing. List-Post: List-Help: List-Subscribe: , List-Archive: This is a multi-part message in MIME format. ------=_NextPart_000_02B4_01C56CFA.28F006D0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Hello, I have a NFS related crash on Suse, I do have a support call in with Suse, I really only want to know if there is any patch add/deletes that was to fix a similar issue to this. We are using 2.6.5-7.139 which is of course a Suse patched kernel, that I don't not know exactly what it was patched with. We are running nfsv3, with udp, 32k block size. NFS did survive several large bonnie runs to the same nfs servers, with no signs similar to these. Also the server seems fine when the event happens it seems to be a client only problem, other clients can get to the server being used just fine. Below is the description, I have been looking through the recent nfs patches and have came up with the following that look possible, but I have looked at the code that it patches, and don't see signs of the code it deletes being in the Suse kernel, so I don't think that is it. linux-2.6.7-01-write_hang.dif: NFS: remove the WRITEPAGE_ACTIVATE hack. It causes crashes. Customer untars a 8GB file located on a NFS share back onto the same NFS share, They have done this on at least 4 separate (slightly different configurations) machines, with slightly different results on each machine. On 1 of the 4 machines it succeeded, on the other 3 it failed. Customer has also experienced similar issues doing nfs read/writes with applications other than tar. On the other 3 machines, 1 machine locked up and required power removal, 1 machine gave us a kernel panic, and 1 machine put the tar process into a permanent unkillable "D" state. Similar results were obtained with a user test program, and a 3rd party application software. The kernel crash message was: CPU 0: Machine Check Exception: 4 Bank 4: f60da00100000813 RIP !INEXACT! 10: {copy_user_generic_c0x8/0x26} TSC 3d101af6bf756 ADDR 4304010 Kernel panic: Machine check We have only so far got this on one of the machines. Any ideas on what is happening? All machines are Opteron machines, some are dual processor machines with one motherboard, and the other machines are quad processer models with a different motherboard, so it does look like a software issues since there is some variation of hardware, it also does not look like a broken hardware issue, as the machine don't fail on anything but these nfs tests. The same commands appear to work on local filesystems. ------=_NextPart_000_02B4_01C56CFA.28F006D0 Content-Type: text/html; charset="us-ascii" Content-Transfer-Encoding: quoted-printable
Hello,
 
I have = a NFS related=20 crash on Suse, I do have a support call in with = Suse,
I = really only want=20 to know if there is any patch add/deletes that was to = fix
a = similar issue to=20 this.   We are using 2.6.5-7.139 which is of=20 course
a Suse = patched=20 kernel, that I don't not know exactly what it was=20 patched
with.=20   We are running nfsv3, = with udp, 32k=20 block size.
 
NFS = did survive=20 several large bonnie runs to the same nfs servers, = with
no = signs similar to=20 these.     Also the server seems fine when the=20 event
happens it seems to=20 be a client only problem, other clients can get
to the = server being=20 used just fine.
 
Below = is the=20 description, I have been looking through the recent = nfs
patches and have=20 came up with the following that look possible, but
I have looked at the code that it patches, and = don't see signs=20 of the
code = it deletes=20 being in the Suse kernel, so I don't think that is = it.
 
linux-2.6.7-01-write_hang.dif:

NFS: remove the WRITEPAGE_ACTIVATE hack. = It causes=20 crashes.

 
 
 Customer = untars a 8GB=20 file located on a NFS share back onto the=20 same
NFS share,  They have done this on at least 4 = separate=20 (slightly different
configurations) machines, with slightly different = results on=20 each machine.
On 1 of the 4 machines it succeeded, on the other = 3 it failed.=20  Customer
has also experienced similar issues doing nfs = read/writes with=20 applications
other than tar.  On the other 3 machines, 1 = machine=20 locked up and required
power removal, 1 machine gave us a kernel panic, = and 1 machine=20 put the
tar process into a permanent unkillable "D" state. =   Similar results were
obtained with a user test program, and a 3rd party = application=20 software.

The kernel crash message was:

CPU 0: Machine = Check=20 Exception:          4 Bank = 4:=20 f60da00100000813
  RIP !INEXACT! =  10:<ffffffff8022fae2>=20 {copy_user_generic_c0x8/0x26}
  TSC 3d101af6bf756 ADDR=20 4304010
  Kernel panic: Machine check

We have only = so far=20 got this on one of the machines.

Any ideas on what is happening?=20   All machines are Opteron machines,=20 some
are dual processor machines with one motherboard, = and the=20 other machines
are quad processer models with a different = motherboard, so it=20 does look
like a software issues since there is some = variation of=20 hardware, it also does
not look like a = broken hardware=20 issue, as the machine don't fail on anything but
these = nfs=20 tests.
 
The = same commands=20 appear to work on local filesystems.
 
          &nbs= p;            = ;            =   
------=_NextPart_000_02B4_01C56CFA.28F006D0-- ------------------------------------------------------- This SF.Net email is sponsored by: NEC IT Guy Games. How far can you shotput a projector? How fast can you ride your desk chair down the office luge track? If you want to score the big prize, get to know the little guy. Play to win an NEC 61" plasma display: http://www.necitguy.com/?r=20 _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs