2005-06-09 18:49:15

by Roger Heflin

[permalink] [raw]
Subject: NFS crash on Suse, any ideas on which normal nfs patch could be a cause/fix?

Hello,

I have a NFS related crash on Suse, I do have a support call in with Suse,
I really only want to know if there is any patch add/deletes that was to fix
a similar issue to this. We are using 2.6.5-7.139 which is of course
a Suse patched kernel, that I don't not know exactly what it was patched
with. We are running nfsv3, with udp, 32k block size.

NFS did survive several large bonnie runs to the same nfs servers, with
no signs similar to these. Also the server seems fine when the event
happens it seems to be a client only problem, other clients can get
to the server being used just fine.

Below is the description, I have been looking through the recent nfs
patches and have came up with the following that look possible, but
I have looked at the code that it patches, and don't see signs of the
code it deletes being in the Suse kernel, so I don't think that is it.



<http://client.linux-nfs.org/Linux-2.6.x/2.6.7-rc3/linux-2.6.7-01-write_hang
.dif> linux-2.6.7-01-write_hang.dif:


NFS: remove the WRITEPAGE_ACTIVATE hack. It causes crashes.





Customer untars a 8GB file located on a NFS share back onto the same

NFS share, They have done this on at least 4 separate (slightly different
configurations) machines, with slightly different results on each machine.
On 1 of the 4 machines it succeeded, on the other 3 it failed. Customer
has also experienced similar issues doing nfs read/writes with applications
other than tar. On the other 3 machines, 1 machine locked up and required
power removal, 1 machine gave us a kernel panic, and 1 machine put the
tar process into a permanent unkillable "D" state. Similar results were
obtained with a user test program, and a 3rd party application software.

The kernel crash message was:

CPU 0: Machine Check Exception: 4 Bank 4: f60da00100000813
RIP !INEXACT! 10:<ffffffff8022fae2> {copy_user_generic_c0x8/0x26}
TSC 3d101af6bf756 ADDR 4304010
Kernel panic: Machine check

We have only so far got this on one of the machines.

Any ideas on what is happening? All machines are Opteron machines, some
are dual processor machines with one motherboard, and the other machines
are quad processer models with a different motherboard, so it does look
like a software issues since there is some variation of hardware, it also
does
not look like a broken hardware issue, as the machine don't fail on anything
but
these nfs tests.

The same commands appear to work on local filesystems.



2005-06-10 07:56:43

by Olaf Kirch

[permalink] [raw]
Subject: Re: NFS crash on Suse, any ideas on which normal nfs patch could be a cause/fix?

On Thu, Jun 09, 2005 at 01:50:13PM -0500, Roger Heflin wrote:
> The kernel crash message was:
>
> CPU 0: Machine Check Exception: 4 Bank 4: f60da00100000813
> RIP !INEXACT! 10:<ffffffff8022fae2> {copy_user_generic_c0x8/0x26}
> TSC 3d101af6bf756 ADDR 4304010
> Kernel panic: Machine check

This is not an NFS bug, it's a machine check exception. Your machines
have bad RAM it seems.

Olaf
--
Olaf Kirch | --- o --- Nous sommes du soleil we love when we play
[email protected] | / | \ sol.dhoop.naytheet.ah kin.ir.samse.qurax


-------------------------------------------------------
This SF.Net email is sponsored by: NEC IT Guy Games. How far can you shotput
a projector? How fast can you ride your desk chair down the office luge track?
If you want to score the big prize, get to know the little guy.
Play to win an NEC 61" plasma display: http://www.necitguy.com/?r=20
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2005-06-10 13:14:08

by Roger Heflin

[permalink] [raw]
Subject: RE: NFS crash on Suse, any ideas on which normal nfs patch could be a cause/fix?


The other three don't do this, so there may just be something
about this machine that is otherwise wrong, so this error may
not be from this problem, though it is odd given that the machine
does not exhibit this message with any other tests.

We know what bad ram/bad cpu looks like, the machines all have ECC,
and are getting no ecc errors, and they pass several days of running
a intensive memory/cpu program with no ecc errors, no crashes outside
of using NFS.

We also have never seen a MCE with a actual kernel routine listed
off to the side like that, and we have seen at least 100+ MCE's of
other sorts, and have fixed them by replacing ram or cpus.

Roger

> -----Original Message-----
> From: Olaf Kirch [mailto:[email protected]]
> Sent: Friday, June 10, 2005 2:57 AM
> To: Roger Heflin
> Cc: [email protected]
> Subject: Re: [NFS] NFS crash on Suse, any ideas on which
> normal nfs patch could be a cause/fix?
>
> On Thu, Jun 09, 2005 at 01:50:13PM -0500, Roger Heflin wrote:
> > The kernel crash message was:
> >
> > CPU 0: Machine Check Exception: 4 Bank 4: f60da00100000813
> > RIP !INEXACT! 10:<ffffffff8022fae2> {copy_user_generic_c0x8/0x26}
> > TSC 3d101af6bf756 ADDR 4304010
> > Kernel panic: Machine check
>
> This is not an NFS bug, it's a machine check exception. Your
> machines have bad RAM it seems.
>
> Olaf
> --
> Olaf Kirch | --- o --- Nous sommes du soleil we love when we play
> [email protected] | / | \ sol.dhoop.naytheet.ah kin.ir.samse.qurax
>



-------------------------------------------------------
This SF.Net email is sponsored by: NEC IT Guy Games. How far can you shotput
a projector? How fast can you ride your desk chair down the office luge track?
If you want to score the big prize, get to know the little guy.
Play to win an NEC 61" plasma display: http://www.necitguy.com/?r=20
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2005-06-10 20:52:08

by Roger Heflin

[permalink] [raw]
Subject: RE: NFS crash on Suse, any ideas on which normal nfs patch could be a cause/fix?

We have looked at this more carefully, the customer has
got me more information and details. It looks like some
way nfs exercises a network bug and causes the network driver
to crash. The machine/os has survived a full kickstart
(with a several hundred MB install by 100 plus identical
machines without that exercising the bug) and a quite a
bit of light nfs usage, but heavier nfs usage, including
using bonnie will make the machines fall over fairly quick.

We are checking for what sort of issues there could be with
the network driver/network firmware/bmc card.

For information this is a Quad Tyan 4882 with 32GB of ram
(16 have 64GB of ram) with the customer having over 100
of the Quads.

There is also a large number of duals, but after further
examination we don't believe those have the issue.

Roger

> -----Original Message-----
> From: Olaf Kirch [mailto:[email protected]]
> Sent: Friday, June 10, 2005 2:57 AM
> To: Roger Heflin
> Cc: [email protected]
> Subject: Re: [NFS] NFS crash on Suse, any ideas on which
> normal nfs patch could be a cause/fix?
>
> On Thu, Jun 09, 2005 at 01:50:13PM -0500, Roger Heflin wrote:
> > The kernel crash message was:
> >
> > CPU 0: Machine Check Exception: 4 Bank 4: f60da00100000813
> > RIP !INEXACT! 10:<ffffffff8022fae2> {copy_user_generic_c0x8/0x26}
> > TSC 3d101af6bf756 ADDR 4304010
> > Kernel panic: Machine check
>
> This is not an NFS bug, it's a machine check exception. Your
> machines have bad RAM it seems.
>
> Olaf
> --
> Olaf Kirch | --- o --- Nous sommes du soleil we love when we play
> [email protected] | / | \ sol.dhoop.naytheet.ah kin.ir.samse.qurax
>



-------------------------------------------------------
This SF.Net email is sponsored by: NEC IT Guy Games. How far can you shotput
a projector? How fast can you ride your desk chair down the office luge track?
If you want to score the big prize, get to know the little guy.
Play to win an NEC 61" plasma display: http://www.necitguy.com/?r=20
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2005-06-14 11:06:27

by Olaf Kirch

[permalink] [raw]
Subject: Re: NFS crash on Suse, any ideas on which normal nfs patch could be a cause/fix?

Andi Kleen mentioned that this may be a BIOS or driver problem.
Drivers accessing incorrect memory mappings may cause a watchdog to
trigger, which will elicit an MCE.

The address is just over 2GB; the BIOS may have created the
memory hole there. Can you disable memory remapping in the
BIOS?

Olaf


On Fri, Jun 10, 2005 at 03:53:14PM -0500, Roger Heflin wrote:
> We have looked at this more carefully, the customer has
> got me more information and details. It looks like some
> way nfs exercises a network bug and causes the network driver
> to crash. The machine/os has survived a full kickstart
> (with a several hundred MB install by 100 plus identical
> machines without that exercising the bug) and a quite a
> bit of light nfs usage, but heavier nfs usage, including
> using bonnie will make the machines fall over fairly quick.
>
> We are checking for what sort of issues there could be with
> the network driver/network firmware/bmc card.
>
> For information this is a Quad Tyan 4882 with 32GB of ram
> (16 have 64GB of ram) with the customer having over 100
> of the Quads.
>
> There is also a large number of duals, but after further
> examination we don't believe those have the issue.
>
> Roger
>
> > -----Original Message-----
> > From: Olaf Kirch [mailto:[email protected]]
> > Sent: Friday, June 10, 2005 2:57 AM
> > To: Roger Heflin
> > Cc: [email protected]
> > Subject: Re: [NFS] NFS crash on Suse, any ideas on which
> > normal nfs patch could be a cause/fix?
> >
> > On Thu, Jun 09, 2005 at 01:50:13PM -0500, Roger Heflin wrote:
> > > The kernel crash message was:
> > >
> > > CPU 0: Machine Check Exception: 4 Bank 4: f60da00100000813
> > > RIP !INEXACT! 10:<ffffffff8022fae2> {copy_user_generic_c0x8/0x26}
> > > TSC 3d101af6bf756 ADDR 4304010
> > > Kernel panic: Machine check
> >
> > This is not an NFS bug, it's a machine check exception. Your
> > machines have bad RAM it seems.
> >
> > Olaf
> > --
> > Olaf Kirch | --- o --- Nous sommes du soleil we love when we play
> > [email protected] | / | \ sol.dhoop.naytheet.ah kin.ir.samse.qurax
> >
>
>
>
> -------------------------------------------------------
> This SF.Net email is sponsored by: NEC IT Guy Games. How far can you shotput
> a projector? How fast can you ride your desk chair down the office luge track?
> If you want to score the big prize, get to know the little guy.
> Play to win an NEC 61" plasma display: http://www.necitguy.com/?r=20
> _______________________________________________
> NFS maillist - [email protected]
> https://lists.sourceforge.net/lists/listinfo/nfs

--
Olaf Kirch | --- o --- Nous sommes du soleil we love when we play
[email protected] | / | \ sol.dhoop.naytheet.ah kin.ir.samse.qurax


-------------------------------------------------------
This SF.Net email is sponsored by: NEC IT Guy Games. How far can you shotput
a projector? How fast can you ride your desk chair down the office luge track?
If you want to score the big prize, get to know the little guy.
Play to win an NEC 61" plasma display: http://www.necitguy.com/?r=20
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs