2004-05-26 21:13:32

by Garrick Staples

[permalink] [raw]
Subject: 2.6.6 lockup

Hi all again,
After fixing up the failover issues, I got my pair of Itaniums into
production with 2.6.5 and as soon as the real world load went up, the machines
started freezing. No net response, no console, only sysreq keys work.

I updated to 2.6.6 and it doesn't freeze up as often, but it's still really
bad, at least a few times a day. Unfortunately, I can't seem to figure out how
to get a decent kernel trace. Apperently sysreq-Crash doesn't work in ia64.
And NMI watchdog doesn't work on ia. And I can't find any info on a hardware
watchdog on the mobo!

I do have some other info from sysreq on the cpu regs, memory, and processes if
anyone would find that interesting.

The work load freezes up the machines under heavy streaming writes from about a
100 processes on at least 60 clients. A combined load of about 80GB/hour is
enough to freeze up the machine pretty regularly.

I tried Trond's 2.6.6 patches at his website, but those brokes things
considerably. Since I don't have any actual Oops messages, anyone have any
experimental deadlock-fixing patches they want me to test? :)

--
Garrick Staples, Linux/HPCC Administrator
University of Southern California


Attachments:
(No filename) (1.16 kB)
(No filename) (189.00 B)
Download all attachments

2004-05-26 21:37:37

by J. Bruce Fields

[permalink] [raw]
Subject: Re: 2.6.6 lockupy

On Wed, May 26, 2004 at 02:11:16PM -0700, Garrick Staples wrote:
> I tried Trond's 2.6.6 patches at his website, but those brokes things
> considerably. Since I don't have any actual Oops messages, anyone have any
> experimental deadlock-fixing patches they want me to test? :)

Note that Trond's patches are client-side only, so shouldn't affect your
servers one way or another (unless I'm misunderstanding what your
setup).--b.


-------------------------------------------------------
This SF.Net email is sponsored by: Oracle 10g
Get certified on the hottest thing ever to hit the market... Oracle 10g.
Take an Oracle 10g class now, and we'll give you the exam FREE.
http://ads.osdn.com/?ad_id=3149&alloc_id=8166&op=click
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2004-05-26 21:44:36

by Garrick Staples

[permalink] [raw]
Subject: Re: 2.6.6 lockupy

On Wed, May 26, 2004 at 05:37:32PM -0400, J. Bruce Fields alleged:
> On Wed, May 26, 2004 at 02:11:16PM -0700, Garrick Staples wrote:
> > I tried Trond's 2.6.6 patches at his website, but those brokes things
> > considerably. Since I don't have any actual Oops messages, anyone have any
> > experimental deadlock-fixing patches they want me to test? :)
>
> Note that Trond's patches are client-side only, so shouldn't affect your
> servers one way or another (unless I'm misunderstanding what your
> setup).--b.

You didn't misunderstand... but I was at a complete loss with production
machines suddenly dropping like flies. I'm now at the stage where I just try
random things =P


--
Garrick Staples, Linux/HPCC Administrator
University of Southern California


Attachments:
(No filename) (765.00 B)
(No filename) (189.00 B)
Download all attachments

2004-05-26 21:56:00

by J. Bruce Fields

[permalink] [raw]
Subject: Re: 2.6.6 lockupy

On Wed, May 26, 2004 at 02:44:36PM -0700, Garrick Staples wrote:
> You didn't misunderstand... but I was at a complete loss with production
> machines suddenly dropping like flies. I'm now at the stage where I just try
> random things =P

OK. May as well send us what information you have on the lockups, and
maybe your .config while you're at it....

(Also, you mention 2.6.6 in your subject line but it sounds like this
happens to you under earlier kernels as well? Are there any kernels
that are OK?)

--Bruce Fields


-------------------------------------------------------
This SF.Net email is sponsored by: Oracle 10g
Get certified on the hottest thing ever to hit the market... Oracle 10g.
Take an Oracle 10g class now, and we'll give you the exam FREE.
http://ads.osdn.com/?ad_id=3149&alloc_id=8166&op=click
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2004-05-26 22:28:01

by Garrick Staples

[permalink] [raw]
Subject: Re: 2.6.6 lockupy

On Wed, May 26, 2004 at 05:55:56PM -0400, J. Bruce Fields alleged:
> On Wed, May 26, 2004 at 02:44:36PM -0700, Garrick Staples wrote:
> > You didn't misunderstand... but I was at a complete loss with production
> > machines suddenly dropping like flies. I'm now at the stage where I just try
> > random things =P
>
> OK. May as well send us what information you have on the lockups, and
> maybe your .config while you're at it....

It's more info then I feel like posting to the list, but I've got it at:
http://www-rds.usc.edu/~garrick/nfsprobs/


> (Also, you mention 2.6.6 in your subject line but it sounds like this
> happens to you under earlier kernels as well? Are there any kernels
> that are OK?)

2.6.6 locks up less often than 2.6.5. I also had 2.6.3, but it had other scsi
driver issues. 2.6.3 never saw this kind of load, but I can try it if you
want.

--
Garrick Staples, Linux/HPCC Administrator
University of Southern California


Attachments:
(No filename) (954.00 B)
(No filename) (189.00 B)
Download all attachments

2004-05-27 04:39:01

by Garrick Staples

[permalink] [raw]
Subject: Re: 2.6.6 lockupy

On Wed, May 26, 2004 at 05:55:56PM -0400, J. Bruce Fields alleged:
> On Wed, May 26, 2004 at 02:44:36PM -0700, Garrick Staples wrote:
> > You didn't misunderstand... but I was at a complete loss with production
> > machines suddenly dropping like flies. I'm now at the stage where I just try
> > random things =P
>
> OK. May as well send us what information you have on the lockups, and
> maybe your .config while you're at it....

Filesystem corruption from the repeated lockups finally showed up today. So I
managed to get those two machines out of production for now. I'll be able to
figure out to trigger the problem and hopefully get you some better info
tomorrow.

Btw, the failover capabilities of 2.6 has been very well tested the last few
days :) Nearly 15TB of data was swapped back and forth during heavy writes.
Good job guys on that!

Btw, reiserfs+lvm2 is very resilient!

(does anyone have advice on triggering a kernel trace on ia64?)

--
Garrick Staples, Linux/HPCC Administrator
University of Southern California


Attachments:
(No filename) (1.01 kB)
(No filename) (189.00 B)
Download all attachments