2002-02-12 15:57:21

by Charlie Wilkinson

[permalink] [raw]
Subject: Hard lock-ups on RH7.2 install - Via Chipset?

Greetings fellow bit jockeys,
This has been driving me nuts for over a week now. All discussion
found and solutions tried so far have proven fruitless. If someone
could point me at a fix or offer any insights, I would be most thrilled.
I've read about some Athlon/Via related problems, so I'm hoping it fits
in with that somehow.

The box is a AMD 1.3GHz Athlon with a "bcm Advanced Research" BC133KT-100
motherboard (Via KT133/VT8363/686B), two Promise Ultra100Tx2 cards, and
an IBM 75gb drive on each IDE channel (four drives in all). The graphics
card is an Nvidia TNT2 AGP, but I'm thinking that doesn't matter too
much as the problem occurs just fine in character mode with no activity
on the screen. I've yanked out network cards, disabled unused ports,
picked conservative BIOS settings, but to no avail.

The problem first occurred when I tried to do a RH7.2 install. I set
each drive up identically, creating a software RAID5 container across all
four drives. The box consistently freezes solid either while creating
the ext3 filesystem on RAID5, or in the early phases of the .rpm march.
(Note that means concurrent load on all four drives...)

Numerous things tried... Finally booted into rescue mode (starting with
the latest RH7.2 updated boot image, FWIW) and tried running concurrent
dd's out to the drives in various combinations, as in:

(dd if=/dev/zero of=/dev/hde2 &) ; (dd if=/dev/zero of=/dev/hdg2 &) ; etc...

What I found was that writing out to any two drives was fine. Writing to
all four will consistently lock up the machine after about 5-10 seconds.
So it seems load related. (No, I didn't try three drives.)

Any clues? Any fixes? Pretty please? :)

-cw-


2002-02-21 15:58:14

by Charlie Wilkinson

[permalink] [raw]
Subject: Re: Hard lock-ups on RH7.2 install - Via Chipset?

On Tue, Feb 12, 2002 at 05:27:39PM +0000, Alan Cox waxed eloquent:
>
> > What I found was that writing out to any two drives was fine. Writing to
> > all four will consistently lock up the machine after about 5-10 seconds.
> > So it seems load related. (No, I didn't try three drives.)
[...]
> Its just another identical report of VIA + high PCI load hanging. It might
> be the promise drivers it might be the chipset. However people have run the
> same set up on intel boards without seeing this kind of problem so its
> not clear.
>
> 2.4.18pre9-ac has the newer ide layer, but Im dubious that will help

I can confirm that it still locks up. :/ What can I do to help?
Anyone I should beat on, or send beer and pizza to?

-cw-

--
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Charlie Wilkinson - [email protected] - N3HAZ
Parental Unit, UNIX Admin, Homebrewer, Cat Lover, Spam Fighter, HAM, SWLer...
Visit the Radio For Peace International Website: http://www.rfpi.org/
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
CLOBBER INTERNET SPAM: See!! <http://spam.abuse.net/>
Join!! <http://www.cauce.org/>
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
QOTD:
The 50-50-90 rule: Anytime you have a 50-50 chance of getting something
right, there's a 90% probability you'll get it wrong.

2002-02-21 16:19:36

by Alan

[permalink] [raw]
Subject: Re: Hard lock-ups on RH7.2 install - Via Chipset?

> I can confirm that it still locks up. :/ What can I do to help?

I'm assuming its a hardware issue. It works on non VIA for multiple people
it fails on VIA for multiple people

2002-02-21 17:50:26

by Charlie Wilkinson

[permalink] [raw]
Subject: Re: Hard lock-ups on RH7.2 install - Via Chipset?

On Thu, Feb 21, 2002 at 11:07:11AM -0500, Mark Hahn waxed eloquent:
[...]
> > Hi Mark,
> > Yeah, Alan suggested his latest pre 2.4.18 kernel *might* work. Tried it,
> > still no joy. :/
>
> but HOW?

How did I try it, or how no joy? ;)

On the former, I didn't run any really exhaustive tests, and Alan
didn't suggest using or avoiding certain options. I built a relatively
conservative kernel and then beat on all four drives with concurrent dd's.
I also did an hdparm -tT. hdparm killed the box in a matter of a second
or two. dd took about 30 seconds. It seems safe to assume that hdparm
is able to create a higher load.

On the latter, the box just freezes up solid. No magic SysRq, no nothing.
A very frustrating state to try and troubleshoot. Any suggestions?

-cw-

2002-02-21 18:03:17

by Charlie Wilkinson

[permalink] [raw]
Subject: Re: Hard lock-ups on RH7.2 install - Via Chipset?

On Thu, Feb 21, 2002 at 04:33:23PM +0000, Alan Cox waxed eloquent:
>
> > I can confirm that it still locks up. :/ What can I do to help?
>
> I'm assuming its a hardware issue. It works on non VIA for multiple people
> it fails on VIA for multiple people

yeah, appears to be specific to the KT133 (not 233), as Mark indicated.
If I recall I've seen reports that this has happened with the KT133A
as well, though I wonder if they are all related specifically to PCI
load, or might it be the old Athlon optimization problem in some cases?
(That was a seperate and distinct issue, generally unrelated to this
problem, yes?)

And then there was something about incorrect chipset register settings
from the BIOS...

I'm getting confused.... 8-o

One thing I noticed - and it may mean nothing - but I noticed that
during my load tests the drive access lights were not always on solid,
that the lights went out for all drives for a small fraction of a second
occasionally. (concurrent dd's from /dev/zero to each drive.) I was
wondering if this might possibly work back to some kind of timeout issue.
And more importantly, is it possible to crank up debugging messages in
the kernel and watch for that sort of thing. Is there any point?

Thanks for any tips.

-cw-

2002-02-22 20:15:14

by Michael B Allen

[permalink] [raw]
Subject: Re: Hard lock-ups on RH7.2 install - Via Chipset?

> On Thu, Feb 21, 2002 at 11:07:11AM -0500, Mark Hahn waxed eloquent:
> [...]
> > > Hi Mark,
> > > Yeah, Alan suggested his latest pre 2.4.18 kernel *might* work. Tried it,
> > > still no joy. :/
> >
> > but HOW?
>
> How did I try it, or how no joy? ;)
>
> On the former, I didn't run any really exhaustive tests, and Alan
> didn't suggest using or avoiding certain options. I built a relatively
> conservative kernel and then beat on all four drives with concurrent dd's.
> I also did an hdparm -tT. hdparm killed the box in a matter of a second
> or two. dd took about 30 seconds. It seems safe to assume that hdparm
> is able to create a higher load.

I have VIA KT133 no raid and I'm happily running RH 7.2 with their
stock kernel. I previously had issues that turned out to be a bad 30G
IBM DeathStar but upgraded the BIOS and did a little torture testing
out of paranoia (tried copying ~150MB files across two drives (not the
bad IBM one, they replaced it)) and I never had a problem since.

[root@nano root]# hdparm -tT /dev/sda
Timing buffer-cache reads: 128 MB in 1.12 seconds =114.29 MB/sec
Timing buffered disk reads: 64 MB in 5.11 seconds = 12.52 MB/sec

processor : 0
vendor_id : AuthenticAMD
cpu family : 6
model : 4
model name : AMD Athlon(tm) processor
stepping : 2
cpu MHz : 900.051
cache size : 256 KB
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 mmx fxsr syscall mmxext 3dnowext 3dnow
bogomips : 1795.68

--
May The Source be with you.

2002-04-18 20:30:11

by Charlie Wilkinson

[permalink] [raw]
Subject: Re: Hard lock-ups on RH7.2 install - Via Chipset?

On Thu, Feb 21, 2002 at 04:33:23PM +0000, Alan Cox waxed eloquent:
>.
> > I can confirm that it still locks up. :/ What can I do to help?
>.
> I'm assuming its a hardware issue. It works on non VIA for multiple people
> it fails on VIA for multiple people

I think I found a solution. At the very least, I've found something
that drastically affects reliability of this hardware combo.

The combo in question is a KT133 chipset (Phoenix BIOS), Athlon 1.3GHz,
2 Promise Ultra100 IDE controllers with an IBM 75gb drive on each channel
(4 drives). Doing anything that beat on all 4 drives sufficiently
(such as software RAID5) would hang the system hard.

The magic settings that had a drastic impact on reliability were the PCI
device latency timers. The early settings I tried just changed how long
the system would run before it crashed (in some cases making things *much*
worse). Then after more of something one could loosely term "research", I
hit on some settings that seem to have resulted in a fully stable system!

Forthwith and to wit:

setpci -v -d *:* latency_timer=b0
setpci -v -d 105a:* latency_timer=ff

Yes, that's a baseline setting of 176 for everything, then max settings
for the two Promise cards. Rather drastic? Perhaps, but it works.
More research and tweaking is probably in order.

I wanted to get the news out -- even if a bit premature -- in hopes that
it might relieve someone else's grief. It's really sucked having all
this hardware for three months and not being able to put it to good use
(unless crash testing counts...)

-cw-