2005-12-02 04:58:54

by Aaron Lehmann

[permalink] [raw]
Subject: Promise SATA oops

I'm running 2.6.14.2 on an x86_64 (Athlon X2, i.e. SMP) with a Promise
TX4 SATAII 150 controller. The night I set up the machine, I got a
Promise-related oops (null pointer dereference IIRC), but was foolish
enough not to write it down. Since then, the machine has been
unstable, and I've suspected the same thing is recurring, but since I
use X it's very difficult to actually get at the oops. I ended up
setting up a ramdisk with a static busybox that I could use to poke
around if anything interesting happened. Just now everything using the
filesystem went into D-state, so I checked dmesg and saw uncorrectable
errors being reported on /dev/sdd. The system froze completely within
a minute. When I rebooted, I got the oops at the end of this message.
I was only able to copy the portion that fit on the screen. A second
reboot was sucessful. My RAID5 arrays are resyncing now, and I expect
that to complete normally because I've had to go through a lot of
resyncs since I set this system up and they were all sucessful. Once
that's done, I guess I'll run badblocks on sdd and see if anything
turns up. It would be a shame if that drive is bad, considering that
my 4 hard drives are brand new ones to replace a failed array I had
lots of problems with.

Process scsi_eh_3 (pid: 25, threadinfo fff81001fbc0000, task ffff81001fbbcf40)
Stack: ffffffff80274291 ffff81001fc0f800 ffff81001fb2a340 ffff81001fe78000
ffffffff8026d524 ffff81001fe78948 ffff81001fb2a340 ffff81001fe78428
ffffffff80280006 ffff81001fe78428
Call Trace:<ffffffff80274291>{scsi_device_unbusy+33} <ffffffff8026d524>{scsi_fin
ish_command+36}
<ffffffff80280006>{ata_scsi_qc_complete+54} <ffffffff8027c32e>{ata_qc_com
plete+366}
<ffffffff80281764>{pdc_eng_timeout+212} <ffffffff802716f0>{scsi_error_han
dler+0}
<ffffffff8027fae5>{ata_scsi_error+21} <ffffffff80271790>{scsi_error_handl
er+160}
<ffffffff80149430>{keventd_create_kthread+0} <ffffffff802716f0>{scsi_erro
r_handler+0}
<ffffffff80149430>{keventd_create_kthread+0} <ffffffff801496a9>{kthread+2
17}
<ffffffff80130260>{schedule_tail+64} <ffffffff8010e746>{child_rip+8}
<ffffffff80149430>{kevent_create_kthread+0} <ffffffff801495d0>{kthread+0}
<ffffffff8010e73e>{child_rip+0}

Code: 80 3f 00 7e f9 e9 2e fe ff ff f3 90 80 3f 00 7e f9 e9 30 fe
console shuts up ...
<0>Kernel panic - not syncing: Aiee, killing interrupt handler!


2005-12-02 05:29:12

by Jeff Garzik

[permalink] [raw]
Subject: Re: Promise SATA oops

Aaron Lehmann wrote:
> I'm running 2.6.14.2 on an x86_64 (Athlon X2, i.e. SMP) with a Promise
> TX4 SATAII 150 controller. The night I set up the machine, I got a
> Promise-related oops (null pointer dereference IIRC), but was foolish
> enough not to write it down. Since then, the machine has been
> unstable, and I've suspected the same thing is recurring, but since I
> use X it's very difficult to actually get at the oops. I ended up
> setting up a ramdisk with a static busybox that I could use to poke
> around if anything interesting happened. Just now everything using the
> filesystem went into D-state, so I checked dmesg and saw uncorrectable
> errors being reported on /dev/sdd. The system froze completely within
> a minute. When I rebooted, I got the oops at the end of this message.
> I was only able to copy the portion that fit on the screen. A second
> reboot was sucessful. My RAID5 arrays are resyncing now, and I expect
> that to complete normally because I've had to go through a lot of
> resyncs since I set this system up and they were all sucessful. Once
> that's done, I guess I'll run badblocks on sdd and see if anything
> turns up. It would be a shame if that drive is bad, considering that
> my 4 hard drives are brand new ones to replace a failed array I had
> lots of problems with.

This should be fixed in 2.6.15-rcX...

Jeff



2005-12-02 19:51:12

by Aaron Lehmann

[permalink] [raw]
Subject: Re: Promise SATA oops

On Fri, Dec 02, 2005 at 12:29:01AM -0500, Jeff Garzik wrote:
> This should be fixed in 2.6.15-rcX...

Still isn't stable. It froze within hours after announcing in all
terminals that it was disabling a certain IRQ. Now the RAID is so
degraded that root can't even be mounted. Was the Promise controller a
bad choice for a reliable setup?

I may not have time to look at this further until late next week, but
I'll follow up with whatever I learn.

2005-12-03 10:10:04

by Erik Slagter

[permalink] [raw]
Subject: Re: Promise SATA oops

On Fri, 2005-12-02 at 11:51 -0800, Aaron Lehmann wrote:
> On Fri, Dec 02, 2005 at 12:29:01AM -0500, Jeff Garzik wrote:
> > This should be fixed in 2.6.15-rcX...
>
> Still isn't stable. It froze within hours after announcing in all
> terminals that it was disabling a certain IRQ. Now the RAID is so
> degraded that root can't even be mounted. Was the Promise controller a
> bad choice for a reliable setup?
>
> I may not have time to look at this further until late next week, but
> I'll follow up with whatever I learn.

This look very similar to the problem I had. It vanished completely when
I exchanged the power supply for a high-end one.


Attachments:
smime.p7s (2.71 kB)

2005-12-20 20:17:22

by Aaron Lehmann

[permalink] [raw]
Subject: Re: Promise SATA oops

On Fri, Dec 02, 2005 at 11:51:09AM -0800, Aaron Lehmann wrote:
> Still isn't stable. It froze within hours after announcing in all
> terminals that it was disabling a certain IRQ. Now the RAID is so
> degraded that root can't even be mounted. Was the Promise controller a
> bad choice for a reliable setup?
>
> I may not have time to look at this further until late next week, but
> I'll follow up with whatever I learn.

Argh, died again!! It had been stable for over 12 days. Same error
message, and the root md is degraded and dirty just like last time.
This is a very severe state with high risk of data loss. When things
went sour, terminals and most applications still kept working, but
anything that touched the filesystem froze up. I had a shell open in a
chroot on a ramdisk, but dmesg just hung for a few minutes and then
exited with a "Bus error". I had no other way of examining the kernel
log since the machine runs X.

This was running 2.6.15-rc4. Crashes seem to happen less frequently
with it than with 2.6.14.x, but when they happen they leave the RAID
in a severe state. I also don't think 2.6.14.2 said anything about
disabling the IRQ.

I'm very desperate now. About every week I experience a crash that
damages my RAID array to the point where it can't boot, as if the
instability wasn't bad enough. Do I need to buy a hardware RAID card?

2006-02-21 04:21:31

by Aaron Lehmann

[permalink] [raw]
Subject: Re: Promise SATA oops

This crash kept happening for months, across all versions of the
kernel I tried (up through early 2.6.16 git snapshots). I ended up
buying a different SATA card, and this seems to have fixed the
problem. At around the same frequency as I experienced the nasty
hanging, I'm seeing this error message:

ata1: command 0xea timeout, stat 0x51 host_stat 0x0
ata1: translated ATA stat/err 0x51/04 to SCSI SK/ASC/ASCQ 0xb/00/00
ata1: status=0x51 { DriveReady SeekComplete Error }
ata1: error=0x04 { DriveStatusError }

...but the system continues running fine. This leads me to believe
that there's a bug in the Promise SATA driver that prevents it from
gracefully handling this error condition, whatever it is. The hard
drives are model WDC WD3200JD-60K, and I couldn't find any bad blocks
on them.


On Tue, Dec 20, 2005 at 12:17:19PM -0800, Aaron Lehmann wrote:
> On Fri, Dec 02, 2005 at 11:51:09AM -0800, Aaron Lehmann wrote:
> > Still isn't stable. It froze within hours after announcing in all
> > terminals that it was disabling a certain IRQ. Now the RAID is so
> > degraded that root can't even be mounted. Was the Promise controller a
> > bad choice for a reliable setup?
> >
> > I may not have time to look at this further until late next week, but
> > I'll follow up with whatever I learn.
>
> Argh, died again!! It had been stable for over 12 days. Same error
> message, and the root md is degraded and dirty just like last time.
> This is a very severe state with high risk of data loss. When things
> went sour, terminals and most applications still kept working, but
> anything that touched the filesystem froze up. I had a shell open in a
> chroot on a ramdisk, but dmesg just hung for a few minutes and then
> exited with a "Bus error". I had no other way of examining the kernel
> log since the machine runs X.
>
> This was running 2.6.15-rc4. Crashes seem to happen less frequently
> with it than with 2.6.14.x, but when they happen they leave the RAID
> in a severe state. I also don't think 2.6.14.2 said anything about
> disabling the IRQ.
>
> I'm very desperate now. About every week I experience a crash that
> damages my RAID array to the point where it can't boot, as if the
> instability wasn't bad enough. Do I need to buy a hardware RAID card?