2005-03-21 00:18:53

by Neil Whelchel

[permalink] [raw]
Subject: SATA Promise TX4 Crash

Hello,
I have two Promise SATA TX4 cards connected to a total of 6 Maxtor 250 GB
drives (7Y250M0) configured into a RAID 5. All works well with small
disk load, but when a large number of requests are issued, it causes crash
similar to the attached, except that the errors before the crash are on a
different drive nearly every time. I have tried several different
motherboards with both Nvidia and Via chipsets, with Athlon and K6
CPUs, and the crash remains the same. I have also seen the same crash
with both a preemptable and a non-preemptable kernel, with kernel
versions 2.6.9, 2.6.10, 2.6.11, and 2.6.11.2 (this one).
Any suggestions, or is this a bug?


ata3: status=0x51 { DriveReady SeekComplete Error }
ata3: error=0x40 { UncorrectableError }
ata3: status=0x51 { DriveReady SeekComplete Error }
ata3: error=0x40 { UncorrectableError }
ata3: status=0x51 { DriveReady SeekComplete Error }
ata3: error=0x40 { UncorrectableError }
ata3: status=0x51 { DriveReady SeekComplete Error }
ata3: error=0x40 { UncorrectableError }
ata3: status=0x51 { DriveReady SeekComplete Error }
ata3: error=0x40 { UncorrectableError }
ata3: command timeout
Assertion failed! qc->flags &
ATA_QCFLAG_ACTIVE,drivers/scsi/libata-core.c,ata_qc_complete,line=2807
ata3: status=0x51 { DriveReady SeekComplete Error }
ata3: called with no error (51)!
------------[ cut here ]------------
kernel BUG at drivers/scsi/scsi.c:299!
invalid operand: 0000 [#1]
PREEMPT
Modules linked in:
CPU: 0
EIP: 0060:[<c02a9ddb>] Not tainted VLI
EFLAGS: 00010046 (2.6.11.2)
EIP is at scsi_put_command+0xbb/0x100
eax: 00000001 ebx: c2f5e390 ecx: 00000001 edx: c2f5e390
esi: c2f5e380 edi: 00000246 ebp: c7c4beb4 esp: c7c4be9c
ds: 007b es: 007b ss: 0068
Process scsi_eh_2 (pid: 821, threadinfo=c7c4a000 task=c7c315b0)
Stack: 00000296 c7c30000 c7c26400 c7c23030 c2f5e380 00000246 c7c4bec4 c02ae9b3
c2f5e380 c44a1740 c7c4bee0 c02aeabc c2f5e380 c7c23030 c2f5e380 08000002
c44a1740 c7c4bf28 c02aedd2 c2f5e380 00000001 00000000 00000000 00000000
Call Trace:
[<c0102c12>] show_stack+0x72/0xa0
[<c0102d64>] show_registers+0x104/0x180
[<c0102f53>] die+0xd3/0x180
[<c0103330>] do_invalid_op+0x90/0xa0
[<c010282b>] error_code+0x2b/0x30
[<c02ae9b3>] scsi_next_command+0x13/0x20
[<c02aeabc>] scsi_end_request+0xbc/0xe0
[<c02aedd2>] scsi_io_completion+0x132/0x3c0
[<c02ba698>] sd_rw_intr+0xb8/0x2c0
[<c02b8420>] ata_scsi_qc_complete+0x20/0x40
[<c02b658c>] ata_qc_complete+0x2c/0xa0
[<c02b9473>] pdc_eng_timeout+0x93/0x120
[<c02b7ef4>] ata_scsi_error+0x14/0x40
[<c02add5b>] scsi_error_handler+0x5b/0xc0
[<c0100811>] kernel_thread_helper+0x5/0x14
Code: ec 8b 42 08 ff 30 e8 e5 cd e8 ff 59 5b 8b 45 f0 05 84 01 00 00 89 45
08 8d 65 f4 5b 5e 5f c9 e9 ac 41 fc ff e8 47 6c 13 00 eb ce <0f> 0b 2b 01
d6 e8 40 c0 e9 6c ff ff ff e8 33 6c 13 00 eb 8b 89
<6>note: scsi_eh_2[821] exited with preempt_count 1


-Neil Whelchel-
First Light Internet Services
760 366-0145
- We don't do Window$, that's what the janitor is for -

Bubble Memory, n.:
A derogatory term, usually referring to a person's
intelligence. See also "vacuum tube".


2005-03-21 06:59:57

by Brad Campbell

[permalink] [raw]
Subject: Re: SATA Promise TX4 Crash

Neil Whelchel wrote:
> Hello,
> I have two Promise SATA TX4 cards connected to a total of 6 Maxtor 250 GB
> drives (7Y250M0) configured into a RAID 5. All works well with small
> disk load, but when a large number of requests are issued, it causes crash
> similar to the attached, except that the errors before the crash are on a

> EFLAGS: 00010046 (2.6.11.2)
> EIP is at scsi_put_command+0xbb/0x100

Oooh Oooh Oooh, pick me Mr Kotter!
I have seen this repeatedly, fought it and "apparently" beat it by upgrading my PSU.
I could reliably reproduce it by running a raid resync and issuing SMART queries
to the drives, but after a PSU upgrade it has gone away.
I have tried hard to reproduce it recently but I just can't get it to crash anymore.

I have a similar setup 4x SATA-TX4 cards and 15x 7Y250M0 drives. I'm thought it was actually
a bug, but as I can't reproduce it anymore it's making it a bit hard to track down.

Not much help, sorry.

Brad
--
"Human beings, who are almost unique in having the ability
to learn from the experience of others, are also remarkable
for their apparent disinclination to do so." -- Douglas Adams

2005-03-21 09:44:30

by Raphael Jacquot

[permalink] [raw]
Subject: Re: SATA Promise TX4 Crash

Brad Campbell wrote:
> Neil Whelchel wrote:
>
>> Hello,
>> I have two Promise SATA TX4 cards connected to a total of 6 Maxtor 250 GB
>> drives (7Y250M0) configured into a RAID 5. All works well with small
>> disk load, but when a large number of requests are issued, it causes
>> crash
>> similar to the attached, except that the errors before the crash are on a
>
>
>> EFLAGS: 00010046 (2.6.11.2)
>> EIP is at scsi_put_command+0xbb/0x100
>
>
> Oooh Oooh Oooh, pick me Mr Kotter!
> I have seen this repeatedly, fought it and "apparently" beat it by
> upgrading my PSU.
> I could reliably reproduce it by running a raid resync and issuing SMART
> queries
> to the drives, but after a PSU upgrade it has gone away.
> I have tried hard to reproduce it recently but I just can't get it to
> crash anymore.
>
> I have a similar setup 4x SATA-TX4 cards and 15x 7Y250M0 drives. I'm
> thought it was actually
> a bug, but as I can't reproduce it anymore it's making it a bit hard to
> track down.
>
> Not much help, sorry.
>
> Brad

I have similar crashes with a (netbooted) epia and 4 250G Seagate 7200.8
PATA drives.
removing the kernel preempt stuff & realtime scheduling and stuff
alleviates the issue a bit but it occured again yesterday.

a quirk in the epia forces me to reboot the box by power cycling it.

2005-03-27 22:17:47

by Neil Whelchel

[permalink] [raw]
Subject: Re: SATA Promise TX4 Crash

On Sat, 26 Mar 2005, quasiben wrote:

> Dear Neil Whelchel,
> I have been having very similar problems. However, my setup is
> somewhat different. I have a LVM logical volume that spans two disks
> (one PATA and one SATA). Did you upgrade your PSU as one person
> suggested ? If so, did it work ?
>
> --Benji
>
> Neil Whelchel wrote:
> > Hello,
> > I have two Promise SATA TX4 cards connected to a total of 6 Maxtor
> 250 GB
> > drives (7Y250M0) configured into a RAID 5. All works well with small
> > disk load, but when a large number of requests are issued, it causes
> crash
> > similar to the attached, except that the errors before the crash are
> on a
> > different drive nearly every time. I have tried several different
> > motherboards with both Nvidia and Via chipsets, with Athlon and K6
> > CPUs, and the crash remains the same. I have also seen the same crash
> > with both a preemptable and a non-preemptable kernel, with kernel
> > versions 2.6.9, 2.6.10, 2.6.11, and 2.6.11.2 (this one).
> > Any suggestions, or is this a bug?
> >
> >
> > ata3: status=0x51 { DriveReady SeekComplete Error }
> > ata3: error=0x40 { UncorrectableError }
> > ata3: status=0x51 { DriveReady SeekComplete Error }
> > ata3: error=0x40 { UncorrectableError }
> > ata3: status=0x51 { DriveReady SeekComplete Error }
> > ata3: error=0x40 { UncorrectableError }
> > ata3: status=0x51 { DriveReady SeekComplete Error }
> > ata3: error=0x40 { UncorrectableError }
> > ata3: status=0x51 { DriveReady SeekComplete Error }
> > ata3: error=0x40 { UncorrectableError }
> > ata3: command timeout
> > Assertion failed! qc->flags &
> >
> ATA_QCFLAG_ACTIVE,drivers/scsi/libata-core.c,ata_qc_complete,line=2807
> > ata3: status=0x51 { DriveReady SeekComplete Error }
> > ata3: called with no error (51)!
> > ------------[ cut here ]------------
> > kernel BUG at drivers/scsi/scsi.c:299!
> > invalid operand: 0000 [#1]
> > PREEMPT
> > Modules linked in:
> > CPU: 0
> > EIP: 0060:[<c02a9ddb>] Not tainted VLI
> > EFLAGS: 00010046 (2.6.11.2)
> > EIP is at scsi_put_command+0xbb/0x100
> > eax: 00000001 ebx: c2f5e390 ecx: 00000001 edx: c2f5e390
> > esi: c2f5e380 edi: 00000246 ebp: c7c4beb4 esp: c7c4be9c
> > ds: 007b es: 007b ss: 0068
> > Process scsi_eh_2 (pid: 821, threadinfo=c7c4a000 task=c7c315b0)
> > Stack: 00000296 c7c30000 c7c26400 c7c23030 c2f5e380 00000246 c7c4bec4
> c02ae9b3
> > c2f5e380 c44a1740 c7c4bee0 c02aeabc c2f5e380 c7c23030 c2f5e380
> 08000002
> > c44a1740 c7c4bf28 c02aedd2 c2f5e380 00000001 00000000 00000000
> 00000000
> > Call Trace:
> > [<c0102c12>] show_stack+0x72/0xa0
> > [<c0102d64>] show_registers+0x104/0x180
> > [<c0102f53>] die+0xd3/0x180
> > [<c0103330>] do_invalid_op+0x90/0xa0
> > [<c010282b>] error_code+0x2b/0x30
> > [<c02ae9b3>] scsi_next_command+0x13/0x20
> > [<c02aeabc>] scsi_end_request+0xbc/0xe0
> > [<c02aedd2>] scsi_io_completion+0x132/0x3c0
> > [<c02ba698>] sd_rw_intr+0xb8/0x2c0
> > [<c02b8420>] ata_scsi_qc_complete+0x20/0x40
> > [<c02b658c>] ata_qc_complete+0x2c/0xa0
> > [<c02b9473>] pdc_eng_timeout+0x93/0x120
> > [<c02b7ef4>] ata_scsi_error+0x14/0x40
> > [<c02add5b>] scsi_error_handler+0x5b/0xc0
> > [<c0100811>] kernel_thread_helper+0x5/0x14
> > Code: ec 8b 42 08 ff 30 e8 e5 cd e8 ff 59 5b 8b 45 f0 05 84 01 00 00
> 89 45
> > 08 8d 65 f4 5b 5e 5f c9 e9 ac 41 fc ff e8 47 6c 13 00 eb ce <0f> 0b
> 2b 01
> > d6 e8 40 c0 e9 6c ff ff ff e8 33 6c 13 00 eb 8b 89
> > <6>note: scsi_eh_2[821] exited with preempt_count 1
> >
> >
> > -Neil Whelchel-
> > First Light Internet Services
> > 760 366-0145
> > - We don't do Window$, that's what the janitor is for -
> >
> > Bubble Memory, n.:
> > A derogatory term, usually referring to a person's
> > intelligence. See also "vacuum tube".
> >
> > -
> > To unsubscribe from this list: send the line "unsubscribe
> linux-kernel" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at http://www.tux.org/lkml/
>
Hello,
Yes, I did replace the PSU about 6 times. I had the same problem with 4
similar machines (all the same), and in one of them I tried two
other different power supplies, so there were a total of three completely
different supplies tested. All of them were 450 Watts, except for one 500
Watt that I tested..
While, my 'feelings' tell me that the PSU is the issue, I have been
looking more to grounding and SATA cable than anything else. But there is
one HUGE however here... If there is a communication failure, it should
not cause a crash in the kernel, this should be fixed!

-Neil Whelchel-
First Light Internet Services
760 366-0145
- We don't do Window$, that's what the janitor is for -

Bubble Memory, n.:
A derogatory term, usually referring to a person's
intelligence. See also "vacuum tube".