Kind people,
I believe this bug has been reported previously, but I will risk
repetition. (c.f. http://lkml.org/lkml/2004/5/22/124 and
http://www.uwsg.iu.edu/hypermail/linux/kernel/0404.0/0736.html)
Partial hardware list:
Athlon 950
00:00.0 Host bridge: Silicon Integrated Systems [SiS] 730 Host (rev 02)
00:0e.0 RAID bus controller: CMD Technology Inc: Unknown device 3112 (rev 02)
00:0f.0 Multimedia video controller: Brooktree Corporation Bt878 Video Capture (rev 11)
00:0f.1 Multimedia controller: Brooktree Corporation Bt878 Audio Capture (rev 11)
I mention the bttv card, since it generates a lot of PCI traffic and
may tax the system a bit.
Under kernels 2.6.6-mm5 and 2.6.7, I get the following errors after a
few hours or days (no obvious trigger or pattern), which hang most of
the machine (Alt-SysRq still works, machine still pings, but shells
and sshd2 freeze):
----------------------------
ata1: DMA timeout, stat 0x1
ATA: abnormal status 0x58 on port 0xCF819087
scsi0: ERROR on channel 0, id 0, lun 0, CDB: Read (10) 00 00 03 ca 47 00 00 00 00
Current sda: sense key Medium Error
Additional sense: Unrecovered read error - auto reallocate failed
end_request: I/O error, dev sda, sector 248391
ATA: abnormal status 0x58 on port 0xCF819087
ATA: abnormal status 0x58 on port 0xCF819087
ATA: abnormal status 0x58 on port 0xCF819087
----------------------------
Reiserfsck and badblocks both report that everything is just ducky
(though reiserfs replays some transactions on reboot, of course). The
freeze has happened about half a dozen times so far.
I am willing to test patches and report back. I will try moving the
SATA card to another slot in the meantime (as it was sharing an
interrupt with an (idle) audio card).
Eric Buddington
On Thu, 17 Jun 2004, Eric Buddington wrote:
>----------------------------
>ata1: DMA timeout, stat 0x1
>ATA: abnormal status 0x58 on port 0xCF819087
>scsi0: ERROR on channel 0, id 0, lun 0, CDB: Read (10) 00 00 03 ca 47 00 00 00 00
>Current sda: sense key Medium Error
>Additional sense: Unrecovered read error - auto reallocate failed
>end_request: I/O error, dev sda, sector 248391
>ATA: abnormal status 0x58 on port 0xCF819087
>ATA: abnormal status 0x58 on port 0xCF819087
>ATA: abnormal status 0x58 on port 0xCF819087
>----------------------------
I'm seeing the same thing on a 3114. (And even uber flaky behavior on a
3ware 8506-4LP... rebuilds succeed, OS can pattern fill the array without
error, but a media scan/parity check fails within minutes.) There's
likely nothing wrong with your drives. Something about that driver and
the hardware aren't playing nice.
I have 4xST3160023AS's in RAID0 (mirroring what the BIOS does to those drives.
See other posts.) I can zero the array @ 35MB/s via O_DIRECT writes without
any issues. I can copy gigs of stuff into the array if the fs is mounted
O_SYNC. Yet, without O_SYNC, bursts of writes eventually result in a DMA
timeout. I tried setting SIL_QUIRK_MOD15WRITE for that drive model, but it
makes no difference.
--Ricky
Hi, Ricky Beam wrote:
>>Current sda: sense key Medium Error
>There's likely nothing wrong with your drives. Something about that
>driver and the hardware aren't playing nice.
What does the drive's SMART error log report?
I would consider swapping the power supply. Last year I had *four* 120 GB
drives fail on me before I changed the thing. Zero problems since.
--
Matthias Urlichs
On Fri, 18 Jun 2004, Matthias Urlichs wrote:
>>>Current sda: sense key Medium Error
>
>>There's likely nothing wrong with your drives. Something about that
>>driver and the hardware aren't playing nice.
>
>What does the drive's SMART error log report?
No errors.
>I would consider swapping the power supply. Last year I had *four* 120 GB
>drives fail on me before I changed the thing. Zero problems since.
Moral: stop buying crappy power supplies. :-)
My power supply is fine. If the PS were at fault, the same thing would
be happening all the time, not just when linux tries to throw 200 sectors
at a time at the drives. Windows has been stressing these drives far more
than linux and there's been zero idication of any problems. As I said,
writing in O_DIRECT mode to the array @ _35MB/s_ never reports a DMA
timeout -- I'll start increasing the buffer size to see where the cracking
point is.
--Ricky
On Friday 18 of June 2004 18:28, Ricky Beam wrote:
> On Fri, 18 Jun 2004, Matthias Urlichs wrote:
> >>>Current sda: sense key Medium Error
> >>
> >>There's likely nothing wrong with your drives. Something about that
> >>driver and the hardware aren't playing nice.
> >
> >What does the drive's SMART error log report?
>
> No errors.
>
> >I would consider swapping the power supply. Last year I had *four* 120 GB
> >drives fail on me before I changed the thing. Zero problems since.
>
> Moral: stop buying crappy power supplies. :-)
>
> My power supply is fine. If the PS were at fault, the same thing would
> be happening all the time, not just when linux tries to throw 200 sectors
> at a time at the drives. Windows has been stressing these drives far more
> than linux and there's been zero idication of any problems. As I said,
> writing in O_DIRECT mode to the array @ _35MB/s_ never reports a DMA
> timeout -- I'll start increasing the buffer size to see where the cracking
> point is.
Are your drives out of Seagate, maybe? If not, what make are they?
rjw
--
Rafael J. Wysocki,
SiSK
[tel. (+48) 605 053 693]
----------------------------
For a successful technology, reality must take precedence over public
relations, for nature cannot be fooled.
-- Richard P. Feynman
On Sat, 19 Jun 2004, R. J. Wysocki wrote:
>Are your drives out of Seagate, maybe? If not, what make are they?
(As I said in a previous email...) 4 x Seagate ST3160023AS's RAID0'd
together in a BIOS "raid" mode compatable manner.
kernel: ata3: DMA timeout, stat 0x1
kernel: ATA: abnormal status 0xD8 on port 0xFFFFFF000004FE87
As I understand it, the "0x1" indicates ATA_DMA_ACTIVE, and "0xD8" is
ATA_BUSY | ATA_DRQ + two more bits (ata.h doesn't make it clear which
are command bits vs. status bits.)
FWIW...
Linux version [email protected] ([email protected]) (gcc version 3.3.3 20040412 (Red Hat Linux 3.3.3-7)) #15 SMP BK[20040618194307] Fri Jun 18 15:56:38 EDT 2004
Kernel command line: ro console=ttyS0,115200 console=tty0 debug numa=off md=d0,0,2,0,/dev/sda,/dev/sdb,/dev/sdc,/dev/sdd root=/dev/hdd1
ata1: dev 0 cfg 49:2f00 82:346b 83:7d01 84:4003 85:3469 86:3c01 87:4003 88:207f
ata1: dev 0 ATA, max UDMA/133, 312581808 sectors: lba48
ata1: dev 0 configured for UDMA/133
scsi0 : sata_sil
ata2: dev 0 cfg 49:2f00 82:346b 83:7d01 84:4003 85:3469 86:3c01 87:4003 88:207f
ata2: dev 0 ATA, max UDMA/133, 312581808 sectors: lba48
ata2: dev 0 configured for UDMA/133
scsi1 : sata_sil
ata3: dev 0 cfg 49:2f00 82:346b 83:7d01 84:4003 85:3469 86:3c01 87:4003 88:207f
ata3: dev 0 ATA, max UDMA/133, 312581808 sectors: lba48
ata3: dev 0 configured for UDMA/133
scsi2 : sata_sil
ata4: dev 0 cfg 49:2f00 82:346b 83:7d01 84:4003 85:3469 86:3c01 87:4003 88:207f
ata4: dev 0 ATA, max UDMA/133, 312581808 sectors: lba48
ata4: dev 0 configured for UDMA/133
scsi3 : sata_sil
Vendor: ATA Model: ST3160023AS Rev: 3.18
Type: Direct-Access ANSI SCSI revision: 05
Vendor: ATA Model: ST3160023AS Rev: 3.18
Type: Direct-Access ANSI SCSI revision: 05
Vendor: ATA Model: ST3160023AS Rev: 3.18
Type: Direct-Access ANSI SCSI revision: 05
Vendor: ATA Model: ST3160023AS Rev: 3.05
Type: Direct-Access ANSI SCSI revision: 05
...
md_d0: p1 p2 p3
I am currently "zero"ing md/d0p2 in a loop. (code is @
http://sweetums.bluetronic.net/~jfbeam/zero.c
) With the buffer size set to 49 stripes, it'll eventually fail. At 64,
it fails quickly. If O_DIRECT is enabled, it never fails.
--Ricky
On Saturday 19 of June 2004 01:06, Ricky Beam wrote:
> On Sat, 19 Jun 2004, R. J. Wysocki wrote:
> >Are your drives out of Seagate, maybe? If not, what make are they?
>
> (As I said in a previous email...) 4 x Seagate ST3160023AS's RAID0'd
> together in a BIOS "raid" mode compatable manner.
Sorry, I should have noticed.
Anyway, it looks like a pattern is forming which smells bad to me.
Apparently, we have:
1) A serious error condition that occurs on Seagate SATA drives connected to
Silicon Image controllers.
2) As of today we can say that it only occurs on Seagate drives (Ricky, do I
remember correctly that you see faulty behavior of such drives with a 3ware
RAID?).
3) The error is reported by the kernel like that:
ata1: DMA timeout, stat 0x1
ATA: abnormal status 0x58 on port 0xCF819087
scsi0: ERROR on channel 0, id 0, lun 0, CDB: Read (10) 00 00 03 ca 47 00 00 00
00
Current sda: sense key Medium Error
Additional sense: Unrecovered read error - auto reallocate failed
end_request: I/O error, dev sda, sector 248391
Afterwards, the drive blocks its SATA bus in a "busy" mode and cannot be
accessed by any means (ie. hardware reset is necessary).
4) The most "reliable" way to trigger this condition is to copy a lot of data
(eg. 2 GB) to the drive in one shot.
Do we agree on that?
rjw
R. J. Wysocki wrote:
> On Saturday 19 of June 2004 01:06, Ricky Beam wrote:
>
>>On Sat, 19 Jun 2004, R. J. Wysocki wrote:
>>
>>>Are your drives out of Seagate, maybe? If not, what make are they?
>>
>>(As I said in a previous email...) 4 x Seagate ST3160023AS's RAID0'd
>>together in a BIOS "raid" mode compatable manner.
>
>
> Sorry, I should have noticed.
>
> Anyway, it looks like a pattern is forming which smells bad to me.
>
> Apparently, we have:
> 1) A serious error condition that occurs on Seagate SATA drives connected to
> Silicon Image controllers.
> 2) As of today we can say that it only occurs on Seagate drives (Ricky, do I
> remember correctly that you see faulty behavior of such drives with a 3ware
> RAID?).
> 3) The error is reported by the kernel like that:
I wonder if it helps to add the Seagate drive to the sata_sil blacklist?
/* TODO firmware versions should be added - eric */
struct sil_drivelist {
const char * product;
unsigned int quirk;
} sil_blacklist [] = {
{ "ST320012AS", SIL_QUIRK_MOD15WRITE },
{ "ST330013AS", SIL_QUIRK_MOD15WRITE },
{ "ST340017AS", SIL_QUIRK_MOD15WRITE },
{ "ST360015AS", SIL_QUIRK_MOD15WRITE },
{ "ST380023AS", SIL_QUIRK_MOD15WRITE },
{ "ST3120023AS", SIL_QUIRK_MOD15WRITE },
[...]
On Sat, 19 Jun 2004, Jeff Garzik wrote:
>I wonder if it helps to add the Seagate drive to the sata_sil blacklist?
As I said, I tried that with no success. Btw, there is no such list in
the SI published driver (or they did it in a manner that is not immediately
obvious.) However, this:
Maxtor 4D060H3:DAK05GK0:MaxMode=udma-5
does show up practically without looking :-)
By way of freebsd mailling lists, it appears the dropped DMA thing is common
to the sil hardware. However, there must be an errata/work-around as
SI's driver doesn't exhibit the same problems -- no stalls, no reported
DMA errors.
Of note is that the drive that most often stalls is of older firmware...
3.05 vs. the 3.18 of the other drives...
Drive information:
/dev/sga ATA ST3160023AS 3.18 312581807 blocks
/dev/sgb ATA ST3160023AS 3.18 312581807 blocks
/dev/sgc ATA ST3160023AS 3.18 312581807 blocks
/dev/sgd ATA ST3160023AS 3.05 312581807 blocks
If I can/could get Seagate to give me a 3.18 firmware update for that drive,
I'll find a way to get it on there :-)
DMA'd reads don't seem to be a problem. I'm 400G through the 10th loop
reading the entire array in 8M O_DIRECT chunks (128 16k stripes). However,
I will note, the internal configuration of the sil chips are laughable...
reading from more than one port at a time (doesn't really matter which two)
degrades performance. Reading from all four (hello raid) maxes out each port
to about 24MB/s. Individually, each port (alone) can read at 48-56MB/s.
The drives are capable of streaming 85MB/s (if you believe the specs.)
--Ricky
On Sat, 19 Jun 2004, R. J. Wysocki wrote:
>Anyway, it looks like a pattern is forming which smells bad to me.
The pattern formed a long time ago... there are several web pages dedicated
to the failure that is SATA. (1.5GHz signal on an unshielded cable? WTF
were they thinking?)
>Apparently, we have:
>1) A serious error condition that occurs on Seagate SATA drives connected to
>Silicon Image controllers.
Not all drives and not all ways Seagate. The seagate drive that fails
most often (always?) is FW 3.05 while the other 3 are FW 3.18.
>2) As of today we can say that it only occurs on Seagate drives (Ricky, do I
>remember correctly that you see faulty behavior of such drives with a 3ware
>RAID?).
The 3ware card has 250G Maxtor drives. The drive that fails has different
firmware than the other three. It fails irrespective of which port it's
on. And powermax diag fails recalibration 50% of the time.
(FW YAR51EW0 works -- 3 on the 3ware, and 2 in Dells, YAR51BW0 fails.)
>Afterwards, the drive blocks its SATA bus in a "busy" mode and cannot be
>accessed by any means (ie. hardware reset is necessary).
Actually, I think it's the controller that's borked. There's no way to
"hardware reset" a SATA drive without powering down the system which I'm
not doing. And the same problems do not happen in windows using si's
driver. Accorinding to FreeBSD developers, si's hardware is, point blank,
broken.
>4) The most "reliable" way to trigger this condition is to copy a lot of data
>(eg. 2 GB) to the drive in one shot.
Any sustained burst write will eventually lock up the channel. It can take
seconds or hours. It is as if a packet is being dropped during the DMA
transfer or the DMA completing without an ack.
--Ricky
On Sat, Jun 19, 2004 at 04:19:08PM -0400, Jeff Garzik wrote:
> I wonder if it helps to add the Seagate drive to the sata_sil blacklist?
To confirm, I also see the problem despite adding the drives to
the blacklist on 2.6.7.
sata_sil version 0.54
ata1: dev 0 cfg 49:2f00 82:346b 83:7d01 84:4003 85:3469 86:3c01 87:4003 88:207f
ata1: dev 0 ATA, max UDMA/133, 234441648 sectors: lba48
ata1(0): applying Seagate errata fix
ata1: dev 0 configured for UDMA/100
scsi0 : sata_sil
ata2: dev 0 cfg 49:2f00 82:346b 83:7d01 84:4003 85:3469 86:3c01 87:4003 88:207f
ata2: dev 0 ATA, max UDMA/133, 234441648 sectors: lba48
ata2(0): applying Seagate errata fix
ata2: dev 0 configured for UDMA/100
scsi1 : sata_sil
Vendor: ATA Model: ST3120026AS Rev: 3.18
Vendor: ATA Model: ST3120026AS Rev: 3.18
On Sat, Jun 19, 2004 at 10:10:05PM +0200, R. J. Wysocki wrote:
>Apparently, we have:
>1) A serious error condition that occurs on Seagate SATA drives connected to
>Silicon Image controllers.
>2) As of today we can say that it only occurs on Seagate drives (Ricky, do I
>remember correctly that you see faulty behavior of such drives with a 3ware
>RAID?).
>3) The error is reported by the kernel like that:
>
>ata1: DMA timeout, stat 0x1
>ATA: abnormal status 0x58 on port 0xCF819087
>scsi0: ERROR on channel 0, id 0, lun 0, CDB: Read (10) 00 00 03 ca 47 00 00 00
>00
>Current sda: sense key Medium Error
>Additional sense: Unrecovered read error - auto reallocate failed
>end_request: I/O error, dev sda, sector 248391
>
>Afterwards, the drive blocks its SATA bus in a "busy" mode and cannot be
>accessed by any means (ie. hardware reset is necessary).
>4) The most "reliable" way to trigger this condition is to copy a lot of data
>(eg. 2 GB) to the drive in one shot.
I can't speak for other hardware. Sounds the same for me though,
accept I'm not getting any error from console or /proc/kmsg...
http://marc.theaimsgroup.com/?l=linux-kernel&m=108792673031602&w=2
running hdparm -Tt in a loop doesn't cause any problem.
// George
--
George Georgalis, Architect and administrator, Linux services. IXOYE
http://galis.org/george/ cell:646-331-2027 mailto:[email protected]
Key fingerprint = 5415 2738 61CF 6AE1 E9A7 9EF0 0186 503B 9831 1631
On Sun, 20 Jun 2004 [email protected] wrote:
>To confirm, I also see the problem despite adding the drives to
>the blacklist on 2.6.7.
Yeah. I uncovered a small oops in there. libsata-scsi.c resets max_sectors
if LBA48 is enabled. Mr. Garzik has been sent a patch (maybe not "the"
patch.) I'm still digging into this one as I don't like losing 50% of
my drives speed. SI eats more than enough on it's own.
--Ricky