LinuxLists.cc - failed command FLUSH CACHE EXT (was: Re: via 8237 sata errors)

2010-06-02 21:10:57

Subject: failed command FLUSH CACHE EXT (was: Re: via 8237 sata errors)

On May 30, 2010, Thomas Fjellstrom wrote:
> On May 30, 2010, Thomas Fjellstrom wrote:
> > On May 30, 2010, Robert Hancock wrote:
> > > On Sun, May 30, 2010 at 1:33 PM, Thomas Fjellstrom
> > >
> > > <[email protected]> wrote:
> > > > On May 30, 2010, Thomas Fjellstrom wrote:
> > > >> On May 30, 2010, Robert Hancock wrote:
> > > >> > On 05/29/2010 08:46 PM, Thomas Fjellstrom wrote:
> > > >> > > I'm getting a rather nasty set of messages from dmesg when
> > > >> > > trying to use a couple SATA II WD 2TB Green drives with an
> > > >> > > older system (via 8237 based). They seem to work fine on a
> > > >> > > newer p35 based system.
> > > >> >
> > > >> > ..
> > > >> >
> > > >> > > [ 283.308963] ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0
> > > >> > > action 0x0 [ 283.309007] ata2.00: BMDMA stat 0x4
> > > >> > > [ 283.309045] ata2.00: failed command: READ DMA
> > > >> > > [ 283.309089] ata2.00: cmd
> > > >> > > c8/00:08:08:08:30/00:00:00:00:00/e0 tag 0 dma 4096 in [
> > > >> > > 283.309091] res
> > > >> > > 41/04:00:08:08:30/00:00:00:00:00/e0 Emask 0x1 (device error) [
> > > >> > > 283.309171] ata2.00: status: { DRDY ERR }
> > > >> > > [ 283.309207] ata2.00: error: { ABRT }
> > > >> > > [ 283.324886] ata2.00: configured for UDMA/133
> > > >> > > [ 283.324904] ata2: EH complete
> > > >> >
> > > >> > It's not really clear why the drive is returning command aborted
> > > >> > on a read, there's no other error bits to indicate an
> > > >> > uncorrectable error or a CRC error. Is it only the one drive
> > > >> > that's giving the errors?
> > > >>
> > > >> I'm not entirely sure if its the same drive each time. I can make
> > > >> sure today. The fun part is it works fine in a different machine.
> > > >> Where as it will start erroring out like that almost right away in
> > > >> the via based machine. When its doing that, its also making some
> > > >> fairly scary (for a hard drive) noises, but since it doesn't do
> > > >> that in the p35 machine I'm really hoping it isn't the drive.
> > > >
> > > > I've started up a dd on each drive, just for kicks, and reading
> > > > from both of them at the same time seems to work fine on the via
> > > > chipset. Given this little tid-bit, it seems its only once md-raid
> > > > is setup on the drives does one of them freak out.
> > >
> > > If the problem happens mostly when both drives are in use then it
> > > could be a power supply issue. Some drives are rather sensitive to
> > > power fluctuations. You could try and move the drives to separate
> > > power cables if they're on the same one, or maybe the power supply's
> > > just not up to snuff.
> >
> > I'll give that a shot too. But I have dd running on both drives right
> > now (dd if=/dev/sdX of=/dev/null), and its running fine. I'm also
> > going to run some other tests, see if maybe there isn't a ram problem,
> > it could be the cause.
>
> Interesting, after moving each WD drive to its own power cable, that
> particular problem seems to have gone away. However another problem seems
> to have popped up, which I'm not sure is related so I'll post another
> thread (unable to handle kernel paging request).
>
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe
> > > linux-kernel" in the body of a message to [email protected]
> > > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > > Please read the FAQ at http://www.tux.org/lkml/

Ok, more testing, I've moved the drives over to the p35 machine semi-
permanently, and after a day or so of uptime I got some new errors:

ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
ata3.00: failed command: FLUSH CACHE EXT
ata3.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata3.00: status: { DRDY }
ata3: hard resetting link
ata3: link is slow to respond, please be patient (ready=0)
ata3: SRST failed (errno=-16)
ata3: hard resetting link
ata3: link is slow to respond, please be patient (ready=0)
ata3: SRST failed (errno=-16)
ata3: hard resetting link
ata3: link is slow to respond, please be patient (ready=0)
ata3: SRST failed (errno=-16)
ata3: limiting SATA link speed to 1.5 Gbps
ata3: hard resetting link
ata3: SRST failed (errno=-16)
ata3: reset failed, giving up
ata3.00: disabled
ata3.00: device reported invalid CHS sector 0
ata3: EH complete
end_request: I/O error, dev sdc, sector 0
sd 2:0:0:0: [sdc] Unhandled error code
sd 2:0:0:0: [sdc] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
sd 2:0:0:0: [sdc] CDB: Write(10): 2a 00 00 00 07 a7 00 00 08 00
end_request: I/O error, dev sdc, sector 1959
Buffer I/O error on device dm-0, logical block 189
lost page write due to I/O error on dm-0
end_request: I/O error, dev sdc, sector 0
end_request: I/O error, dev sdc, sector 0
JBD2 unexpected failure: do_get_write_access: buffer_uptodate(jh2bh(jh));
Possible IO failure.

end_request: I/O error, dev sdc, sector 0
end_request: I/O error, dev sdc, sector 0
------------[ cut here ]------------
WARNING: at /home/damentz/src/zen/main/linux-
liquorix-2.6-2.6.34/debian/build/source_amd64_none/fs/buffer.c:1199
mark_buffer_dirty+0x74/0x90()
Hardware name: P5K SE
Modules linked in: nfs lockd fscache nfs_acl auth_rpcgss sunrpc ipv6
acpi_cpufreq cpufreq_ondemand freq_table cpufreq_conservative
cpufreq_userspace cpufreq_powersave af_packet ext3 jbd loop
snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep snd_pcm_oss
snd_mixer_oss snd_pcm rtc_cmos rtc_core snd_timer tpm_tis nvidia(P) tpm
rtc_lib tpm_bios evdev snd intel_agp pcspkr asus_atk0110 soundcore i2c_i801
snd_page_alloc button i2c_core processor dm_mod raid10 raid456
async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx
raid1 raid0 multipath linear md_mod ext4 mbcache jbd2 crc16 usbhid sd_mod
ata_generic pata_acpi uhci_hcd ata_piix libata floppy scsi_mod thermal atl1
mii ehci_hcd [last unloaded: scsi_wait_scan]
Pid: 3283, comm: jbd2/dm-0-8 Tainted: P 2.6.34-0.dmz.8-liquorix-
amd64 #1
Call Trace:
[<ffffffff8103bf23>] ? warn_slowpath_common+0x73/0xb0
[<ffffffff81101cd4>] ? mark_buffer_dirty+0x74/0x90
[<ffffffffa005a3c9>] ? __jbd2_journal_unfile_buffer+0x9/0x20 [jbd2]
[<ffffffffa005d8a3>] ? jbd2_journal_commit_transaction+0xba3/0x12d0 [jbd2]
[<ffffffff810542d0>] ? autoremove_wake_function+0x0/0x30
[<ffffffffa0061701>] ? kjournald2+0xb1/0x210 [jbd2]
[<ffffffff810542d0>] ? autoremove_wake_function+0x0/0x30
[<ffffffffa0061650>] ? kjournald2+0x0/0x210 [jbd2]
[<ffffffff81053e3e>] ? kthread+0x8e/0xa0
[<ffffffff81033e8d>] ? schedule_tail+0x4d/0xf0
[<ffffffff81003c94>] ? kernel_thread_helper+0x4/0x10
[<ffffffff81053db0>] ? kthread+0x0/0xa0
[<ffffffff81003c90>] ? kernel_thread_helper+0x0/0x10
---[ end trace c90e4c710c9ef513 ]---
end_request: I/O error, dev sdc, sector 0

(and plenty more dmesg lines from lvm and ext4/jbd2 screaming about the io
commands failing)

I take it that this means the drive is likely pooched? I'm going to try some
more tests, and make sure both of the WD drives are on their own power cable
first. but I'm betting now that the drive is just failing. This would make 2
out of 4 in the same batch that had issues. The first one would increase the
sector reallocated count 4 every hour or so. Now this one fails a flush
cache command (and other spurious errors).

I guess its time to break out the WD diagnostics disk.

--
Thomas Fjellstrom
[email protected]

2010-06-02 23:05:38

by Robert Hancock

[permalink] [raw]

Subject: Re: failed command FLUSH CACHE EXT (was: Re: via 8237 sata errors)

On Wed, Jun 2, 2010 at 3:10 PM, Thomas Fjellstrom
<[email protected]> wrote:
> Ok, more testing, I've moved the drives over to the p35 machine semi-
> permanently, and after a day or so of uptime I got some new errors:
>
> ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
> ata3.00: failed command: FLUSH CACHE EXT
> ata3.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
> ? ? ? ? res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
> ata3.00: status: { DRDY }
> ata3: hard resetting link
> ata3: link is slow to respond, please be patient (ready=0)
> ata3: SRST failed (errno=-16)
> ata3: hard resetting link
> ata3: link is slow to respond, please be patient (ready=0)
> ata3: SRST failed (errno=-16)
> ata3: hard resetting link
> ata3: link is slow to respond, please be patient (ready=0)
> ata3: SRST failed (errno=-16)
> ata3: limiting SATA link speed to 1.5 Gbps
> ata3: hard resetting link
> ata3: SRST failed (errno=-16)
> ata3: reset failed, giving up
> ata3.00: disabled
> ata3.00: device reported invalid CHS sector 0
> ata3: EH complete
> end_request: I/O error, dev sdc, sector 0
> sd 2:0:0:0: [sdc] Unhandled error code
> sd 2:0:0:0: [sdc] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
> sd 2:0:0:0: [sdc] CDB: Write(10): 2a 00 00 00 07 a7 00 00 08 00
> end_request: I/O error, dev sdc, sector 1959
> Buffer I/O error on device dm-0, logical block 189
> lost page write due to I/O error on dm-0
> end_request: I/O error, dev sdc, sector 0
> end_request: I/O error, dev sdc, sector 0
> JBD2 unexpected failure: do_get_write_access: buffer_uptodate(jh2bh(jh));
> Possible IO failure.
>
> end_request: I/O error, dev sdc, sector 0
> end_request: I/O error, dev sdc, sector 0
> ------------[ cut here ]------------
> WARNING: at /home/damentz/src/zen/main/linux-
> liquorix-2.6-2.6.34/debian/build/source_amd64_none/fs/buffer.c:1199
> mark_buffer_dirty+0x74/0x90()
> Hardware name: P5K SE
> Modules linked in: nfs lockd fscache nfs_acl auth_rpcgss sunrpc ipv6
> acpi_cpufreq cpufreq_ondemand freq_table cpufreq_conservative
> cpufreq_userspace cpufreq_powersave af_packet ext3 jbd loop
> snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep snd_pcm_oss
> snd_mixer_oss snd_pcm rtc_cmos rtc_core snd_timer tpm_tis nvidia(P) tpm
> rtc_lib tpm_bios evdev snd intel_agp pcspkr asus_atk0110 soundcore i2c_i801
> snd_page_alloc button i2c_core processor dm_mod raid10 raid456
> async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx
> raid1 raid0 multipath linear md_mod ext4 mbcache jbd2 crc16 usbhid sd_mod
> ata_generic pata_acpi uhci_hcd ata_piix libata floppy scsi_mod thermal atl1
> mii ehci_hcd [last unloaded: scsi_wait_scan]
> Pid: 3283, comm: jbd2/dm-0-8 Tainted: P ? ? ? ? ? 2.6.34-0.dmz.8-liquorix-
> amd64 #1
> Call Trace:
> ?[<ffffffff8103bf23>] ? warn_slowpath_common+0x73/0xb0
> ?[<ffffffff81101cd4>] ? mark_buffer_dirty+0x74/0x90
> ?[<ffffffffa005a3c9>] ? __jbd2_journal_unfile_buffer+0x9/0x20 [jbd2]
> ?[<ffffffffa005d8a3>] ? jbd2_journal_commit_transaction+0xba3/0x12d0 [jbd2]
> ?[<ffffffff810542d0>] ? autoremove_wake_function+0x0/0x30
> ?[<ffffffffa0061701>] ? kjournald2+0xb1/0x210 [jbd2]
> ?[<ffffffff810542d0>] ? autoremove_wake_function+0x0/0x30
> ?[<ffffffffa0061650>] ? kjournald2+0x0/0x210 [jbd2]
> ?[<ffffffff81053e3e>] ? kthread+0x8e/0xa0
> ?[<ffffffff81033e8d>] ? schedule_tail+0x4d/0xf0
> ?[<ffffffff81003c94>] ? kernel_thread_helper+0x4/0x10
> ?[<ffffffff81053db0>] ? kthread+0x0/0xa0
> ?[<ffffffff81003c90>] ? kernel_thread_helper+0x0/0x10
> ---[ end trace c90e4c710c9ef513 ]---
> end_request: I/O error, dev sdc, sector 0
>
> (and plenty more dmesg lines from lvm and ext4/jbd2 screaming about the io
> commands failing)
>
> I take it that this means the drive is likely pooched? I'm going to try some
> more tests, and make sure both of the WD drives are on their own power cable
> first. but I'm betting now that the drive is just failing. This would make 2
> out of 4 in the same batch that had issues. The first one would increase the
> sector reallocated count 4 every hour or so. Now this one fails a flush
> cache command (and other spurious errors).
>
> I guess its time to break out the WD diagnostics disk.

I think it's a fairly safe assumption there's something wrong with the
drive - it looks like the drive just pretty much stopped talking..

2010-06-02 23:42:37

by Thomas Fjellstrom

[permalink] [raw]

Subject: Re: failed command FLUSH CACHE EXT (was: Re: via 8237 sata errors)

On June 2, 2010, Robert Hancock wrote:
> On Wed, Jun 2, 2010 at 3:10 PM, Thomas Fjellstrom
>
> <[email protected]> wrote:
> > Ok, more testing, I've moved the drives over to the p35 machine semi-
> > permanently, and after a day or so of uptime I got some new errors:
> >
> > ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
> > ata3.00: failed command: FLUSH CACHE EXT
> > ata3.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
> > res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
> > ata3.00: status: { DRDY }
> > ata3: hard resetting link
> > ata3: link is slow to respond, please be patient (ready=0)
> > ata3: SRST failed (errno=-16)
> > ata3: hard resetting link
> > ata3: link is slow to respond, please be patient (ready=0)
> > ata3: SRST failed (errno=-16)
> > ata3: hard resetting link
> > ata3: link is slow to respond, please be patient (ready=0)
> > ata3: SRST failed (errno=-16)
> > ata3: limiting SATA link speed to 1.5 Gbps
> > ata3: hard resetting link
> > ata3: SRST failed (errno=-16)
> > ata3: reset failed, giving up
> > ata3.00: disabled
> > ata3.00: device reported invalid CHS sector 0
> > ata3: EH complete
> > end_request: I/O error, dev sdc, sector 0
> > sd 2:0:0:0: [sdc] Unhandled error code
> > sd 2:0:0:0: [sdc] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
> > sd 2:0:0:0: [sdc] CDB: Write(10): 2a 00 00 00 07 a7 00 00 08 00
> > end_request: I/O error, dev sdc, sector 1959
> > Buffer I/O error on device dm-0, logical block 189
> > lost page write due to I/O error on dm-0
> > end_request: I/O error, dev sdc, sector 0
> > end_request: I/O error, dev sdc, sector 0
> > JBD2 unexpected failure: do_get_write_access:
> > buffer_uptodate(jh2bh(jh)); Possible IO failure.
> >
> > end_request: I/O error, dev sdc, sector 0
> > end_request: I/O error, dev sdc, sector 0
> > ------------[ cut here ]------------
> > WARNING: at /home/damentz/src/zen/main/linux-
> > liquorix-2.6-2.6.34/debian/build/source_amd64_none/fs/buffer.c:1199
> > mark_buffer_dirty+0x74/0x90()
> > Hardware name: P5K SE
> > Modules linked in: nfs lockd fscache nfs_acl auth_rpcgss sunrpc ipv6
> > acpi_cpufreq cpufreq_ondemand freq_table cpufreq_conservative
> > cpufreq_userspace cpufreq_powersave af_packet ext3 jbd loop
> > snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep snd_pcm_oss
> > snd_mixer_oss snd_pcm rtc_cmos rtc_core snd_timer tpm_tis nvidia(P) tpm
> > rtc_lib tpm_bios evdev snd intel_agp pcspkr asus_atk0110 soundcore
> > i2c_i801 snd_page_alloc button i2c_core processor dm_mod raid10
> > raid456 async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy
> > async_tx raid1 raid0 multipath linear md_mod ext4 mbcache jbd2 crc16
> > usbhid sd_mod ata_generic pata_acpi uhci_hcd ata_piix libata floppy
> > scsi_mod thermal atl1 mii ehci_hcd [last unloaded: scsi_wait_scan]
> > Pid: 3283, comm: jbd2/dm-0-8 Tainted: P
> > 2.6.34-0.dmz.8-liquorix- amd64 #1
> > Call Trace:
> > [<ffffffff8103bf23>] ? warn_slowpath_common+0x73/0xb0
> > [<ffffffff81101cd4>] ? mark_buffer_dirty+0x74/0x90
> > [<ffffffffa005a3c9>] ? __jbd2_journal_unfile_buffer+0x9/0x20 [jbd2]
> > [<ffffffffa005d8a3>] ? jbd2_journal_commit_transaction+0xba3/0x12d0
> > [jbd2] [<ffffffff810542d0>] ? autoremove_wake_function+0x0/0x30
> > [<ffffffffa0061701>] ? kjournald2+0xb1/0x210 [jbd2]
> > [<ffffffff810542d0>] ? autoremove_wake_function+0x0/0x30
> > [<ffffffffa0061650>] ? kjournald2+0x0/0x210 [jbd2]
> > [<ffffffff81053e3e>] ? kthread+0x8e/0xa0
> > [<ffffffff81033e8d>] ? schedule_tail+0x4d/0xf0
> > [<ffffffff81003c94>] ? kernel_thread_helper+0x4/0x10
> > [<ffffffff81053db0>] ? kthread+0x0/0xa0
> > [<ffffffff81003c90>] ? kernel_thread_helper+0x0/0x10
> > ---[ end trace c90e4c710c9ef513 ]---
> > end_request: I/O error, dev sdc, sector 0
> >
> > (and plenty more dmesg lines from lvm and ext4/jbd2 screaming about the
> > io commands failing)
> >
> > I take it that this means the drive is likely pooched? I'm going to try
> > some more tests, and make sure both of the WD drives are on their own
> > power cable first. but I'm betting now that the drive is just failing.
> > This would make 2 out of 4 in the same batch that had issues. The
> > first one would increase the sector reallocated count 4 every hour or
> > so. Now this one fails a flush cache command (and other spurious
> > errors).
> >
> > I guess its time to break out the WD diagnostics disk.
>
> I think it's a fairly safe assumption there's something wrong with the
> drive - it looks like the drive just pretty much stopped talking..

I've only managed to see that error once though. The last few times I've
booted that machine I get the same old DMA error messages I posted before.

Unfortunately I haven't been able to run the WD Diagnostics thing on it so
far, it either takes an age to load, or hangs at the screen prior to the
license screen.

I seem to have rather bad luck with hard drives. Every time I buy more than
two, I tend to get one or two failures out of the batch. 25-50% failure rate
almost. Horrible. I at least average 1 dead hard drive a year since I got my
first computer.

--
Thomas Fjellstrom
[email protected]

2010-06-03 08:23:14

by Tejun Heo

[permalink] [raw]

Subject: Re: failed command FLUSH CACHE EXT

Hello,

On 06/03/2010 01:42 AM, Thomas Fjellstrom wrote:
> I seem to have rather bad luck with hard drives. Every time I buy more than
> two, I tend to get one or two failures out of the batch. 25-50% failure rate
> almost. Horrible. I at least average 1 dead hard drive a year since I got my
> first computer.

Hmmm... that sounds really high. Out of how many? Hard drives do
fail sometimes but not that easily even if you put it under relatively
heavy use 24/7. Maybe there is a common cause - say, instable power,
vibration, impact or whatever? Or maybe the universe just doesn't
like you? :-)

Thanks.

--
tejun

2010-06-03 08:41:19

by Thomas Fjellstrom

[permalink] [raw]

Subject: Re: failed command FLUSH CACHE EXT

On June 3, 2010, Tejun Heo wrote:
> Hello,
>
> On 06/03/2010 01:42 AM, Thomas Fjellstrom wrote:
> > I seem to have rather bad luck with hard drives. Every time I buy more
> > than two, I tend to get one or two failures out of the batch. 25-50%
> > failure rate almost. Horrible. I at least average 1 dead hard drive a
> > year since I got my first computer.
>
> Hmmm... that sounds really high. Out of how many? Hard drives do
> fail sometimes but not that easily even if you put it under relatively
> heavy use 24/7. Maybe there is a common cause - say, instable power,
> vibration, impact or whatever? Or maybe the universe just doesn't
> like you? :-)
>
> Thanks.

I think its a combination of factors. But mainly, early on I was cheap. UBER
cheap, nor did I know that a power supply is probably the /most/ important
component in a machine. If it helps I've also had many motherboards, tons of
ram, and some misc other parts fail over the years as well, but I've
probably gone through more hard drives than anything else.

Let me see if I can count all of them...

1x6G - failed
1x40G - failed
4x80G - 2 failed (PSU kill)
2x160G - 1 failed (used drives anyhow, don't really count)
4x320G - 2 failed (DOA, well after a week each)
4x640G - 1-2 failed (can't really remember the circumstances with these...)
6x1T - 1 failed (DOA)
4x2T - 2 failed (1 DOA, 1 after a year)

So about half.

I mostly blame my cheapness when I was younger. Bad PSU's can cause a lot of
problems. I've since stopped buying and using cheap power supplies. And it
has so far made a difference in death count. If I skip the DOA drives, only
one 1TB, 1 2TB and 1 640 failed. Thats far better than a 50% mortality rate.

That list may not be entirely accurate, it spans over 10 years of hd use. I
can barely remember what I did the day before, let alone the details of
things that happened 10 years ago.

--
Thomas Fjellstrom
[email protected]