Can anyone make any sense of these SATA errors? They're killing my md RAID5
(at least the second error did).
Hard drives (ata1/sda, ata2/sdb, ata3/sdc): Seagate ST31500341AS 1.5TB SATA
Motherboard: Asus M3A78-EH with AMD 780G/SB700 chipset
SATA driver: ahci
00:11.0 SATA controller: ATI Technologies Inc SB700/SB800 SATA Controller [AHCI mode]
Smart reports no errors on the drives, short & long tests have been run as
well. The system is brand new.
I've read some reports about SATA 3.0 Gbps vs 1.5 Gbps problems and I'm
considering limiting the drives to 1.5 Gbps using jumpers. Would that be a
good idea?
19:24:26 ata2: exception Emask 0x50 SAct 0x0 SErr 0x90a02 action 0xe frozen
19:24:26 ata2: irq_stat 0x00400000, PHY RDY changed
19:24:26 ata2: SError: { RecovComm Persist HostInt PHYRdyChg 10B8B }
19:24:26 ata2: hard resetting link
19:24:27 ata2: SATA link down (SStatus 0 SControl 300)
19:24:30 ata2: hard resetting link
19:24:35 ata2: link is slow to respond, please be patient (ready=0)
19:24:38 ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
19:24:38 ata2.00: configured for UDMA/133
19:24:38 ata2: exception Emask 0x10 SAct 0x0 SErr 0x0 action 0x9 t4
19:24:38 ata2: irq_stat 0x00000040, connection status changed
19:24:38 ata2.00: configured for UDMA/133
19:24:38 ata2: EH complete
And then the day after:
09:07:49 ata3.00: exception Emask 0x40 SAct 0x0 SErr 0x800 action 0x6 frozen
09:07:49 ata3: SError: { HostInt }
09:07:49 ata3.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
09:07:49 res 40/00:00:00:00:00/00:00:00:00:00/a0 Emask 0x44 (timeout)
09:07:49 ata3.00: status: { DRDY }
09:07:49 ata3: hard resetting link
09:07:49 ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
09:07:49 ata3.00: configured for UDMA/133
09:07:49 ata3: EH complete
09:07:49 sd 2:0:0:0: [sdc] 2930277168 512-byte hardware sectors (1500302 MB)
09:07:49 sd 2:0:0:0: [sdc] Write Protect is off
09:07:49 sd 2:0:0:0: [sdc] Mode Sense: 00 3a 00 00
09:07:49 sd 2:0:0:0: [sdc] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
09:07:49 end_request: I/O error, dev sdc, sector 8
09:07:49 md: super_written gets error=-5, uptodate=0
09:07:49 raid5: Disk failure on sdc, disabling device.
09:07:49 raid5: Operation continuing on 1 devices.
For reference:
ata1: SATA max UDMA/133 abar m1024@0xf8fff800 port 0xf8fff900 irq 22
ata2: SATA max UDMA/133 abar m1024@0xf8fff800 port 0xf8fff980 irq 22
ata3: SATA max UDMA/133 abar m1024@0xf8fff800 port 0xf8fffa00 irq 22
ata4: SATA max UDMA/133 abar m1024@0xf8fff800 port 0xf8fffa80 irq 22
ata5: SATA max UDMA/133 abar m1024@0xf8fff800 port 0xf8fffb00 irq 22
ata6: SATA max UDMA/133 abar m1024@0xf8fff800 port 0xf8fffb80 irq 22
ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata1.00: HPA detected: current 2930277168, native 18446744072344861488
ata1.00: ATA-8: ST31500341AS, SD17, max UDMA/133
ata1.00: 2930277168 sectors, multi 16: LBA48 NCQ (depth 31/32)
ata1.00: configured for UDMA/133
ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata2.00: HPA detected: current 2930277168, native 18446744072344861488
ata2.00: ATA-8: ST31500341AS, SD17, max UDMA/133
ata2.00: 2930277168 sectors, multi 16: LBA48 NCQ (depth 31/32)
ata2.00: configured for UDMA/133
ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata3.00: HPA detected: current 2930277168, native 18446744072344861488
ata3.00: ATA-8: ST31500341AS, SD17, max UDMA/133
ata3.00: 2930277168 sectors, multi 16: LBA48 NCQ (depth 31/32)
ata3.00: configured for UDMA/133
Regards,
Oskar
On Tue, Oct 28, 2008 at 10:01 AM, Oskar Liljeblad <[email protected]> wrote:
> Can anyone make any sense of these SATA errors? They're killing my md RAID5
> (at least the second error did).
>
> Hard drives (ata1/sda, ata2/sdb, ata3/sdc): Seagate ST31500341AS 1.5TB SATA
> Motherboard: Asus M3A78-EH with AMD 780G/SB700 chipset
> SATA driver: ahci
> 00:11.0 SATA controller: ATI Technologies Inc SB700/SB800 SATA Controller [AHCI mode]
Seems to be a known issue with the 1.5TB drives and Linux (and Mac users, too)
http://forums.seagate.com/stx/board/message?board.id=ata_drives&thread.id=2390
http://ubuntuforums.org/showthread.php?t=933053
-Dave
Oskar Liljeblad wrote:
> Can anyone make any sense of these SATA errors? They're killing my md RAID5
> (at least the second error did).
>
> Hard drives (ata1/sda, ata2/sdb, ata3/sdc): Seagate ST31500341AS 1.5TB SATA
> Motherboard: Asus M3A78-EH with AMD 780G/SB700 chipset
> SATA driver: ahci
> 00:11.0 SATA controller: ATI Technologies Inc SB700/SB800 SATA Controller [AHCI mode]
>
> Smart reports no errors on the drives, short & long tests have been run as
> well. The system is brand new.
>
> I've read some reports about SATA 3.0 Gbps vs 1.5 Gbps problems and I'm
> considering limiting the drives to 1.5 Gbps using jumpers. Would that be a
> good idea?
>
> 19:24:26 ata2: exception Emask 0x50 SAct 0x0 SErr 0x90a02 action 0xe frozen
> 19:24:26 ata2: irq_stat 0x00400000, PHY RDY changed
> 19:24:26 ata2: SError: { RecovComm Persist HostInt PHYRdyChg 10B8B }
RecovComm: Communications between device and host temporarily lost, but
regained
Persist: Persistent communication or data integrity error
HostInt: Host bus adapter internal error
PHYRdyChg: PhyRdy signal changed state
10B8B: 10b to 8b decoding error occurred
Sounds like the drive and the controller are unhappy with each other, or
there's some kind of communications or hardware problem. Not likely a
kernel issue.
It's unclear if limiting to 1.5 Gbps would help, you could try it and see..
Hi,
I've got this issue, and I'm involved in the thread on the Seagate
forums. I've been going through the libata code with a fine tooth comb
to see if I can find the issue, and so far - not a lot of joy.
However, and this is more directed to Jeff Garzik, there is a minor
display bug with drives that have more than 2^31 sectors.
The messages:
ata3.00: HPA detected: current 2930277168, native 18446744072344861488
is a bug. The two sector counts are calculated from different ATA
commands and are parsed differently:
Current Sector Count is retrieved from the IDENTIFY result (words
100-103), and calculated with the ata_id_u64() macro
Native Sector Count (LBA48 max) is retrieved from the READ NATIVE
MAX ADDRESS EXT command, and calculated with the ata_tf_to_lba48()
function.
ata_tf_to_lba48() seems to be overflowing when the total size will be
greater than 2^31 sectors, while ata_id_u64() does not.
I noticed an identical bug in the latest release of hdparm 8.9, even
returning an identical native sector count, but hdparm gets its
information from the IDENTIFY result. I've been able to patch hdparm
to display correctly. Haven't yet tried to patch ata_tf_to_lba48()
because the data is stored differently and haven't had the time to
figure it out yet.
I have some code that shows the bug in action against the hdparm
implementation, won't be hard to modify to prove the bug against the
ata_tf_to_lba48() implementation, but I'm not at home at the moment
and can't send it through. I can also send through the appropriate
values for words 100 - 103.
All that said, this does NOT appear to be causing the issues that both
you and I are suffering from - I can't see anywhere in libata that
uses the ata_tf_to_lba48() function other than the HPA detection code,
and it seems purely display related only, although Jeff would
hopefully be able to comment further on this and whether there could
be other code doing LBA48 calculations like this.
Cheers,
Phillip
On Wed, Oct 29, 2008 at 6:01 AM, Oskar Liljeblad <[email protected]> wrote:
>
> Can anyone make any sense of these SATA errors? They're killing my md RAID5
> (at least the second error did).
>
> Hard drives (ata1/sda, ata2/sdb, ata3/sdc): Seagate ST31500341AS 1.5TB SATA
> Motherboard: Asus M3A78-EH with AMD 780G/SB700 chipset
> SATA driver: ahci
> 00:11.0 SATA controller: ATI Technologies Inc SB700/SB800 SATA Controller [AHCI mode]
>
> Smart reports no errors on the drives, short & long tests have been run as
> well. The system is brand new.
>
> I've read some reports about SATA 3.0 Gbps vs 1.5 Gbps problems and I'm
> considering limiting the drives to 1.5 Gbps using jumpers. Would that be a
> good idea?
>
> 19:24:26 ata2: exception Emask 0x50 SAct 0x0 SErr 0x90a02 action 0xe frozen
> 19:24:26 ata2: irq_stat 0x00400000, PHY RDY changed
> 19:24:26 ata2: SError: { RecovComm Persist HostInt PHYRdyChg 10B8B }
> 19:24:26 ata2: hard resetting link
> 19:24:27 ata2: SATA link down (SStatus 0 SControl 300)
> 19:24:30 ata2: hard resetting link
> 19:24:35 ata2: link is slow to respond, please be patient (ready=0)
> 19:24:38 ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
> 19:24:38 ata2.00: configured for UDMA/133
> 19:24:38 ata2: exception Emask 0x10 SAct 0x0 SErr 0x0 action 0x9 t4
> 19:24:38 ata2: irq_stat 0x00000040, connection status changed
> 19:24:38 ata2.00: configured for UDMA/133
> 19:24:38 ata2: EH complete
>
> And then the day after:
>
> 09:07:49 ata3.00: exception Emask 0x40 SAct 0x0 SErr 0x800 action 0x6 frozen
> 09:07:49 ata3: SError: { HostInt }
> 09:07:49 ata3.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
> 09:07:49 res 40/00:00:00:00:00/00:00:00:00:00/a0 Emask 0x44 (timeout)
> 09:07:49 ata3.00: status: { DRDY }
> 09:07:49 ata3: hard resetting link
> 09:07:49 ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
> 09:07:49 ata3.00: configured for UDMA/133
> 09:07:49 ata3: EH complete
> 09:07:49 sd 2:0:0:0: [sdc] 2930277168 512-byte hardware sectors (1500302 MB)
> 09:07:49 sd 2:0:0:0: [sdc] Write Protect is off
> 09:07:49 sd 2:0:0:0: [sdc] Mode Sense: 00 3a 00 00
> 09:07:49 sd 2:0:0:0: [sdc] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
> 09:07:49 end_request: I/O error, dev sdc, sector 8
> 09:07:49 md: super_written gets error=-5, uptodate=0
> 09:07:49 raid5: Disk failure on sdc, disabling device.
> 09:07:49 raid5: Operation continuing on 1 devices.
>
> For reference:
>
> ata1: SATA max UDMA/133 abar m1024@0xf8fff800 port 0xf8fff900 irq 22
> ata2: SATA max UDMA/133 abar m1024@0xf8fff800 port 0xf8fff980 irq 22
> ata3: SATA max UDMA/133 abar m1024@0xf8fff800 port 0xf8fffa00 irq 22
> ata4: SATA max UDMA/133 abar m1024@0xf8fff800 port 0xf8fffa80 irq 22
> ata5: SATA max UDMA/133 abar m1024@0xf8fff800 port 0xf8fffb00 irq 22
> ata6: SATA max UDMA/133 abar m1024@0xf8fff800 port 0xf8fffb80 irq 22
> ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
> ata1.00: HPA detected: current 2930277168, native 18446744072344861488
> ata1.00: ATA-8: ST31500341AS, SD17, max UDMA/133
> ata1.00: 2930277168 sectors, multi 16: LBA48 NCQ (depth 31/32)
> ata1.00: configured for UDMA/133
> ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
> ata2.00: HPA detected: current 2930277168, native 18446744072344861488
> ata2.00: ATA-8: ST31500341AS, SD17, max UDMA/133
> ata2.00: 2930277168 sectors, multi 16: LBA48 NCQ (depth 31/32)
> ata2.00: configured for UDMA/133
> ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
> ata3.00: HPA detected: current 2930277168, native 18446744072344861488
> ata3.00: ATA-8: ST31500341AS, SD17, max UDMA/133
> ata3.00: 2930277168 sectors, multi 16: LBA48 NCQ (depth 31/32)
> ata3.00: configured for UDMA/133
>
> Regards,
>
> Oskar
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
>
In ata_tf_to_lba48(), when evaluating
(tf->hob_lbal & 0xff) << 24
the expression is promoted to signed int (since int can hold all values
of u8). However, if hob_lbal is 128 or more, then it is treated as a
negative signed value and sign-extended when promoted to u64 to | into
sectors, which leads to the MSB 32 bits of section getting set
incorrectly.
For example, Phillip O'Donnell <[email protected]> reported
that a 1.5GB drive caused:
ata3.00: HPA detected: current 2930277168, native 18446744072344861488
where 2930277168 == 0xAEA87B30 and 18446744072344861488 == 0xffffffffaea87b30
which shows the problem when hob_lbal is 0xae.
Fix this by adding a cast to u64, just as is used by for hob_lbah and
hob_lbam in the function.
Reported-by: Phillip O'Donnell <[email protected]>
Signed-off-by: Roland Dreier <[email protected]>
---
Phillip, this should fix at least your cosmetic issue; can you test it
and report back?
Thanks,
Roland
drivers/ata/libata-core.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)
diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c
index bbb3cae..10424ff 100644
--- a/drivers/ata/libata-core.c
+++ b/drivers/ata/libata-core.c
@@ -1268,7 +1268,7 @@ u64 ata_tf_to_lba48(const struct ata_taskfile *tf)
sectors |= ((u64)(tf->hob_lbah & 0xff)) << 40;
sectors |= ((u64)(tf->hob_lbam & 0xff)) << 32;
- sectors |= (tf->hob_lbal & 0xff) << 24;
+ sectors |= ((u64)(tf->hob_lbal & 0xff)) << 24;
sectors |= (tf->lbah & 0xff) << 16;
sectors |= (tf->lbam & 0xff) << 8;
sectors |= (tf->lbal & 0xff);
Hey Roland,
Sure thing - I'll give that a try tonight.
Just had a cursory glance over libata-core.c and I've noticed that
ata_tf_read_block() uses hob_lbal in the same uncast fashion for LBA48
- reckon that one needs patching too?
Only seems to be used in libata-scsi.c within ata_gen_ata_sense()
Cheers,
Phillip
On Wed, Oct 29, 2008 at 12:52 PM, Roland Dreier <[email protected]> wrote:
> In ata_tf_to_lba48(), when evaluating
>
> (tf->hob_lbal & 0xff) << 24
>
> the expression is promoted to signed int (since int can hold all values
> of u8). However, if hob_lbal is 128 or more, then it is treated as a
> negative signed value and sign-extended when promoted to u64 to | into
> sectors, which leads to the MSB 32 bits of section getting set
> incorrectly.
>
> For example, Phillip O'Donnell <[email protected]> reported
> that a 1.5GB drive caused:
>
> ata3.00: HPA detected: current 2930277168, native 18446744072344861488
>
> where 2930277168 == 0xAEA87B30 and 18446744072344861488 == 0xffffffffaea87b30
> which shows the problem when hob_lbal is 0xae.
>
> Fix this by adding a cast to u64, just as is used by for hob_lbah and
> hob_lbam in the function.
>
> Reported-by: Phillip O'Donnell <[email protected]>
> Signed-off-by: Roland Dreier <[email protected]>
> ---
> Phillip, this should fix at least your cosmetic issue; can you test it
> and report back?
>
> Thanks,
> Roland
>
> drivers/ata/libata-core.c | 2 +-
> 1 files changed, 1 insertions(+), 1 deletions(-)
>
> diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c
> index bbb3cae..10424ff 100644
> --- a/drivers/ata/libata-core.c
> +++ b/drivers/ata/libata-core.c
> @@ -1268,7 +1268,7 @@ u64 ata_tf_to_lba48(const struct ata_taskfile *tf)
>
> sectors |= ((u64)(tf->hob_lbah & 0xff)) << 40;
> sectors |= ((u64)(tf->hob_lbam & 0xff)) << 32;
> - sectors |= (tf->hob_lbal & 0xff) << 24;
> + sectors |= ((u64)(tf->hob_lbal & 0xff)) << 24;
> sectors |= (tf->lbah & 0xff) << 16;
> sectors |= (tf->lbam & 0xff) << 8;
> sectors |= (tf->lbal & 0xff);
>
Confirmed - HPA is no longer detected on boot.
Cheers,
Phillip
On Wed, Oct 29, 2008 at 12:52 PM, Roland Dreier <[email protected]> wrote:
> In ata_tf_to_lba48(), when evaluating
>
> (tf->hob_lbal & 0xff) << 24
>
> the expression is promoted to signed int (since int can hold all values
> of u8). However, if hob_lbal is 128 or more, then it is treated as a
> negative signed value and sign-extended when promoted to u64 to | into
> sectors, which leads to the MSB 32 bits of section getting set
> incorrectly.
>
> For example, Phillip O'Donnell <[email protected]> reported
> that a 1.5GB drive caused:
>
> ata3.00: HPA detected: current 2930277168, native 18446744072344861488
>
> where 2930277168 == 0xAEA87B30 and 18446744072344861488 == 0xffffffffaea87b30
> which shows the problem when hob_lbal is 0xae.
>
> Fix this by adding a cast to u64, just as is used by for hob_lbah and
> hob_lbam in the function.
>
> Reported-by: Phillip O'Donnell <[email protected]>
> Signed-off-by: Roland Dreier <[email protected]>
> ---
> Phillip, this should fix at least your cosmetic issue; can you test it
> and report back?
>
> Thanks,
> Roland
>
> drivers/ata/libata-core.c | 2 +-
> 1 files changed, 1 insertions(+), 1 deletions(-)
>
> diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c
> index bbb3cae..10424ff 100644
> --- a/drivers/ata/libata-core.c
> +++ b/drivers/ata/libata-core.c
> @@ -1268,7 +1268,7 @@ u64 ata_tf_to_lba48(const struct ata_taskfile *tf)
>
> sectors |= ((u64)(tf->hob_lbah & 0xff)) << 40;
> sectors |= ((u64)(tf->hob_lbam & 0xff)) << 32;
> - sectors |= (tf->hob_lbal & 0xff) << 24;
> + sectors |= ((u64)(tf->hob_lbal & 0xff)) << 24;
> sectors |= (tf->lbah & 0xff) << 16;
> sectors |= (tf->lbam & 0xff) << 8;
> sectors |= (tf->lbal & 0xff);
>
On Tuesday, October 28, 2008 at 17:37, Robert Hancock wrote:
[..]
>> Hard drives (ata1/sda, ata2/sdb, ata3/sdc): Seagate ST31500341AS 1.5TB SATA
[..]
>> 19:24:26 ata2: exception Emask 0x50 SAct 0x0 SErr 0x90a02 action 0xe frozen
>> 19:24:26 ata2: irq_stat 0x00400000, PHY RDY changed
>> 19:24:26 ata2: SError: { RecovComm Persist HostInt PHYRdyChg 10B8B }
[..]
> Sounds like the drive and the controller are unhappy with each other, or
> there's some kind of communications or hardware problem. Not likely a
> kernel issue.
>
> It's unclear if limiting to 1.5 Gbps would help, you could try it and see..
It seems the solution is to disable write caching. So far no errors, and
other people confirm this solution.
Anyway Seagate (inofficially) claims it's a driver issue in Linux - from
http://forums.seagate.com/stx/board/message?board.id=ata_drives&thread.id=2390&view=by_date_ascending&page=6
"We already know about the Linux issue, it is indeed a kernel error
causing the problem as it was explained to me by one of our developers."
Regards,
Oskar
O> Anyway Seagate (inofficially) claims it's a driver issue in Linux - from
> http://forums.seagate.com/stx/board/message?board.id=ata_drives&thread.id=2390&view=by_date_ascending&page=6
Well then perhaps they would care to share the information with us 8)
The HPA size reporting one is certinly Linux, the flush cache one I don't
think is. The fact Mac people report it and Seagate suggest workarounds
of the form of "don't use 33% of the disk" don't inspire confidence.
> "We already know about the Linux issue, it is indeed a kernel error
> causing the problem as it was explained to me by one of our developers."
Well if they'd care to explain it to linux-ide perhaps we can find a work
around. I would be cautious about disabling the write caching as it will
harm both performance and probably drive lifetime.
Alan
Alan Cox wrote:
> O> Anyway Seagate (inofficially) claims it's a driver issue in Linux - from
>
>> http://forums.seagate.com/stx/board/message?board.id=ata_drives&thread.id=2390&view=by_date_ascending&page=6
>>
>
> Well then perhaps they would care to share the information with us 8)
>
> The HPA size reporting one is certinly Linux, the flush cache one I don't
> think is. The fact Mac people report it and Seagate suggest workarounds
> of the form of "don't use 33% of the disk" don't inspire confidence.
>
I suspect that the drive is simply choking on the barrier related cache
flushing that we do - that seemed to be the MacOS error as well. The
windows comment suggested that windows had an hba/driver bug (most
likely unrelated to this).
If you want to avoid the issue until they fix the drive, you could run
fast and dangerous (mount without barriers on) or slow and safe (disable
the write cache).
>
>> "We already know about the Linux issue, it is indeed a kernel error
>> causing the problem as it was explained to me by one of our developers."
>>
>
> Well if they'd care to explain it to linux-ide perhaps we can find a work
> around. I would be cautious about disabling the write caching as it will
> harm both performance and probably drive lifetime.
>
> Alan
>
This looks to me to be a drive firmware issue. I would wait until
someone can test with their announced firmware upgrade before looking at
the kernel :-)
ric
Hi Ric,
Not too sure about that - I run with XFS, which announces that it
disables the barriers on my devices (I use LVM on top of them) but
still get the same issue... Unless I've misunderstood your comment?
Cheers,
Phillip
> I suspect that the drive is simply choking on the barrier related cache
> flushing that we do - that seemed to be the MacOS error as well. The windows
> comment suggested that windows had an hba/driver bug (most likely unrelated
> to this).
>
> If you want to avoid the issue until they fix the drive, you could run fast
> and dangerous (mount without barriers on) or slow and safe (disable the
> write cache).
Phillip O'Donnell wrote:
> Hi Ric,
>
> Not too sure about that - I run with XFS, which announces that it
> disables the barriers on my devices (I use LVM on top of them) but
> still get the same issue... Unless I've misunderstood your comment?
>
> Cheers,
> Phillip
>
XFS has a different issue with barriers that has been recently fixed.
I am just going based on what I read at the Seagate customer site - it
looks like the hang was during the processing of the ATA_CACHE_FLUSH_EXT
command.
New drives are routinely buggy to some degree, especially ones that jump
up in capacity :-) Seagate has a well earned reputation for quality and
I will be surprised if they don't fix this issue soon,
Ric
>
>> I suspect that the drive is simply choking on the barrier related cache
>> flushing that we do - that seemed to be the MacOS error as well. The windows
>> comment suggested that windows had an hba/driver bug (most likely unrelated
>> to this).
>>
>> If you want to avoid the issue until they fix the drive, you could run fast
>> and dangerous (mount without barriers on) or slow and safe (disable the
>> write cache).
>>
Roland Dreier wrote:
> In ata_tf_to_lba48(), when evaluating
>
> (tf->hob_lbal & 0xff) << 24
>
> the expression is promoted to signed int (since int can hold all values
> of u8). However, if hob_lbal is 128 or more, then it is treated as a
> negative signed value and sign-extended when promoted to u64 to | into
> sectors, which leads to the MSB 32 bits of section getting set
> incorrectly.
>
> For example, Phillip O'Donnell <[email protected]> reported
> that a 1.5GB drive caused:
>
> ata3.00: HPA detected: current 2930277168, native 18446744072344861488
>
> where 2930277168 == 0xAEA87B30 and 18446744072344861488 == 0xffffffffaea87b30
> which shows the problem when hob_lbal is 0xae.
>
> Fix this by adding a cast to u64, just as is used by for hob_lbah and
> hob_lbam in the function.
>
> Reported-by: Phillip O'Donnell <[email protected]>
> Signed-off-by: Roland Dreier <[email protected]>
This should be pushed to -stable as well once it's merged..
Roland Dreier wrote:
> In ata_tf_to_lba48(), when evaluating
>
> (tf->hob_lbal & 0xff) << 24
>
> the expression is promoted to signed int (since int can hold all values
> of u8). However, if hob_lbal is 128 or more, then it is treated as a
> negative signed value and sign-extended when promoted to u64 to | into
> sectors, which leads to the MSB 32 bits of section getting set
> incorrectly.
>
> For example, Phillip O'Donnell <[email protected]> reported
> that a 1.5GB drive caused:
>
> ata3.00: HPA detected: current 2930277168, native 18446744072344861488
>
> where 2930277168 == 0xAEA87B30 and 18446744072344861488 == 0xffffffffaea87b30
> which shows the problem when hob_lbal is 0xae.
>
> Fix this by adding a cast to u64, just as is used by for hob_lbah and
> hob_lbam in the function.
>
> Reported-by: Phillip O'Donnell <[email protected]>
> Signed-off-by: Roland Dreier <[email protected]>
> ---
> Phillip, this should fix at least your cosmetic issue; can you test it
> and report back?
>
> Thanks,
> Roland
>
> drivers/ata/libata-core.c | 2 +-
> 1 files changed, 1 insertions(+), 1 deletions(-)
applied
Phillip O'Donnell <[email protected]> pointed out that the same
sign extension bug that was fixed in commit ba14a9c2 ("libata: Avoid
overflow in ata_tf_to_lba48() when tf->hba_lbal > 127") also appears to
exist in ata_tf_read_block(). Fix this by adding a cast to u64.
Signed-off-by: Roland Dreier <[email protected]>
---
I don't have any way to test this -- I guess you would have to get an
error on a block above 2G (ie data above 1TB)? But it looks "obviously
correct" enough to add to -next I guess.
drivers/ata/libata-core.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)
diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c
index 622350d..a6ad862 100644
--- a/drivers/ata/libata-core.c
+++ b/drivers/ata/libata-core.c
@@ -612,7 +612,7 @@ u64 ata_tf_read_block(struct ata_taskfile *tf, struct ata_device *dev)
if (tf->flags & ATA_TFLAG_LBA48) {
block |= (u64)tf->hob_lbah << 40;
block |= (u64)tf->hob_lbam << 32;
- block |= tf->hob_lbal << 24;
+ block |= (u64)tf->hob_lbal << 24;
} else
block |= (tf->device & 0xf) << 24;
Thanks Roland,
Right now, my observations indicate that my original fault occurs when
a command (of any type, e.g, read or flush ... ) times out, which
should trigger the sense routines. Next thing I need to do is add some
debug code to identify whether this only occurs if the timeout happens
on a sector above 2^31.
I've identified a reasonably reliable testcase for my fault, so I'll
add that patch and see if it occurs.
I'll let you know how it pans out.
Cheers,
Phillip
On Wed, Nov 5, 2008 at 7:34 AM, Roland Dreier <[email protected]> wrote:
>
> Phillip O'Donnell <[email protected]> pointed out that the same
> sign extension bug that was fixed in commit ba14a9c2 ("libata: Avoid
> overflow in ata_tf_to_lba48() when tf->hba_lbal > 127") also appears to
> exist in ata_tf_read_block(). Fix this by adding a cast to u64.
>
> Signed-off-by: Roland Dreier <[email protected]>
> ---
> I don't have any way to test this -- I guess you would have to get an
> error on a block above 2G (ie data above 1TB)? But it looks "obviously
> correct" enough to add to -next I guess.
>
> drivers/ata/libata-core.c | 2 +-
> 1 files changed, 1 insertions(+), 1 deletions(-)
>
> diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c
> index 622350d..a6ad862 100644
> --- a/drivers/ata/libata-core.c
> +++ b/drivers/ata/libata-core.c
> @@ -612,7 +612,7 @@ u64 ata_tf_read_block(struct ata_taskfile *tf, struct ata_device *dev)
> if (tf->flags & ATA_TFLAG_LBA48) {
> block |= (u64)tf->hob_lbah << 40;
> block |= (u64)tf->hob_lbam << 32;
> - block |= tf->hob_lbal << 24;
> + block |= (u64)tf->hob_lbal << 24;
> } else
> block |= (tf->device & 0xf) << 24;
>
On Wed, 2008-10-29 at 18:37 -0400, Ric Wheeler wrote:
> Phillip O'Donnell wrote:
<snip>
> I am just going based on what I read at the Seagate customer site - it
> looks like the hang was during the processing of the ATA_CACHE_FLUSH_EXT
> command.
>
> New drives are routinely buggy to some degree, especially ones that jump
> up in capacity :-) Seagate has a well earned reputation for quality and
> I will be surprised if they don't fix this issue soon,
Is there any new information on this? so far the only thing i can find
seems to be people reporting the issue, but no word from seagate..
>
> Ric
>
> >
> >> I suspect that the drive is simply choking on the barrier related cache
> >> flushing that we do - that seemed to be the MacOS error as well. The windows
> >> comment suggested that windows had an hba/driver bug (most likely unrelated
> >> to this).
> >>
> >> If you want to avoid the issue until they fix the drive, you could run fast
> >> and dangerous (mount without barriers on) or slow and safe (disable the
> >> write cache).
> >>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
Kasper Sandberg wrote:
> On Wed, 2008-10-29 at 18:37 -0400, Ric Wheeler wrote:
>
>> Phillip O'Donnell wrote:
>>
> <snip>
>
>> I am just going based on what I read at the Seagate customer site - it
>> looks like the hang was during the processing of the ATA_CACHE_FLUSH_EXT
>> command.
>>
>> New drives are routinely buggy to some degree, especially ones that jump
>> up in capacity :-) Seagate has a well earned reputation for quality and
>> I will be surprised if they don't fix this issue soon,
>>
>
> Is there any new information on this? so far the only thing i can find
> seems to be people reporting the issue, but no word from seagate..
>
I don't actually have one of these drives, so I don't have any updates,
sorry.
There was a recent patch to correctly calculate sector numbers for these
disks, but I am not sure that this was the same issue you saw...
ric
>
>> Ric
>>
>>
>>>
>>>
>>>> I suspect that the drive is simply choking on the barrier related cache
>>>> flushing that we do - that seemed to be the MacOS error as well. The windows
>>>> comment suggested that windows had an hba/driver bug (most likely unrelated
>>>> to this).
>>>>
>>>> If you want to avoid the issue until they fix the drive, you could run fast
>>>> and dangerous (mount without barriers on) or slow and safe (disable the
>>>> write cache).
>>>>
>>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at http://www.tux.org/lkml/
>>
>
>
Roland Dreier wrote:
> Phillip O'Donnell <[email protected]> pointed out that the same
> sign extension bug that was fixed in commit ba14a9c2 ("libata: Avoid
> overflow in ata_tf_to_lba48() when tf->hba_lbal > 127") also appears to
> exist in ata_tf_read_block(). Fix this by adding a cast to u64.
>
> Signed-off-by: Roland Dreier <[email protected]>
> ---
> I don't have any way to test this -- I guess you would have to get an
> error on a block above 2G (ie data above 1TB)? But it looks "obviously
> correct" enough to add to -next I guess.
>
> drivers/ata/libata-core.c | 2 +-
> 1 files changed, 1 insertions(+), 1 deletions(-)
>
> diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c
> index 622350d..a6ad862 100644
> --- a/drivers/ata/libata-core.c
> +++ b/drivers/ata/libata-core.c
> @@ -612,7 +612,7 @@ u64 ata_tf_read_block(struct ata_taskfile *tf, struct ata_device *dev)
> if (tf->flags & ATA_TFLAG_LBA48) {
> block |= (u64)tf->hob_lbah << 40;
> block |= (u64)tf->hob_lbam << 32;
> - block |= tf->hob_lbal << 24;
> + block |= (u64)tf->hob_lbal << 24;
> } else
> block |= (tf->device & 0xf) << 24;
applied