2007-08-13 20:03:24

by Andreas Radke

[permalink] [raw]
Subject: sata drive loosing connection/resetting port

running ArchLinux kernel 2.6.22.2 on a Abit IP35 Pro motherboard with
Pro Intel P35 chipset (ICH9R + Jmicron) i have these entries all few
minutes:

ata1.00: exception Emask 0x10 SAct 0x0 SErr 0x400101 action 0x2 frozen
ata1.00: (irq_stat 0x08000000, interface fatal error)
ata1.00: cmd ca/00:f8:18:d2:27/00:00:00:00:00/e6 tag 0 cdb 0x0 data
126976 out res 50/00:00:80:2b:3d/00:00:00:00:00/e0 Emask 0x10 (ATA bus
error) ata1: soft resetting port
ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
ata1.00: configured for UDMA/133
ata1: EH complete
sd 0:0:0:0: [sda] 145226112 512-byte hardware sectors (74356 MB)
sd 0:0:0:0: [sda] Write Protect is off
sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't
support DPO or FUA


hard disc is a Western Digital 10k Raptor connected to the ICH9R. on
the second port of the southbridge is another Samsung hard disc drive
connected without errors. both controllers are running ahci mode.

lsmod:
sr_mod 16548 0
cdrom 38312 1 sr_mod
sd_mod 25088 6
pata_jmicron 5888 0
ahci 22404 4
libata 119440 2 pata_jmicron,ahci

i've tried changing the SATA cables and also the power wire to the
disc without success. i checked the disc successfully with Hitachi
drive fitness test. it has worked without any trouble so far in a Intel
P965 ICH8 based mainboard.

any idea how to find the reason?

Andreas Radke
ArchLinux developer/maintainer


2007-08-13 20:11:59

by Michal Piotrowski

[permalink] [raw]
Subject: Re: sata drive loosing connection/resetting port

Hi Andreas,

[Adding Jeff and linux-ide to CC]

On 13/08/07, Andreas Radke <[email protected]> wrote:
> running ArchLinux kernel 2.6.22.2 on a Abit IP35 Pro motherboard with
> Pro Intel P35 chipset (ICH9R + Jmicron) i have these entries all few
> minutes:
>
> ata1.00: exception Emask 0x10 SAct 0x0 SErr 0x400101 action 0x2 frozen
> ata1.00: (irq_stat 0x08000000, interface fatal error)
> ata1.00: cmd ca/00:f8:18:d2:27/00:00:00:00:00/e6 tag 0 cdb 0x0 data
> 126976 out res 50/00:00:80:2b:3d/00:00:00:00:00/e0 Emask 0x10 (ATA bus
> error) ata1: soft resetting port
> ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
> ata1.00: configured for UDMA/133
> ata1: EH complete
> sd 0:0:0:0: [sda] 145226112 512-byte hardware sectors (74356 MB)
> sd 0:0:0:0: [sda] Write Protect is off
> sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
> sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't
> support DPO or FUA
>
>
> hard disc is a Western Digital 10k Raptor connected to the ICH9R. on
> the second port of the southbridge is another Samsung hard disc drive
> connected without errors. both controllers are running ahci mode.
>
> lsmod:
> sr_mod 16548 0
> cdrom 38312 1 sr_mod
> sd_mod 25088 6
> pata_jmicron 5888 0
> ahci 22404 4
> libata 119440 2 pata_jmicron,ahci
>
> i've tried changing the SATA cables and also the power wire to the
> disc without success. i checked the disc successfully with Hitachi
> drive fitness test. it has worked without any trouble so far in a Intel
> P965 ICH8 based mainboard.
>
> any idea how to find the reason?
>
> Andreas Radke
> ArchLinux developer/maintainer

Please try to CC to the appropriate maintainer(s).

Regards,
Michal

--
LOG
http://www.stardust.webpages.pl/log/

2007-08-14 10:20:48

by Tejun Heo

[permalink] [raw]
Subject: Re: sata drive loosing connection/resetting port

Michal Piotrowski wrote:
> Hi Andreas,
>
> [Adding Jeff and linux-ide to CC]
>
> On 13/08/07, Andreas Radke <[email protected]> wrote:
>> running ArchLinux kernel 2.6.22.2 on a Abit IP35 Pro motherboard with
>> Pro Intel P35 chipset (ICH9R + Jmicron) i have these entries all few
>> minutes:
>>
>> ata1.00: exception Emask 0x10 SAct 0x0 SErr 0x400101 action 0x2 frozen
>> ata1.00: (irq_stat 0x08000000, interface fatal error)
>> ata1.00: cmd ca/00:f8:18:d2:27/00:00:00:00:00/e6 tag 0 cdb 0x0 data
>> 126976 out res 50/00:00:80:2b:3d/00:00:00:00:00/e0 Emask 0x10 (ATA bus
>> error) ata1: soft resetting port
>> ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
>> ata1.00: configured for UDMA/133
>> ata1: EH complete
>> sd 0:0:0:0: [sda] 145226112 512-byte hardware sectors (74356 MB)
>> sd 0:0:0:0: [sda] Write Protect is off
>> sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
>> sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't
>> support DPO or FUA
>>
>>
>> hard disc is a Western Digital 10k Raptor connected to the ICH9R. on
>> the second port of the southbridge is another Samsung hard disc drive
>> connected without errors. both controllers are running ahci mode.
>>
>> lsmod:
>> sr_mod 16548 0
>> cdrom 38312 1 sr_mod
>> sd_mod 25088 6
>> pata_jmicron 5888 0
>> ahci 22404 4
>> libata 119440 2 pata_jmicron,ahci
>>
>> i've tried changing the SATA cables and also the power wire to the
>> disc without success. i checked the disc successfully with Hitachi
>> drive fitness test. it has worked without any trouble so far in a Intel
>> P965 ICH8 based mainboard.
>>
>> any idea how to find the reason?

Most likely a hardware issue.

* Please try a different harddrive/cable and connect to a different port.

* Get a different power supply and see whether that changes anything.

--
tejun

2007-08-14 20:19:45

by Andreas Radke

[permalink] [raw]
Subject: Re: sata drive loosing connection/resetting port

Am Tue, 14 Aug 2007 19:15:52 +0900
schrieb Tejun Heo <[email protected]>:

> Most likely a hardware issue.
>
> * Please try a different harddrive/cable and connect to a different
> port.
>
> * Get a different power supply and see whether that changes anything.
>

already tried different ports, several SATA cables on that mainboard and
psu works well. the Samsung drive works well no matter where
connected. never had any issues with the Raptor drive when it was
connected to the "old" Intel P965 board before. what's the difference
form ICH8 to ICH9R? any changes in how the driver works?

i confirmed, that the issues still resist in kernel 2.6.23rc3 and also
when using controller in "ide" mode together with ata_piix modul.

but on thing that is really weird: i used our installer iso to boot
with 2.6.22 kernel and even after several hours the error did not
appear! can you explain that? so it only happens when filesystems are
mounted (also / partition is on that drive).

any idea?

Andreas

2007-08-15 06:19:48

by Tejun Heo

[permalink] [raw]
Subject: Re: sata drive loosing connection/resetting port

Andreas Radke wrote:
> already tried different ports, several SATA cables on that mainboard and
> psu works well. the Samsung drive works well no matter where
> connected. never had any issues with the Raptor drive when it was
> connected to the "old" Intel P965 board before. what's the difference
> form ICH8 to ICH9R? any changes in how the driver works?
>
> i confirmed, that the issues still resist in kernel 2.6.23rc3 and also
> when using controller in "ide" mode together with ata_piix modul.
>
> but on thing that is really weird: i used our installer iso to boot
> with 2.6.22 kernel and even after several hours the error did not
> appear! can you explain that? so it only happens when filesystems are
> mounted (also / partition is on that drive).
>
> any idea?

Can you post full boot log and the result of 'lspci -nn'?

--
tejun

2007-08-15 13:49:28

by Andreas Radke

[permalink] [raw]
Subject: Re: sata drive loosing connection/resetting port

Am Wed, 15 Aug 2007 15:19:20 +0900
schrieb Tejun Heo <[email protected]>:

> Can you post full boot log and the result of 'lspci -nn'?

attached one logfile for ICH9R in AHCI mode and one for IDE mode.

[root@workstation64 andyrtr]# lspci -nn
00:00.0 Host bridge [0600]: Intel Corporation Unknown device [8086:29c0] (rev 02)
00:01.0 PCI bridge [0604]: Intel Corporation Unknown device [8086:29c1] (rev 02)
00:1a.0 USB Controller [0c03]: Intel Corporation Unknown device [8086:2937] (rev 02)
00:1a.1 USB Controller [0c03]: Intel Corporation Unknown device [8086:2938] (rev 02)
00:1a.2 USB Controller [0c03]: Intel Corporation Unknown device [8086:2939] (rev 02)
00:1a.7 USB Controller [0c03]: Intel Corporation Unknown device [8086:293c] (rev 02)
00:1c.0 PCI bridge [0604]: Intel Corporation Unknown device [8086:2940] (rev 02)
00:1c.4 PCI bridge [0604]: Intel Corporation Unknown device [8086:2948] (rev 02)
00:1d.0 USB Controller [0c03]: Intel Corporation Unknown device [8086:2934] (rev 02)
00:1d.1 USB Controller [0c03]: Intel Corporation Unknown device [8086:2935] (rev 02)
00:1d.2 USB Controller [0c03]: Intel Corporation Unknown device [8086:2936] (rev 02)
00:1d.7 USB Controller [0c03]: Intel Corporation Unknown device [8086:293a] (rev 02)
00:1e.0 PCI bridge [0604]: Intel Corporation 82801 PCI Bridge [8086:244e] (rev 92)
00:1f.0 ISA bridge [0601]: Intel Corporation Unknown device [8086:2916] (rev 02)
00:1f.2 IDE interface [0101]: Intel Corporation Unknown device [8086:2920] (rev 02)
00:1f.3 SMBus [0c05]: Intel Corporation Unknown device [8086:2930] (rev 02)
00:1f.5 IDE interface [0101]: Intel Corporation Unknown device [8086:2926] (rev 02)
01:00.0 VGA compatible controller [0300]: nVidia Corporation NV44 [GeForce 6200 TurboCache(TM)] [10de:0161] (rev a1)
03:00.0 SATA controller [0106]: JMicron Technologies, Inc. JMicron 20360/20363 AHCI Controller [197b:2363] (rev 02)
03:00.1 IDE interface [0101]: JMicron Technologies, Inc. JMicron 20360/20363 AHCI Controller [197b:2363] (rev 02)
04:01.0 Ethernet controller [0200]: Realtek Semiconductor Co., Ltd. RTL-8169SC Gigabit Ethernet [10ec:8167] (rev 10)
04:02.0 FireWire (IEEE 1394) [0c00]: Texas Instruments TSB43AB22/A IEEE-1394a-2000 Controller (PHY/Link) [104c:8023]
04:03.0 Network controller [0280]: Techsan Electronics Co Ltd B2C2 FlexCopII DVB chip / Technisat SkyStar2 DVB card [13d0:2103] (rev 02)
04:05.0 Multimedia audio controller [0401]: Creative Labs SB Audigy [1102:0004] (rev 03)
04:05.1 Input device controller [0980]: Creative Labs SB Audigy Game Port [1102:7003] (rev 03)
04:05.2 FireWire (IEEE 1394) [0c00]: Creative Labs SB Audigy FireWire Port [1102:4001]


anything else? i'm thinking about changing the mainboard sata drive combination back.
but to fix it would be the better way.

Andy


Attachments:
(No filename) (2.71 kB)
ahci.log (61.79 kB)
ide_ck.log (49.19 kB)
Download all attachments

2007-08-16 10:53:54

by Tejun Heo

[permalink] [raw]
Subject: Re: sata drive loosing connection/resetting port

Andreas Radke wrote:
[on ahci]
> Aug 13 18:00:33 workstation64 ata1.00: exception Emask 0x10 SAct 0x0 SErr 0x400101 action 0x2 frozen
> Aug 13 18:00:33 workstation64 ata1.00: (irq_stat 0x08000000, interface fatal error)
> Aug 13 18:00:33 workstation64 ata1.00: cmd ca/00:08:a0:cb:07/00:00:00:00:00/e8 tag 0 cdb 0x0 data 4096 out
> Aug 13 18:00:33 workstation64 res 50/00:00:bf:32:07/00:00:00:00:00/e8 Emask 0x10 (ATA bus error)

Errors are very consistent. Data transfer from the host to the drive
fails and SError indicates Handshake Error, Non-recovered transient data
integrity error and often Recovered Data integrity error. For some
reason, your drive doesn't like what it's hearing from the controller
and replies with R_ERR.

[on ata_piix]
> Aug 14 23:06:01 workstation64 ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2
> Aug 14 23:06:01 workstation64 ata1.00: (BMDMA stat 0x26)
> Aug 14 23:06:01 workstation64 ata1.00: cmd ca/00:08:91:0d:45/00:00:00:00:00/e4 tag 0 cdb 0x0 data 4096 out
> Aug 14 23:06:01 workstation64 res 51/84:08:91:0d:45/00:00:00:00:00/e4 Emask 0x30 (host bus error)

As error reporting when using ata_piix is very limited. It gets
reported as host bus error but I think it's basically the same problem.

I don't think this is driver issue. I have the same controller and I've
never seen similar thing happening and I have plenty of drives and test
them often. Error reports also point to transmission problems. Your
drive just doesn't like what it's hearing. Does 'smartctl -a /dev/sda'
tell anything special?

--
tejun

2007-08-16 16:01:44

by Andreas Radke

[permalink] [raw]
Subject: Re: sata drive loosing connection/resetting port

Am Thu, 16 Aug 2007 19:53:35 +0900
schrieb Tejun Heo <[email protected]>:

> I don't think this is driver issue. I have the same controller and
> I've never seen similar thing happening and I have plenty of drives
> and test them often. Error reports also point to transmission
> problems. Your drive just doesn't like what it's hearing. Does
> 'smartctl -a /dev/sda' tell anything special?


[root@workstation64 andyrtr]# smartctl -a /dev/sda
smartctl version 5.37 [x86_64-unknown-linux-gnu] Copyright (C) 2002-6
Bruce Allen Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Model Family: Western Digital Raptor family
Device Model: WDC WD740ADFD-00NLR1
Serial Number: WD-WMANS1463684
Firmware Version: 20.07P20
User Capacity: 74.355.769.344 bytes
Device is: In smartctl database [for details use: -P show]
ATA Version is: 7
ATA Standard is: ATA/ATAPI-7 published, ANSI INCITS 397-2005
Local Time is: Thu Aug 16 17:53:18 2007 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x84) Offline data collection activity
was suspended by an interrupting command from host.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: (2391) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 39) minutes.
Conveyance self-test routine
recommended polling time: ( 5) minutes.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000b 200 200 051 Pre-fail Always - 0
3 Spin_Up_Time 0x0007 167 166 021 Pre-fail Always - 2650
4 Start_Stop_Count 0x0032 100 100 040 Old_age Always - 149
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x000a 200 200 051 Old_age Always - 0
9 Power_On_Hours 0x0032 092 092 000 Old_age Always - 5890
10 Spin_Retry_Count 0x0012 100 100 051 Old_age Always - 0
11 Calibration_Retry_Count 0x0012 100 100 051 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 149
194 Temperature_Celsius 0x0022 111 104 000 Old_age Always - 32
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0012 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0012 200 200 000 Old_age Always - 0
199 UDMA_CRC_Error_Count 0x000a 200 253 000 Old_age Always - 2074
200 Multi_Zone_Error_Rate 0x0008 200 200 051 Old_age Offline - 0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Conveyance offline Completed without error 00% 5855 -
# 2 Short offline Completed without error 00% 5854 -
# 3 Short offline Completed without error 00% 5826 -
# 4 Short offline Completed without error 00% 0 -

SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.


i have already filed a support request to Western Digital. but i doubt they will know this certain issue
and won't have a fixed firmware for me.

in AHCI mode this error makes the system freeze completly after a few hours without further
log entries. in IDE mode it keeps working the bad way without making the system freeze.

anyway no more time to risk a broken filesystem. i'll change my setup. but if you have an
idea how to fix it or want me to test something just drop me a mail. thanks so far.

Andreas