2008-03-20 14:18:51

by Hans-Peter Jansen

[permalink] [raw]
Subject: 2.6.24.3: regular sata drive resets - worrisome?

Hi,

since I upgraded to 2.6.24.3 on one of my production systems, I see
regular device resets like these:

Mar 20 14:33:03 lisa5 kernel: ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
Mar 20 14:33:03 lisa5 kernel: ata2.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
Mar 20 14:33:03 lisa5 kernel: res 40/00:00:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Mar 20 14:33:03 lisa5 kernel: ata2.00: status: { DRDY }
Mar 20 14:33:03 lisa5 kernel: ata2: hard resetting link
Mar 20 14:33:05 lisa5 kernel: ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 0)
Mar 20 14:33:05 lisa5 kernel: ata2.00: configured for UDMA/100
Mar 20 14:33:05 lisa5 kernel: ata2: EH complete
Mar 20 14:33:05 lisa5 kernel: sd 2:0:0:0: [sdc] 488397168 512-byte hardware sectors (250059 MB)
Mar 20 14:33:05 lisa5 kernel: sd 2:0:0:0: [sdc] Write Protect is off
Mar 20 14:33:05 lisa5 kernel: sd 2:0:0:0: [sdc] Mode Sense: 00 3a 00 00
Mar 20 14:33:05 lisa5 kernel: sd 2:0:0:0: [sdc] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Mar 20 14:36:11 lisa5 kernel: ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
Mar 20 14:36:11 lisa5 kernel: ata3.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
Mar 20 14:36:11 lisa5 kernel: res 40/00:00:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Mar 20 14:36:11 lisa5 kernel: ata3.00: status: { DRDY }
Mar 20 14:36:11 lisa5 kernel: ata3: hard resetting link
Mar 20 14:36:13 lisa5 kernel: ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 0)
Mar 20 14:36:13 lisa5 kernel: ata3.00: configured for UDMA/100
Mar 20 14:36:13 lisa5 kernel: ata3: EH complete
Mar 20 14:36:13 lisa5 kernel: sd 3:0:0:0: [sdd] 488397168 512-byte hardware sectors (250059 MB)
Mar 20 14:36:13 lisa5 kernel: sd 3:0:0:0: [sdd] Write Protect is off
Mar 20 14:36:13 lisa5 kernel: sd 3:0:0:0: [sdd] Mode Sense: 00 3a 00 00
Mar 20 14:36:13 lisa5 kernel: sd 3:0:0:0: [sdd] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA

Should I be worried? smartd doesn't show anything suspicious on those.

It's been 4 samsung drives at all hanging on a sata sil 3124:

29: PCI 202.0: 0104 RAID bus controller
[Created at pci.273]
Unique ID: LHB6.kPnnhej1LTC
Parent ID: bSAa.qlvrYsPk2G3
SysFS ID: /devices/pci0000:00/0000:00:0a.0/0000:02:02.0
SysFS BusID: 0000:02:02.0
Hardware Class: storage
Model: "Silicon Image, Inc. (formerly CMD Technology Inc) SiI 3124 PCI-X Serial ATA Controller"
Vendor: pci 0x1095 "Silicon Image, Inc. (formerly CMD Technology Inc)"
Device: pci 0x3124 "SiI 3124 PCI-X Serial ATA Controller"
SubVendor: pci 0x1095 "Silicon Image, Inc. (formerly CMD Technology Inc)"
SubDevice: pci 0x7124
Revision: 0x01
Driver: "sata_sil24"
Memory Range: 0xfc9ffc00-0xfc9ffc7f (rw,non-prefetchable)
Memory Range: 0xfc9f0000-0xfc9f7fff (rw,non-prefetchable)
I/O Ports: 0xac00-0xac0f (rw)
Memory Range: 0xfc900000-0xfc97ffff (ro,prefetchable,disabled)
IRQ: 17 (20116954 events)
Driver Info #0:
Driver Status: sata_sil24 is active
Driver Activation Cmd: "modprobe sata_sil24"
Config Status: cfg=no, avail=yes, need=no, active=unknown
Attached to: #15 (PCI bridge)

27: IDE 100.0: 10600 Disk
[Created at block.188]
Unique ID: DQI9.0o8IZwvZth7
Parent ID: LHB6.kPnnhej1LTC
SysFS ID: /block/sdb
SysFS BusID: 1:0:0:0
SysFS Device Link: /devices/pci0000:00/0000:00:0a.0/0000:02:02.0/host1/target1:0:0/1:0:0:0
Hardware Class: disk
Model: "SAMSUNG SP2504C"
Vendor: "SAMSUNG"
Device: "SP2504C"
Revision: "VT10"
Serial ID: "S09QJ1GYA03006"
Driver: "sata_sil24", "sd"
Device File: /dev/sdb
Device
Files: /dev/sdb, /dev/disk/by-path/pci-0000:02:02.0-scsi-0:0:0:0, /dev/disk/by-id/1ATA_SAMSUNG_SP2504C_S09QJ1GYA03006
Device Number: block 8:16-8:31
Geometry (Logical): CHS 30401/255/63
Size: 488397168 sectors a 512 bytes
Config Status: cfg=no, avail=yes, need=no, active=unknown
Attached to: #20 (RAID bus controller)

28: IDE 200.0: 10600 Disk
[Created at block.188]
Unique ID: gbYD.jbwYSfXQ1s1
Parent ID: LHB6.kPnnhej1LTC
SysFS ID: /block/sdc
SysFS BusID: 2:0:0:0
SysFS Device Link: /devices/pci0000:00/0000:00:0a.0/0000:02:02.0/host2/target2:0:0/2:0:0:0
Hardware Class: disk
Model: "SAMSUNG SP2504C"
Vendor: "SAMSUNG"
Device: "SP2504C"
Revision: "VT10"
Serial ID: "S09QJ1GYA03003"
Driver: "sata_sil24", "sd"
Device File: /dev/sdc
Device
Files: /dev/sdc, /dev/disk/by-path/pci-0000:02:02.0-scsi-1:0:0:0, /dev/disk/by-id/1ATA_SAMSUNG_SP2504C_S09QJ1GYA03003
Device Number: block 8:32-8:47
Geometry (Logical): CHS 30401/255/63
Size: 488397168 sectors a 512 bytes
Config Status: cfg=no, avail=yes, need=no, active=unknown
Attached to: #20 (RAID bus controller)

29: IDE 300.0: 10600 Disk
[Created at block.188]
Unique ID: 8noH.C6Upa9vrTP5
Parent ID: LHB6.kPnnhej1LTC
SysFS ID: /block/sdd
SysFS BusID: 3:0:0:0
SysFS Device Link: /devices/pci0000:00/0000:00:0a.0/0000:02:02.0/host3/target3:0:0/3:0:0:0
Hardware Class: disk
Model: "SAMSUNG SP2504C"
Vendor: "SAMSUNG"
Device: "SP2504C"
Revision: "VT10"
Serial ID: "S09QJ1GYA02991"
Driver: "sata_sil24", "sd"
Device File: /dev/sdd
Device
Files: /dev/sdd, /dev/disk/by-path/pci-0000:02:02.0-scsi-2:0:0:0, /dev/disk/by-id/1ATA_SAMSUNG_SP2504C_S09QJ1GYA02991
Device Number: block 8:48-8:63
Geometry (Logical): CHS 30401/255/63
Size: 488397168 sectors a 512 bytes
Config Status: cfg=no, avail=yes, need=no, active=unknown
Attached to: #20 (RAID bus controller)

30: IDE 400.0: 10600 Disk
[Created at block.188]
Unique ID: by2M.RNPEDP68X7E
Parent ID: LHB6.kPnnhej1LTC
SysFS ID: /block/sde
SysFS BusID: 4:0:0:0
SysFS Device Link: /devices/pci0000:00/0000:00:0a.0/0000:02:02.0/host4/target4:0:0/4:0:0:0
Hardware Class: disk
Model: "SAMSUNG SP2504C"
Vendor: "SAMSUNG"
Device: "SP2504C"
Revision: "VT10"
Serial ID: "S09QJ1GYA02990"
Driver: "sata_sil24", "sd"
Device File: /dev/sde
Device
Files: /dev/sde, /dev/disk/by-path/pci-0000:02:02.0-scsi-3:0:0:0, /dev/disk/by-id/1ATA_SAMSUNG_SP2504C_S09QJ1GYA02990
Device Number: block 8:64-8:79
Geometry (Logical): CHS 30401/255/63
Size: 488397168 sectors a 512 bytes
Config Status: cfg=no, avail=yes, need=no, active=unknown
Attached to: #20 (RAID bus controller)


2008-03-21 04:48:52

by Andrew Morton

[permalink] [raw]
Subject: Re: 2.6.24.3: regular sata drive resets - worrisome?


(cc linux-ide)
(regression?)

On Thu, 20 Mar 2008 15:18:31 +0100 Hans-Peter Jansen <[email protected]> wrote:

> Hi,
>
> since I upgraded to 2.6.24.3 on one of my production systems, I see
> regular device resets like these:
>
> Mar 20 14:33:03 lisa5 kernel: ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
> Mar 20 14:33:03 lisa5 kernel: ata2.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
> Mar 20 14:33:03 lisa5 kernel: res 40/00:00:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
> Mar 20 14:33:03 lisa5 kernel: ata2.00: status: { DRDY }
> Mar 20 14:33:03 lisa5 kernel: ata2: hard resetting link
> Mar 20 14:33:05 lisa5 kernel: ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 0)
> Mar 20 14:33:05 lisa5 kernel: ata2.00: configured for UDMA/100
> Mar 20 14:33:05 lisa5 kernel: ata2: EH complete
> Mar 20 14:33:05 lisa5 kernel: sd 2:0:0:0: [sdc] 488397168 512-byte hardware sectors (250059 MB)
> Mar 20 14:33:05 lisa5 kernel: sd 2:0:0:0: [sdc] Write Protect is off
> Mar 20 14:33:05 lisa5 kernel: sd 2:0:0:0: [sdc] Mode Sense: 00 3a 00 00
> Mar 20 14:33:05 lisa5 kernel: sd 2:0:0:0: [sdc] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
> Mar 20 14:36:11 lisa5 kernel: ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
> Mar 20 14:36:11 lisa5 kernel: ata3.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
> Mar 20 14:36:11 lisa5 kernel: res 40/00:00:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
> Mar 20 14:36:11 lisa5 kernel: ata3.00: status: { DRDY }
> Mar 20 14:36:11 lisa5 kernel: ata3: hard resetting link
> Mar 20 14:36:13 lisa5 kernel: ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 0)
> Mar 20 14:36:13 lisa5 kernel: ata3.00: configured for UDMA/100
> Mar 20 14:36:13 lisa5 kernel: ata3: EH complete
> Mar 20 14:36:13 lisa5 kernel: sd 3:0:0:0: [sdd] 488397168 512-byte hardware sectors (250059 MB)
> Mar 20 14:36:13 lisa5 kernel: sd 3:0:0:0: [sdd] Write Protect is off
> Mar 20 14:36:13 lisa5 kernel: sd 3:0:0:0: [sdd] Mode Sense: 00 3a 00 00
> Mar 20 14:36:13 lisa5 kernel: sd 3:0:0:0: [sdd] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
>
> Should I be worried? smartd doesn't show anything suspicious on those.
>
> It's been 4 samsung drives at all hanging on a sata sil 3124:
>
> 29: PCI 202.0: 0104 RAID bus controller
> [Created at pci.273]
> Unique ID: LHB6.kPnnhej1LTC
> Parent ID: bSAa.qlvrYsPk2G3
> SysFS ID: /devices/pci0000:00/0000:00:0a.0/0000:02:02.0
> SysFS BusID: 0000:02:02.0
> Hardware Class: storage
> Model: "Silicon Image, Inc. (formerly CMD Technology Inc) SiI 3124 PCI-X Serial ATA Controller"
> Vendor: pci 0x1095 "Silicon Image, Inc. (formerly CMD Technology Inc)"
> Device: pci 0x3124 "SiI 3124 PCI-X Serial ATA Controller"
> SubVendor: pci 0x1095 "Silicon Image, Inc. (formerly CMD Technology Inc)"
> SubDevice: pci 0x7124
> Revision: 0x01
> Driver: "sata_sil24"
> Memory Range: 0xfc9ffc00-0xfc9ffc7f (rw,non-prefetchable)
> Memory Range: 0xfc9f0000-0xfc9f7fff (rw,non-prefetchable)
> I/O Ports: 0xac00-0xac0f (rw)
> Memory Range: 0xfc900000-0xfc97ffff (ro,prefetchable,disabled)
> IRQ: 17 (20116954 events)
> Driver Info #0:
> Driver Status: sata_sil24 is active
> Driver Activation Cmd: "modprobe sata_sil24"
> Config Status: cfg=no, avail=yes, need=no, active=unknown
> Attached to: #15 (PCI bridge)
>
> 27: IDE 100.0: 10600 Disk
> [Created at block.188]
> Unique ID: DQI9.0o8IZwvZth7
> Parent ID: LHB6.kPnnhej1LTC
> SysFS ID: /block/sdb
> SysFS BusID: 1:0:0:0
> SysFS Device Link: /devices/pci0000:00/0000:00:0a.0/0000:02:02.0/host1/target1:0:0/1:0:0:0
> Hardware Class: disk
> Model: "SAMSUNG SP2504C"
> Vendor: "SAMSUNG"
> Device: "SP2504C"
> Revision: "VT10"
> Serial ID: "S09QJ1GYA03006"
> Driver: "sata_sil24", "sd"
> Device File: /dev/sdb
> Device
> Files: /dev/sdb, /dev/disk/by-path/pci-0000:02:02.0-scsi-0:0:0:0, /dev/disk/by-id/1ATA_SAMSUNG_SP2504C_S09QJ1GYA03006
> Device Number: block 8:16-8:31
> Geometry (Logical): CHS 30401/255/63
> Size: 488397168 sectors a 512 bytes
> Config Status: cfg=no, avail=yes, need=no, active=unknown
> Attached to: #20 (RAID bus controller)
>
> 28: IDE 200.0: 10600 Disk
> [Created at block.188]
> Unique ID: gbYD.jbwYSfXQ1s1
> Parent ID: LHB6.kPnnhej1LTC
> SysFS ID: /block/sdc
> SysFS BusID: 2:0:0:0
> SysFS Device Link: /devices/pci0000:00/0000:00:0a.0/0000:02:02.0/host2/target2:0:0/2:0:0:0
> Hardware Class: disk
> Model: "SAMSUNG SP2504C"
> Vendor: "SAMSUNG"
> Device: "SP2504C"
> Revision: "VT10"
> Serial ID: "S09QJ1GYA03003"
> Driver: "sata_sil24", "sd"
> Device File: /dev/sdc
> Device
> Files: /dev/sdc, /dev/disk/by-path/pci-0000:02:02.0-scsi-1:0:0:0, /dev/disk/by-id/1ATA_SAMSUNG_SP2504C_S09QJ1GYA03003
> Device Number: block 8:32-8:47
> Geometry (Logical): CHS 30401/255/63
> Size: 488397168 sectors a 512 bytes
> Config Status: cfg=no, avail=yes, need=no, active=unknown
> Attached to: #20 (RAID bus controller)
>
> 29: IDE 300.0: 10600 Disk
> [Created at block.188]
> Unique ID: 8noH.C6Upa9vrTP5
> Parent ID: LHB6.kPnnhej1LTC
> SysFS ID: /block/sdd
> SysFS BusID: 3:0:0:0
> SysFS Device Link: /devices/pci0000:00/0000:00:0a.0/0000:02:02.0/host3/target3:0:0/3:0:0:0
> Hardware Class: disk
> Model: "SAMSUNG SP2504C"
> Vendor: "SAMSUNG"
> Device: "SP2504C"
> Revision: "VT10"
> Serial ID: "S09QJ1GYA02991"
> Driver: "sata_sil24", "sd"
> Device File: /dev/sdd
> Device
> Files: /dev/sdd, /dev/disk/by-path/pci-0000:02:02.0-scsi-2:0:0:0, /dev/disk/by-id/1ATA_SAMSUNG_SP2504C_S09QJ1GYA02991
> Device Number: block 8:48-8:63
> Geometry (Logical): CHS 30401/255/63
> Size: 488397168 sectors a 512 bytes
> Config Status: cfg=no, avail=yes, need=no, active=unknown
> Attached to: #20 (RAID bus controller)
>
> 30: IDE 400.0: 10600 Disk
> [Created at block.188]
> Unique ID: by2M.RNPEDP68X7E
> Parent ID: LHB6.kPnnhej1LTC
> SysFS ID: /block/sde
> SysFS BusID: 4:0:0:0
> SysFS Device Link: /devices/pci0000:00/0000:00:0a.0/0000:02:02.0/host4/target4:0:0/4:0:0:0
> Hardware Class: disk
> Model: "SAMSUNG SP2504C"
> Vendor: "SAMSUNG"
> Device: "SP2504C"
> Revision: "VT10"
> Serial ID: "S09QJ1GYA02990"
> Driver: "sata_sil24", "sd"
> Device File: /dev/sde
> Device
> Files: /dev/sde, /dev/disk/by-path/pci-0000:02:02.0-scsi-3:0:0:0, /dev/disk/by-id/1ATA_SAMSUNG_SP2504C_S09QJ1GYA02990
> Device Number: block 8:64-8:79
> Geometry (Logical): CHS 30401/255/63
> Size: 488397168 sectors a 512 bytes
> Config Status: cfg=no, avail=yes, need=no, active=unknown
> Attached to: #20 (RAID bus controller)
>

2008-03-21 18:33:18

by Roger Heflin

[permalink] [raw]
Subject: Re: 2.6.24.3: regular sata drive resets - worrisome?

Andrew Morton wrote:
> (cc linux-ide)
> (regression?)
>
> On Thu, 20 Mar 2008 15:18:31 +0100 Hans-Peter Jansen <[email protected]> wrote:
>
>> Hi,
>>
>> since I upgraded to 2.6.24.3 on one of my production systems, I see
>> regular device resets like these:

Hans,

What kernel were you using before you updated to that kernel?

>>
>> Mar 20 14:33:03 lisa5 kernel: ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
>> Mar 20 14:33:03 lisa5 kernel: ata2.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
>> Mar 20 14:33:03 lisa5 kernel: res 40/00:00:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
>> Mar 20 14:33:03 lisa5 kernel: ata2.00: status: { DRDY }
>> Mar 20 14:33:03 lisa5 kernel: ata2: hard resetting link
>> Mar 20 14:33:05 lisa5 kernel: ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 0)
>> Mar 20 14:33:05 lisa5 kernel: ata2.00: configured for UDMA/100
>> Mar 20 14:33:05 lisa5 kernel: ata2: EH complete
>> Mar 20 14:33:05 lisa5 kernel: sd 2:0:0:0: [sdc] 488397168 512-byte hardware sectors (250059 MB)
>> Mar 20 14:33:05 lisa5 kernel: sd 2:0:0:0: [sdc] Write Protect is off
>> Mar 20 14:33:05 lisa5 kernel: sd 2:0:0:0: [sdc] Mode Sense: 00 3a 00 00
>> Mar 20 14:33:05 lisa5 kernel: sd 2:0:0:0: [sdc] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
>> Mar 20 14:36:11 lisa5 kernel: ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
>> Mar 20 14:36:11 lisa5 kernel: ata3.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
>> Mar 20 14:36:11 lisa5 kernel: res 40/00:00:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
>> Mar 20 14:36:11 lisa5 kernel: ata3.00: status: { DRDY }
>> Mar 20 14:36:11 lisa5 kernel: ata3: hard resetting link
>> Mar 20 14:36:13 lisa5 kernel: ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 0)
>> Mar 20 14:36:13 lisa5 kernel: ata3.00: configured for UDMA/100
>> Mar 20 14:36:13 lisa5 kernel: ata3: EH complete
>> Mar 20 14:36:13 lisa5 kernel: sd 3:0:0:0: [sdd] 488397168 512-byte hardware sectors (250059 MB)
>> Mar 20 14:36:13 lisa5 kernel: sd 3:0:0:0: [sdd] Write Protect is off
>> Mar 20 14:36:13 lisa5 kernel: sd 3:0:0:0: [sdd] Mode Sense: 00 3a 00 00
>> Mar 20 14:36:13 lisa5 kernel: sd 3:0:0:0: [sdd] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
>>
>> Should I be worried? smartd doesn't show anything suspicious on those.
>>

Andrew,

I don't think it is a recent regression, I have seen it happening for a while on
my machine, I don't think it is causing any crashes but I am getting unexplained
events about 1x per month that appear to deadlock a number of things (machine is
up, but top won't run and vmstat actually gets a FP exception on the second
sample, and a number of other things have issues until reboot).

I have 4 identical disks, 2 on a sata_sil and 2 on another controller, the ones
on the sil controller have this behavior, I have seen it in 2.6.23.1,
FC7-2.6.23.15-80 and FC7-2.6.22.9-91. My sil is a 4-port 3114 PCI card, and my
disks are 500GB Western Digital disks. I have a fairly long run with 20-30
events on the 2 disks on the sata_sil and no events on the identical non-sil
disks that had previously been getting resets (when on the sil controller), and
since they are under software raid5 all 4 disks should have very very similar IO
loads.

Roger

2008-03-21 23:18:10

by Hans-Peter Jansen

[permalink] [raw]
Subject: Re: 2.6.24.3: regular sata drive resets - worrisome?

Am Freitag, 21. M?rz 2008 schrieb Roger Heflin:
> Andrew Morton wrote:
> > (cc linux-ide)
> > (regression?)
> >
> > On Thu, 20 Mar 2008 15:18:31 +0100 Hans-Peter Jansen <[email protected]>
wrote:
> >> Hi,
> >>
> >> since I upgraded to 2.6.24.3 on one of my production systems, I see
> >> regular device resets like these:
>
> Hans,
>
> What kernel were you using before you updated to that kernel?

You don't what to know that ;-) (cough 2.6.11 cough)

> >> Mar 20 14:33:03 lisa5 kernel: ata2.00: exception Emask 0x0 SAct 0x0
> >> SErr 0x0 action 0x2 frozen Mar 20 14:33:03 lisa5 kernel: ata2.00: cmd
> >> ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0 Mar 20 14:33:03 lisa5
> >> kernel: res 40/00:00:01:4f:c2/00:00:00:00:00/00 Emask 0x4
> >> (timeout) Mar 20 14:33:03 lisa5 kernel: ata2.00: status: { DRDY }
> >> Mar 20 14:33:03 lisa5 kernel: ata2: hard resetting link
> >> Mar 20 14:33:05 lisa5 kernel: ata2: SATA link up 3.0 Gbps (SStatus 123
> >> SControl 0) Mar 20 14:33:05 lisa5 kernel: ata2.00: configured for
> >> UDMA/100 Mar 20 14:33:05 lisa5 kernel: ata2: EH complete
> >> Mar 20 14:33:05 lisa5 kernel: sd 2:0:0:0: [sdc] 488397168 512-byte
> >> hardware sectors (250059 MB) Mar 20 14:33:05 lisa5 kernel: sd 2:0:0:0:
> >> [sdc] Write Protect is off Mar 20 14:33:05 lisa5 kernel: sd 2:0:0:0:
> >> [sdc] Mode Sense: 00 3a 00 00 Mar 20 14:33:05 lisa5 kernel: sd
> >> 2:0:0:0: [sdc] Write cache: enabled, read cache: enabled, doesn't
> >> support DPO or FUA Mar 20 14:36:11 lisa5 kernel: ata3.00: exception
> >> Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen Mar 20 14:36:11 lisa5
> >> kernel: ata3.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0 Mar 20
> >> 14:36:11 lisa5 kernel: res
> >> 40/00:00:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout) Mar 20
> >> 14:36:11 lisa5 kernel: ata3.00: status: { DRDY }
> >> Mar 20 14:36:11 lisa5 kernel: ata3: hard resetting link
> >> Mar 20 14:36:13 lisa5 kernel: ata3: SATA link up 3.0 Gbps (SStatus 123
> >> SControl 0) Mar 20 14:36:13 lisa5 kernel: ata3.00: configured for
> >> UDMA/100 Mar 20 14:36:13 lisa5 kernel: ata3: EH complete
> >> Mar 20 14:36:13 lisa5 kernel: sd 3:0:0:0: [sdd] 488397168 512-byte
> >> hardware sectors (250059 MB) Mar 20 14:36:13 lisa5 kernel: sd 3:0:0:0:
> >> [sdd] Write Protect is off Mar 20 14:36:13 lisa5 kernel: sd 3:0:0:0:
> >> [sdd] Mode Sense: 00 3a 00 00 Mar 20 14:36:13 lisa5 kernel: sd
> >> 3:0:0:0: [sdd] Write cache: enabled, read cache: enabled, doesn't
> >> support DPO or FUA
> >>
> >> Should I be worried? smartd doesn't show anything suspicious on those.
>
> Andrew,
>
> I don't think it is a recent regression, I have seen it happening for a
> while on my machine, I don't think it is causing any crashes but I am
> getting unexplained events about 1x per month that appear to deadlock a
> number of things (machine is up, but top won't run and vmstat actually
> gets a FP exception on the second sample, and a number of other things
> have issues until reboot).

Well, that doesn't sound reassuring, does it?

> I have 4 identical disks, 2 on a sata_sil and 2 on another controller,
> the ones on the sil controller have this behavior, I have seen it in
> 2.6.23.1, FC7-2.6.23.15-80 and FC7-2.6.22.9-91. My sil is a 4-port 3114
> PCI card, and my disks are 500GB Western Digital disks. I have a fairly
> long run with 20-30 events on the 2 disks on the sata_sil and no events
> on the identical non-sil disks that had previously been getting resets
> (when on the sil controller), and since they are under software raid5 all
> 4 disks should have very very similar IO loads.

Okay, those resets may happen without further consequences, but they're
disturbing nevertheless.

BTW, I'm preparing a hardware reorg, which will eliminate this controller
during this weekend.. Well, to be correct, the other is the one, that
really nags me for some time now (3ware 9xxx-8). Swapping both with one
Areca 1130, which performes _much_ better.

Pete

2008-03-29 12:58:52

by Tejun Heo

[permalink] [raw]
Subject: Re: 2.6.24.3: regular sata drive resets - worrisome?

Hello, Hans.

Andrew Morton wrote:
>> since I upgraded to 2.6.24.3 on one of my production systems, I see
>> regular device resets like these:
>>
>> Mar 20 14:33:03 lisa5 kernel: ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
>> Mar 20 14:33:03 lisa5 kernel: ata2.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
>> Mar 20 14:33:03 lisa5 kernel: res 40/00:00:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)

Ouch, timeout on FLUSH_EXT. Are all errors on cmd ea?

>> Should I be worried? smartd doesn't show anything suspicious on those.

Can you please post the result of "smartctl -a /dev/sdX"?

>> It's been 4 samsung drives at all hanging on a sata sil 3124:

FLUSH_EXT timing out usually indicates that the drive is having problem
writing out what it has in its cache to the media. There was one case
where FLUSH_EXT timeout was caused by the driver failing to switch
controller back from NCQ mode before issuing FLUSH_EXT but that was on
sata_nv. There hasn't been any similar problem on sata_sil24.

Thanks.

--
tejun

2008-03-30 00:15:21

by Hans-Peter Jansen

[permalink] [raw]
Subject: Re: 2.6.24.3: regular sata drive resets - worrisome?

Hi Tejun,

thanks for picking this issue up.

Am Samstag, 29. M?rz 2008 schrieb Tejun Heo:
> Hello, Hans.
>
> Andrew Morton wrote:
> >> since I upgraded to 2.6.24.3 on one of my production systems, I see
> >> regular device resets like these:
> >>
> >> Mar 20 14:33:03 lisa5 kernel: ata2.00: exception Emask 0x0 SAct 0x0
> >> SErr 0x0 action 0x2 frozen Mar 20 14:33:03 lisa5 kernel: ata2.00: cmd
> >> ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0 Mar 20 14:33:03 lisa5
> >> kernel: res 40/00:00:01:4f:c2/00:00:00:00:00/00 Emask 0x4
> >> (timeout)
>
> Ouch, timeout on FLUSH_EXT. Are all errors on cmd ea?
>
> >> Should I be worried? smartd doesn't show anything suspicious on those.
>
> Can you please post the result of "smartctl -a /dev/sdX"?

Here's the last smart report from two of the offending drives. As noted
before, I did the hardware reorganization, replaced the dog slow 3ware
9500S-8 and the SiI 3124 with a single Areca 1130 and retired the drives
for now, but a nephew already showed interest. What do you think, can I
cede those drives with a clear conscience? The Hardware_ECC_Recovered
values are really worrisome, aren't they?

sdc:
smartctl version 5.38 [i686-suse-linux-gnu] Copyright (C) 2002-7 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Model Family: SAMSUNG SpinPoint P120 series
Device Model: SAMSUNG SP2504C
Serial Number: S09QJ1GYA03006
Firmware Version: VT100-33
User Capacity: 250.059.350.016 bytes
Device is: In smartctl database [for details use: -P show]
ATA Version is: 7
ATA Standard is: ATA/ATAPI-7 T13 1532D revision 4a
Local Time is: Sun Mar 23 01:13:37 2008 CET

==> WARNING: May need -F samsung3 enabled; see manual for details.

SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: (4866) seconds.
Offline data collection
capabilities: (0x5b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 81) minutes.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 100 100 051 Pre-fail Always - 82
3 Spin_Up_Time 0x0007 100 100 025 Pre-fail Always - 5952
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 23
5 Reallocated_Sector_Ct 0x0033 253 253 010 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 253 253 051 Pre-fail Always - 0
8 Seek_Time_Performance 0x0025 253 253 015 Pre-fail Offline - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 17647
10 Spin_Retry_Count 0x0033 253 253 051 Pre-fail Always - 0
11 Calibration_Retry_Count 0x0012 253 002 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 19
190 Airflow_Temperature_Cel 0x0022 124 124 000 Old_age Always - 38
194 Temperature_Celsius 0x0022 124 124 000 Old_age Always - 38
195 Hardware_ECC_Recovered 0x001a 100 100 000 Old_age Always - 162956700
196 Reallocated_Event_Count 0x0032 253 253 000 Old_age Always - 0
197 Current_Pending_Sector 0x0012 253 253 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 253 253 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x000a 253 100 000 Old_age Always - 0
201 Soft_Read_Error_Rate 0x000a 100 100 000 Old_age Always - 0
202 TA_Increase_Count 0x0032 253 253 000 Old_age Always - 0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 17624 -
# 2 Short offline Completed without error 00% 17601 -
# 3 Short offline Completed without error 00% 17577 -
# 4 Short offline Completed without error 00% 17553 -
# 5 Short offline Completed without error 00% 17528 -
# 6 Short offline Completed without error 00% 17504 -
# 7 Extended offline Completed without error 00% 17489 -
# 8 Short offline Completed without error 00% 17480 -
# 9 Short offline Completed without error 00% 17456 -
#10 Short offline Completed without error 00% 17432 -
#11 Short offline Completed without error 00% 17408 -
#12 Short offline Completed without error 00% 17384 -
#13 Short offline Completed without error 00% 17360 -
#14 Short offline Completed without error 00% 17336 -
#15 Extended offline Completed without error 00% 17320 -
#16 Short offline Completed without error 00% 17311 -
#17 Short offline Completed without error 00% 17287 -
#18 Short offline Completed without error 00% 17263 -
#19 Short offline Completed without error 00% 17239 -

SMART Selective Self-Test Log Data Structure Revision Number (0) should be 1
SMART Selective self-test log data structure revision number 0
Warning: ATA Specification requires selective self-test log data structure revision number = 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

sdd:

smartctl version 5.38 [i686-suse-linux-gnu] Copyright (C) 2002-7 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Model Family: SAMSUNG SpinPoint P120 series
Device Model: SAMSUNG SP2504C
Serial Number: S09QJ1GYA03003
Firmware Version: VT100-33
User Capacity: 250.059.350.016 bytes
Device is: In smartctl database [for details use: -P show]
ATA Version is: 7
ATA Standard is: ATA/ATAPI-7 T13 1532D revision 4a
Local Time is: Sun Mar 23 01:13:38 2008 CET

==> WARNING: May need -F samsung3 enabled; see manual for details.

SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: (4836) seconds.
Offline data collection
capabilities: (0x5b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 80) minutes.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 100 100 051 Pre-fail Always - 79
3 Spin_Up_Time 0x0007 100 100 025 Pre-fail Always - 5952
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 23
5 Reallocated_Sector_Ct 0x0033 253 253 010 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 253 253 051 Pre-fail Always - 0
8 Seek_Time_Performance 0x0025 253 253 015 Pre-fail Offline - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 17648
10 Spin_Retry_Count 0x0033 253 253 051 Pre-fail Always - 0
11 Calibration_Retry_Count 0x0012 253 002 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 19
190 Airflow_Temperature_Cel 0x0022 118 118 000 Old_age Always - 40
194 Temperature_Celsius 0x0022 118 118 000 Old_age Always - 40
195 Hardware_ECC_Recovered 0x001a 100 100 000 Old_age Always - 162520674
196 Reallocated_Event_Count 0x0032 253 253 000 Old_age Always - 0
197 Current_Pending_Sector 0x0012 253 253 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 253 253 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x000a 253 100 000 Old_age Always - 0
201 Soft_Read_Error_Rate 0x000a 100 100 000 Old_age Always - 0
202 TA_Increase_Count 0x0032 253 253 000 Old_age Always - 0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 17626 -
# 2 Short offline Completed without error 00% 17602 -
# 3 Short offline Completed without error 00% 17578 -
# 4 Short offline Completed without error 00% 17554 -
# 5 Short offline Completed without error 00% 17530 -
# 6 Short offline Completed without error 00% 17506 -
# 7 Extended offline Completed without error 00% 17490 -
# 8 Short offline Completed without error 00% 17482 -
# 9 Short offline Completed without error 00% 17457 -
#10 Short offline Completed without error 00% 17433 -
#11 Short offline Completed without error 00% 17409 -
#12 Short offline Completed without error 00% 17385 -
#13 Short offline Completed without error 00% 17361 -
#14 Short offline Completed without error 00% 17337 -
#15 Extended offline Completed without error 00% 17321 -
#16 Short offline Completed without error 00% 17313 -
#17 Short offline Completed without error 00% 17289 -
#18 Short offline Completed without error 00% 17264 -
#19 Short offline Completed without error 00% 17240 -

SMART Selective Self-Test Log Data Structure Revision Number (0) should be 1
SMART Selective self-test log data structure revision number 0
Warning: ATA Specification requires selective self-test log data structure revision number = 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

> >> It's been 4 samsung drives at all hanging on a sata sil 3124:
>
> FLUSH_EXT timing out usually indicates that the drive is having problem
> writing out what it has in its cache to the media. There was one case
> where FLUSH_EXT timeout was caused by the driver failing to switch
> controller back from NCQ mode before issuing FLUSH_EXT but that was on
> sata_nv. There hasn't been any similar problem on sata_sil24.

Hmm, I didn't noticed any data distortions, and if there where, they live
on as copies in their new home..

Thanks,
Pete

2008-03-30 00:54:39

by Tejun Heo

[permalink] [raw]
Subject: Re: 2.6.24.3: regular sata drive resets - worrisome?

Hello,

Hans-Peter Jansen wrote:
>>>> Should I be worried? smartd doesn't show anything suspicious on those.
>> Can you please post the result of "smartctl -a /dev/sdX"?
>
> Here's the last smart report from two of the offending drives. As noted
> before, I did the hardware reorganization, replaced the dog slow 3ware
> 9500S-8 and the SiI 3124 with a single Areca 1130 and retired the drives
> for now, but a nephew already showed interest. What do you think, can I
> cede those drives with a clear conscience? The Hardware_ECC_Recovered
> values are really worrisome, aren't they?

Different vendors use different scales for the raw values. The value is
still pegged at the highest so it could be those raw values are okay or
that the vendor just doesn't update value field accordingly. My P120
says 0 for the raw value and 904635 for hardware ECC recovered so there
is some difference. What do other non-failing drives say about those
values?

> SMART Attributes Data Structure revision number: 16
> Vendor Specific SMART Attributes with Thresholds:
> ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
> 1 Raw_Read_Error_Rate 0x000f 100 100 051 Pre-fail Always - 82
> 3 Spin_Up_Time 0x0007 100 100 025 Pre-fail Always - 5952
> 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 23
> 5 Reallocated_Sector_Ct 0x0033 253 253 010 Pre-fail Always - 0
> 7 Seek_Error_Rate 0x000f 253 253 051 Pre-fail Always - 0
> 8 Seek_Time_Performance 0x0025 253 253 015 Pre-fail Offline - 0
> 9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 17647
> 10 Spin_Retry_Count 0x0033 253 253 051 Pre-fail Always - 0
> 11 Calibration_Retry_Count 0x0012 253 002 000 Old_age Always - 0
> 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 19
> 190 Airflow_Temperature_Cel 0x0022 124 124 000 Old_age Always - 38
> 194 Temperature_Celsius 0x0022 124 124 000 Old_age Always - 38
> 195 Hardware_ECC_Recovered 0x001a 100 100 000 Old_age Always - 162956700
> 196 Reallocated_Event_Count 0x0032 253 253 000 Old_age Always - 0
> 197 Current_Pending_Sector 0x0012 253 253 000 Old_age Always - 0
> 198 Offline_Uncorrectable 0x0030 253 253 000 Old_age Offline - 0
> 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
> 200 Multi_Zone_Error_Rate 0x000a 253 100 000 Old_age Always - 0
> 201 Soft_Read_Error_Rate 0x000a 100 100 000 Old_age Always - 0
> 202 TA_Increase_Count 0x0032 253 253 000 Old_age Always - 0

Hmmm... If the drive is failing FLUSHs, I would expect to see elevated
reallocation counters and maybe some pending counts. Aieee.. weird.

>>>> It's been 4 samsung drives at all hanging on a sata sil 3124:
>> FLUSH_EXT timing out usually indicates that the drive is having problem
>> writing out what it has in its cache to the media. There was one case
>> where FLUSH_EXT timeout was caused by the driver failing to switch
>> controller back from NCQ mode before issuing FLUSH_EXT but that was on
>> sata_nv. There hasn't been any similar problem on sata_sil24.
>
> Hmm, I didn't noticed any data distortions, and if there where, they live
> on as copies in their new home..

It should have appeared as read errors. Maybe the drive successfully
wrote those sectors after 30+ secs timeout.

--
tejun

2008-03-30 12:00:31

by Hans-Peter Jansen

[permalink] [raw]
Subject: Re: 2.6.24.3: regular sata drive resets - worrisome?

Am Sonntag, 30. M?rz 2008 schrieb Tejun Heo:
> Hello,
>
> Hans-Peter Jansen wrote:
> >>>> Should I be worried? smartd doesn't show anything suspicious on
> >>>> those.
> >>
> >> Can you please post the result of "smartctl -a /dev/sdX"?
> >
> > Here's the last smart report from two of the offending drives. As noted
> > before, I did the hardware reorganization, replaced the dog slow 3ware
> > 9500S-8 and the SiI 3124 with a single Areca 1130 and retired the
> > drives for now, but a nephew already showed interest. What do you
> > think, can I cede those drives with a clear conscience? The
> > Hardware_ECC_Recovered values are really worrisome, aren't they?
>
> Different vendors use different scales for the raw values. The value is
> still pegged at the highest so it could be those raw values are okay or
> that the vendor just doesn't update value field accordingly. My P120
> says 0 for the raw value and 904635 for hardware ECC recovered so there
> is some difference. What do other non-failing drives say about those
> values?

The only non-failing drive was sdf as it was running in standby mode in this
md raid 5 ensemble:

20080323-011337-sdc.log:195 Hardware_ECC_Recovered 0x001a 100 100 000 Old_age Always - 162956700
20080323-011337-sdc.log:196 Reallocated_Event_Count 0x0032 253 253 000 Old_age Always - 0
20080323-011337-sdc.log:197 Current_Pending_Sector 0x0012 253 253 000 Old_age Always - 0
20080323-011337-sdc.log:198 Offline_Uncorrectable 0x0030 253 253 000 Old_age Offline - 0
20080323-011337-sdc.log:199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
20080323-011338-sdd.log:195 Hardware_ECC_Recovered 0x001a 100 100 000 Old_age Always - 162520674
20080323-011338-sdd.log:196 Reallocated_Event_Count 0x0032 253 253 000 Old_age Always - 0
20080323-011338-sdd.log:197 Current_Pending_Sector 0x0012 253 253 000 Old_age Always - 0
20080323-011338-sdd.log:198 Offline_Uncorrectable 0x0030 253 253 000 Old_age Offline - 0
20080323-011338-sdd.log:199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
20080323-011338-sde.log:195 Hardware_ECC_Recovered 0x001a 100 100 000 Old_age Always - 148429049
20080323-011338-sde.log:196 Reallocated_Event_Count 0x0032 253 253 000 Old_age Always - 0
20080323-011338-sde.log:197 Current_Pending_Sector 0x0012 253 253 000 Old_age Always - 0
20080323-011338-sde.log:198 Offline_Uncorrectable 0x0030 253 253 000 Old_age Offline - 0
20080323-011338-sde.log:199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
20080323-011339-sdf.log:195 Hardware_ECC_Recovered 0x001a 100 100 000 Old_age Always - 1559
20080323-011339-sdf.log:196 Reallocated_Event_Count 0x0032 253 253 000 Old_age Always - 0
20080323-011339-sdf.log:197 Current_Pending_Sector 0x0012 253 253 000 Old_age Always - 0
20080323-011339-sdf.log:198 Offline_Uncorrectable 0x0030 253 253 000 Old_age Offline - 0
20080323-011339-sdf.log:199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0

> Hmmm... If the drive is failing FLUSHs, I would expect to see elevated
> reallocation counters and maybe some pending counts. Aieee.. weird.

But there are no reallocations nor any pending sectors on any of them.

> >>>> It's been 4 samsung drives at all hanging on a sata sil 3124:
> >>
> >> FLUSH_EXT timing out usually indicates that the drive is having
> >> problem writing out what it has in its cache to the media. There was
> >> one case where FLUSH_EXT timeout was caused by the driver failing to
> >> switch controller back from NCQ mode before issuing FLUSH_EXT but that
> >> was on sata_nv. There hasn't been any similar problem on sata_sil24.
> >
> > Hmm, I didn't noticed any data distortions, and if there where, they
> > live on as copies in their new home..
>
> It should have appeared as read errors. Maybe the drive successfully
^^^^
write (I guess)
> wrote those sectors after 30+ secs timeout.

That would point to some driver issue, wouldn't it? Roger Heflin also
experienced similar behavior with that controller, which wasn't
reproducible with another.

I can offer to you rebuilding that md in a test environment, and giving
you access to it, if you're interested.

Anyway, thanks for caring Tejun,
Pete

2008-03-30 12:41:52

by Roger Heflin

[permalink] [raw]
Subject: Re: 2.6.24.3: regular sata drive resets - worrisome?

Hans-Peter Jansen wrote:
> Am Sonntag, 30. M?rz 2008 schrieb Tejun Heo:
>> Hello,
>>
>> Hans-Peter Jansen wrote:
>>>>>> Should I be worried? smartd doesn't show anything suspicious on
>>>>>> those.
>>>> Can you please post the result of "smartctl -a /dev/sdX"?
>>> Here's the last smart report from two of the offending drives. As noted
>>> before, I did the hardware reorganization, replaced the dog slow 3ware
>>> 9500S-8 and the SiI 3124 with a single Areca 1130 and retired the
>>> drives for now, but a nephew already showed interest. What do you
>>> think, can I cede those drives with a clear conscience? The
>>> Hardware_ECC_Recovered values are really worrisome, aren't they?
>> Different vendors use different scales for the raw values. The value is
>> still pegged at the highest so it could be those raw values are okay or
>> that the vendor just doesn't update value field accordingly. My P120
>> says 0 for the raw value and 904635 for hardware ECC recovered so there
>> is some difference. What do other non-failing drives say about those
>> values?
>
> The only non-failing drive was sdf as it was running in standby mode in this
> md raid 5 ensemble:
>
> 20080323-011337-sdc.log:195 Hardware_ECC_Recovered 0x001a 100 100 000 Old_age Always - 162956700
> 20080323-011337-sdc.log:196 Reallocated_Event_Count 0x0032 253 253 000 Old_age Always - 0
> 20080323-011337-sdc.log:197 Current_Pending_Sector 0x0012 253 253 000 Old_age Always - 0
> 20080323-011337-sdc.log:198 Offline_Uncorrectable 0x0030 253 253 000 Old_age Offline - 0
> 20080323-011337-sdc.log:199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
> 20080323-011338-sdd.log:195 Hardware_ECC_Recovered 0x001a 100 100 000 Old_age Always - 162520674
> 20080323-011338-sdd.log:196 Reallocated_Event_Count 0x0032 253 253 000 Old_age Always - 0
> 20080323-011338-sdd.log:197 Current_Pending_Sector 0x0012 253 253 000 Old_age Always - 0
> 20080323-011338-sdd.log:198 Offline_Uncorrectable 0x0030 253 253 000 Old_age Offline - 0
> 20080323-011338-sdd.log:199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
> 20080323-011338-sde.log:195 Hardware_ECC_Recovered 0x001a 100 100 000 Old_age Always - 148429049
> 20080323-011338-sde.log:196 Reallocated_Event_Count 0x0032 253 253 000 Old_age Always - 0
> 20080323-011338-sde.log:197 Current_Pending_Sector 0x0012 253 253 000 Old_age Always - 0
> 20080323-011338-sde.log:198 Offline_Uncorrectable 0x0030 253 253 000 Old_age Offline - 0
> 20080323-011338-sde.log:199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
> 20080323-011339-sdf.log:195 Hardware_ECC_Recovered 0x001a 100 100 000 Old_age Always - 1559
> 20080323-011339-sdf.log:196 Reallocated_Event_Count 0x0032 253 253 000 Old_age Always - 0
> 20080323-011339-sdf.log:197 Current_Pending_Sector 0x0012 253 253 000 Old_age Always - 0
> 20080323-011339-sdf.log:198 Offline_Uncorrectable 0x0030 253 253 000 Old_age Offline - 0
> 20080323-011339-sdf.log:199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
>
>> Hmmm... If the drive is failing FLUSHs, I would expect to see elevated
>> reallocation counters and maybe some pending counts. Aieee.. weird.
>
> But there are no reallocations nor any pending sectors on any of them.
>
>>>>>> It's been 4 samsung drives at all hanging on a sata sil 3124:
>>>> FLUSH_EXT timing out usually indicates that the drive is having
>>>> problem writing out what it has in its cache to the media. There was
>>>> one case where FLUSH_EXT timeout was caused by the driver failing to
>>>> switch controller back from NCQ mode before issuing FLUSH_EXT but that
>>>> was on sata_nv. There hasn't been any similar problem on sata_sil24.
>>> Hmm, I didn't noticed any data distortions, and if there where, they
>>> live on as copies in their new home..
>> It should have appeared as read errors. Maybe the drive successfully
> ^^^^
> write (I guess)
>> wrote those sectors after 30+ secs timeout.
>
> That would point to some driver issue, wouldn't it? Roger Heflin also
> experienced similar behavior with that controller, which wasn't
> reproducible with another.
>
> I can offer to you rebuilding that md in a test environment, and giving
> you access to it, if you're interested.
>
> Anyway, thanks for caring Tejun,
> Pete
>

Here are the errors I get, though look at it closer, I am don't appear to be
getting the reset, just this error from time to time:

sd 9:0:0:0: [sde] 976773168 512-byte hardware sectors (500108 MB)
sd 9:0:0:0: [sde] Write Protect is off
sd 9:0:0:0: [sde] Mode Sense: 00 3a 00 00
sd 9:0:0:0: [sde] Write cache: enabled, read cache: enabled, doesn't support DPO
or FUA
ata8.00: exception Emask 0x0 SAct 0x0 SErr 0x280000 action 0x0
ata8.00: BMDMA2 stat 0x687d8009
ata8.00: cmd 25/00:80:a7:00:1d/00:01:1d:00:00/e0 tag 0 cdb 0x0 data 196608 in
res 51/04:8f:98:01:1d/00:00:1d:00:00/f0 Emask 0x1 (device error)
ata8.00: configured for UDMA/100
ata8: EH complete
sd 7:0:0:0: [sdd] 976773168 512-byte hardware sectors (500108 MB)
sd 7:0:0:0: [sdd] Write Protect is off
sd 7:0:0:0: [sdd] Mode Sense: 00 3a 00 00
sd 7:0:0:0: [sdd] Write cache: enabled, read cache: enabled, doesn't support DPO
or FUA

I have 4 identical disks, with all 4 connected to the SIL controller all give
some errors, moving 2 of the disks to a promise controller makes the errors go
away on the 2 connected to the promise controller. All drives are part of a
software raid5 array.

Startup looks like this:
sata_sil 0000:00:09.0: version 2.3
ACPI: PCI Interrupt 0000:00:09.0[A] -> GSI 16 (level, low) -> IRQ 20
sata_sil 0000:00:09.0: Applying R_ERR on DMA activate FIS errata fix
scsi7 : sata_sil
scsi8 : sata_sil
scsi9 : sata_sil
scsi10 : sata_sil
ata8: SATA max UDMA/100 cmd 0xf8942080 ctl 0xf894208a bmdma 0xf8942000 irq 20
ata9: SATA max UDMA/100 cmd 0xf89420c0 ctl 0xf89420ca bmdma 0xf8942008 irq 20
ata10: SATA max UDMA/100 cmd 0xf8942280 ctl 0xf894228a bmdma 0xf8942200 irq 20
ata11: SATA max UDMA/100 cmd 0xf89422c0 ctl 0xf89422ca bmdma 0xf8942208 irq 20

Right now I am running 2.6.23.15-80.fc7, but have also got the errors under 2.6.23.1

Roger

2008-03-31 04:33:31

by Tejun Heo

[permalink] [raw]
Subject: Re: 2.6.24.3: regular sata drive resets - worrisome?

Roger Heflin wrote:
>> The only non-failing drive was sdf as it was running in standby mode
>> in this md raid 5 ensemble:
>>
>> 20080323-011337-sdc.log:195 Hardware_ECC_Recovered 0x001a 100
>> 100 000 Old_age Always - 162956700
>> 20080323-011338-sde.log:195 Hardware_ECC_Recovered 0x001a 100
>> 100 000 Old_age Always - 148429049

Hmmm... looks similar.

>>> Hmmm... If the drive is failing FLUSHs, I would expect to see elevated
>>> reallocation counters and maybe some pending counts. Aieee.. weird.
>>
>> But there are no reallocations nor any pending sectors on any of them.

Yeah, indded.

>>>>>>> It's been 4 samsung drives at all hanging on a sata sil 3124:
>>>>> FLUSH_EXT timing out usually indicates that the drive is having
>>>>> problem writing out what it has in its cache to the media. There was
>>>>> one case where FLUSH_EXT timeout was caused by the driver failing to
>>>>> switch controller back from NCQ mode before issuing FLUSH_EXT but that
>>>>> was on sata_nv. There hasn't been any similar problem on sata_sil24.
>>>> Hmm, I didn't noticed any data distortions, and if there where, they
>>>> live on as copies in their new home..
>>> It should have appeared as read errors. Maybe the drive successfully
>> ^^^^
>> write (I guess)

I actually meant read. For the corrupted data to get transferred to
other disks, it should have been read as wrong values but such things
should never happen as ECC checks would fail.

>>> wrote those sectors after 30+ secs timeout.
>>
>> That would point to some driver issue, wouldn't it? Roger Heflin also
>> experienced similar behavior with that controller, which wasn't
>> reproducible with another.

Roger's problem is different one. I'll talk about it below.

>> I can offer to you rebuilding that md in a test environment, and
>> giving you access to it, if you're interested.

Can you hook up those failed drives to a different controller? Say,
ahci or ata_piix and put them under write load (ext3 w/ barrier=1 and
copying lots of files into it should work) and see whether the problem
reproduces?

> Here are the errors I get, though look at it closer, I am don't appear
> to be getting the reset, just this error from time to time:
>
> sd 9:0:0:0: [sde] 976773168 512-byte hardware sectors (500108 MB)
> sd 9:0:0:0: [sde] Write Protect is off
> sd 9:0:0:0: [sde] Mode Sense: 00 3a 00 00
> sd 9:0:0:0: [sde] Write cache: enabled, read cache: enabled, doesn't
> support DPO or FUA
> ata8.00: exception Emask 0x0 SAct 0x0 SErr 0x280000 action 0x0
> ata8.00: BMDMA2 stat 0x687d8009
> ata8.00: cmd 25/00:80:a7:00:1d/00:01:1d:00:00/e0 tag 0 cdb 0x0 data
> 196608 in
> res 51/04:8f:98:01:1d/00:00:1d:00:00/f0 Emask 0x1 (device error)
> ata8.00: configured for UDMA/100

That's device abort error on read. The drive just can't read sector one
of the requested sectors and it's not sata_sil24. It's a bmdma one.

> I have 4 identical disks, with all 4 connected to the SIL controller all
> give some errors, moving 2 of the disks to a promise controller makes
> the errors go away on the 2 connected to the promise controller. All
> drives are part of a software raid5 array.

Ah.. okay, sata_sil. Roger, the moving and errors are not very likely
to have anything to do with each other. The only possibility is
transmission problems but the drive didn't report transport error (ICRC)
and it's more likely that the drive was experiencing temporary failures.
It's also possible that the drive set ABRT although there was some
problem with the transport tho.

If you move the drive back to the sata_sil, do those problems appear
again? Anyways, this doesn't really have anything to do with what Hans
is seeing.

Thanks.

--
tejun