LinuxLists.cc - SATA error while resume

2007-08-19 09:13:21

Subject: SATA error while resume

Kernel: 2.6.23-rc2 witch patches [1], but older and stable versions also
affected.

[1] http://www.ussg.iu.edu/hypermail/linux/kernel/0708.0/2655.html
+ipw3945 and truecrypt.

Sometimes (one in ten, or rarely) I have this error while system resume
from suspend to disk:

=================
swsusp: Marking nosave pages: 000000000009f000 - 0000000000100000
swsusp: Basic memory bitmaps created
Freezing user space processes ... (elapsed 0.00 seconds) done.
Freezing remaining freezable tasks ... (elapsed 0.00 seconds) done.
Loading image data pages (117687 pages)
... 0% 1% 2% 3% 4% 5% 6% 7%
8% 9% 10% 11% 12% 13% 14% 15% 16% 17%
18% 19% 20%<3>ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0
action 0x0
ata1.00: irq_stat 0x40000001
ata1.00: cmd 25/00:00:10:0b:f3/00:04:05:00:00/e0 tag 0 cdb 0x0 data
524288 in
res 51/40:a4:6c:0b:f3/00:03:05:00:00/e0 Emask 0x9 (media error)
ata1.00: configured for UDMA/100
ata1: EH complete
ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
ata1.00: irq_stat 0x40000001
ata1.00: cmd 25/00:00:10:0b:f3/00:04:05:00:00/e0 tag 0 cdb 0x0 data
524288 in
res 51/40:a4:6c:0b:f3/00:03:05:00:00/e0 Emask 0x9 (media error)
ata1.00: configured for UDMA/100
ata1: EH complete
ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
ata1.00: irq_stat 0x40000001
ata1.00: cmd 25/00:00:10:0b:f3/00:04:05:00:00/e0 tag 0 cdb 0x0 data
524288 in
res 51/40:a4:6c:0b:f3/00:03:05:00:00/e0 Emask 0x9 (media error)
ata1.00: configured for UDMA/100
ata1: EH complete
ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
ata1.00: irq_stat 0x40000001
ata1.00: cmd 25/00:00:10:0b:f3/00:04:05:00:00/e0 tag 0 cdb 0x0 data
524288 in
res 51/40:a4:6c:0b:f3/00:03:05:00:00/e0 Emask 0x9 (media error)
ata1.00: configured for UDMA/100
ata1: EH complete
ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
ata1.00: irq_stat 0x40000001
ata1.00: cmd 25/00:00:10:0b:f3/00:04:05:00:00/e0 tag 0 cdb 0x0 data
524288 in
res 51/40:a4:6c:0b:f3/00:03:05:00:00/e0 Emask 0x9 (media error)
ata1.00: configured for UDMA/100
ata1: EH complete
ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
ata1.00: irq_stat 0x40000001
ata1.00: cmd 25/00:00:10:0b:f3/00:04:05:00:00/e0 tag 0 cdb 0x0 data
524288 in
res 51/40:a4:6c:0b:f3/00:03:05:00:00/e0 Emask 0x9 (media error)
ata1.00: configured for UDMA/100
sd 0:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE,SUGGEST_OK
sd 0:0:0:0: [sda] Sense Key : Medium Error [current] [descriptor]
Descriptor sense data with sense descriptors (in hex):
72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00
05 f3 0b 6c
sd 0:0:0:0: [sda] Add. Sense: Unrecovered read error - auto reallocate
failed
end_request: I/O error, dev sda, sector 99814252
Read-error on swap-device (8:0:99814256)
Read-error on swap-device (8:0:99814264)
Read-error on swap-device (8:0:99814272)
...
Read-error on swap-device (8:0:99815184)
ata1: EH complete
sd 0:0:0:0: [sda] 156301488 512-byte hardware sectors (80026 MB)
sd 0:0:0:0: [sda] Write Protect is off
sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't
support DPO or FUA
sd 0:0:0:0: [sda] 156301488 512-byte hardware sectors (80026 MB)
sd 0:0:0:0: [sda] Write Protect is off
sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't
support DPO or FUA
Read 470748 kbytes in 30.97 seconds (15.20 MB/s)
PM: Restore failed, recovering.
Restarting tasks ... done.
swsusp: Basic memory bitmaps freed
=================

Then system continue booting without resume.

I use smartctl and check disk 2 times and run fsck/mkswap -c and I have
no erros:

=================
rutek:/home/maciek/kernel.org/libata_error# smartctl -A /dev/sda
smartctl version 5.37 [i686-pc-linux-gnu] Copyright (C) 2002-6 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE
UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 100 100 046 Pre-fail Always
- 28879
2 Throughput_Performance 0x0005 100 100 030 Pre-fail
Offline - 20381999
3 Spin_Up_Time 0x0003 100 100 025 Pre-fail Always
- 1
4 Start_Stop_Count 0x0032 099 099 000 Old_age Always
- 1599
5 Reallocated_Sector_Ct 0x0033 100 100 024 Pre-fail Always
- 8589934592000
7 Seek_Error_Rate 0x000f 100 100 047 Pre-fail Always
- 3713
8 Seek_Time_Performance 0x0005 100 100 019 Pre-fail
Offline - 0
9 Power_On_Seconds 0x0032 096 096 000 Old_age Always
- 0h+41m+39s
10 Spin_Retry_Count 0x0013 100 100 020 Pre-fail Always
- 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always
- 1354
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always
- 65
193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always
- 2776
194 Temperature_Celsius 0x0022 100 100 000 Old_age Always
- 33 (Lifetime Min/Max 15/46)
195 Hardware_ECC_Recovered 0x001a 100 100 000 Old_age Always
- 344
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always
- 444268544
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always
- 1
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age
Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always
- 0
200 Multi_Zone_Error_Rate 0x000f 100 100 060 Pre-fail Always
- 22830
203 Run_Out_Cancel 0x0002 100 100 000 Old_age Always
- 2632796799455
240 Head_Flying_Hours 0x003e 200 200 000 Old_age Always
- 0

=================

Dmesg and config:
http://www.unixy.pl/maciek/download/kernel/libata_error/

Regards
--
Maciej Rutecki
http://www.maciek.unixy.pl

2007-08-19 10:47:00

by Tejun Heo

[permalink] [raw]

Subject: Re: SATA error while resume

Maciek Rutecki wrote:
> ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
> ata1.00: irq_stat 0x40000001
> ata1.00: cmd 25/00:00:10:0b:f3/00:04:05:00:00/e0 tag 0 cdb 0x0 data
> 524288 in
> res 51/40:a4:6c:0b:f3/00:03:05:00:00/e0 Emask 0x9 (media error)
> ata1.00: configured for UDMA/100
>
> Then system continue booting without resume.
>
> I use smartctl and check disk 2 times and run fsck/mkswap -c and I have
> no erros:
>
> =================
> 5 Reallocated_Sector_Ct 0x0033 100 100 024 Pre-fail Always
> - 8589934592000
> 196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always
> - 444268544

It very much looks like the disk is dying. Dunno why it doesn't show up
during SMART testing but you better back up and contact the hardware vendor.

--
tejun

2007-08-19 14:26:21

by Mark Lord

[permalink] [raw]

Subject: Re: SATA error while resume

Maciek Rutecki wrote:
> Kernel: 2.6.23-rc2 witch patches [1], but older and stable versions also
..
> Sometimes (one in ten, or rarely) I have this error while system resume
> from suspend to disk:
..
> ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
> ata1.00: irq_stat 0x40000001
> ata1.00: cmd 25/00:00:10:0b:f3/00:04:05:00:00/e0 tag 0 cdb 0x0 data
> 524288 in
> res 51/40:a4:6c:0b:f3/00:03:05:00:00/e0 Emask 0x9 (media error)
> ata1.00: configured for UDMA/100
> sd 0:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE,SUGGEST_OK
> sd 0:0:0:0: [sda] Sense Key : Medium Error [current] [descriptor]
> Descriptor sense data with sense descriptors (in hex):
> 72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00
> 05 f3 0b 6c
> sd 0:0:0:0: [sda] Add. Sense: Unrecovered read error - auto reallocate
> failed
> end_request: I/O error, dev sda, sector 99814252
> Read-error on swap-device (8:0:99814256)
...

Looks like a bad sector in the swap partition.
You can probably repair it by using this sequence of commands:

swapoff /dev/sdX <--- replace sdX with actual swap partition dev name
sync
cat /dev/zero > /dev/sdX
mkswap /dev/sdX
swapon /dev/sdX

If it recurs after doing that, then it's time for a new drive.

-ml

2007-08-19 15:16:57

by Maciej Rutecki

[permalink] [raw]

Subject: Re: SATA error while resume

Mark Lord pisze:

>
> Looks like a bad sector in the swap partition.
> You can probably repair it by using this sequence of commands:
>
> swapoff /dev/sdX <--- replace sdX with actual swap partition dev name
> sync
> cat /dev/zero > /dev/sdX
> mkswap /dev/sdX
> swapon /dev/sdX
>
> If it recurs after doing that, then it's time for a new drive.
>
> -ml
> -

rutek:/home/maciek# swapoff /dev/sda6
rutek:/home/maciek# sync
rutek:/home/maciek# cat /dev/zero > /dev/sda6
cat: błąd zapisu: Błąd wejścia/wyjścia (write error, after few minutes,
probably sda6 is full)

rutek:/home/maciek# dd if=/dev/zero of=/dev/sda6
dd: zapis do `/dev/sda6': Błąd wejścia/wyjścia
5992177+0 przeczytanych recordów
5992176+0 zapisanych recordów
skopiowane 3067994112 bajtów (3,1 GB), 298,159 sekund, 10,3 MB/s
rutek:/home/maciek# mkswap /dev/sda6
Setting up swapspace version 1, size = 3067990 kB
no label, UUID=2061df6e-d385-4367-9a4c-c8431e57b73a
rutek:/home/maciek# swapon /dev/sda6

dmesg:
Adding 2996080k swap on /dev/sda6. Priority:-2 extents:1 across:2996080k

Also I try:
dd if==/dev/sda... of=/dev/null for all partitions
Test disk with bios utility and smartctl. Use autotest (bash shared
mapping and disktest). No errors/warnings. Only (sometimes) while system
resume from suspend to disk. Disk 10 months old...

Regards
--
Maciej Rutecki
http://www.unixy.pl

2007-08-19 15:49:23

by Tejun Heo

[permalink] [raw]

Subject: Re: SATA error while resume

Maciek Rutecki wrote:
> rutek:/home/maciek# swapoff /dev/sda6
> rutek:/home/maciek# sync
> rutek:/home/maciek# cat /dev/zero > /dev/sda6
> cat: błąd zapisu: Błąd wejścia/wyjścia (write error, after few minutes,
> probably sda6 is full)
>
> rutek:/home/maciek# dd if=/dev/zero of=/dev/sda6
> dd: zapis do `/dev/sda6': Błąd wejścia/wyjścia
> 5992177+0 przeczytanych recordów
> 5992176+0 zapisanych recordów
> skopiowane 3067994112 bajtów (3,1 GB), 298,159 sekund, 10,3 MB/s
> rutek:/home/maciek# mkswap /dev/sda6
> Setting up swapspace version 1, size = 3067990 kB
> no label, UUID=2061df6e-d385-4367-9a4c-c8431e57b73a
> rutek:/home/maciek# swapon /dev/sda6
>
>
> dmesg:
> Adding 2996080k swap on /dev/sda6. Priority:-2 extents:1 across:2996080k
>
> Also I try:
> dd if==/dev/sda... of=/dev/null for all partitions
> Test disk with bios utility and smartctl. Use autotest (bash shared
> mapping and disktest). No errors/warnings. Only (sometimes) while system
> resume from suspend to disk. Disk 10 months old...

Hmmmm... Does Power-Off_Retract_Count increase after suspend/resume cycle?

--
tejun

2007-08-19 16:25:32

by Maciej Rutecki

[permalink] [raw]

Subject: Re: SATA error while resume

Tejun Heo pisze:

> Hmmmm... Does Power-Off_Retract_Count increase after suspend/resume cycle?
>

No.

Before:

rutek:/home/maciek# smartctl -A -d ata /dev/sda | grep
Power-Off_Retract_Count
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always
- 66

Tested:

2.6.22.1 OK (double spin down of disk issue).

2.6.23-rc2 with patches [1] (prevent double spin down while suspend to
disk), also was tested ealier [2] OK

[1] http://www.ussg.iu.edu/hypermail/linux/kernel/0708.0/2655.html
[2] http://www.ussg.iu.edu/hypermail/linux/kernel/0708.0/2784.html

After:
rutek:/home/maciek# smartctl -A -d ata /dev/sda | grep
Power-Off_Retract_Count
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always
- 66

--
Maciej Rutecki
http://www.maciek.unixy.pl