2009-03-25 05:19:28

by Niel Lambrechts

[permalink] [raw]
Subject: Re: 2.6.29 regression: ATA bus errors on resume (was: EXT4: __ext4_get_inode_loc errors after s2disk)

00:00.0 Host bridge: Intel Corporation Mobile 4 Series Chipset Memory Controller Hub (rev 07)
00:02.0 VGA compatible controller: Intel Corporation Mobile 4 Series Chipset Integrated Graphics Controller (rev 07)
00:02.1 Display controller: Intel Corporation Mobile 4 Series Chipset Integrated Graphics Controller (rev 07)
00:03.0 Communication controller: Intel Corporation Mobile 4 Series Chipset MEI Controller (rev 07)
00:19.0 Ethernet controller: Intel Corporation 82567LM Gigabit Network Connection (rev 03)
00:1a.0 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #4 (rev 03)
00:1a.1 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #5 (rev 03)
00:1a.2 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #6 (rev 03)
00:1a.7 USB Controller: Intel Corporation 82801I (ICH9 Family) USB2 EHCI Controller #2 (rev 03)
00:1b.0 Audio device: Intel Corporation 82801I (ICH9 Family) HD Audio Controller (rev 03)
00:1c.0 PCI bridge: Intel Corporation 82801I (ICH9 Family) PCI Express Port 1 (rev 03)
00:1c.1 PCI bridge: Intel Corporation 82801I (ICH9 Family) PCI Express Port 2 (rev 03)
00:1c.3 PCI bridge: Intel Corporation 82801I (ICH9 Family) PCI Express Port 4 (rev 03)
00:1c.4 PCI bridge: Intel Corporation 82801I (ICH9 Family) PCI Express Port 5 (rev 03)
00:1d.0 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #1 (rev 03)
00:1d.1 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #2 (rev 03)
00:1d.2 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #3 (rev 03)
00:1d.7 USB Controller: Intel Corporation 82801I (ICH9 Family) USB2 EHCI Controller #1 (rev 03)
00:1e.0 PCI bridge: Intel Corporation 82801 Mobile PCI Bridge (rev 93)
00:1f.0 ISA bridge: Intel Corporation ICH9M-E LPC Interface Controller (rev 03)
00:1f.2 SATA controller: Intel Corporation ICH9M/M-E SATA AHCI Controller (rev 03)
00:1f.3 SMBus: Intel Corporation 82801I (ICH9 Family) SMBus Controller (rev 03)
03:00.0 Network controller: Intel Corporation Wireless WiFi Link 5100
15:00.0 CardBus bridge: Ricoh Co Ltd RL5c476 II (rev ba)
15:00.1 FireWire (IEEE 1394): Ricoh Co Ltd R5C832 IEEE 1394 Controller (rev 04)
15:00.2 SD Host controller: Ricoh Co Ltd R5C822 SD/SDIO/MMC/MS/MSPro Host Adapter (rev 21)
15:00.3 System peripheral: Ricoh Co Ltd R5C843 MMC Host Controller (rev 11)
15:00.4 System peripheral: Ricoh Co Ltd R5C592 Memory Stick Bus Host Adapter (rev 11)
15:00.5 System peripheral: Ricoh Co Ltd xD-Picture Card Controller (rev 11)


Attachments:
messages.2.6.29.txt (39.19 kB)
messages.2.6.28.txt (12.33 kB)
dmidecode.txt (13.98 kB)
lspci.txt (2.50 kB)
Download all attachments

2009-03-25 06:07:30

by Jeff Garzik

[permalink] [raw]
Subject: Re: 2.6.29 regression: ATA bus errors on resume

Niel Lambrechts wrote:
> On 03/25/2009 03:30 AM, Theodore Tso wrote:
>> On Tue, Mar 24, 2009 at 10:25:57PM +0200, Niel Lambrechts wrote:
>>> Hi,
>>>
>>> After upgrading to 2.6.29 I get the below errors after resuming from
>>> hibernating with s2disk. I ran fsck and tried doing the same thing again
>>> in 2.6.28.9-pae, but do not get any errors there.
>> The ext4 errors are interleaved with hardware errors, and the ext4
>> errors are about I/O errors.
>>
>> EXT4-fs error (device sda6): __ext4_get_inode_loc: unable to read inode block - inode=2346519
>> EXT4-fs error (device sda6) in ext4_reserve_inode_write: IO failure
>>
>> This looks more like a hibernation problem than an ext4 problem.
>> Looks like the hard drive is being left in some inconsistent state
>> after resuming from hibernation.
>>
>> - Ted
>
> Thanks for the info Theodore, this is definitely looks like some type of
> regression in 2.6.29, as the problem is not evident when I s2disk using
> 2.6.28.9, even after multiple suspend/resume cycles.
>
> I found some 'ATA bus errors' and 'SError' messages in
> /var/log/messages, so I've attached the messages from both 2.6.29 and
> 2.6.28 for comparison.

Well, here is the interpretation of messages:

> ata1.00: irq_stat 0x00400008, PHY RDY changed
> ata1: SError: { PHYRdyChg CommWake }

Your SATA hardware flags a connect-or-disconnect event ("PHY RDY"),
which requires us to abort a bunch of queued commands:

> ata1.00: cmd 60/18:00:77:88:6f/00:00:0e:00:00/40 tag 0 ncq 12288 in
> res 50/00:30:07:b3:10/00:00:0c:00:00/40 Emask 0x10 (ATA bus error)
[...]
> ata1.00: cmd 60/30:68:07:b3:10/00:00:0c:00:00/40 tag 13 ncq 24576 in
> res 50/00:30:07:b3:10/00:00:0c:00:00/40 Emask 0x10 (ATA bus error)

...through the 14th command (tag 13).

> Mar 24 21:29:14 linux-7vph kernel: ata1: hard resetting link
> Mar 24 21:29:14 linux-7vph kernel: ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
> Mar 24 21:29:14 linux-7vph kernel: ata1.00: ACPI cmd ef/02:00:00:00:00:a0 succeeded
> Mar 24 21:29:14 linux-7vph kernel: ata1.00: ACPI cmd f5/00:00:00:00:00:a0 filtered out
> Mar 24 21:29:14 linux-7vph kernel: ata1.00: ACPI cmd ef/5f:00:00:00:00:a0 succeeded
> Mar 24 21:29:14 linux-7vph kernel: ata1.00: ACPI cmd ef/10:03:00:00:00:a0 filtered out
> Mar 24 21:29:14 linux-7vph kernel: ata1.00: ACPI cmd ef/02:00:00:00:00:a0 succeeded
> Mar 24 21:29:14 linux-7vph kernel: ata1.00: ACPI cmd f5/00:00:00:00:00:a0 filtered out
> Mar 24 21:29:14 linux-7vph kernel: ata1.00: ACPI cmd ef/5f:00:00:00:00:a0 succeeded
> Mar 24 21:29:14 linux-7vph kernel: ata1.00: ACPI cmd ef/10:03:00:00:00:a0 filtered out
> Mar 24 21:29:14 linux-7vph kernel: ata1.00: configured for UDMA/133
> Mar 24 21:29:14 linux-7vph kernel: ata1.00: configured for UDMA/133


SATA link is reset, and ACPI is re-run.

> Mar 24 21:29:14 linux-7vph kernel: sd 0:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE,SUGGEST_OK
> Mar 24 21:29:14 linux-7vph kernel: sd 0:0:0:0: [sda] Sense Key : Aborted Command [current] [descriptor]
> Mar 24 21:29:14 linux-7vph kernel: Descriptor sense data with sense descriptors (in hex):
> Mar 24 21:29:14 linux-7vph kernel: 72 0b 00 00 00 00 00 0c 00 0a 80 00 00 00 00 00
> Mar 24 21:29:14 linux-7vph kernel: 0c 10 b3 07
> Mar 24 21:29:14 linux-7vph kernel: sd 0:0:0:0: [sda] Add. Sense: No additional sense information
> Mar 24 21:29:14 linux-7vph kernel: end_request: I/O error, dev sda, sector 242190455

The SCSI subsystem aborts each of the queued commands.

> Mar 24 21:29:14 linux-7vph kernel: ata1: EH complete

SATA error handling completes

> Mar 24 21:29:14 linux-7vph kernel: EXT4-fs error (device sda6): __ext4_get_inode_loc: unable to read inode block - inode=2330178, block=9338883

ext4 pukes, because of SATA/SCSI errors

> Mar 24 21:29:14 linux-7vph kernel: sd 0:0:0:0: [sda] 390721968 512-byte hardware sectors: (200 GB/186 GiB)
> Mar 24 21:29:14 linux-7vph kernel: sd 0:0:0:0: [sda] Write Protect is off
> Mar 24 21:29:14 linux-7vph kernel: sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
> Mar 24 21:29:14 linux-7vph kernel: sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
> Mar 24 21:29:14 linux-7vph kernel: sd 0:0:0:0: [sda] 390721968 512-byte hardware sectors: (200 GB/186 GiB)
> Mar 24 21:29:14 linux-7vph kernel: sd 0:0:0:0: [sda] Write Protect is off
> Mar 24 21:29:14 linux-7vph kernel: sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
> Mar 24 21:29:14 linux-7vph kernel: sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA

SCSI finished re-initialization.

So from this, a few observations and open questions:

Why didn't these SCSI commands get retried?

Were they left over from prior to resume?

Did SCSI error out the commands too soon? The probing is not complete
until AFTER the sector errors, I note.

Did the system resume before ACPI resume and SCSI resume completed? It
sure looks that way, from the log.

Does the asynchronous probing play a role here?

Jeff


2009-03-25 21:40:35

by Niel Lambrechts

[permalink] [raw]
Subject: Re: 2.6.29 regression: ATA bus errors on resume

Mar 25 20:01:51 linux-7vph kernel: Syncing filesystems ... done.
Mar 25 20:01:51 linux-7vph kernel: Freezing user space processes ... (elapsed 0.00 seconds) done.
Mar 25 20:01:51 linux-7vph kernel: Freezing remaining freezable tasks ... (elapsed 0.00 seconds) done.
Mar 25 20:01:51 linux-7vph kernel: PM: Shrinking memory... done (0 pages freed)
Mar 25 20:01:51 linux-7vph kernel: PM: Freed 0 kbytes in 0.14 seconds (0.00 MB/s)
Mar 25 20:01:51 linux-7vph kernel: Suspending console(s) (use no_console_suspend to debug)
Mar 25 20:01:51 linux-7vph kernel: sd 0:0:0:0: [sda] Synchronizing SCSI cache
Mar 25 20:01:51 linux-7vph kernel: ACPI handle has no context!
Mar 25 20:01:51 linux-7vph kernel: iwlagn 0000:03:00.0: PCI INT A disabled
Mar 25 20:01:51 linux-7vph kernel: ehci_hcd 0000:00:1d.7: PCI INT D disabled
Mar 25 20:01:51 linux-7vph kernel: ehci_hcd 0000:00:1d.7: PME# disabled
Mar 25 20:01:51 linux-7vph kernel: uhci_hcd 0000:00:1d.2: PCI INT C disabled
Mar 25 20:01:51 linux-7vph kernel: uhci_hcd 0000:00:1d.1: PCI INT B disabled
Mar 25 20:01:51 linux-7vph kernel: uhci_hcd 0000:00:1d.0: PCI INT A disabled
Mar 25 20:01:51 linux-7vph kernel: HDA Intel 0000:00:1b.0: PCI INT B disabled
Mar 25 20:01:51 linux-7vph kernel: ehci_hcd 0000:00:1a.7: PCI INT D disabled
Mar 25 20:01:51 linux-7vph kernel: ehci_hcd 0000:00:1a.7: PME# disabled
Mar 25 20:01:51 linux-7vph kernel: uhci_hcd 0000:00:1a.2: PCI INT C disabled
Mar 25 20:01:51 linux-7vph kernel: uhci_hcd 0000:00:1a.1: PCI INT B disabled
Mar 25 20:01:51 linux-7vph kernel: uhci_hcd 0000:00:1a.0: PCI INT A disabled
Mar 25 20:01:51 linux-7vph kernel: e1000e 0000:00:19.0: PME# enabled
Mar 25 20:01:51 linux-7vph kernel: e1000e 0000:00:19.0: wake-up capability enabled by ACPI
Mar 25 20:01:51 linux-7vph kernel: e1000e 0000:00:19.0: PME# enabled
Mar 25 20:01:51 linux-7vph kernel: e1000e 0000:00:19.0: wake-up capability enabled by ACPI
Mar 25 20:01:51 linux-7vph kernel: e1000e 0000:00:19.0: PCI INT A disabled
Mar 25 20:01:51 linux-7vph kernel: ACPI: Preparing to enter system sleep state S4
Mar 25 20:01:51 linux-7vph kernel: PM: Saving platform NVS memory
Mar 25 20:01:51 linux-7vph kernel: Disabling non-boot CPUs ...
Mar 25 20:01:51 linux-7vph kernel: CPU 1 is now offline
Mar 25 20:01:51 linux-7vph kernel: SMP alternatives: switching to UP code
Mar 25 20:01:51 linux-7vph kernel: CPU0 attaching NULL sched-domain.
Mar 25 20:01:51 linux-7vph kernel: CPU1 attaching NULL sched-domain.
Mar 25 20:01:51 linux-7vph kernel: CPU0 attaching NULL sched-domain.
Mar 25 20:01:51 linux-7vph kernel: CPU1 is down
Mar 25 20:01:51 linux-7vph kernel: Extended CMOS year: 2000
Mar 25 20:01:51 linux-7vph kernel: PM: Creating hibernation image:
Mar 25 20:01:51 linux-7vph kernel: PM: Need to copy 37796 pages
Mar 25 20:01:51 linux-7vph kernel: x86 PAT enabled: cpu 0, old 0x7040600070406, new 0x7010600070106
Mar 25 20:01:51 linux-7vph kernel: Intel machine check architecture supported.
Mar 25 20:01:51 linux-7vph kernel: Intel machine check reporting enabled on CPU#0.
Mar 25 20:01:51 linux-7vph kernel: PM: Restoring platform NVS memory
Mar 25 20:01:51 linux-7vph kernel: Extended CMOS year: 2000
Mar 25 20:01:51 linux-7vph kernel: HDA Intel 0000:00:1b.0: restoring config space at offset 0x1 (was 0x100106, writing 0x100102)
Mar 25 20:01:51 linux-7vph kernel: ahci 0000:00:1f.2: restoring config space at offset 0x1 (was 0x2b00403, writing 0x2b00407)
Mar 25 20:01:51 linux-7vph kernel: Enabling non-boot CPUs ...
Mar 25 20:01:51 linux-7vph kernel: SMP alternatives: switching to SMP code
Mar 25 20:01:51 linux-7vph kernel: Booting processor 1 APIC 0x1 ip 0x6000
Mar 25 20:01:51 linux-7vph kernel: Initializing CPU#1
Mar 25 20:01:51 linux-7vph kernel: Calibrating delay using timer specific routine.. 5054.04 BogoMIPS (lpj=10108082)
Mar 25 20:01:51 linux-7vph kernel: CPU: L1 I cache: 32K, L1 D cache: 32K
Mar 25 20:01:51 linux-7vph kernel: CPU: L2 cache: 6144K
Mar 25 20:01:51 linux-7vph kernel: [ds] using Core 2/Atom configuration
Mar 25 20:01:51 linux-7vph kernel: CPU: Physical Processor ID: 0
Mar 25 20:01:51 linux-7vph kernel: CPU: Processor Core ID: 1
Mar 25 20:01:51 linux-7vph kernel: Intel machine check architecture supported.
Mar 25 20:01:51 linux-7vph kernel: Intel machine check reporting enabled on CPU#1.
Mar 25 20:01:51 linux-7vph kernel: x86 PAT enabled: cpu 1, old 0x7040600070406, new 0x7010600070106
Mar 25 20:01:51 linux-7vph kernel: CPU1: Intel(R) Core(TM)2 Duo CPU T9400 @ 2.53GHz stepping 06
Mar 25 20:01:51 linux-7vph kernel: CPU0 attaching NULL sched-domain.
Mar 25 20:01:51 linux-7vph kernel: Switched to high resolution mode on CPU 1
Mar 25 20:01:51 linux-7vph kernel: CPU0 attaching sched-domain:
Mar 25 20:01:51 linux-7vph kernel: domain 0: span 0-1 level MC
Mar 25 20:01:51 linux-7vph kernel: groups: 0 1
Mar 25 20:01:51 linux-7vph kernel: CPU1 attaching sched-domain:
Mar 25 20:01:51 linux-7vph kernel: domain 0: span 0-1 level MC
Mar 25 20:01:51 linux-7vph kernel: groups: 1 0
Mar 25 20:01:51 linux-7vph kernel: CPU1 is up
Mar 25 20:01:51 linux-7vph kernel: ACPI: Waking up from system sleep state S4
Mar 25 20:01:51 linux-7vph kernel: ACPI: EC: non-query interrupt received, switching to interrupt mode
Mar 25 20:01:51 linux-7vph kernel: pci 0000:00:02.0: PME# disabled
Mar 25 20:01:51 linux-7vph kernel: pci 0000:00:02.1: PME# disabled
Mar 25 20:01:51 linux-7vph kernel: pci 0000:00:03.0: PME# disabled
Mar 25 20:01:51 linux-7vph kernel: e1000e 0000:00:19.0: PCI INT A -> GSI 20 (level, low) -> IRQ 20
Mar 25 20:01:51 linux-7vph kernel: e1000e 0000:00:19.0: setting latency timer to 64
Mar 25 20:01:51 linux-7vph kernel: e1000e 0000:00:19.0: wake-up capability disabled by ACPI
Mar 25 20:01:51 linux-7vph kernel: e1000e 0000:00:19.0: PME# disabled
Mar 25 20:01:51 linux-7vph kernel: e1000e 0000:00:19.0: wake-up capability disabled by ACPI
Mar 25 20:01:51 linux-7vph kernel: e1000e 0000:00:19.0: PME# disabled
Mar 25 20:01:51 linux-7vph kernel: e1000e 0000:00:19.0: irq 29 for MSI/MSI-X
Mar 25 20:01:51 linux-7vph kernel: uhci_hcd 0000:00:1a.0: power state changed by ACPI to D0
Mar 25 20:01:51 linux-7vph kernel: uhci_hcd 0000:00:1a.0: PCI INT A -> GSI 20 (level, low) -> IRQ 20
Mar 25 20:01:51 linux-7vph kernel: uhci_hcd 0000:00:1a.0: setting latency timer to 64
Mar 25 20:01:51 linux-7vph kernel: uhci_hcd 0000:00:1a.1: PCI INT B -> GSI 21 (level, low) -> IRQ 21
Mar 25 20:01:51 linux-7vph kernel: uhci_hcd 0000:00:1a.1: setting latency timer to 64
Mar 25 20:01:51 linux-7vph kernel: uhci_hcd 0000:00:1a.2: power state changed by ACPI to D0
Mar 25 20:01:51 linux-7vph kernel: uhci_hcd 0000:00:1a.2: PCI INT C -> GSI 22 (level, low) -> IRQ 22
Mar 25 20:01:51 linux-7vph kernel: uhci_hcd 0000:00:1a.2: setting latency timer to 64
Mar 25 20:01:51 linux-7vph kernel: ehci_hcd 0000:00:1a.7: PME# disabled
Mar 25 20:01:51 linux-7vph kernel: ehci_hcd 0000:00:1a.7: power state changed by ACPI to D0
Mar 25 20:01:51 linux-7vph kernel: ehci_hcd 0000:00:1a.7: PCI INT D -> GSI 23 (level, low) -> IRQ 23
Mar 25 20:01:51 linux-7vph kernel: ehci_hcd 0000:00:1a.7: setting latency timer to 64
Mar 25 20:01:51 linux-7vph kernel: ehci_hcd 0000:00:1a.7: PME# disabled
Mar 25 20:01:51 linux-7vph kernel: HDA Intel 0000:00:1b.0: PCI INT B -> GSI 17 (level, low) -> IRQ 17
Mar 25 20:01:51 linux-7vph kernel: HDA Intel 0000:00:1b.0: setting latency timer to 64
Mar 25 20:01:51 linux-7vph kernel: pcieport-driver 0000:00:1c.0: setting latency timer to 64
Mar 25 20:01:51 linux-7vph kernel: pcieport-driver 0000:00:1c.1: setting latency timer to 64
Mar 25 20:01:51 linux-7vph kernel: pcieport-driver 0000:00:1c.3: setting latency timer to 64
Mar 25 20:01:51 linux-7vph kernel: pcieport-driver 0000:00:1c.4: setting latency timer to 64
Mar 25 20:01:51 linux-7vph kernel: uhci_hcd 0000:00:1d.0: power state changed by ACPI to D0
Mar 25 20:01:51 linux-7vph kernel: uhci_hcd 0000:00:1d.0: PCI INT A -> GSI 16 (level, low) -> IRQ 16
Mar 25 20:01:51 linux-7vph kernel: uhci_hcd 0000:00:1d.0: setting latency timer to 64
Mar 25 20:01:51 linux-7vph kernel: uhci_hcd 0000:00:1d.1: PCI INT B -> GSI 17 (level, low) -> IRQ 17
Mar 25 20:01:51 linux-7vph kernel: uhci_hcd 0000:00:1d.1: setting latency timer to 64
Mar 25 20:01:51 linux-7vph kernel: uhci_hcd 0000:00:1d.2: PCI INT C -> GSI 18 (level, low) -> IRQ 18
Mar 25 20:01:51 linux-7vph kernel: uhci_hcd 0000:00:1d.2: setting latency timer to 64
Mar 25 20:01:51 linux-7vph kernel: ehci_hcd 0000:00:1d.7: PME# disabled
Mar 25 20:01:51 linux-7vph kernel: ehci_hcd 0000:00:1d.7: power state changed by ACPI to D0
Mar 25 20:01:51 linux-7vph kernel: ehci_hcd 0000:00:1d.7: PCI INT D -> GSI 19 (level, low) -> IRQ 19
Mar 25 20:01:51 linux-7vph kernel: ehci_hcd 0000:00:1d.7: setting latency timer to 64
Mar 25 20:01:51 linux-7vph kernel: ehci_hcd 0000:00:1d.7: PME# disabled
Mar 25 20:01:51 linux-7vph kernel: pci 0000:00:1e.0: setting latency timer to 64
Mar 25 20:01:51 linux-7vph kernel: ahci 0000:00:1f.2: setting latency timer to 64
Mar 25 20:01:51 linux-7vph kernel: iwlagn 0000:03:00.0: PCI INT A -> GSI 17 (level, low) -> IRQ 17
Mar 25 20:01:51 linux-7vph kernel: iwlagn 0000:03:00.0: irq 30 for MSI/MSI-X
Mar 25 20:01:51 linux-7vph kernel: pci 0000:15:00.0: PCI INT A -> GSI 16 (level, low) -> IRQ 16
Mar 25 20:01:51 linux-7vph kernel: Registered led device: iwl-phy0:radio
Mar 25 20:01:51 linux-7vph kernel: Registered led device: iwl-phy0:assoc
Mar 25 20:01:51 linux-7vph kernel: Registered led device: iwl-phy0:RX
Mar 25 20:01:51 linux-7vph kernel: Registered led device: iwl-phy0:TX
Mar 25 20:01:51 linux-7vph kernel: ohci1394: fw-host0: OHCI-1394 1.1 (PCI): IRQ=[17] MMIO=[f4801000-f48017ff] Max Packet=[2048] IR/IT contexts=[4/4]
Mar 25 20:01:51 linux-7vph kernel: pci 0000:15:00.2: PME# disabled
Mar 25 20:01:51 linux-7vph kernel: pci 0000:15:00.3: PME# disabled
Mar 25 20:01:51 linux-7vph kernel: pci 0000:15:00.4: PME# disabled
Mar 25 20:01:51 linux-7vph kernel: pci 0000:15:00.5: PME# disabled
Mar 25 20:01:51 linux-7vph kernel: sd 0:0:0:0: [sda] Starting disk
Mar 25 20:01:51 linux-7vph kernel: ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
Mar 25 20:01:51 linux-7vph kernel: ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
Mar 25 20:01:51 linux-7vph kernel: ata1.00: ACPI cmd ef/02:00:00:00:00:a0 succeeded
Mar 25 20:01:51 linux-7vph kernel: ata1.00: ACPI cmd f5/00:00:00:00:00:a0 filtered out
Mar 25 20:01:51 linux-7vph kernel: ata1.00: ACPI cmd ef/5f:00:00:00:00:a0 succeeded
Mar 25 20:01:51 linux-7vph kernel: ata1.00: ACPI cmd ef/10:03:00:00:00:a0 filtered out
Mar 25 20:01:51 linux-7vph kernel: ata1.00: ACPI cmd ef/02:00:00:00:00:a0 succeeded
Mar 25 20:01:51 linux-7vph kernel: ata1.00: ACPI cmd f5/00:00:00:00:00:a0 filtered out
Mar 25 20:01:51 linux-7vph kernel: ata1.00: ACPI cmd ef/5f:00:00:00:00:a0 succeeded
Mar 25 20:01:51 linux-7vph kernel: ata1.00: ACPI cmd ef/10:03:00:00:00:a0 filtered out
Mar 25 20:01:51 linux-7vph kernel: ata1.00: configured for UDMA/133
Mar 25 20:01:51 linux-7vph kernel: ata1: exception Emask 0x10 SAct 0x0 SErr 0x0 action 0x9 t4
Mar 25 20:01:51 linux-7vph kernel: ata1: irq_stat 0x00400040, connection status changed
Mar 25 20:01:51 linux-7vph kernel: ata1.00: configured for UDMA/133
Mar 25 20:01:51 linux-7vph kernel: ata1: EH complete
Mar 25 20:01:51 linux-7vph kernel: sd 0:0:0:0: [sda] 390721968 512-byte hardware sectors: (200 GB/186 GiB)
Mar 25 20:01:51 linux-7vph kernel: sd 0:0:0:0: [sda] Write Protect is off
Mar 25 20:01:51 linux-7vph kernel: sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
Mar 25 20:01:51 linux-7vph kernel: sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Mar 25 20:01:51 linux-7vph kernel: sd 0:0:0:0: [sda] 390721968 512-byte hardware sectors: (200 GB/186 GiB)
Mar 25 20:01:51 linux-7vph kernel: sd 0:0:0:0: [sda] Write Protect is off
Mar 25 20:01:51 linux-7vph kernel: sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
Mar 25 20:01:51 linux-7vph kernel: sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Mar 25 20:01:51 linux-7vph kernel: ata2.00: ACPI cmd e3/00:1f:00:00:00:a0 succeeded
Mar 25 20:01:51 linux-7vph kernel: ata2.00: ACPI cmd e3/00:02:00:00:00:a0 succeeded
Mar 25 20:01:51 linux-7vph kernel: ata2.00: ACPI cmd e3/00:1f:00:00:00:a0 succeeded
Mar 25 20:01:51 linux-7vph kernel: ata2.00: ACPI cmd e3/00:02:00:00:00:a0 succeeded
Mar 25 20:01:51 linux-7vph kernel: ata2.00: configured for UDMA/133
Mar 25 20:01:51 linux-7vph kernel: Restarting tasks ... <3>ata2: exception Emask 0x10 SAct 0x0 SErr 0x0 action 0x9 t4
Mar 25 20:01:51 linux-7vph kernel: ata2: irq_stat 0x40000001
Mar 25 20:01:51 linux-7vph kernel: done.
Mar 25 20:01:51 linux-7vph kernel: ata2.00: configured for UDMA/133
Mar 25 20:01:51 linux-7vph kernel: ata2: EH complete


Attachments:
hibernate-x.txt (16.97 kB)
hibernate-console.txt (12.44 kB)
Download all attachments

2009-03-25 22:16:55

by James Bottomley

[permalink] [raw]
Subject: Re: 2.6.29 regression: ATA bus errors on resume

On Wed, 2009-03-25 at 02:06 -0400, Jeff Garzik wrote:
> Niel Lambrechts wrote:
> > On 03/25/2009 03:30 AM, Theodore Tso wrote:
> >> On Tue, Mar 24, 2009 at 10:25:57PM +0200, Niel Lambrechts wrote:
> >>> Hi,
> >>>
> >>> After upgrading to 2.6.29 I get the below errors after resuming from
> >>> hibernating with s2disk. I ran fsck and tried doing the same thing again
> >>> in 2.6.28.9-pae, but do not get any errors there.
> >> The ext4 errors are interleaved with hardware errors, and the ext4
> >> errors are about I/O errors.
> >>
> >> EXT4-fs error (device sda6): __ext4_get_inode_loc: unable to read inode block - inode=2346519
> >> EXT4-fs error (device sda6) in ext4_reserve_inode_write: IO failure
> >>
> >> This looks more like a hibernation problem than an ext4 problem.
> >> Looks like the hard drive is being left in some inconsistent state
> >> after resuming from hibernation.
> >>
> >> - Ted
> >
> > Thanks for the info Theodore, this is definitely looks like some type of
> > regression in 2.6.29, as the problem is not evident when I s2disk using
> > 2.6.28.9, even after multiple suspend/resume cycles.
> >
> > I found some 'ATA bus errors' and 'SError' messages in
> > /var/log/messages, so I've attached the messages from both 2.6.29 and
> > 2.6.28 for comparison.
>
> Well, here is the interpretation of messages:
>
> > ata1.00: irq_stat 0x00400008, PHY RDY changed
> > ata1: SError: { PHYRdyChg CommWake }
>
> Your SATA hardware flags a connect-or-disconnect event ("PHY RDY"),
> which requires us to abort a bunch of queued commands:
>
> > ata1.00: cmd 60/18:00:77:88:6f/00:00:0e:00:00/40 tag 0 ncq 12288 in
> > res 50/00:30:07:b3:10/00:00:0c:00:00/40 Emask 0x10 (ATA bus error)
> [...]
> > ata1.00: cmd 60/30:68:07:b3:10/00:00:0c:00:00/40 tag 13 ncq 24576 in
> > res 50/00:30:07:b3:10/00:00:0c:00:00/40 Emask 0x10 (ATA bus error)
>
> ...through the 14th command (tag 13).
>
> > Mar 24 21:29:14 linux-7vph kernel: ata1: hard resetting link
> > Mar 24 21:29:14 linux-7vph kernel: ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
> > Mar 24 21:29:14 linux-7vph kernel: ata1.00: ACPI cmd ef/02:00:00:00:00:a0 succeeded
> > Mar 24 21:29:14 linux-7vph kernel: ata1.00: ACPI cmd f5/00:00:00:00:00:a0 filtered out
> > Mar 24 21:29:14 linux-7vph kernel: ata1.00: ACPI cmd ef/5f:00:00:00:00:a0 succeeded
> > Mar 24 21:29:14 linux-7vph kernel: ata1.00: ACPI cmd ef/10:03:00:00:00:a0 filtered out
> > Mar 24 21:29:14 linux-7vph kernel: ata1.00: ACPI cmd ef/02:00:00:00:00:a0 succeeded
> > Mar 24 21:29:14 linux-7vph kernel: ata1.00: ACPI cmd f5/00:00:00:00:00:a0 filtered out
> > Mar 24 21:29:14 linux-7vph kernel: ata1.00: ACPI cmd ef/5f:00:00:00:00:a0 succeeded
> > Mar 24 21:29:14 linux-7vph kernel: ata1.00: ACPI cmd ef/10:03:00:00:00:a0 filtered out
> > Mar 24 21:29:14 linux-7vph kernel: ata1.00: configured for UDMA/133
> > Mar 24 21:29:14 linux-7vph kernel: ata1.00: configured for UDMA/133
>
>
> SATA link is reset, and ACPI is re-run.
>
> > Mar 24 21:29:14 linux-7vph kernel: sd 0:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE,SUGGEST_OK
> > Mar 24 21:29:14 linux-7vph kernel: sd 0:0:0:0: [sda] Sense Key : Aborted Command [current] [descriptor]
> > Mar 24 21:29:14 linux-7vph kernel: Descriptor sense data with sense descriptors (in hex):
> > Mar 24 21:29:14 linux-7vph kernel: 72 0b 00 00 00 00 00 0c 00 0a 80 00 00 00 00 00
> > Mar 24 21:29:14 linux-7vph kernel: 0c 10 b3 07
> > Mar 24 21:29:14 linux-7vph kernel: sd 0:0:0:0: [sda] Add. Sense: No additional sense information
> > Mar 24 21:29:14 linux-7vph kernel: end_request: I/O error, dev sda, sector 242190455
>
> The SCSI subsystem aborts each of the queued commands.

No .. this is the SCSI subsystem receives an ABORTED COMMAND return in
sense data for each of the outstanding I/Os

The only place these are generated is in ata_sense_to_error() which only
occurs if there's some type of ata error.

If I had to theorise, I'd say the system suspended with commands
outstanding to the device. On resume, the device gets reset and returns
some type of ATA error which gets translated to ABORTED COMMAND which
causes a failure.

In the mid layer, we translate ABORTED_COMMAND into a retry until the
command runs out of them ... could it be there's a race readying the
device and we run through the retries before it can accept the command?

James