LinuxLists.cc - Re: 2.6.29 regression: ATA bus errors on resume

2009-06-01 18:58:51

Subject: Re: 2.6.29 regression: ATA bus errors on resume

Niel Lambrechts wrote:
> On 05/27/2009 02:07 AM, Tejun Heo wrote:
>> The above is the offending failure and all three failfast bits are
>> set. This corresponds to the following ATA exception.
<snip>
>> Can you please try the attached patch? It takes suspend/resume cycle
>> out of the equation and simply induces artificial failure to readahead
>> requests. It's currently set to fail every 40th readahead. Feel free
>> to adjust the frequency as you see fit. catting files into /dev/null
>> would trigger readahead to kick in. Can you reproduce filesystem
>> failure with this alone?
>>
>
> No corruption, I tried cat on an entire directory of mp3s, then also
> started X, but there the messages debug output got fairly drastic so I
> didn't care to continue for an entire day.
>
> It did trigger the "XXX ...failing readahead" message nearly 300
> times, and I also did a s2disk and resume cycle in there - so I hope
> this is enough for us to conclude that it is not the cause.

Hi Tejun,

Did you perhaps have any time to look into my feedback around the
readahead patch?

>From my side, I tried on Saturday to bisect this problem again, doing
5-8 hibernates per each bisect from 2.6.28. I stopped at 2.6.30-rc2 due
time (or fatigue), and did not manage to replicate the problem at all
which is strange since I was playing audio, doing finds and even doing
an entire dd of the root partition. I saved the bisect logs so perhaps I
can continue to see if the problem becomes more prevalent in later
versions - the first time it ever happened to me was somewhere in 2.6.29
originally.

The other interesting thing was to see that "hard resetting link"
messages seem to first start appearing at v2.6.29-rc7 or perhaps rc8. Is
it worth trying to track down the commit that lead to this?

Do you have any other debug patches to try, or should I try to delve
deeper into finding commits that can be reverted? I'm running out of
ideas, I even tried to find later firmware for my drive, but I seem to
be on the latest level.

Thanks for the help so far!

Niel

2009-06-03 03:14:19

by Tejun Heo

[permalink] [raw]

Subject: Re: 2.6.29 regression: ATA bus errors on resume

Hello, Niel.

Niel Lambrechts wrote:
> Did you perhaps have any time to look into my feedback around the
> readahead patch?

Yeah, I've been thinking about it and am a bit out of ideas, so no
immediate follow-up.

>>From my side, I tried on Saturday to bisect this problem again, doing
> 5-8 hibernates per each bisect from 2.6.28. I stopped at 2.6.30-rc2 due
> time (or fatigue), and did not manage to replicate the problem at all
> which is strange since I was playing audio, doing finds and even doing
> an entire dd of the root partition. I saved the bisect logs so perhaps I
> can continue to see if the problem becomes more prevalent in later
> versions - the first time it ever happened to me was somewhere in 2.6.29
> originally.
>
> The other interesting thing was to see that "hard resetting link"
> messages seem to first start appearing at v2.6.29-rc7 or perhaps rc8. Is
> it worth trying to track down the commit that lead to this?
>
> Do you have any other debug patches to try, or should I try to delve
> deeper into finding commits that can be reverted? I'm running out of
> ideas, I even tried to find later firmware for my drive, but I seem to
> be on the latest level.

Given the non-deterministic nature of the failure, I think bisection
would be quite difficult. I think the best way to diagnose the
problem is to track down the owner of the failed request. It has
FAILFAST set and yet the issuer fails to deal with an error condition
which is expected when FAILFAST is set. I'll prep a patch to track
request / bio failure.

Thanks.

--
tejun

2009-06-03 04:28:17

by Tejun Heo

[permalink] [raw]

Subject: Re: 2.6.29 regression: ATA bus errors on resume

Hello,

Can you please do the followings?

1. Apply the attached patch, build & boot

2. Trigger the problem and record dmesg

3. On failed IO, the kernel will print the address of bi_endio. Run
"nm -n" on the vmlinux in the kernel build root and look up which
function it is and post the dmesg and function name.

Thanks.

--
tejun

Attachments:

bio_endio-debug.patch (1.29 kB)

2009-06-06 07:06:24

by Niel Lambrechts

[permalink] [raw]

Subject: Re: 2.6.29 regression: ATA bus errors on resume

Jun 5 09:26:12 linux-7vph -- MARK --
Jun 5 09:26:12 linux-7vph syslog-ng[24615]: Log statistics; dropped='pipe(/dev/xconsole)=0', dropped='pipe(/dev/tty10)=0', processed='center(queued)=3777', processed='center(received)=3142', processed='destination(newsnotice)=0', processed='destination(acpid)=13', processed='destination(firewall)=18', processed='destination(null)=208', processed='destination(mail)=0', processed='destination(mailinfo)=0', processed='destination(console)=200', processed='destination(newserr)=0', processed='destination(newscrit)=0', processed='destination(messages)=1909', processed='destination(mailwarn)=0', processed='destination(localmessages)=0', processed='destination(netmgm)=994', processed='destination(mailerr)=0', processed='destination(xconsole)=200', processed='destination(warn)=235', processed='source(src)=3142'
Jun 5 09:26:11 linux-7vph kernel: Syncing filesystems ... done.
Jun 5 09:26:12 linux-7vph kernel: Freezing user space processes ... (elapsed 0.00 seconds) done.
Jun 5 09:26:12 linux-7vph kernel: Freezing remaining freezable tasks ... (elapsed 0.00 seconds) done.
Jun 5 09:26:12 linux-7vph kernel: PM: Shrinking memory... done (149507 pages freed)
Jun 5 09:26:12 linux-7vph kernel: PM: Freed 598028 kbytes in 6.49 seconds (92.14 MB/s)
Jun 5 09:26:12 linux-7vph kernel: Suspending console(s) (use no_console_suspend to debug)
Jun 5 09:26:12 linux-7vph kernel: sd 0:0:0:0: [sda] Synchronizing SCSI cache
Jun 5 09:26:12 linux-7vph kernel: ACPI handle has no context!
Jun 5 09:26:12 linux-7vph kernel: ehci_hcd 0000:00:1d.7: PCI INT D disabled
Jun 5 09:26:12 linux-7vph kernel: ehci_hcd 0000:00:1d.7: PME# disabled
Jun 5 09:26:12 linux-7vph kernel: uhci_hcd 0000:00:1d.2: PCI INT C disabled
Jun 5 09:26:12 linux-7vph kernel: uhci_hcd 0000:00:1d.1: PCI INT B disabled
Jun 5 09:26:12 linux-7vph kernel: uhci_hcd 0000:00:1d.0: PCI INT A disabled
Jun 5 09:26:12 linux-7vph kernel: HDA Intel 0000:00:1b.0: PCI INT B disabled
Jun 5 09:26:12 linux-7vph kernel: ehci_hcd 0000:00:1a.7: PCI INT D disabled
Jun 5 09:26:12 linux-7vph kernel: ehci_hcd 0000:00:1a.7: PME# disabled
Jun 5 09:26:12 linux-7vph kernel: uhci_hcd 0000:00:1a.2: PCI INT C disabled
Jun 5 09:26:12 linux-7vph kernel: uhci_hcd 0000:00:1a.1: PCI INT B disabled
Jun 5 09:26:12 linux-7vph kernel: uhci_hcd 0000:00:1a.0: PCI INT A disabled
Jun 5 09:26:13 linux-7vph kernel: e1000e 0000:00:19.0: PCI INT A disabled
Jun 5 09:26:13 linux-7vph kernel: e1000e 0000:00:19.0: PME# enabled
Jun 5 09:26:13 linux-7vph kernel: e1000e 0000:00:19.0: wake-up capability enabled by ACPI
Jun 5 09:26:13 linux-7vph kernel: ACPI: Preparing to enter system sleep state S4
Jun 5 09:26:13 linux-7vph kernel: PM: Saving platform NVS memory
Jun 5 09:26:13 linux-7vph kernel: Disabling non-boot CPUs ...
Jun 5 09:26:13 linux-7vph kernel: CPU 1 is now offline
Jun 5 09:26:13 linux-7vph kernel: SMP alternatives: switching to UP code
Jun 5 09:26:13 linux-7vph kernel: CPU0 attaching NULL sched-domain.
Jun 5 09:26:13 linux-7vph kernel: CPU1 attaching NULL sched-domain.
Jun 5 09:26:13 linux-7vph kernel: CPU0 attaching NULL sched-domain.
Jun 5 09:26:13 linux-7vph kernel: CPU1 is down
Jun 5 09:26:13 linux-7vph kernel: Extended CMOS year: 2000
Jun 5 09:26:13 linux-7vph kernel: PM: Creating hibernation image:
Jun 5 09:26:13 linux-7vph kernel: PM: Need to copy 223773 pages
Jun 5 09:26:13 linux-7vph kernel: x86 PAT enabled: cpu 0, old 0x7040600070406, new 0x7010600070106
Jun 5 09:26:13 linux-7vph kernel: Intel machine check architecture supported.
Jun 5 09:26:13 linux-7vph kernel: Intel machine check reporting enabled on CPU#0.
Jun 5 09:26:13 linux-7vph kernel: PM: Restoring platform NVS memory
Jun 5 09:26:13 linux-7vph kernel: Extended CMOS year: 2000
Jun 5 09:26:13 linux-7vph kernel: Enabling non-boot CPUs ...
Jun 5 09:26:13 linux-7vph kernel: SMP alternatives: switching to SMP code
Jun 5 09:26:13 linux-7vph kernel: Booting processor 1 APIC 0x1 ip 0x6000
Jun 5 09:26:13 linux-7vph kernel: Initializing CPU#1
Jun 5 09:26:13 linux-7vph kernel: Calibrating delay using timer specific routine.. 5134.67 BogoMIPS (lpj=10269343)
Jun 5 09:26:13 linux-7vph kernel: CPU: L1 I cache: 32K, L1 D cache: 32K
Jun 5 09:26:13 linux-7vph kernel: CPU: L2 cache: 6144K
Jun 5 09:26:13 linux-7vph kernel: CPU: Physical Processor ID: 0
Jun 5 09:26:13 linux-7vph kernel: CPU: Processor Core ID: 1
Jun 5 09:26:13 linux-7vph kernel: Intel machine check architecture supported.
Jun 5 09:26:13 linux-7vph kernel: Intel machine check reporting enabled on CPU#1.
Jun 5 09:26:13 linux-7vph kernel: x86 PAT enabled: cpu 1, old 0x7040600070406, new 0x7010600070106
Jun 5 09:26:13 linux-7vph kernel: CPU1: Intel(R) Core(TM)2 Duo CPU T9400 @ 2.53GHz stepping 06
Jun 5 09:26:13 linux-7vph kernel: CPU0 attaching NULL sched-domain.
Jun 5 09:26:13 linux-7vph kernel: Switched to high resolution mode on CPU 1
Jun 5 09:26:13 linux-7vph kernel: CPU0 attaching sched-domain:
Jun 5 09:26:13 linux-7vph kernel: domain 0: span 0-1 level MC
Jun 5 09:26:13 linux-7vph kernel: groups: 0 1
Jun 5 09:26:13 linux-7vph kernel: domain 1: span 0-1 level CPU
Jun 5 09:26:13 linux-7vph kernel: groups: 0-1 (__cpu_power = 2048)
Jun 5 09:26:13 linux-7vph kernel: CPU1 attaching sched-domain:
Jun 5 09:26:13 linux-7vph kernel: domain 0: span 0-1 level MC
Jun 5 09:26:13 linux-7vph kernel: groups: 1 0
Jun 5 09:26:13 linux-7vph kernel: domain 1: span 0-1 level CPU
Jun 5 09:26:13 linux-7vph kernel: groups: 0-1 (__cpu_power = 2048)
Jun 5 09:26:13 linux-7vph kernel: microcode: failed to init CPU1
Jun 5 09:26:13 linux-7vph kernel: CPU1 is up
Jun 5 09:26:13 linux-7vph kernel: ACPI: Waking up from system sleep state S4
Jun 5 09:26:13 linux-7vph kernel: pci 0000:00:02.0: restoring config space at offset 0x1 (was 0x900007, writing 0x900403)
Jun 5 09:26:13 linux-7vph kernel: uhci_hcd 0000:00:1a.0: power state changed by ACPI to D0
Jun 5 09:26:13 linux-7vph kernel: uhci_hcd 0000:00:1a.2: power state changed by ACPI to D0
Jun 5 09:26:13 linux-7vph kernel: HDA Intel 0000:00:1b.0: restoring config space at offset 0x1 (was 0x100106, writing 0x100102)
Jun 5 09:26:13 linux-7vph kernel: uhci_hcd 0000:00:1d.0: power state changed by ACPI to D0
Jun 5 09:26:13 linux-7vph kernel: ahci 0000:00:1f.2: restoring config space at offset 0x1 (was 0x2b00403, writing 0x2b00407)
Jun 5 09:26:13 linux-7vph kernel: iwlagn 0000:03:00.0: restoring config space at offset 0x1 (was 0x100106, writing 0x100506)
Jun 5 09:26:13 linux-7vph kernel: pci 0000:00:02.0: PME# disabled
Jun 5 09:26:13 linux-7vph kernel: pci 0000:00:02.0: setting latency timer to 64
Jun 5 09:26:13 linux-7vph kernel: pci 0000:00:02.1: PME# disabled
Jun 5 09:26:13 linux-7vph kernel: pci 0000:00:03.0: PME# disabled
Jun 5 09:26:13 linux-7vph kernel: e1000e 0000:00:19.0: PCI INT A -> GSI 20 (level, low) -> IRQ 20
Jun 5 09:26:13 linux-7vph kernel: e1000e 0000:00:19.0: pci_enable_pcie_error_reporting failed 0xfffffffb
Jun 5 09:26:13 linux-7vph kernel: e1000e 0000:00:19.0: setting latency timer to 64
Jun 5 09:26:14 linux-7vph kernel: e1000e 0000:00:19.0: wake-up capability disabled by ACPI
Jun 5 09:26:14 linux-7vph kernel: e1000e 0000:00:19.0: PME# disabled
Jun 5 09:26:14 linux-7vph kernel: e1000e 0000:00:19.0: wake-up capability disabled by ACPI
Jun 5 09:26:14 linux-7vph kernel: e1000e 0000:00:19.0: PME# disabled
Jun 5 09:26:14 linux-7vph kernel: e1000e 0000:00:19.0: irq 29 for MSI/MSI-X
Jun 5 09:26:14 linux-7vph kernel: uhci_hcd 0000:00:1a.0: PCI INT A -> GSI 20 (level, low) -> IRQ 20
Jun 5 09:26:14 linux-7vph kernel: uhci_hcd 0000:00:1a.0: setting latency timer to 64
Jun 5 09:26:14 linux-7vph kernel: uhci_hcd 0000:00:1a.1: PCI INT B -> GSI 21 (level, low) -> IRQ 21
Jun 5 09:26:14 linux-7vph kernel: uhci_hcd 0000:00:1a.1: setting latency timer to 64
Jun 5 09:26:14 linux-7vph kernel: uhci_hcd 0000:00:1a.2: PCI INT C -> GSI 22 (level, low) -> IRQ 22
Jun 5 09:26:14 linux-7vph kernel: uhci_hcd 0000:00:1a.2: setting latency timer to 64
Jun 5 09:26:14 linux-7vph kernel: ehci_hcd 0000:00:1a.7: PME# disabled
Jun 5 09:26:14 linux-7vph kernel: ehci_hcd 0000:00:1a.7: PCI INT D -> GSI 23 (level, low) -> IRQ 23
Jun 5 09:26:14 linux-7vph kernel: ehci_hcd 0000:00:1a.7: setting latency timer to 64
Jun 5 09:26:14 linux-7vph kernel: ehci_hcd 0000:00:1a.7: PME# disabled
Jun 5 09:26:14 linux-7vph kernel: HDA Intel 0000:00:1b.0: PCI INT B -> GSI 17 (level, low) -> IRQ 17
Jun 5 09:26:14 linux-7vph kernel: HDA Intel 0000:00:1b.0: setting latency timer to 64
Jun 5 09:26:14 linux-7vph kernel: uhci_hcd 0000:00:1d.0: PCI INT A -> GSI 16 (level, low) -> IRQ 16
Jun 5 09:26:14 linux-7vph kernel: uhci_hcd 0000:00:1d.0: setting latency timer to 64
Jun 5 09:26:14 linux-7vph kernel: uhci_hcd 0000:00:1d.1: PCI INT B -> GSI 17 (level, low) -> IRQ 17
Jun 5 09:26:14 linux-7vph kernel: uhci_hcd 0000:00:1d.1: setting latency timer to 64
Jun 5 09:26:14 linux-7vph kernel: uhci_hcd 0000:00:1d.2: PCI INT C -> GSI 18 (level, low) -> IRQ 18
Jun 5 09:26:14 linux-7vph kernel: uhci_hcd 0000:00:1d.2: setting latency timer to 64
Jun 5 09:26:14 linux-7vph kernel: ehci_hcd 0000:00:1d.7: PME# disabled
Jun 5 09:26:14 linux-7vph kernel: ehci_hcd 0000:00:1d.7: PCI INT D -> GSI 19 (level, low) -> IRQ 19
Jun 5 09:26:14 linux-7vph kernel: ehci_hcd 0000:00:1d.7: setting latency timer to 64
Jun 5 09:26:14 linux-7vph kernel: ehci_hcd 0000:00:1d.7: PME# disabled
Jun 5 09:26:14 linux-7vph kernel: pci 0000:00:1e.0: setting latency timer to 64
Jun 5 09:26:14 linux-7vph kernel: ahci 0000:00:1f.2: setting latency timer to 64
Jun 5 09:26:14 linux-7vph kernel: pci 0000:15:00.0: PCI INT A -> GSI 16 (level, low) -> IRQ 16
Jun 5 09:26:14 linux-7vph kernel: ohci1394: fw-host0: OHCI-1394 1.1 (PCI): IRQ=[17] MMIO=[f4801000-f48017ff] Max Packet=[2048] IR/IT contexts=[4/4]
Jun 5 09:26:14 linux-7vph kernel: pci 0000:15:00.2: PME# disabled
Jun 5 09:26:14 linux-7vph kernel: pci 0000:15:00.3: PME# disabled
Jun 5 09:26:14 linux-7vph kernel: pci 0000:15:00.4: PME# disabled
Jun 5 09:26:14 linux-7vph kernel: pci 0000:15:00.5: PME# disabled
Jun 5 09:26:14 linux-7vph kernel: sd 0:0:0:0: [sda] Starting disk
Jun 5 09:26:14 linux-7vph kernel: ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
Jun 5 09:26:14 linux-7vph kernel: ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
Jun 5 09:26:14 linux-7vph kernel: ata1.00: ACPI cmd ef/02:00:00:00:00:a0 succeeded
Jun 5 09:26:14 linux-7vph kernel: ata1.00: ACPI cmd f5/00:00:00:00:00:a0 filtered out
Jun 5 09:26:14 linux-7vph kernel: ata1.00: ACPI cmd ef/5f:00:00:00:00:a0 succeeded
Jun 5 09:26:14 linux-7vph kernel: ata1.00: ACPI cmd ef/10:03:00:00:00:a0 filtered out
Jun 5 09:26:14 linux-7vph kernel: ata1.00: ACPI cmd ef/02:00:00:00:00:a0 succeeded
Jun 5 09:26:14 linux-7vph kernel: ata1.00: ACPI cmd f5/00:00:00:00:00:a0 filtered out
Jun 5 09:26:14 linux-7vph kernel: ata1.00: ACPI cmd ef/5f:00:00:00:00:a0 succeeded
Jun 5 09:26:14 linux-7vph kernel: ata1.00: ACPI cmd ef/10:03:00:00:00:a0 filtered out
Jun 5 09:26:14 linux-7vph kernel: ata1.00: configured for UDMA/133
Jun 5 09:26:14 linux-7vph kernel: ata1: exception Emask 0x10 SAct 0x0 SErr 0x0 action 0x9 t4
Jun 5 09:26:14 linux-7vph kernel: ata1: irq_stat 0x00400040, connection status changed
Jun 5 09:26:14 linux-7vph kernel: ata1.00: configured for UDMA/133
Jun 5 09:26:14 linux-7vph kernel: ata1: EH complete
Jun 5 09:26:14 linux-7vph kernel: pci 0000:00:02.0: restoring config space at offset 0x1 (was 0x900407, writing 0x900403)
Jun 5 09:26:14 linux-7vph kernel: pci 0000:00:02.0: setting latency timer to 64
Jun 5 09:26:14 linux-7vph kernel: ata2.00: ACPI cmd e3/00:1f:00:00:00:a0 succeeded
Jun 5 09:26:14 linux-7vph kernel: ata2.00: ACPI cmd e3/00:02:00:00:00:a0 succeeded
Jun 5 09:26:14 linux-7vph kernel: ata2.00: ACPI cmd e3/00:1f:00:00:00:a0 succeeded
Jun 5 09:26:14 linux-7vph kernel: ata2.00: ACPI cmd e3/00:02:00:00:00:a0 succeeded
Jun 5 09:26:14 linux-7vph kernel: ata2.00: configured for UDMA/133
Jun 5 09:26:14 linux-7vph kernel: Restarting tasks ... <3>ata2: exception Emask 0x10 SAct 0x0 SErr 0x0 action 0x9 t4
Jun 5 09:26:14 linux-7vph kernel: ata2: irq_stat 0x40000001
Jun 5 09:26:14 linux-7vph kernel: ata1.00: exception Emask 0x10 SAct 0x1ff SErr 0x50000 action 0xe frozen
Jun 5 09:26:14 linux-7vph kernel: ata1.00: irq_stat 0x00400008, PHY RDY changed
Jun 5 09:26:14 linux-7vph kernel: ata1: SError: { PHYRdyChg CommWake }
Jun 5 09:26:14 linux-7vph kernel: ata1.00: cmd 60/08:00:5f:06:44/00:00:0c:00:00/40 tag 0 ncq 4096 in
Jun 5 09:26:14 linux-7vph kernel: res 50/00:28:3f:8a:6f/00:00:0e:00:00/40 Emask 0x10 (ATA bus error)
Jun 5 09:26:14 linux-7vph kernel: ata1.00: status: { DRDY }
Jun 5 09:26:14 linux-7vph kernel: ata1.00: cmd 60/58:08:6f:06:44/00:00:0c:00:00/40 tag 1 ncq 45056 in
Jun 5 09:26:14 linux-7vph kernel: res 50/00:28:3f:8a:6f/00:00:0e:00:00/40 Emask 0x10 (ATA bus error)
Jun 5 09:26:14 linux-7vph kernel: ata1.00: status: { DRDY }
Jun 5 09:26:14 linux-7vph kernel: ata1.00: cmd 60/08:10:d7:06:44/00:00:0c:00:00/40 tag 2 ncq 4096 in
Jun 5 09:26:14 linux-7vph kernel: res 50/00:28:3f:8a:6f/00:00:0e:00:00/40 Emask 0x10 (ATA bus error)
Jun 5 09:26:14 linux-7vph kernel: ata1.00: status: { DRDY }
Jun 5 09:26:14 linux-7vph kernel: ata1.00: cmd 60/08:18:07:07:44/00:00:0c:00:00/40 tag 3 ncq 4096 in
Jun 5 09:26:14 linux-7vph kernel: res 50/00:28:3f:8a:6f/00:00:0e:00:00/40 Emask 0x10 (ATA bus error)
Jun 5 09:26:14 linux-7vph kernel: ata1.00: status: { DRDY }
Jun 5 09:26:14 linux-7vph kernel: ata1.00: cmd 60/18:20:17:07:44/00:00:0c:00:00/40 tag 4 ncq 12288 in
Jun 5 09:26:14 linux-7vph kernel: res 50/00:28:3f:8a:6f/00:00:0e:00:00/40 Emask 0x10 (ATA bus error)
Jun 5 09:26:14 linux-7vph kernel: ata1.00: status: { DRDY }
Jun 5 09:26:14 linux-7vph kernel: ata1.00: cmd 60/08:28:67:88:70/00:00:0e:00:00/40 tag 5 ncq 4096 in
Jun 5 09:26:14 linux-7vph kernel: res 50/00:28:3f:8a:6f/00:00:0e:00:00/40 Emask 0x10 (ATA bus error)
Jun 5 09:26:14 linux-7vph kernel: ata1.00: status: { DRDY }
Jun 5 09:26:14 linux-7vph kernel: ata1.00: cmd 60/08:30:17:70:14/00:00:0d:00:00/40 tag 6 ncq 4096 in
Jun 5 09:26:14 linux-7vph kernel: res 50/00:28:3f:8a:6f/00:00:0e:00:00/40 Emask 0x10 (ATA bus error)
Jun 5 09:26:14 linux-7vph kernel: ata1.00: status: { DRDY }
Jun 5 09:26:14 linux-7vph kernel: ata1.00: cmd 60/d8:38:5f:89:6f/00:00:0e:00:00/40 tag 7 ncq 110592 in
Jun 5 09:26:15 linux-7vph kernel: res 50/00:28:3f:8a:6f/00:00:0e:00:00/40 Emask 0x10 (ATA bus error)
Jun 5 09:26:15 linux-7vph kernel: ata1.00: status: { DRDY }
Jun 5 09:26:15 linux-7vph kernel: ata1.00: cmd 60/28:40:3f:8a:6f/00:00:0e:00:00/40 tag 8 ncq 20480 in
Jun 5 09:26:15 linux-7vph kernel: res 50/00:28:3f:8a:6f/00:00:0e:00:00/40 Emask 0x10 (ATA bus error)
Jun 5 09:26:15 linux-7vph kernel: ata1.00: status: { DRDY }
Jun 5 09:26:15 linux-7vph kernel: ata1: hard resetting link
Jun 5 09:26:15 linux-7vph kernel: ata2.00: configured for UDMA/133
Jun 5 09:26:15 linux-7vph kernel: ata2: EH complete
Jun 5 09:26:15 linux-7vph kernel: done.
Jun 5 09:26:15 linux-7vph kernel: ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
Jun 5 09:26:16 linux-7vph kernel: ata1.00: ACPI cmd ef/02:00:00:00:00:a0 succeeded
Jun 5 09:26:16 linux-7vph kernel: ata1.00: ACPI cmd f5/00:00:00:00:00:a0 filtered out
Jun 5 09:26:16 linux-7vph kernel: ata1.00: ACPI cmd ef/5f:00:00:00:00:a0 succeeded
Jun 5 09:26:16 linux-7vph kernel: ata1.00: ACPI cmd ef/10:03:00:00:00:a0 filtered out
Jun 5 09:26:16 linux-7vph kernel: ata1.00: ACPI cmd ef/02:00:00:00:00:a0 succeeded
Jun 5 09:26:16 linux-7vph kernel: ata1.00: ACPI cmd f5/00:00:00:00:00:a0 filtered out
Jun 5 09:26:16 linux-7vph kernel: ata1.00: ACPI cmd ef/5f:00:00:00:00:a0 succeeded
Jun 5 09:26:16 linux-7vph kernel: ata1.00: ACPI cmd ef/10:03:00:00:00:a0 filtered out
Jun 5 09:26:16 linux-7vph kernel: ata1.00: configured for UDMA/133
Jun 5 09:26:16 linux-7vph kernel: ata1.00: configured for UDMA/133
Jun 5 09:26:16 linux-7vph kernel: sd 0:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Jun 5 09:26:16 linux-7vph kernel: sd 0:0:0:0: [sda] Sense Key : Aborted Command [current] [descriptor]
Jun 5 09:26:16 linux-7vph kernel: Descriptor sense data with sense descriptors (in hex):
Jun 5 09:26:16 linux-7vph kernel: 72 0b 00 00 00 00 00 0c 00 0a 80 00 00 00 00 00
Jun 5 09:26:16 linux-7vph kernel: 0e 6f 8a 3f
Jun 5 09:26:16 linux-7vph kernel: sd 0:0:0:0: [sda] Add. Sense: No additional sense information
Jun 5 09:26:16 linux-7vph kernel: end_request: I/O error, dev sda, sector 242190687
Jun 5 09:26:16 linux-7vph kernel: XXX sda: failing bio e1223140 with -5
Jun 5 09:26:16 linux-7vph kernel: XXX sda: bio=e1223140 error=-5 bi_end_io=c01a49fd
Jun 5 09:26:16 linux-7vph kernel: XXX sda: failing bio f5caa0c0 with -5
Jun 5 09:26:16 linux-7vph kernel: XXX sda: bio=f5caa0c0 error=-5 bi_end_io=c01a49fd
Jun 5 09:26:16 linux-7vph kernel: XXX sda: failing bio f5caa140 with -5
Jun 5 09:26:16 linux-7vph kernel: XXX sda: bio=f5caa140 error=-5 bi_end_io=c01a49fd
Jun 5 09:26:16 linux-7vph kernel: XXX sda: failing bio f5caa1c0 with -5
Jun 5 09:26:16 linux-7vph kernel: XXX sda: bio=f5caa1c0 error=-5 bi_end_io=c01a49fd
Jun 5 09:26:16 linux-7vph kernel: XXX sda: failing bio f5caa240 with -5
Jun 5 09:26:16 linux-7vph kernel: XXX sda: bio=f5caa240 error=-5 bi_end_io=c01a49fd
Jun 5 09:26:16 linux-7vph kernel: XXX sda: failing bio f5caa2c0 with -5
Jun 5 09:26:16 linux-7vph kernel: XXX sda: bio=f5caa2c0 error=-5 bi_end_io=c01a49fd
Jun 5 09:26:16 linux-7vph kernel: XXX sda: failing bio f5caa340 with -5
Jun 5 09:26:16 linux-7vph kernel: XXX sda: bio=f5caa340 error=-5 bi_end_io=c01a49fd
Jun 5 09:26:16 linux-7vph kernel: XXX sda: failing bio f5caa3c0 with -5
Jun 5 09:26:16 linux-7vph kernel: XXX sda: bio=f5caa3c0 error=-5 bi_end_io=c01a49fd
Jun 5 09:26:16 linux-7vph kernel: XXX sda: failing bio f5caa440 with -5
Jun 5 09:26:16 linux-7vph kernel: XXX sda: bio=f5caa440 error=-5 bi_end_io=c01a49fd
Jun 5 09:26:16 linux-7vph kernel: XXX sda: failing bio f5caa4c0 with -5
Jun 5 09:26:16 linux-7vph kernel: XXX sda: bio=f5caa4c0 error=-5 bi_end_io=c01a49fd
Jun 5 09:26:16 linux-7vph kernel: XXX sda: failing bio f5caa540 with -5
Jun 5 09:26:16 linux-7vph kernel: XXX sda: bio=f5caa540 error=-5 bi_end_io=c01a49fd
Jun 5 09:26:16 linux-7vph kernel: XXX sda: failing bio f5caa5c0 with -5
Jun 5 09:26:19 linux-7vph kernel: XXX sda: bio=f5caa5c0 error=-5 bi_end_io=c01a49fd
Jun 5 09:26:19 linux-7vph kernel: XXX sda: failing bio f5caa640 with -5
Jun 5 09:26:19 linux-7vph kernel: XXX sda: bio=f5caa640 error=-5 bi_end_io=c01a49fd
Jun 5 09:26:19 linux-7vph kernel: XXX sda: failing bio f5caa6c0 with -5
Jun 5 09:26:19 linux-7vph kernel: XXX sda: bio=f5caa6c0 error=-5 bi_end_io=c01a49fd
Jun 5 09:26:19 linux-7vph kernel: XXX sda: failing bio f5caa740 with -5
Jun 5 09:26:19 linux-7vph kernel: XXX sda: bio=f5caa740 error=-5 bi_end_io=c01a49fd
Jun 5 09:26:19 linux-7vph kernel: XXX sda: failing bio f5caa7c0 with -5
Jun 5 09:26:19 linux-7vph kernel: XXX sda: bio=f5caa7c0 error=-5 bi_end_io=c01a49fd
Jun 5 09:26:19 linux-7vph kernel: XXX sda: failing bio f5caa840 with -5
Jun 5 09:26:21 linux-7vph kernel: XXX sda: bio=f5caa840 error=-5 bi_end_io=c01a49fd
Jun 5 09:26:21 linux-7vph kernel: XXX sda: failing bio f5caa8c0 with -5
Jun 5 09:26:21 linux-7vph kernel: XXX sda: bio=f5caa8c0 error=-5 bi_end_io=c01a49fd
Jun 5 09:26:21 linux-7vph kernel: XXX sda: failing bio f5caa940 with -5
Jun 5 09:26:21 linux-7vph kernel: XXX sda: bio=f5caa940 error=-5 bi_end_io=c01a49fd
Jun 5 09:26:21 linux-7vph kernel: XXX sda: failing bio f5caa9c0 with -5
Jun 5 09:26:21 linux-7vph kernel: XXX sda: bio=f5caa9c0 error=-5 bi_end_io=c01a49fd
Jun 5 09:26:21 linux-7vph kernel: XXX sda: failing bio f5caaa40 with -5
Jun 5 09:26:21 linux-7vph kernel: XXX sda: bio=f5caaa40 error=-5 bi_end_io=c01a49fd
Jun 5 09:26:21 linux-7vph kernel: XXX sda: failing bio f5caaac0 with -5
Jun 5 09:26:22 linux-7vph kernel: XXX sda: bio=f5caaac0 error=-5 bi_end_io=c01a49fd
Jun 5 09:26:22 linux-7vph kernel: XXX sda: failing bio f5caab40 with -5
Jun 5 09:26:22 linux-7vph kernel: XXX sda: bio=f5caab40 error=-5 bi_end_io=c01a49fd
Jun 5 09:26:22 linux-7vph kernel: XXX sda: failing bio f5caabc0 with -5
Jun 5 09:26:22 linux-7vph kernel: XXX sda: bio=f5caabc0 error=-5 bi_end_io=c01a49fd
Jun 5 09:26:22 linux-7vph kernel: XXX sda: failing bio f5caac40 with -5
Jun 5 09:26:22 linux-7vph kernel: XXX sda: bio=f5caac40 error=-5 bi_end_io=c01a49fd
Jun 5 09:26:22 linux-7vph kernel: XXX sda: failing bio f5caacc0 with -5
Jun 5 09:26:22 linux-7vph kernel: XXX sda: bio=f5caacc0 error=-5 bi_end_io=c01a49fd
Jun 5 09:26:22 linux-7vph kernel: XXX sda: failing bio f5caad40 with -5
Jun 5 09:26:22 linux-7vph kernel: XXX sda: bio=f5caad40 error=-5 bi_end_io=c01a49fd
Jun 5 09:26:22 linux-7vph kernel: sd 0:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Jun 5 09:26:22 linux-7vph kernel: sd 0:0:0:0: [sda] Sense Key : Aborted Command [current] [descriptor]
Jun 5 09:26:22 linux-7vph kernel: Descriptor sense data with sense descriptors (in hex):
Jun 5 09:26:22 linux-7vph kernel: 72 0b 00 00 00 00 00 0c 00 0a 80 00 00 00 00 00
Jun 5 09:26:22 linux-7vph kernel: 0e 6f 8a 3f
Jun 5 09:26:22 linux-7vph kernel: sd 0:0:0:0: [sda] Add. Sense: No additional sense information
Jun 5 09:26:22 linux-7vph kernel: end_request: I/O error, dev sda, sector 242190911
Jun 5 09:26:22 linux-7vph kernel: XXX sda: failing bio f5caadc0 with -5
Jun 5 09:26:22 linux-7vph kernel: XXX sda: bio=f5caadc0 error=-5 bi_end_io=c01a49fd
Jun 5 09:26:22 linux-7vph kernel: XXX sda: failing bio f5caae40 with -5
Jun 5 09:26:22 linux-7vph kernel: XXX sda: bio=f5caae40 error=-5 bi_end_io=c01a49fd
Jun 5 09:26:22 linux-7vph kernel: XXX sda: failing bio f5caaec0 with -5
Jun 5 09:26:22 linux-7vph kernel: XXX sda: bio=f5caaec0 error=-5 bi_end_io=c01a49fd
Jun 5 09:26:22 linux-7vph kernel: XXX sda: failing bio f5caaf40 with -5
Jun 5 09:26:22 linux-7vph kernel: XXX sda: bio=f5caaf40 error=-5 bi_end_io=c01a49fd
Jun 5 09:26:22 linux-7vph kernel: XXX sda: failing bio e12230c0 with -5
Jun 5 09:26:22 linux-7vph kernel: XXX sda: bio=e12230c0 error=-5 bi_end_io=c01a49fd
Jun 5 09:26:22 linux-7vph kernel: ata1: EH complete
Jun 5 09:26:22 linux-7vph kernel: EXT4-fs error (device sda6): __ext4_get_inode_loc: unable to read inode block - inode=2346994, block=9404448
Jun 5 09:26:22 linux-7vph kernel: EXT4-fs error (device sda6) in ext4_reserve_inode_write: IO failure

Attachments:

messages.2.6.30-rc7-pae.bio-extract.txt (22.17 kB)

2009-06-19 15:05:00

by Pavel Machek

[permalink] [raw]

Subject: Re: 2.6.29 regression: ATA bus errors on resume

On Sat 2009-06-06 09:05:05, Niel Lambrechts wrote:
> Morning Tejun,
>
> Tejun Heo wrote:
> > Hello,
> >
> > Can you please do the followings?
> >
> > 1. Apply the attached patch, build & boot
> >
> I chose 2.6.30-rc7...
> > 2. Trigger the problem and record dmesg
> >
> It took 3 days and quite a few hibernate attempts ... :-)
>
> > 3. On failed IO, the kernel will print the address of bi_endio. Run
> > "nm -n" on the vmlinux in the kernel build root and look up which
> > function it is and post the dmesg and function name.
> I did not have that specific vmlinux.o file any more, but
> /boot/System.map-2.6.30-rc7-pae shows:
> c01a49fd t end_bio_bh_io_sync
>
> Hope this is sufficient to help you. Sorry if this is silly - being so
> inexperienced with the kernel - but I wondered if or why a dump_stack()
> in that debug patch would not be helpful?

I guess dump_stack would be very helpful :-).

> Regards,
> Niel
>
>

> Jun 5 09:26:12 linux-7vph -- MARK --
> Jun 5 09:26:12 linux-7vph syslog-ng[24615]: Log statistics; dropped='pipe(/dev/xconsole)=0', dropped='pipe(/dev/tty10)=0', processed='center(queued)=3777', processed='center(received)=3142', processed='destination(newsnotice)=0', processed='destination(acpid)=13', processed='destination(firewall)=18', processed='destination(null)=208', processed='destination(mail)=0', processed='destination(mailinfo)=0', processed='destination(console)=200', processed='destination(newserr)=0', processed='destination(newscrit)=0', processed='destination(messages)=1909', processed='destination(mailwarn)=0', processed='destination(localmessages)=0', processed='destination(netmgm)=994', processed='destination(mailerr)=0', processed='destination(xconsole)=200', processed='destination(warn)=235', processed='source(src)=3142'
> Jun 5 09:26:11 linux-7vph kernel: Syncing filesystems ... done.
> Jun 5 09:26:12 linux-7vph kernel: Freezing user space processes ... (elapsed 0.00 seconds) done.
> Jun 5 09:26:12 linux-7vph kernel: Freezing remaining freezable tasks ... (elapsed 0.00 seconds) done.
> Jun 5 09:26:12 linux-7vph kernel: PM: Shrinking memory... done (149507 pages freed)
> Jun 5 09:26:12 linux-7vph kernel: PM: Freed 598028 kbytes in 6.49 seconds (92.14 MB/s)
> Jun 5 09:26:12 linux-7vph kernel: Suspending console(s) (use no_console_suspend to debug)
> Jun 5 09:26:12 linux-7vph kernel: sd 0:0:0:0: [sda] Synchronizing SCSI cache
> Jun 5 09:26:12 linux-7vph kernel: ACPI handle has no context!
> Jun 5 09:26:12 linux-7vph kernel: ehci_hcd 0000:00:1d.7: PCI INT D disabled
> Jun 5 09:26:12 linux-7vph kernel: ehci_hcd 0000:00:1d.7: PME# disabled
> Jun 5 09:26:12 linux-7vph kernel: uhci_hcd 0000:00:1d.2: PCI INT C disabled
> Jun 5 09:26:12 linux-7vph kernel: uhci_hcd 0000:00:1d.1: PCI INT B disabled
> Jun 5 09:26:12 linux-7vph kernel: uhci_hcd 0000:00:1d.0: PCI INT A disabled
> Jun 5 09:26:12 linux-7vph kernel: HDA Intel 0000:00:1b.0: PCI INT B disabled
> Jun 5 09:26:12 linux-7vph kernel: ehci_hcd 0000:00:1a.7: PCI INT D disabled
> Jun 5 09:26:12 linux-7vph kernel: ehci_hcd 0000:00:1a.7: PME# disabled
> Jun 5 09:26:12 linux-7vph kernel: uhci_hcd 0000:00:1a.2: PCI INT C disabled
> Jun 5 09:26:12 linux-7vph kernel: uhci_hcd 0000:00:1a.1: PCI INT B disabled
> Jun 5 09:26:12 linux-7vph kernel: uhci_hcd 0000:00:1a.0: PCI INT A disabled
> Jun 5 09:26:13 linux-7vph kernel: e1000e 0000:00:19.0: PCI INT A disabled
> Jun 5 09:26:13 linux-7vph kernel: e1000e 0000:00:19.0: PME# enabled
> Jun 5 09:26:13 linux-7vph kernel: e1000e 0000:00:19.0: wake-up capability enabled by ACPI
> Jun 5 09:26:13 linux-7vph kernel: ACPI: Preparing to enter system sleep state S4
> Jun 5 09:26:13 linux-7vph kernel: PM: Saving platform NVS memory
> Jun 5 09:26:13 linux-7vph kernel: Disabling non-boot CPUs ...
> Jun 5 09:26:13 linux-7vph kernel: CPU 1 is now offline
> Jun 5 09:26:13 linux-7vph kernel: SMP alternatives: switching to UP code
> Jun 5 09:26:13 linux-7vph kernel: CPU0 attaching NULL sched-domain.
> Jun 5 09:26:13 linux-7vph kernel: CPU1 attaching NULL sched-domain.
> Jun 5 09:26:13 linux-7vph kernel: CPU0 attaching NULL sched-domain.
> Jun 5 09:26:13 linux-7vph kernel: CPU1 is down
> Jun 5 09:26:13 linux-7vph kernel: Extended CMOS year: 2000
> Jun 5 09:26:13 linux-7vph kernel: PM: Creating hibernation image:
> Jun 5 09:26:13 linux-7vph kernel: PM: Need to copy 223773 pages
> Jun 5 09:26:13 linux-7vph kernel: x86 PAT enabled: cpu 0, old 0x7040600070406, new 0x7010600070106
> Jun 5 09:26:13 linux-7vph kernel: Intel machine check architecture supported.
> Jun 5 09:26:13 linux-7vph kernel: Intel machine check reporting enabled on CPU#0.
> Jun 5 09:26:13 linux-7vph kernel: PM: Restoring platform NVS memory
> Jun 5 09:26:13 linux-7vph kernel: Extended CMOS year: 2000
> Jun 5 09:26:13 linux-7vph kernel: Enabling non-boot CPUs ...
> Jun 5 09:26:13 linux-7vph kernel: SMP alternatives: switching to SMP code
> Jun 5 09:26:13 linux-7vph kernel: Booting processor 1 APIC 0x1 ip 0x6000
> Jun 5 09:26:13 linux-7vph kernel: Initializing CPU#1
> Jun 5 09:26:13 linux-7vph kernel: Calibrating delay using timer specific routine.. 5134.67 BogoMIPS (lpj=10269343)
> Jun 5 09:26:13 linux-7vph kernel: CPU: L1 I cache: 32K, L1 D cache: 32K
> Jun 5 09:26:13 linux-7vph kernel: CPU: L2 cache: 6144K
> Jun 5 09:26:13 linux-7vph kernel: CPU: Physical Processor ID: 0
> Jun 5 09:26:13 linux-7vph kernel: CPU: Processor Core ID: 1
> Jun 5 09:26:13 linux-7vph kernel: Intel machine check architecture supported.
> Jun 5 09:26:13 linux-7vph kernel: Intel machine check reporting enabled on CPU#1.
> Jun 5 09:26:13 linux-7vph kernel: x86 PAT enabled: cpu 1, old 0x7040600070406, new 0x7010600070106
> Jun 5 09:26:13 linux-7vph kernel: CPU1: Intel(R) Core(TM)2 Duo CPU T9400 @ 2.53GHz stepping 06
> Jun 5 09:26:13 linux-7vph kernel: CPU0 attaching NULL sched-domain.
> Jun 5 09:26:13 linux-7vph kernel: Switched to high resolution mode on CPU 1
> Jun 5 09:26:13 linux-7vph kernel: CPU0 attaching sched-domain:
> Jun 5 09:26:13 linux-7vph kernel: domain 0: span 0-1 level MC
> Jun 5 09:26:13 linux-7vph kernel: groups: 0 1
> Jun 5 09:26:13 linux-7vph kernel: domain 1: span 0-1 level CPU
> Jun 5 09:26:13 linux-7vph kernel: groups: 0-1 (__cpu_power = 2048)
> Jun 5 09:26:13 linux-7vph kernel: CPU1 attaching sched-domain:
> Jun 5 09:26:13 linux-7vph kernel: domain 0: span 0-1 level MC
> Jun 5 09:26:13 linux-7vph kernel: groups: 1 0
> Jun 5 09:26:13 linux-7vph kernel: domain 1: span 0-1 level CPU
> Jun 5 09:26:13 linux-7vph kernel: groups: 0-1 (__cpu_power = 2048)
> Jun 5 09:26:13 linux-7vph kernel: microcode: failed to init CPU1
> Jun 5 09:26:13 linux-7vph kernel: CPU1 is up
> Jun 5 09:26:13 linux-7vph kernel: ACPI: Waking up from system sleep state S4
> Jun 5 09:26:13 linux-7vph kernel: pci 0000:00:02.0: restoring config space at offset 0x1 (was 0x900007, writing 0x900403)
> Jun 5 09:26:13 linux-7vph kernel: uhci_hcd 0000:00:1a.0: power state changed by ACPI to D0
> Jun 5 09:26:13 linux-7vph kernel: uhci_hcd 0000:00:1a.2: power state changed by ACPI to D0
> Jun 5 09:26:13 linux-7vph kernel: HDA Intel 0000:00:1b.0: restoring config space at offset 0x1 (was 0x100106, writing 0x100102)
> Jun 5 09:26:13 linux-7vph kernel: uhci_hcd 0000:00:1d.0: power state changed by ACPI to D0
> Jun 5 09:26:13 linux-7vph kernel: ahci 0000:00:1f.2: restoring config space at offset 0x1 (was 0x2b00403, writing 0x2b00407)
> Jun 5 09:26:13 linux-7vph kernel: iwlagn 0000:03:00.0: restoring config space at offset 0x1 (was 0x100106, writing 0x100506)
> Jun 5 09:26:13 linux-7vph kernel: pci 0000:00:02.0: PME# disabled
> Jun 5 09:26:13 linux-7vph kernel: pci 0000:00:02.0: setting latency timer to 64
> Jun 5 09:26:13 linux-7vph kernel: pci 0000:00:02.1: PME# disabled
> Jun 5 09:26:13 linux-7vph kernel: pci 0000:00:03.0: PME# disabled
> Jun 5 09:26:13 linux-7vph kernel: e1000e 0000:00:19.0: PCI INT A -> GSI 20 (level, low) -> IRQ 20
> Jun 5 09:26:13 linux-7vph kernel: e1000e 0000:00:19.0: pci_enable_pcie_error_reporting failed 0xfffffffb
> Jun 5 09:26:13 linux-7vph kernel: e1000e 0000:00:19.0: setting latency timer to 64
> Jun 5 09:26:14 linux-7vph kernel: e1000e 0000:00:19.0: wake-up capability disabled by ACPI
> Jun 5 09:26:14 linux-7vph kernel: e1000e 0000:00:19.0: PME# disabled
> Jun 5 09:26:14 linux-7vph kernel: e1000e 0000:00:19.0: wake-up capability disabled by ACPI
> Jun 5 09:26:14 linux-7vph kernel: e1000e 0000:00:19.0: PME# disabled
> Jun 5 09:26:14 linux-7vph kernel: e1000e 0000:00:19.0: irq 29 for MSI/MSI-X
> Jun 5 09:26:14 linux-7vph kernel: uhci_hcd 0000:00:1a.0: PCI INT A -> GSI 20 (level, low) -> IRQ 20
> Jun 5 09:26:14 linux-7vph kernel: uhci_hcd 0000:00:1a.0: setting latency timer to 64
> Jun 5 09:26:14 linux-7vph kernel: uhci_hcd 0000:00:1a.1: PCI INT B -> GSI 21 (level, low) -> IRQ 21
> Jun 5 09:26:14 linux-7vph kernel: uhci_hcd 0000:00:1a.1: setting latency timer to 64
> Jun 5 09:26:14 linux-7vph kernel: uhci_hcd 0000:00:1a.2: PCI INT C -> GSI 22 (level, low) -> IRQ 22
> Jun 5 09:26:14 linux-7vph kernel: uhci_hcd 0000:00:1a.2: setting latency timer to 64
> Jun 5 09:26:14 linux-7vph kernel: ehci_hcd 0000:00:1a.7: PME# disabled
> Jun 5 09:26:14 linux-7vph kernel: ehci_hcd 0000:00:1a.7: PCI INT D -> GSI 23 (level, low) -> IRQ 23
> Jun 5 09:26:14 linux-7vph kernel: ehci_hcd 0000:00:1a.7: setting latency timer to 64
> Jun 5 09:26:14 linux-7vph kernel: ehci_hcd 0000:00:1a.7: PME# disabled
> Jun 5 09:26:14 linux-7vph kernel: HDA Intel 0000:00:1b.0: PCI INT B -> GSI 17 (level, low) -> IRQ 17
> Jun 5 09:26:14 linux-7vph kernel: HDA Intel 0000:00:1b.0: setting latency timer to 64
> Jun 5 09:26:14 linux-7vph kernel: uhci_hcd 0000:00:1d.0: PCI INT A -> GSI 16 (level, low) -> IRQ 16
> Jun 5 09:26:14 linux-7vph kernel: uhci_hcd 0000:00:1d.0: setting latency timer to 64
> Jun 5 09:26:14 linux-7vph kernel: uhci_hcd 0000:00:1d.1: PCI INT B -> GSI 17 (level, low) -> IRQ 17
> Jun 5 09:26:14 linux-7vph kernel: uhci_hcd 0000:00:1d.1: setting latency timer to 64
> Jun 5 09:26:14 linux-7vph kernel: uhci_hcd 0000:00:1d.2: PCI INT C -> GSI 18 (level, low) -> IRQ 18
> Jun 5 09:26:14 linux-7vph kernel: uhci_hcd 0000:00:1d.2: setting latency timer to 64
> Jun 5 09:26:14 linux-7vph kernel: ehci_hcd 0000:00:1d.7: PME# disabled
> Jun 5 09:26:14 linux-7vph kernel: ehci_hcd 0000:00:1d.7: PCI INT D -> GSI 19 (level, low) -> IRQ 19
> Jun 5 09:26:14 linux-7vph kernel: ehci_hcd 0000:00:1d.7: setting latency timer to 64
> Jun 5 09:26:14 linux-7vph kernel: ehci_hcd 0000:00:1d.7: PME# disabled
> Jun 5 09:26:14 linux-7vph kernel: pci 0000:00:1e.0: setting latency timer to 64
> Jun 5 09:26:14 linux-7vph kernel: ahci 0000:00:1f.2: setting latency timer to 64
> Jun 5 09:26:14 linux-7vph kernel: pci 0000:15:00.0: PCI INT A -> GSI 16 (level, low) -> IRQ 16
> Jun 5 09:26:14 linux-7vph kernel: ohci1394: fw-host0: OHCI-1394 1.1 (PCI): IRQ=[17] MMIO=[f4801000-f48017ff] Max Packet=[2048] IR/IT contexts=[4/4]
> Jun 5 09:26:14 linux-7vph kernel: pci 0000:15:00.2: PME# disabled
> Jun 5 09:26:14 linux-7vph kernel: pci 0000:15:00.3: PME# disabled
> Jun 5 09:26:14 linux-7vph kernel: pci 0000:15:00.4: PME# disabled
> Jun 5 09:26:14 linux-7vph kernel: pci 0000:15:00.5: PME# disabled
> Jun 5 09:26:14 linux-7vph kernel: sd 0:0:0:0: [sda] Starting disk
> Jun 5 09:26:14 linux-7vph kernel: ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
> Jun 5 09:26:14 linux-7vph kernel: ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
> Jun 5 09:26:14 linux-7vph kernel: ata1.00: ACPI cmd ef/02:00:00:00:00:a0 succeeded
> Jun 5 09:26:14 linux-7vph kernel: ata1.00: ACPI cmd f5/00:00:00:00:00:a0 filtered out
> Jun 5 09:26:14 linux-7vph kernel: ata1.00: ACPI cmd ef/5f:00:00:00:00:a0 succeeded
> Jun 5 09:26:14 linux-7vph kernel: ata1.00: ACPI cmd ef/10:03:00:00:00:a0 filtered out
> Jun 5 09:26:14 linux-7vph kernel: ata1.00: ACPI cmd ef/02:00:00:00:00:a0 succeeded
> Jun 5 09:26:14 linux-7vph kernel: ata1.00: ACPI cmd f5/00:00:00:00:00:a0 filtered out
> Jun 5 09:26:14 linux-7vph kernel: ata1.00: ACPI cmd ef/5f:00:00:00:00:a0 succeeded
> Jun 5 09:26:14 linux-7vph kernel: ata1.00: ACPI cmd ef/10:03:00:00:00:a0 filtered out
> Jun 5 09:26:14 linux-7vph kernel: ata1.00: configured for UDMA/133
> Jun 5 09:26:14 linux-7vph kernel: ata1: exception Emask 0x10 SAct 0x0 SErr 0x0 action 0x9 t4
> Jun 5 09:26:14 linux-7vph kernel: ata1: irq_stat 0x00400040, connection status changed
> Jun 5 09:26:14 linux-7vph kernel: ata1.00: configured for UDMA/133
> Jun 5 09:26:14 linux-7vph kernel: ata1: EH complete
> Jun 5 09:26:14 linux-7vph kernel: pci 0000:00:02.0: restoring config space at offset 0x1 (was 0x900407, writing 0x900403)
> Jun 5 09:26:14 linux-7vph kernel: pci 0000:00:02.0: setting latency timer to 64
> Jun 5 09:26:14 linux-7vph kernel: ata2.00: ACPI cmd e3/00:1f:00:00:00:a0 succeeded
> Jun 5 09:26:14 linux-7vph kernel: ata2.00: ACPI cmd e3/00:02:00:00:00:a0 succeeded
> Jun 5 09:26:14 linux-7vph kernel: ata2.00: ACPI cmd e3/00:1f:00:00:00:a0 succeeded
> Jun 5 09:26:14 linux-7vph kernel: ata2.00: ACPI cmd e3/00:02:00:00:00:a0 succeeded
> Jun 5 09:26:14 linux-7vph kernel: ata2.00: configured for UDMA/133
> Jun 5 09:26:14 linux-7vph kernel: Restarting tasks ... <3>ata2: exception Emask 0x10 SAct 0x0 SErr 0x0 action 0x9 t4
> Jun 5 09:26:14 linux-7vph kernel: ata2: irq_stat 0x40000001
> Jun 5 09:26:14 linux-7vph kernel: ata1.00: exception Emask 0x10 SAct 0x1ff SErr 0x50000 action 0xe frozen
> Jun 5 09:26:14 linux-7vph kernel: ata1.00: irq_stat 0x00400008, PHY RDY changed
> Jun 5 09:26:14 linux-7vph kernel: ata1: SError: { PHYRdyChg CommWake }
> Jun 5 09:26:14 linux-7vph kernel: ata1.00: cmd 60/08:00:5f:06:44/00:00:0c:00:00/40 tag 0 ncq 4096 in
> Jun 5 09:26:14 linux-7vph kernel: res 50/00:28:3f:8a:6f/00:00:0e:00:00/40 Emask 0x10 (ATA bus error)
> Jun 5 09:26:14 linux-7vph kernel: ata1.00: status: { DRDY }
> Jun 5 09:26:14 linux-7vph kernel: ata1.00: cmd 60/58:08:6f:06:44/00:00:0c:00:00/40 tag 1 ncq 45056 in
> Jun 5 09:26:14 linux-7vph kernel: res 50/00:28:3f:8a:6f/00:00:0e:00:00/40 Emask 0x10 (ATA bus error)
> Jun 5 09:26:14 linux-7vph kernel: ata1.00: status: { DRDY }
> Jun 5 09:26:14 linux-7vph kernel: ata1.00: cmd 60/08:10:d7:06:44/00:00:0c:00:00/40 tag 2 ncq 4096 in
> Jun 5 09:26:14 linux-7vph kernel: res 50/00:28:3f:8a:6f/00:00:0e:00:00/40 Emask 0x10 (ATA bus error)
> Jun 5 09:26:14 linux-7vph kernel: ata1.00: status: { DRDY }
> Jun 5 09:26:14 linux-7vph kernel: ata1.00: cmd 60/08:18:07:07:44/00:00:0c:00:00/40 tag 3 ncq 4096 in
> Jun 5 09:26:14 linux-7vph kernel: res 50/00:28:3f:8a:6f/00:00:0e:00:00/40 Emask 0x10 (ATA bus error)
> Jun 5 09:26:14 linux-7vph kernel: ata1.00: status: { DRDY }
> Jun 5 09:26:14 linux-7vph kernel: ata1.00: cmd 60/18:20:17:07:44/00:00:0c:00:00/40 tag 4 ncq 12288 in
> Jun 5 09:26:14 linux-7vph kernel: res 50/00:28:3f:8a:6f/00:00:0e:00:00/40 Emask 0x10 (ATA bus error)
> Jun 5 09:26:14 linux-7vph kernel: ata1.00: status: { DRDY }
> Jun 5 09:26:14 linux-7vph kernel: ata1.00: cmd 60/08:28:67:88:70/00:00:0e:00:00/40 tag 5 ncq 4096 in
> Jun 5 09:26:14 linux-7vph kernel: res 50/00:28:3f:8a:6f/00:00:0e:00:00/40 Emask 0x10 (ATA bus error)
> Jun 5 09:26:14 linux-7vph kernel: ata1.00: status: { DRDY }
> Jun 5 09:26:14 linux-7vph kernel: ata1.00: cmd 60/08:30:17:70:14/00:00:0d:00:00/40 tag 6 ncq 4096 in
> Jun 5 09:26:14 linux-7vph kernel: res 50/00:28:3f:8a:6f/00:00:0e:00:00/40 Emask 0x10 (ATA bus error)
> Jun 5 09:26:14 linux-7vph kernel: ata1.00: status: { DRDY }
> Jun 5 09:26:14 linux-7vph kernel: ata1.00: cmd 60/d8:38:5f:89:6f/00:00:0e:00:00/40 tag 7 ncq 110592 in
> Jun 5 09:26:15 linux-7vph kernel: res 50/00:28:3f:8a:6f/00:00:0e:00:00/40 Emask 0x10 (ATA bus error)
> Jun 5 09:26:15 linux-7vph kernel: ata1.00: status: { DRDY }
> Jun 5 09:26:15 linux-7vph kernel: ata1.00: cmd 60/28:40:3f:8a:6f/00:00:0e:00:00/40 tag 8 ncq 20480 in
> Jun 5 09:26:15 linux-7vph kernel: res 50/00:28:3f:8a:6f/00:00:0e:00:00/40 Emask 0x10 (ATA bus error)
> Jun 5 09:26:15 linux-7vph kernel: ata1.00: status: { DRDY }
> Jun 5 09:26:15 linux-7vph kernel: ata1: hard resetting link
> Jun 5 09:26:15 linux-7vph kernel: ata2.00: configured for UDMA/133
> Jun 5 09:26:15 linux-7vph kernel: ata2: EH complete
> Jun 5 09:26:15 linux-7vph kernel: done.
> Jun 5 09:26:15 linux-7vph kernel: ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
> Jun 5 09:26:16 linux-7vph kernel: ata1.00: ACPI cmd ef/02:00:00:00:00:a0 succeeded
> Jun 5 09:26:16 linux-7vph kernel: ata1.00: ACPI cmd f5/00:00:00:00:00:a0 filtered out
> Jun 5 09:26:16 linux-7vph kernel: ata1.00: ACPI cmd ef/5f:00:00:00:00:a0 succeeded
> Jun 5 09:26:16 linux-7vph kernel: ata1.00: ACPI cmd ef/10:03:00:00:00:a0 filtered out
> Jun 5 09:26:16 linux-7vph kernel: ata1.00: ACPI cmd ef/02:00:00:00:00:a0 succeeded
> Jun 5 09:26:16 linux-7vph kernel: ata1.00: ACPI cmd f5/00:00:00:00:00:a0 filtered out
> Jun 5 09:26:16 linux-7vph kernel: ata1.00: ACPI cmd ef/5f:00:00:00:00:a0 succeeded
> Jun 5 09:26:16 linux-7vph kernel: ata1.00: ACPI cmd ef/10:03:00:00:00:a0 filtered out
> Jun 5 09:26:16 linux-7vph kernel: ata1.00: configured for UDMA/133
> Jun 5 09:26:16 linux-7vph kernel: ata1.00: configured for UDMA/133
> Jun 5 09:26:16 linux-7vph kernel: sd 0:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
> Jun 5 09:26:16 linux-7vph kernel: sd 0:0:0:0: [sda] Sense Key : Aborted Command [current] [descriptor]
> Jun 5 09:26:16 linux-7vph kernel: Descriptor sense data with sense descriptors (in hex):
> Jun 5 09:26:16 linux-7vph kernel: 72 0b 00 00 00 00 00 0c 00 0a 80 00 00 00 00 00
> Jun 5 09:26:16 linux-7vph kernel: 0e 6f 8a 3f
> Jun 5 09:26:16 linux-7vph kernel: sd 0:0:0:0: [sda] Add. Sense: No additional sense information
> Jun 5 09:26:16 linux-7vph kernel: end_request: I/O error, dev sda, sector 242190687
> Jun 5 09:26:16 linux-7vph kernel: XXX sda: failing bio e1223140 with -5
> Jun 5 09:26:16 linux-7vph kernel: XXX sda: bio=e1223140 error=-5 bi_end_io=c01a49fd
> Jun 5 09:26:16 linux-7vph kernel: XXX sda: failing bio f5caa0c0 with -5
> Jun 5 09:26:16 linux-7vph kernel: XXX sda: bio=f5caa0c0 error=-5 bi_end_io=c01a49fd
> Jun 5 09:26:16 linux-7vph kernel: XXX sda: failing bio f5caa140 with -5
> Jun 5 09:26:16 linux-7vph kernel: XXX sda: bio=f5caa140 error=-5 bi_end_io=c01a49fd
> Jun 5 09:26:16 linux-7vph kernel: XXX sda: failing bio f5caa1c0 with -5
> Jun 5 09:26:16 linux-7vph kernel: XXX sda: bio=f5caa1c0 error=-5 bi_end_io=c01a49fd
> Jun 5 09:26:16 linux-7vph kernel: XXX sda: failing bio f5caa240 with -5
> Jun 5 09:26:16 linux-7vph kernel: XXX sda: bio=f5caa240 error=-5 bi_end_io=c01a49fd
> Jun 5 09:26:16 linux-7vph kernel: XXX sda: failing bio f5caa2c0 with -5
> Jun 5 09:26:16 linux-7vph kernel: XXX sda: bio=f5caa2c0 error=-5 bi_end_io=c01a49fd
> Jun 5 09:26:16 linux-7vph kernel: XXX sda: failing bio f5caa340 with -5
> Jun 5 09:26:16 linux-7vph kernel: XXX sda: bio=f5caa340 error=-5 bi_end_io=c01a49fd
> Jun 5 09:26:16 linux-7vph kernel: XXX sda: failing bio f5caa3c0 with -5
> Jun 5 09:26:16 linux-7vph kernel: XXX sda: bio=f5caa3c0 error=-5 bi_end_io=c01a49fd
> Jun 5 09:26:16 linux-7vph kernel: XXX sda: failing bio f5caa440 with -5
> Jun 5 09:26:16 linux-7vph kernel: XXX sda: bio=f5caa440 error=-5 bi_end_io=c01a49fd
> Jun 5 09:26:16 linux-7vph kernel: XXX sda: failing bio f5caa4c0 with -5
> Jun 5 09:26:16 linux-7vph kernel: XXX sda: bio=f5caa4c0 error=-5 bi_end_io=c01a49fd
> Jun 5 09:26:16 linux-7vph kernel: XXX sda: failing bio f5caa540 with -5
> Jun 5 09:26:16 linux-7vph kernel: XXX sda: bio=f5caa540 error=-5 bi_end_io=c01a49fd
> Jun 5 09:26:16 linux-7vph kernel: XXX sda: failing bio f5caa5c0 with -5
> Jun 5 09:26:19 linux-7vph kernel: XXX sda: bio=f5caa5c0 error=-5 bi_end_io=c01a49fd
> Jun 5 09:26:19 linux-7vph kernel: XXX sda: failing bio f5caa640 with -5
> Jun 5 09:26:19 linux-7vph kernel: XXX sda: bio=f5caa640 error=-5 bi_end_io=c01a49fd
> Jun 5 09:26:19 linux-7vph kernel: XXX sda: failing bio f5caa6c0 with -5
> Jun 5 09:26:19 linux-7vph kernel: XXX sda: bio=f5caa6c0 error=-5 bi_end_io=c01a49fd
> Jun 5 09:26:19 linux-7vph kernel: XXX sda: failing bio f5caa740 with -5
> Jun 5 09:26:19 linux-7vph kernel: XXX sda: bio=f5caa740 error=-5 bi_end_io=c01a49fd
> Jun 5 09:26:19 linux-7vph kernel: XXX sda: failing bio f5caa7c0 with -5
> Jun 5 09:26:19 linux-7vph kernel: XXX sda: bio=f5caa7c0 error=-5 bi_end_io=c01a49fd
> Jun 5 09:26:19 linux-7vph kernel: XXX sda: failing bio f5caa840 with -5
> Jun 5 09:26:21 linux-7vph kernel: XXX sda: bio=f5caa840 error=-5 bi_end_io=c01a49fd
> Jun 5 09:26:21 linux-7vph kernel: XXX sda: failing bio f5caa8c0 with -5
> Jun 5 09:26:21 linux-7vph kernel: XXX sda: bio=f5caa8c0 error=-5 bi_end_io=c01a49fd
> Jun 5 09:26:21 linux-7vph kernel: XXX sda: failing bio f5caa940 with -5
> Jun 5 09:26:21 linux-7vph kernel: XXX sda: bio=f5caa940 error=-5 bi_end_io=c01a49fd
> Jun 5 09:26:21 linux-7vph kernel: XXX sda: failing bio f5caa9c0 with -5
> Jun 5 09:26:21 linux-7vph kernel: XXX sda: bio=f5caa9c0 error=-5 bi_end_io=c01a49fd
> Jun 5 09:26:21 linux-7vph kernel: XXX sda: failing bio f5caaa40 with -5
> Jun 5 09:26:21 linux-7vph kernel: XXX sda: bio=f5caaa40 error=-5 bi_end_io=c01a49fd
> Jun 5 09:26:21 linux-7vph kernel: XXX sda: failing bio f5caaac0 with -5
> Jun 5 09:26:22 linux-7vph kernel: XXX sda: bio=f5caaac0 error=-5 bi_end_io=c01a49fd
> Jun 5 09:26:22 linux-7vph kernel: XXX sda: failing bio f5caab40 with -5
> Jun 5 09:26:22 linux-7vph kernel: XXX sda: bio=f5caab40 error=-5 bi_end_io=c01a49fd
> Jun 5 09:26:22 linux-7vph kernel: XXX sda: failing bio f5caabc0 with -5
> Jun 5 09:26:22 linux-7vph kernel: XXX sda: bio=f5caabc0 error=-5 bi_end_io=c01a49fd
> Jun 5 09:26:22 linux-7vph kernel: XXX sda: failing bio f5caac40 with -5
> Jun 5 09:26:22 linux-7vph kernel: XXX sda: bio=f5caac40 error=-5 bi_end_io=c01a49fd
> Jun 5 09:26:22 linux-7vph kernel: XXX sda: failing bio f5caacc0 with -5
> Jun 5 09:26:22 linux-7vph kernel: XXX sda: bio=f5caacc0 error=-5 bi_end_io=c01a49fd
> Jun 5 09:26:22 linux-7vph kernel: XXX sda: failing bio f5caad40 with -5
> Jun 5 09:26:22 linux-7vph kernel: XXX sda: bio=f5caad40 error=-5 bi_end_io=c01a49fd
> Jun 5 09:26:22 linux-7vph kernel: sd 0:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
> Jun 5 09:26:22 linux-7vph kernel: sd 0:0:0:0: [sda] Sense Key : Aborted Command [current] [descriptor]
> Jun 5 09:26:22 linux-7vph kernel: Descriptor sense data with sense descriptors (in hex):
> Jun 5 09:26:22 linux-7vph kernel: 72 0b 00 00 00 00 00 0c 00 0a 80 00 00 00 00 00
> Jun 5 09:26:22 linux-7vph kernel: 0e 6f 8a 3f
> Jun 5 09:26:22 linux-7vph kernel: sd 0:0:0:0: [sda] Add. Sense: No additional sense information
> Jun 5 09:26:22 linux-7vph kernel: end_request: I/O error, dev sda, sector 242190911
> Jun 5 09:26:22 linux-7vph kernel: XXX sda: failing bio f5caadc0 with -5
> Jun 5 09:26:22 linux-7vph kernel: XXX sda: bio=f5caadc0 error=-5 bi_end_io=c01a49fd
> Jun 5 09:26:22 linux-7vph kernel: XXX sda: failing bio f5caae40 with -5
> Jun 5 09:26:22 linux-7vph kernel: XXX sda: bio=f5caae40 error=-5 bi_end_io=c01a49fd
> Jun 5 09:26:22 linux-7vph kernel: XXX sda: failing bio f5caaec0 with -5
> Jun 5 09:26:22 linux-7vph kernel: XXX sda: bio=f5caaec0 error=-5 bi_end_io=c01a49fd
> Jun 5 09:26:22 linux-7vph kernel: XXX sda: failing bio f5caaf40 with -5
> Jun 5 09:26:22 linux-7vph kernel: XXX sda: bio=f5caaf40 error=-5 bi_end_io=c01a49fd
> Jun 5 09:26:22 linux-7vph kernel: XXX sda: failing bio e12230c0 with -5
> Jun 5 09:26:22 linux-7vph kernel: XXX sda: bio=e12230c0 error=-5 bi_end_io=c01a49fd
> Jun 5 09:26:22 linux-7vph kernel: ata1: EH complete
> Jun 5 09:26:22 linux-7vph kernel: EXT4-fs error (device sda6): __ext4_get_inode_loc: unable to read inode block - inode=2346994, block=9404448
> Jun 5 09:26:22 linux-7vph kernel: EXT4-fs error (device sda6) in ext4_reserve_inode_write: IO failure
>

--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2009-06-25 12:57:54

by Tejun Heo

[permalink] [raw]

Subject: Re: 2.6.29 regression: ATA bus errors on resume

Sorry about the long delay.

Niel Lambrechts wrote:
> Morning Tejun,
>
> Tejun Heo wrote:
>> Hello,
>>
>> Can you please do the followings?
>>
>> 1. Apply the attached patch, build & boot
>>
> I chose 2.6.30-rc7...
>> 2. Trigger the problem and record dmesg
>>
> It took 3 days and quite a few hibernate attempts ... :-)
>
>> 3. On failed IO, the kernel will print the address of bi_endio. Run
>> "nm -n" on the vmlinux in the kernel build root and look up which
>> function it is and post the dmesg and function name.
> I did not have that specific vmlinux.o file any more, but
> /boot/System.map-2.6.30-rc7-pae shows:
> c01a49fd t end_bio_bh_io_sync

So, it's coming from submit_bh()

> Hope this is sufficient to help you. Sorry if this is silly - being so
> inexperienced with the kernel - but I wondered if or why a dump_stack()
> in that debug patch would not be helpful?

The result is perfectly good and yeah dump_stack() on the issue path
would help but the problem is that block IO requests are processed
asynchronously so by the time we find out which request fail, the
requester stack is long gone. We can either record the stack trace
with each request or trace it back one step at a time by chasing down
the completion callbacks. The first requires more coding, so... :-)

Looks like the request gotta be coming from __breadahead(). The only
place this is used in ext4 is in __ext4_get_inode_loc(). Ah.. it also
contains the matching error message. I still don't see how the READA
buffer reads can affect the synchronous path. They're doing proper
exclusion via buffer lock. Maybe they're getting merged? Yeap, looks
like block code is merging READAs and regular READs.

Can you please try the attached patch and reproduce the problem and
report the kernel log? Hopefully, this will be the last debug run.

Thanks.

--
tejun

Attachments:

bio_endio-debug2.patch (1.31 kB)

2009-06-25 15:27:47

by Niel Lambrechts

[permalink] [raw]

Subject: Re: 2.6.29 regression: ATA bus errors on resume

On 06/25/2009 02:57 PM, Tejun Heo wrote:
> Sorry about the long delay.
>
> The result is perfectly good and yeah dump_stack() on the issue path
> would help but the problem is that block IO requests are processed
> asynchronously so by the time we find out which request fail, the
> requester stack is long gone. We can either record the stack trace
> with each request or trace it back one step at a time by chasing down
> the completion callbacks. The first requires more coding, so... :-)
>
> Looks like the request gotta be coming from __breadahead(). The only
> place this is used in ext4 is in __ext4_get_inode_loc(). Ah.. it also
> contains the matching error message. I still don't see how the READA
> buffer reads can affect the synchronous path. They're doing proper
> exclusion via buffer lock. Maybe they're getting merged? Yeap, looks
> like block code is merging READAs and regular READs.
>
> Can you please try the attached patch and reproduce the problem and
> report the kernel log? Hopefully, this will be the last debug run.
>

Hi Tejun,

I've recently switched my root partition from OpenSUSE 11.1 to Fedora 11
and since then I've not again seen the issue. I'm still using vanilla
2.6.30 generated with the same .config and EXT4 as before, so I have no
idea why I cannot reproduce the issue. I still use hibernate + sleep
frequently, and I just checked - I have 5 days uptime with a mount count
of 20 and the file-system is still clean.

The one big difference is that my original partition was a EXT2 -> EXT3
-> EXT4 upgrade job over a long period of time, and some of the EXT4
parameters now used by Fedora 11 on the reformatted root partition are
different from what I had then. Here is a summary of the differences in
case it matters at all:

Current settings:
Default mount options: user_xattr acl
Inodes per group: 8192
Inode blocks per group: 512
Flex block group size: 16
Required extra isize: 28
Desired extra isize: 28
Default directory hash: half_md4

Previous settings:
Default mount options: (none)
Inodes per group: 8176
Inode blocks per group: 511
Default directory hash: tea

If I do notice any such errors again I'll apply the debug patch and let
you know, but it does seem as if the upgrade made this issue disappear...

Regards,
Niel

2009-06-26 00:46:42

by Tejun Heo

[permalink] [raw]

Subject: Re: 2.6.29 regression: ATA bus errors on resume

Hello, Niel.

Niel Lambrechts wrote:
> I've recently switched my root partition from OpenSUSE 11.1 to Fedora 11
> and since then I've not again seen the issue. I'm still using vanilla
> 2.6.30 generated with the same .config and EXT4 as before, so I have no
> idea why I cannot reproduce the issue. I still use hibernate + sleep
> frequently, and I just checked - I have 5 days uptime with a mount count
> of 20 and the file-system is still clean.
>
> The one big difference is that my original partition was a EXT2 -> EXT3
> -> EXT4 upgrade job over a long period of time, and some of the EXT4
> parameters now used by Fedora 11 on the reformatted root partition are
> different from what I had then. Here is a summary of the differences in
> case it matters at all:
>
> Current settings:
> Default mount options: user_xattr acl
> Inodes per group: 8192
> Inode blocks per group: 512
> Flex block group size: 16
> Required extra isize: 28
> Desired extra isize: 28
> Default directory hash: half_md4
>
> Previous settings:
> Default mount options: (none)
> Inodes per group: 8176
> Inode blocks per group: 511
> Default directory hash: tea
>
> If I do notice any such errors again I'll apply the debug patch and let
> you know, but it does seem as if the upgrade made this issue disappear...

Eh... if my analysis is correct the bug will be dependent on some
non-deterministic things including timing and layout of blocks of the
file which the filesystem touchs during suspend/resume. I'm fairly
sure the problem is still there. Oh well, it seems I'll have to
construct a synthetic case to test it. Anyways, please let me know if
you encounter the problem again. I'll keep you in loop with further
patches.

Thanks.

--
tejun

2009-06-26 06:33:20

by Niel Lambrechts

[permalink] [raw]

Subject: Re: 2.6.29 regression: ATA bus errors on resume

On 06/26/2009 02:46 AM, Tejun Heo wrote:
> Hello, Niel.
>
> Niel Lambrechts wrote:
>
>> I've recently switched my root partition from OpenSUSE 11.1 to Fedora 11
>> and since then I've not again seen the issue. I'm still using vanilla
>> 2.6.30 generated with the same .config and EXT4 as before, so I have no
>> idea why I cannot reproduce the issue. I still use hibernate + sleep
>> frequently, and I just checked - I have 5 days uptime with a mount count
>> of 20 and the file-system is still clean.
>>
>> The one big difference is that my original partition was a EXT2 -> EXT3
>> -> EXT4 upgrade job over a long period of time, and some of the EXT4
>> parameters now used by Fedora 11 on the reformatted root partition are
>> different from what I had then. Here is a summary of the differences in
>> case it matters at all:
>>
<snip>
> Eh... if my analysis is correct the bug will be dependent on some
> non-deterministic things including timing and layout of blocks of the
> file which the filesystem touchs during suspend/resume. I'm fairly
> sure the problem is still there. Oh well, it seems I'll have to
> construct a synthetic case to test it. Anyways, please let me know if
> you encounter the problem again. I'll keep you in loop with further
> patches.
>

Understood, what I mean to imply is that I might (at least temporarily)
be shielded from the bug because reformatting the root partition could
possibly have removed some fragmentation and improved the initial disk
layout. With the previous partition layout there would also have been a
mixture of extents and non-extents based files since that system was
upgraded from EXT2 to EXT3 to EXT4.

I'll try to boot from 2.6.29 again as well, and put it through some
suspend/resume cycles just in case that made any difference.

Regards,
Niel

2009-09-18 20:26:44

by Berthold Gunreben

[permalink] [raw]

Subject: Re: 2.6.29 regression: ATA bus errors on resume

Am Freitag 26 Juni 2009 schrieb Niel Lambrechts:
> On 06/26/2009 02:46 AM, Tejun Heo wrote:
> > Eh... if my analysis is correct the bug will be dependent on some
> > non-deterministic things including timing and layout of blocks of the
> > file which the filesystem touchs during suspend/resume. I'm fairly
> > sure the problem is still there. Oh well, it seems I'll have to
> > construct a synthetic case to test it. Anyways, please let me know if
> > you encounter the problem again. I'll keep you in loop with further
> > patches.

I am not quite sure if I am at the right place here, however, I get very
similar problems with a totally different setup. What I do is the following:

I added a fourth disk to a software raid5 array and did setup the raid
completely from scratch (the same disks have been running for about
1.5 years without any problems before, the difference is that previously
one of the disks was setup as hot spare).

After the software raid was in sync, I started to copy my data back to
the raid, and after a less than 5 minutes time, one of the disks failed
(from /var/log/warn):

Sep 18 22:04:02 Bacchus kernel: ata3.00: exception Emask 0x10 SAct 0x0 SErr 0x10000 action 0xe frozen
Sep 18 22:04:02 Bacchus kernel: ata3.00: irq_stat 0x00400000, PHY RDY changed
Sep 18 22:04:02 Bacchus kernel: ata3: SError: { PHYRdyChg }
Sep 18 22:04:02 Bacchus kernel: ata3.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
Sep 18 22:04:02 Bacchus kernel: res 40/00:0c:3f:6c:d5/00:00:15:00:00/40 Emask 0x10 (ATA bus error)
Sep 18 22:04:02 Bacchus kernel: ata3.00: status: { DRDY }
Sep 18 22:04:02 Bacchus kernel: ata3: hard resetting link
Sep 18 22:04:07 Bacchus kernel: ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Sep 18 22:04:12 Bacchus kernel: ata3.00: qc timeout (cmd 0xec)
Sep 18 22:04:12 Bacchus kernel: ata3.00: failed to IDENTIFY (I/O error, err_mask=0x5)
Sep 18 22:04:12 Bacchus kernel: ata3.00: revalidation failed (errno=-5)
Sep 18 22:04:12 Bacchus kernel: ata3: hard resetting link
Sep 18 22:04:12 Bacchus kernel: ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Sep 18 22:04:12 Bacchus kernel: ata3.00: configured for UDMA/133
Sep 18 22:04:12 Bacchus kernel: ata3: EH complete
Sep 18 22:04:12 Bacchus kernel: end_request: I/O error, dev sdc, sector 1953519853
Sep 18 22:04:12 Bacchus kernel: md: super_written gets error=-5, uptodate=0
Sep 18 22:04:12 Bacchus kernel: raid5: Disk failure on sdc1, disabling device.
Sep 18 22:04:12 Bacchus kernel: raid5: Operation continuing on 3 devices.
Sep 18 22:04:13 Bacchus kernel: RAID5 conf printout:
Sep 18 22:04:13 Bacchus kernel: --- rd:4 wd:3
Sep 18 22:04:13 Bacchus kernel: disk 0, o:1, dev:sda1
Sep 18 22:04:13 Bacchus kernel: disk 1, o:1, dev:sdb1
Sep 18 22:04:13 Bacchus kernel: disk 2, o:0, dev:sdc1
Sep 18 22:04:13 Bacchus kernel: disk 3, o:1, dev:sdd1
Sep 18 22:04:13 Bacchus kernel: RAID5 conf printout:
Sep 18 22:04:13 Bacchus kernel: --- rd:4 wd:3
Sep 18 22:04:13 Bacchus kernel: disk 0, o:1, dev:sda1
Sep 18 22:04:13 Bacchus kernel: disk 1, o:1, dev:sdb1
Sep 18 22:04:13 Bacchus kernel: disk 3, o:1, dev:sdd1

The filesystem on /dev/md0 was XFS and there was no suspend/resume applied.
The raid is built with four 1TB disks, attached to a Intel ICH7 Controller:
00:1f.2 SATA controller: Intel Corporation 82801GR/GH (ICH7 Family) SATA AHCI Controller (rev 01)

The system is a openSUSE11.1 with all patches applied. When I first
encountered this problem I had the original openSUSE kernel (2.6.27.29)
running, but I thought it was a good idea to go with a current upstream kernel
instead. The difference was, that with the openSUSE kernel, I could copy
several 100 GB before I lost the disk, while with the upstream kernel
(2.6.31 from ftp.kernel.org) I lost the disk after less than 4 GB. The result
however is the same.

smart does not tell me any problems with the disks, and I did not encounter
any problems during the build of the raid5 which took about 6 hours on
this machine. Thus I don't believe that there is a hardware problem.

Maybe you have an idea what I can do to provide more information.

Thanks in advance

Berthold

2009-09-25 04:11:43

by Tejun Heo

[permalink] [raw]

Subject: Re: 2.6.29 regression: ATA bus errors on resume

Hello, Berthold.

Berthold Gunreben wrote:
> I am not quite sure if I am at the right place here, however, I get very
> similar problems with a totally different setup. What I do is the following:
>
> I added a fourth disk to a software raid5 array and did setup the raid
> completely from scratch (the same disks have been running for about
> 1.5 years without any problems before, the difference is that previously
> one of the disks was setup as hot spare).
>
> After the software raid was in sync, I started to copy my data back to
> the raid, and after a less than 5 minutes time, one of the disks failed
> (from /var/log/warn):
>
> Sep 18 22:04:02 Bacchus kernel: ata3.00: exception Emask 0x10 SAct 0x0 SErr 0x10000 action 0xe frozen
> Sep 18 22:04:02 Bacchus kernel: ata3.00: irq_stat 0x00400000, PHY RDY changed
> Sep 18 22:04:02 Bacchus kernel: ata3: SError: { PHYRdyChg }
> Sep 18 22:04:02 Bacchus kernel: ata3.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
> Sep 18 22:04:02 Bacchus kernel: res 40/00:0c:3f:6c:d5/00:00:15:00:00/40 Emask 0x10 (ATA bus error)
> Sep 18 22:04:02 Bacchus kernel: ata3.00: status: { DRDY }

The disk is most likely losing power briefly. After boot, run
"smartctl -a" on the device and record the output. After triggering
the problem, do it again. See if Start_Stop_Count, Power_Cycle_Count
or Power-Off_Retract_Count has increased. If so, take out your PSU,
bury it half-deep in your backyard, apply some gasoline, light it up
and enjoy the sight of perishing evil with a can of beer.

Thanks.

--
tejun

2009-09-30 09:58:46

by Berthold Gunreben

[permalink] [raw]

Subject: Re: 2.6.29 regression: ATA bus errors on resume

Hi Tejun,

thanks a lot for your reply.

Am Freitag 25 September 2009 schrieb Tejun Heo:
> Hello, Berthold.
>
> The disk is most likely losing power briefly. After boot, run
> "smartctl -a" on the device and record the output. After triggering
> the problem, do it again. See if Start_Stop_Count, Power_Cycle_Count
> or Power-Off_Retract_Count has increased. If so, take out your PSU,
> bury it half-deep in your backyard, apply some gasoline, light it up
> and enjoy the sight of perishing evil with a can of beer.

You might be right. However, I cannot reproduce the problem anymore, since I
switched to the totally unsupported JFS as filesystem.

In the meantime, I was able to copy 1.5TB of data back to the array, and the
system also survived artificially generated high load. If the problem is a
race (which I do not know), it might still be there. Obviously, it does not
show up as often again.

It could still be the power supply of course, but I don't understand why a new
kernel would trigger power outages so often (current kernels triggered the
problem latest after 5 minutes). Maybe it has something to do with the
chipset (ICH7R) which is capable of hot remove/add disks. Or it is related to
the hotswap harddisk slots in the case
(http://www.chenbro.eu/corporatesite/products_detail.php?sku=79 ). I have no
idea....

Thanks

Berthold

2009-09-30 10:32:35

by Tejun Heo

[permalink] [raw]

Subject: Re: 2.6.29 regression: ATA bus errors on resume

Hello,

Berthold Gunreben wrote:
> It could still be the power supply of course, but I don't understand
> why a new kernel would trigger power outages so often (current
> kernels triggered the problem latest after 5 minutes).

Probably because older kernels didn't handle barrier properly for
raid. Changing fs can definitely affect things but if I were you I
wouldn't be trusting any valuable data on that configuration.

--
tejun