2009-03-27 19:11:16

by Niel Lambrechts

[permalink] [raw]
Subject: Re: 2.6.29 regression: ATA bus errors on resume

Mar 27 20:18:01 linux-7vph -- MARK --
Mar 27 20:18:01 linux-7vph syslog-ng[32661]: Log statistics; dropped='pipe(/dev/xconsole)=0', dropped='pipe(/dev/tty10)=0', processed='center(queued)=350', processed='center(received)=284', processed='destination(newsnotice)=0', processed='destination(acpid)=8', processed='destination(firewall)=0', processed='destination(null)=61', processed='destination(mail)=0', processed='destination(mailinfo)=0', processed='destination(console)=20', processed='destination(newserr)=0', processed='destination(newscrit)=0', processed='destination(messages)=129', processed='destination(mailwarn)=0', processed='destination(localmessages)=0', processed='destination(netmgm)=86', processed='destination(mailerr)=0', processed='destination(xconsole)=20', processed='destination(warn)=26', processed='source(src)=284'
Mar 27 20:18:00 linux-7vph kernel: Syncing filesystems ... done.
Mar 27 20:18:01 linux-7vph kernel: Freezing user space processes ... (elapsed 0.00 seconds) done.
Mar 27 20:18:01 linux-7vph kernel: Freezing remaining freezable tasks ... (elapsed 0.00 seconds) done.
Mar 27 20:18:01 linux-7vph kernel: PM: Shrinking memory... done (57185 pages freed)
Mar 27 20:18:01 linux-7vph kernel: PM: Freed 228740 kbytes in 2.03 seconds (112.67 MB/s)
Mar 27 20:18:01 linux-7vph kernel: Suspending console(s) (use no_console_suspend to debug)
Mar 27 20:18:01 linux-7vph kernel: sd 0:0:0:0: [sda] Synchronizing SCSI cache
Mar 27 20:18:01 linux-7vph kernel: ACPI handle has no context!
Mar 27 20:18:01 linux-7vph kernel: ehci_hcd 0000:00:1d.7: PCI INT D disabled
Mar 27 20:18:01 linux-7vph kernel: uhci_hcd 0000:00:1d.2: PCI INT C disabled
Mar 27 20:18:01 linux-7vph kernel: uhci_hcd 0000:00:1d.1: PCI INT B disabled
Mar 27 20:18:01 linux-7vph kernel: uhci_hcd 0000:00:1d.0: PCI INT A disabled
Mar 27 20:18:01 linux-7vph kernel: HDA Intel 0000:00:1b.0: PCI INT B disabled
Mar 27 20:18:01 linux-7vph kernel: ehci_hcd 0000:00:1a.7: PCI INT D disabled
Mar 27 20:18:01 linux-7vph kernel: uhci_hcd 0000:00:1a.2: PCI INT C disabled
Mar 27 20:18:01 linux-7vph kernel: uhci_hcd 0000:00:1a.1: PCI INT B disabled
Mar 27 20:18:01 linux-7vph kernel: uhci_hcd 0000:00:1a.0: PCI INT A disabled
Mar 27 20:18:01 linux-7vph kernel: e1000e 0000:00:19.0: PME# enabled
Mar 27 20:18:01 linux-7vph kernel: e1000e 0000:00:19.0: wake-up capability enabled by ACPI
Mar 27 20:18:01 linux-7vph kernel: e1000e 0000:00:19.0: PME# enabled
Mar 27 20:18:01 linux-7vph kernel: e1000e 0000:00:19.0: wake-up capability enabled by ACPI
Mar 27 20:18:06 linux-7vph kernel: e1000e 0000:00:19.0: PCI INT A disabled
Mar 27 20:18:06 linux-7vph kernel: ACPI: Preparing to enter system sleep state S4
Mar 27 20:18:06 linux-7vph kernel: Disabling non-boot CPUs ...
Mar 27 20:18:06 linux-7vph kernel: CPU 1 is now offline
Mar 27 20:18:06 linux-7vph kernel: SMP alternatives: switching to UP code
Mar 27 20:18:06 linux-7vph kernel: CPU0 attaching NULL sched-domain.
Mar 27 20:18:06 linux-7vph kernel: CPU1 attaching NULL sched-domain.
Mar 27 20:18:06 linux-7vph kernel: CPU0 attaching NULL sched-domain.
Mar 27 20:18:06 linux-7vph kernel: CPU1 is down
Mar 27 20:18:06 linux-7vph kernel: Extended CMOS year: 2000
Mar 27 20:18:06 linux-7vph kernel: PM: Creating hibernation image:
Mar 27 20:18:06 linux-7vph kernel: PM: Need to copy 125532 pages
Mar 27 20:18:06 linux-7vph kernel: x86 PAT enabled: cpu 0, old 0x7040600070406, new 0x7010600070106
Mar 27 20:18:06 linux-7vph kernel: Intel machine check architecture supported.
Mar 27 20:18:06 linux-7vph kernel: Intel machine check reporting enabled on CPU#0.
Mar 27 20:18:06 linux-7vph kernel: Extended CMOS year: 2000
Mar 27 20:18:06 linux-7vph kernel: Enabling non-boot CPUs ...
Mar 27 20:18:06 linux-7vph kernel: SMP alternatives: switching to SMP code
Mar 27 20:18:06 linux-7vph kernel: Booting processor 1 APIC 0x1 ip 0x6000
Mar 27 20:18:06 linux-7vph kernel: Initializing CPU#1
Mar 27 20:18:06 linux-7vph kernel: Calibrating delay using timer specific routine.. 5054.03 BogoMIPS (lpj=10108064)
Mar 27 20:18:06 linux-7vph kernel: CPU: L1 I cache: 32K, L1 D cache: 32K
Mar 27 20:18:06 linux-7vph kernel: CPU: L2 cache: 6144K
Mar 27 20:18:06 linux-7vph kernel: CPU: Physical Processor ID: 0
Mar 27 20:18:06 linux-7vph kernel: CPU: Processor Core ID: 1
Mar 27 20:18:06 linux-7vph kernel: Intel machine check architecture supported.
Mar 27 20:18:06 linux-7vph kernel: Intel machine check reporting enabled on CPU#1.
Mar 27 20:18:06 linux-7vph kernel: x86 PAT enabled: cpu 1, old 0x7040600070406, new 0x7010600070106
Mar 27 20:18:06 linux-7vph kernel: CPU1: Intel(R) Core(TM)2 Duo CPU T9400 @ 2.53GHz stepping 06
Mar 27 20:18:06 linux-7vph kernel: CPU0 attaching NULL sched-domain.
Mar 27 20:18:06 linux-7vph kernel: CPU0 attaching sched-domain:
Mar 27 20:18:06 linux-7vph kernel: domain 0: span 0-1 level MC
Mar 27 20:18:06 linux-7vph kernel: groups: 0 1
Mar 27 20:18:06 linux-7vph kernel: domain 1: span 0-1 level CPU
Mar 27 20:18:06 linux-7vph kernel: groups: 0-1
Mar 27 20:18:06 linux-7vph kernel: CPU1 attaching sched-domain:
Mar 27 20:18:06 linux-7vph kernel: domain 0: span 0-1 level MC
Mar 27 20:18:06 linux-7vph kernel: groups: 1 0
Mar 27 20:18:06 linux-7vph kernel: domain 1: span 0-1 level CPU
Mar 27 20:18:06 linux-7vph kernel: groups: 0-1
Mar 27 20:18:06 linux-7vph kernel: Switched to high resolution mode on CPU 1
Mar 27 20:18:06 linux-7vph kernel: CPU1 is up
Mar 27 20:18:06 linux-7vph kernel: ACPI: Waking up from system sleep state S4
Mar 27 20:18:06 linux-7vph kernel: APIC error on CPU1: 00(40)
Mar 27 20:18:06 linux-7vph kernel: ACPI: EC: non-query interrupt received, switching to interrupt mode
Mar 27 20:18:06 linux-7vph kernel: pci 0000:00:02.0: restoring config space at offset 0x1 (was 0x900007, writing 0x900407)
Mar 27 20:18:06 linux-7vph kernel: pci 0000:00:02.0: power state changed by ACPI to D0
Mar 27 20:18:06 linux-7vph kernel: pci 0000:00:02.0: setting latency timer to 64
Mar 27 20:18:06 linux-7vph kernel: e1000e 0000:00:19.0: PCI INT A -> GSI 20 (level, low) -> IRQ 20
Mar 27 20:18:06 linux-7vph kernel: e1000e 0000:00:19.0: setting latency timer to 64
Mar 27 20:18:06 linux-7vph kernel: e1000e 0000:00:19.0: wake-up capability disabled by ACPI
Mar 27 20:18:06 linux-7vph kernel: e1000e 0000:00:19.0: PME# disabled
Mar 27 20:18:06 linux-7vph kernel: e1000e 0000:00:19.0: wake-up capability disabled by ACPI
Mar 27 20:18:06 linux-7vph kernel: e1000e 0000:00:19.0: PME# disabled
Mar 27 20:18:06 linux-7vph kernel: e1000e 0000:00:19.0: irq 2298 for MSI/MSI-X
Mar 27 20:18:06 linux-7vph kernel: uhci_hcd 0000:00:1a.0: power state changed by ACPI to D0
Mar 27 20:18:06 linux-7vph kernel: uhci_hcd 0000:00:1a.0: PCI INT A -> GSI 20 (level, low) -> IRQ 20
Mar 27 20:18:06 linux-7vph kernel: uhci_hcd 0000:00:1a.0: setting latency timer to 64
Mar 27 20:18:06 linux-7vph kernel: uhci_hcd 0000:00:1a.1: PCI INT B -> GSI 21 (level, low) -> IRQ 21
Mar 27 20:18:06 linux-7vph kernel: uhci_hcd 0000:00:1a.1: setting latency timer to 64
Mar 27 20:18:06 linux-7vph kernel: uhci_hcd 0000:00:1a.2: power state changed by ACPI to D0
Mar 27 20:18:06 linux-7vph kernel: uhci_hcd 0000:00:1a.2: PCI INT C -> GSI 22 (level, low) -> IRQ 22
Mar 27 20:18:06 linux-7vph kernel: uhci_hcd 0000:00:1a.2: setting latency timer to 64
Mar 27 20:18:06 linux-7vph kernel: ehci_hcd 0000:00:1a.7: power state changed by ACPI to D0
Mar 27 20:18:06 linux-7vph kernel: ehci_hcd 0000:00:1a.7: PCI INT D -> GSI 23 (level, low) -> IRQ 23
Mar 27 20:18:06 linux-7vph kernel: ehci_hcd 0000:00:1a.7: setting latency timer to 64
Mar 27 20:18:06 linux-7vph kernel: HDA Intel 0000:00:1b.0: restoring config space at offset 0x1 (was 0x100106, writing 0x100102)
Mar 27 20:18:06 linux-7vph kernel: HDA Intel 0000:00:1b.0: PCI INT B -> GSI 17 (level, low) -> IRQ 17
Mar 27 20:18:06 linux-7vph kernel: HDA Intel 0000:00:1b.0: setting latency timer to 64
Mar 27 20:18:06 linux-7vph kernel: pcieport-driver 0000:00:1c.0: setting latency timer to 64
Mar 27 20:18:06 linux-7vph kernel: pcieport-driver 0000:00:1c.1: setting latency timer to 64
Mar 27 20:18:06 linux-7vph kernel: pcieport-driver 0000:00:1c.3: setting latency timer to 64
Mar 27 20:18:06 linux-7vph kernel: pcieport-driver 0000:00:1c.4: setting latency timer to 64
Mar 27 20:18:06 linux-7vph kernel: uhci_hcd 0000:00:1d.0: power state changed by ACPI to D0
Mar 27 20:18:06 linux-7vph kernel: uhci_hcd 0000:00:1d.0: PCI INT A -> GSI 16 (level, low) -> IRQ 16
Mar 27 20:18:06 linux-7vph kernel: uhci_hcd 0000:00:1d.0: setting latency timer to 64
Mar 27 20:18:06 linux-7vph kernel: uhci_hcd 0000:00:1d.1: PCI INT B -> GSI 17 (level, low) -> IRQ 17
Mar 27 20:18:06 linux-7vph kernel: uhci_hcd 0000:00:1d.1: setting latency timer to 64
Mar 27 20:18:06 linux-7vph kernel: uhci_hcd 0000:00:1d.2: PCI INT C -> GSI 18 (level, low) -> IRQ 18
Mar 27 20:18:06 linux-7vph kernel: uhci_hcd 0000:00:1d.2: setting latency timer to 64
Mar 27 20:18:06 linux-7vph kernel: ehci_hcd 0000:00:1d.7: power state changed by ACPI to D0
Mar 27 20:18:06 linux-7vph kernel: ehci_hcd 0000:00:1d.7: PCI INT D -> GSI 19 (level, low) -> IRQ 19
Mar 27 20:18:06 linux-7vph kernel: ehci_hcd 0000:00:1d.7: setting latency timer to 64
Mar 27 20:18:06 linux-7vph kernel: pci 0000:00:1e.0: setting latency timer to 64
Mar 27 20:18:06 linux-7vph kernel: ahci 0000:00:1f.2: restoring config space at offset 0x1 (was 0x2b00403, writing 0x2b00407)
Mar 27 20:18:06 linux-7vph kernel: ahci 0000:00:1f.2: setting latency timer to 64
Mar 27 20:18:06 linux-7vph kernel: pci 0000:15:00.0: PCI INT A -> GSI 16 (level, low) -> IRQ 16
Mar 27 20:18:06 linux-7vph kernel: ohci1394: fw-host0: OHCI-1394 1.1 (PCI): IRQ=[17] MMIO=[f4801000-f48017ff] Max Packet=[2048] IR/IT contexts=[4/4]
Mar 27 20:18:06 linux-7vph kernel: sd 0:0:0:0: [sda] Starting disk
Mar 27 20:18:06 linux-7vph kernel: ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
Mar 27 20:18:06 linux-7vph kernel: ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
Mar 27 20:18:06 linux-7vph kernel: ata1.00: ACPI cmd ef/02:00:00:00:00:a0 succeeded
Mar 27 20:18:06 linux-7vph kernel: ata1.00: ACPI cmd f5/00:00:00:00:00:a0 filtered out
Mar 27 20:18:06 linux-7vph kernel: ata1.00: ACPI cmd ef/5f:00:00:00:00:a0 succeeded
Mar 27 20:18:06 linux-7vph kernel: ata1.00: ACPI cmd ef/10:03:00:00:00:a0 filtered out
Mar 27 20:18:06 linux-7vph kernel: ata1.00: ACPI cmd ef/02:00:00:00:00:a0 succeeded
Mar 27 20:18:06 linux-7vph kernel: ata1.00: ACPI cmd f5/00:00:00:00:00:a0 filtered out
Mar 27 20:18:06 linux-7vph kernel: ata1.00: ACPI cmd ef/5f:00:00:00:00:a0 succeeded
Mar 27 20:18:06 linux-7vph kernel: ata1.00: ACPI cmd ef/10:03:00:00:00:a0 filtered out
Mar 27 20:18:06 linux-7vph kernel: ata1.00: configured for UDMA/133
Mar 27 20:18:06 linux-7vph kernel: ata1: exception Emask 0x10 SAct 0x0 SErr 0x0 action 0x9 t4
Mar 27 20:18:06 linux-7vph kernel: ata1: irq_stat 0x00400040, connection status changed
Mar 27 20:18:06 linux-7vph kernel: ata2.00: ACPI cmd e3/00:1f:00:00:00:a0 succeeded
Mar 27 20:18:06 linux-7vph kernel: ata2.00: ACPI cmd e3/00:02:00:00:00:a0 succeeded
Mar 27 20:18:06 linux-7vph kernel: ata1.00: configured for UDMA/133
Mar 27 20:18:06 linux-7vph kernel: ata1: EH complete
Mar 27 20:18:06 linux-7vph kernel: sd 0:0:0:0: [sda] 390721968 512-byte hardware sectors: (200 GB/186 GiB)
Mar 27 20:18:06 linux-7vph kernel: sd 0:0:0:0: [sda] Write Protect is off
Mar 27 20:18:06 linux-7vph kernel: sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
Mar 27 20:18:06 linux-7vph kernel: sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Mar 27 20:18:06 linux-7vph kernel: sd 0:0:0:0: [sda] 390721968 512-byte hardware sectors: (200 GB/186 GiB)
Mar 27 20:18:06 linux-7vph kernel: sd 0:0:0:0: [sda] Write Protect is off
Mar 27 20:18:06 linux-7vph kernel: sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
Mar 27 20:18:06 linux-7vph kernel: sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Mar 27 20:18:06 linux-7vph kernel: ata2.00: ACPI cmd e3/00:1f:00:00:00:a0 succeeded
Mar 27 20:18:06 linux-7vph kernel: ata2.00: ACPI cmd e3/00:02:00:00:00:a0 succeeded
Mar 27 20:18:06 linux-7vph kernel: ata2.00: configured for UDMA/133
Mar 27 20:18:06 linux-7vph kernel: ata2: exception Emask 0x10 SAct 0x0 SErr 0x0 action 0x9 t4
Mar 27 20:18:06 linux-7vph kernel: ata2: irq_stat 0x40000001
Mar 27 20:18:06 linux-7vph kernel: ata2.00: configured for UDMA/133
Mar 27 20:18:06 linux-7vph kernel: ata2: EH complete
Mar 27 20:18:06 linux-7vph kernel: pci 0000:00:02.0: power state changed by ACPI to D0
Mar 27 20:18:06 linux-7vph kernel: pci 0000:00:02.0: setting latency timer to 64
Mar 27 20:18:06 linux-7vph kernel: Restarting tasks ... done.


Attachments:
resume-2.6.28.txt (12.40 kB)

2009-03-27 22:29:42

by Arjan van de Ven

[permalink] [raw]
Subject: Re: 2.6.29 regression: ATA bus errors on resume

On Fri, 27 Mar 2009 21:10:52 +0200
Niel Lambrechts <[email protected]> wrote:

> I'm seeing some dubious looking ATA messages even on 2.6.28.9-pae,
> although with all the 2.6.28 variants I used s2disk/resume has always
> worked. I was wondering if these "errors" perhaps play more of a role
> in 2.6.29, perhaps due to the async. changes that was mentioned?

unless you actively enabled this via a kernel command line option there
are no async changes in 2.6.29 in terms of behavior.


--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

2009-03-28 10:22:37

by Niel Lambrechts

[permalink] [raw]
Subject: Re: 2.6.29 regression: ATA bus errors on resume

On 03/28/2009 12:30 AM, Arjan van de Ven wrote:
> On Fri, 27 Mar 2009 21:10:52 +0200
> Niel Lambrechts <[email protected]> wrote:
>
>
>> I'm seeing some dubious looking ATA messages even on 2.6.28.9-pae,
>> although with all the 2.6.28 variants I used s2disk/resume has always
>> worked. I was wondering if these "errors" perhaps play more of a role
>> in 2.6.29, perhaps due to the async. changes that was mentioned?
>>
>
> unless you actively enabled this via a kernel command line option there
> are no async changes in 2.6.29 in terms of behavior.
>
>
>
The only non-default option I had was 'modeset=1'. From Jeff's earlier
comment I understood the probing behaviour changed.

The fundamental difference is that in 2.6.29 everything initially seems
okay, but then there is a
ata1.00: exception Emask 0x10 SAct 0x3f SErr 0x50000 action0xe frozen
ata1.00: irq_stat 0x00400008, PHY RDY changed

There's nothing frozen it 2.6.28.

Should I log a kernel bug, what's the best way forward and is there
anything more I can do to help?

cheers
Niel

2009-03-28 14:05:59

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: 2.6.29 regression: ATA bus errors on resume

On Saturday 28 March 2009, Niel Lambrechts wrote:
> On 03/28/2009 12:30 AM, Arjan van de Ven wrote:
> > On Fri, 27 Mar 2009 21:10:52 +0200
> > Niel Lambrechts <[email protected]> wrote:
> >
> >
> >> I'm seeing some dubious looking ATA messages even on 2.6.28.9-pae,
> >> although with all the 2.6.28 variants I used s2disk/resume has always
> >> worked. I was wondering if these "errors" perhaps play more of a role
> >> in 2.6.29, perhaps due to the async. changes that was mentioned?
> >>
> >
> > unless you actively enabled this via a kernel command line option there
> > are no async changes in 2.6.29 in terms of behavior.
> >
> >
> >
> The only non-default option I had was 'modeset=1'. From Jeff's earlier
> comment I understood the probing behaviour changed.
>
> The fundamental difference is that in 2.6.29 everything initially seems
> okay, but then there is a
> ata1.00: exception Emask 0x10 SAct 0x3f SErr 0x50000 action0xe frozen
> ata1.00: irq_stat 0x00400008, PHY RDY changed
>
> There's nothing frozen it 2.6.28.
>
> Should I log a kernel bug, what's the best way forward and is there
> anything more I can do to help?

Let Tejun have a look a this, perhaps?

Rafael

2009-03-30 08:43:43

by Tejun Heo

[permalink] [raw]
Subject: Re: 2.6.29 regression: ATA bus errors on resume

Hello,

Rafael J. Wysocki wrote:
> On Saturday 28 March 2009, Niel Lambrechts wrote:
>> On 03/28/2009 12:30 AM, Arjan van de Ven wrote:
>>> On Fri, 27 Mar 2009 21:10:52 +0200
>>> Niel Lambrechts <[email protected]> wrote:
>>>
>>>
>>>> I'm seeing some dubious looking ATA messages even on 2.6.28.9-pae,
>>>> although with all the 2.6.28 variants I used s2disk/resume has always
>>>> worked. I was wondering if these "errors" perhaps play more of a role
>>>> in 2.6.29, perhaps due to the async. changes that was mentioned?
>>>>
>>> unless you actively enabled this via a kernel command line option there
>>> are no async changes in 2.6.29 in terms of behavior.
>>>
>>>
>>>
>> The only non-default option I had was 'modeset=1'. From Jeff's earlier
>> comment I understood the probing behaviour changed.
>>
>> The fundamental difference is that in 2.6.29 everything initially seems
>> okay, but then there is a
>> ata1.00: exception Emask 0x10 SAct 0x3f SErr 0x50000 action0xe frozen
>> ata1.00: irq_stat 0x00400008, PHY RDY changed
>>
>> There's nothing frozen it 2.6.28.
>>
>> Should I log a kernel bug, what's the best way forward and is there
>> anything more I can do to help?
>
> Let Tejun have a look a this, perhaps?

What Niel is seeing is probably caused by libata EH somehow moving
forward too fast and receiving the second PHY changed event after the
initial reset is complete. That or the thaw routine is broken and
doesn't clear hotplug event properly. Actually, this double reset
seems to happen quite often, so it might be about time to drill it
down and find out what's really going on. But, generally, it isn't a
serious problem, all that happens is EH doing another round. The
original one looks quite serious tho. I'll reply separately.

Thanks.

--
tejun

2009-03-30 08:59:27

by Tejun Heo

[permalink] [raw]
Subject: Re: 2.6.29 regression: ATA bus errors on resume

Hello,

For some reason, I can't find the original thread, so replying here.

Niel Lambrechts wrote:
>>>>> The ext4 errors are interleaved with hardware errors, and the ext4
>>>>> errors are about I/O errors.
>>>>>
>>>>> EXT4-fs error (device sda6): __ext4_get_inode_loc: unable to read inode block - inode=2346519
>>>>> EXT4-fs error (device sda6) in ext4_reserve_inode_write: IO failure
>>>>>
>>>>> This looks more like a hibernation problem than an ext4 problem.
>>>>> Looks like the hard drive is being left in some inconsistent state
>>>>> after resuming from hibernation.

Yeap, ext4 is just the victim here.

>>>> ata1.00: irq_stat 0x00400008, PHY RDY changed
>>>> ata1: SError: { PHYRdyChg CommWake }
>>> Your SATA hardware flags a connect-or-disconnect event ("PHY RDY"),
>>> which requires us to abort a bunch of queued commands:
>>>
>>>> ata1.00: cmd 60/18:00:77:88:6f/00:00:0e:00:00/40 tag 0 ncq 12288 in
>>>> res 50/00:30:07:b3:10/00:00:0c:00:00/40 Emask 0x10 (ATA bus error)
>>> [...]
...
>>> The SCSI subsystem aborts each of the queued commands.
>> No .. this is the SCSI subsystem receives an ABORTED COMMAND return in
>> sense data for each of the outstanding I/Os
>>
>> The only place these are generated is in ata_sense_to_error() which only
>> occurs if there's some type of ata error.
>>
>> If I had to theorise, I'd say the system suspended with commands
>> outstanding to the device. On resume, the device gets reset and returns
>> some type of ATA error which gets translated to ABORTED COMMAND which
>> causes a failure.
>>
>> In the mid layer, we translate ABORTED_COMMAND into a retry until the
>> command runs out of them ... could it be there's a race readying the
>> device and we run through the retries before it can accept the command?

When libata-eh thinks that the problem isn't worth retrying, it sets
scmd->retries to scmd->allowed so that it gets aborted immediately.
The code is in ata_eh_qc_complete().

Whether a command is to be retried or not is determined with
ATA_QCFLAG_RETRY which is set in ata_eh_link_autopsy() for each failed
command. Immediate-failure criteria is pretty strict - only driver
software errors (AC_ERR_INVALID) and PC or other special commands
which failed which got aborted by the device get the immediate pink
slip. In this case, the commands are from FS and failed with
AC_ERR_ATA_BUS, so it definitely doesn't fit into the criteria.
Strange.

How reproducible is the problem? Are you interested in trying out
some debug patches?

Thanks.

--
tejun