2010-06-03 01:54:14

by Marc Dionne

[permalink] [raw]
Subject: Hang during boot, bisected to commit 96d60303fd

Since somewhere early in the 2.6.35 merge window I've been seeing a
hang during boot. The hang last roughly 30 seconds and is followed by
this:

ata4.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x6 frozen
ata4.00: failed command: READ FPDMA QUEUED
ata4.00: cmd 60/08:00:00:00:00/00:00:00:00:00/40 tag 0 ncq 4096 in
res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata4.00: status: { DRDY }
ata4: hard resetting link
ahci_is_device_present: status is: 80
ahci_is_device_present: status(2) is: 3
ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata4.00: configured for UDMA/133
ata4.00: device reported invalid CHS sector 0
ata4: EH complete

The system works normally once booted. I had some time to do a bisect
today and it cleanly ended up pointing to this commit:
96d60303fd: ahci: Turn off DMA engines when there's no device attached

Reverting the commit makes the system boot cleanly as before. System
has an Intel ICH10 based board, in AHCI mode, disk is an Intel X25M
SSD.

Marc


2010-06-03 02:00:17

by Matthew Garrett

[permalink] [raw]
Subject: Re: Hang during boot, bisected to commit 96d60303fd

On Wed, Jun 02, 2010 at 09:49:27PM -0400, Marc Dionne wrote:
> Since somewhere early in the 2.6.35 merge window I've been seeing a hang
> during boot. The hang last roughly 30 seconds and is followed by this:
>
> ata4.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x6 frozen
> ata4.00: failed command: READ FPDMA QUEUED
> ata4.00: cmd 60/08:00:00:00:00/00:00:00:00:00/40 tag 0 ncq 4096 in
> res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
> ata4.00: status: { DRDY }
> ata4: hard resetting link
> ahci_is_device_present: status is: 80
> ahci_is_device_present: status(2) is: 3
> ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
> ata4.00: configured for UDMA/133
> ata4.00: device reported invalid CHS sector 0
> ata4: EH complete
>
> The system works normally once booted. I had some time to do a bisect
> today and it cleanly ended up pointing to this commit:
> 96d60303fd: ahci: Turn off DMA engines when there's no device attached
>
> Reverting the commit makes the system boot cleanly as before. System has
> an Intel ICH10 based board, in AHCI mode, disk is an Intel X25M SSD.

Hm, interesting. Jeff, any immediate ideas? I don't see this behaviour
on my ICH10. Is there actually a device on ata4?

--
Matthew Garrett | [email protected]

2010-06-03 02:11:13

by Marc Dionne

[permalink] [raw]
Subject: Re: Hang during boot, bisected to commit 96d60303fd

On Wed, Jun 2, 2010 at 10:00 PM, Matthew Garrett <[email protected]> wrote:
> On Wed, Jun 02, 2010 at 09:49:27PM -0400, Marc Dionne wrote:
>> ? ?Since somewhere early in the 2.6.35 merge window I've been seeing a hang
>> ? ?during boot. ?The hang last roughly 30 seconds and is followed by this:
>>
>> ? ?ata4.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x6 frozen
>> ? ?ata4.00: failed command: READ FPDMA QUEUED
>> ? ?ata4.00: cmd 60/08:00:00:00:00/00:00:00:00:00/40 tag 0 ncq 4096 in
>> ? ? ? ? ? ? res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
>> ? ?ata4.00: status: { DRDY }
>> ? ?ata4: hard resetting link
>> ? ?ahci_is_device_present: status is: 80
>> ? ?ahci_is_device_present: status(2) is: 3
>> ? ?ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
>> ? ?ata4.00: configured for UDMA/133
>> ? ?ata4.00: device reported invalid CHS sector 0
>> ? ?ata4: EH complete
>>
>> ? ?The system works normally once booted. ?I had some time to do a bisect
>> ? ?today and it cleanly ended up pointing to this commit:
>> ? ? ? ?96d60303fd: ahci: Turn off DMA engines when there's no device attached
>>
>> ? ?Reverting the commit makes the system boot cleanly as before. ?System has
>> ? ?an Intel ICH10 based board, in AHCI mode, disk is an Intel X25M SSD.
>
> Hm, interesting. Jeff, any immediate ideas? I don't see this behaviour
> on my ICH10. Is there actually a device on ata4?
>
> --
> Matthew Garrett | [email protected]

Yes, ata4 is the Intel SSD:

ata4.00: ATA-7: INTEL SSDSA2MH080G1GC, 045C8610, max UDMA/133

Marc

2010-06-04 01:21:42

by Marc Dionne

[permalink] [raw]
Subject: Re: Hang during boot, bisected to commit 96d60303fd

On Wed, Jun 2, 2010 at 10:00 PM, Matthew Garrett <[email protected]> wrote:
> On Wed, Jun 02, 2010 at 09:49:27PM -0400, Marc Dionne wrote:
>> ? ?Since somewhere early in the 2.6.35 merge window I've been seeing a hang
>> ? ?during boot. ?The hang last roughly 30 seconds and is followed by this:
>>
>> ? ?ata4.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x6 frozen
>> ? ?ata4.00: failed command: READ FPDMA QUEUED
>> ? ?ata4.00: cmd 60/08:00:00:00:00/00:00:00:00:00/40 tag 0 ncq 4096 in
>> ? ? ? ? ? ? res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
>> ? ?ata4.00: status: { DRDY }
>> ? ?ata4: hard resetting link
>> ? ?ahci_is_device_present: status is: 80
>> ? ?ahci_is_device_present: status(2) is: 3
>> ? ?ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
>> ? ?ata4.00: configured for UDMA/133
>> ? ?ata4.00: device reported invalid CHS sector 0
>> ? ?ata4: EH complete
>>
>> ? ?The system works normally once booted. ?I had some time to do a bisect
>> ? ?today and it cleanly ended up pointing to this commit:
>> ? ? ? ?96d60303fd: ahci: Turn off DMA engines when there's no device attached
>>
>> ? ?Reverting the commit makes the system boot cleanly as before. ?System has
>> ? ?an Intel ICH10 based board, in AHCI mode, disk is an Intel X25M SSD.
>
> Hm, interesting. Jeff, any immediate ideas? I don't see this behaviour
> on my ICH10. Is there actually a device on ata4?
>
> --
> Matthew Garrett | [email protected]

If it's helpful, attached are the dmesg output following a normal boot
and one with the hang.

The only difference in the kernel used for those two boots is that the
return statement in ahci_start_engine following the call to
ahci_is_device_present is commented out for the normal boot case.
When the return statement is active, the 30 second hang occurs.

Marc


Attachments:
bad_ahci_dmesg (49.10 kB)
normal_ahci_dmesg (48.66 kB)
Download all attachments

2010-06-04 03:42:10

by Robert Hancock

[permalink] [raw]
Subject: Re: Hang during boot, bisected to commit 96d60303fd

On 06/03/2010 07:21 PM, Marc Dionne wrote:
> On Wed, Jun 2, 2010 at 10:00 PM, Matthew Garrett<[email protected]> wrote:
>> On Wed, Jun 02, 2010 at 09:49:27PM -0400, Marc Dionne wrote:
>>> Since somewhere early in the 2.6.35 merge window I've been seeing a hang
>>> during boot. The hang last roughly 30 seconds and is followed by this:
>>>
>>> ata4.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x6 frozen
>>> ata4.00: failed command: READ FPDMA QUEUED
>>> ata4.00: cmd 60/08:00:00:00:00/00:00:00:00:00/40 tag 0 ncq 4096 in
>>> res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
>>> ata4.00: status: { DRDY }
>>> ata4: hard resetting link
>>> ahci_is_device_present: status is: 80
>>> ahci_is_device_present: status(2) is: 3
>>> ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
>>> ata4.00: configured for UDMA/133
>>> ata4.00: device reported invalid CHS sector 0
>>> ata4: EH complete
>>>
>>> The system works normally once booted. I had some time to do a bisect
>>> today and it cleanly ended up pointing to this commit:
>>> 96d60303fd: ahci: Turn off DMA engines when there's no device attached
>>>
>>> Reverting the commit makes the system boot cleanly as before. System has
>>> an Intel ICH10 based board, in AHCI mode, disk is an Intel X25M SSD.
>>
>> Hm, interesting. Jeff, any immediate ideas? I don't see this behaviour
>> on my ICH10. Is there actually a device on ata4?
>>
>> --
>> Matthew Garrett | [email protected]
>
> If it's helpful, attached are the dmesg output following a normal boot
> and one with the hang.
>
> The only difference in the kernel used for those two boots is that the
> return statement in ahci_start_engine following the call to
> ahci_is_device_present is commented out for the normal boot case.
> When the return statement is active, the 30 second hang occurs.

Can you try and put a printk in where it's returning out of the
ahci_start_engine method? I wonder if that's getting triggered somewhere
unexpected..

2010-06-05 01:23:26

by Marc Dionne

[permalink] [raw]
Subject: Re: Hang during boot, bisected to commit 96d60303fd

On Thu, Jun 3, 2010 at 11:42 PM, Robert Hancock <[email protected]> wrote:
> On 06/03/2010 07:21 PM, Marc Dionne wrote:
>>
>> On Wed, Jun 2, 2010 at 10:00 PM, Matthew Garrett<[email protected]>
>> ?wrote:
>>>
>>> On Wed, Jun 02, 2010 at 09:49:27PM -0400, Marc Dionne wrote:
>>>>
>>>> ? ?Since somewhere early in the 2.6.35 merge window I've been seeing a
>>>> hang
>>>> ? ?during boot. ?The hang last roughly 30 seconds and is followed by
>>>> this:
>>>>
>>>> ? ?ata4.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x6 frozen
>>>> ? ?ata4.00: failed command: READ FPDMA QUEUED
>>>> ? ?ata4.00: cmd 60/08:00:00:00:00/00:00:00:00:00/40 tag 0 ncq 4096 in
>>>> ? ? ? ? ? ? res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
>>>> ? ?ata4.00: status: { DRDY }
>>>> ? ?ata4: hard resetting link
>>>> ? ?ahci_is_device_present: status is: 80
>>>> ? ?ahci_is_device_present: status(2) is: 3
>>>> ? ?ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
>>>> ? ?ata4.00: configured for UDMA/133
>>>> ? ?ata4.00: device reported invalid CHS sector 0
>>>> ? ?ata4: EH complete
>>>>
>>>> ? ?The system works normally once booted. ?I had some time to do a
>>>> bisect
>>>> ? ?today and it cleanly ended up pointing to this commit:
>>>> ? ? ? ?96d60303fd: ahci: Turn off DMA engines when there's no device
>>>> attached
>>>>
>>>> ? ?Reverting the commit makes the system boot cleanly as before. ?System
>>>> has
>>>> ? ?an Intel ICH10 based board, in AHCI mode, disk is an Intel X25M SSD.
>>>
>>> Hm, interesting. Jeff, any immediate ideas? I don't see this behaviour
>>> on my ICH10. Is there actually a device on ata4?
>>>
>>> --
>>> Matthew Garrett | [email protected]
>>
>> If it's helpful, attached are the dmesg output following a normal boot
>> and one with the hang.
>>
>> The only difference in the kernel used for those two boots is that the
>> return statement in ahci_start_engine following the call to
>> ahci_is_device_present is commented out for the normal boot case.
>> When the return statement is active, the 30 second hang occurs.
>
> Can you try and put a printk in where it's returning out of the
> ahci_start_engine method? I wonder if that's getting triggered somewhere
> unexpected..

Attached is a dmesg output with some printks added to show the status
when ahci_is_device_present causes ahci_start_engine to return early..

It looks to me like there's two passes through ahci_start_engine for
all ports, and during the first pass the port with the disk is showing
a status of 127, causing the function to return early and skipping the
start DMA part. This seems to cause the hang later on.

I noticed that during this first pass all the print_id fields for the
ports are still set to -1, so as a test I made ahci_start_engine only
do the check for ahci_is_device_present if print_id != -1, and this
makes the system boot normally.

Marc


Attachments:
dmesg_ahci (50.68 kB)

2010-06-05 01:37:31

by Robert Hancock

[permalink] [raw]
Subject: Re: Hang during boot, bisected to commit 96d60303fd

Fri, Jun 4, 2010 at 7:23 PM, Marc Dionne <[email protected]> wrote:
> On Thu, Jun 3, 2010 at 11:42 PM, Robert Hancock <[email protected]> wrote:
>> On 06/03/2010 07:21 PM, Marc Dionne wrote:
>>>
>>> On Wed, Jun 2, 2010 at 10:00 PM, Matthew Garrett<[email protected]>
>>> ?wrote:
>>>>
>>>> On Wed, Jun 02, 2010 at 09:49:27PM -0400, Marc Dionne wrote:
>>>>>
>>>>> ? ?Since somewhere early in the 2.6.35 merge window I've been seeing a
>>>>> hang
>>>>> ? ?during boot. ?The hang last roughly 30 seconds and is followed by
>>>>> this:
>>>>>
>>>>> ? ?ata4.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x6 frozen
>>>>> ? ?ata4.00: failed command: READ FPDMA QUEUED
>>>>> ? ?ata4.00: cmd 60/08:00:00:00:00/00:00:00:00:00/40 tag 0 ncq 4096 in
>>>>> ? ? ? ? ? ? res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
>>>>> ? ?ata4.00: status: { DRDY }
>>>>> ? ?ata4: hard resetting link
>>>>> ? ?ahci_is_device_present: status is: 80
>>>>> ? ?ahci_is_device_present: status(2) is: 3
>>>>> ? ?ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
>>>>> ? ?ata4.00: configured for UDMA/133
>>>>> ? ?ata4.00: device reported invalid CHS sector 0
>>>>> ? ?ata4: EH complete
>>>>>
>>>>> ? ?The system works normally once booted. ?I had some time to do a
>>>>> bisect
>>>>> ? ?today and it cleanly ended up pointing to this commit:
>>>>> ? ? ? ?96d60303fd: ahci: Turn off DMA engines when there's no device
>>>>> attached
>>>>>
>>>>> ? ?Reverting the commit makes the system boot cleanly as before. ?System
>>>>> has
>>>>> ? ?an Intel ICH10 based board, in AHCI mode, disk is an Intel X25M SSD.
>>>>
>>>> Hm, interesting. Jeff, any immediate ideas? I don't see this behaviour
>>>> on my ICH10. Is there actually a device on ata4?
>>>>
>>>> --
>>>> Matthew Garrett | [email protected]
>>>
>>> If it's helpful, attached are the dmesg output following a normal boot
>>> and one with the hang.
>>>
>>> The only difference in the kernel used for those two boots is that the
>>> return statement in ahci_start_engine following the call to
>>> ahci_is_device_present is commented out for the normal boot case.
>>> When the return statement is active, the 30 second hang occurs.
>>
>> Can you try and put a printk in where it's returning out of the
>> ahci_start_engine method? I wonder if that's getting triggered somewhere
>> unexpected..
>
> Attached is a dmesg output with some printks added to show the status
> when ahci_is_device_present causes ahci_start_engine to return early..
>
> It looks to me like there's two passes through ahci_start_engine for
> all ports, and during the first pass the port with the disk is showing
> a status of 127, causing the function to return early and skipping the
> start DMA part. ?This seems to cause the hang later on.
>
> I noticed that during this first pass all the print_id fields for the
> ports are still set to -1, so as a test I made ahci_start_engine only
> do the check for ahci_is_device_present if print_id != -1, and this
> makes the system boot normally.

Is that the right dmesg? I didn't see where the extra output was. It
might be useful to put a WARN_ON(1); inside ahci_start_engine to see
the two code paths where it's being called..

2010-06-05 01:45:54

by Marc Dionne

[permalink] [raw]
Subject: Re: Hang during boot, bisected to commit 96d60303fd

On Fri, Jun 4, 2010 at 9:37 PM, Robert Hancock <[email protected]> wrote:
> ?Fri, Jun 4, 2010 at 7:23 PM, Marc Dionne <[email protected]> wrote:
>> On Thu, Jun 3, 2010 at 11:42 PM, Robert Hancock <[email protected]> wrote:
>>> On 06/03/2010 07:21 PM, Marc Dionne wrote:
>>>>
>>>> On Wed, Jun 2, 2010 at 10:00 PM, Matthew Garrett<[email protected]>
>>>> ?wrote:
>>>>>
>>>>> On Wed, Jun 02, 2010 at 09:49:27PM -0400, Marc Dionne wrote:
>>>>>>
>>>>>> ? ?Since somewhere early in the 2.6.35 merge window I've been seeing a
>>>>>> hang
>>>>>> ? ?during boot. ?The hang last roughly 30 seconds and is followed by
>>>>>> this:
>>>>>>
>>>>>> ? ?ata4.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x6 frozen
>>>>>> ? ?ata4.00: failed command: READ FPDMA QUEUED
>>>>>> ? ?ata4.00: cmd 60/08:00:00:00:00/00:00:00:00:00/40 tag 0 ncq 4096 in
>>>>>> ? ? ? ? ? ? res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
>>>>>> ? ?ata4.00: status: { DRDY }
>>>>>> ? ?ata4: hard resetting link
>>>>>> ? ?ahci_is_device_present: status is: 80
>>>>>> ? ?ahci_is_device_present: status(2) is: 3
>>>>>> ? ?ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
>>>>>> ? ?ata4.00: configured for UDMA/133
>>>>>> ? ?ata4.00: device reported invalid CHS sector 0
>>>>>> ? ?ata4: EH complete
>>>>>>
>>>>>> ? ?The system works normally once booted. ?I had some time to do a
>>>>>> bisect
>>>>>> ? ?today and it cleanly ended up pointing to this commit:
>>>>>> ? ? ? ?96d60303fd: ahci: Turn off DMA engines when there's no device
>>>>>> attached
>>>>>>
>>>>>> ? ?Reverting the commit makes the system boot cleanly as before. ?System
>>>>>> has
>>>>>> ? ?an Intel ICH10 based board, in AHCI mode, disk is an Intel X25M SSD.
>>>>>
>>>>> Hm, interesting. Jeff, any immediate ideas? I don't see this behaviour
>>>>> on my ICH10. Is there actually a device on ata4?
>>>>>
>>>>> --
>>>>> Matthew Garrett | [email protected]
>>>>
>>>> If it's helpful, attached are the dmesg output following a normal boot
>>>> and one with the hang.
>>>>
>>>> The only difference in the kernel used for those two boots is that the
>>>> return statement in ahci_start_engine following the call to
>>>> ahci_is_device_present is commented out for the normal boot case.
>>>> When the return statement is active, the 30 second hang occurs.
>>>
>>> Can you try and put a printk in where it's returning out of the
>>> ahci_start_engine method? I wonder if that's getting triggered somewhere
>>> unexpected..
>>
>> Attached is a dmesg output with some printks added to show the status
>> when ahci_is_device_present causes ahci_start_engine to return early..
>>
>> It looks to me like there's two passes through ahci_start_engine for
>> all ports, and during the first pass the port with the disk is showing
>> a status of 127, causing the function to return early and skipping the
>> start DMA part. ?This seems to cause the hang later on.
>>
>> I noticed that during this first pass all the print_id fields for the
>> ports are still set to -1, so as a test I made ahci_start_engine only
>> do the check for ahci_is_device_present if print_id != -1, and this
>> makes the system boot normally.
>
> Is that the right dmesg? I didn't see where the extra output was. It
> might be useful to put a WARN_ON(1); inside ahci_start_engine to see
> the two code paths where it's being called..

It's the right dmesg, look for lines containing
"ahci_is_device_present" or "ahci_start_engine" - sorry, I should have
pointed it out.

I'll try adding a WARN_ON to get the code paths.

Marc

2010-06-05 02:02:01

by Marc Dionne

[permalink] [raw]
Subject: Re: Hang during boot, bisected to commit 96d60303fd

On Fri, Jun 4, 2010 at 9:37 PM, Robert Hancock <[email protected]> wrote:
> Is that the right dmesg? I didn't see where the extra output was. It
> might be useful to put a WARN_ON(1); inside ahci_start_engine to see
> the two code paths where it's being called..

The first series of calls to ahci_start_engine have a stack trace that
looks like this:

Call Trace:
[<ffffffff81077002>] warn_slowpath_common+0x85/0x9d
[<ffffffff81077034>] warn_slowpath_null+0x1a/0x1c
[<ffffffff813572dd>] ahci_start_engine+0x92/0xa4
[<ffffffff8135736a>] ahci_start_port+0x7b/0x16c
[<ffffffff81357792>] ahci_port_resume+0x52/0x73
[<ffffffff81357901>] ahci_port_start+0x14e/0x164
[<ffffffff81341dc5>] ata_host_start+0xdf/0x184
[<ffffffff81357b35>] ? ahci_interrupt+0x0/0x5f3
[<ffffffff8134266f>] ata_host_activate+0x2a/0xdc
[<ffffffff81355aab>] ahci_init_one+0x9dd/0x9fe
[<ffffffff81170a9c>] ? __sysfs_add_one+0x38/0x9a
[<ffffffff812519bd>] ? kobject_get+0x1a/0x22
[<ffffffff81269891>] local_pci_probe+0x17/0x1b
[<ffffffff8126a63f>] pci_device_probe+0xca/0xfa
[<ffffffff81308fbe>] ? driver_sysfs_add+0x4c/0x71
[<ffffffff81309110>] driver_probe_device+0xac/0x13b
[<ffffffff813091fc>] __driver_attach+0x5d/0x81
[<ffffffff8130919f>] ? __driver_attach+0x0/0x81
[<ffffffff81308714>] bus_for_each_dev+0x53/0x88
[<ffffffff81308f70>] driver_attach+0x1e/0x20
[<ffffffff81308bc3>] bus_add_driver+0xb6/0x203
[<ffffffff813094ed>] driver_register+0x9e/0x10f
[<ffffffff8126a869>] __pci_register_driver+0x58/0xc8
[<ffffffff81b6582c>] ? ahci_init+0x0/0x1b
[<ffffffff81b65845>] ahci_init+0x19/0x1b
[<ffffffff81002069>] do_one_initcall+0x5e/0x159
[<ffffffff81b33713>] kernel_init+0x183/0x211
[<ffffffff810319a4>] kernel_thread_helper+0x4/0x10
[<ffffffff81b33590>] ? kernel_init+0x0/0x211
[<ffffffff810319a0>] ? kernel_thread_helper+0x0/0x10

The later calls:

Call Trace:
[<ffffffff81077002>] warn_slowpath_common+0x85/0x9d
[<ffffffff81077034>] warn_slowpath_null+0x1a/0x1c
[<ffffffff813572dd>] ahci_start_engine+0x92/0xa4
[<ffffffff81357a0d>] ahci_hardreset+0xc6/0x124
[<ffffffff81357947>] ? ahci_hardreset+0x0/0x124
[<ffffffff81342e08>] ? ata_std_prereset+0x0/0xb7
[<ffffffff8134aeca>] ata_do_reset+0x59/0x62
[<ffffffff8134c3fa>] ata_eh_reset+0x512/0xc99
[<ffffffff81356baa>] ? ahci_postreset+0x0/0x5d
[<ffffffff81357947>] ? ahci_hardreset+0x0/0x124
[<ffffffff81358498>] ? ahci_softreset+0x0/0x44
[<ffffffff81357947>] ? ahci_hardreset+0x0/0x124
[<ffffffff8134d9a1>] ? ata_eh_schedule_probe+0xc5/0xf3
[<ffffffff8134dca7>] ata_eh_recover+0x2d8/0xd6d
[<ffffffff81356baa>] ? ahci_postreset+0x0/0x5d
[<ffffffff81357947>] ? ahci_hardreset+0x0/0x124
[<ffffffff81358498>] ? ahci_softreset+0x0/0x44
[<ffffffff81342e08>] ? ata_std_prereset+0x0/0xb7
[<ffffffff81352ec9>] sata_pmp_error_handler+0xd3/0xa0a
[<ffffffff81067f73>] ? dequeue_entity+0x4a/0x183
[<ffffffff8108c1ae>] ? __cancel_work_timer+0xad/0x193
[<ffffffff8108395e>] ? lock_timer_base+0x2c/0x52
[<ffffffff810839ff>] ? try_to_del_timer_sync+0x7b/0x89
[<ffffffff81357943>] ahci_error_handler+0x2c/0x30
[<ffffffff8134f2c7>] ata_scsi_error+0x33d/0x697
[<ffffffff8131bc10>] scsi_error_handler+0xc2/0x3c5
[<ffffffff8131bb4e>] ? scsi_error_handler+0x0/0x3c5
[<ffffffff8108f451>] kthread+0x7f/0x87
[<ffffffff810319a4>] kernel_thread_helper+0x4/0x10
[<ffffffff8108f3d2>] ? kthread+0x0/0x87
[<ffffffff810319a0>] ? kernel_thread_helper+0x0/0x10

Marc

2010-06-07 04:34:49

by Robert Hancock

[permalink] [raw]
Subject: Re: Hang during boot, bisected to commit 96d60303fd

On Fri, Jun 4, 2010 at 8:01 PM, Marc Dionne <[email protected]> wrote:
> On Fri, Jun 4, 2010 at 9:37 PM, Robert Hancock <[email protected]> wrote:
>> Is that the right dmesg? I didn't see where the extra output was. It
>> might be useful to put a WARN_ON(1); inside ahci_start_engine to see
>> the two code paths where it's being called..
>
> The first series of calls to ahci_start_engine have a stack trace that
> looks like this:
>
> Call Trace:
> ?[<ffffffff81077002>] warn_slowpath_common+0x85/0x9d
> ?[<ffffffff81077034>] warn_slowpath_null+0x1a/0x1c
> ?[<ffffffff813572dd>] ahci_start_engine+0x92/0xa4
> ?[<ffffffff8135736a>] ahci_start_port+0x7b/0x16c
> ?[<ffffffff81357792>] ahci_port_resume+0x52/0x73
> ?[<ffffffff81357901>] ahci_port_start+0x14e/0x164
> ?[<ffffffff81341dc5>] ata_host_start+0xdf/0x184
> ?[<ffffffff81357b35>] ? ahci_interrupt+0x0/0x5f3
> ?[<ffffffff8134266f>] ata_host_activate+0x2a/0xdc
> ?[<ffffffff81355aab>] ahci_init_one+0x9dd/0x9fe
> ?[<ffffffff81170a9c>] ? __sysfs_add_one+0x38/0x9a
> ?[<ffffffff812519bd>] ? kobject_get+0x1a/0x22
> ?[<ffffffff81269891>] local_pci_probe+0x17/0x1b
> ?[<ffffffff8126a63f>] pci_device_probe+0xca/0xfa
> ?[<ffffffff81308fbe>] ? driver_sysfs_add+0x4c/0x71
> ?[<ffffffff81309110>] driver_probe_device+0xac/0x13b
> ?[<ffffffff813091fc>] __driver_attach+0x5d/0x81
> ?[<ffffffff8130919f>] ? __driver_attach+0x0/0x81
> ?[<ffffffff81308714>] bus_for_each_dev+0x53/0x88
> ?[<ffffffff81308f70>] driver_attach+0x1e/0x20
> ?[<ffffffff81308bc3>] bus_add_driver+0xb6/0x203
> ?[<ffffffff813094ed>] driver_register+0x9e/0x10f
> ?[<ffffffff8126a869>] __pci_register_driver+0x58/0xc8
> ?[<ffffffff81b6582c>] ? ahci_init+0x0/0x1b
> ?[<ffffffff81b65845>] ahci_init+0x19/0x1b
> ?[<ffffffff81002069>] do_one_initcall+0x5e/0x159
> ?[<ffffffff81b33713>] kernel_init+0x183/0x211
> ?[<ffffffff810319a4>] kernel_thread_helper+0x4/0x10
> ?[<ffffffff81b33590>] ? kernel_init+0x0/0x211
> ?[<ffffffff810319a0>] ? kernel_thread_helper+0x0/0x10
>
> The later calls:
>
> Call Trace:
> ?[<ffffffff81077002>] warn_slowpath_common+0x85/0x9d
> ?[<ffffffff81077034>] warn_slowpath_null+0x1a/0x1c
> ?[<ffffffff813572dd>] ahci_start_engine+0x92/0xa4
> ?[<ffffffff81357a0d>] ahci_hardreset+0xc6/0x124
> ?[<ffffffff81357947>] ? ahci_hardreset+0x0/0x124
> ?[<ffffffff81342e08>] ? ata_std_prereset+0x0/0xb7
> ?[<ffffffff8134aeca>] ata_do_reset+0x59/0x62
> ?[<ffffffff8134c3fa>] ata_eh_reset+0x512/0xc99
> ?[<ffffffff81356baa>] ? ahci_postreset+0x0/0x5d
> ?[<ffffffff81357947>] ? ahci_hardreset+0x0/0x124
> ?[<ffffffff81358498>] ? ahci_softreset+0x0/0x44
> ?[<ffffffff81357947>] ? ahci_hardreset+0x0/0x124
> ?[<ffffffff8134d9a1>] ? ata_eh_schedule_probe+0xc5/0xf3
> ?[<ffffffff8134dca7>] ata_eh_recover+0x2d8/0xd6d
> ?[<ffffffff81356baa>] ? ahci_postreset+0x0/0x5d
> ?[<ffffffff81357947>] ? ahci_hardreset+0x0/0x124
> ?[<ffffffff81358498>] ? ahci_softreset+0x0/0x44
> ?[<ffffffff81342e08>] ? ata_std_prereset+0x0/0xb7
> ?[<ffffffff81352ec9>] sata_pmp_error_handler+0xd3/0xa0a
> ?[<ffffffff81067f73>] ? dequeue_entity+0x4a/0x183
> ?[<ffffffff8108c1ae>] ? __cancel_work_timer+0xad/0x193
> ?[<ffffffff8108395e>] ? lock_timer_base+0x2c/0x52
> ?[<ffffffff810839ff>] ? try_to_del_timer_sync+0x7b/0x89
> ?[<ffffffff81357943>] ahci_error_handler+0x2c/0x30
> ?[<ffffffff8134f2c7>] ata_scsi_error+0x33d/0x697
> ?[<ffffffff8131bc10>] scsi_error_handler+0xc2/0x3c5
> ?[<ffffffff8131bb4e>] ? scsi_error_handler+0x0/0x3c5
> ?[<ffffffff8108f451>] kthread+0x7f/0x87
> ?[<ffffffff810319a4>] kernel_thread_helper+0x4/0x10
> ?[<ffffffff8108f3d2>] ? kthread+0x0/0x87
> ?[<ffffffff810319a0>] ? kernel_thread_helper+0x0/0x10

I'm thinking that the hardreset and device classification makes more
sense as a place to do this anyways, since that's the point where we
really know if there's a device connected or not.. Tejun, any
thoughts?

2010-06-07 09:22:59

by Tejun Heo

[permalink] [raw]
Subject: Re: Hang during boot, bisected to commit 96d60303fd

Hello,

On 06/07/2010 06:34 AM, Robert Hancock wrote:
> I'm thinking that the hardreset and device classification makes more
> sense as a place to do this anyways, since that's the point where we
> really know if there's a device connected or not.. Tejun, any
> thoughts?

Heh, of course it doesn't work. Device presence can't be reliably
determined by sampling status once like that. We don't have all the
crap in libata-eh for nothing. :-)

Marc, can you please try the following patch?

Thanks.

diff --git a/drivers/ata/libahci.c b/drivers/ata/libahci.c
index 1984a6e..261f86d 100644
--- a/drivers/ata/libahci.c
+++ b/drivers/ata/libahci.c
@@ -541,29 +541,11 @@ static int ahci_scr_write(struct ata_link *link, unsigned int sc_reg, u32 val)
return -EINVAL;
}

-static int ahci_is_device_present(void __iomem *port_mmio)
-{
- u8 status = readl(port_mmio + PORT_TFDATA) & 0xff;
-
- /* Make sure PxTFD.STS.BSY and PxTFD.STS.DRQ are 0 */
- if (status & (ATA_BUSY | ATA_DRQ))
- return 0;
-
- /* Make sure PxSSTS.DET is 3h */
- status = readl(port_mmio + PORT_SCR_STAT) & 0xf;
- if (status != 3)
- return 0;
- return 1;
-}
-
void ahci_start_engine(struct ata_port *ap)
{
void __iomem *port_mmio = ahci_port_base(ap);
u32 tmp;

- if (!ahci_is_device_present(port_mmio))
- return;
-
/* start DMA */
tmp = readl(port_mmio + PORT_CMD);
tmp |= PORT_CMD_START;
@@ -1892,6 +1874,9 @@ static void ahci_error_handler(struct ata_port *ap)
}

sata_pmp_error_handler(ap);
+
+ if (!ata_dev_enabled(ap->link.device))
+ ahci_stop_engine(ap);
}

static void ahci_post_internal_cmd(struct ata_queued_cmd *qc)

--
tejun

2010-06-07 12:21:46

by Marc Dionne

[permalink] [raw]
Subject: Re: Hang during boot, bisected to commit 96d60303fd

On Mon, Jun 7, 2010 at 5:22 AM, Tejun Heo <[email protected]> wrote:
> Hello,
>
> On 06/07/2010 06:34 AM, Robert Hancock wrote:
>> I'm thinking that the hardreset and device classification makes more
>> sense as a place to do this anyways, since that's the point where we
>> really know if there's a device connected or not.. Tejun, any
>> thoughts?
>
> Heh, of course it doesn't work. ?Device presence can't be reliably
> determined by sampling status once like that. ?We don't have all the
> crap in libata-eh for nothing. ?:-)
>
> Marc, can you please try the following patch?
>
> Thanks.
>
> diff --git a/drivers/ata/libahci.c b/drivers/ata/libahci.c
> index 1984a6e..261f86d 100644
> --- a/drivers/ata/libahci.c
> +++ b/drivers/ata/libahci.c
> @@ -541,29 +541,11 @@ static int ahci_scr_write(struct ata_link *link, unsigned int sc_reg, u32 val)
> ? ? ? ?return -EINVAL;
> ?}
>
> -static int ahci_is_device_present(void __iomem *port_mmio)
> -{
> - ? ? ? u8 status = readl(port_mmio + PORT_TFDATA) & 0xff;
> -
> - ? ? ? /* Make sure PxTFD.STS.BSY and PxTFD.STS.DRQ are 0 */
> - ? ? ? if (status & (ATA_BUSY | ATA_DRQ))
> - ? ? ? ? ? ? ? return 0;
> -
> - ? ? ? /* Make sure PxSSTS.DET is 3h */
> - ? ? ? status = readl(port_mmio + PORT_SCR_STAT) & 0xf;
> - ? ? ? if (status != 3)
> - ? ? ? ? ? ? ? return 0;
> - ? ? ? return 1;
> -}
> -
> ?void ahci_start_engine(struct ata_port *ap)
> ?{
> ? ? ? ?void __iomem *port_mmio = ahci_port_base(ap);
> ? ? ? ?u32 tmp;
>
> - ? ? ? if (!ahci_is_device_present(port_mmio))
> - ? ? ? ? ? ? ? return;
> -
> ? ? ? ?/* start DMA */
> ? ? ? ?tmp = readl(port_mmio + PORT_CMD);
> ? ? ? ?tmp |= PORT_CMD_START;
> @@ -1892,6 +1874,9 @@ static void ahci_error_handler(struct ata_port *ap)
> ? ? ? ?}
>
> ? ? ? ?sata_pmp_error_handler(ap);
> +
> + ? ? ? if (!ata_dev_enabled(ap->link.device))
> + ? ? ? ? ? ? ? ahci_stop_engine(ap);
> ?}
>
> ?static void ahci_post_internal_cmd(struct ata_queued_cmd *qc)
>
> --
> tejun

Hi Tejun,

With that patch the system boots normally. Out of curiosity I also
instrumented ahci_error_handler a bit, and the stop_engine case
triggers once for the 6 unused ports, but doesn't trigger for the 2
ports with devices (SSD and DVD), as expected.

Feel free to add a Tested-by if you submit it.

Thanks,
Marc

2010-06-07 13:04:33

by Matthew Garrett

[permalink] [raw]
Subject: Re: Hang during boot, bisected to commit 96d60303fd

On Mon, Jun 07, 2010 at 08:21:42AM -0400, Marc Dionne wrote:

> With that patch the system boots normally. Out of curiosity I also
> instrumented ahci_error_handler a bit, and the stop_engine case
> triggers once for the 6 unused ports, but doesn't trigger for the 2
> ports with devices (SSD and DVD), as expected.
>
> Feel free to add a Tested-by if you submit it.

Well, that's what trusting the ahci spec gets me. Thanks for testing
that, Marc.

--
Matthew Garrett | [email protected]

2010-06-07 13:09:22

by Tejun Heo

[permalink] [raw]
Subject: Re: Hang during boot, bisected to commit 96d60303fd

Hello,

On 06/07/2010 03:04 PM, Matthew Garrett wrote:
> On Mon, Jun 07, 2010 at 08:21:42AM -0400, Marc Dionne wrote:
>
>> With that patch the system boots normally. Out of curiosity I also
>> instrumented ahci_error_handler a bit, and the stop_engine case
>> triggers once for the 6 unused ports, but doesn't trigger for the 2
>> ports with devices (SSD and DVD), as expected.
>>
>> Feel free to add a Tested-by if you submit it.
>
> Well, that's what trusting the ahci spec gets me. Thanks for testing
> that, Marc.

Heh heh, welcome to ATA. :-)

Will add Tested-by and send it to Jeff.

Thanks.

--
tejun