2019-03-13 16:52:35

by Jesse Hathaway

[permalink] [raw]
Subject: Regression causes a hang on boot with a Comtrol PCI card

Two regressions cause Linux to hang on boot when a Comtrol PCI card is present.

If I revert the following two commits, I can boot again and the card operates
without issue:

1302fcf0d03e (refs/bisect/bad) PCI: Configure *all* devices, not just
hot-added ones
1c3c5eab1715 sched/core: Enable might_sleep() and smp_processor_id()
checks early

; lspci -vs 82:00.0
82:00.0 Multiport serial controller: Comtrol Corporation Device 0061
Subsystem: Comtrol Corporation Device 0061
Flags: 66MHz, medium devsel, IRQ 35, NUMA node 1
Memory at c8004000 (32-bit, non-prefetchable) [size=4K]
Memory at c8000000 (32-bit, non-prefetchable) [size=16K]
Capabilities: [40] Hot-plug capable
Capabilities: [48] Power Management version 2
Kernel driver in use: rp2
Kernel modules: rp2

Is it possible that the problem is that the card claims to support Hot-plug,
but does not?

I would love to help fix this issue, please let me know what other information
would be helpful to provide.

; awk -f scripts/ver_linux
If some fields are empty or look unusual you may have an old version.
Compare to the current minimal requirements in Documentation/Changes.

Linux tty01 5.0.1-amd64 #1 SMP Wed Mar 13 15:43:44 UTC 2019 x86_64 GNU/Linux

GNU C 6.3.0
GNU Make 4.1
Binutils 2.28
Util-linux 2.29.2
Mount 2.29.2
Linux C Library 2.24
Dynamic linker (ldd) 2.24
Procps 3.3.12
Sh-utils 8.26
Udev 232
Modules Loaded 8021q acpi_power_meter aesni_intel aes_x86_64
ahci autofs4 bonding button coretemp crc16 crc32c_generic crc32c_intel
crc32_pclmul crct10dif_pclmul cryptd crypto_simd dca dcdbas dm_mod drm
drm_kms_helper ehci_hcd ehci_pci evdev ext4 fscrypto garp
ghash_clmulni_intel glue_helper i2c_algo_bit igb intel_cstate
intel_powerclamp intel_rapl intel_rapl_perf intel_uncore ioatdma
ipmi_devintf ipmi_msghandler ipmi_si iptable_filter ip_tables
irqbypass iTCO_vendor_support iTCO_wdt ixgbe jbd2 kvm kvm_intel
libahci libata libcrc32c libphy llc lpc_ich mbcache mdio megaraid_sas
mei mei_me mgag200 mrp mxm_wmi nf_conntrack nf_defrag_ipv4
nf_defrag_ipv6 pcc_cpufreq pcspkr rp2 sb_edac scsi_mod sd_mod sg snd
snd_pcm snd_timer soundcore stp ttm usbcore wmi x86_pkg_temp_thermal
xfrm_algo x_tables xt_conntrack xt_tcpudp


2019-03-13 23:21:54

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: Regression causes a hang on boot with a Comtrol PCI card

Hi Jesse,

On Wed, Mar 13, 2019 at 11:50:07AM -0500, Jesse Hathaway wrote:
> Two regressions cause Linux to hang on boot when a Comtrol PCI card
> is present.
>
> If I revert the following two commits, I can boot again and the card
> operates without issue:
>
> 1302fcf0d03e (refs/bisect/bad) PCI: Configure *all* devices, not just
> hot-added ones
> 1c3c5eab1715 sched/core: Enable might_sleep() and smp_processor_id()
> checks early

I'm very sorry about the regression, but thank you very much for
narrowing it down and reporting it!

How did you narrow it down to *two* commits, and do you have to revert
both of them to avoid the hang? Usually a bisection identifies a
single commit, and the two you mention aren't related.

> ; lspci -vs 82:00.0
> 82:00.0 Multiport serial controller: Comtrol Corporation Device 0061
> Subsystem: Comtrol Corporation Device 0061
> Flags: 66MHz, medium devsel, IRQ 35, NUMA node 1
> Memory at c8004000 (32-bit, non-prefetchable) [size=4K]
> Memory at c8000000 (32-bit, non-prefetchable) [size=16K]
> Capabilities: [40] Hot-plug capable
> Capabilities: [48] Power Management version 2
> Kernel driver in use: rp2
> Kernel modules: rp2
>
> Is it possible that the problem is that the card claims to support
> Hot-plug, but does not?
>
> I would love to help fix this issue, please let me know what other
> information would be helpful to provide.

Can you collect a complete dmesg log (with a working kernel) and
output of "sudo lspci -vvxxx"? You can open a bug report at
https://bugzilla.kernel.org, attach the logs there, and respond here
with the URL.

Where does the hang happen? Is it when we configure the Comtrol card?

Bjorn

2019-03-14 20:58:15

by Jesse Hathaway

[permalink] [raw]
Subject: Re: Regression causes a hang on boot with a Comtrol PCI card

> > 1302fcf0d03e (refs/bisect/bad) PCI: Configure *all* devices, not just
> > hot-added ones
> > 1c3c5eab1715 sched/core: Enable might_sleep() and smp_processor_id()
> > checks early
>
> How did you narrow it down to *two* commits, and do you have to revert
> both of them to avoid the hang? Usually a bisection identifies a
> single commit, and the two you mention aren't related.

Sorry I should have been more verbose in what the bisection process was, I
found the problem after attempting to upgrade from linux v3.16 to v4.9. When
v4.9 hung I tried the latest kernel, v5.0, which also hanged. I began a git
bisect, but found there was more than one bad commit. Here is my current
understanding:

- [x] v3.18 vanilla, 1302fcf0d03e committed, hangs
- [x] v3.18 with revert of 1302fcf0d03e, works
.
.
.
- [x] v4.12 vanilla, hangs
- [x] v4.12 with revert of 1302fcf0d03e, works

- [x] v4.13 vanilla, 1c3c5eab1715 committed, hangs
- [x] v4.13 with revert of 1302fcf0d03e, hangs
- [x] v4.13 with revert of 1c3c5eab1715, hangs
- [x] v4.13 with revert of 1302fcf0d03e & 1c3c5eab1715, works

- [x] v5.0 vanilla, hangs
- [x] v5.0 with revert of 1302fcf0d03e & 1c3c5eab1715, works

> Can you collect a complete dmesg log (with a working kernel) and
> output of "sudo lspci -vvxxx"? You can open a bug report at
> https://bugzilla.kernel.org, attach the logs there, and respond here
> with the URL.

Bug submitted along with the requested logs,
https://bugzilla.kernel.org/show_bug.cgi?id=202927

> Where does the hang happen? Is it when we configure the Comtrol card?

Hang occurs after PCI is initialized, snippet below, I have included the full
output in the bug report:

[ 10.561971] pci 0000:81:00.0: bridge window [mem 0xc8000000-0xc80fffff]
[ 10.569661] pci 0000:80:01.0: PCI bridge to [bus 81-82]
[ 10.575594] pci 0000:80:01.0: bridge window [mem 0xc8000000-0xc80fffff]
[ 10.583278] pci 0000:80:03.0: PCI bridge to [bus 83]
[ 10.589008] NET: Registered protocol family 2
[ 10.594254] tcp_listen_portaddr_hash hash table entries: 65536
(order: 8, 1048576 bytes)
[ 10.603671] TCP established hash table entries: 524288 (order: 10,
4194304 bytes)
[ 10.612729] TCP bind hash table entries: 65536 (order: 8, 1048576 bytes)
[ 10.620446] TCP: Hash tables configured (established 524288 bind 65536)
[ 10.628124] UDP hash table entries: 65536 (order: 9, 2097152 bytes)
[ 10.635541] UDP-Lite hash table entries: 65536 (order: 9, 2097152 bytes)
[ 10.643669] NET: Registered protocol family 1

Please let me know if there is anything else I can provide, I am also happy to
test any patches, Jesse Hathaway

2019-03-21 23:24:28

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: Regression causes a hang on boot with a Comtrol PCI card

On Thu, Mar 14, 2019 at 03:57:07PM -0500, Jesse Hathaway wrote:
> > > 1302fcf0d03e (refs/bisect/bad) PCI: Configure *all* devices, not just
> > > hot-added ones
> > > 1c3c5eab1715 sched/core: Enable might_sleep() and smp_processor_id()
> > > checks early
> >
> > How did you narrow it down to *two* commits, and do you have to revert
> > both of them to avoid the hang? Usually a bisection identifies a
> > single commit, and the two you mention aren't related.
>
> Sorry I should have been more verbose in what the bisection process was, I
> found the problem after attempting to upgrade from linux v3.16 to v4.9. When
> v4.9 hung I tried the latest kernel, v5.0, which also hanged. I began a git
> bisect, but found there was more than one bad commit. Here is my current
> understanding:
>
> - [x] v3.18 vanilla, 1302fcf0d03e committed, hangs
> - [x] v3.18 with revert of 1302fcf0d03e, works
> .
> .
> .
> - [x] v4.12 vanilla, hangs
> - [x] v4.12 with revert of 1302fcf0d03e, works
>
> - [x] v4.13 vanilla, 1c3c5eab1715 committed, hangs
> - [x] v4.13 with revert of 1302fcf0d03e, hangs
> - [x] v4.13 with revert of 1c3c5eab1715, hangs
> - [x] v4.13 with revert of 1302fcf0d03e & 1c3c5eab1715, works
>
> - [x] v5.0 vanilla, hangs
> - [x] v5.0 with revert of 1302fcf0d03e & 1c3c5eab1715, works

Thanks! I doubt either of those commits is the real problem, but
they're both related to system_state, so it's conceivable they're both
involved in exposing the problem.

> > Can you collect a complete dmesg log (with a working kernel) and
> > output of "sudo lspci -vvxxx"? You can open a bug report at
> > https://bugzilla.kernel.org, attach the logs there, and respond here
> > with the URL.
>
> Bug submitted along with the requested logs,
> https://bugzilla.kernel.org/show_bug.cgi?id=202927

Thanks for that.

> > Where does the hang happen? Is it when we configure the Comtrol card?
>
> Hang occurs after PCI is initialized, snippet below, I have included the full
> output in the bug report:
>
> [ 10.561971] pci 0000:81:00.0: bridge window [mem 0xc8000000-0xc80fffff]
> [ 10.569661] pci 0000:80:01.0: PCI bridge to [bus 81-82]
> [ 10.575594] pci 0000:80:01.0: bridge window [mem 0xc8000000-0xc80fffff]
> [ 10.583278] pci 0000:80:03.0: PCI bridge to [bus 83]
> [ 10.589008] NET: Registered protocol family 2
> [ 10.594254] tcp_listen_portaddr_hash hash table entries: 65536
> (order: 8, 1048576 bytes)
> [ 10.603671] TCP established hash table entries: 524288 (order: 10,
> 4194304 bytes)
> [ 10.612729] TCP bind hash table entries: 65536 (order: 8, 1048576 bytes)
> [ 10.620446] TCP: Hash tables configured (established 524288 bind 65536)
> [ 10.628124] UDP hash table entries: 65536 (order: 9, 2097152 bytes)
> [ 10.635541] UDP-Lite hash table entries: 65536 (order: 9, 2097152 bytes)
> [ 10.643669] NET: Registered protocol family 1

The successful boot continues on with this:

[ 10.675996] pci 0000:00:1a.0: quirk_usb_early_handoff+0x0/0x6a0 took 22519 usecs
[ 10.684519] pci 0000:03:00.0: [Firmware Bug]: disabling VPD access (can't determine size of non-standard VPD for)
[ 10.696404] pci 0000:03:00.0: quirk_blacklist_vpd+0x0/0x30 took 11605 usecs
[ 10.704515] pci 0000:0b:00.0: Video device with shadowed ROM at [mem 0x000c0000-0x000dffff]

So apparently the hang happens while we're running the "final" PCI
fixups. This happens after all the rest of PCI is initialized.

Can you boot v5.0 vanilla with "initcall_debug"? Maybe we can narrow
it down to a specific quirk.

Bjorn

2019-03-22 20:03:39

by Jesse Hathaway

[permalink] [raw]
Subject: Re: Regression causes a hang on boot with a Comtrol PCI card

> So apparently the hang happens while we're running the "final" PCI
> fixups. This happens after all the rest of PCI is initialized.
>
> Can you boot v5.0 vanilla with "initcall_debug"? Maybe we can narrow
> it down to a specific quirk.

yup, added the "initcall_debug" output to the ticket:
https://bugzilla.kernel.org/show_bug.cgi?id=202927, here is the tail end

[ 14.896337] NET: Registered protocol family 1
[ 14.901314] initcall af_unix_init+0x0/0x4e returned 0 after 4866 usecs
[ 14.908694] calling ipv6_offload_init+0x0/0x7f @ 1
[ 14.914238] initcall ipv6_offload_init+0x0/0x7f returned 0 after 1 usecs
[ 14.921821] calling vlan_offload_init+0x0/0x20 @ 1
[ 14.927365] initcall vlan_offload_init+0x0/0x20 returned 0 after 0 usecs
[ 14.934948] calling pci_apply_final_quirks+0x0/0x126 @ 1
[ 14.941106] pci 0000:00:1a.0: calling quirk_usb_early_handoff+0x0/0x6a0 @ 1

thanks, Jesse Hathaway

2019-04-01 19:45:44

by Jesse Hathaway

[permalink] [raw]
Subject: Re: Regression causes a hang on boot with a Comtrol PCI card

On Fri, Mar 22, 2019 at 3:02 PM Jesse Hathaway <[email protected]> wrote:
> > Can you boot v5.0 vanilla with "initcall_debug"? Maybe we can narrow
> > it down to a specific quirk.
>
> yup, added the "initcall_debug" output to the ticket:
> https://bugzilla.kernel.org/show_bug.cgi?id=202927, here is the tail end
>
> [ 14.896337] NET: Registered protocol family 1
> [ 14.901314] initcall af_unix_init+0x0/0x4e returned 0 after 4866 usecs
> [ 14.908694] calling ipv6_offload_init+0x0/0x7f @ 1
> [ 14.914238] initcall ipv6_offload_init+0x0/0x7f returned 0 after 1 usecs
> [ 14.921821] calling vlan_offload_init+0x0/0x20 @ 1
> [ 14.927365] initcall vlan_offload_init+0x0/0x20 returned 0 after 0 usecs
> [ 14.934948] calling pci_apply_final_quirks+0x0/0x126 @ 1
> [ 14.941106] pci 0000:00:1a.0: calling quirk_usb_early_handoff+0x0/0x6a0 @ 1

Bjorn, did you get a chance to look at the initcall_debug output for anything
obvious to you on what might be the cause of the problem?

Thanks, Jesse Hathaway

2019-04-01 21:14:54

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: Regression causes a hang on boot with a Comtrol PCI card

[+cc Mathias, Greg, Alan, linux-usb]

Beginning of thread: https://lore.kernel.org/linux-pci/CANSNSoWiKd98Dt1N2sSjP9Af8zk1NPV-=3P4VLtFs_cSQG4RUg@mail.gmail.com

Synopsis: v5.0 hangs at boot unless the following commits are reverted:

1302fcf0d03e ("PCI: Configure *all* devices, not just hot-added ones")
1c3c5eab1715 ("sched/core: Enable might_sleep() and smp_processor_id() checks early")

The hang appears to be in quirk_usb_early_handoff(). With
"initcall_debug", we see the call but not the completion:

On Fri, Mar 22, 2019 at 03:02:11PM -0500, Jesse Hathaway wrote:
> > So apparently the hang happens while we're running the "final" PCI
> > fixups. This happens after all the rest of PCI is initialized.
> >
> > Can you boot v5.0 vanilla with "initcall_debug"? Maybe we can narrow
> > it down to a specific quirk.
>
> yup, added the "initcall_debug" output to the ticket:
> https://bugzilla.kernel.org/show_bug.cgi?id=202927, here is the tail end
>
> [ 14.896337] NET: Registered protocol family 1
> [ 14.901314] initcall af_unix_init+0x0/0x4e returned 0 after 4866 usecs
> [ 14.908694] calling ipv6_offload_init+0x0/0x7f @ 1
> [ 14.914238] initcall ipv6_offload_init+0x0/0x7f returned 0 after 1 usecs
> [ 14.921821] calling vlan_offload_init+0x0/0x20 @ 1
> [ 14.927365] initcall vlan_offload_init+0x0/0x20 returned 0 after 0 usecs
> [ 14.934948] calling pci_apply_final_quirks+0x0/0x126 @ 1
> [ 14.941106] pci 0000:00:1a.0: calling quirk_usb_early_handoff+0x0/0x6a0 @ 1
>
> thanks, Jesse Hathaway

2019-04-02 15:49:38

by Alan Stern

[permalink] [raw]
Subject: Re: Regression causes a hang on boot with a Comtrol PCI card

On Mon, 1 Apr 2019, Bjorn Helgaas wrote:

> [+cc Mathias, Greg, Alan, linux-usb]
>
> Beginning of thread: https://lore.kernel.org/linux-pci/CANSNSoWiKd98Dt1N2sSjP9Af8zk1NPV-=3P4VLtFs_cSQG4RUg@mail.gmail.com
>
> Synopsis: v5.0 hangs at boot unless the following commits are reverted:
>
> 1302fcf0d03e ("PCI: Configure *all* devices, not just hot-added ones")
> 1c3c5eab1715 ("sched/core: Enable might_sleep() and smp_processor_id() checks early")
>
> The hang appears to be in quirk_usb_early_handoff(). With
> "initcall_debug", we see the call but not the completion:
>
> On Fri, Mar 22, 2019 at 03:02:11PM -0500, Jesse Hathaway wrote:
> > > So apparently the hang happens while we're running the "final" PCI
> > > fixups. This happens after all the rest of PCI is initialized.
> > >
> > > Can you boot v5.0 vanilla with "initcall_debug"? Maybe we can narrow
> > > it down to a specific quirk.
> >
> > yup, added the "initcall_debug" output to the ticket:
> > https://bugzilla.kernel.org/show_bug.cgi?id=202927, here is the tail end
> >
> > [ 14.896337] NET: Registered protocol family 1
> > [ 14.901314] initcall af_unix_init+0x0/0x4e returned 0 after 4866 usecs
> > [ 14.908694] calling ipv6_offload_init+0x0/0x7f @ 1
> > [ 14.914238] initcall ipv6_offload_init+0x0/0x7f returned 0 after 1 usecs
> > [ 14.921821] calling vlan_offload_init+0x0/0x20 @ 1
> > [ 14.927365] initcall vlan_offload_init+0x0/0x20 returned 0 after 0 usecs
> > [ 14.934948] calling pci_apply_final_quirks+0x0/0x126 @ 1
> > [ 14.941106] pci 0000:00:1a.0: calling quirk_usb_early_handoff+0x0/0x6a0 @ 1
> >
> > thanks, Jesse Hathaway

Most likely the problem occurs somewhere inside
quirk_usb_handoff_xhci(). Can Jesse add debugging statements to that
routine in order to pin down exactly where the problem lies?

Alan Stern

2019-04-02 15:52:40

by Mathias Nyman

[permalink] [raw]
Subject: Re: Regression causes a hang on boot with a Comtrol PCI card

On 2.4.2019 17.29, Alan Stern wrote:
> On Mon, 1 Apr 2019, Bjorn Helgaas wrote:
>
>> [+cc Mathias, Greg, Alan, linux-usb]
>>
>> Beginning of thread: https://lore.kernel.org/linux-pci/CANSNSoWiKd98Dt1N2sSjP9Af8zk1NPV-=3P4VLtFs_cSQG4RUg@mail.gmail.com
>>
>> Synopsis: v5.0 hangs at boot unless the following commits are reverted:
>>
>> 1302fcf0d03e ("PCI: Configure *all* devices, not just hot-added ones")
>> 1c3c5eab1715 ("sched/core: Enable might_sleep() and smp_processor_id() checks early")
>>
>> The hang appears to be in quirk_usb_early_handoff(). With
>> "initcall_debug", we see the call but not the completion:
>>
>> On Fri, Mar 22, 2019 at 03:02:11PM -0500, Jesse Hathaway wrote:
>>>> So apparently the hang happens while we're running the "final" PCI
>>>> fixups. This happens after all the rest of PCI is initialized.
>>>>
>>>> Can you boot v5.0 vanilla with "initcall_debug"? Maybe we can narrow
>>>> it down to a specific quirk.
>>>
>>> yup, added the "initcall_debug" output to the ticket:
>>> https://bugzilla.kernel.org/show_bug.cgi?id=202927, here is the tail end
>>>
>>> [ 14.896337] NET: Registered protocol family 1
>>> [ 14.901314] initcall af_unix_init+0x0/0x4e returned 0 after 4866 usecs
>>> [ 14.908694] calling ipv6_offload_init+0x0/0x7f @ 1
>>> [ 14.914238] initcall ipv6_offload_init+0x0/0x7f returned 0 after 1 usecs
>>> [ 14.921821] calling vlan_offload_init+0x0/0x20 @ 1
>>> [ 14.927365] initcall vlan_offload_init+0x0/0x20 returned 0 after 0 usecs
>>> [ 14.934948] calling pci_apply_final_quirks+0x0/0x126 @ 1
>>> [ 14.941106] pci 0000:00:1a.0: calling quirk_usb_early_handoff+0x0/0x6a0 @ 1
>>>
>>> thanks, Jesse Hathaway
>
> Most likely the problem occurs somewhere inside
> quirk_usb_handoff_xhci(). Can Jesse add debugging statements to that
> routine in order to pin down exactly where the problem lies?
>
> Alan Stern
>

lspci log from bugzilla shows 0000:00:1a.0 is a EHCI device,
so probably woth adding debugging to quirk_usb_disable_ehci() as well

00:1a.0 USB controller: Intel Corporation C610/X99 series chipset USB Enhanced Host Controller #2 (rev 05) (prog-if 20 [EHCI])

-Mathias

2019-04-02 18:33:12

by Alan Stern

[permalink] [raw]
Subject: Re: Regression causes a hang on boot with a Comtrol PCI card

On Tue, 2 Apr 2019, Mathias Nyman wrote:

> On 2.4.2019 17.29, Alan Stern wrote:
> > On Mon, 1 Apr 2019, Bjorn Helgaas wrote:
> >
> >> [+cc Mathias, Greg, Alan, linux-usb]
> >>
> >> Beginning of thread: https://lore.kernel.org/linux-pci/CANSNSoWiKd98Dt1N2sSjP9Af8zk1NPV-=3P4VLtFs_cSQG4RUg@mail.gmail.com
> >>
> >> Synopsis: v5.0 hangs at boot unless the following commits are reverted:
> >>
> >> 1302fcf0d03e ("PCI: Configure *all* devices, not just hot-added ones")
> >> 1c3c5eab1715 ("sched/core: Enable might_sleep() and smp_processor_id() checks early")
> >>
> >> The hang appears to be in quirk_usb_early_handoff(). With
> >> "initcall_debug", we see the call but not the completion:
> >>
> >> On Fri, Mar 22, 2019 at 03:02:11PM -0500, Jesse Hathaway wrote:
> >>>> So apparently the hang happens while we're running the "final" PCI
> >>>> fixups. This happens after all the rest of PCI is initialized.
> >>>>
> >>>> Can you boot v5.0 vanilla with "initcall_debug"? Maybe we can narrow
> >>>> it down to a specific quirk.
> >>>
> >>> yup, added the "initcall_debug" output to the ticket:
> >>> https://bugzilla.kernel.org/show_bug.cgi?id=202927, here is the tail end
> >>>
> >>> [ 14.896337] NET: Registered protocol family 1
> >>> [ 14.901314] initcall af_unix_init+0x0/0x4e returned 0 after 4866 usecs
> >>> [ 14.908694] calling ipv6_offload_init+0x0/0x7f @ 1
> >>> [ 14.914238] initcall ipv6_offload_init+0x0/0x7f returned 0 after 1 usecs
> >>> [ 14.921821] calling vlan_offload_init+0x0/0x20 @ 1
> >>> [ 14.927365] initcall vlan_offload_init+0x0/0x20 returned 0 after 0 usecs
> >>> [ 14.934948] calling pci_apply_final_quirks+0x0/0x126 @ 1
> >>> [ 14.941106] pci 0000:00:1a.0: calling quirk_usb_early_handoff+0x0/0x6a0 @ 1
> >>>
> >>> thanks, Jesse Hathaway
> >
> > Most likely the problem occurs somewhere inside
> > quirk_usb_handoff_xhci(). Can Jesse add debugging statements to that
> > routine in order to pin down exactly where the problem lies?
> >
> > Alan Stern
> >
>
> lspci log from bugzilla shows 0000:00:1a.0 is a EHCI device,
> so probably woth adding debugging to quirk_usb_disable_ehci() as well
>
> 00:1a.0 USB controller: Intel Corporation C610/X99 series chipset USB Enhanced Host Controller #2 (rev 05) (prog-if 20 [EHCI])

Oops! Quite right, my mistake. After all these years, I ought to
remember that Intel puts their motherboards' EHCI and UHCI controllers
at PCI addresses 00:1a and 00:1d.

Alan Stern

2019-04-04 15:42:53

by Jesse Hathaway

[permalink] [raw]
Subject: Re: Regression causes a hang on boot with a Comtrol PCI card

On Tue, Apr 2, 2019 at 9:29 AM Alan Stern <[email protected]> wrote:
> Most likely the problem occurs somewhere inside
> quirk_usb_handoff_xhci(). Can Jesse add debugging statements to that
> routine in order to pin down exactly where the problem lies?

Alan,

I added debug statements to quirk_usb_early_handoff, quirk_usb_disable_ehci &
ehci_bios_handoff. The box hangs right before calling:

pci_write_config_byte(pdev, offset + 3, 1);

which is in ehci_bios_handoff:

[ 10.698240] DEBUG: Passed quirk_usb_early_handoff 1300
[ 10.704271] DEBUG: Passed quirk_usb_early_handoff 1308
[ 10.710206] DEBUG: Passed quirk_usb_disable_ehci 939
[ 10.715949] DEBUG: Passed quirk_usb_disable_ehci 945
[ 10.721685] DEBUG: Passed quirk_usb_disable_ehci 950
[ 10.727423] DEBUG: Passed quirk_usb_disable_ehci 958
[ 10.733160] DEBUG: Passed quirk_usb_disable_ehci 964
[ 10.738897] DEBUG: Passed quirk_usb_disable_ehci 968
[ 10.744633] DEBUG: Passed ehci_bios_handoff 849
[ 10.749884] DEBUG: Passed ehci_bios_handoff 884

I have attached the debug output, and my modified pci-quirks.c file
to the bug report, let me know what else I can do to help.

2019-04-04 17:17:01

by Alan Stern

[permalink] [raw]
Subject: Re: Regression causes a hang on boot with a Comtrol PCI card

On Thu, 4 Apr 2019, Jesse Hathaway wrote:

> On Tue, Apr 2, 2019 at 9:29 AM Alan Stern <[email protected]> wrote:
> > Most likely the problem occurs somewhere inside
> > quirk_usb_handoff_xhci(). Can Jesse add debugging statements to that
> > routine in order to pin down exactly where the problem lies?
>
> Alan,
>
> I added debug statements to quirk_usb_early_handoff, quirk_usb_disable_ehci &
> ehci_bios_handoff. The box hangs right before calling:
>
> pci_write_config_byte(pdev, offset + 3, 1);

Right _before_ that line? Not _after_ it?

That's surprising because the two preceding lines of code are the
condition of an "if" statement and a dev_dbg() call. I don't see how
either of them could cause a hang.

Maybe the hang is a delayed reaction to something happening somewhere
else. But on the assumption that it isn't, you could try commenting
out various parts of ehci_bios_handoff to see which ones make a
difference.

> which is in ehci_bios_handoff:
>
> [ 10.698240] DEBUG: Passed quirk_usb_early_handoff 1300
> [ 10.704271] DEBUG: Passed quirk_usb_early_handoff 1308
> [ 10.710206] DEBUG: Passed quirk_usb_disable_ehci 939
> [ 10.715949] DEBUG: Passed quirk_usb_disable_ehci 945
> [ 10.721685] DEBUG: Passed quirk_usb_disable_ehci 950
> [ 10.727423] DEBUG: Passed quirk_usb_disable_ehci 958
> [ 10.733160] DEBUG: Passed quirk_usb_disable_ehci 964
> [ 10.738897] DEBUG: Passed quirk_usb_disable_ehci 968
> [ 10.744633] DEBUG: Passed ehci_bios_handoff 849
> [ 10.749884] DEBUG: Passed ehci_bios_handoff 884
>
> I have attached the debug output, and my modified pci-quirks.c file
> to the bug report, let me know what else I can do to help.

Nothing was attached.

Alan Stern

2019-04-04 17:37:37

by Jesse Hathaway

[permalink] [raw]
Subject: Re: Regression causes a hang on boot with a Comtrol PCI card

On Thu, Apr 4, 2019 at 12:16 PM Alan Stern <[email protected]> wrote:
> > I added debug statements to quirk_usb_early_handoff, quirk_usb_disable_ehci &
> > ehci_bios_handoff. The box hangs right before calling:
> >
> > pci_write_config_byte(pdev, offset + 3, 1);
>
> Right _before_ that line? Not _after_ it?

Sorry I should have been more precise, it hangs executing the above
pci_write_config_function. I get the debug printk immediately preceding
that line.

> That's surprising because the two preceding lines of code are the
> condition of an "if" statement and a dev_dbg() call. I don't see how
> either of them could cause a hang.
>
> Maybe the hang is a delayed reaction to something happening somewhere
> else. But on the assumption that it isn't, you could try commenting
> out various parts of ehci_bios_handoff to see which ones make a
> difference.

will do

> > which is in ehci_bios_handoff:
> >
> > [ 10.698240] DEBUG: Passed quirk_usb_early_handoff 1300
> > [ 10.704271] DEBUG: Passed quirk_usb_early_handoff 1308
> > [ 10.710206] DEBUG: Passed quirk_usb_disable_ehci 939
> > [ 10.715949] DEBUG: Passed quirk_usb_disable_ehci 945
> > [ 10.721685] DEBUG: Passed quirk_usb_disable_ehci 950
> > [ 10.727423] DEBUG: Passed quirk_usb_disable_ehci 958
> > [ 10.733160] DEBUG: Passed quirk_usb_disable_ehci 964
> > [ 10.738897] DEBUG: Passed quirk_usb_disable_ehci 968
> > [ 10.744633] DEBUG: Passed ehci_bios_handoff 849
> > [ 10.749884] DEBUG: Passed ehci_bios_handoff 884
> >
> > I have attached the debug output, and my modified pci-quirks.c file
> > to the bug report, let me know what else I can do to help.

sorry I attached them to the bug report, but I have attached them to this
email as well.


Attachments:
initcall_debug_printk.log (58.25 kB)
pci-quirks.c (35.94 kB)
Download all attachments

2019-04-04 20:41:35

by Alan Stern

[permalink] [raw]
Subject: Re: Regression causes a hang on boot with a Comtrol PCI card

On Thu, 4 Apr 2019, Jesse Hathaway wrote:

> On Thu, Apr 4, 2019 at 12:16 PM Alan Stern <[email protected]> wrote:
> > > I added debug statements to quirk_usb_early_handoff, quirk_usb_disable_ehci &
> > > ehci_bios_handoff. The box hangs right before calling:
> > >
> > > pci_write_config_byte(pdev, offset + 3, 1);
> >
> > Right _before_ that line? Not _after_ it?
>
> Sorry I should have been more precise, it hangs executing the above
> pci_write_config_function. I get the debug printk immediately preceding
> that line.
>
> > That's surprising because the two preceding lines of code are the
> > condition of an "if" statement and a dev_dbg() call. I don't see how
> > either of them could cause a hang.
> >
> > Maybe the hang is a delayed reaction to something happening somewhere
> > else. But on the assumption that it isn't, you could try commenting
> > out various parts of ehci_bios_handoff to see which ones make a
> > difference.
>
> will do
>
> > > which is in ehci_bios_handoff:
> > >
> > > [ 10.698240] DEBUG: Passed quirk_usb_early_handoff 1300
> > > [ 10.704271] DEBUG: Passed quirk_usb_early_handoff 1308
> > > [ 10.710206] DEBUG: Passed quirk_usb_disable_ehci 939
> > > [ 10.715949] DEBUG: Passed quirk_usb_disable_ehci 945
> > > [ 10.721685] DEBUG: Passed quirk_usb_disable_ehci 950
> > > [ 10.727423] DEBUG: Passed quirk_usb_disable_ehci 958
> > > [ 10.733160] DEBUG: Passed quirk_usb_disable_ehci 964
> > > [ 10.738897] DEBUG: Passed quirk_usb_disable_ehci 968
> > > [ 10.744633] DEBUG: Passed ehci_bios_handoff 849
> > > [ 10.749884] DEBUG: Passed ehci_bios_handoff 884
> > >
> > > I have attached the debug output, and my modified pci-quirks.c file
> > > to the bug report, let me know what else I can do to help.
>
> sorry I attached them to the bug report, but I have attached them to this
> email as well.

Okay. You could try skipping that pci_write_config_byte() call. The
following loop would probably time out, and you might find that the
code crashes later on.

You could also try setting try_handoff to 0 near the start of the
routine. Your system plus the Comtrol PCI card could have the same
sort of bug as reported in Bugzilla #77021.

Alan Stern

2019-04-05 21:34:52

by Jesse Hathaway

[permalink] [raw]
Subject: Re: Regression causes a hang on boot with a Comtrol PCI card

On Thu, Apr 4, 2019 at 2:14 PM Alan Stern <[email protected]> wrote:
> Okay. You could try skipping that pci_write_config_byte() call. The
> following loop would probably time out, and you might find that the
> code crashes later on.
>
> You could also try setting try_handoff to 0 near the start of the
> routine. Your system plus the Comtrol PCI card could have the same
> sort of bug as reported in Bugzilla #77021.

I tried both of those options and the output was similar, here is the output
with try_handoff set to 0. In both cases the box then hangs on initializing
pstore. Excuse my ignorance on understanding how having the Comtrol
card present would effect the USB handoff, but would there be any
value in trying to boot this box via UEFI, rather than via the BIOS? Does
UEFI handle the handoff differently?

[ 11.391514] DEBUG: Passed quirk_usb_disable_ehci 945
[ 11.397250] DEBUG: Passed quirk_usb_disable_ehci 950
[ 11.402988] DEBUG: Passed quirk_usb_disable_ehci 958
[ 11.408724] DEBUG: Passed quirk_usb_disable_ehci 964
[ 11.414461] DEBUG: Passed quirk_usb_disable_ehci 968
[ 11.420197] DEBUG: Passed ehci_bios_handoff 849
[ 11.425448] DEBUG: Passed ehci_bios_handoff 889
[ 11.430699] DEBUG: Passed ehci_bios_handoff 902
[ 11.435952] DEBUG: Passed ehci_bios_handoff 918
[ 11.441203] DEBUG: Passed ehci_bios_handoff 926
[ 11.446455] DEBUG: Passed quirk_usb_disable_ehci 981
[ 11.452292] DEBUG: Passed quirk_usb_disable_ehci 1009
[ 11.458180] pci 0000:00:1d.0: quirk_usb_early_handoff+0x0/0x865
took 82350 usecs
[ 11.466568] pci 0000:01:00.0: calling quirk_e100_interrupt+0x0/0x1a0 @ 1
[ 11.474249] pci 0000:01:00.0: quirk_e100_interrupt+0x0/0x1a0 took 0 usecs
[ 11.481931] pci 0000:01:00.1: calling quirk_e100_interrupt+0x0/0x1a0 @ 1
[ 11.489612] pci 0000:01:00.1: quirk_e100_interrupt+0x0/0x1a0 took 0 usecs
[ 11.497296] pci 0000:03:00.0: calling quirk_blacklist_vpd+0x0/0x30 @ 1
[ 11.504783] pci 0000:03:00.0: [Firmware Bug]: disabling VPD access
(can't determine size of non-standard VPD format)
[ 11.516663] pci 0000:03:00.0: quirk_blacklist_vpd+0x0/0x30 took 11600 usecs
[ 11.524538] pci 0000:07:00.0: calling quirk_e100_interrupt+0x0/0x1a0 @ 1
[ 11.532219] pci 0000:07:00.0: quirk_e100_interrupt+0x0/0x1a0 took 0 usecs
[ 11.539901] pci 0000:07:00.1: calling quirk_e100_interrupt+0x0/0x1a0 @ 1
[ 11.547582] pci 0000:07:00.1: quirk_e100_interrupt+0x0/0x1a0 took 0 usecs
[ 11.555374] pci 0000:0b:00.0: calling pci_fixup_video+0x0/0x110 @ 1
[ 11.562797] pci 0000:0b:00.0: Video device with shadowed ROM at
[mem 0x000c0000-0x000dffff]
[ 11.572248] pci 0000:0b:00.0: pci_fixup_video+0x0/0x110 took 9453 usecs
[ 11.579991] Unpacking initramfs...
[ 11.923856] Freeing initrd memory: 22636K
[ 11.950050] PCI-DMA: Using software bounce buffering for IO (SWIOTLB)
[ 11.957345] software IO TLB: mapped [mem 0x76289000-0x7a289000] (64MB)
[ 11.967721] Initialise system trusted keyrings
[ 11.972908] workingset: timestamp_bits=40 max_order=27 bucket_order=0
[ 11.981517] zbud: loaded
[ 12.138528] Key type asymmetric registered
[ 12.143201] Asymmetric key parser 'x509' registered
[ 12.148754] Block layer SCSI generic (bsg) driver version 0.4
loaded (major 247)
[ 12.157245] io scheduler mq-deadline registered
[ 12.163318] pcieport 0000:00:01.0: Signaling PME with IRQ 25
[ 12.169926] pcieport 0000:00:02.0: Signaling PME with IRQ 26
[ 12.176519] pcieport 0000:00:03.0: Signaling PME with IRQ 27
[ 12.183109] pcieport 0000:00:03.2: Signaling PME with IRQ 28
[ 12.189678] pcieport 0000:00:1c.0: Signaling PME with IRQ 29
[ 12.196248] pcieport 0000:00:1c.4: Signaling PME with IRQ 30
[ 12.202833] pcieport 0000:00:1c.7: Signaling PME with IRQ 31
[ 12.212006] pcieport 0000:80:01.0: Signaling PME with IRQ 33
[ 12.218609] pcieport 0000:80:03.0: Signaling PME with IRQ 34
[ 12.225404] shpchp: Standard Hot Plug PCI Controller Driver version: 0.4
[ 12.238133] ERST: Error Record Serialization Table (ERST) support
is initialized.
[ 12.246614] pstore: Registered erst as persistent store backend

2019-04-06 15:33:25

by Alan Stern

[permalink] [raw]
Subject: Re: Regression causes a hang on boot with a Comtrol PCI card

On Fri, 5 Apr 2019, Jesse Hathaway wrote:

> On Thu, Apr 4, 2019 at 2:14 PM Alan Stern <[email protected]> wrote:
> > Okay. You could try skipping that pci_write_config_byte() call. The
> > following loop would probably time out, and you might find that the
> > code crashes later on.
> >
> > You could also try setting try_handoff to 0 near the start of the
> > routine. Your system plus the Comtrol PCI card could have the same
> > sort of bug as reported in Bugzilla #77021.
>
> I tried both of those options and the output was similar, here is the output
> with try_handoff set to 0. In both cases the box then hangs on initializing
> pstore.

Well, at least that's forward progress. I don't know what pstore is or
what connection it has to the USB subsystem. Does the machine hang
similarly if you boot without the Comtrol PCI card present?

For that matter, what happens if you remove EHCI from the kernel
configuration completely?

> Excuse my ignorance on understanding how having the Comtrol
> card present would effect the USB handoff, but would there be any
> value in trying to boot this box via UEFI, rather than via the BIOS? Does
> UEFI handle the handoff differently?

The UEFI and the legacy BIOS are two separate pieces of code with no
particular connection to one another. It's a good bet that they do
pretty much everything differently.

As for how the PCI card affects the USB handoff, it depends on how the
BIOS behaves. Normally the BIOS will take control of all the available
EHCI controllers during bootup (so that it can use them to communicate
with a USB keyboard or mouse), including controllers on add-on PCI
cards as well as those on the motherboard. When the kernel starts up,
it tries to take ownership of the controllers away from the BIOS
(that's the handoff) so that Linux can use them. However, if the BIOS
was never tested for handoff of USB controllers on add-on PCI cards, it
could easily have a bug that would crash the machine.

Alan Stern

2019-04-15 21:48:35

by Jesse Hathaway

[permalink] [raw]
Subject: Re: Regression causes a hang on boot with a Comtrol PCI card

On Sat, Apr 6, 2019 at 10:32 AM Alan Stern <[email protected]> wrote:
> Well, at least that's forward progress. I don't know what pstore is or
> what connection it has to the USB subsystem. Does the machine hang
> similarly if you boot without the Comtrol PCI card present?

Yes the box boots fine when the Comtrol PCI card is *not* present.

> For that matter, what happens if you remove EHCI from the kernel
> configuration completely?

If I remove USB support, the box still hangs after registering the pstore, but
if I remove pstore support and APEI support from the kernel then the box boots
without issue.

> As for how the PCI card affects the USB handoff, it depends on how the
> BIOS behaves. Normally the BIOS will take control of all the available
> EHCI controllers during bootup (so that it can use them to communicate
> with a USB keyboard or mouse), including controllers on add-on PCI
> cards as well as those on the motherboard. When the kernel starts up,
> it tries to take ownership of the controllers away from the BIOS
> (that's the handoff) so that Linux can use them. However, if the BIOS
> was never tested for handoff of USB controllers on add-on PCI cards, it
> could easily have a bug that would crash the machine.

The Comtrol card provides 32 serial ports, via a breakout box, but it has
no USB functionality, which was why I was surprised that its presence
somehow breaks the USB hand off.

2019-04-16 15:02:24

by Alan Stern

[permalink] [raw]
Subject: Re: Regression causes a hang on boot with a Comtrol PCI card

On Mon, 15 Apr 2019, Jesse Hathaway wrote:

> On Sat, Apr 6, 2019 at 10:32 AM Alan Stern <[email protected]> wrote:
> > Well, at least that's forward progress. I don't know what pstore is or
> > what connection it has to the USB subsystem. Does the machine hang
> > similarly if you boot without the Comtrol PCI card present?
>
> Yes the box boots fine when the Comtrol PCI card is *not* present.
>
> > For that matter, what happens if you remove EHCI from the kernel
> > configuration completely?
>
> If I remove USB support, the box still hangs after registering the pstore, but
> if I remove pstore support and APEI support from the kernel then the box boots
> without issue.
>
> > As for how the PCI card affects the USB handoff, it depends on how the
> > BIOS behaves. Normally the BIOS will take control of all the available
> > EHCI controllers during bootup (so that it can use them to communicate
> > with a USB keyboard or mouse), including controllers on add-on PCI
> > cards as well as those on the motherboard. When the kernel starts up,
> > it tries to take ownership of the controllers away from the BIOS
> > (that's the handoff) so that Linux can use them. However, if the BIOS
> > was never tested for handoff of USB controllers on add-on PCI cards, it
> > could easily have a bug that would crash the machine.
>
> The Comtrol card provides 32 serial ports, via a breakout box, but it has
> no USB functionality, which was why I was surprised that its presence
> somehow breaks the USB hand off.

Well, I am completely mystified. Nor do I understand how the commits
you identified could be related, although maybe the relationship is
very indirect.

Whatever the source of the problem, I don't think you're going to find
it by looking at the USB code. Perhaps the early initialization of the
functions that _are_ present on the Comtrol card somehow messes up
other parts of the system.

Alan Stern

2019-04-23 20:20:09

by Jesse Hathaway

[permalink] [raw]
Subject: Re: Regression causes a hang on boot with a Comtrol PCI card

On Tue, Apr 16, 2019 at 10:00 AM Alan Stern <[email protected]> wrote:
> Whatever the source of the problem, I don't think you're going to find
> it by looking at the USB code. Perhaps the early initialization of the
> functions that _are_ present on the Comtrol card somehow messes up
> other parts of the system.

Thanks for all you help Alan,

I think at this point my Linux kernel debugging knowledge has run
aground. I would
love to crack this nut, but I doubt that will be possible with my
current knowledge base.
For now I am going to run the patched kernel even if I don't
understand precisely why
they prevent the box from hanging on boot when the Comtrol card is present.

2019-04-24 19:44:11

by Alan Stern

[permalink] [raw]
Subject: Re: Regression causes a hang on boot with a Comtrol PCI card

On Tue, 23 Apr 2019, Jesse Hathaway wrote:

> On Tue, Apr 16, 2019 at 10:00 AM Alan Stern <[email protected]> wrote:
> > Whatever the source of the problem, I don't think you're going to find
> > it by looking at the USB code. Perhaps the early initialization of the
> > functions that _are_ present on the Comtrol card somehow messes up
> > other parts of the system.
>
> Thanks for all you help Alan,
>
> I think at this point my Linux kernel debugging knowledge has run
> aground. I would
> love to crack this nut, but I doubt that will be possible with my
> current knowledge base.
> For now I am going to run the patched kernel even if I don't
> understand precisely why
> they prevent the box from hanging on boot when the Comtrol card is present.

Okay, that's understandable. Sorry I couldn't do more to help.

Alan Stern