2020-02-05 00:57:04

by Joel Stanley

[permalink] [raw]
Subject: TI PCIe xHCI and kexec

I'm supporting a system that uses Linux-as-a-bootloader to load a
distro kernel via kexec, The systems have a TI TUSB73x0 PCIe
controller which goes out to lunch after a kexec. This is the distro
(post-kexec) kernel:

[ 0.235411] pci 0003:01:00.0: xHCI HW did not halt within 16000
usec status = 0x0
[ 1.037298] xhci_hcd 0003:01:00.0: xHCI Host Controller
[ 1.037367] xhci_hcd 0003:01:00.0: new USB bus registered, assigned
bus number 1
[ 1.053481] xhci_hcd 0003:01:00.0: Host halt failed, -110
[ 1.053523] xhci_hcd 0003:01:00.0: can't setup: -110
[ 1.053565] xhci_hcd 0003:01:00.0: USB bus 1 deregistered
[ 1.053629] xhci_hcd 0003:01:00.0: init 0003:01:00.0 fail, -110
[ 1.053703] xhci_hcd: probe of 0003:01:00.0 failed with error -110

There were some fixes made a few years back to improve the situation,
but we've still had to carry some form of the patch below in the
bootloader kernel. I would like to rework it so it can be merged.

diff --git a/drivers/usb/host/xhci.c b/drivers/usb/host/xhci.c
index dbac0fa9748d..eaa94456dd9d 100644
--- a/drivers/usb/host/xhci.c
+++ b/drivers/usb/host/xhci.c
@@ -789,6 +789,9 @@ void xhci_shutdown(struct usb_hcd *hcd)
xhci_dbg_trace(xhci, trace_xhci_dbg_init,
"xhci_shutdown completed - status = %x",
readl(&xhci->op_regs->status));
+
+ /* TI XHCI controllers do not come back after kexec without this hack */
+ pci_reset_function_locked(to_pci_dev(hcd->self.sysdev));
}
EXPORT_SYMBOL_GPL(xhci_shutdown);

I would like some advice on how to implement it in a way that is
acceptable. Would a quirk on the pci id in xhci_shutdown be ok?

0003:01:00.0 USB controller: Texas Instruments TUSB73x0 SuperSpeed USB
3.0 xHCI Host Controller (rev 02)

The full debug log of the distro kernel booting is below.

Cheers,

Joel

---

[ 1.037638] xhci_hcd 0003:01:00.0: xHCI Host Controller
[ 1.037695] xhci_hcd 0003:01:00.0: new USB bus registered, assigned
bus number 1
[ 1.037789] xhci_hcd 0003:01:00.0: xHCI capability registers at
(ptrval):
[ 1.037792] xhci_hcd 0003:01:00.0: CAPLENGTH AND HCIVERSION 0x960020:
[ 1.037794] xhci_hcd 0003:01:00.0: CAPLENGTH: 0x20
[ 1.037796] xhci_hcd 0003:01:00.0: HCIVERSION: 0x96
[ 1.037798] xhci_hcd 0003:01:00.0: HCSPARAMS 1: 0x8000840
[ 1.037800] xhci_hcd 0003:01:00.0: Max device slots: 64
[ 1.037802] xhci_hcd 0003:01:00.0: Max interrupters: 8
[ 1.037804] xhci_hcd 0003:01:00.0: Max ports: 8
[ 1.037806] xhci_hcd 0003:01:00.0: HCSPARAMS 2: 0xc0000f1
[ 1.037808] xhci_hcd 0003:01:00.0: Isoc scheduling threshold: 1
[ 1.037810] xhci_hcd 0003:01:00.0: Maximum allowed segments in
event ring: 15
[ 1.037813] xhci_hcd 0003:01:00.0: HCSPARAMS 3 0x7ff000a:
[ 1.037815] xhci_hcd 0003:01:00.0: Worst case U1 device exit latency: 10
[ 1.037816] xhci_hcd 0003:01:00.0: Worst case U2 device exit latency: 2047
[ 1.037819] xhci_hcd 0003:01:00.0: HCC PARAMS 0x270f06d:
[ 1.037821] xhci_hcd 0003:01:00.0: HC generates 64 bit addresses
[ 1.037823] xhci_hcd 0003:01:00.0: HC hasn't Contiguous Frame ID Capability
[ 1.037825] xhci_hcd 0003:01:00.0: HC can't generate Stopped -
Short Package event
[ 1.037826] xhci_hcd 0003:01:00.0: FIXME: more HCCPARAMS debugging
[ 1.037829] xhci_hcd 0003:01:00.0: RTSOFF 0x4a0:
[ 1.037831] xhci_hcd 0003:01:00.0: xHCI operational registers at
(ptrval):
[ 1.037833] xhci_hcd 0003:01:00.0: USBCMD 0x0:
[ 1.037835] xhci_hcd 0003:01:00.0: HC is being stopped
[ 1.037837] xhci_hcd 0003:01:00.0: HC has finished hard reset
[ 1.037839] xhci_hcd 0003:01:00.0: Event Interrupts disabled
[ 1.037841] xhci_hcd 0003:01:00.0: Host System Error Interrupts disabled
[ 1.037843] xhci_hcd 0003:01:00.0: HC has finished light reset
[ 1.037846] xhci_hcd 0003:01:00.0: USBSTS 0x0:
[ 1.037847] xhci_hcd 0003:01:00.0: Event ring is empty
[ 1.037849] xhci_hcd 0003:01:00.0: No Host System Error
[ 1.037851] xhci_hcd 0003:01:00.0: HC is running
[ 1.037854] xhci_hcd 0003:01:00.0: (ptrval) port status reg = 0x2a0
[ 1.037857] xhci_hcd 0003:01:00.0: (ptrval) port power reg = 0x0
[ 1.037860] xhci_hcd 0003:01:00.0: (ptrval) port link reg = 0x0
[ 1.037862] xhci_hcd 0003:01:00.0: (ptrval) port reserved reg = 0x0
[ 1.037865] xhci_hcd 0003:01:00.0: (ptrval) port status reg = 0x2a0
[ 1.037868] xhci_hcd 0003:01:00.0: (ptrval) port power reg = 0x0
[ 1.037870] xhci_hcd 0003:01:00.0: (ptrval) port link reg = 0x0
[ 1.037873] xhci_hcd 0003:01:00.0: (ptrval) port reserved reg = 0x0
[ 1.037876] xhci_hcd 0003:01:00.0: (ptrval) port status reg = 0xe03
[ 1.037878] xhci_hcd 0003:01:00.0: (ptrval) port power reg = 0x0
[ 1.037881] xhci_hcd 0003:01:00.0: (ptrval) port link reg = 0x0
[ 1.037884] xhci_hcd 0003:01:00.0: (ptrval) port reserved reg = 0x0
[ 1.037886] xhci_hcd 0003:01:00.0: (ptrval) port status reg = 0xe03
[ 1.037889] xhci_hcd 0003:01:00.0: (ptrval) port power reg = 0x0
[ 1.037892] xhci_hcd 0003:01:00.0: (ptrval) port link reg = 0x0
[ 1.037894] xhci_hcd 0003:01:00.0: (ptrval) port reserved reg = 0x0
[ 1.037897] xhci_hcd 0003:01:00.0: (ptrval) port status reg = 0x2a0
[ 1.037900] xhci_hcd 0003:01:00.0: (ptrval) port power reg = 0x0
[ 1.037902] xhci_hcd 0003:01:00.0: (ptrval) port link reg = 0x0
[ 1.037905] xhci_hcd 0003:01:00.0: (ptrval) port reserved reg = 0x0
[ 1.037908] xhci_hcd 0003:01:00.0: (ptrval) port status reg = 0x2a0
[ 1.037910] xhci_hcd 0003:01:00.0: (ptrval) port power reg = 0x0
[ 1.037913] xhci_hcd 0003:01:00.0: (ptrval) port link reg = 0x0
[ 1.037916] xhci_hcd 0003:01:00.0: (ptrval) port reserved reg = 0x0
[ 1.037918] xhci_hcd 0003:01:00.0: (ptrval) port status reg = 0x2a0
[ 1.037921] xhci_hcd 0003:01:00.0: (ptrval) port power reg = 0x0
[ 1.037924] xhci_hcd 0003:01:00.0: (ptrval) port link reg = 0x0
[ 1.037926] xhci_hcd 0003:01:00.0: (ptrval) port reserved reg = 0x0
[ 1.037929] xhci_hcd 0003:01:00.0: (ptrval) port status reg = 0x2a0
[ 1.037931] xhci_hcd 0003:01:00.0: (ptrval) port power reg = 0x0
[ 1.037934] xhci_hcd 0003:01:00.0: (ptrval) port link reg = 0x0
[ 1.037937] xhci_hcd 0003:01:00.0: (ptrval) port reserved reg = 0x0
[ 1.037939] xhci_hcd 0003:01:00.0: // Halt the HC
[ 1.053945] xhci_hcd 0003:01:00.0: Host halt failed, -110
[ 1.053987] xhci_hcd 0003:01:00.0: can't setup: -110
[ 1.054032] xhci_hcd 0003:01:00.0: USB bus 1 deregistered
[ 1.054099] xhci_hcd 0003:01:00.0: init 0003:01:00.0 fail, -110
[ 1.054165] xhci_hcd: probe of 0003:01:00.0 failed with error -110


2020-02-05 09:37:56

by Mathias Nyman

[permalink] [raw]
Subject: Re: TI PCIe xHCI and kexec

On 5.2.2020 2.55, Joel Stanley wrote:
> I'm supporting a system that uses Linux-as-a-bootloader to load a
> distro kernel via kexec, The systems have a TI TUSB73x0 PCIe
> controller which goes out to lunch after a kexec. This is the distro
> (post-kexec) kernel:
>
> [ 0.235411] pci 0003:01:00.0: xHCI HW did not halt within 16000
> usec status = 0x0
> [ 1.037298] xhci_hcd 0003:01:00.0: xHCI Host Controller
> [ 1.037367] xhci_hcd 0003:01:00.0: new USB bus registered, assigned
> bus number 1
> [ 1.053481] xhci_hcd 0003:01:00.0: Host halt failed, -110
> [ 1.053523] xhci_hcd 0003:01:00.0: can't setup: -110
> [ 1.053565] xhci_hcd 0003:01:00.0: USB bus 1 deregistered
> [ 1.053629] xhci_hcd 0003:01:00.0: init 0003:01:00.0 fail, -110
> [ 1.053703] xhci_hcd: probe of 0003:01:00.0 failed with error -110
>
> There were some fixes made a few years back to improve the situation,
> but we've still had to carry some form of the patch below in the
> bootloader kernel. I would like to rework it so it can be merged.
>
> diff --git a/drivers/usb/host/xhci.c b/drivers/usb/host/xhci.c
> index dbac0fa9748d..eaa94456dd9d 100644
> --- a/drivers/usb/host/xhci.c
> +++ b/drivers/usb/host/xhci.c
> @@ -789,6 +789,9 @@ void xhci_shutdown(struct usb_hcd *hcd)
> xhci_dbg_trace(xhci, trace_xhci_dbg_init,
> "xhci_shutdown completed - status = %x",
> readl(&xhci->op_regs->status));
> +
> + /* TI XHCI controllers do not come back after kexec without this hack */
> + pci_reset_function_locked(to_pci_dev(hcd->self.sysdev));
> }
> EXPORT_SYMBOL_GPL(xhci_shutdown);
>
> I would like some advice on how to implement it in a way that is
> acceptable. Would a quirk on the pci id in xhci_shutdown be ok?

Yes, but as this is a pci specific workaround the quirk should go to
xhci-pci.c: xhci_pci_shutdown(), which was added in v5.5

Is the rootcause known?
Is the only possible solution to reset the pci function?.
Have you tried, or seen this issue on any other controller than this TUSB73x0?

>
> 0003:01:00.0 USB controller: Texas Instruments TUSB73x0 SuperSpeed USB
> 3.0 xHCI Host Controller (rev 02)
>
> The full debug log of the distro kernel booting is below.
>
> [ 1.037833] xhci_hcd 0003:01:00.0: USBCMD 0x0:
> [ 1.037835] xhci_hcd 0003:01:00.0: HC is being stopped
> [ 1.037837] xhci_hcd 0003:01:00.0: HC has finished hard reset
> [ 1.037839] xhci_hcd 0003:01:00.0: Event Interrupts disabled
> [ 1.037841] xhci_hcd 0003:01:00.0: Host System Error Interrupts disabled
> [ 1.037843] xhci_hcd 0003:01:00.0: HC has finished light reset
> [ 1.037846] xhci_hcd 0003:01:00.0: USBSTS 0x0:
> [ 1.037847] xhci_hcd 0003:01:00.0: Event ring is empty
> [ 1.037849] xhci_hcd 0003:01:00.0: No Host System Error
> [ 1.037851] xhci_hcd 0003:01:00.0: HC is running

Hmm, all bits in both USBCMD and USBSTS are 0. This is a bit suspicious.
Normally at least USBCMD Run/Stop bit, and USBSTS HCHalted bit have
opposite values.

-Mathias

2020-02-06 03:05:59

by Michael Ellerman

[permalink] [raw]
Subject: Re: TI PCIe xHCI and kexec

Joel Stanley <[email protected]> writes:
> I'm supporting a system that uses Linux-as-a-bootloader to load a
> distro kernel via kexec, The systems have a TI TUSB73x0 PCIe
> controller which goes out to lunch after a kexec. This is the distro
> (post-kexec) kernel:
>
> [ 0.235411] pci 0003:01:00.0: xHCI HW did not halt within 16000
> usec status = 0x0
> [ 1.037298] xhci_hcd 0003:01:00.0: xHCI Host Controller
> [ 1.037367] xhci_hcd 0003:01:00.0: new USB bus registered, assigned
> bus number 1
> [ 1.053481] xhci_hcd 0003:01:00.0: Host halt failed, -110
> [ 1.053523] xhci_hcd 0003:01:00.0: can't setup: -110
> [ 1.053565] xhci_hcd 0003:01:00.0: USB bus 1 deregistered
> [ 1.053629] xhci_hcd 0003:01:00.0: init 0003:01:00.0 fail, -110
> [ 1.053703] xhci_hcd: probe of 0003:01:00.0 failed with error -110
>
> There were some fixes made a few years back to improve the situation,
> but we've still had to carry some form of the patch below in the
> bootloader kernel. I would like to rework it so it can be merged.
>
> diff --git a/drivers/usb/host/xhci.c b/drivers/usb/host/xhci.c
> index dbac0fa9748d..eaa94456dd9d 100644
> --- a/drivers/usb/host/xhci.c
> +++ b/drivers/usb/host/xhci.c
> @@ -789,6 +789,9 @@ void xhci_shutdown(struct usb_hcd *hcd)
> xhci_dbg_trace(xhci, trace_xhci_dbg_init,
> "xhci_shutdown completed - status = %x",
> readl(&xhci->op_regs->status));
> +
> + /* TI XHCI controllers do not come back after kexec without this hack */
> + pci_reset_function_locked(to_pci_dev(hcd->self.sysdev));
> }
> EXPORT_SYMBOL_GPL(xhci_shutdown);
>
> I would like some advice on how to implement it in a way that is
> acceptable. Would a quirk on the pci id in xhci_shutdown be ok?

Doing something on shutdown only helps the kexec case, where the first
kernel shuts down cleanly.

kdump kernels will still hit the issue, because there the first kernel
has crashed and done no (or very little) shutdown.

The ideal solution is to have the driver/bus code in the second kernel
detect that the device is fubar and do something to reset it then.

But I don't know enough about USB or this device to know if that's
really feasible.

I realise that doesn't actually solve the problem for the
linux-as-bootloader case, because there you want to make it work even if
the kernel you're loading hasn't been updated, so you have to do
something on shutdown.

cheers


> [ 1.037638] xhci_hcd 0003:01:00.0: xHCI Host Controller
> [ 1.037695] xhci_hcd 0003:01:00.0: new USB bus registered, assigned
> bus number 1
> [ 1.037789] xhci_hcd 0003:01:00.0: xHCI capability registers at
> (ptrval):
> [ 1.037792] xhci_hcd 0003:01:00.0: CAPLENGTH AND HCIVERSION 0x960020:
> [ 1.037794] xhci_hcd 0003:01:00.0: CAPLENGTH: 0x20
> [ 1.037796] xhci_hcd 0003:01:00.0: HCIVERSION: 0x96
> [ 1.037798] xhci_hcd 0003:01:00.0: HCSPARAMS 1: 0x8000840
> [ 1.037800] xhci_hcd 0003:01:00.0: Max device slots: 64
> [ 1.037802] xhci_hcd 0003:01:00.0: Max interrupters: 8
> [ 1.037804] xhci_hcd 0003:01:00.0: Max ports: 8
> [ 1.037806] xhci_hcd 0003:01:00.0: HCSPARAMS 2: 0xc0000f1
> [ 1.037808] xhci_hcd 0003:01:00.0: Isoc scheduling threshold: 1
> [ 1.037810] xhci_hcd 0003:01:00.0: Maximum allowed segments in
> event ring: 15
> [ 1.037813] xhci_hcd 0003:01:00.0: HCSPARAMS 3 0x7ff000a:
> [ 1.037815] xhci_hcd 0003:01:00.0: Worst case U1 device exit latency: 10
> [ 1.037816] xhci_hcd 0003:01:00.0: Worst case U2 device exit latency: 2047
> [ 1.037819] xhci_hcd 0003:01:00.0: HCC PARAMS 0x270f06d:
> [ 1.037821] xhci_hcd 0003:01:00.0: HC generates 64 bit addresses
> [ 1.037823] xhci_hcd 0003:01:00.0: HC hasn't Contiguous Frame ID Capability
> [ 1.037825] xhci_hcd 0003:01:00.0: HC can't generate Stopped -
> Short Package event
> [ 1.037826] xhci_hcd 0003:01:00.0: FIXME: more HCCPARAMS debugging
> [ 1.037829] xhci_hcd 0003:01:00.0: RTSOFF 0x4a0:
> [ 1.037831] xhci_hcd 0003:01:00.0: xHCI operational registers at
> (ptrval):
> [ 1.037833] xhci_hcd 0003:01:00.0: USBCMD 0x0:
> [ 1.037835] xhci_hcd 0003:01:00.0: HC is being stopped
> [ 1.037837] xhci_hcd 0003:01:00.0: HC has finished hard reset
> [ 1.037839] xhci_hcd 0003:01:00.0: Event Interrupts disabled
> [ 1.037841] xhci_hcd 0003:01:00.0: Host System Error Interrupts disabled
> [ 1.037843] xhci_hcd 0003:01:00.0: HC has finished light reset
> [ 1.037846] xhci_hcd 0003:01:00.0: USBSTS 0x0:
> [ 1.037847] xhci_hcd 0003:01:00.0: Event ring is empty
> [ 1.037849] xhci_hcd 0003:01:00.0: No Host System Error
> [ 1.037851] xhci_hcd 0003:01:00.0: HC is running
> [ 1.037854] xhci_hcd 0003:01:00.0: (ptrval) port status reg = 0x2a0
> [ 1.037857] xhci_hcd 0003:01:00.0: (ptrval) port power reg = 0x0
> [ 1.037860] xhci_hcd 0003:01:00.0: (ptrval) port link reg = 0x0
> [ 1.037862] xhci_hcd 0003:01:00.0: (ptrval) port reserved reg = 0x0
> [ 1.037865] xhci_hcd 0003:01:00.0: (ptrval) port status reg = 0x2a0
> [ 1.037868] xhci_hcd 0003:01:00.0: (ptrval) port power reg = 0x0
> [ 1.037870] xhci_hcd 0003:01:00.0: (ptrval) port link reg = 0x0
> [ 1.037873] xhci_hcd 0003:01:00.0: (ptrval) port reserved reg = 0x0
> [ 1.037876] xhci_hcd 0003:01:00.0: (ptrval) port status reg = 0xe03
> [ 1.037878] xhci_hcd 0003:01:00.0: (ptrval) port power reg = 0x0
> [ 1.037881] xhci_hcd 0003:01:00.0: (ptrval) port link reg = 0x0
> [ 1.037884] xhci_hcd 0003:01:00.0: (ptrval) port reserved reg = 0x0
> [ 1.037886] xhci_hcd 0003:01:00.0: (ptrval) port status reg = 0xe03
> [ 1.037889] xhci_hcd 0003:01:00.0: (ptrval) port power reg = 0x0
> [ 1.037892] xhci_hcd 0003:01:00.0: (ptrval) port link reg = 0x0
> [ 1.037894] xhci_hcd 0003:01:00.0: (ptrval) port reserved reg = 0x0
> [ 1.037897] xhci_hcd 0003:01:00.0: (ptrval) port status reg = 0x2a0
> [ 1.037900] xhci_hcd 0003:01:00.0: (ptrval) port power reg = 0x0
> [ 1.037902] xhci_hcd 0003:01:00.0: (ptrval) port link reg = 0x0
> [ 1.037905] xhci_hcd 0003:01:00.0: (ptrval) port reserved reg = 0x0
> [ 1.037908] xhci_hcd 0003:01:00.0: (ptrval) port status reg = 0x2a0
> [ 1.037910] xhci_hcd 0003:01:00.0: (ptrval) port power reg = 0x0
> [ 1.037913] xhci_hcd 0003:01:00.0: (ptrval) port link reg = 0x0
> [ 1.037916] xhci_hcd 0003:01:00.0: (ptrval) port reserved reg = 0x0
> [ 1.037918] xhci_hcd 0003:01:00.0: (ptrval) port status reg = 0x2a0
> [ 1.037921] xhci_hcd 0003:01:00.0: (ptrval) port power reg = 0x0
> [ 1.037924] xhci_hcd 0003:01:00.0: (ptrval) port link reg = 0x0
> [ 1.037926] xhci_hcd 0003:01:00.0: (ptrval) port reserved reg = 0x0
> [ 1.037929] xhci_hcd 0003:01:00.0: (ptrval) port status reg = 0x2a0
> [ 1.037931] xhci_hcd 0003:01:00.0: (ptrval) port power reg = 0x0
> [ 1.037934] xhci_hcd 0003:01:00.0: (ptrval) port link reg = 0x0
> [ 1.037937] xhci_hcd 0003:01:00.0: (ptrval) port reserved reg = 0x0
> [ 1.037939] xhci_hcd 0003:01:00.0: // Halt the HC
> [ 1.053945] xhci_hcd 0003:01:00.0: Host halt failed, -110
> [ 1.053987] xhci_hcd 0003:01:00.0: can't setup: -110
> [ 1.054032] xhci_hcd 0003:01:00.0: USB bus 1 deregistered
> [ 1.054099] xhci_hcd 0003:01:00.0: init 0003:01:00.0 fail, -110
> [ 1.054165] xhci_hcd: probe of 0003:01:00.0 failed with error -110

2020-02-06 03:43:01

by Joel Stanley

[permalink] [raw]
Subject: Re: TI PCIe xHCI and kexec

On Wed, 5 Feb 2020 at 09:35, Mathias Nyman
<[email protected]> wrote:
>
> On 5.2.2020 2.55, Joel Stanley wrote:
> > I'm supporting a system that uses Linux-as-a-bootloader to load a
> > distro kernel via kexec, The systems have a TI TUSB73x0 PCIe
> > controller which goes out to lunch after a kexec. This is the distro
> > (post-kexec) kernel:
> >
> > [ 0.235411] pci 0003:01:00.0: xHCI HW did not halt within 16000
> > usec status = 0x0
> > [ 1.037298] xhci_hcd 0003:01:00.0: xHCI Host Controller
> > [ 1.037367] xhci_hcd 0003:01:00.0: new USB bus registered, assigned
> > bus number 1
> > [ 1.053481] xhci_hcd 0003:01:00.0: Host halt failed, -110
> > [ 1.053523] xhci_hcd 0003:01:00.0: can't setup: -110
> > [ 1.053565] xhci_hcd 0003:01:00.0: USB bus 1 deregistered
> > [ 1.053629] xhci_hcd 0003:01:00.0: init 0003:01:00.0 fail, -110
> > [ 1.053703] xhci_hcd: probe of 0003:01:00.0 failed with error -110
> >
> > There were some fixes made a few years back to improve the situation,
> > but we've still had to carry some form of the patch below in the
> > bootloader kernel. I would like to rework it so it can be merged.
> >
> > diff --git a/drivers/usb/host/xhci.c b/drivers/usb/host/xhci.c
> > index dbac0fa9748d..eaa94456dd9d 100644
> > --- a/drivers/usb/host/xhci.c
> > +++ b/drivers/usb/host/xhci.c
> > @@ -789,6 +789,9 @@ void xhci_shutdown(struct usb_hcd *hcd)
> > xhci_dbg_trace(xhci, trace_xhci_dbg_init,
> > "xhci_shutdown completed - status = %x",
> > readl(&xhci->op_regs->status));
> > +
> > + /* TI XHCI controllers do not come back after kexec without this hack */
> > + pci_reset_function_locked(to_pci_dev(hcd->self.sysdev));
> > }
> > EXPORT_SYMBOL_GPL(xhci_shutdown);
> >
> > I would like some advice on how to implement it in a way that is
> > acceptable. Would a quirk on the pci id in xhci_shutdown be ok?
>
> Yes, but as this is a pci specific workaround the quirk should go to
> xhci-pci.c: xhci_pci_shutdown(), which was added in v5.5
>
> Is the rootcause known?
> Is the only possible solution to reset the pci function?.

I don't know the root cause. The people that helped debug it in the
first place have moved on.

> Have you tried, or seen this issue on any other controller than this TUSB73x0?

We don't have any systems with a different USB controller.

In general, the other PCie devices in the system are well (enough)
behaved to survive kexec. We don't have any other out of tree
workarounds.

>
> >
> > 0003:01:00.0 USB controller: Texas Instruments TUSB73x0 SuperSpeed USB
> > 3.0 xHCI Host Controller (rev 02)
> >
> > The full debug log of the distro kernel booting is below.
> >
> > [ 1.037833] xhci_hcd 0003:01:00.0: USBCMD 0x0:
> > [ 1.037835] xhci_hcd 0003:01:00.0: HC is being stopped
> > [ 1.037837] xhci_hcd 0003:01:00.0: HC has finished hard reset
> > [ 1.037839] xhci_hcd 0003:01:00.0: Event Interrupts disabled
> > [ 1.037841] xhci_hcd 0003:01:00.0: Host System Error Interrupts disabled
> > [ 1.037843] xhci_hcd 0003:01:00.0: HC has finished light reset
> > [ 1.037846] xhci_hcd 0003:01:00.0: USBSTS 0x0:
> > [ 1.037847] xhci_hcd 0003:01:00.0: Event ring is empty
> > [ 1.037849] xhci_hcd 0003:01:00.0: No Host System Error
> > [ 1.037851] xhci_hcd 0003:01:00.0: HC is running
>
> Hmm, all bits in both USBCMD and USBSTS are 0. This is a bit suspicious.
> Normally at least USBCMD Run/Stop bit, and USBSTS HCHalted bit have
> opposite values.

Does this suggest the controller is not responding at all?

2020-02-06 14:25:43

by Mathias Nyman

[permalink] [raw]
Subject: Re: TI PCIe xHCI and kexec

On 6.2.2020 5.37, Joel Stanley wrote:
> On Wed, 5 Feb 2020 at 09:35, Mathias Nyman
> <[email protected]> wrote:
>>
>> On 5.2.2020 2.55, Joel Stanley wrote:
>>> I'm supporting a system that uses Linux-as-a-bootloader to load a
>>> distro kernel via kexec, The systems have a TI TUSB73x0 PCIe
>>> controller which goes out to lunch after a kexec. This is the distro
>>> (post-kexec) kernel:
>>>
>>> [ 0.235411] pci 0003:01:00.0: xHCI HW did not halt within 16000
>>> usec status = 0x0
>>> [ 1.037298] xhci_hcd 0003:01:00.0: xHCI Host Controller
>>> [ 1.037367] xhci_hcd 0003:01:00.0: new USB bus registered, assigned
>>> bus number 1
>>> [ 1.053481] xhci_hcd 0003:01:00.0: Host halt failed, -110
>>> [ 1.053523] xhci_hcd 0003:01:00.0: can't setup: -110
>>> [ 1.053565] xhci_hcd 0003:01:00.0: USB bus 1 deregistered
>>> [ 1.053629] xhci_hcd 0003:01:00.0: init 0003:01:00.0 fail, -110
>>> [ 1.053703] xhci_hcd: probe of 0003:01:00.0 failed with error -110
>>>

>>>
>>> 0003:01:00.0 USB controller: Texas Instruments TUSB73x0 SuperSpeed USB
>>> 3.0 xHCI Host Controller (rev 02)
>>>
>>> The full debug log of the distro kernel booting is below.
>>>
>>> [ 1.037833] xhci_hcd 0003:01:00.0: USBCMD 0x0:
>>> [ 1.037835] xhci_hcd 0003:01:00.0: HC is being stopped
>>> [ 1.037837] xhci_hcd 0003:01:00.0: HC has finished hard reset
>>> [ 1.037839] xhci_hcd 0003:01:00.0: Event Interrupts disabled
>>> [ 1.037841] xhci_hcd 0003:01:00.0: Host System Error Interrupts disabled
>>> [ 1.037843] xhci_hcd 0003:01:00.0: HC has finished light reset
>>> [ 1.037846] xhci_hcd 0003:01:00.0: USBSTS 0x0:
>>> [ 1.037847] xhci_hcd 0003:01:00.0: Event ring is empty
>>> [ 1.037849] xhci_hcd 0003:01:00.0: No Host System Error
>>> [ 1.037851] xhci_hcd 0003:01:00.0: HC is running
>>
>> Hmm, all bits in both USBCMD and USBSTS are 0. This is a bit suspicious.
>> Normally at least USBCMD Run/Stop bit, and USBSTS HCHalted bit have
>> opposite values.
>
> Does this suggest the controller is not responding at all?
>

The Capability registers looks fine, so does port status registers.
It's just the operational USBSTS and USBCMD registers that return 0.

Current xhci implementation assumes host failed to halt because USBSTS
HCHalted bit is still 0, and bails out before reset.
Host is probably not running, register just returns all zero.

Can you try if the below code works, it checks if host is running from
an additional place, and continues with the host reset.

diff --git a/drivers/usb/host/xhci.c b/drivers/usb/host/xhci.c
index fe38275363e0..2dbfeaf88574 100644
--- a/drivers/usb/host/xhci.c
+++ b/drivers/usb/host/xhci.c
@@ -177,8 +177,16 @@ int xhci_reset(struct xhci_hcd *xhci)
}

if ((state & STS_HALT) == 0) {
- xhci_warn(xhci, "Host controller not halted, aborting reset.\n");
- return 0;
+ /*
+ * After a kexec TI TUSB73x0 might appear running as its USBSTS
+ * and USBCMD registers return all zeroes. Doublecheck if host
+ * is running from USBCMD RUN bit before bailing out.
+ */
+ command = readl(&xhci->op_regs->command);
+ if (command & CMD_RUN) {
+ xhci_warn(xhci, "Host controller not halted, aborting reset.\n");
+ return 0;
+ }
}

xhci_dbg_trace(xhci, trace_xhci_dbg_init, "// Reset the HC");
@@ -5217,7 +5225,7 @@ int xhci_gen_setup(struct usb_hcd *hcd, xhci_get_quirks_t get_quirks)
/* Make sure the HC is halted. */
retval = xhci_halt(xhci);
if (retval)
- return retval;
+ xhci_warn(xhci, "Continue with reset even if host appears running\n");

xhci_zero_64b_regs(xhci);

2020-02-10 06:12:43

by Joel Stanley

[permalink] [raw]
Subject: Re: TI PCIe xHCI and kexec

On Thu, 6 Feb 2020 at 14:23, Mathias Nyman
<[email protected]> wrote:
>
> On 6.2.2020 5.37, Joel Stanley wrote:
> > On Wed, 5 Feb 2020 at 09:35, Mathias Nyman
> > <[email protected]> wrote:
> >>
> >> On 5.2.2020 2.55, Joel Stanley wrote:
> >>> I'm supporting a system that uses Linux-as-a-bootloader to load a
> >>> distro kernel via kexec, The systems have a TI TUSB73x0 PCIe
> >>> controller which goes out to lunch after a kexec. This is the distro
> >>> (post-kexec) kernel:
> >>>
> >>> [ 0.235411] pci 0003:01:00.0: xHCI HW did not halt within 16000
> >>> usec status = 0x0
> >>> [ 1.037298] xhci_hcd 0003:01:00.0: xHCI Host Controller
> >>> [ 1.037367] xhci_hcd 0003:01:00.0: new USB bus registered, assigned
> >>> bus number 1
> >>> [ 1.053481] xhci_hcd 0003:01:00.0: Host halt failed, -110
> >>> [ 1.053523] xhci_hcd 0003:01:00.0: can't setup: -110
> >>> [ 1.053565] xhci_hcd 0003:01:00.0: USB bus 1 deregistered
> >>> [ 1.053629] xhci_hcd 0003:01:00.0: init 0003:01:00.0 fail, -110
> >>> [ 1.053703] xhci_hcd: probe of 0003:01:00.0 failed with error -110
> >>>
>
> >>>
> >>> 0003:01:00.0 USB controller: Texas Instruments TUSB73x0 SuperSpeed USB
> >>> 3.0 xHCI Host Controller (rev 02)
> >>>
> >>> The full debug log of the distro kernel booting is below.
> >>>
> >>> [ 1.037833] xhci_hcd 0003:01:00.0: USBCMD 0x0:
> >>> [ 1.037835] xhci_hcd 0003:01:00.0: HC is being stopped
> >>> [ 1.037837] xhci_hcd 0003:01:00.0: HC has finished hard reset
> >>> [ 1.037839] xhci_hcd 0003:01:00.0: Event Interrupts disabled
> >>> [ 1.037841] xhci_hcd 0003:01:00.0: Host System Error Interrupts disabled
> >>> [ 1.037843] xhci_hcd 0003:01:00.0: HC has finished light reset
> >>> [ 1.037846] xhci_hcd 0003:01:00.0: USBSTS 0x0:
> >>> [ 1.037847] xhci_hcd 0003:01:00.0: Event ring is empty
> >>> [ 1.037849] xhci_hcd 0003:01:00.0: No Host System Error
> >>> [ 1.037851] xhci_hcd 0003:01:00.0: HC is running
> >>
> >> Hmm, all bits in both USBCMD and USBSTS are 0. This is a bit suspicious.
> >> Normally at least USBCMD Run/Stop bit, and USBSTS HCHalted bit have
> >> opposite values.
> >
> > Does this suggest the controller is not responding at all?
> >
>
> The Capability registers looks fine, so does port status registers.
> It's just the operational USBSTS and USBCMD registers that return 0.
>
> Current xhci implementation assumes host failed to halt because USBSTS
> HCHalted bit is still 0, and bails out before reset.
> Host is probably not running, register just returns all zero.
>
> Can you try if the below code works, it checks if host is running from
> an additional place, and continues with the host reset.

Here's the patch applied to 5.6-rc1, and then kexec'd twice (once so
we're running a kernel without any workarounds on shutdown, and the
second time to test the recovery code).

It appears to have made it a bit further:

[ 1.532920] pci 0003:01:00.0: enabling device (0140 -> 0142)
[ 1.549081] pci 0003:01:00.0: xHCI HW did not halt within 16000
usec status = 0x10
[ 1.549119] pci 0003:01:00.0: quirk_usb_early_handoff+0x0/0x7c4
took 15820 usecs

[ 5.494595] xhci_hcd 0003:01:00.0: xHCI Host Controller
[ 5.494670] xhci_hcd 0003:01:00.0: new USB bus registered, assigned
bus number 1
[ 5.510774] xhci_hcd 0003:01:00.0: Host halt failed, -110
[ 5.510791] xhci_hcd 0003:01:00.0: Continue with reset even if host
appears running
[ 5.511271] xhci_hcd 0003:01:00.0: hcc params 0x0270f06d hci
version 0x96 quirks 0x0000000004000000
[ 5.522063] xhci_hcd 0003:01:00.0: xHCI Host Controller
[ 5.522115] xhci_hcd 0003:01:00.0: new USB bus registered, assigned
bus number 2
[ 5.522186] xhci_hcd 0003:01:00.0: Host supports USB 3.0 SuperSpeed

[ 19.003160] xhci_hcd 0003:01:00.0: Abort failed to stop command ring: -110
[ 19.019167] xhci_hcd 0003:01:00.0: Host halt failed, -110
[ 19.019168] xhci_hcd 0003:01:00.0: xHCI host controller not
responding, assume dead
[ 19.019172] xhci_hcd 0003:01:00.0: HC died; cleaning up
[ 19.019299] xhci_hcd 0003:01:00.0: Error while assigning device slot ID
[ 19.019302] xhci_hcd 0003:01:00.0: Max number of devices this xHCI
host supports is 64.


>
> diff --git a/drivers/usb/host/xhci.c b/drivers/usb/host/xhci.c
> index fe38275363e0..2dbfeaf88574 100644
> --- a/drivers/usb/host/xhci.c
> +++ b/drivers/usb/host/xhci.c
> @@ -177,8 +177,16 @@ int xhci_reset(struct xhci_hcd *xhci)
> }
>
> if ((state & STS_HALT) == 0) {
> - xhci_warn(xhci, "Host controller not halted, aborting reset.\n");
> - return 0;
> + /*
> + * After a kexec TI TUSB73x0 might appear running as its USBSTS
> + * and USBCMD registers return all zeroes. Doublecheck if host
> + * is running from USBCMD RUN bit before bailing out.
> + */
> + command = readl(&xhci->op_regs->command);
> + if (command & CMD_RUN) {
> + xhci_warn(xhci, "Host controller not halted, aborting reset.\n");
> + return 0;
> + }
> }
>
> xhci_dbg_trace(xhci, trace_xhci_dbg_init, "// Reset the HC");
> @@ -5217,7 +5225,7 @@ int xhci_gen_setup(struct usb_hcd *hcd, xhci_get_quirks_t get_quirks)
> /* Make sure the HC is halted. */
> retval = xhci_halt(xhci);
> if (retval)
> - return retval;
> + xhci_warn(xhci, "Continue with reset even if host appears running\n");
>
> xhci_zero_64b_regs(xhci);