2013-07-30 17:08:56

by Julius Werner

[permalink] [raw]
Subject: [PATCH] usb: core: don't try to reset_device() a port that got just disconnected

The USB hub driver's event handler contains a check to catch SuperSpeed
devices that transitioned into the SS.Inactive state and tries to fix
them with a reset. It decides whether to do a plain hub port reset or
call the usb_reset_device() method based on whether there was a device
attached to the port.

However, there are device/hub combinations (found with a JetFlash
Transcend mass storage stick (8564:1000) on the root hub of an Intel
LynxPoint PCH) which can transition to the SS.Inactive state on
disconnect (and stay there long enough for the host to notice). In this
case, above-mentioned reset check will call usb_reset_device() on the
now disconnected device. This leads to a futile attempt of asking a
bunch of floating wires to disable device-initiated U1 and U2, each
stalling khubd for a 5 second timeout before finally accepting the
device's fate amongst a flurry of error messages.

This patch makes the choice of reset dependent on the port status that
has just been read from the hub, as opposed to any previous in-kernel
state, thus working around this problem.

Signed-off-by: Julius Werner <[email protected]>
---
drivers/usb/core/hub.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/usb/core/hub.c b/drivers/usb/core/hub.c
index 4a8a1d6..ace0c52 100644
--- a/drivers/usb/core/hub.c
+++ b/drivers/usb/core/hub.c
@@ -4798,7 +4798,7 @@ static void hub_events(void)
hub->ports[i - 1]->child;

dev_dbg(hub_dev, "warm reset port %d\n", i);
- if (!udev) {
+ if (!(portstatus & USB_PORT_STAT_CONNECTION)) {
status = hub_port_reset(hub, i,
NULL, HUB_BH_RESET_TIME,
true);
@@ -4808,8 +4808,8 @@ static void hub_events(void)
usb_lock_device(udev);
status = usb_reset_device(udev);
usb_unlock_device(udev);
+ connect_change = 0;
}
- connect_change = 0;
}

if (connect_change)
--
1.7.12.4


2013-07-30 17:14:47

by Julius Werner

[permalink] [raw]
Subject: Re: [PATCH] usb: core: don't try to reset_device() a port that got just disconnected

-andiry.xu... he wrote that section originally but it seems that his
address is no longer available.

2013-07-30 17:30:25

by Alan Stern

[permalink] [raw]
Subject: Re: [PATCH] usb: core: don't try to reset_device() a port that got just disconnected

On Tue, 30 Jul 2013, Julius Werner wrote:

> The USB hub driver's event handler contains a check to catch SuperSpeed
> devices that transitioned into the SS.Inactive state and tries to fix
> them with a reset. It decides whether to do a plain hub port reset or
> call the usb_reset_device() method based on whether there was a device
> attached to the port.
>
> However, there are device/hub combinations (found with a JetFlash
> Transcend mass storage stick (8564:1000) on the root hub of an Intel
> LynxPoint PCH) which can transition to the SS.Inactive state on
> disconnect (and stay there long enough for the host to notice). In this
> case, above-mentioned reset check will call usb_reset_device() on the
> now disconnected device. This leads to a futile attempt of asking a
> bunch of floating wires to disable device-initiated U1 and U2, each
> stalling khubd for a 5 second timeout before finally accepting the
> device's fate amongst a flurry of error messages.

Wait a moment. Why does each of these attempts lead to a 5-second
timeout? Why don't they fail immediately?

> This patch makes the choice of reset dependent on the port status that
> has just been read from the hub, as opposed to any previous in-kernel
> state, thus working around this problem.
>
> Signed-off-by: Julius Werner <[email protected]>
> ---
> drivers/usb/core/hub.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/usb/core/hub.c b/drivers/usb/core/hub.c
> index 4a8a1d6..ace0c52 100644
> --- a/drivers/usb/core/hub.c
> +++ b/drivers/usb/core/hub.c
> @@ -4798,7 +4798,7 @@ static void hub_events(void)
> hub->ports[i - 1]->child;
>
> dev_dbg(hub_dev, "warm reset port %d\n", i);
> - if (!udev) {
> + if (!(portstatus & USB_PORT_STAT_CONNECTION)) {
> status = hub_port_reset(hub, i,
> NULL, HUB_BH_RESET_TIME,
> true);
> @@ -4808,8 +4808,8 @@ static void hub_events(void)
> usb_lock_device(udev);
> status = usb_reset_device(udev);
> usb_unlock_device(udev);

What will happen here if udev is NULL? Maybe you should change the
test to (!udev || !(portstatus & ...)).

> + connect_change = 0;
> }
> - connect_change = 0;
> }

Alan Stern

2013-07-31 02:33:49

by Julius Werner

[permalink] [raw]
Subject: Re: [PATCH] usb: core: don't try to reset_device() a port that got just disconnected

> Wait a moment. Why does each of these attempts lead to a 5-second
> timeout? Why don't they fail immediately?

Now that you mention it, that's a very good question. The kernel
enqueues a control transfer to the now disconnected address because
it's internal bookkeeping is not yet updated, but I guess that should
usually fail very fast from an xHC-internal transaction timeout. I
have now tried to debug the cause of this: I see the transfer being
enqueued and the doorbell triggered, but never get a transfer event
back from it (until the software timeout calls usb_kill_urb() which
stops the endpoint). With the same setup on a PantherPoint system I
get the same U1/U2 disable control messages on unplugging, but they
fail within <5ms with a transaction error... so I guess this must be a
LynxPoint hardware bug.

Regardless, calling usb_reset_device() is still wrong and will at
least lead to pointless transfer attempts and error messages, so I
think my patch still has merit.

> What will happen here if udev is NULL? Maybe you should change the
> test to (!udev || !(portstatus & ...)).

Right... I'm not sure if that can happen in practice, but I'll change
it just in case.

2013-07-31 14:49:26

by Alan Stern

[permalink] [raw]
Subject: Re: [PATCH] usb: core: don't try to reset_device() a port that got just disconnected

On Tue, 30 Jul 2013, Julius Werner wrote:

> > Wait a moment. Why does each of these attempts lead to a 5-second
> > timeout? Why don't they fail immediately?
>
> Now that you mention it, that's a very good question.

I have brought this up with Sarah on more than one occasion, but we
never found a good answer. The effects are quite visible when somebody
unplugs a USB-3 disk drive in the middle of a data transfer.

> The kernel
> enqueues a control transfer to the now disconnected address because
> it's internal bookkeeping is not yet updated, but I guess that should
> usually fail very fast from an xHC-internal transaction timeout. I
> have now tried to debug the cause of this: I see the transfer being
> enqueued and the doorbell triggered, but never get a transfer event
> back from it (until the software timeout calls usb_kill_urb() which
> stops the endpoint). With the same setup on a PantherPoint system I
> get the same U1/U2 disable control messages on unplugging, but they
> fail within <5ms with a transaction error... so I guess this must be a
> LynxPoint hardware bug.

An odd sort of bug. You'd think that not getting a response back would
be one of the first types of error the hardware designers would check
for.

> Regardless, calling usb_reset_device() is still wrong and will at
> least lead to pointless transfer attempts and error messages, so I
> think my patch still has merit.
>
> > What will happen here if udev is NULL? Maybe you should change the
> > test to (!udev || !(portstatus & ...)).
>
> Right... I'm not sure if that can happen in practice, but I'll change
> it just in case.

Somebody said that in theory, ports can put themselves in the Disabled
state at any time, spontaneously. If this happened just after a device
was attached, you would end up with udev being NULL and the connect
status being set.

Alan Stern

2013-07-31 16:05:55

by Sarah Sharp

[permalink] [raw]
Subject: Re: [PATCH] usb: core: don't try to reset_device() a port that got just disconnected

On Tue, Jul 30, 2013 at 07:33:46PM -0700, Julius Werner wrote:
> > Wait a moment. Why does each of these attempts lead to a 5-second
> > timeout? Why don't they fail immediately?
>
> Now that you mention it, that's a very good question. The kernel
> enqueues a control transfer to the now disconnected address because
> it's internal bookkeeping is not yet updated, but I guess that should
> usually fail very fast from an xHC-internal transaction timeout. I
> have now tried to debug the cause of this: I see the transfer being
> enqueued and the doorbell triggered, but never get a transfer event
> back from it (until the software timeout calls usb_kill_urb() which
> stops the endpoint). With the same setup on a PantherPoint system I
> get the same U1/U2 disable control messages on unplugging, but they
> fail within <5ms with a transaction error... so I guess this must be a
> LynxPoint hardware bug.
>
> Regardless, calling usb_reset_device() is still wrong and will at
> least lead to pointless transfer attempts and error messages, so I
> think my patch still has merit.

Hmm, 5 seconds is the xHCI command timeout. Could the driver be
possibly issuing either a Reset Device or Address Device command that
simply times out?

Sarah Sharp

2013-07-31 18:49:19

by Julius Werner

[permalink] [raw]
Subject: Re: [PATCH] usb: core: don't try to reset_device() a port that got just disconnected

> An odd sort of bug. You'd think that not getting a response back would
> be one of the first types of error the hardware designers would check
> for.

You'd think that, right? But apparently they didn't. I've now also
tried just hacking a pointless SET_INTERFACE message into the
hub_port_connect_change() handler on disconnect... stalls every time
with both 2.0 and 3.0 devices (and fails fast as expected on
PantherPoint). I have a feeling that this might not be the last
fix/workaround we will need for this PCH...

> Hmm, 5 seconds is the xHCI command timeout. Could the driver be
> possibly issuing either a Reset Device or Address Device command that
> simply times out?

The timeout is USB_CTRL_SET_TIMEOUT from
usb_set_device_initiated_lpm() (ends up in usb_start_wait_urb()). I
have dumped all XHCI commands and events in my tests... there were no
commands enqueued or completed in that timeframe, and no transfer
events received for that endpoint. I've also checked the endpoint
context before and 200ms after enqueueing the transfer, and it was
still in the "Running" state. The xHC just seems to completely hang on
that transfer ring until the timeout hits and the kernel issues a Stop
Endpoint to abort the URB.

> Thanks Alan, I'll send this off to Greg shortly.

Thanks!