2018-02-08 03:56:42

by Jeffy Chen

[permalink] [raw]
Subject: [PATCH v6] usb: ohci: Proper handling of ed_rm_list to handle race condition between usb_kill_urb() and finish_unlinks()

From: AMAN DEEP <[email protected]>

There is a race condition between finish_unlinks->finish_urb() function
and usb_kill_urb() in ohci controller case. The finish_urb calls
spin_unlock(&ohci->lock) before usb_hcd_giveback_urb() function call,
then if during this time, usb_kill_urb is called for another endpoint,
then new ed will be added to ed_rm_list at beginning for unlink, and
ed_rm_list will point to newly added.

When finish_urb() is completed in finish_unlinks() and ed->td_list
becomes empty as in below code (in finish_unlinks() function):

if (list_empty(&ed->td_list)) {
*last = ed->ed_next;
ed->ed_next = NULL;
} else if (ohci->rh_state == OHCI_RH_RUNNING) {
*last = ed->ed_next;
ed->ed_next = NULL;
ed_schedule(ohci, ed);
}

The *last = ed->ed_next will make ed_rm_list to point to ed->ed_next
and previously added ed by usb_kill_urb will be left unreferenced by
ed_rm_list. This causes usb_kill_urb() hang forever waiting for
finish_unlink to remove added ed from ed_rm_list.

The main reason for hang in this race condtion is addition and removal
of ed from ed_rm_list in the beginning during usb_kill_urb and later
last* is modified in finish_unlinks().

As suggested by Alan Stern, the solution for proper handling of
ohci->ed_rm_list is to remove ed from the ed_rm_list before finishing
any URBs. Then at the end, we can add ed back to the list if necessary.

This properly handle the updated ohci->ed_rm_list in usb_kill_urb().

Fixes:977dcfdc6031("USB:OHCI:don't lose track of EDs when a controller dies")
Acked-by: Alan Stern <[email protected]>
CC: <[email protected]>
Signed-off-by: Aman Deep <[email protected]>
Signed-off-by: Jeffy Chen <[email protected]>
---

Changes in v6:
This is a resend of Aman Deep's v5 patch [0], which solved the hang we
hit [1]. (Thanks Aman :)

The v5 has some format issues, so i slightly adjust the commit message.

[0] https://www.spinics.net/lists/linux-usb/msg129010.html
[1] https://bugs.chromium.org/p/chromium/issues/detail?id=803749

drivers/usb/host/ohci-q.c | 17 ++++++++++-------
1 file changed, 10 insertions(+), 7 deletions(-)

diff --git a/drivers/usb/host/ohci-q.c b/drivers/usb/host/ohci-q.c
index b2ec8c399363..4ccb85a67bb3 100644
--- a/drivers/usb/host/ohci-q.c
+++ b/drivers/usb/host/ohci-q.c
@@ -1019,6 +1019,8 @@ static void finish_unlinks(struct ohci_hcd *ohci)
* have modified this list. normally it's just prepending
* entries (which we'd ignore), but paranoia won't hurt.
*/
+ *last = ed->ed_next;
+ ed->ed_next = NULL;
modified = 0;

/* unlink urbs as requested, but rescan the list after
@@ -1077,21 +1079,22 @@ static void finish_unlinks(struct ohci_hcd *ohci)
goto rescan_this;

/*
- * If no TDs are queued, take ED off the ed_rm_list.
+ * If no TDs are queued, ED is now idle.
* Otherwise, if the HC is running, reschedule.
- * If not, leave it on the list for further dequeues.
+ * If the HC isn't running, add ED back to the
+ * start of the list for later processing.
*/
if (list_empty(&ed->td_list)) {
- *last = ed->ed_next;
- ed->ed_next = NULL;
ed->state = ED_IDLE;
list_del(&ed->in_use_list);
} else if (ohci->rh_state == OHCI_RH_RUNNING) {
- *last = ed->ed_next;
- ed->ed_next = NULL;
ed_schedule(ohci, ed);
} else {
- last = &ed->ed_next;
+ ed->ed_next = ohci->ed_rm_list;
+ ohci->ed_rm_list = ed;
+ /* Don't loop on the same ED */
+ if (last == &ohci->ed_rm_list)
+ last = &ed->ed_next;
}

if (modified)
--
2.11.0




2018-03-25 01:25:23

by Jonathan Liu

[permalink] [raw]
Subject: Re: [v6] usb: ohci: Proper handling of ed_rm_list to handle race condition between usb_kill_urb() and finish_unlinks()

Hi,

On 8 February 2018 at 14:55, Jeffy Chen <[email protected]> wrote:
> From: AMAN DEEP <[email protected]>
>
> There is a race condition between finish_unlinks->finish_urb() function
> and usb_kill_urb() in ohci controller case. The finish_urb calls
> spin_unlock(&ohci->lock) before usb_hcd_giveback_urb() function call,
> then if during this time, usb_kill_urb is called for another endpoint,
> then new ed will be added to ed_rm_list at beginning for unlink, and
> ed_rm_list will point to newly added.
>
> When finish_urb() is completed in finish_unlinks() and ed->td_list
> becomes empty as in below code (in finish_unlinks() function):
>
> if (list_empty(&ed->td_list)) {
> *last = ed->ed_next;
> ed->ed_next = NULL;
> } else if (ohci->rh_state == OHCI_RH_RUNNING) {
> *last = ed->ed_next;
> ed->ed_next = NULL;
> ed_schedule(ohci, ed);
> }
>
> The *last = ed->ed_next will make ed_rm_list to point to ed->ed_next
> and previously added ed by usb_kill_urb will be left unreferenced by
> ed_rm_list. This causes usb_kill_urb() hang forever waiting for
> finish_unlink to remove added ed from ed_rm_list.
>
> The main reason for hang in this race condtion is addition and removal
> of ed from ed_rm_list in the beginning during usb_kill_urb and later
> last* is modified in finish_unlinks().
>
> As suggested by Alan Stern, the solution for proper handling of
> ohci->ed_rm_list is to remove ed from the ed_rm_list before finishing
> any URBs. Then at the end, we can add ed back to the list if necessary.
>
> This properly handle the updated ohci->ed_rm_list in usb_kill_urb().
>
> Fixes:977dcfdc6031("USB:OHCI:don't lose track of EDs when a controller dies")
> Acked-by: Alan Stern <[email protected]>
> CC: <[email protected]>
> Signed-off-by: Aman Deep <[email protected]>
> Signed-off-by: Jeffy Chen <[email protected]>
> ---
>
> Changes in v6:
> This is a resend of Aman Deep's v5 patch [0], which solved the hang we
> hit [1]. (Thanks Aman :)
>
> The v5 has some format issues, so i slightly adjust the commit message.
>
> [0] https://www.spinics.net/lists/linux-usb/msg129010.html
> [1] https://bugs.chromium.org/p/chromium/issues/detail?id=803749
>
> drivers/usb/host/ohci-q.c | 17 ++++++++++-------
> 1 file changed, 10 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/usb/host/ohci-q.c b/drivers/usb/host/ohci-q.c
> index b2ec8c399363..4ccb85a67bb3 100644
> --- a/drivers/usb/host/ohci-q.c
> +++ b/drivers/usb/host/ohci-q.c
> @@ -1019,6 +1019,8 @@ static void finish_unlinks(struct ohci_hcd *ohci)
> * have modified this list. normally it's just prepending
> * entries (which we'd ignore), but paranoia won't hurt.
> */
> + *last = ed->ed_next;
> + ed->ed_next = NULL;
> modified = 0;
>
> /* unlink urbs as requested, but rescan the list after
> @@ -1077,21 +1079,22 @@ static void finish_unlinks(struct ohci_hcd *ohci)
> goto rescan_this;
>
> /*
> - * If no TDs are queued, take ED off the ed_rm_list.
> + * If no TDs are queued, ED is now idle.
> * Otherwise, if the HC is running, reschedule.
> - * If not, leave it on the list for further dequeues.
> + * If the HC isn't running, add ED back to the
> + * start of the list for later processing.
> */
> if (list_empty(&ed->td_list)) {
> - *last = ed->ed_next;
> - ed->ed_next = NULL;
> ed->state = ED_IDLE;
> list_del(&ed->in_use_list);
> } else if (ohci->rh_state == OHCI_RH_RUNNING) {
> - *last = ed->ed_next;
> - ed->ed_next = NULL;
> ed_schedule(ohci, ed);
> } else {
> - last = &ed->ed_next;
> + ed->ed_next = ohci->ed_rm_list;
> + ohci->ed_rm_list = ed;
> + /* Don't loop on the same ED */
> + if (last == &ohci->ed_rm_list)
> + last = &ed->ed_next;
> }
>
> if (modified)

I am experiencing a USB function call hang from userspace with OCHI
(full speed USB device) after updating from Linux 4.14.15 to 4.14.24
and noticed this commit.

Here is the Linux 4.14.24 kernel stack trace (extracted from SysRq+w
and amended with addr2line):
[<c05cbd64>] (__schedule) from [<c05cc148>] (schedule+0x50/0xb4)
kernel/sched/core.c:2792
[<c05cc148>] (schedule) from [<c042789c>]
(usb_kill_urb.part.3+0x78/0xa8) include/asm-generic/preempt.h:59
[<c042789c>] (usb_kill_urb.part.3) from [<c0433a3c>]
(usbdev_ioctl+0x1288/0x1cf0) drivers/usb/core/urb.c:690
[<c0433a3c>] (usbdev_ioctl) from [<c0217db8>]
(do_vfs_ioctl+0x9c/0x8ec) drivers/usb/core/devio.c:1835
[<c0217db8>] (do_vfs_ioctl) from [<c021863c>] (SyS_ioctl+0x34/0x5c)
fs/ioctl.c:47
[<c021863c>] (SyS_ioctl) from [<c0107820>] (ret_fast_syscall+0x0/0x54)
include/linux/file.h:39

Afterwards the kernel is unresponsive to disconnect/connect of the
full speed USB device but I can connect/disconnect a high speed USB
device to the same port and communicate to it without problem since it
uses EHCI (OHCI is companion controller). If I try to connect the full
speed USB device again it is still unresponsive. The userspace
application is still hanging after all this.

Could this commit be causing the issue?

Thanks.

Regards,
Jonathan

2018-03-25 02:05:31

by Jonathan Liu

[permalink] [raw]
Subject: Re: [v6] usb: ohci: Proper handling of ed_rm_list to handle race condition between usb_kill_urb() and finish_unlinks()

Hi,

On 25 March 2018 at 12:21, Jonathan Liu <[email protected]> wrote:
> Hi,
>
> On 8 February 2018 at 14:55, Jeffy Chen <[email protected]> wrote:
>> From: AMAN DEEP <[email protected]>
>>
>> There is a race condition between finish_unlinks->finish_urb() function
>> and usb_kill_urb() in ohci controller case. The finish_urb calls
>> spin_unlock(&ohci->lock) before usb_hcd_giveback_urb() function call,
>> then if during this time, usb_kill_urb is called for another endpoint,
>> then new ed will be added to ed_rm_list at beginning for unlink, and
>> ed_rm_list will point to newly added.
>>
>> When finish_urb() is completed in finish_unlinks() and ed->td_list
>> becomes empty as in below code (in finish_unlinks() function):
>>
>> if (list_empty(&ed->td_list)) {
>> *last = ed->ed_next;
>> ed->ed_next = NULL;
>> } else if (ohci->rh_state == OHCI_RH_RUNNING) {
>> *last = ed->ed_next;
>> ed->ed_next = NULL;
>> ed_schedule(ohci, ed);
>> }
>>
>> The *last = ed->ed_next will make ed_rm_list to point to ed->ed_next
>> and previously added ed by usb_kill_urb will be left unreferenced by
>> ed_rm_list. This causes usb_kill_urb() hang forever waiting for
>> finish_unlink to remove added ed from ed_rm_list.
>>
>> The main reason for hang in this race condtion is addition and removal
>> of ed from ed_rm_list in the beginning during usb_kill_urb and later
>> last* is modified in finish_unlinks().
>>
>> As suggested by Alan Stern, the solution for proper handling of
>> ohci->ed_rm_list is to remove ed from the ed_rm_list before finishing
>> any URBs. Then at the end, we can add ed back to the list if necessary.
>>
>> This properly handle the updated ohci->ed_rm_list in usb_kill_urb().
>>
>> Fixes:977dcfdc6031("USB:OHCI:don't lose track of EDs when a controller dies")
>> Acked-by: Alan Stern <[email protected]>
>> CC: <[email protected]>
>> Signed-off-by: Aman Deep <[email protected]>
>> Signed-off-by: Jeffy Chen <[email protected]>
>> ---
>>
>> Changes in v6:
>> This is a resend of Aman Deep's v5 patch [0], which solved the hang we
>> hit [1]. (Thanks Aman :)
>>
>> The v5 has some format issues, so i slightly adjust the commit message.
>>
>> [0] https://www.spinics.net/lists/linux-usb/msg129010.html
>> [1] https://bugs.chromium.org/p/chromium/issues/detail?id=803749
>>
>> drivers/usb/host/ohci-q.c | 17 ++++++++++-------
>> 1 file changed, 10 insertions(+), 7 deletions(-)
>>
>> diff --git a/drivers/usb/host/ohci-q.c b/drivers/usb/host/ohci-q.c
>> index b2ec8c399363..4ccb85a67bb3 100644
>> --- a/drivers/usb/host/ohci-q.c
>> +++ b/drivers/usb/host/ohci-q.c
>> @@ -1019,6 +1019,8 @@ static void finish_unlinks(struct ohci_hcd *ohci)
>> * have modified this list. normally it's just prepending
>> * entries (which we'd ignore), but paranoia won't hurt.
>> */
>> + *last = ed->ed_next;
>> + ed->ed_next = NULL;
>> modified = 0;
>>
>> /* unlink urbs as requested, but rescan the list after
>> @@ -1077,21 +1079,22 @@ static void finish_unlinks(struct ohci_hcd *ohci)
>> goto rescan_this;
>>
>> /*
>> - * If no TDs are queued, take ED off the ed_rm_list.
>> + * If no TDs are queued, ED is now idle.
>> * Otherwise, if the HC is running, reschedule.
>> - * If not, leave it on the list for further dequeues.
>> + * If the HC isn't running, add ED back to the
>> + * start of the list for later processing.
>> */
>> if (list_empty(&ed->td_list)) {
>> - *last = ed->ed_next;
>> - ed->ed_next = NULL;
>> ed->state = ED_IDLE;
>> list_del(&ed->in_use_list);
>> } else if (ohci->rh_state == OHCI_RH_RUNNING) {
>> - *last = ed->ed_next;
>> - ed->ed_next = NULL;
>> ed_schedule(ohci, ed);
>> } else {
>> - last = &ed->ed_next;
>> + ed->ed_next = ohci->ed_rm_list;
>> + ohci->ed_rm_list = ed;
>> + /* Don't loop on the same ED */
>> + if (last == &ohci->ed_rm_list)
>> + last = &ed->ed_next;
>> }
>>
>> if (modified)
>
> I am experiencing a USB function call hang from userspace with OCHI
> (full speed USB device) after updating from Linux 4.14.15 to 4.14.24
> and noticed this commit.
>
> Here is the Linux 4.14.24 kernel stack trace (extracted from SysRq+w
> and amended with addr2line):
> [<c05cbd64>] (__schedule) from [<c05cc148>] (schedule+0x50/0xb4)
> kernel/sched/core.c:2792
> [<c05cc148>] (schedule) from [<c042789c>]
> (usb_kill_urb.part.3+0x78/0xa8) include/asm-generic/preempt.h:59
> [<c042789c>] (usb_kill_urb.part.3) from [<c0433a3c>]
> (usbdev_ioctl+0x1288/0x1cf0) drivers/usb/core/urb.c:690
> [<c0433a3c>] (usbdev_ioctl) from [<c0217db8>]
> (do_vfs_ioctl+0x9c/0x8ec) drivers/usb/core/devio.c:1835
> [<c0217db8>] (do_vfs_ioctl) from [<c021863c>] (SyS_ioctl+0x34/0x5c)
> fs/ioctl.c:47
> [<c021863c>] (SyS_ioctl) from [<c0107820>] (ret_fast_syscall+0x0/0x54)
> include/linux/file.h:39
>
> Afterwards the kernel is unresponsive to disconnect/connect of the
> full speed USB device but I can connect/disconnect a high speed USB
> device to the same port and communicate to it without problem since it
> uses EHCI (OHCI is companion controller). If I try to connect the full
> speed USB device again it is still unresponsive. The userspace
> application is still hanging after all this.
>
> Could this commit be causing the issue?

Note that I have been running with OHCI_QUIRK_QEMU added to
ochi->flags in ohci_init of drivers/usb/host/ohci-hcd.c for both
4.14.15 and 4.14.24 kernels due to a race condition that was later
addressed by https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/commit/?h=v4.14.23&id=85c3d26bd754160f6be90d8b078d70a1b35820e7,
so io_watchdog_func of drivers/usb/host/ohci-hcd.c is not being run.

Thanks.

Regards,
Jonathan

2018-03-25 15:40:08

by Jonathan Liu

[permalink] [raw]
Subject: Re: [v6] usb: ohci: Proper handling of ed_rm_list to handle race condition between usb_kill_urb() and finish_unlinks()

Hi,

On 25 March 2018 at 12:21, Jonathan Liu <[email protected]> wrote:
> On 8 February 2018 at 14:55, Jeffy Chen <[email protected]> wrote:
>> From: AMAN DEEP <[email protected]>
>>
>> There is a race condition between finish_unlinks->finish_urb() function
>> and usb_kill_urb() in ohci controller case. The finish_urb calls
>> spin_unlock(&ohci->lock) before usb_hcd_giveback_urb() function call,
>> then if during this time, usb_kill_urb is called for another endpoint,
>> then new ed will be added to ed_rm_list at beginning for unlink, and
>> ed_rm_list will point to newly added.
>>
>> When finish_urb() is completed in finish_unlinks() and ed->td_list
>> becomes empty as in below code (in finish_unlinks() function):
>>
>> if (list_empty(&ed->td_list)) {
>> *last = ed->ed_next;
>> ed->ed_next = NULL;
>> } else if (ohci->rh_state == OHCI_RH_RUNNING) {
>> *last = ed->ed_next;
>> ed->ed_next = NULL;
>> ed_schedule(ohci, ed);
>> }
>>
>> The *last = ed->ed_next will make ed_rm_list to point to ed->ed_next
>> and previously added ed by usb_kill_urb will be left unreferenced by
>> ed_rm_list. This causes usb_kill_urb() hang forever waiting for
>> finish_unlink to remove added ed from ed_rm_list.
>>
>> The main reason for hang in this race condtion is addition and removal
>> of ed from ed_rm_list in the beginning during usb_kill_urb and later
>> last* is modified in finish_unlinks().
>>
>> As suggested by Alan Stern, the solution for proper handling of
>> ohci->ed_rm_list is to remove ed from the ed_rm_list before finishing
>> any URBs. Then at the end, we can add ed back to the list if necessary.
>>
>> This properly handle the updated ohci->ed_rm_list in usb_kill_urb().
>>
>> Fixes:977dcfdc6031("USB:OHCI:don't lose track of EDs when a controller dies")
>> Acked-by: Alan Stern <[email protected]>
>> CC: <[email protected]>
>> Signed-off-by: Aman Deep <[email protected]>
>> Signed-off-by: Jeffy Chen <[email protected]>
>> ---
>>
>> Changes in v6:
>> This is a resend of Aman Deep's v5 patch [0], which solved the hang we
>> hit [1]. (Thanks Aman :)
>>
>> The v5 has some format issues, so i slightly adjust the commit message.
>>
>> [0] https://www.spinics.net/lists/linux-usb/msg129010.html
>> [1] https://bugs.chromium.org/p/chromium/issues/detail?id=803749
>>
>> drivers/usb/host/ohci-q.c | 17 ++++++++++-------
>> 1 file changed, 10 insertions(+), 7 deletions(-)
>>
>> diff --git a/drivers/usb/host/ohci-q.c b/drivers/usb/host/ohci-q.c
>> index b2ec8c399363..4ccb85a67bb3 100644
>> --- a/drivers/usb/host/ohci-q.c
>> +++ b/drivers/usb/host/ohci-q.c
>> @@ -1019,6 +1019,8 @@ static void finish_unlinks(struct ohci_hcd *ohci)
>> * have modified this list. normally it's just prepending
>> * entries (which we'd ignore), but paranoia won't hurt.
>> */
>> + *last = ed->ed_next;
>> + ed->ed_next = NULL;
>> modified = 0;
>>
>> /* unlink urbs as requested, but rescan the list after
>> @@ -1077,21 +1079,22 @@ static void finish_unlinks(struct ohci_hcd *ohci)
>> goto rescan_this;
>>
>> /*
>> - * If no TDs are queued, take ED off the ed_rm_list.
>> + * If no TDs are queued, ED is now idle.
>> * Otherwise, if the HC is running, reschedule.
>> - * If not, leave it on the list for further dequeues.
>> + * If the HC isn't running, add ED back to the
>> + * start of the list for later processing.
>> */
>> if (list_empty(&ed->td_list)) {
>> - *last = ed->ed_next;
>> - ed->ed_next = NULL;
>> ed->state = ED_IDLE;
>> list_del(&ed->in_use_list);
>> } else if (ohci->rh_state == OHCI_RH_RUNNING) {
>> - *last = ed->ed_next;
>> - ed->ed_next = NULL;
>> ed_schedule(ohci, ed);
>> } else {
>> - last = &ed->ed_next;
>> + ed->ed_next = ohci->ed_rm_list;
>> + ohci->ed_rm_list = ed;
>> + /* Don't loop on the same ED */
>> + if (last == &ohci->ed_rm_list)
>> + last = &ed->ed_next;
>> }
>>
>> if (modified)
>
> I am experiencing a USB function call hang from userspace with OCHI
> (full speed USB device) after updating from Linux 4.14.15 to 4.14.24
> and noticed this commit.
>
> Here is the Linux 4.14.24 kernel stack trace (extracted from SysRq+w
> and amended with addr2line):
> [<c05cbd64>] (__schedule) from [<c05cc148>] (schedule+0x50/0xb4)
> kernel/sched/core.c:2792
> [<c05cc148>] (schedule) from [<c042789c>]
> (usb_kill_urb.part.3+0x78/0xa8) include/asm-generic/preempt.h:59
> [<c042789c>] (usb_kill_urb.part.3) from [<c0433a3c>]
> (usbdev_ioctl+0x1288/0x1cf0) drivers/usb/core/urb.c:690
> [<c0433a3c>] (usbdev_ioctl) from [<c0217db8>]
> (do_vfs_ioctl+0x9c/0x8ec) drivers/usb/core/devio.c:1835
> [<c0217db8>] (do_vfs_ioctl) from [<c021863c>] (SyS_ioctl+0x34/0x5c)
> fs/ioctl.c:47
> [<c021863c>] (SyS_ioctl) from [<c0107820>] (ret_fast_syscall+0x0/0x54)
> include/linux/file.h:39
>
> Afterwards the kernel is unresponsive to disconnect/connect of the
> full speed USB device but I can connect/disconnect a high speed USB
> device to the same port and communicate to it without problem since it
> uses EHCI (OHCI is companion controller). If I try to connect the full
> speed USB device again it is still unresponsive. The userspace
> application is still hanging after all this.
>
> Could this commit be causing the issue?

There does seem to be something wrong with this commit. I did some
more testing with Linux 4.16.0-rc6 on sun7i-a20-olinuxino-lime.
I connected a sun7i-a20-pcduino3b running Linux 4.15.12 that has OTG
port with gadget zero kernel module (g_zero) loaded to the USB port
and did some testing with usbtest from tools/usb/testusb.c in kernel
repository.

I had to do the following change to avoid getting flooded with
"usbtest 1-1:3.0: unlink retry" kernel messages after test 11 of
testusb starts:
--- a/drivers/usb/misc/usbtest.c
+++ b/drivers/usb/misc/usbtest.c
@@ -1473,7 +1473,7 @@ static int unlink1(struct usbtest_dev *dev, int
pipe, int
size, int async)
* resubmission, but since we're testing unlink
* paths, we can't.
*/
- ERROR(dev, "unlink retry\n");
+ /*ERROR(dev, "unlink retry\n");*/
continue;
case 0:
case -EINPROGRESS:

Here are the results.

# ./testusb -a
/dev/bus/usb/001/002 test 0, 0.006822 secs
/dev/bus/usb/001/002 test 1, 0.140112 secs
/dev/bus/usb/001/002 test 2, 0.381125 secs
/dev/bus/usb/001/002 test 3, 0.133564 secs
/dev/bus/usb/001/002 test 4, 0.385498 secs
/dev/bus/usb/001/002 test 5, 2.138259 secs
/dev/bus/usb/001/002 test 6, 1.755144 secs
/dev/bus/usb/001/002 test 7, 2.137387 secs
/dev/bus/usb/001/002 test 8, 1.752393 secs
/dev/bus/usb/001/002 test 9, 1.132559 secs
/dev/bus/usb/001/002 test 10, 3.698934 secs
/dev/bus/usb/001/002 test 11, 40.061114 secs

dmesg (processed through scripts/decode_stacktrace.sh from kernel repository):
[ 21.573242] usbtest 1-1:3.0: TEST 0: NOP
[ 21.580315] usbtest 1-1:3.0: TEST 1: write 1024 bytes 1000 times
[ 21.732138] usbtest 1-1:3.0: TEST 2: read 1024 bytes 1000 times
[ 22.132133] usbtest 1-1:3.0: TEST 3: write/1024 0..1024 bytes 1000 times
[ 22.282137] usbtest 1-1:3.0: TEST 4: read/1024 0..1024 bytes 1000 times
[ 22.682131] usbtest 1-1:3.0: TEST 5: write 1000 sglists 32 entries
of 1024 bytes
[ 24.832138] usbtest 1-1:3.0: TEST 6: read 1000 sglists 32 entries
of 1024 bytes
[ 26.602135] usbtest 1-1:3.0: TEST 7: write/1024 1000 sglists 32
entries 0..1024 bytes
[ 28.752136] usbtest 1-1:3.0: TEST 8: read/1024 1000 sglists 32
entries 0..1024 bytes
[ 30.522139] usbtest 1-1:3.0: TEST 9: ch9 (subset) control tests, 1000 times
[ 31.654957] usbtest 1-1:3.0: TEST 10: queue 32 control calls, 1000 times
[ 35.354220] usbtest 1-1:3.0: TEST 11: unlink 1000 reads of 1024
[ 75.414099] usbtest 1-1:3.0: TEST 12: unlink 1000 writes of 1024
[ 128.026194] INFO: rcu_sched self-detected stall on CPU
[ 128.031367] 0-....: (1 GPs behind) idle=faa/1/1073741826
softirq=2614/2621 fqs=1048
[ 128.039182] (t=2100 jiffies g=227 c=226 q=1961)
[ 128.043887] NMI backtrace for cpu 0
[ 128.047377] CPU: 0 PID: 262 Comm: testusb Tainted: G O
4.16.0-rc6 #1
[ 128.054932] Hardware name: Allwinner sun7i (A20) Family
[ 128.060181] (unwind_backtrace) from show_stack (arch/arm/kernel/traps.c:249)
[ 128.067921] (show_stack) from dump_stack (lib/dump_stack.c:55)
[ 128.075142] (dump_stack) from nmi_cpu_backtrace (lib/nmi_backtrace.c:103)
[ 128.083141] (nmi_cpu_backtrace) from nmi_trigger_cpumask_backtrace
(lib/nmi_backtrace.c:62)
[ 128.092704] (nmi_trigger_cpumask_backtrace) from
rcu_dump_cpu_stacks (kernel/rcu/tree.c:1373 (discriminator 2))
[ 128.102351] (rcu_dump_cpu_stacks) from rcu_check_callbacks
(kernel/rcu/tree.c:1526 kernel/rcu/tree.c:1592 kernel/rcu/tree.c:3361
kernel/rcu/tree.c:3423 kernel/rcu/tree.c:2763)
[ 128.111304] (rcu_check_callbacks) from update_process_times
(arch/arm/include/asm/thread_info.h:91
include/asm-generic/preempt.h:11 kernel/time/timer.c:1638)
[ 128.120173] (update_process_times) from tick_sched_timer
(kernel/time/tick-sched.c:1199)
[ 128.128777] (tick_sched_timer) from __hrtimer_run_queues
(kernel/time/hrtimer.c:1349 kernel/time/hrtimer.c:1411)
[ 128.137553] (__hrtimer_run_queues) from hrtimer_interrupt
(kernel/time/hrtimer.c:1472)
[ 128.146331] (hrtimer_interrupt) from arch_timer_handler_phys
(drivers/clocksource/arm_arch_timer.c:589
drivers/clocksource/arm_arch_timer.c:606)
[ 128.155285] (arch_timer_handler_phys) from handle_percpu_devid_irq
(kernel/irq/chip.c:909)
[ 128.164843] (handle_percpu_devid_irq) from generic_handle_irq
(kernel/irq/irqdesc.c:607)
[ 128.173880] (generic_handle_irq) from __handle_domain_irq
(kernel/irq/irqdesc.c:646)
[ 128.182574] (__handle_domain_irq) from gic_handle_irq
(arch/arm/include/asm/io.h:118 drivers/irqchip/irq-gic.c:360)
[ 128.190920] (gic_handle_irq) from __irq_svc
(arch/arm/kernel/entry-armv.S:226)
[ 128.198382] Exception stack(0xcda1fc88 to 0xcda1fcd0)
[ 128.203431] fc80: c0a95bc8 a0060013 0000000a
0000aa75 cd83b480 a0060013
[ 128.211598] fca0: cf7ee000 ffffff98 00000001 00000000 00000400
cec2b800 00000009 cda1fcd8
[ 128.219763] fcc0: c045b328 c0622890 20060013 ffffffff
[ 128.224818] (__irq_svc) from _raw_spin_unlock_irqrestore
(kernel/locking/spinlock.c:185)
[ 128.233428] (_raw_spin_unlock_irqrestore) from usb_hcd_unlink_urb
(drivers/usb/core/hcd.c:1720)
[ 128.242816] (usb_hcd_unlink_urb) from unlink1
(drivers/usb/core/hcd.c:864 drivers/usb/core/hcd.c:1684)
[ 128.250555] (unlink1) from unlink_simple (drivers/usb/misc/usbtest.c:1510)
[ 128.257774] (unlink_simple) from usbtest_do_ioctl
(drivers/usb/misc/usbtest.c:2360)
[ 128.266033] (usbtest_do_ioctl) from usbtest_ioctl
(drivers/usb/misc/usbtest.c:2685)
[ 128.274206] (usbtest_ioctl) from usbdev_ioctl
(drivers/usb/core/devio.c:2175 drivers/usb/core/devio.c:2197
drivers/usb/core/devio.c:2511 drivers/usb/core/devio.c:2555)
[ 128.282200] (usbdev_ioctl) from do_vfs_ioctl (fs/ioctl.c:47 fs/ioctl.c:686)
[ 128.289854] (do_vfs_ioctl) from SyS_ioctl (include/linux/file.h:39
fs/ioctl.c:702 fs/ioctl.c:692)
[ 128.297159] (SyS_ioctl) from ret_fast_syscall
(arch/arm/kernel/entry-common.S:52)
[ 128.304716] Exception stack(0xcda1ffa8 to 0xcda1fff0)
[ 128.309763] ffa0: 000119c4 00000000 00000003
c00c5512 bebbfb34 bebbfb34
[ 128.317931] ffc0: 000119c4 00000000 000109b0 00000036 00000000
00000000 b6fb0fac bebbfb44
[ 128.326096] ffe0: 00022034 bebbfb1c 00011020 b6ee073c

If I revert the commit, I can run "./testusb -a" to completion without
any issues with Linux 4.16.0-rc6 on sun7i-a20-olinuxino-lime.
I don't get a kernel stall detected when testing on x86_64 but I do
still get at least one "unlink retry" error message from kernel in
dmesg which can be reproduced using "./testusb -a" or more frequently
with "./testusb -a -t 11 -c 5000".

Regards,
Jonathan

2018-03-27 17:09:48

by Alan Stern

[permalink] [raw]
Subject: Re: [v6] usb: ohci: Proper handling of ed_rm_list to handle race condition between usb_kill_urb() and finish_unlinks()

On Mon, 26 Mar 2018, Jonathan Liu wrote:

> > I am experiencing a USB function call hang from userspace with OCHI
> > (full speed USB device) after updating from Linux 4.14.15 to 4.14.24
> > and noticed this commit.
> >
> > Here is the Linux 4.14.24 kernel stack trace (extracted from SysRq+w
> > and amended with addr2line):
> > [<c05cbd64>] (__schedule) from [<c05cc148>] (schedule+0x50/0xb4)
> > kernel/sched/core.c:2792
> > [<c05cc148>] (schedule) from [<c042789c>]
> > (usb_kill_urb.part.3+0x78/0xa8) include/asm-generic/preempt.h:59
> > [<c042789c>] (usb_kill_urb.part.3) from [<c0433a3c>]
> > (usbdev_ioctl+0x1288/0x1cf0) drivers/usb/core/urb.c:690
> > [<c0433a3c>] (usbdev_ioctl) from [<c0217db8>]
> > (do_vfs_ioctl+0x9c/0x8ec) drivers/usb/core/devio.c:1835
> > [<c0217db8>] (do_vfs_ioctl) from [<c021863c>] (SyS_ioctl+0x34/0x5c)
> > fs/ioctl.c:47
> > [<c021863c>] (SyS_ioctl) from [<c0107820>] (ret_fast_syscall+0x0/0x54)
> > include/linux/file.h:39
> >
> > Afterwards the kernel is unresponsive to disconnect/connect of the
> > full speed USB device but I can connect/disconnect a high speed USB
> > device to the same port and communicate to it without problem since it
> > uses EHCI (OHCI is companion controller). If I try to connect the full
> > speed USB device again it is still unresponsive. The userspace
> > application is still hanging after all this.
> >
> > Could this commit be causing the issue?

It sounds like the URB gets cancelled but is never given back. But
although the commit you mentioned does affect this logic, I don't see
the problem could arise.

> There does seem to be something wrong with this commit. I did some
> more testing with Linux 4.16.0-rc6 on sun7i-a20-olinuxino-lime.
> I connected a sun7i-a20-pcduino3b running Linux 4.15.12 that has OTG
> port with gadget zero kernel module (g_zero) loaded to the USB port
> and did some testing with usbtest from tools/usb/testusb.c in kernel
> repository.
>
> I had to do the following change to avoid getting flooded with
> "usbtest 1-1:3.0: unlink retry" kernel messages after test 11 of
> testusb starts:
> --- a/drivers/usb/misc/usbtest.c
> +++ b/drivers/usb/misc/usbtest.c
> @@ -1473,7 +1473,7 @@ static int unlink1(struct usbtest_dev *dev, int
> pipe, int
> size, int async)
> * resubmission, but since we're testing unlink
> * paths, we can't.
> */
> - ERROR(dev, "unlink retry\n");
> + /*ERROR(dev, "unlink retry\n");*/

That's a bizarre message in any case. The condition under which it
gets printed is _not_ an error, just a race. I would expect it to show
up every now and then just based on random chance. But if it gets
printed over and over again then something is definitely wrong.

> If I revert the commit, I can run "./testusb -a" to completion without
> any issues with Linux 4.16.0-rc6 on sun7i-a20-olinuxino-lime.
> I don't get a kernel stall detected when testing on x86_64 but I do
> still get at least one "unlink retry" error message from kernel in
> dmesg which can be reproduced using "./testusb -a" or more frequently
> with "./testusb -a -t 11 -c 5000".

A few of those "unlink retry" messages during a test with 5000
iterations does not mean anything is wrong; don't worry about it.

It would be great if you could debug this farther. I'm not sure what
to suggest, however. Clearly we need to know what's going on in
ohci-q.c's finish_unlinks() routine; unfortunately this routine
probably gets called thousands of times before the problem shows up.

Anyway, here's a patch you can try out on the sun7i-a20-olinuxino-lime.
I have no idea whether it will help pinpoint the source of the problem.

Alan Stern



Index: usb-4.x/drivers/usb/host/ohci-hcd.c
===================================================================
--- usb-4.x.orig/drivers/usb/host/ohci-hcd.c
+++ usb-4.x/drivers/usb/host/ohci-hcd.c
@@ -327,14 +327,21 @@ static int ohci_urb_dequeue(struct usb_h
* some upcoming INTR_SF to call finish_unlinks()
*/
urb_priv = urb->hcpriv;
- if (urb_priv->ed->state == ED_OPER)
+ if (urb_priv->ed->state == ED_OPER) {
+ ohci_info(ohci, "start dequeue URB %p ed %p\n", urb,
+ urb_priv->ed);
start_ed_unlink(ohci, urb_priv->ed);
-
+ } else
+ ohci_info(ohci, "idle dequeue URB %p ed %p\n", urb,
+ urb_priv->ed);
if (ohci->rh_state != OHCI_RH_RUNNING) {
/* With HC dead, we can clean up right away */
ohci_work(ohci);
}
}
+ else
+ ohci_info(ohci, "failed dequeue URB %p\n", urb);
+
spin_unlock_irqrestore (&ohci->lock, flags);
return rc;
}
Index: usb-4.x/drivers/usb/host/ohci-q.c
===================================================================
--- usb-4.x.orig/drivers/usb/host/ohci-q.c
+++ usb-4.x/drivers/usb/host/ohci-q.c
@@ -1022,6 +1022,7 @@ skip_ed:
*last = ed->ed_next;
ed->ed_next = NULL;
modified = 0;
+ ohci_info(ohci, "scanning unlinked ed %p\n", ed);

/* unlink urbs as requested, but rescan the list after
* we call a completion since it might have unlinked
@@ -1072,6 +1073,8 @@ rescan_this:
/* if URB is done, clean up */
if (urb_priv->td_cnt >= urb_priv->length) {
modified = completed = 1;
+ ohci_info(ohci, "finish dequeue URB %p ed %p\n",
+ urb, ed);
finish_urb(ohci, urb, 0);
}
}