Received: by 10.213.65.68 with SMTP id h4csp623008imn; Sun, 25 Mar 2018 08:40:08 -0700 (PDT) X-Google-Smtp-Source: AG47ELvjvVAq32ImqJ7WNd0CKE1EJkjGsR9byUNR9kuTsMhmo8YcF0TMcWvOPrSsbPqkWPc9onn1 X-Received: by 10.98.227.16 with SMTP id g16mr21801049pfh.171.1521992408109; Sun, 25 Mar 2018 08:40:08 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1521992408; cv=none; d=google.com; s=arc-20160816; b=w3/dvKKkkKSmPedHOmp7rKGyrqyN4P4mSb8t8agRl0LsX+p58FfnL9wNU4e6by/6gh PyXB6RTbcs2GL6lImqf87jwERjeyEf7Ml8LZ/3+vfLS3XunSIaSk+JXOwfHdatFMqWGt Su0JuqpdHfvgpnxb1F7umffGEUvSBFuyildH/bgPHMg2vJNt2sQpasb5kEhm0jz2Mt5g dD+Sa+DquOyC0tMrPI7g/oXixKL6kvxOSawaXHkR+aOd4PltV9owxkO0aKGvcbPGvSzw I1O5sZWQwVEDOujZBxsG2aHBLuTybKiLUj9uW7cjc8mnDoJvwFLLp+/NcA/9ViIv+B1e MR2g== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:cc:to:subject:message-id:date:from :references:in-reply-to:mime-version:dkim-signature :arc-authentication-results; bh=brgvCCetpXW1NFkkBk1rkYZFVVcLfWzSvAOLNC31UX4=; b=o3YHLfYdPZdbm5vVaeby+/bpM017IBG2tvAsqb06lCqZWcwpkz9j5fPacJqTgL8jGv TRaI7VhOwG2gK52xSFiSs0oIsoBuLpB26/4rX8HsneYgzCf8c9FlS9K8QKLgmHlCInB+ gpNXQPoJ9I4rn6pxkpozu+Ytiq6olhKjWIjOz3VMBQ3BS5B5eoWGBswSaJ++kxIjHqRV EPYVt0NV8OJrBQgA5tCy2BhbZq1KV6a9WKTwW9rHPH8dtgJfik9d7BljzprtxX6zZ8Ds kR351p9IyAEEp2HKk66SkafC7et8fNltrNc9lH9CAbe4SvHMmkRJrdhHcbd54Omp1VkV qLnw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=t3jR/mkE; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id s23-v6si12540532plr.109.2018.03.25.08.39.39; Sun, 25 Mar 2018 08:40:08 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=t3jR/mkE; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753444AbeCYP0F (ORCPT + 99 others); Sun, 25 Mar 2018 11:26:05 -0400 Received: from mail-lf0-f67.google.com ([209.85.215.67]:42917 "EHLO mail-lf0-f67.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753324AbeCYP0D (ORCPT ); Sun, 25 Mar 2018 11:26:03 -0400 Received: by mail-lf0-f67.google.com with SMTP id a22-v6so24605677lfg.9; Sun, 25 Mar 2018 08:26:02 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc; bh=brgvCCetpXW1NFkkBk1rkYZFVVcLfWzSvAOLNC31UX4=; b=t3jR/mkEuv9XAbTMnFzVtHTc7sg0ZkoCEsfoDgtpUsPRE9iGlTvfdthLob/aryJQSD j0KwDQF6wP67dlxTA5jwFuN4gvvry8W3xkEmh7NqxCcyXuaobtFxeyotkWr1css/JRdr WMuXu8TH9uw3+v8r7RFyI0KIdbodznlrrdnnFx9n/stEFqnAI3U8jGS+73A/A5tXYkAo cSgZrZeou7zj5FpLU+A9geL5p7X+2sBVVPSPfgwOFq3i71Boy8fWVSUiJGnkjoN+M+ea A0jJDlw/8JxMfyBusyI+vLGTuDJFSKIWCuxcV9ug5hJaeb+KQvTwUTbfe5uGiLllo0oA pLCQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc; bh=brgvCCetpXW1NFkkBk1rkYZFVVcLfWzSvAOLNC31UX4=; b=GAOK4T/oCZsPZnWSxpqrLKxm+lWMGxViLJix12xJPFCsFHtSMkvDRjSOYmzyiqQ+pj xxLjm3QXCSAvKVs9IR1JV8VbgQ4DWS9ci42pe6OoxSWNSRyO9j9oKgnL8a/5B9bSy1Dt TwWcg96RTXTitl5pbg9YgmGNToYgXKWzSIDT9+gi2pIHKsHm3SXvFtEK3MKAKvG5otdE 4cB4FHMA4bSKWIzfFzcYz6IcveP6PpnTQR9eecmUT612vGuJsN4Zhw63HS19R6g7VWjZ eBpbtW/SEhEySC8Qh/OGbjPXR6T9AlnFhjC8HVxDOpZ/kMEQtQG+TYqmtXYHHiT0YR2v +rVg== X-Gm-Message-State: AElRT7G0y1yINl9h9avnWpL3WUwpL8aMmBHexdZ1BJmli9qK1x53ZKda KD61wvCB7F1oR6eNwByGoLIOotgkPbytFYqLzo4= X-Received: by 10.46.17.23 with SMTP id f23mr9360918lje.43.1521991560010; Sun, 25 Mar 2018 08:26:00 -0700 (PDT) MIME-Version: 1.0 Received: by 10.46.17.84 with HTTP; Sun, 25 Mar 2018 08:25:59 -0700 (PDT) In-Reply-To: References: <20180208035501.10711-1-jeffy.chen@rock-chips.com> From: Jonathan Liu Date: Mon, 26 Mar 2018 02:25:59 +1100 Message-ID: Subject: Re: [v6] usb: ohci: Proper handling of ed_rm_list to handle race condition between usb_kill_urb() and finish_unlinks() To: Jeffy Chen Cc: linux-kernel , Brian Norris , Alan Stern , mka@chromium.org, Douglas Anderson , AMAN DEEP , stable@vger.kernel.org, Greg Kroah-Hartman , linux-usb@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi, On 25 March 2018 at 12:21, Jonathan Liu wrote: > On 8 February 2018 at 14:55, Jeffy Chen wrote: >> From: AMAN DEEP >> >> There is a race condition between finish_unlinks->finish_urb() function >> and usb_kill_urb() in ohci controller case. The finish_urb calls >> spin_unlock(&ohci->lock) before usb_hcd_giveback_urb() function call, >> then if during this time, usb_kill_urb is called for another endpoint, >> then new ed will be added to ed_rm_list at beginning for unlink, and >> ed_rm_list will point to newly added. >> >> When finish_urb() is completed in finish_unlinks() and ed->td_list >> becomes empty as in below code (in finish_unlinks() function): >> >> if (list_empty(&ed->td_list)) { >> *last = ed->ed_next; >> ed->ed_next = NULL; >> } else if (ohci->rh_state == OHCI_RH_RUNNING) { >> *last = ed->ed_next; >> ed->ed_next = NULL; >> ed_schedule(ohci, ed); >> } >> >> The *last = ed->ed_next will make ed_rm_list to point to ed->ed_next >> and previously added ed by usb_kill_urb will be left unreferenced by >> ed_rm_list. This causes usb_kill_urb() hang forever waiting for >> finish_unlink to remove added ed from ed_rm_list. >> >> The main reason for hang in this race condtion is addition and removal >> of ed from ed_rm_list in the beginning during usb_kill_urb and later >> last* is modified in finish_unlinks(). >> >> As suggested by Alan Stern, the solution for proper handling of >> ohci->ed_rm_list is to remove ed from the ed_rm_list before finishing >> any URBs. Then at the end, we can add ed back to the list if necessary. >> >> This properly handle the updated ohci->ed_rm_list in usb_kill_urb(). >> >> Fixes:977dcfdc6031("USB:OHCI:don't lose track of EDs when a controller dies") >> Acked-by: Alan Stern >> CC: >> Signed-off-by: Aman Deep >> Signed-off-by: Jeffy Chen >> --- >> >> Changes in v6: >> This is a resend of Aman Deep's v5 patch [0], which solved the hang we >> hit [1]. (Thanks Aman :) >> >> The v5 has some format issues, so i slightly adjust the commit message. >> >> [0] https://www.spinics.net/lists/linux-usb/msg129010.html >> [1] https://bugs.chromium.org/p/chromium/issues/detail?id=803749 >> >> drivers/usb/host/ohci-q.c | 17 ++++++++++------- >> 1 file changed, 10 insertions(+), 7 deletions(-) >> >> diff --git a/drivers/usb/host/ohci-q.c b/drivers/usb/host/ohci-q.c >> index b2ec8c399363..4ccb85a67bb3 100644 >> --- a/drivers/usb/host/ohci-q.c >> +++ b/drivers/usb/host/ohci-q.c >> @@ -1019,6 +1019,8 @@ static void finish_unlinks(struct ohci_hcd *ohci) >> * have modified this list. normally it's just prepending >> * entries (which we'd ignore), but paranoia won't hurt. >> */ >> + *last = ed->ed_next; >> + ed->ed_next = NULL; >> modified = 0; >> >> /* unlink urbs as requested, but rescan the list after >> @@ -1077,21 +1079,22 @@ static void finish_unlinks(struct ohci_hcd *ohci) >> goto rescan_this; >> >> /* >> - * If no TDs are queued, take ED off the ed_rm_list. >> + * If no TDs are queued, ED is now idle. >> * Otherwise, if the HC is running, reschedule. >> - * If not, leave it on the list for further dequeues. >> + * If the HC isn't running, add ED back to the >> + * start of the list for later processing. >> */ >> if (list_empty(&ed->td_list)) { >> - *last = ed->ed_next; >> - ed->ed_next = NULL; >> ed->state = ED_IDLE; >> list_del(&ed->in_use_list); >> } else if (ohci->rh_state == OHCI_RH_RUNNING) { >> - *last = ed->ed_next; >> - ed->ed_next = NULL; >> ed_schedule(ohci, ed); >> } else { >> - last = &ed->ed_next; >> + ed->ed_next = ohci->ed_rm_list; >> + ohci->ed_rm_list = ed; >> + /* Don't loop on the same ED */ >> + if (last == &ohci->ed_rm_list) >> + last = &ed->ed_next; >> } >> >> if (modified) > > I am experiencing a USB function call hang from userspace with OCHI > (full speed USB device) after updating from Linux 4.14.15 to 4.14.24 > and noticed this commit. > > Here is the Linux 4.14.24 kernel stack trace (extracted from SysRq+w > and amended with addr2line): > [] (__schedule) from [] (schedule+0x50/0xb4) > kernel/sched/core.c:2792 > [] (schedule) from [] > (usb_kill_urb.part.3+0x78/0xa8) include/asm-generic/preempt.h:59 > [] (usb_kill_urb.part.3) from [] > (usbdev_ioctl+0x1288/0x1cf0) drivers/usb/core/urb.c:690 > [] (usbdev_ioctl) from [] > (do_vfs_ioctl+0x9c/0x8ec) drivers/usb/core/devio.c:1835 > [] (do_vfs_ioctl) from [] (SyS_ioctl+0x34/0x5c) > fs/ioctl.c:47 > [] (SyS_ioctl) from [] (ret_fast_syscall+0x0/0x54) > include/linux/file.h:39 > > Afterwards the kernel is unresponsive to disconnect/connect of the > full speed USB device but I can connect/disconnect a high speed USB > device to the same port and communicate to it without problem since it > uses EHCI (OHCI is companion controller). If I try to connect the full > speed USB device again it is still unresponsive. The userspace > application is still hanging after all this. > > Could this commit be causing the issue? There does seem to be something wrong with this commit. I did some more testing with Linux 4.16.0-rc6 on sun7i-a20-olinuxino-lime. I connected a sun7i-a20-pcduino3b running Linux 4.15.12 that has OTG port with gadget zero kernel module (g_zero) loaded to the USB port and did some testing with usbtest from tools/usb/testusb.c in kernel repository. I had to do the following change to avoid getting flooded with "usbtest 1-1:3.0: unlink retry" kernel messages after test 11 of testusb starts: --- a/drivers/usb/misc/usbtest.c +++ b/drivers/usb/misc/usbtest.c @@ -1473,7 +1473,7 @@ static int unlink1(struct usbtest_dev *dev, int pipe, int size, int async) * resubmission, but since we're testing unlink * paths, we can't. */ - ERROR(dev, "unlink retry\n"); + /*ERROR(dev, "unlink retry\n");*/ continue; case 0: case -EINPROGRESS: Here are the results. # ./testusb -a /dev/bus/usb/001/002 test 0, 0.006822 secs /dev/bus/usb/001/002 test 1, 0.140112 secs /dev/bus/usb/001/002 test 2, 0.381125 secs /dev/bus/usb/001/002 test 3, 0.133564 secs /dev/bus/usb/001/002 test 4, 0.385498 secs /dev/bus/usb/001/002 test 5, 2.138259 secs /dev/bus/usb/001/002 test 6, 1.755144 secs /dev/bus/usb/001/002 test 7, 2.137387 secs /dev/bus/usb/001/002 test 8, 1.752393 secs /dev/bus/usb/001/002 test 9, 1.132559 secs /dev/bus/usb/001/002 test 10, 3.698934 secs /dev/bus/usb/001/002 test 11, 40.061114 secs dmesg (processed through scripts/decode_stacktrace.sh from kernel repository): [ 21.573242] usbtest 1-1:3.0: TEST 0: NOP [ 21.580315] usbtest 1-1:3.0: TEST 1: write 1024 bytes 1000 times [ 21.732138] usbtest 1-1:3.0: TEST 2: read 1024 bytes 1000 times [ 22.132133] usbtest 1-1:3.0: TEST 3: write/1024 0..1024 bytes 1000 times [ 22.282137] usbtest 1-1:3.0: TEST 4: read/1024 0..1024 bytes 1000 times [ 22.682131] usbtest 1-1:3.0: TEST 5: write 1000 sglists 32 entries of 1024 bytes [ 24.832138] usbtest 1-1:3.0: TEST 6: read 1000 sglists 32 entries of 1024 bytes [ 26.602135] usbtest 1-1:3.0: TEST 7: write/1024 1000 sglists 32 entries 0..1024 bytes [ 28.752136] usbtest 1-1:3.0: TEST 8: read/1024 1000 sglists 32 entries 0..1024 bytes [ 30.522139] usbtest 1-1:3.0: TEST 9: ch9 (subset) control tests, 1000 times [ 31.654957] usbtest 1-1:3.0: TEST 10: queue 32 control calls, 1000 times [ 35.354220] usbtest 1-1:3.0: TEST 11: unlink 1000 reads of 1024 [ 75.414099] usbtest 1-1:3.0: TEST 12: unlink 1000 writes of 1024 [ 128.026194] INFO: rcu_sched self-detected stall on CPU [ 128.031367] 0-....: (1 GPs behind) idle=faa/1/1073741826 softirq=2614/2621 fqs=1048 [ 128.039182] (t=2100 jiffies g=227 c=226 q=1961) [ 128.043887] NMI backtrace for cpu 0 [ 128.047377] CPU: 0 PID: 262 Comm: testusb Tainted: G O 4.16.0-rc6 #1 [ 128.054932] Hardware name: Allwinner sun7i (A20) Family [ 128.060181] (unwind_backtrace) from show_stack (arch/arm/kernel/traps.c:249) [ 128.067921] (show_stack) from dump_stack (lib/dump_stack.c:55) [ 128.075142] (dump_stack) from nmi_cpu_backtrace (lib/nmi_backtrace.c:103) [ 128.083141] (nmi_cpu_backtrace) from nmi_trigger_cpumask_backtrace (lib/nmi_backtrace.c:62) [ 128.092704] (nmi_trigger_cpumask_backtrace) from rcu_dump_cpu_stacks (kernel/rcu/tree.c:1373 (discriminator 2)) [ 128.102351] (rcu_dump_cpu_stacks) from rcu_check_callbacks (kernel/rcu/tree.c:1526 kernel/rcu/tree.c:1592 kernel/rcu/tree.c:3361 kernel/rcu/tree.c:3423 kernel/rcu/tree.c:2763) [ 128.111304] (rcu_check_callbacks) from update_process_times (arch/arm/include/asm/thread_info.h:91 include/asm-generic/preempt.h:11 kernel/time/timer.c:1638) [ 128.120173] (update_process_times) from tick_sched_timer (kernel/time/tick-sched.c:1199) [ 128.128777] (tick_sched_timer) from __hrtimer_run_queues (kernel/time/hrtimer.c:1349 kernel/time/hrtimer.c:1411) [ 128.137553] (__hrtimer_run_queues) from hrtimer_interrupt (kernel/time/hrtimer.c:1472) [ 128.146331] (hrtimer_interrupt) from arch_timer_handler_phys (drivers/clocksource/arm_arch_timer.c:589 drivers/clocksource/arm_arch_timer.c:606) [ 128.155285] (arch_timer_handler_phys) from handle_percpu_devid_irq (kernel/irq/chip.c:909) [ 128.164843] (handle_percpu_devid_irq) from generic_handle_irq (kernel/irq/irqdesc.c:607) [ 128.173880] (generic_handle_irq) from __handle_domain_irq (kernel/irq/irqdesc.c:646) [ 128.182574] (__handle_domain_irq) from gic_handle_irq (arch/arm/include/asm/io.h:118 drivers/irqchip/irq-gic.c:360) [ 128.190920] (gic_handle_irq) from __irq_svc (arch/arm/kernel/entry-armv.S:226) [ 128.198382] Exception stack(0xcda1fc88 to 0xcda1fcd0) [ 128.203431] fc80: c0a95bc8 a0060013 0000000a 0000aa75 cd83b480 a0060013 [ 128.211598] fca0: cf7ee000 ffffff98 00000001 00000000 00000400 cec2b800 00000009 cda1fcd8 [ 128.219763] fcc0: c045b328 c0622890 20060013 ffffffff [ 128.224818] (__irq_svc) from _raw_spin_unlock_irqrestore (kernel/locking/spinlock.c:185) [ 128.233428] (_raw_spin_unlock_irqrestore) from usb_hcd_unlink_urb (drivers/usb/core/hcd.c:1720) [ 128.242816] (usb_hcd_unlink_urb) from unlink1 (drivers/usb/core/hcd.c:864 drivers/usb/core/hcd.c:1684) [ 128.250555] (unlink1) from unlink_simple (drivers/usb/misc/usbtest.c:1510) [ 128.257774] (unlink_simple) from usbtest_do_ioctl (drivers/usb/misc/usbtest.c:2360) [ 128.266033] (usbtest_do_ioctl) from usbtest_ioctl (drivers/usb/misc/usbtest.c:2685) [ 128.274206] (usbtest_ioctl) from usbdev_ioctl (drivers/usb/core/devio.c:2175 drivers/usb/core/devio.c:2197 drivers/usb/core/devio.c:2511 drivers/usb/core/devio.c:2555) [ 128.282200] (usbdev_ioctl) from do_vfs_ioctl (fs/ioctl.c:47 fs/ioctl.c:686) [ 128.289854] (do_vfs_ioctl) from SyS_ioctl (include/linux/file.h:39 fs/ioctl.c:702 fs/ioctl.c:692) [ 128.297159] (SyS_ioctl) from ret_fast_syscall (arch/arm/kernel/entry-common.S:52) [ 128.304716] Exception stack(0xcda1ffa8 to 0xcda1fff0) [ 128.309763] ffa0: 000119c4 00000000 00000003 c00c5512 bebbfb34 bebbfb34 [ 128.317931] ffc0: 000119c4 00000000 000109b0 00000036 00000000 00000000 b6fb0fac bebbfb44 [ 128.326096] ffe0: 00022034 bebbfb1c 00011020 b6ee073c If I revert the commit, I can run "./testusb -a" to completion without any issues with Linux 4.16.0-rc6 on sun7i-a20-olinuxino-lime. I don't get a kernel stall detected when testing on x86_64 but I do still get at least one "unlink retry" error message from kernel in dmesg which can be reproduced using "./testusb -a" or more frequently with "./testusb -a -t 11 -c 5000". Regards, Jonathan