Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755480AbdCJXiD (ORCPT ); Fri, 10 Mar 2017 18:38:03 -0500 Received: from bh-25.webhostbox.net ([208.91.199.152]:37106 "EHLO bh-25.webhostbox.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754128AbdCJXhz (ORCPT ); Fri, 10 Mar 2017 18:37:55 -0500 Subject: Re: [PATCH] usb: hub: Fix error loop seen after hub communication errors To: Alan Stern References: Cc: Greg Kroah-Hartman , Douglas Anderson , Brian Norris , linux-usb@vger.kernel.org, linux-kernel@vger.kernel.org From: Guenter Roeck Message-ID: <4eaddfc6-92f0-d777-e58b-9db2db618823@roeck-us.net> Date: Fri, 10 Mar 2017 15:37:51 -0800 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.5.1 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit X-Authenticated_sender: linux@roeck-us.net X-OutGoing-Spam-Status: No, score=-1.0 X-AntiAbuse: This header was added to track abuse, please include it with any abuse report X-AntiAbuse: Primary Hostname - bh-25.webhostbox.net X-AntiAbuse: Original Domain - vger.kernel.org X-AntiAbuse: Originator/Caller UID/GID - [47 12] / [47 12] X-AntiAbuse: Sender Address Domain - roeck-us.net X-Get-Message-Sender-Via: bh-25.webhostbox.net: authenticated_id: linux@roeck-us.net X-Authenticated-Sender: bh-25.webhostbox.net: linux@roeck-us.net X-Source: X-Source-Args: X-Source-Dir: Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 5642 Lines: 136 Hi Alan, On 03/10/2017 01:14 PM, Alan Stern wrote: > On Wed, 8 Mar 2017, Guenter Roeck wrote: > >> While stress testing a usb controller using a bind/unbind looop, the >> following error loop was observed. >> >> usb 7-1.2: new low-speed USB device number 3 using xhci-hcd >> usb 7-1.2: hub failed to enable device, error -108 >> usb 7-1-port2: cannot disable (err = -22) >> usb 7-1-port2: couldn't allocate usb_device >> usb 7-1-port2: cannot disable (err = -22) >> hub 7-1:1.0: hub_ext_port_status failed (err = -22) >> hub 7-1:1.0: hub_ext_port_status failed (err = -22) >> hub 7-1:1.0: activate --> -22 >> hub 7-1:1.0: hub_ext_port_status failed (err = -22) >> hub 7-1:1.0: hub_ext_port_status failed (err = -22) >> hub 7-1:1.0: activate --> -22 > ... > >> This continues forever. After adding tracebacks into the code, >> the call sequence leading to this is found to be as follows. >> >> [] hub_activate+0x368/0x7b8 >> [] hub_resume+0x2c/0x3c >> [] usb_resume_interface.isra.6+0x128/0x158 >> [] usb_suspend_both+0x1e8/0x288 >> [] usb_runtime_suspend+0x3c/0x98 >> [] __rpm_callback+0x48/0x7c >> [] rpm_callback+0xa8/0xd4 >> [] rpm_suspend+0x84/0x758 >> [] rpm_idle+0x2c8/0x498 >> [] __pm_runtime_idle+0x60/0xac >> [] usb_autopm_put_interface+0x6c/0x7c >> [] hub_event+0x10ac/0x12ac >> [] process_one_work+0x390/0x6b8 >> [] worker_thread+0x480/0x610 >> [] kthread+0x164/0x178 >> [] ret_from_fork+0x10/0x40 >> >> kick_hub_wq() is called from hub_activate() even after failures to >> communicate with the hub. This results in an endless sequence of >> hub event -> hub activate -> wq trigger -> hub event -> ... >> To break the loop, only trigger the hub event queue if communication >> with the hub is successful. >> >> Signed-off-by: Guenter Roeck >> --- >> drivers/usb/core/hub.c | 3 ++- >> 1 file changed, 2 insertions(+), 1 deletion(-) >> >> diff --git a/drivers/usb/core/hub.c b/drivers/usb/core/hub.c >> index 5286bf67869a..f22ab428b310 100644 >> --- a/drivers/usb/core/hub.c >> +++ b/drivers/usb/core/hub.c >> @@ -1197,7 +1197,8 @@ static void hub_activate(struct usb_hub *hub, enum hub_activation_type type) >> &hub->leds, LED_CYCLE_PERIOD); >> >> /* Scan all ports that need attention */ >> - kick_hub_wq(hub); >> + if (!status) >> + kick_hub_wq(hub); >> >> if (type == HUB_INIT2 || type == HUB_INIT3) { >> /* Allow autosuspend if it was suppressed */ > > I'm not at all sure this is the best solution. status == 0 means that > an URB was successfully _submitted_; it does not mean that future > communication with the hub will be successful. > > The problem arises here because the system is unable to runtime-suspend > the hub (you didn't say why, but probably because of the same > communication problem). When a suspend fails, drivers are informed Yes, exactly. > that the device is back at full power, and of course the hub driver > tries to determine the hub's current state at that time. When this > fails, there's no reason to keep the hub at full power any longer, so > the system tries to do another runtime suspend. This suspend fails > just like the first one, and there's your loop. > If we don't cut the loop in hub_activate(), where else would be a good place to cut it ? > This same kind of thing could happen with any driver. In your case, > the problem could be solved by making sure that when the hub fails to > suspend, it is still at least minimally functional. We don't do this > now; when usb_port_suspend fails there is no checking to see whether > the device is still working. > > (It would be interesting to know exactly at which point > usb_port_suspend fails during your test. And why didn't your system > mark the hub as disconnected when the host controller was unbound?) > It does, but that happens a bit later. If I disable console logging (for example by using dev_err_ratelimited() instead of dev_err() for the observed errors), the hub is marked as disabled after a few hundred or a few thousand loops. Only that code never gets to run runs if console logging is enabled, and to restrict logging didn't seem to be a good solution for the problem. The test script I am running is: while true do for i in /sys/bus/platform/drivers/rockchip-dwc3/usb*; do basename $i > $(dirname $i)/unbind basename $i > $(dirname $i)/bind done sleep $(awk "BEGIN {printf \"%.2f\", $((RANDOM % 150)) / 100 + 0.5}") done where rockchip-dwc3 is a slightly more complex version of dwc3-of-simple.c, with pretty much the same functionality, only that it ties into extcon to be able to handle cable states. The hub disconnect is a result of a call to of_platform_depopulate(), similar to the same call in dwc3-of-simple.c. The sleep in the above script is essential; it results in all kinds of race conditions. The one I am trying to fix here is just one of them. The system under test has two Type-C ports; one is connected to an Ethernet interface and the other to an Apple Type-C adapter (it probably doesn't matter much what is connected, as long as there is a hub). > An even worse problem is what to do when the device is working okay but > runtime suspend still fails (for example, if remote wakeup cannot be > enabled). How do we prevent the system from going into a similar loop > then? > Good question. I am open to suggestions. Thanks, Guenter