Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S935104AbcLVBoB (ORCPT ); Wed, 21 Dec 2016 20:44:01 -0500 Received: from mga09.intel.com ([134.134.136.24]:61484 "EHLO mga09.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S934181AbcLVBn7 (ORCPT ); Wed, 21 Dec 2016 20:43:59 -0500 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.33,386,1477983600"; d="scan'208";a="1085297939" Subject: Re: [PATCH 2/2] usb: host: xhci: Handle the right timeout command To: Mathias Nyman , Baolin Wang References: <0c39bfed1cf6f7b747e702aa841f82c9d2140f27.1480922249.git.baolin.wang@linaro.org> <584EC7B9.6040100@linux.intel.com> <5857B794.2070100@linux.intel.com> <5857CEE4.6040007@intel.com> <5858B3AE.90705@linux.intel.com> <5858D231.7040407@linux.intel.com> <5858DB52.8060707@linux.intel.com> <58594A9A.10507@linux.intel.com> <585A1E67.4020902@linux.intel.com> <585A7A0A.6090401@linux.intel.com> Cc: Mathias Nyman , Greg KH , USB , LKML , Mark Brown , "Lu, Baolu" , Alan Stern , OGAWA Hirofumi From: Lu Baolu Message-ID: <585B2FDC.1060502@linux.intel.com> Date: Thu, 22 Dec 2016 09:43:56 +0800 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.5.1 MIME-Version: 1.0 In-Reply-To: <585A7A0A.6090401@linux.intel.com> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 4335 Lines: 107 Hi, On 12/21/2016 08:48 PM, Mathias Nyman wrote: > On 21.12.2016 08:17, Lu Baolu wrote: >> Hi Mathias, >> >> I have some comments for the implementation of xhci_abort_cmd_ring() below. >> >> On 12/20/2016 11:13 PM, Mathias Nyman wrote: >>> On 20.12.2016 09:30, Baolin Wang wrote: >>> ... >>> >>> Alright, I gathered all current work related to xhci races and timeouts >>> and put them into a branch: >>> >>> git://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git timeout_race_fixes >>> >>> Its based on 4.9 >>> It includes a few other patches just to avoid conflicts and make my life easier >>> >>> Interesting patches are: >>> >>> ee4eb91 xhci: remove unnecessary check for pending timer >>> 0cba67d xhci: detect stop endpoint race using pending timer instead of counter. >>> 4f2535f xhci: Handle command completion and timeout race >>> b9d00d7 usb: host: xhci: Fix possible wild pointer when handling abort command >>> 529a5a0 usb: xhci: fix possible wild pointer >>> 4766555 xhci: Fix race related to abort operation >>> de834a3 xhci: Use delayed_work instead of timer for command timeout >>> 69973b8 Linux 4.9 >>> >>> The fixes for command queue races will go to usb-linus and stable, the >>> reworks for stop ep watchdog timer will go to usb-next. >>> >>> Still completely untested, (well it compiles) >>> >>> Felipe gave instructions how to modify dwc3 driver to timeout on address >>> devicecommands to test these, I'll try to set that up. >>> >>> All additional testing is welcome, especially if you can trigger timeouts >>> and races >>> >>> -Mathias >>> >>> >> >> Below is the latest code. I put my comments in line. >> >> 322 static int xhci_abort_cmd_ring(struct xhci_hcd *xhci) >> 323 { >> 324 u64 temp_64; >> 325 int ret; >> 326 >> 327 xhci_dbg(xhci, "Abort command ring\n"); >> 328 >> 329 reinit_completion(&xhci->cmd_ring_stop_completion); >> 330 >> 331 temp_64 = xhci_read_64(xhci, &xhci->op_regs->cmd_ring); >> 332 xhci_write_64(xhci, temp_64 | CMD_RING_ABORT, >> 333 &xhci->op_regs->cmd_ring); >> >> We should hold xhci->lock when we are modifying xhci registers >> at runtime. >> > > Makes sense, but we need to unlock it before sleeping or waiting for completion. > I need to look into that in more detail. > > But this was an issue already before these changes. > >> The retry of setting CMD_RING_ABORT is not necessary according to >> previous discussion. We have cleaned code for second try in >> xhci_handle_command_timeout(). Need to clean up here as well. >> > > Yes it can be cleaned up as well, but the two cases are a bit different. > The cleaned up one was about command ring not starting again after it was stopped. > > This second try is a workaround for what we thought was the command ring failing > to stop in the first place, but is most likely due to the race that OGAWA Hirofumi > fixed. It races if the stop command ring interrupt happens between writing the abort > bit and polling for the ring stopped bit. The interrupt hander may start the command > ring again, and we would believe we failed to stop it in the first place. > > This race could probably be fixed by just extending the lock (and preventing > interrupts) to cover both writing the abort bit and polling for the command ring > running bit, as you pointed out here previously. > > But then again I really like OGAWA Hiroumi's solution that separates the > command ring stopping from aborting commands and restarting the ring. > > The current way of always restarting the command ring as a response to > a stop command ring command really limits its usage. > > So, with this in mind most reasonable would be to > 1. fix the lock to cover abort+CRR check, and send it to usb-linus +stable > 2. rebase OGAWA Hirofumi's changes on top of that, and send to usb-linus only > 3. remove unnecessary second abort try as a separate patch, send to usb-next > 4. remove polling for the Command ring running (CRR), waiting for completion > is enough, if completion times out then we can check CRR. for usb-next > I'll fix the typos these patches would introduce. Fixing old typos can be done as separate > patches later. This is exactly the same as what I am thinking of. I will submit the patches later. Best regards, Lu Baolu