Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752581AbdHOMvW (ORCPT ); Tue, 15 Aug 2017 08:51:22 -0400 Received: from hs01.dk-develop.de ([213.136.71.231]:40625 "EHLO hs01.dk-develop.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752381AbdHOMvU (ORCPT ); Tue, 15 Aug 2017 08:51:20 -0400 MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII; format=flowed Content-Transfer-Encoding: 7bit Date: Tue, 15 Aug 2017 14:51:19 +0200 From: Danilo Krummrich To: Felipe Balbi Cc: linux-kernel@vger.kernel.org, linux-usb@vger.kernel.org, gregkh@linuxfoundation.org, k.opasiak@samsung.com Subject: Re: [PATCH] usb: gadget: udc: udc_stop before gadget unbind In-Reply-To: <878til6ntr.fsf@linux.intel.com> References: <20170814163652.29108-1-danilokrummrich@dk-develop.de> <87efsd6tcn.fsf@linux.intel.com> <0c918df8b95fba6ad6849cf9e67c59d6@dk-develop.de> <878til6ntr.fsf@linux.intel.com> Message-ID: User-Agent: Roundcube Webmail/1.3.0 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 5583 Lines: 155 On 2017-08-15 14:02, Felipe Balbi wrote: > Hi, > > Danilo Krummrich writes: >> thanks for reviewing. > > np :-) > >> On 2017-08-15 12:03, Felipe Balbi wrote: >>> Hi, >>> >>> Danilo Krummrich writes: >>>> udc_stop needs to be called before gadget driver unbind. Otherwise >>>> it >>>> might happen that udc drivers still call into the gadget driver >>>> (e.g. >>>> to reset gadget after OTG event). If this happens it is likely to >>>> get >>>> panics from gadget driver dereferencing NULL ptr, as gadget's >>>> drvdata >>>> is set to NULL on unbind. >>> >>> seems like the problem here is with the OTG layer, not UDC core. >>> >> I mentioned this just as example, it can happen whenever a UDC driver >> calls >> the gadget driver (e.g. by calling usb_gadget_udc_reset() in ISR) >> after >> gadget >> drivers unbind() was called already (e.g. by gadget configfs). >> If this happens gadget drivers drvdata was already set to NULL by >> unbind() >> and reset() could result into a NULL ptr exception. >> Therefore my assumption was that it needs to be prevented that the >> gadget >> driver is getting called after unbind. > > We have a known problem in the design of the gadget API that causes > this > races but we couldn't come up with a solution yet :-) > > Inverting these two calls is not the correct way to go about this :-) > Now I see, thanks for explanation below. >>>> Signed-off-by: Danilo Krummrich >>>> --- >>>> Actually there could still be a race: >>>> (CPU1 code taken from dwc3 drivers dwc3_disconnect_gadget() as >>>> exsample) >>>> >>>> CPU0 CPU1 >>>> ---- ---- >>>> usb_gadget_disconnect(udc->gadget); >>>> udc->driver->disconnect(udc->gadget); >>>> if (dwc->gadget_driver && dwc->gadget_driver->disconnect) >>>> usb_gadget_udc_stop(udc); >>>> udc->driver->unbind(udc->gadget); >>>> dwc->gadget_driver->disconnect(&dwc->gadget); >>>> >>>> UDC drivers typically set their gadget driver pointer to NULL in >>>> udc_stop >>>> and check for it before calling into the gadget driver. To fix the >>>> issue >>>> above every udc driver could apply a lock around this. >>>> >>>> If you see the need for having this or another solutions I can >>>> provide >>>> further patches. This patch could also just serve as a base for >>>> discussion >>>> if someone knows a smarter solution. >>>> >>>> I saw this problem causing a panic on hikey960 board and provided a >>>> quick >>>> workaround for the same problem here: >>>> https://android-review.googlesource.com/#/c/kernel/common/+/457476/ >>>> (panic log in the commit message of the linked patch) >>>> --- >>>> drivers/usb/gadget/udc/core.c | 6 +++++- >>>> 1 file changed, 5 insertions(+), 1 deletion(-) >>>> >>>> diff --git a/drivers/usb/gadget/udc/core.c >>>> b/drivers/usb/gadget/udc/core.c >>>> index efce68e9a8e0..8155468afc0d 100644 >>>> --- a/drivers/usb/gadget/udc/core.c >>>> +++ b/drivers/usb/gadget/udc/core.c >>>> @@ -1234,8 +1234,12 @@ static void usb_gadget_remove_driver(struct >>>> usb_udc *udc) >>>> >>>> usb_gadget_disconnect(udc->gadget); >>>> udc->driver->disconnect(udc->gadget); >>>> - udc->driver->unbind(udc->gadget); >>>> + /* udc_stop needs to be called before gadget driver unbind to >>>> prevent >>>> + * udc driver calls into gadget driver after unbind which could >>>> cause >>>> + * a nullptr exception. >>>> + */ >>>> usb_gadget_udc_stop(udc); >>>> + udc->driver->unbind(udc->gadget); >>> >>> This patch is incorrect, it will prevent us from giving back requests >>> to >>> gadget driver properly. ->unbind() has to happen before ->udc_stop(). >> >> Do you mean after udc_stop the udc driver can not call the gadget >> driver >> anymore? If not, I did not got your point, sorry for that. Can you >> please >> help me out? Would the changed order raise another issue I'm not aware >> of? > > right, ->udc_stop() is supposed to completely teardown the USB > controller, including disabling interrupts and all. The only thing it > _can_ do from ->udc_stop() would be giving back any pending requests > that were left (which would cause req->complete() to be called with an > error status). But even that is unlikely in the case you mention since > ->unbind() was already called. > Ok, got it. That's why req->context = cdev, to overcome being unbound already. Thanks for clarification. >> If I understood you correctly, without this patch udc driver can not >> call >> the gadget driver back as well, because this would result in a NULL >> ptr >> dereference, as unbind() sets drvdata to NULL. >> >> In any case the race described in my original message can still >> happen, >> regardless of the order of udc_stop and unbind. But with this patch >> the >> needed locking could easily done within the udc driver only. Without, >> the >> lock needs to be acquired before udc->driver->unbind(udc->gadget) and >> released after usb_gadget_udc_stop(). Otherwise an ISR of the udc >> driver >> trying to call into the gadget driver could do this after gadget >> driver >> already unbound. > > right Is someone working on this issue, already? If not, I would like to offer introducing the needed locking to overcome this issue. If you are about to refactor the whole thing already to solve this (as you state there's a design problem in the gadget API) would it make sense for you to have a workaround meanwhile (maybe not in mainline kernel, but somewhere else)? As e.g. on hikey960 board this causes panics very often. Thanks, Danilo