Return-path: Received: from mx0a-0016f401.pphosted.com ([67.231.148.174]:38224 "EHLO mx0b-0016f401.pphosted.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1751527AbeAJMan (ORCPT ); Wed, 10 Jan 2018 07:30:43 -0500 From: Xinming Hu To: Brian Norris , Kalle Valo CC: Kalle Valo , Linux Wireless , Dmitry Torokhov , "rajatja@google.com" , Zhiyuan Yang , Tim Song , Cathy Luo , James Cao , Ganapathi Bhat , Ellie Reeves , Christoph Hellwig Subject: Re: Re: Re: [PATCH] mwifiex: cancel pcie/sdio work in remove/shutdown handler Date: Wed, 10 Jan 2018 12:30:35 +0000 Message-ID: <1515587433253.94629@marvell.com> (sfid-20180110_133048_551411_6F4A4C3B) Content-Type: text/plain; charset="iso-2022-jp" MIME-Version: 1.0 Sender: linux-wireless-owner@vger.kernel.org List-ID: Hi Brian, > -----Original Message----- > From: Brian Norris [mailto:briannorris@chromium.org] > Sent: 2018年1月10日 6:01 > To: Xinming Hu ; Kalle Valo > Cc: Kalle Valo ; Linux Wireless > ; Dmitry Torokhov ; > rajatja@google.com; Zhiyuan Yang ; Tim Song > ; Cathy Luo ; James Cao > ; Ganapathi Bhat ; Ellie Reeves > ; Christoph Hellwig > Subject: [EXT] Re: Re: [PATCH] mwifiex: cancel pcie/sdio work in > remove/shutdown handler > > External Email > > ---------------------------------------------------------------------- > + Christopher > > Hi Simon and Kalle, > > On Tue, Jan 09, 2018 at 11:42:21AM +0000, Xinming Hu wrote: > > Hi, > > > > > -----Original Message----- > > > From: Kalle Valo [mailto:kvalo@codeaurora.org] > > > Sent: 2018年1月9日 15:40 > > > To: Brian Norris > > > Cc: Xinming Hu ; Linux Wireless > > > ; Dmitry Torokhov ; > > > rajatja@google.com; Zhiyuan Yang ; Tim Song > > > ; Cathy Luo ; James Cao > > > ; Ganapathi Bhat ; Ellie Reeves > > > > > > Subject: [EXT] Re: [PATCH] mwifiex: cancel pcie/sdio work in > > > remove/shutdown handler > > > > > > External Email > > > > > > -------------------------------------------------------------------- > > > -- Brian Norris writes: > > > > > > >> --- a/drivers/net/wireless/marvell/mwifiex/pcie.c > > > >> +++ b/drivers/net/wireless/marvell/mwifiex/pcie.c > > > >> @@ -310,6 +310,8 @@ static void mwifiex_pcie_remove(struct > > > >> pci_dev > > > *pdev) > > > >> mwifiex_init_shutdown_fw(priv, MWIFIEX_FUNC_SHUTDOWN); > > > >> } > > > >> > > > >> + cancel_work_sync(&card->work); > > > >> + > > > > > > > > Just FYI, this "fix" is not a real fix. It will likely paper over > > > > some of your bugs (where, e.g., the FW shutdown command times out > > > > in the previous couple of lines), but this highlights the fact > > > > that there are other races that could trigger the same behavior. > > > > You're not fixing those. > > > > > > > > For example, what if somebody initiates a scan or other nl80211 > > > > command between the above line and mwifiex_remove_card()? That > > > command > > > > could potentially time out too. > > > > > > > > The hardware status have been reset before downloading the last > > command(FUNC SHUTDOWN), in this way, follow commands download will > be > > ignored and warned. > > Hmm, I suppose that's true. So the race I'm talking about probably can't > happen usually. What about in manufacturing mode > or !FIRMWARE_READY_PCIE though? Those cases don't shut down the > firmware. Can we still have outstanding timeouts in those cases? > > Anyway, I still think there's a problem though, and this patch is just going to > make things worse. See below. > > > > > The proper fix would be to institute some kind of mutual exclusion > > > > (locking) between mwifiex_shutdown_sw() and mwifiex_remove_card(), > > > > so that they can't occur at the same time. > > > > > > > > I am not sure whether there is any mutual exclusion protect between > pcie_reset and pcie_remove in pcie core. > > But it looks a different race. > > We still need this fix, right? > > Good point. Previously, there wasn't any such exclusion, and that's why races > like the above were even more likely. But as of 4.13, now there > *is* exclusion. See commit b014e96d1abb ("PCI: Protect > pci_error_handlers->reset_notify() usage with device_lock()"). That incidentally > means that you're creating a deadlock with this patch! [1] > > If we start a timeout/reset sequence in mwifiex_init_shutdown_fw() (called > from remove()), then you'll eventually have pci_reset_function() waiting on the > device lock, but mwifiex_pcie_remove() will be holding the device lock already, > and now (with your patch), remove() will be waiting on the worker thread to > finish pci_reset_function()...deadlock! > > I actually think that the above patch (adding device_lock()) resolves most of the > race but introduces a possible deadlock. I believe an easy solution is just to > switch to pci_try_reset_function() instead? That will just abort a reset attempt > if we're in the middle of removing the device. Problem solved? Diff appended, > but I'll send out a real version if that looks right. Can you test your original > problem with the above commit from Christopher, as well as the appended > diff? > Since I don't have the original customer platform, which is 4.1 based. Just run the stress test on my 4.14, with commands: while true; do rmmod mwifiex;insmod $mwd/mwifiex.ko; sleep 1; insmod $mwd/mwifiex_pcie.ko; sleep 1; rmmod mwifiex_pcie & echo "1" > /sys/kernel/debug/mwifiex/mlan0/reset;done & Till now, I didn't hit deadlock with/without below diff, though it looks finally will be reached, according to the above explanation. I guess, maybe rmmod operation is slowly then sysfs operation. > > Regards, > > Simon > > > > Unfortunately, I only paid attention to this after Kalle already > > > > applied this patch. Personally, I'd prefer this patch not get > > > > applied, since it's a bad solution to an obvious problem, which > > > > instead leaves a subtle problem that perhaps no one will bother fixing. > > > > > > I can revert it, that's not a problem. Can I use the text below as > > > explanation for the revert? > > > > > > -------------------------------------------------------------------- > > > -- Brian Norris says: > > > > > > Just FYI, this "fix" is not a real fix. It will likely paper over > > > some of your bugs (where, e.g., the FW shutdown command times out in > > > the previous couple of lines), but this highlights the fact that > > > there are other races that could trigger the same behavior. You're not fixing > those. > > > > > > For example, what if somebody initiates a scan or other nl80211 > > > command between the above line and mwifiex_remove_card()? That > > > command could potentially time out too. > > > > > > The proper fix would be to institute some kind of mutual exclusion > > > (locking) between mwifiex_shutdown_sw() and mwifiex_remove_card(), > > > so that they can't occur at the same time. > > > > > > -------------------------------------------------------------------- > > > -- > > Hi Kalle, thanks for the above. With the new information from above, I think > that there's a more accurate description, like the following: > > --- > Brian Norris says: > > The "fix" in question might not actually fix all related problems, and it also looks > like it can cause a deadlock. Since commit b014e96d1abb ("PCI: Protect > pci_error_handlers->reset_notify() usage with device_lock()"), the race in > question is actually resolved (PCIe reset cannot happen at the same time as > remove()). Instead, this "fix" > just introduces a deadlock where mwifiex_pcie_card_reset_work() is waiting on > device_lock, which is held by PCIe device remove(), which is waiting > on...mwifiex_pcie_card_reset_work(). > > The proper thing to do is just to fix the deadlock. Patch for this will come > separately. > --- > > Brian > > [1] Technically, the deadlock is already there, since > mwifiex_remove_card() eventually calls pcie.c's ->cleanup_if(), which also calls > cancel_work_sync(). But your patch doesn't help... > > Untested diff: > > diff --git a/drivers/net/wireless/marvell/mwifiex/pcie.c > b/drivers/net/wireless/marvell/mwifiex/pcie.c > index cd314946452c..5da3d6ccf5f2 100644 > --- a/drivers/net/wireless/marvell/mwifiex/pcie.c > +++ b/drivers/net/wireless/marvell/mwifiex/pcie.c > @@ -2781,7 +2781,7 @@ static void mwifiex_pcie_card_reset_work(struct > mwifiex_adapter *adapter) { > struct pcie_service_card *card = adapter->card; > > - pci_reset_function(card->dev); > + pci_try_reset_function(card->dev); > } > > static void mwifiex_pcie_work(struct work_struct *work)