Hi Brian,
> -----Original Message-----
> From: Brian Norris [mailto:[email protected]]
> Sent: 2018年1月10日 6:01
> To: Xinming Hu <[email protected]>; Kalle Valo <[email protected]>
> Cc: Kalle Valo <[email protected]>; Linux Wireless
> <[email protected]>; Dmitry Torokhov <[email protected]>;
> [email protected]; Zhiyuan Yang <[email protected]>; Tim Song
> <[email protected]>; Cathy Luo <[email protected]>; James Cao
> <[email protected]>; Ganapathi Bhat <[email protected]>; Ellie Reeves
> <[email protected]>; Christoph Hellwig <[email protected]>
> Subject: [EXT] Re: Re: [PATCH] mwifiex: cancel pcie/sdio work in
> remove/shutdown handler
>
> External Email
>
> ----------------------------------------------------------------------
> + Christopher
>
> Hi Simon and Kalle,
>
> On Tue, Jan 09, 2018 at 11:42:21AM +0000, Xinming Hu wrote:
> > Hi,
> >
> > > -----Original Message-----
> > > From: Kalle Valo [mailto:[email protected]]
> > > Sent: 2018年1月9日 15:40
> > > To: Brian Norris <[email protected]>
> > > Cc: Xinming Hu <[email protected]>; Linux Wireless
> > > <[email protected]>; Dmitry Torokhov <[email protected]>;
> > > [email protected]; Zhiyuan Yang <[email protected]>; Tim Song
> > > <[email protected]>; Cathy Luo <[email protected]>; James Cao
> > > <[email protected]>; Ganapathi Bhat <[email protected]>; Ellie Reeves
> > > <[email protected]>
> > > Subject: [EXT] Re: [PATCH] mwifiex: cancel pcie/sdio work in
> > > remove/shutdown handler
> > >
> > > External Email
> > >
> > > --------------------------------------------------------------------
> > > -- Brian Norris <[email protected]> writes:
> > >
> > > >> --- a/drivers/net/wireless/marvell/mwifiex/pcie.c
> > > >> +++ b/drivers/net/wireless/marvell/mwifiex/pcie.c
> > > >> @@ -310,6 +310,8 @@ static void mwifiex_pcie_remove(struct
> > > >> pci_dev
> > > *pdev)
> > > >> mwifiex_init_shutdown_fw(priv, MWIFIEX_FUNC_SHUTDOWN);
> > > >> }
> > > >>
> > > >> + cancel_work_sync(&card->work);
> > > >> +
> > > >
> > > > Just FYI, this "fix" is not a real fix. It will likely paper over
> > > > some of your bugs (where, e.g., the FW shutdown command times out
> > > > in the previous couple of lines), but this highlights the fact
> > > > that there are other races that could trigger the same behavior.
> > > > You're not fixing those.
> > > >
> > > > For example, what if somebody initiates a scan or other nl80211
> > > > command between the above line and mwifiex_remove_card()? That
> > > command
> > > > could potentially time out too.
> > > >
> >
> > The hardware status have been reset before downloading the last
> > command(FUNC SHUTDOWN), in this way, follow commands download will
> be
> > ignored and warned.
>
> Hmm, I suppose that's true. So the race I'm talking about probably can't
> happen usually. What about in manufacturing mode
> or !FIRMWARE_READY_PCIE though? Those cases don't shut down the
> firmware. Can we still have outstanding timeouts in those cases?
>
> Anyway, I still think there's a problem though, and this patch is just going to
> make things worse. See below.
>
> > > > The proper fix would be to institute some kind of mutual exclusion
> > > > (locking) between mwifiex_shutdown_sw() and mwifiex_remove_card(),
> > > > so that they can't occur at the same time.
> > > >
> >
> > I am not sure whether there is any mutual exclusion protect between
> pcie_reset and pcie_remove in pcie core.
> > But it looks a different race.
> > We still need this fix, right?
>
> Good point. Previously, there wasn't any such exclusion, and that's why races
> like the above were even more likely. But as of 4.13, now there
> *is* exclusion. See commit b014e96d1abb ("PCI: Protect
> pci_error_handlers->reset_notify() usage with device_lock()"). That incidentally
> means that you're creating a deadlock with this patch! [1]
>
> If we start a timeout/reset sequence in mwifiex_init_shutdown_fw() (called
> from remove()), then you'll eventually have pci_reset_function() waiting on the
> device lock, but mwifiex_pcie_remove() will be holding the device lock already,
> and now (with your patch), remove() will be waiting on the worker thread to
> finish pci_reset_function()...deadlock!
>
> I actually think that the above patch (adding device_lock()) resolves most of the
> race but introduces a possible deadlock. I believe an easy solution is just to
> switch to pci_try_reset_function() instead? That will just abort a reset attempt
> if we're in the middle of removing the device. Problem solved? Diff appended,
> but I'll send out a real version if that looks right. Can you test your original
> problem with the above commit from Christopher, as well as the appended
> diff?
>
Since I don't have the original customer platform, which is 4.1 based.
Just run the stress test on my 4.14, with commands:
while true; do rmmod mwifiex;insmod $mwd/mwifiex.ko; sleep 1; insmod $mwd/mwifiex_pcie.ko; sleep 1; rmmod mwifiex_pcie & echo "1" > /sys/kernel/debug/mwifiex/mlan0/reset;done &
Till now, I didn't hit deadlock with/without below diff, though it looks finally will be reached, according to the above explanation.
I guess, maybe rmmod operation is slowly then sysfs operation.
> > Regards,
> > Simon
> > > > Unfortunately, I only paid attention to this after Kalle already
> > > > applied this patch. Personally, I'd prefer this patch not get
> > > > applied, since it's a bad solution to an obvious problem, which
> > > > instead leaves a subtle problem that perhaps no one will bother fixing.
> > >
> > > I can revert it, that's not a problem. Can I use the text below as
> > > explanation for the revert?
> > >
> > > --------------------------------------------------------------------
> > > -- Brian Norris <[email protected]> says:
> > >
> > > Just FYI, this "fix" is not a real fix. It will likely paper over
> > > some of your bugs (where, e.g., the FW shutdown command times out in
> > > the previous couple of lines), but this highlights the fact that
> > > there are other races that could trigger the same behavior. You're not fixing
> those.
> > >
> > > For example, what if somebody initiates a scan or other nl80211
> > > command between the above line and mwifiex_remove_card()? That
> > > command could potentially time out too.
> > >
> > > The proper fix would be to institute some kind of mutual exclusion
> > > (locking) between mwifiex_shutdown_sw() and mwifiex_remove_card(),
> > > so that they can't occur at the same time.
> > >
> > > --------------------------------------------------------------------
> > > --
>
> Hi Kalle, thanks for the above. With the new information from above, I think
> that there's a more accurate description, like the following:
>
> ---
> Brian Norris <[email protected]> says:
>
> The "fix" in question might not actually fix all related problems, and it also looks
> like it can cause a deadlock. Since commit b014e96d1abb ("PCI: Protect
> pci_error_handlers->reset_notify() usage with device_lock()"), the race in
> question is actually resolved (PCIe reset cannot happen at the same time as
> remove()). Instead, this "fix"
> just introduces a deadlock where mwifiex_pcie_card_reset_work() is waiting on
> device_lock, which is held by PCIe device remove(), which is waiting
> on...mwifiex_pcie_card_reset_work().
>
> The proper thing to do is just to fix the deadlock. Patch for this will come
> separately.
> ---
>
> Brian
>
> [1] Technically, the deadlock is already there, since
> mwifiex_remove_card() eventually calls pcie.c's ->cleanup_if(), which also calls
> cancel_work_sync(). But your patch doesn't help...
>
> Untested diff:
>
> diff --git a/drivers/net/wireless/marvell/mwifiex/pcie.c
> b/drivers/net/wireless/marvell/mwifiex/pcie.c
> index cd314946452c..5da3d6ccf5f2 100644
> --- a/drivers/net/wireless/marvell/mwifiex/pcie.c
> +++ b/drivers/net/wireless/marvell/mwifiex/pcie.c
> @@ -2781,7 +2781,7 @@ static void mwifiex_pcie_card_reset_work(struct
> mwifiex_adapter *adapter) {
> struct pcie_service_card *card = adapter->card;
>
> - pci_reset_function(card->dev);
> + pci_try_reset_function(card->dev);
> }
>
> static void mwifiex_pcie_work(struct work_struct *work)
On Thu, Jan 11, 2018 at 06:25:09PM -0800, Brian Norris wrote:
> Anyway, I'll do my own testing and then submit my patch properly.
OK, so I definitely confirmed: if your patch does anything, it
introduces a new deadlock possibility. Just trigger a Wifi timeout or
reset from within remove(), and you'll see the work event get stuck in
pci_reset_function(), while remove() gets stuck at cancel_work_sync().
I also confirmed that my patch resolves this problem.
I'll send the revert + my patch now.
Brian
Brian Norris <[email protected]> writes:
> On Thu, Jan 11, 2018 at 06:25:09PM -0800, Brian Norris wrote:
>> Anyway, I'll do my own testing and then submit my patch properly.
>
> OK, so I definitely confirmed: if your patch does anything, it
> introduces a new deadlock possibility. Just trigger a Wifi timeout or
> reset from within remove(), and you'll see the work event get stuck in
> pci_reset_function(), while remove() gets stuck at cancel_work_sync().
>
> I also confirmed that my patch resolves this problem.
>
> I'll send the revert + my patch now.
Great, thanks. I didn't had a chance to do the revert yet but I'll now
apply your revert instead.
--
Kalle Valo
Hi Simon,
On Wed, Jan 10, 2018 at 12:30:35PM +0000, Xinming Hu wrote:
> > -----Original Message-----
> > From: Brian Norris [mailto:[email protected]]
> > On Tue, Jan 09, 2018 at 11:42:21AM +0000, Xinming Hu wrote:
> > > I am not sure whether there is any mutual exclusion protect between
> > pcie_reset and pcie_remove in pcie core.
> > > But it looks a different race.
> > > We still need this fix, right?
> >
> > Good point. Previously, there wasn't any such exclusion, and that's why races
> > like the above were even more likely. But as of 4.13, now there
> > *is* exclusion. See commit b014e96d1abb ("PCI: Protect
> > pci_error_handlers->reset_notify() usage with device_lock()"). That incidentally
> > means that you're creating a deadlock with this patch! [1]
> >
> > If we start a timeout/reset sequence in mwifiex_init_shutdown_fw() (called
> > from remove()), then you'll eventually have pci_reset_function() waiting on the
> > device lock, but mwifiex_pcie_remove() will be holding the device lock already,
> > and now (with your patch), remove() will be waiting on the worker thread to
> > finish pci_reset_function()...deadlock!
> >
> > I actually think that the above patch (adding device_lock()) resolves most of the
> > race but introduces a possible deadlock. I believe an easy solution is just to
> > switch to pci_try_reset_function() instead? That will just abort a reset attempt
> > if we're in the middle of removing the device. Problem solved? Diff appended,
> > but I'll send out a real version if that looks right. Can you test your original
> > problem with the above commit from Christopher, as well as the appended
> > diff?
> >
>
> Since I don't have the original customer platform, which is 4.1 based.
You could suggest your customer try the aforementioned commit + my
patch. It will likely get them a more reliable result in the end.
> Just run the stress test on my 4.14, with commands:
> while true; do rmmod mwifiex;insmod $mwd/mwifiex.ko; sleep 1; insmod $mwd/mwifiex_pcie.ko; sleep 1; rmmod mwifiex_pcie & echo "1" > /sys/kernel/debug/mwifiex/mlan0/reset;done &
That's actually not much of a good test, since the mutual exclusion of
reset and remove will mean that 'rmmod' and 'echo 1 > ... /reset' won't
really be doing anything significant in parallel.
What you'd really need to do is fake a timeout from within remove().
e.g., you could do
adapter->if_ops.card_reset(adapter);
plus some sleep (to let the work event get started) followed by the
cancel_work_sync() of your patch.
>
> Till now, I didn't hit deadlock with/without below diff, though it
> looks finally will be reached, according to the above explanation. I
> guess, maybe rmmod operation is slowly then sysfs operation.
Per my above explanation, you won't hit the deadlock like that.
Anyway, I'll do my own testing and then submit my patch properly.
Brian