2018-02-28 01:22:53

by Ben Greear

[permalink] [raw]
Subject: [RFC] ath10k: Attempt to work around napi_synchronize hang.

From: Ben Greear <[email protected]>

Calling napi_disable twice in a row (w/out starting it and/or without
having NAPI active leads to deadlock because napi_disable sets
NAPI_STATE_SCHED and NAPI_STATE_NPSVC when it returns, as far as I
can tell. So, guard this call to napi_disable. I believe the
failure case is something like this:
rmmod ath10k_pci ath10k_core
Firmware crashes before hif_stop is called by the rmmod path
The crash handling logic calls hif_stop
Then rmmod gets around to calling hif_stop, but spins endlessly
in napi_synchronize.

I think one way this could happen is that ath10k_stop checks
for state != ATH10K_STATE_OFF, but STATE_RESTARTING is also
a possibility. That might be how we can have hif_stop called twice
without a hif_start in between. --Ben

Signed-off-by: Ben Greear <[email protected]>
---
drivers/net/wireless/ath/ath10k/core.h | 1 +
drivers/net/wireless/ath/ath10k/pci.c | 25 +++++++++++++++++++++++--
2 files changed, 24 insertions(+), 2 deletions(-)

diff --git a/drivers/net/wireless/ath/ath10k/core.h b/drivers/net/wireless/ath/ath10k/core.h
index 72b4495..c7ba49f 100644
--- a/drivers/net/wireless/ath/ath10k/core.h
+++ b/drivers/net/wireless/ath/ath10k/core.h
@@ -1205,6 +1205,7 @@ struct ath10k {
/* NAPI */
struct net_device napi_dev;
struct napi_struct napi;
+ bool napi_enabled;

struct work_struct stop_scan_work;

diff --git a/drivers/net/wireless/ath/ath10k/pci.c b/drivers/net/wireless/ath/ath10k/pci.c
index 398e413..9131e15 100644
--- a/drivers/net/wireless/ath/ath10k/pci.c
+++ b/drivers/net/wireless/ath/ath10k/pci.c
@@ -1956,6 +1956,7 @@ static int ath10k_pci_hif_start(struct ath10k *ar)
ath10k_dbg(ar, ATH10K_DBG_BOOT, "boot hif start\n");

napi_enable(&ar->napi);
+ ar->napi_enabled = true;

ath10k_pci_irq_enable(ar);
ath10k_pci_rx_post(ar);
@@ -2086,8 +2087,28 @@ static void ath10k_pci_hif_stop(struct ath10k *ar)
ath10k_pci_irq_disable(ar);
ath10k_pci_irq_sync(ar);
ath10k_pci_flush(ar);
- napi_synchronize(&ar->napi);
- napi_disable(&ar->napi);
+
+ /* Calling napi_disable twice in a row (w/out starting it and/or without
+ * having NAPI active leads to deadlock because napi_disable sets
+ * NAPI_STATE_SCHED and NAPI_STATE_NPSVC when it returns, as far as I
+ * can tell. So, guard this call to napi_disable. I believe the
+ * failure case is something like this:
+ * rmmod ath10k_pci ath10k_core
+ * Firmware crashes before hif_stop is called by the rmmod path
+ * The crash handling logic calls hif_stop
+ * Then rmmod gets around to calling hif_stop, but spins endlessly
+ * in napi_synchronize.
+ *
+ * I think one way this could happen is that ath10k_stop checks
+ * for state != ATH10K_STATE_OFF, but STATE_RESTARTING is also
+ * a possibility. That might be how we can have hif_stop called twice
+ * without a hif_start in between. --Ben
+ */
+ if (ar->napi_enabled) {
+ napi_synchronize(&ar->napi);
+ napi_disable(&ar->napi);
+ ar->napi_enabled = false;
+ }

spin_lock_irqsave(&ar_pci->ps_lock, flags);
WARN_ON(ar_pci->ps_wake_refcount > 0);
--
2.4.11


2018-02-28 18:05:06

by Ben Greear

[permalink] [raw]
Subject: Re: [RFC] ath10k: Attempt to work around napi_synchronize hang.

On 02/28/2018 09:31 AM, Michał Kazior wrote:
> On 28 February 2018 at 02:22, <[email protected]> wrote:
> [...]
>> @@ -2086,8 +2087,28 @@ static void ath10k_pci_hif_stop(struct ath10k *ar)
>> ath10k_pci_irq_disable(ar);
>> ath10k_pci_irq_sync(ar);
>> ath10k_pci_flush(ar);
>> - napi_synchronize(&ar->napi);
>> - napi_disable(&ar->napi);
>> +
>> + /* Calling napi_disable twice in a row (w/out starting it and/or without
>> + * having NAPI active leads to deadlock because napi_disable sets
>> + * NAPI_STATE_SCHED and NAPI_STATE_NPSVC when it returns, as far as I
>> + * can tell. So, guard this call to napi_disable. I believe the
>> + * failure case is something like this:
>> + * rmmod ath10k_pci ath10k_core
>> + * Firmware crashes before hif_stop is called by the rmmod path
>> + * The crash handling logic calls hif_stop
>> + * Then rmmod gets around to calling hif_stop, but spins endlessly
>> + * in napi_synchronize.
>> + *
>> + * I think one way this could happen is that ath10k_stop checks
>> + * for state != ATH10K_STATE_OFF, but STATE_RESTARTING is also
>> + * a possibility. That might be how we can have hif_stop called twice
>> + * without a hif_start in between. --Ben
>> + */
>> + if (ar->napi_enabled) {
>> + napi_synchronize(&ar->napi);
>> + napi_disable(&ar->napi);
>> + ar->napi_enabled = false;
>> + }
>
> Looking at the code and your comment the described fail case seems
> legit. I would consider tuning ath10k_stop() so that it calls
> ath10k_halt() only if ar->state != OFF & ar->state != RESTARTING
> though. Or did you try already?

I did not try tuning ath10k_stop(). The code in this area is quite complex,
and in my opinion trying to keep the start/stop calls exactly matched without
individual 'has_started' flags seems ripe for bugs.

> While your approach will probably work it won't prevent other non-NAPI
> bad things from happening. Even if there's nothing today it might
> creep up in the future. And you'd need to update ahb.c too.

I'll update ahb.c to match.

Thanks,
Ben

>
>
> Michał
>


--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com

2018-02-28 17:31:31

by Michal Kazior

[permalink] [raw]
Subject: Re: [RFC] ath10k: Attempt to work around napi_synchronize hang.

On 28 February 2018 at 02:22, <[email protected]> wrote:
[...]
> @@ -2086,8 +2087,28 @@ static void ath10k_pci_hif_stop(struct ath10k *ar)
> ath10k_pci_irq_disable(ar);
> ath10k_pci_irq_sync(ar);
> ath10k_pci_flush(ar);
> - napi_synchronize(&ar->napi);
> - napi_disable(&ar->napi);
> +
> + /* Calling napi_disable twice in a row (w/out starting it and/or =
without
> + * having NAPI active leads to deadlock because napi_disable sets
> + * NAPI_STATE_SCHED and NAPI_STATE_NPSVC when it returns, as far =
as I
> + * can tell. So, guard this call to napi_disable. I believe the
> + * failure case is something like this:
> + * rmmod ath10k_pci ath10k_core
> + * Firmware crashes before hif_stop is called by the rmmod path
> + * The crash handling logic calls hif_stop
> + * Then rmmod gets around to calling hif_stop, but spins endle=
ssly
> + * in napi_synchronize.
> + *
> + * I think one way this could happen is that ath10k_stop checks
> + * for state !=3D ATH10K_STATE_OFF, but STATE_RESTARTING is also
> + * a possibility. That might be how we can have hif_stop called=
twice
> + * without a hif_start in between. --Ben
> + */
> + if (ar->napi_enabled) {
> + napi_synchronize(&ar->napi);
> + napi_disable(&ar->napi);
> + ar->napi_enabled =3D false;
> + }

Looking at the code and your comment the described fail case seems
legit. I would consider tuning ath10k_stop() so that it calls
ath10k_halt() only if ar->state !=3D OFF & ar->state !=3D RESTARTING
though. Or did you try already?

While your approach will probably work it won't prevent other non-NAPI
bad things from happening. Even if there's nothing today it might
creep up in the future. And you'd need to update ahb.c too.


Micha=C5=82