2021-09-14 19:34:56

by Linus Lüssing

[permalink] [raw]
Subject: [PATCH 0/3] ath9k: interrupt fixes on queue reset

Hi,

The following are two patches for ath9k to fix a potential interrupt
storm (PATCH 2/3) and to fix potentially resetting the wifi chip while
its interrupts were accidentally reenabled (PATCH 3/3).

PATCH 1/3 adds the possibility to trigger the ath9k queue reset through
the ath9k reset file in debugfs. Which was helpful to reproduce and debug
this issue and might help for future debugging.

PATCH 2/3 and PATCH 3/3 should be applicable for stable.

Regards, Linus


2021-09-14 19:35:10

by Linus Lüssing

[permalink] [raw]
Subject: [PATCH 2/3] ath9k: Fix potential interrupt storm on queue reset

From: Linus Lüssing <[email protected]>

In tests with two Lima boards from 8devices (QCA4531 based) on OpenWrt
19.07 we could force a silent restart of a device with no serial
output when we were sending a high amount of UDP traffic (iperf3 at 80
MBit/s in both directions from external hosts, saturating the wifi and
causing a load of about 4.5 to 6) and were then triggering an
ath9k_queue_reset().

Further debugging showed that the restart was caused by the ath79
watchdog. With disabled watchdog we could observe that the device was
constantly going into ath_isr() interrupt handler and was returning
early after the ATH_OP_HW_RESET flag test, without clearing any
interrupts. Even though ath9k_queue_reset() calls
ath9k_hw_kill_interrupts().

With JTAG we could observe the following race condition:

1) ath9k_queue_reset()
...
-> ath9k_hw_kill_interrupts()
-> set_bit(ATH_OP_HW_RESET, &common->op_flags);
...
<- returns

2) ath9k_tasklet()
...
-> ath9k_hw_resume_interrupts()
...
<- returns

3) loops around:
...
handle_int()
-> ath_isr()
...
-> if (test_bit(ATH_OP_HW_RESET,
&common->op_flags))
return IRQ_HANDLED;

x) ath_reset_internal():
=> never reached <=

And in ath_isr() we would typically see the following interrupts /
interrupt causes:

* status: 0x00111030 or 0x00110030
* async_cause: 2 (AR_INTR_MAC_IPQ)
* sync_cause: 0

So the ath9k_tasklet() reenables the ath9k interrupts
through ath9k_hw_resume_interrupts() which ath9k_queue_reset() had just
disabled. And ath_isr() then keeps firing because it returns IRQ_HANDLED
without actually clearing the interrupt.

To fix this IRQ storm also clear/disable the interrupts again when we
are in reset state.

Cc: Sven Eckelmann <[email protected]>
Cc: Simon Wunderlich <[email protected]>
Cc: Linus Lüssing <[email protected]>
Fixes: 872b5d814f99 ("ath9k: do not access hardware on IRQs during reset")
Signed-off-by: Linus Lüssing <[email protected]>
---
drivers/net/wireless/ath/ath9k/main.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/net/wireless/ath/ath9k/main.c b/drivers/net/wireless/ath/ath9k/main.c
index 139831539da3..98090e40e1cf 100644
--- a/drivers/net/wireless/ath/ath9k/main.c
+++ b/drivers/net/wireless/ath/ath9k/main.c
@@ -533,8 +533,10 @@ irqreturn_t ath_isr(int irq, void *dev)
ath9k_debug_sync_cause(sc, sync_cause);
status &= ah->imask; /* discard unasked-for bits */

- if (test_bit(ATH_OP_HW_RESET, &common->op_flags))
+ if (test_bit(ATH_OP_HW_RESET, &common->op_flags)) {
+ ath9k_hw_kill_interrupts(sc->sc_ah);
return IRQ_HANDLED;
+ }

/*
* If there are no status bits set, then this interrupt was not
--
2.31.0

2021-09-14 19:54:02

by Toke Høiland-Jørgensen

[permalink] [raw]
Subject: Re: [PATCH 0/3] ath9k: interrupt fixes on queue reset

Linus Lüssing <[email protected]> writes:

> Hi,
>
> The following are two patches for ath9k to fix a potential interrupt
> storm (PATCH 2/3) and to fix potentially resetting the wifi chip while
> its interrupts were accidentally reenabled (PATCH 3/3).

Uhh, interesting - nice debugging work! What's the user-level symptom of
this? I.e., when this triggers does the device just appear to hang, or
does it cause reboots, or?

-Toke

2021-09-15 09:24:25

by Linus Lüssing

[permalink] [raw]
Subject: Re: [PATCH 0/3] ath9k: interrupt fixes on queue reset

Hi Toke,

On Tue, Sep 14, 2021 at 09:53:34PM +0200, Toke Høiland-Jørgensen wrote:
> Linus Lüssing <[email protected]> writes:
>
> > Hi,
> >
> > The following are two patches for ath9k to fix a potential interrupt
> > storm (PATCH 2/3) and to fix potentially resetting the wifi chip while
> > its interrupts were accidentally reenabled (PATCH 3/3).
>
> Uhh, interesting - nice debugging work! What's the user-level symptom of
> this? I.e., when this triggers does the device just appear to hang, or
> does it cause reboots, or?
>
> -Toke
>

For PATCH 2/3 the user-level symptom was that the system would
hang for a few seconds and would then silently reboot without any notice
on the serial console. And after disabling CONFIG_ATH79_WDT the
system would "hang" indefinitely without any notice on the console
without a reboot (while JTAG/gdb showed that it was entering ath_isr()
again and again and wasn't doing anything else).

For PATCH 3/3 I don't have a specific user-level symptom. But from
looking at the git history it seemed to me that the ath9k hw
interrupts (AR_IER, AR_INTR_ASYNC_ENABLE and AR_INTR_ASYNC_ENABLE off)
should be disabled during a reset:

4668cce527ac ath9k: disable the tasklet before taking the PCU lock
eaf04a691566 ath9k: Disable beacon tasklet during reset
872b5d814f99 ath9k: do not access hardware on IRQs during reset
e3f31175a3ee ath9k: fix race condition in irq processing during hardware reset

Maybe someone else on these lists might know what issues this can
cause exactly?

Regards, Linus

2021-10-05 14:27:15

by Kalle Valo

[permalink] [raw]
Subject: Re: [PATCH 0/3] ath9k: interrupt fixes on queue reset

Linus Lüssing <[email protected]> writes:

> On Tue, Sep 14, 2021 at 09:25:12PM +0200, Linus Lüssing wrote:
>> Hi,
>>
>> The following are two patches for ath9k to fix a potential interrupt
>> storm (PATCH 2/3) and to fix potentially resetting the wifi chip while
>> its interrupts were accidentally reenabled (PATCH 3/3).
>>
>> PATCH 1/3 adds the possibility to trigger the ath9k queue reset through
>> the ath9k reset file in debugfs. Which was helpful to reproduce and debug
>> this issue and might help for future debugging.
>>
>> PATCH 2/3 and PATCH 3/3 should be applicable for stable.
>>
>> Regards, Linus
>>
>
> I've marked PATCH 3/3 as "rejected" in Patchwork now due to
> Felix's legitimate remarks.

BTW I prefer to mark patches as rejected myself in patchwork so that I
know what's happening (patchwork is lacking in this respect as it
doesn't notify me if there are any changes in patches). But good that
you mentioned this via email so I didn't need to wonder what happened.

> For patches 1/3 and and 2/3 I'd still like to see them merged upstream
> if there is no objection to those.

Thanks, I was about to ask what I should do with this patchset.

--
https://patchwork.kernel.org/project/linux-wireless/list/

https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches