2012-05-30 16:10:31

by Mohammed Shafi Shajakhan

[permalink] [raw]
Subject: [RFC] ath9k: Fix softlockup in AR9485

From: Mohammed Shafi Shajakhan <[email protected]>

steps to recreate:
load latest ath9k driver with AR9485
stop the network-manager and wpa_supplicant
bring the interface up

Call Trace:
[<ffffffffa0517490>] ? ath_hw_check+0xe0/0xe0 [ath9k]
[<ffffffff812cd1e8>] __const_udelay+0x28/0x30
[<ffffffffa03bae7a>] ar9003_get_pll_sqsum_dvc+0x4a/0x80 [ath9k_hw]
[<ffffffffa05174eb>] ath_hw_pll_work+0x5b/0xe0 [ath9k]
[<ffffffff810744fe>] process_one_work+0x11e/0x470
[<ffffffff8107530f>] worker_thread+0x15f/0x360
[<ffffffff810751b0>] ? manage_workers+0x230/0x230
[<ffffffff81079af3>] kthread+0x93/0xa0
[<ffffffff815fd3a4>] kernel_thread_helper+0x4/0x10
[<ffffffff81079a60>] ? kthread_freezable_should_stop+0x70/0x70
[<ffffffff815fd3a0>] ? gs_change+0x13/0x13

before the STA is associated PLL4(0x1618c) always seem to dump
zero, which causes a softlockup as we keep on polling infinitely
PLL4's 8th bit. Once PLL4'S 8th bit is set indicates we can take
PLL3(0x16188) readings which tells us whether PLL is locked or not.
if the PLL is not locked we have a rx hang.
It does not looks safe to poll PLL4 before the STA is associated.
so it is safe to check this rx hang after association, where we might
have a possibility of rx hang under stress testing. digging into the
PLL stuff did not yield anything much better.

previously the register dumped 0xdeadbeef and the hw_pll_work
itself is cancelled by ath_radio_disable(obselete)

fixes https://bugzilla.redhat.com/show_bug.cgi?id=811142

Reported-by: Rolf Offermanns <[email protected]>
Cc: [email protected]
Signed-off-by: Mohammed Shafi Shajakhan <[email protected]>
---
drivers/net/wireless/ath/ath9k/main.c | 3 +++
1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/drivers/net/wireless/ath/ath9k/main.c b/drivers/net/wireless/ath/ath9k/main.c
index dfa78e8..a03ad3e 100644
--- a/drivers/net/wireless/ath/ath9k/main.c
+++ b/drivers/net/wireless/ath/ath9k/main.c
@@ -970,6 +970,9 @@ void ath_hw_pll_work(struct work_struct *work)
hw_pll_work.work);
u32 pll_sqsum;

+ if (!(sc->sc_flags & SC_OP_PRIM_STA_VIF))
+ return;
+
if (AR_SREV_9485(sc->sc_ah)) {

ath9k_ps_wakeup(sc);
--
1.7.0.4



2012-05-31 05:34:38

by Mohammed Shafi Shajakhan

[permalink] [raw]
Subject: Re: [ath9k-devel] [RFC] ath9k: Fix softlockup in AR9485

Hi Sujith,

On Thursday 31 May 2012 08:14 AM, Sujith Manoharan wrote:
> Mohammed Shafi Shajakhan wrote:
>> before the STA is associated PLL4(0x1618c) always seem to dump
>> zero, which causes a softlockup as we keep on polling infinitely
>> PLL4's 8th bit. Once PLL4'S 8th bit is set indicates we can take
>> PLL3(0x16188) readings which tells us whether PLL is locked or not.
>> if the PLL is not locked we have a rx hang.
>> It does not looks safe to poll PLL4 before the STA is associated.
>> so it is safe to check this rx hang after association, where we might
>> have a possibility of rx hang under stress testing. digging into the
>> PLL stuff did not yield anything much better.
>>
>> previously the register dumped 0xdeadbeef and the hw_pll_work
>> itself is cancelled by ath_radio_disable(obselete)
>>
>> fixes https://bugzilla.redhat.com/show_bug.cgi?id=811142
>>
>> Reported-by: Rolf Offermanns<[email protected]>
>> Cc: [email protected]
>> Signed-off-by: Mohammed Shafi Shajakhan<[email protected]>
>> ---
>> drivers/net/wireless/ath/ath9k/main.c | 3 +++
>> 1 files changed, 3 insertions(+), 0 deletions(-)
>>
>> diff --git a/drivers/net/wireless/ath/ath9k/main.c b/drivers/net/wireless/ath/ath9k/main.c
>> index dfa78e8..a03ad3e 100644
>> --- a/drivers/net/wireless/ath/ath9k/main.c
>> +++ b/drivers/net/wireless/ath/ath9k/main.c
>> @@ -970,6 +970,9 @@ void ath_hw_pll_work(struct work_struct *work)
>> hw_pll_work.work);
>> u32 pll_sqsum;
>>
>> + if (!(sc->sc_flags& SC_OP_PRIM_STA_VIF))
>> + return;
>> +
>> if (AR_SREV_9485(sc->sc_ah)) {
>>
>> ath9k_ps_wakeup(sc);
>
>
> All this is really racy, I think. In the various callsites where we issue a HW
> reset by queuing a work, there is no guarantee that the caller would not be
> invoked once again before the reset work has had a chance to kick in. And there is
> nothing that prevents the various poll routines from checking the HW while a
> reset in progress.

should we check if (work_pending(&sc->hw_reset_work)) like we do in
ath_beacon_tasklet in all the other work routine that access hardware
registers in case the hw_reset_work is not cancelled(possible in
ath_set_channel)


>
> And why are we using HZ for the pll_work ?

don't know, i put a printk to see the routine seems to get executed for
every 200ms(HZ/5). scan work in mac80211 seems to use HZ
ieee80211_queue_delayed_work(&local->hw, &local->scan_work, next_delay);


>
> Sujith


--
thanks,
shafi

2012-05-31 05:53:02

by Mohammed Shafi Shajakhan

[permalink] [raw]
Subject: Re: [ath9k-devel] [RFC] ath9k: Fix softlockup in AR9485

On Thursday 31 May 2012 11:11 AM, Sujith Manoharan wrote:
> Mohammed Shafi Shajakhan wrote:
>> should we check if (work_pending(&sc->hw_reset_work)) like we do in
>> ath_beacon_tasklet in all the other work routine that access hardware
>> registers in case the hw_reset_work is not cancelled(possible in
>> ath_set_channel)
>
> Probably, but this needs to be done in a clean manner.

ok.

>
>> don't know, i put a printk to see the routine seems to get executed for
>> every 200ms(HZ/5). scan work in mac80211 seems to use HZ
>> ieee80211_queue_delayed_work(&local->hw,&local->scan_work, next_delay);
>
> HZ can be configured to a user-defined value, so we need to make sure
> that the timing for the PLL work is what we want.

ok, then we will change it.

>
> Sujith


--
thanks,
shafi

2012-05-31 02:45:44

by Sujith Manoharan

[permalink] [raw]
Subject: [ath9k-devel] [RFC] ath9k: Fix softlockup in AR9485

Mohammed Shafi Shajakhan wrote:
> before the STA is associated PLL4(0x1618c) always seem to dump
> zero, which causes a softlockup as we keep on polling infinitely
> PLL4's 8th bit. Once PLL4'S 8th bit is set indicates we can take
> PLL3(0x16188) readings which tells us whether PLL is locked or not.
> if the PLL is not locked we have a rx hang.
> It does not looks safe to poll PLL4 before the STA is associated.
> so it is safe to check this rx hang after association, where we might
> have a possibility of rx hang under stress testing. digging into the
> PLL stuff did not yield anything much better.
>
> previously the register dumped 0xdeadbeef and the hw_pll_work
> itself is cancelled by ath_radio_disable(obselete)
>
> fixes https://bugzilla.redhat.com/show_bug.cgi?id=811142
>
> Reported-by: Rolf Offermanns <[email protected]>
> Cc: [email protected]
> Signed-off-by: Mohammed Shafi Shajakhan <[email protected]>
> ---
> drivers/net/wireless/ath/ath9k/main.c | 3 +++
> 1 files changed, 3 insertions(+), 0 deletions(-)
>
> diff --git a/drivers/net/wireless/ath/ath9k/main.c b/drivers/net/wireless/ath/ath9k/main.c
> index dfa78e8..a03ad3e 100644
> --- a/drivers/net/wireless/ath/ath9k/main.c
> +++ b/drivers/net/wireless/ath/ath9k/main.c
> @@ -970,6 +970,9 @@ void ath_hw_pll_work(struct work_struct *work)
> hw_pll_work.work);
> u32 pll_sqsum;
>
> + if (!(sc->sc_flags & SC_OP_PRIM_STA_VIF))
> + return;
> +
> if (AR_SREV_9485(sc->sc_ah)) {
>
> ath9k_ps_wakeup(sc);


All this is really racy, I think. In the various callsites where we issue a HW
reset by queuing a work, there is no guarantee that the caller would not be
invoked once again before the reset work has had a chance to kick in. And there is
nothing that prevents the various poll routines from checking the HW while a
reset in progress.

And why are we using HZ for the pll_work ?

Sujith

2012-05-30 18:48:21

by Adrian Chadd

[permalink] [raw]
Subject: Re: [RFC] ath9k: Fix softlockup in AR9485

(Yes, this is a public response.)

Hi,

That has me a bit concerned. Waiting for the STA to be associated is
really just some kind of "settling" delay. You're still TX/RXing at
that point.

I think we need to speak to the baseband/clock teams and figure out
why this is actually occuring.

Can we please leave this out of the tree (but in the bug report) until
this gets root caused internally?

Thanks,



adrian

2012-05-31 04:49:08

by Mohammed Shafi Shajakhan

[permalink] [raw]
Subject: Re: [RFC] ath9k: Fix softlockup in AR9485

Hi Adrian,

On Thursday 31 May 2012 12:18 AM, Adrian Chadd wrote:
> (Yes, this is a public response.)
>
> Hi,
>
> That has me a bit concerned. Waiting for the STA to be associated is
> really just some kind of "settling" delay. You're still TX/RXing at
> that point.

yeah, but i am pretty sure in the STA mode we have this PLL registers
having sane values once the STA is associated. i checked if i am missing
hardware code/init_pll routine or if its actually overwritten by INI
values, but no. i also checked this issue with IBSS and monitor mode
even under shield room, still it did not show much improvements or more
clue.

>
> I think we need to speak to the baseband/clock teams and figure out
> why this is actually occuring.

oh yes and i had been already sending some mails internally, checking
with people who might have good knowledge about this.

>
> Can we please leave this out of the tree (but in the bug report) until
> this gets root caused internally?

one thing i observed was this rx hang detection is based on the 'delta
of rx unicast packets' for a particular time. if its lesser than 5 this
rx hang detection kicks off. another system expert told me this rx hang
itself can be specific to AP rather than STA.

finally i would check with one system expert and another Engineer to get
some thoughts, then send it as a PATCH.


>
> Thanks,
>
>
>
> adrian


--
thanks,
shafi

2012-05-31 05:43:10

by Sujith Manoharan

[permalink] [raw]
Subject: Re: [ath9k-devel] [RFC] ath9k: Fix softlockup in AR9485

Mohammed Shafi Shajakhan wrote:
> should we check if (work_pending(&sc->hw_reset_work)) like we do in
> ath_beacon_tasklet in all the other work routine that access hardware
> registers in case the hw_reset_work is not cancelled(possible in
> ath_set_channel)

Probably, but this needs to be done in a clean manner.

> don't know, i put a printk to see the routine seems to get executed for
> every 200ms(HZ/5). scan work in mac80211 seems to use HZ
> ieee80211_queue_delayed_work(&local->hw, &local->scan_work, next_delay);

HZ can be configured to a user-defined value, so we need to make sure
that the timing for the PLL work is what we want.

Sujith

2012-06-08 12:17:39

by Sujith Manoharan

[permalink] [raw]
Subject: Re: [ath9k-devel] [RFC] ath9k: Fix softlockup in AR9485

Mohammed Shafi Shajakhan wrote:
> > HZ can be configured to a user-defined value, so we need to make sure
> > that the timing for the PLL work is what we want.
>
> ok, then we will change it.

Shafi, will you be sending an updated version of this patch ?
AR9485 is unusable currently - just bringing up the interface is
enough to hang the machine...

Sujith

2012-06-11 05:03:41

by Mohammed Shafi Shajakhan

[permalink] [raw]
Subject: Re: [ath9k-devel] [RFC] ath9k: Fix softlockup in AR9485

Hi Sujith,

On Friday 08 June 2012 05:46 PM, Sujith Manoharan wrote:
> Mohammed Shafi Shajakhan wrote:
>>> HZ can be configured to a user-defined value, so we need to make sure
>>> that the timing for the PLL work is what we want.
>>
>> ok, then we will change it.
>
> Shafi, will you be sending an updated version of this patch ?
> AR9485 is unusable currently - just bringing up the interface is
> enough to hang the machine...
>

i will send an updated version based on the latest wireless-testing
tree. thanks.


--
thanks,
shafi