2016-06-09 07:46:57

by A. Benz

[permalink] [raw]
Subject: ath10k/QCA9980 - Issues introduced in wireless testing 2016-05

Dear All,

I am using LEDE on my IPQ806x (QCA9980) system (Archer C2600).
With compat-wireless-2016-05-12, I observed traces attached below.
The router is unstable and eventually reboots by itself (randomly).

Upon reverting to compat-wireless-2016-01, the issue disappears. Nothing
else is changed (software-wise or hardware).
This was confirmed with other users.

A new compile with the fixes below:
https://git.lede-project.org/?p=lede/nbd/staging.git;a=commit;h=858e26f3c0fc11231f25497cbb2ddca1e5f101e0

Did not solve the problem.

Please let me know if I need to provide any further information.

------------[ cut here ]------------
WARNING: CPU: 0 PID: 558 at
compat-wireless-2016-05-12/net/mac80211/rx.c:4068
ieee80211_rx_napi+0x8c/0x8a4 [mac80211]()
Modules linked in: pppoe ppp_async iptable_nat pppox ppp_generic
nf_nat_ipv4 nf_conntrack_ipv6 nf_conntrack_ipv4 ipt_REJECT
ipt_MASQUERADE xt_time xt_tcpudp xt_state xt_nat xt_multiport xt_mark
xt_mac xt_limit xt_id xt_conntrack xt_comment xt_TCPMSS xt_REDIRECT
xt_LOG xt_CT slhc nf_reject_ipv4 nf_nat_redirect nf_nat_masquerade_ipv4
nf_nat nf_log_ipv4 nf_defrag_ipv6 nf_defrag_ipv4 nf_conntrack_rtcache
nf_conntrack iptable_raw iptable_mangle iptable_fWed May 25 21:21:57
2016 kern.warn kernel: [24187.498347] CPU: 0 PID: 558 Comm: hostapd
Tainted: G W 4.4.11 #2
Hardware name: Qualcomm (Flattened Device Tree)
[<c021ff34>] (unwind_backtrace) from [<c021cb9c>] (show_stack+0x10/0x14)
[<c021cb9c>] (show_stack) from [<c03a2218>] (dump_stack+0x88/0x9c)
[<c03a2218>] (dump_stack) from [<c0227adc>] (warn_slowpath_common+0x94/0xb0)
[<c0227adc>] (warn_slowpath_common) from [<c0227b94>]
(warn_slowpath_null+0x1c/0x24)
[<c0227b94>] (warn_slowpath_null) from [<bf19fa44>]
(ieee80211_rx_napi+0x8c/0x8a4 [mac80211])
[<bf19fa44>] (ieee80211_rx_napi [mac80211]) from [<bf20e9d0>]
(ath10k_htt_t2h_msg_handler+0x92c/0x988 [ath10k_core])
[<bf20e9d0>] (ath10k_htt_t2h_msg_handler [ath10k_core]) from
[<bf20f3e8>] (ath10k_htt_txrx_compl_task+0x9bc/0x117c [ath10k_core])
[<bf20f3e8>] (ath10k_htt_txrx_compl_task [ath10k_core]) from
[<c022b158>] (tasklet_action+0xb8/0x144)
[<c022b158>] (tasklet_action) from [<c022b32c>] (__do_softirq+0xe0/0x21c)
[<c022b32c>] (__do_softirq) from [<c022b4d4>] (do_softirq.part.2+0x28/0x30)
[<c022b4d4>] (do_softirq.part.2) from [<c022b590>]
(__local_bh_enable_ip+0xb4/0x104)
[<c022b590>] (__local_bh_enable_ip) from [<c0596684>]
(packet_poll+0xc0/0x100)
[<c0596684>] (packet_poll) from [<c04b1368>] (sock_poll+0xec/0xf8)
[<c04b1368>] (sock_poll) from [<c02e7fc4>] (do_select+0x2f8/0x62c)
[<c02e7fc4>] (do_select) from [<c02e858c>] (core_sys_select+0x294/0x424)
[<c02e858c>] (core_sys_select) from [<c02e8820>] (SyS_select+0x104/0x130)
[<c02e8820>] (SyS_select) from [<c0209c00>] (ret_fast_syscall+0x0/0x3c)
---[ end trace e55b94e0d302fcd8 ]---

Regards,
A. Benz


2016-06-10 09:21:00

by Johannes Berg

[permalink] [raw]
Subject: Re: ath10k/QCA9980 - Issues introduced in wireless testing 2016-05

On Fri, 2016-06-10 at 11:10 +0200, Michal Kazior wrote:

> Hmm.. could it be related to ath10k not fulfilling (some) NAPI's
> locking requirements and thus ending up with, e.g. linked-list
> mayhem?
>

Shoudln't matter since ath10k doesn't actually use rx_napi()?

johannes

2016-06-10 08:55:41

by Felix Fietkau

[permalink] [raw]
Subject: Re: ath10k/QCA9980 - Issues introduced in wireless testing 2016-05

On 2016-06-10 10:50, Michal Kazior wrote:
> On 9 June 2016 at 09:46, A. Benz <[email protected]> wrote:
>> Dear All,
>>
>> I am using LEDE on my IPQ806x (QCA9980) system (Archer C2600).
>> With compat-wireless-2016-05-12, I observed traces attached below.
>> The router is unstable and eventually reboots by itself (randomly).
>>
>> Upon reverting to compat-wireless-2016-01, the issue disappears. Nothing
>> else is changed (software-wise or hardware).
>> This was confirmed with other users.
>>
>> A new compile with the fixes below:
>> https://git.lede-project.org/?p=lede/nbd/staging.git;a=commit;h=858e26f3c0fc11231f25497cbb2ddca1e5f101e0
>>
>> Did not solve the problem.
>>
>> Please let me know if I need to provide any further information.
>>
>> ------------[ cut here ]------------
>> WARNING: CPU: 0 PID: 558 at
>> compat-wireless-2016-05-12/net/mac80211/rx.c:4068
>> ieee80211_rx_napi+0x8c/0x8a4 [mac80211]()
>
> Can you post what is at rx.c line 4068 (and +/- 3 lines), please?
It's early in ieee80211_rx_napi:

sband = local->hw.wiphy->bands[status->band];
if (WARN_ON(!sband))
goto drop;

I could not easily find a scenario under which status->band would not be
set properly by the driver, so my guess is there is some nasty memory
corruption going on.

FWIW, I've received several reports like this from different people on
different devices. They're also confirming that reverting to the
snapshot from January makes things stable again.

- Felix

2016-06-10 13:16:24

by Ben Greear

[permalink] [raw]
Subject: Re: ath10k/QCA9980 - Issues introduced in wireless testing 2016-05



On 06/10/2016 05:57 AM, Felix Fietkau wrote:
> On 2016-06-10 14:52, Ben Greear wrote:
>> This looks a lot like the problems I was having.
>>
>> Two of these 5 patches recently made it upstream (but may not be in LEDE yet),
>> but the other patches also were related to memory corruption.
>>
>> See my patches posted on 4/1/16:
>>
>> https://patchwork.kernel.org/project/ath10k/list/
>>
>> I don't know where the 5/5 patch ended up.
> I had already asked affected users to test with those patches (I have a
> commit that adds them in my staging tree), but it did not resolve the issue.

Ok, must be something else then.

If you can run on x86 under KASAN it may provide some clues..that is how I eventually
made progress on the issues I was seeing. My rebase onto 3.7 has been slow and painful,
but I should be ready to start testing that sometime soon, maybe I can reproduce something
there.

Thanks,
Ben

--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com

2016-06-10 12:22:49

by Kalle Valo

[permalink] [raw]
Subject: Re: ath10k/QCA9980 - Issues introduced in wireless testing 2016-05

Felix Fietkau <[email protected]> writes:

> On 2016-06-10 10:50, Michal Kazior wrote:
>> On 9 June 2016 at 09:46, A. Benz <[email protected]> wrote:
>>> Dear All,
>>>
>>> I am using LEDE on my IPQ806x (QCA9980) system (Archer C2600).
>>> With compat-wireless-2016-05-12, I observed traces attached below.
>>> The router is unstable and eventually reboots by itself (randomly).
>>>
>>> Upon reverting to compat-wireless-2016-01, the issue disappears. Nothing
>>> else is changed (software-wise or hardware).
>>> This was confirmed with other users.
>>>
>>> A new compile with the fixes below:
>>> https://git.lede-project.org/?p=lede/nbd/staging.git;a=commit;h=858e26f3c0fc11231f25497cbb2ddca1e5f101e0
>>>
>>> Did not solve the problem.
>>>
>>> Please let me know if I need to provide any further information.
>>>
>>> ------------[ cut here ]------------
>>> WARNING: CPU: 0 PID: 558 at
>>> compat-wireless-2016-05-12/net/mac80211/rx.c:4068
>>> ieee80211_rx_napi+0x8c/0x8a4 [mac80211]()
>>
>> Can you post what is at rx.c line 4068 (and +/- 3 lines), please?
> It's early in ieee80211_rx_napi:
>
> sband = local->hw.wiphy->bands[status->band];
> if (WARN_ON(!sband))
> goto drop;
>
> I could not easily find a scenario under which status->band would not be
> set properly by the driver, so my guess is there is some nasty memory
> corruption going on.
>
> FWIW, I've received several reports like this from different people on
> different devices. They're also confirming that reverting to the
> snapshot from January makes things stable again.

Adding ath10k list to the loop.

--
Kalle Valo

2016-06-10 12:57:37

by Felix Fietkau

[permalink] [raw]
Subject: Re: ath10k/QCA9980 - Issues introduced in wireless testing 2016-05

On 2016-06-10 14:52, Ben Greear wrote:
> This looks a lot like the problems I was having.
>
> Two of these 5 patches recently made it upstream (but may not be in LEDE yet),
> but the other patches also were related to memory corruption.
>
> See my patches posted on 4/1/16:
>
> https://patchwork.kernel.org/project/ath10k/list/
>
> I don't know where the 5/5 patch ended up.
I had already asked affected users to test with those patches (I have a
commit that adds them in my staging tree), but it did not resolve the issue.

- Felix

2016-06-10 08:50:45

by Michal Kazior

[permalink] [raw]
Subject: Re: ath10k/QCA9980 - Issues introduced in wireless testing 2016-05

On 9 June 2016 at 09:46, A. Benz <[email protected]> wrote:
> Dear All,
>
> I am using LEDE on my IPQ806x (QCA9980) system (Archer C2600).
> With compat-wireless-2016-05-12, I observed traces attached below.
> The router is unstable and eventually reboots by itself (randomly).
>
> Upon reverting to compat-wireless-2016-01, the issue disappears. Nothing
> else is changed (software-wise or hardware).
> This was confirmed with other users.
>
> A new compile with the fixes below:
> https://git.lede-project.org/?p=lede/nbd/staging.git;a=commit;h=858e26f3c0fc11231f25497cbb2ddca1e5f101e0
>
> Did not solve the problem.
>
> Please let me know if I need to provide any further information.
>
> ------------[ cut here ]------------
> WARNING: CPU: 0 PID: 558 at
> compat-wireless-2016-05-12/net/mac80211/rx.c:4068
> ieee80211_rx_napi+0x8c/0x8a4 [mac80211]()

Can you post what is at rx.c line 4068 (and +/- 3 lines), please?


Michał

2016-06-10 12:52:59

by Ben Greear

[permalink] [raw]
Subject: Re: ath10k/QCA9980 - Issues introduced in wireless testing 2016-05



On 06/10/2016 05:22 AM, Kalle Valo wrote:
> Felix Fietkau <[email protected]> writes:
>
>> On 2016-06-10 10:50, Michal Kazior wrote:
>>> On 9 June 2016 at 09:46, A. Benz <[email protected]> wrote:
>>>> Dear All,
>>>>
>>>> I am using LEDE on my IPQ806x (QCA9980) system (Archer C2600).
>>>> With compat-wireless-2016-05-12, I observed traces attached below.
>>>> The router is unstable and eventually reboots by itself (randomly).
>>>>
>>>> Upon reverting to compat-wireless-2016-01, the issue disappears. Nothing
>>>> else is changed (software-wise or hardware).
>>>> This was confirmed with other users.
>>>>
>>>> A new compile with the fixes below:
>>>> https://git.lede-project.org/?p=lede/nbd/staging.git;a=commit;h=858e26f3c0fc11231f25497cbb2ddca1e5f101e0
>>>>
>>>> Did not solve the problem.
>>>>
>>>> Please let me know if I need to provide any further information.
>>>>
>>>> ------------[ cut here ]------------
>>>> WARNING: CPU: 0 PID: 558 at
>>>> compat-wireless-2016-05-12/net/mac80211/rx.c:4068
>>>> ieee80211_rx_napi+0x8c/0x8a4 [mac80211]()
>>>
>>> Can you post what is at rx.c line 4068 (and +/- 3 lines), please?
>> It's early in ieee80211_rx_napi:
>>
>> sband = local->hw.wiphy->bands[status->band];
>> if (WARN_ON(!sband))
>> goto drop;
>>
>> I could not easily find a scenario under which status->band would not be
>> set properly by the driver, so my guess is there is some nasty memory
>> corruption going on.
>>
>> FWIW, I've received several reports like this from different people on
>> different devices. They're also confirming that reverting to the
>> snapshot from January makes things stable again.
>
> Adding ath10k list to the loop.

This looks a lot like the problems I was having.

Two of these 5 patches recently made it upstream (but may not be in LEDE yet),
but the other patches also were related to memory corruption.

See my patches posted on 4/1/16:

https://patchwork.kernel.org/project/ath10k/list/

I don't know where the 5/5 patch ended up.

Thanks,
Ben

--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com

2016-06-10 09:10:45

by Michal Kazior

[permalink] [raw]
Subject: Re: ath10k/QCA9980 - Issues introduced in wireless testing 2016-05

On 10 June 2016 at 10:55, Felix Fietkau <[email protected]> wrote:
> On 2016-06-10 10:50, Michal Kazior wrote:
>> On 9 June 2016 at 09:46, A. Benz <[email protected]> wrote:
>>> Dear All,
>>>
>>> I am using LEDE on my IPQ806x (QCA9980) system (Archer C2600).
>>> With compat-wireless-2016-05-12, I observed traces attached below.
>>> The router is unstable and eventually reboots by itself (randomly).
>>>
>>> Upon reverting to compat-wireless-2016-01, the issue disappears. Nothing
>>> else is changed (software-wise or hardware).
>>> This was confirmed with other users.
>>>
>>> A new compile with the fixes below:
>>> https://git.lede-project.org/?p=lede/nbd/staging.git;a=commit;h=858e26f3c0fc11231f25497cbb2ddca1e5f101e0
>>>
>>> Did not solve the problem.
>>>
>>> Please let me know if I need to provide any further information.
>>>
>>> ------------[ cut here ]------------
>>> WARNING: CPU: 0 PID: 558 at
>>> compat-wireless-2016-05-12/net/mac80211/rx.c:4068
>>> ieee80211_rx_napi+0x8c/0x8a4 [mac80211]()
>>
>> Can you post what is at rx.c line 4068 (and +/- 3 lines), please?
> It's early in ieee80211_rx_napi:
>
> sband = local->hw.wiphy->bands[status->band];
> if (WARN_ON(!sband))
> goto drop;

Thanks.


> I could not easily find a scenario under which status->band would not be
> set properly by the driver, so my guess is there is some nasty memory
> corruption going on.

Hmm.. could it be related to ath10k not fulfilling (some) NAPI's
locking requirements and thus ending up with, e.g. linked-list mayhem?


Michał

2016-07-15 11:49:34

by Ashok Raj Nagarajan

[permalink] [raw]
Subject: Re: ath10k/QCA9980 - Issues introduced in wireless testing 2016-05

> On 06/10/2016 05:57 AM, Felix Fietkau wrote:
>>> On 2016-06-10 14:52, Ben Greear wrote:
>>> This looks a lot like the problems I was having.
>>>
>>> Two of these 5 patches recently made it upstream (but may not be in LEDE yet),
>>> but the other patches also were related to memory corruption.
>>>
>>> See my patches posted on 4/1/16:
>>>
>>> https://patchwork.kernel.org/project/ath10k/list/
>>>
>>> I don't know where the 5/5 patch ended up.
>> I had already asked affected users to test with those patches (I have a
>> commit that adds them in my staging tree), but it did not resolve the issue.
>
>Ok, must be something else then.
>
>If you can run on x86 under KASAN it may provide some clues..that is how I eventually
>made progress on the issues I was seeing. My rebase onto 3.7 has been slow and painful,
>but I should be ready to start testing that sometime soon, maybe I can reproduce something
>there.

Hi Benz,

Could you please check with the following diff if it is solving your issue?

diff --git a/drivers/net/wireless/ath/ath10k/htt_rx.c b/drivers/net/wireless/ath/ath10k/htt_rx.c
index 6f19fca..c192a41 100644
--- a/drivers/net/wireless/ath/ath10k/htt_rx.c
+++ b/drivers/net/wireless/ath/ath10k/htt_rx.c
@@ -1528,7 +1528,7 @@ static void ath10k_htt_rx_h_filter(struct ath10k *ar,
static int ath10k_htt_rx_handle_amsdu(struct ath10k_htt *htt)
{
struct ath10k *ar = htt->ar;
- static struct ieee80211_rx_status rx_status;
+ struct ieee80211_rx_status *rx_status = &htt->rx_status;
struct sk_buff_head amsdu;
int ret;

@@ -1553,11 +1553,11 @@ static int ath10k_htt_rx_handle_amsdu(struct ath10k_htt *htt)
}

ath10k_pktlog_rx(ar, &amsdu);
- ath10k_htt_rx_h_ppdu(ar, &amsdu, &rx_status, 0xffff);
+ ath10k_htt_rx_h_ppdu(ar, &amsdu, rx_status, 0xffff);
ath10k_htt_rx_h_unchain(ar, &amsdu, ret > 0);
- ath10k_htt_rx_h_filter(ar, &amsdu, &rx_status);
- ath10k_htt_rx_h_mpdu(ar, &amsdu, &rx_status);
- ath10k_htt_rx_h_deliver(ar, &amsdu, &rx_status);
+ ath10k_htt_rx_h_filter(ar, &amsdu, rx_status);
+ ath10k_htt_rx_h_mpdu(ar, &amsdu, rx_status);
+ ath10k_htt_rx_h_deliver(ar, &amsdu, rx_status);

Thanks,
Ashok

>Thanks,
>Ben