2018-07-30 09:54:54

by Martin Willi

[permalink] [raw]
Subject: ath10k SWBA overrun / tx credit starvation

Hi,

We are experiencing some issues when running ath10k in AP mode.
Unfortunately, I didn't manage to reproduce the issue in the lab, but
in the field we see it roughly once a day on one out of fifty devices.

The symptoms are the logged "SWBA overruns" followed by a kernel
WARNING when removing a station (see below), followed by many more
"SWBA overruns". It seems that the firmware and kernel get out of sync
about the associated stations. The module does not recover, but the
whole networking stack gets very sluggish, probably due to a lock held
for many seconds. Bringing down the affected network interface takes
some extra seconds, but then allows recovering from that issue.

We are running 4.14-stable, and tried many firmware versions, including
10.2.4.70-2, 10.2.4-1.0-00040, 10.2.4.70.61-2, 10.2.4.70.67 and
firmware-2-ct-full-community-20, but the issue remains. Hardware is
QCA9882 on a WLE600VX.

I stumbled over a some years old discussion at [1] about tx credit
starvation. Is this still the same issue we are seeing? Given that the
mentioned newer firmware versions did not help here, is there anything
else we can try?

Thanks!
Martin

[1] https://lists.infradead.org/pipermail/ath10k/2015-June/005340.html

---

15:27:39 ath10k_pci 0000:01:00.0: SWBA overrun on vdev 0, skipped old beacon
15:27:39 ath10k_pci 0000:01:00.0: SWBA overrun on vdev 0, skipped old beacon
15:27:39 ath10k_pci 0000:01:00.0: SWBA overrun on vdev 0, skipped old beacon
15:27:39 ath10k_pci 0000:01:00.0: SWBA overrun on vdev 0, skipped old beacon
15:27:40 ath10k_pci 0000:01:00.0: SWBA overrun on vdev 0, skipped old beacon
15:27:40 ath10k_pci 0000:01:00.0: SWBA overrun on vdev 0, skipped old beacon
15:27:41 ath10k_pci 0000:01:00.0: SWBA overrun on vdev 0, skipped old beacon
15:27:41 ath10k_pci 0000:01:00.0: SWBA overrun on vdev 0, skipped old beacon
15:27:40 ath10k_pci 0000:01:00.0: SWBA overrun on vdev 0, skipped old beacon
15:27:40 ath10k_pci 0000:01:00.0: SWBA overrun on vdev 0, skipped old beacon
15:27:44 ------------[ cut here ]------------
15:27:44 WARNING: CPU: 0 PID: 150 at net/mac80211/sta_info.c:976 __sta_info_destroy_part2+0x170/0x174
15:27:44 Modules linked in: xt_comment xt_cluster xt_u32 esp4 xfrm6_mode_tunnel xfrm4_mode_tunnel ebtable_filter ebtables bridge stp llc xt_policy xt_connmark xt_mark xt_set ip_set_hash_ipport ip_set_hash_netnet ip_set iptable_mangle nfnet
15:27:44 CPU: 0 PID: 150 Comm: hostapd Not tainted 4.14.55 #2
15:27:44 Hardware name: Marvell Armada 380/385 (Device Tree)
15:27:44 [<c010fd20>] (unwind_backtrace) from [<c010b9bc>] (show_stack+0x10/0x14)
15:27:44 [<c010b9bc>] (show_stack) from [<c0653fa8>] (dump_stack+0x88/0x9c)
15:27:44 [<c0653fa8>] (dump_stack) from [<c012f110>] (__warn+0xe8/0x100)
15:27:44 [<c012f110>] (__warn) from [<c012f1d8>] (warn_slowpath_null+0x20/0x28)
15:27:44 [<c012f1d8>] (warn_slowpath_null) from [<c060b6e0>] (__sta_info_destroy_part2+0x170/0x174)
15:27:44 [<c060b6e0>] (__sta_info_destroy_part2) from [<c060b704>] (__sta_info_destroy+0x20/0x28)
15:27:44 [<c060b704>] (__sta_info_destroy) from [<c060b77c>] (sta_info_destroy_addr_bss+0x2c/0x44)
15:27:44 [<c060b77c>] (sta_info_destroy_addr_bss) from [<c05ec1f4>] (nl80211_del_station+0xc8/0x100)
15:27:44 [<c05ec1f4>] (nl80211_del_station) from [<c050d050>] (genl_rcv_msg+0x2f8/0x3c8)
15:27:44 [<c050d050>] (genl_rcv_msg) from [<c050c190>] (netlink_rcv_skb+0xac/0x104)
15:27:44 [<c050c190>] (netlink_rcv_skb) from [<c050cd48>] (genl_rcv+0x24/0x34)
15:27:44 [<c050cd48>] (genl_rcv) from [<c050b954>] (netlink_unicast+0x184/0x21c)
15:27:44 [<c050b954>] (netlink_unicast) from [<c050bdec>] (netlink_sendmsg+0x334/0x374)
15:27:44 [<c050bdec>] (netlink_sendmsg) from [<c04bc128>] (sock_sendmsg+0x14/0x24)
15:27:44 [<c04bc128>] (sock_sendmsg) from [<c04bc660>] (___sys_sendmsg+0x214/0x228)
15:27:44 [<c04bc660>] (___sys_sendmsg) from [<c04bd4a0>] (__sys_sendmsg+0x40/0x6c)
15:27:44 [<c04bd4a0>] (__sys_sendmsg) from [<c0107ae0>] (ret_fast_syscall+0x0/0x54)
15:27:44 ---[ end trace 036b835c84274321 ]---
15:27:44 ath10k_warn: 41 callbacks suppressed
15:27:44 ath10k_pci 0000:01:00.0: SWBA overrun on vdev 0, skipped old beacon
15:27:44 ath10k_pci 0000:01:00.0: SWBA overrun on vdev 0, skipped old beacon
15:27:44 ath10k_pci 0000:01:00.0: SWBA overrun on vdev 0, skipped old beacon
15:27:44 ath10k_pci 0000:01:00.0: SWBA overrun on vdev 0, skipped old beacon
15:27:45 ath10k_pci 0000:01:00.0: SWBA overrun on vdev 0, skipped old beacon


2018-07-30 18:37:17

by Ben Greear

[permalink] [raw]
Subject: Re: ath10k SWBA overrun / tx credit starvation

On 07/30/2018 01:12 AM, Martin Willi wrote:
> Hi,
>
> We are experiencing some issues when running ath10k in AP mode.
> Unfortunately, I didn't manage to reproduce the issue in the lab, but
> in the field we see it roughly once a day on one out of fifty devices.
>
> The symptoms are the logged "SWBA overruns" followed by a kernel
> WARNING when removing a station (see below), followed by many more
> "SWBA overruns". It seems that the firmware and kernel get out of sync
> about the associated stations. The module does not recover, but the
> whole networking stack gets very sluggish, probably due to a lock held
> for many seconds. Bringing down the affected network interface takes
> some extra seconds, but then allows recovering from that issue.
>
> We are running 4.14-stable, and tried many firmware versions, including
> 10.2.4.70-2, 10.2.4-1.0-00040, 10.2.4.70.61-2, 10.2.4.70.67 and
> firmware-2-ct-full-community-20, but the issue remains. Hardware is
> QCA9882 on a WLE600VX.
>
> I stumbled over a some years old discussion at [1] about tx credit
> starvation. Is this still the same issue we are seeing? Given that the
> mentioned newer firmware versions did not help here, is there anything
> else we can try?

If you use the -ct firmware and the -ct driver, you can configure more than
2 tx-credits. Search for 'TARGET_HTC_MAX_TX_CREDITS_CT' in the -ct driver
and change it, maybe to 4 or 6.

The support was added in this patch and it has some comments that explain
it a bit:

https://github.com/greearb/linux-ct-4.16/commit/59acbd0481e8fc2028373ed01a0ec5212990b330

I don't know that it will fix your problem, but it might, and is something
to try.

The -ct driver also has several other patches that attempt to improve
tx credits issues, but I am not sure it resolves everything and a buggy
firmware would still cause issues no matter.

Thanks,
Ben

>
> Thanks!
> Martin
>
> [1] https://lists.infradead.org/pipermail/ath10k/2015-June/005340.html
>
> ---
>
> 15:27:39 ath10k_pci 0000:01:00.0: SWBA overrun on vdev 0, skipped old beacon
> 15:27:39 ath10k_pci 0000:01:00.0: SWBA overrun on vdev 0, skipped old beacon
> 15:27:39 ath10k_pci 0000:01:00.0: SWBA overrun on vdev 0, skipped old beacon
> 15:27:39 ath10k_pci 0000:01:00.0: SWBA overrun on vdev 0, skipped old beacon
> 15:27:40 ath10k_pci 0000:01:00.0: SWBA overrun on vdev 0, skipped old beacon
> 15:27:40 ath10k_pci 0000:01:00.0: SWBA overrun on vdev 0, skipped old beacon
> 15:27:41 ath10k_pci 0000:01:00.0: SWBA overrun on vdev 0, skipped old beacon
> 15:27:41 ath10k_pci 0000:01:00.0: SWBA overrun on vdev 0, skipped old beacon
> 15:27:40 ath10k_pci 0000:01:00.0: SWBA overrun on vdev 0, skipped old beacon
> 15:27:40 ath10k_pci 0000:01:00.0: SWBA overrun on vdev 0, skipped old beacon
> 15:27:44 ------------[ cut here ]------------
> 15:27:44 WARNING: CPU: 0 PID: 150 at net/mac80211/sta_info.c:976 __sta_info_destroy_part2+0x170/0x174
> 15:27:44 Modules linked in: xt_comment xt_cluster xt_u32 esp4 xfrm6_mode_tunnel xfrm4_mode_tunnel ebtable_filter ebtables bridge stp llc xt_policy xt_connmark xt_mark xt_set ip_set_hash_ipport ip_set_hash_netnet ip_set iptable_mangle nfnet
> 15:27:44 CPU: 0 PID: 150 Comm: hostapd Not tainted 4.14.55 #2
> 15:27:44 Hardware name: Marvell Armada 380/385 (Device Tree)
> 15:27:44 [<c010fd20>] (unwind_backtrace) from [<c010b9bc>] (show_stack+0x10/0x14)
> 15:27:44 [<c010b9bc>] (show_stack) from [<c0653fa8>] (dump_stack+0x88/0x9c)
> 15:27:44 [<c0653fa8>] (dump_stack) from [<c012f110>] (__warn+0xe8/0x100)
> 15:27:44 [<c012f110>] (__warn) from [<c012f1d8>] (warn_slowpath_null+0x20/0x28)
> 15:27:44 [<c012f1d8>] (warn_slowpath_null) from [<c060b6e0>] (__sta_info_destroy_part2+0x170/0x174)
> 15:27:44 [<c060b6e0>] (__sta_info_destroy_part2) from [<c060b704>] (__sta_info_destroy+0x20/0x28)
> 15:27:44 [<c060b704>] (__sta_info_destroy) from [<c060b77c>] (sta_info_destroy_addr_bss+0x2c/0x44)
> 15:27:44 [<c060b77c>] (sta_info_destroy_addr_bss) from [<c05ec1f4>] (nl80211_del_station+0xc8/0x100)
> 15:27:44 [<c05ec1f4>] (nl80211_del_station) from [<c050d050>] (genl_rcv_msg+0x2f8/0x3c8)
> 15:27:44 [<c050d050>] (genl_rcv_msg) from [<c050c190>] (netlink_rcv_skb+0xac/0x104)
> 15:27:44 [<c050c190>] (netlink_rcv_skb) from [<c050cd48>] (genl_rcv+0x24/0x34)
> 15:27:44 [<c050cd48>] (genl_rcv) from [<c050b954>] (netlink_unicast+0x184/0x21c)
> 15:27:44 [<c050b954>] (netlink_unicast) from [<c050bdec>] (netlink_sendmsg+0x334/0x374)
> 15:27:44 [<c050bdec>] (netlink_sendmsg) from [<c04bc128>] (sock_sendmsg+0x14/0x24)
> 15:27:44 [<c04bc128>] (sock_sendmsg) from [<c04bc660>] (___sys_sendmsg+0x214/0x228)
> 15:27:44 [<c04bc660>] (___sys_sendmsg) from [<c04bd4a0>] (__sys_sendmsg+0x40/0x6c)
> 15:27:44 [<c04bd4a0>] (__sys_sendmsg) from [<c0107ae0>] (ret_fast_syscall+0x0/0x54)
> 15:27:44 ---[ end trace 036b835c84274321 ]---
> 15:27:44 ath10k_warn: 41 callbacks suppressed
> 15:27:44 ath10k_pci 0000:01:00.0: SWBA overrun on vdev 0, skipped old beacon
> 15:27:44 ath10k_pci 0000:01:00.0: SWBA overrun on vdev 0, skipped old beacon
> 15:27:44 ath10k_pci 0000:01:00.0: SWBA overrun on vdev 0, skipped old beacon
> 15:27:44 ath10k_pci 0000:01:00.0: SWBA overrun on vdev 0, skipped old beacon
> 15:27:45 ath10k_pci 0000:01:00.0: SWBA overrun on vdev 0, skipped old beacon
>


--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com

2018-08-06 18:21:11

by Ben Greear

[permalink] [raw]
Subject: Re: ath10k SWBA overrun / tx credit starvation

On 08/06/2018 05:15 AM, Martin Willi wrote:
> Hi Ben,
>
> Thanks for your help.
>
>> If you use the -ct firmware and the -ct driver, you can configure
>> more than 2 tx-credits.
>
> Unfortunately, this didn't help, either. I hit these issues even sooner
> with any 10.1-based firmware (including CT), which implies that at
> least some of them have been addressed with 10.2/10.2.4.

Out of curiosity, how soon could you hit it with -ct firmware? We often
see these around once per day in some of our test cases, rarely more often
than that.

>> I am not sure it resolves everything and a buggy firmware would still
>> cause issues no matter.
>
> As a work-around, I'm experimenting with handling timeout conditions in
> ath10k_wmi_cmd_send() caused by missing credits. Given that we can't do
> any TX-flush or warm-restart over WMI under these conditions, I just
> issue a hardware restart (patch below).
>
> Some initial tests show that this in fact recovers the module from its
> bad state with just a small connectivity gap; certainly much better
> than that unpredictable behavior we've seen previously.

Yes, I have been using a similar patch for years with good results.

Thanks,
Ben

>
> I'll do some more testing with this approach before considering to
> upstream it.
>
> Regards
> Martin
>
> ---
>
> From fd9e90d0294450c093d243ee4f1eb1e07b1cd73a Mon Sep 17 00:00:00 2001
> From: Martin Willi <[email protected]>
> Date: Fri, 3 Aug 2018 14:23:30 +0200
> Subject: [PATCH] ath10k: Schedule hardware restart if WMI command times out
>
> If the TX queue gets stuck for some reason, we run out of tx credits and
> are unable to send any commands over WMI. To recover from this situation,
> issue a hard hardware restart. This implies a connectivity outage of about
> 1.4s in AP mode, but brings back the interface to a usable state.
>
> Signed-off-by: Martin Willi <[email protected]>
> ---
> drivers/net/wireless/ath/ath10k/wmi.c | 6 ++++++
> 1 file changed, 6 insertions(+)
>
> diff --git a/drivers/net/wireless/ath/ath10k/wmi.c b/drivers/net/wireless/ath/ath10k/wmi.c
> index 38a97086708b..d39a983f4a1f 100644
> --- a/drivers/net/wireless/ath/ath10k/wmi.c
> +++ b/drivers/net/wireless/ath/ath10k/wmi.c
> @@ -1852,6 +1852,12 @@ int ath10k_wmi_cmd_send(struct ath10k *ar, struct sk_buff *skb, u32 cmd_id)
> if (ret)
> dev_kfree_skb_any(skb);
>
> + if (ret == -EAGAIN) {
> + ath10k_warn(ar, "wmi command %d timeout, restarting hardware\n",
> + cmd_id);
> + queue_work(ar->workqueue, &ar->restart_work);
> + }
> +
> return ret;
> }
>
>


--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com

2018-08-06 14:24:32

by Martin Willi

[permalink] [raw]
Subject: Re: ath10k SWBA overrun / tx credit starvation

Hi Ben,

Thanks for your help.

> If you use the -ct firmware and the -ct driver, you can configure
> more than 2 tx-credits.

Unfortunately, this didn't help, either. I hit these issues even sooner
with any 10.1-based firmware (including CT), which implies that at
least some of them have been addressed with 10.2/10.2.4.

> I am not sure it resolves everything and a buggy firmware would still
> cause issues no matter.

As a work-around, I'm experimenting with handling timeout conditions in
ath10k_wmi_cmd_send() caused by missing credits. Given that we can't do
any TX-flush or warm-restart over WMI under these conditions, I just
issue a hardware restart (patch below).

Some initial tests show that this in fact recovers the module from its
bad state with just a small connectivity gap; certainly much better
than that unpredictable behavior we've seen previously.

I'll do some more testing with this approach before considering to
upstream it.

Regards
Martin

---

>From fd9e90d0294450c093d243ee4f1eb1e07b1cd73a Mon Sep 17 00:00:00 2001
From: Martin Willi <[email protected]>
Date: Fri, 3 Aug 2018 14:23:30 +0200
Subject: [PATCH] ath10k: Schedule hardware restart if WMI command times out

If the TX queue gets stuck for some reason, we run out of tx credits and
are unable to send any commands over WMI. To recover from this situation,
issue a hard hardware restart. This implies a connectivity outage of about
1.4s in AP mode, but brings back the interface to a usable state.

Signed-off-by: Martin Willi <[email protected]>
---
drivers/net/wireless/ath/ath10k/wmi.c | 6 ++++++
1 file changed, 6 insertions(+)

diff --git a/drivers/net/wireless/ath/ath10k/wmi.c b/drivers/net/wireless/ath/ath10k/wmi.c
index 38a97086708b..d39a983f4a1f 100644
--- a/drivers/net/wireless/ath/ath10k/wmi.c
+++ b/drivers/net/wireless/ath/ath10k/wmi.c
@@ -1852,6 +1852,12 @@ int ath10k_wmi_cmd_send(struct ath10k *ar, struct sk_buff *skb, u32 cmd_id)
if (ret)
dev_kfree_skb_any(skb);

+ if (ret == -EAGAIN) {
+ ath10k_warn(ar, "wmi command %d timeout, restarting hardware\n",
+ cmd_id);
+ queue_work(ar->workqueue, &ar->restart_work);
+ }
+
return ret;
}

--
2.17.1

2018-08-07 08:36:25

by Martin Willi

[permalink] [raw]
Subject: Re: ath10k SWBA overrun / tx credit starvation


> > [...], which implies that at least some of them have been addressed
> > with 10.2/10.2.4.
>
> Out of curiosity, how soon could you hit it with -ct firmware? We
> often see these around once per day in some of our test cases, rarely
> more often than that.

I have no hard numbers, as I have no reproducer. In the field this
depends on the client load, but once in 1-3 days is realistic.

With 10.2.4 is see this much less frequent (factor 10).

Martin