2014-04-04 11:44:11

by Michal Kazior

[permalink] [raw]
Subject: [RFT 0/4] ath10k: fix flushing and tx stalls

Hi,

After digging around I've found what seems to be
the problem with WMI Tx credit starvation and
inability to properly flush Tx in ath10k_flush().

Long story short: if a client that was asleep (as
per what firmware thinks) goes out of range (or
just stops responding) then Tx rots in FW/HW
queues for a few seconds before it's discarded.
For WMI Tx credits this means management frames
eat up Tx credits for a few seconds (causing other
WMI commands to timeout and return -EAGAIN/-11).
For HTT Tx this means NullFunc frames would get
stuck for a few seconds before completion was
received.

@Ben: Can you check if this helps you? I tested
this briefly and at least [1/4] seems fixes the
WMI Tx starvation. I'm hoping patches 2-4 help
with your ath10k_flush() failures which I haven't
been successfull in reproducing (but have observed
improvement with purging some frames out of FW/HW
queues).


Michal Kazior (4):
ath10k: fix wmi-htc tx credit starvation
ath10k: rework peer accounting
ath10k: wait for mgmt tx when flushing too
ath10k: improve tx flushing

drivers/net/wireless/ath/ath10k/core.h | 7 +-
drivers/net/wireless/ath/ath10k/mac.c | 254 +++++++++++++++++++++++++++++----
drivers/net/wireless/ath/ath10k/txrx.c | 29 ++--
drivers/net/wireless/ath/ath10k/wmi.c | 5 -
drivers/net/wireless/ath/ath10k/wmi.h | 4 +
5 files changed, 253 insertions(+), 46 deletions(-)

--
1.8.5.3



2014-04-09 10:55:30

by Michal Kazior

[permalink] [raw]
Subject: [RFTv2 1/5] ath10k: always request htc tx replenishment

This simplifies tx credit management and tries to
address a problem where driver runs out of credits
and is unable to recover when firmware stalls.

This also makes it a little more easy to track and
debug htc tx credit flow since there should be no
laziness involved.

Signed-off-by: Michal Kazior <[email protected]>
---
drivers/net/wireless/ath/ath10k/htc.c | 15 +--------------
1 file changed, 1 insertion(+), 14 deletions(-)

diff --git a/drivers/net/wireless/ath/ath10k/htc.c b/drivers/net/wireless/ath/ath10k/htc.c
index 7f1bccd..c2a290a 100644
--- a/drivers/net/wireless/ath/ath10k/htc.c
+++ b/drivers/net/wireless/ath/ath10k/htc.c
@@ -86,19 +86,6 @@ static void ath10k_htc_notify_tx_completion(struct ath10k_htc_ep *ep,
ep->ep_ops.ep_tx_complete(ep->htc->ar, skb);
}

-/* assumes tx_lock is held */
-static bool ath10k_htc_ep_need_credit_update(struct ath10k_htc_ep *ep)
-{
- if (!ep->tx_credit_flow_enabled)
- return false;
- if (ep->tx_credits >= ep->tx_credits_per_max_message)
- return false;
-
- ath10k_dbg(ATH10K_DBG_HTC, "HTC: endpoint %d needs credit update\n",
- ep->eid);
- return true;
-}
-
static void ath10k_htc_prepare_tx_skb(struct ath10k_htc_ep *ep,
struct sk_buff *skb)
{
@@ -113,7 +100,7 @@ static void ath10k_htc_prepare_tx_skb(struct ath10k_htc_ep *ep,
spin_lock_bh(&ep->htc->tx_lock);
hdr->seq_no = ep->seq_no++;

- if (ath10k_htc_ep_need_credit_update(ep))
+ if (ep->tx_credit_flow_enabled)
hdr->flags |= ATH10K_HTC_FLAG_NEED_CREDIT_UPDATE;

spin_unlock_bh(&ep->htc->tx_lock);
--
1.8.5.3


2014-04-09 17:34:55

by Ben Greear

[permalink] [raw]
Subject: Re: [RFT 0/4] ath10k: fix flushing and tx stalls

On 04/08/2014 11:25 PM, Michal Kazior wrote:

> Additional kernel config:
>
> CONFIG_ATH10K_DEBUG=y
> CONFIG_ATH10K_DEBUGFS=y
> CONFIG_ATH10K_TRACING=y
> CONFIG_MAC80211_MESSAGE_TRACING=y
>
> This probably requires this to be set first:
>
> CONFIG_TRACING=y
> CONFIG_GENERIC_TRACER=y
> CONFIG_TRACING_SUPPORT=y
>
> (it's in Kernel Hacking)
>
> Once you have this you can use trace-cmd command to gathers logs:
>
> trace-cmd record -e mac80211 -e cfg80211 -e ath10k
>
> And then perform whatever tortures on ath10k you wish.
>
> This creates trace.dat in your current directory. The file can get
> pretty heavy and it might not fit into an email so we might need to
> transfer it "out of band" somehow.
>
> If you use wpa_supplicant/hostapd you can pass -T argument to those to
> get all their messages into tracing as markers. This way you're able
> to get unified logs in one file with correct chronological order.

My version of supplicant does not support -T. I have just sync'd down the
latest upstream hostapd code and will see if that adds the support.

Here are the trace output logs and /var/log/messages.

http://www.candelatech.com/tmp/ath10k-debug.tar.bz2

Test case is to create 32 stations, bring them up and get DHCP, start
10Mbps TCP traffic on each, and re-set them. When resetting them system generally craps
out and user-space app trying to reset the ports hangs.. Logs are full
of ath10k driver error messages.

In previous tests, I think I was running traffic while bringing up
additional interfaces, so that is probably why I could not even bring
them up properly with your patches.

This is with your RFTv2 patches on top of some of my patches and
recent wl-ath tree. This is using my modified firmware since this
test case cannot work at all on standard firmware.

Let me know if I can offer any additional logs or info.

Thanks,
Ben



--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com


2014-04-09 21:46:47

by Ben Greear

[permalink] [raw]
Subject: Re: [RFTv2 0/5] ath10k: ath10k: fix flushing and tx stalls

Here's another log snippet with these 5 patches (and lots more
mostly non ath10k patches of my own) applied:

No (significant) user traffic was running.

sta215: authenticated at: 1397076587.913186
sta215: associate with 04:f0:21:03:38:99 (try 1/3), at: 1397076587.915143
ath10k: mac flushing peer 04:f0:21:03:38:99 on vdev 16 mgmt tid for unicast mgmt (204 msecs)
sta215: RX AssocResp from 04:f0:21:03:38:99 (capab=0x1 status=0 aid=17) at: 1397076588.120218
ath10k: mac vdev 16 dtim_period 2
ath10k: mac vdev 16 slot_time 2
ath10k: mac ht peer 04:f0:21:03:38:99 mcs cnt 24 nss 3
ath10k: mac vht peer 04:f0:21:03:38:99 max_mpdu 65535 flags 0x6003001
ath10k: mac peer 04:f0:21:03:38:99 phymode 11ac-vht80
ath10k: mac vdev 16 up (associated) bssid 04:f0:21:03:38:99 aid 17
sta215: associated at: 1397076588.122484
IPv6: ADDRCONF(NETDEV_CHANGE): sta215: link becomes ready
sta216: authenticate with 04:f0:21:03:38:99 at: 1397076588.124830
ath10k: mac vdev 17 beacon_interval 100
ath10k: mac vdev 17 create peer 04:f0:21:03:38:99
ath10k: mac vdev 17 start 04:f0:21:03:38:99
ath10k: mac vdev 17 start center_freq 5180 phymode 11ac-vht80
sta216: send auth to 04:f0:21:03:38:99 (try 1/3) at: 1397076588.125811
ath10k: mac flushing peer 04:f0:21:03:38:99 on vdev 17 mgmt tid for unicast mgmt (204 msecs)
sta216: send auth to 04:f0:21:03:38:99 (try 2/3) at: 1397076588.431144
ath10k: failed to transmit management frame via WMI: -11
sta216: send auth to 04:f0:21:03:38:99 (try 3/3) at: 1397076591.431376
ath10k: failed to wait for tx to flush: -110, forcing
ath10k: failed to request peer 04:f0:21:03:38:99 on vdev 17 to flush ffffffff: -11
ath10k: failed to flush all peer tids: -11
sta216: deauthenticating from 04:f0:21:03:38:99 by local choice (reason=3)
ath10k: failed to wait for tx to flush: -110, forcing
ath10k: failed to request peer 04:f0:21:03:38:99 on vdev 17 to flush ffffffff: -11
ath10k: failed to flush all peer tids: -11
ath10k: mac vdev 17 peer delete 04:f0:21:03:38:99 (sta gone)
ath10k: failed to request wmi peer 04:f0:21:03:38:99 on vdev 17 removal: -11
ath10k: removing peer 04:f0:21:03:38:99 on vdev 17 still being mapped in firmware
ath10k: mac vdev 17 stop (disassociated
ath10k: failed to stop WMI vdev 17: -11
ath10k: mac vdev 17 down
ath10k: failed to wait for tx to flush: -110, forcing
ath10k: failed to request peer 04:f0:21:03:38:99 on vdev 16 to flush ffffffff: -11
ath10k: failed to flush all peer tids: -11
ath10k: failed to wait for tx to flush: -110, forcing
ath10k: failed to request peer 04:f0:21:03:38:99 on vdev 16 to flush ffffffff: -11
ath10k: failed to flush all peer tids: -11
sta218: authenticate with 04:f0:21:03:38:99 at: 1397076616.431328
ath10k: failed to wait for tx to flush: -110, forcing
ath10k: failed to request peer 04:f0:21:03:38:99 on vdev 16 to flush ffffffff: -11
ath10k: failed to flush all peer tids: -11
ath10k: mac vdev 19 beacon_interval 100
ath10k: failed to set beacon interval for vdev 19: -11
ath10k: mac vdev 19 create peer 04:f0:21:03:38:99
ath10k: failed to create wmi peer 04:f0:21:03:38:99 on vdev 19: -11
ath10k: failed to add peer 04:f0:21:03:38:99 for vdev 19 when changing bssid: -11
ath10k: mac vdev 19 start 04:f0:21:03:38:99
ath10k: mac vdev 19 start center_freq 5180 phymode 11ac-vht80


Thanks,
Ben

--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com


2014-04-09 19:30:08

by Ben Greear

[permalink] [raw]
Subject: Re: [RFT 0/4] ath10k: fix flushing and tx stalls

On 04/09/2014 10:34 AM, Ben Greear wrote:

> This is with your RFTv2 patches on top of some of my patches and
> recent wl-ath tree. This is using my modified firmware since this
> test case cannot work at all on standard firmware.
>
> Let me know if I can offer any additional logs or info.

By the way, I reproduced the general problem with only your first two
patches (from previous RFT series) applied. It is at least a bit
harder to reproduce, but it can happen.

Possibly the causes of the failure to flush are different in that
case, so it may not really be quite the same issue.

Thanks,
Ben

--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com


2014-04-04 11:44:13

by Michal Kazior

[permalink] [raw]
Subject: [RFT 1/4] ath10k: fix wmi-htc tx credit starvation

Each WMI management Tx consumes 1 HTC Tx credit
and doesn't replenish it until the frame is
actually transmitted out of firmware's Tx queues.

If associated client was asleep and has gone out
of range then unicast frames won't be released for
FW/HW queues for a while (10 seconds per
observation). This means that if more management
frames are queued then HTC Tx credits are drained
to 0 and no other commands can be submitted
including beacons and peer removal.

This could in turn result in clients disconnecting
due to their beacon loss and may trigger spurious
sta kickouts because wmi peer removal command may
never reach firmware during disassociation.

This could happen, e.g. when disconnecting client
that has gone away while asleep.

As a workaround flush unicast management frames
that can possibly be buffered.

Signed-off-by: Michal Kazior <[email protected]>
---
drivers/net/wireless/ath/ath10k/mac.c | 80 +++++++++++++++++++++++++++++++++++
drivers/net/wireless/ath/ath10k/wmi.c | 5 ---
drivers/net/wireless/ath/ath10k/wmi.h | 3 ++
3 files changed, 83 insertions(+), 5 deletions(-)

diff --git a/drivers/net/wireless/ath/ath10k/mac.c b/drivers/net/wireless/ath/ath10k/mac.c
index 58ec5a7..25362ef 100644
--- a/drivers/net/wireless/ath/ath10k/mac.c
+++ b/drivers/net/wireless/ath/ath10k/mac.c
@@ -2076,9 +2076,56 @@ void ath10k_mgmt_over_wmi_tx_purge(struct ath10k *ar)
}
}

+static void ath10k_mgmt_tx_flush(struct ath10k *ar, struct sk_buff *skb)
+{
+ struct ieee80211_hdr *hdr = (struct ieee80211_hdr *)skb->data;
+ struct ath10k_vif *arvif = NULL, *arvif_iter;
+ u8 *da = ieee80211_get_DA(hdr);
+ u8 vdev_id = ATH10K_SKB_CB(skb)->vdev_id;
+ int ret;
+
+ if (!is_unicast_ether_addr(da))
+ return;
+
+ if (!ieee80211_is_disassoc(hdr->frame_control) &&
+ !ieee80211_is_deauth(hdr->frame_control) &&
+ !ieee80211_is_action(hdr->frame_control))
+ return;
+
+ mutex_lock(&ar->conf_mutex);
+ list_for_each_entry(arvif_iter, &ar->arvifs, list) {
+ if (arvif_iter->vdev_id == vdev_id) {
+ arvif = arvif_iter;
+ break;
+ }
+ }
+ mutex_unlock(&ar->conf_mutex);
+
+ if (!arvif)
+ return;
+
+ ath10k_dbg(ATH10K_DBG_MAC,
+ "mac flushing peer %pM on vdev %i mgmt tid for unicast mgmt\n",
+ da, vdev_id);
+
+ /*
+ * Wait 2 beacon intervals before flushing so stations that are
+ * asleep but are actually still in range have a chance to see
+ * updated PVB, wake up and fetch the frame.
+ */
+ msleep(2 * arvif->beacon_interval * 1024 / 1000);
+
+ ret = ath10k_wmi_peer_flush(ar, vdev_id, da,
+ WMI_PEER_TID_MGMT_MASK);
+ if (ret)
+ ath10k_warn("failed to flush peer %pM mgmt tid: %d\n",
+ da, ret);
+}
+
void ath10k_mgmt_over_wmi_tx_work(struct work_struct *work)
{
struct ath10k *ar = container_of(work, struct ath10k, wmi_mgmt_tx_work);
+ struct ieee80211_tx_info *info;
struct sk_buff *skb;
int ret;

@@ -2092,7 +2139,40 @@ void ath10k_mgmt_over_wmi_tx_work(struct work_struct *work)
ath10k_warn("failed to transmit management frame via WMI: %d\n",
ret);
ieee80211_free_txskb(ar->hw, skb);
+ continue;
}
+
+ /*
+ * Each WMI management Tx consumes 1 HTC Tx credit and doesn't
+ * replenish it until the frame is actually transmitted out of
+ * firmware's Tx queues.
+ *
+ * If associated client was asleep and has gone out of range
+ * then unicast frames won't be released for FW/HW queues for a
+ * while (10 seconds per observation). This means that if more
+ * management frames are queued then HTC Tx credits are drained
+ * to 0 and no other commands can be submitted including
+ * beacons and peer removal.
+ *
+ * This could in turn result in clients disconnecting due to
+ * their beacon loss and may trigger spurious sta kickouts
+ * because wmi peer removal command may never reach firmware
+ * during disassociation.
+ *
+ * This could happen, e.g. when disconnecting client that has
+ * gone away while asleep.
+ *
+ * As a workaround flush unicast management frames that can
+ * possibly be buffered.
+ *
+ * Note: This is a deficiency in design of WMI_MGMT_TX command.
+ */
+ ath10k_mgmt_tx_flush(ar, skb);
+
+ /* there's no way to get ACK so just assume it's acked */
+ info = IEEE80211_SKB_CB(skb);
+ info->flags |= IEEE80211_TX_STAT_ACK;
+ ieee80211_tx_status(ar->hw, skb);
}
}

diff --git a/drivers/net/wireless/ath/ath10k/wmi.c b/drivers/net/wireless/ath/ath10k/wmi.c
index d4b48ef..3b270a4 100644
--- a/drivers/net/wireless/ath/ath10k/wmi.c
+++ b/drivers/net/wireless/ath/ath10k/wmi.c
@@ -637,7 +637,6 @@ int ath10k_wmi_mgmt_tx(struct ath10k *ar, struct sk_buff *skb)
struct wmi_mgmt_tx_cmd *cmd;
struct ieee80211_hdr *hdr;
struct sk_buff *wmi_skb;
- struct ieee80211_tx_info *info = IEEE80211_SKB_CB(skb);
int len;
u16 fc;

@@ -673,10 +672,6 @@ int ath10k_wmi_mgmt_tx(struct ath10k *ar, struct sk_buff *skb)
if (ret)
return ret;

- /* TODO: report tx status to mac80211 - temporary just ACK */
- info->flags |= IEEE80211_TX_STAT_ACK;
- ieee80211_tx_status_irqsafe(ar->hw, skb);
-
return ret;
}

diff --git a/drivers/net/wireless/ath/ath10k/wmi.h b/drivers/net/wireless/ath/ath10k/wmi.h
index ae83822..90fe2e9 100644
--- a/drivers/net/wireless/ath/ath10k/wmi.h
+++ b/drivers/net/wireless/ath/ath10k/wmi.h
@@ -3844,6 +3844,9 @@ struct wmi_peer_delete_cmd {
struct wmi_mac_addr peer_macaddr;
} __packed;

+#define WMI_PEER_TID_MGMT 17
+#define WMI_PEER_TID_MGMT_MASK BIT(WMI_PEER_TID_MGMT)
+
struct wmi_peer_flush_tids_cmd {
__le32 vdev_id;
struct wmi_mac_addr peer_macaddr;
--
1.8.5.3


2014-04-10 06:50:16

by Kalle Valo

[permalink] [raw]
Subject: Re: [RFTv2 3/5] ath10k: rework peer accounting

Michal Kazior <[email protected]> writes:

> It was troublesome to iterate over peers and
> perform sleepable calls. This will be necessary
> for some upcomming changes to tx flushing.
>
> Make peer allocation and initial setup
> protected by both ar->conf_mutex and
> ar->data_lock. This way it's possible to iterate
> over peers with conf_mutex and call sleepable
> functions.
>
> Signed-off-by: Michal Kazior <[email protected]>

Sorry, but I'm not really getting your idea here of using both
conf_mutex and data_lock to protect ar->peers. Can you elaborate more,
please? Why can't we just use conf_mutex then? What am I missing?

--
Kalle Valo

2014-04-08 16:03:46

by Ben Greear

[permalink] [raw]
Subject: Re: [RFT 0/4] ath10k: fix flushing and tx stalls

On 04/07/2014 10:51 PM, Michal Kazior wrote:
> On 8 April 2014 04:31, Ben Greear <[email protected]> wrote:
>> On 04/07/2014 02:11 AM, Michal Kazior wrote:
>>
>>> These logs are not enough. I'd love to see traces for this to see what
>>> frames are actually submitted and when tx credits are replenished.
>>>
>>> I also wonder if this can be somehow related to your FW changes to
>>> allow connecting multiple client virtual interfaces to a single AP?
>>
>>
>> I think it is unlikely due to my firmware changes...little of that touched
>> the handling of management frames. It might very well be a basic problem in
>> either the firmware or driver when using multiple station VIFS. I think
>> that
>> aside from my testing that code has not been used much.
>>
>> Note my followup email that problems started with patch 3/4...not sure you
>> saw that one or not. I saw similar failure to associate & get DHCP (and
>> slow/hung user-space) without the kernel error logs.
>
> I did get that mail. It's just that it makes little sense. Patch 3/4
> simply extends ath10k_flush() to wait for management frames. It
> shouldn't change behaviour of tx credit starvation which appears to be
> the case in the logs you provided.

Maybe this causes unexpected waits (under lock) that in turn causes other
vifs trying to come up at the same time to time out?

It seems easily reproducible, so let me know what sort of logging/tracing
you would like to see (and how to enable it if it is not trivial)
and I'll reproduce it and send you the logs.

Thanks,
Ben


--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com


2014-04-08 02:32:17

by Ben Greear

[permalink] [raw]
Subject: Re: [RFT 0/4] ath10k: fix flushing and tx stalls

On 04/07/2014 02:11 AM, Michal Kazior wrote:

> These logs are not enough. I'd love to see traces for this to see what
> frames are actually submitted and when tx credits are replenished.
>
> I also wonder if this can be somehow related to your FW changes to
> allow connecting multiple client virtual interfaces to a single AP?

I think it is unlikely due to my firmware changes...little of that touched
the handling of management frames. It might very well be a basic problem in
either the firmware or driver when using multiple station VIFS. I think that
aside from my testing that code has not been used much.

Note my followup email that problems started with patch 3/4...not sure you
saw that one or not. I saw similar failure to associate & get DHCP (and
slow/hung user-space) without the kernel error logs.

I've added the first two patches to my tree and will continue to run
with them since they do not appear to cause problems so far.

Thanks,
Ben

--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com


2014-04-10 07:11:15

by Michal Kazior

[permalink] [raw]
Subject: Re: [RFTv2 3/5] ath10k: rework peer accounting

On 10 April 2014 08:59, Kalle Valo <[email protected]> wrote:
> Michal Kazior <[email protected]> writes:
>
>> It was troublesome to iterate over peers and
>> perform sleepable calls. This will be necessary
>> for some upcomming changes to tx flushing.
>>
>> Make peer allocation and initial setup
>> protected by both ar->conf_mutex and
>> ar->data_lock. This way it's possible to iterate
>> over peers with conf_mutex and call sleepable
>> functions.
>>
>> Signed-off-by: Michal Kazior <[email protected]>
>
> First comments, but I need to read this much more carefully still:
>
>> --- a/drivers/net/wireless/ath/ath10k/core.h
>> +++ b/drivers/net/wireless/ath/ath10k/core.h
>> @@ -199,9 +199,14 @@ struct ath10k_dfs_stats {
>> #define ATH10K_MAX_NUM_PEER_IDS (1 << 11) /* htt rx_desc limit */
>>
>> struct ath10k_peer {
>> + /* protected by conf_mutex + data_lock */
>> struct list_head list;
>
> This really needs a lot more documentation. And besides, don't we
> actually want to protect struct ath10k::peers, not this?

Parts of the structure needs to be locked differently.

I suppose we can say we just protect everything with data_lock except
`struct list_head list` which needs to be write-protected by both
conf_mutex and data_lock. But then again, addr and vdev_id don't need
any locking as they are immutable (and set before adding to the list).


>
>> +/* hold conf_mutex for simple iteration, or conf_mutex+data_lock for
>> + * modifications */
>> struct ath10k_peer *ath10k_peer_find(struct ath10k *ar, int vdev_id,
>> const u8 *addr)
>> {
>> struct ath10k_peer *peer;
>>
>> - lockdep_assert_held(&ar->data_lock);
>> -
>> list_for_each_entry(peer, &ar->peers, list) {
>> if (peer->vdev_id != vdev_id)
>> continue;
>
> The comment here makes me suspicious. How can we safely iterate the list
> if we don't take data_lock? Doesn't it mean that the list can change
> while we have conf_mutex?

The idea is you need BOTH locks to modify the list structure, but you
need only one of them to iterate over the list safely and
consistently. This means writer will not alter the list structure
until there are no readers.


Michał

2014-04-07 09:11:11

by Michal Kazior

[permalink] [raw]
Subject: Re: [RFT 0/4] ath10k: fix flushing and tx stalls

On 7 April 2014 02:30, Ben Greear <[email protected]> wrote:
> On 04/04/2014 04:37 AM, Michal Kazior wrote:
>>
>> Hi,
>>
>> After digging around I've found what seems to be
>> the problem with WMI Tx credit starvation and
>> inability to properly flush Tx in ath10k_flush().
>
>
> I applied all four, rebooted AP and Station machine with the
> new kernel (3.14.0-wl-ath+). I had removed my own tx-hang detection
> hack.
>
> It is not working well...though perhaps it is not directly related
> to your patches...I will do some reboots and try reversing some patches.
> But, 3.14.0-rc8-wl-ath+ was working mostly OK for me....
>
> It seems my user-space app got hung, probably because rtnl was being
> held too long while doing admin-down on a bunch of stations.
>
> Out of 32 stations, maybe 5 of them managed to come up and get
> an IP address through DHCP before user-space app hung and log
> started filling with the errors at the end of the dmesg below.
>
>
> sta218: associated
> IPv6: ADDRCONF(NETDEV_CHANGE): sta218: link becomes ready
> sta219: authenticate with 04:f0:21:03:38:99
> ath10k: mac vdev 19 beacon_interval 100
> ath10k: mac vdev 19 create peer 04:f0:21:03:38:99
> ath10k: mac vdev 19 start 04:f0:21:03:38:99
> ath10k: mac vdev 19 start center_freq 5180 phymode 11ac-vht80
>
> sta219: send auth to 04:f0:21:03:38:99 (try 1/3)
> sta219: authenticated
> sta218: deauthenticating from 04:f0:21:03:38:99 by local choice (Reason:
> 3=DEAUTH_LEAVING)
> ath10k: mac flushing peer 04:f0:21:03:38:99 on vdev 18 mgmt tid for unicast
> mgmt
> ath10k: mac vdev 18 peer delete 04:f0:21:03:38:99 (sta gone)
> ath10k: mac vdev 18 stop (disassociated
> ath10k: mac vdev 18 down
> ath10k: mac vdev 18 cts_prot 0
> ath10k: mac vdev 18 slot_time 1
> ath10k: mac vdev 18 preamble 1n
> sta219: associate with 04:f0:21:03:38:99 (try 1/3)
> sta219: RX AssocResp from 04:f0:21:03:38:99 (capab=0x1 status=0 aid=22)
> ath10k: mac vdev 19 dtim_period 2
> ath10k: mac vdev 19 slot_time 2
> ath10k: mac ht peer 04:f0:21:03:38:99 mcs cnt 24 nss 3
> ath10k: mac vht peer 04:f0:21:03:38:99 max_mpdu 65535 flags 0x6003001
> ath10k: mac peer 04:f0:21:03:38:99 phymode 11ac-vht80
> ath10k: mac vdev 19 up (associated) bssid 04:f0:21:03:38:99 aid 22
> sta219: associated
> IPv6: ADDRCONF(NETDEV_CHANGE): sta219: link becomes ready
> sta220: authenticate with 04:f0:21:03:38:99
> ath10k: mac vdev 20 beacon_interval 100
> ath10k: mac vdev 20 create peer 04:f0:21:03:38:99
> ath10k: mac vdev 20 start 04:f0:21:03:38:99
> ath10k: mac vdev 20 start center_freq 5180 phymode 11ac-vht80
> sta220: send auth to 04:f0:21:03:38:99 (try 1/3)
> sta220: authenticated
> sta220: associate with 04:f0:21:03:38:99 (try 1/3)
> sta220: RX AssocResp from 04:f0:21:03:38:99 (capab=0x1 status=0 aid=23)
> ath10k: mac vdev 20 dtim_period 2
> ath10k: mac vdev 20 slot_time 2
> ath10k: mac ht peer 04:f0:21:03:38:99 mcs cnt 24 nss 3
> ath10k: mac vht peer 04:f0:21:03:38:99 max_mpdu 65535 flags 0x6003001
> ath10k: mac peer 04:f0:21:03:38:99 phymode 11ac-vht80
> ath10k: mac vdev 20 up (associated) bssid 04:f0:21:03:38:99 aid 23
> sta220: associated
> IPv6: ADDRCONF(NETDEV_CHANGE): sta220: link becomes ready
> ath10k: Creating vdev id: 21 map: 4292870144
> ath10k: mac vdev create 21 (add interface) type 2 subtype 0
> ath10k: mac vdev 21 cts_prot 0
> ath10k: mac vdev 21 slot_time 1
> ath10k: mac vdev 21 preamble 1n
> IPv6: ADDRCONF(NETDEV_UP): sta221: link is not ready
> ath10k: mac vdev 21 delete (remove interface)
> ath10k: peer-unmap-event: unknown peer id 23
> ath10k: Creating vdev id: 21 map: 4292870144
> ath10k: mac vdev create 21 (add interface) type 2 subtype 0
> ath10k: mac vdev 21 cts_prot 0
> ath10k: mac vdev 21 slot_time 1
> ath10k: mac vdev 21 preamble 1n
> IPv6: ADDRCONF(NETDEV_UP): sta221: link is not ready
> sta217: authenticate with 04:f0:21:03:38:99
> ath10k: mac vdev 17 beacon_interval 100
> ath10k: mac vdev 17 create peer 04:f0:21:03:38:99
> ath10k: mac vdev 17 start 04:f0:21:03:38:99
> ath10k: mac vdev 17 start center_freq 5180 phymode 11ac-vht80
>
> sta217: send auth to 04:f0:21:03:38:99 (try 1/3)
> sta217: authenticated
> sta217: associate with 04:f0:21:03:38:99 (try 1/3)
> sta217: RX AssocResp from 04:f0:21:03:38:99 (capab=0x1 status=0 aid=20)
> ath10k: mac vdev 17 dtim_period 2
> ath10k: mac vdev 17 slot_time 2
> ath10k: mac ht peer 04:f0:21:03:38:99 mcs cnt 24 nss 3
> ath10k: mac vht peer 04:f0:21:03:38:99 max_mpdu 65535 flags 0x6003001
> ath10k: mac peer 04:f0:21:03:38:99 phymode 11ac-vht80
> ath10k: mac vdev 17 up (associated) bssid 04:f0:21:03:38:99 aid 20
> sta217: associated
> IPv6: ADDRCONF(NETDEV_CHANGE): sta217: link becomes ready
> sta218: authenticate with 04:f0:21:03:38:99
> ath10k: mac vdev 18 beacon_interval 100
> ath10k: mac vdev 18 create peer 04:f0:21:03:38:99
> ath10k: mac vdev 18 start 04:f0:21:03:38:99
> ath10k: mac vdev 18 start center_freq 5180 phymode 11ac-vht80
> sta218: send auth to 04:f0:21:03:38:99 (try 1/3)
> ath10k: Creating vdev id: 22 map: 4290772992
> ath10k: mac vdev create 22 (add interface) type 2 subtype 0
> sta218: authenticated
> ath10k: mac vdev 22 cts_prot 0
> ath10k: mac vdev 22 slot_time 1
> ath10k: mac vdev 22 preamble 1n
> IPv6: ADDRCONF(NETDEV_UP): sta222: link is not ready
> sta218: associate with 04:f0:21:03:38:99 (try 1/3)
> sta218: RX AssocResp from 04:f0:21:03:38:99 (capab=0x1 status=0 aid=18)
> ath10k: mac vdev 18 dtim_period 2
> ath10k: mac vdev 18 slot_time 2
> ath10k: mac ht peer 04:f0:21:03:38:99 mcs cnt 24 nss 3
> ath10k: mac vht peer 04:f0:21:03:38:99 max_mpdu 65535 flags 0x6003001
> ath10k: mac peer 04:f0:21:03:38:99 phymode 11ac-vht80
> ath10k: mac vdev 18 up (associated) bssid 04:f0:21:03:38:99 aid 18
> sta218: associated
> sta222: authenticate with 04:f0:21:03:38:99
> ath10k: mac vdev 22 beacon_interval 100
> ath10k: mac vdev 22 create peer 04:f0:21:03:38:99
> ath10k: mac vdev 22 start 04:f0:21:03:38:99
> ath10k: mac vdev 22 start center_freq 5180 phymode 11ac-vht80
>
> sta222: send auth to 04:f0:21:03:38:99 (try 1/3)
> ath10k: mac vdev 22 peer delete 04:f0:21:03:38:99 (sta gone)
> ath10k: mac vdev 22 stop (disassociated
> ath10k: mac vdev 22 down
> ath10k: mac vdev 22 delete (remove interface)
> ath10k: peer-unmap-event: unknown peer id 24
> ath10k: Creating vdev id: 22 map: 4290772992
> ath10k: mac vdev create 22 (add interface) type 2 subtype 0
> ath10k: mac vdev 22 cts_prot 0
> ath10k: mac vdev 22 slot_time 1
> ath10k: mac vdev 22 preamble 1n
> IPv6: ADDRCONF(NETDEV_UP): sta222: link is not ready
> sta221: authenticate with 04:f0:21:03:38:99
> ath10k: mac vdev 21 beacon_interval 100
> ath10k: mac vdev 21 create peer 04:f0:21:03:38:99
> ath10k: mac vdev 21 start 04:f0:21:03:38:99
> ath10k: mac vdev 21 start center_freq 5180 phymode 11ac-vht80
>
> sta221: send auth to 04:f0:21:03:38:99 (try 1/3)
> sta221: authenticated
> sta221: associate with 04:f0:21:03:38:99 (try 1/3)
> sta221: RX AssocResp from 04:f0:21:03:38:99 (capab=0x1 status=0 aid=19)
> ath10k: mac vdev 21 dtim_period 2
> ath10k: mac vdev 21 slot_time 2
> ath10k: mac ht peer 04:f0:21:03:38:99 mcs cnt 24 nss 3
> ath10k: mac vht peer 04:f0:21:03:38:99 max_mpdu 65535 flags 0x6003001
> ath10k: mac peer 04:f0:21:03:38:99 phymode 11ac-vht80
> ath10k: mac vdev 21 up (associated) bssid 04:f0:21:03:38:99 aid 19
> sta221: associated
> IPv6: ADDRCONF(NETDEV_CHANGE): sta221: link becomes ready
> sta222: authenticate with 04:f0:21:03:38:99
> ath10k: mac vdev 22 beacon_interval 100
> ath10k: mac vdev 22 create peer 04:f0:21:03:38:99
> ath10k: mac vdev 22 start 04:f0:21:03:38:99
> ath10k: mac vdev 22 start center_freq 5180 phymode 11ac-vht80
> sta222: send auth to 04:f0:21:03:38:99 (try 1/3)
> sta222: authenticated
> sta222: associate with 04:f0:21:03:38:99 (try 1/3)
> sta222: RX AssocResp from 04:f0:21:03:38:99 (capab=0x1 status=0 aid=24)
> ath10k: mac vdev 22 dtim_period 2
> ath10k: mac vdev 22 slot_time 2
> ath10k: mac ht peer 04:f0:21:03:38:99 mcs cnt 24 nss 3
> ath10k: mac vht peer 04:f0:21:03:38:99 max_mpdu 65535 flags 0x6003001
> ath10k: mac peer 04:f0:21:03:38:99 phymode 11ac-vht80
> ath10k: mac vdev 22 up (associated) bssid 04:f0:21:03:38:99 aid 24
> sta222: associated
> IPv6: ADDRCONF(NETDEV_CHANGE): sta222: link becomes ready
>
> sta213: deauthenticating from 04:f0:21:03:38:99 by local choice (Reason:
> 3=DEAUTH_LEAVING)
> ath10k: mac flushing peer 04:f0:21:03:38:99 on vdev 13 mgmt tid for unicast
> mgmt
> ath10k: mac vdev 13 peer delete 04:f0:21:03:38:99 (sta gone)
> ath10k: mac vdev 13 stop (disassociated
> ath10k: mac vdev 13 down
> ath10k: mac vdev 13 cts_prot 0
> ath10k: mac vdev 13 slot_time 1
>
> ath10k: failed to set erp slot for vdev 13: -11
> ath10k: mac vdev 13 preamble 1n
> ath10k: failed to set preamble for vdev 13: -11
> ath10k: failed to set wmm params: -11
> ath10k: failed to set wmm params: -11

These logs are not enough. I'd love to see traces for this to see what
frames are actually submitted and when tx credits are replenished.

I also wonder if this can be somehow related to your FW changes to
allow connecting multiple client virtual interfaces to a single AP?


Michał

2014-04-11 06:32:02

by Kalle Valo

[permalink] [raw]
Subject: Re: [RFTv2 3/5] ath10k: rework peer accounting

Michal Kazior <[email protected]> writes:

> On 10 April 2014 08:59, Kalle Valo <[email protected]> wrote:
>> Michal Kazior <[email protected]> writes:
>>
>>> It was troublesome to iterate over peers and
>>> perform sleepable calls. This will be necessary
>>> for some upcomming changes to tx flushing.
>>>
>>> Make peer allocation and initial setup
>>> protected by both ar->conf_mutex and
>>> ar->data_lock. This way it's possible to iterate
>>> over peers with conf_mutex and call sleepable
>>> functions.
>>>
>>> Signed-off-by: Michal Kazior <[email protected]>
>>
>> First comments, but I need to read this much more carefully still:
>>
>>> --- a/drivers/net/wireless/ath/ath10k/core.h
>>> +++ b/drivers/net/wireless/ath/ath10k/core.h
>>> @@ -199,9 +199,14 @@ struct ath10k_dfs_stats {
>>> #define ATH10K_MAX_NUM_PEER_IDS (1 << 11) /* htt rx_desc limit */
>>>
>>> struct ath10k_peer {
>>> + /* protected by conf_mutex + data_lock */
>>> struct list_head list;
>>
>> This really needs a lot more documentation. And besides, don't we
>> actually want to protect struct ath10k::peers, not this?
>
> Parts of the structure needs to be locked differently.
>
> I suppose we can say we just protect everything with data_lock except
> `struct list_head list` which needs to be write-protected by both
> conf_mutex and data_lock. But then again, addr and vdev_id don't need
> any locking as they are immutable (and set before adding to the list).

I guess this depends from the point of view. In my opinion struct
ath10k::peers is the actual list and that's why the main documentation
needs to be above struct ath10k::peers field. And I would guess that's
there most people check there for documentation first.

But if we have special rules within struct ath10k_peer (excluding the
list field), we can document that within struct ath10k_peer.

--
Kalle Valo

2014-04-08 06:58:26

by Kalle Valo

[permalink] [raw]
Subject: Re: [RFT 4/4] ath10k: improve tx flushing

Michal Kazior <[email protected]> writes:

> Flushing was prone to timeouts when powersaving
> clients were involved that went missing while
> asleep. The common culprit was NullFunc frame
> being stuck in FW/HW queues.
>
> Introduce a two-pass flushing with WMI flush
> commands interleaved to force frame drops if
> necessary. This allows for a decreased flush
> timeout time and should get rid of some warnings.
>
> Signed-off-by: Michal Kazior <[email protected]>

With this patch I see a new checkpatch warning:

drivers/net/wireless/ath/ath10k/mac.c:3785: WARNING: line over 80 characters

--
Kalle Valo

2014-04-09 10:55:32

by Michal Kazior

[permalink] [raw]
Subject: [RFTv2 2/5] ath10k: fix wmi-htc tx credit starvation

Each WMI management Tx consumes 1 HTC Tx credit
and doesn't replenish it until the frame is
actually transmitted out of firmware's Tx queues.

If associated client was asleep and has gone out
of range then unicast frames won't be released for
FW/HW queues for a while (10 seconds per
observation). This means that if more management
frames are queued then HTC Tx credits are drained
to 0 and no other commands can be submitted
including beacons and peer removal.

This could in turn result in clients disconnecting
due to their beacon loss and may trigger spurious
sta kickouts because wmi peer removal command may
never reach firmware during disassociation.

This could happen, e.g. when disconnecting
client (hostapd inactivity) that has gone away
while asleep when acting as AP.

As a workaround flush frames that can potentialy
get stuck in FW Tx queues.

Signed-off-by: Michal Kazior <[email protected]>
---
drivers/net/wireless/ath/ath10k/mac.c | 96 +++++++++++++++++++++++++++++++++++
drivers/net/wireless/ath/ath10k/wmi.c | 5 --
drivers/net/wireless/ath/ath10k/wmi.h | 3 ++
3 files changed, 99 insertions(+), 5 deletions(-)

diff --git a/drivers/net/wireless/ath/ath10k/mac.c b/drivers/net/wireless/ath/ath10k/mac.c
index 58ec5a7..5cfbf88 100644
--- a/drivers/net/wireless/ath/ath10k/mac.c
+++ b/drivers/net/wireless/ath/ath10k/mac.c
@@ -2076,9 +2076,67 @@ void ath10k_mgmt_over_wmi_tx_purge(struct ath10k *ar)
}
}

+static void ath10k_mgmt_tx_flush(struct ath10k *ar, struct sk_buff *skb)
+{
+ struct ieee80211_hdr *hdr = (struct ieee80211_hdr *)skb->data;
+ struct ieee80211_sta *sta;
+ struct ath10k_vif *arvif;
+ u8 *da = ieee80211_get_DA(hdr);
+ u8 vdev_id = ATH10K_SKB_CB(skb)->vdev_id;
+ u32 bcn_intval = 0;
+ unsigned int msecs;
+ int ret;
+
+ lockdep_assert_held(&ar->conf_mutex);
+
+ if (!is_unicast_ether_addr(da))
+ return;
+
+ rcu_read_lock();
+ sta = ieee80211_find_sta_by_ifaddr(ar->hw, da, NULL);
+ rcu_read_unlock();
+
+ /*
+ * FW Tx queues can be paused only for associated peers. Since this is
+ * a workaround just assume worst case if station simply entry exists.
+ */
+ if (!sta)
+ return;
+
+ list_for_each_entry(arvif, &ar->arvifs, list) {
+ if (arvif->vdev_id == vdev_id) {
+ bcn_intval = arvif->beacon_interval;
+ break;
+ }
+ }
+
+ if (!bcn_intval)
+ return;
+
+ /*
+ * Wait 2 beacon intervals before flushing so stations that are
+ * asleep but are actually still in range have a chance to see
+ * updated PVB, wake up and fetch the frame. There's no other way of
+ * synchronizing other than sleeping because there's no tx completion
+ * indication event for WMI management tx.
+ */
+ msecs = 2 * arvif->beacon_interval * 1024 / 1000;
+ ath10k_dbg(ATH10K_DBG_MAC,
+ "mac flushing peer %pM on vdev %i mgmt tid for unicast mgmt (%d msecs)\n",
+ da, vdev_id, msecs);
+ msleep(msecs);
+
+ ret = ath10k_wmi_peer_flush(ar, vdev_id, da,
+ WMI_PEER_TID_MGMT_MASK);
+ if (ret)
+ ath10k_warn("failed to flush peer %pM mgmt tid: %d\n",
+ da, ret);
+}
+
void ath10k_mgmt_over_wmi_tx_work(struct work_struct *work)
{
struct ath10k *ar = container_of(work, struct ath10k, wmi_mgmt_tx_work);
+ struct ieee80211_tx_info *info;
struct sk_buff *skb;
int ret;

@@ -2087,12 +2145,50 @@ void ath10k_mgmt_over_wmi_tx_work(struct work_struct *work)
if (!skb)
break;

+ mutex_lock(&ar->conf_mutex);
+
ret = ath10k_wmi_mgmt_tx(ar, skb);
if (ret) {
ath10k_warn("failed to transmit management frame via WMI: %d\n",
ret);
+ mutex_unlock(&ar->conf_mutex);
ieee80211_free_txskb(ar->hw, skb);
+ continue;
}
+
+ /*
+ * Each WMI management Tx consumes 1 HTC Tx credit and doesn't
+ * replenish it until the frame is actually transmitted out of
+ * firmware's Tx queues.
+ *
+ * If associated client was asleep and has gone out of range
+ * then unicast frames won't be released for FW/HW queues for a
+ * while (10 seconds per observation). This means that if more
+ * management frames are queued then HTC Tx credits are drained
+ * to 0 and no other commands can be submitted including
+ * beacons and peer removal.
+ *
+ * This could in turn result in clients disconnecting due to
+ * their beacon loss and may trigger spurious sta kickouts
+ * because wmi peer removal command may never reach firmware
+ * during disassociation.
+ *
+ * This could happen, e.g. when disconnecting client that has
+ * gone away while asleep.
+ *
+ * As a workaround flush unicast management frames that can
+ * possibly be buffered.
+ *
+ * Note: This is a deficiency in design of WMI_MGMT_TX command.
+ */
+ ath10k_mgmt_tx_flush(ar, skb);
+
+ mutex_unlock(&ar->conf_mutex);
+
+ /* there's no way to get ACK so just assume it's acked */
+ info = IEEE80211_SKB_CB(skb);
+ info->flags |= IEEE80211_TX_STAT_ACK;
+ ieee80211_tx_status(ar->hw, skb);
}
}

diff --git a/drivers/net/wireless/ath/ath10k/wmi.c b/drivers/net/wireless/ath/ath10k/wmi.c
index d4b48ef..3b270a4 100644
--- a/drivers/net/wireless/ath/ath10k/wmi.c
+++ b/drivers/net/wireless/ath/ath10k/wmi.c
@@ -637,7 +637,6 @@ int ath10k_wmi_mgmt_tx(struct ath10k *ar, struct sk_buff *skb)
struct wmi_mgmt_tx_cmd *cmd;
struct ieee80211_hdr *hdr;
struct sk_buff *wmi_skb;
- struct ieee80211_tx_info *info = IEEE80211_SKB_CB(skb);
int len;
u16 fc;

@@ -673,10 +672,6 @@ int ath10k_wmi_mgmt_tx(struct ath10k *ar, struct sk_buff *skb)
if (ret)
return ret;

- /* TODO: report tx status to mac80211 - temporary just ACK */
- info->flags |= IEEE80211_TX_STAT_ACK;
- ieee80211_tx_status_irqsafe(ar->hw, skb);
-
return ret;
}

diff --git a/drivers/net/wireless/ath/ath10k/wmi.h b/drivers/net/wireless/ath/ath10k/wmi.h
index ae83822..90fe2e9 100644
--- a/drivers/net/wireless/ath/ath10k/wmi.h
+++ b/drivers/net/wireless/ath/ath10k/wmi.h
@@ -3844,6 +3844,9 @@ struct wmi_peer_delete_cmd {
struct wmi_mac_addr peer_macaddr;
} __packed;

+#define WMI_PEER_TID_MGMT 17
+#define WMI_PEER_TID_MGMT_MASK BIT(WMI_PEER_TID_MGMT)
+
struct wmi_peer_flush_tids_cmd {
__le32 vdev_id;
struct wmi_mac_addr peer_macaddr;
--
1.8.5.3


2014-04-09 10:55:36

by Michal Kazior

[permalink] [raw]
Subject: [RFTv2 5/5] ath10k: improve tx flushing

Flushing was prone to timeouts when powersaving
clients were involved that went missing while
asleep. The common culprit was NullFunc frame
being stuck in FW/HW queues.

Introduce a two-pass flushing with WMI flush
commands interleaved to force frame drops if
necessary. This allows for a decreased flush
timeout time and should get rid of some warnings.

Signed-off-by: Michal Kazior <[email protected]>
---
drivers/net/wireless/ath/ath10k/core.h | 2 +-
drivers/net/wireless/ath/ath10k/mac.c | 97 +++++++++++++++++++++++++++-------
drivers/net/wireless/ath/ath10k/wmi.h | 1 +
3 files changed, 81 insertions(+), 19 deletions(-)

diff --git a/drivers/net/wireless/ath/ath10k/core.h b/drivers/net/wireless/ath/ath10k/core.h
index 61e325a..0d60e70 100644
--- a/drivers/net/wireless/ath/ath10k/core.h
+++ b/drivers/net/wireless/ath/ath10k/core.h
@@ -38,7 +38,7 @@

#define ATH10K_SCAN_ID 0
#define WMI_READY_TIMEOUT (5 * HZ)
-#define ATH10K_FLUSH_TIMEOUT_HZ (5*HZ)
+#define ATH10K_FLUSH_TIMEOUT_HZ (1*HZ)
#define ATH10K_NUM_CHANS 38

/* Antenna noise floor */
diff --git a/drivers/net/wireless/ath/ath10k/mac.c b/drivers/net/wireless/ath/ath10k/mac.c
index 1d95500..dc176b5 100644
--- a/drivers/net/wireless/ath/ath10k/mac.c
+++ b/drivers/net/wireless/ath/ath10k/mac.c
@@ -3769,25 +3769,31 @@ static int ath10k_set_frag_threshold(struct ieee80211_hw *hw, u32 value)
return ret;
}

-static void ath10k_flush(struct ieee80211_hw *hw, u32 queues, bool drop)
+static int ath10k_flush_all_peer_tids(struct ath10k *ar)
{
- struct ath10k *ar = hw->priv;
- bool skip;
+ struct ath10k_peer *peer;
int ret;

- /* mac80211 doesn't care if we really xmit queued frames or not
- * we'll collect those frames either way if we stop/delete vdevs */
- if (drop)
- return;
-
- mutex_lock(&ar->conf_mutex);
+ lockdep_assert_held(&ar->conf_mutex);

- if (ar->state == ATH10K_STATE_WEDGED) {
- ret = -EBUSY;
- goto skip;
+ list_for_each_entry(peer, &ar->peers, list) {
+ ret = ath10k_wmi_peer_flush(ar, peer->vdev_id, peer->addr,
+ WMI_PEER_TID_ALL_MASK);
+ if (ret) {
+ ath10k_warn("failed to request peer %pM on vdev %i to flush %08x: %d\n",
+ peer->addr, peer->vdev_id,
+ WMI_PEER_TID_ALL_MASK, ret);
+ return ret;
+ }
}

- ret = wait_event_timeout(ar->htt.empty_tx_wq, ({
+ return 0;
+}
+
+static int ath10k_flush_wait(struct ath10k *ar)
+{
+ bool skip;
+ int ret = wait_event_timeout(ar->htt.empty_tx_wq, ({
bool htt_empty, wmi_empty;
unsigned long flags;

@@ -3805,16 +3811,71 @@ static void ath10k_flush(struct ieee80211_hw *hw, u32 queues, bool drop)
((htt_empty && wmi_empty) || skip);
}), ATH10K_FLUSH_TIMEOUT_HZ);

- if (ret <= 0 || skip)
- ath10k_warn("failed to flush transmit queue (skip %i ar-state %i): %i\n",
- skip, ar->state, ret);
+ if (ret == 0)
+ ret = -ETIMEDOUT;
+ else if (ret > 0)
+ ret = 0;
+
+ if (skip) {
+ ath10k_warn("ignoring flushing result because hardware is wedged\n");
+ ret = -EBUSY;
+ }
+
+ return ret;
+}
+
+static void ath10k_flush(struct ieee80211_hw *hw, u32 queues, bool drop)
+{
+ struct ath10k *ar = hw->priv;
+ int ret;
+
+ mutex_lock(&ar->conf_mutex);
+
+ if (drop) {
+ ath10k_mgmt_over_wmi_tx_purge(ar);
+
+ ret = ath10k_flush_all_peer_tids(ar);
+ if (ret) {
+ ath10k_warn("failed to flush all peer tids: %d\n", ret);
+ goto out;
+ }

-skip:
+ goto out;
+ }
+
+ if (ar->state == ATH10K_STATE_WEDGED) {
+ ath10k_warn("skipping flushing because hardware is wedged\n");
+ ret = -EBUSY;
+ goto out;
+ }
+
+ ret = ath10k_flush_wait(ar);
+ if (ret) {
+ ath10k_dbg(ATH10K_DBG_MAC,
+ "failed to wait for tx to flush: %d, forcing\n",
+ ret);
+
+ ath10k_mgmt_over_wmi_tx_purge(ar);
+
+ ret = ath10k_flush_all_peer_tids(ar);
+ if (ret) {
+ ath10k_warn("failed to flush all peer tids: %d\n", ret);
+ goto out;
+ }
+
+ ret = ath10k_flush_wait(ar);
+ if (ret) {
+ ath10k_warn("failed to flush tx: %d\n", ret);
+ goto out;
+ }
+ }
+
+out:
mutex_unlock(&ar->conf_mutex);

/* empty mgmt tx queue doesn't mean mgmt tx is flushed because the last
* frame still may be processed by a worker */
- if (ret > 0 && !skip)
+ if (ret == 0)
cancel_work_sync(&ar->wmi_mgmt_tx_work);
}

diff --git a/drivers/net/wireless/ath/ath10k/wmi.h b/drivers/net/wireless/ath/ath10k/wmi.h
index 90fe2e9..2b37c4a 100644
--- a/drivers/net/wireless/ath/ath10k/wmi.h
+++ b/drivers/net/wireless/ath/ath10k/wmi.h
@@ -3846,6 +3846,7 @@ struct wmi_peer_delete_cmd {

#define WMI_PEER_TID_MGMT 17
#define WMI_PEER_TID_MGMT_MASK BIT(WMI_PEER_TID_MGMT)
+#define WMI_PEER_TID_ALL_MASK 0xFFFFFFFF

struct wmi_peer_flush_tids_cmd {
__le32 vdev_id;
--
1.8.5.3


2014-04-04 11:44:16

by Michal Kazior

[permalink] [raw]
Subject: [RFT 4/4] ath10k: improve tx flushing

Flushing was prone to timeouts when powersaving
clients were involved that went missing while
asleep. The common culprit was NullFunc frame
being stuck in FW/HW queues.

Introduce a two-pass flushing with WMI flush
commands interleaved to force frame drops if
necessary. This allows for a decreased flush
timeout time and should get rid of some warnings.

Signed-off-by: Michal Kazior <[email protected]>
---
drivers/net/wireless/ath/ath10k/core.h | 2 +-
drivers/net/wireless/ath/ath10k/mac.c | 92 +++++++++++++++++++++++++++-------
drivers/net/wireless/ath/ath10k/wmi.h | 1 +
3 files changed, 76 insertions(+), 19 deletions(-)

diff --git a/drivers/net/wireless/ath/ath10k/core.h b/drivers/net/wireless/ath/ath10k/core.h
index 61e325a..0d60e70 100644
--- a/drivers/net/wireless/ath/ath10k/core.h
+++ b/drivers/net/wireless/ath/ath10k/core.h
@@ -38,7 +38,7 @@

#define ATH10K_SCAN_ID 0
#define WMI_READY_TIMEOUT (5 * HZ)
-#define ATH10K_FLUSH_TIMEOUT_HZ (5*HZ)
+#define ATH10K_FLUSH_TIMEOUT_HZ (1*HZ)
#define ATH10K_NUM_CHANS 38

/* Antenna noise floor */
diff --git a/drivers/net/wireless/ath/ath10k/mac.c b/drivers/net/wireless/ath/ath10k/mac.c
index 47b3574..77e1370 100644
--- a/drivers/net/wireless/ath/ath10k/mac.c
+++ b/drivers/net/wireless/ath/ath10k/mac.c
@@ -3753,25 +3753,30 @@ static int ath10k_set_frag_threshold(struct ieee80211_hw *hw, u32 value)
return ret;
}

-static void ath10k_flush(struct ieee80211_hw *hw, u32 queues, bool drop)
+static int ath10k_flush_all_peer_tids(struct ath10k *ar)
{
- struct ath10k *ar = hw->priv;
- bool skip;
+ struct ath10k_peer *peer;
int ret;

- /* mac80211 doesn't care if we really xmit queued frames or not
- * we'll collect those frames either way if we stop/delete vdevs */
- if (drop)
- return;
-
- mutex_lock(&ar->conf_mutex);
+ lockdep_assert_held(&ar->conf_mutex);

- if (ar->state == ATH10K_STATE_WEDGED) {
- ret = -EBUSY;
- goto skip;
+ list_for_each_entry(peer, &ar->peers, list) {
+ ret = ath10k_wmi_peer_flush(ar, peer->vdev_id, peer->addr,
+ WMI_PEER_TID_ALL_MASK);
+ if (ret) {
+ ath10k_warn("failed to request peer %pM on vdev %i to flush %08x: %d\n",
+ peer->addr, peer->vdev_id, WMI_PEER_TID_ALL_MASK, ret);
+ return ret;
+ }
}

- ret = wait_event_timeout(ar->htt.empty_tx_wq, ({
+ return 0;
+}
+
+static int ath10k_flush_wait(struct ath10k *ar)
+{
+ bool skip;
+ int ret = wait_event_timeout(ar->htt.empty_tx_wq, ({
bool htt_empty, wmi_empty;
unsigned long flags;

@@ -3789,16 +3794,67 @@ static void ath10k_flush(struct ieee80211_hw *hw, u32 queues, bool drop)
((htt_empty && wmi_empty) || skip);
}), ATH10K_FLUSH_TIMEOUT_HZ);

- if (ret <= 0 || skip)
- ath10k_warn("failed to flush transmit queue (skip %i ar-state %i): %i\n",
- skip, ar->state, ret);
+ if (ret == 0)
+ ret = -ETIMEDOUT;
+ else if (ret > 0)
+ ret = 0;
+
+ if (skip) {
+ ath10k_warn("ignoring flushing result because hardware is wedged\n");
+ ret = -EBUSY;
+ }
+
+ return ret;
+}
+
+static void ath10k_flush(struct ieee80211_hw *hw, u32 queues, bool drop)
+{
+ struct ath10k *ar = hw->priv;
+ int ret;
+
+ mutex_lock(&ar->conf_mutex);
+
+ if (drop) {
+ ret = ath10k_flush_all_peer_tids(ar);
+ if (ret) {
+ ath10k_warn("failed to flush all peer tids: %d\n", ret);
+ goto out;
+ }
+
+ goto out;
+ }
+
+ if (ar->state == ATH10K_STATE_WEDGED) {
+ ath10k_warn("skipping flushing because hardware is wedged\n");
+ ret = -EBUSY;
+ goto out;
+ }
+
+ ret = ath10k_flush_wait(ar);
+ if (ret) {
+ ath10k_dbg(ATH10K_DBG_MAC,
+ "failed to wait for tx to flush: %d, forcing\n",
+ ret);
+
+ ret = ath10k_flush_all_peer_tids(ar);
+ if (ret) {
+ ath10k_warn("failed to flush all peer tids: %d\n", ret);
+ goto out;
+ }
+
+ ret = ath10k_flush_wait(ar);
+ if (ret) {
+ ath10k_warn("failed to flush tx: %d\n", ret);
+ goto out;
+ }
+ }

-skip:
+out:
mutex_unlock(&ar->conf_mutex);

/* empty mgmt tx queue doesn't mean mgmt tx is flushed because the last
* frame still may be processed by a worker */
- if (ret > 0)
+ if (ret == 0)
cancel_work_sync(&ar->wmi_mgmt_tx_work);
}

diff --git a/drivers/net/wireless/ath/ath10k/wmi.h b/drivers/net/wireless/ath/ath10k/wmi.h
index 90fe2e9..2b37c4a 100644
--- a/drivers/net/wireless/ath/ath10k/wmi.h
+++ b/drivers/net/wireless/ath/ath10k/wmi.h
@@ -3846,6 +3846,7 @@ struct wmi_peer_delete_cmd {

#define WMI_PEER_TID_MGMT 17
#define WMI_PEER_TID_MGMT_MASK BIT(WMI_PEER_TID_MGMT)
+#define WMI_PEER_TID_ALL_MASK 0xFFFFFFFF

struct wmi_peer_flush_tids_cmd {
__le32 vdev_id;
--
1.8.5.3


2014-04-09 10:55:34

by Michal Kazior

[permalink] [raw]
Subject: [RFTv2 3/5] ath10k: rework peer accounting

It was troublesome to iterate over peers and
perform sleepable calls. This will be necessary
for some upcomming changes to tx flushing.

Make peer allocation and initial setup
protected by both ar->conf_mutex and
ar->data_lock. This way it's possible to iterate
over peers with conf_mutex and call sleepable
functions.

Signed-off-by: Michal Kazior <[email protected]>
---
drivers/net/wireless/ath/ath10k/core.h | 5 +++
drivers/net/wireless/ath/ath10k/mac.c | 69 +++++++++++++++++++++++++++++++---
drivers/net/wireless/ath/ath10k/txrx.c | 29 +++++++-------
3 files changed, 81 insertions(+), 22 deletions(-)

diff --git a/drivers/net/wireless/ath/ath10k/core.h b/drivers/net/wireless/ath/ath10k/core.h
index 8edd6da..61e325a 100644
--- a/drivers/net/wireless/ath/ath10k/core.h
+++ b/drivers/net/wireless/ath/ath10k/core.h
@@ -199,9 +199,14 @@ struct ath10k_dfs_stats {
#define ATH10K_MAX_NUM_PEER_IDS (1 << 11) /* htt rx_desc limit */

struct ath10k_peer {
+ /* protected by conf_mutex + data_lock */
struct list_head list;
+
+ /* protected by conf_mutex */
int vdev_id;
u8 addr[ETH_ALEN];
+
+ /* protected by data_lock */
DECLARE_BITMAP(peer_ids, ATH10K_MAX_NUM_PEER_IDS);
struct ieee80211_key_conf *keys[WMI_MAX_KEY_INDEX + 1];
};
diff --git a/drivers/net/wireless/ath/ath10k/mac.c b/drivers/net/wireless/ath/ath10k/mac.c
index 5cfbf88..adb5da1 100644
--- a/drivers/net/wireless/ath/ath10k/mac.c
+++ b/drivers/net/wireless/ath/ath10k/mac.c
@@ -318,28 +318,61 @@ static u8 ath10k_parse_mpdudensity(u8 mpdudensity)

static int ath10k_peer_create(struct ath10k *ar, u32 vdev_id, const u8 *addr)
{
+ struct ath10k_peer *peer;
int ret;

lockdep_assert_held(&ar->conf_mutex);

+ peer = ath10k_peer_find(ar, vdev_id, addr);
+ if (peer) {
+ ath10k_warn("peer %pM on vdev %i already exists\n",
+ addr, vdev_id);
+ return -EINVAL;
+ }
+
+ peer = kzalloc(sizeof(*peer), GFP_KERNEL);
+ if (!peer) {
+ ath10k_warn("failed to allocate peer %pM on vdev %i: not enough memory\n",
+ addr, vdev_id);
+ return -ENOMEM;
+ }
+
+ peer->vdev_id = vdev_id;
+ memcpy(peer->addr, addr, ETH_ALEN);
+
+ spin_lock_bh(&ar->data_lock);
+ list_add(&peer->list, &ar->peers);
+ spin_unlock_bh(&ar->data_lock);
+
ret = ath10k_wmi_peer_create(ar, vdev_id, addr);
if (ret) {
ath10k_warn("failed to create wmi peer %pM on vdev %i: %i\n",
addr, vdev_id, ret);
- return ret;
+ goto err_free;
}

ret = ath10k_wait_for_peer_created(ar, vdev_id, addr);
if (ret) {
ath10k_warn("failed to wait for created wmi peer %pM on vdev %i: %i\n",
addr, vdev_id, ret);
- return ret;
+ goto err_free;
}
spin_lock_bh(&ar->data_lock);
ar->num_peers++;
spin_unlock_bh(&ar->data_lock);

return 0;
+
+err_free:
+ spin_lock_bh(&ar->data_lock);
+ list_del(&peer->list);
+ /* very unlikely, but check anyway */
+ if (!bitmap_empty(peer->peer_ids, ATH10K_MAX_NUM_PEER_IDS))
+ ath10k_warn("removing peer %pM on vdev %i still being mapped in firmware\n",
+ addr, vdev_id);
+ spin_unlock_bh(&ar->data_lock);
+ kfree(peer);
+ return ret;
}

static int ath10k_mac_set_kickout(struct ath10k_vif *arvif)
@@ -416,22 +449,46 @@ static int ath10k_mac_set_frag(struct ath10k_vif *arvif, u32 value)

static int ath10k_peer_delete(struct ath10k *ar, u32 vdev_id, const u8 *addr)
{
+ struct ath10k_peer *peer;
int ret;

lockdep_assert_held(&ar->conf_mutex);

+ peer = ath10k_peer_find(ar, vdev_id, addr);
+ if (!peer) {
+ ath10k_warn("failed to lookup peer %pM on vdev %i\n",
+ addr, vdev_id);
+ return -ENOENT;
+ }
+
ret = ath10k_wmi_peer_delete(ar, vdev_id, addr);
- if (ret)
- return ret;
+ if (ret) {
+ ath10k_warn("failed to request wmi peer %pM on vdev %i removal: %d\n",
+ addr, vdev_id, ret);
+ goto out;
+ }

ret = ath10k_wait_for_peer_deleted(ar, vdev_id, addr);
- if (ret)
- return ret;
+ if (ret) {
+ ath10k_warn("failed to wait for wmi peer %pM on vdev %i removal: %d\n",
+ addr, vdev_id, ret);
+ goto out;
+ }

spin_lock_bh(&ar->data_lock);
ar->num_peers--;
spin_unlock_bh(&ar->data_lock);

+out:
+ spin_lock_bh(&ar->data_lock);
+ list_del(&peer->list);
+ if (!bitmap_empty(peer->peer_ids, ATH10K_MAX_NUM_PEER_IDS))
+ ath10k_warn("removing peer %pM on vdev %i still being mapped in firmware\n",
+ addr, vdev_id);
+ spin_unlock_bh(&ar->data_lock);
+
+ kfree(peer);
+
return 0;
}

diff --git a/drivers/net/wireless/ath/ath10k/txrx.c b/drivers/net/wireless/ath/ath10k/txrx.c
index 82669a7..d802569 100644
--- a/drivers/net/wireless/ath/ath10k/txrx.c
+++ b/drivers/net/wireless/ath/ath10k/txrx.c
@@ -100,13 +100,13 @@ exit:
wake_up(&htt->empty_tx_wq);
}

+/* hold conf_mutex for simple iteration, or conf_mutex+data_lock for
+ * modifications */
struct ath10k_peer *ath10k_peer_find(struct ath10k *ar, int vdev_id,
const u8 *addr)
{
struct ath10k_peer *peer;

- lockdep_assert_held(&ar->data_lock);
-
list_for_each_entry(peer, &ar->peers, list) {
if (peer->vdev_id != vdev_id)
continue;
@@ -139,10 +139,14 @@ static int ath10k_wait_for_peer_common(struct ath10k *ar, int vdev_id,
int ret;

ret = wait_event_timeout(ar->peer_mapping_wq, ({
- bool mapped;
+ struct ath10k_peer *peer;
+ bool mapped = false;

spin_lock_bh(&ar->data_lock);
- mapped = !!ath10k_peer_find(ar, vdev_id, addr);
+ peer = ath10k_peer_find(ar, vdev_id, addr);
+ if (peer)
+ mapped = !bitmap_empty(peer->peer_ids,
+ ATH10K_MAX_NUM_PEER_IDS);
spin_unlock_bh(&ar->data_lock);

mapped == expect_mapped;
@@ -173,20 +177,16 @@ void ath10k_peer_map_event(struct ath10k_htt *htt,
spin_lock_bh(&ar->data_lock);
peer = ath10k_peer_find(ar, ev->vdev_id, ev->addr);
if (!peer) {
- peer = kzalloc(sizeof(*peer), GFP_ATOMIC);
- if (!peer)
- goto exit;
-
- peer->vdev_id = ev->vdev_id;
- memcpy(peer->addr, ev->addr, ETH_ALEN);
- list_add(&peer->list, &ar->peers);
- wake_up(&ar->peer_mapping_wq);
+ ath10k_warn("failed to map peer %pM on vdev %i: no such entry\n",
+ ev->addr, ev->vdev_id);
+ goto exit;
}

ath10k_dbg(ATH10K_DBG_HTT, "htt peer map vdev %d peer %pM id %d\n",
ev->vdev_id, ev->addr, ev->peer_id);

set_bit(ev->peer_id, peer->peer_ids);
+ wake_up(&ar->peer_mapping_wq);
exit:
spin_unlock_bh(&ar->data_lock);
}
@@ -210,11 +210,8 @@ void ath10k_peer_unmap_event(struct ath10k_htt *htt,

clear_bit(ev->peer_id, peer->peer_ids);

- if (bitmap_empty(peer->peer_ids, ATH10K_MAX_NUM_PEER_IDS)) {
- list_del(&peer->list);
- kfree(peer);
+ if (bitmap_empty(peer->peer_ids, ATH10K_MAX_NUM_PEER_IDS))
wake_up(&ar->peer_mapping_wq);
- }

exit:
spin_unlock_bh(&ar->data_lock);
--
1.8.5.3


2014-04-04 18:31:10

by Dave Taht

[permalink] [raw]
Subject: Re: [RFT 0/4] ath10k: fix flushing and tx stalls

On Fri, Apr 4, 2014 at 7:49 AM, Ben Greear <[email protected]> wrote:
> On 04/04/2014 04:37 AM, Michal Kazior wrote:
>>
>> Hi,
>>
>> After digging around I've found what seems to be
>> the problem with WMI Tx credit starvation and
>> inability to properly flush Tx in ath10k_flush().
>>
>> Long story short: if a client that was asleep (as
>> per what firmware thinks) goes out of range (or
>> just stops responding) then Tx rots in FW/HW
>> queues for a few seconds before it's discarded.
>> For WMI Tx credits this means management frames
>> eat up Tx credits for a few seconds (causing other
>> WMI commands to timeout and return -EAGAIN/-11).
>> For HTT Tx this means NullFunc frames would get
>> stuck for a few seconds before completion was
>> received.
>>
>> @Ben: Can you check if this helps you? I tested
>> this briefly and at least [1/4] seems fixes the
>> WMI Tx starvation. I'm hoping patches 2-4 help
>> with your ath10k_flush() failures which I haven't
>> been successfull in reproducing (but have observed
>> improvement with purging some frames out of FW/HW
>> queues).
>
>
> I'm out of office for a bit, but will test this as
> soon as I'm back.
>
> Thanks for looking into this!
>
> In general, would it make more sense to have a few more tx credits
> available to mitigate the slow-to-be-processed buffers?

It would help more to organize the queueing so that there is no head
of line per sta blocking, smarter (interleaved) per station retries,
and sane packet drops and congestion notifications delivered sooner -
certainly dropping packets after 250ms of delay, and preferably much,
much less when a client is not in sleep mode....

To preach:

https://tools.ietf.org/html/draft-hoeiland-joergensen-aqm-fq-codel-00

https://datatracker.ietf.org/doc/draft-nichols-tsvwg-codel/

http://www.bufferbloat.net/projects/cerowrt/wiki/Fq_Codel_on_Wireless


>
> Thanks,
> Ben
>
>
> --
> Ben Greear <[email protected]>
> Candela Technologies Inc http://www.candelatech.com
>
>
>
> _______________________________________________
> ath10k mailing list
> [email protected]
> http://lists.infradead.org/mailman/listinfo/ath10k



--
Dave T?ht

Fixing bufferbloat with cerowrt: http://www.teklibre.com/cerowrt/subscribe.html

2014-04-08 05:51:46

by Michal Kazior

[permalink] [raw]
Subject: Re: [RFT 0/4] ath10k: fix flushing and tx stalls

On 8 April 2014 04:31, Ben Greear <[email protected]> wrote:
> On 04/07/2014 02:11 AM, Michal Kazior wrote:
>
>> These logs are not enough. I'd love to see traces for this to see what
>> frames are actually submitted and when tx credits are replenished.
>>
>> I also wonder if this can be somehow related to your FW changes to
>> allow connecting multiple client virtual interfaces to a single AP?
>
>
> I think it is unlikely due to my firmware changes...little of that touched
> the handling of management frames. It might very well be a basic problem in
> either the firmware or driver when using multiple station VIFS. I think
> that
> aside from my testing that code has not been used much.
>
> Note my followup email that problems started with patch 3/4...not sure you
> saw that one or not. I saw similar failure to associate & get DHCP (and
> slow/hung user-space) without the kernel error logs.

I did get that mail. It's just that it makes little sense. Patch 3/4
simply extends ath10k_flush() to wait for management frames. It
shouldn't change behaviour of tx credit starvation which appears to be
the case in the logs you provided.


> I've added the first two patches to my tree and will continue to run
> with them since they do not appear to cause problems so far.

Patch 2/4 is probably meaningless in itself. It is just a preparation for 4/4.

Anyway, thanks for testing!


Michał

2014-04-10 06:59:14

by Kalle Valo

[permalink] [raw]
Subject: Re: [RFTv2 3/5] ath10k: rework peer accounting

Michal Kazior <[email protected]> writes:

> It was troublesome to iterate over peers and
> perform sleepable calls. This will be necessary
> for some upcomming changes to tx flushing.
>
> Make peer allocation and initial setup
> protected by both ar->conf_mutex and
> ar->data_lock. This way it's possible to iterate
> over peers with conf_mutex and call sleepable
> functions.
>
> Signed-off-by: Michal Kazior <[email protected]>

First comments, but I need to read this much more carefully still:

> --- a/drivers/net/wireless/ath/ath10k/core.h
> +++ b/drivers/net/wireless/ath/ath10k/core.h
> @@ -199,9 +199,14 @@ struct ath10k_dfs_stats {
> #define ATH10K_MAX_NUM_PEER_IDS (1 << 11) /* htt rx_desc limit */
>
> struct ath10k_peer {
> + /* protected by conf_mutex + data_lock */
> struct list_head list;

This really needs a lot more documentation. And besides, don't we
actually want to protect struct ath10k::peers, not this?

> +/* hold conf_mutex for simple iteration, or conf_mutex+data_lock for
> + * modifications */
> struct ath10k_peer *ath10k_peer_find(struct ath10k *ar, int vdev_id,
> const u8 *addr)
> {
> struct ath10k_peer *peer;
>
> - lockdep_assert_held(&ar->data_lock);
> -
> list_for_each_entry(peer, &ar->peers, list) {
> if (peer->vdev_id != vdev_id)
> continue;

The comment here makes me suspicious. How can we safely iterate the list
if we don't take data_lock? Doesn't it mean that the list can change
while we have conf_mutex?

--
Kalle Valo

2014-04-04 11:44:15

by Michal Kazior

[permalink] [raw]
Subject: [RFT 3/4] ath10k: wait for mgmt tx when flushing too

The 10.1 firmware uses WMI to Tx management frames
but ath10k_flush() didn't care. This could lead to
some frames not being sent.

Although there are no Tx completions for WMI
mangement frames it's still better to at least
wait for ath10k queue to be drained.

Signed-off-by: Michal Kazior <[email protected]>
---
drivers/net/wireless/ath/ath10k/mac.c | 21 +++++++++++++++++----
1 file changed, 17 insertions(+), 4 deletions(-)

diff --git a/drivers/net/wireless/ath/ath10k/mac.c b/drivers/net/wireless/ath/ath10k/mac.c
index 9af3ed5..47b3574 100644
--- a/drivers/net/wireless/ath/ath10k/mac.c
+++ b/drivers/net/wireless/ath/ath10k/mac.c
@@ -3766,19 +3766,27 @@ static void ath10k_flush(struct ieee80211_hw *hw, u32 queues, bool drop)

mutex_lock(&ar->conf_mutex);

- if (ar->state == ATH10K_STATE_WEDGED)
+ if (ar->state == ATH10K_STATE_WEDGED) {
+ ret = -EBUSY;
goto skip;
+ }

ret = wait_event_timeout(ar->htt.empty_tx_wq, ({
- bool empty;
+ bool htt_empty, wmi_empty;
+ unsigned long flags;

spin_lock_bh(&ar->htt.tx_lock);
- empty = (ar->htt.num_pending_tx == 0);
+ htt_empty = (ar->htt.num_pending_tx == 0);
spin_unlock_bh(&ar->htt.tx_lock);

+ spin_lock_irqsave(&ar->wmi_mgmt_tx_queue.lock, flags);
+ wmi_empty = skb_queue_len(&ar->wmi_mgmt_tx_queue) == 0;
+ spin_unlock_irqrestore(&ar->wmi_mgmt_tx_queue.lock,
+ flags);
+
skip = (ar->state == ATH10K_STATE_WEDGED);

- (empty || skip);
+ ((htt_empty && wmi_empty) || skip);
}), ATH10K_FLUSH_TIMEOUT_HZ);

if (ret <= 0 || skip)
@@ -3787,6 +3795,11 @@ static void ath10k_flush(struct ieee80211_hw *hw, u32 queues, bool drop)

skip:
mutex_unlock(&ar->conf_mutex);
+
+ /* empty mgmt tx queue doesn't mean mgmt tx is flushed because the last
+ * frame still may be processed by a worker */
+ if (ret > 0)
+ cancel_work_sync(&ar->wmi_mgmt_tx_work);
}

/* TODO: Implement this function properly
--
1.8.5.3


2014-04-10 06:56:40

by Michal Kazior

[permalink] [raw]
Subject: Re: [RFTv2 3/5] ath10k: rework peer accounting

On 10 April 2014 08:50, Kalle Valo <[email protected]> wrote:
> Michal Kazior <[email protected]> writes:
>
>> It was troublesome to iterate over peers and
>> perform sleepable calls. This will be necessary
>> for some upcomming changes to tx flushing.
>>
>> Make peer allocation and initial setup
>> protected by both ar->conf_mutex and
>> ar->data_lock. This way it's possible to iterate
>> over peers with conf_mutex and call sleepable
>> functions.
>>
>> Signed-off-by: Michal Kazior <[email protected]>
>
> Sorry, but I'm not really getting your idea here of using both
> conf_mutex and data_lock to protect ar->peers. Can you elaborate more,
> please? Why can't we just use conf_mutex then? What am I missing?

Originally ar->peers was protected by data_lock. The reason for that
the list is accessed from a tasklet context for htt peer map/unmap
events. Originally those events allocated/freed peers. This however is
cumbersome to work with when you want to iterate over peers and issue
blocking/sleepable calls (e.g. flush tx via wmi).

Having conf_mutex-only protected conf_mutex would require htt peer
map/unmap event processing to be moved into sleepable context (e.g.
into a worker).

That's why I've made ar->peers being write-protected by both so that
it can be read while holding any of these locks.

I'll try to make the commit message clearer.


Michał

2014-04-10 07:43:14

by Michal Kazior

[permalink] [raw]
Subject: Re: [RFTv2 3/5] ath10k: rework peer accounting

On 10 April 2014 09:18, Kalle Valo <[email protected]> wrote:
> Michal Kazior <[email protected]> writes:
>
>>>> +/* hold conf_mutex for simple iteration, or conf_mutex+data_lock for
>>>> + * modifications */
>>>> struct ath10k_peer *ath10k_peer_find(struct ath10k *ar, int vdev_id,
>>>> const u8 *addr)
>>>> {
>>>> struct ath10k_peer *peer;
>>>>
>>>> - lockdep_assert_held(&ar->data_lock);
>>>> -
>>>> list_for_each_entry(peer, &ar->peers, list) {
>>>> if (peer->vdev_id != vdev_id)
>>>> continue;
>>>
>>> The comment here makes me suspicious. How can we safely iterate the list
>>> if we don't take data_lock? Doesn't it mean that the list can change
>>> while we have conf_mutex?
>>
>> The idea is you need BOTH locks to modify the list structure, but you
>> need only one of them to iterate over the list safely and
>> consistently. This means writer will not alter the list structure
>> until there are no readers.
>
> Still not understanding this. Why not then use conf_mutex always, why do
> we need data_lock at all?

htt peer map/unmap which is in tasklet context needs to iterate over
the list. It can't take conf_mutex.


> Or are you saying that one can iterate the list by either taking
> conf_mutex or by taking data_lock, depending on context?

Yes.


Michał

2014-04-10 05:10:36

by Michal Kazior

[permalink] [raw]
Subject: Re: [RFTv2 0/5] ath10k: ath10k: fix flushing and tx stalls

On 10 April 2014 01:58, Ben Greear <[email protected]> wrote:
> On 04/09/2014 02:46 PM, Ben Greear wrote:
>> Here's another log snippet with these 5 patches (and lots more
>> mostly non ath10k patches of my own) applied:
>
> And another one, this time with more debugging enabled.
> The 0x7110XXXX numbers indicate the command-id (the XXXX part
> is the cmd id).
>
> After this below, I see a debug-log message come from
> the firmware, and then nothing else. I had added a sort
> of keep-alive message in the firmware, and I do not see that
> in my logs, so probably firmware is wedged in such a way that
> it cannot or will not send packets to the host at this point.
>
> I had chased this sort of problem previously, and ended up
> with a hack to reset firmware when the flush failed twice.
> I backed that out when applying your patches, but I guess
> it is still needed.

Then this looks like a different issue from what I've been trying to
fix actually.

In my case when acting as AP it's possible to get WMI mgmt tx frames
stuck in FW queues when sleeping client stops responding for about 10
seconds. If you use up all tx credits (the multitude of 2 that there
are :-) beaconing stops and everything just fails.


> ath10k: ep 2 got 1 credits tot 2
> ath10k: mac vdev 20 start 04:f0:21:03:38:99
> ath10k: mac vdev 20 start center_freq 5180 phymode 11ac-vht80

> ath10k: ep 2 used 1 credits, remaining 1 dbg 1896910867 (0x71109013)

I suppose this print is located in ath10k_htc_send()?


> ath10k: ep 2 got 1 credits tot 2
> sta219: send auth to 04:f0:21:03:38:99 (try 1/3) at: 1397086238.721985
> ath10k: ep 2 used 1 credits, remaining 1 dbg 1896910888 (0x71109028)
> ath10k: mac flushing peer 04:f0:21:03:38:99 on vdev 20 mgmt tid for unicast mgmt (204 msecs)
> ath10k: ep 2 used 1 credits, remaining 0 dbg 1896910878 (0x7110901e)
> ath10k: Creating vdev id: 22 map: 12582912
> ath10k: mac vdev create 22 (add interface) type 2 subtype 0
> sta219: send auth to 04:f0:21:03:38:99 (try 2/3) at: 1397086239.28088
> [firmware logging msg]
> ath10k: failed to create WMI vdev 22: -11

Hmm.. If I read this correctly it means that MGMT_TX and
PEER_FLUSH_TIDS commands are both stuck in firmware. This most likely
means firmware stops processing everything altogether. Having HTC
debug prints from ath10k_htc_notify_tx_completion() could provide more
insight perhaps. I suspect MGMT_TX is the trigger in all cases.

I'm still suspicious of your firmware changes. You connect multiple
stations to the exact same AP. Is peer mapping working correctly? Are
tid queues mapped correctly in all cases? Perhaps there's some kind of
inconsistency that leads to this mess? I think firmware wasn't
originally designed to support your usecase. Or maybe firmware just
breaks when you try to run a hundred or so of vdevs :-D


Michał

2014-04-04 11:44:15

by Michal Kazior

[permalink] [raw]
Subject: [RFT 2/4] ath10k: rework peer accounting

It was troublesome to iterate over peers within a
sleepable context (e.g. while holding conf_mutex).
This will be necessary for some upcomming changes
to tx flushing.

Making peer allocation and initial setup
ar->conf_mutex bound instead ar->data_lock solves
the problem.

Signed-off-by: Michal Kazior <[email protected]>
---
drivers/net/wireless/ath/ath10k/core.h | 5 +++
drivers/net/wireless/ath/ath10k/mac.c | 69 +++++++++++++++++++++++++++++++---
drivers/net/wireless/ath/ath10k/txrx.c | 29 +++++++-------
3 files changed, 81 insertions(+), 22 deletions(-)

diff --git a/drivers/net/wireless/ath/ath10k/core.h b/drivers/net/wireless/ath/ath10k/core.h
index 8edd6da..61e325a 100644
--- a/drivers/net/wireless/ath/ath10k/core.h
+++ b/drivers/net/wireless/ath/ath10k/core.h
@@ -199,9 +199,14 @@ struct ath10k_dfs_stats {
#define ATH10K_MAX_NUM_PEER_IDS (1 << 11) /* htt rx_desc limit */

struct ath10k_peer {
+ /* protected by conf_mutex + data_lock */
struct list_head list;
+
+ /* protected by conf_mutex */
int vdev_id;
u8 addr[ETH_ALEN];
+
+ /* protected by data_lock */
DECLARE_BITMAP(peer_ids, ATH10K_MAX_NUM_PEER_IDS);
struct ieee80211_key_conf *keys[WMI_MAX_KEY_INDEX + 1];
};
diff --git a/drivers/net/wireless/ath/ath10k/mac.c b/drivers/net/wireless/ath/ath10k/mac.c
index 25362ef..9af3ed5 100644
--- a/drivers/net/wireless/ath/ath10k/mac.c
+++ b/drivers/net/wireless/ath/ath10k/mac.c
@@ -318,28 +318,61 @@ static u8 ath10k_parse_mpdudensity(u8 mpdudensity)

static int ath10k_peer_create(struct ath10k *ar, u32 vdev_id, const u8 *addr)
{
+ struct ath10k_peer *peer;
int ret;

lockdep_assert_held(&ar->conf_mutex);

+ peer = ath10k_peer_find(ar, vdev_id, addr);
+ if (peer) {
+ ath10k_warn("peer %pM on vdev %i already exists\n",
+ addr, vdev_id);
+ return -EINVAL;
+ }
+
+ peer = kzalloc(sizeof(*peer), GFP_KERNEL);
+ if (!peer) {
+ ath10k_warn("failed to allocate peer %pM on vdev %i: not enough memory\n",
+ addr, vdev_id);
+ return -ENOMEM;
+ }
+
+ peer->vdev_id = vdev_id;
+ memcpy(peer->addr, addr, ETH_ALEN);
+
+ spin_lock_bh(&ar->data_lock);
+ list_add(&peer->list, &ar->peers);
+ spin_unlock_bh(&ar->data_lock);
+
ret = ath10k_wmi_peer_create(ar, vdev_id, addr);
if (ret) {
ath10k_warn("failed to create wmi peer %pM on vdev %i: %i\n",
addr, vdev_id, ret);
- return ret;
+ goto err_free;
}

ret = ath10k_wait_for_peer_created(ar, vdev_id, addr);
if (ret) {
ath10k_warn("failed to wait for created wmi peer %pM on vdev %i: %i\n",
addr, vdev_id, ret);
- return ret;
+ goto err_free;
}
spin_lock_bh(&ar->data_lock);
ar->num_peers++;
spin_unlock_bh(&ar->data_lock);

return 0;
+
+err_free:
+ spin_lock_bh(&ar->data_lock);
+ list_del(&peer->list);
+ /* very unlikely, but check anyway */
+ if (!bitmap_empty(peer->peer_ids, ATH10K_MAX_NUM_PEER_IDS))
+ ath10k_warn("removing peer %pM on vdev %i still being mapped in firmware\n",
+ addr, vdev_id);
+ spin_unlock_bh(&ar->data_lock);
+ kfree(peer);
+ return ret;
}

static int ath10k_mac_set_kickout(struct ath10k_vif *arvif)
@@ -416,22 +449,46 @@ static int ath10k_mac_set_frag(struct ath10k_vif *arvif, u32 value)

static int ath10k_peer_delete(struct ath10k *ar, u32 vdev_id, const u8 *addr)
{
+ struct ath10k_peer *peer;
int ret;

lockdep_assert_held(&ar->conf_mutex);

+ peer = ath10k_peer_find(ar, vdev_id, addr);
+ if (!peer) {
+ ath10k_warn("failed to lookup peer %pM on vdev %i\n",
+ addr, vdev_id);
+ return -ENOENT;
+ }
+
ret = ath10k_wmi_peer_delete(ar, vdev_id, addr);
- if (ret)
- return ret;
+ if (ret) {
+ ath10k_warn("failed to request wmi peer %pM on vdev %i removal: %d\n",
+ addr, vdev_id, ret);
+ goto out;
+ }

ret = ath10k_wait_for_peer_deleted(ar, vdev_id, addr);
- if (ret)
- return ret;
+ if (ret) {
+ ath10k_warn("failed to wait for wmi peer %pM on vdev %i removal: %d\n",
+ addr, vdev_id, ret);
+ goto out;
+ }

spin_lock_bh(&ar->data_lock);
ar->num_peers--;
spin_unlock_bh(&ar->data_lock);

+out:
+ spin_lock_bh(&ar->data_lock);
+ list_del(&peer->list);
+ if (!bitmap_empty(peer->peer_ids, ATH10K_MAX_NUM_PEER_IDS))
+ ath10k_warn("removing peer %pM on vdev %i still being mapped in firmware\n",
+ addr, vdev_id);
+ spin_unlock_bh(&ar->data_lock);
+
+ kfree(peer);
+
return 0;
}

diff --git a/drivers/net/wireless/ath/ath10k/txrx.c b/drivers/net/wireless/ath/ath10k/txrx.c
index 82669a7..d802569 100644
--- a/drivers/net/wireless/ath/ath10k/txrx.c
+++ b/drivers/net/wireless/ath/ath10k/txrx.c
@@ -100,13 +100,13 @@ exit:
wake_up(&htt->empty_tx_wq);
}

+/* hold conf_mutex for simple iteration, or conf_mutex+data_lock for
+ * modifications */
struct ath10k_peer *ath10k_peer_find(struct ath10k *ar, int vdev_id,
const u8 *addr)
{
struct ath10k_peer *peer;

- lockdep_assert_held(&ar->data_lock);
-
list_for_each_entry(peer, &ar->peers, list) {
if (peer->vdev_id != vdev_id)
continue;
@@ -139,10 +139,14 @@ static int ath10k_wait_for_peer_common(struct ath10k *ar, int vdev_id,
int ret;

ret = wait_event_timeout(ar->peer_mapping_wq, ({
- bool mapped;
+ struct ath10k_peer *peer;
+ bool mapped = false;

spin_lock_bh(&ar->data_lock);
- mapped = !!ath10k_peer_find(ar, vdev_id, addr);
+ peer = ath10k_peer_find(ar, vdev_id, addr);
+ if (peer)
+ mapped = !bitmap_empty(peer->peer_ids,
+ ATH10K_MAX_NUM_PEER_IDS);
spin_unlock_bh(&ar->data_lock);

mapped == expect_mapped;
@@ -173,20 +177,16 @@ void ath10k_peer_map_event(struct ath10k_htt *htt,
spin_lock_bh(&ar->data_lock);
peer = ath10k_peer_find(ar, ev->vdev_id, ev->addr);
if (!peer) {
- peer = kzalloc(sizeof(*peer), GFP_ATOMIC);
- if (!peer)
- goto exit;
-
- peer->vdev_id = ev->vdev_id;
- memcpy(peer->addr, ev->addr, ETH_ALEN);
- list_add(&peer->list, &ar->peers);
- wake_up(&ar->peer_mapping_wq);
+ ath10k_warn("failed to map peer %pM on vdev %i: no such entry\n",
+ ev->addr, ev->vdev_id);
+ goto exit;
}

ath10k_dbg(ATH10K_DBG_HTT, "htt peer map vdev %d peer %pM id %d\n",
ev->vdev_id, ev->addr, ev->peer_id);

set_bit(ev->peer_id, peer->peer_ids);
+ wake_up(&ar->peer_mapping_wq);
exit:
spin_unlock_bh(&ar->data_lock);
}
@@ -210,11 +210,8 @@ void ath10k_peer_unmap_event(struct ath10k_htt *htt,

clear_bit(ev->peer_id, peer->peer_ids);

- if (bitmap_empty(peer->peer_ids, ATH10K_MAX_NUM_PEER_IDS)) {
- list_del(&peer->list);
- kfree(peer);
+ if (bitmap_empty(peer->peer_ids, ATH10K_MAX_NUM_PEER_IDS))
wake_up(&ar->peer_mapping_wq);
- }

exit:
spin_unlock_bh(&ar->data_lock);
--
1.8.5.3


2014-04-07 09:06:57

by Michal Kazior

[permalink] [raw]
Subject: Re: [RFT 0/4] ath10k: fix flushing and tx stalls

On 4 April 2014 16:49, Ben Greear <[email protected]> wrote:
> On 04/04/2014 04:37 AM, Michal Kazior wrote:
>>
>> Hi,
>>
>> After digging around I've found what seems to be
>> the problem with WMI Tx credit starvation and
>> inability to properly flush Tx in ath10k_flush().
>>
>> Long story short: if a client that was asleep (as
>> per what firmware thinks) goes out of range (or
>> just stops responding) then Tx rots in FW/HW
>> queues for a few seconds before it's discarded.
>> For WMI Tx credits this means management frames
>> eat up Tx credits for a few seconds (causing other
>> WMI commands to timeout and return -EAGAIN/-11).
>> For HTT Tx this means NullFunc frames would get
>> stuck for a few seconds before completion was
>> received.
>>
>> @Ben: Can you check if this helps you? I tested
>> this briefly and at least [1/4] seems fixes the
>> WMI Tx starvation. I'm hoping patches 2-4 help
>> with your ath10k_flush() failures which I haven't
>> been successfull in reproducing (but have observed
>> improvement with purging some frames out of FW/HW
>> queues).
>
>
> I'm out of office for a bit, but will test this as
> soon as I'm back.
>
> Thanks for looking into this!
>
> In general, would it make more sense to have a few more tx credits
> available to mitigate the slow-to-be-processed buffers?

Sure, we can disregard firmware telling us we have only 2 credits and
submit commands as we see fit. We can probably even get away with it
as long as we don't submit more than 2 WMI_MGMT_TX_CMDID commands
because it seems this is the only resource consuming command (requires
firmware to copy frame, keep it allocated and mapped for HW until
released on air). Beacons are already sent by reference. But I don't
think this is the solution.

The problem is there's no tx completion indication for
WMI_MGMT_TX_CMDID and you can't rely on tx credits replenishment for
that because (as per my observation) you have to submit an even number
of WMI_MGMT_TX_CMDID to get tx credits replenished. This means there's
no way of doing an educated timeout/flushing. This also means you can
get stuck with 2 credits and 2 frames being stuck in FW if destination
peer is unresponsive for up to 10 seconds.

Once you get stuck you can get a cascade of errors because WMI
commands time out after 3 seconds and if you're running AP you stop
beaconing because you can't even submit WMI_BCN_TX_CMDID.

Another way would be to prolong WMI command timeout to ~11 seconds...
at least for the "tx credits being stuck" problem.


Michał

2014-04-10 05:27:13

by Ben Greear

[permalink] [raw]
Subject: Re: [RFTv2 0/5] ath10k: ath10k: fix flushing and tx stalls

On 04/09/2014 10:10 PM, Michal Kazior wrote:
> On 10 April 2014 01:58, Ben Greear <[email protected]> wrote:
>> On 04/09/2014 02:46 PM, Ben Greear wrote:
>>> Here's another log snippet with these 5 patches (and lots more
>>> mostly non ath10k patches of my own) applied:
>>
>> And another one, this time with more debugging enabled.
>> The 0x7110XXXX numbers indicate the command-id (the XXXX part
>> is the cmd id).
>>
>> After this below, I see a debug-log message come from
>> the firmware, and then nothing else. I had added a sort
>> of keep-alive message in the firmware, and I do not see that
>> in my logs, so probably firmware is wedged in such a way that
>> it cannot or will not send packets to the host at this point.
>>
>> I had chased this sort of problem previously, and ended up
>> with a hack to reset firmware when the flush failed twice.
>> I backed that out when applying your patches, but I guess
>> it is still needed.
>
> Then this looks like a different issue from what I've been trying to
> fix actually.
>
> In my case when acting as AP it's possible to get WMI mgmt tx frames
> stuck in FW queues when sleeping client stops responding for about 10
> seconds. If you use up all tx credits (the multitude of 2 that there
> are :-) beaconing stops and everything just fails.

I see them stuck for a long while in STA mode too, but usually
it recovers. I tried finding where in firmware I could increase
the tx-credits to 4 or something, but had no luck...tis a twisty
maze.


>> ath10k: ep 2 got 1 credits tot 2
>> ath10k: mac vdev 20 start 04:f0:21:03:38:99
>> ath10k: mac vdev 20 start center_freq 5180 phymode 11ac-vht80
>
>> ath10k: ep 2 used 1 credits, remaining 1 dbg 1896910867 (0x71109013)
>
> I suppose this print is located in ath10k_htc_send()?

Yes, right after decrementing credits.

>> ath10k: ep 2 got 1 credits tot 2
>> sta219: send auth to 04:f0:21:03:38:99 (try 1/3) at: 1397086238.721985
>> ath10k: ep 2 used 1 credits, remaining 1 dbg 1896910888 (0x71109028)
>> ath10k: mac flushing peer 04:f0:21:03:38:99 on vdev 20 mgmt tid for unicast mgmt (204 msecs)
>> ath10k: ep 2 used 1 credits, remaining 0 dbg 1896910878 (0x7110901e)
>> ath10k: Creating vdev id: 22 map: 12582912
>> ath10k: mac vdev create 22 (add interface) type 2 subtype 0
>> sta219: send auth to 04:f0:21:03:38:99 (try 2/3) at: 1397086239.28088
>> [firmware logging msg]
>> ath10k: failed to create WMI vdev 22: -11
>
> Hmm.. If I read this correctly it means that MGMT_TX and
> PEER_FLUSH_TIDS commands are both stuck in firmware. This most likely
> means firmware stops processing everything altogether. Having HTC
> debug prints from ath10k_htc_notify_tx_completion() could provide more
> insight perhaps. I suspect MGMT_TX is the trigger in all cases.
>
> I'm still suspicious of your firmware changes. You connect multiple
> stations to the exact same AP. Is peer mapping working correctly? Are
> tid queues mapped correctly in all cases? Perhaps there's some kind of
> inconsistency that leads to this mess? I think firmware wasn't
> originally designed to support your usecase. Or maybe firmware just
> breaks when you try to run a hundred or so of vdevs :-D

I have at least attempted to rectify all of that, but indeed this
particular lockup seems like a firmware issue. I personally suspect
that I just find many bugs 32 times faster than simpler systems will :P

The firmware has it's own sort of tx-to-host-credits logic, so if it runs
out of space it might not be able to send any messages back to
the host. I've crawled through a lot of that code and didn't
see any obvious ways to leak buffers, but it's far from simple
code, so I could still be wrong.

Maybe I could add a small scratch area in firmware memory and place debug
info there and read it from host over the PCI bus like when we
dump the crash info... This time of night I really hate firmware :P

Thanks,
Ben

--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com


2014-04-09 10:55:29

by Michal Kazior

[permalink] [raw]
Subject: [RFTv2 0/5] ath10k: ath10k: fix flushing and tx stalls

Hi,

After digging around I've found what seems to be
the problem with WMI Tx credit starvation and
inability to properly flush Tx in ath10k_flush().

Long story short: if a client that was asleep (as
per what firmware thinks) goes out of range (or
just stops responding) then Tx rots in FW/HW
queues for a few seconds before it's discarded.
For WMI Tx credits this means management frames
eat up Tx credits for a few seconds (causing other
WMI commands to timeout and return -EAGAIN/-11).
For HTT Tx this means NullFunc frames would get
stuck for a few seconds before completion was
received.

v2:
* added tx replenishment patch
* added conf_mutex locking
* change condition for flushing mgmt tx


Michal Kazior (5):
ath10k: always request htc tx replenishment
ath10k: fix wmi-htc tx credit starvation
ath10k: rework peer accounting
ath10k: wait for mgmt tx when flushing too
ath10k: improve tx flushing

drivers/net/wireless/ath/ath10k/core.h | 7 +-
drivers/net/wireless/ath/ath10k/htc.c | 15 +-
drivers/net/wireless/ath/ath10k/mac.c | 275 ++++++++++++++++++++++++++++++---
drivers/net/wireless/ath/ath10k/txrx.c | 29 ++--
drivers/net/wireless/ath/ath10k/wmi.c | 5 -
drivers/net/wireless/ath/ath10k/wmi.h | 4 +
6 files changed, 275 insertions(+), 60 deletions(-)

--
1.8.5.3


2014-04-09 23:58:41

by Ben Greear

[permalink] [raw]
Subject: Re: [RFTv2 0/5] ath10k: ath10k: fix flushing and tx stalls

On 04/09/2014 02:46 PM, Ben Greear wrote:
> Here's another log snippet with these 5 patches (and lots more
> mostly non ath10k patches of my own) applied:

And another one, this time with more debugging enabled.
The 0x7110XXXX numbers indicate the command-id (the XXXX part
is the cmd id).

After this below, I see a debug-log message come from
the firmware, and then nothing else. I had added a sort
of keep-alive message in the firmware, and I do not see that
in my logs, so probably firmware is wedged in such a way that
it cannot or will not send packets to the host at this point.

I had chased this sort of problem previously, and ended up
with a hack to reset firmware when the flush failed twice.
I backed that out when applying your patches, but I guess
it is still needed.


ath10k: ep 2 got 1 credits tot 2
ath10k: mac vdev 20 start 04:f0:21:03:38:99
ath10k: mac vdev 20 start center_freq 5180 phymode 11ac-vht80
ath10k: ep 2 used 1 credits, remaining 1 dbg 1896910867 (0x71109013)
ath10k: ep 2 got 1 credits tot 2
sta219: send auth to 04:f0:21:03:38:99 (try 1/3) at: 1397086238.721985
ath10k: ep 2 used 1 credits, remaining 1 dbg 1896910888 (0x71109028)
ath10k: mac flushing peer 04:f0:21:03:38:99 on vdev 20 mgmt tid for unicast mgmt (204 msecs)
ath10k: ep 2 used 1 credits, remaining 0 dbg 1896910878 (0x7110901e)
ath10k: Creating vdev id: 22 map: 12582912
ath10k: mac vdev create 22 (add interface) type 2 subtype 0
sta219: send auth to 04:f0:21:03:38:99 (try 2/3) at: 1397086239.28088
[firmware logging msg]
ath10k: failed to create WMI vdev 22: -11
ath10k: failed to transmit management frame via WMI: -11
ath10k: Creating vdev id: 22 map: 12582912
ath10k: mac vdev create 22 (add interface) type 2 subtype 0
ath10k: failed to create WMI vdev 22: -11
sta219: deauthenticating from 04:f0:21:03:38:99 by local choice (reason=3)
ath10k: failed to wait for tx to flush: -110, forcing
ath10k: failed to request peer 04:f0:21:03:38:99 on vdev 20 to flush ffffffff: -11
....




>
> No (significant) user traffic was running.
>
> sta215: authenticated at: 1397076587.913186
> sta215: associate with 04:f0:21:03:38:99 (try 1/3), at: 1397076587.915143
> ath10k: mac flushing peer 04:f0:21:03:38:99 on vdev 16 mgmt tid for unicast mgmt (204 msecs)
> sta215: RX AssocResp from 04:f0:21:03:38:99 (capab=0x1 status=0 aid=17) at: 1397076588.120218
> ath10k: mac vdev 16 dtim_period 2
> ath10k: mac vdev 16 slot_time 2
> ath10k: mac ht peer 04:f0:21:03:38:99 mcs cnt 24 nss 3
> ath10k: mac vht peer 04:f0:21:03:38:99 max_mpdu 65535 flags 0x6003001
> ath10k: mac peer 04:f0:21:03:38:99 phymode 11ac-vht80
> ath10k: mac vdev 16 up (associated) bssid 04:f0:21:03:38:99 aid 17
> sta215: associated at: 1397076588.122484
> IPv6: ADDRCONF(NETDEV_CHANGE): sta215: link becomes ready
> sta216: authenticate with 04:f0:21:03:38:99 at: 1397076588.124830
> ath10k: mac vdev 17 beacon_interval 100
> ath10k: mac vdev 17 create peer 04:f0:21:03:38:99
> ath10k: mac vdev 17 start 04:f0:21:03:38:99
> ath10k: mac vdev 17 start center_freq 5180 phymode 11ac-vht80
> sta216: send auth to 04:f0:21:03:38:99 (try 1/3) at: 1397076588.125811
> ath10k: mac flushing peer 04:f0:21:03:38:99 on vdev 17 mgmt tid for unicast mgmt (204 msecs)
> sta216: send auth to 04:f0:21:03:38:99 (try 2/3) at: 1397076588.431144
> ath10k: failed to transmit management frame via WMI: -11
> sta216: send auth to 04:f0:21:03:38:99 (try 3/3) at: 1397076591.431376
> ath10k: failed to wait for tx to flush: -110, forcing
> ath10k: failed to request peer 04:f0:21:03:38:99 on vdev 17 to flush ffffffff: -11
> ath10k: failed to flush all peer tids: -11
> sta216: deauthenticating from 04:f0:21:03:38:99 by local choice (reason=3)
> ath10k: failed to wait for tx to flush: -110, forcing
> ath10k: failed to request peer 04:f0:21:03:38:99 on vdev 17 to flush ffffffff: -11
> ath10k: failed to flush all peer tids: -11
> ath10k: mac vdev 17 peer delete 04:f0:21:03:38:99 (sta gone)
> ath10k: failed to request wmi peer 04:f0:21:03:38:99 on vdev 17 removal: -11
> ath10k: removing peer 04:f0:21:03:38:99 on vdev 17 still being mapped in firmware
> ath10k: mac vdev 17 stop (disassociated
> ath10k: failed to stop WMI vdev 17: -11
> ath10k: mac vdev 17 down
> ath10k: failed to wait for tx to flush: -110, forcing
> ath10k: failed to request peer 04:f0:21:03:38:99 on vdev 16 to flush ffffffff: -11
> ath10k: failed to flush all peer tids: -11
> ath10k: failed to wait for tx to flush: -110, forcing
> ath10k: failed to request peer 04:f0:21:03:38:99 on vdev 16 to flush ffffffff: -11
> ath10k: failed to flush all peer tids: -11
> sta218: authenticate with 04:f0:21:03:38:99 at: 1397076616.431328
> ath10k: failed to wait for tx to flush: -110, forcing
> ath10k: failed to request peer 04:f0:21:03:38:99 on vdev 16 to flush ffffffff: -11
> ath10k: failed to flush all peer tids: -11
> ath10k: mac vdev 19 beacon_interval 100
> ath10k: failed to set beacon interval for vdev 19: -11
> ath10k: mac vdev 19 create peer 04:f0:21:03:38:99
> ath10k: failed to create wmi peer 04:f0:21:03:38:99 on vdev 19: -11
> ath10k: failed to add peer 04:f0:21:03:38:99 for vdev 19 when changing bssid: -11
> ath10k: mac vdev 19 start 04:f0:21:03:38:99
> ath10k: mac vdev 19 start center_freq 5180 phymode 11ac-vht80
>
>
> Thanks,
> Ben
>


--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com


2014-04-11 06:23:15

by Kalle Valo

[permalink] [raw]
Subject: Re: [RFTv2 3/5] ath10k: rework peer accounting

Michal Kazior <[email protected]> writes:

> On 10 April 2014 09:18, Kalle Valo <[email protected]> wrote:
>> Michal Kazior <[email protected]> writes:
>>
>>> The idea is you need BOTH locks to modify the list structure, but you
>>> need only one of them to iterate over the list safely and
>>> consistently. This means writer will not alter the list structure
>>> until there are no readers.
>>
>> Still not understanding this. Why not then use conf_mutex always, why do
>> we need data_lock at all?
>
> htt peer map/unmap which is in tasklet context needs to iterate over
> the list. It can't take conf_mutex.
>
>> Or are you saying that one can iterate the list by either taking
>> conf_mutex or by taking data_lock, depending on context?
>
> Yes.

Thanks, finally I understand this. At least the comment below needs to
be changed to cover all cases:

--- a/drivers/net/wireless/ath/ath10k/txrx.c
+++ b/drivers/net/wireless/ath/ath10k/txrx.c
@@ -100,13 +100,13 @@ exit:
wake_up(&htt->empty_tx_wq);
}

+/* hold conf_mutex for simple iteration, or conf_mutex+data_lock for
+ * modifications */
struct ath10k_peer *ath10k_peer_find(struct ath10k *ar, int vdev_id,
const u8 *addr)
{
struct ath10k_peer *peer;

- lockdep_assert_held(&ar->data_lock);
-
list_for_each_entry(peer, &ar->peers, list) {
if (peer->vdev_id != vdev_id)
continue;

The other thing is that I really like the idea of documenting locking
requirements with lockdep_assert_held() and I hate to lose that in this
function. I started to think should we come up with our own macro for
checking if either of two locks is held? When I look at the original
macro I think it should be pretty easy:

#define lockdep_assert_held(l) do { \
WARN_ON(debug_locks && !lockdep_is_held(l)); \
} while (0)

--
Kalle Valo

2014-04-11 04:59:44

by Ben Greear

[permalink] [raw]
Subject: Re: [RFTv2 3/5] ath10k: rework peer accounting

On 04/09/2014 03:48 AM, Michal Kazior wrote:
> It was troublesome to iterate over peers and
> perform sleepable calls. This will be necessary
> for some upcomming changes to tx flushing.

Just FYI, I tried all yesterday to get my system stable with all
of your RFT patches applied, but it just has all sorts of issues.
I think much of it relates to the management flush logic, perhaps
all the time it spends under lock (including RTNL lock I think)
just makes my system fall apart.

Possibly related to my peculiar environment & firmware, for sure.

I backed out your last two, leaving this one and the two previous,
and it runs better. I still see the tx-credits hang (this time,
when I tried resetting all 32 stations while transmitting packets),
but at least the stations come up and can transmit packets most of
the time.

Even in the 'mostly working' state, resetting interfaces while
they and other vifs are under heavy packet load can effectively
lock up my user-space app, and things like 'ifconfig -a'
can take 10s of seconds to complete. So, I probably need
to mitigate more of the sleeps related to flushing somehow.

Thanks,
Ben


--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com


2014-04-09 06:25:12

by Michal Kazior

[permalink] [raw]
Subject: Re: [RFT 0/4] ath10k: fix flushing and tx stalls

On 8 April 2014 18:02, Ben Greear <[email protected]> wrote:
> On 04/07/2014 10:51 PM, Michal Kazior wrote:
>> On 8 April 2014 04:31, Ben Greear <[email protected]> wrote:
>>> On 04/07/2014 02:11 AM, Michal Kazior wrote:
>>>
>>>> These logs are not enough. I'd love to see traces for this to see what
>>>> frames are actually submitted and when tx credits are replenished.
>>>>
>>>> I also wonder if this can be somehow related to your FW changes to
>>>> allow connecting multiple client virtual interfaces to a single AP?
>>>
>>>
>>> I think it is unlikely due to my firmware changes...little of that touched
>>> the handling of management frames. It might very well be a basic problem in
>>> either the firmware or driver when using multiple station VIFS. I think
>>> that
>>> aside from my testing that code has not been used much.
>>>
>>> Note my followup email that problems started with patch 3/4...not sure you
>>> saw that one or not. I saw similar failure to associate & get DHCP (and
>>> slow/hung user-space) without the kernel error logs.
>>
>> I did get that mail. It's just that it makes little sense. Patch 3/4
>> simply extends ath10k_flush() to wait for management frames. It
>> shouldn't change behaviour of tx credit starvation which appears to be
>> the case in the logs you provided.
>
> Maybe this causes unexpected waits (under lock) that in turn causes other
> vifs trying to come up at the same time to time out?

Hm... Most stuff is configured while holding conf_mutex and is
completed while it is held. The exception are beacons and mgmt tx.
Perhaps waiting for mgmt tx in ath10k_flush() doesn't interleave mgmt
tx with other commands and ends up starving tx credit sooner.

I'm planning to submit RFTv2 later today with a few fixes.


> It seems easily reproducible, so let me know what sort of logging/tracing
> you would like to see (and how to enable it if it is not trivial)
> and I'll reproduce it and send you the logs.

Additional kernel config:

CONFIG_ATH10K_DEBUG=y
CONFIG_ATH10K_DEBUGFS=y
CONFIG_ATH10K_TRACING=y
CONFIG_MAC80211_MESSAGE_TRACING=y

This probably requires this to be set first:

CONFIG_TRACING=y
CONFIG_GENERIC_TRACER=y
CONFIG_TRACING_SUPPORT=y

(it's in Kernel Hacking)

Once you have this you can use trace-cmd command to gathers logs:

trace-cmd record -e mac80211 -e cfg80211 -e ath10k

And then perform whatever tortures on ath10k you wish.

This creates trace.dat in your current directory. The file can get
pretty heavy and it might not fit into an email so we might need to
transfer it "out of band" somehow.

If you use wpa_supplicant/hostapd you can pass -T argument to those to
get all their messages into tracing as markers. This way you're able
to get unified logs in one file with correct chronological order.


Michał

2014-04-07 00:31:10

by Ben Greear

[permalink] [raw]
Subject: Re: [RFT 0/4] ath10k: fix flushing and tx stalls

On 04/04/2014 04:37 AM, Michal Kazior wrote:
> Hi,
>
> After digging around I've found what seems to be
> the problem with WMI Tx credit starvation and
> inability to properly flush Tx in ath10k_flush().

I applied all four, rebooted AP and Station machine with the
new kernel (3.14.0-wl-ath+). I had removed my own tx-hang detection
hack.

It is not working well...though perhaps it is not directly related
to your patches...I will do some reboots and try reversing some patches.
But, 3.14.0-rc8-wl-ath+ was working mostly OK for me....

It seems my user-space app got hung, probably because rtnl was being
held too long while doing admin-down on a bunch of stations.

Out of 32 stations, maybe 5 of them managed to come up and get
an IP address through DHCP before user-space app hung and log
started filling with the errors at the end of the dmesg below.


sta218: associated
IPv6: ADDRCONF(NETDEV_CHANGE): sta218: link becomes ready
sta219: authenticate with 04:f0:21:03:38:99
ath10k: mac vdev 19 beacon_interval 100
ath10k: mac vdev 19 create peer 04:f0:21:03:38:99
ath10k: mac vdev 19 start 04:f0:21:03:38:99
ath10k: mac vdev 19 start center_freq 5180 phymode 11ac-vht80

sta219: send auth to 04:f0:21:03:38:99 (try 1/3)
sta219: authenticated
sta218: deauthenticating from 04:f0:21:03:38:99 by local choice (Reason: 3=DEAUTH_LEAVING)
ath10k: mac flushing peer 04:f0:21:03:38:99 on vdev 18 mgmt tid for unicast mgmt
ath10k: mac vdev 18 peer delete 04:f0:21:03:38:99 (sta gone)
ath10k: mac vdev 18 stop (disassociated
ath10k: mac vdev 18 down
ath10k: mac vdev 18 cts_prot 0
ath10k: mac vdev 18 slot_time 1
ath10k: mac vdev 18 preamble 1n
sta219: associate with 04:f0:21:03:38:99 (try 1/3)
sta219: RX AssocResp from 04:f0:21:03:38:99 (capab=0x1 status=0 aid=22)
ath10k: mac vdev 19 dtim_period 2
ath10k: mac vdev 19 slot_time 2
ath10k: mac ht peer 04:f0:21:03:38:99 mcs cnt 24 nss 3
ath10k: mac vht peer 04:f0:21:03:38:99 max_mpdu 65535 flags 0x6003001
ath10k: mac peer 04:f0:21:03:38:99 phymode 11ac-vht80
ath10k: mac vdev 19 up (associated) bssid 04:f0:21:03:38:99 aid 22
sta219: associated
IPv6: ADDRCONF(NETDEV_CHANGE): sta219: link becomes ready
sta220: authenticate with 04:f0:21:03:38:99
ath10k: mac vdev 20 beacon_interval 100
ath10k: mac vdev 20 create peer 04:f0:21:03:38:99
ath10k: mac vdev 20 start 04:f0:21:03:38:99
ath10k: mac vdev 20 start center_freq 5180 phymode 11ac-vht80
sta220: send auth to 04:f0:21:03:38:99 (try 1/3)
sta220: authenticated
sta220: associate with 04:f0:21:03:38:99 (try 1/3)
sta220: RX AssocResp from 04:f0:21:03:38:99 (capab=0x1 status=0 aid=23)
ath10k: mac vdev 20 dtim_period 2
ath10k: mac vdev 20 slot_time 2
ath10k: mac ht peer 04:f0:21:03:38:99 mcs cnt 24 nss 3
ath10k: mac vht peer 04:f0:21:03:38:99 max_mpdu 65535 flags 0x6003001
ath10k: mac peer 04:f0:21:03:38:99 phymode 11ac-vht80
ath10k: mac vdev 20 up (associated) bssid 04:f0:21:03:38:99 aid 23
sta220: associated
IPv6: ADDRCONF(NETDEV_CHANGE): sta220: link becomes ready
ath10k: Creating vdev id: 21 map: 4292870144
ath10k: mac vdev create 21 (add interface) type 2 subtype 0
ath10k: mac vdev 21 cts_prot 0
ath10k: mac vdev 21 slot_time 1
ath10k: mac vdev 21 preamble 1n
IPv6: ADDRCONF(NETDEV_UP): sta221: link is not ready
ath10k: mac vdev 21 delete (remove interface)
ath10k: peer-unmap-event: unknown peer id 23
ath10k: Creating vdev id: 21 map: 4292870144
ath10k: mac vdev create 21 (add interface) type 2 subtype 0
ath10k: mac vdev 21 cts_prot 0
ath10k: mac vdev 21 slot_time 1
ath10k: mac vdev 21 preamble 1n
IPv6: ADDRCONF(NETDEV_UP): sta221: link is not ready
sta217: authenticate with 04:f0:21:03:38:99
ath10k: mac vdev 17 beacon_interval 100
ath10k: mac vdev 17 create peer 04:f0:21:03:38:99
ath10k: mac vdev 17 start 04:f0:21:03:38:99
ath10k: mac vdev 17 start center_freq 5180 phymode 11ac-vht80

sta217: send auth to 04:f0:21:03:38:99 (try 1/3)
sta217: authenticated
sta217: associate with 04:f0:21:03:38:99 (try 1/3)
sta217: RX AssocResp from 04:f0:21:03:38:99 (capab=0x1 status=0 aid=20)
ath10k: mac vdev 17 dtim_period 2
ath10k: mac vdev 17 slot_time 2
ath10k: mac ht peer 04:f0:21:03:38:99 mcs cnt 24 nss 3
ath10k: mac vht peer 04:f0:21:03:38:99 max_mpdu 65535 flags 0x6003001
ath10k: mac peer 04:f0:21:03:38:99 phymode 11ac-vht80
ath10k: mac vdev 17 up (associated) bssid 04:f0:21:03:38:99 aid 20
sta217: associated
IPv6: ADDRCONF(NETDEV_CHANGE): sta217: link becomes ready
sta218: authenticate with 04:f0:21:03:38:99
ath10k: mac vdev 18 beacon_interval 100
ath10k: mac vdev 18 create peer 04:f0:21:03:38:99
ath10k: mac vdev 18 start 04:f0:21:03:38:99
ath10k: mac vdev 18 start center_freq 5180 phymode 11ac-vht80
sta218: send auth to 04:f0:21:03:38:99 (try 1/3)
ath10k: Creating vdev id: 22 map: 4290772992
ath10k: mac vdev create 22 (add interface) type 2 subtype 0
sta218: authenticated
ath10k: mac vdev 22 cts_prot 0
ath10k: mac vdev 22 slot_time 1
ath10k: mac vdev 22 preamble 1n
IPv6: ADDRCONF(NETDEV_UP): sta222: link is not ready
sta218: associate with 04:f0:21:03:38:99 (try 1/3)
sta218: RX AssocResp from 04:f0:21:03:38:99 (capab=0x1 status=0 aid=18)
ath10k: mac vdev 18 dtim_period 2
ath10k: mac vdev 18 slot_time 2
ath10k: mac ht peer 04:f0:21:03:38:99 mcs cnt 24 nss 3
ath10k: mac vht peer 04:f0:21:03:38:99 max_mpdu 65535 flags 0x6003001
ath10k: mac peer 04:f0:21:03:38:99 phymode 11ac-vht80
ath10k: mac vdev 18 up (associated) bssid 04:f0:21:03:38:99 aid 18
sta218: associated
sta222: authenticate with 04:f0:21:03:38:99
ath10k: mac vdev 22 beacon_interval 100
ath10k: mac vdev 22 create peer 04:f0:21:03:38:99
ath10k: mac vdev 22 start 04:f0:21:03:38:99
ath10k: mac vdev 22 start center_freq 5180 phymode 11ac-vht80

sta222: send auth to 04:f0:21:03:38:99 (try 1/3)
ath10k: mac vdev 22 peer delete 04:f0:21:03:38:99 (sta gone)
ath10k: mac vdev 22 stop (disassociated
ath10k: mac vdev 22 down
ath10k: mac vdev 22 delete (remove interface)
ath10k: peer-unmap-event: unknown peer id 24
ath10k: Creating vdev id: 22 map: 4290772992
ath10k: mac vdev create 22 (add interface) type 2 subtype 0
ath10k: mac vdev 22 cts_prot 0
ath10k: mac vdev 22 slot_time 1
ath10k: mac vdev 22 preamble 1n
IPv6: ADDRCONF(NETDEV_UP): sta222: link is not ready
sta221: authenticate with 04:f0:21:03:38:99
ath10k: mac vdev 21 beacon_interval 100
ath10k: mac vdev 21 create peer 04:f0:21:03:38:99
ath10k: mac vdev 21 start 04:f0:21:03:38:99
ath10k: mac vdev 21 start center_freq 5180 phymode 11ac-vht80

sta221: send auth to 04:f0:21:03:38:99 (try 1/3)
sta221: authenticated
sta221: associate with 04:f0:21:03:38:99 (try 1/3)
sta221: RX AssocResp from 04:f0:21:03:38:99 (capab=0x1 status=0 aid=19)
ath10k: mac vdev 21 dtim_period 2
ath10k: mac vdev 21 slot_time 2
ath10k: mac ht peer 04:f0:21:03:38:99 mcs cnt 24 nss 3
ath10k: mac vht peer 04:f0:21:03:38:99 max_mpdu 65535 flags 0x6003001
ath10k: mac peer 04:f0:21:03:38:99 phymode 11ac-vht80
ath10k: mac vdev 21 up (associated) bssid 04:f0:21:03:38:99 aid 19
sta221: associated
IPv6: ADDRCONF(NETDEV_CHANGE): sta221: link becomes ready
sta222: authenticate with 04:f0:21:03:38:99
ath10k: mac vdev 22 beacon_interval 100
ath10k: mac vdev 22 create peer 04:f0:21:03:38:99
ath10k: mac vdev 22 start 04:f0:21:03:38:99
ath10k: mac vdev 22 start center_freq 5180 phymode 11ac-vht80
sta222: send auth to 04:f0:21:03:38:99 (try 1/3)
sta222: authenticated
sta222: associate with 04:f0:21:03:38:99 (try 1/3)
sta222: RX AssocResp from 04:f0:21:03:38:99 (capab=0x1 status=0 aid=24)
ath10k: mac vdev 22 dtim_period 2
ath10k: mac vdev 22 slot_time 2
ath10k: mac ht peer 04:f0:21:03:38:99 mcs cnt 24 nss 3
ath10k: mac vht peer 04:f0:21:03:38:99 max_mpdu 65535 flags 0x6003001
ath10k: mac peer 04:f0:21:03:38:99 phymode 11ac-vht80
ath10k: mac vdev 22 up (associated) bssid 04:f0:21:03:38:99 aid 24
sta222: associated
IPv6: ADDRCONF(NETDEV_CHANGE): sta222: link becomes ready

sta213: deauthenticating from 04:f0:21:03:38:99 by local choice (Reason: 3=DEAUTH_LEAVING)
ath10k: mac flushing peer 04:f0:21:03:38:99 on vdev 13 mgmt tid for unicast mgmt
ath10k: mac vdev 13 peer delete 04:f0:21:03:38:99 (sta gone)
ath10k: mac vdev 13 stop (disassociated
ath10k: mac vdev 13 down
ath10k: mac vdev 13 cts_prot 0
ath10k: mac vdev 13 slot_time 1

ath10k: failed to set erp slot for vdev 13: -11
ath10k: mac vdev 13 preamble 1n
ath10k: failed to set preamble for vdev 13: -11
ath10k: failed to set wmm params: -11
ath10k: failed to set wmm params: -11
ath10k: failed to set wmm params: -11
ath10k: failed to set wmm params: -11
ath10k: mac vdev 13 delete (remove interface)
ath10k: failed to delete WMI vdev 13: -11
ath10k: failed to wait for tx to flush: -110, forcing
ath10k: failed to request peer 04:f0:21:03:38:99 on vdev 22 to flush ffffffff: -11
ath10k: failed to flush all peer tids: -11
ath10k: failed to wait for tx to flush: -110, forcing
ath10k: failed to request peer 04:f0:21:03:38:99 on vdev 22 to flush ffffffff: -11
ath10k: failed to flush all peer tids: -11
ath10k: failed to wait for tx to flush: -110, forcing
ath10k: failed to request peer 04:f0:21:03:38:99 on vdev 22 to flush ffffffff: -11
ath10k: failed to flush all peer tids: -11
ath10k: failed to wait for tx to flush: -110, forcing
ath10k: failed to request peer 04:f0:21:03:38:99 on vdev 22 to flush ffffffff: -11
ath10k: failed to flush all peer tids: -11
ath10k: failed to wait for tx to flush: -110, forcing
ath10k: failed to request peer 04:f0:21:03:38:99 on vdev 22 to flush ffffffff: -11
ath10k: failed to flush all peer tids: -11
ath10k: failed to wait for tx to flush: -110, forcing
ath10k: failed to request peer 04:f0:21:03:38:99 on vdev 22 to flush ffffffff: -11
ath10k: failed to flush all peer tids: -11
ath10k: failed to wait for tx to flush: -110, forcing
ath10k: failed to request peer 04:f0:21:03:38:99 on vdev 22 to flush ffffffff: -11
ath10k: failed to flush all peer tids: -11
ath10k: failed to wait for tx to flush: -110, forcing
ath10k: failed to request peer 04:f0:21:03:38:99 on vdev 22 to flush ffffffff: -11
ath10k: failed to flush all peer tids: -11
ath10k: failed to wait for tx to flush: -110, forcing
ath10k: failed to request peer 04:f0:21:03:38:99 on vdev 22 to flush ffffffff: -11
ath10k: failed to flush all peer tids: -11
ath10k: failed to wait for tx to flush: -110, forcing
ath10k: failed to request peer 04:f0:21:03:38:99 on vdev 22 to flush ffffffff: -11
ath10k: failed to flush all peer tids: -11
ath10k: failed to wait for tx to flush: -110, forcing
ath10k: failed to request peer 04:f0:21:03:38:99 on vdev 22 to flush ffffffff: -11
ath10k: failed to flush all peer tids: -11
ath10k: failed to wait for tx to flush: -110, forcing
ath10k: failed to request peer 04:f0:21:03:38:99 on vdev 22 to flush ffffffff: -11
ath10k: failed to flush all peer tids: -11
ath10k: failed to wait for tx to flush: -110, forcing
ath10k: failed to request peer 04:f0:21:03:38:99 on vdev 22 to flush ffffffff: -11
ath10k: failed to flush all peer tids: -11
ath10k: failed to wait for tx to flush: -110, forcing
ath10k: failed to request peer 04:f0:21:03:38:99 on vdev 22 to flush ffffffff: -11
ath10k: failed to flush all peer tids: -11
ath10k: failed to wait for tx to flush: -110, forcing
ath10k: failed to request peer 04:f0:21:03:38:99 on vdev 22 to flush ffffffff: -11
ath10k: failed to flush all peer tids: -11
ath10k: failed to wait for tx to flush: -110, forcing
ath10k: failed to request peer 04:f0:21:03:38:99 on vdev 22 to flush ffffffff: -11
ath10k: failed to flush all peer tids: -11
sta219: deauthenticating from 04:f0:21:03:38:99 by local choice (Reason: 3=DEAUTH_LEAVING)
ath10k: failed to wait for tx to flush: -110, forcing
ath10k: failed to request peer 04:f0:21:03:38:99 on vdev 22 to flush ffffffff: -11
ath10k: failed to flush all peer tids: -11
ath10k: failed to wait for tx to flush: -110, forcing
ath10k: failed to request peer 04:f0:21:03:38:99 on vdev 22 to flush ffffffff: -11
ath10k: failed to flush all peer tids: -11
ath10k: failed to wait for tx to flush: -110, forcing
ath10k: failed to request peer 04:f0:21:03:38:99 on vdev 22 to flush ffffffff: -11
ath10k: failed to flush all peer tids: -11
ath10k: failed to wait for tx to flush: -110, forcing
ath10k: failed to request peer 04:f0:21:03:38:99 on vdev 22 to flush ffffffff: -11
ath10k: failed to flush all peer tids: -11
ath10k: failed to wait for tx to flush: -110, forcing
ath10k: failed to request peer 04:f0:21:03:38:99 on vdev 22 to flush ffffffff: -11
ath10k: failed to flush all peer tids: -11
ath10k: mac vdev 19 peer delete 04:f0:21:03:38:99 (sta gone)
ath10k: failed to request wmi peer 04:f0:21:03:38:99 on vdev 19 removal: -11
ath10k: removing peer 04:f0:21:03:38:99 on vdev 19 still being mapped in firmware
ath10k: mac vdev 19 stop (disassociated
ath10k: failed to stop WMI vdev 19: -11
ath10k: mac vdev 19 down
ath10k: failed to wait for tx to flush: -110, forcing
ath10k: failed to request peer 04:f0:21:03:38:99 on vdev 22 to flush ffffffff: -11
ath10k: failed to flush all peer tids: -11
ath10k: mac vdev 19 cts_prot 0
ath10k: failed to recalculate rts/cts prot for vdev 19: -11
ath10k: mac vdev 19 slot_time 1
ath10k: failed to set erp slot for vdev 19: -11
ath10k: mac vdev 19 preamble 1n
ath10k: failed to set preamble for vdev 19: -11
ath10k: failed to set wmm params: -11
ath10k: failed to set wmm params: -11
ath10k: failed to set wmm params: -11
ath10k: failed to set wmm params: -11
ath10k: failed to wait for tx to flush: -110, forcing
ath10k: failed to request peer 04:f0:21:03:38:99 on vdev 22 to flush ffffffff: -11
ath10k: failed to flush all peer tids: -11
ath10k: mac vdev 19 delete (remove interface)
ath10k: failed to delete WMI vdev 19: -11
ath10k: failed to wait for tx to flush: -110, forcing
ath10k: failed to request peer 04:f0:21:03:38:99 on vdev 22 to flush ffffffff: -11
ath10k: failed to flush all peer tids: -11
ath10k: mac vdev 0 peer delete 04:f0:21:03:38:99 (sta gone)
ath10k: failed to request wmi peer 04:f0:21:03:38:99 on vdev 0 removal: -11
ath10k: removing peer 04:f0:21:03:38:99 on vdev 0 still being mapped in firmware
ath10k: mac vdev 0 stop (disassociated
ath10k: failed to stop WMI vdev 0: -11
ath10k: mac vdev 0 down
ath10k: mac vdev 0 cts_prot 0
ath10k: failed to recalculate rts/cts prot for vdev 0: -11
ath10k: mac vdev 0 slot_time 1
ath10k: failed to set erp slot for vdev 0: -11
ath10k: mac vdev 0 preamble 1n
ath10k: failed to set preamble for vdev 0: -11
ath10k: failed to set wmm params: -11

--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com


2014-04-10 08:50:47

by Michal Kazior

[permalink] [raw]
Subject: Re: [RFTv2 0/5] ath10k: ath10k: fix flushing and tx stalls

On 10 April 2014 07:26, Ben Greear <[email protected]> wrote:
> On 04/09/2014 10:10 PM, Michal Kazior wrote:
>> On 10 April 2014 01:58, Ben Greear <[email protected]> wrote:
>>> ath10k: ep 2 got 1 credits tot 2
>>> sta219: send auth to 04:f0:21:03:38:99 (try 1/3) at: 1397086238.721985
>>> ath10k: ep 2 used 1 credits, remaining 1 dbg 1896910888 (0x71109028)
>>> ath10k: mac flushing peer 04:f0:21:03:38:99 on vdev 20 mgmt tid for
>>> unicast mgmt (204 msecs)
>>> ath10k: ep 2 used 1 credits, remaining 0 dbg 1896910878 (0x7110901e)
>>> ath10k: Creating vdev id: 22 map: 12582912
>>> ath10k: mac vdev create 22 (add interface) type 2 subtype 0
>>> sta219: send auth to 04:f0:21:03:38:99 (try 2/3) at: 1397086239.28088
>>> [firmware logging msg]
>>> ath10k: failed to create WMI vdev 22: -11
>>
>>
>> Hmm.. If I read this correctly it means that MGMT_TX and
>> PEER_FLUSH_TIDS commands are both stuck in firmware. This most likely
>> means firmware stops processing everything altogether. Having HTC
>> debug prints from ath10k_htc_notify_tx_completion() could provide more
>> insight perhaps. I suspect MGMT_TX is the trigger in all cases.
>>
>> I'm still suspicious of your firmware changes. You connect multiple
>> stations to the exact same AP. Is peer mapping working correctly? Are
>> tid queues mapped correctly in all cases? Perhaps there's some kind of
>> inconsistency that leads to this mess? I think firmware wasn't
>> originally designed to support your usecase. Or maybe firmware just
>> breaks when you try to run a hundred or so of vdevs :-D
>
>
> I have at least attempted to rectify all of that, but indeed this
> particular lockup seems like a firmware issue. I personally suspect
> that I just find many bugs 32 times faster than simpler systems will :P
>
> The firmware has it's own sort of tx-to-host-credits logic, so if it runs
> out of space it might not be able to send any messages back to
> the host. I've crawled through a lot of that code and didn't
> see any obvious ways to leak buffers, but it's far from simple
> code, so I could still be wrong.

I don't think that's the problem here.

Firmware seems to generate traffic to host while tx credits aren't
replenished (I've looked in traces you've send in the other email)
including wmi mgmt rx. From the traces it also looks like htc tx
completion is done for flush command suggesting it has been probably
processed by hif layer and maybe htc layer but there are no tx credits
replenished. There's even a ton of htt tx completion indications
although it seems new htt tx commands are never completed (in the
traces). This could suggest htt service is dead as well as wmi or it
just queues frames on a paused queue.

Tx credits for wmi should be replenished right away for all wmi
commands except mgmt tx and bcn tx (as they cannot be immediately
done). If tx credits are not replenished for flush command (which is
the case) it might not have reached the target wmi service at all.
>From what I understand this could happen if endpoint is paused but
this probably shouldn't happen as this is for wmi data path
synchronization apparently which is a legacy thing and should be hit.
Maybe there's a different way for wmi service to stop responding or to
prevent it from receiving and processing commands?


> Maybe I could add a small scratch area in firmware memory and place debug
> info there and read it from host over the PCI bus like when we
> dump the crash info... This time of night I really hate firmware :P

Sounds reasonable to debug and pinpoint the cause of this problem.
Perhaps some counters to check if certain code paths are hit when you
expect them to be?


Michał

2014-04-09 10:55:35

by Michal Kazior

[permalink] [raw]
Subject: [RFTv2 4/5] ath10k: wait for mgmt tx when flushing too

The 10.1 firmware uses WMI to Tx management frames
but ath10k_flush() didn't care. This could lead to
some frames not being sent.

Although there are no Tx completions for WMI
mangement frames it's still better to at least
wait for ath10k queue to be drained.

Signed-off-by: Michal Kazior <[email protected]>
---
drivers/net/wireless/ath/ath10k/mac.c | 21 +++++++++++++++++----
1 file changed, 17 insertions(+), 4 deletions(-)

diff --git a/drivers/net/wireless/ath/ath10k/mac.c b/drivers/net/wireless/ath/ath10k/mac.c
index adb5da1..1d95500 100644
--- a/drivers/net/wireless/ath/ath10k/mac.c
+++ b/drivers/net/wireless/ath/ath10k/mac.c
@@ -3782,19 +3782,27 @@ static void ath10k_flush(struct ieee80211_hw *hw, u32 queues, bool drop)

mutex_lock(&ar->conf_mutex);

- if (ar->state == ATH10K_STATE_WEDGED)
+ if (ar->state == ATH10K_STATE_WEDGED) {
+ ret = -EBUSY;
goto skip;
+ }

ret = wait_event_timeout(ar->htt.empty_tx_wq, ({
- bool empty;
+ bool htt_empty, wmi_empty;
+ unsigned long flags;

spin_lock_bh(&ar->htt.tx_lock);
- empty = (ar->htt.num_pending_tx == 0);
+ htt_empty = (ar->htt.num_pending_tx == 0);
spin_unlock_bh(&ar->htt.tx_lock);

+ spin_lock_irqsave(&ar->wmi_mgmt_tx_queue.lock, flags);
+ wmi_empty = skb_queue_len(&ar->wmi_mgmt_tx_queue) == 0;
+ spin_unlock_irqrestore(&ar->wmi_mgmt_tx_queue.lock,
+ flags);
+
skip = (ar->state == ATH10K_STATE_WEDGED);

- (empty || skip);
+ ((htt_empty && wmi_empty) || skip);
}), ATH10K_FLUSH_TIMEOUT_HZ);

if (ret <= 0 || skip)
@@ -3803,6 +3811,11 @@ static void ath10k_flush(struct ieee80211_hw *hw, u32 queues, bool drop)

skip:
mutex_unlock(&ar->conf_mutex);
+
+ /* empty mgmt tx queue doesn't mean mgmt tx is flushed because the last
+ * frame still may be processed by a worker */
+ if (ret > 0 && !skip)
+ cancel_work_sync(&ar->wmi_mgmt_tx_work);
}

/* TODO: Implement this function properly
--
1.8.5.3


2014-04-10 07:18:46

by Kalle Valo

[permalink] [raw]
Subject: Re: [RFTv2 3/5] ath10k: rework peer accounting

Michal Kazior <[email protected]> writes:

>>> +/* hold conf_mutex for simple iteration, or conf_mutex+data_lock for
>>> + * modifications */
>>> struct ath10k_peer *ath10k_peer_find(struct ath10k *ar, int vdev_id,
>>> const u8 *addr)
>>> {
>>> struct ath10k_peer *peer;
>>>
>>> - lockdep_assert_held(&ar->data_lock);
>>> -
>>> list_for_each_entry(peer, &ar->peers, list) {
>>> if (peer->vdev_id != vdev_id)
>>> continue;
>>
>> The comment here makes me suspicious. How can we safely iterate the list
>> if we don't take data_lock? Doesn't it mean that the list can change
>> while we have conf_mutex?
>
> The idea is you need BOTH locks to modify the list structure, but you
> need only one of them to iterate over the list safely and
> consistently. This means writer will not alter the list structure
> until there are no readers.

Still not understanding this. Why not then use conf_mutex always, why do
we need data_lock at all?

Or are you saying that one can iterate the list by either taking
conf_mutex or by taking data_lock, depending on context?

--
Kalle Valo

2014-04-07 01:05:48

by Ben Greear

[permalink] [raw]
Subject: Re: [RFT 0/4] ath10k: fix flushing and tx stalls

On 04/06/2014 05:30 PM, Ben Greear wrote:
> On 04/04/2014 04:37 AM, Michal Kazior wrote:
>> Hi,
>>
>> After digging around I've found what seems to be
>> the problem with WMI Tx credit starvation and
>> inability to properly flush Tx in ath10k_flush().
>
> I applied all four, rebooted AP and Station machine with the
> new kernel (3.14.0-wl-ath+). I had removed my own tx-hang detection
> hack.

Looks like the problems start with patch 3/4.

All tests were done with all 4 patches applied on the AP machine,
and it appears that works fine. The station machine worked fine with
the first two patches, but when third was applied, only a few stations
come up and I get behaviour similar to that described in the first email,
but I do not see anything obviously wrong in dmesg...

Thanks,
Ben

--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com


2014-04-10 03:45:48

by Kalle Valo

[permalink] [raw]
Subject: Re: [RFT 0/4] ath10k: fix flushing and tx stalls

Ben Greear <[email protected]> writes:

> My version of supplicant does not support -T. I have just sync'd down the
> latest upstream hostapd code and will see if that adds the support.

You need to enable CONFIG_DEBUG_LINUX_TRACING in wpasupplicant.

http://wireless.kernel.org/en/developers/Documentation/mac80211/tracing

--
Kalle Valo

2014-04-04 14:50:25

by Ben Greear

[permalink] [raw]
Subject: Re: [RFT 0/4] ath10k: fix flushing and tx stalls

On 04/04/2014 04:37 AM, Michal Kazior wrote:
> Hi,
>
> After digging around I've found what seems to be
> the problem with WMI Tx credit starvation and
> inability to properly flush Tx in ath10k_flush().
>
> Long story short: if a client that was asleep (as
> per what firmware thinks) goes out of range (or
> just stops responding) then Tx rots in FW/HW
> queues for a few seconds before it's discarded.
> For WMI Tx credits this means management frames
> eat up Tx credits for a few seconds (causing other
> WMI commands to timeout and return -EAGAIN/-11).
> For HTT Tx this means NullFunc frames would get
> stuck for a few seconds before completion was
> received.
>
> @Ben: Can you check if this helps you? I tested
> this briefly and at least [1/4] seems fixes the
> WMI Tx starvation. I'm hoping patches 2-4 help
> with your ath10k_flush() failures which I haven't
> been successfull in reproducing (but have observed
> improvement with purging some frames out of FW/HW
> queues).

I'm out of office for a bit, but will test this as
soon as I'm back.

Thanks for looking into this!

In general, would it make more sense to have a few more tx credits
available to mitigate the slow-to-be-processed buffers?

Thanks,
Ben


--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com


2015-01-29 01:32:27

by YanBo

[permalink] [raw]
Subject: Re: [RFTv2 2/5] ath10k: fix wmi-htc tx credit starvation

Hi Michal,

What the conclusion about this patch, it looks like this patch not be
merged into ath10K due to introduce some unstable issue, I'v got
another issue that when move the station enter hibernate mode. the AP
will continue report message like before
[ 3958.681293] ath10k_pci 0000:01:00.0: Spurious quick kickout for STA
00:03:7f:40:04:5b
[ 3959.681449] ath10k_pci 0000:01:00.0: Spurious quick kickout for STA
00:03:7f:40:04:5b
[ 3960.681696] ath10k_pci 0000:01:00.0: Spurious quick kickout for STA
00:03:7f:40:04:5b
[ 3961.681877] ath10k_pci 0000:01:00.0: Spurious quick kickout for STA
00:03:7f:40:04:5b
[ 3962.682080] ath10k_pci 0000:01:00.0: Spurious quick kickout for STA
00:03:7f:40:04:5b
[ 3963.682361] ath10k_pci 0000:01:00.0: Spurious quick kickout for STA
00:03:7f:40:04:5b
[ 3964.682550] ath10k_pci 0000:01:00.0: Spurious quick kickout for STA
00:03:7f:40:04:5b
[ 3965.682743] ath10k_pci 0000:01:00.0: Spurious quick kickout for STA
00:03:7f:40:04:5b
and there are also error message like this be happened at early time:


[ 1316.883053] ath10k_pci 0000:01:00.0: SWBA overrun on vdev 0

[ 1316.912357] ath10k_pci 0000:01:00.0: failed to transmit management
frame via WMI: -11

[ 1316.985476] ath10k_pci 0000:01:00.0: SWBA overrun on vdev 0

I suspect it is triggered as you mentioned because the HTC Tx credits
are drained
to 0 and no other commands can be submitted, if the answer is yes,
I'd hear your suggestion about whether this patch still worth to be
continue improve to solve such kinds of issue.

Thanks
BR /Yanbo

On Wed, Apr 9, 2014 at 3:48 AM, Michal Kazior <[email protected]> wrote:
> Each WMI management Tx consumes 1 HTC Tx credit
> and doesn't replenish it until the frame is
> actually transmitted out of firmware's Tx queues.
>
> If associated client was asleep and has gone out
> of range then unicast frames won't be released for
> FW/HW queues for a while (10 seconds per
> observation). This means that if more management
> frames are queued then HTC Tx credits are drained
> to 0 and no other commands can be submitted
> including beacons and peer removal.
>
> This could in turn result in clients disconnecting
> due to their beacon loss and may trigger spurious
> sta kickouts because wmi peer removal command may
> never reach firmware during disassociation.
>
> This could happen, e.g. when disconnecting
> client (hostapd inactivity) that has gone away
> while asleep when acting as AP.
>
> As a workaround flush frames that can potentialy
> get stuck in FW Tx queues.
>
> Signed-off-by: Michal Kazior <[email protected]>
> ---
> drivers/net/wireless/ath/ath10k/mac.c | 96 +++++++++++++++++++++++++++++++++++
> drivers/net/wireless/ath/ath10k/wmi.c | 5 --
> drivers/net/wireless/ath/ath10k/wmi.h | 3 ++
> 3 files changed, 99 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/net/wireless/ath/ath10k/mac.c b/drivers/net/wireless/ath/ath10k/mac.c
> index 58ec5a7..5cfbf88 100644
> --- a/drivers/net/wireless/ath/ath10k/mac.c
> +++ b/drivers/net/wireless/ath/ath10k/mac.c
> @@ -2076,9 +2076,67 @@ void ath10k_mgmt_over_wmi_tx_purge(struct ath10k *ar)
> }
> }
>
> +static void ath10k_mgmt_tx_flush(struct ath10k *ar, struct sk_buff *skb)
> +{
> + struct ieee80211_hdr *hdr = (struct ieee80211_hdr *)skb->data;
> + struct ieee80211_sta *sta;
> + struct ath10k_vif *arvif;
> + u8 *da = ieee80211_get_DA(hdr);
> + u8 vdev_id = ATH10K_SKB_CB(skb)->vdev_id;
> + u32 bcn_intval = 0;
> + unsigned int msecs;
> + int ret;
> +
> + lockdep_assert_held(&ar->conf_mutex);
> +
> + if (!is_unicast_ether_addr(da))
> + return;
> +
> + rcu_read_lock();
> + sta = ieee80211_find_sta_by_ifaddr(ar->hw, da, NULL);
> + rcu_read_unlock();
> +
> + /*
> + * FW Tx queues can be paused only for associated peers. Since this is
> + * a workaround just assume worst case if station simply entry exists.
> + */
> + if (!sta)
> + return;
> +
> + list_for_each_entry(arvif, &ar->arvifs, list) {
> + if (arvif->vdev_id == vdev_id) {
> + bcn_intval = arvif->beacon_interval;
> + break;
> + }
> + }
> +
> + if (!bcn_intval)
> + return;
> +
> + /*
> + * Wait 2 beacon intervals before flushing so stations that are
> + * asleep but are actually still in range have a chance to see
> + * updated PVB, wake up and fetch the frame. There's no other way of
> + * synchronizing other than sleeping because there's no tx completion
> + * indication event for WMI management tx.
> + */
> + msecs = 2 * arvif->beacon_interval * 1024 / 1000;
> + ath10k_dbg(ATH10K_DBG_MAC,
> + "mac flushing peer %pM on vdev %i mgmt tid for unicast mgmt (%d msecs)\n",
> + da, vdev_id, msecs);
> + msleep(msecs);
> +
> + ret = ath10k_wmi_peer_flush(ar, vdev_id, da,
> + WMI_PEER_TID_MGMT_MASK);
> + if (ret)
> + ath10k_warn("failed to flush peer %pM mgmt tid: %d\n",
> + da, ret);
> +}
> +
> void ath10k_mgmt_over_wmi_tx_work(struct work_struct *work)
> {
> struct ath10k *ar = container_of(work, struct ath10k, wmi_mgmt_tx_work);
> + struct ieee80211_tx_info *info;
> struct sk_buff *skb;
> int ret;
>
> @@ -2087,12 +2145,50 @@ void ath10k_mgmt_over_wmi_tx_work(struct work_struct *work)
> if (!skb)
> break;
>
> + mutex_lock(&ar->conf_mutex);
> +
> ret = ath10k_wmi_mgmt_tx(ar, skb);
> if (ret) {
> ath10k_warn("failed to transmit management frame via WMI: %d\n",
> ret);
> + mutex_unlock(&ar->conf_mutex);
> ieee80211_free_txskb(ar->hw, skb);
> + continue;
> }
> +
> + /*
> + * Each WMI management Tx consumes 1 HTC Tx credit and doesn't
> + * replenish it until the frame is actually transmitted out of
> + * firmware's Tx queues.
> + *
> + * If associated client was asleep and has gone out of range
> + * then unicast frames won't be released for FW/HW queues for a
> + * while (10 seconds per observation). This means that if more
> + * management frames are queued then HTC Tx credits are drained
> + * to 0 and no other commands can be submitted including
> + * beacons and peer removal.
> + *
> + * This could in turn result in clients disconnecting due to
> + * their beacon loss and may trigger spurious sta kickouts
> + * because wmi peer removal command may never reach firmware
> + * during disassociation.
> + *
> + * This could happen, e.g. when disconnecting client that has
> + * gone away while asleep.
> + *
> + * As a workaround flush unicast management frames that can
> + * possibly be buffered.
> + *
> + * Note: This is a deficiency in design of WMI_MGMT_TX command.
> + */
> + ath10k_mgmt_tx_flush(ar, skb);
> +
> + mutex_unlock(&ar->conf_mutex);
> +
> + /* there's no way to get ACK so just assume it's acked */
> + info = IEEE80211_SKB_CB(skb);
> + info->flags |= IEEE80211_TX_STAT_ACK;
> + ieee80211_tx_status(ar->hw, skb);
> }
> }
>
> diff --git a/drivers/net/wireless/ath/ath10k/wmi.c b/drivers/net/wireless/ath/ath10k/wmi.c
> index d4b48ef..3b270a4 100644
> --- a/drivers/net/wireless/ath/ath10k/wmi.c
> +++ b/drivers/net/wireless/ath/ath10k/wmi.c
> @@ -637,7 +637,6 @@ int ath10k_wmi_mgmt_tx(struct ath10k *ar, struct sk_buff *skb)
> struct wmi_mgmt_tx_cmd *cmd;
> struct ieee80211_hdr *hdr;
> struct sk_buff *wmi_skb;
> - struct ieee80211_tx_info *info = IEEE80211_SKB_CB(skb);
> int len;
> u16 fc;
>
> @@ -673,10 +672,6 @@ int ath10k_wmi_mgmt_tx(struct ath10k *ar, struct sk_buff *skb)
> if (ret)
> return ret;
>
> - /* TODO: report tx status to mac80211 - temporary just ACK */
> - info->flags |= IEEE80211_TX_STAT_ACK;
> - ieee80211_tx_status_irqsafe(ar->hw, skb);
> -
> return ret;
> }
>
> diff --git a/drivers/net/wireless/ath/ath10k/wmi.h b/drivers/net/wireless/ath/ath10k/wmi.h
> index ae83822..90fe2e9 100644
> --- a/drivers/net/wireless/ath/ath10k/wmi.h
> +++ b/drivers/net/wireless/ath/ath10k/wmi.h
> @@ -3844,6 +3844,9 @@ struct wmi_peer_delete_cmd {
> struct wmi_mac_addr peer_macaddr;
> } __packed;
>
> +#define WMI_PEER_TID_MGMT 17
> +#define WMI_PEER_TID_MGMT_MASK BIT(WMI_PEER_TID_MGMT)
> +
> struct wmi_peer_flush_tids_cmd {
> __le32 vdev_id;
> struct wmi_mac_addr peer_macaddr;
> --
> 1.8.5.3
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2015-01-29 16:50:44

by Ben Greear

[permalink] [raw]
Subject: Re: [RFTv2 2/5] ath10k: fix wmi-htc tx credit starvation

On 01/28/2015 11:57 PM, Michal Kazior wrote:
> On 29 January 2015 at 02:32, YanBo <[email protected]> wrote:
>> Hi Michal,
>>
>> What the conclusion about this patch, it looks like this patch not be
>> merged into ath10K due to introduce some unstable issue, I'v got
>> another issue that when move the station enter hibernate mode. the AP
>> will continue report message like before
>> [ 3958.681293] ath10k_pci 0000:01:00.0: Spurious quick kickout for STA
>> 00:03:7f:40:04:5b
>> [ 3959.681449] ath10k_pci 0000:01:00.0: Spurious quick kickout for STA
>> 00:03:7f:40:04:5b
>> [ 3960.681696] ath10k_pci 0000:01:00.0: Spurious quick kickout for STA
>> 00:03:7f:40:04:5b
>> [ 3961.681877] ath10k_pci 0000:01:00.0: Spurious quick kickout for STA
>> 00:03:7f:40:04:5b
>> [ 3962.682080] ath10k_pci 0000:01:00.0: Spurious quick kickout for STA
>> 00:03:7f:40:04:5b
>> [ 3963.682361] ath10k_pci 0000:01:00.0: Spurious quick kickout for STA
>> 00:03:7f:40:04:5b
>> [ 3964.682550] ath10k_pci 0000:01:00.0: Spurious quick kickout for STA
>> 00:03:7f:40:04:5b
>> [ 3965.682743] ath10k_pci 0000:01:00.0: Spurious quick kickout for STA
>> 00:03:7f:40:04:5b
>
> The spurious STA kickout alone is most likely an aftermath of HTX Tx
> credit starvation when client was detected as inactive by hostapd and
> was subsequently disassociated. However due to starvation
> wmi-peer-delete was never sent to firmware so fw thinks the peer is
> still there.
>
> I suppose fw should be restarted when ath10k is unable to submit a
> configuration command like wmi-peer-delete. It doesn't make sense to
> continue since fw-host state loses coherency and weird things can
> start to happen (spurious sta kickout is the best known example).

At least some of the tx-credits problem is in firmware, but
regardless of that:

Instead of restarting firmware in this case, maybe change the
'wait-for-3-seconds' timeout to 3 1-second timeouts, and
on second timeout force a flush, ignoring tx-credits if
required? That may not be pretty, but seems better than resetting
firmware if it works.

Thanks,
Ben

--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com


2015-01-29 07:57:15

by Michal Kazior

[permalink] [raw]
Subject: Re: [RFTv2 2/5] ath10k: fix wmi-htc tx credit starvation

On 29 January 2015 at 02:32, YanBo <[email protected]> wrote:
> Hi Michal,
>
> What the conclusion about this patch, it looks like this patch not be
> merged into ath10K due to introduce some unstable issue, I'v got
> another issue that when move the station enter hibernate mode. the AP
> will continue report message like before
> [ 3958.681293] ath10k_pci 0000:01:00.0: Spurious quick kickout for STA
> 00:03:7f:40:04:5b
> [ 3959.681449] ath10k_pci 0000:01:00.0: Spurious quick kickout for STA
> 00:03:7f:40:04:5b
> [ 3960.681696] ath10k_pci 0000:01:00.0: Spurious quick kickout for STA
> 00:03:7f:40:04:5b
> [ 3961.681877] ath10k_pci 0000:01:00.0: Spurious quick kickout for STA
> 00:03:7f:40:04:5b
> [ 3962.682080] ath10k_pci 0000:01:00.0: Spurious quick kickout for STA
> 00:03:7f:40:04:5b
> [ 3963.682361] ath10k_pci 0000:01:00.0: Spurious quick kickout for STA
> 00:03:7f:40:04:5b
> [ 3964.682550] ath10k_pci 0000:01:00.0: Spurious quick kickout for STA
> 00:03:7f:40:04:5b
> [ 3965.682743] ath10k_pci 0000:01:00.0: Spurious quick kickout for STA
> 00:03:7f:40:04:5b

The spurious STA kickout alone is most likely an aftermath of HTX Tx
credit starvation when client was detected as inactive by hostapd and
was subsequently disassociated. However due to starvation
wmi-peer-delete was never sent to firmware so fw thinks the peer is
still there.

I suppose fw should be restarted when ath10k is unable to submit a
configuration command like wmi-peer-delete. It doesn't make sense to
continue since fw-host state loses coherency and weird things can
start to happen (spurious sta kickout is the best known example).


> and there are also error message like this be happened at early time:
>
>
> [ 1316.883053] ath10k_pci 0000:01:00.0: SWBA overrun on vdev 0
>
> [ 1316.912357] ath10k_pci 0000:01:00.0: failed to transmit management
> frame via WMI: -11
>
> [ 1316.985476] ath10k_pci 0000:01:00.0: SWBA overrun on vdev 0
>
> I suspect it is triggered as you mentioned because the HTC Tx credits
> are drained
> to 0 and no other commands can be submitted, if the answer is yes,
> I'd hear your suggestion about whether this patch still worth to be
> continue improve to solve such kinds of issue.

Yep, looks like the starvation issue.

The problem with the patch is it creates ugly latencies. This has been
reported by Avery[1] (he used/uses this patch internally for his
purposes).

Ideally mgmt frames should be sent via HTT. 10.2 is capable of sending
raw frames via HTT so it might be possible to utilize that and forgo
WMI mgmt tx for 10.2+. I did a proof-of-concept for raw tx on 10.2
some time ago [2] but I'm haven't tested how it interacts with
powersave buffering.


[1]: http://thread.gmane.org/gmane.linux.drivers.ath10k.devel/638
[2]: http://thread.gmane.org/gmane.linux.drivers.ath10k.devel/246


Michał