Failed to transmit wmi management frames:
[84977.840894] ath10k_snoc a000000.wifi: wmi mgmt tx queue is full
[84977.840913] ath10k_snoc a000000.wifi: failed to transmit packet, dropping: -28
[84977.840924] ath10k_snoc a000000.wifi: failed to submit frame: -28
[84977.840932] ath10k_snoc a000000.wifi: failed to transmit frame: -28
This issue is caused by race condition between skb_dequeue and
__skb_queue_tail. The queue of ‘wmi_mgmt_tx_queue’ is protected by a
different lock: ar->data_lock vs list->lock, the result is no protection.
So when ath10k_mgmt_over_wmi_tx_work() and ath10k_mac_tx_wmi_mgmt()
running concurrently on different CPUs, there appear to be a rare corner
cases when the queue length is 1,
CPUx (skb_deuque) CPUy (__skb_queue_tail)
next=list
prev=list
struct sk_buff *skb = skb_peek(list); WRITE_ONCE(newsk->next, next);
WRITE_ONCE(list->qlen, list->qlen - 1);WRITE_ONCE(newsk->prev, prev);
next = skb->next; WRITE_ONCE(next->prev, newsk);
prev = skb->prev; WRITE_ONCE(prev->next, newsk);
skb->next = skb->prev = NULL; list->qlen++;
WRITE_ONCE(next->prev, prev);
WRITE_ONCE(prev->next, next);
If the instruction ‘next = skb->next’ is executed before
‘WRITE_ONCE(prev->next, newsk)’, newsk will be lost, as CPUx get the
old ‘next’ pointer, but the length is still added by one. The final
result is the length of the queue will reach the maximum value but
the queue is empty.
So remove ar->data_lock, and use 'skb_queue_tail' instead of
'__skb_queue_tail' to prevent the potential race condition.
Tested-on: WCN3990 hw1.0 SNOC WLAN.HL.3.1.c2-00033-QCAHLSWMTPLZ-1
Signed-off-by: Miaoqing Pan <[email protected]>
---
drivers/net/wireless/ath/ath10k/mac.c | 13 +++----------
1 file changed, 3 insertions(+), 10 deletions(-)
diff --git a/drivers/net/wireless/ath/ath10k/mac.c b/drivers/net/wireless/ath/ath10k/mac.c
index dc32c78..5da1504 100644
--- a/drivers/net/wireless/ath/ath10k/mac.c
+++ b/drivers/net/wireless/ath/ath10k/mac.c
@@ -3763,23 +3763,16 @@ bool ath10k_mac_tx_frm_has_freq(struct ath10k *ar)
static int ath10k_mac_tx_wmi_mgmt(struct ath10k *ar, struct sk_buff *skb)
{
struct sk_buff_head *q = &ar->wmi_mgmt_tx_queue;
- int ret = 0;
-
- spin_lock_bh(&ar->data_lock);
if (skb_queue_len(q) == ATH10K_MAX_NUM_MGMT_PENDING) {
ath10k_warn(ar, "wmi mgmt tx queue is full\n");
- ret = -ENOSPC;
- goto unlock;
+ return -ENOSPC;
}
- __skb_queue_tail(q, skb);
+ skb_queue_tail(q, skb);
ieee80211_queue_work(ar->hw, &ar->wmi_mgmt_tx_work);
-unlock:
- spin_unlock_bh(&ar->data_lock);
-
- return ret;
+ return 0;
}
static enum ath10k_mac_tx_path
--
2.7.4
Hi,
On Sun, Dec 20, 2020 at 5:53 PM Miaoqing Pan <[email protected]> wrote:
>
> Failed to transmit wmi management frames:
>
> [84977.840894] ath10k_snoc a000000.wifi: wmi mgmt tx queue is full
> [84977.840913] ath10k_snoc a000000.wifi: failed to transmit packet, dropping: -28
> [84977.840924] ath10k_snoc a000000.wifi: failed to submit frame: -28
> [84977.840932] ath10k_snoc a000000.wifi: failed to transmit frame: -28
>
> This issue is caused by race condition between skb_dequeue and
> __skb_queue_tail. The queue of ‘wmi_mgmt_tx_queue’ is protected by a
> different lock: ar->data_lock vs list->lock, the result is no protection.
Nice catch!
> --- a/drivers/net/wireless/ath/ath10k/mac.c
> +++ b/drivers/net/wireless/ath/ath10k/mac.c
> @@ -3763,23 +3763,16 @@ bool ath10k_mac_tx_frm_has_freq(struct ath10k *ar)
> static int ath10k_mac_tx_wmi_mgmt(struct ath10k *ar, struct sk_buff *skb)
> {
> struct sk_buff_head *q = &ar->wmi_mgmt_tx_queue;
> - int ret = 0;
> -
> - spin_lock_bh(&ar->data_lock);
>
> if (skb_queue_len(q) == ATH10K_MAX_NUM_MGMT_PENDING) {
I believe you should be switching this to use skb_queue_len_lockless()
too. And this still probably leaves a TOCTOU race; maybe we should use
">=" here, in case we queue a few SKBs simultaneously? It doesn't seem
like we actually have a hard limit here, but it still seems like we
shouldn't leave this potential inconsistency.
Brian
> ath10k_warn(ar, "wmi mgmt tx queue is full\n");
> - ret = -ENOSPC;
> - goto unlock;
> + return -ENOSPC;
> }
>
> - __skb_queue_tail(q, skb);
> + skb_queue_tail(q, skb);
> ieee80211_queue_work(ar->hw, &ar->wmi_mgmt_tx_work);
Brian Norris <[email protected]> writes:
> Hi,
>
> On Sun, Dec 20, 2020 at 5:53 PM Miaoqing Pan <[email protected]> wrote:
>>
>> Failed to transmit wmi management frames:
>>
>> [84977.840894] ath10k_snoc a000000.wifi: wmi mgmt tx queue is full
>> [84977.840913] ath10k_snoc a000000.wifi: failed to transmit packet, dropping: -28
>> [84977.840924] ath10k_snoc a000000.wifi: failed to submit frame: -28
>> [84977.840932] ath10k_snoc a000000.wifi: failed to transmit frame: -28
>>
>> This issue is caused by race condition between skb_dequeue and
>> __skb_queue_tail. The queue of ‘wmi_mgmt_tx_queue’ is protected by a
>> different lock: ar->data_lock vs list->lock, the result is no protection.
>
> Nice catch!
>
>> --- a/drivers/net/wireless/ath/ath10k/mac.c
>> +++ b/drivers/net/wireless/ath/ath10k/mac.c
>> @@ -3763,23 +3763,16 @@ bool ath10k_mac_tx_frm_has_freq(struct ath10k *ar)
>> static int ath10k_mac_tx_wmi_mgmt(struct ath10k *ar, struct sk_buff *skb)
>> {
>> struct sk_buff_head *q = &ar->wmi_mgmt_tx_queue;
>> - int ret = 0;
>> -
>> - spin_lock_bh(&ar->data_lock);
>>
>> if (skb_queue_len(q) == ATH10K_MAX_NUM_MGMT_PENDING) {
>
> I believe you should be switching this to use skb_queue_len_lockless()
> too.
Why lockless? (reads documentation) Ah, is it due to memory
synchronisation now that we don't take the data_lock anymore?
--
https://patchwork.kernel.org/project/linux-wireless/list/
https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches
On Mon, Dec 21, 2020 at 11:53 AM Kalle Valo <[email protected]> wrote:
> Brian Norris <[email protected]> writes:
> > On Sun, Dec 20, 2020 at 5:53 PM Miaoqing Pan <[email protected]> wrote:
> >> if (skb_queue_len(q) == ATH10K_MAX_NUM_MGMT_PENDING) {
> >
> > I believe you should be switching this to use skb_queue_len_lockless()
> > too.
>
> Why lockless? (reads documentation) Ah, is it due to memory
> synchronisation now that we don't take the data_lock anymore?
As the original post notes, there was no valid locking in the first
place anyway, but now that we're fully relying on the queue lock, we
either need to grab that lock, or else otherwise use lock-free-denoted
methods.
One could say it's about data synchronization, but it's really about
the lack of memory model -- C didn't have a formal one until
relatively recently, and the kernel has always blazed its own way
anyway. You need to annotate *something* about a bare read, otherwise
it's not safe to do concurrently; either compiler or hardware can do
nasty things to you. (In practice, it's unlikely this particular case
will cause a problem for this reason; it's already a somewhat
imprecise check anyway.)
Brian
Miaoqing Pan <[email protected]> wrote:
> Failed to transmit wmi management frames:
>
> [84977.840894] ath10k_snoc a000000.wifi: wmi mgmt tx queue is full
> [84977.840913] ath10k_snoc a000000.wifi: failed to transmit packet, dropping: -28
> [84977.840924] ath10k_snoc a000000.wifi: failed to submit frame: -28
> [84977.840932] ath10k_snoc a000000.wifi: failed to transmit frame: -28
>
> This issue is caused by race condition between skb_dequeue and
> __skb_queue_tail. The queue of ‘wmi_mgmt_tx_queue’ is protected by a
> different lock: ar->data_lock vs list->lock, the result is no protection.
> So when ath10k_mgmt_over_wmi_tx_work() and ath10k_mac_tx_wmi_mgmt()
> running concurrently on different CPUs, there appear to be a rare corner
> cases when the queue length is 1,
>
> CPUx (skb_deuque) CPUy (__skb_queue_tail)
> next=list
> prev=list
> struct sk_buff *skb = skb_peek(list); WRITE_ONCE(newsk->next, next);
> WRITE_ONCE(list->qlen, list->qlen - 1);WRITE_ONCE(newsk->prev, prev);
> next = skb->next; WRITE_ONCE(next->prev, newsk);
> prev = skb->prev; WRITE_ONCE(prev->next, newsk);
> skb->next = skb->prev = NULL; list->qlen++;
> WRITE_ONCE(next->prev, prev);
> WRITE_ONCE(prev->next, next);
>
> If the instruction ‘next = skb->next’ is executed before
> ‘WRITE_ONCE(prev->next, newsk)’, newsk will be lost, as CPUx get the
> old ‘next’ pointer, but the length is still added by one. The final
> result is the length of the queue will reach the maximum value but
> the queue is empty.
>
> So remove ar->data_lock, and use 'skb_queue_tail' instead of
> '__skb_queue_tail' to prevent the potential race condition.
>
> Tested-on: WCN3990 hw1.0 SNOC WLAN.HL.3.1.c2-00033-QCAHLSWMTPLZ-1
>
> Signed-off-by: Miaoqing Pan <[email protected]>
> Signed-off-by: Kalle Valo <[email protected]>
Please address Brian's comments and send v2.
Patch set to Changes Requested.
--
https://patchwork.kernel.org/project/linux-wireless/patch/[email protected]/
https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches
> Please address Brian's comments and send v2.
>
> Patch set to Changes Requested.
Updated.
https://patchwork.kernel.org/project/linux-wireless/patch/[email protected]/