2018-05-16 14:08:32

by Daniel Mack

[permalink] [raw]
Subject: [PATCH 00/10] Some more patches for wcn36xx

Here are some more patches for the wcn36xx driver that emerged from
reviews and my attempts to fix the connection timeout issues. Some of
them merely bring the driver in sync with what the "prima" downstream
driver is doing, and they seemingly make the driver more stable.
Others just apply some best practice patterns or do some cleanup.

The first patch in the series is one that I've sent earlier as RFC, and
Loic expressed his support in a follow-up. I still believe it does the
right thing.

However, the connection timeout problems are still there unfortunately,
but they seemingly appear to happen less often now.

To recap, when the driver gets stuck, not a single packet is being sent
out anymore. I have a 2nd machine tuned to the same channel to capture
all packets from the MAC of the device under test, and it shows a flat
line once the connection timeouts happen. Some more context is given in
this bug report:

https://bugs.96boards.org/show_bug.cgi?id=538

One interesting find is that setting the debug_mask module parameter to
0x4 or 0x400 (or both), the issue is much less likely to happen. All
this does is adding printk()s before each message is sent out, and that
affects the timings of course. I tried adding mdelay() there instead,
and they too make the effect less likely, but they don't make it go
away.

Hence I believe that some sort of firmware internal buffer is overrun if
too many SMD requests fly in in a short amount of time. The firmware
does, however, still ack all packets just fine on the SMD channels, and
also the DXE communication flows are all healthy. No errors are reported
anywhere, but nothing is being put on the ether anymore.

I'm running out of ideas at this point. I guess without access to the
firmware sources, it's virtually impossible to grok what's going on and
how to work around whatever is causing it.

If anyone has an idea on how to debug this frustrating issue any further,
please let me know. A reproducer is described in the bug reported linked
to above.


Thanks,
Daniel


Daniel Mack (10):
wcn36xx: fix buffer commit logic on TX path
wcn36xx: set DMA mask explicitly
wcn36xx: don't disable RX IRQ from handler
wcn36xx: clear all masks in RX interrupt
wcn36xx: only handle packets when ED or DONE bit is set
wcn36xx: consider CTRL_EOP bit when looking for valid descriptors
wcn36xx: set PREASSOC and IDLE stated when BSS info changes
wcn36xx: drain pending indicator messages on shutdown
wcn36xx: simplify wcn36xx_smd_open()
wcn36xx: improve debug and error messages for SMD

drivers/net/wireless/ath/wcn36xx/dxe.c | 176 ++++++++++++++++++++------------
drivers/net/wireless/ath/wcn36xx/main.c | 10 ++
drivers/net/wireless/ath/wcn36xx/smd.c | 32 +++---
3 files changed, 137 insertions(+), 81 deletions(-)

--
2.14.3


2018-05-24 08:44:15

by Kalle Valo

[permalink] [raw]
Subject: Re: [PATCH 00/10] Some more patches for wcn36xx

Daniel Mack <[email protected]> writes:

> On Friday, May 18, 2018 01:28 PM, Kalle Valo wrote:
>> Daniel Mack <[email protected]> writes:
>>
>>> On Wednesday, May 16, 2018 04:08 PM, Daniel Mack wrote:
>>>> Hence I believe that some sort of firmware internal buffer is overrun if
>>>> too many SMD requests fly in in a short amount of time. The firmware
>>>> does, however, still ack all packets just fine on the SMD channels, and
>>>> also the DXE communication flows are all healthy. No errors are reported
>>>> anywhere, but nothing is being put on the ether anymore.
>>>
>>> And FTR, there is a commit in the prima repository that caught my
>>> attention a while back:
>>>
>>> https://source.codeaurora.org/external/wlan/prima/commit/?id=93cd8f3c
>>>
>>> What this does (through an remarkable number of indirection layers) is
>>> sending the DUMP_COMMAND_REQ command with args = (274, 0, 0, 0, 0)
>>> when management frames get stuck, which smells pretty much like the
>>> issue I'm seeing. Doing the same with the mainline driver and the
>>> debugfs interface it exposes doesn't have any effect though.
>>>
>>> But even if it did work, I wouldn't see a way to detect the situation
>>> in which this is needed reliably.
>>
>> The firmware version might make a difference so I recommend always
>> mentioning the firmware version as well. For example, what if your
>> firmware does not support that command or parameter?
>
> Sure, that could be the case. FTR - the firmware I'm using is the one
> that came out of the Qualcomm r1034.2.1 BSP. It is recognized by the
> driver as 'WCN v2.0 RadioPhy vRhea_GF_1.12 with 19.2MHz XO'.

Ok, thanks. Please add that to the bug report.

>> Also I would recommend to file a bug to bugzilla.kernel.org so that all
>> the information is one place and it can be easily updated. Now it's
>> pretty difficult to get the big picture from various emails on the list.
>
> Yes, I agree it's a bit convoluted. However, there's already the bug
> report on 96board.org that Bjorn opened some time back, and I
> considered that sufficient. IMO, it has all the information needed,
> plus a link to a tool to reproduce the issue.
>
> https://bugs.96boards.org/show_bug.cgi?id=538

Yeah, bugs.96boards.org is fine. As long as there's one place which
collects all the information about the bug.

But IMHO the bug report is not telling much, all I get is that TX frames
get stuck but not even that is confirmed. After reading it I have at
least these questions:

* Is it really confirmed that the issue is that TX frames are stuck? For
example, using a wireless sniffer would confirm that.

* Are only management frames stuck or does it also involve data frames?

* Based on the bug report the TX stuck issue seems to happen during
authentication, but what happens before that? Does wcn36xx get
disconnected from AP or what?

* Any wcn36xx logs about the issue (with or without debug logs)? Also
matching wpasupplicant logs would help.

* Does this only happen with encryption or also in open mode?

* How long does it take with qconnman-stress to reproduce the issue?

* Does the radio environment make any difference on reproducibility? For
example, clear enviroment vs lots of traffic/interference?

--
Kalle Valo

2018-05-17 09:03:56

by Ramon Fried

[permalink] [raw]
Subject: Re: [PATCH 01/10] wcn36xx: fix buffer commit logic on TX path

On Wed, May 16, 2018 at 5:08 PM, Daniel Mack <[email protected]> wrote:
> When wcn36xx_dxe_tx_frame() is entered while the device is still processing
> the queue asyncronously, we are racing against the firmware code with
> updates to the buffer descriptors. Presumably, the firmware scans the ring
> buffer that holds the descriptors and scans for a valid control descriptor,
> and then assumes that the next descriptor contains the payload. If, however,
> the control descriptor is marked valid, but the payload descriptor isn't,
> the packet is not sent out.
>
> Another issue with the current code is that is lacks memory barriers before
> descriptors are marked valid. This is important because the CPU may reorder
> writes to memory, even if it is allocated as coherent DMA area, and hence
> the device may see incompletely written data.
>
> To fix this, the code in wcn36xx_dxe_tx_frame() was restructured a bit so
> that the payload descriptor is made valid before the control descriptor.
> Memory barriers are added to ensure coherency of shared memory areas.
>
> Signed-off-by: Daniel Mack <[email protected]>
> ---
> drivers/net/wireless/ath/wcn36xx/dxe.c | 75 +++++++++++++++++-----------------
> 1 file changed, 38 insertions(+), 37 deletions(-)
>
> diff --git a/drivers/net/wireless/ath/wcn36xx/dxe.c b/drivers/net/wireless/ath/wcn36xx/dxe.c
> index bd2b946a65c9..3d1cf7dd1f8e 100644
> --- a/drivers/net/wireless/ath/wcn36xx/dxe.c
> +++ b/drivers/net/wireless/ath/wcn36xx/dxe.c
> @@ -642,8 +642,8 @@ int wcn36xx_dxe_tx_frame(struct wcn36xx *wcn,
> struct sk_buff *skb,
> bool is_low)
> {
> - struct wcn36xx_dxe_ctl *ctl = NULL;
> - struct wcn36xx_dxe_desc *desc = NULL;
> + struct wcn36xx_dxe_desc *desc_bd, *desc_skb;
> + struct wcn36xx_dxe_ctl *ctl_bd, *ctl_skb;
> struct wcn36xx_dxe_ch *ch = NULL;
> unsigned long flags;
> int ret;
> @@ -651,74 +651,75 @@ int wcn36xx_dxe_tx_frame(struct wcn36xx *wcn,
> ch = is_low ? &wcn->dxe_tx_l_ch : &wcn->dxe_tx_h_ch;
>
> spin_lock_irqsave(&ch->lock, flags);
> - ctl = ch->head_blk_ctl;
> + ctl_bd = ch->head_blk_ctl;
> + ctl_skb = ctl_bd->next;
>
> /*
> * If skb is not null that means that we reached the tail of the ring
> * hence ring is full. Stop queues to let mac80211 back off until ring
> * has an empty slot again.
> */
> - if (NULL != ctl->next->skb) {
> + if (NULL != ctl_skb->skb) {
> ieee80211_stop_queues(wcn->hw);
> wcn->queues_stopped = true;
> spin_unlock_irqrestore(&ch->lock, flags);
> return -EBUSY;
> }
>
> - ctl->skb = NULL;
> - desc = ctl->desc;
> + if (unlikely(ctl_skb->bd_cpu_addr)) {
> + wcn36xx_err("bd_cpu_addr cannot be NULL for skb DXE\n");
> + ret = -EINVAL;
> + goto unlock;
> + }
> +
> + desc_bd = ctl_bd->desc;
> + desc_skb = ctl_skb->desc;
> +
> + ctl_bd->skb = NULL;
>
> /* write buffer descriptor */
> - memcpy(ctl->bd_cpu_addr, bd, sizeof(*bd));
> + memcpy(ctl_bd->bd_cpu_addr, bd, sizeof(*bd));
>
> /* Set source address of the BD we send */
> - desc->src_addr_l = ctl->bd_phy_addr;
> -
> - desc->dst_addr_l = ch->dxe_wq;
> - desc->fr_len = sizeof(struct wcn36xx_tx_bd);
> - desc->ctrl = ch->ctrl_bd;
> + desc_bd->src_addr_l = ctl_bd->bd_phy_addr;
> + desc_bd->dst_addr_l = ch->dxe_wq;
> + desc_bd->fr_len = sizeof(struct wcn36xx_tx_bd);
>
> wcn36xx_dbg(WCN36XX_DBG_DXE, "DXE TX\n");
>
> wcn36xx_dbg_dump(WCN36XX_DBG_DXE_DUMP, "DESC1 >>> ",
> - (char *)desc, sizeof(*desc));
> + (char *)desc_bd, sizeof(*desc_bd));
> wcn36xx_dbg_dump(WCN36XX_DBG_DXE_DUMP,
> - "BD >>> ", (char *)ctl->bd_cpu_addr,
> + "BD >>> ", (char *)ctl_bd->bd_cpu_addr,
> sizeof(struct wcn36xx_tx_bd));
>
> - /* Set source address of the SKB we send */
> - ctl = ctl->next;
> - desc = ctl->desc;
> - if (ctl->bd_cpu_addr) {
> - wcn36xx_err("bd_cpu_addr cannot be NULL for skb DXE\n");
> - ret = -EINVAL;
> - goto unlock;
> - }
> -
> - desc->src_addr_l = dma_map_single(wcn->dev,
> - skb->data,
> - skb->len,
> - DMA_TO_DEVICE);
> - if (dma_mapping_error(wcn->dev, desc->src_addr_l)) {
> + desc_skb->src_addr_l = dma_map_single(wcn->dev,
> + skb->data,
> + skb->len,
> + DMA_TO_DEVICE);
> + if (dma_mapping_error(wcn->dev, desc_skb->src_addr_l)) {
> dev_err(wcn->dev, "unable to DMA map src_addr_l\n");
> ret = -ENOMEM;
> goto unlock;
> }
>
> - ctl->skb = skb;
> - desc->dst_addr_l = ch->dxe_wq;
> - desc->fr_len = ctl->skb->len;
> -
> - /* set dxe descriptor to VALID */
> - desc->ctrl = ch->ctrl_skb;
> + ctl_skb->skb = skb;
> + desc_skb->dst_addr_l = ch->dxe_wq;
> + desc_skb->fr_len = ctl_skb->skb->len;
>
> wcn36xx_dbg_dump(WCN36XX_DBG_DXE_DUMP, "DESC2 >>> ",
> - (char *)desc, sizeof(*desc));
> + (char *)desc_skb, sizeof(*desc_skb));
> wcn36xx_dbg_dump(WCN36XX_DBG_DXE_DUMP, "SKB >>> ",
> - (char *)ctl->skb->data, ctl->skb->len);
> + (char *)ctl_skb->skb->data, ctl_skb->skb->len);
>
> /* Move the head of the ring to the next empty descriptor */
> - ch->head_blk_ctl = ctl->next;
> + ch->head_blk_ctl = ctl_skb->next;
> +
> + /* Commit all previous writes and set descriptors to VALID */
> + wmb();
> + desc_skb->ctrl = ch->ctrl_skb;
> + wmb();
> + desc_bd->ctrl = ch->ctrl_bd;
>
> /*
> * When connected and trying to send data frame chip can be in sleep
> --
> 2.14.3
>
>
> _______________________________________________
> wcn36xx mailing list
> [email protected]
> http://lists.infradead.org/mailman/listinfo/wcn36xx

Acked-by: Ramon Fried <[email protected]>

2018-05-16 14:08:33

by Daniel Mack

[permalink] [raw]
Subject: [PATCH 01/10] wcn36xx: fix buffer commit logic on TX path

When wcn36xx_dxe_tx_frame() is entered while the device is still processing
the queue asyncronously, we are racing against the firmware code with
updates to the buffer descriptors. Presumably, the firmware scans the ring
buffer that holds the descriptors and scans for a valid control descriptor,
and then assumes that the next descriptor contains the payload. If, however,
the control descriptor is marked valid, but the payload descriptor isn't,
the packet is not sent out.

Another issue with the current code is that is lacks memory barriers before
descriptors are marked valid. This is important because the CPU may reorder
writes to memory, even if it is allocated as coherent DMA area, and hence
the device may see incompletely written data.

To fix this, the code in wcn36xx_dxe_tx_frame() was restructured a bit so
that the payload descriptor is made valid before the control descriptor.
Memory barriers are added to ensure coherency of shared memory areas.

Signed-off-by: Daniel Mack <[email protected]>
---
drivers/net/wireless/ath/wcn36xx/dxe.c | 75 +++++++++++++++++-----------------
1 file changed, 38 insertions(+), 37 deletions(-)

diff --git a/drivers/net/wireless/ath/wcn36xx/dxe.c b/drivers/net/wireless/ath/wcn36xx/dxe.c
index bd2b946a65c9..3d1cf7dd1f8e 100644
--- a/drivers/net/wireless/ath/wcn36xx/dxe.c
+++ b/drivers/net/wireless/ath/wcn36xx/dxe.c
@@ -642,8 +642,8 @@ int wcn36xx_dxe_tx_frame(struct wcn36xx *wcn,
struct sk_buff *skb,
bool is_low)
{
- struct wcn36xx_dxe_ctl *ctl = NULL;
- struct wcn36xx_dxe_desc *desc = NULL;
+ struct wcn36xx_dxe_desc *desc_bd, *desc_skb;
+ struct wcn36xx_dxe_ctl *ctl_bd, *ctl_skb;
struct wcn36xx_dxe_ch *ch = NULL;
unsigned long flags;
int ret;
@@ -651,74 +651,75 @@ int wcn36xx_dxe_tx_frame(struct wcn36xx *wcn,
ch = is_low ? &wcn->dxe_tx_l_ch : &wcn->dxe_tx_h_ch;

spin_lock_irqsave(&ch->lock, flags);
- ctl = ch->head_blk_ctl;
+ ctl_bd = ch->head_blk_ctl;
+ ctl_skb = ctl_bd->next;

/*
* If skb is not null that means that we reached the tail of the ring
* hence ring is full. Stop queues to let mac80211 back off until ring
* has an empty slot again.
*/
- if (NULL != ctl->next->skb) {
+ if (NULL != ctl_skb->skb) {
ieee80211_stop_queues(wcn->hw);
wcn->queues_stopped = true;
spin_unlock_irqrestore(&ch->lock, flags);
return -EBUSY;
}

- ctl->skb = NULL;
- desc = ctl->desc;
+ if (unlikely(ctl_skb->bd_cpu_addr)) {
+ wcn36xx_err("bd_cpu_addr cannot be NULL for skb DXE\n");
+ ret = -EINVAL;
+ goto unlock;
+ }
+
+ desc_bd = ctl_bd->desc;
+ desc_skb = ctl_skb->desc;
+
+ ctl_bd->skb = NULL;

/* write buffer descriptor */
- memcpy(ctl->bd_cpu_addr, bd, sizeof(*bd));
+ memcpy(ctl_bd->bd_cpu_addr, bd, sizeof(*bd));

/* Set source address of the BD we send */
- desc->src_addr_l = ctl->bd_phy_addr;
-
- desc->dst_addr_l = ch->dxe_wq;
- desc->fr_len = sizeof(struct wcn36xx_tx_bd);
- desc->ctrl = ch->ctrl_bd;
+ desc_bd->src_addr_l = ctl_bd->bd_phy_addr;
+ desc_bd->dst_addr_l = ch->dxe_wq;
+ desc_bd->fr_len = sizeof(struct wcn36xx_tx_bd);

wcn36xx_dbg(WCN36XX_DBG_DXE, "DXE TX\n");

wcn36xx_dbg_dump(WCN36XX_DBG_DXE_DUMP, "DESC1 >>> ",
- (char *)desc, sizeof(*desc));
+ (char *)desc_bd, sizeof(*desc_bd));
wcn36xx_dbg_dump(WCN36XX_DBG_DXE_DUMP,
- "BD >>> ", (char *)ctl->bd_cpu_addr,
+ "BD >>> ", (char *)ctl_bd->bd_cpu_addr,
sizeof(struct wcn36xx_tx_bd));

- /* Set source address of the SKB we send */
- ctl = ctl->next;
- desc = ctl->desc;
- if (ctl->bd_cpu_addr) {
- wcn36xx_err("bd_cpu_addr cannot be NULL for skb DXE\n");
- ret = -EINVAL;
- goto unlock;
- }
-
- desc->src_addr_l = dma_map_single(wcn->dev,
- skb->data,
- skb->len,
- DMA_TO_DEVICE);
- if (dma_mapping_error(wcn->dev, desc->src_addr_l)) {
+ desc_skb->src_addr_l = dma_map_single(wcn->dev,
+ skb->data,
+ skb->len,
+ DMA_TO_DEVICE);
+ if (dma_mapping_error(wcn->dev, desc_skb->src_addr_l)) {
dev_err(wcn->dev, "unable to DMA map src_addr_l\n");
ret = -ENOMEM;
goto unlock;
}

- ctl->skb = skb;
- desc->dst_addr_l = ch->dxe_wq;
- desc->fr_len = ctl->skb->len;
-
- /* set dxe descriptor to VALID */
- desc->ctrl = ch->ctrl_skb;
+ ctl_skb->skb = skb;
+ desc_skb->dst_addr_l = ch->dxe_wq;
+ desc_skb->fr_len = ctl_skb->skb->len;

wcn36xx_dbg_dump(WCN36XX_DBG_DXE_DUMP, "DESC2 >>> ",
- (char *)desc, sizeof(*desc));
+ (char *)desc_skb, sizeof(*desc_skb));
wcn36xx_dbg_dump(WCN36XX_DBG_DXE_DUMP, "SKB >>> ",
- (char *)ctl->skb->data, ctl->skb->len);
+ (char *)ctl_skb->skb->data, ctl_skb->skb->len);

/* Move the head of the ring to the next empty descriptor */
- ch->head_blk_ctl = ctl->next;
+ ch->head_blk_ctl = ctl_skb->next;
+
+ /* Commit all previous writes and set descriptors to VALID */
+ wmb();
+ desc_skb->ctrl = ch->ctrl_skb;
+ wmb();
+ desc_bd->ctrl = ch->ctrl_bd;

/*
* When connected and trying to send data frame chip can be in sleep
--
2.14.3

2018-05-17 09:04:31

by Ramon Fried

[permalink] [raw]
Subject: Re: [PATCH 02/10] wcn36xx: set DMA mask explicitly

On Wed, May 16, 2018 at 5:08 PM, Daniel Mack <[email protected]> wrote:
> The device takes 32-bit addresses only, so inform the DMA API about it.
> This is the default on msm8016, so that doesn't change anything, but
> it's best practice to be explicit.
>
> Signed-off-by: Daniel Mack <[email protected]>
> ---
> drivers/net/wireless/ath/wcn36xx/main.c | 6 ++++++
> 1 file changed, 6 insertions(+)
>
> diff --git a/drivers/net/wireless/ath/wcn36xx/main.c b/drivers/net/wireless/ath/wcn36xx/main.c
> index e3b91b3b38ef..e6330ad372b3 100644
> --- a/drivers/net/wireless/ath/wcn36xx/main.c
> +++ b/drivers/net/wireless/ath/wcn36xx/main.c
> @@ -1309,6 +1309,12 @@ static int wcn36xx_probe(struct platform_device *pdev)
> mutex_init(&wcn->hal_mutex);
> mutex_init(&wcn->scan_lock);
>
> + ret = dma_set_mask_and_coherent(wcn->dev, DMA_BIT_MASK(32));
> + if (ret < 0) {
> + wcn36xx_err("failed to set DMA mask: %d\n", ret);
> + goto out_wq;
> + }
> +
> INIT_WORK(&wcn->scan_work, wcn36xx_hw_scan_worker);
>
> wcn->smd_channel = qcom_wcnss_open_channel(wcnss, "WLAN_CTRL", wcn36xx_smd_rsp_process, hw);
> --
> 2.14.3
>
>
> _______________________________________________
> wcn36xx mailing list
> [email protected]
> http://lists.infradead.org/mailman/listinfo/wcn36xx
Acked-by: Ramon Fried <[email protected]>

2018-05-24 17:53:07

by Ramon Fried

[permalink] [raw]
Subject: Re: wcn36xx: bug #538: stuck tx management frames

On Thu, May 24, 2018 at 3:39 PM, Kalle Valo <[email protected]> wrote:
> Daniel Mack <[email protected]> writes:
>
>> On Thursday, May 24, 2018 01:48 PM, Kalle Valo wrote:
>>> Daniel Mack <[email protected]> writes:
>>>> On Thursday, May 24, 2018 10:44 AM, Kalle Valo wrote:
>>>>> Daniel Mack <[email protected]> writes:
>>
>>>> It seems that once a network is successfully joined, the network
>>>> stability is fine. I haven't seen any starvation of streams lately, at
>>>> least not with the the patches in this series which I'm running since
>>>> a while. That is, until a disconnect/reconnect attempt is made, and at
>>>> this point, only management frames are involved.
>>>
>>> Ah, maybe originally you were seeing different issues with similar
>>> symptoms? But now you have fixed the other bugsand now the stuck
>>> transmitted management frame issue is left? Just guessing...
>>
>> Yeah, I wish I had a clearer picture on all this myself :(
>>
>> My patches definitely address some of the issues I have seen before,
>> also the fixes for potential race conditions are likely to have a
>> positive effect. But as you guessed yourself, I'm afraid that there's
>> a multitude of possible sources for bugs, so it's hard to tell.
>
> Sure, but things seem to going be forward in steady state. Thanks for
> your hard work!
>
>>> It would be great to have wcn36xx logging via tracing, just like ath10k
>>> and iwlwifi does. This way logging shouldn't slow down the system too
>>> much and with wpasupplicant's -T switch you can even get wpasupplicant's
>>> debug messages to the same log with proper timestamps! And almost
>>> forgot, you can also include mac80211 tracing logs as well:
>>>
>>> https://wireless.wiki.kernel.org/en/developers/documentation/mac80211/tracing
>>>
>>> https://wireless.wiki.kernel.org/en/users/drivers/ath10k/debug#tracing
>>>
>>> See ath10k_dbg() and trace_ath10k_log_dbg() for ideas how to implement
>>> this, and you can also take a look at iwlwifi. Should be pretty easy.
>>> Patches more than welcome :)
>>
>> Okay, I'll see if I can find some time to look into this.
>>
>> The reason why I didn't focus the logging possibilities is that I
>> looked at the SMD messages that are flying around which result from
>> ieee80211 API calls into the driver, and I can't seem to find anything
>> that's wrong, especially not before the timeouts occur. Hence, I don't
>> actually expect any oddness on the ieee80211 layer.
>>
>> But I agree that in general, better logging is certainly helpful.
>
> Yeah, I'm not expecting tracing logs to solve this case either but maybe
> we could find some hints. And it just makes it so much easier to see
> what's really happening from tracing logs instead trying to guess from
> the bug description. "a tracing log is worth a thousand words" ;)
>
> But if you don't have time to implement tracing support to wcn36xx
> hopefully someone else has, all one needs is a DragonBoard 410c. A
> simple project for a student or someone who wants to contribute
> something to the kernel.
I have time to do it. If nobody else started yet...
Thanks.
>
>>>> It seems it does, yes. Tests at night seem to take a bit more time to
>>>> make the effect happen. But then again, it could also be unrelated. I
>>>> can't be certain at this point.
>>>
>>> Can you describe what kind of radio environment you have, is it a busy
>>> office complex? How many APs around etc?
>>
>> I've tried different environments. In the office with 15-20
>> laptops/mobiles in the room I see about 10-15 APs. At home, where I
>> did long-term nightly test, there's maybe a higher number of APs, but
>> fewer devices on the channel of the AP that I used for tests.
>>
>> I haven't used any more sophisticated environments like a sealed
>> reverberation chamber yet though.
>
> Ok, but please let me know if you see any differences caused by the
> environment.
>
> --
> Kalle Valo
>
> _______________________________________________
> wcn36xx mailing list
> [email protected]
> http://lists.infradead.org/mailman/listinfo/wcn36xx

2018-05-25 10:08:58

by Kalle Valo

[permalink] [raw]
Subject: Re: [01/10] wcn36xx: fix buffer commit logic on TX path

Daniel Mack <[email protected]> wrote:

> When wcn36xx_dxe_tx_frame() is entered while the device is still processing
> the queue asyncronously, we are racing against the firmware code with
> updates to the buffer descriptors. Presumably, the firmware scans the ring
> buffer that holds the descriptors and scans for a valid control descriptor,
> and then assumes that the next descriptor contains the payload. If, however,
> the control descriptor is marked valid, but the payload descriptor isn't,
> the packet is not sent out.
>
> Another issue with the current code is that is lacks memory barriers before
> descriptors are marked valid. This is important because the CPU may reorder
> writes to memory, even if it is allocated as coherent DMA area, and hence
> the device may see incompletely written data.
>
> To fix this, the code in wcn36xx_dxe_tx_frame() was restructured a bit so
> that the payload descriptor is made valid before the control descriptor.
> Memory barriers are added to ensure coherency of shared memory areas.
>
> Signed-off-by: Daniel Mack <[email protected]>
> Signed-off-by: Kalle Valo <[email protected]>

10 patches applied to ath-next branch of ath.git, thanks.

9a81cc23dfef wcn36xx: fix buffer commit logic on TX path
57e06e0e2a86 wcn36xx: set DMA mask explicitly
ba437e72378c wcn36xx: don't disable RX IRQ from handler
edd23ab403cf wcn36xx: clear all masks in RX interrupt
ce1d4be82b10 wcn36xx: only handle packets when ED or DONE bit is set
18c7ed138824 wcn36xx: consider CTRL_EOP bit when looking for valid descriptors
2a46c829a926 wcn36xx: set PREASSOC and IDLE stated when BSS info changes
773f9a28bcda wcn36xx: drain pending indicator messages on shutdown
a50c6c8412f7 wcn36xx: simplify wcn36xx_smd_open()
ffbc9197b472 wcn36xx: improve debug and error messages for SMD

--
https://patchwork.kernel.org/patch/10404031/

https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches

2018-05-16 14:08:39

by Daniel Mack

[permalink] [raw]
Subject: [PATCH 05/10] wcn36xx: only handle packets when ED or DONE bit is set

On RX and TX interrupts, check for the WCN36XX_CH_STAT_INT_ED_MASK or
WCN36XX_CH_STAT_INT_DONE_MASK in the interrupt reason register, and
only handle packets when it is set. This way, reap_tx_dxes() is only
invoked when needed.

This brings the dequeing logic in line with what the prima downstream
driver is doing.

While at it, also log the interrupt reason.

Signed-off-by: Daniel Mack <[email protected]>
---
drivers/net/wireless/ath/wcn36xx/dxe.c | 20 ++++++++++++++++----
1 file changed, 16 insertions(+), 4 deletions(-)

diff --git a/drivers/net/wireless/ath/wcn36xx/dxe.c b/drivers/net/wireless/ath/wcn36xx/dxe.c
index d11c9c536627..8c64e05ca3b7 100644
--- a/drivers/net/wireless/ath/wcn36xx/dxe.c
+++ b/drivers/net/wireless/ath/wcn36xx/dxe.c
@@ -430,8 +430,12 @@ static irqreturn_t wcn36xx_irq_tx_complete(int irq, void *dev)
WCN36XX_INT_MASK_CHAN_TX_H);
}

- wcn36xx_dbg(WCN36XX_DBG_DXE, "dxe tx ready high\n");
- reap_tx_dxes(wcn, &wcn->dxe_tx_h_ch);
+ wcn36xx_dbg(WCN36XX_DBG_DXE, "dxe tx ready high, reason %08x\n",
+ int_reason);
+
+ if (int_reason & (WCN36XX_CH_STAT_INT_DONE_MASK |
+ WCN36XX_CH_STAT_INT_ED_MASK))
+ reap_tx_dxes(wcn, &wcn->dxe_tx_h_ch);
}

if (int_src & WCN36XX_INT_MASK_CHAN_TX_L) {
@@ -465,8 +469,12 @@ static irqreturn_t wcn36xx_irq_tx_complete(int irq, void *dev)
WCN36XX_INT_MASK_CHAN_TX_L);
}

- wcn36xx_dbg(WCN36XX_DBG_DXE, "dxe tx ready low\n");
- reap_tx_dxes(wcn, &wcn->dxe_tx_l_ch);
+ wcn36xx_dbg(WCN36XX_DBG_DXE, "dxe tx ready low, reason %08x\n",
+ int_reason);
+
+ if (int_reason & (WCN36XX_CH_STAT_INT_DONE_MASK |
+ WCN36XX_CH_STAT_INT_ED_MASK))
+ reap_tx_dxes(wcn, &wcn->dxe_tx_l_ch);
}

return IRQ_HANDLED;
@@ -545,6 +553,10 @@ static int wcn36xx_rx_handle_packets(struct wcn36xx *wcn,
WCN36XX_DXE_0_INT_ED_CLR,
int_mask);

+ if (!(int_reason & (WCN36XX_CH_STAT_INT_DONE_MASK |
+ WCN36XX_CH_STAT_INT_ED_MASK)))
+ return 0;
+
spin_lock(&ch->lock);

ctl = ch->head_blk_ctl;
--
2.14.3

2018-05-16 14:08:49

by Daniel Mack

[permalink] [raw]
Subject: [PATCH 10/10] wcn36xx: improve debug and error messages for SMD

Add a missing newline in wcn36xx_smd_send_and_wait() and also log the
command request and response type that was processed.

Signed-off-by: Daniel Mack <[email protected]>
---
drivers/net/wireless/ath/wcn36xx/smd.c | 14 ++++++++++----
1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/drivers/net/wireless/ath/wcn36xx/smd.c b/drivers/net/wireless/ath/wcn36xx/smd.c
index 43c8aa79fad4..b855a58d5aac 100644
--- a/drivers/net/wireless/ath/wcn36xx/smd.c
+++ b/drivers/net/wireless/ath/wcn36xx/smd.c
@@ -252,23 +252,29 @@ static int wcn36xx_smd_send_and_wait(struct wcn36xx *wcn, size_t len)
{
int ret = 0;
unsigned long start;
+ struct wcn36xx_hal_msg_header *hdr =
+ (struct wcn36xx_hal_msg_header *)wcn->hal_buf;
+ u16 req_type = hdr->msg_type;
+
wcn36xx_dbg_dump(WCN36XX_DBG_SMD_DUMP, "HAL >>> ", wcn->hal_buf, len);

init_completion(&wcn->hal_rsp_compl);
start = jiffies;
ret = rpmsg_send(wcn->smd_channel, wcn->hal_buf, len);
if (ret) {
- wcn36xx_err("HAL TX failed\n");
+ wcn36xx_err("HAL TX failed for req %d\n", req_type);
goto out;
}
if (wait_for_completion_timeout(&wcn->hal_rsp_compl,
msecs_to_jiffies(HAL_MSG_TIMEOUT)) <= 0) {
- wcn36xx_err("Timeout! No SMD response in %dms\n",
- HAL_MSG_TIMEOUT);
+ wcn36xx_err("Timeout! No SMD response to req %d in %dms\n",
+ req_type, HAL_MSG_TIMEOUT);
ret = -ETIME;
goto out;
}
- wcn36xx_dbg(WCN36XX_DBG_SMD, "SMD command completed in %dms",
+ wcn36xx_dbg(WCN36XX_DBG_SMD,
+ "SMD command (req %d, rsp %d) completed in %dms\n",
+ req_type, hdr->msg_type,
jiffies_to_msecs(jiffies - start));
out:
return ret;
--
2.14.3

2018-05-16 14:08:46

by Daniel Mack

[permalink] [raw]
Subject: [PATCH 09/10] wcn36xx: simplify wcn36xx_smd_open()

Drop the extra warning about failed allocations, both the core and the
only caller of this function will warn loud enough in such cases.

Signed-off-by: Daniel Mack <[email protected]>
---
drivers/net/wireless/ath/wcn36xx/smd.c | 12 +++---------
1 file changed, 3 insertions(+), 9 deletions(-)

diff --git a/drivers/net/wireless/ath/wcn36xx/smd.c b/drivers/net/wireless/ath/wcn36xx/smd.c
index 0a505b5e038b..43c8aa79fad4 100644
--- a/drivers/net/wireless/ath/wcn36xx/smd.c
+++ b/drivers/net/wireless/ath/wcn36xx/smd.c
@@ -2494,21 +2494,15 @@ static void wcn36xx_ind_smd_work(struct work_struct *work)
}
int wcn36xx_smd_open(struct wcn36xx *wcn)
{
- int ret = 0;
wcn->hal_ind_wq = create_freezable_workqueue("wcn36xx_smd_ind");
- if (!wcn->hal_ind_wq) {
- wcn36xx_err("failed to allocate wq\n");
- ret = -ENOMEM;
- goto out;
- }
+ if (!wcn->hal_ind_wq)
+ return -ENOMEM;
+
INIT_WORK(&wcn->hal_ind_work, wcn36xx_ind_smd_work);
INIT_LIST_HEAD(&wcn->hal_ind_queue);
spin_lock_init(&wcn->hal_ind_lock);

return 0;
-
-out:
- return ret;
}

void wcn36xx_smd_close(struct wcn36xx *wcn)
--
2.14.3

2018-05-16 14:08:35

by Daniel Mack

[permalink] [raw]
Subject: [PATCH 03/10] wcn36xx: don't disable RX IRQ from handler

There's no need to disable the IRQ from inside its handler.
Instead just grab the spinlock of the channel that is being processed.

Signed-off-by: Daniel Mack <[email protected]>
---
drivers/net/wireless/ath/wcn36xx/dxe.c | 15 +++++++++++----
1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/drivers/net/wireless/ath/wcn36xx/dxe.c b/drivers/net/wireless/ath/wcn36xx/dxe.c
index 3d1cf7dd1f8e..d6dd47b211ba 100644
--- a/drivers/net/wireless/ath/wcn36xx/dxe.c
+++ b/drivers/net/wireless/ath/wcn36xx/dxe.c
@@ -476,9 +476,8 @@ static irqreturn_t wcn36xx_irq_rx_ready(int irq, void *dev)
{
struct wcn36xx *wcn = (struct wcn36xx *)dev;

- disable_irq_nosync(wcn->rx_irq);
wcn36xx_dxe_rx_frame(wcn);
- enable_irq(wcn->rx_irq);
+
return IRQ_HANDLED;
}

@@ -514,8 +513,8 @@ static int wcn36xx_dxe_request_irqs(struct wcn36xx *wcn)
static int wcn36xx_rx_handle_packets(struct wcn36xx *wcn,
struct wcn36xx_dxe_ch *ch)
{
- struct wcn36xx_dxe_ctl *ctl = ch->head_blk_ctl;
- struct wcn36xx_dxe_desc *dxe = ctl->desc;
+ struct wcn36xx_dxe_desc *dxe;
+ struct wcn36xx_dxe_ctl *ctl;
dma_addr_t dma_addr;
struct sk_buff *skb;
int ret = 0, int_mask;
@@ -529,6 +528,11 @@ static int wcn36xx_rx_handle_packets(struct wcn36xx *wcn,
int_mask = WCN36XX_DXE_INT_CH3_MASK;
}

+ spin_lock(&ch->lock);
+
+ ctl = ch->head_blk_ctl;
+ dxe = ctl->desc;
+
while (!(READ_ONCE(dxe->ctrl) & WCN36xx_DXE_CTRL_VLD)) {
skb = ctl->skb;
dma_addr = dxe->dst_addr_l;
@@ -549,6 +553,9 @@ static int wcn36xx_rx_handle_packets(struct wcn36xx *wcn,
wcn36xx_dxe_write_register(wcn, WCN36XX_DXE_ENCH_ADDR, int_mask);

ch->head_blk_ctl = ctl;
+
+ spin_unlock(&ch->lock);
+
return 0;
}

--
2.14.3

2018-05-16 14:08:34

by Daniel Mack

[permalink] [raw]
Subject: [PATCH 02/10] wcn36xx: set DMA mask explicitly

The device takes 32-bit addresses only, so inform the DMA API about it.
This is the default on msm8016, so that doesn't change anything, but
it's best practice to be explicit.

Signed-off-by: Daniel Mack <[email protected]>
---
drivers/net/wireless/ath/wcn36xx/main.c | 6 ++++++
1 file changed, 6 insertions(+)

diff --git a/drivers/net/wireless/ath/wcn36xx/main.c b/drivers/net/wireless/ath/wcn36xx/main.c
index e3b91b3b38ef..e6330ad372b3 100644
--- a/drivers/net/wireless/ath/wcn36xx/main.c
+++ b/drivers/net/wireless/ath/wcn36xx/main.c
@@ -1309,6 +1309,12 @@ static int wcn36xx_probe(struct platform_device *pdev)
mutex_init(&wcn->hal_mutex);
mutex_init(&wcn->scan_lock);

+ ret = dma_set_mask_and_coherent(wcn->dev, DMA_BIT_MASK(32));
+ if (ret < 0) {
+ wcn36xx_err("failed to set DMA mask: %d\n", ret);
+ goto out_wq;
+ }
+
INIT_WORK(&wcn->scan_work, wcn36xx_hw_scan_worker);

wcn->smd_channel = qcom_wcnss_open_channel(wcnss, "WLAN_CTRL", wcn36xx_smd_rsp_process, hw);
--
2.14.3

2018-05-24 12:13:56

by Daniel Mack

[permalink] [raw]
Subject: Re: wcn36xx: bug #538: stuck tx management frames

On Thursday, May 24, 2018 01:48 PM, Kalle Valo wrote:
> Daniel Mack <[email protected]> writes:
>> On Thursday, May 24, 2018 10:44 AM, Kalle Valo wrote:
>>> Daniel Mack <[email protected]> writes:

>> It seems that once a network is successfully joined, the network
>> stability is fine. I haven't seen any starvation of streams lately, at
>> least not with the the patches in this series which I'm running since
>> a while. That is, until a disconnect/reconnect attempt is made, and at
>> this point, only management frames are involved.
>
> Ah, maybe originally you were seeing different issues with similar
> symptoms? But now you have fixed the other bugsand now the stuck
> transmitted management frame issue is left? Just guessing...

Yeah, I wish I had a clearer picture on all this myself :(

My patches definitely address some of the issues I have seen before,
also the fixes for potential race conditions are likely to have a
positive effect. But as you guessed yourself, I'm afraid that there's a
multitude of possible sources for bugs, so it's hard to tell.

> It would be great to have wcn36xx logging via tracing, just like ath10k
> and iwlwifi does. This way logging shouldn't slow down the system too
> much and with wpasupplicant's -T switch you can even get wpasupplicant's
> debug messages to the same log with proper timestamps! And almost
> forgot, you can also include mac80211 tracing logs as well:
>
> https://wireless.wiki.kernel.org/en/developers/documentation/mac80211/tracing
>
> https://wireless.wiki.kernel.org/en/users/drivers/ath10k/debug#tracing
>
> See ath10k_dbg() and trace_ath10k_log_dbg() for ideas how to implement
> this, and you can also take a look at iwlwifi. Should be pretty easy.
> Patches more than welcome :)

Okay, I'll see if I can find some time to look into this.

The reason why I didn't focus the logging possibilities is that I looked
at the SMD messages that are flying around which result from ieee80211
API calls into the driver, and I can't seem to find anything that's
wrong, especially not before the timeouts occur. Hence, I don't actually
expect any oddness on the ieee80211 layer.

But I agree that in general, better logging is certainly helpful.

>> It seems it does, yes. Tests at night seem to take a bit more time to
>> make the effect happen. But then again, it could also be unrelated. I
>> can't be certain at this point.
>
> Can you describe what kind of radio environment you have, is it a busy
> office complex? How many APs around etc?

I've tried different environments. In the office with 15-20
laptops/mobiles in the room I see about 10-15 APs. At home, where I did
long-term nightly test, there's maybe a higher number of APs, but fewer
devices on the channel of the AP that I used for tests.

I haven't used any more sophisticated environments like a sealed
reverberation chamber yet though.


Thanks,
Daniel

2018-05-18 11:28:11

by Kalle Valo

[permalink] [raw]
Subject: Re: [PATCH 00/10] Some more patches for wcn36xx

Daniel Mack <[email protected]> writes:

> On Wednesday, May 16, 2018 04:08 PM, Daniel Mack wrote:
>> Hence I believe that some sort of firmware internal buffer is overrun if
>> too many SMD requests fly in in a short amount of time. The firmware
>> does, however, still ack all packets just fine on the SMD channels, and
>> also the DXE communication flows are all healthy. No errors are reported
>> anywhere, but nothing is being put on the ether anymore.
>
> And FTR, there is a commit in the prima repository that caught my
> attention a while back:
>
> https://source.codeaurora.org/external/wlan/prima/commit/?id=93cd8f3c
>
> What this does (through an remarkable number of indirection layers) is
> sending the DUMP_COMMAND_REQ command with args = (274, 0, 0, 0, 0)
> when management frames get stuck, which smells pretty much like the
> issue I'm seeing. Doing the same with the mainline driver and the
> debugfs interface it exposes doesn't have any effect though.
>
> But even if it did work, I wouldn't see a way to detect the situation
> in which this is needed reliably.

The firmware version might make a difference so I recommend always
mentioning the firmware version as well. For example, what if your
firmware does not support that command or parameter?

Also I would recommend to file a bug to bugzilla.kernel.org so that all
the information is one place and it can be easily updated. Now it's
pretty difficult to get the big picture from various emails on the list.

--
Kalle Valo

2018-05-23 10:05:16

by Daniel Mack

[permalink] [raw]
Subject: Re: [PATCH 00/10] Some more patches for wcn36xx

Hi Kalle,

On Friday, May 18, 2018 01:28 PM, Kalle Valo wrote:
> Daniel Mack <[email protected]> writes:
>
>> On Wednesday, May 16, 2018 04:08 PM, Daniel Mack wrote:
>>> Hence I believe that some sort of firmware internal buffer is overrun if
>>> too many SMD requests fly in in a short amount of time. The firmware
>>> does, however, still ack all packets just fine on the SMD channels, and
>>> also the DXE communication flows are all healthy. No errors are reported
>>> anywhere, but nothing is being put on the ether anymore.
>>
>> And FTR, there is a commit in the prima repository that caught my
>> attention a while back:
>>
>> https://source.codeaurora.org/external/wlan/prima/commit/?id=93cd8f3c
>>
>> What this does (through an remarkable number of indirection layers) is
>> sending the DUMP_COMMAND_REQ command with args = (274, 0, 0, 0, 0)
>> when management frames get stuck, which smells pretty much like the
>> issue I'm seeing. Doing the same with the mainline driver and the
>> debugfs interface it exposes doesn't have any effect though.
>>
>> But even if it did work, I wouldn't see a way to detect the situation
>> in which this is needed reliably.
>
> The firmware version might make a difference so I recommend always
> mentioning the firmware version as well. For example, what if your
> firmware does not support that command or parameter?

Sure, that could be the case. FTR - the firmware I'm using is the one
that came out of the Qualcomm r1034.2.1 BSP. It is recognized by the
driver as 'WCN v2.0 RadioPhy vRhea_GF_1.12 with 19.2MHz XO'.

> Also I would recommend to file a bug to bugzilla.kernel.org so that all
> the information is one place and it can be easily updated. Now it's
> pretty difficult to get the big picture from various emails on the list.

Yes, I agree it's a bit convoluted. However, there's already the bug
report on 96board.org that Bjorn opened some time back, and I considered
that sufficient. IMO, it has all the information needed, plus a link to
a tool to reproduce the issue.

https://bugs.96boards.org/show_bug.cgi?id=538

The patches in this series can be considered as general cleanups that
are not necessarily related to this particular issue.


Thanks,
Daniel

2018-05-18 10:50:55

by Daniel Mack

[permalink] [raw]
Subject: Re: [PATCH 00/10] Some more patches for wcn36xx

On Wednesday, May 16, 2018 04:08 PM, Daniel Mack wrote:
> Hence I believe that some sort of firmware internal buffer is overrun if
> too many SMD requests fly in in a short amount of time. The firmware
> does, however, still ack all packets just fine on the SMD channels, and
> also the DXE communication flows are all healthy. No errors are reported
> anywhere, but nothing is being put on the ether anymore.

And FTR, there is a commit in the prima repository that caught my
attention a while back:

https://source.codeaurora.org/external/wlan/prima/commit/?id=93cd8f3c

What this does (through an remarkable number of indirection layers) is
sending the DUMP_COMMAND_REQ command with args = (274, 0, 0, 0, 0) when
management frames get stuck, which smells pretty much like the issue I'm
seeing. Doing the same with the mainline driver and the debugfs
interface it exposes doesn't have any effect though.

But even if it did work, I wouldn't see a way to detect the situation in
which this is needed reliably.


Daniel

2018-05-16 14:08:41

by Daniel Mack

[permalink] [raw]
Subject: [PATCH 06/10] wcn36xx: consider CTRL_EOP bit when looking for valid descriptors

In reap_tx_dxes(), when we iterate over the linked descriptors, only
consider such valid that have WCN36xx_DXE_CTRL_EOP set.

This is what the prima downstream driver is doing as well.

Signed-off-by: Daniel Mack <[email protected]>
---
drivers/net/wireless/ath/wcn36xx/dxe.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/net/wireless/ath/wcn36xx/dxe.c b/drivers/net/wireless/ath/wcn36xx/dxe.c
index 8c64e05ca3b7..06cfe8d311f3 100644
--- a/drivers/net/wireless/ath/wcn36xx/dxe.c
+++ b/drivers/net/wireless/ath/wcn36xx/dxe.c
@@ -370,7 +370,9 @@ static void reap_tx_dxes(struct wcn36xx *wcn, struct wcn36xx_dxe_ch *ch)
do {
if (READ_ONCE(ctl->desc->ctrl) & WCN36xx_DXE_CTRL_VLD)
break;
- if (ctl->skb) {
+
+ if (ctl->skb &&
+ READ_ONCE(ctl->desc->ctrl) & WCN36xx_DXE_CTRL_EOP) {
dma_unmap_single(wcn->dev, ctl->desc->src_addr_l,
ctl->skb->len, DMA_TO_DEVICE);
info = IEEE80211_SKB_CB(ctl->skb);
--
2.14.3

2018-05-16 14:08:44

by Daniel Mack

[permalink] [raw]
Subject: [PATCH 08/10] wcn36xx: drain pending indicator messages on shutdown

When the interface is shut down, wcn36xx_smd_close() potentially races
against the queue worker. Make sure to cancel the work, and then free all
the remnants in hal_ind_queue manually.

This is again just a theoretical issue, not something that was triggered in
the wild.

Signed-off-by: Daniel Mack <[email protected]>
---
drivers/net/wireless/ath/wcn36xx/smd.c | 6 ++++++
1 file changed, 6 insertions(+)

diff --git a/drivers/net/wireless/ath/wcn36xx/smd.c b/drivers/net/wireless/ath/wcn36xx/smd.c
index ea74f2b92df5..0a505b5e038b 100644
--- a/drivers/net/wireless/ath/wcn36xx/smd.c
+++ b/drivers/net/wireless/ath/wcn36xx/smd.c
@@ -2513,5 +2513,11 @@ int wcn36xx_smd_open(struct wcn36xx *wcn)

void wcn36xx_smd_close(struct wcn36xx *wcn)
{
+ struct wcn36xx_hal_ind_msg *msg, *tmp;
+
+ cancel_work_sync(&wcn->hal_ind_work);
destroy_workqueue(wcn->hal_ind_wq);
+
+ list_for_each_entry_safe(msg, tmp, &wcn->hal_ind_queue, list)
+ kfree(msg);
}
--
2.14.3

2018-05-24 12:39:11

by Kalle Valo

[permalink] [raw]
Subject: Re: wcn36xx: bug #538: stuck tx management frames

Daniel Mack <[email protected]> writes:

> On Thursday, May 24, 2018 01:48 PM, Kalle Valo wrote:
>> Daniel Mack <[email protected]> writes:
>>> On Thursday, May 24, 2018 10:44 AM, Kalle Valo wrote:
>>>> Daniel Mack <[email protected]> writes:
>
>>> It seems that once a network is successfully joined, the network
>>> stability is fine. I haven't seen any starvation of streams lately, at
>>> least not with the the patches in this series which I'm running since
>>> a while. That is, until a disconnect/reconnect attempt is made, and at
>>> this point, only management frames are involved.
>>
>> Ah, maybe originally you were seeing different issues with similar
>> symptoms? But now you have fixed the other bugsand now the stuck
>> transmitted management frame issue is left? Just guessing...
>
> Yeah, I wish I had a clearer picture on all this myself :(
>
> My patches definitely address some of the issues I have seen before,
> also the fixes for potential race conditions are likely to have a
> positive effect. But as you guessed yourself, I'm afraid that there's
> a multitude of possible sources for bugs, so it's hard to tell.

Sure, but things seem to going be forward in steady state. Thanks for
your hard work!

>> It would be great to have wcn36xx logging via tracing, just like ath10k
>> and iwlwifi does. This way logging shouldn't slow down the system too
>> much and with wpasupplicant's -T switch you can even get wpasupplicant's
>> debug messages to the same log with proper timestamps! And almost
>> forgot, you can also include mac80211 tracing logs as well:
>>
>> https://wireless.wiki.kernel.org/en/developers/documentation/mac80211/tracing
>>
>> https://wireless.wiki.kernel.org/en/users/drivers/ath10k/debug#tracing
>>
>> See ath10k_dbg() and trace_ath10k_log_dbg() for ideas how to implement
>> this, and you can also take a look at iwlwifi. Should be pretty easy.
>> Patches more than welcome :)
>
> Okay, I'll see if I can find some time to look into this.
>
> The reason why I didn't focus the logging possibilities is that I
> looked at the SMD messages that are flying around which result from
> ieee80211 API calls into the driver, and I can't seem to find anything
> that's wrong, especially not before the timeouts occur. Hence, I don't
> actually expect any oddness on the ieee80211 layer.
>
> But I agree that in general, better logging is certainly helpful.

Yeah, I'm not expecting tracing logs to solve this case either but maybe
we could find some hints. And it just makes it so much easier to see
what's really happening from tracing logs instead trying to guess from
the bug description. "a tracing log is worth a thousand words" ;)

But if you don't have time to implement tracing support to wcn36xx
hopefully someone else has, all one needs is a DragonBoard 410c. A
simple project for a student or someone who wants to contribute
something to the kernel.

>>> It seems it does, yes. Tests at night seem to take a bit more time to
>>> make the effect happen. But then again, it could also be unrelated. I
>>> can't be certain at this point.
>>
>> Can you describe what kind of radio environment you have, is it a busy
>> office complex? How many APs around etc?
>
> I've tried different environments. In the office with 15-20
> laptops/mobiles in the room I see about 10-15 APs. At home, where I
> did long-term nightly test, there's maybe a higher number of APs, but
> fewer devices on the channel of the AP that I used for tests.
>
> I haven't used any more sophisticated environments like a sealed
> reverberation chamber yet though.

Ok, but please let me know if you see any differences caused by the
environment.

--
Kalle Valo

2018-05-16 14:08:42

by Daniel Mack

[permalink] [raw]
Subject: [PATCH 07/10] wcn36xx: set PREASSOC and IDLE stated when BSS info changes

When a BSSID is joined, set the link status to 'preassoc', and set it to
'idle' when the BSS is deleted.

This is what the downstream driver is doing, and it seems to improve the
reliability during connect/disconnect stress tests.

Signed-off-by: Daniel Mack <[email protected]>
---
drivers/net/wireless/ath/wcn36xx/main.c | 4 ++++
1 file changed, 4 insertions(+)

diff --git a/drivers/net/wireless/ath/wcn36xx/main.c b/drivers/net/wireless/ath/wcn36xx/main.c
index e6330ad372b3..662e50540b07 100644
--- a/drivers/net/wireless/ath/wcn36xx/main.c
+++ b/drivers/net/wireless/ath/wcn36xx/main.c
@@ -798,6 +798,8 @@ static void wcn36xx_bss_info_changed(struct ieee80211_hw *hw,
if (!is_zero_ether_addr(bss_conf->bssid)) {
vif_priv->is_joining = true;
vif_priv->bss_index = WCN36XX_HAL_BSS_INVALID_IDX;
+ wcn36xx_smd_set_link_st(wcn, bss_conf->bssid, vif->addr,
+ WCN36XX_HAL_LINK_PREASSOC_STATE);
wcn36xx_smd_join(wcn, bss_conf->bssid,
vif->addr, WCN36XX_HW_CHANNEL(wcn));
wcn36xx_smd_config_bss(wcn, vif, NULL,
@@ -805,6 +807,8 @@ static void wcn36xx_bss_info_changed(struct ieee80211_hw *hw,
} else {
vif_priv->is_joining = false;
wcn36xx_smd_delete_bss(wcn, vif);
+ wcn36xx_smd_set_link_st(wcn, bss_conf->bssid, vif->addr,
+ WCN36XX_HAL_LINK_IDLE_STATE);
vif_priv->encrypt_type = WCN36XX_HAL_ED_NONE;
}
}
--
2.14.3

2018-05-24 11:48:20

by Kalle Valo

[permalink] [raw]
Subject: wcn36xx: bug #538: stuck tx management frames

(I'll change the subject to better reflect what we are discussing.)

Daniel Mack <[email protected]> writes:
> On Thursday, May 24, 2018 10:44 AM, Kalle Valo wrote:
>> Daniel Mack <[email protected]> writes:
>>> On Friday, May 18, 2018 01:28 PM, Kalle Valo wrote:
>
>>>> Also I would recommend to file a bug to bugzilla.kernel.org so that all
>>>> the information is one place and it can be easily updated. Now it's
>>>> pretty difficult to get the big picture from various emails on the list.
>>>
>>> Yes, I agree it's a bit convoluted. However, there's already the bug
>>> report on 96board.org that Bjorn opened some time back, and I
>>> considered that sufficient. IMO, it has all the information needed,
>>> plus a link to a tool to reproduce the issue.
>>>
>>> https://bugs.96boards.org/show_bug.cgi?id=538
>>
>> Yeah, bugs.96boards.org is fine. As long as there's one place which
>> collects all the information about the bug.
>>
>> But IMHO the bug report is not telling much, all I get is that TX frames
>> get stuck but not even that is confirmed. After reading it I have at
>> least these questions:
>>
>> * Is it really confirmed that the issue is that TX frames are stuck? For
>> example, using a wireless sniffer would confirm that.
>
> Yes, that's confirmed. I have a 2nd machine tuned to the same channel
> than the network I use for testing, and once the timeouts happen, I
> cannot see any frame anymore from the MAC of the wcn36xx. No probe
> requests for scans, no authentication attempts, nothing.
>
> As my test constantly connects and disconnects, the last thing I see
> in wireshark is a deauthentication frame.

Thanks, this is good to know.

>> * Are only management frames stuck or does it also involve data frames?
>
> It seems that once a network is successfully joined, the network
> stability is fine. I haven't seen any starvation of streams lately, at
> least not with the the patches in this series which I'm running since
> a while. That is, until a disconnect/reconnect attempt is made, and at
> this point, only management frames are involved.

Ah, maybe originally you were seeing different issues with similar
symptoms? But now you have fixed the other bugsand now the stuck
transmitted management frame issue is left? Just guessing...

>> * Based on the bug report the TX stuck issue seems to happen during
>> authentication, but what happens before that? Does wcn36xx get
>> disconnected from AP or what?
>
> As I said, my test setup includes repeated disconnections to make the
> bug appear. It sometimes happens at the first attempt after a fresh
> boot, however, so the stress test only makes debugging a bit easier by
> increasing the likeliness.
>
>> * Any wcn36xx logs about the issue (with or without debug logs)? Also
>> matching wpasupplicant logs would help.
>
> The problem with this is that it's not exactly clear what kind of
> effect we're looking at. With all the debug flags of the driver
> enabled, it produces so much log output that wpa_supplicant gives up
> due to timeouts. The other weird issue is that with WCN36XX_DBG_MAC
> and/or WCN36XX_DBG_SMD enabled, the effect is _much_ harder to
> trigger.

It would be great to have wcn36xx logging via tracing, just like ath10k
and iwlwifi does. This way logging shouldn't slow down the system too
much and with wpasupplicant's -T switch you can even get wpasupplicant's
debug messages to the same log with proper timestamps! And almost
forgot, you can also include mac80211 tracing logs as well:

https://wireless.wiki.kernel.org/en/developers/documentation/mac80211/tracing

https://wireless.wiki.kernel.org/en/users/drivers/ath10k/debug#tracing

See ath10k_dbg() and trace_ath10k_log_dbg() for ideas how to implement
this, and you can also take a look at iwlwifi. Should be pretty easy.
Patches more than welcome :)

>> * Does this only happen with encryption or also in open mode?
>
> That's a good question. I'll go check with an open network.

Thanks.

>> * How long does it take with qconnman-stress to reproduce the issue?
>
> Usually less than 10 minutes.

That's really good, makes it so much easier to verify potential fixes.

>> * Does the radio environment make any difference on reproducibility? For
>> example, clear enviroment vs lots of traffic/interference?
>
> It seems it does, yes. Tests at night seem to take a bit more time to
> make the effect happen. But then again, it could also be unrelated. I
> can't be certain at this point.

Can you describe what kind of radio environment you have, is it a busy
office complex? How many APs around etc?

--
Kalle Valo

2018-05-16 14:08:37

by Daniel Mack

[permalink] [raw]
Subject: [PATCH 04/10] wcn36xx: clear all masks in RX interrupt

Like on the TX side, check for the interrupt reason when the RX interrupt
is latched and clear the ERR, DONE and ED masks.

This seems to help with connection timeouts and network stream
starvatations. And FWIW, the downstream driver does the same thing.

Note that in analogy to the TX side, WCN36XX_DXE_0_INT_CLR should be set to
WCN36XX_INT_MASK_CHAN_RX_{L,H} rather than WCN36XX_DXE_INT_CH{1,3}_MASK. It
did the right thing however, as the defines happen to have identical values.

Also, instead of determining register addresses and values inside
wcn36xx_rx_handle_packets(), pass them as arguments.

Signed-off-by: Daniel Mack <[email protected]>
---
drivers/net/wireless/ath/wcn36xx/dxe.c | 62 ++++++++++++++++++++++------------
1 file changed, 40 insertions(+), 22 deletions(-)

diff --git a/drivers/net/wireless/ath/wcn36xx/dxe.c b/drivers/net/wireless/ath/wcn36xx/dxe.c
index d6dd47b211ba..d11c9c536627 100644
--- a/drivers/net/wireless/ath/wcn36xx/dxe.c
+++ b/drivers/net/wireless/ath/wcn36xx/dxe.c
@@ -511,23 +511,40 @@ static int wcn36xx_dxe_request_irqs(struct wcn36xx *wcn)
}

static int wcn36xx_rx_handle_packets(struct wcn36xx *wcn,
- struct wcn36xx_dxe_ch *ch)
+ struct wcn36xx_dxe_ch *ch,
+ u32 ctrl,
+ u32 en_mask,
+ u32 int_mask,
+ u32 status_reg)
{
struct wcn36xx_dxe_desc *dxe;
struct wcn36xx_dxe_ctl *ctl;
dma_addr_t dma_addr;
struct sk_buff *skb;
- int ret = 0, int_mask;
- u32 value;
+ u32 int_reason;
+ int ret;

- if (ch->ch_type == WCN36XX_DXE_CH_RX_L) {
- value = WCN36XX_DXE_CTRL_RX_L;
- int_mask = WCN36XX_DXE_INT_CH1_MASK;
- } else {
- value = WCN36XX_DXE_CTRL_RX_H;
- int_mask = WCN36XX_DXE_INT_CH3_MASK;
+ wcn36xx_dxe_read_register(wcn, status_reg, &int_reason);
+ wcn36xx_dxe_write_register(wcn, WCN36XX_DXE_0_INT_CLR, int_mask);
+
+ if (int_reason & WCN36XX_CH_STAT_INT_ERR_MASK) {
+ wcn36xx_dxe_write_register(wcn,
+ WCN36XX_DXE_0_INT_ERR_CLR,
+ int_mask);
+
+ wcn36xx_err("DXE IRQ reported error on RX channel\n");
}

+ if (int_reason & WCN36XX_CH_STAT_INT_DONE_MASK)
+ wcn36xx_dxe_write_register(wcn,
+ WCN36XX_DXE_0_INT_DONE_CLR,
+ int_mask);
+
+ if (int_reason & WCN36XX_CH_STAT_INT_ED_MASK)
+ wcn36xx_dxe_write_register(wcn,
+ WCN36XX_DXE_0_INT_ED_CLR,
+ int_mask);
+
spin_lock(&ch->lock);

ctl = ch->head_blk_ctl;
@@ -546,11 +563,11 @@ static int wcn36xx_rx_handle_packets(struct wcn36xx *wcn,
wcn36xx_rx_skb(wcn, skb);
} /* else keep old skb not submitted and use it for rx DMA */

- dxe->ctrl = value;
+ dxe->ctrl = ctrl;
ctl = ctl->next;
dxe = ctl->desc;
}
- wcn36xx_dxe_write_register(wcn, WCN36XX_DXE_ENCH_ADDR, int_mask);
+ wcn36xx_dxe_write_register(wcn, WCN36XX_DXE_ENCH_ADDR, en_mask);

ch->head_blk_ctl = ctl;

@@ -566,19 +583,20 @@ void wcn36xx_dxe_rx_frame(struct wcn36xx *wcn)
wcn36xx_dxe_read_register(wcn, WCN36XX_DXE_INT_SRC_RAW_REG, &int_src);

/* RX_LOW_PRI */
- if (int_src & WCN36XX_DXE_INT_CH1_MASK) {
- wcn36xx_dxe_write_register(wcn, WCN36XX_DXE_0_INT_CLR,
- WCN36XX_DXE_INT_CH1_MASK);
- wcn36xx_rx_handle_packets(wcn, &(wcn->dxe_rx_l_ch));
- }
+ if (int_src & WCN36XX_DXE_INT_CH1_MASK)
+ wcn36xx_rx_handle_packets(wcn, &wcn->dxe_rx_l_ch,
+ WCN36XX_DXE_CTRL_RX_L,
+ WCN36XX_DXE_INT_CH1_MASK,
+ WCN36XX_INT_MASK_CHAN_RX_L,
+ WCN36XX_DXE_CH_STATUS_REG_ADDR_RX_L);

/* RX_HIGH_PRI */
- if (int_src & WCN36XX_DXE_INT_CH3_MASK) {
- /* Clean up all the INT within this channel */
- wcn36xx_dxe_write_register(wcn, WCN36XX_DXE_0_INT_CLR,
- WCN36XX_DXE_INT_CH3_MASK);
- wcn36xx_rx_handle_packets(wcn, &(wcn->dxe_rx_h_ch));
- }
+ if (int_src & WCN36XX_DXE_INT_CH3_MASK)
+ wcn36xx_rx_handle_packets(wcn, &wcn->dxe_rx_h_ch,
+ WCN36XX_DXE_CTRL_RX_H,
+ WCN36XX_DXE_INT_CH3_MASK,
+ WCN36XX_INT_MASK_CHAN_RX_H,
+ WCN36XX_DXE_CH_STATUS_REG_ADDR_RX_H);

if (!int_src)
wcn36xx_warn("No DXE interrupt pending\n");
--
2.14.3

2018-05-24 09:41:01

by Daniel Mack

[permalink] [raw]
Subject: Re: [PATCH 00/10] Some more patches for wcn36xx

Hi Kalle,

On Thursday, May 24, 2018 10:44 AM, Kalle Valo wrote:
> Daniel Mack <[email protected]> writes:
>> On Friday, May 18, 2018 01:28 PM, Kalle Valo wrote:

>>> Also I would recommend to file a bug to bugzilla.kernel.org so that all
>>> the information is one place and it can be easily updated. Now it's
>>> pretty difficult to get the big picture from various emails on the list.
>>
>> Yes, I agree it's a bit convoluted. However, there's already the bug
>> report on 96board.org that Bjorn opened some time back, and I
>> considered that sufficient. IMO, it has all the information needed,
>> plus a link to a tool to reproduce the issue.
>>
>> https://bugs.96boards.org/show_bug.cgi?id=538
>
> Yeah, bugs.96boards.org is fine. As long as there's one place which
> collects all the information about the bug.
>
> But IMHO the bug report is not telling much, all I get is that TX frames
> get stuck but not even that is confirmed. After reading it I have at
> least these questions:
>
> * Is it really confirmed that the issue is that TX frames are stuck? For
> example, using a wireless sniffer would confirm that.

Yes, that's confirmed. I have a 2nd machine tuned to the same channel
than the network I use for testing, and once the timeouts happen, I
cannot see any frame anymore from the MAC of the wcn36xx. No probe
requests for scans, no authentication attempts, nothing.

As my test constantly connects and disconnects, the last thing I see in
wireshark is a deauthentication frame.

> * Are only management frames stuck or does it also involve data frames?

It seems that once a network is successfully joined, the network
stability is fine. I haven't seen any starvation of streams lately, at
least not with the the patches in this series which I'm running since a
while. That is, until a disconnect/reconnect attempt is made, and at
this point, only management frames are involved.

> * Based on the bug report the TX stuck issue seems to happen during
> authentication, but what happens before that? Does wcn36xx get
> disconnected from AP or what?

As I said, my test setup includes repeated disconnections to make the
bug appear. It sometimes happens at the first attempt after a fresh
boot, however, so the stress test only makes debugging a bit easier by
increasing the likeliness.

> * Any wcn36xx logs about the issue (with or without debug logs)? Also
> matching wpasupplicant logs would help.

The problem with this is that it's not exactly clear what kind of effect
we're looking at. With all the debug flags of the driver enabled, it
produces so much log output that wpa_supplicant gives up due to
timeouts. The other weird issue is that with WCN36XX_DBG_MAC and/or
WCN36XX_DBG_SMD enabled, the effect is _much_ harder to trigger.

> * Does this only happen with encryption or also in open mode?

That's a good question. I'll go check with an open network.

> * How long does it take with qconnman-stress to reproduce the issue?

Usually less than 10 minutes.

> * Does the radio environment make any difference on reproducibility? For
> example, clear enviroment vs lots of traffic/interference?

It seems it does, yes. Tests at night seem to take a bit more time to
make the effect happen. But then again, it could also be unrelated. I
can't be certain at this point.

I'll submit some more information to the bug report. What would help me
is if other people tried to reproduce the effect using the stress tool
and confirm my findings. Chances are I've been staring at this for too
long :)


Thanks again,
Daniel