by Daniel Mack

[permalink] [raw]

Subject: Re: [PATCH 00/10] Some more patches for wcn36xx

Hi Kalle,

On Thursday, May 24, 2018 10:44 AM, Kalle Valo wrote:
> Daniel Mack <[email protected]> writes:
>> On Friday, May 18, 2018 01:28 PM, Kalle Valo wrote:

>>> Also I would recommend to file a bug to bugzilla.kernel.org so that all
>>> the information is one place and it can be easily updated. Now it's
>>> pretty difficult to get the big picture from various emails on the list.
>>
>> Yes, I agree it's a bit convoluted. However, there's already the bug
>> report on 96board.org that Bjorn opened some time back, and I
>> considered that sufficient. IMO, it has all the information needed,
>> plus a link to a tool to reproduce the issue.
>>
>> https://bugs.96boards.org/show_bug.cgi?id=538
>
> Yeah, bugs.96boards.org is fine. As long as there's one place which
> collects all the information about the bug.
>
> But IMHO the bug report is not telling much, all I get is that TX frames
> get stuck but not even that is confirmed. After reading it I have at
> least these questions:
>
> * Is it really confirmed that the issue is that TX frames are stuck? For
> example, using a wireless sniffer would confirm that.

Yes, that's confirmed. I have a 2nd machine tuned to the same channel
than the network I use for testing, and once the timeouts happen, I
cannot see any frame anymore from the MAC of the wcn36xx. No probe
requests for scans, no authentication attempts, nothing.

As my test constantly connects and disconnects, the last thing I see in
wireshark is a deauthentication frame.

> * Are only management frames stuck or does it also involve data frames?

It seems that once a network is successfully joined, the network
stability is fine. I haven't seen any starvation of streams lately, at
least not with the the patches in this series which I'm running since a
while. That is, until a disconnect/reconnect attempt is made, and at
this point, only management frames are involved.

> * Based on the bug report the TX stuck issue seems to happen during
> authentication, but what happens before that? Does wcn36xx get
> disconnected from AP or what?

As I said, my test setup includes repeated disconnections to make the
bug appear. It sometimes happens at the first attempt after a fresh
boot, however, so the stress test only makes debugging a bit easier by
increasing the likeliness.

> * Any wcn36xx logs about the issue (with or without debug logs)? Also
> matching wpasupplicant logs would help.

The problem with this is that it's not exactly clear what kind of effect
we're looking at. With all the debug flags of the driver enabled, it
produces so much log output that wpa_supplicant gives up due to
timeouts. The other weird issue is that with WCN36XX_DBG_MAC and/or
WCN36XX_DBG_SMD enabled, the effect is _much_ harder to trigger.

> * Does this only happen with encryption or also in open mode?

That's a good question. I'll go check with an open network.

> * How long does it take with qconnman-stress to reproduce the issue?

Usually less than 10 minutes.

> * Does the radio environment make any difference on reproducibility? For
> example, clear enviroment vs lots of traffic/interference?

It seems it does, yes. Tests at night seem to take a bit more time to
make the effect happen. But then again, it could also be unrelated. I
can't be certain at this point.

I'll submit some more information to the bug report. What would help me
is if other people tried to reproduce the effect using the stress tool
and confirm my findings. Chances are I've been staring at this for too
long :)

Thanks again,
Daniel