2020-01-02 13:29:46

by Johannes Berg

[permalink] [raw]
Subject: Re: PROBLEM: Wireless networking goes down on Acer C720P Chromebook (bisected)

On Tue, 2019-12-31 at 19:49 -0500, Stephen Oberholtzer wrote:
> Wireless networking goes down on Acer C720P Chromebook (bisected)
>
> Culprit: 7a89233a ("mac80211: Use Airtime-based Queue Limits (AQL) on
> packet dequeue")
>
> I found that the newest kernel (5.4) displayed a curious issue on my
> Acer C720P Chromebook: shortly after bringing networking up, all
> connections would suddenly fail. I discovered that I could
> consistently reproduce the issue by ssh'ing into the machine and
> running 'dmesg' -- on a non-working kernel; I would get partial
> output, and then the connection would completely hang. This was so
> consistent, in fact, that I was able to leverage it to automate the
> process from 'git bisect run'.
>
> KEYWORDS: c720p, chromebook, wireless, networking, mac80211
>
> KERNEL: any kernel containing commit 7a89233a ("mac80211: Use
> Airtime-based Queue Limits (AQL) on packet dequeue")
>
> I find this bit in the offending commit's message suspicious:
>
> > This patch does *not* include any mechanism to wake a throttled TXQ again,
> > on the assumption that this will happen anyway as a side effect of whatever
> > freed the skb (most commonly a TX completion).
>
> Methinks this assumption is not a fully valid one.

I think I found at least one hole in this, but IIRC (it was before my
vacation, sorry) it was pretty unlikely to actually happen. Perhaps
there are more though.

https://lore.kernel.org/r/[email protected]


> I'll be happy to test any patches. If you need some printk calls, just
> tell me where to put 'em.

Do you get any output at all? Like a WARN_ON() for an underflow, or
something?

johannes


2020-01-02 23:15:02

by Stephen Oberholtzer

[permalink] [raw]
Subject: Re: PROBLEM: Wireless networking goes down on Acer C720P Chromebook (bisected)

On Thu, Jan 2, 2020 at 8:28 AM Johannes Berg <[email protected]> wrote:
>
> On Tue, 2019-12-31 at 19:49 -0500, Stephen Oberholtzer wrote:
> > Wireless networking goes down on Acer C720P Chromebook (bisected)
> >
> > Culprit: 7a89233a ("mac80211: Use Airtime-based Queue Limits (AQL) on
> > packet dequeue")
> >

<snip>

> I think I found at least one hole in this, but IIRC (it was before my
> vacation, sorry) it was pretty unlikely to actually happen. Perhaps
> there are more though.
>
> https://lore.kernel.org/r/[email protected]

<snippety-snip>

> Do you get any output at all? Like a WARN_ON() for an underflow, or
> something?
>
> johannes
>

Johannes,

To answer your immediate question, no, I don't get any dmesg output at
all. Nothing about underruns.
However, while pursuing other avenues -- specifically, enabling
mac80211 debugfs and log messages -- I realized that my 'master' was
out-of-date from linux-stable and did a git pull. Imagine my surprise
when the resulting kernel did not exhibit the problem!

Apparently, I had been a bit too pessimistic; since the problem
existed in 5.5-rc1 release, I'd assumed that the problem wouldn't get
rectified before 5.5.

However, I decided to bisect the fix, and ended up with: 911bde0f
("mac80211: Turn AQL into an NL80211_EXT_FEATURE"), which appears to
have "solved" the problem by just disabling the feature (this is
ath9k, by the way.)

This AQL stuff sounds pretty nifty, and I'd love to try my hand at
making it work for ath9k (also, since I put so much effort into an
automated build-and-test framework, it'd be a shame to just abandon
it.) However, the ath9k code is rather lacking for comments, so I
don't even know where I should start, except for (I suspect) a call to
`wiphy_ext_feature_set(whatever, NL80211_EXT_FEATURE_AQL);` from
inside ath9k_set_hw_capab()?

In the meantime, I went back to e548f749b096 -- the commit prior to
the one making AQL support opt-in -- and cranked up the debugging.

I'm not sure how to interpret any of this, but here's what I got:

dmesg output:

Last relevant mention is "moving STA <my AP's MAC> to state 4" which
happened during startup, before everything shut down.

/sys/kernel/debug/ieee80211/phy0

airtime_flags = 7

stations/<my AP's MAC>/airtime =

RX: 6583578 us
TX: 32719 us
Weight: 256
Deficit: VO: -1128 us VI: 11 us BE: -5098636 us BK: 256 us
Q depth: VO: 3868 us VI: 3636 us BE: 12284 us BK: 0 us
Q limit[low/high]: VO: 5000/12000 VI: 5000/12000 BE: 5000/12000 BK: 5000/12000

(I have no idea how to interpret this, but that '32719 us' seems odd,
I thought the airtime usage was in 4us units?)


Doing an 'echo 3 | tee airtime_flags' to clear the (old) AQL-enabled
bit seemed to *immediately* restore network connectivity.

I ran a ping, and saw this:

- pings coming back in <5ms
- re-enable AQL (echo 7 | tee airtime_flags)
- pings stop coming back immediately
- some seconds later, disable AQL again (echo 3 | tee airtime_flags)
- immediate *flood* of ping replies registered, with times 16000ms,
15000ms, 14000ms, .. down to 1000ms, 15ms, then stabilizing sub-5ms
- According to the icmp_seq values, all 28 requests were replied to,
and their replies were delivered in-order

This certainly looks like a missing TX queue restart to me?


--
-- Stevie-O
Real programmers use COPY CON PROGRAM.EXE