2015-11-04 05:20:32

by Avery Pennarun

[permalink] [raw]
Subject: ath9k(?): AP stops sending traffic to iPhone 4S until another 802.11n-capable STA joins

[fixed ath9k list address. sorry for the spam]

Hi all,

I have a pretty weird problem I've been chasing for a few weeks and
have narrowed it down, but not quite solved it. It may be caused by
bugs in aggregation-related code.

Steps:
- Set up an ath9k-based Linux AP on an ARM processor (currently using
this version of backports, though I've tried older and newer versions
with no change: "backported from Linux (next-20150525-0-gc201847)
using backports backports-20150525-0-g49969bd")
- Join my iPhone 4S (running iOS 7.1.2) to the network
- Use it for a while
- Eventually it will stay connected, but Internet access doesn't work
- Wireless packet captures show that packets are received *from* the
iPhone, and ACKs are returned for those packets from the ath9k, and
those packets are correctly forwarded to the AP's br0 interface. But
outgoing packets show up on br0 and wlan0 with tcpdump, but never make
it onto the air.
- Putting the iPhone 4S into airplane mode and then letting it
reconnecting will fix it for a few more seconds/minutes before it
stops again.

More details:
- It only seems to happen to my iPhone 4S client (never seen it with a
different client).
- It only seems to happen with my ath9k AP.
- It only seems to happen on my home network (another instance of the
same AP hardware on another network doesn't trigger the problem).
- It only seems to happen when no other 802.11n-capable devices are
connected to the same AP.
- The moment I join an 802.11n-capable device to the AP, traffic
instantly unblocks (see packet capture below).
- Joining an 802.11g-only device (no aggregation) does *not* unblock traffic.
- Disabling encryption and turning wmm_enable on and off have no effect.
- Disabling 802.11n support on the AP (so that everyone has to use
802.11g) makes the problem go away.
- 'ip -s link show dev wlan0' shows tx packet counters continuing to
increase during the outage, even though packets aren't flowing.
- I applied a patch from Tim Shepard to track the most recent tx
attempt, acked tx, and rx packet times inside mac80211. According to
this data, mac80211 thinks rx happened at most a couple of seconds ago
(as expected). The most recent tx was acked, but it was back around
the time the outage started. Note that this disagrees with 'ip -s
link' and tcpdump, which think they transmitted much more recently
than that. (The patch is here:
https://gfiber-review.googlesource.com/#/c/1250/ )

I captured a pcap of a new 802.11n-capable device joining the network
and unblocking the transmit. The action starts around frame 325:
http://apenwarr.ca/tmp/iPod4-fixing-iPhone4-trimmed.pcap.gz

In this pcap, the main players are:
ath9k AP: 88:dc:96:08:60:50
iPhone 4S with the problem: e4:25:e7:73:e6:31
New client fixing the problem (iPod 4): 18:e7:f4:7e:c1:42

Observations from the pcap:
- Upstream packets (iPhone->ath9k) are received and acked (see eg. frame 154)
- Beacons from the ath9k show an empty TIM bitmap until the iPod
joins, then it's nonempty and things unblock.

Does anyone have any thoughts about what to look for here?

Have fun,

Avery


2016-02-17 06:23:53

by Krishna Chaitanya

[permalink] [raw]
Subject: Re: ath9k(?): AP stops sending traffic to iPhone 4S until another 802.11n-capable STA joins

On Wed, Feb 17, 2016 at 10:02 AM, Avery Pennarun <[email protected]> wrote:
>
> On Tue, Feb 16, 2016 at 5:05 PM, Johannes Berg
> <[email protected]> wrote:
> > On Tue, 2016-02-16 at 16:28 -0500, Avery Pennarun wrote:
> >> Changing default_agg_timeout to zero (as it is on most non-ath9k
> >> drivers) makes the problem pretty much go away. However, I think
> >> it's because I'm just dodging the code path that triggers a race
> >> condition.
> >
> > That does seem likely. Perhaps you could reproduce it while running
> > mac80211 tracing? There should be a fair amount of information about
> > aggregation and queue stops in there, though as you note queue stops
> > aren't really happening, only aggregation related things. Perhaps the
> > tracepoints for that aren't quite sufficient.
>
> So far that hasn't seemed to help, although maybe you can read traces
> better than I can. The big problem is that the actual queue doesn't
> seem to have stopped; it might be an ath9k bug.
>
> >> Notes:
> >>
> >> - I'm using exactly the same ath9k driver (currently 20150525, but
> >> we've tried newer ones with no difference) on two totally different
> >> platforms: a dual-core mindspeed c2k host CPU (ARMv7) with separate
> >> ath9k, and a single-core QCA9531 (MIPS) with on-chip ath9k.
> >>
> >> - I've been unable to trigger the problem on the QCA9531, but I have
> >> on MIPS.
> >
> > That's ... not what I would have expected, especially since the MIPS is
> > single core. That makes the races stranger than expected.
>
> Oops, typo. The QCA9531 *is* MIPS. The one where it triggers is the
> dual-core ARM.
>
> >> The aggregation code is... a little hairy. Does anyone have any
> >> guesses where I might look for the race condition? Or better still,
> >> a patch I can try?
> >
> > I'm not aware of any race conditions in the code right now :)
>
> Aw. That would have made it a lot easier!


>From a quick glance of symptoms, i think the below patch is worth a
try, even though
i don't see you are doing any background scans for which this applies.

https://patchwork.kernel.org/patch/8015321/

2016-02-17 07:05:29

by Avery Pennarun

[permalink] [raw]
Subject: Re: ath9k(?): AP stops sending traffic to iPhone 4S until another 802.11n-capable STA joins

On Wed, Feb 17, 2016 at 1:23 AM, Krishna Chaitanya
<[email protected]> wrote:
> From a quick glance of symptoms, i think the below patch is worth a
> try, even though
> i don't see you are doing any background scans for which this applies.
>
> https://patchwork.kernel.org/patch/8015321/

Thanks, Krishna. We are in fact doing background scans occasionally,
however, none was in progress around the time of the glitch, and the
problem was still reproducible with background scans disabled. We
also aren't combining AP and STA on the same radio (in this particular
use case).

2016-02-16 22:05:32

by Johannes Berg

[permalink] [raw]
Subject: Re: ath9k(?): AP stops sending traffic to iPhone 4S until another 802.11n-capable STA joins

On Tue, 2016-02-16 at 16:28 -0500, Avery Pennarun wrote:

> Changing default_agg_timeout to zero (as it is on most non-ath9k
> drivers) makes the problem pretty much go away.  However, I think
> it's because I'm just dodging the code path that triggers a race
> condition.

That does seem likely. Perhaps you could reproduce it while running
mac80211 tracing? There should be a fair amount of information about
aggregation and queue stops in there, though as you note queue stops
aren't really happening, only aggregation related things. Perhaps the
tracepoints for that aren't quite sufficient.

> Notes:
>
> - I'm using exactly the same ath9k driver (currently 20150525, but
> we've   tried newer ones with no difference) on two totally different
> platforms: a dual-core mindspeed c2k host CPU (ARMv7) with separate
> ath9k, and a single-core QCA9531 (MIPS) with on-chip ath9k.
>
> - I've been unable to trigger the problem on the QCA9531, but I have
> on MIPS.

That's ... not what I would have expected, especially since the MIPS is
single core. That makes the races stranger than expected.

> The aggregation code is... a little hairy.  Does anyone have any
> guesses where I might look for the race condition?  Or better still,
> a patch I can try?

I'm not aware of any race conditions in the code right now :)

johannes

2016-02-17 04:33:00

by Avery Pennarun

[permalink] [raw]
Subject: Re: ath9k(?): AP stops sending traffic to iPhone 4S until another 802.11n-capable STA joins

On Tue, Feb 16, 2016 at 5:05 PM, Johannes Berg
<[email protected]> wrote:
> On Tue, 2016-02-16 at 16:28 -0500, Avery Pennarun wrote:
>> Changing default_agg_timeout to zero (as it is on most non-ath9k
>> drivers) makes the problem pretty much go away. However, I think
>> it's because I'm just dodging the code path that triggers a race
>> condition.
>
> That does seem likely. Perhaps you could reproduce it while running
> mac80211 tracing? There should be a fair amount of information about
> aggregation and queue stops in there, though as you note queue stops
> aren't really happening, only aggregation related things. Perhaps the
> tracepoints for that aren't quite sufficient.

So far that hasn't seemed to help, although maybe you can read traces
better than I can. The big problem is that the actual queue doesn't
seem to have stopped; it might be an ath9k bug.

>> Notes:
>>
>> - I'm using exactly the same ath9k driver (currently 20150525, but
>> we've tried newer ones with no difference) on two totally different
>> platforms: a dual-core mindspeed c2k host CPU (ARMv7) with separate
>> ath9k, and a single-core QCA9531 (MIPS) with on-chip ath9k.
>>
>> - I've been unable to trigger the problem on the QCA9531, but I have
>> on MIPS.
>
> That's ... not what I would have expected, especially since the MIPS is
> single core. That makes the races stranger than expected.

Oops, typo. The QCA9531 *is* MIPS. The one where it triggers is the
dual-core ARM.

>> The aggregation code is... a little hairy. Does anyone have any
>> guesses where I might look for the race condition? Or better still,
>> a patch I can try?
>
> I'm not aware of any race conditions in the code right now :)

Aw. That would have made it a lot easier!