2015-11-04 05:04:19

by Avery Pennarun

[permalink] [raw]
Subject: ath9k(?): AP stops sending traffic to iPhone 4S until another 802.11n-capable STA joins

Hi all,

I have a pretty weird problem I've been chasing for a few weeks and
have narrowed it down, but not quite solved it. It may be caused by
bugs in aggregation-related code.

Steps:
- Set up an ath9k-based Linux AP on an ARM processor (currently using
this version of backports, though I've tried older and newer versions
with no change: "backported from Linux (next-20150525-0-gc201847)
using backports backports-20150525-0-g49969bd")
- Join my iPhone 4S (running iOS 7.1.2) to the network
- Use it for a while
- Eventually it will stay connected, but Internet access doesn't work
- Wireless packet captures show that packets are received *from* the
iPhone, and ACKs are returned for those packets from the ath9k, and
those packets are correctly forwarded to the AP's br0 interface. But
outgoing packets show up on br0 and wlan0 with tcpdump, but never make
it onto the air.
- Putting the iPhone 4S into airplane mode and then letting it
reconnecting will fix it for a few more seconds/minutes before it
stops again.

More details:
- It only seems to happen to my iPhone 4S client (never seen it with a
different client).
- It only seems to happen with my ath9k AP.
- It only seems to happen on my home network (another instance of the
same AP hardware on another network doesn't trigger the problem).
- It only seems to happen when no other 802.11n-capable devices are
connected to the same AP.
- The moment I join an 802.11n-capable device to the AP, traffic
instantly unblocks (see packet capture below).
- Joining an 802.11g-only device (no aggregation) does *not* unblock traffic.
- Disabling encryption and turning wmm_enable on and off have no effect.
- Disabling 802.11n support on the AP (so that everyone has to use
802.11g) makes the problem go away.
- 'ip -s link show dev wlan0' shows tx packet counters continuing to
increase during the outage, even though packets aren't flowing.
- I applied a patch from Tim Shepard to track the most recent tx
attempt, acked tx, and rx packet times inside mac80211. According to
this data, mac80211 thinks rx happened at most a couple of seconds ago
(as expected). The most recent tx was acked, but it was back around
the time the outage started. Note that this disagrees with 'ip -s
link' and tcpdump, which think they transmitted much more recently
than that. (The patch is here:
https://gfiber-review.googlesource.com/#/c/1250/ )

I captured a pcap of a new 802.11n-capable device joining the network
and unblocking the transmit. The action starts around frame 325:
http://apenwarr.ca/tmp/iPod4-fixing-iPhone4-trimmed.pcap.gz

In this pcap, the main players are:
ath9k AP: 88:dc:96:08:60:50
iPhone 4S with the problem: e4:25:e7:73:e6:31
New client fixing the problem (iPod 4): 18:e7:f4:7e:c1:42

Observations from the pcap:
- Upstream packets (iPhone->ath9k) are received and acked (see eg. frame 154)
- Beacons from the ath9k show an empty TIM bitmap until the iPod
joins, then it's nonempty and things unblock.

Does anyone have any thoughts about what to look for here?

Have fun,

Avery


2016-02-16 21:44:31

by Johannes Berg

[permalink] [raw]
Subject: Re: [PATCH] mac80211: debugfs var for the default aggregation timeout.

On Tue, 2016-02-16 at 16:28 -0500, Avery Pennarun wrote:
> Since around the beginning of time, ath9k aggregates have timed out
> after
> 5000 TU (around 5000ms) of inactivity, but nobody seems to be quite
> sure
> why, and this magic number seems to have migrated around from one
> place to
> another.  An openbsd mailing list recently had a patch to disable the
> timeout completely, which they say matches some commercial routers:
> https://www.mail-archive.com/[email protected]/msg29456.html
>
> Even in Linux, several non-ath9k drivers default to no timeout
> already.  I
> think changing it directly to zero would be safe, but to allow a more
> structured investigation, let's make it configurable for now.

I believe the original timeout came from some ancient Intel code.
Nobody gave it any thought at the time, and it was merged and preserved
to this day.

The reason for Intel specifically is that every aggregation session
(both RX and TX), even idle ones, takes up hardware resources, so
there's a limited need to drop completely idle sessions.

As a result, I believe that changing the default for every driver other
than iwlegacy, iwlwifi/dvm and iwlwifi/mvm would be perfectly fine.

johannes

2016-02-16 21:28:25

by Avery Pennarun

[permalink] [raw]
Subject: [PATCH] mac80211: debugfs var for the default aggregation timeout.

Since around the beginning of time, ath9k aggregates have timed out after
5000 TU (around 5000ms) of inactivity, but nobody seems to be quite sure
why, and this magic number seems to have migrated around from one place to
another. An openbsd mailing list recently had a patch to disable the
timeout completely, which they say matches some commercial routers:
https://www.mail-archive.com/[email protected]/msg29456.html

Even in Linux, several non-ath9k drivers default to no timeout already. I
think changing it directly to zero would be safe, but to allow a more
structured investigation, let's make it configurable for now.

Signed-off-by: Avery Pennarun <[email protected]>
---
net/mac80211/debugfs_netdev.c | 4 ++++
net/mac80211/rc80211_minstrel_ht.c | 4 +++-
2 files changed, 7 insertions(+), 1 deletion(-)


diff --git a/net/mac80211/debugfs_netdev.c b/net/mac80211/debugfs_netdev.c
index 37ea30e..5ae160b 100644
--- a/net/mac80211/debugfs_netdev.c
+++ b/net/mac80211/debugfs_netdev.c
@@ -715,6 +715,8 @@ static void add_mesh_config(struct ieee80211_sub_if_data *sdata)
}
#endif

+u32 default_agg_timeout = 5000;
+
static void add_files(struct ieee80211_sub_if_data *sdata)
{
if (!sdata->vif.debugfs_dir)
@@ -725,6 +727,8 @@ static void add_files(struct ieee80211_sub_if_data *sdata)
DEBUGFS_ADD(txpower);
DEBUGFS_ADD(user_power_level);
DEBUGFS_ADD(ap_power_level);
+ debugfs_create_u32("default_agg_timeout", 0600, sdata->vif.debugfs_dir,
+ &default_agg_timeout);

if (sdata->vif.type != NL80211_IFTYPE_MONITOR)
add_common_files(sdata);
diff --git a/net/mac80211/rc80211_minstrel_ht.c b/net/mac80211/rc80211_minstrel_ht.c
index 3928dbd..028d9d4 100644
--- a/net/mac80211/rc80211_minstrel_ht.c
+++ b/net/mac80211/rc80211_minstrel_ht.c
@@ -671,6 +671,8 @@ minstrel_downgrade_rate(struct minstrel_ht_sta *mi, u16 *idx, bool primary)
}
}

+extern u32 default_agg_timeout;
+
static void
minstrel_aggr_check(struct ieee80211_sta *pubsta, struct sk_buff *skb)
{
@@ -691,7 +693,7 @@ minstrel_aggr_check(struct ieee80211_sta *pubsta, struct sk_buff *skb)
if (likely(sta->ampdu_mlme.tid_tx[tid]))
return;

- ieee80211_start_tx_ba_session(pubsta, tid, 5000);
+ ieee80211_start_tx_ba_session(pubsta, tid, default_agg_timeout);
}

static void
--
2.7.0.rc3.207.g0ac5344


2016-02-23 10:14:05

by Johannes Berg

[permalink] [raw]
Subject: Re: [PATCH] mac80211: debugfs var for the default aggregation timeout.

On Tue, 2016-02-16 at 16:28 -0500, Avery Pennarun wrote:
> Since around the beginning of time, ath9k aggregates have timed out
> after
> 5000 TU (around 5000ms) of inactivity, but nobody seems to be quite
> sure
> why, and this magic number seems to have migrated around from one
> place to
> another.  An openbsd mailing list recently had a patch to disable the
> timeout completely, which they say matches some commercial routers:
> https://www.mail-archive.com/[email protected]/msg29456.html
>
> Even in Linux, several non-ath9k drivers default to no timeout
> already.  I
> think changing it directly to zero would be safe, but to allow a more
> structured investigation, let's make it configurable for now.
>
Since we just made it zero, perhaps we don't need this?

Although perhaps we still want it to be able to debug it?

Anyway - you shouldn't create a debugfs file and play with the extern
stuff etc., let minstrel create the debugfs file in minstrel_ht_alloc()

johannes

2016-02-17 02:30:45

by Sujith Manoharan

[permalink] [raw]
Subject: Re: [PATCH] mac80211: debugfs var for the default aggregation timeout.

Avery Pennarun wrote:
> Since around the beginning of time, ath9k aggregates have timed out after
> 5000 TU (around 5000ms) of inactivity, but nobody seems to be quite sure
> why, and this magic number seems to have migrated around from one place to
> another. An openbsd mailing list recently had a patch to disable the
> timeout completely, which they say matches some commercial routers:
> https://www.mail-archive.com/[email protected]/msg29456.html

We actually did this for ath9k:
https://git.kernel.org/cgit/linux/kernel/git/wireless/wireless-testing.git/commit/?id=bd2ce6e43f65127bc723e7fcc044758cf8113260

Almost all APs set the timeout to zero, only mac80211 differs
and uses 5000 TUs. I guess ath9k lost this fix when the old
RC was removed and switched to minstrel.

Sujith

2016-02-23 20:05:52

by Johannes Berg

[permalink] [raw]
Subject: Re: [PATCH] mac80211: debugfs var for the default aggregation timeout.

On Tue, 2016-02-23 at 13:43 -0500, Avery Pennarun wrote:

> We're putting my version of the patch into our devices in order to be
> able to try different values and see how it changes the percentage of
> devices with nonzero 'pending' field in agg_status.  I'm hoping using
> zero here will result in total elimination of the pending problem,
> but we'll see.

:)
I for one would be interested in the result. And, if you find mac80211
is at fault, knowing what happens there.

> If people think this is important, I can respin the patch, otherwise
> feel free to discard.
>

Up to you. I've discarded it for now with "changes requested", if you
never resend I'll not worry about it. Clearly we haven't had a need for
it so far.

johannes

2016-02-16 21:28:23

by Avery Pennarun

[permalink] [raw]
Subject: Re: ath9k(?): AP stops sending traffic to iPhone 4S until another 802.11n-capable STA joins

Okay, I've made much more progress on this old thread. I haven't actually
fixed the bug, which I suspect is a race condition only on multicore
machines, but I at least have better reproduction steps and a workaround.

The bug seems to trigger when three things happen at once:
1) Background interference causes retries
2) AP wants to send data to the STA, which has been idle for a while
3) We want to negotiate a new BA session from AP to STA.

Sometimes, the background interference will cause the time between ADDBA
Request (from AP) and ADDBA Response (from STA) to be longer than usual. In
my tests, it's usually <1ms, but in high-interference situations I've seen
it be >3ms. Sometimes, when the delay is longer, I see the symptom that the
agg_status file for the station in question starts showing TID#0's "pending"
column increasing slowly, until it eventually reaches 64. A wifi capture on
a separate sniffer indicates that no data is being transmitted to that
station, although traffic to other stations (and broadcast/multicast)
continues unabated. I guess this means the device's queues are themselves
not stopped, but the station's per-TID aggregation queue is stuck.

Twiddling the agg_status of a different queue (in this case TID#1) unblocks
TID#0:
echo "tx start 1" >/sys/kernel/debug/ieee80211/phy0/.../agg_status

So does having another aggregation-capable device join the network. Having
an 802.11g-only device join the network does *not* unblock the queue.

However, trying to stop TID#0 doesn't help (and it also doesn't successfully
stop the aggregation):
echo "tx stop 0" >/sys/kernel/debug/ieee80211/phy0/.../agg_status

The following patch makes the problem easier to reproduce by letting you
turn the aggregation timeout way down. For myself, I used a
default_agg_timeout of 500ms and just pinged repeatedly once per second from
the AP to STA. This causes the aggregation sessions to be repeatedly
brought up and torn down, which triggers the problem for me within a few
minutes (when run on a channel with fairly high noise).

Changing default_agg_timeout to zero (as it is on most non-ath9k drivers)
makes the problem pretty much go away. However, I think it's because I'm
just dodging the code path that triggers a race condition.

Notes:

- I'm using exactly the same ath9k driver (currently 20150525, but we've
tried newer ones with no difference) on two totally different platforms: a
dual-core mindspeed c2k host CPU (ARMv7) with separate ath9k, and a
single-core QCA9531 (MIPS) with on-chip ath9k.

- I've been unable to trigger the problem on the QCA9531, but I have on
MIPS.

The aggregation code is... a little hairy. Does anyone have any guesses
where I might look for the race condition? Or better still, a patch I can
try?


Avery Pennarun (1):
mac80211: add a debugfs var for the default aggregation timeout.

net/mac80211/debugfs_netdev.c | 4 ++++
net/mac80211/rc80211_minstrel_ht.c | 4 +++-
2 files changed, 7 insertions(+), 1 deletion(-)

--
2.7.0.rc3.207.g0ac5344


2016-02-23 18:43:53

by Avery Pennarun

[permalink] [raw]
Subject: Re: [PATCH] mac80211: debugfs var for the default aggregation timeout.

On Tue, Feb 23, 2016 at 5:14 AM, Johannes Berg
<[email protected]> wrote:
> On Tue, 2016-02-16 at 16:28 -0500, Avery Pennarun wrote:
>> Since around the beginning of time, ath9k aggregates have timed out
>> after
>> 5000 TU (around 5000ms) of inactivity, but nobody seems to be quite
>> sure
>> why, and this magic number seems to have migrated around from one
>> place to
>> another. An openbsd mailing list recently had a patch to disable the
>> timeout completely, which they say matches some commercial routers:
>> https://www.mail-archive.com/[email protected]/msg29456.html
>>
>> Even in Linux, several non-ath9k drivers default to no timeout
>> already. I
>> think changing it directly to zero would be safe, but to allow a more
>> structured investigation, let's make it configurable for now.
>>
> Since we just made it zero, perhaps we don't need this?
>
> Although perhaps we still want it to be able to debug it?

We're putting my version of the patch into our devices in order to be
able to try different values and see how it changes the percentage of
devices with nonzero 'pending' field in agg_status. I'm hoping using
zero here will result in total elimination of the pending problem, but
we'll see.

It probably makes sense not to apply this upstream if the default
value is zero now anyway.

> Anyway - you shouldn't create a debugfs file and play with the extern
> stuff etc., let minstrel create the debugfs file in minstrel_ht_alloc()

Good point. I had a feeling I was doing that in the wrong place :)

If people think this is important, I can respin the patch, otherwise
feel free to discard.

Have fun,

Avery

2016-04-08 01:33:17

by Avery Pennarun

[permalink] [raw]
Subject: Re: [PATCH] mac80211: debugfs var for the default aggregation timeout.

On Wed, Apr 6, 2016 at 3:40 AM, Johannes Berg <[email protected]> wrote:
> On Tue, 2016-04-05 at 19:46 -0400, Avery Pennarun wrote:
>
>> This test was with backports-20150525 on ath9k. (We have newer
>> versions in the queue, but they haven't rolled out to our customers
>> yet. Anyway, earlier in this thread, I was able to trigger the race
>> condition on much newer backports. Unfortunately the current fix
>> makes my reproducible test case go away, but I don't know any reason
>> to assume the race condition is fixed.)
>
> Well, we know that the timeout is likely unrelated to the issue (other
> than not triggering the broken code path that frequently), so you can
> revert the timeout change for the test case.

Yes. And I can make it happen more often by making it timeout the
aggregation agreement much more frequently than usual.

>> While we're here, unfortunately it turns out that just observing the
>> agg_status file can cause crashes (though not very often... except
>> for a few unlucky customers), probably due to a different race
>> condition.
>> Any suggestions about this one? Stack trace attached below. (I
>> think the stack trace suggests a mac80211 problem?)
>
> That has to be a mac80211 problem, yeah.
> (Side note: I'm a bit surprised this is a 32-bit system?)

We're going for all of good, fast, and cheap here. That should end well :)

> Looks like we use RCU protection to get the data. Can I get the
> mac80211.ko binary (with debug data) corresponding to the crash below?

Yes. Here it is:
http://apenwarr.ca/tmp/mac80211-agg-status-crash.ko

Thanks for your help!

2016-04-08 08:32:06

by Avery Pennarun

[permalink] [raw]
Subject: Re: [PATCH] mac80211: debugfs var for the default aggregation timeout.

On Fri, Apr 8, 2016 at 3:15 AM, Johannes Berg <[email protected]> wrote:
> On Fri, 2016-04-08 at 09:01 +0200, Johannes Berg wrote:
>> On Fri, 2016-04-08 at 08:56 +0200, Johannes Berg wrote:
>> > On Thu, 2016-04-07 at 21:32 -0400, Avery Pennarun wrote:
>> > > Yes. Here it is:
>> > > http://apenwarr.ca/tmp/mac80211-agg-status-crash.ko
>> > >
>> > Unfortunately there are no debug symbols in this file, so it
>> > doesn't
>> > help me much. I can't even seem to get objdump to disassemble it
>> > correctly: looks like the file is in thumb, going from things
>> > like R_ARM_THM_CALL relocations, but even -Mforce-thumb doesn't
>> > seem
>> > to DRT; sta_agg_status_read+0xeb isn't even a valid instruction
>> > offset in regular ARM mode.
>> >
>> It *seems* that it most likely crashes on the first access to tid_tx,
>> which is consistent with the story of disabling TX aggregation
>> timeouts
>> reducing the chances.
>>
>> So I guess we have to look for some TX aggregation teardown RCU
>> pointer problem?
>
> Can't find anything. The only other thing I saw now is that the TID
> appears to be 7 (in r7), might be worth looking for whether that's a
> common thing or not?

Just to be clear, this crash is only from *reading* the agg_status
files. I don't know if the crashiness reduces when disabling the
aggregation timeouts, since that's a separate bug (in which the queue
gets stuck and the 'pending' column of this file just keeps
increasing).

I'll try twiddling some options again tomorrow and see if I can get
one with proper debug symbols. For what it's worth, this platform is
"ARMv7 Processor rev 1 (v7l)" and the gcc build is made for Cortex A9.
You can find an x86 build of our toolchain in the git repo at
https://gfiber.googlesource.com/toolchains/mindspeed.

Thanks for looking into it :)

Avery

2016-04-05 23:47:15

by Avery Pennarun

[permalink] [raw]
Subject: Re: [PATCH] mac80211: debugfs var for the default aggregation timeout.

On Tue, Feb 23, 2016 at 3:05 PM, Johannes Berg
<[email protected]> wrote:
> On Tue, 2016-02-23 at 13:43 -0500, Avery Pennarun wrote:
>> We're putting my version of the patch into our devices in order to be
>> able to try different values and see how it changes the percentage of
>> devices with nonzero 'pending' field in agg_status. I'm hoping using
>> zero here will result in total elimination of the pending problem,
>> but we'll see.
>
> :)
> I for one would be interested in the result. And, if you find mac80211
> is at fault, knowing what happens there.

Here's the promised update! The news is not as good as I had hoped.

Across the GFiber fleet, number of APs per day observing the problem
(ie. the pending field > 0 for more than a minute for any station),
with the original aggregation timeout, is about 41% (yikes). With the
aggregation timeout set to zero, the number of APs observing the
problem in a day drops to about 10%.

Obviously this is a huge improvement, but the problem isn't completely
eliminated. In retrospect that's not totally surprising, as there are
reasons other than an AP-side aggregation timeout that an aggregation
would need to be negotiated, and a race condition in aggregation queue
setup could happen at any of those times. I was just hoping that
those other cases would be much less frequent than they apparently
are.

This test was with backports-20150525 on ath9k. (We have newer
versions in the queue, but they haven't rolled out to our customers
yet. Anyway, earlier in this thread, I was able to trigger the race
condition on much newer backports. Unfortunately the current fix
makes my reproducible test case go away, but I don't know any reason
to assume the race condition is fixed.)

While we're here, unfortunately it turns out that just observing the
agg_status file can cause crashes (though not very often... except for
a few unlucky customers), probably due to a different race condition.
Any suggestions about this one? Stack trace attached below. (I think
the stack trace suggests a mac80211 problem?)

Thanks!

Avery


03/30,133400.674 Unable to handle kernel paging request at virtual
address 5b35da9e
03/30,133400.675 pgd = ac238000
03/30,133400.675 [5b35da9e] *pgd=00000000
03/30,133400.675 Internal error: Oops: 5 [#1] PREEMPT SMP
03/30,133400.680 Modules linked in: ccm nf_conntrack_netlink
auto_bridge(O) fci(O) nfnetlink pktgen ath9k_htc(O) mwifiex_usb(O)
mwifiex(O) ath10k_pci(O) ath10k_core(O) arc4 ath9k(O) mac80211(O)
ath9k_common(O) ath9k_hw(O) ath(O) cfg80211(O) compat(O) bmoca(O)
xt_connmark ip6table_mangle xt_CLASSIFY iptable_mangle xt_helper
nf_nat_sip nf_conntrack_sip ip6t_REJECT ip6t_LOG nf_conntrack_ipv6
nf_defrag_ipv6 ip6table_filter ip6_tables nf_nat_rtsp
nf_conntrack_rtsp nf_nat_h323 nf_conntrack_h323 nf_nat_irc
nf_conntrack_irc nf_nat_pptp nf_conntrack_pptp nf_conntrack_proto_gre
nf_nat_proto_gre nf_nat_tftp nf_conntrack_tftp nf_nat_ftp
nf_conntrack_ftp ipt_MASQUERADE ipt_REJECT ipt_LOG xt_limit xt_pkttype
xt_conntrack xt_tcpudp iptable_nat nf_nat nf_conntrack_ipv4
nf_conntrack nf_defrag_ipv4 iptable_filter ip_tables x_tables pfe(O)
03/30,133400.753 CPU: 0 Tainted: G O (3.2.26 #1)
03/30,133400.758 PC is at sta_agg_status_read+0xeb/0x170 [mac80211]
03/30,133400.764 LR is at sta_agg_status_read+0xd8/0x170 [mac80211]
03/30,133400.770 pc : [<838b4d0c>] lr : [<838b4cf9>] psr: 20010033
03/30,133400.770 sp : ac0c3c58 ip : 0000000f fp : ac0c3c71
03/30,133400.782 r10: ac341800 r9 : af7f3b53 r8 : 00000001
03/30,133400.787 r7 : 00000007 r6 : 5b35da40 r5 : ac0c3f38 r4 : ac0c3d90
03/30,133400.794 r3 : ac0c3d8d r2 : 838c6958 r1 : 000001a8 r0 : ac0c3d90
03/30,133400.800 Flags: nzCv IRQs on FIQs on Mode SVC_32 ISA Thumb
Segment user
03/30,133400.807 Control: 50c53c7d Table: 2c23804a DAC: 00000015
03/30,133400.813 Process psstat (pid: 25220, stack limit = 0xac0c22f0)
03/30,133400.819 Stack: (0xac0c3c58 to 0xac0c4000)
03/30,133400.824 3c40:
00000209 a6199050
03/30,133400.832 3c60: ac0c3d58 7e957143 00000001 ac0c3f88 78656e00
69642074 676f6c61 6b6f745f
03/30,133400.840 3c80: 203a6e65 0a317830 09444954 09585209 4e4b5444
4e535309 58540909 4b544409
03/30,133400.848 3ca0: 6570094e 6e69646e 30300a67 09300909 30307830
30783009 09093030 78300930
03/30,133400.857 3cc0: 30093030 300a3030 30090931 30783009 78300930
09303030 30093009 09303078
03/30,133400.865 3ce0: 0a303030 09093230 78300930 30093030 30303078
09300909 30307830 30303009
03/30,133400.873 3d00: 0933300a 30093009 09303078 30307830 30090930
30783009 30300930 34300a30
03/30,133400.881 3d20: 09300909 30307830 30783009 09093030 78300930
30093030 300a3030 30090935
03/30,133400.889 3d40: 30783009 78300930 09303030 30093009 09303078
0a303030 09093630 78300931
03/30,133400.898 3d60: 30096632 32323678 31090966 38783009 32310933
30343230 35383333 0937300a
03/30,133400.906 3d80: 30093109 09303578 31307830 31090961 38300a00
09300909 30307830 30783009
03/30,133400.914 3da0: 09093030 78300930 30093030 300a3030 30090939
30783009 78300930 09303030
03/30,133400.922 3dc0: 30093009 09303078 0a303030 09093031 78300930
30093030 30303078 09300909
03/30,133400.930 3de0: 30307830 30303009 0931310a 30093009 09303078
30307830 30090930 30783009
03/30,133400.939 3e00: 30300930 32310a30 09300909 30307830 30783009
09093030 78300930 30093030
03/30,133400.947 3e20: 310a3030 30090933 30783009 78300930 09303030
30093009 09303078 0a303030
03/30,133400.955 3e40: 09093431 78300930 30093030 30303078 09300909
30307830 30303009 0935310a
03/30,133400.963 3e60: 30093009 09303078 30307830 30090930 30783009
30300930 00000a30 bfa440c0
03/30,133400.971 3e80: 842caf64 842cadb4 ac0c3e90 8401ea65 00000000
c55f8337 c55f8337 000015e4
03/30,133400.980 3ea0: bb3f54b8 ac0c3eb8 c55f8337 84021ad3 bf82f060
00000001 80000000 00000000
03/30,133400.988 3ec0: 84008b15 00000000 00000002 84040045 ffffffff
00000000 00000002 84470aac
03/30,133400.996 3ee0: ac0c3f18 bb05f780 00000000 840401b5 00000000
84431160 ac0c2000 bb3f5480
03/30,133401.004 3f00: ac0c3f18 842c792d 84cb6160 00000005 ac0c3f18
00000001 bd3e9668 00000001
03/30,133401.012 3f20: 842caf64 842cadb4 8400caf5 00000000 00000000
00000000 8443e8b8 bd3e9660
03/30,133401.021 3f40: 838b4c21 7e957143 ac0c3f88 7e957143 ac0c2000
00000000 0002802c 840993bb
03/30,133401.029 3f60: 00000030 00000001 bd3e9660 00000001 000001fa
00000000 7e957143 84099637
03/30,133401.037 3f80: ac0c2000 00000000 000001fa 00000000 00000001
0049fc94 0049fcca 00000003
03/30,133401.045 3fa0: 8400cc44 8400ca81 00000001 0049fc94 00000000
7e957143 00000001 00000030
03/30,133401.054 3fc0: 00000001 0049fc94 0049fcca 00000003 ffffffff
00028028 00000000 0002802c
03/30,133401.062 3fe0: 00000000 7e957134 00014428 2ac417cc 60010010
00000000 00000000 00000000
03/30,133401.070 [<838b4d0c>] (sta_agg_status_read+0xeb/0x170
[mac80211]) from [<840993bb>] (vfs_read+0x5f/0xcc)
03/30,133401.080 [<840993bb>] (vfs_read+0x5f/0xcc) from [<84099637>]
(sys_read+0x27/0x48)
03/30,133401.088 [<84099637>] (sys_read+0x27/0x48) from [<8400ca81>]
(ret_fast_syscall+0x1/0x46)
03/30,133401.096 Code: f1b8 0f00 d036 4620 (f896) 305e
03/30,133401.104 ath10k_pci 0000:01:00.0: SWBA overrun on vdev 0,
skipped old beacon
03/30,133401.106 waveguide: ZONEED: APs=63
peer-APs=LOFHAL(-10),QIYQAW(-26) stations=
03/30,133401.106 waveguide: Connected station VVMKUW taxonomy:
BCM4339;iPhone 6/6+;802.11ac n:1,w:80
03/30,133401.106 waveguide: Connected station FYLWIQ taxonomy:
SHA:f0297d6b773948dcc4c86451a0207ba7d9e97e1cc864b3031001ae2105faa872;Unknown;802.11n
n:2,w:40
03/30,133401.143 ---[ end trace e62670ec7c09380f ]---
03/30,133401.148 Kernel panic - not syncing: Fatal exception
03/30,133401.153 [<840111e1>] (unwind_backtrace+0x1/0x8c) from
[<842c61fd>] (panic+0x5d/0x134)
03/30,133401.162 [<842c61fd>] (panic+0x5d/0x134) from [<8400f60b>]
(die+0x203/0x224)
03/30,133401.169 [<8400f60b>] (die+0x203/0x224) from [<842c5897>]
(__do_kernel_fault.part.5+0x4f/0x5c)
03/30,133401.178 [<842c5897>] (__do_kernel_fault.part.5+0x4f/0x5c)
from [<84013d23>] (do_page_fault+0x20b/0x268)
03/30,133401.188 [<84013d23>] (do_page_fault+0x20b/0x268) from
[<84008293>] (do_DataAbort+0x2f/0x70)
03/30,133401.197 [<84008293>] (do_DataAbort+0x2f/0x70) from
[<8400c4f5>] (__dabt_svc+0x35/0x60)
03/30,133401.206 Exception stack(0xac0c3c10 to 0xac0c3c58)
03/30,133401.211 3c00: ac0c3d90
000001a8 838c6958 ac0c3d8d
03/30,133401.220 3c20: ac0c3d90 ac0c3f38 5b35da40 00000007 00000001
af7f3b53 ac341800 ac0c3c71
03/30,133401.228 3c40: 0000000f ac0c3c58 838b4cf9 838b4d0c 20010033 ffffffff
03/30,133401.235 [<8400c4f5>] (__dabt_svc+0x35/0x60) from [<838b4d0c>]
(sta_agg_status_read+0xeb/0x170 [mac80211])
03/30,133401.245 [<838b4d0c>] (sta_agg_status_read+0xeb/0x170
[mac80211]) from [<840993bb>] (vfs_read+0x5f/0xcc)
03/30,133401.255 [<840993bb>] (vfs_read+0x5f/0xcc) from [<84099637>]
(sys_read+0x27/0x48)
03/30,133401.263 [<84099637>] (sys_read+0x27/0x48) from [<8400ca81>]
(ret_fast_syscall+0x1/0x46)
03/30,133401.272 CPU1: stopping
03/30,133401.275 [<840111e1>] (unwind_backtrace+0x1/0x8c) from
[<8401088d>] (handle_IPI+0xcd/0x104)
03/30,133401.283 [<8401088d>] (handle_IPI+0xcd/0x104) from
[<8400c7c1>] (__irq_usr+0x41/0xa0)
03/30,133401.292 Rebooting in 3 seconds..

2016-04-10 00:31:38

by Adrian Chadd

[permalink] [raw]
Subject: Re: [PATCH] mac80211: debugfs var for the default aggregation timeout.

On 8 April 2016 at 21:56, Johannes Berg <[email protected]> wrote:
> On Fri, 2016-04-08 at 21:27 -0400, Avery Pennarun wrote:
>
>> > Just to be clear, this crash is only from *reading* the agg_status
>> > files. I don't know if the crashiness reduces when disabling the
>> > aggregation timeouts, since that's a separate bug (in which the
>> > queue gets stuck and the 'pending' column of this file just keeps
>> > increasing).
>
> Oh, right, I was confusing the two. The reading one is even stranger
> though, in a way. I have no explanation for it (yet). We could suspect
> memory corruption, but why would it specifically hit issues here? Not
> very plausible.
>
>> Updated .ko file that definitely has debug symbols this time:
>> http://apenwarr.ca/tmp/mac80211-agg-status-crash-debugsyms.ko
>>
>
> Ok, that confirms what I did manually in my previous email - that it
> crashed on this:
>
> 141 p += scnprintf(p, sizeof(buf) + buf - p, "\t%#.2x",
> 142 tid_tx ? tid_tx->dialog_token : 0);
>
> (and by hand I'd already checked that it crashed dereferencing the
> tid_tx->dialog_token, since tid_tx was the value 0x5b35da40.
>
> If any people more familiar with ARM are reading this - does the value
> 0x5b35da40 ring a bell? Is that a userspace area? Or an area where the
> stack would be? All other points around here seem to look like
> 0xac0c3c58, or maybe 0x838c6958, but not 0x5b35...., how could we end
> up with that?

.. that looks very userland-y to me. Is it just some pointer garbage perhaps?

Do you have a full crashdump? what's sta->ampdu_mlme look like?



-a

2016-04-06 07:40:42

by Johannes Berg

[permalink] [raw]
Subject: Re: [PATCH] mac80211: debugfs var for the default aggregation timeout.

On Tue, 2016-04-05 at 19:46 -0400, Avery Pennarun wrote:

> This test was with backports-20150525 on ath9k.  (We have newer
> versions in the queue, but they haven't rolled out to our customers
> yet.  Anyway, earlier in this thread, I was able to trigger the race
> condition on much newer backports.  Unfortunately the current fix
> makes my reproducible test case go away, but I don't know any reason
> to assume the race condition is fixed.)

Well, we know that the timeout is likely unrelated to the issue (other
than not triggering the broken code path that frequently), so you can
revert the timeout change for the test case.

> While we're here, unfortunately it turns out that just observing the
> agg_status file can cause crashes (though not very often... except
> for a few unlucky customers), probably due to a different race
> condition.
> Any suggestions about this one?  Stack trace attached below.  (I
> think the stack trace suggests a mac80211 problem?)
>

That has to be a mac80211 problem, yeah.
(Side note: I'm a bit surprised this is a 32-bit system?)

Looks like we use RCU protection to get the data. Can I get the
mac80211.ko binary (with debug data) corresponding to the crash below?

johannes

2016-04-09 04:57:03

by Johannes Berg

[permalink] [raw]
Subject: Re: [PATCH] mac80211: debugfs var for the default aggregation timeout.

On Fri, 2016-04-08 at 21:27 -0400, Avery Pennarun wrote:

> > Just to be clear, this crash is only from *reading* the agg_status
> > files.  I don't know if the crashiness reduces when disabling the
> > aggregation timeouts, since that's a separate bug (in which the
> > queue gets stuck and the 'pending' column of this file just keeps
> > increasing).

Oh, right, I was confusing the two. The reading one is even stranger
though, in a way. I have no explanation for it (yet). We could suspect
memory corruption, but why would it specifically hit issues here? Not
very plausible.

> Updated .ko file that definitely has debug symbols this time:
> http://apenwarr.ca/tmp/mac80211-agg-status-crash-debugsyms.ko
>

Ok, that confirms what I did manually in my previous email - that it
crashed on this:

141 p += scnprintf(p, sizeof(buf) + buf - p, "\t%#.2x",
142 tid_tx ? tid_tx->dialog_token : 0);

(and by hand I'd already checked that it crashed dereferencing the
tid_tx->dialog_token, since tid_tx was the value 0x5b35da40.

If any people more familiar with ARM are reading this - does the value
0x5b35da40 ring a bell? Is that a userspace area? Or an area where the
stack would be? All other points around here seem to look like
0xac0c3c58, or maybe 0x838c6958, but not 0x5b35...., how could we end
up with that?

johannes

2016-04-10 01:59:02

by bruce m beach

[permalink] [raw]
Subject: Re: [PATCH] mac80211: debugfs var for the default aggregation timeout.

> If any people more familiar with ARM are reading this - does the value
> 0x5b35da40 ring a bell?

It could be anything. It depends on the chip implementation. For instance on an
exynos that is an undefined region. What device are we talking about?
Bruce

2016-04-08 06:57:00

by Johannes Berg

[permalink] [raw]
Subject: Re: [PATCH] mac80211: debugfs var for the default aggregation timeout.

On Thu, 2016-04-07 at 21:32 -0400, Avery Pennarun wrote:

> Yes.  Here it is:
> http://apenwarr.ca/tmp/mac80211-agg-status-crash.ko
>

Unfortunately there are no debug symbols in this file, so it doesn't
help me much. I can't even seem to get objdump to disassemble it
correctly: looks like the file is in thumb, going from things
like R_ARM_THM_CALL relocations, but even -Mforce-thumb doesn't seem to
DRT; sta_agg_status_read+0xeb isn't even a valid instruction offset in
regular ARM mode.

johannes

2016-04-08 07:01:52

by Johannes Berg

[permalink] [raw]
Subject: Re: [PATCH] mac80211: debugfs var for the default aggregation timeout.

On Fri, 2016-04-08 at 08:56 +0200, Johannes Berg wrote:
> On Thu, 2016-04-07 at 21:32 -0400, Avery Pennarun wrote:
>
> >
> > Yes.  Here it is:
> > http://apenwarr.ca/tmp/mac80211-agg-status-crash.ko
> >
> Unfortunately there are no debug symbols in this file, so it doesn't
> help me much. I can't even seem to get objdump to disassemble it
> correctly: looks like the file is in thumb, going from things
> like R_ARM_THM_CALL relocations, but even -Mforce-thumb doesn't seem
> to DRT; sta_agg_status_read+0xeb isn't even a valid instruction
> offset in regular ARM mode.
>

It *seems* that it most likely crashes on the first access to tid_tx,
which is consistent with the story of disabling TX aggregation timeouts
reducing the chances.

So I guess we have to look for some TX aggregation teardown RCU pointer
problem?

johannes

2016-04-09 01:27:47

by Avery Pennarun

[permalink] [raw]
Subject: Re: [PATCH] mac80211: debugfs var for the default aggregation timeout.

On Fri, Apr 8, 2016 at 4:31 AM, Avery Pennarun <[email protected]> wrote:
> On Fri, Apr 8, 2016 at 3:15 AM, Johannes Berg <[email protected]> wrote:
>> On Fri, 2016-04-08 at 09:01 +0200, Johannes Berg wrote:
>>> On Fri, 2016-04-08 at 08:56 +0200, Johannes Berg wrote:
>>> > On Thu, 2016-04-07 at 21:32 -0400, Avery Pennarun wrote:
>>> > > Yes. Here it is:
>>> > > http://apenwarr.ca/tmp/mac80211-agg-status-crash.ko
>>> > >
>>> > Unfortunately there are no debug symbols in this file, so it
>>> > doesn't
>>> > help me much. I can't even seem to get objdump to disassemble it
>>> > correctly: looks like the file is in thumb, going from things
>>> > like R_ARM_THM_CALL relocations, but even -Mforce-thumb doesn't
>>> > seem
>>> > to DRT; sta_agg_status_read+0xeb isn't even a valid instruction
>>> > offset in regular ARM mode.
>>> >
>>> It *seems* that it most likely crashes on the first access to tid_tx,
>>> which is consistent with the story of disabling TX aggregation
>>> timeouts
>>> reducing the chances.
>>>
>>> So I guess we have to look for some TX aggregation teardown RCU
>>> pointer problem?
>>
>> Can't find anything. The only other thing I saw now is that the TID
>> appears to be 7 (in r7), might be worth looking for whether that's a
>> common thing or not?
>
> Just to be clear, this crash is only from *reading* the agg_status
> files. I don't know if the crashiness reduces when disabling the
> aggregation timeouts, since that's a separate bug (in which the queue
> gets stuck and the 'pending' column of this file just keeps
> increasing).
>
> I'll try twiddling some options again tomorrow and see if I can get
> one with proper debug symbols. For what it's worth, this platform is
> "ARMv7 Processor rev 1 (v7l)" and the gcc build is made for Cortex A9.
> You can find an x86 build of our toolchain in the git repo at
> https://gfiber.googlesource.com/toolchains/mindspeed.

Updated .ko file that definitely has debug symbols this time:
http://apenwarr.ca/tmp/mac80211-agg-status-crash-debugsyms.ko
a gdb compiled for x86-64 that can definitely read the above .ko file:
http://apenwarr.ca/tmp/arm-gdb

Thanks!

2016-04-19 01:30:03

by Avery Pennarun

[permalink] [raw]
Subject: Re: [PATCH] mac80211: debugfs var for the default aggregation timeout.

On Sat, Apr 9, 2016 at 9:59 PM, bruce m beach <[email protected]> wrote:
>> If any people more familiar with ARM are reading this - does the value
>> 0x5b35da40 ring a bell?
>
> It could be anything. It depends on the chip implementation. For instance on an
> exynos that is an undefined region. What device are we talking about?

This is a mindspeed c2k CPU, in case that helps at all. But I'm
guessing it really is just some pointer garbage. The only way to
trigger the crash is to do (something) with an attached station at the
same moment as reading the agg_status file. Some station types
trigger it more than others, but I'm not sure which.

2016-04-08 07:15:50

by Johannes Berg

[permalink] [raw]
Subject: Re: [PATCH] mac80211: debugfs var for the default aggregation timeout.

On Fri, 2016-04-08 at 09:01 +0200, Johannes Berg wrote:
> On Fri, 2016-04-08 at 08:56 +0200, Johannes Berg wrote:
> >
> > On Thu, 2016-04-07 at 21:32 -0400, Avery Pennarun wrote:
> >
> > >
> > >
> > > Yes.  Here it is:
> > > http://apenwarr.ca/tmp/mac80211-agg-status-crash.ko
> > >
> > Unfortunately there are no debug symbols in this file, so it
> > doesn't
> > help me much. I can't even seem to get objdump to disassemble it
> > correctly: looks like the file is in thumb, going from things
> > like R_ARM_THM_CALL relocations, but even -Mforce-thumb doesn't
> > seem
> > to DRT; sta_agg_status_read+0xeb isn't even a valid instruction
> > offset in regular ARM mode.
> >
> It *seems* that it most likely crashes on the first access to tid_tx,
> which is consistent with the story of disabling TX aggregation
> timeouts
> reducing the chances.
>
> So I guess we have to look for some TX aggregation teardown RCU
> pointer problem?
>

Can't find anything. The only other thing I saw now is that the TID
appears to be 7 (in r7), might be worth looking for whether that's a
common thing or not?

johannes