We are seeing reports of the following panic with ath10k, running
backports 3.19-rc1 code:
[196113.641] ------------[ cut here ]------------
[196113.641] kernel BUG at kernel/workqueue.c:1040!
[196113.641] Internal error: Oops - undefined instruction: 0 [#1] PREEMPT SMP
[196113.656] Modules linked in: ctr ccm nf_conntrack_netlink
auto_bridge(O) fci(O) nfnetlink pktgen ath9k_htc(O) mwifiex_usb(O)
mwifiex(O) ath10k_pci(O) ath10k_core(O) arc4 ath9k(O) mac80211(O)
ath9k_common(O) ath9k_hw(O) ath(O) cfg80211(O) compat(O) bmoc
a(O) xt_connmark ip6table_mangle xt_CLASSIFY iptable_mangle xt_helper
ip6t_REJECT ip6t_LOG nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter
ip6_tables nf_nat_rtsp nf_conntrack_rtsp nf_nat_h323 nf_conntrack_h323
nf_nat_irc nf_conntrack_irc nf_nat_pptp nf_c
onntrack_pptp nf_conntrack_proto_gre nf_nat_proto_gre nf_nat_sip
nf_conntrack_sip nf_nat_tftp nf_conntrack_tftp nf_nat_ftp
nf_conntrack_ftp ipt_MASQUERADE ipt_REJECT ipt_LOG xt_limit xt_pkttype
xt_conntrack xt_tcpudp iptable_nat nf_nat nf_conntrack_ipv4 n
f_conntrack nf_defrag_ipv4 iptable_filter ip_tables x_tables pfe(O)
[196113.734] CPU: 0 Tainted: G O (3.2.26 #1)
[196113.734] PC is at __queue_work+0x20e/0x310
[196113.750] LR is at __queue_work+0x1c7/0x310
[196113.750] pc : [<840383ea>] lr : [<840383a3>] psr: 000101b3
[196113.750] sp : b8cd5730 ip : 00000000 fp : 00010113
[196113.766] r10: 89de13cc r9 : 8446b6e4 r8 : 84472e40
[196113.766] r7 : 00000000 r6 : bc797a00 r5 : bc79bd20 r4 : 89de1424
[196113.766] r3 : 89de1428 r2 : 00000000 r1 : bc79bd20 r0 : bc797a00
[196113.781] Flags: nzcv IRQs off FIQs on Mode SVC_32 ISA Thumb
Segment user
[196113.781] Control: 50c53c7d Table: 3577404a DAC: 00000015
[196113.797] Process hostapd (pid: 3817, stack limit = 0xb8cd42f0)
[196113.797] Stack: (0xb8cd5730 to 0xb8cd6000)
...
[196114.391] [<840383ea>] (__queue_work+0x20e/0x310) from [<84038509>]
(queue_work_on+0x1d/0x24)
[196114.406] [<84038509>] (queue_work_on+0x1d/0x24) from [<8403852d>]
(queue_work+0x1d/0x34)
[196114.406] [<8403852d>] (queue_work+0x1d/0x34) from [<83883a05>]
(ieee80211_queue_work+0x14/0x18 [mac80211])
[196114.422] [<83883a05>] (ieee80211_queue_work+0x14/0x18 [mac80211])
from [<838fa15f>] (ath10k_sta_rc_update+0xbe/0x1b4 [ath10k_core])
[196114.438] [<838fa15f>] (ath10k_sta_rc_update+0xbe/0x1b4
[ath10k_core]) from [<8387ddd3>]
(ieee80211_sta_ps_transition+0x1456/0x25a4 [mac80211])
[196114.438] [<8387ddd3>] (ieee80211_sta_ps_transition+0x1456/0x25a4
[mac80211]) from [<8387ee81>]
(ieee80211_sta_ps_transition+0x2504/0x25a4 [mac80211])
[196114.453] [<8387ee81>] (ieee80211_sta_ps_transition+0x2504/0x25a4
[mac80211]) from [<8387f407>] (ieee80211_rx+0x4e6/0x580 [mac80211])
[196114.469] [<8387f407>] (ieee80211_rx+0x4e6/0x580 [mac80211]) from
[<8390659f>] (ath10k_wmi_htc_tx_complete+0xf72/0x160c [ath10k_core])
[196114.484] [<8390659f>] (ath10k_wmi_htc_tx_complete+0xf72/0x160c
[ath10k_core]) from [<83908f29>] (ath10k_wmi_process_rx+0x3e0/0x5c0
[ath10k_core])
[196114.500] [<83908f29>] (ath10k_wmi_process_rx+0x3e0/0x5c0
[ath10k_core]) from [<8390261f>]
(ath10k_htc_rx_completion_handler+0x2be/0x304 [ath10k_core])
[196114.516] [<8390261f>]
(ath10k_htc_rx_completion_handler+0x2be/0x304 [ath10k_core]) from
[<83920f51>] (ath10k_pci_ce_recv_data+0x120/0x190 [ath10k_pci])
[196114.516] FW:IN=wan0.2 OUT=br0
MAC=<removed>
SRC=<removed>
DST=<removed> LEN=60 TC=0 HOPLIMIT=52
FLOWLBL=0 PROTO=TCP SPT=5228 DPT=61497 WINDOW=0 RES=0x00 RST URGP=0
[196114.547] [<83920f51>] (ath10k_pci_ce_recv_data+0x120/0x190
[ath10k_pci]) from [<8392257b>]
(ath10k_ce_per_engine_service+0x4a/0x74 [ath10k_pci])
[196114.562] [<8392257b>] (ath10k_ce_per_engine_service+0x4a/0x74
[ath10k_pci]) from [<839225d1>]
(ath10k_ce_per_engine_service_any+0x2c/0x48 [ath10k_pci])
[196114.578] [<839225d1>] (ath10k_ce_per_engine_service_any+0x2c/0x48
[ath10k_pci]) from [<8392209f>] (ath10k_pci_tasklet+0x26/0x40
[ath10k_pci])
[196114.578] [<8392209f>] (ath10k_pci_tasklet+0x26/0x40 [ath10k_pci])
from [<8402b96f>] (tasklet_action+0x6f/0x104)
[196114.594] [<8402b96f>] (tasklet_action+0x6f/0x104) from
[<8402bc9d>] (__do_softirq+0xc5/0x190)
[196114.594] [<8402bc9d>] (__do_softirq+0xc5/0x190) from [<8402c067>]
(irq_exit+0x37/0x6c)
[196114.609] [<8402c067>] (irq_exit+0x37/0x6c) from [<8400d3d7>]
(handle_IRQ+0xdf/0xf4)
[196114.609] [<8400d3d7>] (handle_IRQ+0xdf/0xf4) from [<8400c553>]
(__irq_svc+0x33/0xb8)
[196114.625] [<8400c553>] (__irq_svc+0x33/0xb8) from [<838695a4>]
(sta_info_recalc_tim+0x573/0x9a4 [mac80211])
[196114.641] [<838695a4>] (sta_info_recalc_tim+0x573/0x9a4 [mac80211])
from [<83869a4d>] (sta_info_insert_rcu+0x78/0x84 [mac80211])
[196114.641] [<83869a4d>] (sta_info_insert_rcu+0x78/0x84 [mac80211])
from [<8387afe5>] (ieee80211_add_station+0x184/0x1e8 [mac80211])
[196114.656] [<8387afe5>] (ieee80211_add_station+0x184/0x1e8
[mac80211]) from [<837a4721>] (nl80211_new_station+0x214/0x2d8
[cfg80211])
[196114.672] [<837a4721>] (nl80211_new_station+0x214/0x2d8 [cfg80211])
from [<842436a1>] (genl_rcv_msg+0x149/0x17c)
[196114.672] [<842436a1>] (genl_rcv_msg+0x149/0x17c) from [<84242fed>]
(netlink_rcv_skb+0x31/0x6c)
[196114.688] [<84242fed>] (netlink_rcv_skb+0x31/0x6c) from
[<8424354d>] (genl_rcv+0x21/0x2c)
[196114.703] [<8424354d>] (genl_rcv+0x21/0x2c) from [<84242a77>]
(netlink_unicast+0x1bb/0x2b8)
[196114.703] [<84242a77>] (netlink_unicast+0x1bb/0x2b8) from
[<84242d97>] (netlink_sendmsg+0x183/0x21c)
[196114.734] [<84242d97>] (netlink_sendmsg+0x183/0x21c) from
[<84221649>] (sock_sendmsg+0x61/0x74)
[196114.734] [<84221649>] (sock_sendmsg+0x61/0x74) from [<84222047>]
(__sys_sendmsg+0x1c3/0x1e0)
[196114.750] [<84222047>] (__sys_sendmsg+0x1c3/0x1e0) from
[<84222dd7>] (sys_sendmsg+0x23/0x38)
[196114.750] [<84222dd7>] (sys_sendmsg+0x23/0x38) from [<8400ca81>]
(ret_fast_syscall+0x1/0x46)
[196114.766] Code: 6862 1d23 429a d000 (de02) 68b7
[196114.766] ---[ end trace 4ff223894257ab1a ]---
[196114.766] Kernel panic - not syncing: Fatal exception in interrupt
[196114.781] [<840111e1>] (unwind_backtrace+0x1/0x8c) from
[<842c28fd>] (panic+0x5d/0x134)
[196114.781] [<842c28fd>] (panic+0x5d/0x134) from [<8400f5f3>] (die+0x1eb/0x224)
[196114.797] [<8400f5f3>] (die+0x1eb/0x224) from [<84008249>]
(do_undefinstr+0xc5/0xdc)
[196114.797] [<84008249>] (do_undefinstr+0xc5/0xdc) from [<8400c661>]
(__und_svc_finish+0x1/0x40)
I believe the following race occurs:
1. mac80211 is processing a new station that just joined.
2. Before it's completed initializing the station, we take an interrupt,
with a packet from the STA.
3. That interrupt is a management frame from the station which causes
rate_control_rate_update to be called, which calls back into the ath10k
with the not-fully initialized station.
4. It reaches ath10k_sta_rc_update before the interrupted thread had
reached ath10k_sta_state()
Therefore it has not yet initialized its workqueue. Still NULL.
5. When this NULL workqueue gets passed to queue_work, it passes
the first check that its not in use because of the NULL struct, but fails the
next check in __queue_work because a NULL structure makes the test
for an empty list fail.
Proposed patch is to not add the new station in the hash until after the
driver initializes it. But I'm not clear what implications this has. Could this
cause other problems?
diff --git a/net/mac80211/sta_info.c b/net/mac80211/sta_info.c
index 7566f81..f262362 100644
--- a/net/mac80211/sta_info.c
+++ b/net/mac80211/sta_info.c
@@ -482,7 +482,7 @@ static int sta_info_insert_finish(struct sta_info
*sta) __acquires(RCU)
/* check if STA exists already */
if (sta_info_get_bss(sdata, sta->sta.addr)) {
err = -EEXIST;
- goto out_err;
+ goto out_err1;
}
local->num_sta++;
@@ -492,15 +492,17 @@ static int sta_info_insert_finish(struct
sta_info *sta) __acquires(RCU)
/* simplify things and don't accept BA sessions yet */
set_sta_flag(sta, WLAN_STA_BLOCK_BA);
+ /* notify driver */
+ err = sta_info_insert_drv_state(local, sdata, sta);
+
+ if (err)
+ goto out_err2;
+
/* make the station visible */
sta_info_hash_add(local, sta);
list_add_tail_rcu(&sta->list, &local->sta_list);
- /* notify driver */
- err = sta_info_insert_drv_state(local, sdata, sta);
- if (err)
- goto out_remove;
set_sta_flag(sta, WLAN_STA_INSERTED);
/* accept BA sessions now */
@@ -525,13 +527,11 @@ static int sta_info_insert_finish(struct
sta_info *sta) __acquires(RCU)
mesh_accept_plinks_update(sdata);
return 0;
- out_remove:
- sta_info_hash_del(local, sta);
- list_del_rcu(&sta->list);
+ out_err2:
local->num_sta--;
synchronize_net();
__cleanup_single_sta(sta);
- out_err:
+ out_err1:
mutex_unlock(&local->sta_mtx);
rcu_read_lock();
return err;
On 20 August 2015 at 23:39, Marty Faltesek <[email protected]> wrote:
> We are seeing reports of the following panic with ath10k, running
> backports 3.19-rc1 code:
>
> [196113.641] ------------[ cut here ]------------
> [196113.641] kernel BUG at kernel/workqueue.c:1040!
> [196113.641] Internal error: Oops - undefined instruction: 0 [#1] PREEMPT SMP
> [196113.656] Modules linked in: ctr ccm nf_conntrack_netlink
> auto_bridge(O) fci(O) nfnetlink pktgen ath9k_htc(O) mwifiex_usb(O)
> mwifiex(O) ath10k_pci(O) ath10k_core(O) arc4 ath9k(O) mac80211(O)
> ath9k_common(O) ath9k_hw(O) ath(O) cfg80211(O) compat(O) bmoc
> a(O) xt_connmark ip6table_mangle xt_CLASSIFY iptable_mangle xt_helper
> ip6t_REJECT ip6t_LOG nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter
> ip6_tables nf_nat_rtsp nf_conntrack_rtsp nf_nat_h323 nf_conntrack_h323
> nf_nat_irc nf_conntrack_irc nf_nat_pptp nf_c
> onntrack_pptp nf_conntrack_proto_gre nf_nat_proto_gre nf_nat_sip
> nf_conntrack_sip nf_nat_tftp nf_conntrack_tftp nf_nat_ftp
> nf_conntrack_ftp ipt_MASQUERADE ipt_REJECT ipt_LOG xt_limit xt_pkttype
> xt_conntrack xt_tcpudp iptable_nat nf_nat nf_conntrack_ipv4 n
> f_conntrack nf_defrag_ipv4 iptable_filter ip_tables x_tables pfe(O)
> [196113.734] CPU: 0 Tainted: G O (3.2.26 #1)
> [196113.734] PC is at __queue_work+0x20e/0x310
> [196113.750] LR is at __queue_work+0x1c7/0x310
> [196113.750] pc : [<840383ea>] lr : [<840383a3>] psr: 000101b3
> [196113.750] sp : b8cd5730 ip : 00000000 fp : 00010113
> [196113.766] r10: 89de13cc r9 : 8446b6e4 r8 : 84472e40
> [196113.766] r7 : 00000000 r6 : bc797a00 r5 : bc79bd20 r4 : 89de1424
> [196113.766] r3 : 89de1428 r2 : 00000000 r1 : bc79bd20 r0 : bc797a00
> [196113.781] Flags: nzcv IRQs off FIQs on Mode SVC_32 ISA Thumb
> Segment user
> [196113.781] Control: 50c53c7d Table: 3577404a DAC: 00000015
> [196113.797] Process hostapd (pid: 3817, stack limit = 0xb8cd42f0)
> [196113.797] Stack: (0xb8cd5730 to 0xb8cd6000)
> ...
> [196114.391] [<840383ea>] (__queue_work+0x20e/0x310) from [<84038509>]
> (queue_work_on+0x1d/0x24)
> [196114.406] [<84038509>] (queue_work_on+0x1d/0x24) from [<8403852d>]
> (queue_work+0x1d/0x34)
> [196114.406] [<8403852d>] (queue_work+0x1d/0x34) from [<83883a05>]
> (ieee80211_queue_work+0x14/0x18 [mac80211])
> [196114.422] [<83883a05>] (ieee80211_queue_work+0x14/0x18 [mac80211])
> from [<838fa15f>] (ath10k_sta_rc_update+0xbe/0x1b4 [ath10k_core])
> [196114.438] [<838fa15f>] (ath10k_sta_rc_update+0xbe/0x1b4
> [ath10k_core]) from [<8387ddd3>]
> (ieee80211_sta_ps_transition+0x1456/0x25a4 [mac80211])
> [196114.438] [<8387ddd3>] (ieee80211_sta_ps_transition+0x1456/0x25a4
> [mac80211]) from [<8387ee81>]
> (ieee80211_sta_ps_transition+0x2504/0x25a4 [mac80211])
> [196114.453] [<8387ee81>] (ieee80211_sta_ps_transition+0x2504/0x25a4
> [mac80211]) from [<8387f407>] (ieee80211_rx+0x4e6/0x580 [mac80211])
> [196114.469] [<8387f407>] (ieee80211_rx+0x4e6/0x580 [mac80211]) from
> [<8390659f>] (ath10k_wmi_htc_tx_complete+0xf72/0x160c [ath10k_core])
> [196114.484] [<8390659f>] (ath10k_wmi_htc_tx_complete+0xf72/0x160c
> [ath10k_core]) from [<83908f29>] (ath10k_wmi_process_rx+0x3e0/0x5c0
> [ath10k_core])
> [196114.500] [<83908f29>] (ath10k_wmi_process_rx+0x3e0/0x5c0
> [ath10k_core]) from [<8390261f>]
> (ath10k_htc_rx_completion_handler+0x2be/0x304 [ath10k_core])
> [196114.516] [<8390261f>]
> (ath10k_htc_rx_completion_handler+0x2be/0x304 [ath10k_core]) from
> [<83920f51>] (ath10k_pci_ce_recv_data+0x120/0x190 [ath10k_pci])
> [196114.516] FW:IN=wan0.2 OUT=br0
> MAC=<removed>
> SRC=<removed>
> DST=<removed> LEN=60 TC=0 HOPLIMIT=52
> FLOWLBL=0 PROTO=TCP SPT=5228 DPT=61497 WINDOW=0 RES=0x00 RST URGP=0
> [196114.547] [<83920f51>] (ath10k_pci_ce_recv_data+0x120/0x190
> [ath10k_pci]) from [<8392257b>]
> (ath10k_ce_per_engine_service+0x4a/0x74 [ath10k_pci])
> [196114.562] [<8392257b>] (ath10k_ce_per_engine_service+0x4a/0x74
> [ath10k_pci]) from [<839225d1>]
> (ath10k_ce_per_engine_service_any+0x2c/0x48 [ath10k_pci])
> [196114.578] [<839225d1>] (ath10k_ce_per_engine_service_any+0x2c/0x48
> [ath10k_pci]) from [<8392209f>] (ath10k_pci_tasklet+0x26/0x40
> [ath10k_pci])
> [196114.578] [<8392209f>] (ath10k_pci_tasklet+0x26/0x40 [ath10k_pci])
> from [<8402b96f>] (tasklet_action+0x6f/0x104)
> [196114.594] [<8402b96f>] (tasklet_action+0x6f/0x104) from
> [<8402bc9d>] (__do_softirq+0xc5/0x190)
> [196114.594] [<8402bc9d>] (__do_softirq+0xc5/0x190) from [<8402c067>]
> (irq_exit+0x37/0x6c)
> [196114.609] [<8402c067>] (irq_exit+0x37/0x6c) from [<8400d3d7>]
> (handle_IRQ+0xdf/0xf4)
> [196114.609] [<8400d3d7>] (handle_IRQ+0xdf/0xf4) from [<8400c553>]
> (__irq_svc+0x33/0xb8)
> [196114.625] [<8400c553>] (__irq_svc+0x33/0xb8) from [<838695a4>]
> (sta_info_recalc_tim+0x573/0x9a4 [mac80211])
> [196114.641] [<838695a4>] (sta_info_recalc_tim+0x573/0x9a4 [mac80211])
> from [<83869a4d>] (sta_info_insert_rcu+0x78/0x84 [mac80211])
> [196114.641] [<83869a4d>] (sta_info_insert_rcu+0x78/0x84 [mac80211])
> from [<8387afe5>] (ieee80211_add_station+0x184/0x1e8 [mac80211])
> [196114.656] [<8387afe5>] (ieee80211_add_station+0x184/0x1e8
> [mac80211]) from [<837a4721>] (nl80211_new_station+0x214/0x2d8
> [cfg80211])
> [196114.672] [<837a4721>] (nl80211_new_station+0x214/0x2d8 [cfg80211])
> from [<842436a1>] (genl_rcv_msg+0x149/0x17c)
> [196114.672] [<842436a1>] (genl_rcv_msg+0x149/0x17c) from [<84242fed>]
> (netlink_rcv_skb+0x31/0x6c)
> [196114.688] [<84242fed>] (netlink_rcv_skb+0x31/0x6c) from
> [<8424354d>] (genl_rcv+0x21/0x2c)
> [196114.703] [<8424354d>] (genl_rcv+0x21/0x2c) from [<84242a77>]
> (netlink_unicast+0x1bb/0x2b8)
> [196114.703] [<84242a77>] (netlink_unicast+0x1bb/0x2b8) from
> [<84242d97>] (netlink_sendmsg+0x183/0x21c)
> [196114.734] [<84242d97>] (netlink_sendmsg+0x183/0x21c) from
> [<84221649>] (sock_sendmsg+0x61/0x74)
> [196114.734] [<84221649>] (sock_sendmsg+0x61/0x74) from [<84222047>]
> (__sys_sendmsg+0x1c3/0x1e0)
> [196114.750] [<84222047>] (__sys_sendmsg+0x1c3/0x1e0) from
> [<84222dd7>] (sys_sendmsg+0x23/0x38)
> [196114.750] [<84222dd7>] (sys_sendmsg+0x23/0x38) from [<8400ca81>]
> (ret_fast_syscall+0x1/0x46)
> [196114.766] Code: 6862 1d23 429a d000 (de02) 68b7
> [196114.766] ---[ end trace 4ff223894257ab1a ]---
> [196114.766] Kernel panic - not syncing: Fatal exception in interrupt
> [196114.781] [<840111e1>] (unwind_backtrace+0x1/0x8c) from
> [<842c28fd>] (panic+0x5d/0x134)
> [196114.781] [<842c28fd>] (panic+0x5d/0x134) from [<8400f5f3>] (die+0x1eb/0x224)
> [196114.797] [<8400f5f3>] (die+0x1eb/0x224) from [<84008249>]
> (do_undefinstr+0xc5/0xdc)
> [196114.797] [<84008249>] (do_undefinstr+0xc5/0xdc) from [<8400c661>]
> (__und_svc_finish+0x1/0x40)
>
>
> I believe the following race occurs:
>
> 1. mac80211 is processing a new station that just joined.
> 2. Before it's completed initializing the station, we take an interrupt,
> with a packet from the STA.
> 3. That interrupt is a management frame from the station which causes
> rate_control_rate_update to be called, which calls back into the ath10k
> with the not-fully initialized station.
> 4. It reaches ath10k_sta_rc_update before the interrupted thread had
> reached ath10k_sta_state()
> Therefore it has not yet initialized its workqueue. Still NULL.
> 5. When this NULL workqueue gets passed to queue_work, it passes
> the first check that its not in use because of the NULL struct, but fails the
> next check in __queue_work because a NULL structure makes the test
> for an empty list fail.
>
>
> Proposed patch is to not add the new station in the hash until after the
> driver initializes it. But I'm not clear what implications this has. Could this
> cause other problems?
Yes. This can cause problems with drivers using iterate_stations()
from within sta_state() (actually ath10k does that).
Moreover it seems your patch will re-introduce another problem if you
swap the inserting sequence:
commit 5108ca828017120981880eeec8a9ec369334a899
Author: Johannes Berg <[email protected]>
Date: Mon Feb 17 20:49:03 2014 +0100
mac80211: insert stations before adding to driver
There's a race condition in mac80211 because we add stations
to the internal lists after adding them to the driver, which
means that (for example) the following can happen:
1. a station connects and is added
2. first, it is added to the driver
3. then, it is added to the mac80211 lists
If the station goes to sleep between steps 2 and 3, and the
firmware/hardware records it as being asleep, mac80211 will
never instruct the driver to wake it up again as it never
realized it went to sleep since the RX path discarded the
frame as a "spurious class 3 frame", no station entry was
present yet.
Fix this by adding the station in software first, and only
then adding it to the driver. That way, any state that the
driver changes will be reflected properly in mac80211's
station state. The problematic part is the roll-back if the
driver fails to add the station, in that case a bit more is
needed. To not make that overly complex prevent starting BA
sessions in the meantime.
Signed-off-by: Johannes Berg <[email protected]>
I guess it'd be better to try to prevent mac80211 from calling
sta_rc_update() before sta_state()..
Michał
On Thu, 2015-08-20 at 17:39 -0400, Marty Faltesek wrote:
>
> I believe the following race occurs:
>
> 1. mac80211 is processing a new station that just joined.
> 2. Before it's completed initializing the station, we take an
> interrupt,
> with a packet from the STA.
> 3. That interrupt is a management frame from the station which causes
> rate_control_rate_update to be called, which calls back into the
> ath10k
> with the not-fully initialized station.
I'm a bit confused by the stack trace - is this IBSS and the stack
trace is just bad? I don't really see how we get from sta_ps_transition
to sta_rc_update?
> 4. It reaches ath10k_sta_rc_update before the interrupted thread had
> reached ath10k_sta_state()
> Therefore it has not yet initialized its workqueue. Still NULL.
> 5. When this NULL workqueue gets passed to queue_work, it passes
> the first check that its not in use because of the NULL struct,
> but fails the
> next check in __queue_work because a NULL structure makes the test
> for an empty list fail.
>
>
> Proposed patch is to not add the new station in the hash until after
> the
> driver initializes it. But I'm not clear what implications this has.
> Could this
> cause other problems?
>
I think the issue Michał pointed out is valid - but we can probably
check sta->uploaded in the relevant places?
johannes
On Fri, 2015-08-21 at 09:57 +0200, Johannes Berg wrote:
> On Thu, 2015-08-20 at 17:39 -0400, Marty Faltesek wrote:
> >
> > I believe the following race occurs:
> >
> > 1. mac80211 is processing a new station that just joined.
> > 2. Before it's completed initializing the station, we take an
> > interrupt,
> > with a packet from the STA.
> > 3. That interrupt is a management frame from the station which
> > causes
> > rate_control_rate_update to be called, which calls back into the
> >
> > ath10k
> > with the not-fully initialized station.
>
> I'm a bit confused by the stack trace - is this IBSS and the stack
> trace is just bad? I don't really see how we get from
> sta_ps_transition
> to sta_rc_update?
>
> > 4. It reaches ath10k_sta_rc_update before the interrupted thread
> > had
> > reached ath10k_sta_state()
> > Therefore it has not yet initialized its workqueue. Still NULL.
> > 5. When this NULL workqueue gets passed to queue_work, it passes
> > the first check that its not in use because of the NULL struct,
> > but fails the
> > next check in __queue_work because a NULL structure makes the
> > test
> > for an empty list fail.
> >
> >
> > Proposed patch is to not add the new station in the hash until
> > after
> > the
> > driver initializes it. But I'm not clear what implications this
> > has.
> > Could this
> > cause other problems?
> >
>
> I think the issue Michał pointed out is valid - but we can probably
> check sta->uploaded in the relevant places?
>
Come to think of it though - we might cache & delay the call so drivers
actually get the updated values.
johannes
On Mon, 2015-08-24 at 12:12 -0400, Marty Faltesek wrote:
> Yes, that would address the issue in a simple way. Would this be
> applicable to any of the other ops vectors? I will create a patch.
>
Possible, we'll have to check which ones are possible to get at from
the RX (and possibly TX?) paths.
Another option could be to prevent looking up the station until it's
actually uploaded? But that might also cause the other races to happen
again, I suppose.
johannes
On Fri, Aug 21, 2015 at 3:58 AM, Johannes Berg
<[email protected]> wrote:
> On Fri, 2015-08-21 at 09:57 +0200, Johannes Berg wrote:
>> On Thu, 2015-08-20 at 17:39 -0400, Marty Faltesek wrote:
>> >
>> > I believe the following race occurs:
>> >
>> > 1. mac80211 is processing a new station that just joined.
>> > 2. Before it's completed initializing the station, we take an
>> > interrupt,
>> > with a packet from the STA.
>> > 3. That interrupt is a management frame from the station which
>> > causes
>> > rate_control_rate_update to be called, which calls back into the
>> >
>> > ath10k
>> > with the not-fully initialized station.
>>
>> I'm a bit confused by the stack trace - is this IBSS and the stack
>> trace is just bad? I don't really see how we get from
>> sta_ps_transition
>> to sta_rc_update?
>>
>> > 4. It reaches ath10k_sta_rc_update before the interrupted thread
>> > had
>> > reached ath10k_sta_state()
>> > Therefore it has not yet initialized its workqueue. Still NULL.
>> > 5. When this NULL workqueue gets passed to queue_work, it passes
>> > the first check that its not in use because of the NULL struct,
>> > but fails the
>> > next check in __queue_work because a NULL structure makes the
>> > test
>> > for an empty list fail.
>> >
>> >
>> > Proposed patch is to not add the new station in the hash until
>> > after
>> > the
>> > driver initializes it. But I'm not clear what implications this
>> > has.
>> > Could this
>> > cause other problems?
>> >
>>
>> I think the issue Michał pointed out is valid - but we can probably
>> check sta->uploaded in the relevant places?
>>
>
> Come to think of it though - we might cache & delay the call so drivers
> actually get the updated values.
>
> johannes
It's not IBSS and the stack is messed up, I guess because of statics.
It's probably more likely this manually crafted stack:
__queue_work (fails BUG_ON(!list_empty(&work->entry));) NULL work
causes this check fail.
queue_work_on WORK_STRUCT_PENDING_BIT check passes cause NULL work.
queue_work
ieee80211_queue_work
ath10k_sta_rc_update
drv_sta_rc_update
rate_control_rate_update
ieee80211_rx_h_action
ieee80211_rx_handlers
ieee80211_invoke_rx_handlers
ieee80211_prepare_and_rx_handle
__ieee80211_rx_handle_packet
ieee80211_rx
On Fri, 2015-08-21 at 13:00 -0400, Marty Faltesek wrote:
> It's not IBSS and the stack is messed up, I guess because of statics.
> It's probably more likely this manually crafted stack:
>
> __queue_work (fails BUG_ON(!list_empty(&work->entry));) NULL work
> causes this check fail.
> queue_work_on WORK_STRUCT_PENDING_BIT check passes cause NULL work.
> queue_work
> ieee80211_queue_work
> ath10k_sta_rc_update
> drv_sta_rc_update
> rate_control_rate_update
> ieee80211_rx_h_action
> ieee80211_rx_handlers
> ieee80211_invoke_rx_handlers
> ieee80211_prepare_and_rx_handle
> __ieee80211_rx_handle_packet
> ieee80211_rx
Ok. Still, I think the best solution would be to check sta->uploaded?
johannes
Yes, that would address the issue in a simple way. Would this be
applicable to any of the other ops vectors? I will create a patch.
On Mon, Aug 24, 2015 at 8:38 AM, Johannes Berg
<[email protected]> wrote:
> On Fri, 2015-08-21 at 13:00 -0400, Marty Faltesek wrote:
>> It's not IBSS and the stack is messed up, I guess because of statics.
>> It's probably more likely this manually crafted stack:
>>
>> __queue_work (fails BUG_ON(!list_empty(&work->entry));) NULL work
>> causes this check fail.
>> queue_work_on WORK_STRUCT_PENDING_BIT check passes cause NULL work.
>> queue_work
>> ieee80211_queue_work
>> ath10k_sta_rc_update
>> drv_sta_rc_update
>> rate_control_rate_update
>> ieee80211_rx_h_action
>> ieee80211_rx_handlers
>> ieee80211_invoke_rx_handlers
>> ieee80211_prepare_and_rx_handle
>> __ieee80211_rx_handle_packet
>> ieee80211_rx
>
> Ok. Still, I think the best solution would be to check sta->uploaded?
>
> johannes