2019-09-11 10:07:18

by Linus Torvalds

[permalink] [raw]
Subject: WARNING at net/mac80211/sta_info.c:1057 (__sta_info_destroy_part2())

So I'm at LCA, reading email, using my laptop more than I normally do,
and with different networking than I normally do.

And I just had a 802.11 WARN_ON() trigger, followed by essentially a
dead machine due to some lock held (maybe rtnl_lock).

It's possible that the lock held thing happened before, and is the
_reason_ for the delay, I don't know. I had to reboot the machine, but
I gathered as much information as made sense and was obvious before I
did so. That's appended.

Linus

---
Selected dmesg from the WARN_ON_ONCE() and just before, followed by
info from "sysrq-w":

Previous resume looks normal:

PM: suspend exit
ath10k_pci 0000:02:00.0: UART prints enabled
ath10k_pci 0000:02:00.0: unsupported HTC service id: 1536
wlp2s0: authenticate with 54:ec:2f:05:70:2c
wlp2s0: send auth to 54:ec:2f:05:70:2c (try 1/3)
wlp2s0: send auth to 54:ec:2f:05:70:2c (try 2/3)
wlp2s0: send auth to 54:ec:2f:05:70:2c (try 3/3)
wlp2s0: authentication with 54:ec:2f:05:70:2c timed out
wlp2s0: authenticate with 54:ec:2f:05:70:28
wlp2s0: send auth to 54:ec:2f:05:70:28 (try 1/3)
wlp2s0: authenticated
wlp2s0: associate with 54:ec:2f:05:70:28 (try 1/3)
wlp2s0: RX AssocResp from 54:ec:2f:05:70:28 (capab=0x1431 status=0 aid=1)
wlp2s0: associated
wlp2s0: Limiting TX power to 20 (20 - 0) dBm as advertised by
54:ec:2f:05:70:28
ath: EEPROM regdomain: 0x826c
ath: EEPROM indicates we should expect a country code
ath: doing EEPROM country->regdmn map search
ath: country maps to regdmn code: 0x37
ath: Country alpha2 being used: PT
ath: Regpair used: 0x37
ath: regdomain 0x826c dynamically updated by country element
IPv6: ADDRCONF(NETDEV_CHANGE): wlp2s0: link becomes ready
wlp2s0: disconnect from AP 54:ec:2f:05:70:28 for new auth to
54:ec:2f:05:70:2c
wlp2s0: authenticate with 54:ec:2f:05:70:2c
wlp2s0: send auth to 54:ec:2f:05:70:2c (try 1/3)
wlp2s0: authenticated
wlp2s0: associate with 54:ec:2f:05:70:2c (try 1/3)
wlp2s0: RX ReassocResp from 54:ec:2f:05:70:2c (capab=0x1011 status=0 aid=2)
wlp2s0: associated
ath: EEPROM regdomain: 0x826c
ath: EEPROM indicates we should expect a country code
ath: doing EEPROM country->regdmn map search
ath: country maps to regdmn code: 0x37
ath: Country alpha2 being used: PT
ath: Regpair used: 0x37
ath: regdomain 0x826c dynamically updated by country element
wlp2s0: Limiting TX power to 23 (23 - 0) dBm as advertised by
54:ec:2f:05:70:2c

Another _almost_ successful suspend/resume:

PM: suspend entry (s2idle)
Filesystems sync: 0.028 seconds
Freezing user space processes ... (elapsed 0.126 seconds) done.
OOM killer disabled.
Freezing remaining freezable tasks ... (elapsed 0.001 seconds) done.
printk: Suspending console(s) (use no_console_suspend to debug)
wlp2s0: deauthenticating from 54:ec:2f:05:70:2c by local choice
(Reason: 3=DEAUTH_LEAVING)
ath10k_pci 0000:02:00.0: UART prints enabled
ath10k_pci 0000:02:00.0: unsupported HTC service id: 1536
OOM killer enabled.
Restarting tasks ... done.
PM: suspend exit
ath10k_pci 0000:02:00.0: UART prints enabled
ath10k_pci 0000:02:00.0: unsupported HTC service id: 1536
wlp2s0: authenticate with 54:ec:2f:05:70:2c
wlp2s0: send auth to 54:ec:2f:05:70:2c (try 1/3)
wlp2s0: authenticated
wlp2s0: associate with 54:ec:2f:05:70:2c (try 1/3)
wlp2s0: RX AssocResp from 54:ec:2f:05:70:2c (capab=0x1011 status=0 aid=2)
wlp2s0: associated
wlp2s0: Limiting TX power to 23 (23 - 0) dBm as advertised by
54:ec:2f:05:70:2c
ath: EEPROM regdomain: 0x826c
ath: EEPROM indicates we should expect a country code
ath: doing EEPROM country->regdmn map search
ath: country maps to regdmn code: 0x37
ath: Country alpha2 being used: PT
ath: Regpair used: 0x37
ath: regdomain 0x826c dynamically updated by country element
IPv6: ADDRCONF(NETDEV_CHANGE): wlp2s0: link becomes ready

I say _almost_, because I don't see the "No TX power limit" for the
country lookup (yes, Portugal) this time?

But wait!

... then 10+ minutes later:

ath10k_pci 0000:02:00.0: wmi command 16387 timeout, restarting hardware
ath10k_pci 0000:02:00.0: failed to set 5g txpower 23: -11
ath10k_pci 0000:02:00.0: failed to setup tx power 23: -11
ath10k_pci 0000:02:00.0: failed to recalc tx power: -11
ath10k_pci 0000:02:00.0: failed to set inactivity time for vdev 0: -108
ath10k_pci 0000:02:00.0: failed to setup powersave: -108

That certainly looks like something did try to set a power limit, but
eventually failed.

Immediately after that:

wlp2s0: deauthenticating from 54:ec:2f:05:70:2c by local choice
(Reason: 3=DEAUTH_LEAVING)
ath10k_pci 0000:02:00.0: failed to read hi_board_data address: -16
ath10k_pci 0000:02:00.0: failed to receive initialized event from
target: 00000000
ath10k_pci 0000:02:00.0: failed to receive initialized event from
target: 00000000
ath10k_pci 0000:02:00.0: failed to wait for target init: -110
ieee80211 phy0: Hardware restart was requested
ath10k_pci 0000:02:00.0: failed to set inactivity time for vdev 0: -108
ath10k_pci 0000:02:00.0: failed to setup powersave: -108
ath10k_pci 0000:02:00.0: failed to set PS Mode 0 for vdev 0: -108
ath10k_pci 0000:02:00.0: failed to setup powersave: -108
ath10k_pci 0000:02:00.0: failed to setup ps on vdev 0: -108
ath10k_pci 0000:02:00.0: failed to flush transmit queue (skip 1
ar-state 2): 5000
ath10k_pci 0000:02:00.0: failed to delete peer 54:ec:2f:05:70:2c
for vdev 0: -108

and this then results in:

WARNING: CPU: 4 PID: 1246 at net/mac80211/sta_info.c:1057
__sta_info_destroy_part2+0x147/0x150 [mac80211]
Modules linked in: ccm fuse rfcomm xt_CHECKSUM xt_MASQUERADE tun
bridge stp llc nf_conntrack_netbios_ns nf_conntrack_broadcast xt_CT
ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ebtable_nat
ip6table_nat ip6table_mangle ip6table_raw ip6table_security
iptable_nat nf_nat iptable_mangle iptable_raw iptable_security
nf_conntrack nf_defrag_ipv6 libcrc32c nf_defrag_ipv4 ip_set nfnetlink
ebtable_filter ebtables ip6table_filter ip6_tables cmac bnep vfat fat
hid_multitouch iTCO_wdt iTCO_vendor_support snd_hda_codec_hdmi
snd_soc_skl snd_soc_hdac_hda snd_hda_ext_core snd_hda_codec_realtek
snd_soc_core snd_hda_codec_generic snd_soc_acpi_intel_match
snd_soc_acpi snd_soc_skl_ipc snd_soc_sst_ipc snd_soc_sst_dsp
dell_laptop snd_hda_intel ledtrig_audio dell_wmi dell_smbios
x86_pkg_temp_thermal snd_hda_codec ath10k_pci intel_powerclamp
wmi_bmof dell_wmi_descriptor intel_wmi_thunderbolt snd_hwdep coretemp
ath10k_core snd_hda_core kvm_intel snd_seq ath snd_seq_device kvm
btusb btrtl
intel_rapl_msr btbcm btintel mac80211 irqbypass snd_pcm bluetooth
intel_cstate dcdbas intel_uncore intel_rapl_perf joydev snd_timer snd
ecdh_generic i2c_i801 cfg80211 ecc idma64 soundcore rtsx_pci_ms
processor_thermal_device mei_me memstick rfkill libarc4 intel_lpss_pci
ucsi_acpi intel_soc_dts_iosf mei typec_ucsi intel_lpss
intel_pch_thermal intel_rapl_common typec wmi int3403_thermal
int340x_thermal_zone intel_hid sparse_keymap int3400_thermal acpi_pad
acpi_thermal_rel dm_crypt i915 crct10dif_pclmul crc32_pclmul
crc32c_intel rtsx_pci_sdmmc mmc_core i2c_algo_bit drm_kms_helper
syscopyarea sysfillrect sysimgblt nvme fb_sys_fops ghash_clmulni_intel
drm serio_raw nvme_core rtsx_pci i2c_hid pinctrl_cannonlake video
pinctrl_intel
CPU: 4 PID: 1246 Comm: NetworkManager Not tainted
5.3.0-rc8-00004-g56037cadf604-dirty #5
Hardware name: Dell Inc. XPS 13 9380/0KTW76, BIOS 1.5.0 06/03/2019
RIP: 0010:__sta_info_destroy_part2+0x147/0x150 [mac80211]
Code: bd 0c 01 00 00 00 0f 84 4e ff ff ff 45 31 c0 b9 01 00 00 00
48 89 ea 48 89 de 4c 89 e7 e8 e1 aa ff ff 85 c0 0f 84 30 ff ff ff <0f>
0b e9 29 ff ff ff 66 90 0f 1f 44 00 00 41 54 55 48 89 fd e8 40
RSP: 0018:ffff98fc00daba90 EFLAGS: 00010282
RAX: 00000000ffffff94 RBX: ffff8e1355ad2880 RCX: 0000000000000000
RDX: ffff8e1355901ec0 RSI: 0000000000000000 RDI: ffff8e1357902e38
RBP: ffff8e135bb08000 R08: 0000000000000000 R09: 0000000000000574
R10: 000000000001cc5c R11: 0000000000000000 R12: ffff8e13579007a0
R13: ffff8e1355ad2880 R14: 0000000000000001 R15: ffff8e1357900d58
FS: 00007fbd99372bc0(0000) GS:ffff8e135e500000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00001cdbd7408000 CR3: 00000004708a2001 CR4: 00000000003606e0
Call Trace:
__sta_info_flush+0x125/0x170 [mac80211]
ieee80211_set_disassoc+0xc2/0x590 [mac80211]
ieee80211_mgd_deauth.cold+0x4a/0x1b8 [mac80211]
cfg80211_mlme_deauth+0xb3/0x1d0 [cfg80211]
? startup_64+0x3/0x30
cfg80211_mlme_down+0x66/0x90 [cfg80211]
cfg80211_disconnect+0x129/0x1e0 [cfg80211]
cfg80211_leave+0x27/0x40 [cfg80211]
cfg80211_netdev_notifier_call+0x1a7/0x4e0 [cfg80211]
? ieee80211_iter_chan_contexts_atomic+0x3e/0x50 [mac80211]
? ath10k_monitor_recalc+0x38f/0x5d0 [ath10k_core]
? __schedule+0x143/0x660
? ath10k_config_ps+0x4e/0x70 [ath10k_core]
? inetdev_event+0x46/0x560
notifier_call_chain+0x4c/0x70
__dev_close_many+0x57/0x100
dev_close_many+0x8d/0x140
dev_close.part.0+0x44/0x70
cfg80211_shutdown_all_interfaces+0x71/0xd0 [cfg80211]
cfg80211_rfkill_set_block+0x22/0x30 [cfg80211]
rfkill_set_block+0x92/0x140 [rfkill]
rfkill_fop_write+0x11f/0x1c0 [rfkill]
vfs_write+0xb6/0x1a0
ksys_write+0xa7/0xe0
do_syscall_64+0x59/0x1c0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7fbd9a36c92f
Code: 89 54 24 18 48 89 74 24 10 89 7c 24 08 e8 19 fd ff ff 48 8b
54 24 18 48 8b 74 24 10 41 89 c0 8b 7c 24 08 b8 01 00 00 00 0f 05 <48>
3d 00 f0 ff ff 77 2d 44 89 c7 48 89 44 24 08 e8 4c fd ff ff 48
RSP: 002b:00007fff9a1ee220 EFLAGS: 00000293 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 000055f18ef5eaad RCX: 00007fbd9a36c92f
RDX: 0000000000000008 RSI: 00007fff9a1ee258 RDI: 0000000000000019
RBP: 0000000000000019 R08: 0000000000000000 R09: 0000000000000001
R10: 0000000000000000 R11: 0000000000000293 R12: 000055f190502090
R13: 0000000000000000 R14: 0000000000000000 R15: 000055f1904ffbb0
---[ end trace 4d7c87877e20badd ]---
ath10k_pci 0000:02:00.0: failed to recalculate rts/cts prot for vdev 0: -108
ath10k_pci 0000:02:00.0: failed to set cts protection for vdev 0: -108

and it looks like it leaves some lock held.

Because after this, nothing network-related stops working, and the
machine eventually dies because workqueues don't make any progress (due
to rtnl_lock, I suspect):

Workqueue: events_freezable ieee80211_restart_work [mac80211]
Call Trace:
schedule+0x39/0xa0
schedule_preempt_disabled+0xa/0x10
__mutex_lock.isra.0+0x263/0x4b0
ieee80211_restart_work+0x54/0xe0 [mac80211]
process_one_work+0x1cf/0x370
worker_thread+0x4a/0x3c0
kthread+0xfb/0x130
ret_from_fork+0x35/0x40

Workqueue: events disconnect_work [cfg80211]
Call Trace:
schedule+0x39/0xa0
schedule_preempt_disabled+0xa/0x10
__mutex_lock.isra.0+0x263/0x4b0
disconnect_work+0x12/0xd0 [cfg80211]
process_one_work+0x1cf/0x370
worker_thread+0x4a/0x3c0
kthread+0xfb/0x130
ret_from_fork+0x35/0x40
kworker/3:0 D 0 6313 2 0x80004000


Workqueue: events linkwatch_event
Call Trace:
schedule+0x39/0xa0
schedule_preempt_disabled+0xa/0x10
__mutex_lock.isra.0+0x263/0x4b0
linkwatch_event+0xa/0x30
process_one_work+0x1cf/0x370
worker_thread+0x4a/0x3c0
kthread+0xfb/0x130
ret_from_fork+0x35/0x40

ping D 0 6619 6568 0x80000002
Call Trace:
schedule+0x39/0xa0
schedule_preempt_disabled+0xa/0x10
__mutex_lock.isra.0+0x263/0x4b0
ip6mr_sk_done+0x2e/0xd0
rawv6_close+0x19/0x30
inet_release+0x34/0x60
__sock_release+0x3d/0xa0
sock_close+0x11/0x20
__fput+0xc1/0x250
task_work_run+0x81/0xa0
do_exit+0x2e5/0xaa0
do_group_exit+0x3a/0x90
__x64_sys_exit_group+0x14/0x20
do_syscall_64+0x59/0x1c0
entry_SYSCALL_64_after_hwframe+0x44/0xa9

gdbus D 0 6629 1949 0x80004000
Call Trace:
schedule+0x39/0xa0
do_exit+0x1da/0xaa0
do_group_exit+0x3a/0x90
get_signal+0x14f/0x830
do_signal+0x36/0x620
exit_to_usermode_loop+0x6f/0xf0
do_syscall_64+0x17d/0x1c0
entry_SYSCALL_64_after_hwframe+0x44/0xa9


2019-09-11 10:29:07

by Johannes Berg

[permalink] [raw]
Subject: Re: WARNING at net/mac80211/sta_info.c:1057 (__sta_info_destroy_part2())

Hi,

> So I'm at LCA

When did LCA move to Portugal? ;-))

> , reading email, using my laptop more than I normally do,
> and with different networking than I normally do.
>
> And I just had a 802.11 WARN_ON() trigger, followed by essentially a
> dead machine due to some lock held (maybe rtnl_lock).

yes, it's definitely stuck on the RTNL, all the lot of the workqueues
are trying to acquire it (linkwatch, ipv6, but oddly enough even the
mac80211 restart work).

> It's possible that the lock held thing happened before, and is the
> _reason_ for the delay, I don't know.

No, we do have the lock in the WARN_ON(), somewhere around
dev_close_many() it is acquired.

> Previous resume looks normal:
> [snip]
> wlp2s0: Limiting TX power to 23 (23 - 0) dBm as advertised by
> 54:ec:2f:05:70:2c

Is that the message you meant?

> Another _almost_ successful suspend/resume:
> [snip
> wlp2s0: RX AssocResp from 54:ec:2f:05:70:2c (capab=0x1011 status=0 aid=2)
> wlp2s0: associated
> wlp2s0: Limiting TX power to 23 (23 - 0) dBm as advertised by
> 54:ec:2f:05:70:2c
> ath: EEPROM regdomain: 0x826c
> ath: EEPROM indicates we should expect a country code
> ath: doing EEPROM country->regdmn map search
> ath: country maps to regdmn code: 0x37
> ath: Country alpha2 being used: PT
> ath: Regpair used: 0x37
> ath: regdomain 0x826c dynamically updated by country element
> IPv6: ADDRCONF(NETDEV_CHANGE): wlp2s0: link becomes ready
>
> I say _almost_, because I don't see the "No TX power limit" for the
> country lookup (yes, Portugal) this time?

because here you had it too, just a bit earlier. It usually comes when a
beacon is received the first time, which depends on the AP timing.

> ath10k_pci 0000:02:00.0: wmi command 16387 timeout, restarting hardware
> ath10k_pci 0000:02:00.0: failed to set 5g txpower 23: -11
> ath10k_pci 0000:02:00.0: failed to setup tx power 23: -11
> ath10k_pci 0000:02:00.0: failed to recalc tx power: -11
> ath10k_pci 0000:02:00.0: failed to set inactivity time for vdev 0: -108
> ath10k_pci 0000:02:00.0: failed to setup powersave: -108
>
> That certainly looks like something did try to set a power limit, but
> eventually failed.

Yeah, that does seem a bit fishy. Kalle would have to comment for
ath10k.

> Immediately after that:
>
> wlp2s0: deauthenticating from 54:ec:2f:05:70:2c by local choice
> (Reason: 3=DEAUTH_LEAVING)

I don't _think_ any of the above would be a reason to disconnect, but it
clearly looks like the device got stuck at this point, since everything
just fails afterwards.

> ath10k_pci 0000:02:00.0: failed to delete peer 54:ec:2f:05:70:2c
> for vdev 0: -108
>
> and this then results in:
>
> WARNING: CPU: 4 PID: 1246 at net/mac80211/sta_info.c:1057
> __sta_info_destroy_part2+0x147/0x150 [mac80211]

Not really a surprise. Perhaps we shouldn't even WARN_ON() this, if the
driver is stuck completely and returning errors to everything that
doesn't help so much.

Then again, the stack trace was helpful this time:

> ieee80211_set_disassoc+0xc2/0x590 [mac80211]
> ieee80211_mgd_deauth.cold+0x4a/0x1b8 [mac80211]
> cfg80211_mlme_deauth+0xb3/0x1d0 [cfg80211]
> cfg80211_mlme_down+0x66/0x90 [cfg80211]
> cfg80211_disconnect+0x129/0x1e0 [cfg80211]
> cfg80211_leave+0x27/0x40 [cfg80211]
> cfg80211_netdev_notifier_call+0x1a7/0x4e0 [cfg80211]
> notifier_call_chain+0x4c/0x70
> __dev_close_many+0x57/0x100
> dev_close_many+0x8d/0x140
> dev_close.part.0+0x44/0x70
> cfg80211_shutdown_all_interfaces+0x71/0xd0 [cfg80211]
> cfg80211_rfkill_set_block+0x22/0x30 [cfg80211]
> rfkill_set_block+0x92/0x140 [rfkill]
> rfkill_fop_write+0x11f/0x1c0 [rfkill]
> vfs_write+0xb6/0x1a0


Since we see that something actually did an rfkill operation. Did you
push a button there?

Like I said above, the fact that we get into notifier_call_chain() from
rfkill_set_block() means that we acquired the RTNL here somewhere
between these.

> ath10k_pci 0000:02:00.0: failed to recalculate rts/cts prot for vdev 0: -108
> ath10k_pci 0000:02:00.0: failed to set cts protection for vdev 0: -108
>
> and it looks like it leaves some lock held.

Yeah, the RTNL.

You don't happen to have timing information on these logs, perhaps
recorded in the logfile/journal?

It seems odd to me, since the RTNL is acquired by
cfg80211_rfkill_set_block() and that doesn't even have an error path, it
just does
rtnl_lock();
cfg80211_shutdown_all_interfaces(&rdev->wiphy);
rtnl_unlock();

The only explanation I therefore have is that something is just taking
*forever* in that code path, hence my question about timing information
on the logs.

Looks like indeed the driver gives the device at least *3 seconds* for
every command, see ath10k_wmi_cmd_send(), so most likely this would
eventually have finished, but who knows how many firmware commands it
would still have attempted to send...

Perhaps the driver should mark the device as dead and fail quickly once
it timed out once, or so, but I'll let Kalle comment on that.

johannes

2019-09-11 11:59:44

by Linus Torvalds

[permalink] [raw]
Subject: Re: WARNING at net/mac80211/sta_info.c:1057 (__sta_info_destroy_part2())

On Wed, Sep 11, 2019 at 11:26 AM Johannes Berg
<[email protected]> wrote:
>
> Hi,
>
> > So I'm at LCA
>
> When did LCA move to Portugal? ;-))

Heh. I may be jetlagged and not thinking straight. LCA/LPC ;)

> > Previous resume looks normal:
> > [snip]
> > wlp2s0: Limiting TX power to 23 (23 - 0) dBm as advertised by
> > 54:ec:2f:05:70:2c
>
> Is that the message you meant?
>
> > I say _almost_, because I don't see the "No TX power limit" for the
> > country lookup (yes, Portugal) this time?
>
> because here you had it too, just a bit earlier. It usually comes when a
> beacon is received the first time, which depends on the AP timing.

Duh. I'm blind and didn't notice, because I was expecting it in the same order.

And I didn't think about it or double-check, because the errors that
then followed later _looked_ like that TX power failing that I thought
hadn't happened.

> I don't _think_ any of the above would be a reason to disconnect, but it
> clearly looks like the device got stuck at this point, since everything
> just fails afterwards.

Yeah, maybe the power stuff was just another effect of things getting
stuck, rather than the reason for it getting stuck.

So I probably mis-attributed the cause.

> > WARNING: CPU: 4 PID: 1246 at net/mac80211/sta_info.c:1057
> > __sta_info_destroy_part2+0x147/0x150 [mac80211]
>
> Not really a surprise. Perhaps we shouldn't even WARN_ON() this, if the
> driver is stuck completely and returning errors to everything that
> doesn't help so much.
>
> Then again, the stack trace was helpful this time:
>
> > ieee80211_set_disassoc+0xc2/0x590 [mac80211]
> > ieee80211_mgd_deauth.cold+0x4a/0x1b8 [mac80211]
> > cfg80211_mlme_deauth+0xb3/0x1d0 [cfg80211]
> > cfg80211_mlme_down+0x66/0x90 [cfg80211]
> > cfg80211_disconnect+0x129/0x1e0 [cfg80211]
> > cfg80211_leave+0x27/0x40 [cfg80211]
> > cfg80211_netdev_notifier_call+0x1a7/0x4e0 [cfg80211]
> > notifier_call_chain+0x4c/0x70
> > __dev_close_many+0x57/0x100
> > dev_close_many+0x8d/0x140
> > dev_close.part.0+0x44/0x70
> > cfg80211_shutdown_all_interfaces+0x71/0xd0 [cfg80211]
> > cfg80211_rfkill_set_block+0x22/0x30 [cfg80211]
> > rfkill_set_block+0x92/0x140 [rfkill]
> > rfkill_fop_write+0x11f/0x1c0 [rfkill]
> > vfs_write+0xb6/0x1a0
>
>
> Since we see that something actually did an rfkill operation. Did you
> push a button there?

No, I tried to turn off and turn on Wifi manually (no button, just the
settings panel).

I didn't notice the WARN_ON(), I just noticed that there was no
networking, and "turn it off and on again" is obviously the first
thing to try ;)

> You don't happen to have timing information on these logs, perhaps
> recorded in the logfile/journal?

Sure. I cleaned up the logs to not spam people with lots of illegible
data, but it's all in the journal log.

Rough timeline:

Sep 11 03:40:00 xps13 kernel: PM: suspend entry (s2idle)
Sep 11 03:40:00 xps13 kernel: Filesystems sync: 0.028 seconds
...
Sep 11 10:13:14 xps13 kernel: Restarting tasks ... done.
Sep 11 10:13:14 xps13 kernel: PM: suspend exit
Sep 11 10:13:14 xps13 kernel: ath10k_pci 0000:02:00.0: UART prints enabled
Sep 11 10:13:14 xps13 kernel: ath10k_pci 0000:02:00.0: unsupported HTC
service id: 1536
Sep 11 10:13:23 xps13 kernel: wlp2s0: authenticate with 54:ec:2f:05:70:2c
Sep 11 10:13:23 xps13 kernel: wlp2s0: send auth to 54:ec:2f:05:70:2c (try 1/3)
Sep 11 10:13:23 xps13 kernel: wlp2s0: authenticated
Sep 11 10:13:23 xps13 kernel: wlp2s0: Limiting TX power to 23 (23 - 0)
dBm as advertised by 54:ec:2f:05:70:2c
...
Sep 11 10:13:23 xps13 kernel: ath: regdomain 0x826c dynamically
updated by country element
Sep 11 10:13:24 xps13 kernel: IPv6: ADDRCONF(NETDEV_CHANGE): wlp2s0:
link becomes ready
Sep 11 10:27:07 xps13 kernel: ath10k_pci 0000:02:00.0: wmi command
16387 timeout, restarting hardware
...
Sep 11 10:27:07 xps13 kernel: ath10k_pci 0000:02:00.0: failed to read
hi_board_data address: -16
Sep 11 10:27:10 xps13 kernel: ath10k_pci 0000:02:00.0: failed to
receive initialized event from target: 00000000
Sep 11 10:27:13 xps13 kernel: ath10k_pci 0000:02:00.0: failed to
receive initialized event from target: 00000000
...
Sep 11 10:27:13 xps13 kernel: WARNING: CPU: 4 PID: 1246 at
net/mac80211/sta_info.c:1057 __sta_info_destroy_part2+0x147/0x150
[mac80211]

but if you want full logs I can send them in private to you.

I do suspect it's atheros and suspend/resume or something. The
wireless clearly worked for a while after the resume, but then at some
point it stopped.

> It seems odd to me, since the RTNL is acquired by
> cfg80211_rfkill_set_block() and that doesn't even have an error path, it
> just does
> rtnl_lock();
> cfg80211_shutdown_all_interfaces(&rdev->wiphy);
> rtnl_unlock();
>
> The only explanation I therefore have is that something is just taking
> *forever* in that code path, hence my question about timing information
> on the logs.

Yeah, maybe it would time out everything eventually. But not for a
long time. It hadn't cleared up by

Sep 11 10:36:21 xps13 gnome-session-f[6837]: gnome-session-failed:
Fatal IO error 0 (Success) on X server :0.

which is when I shut down the machine (and had to then force a hard
power-off because the shutdown wanted things that needed the rtnl_lock
to go away)

Linus

2019-09-11 12:08:09

by Johannes Berg

[permalink] [raw]
Subject: Re: WARNING at net/mac80211/sta_info.c:1057 (__sta_info_destroy_part2())

On Wed, 2019-09-11 at 12:58 +0100, Linus Torvalds wrote:
>
> And I didn't think about it or double-check, because the errors that
> then followed later _looked_ like that TX power failing that I thought
> hadn't happened.

Yeah, it could be something already got stuck there, hard to say.

> > Since we see that something actually did an rfkill operation. Did you
> > push a button there?
>
> No, I tried to turn off and turn on Wifi manually (no button, just the
> settings panel).

That does usually also cause rfkill, so that explains how we got down
this particular code path.

> I didn't notice the WARN_ON(), I just noticed that there was no
> networking, and "turn it off and on again" is obviously the first
> thing to try ;)

:-)

> Sep 11 10:27:13 xps13 kernel: WARNING: CPU: 4 PID: 1246 at
> net/mac80211/sta_info.c:1057 __sta_info_destroy_part2+0x147/0x150
> [mac80211]
>
> but if you want full logs I can send them in private to you.

No, it's fine, though maybe Kalle does - he was stepping out for a while
but said he'd look later.

This is the interesting time - 10:27:13 we get one of the first
failures. Really the first one was this:

> Sep 11 10:27:07 xps13 kernel: ath10k_pci 0000:02:00.0: wmi command 16387 timeout, restarting hardware


> I do suspect it's atheros and suspend/resume or something. The
> wireless clearly worked for a while after the resume, but then at some
> point it stopped.

I'm not really sure it's related to suspend/resume at all, the firmware
seems to just have gotten stuck, and the device and firmware most likely
got reset over the suspend/resume anyway.

> > The only explanation I therefore have is that something is just taking
> > *forever* in that code path, hence my question about timing information
> > on the logs.
>
> Yeah, maybe it would time out everything eventually. But not for a
> long time. It hadn't cleared up by
>
> Sep 11 10:36:21 xps13 gnome-session-f[6837]: gnome-session-failed:
> Fatal IO error 0 (Success) on X server :0.

Ok, that's way longer than I would have guessed even! That's over 9
minutes, that'd be close to 200 commands having to be issued and timing
out ...

I don't know. What I wrote before is basically all I can say, I think
the driver gets stuck somewhere waiting for the device "forever", and
the stack just doesn't get to release the lock, causing all the follow-
up problems.

johannes

2019-09-11 13:05:14

by Ben Greear

[permalink] [raw]
Subject: Re: WARNING at net/mac80211/sta_info.c:1057 (__sta_info_destroy_part2())



On 09/11/2019 05:04 AM, Johannes Berg wrote:
> On Wed, 2019-09-11 at 12:58 +0100, Linus Torvalds wrote:
>>
>> And I didn't think about it or double-check, because the errors that
>> then followed later _looked_ like that TX power failing that I thought
>> hadn't happened.
>
> Yeah, it could be something already got stuck there, hard to say.
>
>>> Since we see that something actually did an rfkill operation. Did you
>>> push a button there?
>>
>> No, I tried to turn off and turn on Wifi manually (no button, just the
>> settings panel).
>
> That does usually also cause rfkill, so that explains how we got down
> this particular code path.
>
>> I didn't notice the WARN_ON(), I just noticed that there was no
>> networking, and "turn it off and on again" is obviously the first
>> thing to try ;)
>
> :-)
>
>> Sep 11 10:27:13 xps13 kernel: WARNING: CPU: 4 PID: 1246 at
>> net/mac80211/sta_info.c:1057 __sta_info_destroy_part2+0x147/0x150
>> [mac80211]
>>
>> but if you want full logs I can send them in private to you.
>
> No, it's fine, though maybe Kalle does - he was stepping out for a while
> but said he'd look later.
>
> This is the interesting time - 10:27:13 we get one of the first
> failures. Really the first one was this:
>
>> Sep 11 10:27:07 xps13 kernel: ath10k_pci 0000:02:00.0: wmi command 16387 timeout, restarting hardware
>
>
>> I do suspect it's atheros and suspend/resume or something. The
>> wireless clearly worked for a while after the resume, but then at some
>> point it stopped.
>
> I'm not really sure it's related to suspend/resume at all, the firmware
> seems to just have gotten stuck, and the device and firmware most likely
> got reset over the suspend/resume anyway.
>
>>> The only explanation I therefore have is that something is just taking
>>> *forever* in that code path, hence my question about timing information
>>> on the logs.
>>
>> Yeah, maybe it would time out everything eventually. But not for a
>> long time. It hadn't cleared up by
>>
>> Sep 11 10:36:21 xps13 gnome-session-f[6837]: gnome-session-failed:
>> Fatal IO error 0 (Success) on X server :0.
>
> Ok, that's way longer than I would have guessed even! That's over 9
> minutes, that'd be close to 200 commands having to be issued and timing
> out ...
>
> I don't know. What I wrote before is basically all I can say, I think
> the driver gets stuck somewhere waiting for the device "forever", and
> the stack just doesn't get to release the lock, causing all the follow-
> up problems.

It looks to me like the ath10k firmware is not responding to commands and/or
is out of its WMI tx credits. The code often takes a lock and then blocks for up to 3
or so seconds waiting for a response from the firmware, and the mac80211 calling
code is often already holding rtnl. Pretty much every mac80211 call will cause a
WMI message and thus potentially hit this timeout.

This can easily cause rtnl to be held for 3 seconds, but after that, I believe
upstream ath10k will now time out and kill the firmware and restart. (I run
a significantly modified ath10k driver, and that is how mine works, at least.)

In this case, it looks like restarting the firmware/NIC failed, and I guess
that must get it in a state where it is still blocking and trying to talk
to the firmware? Or maybe deadlock down inside ath10k driver.

For what it's worth, we see that WARN_ON often when ath10k firmware crashes, but it
seems to not be a big deal and the system normally recovers fine.

Out of curiosity, I'm interested to know what ath10k NIC chipset this is from.

Thanks,
Ben

--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com

2019-09-11 13:22:33

by Linus Torvalds

[permalink] [raw]
Subject: Re: WARNING at net/mac80211/sta_info.c:1057 (__sta_info_destroy_part2())

On Wed, Sep 11, 2019 at 2:03 PM Ben Greear <[email protected]> wrote:
>
> Out of curiosity, I'm interested to know what ath10k NIC chipset this is from.

It's a Dell XPS 13 9380, with

02:00.0 Network controller: Qualcomm Atheros QCA6174 802.11ac
Wireless Network Adapter (rev 32)
Subsystem: Bigfoot Networks, Inc. Killer 1435 Wireless-AC

(numeric PCI ID 168c:003e, subsystem 1a56:143a).

The ath10k driver says

qca6174 hw3.2 target 0x05030000 chip_id 0x00340aff sub 1a56:143a
firmware ver WLAN.RM.4.4.1-00140-QCARMSWPZ-1 api 6 features
wowlan,ignore-otp,mfp crc32 29eb8ca1
board_file api 2 bmi_id N/A crc32 4ed3569e

if that tells you anything more.

Linus

2019-09-11 13:33:17

by Ben Greear

[permalink] [raw]
Subject: Re: WARNING at net/mac80211/sta_info.c:1057 (__sta_info_destroy_part2())



On 09/11/2019 06:21 AM, Linus Torvalds wrote:
> On Wed, Sep 11, 2019 at 2:03 PM Ben Greear <[email protected]> wrote:
>>
>> Out of curiosity, I'm interested to know what ath10k NIC chipset this is from.
>
> It's a Dell XPS 13 9380, with
>
> 02:00.0 Network controller: Qualcomm Atheros QCA6174 802.11ac
> Wireless Network Adapter (rev 32)
> Subsystem: Bigfoot Networks, Inc. Killer 1435 Wireless-AC
>
> (numeric PCI ID 168c:003e, subsystem 1a56:143a).
>
> The ath10k driver says
>
> qca6174 hw3.2 target 0x05030000 chip_id 0x00340aff sub 1a56:143a
> firmware ver WLAN.RM.4.4.1-00140-QCARMSWPZ-1 api 6 features
> wowlan,ignore-otp,mfp crc32 29eb8ca1
> board_file api 2 bmi_id N/A crc32 4ed3569e
>
> if that tells you anything more.

That means it is something I have never used nor have firmware for, but
the WMI logic should be similar to what I described and have experienced
with other chips.

Thanks,
Ben

--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com

2019-09-11 13:35:28

by Kalle Valo

[permalink] [raw]
Subject: Re: WARNING at net/mac80211/sta_info.c:1057 (__sta_info_destroy_part2())

Johannes Berg <[email protected]> writes:

>> Sep 11 10:27:13 xps13 kernel: WARNING: CPU: 4 PID: 1246 at
>> net/mac80211/sta_info.c:1057 __sta_info_destroy_part2+0x147/0x150
>> [mac80211]
>>
>> but if you want full logs I can send them in private to you.
>
> No, it's fine, though maybe Kalle does - he was stepping out for a while
> but said he'd look later.

Linus, it would help if you could send me full logs with timestamps.
Also if you can, please grep your logs to see if these wmi timeouts have
happened before.

--
https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches

2019-09-11 18:21:18

by Kalle Valo

[permalink] [raw]
Subject: Re: WARNING at net/mac80211/sta_info.c:1057 (__sta_info_destroy_part2())

Johannes Berg <[email protected]> writes:

>> ath10k_pci 0000:02:00.0: wmi command 16387 timeout, restarting hardware
>> ath10k_pci 0000:02:00.0: failed to set 5g txpower 23: -11
>> ath10k_pci 0000:02:00.0: failed to setup tx power 23: -11
>> ath10k_pci 0000:02:00.0: failed to recalc tx power: -11
>> ath10k_pci 0000:02:00.0: failed to set inactivity time for vdev 0: -108
>> ath10k_pci 0000:02:00.0: failed to setup powersave: -108
>>
>> That certainly looks like something did try to set a power limit, but
>> eventually failed.
>
> Yeah, that does seem a bit fishy. Kalle would have to comment for
> ath10k.
>
>> Immediately after that:
>>
>> wlp2s0: deauthenticating from 54:ec:2f:05:70:2c by local choice
>> (Reason: 3=DEAUTH_LEAVING)
>
> I don't _think_ any of the above would be a reason to disconnect, but it
> clearly looks like the device got stuck at this point, since everything
> just fails afterwards.

Yeah, to me it looks anything ath10k tries to do with the devie fails,
even resetting the device.

> Looks like indeed the driver gives the device at least *3 seconds* for
> every command, see ath10k_wmi_cmd_send(), so most likely this would
> eventually have finished, but who knows how many firmware commands it
> would still have attempted to send...

3 seconds is a bit short but in normal cases it should be enough. Of
course we could increase the delay but I'm skeptic it would help here.

> Perhaps the driver should mark the device as dead and fail quickly once
> it timed out once, or so, but I'll let Kalle comment on that.

Actually we do try to restart the device when a timeout happens in
ath10k_wmi_cmd_send():

if (ret == -EAGAIN) {
ath10k_warn(ar, "wmi command %d timeout, restarting hardware\n",
cmd_id);
queue_work(ar->workqueue, &ar->restart_work);
}


--
https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches

2019-09-11 18:50:06

by Kalle Valo

[permalink] [raw]
Subject: Re: WARNING at net/mac80211/sta_info.c:1057 (__sta_info_destroy_part2())

Johannes Berg <[email protected]> writes:

> On Wed, 2019-09-11 at 21:19 +0300, Kalle Valo wrote:
>> > Looks like indeed the driver gives the device at least *3 seconds* for
>> > every command, see ath10k_wmi_cmd_send(), so most likely this would
>> > eventually have finished, but who knows how many firmware commands it
>> > would still have attempted to send...
>>
>> 3 seconds is a bit short but in normal cases it should be enough. Of
>> course we could increase the delay but I'm skeptic it would help here.
>
> I was thinking 3 seconds is way too long :-)

Heh :)

>> > Perhaps the driver should mark the device as dead and fail quickly once
>> > it timed out once, or so, but I'll let Kalle comment on that.
>>
>> Actually we do try to restart the device when a timeout happens in
>> ath10k_wmi_cmd_send():
>>
>> if (ret == -EAGAIN) {
>> ath10k_warn(ar, "wmi command %d timeout, restarting hardware\n",
>> cmd_id);
>> queue_work(ar->workqueue, &ar->restart_work);
>> }
>
> Yeah, and this is the problem, in a sense, I'd think. It seems to me
> that at this point the code needs to tag the device as "dead" and
> immediately return from any further commands submitted to it with an
> error (e.g. -EIO).

Yeah, ath10k_core_restart() is supposed change to state
ATH10K_STATE_RESTARTING but very few mac80211 ops in ath10k_ops are
checking for it, and to me it looks like quite late even. I think a
proper fix for ops which can sleep is to check ar->state is
ATH10K_STATE_ON and for ops which cannot sleep check
ATH10K_FLAG_CRASH_FLUSH.

But of course this just fixes the symptoms, the root cause for timeouts
needs to be found as well.

--
https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches