2009-11-18 18:27:17

by Maxim Levitsky

[permalink] [raw]
Subject: mac80211 breaks suspend to disk

Hi,

I finally managed to track down a rare hang on resume from disk I see
for some prolonged time.
Symptoms are that sometimes on resume, system almost hangs.
This is I can switch VTs and execute SysRQ sequences.
This is typical when oops happens.

I tracked this to BUG_ON in add_timer.
I replaced this with WARN_ON, and here is the backtrace:


331.334935] WARNING: at /home/maxim/software/kernel/linux-2.6/kernel/timer.c:791 add_timer+0x36/0x40()
[ 331.347374] Hardware name: Aspire 5720
[ 331.359725] Modules linked in: nvidia(P) af_packet usb_storage usb_libusual cpufreq_powersave iwl3945 nfsd snd_hda_codec_realtek cpufreq_conservative exportfs iwlcore cpufreq_userspace uvcvideo snd_hda_intel acpi_cpufreq nfs videodev mac80211 coretemp lockd snd_hda_codec joydev v4l1_compat tg3 nfs_acl ohci1394 v4l2_compat_ioctl32 snd_hwdep uhci_hcd sbp2 cfg80211 psmouse video ehci_hcd auth_rpcgss libphy lirc_ene0100 ieee1394 output snd_pcm usbcore serio_raw rfkill evdev snd_page_alloc sunrpc fuse lzo lzo_decompress lzo_compress
[ 331.428619] Pid: 4394, comm: pm-hibernate Tainted: P 2.6.32-rc7-wl #183
[ 331.442747] Call Trace:
[ 331.456760] [<ffffffff81043a98>] warn_slowpath_common+0x78/0xb0
[ 331.471119] [<ffffffff81043adf>] warn_slowpath_null+0xf/0x20
[ 331.485475] [<ffffffff8104f996>] add_timer+0x36/0x40
[ 331.499814] [<ffffffffa01fb34c>] ieee80211_sta_restart+0x4c/0x50 [mac80211]
[ 331.514314] [<ffffffffa020c86c>] ieee80211_reconfig+0x36c/0x420 [mac80211]
[ 331.528678] [<ffffffffa020c65c>] ? ieee80211_reconfig+0x15c/0x420 [mac80211]
[ 331.543094] [<ffffffffa0202fa5>] ieee80211_resume+0x15/0x20 [mac80211]
[ 331.557525] [<ffffffffa01527a5>] wiphy_resume+0x75/0x90 [cfg80211]
[ 331.572001] [<ffffffff81273fee>] dpm_resume_end+0x47e/0x4b0
[ 331.586325] [<ffffffff8107d5e9>] hibernation_snapshot+0xc9/0x280
[ 331.600377] [<ffffffff8107d88d>] hibernate+0xed/0x1f0
[ 331.613980] [<ffffffff8107c00c>] state_store+0xec/0x100
[ 331.627577] [<ffffffff811e05f7>] kobj_attr_store+0x17/0x20
[ 331.641242] [<ffffffff811345a4>] sysfs_write_file+0xd4/0x150
[ 331.655056] [<ffffffff810d02f8>] vfs_write+0xb8/0x1a0
[ 331.668718] [<ffffffff810d04bc>] sys_write+0x4c/0x80
[ 331.682153] [<ffffffff8100beeb>] system_call_fastpath+0x16/0x1b
[ 331.695613] ---[ end trace 85c754c80d7debe8 ]---
[ 331.709557] PM: Image restored successfully.

Now, how the ifmgd->timer could still be pending I have no idea yet, it seems to be explictly
del_timer_sync'ed and timer routine also checks this condition, and doesn't restart itself....

Also note that this was seen happening both on resume, and when the suspend image is written (sort of resume too).

Best regards,
Maxim Levitsky



2009-11-18 21:19:29

by Maxim Levitsky

[permalink] [raw]
Subject: Re: mac80211 breaks suspend to disk

On Wed, 2009-11-18 at 22:03 +0100, Johannes Berg wrote:
> On Wed, 2009-11-18 at 22:39 +0200, Maxim Levitsky wrote:
>
> > I forgot to pay attention to few lines before the oops message:
> >
> >
> > [ 331.307139] No probe response from AP 00:1b:9e:d8:77:02 after 500ms, try 1
> > [ 331.320121] ------------[ cut here ]------------
> > [ 331.334935] WARNING: at /home/maxim/software/kernel/linux-2.6/kernel/timer.c:791 add_timer+0x36/0x40()
> > [ 331.347374] Hardware name: Aspire 5720
> > .....
>
> Thanks.
>
> > This reveals that somehow the ieee80211_sta_work manages to run before
> > ieee80211_sta_restart and it sets the timer....
>
> Indeed.
>
> > looking at ieee80211_reconfig it seems suspicious that
> > 'local->suspended = false;' is set so early.
> > In fact it is set again to false just prior to
> >
> > list_for_each_entry(sdata, &local->interfaces, list) {....
> >
> > So, I would suspect some lines in this function trigger the work
> > queue.
>
> No, I doubt it, it's most likely just receiving a frame to trigger the
> function -- ieee80211_sta_rx_mgmt queues the work. If we were still
> suspended then we'd run into an error situation, like the comment in
> ieee80211_reconfig explains ...
>
> Catch-22.
>
> Then again, we drop RX frames while local->suspended is true, and I
> guess we don't really care about missing a few more frames when we're
> just waking up, so it should be OK to keep suspended == true while doing
> the startup.
>
> EXCEPT drivers are going to start using ieee80211_queue_work() already
> then from the callbacks, which we must allow.
>
> So ... maybe this?


This will likely fix this problem.
I will run my suspend/resume loop for a while and let you know if it
works.

Best regards,
Maxim Levitsky




2009-11-18 21:03:45

by Johannes Berg

[permalink] [raw]
Subject: Re: mac80211 breaks suspend to disk

On Wed, 2009-11-18 at 22:39 +0200, Maxim Levitsky wrote:

> I forgot to pay attention to few lines before the oops message:
>
>
> [ 331.307139] No probe response from AP 00:1b:9e:d8:77:02 after 500ms, try 1
> [ 331.320121] ------------[ cut here ]------------
> [ 331.334935] WARNING: at /home/maxim/software/kernel/linux-2.6/kernel/timer.c:791 add_timer+0x36/0x40()
> [ 331.347374] Hardware name: Aspire 5720
> .....

Thanks.

> This reveals that somehow the ieee80211_sta_work manages to run before
> ieee80211_sta_restart and it sets the timer....

Indeed.

> looking at ieee80211_reconfig it seems suspicious that
> 'local->suspended = false;' is set so early.
> In fact it is set again to false just prior to
>
> list_for_each_entry(sdata, &local->interfaces, list) {....
>
> So, I would suspect some lines in this function trigger the work
> queue.

No, I doubt it, it's most likely just receiving a frame to trigger the
function -- ieee80211_sta_rx_mgmt queues the work. If we were still
suspended then we'd run into an error situation, like the comment in
ieee80211_reconfig explains ...

Catch-22.

Then again, we drop RX frames while local->suspended is true, and I
guess we don't really care about missing a few more frames when we're
just waking up, so it should be OK to keep suspended == true while doing
the startup.

EXCEPT drivers are going to start using ieee80211_queue_work() already
then from the callbacks, which we must allow.

So ... maybe this?

johannes

---
net/mac80211/ieee80211_i.h | 7 +++++++
net/mac80211/util.c | 19 +++++++++----------
2 files changed, 16 insertions(+), 10 deletions(-)

--- wireless-testing.orig/net/mac80211/util.c 2009-11-18 22:02:35.000000000 +0100
+++ wireless-testing/net/mac80211/util.c 2009-11-18 22:02:51.000000000 +0100
@@ -520,9 +520,9 @@ EXPORT_SYMBOL_GPL(ieee80211_iterate_acti
*/
static bool ieee80211_can_queue_work(struct ieee80211_local *local)
{
- if (WARN(local->suspended, "queueing ieee80211 work while "
- "going to suspend\n"))
- return false;
+ if (WARN(local->suspended && !local->resuming,
+ "queueing ieee80211 work while going to suspend\n"))
+ return false;

return true;
}
@@ -1033,13 +1033,9 @@ int ieee80211_reconfig(struct ieee80211_
struct sta_info *sta;
unsigned long flags;
int res;
- bool from_suspend = local->suspended;

- /*
- * We're going to start the hardware, at that point
- * we are no longer suspended and can RX frames.
- */
- local->suspended = false;
+ if (local->suspended)
+ local->resuming = true;

/* restart hardware */
if (local->open_count) {
@@ -1137,11 +1133,14 @@ int ieee80211_reconfig(struct ieee80211_
* If this is for hw restart things are still running.
* We may want to change that later, however.
*/
- if (!from_suspend)
+ if (!local->suspended)
return 0;

#ifdef CONFIG_PM
+ /* first set suspended false, then resuming */
local->suspended = false;
+ mb();
+ local->resuming = false;

list_for_each_entry(sdata, &local->interfaces, list) {
switch(sdata->vif.type) {
--- wireless-testing.orig/net/mac80211/ieee80211_i.h 2009-11-18 22:02:35.000000000 +0100
+++ wireless-testing/net/mac80211/ieee80211_i.h 2009-11-18 22:02:51.000000000 +0100
@@ -601,6 +601,13 @@ struct ieee80211_local {
bool suspended;

/*
+ * Resuming is true while suspended, but when we're reprogramming the
+ * hardware -- at that time it's allowed to use ieee80211_queue_work()
+ * again even though some other parts of the stack are still suspended.
+ */
+ bool resuming;
+
+ /*
* quiescing is true during the suspend process _only_ to
* ease timer cancelling etc.
*/



2009-11-18 21:21:32

by Johannes Berg

[permalink] [raw]
Subject: Re: mac80211 breaks suspend to disk

On Wed, 2009-11-18 at 23:19 +0200, Maxim Levitsky wrote:

> > Then again, we drop RX frames while local->suspended is true, and I
> > guess we don't really care about missing a few more frames when we're
> > just waking up, so it should be OK to keep suspended == true while doing
> > the startup.
> >
> > EXCEPT drivers are going to start using ieee80211_queue_work() already
> > then from the callbacks, which we must allow.
> >
> > So ... maybe this?
>
>
> This will likely fix this problem.
> I will run my suspend/resume loop for a while and let you know if it
> works.

Thanks. I'll take another look at the code to see if there's another
code path that could now end in warnings, but I don't think so.

I suppose you could reduce your AP's beacon interval to make it more
likely to trigger.

johannes


Attachments:
signature.asc (801.00 B)
This is a digitally signed message part

2009-11-23 08:13:27

by Maxim Levitsky

[permalink] [raw]
Subject: Re: mac80211 breaks suspend to disk

On Thu, 2009-11-19 at 15:29 +0100, Johannes Berg wrote:
> On Thu, 2009-11-19 at 15:43 +0200, Maxim Levitsky wrote:
>
> > This problem definitely is fixed.
>
> Thanks for checking.
>
> > One more finding from the hibernate loop though:
> >
> > Nov 19 12:25:49 maxim-laptop NetworkManager: <info> (wlan0): deactivating device (reason: 37).
> > Nov 19 12:25:49 maxim-laptop NetworkManager: <info> (wlan0): canceled DHCP transaction, dhcp client pid 6130
> > Nov 19 12:25:49 maxim-laptop kernel: [ 239.925841] wlan0: disassociating from 00:1b:9e:d8:77:02 by local choice (reason=3)
> > Nov 19 12:25:49 maxim-laptop kernel: [ 239.940401] phy0: Removed STA 00:1b:9e:d8:77:02
> > Nov 19 12:25:49 maxim-laptop kernel: [ 239.982746] phy0: Destroyed STA 00:1b:9e:d8:77:02
> > Nov 19 12:25:49 maxim-laptop NetworkManager: <WARN> check_one_route(): (wlan0) error -34 returned from rtnl_route_del(): Sucess#012
> > Nov 19 12:25:49 maxim-laptop NetworkManager: <info> (wlan0): cleaning up...
> > Nov 19 12:25:49 maxim-laptop NetworkManager: <info> (wlan0): taking down device.
> > Nov 19 12:25:49 maxim-laptop avahi-daemon[1626]: Withdrawing address record for 10.1.0.2 on wlan0.
> > Nov 19 12:25:49 maxim-laptop avahi-daemon[1626]: Leaving mDNS multicast group on interface wlan0.IPv4 with address 10.1.0.2.
> > Nov 19 12:25:49 maxim-laptop kernel: [ 239.998665] wlan0: deauthenticating from 00:1b:9e:d8:77:02 by local choice (reason=3)
> > Nov 19 12:25:49 maxim-laptop kernel: [ 240.072664] phy0: device now idle
> > Nov 19 12:25:49 maxim-laptop kernel: [ 240.105115] ------------[ cut here ]------------
> > Nov 19 12:25:49 maxim-laptop avahi-daemon[1626]: Interface wlan0.IPv4 no longer relevant for mDNS.
> > Nov 19 12:25:49 maxim-laptop kernel: [ 240.120021] WARNING: at /home/maxim/software/kernel/linux-2.6/net/wireless/core.c:618 wdev_cleanup_work+0xe9/0x110 [cfg80211]()
>
> You're using 3945 it seems, this kind of warning happens when it doesn't
> do the scan notifications properly. I'd look at 3945 for this one.
>
> > Other that that, I got few freezes, this time with no feedback whatsoever, blank screen, thus very likely something different.
> > Might be nvidia proprietary driver fault too.
>
> Fun
And after removing acpi global lock that protects access to EC on my
system, but somehow ether isn't released, or I don't know what, my
system just survived 280 s2disk cycles, with everything enabled.

http://bugzilla.kernel.org/show_bug.cgi?id=14668
(not related to wireless stack in any way)




2009-11-19 13:43:04

by Maxim Levitsky

[permalink] [raw]
Subject: Re: mac80211 breaks suspend to disk

On Wed, 2009-11-18 at 22:21 +0100, Johannes Berg wrote:
> On Wed, 2009-11-18 at 23:19 +0200, Maxim Levitsky wrote:
>
> > > Then again, we drop RX frames while local->suspended is true, and I
> > > guess we don't really care about missing a few more frames when we're
> > > just waking up, so it should be OK to keep suspended == true while doing
> > > the startup.
> > >
> > > EXCEPT drivers are going to start using ieee80211_queue_work() already
> > > then from the callbacks, which we must allow.
> > >
> > > So ... maybe this?
> >
> >
> > This will likely fix this problem.
> > I will run my suspend/resume loop for a while and let you know if it
> > works.
>
> Thanks. I'll take another look at the code to see if there's another
> code path that could now end in warnings, but I don't think so.
>
> I suppose you could reduce your AP's beacon interval to make it more
> likely to trigger.
>
> johannes


This problem definitely is fixed.

One more finding from the hibernate loop though:

Nov 19 12:25:49 maxim-laptop NetworkManager: <info> (wlan0): deactivating device (reason: 37).
Nov 19 12:25:49 maxim-laptop NetworkManager: <info> (wlan0): canceled DHCP transaction, dhcp client pid 6130
Nov 19 12:25:49 maxim-laptop kernel: [ 239.925841] wlan0: disassociating from 00:1b:9e:d8:77:02 by local choice (reason=3)
Nov 19 12:25:49 maxim-laptop kernel: [ 239.940401] phy0: Removed STA 00:1b:9e:d8:77:02
Nov 19 12:25:49 maxim-laptop kernel: [ 239.982746] phy0: Destroyed STA 00:1b:9e:d8:77:02
Nov 19 12:25:49 maxim-laptop NetworkManager: <WARN> check_one_route(): (wlan0) error -34 returned from rtnl_route_del(): Sucess#012
Nov 19 12:25:49 maxim-laptop NetworkManager: <info> (wlan0): cleaning up...
Nov 19 12:25:49 maxim-laptop NetworkManager: <info> (wlan0): taking down device.
Nov 19 12:25:49 maxim-laptop avahi-daemon[1626]: Withdrawing address record for 10.1.0.2 on wlan0.
Nov 19 12:25:49 maxim-laptop avahi-daemon[1626]: Leaving mDNS multicast group on interface wlan0.IPv4 with address 10.1.0.2.
Nov 19 12:25:49 maxim-laptop kernel: [ 239.998665] wlan0: deauthenticating from 00:1b:9e:d8:77:02 by local choice (reason=3)
Nov 19 12:25:49 maxim-laptop kernel: [ 240.072664] phy0: device now idle
Nov 19 12:25:49 maxim-laptop kernel: [ 240.105115] ------------[ cut here ]------------
Nov 19 12:25:49 maxim-laptop avahi-daemon[1626]: Interface wlan0.IPv4 no longer relevant for mDNS.
Nov 19 12:25:49 maxim-laptop kernel: [ 240.120021] WARNING: at /home/maxim/software/kernel/linux-2.6/net/wireless/core.c:618 wdev_cleanup_work+0xe9/0x110 [cfg80211]()
Nov 19 12:25:49 maxim-laptop kernel: [ 240.145103] hpet1: lost 1 rtc interrupts
Nov 19 12:25:49 maxim-laptop kernel: [ 240.150333] Hardware name: Aspire 5720
Nov 19 12:25:49 maxim-laptop kernel: [ 240.150337] Modules linked in: af_packet usb_storage nfsd exportfs iwl3945 usb_libusual snd_hda_codec_
realtek nfs cpufreq_powersave uvcvideo snd_hda_intel iwlcore videodev lockd cpufreq_conservative snd_hda_codec mac80211 v4l1_compat nfs_acl cpufreq_userspace acpi_cpufreq v4
l2_compat_ioctl32 snd_hwdep coretemp auth_rpcgss uhci_hcd joydev tg3 cfg80211 ohci1394 ehci_hcd snd_pcm sbp2 sunrpc usbcore psmouse libphy evdev rfkill snd_page_alloc video
nvidia(P) ieee1394 output lirc_ene0100 serio_raw fuse lzo lzo_decompress lzo_compress
Nov 19 12:25:49 maxim-laptop kernel: [ 240.150385] Pid: 9, comm: events/0 Tainted: P 2.6.32-rc7-wl #185
Nov 19 12:25:49 maxim-laptop kernel: [ 240.150388] Call Trace:
Nov 19 12:25:49 maxim-laptop kernel: [ 240.150401] [<ffffffff81043a98>] warn_slowpath_common+0x78/0xb0
Nov 19 12:25:49 maxim-laptop kernel: [ 240.150405] [<ffffffff81043adf>] warn_slowpath_null+0xf/0x20
Nov 19 12:25:49 maxim-laptop kernel: [ 240.150416] [<ffffffffa0b866a9>] wdev_cleanup_work+0xe9/0x110 [cfg80211]
Nov 19 12:25:49 maxim-laptop kernel: [ 240.150424] [<ffffffffa0b865c0>] ? wdev_cleanup_work+0x0/0x110 [cfg80211]
Nov 19 12:25:49 maxim-laptop kernel: [ 240.150428] [<ffffffff81056eb6>] worker_thread+0x1f6/0x360
Nov 19 12:25:49 maxim-laptop kernel: [ 240.150431] [<ffffffff81056e64>] ? worker_thread+0x1a4/0x360
Nov 19 12:25:49 maxim-laptop kernel: [ 240.150436] [<ffffffff8105b9f0>] ? autoremove_wake_function+0x0/0x40
Nov 19 12:25:49 maxim-laptop kernel: [ 240.150441] [<ffffffff8106d0ad>] ? trace_hardirqs_on+0xd/0x10
Nov 19 12:25:49 maxim-laptop kernel: [ 240.150445] [<ffffffff81056cc0>] ? worker_thread+0x0/0x360
Nov 19 12:25:49 maxim-laptop kernel: [ 240.150449] [<ffffffff8105b73e>] kthread+0x8e/0xa0
Nov 19 12:25:49 maxim-laptop kernel: [ 240.150453] [<ffffffff8100cf5a>] child_rip+0xa/0x20
Nov 19 12:25:49 maxim-laptop kernel: [ 240.150457] [<ffffffff8100c8c0>] ? restore_args+0x0/0x30
Nov 19 12:25:49 maxim-laptop kernel: [ 240.150461] [<ffffffff8105b6b0>] ? kthread+0x0/0xa0
Nov 19 12:25:49 maxim-laptop kernel: [ 240.150464] [<ffffffff8100cf50>] ? child_rip+0x0/0x20
Nov 19 12:25:49 maxim-laptop kernel: [ 240.150466] ---[ end trace ea0fcbca6d09945c ]---
Nov 19 12:25:50 maxim-laptop kernel: [ 240.833019] PM: Marking nosave pages: 0000000000001000 - 0000000000006000
Nov 19 12:25:50 maxim-laptop kernel: [ 240.833402] PM: Marking nosave pages: 000000000009f000 - 0000000000100000
Nov 19 12:25:50 maxim-laptop kernel: [ 240.833782] PM: Marking nosave pages: 000000007fcfd000 - 000000007fd08000
Nov 19 12:25:50 maxim-laptop kernel: [ 240.834159] PM: Marking nosave pages: 000000007fd52000 - 000000007fd55000
Nov 19 12:25:50 maxim-laptop kernel: [ 240.834531] PM: Marking nosave pages: 000000007fdbb000 - 000000007fdbf000
Nov 19 12:25:50 maxim-laptop kernel: [ 240.834904] PM: Basic memory bitmaps created
Nov 19 12:26:57 maxim-laptop kernel: [ 240.835140] PM: Syncing filesystems ... done.
Nov 19 12:26:57 maxim-laptop kernel: [ 241.022693] Freezing user space processes ... (elapsed 0.00 seconds) done.


Other that that, I got few freezes, this time with no feedback whatsoever, blank screen, thus very likely something different.
Might be nvidia proprietary driver fault too.


Best regards,
Maxim Levitsky




2009-11-18 20:39:01

by Maxim Levitsky

[permalink] [raw]
Subject: Re: mac80211 breaks suspend to disk

On Wed, 2009-11-18 at 15:28 -0500, John W. Linville wrote:
> On Wed, Nov 18, 2009 at 08:27:16PM +0200, Maxim Levitsky wrote:
> > Hi,
> >
> > I finally managed to track down a rare hang on resume from disk I see
> > for some prolonged time.
> > Symptoms are that sometimes on resume, system almost hangs.
> > This is I can switch VTs and execute SysRQ sequences.
> > This is typical when oops happens.
> >
> > I tracked this to BUG_ON in add_timer.
> > I replaced this with WARN_ON, and here is the backtrace:
> >
> >
> > 331.334935] WARNING: at /home/maxim/software/kernel/linux-2.6/kernel/timer.c:791 add_timer+0x36/0x40()
> > [ 331.347374] Hardware name: Aspire 5720
> > [ 331.359725] Modules linked in: nvidia(P) af_packet usb_storage usb_libusual cpufreq_powersave iwl3945 nfsd snd_hda_codec_realtek cpufreq_conservative exportfs iwlcore cpufreq_userspace uvcvideo snd_hda_intel acpi_cpufreq nfs videodev mac80211 coretemp lockd snd_hda_codec joydev v4l1_compat tg3 nfs_acl ohci1394 v4l2_compat_ioctl32 snd_hwdep uhci_hcd sbp2 cfg80211 psmouse video ehci_hcd auth_rpcgss libphy lirc_ene0100 ieee1394 output snd_pcm usbcore serio_raw rfkill evdev snd_page_alloc sunrpc fuse lzo lzo_decompress lzo_compress
> > [ 331.428619] Pid: 4394, comm: pm-hibernate Tainted: P 2.6.32-rc7-wl #183
> > [ 331.442747] Call Trace:
> > [ 331.456760] [<ffffffff81043a98>] warn_slowpath_common+0x78/0xb0
> > [ 331.471119] [<ffffffff81043adf>] warn_slowpath_null+0xf/0x20
> > [ 331.485475] [<ffffffff8104f996>] add_timer+0x36/0x40
> > [ 331.499814] [<ffffffffa01fb34c>] ieee80211_sta_restart+0x4c/0x50 [mac80211]
> > [ 331.514314] [<ffffffffa020c86c>] ieee80211_reconfig+0x36c/0x420 [mac80211]
> > [ 331.528678] [<ffffffffa020c65c>] ? ieee80211_reconfig+0x15c/0x420 [mac80211]
> > [ 331.543094] [<ffffffffa0202fa5>] ieee80211_resume+0x15/0x20 [mac80211]
> > [ 331.557525] [<ffffffffa01527a5>] wiphy_resume+0x75/0x90 [cfg80211]
> > [ 331.572001] [<ffffffff81273fee>] dpm_resume_end+0x47e/0x4b0
> > [ 331.586325] [<ffffffff8107d5e9>] hibernation_snapshot+0xc9/0x280
> > [ 331.600377] [<ffffffff8107d88d>] hibernate+0xed/0x1f0
> > [ 331.613980] [<ffffffff8107c00c>] state_store+0xec/0x100
> > [ 331.627577] [<ffffffff811e05f7>] kobj_attr_store+0x17/0x20
> > [ 331.641242] [<ffffffff811345a4>] sysfs_write_file+0xd4/0x150
> > [ 331.655056] [<ffffffff810d02f8>] vfs_write+0xb8/0x1a0
> > [ 331.668718] [<ffffffff810d04bc>] sys_write+0x4c/0x80
> > [ 331.682153] [<ffffffff8100beeb>] system_call_fastpath+0x16/0x1b
> > [ 331.695613] ---[ end trace 85c754c80d7debe8 ]---
> > [ 331.709557] PM: Image restored successfully.
> >
> > Now, how the ifmgd->timer could still be pending I have no idea yet, it seems to be explictly
> > del_timer_sync'ed and timer routine also checks this condition, and doesn't restart itself....
> >
> > Also note that this was seen happening both on resume, and when the suspend image is written (sort of resume too).
>
> Quick glance suggest some hole in the management of the
> TMR_RUNNING_CHANSW bit?

I guess no, I have futher information.

I forgot to pay attention to few lines before the oops message:


[ 331.307139] No probe response from AP 00:1b:9e:d8:77:02 after 500ms, try 1
[ 331.320121] ------------[ cut here ]------------
[ 331.334935] WARNING: at /home/maxim/software/kernel/linux-2.6/kernel/timer.c:791 add_timer+0x36/0x40()
[ 331.347374] Hardware name: Aspire 5720
.....

This reveals that somehow the ieee80211_sta_work manages to run before ieee80211_sta_restart and it sets the timer....

looking at ieee80211_reconfig it seems suspicious that 'local->suspended = false;' is set so early.
In fact it is set again to false just prior to

list_for_each_entry(sdata, &local->interfaces, list) {....

So, I would suspect some lines in this function trigger the work queue.

Seems reasonable?

Best regards,
Maxim Levitsky



2009-11-18 20:30:26

by John W. Linville

[permalink] [raw]
Subject: Re: mac80211 breaks suspend to disk

On Wed, Nov 18, 2009 at 08:27:16PM +0200, Maxim Levitsky wrote:
> Hi,
>
> I finally managed to track down a rare hang on resume from disk I see
> for some prolonged time.
> Symptoms are that sometimes on resume, system almost hangs.
> This is I can switch VTs and execute SysRQ sequences.
> This is typical when oops happens.
>
> I tracked this to BUG_ON in add_timer.
> I replaced this with WARN_ON, and here is the backtrace:
>
>
> 331.334935] WARNING: at /home/maxim/software/kernel/linux-2.6/kernel/timer.c:791 add_timer+0x36/0x40()
> [ 331.347374] Hardware name: Aspire 5720
> [ 331.359725] Modules linked in: nvidia(P) af_packet usb_storage usb_libusual cpufreq_powersave iwl3945 nfsd snd_hda_codec_realtek cpufreq_conservative exportfs iwlcore cpufreq_userspace uvcvideo snd_hda_intel acpi_cpufreq nfs videodev mac80211 coretemp lockd snd_hda_codec joydev v4l1_compat tg3 nfs_acl ohci1394 v4l2_compat_ioctl32 snd_hwdep uhci_hcd sbp2 cfg80211 psmouse video ehci_hcd auth_rpcgss libphy lirc_ene0100 ieee1394 output snd_pcm usbcore serio_raw rfkill evdev snd_page_alloc sunrpc fuse lzo lzo_decompress lzo_compress
> [ 331.428619] Pid: 4394, comm: pm-hibernate Tainted: P 2.6.32-rc7-wl #183
> [ 331.442747] Call Trace:
> [ 331.456760] [<ffffffff81043a98>] warn_slowpath_common+0x78/0xb0
> [ 331.471119] [<ffffffff81043adf>] warn_slowpath_null+0xf/0x20
> [ 331.485475] [<ffffffff8104f996>] add_timer+0x36/0x40
> [ 331.499814] [<ffffffffa01fb34c>] ieee80211_sta_restart+0x4c/0x50 [mac80211]
> [ 331.514314] [<ffffffffa020c86c>] ieee80211_reconfig+0x36c/0x420 [mac80211]
> [ 331.528678] [<ffffffffa020c65c>] ? ieee80211_reconfig+0x15c/0x420 [mac80211]
> [ 331.543094] [<ffffffffa0202fa5>] ieee80211_resume+0x15/0x20 [mac80211]
> [ 331.557525] [<ffffffffa01527a5>] wiphy_resume+0x75/0x90 [cfg80211]
> [ 331.572001] [<ffffffff81273fee>] dpm_resume_end+0x47e/0x4b0
> [ 331.586325] [<ffffffff8107d5e9>] hibernation_snapshot+0xc9/0x280
> [ 331.600377] [<ffffffff8107d88d>] hibernate+0xed/0x1f0
> [ 331.613980] [<ffffffff8107c00c>] state_store+0xec/0x100
> [ 331.627577] [<ffffffff811e05f7>] kobj_attr_store+0x17/0x20
> [ 331.641242] [<ffffffff811345a4>] sysfs_write_file+0xd4/0x150
> [ 331.655056] [<ffffffff810d02f8>] vfs_write+0xb8/0x1a0
> [ 331.668718] [<ffffffff810d04bc>] sys_write+0x4c/0x80
> [ 331.682153] [<ffffffff8100beeb>] system_call_fastpath+0x16/0x1b
> [ 331.695613] ---[ end trace 85c754c80d7debe8 ]---
> [ 331.709557] PM: Image restored successfully.
>
> Now, how the ifmgd->timer could still be pending I have no idea yet, it seems to be explictly
> del_timer_sync'ed and timer routine also checks this condition, and doesn't restart itself....
>
> Also note that this was seen happening both on resume, and when the suspend image is written (sort of resume too).

Quick glance suggest some hole in the management of the
TMR_RUNNING_CHANSW bit?

Johannes, ideas?

John
--
John W. Linville Someday the world will need a hero, and you
[email protected] might be all we have. Be ready.

2009-11-19 14:29:56

by Johannes Berg

[permalink] [raw]
Subject: Re: mac80211 breaks suspend to disk

On Thu, 2009-11-19 at 15:43 +0200, Maxim Levitsky wrote:

> This problem definitely is fixed.

Thanks for checking.

> One more finding from the hibernate loop though:
>
> Nov 19 12:25:49 maxim-laptop NetworkManager: <info> (wlan0): deactivating device (reason: 37).
> Nov 19 12:25:49 maxim-laptop NetworkManager: <info> (wlan0): canceled DHCP transaction, dhcp client pid 6130
> Nov 19 12:25:49 maxim-laptop kernel: [ 239.925841] wlan0: disassociating from 00:1b:9e:d8:77:02 by local choice (reason=3)
> Nov 19 12:25:49 maxim-laptop kernel: [ 239.940401] phy0: Removed STA 00:1b:9e:d8:77:02
> Nov 19 12:25:49 maxim-laptop kernel: [ 239.982746] phy0: Destroyed STA 00:1b:9e:d8:77:02
> Nov 19 12:25:49 maxim-laptop NetworkManager: <WARN> check_one_route(): (wlan0) error -34 returned from rtnl_route_del(): Sucess#012
> Nov 19 12:25:49 maxim-laptop NetworkManager: <info> (wlan0): cleaning up...
> Nov 19 12:25:49 maxim-laptop NetworkManager: <info> (wlan0): taking down device.
> Nov 19 12:25:49 maxim-laptop avahi-daemon[1626]: Withdrawing address record for 10.1.0.2 on wlan0.
> Nov 19 12:25:49 maxim-laptop avahi-daemon[1626]: Leaving mDNS multicast group on interface wlan0.IPv4 with address 10.1.0.2.
> Nov 19 12:25:49 maxim-laptop kernel: [ 239.998665] wlan0: deauthenticating from 00:1b:9e:d8:77:02 by local choice (reason=3)
> Nov 19 12:25:49 maxim-laptop kernel: [ 240.072664] phy0: device now idle
> Nov 19 12:25:49 maxim-laptop kernel: [ 240.105115] ------------[ cut here ]------------
> Nov 19 12:25:49 maxim-laptop avahi-daemon[1626]: Interface wlan0.IPv4 no longer relevant for mDNS.
> Nov 19 12:25:49 maxim-laptop kernel: [ 240.120021] WARNING: at /home/maxim/software/kernel/linux-2.6/net/wireless/core.c:618 wdev_cleanup_work+0xe9/0x110 [cfg80211]()

You're using 3945 it seems, this kind of warning happens when it doesn't
do the scan notifications properly. I'd look at 3945 for this one.

> Other that that, I got few freezes, this time with no feedback whatsoever, blank screen, thus very likely something different.
> Might be nvidia proprietary driver fault too.

Fun.

johannes


Attachments:
signature.asc (801.00 B)
This is a digitally signed message part