Return-path: Received: from fg-out-1718.google.com ([72.14.220.155]:51096 "EHLO fg-out-1718.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932186AbZKRUjB (ORCPT ); Wed, 18 Nov 2009 15:39:01 -0500 Received: by fg-out-1718.google.com with SMTP id d23so2425790fga.1 for ; Wed, 18 Nov 2009 12:39:06 -0800 (PST) Subject: Re: mac80211 breaks suspend to disk From: Maxim Levitsky To: "John W. Linville" Cc: linux-wireless In-Reply-To: <20091118202802.GE2911@tuxdriver.com> References: <1258568836.4512.13.camel@maxim-laptop> <20091118202802.GE2911@tuxdriver.com> Content-Type: text/plain; charset="UTF-8" Date: Wed, 18 Nov 2009 22:39:09 +0200 Message-ID: <1258576749.8018.7.camel@maxim-laptop> Mime-Version: 1.0 Sender: linux-wireless-owner@vger.kernel.org List-ID: On Wed, 2009-11-18 at 15:28 -0500, John W. Linville wrote: > On Wed, Nov 18, 2009 at 08:27:16PM +0200, Maxim Levitsky wrote: > > Hi, > > > > I finally managed to track down a rare hang on resume from disk I see > > for some prolonged time. > > Symptoms are that sometimes on resume, system almost hangs. > > This is I can switch VTs and execute SysRQ sequences. > > This is typical when oops happens. > > > > I tracked this to BUG_ON in add_timer. > > I replaced this with WARN_ON, and here is the backtrace: > > > > > > 331.334935] WARNING: at /home/maxim/software/kernel/linux-2.6/kernel/timer.c:791 add_timer+0x36/0x40() > > [ 331.347374] Hardware name: Aspire 5720 > > [ 331.359725] Modules linked in: nvidia(P) af_packet usb_storage usb_libusual cpufreq_powersave iwl3945 nfsd snd_hda_codec_realtek cpufreq_conservative exportfs iwlcore cpufreq_userspace uvcvideo snd_hda_intel acpi_cpufreq nfs videodev mac80211 coretemp lockd snd_hda_codec joydev v4l1_compat tg3 nfs_acl ohci1394 v4l2_compat_ioctl32 snd_hwdep uhci_hcd sbp2 cfg80211 psmouse video ehci_hcd auth_rpcgss libphy lirc_ene0100 ieee1394 output snd_pcm usbcore serio_raw rfkill evdev snd_page_alloc sunrpc fuse lzo lzo_decompress lzo_compress > > [ 331.428619] Pid: 4394, comm: pm-hibernate Tainted: P 2.6.32-rc7-wl #183 > > [ 331.442747] Call Trace: > > [ 331.456760] [] warn_slowpath_common+0x78/0xb0 > > [ 331.471119] [] warn_slowpath_null+0xf/0x20 > > [ 331.485475] [] add_timer+0x36/0x40 > > [ 331.499814] [] ieee80211_sta_restart+0x4c/0x50 [mac80211] > > [ 331.514314] [] ieee80211_reconfig+0x36c/0x420 [mac80211] > > [ 331.528678] [] ? ieee80211_reconfig+0x15c/0x420 [mac80211] > > [ 331.543094] [] ieee80211_resume+0x15/0x20 [mac80211] > > [ 331.557525] [] wiphy_resume+0x75/0x90 [cfg80211] > > [ 331.572001] [] dpm_resume_end+0x47e/0x4b0 > > [ 331.586325] [] hibernation_snapshot+0xc9/0x280 > > [ 331.600377] [] hibernate+0xed/0x1f0 > > [ 331.613980] [] state_store+0xec/0x100 > > [ 331.627577] [] kobj_attr_store+0x17/0x20 > > [ 331.641242] [] sysfs_write_file+0xd4/0x150 > > [ 331.655056] [] vfs_write+0xb8/0x1a0 > > [ 331.668718] [] sys_write+0x4c/0x80 > > [ 331.682153] [] system_call_fastpath+0x16/0x1b > > [ 331.695613] ---[ end trace 85c754c80d7debe8 ]--- > > [ 331.709557] PM: Image restored successfully. > > > > Now, how the ifmgd->timer could still be pending I have no idea yet, it seems to be explictly > > del_timer_sync'ed and timer routine also checks this condition, and doesn't restart itself.... > > > > Also note that this was seen happening both on resume, and when the suspend image is written (sort of resume too). > > Quick glance suggest some hole in the management of the > TMR_RUNNING_CHANSW bit? I guess no, I have futher information. I forgot to pay attention to few lines before the oops message: [ 331.307139] No probe response from AP 00:1b:9e:d8:77:02 after 500ms, try 1 [ 331.320121] ------------[ cut here ]------------ [ 331.334935] WARNING: at /home/maxim/software/kernel/linux-2.6/kernel/timer.c:791 add_timer+0x36/0x40() [ 331.347374] Hardware name: Aspire 5720 ..... This reveals that somehow the ieee80211_sta_work manages to run before ieee80211_sta_restart and it sets the timer.... looking at ieee80211_reconfig it seems suspicious that 'local->suspended = false;' is set so early. In fact it is set again to false just prior to list_for_each_entry(sdata, &local->interfaces, list) {.... So, I would suspect some lines in this function trigger the work queue. Seems reasonable? Best regards, Maxim Levitsky