Return-path: Received: from www.tglx.de ([62.245.132.106]:50784 "EHLO www.tglx.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755485AbYJOTxP (ORCPT ); Wed, 15 Oct 2008 15:53:15 -0400 Date: Wed, 15 Oct 2008 21:53:06 +0200 (CEST) From: Thomas Gleixner To: Elias Oltmanns cc: Jiri Slaby , linux-wireless@vger.kernel.org Subject: Re: ath5k: kernel timing screwed - due to unserialised register access? In-Reply-To: <87tzbdzump.fsf@denkblock.local> Message-ID: (sfid-20081015_215329_951967_08FF1C9E) References: <87k5cm3ee2.fsf@denkblock.local> <87d4id3jmr.fsf@denkblock.local> <87skr8h1de.fsf@denkblock.local> <87hc7ot804.fsf@denkblock.local> <87myhfnwne.fsf@denkblock.local> <87k5cgg87j.fsf@denkblock.local> <87abdck6sn.fsf@denkblock.local> <87k5ceeuxy.fsf@denkblock.local> <87skqyj0ps.fsf@denkblock.local> <87tzbdzump.fsf@denkblock.local> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: linux-wireless-owner@vger.kernel.org List-ID: On Wed, 15 Oct 2008, Elias Oltmanns wrote: > Thomas Gleixner wrote: > > On Wed, 15 Oct 2008, Elias Oltmanns wrote: > >> Thomas Gleixner wrote: > > > >> > On Sat, 11 Oct 2008, Elias Oltmanns wrote: > >> >> > Compile the acpi_processor module in to the kernel > >> > > >> >> > (CONFIG_ACPI_PROCESSOR=y) and add processor.max_cstate=1 to the kernel > >> >> > command line. If I analysed the problem correctly this will make the > >> >> > jiffies problem go away. I'm working on a fix. > >> >> > >> >> Spot on, it does go away regardless whether NO_HZ or HIGH_RES are > >> >> enabled or disabled. Looking forward to testing your fix ;-). > >> > > >> > Here you go. > >> > >> Bad luck, I'm afraid. Your patch seems to fix the issue for NO_HZ=n + > >> HIGH_RES=y. As soon as NO_HZ=y, however, the problem reappears. See the > > > > Hmm. You should have seen the same problem w/o that patch in the > > "CONFIG_ACPI_PROCESSOR=y / processor.max_cstate=1" test with NOHZ=y. > > Right you are, I've just retested that case. The reason why I didn't hit > this the last time is that it seems to be triggered far less often than > without the processor.max_cstate=1 option. > > > > > Anyway, I can see what the NOHZ problem is. Updated patch below. > > Nice, it seems to have done the trick. I'll keep an eye on my logs to > make sure it doesn't pop up again. Thanks. I queue it for .28 and tag it for stable as well. > > > >> output below. (Still testing on 2.6.27, mind. Should I test something > >> more up-to-date?) > > > > .27 is fine. > > > > Did you make any progress finding out why the ath5k softirq runs for > > >20ms ? We need to fix this madness as well :) > > Well, it wasn't obvious to me so far, whether the logs really indicated > that 20 msecs had been spent in te callback or whether all this was due > to the bug in the timing code. With your patch applied, I have > eventually made further investigations into the matter. The problem is > the following snippet from > drivers/net/wireless/ath5k/phy.c:ath5k_hw_noise_floor_calibration(): > > /* > * Enable noise floor calibration > */ > AR5K_REG_ENABLE_BITS(ah, AR5K_PHY_AGCCTL, > AR5K_PHY_AGCCTL_NF); > > ret = ath5k_hw_register_timeout(ah, AR5K_PHY_AGCCTL, > AR5K_PHY_AGCCTL_NF, 0, false); > > The first call sets a bit in the AR5K_PHY_AGCCTL register and the second > waits for that bit to be cleared by the hardware again. Apparently, it > takes roughly 20 ms to clear that bit. That happens in softirq context ? So to verify this it's easy to just measure the time with ktime_get() across the both calls. > In order to execute ath5k_hw_noise_floor_calibration() in process > context, I'd suggest introducing a single threaded workqueue for the > ath5k driver and scheduling calibration from the calib_timer callback. > Additionally, it would be necessary to schedule resets in a similar > manner instead of using the ath5k_tasklet_reset() tasklet. This requires > some serialisation but in my opinion there are various serialisation > issues in ath5k as it is that need fixing. Unfortunately, none of the > concerns I have raised wrt the ath5k driver seem to have resulted in a > commit that I'm aware of even though patches have been supplied. Perhaps > it's the merge window that consumes people's resources. Maybe I'll post > a patch addressing this particular issue in the next few days. Sounds like a candidate for threaded interrupt handler :) Anyway, measuring the time of the softirq and pointing it out when it takes more than a couple of cycles is the right thing to do. Thanks, tglx