I'm encountering a very odd issue which only seems to crop up while I
sleep at night. In the morning, I go back to my Linux machine and
there's a ridiculously bad latency I immediately notice in Firefox's
redraw which I cannot explain. glxgears shows the typical 5K FPS, so
it seems like graphics are normal, but there's something causing
horrendous redraw times for Firefox. If I exit X11, the problem then
extends to my virtual terminals. If I switch a VT, it takes about a
second to redraw the whole screen top-to-bottom[1]. I can also get the
same issue if I do page-up/page-down in less, or pretty much anything
that causes the screen to redraw.
I'm using 2.6.27-rc6-tip right now, but this problem occurs on
pristine rc5 and rc6 as well. uptime is only 7 hours, and this problem
seems to get progressively worse over time. Strangely enough,
unloading the 'ath9k' driver causes the latency to outright -vanish-
instantaneously (I only discovered this -after- I made the YouTube
video cited below, so the video doesn't demonstrate this tidbit,
unfortunately). I wish I knew what these symptoms are pointing to,
because this makes no sense to me at all. Is there some hardware
interrupt going bonkers? The only thing I can think of is that I've
got 'irqpoll' on the command line. I have it because I had to reload
the ath9k driver at one point, and this was the result (caused ath9k
to become a useless driver until I rebooted, which re-enabled IRQ 17):
[53997.493777] ath9k: driver unloaded
[54001.098532] ath9k: 0.1
[54001.098833] ath9k 0000:03:00.0: PCI INT A -> GSI 17 (level, low) -> IRQ 17
[54001.099013] ath9k 0000:03:00.0: setting latency timer to 64
[54001.233825] phy1: Selected rate control algorithm 'ath9k_rate_control'
[54001.236980] phy1: Atheros 5416: mem=0xffffc20000060000, irq=17
[54002.136218] irq 17: nobody cared (try booting with the "irqpoll" option)
[54002.136229] Pid: 0, comm: swapper Tainted: P 2.6.27-rc6-00036-ga551b98 #2
[54002.136234]
[54002.136235] Call Trace:
[54002.136241] <IRQ> [<ffffffff80265231>] __report_bad_irq+0x3d/0x81
[54002.136258] [<ffffffff80265477>] note_interrupt+0x202/0x267
[54002.136265] [<ffffffff80265bfc>] handle_fasteoi_irq+0xba/0xe4
[54002.136273] [<ffffffff8020e7ff>] do_IRQ+0x115/0x191
[54002.136279] [<ffffffff8020bc01>] ret_from_intr+0x0/0xa
[54002.136283] <EOI> [<ffffffff802541bb>] ?
tick_nohz_stop_sched_tick+0x2c5/0x2d4
[54002.136296] [<ffffffff8020a69f>] ? cpu_idle+0x2d/0xed
[54002.136304] [<ffffffff8060d741>] ? rest_init+0x75/0x77
[54002.136309]
[54002.136312] handlers:
[54002.136315] [<ffffffffa0209016>] (ath_isr+0x0/0x160 [ath9k])
[54002.136340] Disabling IRQ #17
latencytop did not give me anything useful while the enormous latency
was showing up, except during one case where 'less' was taking forever
and a half to redraw:
Cause Maximum Percentage
down acquire_console_sem con_flush_chars write_cha1140.9 msec 70.0 %
Scheduler: waiting for cpu 79.6 msec 21.0 %
do_sys_poll sys_poll system_call_fastpath 4.8 msec 4.1 %
down acquire_console_sem vt_waitactive vt_ioctl tt 0.9 msec 4.8 %
down acquire_console_sem do_con_write con_write wr 0.8 msec 0.1 %
Process less (5550) Total: 1145.3 msec
down acquire_console_sem con_flush_chars write_cha1140.9 msec
99.6 %write system_call_fastpath
Scheduler: waiting for cpu 1.3 msec 0.3 %
down acquire_console_sem do_con_write con_write wr 0.8 msec
0.1 %ite sys_write system_call_fastpath
latencytop only showed the redraw delay once, couldn't pin Firefox or
VT switch latency on anything. 'top' also doesn't show anything taking
a ridiculous amount of CPU. I can't figure out what precisely the
explanation is for ath9k's horrendous slowdown of the system.
This problem isn't easily bisectable either, because the only way I
can reproduce this problem is by leaving Linux running with ath9k
loaded overnight, and then finding it running like a tortoise in the
morning. I also don't know what kernels this -does not- happen on,
because I just recently put Linux on this particular machine, and I've
only used 2.6.27-rc5 through rc6 and above.
- Steven
[1] http://www.youtube.com/watch?v=4GeCx1gZMpA -- took a quick video
of the problem in progress. be prepared to be absolutely horrified.
On Thu, Sep 18, 2008 at 12:00 PM, Steven Noonan <[email protected]> wrote:
> On Thu, Sep 18, 2008 at 11:42 AM, Luis R. Rodriguez
> <[email protected]> wrote:
>> On Thu, Sep 18, 2008 at 11:34 AM, Luis R. Rodriguez
>>> irqpoll is a monster of evil and that should make your system crawl to
>>> its knees. I would advise instead we work with you fixing the the
>>> missed interrupts issue upon rmmod.
>>
>> Also, please provide the output of
>>
>> cat /proc/interrupts
>
> Note that the problem necessitating use of irqpoll in the first place
> seems to only happen under certain conditions. I am unsure what these
> conditions are. Before 'ath9k: connectivity is lost after Group
> rekeying is done', I had used rmmod/modprobe as my solution to the
> issue, which triggered the IRQ issue.
I wonder if this is related:
alcarin ~ # dmesg | grep GPE
[ 0.166806] ACPI: EC: GPE = 0x17, I/O: command/status = 0x66, data = 0x62
[ 685.409077] ACPI: EC: GPE storm detected, disabling EC GPE
On Fri, Sep 19, 2008 at 1:17 AM, Ingo Molnar <[email protected]> wrote:
> in tip/master there's an ftrace_printk() facility. You can just replace
> the verbose printk()s with ftrace_printk() and see the result in
> /debug/tracing/trace.
>
> This sort of tracing has very low overhead and can be used in an easy
> ad-hoc manner with no extra infrastructure. Here's a few quick-start
> links about how to enable the scheduler tracer:
>
> http://redhat.com/~mingo/sched-devel.git/readme-tracer.txt
> http://redhat.com/~mingo/sched-devel.git/howto-trace-latencies.txt
>
> any ftrace_printk() you add in the kernel will show up in that trace.
>
> (If the scheduling events are uninteresting and clutter the output then
> you might want to remove the scheduler tracing entries from kernel/*.c
> by removing the trace_sched_*() calls or use a less noisy tracer.)
>
> to get enough of a trace history you might want to increase the number
> of trace entries in /debug/tracing/trace_entries from 16K to 128K or so.
>
Thanks for the suggestion, I'll do that. :)
On Thu, Sep 18, 2008 at 11:42 AM, Luis R. Rodriguez
<[email protected]> wrote:
> On Thu, Sep 18, 2008 at 11:34 AM, Luis R. Rodriguez
>> irqpoll is a monster of evil and that should make your system crawl to
>> its knees. I would advise instead we work with you fixing the the
>> missed interrupts issue upon rmmod.
>
> Also, please provide the output of
>
> cat /proc/interrupts
Note that the problem necessitating use of irqpoll in the first place
seems to only happen under certain conditions. I am unsure what these
conditions are. Before 'ath9k: connectivity is lost after Group
rekeying is done', I had used rmmod/modprobe as my solution to the
issue, which triggered the IRQ issue.
alcarin steven # cat /proc/interrupts
CPU0 CPU1
0x0: 63227 0 IO-APIC-edge hpet
0x8: 1 0 IO-APIC-edge rtc0
0x9: 13080 0 IO-APIC-fasteoi acpi
0xe: 8195 0 IO-APIC-edge ide0
0xf: 0 0 IO-APIC-edge ide1
0x10: 36 0 IO-APIC-fasteoi uhci_hcd:usb5
0x11: 10645 0 IO-APIC-fasteoi ath
0x12: 42 0 IO-APIC-fasteoi uhci_hcd:usb4
0x17: 919 0 IO-APIC-fasteoi ehci_hcd:usb1, uhci_hcd:usb2
0x13: 32885 0 IO-APIC-fasteoi uhci_hcd:usb3,
ata_piix, ohci1394
0x200100: 1 0 PCI-MSI-edge eth0
0x16: 223 0 IO-APIC-fasteoi HDA Intel
NMI: 0 0 Non-maskable interrupts
LOC: 78087 95718 Local timer interrupts
RES: 11576 16384 Rescheduling interrupts
CAL: 6862 8889 Function call interrupts
TLB: 54 41 TLB shootdowns
TRM: 0 0 Thermal event interrupts
THR: 0 0 Threshold APIC interrupts
SPU: 0 0 Spurious interrupts
ERR: 0
>
> and also please do not cross post to all these lists, just use
> linux-wireless or ath9k.
>
Sorry, but in the past I've posted to linux-wireless, ath9k-devel, and
all the maintainers of ath9k and didn't get a single response (except
a 'me too' from a fellow ath9k user). I didn't just want to hear
crickets this time.
- Steven
On Thu, Sep 18, 2008 at 4:04 PM, Steven Noonan
> I always get "1 megabit/sec". It actually measures (at best) to be a
> throughput of 8mbits/sec, though the average (due to strange random
> packet loss) is about 4mbits/sec. --
Yeah, iwconfig reports a wrong rate, should be easy to fix but I
haven't yet looked at it. If you can fix it it'd be appreciated.
> And that was with aggregation
> re-enabled via your patch.
That patch just fixes aggregation for mac80211 after the TX MQ
changes, its seems ath9k aggregation is broken. Looking into that now.
> Without, it never exceeds 1mbit/sec
> (actual, not just as indicated by iwconfig.
Right.
Luis
On Fri, Sep 19, 2008 at 8:29 PM, Steven Noonan <[email protected]> wrote:
> Anyway, there definitely is an interrupt storm here. Let me demonstrate...
Worth noting that this change (modified version of the patch by
Senthil from earlier) made it so it didn't give an 'IRQ 17: nobody
cared' when I reloaded ath9k during the IRQ storm...
---
ath9k: flush pending IRQs before freeing IRQ
Signed-off-by: Steven Noonan <[email protected]>
diff --git a/drivers/net/wireless/ath9k/main.c
b/drivers/net/wireless/ath9k/main.c
index 0997196..591eec1 100644
--- a/drivers/net/wireless/ath9k/main.c
+++ b/drivers/net/wireless/ath9k/main.c
@@ -1411,10 +1411,17 @@ static void ath_pci_remove(struct pci_dev *pdev)
{
struct ieee80211_hw *hw = pci_get_drvdata(pdev);
struct ath_softc *sc = hw->priv;
+ enum ath9k_int status;
if (pdev->irq)
+ {
+ ath9k_hw_set_interrupts(sc->sc_ah, 0);
+ ath9k_hw_getisr(sc->sc_ah, &status); /* NB: clears ISR too */
+ sc->sc_invalid = 1;
free_irq(pdev->irq, sc);
+ }
ath_detach(sc);
+
pci_iounmap(pdev, sc->mem);
pci_release_region(pdev, 0);
pci_disable_device(pdev);
On Fri, Sep 19, 2008 at 08:40:04PM -0700, Steven Noonan wrote:
> On Fri, Sep 19, 2008 at 8:29 PM, Steven Noonan <[email protected]> wrote:
> > Anyway, there definitely is an interrupt storm here. Let me demonstrate...
>
> Worth noting that this change (modified version of the patch by
> Senthil from earlier) made it so it didn't give an 'IRQ 17: nobody
> cared' when I reloaded ath9k during the IRQ storm...
Thanks, for letting us know -- John, seems that IRQ patch by Senthil is
27 material as it fixes an IRQ issue (nobody cared complaint, and it
disables the IRQ otherwise) upon rmmod during an IRQ storm.
Luis
On Fri, Sep 19, 2008 at 04:31:20PM -0700, Steven Noonan wrote:
>
> On Fri, 19 Sep 2008, Luis R. Rodriguez wrote:
> > Thanks for the patch Senthil. He is on Linus' v2.6.27-rc, you are on
> > wireless-testing so I've ported it (no changes, it just applies cleanly
> > now). This is also untested by me:
>
> I'm seeing the same latency issue I reported originally with the latest
> -tip kernel and WITHOUT irqpoll.
Damn it.
> Apparently there's something else to this...
Steven,
Can you reproduce this? If so can you tell us how? If you can reproduce
can you try to test this with wireless-testing as well? That is still
based on 2.6.26-rc6 but its just has all the new wireless development
stuff queued up for 2.6.28 so far. Everything else is the same.
Luis
On Thu, Sep 18, 2008 at 12:00 PM, Steven Noonan <[email protected]> wrote:
> On Thu, Sep 18, 2008 at 11:42 AM, Luis R. Rodriguez
> <[email protected]> wrote:
>> On Thu, Sep 18, 2008 at 11:34 AM, Luis R. Rodriguez
>>> irqpoll is a monster of evil and that should make your system crawl to
>>> its knees. I would advise instead we work with you fixing the the
>>> missed interrupts issue upon rmmod.
>>
>> Also, please provide the output of
>>
>> cat /proc/interrupts
>
> Note that the problem necessitating use of irqpoll in the first place
> seems to only happen under certain conditions. I am unsure what these
> conditions are. Before 'ath9k: connectivity is lost after Group
> rekeying is done',
You mean this patch:
[PATCH] ath9k: connectivity is lost after Group rekeying is done
http://marc.info/?l=linux-wireless&m=122163541519736&w=2
So let me get this straight -- you applied this new patch, and haven't
tried disabling irqpoll now?
> I had used rmmod/modprobe as my solution to the
> issue, which triggered the IRQ issue.
Understood, but I also have used this before with ath9k and I got
exactly the same results you did -- I just refused to use it again and
just try to fix the issues present.
ath9k issues tons of interrupts, not sure why irqpoll option would
cause latency so bad as the interrupts *are* handled. Not sure
*exactly* how irqpoll works but its description mentions using it
forces each interrupt handler on the IRQ line to check the interrupt
is for it. You have to keep in mind that not only are ath9k interrupts
then being sent to the devices on its line but it would seem that all
other devices on each line would suffer from the interrupts of the
other guys. Why ath9k would be the *only* culprit of causing latency
when using irqpoll if the irq line it son is clean? Beats me.
> alcarin steven # cat /proc/interrupts
> CPU0 CPU1
> 0x0: 63227 0 IO-APIC-edge hpet
> 0x8: 1 0 IO-APIC-edge rtc0
> 0x9: 13080 0 IO-APIC-fasteoi acpi
> 0xe: 8195 0 IO-APIC-edge ide0
> 0xf: 0 0 IO-APIC-edge ide1
> 0x10: 36 0 IO-APIC-fasteoi uhci_hcd:usb5
> 0x11: 10645 0 IO-APIC-fasteoi ath
In this case your 11n Atheros device is on a clean line.
> 0x12: 42 0 IO-APIC-fasteoi uhci_hcd:usb4
> 0x17: 919 0 IO-APIC-fasteoi ehci_hcd:usb1, uhci_hcd:usb2
But it was this interrupt line which had an interrupt not handled.
I'm not sure why this would happen. Can't we rule out ath9k then since
its on a different interrupt line?
> 0x13: 32885 0 IO-APIC-fasteoi uhci_hcd:usb3,
> ata_piix, ohci1394
> 0x200100: 1 0 PCI-MSI-edge eth0
> 0x16: 223 0 IO-APIC-fasteoi HDA Intel
> NMI: 0 0 Non-maskable interrupts
> LOC: 78087 95718 Local timer interrupts
> RES: 11576 16384 Rescheduling interrupts
> CAL: 6862 8889 Function call interrupts
> TLB: 54 41 TLB shootdowns
> TRM: 0 0 Thermal event interrupts
> THR: 0 0 Threshold APIC interrupts
> SPU: 0 0 Spurious interrupts
> ERR: 0
Can you try to reproduce the irq not handled again?
>>
>> and also please do not cross post to all these lists, just use
>> linux-wireless or ath9k.
>>
>
> Sorry, but in the past I've posted to linux-wireless, ath9k-devel, and
> all the maintainers of ath9k and didn't get a single response (except
> a 'me too' from a fellow ath9k user). I didn't just want to hear
> crickets this time.
Patches speak more than words, but yeah sorry, we should have
addressed this there. I've personally have just been busy with
tackling aggregation.
Luis
On Thu, 2008-09-18 at 16:10 -0700, Luis R. Rodriguez wrote:
> On Thu, Sep 18, 2008 at 4:04 PM, Steven Noonan
> > I always get "1 megabit/sec". It actually measures (at best) to be a
> > throughput of 8mbits/sec, though the average (due to strange random
> > packet loss) is about 4mbits/sec. --
>
> Yeah, iwconfig reports a wrong rate, should be easy to fix but I
> haven't yet looked at it. If you can fix it it'd be appreciated.
the rate patch I posted today will fix it. not for .27 though
johannes
On Sat, Sep 20, 2008 at 2:03 PM, Luis R. Rodriguez
<[email protected]> wrote:
>
> OK so yeah you have an interrupt storm, this is evident.
Indeedy.
>
> Do you think there is a way you can try to trigger the interrupt storm
> so it happens < than after 7 hours or however long its taking you? If
> so then it'd be good to get your debug out using the interrupt debug.
> If that's not useful then we can enable all debug stuff (file will be
> huge).
It's apparently 7 idle hours and ~1-2 hours when I'm actively browsing
the web. I don't know what in particular triggers it, so I guess I'd
just need to enable debug logging and then start running through all
my webcomics, or perhaps perusing Wikipedia.
> Try turning on the microwave, turn on all your bluetooth devices, use
> your 2.4 GHz wireless phone, use your xbox 360 controllers and ps3
> controllers all at once :D
Heh. I don't have an Xbox 360 or PS3 (sad, I know). Do have several
2.4GHz phones. ;)
On Sat, Sep 20, 2008 at 2:18 PM, Steven Noonan <[email protected]> wrote:
> On Sat, Sep 20, 2008 at 1:57 PM, Luis R. Rodriguez
> <[email protected]> wrote:
>> So here's what we need to do -- try to zero on in what causes this IRQ
>> storm.
>
> How is the best way to approach this? I assume ATH_DBG_INTERRUPT and
> pray like hell my /var partition doesn't fill up first? Maybe I'll
> just symlink /var/log/messages to a file on a substantially larger
> pattition...
Yeah see the other e-mail I sent.
>> Most likely its some strange event that triggers it, perhaps
>> some sort of noise which makes the device try to adapt and hits a path
>> we are not handling correctly yet. Can you provide the initial log of
>> when ath9k gets loaded? This tells us what card it is, like AR5416 or
>> AR9280, etc.
>
> [10238.347956] ath9k: 0.1
> [10238.348028] vendor=8086 device=27d2
> [10238.348033] ath9k 0000:03:00.0: PCI INT A -> GSI 17 (level, low) -> IRQ 17
> [10238.348050] ath9k 0000:03:00.0: setting latency timer to 64
> [10238.480781] phy1: Selected rate control algorithm 'ath9k_rate_control'
> [10238.483320] phy1: Atheros 5416: mem=0xffffc20000060000, irq=17
>
> It claims to be a 5416, but I was under the impression that it was the
> 5418 that came with the MacBook Pro.
lspci output is outdated, we have submitted some new updates to it.
The output of the driver should be the one you can rely on for chipset
information for now. The problem is besides the PCI device ID there is
also EEPROM information which further identifies the chipset like
devices can be dual band (2 GHz and 5 GHz capable) and some other
single band -- using the same PCI device:vendor ID. So for now we've
tried to update the PCI id stuff (not sure when it'll be merged) to
tell your "family chipset" and in parens the MAC or known model. It
can still use some cleanup but yeah, hope this helps understand this a
little better. You can see more info on this here:
http://wireless.kernel.org/en/users/Drivers/ath9k#supportedchips
Luis
On Fri, Sep 19, 2008 at 12:59:29PM +0530, Steven Noonan wrote:
> On Thu, Sep 18, 2008 at 8:01 PM, Luis R. Rodriguez
> <[email protected]> wrote:
> > Thanks for testing, and glad to see this is resolved.
> >
>
> Damn. It's back. I was using wireless normally this evening. Browsing
> the web, that kind of thing, and then the wireless inexplicably
> dropped (even with the group rekeying patch), so I unloaded/reloaded
> the module. This popped up in dmesg:
>
> [ 3834.375658] vendor=8086 device=27d2
> [ 3834.375666] ath9k 0000:03:00.0: PCI INT A disabled
> [ 3834.375716] ath9k: driver unloaded
> [ 3838.552419] ath9k: 0.1
> [ 3838.552502] vendor=8086 device=27d2
> [ 3838.552511] ath9k 0000:03:00.0: PCI INT A -> GSI 17 (level, low) -> IRQ 17
> [ 3838.552532] ath9k 0000:03:00.0: setting latency timer to 64
> [ 3838.688924] phy1: Selected rate control algorithm 'ath9k_rate_control'
> [ 3838.693652] phy1: Atheros 5416: mem=0xffffc20000060000, irq=17
> [ 3839.427125] irq 17: nobody cared (try booting with the "irqpoll" option)
> [ 3839.427136] Pid: 0, comm: swapper Tainted: P
> 2.6.27-rc6-tip-00478-g74f1a36 #1
> [ 3839.427141] Call Trace:
> [ 3839.427145] <IRQ> [<ffffffff802219c5>] ? read_hpet+0x9/0x1c
> [ 3839.427165] [<ffffffff8026af71>] __report_bad_irq+0x3d/0x8c
> [ 3839.427172] [<ffffffff8026b0c6>] note_interrupt+0x106/0x160
> [ 3839.427180] [<ffffffff8026b801>] handle_fasteoi_irq+0xad/0xda
> [ 3839.427188] [<ffffffff8020f7b0>] do_IRQ+0x10c/0x190
> [ 3839.427194] [<ffffffff8020c551>] ret_from_intr+0x0/0xa
> [ 3839.427198] <EOI> [<ffffffff8026c6f0>] rcu_pending+0x62/0x6e
> [ 3839.427211] [<ffffffff8025bc11>] ? tick_nohz_stop_sched_tick+0x2e4/0x2f3
> [ 3839.427218] [<ffffffff8020ad94>] cpu_idle+0x7b/0xdb
> [ 3839.427226] [<ffffffff8060c921>] rest_init+0x75/0x77
> [ 3839.427231] handlers:
> [ 3839.427234] [<ffffffffa0240238>] (ath_isr+0x0/0x170 [ath9k])
> [ 3839.427263] Disabling IRQ #17
> [ 3842.263699] ADDRCONF(NETDEV_UP): wlan0: link is not ready
> [ 3848.035003] ADDRCONF(NETDEV_UP): wlan0: link is not ready
> [ 3848.432701] ADDRCONF(NETDEV_UP): wlan0: link is not ready
> [ 3850.216947] wlan0: authenticate with AP 00:1e:52:79:4d:01
> [ 3850.217027] wlan0: authenticate with AP 00:1e:52:79:4d:01
> [ 3850.228326] wlan0: authenticated
> [ 3850.228336] wlan0: associate with AP 00:1e:52:79:4d:01
> [ 3850.428140] wlan0: associate with AP 00:1e:52:79:4d:01
> [ 3850.628151] wlan0: associate with AP 00:1e:52:79:4d:01
> [ 3850.728305] wlan0: RX AssocResp from 00:1e:52:79:4d:01 (capab=0x431
> status=0 aid=1)
> [ 3850.728314] wlan0: associated
> [ 3850.728655] wlan0 (WE) : Wireless Event too big (320)
> [ 3850.743377] ADDRCONF(NETDEV_CHANGE): wlan0: link becomes ready
> [ 3860.855104] wlan0: no IPv6 routers present
>
> I rebuilt the module with DBG_ATH_INTERRUPT, but it somehow stumbled
> itself back into working order while I was compiling. I can't keep the
> interrupt debugging on all the time because it's just -too verbose-,
> and when I pop a debug version of the module in, then it's too late to
> track the issue....
I am able to reproduce this IRQ nobody cared issue in my setup and the
following patch seems to be fixing the issue. Please try it out and let
me know if it solves your issue in your setup.
**********
IRQs should be disabled before calling free_irq. Also clear pending IRQs.
Signed-off-by: Senthil Balasubramanian <[email protected]>
---
drivers/net/wireless/ath9k/core.c | 2 ++
drivers/net/wireless/ath9k/main.c | 8 +++++++-
2 files changed, 9 insertions(+), 1 deletions(-)
diff --git a/drivers/net/wireless/ath9k/core.c b/drivers/net/wireless/ath9k/core.c
index c262ef2..c007dd2 100644
--- a/drivers/net/wireless/ath9k/core.c
+++ b/drivers/net/wireless/ath9k/core.c
@@ -1183,6 +1183,8 @@ void ath_deinit(struct ath_softc *sc)
DPRINTF(sc, ATH_DBG_CONFIG, "%s\n", __func__);
+ tasklet_kill(&sc->intr_tq);
+ tasklet_kill(&sc->bcon_tasklet);
ath_stop(sc);
if (!(sc->sc_flags & SC_OP_INVALID))
ath9k_hw_setpower(sc->sc_ah, ATH9K_PM_AWAKE);
diff --git a/drivers/net/wireless/ath9k/main.c b/drivers/net/wireless/ath9k/main.c
index aca893a..559e0e8 100644
--- a/drivers/net/wireless/ath9k/main.c
+++ b/drivers/net/wireless/ath9k/main.c
@@ -1781,10 +1781,16 @@ static void ath_pci_remove(struct pci_dev *pdev)
{
struct ieee80211_hw *hw = pci_get_drvdata(pdev);
struct ath_softc *sc = hw->priv;
+ enum ath9k_int status;
- if (pdev->irq)
+ if (pdev->irq) {
+ ath9k_hw_set_interrupts(sc->sc_ah, 0);
+ ath9k_hw_getisr(sc->sc_ah, &status); /* NB: clears ISR too */
+ sc->sc_flags |= SC_OP_INVALID;
free_irq(pdev->irq, sc);
+ }
ath_detach(sc);
+
pci_iounmap(pdev, sc->mem);
pci_release_region(pdev, 0);
pci_disable_device(pdev);
--
1.5.5
On Fri, Sep 19, 2008 at 1:22 AM, Steven Noonan <[email protected]> wrote:
> On Fri, Sep 19, 2008 at 1:17 AM, Ingo Molnar <[email protected]> wrote:
>> in tip/master there's an ftrace_printk() facility. You can just replace
>> the verbose printk()s with ftrace_printk() and see the result in
>> /debug/tracing/trace.
>>
>> This sort of tracing has very low overhead and can be used in an easy
>> ad-hoc manner with no extra infrastructure. Here's a few quick-start
>> links about how to enable the scheduler tracer:
>>
>> http://redhat.com/~mingo/sched-devel.git/readme-tracer.txt
>> http://redhat.com/~mingo/sched-devel.git/howto-trace-latencies.txt
>>
>> any ftrace_printk() you add in the kernel will show up in that trace.
>>
>> (If the scheduling events are uninteresting and clutter the output then
>> you might want to remove the scheduler tracing entries from kernel/*.c
>> by removing the trace_sched_*() calls or use a less noisy tracer.)
>>
>> to get enough of a trace history you might want to increase the number
>> of trace entries in /debug/tracing/trace_entries from 16K to 128K or so.
>>
>
> Thanks for the suggestion, I'll do that. :)
>
It'd be extremely convenient to have a tracer that did absolutely
nothing but display ftrace_printk()s.
I hate trips where my email box becomes filled. I skip over a lot of
emails that I should not have.
On Fri, 19 Sep 2008, Ingo Molnar wrote:
>
> What is needed is to set kernel/trace/trace.c's tracing_disabled
> variable to 0. I've Cc:-ed Steve Rostedt - Steve, why is there no easy
> ad-hoc way to trigger a stop-tracing event without elaborate changes?
I think I added a way to do this. Hmm, I think it was in -rt that I added
it, and have not gotten around to sending you the updates.
Yes, we need an easy way to disable the tracer. I think we have
ftrace_disable() function. Perhaps that only disables tracing that checks
the tracing_disabled variable.
Anyway, I made sure the new ring_buffer code has an easy way to disable
it.
-- Steve
On Fri, Sep 19, 2008 at 6:13 PM, Luis R. Rodriguez
<[email protected]> wrote:
> On Fri, Sep 19, 2008 at 5:01 PM, Steven Noonan <[email protected]> wrote:
>> I'll pull wireless-testing and run that kernel for another 8 hours.
>> What's the URL for the repo? Surely you don't mean the
>> wireless-testing on git.kernel.org that hasn't been touched for almost
>> a week?
>
> Yes, that's the guy. This week was the Kernel Summit and the Linux
> Plumbers Conference too so there is understandably some lag in pulls
> as developers are getting drunk.
>
> git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-testing.git
>
>> I kind of wish there was a better debugging facility in ath9k.
>
> Patches are welcomed ;)
>
> Also, we have a few interrupt levels you can use. In this case we are
> debugging interrupts.
>
>> With
>> ATH_DBG_INTERRUPTS, I see several hundred messages per second, which
>> fills log files up like crazy (I switched to ftrace_printk as Ingo
>> suggested, which has alleviated this somewhat). Is there a way to make
>> it less verbose (perhaps so that it prints data at a rate of no more
>> than 2 or 3 messages per second)? Or maybe the rate at which it prints
>> this stuff is indicative something's wrong?
>
> No, interrupts do happen quite often so that is what you get, perhaps
> we can enhance this by not printing on some interrupts but the point
> is to debug interrupts after all. To help us debug in userspace in a
> nicer way we can use sadc and sar. So you can try this:
>
> On a window leave this running:
>
> /usr/lib/sysstat/sadc -I 1 irqlog.sar
>
> And when you are seeing your latency issues, run this:
>
> sar -I 17 -f irqlog.sar | less
Thanks for informing me about this useful utility. I wasn't aware of
its existence.
However, this profiler seems broken. When I check individual or 'ALL'
statistics, they're all 0.00 IRQs/sec, but SUM shows the actual number
of total IRQs per second. I suspect it's only broken because of some
kernel config option not being enabled.
Anyway, there definitely is an interrupt storm here. Let me demonstrate...
This is when the system is operating normally:
[...]
19:14:39 sum 1271.00
19:14:40 sum 1286.14
19:14:41 sum 1373.00
19:14:42 sum 1691.00
19:14:43 sum 1380.00
19:14:44 sum 902.00
19:14:45 sum 772.00
19:14:46 sum 1155.00
19:14:47 sum 804.00
19:14:48 sum 809.00
19:14:49 sum 965.00
19:14:50 sum 1077.23
Average: sum 615.69
Here's while the latency issue is occurring, which can now safely be
called an interrupt storm:
[...]
20:21:07 sum 18864.08
20:21:08 sum 21152.73
20:21:09 sum 19221.35
20:21:10 sum 20786.14
20:21:11 sum 19891.92
20:21:12 sum 19816.16
20:21:13 sum 19430.69
20:21:14 sum 20432.00
20:21:15 sum 19657.84
20:21:16 sum 19386.00
20:21:17 sum 19386.79
20:21:18 sum 20152.94
20:21:19 sum 19077.88
20:21:20 sum 19852.00
Average: sum 19958.91
And here's the change as I unload and reload ath9k:
[...]
20:21:38 sum 20138.83
20:21:40 sum 21576.09
20:21:41 sum 17678.38
20:21:42 sum 19834.34
20:21:43 sum 19845.10
20:21:44 sum 19470.00
20:21:45 sum 9273.00 <--- ath9k unloaded/reloaded
20:21:46 sum 187.00
20:21:47 sum 1066.00
20:21:48 sum 610.00
20:21:49 sum 106.93
20:21:50 sum 670.00
20:21:51 sum 742.00
20:21:52 sum 781.00
Average: sum 19562.56
- Steven
2008/9/23 Steven Noonan <[email protected]>:
> Okay. I'm in Seattle now.
>
> Yes, it solved the issue, but surely an #if 0 isn't a proper solution.
For 2.6.27 it will be unless people are willing to accept a patch to
enable ANI support for 2.6.27.
> What's the actual bug here? I'm afraid the meaning of what's going on
> here isn't exactly intuitive.
Check linux-wireless for my patch on MIB event comments. But I'll
summarize here for you. We get MIB interrupts to update statistics on
a number of PHY related elements. We then use this information to
process ANI -- some automatic noise immunity work. If you don't
process the information for ANI it seems you get the hardware treating
this as untreated and hence you get an interrupt storm. So since
2.6.27 doesn't have ANI implemented the MIB interrupts are not useful
except to collect internal statistics.
Since you are so eager to test this with a "proper" solution I will
send you a patch to test ANI for 2.6.27. We'll leave it up to the
maintainers and distributions to decide whether or not to pick this
up. But for that let's leave that for another thread.
Luis
On Fri, 19 Sep 2008, Senthil Balasubramanian wrote:
> On Fri, Sep 19, 2008 at 12:59:29PM +0530, Steven Noonan wrote:
> > On Thu, Sep 18, 2008 at 8:01 PM, Luis R. Rodriguez
> > <[email protected]> wrote:
> > > Thanks for testing, and glad to see this is resolved.
> > >
> >
> > Damn. It's back. I was using wireless normally this evening. Browsing
> > the web, that kind of thing, and then the wireless inexplicably
> > dropped (even with the group rekeying patch), so I unloaded/reloaded
> > the module. This popped up in dmesg:
> >
> > [ 3834.375658] vendor=8086 device=27d2
> > [ 3834.375666] ath9k 0000:03:00.0: PCI INT A disabled
> > [ 3834.375716] ath9k: driver unloaded
> > [ 3838.552419] ath9k: 0.1
> > [ 3838.552502] vendor=8086 device=27d2
> > [ 3838.552511] ath9k 0000:03:00.0: PCI INT A -> GSI 17 (level, low) -> IRQ 17
> > [ 3838.552532] ath9k 0000:03:00.0: setting latency timer to 64
> > [ 3838.688924] phy1: Selected rate control algorithm 'ath9k_rate_control'
> > [ 3838.693652] phy1: Atheros 5416: mem=0xffffc20000060000, irq=17
> > [ 3839.427125] irq 17: nobody cared (try booting with the "irqpoll" option)
> > [ 3839.427136] Pid: 0, comm: swapper Tainted: P
> > 2.6.27-rc6-tip-00478-g74f1a36 #1
> > [ 3839.427141] Call Trace:
> > [ 3839.427145] <IRQ> [<ffffffff802219c5>] ? read_hpet+0x9/0x1c
> > [ 3839.427165] [<ffffffff8026af71>] __report_bad_irq+0x3d/0x8c
> > [ 3839.427172] [<ffffffff8026b0c6>] note_interrupt+0x106/0x160
> > [ 3839.427180] [<ffffffff8026b801>] handle_fasteoi_irq+0xad/0xda
> > [ 3839.427188] [<ffffffff8020f7b0>] do_IRQ+0x10c/0x190
> > [ 3839.427194] [<ffffffff8020c551>] ret_from_intr+0x0/0xa
> > [ 3839.427198] <EOI> [<ffffffff8026c6f0>] rcu_pending+0x62/0x6e
> > [ 3839.427211] [<ffffffff8025bc11>] ? tick_nohz_stop_sched_tick+0x2e4/0x2f3
> > [ 3839.427218] [<ffffffff8020ad94>] cpu_idle+0x7b/0xdb
> > [ 3839.427226] [<ffffffff8060c921>] rest_init+0x75/0x77
> > [ 3839.427231] handlers:
> > [ 3839.427234] [<ffffffffa0240238>] (ath_isr+0x0/0x170 [ath9k])
> > [ 3839.427263] Disabling IRQ #17
> > [ 3842.263699] ADDRCONF(NETDEV_UP): wlan0: link is not ready
> > [ 3848.035003] ADDRCONF(NETDEV_UP): wlan0: link is not ready
> > [ 3848.432701] ADDRCONF(NETDEV_UP): wlan0: link is not ready
> > [ 3850.216947] wlan0: authenticate with AP 00:1e:52:79:4d:01
> > [ 3850.217027] wlan0: authenticate with AP 00:1e:52:79:4d:01
> > [ 3850.228326] wlan0: authenticated
> > [ 3850.228336] wlan0: associate with AP 00:1e:52:79:4d:01
> > [ 3850.428140] wlan0: associate with AP 00:1e:52:79:4d:01
> > [ 3850.628151] wlan0: associate with AP 00:1e:52:79:4d:01
> > [ 3850.728305] wlan0: RX AssocResp from 00:1e:52:79:4d:01 (capab=0x431
> > status=0 aid=1)
> > [ 3850.728314] wlan0: associated
> > [ 3850.728655] wlan0 (WE) : Wireless Event too big (320)
> > [ 3850.743377] ADDRCONF(NETDEV_CHANGE): wlan0: link becomes ready
> > [ 3860.855104] wlan0: no IPv6 routers present
> >
> > I rebuilt the module with DBG_ATH_INTERRUPT, but it somehow stumbled
> > itself back into working order while I was compiling. I can't keep the
> > interrupt debugging on all the time because it's just -too verbose-,
> > and when I pop a debug version of the module in, then it's too late to
> > track the issue....
>
> I am able to reproduce this IRQ nobody cared issue in my setup and the
> following patch seems to be fixing the issue. Please try it out and let
> me know if it solves your issue in your setup.
The patch you prvide doesn't want to apply. What code did you base this
on?
The first change listed doesn't work because there is no tasklet_kill() in
core.c, and the line immediately after ath_stop there has
"!sc->sc_invalid" instead of the "!(sc->sc_flags & SC_OP_INVALID)".
The second fails because SC_OP_INVALID isn't defined.
However, if your patch did apply to my code, I bet it'd solve the issue,
based on what it says it does.
On Thu, Sep 18, 2008 at 1:44 PM, Luis R. Rodriguez
<[email protected]> wrote:
> On Thu, Sep 18, 2008 at 12:00 PM, Steven Noonan <[email protected]> wrote:
>> On Thu, Sep 18, 2008 at 11:42 AM, Luis R. Rodriguez
>> <[email protected]> wrote:
>>> On Thu, Sep 18, 2008 at 11:34 AM, Luis R. Rodriguez
>>>> irqpoll is a monster of evil and that should make your system crawl to
>>>> its knees. I would advise instead we work with you fixing the the
>>>> missed interrupts issue upon rmmod.
>>>
>>> Also, please provide the output of
>>>
>>> cat /proc/interrupts
>>
>> Note that the problem necessitating use of irqpoll in the first place
>> seems to only happen under certain conditions. I am unsure what these
>> conditions are. Before 'ath9k: connectivity is lost after Group
>> rekeying is done',
>
> You mean this patch:
>
> [PATCH] ath9k: connectivity is lost after Group rekeying is done
> http://marc.info/?l=linux-wireless&m=122163541519736&w=2
>
> So let me get this straight -- you applied this new patch, and haven't
> tried disabling irqpoll now?
>
>> I had used rmmod/modprobe as my solution to the
>> issue, which triggered the IRQ issue.
>
> Understood, but I also have used this before with ath9k and I got
> exactly the same results you did -- I just refused to use it again and
> just try to fix the issues present.
>
> ath9k issues tons of interrupts, not sure why irqpoll option would
> cause latency so bad as the interrupts *are* handled. Not sure
> *exactly* how irqpoll works but its description mentions using it
> forces each interrupt handler on the IRQ line to check the interrupt
> is for it. You have to keep in mind that not only are ath9k interrupts
> then being sent to the devices on its line but it would seem that all
> other devices on each line would suffer from the interrupts of the
> other guys. Why ath9k would be the *only* culprit of causing latency
> when using irqpoll if the irq line it son is clean? Beats me.
>
>> alcarin steven # cat /proc/interrupts
>> CPU0 CPU1
>> 0x0: 63227 0 IO-APIC-edge hpet
>> 0x8: 1 0 IO-APIC-edge rtc0
>> 0x9: 13080 0 IO-APIC-fasteoi acpi
>> 0xe: 8195 0 IO-APIC-edge ide0
>> 0xf: 0 0 IO-APIC-edge ide1
>> 0x10: 36 0 IO-APIC-fasteoi uhci_hcd:usb5
>> 0x11: 10645 0 IO-APIC-fasteoi ath
>
> In this case your 11n Atheros device is on a clean line.
>
>> 0x12: 42 0 IO-APIC-fasteoi uhci_hcd:usb4
>> 0x17: 919 0 IO-APIC-fasteoi ehci_hcd:usb1, uhci_hcd:usb2
>
> But it was this interrupt line which had an interrupt not handled.
>
> I'm not sure why this would happen. Can't we rule out ath9k then since
> its on a different interrupt line?
>
>> 0x13: 32885 0 IO-APIC-fasteoi uhci_hcd:usb3,
>> ata_piix, ohci1394
>> 0x200100: 1 0 PCI-MSI-edge eth0
>> 0x16: 223 0 IO-APIC-fasteoi HDA Intel
>> NMI: 0 0 Non-maskable interrupts
>> LOC: 78087 95718 Local timer interrupts
>> RES: 11576 16384 Rescheduling interrupts
>> CAL: 6862 8889 Function call interrupts
>> TLB: 54 41 TLB shootdowns
>> TRM: 0 0 Thermal event interrupts
>> THR: 0 0 Threshold APIC interrupts
>> SPU: 0 0 Spurious interrupts
>> ERR: 0
>
> Can you try to reproduce the irq not handled again?
>
>>>
>>> and also please do not cross post to all these lists, just use
>>> linux-wireless or ath9k.
>>>
>>
>> Sorry, but in the past I've posted to linux-wireless, ath9k-devel, and
>> all the maintainers of ath9k and didn't get a single response (except
>> a 'me too' from a fellow ath9k user). I didn't just want to hear
>> crickets this time.
>
> Patches speak more than words, but yeah sorry, we should have
> addressed this there. I've personally have just been busy with
> tackling aggregation.
>
> Luis
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
Appologize about that, I'm wondering
what is or was the main commit that might be causing
all of these disable irq #'s in usb, wireless, etc..
just doing a quick search one cant help but to think
maybe this is something with tick.(but could be wrong);
--
Justin P. Mattock
MjAwOC85LzIzIFN0ZXZlbiBOb29uYW4gPHN0ZXZlbkB1cGxpbmtsYWJzLm5ldD46Cj4gMjAwOC85
LzIzIFN0ZXZlbiBOb29uYW4gPHN0ZXZlbkB1cGxpbmtsYWJzLm5ldD46Cj4+IE9uIE1vbiwgU2Vw
IDIyLCAyMDA4IGF0IDk6MDEgQU0sIEx1aXMgUi4gUm9kcmlndWV6Cj4+IDxscm9kcmlndWV6QGF0
aGVyb3MuY29tPiB3cm90ZToKPj4+IE9uIE1vbiwgU2VwIDIyLCAyMDA4IGF0IDg6MTEgQU0sIFN0
ZXZlbiBOb29uYW4gPHN0ZXZlbkB1cGxpbmtsYWJzLm5ldD4gd3JvdGU6Cj4+Pj4gT24gTW9uLCBT
ZXAgMjIsIDIwMDggYXQgMTowNiBBTSwgTHVpcyBSLiBSb2RyaWd1ZXoKPj4+PiA8bHJvZHJpZ3Vl
ekBhdGhlcm9zLmNvbT4gd3JvdGU6Cj4+Pj4+IFBsZWFzZSB0ZXN0IHRoZSBmb2xsb3dpbmcgcGF0
Y2guCj4+Pj4+Cj4+Pj4+IGRpZmYgLS1naXQgYS9kcml2ZXJzL25ldC93aXJlbGVzcy9hdGg5ay9j
b3JlLmMgYi9kcml2ZXJzL25ldC93aXJlbGVzcy9hdGg5ay9jb3JlLmMKPj4+Pj4gaW5kZXggYzI2
MmVmMi4uOWE1MTczOSAxMDA2NDQKPj4+Pj4gLS0tIGEvZHJpdmVycy9uZXQvd2lyZWxlc3MvYXRo
OWsvY29yZS5jCj4+Pj4+ICsrKyBiL2RyaXZlcnMvbmV0L3dpcmVsZXNzL2F0aDlrL2NvcmUuYwo+
Pj4+PiBAQCAtNjgxLDEwICs2ODEsMTIgQEAgaW50IGF0aF9vcGVuKHN0cnVjdCBhdGhfc29mdGMg
KnNjLCBzdHJ1Y3QgYXRoOWtfY2hhbm5lbCAqaW5pdGlhbF9jaGFuKQo+Pj4+PiAgICAgICAgICog
RW5hYmxlIE1JQiBpbnRlcnJ1cHRzIHdoZW4gdGhlcmUgYXJlIGhhcmR3YXJlIHBoeSBjb3VudGVy
cy4KPj4+Pj4gICAgICAgICAqIE5vdGUgd2Ugb25seSBkbyB0aGlzIChhdCB0aGUgbW9tZW50KSBm
b3Igc3RhdGlvbiBtb2RlLgo+Pj4+PiAgICAgICAgICovCj4+Pj4+ICsjaWYgMAo+Pj4+PiAgICAg
ICAgaWYgKGF0aDlrX2h3X3BoeWNvdW50ZXJzKGFoKSAmJgo+Pj4+PiAgICAgICAgICAgICgoc2Mt
PnNjX2FoLT5haF9vcG1vZGUgPT0gQVRIOUtfTV9TVEEpIHx8Cj4+Pj4+ICAgICAgICAgICAgIChz
Yy0+c2NfYWgtPmFoX29wbW9kZSA9PSBBVEg5S19NX0lCU1MpKSkKPj4+Pj4gICAgICAgICAgICAg
ICAgc2MtPnNjX2ltYXNrIHw9IEFUSDlLX0lOVF9NSUI7Cj4+Pj4+ICsjZW5kaWYKPj4+Pj4gICAg
ICAgIC8qCj4+Pj4+ICAgICAgICAgKiBTb21lIGhhcmR3YXJlIHByb2Nlc3NlcyB0aGUgVElNIElF
IGFuZCBmaXJlcyBhbgo+Pj4+PiAgICAgICAgICogaW50ZXJydXB0IHdoZW4gdGhlIFRJTSBiaXQg
aXMgc2V0LiAgRm9yIGhhcmR3YXJlCj4+Pj4+Cj4+Pj4KPj4+PiBJdCBkaWRuJ3QgYXBwbHkgdG8g
LXJjNywgYnV0IEkgbWFuYWdlZCB0byBhcHBseSBpdCBtYW51YWxseQo+Pj4+IChhcHBhcmVudGx5
IHlvdSBtYWRlIHRoaXMgZm9yIHdpcmVsZXNzLXRlc3Rpbmc/KSBhbmQgYWRkZWQgdGhlICNpZiAw
Lgo+Pj4+IEFmdGVyIDcgaG91cnMgdXB0aW1lIHdpdGggdGhlIGRyaXZlciwgbm8gYXBwYXJlbnQg
aW50ZXJydXB0IHN0b3JtLgo+Pj4+Cj4+Pj4gSSdsbCBkbyBtb3JlIHRob3JvdWdoIHRlc3Rpbmcg
bGF0ZXIsIGJ1dCBJIGhhdmUgYSAzIGhvdXIgZHJpdmUgdG9kYXksCj4+Pj4gYW5kIHNvbWUgcGFj
a2luZyB0byBkbywgc28gaXQnbGwgbmVlZCB0byB3YWl0IGEgYml0Lgo+Pj4KPj4+IFllYWggSSBk
aWQgaXQgZm9yIHdpcmVsZXNzLXRlc3RpbmcuIEdvb2QgdG8gaGVhciB0aGlzIHNvIGZhciBoYXMK
Pj4+IHNvbHZlZCB0aGUgaXNzdWUuIEknbGwgcG9ydCBpdCB0byAyNyBhbmQgcG9zdCBpdCBvbmNl
IEkgZ2V0IHlvdXIKPj4+IGJsZXNzaW5ncyB0aGF0IHRoaXMgZml4ZWQgaXQgYnkgYSBUZXN0ZWQt
YnkuCj4+Pgo+Pgo+PiBPa2F5LiBJJ20gaW4gU2VhdHRsZSBub3cuCj4+Cj4+IFllcywgaXQgc29s
dmVkIHRoZSBpc3N1ZSwgYnV0IHN1cmVseSBhbiAjaWYgMCBpc24ndCBhIHByb3BlciBzb2x1dGlv
bi4KPj4gV2hhdCdzIHRoZSBhY3R1YWwgYnVnIGhlcmU/IEknbSBhZnJhaWQgdGhlIG1lYW5pbmcg
b2Ygd2hhdCdzIGdvaW5nIG9uCj4+IGhlcmUgaXNuJ3QgZXhhY3RseSBpbnR1aXRpdmUuCj4+Cj4+
IEFsc28sIEknbSBkb2luZyBhbiA4IGhvdXIgcnVuIHdpdGggdGhpcyBwYXRjaCB0b25pZ2h0LiBN
YXliZSBtb3JlLAo+PiBkZXBlbmRzIG9uIHdoZW4gSSB3YWtlIHVwLiA7KQo+Pgo+PiBJIHRoaW5r
IGlmIHRoaXMgcmVzb2x2ZXMgdGhlIGlzc3VlIG9uIHR3byBzZXBhcmF0ZSCXOCBob3VyIHJ1bnMs
IGl0Cj4+IHNob3VsZCBiZSBjb25zaWRlcmVkIHRoZSBzb2x1dGlvbiB0byB0aGUgaXNzdWUsIHVu
bGVzcyBpdCByZXN1cmZhY2VzLgo+Pgo+Cj4gSSBhbSBvbmx5IDI1IG1pbnV0ZXMgaW50byB0aGUg
cnVuLCBidXQgSSBmZWVsIHRoaXMgaXMgd29ydGggcmVwb3J0aW5nCj4gYmVmb3JlIEkgZm9yZ2V0
LiBObyBJUlEgc3Rvcm0gdGh1cyBmYXIsIGJ1dCB0aGUgY29ubmVjdGlvbiBrZWVwcwo+IGRyb3Bw
aW5nLiBPciByYXRoZXIsIGl0IGRvZXNuJ3QgZGlzYXNzb2NpYXRlLCBidXQgd2hpbGUgSSB3YXMK
PiB0cmFuc2ZlcnJpbmcgYSBsYXJnZSBmaWxlLCBpdCBoYWQgcG9pbnRzIG9mIHRpbWUgd2hlcmUg
aXQgc3VkZGVubHkKPiBsb3N0IGFuZCByZWFjcXVpcmVkIHRoZSBjb25uZWN0aW9uIChzaWduYWwg
c3RyZW5ndGggZmx1Y3R1YXRlcyB3aWxkbHkKPiBiZXR3ZWVuIDcwLTkwJSBhbmQgdGhlbiAwJSBm
b3IgYSBzZWNvbmQsIGFuZCB0aGVuIGJhY2sgdG8KPiBmbHVjdHVhdGluZykuIFBlcmhhcHMgdGhp
cyBpcyByZWxhdGVkIHRvIHRoZSBsYWNrIG9mIGFnZ3JlZ2F0aW9uPwo+CgpUaGlzIDggaG91ciB0
ZXN0IHBhc3Nlcy4KCi0gU3RldmVuCg==
On Thu, 2008-09-18 at 11:23 -0700, Steven Noonan wrote:
> I'm encountering a very odd issue which only seems to crop up while I
> sleep at night.
Are you still sleeping well? ;)
> In the morning, I go back to my Linux machine and
> there's a ridiculously bad latency I immediately notice in Firefox's
> redraw which I cannot explain.
It looks as though something is just eating CPU time. Are you sure top
isn't showing like 99% hi somewhere in the top? Like there:
Cpu(s): 5.2%us, 3.6%sy, 0.0%ni, 90.8%id, 0.3%wa, 0.0%hi, 0.0%si, 0.0%st
^^^^^^^
johannes
On Sat, Sep 20, 2008 at 2:26 PM, Luis R. Rodriguez
<[email protected]> wrote:
> On Sat, Sep 20, 2008 at 2:22 PM, Steven Noonan <[email protected]> wrote:
>
>> Heh. I don't have an Xbox 360 or PS3 (sad, I know). Do have several
>> 2.4GHz phones. ;)
>
> Microwaves are the best to make the signal wet poop BTW, if you have
> that option available. Not sure if we can trigger the device that way
> but I'm just brainstorming here.
>
Well, I've got the thing set to ATH_DBG_ANY (I want this damn bug
found!), which will certainly be *interesting*. In 5 minutes, the log
has jumped 9 megabytes.
I've also got sadc monitoring IRQs so I know for certain when the
interrupt storm starts.
I'll let you know how this goes.
- Steven
On Fri, Sep 19, 2008 at 10:12:04PM +0530, Steven Noonan wrote:
>
>
> On Fri, 19 Sep 2008, Senthil Balasubramanian wrote:
>
> > On Fri, Sep 19, 2008 at 12:59:29PM +0530, Steven Noonan wrote:
> > > On Thu, Sep 18, 2008 at 8:01 PM, Luis R. Rodriguez
> > > <[email protected]> wrote:
> > > > Thanks for testing, and glad to see this is resolved.
> > > >
> > >
> > > Damn. It's back. I was using wireless normally this evening. Browsing
> > > the web, that kind of thing, and then the wireless inexplicably
> > > dropped (even with the group rekeying patch), so I unloaded/reloaded
> > > the module. This popped up in dmesg:
> > >
> > > [ 3834.375658] vendor=8086 device=27d2
> > > [ 3834.375666] ath9k 0000:03:00.0: PCI INT A disabled
> > > [ 3834.375716] ath9k: driver unloaded
> > > [ 3838.552419] ath9k: 0.1
> > > [ 3838.552502] vendor=8086 device=27d2
> > > [ 3838.552511] ath9k 0000:03:00.0: PCI INT A -> GSI 17 (level, low) -> IRQ 17
> > > [ 3838.552532] ath9k 0000:03:00.0: setting latency timer to 64
> > > [ 3838.688924] phy1: Selected rate control algorithm 'ath9k_rate_control'
> > > [ 3838.693652] phy1: Atheros 5416: mem=0xffffc20000060000, irq=17
> > > [ 3839.427125] irq 17: nobody cared (try booting with the "irqpoll" option)
> > > [ 3839.427136] Pid: 0, comm: swapper Tainted: P
> > > 2.6.27-rc6-tip-00478-g74f1a36 #1
> > > [ 3839.427141] Call Trace:
> > > [ 3839.427145] <IRQ> [<ffffffff802219c5>] ? read_hpet+0x9/0x1c
> > > [ 3839.427165] [<ffffffff8026af71>] __report_bad_irq+0x3d/0x8c
> > > [ 3839.427172] [<ffffffff8026b0c6>] note_interrupt+0x106/0x160
> > > [ 3839.427180] [<ffffffff8026b801>] handle_fasteoi_irq+0xad/0xda
> > > [ 3839.427188] [<ffffffff8020f7b0>] do_IRQ+0x10c/0x190
> > > [ 3839.427194] [<ffffffff8020c551>] ret_from_intr+0x0/0xa
> > > [ 3839.427198] <EOI> [<ffffffff8026c6f0>] rcu_pending+0x62/0x6e
> > > [ 3839.427211] [<ffffffff8025bc11>] ? tick_nohz_stop_sched_tick+0x2e4/0x2f3
> > > [ 3839.427218] [<ffffffff8020ad94>] cpu_idle+0x7b/0xdb
> > > [ 3839.427226] [<ffffffff8060c921>] rest_init+0x75/0x77
> > > [ 3839.427231] handlers:
> > > [ 3839.427234] [<ffffffffa0240238>] (ath_isr+0x0/0x170 [ath9k])
> > > [ 3839.427263] Disabling IRQ #17
> > > [ 3842.263699] ADDRCONF(NETDEV_UP): wlan0: link is not ready
> > > [ 3848.035003] ADDRCONF(NETDEV_UP): wlan0: link is not ready
> > > [ 3848.432701] ADDRCONF(NETDEV_UP): wlan0: link is not ready
> > > [ 3850.216947] wlan0: authenticate with AP 00:1e:52:79:4d:01
> > > [ 3850.217027] wlan0: authenticate with AP 00:1e:52:79:4d:01
> > > [ 3850.228326] wlan0: authenticated
> > > [ 3850.228336] wlan0: associate with AP 00:1e:52:79:4d:01
> > > [ 3850.428140] wlan0: associate with AP 00:1e:52:79:4d:01
> > > [ 3850.628151] wlan0: associate with AP 00:1e:52:79:4d:01
> > > [ 3850.728305] wlan0: RX AssocResp from 00:1e:52:79:4d:01 (capab=0x431
> > > status=0 aid=1)
> > > [ 3850.728314] wlan0: associated
> > > [ 3850.728655] wlan0 (WE) : Wireless Event too big (320)
> > > [ 3850.743377] ADDRCONF(NETDEV_CHANGE): wlan0: link becomes ready
> > > [ 3860.855104] wlan0: no IPv6 routers present
> > >
> > > I rebuilt the module with DBG_ATH_INTERRUPT, but it somehow stumbled
> > > itself back into working order while I was compiling. I can't keep the
> > > interrupt debugging on all the time because it's just -too verbose-,
> > > and when I pop a debug version of the module in, then it's too late to
> > > track the issue....
> >
> > I am able to reproduce this IRQ nobody cared issue in my setup and the
> > following patch seems to be fixing the issue. Please try it out and let
> > me know if it solves your issue in your setup.
>
> The patch you prvide doesn't want to apply. What code did you base this
> on?
>
> The first change listed doesn't work because there is no tasklet_kill() in
> core.c, and the line immediately after ath_stop there has
> "!sc->sc_invalid" instead of the "!(sc->sc_flags & SC_OP_INVALID)".
>
> The second fails because SC_OP_INVALID isn't defined.
>
> However, if your patch did apply to my code, I bet it'd solve the issue,
> based on what it says it does.
I am on 2.6.27-rc6 and this patch is on top of my earlier patch
titled "[PATCH] ath9k: connectivity is lost after Group rekeying is
done". However this patch can be applied on top of latest wireless testing
too. I could apply this patch succesfully on top of wireless testing
git tree.
My git-describe says v2.6.27-rc6-1378-g34e512f.
There is no sc_invalid flag in "struct ath_softc" today. Where did
you get this variable from? It was removed in the following commit
-----------------------------------------------
commit f2c9705a05ecbc0d94216a3b042d5641e8bf70b1
Author: Sujith <[email protected]>
Date: Mon Aug 11 14:05:08 2008 +0530
ath9k: Use bitfields for sc operations
Signed-off-by: Sujith Manoharan <[email protected]>
Signed-off-by: John W. Linville <[email protected]>
-----------------------------------------------
Which code base are you using?
On Sat, Sep 20, 2008 at 2:22 PM, Steven Noonan <[email protected]> wrote:
> Heh. I don't have an Xbox 360 or PS3 (sad, I know). Do have several
> 2.4GHz phones. ;)
Microwaves are the best to make the signal wet poop BTW, if you have
that option available. Not sure if we can trigger the device that way
but I'm just brainstorming here.
Luis
On Fri, Sep 19, 2008 at 12:29 AM, Steven Noonan <[email protected]> wrote:
> On Thu, Sep 18, 2008 at 8:01 PM, Luis R. Rodriguez
> <[email protected]> wrote:
>> Thanks for testing, and glad to see this is resolved.
>>
>
> Damn. It's back. I was using wireless normally this evening. Browsing
> the web, that kind of thing, and then the wireless inexplicably
> dropped (even with the group rekeying patch), so I unloaded/reloaded
> the module. This popped up in dmesg:
>
> [ 3834.375658] vendor=8086 device=27d2
> [ 3834.375666] ath9k 0000:03:00.0: PCI INT A disabled
> [ 3834.375716] ath9k: driver unloaded
> [ 3838.552419] ath9k: 0.1
> [ 3838.552502] vendor=8086 device=27d2
> [ 3838.552511] ath9k 0000:03:00.0: PCI INT A -> GSI 17 (level, low) -> IRQ 17
> [ 3838.552532] ath9k 0000:03:00.0: setting latency timer to 64
> [ 3838.688924] phy1: Selected rate control algorithm 'ath9k_rate_control'
> [ 3838.693652] phy1: Atheros 5416: mem=0xffffc20000060000, irq=17
> [ 3839.427125] irq 17: nobody cared (try booting with the "irqpoll" option)
> [ 3839.427136] Pid: 0, comm: swapper Tainted: P
> 2.6.27-rc6-tip-00478-g74f1a36 #1
> [ 3839.427141] Call Trace:
> [ 3839.427145] <IRQ> [<ffffffff802219c5>] ? read_hpet+0x9/0x1c
> [ 3839.427165] [<ffffffff8026af71>] __report_bad_irq+0x3d/0x8c
> [ 3839.427172] [<ffffffff8026b0c6>] note_interrupt+0x106/0x160
> [ 3839.427180] [<ffffffff8026b801>] handle_fasteoi_irq+0xad/0xda
> [ 3839.427188] [<ffffffff8020f7b0>] do_IRQ+0x10c/0x190
> [ 3839.427194] [<ffffffff8020c551>] ret_from_intr+0x0/0xa
> [ 3839.427198] <EOI> [<ffffffff8026c6f0>] rcu_pending+0x62/0x6e
> [ 3839.427211] [<ffffffff8025bc11>] ? tick_nohz_stop_sched_tick+0x2e4/0x2f3
> [ 3839.427218] [<ffffffff8020ad94>] cpu_idle+0x7b/0xdb
> [ 3839.427226] [<ffffffff8060c921>] rest_init+0x75/0x77
> [ 3839.427231] handlers:
> [ 3839.427234] [<ffffffffa0240238>] (ath_isr+0x0/0x170 [ath9k])
> [ 3839.427263] Disabling IRQ #17
> [ 3842.263699] ADDRCONF(NETDEV_UP): wlan0: link is not ready
> [ 3848.035003] ADDRCONF(NETDEV_UP): wlan0: link is not ready
> [ 3848.432701] ADDRCONF(NETDEV_UP): wlan0: link is not ready
> [ 3850.216947] wlan0: authenticate with AP 00:1e:52:79:4d:01
> [ 3850.217027] wlan0: authenticate with AP 00:1e:52:79:4d:01
> [ 3850.228326] wlan0: authenticated
> [ 3850.228336] wlan0: associate with AP 00:1e:52:79:4d:01
> [ 3850.428140] wlan0: associate with AP 00:1e:52:79:4d:01
> [ 3850.628151] wlan0: associate with AP 00:1e:52:79:4d:01
> [ 3850.728305] wlan0: RX AssocResp from 00:1e:52:79:4d:01 (capab=0x431
> status=0 aid=1)
> [ 3850.728314] wlan0: associated
> [ 3850.728655] wlan0 (WE) : Wireless Event too big (320)
> [ 3850.743377] ADDRCONF(NETDEV_CHANGE): wlan0: link becomes ready
> [ 3860.855104] wlan0: no IPv6 routers present
>
> I rebuilt the module with DBG_ATH_INTERRUPT, but it somehow stumbled
> itself back into working order while I was compiling. I can't keep the
> interrupt debugging on all the time because it's just -too verbose-,
> and when I pop a debug version of the module in, then it's too late to
> track the issue....
>
Do pardon me if this is incredibly stupid, but I have some thoughts
about ath9k interrupt handling. Let me know if there's any validity
here. And if there's any invalid points, please help me understand
what's going on. :)
First of all, I know for certain the ath9k interrupt handler is
registered before the driver prints the line saying the name of the
device ("phy1: Atheros 5416"). So any IRQs to IRQ 17 should be seen by
the ath9k handler (ath_isr) after that point, right? Shortly after
that point is where we're seeing the "nobody cared". This seems to
suggest that it registered the handler successfully (because we didn't
get an error message), but the handler returned IRQ_NONE.
Second of all, I'm looking at the ath9k interrupt handler right now,
and there are a few cases where it returns IRQ_NONE. And here's where
I'm a bit fuzzy. Since there could be any number of things using IRQ
17 (though in my case, ath9k is on its own dedicated IRQ), it seems
odd that you check the value of sc->sc_invalid, since the cookie
passed to the handler might not actually be ath9k's cookie if multiple
drivers have registered IRQ handlers for that particular IRQ. Who
knows if what you're reading is even valid? Heck, what if some driver
uses a NULL for their cookie (however unlikely)? You'd get a
segmentation fault on the second line of the interrupt handler. Of
course, I could be completely wrong: does parent interrupt handler
check to see which device driver owns the particular device signaling
an IRQ and then call the appropriate handler?
- Steven
On Fri, Sep 19, 2008 at 4:53 PM, Luis R. Rodriguez
<[email protected]> wrote:
> On Fri, Sep 19, 2008 at 04:31:20PM -0700, Steven Noonan wrote:
>>
>> I'm seeing the same latency issue I reported originally with the latest
>> -tip kernel and WITHOUT irqpoll.
>
> Damn it.
>
>> Apparently there's something else to this...
>
> Steven,
>
> Can you reproduce this? If so can you tell us how? If you can reproduce
> can you try to test this with wireless-testing as well? That is still
> based on 2.6.26-rc6 but its just has all the new wireless development
> stuff queued up for 2.6.28 so far. Everything else is the same.
It's a weird problem, because I'm able to reproduce it by running a
kernel with ath9k loaded for a few hours. In this particular case, I
only ran this kernel for about 3.5 hours , but the delay when VT
switching was getting noticeable. Unload/reload of ath9k and the
latency disappeared again. It's definitely ath9k's doing, but I don't
know how. I'm finding the delay is most noticeable after 7 hours
uptime with ath9k loaded.
I'll pull wireless-testing and run that kernel for another 8 hours.
What's the URL for the repo? Surely you don't mean the
wireless-testing on git.kernel.org that hasn't been touched for almost
a week?
I kind of wish there was a better debugging facility in ath9k. With
ATH_DBG_INTERRUPTS, I see several hundred messages per second, which
fills log files up like crazy (I switched to ftrace_printk as Ingo
suggested, which has alleviated this somewhat). Is there a way to make
it less verbose (perhaps so that it prints data at a rate of no more
than 2 or 3 messages per second)? Or maybe the rate at which it prints
this stuff is indicative something's wrong?
- Steven
MjAwOC85LzIzIFN0ZXZlbiBOb29uYW4gPHN0ZXZlbkB1cGxpbmtsYWJzLm5ldD46Cj4gT24gTW9u
LCBTZXAgMjIsIDIwMDggYXQgOTowMSBBTSwgTHVpcyBSLiBSb2RyaWd1ZXoKPiA8bHJvZHJpZ3Vl
ekBhdGhlcm9zLmNvbT4gd3JvdGU6Cj4+IE9uIE1vbiwgU2VwIDIyLCAyMDA4IGF0IDg6MTEgQU0s
IFN0ZXZlbiBOb29uYW4gPHN0ZXZlbkB1cGxpbmtsYWJzLm5ldD4gd3JvdGU6Cj4+PiBPbiBNb24s
IFNlcCAyMiwgMjAwOCBhdCAxOjA2IEFNLCBMdWlzIFIuIFJvZHJpZ3Vlego+Pj4gPGxyb2RyaWd1
ZXpAYXRoZXJvcy5jb20+IHdyb3RlOgo+Pj4+IFBsZWFzZSB0ZXN0IHRoZSBmb2xsb3dpbmcgcGF0
Y2guCj4+Pj4KPj4+PiBkaWZmIC0tZ2l0IGEvZHJpdmVycy9uZXQvd2lyZWxlc3MvYXRoOWsvY29y
ZS5jIGIvZHJpdmVycy9uZXQvd2lyZWxlc3MvYXRoOWsvY29yZS5jCj4+Pj4gaW5kZXggYzI2MmVm
Mi4uOWE1MTczOSAxMDA2NDQKPj4+PiAtLS0gYS9kcml2ZXJzL25ldC93aXJlbGVzcy9hdGg5ay9j
b3JlLmMKPj4+PiArKysgYi9kcml2ZXJzL25ldC93aXJlbGVzcy9hdGg5ay9jb3JlLmMKPj4+PiBA
QCAtNjgxLDEwICs2ODEsMTIgQEAgaW50IGF0aF9vcGVuKHN0cnVjdCBhdGhfc29mdGMgKnNjLCBz
dHJ1Y3QgYXRoOWtfY2hhbm5lbCAqaW5pdGlhbF9jaGFuKQo+Pj4+ICAgICAgICAgKiBFbmFibGUg
TUlCIGludGVycnVwdHMgd2hlbiB0aGVyZSBhcmUgaGFyZHdhcmUgcGh5IGNvdW50ZXJzLgo+Pj4+
ICAgICAgICAgKiBOb3RlIHdlIG9ubHkgZG8gdGhpcyAoYXQgdGhlIG1vbWVudCkgZm9yIHN0YXRp
b24gbW9kZS4KPj4+PiAgICAgICAgICovCj4+Pj4gKyNpZiAwCj4+Pj4gICAgICAgIGlmIChhdGg5
a19od19waHljb3VudGVycyhhaCkgJiYKPj4+PiAgICAgICAgICAgICgoc2MtPnNjX2FoLT5haF9v
cG1vZGUgPT0gQVRIOUtfTV9TVEEpIHx8Cj4+Pj4gICAgICAgICAgICAgKHNjLT5zY19haC0+YWhf
b3Btb2RlID09IEFUSDlLX01fSUJTUykpKQo+Pj4+ICAgICAgICAgICAgICAgIHNjLT5zY19pbWFz
ayB8PSBBVEg5S19JTlRfTUlCOwo+Pj4+ICsjZW5kaWYKPj4+PiAgICAgICAgLyoKPj4+PiAgICAg
ICAgICogU29tZSBoYXJkd2FyZSBwcm9jZXNzZXMgdGhlIFRJTSBJRSBhbmQgZmlyZXMgYW4KPj4+
PiAgICAgICAgICogaW50ZXJydXB0IHdoZW4gdGhlIFRJTSBiaXQgaXMgc2V0LiAgRm9yIGhhcmR3
YXJlCj4+Pj4KPj4+Cj4+PiBJdCBkaWRuJ3QgYXBwbHkgdG8gLXJjNywgYnV0IEkgbWFuYWdlZCB0
byBhcHBseSBpdCBtYW51YWxseQo+Pj4gKGFwcGFyZW50bHkgeW91IG1hZGUgdGhpcyBmb3Igd2ly
ZWxlc3MtdGVzdGluZz8pIGFuZCBhZGRlZCB0aGUgI2lmIDAuCj4+PiBBZnRlciA3IGhvdXJzIHVw
dGltZSB3aXRoIHRoZSBkcml2ZXIsIG5vIGFwcGFyZW50IGludGVycnVwdCBzdG9ybS4KPj4+Cj4+
PiBJJ2xsIGRvIG1vcmUgdGhvcm91Z2ggdGVzdGluZyBsYXRlciwgYnV0IEkgaGF2ZSBhIDMgaG91
ciBkcml2ZSB0b2RheSwKPj4+IGFuZCBzb21lIHBhY2tpbmcgdG8gZG8sIHNvIGl0J2xsIG5lZWQg
dG8gd2FpdCBhIGJpdC4KPj4KPj4gWWVhaCBJIGRpZCBpdCBmb3Igd2lyZWxlc3MtdGVzdGluZy4g
R29vZCB0byBoZWFyIHRoaXMgc28gZmFyIGhhcwo+PiBzb2x2ZWQgdGhlIGlzc3VlLiBJJ2xsIHBv
cnQgaXQgdG8gMjcgYW5kIHBvc3QgaXQgb25jZSBJIGdldCB5b3VyCj4+IGJsZXNzaW5ncyB0aGF0
IHRoaXMgZml4ZWQgaXQgYnkgYSBUZXN0ZWQtYnkuCj4+Cj4KPiBPa2F5LiBJJ20gaW4gU2VhdHRs
ZSBub3cuCj4KPiBZZXMsIGl0IHNvbHZlZCB0aGUgaXNzdWUsIGJ1dCBzdXJlbHkgYW4gI2lmIDAg
aXNuJ3QgYSBwcm9wZXIgc29sdXRpb24uCj4gV2hhdCdzIHRoZSBhY3R1YWwgYnVnIGhlcmU/IEkn
bSBhZnJhaWQgdGhlIG1lYW5pbmcgb2Ygd2hhdCdzIGdvaW5nIG9uCj4gaGVyZSBpc24ndCBleGFj
dGx5IGludHVpdGl2ZS4KPgo+IEFsc28sIEknbSBkb2luZyBhbiA4IGhvdXIgcnVuIHdpdGggdGhp
cyBwYXRjaCB0b25pZ2h0LiBNYXliZSBtb3JlLAo+IGRlcGVuZHMgb24gd2hlbiBJIHdha2UgdXAu
IDspCj4KPiBJIHRoaW5rIGlmIHRoaXMgcmVzb2x2ZXMgdGhlIGlzc3VlIG9uIHR3byBzZXBhcmF0
ZSCXOCBob3VyIHJ1bnMsIGl0Cj4gc2hvdWxkIGJlIGNvbnNpZGVyZWQgdGhlIHNvbHV0aW9uIHRv
IHRoZSBpc3N1ZSwgdW5sZXNzIGl0IHJlc3VyZmFjZXMuCj4KCkkgYW0gb25seSAyNSBtaW51dGVz
IGludG8gdGhlIHJ1biwgYnV0IEkgZmVlbCB0aGlzIGlzIHdvcnRoIHJlcG9ydGluZwpiZWZvcmUg
SSBmb3JnZXQuIE5vIElSUSBzdG9ybSB0aHVzIGZhciwgYnV0IHRoZSBjb25uZWN0aW9uIGtlZXBz
CmRyb3BwaW5nLiBPciByYXRoZXIsIGl0IGRvZXNuJ3QgZGlzYXNzb2NpYXRlLCBidXQgd2hpbGUg
SSB3YXMKdHJhbnNmZXJyaW5nIGEgbGFyZ2UgZmlsZSwgaXQgaGFkIHBvaW50cyBvZiB0aW1lIHdo
ZXJlIGl0IHN1ZGRlbmx5Cmxvc3QgYW5kIHJlYWNxdWlyZWQgdGhlIGNvbm5lY3Rpb24gKHNpZ25h
bCBzdHJlbmd0aCBmbHVjdHVhdGVzIHdpbGRseQpiZXR3ZWVuIDcwLTkwJSBhbmQgdGhlbiAwJSBm
b3IgYSBzZWNvbmQsIGFuZCB0aGVuIGJhY2sgdG8KZmx1Y3R1YXRpbmcpLiBQZXJoYXBzIHRoaXMg
aXMgcmVsYXRlZCB0byB0aGUgbGFjayBvZiBhZ2dyZWdhdGlvbj8KCi0gU3RldmVuCg==
On Sat, 27 Sep 2008, Ingo Molnar wrote:
>
> * Steven Rostedt <[email protected]> wrote:
>
> >
> > I hate trips where my email box becomes filled. I skip over a lot of
> > emails that I should not have.
> >
> > On Fri, 19 Sep 2008, Ingo Molnar wrote:
> > >
> > > What is needed is to set kernel/trace/trace.c's tracing_disabled
> > > variable to 0. I've Cc:-ed Steve Rostedt - Steve, why is there no easy
> > > ad-hoc way to trigger a stop-tracing event without elaborate changes?
> >
> > I think I added a way to do this. Hmm, I think it was in -rt that I added
> > it, and have not gotten around to sending you the updates.
> >
> > Yes, we need an easy way to disable the tracer. I think we have
> > ftrace_disable() function. Perhaps that only disables tracing that
> > checks the tracing_disabled variable.
>
> we have kill_ftrace(), but that will permanently zap it.
OK, on Monday, as I work on getting ftrace to use the ring buffer, I'll
also look to see which features are only in -rt that need to go to tip.
-- Steve
On Thu, Sep 18, 2008 at 07:52:16PM -0700, Steven Noonan wrote:
> This IRQ debugging is lots of fun. My /var/log/messages jumped in size
> to 241MB. :)
>
> My test case was a file transfer that basically copies 50MB from a
> remote machine's /dev/urandom and writes it to /dev/null locally. That
> should be a pretty nice large RX. And jeez, I'm seeing tremendous
> download speeds now and I have no idea what is different about my
> configuration. I'm getting up to 24 megabit per second transfer rates
> (actual), which is incredibly unusual for ath9k.
>
> I wasn't able to reproduce the IRQ problems on -tip or Linus' current
> tree (both with the group rekeying fix applied, which may or may not
> be a good idea for reproducing the issue). Even tried doing
> rmmod/modprobe in fast succession multiple times. Couldn't get it to
> throw up.
>
> I wonder... My rmmod/modprobe originally was used after the improper
> group rekeying before in order to get it back to normal. Perhaps the
> encryption has to go wonky before this IRQ issue surfaces? Another
> possibility is the earlier mentioned "ACPI: EC: GPE storm detected,
> disabling EC GPE". I don't know if it's related or not, but I haven't
> waited for it to show up before testing.
>
> Oh well, if the IRQ problem resurfaces (or if I get a ridiculous
> latency tomorrow morning with irqpoll off), I'll try to get some
> useful data to help debug it.
Thanks for testing, and glad to see this is resolved.
Luis
On Thu, Sep 18, 2008 at 12:23 PM, Steven Noonan <[email protected]> wrote:
> On Thu, Sep 18, 2008 at 12:00 PM, Steven Noonan <[email protected]> wrote:
>> On Thu, Sep 18, 2008 at 11:42 AM, Luis R. Rodriguez
>> <[email protected]> wrote:
>>> On Thu, Sep 18, 2008 at 11:34 AM, Luis R. Rodriguez
>>>> irqpoll is a monster of evil and that should make your system crawl to
>>>> its knees. I would advise instead we work with you fixing the the
>>>> missed interrupts issue upon rmmod.
>>>
>>> Also, please provide the output of
>>>
>>> cat /proc/interrupts
>>
>> Note that the problem necessitating use of irqpoll in the first place
>> seems to only happen under certain conditions. I am unsure what these
>> conditions are. Before 'ath9k: connectivity is lost after Group
>> rekeying is done', I had used rmmod/modprobe as my solution to the
>> issue, which triggered the IRQ issue.
>
> I wonder if this is related:
>
> alcarin ~ # dmesg | grep GPE
> [ 0.166806] ACPI: EC: GPE = 0x17, I/O: command/status = 0x66, data = 0x62
> [ 685.409077] ACPI: EC: GPE storm detected, disabling EC GPE
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
FWIW, you might want to try this pach
in this thread:
http://www.gossamer-threads.com/lists/linux/kernel/966076
I was receiving a disable irq #23 this patch"knock on wood",
seems to help with this issue.
--
Justin P. Mattock
On Sun, Sep 21, 2008 at 11:54:53PM -0700, Luis Rodriguez wrote:
> On Sat, Sep 20, 2008 at 06:48:05PM -0700, Steven Noonan wrote:
> >
> > OK, here's the time index at which it b0rked:
> >
> > 18:28:42 sum 1155.00
> > 18:28:43 sum 1339.00
> > 18:28:44 sum 18355.00
> > 18:28:45 sum 17845.45
> >
> > And the entire log of the minute 18:28 is here:
> > http://www.uplinklabs.net/~tycho/linux/18.28.00-59.log.gz
> >
> > It looks pretty crazy.
>
> Thanks, I'll take a look when I get a chance (remember aggretion
> is in the queue too) unless someone beats me to it. Asking in advance
> -- if you can also provide a log of minutes 43, 44 and 45 that would
> be great, but this may just be enough.
Oh wait, those are seconds, nevermind :)
Luis
On Sun, Sep 21, 2008 at 11:54 PM, Luis R. Rodriguez
<[email protected]> wrote:
> On Sat, Sep 20, 2008 at 06:48:05PM -0700, Steven Noonan wrote:
>>
>> OK, here's the time index at which it b0rked:
>>
>> 18:28:42 sum 1155.00
>> 18:28:43 sum 1339.00
>> 18:28:44 sum 18355.00
>> 18:28:45 sum 17845.45
>>
>> And the entire log of the minute 18:28 is here:
>> http://www.uplinklabs.net/~tycho/linux/18.28.00-59.log.gz
>>
>> It looks pretty crazy.
>
> Thanks, I'll take a look when I get a chance (remember aggretion
> is in the queue too) unless someone beats me to it.
>
You may want to take a quick look at it now. I certainly don't mean
you should push aggregation aside, but the log looks kind of screwy at
18:28:43 (look at line 4972 to see what I mean). It looks as though it
did printk()s from multiple threads with no locking mechanism. It's
possible I will need to re-capture the error, and I'd like to know if
this log will help or not.
- Steven
2008/9/20 Johannes Berg <[email protected]>:
> On Sat, 2008-09-20 at 13:57 -0700, Luis R. Rodriguez wrote:
>
>> mcgrof@tesla ~/tmp/irq $ sar -f irqlog-storm.sar -A
>> Invalid system activity file: irqlog-storm.sar
>>
>> Perhaps we its a different format than what my sar can read...
>
> It changes like every release... Not a very useful thing, that.
Ah -- oh well.
OK so yeah you have an interrupt storm, this is evident.
Do you think there is a way you can try to trigger the interrupt storm
so it happens < than after 7 hours or however long its taking you? If
so then it'd be good to get your debug out using the interrupt debug.
If that's not useful then we can enable all debug stuff (file will be
huge).
Try turning on the microwave, turn on all your bluetooth devices, use
your 2.4 GHz wireless phone, use your xbox 360 controllers and ps3
controllers all at once :D
Luis
On Sat, Sep 20, 2008 at 1:57 PM, Luis R. Rodriguez
<[email protected]> wrote:
> So here's what we need to do -- try to zero on in what causes this IRQ
> storm.
How is the best way to approach this? I assume ATH_DBG_INTERRUPT and
pray like hell my /var partition doesn't fill up first? Maybe I'll
just symlink /var/log/messages to a file on a substantially larger
pattition...
> Most likely its some strange event that triggers it, perhaps
> some sort of noise which makes the device try to adapt and hits a path
> we are not handling correctly yet. Can you provide the initial log of
> when ath9k gets loaded? This tells us what card it is, like AR5416 or
> AR9280, etc.
[10238.347956] ath9k: 0.1
[10238.348028] vendor=8086 device=27d2
[10238.348033] ath9k 0000:03:00.0: PCI INT A -> GSI 17 (level, low) -> IRQ 17
[10238.348050] ath9k 0000:03:00.0: setting latency timer to 64
[10238.480781] phy1: Selected rate control algorithm 'ath9k_rate_control'
[10238.483320] phy1: Atheros 5416: mem=0xffffc20000060000, irq=17
It claims to be a 5416, but I was under the impression that it was the
5418 that came with the MacBook Pro.
On Sat, Sep 20, 2008 at 8:21 AM, Steven Noonan <[email protected]> wrote:
> On Fri, Sep 19, 2008 at 8:29 PM, Steven Noonan <[email protected]> wrote:
>> And here's the change as I unload and reload ath9k:
>>
>> [...]
>> 20:21:38 sum 20138.83
>> 20:21:40 sum 21576.09
>> 20:21:41 sum 17678.38
>> 20:21:42 sum 19834.34
>> 20:21:43 sum 19845.10
>> 20:21:44 sum 19470.00
>> 20:21:45 sum 9273.00 <--- ath9k unloaded/reloaded
>> 20:21:46 sum 187.00
>> 20:21:47 sum 1066.00
>> 20:21:48 sum 610.00
>> 20:21:49 sum 106.93
>> 20:21:50 sum 670.00
>> 20:21:51 sum 742.00
>> 20:21:52 sum 781.00
>> Average: sum 19562.56
>>
>
> I think it might help if you can see the .sar file yourself. I've
> uploaded a gzipped copy (great compression ratio in this case for
> merely using gzip!).
>
> http://www.uplinklabs.net/~tycho/linux/irqlog-storm.sar.gz
mcgrof@tesla ~/tmp/irq $ sar -f irqlog-storm.sar -A
Invalid system activity file: irqlog-storm.sar
Perhaps we its a different format than what my sar can read...
So here's what we need to do -- try to zero on in what causes this IRQ
storm. Most likely its some strange event that triggers it, perhaps
some sort of noise which makes the device try to adapt and hits a path
we are not handling correctly yet. Can you provide the initial log of
when ath9k gets loaded? This tells us what card it is, like AR5416 or
AR9280, etc.
Luis
On Thu, Sep 18, 2008 at 1:44 PM, Luis R. Rodriguez
<[email protected]> wrote:
> On Thu, Sep 18, 2008 at 12:00 PM, Steven Noonan <[email protected]> wrote:
>> On Thu, Sep 18, 2008 at 11:42 AM, Luis R. Rodriguez
>> <[email protected]> wrote:
>>> On Thu, Sep 18, 2008 at 11:34 AM, Luis R. Rodriguez
>>>> irqpoll is a monster of evil and that should make your system crawl to
>>>> its knees. I would advise instead we work with you fixing the the
>>>> missed interrupts issue upon rmmod.
>>>
>>> Also, please provide the output of
>>>
>>> cat /proc/interrupts
>>
>> Note that the problem necessitating use of irqpoll in the first place
>> seems to only happen under certain conditions. I am unsure what these
>> conditions are. Before 'ath9k: connectivity is lost after Group
>> rekeying is done',
>
> You mean this patch:
>
> [PATCH] ath9k: connectivity is lost after Group rekeying is done
> http://marc.info/?l=linux-wireless&m=122163541519736&w=2
>
> So let me get this straight -- you applied this new patch, and haven't
> tried disabling irqpoll now?
I hadn't at the time of that writing, no. I saw it as a fix for future
ignored IRQs, and hadn't noticed any difference with it on or off. So
it seemed like since there was no consequence having it enabled, why
not leave it enabled all the time?
Now that I'm aware it's the spawn of satan, I'm trying with it off. So
far, so good. But I haven't had need to reload ath9k so frequently,
and even when I do, I can't reproduce the specific conditions which
caused the problem in the first place.
>> I had used rmmod/modprobe as my solution to the
>> issue, which triggered the IRQ issue.
>
> Understood, but I also have used this before with ath9k and I got
> exactly the same results you did -- I just refused to use it again and
> just try to fix the issues present.
>
> ath9k issues tons of interrupts, not sure why irqpoll option would
> cause latency so bad as the interrupts *are* handled. Not sure
> *exactly* how irqpoll works but its description mentions using it
> forces each interrupt handler on the IRQ line to check the interrupt
> is for it. You have to keep in mind that not only are ath9k interrupts
> then being sent to the devices on its line but it would seem that all
> other devices on each line would suffer from the interrupts of the
> other guys. Why ath9k would be the *only* culprit of causing latency
> when using irqpoll if the irq line it son is clean? Beats me.
I'm guessing there's at least one interrupt that wasn't accounted for somehow.
>
>> alcarin steven # cat /proc/interrupts
>> CPU0 CPU1
>> 0x0: 63227 0 IO-APIC-edge hpet
>> 0x8: 1 0 IO-APIC-edge rtc0
>> 0x9: 13080 0 IO-APIC-fasteoi acpi
>> 0xe: 8195 0 IO-APIC-edge ide0
>> 0xf: 0 0 IO-APIC-edge ide1
>> 0x10: 36 0 IO-APIC-fasteoi uhci_hcd:usb5
>> 0x11: 10645 0 IO-APIC-fasteoi ath
>
> In this case your 11n Atheros device is on a clean line.
>
>> 0x12: 42 0 IO-APIC-fasteoi uhci_hcd:usb4
>> 0x17: 919 0 IO-APIC-fasteoi ehci_hcd:usb1, uhci_hcd:usb2
>
> But it was this interrupt line which had an interrupt not handled.
No, it's in hex. 0x17 = 23, 0x11 = 17. IRQ 17 is the one that pooped
in my case, which is my wireless chipset.
>> 0x13: 32885 0 IO-APIC-fasteoi uhci_hcd:usb3,
>> ata_piix, ohci1394
>> 0x200100: 1 0 PCI-MSI-edge eth0
>> 0x16: 223 0 IO-APIC-fasteoi HDA Intel
>> NMI: 0 0 Non-maskable interrupts
>> LOC: 78087 95718 Local timer interrupts
>> RES: 11576 16384 Rescheduling interrupts
>> CAL: 6862 8889 Function call interrupts
>> TLB: 54 41 TLB shootdowns
>> TRM: 0 0 Thermal event interrupts
>> THR: 0 0 Threshold APIC interrupts
>> SPU: 0 0 Spurious interrupts
>> ERR: 0
>
> Can you try to reproduce the irq not handled again?
If I do, I'll need to know what precisely to do about it. What debug
info should I collect before rebooting?
>
>>>
>>> and also please do not cross post to all these lists, just use
>>> linux-wireless or ath9k.
>>>
>>
>> Sorry, but in the past I've posted to linux-wireless, ath9k-devel, and
>> all the maintainers of ath9k and didn't get a single response (except
>> a 'me too' from a fellow ath9k user). I didn't just want to hear
>> crickets this time.
>
> Patches speak more than words, but yeah sorry, we should have
> addressed this there. I've personally have just been busy with
> tackling aggregation.
>
Which is far more important, I agree. It's annoying to get speeds
<802.11b on my pre-802.11n capable chipset and network. ;)
- Steven
On Fri, Sep 19, 2008 at 6:13 PM, Luis R. Rodriguez
<[email protected]> wrote:
> That is capital "i", this will show you the monitor results of number
> of interrupts on IRQ 17. You want to see if the rate of change is huge
> when you have latency issues. If not then its most likely
I meant most likely *NOT* an interrupt storm.
> an interrupt
> storm by ath9k (which is my suspect right now) but something else.
>
> Luis
>
This IRQ debugging is lots of fun. My /var/log/messages jumped in size
to 241MB. :)
My test case was a file transfer that basically copies 50MB from a
remote machine's /dev/urandom and writes it to /dev/null locally. That
should be a pretty nice large RX. And jeez, I'm seeing tremendous
download speeds now and I have no idea what is different about my
configuration. I'm getting up to 24 megabit per second transfer rates
(actual), which is incredibly unusual for ath9k.
I wasn't able to reproduce the IRQ problems on -tip or Linus' current
tree (both with the group rekeying fix applied, which may or may not
be a good idea for reproducing the issue). Even tried doing
rmmod/modprobe in fast succession multiple times. Couldn't get it to
throw up.
I wonder... My rmmod/modprobe originally was used after the improper
group rekeying before in order to get it back to normal. Perhaps the
encryption has to go wonky before this IRQ issue surfaces? Another
possibility is the earlier mentioned "ACPI: EC: GPE storm detected,
disabling EC GPE". I don't know if it's related or not, but I haven't
waited for it to show up before testing.
Oh well, if the IRQ problem resurfaces (or if I get a ridiculous
latency tomorrow morning with irqpoll off), I'll try to get some
useful data to help debug it.
On Thu, Sep 18, 2008 at 02:49:45PM -0700, Steven Noonan wrote:
>
> No, it's in hex. 0x17 = 23, 0x11 = 17. IRQ 17 is the one that pooped
> in my case, which is my wireless chipset.
Ah - boo.
> If I do, I'll need to know what precisely to do about it. What debug
> info should I collect before rebooting?
Sure, OK so in ath9k's core.h we have a macro used to indicate what
type of debugging stuff gets print out. By default we only leave
enabled the fatal stuff.
#define DBG_DEFAULT (ATH_DBG_FATAL )
You want to add to this ATH_DBG_INTERRUPT as follows:
#define DBG_DEFAULT (ATH_DBG_FATAL | ATH_DBG_INTERRUPT)
Then since the issue seems to come up when you rmmod, simply try to
download do some large RXing or TXing or both and then rmmod. Capture
the output and send along.
> > Patches speak more than words, but yeah sorry, we should have
> > addressed this there. I've personally have just been busy with
> > tackling aggregation.
> >
>
> Which is far more important, I agree. It's annoying to get speeds
> <802.11b on my pre-802.11n capable chipset and network. ;)
Yeah exactly, I hope the work we do will get in for 27. We'll see.
Luis
On Fri, 19 Sep 2008, Luis R. Rodriguez wrote:
> Thanks for the patch Senthil. He is on Linus' v2.6.27-rc, you are on
> wireless-testing so I've ported it (no changes, it just applies cleanly
> now). This is also untested by me:
I'm seeing the same latency issue I reported originally with the latest
-tip kernel and WITHOUT irqpoll.
Apparently there's something else to this...
- Steven
On Mon, Sep 22, 2008 at 1:06 AM, Luis R. Rodriguez
<[email protected]> wrote:
> Please test the following patch.
>
> diff --git a/drivers/net/wireless/ath9k/core.c b/drivers/net/wireless/ath9k/core.c
> index c262ef2..9a51739 100644
> --- a/drivers/net/wireless/ath9k/core.c
> +++ b/drivers/net/wireless/ath9k/core.c
> @@ -681,10 +681,12 @@ int ath_open(struct ath_softc *sc, struct ath9k_channel *initial_chan)
> * Enable MIB interrupts when there are hardware phy counters.
> * Note we only do this (at the moment) for station mode.
> */
> +#if 0
> if (ath9k_hw_phycounters(ah) &&
> ((sc->sc_ah->ah_opmode == ATH9K_M_STA) ||
> (sc->sc_ah->ah_opmode == ATH9K_M_IBSS)))
> sc->sc_imask |= ATH9K_INT_MIB;
> +#endif
> /*
> * Some hardware processes the TIM IE and fires an
> * interrupt when the TIM bit is set. For hardware
>
It didn't apply to -rc7, but I managed to apply it manually
(apparently you made this for wireless-testing?) and added the #if 0.
After 7 hours uptime with the driver, no apparent interrupt storm.
I'll do more thorough testing later, but I have a 3 hour drive today,
and some packing to do, so it'll need to wait a bit.
- Steven
On Sat, Sep 20, 2008 at 2:48 PM, Steven Noonan <[email protected]> wrote:
> On Sat, Sep 20, 2008 at 2:26 PM, Luis R. Rodriguez
> <[email protected]> wrote:
>> On Sat, Sep 20, 2008 at 2:22 PM, Steven Noonan <[email protected]> wrote:
>>
>>> Heh. I don't have an Xbox 360 or PS3 (sad, I know). Do have several
>>> 2.4GHz phones. ;)
>>
>> Microwaves are the best to make the signal wet poop BTW, if you have
>> that option available. Not sure if we can trigger the device that way
>> but I'm just brainstorming here.
>>
>
> Well, I've got the thing set to ATH_DBG_ANY (I want this damn bug
> found!), which will certainly be *interesting*. In 5 minutes, the log
> has jumped 9 megabytes.
>
> I've also got sadc monitoring IRQs so I know for certain when the
> interrupt storm starts.
>
> I'll let you know how this goes.
OK, here's the time index at which it b0rked:
18:28:38 sum 1106.00
18:28:39 sum 1037.62
18:28:40 sum 1069.00
18:28:41 sum 1167.00
18:28:42 sum 1155.00
18:28:43 sum 1339.00
18:28:44 sum 18355.00
18:28:45 sum 17845.45
18:28:46 sum 15285.00
18:28:47 sum 17511.00
18:28:48 sum 17568.69
18:28:49 sum 17704.04
18:28:50 sum 18566.67
18:28:51 sum 18913.13
And the entire log of the minute 18:28 is here:
http://www.uplinklabs.net/~tycho/linux/18.28.00-59.log.gz
It looks pretty crazy.
- Steven
On Thu, Sep 18, 2008 at 11:34 AM, Luis R. Rodriguez
> irqpoll is a monster of evil and that should make your system crawl to
> its knees. I would advise instead we work with you fixing the the
> missed interrupts issue upon rmmod.
Also, please provide the output of
cat /proc/interrupts
and also please do not cross post to all these lists, just use
linux-wireless or ath9k.
Luis
On Thu, Sep 18, 2008 at 3:08 PM, Luis R. Rodriguez
<[email protected]> wrote:
> On Thu, Sep 18, 2008 at 3:01 PM, Luis R. Rodriguez
>
>>> Which is far more important, I agree. It's annoying to get speeds
>>> <802.11b on my pre-802.11n capable chipset and network. ;)
>
> Oh and you should at least get some 11n rates, just not aggregation :)
>
> Luis
>
I always get "1 megabit/sec". It actually measures (at best) to be a
throughput of 8mbits/sec, though the average (due to strange random
packet loss) is about 4mbits/sec. -- And that was with aggregation
re-enabled via your patch. Without, it never exceeds 1mbit/sec
(actual, not just as indicated by iwconfig.
- Steven
On Thu, Sep 18, 2008 at 4:08 PM, Steven Noonan
> Will do! And actually, to be precise, the issue occurred not on the
> rmmod, but on the following modprobe for ath9k.
Ah -- it makes it even more interesting now..
>>> Which is far more important, I agree. It's annoying to get speeds
>>> <802.11b on my pre-802.11n capable chipset and network. ;)
>>
>> Yeah exactly, I hope the work we do will get in for 27. We'll see.
>
> Linus has been fairly strict lately, so I am as uncertain as you are.
> It really should get merged, because having 2.6.27 with [partially]
> broken wireless networking is kind of stupid. In the worst case, isn't
> it possible to do a revert to get the old aggregation crap back in?
Well let us finish the patch first in a way that we are happy with it
:) But yeah if its not merged distributions will probably just pick it
up anyway.
Luis
On Fri, Sep 19, 2008 at 5:01 PM, Steven Noonan <[email protected]> wrote:
> I'll pull wireless-testing and run that kernel for another 8 hours.
> What's the URL for the repo? Surely you don't mean the
> wireless-testing on git.kernel.org that hasn't been touched for almost
> a week?
Yes, that's the guy. This week was the Kernel Summit and the Linux
Plumbers Conference too so there is understandably some lag in pulls
as developers are getting drunk.
git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-testing.git
> I kind of wish there was a better debugging facility in ath9k.
Patches are welcomed ;)
Also, we have a few interrupt levels you can use. In this case we are
debugging interrupts.
> With
> ATH_DBG_INTERRUPTS, I see several hundred messages per second, which
> fills log files up like crazy (I switched to ftrace_printk as Ingo
> suggested, which has alleviated this somewhat). Is there a way to make
> it less verbose (perhaps so that it prints data at a rate of no more
> than 2 or 3 messages per second)? Or maybe the rate at which it prints
> this stuff is indicative something's wrong?
No, interrupts do happen quite often so that is what you get, perhaps
we can enhance this by not printing on some interrupts but the point
is to debug interrupts after all. To help us debug in userspace in a
nicer way we can use sadc and sar. So you can try this:
On a window leave this running:
/usr/lib/sysstat/sadc -I 1 irqlog.sar
And when you are seeing your latency issues, run this:
sar -I 17 -f irqlog.sar | less
That is capital "i", this will show you the monitor results of number
of interrupts on IRQ 17. You want to see if the rate of change is huge
when you have latency issues. If not then its most likely an interrupt
storm by ath9k (which is my suspect right now) but something else.
Luis
On Fri, Sep 19, 2008 at 10:58:24AM -0700, Senthilkumar Balasubramanian wrote:
>
> I am on 2.6.27-rc6 and this patch is on top of my earlier patch
> titled "[PATCH] ath9k: connectivity is lost after Group rekeying is
> done". However this patch can be applied on top of latest wireless testing
> too. I could apply this patch succesfully on top of wireless testing
> git tree.
>
> My git-describe says v2.6.27-rc6-1378-g34e512f.
>
> There is no sc_invalid flag in "struct ath_softc" today. Where did
> you get this variable from? It was removed in the following commit
>
> -----------------------------------------------
> commit f2c9705a05ecbc0d94216a3b042d5641e8bf70b1
> Author: Sujith <[email protected]>
> Date: Mon Aug 11 14:05:08 2008 +0530
>
> ath9k: Use bitfields for sc operations
>
> Signed-off-by: Sujith Manoharan <[email protected]>
> Signed-off-by: John W. Linville <[email protected]>
> -----------------------------------------------
>
> Which code base are you using?
Thanks for the patch Senthil. He is on Linus' v2.6.27-rc, you are on
wireless-testing so I've ported it (no changes, it just applies cleanly
now). This is also untested by me:
diff --git a/drivers/net/wireless/ath9k/core.c b/drivers/net/wireless/ath9k/core.c
index f6c4528..9a432e5 100644
--- a/drivers/net/wireless/ath9k/core.c
+++ b/drivers/net/wireless/ath9k/core.c
@@ -1336,6 +1336,8 @@ void ath_deinit(struct ath_softc *sc)
DPRINTF(sc, ATH_DBG_CONFIG, "%s\n", __func__);
+ tasklet_kill(&sc->intr_tq);
+ tasklet_kill(&sc->bcon_tasklet);
ath_stop(sc);
if (!sc->sc_invalid)
ath9k_hw_setpower(sc->sc_ah, ATH9K_PM_AWAKE);
diff --git a/drivers/net/wireless/ath9k/main.c b/drivers/net/wireless/ath9k/main.c
index c5107f2..75acf54 100644
--- a/drivers/net/wireless/ath9k/main.c
+++ b/drivers/net/wireless/ath9k/main.c
@@ -1400,10 +1400,16 @@ static void ath_pci_remove(struct pci_dev *pdev)
{
struct ieee80211_hw *hw = pci_get_drvdata(pdev);
struct ath_softc *sc = hw->priv;
+ enum ath9k_int status;
- if (pdev->irq)
+ if (pdev->irq) {
+ ath9k_hw_set_interrupts(sc->sc_ah, 0);
+ ath9k_hw_getisr(sc->sc_ah, &status); /* NB: clears ISR too */
+ sc->sc_flags |= SC_OP_INVALID;
free_irq(pdev->irq, sc);
+ }
ath_detach(sc);
+
pci_iounmap(pdev, sc->mem);
pci_release_region(pdev, 0);
pci_disable_device(pdev);
On Thu, Sep 18, 2008 at 3:01 PM, Luis R. Rodriguez
<[email protected]> wrote:
>> If I do, I'll need to know what precisely to do about it. What debug
>> info should I collect before rebooting?
>
> Sure, OK so in ath9k's core.h we have a macro used to indicate what
> type of debugging stuff gets print out. By default we only leave
> enabled the fatal stuff.
>
> #define DBG_DEFAULT (ATH_DBG_FATAL )
>
> You want to add to this ATH_DBG_INTERRUPT as follows:
>
> #define DBG_DEFAULT (ATH_DBG_FATAL | ATH_DBG_INTERRUPT)
>
> Then since the issue seems to come up when you rmmod, simply try to
> download do some large RXing or TXing or both and then rmmod. Capture
> the output and send along.
Will do! And actually, to be precise, the issue occurred not on the
rmmod, but on the following modprobe for ath9k.
>> Which is far more important, I agree. It's annoying to get speeds
>> <802.11b on my pre-802.11n capable chipset and network. ;)
>
> Yeah exactly, I hope the work we do will get in for 27. We'll see.
Linus has been fairly strict lately, so I am as uncertain as you are.
It really should get merged, because having 2.6.27 with [partially]
broken wireless networking is kind of stupid. In the worst case, isn't
it possible to do a revert to get the old aggregation crap back in?
On Thu, Sep 18, 2008 at 8:01 PM, Luis R. Rodriguez
<[email protected]> wrote:
> Thanks for testing, and glad to see this is resolved.
>
Damn. It's back. I was using wireless normally this evening. Browsing
the web, that kind of thing, and then the wireless inexplicably
dropped (even with the group rekeying patch), so I unloaded/reloaded
the module. This popped up in dmesg:
[ 3834.375658] vendor=8086 device=27d2
[ 3834.375666] ath9k 0000:03:00.0: PCI INT A disabled
[ 3834.375716] ath9k: driver unloaded
[ 3838.552419] ath9k: 0.1
[ 3838.552502] vendor=8086 device=27d2
[ 3838.552511] ath9k 0000:03:00.0: PCI INT A -> GSI 17 (level, low) -> IRQ 17
[ 3838.552532] ath9k 0000:03:00.0: setting latency timer to 64
[ 3838.688924] phy1: Selected rate control algorithm 'ath9k_rate_control'
[ 3838.693652] phy1: Atheros 5416: mem=0xffffc20000060000, irq=17
[ 3839.427125] irq 17: nobody cared (try booting with the "irqpoll" option)
[ 3839.427136] Pid: 0, comm: swapper Tainted: P
2.6.27-rc6-tip-00478-g74f1a36 #1
[ 3839.427141] Call Trace:
[ 3839.427145] <IRQ> [<ffffffff802219c5>] ? read_hpet+0x9/0x1c
[ 3839.427165] [<ffffffff8026af71>] __report_bad_irq+0x3d/0x8c
[ 3839.427172] [<ffffffff8026b0c6>] note_interrupt+0x106/0x160
[ 3839.427180] [<ffffffff8026b801>] handle_fasteoi_irq+0xad/0xda
[ 3839.427188] [<ffffffff8020f7b0>] do_IRQ+0x10c/0x190
[ 3839.427194] [<ffffffff8020c551>] ret_from_intr+0x0/0xa
[ 3839.427198] <EOI> [<ffffffff8026c6f0>] rcu_pending+0x62/0x6e
[ 3839.427211] [<ffffffff8025bc11>] ? tick_nohz_stop_sched_tick+0x2e4/0x2f3
[ 3839.427218] [<ffffffff8020ad94>] cpu_idle+0x7b/0xdb
[ 3839.427226] [<ffffffff8060c921>] rest_init+0x75/0x77
[ 3839.427231] handlers:
[ 3839.427234] [<ffffffffa0240238>] (ath_isr+0x0/0x170 [ath9k])
[ 3839.427263] Disabling IRQ #17
[ 3842.263699] ADDRCONF(NETDEV_UP): wlan0: link is not ready
[ 3848.035003] ADDRCONF(NETDEV_UP): wlan0: link is not ready
[ 3848.432701] ADDRCONF(NETDEV_UP): wlan0: link is not ready
[ 3850.216947] wlan0: authenticate with AP 00:1e:52:79:4d:01
[ 3850.217027] wlan0: authenticate with AP 00:1e:52:79:4d:01
[ 3850.228326] wlan0: authenticated
[ 3850.228336] wlan0: associate with AP 00:1e:52:79:4d:01
[ 3850.428140] wlan0: associate with AP 00:1e:52:79:4d:01
[ 3850.628151] wlan0: associate with AP 00:1e:52:79:4d:01
[ 3850.728305] wlan0: RX AssocResp from 00:1e:52:79:4d:01 (capab=0x431
status=0 aid=1)
[ 3850.728314] wlan0: associated
[ 3850.728655] wlan0 (WE) : Wireless Event too big (320)
[ 3850.743377] ADDRCONF(NETDEV_CHANGE): wlan0: link becomes ready
[ 3860.855104] wlan0: no IPv6 routers present
I rebuilt the module with DBG_ATH_INTERRUPT, but it somehow stumbled
itself back into working order while I was compiling. I can't keep the
interrupt debugging on all the time because it's just -too verbose-,
and when I pop a debug version of the module in, then it's too late to
track the issue....
- Steven
* Steven Noonan <[email protected]> wrote:
> On Fri, Sep 19, 2008 at 1:22 AM, Steven Noonan <[email protected]> wrote:
> > On Fri, Sep 19, 2008 at 1:17 AM, Ingo Molnar <[email protected]> wrote:
> >> in tip/master there's an ftrace_printk() facility. You can just replace
> >> the verbose printk()s with ftrace_printk() and see the result in
> >> /debug/tracing/trace.
> >>
> >> This sort of tracing has very low overhead and can be used in an easy
> >> ad-hoc manner with no extra infrastructure. Here's a few quick-start
> >> links about how to enable the scheduler tracer:
> >>
> >> http://redhat.com/~mingo/sched-devel.git/readme-tracer.txt
> >> http://redhat.com/~mingo/sched-devel.git/howto-trace-latencies.txt
> >>
> >> any ftrace_printk() you add in the kernel will show up in that trace.
> >>
> >> (If the scheduling events are uninteresting and clutter the output then
> >> you might want to remove the scheduler tracing entries from kernel/*.c
> >> by removing the trace_sched_*() calls or use a less noisy tracer.)
> >>
> >> to get enough of a trace history you might want to increase the number
> >> of trace entries in /debug/tracing/trace_entries from 16K to 128K or so.
> >>
> >
> > Thanks for the suggestion, I'll do that. :)
> >
>
> It'd be extremely convenient to have a tracer that did absolutely
> nothing but display ftrace_printk()s.
yeah, had the same thought.
Could you please contribute one? :) I'd suggest you take the smallest
tracer as a template: trace_sched_switch.c and turn that into
trace_nop.c - and change all the function names and ASCII and kconfig
glue to 'nop'/'NOP'?
That can then be used as a basis for additional ftrace plugins, or for
ad-hoc temporary tracing.
Ingo
On Thu, Sep 18, 2008 at 1:18 PM, Justin Mattock <[email protected]> wrote:
> On Thu, Sep 18, 2008 at 12:23 PM, Steven Noonan <[email protected]> wrote:
>> On Thu, Sep 18, 2008 at 12:00 PM, Steven Noonan <[email protected]> wrote:
>>> On Thu, Sep 18, 2008 at 11:42 AM, Luis R. Rodriguez
>>> <[email protected]> wrote:
>>>> On Thu, Sep 18, 2008 at 11:34 AM, Luis R. Rodriguez
>>>>> irqpoll is a monster of evil and that should make your system crawl to
>>>>> its knees. I would advise instead we work with you fixing the the
>>>>> missed interrupts issue upon rmmod.
>>>>
>>>> Also, please provide the output of
>>>>
>>>> cat /proc/interrupts
>>>
>>> Note that the problem necessitating use of irqpoll in the first place
>>> seems to only happen under certain conditions. I am unsure what these
>>> conditions are. Before 'ath9k: connectivity is lost after Group
>>> rekeying is done', I had used rmmod/modprobe as my solution to the
>>> issue, which triggered the IRQ issue.
>>
>> I wonder if this is related:
>>
>> alcarin ~ # dmesg | grep GPE
>> [ 0.166806] ACPI: EC: GPE = 0x17, I/O: command/status = 0x66, data = 0x62
>> [ 685.409077] ACPI: EC: GPE storm detected, disabling EC GPE
>
> FWIW, you might want to try this pach
> in this thread:
> http://www.gossamer-threads.com/lists/linux/kernel/966076
> I was receiving a disable irq #23 this patch"knock on wood",
> seems to help with this issue.
>
Both the patches there affect EHCI, though. If I was having a USB
controller fink out with "IRQ X: nobody cared", then that patch would
probably indeed be applicable.
* Steven Rostedt <[email protected]> wrote:
>
> I hate trips where my email box becomes filled. I skip over a lot of
> emails that I should not have.
>
> On Fri, 19 Sep 2008, Ingo Molnar wrote:
> >
> > What is needed is to set kernel/trace/trace.c's tracing_disabled
> > variable to 0. I've Cc:-ed Steve Rostedt - Steve, why is there no easy
> > ad-hoc way to trigger a stop-tracing event without elaborate changes?
>
> I think I added a way to do this. Hmm, I think it was in -rt that I added
> it, and have not gotten around to sending you the updates.
>
> Yes, we need an easy way to disable the tracer. I think we have
> ftrace_disable() function. Perhaps that only disables tracing that
> checks the tracing_disabled variable.
we have kill_ftrace(), but that will permanently zap it.
Ingo
* Steven Noonan <[email protected]> wrote:
> On Fri, Sep 19, 2008 at 1:17 AM, Ingo Molnar <[email protected]> wrote:
> > in tip/master there's an ftrace_printk() facility. You can just replace
> > the verbose printk()s with ftrace_printk() and see the result in
> > /debug/tracing/trace.
> >
> > This sort of tracing has very low overhead and can be used in an easy
> > ad-hoc manner with no extra infrastructure. Here's a few quick-start
> > links about how to enable the scheduler tracer:
> >
> > http://redhat.com/~mingo/sched-devel.git/readme-tracer.txt
> > http://redhat.com/~mingo/sched-devel.git/howto-trace-latencies.txt
> >
> > any ftrace_printk() you add in the kernel will show up in that trace.
> >
> > (If the scheduling events are uninteresting and clutter the output then
> > you might want to remove the scheduler tracing entries from kernel/*.c
> > by removing the trace_sched_*() calls or use a less noisy tracer.)
> >
> > to get enough of a trace history you might want to increase the number
> > of trace entries in /debug/tracing/trace_entries from 16K to 128K or so.
> >
>
> Thanks for the suggestion, I'll do that. :)
one more suggestion: you should first check how large the 'scope' of
your trace is - i.e. how long of a time span it can cover, in the
workload that you are going to trace.
if it's too short - say just a few hundred milliseconds then you need a
larger trace buffer or fewer trace entries.
another thing: you can auto-stop tracing on the failure condition. That
way you dont have to worry about whether you notice the badness.
Unfortunately there's no easy callback to just anonymously stop all
tracing that goes on. There's ftrace_kill_atomic() but i'm not sure it
properly stops the scheduler tracer.
What is needed is to set kernel/trace/trace.c's tracing_disabled
variable to 0. I've Cc:-ed Steve Rostedt - Steve, why is there no easy
ad-hoc way to trigger a stop-tracing event without elaborate changes?
Ingo
On Mon, Sep 22, 2008 at 8:11 AM, Steven Noonan <[email protected]> wrote:
> On Mon, Sep 22, 2008 at 1:06 AM, Luis R. Rodriguez
> <[email protected]> wrote:
>> Please test the following patch.
>>
>> diff --git a/drivers/net/wireless/ath9k/core.c b/drivers/net/wireless/ath9k/core.c
>> index c262ef2..9a51739 100644
>> --- a/drivers/net/wireless/ath9k/core.c
>> +++ b/drivers/net/wireless/ath9k/core.c
>> @@ -681,10 +681,12 @@ int ath_open(struct ath_softc *sc, struct ath9k_channel *initial_chan)
>> * Enable MIB interrupts when there are hardware phy counters.
>> * Note we only do this (at the moment) for station mode.
>> */
>> +#if 0
>> if (ath9k_hw_phycounters(ah) &&
>> ((sc->sc_ah->ah_opmode == ATH9K_M_STA) ||
>> (sc->sc_ah->ah_opmode == ATH9K_M_IBSS)))
>> sc->sc_imask |= ATH9K_INT_MIB;
>> +#endif
>> /*
>> * Some hardware processes the TIM IE and fires an
>> * interrupt when the TIM bit is set. For hardware
>>
>
> It didn't apply to -rc7, but I managed to apply it manually
> (apparently you made this for wireless-testing?) and added the #if 0.
> After 7 hours uptime with the driver, no apparent interrupt storm.
>
> I'll do more thorough testing later, but I have a 3 hour drive today,
> and some packing to do, so it'll need to wait a bit.
Yeah I did it for wireless-testing. Good to hear this so far has
solved the issue. I'll port it to 27 and post it once I get your
blessings that this fixed it by a Tested-by.
Luis
T24gTW9uLCBTZXAgMjIsIDIwMDggYXQgOTowMSBBTSwgTHVpcyBSLiBSb2RyaWd1ZXoKPGxyb2Ry
aWd1ZXpAYXRoZXJvcy5jb20+IHdyb3RlOgo+IE9uIE1vbiwgU2VwIDIyLCAyMDA4IGF0IDg6MTEg
QU0sIFN0ZXZlbiBOb29uYW4gPHN0ZXZlbkB1cGxpbmtsYWJzLm5ldD4gd3JvdGU6Cj4+IE9uIE1v
biwgU2VwIDIyLCAyMDA4IGF0IDE6MDYgQU0sIEx1aXMgUi4gUm9kcmlndWV6Cj4+IDxscm9kcmln
dWV6QGF0aGVyb3MuY29tPiB3cm90ZToKPj4+IFBsZWFzZSB0ZXN0IHRoZSBmb2xsb3dpbmcgcGF0
Y2guCj4+Pgo+Pj4gZGlmZiAtLWdpdCBhL2RyaXZlcnMvbmV0L3dpcmVsZXNzL2F0aDlrL2NvcmUu
YyBiL2RyaXZlcnMvbmV0L3dpcmVsZXNzL2F0aDlrL2NvcmUuYwo+Pj4gaW5kZXggYzI2MmVmMi4u
OWE1MTczOSAxMDA2NDQKPj4+IC0tLSBhL2RyaXZlcnMvbmV0L3dpcmVsZXNzL2F0aDlrL2NvcmUu
Ywo+Pj4gKysrIGIvZHJpdmVycy9uZXQvd2lyZWxlc3MvYXRoOWsvY29yZS5jCj4+PiBAQCAtNjgx
LDEwICs2ODEsMTIgQEAgaW50IGF0aF9vcGVuKHN0cnVjdCBhdGhfc29mdGMgKnNjLCBzdHJ1Y3Qg
YXRoOWtfY2hhbm5lbCAqaW5pdGlhbF9jaGFuKQo+Pj4gICAgICAgICAqIEVuYWJsZSBNSUIgaW50
ZXJydXB0cyB3aGVuIHRoZXJlIGFyZSBoYXJkd2FyZSBwaHkgY291bnRlcnMuCj4+PiAgICAgICAg
ICogTm90ZSB3ZSBvbmx5IGRvIHRoaXMgKGF0IHRoZSBtb21lbnQpIGZvciBzdGF0aW9uIG1vZGUu
Cj4+PiAgICAgICAgICovCj4+PiArI2lmIDAKPj4+ICAgICAgICBpZiAoYXRoOWtfaHdfcGh5Y291
bnRlcnMoYWgpICYmCj4+PiAgICAgICAgICAgICgoc2MtPnNjX2FoLT5haF9vcG1vZGUgPT0gQVRI
OUtfTV9TVEEpIHx8Cj4+PiAgICAgICAgICAgICAoc2MtPnNjX2FoLT5haF9vcG1vZGUgPT0gQVRI
OUtfTV9JQlNTKSkpCj4+PiAgICAgICAgICAgICAgICBzYy0+c2NfaW1hc2sgfD0gQVRIOUtfSU5U
X01JQjsKPj4+ICsjZW5kaWYKPj4+ICAgICAgICAvKgo+Pj4gICAgICAgICAqIFNvbWUgaGFyZHdh
cmUgcHJvY2Vzc2VzIHRoZSBUSU0gSUUgYW5kIGZpcmVzIGFuCj4+PiAgICAgICAgICogaW50ZXJy
dXB0IHdoZW4gdGhlIFRJTSBiaXQgaXMgc2V0LiAgRm9yIGhhcmR3YXJlCj4+Pgo+Pgo+PiBJdCBk
aWRuJ3QgYXBwbHkgdG8gLXJjNywgYnV0IEkgbWFuYWdlZCB0byBhcHBseSBpdCBtYW51YWxseQo+
PiAoYXBwYXJlbnRseSB5b3UgbWFkZSB0aGlzIGZvciB3aXJlbGVzcy10ZXN0aW5nPykgYW5kIGFk
ZGVkIHRoZSAjaWYgMC4KPj4gQWZ0ZXIgNyBob3VycyB1cHRpbWUgd2l0aCB0aGUgZHJpdmVyLCBu
byBhcHBhcmVudCBpbnRlcnJ1cHQgc3Rvcm0uCj4+Cj4+IEknbGwgZG8gbW9yZSB0aG9yb3VnaCB0
ZXN0aW5nIGxhdGVyLCBidXQgSSBoYXZlIGEgMyBob3VyIGRyaXZlIHRvZGF5LAo+PiBhbmQgc29t
ZSBwYWNraW5nIHRvIGRvLCBzbyBpdCdsbCBuZWVkIHRvIHdhaXQgYSBiaXQuCj4KPiBZZWFoIEkg
ZGlkIGl0IGZvciB3aXJlbGVzcy10ZXN0aW5nLiBHb29kIHRvIGhlYXIgdGhpcyBzbyBmYXIgaGFz
Cj4gc29sdmVkIHRoZSBpc3N1ZS4gSSdsbCBwb3J0IGl0IHRvIDI3IGFuZCBwb3N0IGl0IG9uY2Ug
SSBnZXQgeW91cgo+IGJsZXNzaW5ncyB0aGF0IHRoaXMgZml4ZWQgaXQgYnkgYSBUZXN0ZWQtYnku
Cj4KCk9rYXkuIEknbSBpbiBTZWF0dGxlIG5vdy4KClllcywgaXQgc29sdmVkIHRoZSBpc3N1ZSwg
YnV0IHN1cmVseSBhbiAjaWYgMCBpc24ndCBhIHByb3BlciBzb2x1dGlvbi4KV2hhdCdzIHRoZSBh
Y3R1YWwgYnVnIGhlcmU/IEknbSBhZnJhaWQgdGhlIG1lYW5pbmcgb2Ygd2hhdCdzIGdvaW5nIG9u
CmhlcmUgaXNuJ3QgZXhhY3RseSBpbnR1aXRpdmUuCgpBbHNvLCBJJ20gZG9pbmcgYW4gOCBob3Vy
IHJ1biB3aXRoIHRoaXMgcGF0Y2ggdG9uaWdodC4gTWF5YmUgbW9yZSwKZGVwZW5kcyBvbiB3aGVu
IEkgd2FrZSB1cC4gOykKCkkgdGhpbmsgaWYgdGhpcyByZXNvbHZlcyB0aGUgaXNzdWUgb24gdHdv
IHNlcGFyYXRlIJc4IGhvdXIgcnVucywgaXQKc2hvdWxkIGJlIGNvbnNpZGVyZWQgdGhlIHNvbHV0
aW9uIHRvIHRoZSBpc3N1ZSwgdW5sZXNzIGl0IHJlc3VyZmFjZXMuCgotIFN0ZXZlbgo=
* Steven Noonan <[email protected]> wrote:
> On Thu, Sep 18, 2008 at 8:01 PM, Luis R. Rodriguez
> <[email protected]> wrote:
> > Thanks for testing, and glad to see this is resolved.
> >
>
> Damn. It's back. I was using wireless normally this evening. Browsing
> the web, that kind of thing, and then the wireless inexplicably
> dropped (even with the group rekeying patch), so I unloaded/reloaded
> the module. This popped up in dmesg:
>
> [ 3834.375658] vendor=8086 device=27d2
> [ 3834.375666] ath9k 0000:03:00.0: PCI INT A disabled
> [ 3834.375716] ath9k: driver unloaded
> [ 3838.552419] ath9k: 0.1
> [ 3838.552502] vendor=8086 device=27d2
> [ 3838.552511] ath9k 0000:03:00.0: PCI INT A -> GSI 17 (level, low) -> IRQ 17
> [ 3838.552532] ath9k 0000:03:00.0: setting latency timer to 64
> [ 3838.688924] phy1: Selected rate control algorithm 'ath9k_rate_control'
> [ 3838.693652] phy1: Atheros 5416: mem=0xffffc20000060000, irq=17
> [ 3839.427125] irq 17: nobody cared (try booting with the "irqpoll" option)
> [ 3839.427136] Pid: 0, comm: swapper Tainted: P
> 2.6.27-rc6-tip-00478-g74f1a36 #1
> [ 3839.427141] Call Trace:
> [ 3839.427145] <IRQ> [<ffffffff802219c5>] ? read_hpet+0x9/0x1c
> [ 3839.427165] [<ffffffff8026af71>] __report_bad_irq+0x3d/0x8c
> [ 3839.427172] [<ffffffff8026b0c6>] note_interrupt+0x106/0x160
> [ 3839.427180] [<ffffffff8026b801>] handle_fasteoi_irq+0xad/0xda
> [ 3839.427188] [<ffffffff8020f7b0>] do_IRQ+0x10c/0x190
> [ 3839.427194] [<ffffffff8020c551>] ret_from_intr+0x0/0xa
> [ 3839.427198] <EOI> [<ffffffff8026c6f0>] rcu_pending+0x62/0x6e
> [ 3839.427211] [<ffffffff8025bc11>] ? tick_nohz_stop_sched_tick+0x2e4/0x2f3
> [ 3839.427218] [<ffffffff8020ad94>] cpu_idle+0x7b/0xdb
> [ 3839.427226] [<ffffffff8060c921>] rest_init+0x75/0x77
> [ 3839.427231] handlers:
> [ 3839.427234] [<ffffffffa0240238>] (ath_isr+0x0/0x170 [ath9k])
> [ 3839.427263] Disabling IRQ #17
> [ 3842.263699] ADDRCONF(NETDEV_UP): wlan0: link is not ready
> [ 3848.035003] ADDRCONF(NETDEV_UP): wlan0: link is not ready
> [ 3848.432701] ADDRCONF(NETDEV_UP): wlan0: link is not ready
> [ 3850.216947] wlan0: authenticate with AP 00:1e:52:79:4d:01
> [ 3850.217027] wlan0: authenticate with AP 00:1e:52:79:4d:01
> [ 3850.228326] wlan0: authenticated
> [ 3850.228336] wlan0: associate with AP 00:1e:52:79:4d:01
> [ 3850.428140] wlan0: associate with AP 00:1e:52:79:4d:01
> [ 3850.628151] wlan0: associate with AP 00:1e:52:79:4d:01
> [ 3850.728305] wlan0: RX AssocResp from 00:1e:52:79:4d:01 (capab=0x431
> status=0 aid=1)
> [ 3850.728314] wlan0: associated
> [ 3850.728655] wlan0 (WE) : Wireless Event too big (320)
> [ 3850.743377] ADDRCONF(NETDEV_CHANGE): wlan0: link becomes ready
> [ 3860.855104] wlan0: no IPv6 routers present
>
> I rebuilt the module with DBG_ATH_INTERRUPT, but it somehow stumbled
> itself back into working order while I was compiling. I can't keep the
> interrupt debugging on all the time because it's just -too verbose-,
> and when I pop a debug version of the module in, then it's too late to
> track the issue....
in tip/master there's an ftrace_printk() facility. You can just replace
the verbose printk()s with ftrace_printk() and see the result in
/debug/tracing/trace.
This sort of tracing has very low overhead and can be used in an easy
ad-hoc manner with no extra infrastructure. Here's a few quick-start
links about how to enable the scheduler tracer:
http://redhat.com/~mingo/sched-devel.git/readme-tracer.txt
http://redhat.com/~mingo/sched-devel.git/howto-trace-latencies.txt
any ftrace_printk() you add in the kernel will show up in that trace.
(If the scheduling events are uninteresting and clutter the output then
you might want to remove the scheduler tracing entries from kernel/*.c
by removing the trace_sched_*() calls or use a less noisy tracer.)
to get enough of a trace history you might want to increase the number
of trace entries in /debug/tracing/trace_entries from 16K to 128K or so.
Ingo
On Sat, 2008-09-20 at 13:57 -0700, Luis R. Rodriguez wrote:
> mcgrof@tesla ~/tmp/irq $ sar -f irqlog-storm.sar -A
> Invalid system activity file: irqlog-storm.sar
>
> Perhaps we its a different format than what my sar can read...
It changes like every release... Not a very useful thing, that.
johannes
On Fri, Sep 19, 2008 at 8:29 PM, Steven Noonan <[email protected]> wrote:
> And here's the change as I unload and reload ath9k:
>
> [...]
> 20:21:38 sum 20138.83
> 20:21:40 sum 21576.09
> 20:21:41 sum 17678.38
> 20:21:42 sum 19834.34
> 20:21:43 sum 19845.10
> 20:21:44 sum 19470.00
> 20:21:45 sum 9273.00 <--- ath9k unloaded/reloaded
> 20:21:46 sum 187.00
> 20:21:47 sum 1066.00
> 20:21:48 sum 610.00
> 20:21:49 sum 106.93
> 20:21:50 sum 670.00
> 20:21:51 sum 742.00
> 20:21:52 sum 781.00
> Average: sum 19562.56
>
I think it might help if you can see the .sar file yourself. I've
uploaded a gzipped copy (great compression ratio in this case for
merely using gzip!).
http://www.uplinklabs.net/~tycho/linux/irqlog-storm.sar.gz
- Steven
On Thu, Sep 18, 2008 at 11:23 AM, Steven Noonan <[email protected]> wrote:
> The only thing I can think of is that I've
> got 'irqpoll' on the command line. I have it because I had to reload
> the ath9k driver at one point, and this was the result (caused ath9k
> to become a useless driver until I rebooted, which re-enabled IRQ 17):
>
> [53997.493777] ath9k: driver unloaded
> [54001.098532] ath9k: 0.1
> [54001.098833] ath9k 0000:03:00.0: PCI INT A -> GSI 17 (level, low) -> IRQ 17
> [54001.099013] ath9k 0000:03:00.0: setting latency timer to 64
> [54001.233825] phy1: Selected rate control algorithm 'ath9k_rate_control'
> [54001.236980] phy1: Atheros 5416: mem=0xffffc20000060000, irq=17
> [54002.136218] irq 17: nobody cared (try booting with the "irqpoll" option)
> [54002.136229] Pid: 0, comm: swapper Tainted: P 2.6.27-rc6-00036-ga551b98 #2
> [54002.136234]
> [54002.136235] Call Trace:
> [54002.136241] <IRQ> [<ffffffff80265231>] __report_bad_irq+0x3d/0x81
> [54002.136258] [<ffffffff80265477>] note_interrupt+0x202/0x267
> [54002.136265] [<ffffffff80265bfc>] handle_fasteoi_irq+0xba/0xe4
> [54002.136273] [<ffffffff8020e7ff>] do_IRQ+0x115/0x191
> [54002.136279] [<ffffffff8020bc01>] ret_from_intr+0x0/0xa
> [54002.136283] <EOI> [<ffffffff802541bb>] ?
> tick_nohz_stop_sched_tick+0x2c5/0x2d4
> [54002.136296] [<ffffffff8020a69f>] ? cpu_idle+0x2d/0xed
> [54002.136304] [<ffffffff8060d741>] ? rest_init+0x75/0x77
> [54002.136309]
> [54002.136312] handlers:
> [54002.136315] [<ffffffffa0209016>] (ath_isr+0x0/0x160 [ath9k])
> [54002.136340] Disabling IRQ #17
irqpoll is a monster of evil and that should make your system crawl to
its knees. I would advise instead we work with you fixing the the
missed interrupts issue upon rmmod.
Luis
On Mon, Sep 22, 2008 at 12:26:02AM -0700, Steven Noonan wrote:
> On Sun, Sep 21, 2008 at 11:54 PM, Luis R. Rodriguez
> <[email protected]> wrote:
> > On Sat, Sep 20, 2008 at 06:48:05PM -0700, Steven Noonan wrote:
> >>
> >> OK, here's the time index at which it b0rked:
> >>
> >> 18:28:42 sum 1155.00
> >> 18:28:43 sum 1339.00
> >> 18:28:44 sum 18355.00
> >> 18:28:45 sum 17845.45
> >>
> >> And the entire log of the minute 18:28 is here:
> >> http://www.uplinklabs.net/~tycho/linux/18.28.00-59.log.gz
> >>
> >> It looks pretty crazy.
> >
> > Thanks, I'll take a look when I get a chance (remember aggretion
> > is in the queue too) unless someone beats me to it.
> >
>
> You may want to take a quick look at it now.
Its Sunday night and I am looking at it :)
> I certainly don't mean
> you should push aggregation aside,
Trust me I want to get this resolved too ;)
> but the log looks kind of screwy at
> 18:28:43 (look at line 4972 to see what I mean). It looks as though it
> did printk()s from multiple threads with no locking mechanism. It's
> possible I will need to re-capture the error, and I'd like to know if
> this log will help or not.
At first glance I see tons of MIB interrupts being spewed, besides the
ackward prints you see too my main concern so far is the MIB interrupts
as we don't handle ANI. My initial thought is because don't hanld ANI
yet we are letting MIB interrupts to continue to spew out.
Luis
Please test the following patch.
diff --git a/drivers/net/wireless/ath9k/core.c b/drivers/net/wireless/ath9k/core.c
index c262ef2..9a51739 100644
--- a/drivers/net/wireless/ath9k/core.c
+++ b/drivers/net/wireless/ath9k/core.c
@@ -681,10 +681,12 @@ int ath_open(struct ath_softc *sc, struct ath9k_channel *initial_chan)
* Enable MIB interrupts when there are hardware phy counters.
* Note we only do this (at the moment) for station mode.
*/
+#if 0
if (ath9k_hw_phycounters(ah) &&
((sc->sc_ah->ah_opmode == ATH9K_M_STA) ||
(sc->sc_ah->ah_opmode == ATH9K_M_IBSS)))
sc->sc_imask |= ATH9K_INT_MIB;
+#endif
/*
* Some hardware processes the TIM IE and fires an
* interrupt when the TIM bit is set. For hardware
On Thu, Sep 18, 2008 at 3:01 PM, Luis R. Rodriguez
>> Which is far more important, I agree. It's annoying to get speeds
>> <802.11b on my pre-802.11n capable chipset and network. ;)
Oh and you should at least get some 11n rates, just not aggregation :)
Luis
On Sat, Sep 20, 2008 at 06:48:05PM -0700, Steven Noonan wrote:
>
> OK, here's the time index at which it b0rked:
>
> 18:28:42 sum 1155.00
> 18:28:43 sum 1339.00
> 18:28:44 sum 18355.00
> 18:28:45 sum 17845.45
>
> And the entire log of the minute 18:28 is here:
> http://www.uplinklabs.net/~tycho/linux/18.28.00-59.log.gz
>
> It looks pretty crazy.
Thanks, I'll take a look when I get a chance (remember aggretion
is in the queue too) unless someone beats me to it. Asking in advance
-- if you can also provide a log of minutes 43, 44 and 45 that would
be great, but this may just be enough.
Luis