2011-01-07 21:55:52

by Peter Stuge

[permalink] [raw]
Subject: ath9k tx lockup, ath: received PCI FATAL interrupt

First of all, sorry for dropping linux-wireless from my previous
message, it was mostly about my understanding of ath9k processes.

Crossposting is a bit unnatural for me, I usually strictly use
list-reply. Let's try with this one! :)


Peter Stuge wrote:
> So there was a problem while writing this message. Will follow up
> in separate message. I can only describe the symptom, but as I wrote
> after subscribing this list over a year ago I'm comfortable and more
> than happy to help with debugging on register level, and below, but I
> still know neither driver nor hardware and if possible I would like
> to not have to learn all of the driver, so I would need to be told
> fairly specifically what to instrument.
>
> I think someone with good knowledge of driver and hardware could give
> guidance very quickly.

I booted wireless-testing master-2011-01-05 with an AR9280 Mini PCI
card in my ThinkPad X40 today.

I normally never use wpa_supplicant. I don't accept that it would be
required for a working connection and so far noone has really
explained why that would be the case. Only ath9k (ie. neither ipw2200
nor ath5k) has ever made a difference between wpa_supplicant running
or not.

I configure manually. Module is loaded by kernel on boot. After
logging in, I run:

ip l s dev eth1 up
iw dev eth1 connect Stuge
ip a a 192.168.5.4/24 dev eth1
ip r a default via 192.168.5.1

..and that is my internet connection.


I used the connection without issues for about an hour, working
remotely and listening to a music stream, then I was disassociated,
with the PCI FATAL interrupt in the kernel log.

dmesg output is log1_ath_pci_fatal_interrupt.txt


Trying to revive the connection it seemed like no packets were being
transmitted because I could scan, but not associate again.

I then tried to start wpa_supplicant, which made four attempts at
authenticating with the (unencrypted) AP, without success.

dmesg output is log2_after_wpa_supplicant_4x_trying_to_auth.txt


I tried to rmmod and modprobe the driver, followed by

iw dev eth1 connect Stuge

but that also did not work.

dmesg output is log3_after_rmmod_modprobe.txt


I power cycled and did the above ip l s + iw connect, which made the
card sort-of associate, but not function correctly.

dmesg output is log4_after_power_cycle_and_iw_connect.txt

Running iw dev eth1 link at this time showed me *not* being
successfully connected.

I started wpa_supplicant 0.7.2 without debug messages.
Output is in log5_wpa_supplicant.txt

I ran iw dev eth1 connect Stuge

wpa_supplicant output is log6_wpa_supplicant_after_iw_connect.txt

dmesg output is log7_after_wpa_supplicant_and_iw_connect.txt


This is the connection I am currently using.


The above identifies at least two issues:

1. ath: received PCI FATAL interrupt

How can I find out more about the reason for this?


2. Unworking association without wpa_supplicant after power cycle

How can I find out more about the reason for this?


The log files are attached and available at http://stuge.se/ath9k/wl0105/
All dmesg output is incremental. boot1.txt and boot2.txt are initial
messages for each boot, but not included as attachments.

boot2.txt (after power cycle) had different timing, keyboard detected
earlier, no tty60 hash and different btrfs transaction id. I added
the Time: stamp in log4_after_power_cycle_and_iw_connect.txt.

Please help me provide further neccessary information to debug this.
Thank you!


//Peter


Attachments:
(No filename) (3.38 kB)
log1_ath_pci_fatal_interrupt.txt (9.65 kB)
log2_after_wpa_supplicant_4x_trying_to_auth.txt (1.81 kB)
log3_after_rmmod_modprobe.txt (1.12 kB)
log4_after_power_cycle_and_iw_connect.txt (8.59 kB)
log5_wpa_supplicant.txt (254.00 B)
log6_wpa_supplicant_after_iw_connect.txt (268.00 B)
log7_after_wpa_supplicant_and_iw_connect.txt (2.63 kB)
Download all attachments

2011-01-07 23:43:01

by Peter Stuge

[permalink] [raw]
Subject: Re: [ath9k-devel] ath9k tx lockup, ath: received PCI FATAL interrupt

Jouni Malinen wrote:
> > Only ath9k (ie. neither ipw2200 nor ath5k) has ever made a
> > difference between wpa_supplicant running or not.
>
> You won't need wpa_supplicant for an open network that has only a
> single AP and if you never get disconnected (which could happen,
> e.g., if you leave the range for a moment).

Thanks for confirming this!


> mac80211 will not be reconnecting to the network on its own
> automatically, so you need something like wpa_supplicant to do it
> for you.

Right, I had understood that to also be the case. I may not agree but
in any case that's a different topic.

However, since neither laptop nor AP is moving around much I also do
not expect to get disconnected. Let's see how life with master-2011-01-05
turns out!


> > 1. ath: received PCI FATAL interrupt
> >
> > How can I find out more about the reason for this?
>
> Have you looked at enabling debugging in ath9k?

Yes, and I dug around adding some code last year when I first got
started with ath9k. Unfortunately I forgot to pass debug when
loading the driver today.


> For the PCI fatal interrupt, it would be useful to have at least
> interrupt and fatal levels included.

Hold on.. OK, reloaded now with debug=0x410.


> If the output is still readable (i.e., you can still find the PCI
> fatal message), enabling more of these could end up providing more
> details, too.

ATH_DBG_INTERRUPT is spamming the kernel log with a disable/enable
IER at least every 100ms but usually much more often. A snippet:

[10654.729417] ath: disable IER
[10654.729430] ath: AR_IMR 0x918414b0 IER 0x1
[10654.729452] ath: enable IER
[10654.729462] ath: AR_IMR 0x918414b0 IER 0x1
[10654.729581] ath: disable IER
[10654.729601] ath: enable IER
[10654.729619] ath: disable IER
[10654.729633] ath: AR_IMR 0x918414b0 IER 0x1
[10654.729665] ath: enable IER
[10654.729676] ath: AR_IMR 0x918414b0 IER 0x1
[10654.729729] ath: disable IER
[10654.729749] ath: enable IER
[10654.729760] ath: AR_IMR 0x918414b0 IER 0x1
[10654.735625] ath: disable IER
[10654.735662] ath: enable IER
[10654.735674] ath: AR_IMR 0x918414b0 IER 0x1
[10654.829084] ath: disable IER
[10654.829103] ath: enable IER
[10654.829115] ath: AR_IMR 0x918414b0 IER 0x1
[10654.829176] ath: disable IER
[10654.829203] ath: enable IER
[10654.829214] ath: AR_IMR 0x918414b0 IER 0x1
[10654.829237] ath: 0xf4041171 => 0xf4041171
[10654.829244] ath: new IMR 0x918414b0
[10654.829252] ath: enable IER
[10654.829263] ath: AR_IMR 0x918414b0 IER 0x1
[10654.837973] ath: disable IER
[10654.838010] ath: enable IER
[10654.838021] ath: AR_IMR 0x918414b0 IER 0x1
[10654.940330] ath: disable IER
[10654.940374] ath: enable IER
[10654.940386] ath: AR_IMR 0x918414b0 IER 0x1
[10655.042682] ath: disable IER
[10655.042724] ath: enable IER
[10655.042737] ath: AR_IMR 0x918414b0 IER 0x1

So now my dmesg is only some 6000 lines of this, going back 30
seconds or so. Would it make sense to bump the kernel log size
significantly to have maybe several minutes?


> Have you happened to test that WLAN card in any other system or
> with any other driver?

No.


> It could be useful to make sure it is known to be working reliably
> before spending huge amount of work trying to figure out why it
> doesn't work properly with ath9k..

True. The card was brand new in ESD bag when I installed it. It comes
out of a batch of cards that ship as part of a commercial product
running in AP mode with good performance.

Your point is solid, but I have no easy way to test something else.

Would FreeBSD be a useful data point, or is some Windows the only
reliable reference in this case?


> > 2. Unworking association without wpa_supplicant after power cycle
> >
> > How can I find out more about the reason for this?
>
> The part where the dmesg output had "direct probe to <BSSID> timed out"
> could potentially be caused by card RX not working properly.

Hm? This message is in log2_after_wpa_supplicant_4x_trying_to_auth.txt
which is dmesg following the PCI FATAL interrupt and then running
wpa_supplicant, but before the power cycle. The issue above is rather
what can be seen in log4..log5, and going away in log6..log7 when I
try iw connect again *with* wpa_supplicant running.


> I've seen that happen in some cases (with rmmod + modprobe ath9k
> being one way of recovering from that state). Looking at the RX
> interrupt counters in ath9k debugfs and checking whether they
> increment could be of use here. In general, verifying RX registers
> (filter, descriptors) would likely follow in getting more details.

Thanks for the pointers! When I first had problems about a year ago I
did look at and post (only to ath9k-devel) interrupt counter values,
but no obvious problem showed. (Rather, there were no comments.)

I'll remember to also instrument debugfs if/when next problem hits.


//Peter

2011-01-07 22:09:40

by Brian Prodoehl

[permalink] [raw]
Subject: Re: [ath9k-devel] ath9k tx lockup, ath: received PCI FATAL interrupt

On Fri, Jan 7, 2011 at 4:55 PM, Peter Stuge <[email protected]> wrote:
> First of all, sorry for dropping linux-wireless from my previous
> message, it was mostly about my understanding of ath9k processes.
>
> Crossposting is a bit unnatural for me, I usually strictly use
> list-reply. Let's try with this one! :)
>
>
> Peter Stuge wrote:
>> So there was a problem while writing this message. Will follow up
>> in separate message. I can only describe the symptom, but as I wrote
>> after subscribing this list over a year ago I'm comfortable and more
>> than happy to help with debugging on register level, and below, but I
>> still know neither driver nor hardware and if possible I would like
>> to not have to learn all of the driver, so I would need to be told
>> fairly specifically what to instrument.
>>
>> I think someone with good knowledge of driver and hardware could give
>> guidance very quickly.
>
> I booted wireless-testing master-2011-01-05 with an AR9280 Mini PCI
> card in my ThinkPad X40 today.
>
> I normally never use wpa_supplicant. I don't accept that it would be
> required for a working connection and so far noone has really
> explained why that would be the case. Only ath9k (ie. neither ipw2200
> nor ath5k) has ever made a difference between wpa_supplicant running
> or not.
>
> I configure manually. Module is loaded by kernel on boot. After
> logging in, I run:
>
> ip l s dev eth1 up
> iw dev eth1 connect Stuge
> ip a a 192.168.5.4/24 dev eth1
> ip r a default via 192.168.5.1
>
> ..and that is my internet connection.
>
>
> I used the connection without issues for about an hour, working
> remotely and listening to a music stream, then I was disassociated,
> with the PCI FATAL interrupt in the kernel log.
>
> dmesg output is log1_ath_pci_fatal_interrupt.txt
>
>
> Trying to revive the connection it seemed like no packets were being
> transmitted because I could scan, but not associate again.
>
> I then tried to start wpa_supplicant, which made four attempts at
> authenticating with the (unencrypted) AP, without success.
>
> dmesg output is log2_after_wpa_supplicant_4x_trying_to_auth.txt
>
>
> I tried to rmmod and modprobe the driver, followed by
>
> iw dev eth1 connect Stuge
>
> but that also did not work.
>
> dmesg output is log3_after_rmmod_modprobe.txt
>
>
> I power cycled and did the above ip l s + iw connect, which made the
> card sort-of associate, but not function correctly.
>
> dmesg output is log4_after_power_cycle_and_iw_connect.txt
>
> Running iw dev eth1 link at this time showed me *not* being
> successfully connected.
>
> I started wpa_supplicant 0.7.2 without debug messages.
> Output is in log5_wpa_supplicant.txt
>
> I ran iw dev eth1 connect Stuge
>
> wpa_supplicant output is log6_wpa_supplicant_after_iw_connect.txt
>
> dmesg output is log7_after_wpa_supplicant_and_iw_connect.txt
>
>
> This is the connection I am currently using.
>
>
> The above identifies at least two issues:
>
> 1. ath: received PCI FATAL interrupt
>
> How can I find out more about the reason for this?
>
>
> 2. Unworking association without wpa_supplicant after power cycle
>
> How can I find out more about the reason for this?
>
>
> The log files are attached and available at http://stuge.se/ath9k/wl0105/
> All dmesg output is incremental. boot1.txt and boot2.txt are initial
> messages for each boot, but not included as attachments.
>
> boot2.txt (after power cycle) had different timing, keyboard detected
> earlier, no tty60 hash and different btrfs transaction id. I added
> the Time: stamp in log4_after_power_cycle_and_iw_connect.txt.
>
> Please help me provide further neccessary information to debug this.
> Thank you!
>
>
> //Peter


Out of curiosity, what's your AP? I've never seen those "detected
beacon loss from AP" messages. If you bring up a monitor-mode
interface, do you see beacons at a sane, regular interval?

-Brian

2011-01-07 22:47:25

by Peter Stuge

[permalink] [raw]
Subject: Re: [ath9k-devel] ath9k tx lockup, ath: received PCI FATAL interrupt

Brian Prodoehl wrote:
> Out of curiosity, what's your AP?

July 2006 Linux 2.6.17.6 madwifi-ng Mini PCI AR5212.

wlan: 0.8.4.2 (svn 1531)
ath_hal: module license 'Proprietary' taints kernel.
ath_hal: 0.9.16.16 (AR5210, AR5211, AR5212, RF5111, RF5112, RF2413, RF5413)
ath_rate_sample: 1.2 (svn 1531)
ath_pci: 0.9.4.5 (svn 1531)
..
wifi0: mac 5.9 phy 4.3 radio 3.6
..
wifi0: Atheros 5212: mem=0xa0010000, irq=11


> I've never seen those "detected beacon loss from AP" messages.

I fully expect that this AP is not behaving perfectly, but that is of
course irrelevant in the scope of ath9k when ipw2200 ath5k orinoco_cs
can deal. If anything, the AP is valuable for debugging ath9k, which
is also why I prefer not to change the AP until it seems like all
issues have been solved.

(But I've had plenty of problems with ath9k also on various other
APs, just no logs from anything else today.)


> If you bring up a monitor-mode interface,

Then I got some new suspect messages in the kernel log. But internet
and monitor both work.

dmesg output in log8_after_iw_dev_eth1_interface_add_wmon0.txt


> do you see beacons at a sane, regular interval?

100ms, with some missing ones. (3 missing in 327 received, second
test 13 missing in 819 received over 93 seconds. Seems to match the
beacon loss messages.)


//Peter


Attachments:
(No filename) (1.28 kB)
log8_after_iw_dev_eth1_interface_add_wmon0.txt (3.59 kB)
Download all attachments

2011-01-07 23:08:39

by Jouni Malinen

[permalink] [raw]
Subject: Re: [ath9k-devel] ath9k tx lockup, ath: received PCI FATAL interrupt

On Fri, 2011-01-07 at 13:55 -0800, Peter Stuge wrote:
> I normally never use wpa_supplicant. I don't accept that it would be
> required for a working connection and so far noone has really
> explained why that would be the case. Only ath9k (ie. neither ipw2200
> nor ath5k) has ever made a difference between wpa_supplicant running
> or not.

You won't need wpa_supplicant for an open network that has only a single
AP and if you never get disconnected (which could happen, e.g., if you
leave the range for a moment). mac80211 will not be reconnecting to the
network on its own automatically, so you need something like
wpa_supplicant to do it for you.

> The above identifies at least two issues:
>
> 1. ath: received PCI FATAL interrupt
>
> How can I find out more about the reason for this?

Have you looked at enabling debugging in ath9k? This is described at
http://wireless.kernel.org/en/users/Drivers/ath9k/debug
For the PCI fatal interrupt, it would be useful to have at least
interrupt and fatal levels included. If the output is still readable
(i.e., you can still find the PCI fatal message), enabling more of these
could end up providing more details, too.

Have you happened to test that WLAN card in any other system or with any
other driver? It could be useful to make sure it is known to be working
reliably before spending huge amount of work trying to figure out why it
doesn't work properly with ath9k..

> 2. Unworking association without wpa_supplicant after power cycle
>
> How can I find out more about the reason for this?

The part where the dmesg output had "direct probe to <BSSID> timed out"
could potentially be caused by card RX not working properly. I've seen
that happen in some cases (with rmmod + modprobe ath9k being one way of
recovering from that state). Looking at the RX interrupt counters in
ath9k debugfs and checking whether they increment could be of use here.
In general, verifying RX registers (filter, descriptors) would likely
follow in getting more details.

- Jouni