2010-11-30 00:30:02

by Ben Greear

[permalink] [raw]
Subject: Script to crash ath9k with DMA errors.

Here is a script that reliably crashes my ath9k box.
A second box with completely different hardware (except
for ath9k) experiences similar problems.

I am using today's wireless-testing kernel with a few
patches of my own.

You will also need the very latest hostap tree as it has the
optimizations for allowing STAs to share scans. Without
this optimization, I did not see this problem.

A few notes about the script:

* I cannot remove any interfaces, seems a ref-count leak somewhere.
I haven't debugged this issue.

* Without the background ping, it is very hard to reproduce this problem,
but with it, it happens almost every time.

* You'll need to set up your paths at the top of the script.


#!/usr/bin/perl

use strict;

my $iw = "./local/sbin/iw";
my $ip = "./local/sbin/ip";
my $wpa_s = "./local/bin/wpa_supplicant";
my $ssid = "candela-n";
my $key = "wpadmz123";

my $phy = "wiphy0";
my $max = 32;
my $i;
my $bmac = "00:01:02:03:04:";
my $cmd;

# Cleanup previous stuff
runCmd("killall wpa_supplicant");
runCmd("killall ping");

for ($i = 0; $i<$max; $i++) {
# Work around ref-counting bugs in kernel
runCmd("$ip link set sta$i down");
runCmd("$ip addr flush dev sta$i");
runCmd("$ip route flush dev sta$i");
runCmd("$ip -6 addr flush dev sta$i");
runCmd("$ip -6 route flush dev sta$i");

# Bugger, cannot get the ref-count problem to go away.
# runCmd("$iw dev sta$i del");
}

#exit(0);

open(FD, ">pingbg") || die("Couldn't open pingbg.");
print FD "#!/bin/bash\n\n";
print FD "ping \$* > /dev/null 2>&1 &\n";
print FD "echo continuing....\n";
close(FD);
runCmd("chmod a+x pingbg");

# Create stations
for ($i = 0; $i<$max; $i++) {
runCmd("$iw phy $phy interface add sta$i type station");
my $mc5 = $i + 1;
if (length($mc5) == 1) {
$mc5 = "0$mc5"; # pad mac octet
}
my $mac = "$bmac$mc5";
runCmd("$ip link set sta$i address $mac");

runCmd("$iw dev sta$i set power_save off");
runCmd("$ip addr add 9.99.1.$mc5/24 dev sta$i");
runCmd("./pingbg -I sta$i 9.99.1.1");
}

# Bring them up with WPA
for ($i = 0; $i<$max; $i++) {
open(FD, ">sta$i" . "_wpa.conf") || die("Couldn't open file: $!\n");
print FD "
ctrl_interface=/var/run/wpa_supplicant
fast_reauth=1
#can_scan_one=1
network={
ssid=\"$ssid\"
proto=WPA
key_mgmt=WPA-PSK
psk=\"$key\"
pairwise=TKIP CCMP
group=TKIP CCMP
}
";
#runCmd("$wpa_s -B -i sta$i -c sta$i" . "_wpa.conf -P sta$i" . "_wpa.pid -t -f sta$i" . "_wpa.log");
}

# Build command to start one wpa_supplicant for all interfaces.
my $cmd = "$wpa_s -B -g /var/run/wpa_supplicant_if -P /tmp/wpa_supplicant-all.pid -t -f /tmp/wpa_supplicant_log_all.txt -i sta0 -c sta0_wpa.conf";
for ($i = 1; $i<$max; $i++) {
$cmd = "$cmd -N -i sta$i -c sta$i" . "_wpa.conf";
}
runCmd($cmd);

sub runCmd {
my $cmd = shift;
print "$cmd\n";
`$cmd`;
}


Example kernel crash output:

ADDRCONF(NETDEV_CHANGE): sta6: link becomes ready
ADDRCONF(NETDEV_CHANGE): sta5: link becomes ready
ADDRCONF(NETDEV_CHANGE): sta4: link becomes ready
ADDRCONF(NETDEV_CHANGE): sta3: link becomes ready
ADDRCONF(NETDEV_CHANGE): sta1: link becomes ready
ADDRCONF(NETDEV_CHANGE): sta0: link becomes ready
padlock: VIA PadLock not detected.

[root@ath9k-dev1 ~]# ADDRCONF(NETDEV_CHANGE): sta30: link becomes ready
ADDRCONF(NETDEV_CHANGE): sta29: link becomes ready
------------[ cut here ]------------
WARNING: at /home/greearb/git/linux.wireless-testing/drivers/net/wireless/ath/ath9k/recv.c:532 ath_stoprecv+0x90/0x9a [ath9k]()
Hardware name: PDSBM
Could not stop RX, we could be confusing the DMA engine when we start RX up
Modules linked in: aes_i586 aes_generic fuse nfs lockd fscache nfs_acl auth_rpcgss sunrpc ipv6 uinput arc4 ecb ath9k mac80211 ath9k_common ath9k_hw mi]
Pid: 3505, comm: wpa_supplicant Not tainted 2.6.37-rc3-wl+ #53
Call Trace:
[<78436fe9>] warn_slowpath_common+0x77/0x8c
[<f933019e>] ? ath_stoprecv+0x90/0x9a [ath9k]
[<f933019e>] ? ath_stoprecv+0x90/0x9a [ath9k]
[<7843707a>] warn_slowpath_fmt+0x2e/0x30
[<f933019e>] ath_stoprecv+0x90/0x9a [ath9k]
[<f932f13c>] ath_set_channel+0x94/0x1e8 [ath9k]
[<7845a425>] ? mark_held_locks+0x47/0x5f
[<7878e5bb>] ? _raw_spin_unlock_irqrestore+0x3c/0x48
[<f932f5d4>] ath9k_config+0x344/0x423 [ath9k]
[<f919aaaa>] ieee80211_hw_config+0x11b/0x125 [mac80211]
[<f91aa25a>] ieee80211_set_channel+0x74/0x9e [mac80211]
[<f8d37d36>] cfg80211_set_freq+0xf3/0x12d [cfg80211]
[<f91aa1e6>] ? ieee80211_set_channel+0x0/0x9e [mac80211]
[<f8d3a24c>] cfg80211_mgd_wext_siwfreq+0x108/0x148 [cfg80211]
[<f8d395c9>] cfg80211_wext_siwfreq+0x42/0xbf [cfg80211]
[<7876e14f>] ioctl_standard_call+0x52/0x28e
[<786f2db3>] ? dev_name_hash+0x16/0x48
[<786f67cc>] ? __dev_get_by_name+0x32/0x3d
[<7876e418>] wext_handle_ioctl+0x8d/0x18d
[<f8d39587>] ? cfg80211_wext_siwfreq+0x0/0xbf [cfg80211]
[<786f78f9>] dev_ioctl+0x520/0x53f
[<786e5f7f>] ? sock_ioctl+0x0/0x202
[<786e6175>] sock_ioctl+0x1f6/0x202
[<7878e576>] ? _raw_spin_unlock_irq+0x22/0x2b
[<786e5f7f>] ? sock_ioctl+0x0/0x202
[<784cc151>] do_vfs_ioctl+0x4b1/0x4f6
[<7878e576>] ? _raw_spin_unlock_irq+0x22/0x2b
[<784303cd>] ? finish_task_switch+0x72/0xd4
[<784c14a9>] ? fcheck_files+0x9b/0xca
[<784c1505>] ? fget_light+0x2d/0xb0
[<784cc1d9>] sys_ioctl+0x43/0x62
[<784030dc>] sysenter_do_call+0x12/0x38
---[ end trace 34d8f42d696b7763 ]---
------------[ cut here ]------------
WARNING: at /home/greearb/git/linux.wireless-testing/net/wireless/mlme.c:285 __cfg80211_auth_remove+0x98/0x9e [cfg80211]()
Hardware name: PDSBM
Modules linked in: aes_i586 aes_generic fuse nfs lockd fscache nfs_acl auth_rpcgss sunrpc ipv6 uinput arc4 ecb ath9k mac80211 ath9k_common ath9k_hw mi]
Pid: 38, comm: kworker/u:1 Tainted: G W 2.6.37-rc3-wl+ #53
Call Trace:
[<78436fe9>] warn_slowpath_common+0x77/0x8c
[<f8d34888>] ? __cfg80211_auth_remove+0x98/0x9e [cfg80211]
[<f8d34888>] ? __cfg80211_auth_remove+0x98/0x9e [cfg80211]
[<7843701b>] warn_slowpath_null+0x1d/0x1f
[<f8d34888>] __cfg80211_auth_remove+0x98/0x9e [cfg80211]
[<f8d34fc2>] cfg80211_send_auth_timeout+0x90/0xa0 [cfg80211]
[<7845a681>] ? trace_hardirqs_on_caller+0x104/0x125
[<7845a6ad>] ? trace_hardirqs_on+0xb/0xd
[<f91a434b>] ieee80211_probe_auth_done+0x1e/0x7b [mac80211]
[<f91a6861>] ieee80211_work_work+0xd51/0xd8f [mac80211]
[<7845a681>] ? trace_hardirqs_on_caller+0x104/0x125
[<7845a602>] ? trace_hardirqs_on_caller+0x85/0x125
[<78447000>] process_one_work+0x1af/0x2bf
[<78446f8f>] ? process_one_work+0x13e/0x2bf
[<f91a5b10>] ? ieee80211_work_work+0x0/0xd8f [mac80211]
[<7844874e>] worker_thread+0xf9/0x1bf
[<78448655>] ? worker_thread+0x0/0x1bf
[<7844b27e>] kthread+0x62/0x67
[<7844b21c>] ? kthread+0x0/0x67
[<784036c6>] kernel_thread_helper+0x6/0x1a
---[ end trace 34d8f42d696b7764 ]---
e1000e 0000:06:00.0: eth0: Detected Hardware Unit Hang:
TDH <f1>
TDT <f4>
next_to_use <f4>
next_to_clean <f1>
buffer_info[next_to_clean]:
time_stamp <bcc5>
next_to_watch <f1>
jiffies <c73c>
next_to_watch.status <0>
MAC Status <80080f83>
PHY Status <796d>
PHY 1000BASE-T Status <7c00>
PHY Extended Status <3000>
PCI Status <4010>
e1000e 0000:06:00.0: eth0: Detected Hardware Unit Hang:
TDH <f1>
TDT <f4>
next_to_use <f4>
next_to_clean <f1>
buffer_info[next_to_clean]:
time_stamp <bcc5>
next_to_watch <f1>
jiffies <cf0c>
next_to_watch.status <0>
MAC Status <80080f83>
PHY Status <796d>
PHY 1000BASE-T Status <7c00>
PHY Extended Status <3000>
PCI Status <4010>
BUG: unable to handle kernel NULL pointer dereference at 00000040
IP: [<f933470a>] ath_tx_start+0x461/0x5ef [ath9k]
*pde = 00000000
Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
last sysfs file: /sys/devices/pci0000:00/0000:00:1e.0/0000:08:01.0/irq
Modules linked in: aes_i586 aes_generic fuse nfs lockd fscache nfs_acl auth_rpcgss sunrpc ipv6 uinput arc4 ecb ath9k mac80211 ath9k_common ath9k_hw mi]

Pid: 38, comm: kworker/u:1 Tainted: G W 2.6.37-rc3-wl+ #53 PDSBM/PDSBM
EIP: 0060:[<f933470a>] EFLAGS: 00010246 CPU: 1
EIP is at ath_tx_start+0x461/0x5ef [ath9k]

--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com



2010-11-30 00:52:37

by Ben Greear

[permalink] [raw]
Subject: Re: [ath9k-devel] Script to crash ath9k with DMA errors.

On 11/29/2010 04:44 PM, Luis R. Rodriguez wrote:
> On Mon, Nov 29, 2010 at 04:28:51PM -0800, Ben Greear wrote:
>> Here is a script that reliably crashes my ath9k box.
>> A second box with completely different hardware (except
>> for ath9k) experiences similar problems.

>> BUG: unable to handle kernel NULL pointer dereference at 00000040
>> IP: [<f933470a>] ath_tx_start+0x461/0x5ef [ath9k]
>> *pde = 00000000
>> Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
>> last sysfs file: /sys/devices/pci0000:00/0000:00:1e.0/0000:08:01.0/irq
>> Modules linked in: aes_i586 aes_generic fuse nfs lockd fscache nfs_acl auth_rpcgss sunrpc ipv6 uinput arc4 ecb ath9k mac80211 ath9k_common ath9k_hw mi]
>>
>> Pid: 38, comm: kworker/u:1 Tainted: G W 2.6.37-rc3-wl+ #53 PDSBM/PDSBM
>> EIP: 0060:[<f933470a>] EFLAGS: 00010246 CPU: 1
>> EIP is at ath_tx_start+0x461/0x5ef [ath9k]
>
> Please use
>
> gdb drivers/net/wireless/ath/ath9k/
> l *(ath_tx_start+0x461)

Usually the machine locks pretty hard with irq errors reported from wired NICs
and/or the hard-drive. I'm not sure that the ath_tx_start
issue is real, or maybe just some un-lucky side-affect of earlier
bugs in this trace.

Reading symbols from /home/greearb/kernel/2.6/wireless-testing-dbg.p4s/drivers/net/wireless/ath/ath9k/ath9k.ko...done.
(gdb) l *(ath_tx_start+0x461)
0x972e is in ath_tx_start (/home/greearb/git/linux.wireless-testing/drivers/net/wireless/ath/ath9k/xmit.c:1691).
1686 if ((tx_info->flags & IEEE80211_TX_CTL_AMPDU) && txctl->an) {
1687 tidno = ieee80211_get_qos_ctl(hdr)[0] &
1688 IEEE80211_QOS_CTL_TID_MASK;
1689 tid = ATH_AN_2_TID(txctl->an, tidno);
1690
1691 WARN_ON(tid->ac->txq != txctl->txq);
1692 /*
1693 * Try aggregation if it's a unicast data frame
1694 * and the destination is HT capable.
1695 */


Thanks,
Ben

>
> Luis


--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com


2010-11-30 00:44:29

by Luis R. Rodriguez

[permalink] [raw]
Subject: Re: [ath9k-devel] Script to crash ath9k with DMA errors.

On Mon, Nov 29, 2010 at 04:28:51PM -0800, Ben Greear wrote:
> Here is a script that reliably crashes my ath9k box.
> A second box with completely different hardware (except
> for ath9k) experiences similar problems.
>
> I am using today's wireless-testing kernel with a few
> patches of my own.
>
> You will also need the very latest hostap tree as it has the
> optimizations for allowing STAs to share scans. Without
> this optimization, I did not see this problem.
>
> A few notes about the script:
>
> * I cannot remove any interfaces, seems a ref-count leak somewhere.
> I haven't debugged this issue.
>
> * Without the background ping, it is very hard to reproduce this problem,
> but with it, it happens almost every time.
>
> * You'll need to set up your paths at the top of the script.
>
>
> #!/usr/bin/perl
>
> use strict;
>
> my $iw = "./local/sbin/iw";
> my $ip = "./local/sbin/ip";
> my $wpa_s = "./local/bin/wpa_supplicant";
> my $ssid = "candela-n";
> my $key = "wpadmz123";
>
> my $phy = "wiphy0";
> my $max = 32;
> my $i;
> my $bmac = "00:01:02:03:04:";
> my $cmd;
>
> # Cleanup previous stuff
> runCmd("killall wpa_supplicant");
> runCmd("killall ping");
>
> for ($i = 0; $i<$max; $i++) {
> # Work around ref-counting bugs in kernel
> runCmd("$ip link set sta$i down");
> runCmd("$ip addr flush dev sta$i");
> runCmd("$ip route flush dev sta$i");
> runCmd("$ip -6 addr flush dev sta$i");
> runCmd("$ip -6 route flush dev sta$i");
>
> # Bugger, cannot get the ref-count problem to go away.
> # runCmd("$iw dev sta$i del");
> }
>
> #exit(0);
>
> open(FD, ">pingbg") || die("Couldn't open pingbg.");
> print FD "#!/bin/bash\n\n";
> print FD "ping \$* > /dev/null 2>&1 &\n";
> print FD "echo continuing....\n";
> close(FD);
> runCmd("chmod a+x pingbg");
>
> # Create stations
> for ($i = 0; $i<$max; $i++) {
> runCmd("$iw phy $phy interface add sta$i type station");
> my $mc5 = $i + 1;
> if (length($mc5) == 1) {
> $mc5 = "0$mc5"; # pad mac octet
> }
> my $mac = "$bmac$mc5";
> runCmd("$ip link set sta$i address $mac");
>
> runCmd("$iw dev sta$i set power_save off");
> runCmd("$ip addr add 9.99.1.$mc5/24 dev sta$i");
> runCmd("./pingbg -I sta$i 9.99.1.1");
> }
>
> # Bring them up with WPA
> for ($i = 0; $i<$max; $i++) {
> open(FD, ">sta$i" . "_wpa.conf") || die("Couldn't open file: $!\n");
> print FD "
> ctrl_interface=/var/run/wpa_supplicant
> fast_reauth=1
> #can_scan_one=1
> network={
> ssid=\"$ssid\"
> proto=WPA
> key_mgmt=WPA-PSK
> psk=\"$key\"
> pairwise=TKIP CCMP
> group=TKIP CCMP
> }
> ";
> #runCmd("$wpa_s -B -i sta$i -c sta$i" . "_wpa.conf -P sta$i" . "_wpa.pid -t -f sta$i" . "_wpa.log");
> }
>
> # Build command to start one wpa_supplicant for all interfaces.
> my $cmd = "$wpa_s -B -g /var/run/wpa_supplicant_if -P /tmp/wpa_supplicant-all.pid -t -f /tmp/wpa_supplicant_log_all.txt -i sta0 -c sta0_wpa.conf";
> for ($i = 1; $i<$max; $i++) {
> $cmd = "$cmd -N -i sta$i -c sta$i" . "_wpa.conf";
> }
> runCmd($cmd);
>
> sub runCmd {
> my $cmd = shift;
> print "$cmd\n";
> `$cmd`;
> }
>
>
> Example kernel crash output:
>
> ADDRCONF(NETDEV_CHANGE): sta6: link becomes ready
> ADDRCONF(NETDEV_CHANGE): sta5: link becomes ready
> ADDRCONF(NETDEV_CHANGE): sta4: link becomes ready
> ADDRCONF(NETDEV_CHANGE): sta3: link becomes ready
> ADDRCONF(NETDEV_CHANGE): sta1: link becomes ready
> ADDRCONF(NETDEV_CHANGE): sta0: link becomes ready
> padlock: VIA PadLock not detected.
>
> [root@ath9k-dev1 ~]# ADDRCONF(NETDEV_CHANGE): sta30: link becomes ready
> ADDRCONF(NETDEV_CHANGE): sta29: link becomes ready
> ------------[ cut here ]------------
> WARNING: at /home/greearb/git/linux.wireless-testing/drivers/net/wireless/ath/ath9k/recv.c:532 ath_stoprecv+0x90/0x9a [ath9k]()
> Hardware name: PDSBM
> Could not stop RX, we could be confusing the DMA engine when we start RX up
> Modules linked in: aes_i586 aes_generic fuse nfs lockd fscache nfs_acl auth_rpcgss sunrpc ipv6 uinput arc4 ecb ath9k mac80211 ath9k_common ath9k_hw mi]
> Pid: 3505, comm: wpa_supplicant Not tainted 2.6.37-rc3-wl+ #53
> Call Trace:
> [<78436fe9>] warn_slowpath_common+0x77/0x8c
> [<f933019e>] ? ath_stoprecv+0x90/0x9a [ath9k]
> [<f933019e>] ? ath_stoprecv+0x90/0x9a [ath9k]
> [<7843707a>] warn_slowpath_fmt+0x2e/0x30
> [<f933019e>] ath_stoprecv+0x90/0x9a [ath9k]
> [<f932f13c>] ath_set_channel+0x94/0x1e8 [ath9k]
> [<7845a425>] ? mark_held_locks+0x47/0x5f
> [<7878e5bb>] ? _raw_spin_unlock_irqrestore+0x3c/0x48
> [<f932f5d4>] ath9k_config+0x344/0x423 [ath9k]
> [<f919aaaa>] ieee80211_hw_config+0x11b/0x125 [mac80211]
> [<f91aa25a>] ieee80211_set_channel+0x74/0x9e [mac80211]
> [<f8d37d36>] cfg80211_set_freq+0xf3/0x12d [cfg80211]
> [<f91aa1e6>] ? ieee80211_set_channel+0x0/0x9e [mac80211]
> [<f8d3a24c>] cfg80211_mgd_wext_siwfreq+0x108/0x148 [cfg80211]
> [<f8d395c9>] cfg80211_wext_siwfreq+0x42/0xbf [cfg80211]
> [<7876e14f>] ioctl_standard_call+0x52/0x28e
> [<786f2db3>] ? dev_name_hash+0x16/0x48
> [<786f67cc>] ? __dev_get_by_name+0x32/0x3d
> [<7876e418>] wext_handle_ioctl+0x8d/0x18d
> [<f8d39587>] ? cfg80211_wext_siwfreq+0x0/0xbf [cfg80211]
> [<786f78f9>] dev_ioctl+0x520/0x53f
> [<786e5f7f>] ? sock_ioctl+0x0/0x202
> [<786e6175>] sock_ioctl+0x1f6/0x202
> [<7878e576>] ? _raw_spin_unlock_irq+0x22/0x2b
> [<786e5f7f>] ? sock_ioctl+0x0/0x202
> [<784cc151>] do_vfs_ioctl+0x4b1/0x4f6
> [<7878e576>] ? _raw_spin_unlock_irq+0x22/0x2b
> [<784303cd>] ? finish_task_switch+0x72/0xd4
> [<784c14a9>] ? fcheck_files+0x9b/0xca
> [<784c1505>] ? fget_light+0x2d/0xb0
> [<784cc1d9>] sys_ioctl+0x43/0x62
> [<784030dc>] sysenter_do_call+0x12/0x38
> ---[ end trace 34d8f42d696b7763 ]---
> ------------[ cut here ]------------
> WARNING: at /home/greearb/git/linux.wireless-testing/net/wireless/mlme.c:285 __cfg80211_auth_remove+0x98/0x9e [cfg80211]()
> Hardware name: PDSBM
> Modules linked in: aes_i586 aes_generic fuse nfs lockd fscache nfs_acl auth_rpcgss sunrpc ipv6 uinput arc4 ecb ath9k mac80211 ath9k_common ath9k_hw mi]
> Pid: 38, comm: kworker/u:1 Tainted: G W 2.6.37-rc3-wl+ #53
> Call Trace:
> [<78436fe9>] warn_slowpath_common+0x77/0x8c
> [<f8d34888>] ? __cfg80211_auth_remove+0x98/0x9e [cfg80211]
> [<f8d34888>] ? __cfg80211_auth_remove+0x98/0x9e [cfg80211]
> [<7843701b>] warn_slowpath_null+0x1d/0x1f
> [<f8d34888>] __cfg80211_auth_remove+0x98/0x9e [cfg80211]
> [<f8d34fc2>] cfg80211_send_auth_timeout+0x90/0xa0 [cfg80211]
> [<7845a681>] ? trace_hardirqs_on_caller+0x104/0x125
> [<7845a6ad>] ? trace_hardirqs_on+0xb/0xd
> [<f91a434b>] ieee80211_probe_auth_done+0x1e/0x7b [mac80211]
> [<f91a6861>] ieee80211_work_work+0xd51/0xd8f [mac80211]
> [<7845a681>] ? trace_hardirqs_on_caller+0x104/0x125
> [<7845a602>] ? trace_hardirqs_on_caller+0x85/0x125
> [<78447000>] process_one_work+0x1af/0x2bf
> [<78446f8f>] ? process_one_work+0x13e/0x2bf
> [<f91a5b10>] ? ieee80211_work_work+0x0/0xd8f [mac80211]
> [<7844874e>] worker_thread+0xf9/0x1bf
> [<78448655>] ? worker_thread+0x0/0x1bf
> [<7844b27e>] kthread+0x62/0x67
> [<7844b21c>] ? kthread+0x0/0x67
> [<784036c6>] kernel_thread_helper+0x6/0x1a
> ---[ end trace 34d8f42d696b7764 ]---
> e1000e 0000:06:00.0: eth0: Detected Hardware Unit Hang:
> TDH <f1>
> TDT <f4>
> next_to_use <f4>
> next_to_clean <f1>
> buffer_info[next_to_clean]:
> time_stamp <bcc5>
> next_to_watch <f1>
> jiffies <c73c>
> next_to_watch.status <0>
> MAC Status <80080f83>
> PHY Status <796d>
> PHY 1000BASE-T Status <7c00>
> PHY Extended Status <3000>
> PCI Status <4010>
> e1000e 0000:06:00.0: eth0: Detected Hardware Unit Hang:
> TDH <f1>
> TDT <f4>
> next_to_use <f4>
> next_to_clean <f1>
> buffer_info[next_to_clean]:
> time_stamp <bcc5>
> next_to_watch <f1>
> jiffies <cf0c>
> next_to_watch.status <0>
> MAC Status <80080f83>
> PHY Status <796d>
> PHY 1000BASE-T Status <7c00>
> PHY Extended Status <3000>
> PCI Status <4010>
> BUG: unable to handle kernel NULL pointer dereference at 00000040
> IP: [<f933470a>] ath_tx_start+0x461/0x5ef [ath9k]
> *pde = 00000000
> Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
> last sysfs file: /sys/devices/pci0000:00/0000:00:1e.0/0000:08:01.0/irq
> Modules linked in: aes_i586 aes_generic fuse nfs lockd fscache nfs_acl auth_rpcgss sunrpc ipv6 uinput arc4 ecb ath9k mac80211 ath9k_common ath9k_hw mi]
>
> Pid: 38, comm: kworker/u:1 Tainted: G W 2.6.37-rc3-wl+ #53 PDSBM/PDSBM
> EIP: 0060:[<f933470a>] EFLAGS: 00010246 CPU: 1
> EIP is at ath_tx_start+0x461/0x5ef [ath9k]

Please use

gdb drivers/net/wireless/ath/ath9k/
l *(ath_tx_start+0x461)

Luis

2010-12-06 19:53:52

by Luis R. Rodriguez

[permalink] [raw]
Subject: Re: [ath9k-devel] Script to crash ath9k with DMA errors.

On Mon, Dec 06, 2010 at 11:53:13AM -0800, Luis Rodriguez wrote:
> On Mon, Dec 06, 2010 at 11:47:47AM -0800, Ben Greear wrote:
> > On 12/06/2010 11:36 AM, Luis R. Rodriguez wrote:
> >
> > > Can you clarify the status of this issue. It remains unclear to me from
> > > your above description how things are going. As I read it some things
> > > look OK now but you still get a warning.
> >
> > Ok, since you asked :)
> >
> > I worked on this over the weekend and this morning. I had all sorts of
> > issues until I realized that I had one STA with non-configured SSID.
> > It sometimes connected to one /a AP and the other STAs attempted to connect
> > to another /n (on entirely different band) AP. I basically got zero stations associated for any length
> > of time due to constant channel switching. No crashes, but lots of
> > warnings about DMA failing to stop.
> >
> > Now..I've fixed this configuration issue (and adding steps to help prevent this mis-configuration
> > again).
> >
> > With 16 properly configured non-encrypted stations, running with wpa-supplicant
> > with netlink driver & sharing scan results, the interfaces quickly associate.
> >
> > However, I do continue to see DMA warnings such as these (I had picked up my
> > portable phone, and it knocked all the interfaces offline ..here
> > they are coming back up after I hung up the phone).
> >
> > Please note that I ported Felix's 2.6.37 patch he posted this morning
> > to wireless-testing and have applied it.
> >
> > I'm highly tempted to just make that a WARN_ON_ONCE so at least my logs
> > aren't spammed so heavily with the recv.c:531 DMA warning.
>
> You can send this change upstream as well.

Also, feel free to limit the number of STAs you can have up
physically by setting this to a number you bless yourself.

Luis

2010-12-06 19:36:06

by Luis R. Rodriguez

[permalink] [raw]
Subject: Re: [ath9k-devel] Script to crash ath9k with DMA errors.

On Sat, Dec 04, 2010 at 09:18:50PM -0800, Ben Greear wrote:
> On 12/04/2010 06:41 PM, Felix Fietkau wrote:
> > On 2010-12-03 9:14 AM, Ben Greear wrote:
> >> On 12/01/2010 03:22 PM, Ben Greear wrote:
> >>> On 11/29/2010 04:44 PM, Luis R. Rodriguez wrote:
> >>>> On Mon, Nov 29, 2010 at 04:28:51PM -0800, Ben Greear wrote:
> >>>
> >>>>> BUG: unable to handle kernel NULL pointer dereference at 00000040
> >>>>> IP: [<f933470a>] ath_tx_start+0x461/0x5ef [ath9k]
> >>>>> *pde = 00000000
> >>>>> Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
> >>>>> last sysfs file: /sys/devices/pci0000:00/0000:00:1e.0/0000:08:01.0/irq
> >>>>> Modules linked in: aes_i586 aes_generic fuse nfs lockd fscache nfs_acl auth_rpcgss sunrpc ipv6 uinput arc4 ecb ath9k mac80211 ath9k_common ath9k_hw mi]
> >>>>>
> >>>>> Pid: 38, comm: kworker/u:1 Tainted: G W 2.6.37-rc3-wl+ #53 PDSBM/PDSBM
> >>>>> EIP: 0060:[<f933470a>] EFLAGS: 00010246 CPU: 1
> >>>>> EIP is at ath_tx_start+0x461/0x5ef [ath9k]
> >>>>
> >>>> Please use
> >>>>
> >>>> gdb drivers/net/wireless/ath/ath9k/
> >>>> l *(ath_tx_start+0x461)
> >>>>
> >>>> Luis
> >>>
> >>> I managed to hit that ath_tx_start crash again, and this time there were no obvious
> >>> DMA or irq errors immediately preceding it. So, it might be a real bug
> >>> after all. I'll add some extra checks to see if tid->ac is NULL.
> >>
> >> I've made some small progress on this general issue.
> >>
> >> First, I added all sorts of debugging to try to figure out ath_tx_start crash.
> >> As best as I can tell, 'tid' is not NULL, but also is not a valid pointer,
> >> and probably something close to 0x0. I've added yet more debugging, but haven't
> >> hit the problem again.
> >>
> >> I also tried stopping DMA in a loop up to 5 times if it failed to stop
> >> previously in the loop. This did not appear to help at all.
> >>
> >> I also managed to make both the ath_tx_start crash and the DMA errors very hard to reproduce
> >> (I dare not say fixed, yet).
> >>
> >> It appears that this small patch (and possibly, the fact that I set debugging to 0x600
> >> instead of 0x400) makes the problems go away. This makes me wonder if a root cause is
> >> something to do with repeatedly resetting the hardware too fast, as setting channels rapidly
> >> would tend to do that, and channels are set on association by supplicant, it appears.
> > Please try this patch while leaving the unnecessary resets in place.
> > I found that when ath_drain_all_txq finds tx dma not stopped, it will
> > issue a reset at a point in time where it is both useless (since it's
> > right before a reset anyway) and dangerous (since the rx dma engine
> > isn't even disabled yet), so IMHO the right thing to do is to drop
> > this extra reset.
> >
> > --- a/drivers/net/wireless/ath/ath9k/xmit.c
> > +++ b/drivers/net/wireless/ath/ath9k/xmit.c
> > @@ -1194,18 +1194,8 @@ void ath_drain_all_txq(struct ath_softc
> > }
> > }
> >
> > - if (npend) {
> > - int r;
> > -
> > - ath_print(common, ATH_DBG_FATAL,
> > - "Failed to stop TX DMA. Resetting hardware!\n");
> > -
> > - r = ath9k_hw_reset(ah, sc->sc_ah->curchan, ah->caldata, false);
> > - if (r)
> > - ath_print(common, ATH_DBG_FATAL,
> > - "Unable to reset hardware; reset status %d\n",
> > - r);
> > - }
> > + if (npend)
> > + ath_print(common, ATH_DBG_FATAL, "Failed to stop TX DMA!\n");
> >
> > for (i = 0; i< ATH9K_NUM_TX_QUEUES; i++) {
> > if (ATH_TXQ_SETUP(sc, i))
>
>
> I applied this on top of all my patches, and on top of the 4 that Luis recently
> posted.
>
> I'm trying this on a different system than normal..happens to be configured
> with 115 stations. It was getting this fail-to-stop-RX warning even with my
> channel-change mitigation patch, so I left it in. I can still test w/it removed
> if you want.
>
> None of my interfaces are using WPA (or supplicant)..just un-encrypted
> association to an AP 3 feet away.
>
> The recent success I had on Friday was on a different system entirely,
> with only 84 STAs, and using wpa-supplicant with 30 or so stations
> using WPA and the other 55 on a different AP un-encrypted (still using
> wpa_supplicant for all of these).
>
> So, can't compare my previous reports directly with this one.
>
> I'm going to re-configure this one to have smaller numbers of
> stations and use wpa_supplicant..will see how that goes.
>
> Even with all these warnings in the logs..system is basically stable and
> a few interfaces are able to associate, at least for a short time.
>
>
> WARNING: at /home/greearb/git/linux.wireless-testing/drivers/net/wireless/ath/ath9k/recv.c:538 ath_stoprecv+0xcd/0xd7 [ath9k]()
> Hardware name: 945GM
> Could not stop RX, we could be confusing the DMA engine when we start RX up
> Modules linked in: 8021q garp stp llc michael_mic macvlan pktgen iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi nfs lockd fscache nfs_acl auth_rpcgss
> sunrpc p4_clockmod ipv6 uinput arc4 ecb ath9k mac80211 snd_intel8x0 snd_ac97_codec ath9k_common ac97_bus snd_seq snd_seq_device ath9k_hw ath snd_pcm pcspkr
> i2c_i801 serio_raw cfg80211 iTCO_wdt iTCO_vendor_support microcode snd_timer snd soundcore e1000e snd_page_alloc yenta_socket floppy i915 drm_kms_helper drm
> i2c_algo_bit i2c_core video output [last unloaded: ipt_addrtype]
> Pid: 5, comm: kworker/u:0 Tainted: G W 2.6.37-rc4-wl+ #16
> Call Trace:
> [<78436fbd>] warn_slowpath_common+0x77/0x8c
> [<f946028f>] ? ath_stoprecv+0xcd/0xd7 [ath9k]
> [<f946028f>] ? ath_stoprecv+0xcd/0xd7 [ath9k]
> [<7843704e>] warn_slowpath_fmt+0x2e/0x30
> [<f946028f>] ath_stoprecv+0xcd/0xd7 [ath9k]
> [<f945e4bb>] ath_reset+0x55/0x163 [ath9k]
> [<7845a68d>] ? trace_hardirqs_on+0xb/0xd
> [<f9462830>] ath_tx_complete_poll_work+0x90/0xdf [ath9k]
> [<78446fd4>] process_one_work+0x1af/0x2bf
> [<78446f63>] ? process_one_work+0x13e/0x2bf
> [<f94627a0>] ? ath_tx_complete_poll_work+0x0/0xdf [ath9k]
> [<78448722>] worker_thread+0xf9/0x1bf
> [<78448629>] ? worker_thread+0x0/0x1bf
> [<7844b252>] kthread+0x62/0x67
> [<7844b1f0>] ? kthread+0x0/0x67
> [<784036c6>] kernel_thread_helper+0x6/0x1a

Can you clarify the status of this issue. It remains unclear to me from
your above description how things are going. As I read it some things
look OK now but you still get a warning.

Luis

2010-12-06 21:16:27

by Luis R. Rodriguez

[permalink] [raw]
Subject: Re: [ath9k-devel] Script to crash ath9k with DMA errors.

On Mon, Dec 06, 2010 at 01:00:05PM -0800, Ben Greear wrote:
> On 12/06/2010 12:42 PM, Luis R. Rodriguez wrote:
> > On Mon, Dec 06, 2010 at 12:22:26PM -0800, Ben Greear wrote:
> >> On 12/06/2010 12:11 PM, Bj?rn Smedman wrote:
> >>> On Mon, Dec 6, 2010 at 8:47 PM, Ben Greear<[email protected]> wrote:
> >>>> With 16 properly configured non-encrypted stations, running with
> >>>> wpa-supplicant
> >>>> with netlink driver& sharing scan results, the interfaces quickly
> >>>> associate.
> >>>>
> >>>> However, I do continue to see DMA warnings such as these (I had picked up my
> >>>> portable phone, and it knocked all the interfaces offline ..here
> >>>> they are coming back up after I hung up the phone).
> >>>
> >>> Is there some theory as to why using multiple interfaces cause so many
> >>> problems with DMA?
> >>
> >> Seems pretty directly related to channel changes and/or resets, and exacerbated
> >> by other interfaces sending data while another is scanning, for instance.
> >>
> >> Other issues we've found in the past have been various races that you wouldn't
> >> normally see with a single VIF.
> >
> > Right, there might be some other hot path we need to lock around over.
> > Not sure what it could be though we should be locking stopping RX
> > over resets already though. These should all be atomic, in fact
> > starting TX too IIRC, hence the name change of the lock to be
> > specific to the PCU together. There may be other PCU changes
> > we may need to contend against.
>
> Maybe the hardware/firmware guys could give us some clues as to what
> types of things can cause stopping RMA to fail? Maybe that could
> point us to what might be racing with the attempts to stop RMA?

We have no firmware, but yeah understanding how the hardware
blocks would be key here. Good point.

Luis

2010-12-05 05:18:59

by Ben Greear

[permalink] [raw]
Subject: Re: [ath9k-devel] Script to crash ath9k with DMA errors.

On 12/04/2010 06:41 PM, Felix Fietkau wrote:
> On 2010-12-03 9:14 AM, Ben Greear wrote:
>> On 12/01/2010 03:22 PM, Ben Greear wrote:
>>> On 11/29/2010 04:44 PM, Luis R. Rodriguez wrote:
>>>> On Mon, Nov 29, 2010 at 04:28:51PM -0800, Ben Greear wrote:
>>>
>>>>> BUG: unable to handle kernel NULL pointer dereference at 00000040
>>>>> IP: [<f933470a>] ath_tx_start+0x461/0x5ef [ath9k]
>>>>> *pde = 00000000
>>>>> Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
>>>>> last sysfs file: /sys/devices/pci0000:00/0000:00:1e.0/0000:08:01.0/irq
>>>>> Modules linked in: aes_i586 aes_generic fuse nfs lockd fscache nfs_acl auth_rpcgss sunrpc ipv6 uinput arc4 ecb ath9k mac80211 ath9k_common ath9k_hw mi]
>>>>>
>>>>> Pid: 38, comm: kworker/u:1 Tainted: G W 2.6.37-rc3-wl+ #53 PDSBM/PDSBM
>>>>> EIP: 0060:[<f933470a>] EFLAGS: 00010246 CPU: 1
>>>>> EIP is at ath_tx_start+0x461/0x5ef [ath9k]
>>>>
>>>> Please use
>>>>
>>>> gdb drivers/net/wireless/ath/ath9k/
>>>> l *(ath_tx_start+0x461)
>>>>
>>>> Luis
>>>
>>> I managed to hit that ath_tx_start crash again, and this time there were no obvious
>>> DMA or irq errors immediately preceding it. So, it might be a real bug
>>> after all. I'll add some extra checks to see if tid->ac is NULL.
>>
>> I've made some small progress on this general issue.
>>
>> First, I added all sorts of debugging to try to figure out ath_tx_start crash.
>> As best as I can tell, 'tid' is not NULL, but also is not a valid pointer,
>> and probably something close to 0x0. I've added yet more debugging, but haven't
>> hit the problem again.
>>
>> I also tried stopping DMA in a loop up to 5 times if it failed to stop
>> previously in the loop. This did not appear to help at all.
>>
>> I also managed to make both the ath_tx_start crash and the DMA errors very hard to reproduce
>> (I dare not say fixed, yet).
>>
>> It appears that this small patch (and possibly, the fact that I set debugging to 0x600
>> instead of 0x400) makes the problems go away. This makes me wonder if a root cause is
>> something to do with repeatedly resetting the hardware too fast, as setting channels rapidly
>> would tend to do that, and channels are set on association by supplicant, it appears.
> Please try this patch while leaving the unnecessary resets in place.
> I found that when ath_drain_all_txq finds tx dma not stopped, it will
> issue a reset at a point in time where it is both useless (since it's
> right before a reset anyway) and dangerous (since the rx dma engine
> isn't even disabled yet), so IMHO the right thing to do is to drop
> this extra reset.
>
> --- a/drivers/net/wireless/ath/ath9k/xmit.c
> +++ b/drivers/net/wireless/ath/ath9k/xmit.c
> @@ -1194,18 +1194,8 @@ void ath_drain_all_txq(struct ath_softc
> }
> }
>
> - if (npend) {
> - int r;
> -
> - ath_print(common, ATH_DBG_FATAL,
> - "Failed to stop TX DMA. Resetting hardware!\n");
> -
> - r = ath9k_hw_reset(ah, sc->sc_ah->curchan, ah->caldata, false);
> - if (r)
> - ath_print(common, ATH_DBG_FATAL,
> - "Unable to reset hardware; reset status %d\n",
> - r);
> - }
> + if (npend)
> + ath_print(common, ATH_DBG_FATAL, "Failed to stop TX DMA!\n");
>
> for (i = 0; i< ATH9K_NUM_TX_QUEUES; i++) {
> if (ATH_TXQ_SETUP(sc, i))


I applied this on top of all my patches, and on top of the 4 that Luis recently
posted.

I'm trying this on a different system than normal..happens to be configured
with 115 stations. It was getting this fail-to-stop-RX warning even with my
channel-change mitigation patch, so I left it in. I can still test w/it removed
if you want.

None of my interfaces are using WPA (or supplicant)..just un-encrypted
association to an AP 3 feet away.

The recent success I had on Friday was on a different system entirely,
with only 84 STAs, and using wpa-supplicant with 30 or so stations
using WPA and the other 55 on a different AP un-encrypted (still using
wpa_supplicant for all of these).

So, can't compare my previous reports directly with this one.

I'm going to re-configure this one to have smaller numbers of
stations and use wpa_supplicant..will see how that goes.

Even with all these warnings in the logs..system is basically stable and
a few interfaces are able to associate, at least for a short time.

WARNING: at /home/greearb/git/linux.wireless-testing/drivers/net/wireless/ath/ath9k/recv.c:538 ath_stoprecv+0xcd/0xd7 [ath9k]()
Hardware name: 945GM
Could not stop RX, we could be confusing the DMA engine when we start RX up
Modules linked in: 8021q garp stp llc michael_mic macvlan pktgen iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi nfs lockd fscache nfs_acl auth_rpcgss
sunrpc p4_clockmod ipv6 uinput arc4 ecb ath9k mac80211 snd_intel8x0 snd_ac97_codec ath9k_common ac97_bus snd_seq snd_seq_device ath9k_hw ath snd_pcm pcspkr
i2c_i801 serio_raw cfg80211 iTCO_wdt iTCO_vendor_support microcode snd_timer snd soundcore e1000e snd_page_alloc yenta_socket floppy i915 drm_kms_helper drm
i2c_algo_bit i2c_core video output [last unloaded: ipt_addrtype]
Pid: 5, comm: kworker/u:0 Tainted: G W 2.6.37-rc4-wl+ #16
Call Trace:
[<78436fbd>] warn_slowpath_common+0x77/0x8c
[<f946028f>] ? ath_stoprecv+0xcd/0xd7 [ath9k]
[<f946028f>] ? ath_stoprecv+0xcd/0xd7 [ath9k]
[<7843704e>] warn_slowpath_fmt+0x2e/0x30
[<f946028f>] ath_stoprecv+0xcd/0xd7 [ath9k]
[<f945e4bb>] ath_reset+0x55/0x163 [ath9k]
[<7845a68d>] ? trace_hardirqs_on+0xb/0xd
[<f9462830>] ath_tx_complete_poll_work+0x90/0xdf [ath9k]
[<78446fd4>] process_one_work+0x1af/0x2bf
[<78446f63>] ? process_one_work+0x13e/0x2bf
[<f94627a0>] ? ath_tx_complete_poll_work+0x0/0xdf [ath9k]
[<78448722>] worker_thread+0xf9/0x1bf
[<78448629>] ? worker_thread+0x0/0x1bf
[<7844b252>] kthread+0x62/0x67
[<7844b1f0>] ? kthread+0x0/0x67
[<784036c6>] kernel_thread_helper+0x6/0x1a



--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com

2010-12-06 19:47:59

by Ben Greear

[permalink] [raw]
Subject: Re: [ath9k-devel] Script to crash ath9k with DMA errors.

On 12/06/2010 11:36 AM, Luis R. Rodriguez wrote:

> Can you clarify the status of this issue. It remains unclear to me from
> your above description how things are going. As I read it some things
> look OK now but you still get a warning.

Ok, since you asked :)

I worked on this over the weekend and this morning. I had all sorts of
issues until I realized that I had one STA with non-configured SSID.
It sometimes connected to one /a AP and the other STAs attempted to connect
to another /n (on entirely different band) AP. I basically got zero stations associated for any length
of time due to constant channel switching. No crashes, but lots of
warnings about DMA failing to stop.

Now..I've fixed this configuration issue (and adding steps to help prevent this mis-configuration
again).

With 16 properly configured non-encrypted stations, running with wpa-supplicant
with netlink driver & sharing scan results, the interfaces quickly associate.

However, I do continue to see DMA warnings such as these (I had picked up my
portable phone, and it knocked all the interfaces offline ..here
they are coming back up after I hung up the phone).

Please note that I ported Felix's 2.6.37 patch he posted this morning
to wireless-testing and have applied it.

I'm highly tempted to just make that a WARN_ON_ONCE so at least my logs
aren't spammed so heavily with the recv.c:531 DMA warning.

Dec 6 11:32:15 atom kernel: sta2: direct probe to 00:18:e7:cb:ad:6e timed out
Dec 6 11:32:15 atom kernel: sta14: direct probe to 00:18:e7:cb:ad:6e timed out
Dec 6 11:32:15 atom kernel: ieee80211 wiphy0: device now idle
Dec 6 11:32:15 atom kernel: ieee80211 wiphy0: device no longer idle - scanning
Dec 6 11:32:15 atom kernel: start_sw_scan: running-other-vifs: 0 running-station-vifs: 16, associated-stations: 0 scanning all channels.
Dec 6 11:32:17 atom kernel: ieee80211 wiphy0: device now idle
Dec 6 11:32:22 atom kernel: ieee80211 wiphy0: device no longer idle - scanning
Dec 6 11:32:22 atom kernel: start_sw_scan: running-other-vifs: 0 running-station-vifs: 16, associated-stations: 0 scanning all channels.
Dec 6 11:32:24 atom kernel: ieee80211 wiphy0: device now idle
Dec 6 11:32:29 atom kernel: ieee80211 wiphy0: device no longer idle - scanning
Dec 6 11:32:29 atom kernel: start_sw_scan: running-other-vifs: 0 running-station-vifs: 16, associated-stations: 0 scanning all channels.
Dec 6 11:32:29 atom kernel: ath: DMA failed to stop in 10 ms AR_CR=0x00000024 AR_DIAG_SW=0x42000020
Dec 6 11:32:29 atom kernel: ------------[ cut here ]------------
Dec 6 11:32:29 atom kernel: WARNING: at /home/greearb/git/linux.wireless-testing/drivers/net/wireless/ath/ath9k/recv.c:531 ath_stoprecv+0x90/0x9a [ath9)
Dec 6 11:32:29 atom kernel: Hardware name: 945GM
Dec 6 11:32:29 atom kernel: Could not stop RX, we could be confusing the DMA engine when we start RX up
Dec 6 11:32:29 atom kernel: Modules linked in: michael_mic ath9k mac80211 ath9k_common ath9k_hw ath cfg80211 arc4 8021q garp stp llc macvlan pktgen isc]
Dec 6 11:32:29 atom kernel: Pid: 2732, comm: kworker/u:2 Tainted: G W 2.6.37-rc4-wl+ #17
Dec 6 11:32:29 atom kernel: Call Trace:
Dec 6 11:32:29 atom kernel: [<78436fbd>] warn_slowpath_common+0x77/0x8c
Dec 6 11:32:29 atom kernel: [<fb7a125e>] ? ath_stoprecv+0x90/0x9a [ath9k]
Dec 6 11:32:29 atom kernel: [<fb7a125e>] ? ath_stoprecv+0x90/0x9a [ath9k]
Dec 6 11:32:29 atom kernel: [<7843704e>] warn_slowpath_fmt+0x2e/0x30
Dec 6 11:32:29 atom kernel: [<fb7a125e>] ath_stoprecv+0x90/0x9a [ath9k]
Dec 6 11:32:29 atom kernel: [<fb7a0182>] ath_set_channel+0x94/0x1f2 [ath9k]
Dec 6 11:32:29 atom kernel: [<7845a405>] ? mark_held_locks+0x47/0x5f
Dec 6 11:32:29 atom kernel: [<7878e7cb>] ? _raw_spin_unlock_irqrestore+0x3c/0x48
Dec 6 11:32:29 atom kernel: [<fb7a067a>] ath9k_config+0x39a/0x479 [ath9k]
Dec 6 11:32:29 atom kernel: [<fb6caaaa>] ieee80211_hw_config+0x11b/0x125 [mac80211]
Dec 6 11:32:29 atom kernel: [<fb6cef1b>] ieee80211_scan_work+0x29e/0x3f7 [mac80211]
Dec 6 11:32:29 atom kernel: [<78446f63>] ? process_one_work+0x13e/0x2bf
Dec 6 11:32:29 atom kernel: [<78446fd4>] process_one_work+0x1af/0x2bf
Dec 6 11:32:29 atom kernel: [<78446f63>] ? process_one_work+0x13e/0x2bf
Dec 6 11:32:29 atom kernel: [<fb6cec7d>] ? ieee80211_scan_work+0x0/0x3f7 [mac80211]
Dec 6 11:32:29 atom kernel: [<78448722>] worker_thread+0xf9/0x1bf
Dec 6 11:32:29 atom kernel: [<78448629>] ? worker_thread+0x0/0x1bf
Dec 6 11:32:29 atom kernel: [<7844b252>] kthread+0x62/0x67
Dec 6 11:32:29 atom kernel: [<7844b1f0>] ? kthread+0x0/0x67
Dec 6 11:32:29 atom kernel: [<784036c6>] kernel_thread_helper+0x6/0x1a
Dec 6 11:32:29 atom kernel: ---[ end trace 617a0f44fc30537b ]---
Dec 6 11:32:29 atom kernel: ath: DMA failed to stop in 10 ms AR_CR=0x00000024 AR_DIAG_SW=0x42000020


On module unload, I sometimes see lots of more scary looking DMA warnings,
..but again, system seems stable aside from the noise
in the logs. I will capture these and post them next time I get a clean
set of them (previous ones were on the mis-configured STA scenario..maybe
they only happen when you unload while driver is scanning or something like that).


Thanks,
Ben

--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com

2010-12-06 20:42:33

by Luis R. Rodriguez

[permalink] [raw]
Subject: Re: [ath9k-devel] Script to crash ath9k with DMA errors.

On Mon, Dec 06, 2010 at 12:22:26PM -0800, Ben Greear wrote:
> On 12/06/2010 12:11 PM, Bj?rn Smedman wrote:
> > On Mon, Dec 6, 2010 at 8:47 PM, Ben Greear<[email protected]> wrote:
> >> With 16 properly configured non-encrypted stations, running with
> >> wpa-supplicant
> >> with netlink driver& sharing scan results, the interfaces quickly
> >> associate.
> >>
> >> However, I do continue to see DMA warnings such as these (I had picked up my
> >> portable phone, and it knocked all the interfaces offline ..here
> >> they are coming back up after I hung up the phone).
> >
> > Is there some theory as to why using multiple interfaces cause so many
> > problems with DMA?
>
> Seems pretty directly related to channel changes and/or resets, and exacerbated
> by other interfaces sending data while another is scanning, for instance.
>
> Other issues we've found in the past have been various races that you wouldn't
> normally see with a single VIF.

Right, there might be some other hot path we need to lock around over.
Not sure what it could be though we should be locking stopping RX
over resets already though. These should all be atomic, in fact
starting TX too IIRC, hence the name change of the lock to be
specific to the PCU together. There may be other PCU changes
we may need to contend against.

Luis

2010-12-03 08:14:28

by Ben Greear

[permalink] [raw]
Subject: Re: [ath9k-devel] Script to crash ath9k with DMA errors.

On 12/01/2010 03:22 PM, Ben Greear wrote:
> On 11/29/2010 04:44 PM, Luis R. Rodriguez wrote:
>> On Mon, Nov 29, 2010 at 04:28:51PM -0800, Ben Greear wrote:
>
>>> BUG: unable to handle kernel NULL pointer dereference at 00000040
>>> IP: [<f933470a>] ath_tx_start+0x461/0x5ef [ath9k]
>>> *pde = 00000000
>>> Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
>>> last sysfs file: /sys/devices/pci0000:00/0000:00:1e.0/0000:08:01.0/irq
>>> Modules linked in: aes_i586 aes_generic fuse nfs lockd fscache nfs_acl auth_rpcgss sunrpc ipv6 uinput arc4 ecb ath9k mac80211 ath9k_common ath9k_hw mi]
>>>
>>> Pid: 38, comm: kworker/u:1 Tainted: G W 2.6.37-rc3-wl+ #53 PDSBM/PDSBM
>>> EIP: 0060:[<f933470a>] EFLAGS: 00010246 CPU: 1
>>> EIP is at ath_tx_start+0x461/0x5ef [ath9k]
>>
>> Please use
>>
>> gdb drivers/net/wireless/ath/ath9k/
>> l *(ath_tx_start+0x461)
>>
>> Luis
>
> I managed to hit that ath_tx_start crash again, and this time there were no obvious
> DMA or irq errors immediately preceding it. So, it might be a real bug
> after all. I'll add some extra checks to see if tid->ac is NULL.

I've made some small progress on this general issue.

First, I added all sorts of debugging to try to figure out ath_tx_start crash.
As best as I can tell, 'tid' is not NULL, but also is not a valid pointer,
and probably something close to 0x0. I've added yet more debugging, but haven't
hit the problem again.

I also tried stopping DMA in a loop up to 5 times if it failed to stop
previously in the loop. This did not appear to help at all.

I also managed to make both the ath_tx_start crash and the DMA errors very hard to reproduce
(I dare not say fixed, yet).

It appears that this small patch (and possibly, the fact that I set debugging to 0x600
instead of 0x400) makes the problems go away. This makes me wonder if a root cause is
something to do with repeatedly resetting the hardware too fast, as setting channels rapidly
would tend to do that, and channels are set on association by supplicant, it appears.

diff --git a/drivers/net/wireless/ath/ath9k/main.c b/drivers/net/wireless/ath/ath9k/main.c
index f026a03..46b1791 100644
--- a/drivers/net/wireless/ath/ath9k/main.c
+++ b/drivers/net/wireless/ath/ath9k/main.c
@@ -1605,6 +1605,16 @@ static int ath9k_config(struct ieee80211_hw *hw, u32 changed)
else
sc->sc_flags &= ~SC_OP_OFFCHANNEL;

+ /* If channels & HT are the same, then don't actually do anything.
+ */
+ if ((sc->sc_ah->curchan == &sc->sc_ah->channels[pos]) &&
+ (aphy->chan_is_ht == conf_is_ht(conf))) {
+ ath_print(common, ATH_DBG_CONFIG,
+ "Skip Set channel: %d MHz, already there.\n",
+ curchan->center_freq);
+ goto skip_chan_change;
+ }
+
if (aphy->state == ATH_WIPHY_SCAN ||
aphy->state == ATH_WIPHY_ACTIVE)
ath9k_wiphy_pause_all_forced(sc, aphy);

Thanks,
Ben

--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com

2010-12-06 19:53:15

by Luis R. Rodriguez

[permalink] [raw]
Subject: Re: [ath9k-devel] Script to crash ath9k with DMA errors.

On Mon, Dec 06, 2010 at 11:47:47AM -0800, Ben Greear wrote:
> On 12/06/2010 11:36 AM, Luis R. Rodriguez wrote:
>
> > Can you clarify the status of this issue. It remains unclear to me from
> > your above description how things are going. As I read it some things
> > look OK now but you still get a warning.
>
> Ok, since you asked :)
>
> I worked on this over the weekend and this morning. I had all sorts of
> issues until I realized that I had one STA with non-configured SSID.
> It sometimes connected to one /a AP and the other STAs attempted to connect
> to another /n (on entirely different band) AP. I basically got zero stations associated for any length
> of time due to constant channel switching. No crashes, but lots of
> warnings about DMA failing to stop.
>
> Now..I've fixed this configuration issue (and adding steps to help prevent this mis-configuration
> again).
>
> With 16 properly configured non-encrypted stations, running with wpa-supplicant
> with netlink driver & sharing scan results, the interfaces quickly associate.
>
> However, I do continue to see DMA warnings such as these (I had picked up my
> portable phone, and it knocked all the interfaces offline ..here
> they are coming back up after I hung up the phone).
>
> Please note that I ported Felix's 2.6.37 patch he posted this morning
> to wireless-testing and have applied it.
>
> I'm highly tempted to just make that a WARN_ON_ONCE so at least my logs
> aren't spammed so heavily with the recv.c:531 DMA warning.

You can send this change upstream as well.

Luis

2010-12-06 20:38:53

by Felix Fietkau

[permalink] [raw]
Subject: Re: [ath9k-devel] Script to crash ath9k with DMA errors.

On 2010-12-06 9:28 PM, Ben Greear wrote:
> On 12/06/2010 11:53 AM, Luis R. Rodriguez wrote:
>> On Mon, Dec 06, 2010 at 11:53:13AM -0800, Luis Rodriguez wrote:
>>> On Mon, Dec 06, 2010 at 11:47:47AM -0800, Ben Greear wrote:
>>>> On 12/06/2010 11:36 AM, Luis R. Rodriguez wrote:
>>>>
>>>>> Can you clarify the status of this issue. It remains unclear to me from
>>>>> your above description how things are going. As I read it some things
>>>>> look OK now but you still get a warning.
>>>>
>>>> Ok, since you asked :)
>>>>
>>>> I worked on this over the weekend and this morning. I had all sorts of
>>>> issues until I realized that I had one STA with non-configured SSID.
>>>> It sometimes connected to one /a AP and the other STAs attempted to connect
>>>> to another /n (on entirely different band) AP. I basically got zero stations associated for any length
>>>> of time due to constant channel switching. No crashes, but lots of
>>>> warnings about DMA failing to stop.
>>>>
>>>> Now..I've fixed this configuration issue (and adding steps to help prevent this mis-configuration
>>>> again).
>>>>
>>>> With 16 properly configured non-encrypted stations, running with wpa-supplicant
>>>> with netlink driver& sharing scan results, the interfaces quickly associate.
>>>>
>>>> However, I do continue to see DMA warnings such as these (I had picked up my
>>>> portable phone, and it knocked all the interfaces offline ..here
>>>> they are coming back up after I hung up the phone).
>>>>
>>>> Please note that I ported Felix's 2.6.37 patch he posted this morning
>>>> to wireless-testing and have applied it.
>>>>
>>>> I'm highly tempted to just make that a WARN_ON_ONCE so at least my logs
>>>> aren't spammed so heavily with the recv.c:531 DMA warning.
>>>
>>> You can send this change upstream as well.
>>
>> Also, feel free to limit the number of STAs you can have up
>> physically by setting this to a number you bless yourself.
>
> I have a feeling there is no hard limit..but if I do find one,
> I'll cook up a patch. Probably not many of us ever going to push
> anywhere near what I'm trying, and folks like me can limit in
> user-space if wanted...
>
> I'll do up the warn-on-once patch shortly.
>
> By the way, would you consider this channel-change suppression
> patch, or something similar?
>
>
> -------------------- drivers/net/wireless/ath/ath9k/main.c --------------------
> index f026a03..6c1c43b 100644
> @@ -1605,6 +1605,16 @@ static int ath9k_config(struct ieee80211_hw *hw, u32 changed)
> else
> sc->sc_flags &= ~SC_OP_OFFCHANNEL;
>
> + /* If channels & HT are the same, then don't actually do anything.
> + */
> + if ((sc->sc_ah->curchan == &sc->sc_ah->channels[pos]) &&
> + (aphy->chan_is_ht == conf_is_ht(conf))) {
> + ath_print(common, ATH_DBG_CONFIG,
> + "Skip Set channel: %d MHz, already there.\n",
> + curchan->center_freq);
> + goto skip_chan_change;
> + }
> +
I think this needs to check the offchannel flag as well, at least in one
direction. Skipping on-channel -> off-channel is fine, but the other way
around might break calibration

- Felix

2010-12-06 20:28:41

by Ben Greear

[permalink] [raw]
Subject: Re: [ath9k-devel] Script to crash ath9k with DMA errors.

On 12/06/2010 11:53 AM, Luis R. Rodriguez wrote:
> On Mon, Dec 06, 2010 at 11:53:13AM -0800, Luis Rodriguez wrote:
>> On Mon, Dec 06, 2010 at 11:47:47AM -0800, Ben Greear wrote:
>>> On 12/06/2010 11:36 AM, Luis R. Rodriguez wrote:
>>>
>>>> Can you clarify the status of this issue. It remains unclear to me from
>>>> your above description how things are going. As I read it some things
>>>> look OK now but you still get a warning.
>>>
>>> Ok, since you asked :)
>>>
>>> I worked on this over the weekend and this morning. I had all sorts of
>>> issues until I realized that I had one STA with non-configured SSID.
>>> It sometimes connected to one /a AP and the other STAs attempted to connect
>>> to another /n (on entirely different band) AP. I basically got zero stations associated for any length
>>> of time due to constant channel switching. No crashes, but lots of
>>> warnings about DMA failing to stop.
>>>
>>> Now..I've fixed this configuration issue (and adding steps to help prevent this mis-configuration
>>> again).
>>>
>>> With 16 properly configured non-encrypted stations, running with wpa-supplicant
>>> with netlink driver& sharing scan results, the interfaces quickly associate.
>>>
>>> However, I do continue to see DMA warnings such as these (I had picked up my
>>> portable phone, and it knocked all the interfaces offline ..here
>>> they are coming back up after I hung up the phone).
>>>
>>> Please note that I ported Felix's 2.6.37 patch he posted this morning
>>> to wireless-testing and have applied it.
>>>
>>> I'm highly tempted to just make that a WARN_ON_ONCE so at least my logs
>>> aren't spammed so heavily with the recv.c:531 DMA warning.
>>
>> You can send this change upstream as well.
>
> Also, feel free to limit the number of STAs you can have up
> physically by setting this to a number you bless yourself.

I have a feeling there is no hard limit..but if I do find one,
I'll cook up a patch. Probably not many of us ever going to push
anywhere near what I'm trying, and folks like me can limit in
user-space if wanted...

I'll do up the warn-on-once patch shortly.

By the way, would you consider this channel-change suppression
patch, or something similar?


-------------------- drivers/net/wireless/ath/ath9k/main.c --------------------
index f026a03..6c1c43b 100644
@@ -1605,6 +1605,16 @@ static int ath9k_config(struct ieee80211_hw *hw, u32 changed)
else
sc->sc_flags &= ~SC_OP_OFFCHANNEL;

+ /* If channels & HT are the same, then don't actually do anything.
+ */
+ if ((sc->sc_ah->curchan == &sc->sc_ah->channels[pos]) &&
+ (aphy->chan_is_ht == conf_is_ht(conf))) {
+ ath_print(common, ATH_DBG_CONFIG,
+ "Skip Set channel: %d MHz, already there.\n",
+ curchan->center_freq);
+ goto skip_chan_change;
+ }
+
if (aphy->state == ATH_WIPHY_SCAN ||
aphy->state == ATH_WIPHY_ACTIVE)
ath9k_wiphy_pause_all_forced(sc, aphy);

Thanks,
Ben


--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com

2010-12-05 03:30:25

by Ben Greear

[permalink] [raw]
Subject: Re: [ath9k-devel] Script to crash ath9k with DMA errors.

On 12/04/2010 06:41 PM, Felix Fietkau wrote:
> On 2010-12-03 9:14 AM, Ben Greear wrote:
>> On 12/01/2010 03:22 PM, Ben Greear wrote:
>>> On 11/29/2010 04:44 PM, Luis R. Rodriguez wrote:
>>>> On Mon, Nov 29, 2010 at 04:28:51PM -0800, Ben Greear wrote:
>>>
>>>>> BUG: unable to handle kernel NULL pointer dereference at 00000040
>>>>> IP: [<f933470a>] ath_tx_start+0x461/0x5ef [ath9k]
>>>>> *pde = 00000000
>>>>> Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
>>>>> last sysfs file: /sys/devices/pci0000:00/0000:00:1e.0/0000:08:01.0/irq
>>>>> Modules linked in: aes_i586 aes_generic fuse nfs lockd fscache nfs_acl auth_rpcgss sunrpc ipv6 uinput arc4 ecb ath9k mac80211 ath9k_common ath9k_hw mi]
>>>>>
>>>>> Pid: 38, comm: kworker/u:1 Tainted: G W 2.6.37-rc3-wl+ #53 PDSBM/PDSBM
>>>>> EIP: 0060:[<f933470a>] EFLAGS: 00010246 CPU: 1
>>>>> EIP is at ath_tx_start+0x461/0x5ef [ath9k]
>>>>
>>>> Please use
>>>>
>>>> gdb drivers/net/wireless/ath/ath9k/
>>>> l *(ath_tx_start+0x461)
>>>>
>>>> Luis
>>>
>>> I managed to hit that ath_tx_start crash again, and this time there were no obvious
>>> DMA or irq errors immediately preceding it. So, it might be a real bug
>>> after all. I'll add some extra checks to see if tid->ac is NULL.
>>
>> I've made some small progress on this general issue.
>>
>> First, I added all sorts of debugging to try to figure out ath_tx_start crash.
>> As best as I can tell, 'tid' is not NULL, but also is not a valid pointer,
>> and probably something close to 0x0. I've added yet more debugging, but haven't
>> hit the problem again.
>>
>> I also tried stopping DMA in a loop up to 5 times if it failed to stop
>> previously in the loop. This did not appear to help at all.
>>
>> I also managed to make both the ath_tx_start crash and the DMA errors very hard to reproduce
>> (I dare not say fixed, yet).
>>
>> It appears that this small patch (and possibly, the fact that I set debugging to 0x600
>> instead of 0x400) makes the problems go away. This makes me wonder if a root cause is
>> something to do with repeatedly resetting the hardware too fast, as setting channels rapidly
>> would tend to do that, and channels are set on association by supplicant, it appears.
> Please try this patch while leaving the unnecessary resets in place.
> I found that when ath_drain_all_txq finds tx dma not stopped, it will
> issue a reset at a point in time where it is both useless (since it's
> right before a reset anyway) and dangerous (since the rx dma engine
> isn't even disabled yet), so IMHO the right thing to do is to drop
> this extra reset.

I'll give this a try, not sure if I'll get to it before Monday though...

Thanks,
Ben

--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com

2010-12-01 23:22:53

by Ben Greear

[permalink] [raw]
Subject: Re: [ath9k-devel] Script to crash ath9k with DMA errors.

On 11/29/2010 04:44 PM, Luis R. Rodriguez wrote:
> On Mon, Nov 29, 2010 at 04:28:51PM -0800, Ben Greear wrote:

>> BUG: unable to handle kernel NULL pointer dereference at 00000040
>> IP: [<f933470a>] ath_tx_start+0x461/0x5ef [ath9k]
>> *pde = 00000000
>> Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
>> last sysfs file: /sys/devices/pci0000:00/0000:00:1e.0/0000:08:01.0/irq
>> Modules linked in: aes_i586 aes_generic fuse nfs lockd fscache nfs_acl auth_rpcgss sunrpc ipv6 uinput arc4 ecb ath9k mac80211 ath9k_common ath9k_hw mi]
>>
>> Pid: 38, comm: kworker/u:1 Tainted: G W 2.6.37-rc3-wl+ #53 PDSBM/PDSBM
>> EIP: 0060:[<f933470a>] EFLAGS: 00010246 CPU: 1
>> EIP is at ath_tx_start+0x461/0x5ef [ath9k]
>
> Please use
>
> gdb drivers/net/wireless/ath/ath9k/
> l *(ath_tx_start+0x461)
>
> Luis

I managed to hit that ath_tx_start crash again, and this time there were no obvious
DMA or irq errors immediately preceding it. So, it might be a real bug
after all. I'll add some extra checks to see if tid->ac is NULL.

Thanks,
Ben

--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com


2010-12-05 02:41:45

by Felix Fietkau

[permalink] [raw]
Subject: Re: [ath9k-devel] Script to crash ath9k with DMA errors.

On 2010-12-03 9:14 AM, Ben Greear wrote:
> On 12/01/2010 03:22 PM, Ben Greear wrote:
>> On 11/29/2010 04:44 PM, Luis R. Rodriguez wrote:
>>> On Mon, Nov 29, 2010 at 04:28:51PM -0800, Ben Greear wrote:
>>
>>>> BUG: unable to handle kernel NULL pointer dereference at 00000040
>>>> IP: [<f933470a>] ath_tx_start+0x461/0x5ef [ath9k]
>>>> *pde = 00000000
>>>> Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
>>>> last sysfs file: /sys/devices/pci0000:00/0000:00:1e.0/0000:08:01.0/irq
>>>> Modules linked in: aes_i586 aes_generic fuse nfs lockd fscache nfs_acl auth_rpcgss sunrpc ipv6 uinput arc4 ecb ath9k mac80211 ath9k_common ath9k_hw mi]
>>>>
>>>> Pid: 38, comm: kworker/u:1 Tainted: G W 2.6.37-rc3-wl+ #53 PDSBM/PDSBM
>>>> EIP: 0060:[<f933470a>] EFLAGS: 00010246 CPU: 1
>>>> EIP is at ath_tx_start+0x461/0x5ef [ath9k]
>>>
>>> Please use
>>>
>>> gdb drivers/net/wireless/ath/ath9k/
>>> l *(ath_tx_start+0x461)
>>>
>>> Luis
>>
>> I managed to hit that ath_tx_start crash again, and this time there were no obvious
>> DMA or irq errors immediately preceding it. So, it might be a real bug
>> after all. I'll add some extra checks to see if tid->ac is NULL.
>
> I've made some small progress on this general issue.
>
> First, I added all sorts of debugging to try to figure out ath_tx_start crash.
> As best as I can tell, 'tid' is not NULL, but also is not a valid pointer,
> and probably something close to 0x0. I've added yet more debugging, but haven't
> hit the problem again.
>
> I also tried stopping DMA in a loop up to 5 times if it failed to stop
> previously in the loop. This did not appear to help at all.
>
> I also managed to make both the ath_tx_start crash and the DMA errors very hard to reproduce
> (I dare not say fixed, yet).
>
> It appears that this small patch (and possibly, the fact that I set debugging to 0x600
> instead of 0x400) makes the problems go away. This makes me wonder if a root cause is
> something to do with repeatedly resetting the hardware too fast, as setting channels rapidly
> would tend to do that, and channels are set on association by supplicant, it appears.
Please try this patch while leaving the unnecessary resets in place.
I found that when ath_drain_all_txq finds tx dma not stopped, it will
issue a reset at a point in time where it is both useless (since it's
right before a reset anyway) and dangerous (since the rx dma engine
isn't even disabled yet), so IMHO the right thing to do is to drop
this extra reset.

--- a/drivers/net/wireless/ath/ath9k/xmit.c
+++ b/drivers/net/wireless/ath/ath9k/xmit.c
@@ -1194,18 +1194,8 @@ void ath_drain_all_txq(struct ath_softc
}
}

- if (npend) {
- int r;
-
- ath_print(common, ATH_DBG_FATAL,
- "Failed to stop TX DMA. Resetting hardware!\n");
-
- r = ath9k_hw_reset(ah, sc->sc_ah->curchan, ah->caldata, false);
- if (r)
- ath_print(common, ATH_DBG_FATAL,
- "Unable to reset hardware; reset status %d\n",
- r);
- }
+ if (npend)
+ ath_print(common, ATH_DBG_FATAL, "Failed to stop TX DMA!\n");

for (i = 0; i < ATH9K_NUM_TX_QUEUES; i++) {
if (ATH_TXQ_SETUP(sc, i))

2010-12-06 20:22:33

by Ben Greear

[permalink] [raw]
Subject: Re: [ath9k-devel] Script to crash ath9k with DMA errors.

On 12/06/2010 12:11 PM, Bj?rn Smedman wrote:
> On Mon, Dec 6, 2010 at 8:47 PM, Ben Greear<[email protected]> wrote:
>> With 16 properly configured non-encrypted stations, running with
>> wpa-supplicant
>> with netlink driver& sharing scan results, the interfaces quickly
>> associate.
>>
>> However, I do continue to see DMA warnings such as these (I had picked up my
>> portable phone, and it knocked all the interfaces offline ..here
>> they are coming back up after I hung up the phone).
>
> Is there some theory as to why using multiple interfaces cause so many
> problems with DMA?

Seems pretty directly related to channel changes and/or resets, and exacerbated
by other interfaces sending data while another is scanning, for instance.

Other issues we've found in the past have been various races that you wouldn't
normally see with a single VIF.

Thanks,
Ben

>
> /Bj?rn


--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com

2010-12-06 20:11:38

by Björn Smedman

[permalink] [raw]
Subject: Re: [ath9k-devel] Script to crash ath9k with DMA errors.

On Mon, Dec 6, 2010 at 8:47 PM, Ben Greear <[email protected]> wrote:
> With 16 properly configured non-encrypted stations, running with
> wpa-supplicant
> with netlink driver & sharing scan results, ?the interfaces quickly
> associate.
>
> However, I do continue to see DMA warnings such as these (I had picked up my
> portable phone, and it knocked all the interfaces offline ..here
> they are coming back up after I hung up the phone).

Is there some theory as to why using multiple interfaces cause so many
problems with DMA?

/Bj?rn

2010-12-06 21:00:15

by Ben Greear

[permalink] [raw]
Subject: Re: [ath9k-devel] Script to crash ath9k with DMA errors.

On 12/06/2010 12:42 PM, Luis R. Rodriguez wrote:
> On Mon, Dec 06, 2010 at 12:22:26PM -0800, Ben Greear wrote:
>> On 12/06/2010 12:11 PM, Bj?rn Smedman wrote:
>>> On Mon, Dec 6, 2010 at 8:47 PM, Ben Greear<[email protected]> wrote:
>>>> With 16 properly configured non-encrypted stations, running with
>>>> wpa-supplicant
>>>> with netlink driver& sharing scan results, the interfaces quickly
>>>> associate.
>>>>
>>>> However, I do continue to see DMA warnings such as these (I had picked up my
>>>> portable phone, and it knocked all the interfaces offline ..here
>>>> they are coming back up after I hung up the phone).
>>>
>>> Is there some theory as to why using multiple interfaces cause so many
>>> problems with DMA?
>>
>> Seems pretty directly related to channel changes and/or resets, and exacerbated
>> by other interfaces sending data while another is scanning, for instance.
>>
>> Other issues we've found in the past have been various races that you wouldn't
>> normally see with a single VIF.
>
> Right, there might be some other hot path we need to lock around over.
> Not sure what it could be though we should be locking stopping RX
> over resets already though. These should all be atomic, in fact
> starting TX too IIRC, hence the name change of the lock to be
> specific to the PCU together. There may be other PCU changes
> we may need to contend against.

Maybe the hardware/firmware guys could give us some clues as to what
types of things can cause stopping RMA to fail? Maybe that could
point us to what might be racing with the attempts to stop RMA?

Thanks,
Ben

--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com