2008-03-28 19:15:07

by TK

[permalink] [raw]
Subject: 2.6.24 panic in rt2x00lib_txdone / ieee80211_tx_status_irqsafe

I get this panic with rt2x00 2.0.10 (2.6.24-12-server kernel
shipped by ubuntu). It can crash quite regularly given the
right amount of wireless activity and system load.

The instruction that crashes seems to be the assignment to
skb->dev when skb is null (in ieee80211_tx_status_irqsafe);
skb can be set to null in the caller, rt2x00lib_txdone.

I'm not really sure what is the right fix, but I can try a patch.

Logs:

[ 655.701275] BUG: unable to handle kernel NULL pointer dereference at virtual address 00000014
[ 655.701369] printing eip: d0b96c72 *pdpt = 000000000ed22001 *pde = 0000000000000000
[ 655.701444] Oops: 0002 [#1] SMP
[ 655.701485] Modules linked in: aes_i586 geode_aes aes_generic nls_iso8859_1 nls_cp437 vfat fat loop netconsole configfs ip6table_filter iptable_raw ipt_ULOG ipt_TTL ipt_ttl ipt_TOS ipt_tos ipt_SAME ipt_REJECT ipt_REDIRECT ipt_recent ipt_owner ipt_NETMAP ipt_MASQUERADE ipt_LOG ipt_iprange ipt_ECN ipt_ecn ipt_CLUSTERIP ipt_ah ipt_addrtype nf_nat_tftp nf_nat_snmp_basic nf_nat_sip nf_nat_pptp nf_nat_proto_gre nf_nat_irc nf_nat_h323 nf_nat_ftp nf_nat_amanda ts_kmp nf_conntrack_amanda nf_conntrack_tftp nf_conntrack_sip nf_conntrack_proto_sctp nf_conntrack_pptp nf_conntrack_proto_gre nf_conntrack_netlink nf_conntrack_netbios_ns nf_conntrack_irc nf_conntrack_h323 nf_conntrack_ftp xt_tcpmss xt_pkttype xt_physdev xt_NFQUEUE xt_NFLOG xt_multiport xt_MARK xt_mark xt_mac xt_limit xt_length xt_helper xt_hashlimit ip6_tables xt_dccp xt_conntrack xt_CONNMARK xt_connmark xt_CLASSIFY xt_tcpudp xt_state iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack iptable_mangle nfnetlink iptable_filter ip_tables x_tables ext3 jbd mbcache ac lp af_packet ipv6 evdev snd_via82xx snd_cmipci snd_ac97_codec gameport ac97_bus snd_pcm_oss snd_mixer_oss snd_pcm snd_opl3_lib snd_hwdep snd_page_alloc snd_mpu401_uart arc4 ecb blkcipher snd_seq_dummy psmouse serio_raw rt2x00lib snd_rawmidi snd_seq_midi_event snd_timer snd_seq_device shpchp snd via_agp agpgart pci_hotplug via686a parport_pc parport soundcore reiserfs sg r8169 raid10 raid456 md_mod tileblit bitblit softcursor fuse
[ 655.703375]
[ 655.703400] Pid: 0, comm: swapper Not tainted (2.6.24-12-server #1)
[ 655.703436] EIP: 0060:[<d0b96c72>] EFLAGS: 00010086 CPU: 0
[ 655.703545] EIP is at ieee80211_tx_status_irqsafe+0x12/0x120 [mac80211]
[ 655.703581] EAX: ce945000 EBX: cf543a58 ECX: cf543a58 EDX: 00000000
[ 655.703616] ESI: cd485e80 EDI: 00000000 EBP: cd485180 ESP: c0439e00
[ 655.703651] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
[ 655.703684] Process swapper (pid: 0, ti=c0438000 task=c04043a0 task.ti=c0438000)
[ 655.703719] Stack: c1207240 c12071b8 00000000 ce088280 cf543a4c cd485e80 00000000 00000801
[ 655.703812] d0b843f4 ce8cb438 cf543aa0 006c000a d0b8e41f 00000000 00000000 cd485e80
[ 655.703902] 00000000 ced10200 ccccc4a0 00000000 00000000 0000000a c016da50 c042d380
[ 655.703994] Call Trace:
[ 655.704082] [<d0b843f4>] rt2x00lib_txdone+0x84/0xb0 [rt2x00lib]
[ 655.704164] [<d0b8e41f>] rt61pci_interrupt+0x13f/0x220 [rt61pci]
[ 655.704349] [<c016da50>] handle_IRQ_event+0x30/0x60
[ 655.704414] [<c016f3ec>] handle_level_irq+0x7c/0xf0
[ 655.704473] [<c010a93b>] do_IRQ+0x3b/0x70
[ 655.704545] [<c0108dff>] common_interrupt+0x23/0x28
[ 655.704648] [<c033041d>] _spin_unlock_irqrestore+0xd/0x20
[ 655.704706] [<d0958f18>] ata_interrupt+0xe8/0x200 [libata]
[ 655.704902] [<c016da50>] handle_IRQ_event+0x30/0x60
[ 655.704959] [<c016f3ec>] handle_level_irq+0x7c/0xf0
[ 655.705016] [<c010a93b>] do_IRQ+0x3b/0x70
[ 655.705086] [<c0108dff>] common_interrupt+0x23/0x28
[ 655.705184] [<d08595b0>] acpi_idle_enter_simple+0x159/0x1c5 [processor]
[ 655.705291] [<c02a4afc>] cpuidle_idle_call+0x7c/0xb0
[ 655.705351] [<c01066c3>] cpu_idle+0x73/0xd0
[ 655.705404] [<c043ea8f>] start_kernel+0x31f/0x3b0
[ 655.705451] [<c043e150>] unknown_bootoption+0x0/0x1f0
[ 655.705539] =======================
[ 655.705566] Code: fe 0f 0b eb fe 8d 74 26 00 0f 0b eb fe 8d b6 00 00 00 00 8d bf 00 00 00 00 55 89 c5 57 56 53 89 cb 83 ec 10 89 54 24 08 8b 40 58 <89> 42

objdump disassembly:
00000c60 <ieee80211_tx_status_irqsafe>:
c60: 55 push %ebp
c61: 89 c5 mov %eax,%ebp
c63: 57 push %edi
c64: 56 push %esi
c65: 53 push %ebx
c66: 89 cb mov %ecx,%ebx
c68: 83 ec 10 sub $0x10,%esp
call convention boilerplate
c6b: 89 54 24 08 mov %edx,0x8(%esp)
struct ieee80211_local *local = hw_to_local(hw);
c6f: 8b 40 58 mov 0x58(%eax),%eax
x = local->mdev;
c72: 89 42 14 mov %eax,0x14(%edx)
skb->dev = x;

Source:
void ieee80211_tx_status_irqsafe(struct ieee80211_hw *hw,
struct sk_buff *skb,
struct ieee80211_tx_status *status)
{
struct ieee80211_local *local = hw_to_local(hw);
struct ieee80211_tx_status *saved;
int tmp;

skb->dev = local->mdev;
saved = kmalloc(sizeof(struct ieee80211_tx_status), GFP_ATOMIC);




2008-03-28 19:35:49

by Ivo Van Doorn

[permalink] [raw]
Subject: Re: 2.6.24 panic in rt2x00lib_txdone / ieee80211_tx_status_irqsafe

Hi,

> I get this panic with rt2x00 2.0.10 (2.6.24-12-server kernel
> shipped by ubuntu). It can crash quite regularly given the
> right amount of wireless activity and system load.
>
> The instruction that crashes seems to be the assignment to
> skb->dev when skb is null (in ieee80211_tx_status_irqsafe);
> skb can be set to null in the caller, rt2x00lib_txdone.
>
> I'm not really sure what is the right fix, but I can try a patch.

I am not sure what is causing this problem either, the skb pointer
which rt2x00 sends to mac80211 is being used by rt2x00 itself
just prior before sending it to mac80211.

Which means there could be a race condition somewhere,
although that would still be odd since that means txdone is running
for the same entry simultenously.

Could you try rt2x00 2.0.14 (2.6.25-rc*) or a wireless-testing kernel
to see if the problems persist?
The wireless-testing would be the most interesting since several
race conditions in the queues where fixed in there.

Ivo

2008-03-30 20:45:41

by TK

[permalink] [raw]
Subject: Re: 2.6.24 panic in rt2x00lib_txdone / ieee80211_tx_status_irqsafe

On Sun, 30 Mar 2008 20:26:44 +0000, TK said:

> On Fri, 28 Mar 2008 20:34:55 +0100, Ivo van Doorn said:
>
>> Hi,
>>
>> Could you try rt2x00 2.0.14 (2.6.25-rc*) or a wireless-testing kernel
>> to see if the problems persist?
>> The wireless-testing would be the most interesting since several race
>> conditions in the queues where fixed in there.
>
> I've had no luck with either:
> rt2x00 2.0.14 - 93d2334f3773c0d4e8dfafdb394f73bb1bdf0cc9 or a recent
> wireless-testing - 78f8e644105878c3e13ca281529c3c9e661bf59f

I see there is a fix for the second, I'll try it sometime.



2008-03-31 18:32:43

by TK

[permalink] [raw]
Subject: Re: 2.6.24 panic in rt2x00lib_txdone / ieee80211_tx_status_irqsafe

On Sun, 30 Mar 2008 20:45:31 +0000, TK said:

> On Sun, 30 Mar 2008 20:26:44 +0000, TK said:
>
>> On Fri, 28 Mar 2008 20:34:55 +0100, Ivo van Doorn said:
>>
>>> The wireless-testing would be the most interesting since several race
>>> conditions in the queues where fixed in there.

Great, I wasn't able to trigger the crash again.
I expect there is no chance of a backport though - 2.6.25 doesn't play
nice with my distro, yet.


2008-03-30 20:27:00

by TK

[permalink] [raw]
Subject: Re: 2.6.24 panic in rt2x00lib_txdone / ieee80211_tx_status_irqsafe

On Fri, 28 Mar 2008 20:34:55 +0100, Ivo van Doorn said:

> Hi,
>
> Could you try rt2x00 2.0.14 (2.6.25-rc*) or a wireless-testing kernel to
> see if the problems persist?
> The wireless-testing would be the most interesting since several race
> conditions in the queues where fixed in there.

I've had no luck with either:
rt2x00 2.0.14 - 93d2334f3773c0d4e8dfafdb394f73bb1bdf0cc9
or a recent wireless-testing - 78f8e644105878c3e13ca281529c3c9e661bf59f

The first one does not manage to associate,
[ 1335.437438] wlan0: authenticate with AP ea:2c:38:1b:3b:6c
[ 1335.439767] wlan0: RX authentication from ea:2c:38:1b:3b:6c (alg=0 transaction=2 status=0)
[ 1335.439879] wlan0: authenticated
[ 1335.439947] wlan0: associate with AP ea:2c:38:1b:3b:6c
[ 1335.631087] wlan0: associate with AP ea:2c:38:1b:3b:6c
[ 1335.830689] wlan0: associate with AP ea:2c:38:1b:3b:6c
[ 1336.030006] wlan0: association with AP ea:2c:38:1b:3b:6c timed out

And the second deadlocks,
[ 557.872813] ACPI: PCI Interrupt 0000:00:0b.0[A] -> Link [LNKD] -> GSI 10 (level, low) -> IRQ 10
[ 557.893592] phy0: Selected rate control algorithm 'pid'
[ 622.770360] BUG: soft lockup - CPU#0 stuck for 61s! [rt61pci:6275]
[ 622.770360]
[ 622.770360] Pid: 6275, comm: rt61pci Not tainted (2.6.25-rc7-wl #2)
[ 622.770360] EIP: 0060:[<c013c623>] EFLAGS: 00000283 CPU: 0
[ 622.770360] EIP is at run_workqueue+0xa3/0x150
[ 622.770360] EAX: c99fde04 EBX: c6ecd220 ECX: c6ecd224 EDX: c99fde04
[ 622.770360] ESI: c99fde00 EDI: d0d3f040 EBP: c99fde00 ESP: c6f55f88
[ 622.770360] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
[ 622.770360] CR0: 8005003b CR2: bfa98e5e CR3: 098ef000 CR4: 000002b0
[ 622.770360] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
[ 622.770360] DR6: ffff0ff0 DR7: 00000400
[ 622.770360] [<c013cee0>] ? worker_thread+0x0/0xe0
[ 622.770360] [<c013cf60>] ? worker_thread+0x80/0xe0
[ 622.770360] [<c013fa70>] ? autoremove_wake_function+0x0/0x40
[ 622.770360] [<c013cee0>] ? worker_thread+0x0/0xe0
[ 622.770360] [<c013f782>] ? kthread+0x42/0x70
[ 622.770360] [<c013f740>] ? kthread+0x0/0x70
[ 622.770360] [<c0109a93>] ? kernel_thread_helper+0x7/0x10
[ 622.770360] =======================
[ 688.270361] BUG: soft lockup - CPU#0 stuck for 61s! [rt61pci:6275]
[ 688.270361]
[ 688.270361] Pid: 6275, comm: rt61pci Not tainted (2.6.25-rc7-wl #2)
[ 688.270361] EIP: 0060:[<c032c637>] EFLAGS: 00000296 CPU: 0
[ 688.270361] EIP is at _spin_unlock_irqrestore+0x7/0x10
[ 688.270361] EAX: 00000296 EBX: c6ecd220 ECX: 00000001 EDX: 00000296
[ 688.270361] ESI: c99fde00 EDI: d0d3f040 EBP: c99fde00 ESP: c6f55f68
[ 688.270361] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
[ 688.270361] CR0: 8005003b CR2: bfa98e5e CR3: 098ef000 CR4: 000002b0
[ 688.270361] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
[ 688.270361] DR6: ffff0ff0 DR7: 00000400
[ 688.270361] [<c013cb83>] ? queue_work+0x33/0x40
[ 688.270361] [<d0d3f07c>] ? rt2x00lib_packetfilter_scheduled+0x3c/0x50 [rt2x00lib]
[ 688.270361] [<c013c638>] ? run_workqueue+0xb8/0x150
[ 688.270361] [<c013cee0>] ? worker_thread+0x0/0xe0
[ 688.270361] [<c013cf60>] ? worker_thread+0x80/0xe0
[ 688.270361] [<c013fa70>] ? autoremove_wake_function+0x0/0x40
[ 688.270361] [<c013cee0>] ? worker_thread+0x0/0xe0
[ 688.270361] [<c013f782>] ? kthread+0x42/0x70
[ 688.270361] [<c013f740>] ? kthread+0x0/0x70
[ 688.270361] [<c0109a93>] ? kernel_thread_helper+0x7/0x10
[ 688.270361] =======================
[ 709.331164] rt61pci: page allocation failure. order:0, mode:0x20
[ 709.331214] rt61pci: page allocation failure. order:0, mode:0x20
[ 709.331225] Pid: 6275, comm: rt61pci Not tainted 2.6.25-rc7-wl #2
[ 709.331288] [<c01729cd>] __alloc_pages+0x31d/0x370
[ 709.331344] [<c02c6416>] __ip_route_output_key+0x466/0x930
[ 709.331381] [<c0191e68>] __slab_alloc+0x138/0x4f0
[ 709.331425] [<c01924fd>] kmem_cache_alloc+0xad/0xc0
[ 709.331434] [<c02a3066>] __alloc_skb+0x36/0x120
[ 709.331472] [<c02a3066>] __alloc_skb+0x36/0x120
[ 709.331497] [<c02b597c>] find_skb+0x3c/0x80
[ 709.331529] [<c02b636d>] netpoll_send_udp+0x2d/0x260
[ 709.331537] [<c032ed47>] atomic_notifier_call_chain+0x17/0x20
[ 709.331593] [<d0a6e15f>] write_msg+0x8f/0xd0 [netconsole]
[ 709.331632] [<d0a6e0d0>] write_msg+0x0/0xd0 [netconsole]
[ 709.331650] [<c012c643>] __call_console_drivers+0x53/0x60
[ 709.331685] [<c012cae8>] release_console_sem+0xd8/0x1e0
[ 709.331714] [<c012cf77>] vprintk+0x1e7/0x370
[ 709.331813] [<c012d11b>] printk+0x1b/0x20
[ 709.331831] [<c01729c8>] __alloc_pages+0x318/0x370
[ 709.331888] [<c0191e68>] __slab_alloc+0x138/0x4f0
[ 709.331929] [<c01924fd>] kmem_cache_alloc+0xad/0xc0
[ 709.331937] [<c02a3066>] __alloc_skb+0x36/0x120
[ 709.331966] [<c02a3066>] __alloc_skb+0x36/0x120
[ 709.331991] [<c02a3dc2>] __netdev_alloc_skb+0x22/0x50
[ 709.332014] [<d092159e>] rtl8169_rx_interrupt+0x1ce/0x580 [r8169]
[ 709.332091] [<d0922097>] rtl8169_interrupt+0x317/0x3f0 [r8169]
[ 709.332115] [<d09f69b5>] ata_interrupt+0xc5/0x1d0 [libata]
[ 709.332267] [<c01675d0>] handle_IRQ_event+0x30/0x60
[ 709.332291] [<c0168e78>] handle_level_irq+0x78/0xf0
[ 709.332315] [<c010b4ab>] do_IRQ+0x3b/0x70
[ 709.332344] [<d0d3f040>] rt2x00lib_packetfilter_scheduled+0x0/0x50 [rt2x00lib]
[ 709.332375] [<c0109887>] common_interrupt+0x23/0x28

Which does not help with the problem I wanted to solve in the first place;
sorry.



2008-03-31 21:24:08

by Ivo Van Doorn

[permalink] [raw]
Subject: Re: 2.6.24 panic in rt2x00lib_txdone / ieee80211_tx_status_irqsafe

On Monday 31 March 2008, TK wrote:
> On Sun, 30 Mar 2008 20:45:31 +0000, TK said:
>
> > On Sun, 30 Mar 2008 20:26:44 +0000, TK said:
> >
> >> On Fri, 28 Mar 2008 20:34:55 +0100, Ivo van Doorn said:
> >>
> >>> The wireless-testing would be the most interesting since several race
> >>> conditions in the queues where fixed in there.
>
> Great, I wasn't able to trigger the crash again.
> I expect there is no chance of a backport though - 2.6.25 doesn't play
> nice with my distro, yet.

Sorry, but backporting isn't that easy.
rt2x00 has undergone some major changes after the 2.6.25 codebase,
some code was completely restructured and several new features were added.
Between the rt2x00 version in 2.6.25 and the latest wireless-testing code are
over 80 patches, those patches fixed a lot of issues, and backporting those is
troublesome due to the code restructuring (where the code restructuring itself
most likely fixed your bug).

Ivo