2009-11-25 16:49:38

by Maxim Levitsky

[permalink] [raw]
Subject: Panic in iwl3945 driver

Just captured a panic in iwl3945 driver.
Will investigate.

<1>[ 7290.414172] IP: [<ffffffffa0dd53a1>] iwl3945_rx_reply_tx+0xc1/0x450 [iwl3945]
<4>[ 7290.414205] PGD 0
<1>[ 7290.414214] Thread overran stack, or stack corrupted
<0>[ 7290.414229] Oops: 0002 [#1] PREEMPT SMP
<0>[ 7290.414246] last sysfs file: /sys/devices/platform/coretemp.1/temp1_input
<4>[ 7290.414265] CPU 0
<4>[ 7290.414274] Modules linked in: af_packet nfsd usb_storage usb_libusual cpufreq_powersave exportfs cpufreq_conservative iwl3945 nfs cpufreq_userspace snd_hda_codec_realtek acpi_cpufreq uvcvideo lockd iwlcore snd_hda_intel joydev coretemp nfs_acl videodev snd_hda_codec mac80211 v4l1_compat snd_hwdep sbp2 v4l2_compat_ioctl32 uhci_hcd psmouse auth_rpcgss ohci1394 cfg80211 ehci_hcd video ieee1394 snd_pcm serio_raw battery ac nvidia(P) usbcore output sunrpc evdev lirc_ene0100 snd_page_alloc rfkill tg3 libphy fuse lzo lzo_decompress lzo_compress
<6>[ 7290.414486] Pid: 0, comm: swapper Tainted: P 2.6.32-rc8-wl #213 Aspire 5720
<6>[ 7290.414507] RIP: 0010:[<ffffffffa0dd53a1>] [<ffffffffa0dd53a1>] iwl3945_rx_reply_tx+0xc1/0x450 [iwl3945]
<6>[ 7290.414541] RSP: 0018:ffff880002203d60 EFLAGS: 00010246
<6>[ 7290.414557] RAX: 000000000000004f RBX: ffff880064c11600 RCX: 0000000000000013
<6>[ 7290.414576] RDX: ffffffffa0ddcf20 RSI: ffff8800512b7008 RDI: 0000000000000038
<6>[ 7290.414596] RBP: ffff880002203dd0 R08: 0000000000000000 R09: 0000000000000100
<6>[ 7290.414616] R10: 0000000000000001 R11: 0000000000000000 R12: 00000000000000a0
<6>[ 7290.414635] R13: 0000000000000002 R14: 0000000000000013 R15: 0000000000020201
<6>[ 7290.414655] FS: 0000000000000000(0000) GS:ffff880002200000(0000) knlGS:0000000000000000
<6>[ 7290.414677] CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
<6>[ 7290.414693] CR2: 0000000000000041 CR3: 0000000001001000 CR4: 00000000000006f0
<6>[ 7290.414712] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
<6>[ 7290.414732] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
<4>[ 7290.414752] Process swapper (pid: 0, threadinfo ffffffff81524000, task ffffffff81528b60)
<0>[ 7290.414772] Stack:
<4>[ 7290.414780] ffff880002203da0 0000000000000046 0000000000000000 0000000000000046
<4>[ 7290.414804] <0> 0000000000000282 0000000000000282 0000000000000282 ffff880064c12010
<4>[ 7290.414830] <0> ffff880002203db0 ffff880064c11600 ffff880064c12e50 ffff8800512b7000
<0>[ 7290.414858] Call Trace:
<0>[ 7290.414867] <IRQ>
<4>[ 7290.414884] [<ffffffffa0dc8c47>] iwl3945_irq_tasklet+0x657/0x1740 [iwl3945]
<4>[ 7290.414910] [<ffffffff8138fc60>] ? _spin_unlock+0x30/0x60
<4>[ 7290.414931] [<ffffffff81049a21>] tasklet_action+0x101/0x110
<4>[ 7290.414950] [<ffffffff8104a3d0>] __do_softirq+0xc0/0x160
<4>[ 7290.414968] [<ffffffff8100d01c>] call_softirq+0x1c/0x30
<4>[ 7290.414986] [<ffffffff8100eff5>] do_softirq+0x75/0xb0
<4>[ 7290.415003] [<ffffffff81049ee5>] irq_exit+0x95/0xa0
<4>[ 7290.415020] [<ffffffff8100e547>] do_IRQ+0x77/0xf0
<4>[ 7290.415038] [<ffffffff8100c7d3>] ret_from_intr+0x0/0xf
<0>[ 7290.415052] <EOI>
<4>[ 7290.415067] [<ffffffff81234efa>] ? acpi_idle_enter_bm+0x270/0x2a5
<4>[ 7290.415087] [<ffffffff81234f04>] ? acpi_idle_enter_bm+0x27a/0x2a5
<4>[ 7290.415107] [<ffffffff81234efa>] ? acpi_idle_enter_bm+0x270/0x2a5
<4>[ 7290.415130] [<ffffffff812c11f3>] ? cpuidle_idle_call+0x93/0xf0
<4>[ 7290.415149] [<ffffffff8100b0d7>] ? cpu_idle+0xa7/0x110
<4>[ 7290.415168] [<ffffffff8137b3d5>] ? rest_init+0x75/0x80
<4>[ 7290.415187] [<ffffffff8158cd0a>] ? start_kernel+0x3a7/0x3b3
<4>[ 7290.415206] [<ffffffff8158c315>] ? x86_64_start_reservations+0x125/0x129
<4>[ 7290.415227] [<ffffffff8158c3fd>] ? x86_64_start_kernel+0xe4/0xeb
<0>[ 7290.415243] Code: 00 41 39 ce 0f 8d e8 01 00 00 48 8b 47 40 48 63 d2 48 69 d2 98 00 00 00 4c 8b 04 02 48 c7 c2 20 cf dd a0 49 8d 78 38 49 8d 40 4f <c6> 47 09 00 c6 47 0c 00 c6 47 0f 00 c6 47 12 00 c6 47 15 00 49
<1>[ 7290.415382] RIP [<ffffffffa0dd53a1>] iwl3945_rx_reply_tx+0xc1/0x450 [iwl3945]
<4>[ 7290.415410] RSP <ffff880002203d60>
<0>[ 7290.415421] CR2: 0000000000000041
<4>[ 7290.415436] ---[ end trace ec46807277caa515 ]---
<0>[ 7290.415450] Kernel panic - not syncing: Fatal exception in interrupt
<4>[ 7290.415468] Pid: 0, comm: swapper Tainted: P D 2.6.32-rc8-wl #213
<4>[ 7290.415486] Call Trace:
<4>[ 7290.415495] <IRQ> [<ffffffff8138c040>] panic+0x7d/0x13a
<4>[ 7290.415519] [<ffffffff8101071a>] oops_end+0xda/0xe0
<4>[ 7290.415538] [<ffffffff8102e1ea>] no_context+0xea/0x250
<4>[ 7290.415557] [<ffffffff81038991>] ? select_task_rq_fair+0x511/0x780
<4>[ 7290.415578] [<ffffffff8102e475>] __bad_area_nosemaphore+0x125/0x1e0
<4>[ 7290.415597] [<ffffffff81038d0c>] ? __enqueue_entity+0x7c/0x80
<4>[ 7290.415616] [<ffffffff81039201>] ? enqueue_task_fair+0x111/0x150
<4>[ 7290.415636] [<ffffffff8102e53e>] bad_area_nosemaphore+0xe/0x10
<4>[ 7290.415656] [<ffffffff8102e8fa>] do_page_fault+0x26a/0x320
<4>[ 7290.415674] [<ffffffff813905df>] page_fault+0x1f/0x30
<4>[ 7290.415697] [<ffffffffa0dd53a1>] ? iwl3945_rx_reply_tx+0xc1/0x450 [iwl3945]
<4>[ 7290.415723] [<ffffffffa0dc8c47>] iwl3945_irq_tasklet+0x657/0x1740 [iwl3945]
<4>[ 7290.415746] [<ffffffff8138fc60>] ? _spin_unlock+0x30/0x60
<4>[ 7290.415764] [<ffffffff81049a21>] tasklet_action+0x101/0x110
<4>[ 7290.415783] [<ffffffff8104a3d0>] __do_softirq+0xc0/0x160
<4>[ 7290.415801] [<ffffffff8100d01c>] call_softirq+0x1c/0x30
<4>[ 7290.415818] [<ffffffff8100eff5>] do_softirq+0x75/0xb0
<4>[ 7290.415835] [<ffffffff81049ee5>] irq_exit+0x95/0xa0
<4>[ 7290.415852] [<ffffffff8100e547>] do_IRQ+0x77/0xf0
<4>[ 7290.415869] [<ffffffff8100c7d3>] ret_from_intr+0x0/0xf
<4>[ 7290.415883] <EOI> [<ffffffff81234efa>] ? acpi_idle_enter_bm+0x270/0x2a5
<4>[ 7290.415911] [<ffffffff81234f04>] ? acpi_idle_enter_bm+0x27a/0x2a5
<4>[ 7290.415931] [<ffffffff81234efa>] ? acpi_idle_enter_bm+0x270/0x2a5
<4>[ 7290.415952] [<ffffffff812c11f3>] ? cpuidle_idle_call+0x93/0xf0
<4>[ 7290.415971] [<ffffffff8100b0d7>] ? cpu_idle+0xa7/0x110
<4>[ 7290.415989] [<ffffffff8137b3d5>] ? rest_init+0x75/0x80
<4>[ 7290.416007] [<ffffffff8158cd0a>] ? start_kernel+0x3a7/0x3b3
<4>[ 7290.416026] [<ffffffff8158c315>] ? x86_64_start_reservations+0x125/0x129
<4>[ 7290.416047] [<ffffffff8158c3fd>] ? x86_64_start_kernel+0xe4/0xeb




2009-11-27 00:00:21

by Maxim Levitsky

[permalink] [raw]
Subject: Re: Panic in iwl3945 driver

On Wed, 2009-11-25 at 18:49 +0200, Maxim Levitsky wrote:
> Just captured a panic in iwl3945 driver.
> Will investigate.
>
> <1>[ 7290.414172] IP: [<ffffffffa0dd53a1>] iwl3945_rx_reply_tx+0xc1/0x450 [iwl3945]
> <4>[ 7290.414205] PGD 0
> <1>[ 7290.414214] Thread overran stack, or stack corrupted
> <0>[ 7290.414229] Oops: 0002 [#1] PREEMPT SMP
> <0>[ 7290.414246] last sysfs file: /sys/devices/platform/coretemp.1/temp1_input
> <4>[ 7290.414265] CPU 0
> <4>[ 7290.414274] Modules linked in: af_packet nfsd usb_storage usb_libusual cpufreq_powersave exportfs cpufreq_conservative iwl3945 nfs cpufreq_userspace snd_hda_codec_realtek acpi_cpufreq uvcvideo lockd iwlcore snd_hda_intel joydev coretemp nfs_acl videodev snd_hda_codec mac80211 v4l1_compat snd_hwdep sbp2 v4l2_compat_ioctl32 uhci_hcd psmouse auth_rpcgss ohci1394 cfg80211 ehci_hcd video ieee1394 snd_pcm serio_raw battery ac nvidia(P) usbcore output sunrpc evdev lirc_ene0100 snd_page_alloc rfkill tg3 libphy fuse lzo lzo_decompress lzo_compress
> <6>[ 7290.414486] Pid: 0, comm: swapper Tainted: P 2.6.32-rc8-wl #213 Aspire 5720
> <6>[ 7290.414507] RIP: 0010:[<ffffffffa0dd53a1>] [<ffffffffa0dd53a1>] iwl3945_rx_reply_tx+0xc1/0x450 [iwl3945]
> <6>[ 7290.414541] RSP: 0018:ffff880002203d60 EFLAGS: 00010246
> <6>[ 7290.414557] RAX: 000000000000004f RBX: ffff880064c11600 RCX: 0000000000000013
> <6>[ 7290.414576] RDX: ffffffffa0ddcf20 RSI: ffff8800512b7008 RDI: 0000000000000038
> <6>[ 7290.414596] RBP: ffff880002203dd0 R08: 0000000000000000 R09: 0000000000000100
> <6>[ 7290.414616] R10: 0000000000000001 R11: 0000000000000000 R12: 00000000000000a0
> <6>[ 7290.414635] R13: 0000000000000002 R14: 0000000000000013 R15: 0000000000020201
> <6>[ 7290.414655] FS: 0000000000000000(0000) GS:ffff880002200000(0000) knlGS:0000000000000000
> <6>[ 7290.414677] CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
> <6>[ 7290.414693] CR2: 0000000000000041 CR3: 0000000001001000 CR4: 00000000000006f0
> <6>[ 7290.414712] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> <6>[ 7290.414732] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> <4>[ 7290.414752] Process swapper (pid: 0, threadinfo ffffffff81524000, task ffffffff81528b60)
> <0>[ 7290.414772] Stack:
> <4>[ 7290.414780] ffff880002203da0 0000000000000046 0000000000000000 0000000000000046
> <4>[ 7290.414804] <0> 0000000000000282 0000000000000282 0000000000000282 ffff880064c12010
> <4>[ 7290.414830] <0> ffff880002203db0 ffff880064c11600 ffff880064c12e50 ffff8800512b7000
> <0>[ 7290.414858] Call Trace:
> <0>[ 7290.414867] <IRQ>
> <4>[ 7290.414884] [<ffffffffa0dc8c47>] iwl3945_irq_tasklet+0x657/0x1740 [iwl3945]
> <4>[ 7290.414910] [<ffffffff8138fc60>] ? _spin_unlock+0x30/0x60
> <4>[ 7290.414931] [<ffffffff81049a21>] tasklet_action+0x101/0x110
> <4>[ 7290.414950] [<ffffffff8104a3d0>] __do_softirq+0xc0/0x160
> <4>[ 7290.414968] [<ffffffff8100d01c>] call_softirq+0x1c/0x30
> <4>[ 7290.414986] [<ffffffff8100eff5>] do_softirq+0x75/0xb0
> <4>[ 7290.415003] [<ffffffff81049ee5>] irq_exit+0x95/0xa0
> <4>[ 7290.415020] [<ffffffff8100e547>] do_IRQ+0x77/0xf0
> <4>[ 7290.415038] [<ffffffff8100c7d3>] ret_from_intr+0x0/0xf
> <0>[ 7290.415052] <EOI>
> <4>[ 7290.415067] [<ffffffff81234efa>] ? acpi_idle_enter_bm+0x270/0x2a5
> <4>[ 7290.415087] [<ffffffff81234f04>] ? acpi_idle_enter_bm+0x27a/0x2a5
> <4>[ 7290.415107] [<ffffffff81234efa>] ? acpi_idle_enter_bm+0x270/0x2a5
> <4>[ 7290.415130] [<ffffffff812c11f3>] ? cpuidle_idle_call+0x93/0xf0
> <4>[ 7290.415149] [<ffffffff8100b0d7>] ? cpu_idle+0xa7/0x110
> <4>[ 7290.415168] [<ffffffff8137b3d5>] ? rest_init+0x75/0x80
> <4>[ 7290.415187] [<ffffffff8158cd0a>] ? start_kernel+0x3a7/0x3b3
> <4>[ 7290.415206] [<ffffffff8158c315>] ? x86_64_start_reservations+0x125/0x129
> <4>[ 7290.415227] [<ffffffff8158c3fd>] ? x86_64_start_kernel+0xe4/0xeb
> <0>[ 7290.415243] Code: 00 41 39 ce 0f 8d e8 01 00 00 48 8b 47 40 48 63 d2 48 69 d2 98 00 00 00 4c 8b 04 02 48 c7 c2 20 cf dd a0 49 8d 78 38 49 8d 40 4f <c6> 47 09 00 c6 47 0c 00 c6 47 0f 00 c6 47 12 00 c6 47 15 00 49
> <1>[ 7290.415382] RIP [<ffffffffa0dd53a1>] iwl3945_rx_reply_tx+0xc1/0x450 [iwl3945]
> <4>[ 7290.415410] RSP <ffff880002203d60>
> <0>[ 7290.415421] CR2: 0000000000000041
> <4>[ 7290.415436] ---[ end trace ec46807277caa515 ]---
> <0>[ 7290.415450] Kernel panic - not syncing: Fatal exception in interrupt
> <4>[ 7290.415468] Pid: 0, comm: swapper Tainted: P D 2.6.32-rc8-wl #213
> <4>[ 7290.415486] Call Trace:
> <4>[ 7290.415495] <IRQ> [<ffffffff8138c040>] panic+0x7d/0x13a
> <4>[ 7290.415519] [<ffffffff8101071a>] oops_end+0xda/0xe0
> <4>[ 7290.415538] [<ffffffff8102e1ea>] no_context+0xea/0x250
> <4>[ 7290.415557] [<ffffffff81038991>] ? select_task_rq_fair+0x511/0x780
> <4>[ 7290.415578] [<ffffffff8102e475>] __bad_area_nosemaphore+0x125/0x1e0
> <4>[ 7290.415597] [<ffffffff81038d0c>] ? __enqueue_entity+0x7c/0x80
> <4>[ 7290.415616] [<ffffffff81039201>] ? enqueue_task_fair+0x111/0x150
> <4>[ 7290.415636] [<ffffffff8102e53e>] bad_area_nosemaphore+0xe/0x10
> <4>[ 7290.415656] [<ffffffff8102e8fa>] do_page_fault+0x26a/0x320
> <4>[ 7290.415674] [<ffffffff813905df>] page_fault+0x1f/0x30
> <4>[ 7290.415697] [<ffffffffa0dd53a1>] ? iwl3945_rx_reply_tx+0xc1/0x450 [iwl3945]
> <4>[ 7290.415723] [<ffffffffa0dc8c47>] iwl3945_irq_tasklet+0x657/0x1740 [iwl3945]
> <4>[ 7290.415746] [<ffffffff8138fc60>] ? _spin_unlock+0x30/0x60
> <4>[ 7290.415764] [<ffffffff81049a21>] tasklet_action+0x101/0x110
> <4>[ 7290.415783] [<ffffffff8104a3d0>] __do_softirq+0xc0/0x160
> <4>[ 7290.415801] [<ffffffff8100d01c>] call_softirq+0x1c/0x30
> <4>[ 7290.415818] [<ffffffff8100eff5>] do_softirq+0x75/0xb0
> <4>[ 7290.415835] [<ffffffff81049ee5>] irq_exit+0x95/0xa0
> <4>[ 7290.415852] [<ffffffff8100e547>] do_IRQ+0x77/0xf0
> <4>[ 7290.415869] [<ffffffff8100c7d3>] ret_from_intr+0x0/0xf
> <4>[ 7290.415883] <EOI> [<ffffffff81234efa>] ? acpi_idle_enter_bm+0x270/0x2a5
> <4>[ 7290.415911] [<ffffffff81234f04>] ? acpi_idle_enter_bm+0x27a/0x2a5
> <4>[ 7290.415931] [<ffffffff81234efa>] ? acpi_idle_enter_bm+0x270/0x2a5
> <4>[ 7290.415952] [<ffffffff812c11f3>] ? cpuidle_idle_call+0x93/0xf0
> <4>[ 7290.415971] [<ffffffff8100b0d7>] ? cpu_idle+0xa7/0x110
> <4>[ 7290.415989] [<ffffffff8137b3d5>] ? rest_init+0x75/0x80
> <4>[ 7290.416007] [<ffffffff8158cd0a>] ? start_kernel+0x3a7/0x3b3
> <4>[ 7290.416026] [<ffffffff8158c315>] ? x86_64_start_reservations+0x125/0x129
> <4>[ 7290.416047] [<ffffffff8158c3fd>] ? x86_64_start_kernel+0xe4/0xeb


This is some very unpleasant problem.
The thing is that this happens very rarely, and while you use X.
I had recently few such embarrassing kernel panics (I never had any
random and rare kernel panics) and I strongly suspect them to be of same
origin.

This one is first I captured, due to some code that I wrote recently
that saves printk buffer in predefined location in system ram that isn't
cleared on reboot in my notebook.

I had put some NULL checks in iwl3945_rx_reply_tx, none did trigger yet,
nor I had another kernel panic.

Best regards,
Maxim Levitsky



2009-11-30 21:42:08

by Reinette Chatre

[permalink] [raw]
Subject: Re: Panic in iwl3945 driver

On Mon, 2009-11-30 at 07:55 -0800, Maxim Levitsky wrote:
> > This is some very unpleasant problem.
> > The thing is that this happens very rarely, and while you use X.
> > I had recently few such embarrassing kernel panics (I never had any
> > random and rare kernel panics) and I strongly suspect them to be of same
> > origin.
> >
> > This one is first I captured, due to some code that I wrote recently
> > that saves printk buffer in predefined location in system ram that isn't
> > cleared on reboot in my notebook.
> >
> > I had put some NULL checks in iwl3945_rx_reply_tx, none did trigger yet,
> > nor I had another kernel panic.

Did this problem happen with your NULL checks in place? Can you perhaps
help here with which line the problem occurred? Any idea how to trigger
this?

Thank you

Reinette



2009-11-30 22:35:25

by Maxim Levitsky

[permalink] [raw]
Subject: Re: Panic in iwl3945 driver

On Mon, 2009-11-30 at 13:42 -0800, reinette chatre wrote:
> On Mon, 2009-11-30 at 07:55 -0800, Maxim Levitsky wrote:
> > > This is some very unpleasant problem.
> > > The thing is that this happens very rarely, and while you use X.
> > > I had recently few such embarrassing kernel panics (I never had any
> > > random and rare kernel panics) and I strongly suspect them to be of same
> > > origin.
> > >
> > > This one is first I captured, due to some code that I wrote recently
> > > that saves printk buffer in predefined location in system ram that isn't
> > > cleared on reboot in my notebook.
> > >
> > > I had put some NULL checks in iwl3945_rx_reply_tx, none did trigger yet,
> > > nor I had another kernel panic.
>
> Did this problem happen with your NULL checks in place? Can you perhaps
> help here with which line the problem occurred? Any idea how to trigger
> this?

Second one is with the NULL checks.
It is very rare, it seems to be related to playing music, and using
wireless at same time, although both use MSI here.


Now did some disassembly, and this is the results:


......
0x0000000000016652 <iwl3945_rx_reply_tx+242>: callq 0x16657 <iwl3945_rx_reply_tx+247>
0x0000000000016657 <iwl3945_rx_reply_tx+247>: nopw 0x0(%rax,%rax,1)
0x0000000000016660 <iwl3945_rx_reply_tx+256>: add $0x48,%rsp
0x0000000000016664 <iwl3945_rx_reply_tx+260>: pop %rbx
0x0000000000016665 <iwl3945_rx_reply_tx+261>: pop %r12
0x0000000000016667 <iwl3945_rx_reply_tx+263>: pop %r13
0x0000000000016669 <iwl3945_rx_reply_tx+265>: pop %r14
0x000000000001666b <iwl3945_rx_reply_tx+267>: pop %r15
0x000000000001666d <iwl3945_rx_reply_tx+269>: leaveq
0x000000000001666e <iwl3945_rx_reply_tx+270>: retq
0x000000000001666f <iwl3945_rx_reply_tx+271>: nop

0x0000000000016670 <iwl3945_rx_reply_tx+272>: cmp %edx,%r14d
0x0000000000016673 <iwl3945_rx_reply_tx+275>: jl 0x16603 <iwl3945_rx_reply_tx+163>
0x0000000000016675 <iwl3945_rx_reply_tx+277>: mov 0x40(%rdi),%rax
0x0000000000016679 <iwl3945_rx_reply_tx+281>: movslq %edx,%rdx



/home/maxim/software/kernel/linux-2.6/include/net/mac80211.h: 487
memset(&info->status.ampdu_ack_len, 0,
sizeof(struct ieee80211_tx_info) -
offsetof(struct ieee80211_tx_info, status.ampdu_ack_len));

0x000000000001667c <iwl3945_rx_reply_tx+284>: imul $0x98,%rdx,%rdx
0x0000000000016683 <iwl3945_rx_reply_tx+291>: mov (%rdx,%rax,1),%r8
0x0000000000016687 <iwl3945_rx_reply_tx+295>: mov $0x0,%rdx
0x000000000001668e <iwl3945_rx_reply_tx+302>: lea 0x38(%r8),%rdi
0x0000000000016692 <iwl3945_rx_reply_tx+306>: lea 0x4f(%r8),%rax



rate_idx = iwl3945_hwrate_to_plcp_idx(tx_resp->rate);

0x0000000000016696 <iwl3945_rx_reply_tx+310>: movb $0x0,0x9(%rdi) <---------- RIP
0x000000000001669a <iwl3945_rx_reply_tx+314>: movb $0x0,0xc(%rdi)
0x000000000001669e <iwl3945_rx_reply_tx+318>: movb $0x0,0xf(%rdi)
0x00000000000166a2 <iwl3945_rx_reply_tx+322>: movb $0x0,0x12(%rdi)
0x00000000000166a6 <iwl3945_rx_reply_tx+326>: movb $0x0,0x15(%rdi)
0x00000000000166aa <iwl3945_rx_reply_tx+330>: movq $0x0,0x4f(%r8)
0x00000000000166b2 <iwl3945_rx_reply_tx+338>: movq $0x0,0x8(%rax)
0x00000000000166ba <iwl3945_rx_reply_tx+346>: movq $0x0,0x10(%rax)
0x00000000000166c2 <iwl3945_rx_reply_tx+354>: movb $0x0,0x18(%rax)
0x00000000000166c6 <iwl3945_rx_reply_tx+358>: xor %eax,%eax
0x00000000000166c8 <iwl3945_rx_reply_tx+360>: movzbl 0x3(%rsi),%ecx
0x00000000000166cc <iwl3945_rx_reply_tx+364>: nopl 0x0(%rax)
0x00000000000166d0 <iwl3945_rx_reply_tx+368>: cmp (%rdx),%cl
0x00000000000166d2 <iwl3945_rx_reply_tx+370>: je 0x166e4 <iwl3945_rx_reply_tx+388>
0x00000000000166d4 <iwl3945_rx_reply_tx+372>: inc %eax


Even though addr2line seems to think that faulty instruction is outside the inlined function,
thus looks awfully similar to memset.


So the reason seem that, info somehow is null.

info = IEEE80211_SKB_CB(txq->txb[txq->q.read_ptr].skb[0]);
ieee80211_tx_info_clear_status(info);


I will add checks for this now.


Best regards,
Maxim Levitsky




2009-11-30 15:55:51

by Maxim Levitsky

[permalink] [raw]
Subject: Re: Panic in iwl3945 driver

On Fri, 2009-11-27 at 02:00 +0200, Maxim Levitsky wrote:
> On Wed, 2009-11-25 at 18:49 +0200, Maxim Levitsky wrote:
> > Just captured a panic in iwl3945 driver.
> > Will investigate.
> >
> > <1>[ 7290.414172] IP: [<ffffffffa0dd53a1>] iwl3945_rx_reply_tx+0xc1/0x450 [iwl3945]
> > <4>[ 7290.414205] PGD 0
> > <1>[ 7290.414214] Thread overran stack, or stack corrupted
> > <0>[ 7290.414229] Oops: 0002 [#1] PREEMPT SMP
> > <0>[ 7290.414246] last sysfs file: /sys/devices/platform/coretemp.1/temp1_input
> > <4>[ 7290.414265] CPU 0
> > <4>[ 7290.414274] Modules linked in: af_packet nfsd usb_storage usb_libusual cpufreq_powersave exportfs cpufreq_conservative iwl3945 nfs cpufreq_userspace snd_hda_codec_realtek acpi_cpufreq uvcvideo lockd iwlcore snd_hda_intel joydev coretemp nfs_acl videodev snd_hda_codec mac80211 v4l1_compat snd_hwdep sbp2 v4l2_compat_ioctl32 uhci_hcd psmouse auth_rpcgss ohci1394 cfg80211 ehci_hcd video ieee1394 snd_pcm serio_raw battery ac nvidia(P) usbcore output sunrpc evdev lirc_ene0100 snd_page_alloc rfkill tg3 libphy fuse lzo lzo_decompress lzo_compress
> > <6>[ 7290.414486] Pid: 0, comm: swapper Tainted: P 2.6.32-rc8-wl #213 Aspire 5720
> > <6>[ 7290.414507] RIP: 0010:[<ffffffffa0dd53a1>] [<ffffffffa0dd53a1>] iwl3945_rx_reply_tx+0xc1/0x450 [iwl3945]
> > <6>[ 7290.414541] RSP: 0018:ffff880002203d60 EFLAGS: 00010246
> > <6>[ 7290.414557] RAX: 000000000000004f RBX: ffff880064c11600 RCX: 0000000000000013
> > <6>[ 7290.414576] RDX: ffffffffa0ddcf20 RSI: ffff8800512b7008 RDI: 0000000000000038
> > <6>[ 7290.414596] RBP: ffff880002203dd0 R08: 0000000000000000 R09: 0000000000000100
> > <6>[ 7290.414616] R10: 0000000000000001 R11: 0000000000000000 R12: 00000000000000a0
> > <6>[ 7290.414635] R13: 0000000000000002 R14: 0000000000000013 R15: 0000000000020201
> > <6>[ 7290.414655] FS: 0000000000000000(0000) GS:ffff880002200000(0000) knlGS:0000000000000000
> > <6>[ 7290.414677] CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
> > <6>[ 7290.414693] CR2: 0000000000000041 CR3: 0000000001001000 CR4: 00000000000006f0
> > <6>[ 7290.414712] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > <6>[ 7290.414732] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> > <4>[ 7290.414752] Process swapper (pid: 0, threadinfo ffffffff81524000, task ffffffff81528b60)
> > <0>[ 7290.414772] Stack:
> > <4>[ 7290.414780] ffff880002203da0 0000000000000046 0000000000000000 0000000000000046
> > <4>[ 7290.414804] <0> 0000000000000282 0000000000000282 0000000000000282 ffff880064c12010
> > <4>[ 7290.414830] <0> ffff880002203db0 ffff880064c11600 ffff880064c12e50 ffff8800512b7000
> > <0>[ 7290.414858] Call Trace:
> > <0>[ 7290.414867] <IRQ>
> > <4>[ 7290.414884] [<ffffffffa0dc8c47>] iwl3945_irq_tasklet+0x657/0x1740 [iwl3945]
> > <4>[ 7290.414910] [<ffffffff8138fc60>] ? _spin_unlock+0x30/0x60
> > <4>[ 7290.414931] [<ffffffff81049a21>] tasklet_action+0x101/0x110
> > <4>[ 7290.414950] [<ffffffff8104a3d0>] __do_softirq+0xc0/0x160
> > <4>[ 7290.414968] [<ffffffff8100d01c>] call_softirq+0x1c/0x30
> > <4>[ 7290.414986] [<ffffffff8100eff5>] do_softirq+0x75/0xb0
> > <4>[ 7290.415003] [<ffffffff81049ee5>] irq_exit+0x95/0xa0
> > <4>[ 7290.415020] [<ffffffff8100e547>] do_IRQ+0x77/0xf0
> > <4>[ 7290.415038] [<ffffffff8100c7d3>] ret_from_intr+0x0/0xf
> > <0>[ 7290.415052] <EOI>
> > <4>[ 7290.415067] [<ffffffff81234efa>] ? acpi_idle_enter_bm+0x270/0x2a5
> > <4>[ 7290.415087] [<ffffffff81234f04>] ? acpi_idle_enter_bm+0x27a/0x2a5
> > <4>[ 7290.415107] [<ffffffff81234efa>] ? acpi_idle_enter_bm+0x270/0x2a5
> > <4>[ 7290.415130] [<ffffffff812c11f3>] ? cpuidle_idle_call+0x93/0xf0
> > <4>[ 7290.415149] [<ffffffff8100b0d7>] ? cpu_idle+0xa7/0x110
> > <4>[ 7290.415168] [<ffffffff8137b3d5>] ? rest_init+0x75/0x80
> > <4>[ 7290.415187] [<ffffffff8158cd0a>] ? start_kernel+0x3a7/0x3b3
> > <4>[ 7290.415206] [<ffffffff8158c315>] ? x86_64_start_reservations+0x125/0x129
> > <4>[ 7290.415227] [<ffffffff8158c3fd>] ? x86_64_start_kernel+0xe4/0xeb
> > <0>[ 7290.415243] Code: 00 41 39 ce 0f 8d e8 01 00 00 48 8b 47 40 48 63 d2 48 69 d2 98 00 00 00 4c 8b 04 02 48 c7 c2 20 cf dd a0 49 8d 78 38 49 8d 40 4f <c6> 47 09 00 c6 47 0c 00 c6 47 0f 00 c6 47 12 00 c6 47 15 00 49
> > <1>[ 7290.415382] RIP [<ffffffffa0dd53a1>] iwl3945_rx_reply_tx+0xc1/0x450 [iwl3945]
> > <4>[ 7290.415410] RSP <ffff880002203d60>
> > <0>[ 7290.415421] CR2: 0000000000000041
> > <4>[ 7290.415436] ---[ end trace ec46807277caa515 ]---
> > <0>[ 7290.415450] Kernel panic - not syncing: Fatal exception in interrupt
> > <4>[ 7290.415468] Pid: 0, comm: swapper Tainted: P D 2.6.32-rc8-wl #213
> > <4>[ 7290.415486] Call Trace:
> > <4>[ 7290.415495] <IRQ> [<ffffffff8138c040>] panic+0x7d/0x13a
> > <4>[ 7290.415519] [<ffffffff8101071a>] oops_end+0xda/0xe0
> > <4>[ 7290.415538] [<ffffffff8102e1ea>] no_context+0xea/0x250
> > <4>[ 7290.415557] [<ffffffff81038991>] ? select_task_rq_fair+0x511/0x780
> > <4>[ 7290.415578] [<ffffffff8102e475>] __bad_area_nosemaphore+0x125/0x1e0
> > <4>[ 7290.415597] [<ffffffff81038d0c>] ? __enqueue_entity+0x7c/0x80
> > <4>[ 7290.415616] [<ffffffff81039201>] ? enqueue_task_fair+0x111/0x150
> > <4>[ 7290.415636] [<ffffffff8102e53e>] bad_area_nosemaphore+0xe/0x10
> > <4>[ 7290.415656] [<ffffffff8102e8fa>] do_page_fault+0x26a/0x320
> > <4>[ 7290.415674] [<ffffffff813905df>] page_fault+0x1f/0x30
> > <4>[ 7290.415697] [<ffffffffa0dd53a1>] ? iwl3945_rx_reply_tx+0xc1/0x450 [iwl3945]
> > <4>[ 7290.415723] [<ffffffffa0dc8c47>] iwl3945_irq_tasklet+0x657/0x1740 [iwl3945]
> > <4>[ 7290.415746] [<ffffffff8138fc60>] ? _spin_unlock+0x30/0x60
> > <4>[ 7290.415764] [<ffffffff81049a21>] tasklet_action+0x101/0x110
> > <4>[ 7290.415783] [<ffffffff8104a3d0>] __do_softirq+0xc0/0x160
> > <4>[ 7290.415801] [<ffffffff8100d01c>] call_softirq+0x1c/0x30
> > <4>[ 7290.415818] [<ffffffff8100eff5>] do_softirq+0x75/0xb0
> > <4>[ 7290.415835] [<ffffffff81049ee5>] irq_exit+0x95/0xa0
> > <4>[ 7290.415852] [<ffffffff8100e547>] do_IRQ+0x77/0xf0
> > <4>[ 7290.415869] [<ffffffff8100c7d3>] ret_from_intr+0x0/0xf
> > <4>[ 7290.415883] <EOI> [<ffffffff81234efa>] ? acpi_idle_enter_bm+0x270/0x2a5
> > <4>[ 7290.415911] [<ffffffff81234f04>] ? acpi_idle_enter_bm+0x27a/0x2a5
> > <4>[ 7290.415931] [<ffffffff81234efa>] ? acpi_idle_enter_bm+0x270/0x2a5
> > <4>[ 7290.415952] [<ffffffff812c11f3>] ? cpuidle_idle_call+0x93/0xf0
> > <4>[ 7290.415971] [<ffffffff8100b0d7>] ? cpu_idle+0xa7/0x110
> > <4>[ 7290.415989] [<ffffffff8137b3d5>] ? rest_init+0x75/0x80
> > <4>[ 7290.416007] [<ffffffff8158cd0a>] ? start_kernel+0x3a7/0x3b3
> > <4>[ 7290.416026] [<ffffffff8158c315>] ? x86_64_start_reservations+0x125/0x129
> > <4>[ 7290.416047] [<ffffffff8158c3fd>] ? x86_64_start_kernel+0xe4/0xeb
>
>
> This is some very unpleasant problem.
> The thing is that this happens very rarely, and while you use X.
> I had recently few such embarrassing kernel panics (I never had any
> random and rare kernel panics) and I strongly suspect them to be of same
> origin.
>
> This one is first I captured, due to some code that I wrote recently
> that saves printk buffer in predefined location in system ram that isn't
> cleared on reboot in my notebook.
>
> I had put some NULL checks in iwl3945_rx_reply_tx, none did trigger yet,
> nor I had another kernel panic.
>
> Best regards,
> Maxim Levitsky
>
>

Happened again:


<7>[39142.650078] No probe response from AP 00:1b:9e:d8:77:02 after 500ms, try 2
<1>[39329.299446] BUG: unable to handle kernel NULL pointer dereference at 0000000000000041
<1>[39329.299478] IP: [<ffffffffa0c7e416>] iwl3945_rx_reply_tx+0x136/0x460 [iwl3945]
<4>[39329.299513] PGD 49be6067 PUD 48f6b067 PMD 0
<0>[39329.299533] Oops: 0002 [#1] PREEMPT SMP
<0>[39329.299551] last sysfs file: /sys/devices/platform/coretemp.1/temp1_input
<4>[39329.299571] CPU 0
<4>[39329.299581] Modules linked in: tg3 libphy lirc_ene0100 usbhid af_packet vmnet vmblock vsock vmci vmmon nfsd exportfs nfs lockd nfs_acl auth_rpcgss sunrpc usb_storage usb_libusual cpufreq_powersave cpufreq_conservative uvcvideo cpufreq_userspace snd_hda_codec_realtek acpi_cpufreq videodev iwl3945 joydev coretemp snd_hda_intel v4l1_compat iwlcore snd_hda_codec v4l2_compat_ioctl32 sbp2 uhci_hcd snd_hwdep psmouse ohci1394 mac80211 video ehci_hcd snd_pcm serio_raw ieee1394 nvidia(P) evdev output usbcore cfg80211 snd_page_alloc ac battery rfkill fuse lzo lzo_decompress lzo_compress [last unloaded: libphy]
<6>[39329.299809] Pid: 323, comm: firefox Tainted: P 2.6.32-rc8-wl #216 Aspire 5720
<6>[39329.299833] RIP: 0010:[<ffffffffa0c7e416>] [<ffffffffa0c7e416>] iwl3945_rx_reply_tx+0x136/0x460 [iwl3945]
<6>[39329.299868] RSP: 0000:ffff880002203d60 EFLAGS: 00010246
<6>[39329.299884] RAX: 000000000000004f RBX: ffff880065351600 RCX: 0000000000000057
<6>[39329.299905] RDX: ffffffffa0c85f40 RSI: ffff88002fd90008 RDI: 0000000000000038
<6>[39329.299925] RBP: ffff880002203dd0 R08: 0000000000000000 R09: 0000000000000100
<6>[39329.299946] R10: 0000000000000001 R11: 0000000000000000 R12: 00000000000000a0
<6>[39329.299966] R13: 0000000000000002 R14: 0000000000000057 R15: 0000000000020201
<6>[39329.299987] FS: 00007f89d8872710(0000) GS:ffff880002200000(0000) knlGS:0000000000000000
<6>[39329.300010] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<6>[39329.300027] CR2: 0000000000000041 CR3: 0000000047d6f000 CR4: 00000000000006f0
<6>[39329.300048] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
<6>[39329.300068] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
<4>[39329.300090] Process firefox (pid: 323, threadinfo ffff880047ce0000, task ffff88003bc35e20)
<0>[39329.300112] Stack:
<4>[39329.300119] ffff880002203da0 0000000000000046 0000000000000000 0000000000000046
<4>[39329.300145] <0> 0000000000000282 0000000000000282 0000000000000282 ffff880065352010
<4>[39329.300172] <0> ffff880002203db0 ffff880065351600 ffff880065354110 ffff88002fd90000
<0>[39329.300201] Call Trace:
<0>[39329.300210] <IRQ>
<4>[39329.300228] [<ffffffffa0c71c47>] iwl3945_irq_tasklet+0x657/0x1740 [iwl3945]
<4>[39329.300256] [<ffffffff8138b880>] ? _spin_unlock+0x30/0x60
<4>[39329.300277] [<ffffffff81049a21>] tasklet_action+0x101/0x110
<4>[39329.300297] [<ffffffff8104a3d0>] __do_softirq+0xc0/0x160
<4>[39329.300317] [<ffffffff8100d01c>] call_softirq+0x1c/0x30
<4>[39329.300336] [<ffffffff8100eff5>] do_softirq+0x75/0xb0
<4>[39329.300353] [<ffffffff81049ee5>] irq_exit+0x95/0xa0
<4>[39329.300370] [<ffffffff8100e547>] do_IRQ+0x77/0xf0
<4>[39329.300389] [<ffffffff8100c7d3>] ret_from_intr+0x0/0xf
<0>[39329.300405] <EOI>
<0>[39329.300413] Code: 00 00 00 00 00 41 39 d6 7c 89 48 8b 47 40 48 63 d2 48 69 d2 98 00 00 00 4c 8b 04 02 48 c7 c2 40 5f c8 a0 49 8d 78 38 49 8d 40 4f <c6> 47 09 00 c6 47 0c 00 c6 47 0f 00 c6 47 12 00 c6 47 15 00 49
<1>[39329.300556] RIP [<ffffffffa0c7e416>] iwl3945_rx_reply_tx+0x136/0x460 [iwl3945]
<4>[39329.300585] RSP <ffff880002203d60>
<0>[39329.300597] CR2: 0000000000000041
<4>[39329.300643] ---[ end trace 7b44b083a7088d66 ]---
<0>[39329.300658] Kernel panic - not syncing: Fatal exception in interrupt
<4>[39329.300678] Pid: 323, comm: firefox Tainted: P D 2.6.32-rc8-wl #216
<4>[39329.300697] Call Trace:
<4>[39329.300706] <IRQ> [<ffffffff81387c60>] panic+0x7d/0x13a
<4>[39329.300732] [<ffffffff8101071a>] oops_end+0xda/0xe0
<4>[39329.300753] [<ffffffff8102e1ea>] no_context+0xea/0x250
<4>[39329.300773] [<ffffffff8102e475>] __bad_area_nosemaphore+0x125/0x1e0
<4>[39329.300795] [<ffffffff8102e53e>] bad_area_nosemaphore+0xe/0x10
<4>[39329.300815] [<ffffffff8102e8fa>] do_page_fault+0x26a/0x320
<4>[39329.300835] [<ffffffff8138c1ff>] page_fault+0x1f/0x30
<4>[39329.300860] [<ffffffffa0c7e416>] ? iwl3945_rx_reply_tx+0x136/0x460 [iwl3945]
<4>[39329.300888] [<ffffffffa0c71c47>] iwl3945_irq_tasklet+0x657/0x1740 [iwl3945]
<4>[39329.300912] [<ffffffff8138b880>] ? _spin_unlock+0x30/0x60
<4>[39329.300931] [<ffffffff81049a21>] tasklet_action+0x101/0x110
<4>[39329.300950] [<ffffffff8104a3d0>] __do_softirq+0xc0/0x160
<4>[39329.300969] [<ffffffff8100d01c>] call_softirq+0x1c/0x30
<4>[39329.300987] [<ffffffff8100eff5>] do_softirq+0x75/0xb0
<4>[39329.301005] [<ffffffff81049ee5>] irq_exit+0x95/0xa0
<4>[39329.301022] [<ffffffff8100e547>] do_IRQ+0x77/0xf0
<4>[39329.301040] [<ffffffff8100c7d3>] ret_from_intr+0x0/0xf
<4>[39329.301055] <EOI>
<0>[39329.301089] Rebooting in 60 seconds..maxim@maxim-laptop:~$



2009-12-21 16:41:04

by Maxim Levitsky

[permalink] [raw]
Subject: Re: Panic in iwl3945 driver

On Wed, 2009-12-02 at 19:17 +0200, Maxim Levitsky wrote:
> On Wed, 2009-12-02 at 13:42 +0800, Zhu Yi wrote:
> > On Tue, 2009-12-01 at 17:28 +0800, Zhu Yi wrote:
> > > On Tue, 2009-12-01 at 06:35 +0800, Maxim Levitsky wrote:
> > > > 0x000000000001668e <iwl3945_rx_reply_tx+302>: lea 0x38(%r8),%rdi
> > > > 0x0000000000016692 <iwl3945_rx_reply_tx+306>: lea 0x4f(%r8),%rax
> > >
> > > When this happened, from your previous post, r8 is 0x0 and rdi is 0x38.
> > > Since "info" is %rdi (see below), this means
> > > txq->txb[txq->q.read_ptr].skb[0], aka. r8 is 0.
> > >
> > > > rate_idx = iwl3945_hwrate_to_plcp_idx(tx_resp->rate);
> > > >
> > > > 0x0000000000016696 <iwl3945_rx_reply_tx+310>: movb $0x0,0x9(%rdi) <---------- RIP
> > > > 0x000000000001669a <iwl3945_rx_reply_tx+314>: movb $0x0,0xc(%rdi)
> > > > 0x000000000001669e <iwl3945_rx_reply_tx+318>: movb $0x0,0xf(%rdi)
> > > > 0x00000000000166a2 <iwl3945_rx_reply_tx+322>: movb $0x0,0x12(%rdi)
> > > > 0x00000000000166a6 <iwl3945_rx_reply_tx+326>: movb $0x0,0x15(%rdi)
> > >
> > > This equals to below code in ieee80211_tx_info_clear_status(). "info" is
> > > %rdi, which is 0x38. That matches NULL pointer dereference at 0x41 in
> > > your oops header.
> > >
> > > for (i = 0; i < IEEE80211_TX_MAX_RATES; i++)
> > > info->status.rates[i].count = 0;
> > >
> > > I guess there is a race for txq->q.read_ptr somewhere. Haven't checked
> > > though.
> >
> > OK. 3945 updated write_ptr without regard to read_ptr on the Tx path.
> > This messes up our TFD on high load. The patch should fix your problem.
> >
> > Signed-off-by: Zhu Yi <[email protected]>
> >
> > diff --git a/drivers/net/wireless/iwlwifi/iwl3945-base.c b/drivers/net/wireless/iwlwifi/iwl3945-base.c
> > index 994db4a..b31b34c 100644
> > --- a/drivers/net/wireless/iwlwifi/iwl3945-base.c
> > +++ b/drivers/net/wireless/iwlwifi/iwl3945-base.c
> > @@ -548,6 +548,9 @@ static int iwl3945_tx_skb(struct iwl_priv *priv, struct sk_buff *skb)
> > txq = &priv->txq[txq_id];
> > q = &txq->q;
> >
> > + if ((iwl_queue_space(q) < q->high_mark))
> > + goto drop;
> > +
> > spin_lock_irqsave(&priv->lock, flags);
> >
> > idx = get_cmd_index(q, q->write_ptr, 0);
> >
> >
Few days ago, I had an idea to reply here that I am sure that this
problem disappeared with this patch.


Today I got same kernel panic _with_ the patch applied....



<1>[ 3075.773505] BUG: unable to handle kernel NULL pointer dereference at 0000000000000041
<1>[ 3075.773540] IP: [<ffffffffa0c175f6>] iwl3945_rx_reply_tx+0xc6/0x450 [iwl3945]
<4>[ 3075.773564] PGD 0
<1>[ 3075.773570] Thread overran stack, or stack corrupted
<0>[ 3075.773579] Oops: 0002 [#1] PREEMPT SMP
<0>[ 3075.773591] last sysfs file: /sys/devices/platform/coretemp.1/temp1_input
<4>[ 3075.773604] CPU 0
<4>[ 3075.773611] Modules linked in: af_packet nfsd exportfs nfs lockd nfs_acl auth_rpcgss sunrpc snd_hda_codec_realtek uvcvideo videodev iwl3945 cpufreq_powersave usb_storage v4l1_compat snd_hda_intel iwlcore cpufreq_conservative joydev usb_libusual v4l2_compat_ioctl32 snd_hda_codec cpufreq_userspace mac80211 acpi_cpufreq uhci_hcd coretemp snd_hwdep cfg80211 psmouse tg3 ohci1394 video ehci_hcd snd_pcm sbp2 ac battery output usbcore libphy snd_page_alloc lirc_ene0100 ieee1394 nvidia(P) serio_raw evdev rfkill fuse lzo lzo_decompress lzo_compress
<6>[ 3075.773757] Pid: 0, comm: swapper Tainted: P 2.6.32-wl #225 Aspire 5720
<6>[ 3075.773772] RIP: 0010:[<ffffffffa0c175f6>] [<ffffffffa0c175f6>] iwl3945_rx_reply_tx+0xc6/0x450 [iwl3945]
<6>[ 3075.773795] RSP: 0018:ffff880002203d20 EFLAGS: 00010246
<6>[ 3075.773806] RAX: 000000000000004f RBX: ffff880065121600 RCX: 00000000000000b0
<6>[ 3075.773820] RDX: ffffffffa0c1f220 RSI: ffff88004f42e008 RDI: 0000000000000038
<6>[ 3075.773834] RBP: ffff880002203d90 R08: 0000000000000000 R09: 0000000000000100
<6>[ 3075.773847] R10: 0000000000000001 R11: 0000000000000046 R12: 00000000000000a0
<6>[ 3075.773861] R13: 0000000000000002 R14: 00000000000000b0 R15: 0000000000020201
<6>[ 3075.773875] FS: 0000000000000000(0000) GS:ffff880002200000(0000) knlGS:0000000000000000
<6>[ 3075.773891] CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
<6>[ 3075.773903] CR2: 0000000000000041 CR3: 0000000001001000 CR4: 00000000000006f0
<6>[ 3075.773917] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
<6>[ 3075.773930] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
<4>[ 3075.773945] Process swapper (pid: 0, threadinfo ffffffff815b8000, task ffffffff815bcb60)
<0>[ 3075.773959] Stack:
<4>[ 3075.773964] ffff880065122070 0000000000000046 ffff880002203d40 0000000000000046
<4>[ 3075.773981] <0> 0000000000000282 0000000000000282 0000000000000282 ffff880065122010
<4>[ 3075.774000] <0> ffff880002203d70 ffff880065121600 ffff880065122eb0 ffff88004f42e000
<0>[ 3075.774020] Call Trace:
<0>[ 3075.774026] <IRQ>
<4>[ 3075.774037] [<ffffffffa0c0acf7>] iwl3945_irq_tasklet+0x657/0x1740 [iwl3945]
<4>[ 3075.774060] [<ffffffffa09b883e>] ? iwl_isr_legacy+0x3e/0x600 [iwlcore]
<4>[ 3075.774077] [<ffffffff81053f86>] tasklet_action+0x106/0x110
<4>[ 3075.774090] [<ffffffff810556a4>] __do_softirq+0x114/0x2c0
<4>[ 3075.774103] [<ffffffff8109eb68>] ? handle_edge_irq+0x78/0x160
<4>[ 3075.774117] [<ffffffff8100d25c>] call_softirq+0x1c/0x30
<4>[ 3075.774128] [<ffffffff8100f1fd>] do_softirq+0x7d/0xb0
<4>[ 3075.774140] [<ffffffff81055155>] irq_exit+0x95/0xa0
<4>[ 3075.774152] [<ffffffff814031fc>] do_IRQ+0x7c/0xf0
<4>[ 3075.774165] [<ffffffff8100ca13>] ret_from_intr+0x0/0xf
<0>[ 3075.774175] <EOI>
<4>[ 3075.774183] [<ffffffff8129996a>] ? acpi_idle_enter_bm+0x275/0x2aa
<4>[ 3075.774196] [<ffffffff81299974>] ? acpi_idle_enter_bm+0x27f/0x2aa
<4>[ 3075.774209] [<ffffffff8129996a>] ? acpi_idle_enter_bm+0x275/0x2aa
<4>[ 3075.774224] [<ffffffff8132996f>] ? cpuidle_idle_call+0x9f/0x160
<4>[ 3075.774237] [<ffffffff8100b1df>] ? cpu_idle+0xaf/0x110
<4>[ 3075.774250] [<ffffffff813eca3a>] ? rest_init+0x7a/0x80
<4>[ 3075.774264] [<ffffffff8162dd01>] ? start_kernel+0x3ac/0x3b8
<4>[ 3075.774277] [<ffffffff8162d315>] ? x86_64_start_reservations+0x125/0x129
<4>[ 3075.774291] [<ffffffff8162d3fd>] ? x86_64_start_kernel+0xe4/0xeb
<0>[ 3075.774302] Code: 00 41 39 ce 0f 8d e3 01 00 00 48 8b 47 40 48 63 d2 48 69 d2 98 00 00 00 4c 8b 04 02 48 c7 c2 20 f2 c1 a0 49 8d 78 38 49 8d 40 4f <c6> 47 09 00 c6 47 0c 00 c6 47 0f 00 c6 47 12 00 c6 47 15 00 49
<1>[ 3075.774394] RIP [<ffffffffa0c175f6>] iwl3945_rx_reply_tx+0xc6/0x450 [iwl3945]
<4>[ 3075.774412] RSP <ffff880002203d20>
<0>[ 3075.774420] CR2: 0000000000000041
<4>[ 3075.774429] ---[ end trace 7ea524291c193896 ]---
<0>[ 3075.774439] Kernel panic - not syncing: Fatal exception in interrupt
<4>[ 3075.774451] Pid: 0, comm: swapper Tainted: P D 2.6.32-wl #225
<4>[ 3075.774463] Call Trace:
<4>[ 3075.774469] <IRQ> [<ffffffff813fe515>] panic+0x82/0x13f
<4>[ 3075.774486] [<ffffffff81010a52>] oops_end+0xe2/0xf0
<4>[ 3075.774497] [<ffffffff81031382>] no_context+0xf2/0x260
<4>[ 3075.774509] [<ffffffff81031615>] __bad_area_nosemaphore+0x125/0x1e0
<4>[ 3075.774523] [<ffffffff810316e3>] bad_area_nosemaphore+0x13/0x20
<4>[ 3075.774536] [<ffffffff81031aca>] do_page_fault+0x26a/0x320
<4>[ 3075.774549] [<ffffffff81402dcf>] page_fault+0x1f/0x30
<4>[ 3075.774564] [<ffffffffa0c175f6>] ? iwl3945_rx_reply_tx+0xc6/0x450 [iwl3945]
<4>[ 3075.774581] [<ffffffffa0c0acf7>] iwl3945_irq_tasklet+0x657/0x1740 [iwl3945]
<4>[ 3075.774602] [<ffffffffa09b883e>] ? iwl_isr_legacy+0x3e/0x600 [iwlcore]
<4>[ 3075.774616] [<ffffffff81053f86>] tasklet_action+0x106/0x110
<4>[ 3075.774628] [<ffffffff810556a4>] __do_softirq+0x114/0x2c0
<4>[ 3075.774640] [<ffffffff8109eb68>] ? handle_edge_irq+0x78/0x160
<4>[ 3075.774653] [<ffffffff8100d25c>] call_softirq+0x1c/0x30
<4>[ 3075.774665] [<ffffffff8100f1fd>] do_softirq+0x7d/0xb0
<4>[ 3075.774676] [<ffffffff81055155>] irq_exit+0x95/0xa0
<4>[ 3075.774687] [<ffffffff814031fc>] do_IRQ+0x7c/0xf0
<4>[ 3075.774698] [<ffffffff8100ca13>] ret_from_intr+0x0/0xf
<4>[ 3075.774708] <EOI> [<ffffffff8129996a>] ? acpi_idle_enter_bm+0x275/0x2aa
<4>[ 3075.774726] [<ffffffff81299974>] ? acpi_idle_enter_bm+0x27f/0x2aa
<4>[ 3075.774739] [<ffffffff8129996a>] ? acpi_idle_enter_bm+0x275/0x2aa
<4>[ 3075.774752] [<ffffffff8132996f>] ? cpuidle_idle_call+0x9f/0x160
<4>[ 3075.774765] [<ffffffff8100b1df>] ? cpu_idle+0xaf/0x110
<4>[ 3075.774777] [<ffffffff813eca3a>] ? rest_init+0x7a/0x80
<4>[ 3075.774789] [<ffffffff8162dd01>] ? start_kernel+0x3ac/0x3b8
<4>[ 3075.774802] [<ffffffff8162d315>] ? x86_64_start_reservations+0x125/0x129
<4>[ 3075.774816] [<ffffffff8162d3fd>] ? x86_64_start_kernel+0xe4/0xeb
<0>[ 3075.774845] Rebooting in 10 seconds..



(gdb) info line *iwl3945_rx_reply_tx+0xc6
Line 483 of "/home/maxim/software/kernel/linux-2.6/include/net/mac80211.h"
starts at address 0x16626 <iwl3945_rx_reply_tx+198> and ends at 0x1663a <iwl3945_rx_reply_tx+218>.


(gdb) list * 0x16626
0x16626 is in iwl3945_rx_reply_tx (/home/maxim/software/kernel/linux-2.6/include/net/mac80211.h:483).
478 BUILD_BUG_ON(offsetof(struct ieee80211_tx_info, status.rates) !=
479 offsetof(struct ieee80211_tx_info, driver_rates));
480 BUILD_BUG_ON(offsetof(struct ieee80211_tx_info, status.rates) != 8);
481 /* clear the rate counts */
482 for (i = 0; i < IEEE80211_TX_MAX_RATES; i++)
483 info->status.rates[i].count = 0;
484
485 BUILD_BUG_ON(
486 offsetof(struct ieee80211_tx_info, status.ampdu_ack_len) != 23);
487 memset(&info->status.ampdu_ack_len, 0,



Kernel is the dbb6e436ef8e1713258bf1218d09e927d8de3590
(wireless: update old static regulatory domain rules)

Plus few patches that only one that touches wireless is:


diff --git a/drivers/net/wireless/iwlwifi/iwl3945-base.c b/drivers/net/wireless/iwlwifi/iwl3945-base.c
index 2a28a1f..a36de73 100644
--- a/drivers/net/wireless/iwlwifi/iwl3945-base.c
+++ b/drivers/net/wireless/iwlwifi/iwl3945-base.c
@@ -548,6 +548,9 @@ static int iwl3945_tx_skb(struct iwl_priv *priv, struct sk_buff *skb)
txq = &priv->txq[txq_id];
q = &txq->q;

+ if ((iwl_queue_space(q) < q->high_mark))
+ goto drop;
+
spin_lock_irqsave(&priv->lock, flags);

idx = get_cmd_index(q, q->write_ptr, 0);



Best regards,
Maxim Levitsky



2009-12-22 08:57:33

by Zhu Yi

[permalink] [raw]
Subject: Re: Panic in iwl3945 driver

On Tue, 2009-12-22 at 00:40 +0800, Maxim Levitsky wrote:
> On Wed, 2009-12-02 at 19:17 +0200, Maxim Levitsky wrote:
> > On Wed, 2009-12-02 at 13:42 +0800, Zhu Yi wrote:
> > > On Tue, 2009-12-01 at 17:28 +0800, Zhu Yi wrote:
> > > > On Tue, 2009-12-01 at 06:35 +0800, Maxim Levitsky wrote:
> > > > > 0x000000000001668e <iwl3945_rx_reply_tx+302>: lea 0x38(%r8),%rdi
> > > > > 0x0000000000016692 <iwl3945_rx_reply_tx+306>: lea 0x4f(%r8),%rax
> > > >
> > > > When this happened, from your previous post, r8 is 0x0 and rdi is 0x38.
> > > > Since "info" is %rdi (see below), this means
> > > > txq->txb[txq->q.read_ptr].skb[0], aka. r8 is 0.
> > > >
> > > > > rate_idx = iwl3945_hwrate_to_plcp_idx(tx_resp->rate);
> > > > >
> > > > > 0x0000000000016696 <iwl3945_rx_reply_tx+310>: movb $0x0,0x9(%rdi) <---------- RIP
> > > > > 0x000000000001669a <iwl3945_rx_reply_tx+314>: movb $0x0,0xc(%rdi)
> > > > > 0x000000000001669e <iwl3945_rx_reply_tx+318>: movb $0x0,0xf(%rdi)
> > > > > 0x00000000000166a2 <iwl3945_rx_reply_tx+322>: movb $0x0,0x12(%rdi)
> > > > > 0x00000000000166a6 <iwl3945_rx_reply_tx+326>: movb $0x0,0x15(%rdi)
> > > >
> > > > This equals to below code in ieee80211_tx_info_clear_status(). "info" is
> > > > %rdi, which is 0x38. That matches NULL pointer dereference at 0x41 in
> > > > your oops header.
> > > >
> > > > for (i = 0; i < IEEE80211_TX_MAX_RATES; i++)
> > > > info->status.rates[i].count = 0;
> > > >
> > > > I guess there is a race for txq->q.read_ptr somewhere. Haven't checked
> > > > though.
> > >
> > > OK. 3945 updated write_ptr without regard to read_ptr on the Tx path.
> > > This messes up our TFD on high load. The patch should fix your problem.
> > >
> > > Signed-off-by: Zhu Yi <[email protected]>
> > >
> > > diff --git a/drivers/net/wireless/iwlwifi/iwl3945-base.c b/drivers/net/wireless/iwlwifi/iwl3945-base.c
> > > index 994db4a..b31b34c 100644
> > > --- a/drivers/net/wireless/iwlwifi/iwl3945-base.c
> > > +++ b/drivers/net/wireless/iwlwifi/iwl3945-base.c
> > > @@ -548,6 +548,9 @@ static int iwl3945_tx_skb(struct iwl_priv *priv, struct sk_buff *skb)
> > > txq = &priv->txq[txq_id];
> > > q = &txq->q;
> > >
> > > + if ((iwl_queue_space(q) < q->high_mark))
> > > + goto drop;
> > > +
> > > spin_lock_irqsave(&priv->lock, flags);
> > >
> > > idx = get_cmd_index(q, q->write_ptr, 0);
> > >
> > >
> Few days ago, I had an idea to reply here that I am sure that this
> problem disappeared with this patch.
>
>
> Today I got same kernel panic _with_ the patch applied....

Looks like (all of) the root causes are still not found yet. The symptom
is exactly the same as the previous one.

One thing I found today is when txq read_ptr catches up to write_ptr
(read_ptr == write_ptr), iwl_queue_used() will _always_ return TRUE!
This will be a problem if the firmware sends us a wrong index
(sequence), then we will fail the check in this condition. I'm not sure
if firmware can really send us a wrong sequence. Can you please try this
patch? Apply it on top of the previous one. If you do see the "FIRMWARE
BUG" in dmesg, then I think we find the root cause.


diff --git a/drivers/net/wireless/iwlwifi/iwl-dev.h b/drivers/net/wireless/iwlwifi/iwl-dev.h
index 2673e9a..02070cc 100644
--- a/drivers/net/wireless/iwlwifi/iwl-dev.h
+++ b/drivers/net/wireless/iwlwifi/iwl-dev.h
@@ -711,7 +711,11 @@ extern void iwl_txq_ctx_stop(struct iwl_priv *priv);
extern int iwl_queue_space(const struct iwl_queue *q);
static inline int iwl_queue_used(const struct iwl_queue *q, int i)
{
- return q->write_ptr > q->read_ptr ?
+ if (q->write_ptr == q->read_ptr)
+ printk("FIRMWARE BUG: index %d is given while read_ptr is %d\n",
+ i, q->read_ptr);
+
+ return q->write_ptr >= q->read_ptr ?
(i >= q->read_ptr && i < q->write_ptr) :
!(i < q->read_ptr && i >= q->write_ptr);
}


Thanks,
-yi

> <1>[ 3075.773505] BUG: unable to handle kernel NULL pointer dereference at 0000000000000041
> <1>[ 3075.773540] IP: [<ffffffffa0c175f6>] iwl3945_rx_reply_tx+0xc6/0x450 [iwl3945]
> <4>[ 3075.773564] PGD 0
> <1>[ 3075.773570] Thread overran stack, or stack corrupted
> <0>[ 3075.773579] Oops: 0002 [#1] PREEMPT SMP
> <0>[ 3075.773591] last sysfs file: /sys/devices/platform/coretemp.1/temp1_input
> <4>[ 3075.773604] CPU 0
> <4>[ 3075.773611] Modules linked in: af_packet nfsd exportfs nfs lockd nfs_acl auth_rpcgss sunrpc snd_hda_codec_realtek uvcvideo videodev iwl3945 cpufreq_powersave usb_storage v4l1_compat snd_hda_intel iwlcore cpufreq_conservative joydev usb_libusual v4l2_compat_ioctl32 snd_hda_codec cpufreq_userspace mac80211 acpi_cpufreq uhci_hcd coretemp snd_hwdep cfg80211 psmouse tg3 ohci1394 video ehci_hcd snd_pcm sbp2 ac battery output usbcore libphy snd_page_alloc lirc_ene0100 ieee1394 nvidia(P) serio_raw evdev rfkill fuse lzo lzo_decompress lzo_compress
> <6>[ 3075.773757] Pid: 0, comm: swapper Tainted: P 2.6.32-wl #225 Aspire 5720
> <6>[ 3075.773772] RIP: 0010:[<ffffffffa0c175f6>] [<ffffffffa0c175f6>] iwl3945_rx_reply_tx+0xc6/0x450 [iwl3945]
> <6>[ 3075.773795] RSP: 0018:ffff880002203d20 EFLAGS: 00010246
> <6>[ 3075.773806] RAX: 000000000000004f RBX: ffff880065121600 RCX: 00000000000000b0
> <6>[ 3075.773820] RDX: ffffffffa0c1f220 RSI: ffff88004f42e008 RDI: 0000000000000038
> <6>[ 3075.773834] RBP: ffff880002203d90 R08: 0000000000000000 R09: 0000000000000100
> <6>[ 3075.773847] R10: 0000000000000001 R11: 0000000000000046 R12: 00000000000000a0
> <6>[ 3075.773861] R13: 0000000000000002 R14: 00000000000000b0 R15: 0000000000020201
> <6>[ 3075.773875] FS: 0000000000000000(0000) GS:ffff880002200000(0000) knlGS:0000000000000000
> <6>[ 3075.773891] CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
> <6>[ 3075.773903] CR2: 0000000000000041 CR3: 0000000001001000 CR4: 00000000000006f0
> <6>[ 3075.773917] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> <6>[ 3075.773930] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> <4>[ 3075.773945] Process swapper (pid: 0, threadinfo ffffffff815b8000, task ffffffff815bcb60)
> <0>[ 3075.773959] Stack:
> <4>[ 3075.773964] ffff880065122070 0000000000000046 ffff880002203d40 0000000000000046
> <4>[ 3075.773981] <0> 0000000000000282 0000000000000282 0000000000000282 ffff880065122010
> <4>[ 3075.774000] <0> ffff880002203d70 ffff880065121600 ffff880065122eb0 ffff88004f42e000
> <0>[ 3075.774020] Call Trace:
> <0>[ 3075.774026] <IRQ>
> <4>[ 3075.774037] [<ffffffffa0c0acf7>] iwl3945_irq_tasklet+0x657/0x1740 [iwl3945]
> <4>[ 3075.774060] [<ffffffffa09b883e>] ? iwl_isr_legacy+0x3e/0x600 [iwlcore]
> <4>[ 3075.774077] [<ffffffff81053f86>] tasklet_action+0x106/0x110
> <4>[ 3075.774090] [<ffffffff810556a4>] __do_softirq+0x114/0x2c0
> <4>[ 3075.774103] [<ffffffff8109eb68>] ? handle_edge_irq+0x78/0x160
> <4>[ 3075.774117] [<ffffffff8100d25c>] call_softirq+0x1c/0x30
> <4>[ 3075.774128] [<ffffffff8100f1fd>] do_softirq+0x7d/0xb0
> <4>[ 3075.774140] [<ffffffff81055155>] irq_exit+0x95/0xa0
> <4>[ 3075.774152] [<ffffffff814031fc>] do_IRQ+0x7c/0xf0
> <4>[ 3075.774165] [<ffffffff8100ca13>] ret_from_intr+0x0/0xf
> <0>[ 3075.774175] <EOI>
> <4>[ 3075.774183] [<ffffffff8129996a>] ? acpi_idle_enter_bm+0x275/0x2aa
> <4>[ 3075.774196] [<ffffffff81299974>] ? acpi_idle_enter_bm+0x27f/0x2aa
> <4>[ 3075.774209] [<ffffffff8129996a>] ? acpi_idle_enter_bm+0x275/0x2aa
> <4>[ 3075.774224] [<ffffffff8132996f>] ? cpuidle_idle_call+0x9f/0x160
> <4>[ 3075.774237] [<ffffffff8100b1df>] ? cpu_idle+0xaf/0x110
> <4>[ 3075.774250] [<ffffffff813eca3a>] ? rest_init+0x7a/0x80
> <4>[ 3075.774264] [<ffffffff8162dd01>] ? start_kernel+0x3ac/0x3b8
> <4>[ 3075.774277] [<ffffffff8162d315>] ? x86_64_start_reservations+0x125/0x129
> <4>[ 3075.774291] [<ffffffff8162d3fd>] ? x86_64_start_kernel+0xe4/0xeb
> <0>[ 3075.774302] Code: 00 41 39 ce 0f 8d e3 01 00 00 48 8b 47 40 48 63 d2 48 69 d2 98 00 00 00 4c 8b 04 02 48 c7 c2 20 f2 c1 a0 49 8d 78 38 49 8d 40 4f <c6> 47 09 00 c6 47 0c 00 c6 47 0f 00 c6 47 12 00 c6 47 15 00 49
> <1>[ 3075.774394] RIP [<ffffffffa0c175f6>] iwl3945_rx_reply_tx+0xc6/0x450 [iwl3945]
> <4>[ 3075.774412] RSP <ffff880002203d20>
> <0>[ 3075.774420] CR2: 0000000000000041
> <4>[ 3075.774429] ---[ end trace 7ea524291c193896 ]---
> <0>[ 3075.774439] Kernel panic - not syncing: Fatal exception in interrupt
> <4>[ 3075.774451] Pid: 0, comm: swapper Tainted: P D 2.6.32-wl #225
> <4>[ 3075.774463] Call Trace:
> <4>[ 3075.774469] <IRQ> [<ffffffff813fe515>] panic+0x82/0x13f
> <4>[ 3075.774486] [<ffffffff81010a52>] oops_end+0xe2/0xf0
> <4>[ 3075.774497] [<ffffffff81031382>] no_context+0xf2/0x260
> <4>[ 3075.774509] [<ffffffff81031615>] __bad_area_nosemaphore+0x125/0x1e0
> <4>[ 3075.774523] [<ffffffff810316e3>] bad_area_nosemaphore+0x13/0x20
> <4>[ 3075.774536] [<ffffffff81031aca>] do_page_fault+0x26a/0x320
> <4>[ 3075.774549] [<ffffffff81402dcf>] page_fault+0x1f/0x30
> <4>[ 3075.774564] [<ffffffffa0c175f6>] ? iwl3945_rx_reply_tx+0xc6/0x450 [iwl3945]
> <4>[ 3075.774581] [<ffffffffa0c0acf7>] iwl3945_irq_tasklet+0x657/0x1740 [iwl3945]
> <4>[ 3075.774602] [<ffffffffa09b883e>] ? iwl_isr_legacy+0x3e/0x600 [iwlcore]
> <4>[ 3075.774616] [<ffffffff81053f86>] tasklet_action+0x106/0x110
> <4>[ 3075.774628] [<ffffffff810556a4>] __do_softirq+0x114/0x2c0
> <4>[ 3075.774640] [<ffffffff8109eb68>] ? handle_edge_irq+0x78/0x160
> <4>[ 3075.774653] [<ffffffff8100d25c>] call_softirq+0x1c/0x30
> <4>[ 3075.774665] [<ffffffff8100f1fd>] do_softirq+0x7d/0xb0
> <4>[ 3075.774676] [<ffffffff81055155>] irq_exit+0x95/0xa0
> <4>[ 3075.774687] [<ffffffff814031fc>] do_IRQ+0x7c/0xf0
> <4>[ 3075.774698] [<ffffffff8100ca13>] ret_from_intr+0x0/0xf
> <4>[ 3075.774708] <EOI> [<ffffffff8129996a>] ? acpi_idle_enter_bm+0x275/0x2aa
> <4>[ 3075.774726] [<ffffffff81299974>] ? acpi_idle_enter_bm+0x27f/0x2aa
> <4>[ 3075.774739] [<ffffffff8129996a>] ? acpi_idle_enter_bm+0x275/0x2aa
> <4>[ 3075.774752] [<ffffffff8132996f>] ? cpuidle_idle_call+0x9f/0x160
> <4>[ 3075.774765] [<ffffffff8100b1df>] ? cpu_idle+0xaf/0x110
> <4>[ 3075.774777] [<ffffffff813eca3a>] ? rest_init+0x7a/0x80
> <4>[ 3075.774789] [<ffffffff8162dd01>] ? start_kernel+0x3ac/0x3b8
> <4>[ 3075.774802] [<ffffffff8162d315>] ? x86_64_start_reservations+0x125/0x129
> <4>[ 3075.774816] [<ffffffff8162d3fd>] ? x86_64_start_kernel+0xe4/0xeb
> <0>[ 3075.774845] Rebooting in 10 seconds..
>
>
>
> (gdb) info line *iwl3945_rx_reply_tx+0xc6
> Line 483 of "/home/maxim/software/kernel/linux-2.6/include/net/mac80211.h"
> starts at address 0x16626 <iwl3945_rx_reply_tx+198> and ends at 0x1663a <iwl3945_rx_reply_tx+218>.
>
>
> (gdb) list * 0x16626
> 0x16626 is in iwl3945_rx_reply_tx (/home/maxim/software/kernel/linux-2.6/include/net/mac80211.h:483).
> 478 BUILD_BUG_ON(offsetof(struct ieee80211_tx_info, status.rates) !=
> 479 offsetof(struct ieee80211_tx_info, driver_rates));
> 480 BUILD_BUG_ON(offsetof(struct ieee80211_tx_info, status.rates) != 8);
> 481 /* clear the rate counts */
> 482 for (i = 0; i < IEEE80211_TX_MAX_RATES; i++)
> 483 info->status.rates[i].count = 0;
> 484
> 485 BUILD_BUG_ON(
> 486 offsetof(struct ieee80211_tx_info, status.ampdu_ack_len) != 23);
> 487 memset(&info->status.ampdu_ack_len, 0,
>
>
>
> Kernel is the dbb6e436ef8e1713258bf1218d09e927d8de3590
> (wireless: update old static regulatory domain rules)
>
> Plus few patches that only one that touches wireless is:
>
>
> diff --git a/drivers/net/wireless/iwlwifi/iwl3945-base.c b/drivers/net/wireless/iwlwifi/iwl3945-base.c
> index 2a28a1f..a36de73 100644
> --- a/drivers/net/wireless/iwlwifi/iwl3945-base.c
> +++ b/drivers/net/wireless/iwlwifi/iwl3945-base.c
> @@ -548,6 +548,9 @@ static int iwl3945_tx_skb(struct iwl_priv *priv, struct sk_buff *skb)
> txq = &priv->txq[txq_id];
> q = &txq->q;
>
> + if ((iwl_queue_space(q) < q->high_mark))
> + goto drop;
> +
> spin_lock_irqsave(&priv->lock, flags);
>
> idx = get_cmd_index(q, q->write_ptr, 0);
>
>
>
> Best regards,
> Maxim Levitsky
>
>



2009-12-02 02:06:08

by Zhu Yi

[permalink] [raw]
Subject: Re: Panic in iwl3945 driver

On Wed, 2009-12-02 at 02:52 +0800, Chatre, Reinette wrote:
> I agree with your conclusion here that r8 is 0 at this point. I am
> very
> confused how this can be since the two lines above the ones you pasted
> here are:
>
> 0x0000000000016683 <iwl3945_rx_reply_tx+291>:»­­mov (%rdx,%rax,1),%
> r8
> 0x0000000000016687 <iwl3945_rx_reply_tx+295>:»­­mov $0x0,%rdx
>
> From the information in the post you refer to we have rdx as
> 0xffffffffa0c85f40 and rax as 0x4f. Since r8 is initialized above from
> these two registers ... how could it be that r8 ends up as zero?

Is it related? They are memory address references.

Thanks,
-yi


2009-12-01 09:28:40

by Zhu Yi

[permalink] [raw]
Subject: Re: Panic in iwl3945 driver

On Tue, 2009-12-01 at 06:35 +0800, Maxim Levitsky wrote:
> 0x000000000001668e <iwl3945_rx_reply_tx+302>: lea 0x38(%r8),%rdi
> 0x0000000000016692 <iwl3945_rx_reply_tx+306>: lea 0x4f(%r8),%rax

When this happened, from your previous post, r8 is 0x0 and rdi is 0x38.
Since "info" is %rdi (see below), this means
txq->txb[txq->q.read_ptr].skb[0], aka. r8 is 0.

> rate_idx = iwl3945_hwrate_to_plcp_idx(tx_resp->rate);
>
> 0x0000000000016696 <iwl3945_rx_reply_tx+310>: movb $0x0,0x9(%rdi) <---------- RIP
> 0x000000000001669a <iwl3945_rx_reply_tx+314>: movb $0x0,0xc(%rdi)
> 0x000000000001669e <iwl3945_rx_reply_tx+318>: movb $0x0,0xf(%rdi)
> 0x00000000000166a2 <iwl3945_rx_reply_tx+322>: movb $0x0,0x12(%rdi)
> 0x00000000000166a6 <iwl3945_rx_reply_tx+326>: movb $0x0,0x15(%rdi)

This equals to below code in ieee80211_tx_info_clear_status(). "info" is
%rdi, which is 0x38. That matches NULL pointer dereference at 0x41 in
your oops header.

for (i = 0; i < IEEE80211_TX_MAX_RATES; i++)
info->status.rates[i].count = 0;

I guess there is a race for txq->q.read_ptr somewhere. Haven't checked
though.

Thanks,
-yi


2009-12-02 17:17:10

by Maxim Levitsky

[permalink] [raw]
Subject: Re: Panic in iwl3945 driver

On Wed, 2009-12-02 at 13:42 +0800, Zhu Yi wrote:
> On Tue, 2009-12-01 at 17:28 +0800, Zhu Yi wrote:
> > On Tue, 2009-12-01 at 06:35 +0800, Maxim Levitsky wrote:
> > > 0x000000000001668e <iwl3945_rx_reply_tx+302>: lea 0x38(%r8),%rdi
> > > 0x0000000000016692 <iwl3945_rx_reply_tx+306>: lea 0x4f(%r8),%rax
> >
> > When this happened, from your previous post, r8 is 0x0 and rdi is 0x38.
> > Since "info" is %rdi (see below), this means
> > txq->txb[txq->q.read_ptr].skb[0], aka. r8 is 0.
> >
> > > rate_idx = iwl3945_hwrate_to_plcp_idx(tx_resp->rate);
> > >
> > > 0x0000000000016696 <iwl3945_rx_reply_tx+310>: movb $0x0,0x9(%rdi) <---------- RIP
> > > 0x000000000001669a <iwl3945_rx_reply_tx+314>: movb $0x0,0xc(%rdi)
> > > 0x000000000001669e <iwl3945_rx_reply_tx+318>: movb $0x0,0xf(%rdi)
> > > 0x00000000000166a2 <iwl3945_rx_reply_tx+322>: movb $0x0,0x12(%rdi)
> > > 0x00000000000166a6 <iwl3945_rx_reply_tx+326>: movb $0x0,0x15(%rdi)
> >
> > This equals to below code in ieee80211_tx_info_clear_status(). "info" is
> > %rdi, which is 0x38. That matches NULL pointer dereference at 0x41 in
> > your oops header.
> >
> > for (i = 0; i < IEEE80211_TX_MAX_RATES; i++)
> > info->status.rates[i].count = 0;
> >
> > I guess there is a race for txq->q.read_ptr somewhere. Haven't checked
> > though.
>
> OK. 3945 updated write_ptr without regard to read_ptr on the Tx path.
> This messes up our TFD on high load. The patch should fix your problem.
>
> Signed-off-by: Zhu Yi <[email protected]>
>
> diff --git a/drivers/net/wireless/iwlwifi/iwl3945-base.c b/drivers/net/wireless/iwlwifi/iwl3945-base.c
> index 994db4a..b31b34c 100644
> --- a/drivers/net/wireless/iwlwifi/iwl3945-base.c
> +++ b/drivers/net/wireless/iwlwifi/iwl3945-base.c
> @@ -548,6 +548,9 @@ static int iwl3945_tx_skb(struct iwl_priv *priv, struct sk_buff *skb)
> txq = &priv->txq[txq_id];
> q = &txq->q;
>
> + if ((iwl_queue_space(q) < q->high_mark))
> + goto drop;
> +
> spin_lock_irqsave(&priv->lock, flags);
>
> idx = get_cmd_index(q, q->write_ptr, 0);
>
>
I applied that patch, everything works.
I let you know if I see another kernel panic
(I can capture any panic on that system, I set up everything for that)

Best regards,
Maxim Levitsky



2009-12-01 18:52:19

by Reinette Chatre

[permalink] [raw]
Subject: Re: Panic in iwl3945 driver

On Tue, 2009-12-01 at 01:28 -0800, Zhu, Yi wrote:
> On Tue, 2009-12-01 at 06:35 +0800, Maxim Levitsky wrote:
> > 0x000000000001668e <iwl3945_rx_reply_tx+302>: lea 0x38(%r8),%rdi
> > 0x0000000000016692 <iwl3945_rx_reply_tx+306>: lea 0x4f(%r8),%rax
>
> When this happened, from your previous post, r8 is 0x0 and rdi is 0x38.
> Since "info" is %rdi (see below), this means
> txq->txb[txq->q.read_ptr].skb[0], aka. r8 is 0.

I agree with your conclusion here that r8 is 0 at this point. I am very
confused how this can be since the two lines above the ones you pasted
here are:

0x0000000000016683 <iwl3945_rx_reply_tx+291>:»­­mov (%rdx,%rax,1),%r8
0x0000000000016687 <iwl3945_rx_reply_tx+295>:»­­mov $0x0,%rdx

>From the information in the post you refer to we have rdx as
0xffffffffa0c85f40 and rax as 0x4f. Since r8 is initialized above from
these two registers ... how could it be that r8 ends up as zero?

Reinette



2009-12-02 19:18:12

by Cahill, Ben M

[permalink] [raw]
Subject: RE: [ipw3945-devel] Panic in iwl3945 driver



>-----Original Message-----
>From: Zhu Yi [mailto:[email protected]]
>Sent: Wednesday, December 02, 2009 12:43 AM
>To: Maxim Levitsky
>Cc: Chatre, Reinette; linux-wireless; iwlwifi maling list
>Subject: Re: [ipw3945-devel] Panic in iwl3945 driver
>
>On Tue, 2009-12-01 at 17:28 +0800, Zhu Yi wrote:
>> On Tue, 2009-12-01 at 06:35 +0800, Maxim Levitsky wrote:
>> > 0x000000000001668e <iwl3945_rx_reply_tx+302>: lea
>0x38(%r8),%rdi
>> > 0x0000000000016692 <iwl3945_rx_reply_tx+306>: lea
>0x4f(%r8),%rax
>>
>> When this happened, from your previous post, r8 is 0x0 and
>rdi is 0x38.
>> Since "info" is %rdi (see below), this means
>> txq->txb[txq->q.read_ptr].skb[0], aka. r8 is 0.
>>
>> > rate_idx = iwl3945_hwrate_to_plcp_idx(tx_resp->rate);
>> >
>> > 0x0000000000016696 <iwl3945_rx_reply_tx+310>: movb
>$0x0,0x9(%rdi) <---------- RIP
>> > 0x000000000001669a <iwl3945_rx_reply_tx+314>: movb
>$0x0,0xc(%rdi)
>> > 0x000000000001669e <iwl3945_rx_reply_tx+318>: movb
>$0x0,0xf(%rdi)
>> > 0x00000000000166a2 <iwl3945_rx_reply_tx+322>: movb
>$0x0,0x12(%rdi)
>> > 0x00000000000166a6 <iwl3945_rx_reply_tx+326>: movb
>$0x0,0x15(%rdi)
>>
>> This equals to below code in
>ieee80211_tx_info_clear_status(). "info" is
>> %rdi, which is 0x38. That matches NULL pointer dereference at 0x41 in
>> your oops header.
>>
>> for (i = 0; i < IEEE80211_TX_MAX_RATES; i++)
>> info->status.rates[i].count = 0;
>>
>> I guess there is a race for txq->q.read_ptr somewhere.
>Haven't checked
>> though.
>
>OK. 3945 updated write_ptr without regard to read_ptr on the Tx path.
>This messes up our TFD on high load. The patch should fix your problem.

Wow. Could this be happening with other devices as well?

-- Ben --


>
>Signed-off-by: Zhu Yi <[email protected]>
>
>diff --git a/drivers/net/wireless/iwlwifi/iwl3945-base.c
>b/drivers/net/wireless/iwlwifi/iwl3945-base.c
>index 994db4a..b31b34c 100644
>--- a/drivers/net/wireless/iwlwifi/iwl3945-base.c
>+++ b/drivers/net/wireless/iwlwifi/iwl3945-base.c
>@@ -548,6 +548,9 @@ static int iwl3945_tx_skb(struct iwl_priv
>*priv, struct sk_buff *skb)
> txq = &priv->txq[txq_id];
> q = &txq->q;
>
>+ if ((iwl_queue_space(q) < q->high_mark))
>+ goto drop;
>+
> spin_lock_irqsave(&priv->lock, flags);
>
> idx = get_cmd_index(q, q->write_ptr, 0);
>
>
>
>---------------------------------------------------------------
>---------------
>Join us December 9, 2009 for the Red Hat Virtual Experience,
>a free event focused on virtualization and cloud computing.
>Attend in-depth sessions from your desk. Your couch. Anywhere.
>http://p.sf.net/sfu/redhat-sfdev2dev
>_______________________________________________
>Ipw3945-devel mailing list
>[email protected]
>https://lists.sourceforge.net/lists/listinfo/ipw3945-devel
>

2009-12-02 05:42:25

by Zhu Yi

[permalink] [raw]
Subject: Re: Panic in iwl3945 driver

On Tue, 2009-12-01 at 17:28 +0800, Zhu Yi wrote:
> On Tue, 2009-12-01 at 06:35 +0800, Maxim Levitsky wrote:
> > 0x000000000001668e <iwl3945_rx_reply_tx+302>: lea 0x38(%r8),%rdi
> > 0x0000000000016692 <iwl3945_rx_reply_tx+306>: lea 0x4f(%r8),%rax
>
> When this happened, from your previous post, r8 is 0x0 and rdi is 0x38.
> Since "info" is %rdi (see below), this means
> txq->txb[txq->q.read_ptr].skb[0], aka. r8 is 0.
>
> > rate_idx = iwl3945_hwrate_to_plcp_idx(tx_resp->rate);
> >
> > 0x0000000000016696 <iwl3945_rx_reply_tx+310>: movb $0x0,0x9(%rdi) <---------- RIP
> > 0x000000000001669a <iwl3945_rx_reply_tx+314>: movb $0x0,0xc(%rdi)
> > 0x000000000001669e <iwl3945_rx_reply_tx+318>: movb $0x0,0xf(%rdi)
> > 0x00000000000166a2 <iwl3945_rx_reply_tx+322>: movb $0x0,0x12(%rdi)
> > 0x00000000000166a6 <iwl3945_rx_reply_tx+326>: movb $0x0,0x15(%rdi)
>
> This equals to below code in ieee80211_tx_info_clear_status(). "info" is
> %rdi, which is 0x38. That matches NULL pointer dereference at 0x41 in
> your oops header.
>
> for (i = 0; i < IEEE80211_TX_MAX_RATES; i++)
> info->status.rates[i].count = 0;
>
> I guess there is a race for txq->q.read_ptr somewhere. Haven't checked
> though.

OK. 3945 updated write_ptr without regard to read_ptr on the Tx path.
This messes up our TFD on high load. The patch should fix your problem.

Signed-off-by: Zhu Yi <[email protected]>

diff --git a/drivers/net/wireless/iwlwifi/iwl3945-base.c b/drivers/net/wireless/iwlwifi/iwl3945-base.c
index 994db4a..b31b34c 100644
--- a/drivers/net/wireless/iwlwifi/iwl3945-base.c
+++ b/drivers/net/wireless/iwlwifi/iwl3945-base.c
@@ -548,6 +548,9 @@ static int iwl3945_tx_skb(struct iwl_priv *priv, struct sk_buff *skb)
txq = &priv->txq[txq_id];
q = &txq->q;

+ if ((iwl_queue_space(q) < q->high_mark))
+ goto drop;
+
spin_lock_irqsave(&priv->lock, flags);

idx = get_cmd_index(q, q->write_ptr, 0);



2009-12-02 19:41:10

by Abhijeet Kolekar

[permalink] [raw]
Subject: RE: [ipw3945-devel] Panic in iwl3945 driver

On Wed, 2009-12-02 at 11:18 -0800, Cahill, Ben M wrote:
>
> >-----Original Message-----
> >From: Zhu Yi [mailto:[email protected]]
> >Sent: Wednesday, December 02, 2009 12:43 AM
> >To: Maxim Levitsky
> >Cc: Chatre, Reinette; linux-wireless; iwlwifi maling list
> >Subject: Re: [ipw3945-devel] Panic in iwl3945 driver
> >
> >On Tue, 2009-12-01 at 17:28 +0800, Zhu Yi wrote:
> >> On Tue, 2009-12-01 at 06:35 +0800, Maxim Levitsky wrote:
> >> > 0x000000000001668e <iwl3945_rx_reply_tx+302>: lea
> >0x38(%r8),%rdi
> >> > 0x0000000000016692 <iwl3945_rx_reply_tx+306>: lea
> >0x4f(%r8),%rax
> >>
> >> When this happened, from your previous post, r8 is 0x0 and
> >rdi is 0x38.
> >> Since "info" is %rdi (see below), this means
> >> txq->txb[txq->q.read_ptr].skb[0], aka. r8 is 0.
> >>
> >> > rate_idx = iwl3945_hwrate_to_plcp_idx(tx_resp->rate);
> >> >
> >> > 0x0000000000016696 <iwl3945_rx_reply_tx+310>: movb
> >$0x0,0x9(%rdi) <---------- RIP
> >> > 0x000000000001669a <iwl3945_rx_reply_tx+314>: movb
> >$0x0,0xc(%rdi)
> >> > 0x000000000001669e <iwl3945_rx_reply_tx+318>: movb
> >$0x0,0xf(%rdi)
> >> > 0x00000000000166a2 <iwl3945_rx_reply_tx+322>: movb
> >$0x0,0x12(%rdi)
> >> > 0x00000000000166a6 <iwl3945_rx_reply_tx+326>: movb
> >$0x0,0x15(%rdi)
> >>
> >> This equals to below code in
> >ieee80211_tx_info_clear_status(). "info" is
> >> %rdi, which is 0x38. That matches NULL pointer dereference at 0x41 in
> >> your oops header.
> >>
> >> for (i = 0; i < IEEE80211_TX_MAX_RATES; i++)
> >> info->status.rates[i].count = 0;
> >>
> >> I guess there is a race for txq->q.read_ptr somewhere.
> >Haven't checked
> >> though.
> >
> >OK. 3945 updated write_ptr without regard to read_ptr on the Tx path.
> >This messes up our TFD on high load. The patch should fix your problem.
>
> Wow. Could this be happening with other devices as well?
>

We do check this on other devices.

Abhijeet

> -- Ben --
>
>
> >
> >Signed-off-by: Zhu Yi <[email protected]>
> >
> >diff --git a/drivers/net/wireless/iwlwifi/iwl3945-base.c
> >b/drivers/net/wireless/iwlwifi/iwl3945-base.c
> >index 994db4a..b31b34c 100644
> >--- a/drivers/net/wireless/iwlwifi/iwl3945-base.c
> >+++ b/drivers/net/wireless/iwlwifi/iwl3945-base.c
> >@@ -548,6 +548,9 @@ static int iwl3945_tx_skb(struct iwl_priv
> >*priv, struct sk_buff *skb)
> > txq = &priv->txq[txq_id];
> > q = &txq->q;
> >
> >+ if ((iwl_queue_space(q) < q->high_mark))
> >+ goto drop;
> >+
> > spin_lock_irqsave(&priv->lock, flags);
> >
> > idx = get_cmd_index(q, q->write_ptr, 0);
> >
> >
> >
> >---------------------------------------------------------------
> >---------------
> >Join us December 9, 2009 for the Red Hat Virtual Experience,
> >a free event focused on virtualization and cloud computing.
> >Attend in-depth sessions from your desk. Your couch. Anywhere.
> >http://p.sf.net/sfu/redhat-sfdev2dev
> >_______________________________________________
> >Ipw3945-devel mailing list
> >[email protected]
> >https://lists.sourceforge.net/lists/listinfo/ipw3945-devel
> >--
> To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html


2010-01-05 22:26:23

by Cahill, Ben M

[permalink] [raw]
Subject: RE: [ipw3945-devel] Panic in iwl3945 driver



>-----Original Message-----
>From: Cahill, Ben M [mailto:[email protected]]
>Sent: Tuesday, January 05, 2010 5:16 PM
>To: Maxim Levitsky; Zhu, Yi
>Cc: Chatre, Reinette; linux-wireless; iwlwifi maling list
>Subject: Re: [ipw3945-devel] Panic in iwl3945 driver
>
>uCode is receiving bad command from command queue ... See
>below (search "Bad").
>
>-- Ben --
>
>>-----Original Message-----
>>From: Maxim Levitsky [mailto:[email protected]]
>>Sent: Tuesday, January 05, 2010 11:56 AM
>>To: Zhu, Yi
>>Cc: Chatre, Reinette; linux-wireless; iwlwifi maling list
>>Subject: Re: [ipw3945-devel] Panic in iwl3945 driver
>>
>>On Tue, 2009-12-22 at 16:57 +0800, Zhu Yi wrote:
>>> On Tue, 2009-12-22 at 00:40 +0800, Maxim Levitsky wrote:
>>> > On Wed, 2009-12-02 at 19:17 +0200, Maxim Levitsky wrote:
>>> > > On Wed, 2009-12-02 at 13:42 +0800, Zhu Yi wrote:
>>> > > > On Tue, 2009-12-01 at 17:28 +0800, Zhu Yi wrote:
>>> > > > > On Tue, 2009-12-01 at 06:35 +0800, Maxim Levitsky wrote:
>>> > > > > > 0x000000000001668e <iwl3945_rx_reply_tx+302>:
>>lea 0x38(%r8),%rdi
>>> > > > > > 0x0000000000016692 <iwl3945_rx_reply_tx+306>:
>>lea 0x4f(%r8),%rax
>>> > > > >
>>> > > > > When this happened, from your previous post, r8 is
>>0x0 and rdi is 0x38.
>>> > > > > Since "info" is %rdi (see below), this means
>>> > > > > txq->txb[txq->q.read_ptr].skb[0], aka. r8 is 0.
>>> > > > >
>>> > > > > > rate_idx = iwl3945_hwrate_to_plcp_idx(tx_resp->rate);
>>> > > > > >
>>> > > > > > 0x0000000000016696 <iwl3945_rx_reply_tx+310>:
>>movb $0x0,0x9(%rdi) <---------- RIP
>>> > > > > > 0x000000000001669a <iwl3945_rx_reply_tx+314>:
>>movb $0x0,0xc(%rdi)
>>> > > > > > 0x000000000001669e <iwl3945_rx_reply_tx+318>:
>>movb $0x0,0xf(%rdi)
>>> > > > > > 0x00000000000166a2 <iwl3945_rx_reply_tx+322>:
>>movb $0x0,0x12(%rdi)
>>> > > > > > 0x00000000000166a6 <iwl3945_rx_reply_tx+326>:
>>movb $0x0,0x15(%rdi)
>>> > > > >
>>> > > > > This equals to below code in
>>ieee80211_tx_info_clear_status(). "info" is
>>> > > > > %rdi, which is 0x38. That matches NULL pointer
>>dereference at 0x41 in
>>> > > > > your oops header.
>>> > > > >
>>> > > > > for (i = 0; i < IEEE80211_TX_MAX_RATES; i++)
>>> > > > > info->status.rates[i].count = 0;
>>> > > > >
>>> > > > > I guess there is a race for txq->q.read_ptr
>>somewhere. Haven't checked
>>> > > > > though.
>>> > > >
>>> > > > OK. 3945 updated write_ptr without regard to read_ptr
>>on the Tx path.
>>> > > > This messes up our TFD on high load. The patch should
>>fix your problem.
>>> > > >
>>> > > > Signed-off-by: Zhu Yi <[email protected]>
>>> > > >
>>> > > > diff --git
>>a/drivers/net/wireless/iwlwifi/iwl3945-base.c
>>b/drivers/net/wireless/iwlwifi/iwl3945-base.c
>>> > > > index 994db4a..b31b34c 100644
>>> > > > --- a/drivers/net/wireless/iwlwifi/iwl3945-base.c
>>> > > > +++ b/drivers/net/wireless/iwlwifi/iwl3945-base.c
>>> > > > @@ -548,6 +548,9 @@ static int iwl3945_tx_skb(struct
>>iwl_priv *priv, struct sk_buff *skb)
>>> > > > txq = &priv->txq[txq_id];
>>> > > > q = &txq->q;
>>> > > >
>>> > > > + if ((iwl_queue_space(q) < q->high_mark))
>>> > > > + goto drop;
>>> > > > +
>>> > > > spin_lock_irqsave(&priv->lock, flags);
>>> > > >
>>> > > > idx = get_cmd_index(q, q->write_ptr, 0);
>>> > > >
>>> > > >
>>> > Few days ago, I had an idea to reply here that I am sure that this
>>> > problem disappeared with this patch.
>>> >
>>> >
>>> > Today I got same kernel panic _with_ the patch applied....
>>>
>>> Looks like (all of) the root causes are still not found yet.
>>The symptom
>>> is exactly the same as the previous one.
>>>
>>> One thing I found today is when txq read_ptr catches up to write_ptr
>>> (read_ptr == write_ptr), iwl_queue_used() will _always_ return TRUE!
>>> This will be a problem if the firmware sends us a wrong index
>>> (sequence), then we will fail the check in this condition.
>>I'm not sure
>>> if firmware can really send us a wrong sequence. Can you
>>please try this
>>> patch? Apply it on top of the previous one. If you do see
>>the "FIRMWARE
>>> BUG" in dmesg, then I think we find the root cause.
>>
>>
>>
>>Finally exploded....
>>
>>
>><3>[ 330.761347] iwl3945 0000:06:00.0: Microcode SW error
>>detected. Restarting 0x82000008.
>><3>[ 330.761403] iwl3945 0000:06:00.0: Start IWL Error Log Dump:
>><3>[ 330.761421] iwl3945 0000:06:00.0: Status: 0x000302E6, count: 1
>><3>[ 330.761438] iwl3945 0000:06:00.0: Desc Time
>>asrtPC blink2 ilink1 nmiPC Line
>><3>[ 330.761683] iwl3945 0000:06:00.0: SYSASSERT (#5)
>>1035963230 0x008B6 0x017B8 0x0031C 0x00000 65
>
>"Bad Command" sysassert ... See below (search "obviously").
>
>><3>[ 330.761692]
>><3>[ 330.761837] iwl3945 0000:06:00.0: Start IWL Event Log
>>Dump: display last 20 count
>><3>[ 330.761895] iwl3945 0000:06:00.0: 1035963151
>>0x00000000 1106
>><3>[ 330.761931] iwl3945 0000:06:00.0: 1035963155
>>0x00000000 1106
>><3>[ 330.761967] iwl3945 0000:06:00.0: 1035963160
>>0x00000000 1106
>><3>[ 330.762003] iwl3945 0000:06:00.0: 1035963164
>>0x00000000 1106
>><3>[ 330.762039] iwl3945 0000:06:00.0: 1035963168
>>0x00000000 1106
>><3>[ 330.762075] iwl3945 0000:06:00.0: 1035963173
>>0x00000000 1106
>><3>[ 330.762112] iwl3945 0000:06:00.0: 1035963177
>>0x00000000 1106
>><3>[ 330.762148] iwl3945 0000:06:00.0: 1035963181
>>0x00000000 1106
>><3>[ 330.762184] iwl3945 0000:06:00.0: 1035963185
>>0x00000000 1106
>><3>[ 330.762220] iwl3945 0000:06:00.0: 1035963190
>>0x00000001 1106
>><3>[ 330.762256] iwl3945 0000:06:00.0: 1035963190
>>0x0000000f 1106
>><3>[ 330.762292] iwl3945 0000:06:00.0: 1035963192
>>0x00000417 1105
>><3>[ 330.762329] iwl3945 0000:06:00.0: 1035963195
>>0x000005a5 1115
>><3>[ 330.762365] iwl3945 0000:06:00.0: 1035963199
>>0x00000001 0463
>><3>[ 330.762401] iwl3945 0000:06:00.0: 1035963200
>>0x00000001 0462
>><3>[ 330.762437] iwl3945 0000:06:00.0: 1035963212
>>0x000002b1 0603
>><3>[ 330.762474] iwl3945 0000:06:00.0: 1035963217
>>0x04fb0077 0401
>><3>[ 330.762510] iwl3945 0000:06:00.0: 1035963222
>>0x04dc0018 0401
>
>Last good command 0x18, with (driver provided) header
>indicating sent via queue 4, queue index 0xdc
>
>><3>[ 330.762547] iwl3945 0000:06:00.0: 1035963229
>>0x00000000 0401
>
>This shows command "0" (non-existant/unsupported/bad command)
>with (driver provided) header indicating sent via queue "0",
>index "0"; obviously bad data in command queue.

Or index ("pointer") is getting out of sync. (Reminder to self; never use the word "obviously").

-- Ben --
>
>-- Ben --
>
>><3>[ 330.762583] iwl3945 0000:06:00.0: 1035963232
>>0x00000100 0125
>><4>[ 330.762623] FIRMWARE BUG: index 220 is given while
>>read_ptr is 252
>><0>[ 330.762674] ------------[ cut here ]------------
>><2>[ 330.762689] kernel BUG at
>>/home/maxim/software/kernel/linux-2.6/drivers/net/wireless/iwlw
>>ifi/iwl-dev.h:717!
>><0>[ 330.762715] invalid opcode: 0000 [#1] PREEMPT SMP
>><0>[ 330.762735] last sysfs file:
>>/sys/devices/platform/coretemp.1/temp1_input
>><4>[ 330.762755] CPU 0
>><4>[ 330.762765] Modules linked in: nvidia(P) af_packet nfsd
>>exportfs nfs lockd nfs_acl auth_rpcgss sunrpc usb_storage
>>usb_libusual cpufreq_powersave cpufreq_conservative
>>snd_hda_codec_realtek cpufreq_userspace snd_hda_intel
>>acpi_cpufreq uvcvideo iwl3945 snd_hda_codec coretemp videodev
>>iwlcore v4l1_compat snd_hwdep joydev ohci1394
>>v4l2_compat_ioctl32 tg3 snd_pcm sbp2 mac80211 uhci_hcd psmouse
>>video snd_page_alloc ieee1394 libphy ehci_hcd output ac
>>battery lirc_ene0100 serio_raw evdev cfg80211 rfkill usbcore
>>fuse lzo lzo_decompress lzo_compress
>><6>[ 330.762987] Pid: 0, comm: swapper Tainted: P
>>2.6.32-wl #243 Aspire 5720
>><6>[ 330.763011] RIP: 0010:[<ffffffffa01c3fe7>]
>>[<ffffffffa01c3fe7>] iwl_tx_cmd_complete+0x397/0x3a0 [iwlcore]
>><6>[ 330.763057] RSP: 0018:ffff880002203d20 EFLAGS: 00010286
>><6>[ 330.763075] RAX: 0000000000000049 RBX: ffff88006fc09b40
>>RCX: 0000000000000102
>><6>[ 330.763096] RDX: 0000000000000000 RSI: 0000000000000000
>>RDI: 0000000000000001
>><6>[ 330.763118] RBP: ffff880002203d90 R08: 0000000000000002
>>R09: 0000000000000000
>><6>[ 330.763139] R10: 0000000000000000 R11: 0000000000000001
>>R12: 00000000000000fc
>><6>[ 330.763160] R13: 00000000000000dc R14: ffff88006fd6d460
>>R15: ffff88006a2a1600
>><6>[ 330.763182] FS: 0000000000000000(0000)
>>GS:ffff880002200000(0000) knlGS:0000000000000000
>><6>[ 330.763206] CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
>><6>[ 330.763224] CR2: 00007f7ffbf95000 CR3: 0000000001001000
>>CR4: 00000000000006f0
>><6>[ 330.763246] DR0: 0000000000000000 DR1: 0000000000000000
>>DR2: 0000000000000000
>><6>[ 330.763267] DR3: 0000000000000000 DR6: 00000000ffff0ff0
>>DR7: 0000000000000400
>><4>[ 330.763289] Process swapper (pid: 0, threadinfo
>>ffffffff815ca000, task ffffffff815ceb60)
>><0>[ 330.763310] Stack:
>><4>[ 330.763318] ffff88006a2a9690 0000000000000046
>>ffff880002203d40 0000000000000046
>><4>[ 330.763346] <0> 0000000000000282 0000000000000282
>>0000000000000282 ffff8800699eba00
>><4>[ 330.763375] <0> ffff880002203d70 ffff88006a2a1600
>>ffff88006a2a2f10 ffff88006a673000
>><0>[ 330.763406] Call Trace:
>><0>[ 330.763416] <IRQ>
>><4>[ 330.763436] [<ffffffffa0230d13>]
>>iwl3945_irq_tasklet+0x673/0x1740 [iwl3945]
>><4>[ 330.763469] [<ffffffffa01b683e>] ?
>>iwl_isr_legacy+0x3e/0x600 [iwlcore]
>><4>[ 330.763496] [<ffffffff81053f56>] tasklet_action+0x106/0x110
>><4>[ 330.763516] [<ffffffff81055674>] __do_softirq+0x114/0x2c0
>><4>[ 330.763536] [<ffffffff8109eb38>] ? handle_edge_irq+0x78/0x160
>><4>[ 330.763558] [<ffffffff8100d25c>] call_softirq+0x1c/0x30
>><4>[ 330.763577] [<ffffffff8100f1fd>] do_softirq+0x7d/0xb0
>><4>[ 330.763595] [<ffffffff81055125>] irq_exit+0x95/0xa0
>><4>[ 330.763615] [<ffffffff8140bddc>] do_IRQ+0x7c/0xf0
>><4>[ 330.763634] [<ffffffff8100ca13>] ret_from_intr+0x0/0xf
>><0>[ 330.763650] <EOI>
>><4>[ 330.763664] [<ffffffff812a1d5b>] ?
>>acpi_idle_enter_bm+0x275/0x2aa
>><4>[ 330.763684] [<ffffffff812a1d65>] ?
>>acpi_idle_enter_bm+0x27f/0x2aa
>><4>[ 330.763705] [<ffffffff812a1d5b>] ?
>>acpi_idle_enter_bm+0x275/0x2aa
>><4>[ 330.763729] [<ffffffff8133227f>] ? cpuidle_idle_call+0x9f/0x160
>><4>[ 330.763749] [<ffffffff8100b1df>] ? cpu_idle+0xaf/0x110
>><4>[ 330.763769] [<ffffffff813f534a>] ? rest_init+0x7a/0x80
>><4>[ 330.763789] [<ffffffff81640d01>] ? start_kernel+0x3ac/0x3b8
>><4>[ 330.763808] [<ffffffff81640315>] ?
>>x86_64_start_reservations+0x125/0x129
>><4>[ 330.763831] [<ffffffff816403fd>] ?
>x86_64_start_kernel+0xe4/0xeb
>><0>[ 330.763848] Code: c7 c6 48 86 1d a0 48 c7 c7 53 86 1d a0
>>e8 a2 b3 07 e1 e9 68 fe ff ff 44 89 e2 44 89 ee 48 c7 c7 80 5a
>>1d a0 31 c0 e8 c7 31 24 e1 <0f> 0b eb fe 0f 1f 44 00 00 55 48
>>89 e5 41 57 41 56 41 55 41 54
>><1>[ 330.764008] RIP [<ffffffffa01c3fe7>]
>>iwl_tx_cmd_complete+0x397/0x3a0 [iwlcore]
>><4>[ 330.764042] RSP <ffff880002203d20>
>><4>[ 330.764059] ---[ end trace e386a588d450d76f ]---
>><0>[ 330.764073] Kernel panic - not syncing: Fatal exception
>>in interrupt
>><4>[ 330.764093] Pid: 0, comm: swapper Tainted: P D
>>2.6.32-wl #243
>><4>[ 330.764111] Call Trace:
>><4>[ 330.764120] <IRQ> [<ffffffff814070f1>] panic+0x82/0x13f
>><4>[ 330.764148] [<ffffffff81010a52>] oops_end+0xe2/0xf0
>><4>[ 330.764166] [<ffffffff81010c7b>] die+0x5b/0x90
>><4>[ 330.764183] [<ffffffff8100db24>] do_trap+0xc4/0x170
>><4>[ 330.764201] [<ffffffff8100e025>] do_invalid_op+0x95/0xb0
>><4>[ 330.764230] [<ffffffffa01c3fe7>] ?
>>iwl_tx_cmd_complete+0x397/0x3a0 [iwlcore]
>><4>[ 330.764254] [<ffffffff8140a94d>] ?
>>trace_hardirqs_off_thunk+0x3a/0x3c
>><4>[ 330.764276] [<ffffffff8100cef5>] invalid_op+0x15/0x20
>><4>[ 330.764304] [<ffffffffa01c3fe7>] ?
>>iwl_tx_cmd_complete+0x397/0x3a0 [iwlcore]
>><4>[ 330.764333] [<ffffffffa0230d13>]
>>iwl3945_irq_tasklet+0x673/0x1740 [iwl3945]
>><4>[ 330.764364] [<ffffffffa01b683e>] ?
>>iwl_isr_legacy+0x3e/0x600 [iwlcore]
>><4>[ 330.764388] [<ffffffff81053f56>] tasklet_action+0x106/0x110
>><4>[ 330.764408] [<ffffffff81055674>] __do_softirq+0x114/0x2c0
>><4>[ 330.764427] [<ffffffff8109eb38>] ? handle_edge_irq+0x78/0x160
>><4>[ 330.764447] [<ffffffff8100d25c>] call_softirq+0x1c/0x30
>><4>[ 330.764465] [<ffffffff8100f1fd>] do_softirq+0x7d/0xb0
>><4>[ 330.764483] [<ffffffff81055125>] irq_exit+0x95/0xa0
>><4>[ 330.764500] [<ffffffff8140bddc>] do_IRQ+0x7c/0xf0
>><4>[ 330.764519] [<ffffffff8100ca13>] ret_from_intr+0x0/0xf
>><4>[ 330.764534] <EOI> [<ffffffff812a1d5b>] ?
>>acpi_idle_enter_bm+0x275/0x2aa
>><4>[ 330.764562] [<ffffffff812a1d65>] ?
>>acpi_idle_enter_bm+0x27f/0x2aa
>><4>[ 330.764583] [<ffffffff812a1d5b>] ?
>>acpi_idle_enter_bm+0x275/0x2aa
>><4>[ 330.764605] [<ffffffff8133227f>] ? cpuidle_idle_call+0x9f/0x160
>><4>[ 330.764625] [<ffffffff8100b1df>] ? cpu_idle+0xaf/0x110
>><4>[ 330.764643] [<ffffffff813f534a>] ? rest_init+0x7a/0x80
>><4>[ 330.764662] [<ffffffff81640d01>] ? start_kernel+0x3ac/0x3b8
>><4>[ 330.764682] [<ffffffff81640315>] ?
>>x86_64_start_reservations+0x125/0x129
>><4>[ 330.764703] [<ffffffff816403fd>] ?
>x86_64_start_kernel+0xe4/0xeb
>>
>>
>>
>>Note that these days I see the 'Microcode SW error detected'
>>somewhat more often that I used to.
>>
>>
>>This is the combined patch I used:
>>
>>
>>diff --git a/drivers/net/wireless/iwlwifi/iwl-dev.h
>>b/drivers/net/wireless/iwlwifi/iwl-dev.h
>>index 2673e9a..8efc469 100644
>>--- a/drivers/net/wireless/iwlwifi/iwl-dev.h
>>+++ b/drivers/net/wireless/iwlwifi/iwl-dev.h
>>@@ -711,7 +711,13 @@ extern void iwl_txq_ctx_stop(struct
>>iwl_priv *priv);
>> extern int iwl_queue_space(const struct iwl_queue *q);
>> static inline int iwl_queue_used(const struct iwl_queue *q, int i)
>> {
>>- return q->write_ptr > q->read_ptr ?
>>+ if (q->write_ptr == q->read_ptr) {
>>+ printk("FIRMWARE BUG: index %d is given while
>>read_ptr is %d\n",
>>+ i, q->read_ptr);
>>+ BUG();
>>+ }
>>+
>>+ return q->write_ptr >= q->read_ptr ?
>> (i >= q->read_ptr && i < q->write_ptr) :
>> !(i < q->read_ptr && i >= q->write_ptr);
>> }
>>diff --git a/drivers/net/wireless/iwlwifi/iwl3945-base.c
>>b/drivers/net/wireless/iwlwifi/iwl3945-base.c
>>index 2a28a1f..a36de73 100644
>>--- a/drivers/net/wireless/iwlwifi/iwl3945-base.c
>>+++ b/drivers/net/wireless/iwlwifi/iwl3945-base.c
>>@@ -548,6 +548,9 @@ static int iwl3945_tx_skb(struct iwl_priv
>>*priv, struct sk_buff *skb)
>> txq = &priv->txq[txq_id];
>> q = &txq->q;
>>
>>+ if ((iwl_queue_space(q) < q->high_mark))
>>+ goto drop;
>>+
>> spin_lock_irqsave(&priv->lock, flags);
>>
>> idx = get_cmd_index(q, q->write_ptr, 0);
>>
>>
>>Best regards,
>> Maxim Levitsky
>>
>>
>>---------------------------------------------------------------
>>---------------
>>This SF.Net email is sponsored by the Verizon Developer Community
>>Take advantage of Verizon's best-in-class app development support
>>A streamlined, 14 day to market process makes app distribution
>>fast and easy
>>Join now and get one step closer to millions of Verizon customers
>>http://p.sf.net/sfu/verizon-dev2dev
>>_______________________________________________
>>Ipw3945-devel mailing list
>>[email protected]
>>https://lists.sourceforge.net/lists/listinfo/ipw3945-devel
>>
>
>---------------------------------------------------------------
>---------------
>This SF.Net email is sponsored by the Verizon Developer Community
>Take advantage of Verizon's best-in-class app development support
>A streamlined, 14 day to market process makes app distribution
>fast and easy
>Join now and get one step closer to millions of Verizon customers
>http://p.sf.net/sfu/verizon-dev2dev
>_______________________________________________
>Ipw3945-devel mailing list
>[email protected]
>https://lists.sourceforge.net/lists/listinfo/ipw3945-devel
>

2010-01-05 22:16:42

by Cahill, Ben M

[permalink] [raw]
Subject: RE: [ipw3945-devel] Panic in iwl3945 driver

uCode is receiving bad command from command queue ... See below (search "Bad").

-- Ben --

>-----Original Message-----
>From: Maxim Levitsky [mailto:[email protected]]
>Sent: Tuesday, January 05, 2010 11:56 AM
>To: Zhu, Yi
>Cc: Chatre, Reinette; linux-wireless; iwlwifi maling list
>Subject: Re: [ipw3945-devel] Panic in iwl3945 driver
>
>On Tue, 2009-12-22 at 16:57 +0800, Zhu Yi wrote:
>> On Tue, 2009-12-22 at 00:40 +0800, Maxim Levitsky wrote:
>> > On Wed, 2009-12-02 at 19:17 +0200, Maxim Levitsky wrote:
>> > > On Wed, 2009-12-02 at 13:42 +0800, Zhu Yi wrote:
>> > > > On Tue, 2009-12-01 at 17:28 +0800, Zhu Yi wrote:
>> > > > > On Tue, 2009-12-01 at 06:35 +0800, Maxim Levitsky wrote:
>> > > > > > 0x000000000001668e <iwl3945_rx_reply_tx+302>:
>lea 0x38(%r8),%rdi
>> > > > > > 0x0000000000016692 <iwl3945_rx_reply_tx+306>:
>lea 0x4f(%r8),%rax
>> > > > >
>> > > > > When this happened, from your previous post, r8 is
>0x0 and rdi is 0x38.
>> > > > > Since "info" is %rdi (see below), this means
>> > > > > txq->txb[txq->q.read_ptr].skb[0], aka. r8 is 0.
>> > > > >
>> > > > > > rate_idx = iwl3945_hwrate_to_plcp_idx(tx_resp->rate);
>> > > > > >
>> > > > > > 0x0000000000016696 <iwl3945_rx_reply_tx+310>:
>movb $0x0,0x9(%rdi) <---------- RIP
>> > > > > > 0x000000000001669a <iwl3945_rx_reply_tx+314>:
>movb $0x0,0xc(%rdi)
>> > > > > > 0x000000000001669e <iwl3945_rx_reply_tx+318>:
>movb $0x0,0xf(%rdi)
>> > > > > > 0x00000000000166a2 <iwl3945_rx_reply_tx+322>:
>movb $0x0,0x12(%rdi)
>> > > > > > 0x00000000000166a6 <iwl3945_rx_reply_tx+326>:
>movb $0x0,0x15(%rdi)
>> > > > >
>> > > > > This equals to below code in
>ieee80211_tx_info_clear_status(). "info" is
>> > > > > %rdi, which is 0x38. That matches NULL pointer
>dereference at 0x41 in
>> > > > > your oops header.
>> > > > >
>> > > > > for (i = 0; i < IEEE80211_TX_MAX_RATES; i++)
>> > > > > info->status.rates[i].count = 0;
>> > > > >
>> > > > > I guess there is a race for txq->q.read_ptr
>somewhere. Haven't checked
>> > > > > though.
>> > > >
>> > > > OK. 3945 updated write_ptr without regard to read_ptr
>on the Tx path.
>> > > > This messes up our TFD on high load. The patch should
>fix your problem.
>> > > >
>> > > > Signed-off-by: Zhu Yi <[email protected]>
>> > > >
>> > > > diff --git
>a/drivers/net/wireless/iwlwifi/iwl3945-base.c
>b/drivers/net/wireless/iwlwifi/iwl3945-base.c
>> > > > index 994db4a..b31b34c 100644
>> > > > --- a/drivers/net/wireless/iwlwifi/iwl3945-base.c
>> > > > +++ b/drivers/net/wireless/iwlwifi/iwl3945-base.c
>> > > > @@ -548,6 +548,9 @@ static int iwl3945_tx_skb(struct
>iwl_priv *priv, struct sk_buff *skb)
>> > > > txq = &priv->txq[txq_id];
>> > > > q = &txq->q;
>> > > >
>> > > > + if ((iwl_queue_space(q) < q->high_mark))
>> > > > + goto drop;
>> > > > +
>> > > > spin_lock_irqsave(&priv->lock, flags);
>> > > >
>> > > > idx = get_cmd_index(q, q->write_ptr, 0);
>> > > >
>> > > >
>> > Few days ago, I had an idea to reply here that I am sure that this
>> > problem disappeared with this patch.
>> >
>> >
>> > Today I got same kernel panic _with_ the patch applied....
>>
>> Looks like (all of) the root causes are still not found yet.
>The symptom
>> is exactly the same as the previous one.
>>
>> One thing I found today is when txq read_ptr catches up to write_ptr
>> (read_ptr == write_ptr), iwl_queue_used() will _always_ return TRUE!
>> This will be a problem if the firmware sends us a wrong index
>> (sequence), then we will fail the check in this condition.
>I'm not sure
>> if firmware can really send us a wrong sequence. Can you
>please try this
>> patch? Apply it on top of the previous one. If you do see
>the "FIRMWARE
>> BUG" in dmesg, then I think we find the root cause.
>
>
>
>Finally exploded....
>
>
><3>[ 330.761347] iwl3945 0000:06:00.0: Microcode SW error
>detected. Restarting 0x82000008.
><3>[ 330.761403] iwl3945 0000:06:00.0: Start IWL Error Log Dump:
><3>[ 330.761421] iwl3945 0000:06:00.0: Status: 0x000302E6, count: 1
><3>[ 330.761438] iwl3945 0000:06:00.0: Desc Time
>asrtPC blink2 ilink1 nmiPC Line
><3>[ 330.761683] iwl3945 0000:06:00.0: SYSASSERT (#5)
>1035963230 0x008B6 0x017B8 0x0031C 0x00000 65

"Bad Command" sysassert ... See below (search "obviously").

><3>[ 330.761692]
><3>[ 330.761837] iwl3945 0000:06:00.0: Start IWL Event Log
>Dump: display last 20 count
><3>[ 330.761895] iwl3945 0000:06:00.0: 1035963151
>0x00000000 1106
><3>[ 330.761931] iwl3945 0000:06:00.0: 1035963155
>0x00000000 1106
><3>[ 330.761967] iwl3945 0000:06:00.0: 1035963160
>0x00000000 1106
><3>[ 330.762003] iwl3945 0000:06:00.0: 1035963164
>0x00000000 1106
><3>[ 330.762039] iwl3945 0000:06:00.0: 1035963168
>0x00000000 1106
><3>[ 330.762075] iwl3945 0000:06:00.0: 1035963173
>0x00000000 1106
><3>[ 330.762112] iwl3945 0000:06:00.0: 1035963177
>0x00000000 1106
><3>[ 330.762148] iwl3945 0000:06:00.0: 1035963181
>0x00000000 1106
><3>[ 330.762184] iwl3945 0000:06:00.0: 1035963185
>0x00000000 1106
><3>[ 330.762220] iwl3945 0000:06:00.0: 1035963190
>0x00000001 1106
><3>[ 330.762256] iwl3945 0000:06:00.0: 1035963190
>0x0000000f 1106
><3>[ 330.762292] iwl3945 0000:06:00.0: 1035963192
>0x00000417 1105
><3>[ 330.762329] iwl3945 0000:06:00.0: 1035963195
>0x000005a5 1115
><3>[ 330.762365] iwl3945 0000:06:00.0: 1035963199
>0x00000001 0463
><3>[ 330.762401] iwl3945 0000:06:00.0: 1035963200
>0x00000001 0462
><3>[ 330.762437] iwl3945 0000:06:00.0: 1035963212
>0x000002b1 0603
><3>[ 330.762474] iwl3945 0000:06:00.0: 1035963217
>0x04fb0077 0401
><3>[ 330.762510] iwl3945 0000:06:00.0: 1035963222
>0x04dc0018 0401

Last good command 0x18, with (driver provided) header indicating sent via queue 4, queue index 0xdc

><3>[ 330.762547] iwl3945 0000:06:00.0: 1035963229
>0x00000000 0401

This shows command "0" (non-existant/unsupported/bad command) with (driver provided) header indicating sent via queue "0", index "0"; obviously bad data in command queue.

-- Ben --

><3>[ 330.762583] iwl3945 0000:06:00.0: 1035963232
>0x00000100 0125
><4>[ 330.762623] FIRMWARE BUG: index 220 is given while
>read_ptr is 252
><0>[ 330.762674] ------------[ cut here ]------------
><2>[ 330.762689] kernel BUG at
>/home/maxim/software/kernel/linux-2.6/drivers/net/wireless/iwlw
>ifi/iwl-dev.h:717!
><0>[ 330.762715] invalid opcode: 0000 [#1] PREEMPT SMP
><0>[ 330.762735] last sysfs file:
>/sys/devices/platform/coretemp.1/temp1_input
><4>[ 330.762755] CPU 0
><4>[ 330.762765] Modules linked in: nvidia(P) af_packet nfsd
>exportfs nfs lockd nfs_acl auth_rpcgss sunrpc usb_storage
>usb_libusual cpufreq_powersave cpufreq_conservative
>snd_hda_codec_realtek cpufreq_userspace snd_hda_intel
>acpi_cpufreq uvcvideo iwl3945 snd_hda_codec coretemp videodev
>iwlcore v4l1_compat snd_hwdep joydev ohci1394
>v4l2_compat_ioctl32 tg3 snd_pcm sbp2 mac80211 uhci_hcd psmouse
>video snd_page_alloc ieee1394 libphy ehci_hcd output ac
>battery lirc_ene0100 serio_raw evdev cfg80211 rfkill usbcore
>fuse lzo lzo_decompress lzo_compress
><6>[ 330.762987] Pid: 0, comm: swapper Tainted: P
>2.6.32-wl #243 Aspire 5720
><6>[ 330.763011] RIP: 0010:[<ffffffffa01c3fe7>]
>[<ffffffffa01c3fe7>] iwl_tx_cmd_complete+0x397/0x3a0 [iwlcore]
><6>[ 330.763057] RSP: 0018:ffff880002203d20 EFLAGS: 00010286
><6>[ 330.763075] RAX: 0000000000000049 RBX: ffff88006fc09b40
>RCX: 0000000000000102
><6>[ 330.763096] RDX: 0000000000000000 RSI: 0000000000000000
>RDI: 0000000000000001
><6>[ 330.763118] RBP: ffff880002203d90 R08: 0000000000000002
>R09: 0000000000000000
><6>[ 330.763139] R10: 0000000000000000 R11: 0000000000000001
>R12: 00000000000000fc
><6>[ 330.763160] R13: 00000000000000dc R14: ffff88006fd6d460
>R15: ffff88006a2a1600
><6>[ 330.763182] FS: 0000000000000000(0000)
>GS:ffff880002200000(0000) knlGS:0000000000000000
><6>[ 330.763206] CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
><6>[ 330.763224] CR2: 00007f7ffbf95000 CR3: 0000000001001000
>CR4: 00000000000006f0
><6>[ 330.763246] DR0: 0000000000000000 DR1: 0000000000000000
>DR2: 0000000000000000
><6>[ 330.763267] DR3: 0000000000000000 DR6: 00000000ffff0ff0
>DR7: 0000000000000400
><4>[ 330.763289] Process swapper (pid: 0, threadinfo
>ffffffff815ca000, task ffffffff815ceb60)
><0>[ 330.763310] Stack:
><4>[ 330.763318] ffff88006a2a9690 0000000000000046
>ffff880002203d40 0000000000000046
><4>[ 330.763346] <0> 0000000000000282 0000000000000282
>0000000000000282 ffff8800699eba00
><4>[ 330.763375] <0> ffff880002203d70 ffff88006a2a1600
>ffff88006a2a2f10 ffff88006a673000
><0>[ 330.763406] Call Trace:
><0>[ 330.763416] <IRQ>
><4>[ 330.763436] [<ffffffffa0230d13>]
>iwl3945_irq_tasklet+0x673/0x1740 [iwl3945]
><4>[ 330.763469] [<ffffffffa01b683e>] ?
>iwl_isr_legacy+0x3e/0x600 [iwlcore]
><4>[ 330.763496] [<ffffffff81053f56>] tasklet_action+0x106/0x110
><4>[ 330.763516] [<ffffffff81055674>] __do_softirq+0x114/0x2c0
><4>[ 330.763536] [<ffffffff8109eb38>] ? handle_edge_irq+0x78/0x160
><4>[ 330.763558] [<ffffffff8100d25c>] call_softirq+0x1c/0x30
><4>[ 330.763577] [<ffffffff8100f1fd>] do_softirq+0x7d/0xb0
><4>[ 330.763595] [<ffffffff81055125>] irq_exit+0x95/0xa0
><4>[ 330.763615] [<ffffffff8140bddc>] do_IRQ+0x7c/0xf0
><4>[ 330.763634] [<ffffffff8100ca13>] ret_from_intr+0x0/0xf
><0>[ 330.763650] <EOI>
><4>[ 330.763664] [<ffffffff812a1d5b>] ?
>acpi_idle_enter_bm+0x275/0x2aa
><4>[ 330.763684] [<ffffffff812a1d65>] ?
>acpi_idle_enter_bm+0x27f/0x2aa
><4>[ 330.763705] [<ffffffff812a1d5b>] ?
>acpi_idle_enter_bm+0x275/0x2aa
><4>[ 330.763729] [<ffffffff8133227f>] ? cpuidle_idle_call+0x9f/0x160
><4>[ 330.763749] [<ffffffff8100b1df>] ? cpu_idle+0xaf/0x110
><4>[ 330.763769] [<ffffffff813f534a>] ? rest_init+0x7a/0x80
><4>[ 330.763789] [<ffffffff81640d01>] ? start_kernel+0x3ac/0x3b8
><4>[ 330.763808] [<ffffffff81640315>] ?
>x86_64_start_reservations+0x125/0x129
><4>[ 330.763831] [<ffffffff816403fd>] ? x86_64_start_kernel+0xe4/0xeb
><0>[ 330.763848] Code: c7 c6 48 86 1d a0 48 c7 c7 53 86 1d a0
>e8 a2 b3 07 e1 e9 68 fe ff ff 44 89 e2 44 89 ee 48 c7 c7 80 5a
>1d a0 31 c0 e8 c7 31 24 e1 <0f> 0b eb fe 0f 1f 44 00 00 55 48
>89 e5 41 57 41 56 41 55 41 54
><1>[ 330.764008] RIP [<ffffffffa01c3fe7>]
>iwl_tx_cmd_complete+0x397/0x3a0 [iwlcore]
><4>[ 330.764042] RSP <ffff880002203d20>
><4>[ 330.764059] ---[ end trace e386a588d450d76f ]---
><0>[ 330.764073] Kernel panic - not syncing: Fatal exception
>in interrupt
><4>[ 330.764093] Pid: 0, comm: swapper Tainted: P D
>2.6.32-wl #243
><4>[ 330.764111] Call Trace:
><4>[ 330.764120] <IRQ> [<ffffffff814070f1>] panic+0x82/0x13f
><4>[ 330.764148] [<ffffffff81010a52>] oops_end+0xe2/0xf0
><4>[ 330.764166] [<ffffffff81010c7b>] die+0x5b/0x90
><4>[ 330.764183] [<ffffffff8100db24>] do_trap+0xc4/0x170
><4>[ 330.764201] [<ffffffff8100e025>] do_invalid_op+0x95/0xb0
><4>[ 330.764230] [<ffffffffa01c3fe7>] ?
>iwl_tx_cmd_complete+0x397/0x3a0 [iwlcore]
><4>[ 330.764254] [<ffffffff8140a94d>] ?
>trace_hardirqs_off_thunk+0x3a/0x3c
><4>[ 330.764276] [<ffffffff8100cef5>] invalid_op+0x15/0x20
><4>[ 330.764304] [<ffffffffa01c3fe7>] ?
>iwl_tx_cmd_complete+0x397/0x3a0 [iwlcore]
><4>[ 330.764333] [<ffffffffa0230d13>]
>iwl3945_irq_tasklet+0x673/0x1740 [iwl3945]
><4>[ 330.764364] [<ffffffffa01b683e>] ?
>iwl_isr_legacy+0x3e/0x600 [iwlcore]
><4>[ 330.764388] [<ffffffff81053f56>] tasklet_action+0x106/0x110
><4>[ 330.764408] [<ffffffff81055674>] __do_softirq+0x114/0x2c0
><4>[ 330.764427] [<ffffffff8109eb38>] ? handle_edge_irq+0x78/0x160
><4>[ 330.764447] [<ffffffff8100d25c>] call_softirq+0x1c/0x30
><4>[ 330.764465] [<ffffffff8100f1fd>] do_softirq+0x7d/0xb0
><4>[ 330.764483] [<ffffffff81055125>] irq_exit+0x95/0xa0
><4>[ 330.764500] [<ffffffff8140bddc>] do_IRQ+0x7c/0xf0
><4>[ 330.764519] [<ffffffff8100ca13>] ret_from_intr+0x0/0xf
><4>[ 330.764534] <EOI> [<ffffffff812a1d5b>] ?
>acpi_idle_enter_bm+0x275/0x2aa
><4>[ 330.764562] [<ffffffff812a1d65>] ?
>acpi_idle_enter_bm+0x27f/0x2aa
><4>[ 330.764583] [<ffffffff812a1d5b>] ?
>acpi_idle_enter_bm+0x275/0x2aa
><4>[ 330.764605] [<ffffffff8133227f>] ? cpuidle_idle_call+0x9f/0x160
><4>[ 330.764625] [<ffffffff8100b1df>] ? cpu_idle+0xaf/0x110
><4>[ 330.764643] [<ffffffff813f534a>] ? rest_init+0x7a/0x80
><4>[ 330.764662] [<ffffffff81640d01>] ? start_kernel+0x3ac/0x3b8
><4>[ 330.764682] [<ffffffff81640315>] ?
>x86_64_start_reservations+0x125/0x129
><4>[ 330.764703] [<ffffffff816403fd>] ? x86_64_start_kernel+0xe4/0xeb
>
>
>
>Note that these days I see the 'Microcode SW error detected'
>somewhat more often that I used to.
>
>
>This is the combined patch I used:
>
>
>diff --git a/drivers/net/wireless/iwlwifi/iwl-dev.h
>b/drivers/net/wireless/iwlwifi/iwl-dev.h
>index 2673e9a..8efc469 100644
>--- a/drivers/net/wireless/iwlwifi/iwl-dev.h
>+++ b/drivers/net/wireless/iwlwifi/iwl-dev.h
>@@ -711,7 +711,13 @@ extern void iwl_txq_ctx_stop(struct
>iwl_priv *priv);
> extern int iwl_queue_space(const struct iwl_queue *q);
> static inline int iwl_queue_used(const struct iwl_queue *q, int i)
> {
>- return q->write_ptr > q->read_ptr ?
>+ if (q->write_ptr == q->read_ptr) {
>+ printk("FIRMWARE BUG: index %d is given while
>read_ptr is %d\n",
>+ i, q->read_ptr);
>+ BUG();
>+ }
>+
>+ return q->write_ptr >= q->read_ptr ?
> (i >= q->read_ptr && i < q->write_ptr) :
> !(i < q->read_ptr && i >= q->write_ptr);
> }
>diff --git a/drivers/net/wireless/iwlwifi/iwl3945-base.c
>b/drivers/net/wireless/iwlwifi/iwl3945-base.c
>index 2a28a1f..a36de73 100644
>--- a/drivers/net/wireless/iwlwifi/iwl3945-base.c
>+++ b/drivers/net/wireless/iwlwifi/iwl3945-base.c
>@@ -548,6 +548,9 @@ static int iwl3945_tx_skb(struct iwl_priv
>*priv, struct sk_buff *skb)
> txq = &priv->txq[txq_id];
> q = &txq->q;
>
>+ if ((iwl_queue_space(q) < q->high_mark))
>+ goto drop;
>+
> spin_lock_irqsave(&priv->lock, flags);
>
> idx = get_cmd_index(q, q->write_ptr, 0);
>
>
>Best regards,
> Maxim Levitsky
>
>
>---------------------------------------------------------------
>---------------
>This SF.Net email is sponsored by the Verizon Developer Community
>Take advantage of Verizon's best-in-class app development support
>A streamlined, 14 day to market process makes app distribution
>fast and easy
>Join now and get one step closer to millions of Verizon customers
>http://p.sf.net/sfu/verizon-dev2dev
>_______________________________________________
>Ipw3945-devel mailing list
>[email protected]
>https://lists.sourceforge.net/lists/listinfo/ipw3945-devel
>

2010-01-06 03:55:29

by Zhu Yi

[permalink] [raw]
Subject: Re: Panic in iwl3945 driver

On Wed, 2010-01-06 at 00:56 +0800, Maxim Levitsky wrote:
> > One thing I found today is when txq read_ptr catches up to write_ptr
> > (read_ptr == write_ptr), iwl_queue_used() will _always_ return TRUE!
> > This will be a problem if the firmware sends us a wrong index
> > (sequence), then we will fail the check in this condition. I'm not sure
> > if firmware can really send us a wrong sequence. Can you please try this
> > patch? Apply it on top of the previous one. If you do see the "FIRMWARE
> > BUG" in dmesg, then I think we find the root cause.
>
>
>
> Finally exploded....
>
>
> <3>[ 330.761347] iwl3945 0000:06:00.0: Microcode SW error detected. Restarting 0x82000008.
> <3>[ 330.761403] iwl3945 0000:06:00.0: Start IWL Error Log Dump:
> <3>[ 330.761421] iwl3945 0000:06:00.0: Status: 0x000302E6, count: 1
> <3>[ 330.761438] iwl3945 0000:06:00.0: Desc Time asrtPC blink2 ilink1 nmiPC Line
> <3>[ 330.761683] iwl3945 0000:06:00.0: SYSASSERT (#5) 1035963230 0x008B6 0x017B8 0x0031C 0x00000 65
> <3>[ 330.761692]
> <3>[ 330.761837] iwl3945 0000:06:00.0: Start IWL Event Log Dump: display last 20 count
> <3>[ 330.761895] iwl3945 0000:06:00.0: 1035963151 0x00000000 1106
> <3>[ 330.761931] iwl3945 0000:06:00.0: 1035963155 0x00000000 1106
> <3>[ 330.761967] iwl3945 0000:06:00.0: 1035963160 0x00000000 1106
> <3>[ 330.762003] iwl3945 0000:06:00.0: 1035963164 0x00000000 1106
> <3>[ 330.762039] iwl3945 0000:06:00.0: 1035963168 0x00000000 1106
> <3>[ 330.762075] iwl3945 0000:06:00.0: 1035963173 0x00000000 1106
> <3>[ 330.762112] iwl3945 0000:06:00.0: 1035963177 0x00000000 1106
> <3>[ 330.762148] iwl3945 0000:06:00.0: 1035963181 0x00000000 1106
> <3>[ 330.762184] iwl3945 0000:06:00.0: 1035963185 0x00000000 1106
> <3>[ 330.762220] iwl3945 0000:06:00.0: 1035963190 0x00000001 1106
> <3>[ 330.762256] iwl3945 0000:06:00.0: 1035963190 0x0000000f 1106
> <3>[ 330.762292] iwl3945 0000:06:00.0: 1035963192 0x00000417 1105
> <3>[ 330.762329] iwl3945 0000:06:00.0: 1035963195 0x000005a5 1115
> <3>[ 330.762365] iwl3945 0000:06:00.0: 1035963199 0x00000001 0463
> <3>[ 330.762401] iwl3945 0000:06:00.0: 1035963200 0x00000001 0462
> <3>[ 330.762437] iwl3945 0000:06:00.0: 1035963212 0x000002b1 0603
> <3>[ 330.762474] iwl3945 0000:06:00.0: 1035963217 0x04fb0077 0401
> <3>[ 330.762510] iwl3945 0000:06:00.0: 1035963222 0x04dc0018 0401
> <3>[ 330.762547] iwl3945 0000:06:00.0: 1035963229 0x00000000 0401
> <3>[ 330.762583] iwl3945 0000:06:00.0: 1035963232 0x00000100 0125
> <4>[ 330.762623] FIRMWARE BUG: index 220 is given while read_ptr is 252

OK. This confirms my supposition. The firmware does give us an invalid
read pointer in some situations. In the above case, the firmware also
detected a "bad command" error itself.

So we can use the second patch to prevent the panic (will do a formal
submission) and start to track the firmware issue on the other hand.
Thanks for your testing!

Thanks,
-yi

> <0>[ 330.762674] ------------[ cut here ]------------
> <2>[ 330.762689] kernel BUG at /home/maxim/software/kernel/linux-2.6/drivers/net/wireless/iwlwifi/iwl-dev.h:717!
> <0>[ 330.762715] invalid opcode: 0000 [#1] PREEMPT SMP
> <0>[ 330.762735] last sysfs file: /sys/devices/platform/coretemp.1/temp1_input
> <4>[ 330.762755] CPU 0
> <4>[ 330.762765] Modules linked in: nvidia(P) af_packet nfsd exportfs nfs lockd nfs_acl auth_rpcgss sunrpc usb_storage usb_libusual cpufreq_powersave cpufreq_conservative snd_hda_codec_realtek cpufreq_userspace snd_hda_intel acpi_cpufreq uvcvideo iwl3945 snd_hda_codec coretemp videodev iwlcore v4l1_compat snd_hwdep joydev ohci1394 v4l2_compat_ioctl32 tg3 snd_pcm sbp2 mac80211 uhci_hcd psmouse video snd_page_alloc ieee1394 libphy ehci_hcd output ac battery lirc_ene0100 serio_raw evdev cfg80211 rfkill usbcore fuse lzo lzo_decompress lzo_compress
> <6>[ 330.762987] Pid: 0, comm: swapper Tainted: P 2.6.32-wl #243 Aspire 5720
> <6>[ 330.763011] RIP: 0010:[<ffffffffa01c3fe7>] [<ffffffffa01c3fe7>] iwl_tx_cmd_complete+0x397/0x3a0 [iwlcore]
> <6>[ 330.763057] RSP: 0018:ffff880002203d20 EFLAGS: 00010286
> <6>[ 330.763075] RAX: 0000000000000049 RBX: ffff88006fc09b40 RCX: 0000000000000102
> <6>[ 330.763096] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000001
> <6>[ 330.763118] RBP: ffff880002203d90 R08: 0000000000000002 R09: 0000000000000000
> <6>[ 330.763139] R10: 0000000000000000 R11: 0000000000000001 R12: 00000000000000fc
> <6>[ 330.763160] R13: 00000000000000dc R14: ffff88006fd6d460 R15: ffff88006a2a1600
> <6>[ 330.763182] FS: 0000000000000000(0000) GS:ffff880002200000(0000) knlGS:0000000000000000
> <6>[ 330.763206] CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
> <6>[ 330.763224] CR2: 00007f7ffbf95000 CR3: 0000000001001000 CR4: 00000000000006f0
> <6>[ 330.763246] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> <6>[ 330.763267] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> <4>[ 330.763289] Process swapper (pid: 0, threadinfo ffffffff815ca000, task ffffffff815ceb60)
> <0>[ 330.763310] Stack:
> <4>[ 330.763318] ffff88006a2a9690 0000000000000046 ffff880002203d40 0000000000000046
> <4>[ 330.763346] <0> 0000000000000282 0000000000000282 0000000000000282 ffff8800699eba00
> <4>[ 330.763375] <0> ffff880002203d70 ffff88006a2a1600 ffff88006a2a2f10 ffff88006a673000
> <0>[ 330.763406] Call Trace:
> <0>[ 330.763416] <IRQ>
> <4>[ 330.763436] [<ffffffffa0230d13>] iwl3945_irq_tasklet+0x673/0x1740 [iwl3945]
> <4>[ 330.763469] [<ffffffffa01b683e>] ? iwl_isr_legacy+0x3e/0x600 [iwlcore]
> <4>[ 330.763496] [<ffffffff81053f56>] tasklet_action+0x106/0x110
> <4>[ 330.763516] [<ffffffff81055674>] __do_softirq+0x114/0x2c0
> <4>[ 330.763536] [<ffffffff8109eb38>] ? handle_edge_irq+0x78/0x160
> <4>[ 330.763558] [<ffffffff8100d25c>] call_softirq+0x1c/0x30
> <4>[ 330.763577] [<ffffffff8100f1fd>] do_softirq+0x7d/0xb0
> <4>[ 330.763595] [<ffffffff81055125>] irq_exit+0x95/0xa0
> <4>[ 330.763615] [<ffffffff8140bddc>] do_IRQ+0x7c/0xf0
> <4>[ 330.763634] [<ffffffff8100ca13>] ret_from_intr+0x0/0xf
> <0>[ 330.763650] <EOI>
> <4>[ 330.763664] [<ffffffff812a1d5b>] ? acpi_idle_enter_bm+0x275/0x2aa
> <4>[ 330.763684] [<ffffffff812a1d65>] ? acpi_idle_enter_bm+0x27f/0x2aa
> <4>[ 330.763705] [<ffffffff812a1d5b>] ? acpi_idle_enter_bm+0x275/0x2aa
> <4>[ 330.763729] [<ffffffff8133227f>] ? cpuidle_idle_call+0x9f/0x160
> <4>[ 330.763749] [<ffffffff8100b1df>] ? cpu_idle+0xaf/0x110
> <4>[ 330.763769] [<ffffffff813f534a>] ? rest_init+0x7a/0x80
> <4>[ 330.763789] [<ffffffff81640d01>] ? start_kernel+0x3ac/0x3b8
> <4>[ 330.763808] [<ffffffff81640315>] ? x86_64_start_reservations+0x125/0x129
> <4>[ 330.763831] [<ffffffff816403fd>] ? x86_64_start_kernel+0xe4/0xeb
> <0>[ 330.763848] Code: c7 c6 48 86 1d a0 48 c7 c7 53 86 1d a0 e8 a2 b3 07 e1 e9 68 fe ff ff 44 89 e2 44 89 ee 48 c7 c7 80 5a 1d a0 31 c0 e8 c7 31 24 e1 <0f> 0b eb fe 0f 1f 44 00 00 55 48 89 e5 41 57 41 56 41 55 41 54
> <1>[ 330.764008] RIP [<ffffffffa01c3fe7>] iwl_tx_cmd_complete+0x397/0x3a0 [iwlcore]
> <4>[ 330.764042] RSP <ffff880002203d20>
> <4>[ 330.764059] ---[ end trace e386a588d450d76f ]---
> <0>[ 330.764073] Kernel panic - not syncing: Fatal exception in interrupt
> <4>[ 330.764093] Pid: 0, comm: swapper Tainted: P D 2.6.32-wl #243
> <4>[ 330.764111] Call Trace:
> <4>[ 330.764120] <IRQ> [<ffffffff814070f1>] panic+0x82/0x13f
> <4>[ 330.764148] [<ffffffff81010a52>] oops_end+0xe2/0xf0
> <4>[ 330.764166] [<ffffffff81010c7b>] die+0x5b/0x90
> <4>[ 330.764183] [<ffffffff8100db24>] do_trap+0xc4/0x170
> <4>[ 330.764201] [<ffffffff8100e025>] do_invalid_op+0x95/0xb0
> <4>[ 330.764230] [<ffffffffa01c3fe7>] ? iwl_tx_cmd_complete+0x397/0x3a0 [iwlcore]
> <4>[ 330.764254] [<ffffffff8140a94d>] ? trace_hardirqs_off_thunk+0x3a/0x3c
> <4>[ 330.764276] [<ffffffff8100cef5>] invalid_op+0x15/0x20
> <4>[ 330.764304] [<ffffffffa01c3fe7>] ? iwl_tx_cmd_complete+0x397/0x3a0 [iwlcore]
> <4>[ 330.764333] [<ffffffffa0230d13>] iwl3945_irq_tasklet+0x673/0x1740 [iwl3945]
> <4>[ 330.764364] [<ffffffffa01b683e>] ? iwl_isr_legacy+0x3e/0x600 [iwlcore]
> <4>[ 330.764388] [<ffffffff81053f56>] tasklet_action+0x106/0x110
> <4>[ 330.764408] [<ffffffff81055674>] __do_softirq+0x114/0x2c0
> <4>[ 330.764427] [<ffffffff8109eb38>] ? handle_edge_irq+0x78/0x160
> <4>[ 330.764447] [<ffffffff8100d25c>] call_softirq+0x1c/0x30
> <4>[ 330.764465] [<ffffffff8100f1fd>] do_softirq+0x7d/0xb0
> <4>[ 330.764483] [<ffffffff81055125>] irq_exit+0x95/0xa0
> <4>[ 330.764500] [<ffffffff8140bddc>] do_IRQ+0x7c/0xf0
> <4>[ 330.764519] [<ffffffff8100ca13>] ret_from_intr+0x0/0xf
> <4>[ 330.764534] <EOI> [<ffffffff812a1d5b>] ? acpi_idle_enter_bm+0x275/0x2aa
> <4>[ 330.764562] [<ffffffff812a1d65>] ? acpi_idle_enter_bm+0x27f/0x2aa
> <4>[ 330.764583] [<ffffffff812a1d5b>] ? acpi_idle_enter_bm+0x275/0x2aa
> <4>[ 330.764605] [<ffffffff8133227f>] ? cpuidle_idle_call+0x9f/0x160
> <4>[ 330.764625] [<ffffffff8100b1df>] ? cpu_idle+0xaf/0x110
> <4>[ 330.764643] [<ffffffff813f534a>] ? rest_init+0x7a/0x80
> <4>[ 330.764662] [<ffffffff81640d01>] ? start_kernel+0x3ac/0x3b8
> <4>[ 330.764682] [<ffffffff81640315>] ? x86_64_start_reservations+0x125/0x129
> <4>[ 330.764703] [<ffffffff816403fd>] ? x86_64_start_kernel+0xe4/0xeb

Thanks,
-yi


2010-01-06 03:57:07

by Zhu Yi

[permalink] [raw]
Subject: RE: [ipw3945-devel] Panic in iwl3945 driver

On Wed, 2010-01-06 at 06:26 +0800, Cahill, Ben M wrote:
> >><3>[ 330.762510] iwl3945 0000:06:00.0: 1035963222
> >>0x04dc0018 0401
> >
> >Last good command 0x18, with (driver provided) header
> >indicating sent via queue 4, queue index 0xdc
> >
> >><3>[ 330.762547] iwl3945 0000:06:00.0: 1035963229
> >>0x00000000 0401
> >
> >This shows command "0" (non-existant/unsupported/bad command)
> >with (driver provided) header indicating sent via queue "0",
> >index "0"; obviously bad data in command queue.
>
> Or index ("pointer") is getting out of sync. (Reminder to self; never
> use the word "obviously").

Ben, does it smell like driver issue or firmware bug?

Thanks,
-yi


2010-01-05 16:58:36

by Maxim Levitsky

[permalink] [raw]
Subject: Re: Panic in iwl3945 driver

On Tue, 2009-12-22 at 16:57 +0800, Zhu Yi wrote:
> On Tue, 2009-12-22 at 00:40 +0800, Maxim Levitsky wrote:
> > On Wed, 2009-12-02 at 19:17 +0200, Maxim Levitsky wrote:
> > > On Wed, 2009-12-02 at 13:42 +0800, Zhu Yi wrote:
> > > > On Tue, 2009-12-01 at 17:28 +0800, Zhu Yi wrote:
> > > > > On Tue, 2009-12-01 at 06:35 +0800, Maxim Levitsky wrote:
> > > > > > 0x000000000001668e <iwl3945_rx_reply_tx+302>: lea 0x38(%r8),%rdi
> > > > > > 0x0000000000016692 <iwl3945_rx_reply_tx+306>: lea 0x4f(%r8),%rax
> > > > >
> > > > > When this happened, from your previous post, r8 is 0x0 and rdi is 0x38.
> > > > > Since "info" is %rdi (see below), this means
> > > > > txq->txb[txq->q.read_ptr].skb[0], aka. r8 is 0.
> > > > >
> > > > > > rate_idx = iwl3945_hwrate_to_plcp_idx(tx_resp->rate);
> > > > > >
> > > > > > 0x0000000000016696 <iwl3945_rx_reply_tx+310>: movb $0x0,0x9(%rdi) <---------- RIP
> > > > > > 0x000000000001669a <iwl3945_rx_reply_tx+314>: movb $0x0,0xc(%rdi)
> > > > > > 0x000000000001669e <iwl3945_rx_reply_tx+318>: movb $0x0,0xf(%rdi)
> > > > > > 0x00000000000166a2 <iwl3945_rx_reply_tx+322>: movb $0x0,0x12(%rdi)
> > > > > > 0x00000000000166a6 <iwl3945_rx_reply_tx+326>: movb $0x0,0x15(%rdi)
> > > > >
> > > > > This equals to below code in ieee80211_tx_info_clear_status(). "info" is
> > > > > %rdi, which is 0x38. That matches NULL pointer dereference at 0x41 in
> > > > > your oops header.
> > > > >
> > > > > for (i = 0; i < IEEE80211_TX_MAX_RATES; i++)
> > > > > info->status.rates[i].count = 0;
> > > > >
> > > > > I guess there is a race for txq->q.read_ptr somewhere. Haven't checked
> > > > > though.
> > > >
> > > > OK. 3945 updated write_ptr without regard to read_ptr on the Tx path.
> > > > This messes up our TFD on high load. The patch should fix your problem.
> > > >
> > > > Signed-off-by: Zhu Yi <[email protected]>
> > > >
> > > > diff --git a/drivers/net/wireless/iwlwifi/iwl3945-base.c b/drivers/net/wireless/iwlwifi/iwl3945-base.c
> > > > index 994db4a..b31b34c 100644
> > > > --- a/drivers/net/wireless/iwlwifi/iwl3945-base.c
> > > > +++ b/drivers/net/wireless/iwlwifi/iwl3945-base.c
> > > > @@ -548,6 +548,9 @@ static int iwl3945_tx_skb(struct iwl_priv *priv, struct sk_buff *skb)
> > > > txq = &priv->txq[txq_id];
> > > > q = &txq->q;
> > > >
> > > > + if ((iwl_queue_space(q) < q->high_mark))
> > > > + goto drop;
> > > > +
> > > > spin_lock_irqsave(&priv->lock, flags);
> > > >
> > > > idx = get_cmd_index(q, q->write_ptr, 0);
> > > >
> > > >
> > Few days ago, I had an idea to reply here that I am sure that this
> > problem disappeared with this patch.
> >
> >
> > Today I got same kernel panic _with_ the patch applied....
>
> Looks like (all of) the root causes are still not found yet. The symptom
> is exactly the same as the previous one.
>
> One thing I found today is when txq read_ptr catches up to write_ptr
> (read_ptr == write_ptr), iwl_queue_used() will _always_ return TRUE!
> This will be a problem if the firmware sends us a wrong index
> (sequence), then we will fail the check in this condition. I'm not sure
> if firmware can really send us a wrong sequence. Can you please try this
> patch? Apply it on top of the previous one. If you do see the "FIRMWARE
> BUG" in dmesg, then I think we find the root cause.



Finally exploded....


<3>[ 330.761347] iwl3945 0000:06:00.0: Microcode SW error detected. Restarting 0x82000008.
<3>[ 330.761403] iwl3945 0000:06:00.0: Start IWL Error Log Dump:
<3>[ 330.761421] iwl3945 0000:06:00.0: Status: 0x000302E6, count: 1
<3>[ 330.761438] iwl3945 0000:06:00.0: Desc Time asrtPC blink2 ilink1 nmiPC Line
<3>[ 330.761683] iwl3945 0000:06:00.0: SYSASSERT (#5) 1035963230 0x008B6 0x017B8 0x0031C 0x00000 65
<3>[ 330.761692]
<3>[ 330.761837] iwl3945 0000:06:00.0: Start IWL Event Log Dump: display last 20 count
<3>[ 330.761895] iwl3945 0000:06:00.0: 1035963151 0x00000000 1106
<3>[ 330.761931] iwl3945 0000:06:00.0: 1035963155 0x00000000 1106
<3>[ 330.761967] iwl3945 0000:06:00.0: 1035963160 0x00000000 1106
<3>[ 330.762003] iwl3945 0000:06:00.0: 1035963164 0x00000000 1106
<3>[ 330.762039] iwl3945 0000:06:00.0: 1035963168 0x00000000 1106
<3>[ 330.762075] iwl3945 0000:06:00.0: 1035963173 0x00000000 1106
<3>[ 330.762112] iwl3945 0000:06:00.0: 1035963177 0x00000000 1106
<3>[ 330.762148] iwl3945 0000:06:00.0: 1035963181 0x00000000 1106
<3>[ 330.762184] iwl3945 0000:06:00.0: 1035963185 0x00000000 1106
<3>[ 330.762220] iwl3945 0000:06:00.0: 1035963190 0x00000001 1106
<3>[ 330.762256] iwl3945 0000:06:00.0: 1035963190 0x0000000f 1106
<3>[ 330.762292] iwl3945 0000:06:00.0: 1035963192 0x00000417 1105
<3>[ 330.762329] iwl3945 0000:06:00.0: 1035963195 0x000005a5 1115
<3>[ 330.762365] iwl3945 0000:06:00.0: 1035963199 0x00000001 0463
<3>[ 330.762401] iwl3945 0000:06:00.0: 1035963200 0x00000001 0462
<3>[ 330.762437] iwl3945 0000:06:00.0: 1035963212 0x000002b1 0603
<3>[ 330.762474] iwl3945 0000:06:00.0: 1035963217 0x04fb0077 0401
<3>[ 330.762510] iwl3945 0000:06:00.0: 1035963222 0x04dc0018 0401
<3>[ 330.762547] iwl3945 0000:06:00.0: 1035963229 0x00000000 0401
<3>[ 330.762583] iwl3945 0000:06:00.0: 1035963232 0x00000100 0125
<4>[ 330.762623] FIRMWARE BUG: index 220 is given while read_ptr is 252
<0>[ 330.762674] ------------[ cut here ]------------
<2>[ 330.762689] kernel BUG at /home/maxim/software/kernel/linux-2.6/drivers/net/wireless/iwlwifi/iwl-dev.h:717!
<0>[ 330.762715] invalid opcode: 0000 [#1] PREEMPT SMP
<0>[ 330.762735] last sysfs file: /sys/devices/platform/coretemp.1/temp1_input
<4>[ 330.762755] CPU 0
<4>[ 330.762765] Modules linked in: nvidia(P) af_packet nfsd exportfs nfs lockd nfs_acl auth_rpcgss sunrpc usb_storage usb_libusual cpufreq_powersave cpufreq_conservative snd_hda_codec_realtek cpufreq_userspace snd_hda_intel acpi_cpufreq uvcvideo iwl3945 snd_hda_codec coretemp videodev iwlcore v4l1_compat snd_hwdep joydev ohci1394 v4l2_compat_ioctl32 tg3 snd_pcm sbp2 mac80211 uhci_hcd psmouse video snd_page_alloc ieee1394 libphy ehci_hcd output ac battery lirc_ene0100 serio_raw evdev cfg80211 rfkill usbcore fuse lzo lzo_decompress lzo_compress
<6>[ 330.762987] Pid: 0, comm: swapper Tainted: P 2.6.32-wl #243 Aspire 5720
<6>[ 330.763011] RIP: 0010:[<ffffffffa01c3fe7>] [<ffffffffa01c3fe7>] iwl_tx_cmd_complete+0x397/0x3a0 [iwlcore]
<6>[ 330.763057] RSP: 0018:ffff880002203d20 EFLAGS: 00010286
<6>[ 330.763075] RAX: 0000000000000049 RBX: ffff88006fc09b40 RCX: 0000000000000102
<6>[ 330.763096] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000001
<6>[ 330.763118] RBP: ffff880002203d90 R08: 0000000000000002 R09: 0000000000000000
<6>[ 330.763139] R10: 0000000000000000 R11: 0000000000000001 R12: 00000000000000fc
<6>[ 330.763160] R13: 00000000000000dc R14: ffff88006fd6d460 R15: ffff88006a2a1600
<6>[ 330.763182] FS: 0000000000000000(0000) GS:ffff880002200000(0000) knlGS:0000000000000000
<6>[ 330.763206] CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
<6>[ 330.763224] CR2: 00007f7ffbf95000 CR3: 0000000001001000 CR4: 00000000000006f0
<6>[ 330.763246] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
<6>[ 330.763267] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
<4>[ 330.763289] Process swapper (pid: 0, threadinfo ffffffff815ca000, task ffffffff815ceb60)
<0>[ 330.763310] Stack:
<4>[ 330.763318] ffff88006a2a9690 0000000000000046 ffff880002203d40 0000000000000046
<4>[ 330.763346] <0> 0000000000000282 0000000000000282 0000000000000282 ffff8800699eba00
<4>[ 330.763375] <0> ffff880002203d70 ffff88006a2a1600 ffff88006a2a2f10 ffff88006a673000
<0>[ 330.763406] Call Trace:
<0>[ 330.763416] <IRQ>
<4>[ 330.763436] [<ffffffffa0230d13>] iwl3945_irq_tasklet+0x673/0x1740 [iwl3945]
<4>[ 330.763469] [<ffffffffa01b683e>] ? iwl_isr_legacy+0x3e/0x600 [iwlcore]
<4>[ 330.763496] [<ffffffff81053f56>] tasklet_action+0x106/0x110
<4>[ 330.763516] [<ffffffff81055674>] __do_softirq+0x114/0x2c0
<4>[ 330.763536] [<ffffffff8109eb38>] ? handle_edge_irq+0x78/0x160
<4>[ 330.763558] [<ffffffff8100d25c>] call_softirq+0x1c/0x30
<4>[ 330.763577] [<ffffffff8100f1fd>] do_softirq+0x7d/0xb0
<4>[ 330.763595] [<ffffffff81055125>] irq_exit+0x95/0xa0
<4>[ 330.763615] [<ffffffff8140bddc>] do_IRQ+0x7c/0xf0
<4>[ 330.763634] [<ffffffff8100ca13>] ret_from_intr+0x0/0xf
<0>[ 330.763650] <EOI>
<4>[ 330.763664] [<ffffffff812a1d5b>] ? acpi_idle_enter_bm+0x275/0x2aa
<4>[ 330.763684] [<ffffffff812a1d65>] ? acpi_idle_enter_bm+0x27f/0x2aa
<4>[ 330.763705] [<ffffffff812a1d5b>] ? acpi_idle_enter_bm+0x275/0x2aa
<4>[ 330.763729] [<ffffffff8133227f>] ? cpuidle_idle_call+0x9f/0x160
<4>[ 330.763749] [<ffffffff8100b1df>] ? cpu_idle+0xaf/0x110
<4>[ 330.763769] [<ffffffff813f534a>] ? rest_init+0x7a/0x80
<4>[ 330.763789] [<ffffffff81640d01>] ? start_kernel+0x3ac/0x3b8
<4>[ 330.763808] [<ffffffff81640315>] ? x86_64_start_reservations+0x125/0x129
<4>[ 330.763831] [<ffffffff816403fd>] ? x86_64_start_kernel+0xe4/0xeb
<0>[ 330.763848] Code: c7 c6 48 86 1d a0 48 c7 c7 53 86 1d a0 e8 a2 b3 07 e1 e9 68 fe ff ff 44 89 e2 44 89 ee 48 c7 c7 80 5a 1d a0 31 c0 e8 c7 31 24 e1 <0f> 0b eb fe 0f 1f 44 00 00 55 48 89 e5 41 57 41 56 41 55 41 54
<1>[ 330.764008] RIP [<ffffffffa01c3fe7>] iwl_tx_cmd_complete+0x397/0x3a0 [iwlcore]
<4>[ 330.764042] RSP <ffff880002203d20>
<4>[ 330.764059] ---[ end trace e386a588d450d76f ]---
<0>[ 330.764073] Kernel panic - not syncing: Fatal exception in interrupt
<4>[ 330.764093] Pid: 0, comm: swapper Tainted: P D 2.6.32-wl #243
<4>[ 330.764111] Call Trace:
<4>[ 330.764120] <IRQ> [<ffffffff814070f1>] panic+0x82/0x13f
<4>[ 330.764148] [<ffffffff81010a52>] oops_end+0xe2/0xf0
<4>[ 330.764166] [<ffffffff81010c7b>] die+0x5b/0x90
<4>[ 330.764183] [<ffffffff8100db24>] do_trap+0xc4/0x170
<4>[ 330.764201] [<ffffffff8100e025>] do_invalid_op+0x95/0xb0
<4>[ 330.764230] [<ffffffffa01c3fe7>] ? iwl_tx_cmd_complete+0x397/0x3a0 [iwlcore]
<4>[ 330.764254] [<ffffffff8140a94d>] ? trace_hardirqs_off_thunk+0x3a/0x3c
<4>[ 330.764276] [<ffffffff8100cef5>] invalid_op+0x15/0x20
<4>[ 330.764304] [<ffffffffa01c3fe7>] ? iwl_tx_cmd_complete+0x397/0x3a0 [iwlcore]
<4>[ 330.764333] [<ffffffffa0230d13>] iwl3945_irq_tasklet+0x673/0x1740 [iwl3945]
<4>[ 330.764364] [<ffffffffa01b683e>] ? iwl_isr_legacy+0x3e/0x600 [iwlcore]
<4>[ 330.764388] [<ffffffff81053f56>] tasklet_action+0x106/0x110
<4>[ 330.764408] [<ffffffff81055674>] __do_softirq+0x114/0x2c0
<4>[ 330.764427] [<ffffffff8109eb38>] ? handle_edge_irq+0x78/0x160
<4>[ 330.764447] [<ffffffff8100d25c>] call_softirq+0x1c/0x30
<4>[ 330.764465] [<ffffffff8100f1fd>] do_softirq+0x7d/0xb0
<4>[ 330.764483] [<ffffffff81055125>] irq_exit+0x95/0xa0
<4>[ 330.764500] [<ffffffff8140bddc>] do_IRQ+0x7c/0xf0
<4>[ 330.764519] [<ffffffff8100ca13>] ret_from_intr+0x0/0xf
<4>[ 330.764534] <EOI> [<ffffffff812a1d5b>] ? acpi_idle_enter_bm+0x275/0x2aa
<4>[ 330.764562] [<ffffffff812a1d65>] ? acpi_idle_enter_bm+0x27f/0x2aa
<4>[ 330.764583] [<ffffffff812a1d5b>] ? acpi_idle_enter_bm+0x275/0x2aa
<4>[ 330.764605] [<ffffffff8133227f>] ? cpuidle_idle_call+0x9f/0x160
<4>[ 330.764625] [<ffffffff8100b1df>] ? cpu_idle+0xaf/0x110
<4>[ 330.764643] [<ffffffff813f534a>] ? rest_init+0x7a/0x80
<4>[ 330.764662] [<ffffffff81640d01>] ? start_kernel+0x3ac/0x3b8
<4>[ 330.764682] [<ffffffff81640315>] ? x86_64_start_reservations+0x125/0x129
<4>[ 330.764703] [<ffffffff816403fd>] ? x86_64_start_kernel+0xe4/0xeb



Note that these days I see the 'Microcode SW error detected' somewhat more often that I used to.


This is the combined patch I used:


diff --git a/drivers/net/wireless/iwlwifi/iwl-dev.h b/drivers/net/wireless/iwlwifi/iwl-dev.h
index 2673e9a..8efc469 100644
--- a/drivers/net/wireless/iwlwifi/iwl-dev.h
+++ b/drivers/net/wireless/iwlwifi/iwl-dev.h
@@ -711,7 +711,13 @@ extern void iwl_txq_ctx_stop(struct iwl_priv *priv);
extern int iwl_queue_space(const struct iwl_queue *q);
static inline int iwl_queue_used(const struct iwl_queue *q, int i)
{
- return q->write_ptr > q->read_ptr ?
+ if (q->write_ptr == q->read_ptr) {
+ printk("FIRMWARE BUG: index %d is given while read_ptr is %d\n",
+ i, q->read_ptr);
+ BUG();
+ }
+
+ return q->write_ptr >= q->read_ptr ?
(i >= q->read_ptr && i < q->write_ptr) :
!(i < q->read_ptr && i >= q->write_ptr);
}
diff --git a/drivers/net/wireless/iwlwifi/iwl3945-base.c b/drivers/net/wireless/iwlwifi/iwl3945-base.c
index 2a28a1f..a36de73 100644
--- a/drivers/net/wireless/iwlwifi/iwl3945-base.c
+++ b/drivers/net/wireless/iwlwifi/iwl3945-base.c
@@ -548,6 +548,9 @@ static int iwl3945_tx_skb(struct iwl_priv *priv, struct sk_buff *skb)
txq = &priv->txq[txq_id];
q = &txq->q;

+ if ((iwl_queue_space(q) < q->high_mark))
+ goto drop;
+
spin_lock_irqsave(&priv->lock, flags);

idx = get_cmd_index(q, q->write_ptr, 0);


Best regards,
Maxim Levitsky


2010-01-06 05:23:47

by Cahill, Ben M

[permalink] [raw]
Subject: RE: [ipw3945-devel] Panic in iwl3945 driver



>-----Original Message-----
>From: Zhu, Yi
>Sent: Tuesday, January 05, 2010 10:57 PM
>To: Cahill, Ben M
>Cc: Maxim Levitsky; Chatre, Reinette; linux-wireless; iwlwifi
>maling list
>Subject: RE: [ipw3945-devel] Panic in iwl3945 driver
>
>On Wed, 2010-01-06 at 06:26 +0800, Cahill, Ben M wrote:
>> >><3>[ 330.762510] iwl3945 0000:06:00.0: 1035963222
>> >>0x04dc0018 0401
>> >
>> >Last good command 0x18, with (driver provided) header
>> >indicating sent via queue 4, queue index 0xdc
>> >
>> >><3>[ 330.762547] iwl3945 0000:06:00.0: 1035963229
>> >>0x00000000 0401
>> >
>> >This shows command "0" (non-existant/unsupported/bad command)
>> >with (driver provided) header indicating sent via queue "0",
>> >index "0"; obviously bad data in command queue.
>>
>> Or index ("pointer") is getting out of sync. (Reminder to
>self; never
>> use the word "obviously").
>
>Ben, does it smell like driver issue or firmware bug?

I would tend to suspect driver or platform issue, but that's just a guess of course. Something seems to be telling the device to read DRAM that contains "0". Do we reset unused command buffers to 0? Just wondering if the index/pointers are getting out of sync somehow (double-increment or who-knows-what?).

I haven't followed this thread closely (plus I've been on vacation!); can you give me a summary of what you're trying to fix?

Also, not sure if it's relevant, but a community member discovered that MSI should be enabled in kernel config. See bugzilla 1944.

-- Ben --

>
>Thanks,
>-yi
>
>