Hi,
I think I'm hitting a rare bug in rt2x00lib_txdone(). Usually I can't
capture it since a second or third bug hits immediately after and
everything wraps off the screen. I'm fairly certain the same bug is
hitting on rc8, but I only got the oops in my logs with an rc7 kernel
including the latest net tree merge. Reproducing the bug is hit or miss
and I don't know a good way to trigger it.
I have an rt73usb device on an x86_64 machine, lsusb shows:
Bus 001 Device 004: ID 050d:705a Belkin Components F5D7050 Wireless G
Adapter v3000 [Ralink RT2573]
This is all of the oops that I could recover.
[ 9085.714105] BUG: unable to handle kernel NULL pointer dereference at
00000000000000a4
[ 9085.714816] IP: [<ffffffffa0025458>] rt2x00lib_txdone+0x36/0x249
[rt2x00lib]
[ 9085.715017] PGD 215fd067 PUD 292f4067 PMD 0
[ 9085.715017] Oops: 0000 [#1] SMP
[ 9085.715017] last sysfs file:
/sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq
[ 9085.715017] CPU 1
[ 9085.715017] Modules linked in: usb_storage thermal snd_seq_oss
snd_seq_midi snd_seq_dummy snd_pcm_oss snd_mixer_oss snd_hrtimer
snd_emu10k1_synth snd_emux_synth snd_seq_virmidi snd_seq_midi_event
snd_seq_midi_emul snd_seq scsi_wait_scan powernow_k8 mperf i2c_i801 fuse
fan snd_emu10k1 snd_rawmidi snd_ac97_codec ac97_bus snd_pcm
snd_seq_device snd_timer snd_page_alloc snd_util_mem rt73usb crc_itu_t
rt2x00usb snd_hwdep snd processor r8169 via82cxxx rt2x00lib soundcore
mii button k8temp
[ 9085.715017]
[ 9085.715017] Pid: 11513, comm: kworker/1:0 Not tainted 2.6.37-rc7+ #27
MS-7094/MS-7094
[ 9085.715017] RIP: 0010:[<ffffffffa0025458>] [<ffffffffa0025458>]
rt2x00lib_txdone+0x36/0x249 [rt2x00lib]
[ 9085.715017] RSP: 0000:ffff880000057ca0 EFLAGS: 00010286
[ 9085.715017] RAX: 0000000000000030 RBX: ffff88003b64e3c0 RCX:
ffff880000057ca0
[ 9085.715017] RDX: 0000000000000006 RSI: ffff880000057d00 RDI:
0000000000000000
[ 9085.715017] RBP: ffff880000057cf0 R08: ffff88003c7c7110 R09:
0000000000000001
[ 9085.715017] R10: ffffffff81df3c10 R11: 0000000000000282 R12:
ffff88003c586280
[ 9085.715017] R13: 0000000000000000 R14: 0000000000000028 R15:
ffff880000057d00
[ 9085.715017] FS: 00002b41e037b160(0000) GS:ffff88003f
I think the entry or skb in the entry is NULL, but I'm not sure how
that's possible. Here's an objdump of the erroring code if that helps.
0000000000000422 <rt2x00lib_txdone>:
422: 55 push %rbp
423: 48 89 e5 mov %rsp,%rbp
426: 41 57 push %r15
428: 41 56 push %r14
42a: 41 55 push %r13
42c: 41 54 push %r12
42e: 53 push %rbx
42f: 48 83 ec 28 sub $0x28,%rsp
433: e8 00 00 00 00 callq 438 <rt2x00lib_txdone+0x16>
438: 4c 8b 6f 10 mov 0x10(%rdi),%r13
43c: 48 8b 47 08 mov 0x8(%rdi),%rax
440: 49 89 fc mov %rdi,%r12
443: 49 89 f7 mov %rsi,%r15
446: 48 8b 18 mov (%rax),%rbx
449: 49 8d 45 30 lea 0x30(%r13),%rax
44d: 4c 89 ef mov %r13,%rdi
450: 4d 8d 75 28 lea 0x28(%r13),%r14
454: 48 89 45 c8 mov %rax,-0x38(%rbp)
458: 41 8b 95 a4 00 00 00 mov 0xa4(%r13),%edx <--- here
45f: 66 89 55 c2 mov %dx,-0x3e(%rbp)
463: e8 00 00 00 00 callq 468 <rt2x00lib_txdone+0x46>
468: 89 45 c4 mov %eax,-0x3c(%rbp)
46b: 41 8a 45 30 mov 0x30(%r13),%al
Am Samstag, 15. Januar 2011 schrieb Ivo Van Doorn:
> Hi,
>
> > Just a shot in the dark but since the stack trace shows the newly added
> > watchdog this might be the result of a race between a regular txdone work
> > (mac80211 workqueue) vs the watchdog work (global workqueue).
> >
> > I guess the following situation could happen:
> > A regular tx done work calls rt2x00lib_txdone which first sets entry->skb to
> > NULL, calls the driver specific clear_entry and afterwards increases
> > Q_INDEX_DONE. If the watchdog work calls rt2x00lib_txdone on a different CPU
> > inbetween the skb might be NULL and cause the above oops.
>
> This could be, would be interesting to know if compat-wireless also shows
> this problem. Because the queue refactoring code which should have solved
> these race conditions was added after 2.6.37.
I also guess that this issue would be fixed in compat-wireless due to the queue
refactoring. But I guess that is way too big for a stable kernel :(
Helmut
On 01.01.2011, Stephen Boyd wrote:
>On 01/01/11 02:28, Heinz Diehl wrote:
>> On 31.12.2010, Stephen Boyd wrote:
>>
>>> [ 9085.714105] BUG: unable to handle kernel NULL pointer dereference at
>>> 00000000000000a4
>>> [ 9085.714816] IP: [<ffffffffa0025458>] rt2x00lib_txdone+0x36/0x249
>>> [rt2x00lib]
>>> [ 9085.715017] PGD 215fd067 PUD 292f4067 PMD 0
>>> [ 9085.715017] Oops: 0000 [#1] SMP
>>> [ 9085.715017] last sysfs file:
>>> /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq
>>> [ 9085.715017] CPU 1
>>> [ 9085.715017] Modules linked in: usb_storage thermal snd_seq_oss
>>> snd_seq_midi snd_seq_dummy snd_pcm_oss snd_mixer_oss snd_hrtimer
>>> snd_emu10k1_synth snd_emux_synth snd_seq_virmidi snd_seq_midi_event
>>> snd_seq_midi_emul snd_seq scsi_wait_scan powernow_k8 mperf i2c_i801 fuse
>>> fan snd_emu10k1 snd_rawmidi snd_ac97_codec ac97_bus snd_pcm
>>> snd_seq_device snd_timer snd_page_alloc snd_util_mem rt73usb crc_itu_t
>>> rt2x00usb snd_hwdep snd processor r8169 via82cxxx rt2x00lib soundcore
>>> mii button k8temp
>>> [ 9085.715017]
>>> [ 9085.715017] Pid: 11513, comm: kworker/1:0 Not tainted 2.6.37-rc7+ #27
>>> MS-7094/MS-7094
>> [....]
>>
>> I'm not quite shure on this, but it reminds me on this bug here:
>> https://bugzilla.kernel.org/show_bug.cgi?id=24892
>>
>
>I believe I had the latest net tree merged into 2.6.37-rc7 in which case
>the patch mentioned in that bugzilla would already be applied.
I can confirm this bug is present in v2.6.37, and not in v2.6.36.
It seems to trigger quite randomly, I think in less than 2-3 hours after
the boot (sometimes in half an hour), and it leaves no trace in my log
files.
As Stephen said, most of the times the screen shows later oopses triggered
by this one, so it is not easy to identify it either.
On 01/01/11 02:28, Heinz Diehl wrote:
> On 31.12.2010, Stephen Boyd wrote:
>
>> [ 9085.714105] BUG: unable to handle kernel NULL pointer dereference at
>> 00000000000000a4
>> [ 9085.714816] IP: [<ffffffffa0025458>] rt2x00lib_txdone+0x36/0x249
>> [rt2x00lib]
>> [ 9085.715017] PGD 215fd067 PUD 292f4067 PMD 0
>> [ 9085.715017] Oops: 0000 [#1] SMP
>> [ 9085.715017] last sysfs file:
>> /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq
>> [ 9085.715017] CPU 1
>> [ 9085.715017] Modules linked in: usb_storage thermal snd_seq_oss
>> snd_seq_midi snd_seq_dummy snd_pcm_oss snd_mixer_oss snd_hrtimer
>> snd_emu10k1_synth snd_emux_synth snd_seq_virmidi snd_seq_midi_event
>> snd_seq_midi_emul snd_seq scsi_wait_scan powernow_k8 mperf i2c_i801 fuse
>> fan snd_emu10k1 snd_rawmidi snd_ac97_codec ac97_bus snd_pcm
>> snd_seq_device snd_timer snd_page_alloc snd_util_mem rt73usb crc_itu_t
>> rt2x00usb snd_hwdep snd processor r8169 via82cxxx rt2x00lib soundcore
>> mii button k8temp
>> [ 9085.715017]
>> [ 9085.715017] Pid: 11513, comm: kworker/1:0 Not tainted 2.6.37-rc7+ #27
>> MS-7094/MS-7094
> [....]
>
> I'm not quite shure on this, but it reminds me on this bug here:
> https://bugzilla.kernel.org/show_bug.cgi?id=24892
>
I believe I had the latest net tree merged into 2.6.37-rc7 in which case
the patch mentioned in that bugzilla would already be applied.
Hi,
> Just a shot in the dark but since the stack trace shows the newly added
> watchdog this might be the result of a race between a regular txdone work
> (mac80211 workqueue) vs the watchdog work (global workqueue).
>
> I guess the following situation could happen:
> A regular tx done work calls rt2x00lib_txdone which first sets entry->skb to
> NULL, calls the driver specific clear_entry and afterwards increases
> Q_INDEX_DONE. If the watchdog work calls rt2x00lib_txdone on a different CPU
> inbetween the skb might be NULL and cause the above oops.
This could be, would be interesting to know if compat-wireless also shows
this problem. Because the queue refactoring code which should have solved
these race conditions was added after 2.6.37.
Ivo
Ivo Van Doorn <[email protected]> writes:
> This could be, would be interesting to know if compat-wireless also shows
> this problem. Because the queue refactoring code which should have solved
> these race conditions was added after 2.6.37.
I really would like to give it a try, but compat-wireless-2011-01-15
crashes right on module loading. Is there a version known to work?
Hi,
Am Donnerstag, 13. Januar 2011 schrieb Ingo Brunberg:
> I also suffer from this bug with 2.6.37. The first time the following
> trace made it into my logs. Hopefully it might help.
Thanks for the trace!
> BUG: unable to handle kernel NULL pointer dereference at 0000000000000090
> IP: [<ffffffffa00983e4>] rt2x00lib_txdone+0x31/0x259 [rt2x00lib]
> PGD a7011067 PUD ab9b2067 PMD 0
> Oops: 0000 [#1] SMP
> last sysfs file: /sys/devices/pci0000:00/0000:00:13.2/usb2/2-3/2-3.4/2-3.4:1.0/firmware/2-3.4:1.0/loading
> CPU 3
> Modules linked in: aes_generic af_packet w83627ehf hwmon_vid ipv6 fbcon font bitblit softcursor dm_mod arc4 ecb crypto_blkcipher cryptomgr aead crypto_algapi rt73usb rt2x00usb rt2x00lib mac80211 cfg80211 usbhid hid radeon snd_hda_codec_realtek ttm r8169 drm_kms_helper sr_mod drm cdrom firewire_ohci snd_hda_intel i2c_piix4 bitrev 8250_pnp processor snd_hda_codec ohci_hcd thermal_sys ehci_hcd usbcore crc32 8250 i2c_algo_bit firewire_core i2c_core sg pata_atiixp crc_itu_t rtc button k10temp evdev hwmon snd_pcm snd_timer cfbcopyarea cfbimgblt snd floppy cfbfillrect serial_core mii nls_base soundcore snd_page_alloc
>
> Pid: 3069, comm: kworker/3:0 Not tainted 2.6.37 #1 M3A785GXH/128M/To Be Filled By O.E.M.
> RIP: 0010:[<ffffffffa00983e4>] [<ffffffffa00983e4>] rt2x00lib_txdone+0x31/0x259 [rt2x00lib]
> RSP: 0018:ffff880094ad3d30 EFLAGS: 00010286
> RAX: 0000000000000030 RBX: ffff88011df79980 RCX: 0000000000000014
> RDX: 0000000000000101 RSI: ffff880094ad3d90 RDI: 0000000000000000
> RBP: ffff88011ec37af8 R08: 0000000000000002 R09: ffffffff00000002
> R10: 0000000000000286 R11: 0000000000000000 R12: 0000000000000000
> R13: 0000000000000028 R14: ffff880094ad3d90 R15: ffff88011df79c10
> FS: 00007fc5bad23710(0000) GS:ffff8800cfd80000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> CR2: 0000000000000090 CR3: 00000000ab985000 CR4: 00000000000006e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Process kworker/3:0 (pid: 3069, threadinfo ffff880094ad2000, task ffff88011ff08b20)
> Stack:
> ffff88011fc7e420 0000000000011000 0000000000000030 0000000000004000
> ffff88011ec37af8 ffff88011dcb3af0 ffff88011df79980 ffff88011dcb3b40
> ffff88011dcb3b40 0000000000000003 ffff88011df79c10 ffffffffa009862e
> Call Trace:
> [<ffffffffa009862e>] ? rt2x00lib_txdone_noinfo+0x22/0x27 [rt2x00lib]
> [<ffffffffa0016316>] ? rt2x00usb_work_txdone+0x3e/0x6d [rt2x00usb]
> [<ffffffffa0016a0d>] ? rt2x00usb_watchdog+0x69/0xe0 [rt2x00usb]
> [<ffffffffa009aed9>] ? rt2x00link_watchdog+0x0/0x4a [rt2x00lib]
> [<ffffffffa009af00>] ? rt2x00link_watchdog+0x27/0x4a [rt2x00lib]
> [<ffffffff8104256e>] ? process_one_work+0x20e/0x34e
> [<ffffffff81042a45>] ? worker_thread+0x1c9/0x340
> [<ffffffff8102612e>] ? __wake_up_common+0x41/0x78
> [<ffffffff8104287c>] ? worker_thread+0x0/0x340
> [<ffffffff8104287c>] ? worker_thread+0x0/0x340
> [<ffffffff810455a9>] ? kthread+0x7a/0x82
> [<ffffffff81002cd4>] ? kernel_thread_helper+0x4/0x10
> [<ffffffff8104552f>] ? kthread+0x0/0x82
> [<ffffffff81002cd0>] ? kernel_thread_helper+0x0/0x10
> Code: f6 41 55 41 54 55 48 89 fd 53 48 83 ec 28 4c 8b 67 10 48 8b 47 08 48 8b 18 49 8d 44 24 30 4c 89 e7 4d 8d 6c 24 28 48 89 44 24 10 <41> 8b 94 24 90 00 00 00 66 89 54 24 1e e8 1b 16 14 00 48 89 ef
> RIP [<ffffffffa00983e4>] rt2x00lib_txdone+0x31/0x259 [rt2x00lib]
> RSP <ffff880094ad3d30>
> CR2: 0000000000000090
> ---[ end trace 2c6843a38ee68ff0 ]---
Just a shot in the dark but since the stack trace shows the newly added
watchdog this might be the result of a race between a regular txdone work
(mac80211 workqueue) vs the watchdog work (global workqueue).
I guess the following situation could happen:
A regular tx done work calls rt2x00lib_txdone which first sets entry->skb to
NULL, calls the driver specific clear_entry and afterwards increases
Q_INDEX_DONE. If the watchdog work calls rt2x00lib_txdone on a different CPU
inbetween the skb might be NULL and cause the above oops.
Ivo, does that sound reasonable?
Helmut
Hi,
On Sun, Jan 16, 2011 at 3:58 AM, Ingo Brunberg <[email protected]> wrote:
> Ivo Van Doorn <[email protected]> writes:
>
>> This could be, would be interesting to know if compat-wireless also shows
>> this problem. Because the queue refactoring code which should have solved
>> these race conditions was added after 2.6.37.
>
> I really would like to give it a try, but compat-wireless-2011-01-15
> crashes right on module loading. Is there a version known to work?
You could try the rt2x00-special package:
http://kernel.org/pub/linux/kernel/people/ivd/compat-rt2x00.tar.bz2
this is compat-wireless + rt2x00 patches from rt2x00.git.
For me it is working without any crashes..
Ivo
On 31.12.2010, Stephen Boyd wrote:
> [ 9085.714105] BUG: unable to handle kernel NULL pointer dereference at
> 00000000000000a4
> [ 9085.714816] IP: [<ffffffffa0025458>] rt2x00lib_txdone+0x36/0x249
> [rt2x00lib]
> [ 9085.715017] PGD 215fd067 PUD 292f4067 PMD 0
> [ 9085.715017] Oops: 0000 [#1] SMP
> [ 9085.715017] last sysfs file:
> /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq
> [ 9085.715017] CPU 1
> [ 9085.715017] Modules linked in: usb_storage thermal snd_seq_oss
> snd_seq_midi snd_seq_dummy snd_pcm_oss snd_mixer_oss snd_hrtimer
> snd_emu10k1_synth snd_emux_synth snd_seq_virmidi snd_seq_midi_event
> snd_seq_midi_emul snd_seq scsi_wait_scan powernow_k8 mperf i2c_i801 fuse
> fan snd_emu10k1 snd_rawmidi snd_ac97_codec ac97_bus snd_pcm
> snd_seq_device snd_timer snd_page_alloc snd_util_mem rt73usb crc_itu_t
> rt2x00usb snd_hwdep snd processor r8169 via82cxxx rt2x00lib soundcore
> mii button k8temp
> [ 9085.715017]
> [ 9085.715017] Pid: 11513, comm: kworker/1:0 Not tainted 2.6.37-rc7+ #27
> MS-7094/MS-7094
[....]
I'm not quite shure on this, but it reminds me on this bug here:
https://bugzilla.kernel.org/show_bug.cgi?id=24892