2008-07-29 05:56:26

by Dave Young

[permalink] [raw]
Subject: [BUG] wireless : cpu stuck for 61s

While set my wlan0 to ad-hoc mode and then set the essid to something,
this bug triggered.

I suspect it's related to debugfs or mutex? But I'm not sure.

Jul 29 13:01:46 darkstar kernel: [ 8033.736493] BUG: soft lockup - CPU#0 stuck for 61s! [events/0:9]
Jul 29 13:01:46 darkstar kernel: [ 8033.736493] Modules linked in: radeonfb fb_ddc i2c_algo_bit snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss rfcomm l2cap bluetooth vfat fuse rtc_cmos rtc_core psmouse intel_agp pl2303 rtc_lib ath5k e100 i2c_i801 serio_raw usbserial agpgart 3c59x mii snd_hda_intel snd_pcm mac80211 snd_timer snd_page_alloc cfg80211 snd_hwdep snd soundcore thermal processor button thermal_sys evdev sg
Jul 29 13:01:46 darkstar kernel: [ 8033.736493]
Jul 29 13:01:46 darkstar kernel: [ 8033.736493] Pid: 9, comm: events/0 Tainted: G W (2.6.26-smp #3)
Jul 29 13:01:46 darkstar kernel: [ 8033.736493] EIP: 0060:[<c0450bc3>] EFLAGS: 00000246 CPU: 0
Jul 29 13:01:46 darkstar kernel: [ 8033.736493] EIP is at mutex_lock_nested+0x1b3/0x280
Jul 29 13:01:46 darkstar kernel: [ 8033.736493] EAX: 00000000 EBX: f7bf5284 ECX: f7c7e000 EDX: f7bf52a8
Jul 29 13:01:46 darkstar kernel: [ 8033.736493] ESI: f7bf5288 EDI: 00000246 EBP: f7c7fedc ESP: f7c7fe9c
Jul 29 13:01:46 darkstar kernel: [ 8033.736493] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
Jul 29 13:01:46 darkstar kernel: [ 8033.736493] CR0: 8005003b CR2: b800c5ac CR3: 36e71000 CR4: 000006d0
Jul 29 13:01:46 darkstar kernel: [ 8033.736493] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
Jul 29 13:01:46 darkstar kernel: [ 8033.736493] DR6: ffff0ff0 DR7: 00000400
Jul 29 13:01:46 darkstar kernel: [ 8033.736493] [<c02375a6>] ? debugfs_create_file+0x46/0x210
Jul 29 13:01:46 darkstar kernel: [ 8033.736493] [<c02375a6>] ? debugfs_create_file+0x46/0x210
Jul 29 13:01:46 darkstar kernel: [ 8033.736493] [<c02375a6>] debugfs_create_file+0x46/0x210
Jul 29 13:01:46 darkstar kernel: [ 8033.736493] [<c02377f1>] debugfs_create_dir+0x21/0x30
Jul 29 13:01:46 darkstar kernel: [ 8033.736493] [<f8901f6d>] ieee80211_sta_debugfs_add+0x2d/0x150 [mac80211]
Jul 29 13:01:46 darkstar kernel: [ 8033.736493] [<f88eba89>] sta_info_debugfs_add_work+0x89/0x130 [mac80211]
Jul 29 13:01:46 darkstar kernel: [ 8033.736493] [<f890a170>] ? rate_control_pid_add_sta_debugfs+0x0/0x30 [mac80211]
Jul 29 13:01:46 darkstar kernel: [ 8033.736493] [<c01394c6>] run_workqueue+0x156/0x1d0
Jul 29 13:01:46 darkstar kernel: [ 8033.736493] [<c0139472>] ? run_workqueue+0x102/0x1d0
Jul 29 13:01:46 darkstar kernel: [ 8033.736493] [<f88eba00>] ? sta_info_debugfs_add_work+0x0/0x130 [mac80211]
Jul 29 13:01:46 darkstar kernel: [ 8033.736493] [<c0139f18>] worker_thread+0x88/0xe0
Jul 29 13:01:46 darkstar kernel: [ 8033.736493] [<c013cd20>] ? autoremove_wake_function+0x0/0x40
Jul 29 13:01:46 darkstar kernel: [ 8033.736493] [<c0139e90>] ? worker_thread+0x0/0xe0
Jul 29 13:01:46 darkstar kernel: [ 8033.736493] [<c013ca32>] kthread+0x42/0x70
Jul 29 13:01:46 darkstar kernel: [ 8033.736493] [<c013c9f0>] ? kthread+0x0/0x70
Jul 29 13:01:46 darkstar kernel: [ 8033.736493] [<c0104993>] kernel_thread_helper+0x7/0x14
Jul 29 13:01:46 darkstar kernel: [ 8033.736493] =======================
Jul 29 13:02:51 darkstar kernel: [ 8099.234898] BUG: soft lockup - CPU#0 stuck for 61s! [events/0:9]
Jul 29 13:02:51 darkstar kernel: [ 8099.234898] Modules linked in: radeonfb fb_ddc i2c_algo_bit snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss rfcomm l2cap bluetooth vfat fuse rtc_cmos rtc_core psmouse intel_agp pl2303 rtc_lib ath5k e100 i2c_i801 serio_raw usbserial agpgart 3c59x mii snd_hda_intel snd_pcm mac80211 snd_timer snd_page_alloc cfg80211 snd_hwdep snd soundcore thermal processor button thermal_sys evdev sg
Jul 29 13:02:51 darkstar kernel: [ 8099.234898]
Jul 29 13:02:51 darkstar kernel: [ 8099.234898] Pid: 9, comm: events/0 Tainted: G W (2.6.26-smp #3)
Jul 29 13:02:51 darkstar kernel: [ 8099.234898] EIP: 0060:[<c0450bc3>] EFLAGS: 00000246 CPU: 0
Jul 29 13:02:51 darkstar kernel: [ 8099.234898] EIP is at mutex_lock_nested+0x1b3/0x280
Jul 29 13:02:51 darkstar kernel: [ 8099.234898] EAX: 00000000 EBX: f7bf5284 ECX: f7c7e000 EDX: f7bf52a8
Jul 29 13:02:51 darkstar kernel: [ 8099.234898] ESI: f7bf5288 EDI: 00000246 EBP: f7c7fedc ESP: f7c7fe9c
Jul 29 13:02:51 darkstar kernel: [ 8099.234898] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
Jul 29 13:02:51 darkstar kernel: [ 8099.234898] CR0: 8005003b CR2: b800c5ac CR3: 36e71000 CR4: 000006d0
Jul 29 13:02:51 darkstar kernel: [ 8099.234898] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
Jul 29 13:02:51 darkstar kernel: [ 8099.234898] DR6: ffff0ff0 DR7: 00000400
Jul 29 13:02:51 darkstar kernel: [ 8099.234898] [<c02375a6>] ? debugfs_create_file+0x46/0x210
Jul 29 13:02:51 darkstar kernel: [ 8099.234898] [<c02375a6>] ? debugfs_create_file+0x46/0x210
Jul 29 13:02:51 darkstar kernel: [ 8099.234898] [<c02375a6>] debugfs_create_file+0x46/0x210
Jul 29 13:02:51 darkstar kernel: [ 8099.234898] [<c02377f1>] debugfs_create_dir+0x21/0x30
Jul 29 13:02:51 darkstar kernel: [ 8099.234898] [<f8901f6d>] ieee80211_sta_debugfs_add+0x2d/0x150 [mac80211]
Jul 29 13:02:51 darkstar kernel: [ 8099.234898] [<f88eba89>] sta_info_debugfs_add_work+0x89/0x130 [mac80211]
Jul 29 13:02:51 darkstar kernel: [ 8099.234898] [<f890a170>] ? rate_control_pid_add_sta_debugfs+0x0/0x30 [mac80211]
Jul 29 13:02:51 darkstar kernel: [ 8099.234898] [<c01394c6>] run_workqueue+0x156/0x1d0
Jul 29 13:02:51 darkstar kernel: [ 8099.234898] [<c0139472>] ? run_workqueue+0x102/0x1d0
Jul 29 13:02:51 darkstar kernel: [ 8099.234898] [<f88eba00>] ? sta_info_debugfs_add_work+0x0/0x130 [mac80211]
Jul 29 13:02:51 darkstar kernel: [ 8099.234898] [<c0139f18>] worker_thread+0x88/0xe0
Jul 29 13:02:51 darkstar kernel: [ 8099.234898] [<c013cd20>] ? autoremove_wake_function+0x0/0x40
Jul 29 13:02:51 darkstar kernel: [ 8099.234898] [<c0139e90>] ? worker_thread+0x0/0xe0
Jul 29 13:02:51 darkstar kernel: [ 8099.234898] [<c013ca32>] kthread+0x42/0x70
Jul 29 13:02:51 darkstar kernel: [ 8099.234898] [<c013c9f0>] ? kthread+0x0/0x70
Jul 29 13:02:51 darkstar kernel: [ 8099.234898] [<c0104993>] kernel_thread_helper+0x7/0x14
Jul 29 13:02:51 darkstar kernel: [ 8099.234898] =======================


2008-07-31 02:06:35

by Dave Young

[permalink] [raw]
Subject: Re: [BUG] wireless : cpu stuck for 61s

On Wed, Jul 30, 2008 at 6:10 PM, Andrew Morton
<[email protected]> wrote:
> On Wed, 30 Jul 2008 18:02:03 +0800 "Dave Young" <[email protected]> wrote:
>
>> On Wed, Jul 30, 2008 at 5:08 PM, Andrew Morton
>> <[email protected]> wrote:
>> > On Tue, 29 Jul 2008 14:32:04 +0200 Johannes Berg <[email protected]> wrote:
>> >
>> >> On Tue, 2008-07-29 at 13:57 +0800, Dave Young wrote:
>> >> > While set my wlan0 to ad-hoc mode and then set the essid to something,
>> >> > this bug triggered.
>> >> >
>> >> > I suspect it's related to debugfs or mutex? But I'm not sure.
>> >> >
>> >> > Jul 29 13:01:46 darkstar kernel: [ 8033.736493] BUG: soft lockup - CPU#0 stuck for 61s! [events/0:9]
>> >> > Jul 29 13:01:46 darkstar kernel: [ 8033.736493]
>> >> > Jul 29 13:01:46 darkstar kernel: [ 8033.736493] Pid: 9, comm: events/0 Tainted: G W (2.6.26-smp #3)
>> >> > Jul 29 13:01:46 darkstar kernel: [ 8033.736493] EIP: 0060:[<c0450bc3>] EFLAGS: 00000246 CPU: 0
>> >> > Jul 29 13:01:46 darkstar kernel: [ 8033.736493] EIP is at mutex_lock_nested+0x1b3/0x280
>> >> > Jul 29 13:01:46 darkstar kernel: [ 8033.736493] EAX: 00000000 EBX: f7bf5284 ECX: f7c7e000 EDX: f7bf52a8
>> >> > Jul 29 13:01:46 darkstar kernel: [ 8033.736493] ESI: f7bf5288 EDI: 00000246 EBP: f7c7fedc ESP: f7c7fe9c
>> >> > Jul 29 13:01:46 darkstar kernel: [ 8033.736493] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
>> >> > Jul 29 13:01:46 darkstar kernel: [ 8033.736493] CR0: 8005003b CR2: b800c5ac CR3: 36e71000 CR4: 000006d0
>> >> > Jul 29 13:01:46 darkstar kernel: [ 8033.736493] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
>> >> > Jul 29 13:01:46 darkstar kernel: [ 8033.736493] DR6: ffff0ff0 DR7: 00000400
>> >> > Jul 29 13:01:46 darkstar kernel: [ 8033.736493] [<c02375a6>] ? debugfs_create_file+0x46/0x210
>> >>
>> >> It looks like it's getting stuck on
>> >> mutex_lock(&parent->d_inode->i_mutex);
>> >>
>> >> inside debugfs_create_by_name, but I have no idea why that would happen.
>> >>
>> >
>> > Dave, please try sysrq-W or sysrq-T during that 60 seconds, see if we
>> > can find another task which is holding onto that lock.
>> >
>> >
>>
>> Andrew, the keyboard doesn't response after the bug triggered.
>>
>> So I just keep sysrq-W / sysrq-T every several seconds after I set the
>> wlan0 interface.
>> Attached please see the long log file.
>
> argh. Look:
>
> Jul 30 17:34:22 darkstar kernel: [ 145.111510] f703fbf8 00000082 00000002 f703fbe4 f703fbd8 00000000 f704ac40 f703fbcc
> Jul 30 17:34:22 darkstar kernel: [ 145.111510] 00000046 c0669060 c066c580 c066c580 c066c580 f703fbe8 f704ad94 c1eaf580
> Jul 30 17:34:22 darkstar kernel: [ 145.111510] 00000001 f704ac40 00000000 f702ca00 c0452697 ffff55ba 00000000 c014b06d
> Jul 30 17:34:22 darkstar kernel: [ 145.111510] Call Trace:
> Jul 30 17:34:22 darkstar kernel: [ 145.111510] [<c0452697>] ? _spin_unlock_irqrestore+0x47/0x60
> Jul 30 17:34:22 darkstar kernel: [ 145.111510] [<c014b06d>] ? trace_hardirqs_on+0xbd/0x140
> Jul 30 17:34:22 darkstar kernel: [ 145.111510] [<c0450265>] schedule_timeout+0x75/0xc0
> Jul 30 17:34:22 darkstar kernel: [ 145.111510] [<c01942c0>] ? __pollwait+0x60/0xd0
> Jul 30 17:34:22 darkstar kernel: [ 145.111510] [<f888b02e>] ? evdev_poll+0x2e/0x60 [evdev]
> Jul 30 17:34:22 darkstar kernel: [ 145.111510] [<c0193564>] do_sys_poll+0x264/0x3a0
> Jul 30 17:34:22 darkstar kernel: [ 145.111510] [<c0194260>] ? __pollwait+0x0/0xd0
> Jul 30 17:34:22 darkstar kernel: [ 145.111510] [<c0122160>] ? default_wake_function+0x0/0x10
> Jul 30 17:34:22 darkstar last message repeated 3 times
> Jul 30 17:34:22 darkstar kernel: [ 145.111510] [<c01220c5>] ? try_to_wake_up+0x75/0x110
> Jul 30 17:34:22 darkstar kernel: [ 145.111510] [<c014aed0>] ? mark_held_locks+0x40/0x80
> Jul 30 17:34:22 darkstar kernel: [ 145.111510] [<c014b77b>] ? __lock_acquire+0x24b/0x1040
> Jul 30 17:34:22 darkstar kernel: [ 145.111510] [<c014b77b>] ? __lock_acquire+0x24b/0x1040
> Jul 30 17:34:22 darkstar kernel: [ 145.111510] [<c03da639>] ? sock_def_readable+0x59/0x80
> Jul 30 17:34:22 darkstar kernel: [ 145.111510] [<c014b77b>] ? __lock_acquire+0x24b/0x1040
> Jul 30 17:34:22 darkstar last message repeated 5 times
> Jul 30 17:34:22 darkstar kernel: [ 145.111510] [<c0109d0f>] ? native_sched_clock+0x7f/0xb0
> Jul 30 17:34:22 darkstar kernel: [ 145.111510] [<c0243b4f>] ? security_file_permission+0xf/0x20
> Jul 30 17:34:22 darkstar kernel: [ 145.111510] [<c0186387>] ? rw_verify_area+0x57/0xc0
> Jul 30 17:34:22 darkstar kernel: [ 145.111510] [<c0451f5d>] ? _spin_unlock+0x1d/0x20
> Jul 30 17:34:22 darkstar kernel: [ 145.111510] [<c0186b94>] ? vfs_read+0x94/0x130
> Jul 30 17:34:22 darkstar kernel: [ 145.111510] [<f888be60>] ? evdev_read+0x0/0x1b0 [evdev]
> Jul 30 17:34:22 darkstar kernel: [ 145.111510] [<c0103dbf>] ? restore_nocheck+0x12/0x15
> Jul 30 17:34:22 darkstar kernel: [ 145.111510] [<c01936cd>] sys_poll+0x2d/0x70
> Jul 30 17:34:22 darkstar kernel: [ 145.111510] [<c0103d5e>] syscall_call+0x7/0xb
> Jul 30 17:34:22 darkstar kernel: [ 145.111510] =======================
> Jul 30 17:34:22 darkstar kernel: [ 145.111510] f63efd94 00000092 00000002 00000000 00000000 f5064028 f7049d80 00000001
> Jul 30 17:34:22 darkstar kernel: [ 145.111510] 00000000 c0669060 c066c580 c066c580 c066c580 00000000 f7049ed4 c1e5d580
> Jul 30 17:34:22 darkstar kernel: [ 145.111510] 00000000 00000000 00000135 f702e080 f7049d80 00000003 00000000 f7049d80
> Jul 30 17:34:22 darkstar kernel: [ 145.111510] Call Trace:
> Jul 30 17:34:22 darkstar kernel: [ 145.111510] [<c0450265>] schedule_timeout+0x75/0xc0
> Jul 30 17:34:22 darkstar kernel: [ 145.111510] [<c0436b98>] unix_stream_recvmsg+0x2a8/0x4e0
>
>
> We're missing part of the sysrq traces. I noticed this a while back.
> We need this info to find the D-state processes.
>
> It's _possible_ that the kernel is actually trying to print this info,
> but it's doing it at a different prink facility level from the stack
> output, so we're only seeing part of the printing.
>
> Please try changing your logging level (dmesg -n 0, or dmesg -n 7 - I
> always forget which). Then retry the sysrq-T. If that successfully
> causes _all_ the output to be presented, then please rerun the above
> test.
>
> You might want to compress the file too - I don't know whether the vger
> servers will accept a 730kB mail.
>
>

Ok here it is.
BTW, I run "klogd -c 7" after boot

This time I get a kmalloc poison overwritten:

Jul 31 09:51:11 darkstar kernel: [ 2024.960036] BUG kmalloc-4096:
Poison overwritten
Jul 31 09:51:11 darkstar kernel: [ 2024.960036]
-----------------------------------------------------------------------------
Jul 31 09:51:11 darkstar kernel: [ 2024.960036]
Jul 31 09:51:11 darkstar kernel: [ 2024.960036] INFO:
0xf6f3a080-0xf6f3a0ef. First byte 0x80 instead of 0x6b
Jul 31 09:51:11 darkstar kernel: [ 2024.960036] INFO: Allocated in
dev_alloc_skb+0x1c/0x30 age=3642 cpu=0 pid=0
Jul 31 09:51:11 darkstar kernel: [ 2024.960036] INFO: Freed in
skb_release_data+0x57/0x80 age=3146 cpu=0 pid=2398
Jul 31 09:51:11 darkstar kernel: [ 2024.960036] INFO: Slab 0xc1c05440
objects=7 used=3 fp=0xf6f3a060 flags=0x400020c3
Jul 31 09:51:11 darkstar kernel: [ 2024.960036] INFO: Object
0xf6f3a060 @offset=8288 fp=0xf6f39030
Jul 31 09:51:11 darkstar kernel: [ 2024.960036]
Jul 31 09:51:11 darkstar kernel: [ 2024.960036] Bytes b4 0xf6f3a050:
5e 09 00 00 57 c9 05 00 5a 5a 5a 5a 5a 5a 5a 5a ^...WÉ..ZZZZZZZZ
Jul 31 09:51:11 darkstar kernel: [ 2024.960036] Object 0xf6f3a060:
6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
Jul 31 09:51:11 darkstar kernel: [ 2024.960036] Object 0xf6f3a070:
6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
Jul 31 09:51:11 darkstar kernel: [ 2024.960036] Object 0xf6f3a080:
80 00 00 00 ff ff ff ff ff ff 00 17 7b 00 46 40 ....ÿÿÿÿÿÿ..{.F@
Jul 31 09:51:11 darkstar kernel: [ 2024.960036] Object 0xf6f3a090:
00 17 7b 00 46 40 30 09 81 21 08 7a 21 00 00 00 ..{.F@0..!.z!...
Jul 31 09:51:11 darkstar kernel: [ 2024.960036] Object 0xf6f3a0a0:
64 00 21 04 00 07 00 00 00 00 00 00 00 01 08 82 d.!.............
Jul 31 09:51:11 darkstar kernel: [ 2024.960036] Object 0xf6f3a0b0:
84 8b 0c 12 96 18 24 03 01 01 05 04 00 02 00 00 ......$.........
Jul 31 09:51:11 darkstar kernel: [ 2024.960036] Object 0xf6f3a0c0:
07 06 43 4e 20 01 0d 14 2a 01 00 32 04 30 48 60 ..CN....*..2.0H`
Jul 31 09:51:11 darkstar kernel: [ 2024.960036] Object 0xf6f3a0d0:
6c dd 18 00 17 7b 01 04 00 00 00 01 00 00 00 10 lÝ...{..........
Jul 31 09:51:11 darkstar kernel: [ 2024.960036] Redzone 0xf6f3b060:
bb bb bb bb »»»»
Jul 31 09:51:11 darkstar kernel: [ 2024.960036] Padding 0xf6f3b088:
5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZ
Jul 31 09:51:11 darkstar kernel: [ 2024.960036] Pid: 0, comm: swapper
Tainted: G W 2.6.26-smp #2
Jul 31 09:51:11 darkstar kernel: [ 2024.960036] [<c0180f5d>]
print_trailer+0xad/0xf0
Jul 31 09:51:11 darkstar kernel: [ 2024.960036] [<c018103b>]
check_bytes_and_report+0x9b/0xc0
Jul 31 09:51:11 darkstar kernel: [ 2024.960036] [<c018145e>]
check_object+0x19e/0x1e0
Jul 31 09:51:11 darkstar kernel: [ 2024.960036] [<c01821a4>]
__slab_alloc+0x454/0x4f0
Jul 31 09:51:11 darkstar kernel: [ 2024.960036] [<c01834d6>]
__kmalloc_track_caller+0xe6/0xf0
Jul 31 09:51:11 darkstar kernel: [ 2024.960036] [<c03dd1ec>] ?
dev_alloc_skb+0x1c/0x30
Jul 31 09:51:11 darkstar kernel: [ 2024.960036] [<c03dd1ec>] ?
dev_alloc_skb+0x1c/0x30
Jul 31 09:51:11 darkstar kernel: [ 2024.960036] [<c03dce79>]
__alloc_skb+0x49/0x100
Jul 31 09:51:11 darkstar kernel: [ 2024.960036] [<c03dd1ec>]
dev_alloc_skb+0x1c/0x30
Jul 31 09:51:11 darkstar kernel: [ 2024.960036] [<f8a58599>]
ath5k_rxbuf_setup+0x39/0x200 [ath5k]
Jul 31 09:51:11 darkstar kernel: [ 2024.960036] [<f8a5a697>]
ath5k_tasklet_rx+0x127/0x5c0 [ath5k]
Jul 31 09:51:11 darkstar kernel: [ 2024.960036] [<c014969a>] ?
print_lock_contention_bug+0x1a/0xe0
Jul 31 09:51:11 darkstar kernel: [ 2024.960036] [<c012eafc>]
tasklet_action+0x4c/0xc0
Jul 31 09:51:11 darkstar kernel: [ 2024.960036] [<c012e463>]
__do_softirq+0x93/0x120
Jul 31 09:51:11 darkstar kernel: [ 2024.960036] [<c012e547>]
do_softirq+0x57/0x60
Jul 31 09:51:11 darkstar kernel: [ 2024.960036] [<c012ea29>] irq_exit+0x69/0x80
Jul 31 09:51:11 darkstar kernel: [ 2024.960036] [<c0106b55>] do_IRQ+0x45/0x80
Jul 31 09:51:11 darkstar kernel: [ 2024.960036] [<c010a5d0>] ?
mwait_idle+0x0/0x50
Jul 31 09:51:11 darkstar kernel: [ 2024.960036] [<c0104752>]
common_interrupt+0x2e/0x34
Jul 31 09:51:11 darkstar kernel: [ 2024.960036] [<c010a5d0>] ?
mwait_idle+0x0/0x50
Jul 31 09:51:11 darkstar kernel: [ 2024.960036] [<c010a609>] ?
mwait_idle+0x39/0x50
Jul 31 09:51:11 darkstar kernel: [ 2024.960036] [<c01026e0>] cpu_idle+0x60/0xd0
Jul 31 09:51:11 darkstar kernel: [ 2024.960036] [<c043c8ce>]
rest_init+0x4e/0x60
Jul 31 09:51:11 darkstar kernel: [ 2024.960036] =======================


The syslog detail please see the attachment.

--
Regards
dave


Attachments:
(No filename) (10.86 kB)
dmesg.gz (17.52 kB)
Download all attachments

2008-07-30 09:10:20

by Andrew Morton

[permalink] [raw]
Subject: Re: [BUG] wireless : cpu stuck for 61s

On Tue, 29 Jul 2008 14:32:04 +0200 Johannes Berg <[email protected]> wrote:

> On Tue, 2008-07-29 at 13:57 +0800, Dave Young wrote:
> > While set my wlan0 to ad-hoc mode and then set the essid to something,
> > this bug triggered.
> >
> > I suspect it's related to debugfs or mutex? But I'm not sure.
> >
> > Jul 29 13:01:46 darkstar kernel: [ 8033.736493] BUG: soft lockup - CPU#0 stuck for 61s! [events/0:9]
> > Jul 29 13:01:46 darkstar kernel: [ 8033.736493]
> > Jul 29 13:01:46 darkstar kernel: [ 8033.736493] Pid: 9, comm: events/0 Tainted: G W (2.6.26-smp #3)
> > Jul 29 13:01:46 darkstar kernel: [ 8033.736493] EIP: 0060:[<c0450bc3>] EFLAGS: 00000246 CPU: 0
> > Jul 29 13:01:46 darkstar kernel: [ 8033.736493] EIP is at mutex_lock_nested+0x1b3/0x280
> > Jul 29 13:01:46 darkstar kernel: [ 8033.736493] EAX: 00000000 EBX: f7bf5284 ECX: f7c7e000 EDX: f7bf52a8
> > Jul 29 13:01:46 darkstar kernel: [ 8033.736493] ESI: f7bf5288 EDI: 00000246 EBP: f7c7fedc ESP: f7c7fe9c
> > Jul 29 13:01:46 darkstar kernel: [ 8033.736493] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
> > Jul 29 13:01:46 darkstar kernel: [ 8033.736493] CR0: 8005003b CR2: b800c5ac CR3: 36e71000 CR4: 000006d0
> > Jul 29 13:01:46 darkstar kernel: [ 8033.736493] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
> > Jul 29 13:01:46 darkstar kernel: [ 8033.736493] DR6: ffff0ff0 DR7: 00000400
> > Jul 29 13:01:46 darkstar kernel: [ 8033.736493] [<c02375a6>] ? debugfs_create_file+0x46/0x210
>
> It looks like it's getting stuck on
> mutex_lock(&parent->d_inode->i_mutex);
>
> inside debugfs_create_by_name, but I have no idea why that would happen.
>

Dave, please try sysrq-W or sysrq-T during that 60 seconds, see if we
can find another task which is holding onto that lock.


2008-07-31 02:57:21

by Andrew Morton

[permalink] [raw]
Subject: Re: [BUG] wireless : cpu stuck for 61s

On Thu, 31 Jul 2008 10:06:31 +0800 "Dave Young" <hidave.darkstar@gmail.=
com> wrote:

>=20
> Ok here it is.
> BTW, I run "klogd -c 7" after boot

The sysrq output is still missing lots of stuff. I guess we broke it.

>=20
> This time I get a kmalloc poison overwritten:
>

<fixes wordwrapping, cleans stuff up>

=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D
BUG kmalloc-4096: Poison overwritten
-----------------------------------------------------------------------=
------

INFO: 0xf6f3a080-0xf6f3a0ef. First byte 0x80 instead of 0x6b
INFO: Allocated in dev_alloc_skb+0x1c/0x30 age=3D3642 cpu=3D0 pid=3D0
INFO: Freed in skb_release_data+0x57/0x80 age=3D3146 cpu=3D0 pid=3D2398
INFO: Slab 0xc1c05440 objects=3D7 used=3D3 fp=3D0xf6f3a060 flags=3D0x40=
0020c3
INFO: Object 0xf6f3a060 @offset=3D8288 fp=3D0xf6f39030

Bytes b4 0xf6f3a050: 5e 09 00 00 57 c9 05 00 5a 5a 5a 5a 5a 5a 5a 5a ^=
=2E..W=C9..ZZZZZZZZ
Object 0xf6f3a060: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkk=
kkkkkkkkkkkkk
Object 0xf6f3a070: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkk=
kkkkkkkkkkkkk
Object 0xf6f3a080: 80 00 00 00 ff ff ff ff ff ff 00 17 7b 00 46 40 ...=
=2E=FF=FF=FF=FF=FF=FF..{.F@
Object 0xf6f3a090: 00 17 7b 00 46 40 30 09 81 21 08 7a 21 00 00 00 ..{=
=2EF@0..!.z!...
Object 0xf6f3a0a0: 64 00 21 04 00 07 00 00 00 00 00 00 00 01 08 82 d.!=
=2E............
Object 0xf6f3a0b0: 84 8b 0c 12 96 18 24 03 01 01 05 04 00 02 00 00 ...=
=2E..$.........
Object 0xf6f3a0c0: 07 06 43 4e 20 01 0d 14 2a 01 00 32 04 30 48 60 ..C=
N....*..2.0H`
Object 0xf6f3a0d0: 6c dd 18 00 17 7b 01 04 00 00 00 01 00 00 00 10 l=DD=
=2E..{..........
Redzone 0xf6f3b060: bb bb bb bb =BB=
=BB=BB=BB =20
Padding 0xf6f3b088: 5a 5a 5a 5a 5a 5a 5a 5a ZZ=
ZZZZZZ =20
Pid: 0, comm: swapper Tainted: G W 2.6.26-smp #2
[<c0180f5d>] print_trailer+0xad/0xf0
[<c018103b>] check_bytes_and_report+0x9b/0xc0
[<c018145e>] check_object+0x19e/0x1e0
[<c01821a4>] __slab_alloc+0x454/0x4f0
[<c01834d6>] __kmalloc_track_caller+0xe6/0xf0
[<c03dd1ec>] ? dev_alloc_skb+0x1c/0x30
[<c03dd1ec>] ? dev_alloc_skb+0x1c/0x30
[<c03dce79>] __alloc_skb+0x49/0x100
[<c03dd1ec>] dev_alloc_skb+0x1c/0x30
[<f8a58599>] ath5k_rxbuf_setup+0x39/0x200 [ath5k]
[<f8a5a697>] ath5k_tasklet_rx+0x127/0x5c0 [ath5k]
[<c014969a>] ? print_lock_contention_bug+0x1a/0xe0
[<c012eafc>] tasklet_action+0x4c/0xc0
[<c012e463>] __do_softirq+0x93/0x120
[<c012e547>] do_softirq+0x57/0x60
[<c012ea29>] irq_exit+0x69/0x80
[<c0106b55>] do_IRQ+0x45/0x80
[<c010a5d0>] ? mwait_idle+0x0/0x50
[<c0104752>] common_interrupt+0x2e/0x34
[<c010a5d0>] ? mwait_idle+0x0/0x50
[<c010a609>] ? mwait_idle+0x39/0x50
[<c01026e0>] cpu_idle+0x60/0xd0
[<c043c8ce>] rest_init+0x4e/0x60
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
=46IX kmalloc-4096: Restoring 0xf6f3a080-0xf6f3a0ef=3D0x6b

=46IX kmalloc-4096: Marking all objects used
[<c0243b4f>] ? security_file_permission+0xf/0x20
[<c019436f>] sys_select+0x3f/0x190
[<c01878e9>] ? fput+0x19/0x20
[<c0103dbf>] ? restore_nocheck+0x12/0x15
[<c014b06d>] ? trace_hardirqs_on+0xbd/0x140
[<c0103d5e>] syscall_call+0x7/0xb
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D

argh, that stuff hurts my brain. None of the numbers seem to make any
sense for a 4k allocation :( Pekka, do you have time to decrypt this?

Dave, could you please remind us which net driver was in use here?

2008-07-31 10:29:13

by Tomas Winkler

[permalink] [raw]
Subject: Re: [BUG] wireless : cpu stuck for 61s

On Thu, Jul 31, 2008 at 12:53 PM, Pekka Enberg <[email protected]> wrote:
> On Thu, 2008-07-31 at 12:50 +0300, Tomas Winkler wrote:
>> > There's ath5k in the stack trace but that, of course, doesn't
>> > automatically mean it's at fault here. It could have been just the poor
>> > bastard who was the next to allocate 4 KB with kmalloc() noticing the
>> > corruption.
>> >
>> > Hope this helps!
>>
>> I've seen something similar with fragmentation code in iwl4965 but I
>> can reproduce it only when using SLAB. With SLUB it didn't shown up.
>> Does anyone know what is difference between SLAB and SLUB in this context.?
>
> Do you have CONFIG_SLUB_DEBUG_ON enabled? (Note: if you only have
> CONFIG_SLUB_DEBUG enabled, then you need to enable debugging at run-time
> by passing slub_debug to kernel command line.)

Yes sure I've done that with slub_debug=FZPU
Tomas

2008-07-30 10:02:09

by Dave Young

[permalink] [raw]
Subject: Re: [BUG] wireless : cpu stuck for 61s

On Wed, Jul 30, 2008 at 5:08 PM, Andrew Morton
<[email protected]> wrote:
> On Tue, 29 Jul 2008 14:32:04 +0200 Johannes Berg <[email protected]> wrote:
>
>> On Tue, 2008-07-29 at 13:57 +0800, Dave Young wrote:
>> > While set my wlan0 to ad-hoc mode and then set the essid to something,
>> > this bug triggered.
>> >
>> > I suspect it's related to debugfs or mutex? But I'm not sure.
>> >
>> > Jul 29 13:01:46 darkstar kernel: [ 8033.736493] BUG: soft lockup - CPU#0 stuck for 61s! [events/0:9]
>> > Jul 29 13:01:46 darkstar kernel: [ 8033.736493]
>> > Jul 29 13:01:46 darkstar kernel: [ 8033.736493] Pid: 9, comm: events/0 Tainted: G W (2.6.26-smp #3)
>> > Jul 29 13:01:46 darkstar kernel: [ 8033.736493] EIP: 0060:[<c0450bc3>] EFLAGS: 00000246 CPU: 0
>> > Jul 29 13:01:46 darkstar kernel: [ 8033.736493] EIP is at mutex_lock_nested+0x1b3/0x280
>> > Jul 29 13:01:46 darkstar kernel: [ 8033.736493] EAX: 00000000 EBX: f7bf5284 ECX: f7c7e000 EDX: f7bf52a8
>> > Jul 29 13:01:46 darkstar kernel: [ 8033.736493] ESI: f7bf5288 EDI: 00000246 EBP: f7c7fedc ESP: f7c7fe9c
>> > Jul 29 13:01:46 darkstar kernel: [ 8033.736493] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
>> > Jul 29 13:01:46 darkstar kernel: [ 8033.736493] CR0: 8005003b CR2: b800c5ac CR3: 36e71000 CR4: 000006d0
>> > Jul 29 13:01:46 darkstar kernel: [ 8033.736493] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
>> > Jul 29 13:01:46 darkstar kernel: [ 8033.736493] DR6: ffff0ff0 DR7: 00000400
>> > Jul 29 13:01:46 darkstar kernel: [ 8033.736493] [<c02375a6>] ? debugfs_create_file+0x46/0x210
>>
>> It looks like it's getting stuck on
>> mutex_lock(&parent->d_inode->i_mutex);
>>
>> inside debugfs_create_by_name, but I have no idea why that would happen.
>>
>
> Dave, please try sysrq-W or sysrq-T during that 60 seconds, see if we
> can find another task which is holding onto that lock.
>
>

Andrew, the keyboard doesn't response after the bug triggered.

So I just keep sysrq-W / sysrq-T every several seconds after I set the
wlan0 interface.
Attached please see the long log file.

--
Regards
dave


Attachments:
(No filename) (2.05 kB)
syslog.txt (530.48 kB)
Download all attachments

2008-07-31 09:15:32

by Pekka Enberg

[permalink] [raw]
Subject: Re: [BUG] wireless : cpu stuck for 61s

Hi Andrew,

On Wed, 30 Jul 2008, Andrew Morton wrote:
> > Ok here it is.
> > BTW, I run "klogd -c 7" after boot
>
> The sysrq output is still missing lots of stuff. I guess we broke it.
>
> >
> > This time I get a kmalloc poison overwritten:
> >
>
> argh, that stuff hurts my brain. None of the numbers seem to make any
> sense for a 4k allocation :( Pekka, do you have time to decrypt this?

Sure. Here goes:

On Wed, 30 Jul 2008, Andrew Morton wrote:
> <fixes wordwrapping, cleans stuff up>
>
> =============================================================================
> BUG kmalloc-4096: Poison overwritten
> -----------------------------------------------------------------------------
>
> INFO: 0xf6f3a080-0xf6f3a0ef. First byte 0x80 instead of 0x6b

That's POISON_FREE ("0x6b") overwritten which means use-after-free for
the range of 0xf6f3a080 - 0xf6f3a0ef (112 bytes). The rest of the
object is okay but the SLUB debugging code only dumps the first 128 bytes
of the object which is why we don't see the full corruption.

2.6.27-rc1 should dump the full object so I'm assuming this is pre -rc1?

> INFO: Allocated in dev_alloc_skb+0x1c/0x30 age=3642 cpu=0 pid=0
> INFO: Freed in skb_release_data+0x57/0x80 age=3146 cpu=0 pid=2398

So the corrupted object was free'd by skb_release_data() so we need to
look for a driver or the networking stack calling that function too early.

> INFO: Slab 0xc1c05440 objects=7 used=3 fp=0xf6f3a060 flags=0x400020c3
> INFO: Object 0xf6f3a060 @offset=8288 fp=0xf6f39030
>
> Bytes b4 0xf6f3a050: 5e 09 00 00 57 c9 05 00 5a 5a 5a 5a 5a 5a 5a 5a ^...W?..ZZZZZZZZ

The object starts here (the poison values for first 32 bytes are okay):

> Object 0xf6f3a060: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
> Object 0xf6f3a070: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk

And here are the first 96 bytes of the corruption:

> Object 0xf6f3a080: 80 00 00 00 ff ff ff ff ff ff 00 17 7b 00 46 40 ....??????..{.F@
> Object 0xf6f3a090: 00 17 7b 00 46 40 30 09 81 21 08 7a 21 00 00 00 ..{.F@0..!.z!...
> Object 0xf6f3a0a0: 64 00 21 04 00 07 00 00 00 00 00 00 00 01 08 82 d.!.............
> Object 0xf6f3a0b0: 84 8b 0c 12 96 18 24 03 01 01 05 04 00 02 00 00 ......$.........
> Object 0xf6f3a0c0: 07 06 43 4e 20 01 0d 14 2a 01 00 32 04 30 48 60 ..CN....*..2.0H`
> Object 0xf6f3a0d0: 6c dd 18 00 17 7b 01 04 00 00 00 01 00 00 00 10 l?...{..........

But I think that's just the payload of a SKB?

> Redzone 0xf6f3b060: bb bb bb bb ????

The red-zone has SLUB_RED_INACTIVE ("0xbb") which reinforces
use-after-free.

> Padding 0xf6f3b088: 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZ
> Pid: 0, comm: swapper Tainted: G W 2.6.26-smp #2
> [<c0180f5d>] print_trailer+0xad/0xf0
> [<c018103b>] check_bytes_and_report+0x9b/0xc0
> [<c018145e>] check_object+0x19e/0x1e0
> [<c01821a4>] __slab_alloc+0x454/0x4f0
> [<c01834d6>] __kmalloc_track_caller+0xe6/0xf0
> [<c03dd1ec>] ? dev_alloc_skb+0x1c/0x30
> [<c03dd1ec>] ? dev_alloc_skb+0x1c/0x30
> [<c03dce79>] __alloc_skb+0x49/0x100
> [<c03dd1ec>] dev_alloc_skb+0x1c/0x30
> [<f8a58599>] ath5k_rxbuf_setup+0x39/0x200 [ath5k]
> [<f8a5a697>] ath5k_tasklet_rx+0x127/0x5c0 [ath5k]
> [<c014969a>] ? print_lock_contention_bug+0x1a/0xe0
> [<c012eafc>] tasklet_action+0x4c/0xc0
> [<c012e463>] __do_softirq+0x93/0x120
> [<c012e547>] do_softirq+0x57/0x60
> [<c012ea29>] irq_exit+0x69/0x80
> [<c0106b55>] do_IRQ+0x45/0x80
> [<c010a5d0>] ? mwait_idle+0x0/0x50
> [<c0104752>] common_interrupt+0x2e/0x34
> [<c010a5d0>] ? mwait_idle+0x0/0x50
> [<c010a609>] ? mwait_idle+0x39/0x50
> [<c01026e0>] cpu_idle+0x60/0xd0
> [<c043c8ce>] rest_init+0x4e/0x60
> =======================
> FIX kmalloc-4096: Restoring 0xf6f3a080-0xf6f3a0ef=0x6b
>
> FIX kmalloc-4096: Marking all objects used
> [<c0243b4f>] ? security_file_permission+0xf/0x20
> [<c019436f>] sys_select+0x3f/0x190
> [<c01878e9>] ? fput+0x19/0x20
> [<c0103dbf>] ? restore_nocheck+0x12/0x15
> [<c014b06d>] ? trace_hardirqs_on+0xbd/0x140
> [<c0103d5e>] syscall_call+0x7/0xb
> =======================
>
> Dave, could you please remind us which net driver was in use here?

There's ath5k in the stack trace but that, of course, doesn't
automatically mean it's at fault here. It could have been just the poor
bastard who was the next to allocate 4 KB with kmalloc() noticing the
corruption.

Hope this helps!

Pekka

2008-07-31 03:01:07

by Dave Young

[permalink] [raw]
Subject: Re: [BUG] wireless : cpu stuck for 61s

T24gVGh1LCBKdWwgMzEsIDIwMDggYXQgMTA6NTYgQU0sIEFuZHJldyBNb3J0b24KPGFrcG1AbGlu
dXgtZm91bmRhdGlvbi5vcmc+IHdyb3RlOgo+IE9uIFRodSwgMzEgSnVsIDIwMDggMTA6MDY6MzEg
KzA4MDAgIkRhdmUgWW91bmciIDxoaWRhdmUuZGFya3N0YXJAZ21haWwuY29tPiB3cm90ZToKPgo+
Pgo+PiBPayBoZXJlIGl0IGlzLgo+PiBCVFcsIEkgcnVuICJrbG9nZCAtYyA3IiBhZnRlciBib290
Cj4KPiBUaGUgc3lzcnEgb3V0cHV0IGlzIHN0aWxsIG1pc3NpbmcgbG90cyBvZiBzdHVmZi4gIEkg
Z3Vlc3Mgd2UgYnJva2UgaXQuCj4KPj4KPj4gVGhpcyB0aW1lIEkgZ2V0IGEga21hbGxvYyBwb2lz
b24gb3ZlcndyaXR0ZW46Cj4+Cj4KPiA8Zml4ZXMgd29yZHdyYXBwaW5nLCBjbGVhbnMgc3R1ZmYg
dXA+CgpUaGFua3MsIHNvcnJ5IGZvciB0aGF0IGNhdXNlZCBieSB3ZWJtYWlsLgoKPgo+ID09PT09
PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09
PT09PT09PT09PT09PT09Cj4gQlVHIGttYWxsb2MtNDA5NjogUG9pc29uIG92ZXJ3cml0dGVuCj4g
LS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0t
LS0tLS0tLS0tLS0tLS0tLS0tLS0KPgo+IElORk86IDB4ZjZmM2EwODAtMHhmNmYzYTBlZi4gRmly
c3QgYnl0ZSAweDgwIGluc3RlYWQgb2YgMHg2Ygo+IElORk86IEFsbG9jYXRlZCBpbiBkZXZfYWxs
b2Nfc2tiKzB4MWMvMHgzMCBhZ2U9MzY0MiBjcHU9MCBwaWQ9MAo+IElORk86IEZyZWVkIGluIHNr
Yl9yZWxlYXNlX2RhdGErMHg1Ny8weDgwIGFnZT0zMTQ2IGNwdT0wIHBpZD0yMzk4Cj4gSU5GTzog
U2xhYiAweGMxYzA1NDQwIG9iamVjdHM9NyB1c2VkPTMgZnA9MHhmNmYzYTA2MCBmbGFncz0weDQw
MDAyMGMzCj4gSU5GTzogT2JqZWN0IDB4ZjZmM2EwNjAgQG9mZnNldD04Mjg4IGZwPTB4ZjZmMzkw
MzAKPgo+IEJ5dGVzIGI0IDB4ZjZmM2EwNTA6ICA1ZSAwOSAwMCAwMCA1NyBjOSAwNSAwMCA1YSA1
YSA1YSA1YSA1YSA1YSA1YSA1YSBeLi4uV8OJLi5aWlpaWlpaWgo+IE9iamVjdCAweGY2ZjNhMDYw
OiAgNmIgNmIgNmIgNmIgNmIgNmIgNmIgNmIgNmIgNmIgNmIgNmIgNmIgNmIgNmIgNmIga2tra2tr
a2tra2tra2trawo+IE9iamVjdCAweGY2ZjNhMDcwOiAgNmIgNmIgNmIgNmIgNmIgNmIgNmIgNmIg
NmIgNmIgNmIgNmIgNmIgNmIgNmIgNmIga2tra2tra2tra2tra2trawo+IE9iamVjdCAweGY2ZjNh
MDgwOiAgODAgMDAgMDAgMDAgZmYgZmYgZmYgZmYgZmYgZmYgMDAgMTcgN2IgMDAgNDYgNDAgLi4u
LsO/w7/Dv8O/w7/Dvy4uey5GQAo+IE9iamVjdCAweGY2ZjNhMDkwOiAgMDAgMTcgN2IgMDAgNDYg
NDAgMzAgMDkgODEgMjEgMDggN2EgMjEgMDAgMDAgMDAgLi57LkZAMC4uIS56IS4uLgo+IE9iamVj
dCAweGY2ZjNhMGEwOiAgNjQgMDAgMjEgMDQgMDAgMDcgMDAgMDAgMDAgMDAgMDAgMDAgMDAgMDEg
MDggODIgZC4hLi4uLi4uLi4uLi4uLgo+IE9iamVjdCAweGY2ZjNhMGIwOiAgODQgOGIgMGMgMTIg
OTYgMTggMjQgMDMgMDEgMDEgMDUgMDQgMDAgMDIgMDAgMDAgLi4uLi4uJC4uLi4uLi4uLgo+IE9i
amVjdCAweGY2ZjNhMGMwOiAgMDcgMDYgNDMgNGUgMjAgMDEgMGQgMTQgMmEgMDEgMDAgMzIgMDQg
MzAgNDggNjAgLi5DTi4uLi4qLi4yLjBIYAo+IE9iamVjdCAweGY2ZjNhMGQwOiAgNmMgZGQgMTgg
MDAgMTcgN2IgMDEgMDQgMDAgMDAgMDAgMDEgMDAgMDAgMDAgMTAgbMOdLi4uey4uLi4uLi4uLi4K
PiBSZWR6b25lIDB4ZjZmM2IwNjA6ICBiYiBiYiBiYiBiYiAgICAgICAgICAgICAgICAgICAgICAg
ICAgICAgICAgICAgICDCu8K7wrvCuwo+IFBhZGRpbmcgMHhmNmYzYjA4ODogIDVhIDVhIDVhIDVh
IDVhIDVhIDVhIDVhICAgICAgICAgICAgICAgICAgICAgICAgIFpaWlpaWlpaCj4gUGlkOiAwLCBj
b21tOiBzd2FwcGVyIFRhaW50ZWQ6IEcgICAgICAgIFcgMi42LjI2LXNtcCAjMgo+IFs8YzAxODBm
NWQ+XSBwcmludF90cmFpbGVyKzB4YWQvMHhmMAo+IFs8YzAxODEwM2I+XSBjaGVja19ieXRlc19h
bmRfcmVwb3J0KzB4OWIvMHhjMAo+IFs8YzAxODE0NWU+XSBjaGVja19vYmplY3QrMHgxOWUvMHgx
ZTAKPiBbPGMwMTgyMWE0Pl0gX19zbGFiX2FsbG9jKzB4NDU0LzB4NGYwCj4gWzxjMDE4MzRkNj5d
IF9fa21hbGxvY190cmFja19jYWxsZXIrMHhlNi8weGYwCj4gWzxjMDNkZDFlYz5dID8gZGV2X2Fs
bG9jX3NrYisweDFjLzB4MzAKPiBbPGMwM2RkMWVjPl0gPyBkZXZfYWxsb2Nfc2tiKzB4MWMvMHgz
MAo+IFs8YzAzZGNlNzk+XSBfX2FsbG9jX3NrYisweDQ5LzB4MTAwCj4gWzxjMDNkZDFlYz5dIGRl
dl9hbGxvY19za2IrMHgxYy8weDMwCj4gWzxmOGE1ODU5OT5dIGF0aDVrX3J4YnVmX3NldHVwKzB4
MzkvMHgyMDAgW2F0aDVrXQo+IFs8ZjhhNWE2OTc+XSBhdGg1a190YXNrbGV0X3J4KzB4MTI3LzB4
NWMwIFthdGg1a10KPiBbPGMwMTQ5NjlhPl0gPyBwcmludF9sb2NrX2NvbnRlbnRpb25fYnVnKzB4
MWEvMHhlMAo+IFs8YzAxMmVhZmM+XSB0YXNrbGV0X2FjdGlvbisweDRjLzB4YzAKPiBbPGMwMTJl
NDYzPl0gX19kb19zb2Z0aXJxKzB4OTMvMHgxMjAKPiBbPGMwMTJlNTQ3Pl0gZG9fc29mdGlycSsw
eDU3LzB4NjAKPiBbPGMwMTJlYTI5Pl0gaXJxX2V4aXQrMHg2OS8weDgwCj4gWzxjMDEwNmI1NT5d
IGRvX0lSUSsweDQ1LzB4ODAKPiBbPGMwMTBhNWQwPl0gPyBtd2FpdF9pZGxlKzB4MC8weDUwCj4g
WzxjMDEwNDc1Mj5dIGNvbW1vbl9pbnRlcnJ1cHQrMHgyZS8weDM0Cj4gWzxjMDEwYTVkMD5dID8g
bXdhaXRfaWRsZSsweDAvMHg1MAo+IFs8YzAxMGE2MDk+XSA/IG13YWl0X2lkbGUrMHgzOS8weDUw
Cj4gWzxjMDEwMjZlMD5dIGNwdV9pZGxlKzB4NjAvMHhkMAo+IFs8YzA0M2M4Y2U+XSByZXN0X2lu
aXQrMHg0ZS8weDYwCj4gPT09PT09PT09PT09PT09PT09PT09PT0KPiBGSVgga21hbGxvYy00MDk2
OiBSZXN0b3JpbmcgMHhmNmYzYTA4MC0weGY2ZjNhMGVmPTB4NmIKPgo+IEZJWCBrbWFsbG9jLTQw
OTY6IE1hcmtpbmcgYWxsIG9iamVjdHMgdXNlZAo+IFs8YzAyNDNiNGY+XSA/IHNlY3VyaXR5X2Zp
bGVfcGVybWlzc2lvbisweGYvMHgyMAo+IFs8YzAxOTQzNmY+XSBzeXNfc2VsZWN0KzB4M2YvMHgx
OTAKPiBbPGMwMTg3OGU5Pl0gPyBmcHV0KzB4MTkvMHgyMAo+IFs8YzAxMDNkYmY+XSA/IHJlc3Rv
cmVfbm9jaGVjaysweDEyLzB4MTUKPiBbPGMwMTRiMDZkPl0gPyB0cmFjZV9oYXJkaXJxc19vbisw
eGJkLzB4MTQwCj4gWzxjMDEwM2Q1ZT5dIHN5c2NhbGxfY2FsbCsweDcvMHhiCj4gPT09PT09PT09
PT09PT09PT09PT09PT0KPgo+IGFyZ2gsIHRoYXQgc3R1ZmYgaHVydHMgbXkgYnJhaW4uICBOb25l
IG9mIHRoZSBudW1iZXJzIHNlZW0gdG8gbWFrZSBhbnkKPiBzZW5zZSBmb3IgYSA0ayBhbGxvY2F0
aW9uIDooIFBla2thLCBkbyB5b3UgaGF2ZSB0aW1lIHRvIGRlY3J5cHQgdGhpcz8KPgo+IERhdmUs
IGNvdWxkIHlvdSBwbGVhc2UgcmVtaW5kIHVzIHdoaWNoIG5ldCBkcml2ZXIgd2FzIGluIHVzZSBo
ZXJlPwo+Cj4KCkl0J3MgdGhlIGF0aDVrCgotLSAKUmVnYXJkcwpkYXZlCg==

2008-07-29 12:32:07

by Johannes Berg

[permalink] [raw]
Subject: Re: [BUG] wireless : cpu stuck for 61s

On Tue, 2008-07-29 at 13:57 +0800, Dave Young wrote:
> While set my wlan0 to ad-hoc mode and then set the essid to something,
> this bug triggered.
>
> I suspect it's related to debugfs or mutex? But I'm not sure.
>
> Jul 29 13:01:46 darkstar kernel: [ 8033.736493] BUG: soft lockup - CPU#0 stuck for 61s! [events/0:9]
> Jul 29 13:01:46 darkstar kernel: [ 8033.736493]
> Jul 29 13:01:46 darkstar kernel: [ 8033.736493] Pid: 9, comm: events/0 Tainted: G W (2.6.26-smp #3)
> Jul 29 13:01:46 darkstar kernel: [ 8033.736493] EIP: 0060:[<c0450bc3>] EFLAGS: 00000246 CPU: 0
> Jul 29 13:01:46 darkstar kernel: [ 8033.736493] EIP is at mutex_lock_nested+0x1b3/0x280
> Jul 29 13:01:46 darkstar kernel: [ 8033.736493] EAX: 00000000 EBX: f7bf5284 ECX: f7c7e000 EDX: f7bf52a8
> Jul 29 13:01:46 darkstar kernel: [ 8033.736493] ESI: f7bf5288 EDI: 00000246 EBP: f7c7fedc ESP: f7c7fe9c
> Jul 29 13:01:46 darkstar kernel: [ 8033.736493] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
> Jul 29 13:01:46 darkstar kernel: [ 8033.736493] CR0: 8005003b CR2: b800c5ac CR3: 36e71000 CR4: 000006d0
> Jul 29 13:01:46 darkstar kernel: [ 8033.736493] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
> Jul 29 13:01:46 darkstar kernel: [ 8033.736493] DR6: ffff0ff0 DR7: 00000400
> Jul 29 13:01:46 darkstar kernel: [ 8033.736493] [<c02375a6>] ? debugfs_create_file+0x46/0x210

It looks like it's getting stuck on
mutex_lock(&parent->d_inode->i_mutex);

inside debugfs_create_by_name, but I have no idea why that would happen.

johannes


Attachments:
signature.asc (836.00 B)
This is a digitally signed message part

2008-07-31 09:53:09

by Pekka Enberg

[permalink] [raw]
Subject: Re: [BUG] wireless : cpu stuck for 61s

On Thu, 2008-07-31 at 12:50 +0300, Tomas Winkler wrote:
> > There's ath5k in the stack trace but that, of course, doesn't
> > automatically mean it's at fault here. It could have been just the poor
> > bastard who was the next to allocate 4 KB with kmalloc() noticing the
> > corruption.
> >
> > Hope this helps!
>
> I've seen something similar with fragmentation code in iwl4965 but I
> can reproduce it only when using SLAB. With SLUB it didn't shown up.
> Does anyone know what is difference between SLAB and SLUB in this context.?

Do you have CONFIG_SLUB_DEBUG_ON enabled? (Note: if you only have
CONFIG_SLUB_DEBUG enabled, then you need to enable debugging at run-time
by passing slub_debug to kernel command line.)


2008-07-30 10:11:29

by Andrew Morton

[permalink] [raw]
Subject: Re: [BUG] wireless : cpu stuck for 61s

On Wed, 30 Jul 2008 18:02:03 +0800 "Dave Young" <[email protected]> wrote:

> On Wed, Jul 30, 2008 at 5:08 PM, Andrew Morton
> <[email protected]> wrote:
> > On Tue, 29 Jul 2008 14:32:04 +0200 Johannes Berg <[email protected]> wrote:
> >
> >> On Tue, 2008-07-29 at 13:57 +0800, Dave Young wrote:
> >> > While set my wlan0 to ad-hoc mode and then set the essid to something,
> >> > this bug triggered.
> >> >
> >> > I suspect it's related to debugfs or mutex? But I'm not sure.
> >> >
> >> > Jul 29 13:01:46 darkstar kernel: [ 8033.736493] BUG: soft lockup - CPU#0 stuck for 61s! [events/0:9]
> >> > Jul 29 13:01:46 darkstar kernel: [ 8033.736493]
> >> > Jul 29 13:01:46 darkstar kernel: [ 8033.736493] Pid: 9, comm: events/0 Tainted: G W (2.6.26-smp #3)
> >> > Jul 29 13:01:46 darkstar kernel: [ 8033.736493] EIP: 0060:[<c0450bc3>] EFLAGS: 00000246 CPU: 0
> >> > Jul 29 13:01:46 darkstar kernel: [ 8033.736493] EIP is at mutex_lock_nested+0x1b3/0x280
> >> > Jul 29 13:01:46 darkstar kernel: [ 8033.736493] EAX: 00000000 EBX: f7bf5284 ECX: f7c7e000 EDX: f7bf52a8
> >> > Jul 29 13:01:46 darkstar kernel: [ 8033.736493] ESI: f7bf5288 EDI: 00000246 EBP: f7c7fedc ESP: f7c7fe9c
> >> > Jul 29 13:01:46 darkstar kernel: [ 8033.736493] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
> >> > Jul 29 13:01:46 darkstar kernel: [ 8033.736493] CR0: 8005003b CR2: b800c5ac CR3: 36e71000 CR4: 000006d0
> >> > Jul 29 13:01:46 darkstar kernel: [ 8033.736493] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
> >> > Jul 29 13:01:46 darkstar kernel: [ 8033.736493] DR6: ffff0ff0 DR7: 00000400
> >> > Jul 29 13:01:46 darkstar kernel: [ 8033.736493] [<c02375a6>] ? debugfs_create_file+0x46/0x210
> >>
> >> It looks like it's getting stuck on
> >> mutex_lock(&parent->d_inode->i_mutex);
> >>
> >> inside debugfs_create_by_name, but I have no idea why that would happen.
> >>
> >
> > Dave, please try sysrq-W or sysrq-T during that 60 seconds, see if we
> > can find another task which is holding onto that lock.
> >
> >
>
> Andrew, the keyboard doesn't response after the bug triggered.
>
> So I just keep sysrq-W / sysrq-T every several seconds after I set the
> wlan0 interface.
> Attached please see the long log file.

argh. Look:

Jul 30 17:34:22 darkstar kernel: [ 145.111510] f703fbf8 00000082 00000002 f703fbe4 f703fbd8 00000000 f704ac40 f703fbcc
Jul 30 17:34:22 darkstar kernel: [ 145.111510] 00000046 c0669060 c066c580 c066c580 c066c580 f703fbe8 f704ad94 c1eaf580
Jul 30 17:34:22 darkstar kernel: [ 145.111510] 00000001 f704ac40 00000000 f702ca00 c0452697 ffff55ba 00000000 c014b06d
Jul 30 17:34:22 darkstar kernel: [ 145.111510] Call Trace:
Jul 30 17:34:22 darkstar kernel: [ 145.111510] [<c0452697>] ? _spin_unlock_irqrestore+0x47/0x60
Jul 30 17:34:22 darkstar kernel: [ 145.111510] [<c014b06d>] ? trace_hardirqs_on+0xbd/0x140
Jul 30 17:34:22 darkstar kernel: [ 145.111510] [<c0450265>] schedule_timeout+0x75/0xc0
Jul 30 17:34:22 darkstar kernel: [ 145.111510] [<c01942c0>] ? __pollwait+0x60/0xd0
Jul 30 17:34:22 darkstar kernel: [ 145.111510] [<f888b02e>] ? evdev_poll+0x2e/0x60 [evdev]
Jul 30 17:34:22 darkstar kernel: [ 145.111510] [<c0193564>] do_sys_poll+0x264/0x3a0
Jul 30 17:34:22 darkstar kernel: [ 145.111510] [<c0194260>] ? __pollwait+0x0/0xd0
Jul 30 17:34:22 darkstar kernel: [ 145.111510] [<c0122160>] ? default_wake_function+0x0/0x10
Jul 30 17:34:22 darkstar last message repeated 3 times
Jul 30 17:34:22 darkstar kernel: [ 145.111510] [<c01220c5>] ? try_to_wake_up+0x75/0x110
Jul 30 17:34:22 darkstar kernel: [ 145.111510] [<c014aed0>] ? mark_held_locks+0x40/0x80
Jul 30 17:34:22 darkstar kernel: [ 145.111510] [<c014b77b>] ? __lock_acquire+0x24b/0x1040
Jul 30 17:34:22 darkstar kernel: [ 145.111510] [<c014b77b>] ? __lock_acquire+0x24b/0x1040
Jul 30 17:34:22 darkstar kernel: [ 145.111510] [<c03da639>] ? sock_def_readable+0x59/0x80
Jul 30 17:34:22 darkstar kernel: [ 145.111510] [<c014b77b>] ? __lock_acquire+0x24b/0x1040
Jul 30 17:34:22 darkstar last message repeated 5 times
Jul 30 17:34:22 darkstar kernel: [ 145.111510] [<c0109d0f>] ? native_sched_clock+0x7f/0xb0
Jul 30 17:34:22 darkstar kernel: [ 145.111510] [<c0243b4f>] ? security_file_permission+0xf/0x20
Jul 30 17:34:22 darkstar kernel: [ 145.111510] [<c0186387>] ? rw_verify_area+0x57/0xc0
Jul 30 17:34:22 darkstar kernel: [ 145.111510] [<c0451f5d>] ? _spin_unlock+0x1d/0x20
Jul 30 17:34:22 darkstar kernel: [ 145.111510] [<c0186b94>] ? vfs_read+0x94/0x130
Jul 30 17:34:22 darkstar kernel: [ 145.111510] [<f888be60>] ? evdev_read+0x0/0x1b0 [evdev]
Jul 30 17:34:22 darkstar kernel: [ 145.111510] [<c0103dbf>] ? restore_nocheck+0x12/0x15
Jul 30 17:34:22 darkstar kernel: [ 145.111510] [<c01936cd>] sys_poll+0x2d/0x70
Jul 30 17:34:22 darkstar kernel: [ 145.111510] [<c0103d5e>] syscall_call+0x7/0xb
Jul 30 17:34:22 darkstar kernel: [ 145.111510] =======================
Jul 30 17:34:22 darkstar kernel: [ 145.111510] f63efd94 00000092 00000002 00000000 00000000 f5064028 f7049d80 00000001
Jul 30 17:34:22 darkstar kernel: [ 145.111510] 00000000 c0669060 c066c580 c066c580 c066c580 00000000 f7049ed4 c1e5d580
Jul 30 17:34:22 darkstar kernel: [ 145.111510] 00000000 00000000 00000135 f702e080 f7049d80 00000003 00000000 f7049d80
Jul 30 17:34:22 darkstar kernel: [ 145.111510] Call Trace:
Jul 30 17:34:22 darkstar kernel: [ 145.111510] [<c0450265>] schedule_timeout+0x75/0xc0
Jul 30 17:34:22 darkstar kernel: [ 145.111510] [<c0436b98>] unix_stream_recvmsg+0x2a8/0x4e0


We're missing part of the sysrq traces. I noticed this a while back.
We need this info to find the D-state processes.

It's _possible_ that the kernel is actually trying to print this info,
but it's doing it at a different prink facility level from the stack
output, so we're only seeing part of the printing.

Please try changing your logging level (dmesg -n 0, or dmesg -n 7 - I
always forget which). Then retry the sysrq-T. If that successfully
causes _all_ the output to be presented, then please rerun the above
test.

You might want to compress the file too - I don't know whether the vger
servers will accept a 730kB mail.


2008-07-31 09:50:26

by Tomas Winkler

[permalink] [raw]
Subject: Re: [BUG] wireless : cpu stuck for 61s

On Thu, Jul 31, 2008 at 12:15 PM, Pekka J Enberg <[email protected]=
i> wrote:
> Hi Andrew,
>
> On Wed, 30 Jul 2008, Andrew Morton wrote:
>> > Ok here it is.
>> > BTW, I run "klogd -c 7" after boot
>>
>> The sysrq output is still missing lots of stuff. I guess we broke i=
t.
>>
>> >
>> > This time I get a kmalloc poison overwritten:
>> >
>>
>> argh, that stuff hurts my brain. None of the numbers seem to make a=
ny
>> sense for a 4k allocation :( Pekka, do you have time to decrypt this=
?
>
> Sure. Here goes:
>
> On Wed, 30 Jul 2008, Andrew Morton wrote:
>> <fixes wordwrapping, cleans stuff up>
>>
>> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D
>> BUG kmalloc-4096: Poison overwritten
>> --------------------------------------------------------------------=
---------
>>
>> INFO: 0xf6f3a080-0xf6f3a0ef. First byte 0x80 instead of 0x6b
>
> That's POISON_FREE ("0x6b") overwritten which means use-after-free fo=
r
> the range of 0xf6f3a080 - 0xf6f3a0ef (112 bytes). The rest of the
> object is okay but the SLUB debugging code only dumps the first 128 b=
ytes
> of the object which is why we don't see the full corruption.
>
> 2.6.27-rc1 should dump the full object so I'm assuming this is pre -r=
c1?
>
>> INFO: Allocated in dev_alloc_skb+0x1c/0x30 age=3D3642 cpu=3D0 pid=3D=
0
>> INFO: Freed in skb_release_data+0x57/0x80 age=3D3146 cpu=3D0 pid=3D2=
398
>
> So the corrupted object was free'd by skb_release_data() so we need t=
o
> look for a driver or the networking stack calling that function too e=
arly.
>
>> INFO: Slab 0xc1c05440 objects=3D7 used=3D3 fp=3D0xf6f3a060 flags=3D0=
x400020c3
>> INFO: Object 0xf6f3a060 @offset=3D8288 fp=3D0xf6f39030
>>
>> Bytes b4 0xf6f3a050: 5e 09 00 00 57 c9 05 00 5a 5a 5a 5a 5a 5a 5a 5=
a ^...W=C9..ZZZZZZZZ
>
> The object starts here (the poison values for first 32 bytes are okay=
):
>
>> Object 0xf6f3a060: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b =
kkkkkkkkkkkkkkkk
>> Object 0xf6f3a070: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b =
kkkkkkkkkkkkkkkk
>
> And here are the first 96 bytes of the corruption:
>
>> Object 0xf6f3a080: 80 00 00 00 ff ff ff ff ff ff 00 17 7b 00 46 40 =
=2E...=FF=FF=FF=FF=FF=FF..{.F@
>> Object 0xf6f3a090: 00 17 7b 00 46 40 30 09 81 21 08 7a 21 00 00 00 =
=2E.{.F@0..!.z!...
>> Object 0xf6f3a0a0: 64 00 21 04 00 07 00 00 00 00 00 00 00 01 08 82 =
d.!.............
>> Object 0xf6f3a0b0: 84 8b 0c 12 96 18 24 03 01 01 05 04 00 02 00 00 =
=2E.....$.........
>> Object 0xf6f3a0c0: 07 06 43 4e 20 01 0d 14 2a 01 00 32 04 30 48 60 =
=2E.CN....*..2.0H`
>> Object 0xf6f3a0d0: 6c dd 18 00 17 7b 01 04 00 00 00 01 00 00 00 10 =
l=DD...{..........
>
> But I think that's just the payload of a SKB?
>
>> Redzone 0xf6f3b060: bb bb bb bb =
=BB=BB=BB=BB
>
> The red-zone has SLUB_RED_INACTIVE ("0xbb") which reinforces
> use-after-free.
>
>> Padding 0xf6f3b088: 5a 5a 5a 5a 5a 5a 5a 5a =
ZZZZZZZZ
>> Pid: 0, comm: swapper Tainted: G W 2.6.26-smp #2
>> [<c0180f5d>] print_trailer+0xad/0xf0
>> [<c018103b>] check_bytes_and_report+0x9b/0xc0
>> [<c018145e>] check_object+0x19e/0x1e0
>> [<c01821a4>] __slab_alloc+0x454/0x4f0
>> [<c01834d6>] __kmalloc_track_caller+0xe6/0xf0
>> [<c03dd1ec>] ? dev_alloc_skb+0x1c/0x30
>> [<c03dd1ec>] ? dev_alloc_skb+0x1c/0x30
>> [<c03dce79>] __alloc_skb+0x49/0x100
>> [<c03dd1ec>] dev_alloc_skb+0x1c/0x30
>> [<f8a58599>] ath5k_rxbuf_setup+0x39/0x200 [ath5k]
>> [<f8a5a697>] ath5k_tasklet_rx+0x127/0x5c0 [ath5k]
>> [<c014969a>] ? print_lock_contention_bug+0x1a/0xe0
>> [<c012eafc>] tasklet_action+0x4c/0xc0
>> [<c012e463>] __do_softirq+0x93/0x120
>> [<c012e547>] do_softirq+0x57/0x60
>> [<c012ea29>] irq_exit+0x69/0x80
>> [<c0106b55>] do_IRQ+0x45/0x80
>> [<c010a5d0>] ? mwait_idle+0x0/0x50
>> [<c0104752>] common_interrupt+0x2e/0x34
>> [<c010a5d0>] ? mwait_idle+0x0/0x50
>> [<c010a609>] ? mwait_idle+0x39/0x50
>> [<c01026e0>] cpu_idle+0x60/0xd0
>> [<c043c8ce>] rest_init+0x4e/0x60
>> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>> FIX kmalloc-4096: Restoring 0xf6f3a080-0xf6f3a0ef=3D0x6b
>>
>> FIX kmalloc-4096: Marking all objects used
>> [<c0243b4f>] ? security_file_permission+0xf/0x20
>> [<c019436f>] sys_select+0x3f/0x190
>> [<c01878e9>] ? fput+0x19/0x20
>> [<c0103dbf>] ? restore_nocheck+0x12/0x15
>> [<c014b06d>] ? trace_hardirqs_on+0xbd/0x140
>> [<c0103d5e>] syscall_call+0x7/0xb
>> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>>
>> Dave, could you please remind us which net driver was in use here?
>
> There's ath5k in the stack trace but that, of course, doesn't
> automatically mean it's at fault here. It could have been just the po=
or
> bastard who was the next to allocate 4 KB with kmalloc() noticing the
> corruption.
>
> Hope this helps!


I've seen something similar with fragmentation code in iwl4965 but I
can reproduce it only when using SLAB. With SLUB it didn't shown up.
Does anyone know what is difference between SLAB and SLUB in this conte=
xt.?
Tomas

2008-08-05 01:29:28

by Dave Young

[permalink] [raw]
Subject: Re: [BUG] wireless : cpu stuck for 61s

On Mon, Aug 4, 2008 at 6:00 PM, Jiri Slaby <[email protected]> wrote:
> Dave Young napsal(a):
>>
>> Lockdep helped me, this morning I get a lockdep warning about this,
>>
>> [ 171.432140] [ INFO: possible recursive locking detected ]
>> [ 171.433113] 2.6.27-rc1-smp #4
>> [ 171.434079] ---------------------------------------------
>> [ 171.435039] ath5k_pci/2447 is trying to acquire lock:
>> [ 171.435990] (&sc->lock){--..}, at: [<f89ee9b5>]
>> ath5k_config_interface+0xd5/0x340 [ath5k]
>> [ 171.437046]
>> [ 171.437048] but task is already holding lock:
>> [ 171.438903] (&sc->lock){--..}, at: [<f89ee91d>]
>> ath5k_config_interface+0x3d/0x340 [ath5k]
>> [ 171.439953]
>> [ 171.439954] other info that might help us debug this:
>> [ 171.441795] 3 locks held by ath5k_pci/2447:
>> [ 171.442729] #0: ((name)){--..}, at: [<c013a122>]
>> run_workqueue+0x102/0x1d0
>> [ 171.443800] #1: (&(&local->scan_work)->work){--..}, at:
>> [<c013a122>] run_workqueue+0x102/0x1d0
>> [ 171.444859] #2: (&sc->lock){--..}, at: [<f89ee91d>]
>> ath5k_config_interface+0x3d/0x340 [ath5k]
>>
>> Deadlock happen here, I remove the lock in the sub routine, tested and
>> fixed the problem for me.
>> I will send the patch after a while.
>
> That's some kind different. 2.6.26 can't suffer from the thing above
> (opposing to 27-rc1).

Yes, it's got with 27-rc1

>
> Pasting the original issue below. Maybe it's been fixed too? I suppose you
> don't encounter the problem anymore in 27-rc1 with your/Bob's/mine (fix for
> the same thing in the same manner) patch applied?

With the patch I cann't reproduce the bug with 27-rc1 now.

>
> BUG: soft lockup - CPU#0 stuck for 61s! [events/0:9]
> Modules linked in: radeonfb fb_ddc i2c_algo_bit snd_seq_dummy snd_seq_oss
> snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss rfcomm
> l2cap bluetooth vfat fuse rtc_cmos rtc_core psmouse intel_agp pl2303 rtc_lib
> ath5k e100 i2c_i801 serio_raw usbserial agpgart 3c59x mii snd_hda_intel
> snd_pcm mac80211 snd_timer snd_page_alloc cfg80211 snd_hwdep snd soundcore
> thermal processor button thermal_sys evdev sg
>
> Pid: 9, comm: events/0 Tainted: G W (2.6.26-smp #3)
> EIP: 0060:[<c0450bc3>] EFLAGS: 00000246 CPU: 0
> EIP is at mutex_lock_nested+0x1b3/0x280
> EAX: 00000000 EBX: f7bf5284 ECX: f7c7e000 EDX: f7bf52a8
> ESI: f7bf5288 EDI: 00000246 EBP: f7c7fedc ESP: f7c7fe9c
> DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
> CR0: 8005003b CR2: b800c5ac CR3: 36e71000 CR4: 000006d0
> DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
> DR6: ffff0ff0 DR7: 00000400
> [<c02375a6>] ? debugfs_create_file+0x46/0x210
> [<c02375a6>] ? debugfs_create_file+0x46/0x210
> [<c02375a6>] debugfs_create_file+0x46/0x210
> [<c02377f1>] debugfs_create_dir+0x21/0x30
> [<f8901f6d>] ieee80211_sta_debugfs_add+0x2d/0x150 [mac80211]
> [<f88eba89>] sta_info_debugfs_add_work+0x89/0x130 [mac80211]
> [<f890a170>] ? rate_control_pid_add_sta_debugfs+0x0/0x30 [mac80211]
> [<c01394c6>] run_workqueue+0x156/0x1d0
> [<c0139472>] ? run_workqueue+0x102/0x1d0
> [<f88eba00>] ? sta_info_debugfs_add_work+0x0/0x130 [mac80211]
> [<c0139f18>] worker_thread+0x88/0xe0
> [<c013cd20>] ? autoremove_wake_function+0x0/0x40
> [<c0139e90>] ? worker_thread+0x0/0xe0
> [<c013ca32>] kthread+0x42/0x70
> [<c013c9f0>] ? kthread+0x0/0x70
> [<c0104993>] kernel_thread_helper+0x7/0x14
> =======================
> BUG: soft lockup - CPU#0 stuck for 61s! [events/0:9]
> Modules linked in: radeonfb fb_ddc i2c_algo_bit snd_seq_dummy snd_seq_oss
> snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss rfcomm
> l2cap bluetooth vfat fuse rtc_cmos rtc_core psmouse intel_agp pl2303 rtc_lib
> ath5k e100 i2c_i801 serio_raw usbserial agpgart 3c59x mii snd_hda_intel
> snd_pcm mac80211 snd_timer snd_page_alloc cfg80211 snd_hwdep snd soundcore
> thermal processor button thermal_sys evdev sg
>
> Pid: 9, comm: events/0 Tainted: G W (2.6.26-smp #3)
> EIP: 0060:[<c0450bc3>] EFLAGS: 00000246 CPU: 0
> EIP is at mutex_lock_nested+0x1b3/0x280
> EAX: 00000000 EBX: f7bf5284 ECX: f7c7e000 EDX: f7bf52a8
> ESI: f7bf5288 EDI: 00000246 EBP: f7c7fedc ESP: f7c7fe9c
> DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
> CR0: 8005003b CR2: b800c5ac CR3: 36e71000 CR4: 000006d0
> DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
> DR6: ffff0ff0 DR7: 00000400
> [<c02375a6>] ? debugfs_create_file+0x46/0x210
> [<c02375a6>] ? debugfs_create_file+0x46/0x210
> [<c02375a6>] debugfs_create_file+0x46/0x210
> [<c02377f1>] debugfs_create_dir+0x21/0x30
> [<f8901f6d>] ieee80211_sta_debugfs_add+0x2d/0x150 [mac80211]
> [<f88eba89>] sta_info_debugfs_add_work+0x89/0x130 [mac80211]
> [<f890a170>] ? rate_control_pid_add_sta_debugfs+0x0/0x30 [mac80211]
> [<c01394c6>] run_workqueue+0x156/0x1d0
> [<c0139472>] ? run_workqueue+0x102/0x1d0
> [<f88eba00>] ? sta_info_debugfs_add_work+0x0/0x130 [mac80211]
> [<c0139f18>] worker_thread+0x88/0xe0
> [<c013cd20>] ? autoremove_wake_function+0x0/0x40
> [<c0139e90>] ? worker_thread+0x0/0xe0
> [<c013ca32>] kthread+0x42/0x70
> [<c013c9f0>] ? kthread+0x0/0x70
> [<c0104993>] kernel_thread_helper+0x7/0x14
> =======================
>



--
Regards
dave

2008-08-12 04:19:33

by Dave Young

[permalink] [raw]
Subject: Re: [BUG] wireless : cpu stuck for 61s

On Wed, Aug 6, 2008 at 9:51 AM, Dave Young <[email protected]> wrote:
> On Tue, Aug 5, 2008 at 8:24 PM, Bob Copeland <[email protected]> wrote:
>> On Tue, Aug 05, 2008 at 09:29:26AM +0800, Dave Young wrote:
>>> With the patch I cann't reproduce the bug with 27-rc1 now.
>>>
>>> > [<c02375a6>] ? debugfs_create_file+0x46/0x210
>>> > [<c02375a6>] ? debugfs_create_file+0x46/0x210
>>> > [<c02375a6>] debugfs_create_file+0x46/0x210
>>> > [<c02377f1>] debugfs_create_dir+0x21/0x30
>>> > [<f8901f6d>] ieee80211_sta_debugfs_add+0x2d/0x150 [mac80211]
>>> > [<f88eba89>] sta_info_debugfs_add_work+0x89/0x130 [mac80211]
>>> > [<f890a170>] ? rate_control_pid_add_sta_debugfs+0x0/0x30 [mac80211]
>>
>> I wonder if there were two separate problems here. I looked into
>> this with some detail yesterday and agree with Johannes that the above
>> trace is on locking the parent directory's i_mutex, but I too couldn't
>> see any problems with sta_info_debugfs_add_work. Other stuff could also
>> modify the directory with or without rtnl_lock, but not in a way that
>> to my untrained eyes would lead to deadlock.
>
> Yes,. I think so. It's the original bug for me, while testing I found
> the mutex deadlock problem.
>
> But this week I will have no time to trace it. so if I have time I
> will keep tracing the problem

Unfortunately I have no more time to do this these days.
As I can't reproduce it with the latest git kernel I think that It
might have been fixed already.

>
>>
>> Or is the trace just wrong?
>>
>> --
>> Bob Copeland %% http://www.bobcopeland.com
>>
>>
>
>
>
> --
> Regards
> dave
>



--
Regards
dave

2008-08-06 01:53:23

by Dave Young

[permalink] [raw]
Subject: Re: [BUG] wireless : cpu stuck for 61s

On Wed, Aug 6, 2008 at 9:51 AM, Dave Young <[email protected]> wrote:
> On Tue, Aug 5, 2008 at 8:24 PM, Bob Copeland <[email protected]> wrote:
>> On Tue, Aug 05, 2008 at 09:29:26AM +0800, Dave Young wrote:
>>> With the patch I cann't reproduce the bug with 27-rc1 now.
>>>
>>> > [<c02375a6>] ? debugfs_create_file+0x46/0x210
>>> > [<c02375a6>] ? debugfs_create_file+0x46/0x210
>>> > [<c02375a6>] debugfs_create_file+0x46/0x210
>>> > [<c02377f1>] debugfs_create_dir+0x21/0x30
>>> > [<f8901f6d>] ieee80211_sta_debugfs_add+0x2d/0x150 [mac80211]
>>> > [<f88eba89>] sta_info_debugfs_add_work+0x89/0x130 [mac80211]
>>> > [<f890a170>] ? rate_control_pid_add_sta_debugfs+0x0/0x30 [mac80211]
>>
>> I wonder if there were two separate problems here. I looked into
>> this with some detail yesterday and agree with Johannes that the above
>> trace is on locking the parent directory's i_mutex, but I too couldn't
>> see any problems with sta_info_debugfs_add_work. Other stuff could also
>> modify the directory with or without rtnl_lock, but not in a way that
>> to my untrained eyes would lead to deadlock.
>
> Yes,. I think so. It's the original bug for me, while testing I found
> the mutex deadlock problem.
>
> But this week I will have no time to trace it. so if I have time I
> will keep tracing the problem

Additional info,

With the mutex fix patch, in 2.6.27-rc1 I seems can not reproduce the
debugfs_add bug, (maybe need more test)

But with 2.6.26, the bug can be reproduced. (The mutex fix patch need
not to be applied because there's no such deadlock bug)

>
>>
>> Or is the trace just wrong?
>>
>> --
>> Bob Copeland %% http://www.bobcopeland.com
>>
>>
>
>
>
> --
> Regards
> dave
>



--
Regards
dave

2008-08-06 01:51:16

by Dave Young

[permalink] [raw]
Subject: Re: [BUG] wireless : cpu stuck for 61s

On Tue, Aug 5, 2008 at 8:24 PM, Bob Copeland <[email protected]> wrote:
> On Tue, Aug 05, 2008 at 09:29:26AM +0800, Dave Young wrote:
>> With the patch I cann't reproduce the bug with 27-rc1 now.
>>
>> > [<c02375a6>] ? debugfs_create_file+0x46/0x210
>> > [<c02375a6>] ? debugfs_create_file+0x46/0x210
>> > [<c02375a6>] debugfs_create_file+0x46/0x210
>> > [<c02377f1>] debugfs_create_dir+0x21/0x30
>> > [<f8901f6d>] ieee80211_sta_debugfs_add+0x2d/0x150 [mac80211]
>> > [<f88eba89>] sta_info_debugfs_add_work+0x89/0x130 [mac80211]
>> > [<f890a170>] ? rate_control_pid_add_sta_debugfs+0x0/0x30 [mac80211]
>
> I wonder if there were two separate problems here. I looked into
> this with some detail yesterday and agree with Johannes that the above
> trace is on locking the parent directory's i_mutex, but I too couldn't
> see any problems with sta_info_debugfs_add_work. Other stuff could also
> modify the directory with or without rtnl_lock, but not in a way that
> to my untrained eyes would lead to deadlock.

Yes,. I think so. It's the original bug for me, while testing I found
the mutex deadlock problem.

But this week I will have no time to trace it. so if I have time I
will keep tracing the problem

>
> Or is the trace just wrong?
>
> --
> Bob Copeland %% http://www.bobcopeland.com
>
>



--
Regards
dave

2008-08-04 10:01:17

by Jiri Slaby

[permalink] [raw]
Subject: Re: [BUG] wireless : cpu stuck for 61s

Dave Young napsal(a):
> Lockdep helped me, this morning I get a lockdep warning about this,
>
> [ 171.432140] [ INFO: possible recursive locking detected ]
> [ 171.433113] 2.6.27-rc1-smp #4
> [ 171.434079] ---------------------------------------------
> [ 171.435039] ath5k_pci/2447 is trying to acquire lock:
> [ 171.435990] (&sc->lock){--..}, at: [<f89ee9b5>]
> ath5k_config_interface+0xd5/0x340 [ath5k]
> [ 171.437046]
> [ 171.437048] but task is already holding lock:
> [ 171.438903] (&sc->lock){--..}, at: [<f89ee91d>]
> ath5k_config_interface+0x3d/0x340 [ath5k]
> [ 171.439953]
> [ 171.439954] other info that might help us debug this:
> [ 171.441795] 3 locks held by ath5k_pci/2447:
> [ 171.442729] #0: ((name)){--..}, at: [<c013a122>] run_workqueue+0x102/0x1d0
> [ 171.443800] #1: (&(&local->scan_work)->work){--..}, at:
> [<c013a122>] run_workqueue+0x102/0x1d0
> [ 171.444859] #2: (&sc->lock){--..}, at: [<f89ee91d>]
> ath5k_config_interface+0x3d/0x340 [ath5k]
>
> Deadlock happen here, I remove the lock in the sub routine, tested and
> fixed the problem for me.
> I will send the patch after a while.

That's some kind different. 2.6.26 can't suffer from the thing above
(opposing to 27-rc1).

Pasting the original issue below. Maybe it's been fixed too? I suppose you
don't encounter the problem anymore in 27-rc1 with your/Bob's/mine (fix for
the same thing in the same manner) patch applied?

BUG: soft lockup - CPU#0 stuck for 61s! [events/0:9]
Modules linked in: radeonfb fb_ddc i2c_algo_bit snd_seq_dummy snd_seq_oss
snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss rfcomm
l2cap bluetooth vfat fuse rtc_cmos rtc_core psmouse intel_agp pl2303 rtc_lib
ath5k e100 i2c_i801 serio_raw usbserial agpgart 3c59x mii snd_hda_intel
snd_pcm mac80211 snd_timer snd_page_alloc cfg80211 snd_hwdep snd soundcore
thermal processor button thermal_sys evdev sg

Pid: 9, comm: events/0 Tainted: G W (2.6.26-smp #3)
EIP: 0060:[<c0450bc3>] EFLAGS: 00000246 CPU: 0
EIP is at mutex_lock_nested+0x1b3/0x280
EAX: 00000000 EBX: f7bf5284 ECX: f7c7e000 EDX: f7bf52a8
ESI: f7bf5288 EDI: 00000246 EBP: f7c7fedc ESP: f7c7fe9c
DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
CR0: 8005003b CR2: b800c5ac CR3: 36e71000 CR4: 000006d0
DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
DR6: ffff0ff0 DR7: 00000400
[<c02375a6>] ? debugfs_create_file+0x46/0x210
[<c02375a6>] ? debugfs_create_file+0x46/0x210
[<c02375a6>] debugfs_create_file+0x46/0x210
[<c02377f1>] debugfs_create_dir+0x21/0x30
[<f8901f6d>] ieee80211_sta_debugfs_add+0x2d/0x150 [mac80211]
[<f88eba89>] sta_info_debugfs_add_work+0x89/0x130 [mac80211]
[<f890a170>] ? rate_control_pid_add_sta_debugfs+0x0/0x30 [mac80211]
[<c01394c6>] run_workqueue+0x156/0x1d0
[<c0139472>] ? run_workqueue+0x102/0x1d0
[<f88eba00>] ? sta_info_debugfs_add_work+0x0/0x130 [mac80211]
[<c0139f18>] worker_thread+0x88/0xe0
[<c013cd20>] ? autoremove_wake_function+0x0/0x40
[<c0139e90>] ? worker_thread+0x0/0xe0
[<c013ca32>] kthread+0x42/0x70
[<c013c9f0>] ? kthread+0x0/0x70
[<c0104993>] kernel_thread_helper+0x7/0x14
=======================
BUG: soft lockup - CPU#0 stuck for 61s! [events/0:9]
Modules linked in: radeonfb fb_ddc i2c_algo_bit snd_seq_dummy snd_seq_oss
snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss rfcomm
l2cap bluetooth vfat fuse rtc_cmos rtc_core psmouse intel_agp pl2303 rtc_lib
ath5k e100 i2c_i801 serio_raw usbserial agpgart 3c59x mii snd_hda_intel
snd_pcm mac80211 snd_timer snd_page_alloc cfg80211 snd_hwdep snd soundcore
thermal processor button thermal_sys evdev sg

Pid: 9, comm: events/0 Tainted: G W (2.6.26-smp #3)
EIP: 0060:[<c0450bc3>] EFLAGS: 00000246 CPU: 0
EIP is at mutex_lock_nested+0x1b3/0x280
EAX: 00000000 EBX: f7bf5284 ECX: f7c7e000 EDX: f7bf52a8
ESI: f7bf5288 EDI: 00000246 EBP: f7c7fedc ESP: f7c7fe9c
DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
CR0: 8005003b CR2: b800c5ac CR3: 36e71000 CR4: 000006d0
DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
DR6: ffff0ff0 DR7: 00000400
[<c02375a6>] ? debugfs_create_file+0x46/0x210
[<c02375a6>] ? debugfs_create_file+0x46/0x210
[<c02375a6>] debugfs_create_file+0x46/0x210
[<c02377f1>] debugfs_create_dir+0x21/0x30
[<f8901f6d>] ieee80211_sta_debugfs_add+0x2d/0x150 [mac80211]
[<f88eba89>] sta_info_debugfs_add_work+0x89/0x130 [mac80211]
[<f890a170>] ? rate_control_pid_add_sta_debugfs+0x0/0x30 [mac80211]
[<c01394c6>] run_workqueue+0x156/0x1d0
[<c0139472>] ? run_workqueue+0x102/0x1d0
[<f88eba00>] ? sta_info_debugfs_add_work+0x0/0x130 [mac80211]
[<c0139f18>] worker_thread+0x88/0xe0
[<c013cd20>] ? autoremove_wake_function+0x0/0x40
[<c0139e90>] ? worker_thread+0x0/0xe0
[<c013ca32>] kthread+0x42/0x70
[<c013c9f0>] ? kthread+0x0/0x70
[<c0104993>] kernel_thread_helper+0x7/0x14
=======================

2008-08-01 07:32:14

by Dave Young

[permalink] [raw]
Subject: Re: [BUG] wireless : cpu stuck for 61s

T24gVGh1LCBKdWwgMzEsIDIwMDggYXQgNToxNSBQTSwgUGVra2EgSiBFbmJlcmcgPHBlbmJlcmdA
Y3MuaGVsc2lua2kuZmk+IHdyb3RlOgo+IEhpIEFuZHJldywKPgo+IE9uIFdlZCwgMzAgSnVsIDIw
MDgsIEFuZHJldyBNb3J0b24gd3JvdGU6Cj4+ID4gT2sgaGVyZSBpdCBpcy4KPj4gPiBCVFcsIEkg
cnVuICJrbG9nZCAtYyA3IiBhZnRlciBib290Cj4+Cj4+IFRoZSBzeXNycSBvdXRwdXQgaXMgc3Rp
bGwgbWlzc2luZyBsb3RzIG9mIHN0dWZmLiAgSSBndWVzcyB3ZSBicm9rZSBpdC4KPj4KPj4gPgo+
PiA+IFRoaXMgdGltZSBJIGdldCBhIGttYWxsb2MgcG9pc29uIG92ZXJ3cml0dGVuOgo+PiA+Cj4+
Cj4+IGFyZ2gsIHRoYXQgc3R1ZmYgaHVydHMgbXkgYnJhaW4uICBOb25lIG9mIHRoZSBudW1iZXJz
IHNlZW0gdG8gbWFrZSBhbnkKPj4gc2Vuc2UgZm9yIGEgNGsgYWxsb2NhdGlvbiA6KCBQZWtrYSwg
ZG8geW91IGhhdmUgdGltZSB0byBkZWNyeXB0IHRoaXM/Cj4KPiBTdXJlLiBIZXJlIGdvZXM6Cj4K
PiBPbiBXZWQsIDMwIEp1bCAyMDA4LCBBbmRyZXcgTW9ydG9uIHdyb3RlOgo+PiA8Zml4ZXMgd29y
ZHdyYXBwaW5nLCBjbGVhbnMgc3R1ZmYgdXA+Cj4+Cj4+ID09PT09PT09PT09PT09PT09PT09PT09
PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09Cj4+
IEJVRyBrbWFsbG9jLTQwOTY6IFBvaXNvbiBvdmVyd3JpdHRlbgo+PiAtLS0tLS0tLS0tLS0tLS0t
LS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0t
LS0tLQo+Pgo+PiBJTkZPOiAweGY2ZjNhMDgwLTB4ZjZmM2EwZWYuIEZpcnN0IGJ5dGUgMHg4MCBp
bnN0ZWFkIG9mIDB4NmIKPgo+IFRoYXQncyBQT0lTT05fRlJFRSAoIjB4NmIiKSBvdmVyd3JpdHRl
biB3aGljaCBtZWFucyB1c2UtYWZ0ZXItZnJlZSBmb3IKPiB0aGUgcmFuZ2Ugb2YgMHhmNmYzYTA4
MCAtIDB4ZjZmM2EwZWYgKDExMiBieXRlcykuIFRoZSByZXN0IG9mIHRoZQo+IG9iamVjdCBpcyBv
a2F5IGJ1dCB0aGUgU0xVQiBkZWJ1Z2dpbmcgY29kZSBvbmx5IGR1bXBzIHRoZSBmaXJzdCAxMjgg
Ynl0ZXMKPiBvZiB0aGUgb2JqZWN0IHdoaWNoIGlzIHdoeSB3ZSBkb24ndCBzZWUgdGhlIGZ1bGwg
Y29ycnVwdGlvbi4KPgo+IDIuNi4yNy1yYzEgc2hvdWxkIGR1bXAgdGhlIGZ1bGwgb2JqZWN0IHNv
IEknbSBhc3N1bWluZyB0aGlzIGlzIHByZSAtcmMxPwoKWWVzLCBpdCdzIDIuNi4yNgoKPgo+PiBJ
TkZPOiBBbGxvY2F0ZWQgaW4gZGV2X2FsbG9jX3NrYisweDFjLzB4MzAgYWdlPTM2NDIgY3B1PTAg
cGlkPTAKPj4gSU5GTzogRnJlZWQgaW4gc2tiX3JlbGVhc2VfZGF0YSsweDU3LzB4ODAgYWdlPTMx
NDYgY3B1PTAgcGlkPTIzOTgKPgo+IFNvIHRoZSBjb3JydXB0ZWQgb2JqZWN0IHdhcyBmcmVlJ2Qg
Ynkgc2tiX3JlbGVhc2VfZGF0YSgpIHNvIHdlIG5lZWQgdG8KPiBsb29rIGZvciBhIGRyaXZlciBv
ciB0aGUgbmV0d29ya2luZyBzdGFjayBjYWxsaW5nIHRoYXQgZnVuY3Rpb24gdG9vIGVhcmx5Lgo+
Cj4+IElORk86IFNsYWIgMHhjMWMwNTQ0MCBvYmplY3RzPTcgdXNlZD0zIGZwPTB4ZjZmM2EwNjAg
ZmxhZ3M9MHg0MDAwMjBjMwo+PiBJTkZPOiBPYmplY3QgMHhmNmYzYTA2MCBAb2Zmc2V0PTgyODgg
ZnA9MHhmNmYzOTAzMAo+Pgo+PiBCeXRlcyBiNCAweGY2ZjNhMDUwOiAgNWUgMDkgMDAgMDAgNTcg
YzkgMDUgMDAgNWEgNWEgNWEgNWEgNWEgNWEgNWEgNWEgXi4uLlfDiS4uWlpaWlpaWloKPgo+IFRo
ZSBvYmplY3Qgc3RhcnRzIGhlcmUgKHRoZSBwb2lzb24gdmFsdWVzIGZvciBmaXJzdCAzMiBieXRl
cyBhcmUgb2theSk6Cj4KPj4gT2JqZWN0IDB4ZjZmM2EwNjA6ICA2YiA2YiA2YiA2YiA2YiA2YiA2
YiA2YiA2YiA2YiA2YiA2YiA2YiA2YiA2YiA2YiBra2tra2tra2tra2tra2trCj4+IE9iamVjdCAw
eGY2ZjNhMDcwOiAgNmIgNmIgNmIgNmIgNmIgNmIgNmIgNmIgNmIgNmIgNmIgNmIgNmIgNmIgNmIg
NmIga2tra2tra2tra2tra2trawo+Cj4gQW5kIGhlcmUgYXJlIHRoZSBmaXJzdCA5NiBieXRlcyBv
ZiB0aGUgY29ycnVwdGlvbjoKPgo+PiBPYmplY3QgMHhmNmYzYTA4MDogIDgwIDAwIDAwIDAwIGZm
IGZmIGZmIGZmIGZmIGZmIDAwIDE3IDdiIDAwIDQ2IDQwIC4uLi7Dv8O/w7/Dv8O/w78uLnsuRkAK
Pj4gT2JqZWN0IDB4ZjZmM2EwOTA6ICAwMCAxNyA3YiAwMCA0NiA0MCAzMCAwOSA4MSAyMSAwOCA3
YSAyMSAwMCAwMCAwMCAuLnsuRkAwLi4hLnohLi4uCj4+IE9iamVjdCAweGY2ZjNhMGEwOiAgNjQg
MDAgMjEgMDQgMDAgMDcgMDAgMDAgMDAgMDAgMDAgMDAgMDAgMDEgMDggODIgZC4hLi4uLi4uLi4u
Li4uLgo+PiBPYmplY3QgMHhmNmYzYTBiMDogIDg0IDhiIDBjIDEyIDk2IDE4IDI0IDAzIDAxIDAx
IDA1IDA0IDAwIDAyIDAwIDAwIC4uLi4uLiQuLi4uLi4uLi4KPj4gT2JqZWN0IDB4ZjZmM2EwYzA6
ICAwNyAwNiA0MyA0ZSAyMCAwMSAwZCAxNCAyYSAwMSAwMCAzMiAwNCAzMCA0OCA2MCAuLkNOLi4u
LiouLjIuMEhgCj4+IE9iamVjdCAweGY2ZjNhMGQwOiAgNmMgZGQgMTggMDAgMTcgN2IgMDEgMDQg
MDAgMDAgMDAgMDEgMDAgMDAgMDAgMTAgbMOdLi4uey4uLi4uLi4uLi4KPgo+IEJ1dCBJIHRoaW5r
IHRoYXQncyBqdXN0IHRoZSBwYXlsb2FkIG9mIGEgU0tCPwo+Cj4+IFJlZHpvbmUgMHhmNmYzYjA2
MDogIGJiIGJiIGJiIGJiICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgIMK7wrvC
u8K7Cj4KPiBUaGUgcmVkLXpvbmUgaGFzIFNMVUJfUkVEX0lOQUNUSVZFICgiMHhiYiIpIHdoaWNo
IHJlaW5mb3JjZXMKPiB1c2UtYWZ0ZXItZnJlZS4KPgo+PiBQYWRkaW5nIDB4ZjZmM2IwODg6ICA1
YSA1YSA1YSA1YSA1YSA1YSA1YSA1YSAgICAgICAgICAgICAgICAgICAgICAgICBaWlpaWlpaWgo+
PiBQaWQ6IDAsIGNvbW06IHN3YXBwZXIgVGFpbnRlZDogRyAgICAgICAgVyAyLjYuMjYtc21wICMy
Cj4+IFs8YzAxODBmNWQ+XSBwcmludF90cmFpbGVyKzB4YWQvMHhmMAo+PiBbPGMwMTgxMDNiPl0g
Y2hlY2tfYnl0ZXNfYW5kX3JlcG9ydCsweDliLzB4YzAKPj4gWzxjMDE4MTQ1ZT5dIGNoZWNrX29i
amVjdCsweDE5ZS8weDFlMAo+PiBbPGMwMTgyMWE0Pl0gX19zbGFiX2FsbG9jKzB4NDU0LzB4NGYw
Cj4+IFs8YzAxODM0ZDY+XSBfX2ttYWxsb2NfdHJhY2tfY2FsbGVyKzB4ZTYvMHhmMAo+PiBbPGMw
M2RkMWVjPl0gPyBkZXZfYWxsb2Nfc2tiKzB4MWMvMHgzMAo+PiBbPGMwM2RkMWVjPl0gPyBkZXZf
YWxsb2Nfc2tiKzB4MWMvMHgzMAo+PiBbPGMwM2RjZTc5Pl0gX19hbGxvY19za2IrMHg0OS8weDEw
MAo+PiBbPGMwM2RkMWVjPl0gZGV2X2FsbG9jX3NrYisweDFjLzB4MzAKPj4gWzxmOGE1ODU5OT5d
IGF0aDVrX3J4YnVmX3NldHVwKzB4MzkvMHgyMDAgW2F0aDVrXQo+PiBbPGY4YTVhNjk3Pl0gYXRo
NWtfdGFza2xldF9yeCsweDEyNy8weDVjMCBbYXRoNWtdCj4+IFs8YzAxNDk2OWE+XSA/IHByaW50
X2xvY2tfY29udGVudGlvbl9idWcrMHgxYS8weGUwCj4+IFs8YzAxMmVhZmM+XSB0YXNrbGV0X2Fj
dGlvbisweDRjLzB4YzAKPj4gWzxjMDEyZTQ2Mz5dIF9fZG9fc29mdGlycSsweDkzLzB4MTIwCj4+
IFs8YzAxMmU1NDc+XSBkb19zb2Z0aXJxKzB4NTcvMHg2MAo+PiBbPGMwMTJlYTI5Pl0gaXJxX2V4
aXQrMHg2OS8weDgwCj4+IFs8YzAxMDZiNTU+XSBkb19JUlErMHg0NS8weDgwCj4+IFs8YzAxMGE1
ZDA+XSA/IG13YWl0X2lkbGUrMHgwLzB4NTAKPj4gWzxjMDEwNDc1Mj5dIGNvbW1vbl9pbnRlcnJ1
cHQrMHgyZS8weDM0Cj4+IFs8YzAxMGE1ZDA+XSA/IG13YWl0X2lkbGUrMHgwLzB4NTAKPj4gWzxj
MDEwYTYwOT5dID8gbXdhaXRfaWRsZSsweDM5LzB4NTAKPj4gWzxjMDEwMjZlMD5dIGNwdV9pZGxl
KzB4NjAvMHhkMAo+PiBbPGMwNDNjOGNlPl0gcmVzdF9pbml0KzB4NGUvMHg2MAo+PiA9PT09PT09
PT09PT09PT09PT09PT09PQo+PiBGSVgga21hbGxvYy00MDk2OiBSZXN0b3JpbmcgMHhmNmYzYTA4
MC0weGY2ZjNhMGVmPTB4NmIKPj4KPj4gRklYIGttYWxsb2MtNDA5NjogTWFya2luZyBhbGwgb2Jq
ZWN0cyB1c2VkCj4+IFs8YzAyNDNiNGY+XSA/IHNlY3VyaXR5X2ZpbGVfcGVybWlzc2lvbisweGYv
MHgyMAo+PiBbPGMwMTk0MzZmPl0gc3lzX3NlbGVjdCsweDNmLzB4MTkwCj4+IFs8YzAxODc4ZTk+
XSA/IGZwdXQrMHgxOS8weDIwCj4+IFs8YzAxMDNkYmY+XSA/IHJlc3RvcmVfbm9jaGVjaysweDEy
LzB4MTUKPj4gWzxjMDE0YjA2ZD5dID8gdHJhY2VfaGFyZGlycXNfb24rMHhiZC8weDE0MAo+PiBb
PGMwMTAzZDVlPl0gc3lzY2FsbF9jYWxsKzB4Ny8weGIKPj4gPT09PT09PT09PT09PT09PT09PT09
PT0KPj4KPj4gRGF2ZSwgY291bGQgeW91IHBsZWFzZSByZW1pbmQgdXMgd2hpY2ggbmV0IGRyaXZl
ciB3YXMgaW4gdXNlIGhlcmU/Cj4KPiBUaGVyZSdzIGF0aDVrIGluIHRoZSBzdGFjayB0cmFjZSBi
dXQgdGhhdCwgb2YgY291cnNlLCBkb2Vzbid0Cj4gYXV0b21hdGljYWxseSBtZWFuIGl0J3MgYXQg
ZmF1bHQgaGVyZS4gSXQgY291bGQgaGF2ZSBiZWVuIGp1c3QgdGhlIHBvb3IKPiBiYXN0YXJkIHdo
byB3YXMgdGhlIG5leHQgdG8gYWxsb2NhdGUgNCBLQiB3aXRoIGttYWxsb2MoKSBub3RpY2luZyB0
aGUKPiBjb3JydXB0aW9uLgo+Cj4gSG9wZSB0aGlzIGhlbHBzIQo+Cj4gICAgICAgICAgICAgICAg
ICAgICAgICBQZWtrYQoKTG9ja2RlcCBoZWxwZWQgbWUsIHRoaXMgbW9ybmluZyBJIGdldCBhIGxv
Y2tkZXAgd2FybmluZyBhYm91dCB0aGlzLAoKWyAgMTcxLjQzMjE0MF0gWyBJTkZPOiBwb3NzaWJs
ZSByZWN1cnNpdmUgbG9ja2luZyBkZXRlY3RlZCBdClsgIDE3MS40MzMxMTNdIDIuNi4yNy1yYzEt
c21wICM0ClsgIDE3MS40MzQwNzldIC0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0t
LS0tLS0tLS0tLQpbICAxNzEuNDM1MDM5XSBhdGg1a19wY2kvMjQ0NyBpcyB0cnlpbmcgdG8gYWNx
dWlyZSBsb2NrOgpbICAxNzEuNDM1OTkwXSAgKCZzYy0+bG9jayl7LS0uLn0sIGF0OiBbPGY4OWVl
OWI1Pl0KYXRoNWtfY29uZmlnX2ludGVyZmFjZSsweGQ1LzB4MzQwIFthdGg1a10KWyAgMTcxLjQz
NzA0Nl0KWyAgMTcxLjQzNzA0OF0gYnV0IHRhc2sgaXMgYWxyZWFkeSBob2xkaW5nIGxvY2s6Clsg
IDE3MS40Mzg5MDNdICAoJnNjLT5sb2NrKXstLS4ufSwgYXQ6IFs8Zjg5ZWU5MWQ+XQphdGg1a19j
b25maWdfaW50ZXJmYWNlKzB4M2QvMHgzNDAgW2F0aDVrXQpbICAxNzEuNDM5OTUzXQpbICAxNzEu
NDM5OTU0XSBvdGhlciBpbmZvIHRoYXQgbWlnaHQgaGVscCB1cyBkZWJ1ZyB0aGlzOgpbICAxNzEu
NDQxNzk1XSAzIGxvY2tzIGhlbGQgYnkgYXRoNWtfcGNpLzI0NDc6ClsgIDE3MS40NDI3MjldICAj
MDogICgobmFtZSkpey0tLi59LCBhdDogWzxjMDEzYTEyMj5dIHJ1bl93b3JrcXVldWUrMHgxMDIv
MHgxZDAKWyAgMTcxLjQ0MzgwMF0gICMxOiAgKCYoJmxvY2FsLT5zY2FuX3dvcmspLT53b3JrKXst
LS4ufSwgYXQ6Cls8YzAxM2ExMjI+XSBydW5fd29ya3F1ZXVlKzB4MTAyLzB4MWQwClsgIDE3MS40
NDQ4NTldICAjMjogICgmc2MtPmxvY2spey0tLi59LCBhdDogWzxmODllZTkxZD5dCmF0aDVrX2Nv
bmZpZ19pbnRlcmZhY2UrMHgzZC8weDM0MCBbYXRoNWtdCgpEZWFkbG9jayBoYXBwZW4gaGVyZSwg
SSByZW1vdmUgdGhlIGxvY2sgaW4gdGhlIHN1YiByb3V0aW5lLCB0ZXN0ZWQgYW5kCmZpeGVkIHRo
ZSBwcm9ibGVtIGZvciBtZS4KSSB3aWxsIHNlbmQgdGhlIHBhdGNoIGFmdGVyIGEgd2hpbGUuCgpC
dXQgSSBzdGlsbCBoYXZlIG5vIGlkZWEgd2l0aCB0aGUgcG9pc29uIG92ZXJ3cml0dGVuLgoKLS0g
ClJlZ2FyZHMKZGF2ZQo=

2008-08-04 09:22:13

by Jiri Slaby

[permalink] [raw]
Subject: Re: [BUG] wireless : cpu stuck for 61s

Dave Young napsal(a):
> On Thu, Jul 31, 2008 at 5:15 PM, Pekka J Enberg <[email protected].=
fi> wrote:
>> On Wed, 30 Jul 2008, Andrew Morton wrote:
>>> INFO: Allocated in dev_alloc_skb+0x1c/0x30 age=3D3642 cpu=3D0 pid=3D=
0
>>> INFO: Freed in skb_release_data+0x57/0x80 age=3D3146 cpu=3D0 pid=3D=
2398
>> So the corrupted object was free'd by skb_release_data() so we need =
to
>> look for a driver or the networking stack calling that function too =
early.
>>
>>> INFO: Slab 0xc1c05440 objects=3D7 used=3D3 fp=3D0xf6f3a060 flags=3D=
0x400020c3
>>> INFO: Object 0xf6f3a060 @offset=3D8288 fp=3D0xf6f39030
>>>
>>> Bytes b4 0xf6f3a050: 5e 09 00 00 57 c9 05 00 5a 5a 5a 5a 5a 5a 5a =
5a ^...W=C3=89..ZZZZZZZZ
>> The object starts here (the poison values for first 32 bytes are oka=
y):
>>
>>> Object 0xf6f3a060: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b=
kkkkkkkkkkkkkkkk
>>> Object 0xf6f3a070: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b=
kkkkkkkkkkkkkkkk
>> And here are the first 96 bytes of the corruption:
>>
>>> Object 0xf6f3a080: 80 00 00 00 ff ff ff ff ff ff 00 17 7b 00 46 40=
....=C3=BF=C3=BF=C3=BF=C3=BF=C3=BF=C3=BF..{.F@
>>> Object 0xf6f3a090: 00 17 7b 00 46 40 30 09 81 21 08 7a 21 00 00 00=
..{.F@0..!.z!...
>>> Object 0xf6f3a0a0: 64 00 21 04 00 07 00 00 00 00 00 00 00 01 08 82=
d.!.............
>>> Object 0xf6f3a0b0: 84 8b 0c 12 96 18 24 03 01 01 05 04 00 02 00 00=
......$.........
>>> Object 0xf6f3a0c0: 07 06 43 4e 20 01 0d 14 2a 01 00 32 04 30 48 60=
..CN....*..2.0H`
>>> Object 0xf6f3a0d0: 6c dd 18 00 17 7b 01 04 00 00 00 01 00 00 00 10=
l=C3=9D...{..........
>> But I think that's just the payload of a SKB?

It's a receive frame from ath5k, I suppose. 00:17:7b:00:46:40 is your A=
P?

>>> Redzone 0xf6f3b060: bb bb bb bb =
=C2=BB=C2=BB=C2=BB=C2=BB
>> The red-zone has SLUB_RED_INACTIVE ("0xbb") which reinforces
>> use-after-free.
>>
>>> Padding 0xf6f3b088: 5a 5a 5a 5a 5a 5a 5a 5a =
ZZZZZZZZ
>>> Pid: 0, comm: swapper Tainted: G W 2.6.26-smp #2
>>> [<c0180f5d>] print_trailer+0xad/0xf0
>>> [<c018103b>] check_bytes_and_report+0x9b/0xc0
>>> [<c018145e>] check_object+0x19e/0x1e0
>>> [<c01821a4>] __slab_alloc+0x454/0x4f0
>>> [<c01834d6>] __kmalloc_track_caller+0xe6/0xf0
>>> [<c03dd1ec>] ? dev_alloc_skb+0x1c/0x30
>>> [<c03dd1ec>] ? dev_alloc_skb+0x1c/0x30
>>> [<c03dce79>] __alloc_skb+0x49/0x100
>>> [<c03dd1ec>] dev_alloc_skb+0x1c/0x30
>>> [<f8a58599>] ath5k_rxbuf_setup+0x39/0x200 [ath5k]
>>> [<f8a5a697>] ath5k_tasklet_rx+0x127/0x5c0 [ath5k]
>>> [<c014969a>] ? print_lock_contention_bug+0x1a/0xe0
>>> [<c012eafc>] tasklet_action+0x4c/0xc0
[...]
>>> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>>> FIX kmalloc-4096: Restoring 0xf6f3a080-0xf6f3a0ef=3D0x6b

>>> Dave, could you please remind us which net driver was in use here?
>> There's ath5k in the stack trace but that, of course, doesn't
>> automatically mean it's at fault here. It could have been just the p=
oor
>> bastard who was the next to allocate 4 KB with kmalloc() noticing th=
e
>> corruption.

No, unfortunately ath5k *is* likely the culprit. Next time please Cc=20
[email protected] even if it is only a suspicion.

> But I still have no idea with the poison overwritten.

Could you try patch from
http://lkml.org/lkml/2008/7/15/276
? (I have no idea how reproducible is this by you, it often happens on =
noisy=20
channels and/or by lowering RX buffers, i.e. ATH_RXBUF).

[It hit mainline few days ago, I'm going to fwd it to stable.]

2008-08-05 12:25:14

by Bob Copeland

[permalink] [raw]
Subject: Re: [BUG] wireless : cpu stuck for 61s

On Tue, Aug 05, 2008 at 09:29:26AM +0800, Dave Young wrote:
> With the patch I cann't reproduce the bug with 27-rc1 now.
>
> > [<c02375a6>] ? debugfs_create_file+0x46/0x210
> > [<c02375a6>] ? debugfs_create_file+0x46/0x210
> > [<c02375a6>] debugfs_create_file+0x46/0x210
> > [<c02377f1>] debugfs_create_dir+0x21/0x30
> > [<f8901f6d>] ieee80211_sta_debugfs_add+0x2d/0x150 [mac80211]
> > [<f88eba89>] sta_info_debugfs_add_work+0x89/0x130 [mac80211]
> > [<f890a170>] ? rate_control_pid_add_sta_debugfs+0x0/0x30 [mac80211]

I wonder if there were two separate problems here. I looked into
this with some detail yesterday and agree with Johannes that the above
trace is on locking the parent directory's i_mutex, but I too couldn't
see any problems with sta_info_debugfs_add_work. Other stuff could also
modify the directory with or without rtnl_lock, but not in a way that
to my untrained eyes would lead to deadlock.

Or is the trace just wrong?

--
Bob Copeland %% http://www.bobcopeland.com