2008-07-29 05:56:37

by Dave Young

[permalink] [raw]
Subject: [BUG] wireless : cpu stuck for 61s

While set my wlan0 to ad-hoc mode and then set the essid to something,
this bug triggered.

I suspect it's related to debugfs or mutex? But I'm not sure.

Jul 29 13:01:46 darkstar kernel: [ 8033.736493] BUG: soft lockup - CPU#0 stuck for 61s! [events/0:9]
Jul 29 13:01:46 darkstar kernel: [ 8033.736493] Modules linked in: radeonfb fb_ddc i2c_algo_bit snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss rfcomm l2cap bluetooth vfat fuse rtc_cmos rtc_core psmouse intel_agp pl2303 rtc_lib ath5k e100 i2c_i801 serio_raw usbserial agpgart 3c59x mii snd_hda_intel snd_pcm mac80211 snd_timer snd_page_alloc cfg80211 snd_hwdep snd soundcore thermal processor button thermal_sys evdev sg
Jul 29 13:01:46 darkstar kernel: [ 8033.736493]
Jul 29 13:01:46 darkstar kernel: [ 8033.736493] Pid: 9, comm: events/0 Tainted: G W (2.6.26-smp #3)
Jul 29 13:01:46 darkstar kernel: [ 8033.736493] EIP: 0060:[<c0450bc3>] EFLAGS: 00000246 CPU: 0
Jul 29 13:01:46 darkstar kernel: [ 8033.736493] EIP is at mutex_lock_nested+0x1b3/0x280
Jul 29 13:01:46 darkstar kernel: [ 8033.736493] EAX: 00000000 EBX: f7bf5284 ECX: f7c7e000 EDX: f7bf52a8
Jul 29 13:01:46 darkstar kernel: [ 8033.736493] ESI: f7bf5288 EDI: 00000246 EBP: f7c7fedc ESP: f7c7fe9c
Jul 29 13:01:46 darkstar kernel: [ 8033.736493] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
Jul 29 13:01:46 darkstar kernel: [ 8033.736493] CR0: 8005003b CR2: b800c5ac CR3: 36e71000 CR4: 000006d0
Jul 29 13:01:46 darkstar kernel: [ 8033.736493] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
Jul 29 13:01:46 darkstar kernel: [ 8033.736493] DR6: ffff0ff0 DR7: 00000400
Jul 29 13:01:46 darkstar kernel: [ 8033.736493] [<c02375a6>] ? debugfs_create_file+0x46/0x210
Jul 29 13:01:46 darkstar kernel: [ 8033.736493] [<c02375a6>] ? debugfs_create_file+0x46/0x210
Jul 29 13:01:46 darkstar kernel: [ 8033.736493] [<c02375a6>] debugfs_create_file+0x46/0x210
Jul 29 13:01:46 darkstar kernel: [ 8033.736493] [<c02377f1>] debugfs_create_dir+0x21/0x30
Jul 29 13:01:46 darkstar kernel: [ 8033.736493] [<f8901f6d>] ieee80211_sta_debugfs_add+0x2d/0x150 [mac80211]
Jul 29 13:01:46 darkstar kernel: [ 8033.736493] [<f88eba89>] sta_info_debugfs_add_work+0x89/0x130 [mac80211]
Jul 29 13:01:46 darkstar kernel: [ 8033.736493] [<f890a170>] ? rate_control_pid_add_sta_debugfs+0x0/0x30 [mac80211]
Jul 29 13:01:46 darkstar kernel: [ 8033.736493] [<c01394c6>] run_workqueue+0x156/0x1d0
Jul 29 13:01:46 darkstar kernel: [ 8033.736493] [<c0139472>] ? run_workqueue+0x102/0x1d0
Jul 29 13:01:46 darkstar kernel: [ 8033.736493] [<f88eba00>] ? sta_info_debugfs_add_work+0x0/0x130 [mac80211]
Jul 29 13:01:46 darkstar kernel: [ 8033.736493] [<c0139f18>] worker_thread+0x88/0xe0
Jul 29 13:01:46 darkstar kernel: [ 8033.736493] [<c013cd20>] ? autoremove_wake_function+0x0/0x40
Jul 29 13:01:46 darkstar kernel: [ 8033.736493] [<c0139e90>] ? worker_thread+0x0/0xe0
Jul 29 13:01:46 darkstar kernel: [ 8033.736493] [<c013ca32>] kthread+0x42/0x70
Jul 29 13:01:46 darkstar kernel: [ 8033.736493] [<c013c9f0>] ? kthread+0x0/0x70
Jul 29 13:01:46 darkstar kernel: [ 8033.736493] [<c0104993>] kernel_thread_helper+0x7/0x14
Jul 29 13:01:46 darkstar kernel: [ 8033.736493] =======================
Jul 29 13:02:51 darkstar kernel: [ 8099.234898] BUG: soft lockup - CPU#0 stuck for 61s! [events/0:9]
Jul 29 13:02:51 darkstar kernel: [ 8099.234898] Modules linked in: radeonfb fb_ddc i2c_algo_bit snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss rfcomm l2cap bluetooth vfat fuse rtc_cmos rtc_core psmouse intel_agp pl2303 rtc_lib ath5k e100 i2c_i801 serio_raw usbserial agpgart 3c59x mii snd_hda_intel snd_pcm mac80211 snd_timer snd_page_alloc cfg80211 snd_hwdep snd soundcore thermal processor button thermal_sys evdev sg
Jul 29 13:02:51 darkstar kernel: [ 8099.234898]
Jul 29 13:02:51 darkstar kernel: [ 8099.234898] Pid: 9, comm: events/0 Tainted: G W (2.6.26-smp #3)
Jul 29 13:02:51 darkstar kernel: [ 8099.234898] EIP: 0060:[<c0450bc3>] EFLAGS: 00000246 CPU: 0
Jul 29 13:02:51 darkstar kernel: [ 8099.234898] EIP is at mutex_lock_nested+0x1b3/0x280
Jul 29 13:02:51 darkstar kernel: [ 8099.234898] EAX: 00000000 EBX: f7bf5284 ECX: f7c7e000 EDX: f7bf52a8
Jul 29 13:02:51 darkstar kernel: [ 8099.234898] ESI: f7bf5288 EDI: 00000246 EBP: f7c7fedc ESP: f7c7fe9c
Jul 29 13:02:51 darkstar kernel: [ 8099.234898] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
Jul 29 13:02:51 darkstar kernel: [ 8099.234898] CR0: 8005003b CR2: b800c5ac CR3: 36e71000 CR4: 000006d0
Jul 29 13:02:51 darkstar kernel: [ 8099.234898] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
Jul 29 13:02:51 darkstar kernel: [ 8099.234898] DR6: ffff0ff0 DR7: 00000400
Jul 29 13:02:51 darkstar kernel: [ 8099.234898] [<c02375a6>] ? debugfs_create_file+0x46/0x210
Jul 29 13:02:51 darkstar kernel: [ 8099.234898] [<c02375a6>] ? debugfs_create_file+0x46/0x210
Jul 29 13:02:51 darkstar kernel: [ 8099.234898] [<c02375a6>] debugfs_create_file+0x46/0x210
Jul 29 13:02:51 darkstar kernel: [ 8099.234898] [<c02377f1>] debugfs_create_dir+0x21/0x30
Jul 29 13:02:51 darkstar kernel: [ 8099.234898] [<f8901f6d>] ieee80211_sta_debugfs_add+0x2d/0x150 [mac80211]
Jul 29 13:02:51 darkstar kernel: [ 8099.234898] [<f88eba89>] sta_info_debugfs_add_work+0x89/0x130 [mac80211]
Jul 29 13:02:51 darkstar kernel: [ 8099.234898] [<f890a170>] ? rate_control_pid_add_sta_debugfs+0x0/0x30 [mac80211]
Jul 29 13:02:51 darkstar kernel: [ 8099.234898] [<c01394c6>] run_workqueue+0x156/0x1d0
Jul 29 13:02:51 darkstar kernel: [ 8099.234898] [<c0139472>] ? run_workqueue+0x102/0x1d0
Jul 29 13:02:51 darkstar kernel: [ 8099.234898] [<f88eba00>] ? sta_info_debugfs_add_work+0x0/0x130 [mac80211]
Jul 29 13:02:51 darkstar kernel: [ 8099.234898] [<c0139f18>] worker_thread+0x88/0xe0
Jul 29 13:02:51 darkstar kernel: [ 8099.234898] [<c013cd20>] ? autoremove_wake_function+0x0/0x40
Jul 29 13:02:51 darkstar kernel: [ 8099.234898] [<c0139e90>] ? worker_thread+0x0/0xe0
Jul 29 13:02:51 darkstar kernel: [ 8099.234898] [<c013ca32>] kthread+0x42/0x70
Jul 29 13:02:51 darkstar kernel: [ 8099.234898] [<c013c9f0>] ? kthread+0x0/0x70
Jul 29 13:02:51 darkstar kernel: [ 8099.234898] [<c0104993>] kernel_thread_helper+0x7/0x14
Jul 29 13:02:51 darkstar kernel: [ 8099.234898] =======================


2008-07-29 12:32:19

by Johannes Berg

[permalink] [raw]
Subject: Re: [BUG] wireless : cpu stuck for 61s

On Tue, 2008-07-29 at 13:57 +0800, Dave Young wrote:
> While set my wlan0 to ad-hoc mode and then set the essid to something,
> this bug triggered.
>
> I suspect it's related to debugfs or mutex? But I'm not sure.
>
> Jul 29 13:01:46 darkstar kernel: [ 8033.736493] BUG: soft lockup - CPU#0 stuck for 61s! [events/0:9]
> Jul 29 13:01:46 darkstar kernel: [ 8033.736493]
> Jul 29 13:01:46 darkstar kernel: [ 8033.736493] Pid: 9, comm: events/0 Tainted: G W (2.6.26-smp #3)
> Jul 29 13:01:46 darkstar kernel: [ 8033.736493] EIP: 0060:[<c0450bc3>] EFLAGS: 00000246 CPU: 0
> Jul 29 13:01:46 darkstar kernel: [ 8033.736493] EIP is at mutex_lock_nested+0x1b3/0x280
> Jul 29 13:01:46 darkstar kernel: [ 8033.736493] EAX: 00000000 EBX: f7bf5284 ECX: f7c7e000 EDX: f7bf52a8
> Jul 29 13:01:46 darkstar kernel: [ 8033.736493] ESI: f7bf5288 EDI: 00000246 EBP: f7c7fedc ESP: f7c7fe9c
> Jul 29 13:01:46 darkstar kernel: [ 8033.736493] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
> Jul 29 13:01:46 darkstar kernel: [ 8033.736493] CR0: 8005003b CR2: b800c5ac CR3: 36e71000 CR4: 000006d0
> Jul 29 13:01:46 darkstar kernel: [ 8033.736493] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
> Jul 29 13:01:46 darkstar kernel: [ 8033.736493] DR6: ffff0ff0 DR7: 00000400
> Jul 29 13:01:46 darkstar kernel: [ 8033.736493] [<c02375a6>] ? debugfs_create_file+0x46/0x210

It looks like it's getting stuck on
mutex_lock(&parent->d_inode->i_mutex);

inside debugfs_create_by_name, but I have no idea why that would happen.

johannes


Attachments:
signature.asc (836.00 B)
This is a digitally signed message part

2008-07-30 09:10:33

by Andrew Morton

[permalink] [raw]
Subject: Re: [BUG] wireless : cpu stuck for 61s

On Tue, 29 Jul 2008 14:32:04 +0200 Johannes Berg <[email protected]> wrote:

> On Tue, 2008-07-29 at 13:57 +0800, Dave Young wrote:
> > While set my wlan0 to ad-hoc mode and then set the essid to something,
> > this bug triggered.
> >
> > I suspect it's related to debugfs or mutex? But I'm not sure.
> >
> > Jul 29 13:01:46 darkstar kernel: [ 8033.736493] BUG: soft lockup - CPU#0 stuck for 61s! [events/0:9]
> > Jul 29 13:01:46 darkstar kernel: [ 8033.736493]
> > Jul 29 13:01:46 darkstar kernel: [ 8033.736493] Pid: 9, comm: events/0 Tainted: G W (2.6.26-smp #3)
> > Jul 29 13:01:46 darkstar kernel: [ 8033.736493] EIP: 0060:[<c0450bc3>] EFLAGS: 00000246 CPU: 0
> > Jul 29 13:01:46 darkstar kernel: [ 8033.736493] EIP is at mutex_lock_nested+0x1b3/0x280
> > Jul 29 13:01:46 darkstar kernel: [ 8033.736493] EAX: 00000000 EBX: f7bf5284 ECX: f7c7e000 EDX: f7bf52a8
> > Jul 29 13:01:46 darkstar kernel: [ 8033.736493] ESI: f7bf5288 EDI: 00000246 EBP: f7c7fedc ESP: f7c7fe9c
> > Jul 29 13:01:46 darkstar kernel: [ 8033.736493] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
> > Jul 29 13:01:46 darkstar kernel: [ 8033.736493] CR0: 8005003b CR2: b800c5ac CR3: 36e71000 CR4: 000006d0
> > Jul 29 13:01:46 darkstar kernel: [ 8033.736493] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
> > Jul 29 13:01:46 darkstar kernel: [ 8033.736493] DR6: ffff0ff0 DR7: 00000400
> > Jul 29 13:01:46 darkstar kernel: [ 8033.736493] [<c02375a6>] ? debugfs_create_file+0x46/0x210
>
> It looks like it's getting stuck on
> mutex_lock(&parent->d_inode->i_mutex);
>
> inside debugfs_create_by_name, but I have no idea why that would happen.
>

Dave, please try sysrq-W or sysrq-T during that 60 seconds, see if we
can find another task which is holding onto that lock.

2008-07-30 10:11:40

by Andrew Morton

[permalink] [raw]
Subject: Re: [BUG] wireless : cpu stuck for 61s

On Wed, 30 Jul 2008 18:02:03 +0800 "Dave Young" <[email protected]> wrote:

> On Wed, Jul 30, 2008 at 5:08 PM, Andrew Morton
> <[email protected]> wrote:
> > On Tue, 29 Jul 2008 14:32:04 +0200 Johannes Berg <[email protected]> wrote:
> >
> >> On Tue, 2008-07-29 at 13:57 +0800, Dave Young wrote:
> >> > While set my wlan0 to ad-hoc mode and then set the essid to something,
> >> > this bug triggered.
> >> >
> >> > I suspect it's related to debugfs or mutex? But I'm not sure.
> >> >
> >> > Jul 29 13:01:46 darkstar kernel: [ 8033.736493] BUG: soft lockup - CPU#0 stuck for 61s! [events/0:9]
> >> > Jul 29 13:01:46 darkstar kernel: [ 8033.736493]
> >> > Jul 29 13:01:46 darkstar kernel: [ 8033.736493] Pid: 9, comm: events/0 Tainted: G W (2.6.26-smp #3)
> >> > Jul 29 13:01:46 darkstar kernel: [ 8033.736493] EIP: 0060:[<c0450bc3>] EFLAGS: 00000246 CPU: 0
> >> > Jul 29 13:01:46 darkstar kernel: [ 8033.736493] EIP is at mutex_lock_nested+0x1b3/0x280
> >> > Jul 29 13:01:46 darkstar kernel: [ 8033.736493] EAX: 00000000 EBX: f7bf5284 ECX: f7c7e000 EDX: f7bf52a8
> >> > Jul 29 13:01:46 darkstar kernel: [ 8033.736493] ESI: f7bf5288 EDI: 00000246 EBP: f7c7fedc ESP: f7c7fe9c
> >> > Jul 29 13:01:46 darkstar kernel: [ 8033.736493] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
> >> > Jul 29 13:01:46 darkstar kernel: [ 8033.736493] CR0: 8005003b CR2: b800c5ac CR3: 36e71000 CR4: 000006d0
> >> > Jul 29 13:01:46 darkstar kernel: [ 8033.736493] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
> >> > Jul 29 13:01:46 darkstar kernel: [ 8033.736493] DR6: ffff0ff0 DR7: 00000400
> >> > Jul 29 13:01:46 darkstar kernel: [ 8033.736493] [<c02375a6>] ? debugfs_create_file+0x46/0x210
> >>
> >> It looks like it's getting stuck on
> >> mutex_lock(&parent->d_inode->i_mutex);
> >>
> >> inside debugfs_create_by_name, but I have no idea why that would happen.
> >>
> >
> > Dave, please try sysrq-W or sysrq-T during that 60 seconds, see if we
> > can find another task which is holding onto that lock.
> >
> >
>
> Andrew, the keyboard doesn't response after the bug triggered.
>
> So I just keep sysrq-W / sysrq-T every several seconds after I set the
> wlan0 interface.
> Attached please see the long log file.

argh. Look:

Jul 30 17:34:22 darkstar kernel: [ 145.111510] f703fbf8 00000082 00000002 f703fbe4 f703fbd8 00000000 f704ac40 f703fbcc
Jul 30 17:34:22 darkstar kernel: [ 145.111510] 00000046 c0669060 c066c580 c066c580 c066c580 f703fbe8 f704ad94 c1eaf580
Jul 30 17:34:22 darkstar kernel: [ 145.111510] 00000001 f704ac40 00000000 f702ca00 c0452697 ffff55ba 00000000 c014b06d
Jul 30 17:34:22 darkstar kernel: [ 145.111510] Call Trace:
Jul 30 17:34:22 darkstar kernel: [ 145.111510] [<c0452697>] ? _spin_unlock_irqrestore+0x47/0x60
Jul 30 17:34:22 darkstar kernel: [ 145.111510] [<c014b06d>] ? trace_hardirqs_on+0xbd/0x140
Jul 30 17:34:22 darkstar kernel: [ 145.111510] [<c0450265>] schedule_timeout+0x75/0xc0
Jul 30 17:34:22 darkstar kernel: [ 145.111510] [<c01942c0>] ? __pollwait+0x60/0xd0
Jul 30 17:34:22 darkstar kernel: [ 145.111510] [<f888b02e>] ? evdev_poll+0x2e/0x60 [evdev]
Jul 30 17:34:22 darkstar kernel: [ 145.111510] [<c0193564>] do_sys_poll+0x264/0x3a0
Jul 30 17:34:22 darkstar kernel: [ 145.111510] [<c0194260>] ? __pollwait+0x0/0xd0
Jul 30 17:34:22 darkstar kernel: [ 145.111510] [<c0122160>] ? default_wake_function+0x0/0x10
Jul 30 17:34:22 darkstar last message repeated 3 times
Jul 30 17:34:22 darkstar kernel: [ 145.111510] [<c01220c5>] ? try_to_wake_up+0x75/0x110
Jul 30 17:34:22 darkstar kernel: [ 145.111510] [<c014aed0>] ? mark_held_locks+0x40/0x80
Jul 30 17:34:22 darkstar kernel: [ 145.111510] [<c014b77b>] ? __lock_acquire+0x24b/0x1040
Jul 30 17:34:22 darkstar kernel: [ 145.111510] [<c014b77b>] ? __lock_acquire+0x24b/0x1040
Jul 30 17:34:22 darkstar kernel: [ 145.111510] [<c03da639>] ? sock_def_readable+0x59/0x80
Jul 30 17:34:22 darkstar kernel: [ 145.111510] [<c014b77b>] ? __lock_acquire+0x24b/0x1040
Jul 30 17:34:22 darkstar last message repeated 5 times
Jul 30 17:34:22 darkstar kernel: [ 145.111510] [<c0109d0f>] ? native_sched_clock+0x7f/0xb0
Jul 30 17:34:22 darkstar kernel: [ 145.111510] [<c0243b4f>] ? security_file_permission+0xf/0x20
Jul 30 17:34:22 darkstar kernel: [ 145.111510] [<c0186387>] ? rw_verify_area+0x57/0xc0
Jul 30 17:34:22 darkstar kernel: [ 145.111510] [<c0451f5d>] ? _spin_unlock+0x1d/0x20
Jul 30 17:34:22 darkstar kernel: [ 145.111510] [<c0186b94>] ? vfs_read+0x94/0x130
Jul 30 17:34:22 darkstar kernel: [ 145.111510] [<f888be60>] ? evdev_read+0x0/0x1b0 [evdev]
Jul 30 17:34:22 darkstar kernel: [ 145.111510] [<c0103dbf>] ? restore_nocheck+0x12/0x15
Jul 30 17:34:22 darkstar kernel: [ 145.111510] [<c01936cd>] sys_poll+0x2d/0x70
Jul 30 17:34:22 darkstar kernel: [ 145.111510] [<c0103d5e>] syscall_call+0x7/0xb
Jul 30 17:34:22 darkstar kernel: [ 145.111510] =======================
Jul 30 17:34:22 darkstar kernel: [ 145.111510] f63efd94 00000092 00000002 00000000 00000000 f5064028 f7049d80 00000001
Jul 30 17:34:22 darkstar kernel: [ 145.111510] 00000000 c0669060 c066c580 c066c580 c066c580 00000000 f7049ed4 c1e5d580
Jul 30 17:34:22 darkstar kernel: [ 145.111510] 00000000 00000000 00000135 f702e080 f7049d80 00000003 00000000 f7049d80
Jul 30 17:34:22 darkstar kernel: [ 145.111510] Call Trace:
Jul 30 17:34:22 darkstar kernel: [ 145.111510] [<c0450265>] schedule_timeout+0x75/0xc0
Jul 30 17:34:22 darkstar kernel: [ 145.111510] [<c0436b98>] unix_stream_recvmsg+0x2a8/0x4e0


We're missing part of the sysrq traces. I noticed this a while back.
We need this info to find the D-state processes.

It's _possible_ that the kernel is actually trying to print this info,
but it's doing it at a different prink facility level from the stack
output, so we're only seeing part of the printing.

Please try changing your logging level (dmesg -n 0, or dmesg -n 7 - I
always forget which). Then retry the sysrq-T. If that successfully
causes _all_ the output to be presented, then please rerun the above
test.

You might want to compress the file too - I don't know whether the vger
servers will accept a 730kB mail.

2008-07-31 02:06:52

by Dave Young

[permalink] [raw]
Subject: Re: [BUG] wireless : cpu stuck for 61s

On Wed, Jul 30, 2008 at 6:10 PM, Andrew Morton
<[email protected]> wrote:
> On Wed, 30 Jul 2008 18:02:03 +0800 "Dave Young" <[email protected]> wrote:
>
>> On Wed, Jul 30, 2008 at 5:08 PM, Andrew Morton
>> <[email protected]> wrote:
>> > On Tue, 29 Jul 2008 14:32:04 +0200 Johannes Berg <[email protected]> wrote:
>> >
>> >> On Tue, 2008-07-29 at 13:57 +0800, Dave Young wrote:
>> >> > While set my wlan0 to ad-hoc mode and then set the essid to something,
>> >> > this bug triggered.
>> >> >
>> >> > I suspect it's related to debugfs or mutex? But I'm not sure.
>> >> >
>> >> > Jul 29 13:01:46 darkstar kernel: [ 8033.736493] BUG: soft lockup - CPU#0 stuck for 61s! [events/0:9]
>> >> > Jul 29 13:01:46 darkstar kernel: [ 8033.736493]
>> >> > Jul 29 13:01:46 darkstar kernel: [ 8033.736493] Pid: 9, comm: events/0 Tainted: G W (2.6.26-smp #3)
>> >> > Jul 29 13:01:46 darkstar kernel: [ 8033.736493] EIP: 0060:[<c0450bc3>] EFLAGS: 00000246 CPU: 0
>> >> > Jul 29 13:01:46 darkstar kernel: [ 8033.736493] EIP is at mutex_lock_nested+0x1b3/0x280
>> >> > Jul 29 13:01:46 darkstar kernel: [ 8033.736493] EAX: 00000000 EBX: f7bf5284 ECX: f7c7e000 EDX: f7bf52a8
>> >> > Jul 29 13:01:46 darkstar kernel: [ 8033.736493] ESI: f7bf5288 EDI: 00000246 EBP: f7c7fedc ESP: f7c7fe9c
>> >> > Jul 29 13:01:46 darkstar kernel: [ 8033.736493] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
>> >> > Jul 29 13:01:46 darkstar kernel: [ 8033.736493] CR0: 8005003b CR2: b800c5ac CR3: 36e71000 CR4: 000006d0
>> >> > Jul 29 13:01:46 darkstar kernel: [ 8033.736493] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
>> >> > Jul 29 13:01:46 darkstar kernel: [ 8033.736493] DR6: ffff0ff0 DR7: 00000400
>> >> > Jul 29 13:01:46 darkstar kernel: [ 8033.736493] [<c02375a6>] ? debugfs_create_file+0x46/0x210
>> >>
>> >> It looks like it's getting stuck on
>> >> mutex_lock(&parent->d_inode->i_mutex);
>> >>
>> >> inside debugfs_create_by_name, but I have no idea why that would happen.
>> >>
>> >
>> > Dave, please try sysrq-W or sysrq-T during that 60 seconds, see if we
>> > can find another task which is holding onto that lock.
>> >
>> >
>>
>> Andrew, the keyboard doesn't response after the bug triggered.
>>
>> So I just keep sysrq-W / sysrq-T every several seconds after I set the
>> wlan0 interface.
>> Attached please see the long log file.
>
> argh. Look:
>
> Jul 30 17:34:22 darkstar kernel: [ 145.111510] f703fbf8 00000082 00000002 f703fbe4 f703fbd8 00000000 f704ac40 f703fbcc
> Jul 30 17:34:22 darkstar kernel: [ 145.111510] 00000046 c0669060 c066c580 c066c580 c066c580 f703fbe8 f704ad94 c1eaf580
> Jul 30 17:34:22 darkstar kernel: [ 145.111510] 00000001 f704ac40 00000000 f702ca00 c0452697 ffff55ba 00000000 c014b06d
> Jul 30 17:34:22 darkstar kernel: [ 145.111510] Call Trace:
> Jul 30 17:34:22 darkstar kernel: [ 145.111510] [<c0452697>] ? _spin_unlock_irqrestore+0x47/0x60
> Jul 30 17:34:22 darkstar kernel: [ 145.111510] [<c014b06d>] ? trace_hardirqs_on+0xbd/0x140
> Jul 30 17:34:22 darkstar kernel: [ 145.111510] [<c0450265>] schedule_timeout+0x75/0xc0
> Jul 30 17:34:22 darkstar kernel: [ 145.111510] [<c01942c0>] ? __pollwait+0x60/0xd0
> Jul 30 17:34:22 darkstar kernel: [ 145.111510] [<f888b02e>] ? evdev_poll+0x2e/0x60 [evdev]
> Jul 30 17:34:22 darkstar kernel: [ 145.111510] [<c0193564>] do_sys_poll+0x264/0x3a0
> Jul 30 17:34:22 darkstar kernel: [ 145.111510] [<c0194260>] ? __pollwait+0x0/0xd0
> Jul 30 17:34:22 darkstar kernel: [ 145.111510] [<c0122160>] ? default_wake_function+0x0/0x10
> Jul 30 17:34:22 darkstar last message repeated 3 times
> Jul 30 17:34:22 darkstar kernel: [ 145.111510] [<c01220c5>] ? try_to_wake_up+0x75/0x110
> Jul 30 17:34:22 darkstar kernel: [ 145.111510] [<c014aed0>] ? mark_held_locks+0x40/0x80
> Jul 30 17:34:22 darkstar kernel: [ 145.111510] [<c014b77b>] ? __lock_acquire+0x24b/0x1040
> Jul 30 17:34:22 darkstar kernel: [ 145.111510] [<c014b77b>] ? __lock_acquire+0x24b/0x1040
> Jul 30 17:34:22 darkstar kernel: [ 145.111510] [<c03da639>] ? sock_def_readable+0x59/0x80
> Jul 30 17:34:22 darkstar kernel: [ 145.111510] [<c014b77b>] ? __lock_acquire+0x24b/0x1040
> Jul 30 17:34:22 darkstar last message repeated 5 times
> Jul 30 17:34:22 darkstar kernel: [ 145.111510] [<c0109d0f>] ? native_sched_clock+0x7f/0xb0
> Jul 30 17:34:22 darkstar kernel: [ 145.111510] [<c0243b4f>] ? security_file_permission+0xf/0x20
> Jul 30 17:34:22 darkstar kernel: [ 145.111510] [<c0186387>] ? rw_verify_area+0x57/0xc0
> Jul 30 17:34:22 darkstar kernel: [ 145.111510] [<c0451f5d>] ? _spin_unlock+0x1d/0x20
> Jul 30 17:34:22 darkstar kernel: [ 145.111510] [<c0186b94>] ? vfs_read+0x94/0x130
> Jul 30 17:34:22 darkstar kernel: [ 145.111510] [<f888be60>] ? evdev_read+0x0/0x1b0 [evdev]
> Jul 30 17:34:22 darkstar kernel: [ 145.111510] [<c0103dbf>] ? restore_nocheck+0x12/0x15
> Jul 30 17:34:22 darkstar kernel: [ 145.111510] [<c01936cd>] sys_poll+0x2d/0x70
> Jul 30 17:34:22 darkstar kernel: [ 145.111510] [<c0103d5e>] syscall_call+0x7/0xb
> Jul 30 17:34:22 darkstar kernel: [ 145.111510] =======================
> Jul 30 17:34:22 darkstar kernel: [ 145.111510] f63efd94 00000092 00000002 00000000 00000000 f5064028 f7049d80 00000001
> Jul 30 17:34:22 darkstar kernel: [ 145.111510] 00000000 c0669060 c066c580 c066c580 c066c580 00000000 f7049ed4 c1e5d580
> Jul 30 17:34:22 darkstar kernel: [ 145.111510] 00000000 00000000 00000135 f702e080 f7049d80 00000003 00000000 f7049d80
> Jul 30 17:34:22 darkstar kernel: [ 145.111510] Call Trace:
> Jul 30 17:34:22 darkstar kernel: [ 145.111510] [<c0450265>] schedule_timeout+0x75/0xc0
> Jul 30 17:34:22 darkstar kernel: [ 145.111510] [<c0436b98>] unix_stream_recvmsg+0x2a8/0x4e0
>
>
> We're missing part of the sysrq traces. I noticed this a while back.
> We need this info to find the D-state processes.
>
> It's _possible_ that the kernel is actually trying to print this info,
> but it's doing it at a different prink facility level from the stack
> output, so we're only seeing part of the printing.
>
> Please try changing your logging level (dmesg -n 0, or dmesg -n 7 - I
> always forget which). Then retry the sysrq-T. If that successfully
> causes _all_ the output to be presented, then please rerun the above
> test.
>
> You might want to compress the file too - I don't know whether the vger
> servers will accept a 730kB mail.
>
>

Ok here it is.
BTW, I run "klogd -c 7" after boot

This time I get a kmalloc poison overwritten:

Jul 31 09:51:11 darkstar kernel: [ 2024.960036] BUG kmalloc-4096:
Poison overwritten
Jul 31 09:51:11 darkstar kernel: [ 2024.960036]
-----------------------------------------------------------------------------
Jul 31 09:51:11 darkstar kernel: [ 2024.960036]
Jul 31 09:51:11 darkstar kernel: [ 2024.960036] INFO:
0xf6f3a080-0xf6f3a0ef. First byte 0x80 instead of 0x6b
Jul 31 09:51:11 darkstar kernel: [ 2024.960036] INFO: Allocated in
dev_alloc_skb+0x1c/0x30 age=3642 cpu=0 pid=0
Jul 31 09:51:11 darkstar kernel: [ 2024.960036] INFO: Freed in
skb_release_data+0x57/0x80 age=3146 cpu=0 pid=2398
Jul 31 09:51:11 darkstar kernel: [ 2024.960036] INFO: Slab 0xc1c05440
objects=7 used=3 fp=0xf6f3a060 flags=0x400020c3
Jul 31 09:51:11 darkstar kernel: [ 2024.960036] INFO: Object
0xf6f3a060 @offset=8288 fp=0xf6f39030
Jul 31 09:51:11 darkstar kernel: [ 2024.960036]
Jul 31 09:51:11 darkstar kernel: [ 2024.960036] Bytes b4 0xf6f3a050:
5e 09 00 00 57 c9 05 00 5a 5a 5a 5a 5a 5a 5a 5a ^...WÉ..ZZZZZZZZ
Jul 31 09:51:11 darkstar kernel: [ 2024.960036] Object 0xf6f3a060:
6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
Jul 31 09:51:11 darkstar kernel: [ 2024.960036] Object 0xf6f3a070:
6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
Jul 31 09:51:11 darkstar kernel: [ 2024.960036] Object 0xf6f3a080:
80 00 00 00 ff ff ff ff ff ff 00 17 7b 00 46 40 ....ÿÿÿÿÿÿ..{.F@
Jul 31 09:51:11 darkstar kernel: [ 2024.960036] Object 0xf6f3a090:
00 17 7b 00 46 40 30 09 81 21 08 7a 21 00 00 00 ..{.F@0..!.z!...
Jul 31 09:51:11 darkstar kernel: [ 2024.960036] Object 0xf6f3a0a0:
64 00 21 04 00 07 00 00 00 00 00 00 00 01 08 82 d.!.............
Jul 31 09:51:11 darkstar kernel: [ 2024.960036] Object 0xf6f3a0b0:
84 8b 0c 12 96 18 24 03 01 01 05 04 00 02 00 00 ......$.........
Jul 31 09:51:11 darkstar kernel: [ 2024.960036] Object 0xf6f3a0c0:
07 06 43 4e 20 01 0d 14 2a 01 00 32 04 30 48 60 ..CN....*..2.0H`
Jul 31 09:51:11 darkstar kernel: [ 2024.960036] Object 0xf6f3a0d0:
6c dd 18 00 17 7b 01 04 00 00 00 01 00 00 00 10 lÝ...{..........
Jul 31 09:51:11 darkstar kernel: [ 2024.960036] Redzone 0xf6f3b060:
bb bb bb bb »»»»
Jul 31 09:51:11 darkstar kernel: [ 2024.960036] Padding 0xf6f3b088:
5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZ
Jul 31 09:51:11 darkstar kernel: [ 2024.960036] Pid: 0, comm: swapper
Tainted: G W 2.6.26-smp #2
Jul 31 09:51:11 darkstar kernel: [ 2024.960036] [<c0180f5d>]
print_trailer+0xad/0xf0
Jul 31 09:51:11 darkstar kernel: [ 2024.960036] [<c018103b>]
check_bytes_and_report+0x9b/0xc0
Jul 31 09:51:11 darkstar kernel: [ 2024.960036] [<c018145e>]
check_object+0x19e/0x1e0
Jul 31 09:51:11 darkstar kernel: [ 2024.960036] [<c01821a4>]
__slab_alloc+0x454/0x4f0
Jul 31 09:51:11 darkstar kernel: [ 2024.960036] [<c01834d6>]
__kmalloc_track_caller+0xe6/0xf0
Jul 31 09:51:11 darkstar kernel: [ 2024.960036] [<c03dd1ec>] ?
dev_alloc_skb+0x1c/0x30
Jul 31 09:51:11 darkstar kernel: [ 2024.960036] [<c03dd1ec>] ?
dev_alloc_skb+0x1c/0x30
Jul 31 09:51:11 darkstar kernel: [ 2024.960036] [<c03dce79>]
__alloc_skb+0x49/0x100
Jul 31 09:51:11 darkstar kernel: [ 2024.960036] [<c03dd1ec>]
dev_alloc_skb+0x1c/0x30
Jul 31 09:51:11 darkstar kernel: [ 2024.960036] [<f8a58599>]
ath5k_rxbuf_setup+0x39/0x200 [ath5k]
Jul 31 09:51:11 darkstar kernel: [ 2024.960036] [<f8a5a697>]
ath5k_tasklet_rx+0x127/0x5c0 [ath5k]
Jul 31 09:51:11 darkstar kernel: [ 2024.960036] [<c014969a>] ?
print_lock_contention_bug+0x1a/0xe0
Jul 31 09:51:11 darkstar kernel: [ 2024.960036] [<c012eafc>]
tasklet_action+0x4c/0xc0
Jul 31 09:51:11 darkstar kernel: [ 2024.960036] [<c012e463>]
__do_softirq+0x93/0x120
Jul 31 09:51:11 darkstar kernel: [ 2024.960036] [<c012e547>]
do_softirq+0x57/0x60
Jul 31 09:51:11 darkstar kernel: [ 2024.960036] [<c012ea29>] irq_exit+0x69/0x80
Jul 31 09:51:11 darkstar kernel: [ 2024.960036] [<c0106b55>] do_IRQ+0x45/0x80
Jul 31 09:51:11 darkstar kernel: [ 2024.960036] [<c010a5d0>] ?
mwait_idle+0x0/0x50
Jul 31 09:51:11 darkstar kernel: [ 2024.960036] [<c0104752>]
common_interrupt+0x2e/0x34
Jul 31 09:51:11 darkstar kernel: [ 2024.960036] [<c010a5d0>] ?
mwait_idle+0x0/0x50
Jul 31 09:51:11 darkstar kernel: [ 2024.960036] [<c010a609>] ?
mwait_idle+0x39/0x50
Jul 31 09:51:11 darkstar kernel: [ 2024.960036] [<c01026e0>] cpu_idle+0x60/0xd0
Jul 31 09:51:11 darkstar kernel: [ 2024.960036] [<c043c8ce>]
rest_init+0x4e/0x60
Jul 31 09:51:11 darkstar kernel: [ 2024.960036] =======================


The syslog detail please see the attachment.

--
Regards
dave


Attachments:
(No filename) (10.86 kB)
dmesg.gz (17.52 kB)
Download all attachments

2008-07-31 03:01:36

by Andrew Morton

[permalink] [raw]
Subject: Re: [BUG] wireless : cpu stuck for 61s

On Thu, 31 Jul 2008 10:06:31 +0800 "Dave Young" <[email protected]> wrote:

>
> Ok here it is.
> BTW, I run "klogd -c 7" after boot

The sysrq output is still missing lots of stuff. I guess we broke it.

>
> This time I get a kmalloc poison overwritten:
>

<fixes wordwrapping, cleans stuff up>

=============================================================================
BUG kmalloc-4096: Poison overwritten
-----------------------------------------------------------------------------

INFO: 0xf6f3a080-0xf6f3a0ef. First byte 0x80 instead of 0x6b
INFO: Allocated in dev_alloc_skb+0x1c/0x30 age=3642 cpu=0 pid=0
INFO: Freed in skb_release_data+0x57/0x80 age=3146 cpu=0 pid=2398
INFO: Slab 0xc1c05440 objects=7 used=3 fp=0xf6f3a060 flags=0x400020c3
INFO: Object 0xf6f3a060 @offset=8288 fp=0xf6f39030

Bytes b4 0xf6f3a050: 5e 09 00 00 57 c9 05 00 5a 5a 5a 5a 5a 5a 5a 5a ^...W?..ZZZZZZZZ
Object 0xf6f3a060: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
Object 0xf6f3a070: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
Object 0xf6f3a080: 80 00 00 00 ff ff ff ff ff ff 00 17 7b 00 46 40 ....??????..{.F@
Object 0xf6f3a090: 00 17 7b 00 46 40 30 09 81 21 08 7a 21 00 00 00 ..{.F@0..!.z!...
Object 0xf6f3a0a0: 64 00 21 04 00 07 00 00 00 00 00 00 00 01 08 82 d.!.............
Object 0xf6f3a0b0: 84 8b 0c 12 96 18 24 03 01 01 05 04 00 02 00 00 ......$.........
Object 0xf6f3a0c0: 07 06 43 4e 20 01 0d 14 2a 01 00 32 04 30 48 60 ..CN....*..2.0H`
Object 0xf6f3a0d0: 6c dd 18 00 17 7b 01 04 00 00 00 01 00 00 00 10 l?...{..........
Redzone 0xf6f3b060: bb bb bb bb ????
Padding 0xf6f3b088: 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZ
Pid: 0, comm: swapper Tainted: G W 2.6.26-smp #2
[<c0180f5d>] print_trailer+0xad/0xf0
[<c018103b>] check_bytes_and_report+0x9b/0xc0
[<c018145e>] check_object+0x19e/0x1e0
[<c01821a4>] __slab_alloc+0x454/0x4f0
[<c01834d6>] __kmalloc_track_caller+0xe6/0xf0
[<c03dd1ec>] ? dev_alloc_skb+0x1c/0x30
[<c03dd1ec>] ? dev_alloc_skb+0x1c/0x30
[<c03dce79>] __alloc_skb+0x49/0x100
[<c03dd1ec>] dev_alloc_skb+0x1c/0x30
[<f8a58599>] ath5k_rxbuf_setup+0x39/0x200 [ath5k]
[<f8a5a697>] ath5k_tasklet_rx+0x127/0x5c0 [ath5k]
[<c014969a>] ? print_lock_contention_bug+0x1a/0xe0
[<c012eafc>] tasklet_action+0x4c/0xc0
[<c012e463>] __do_softirq+0x93/0x120
[<c012e547>] do_softirq+0x57/0x60
[<c012ea29>] irq_exit+0x69/0x80
[<c0106b55>] do_IRQ+0x45/0x80
[<c010a5d0>] ? mwait_idle+0x0/0x50
[<c0104752>] common_interrupt+0x2e/0x34
[<c010a5d0>] ? mwait_idle+0x0/0x50
[<c010a609>] ? mwait_idle+0x39/0x50
[<c01026e0>] cpu_idle+0x60/0xd0
[<c043c8ce>] rest_init+0x4e/0x60
=======================
FIX kmalloc-4096: Restoring 0xf6f3a080-0xf6f3a0ef=0x6b

FIX kmalloc-4096: Marking all objects used
[<c0243b4f>] ? security_file_permission+0xf/0x20
[<c019436f>] sys_select+0x3f/0x190
[<c01878e9>] ? fput+0x19/0x20
[<c0103dbf>] ? restore_nocheck+0x12/0x15
[<c014b06d>] ? trace_hardirqs_on+0xbd/0x140
[<c0103d5e>] syscall_call+0x7/0xb
=======================

argh, that stuff hurts my brain. None of the numbers seem to make any
sense for a 4k allocation :( Pekka, do you have time to decrypt this?

Dave, could you please remind us which net driver was in use here?

2008-07-31 03:02:54

by Dave Young

[permalink] [raw]
Subject: Re: [BUG] wireless : cpu stuck for 61s

On Thu, Jul 31, 2008 at 10:56 AM, Andrew Morton<[email protected]> wrote:> On Thu, 31 Jul 2008 10:06:31 +0800 "Dave Young" <[email protected]> wrote:>>>>> Ok here it is.>> BTW, I run "klogd -c 7" after boot>> The sysrq output is still missing lots of stuff. I guess we broke it.>>>>> This time I get a kmalloc poison overwritten:>>>> <fixes wordwrapping, cleans stuff up>
Thanks, sorry for that caused by webmail.
>> =============================================================================> BUG kmalloc-4096: Poison overwritten> ----------------------------------------------------------------------------->> INFO: 0xf6f3a080-0xf6f3a0ef. First byte 0x80 instead of 0x6b> INFO: Allocated in dev_alloc_skb+0x1c/0x30 age=3642 cpu=0 pid=0> INFO: Freed in skb_release_data+0x57/0x80 age=3146 cpu=0 pid=2398> INFO: Slab 0xc1c05440 objects=7 used=3 fp=0xf6f3a060 flags=0x400020c3> INFO: Object 0xf6f3a060 @offset=8288 fp=0xf6f39030>> Bytes b4 0xf6f3a050: 5e 09 00 00 57 c9 05 00 5a 5a 5a 5a 5a 5a 5a 5a ^...WÉ..ZZZZZZZZ> Object 0xf6f3a060: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk> Object 0xf6f3a070: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk> Object 0xf6f3a080: 80 00 00 00 ff ff ff ff ff ff 00 17 7b 00 46 40 ....ÿÿÿÿÿÿ..{.F@> Object 0xf6f3a090: 00 17 7b 00 46 40 30 09 81 21 08 7a 21 00 00 00 ..{.F@0..!.z!...> Object 0xf6f3a0a0: 64 00 21 04 00 07 00 00 00 00 00 00 00 01 08 82 d.!.............> Object 0xf6f3a0b0: 84 8b 0c 12 96 18 24 03 01 01 05 04 00 02 00 00 ......$.........> Object 0xf6f3a0c0: 07 06 43 4e 20 01 0d 14 2a 01 00 32 04 30 48 60 ..CN....*..2.0H`> Object 0xf6f3a0d0: 6c dd 18 00 17 7b 01 04 00 00 00 01 00 00 00 10 lÝ...{..........> Redzone 0xf6f3b060: bb bb bb bb »»»»> Padding 0xf6f3b088: 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZ> Pid: 0, comm: swapper Tainted: G W 2.6.26-smp #2> [<c0180f5d>] print_trailer+0xad/0xf0> [<c018103b>] check_bytes_and_report+0x9b/0xc0> [<c018145e>] check_object+0x19e/0x1e0> [<c01821a4>] __slab_alloc+0x454/0x4f0> [<c01834d6>] __kmalloc_track_caller+0xe6/0xf0> [<c03dd1ec>] ? dev_alloc_skb+0x1c/0x30> [<c03dd1ec>] ? dev_alloc_skb+0x1c/0x30> [<c03dce79>] __alloc_skb+0x49/0x100> [<c03dd1ec>] dev_alloc_skb+0x1c/0x30> [<f8a58599>] ath5k_rxbuf_setup+0x39/0x200 [ath5k]> [<f8a5a697>] ath5k_tasklet_rx+0x127/0x5c0 [ath5k]> [<c014969a>] ? print_lock_contention_bug+0x1a/0xe0> [<c012eafc>] tasklet_action+0x4c/0xc0> [<c012e463>] __do_softirq+0x93/0x120> [<c012e547>] do_softirq+0x57/0x60> [<c012ea29>] irq_exit+0x69/0x80> [<c0106b55>] do_IRQ+0x45/0x80> [<c010a5d0>] ? mwait_idle+0x0/0x50> [<c0104752>] common_interrupt+0x2e/0x34> [<c010a5d0>] ? mwait_idle+0x0/0x50> [<c010a609>] ? mwait_idle+0x39/0x50> [<c01026e0>] cpu_idle+0x60/0xd0> [<c043c8ce>] rest_init+0x4e/0x60> =======================> FIX kmalloc-4096: Restoring 0xf6f3a080-0xf6f3a0ef=0x6b>> FIX kmalloc-4096: Marking all objects used> [<c0243b4f>] ? security_file_permission+0xf/0x20> [<c019436f>] sys_select+0x3f/0x190> [<c01878e9>] ? fput+0x19/0x20> [<c0103dbf>] ? restore_nocheck+0x12/0x15> [<c014b06d>] ? trace_hardirqs_on+0xbd/0x140> [<c0103d5e>] syscall_call+0x7/0xb> =======================>> argh, that stuff hurts my brain. None of the numbers seem to make any> sense for a 4k allocation :( Pekka, do you have time to decrypt this?>> Dave, could you please remind us which net driver was in use here?>>
It's the ath5k
-- Regardsdave????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?

2008-07-31 09:15:44

by Pekka Enberg

[permalink] [raw]
Subject: Re: [BUG] wireless : cpu stuck for 61s

Hi Andrew,

On Wed, 30 Jul 2008, Andrew Morton wrote:
> > Ok here it is.
> > BTW, I run "klogd -c 7" after boot
>
> The sysrq output is still missing lots of stuff. I guess we broke it.
>
> >
> > This time I get a kmalloc poison overwritten:
> >
>
> argh, that stuff hurts my brain. None of the numbers seem to make any
> sense for a 4k allocation :( Pekka, do you have time to decrypt this?

Sure. Here goes:

On Wed, 30 Jul 2008, Andrew Morton wrote:
> <fixes wordwrapping, cleans stuff up>
>
> =============================================================================
> BUG kmalloc-4096: Poison overwritten
> -----------------------------------------------------------------------------
>
> INFO: 0xf6f3a080-0xf6f3a0ef. First byte 0x80 instead of 0x6b

That's POISON_FREE ("0x6b") overwritten which means use-after-free for
the range of 0xf6f3a080 - 0xf6f3a0ef (112 bytes). The rest of the
object is okay but the SLUB debugging code only dumps the first 128 bytes
of the object which is why we don't see the full corruption.

2.6.27-rc1 should dump the full object so I'm assuming this is pre -rc1?

> INFO: Allocated in dev_alloc_skb+0x1c/0x30 age=3642 cpu=0 pid=0
> INFO: Freed in skb_release_data+0x57/0x80 age=3146 cpu=0 pid=2398

So the corrupted object was free'd by skb_release_data() so we need to
look for a driver or the networking stack calling that function too early.

> INFO: Slab 0xc1c05440 objects=7 used=3 fp=0xf6f3a060 flags=0x400020c3
> INFO: Object 0xf6f3a060 @offset=8288 fp=0xf6f39030
>
> Bytes b4 0xf6f3a050: 5e 09 00 00 57 c9 05 00 5a 5a 5a 5a 5a 5a 5a 5a ^...W?..ZZZZZZZZ

The object starts here (the poison values for first 32 bytes are okay):

> Object 0xf6f3a060: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
> Object 0xf6f3a070: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk

And here are the first 96 bytes of the corruption:

> Object 0xf6f3a080: 80 00 00 00 ff ff ff ff ff ff 00 17 7b 00 46 40 ....??????..{.F@
> Object 0xf6f3a090: 00 17 7b 00 46 40 30 09 81 21 08 7a 21 00 00 00 ..{.F@0..!.z!...
> Object 0xf6f3a0a0: 64 00 21 04 00 07 00 00 00 00 00 00 00 01 08 82 d.!.............
> Object 0xf6f3a0b0: 84 8b 0c 12 96 18 24 03 01 01 05 04 00 02 00 00 ......$.........
> Object 0xf6f3a0c0: 07 06 43 4e 20 01 0d 14 2a 01 00 32 04 30 48 60 ..CN....*..2.0H`
> Object 0xf6f3a0d0: 6c dd 18 00 17 7b 01 04 00 00 00 01 00 00 00 10 l?...{..........

But I think that's just the payload of a SKB?

> Redzone 0xf6f3b060: bb bb bb bb ????

The red-zone has SLUB_RED_INACTIVE ("0xbb") which reinforces
use-after-free.

> Padding 0xf6f3b088: 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZ
> Pid: 0, comm: swapper Tainted: G W 2.6.26-smp #2
> [<c0180f5d>] print_trailer+0xad/0xf0
> [<c018103b>] check_bytes_and_report+0x9b/0xc0
> [<c018145e>] check_object+0x19e/0x1e0
> [<c01821a4>] __slab_alloc+0x454/0x4f0
> [<c01834d6>] __kmalloc_track_caller+0xe6/0xf0
> [<c03dd1ec>] ? dev_alloc_skb+0x1c/0x30
> [<c03dd1ec>] ? dev_alloc_skb+0x1c/0x30
> [<c03dce79>] __alloc_skb+0x49/0x100
> [<c03dd1ec>] dev_alloc_skb+0x1c/0x30
> [<f8a58599>] ath5k_rxbuf_setup+0x39/0x200 [ath5k]
> [<f8a5a697>] ath5k_tasklet_rx+0x127/0x5c0 [ath5k]
> [<c014969a>] ? print_lock_contention_bug+0x1a/0xe0
> [<c012eafc>] tasklet_action+0x4c/0xc0
> [<c012e463>] __do_softirq+0x93/0x120
> [<c012e547>] do_softirq+0x57/0x60
> [<c012ea29>] irq_exit+0x69/0x80
> [<c0106b55>] do_IRQ+0x45/0x80
> [<c010a5d0>] ? mwait_idle+0x0/0x50
> [<c0104752>] common_interrupt+0x2e/0x34
> [<c010a5d0>] ? mwait_idle+0x0/0x50
> [<c010a609>] ? mwait_idle+0x39/0x50
> [<c01026e0>] cpu_idle+0x60/0xd0
> [<c043c8ce>] rest_init+0x4e/0x60
> =======================
> FIX kmalloc-4096: Restoring 0xf6f3a080-0xf6f3a0ef=0x6b
>
> FIX kmalloc-4096: Marking all objects used
> [<c0243b4f>] ? security_file_permission+0xf/0x20
> [<c019436f>] sys_select+0x3f/0x190
> [<c01878e9>] ? fput+0x19/0x20
> [<c0103dbf>] ? restore_nocheck+0x12/0x15
> [<c014b06d>] ? trace_hardirqs_on+0xbd/0x140
> [<c0103d5e>] syscall_call+0x7/0xb
> =======================
>
> Dave, could you please remind us which net driver was in use here?

There's ath5k in the stack trace but that, of course, doesn't
automatically mean it's at fault here. It could have been just the poor
bastard who was the next to allocate 4 KB with kmalloc() noticing the
corruption.

Hope this helps!

Pekka

2008-07-31 09:50:38

by Tomas Winkler

[permalink] [raw]
Subject: Re: [BUG] wireless : cpu stuck for 61s

On Thu, Jul 31, 2008 at 12:15 PM, Pekka J Enberg <[email protected]> wrote:
> Hi Andrew,
>
> On Wed, 30 Jul 2008, Andrew Morton wrote:
>> > Ok here it is.
>> > BTW, I run "klogd -c 7" after boot
>>
>> The sysrq output is still missing lots of stuff. I guess we broke it.
>>
>> >
>> > This time I get a kmalloc poison overwritten:
>> >
>>
>> argh, that stuff hurts my brain. None of the numbers seem to make any
>> sense for a 4k allocation :( Pekka, do you have time to decrypt this?
>
> Sure. Here goes:
>
> On Wed, 30 Jul 2008, Andrew Morton wrote:
>> <fixes wordwrapping, cleans stuff up>
>>
>> =============================================================================
>> BUG kmalloc-4096: Poison overwritten
>> -----------------------------------------------------------------------------
>>
>> INFO: 0xf6f3a080-0xf6f3a0ef. First byte 0x80 instead of 0x6b
>
> That's POISON_FREE ("0x6b") overwritten which means use-after-free for
> the range of 0xf6f3a080 - 0xf6f3a0ef (112 bytes). The rest of the
> object is okay but the SLUB debugging code only dumps the first 128 bytes
> of the object which is why we don't see the full corruption.
>
> 2.6.27-rc1 should dump the full object so I'm assuming this is pre -rc1?
>
>> INFO: Allocated in dev_alloc_skb+0x1c/0x30 age=3642 cpu=0 pid=0
>> INFO: Freed in skb_release_data+0x57/0x80 age=3146 cpu=0 pid=2398
>
> So the corrupted object was free'd by skb_release_data() so we need to
> look for a driver or the networking stack calling that function too early.
>
>> INFO: Slab 0xc1c05440 objects=7 used=3 fp=0xf6f3a060 flags=0x400020c3
>> INFO: Object 0xf6f3a060 @offset=8288 fp=0xf6f39030
>>
>> Bytes b4 0xf6f3a050: 5e 09 00 00 57 c9 05 00 5a 5a 5a 5a 5a 5a 5a 5a ^...W?..ZZZZZZZZ
>
> The object starts here (the poison values for first 32 bytes are okay):
>
>> Object 0xf6f3a060: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
>> Object 0xf6f3a070: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
>
> And here are the first 96 bytes of the corruption:
>
>> Object 0xf6f3a080: 80 00 00 00 ff ff ff ff ff ff 00 17 7b 00 46 40 ....??????..{.F@
>> Object 0xf6f3a090: 00 17 7b 00 46 40 30 09 81 21 08 7a 21 00 00 00 ..{.F@0..!.z!...
>> Object 0xf6f3a0a0: 64 00 21 04 00 07 00 00 00 00 00 00 00 01 08 82 d.!.............
>> Object 0xf6f3a0b0: 84 8b 0c 12 96 18 24 03 01 01 05 04 00 02 00 00 ......$.........
>> Object 0xf6f3a0c0: 07 06 43 4e 20 01 0d 14 2a 01 00 32 04 30 48 60 ..CN....*..2.0H`
>> Object 0xf6f3a0d0: 6c dd 18 00 17 7b 01 04 00 00 00 01 00 00 00 10 l?...{..........
>
> But I think that's just the payload of a SKB?
>
>> Redzone 0xf6f3b060: bb bb bb bb ????
>
> The red-zone has SLUB_RED_INACTIVE ("0xbb") which reinforces
> use-after-free.
>
>> Padding 0xf6f3b088: 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZ
>> Pid: 0, comm: swapper Tainted: G W 2.6.26-smp #2
>> [<c0180f5d>] print_trailer+0xad/0xf0
>> [<c018103b>] check_bytes_and_report+0x9b/0xc0
>> [<c018145e>] check_object+0x19e/0x1e0
>> [<c01821a4>] __slab_alloc+0x454/0x4f0
>> [<c01834d6>] __kmalloc_track_caller+0xe6/0xf0
>> [<c03dd1ec>] ? dev_alloc_skb+0x1c/0x30
>> [<c03dd1ec>] ? dev_alloc_skb+0x1c/0x30
>> [<c03dce79>] __alloc_skb+0x49/0x100
>> [<c03dd1ec>] dev_alloc_skb+0x1c/0x30
>> [<f8a58599>] ath5k_rxbuf_setup+0x39/0x200 [ath5k]
>> [<f8a5a697>] ath5k_tasklet_rx+0x127/0x5c0 [ath5k]
>> [<c014969a>] ? print_lock_contention_bug+0x1a/0xe0
>> [<c012eafc>] tasklet_action+0x4c/0xc0
>> [<c012e463>] __do_softirq+0x93/0x120
>> [<c012e547>] do_softirq+0x57/0x60
>> [<c012ea29>] irq_exit+0x69/0x80
>> [<c0106b55>] do_IRQ+0x45/0x80
>> [<c010a5d0>] ? mwait_idle+0x0/0x50
>> [<c0104752>] common_interrupt+0x2e/0x34
>> [<c010a5d0>] ? mwait_idle+0x0/0x50
>> [<c010a609>] ? mwait_idle+0x39/0x50
>> [<c01026e0>] cpu_idle+0x60/0xd0
>> [<c043c8ce>] rest_init+0x4e/0x60
>> =======================
>> FIX kmalloc-4096: Restoring 0xf6f3a080-0xf6f3a0ef=0x6b
>>
>> FIX kmalloc-4096: Marking all objects used
>> [<c0243b4f>] ? security_file_permission+0xf/0x20
>> [<c019436f>] sys_select+0x3f/0x190
>> [<c01878e9>] ? fput+0x19/0x20
>> [<c0103dbf>] ? restore_nocheck+0x12/0x15
>> [<c014b06d>] ? trace_hardirqs_on+0xbd/0x140
>> [<c0103d5e>] syscall_call+0x7/0xb
>> =======================
>>
>> Dave, could you please remind us which net driver was in use here?
>
> There's ath5k in the stack trace but that, of course, doesn't
> automatically mean it's at fault here. It could have been just the poor
> bastard who was the next to allocate 4 KB with kmalloc() noticing the
> corruption.
>
> Hope this helps!


I've seen something similar with fragmentation code in iwl4965 but I
can reproduce it only when using SLAB. With SLUB it didn't shown up.
Does anyone know what is difference between SLAB and SLUB in this context.?
Tomas

2008-07-31 09:53:20

by Pekka Enberg

[permalink] [raw]
Subject: Re: [BUG] wireless : cpu stuck for 61s

On Thu, 2008-07-31 at 12:50 +0300, Tomas Winkler wrote:
> > There's ath5k in the stack trace but that, of course, doesn't
> > automatically mean it's at fault here. It could have been just the poor
> > bastard who was the next to allocate 4 KB with kmalloc() noticing the
> > corruption.
> >
> > Hope this helps!
>
> I've seen something similar with fragmentation code in iwl4965 but I
> can reproduce it only when using SLAB. With SLUB it didn't shown up.
> Does anyone know what is difference between SLAB and SLUB in this context.?

Do you have CONFIG_SLUB_DEBUG_ON enabled? (Note: if you only have
CONFIG_SLUB_DEBUG enabled, then you need to enable debugging at run-time
by passing slub_debug to kernel command line.)

2008-07-31 10:29:24

by Tomas Winkler

[permalink] [raw]
Subject: Re: [BUG] wireless : cpu stuck for 61s

On Thu, Jul 31, 2008 at 12:53 PM, Pekka Enberg <[email protected]> wrote:
> On Thu, 2008-07-31 at 12:50 +0300, Tomas Winkler wrote:
>> > There's ath5k in the stack trace but that, of course, doesn't
>> > automatically mean it's at fault here. It could have been just the poor
>> > bastard who was the next to allocate 4 KB with kmalloc() noticing the
>> > corruption.
>> >
>> > Hope this helps!
>>
>> I've seen something similar with fragmentation code in iwl4965 but I
>> can reproduce it only when using SLAB. With SLUB it didn't shown up.
>> Does anyone know what is difference between SLAB and SLUB in this context.?
>
> Do you have CONFIG_SLUB_DEBUG_ON enabled? (Note: if you only have
> CONFIG_SLUB_DEBUG enabled, then you need to enable debugging at run-time
> by passing slub_debug to kernel command line.)

Yes sure I've done that with slub_debug=FZPU
Tomas

2008-08-01 07:32:24

by Dave Young

[permalink] [raw]
Subject: Re: [BUG] wireless : cpu stuck for 61s

On Thu, Jul 31, 2008 at 5:15 PM, Pekka J Enberg <[email protected]> wrote:> Hi Andrew,>> On Wed, 30 Jul 2008, Andrew Morton wrote:>> > Ok here it is.>> > BTW, I run "klogd -c 7" after boot>>>> The sysrq output is still missing lots of stuff. I guess we broke it.>>>> >>> > This time I get a kmalloc poison overwritten:>> >>>>> argh, that stuff hurts my brain. None of the numbers seem to make any>> sense for a 4k allocation :( Pekka, do you have time to decrypt this?>> Sure. Here goes:>> On Wed, 30 Jul 2008, Andrew Morton wrote:>> <fixes wordwrapping, cleans stuff up>>>>> =============================================================================>> BUG kmalloc-4096: Poison overwritten>> ----------------------------------------------------------------------------->>>> INFO: 0xf6f3a080-0xf6f3a0ef. First byte 0x80 instead of 0x6b>> That's POISON_FREE ("0x6b") overwritten which means use-after-free for> the range of 0xf6f3a080 - 0xf6f3a0ef (112 bytes). The rest of the> object is okay but the SLUB debugging code only dumps the first 128 bytes> of the object which is why we don't see the full corruption.>> 2.6.27-rc1 should dump the full object so I'm assuming this is pre -rc1?
Yes, it's 2.6.26
>>> INFO: Allocated in dev_alloc_skb+0x1c/0x30 age=3642 cpu=0 pid=0>> INFO: Freed in skb_release_data+0x57/0x80 age=3146 cpu=0 pid=2398>> So the corrupted object was free'd by skb_release_data() so we need to> look for a driver or the networking stack calling that function too early.>>> INFO: Slab 0xc1c05440 objects=7 used=3 fp=0xf6f3a060 flags=0x400020c3>> INFO: Object 0xf6f3a060 @offset=8288 fp=0xf6f39030>>>> Bytes b4 0xf6f3a050: 5e 09 00 00 57 c9 05 00 5a 5a 5a 5a 5a 5a 5a 5a ^...WÉ..ZZZZZZZZ>> The object starts here (the poison values for first 32 bytes are okay):>>> Object 0xf6f3a060: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk>> Object 0xf6f3a070: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk>> And here are the first 96 bytes of the corruption:>>> Object 0xf6f3a080: 80 00 00 00 ff ff ff ff ff ff 00 17 7b 00 46 40 ....ÿÿÿÿÿÿ..{.F@>> Object 0xf6f3a090: 00 17 7b 00 46 40 30 09 81 21 08 7a 21 00 00 00 ..{.F@0..!.z!...>> Object 0xf6f3a0a0: 64 00 21 04 00 07 00 00 00 00 00 00 00 01 08 82 d.!.............>> Object 0xf6f3a0b0: 84 8b 0c 12 96 18 24 03 01 01 05 04 00 02 00 00 ......$.........>> Object 0xf6f3a0c0: 07 06 43 4e 20 01 0d 14 2a 01 00 32 04 30 48 60 ..CN....*..2.0H`>> Object 0xf6f3a0d0: 6c dd 18 00 17 7b 01 04 00 00 00 01 00 00 00 10 lÝ...{..........>> But I think that's just the payload of a SKB?>>> Redzone 0xf6f3b060: bb bb bb bb »»»»>> The red-zone has SLUB_RED_INACTIVE ("0xbb") which reinforces> use-after-free.>>> Padding 0xf6f3b088: 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZ>> Pid: 0, comm: swapper Tainted: G W 2.6.26-smp #2>> [<c0180f5d>] print_trailer+0xad/0xf0>> [<c018103b>] check_bytes_and_report+0x9b/0xc0>> [<c018145e>] check_object+0x19e/0x1e0>> [<c01821a4>] __slab_alloc+0x454/0x4f0>> [<c01834d6>] __kmalloc_track_caller+0xe6/0xf0>> [<c03dd1ec>] ? dev_alloc_skb+0x1c/0x30>> [<c03dd1ec>] ? dev_alloc_skb+0x1c/0x30>> [<c03dce79>] __alloc_skb+0x49/0x100>> [<c03dd1ec>] dev_alloc_skb+0x1c/0x30>> [<f8a58599>] ath5k_rxbuf_setup+0x39/0x200 [ath5k]>> [<f8a5a697>] ath5k_tasklet_rx+0x127/0x5c0 [ath5k]>> [<c014969a>] ? print_lock_contention_bug+0x1a/0xe0>> [<c012eafc>] tasklet_action+0x4c/0xc0>> [<c012e463>] __do_softirq+0x93/0x120>> [<c012e547>] do_softirq+0x57/0x60>> [<c012ea29>] irq_exit+0x69/0x80>> [<c0106b55>] do_IRQ+0x45/0x80>> [<c010a5d0>] ? mwait_idle+0x0/0x50>> [<c0104752>] common_interrupt+0x2e/0x34>> [<c010a5d0>] ? mwait_idle+0x0/0x50>> [<c010a609>] ? mwait_idle+0x39/0x50>> [<c01026e0>] cpu_idle+0x60/0xd0>> [<c043c8ce>] rest_init+0x4e/0x60>> =======================>> FIX kmalloc-4096: Restoring 0xf6f3a080-0xf6f3a0ef=0x6b>>>> FIX kmalloc-4096: Marking all objects used>> [<c0243b4f>] ? security_file_permission+0xf/0x20>> [<c019436f>] sys_select+0x3f/0x190>> [<c01878e9>] ? fput+0x19/0x20>> [<c0103dbf>] ? restore_nocheck+0x12/0x15>> [<c014b06d>] ? trace_hardirqs_on+0xbd/0x140>> [<c0103d5e>] syscall_call+0x7/0xb>> =======================>>>> Dave, could you please remind us which net driver was in use here?>> There's ath5k in the stack trace but that, of course, doesn't> automatically mean it's at fault here. It could have been just the poor> bastard who was the next to allocate 4 KB with kmalloc() noticing the> corruption.>> Hope this helps!>> Pekka
Lockdep helped me, this morning I get a lockdep warning about this,
[ 171.432140] [ INFO: possible recursive locking detected ][ 171.433113] 2.6.27-rc1-smp #4[ 171.434079] ---------------------------------------------[ 171.435039] ath5k_pci/2447 is trying to acquire lock:[ 171.435990] (&sc->lock){--..}, at: [<f89ee9b5>]ath5k_config_interface+0xd5/0x340 [ath5k][ 171.437046][ 171.437048] but task is already holding lock:[ 171.438903] (&sc->lock){--..}, at: [<f89ee91d>]ath5k_config_interface+0x3d/0x340 [ath5k][ 171.439953][ 171.439954] other info that might help us debug this:[ 171.441795] 3 locks held by ath5k_pci/2447:[ 171.442729] #0: ((name)){--..}, at: [<c013a122>] run_workqueue+0x102/0x1d0[ 171.443800] #1: (&(&local->scan_work)->work){--..}, at:[<c013a122>] run_workqueue+0x102/0x1d0[ 171.444859] #2: (&sc->lock){--..}, at: [<f89ee91d>]ath5k_config_interface+0x3d/0x340 [ath5k]
Deadlock happen here, I remove the lock in the sub routine, tested andfixed the problem for me.I will send the patch after a while.
But I still have no idea with the poison overwritten.
-- Regardsdave????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?

2008-08-04 09:22:27

by Jiri Slaby

[permalink] [raw]
Subject: Re: [BUG] wireless : cpu stuck for 61s

Dave Young napsal(a):
> On Thu, Jul 31, 2008 at 5:15 PM, Pekka J Enberg <[email protected]> wrote:
>> On Wed, 30 Jul 2008, Andrew Morton wrote:
>>> INFO: Allocated in dev_alloc_skb+0x1c/0x30 age=3642 cpu=0 pid=0
>>> INFO: Freed in skb_release_data+0x57/0x80 age=3146 cpu=0 pid=2398
>> So the corrupted object was free'd by skb_release_data() so we need to
>> look for a driver or the networking stack calling that function too early.
>>
>>> INFO: Slab 0xc1c05440 objects=7 used=3 fp=0xf6f3a060 flags=0x400020c3
>>> INFO: Object 0xf6f3a060 @offset=8288 fp=0xf6f39030
>>>
>>> Bytes b4 0xf6f3a050: 5e 09 00 00 57 c9 05 00 5a 5a 5a 5a 5a 5a 5a 5a ^...WÉ..ZZZZZZZZ
>> The object starts here (the poison values for first 32 bytes are okay):
>>
>>> Object 0xf6f3a060: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
>>> Object 0xf6f3a070: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
>> And here are the first 96 bytes of the corruption:
>>
>>> Object 0xf6f3a080: 80 00 00 00 ff ff ff ff ff ff 00 17 7b 00 46 40 ....ÿÿÿÿÿÿ..{.F@
>>> Object 0xf6f3a090: 00 17 7b 00 46 40 30 09 81 21 08 7a 21 00 00 00 ..{.F@0..!.z!...
>>> Object 0xf6f3a0a0: 64 00 21 04 00 07 00 00 00 00 00 00 00 01 08 82 d.!.............
>>> Object 0xf6f3a0b0: 84 8b 0c 12 96 18 24 03 01 01 05 04 00 02 00 00 ......$.........
>>> Object 0xf6f3a0c0: 07 06 43 4e 20 01 0d 14 2a 01 00 32 04 30 48 60 ..CN....*..2.0H`
>>> Object 0xf6f3a0d0: 6c dd 18 00 17 7b 01 04 00 00 00 01 00 00 00 10 lÝ...{..........
>> But I think that's just the payload of a SKB?

It's a receive frame from ath5k, I suppose. 00:17:7b:00:46:40 is your AP?

>>> Redzone 0xf6f3b060: bb bb bb bb »»»»
>> The red-zone has SLUB_RED_INACTIVE ("0xbb") which reinforces
>> use-after-free.
>>
>>> Padding 0xf6f3b088: 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZ
>>> Pid: 0, comm: swapper Tainted: G W 2.6.26-smp #2
>>> [<c0180f5d>] print_trailer+0xad/0xf0
>>> [<c018103b>] check_bytes_and_report+0x9b/0xc0
>>> [<c018145e>] check_object+0x19e/0x1e0
>>> [<c01821a4>] __slab_alloc+0x454/0x4f0
>>> [<c01834d6>] __kmalloc_track_caller+0xe6/0xf0
>>> [<c03dd1ec>] ? dev_alloc_skb+0x1c/0x30
>>> [<c03dd1ec>] ? dev_alloc_skb+0x1c/0x30
>>> [<c03dce79>] __alloc_skb+0x49/0x100
>>> [<c03dd1ec>] dev_alloc_skb+0x1c/0x30
>>> [<f8a58599>] ath5k_rxbuf_setup+0x39/0x200 [ath5k]
>>> [<f8a5a697>] ath5k_tasklet_rx+0x127/0x5c0 [ath5k]
>>> [<c014969a>] ? print_lock_contention_bug+0x1a/0xe0
>>> [<c012eafc>] tasklet_action+0x4c/0xc0
[...]
>>> =======================
>>> FIX kmalloc-4096: Restoring 0xf6f3a080-0xf6f3a0ef=0x6b

>>> Dave, could you please remind us which net driver was in use here?
>> There's ath5k in the stack trace but that, of course, doesn't
>> automatically mean it's at fault here. It could have been just the poor
>> bastard who was the next to allocate 4 KB with kmalloc() noticing the
>> corruption.

No, unfortunately ath5k *is* likely the culprit. Next time please Cc
[email protected] even if it is only a suspicion.

> But I still have no idea with the poison overwritten.

Could you try patch from
http://lkml.org/lkml/2008/7/15/276
? (I have no idea how reproducible is this by you, it often happens on noisy
channels and/or by lowering RX buffers, i.e. ATH_RXBUF).

[It hit mainline few days ago, I'm going to fwd it to stable.]

2008-08-04 10:01:34

by Jiri Slaby

[permalink] [raw]
Subject: Re: [BUG] wireless : cpu stuck for 61s

Dave Young napsal(a):
> Lockdep helped me, this morning I get a lockdep warning about this,
>
> [ 171.432140] [ INFO: possible recursive locking detected ]
> [ 171.433113] 2.6.27-rc1-smp #4
> [ 171.434079] ---------------------------------------------
> [ 171.435039] ath5k_pci/2447 is trying to acquire lock:
> [ 171.435990] (&sc->lock){--..}, at: [<f89ee9b5>]
> ath5k_config_interface+0xd5/0x340 [ath5k]
> [ 171.437046]
> [ 171.437048] but task is already holding lock:
> [ 171.438903] (&sc->lock){--..}, at: [<f89ee91d>]
> ath5k_config_interface+0x3d/0x340 [ath5k]
> [ 171.439953]
> [ 171.439954] other info that might help us debug this:
> [ 171.441795] 3 locks held by ath5k_pci/2447:
> [ 171.442729] #0: ((name)){--..}, at: [<c013a122>] run_workqueue+0x102/0x1d0
> [ 171.443800] #1: (&(&local->scan_work)->work){--..}, at:
> [<c013a122>] run_workqueue+0x102/0x1d0
> [ 171.444859] #2: (&sc->lock){--..}, at: [<f89ee91d>]
> ath5k_config_interface+0x3d/0x340 [ath5k]
>
> Deadlock happen here, I remove the lock in the sub routine, tested and
> fixed the problem for me.
> I will send the patch after a while.

That's some kind different. 2.6.26 can't suffer from the thing above
(opposing to 27-rc1).

Pasting the original issue below. Maybe it's been fixed too? I suppose you
don't encounter the problem anymore in 27-rc1 with your/Bob's/mine (fix for
the same thing in the same manner) patch applied?

BUG: soft lockup - CPU#0 stuck for 61s! [events/0:9]
Modules linked in: radeonfb fb_ddc i2c_algo_bit snd_seq_dummy snd_seq_oss
snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss rfcomm
l2cap bluetooth vfat fuse rtc_cmos rtc_core psmouse intel_agp pl2303 rtc_lib
ath5k e100 i2c_i801 serio_raw usbserial agpgart 3c59x mii snd_hda_intel
snd_pcm mac80211 snd_timer snd_page_alloc cfg80211 snd_hwdep snd soundcore
thermal processor button thermal_sys evdev sg

Pid: 9, comm: events/0 Tainted: G W (2.6.26-smp #3)
EIP: 0060:[<c0450bc3>] EFLAGS: 00000246 CPU: 0
EIP is at mutex_lock_nested+0x1b3/0x280
EAX: 00000000 EBX: f7bf5284 ECX: f7c7e000 EDX: f7bf52a8
ESI: f7bf5288 EDI: 00000246 EBP: f7c7fedc ESP: f7c7fe9c
DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
CR0: 8005003b CR2: b800c5ac CR3: 36e71000 CR4: 000006d0
DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
DR6: ffff0ff0 DR7: 00000400
[<c02375a6>] ? debugfs_create_file+0x46/0x210
[<c02375a6>] ? debugfs_create_file+0x46/0x210
[<c02375a6>] debugfs_create_file+0x46/0x210
[<c02377f1>] debugfs_create_dir+0x21/0x30
[<f8901f6d>] ieee80211_sta_debugfs_add+0x2d/0x150 [mac80211]
[<f88eba89>] sta_info_debugfs_add_work+0x89/0x130 [mac80211]
[<f890a170>] ? rate_control_pid_add_sta_debugfs+0x0/0x30 [mac80211]
[<c01394c6>] run_workqueue+0x156/0x1d0
[<c0139472>] ? run_workqueue+0x102/0x1d0
[<f88eba00>] ? sta_info_debugfs_add_work+0x0/0x130 [mac80211]
[<c0139f18>] worker_thread+0x88/0xe0
[<c013cd20>] ? autoremove_wake_function+0x0/0x40
[<c0139e90>] ? worker_thread+0x0/0xe0
[<c013ca32>] kthread+0x42/0x70
[<c013c9f0>] ? kthread+0x0/0x70
[<c0104993>] kernel_thread_helper+0x7/0x14
=======================
BUG: soft lockup - CPU#0 stuck for 61s! [events/0:9]
Modules linked in: radeonfb fb_ddc i2c_algo_bit snd_seq_dummy snd_seq_oss
snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss rfcomm
l2cap bluetooth vfat fuse rtc_cmos rtc_core psmouse intel_agp pl2303 rtc_lib
ath5k e100 i2c_i801 serio_raw usbserial agpgart 3c59x mii snd_hda_intel
snd_pcm mac80211 snd_timer snd_page_alloc cfg80211 snd_hwdep snd soundcore
thermal processor button thermal_sys evdev sg

Pid: 9, comm: events/0 Tainted: G W (2.6.26-smp #3)
EIP: 0060:[<c0450bc3>] EFLAGS: 00000246 CPU: 0
EIP is at mutex_lock_nested+0x1b3/0x280
EAX: 00000000 EBX: f7bf5284 ECX: f7c7e000 EDX: f7bf52a8
ESI: f7bf5288 EDI: 00000246 EBP: f7c7fedc ESP: f7c7fe9c
DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
CR0: 8005003b CR2: b800c5ac CR3: 36e71000 CR4: 000006d0
DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
DR6: ffff0ff0 DR7: 00000400
[<c02375a6>] ? debugfs_create_file+0x46/0x210
[<c02375a6>] ? debugfs_create_file+0x46/0x210
[<c02375a6>] debugfs_create_file+0x46/0x210
[<c02377f1>] debugfs_create_dir+0x21/0x30
[<f8901f6d>] ieee80211_sta_debugfs_add+0x2d/0x150 [mac80211]
[<f88eba89>] sta_info_debugfs_add_work+0x89/0x130 [mac80211]
[<f890a170>] ? rate_control_pid_add_sta_debugfs+0x0/0x30 [mac80211]
[<c01394c6>] run_workqueue+0x156/0x1d0
[<c0139472>] ? run_workqueue+0x102/0x1d0
[<f88eba00>] ? sta_info_debugfs_add_work+0x0/0x130 [mac80211]
[<c0139f18>] worker_thread+0x88/0xe0
[<c013cd20>] ? autoremove_wake_function+0x0/0x40
[<c0139e90>] ? worker_thread+0x0/0xe0
[<c013ca32>] kthread+0x42/0x70
[<c013c9f0>] ? kthread+0x0/0x70
[<c0104993>] kernel_thread_helper+0x7/0x14
=======================

2008-08-05 01:30:00

by Dave Young

[permalink] [raw]
Subject: Re: [BUG] wireless : cpu stuck for 61s

On Mon, Aug 4, 2008 at 6:00 PM, Jiri Slaby <[email protected]> wrote:
> Dave Young napsal(a):
>>
>> Lockdep helped me, this morning I get a lockdep warning about this,
>>
>> [ 171.432140] [ INFO: possible recursive locking detected ]
>> [ 171.433113] 2.6.27-rc1-smp #4
>> [ 171.434079] ---------------------------------------------
>> [ 171.435039] ath5k_pci/2447 is trying to acquire lock:
>> [ 171.435990] (&sc->lock){--..}, at: [<f89ee9b5>]
>> ath5k_config_interface+0xd5/0x340 [ath5k]
>> [ 171.437046]
>> [ 171.437048] but task is already holding lock:
>> [ 171.438903] (&sc->lock){--..}, at: [<f89ee91d>]
>> ath5k_config_interface+0x3d/0x340 [ath5k]
>> [ 171.439953]
>> [ 171.439954] other info that might help us debug this:
>> [ 171.441795] 3 locks held by ath5k_pci/2447:
>> [ 171.442729] #0: ((name)){--..}, at: [<c013a122>]
>> run_workqueue+0x102/0x1d0
>> [ 171.443800] #1: (&(&local->scan_work)->work){--..}, at:
>> [<c013a122>] run_workqueue+0x102/0x1d0
>> [ 171.444859] #2: (&sc->lock){--..}, at: [<f89ee91d>]
>> ath5k_config_interface+0x3d/0x340 [ath5k]
>>
>> Deadlock happen here, I remove the lock in the sub routine, tested and
>> fixed the problem for me.
>> I will send the patch after a while.
>
> That's some kind different. 2.6.26 can't suffer from the thing above
> (opposing to 27-rc1).

Yes, it's got with 27-rc1

>
> Pasting the original issue below. Maybe it's been fixed too? I suppose you
> don't encounter the problem anymore in 27-rc1 with your/Bob's/mine (fix for
> the same thing in the same manner) patch applied?

With the patch I cann't reproduce the bug with 27-rc1 now.

>
> BUG: soft lockup - CPU#0 stuck for 61s! [events/0:9]
> Modules linked in: radeonfb fb_ddc i2c_algo_bit snd_seq_dummy snd_seq_oss
> snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss rfcomm
> l2cap bluetooth vfat fuse rtc_cmos rtc_core psmouse intel_agp pl2303 rtc_lib
> ath5k e100 i2c_i801 serio_raw usbserial agpgart 3c59x mii snd_hda_intel
> snd_pcm mac80211 snd_timer snd_page_alloc cfg80211 snd_hwdep snd soundcore
> thermal processor button thermal_sys evdev sg
>
> Pid: 9, comm: events/0 Tainted: G W (2.6.26-smp #3)
> EIP: 0060:[<c0450bc3>] EFLAGS: 00000246 CPU: 0
> EIP is at mutex_lock_nested+0x1b3/0x280
> EAX: 00000000 EBX: f7bf5284 ECX: f7c7e000 EDX: f7bf52a8
> ESI: f7bf5288 EDI: 00000246 EBP: f7c7fedc ESP: f7c7fe9c
> DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
> CR0: 8005003b CR2: b800c5ac CR3: 36e71000 CR4: 000006d0
> DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
> DR6: ffff0ff0 DR7: 00000400
> [<c02375a6>] ? debugfs_create_file+0x46/0x210
> [<c02375a6>] ? debugfs_create_file+0x46/0x210
> [<c02375a6>] debugfs_create_file+0x46/0x210
> [<c02377f1>] debugfs_create_dir+0x21/0x30
> [<f8901f6d>] ieee80211_sta_debugfs_add+0x2d/0x150 [mac80211]
> [<f88eba89>] sta_info_debugfs_add_work+0x89/0x130 [mac80211]
> [<f890a170>] ? rate_control_pid_add_sta_debugfs+0x0/0x30 [mac80211]
> [<c01394c6>] run_workqueue+0x156/0x1d0
> [<c0139472>] ? run_workqueue+0x102/0x1d0
> [<f88eba00>] ? sta_info_debugfs_add_work+0x0/0x130 [mac80211]
> [<c0139f18>] worker_thread+0x88/0xe0
> [<c013cd20>] ? autoremove_wake_function+0x0/0x40
> [<c0139e90>] ? worker_thread+0x0/0xe0
> [<c013ca32>] kthread+0x42/0x70
> [<c013c9f0>] ? kthread+0x0/0x70
> [<c0104993>] kernel_thread_helper+0x7/0x14
> =======================
> BUG: soft lockup - CPU#0 stuck for 61s! [events/0:9]
> Modules linked in: radeonfb fb_ddc i2c_algo_bit snd_seq_dummy snd_seq_oss
> snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss rfcomm
> l2cap bluetooth vfat fuse rtc_cmos rtc_core psmouse intel_agp pl2303 rtc_lib
> ath5k e100 i2c_i801 serio_raw usbserial agpgart 3c59x mii snd_hda_intel
> snd_pcm mac80211 snd_timer snd_page_alloc cfg80211 snd_hwdep snd soundcore
> thermal processor button thermal_sys evdev sg
>
> Pid: 9, comm: events/0 Tainted: G W (2.6.26-smp #3)
> EIP: 0060:[<c0450bc3>] EFLAGS: 00000246 CPU: 0
> EIP is at mutex_lock_nested+0x1b3/0x280
> EAX: 00000000 EBX: f7bf5284 ECX: f7c7e000 EDX: f7bf52a8
> ESI: f7bf5288 EDI: 00000246 EBP: f7c7fedc ESP: f7c7fe9c
> DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
> CR0: 8005003b CR2: b800c5ac CR3: 36e71000 CR4: 000006d0
> DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
> DR6: ffff0ff0 DR7: 00000400
> [<c02375a6>] ? debugfs_create_file+0x46/0x210
> [<c02375a6>] ? debugfs_create_file+0x46/0x210
> [<c02375a6>] debugfs_create_file+0x46/0x210
> [<c02377f1>] debugfs_create_dir+0x21/0x30
> [<f8901f6d>] ieee80211_sta_debugfs_add+0x2d/0x150 [mac80211]
> [<f88eba89>] sta_info_debugfs_add_work+0x89/0x130 [mac80211]
> [<f890a170>] ? rate_control_pid_add_sta_debugfs+0x0/0x30 [mac80211]
> [<c01394c6>] run_workqueue+0x156/0x1d0
> [<c0139472>] ? run_workqueue+0x102/0x1d0
> [<f88eba00>] ? sta_info_debugfs_add_work+0x0/0x130 [mac80211]
> [<c0139f18>] worker_thread+0x88/0xe0
> [<c013cd20>] ? autoremove_wake_function+0x0/0x40
> [<c0139e90>] ? worker_thread+0x0/0xe0
> [<c013ca32>] kthread+0x42/0x70
> [<c013c9f0>] ? kthread+0x0/0x70
> [<c0104993>] kernel_thread_helper+0x7/0x14
> =======================
>



--
Regards
dave

2008-08-05 12:26:59

by Bob Copeland

[permalink] [raw]
Subject: Re: [BUG] wireless : cpu stuck for 61s

On Tue, Aug 05, 2008 at 09:29:26AM +0800, Dave Young wrote:
> With the patch I cann't reproduce the bug with 27-rc1 now.
>
> > [<c02375a6>] ? debugfs_create_file+0x46/0x210
> > [<c02375a6>] ? debugfs_create_file+0x46/0x210
> > [<c02375a6>] debugfs_create_file+0x46/0x210
> > [<c02377f1>] debugfs_create_dir+0x21/0x30
> > [<f8901f6d>] ieee80211_sta_debugfs_add+0x2d/0x150 [mac80211]
> > [<f88eba89>] sta_info_debugfs_add_work+0x89/0x130 [mac80211]
> > [<f890a170>] ? rate_control_pid_add_sta_debugfs+0x0/0x30 [mac80211]

I wonder if there were two separate problems here. I looked into
this with some detail yesterday and agree with Johannes that the above
trace is on locking the parent directory's i_mutex, but I too couldn't
see any problems with sta_info_debugfs_add_work. Other stuff could also
modify the directory with or without rtnl_lock, but not in a way that
to my untrained eyes would lead to deadlock.

Or is the trace just wrong?

--
Bob Copeland %% http://www.bobcopeland.com

2008-08-06 01:52:25

by Dave Young

[permalink] [raw]
Subject: Re: [BUG] wireless : cpu stuck for 61s

On Tue, Aug 5, 2008 at 8:24 PM, Bob Copeland <[email protected]> wrote:
> On Tue, Aug 05, 2008 at 09:29:26AM +0800, Dave Young wrote:
>> With the patch I cann't reproduce the bug with 27-rc1 now.
>>
>> > [<c02375a6>] ? debugfs_create_file+0x46/0x210
>> > [<c02375a6>] ? debugfs_create_file+0x46/0x210
>> > [<c02375a6>] debugfs_create_file+0x46/0x210
>> > [<c02377f1>] debugfs_create_dir+0x21/0x30
>> > [<f8901f6d>] ieee80211_sta_debugfs_add+0x2d/0x150 [mac80211]
>> > [<f88eba89>] sta_info_debugfs_add_work+0x89/0x130 [mac80211]
>> > [<f890a170>] ? rate_control_pid_add_sta_debugfs+0x0/0x30 [mac80211]
>
> I wonder if there were two separate problems here. I looked into
> this with some detail yesterday and agree with Johannes that the above
> trace is on locking the parent directory's i_mutex, but I too couldn't
> see any problems with sta_info_debugfs_add_work. Other stuff could also
> modify the directory with or without rtnl_lock, but not in a way that
> to my untrained eyes would lead to deadlock.

Yes,. I think so. It's the original bug for me, while testing I found
the mutex deadlock problem.

But this week I will have no time to trace it. so if I have time I
will keep tracing the problem

>
> Or is the trace just wrong?
>
> --
> Bob Copeland %% http://www.bobcopeland.com
>
>



--
Regards
dave

2008-08-06 01:53:35

by Dave Young

[permalink] [raw]
Subject: Re: [BUG] wireless : cpu stuck for 61s

On Wed, Aug 6, 2008 at 9:51 AM, Dave Young <[email protected]> wrote:
> On Tue, Aug 5, 2008 at 8:24 PM, Bob Copeland <[email protected]> wrote:
>> On Tue, Aug 05, 2008 at 09:29:26AM +0800, Dave Young wrote:
>>> With the patch I cann't reproduce the bug with 27-rc1 now.
>>>
>>> > [<c02375a6>] ? debugfs_create_file+0x46/0x210
>>> > [<c02375a6>] ? debugfs_create_file+0x46/0x210
>>> > [<c02375a6>] debugfs_create_file+0x46/0x210
>>> > [<c02377f1>] debugfs_create_dir+0x21/0x30
>>> > [<f8901f6d>] ieee80211_sta_debugfs_add+0x2d/0x150 [mac80211]
>>> > [<f88eba89>] sta_info_debugfs_add_work+0x89/0x130 [mac80211]
>>> > [<f890a170>] ? rate_control_pid_add_sta_debugfs+0x0/0x30 [mac80211]
>>
>> I wonder if there were two separate problems here. I looked into
>> this with some detail yesterday and agree with Johannes that the above
>> trace is on locking the parent directory's i_mutex, but I too couldn't
>> see any problems with sta_info_debugfs_add_work. Other stuff could also
>> modify the directory with or without rtnl_lock, but not in a way that
>> to my untrained eyes would lead to deadlock.
>
> Yes,. I think so. It's the original bug for me, while testing I found
> the mutex deadlock problem.
>
> But this week I will have no time to trace it. so if I have time I
> will keep tracing the problem

Additional info,

With the mutex fix patch, in 2.6.27-rc1 I seems can not reproduce the
debugfs_add bug, (maybe need more test)

But with 2.6.26, the bug can be reproduced. (The mutex fix patch need
not to be applied because there's no such deadlock bug)

>
>>
>> Or is the trace just wrong?
>>
>> --
>> Bob Copeland %% http://www.bobcopeland.com
>>
>>
>
>
>
> --
> Regards
> dave
>



--
Regards
dave

2008-08-12 04:19:43

by Dave Young

[permalink] [raw]
Subject: Re: [BUG] wireless : cpu stuck for 61s

On Wed, Aug 6, 2008 at 9:51 AM, Dave Young <[email protected]> wrote:
> On Tue, Aug 5, 2008 at 8:24 PM, Bob Copeland <[email protected]> wrote:
>> On Tue, Aug 05, 2008 at 09:29:26AM +0800, Dave Young wrote:
>>> With the patch I cann't reproduce the bug with 27-rc1 now.
>>>
>>> > [<c02375a6>] ? debugfs_create_file+0x46/0x210
>>> > [<c02375a6>] ? debugfs_create_file+0x46/0x210
>>> > [<c02375a6>] debugfs_create_file+0x46/0x210
>>> > [<c02377f1>] debugfs_create_dir+0x21/0x30
>>> > [<f8901f6d>] ieee80211_sta_debugfs_add+0x2d/0x150 [mac80211]
>>> > [<f88eba89>] sta_info_debugfs_add_work+0x89/0x130 [mac80211]
>>> > [<f890a170>] ? rate_control_pid_add_sta_debugfs+0x0/0x30 [mac80211]
>>
>> I wonder if there were two separate problems here. I looked into
>> this with some detail yesterday and agree with Johannes that the above
>> trace is on locking the parent directory's i_mutex, but I too couldn't
>> see any problems with sta_info_debugfs_add_work. Other stuff could also
>> modify the directory with or without rtnl_lock, but not in a way that
>> to my untrained eyes would lead to deadlock.
>
> Yes,. I think so. It's the original bug for me, while testing I found
> the mutex deadlock problem.
>
> But this week I will have no time to trace it. so if I have time I
> will keep tracing the problem

Unfortunately I have no more time to do this these days.
As I can't reproduce it with the latest git kernel I think that It
might have been fixed already.

>
>>
>> Or is the trace just wrong?
>>
>> --
>> Bob Copeland %% http://www.bobcopeland.com
>>
>>
>
>
>
> --
> Regards
> dave
>



--
Regards
dave