Still tracking memory corruption bugs found by the perf_fuzzer, I have
about 10 different log splats that I think might all be related to the
same underlying problem.
Anyway I managed to trigger this using the perf_fuzzer:
[ 221.065278] Slab corruption (Not tainted): kmalloc-2048 start=ffff8800cd15e800, len=2048
[ 221.074062] 040: 6b 6b 6b 6b 6b 6b 6b 6b 98 72 57 cd 00 88 ff ff kkkkkkkk.rW.....
[ 221.082321] Prev obj: start=ffff8800cd15e000, len=2048
[ 221.087933] 000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
[ 221.096224] 010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
And luckily I had ftrace running at the time.
The allocation of this block is by perf_event
perf_fuzzer-2520 [001] 182.980563: kmalloc: (perf_event_alloc+0x55) call_site=ffffffff811399b5 ptr=0xffff8800cd15e800 bytes_req=1272 bytes_alloc=2048 gfp_flags=GFP_KERNEL|GFP_ZERO
perf_fuzzer-2520 [000] 183.628515: kmalloc: (perf_event_alloc+0x55) call_site=ffffffff811399b5 ptr=0xffff8800cd15e800 bytes_req=1272 bytes_alloc=2048 gfp_flags=GFP_KERNEL|GFP_ZERO
perf_fuzzer-2520 [000] 183.628521: kfree: (perf_event_alloc+0x2f7) call_site=ffffffff81139c57 ptr=0xffff8800cd15e800
perf_fuzzer-2520 [000] 183.628844: kmalloc: (perf_event_alloc+0x55) call_site=ffffffff811399b5 ptr=0xffff8800cd15e800 bytes_req=1272 bytes_alloc=2048 gfp_flags=GFP_KERNEL|GFP_ZERO
...(thousands of times of kmalloc/kfree)
Is it worth wading through this mess to try to track down what happened?
Vince
On Tue, 15 Apr 2014, Vince Weaver wrote:
>
> Still tracking memory corruption bugs found by the perf_fuzzer, I have
> about 10 different log splats that I think might all be related to the
> same underlying problem.
>
> Anyway I managed to trigger this using the perf_fuzzer:
>
> [ 221.065278] Slab corruption (Not tainted): kmalloc-2048 start=ffff8800cd15e800, len=2048
> [ 221.074062] 040: 6b 6b 6b 6b 6b 6b 6b 6b 98 72 57 cd 00 88 ff ff kkkkkkkk.rW.....
> [ 221.082321] Prev obj: start=ffff8800cd15e000, len=2048
> [ 221.087933] 000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
> [ 221.096224] 010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
>
> And luckily I had ftrace running at the time.
>
> The allocation of this block is by perf_event
>
> perf_fuzzer-2520 [001] 182.980563: kmalloc: (perf_event_alloc+0x55) call_site=ffffffff811399b5 ptr=0xffff8800cd15e800 bytes_req=1272 bytes_alloc=2048 gfp_flags=GFP_KERNEL|GFP_ZERO
> perf_fuzzer-2520 [000] 183.628515: kmalloc: (perf_event_alloc+0x55) call_site=ffffffff811399b5 ptr=0xffff8800cd15e800 bytes_req=1272 bytes_alloc=2048 gfp_flags=GFP_KERNEL|GFP_ZERO
> perf_fuzzer-2520 [000] 183.628521: kfree: (perf_event_alloc+0x2f7) call_site=ffffffff81139c57 ptr=0xffff8800cd15e800
> perf_fuzzer-2520 [000] 183.628844: kmalloc: (perf_event_alloc+0x55) call_site=ffffffff811399b5 ptr=0xffff8800cd15e800 bytes_req=1272 bytes_alloc=2048 gfp_flags=GFP_KERNEL|GFP_ZERO
> ...(thousands of times of kmalloc/kfree)
>
> Is it worth wading through this mess to try to track down what happened?
Definitely worth a try. Can you upload the trace file and provide the
URL or send it offlist in private mail if you cannot provide a public URL.
Thanks,
tglx
On Tue, 15 Apr 2014, Thomas Gleixner wrote:
> On Tue, 15 Apr 2014, Vince Weaver wrote:
> >
> > Still tracking memory corruption bugs found by the perf_fuzzer, I have
> > about 10 different log splats that I think might all be related to the
> > same underlying problem.
> >
> > Anyway I managed to trigger this using the perf_fuzzer:
> >
> > [ 221.065278] Slab corruption (Not tainted): kmalloc-2048 start=ffff8800cd15e800, len=2048
> > [ 221.074062] 040: 6b 6b 6b 6b 6b 6b 6b 6b 98 72 57 cd 00 88 ff ff kkkkkkkk.rW.....
> > [ 221.082321] Prev obj: start=ffff8800cd15e000, len=2048
> > [ 221.087933] 000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
> > [ 221.096224] 010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
> >
> > And luckily I had ftrace running at the time.
> >
> > The allocation of this block is by perf_event
> >
> > perf_fuzzer-2520 [001] 182.980563: kmalloc: (perf_event_alloc+0x55) call_site=ffffffff811399b5 ptr=0xffff8800cd15e800 bytes_req=1272 bytes_alloc=2048 gfp_flags=GFP_KERNEL|GFP_ZERO
> > perf_fuzzer-2520 [000] 183.628515: kmalloc: (perf_event_alloc+0x55) call_site=ffffffff811399b5 ptr=0xffff8800cd15e800 bytes_req=1272 bytes_alloc=2048 gfp_flags=GFP_KERNEL|GFP_ZERO
> > perf_fuzzer-2520 [000] 183.628521: kfree: (perf_event_alloc+0x2f7) call_site=ffffffff81139c57 ptr=0xffff8800cd15e800
> > perf_fuzzer-2520 [000] 183.628844: kmalloc: (perf_event_alloc+0x55) call_site=ffffffff811399b5 ptr=0xffff8800cd15e800 bytes_req=1272 bytes_alloc=2048 gfp_flags=GFP_KERNEL|GFP_ZERO
> > ...(thousands of times of kmalloc/kfree)
> >
> > Is it worth wading through this mess to try to track down what happened?
>
> Definitely worth a try. Can you upload the trace file and provide the
> URL or send it offlist in private mail if you cannot provide a public URL.
I've poked around the trace a bit.
Possibly it looks like a struct perf_event is being used after freed,
specifically the event->migrate_entry->prev value? I could
be completely wrong about that.
One thing to know about these fuzzer runs, the ones that cause memory
corruption involve forking (with events active). I haven't seen the
corruptions when forking is disabled.
It's very simple forking, only one child is ever active at a time,
and the child itself doesn't do anything but busy wait until it is killed.
The trace shows the problem allocations happening before a fork and
the poison message after. The traces I have don't include the children
though so I don't have records of what happened there.
I'll send a private link to the file downloads as they're a little large
and the local sysadmins would probably appreicate if I limited access to
them.
Vince
On Tue, 15 Apr 2014, Vince Weaver wrote:
> Possibly it looks like a struct perf_event is being used after freed,
> specifically the event->migrate_entry->prev value? I could
> be completely wrong about that.
and actually I'm mixing up hex and decimal. It looks like the actual
value being written to the freed area is at 0x48 whichi I think maps to
event->hlist_entry->pprev
but really if it's late enough I'm mixing hex and decimal I should
probably stop staring at trace dumps and get some sleep.
Vince
On Tue, Apr 15, 2014 at 05:37:01PM -0400, Vince Weaver wrote:
>
> Still tracking memory corruption bugs found by the perf_fuzzer, I have
> about 10 different log splats that I think might all be related to the
> same underlying problem.
>
> Anyway I managed to trigger this using the perf_fuzzer:
>
> [ 221.065278] Slab corruption (Not tainted): kmalloc-2048 start=ffff8800cd15e800, len=2048
> [ 221.074062] 040: 6b 6b 6b 6b 6b 6b 6b 6b 98 72 57 cd 00 88 ff ff kkkkkkkk.rW.....
> [ 221.082321] Prev obj: start=ffff8800cd15e000, len=2048
> [ 221.087933] 000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
> [ 221.096224] 010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
>
> And luckily I had ftrace running at the time.
>
> The allocation of this block is by perf_event
>
> perf_fuzzer-2520 [001] 182.980563: kmalloc: (perf_event_alloc+0x55) call_site=ffffffff811399b5 ptr=0xffff8800cd15e800 bytes_req=1272 bytes_alloc=2048 gfp_flags=GFP_KERNEL|GFP_ZERO
> perf_fuzzer-2520 [000] 183.628515: kmalloc: (perf_event_alloc+0x55) call_site=ffffffff811399b5 ptr=0xffff8800cd15e800 bytes_req=1272 bytes_alloc=2048 gfp_flags=GFP_KERNEL|GFP_ZERO
> perf_fuzzer-2520 [000] 183.628521: kfree: (perf_event_alloc+0x2f7) call_site=ffffffff81139c57 ptr=0xffff8800cd15e800
> perf_fuzzer-2520 [000] 183.628844: kmalloc: (perf_event_alloc+0x55) call_site=ffffffff811399b5 ptr=0xffff8800cd15e800 bytes_req=1272 bytes_alloc=2048 gfp_flags=GFP_KERNEL|GFP_ZERO
> ...(thousands of times of kmalloc/kfree)
Does the below make any difference? I've only ran it through some light
testing to make sure it didn't insta-explode on running.
(perf stat make -j64 -s in fact)
The patch changes the exit path (identified by tglx as the most likely
fuckup source), if I read it right the if(child_event->parent) condition
in __perf_event_exit_task() is complete bullshit.
We should always detach from groups, inherited event or not.
The not detaching of the group, in turn, can cause the
__perf_event_exit_task() loops in perf_event_exit_task_context() to not
actually do what the goto again comment says. Because we do not detach
from the group, group siblings will not be placed back on the list as
singleton events.
This then allows us to 'exit' while still having events linked. Then
when we close the event fd, we'll free the event, _while_still_linked_.
The patch deals with this by iterating the event_list instead of the
pinned/flexible group lists. Making that retry superfluous.
Now I haven't gone through all details yet, so I might be talking crap.
I've pretty much fried my brain by now, so I'll go see if I can
reproduce some of this slab corruption.
---
kernel/events/core.c | 19 +++++++------------
1 file changed, 7 insertions(+), 12 deletions(-)
diff --git a/kernel/events/core.c b/kernel/events/core.c
index f83a71a3e46d..c3c745c1d623 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -7367,11 +7367,9 @@ __perf_event_exit_task(struct perf_event *child_event,
struct perf_event_context *child_ctx,
struct task_struct *child)
{
- if (child_event->parent) {
- raw_spin_lock_irq(&child_ctx->lock);
- perf_group_detach(child_event);
- raw_spin_unlock_irq(&child_ctx->lock);
- }
+ raw_spin_lock_irq(&child_ctx->lock);
+ perf_group_detach(child_event);
+ raw_spin_unlock_irq(&child_ctx->lock);
perf_remove_from_context(child_event);
@@ -7443,12 +7441,7 @@ static void perf_event_exit_task_context(struct task_struct *child, int ctxn)
mutex_lock(&child_ctx->mutex);
again:
- list_for_each_entry_safe(child_event, tmp, &child_ctx->pinned_groups,
- group_entry)
- __perf_event_exit_task(child_event, child_ctx, child);
-
- list_for_each_entry_safe(child_event, tmp, &child_ctx->flexible_groups,
- group_entry)
+ list_for_each_entry_rcu(child_event, &child_ctx->event_list, event_entry)
__perf_event_exit_task(child_event, child_ctx, child);
/*
@@ -7457,8 +7450,10 @@ static void perf_event_exit_task_context(struct task_struct *child, int ctxn)
* will still point to the list head terminating the iteration.
*/
if (!list_empty(&child_ctx->pinned_groups) ||
- !list_empty(&child_ctx->flexible_groups))
+ !list_empty(&child_ctx->flexible_groups)) {
+ WARN_ON_ONCE(1);
goto again;
+ }
mutex_unlock(&child_ctx->mutex);
On Wed, 16 Apr 2014, Peter Zijlstra wrote:
> Does the below make any difference? I've only ran it through some light
> testing to make sure it didn't insta-explode on running.
>
> (perf stat make -j64 -s in fact)
I'm running with your patch now and so far so good.
Unfortunately the problem isn't repeatable, but it usually shows up within
an hour or so of fuzzing (although there's possibly a 2nd unrelated bug
that also shows up sometimes).
If you want to try running the fuzzer on your machine too just do:
git clone https://github.com/deater/perf_event_tests.git
cd fuzzer
make
and then try running the "./fast_repro98.sh" script, as that's the forking
workload I've been using when tracking this issue.
Vince
On Wed, 16 Apr 2014, Vince Weaver wrote:
> On Wed, 16 Apr 2014, Peter Zijlstra wrote:
>
> > Does the below make any difference? I've only ran it through some light
> > testing to make sure it didn't insta-explode on running.
> >
> > (perf stat make -j64 -s in fact)
>
> I'm running with your patch now and so far so good.
spoke too soon, just got this with your patch applied
(I wasn't running ftrace so no trace with this one):
[ 1555.756490] Slab corruption (Not tainted): kmalloc-2048 start=ffff88011879a000, len=2048
[ 1555.765699] 040: 6b 6b 6b 6b 6b 6b 6b 6b 88 a7 97 ce 00 88 ff ff kkkkkkkk........
[ 1555.774684] Next obj: start=ffff88011879a800, len=2048
[ 1555.780396] 000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
[ 1555.789150] 010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
[ 1556.048915] Slab corruption (Not tainted): kmalloc-2048 start=ffff88011879a000, len=2048
[ 1556.057655] 040: 6b 6b 6b 6b 6b 6b 6b 6b 40 30 04 18 01 88 ff ff kkkkkkkk@0......
[ 1556.065946] Next obj: start=ffff88011879a800, len=2048
[ 1556.071544] 000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
[ 1556.079770] 010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
[ 1556.150121] general protection fault: 0000 [#1] SMP
[ 1556.155467] Dumping ftrace buffer:
[ 1556.159051] (ftrace buffer empty)
[ 1556.162848] Modules linked in: fuse x86_pkg_temp_thermal intel_powerclamp snd_hda_codec_hdmi coretemp kvm i915 crct10dif_pclmul crc32_pclmul ghash_clmulni_intel snd_hda_codec_realtek snd_hda_codec_generic aesni_intel aes_x86_64 drm_kms_helper lrw snd_hda_intel snd_hda_controller snd_hda_codec drm snd_hwdep gf128mul tpm_tis mei_me snd_pcm glue_helper tpm evdev mei parport_pc snd_seq ablk_helper iTCO_wdt i2c_algo_bit psmouse iTCO_vendor_support parport snd_timer cryptd serio_raw pcspkr lpc_ich i2c_i801 mfd_core battery button processor video wmi i2c_core snd_seq_device snd soundcore sg sd_mod sr_mod crc_t10dif cdrom crct10dif_common ahci libahci ehci_pci e1000e libata ptp ehci_hcd xhci_hcd crc32c_intel usbcore scsi_mod pps_core usb_common fan thermal thermal_sys
[ 1556.236213] CPU: 4 PID: 28 Comm: ksoftirqd/4 Not tainted 3.15.0-rc1+ #62
[ 1556.243169] Hardware name: LENOVO 10AM000AUS/SHARKBAY, BIOS FBKT72AUS 01/26/2014
[ 1556.251114] task: ffff8801188f8890 ti: ffff8801188fa000 task.ti: ffff8801188fa000
[ 1556.259065] RIP: 0010:[<ffffffff8113884d>] [<ffffffff8113884d>] perf_tp_event+0x9d/0x210
[ 1556.267821] RSP: 0000:ffff8801188fba30 EFLAGS: 00010006
[ 1556.273479] RAX: ffff88011879a040 RBX: 6b6b6b6b6b6b6b2b RCX: 000000000000002c
[ 1556.281000] RDX: ffffe8ffffd01878 RSI: 0000000000000001 RDI: 0000000000000000
[ 1556.288543] RBP: ffff8801188fbb08 R08: ffff8801188fbb30 R09: ffffe8ffffd03098
[ 1556.296068] R10: 0000000000000001 R11: 0000000225c17d03 R12: ffff8800cebde4d0
[ 1556.303619] R13: 0000000000000001 R14: ffff8801188fbb30 R15: ffffe8ffffd01878
[ 1556.311197] FS: 0000000000000000(0000) GS:ffff88011eb00000(0000) knlGS:0000000000000000
[ 1556.320659] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1556.327681] CR2: 0000000000618b50 CR3: 0000000001c11000 CR4: 00000000001407e0
[ 1556.336092] DR0: 0000000000a9e000 DR1: 0000000000000000 DR2: 0000000000a9e000
[ 1556.344624] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
[ 1556.353012] Stack:
[ 1556.356024] ffff8801188f8890 ffffffff81c48380 ffffffff0000002c ffffe8ffffd01878
[ 1556.364798] ffff8801188fba88 0000000000000046 0000000000000000 0000000000000004
[ 1556.373689] 0000000000000000 ffff8801188fbb78 ffff88011eb10420 ffff8801188fbb68
[ 1556.382627] Call Trace:
[ 1556.386190] [<ffffffff81093607>] perf_trace_sched_wakeup_template+0xe7/0x100
[ 1556.394778] [<ffffffff810953f2>] ? ttwu_do_wakeup+0xb2/0xc0
[ 1556.401703] [<ffffffff810953f2>] ttwu_do_wakeup+0xb2/0xc0
[ 1556.408468] [<ffffffff810954ed>] ttwu_do_activate.constprop.95+0x5d/0x70
[ 1556.416659] [<ffffffff810982c0>] try_to_wake_up+0x200/0x300
[ 1556.423711] [<ffffffff81098432>] default_wake_function+0x12/0x20
[ 1556.431114] [<ffffffff810a95f8>] __wake_up_common+0x58/0x90
[ 1556.438103] [<ffffffff810c90c0>] ? ftrace_raw_output_rcu_utilization+0x50/0x50
[ 1556.446860] [<ffffffff810a9643>] __wake_up_locked+0x13/0x20
[ 1556.453756] [<ffffffff810a9e07>] complete+0x37/0x50
[ 1556.459995] [<ffffffff810c90d2>] wakeme_after_rcu+0x12/0x20
[ 1556.466903] [<ffffffff810cc6ad>] rcu_process_callbacks+0x29d/0x620
[ 1556.474468] [<ffffffff810cc646>] ? rcu_process_callbacks+0x236/0x620
[ 1556.482232] [<ffffffff81069995>] __do_softirq+0xf5/0x290
[ 1556.488837] [<ffffffff81069b60>] run_ksoftirqd+0x30/0x50
[ 1556.495385] [<ffffffff8108f6ff>] smpboot_thread_fn+0xff/0x1b0
[ 1556.502441] [<ffffffff8108f600>] ? SyS_setgroups+0x1a0/0x1a0
[ 1556.509398] [<ffffffff8108822d>] kthread+0xed/0x110
[ 1556.515486] [<ffffffff81088140>] ? kthread_create_on_node+0x200/0x200
[ 1556.523324] [<ffffffff8165a4bc>] ret_from_fork+0x7c/0xb0
[ 1556.529858] [<ffffffff81088140>] ? kthread_create_on_node+0x200/0x200
[ 1556.537642] Code: 48 c7 45 c8 00 00 00 00 48 c7 45 90 00 00 00 00 48 c7 45 d0 00 00 00 00 75 11 eb 52 66 90 48 8b 5b 40 48 85 db 74 47 48 83 eb 40 <f6> 83 90 01 00 00 01 75 ea f6 83 e8 00 00 00 20 75 e1 48 8d b5
[ 1556.561008] RIP [<ffffffff8113884d>] perf_tp_event+0x9d/0x210
[ 1556.568036] RSP <ffff8801188fba30>
[ 1556.572833] general protection fault: 0000 [#2] SMP
[ 1556.578955] Dumping ftrace buffer:
[ 1556.583342] (ftrace buffer empty)
[ 1556.587897] Modules linked in: fuse x86_pkg_temp_thermal intel_powerclamp snd_hda_codec_hdmi coretemp kvm i915 crct10dif_pclmul crc32_pclmul ghash_clmulni_intel snd_hda_codec_realtek snd_hda_codec_generic aesni_intel aes_x86_64 drm_kms_helper lrw snd_hda_intel snd_hda_controller snd_hda_codec drm snd_hwdep gf128mul tpm_tis mei_me snd_pcm glue_helper tpm evdev mei parport_pc snd_seq ablk_helper iTCO_wdt i2c_algo_bit psmouse iTCO_vendor_support parport snd_timer cryptd serio_raw pcspkr lpc_ich i2c_i801 mfd_core battery button processor video wmi i2c_core snd_seq_device snd soundcore sg sd_mod sr_mod crc_t10dif cdrom crct10dif_common ahci libahci ehci_pci e1000e libata ptp ehci_hcd xhci_hcd crc32c_intel usbcore scsi_mod pps_core usb_common fan thermal thermal_sys
[ 1556.667183] CPU: 4 PID: 0 Comm: swapper/4 Not tainted 3.15.0-rc1+ #62
[ 1556.674820] Hardware name: LENOVO 10AM000AUS/SHARKBAY, BIOS FBKT72AUS 01/26/2014
[ 1556.683519] task: ffff880118f5e350 ti: ffff880118f60000 task.ti: ffff880118f60000
[ 1556.692431] RIP: 0010:[<ffffffff8113884d>] [<ffffffff8113884d>] perf_tp_event+0x9d/0x210
[ 1556.702028] RSP: 0000:ffff88011eb03af8 EFLAGS: 00010006
[ 1556.708553] RAX: ffff88011879a040 RBX: 6b6b6b6b6b6b6b2b RCX: 000000000000002c
[ 1556.716851] RDX: ffffe8ffffd02078 RSI: 0000000000000001 RDI: 0000000000000000
[ 1556.725272] RBP: ffff88011eb03bd0 R08: ffff88011eb03bf8 R09: ffffe8ffffd03098
[ 1556.733740] R10: 000000000000000f R11: 000000000000b717 R12: ffff8800cfb16750
[ 1556.742209] R13: 0000000000000001 R14: ffff88011eb03bf8 R15: ffffe8ffffd02078
[ 1556.750617] FS: 0000000000000000(0000) GS:ffff88011eb00000(0000) knlGS:0000000000000000
[ 1556.760111] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1556.767040] CR2: 0000000000618b50 CR3: 0000000001c11000 CR4: 00000000001407e0
[ 1556.775451] DR0: 0000000000a9e000 DR1: 0000000000000000 DR2: 0000000000a9e000
[ 1556.783797] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
[ 1556.792254] Stack:
[ 1556.795279] 0000000000000046 ffffffff81138fab 000000000000002c ffffe8ffffd02078
[ 1556.804056] 0000000000000046 0000000000000046 0000000000000000 0000000000000008
[ 1556.812825] 0000000000000000 ffff88011eb03c40 ffff88011eb10420 ffff88011eb03c30
[ 1556.821642] Call Trace:
[ 1556.825070] <IRQ>
[ 1556.827143] [<ffffffff81138fab>] ? __perf_sw_event+0x6b/0x230
[ 1556.835329] [<ffffffff81093607>] perf_trace_sched_wakeup_template+0xe7/0x100
[ 1556.843753] [<ffffffff810953f2>] ? ttwu_do_wakeup+0xb2/0xc0
[ 1556.850681] [<ffffffff810953f2>] ttwu_do_wakeup+0xb2/0xc0
[ 1556.857334] [<ffffffff810954ed>] ttwu_do_activate.constprop.95+0x5d/0x70
[ 1556.865438] [<ffffffff810982c0>] try_to_wake_up+0x200/0x300
[ 1556.872334] [<ffffffff81098432>] default_wake_function+0x12/0x20
[ 1556.879685] [<ffffffff810a9d28>] autoremove_wake_function+0x18/0x40
[ 1556.887333] [<ffffffff810a95f8>] __wake_up_common+0x58/0x90
[ 1556.894223] [<ffffffff810a9869>] __wake_up+0x39/0x50
[ 1556.900578] [<ffffffff810c0f92>] wake_up_klogd_work_func+0x42/0x70
[ 1556.908173] [<ffffffff8112ff9f>] __irq_work_run+0x6f/0x90
[ 1556.914815] [<ffffffff81130028>] irq_work_run+0x18/0x30
[ 1556.921277] [<ffffffff8107272b>] update_process_times+0x5b/0x70
[ 1556.928572] [<ffffffff810d8665>] tick_sched_handle.isra.20+0x25/0x60
[ 1556.936233] [<ffffffff810d8d41>] tick_sched_timer+0x41/0x60
[ 1556.943021] [<ffffffff8108b596>] __run_hrtimer+0x86/0x1e0
[ 1556.949681] [<ffffffff810d8d00>] ? tick_sched_do_timer+0x40/0x40
[ 1556.956981] [<ffffffff8108bd87>] hrtimer_interrupt+0xf7/0x240
[ 1556.964068] [<ffffffff81044637>] local_apic_timer_interrupt+0x37/0x60
[ 1556.971800] [<ffffffff8165c996>] smp_trace_apic_timer_interrupt+0x46/0xb9
[ 1556.979903] [<ffffffff8165b39d>] trace_apic_timer_interrupt+0x6d/0x80
[ 1556.987606] <EOI>
[ 1556.989710] [<ffffffff8165103e>] ? _raw_spin_unlock_irq+0x2e/0x40
[ 1556.998038] [<ffffffff81651037>] ? _raw_spin_unlock_irq+0x27/0x40
[ 1557.005313] [<ffffffff81090c4d>] finish_task_switch+0x7d/0x120
[ 1557.012301] [<ffffffff81090c0f>] ? finish_task_switch+0x3f/0x120
[ 1557.019455] [<ffffffff8164c6b0>] __schedule+0x2c0/0x740
[ 1557.025852] [<ffffffff8164d009>] schedule_preempt_disabled+0x29/0x70
[ 1557.033380] [<ffffffff810aa0e3>] cpu_startup_entry+0x133/0x3d0
[ 1557.040257] [<ffffffff81042a43>] start_secondary+0x193/0x200
[ 1557.047061] Code: 48 c7 45 c8 00 00 00 00 48 c7 45 90 00 00 00 00 48 c7 45 d0 00 00 00 00 75 11 eb 52 66 90 48 8b 5b 40 48 85 db 74 47 48 83 eb 40 <f6> 83 90 01 00 00 01 75 ea f6 83 e8 00 00 00 20 75 e1 48 8d b5
[ 1557.070211] RIP [<ffffffff8113884d>] perf_tp_event+0x9d/0x210
[ 1557.077114] RSP <ffff88011eb03af8>
[ 1557.081529] ---[ end trace de66fd3e04dbf8d0 ]---
[ 1557.087043] Kernel panic - not syncing: Fatal exception in interrupt
[ 1558.139376] Shutting down cpus with NMI
[ 1558.144092] Dumping ftrace buffer:
[ 1558.148310] (ftrace buffer empty)
[ 1558.152807] Kernel Offset: 0x0 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffff9fffffff)
On Wed, Apr 16, 2014 at 01:43:49PM -0400, Vince Weaver wrote:
> On Wed, 16 Apr 2014, Vince Weaver wrote:
>
> > On Wed, 16 Apr 2014, Peter Zijlstra wrote:
> >
> > > Does the below make any difference? I've only ran it through some light
> > > testing to make sure it didn't insta-explode on running.
> > >
> > > (perf stat make -j64 -s in fact)
> >
> > I'm running with your patch now and so far so good.
>
> spoke too soon, just got this with your patch applied
> (I wasn't running ftrace so no trace with this one):
>
> [ 1555.756490] Slab corruption (Not tainted): kmalloc-2048 start=ffff88011879a000, len=2048
> [ 1555.765699] 040: 6b 6b 6b 6b 6b 6b 6b 6b 88 a7 97 ce 00 88 ff ff kkkkkkkk........
> [ 1555.774684] Next obj: start=ffff88011879a800, len=2048
> [ 1555.780396] 000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
> [ 1555.789150] 010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
Awww, bugger. OK more staring tomorrow. Thanks for trying.
* Vince Weaver <[email protected]> wrote:
> On Wed, 16 Apr 2014, Vince Weaver wrote:
>
> > On Wed, 16 Apr 2014, Peter Zijlstra wrote:
> >
> > > Does the below make any difference? I've only ran it through some light
> > > testing to make sure it didn't insta-explode on running.
> > >
> > > (perf stat make -j64 -s in fact)
> >
> > I'm running with your patch now and so far so good.
>
> spoke too soon, just got this with your patch applied
> (I wasn't running ftrace so no trace with this one):
Is there some place where I can pick up the latestest of your fuzzer?
PeterZ has trouble reproducing the corruption locally - I'd like to
run it too, maybe I have hardware that triggers it more readily.
Plus it would be nice to get your latest .config as well, that you are
using to generate this crash. (You can send it in private mail as well
if you wish.)
Thanks,
Ingo
On Thu, Apr 17, 2014 at 11:48:15AM +0200, Ingo Molnar wrote:
> Is there some place where I can pick up the latestest of your fuzzer?
>
> PeterZ has trouble reproducing the corruption locally - I'd like to
> run it too, maybe I have hardware that triggers it more readily.
>From a few emails up:
"If you want to try running the fuzzer on your machine too just do:
git clone https://github.com/deater/perf_event_tests.git
cd fuzzer
make
and then try running the "./fast_repro98.sh" script, as that's the forking
workload I've been using when tracking this issue."
* Peter Zijlstra <[email protected]> wrote:
> On Thu, Apr 17, 2014 at 11:48:15AM +0200, Ingo Molnar wrote:
> > Is there some place where I can pick up the latestest of your fuzzer?
> >
> > PeterZ has trouble reproducing the corruption locally - I'd like to
> > run it too, maybe I have hardware that triggers it more readily.
>
> From a few emails up:
>
> "If you want to try running the fuzzer on your machine too just do:
> git clone https://github.com/deater/perf_event_tests.git
> cd fuzzer
> make
> and then try running the "./fast_repro98.sh" script, as that's the forking
> workload I've been using when tracking this issue."
Cool, thanks!
Ingo
On Thu, 17 Apr 2014, Ingo Molnar wrote:
>
> * Peter Zijlstra <[email protected]> wrote:
>
> > On Thu, Apr 17, 2014 at 11:48:15AM +0200, Ingo Molnar wrote:
> > > Is there some place where I can pick up the latestest of your fuzzer?
> > >
> > > PeterZ has trouble reproducing the corruption locally - I'd like to
> > > run it too, maybe I have hardware that triggers it more readily.
> >
> > From a few emails up:
> >
> > "If you want to try running the fuzzer on your machine too just do:
> > git clone https://github.com/deater/perf_event_tests.git
> > cd fuzzer
> > make
> > and then try running the "./fast_repro98.sh" script, as that's the forking
> > workload I've been using when tracking this issue."
I have to admit the slab corruption message is a new development with
3.15-rc1.
I've been trying to track a possibly-the-same (but harder to debug)
memory corruption issue for a while now.
I can trigger this type of error easily on both a core2 and haswell
machine.
I have a few more test machines I can run, and just got some extra
serial cables so maybe I can hook up a few more serial consoles and do
some additional testing.
Vince
On Thu, Apr 17, 2014 at 10:42:47AM -0400, Vince Weaver wrote:
> On Thu, 17 Apr 2014, Ingo Molnar wrote:
>
> >
> > * Peter Zijlstra <[email protected]> wrote:
> >
> > > On Thu, Apr 17, 2014 at 11:48:15AM +0200, Ingo Molnar wrote:
> > > > Is there some place where I can pick up the latestest of your fuzzer?
> > > >
> > > > PeterZ has trouble reproducing the corruption locally - I'd like to
> > > > run it too, maybe I have hardware that triggers it more readily.
> > >
> > > From a few emails up:
> > >
> > > "If you want to try running the fuzzer on your machine too just do:
> > > git clone https://github.com/deater/perf_event_tests.git
> > > cd fuzzer
> > > make
> > > and then try running the "./fast_repro98.sh" script, as that's the forking
> > > workload I've been using when tracking this issue."
>
> I have to admit the slab corruption message is a new development with
> 3.15-rc1.
Meh.. my machine keeps locking up with 15-rc1 and your fuzzer. It looks
to get stuck a finish_task_switch() from a preemption while waiting for
a perf IPI.
Which is complete crack because we have preemption disabled over
issueing and waiting for the IPI :/
I tried reverting some of the IPI related patches, but no joy so far,
I'm about to go try a git-bisect on this.
On Thu, 17 Apr 2014, Peter Zijlstra wrote:
>
> Meh.. my machine keeps locking up with 15-rc1 and your fuzzer. It looks
> to get stuck a finish_task_switch() from a preemption while waiting for
> a perf IPI.
>
> Which is complete crack because we have preemption disabled over
> issueing and waiting for the IPI :/
>
> I tried reverting some of the IPI related patches, but no joy so far,
> I'm about to go try a git-bisect on this.
Bisecting can be a pain, as if you go too far back you start running into
other bugs that have been fixed due to the fuzzer.
Over the past year there's been at least 10 perf_fuzzer-related crash
fixes that have gotten into the kernel. It used to be I could crash
things in seconds, it's now up to minutes-hours but the bugs are that much
harder to isolate :(
And I still am not fuzzing with signal-overflow enabled, which
causes even more pain.
Vince
OK, since the slab corruption was happening to event->hlist_entry->pprev
I added a WARN() call to every modifier of pprev under
include/linux/*list*.h to see what was stomping over freed memory.
This is what came up:
Apr 18 10:36:11 haswell kernel: [ 998.316177] ------------[ cut here ]------------
Apr 18 10:36:11 haswell kernel: [ 998.321188] WARNING: CPU: 3 PID: 20717 at include/linux/rculist.h:410 perf_trace_add+0xc1/0x100()
Apr 18 10:36:11 haswell kernel: [ 998.330681] Modules linked in: fuse x86_pkg_temp_thermal intel_powerclamp coretemp snd_hda_codec_realtek snd_hda_codec_generic snd_hda_codec_hdmi kvm snd_hda_intel iTCO_wdt snd_hda_controller i915 snd_hda_codec evdev crct10dif_pclmul drm_kms_helper iTCO_vendor_support snd_hwdep snd_pcm drm crc32_pclmul snd_seq mei_me parport_pc parport lpc_ich mfd_core psmouse ghash_clmulni_intel snd_timer snd_seq_device mei snd aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd soundcore pcspkr serio_raw i2c_i801 processor video i2c_algo_bit i2c_core wmi button battery tpm_tis tpm sg sd_mod sr_mod crc_t10dif crct10dif_common cdrom ahci ehci_pci libahci xhci_hcd e1000e libata ehci_hcd ptp scsi_mod crc32c_intel usbcore pps_core usb_common fan thermal thermal_sys
Apr 18 10:36:11 haswell kernel: [ 998.405736] CPU: 3 PID: 20717 Comm: perf_fuzzer Not tainted 3.15.0-rc1+ #63
Apr 18 10:36:11 haswell kernel: [ 998.413162] Hardware name: LENOVO 10AM000AUS/SHARKBAY, BIOS FBKT72AUS 01/26/2014
Apr 18 10:36:11 haswell kernel: [ 998.420987] 0000000000000009 ffff880117923c70 ffffffff8164f753 0000000000000000
Apr 18 10:36:11 haswell kernel: [ 998.429016] ffff880117923ca8 ffffffff810647cd ffffe8ffffcc30d8 ffff8800ce29f040
Apr 18 10:36:11 haswell kernel: [ 998.437121] ffffffff81c1ba40 ffff8800cd3d9040 000000da35b18e50 ffff880117923cb8
Apr 18 10:36:11 haswell kernel: [ 998.445246] Call Trace:
Apr 18 10:36:11 haswell kernel: [ 998.447910] [<ffffffff8164f753>] dump_stack+0x45/0x56
Apr 18 10:36:11 haswell kernel: [ 998.453451] [<ffffffff810647cd>] warn_slowpath_common+0x7d/0xa0
Apr 18 10:36:11 haswell kernel: [ 998.459871] [<ffffffff810648aa>] warn_slowpath_null+0x1a/0x20
Apr 18 10:36:11 haswell kernel: [ 998.466143] [<ffffffff81125a01>] perf_trace_add+0xc1/0x100
Apr 18 10:36:11 haswell kernel: [ 998.472160] [<ffffffff81136640>] event_sched_in.isra.76+0x90/0x1e0
Apr 18 10:36:11 haswell kernel: [ 998.478849] [<ffffffff811367f9>] group_sched_in+0x69/0x1e0
Apr 18 10:36:11 haswell kernel: [ 998.484812] [<ffffffff81136e45>] __perf_event_enable+0x255/0x260
Apr 18 10:36:11 haswell kernel: [ 998.491370] [<ffffffff81132340>] remote_function+0x40/0x50
Apr 18 10:36:11 haswell kernel: [ 998.497311] [<ffffffff810de116>] generic_exec_single+0x126/0x170
Apr 18 10:36:11 haswell kernel: [ 998.503764] [<ffffffff81132300>] ? task_clock_event_add+0x40/0x40
Apr 18 10:36:11 haswell kernel: [ 998.510432] [<ffffffff810de1c7>] smp_call_function_single+0x67/0xa0
Apr 18 10:36:11 haswell kernel: [ 998.517299] [<ffffffff811312b4>] task_function_call+0x44/0x50
Apr 18 10:36:11 haswell kernel: [ 998.523539] [<ffffffff81136bf0>] ? perf_event_sched_in+0x90/0x90
Apr 18 10:36:11 haswell kernel: [ 998.530085] [<ffffffff81131350>] perf_event_enable+0x90/0xf0
Apr 18 10:36:11 haswell kernel: [ 998.536308] [<ffffffff811312c0>] ? task_function_call+0x50/0x50
Apr 18 10:36:11 haswell kernel: [ 998.542761] [<ffffffff8113142a>] perf_event_for_each_child+0x3a/0xa0
Apr 18 10:36:11 haswell kernel: [ 998.551512] [<ffffffff811379af>] perf_event_task_enable+0x4f/0x80
Apr 18 10:36:11 haswell kernel: [ 998.560080] [<ffffffff8107c015>] SyS_prctl+0x255/0x4b0
Apr 18 10:36:11 haswell kernel: [ 998.567605] [<ffffffff813c1406>] ? lockdep_sys_exit_thunk+0x35/0x67
Apr 18 10:36:11 haswell kernel: [ 998.576333] [<ffffffff816609ed>] system_call_fastpath+0x1a/0x1f
Apr 18 10:36:11 haswell kernel: [ 998.584698] ---[ end trace b175966afd57a174 ]---
Apr 18 10:36:12 haswell kernel: [ 998.910691] ------------[ cut here ]------------
On Fri, 18 Apr 2014, Vince Weaver wrote:
>
> OK, since the slab corruption was happening to event->hlist_entry->pprev
> I added a WARN() call to every modifier of pprev under
> include/linux/*list*.h to see what was stomping over freed memory.
>
> This is what came up:
actually there were 3 hits on the WARNING before the machine locked up.
Apr 18 10:36:11 haswell kernel: [ 998.316177] ------------[ cut here ]------------
Apr 18 10:36:11 haswell kernel: [ 998.321188] WARNING: CPU: 3 PID: 20717 at include/linux/rculist.h:410 perf_trace_add+0xc1/0x100()
Apr 18 10:36:11 haswell kernel: [ 998.330681] Modules linked in: fuse x86_pkg_temp_thermal intel_powerclamp coretemp snd_hda_codec_realtek snd_hda_codec_generic snd_hda_codec_hdmi kvm snd_hda_intel iTCO_wdt snd_hda_controller i915 snd_hda_codec evdev crct10dif_pclmul drm_kms_helper iTCO_vendor_support snd_hwdep snd_pcm drm crc32_pclmul snd_seq mei_me parport_pc parport lpc_ich mfd_core psmouse ghash_clmulni_intel snd_timer snd_seq_device mei snd aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd soundcore pcspkr serio_raw i2c_i801 processor video i2c_algo_bit i2c_core wmi button battery tpm_tis tpm sg sd_mod sr_mod crc_t10dif crct10dif_common cdrom ahci ehci_pci libahci xhci_hcd e1000e libata ehci_hcd ptp scsi_mod crc32c_intel usbcore pps_core usb_common fan thermal thermal_sys
Apr 18 10:36:11 haswell kernel: [ 998.405736] CPU: 3 PID: 20717 Comm: perf_fuzzer Not tainted 3.15.0-rc1+ #63
Apr 18 10:36:11 haswell kernel: [ 998.413162] Hardware name: LENOVO 10AM000AUS/SHARKBAY, BIOS FBKT72AUS 01/26/2014
Apr 18 10:36:11 haswell kernel: [ 998.420987] 0000000000000009 ffff880117923c70 ffffffff8164f753 0000000000000000
Apr 18 10:36:11 haswell kernel: [ 998.429016] ffff880117923ca8 ffffffff810647cd ffffe8ffffcc30d8 ffff8800ce29f040
Apr 18 10:36:11 haswell kernel: [ 998.437121] ffffffff81c1ba40 ffff8800cd3d9040 000000da35b18e50 ffff880117923cb8
Apr 18 10:36:11 haswell kernel: [ 998.445246] Call Trace:
Apr 18 10:36:11 haswell kernel: [ 998.447910] [<ffffffff8164f753>] dump_stack+0x45/0x56
Apr 18 10:36:11 haswell kernel: [ 998.453451] [<ffffffff810647cd>] warn_slowpath_common+0x7d/0xa0
Apr 18 10:36:11 haswell kernel: [ 998.459871] [<ffffffff810648aa>] warn_slowpath_null+0x1a/0x20
Apr 18 10:36:11 haswell kernel: [ 998.466143] [<ffffffff81125a01>] perf_trace_add+0xc1/0x100
Apr 18 10:36:11 haswell kernel: [ 998.472160] [<ffffffff81136640>] event_sched_in.isra.76+0x90/0x1e0
Apr 18 10:36:11 haswell kernel: [ 998.478849] [<ffffffff811367f9>] group_sched_in+0x69/0x1e0
Apr 18 10:36:11 haswell kernel: [ 998.484812] [<ffffffff81136e45>] __perf_event_enable+0x255/0x260
Apr 18 10:36:11 haswell kernel: [ 998.491370] [<ffffffff81132340>] remote_function+0x40/0x50
Apr 18 10:36:11 haswell kernel: [ 998.497311] [<ffffffff810de116>] generic_exec_single+0x126/0x170
Apr 18 10:36:11 haswell kernel: [ 998.503764] [<ffffffff81132300>] ? task_clock_event_add+0x40/0x40
Apr 18 10:36:11 haswell kernel: [ 998.510432] [<ffffffff810de1c7>] smp_call_function_single+0x67/0xa0
Apr 18 10:36:11 haswell kernel: [ 998.517299] [<ffffffff811312b4>] task_function_call+0x44/0x50
Apr 18 10:36:11 haswell kernel: [ 998.523539] [<ffffffff81136bf0>] ? perf_event_sched_in+0x90/0x90
Apr 18 10:36:11 haswell kernel: [ 998.530085] [<ffffffff81131350>] perf_event_enable+0x90/0xf0
Apr 18 10:36:11 haswell kernel: [ 998.536308] [<ffffffff811312c0>] ? task_function_call+0x50/0x50
Apr 18 10:36:11 haswell kernel: [ 998.542761] [<ffffffff8113142a>] perf_event_for_each_child+0x3a/0xa0
Apr 18 10:36:11 haswell kernel: [ 998.551512] [<ffffffff811379af>] perf_event_task_enable+0x4f/0x80
Apr 18 10:36:11 haswell kernel: [ 998.560080] [<ffffffff8107c015>] SyS_prctl+0x255/0x4b0
Apr 18 10:36:11 haswell kernel: [ 998.567605] [<ffffffff813c1406>] ? lockdep_sys_exit_thunk+0x35/0x67
Apr 18 10:36:11 haswell kernel: [ 998.576333] [<ffffffff816609ed>] system_call_fastpath+0x1a/0x1f
Apr 18 10:36:11 haswell kernel: [ 998.584698] ---[ end trace b175966afd57a174 ]---
Apr 18 10:36:12 haswell kernel: [ 998.910691] ------------[ cut here ]------------
Apr 18 10:36:12 haswell kernel: [ 998.917828] WARNING: CPU: 4 PID: 22741 at include/linux/rculist.h:410 perf_swevent_add+0x14c/0x190()
Apr 18 10:36:12 haswell kernel: [ 998.930319] Modules linked in: fuse x86_pkg_temp_thermal intel_powerclamp coretemp snd_hda_codec_realtek snd_hda_codec_generic snd_hda_codec_hdmi kvm snd_hda_intel iTCO_wdt snd_hda_controller i915 snd_hda_codec evdev crct10dif_pclmul drm_kms_helper iTCO_vendor_support snd_hwdep snd_pcm drm crc32_pclmul snd_seq mei_me parport_pc parport lpc_ich mfd_core psmouse ghash_clmulni_intel snd_timer snd_seq_device mei snd aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd soundcore pcspkr serio_raw i2c_i801 processor video i2c_algo_bit i2c_core wmi button battery tpm_tis tpm sg sd_mod sr_mod crc_t10dif crct10dif_common cdrom ahci ehci_pci libahci xhci_hcd e1000e libata ehci_hcd ptp scsi_mod crc32c_intel usbcore pps_core usb_common fan thermal thermal_sys
Apr 18 10:36:12 haswell kernel: [ 999.024710] CPU: 4 PID: 22741 Comm: perf_fuzzer Tainted: G W 3.15.0-rc1+ #63
Apr 18 10:36:12 haswell kernel: [ 999.036194] Hardware name: LENOVO 10AM000AUS/SHARKBAY, BIOS FBKT72AUS 01/26/2014
Apr 18 10:36:12 haswell kernel: [ 999.046972] 0000000000000009 ffff8800ce12dce0 ffffffff8164f753 0000000000000000
Apr 18 10:36:12 haswell kernel: [ 999.057871] ffff8800ce12dd18 ffffffff810647cd ffff8800c9ef6000 ffff8800ce29f040
Apr 18 10:36:12 haswell kernel: [ 999.068652] ffff8800ca4ff3c0 ffff8800c9ef6040 0000000000000000 ffff8800ce12dd28
Apr 18 10:36:12 haswell kernel: [ 999.079416] Call Trace:
Apr 18 10:36:12 haswell kernel: [ 999.084718] [<ffffffff8164f753>] dump_stack+0x45/0x56
Apr 18 10:36:12 haswell kernel: [ 999.092858] [<ffffffff810647cd>] warn_slowpath_common+0x7d/0xa0
Apr 18 10:36:12 haswell kernel: [ 999.101848] [<ffffffff810648aa>] warn_slowpath_null+0x1a/0x20
Apr 18 10:36:12 haswell kernel: [ 999.110692] [<ffffffff811320dc>] perf_swevent_add+0x14c/0x190
Apr 18 10:36:12 haswell kernel: [ 999.119529] [<ffffffff81136640>] event_sched_in.isra.76+0x90/0x1e0
Apr 18 10:36:12 haswell kernel: [ 999.128801] [<ffffffff811367f9>] group_sched_in+0x69/0x1e0
Apr 18 10:36:12 haswell kernel: [ 999.137202] [<ffffffff81136ad7>] ctx_sched_in+0x167/0x1f0
Apr 18 10:36:12 haswell kernel: [ 999.145553] [<ffffffff81136b9a>] perf_event_sched_in+0x3a/0x90
Apr 18 10:36:12 haswell kernel: [ 999.154439] [<ffffffff811370db>] perf_event_context_sched_in+0x7b/0xc0
Apr 18 10:36:12 haswell kernel: [ 999.164065] [<ffffffff8113785d>] __perf_event_task_sched_in+0x1dd/0x1f0
Apr 18 10:36:12 haswell kernel: [ 999.173769] [<ffffffff81091398>] finish_task_switch+0xd8/0x120
Apr 18 10:36:12 haswell kernel: [ 999.182555] [<ffffffff81096a07>] schedule_tail+0x27/0xb0
Apr 18 10:36:12 haswell kernel: [ 999.190843] [<ffffffff816608cf>] ret_from_fork+0xf/0xb0
Apr 18 10:36:12 haswell kernel: [ 999.199012] ---[ end trace b175966afd57a175 ]---
Apr 18 10:36:13 haswell kernel: [ 999.778173] ------------[ cut here ]------------
Apr 18 10:36:13 haswell kernel: [ 999.785775] WARNING: CPU: 4 PID: 22755 at include/linux/rculist.h:410 perf_swevent_add+0x14c/0x190()
Apr 18 10:36:13 haswell kernel: [ 999.797857] Modules linked in: fuse x86_pkg_temp_thermal intel_powerclamp coretemp snd_hda_codec_realtek snd_hda_codec_generic snd_hda_codec_hdmi kvm snd_hda_intel iTCO_wdt snd_hda_controller i915 snd_hda_codec evdev crct10dif_pclmul drm_kms_helper iTCO_vendor_support snd_hwdep snd_pcm drm crc32_pclmul snd_seq mei_me parport_pc parport lpc_ich mfd_core psmouse ghash_clmulni_intel snd_timer snd_seq_device mei snd aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd soundcore pcspkr serio_raw i2c_i801 processor video i2c_algo_bit i2c_core wmi button battery tpm_tis tpm sg sd_mod sr_mod crc_t10dif crct10dif_common cdrom ahci ehci_pci libahci xhci_hcd e1000e libata ehci_hcd ptp scsi_mod crc32c_intel usbcore pps_core usb_common fan thermal thermal_sys
Apr 18 10:36:13 haswell kernel: [ 999.884365] CPU: 4 PID: 22755 Comm: perf_fuzzer Tainted: G W 3.15.0-rc1+ #63
Apr 18 10:36:13 haswell kernel: [ 999.894321] Hardware name: LENOVO 10AM000AUS/SHARKBAY, BIOS FBKT72AUS 01/26/2014
Apr 18 10:36:13 haswell kernel: [ 999.903501] 0000000000000009 ffff88011eb03df8 ffffffff8164f753 0000000000000000
Apr 18 10:36:13 haswell kernel: [ 999.912817] ffff88011eb03e30 ffffffff810647cd ffff8800ca0bc800 ffff8800ce29f040
Apr 18 10:36:13 haswell kernel: [ 999.922105] ffff8800ca4ff3c0 ffff8800ca0bc840 0000000000004a30 ffff88011eb03e40
Apr 18 10:36:13 haswell kernel: [ 999.931378] Call Trace:
Apr 18 10:36:13 haswell kernel: [ 999.935215] <IRQ> [<ffffffff8164f753>] dump_stack+0x45/0x56
Apr 18 10:36:13 haswell kernel: [ 999.942436] [<ffffffff810647cd>] warn_slowpath_common+0x7d/0xa0
Apr 18 10:36:13 haswell kernel: [ 999.949915] [<ffffffff810648aa>] warn_slowpath_null+0x1a/0x20
Apr 18 10:36:13 haswell kernel: [ 999.957098] [<ffffffff811320dc>] perf_swevent_add+0x14c/0x190
Apr 18 10:36:13 haswell kernel: [ 999.964273] [<ffffffff81136640>] event_sched_in.isra.76+0x90/0x1e0
Apr 18 10:36:13 haswell kernel: [ 999.971863] [<ffffffff811367f9>] group_sched_in+0x69/0x1e0
Apr 18 10:36:13 haswell kernel: [ 999.978608] [<ffffffff81136e45>] __perf_event_enable+0x255/0x260
Apr 18 10:36:13 haswell kernel: [ 999.985998] [<ffffffff81132340>] remote_function+0x40/0x50
Apr 18 10:36:13 haswell kernel: [ 999.992805] [<ffffffff810de8fd>] generic_smp_call_function_single_interrupt+0x5d/0x100
Apr 18 10:36:13 haswell kernel: [ 1000.002312] [<ffffffff810421dd>] smp_trace_call_function_single_interrupt+0x2d/0xb0
Apr 18 10:36:13 haswell kernel: [ 1000.011260] [<ffffffff81661c9d>] trace_call_function_single_interrupt+0x6d/0x80
Apr 18 10:36:13 haswell kernel: [ 1000.020035] <EOI> [<ffffffff810b1c26>] ? lock_acquire+0xb6/0x120
Apr 18 10:36:13 haswell kernel: [ 1000.027489] [<ffffffff81139b4b>] ? __perf_sw_event+0x6b/0x230
Apr 18 10:36:13 haswell kernel: [ 1000.034488] [<ffffffff81139bcf>] __perf_sw_event+0xef/0x230
Apr 18 10:36:13 haswell kernel: [ 1000.041203] [<ffffffff81139b4b>] ? __perf_sw_event+0x6b/0x230
Apr 18 10:36:13 haswell kernel: [ 1000.048190] [<ffffffff810b139d>] ? __lock_acquire.isra.29+0x3bd/0xb90
Apr 18 10:36:13 haswell kernel: [ 1000.055836] [<ffffffff8165b2ce>] __do_page_fault+0x21e/0x530
Apr 18 10:36:13 haswell kernel: [ 1000.062586] [<ffffffff813cdd6f>] ? debug_check_no_obj_freed+0x19f/0x360
Apr 18 10:36:13 haswell kernel: [ 1000.070501] [<ffffffff81149ca2>] ? free_pages_prepare+0x1b2/0x230
Apr 18 10:36:13 haswell kernel: [ 1000.077811] [<ffffffff8165b647>] trace_do_page_fault+0x37/0xb0
Apr 18 10:36:13 haswell kernel: [ 1000.084762] [<ffffffff81657cd8>] trace_page_fault+0x28/0x30
Apr 18 10:36:13 haswell kernel: [ 1000.091480] [<ffffffff813c12a0>] ? __put_user_4+0x20/0x30
Apr 18 10:36:13 haswell kernel: [ 1000.098045] [<ffffffff81096a3c>] ? schedule_tail+0x5c/0xb0
Apr 18 10:36:13 haswell kernel: [ 1000.104653] [<ffffffff816608cf>] ret_from_fork+0xf/0xb0
Apr 18 10:36:13 haswell kernel: [ 1000.111017] ---[ end trace b175966afd57a176 ]---
On Fri, Apr 18, 2014 at 10:45:47AM -0400, Vince Weaver wrote:
>
> OK, since the slab corruption was happening to event->hlist_entry->pprev
> I added a WARN() call to every modifier of pprev under
> include/linux/*list*.h to see what was stomping over freed memory.
>
> This is what came up:
>
> Apr 18 10:36:11 haswell kernel: [ 998.316177] ------------[ cut here ]------------
> Apr 18 10:36:11 haswell kernel: [ 998.321188] WARNING: CPU: 3 PID: 20717 at include/linux/rculist.h:410 perf_trace_add+0xc1/0x100()
> Apr 18 10:36:11 haswell kernel: [ 998.330681] Modules linked in: fuse x86_pkg_temp_thermal intel_powerclamp coretemp snd_hda_codec_realtek snd_hda_codec_generic snd_hda_codec_hdmi kvm snd_hda_intel iTCO_wdt snd_hda_controller i915 snd_hda_codec evdev crct10dif_pclmul drm_kms_helper iTCO_vendor_support snd_hwdep snd_pcm drm crc32_pclmul snd_seq mei_me parport_pc parport lpc_ich mfd_core psmouse ghash_clmulni_intel snd_timer snd_seq_device mei snd aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd soundcore pcspkr serio_raw i2c_i801 processor video i2c_algo_bit i2c_core wmi button battery tpm_tis tpm sg sd_mod sr_mod crc_t10dif crct10dif_common cdrom ahci ehci_pci libahci xhci_hcd e1000e libata ehci_hcd ptp scsi_mod crc32c_intel usbcore pps_core usb_common fan thermal thermal_sys
> Apr 18 10:36:11 haswell kernel: [ 998.405736] CPU: 3 PID: 20717 Comm: perf_fuzzer Not tainted 3.15.0-rc1+ #63
> Apr 18 10:36:11 haswell kernel: [ 998.413162] Hardware name: LENOVO 10AM000AUS/SHARKBAY, BIOS FBKT72AUS 01/26/2014
> Apr 18 10:36:11 haswell kernel: [ 998.420987] 0000000000000009 ffff880117923c70 ffffffff8164f753 0000000000000000
> Apr 18 10:36:11 haswell kernel: [ 998.429016] ffff880117923ca8 ffffffff810647cd ffffe8ffffcc30d8 ffff8800ce29f040
> Apr 18 10:36:11 haswell kernel: [ 998.437121] ffffffff81c1ba40 ffff8800cd3d9040 000000da35b18e50 ffff880117923cb8
> Apr 18 10:36:11 haswell kernel: [ 998.445246] Call Trace:
> Apr 18 10:36:11 haswell kernel: [ 998.447910] [<ffffffff8164f753>] dump_stack+0x45/0x56
> Apr 18 10:36:11 haswell kernel: [ 998.453451] [<ffffffff810647cd>] warn_slowpath_common+0x7d/0xa0
> Apr 18 10:36:11 haswell kernel: [ 998.459871] [<ffffffff810648aa>] warn_slowpath_null+0x1a/0x20
> Apr 18 10:36:11 haswell kernel: [ 998.466143] [<ffffffff81125a01>] perf_trace_add+0xc1/0x100
> Apr 18 10:36:11 haswell kernel: [ 998.472160] [<ffffffff81136640>] event_sched_in.isra.76+0x90/0x1e0
> Apr 18 10:36:11 haswell kernel: [ 998.478849] [<ffffffff811367f9>] group_sched_in+0x69/0x1e0
> Apr 18 10:36:11 haswell kernel: [ 998.484812] [<ffffffff81136e45>] __perf_event_enable+0x255/0x260
> Apr 18 10:36:11 haswell kernel: [ 998.491370] [<ffffffff81132340>] remote_function+0x40/0x50
> Apr 18 10:36:11 haswell kernel: [ 998.497311] [<ffffffff810de116>] generic_exec_single+0x126/0x170
> Apr 18 10:36:11 haswell kernel: [ 998.503764] [<ffffffff81132300>] ? task_clock_event_add+0x40/0x40
> Apr 18 10:36:11 haswell kernel: [ 998.510432] [<ffffffff810de1c7>] smp_call_function_single+0x67/0xa0
> Apr 18 10:36:11 haswell kernel: [ 998.517299] [<ffffffff811312b4>] task_function_call+0x44/0x50
> Apr 18 10:36:11 haswell kernel: [ 998.523539] [<ffffffff81136bf0>] ? perf_event_sched_in+0x90/0x90
> Apr 18 10:36:11 haswell kernel: [ 998.530085] [<ffffffff81131350>] perf_event_enable+0x90/0xf0
> Apr 18 10:36:11 haswell kernel: [ 998.536308] [<ffffffff811312c0>] ? task_function_call+0x50/0x50
> Apr 18 10:36:11 haswell kernel: [ 998.542761] [<ffffffff8113142a>] perf_event_for_each_child+0x3a/0xa0
> Apr 18 10:36:11 haswell kernel: [ 998.551512] [<ffffffff811379af>] perf_event_task_enable+0x4f/0x80
> Apr 18 10:36:11 haswell kernel: [ 998.560080] [<ffffffff8107c015>] SyS_prctl+0x255/0x4b0
> Apr 18 10:36:11 haswell kernel: [ 998.567605] [<ffffffff813c1406>] ? lockdep_sys_exit_thunk+0x35/0x67
> Apr 18 10:36:11 haswell kernel: [ 998.576333] [<ffffffff816609ed>] system_call_fastpath+0x1a/0x1f
> Apr 18 10:36:11 haswell kernel: [ 998.584698] ---[ end trace b175966afd57a174 ]---
> Apr 18 10:36:12 haswell kernel: [ 998.910691] ------------[ cut here ]------------
OK, that's a good clue. That looks like we're freeing events that still
are on the owner list, which would indicate we're freeing events that
have a refcount.
I added a WARN in free_event() to check the refcount, along with a
number of false positives (through the perf_event_open() fail path) I do
appear to be getting actual fails here.
At least I can 'reproduce' this. Earlier attempts, even based on your
.config only got me very mysterious lockups -- I suspect the corruption
happens on a slightly different spot or so and completely messes up the
machine.
On Fri, Apr 18, 2014 at 05:23:14PM +0200, Peter Zijlstra wrote:
> OK, that's a good clue. That looks like we're freeing events that still
> are on the owner list, which would indicate we're freeing events that
> have a refcount.
>
> I added a WARN in free_event() to check the refcount, along with a
> number of false positives (through the perf_event_open() fail path) I do
> appear to be getting actual fails here.
>
> At least I can 'reproduce' this. Earlier attempts, even based on your
> .config only got me very mysterious lockups -- I suspect the corruption
> happens on a slightly different spot or so and completely messes up the
> machine.
The below should have only made the false positives go away, but my
machine has magically stopped going all funny on me. Could you give it a
go?
---
kernel/events/core.c | 73 +++++++++++++++++++++++++++++++---------------------
1 file changed, 43 insertions(+), 30 deletions(-)
diff --git a/kernel/events/core.c b/kernel/events/core.c
index f83a71a3e46d..f9d7b859395e 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3231,8 +3231,11 @@ static void __free_event(struct perf_event *event)
call_rcu(&event->rcu_head, free_event_rcu);
}
-static void free_event(struct perf_event *event)
+
+static void _free_event(struct perf_event *event)
{
+ WARN_ON(atomic_long_read(&event->refcount));
+
irq_work_sync(&event->pending);
unaccount_event(event);
@@ -3259,45 +3262,28 @@ static void free_event(struct perf_event *event)
if (is_cgroup_event(event))
perf_detach_cgroup(event);
-
__free_event(event);
}
-int perf_event_release_kernel(struct perf_event *event)
-{
- struct perf_event_context *ctx = event->ctx;
-
- WARN_ON_ONCE(ctx->parent_ctx);
- /*
- * There are two ways this annotation is useful:
- *
- * 1) there is a lock recursion from perf_event_exit_task
- * see the comment there.
- *
- * 2) there is a lock-inversion with mmap_sem through
- * perf_event_read_group(), which takes faults while
- * holding ctx->mutex, however this is called after
- * the last filedesc died, so there is no possibility
- * to trigger the AB-BA case.
- */
- mutex_lock_nested(&ctx->mutex, SINGLE_DEPTH_NESTING);
- raw_spin_lock_irq(&ctx->lock);
- perf_group_detach(event);
- raw_spin_unlock_irq(&ctx->lock);
- perf_remove_from_context(event);
- mutex_unlock(&ctx->mutex);
-
- free_event(event);
+static void put_event(struct perf_event *event);
- return 0;
+static void free_event(struct perf_event *event)
+{
+ if (unlikely(atomic_long_cmpxchg(&event->refcount, 1, 0) != 1)) {
+ WARN(1, "unexpected event refcount: %ld\n",
+ atomic_long_read(&event->refcount));
+ put_event(event);
+ return;
+ }
+ _free_event(event);
}
-EXPORT_SYMBOL_GPL(perf_event_release_kernel);
/*
* Called when the last reference to the file is gone.
*/
static void put_event(struct perf_event *event)
{
+ struct perf_event_context *ctx = event->ctx;
struct task_struct *owner;
if (!atomic_long_dec_and_test(&event->refcount))
@@ -3336,9 +3322,36 @@ static void put_event(struct perf_event *event)
put_task_struct(owner);
}
- perf_event_release_kernel(event);
+ WARN_ON_ONCE(ctx->parent_ctx);
+ /*
+ * There are two ways this annotation is useful:
+ *
+ * 1) there is a lock recursion from perf_event_exit_task
+ * see the comment there.
+ *
+ * 2) there is a lock-inversion with mmap_sem through
+ * perf_event_read_group(), which takes faults while
+ * holding ctx->mutex, however this is called after
+ * the last filedesc died, so there is no possibility
+ * to trigger the AB-BA case.
+ */
+ mutex_lock_nested(&ctx->mutex, SINGLE_DEPTH_NESTING);
+ raw_spin_lock_irq(&ctx->lock);
+ perf_group_detach(event);
+ raw_spin_unlock_irq(&ctx->lock);
+ perf_remove_from_context(event);
+ mutex_unlock(&ctx->mutex);
+
+ _free_event(event);
}
+int perf_event_release_kernel(struct perf_event *event)
+{
+ put_event(event);
+ return 0;
+}
+EXPORT_SYMBOL_GPL(perf_event_release_kernel);
+
static int perf_release(struct inode *inode, struct file *file)
{
put_event(file->private_data);
On Fri, Apr 18, 2014 at 06:59:58PM +0200, Peter Zijlstra wrote:
> On Fri, Apr 18, 2014 at 05:23:14PM +0200, Peter Zijlstra wrote:
> > OK, that's a good clue. That looks like we're freeing events that still
> > are on the owner list, which would indicate we're freeing events that
> > have a refcount.
> >
> > I added a WARN in free_event() to check the refcount, along with a
> > number of false positives (through the perf_event_open() fail path) I do
> > appear to be getting actual fails here.
> >
> > At least I can 'reproduce' this. Earlier attempts, even based on your
> > .config only got me very mysterious lockups -- I suspect the corruption
> > happens on a slightly different spot or so and completely messes up the
> > machine.
>
> The below should have only made the false positives go away, but my
> machine has magically stopped going all funny on me. Could you give it a
> go?
>
Hmm the fuzzer task seems stuck in kernel space, can't kill it anymore.
So its likely it just didn't get around to doing enough to wreck the
system or so.
/me goes stab it in the eye.
On Fri, 18 Apr 2014, Peter Zijlstra wrote:
> Hmm the fuzzer task seems stuck in kernel space, can't kill it anymore.
>
> So its likely it just didn't get around to doing enough to wreck the
> system or so.
>
> /me goes stab it in the eye.
OK, I managed to get a trace while this bug was happening.
>From my (non-expert) analysis this is what happens.
[CPU0] 1422.741358 -- perf_event_open() opens event 17 (0x11)
which kmalloc()'d event struct address 0xffff8800cf213000
[CPU1] 1422.814014 -- clone() is called, spawning proces 31443 on CPU7
event 17 is inherited across the clone
[CPU1] 1422.816957 -- in parent thread, event 17 is closed
[CPU1] 1422.820013 -- parent thread kills child process 31443,
last known user of closed event 17
....
[CPU7] 1422.856881 -- grace period expires, kfree of 0xffff8800cf213000
from CPU of child
....
[CPU1] 1423.154079 -- a prctl call to activate events calls
perf_swevent_add() which calls
hlist_add_head_rcu() which finds the first
element in the CPU1 swevent_htable hash list to
be our already freed (and poisoned)
0xffff8800cf213000
In any case, when we close the event, are we somehow not removing
it from all of the swevent_htable (one per cpu?)
A link to the trace can be found here:
web.eece.maine.edu/~vweaver/junk/interesting.trace.bz2
And the log splat here:
[ 1423.159052] WARNING: CPU: 1 PID: 30135 at include/linux/rculist.h:411 perf_swevent_add+0x16f/0x190()
[ 1423.168825] Modules linked in: fuse snd_hda_codec_realtek snd_hda_codec_hdmi snd_hda_codec_generic x86_pkg_temp_thermal intel_powerclamp coretemp kvm snd_hda_intel snd_hda_controller snd_hda_codec snd_hwdep crct10dif_pclmul i915 snd_pcm crc32_pclmul iTCO_wdt ghash_clmulni_intel aesni_intel snd_seq evdev iTCO_vendor_support drm_kms_helper snd_timer aes_x86_64 lrw gf128mul drm snd_seq_device glue_helper psmouse snd processor mei_me soundcore ablk_helper cryptd mei pcspkr video battery serio_raw i2c_i801 i2c_algo_bit lpc_ich mfd_core tpm_tis tpm parport_pc parport i2c_core wmi button sg sd_mod sr_mod crc_t10dif crct10dif_common cdrom ahci ehci_pci libahci e1000e xhci_hcd ehci_hcd libata ptp crc32c_intel scsi_mod usbcore usb_common pps_core fan thermal thermal_sys
[ 1423.242637] CPU: 1 PID: 30135 Comm: perf_fuzzer Not tainted 3.15.0-rc1+ #86
[ 1423.250125] Hardware name: LENOVO 10AM000AUS/SHARKBAY, BIOS FBKT72AUS 01/26/2014
[ 1423.258049] 0000000000000009 ffff8800c30e5c78 ffffffff8164f7a3 0000000000000000
[ 1423.266087] ffff8800c30e5cb0 ffffffff810647cd ffff880118383000 ffff8800cf213040
[ 1423.274159] ffff8800b9e36788 ffff880118383040 00000145269017e9 ffff8800c30e5cc0
[ 1423.282173] Call Trace:
[ 1423.284791] [<ffffffff8164f7a3>] dump_stack+0x45/0x56
[ 1423.290352] [<ffffffff810647cd>] warn_slowpath_common+0x7d/0xa0
[ 1423.296775] [<ffffffff810648aa>] warn_slowpath_null+0x1a/0x20
[ 1423.303064] [<ffffffff8113211f>] perf_swevent_add+0x16f/0x190
[ 1423.309348] [<ffffffff811367a0>] event_sched_in.isra.76+0x90/0x1e0
[ 1423.316084] [<ffffffff81136959>] group_sched_in+0x69/0x1e0
[ 1423.322076] [<ffffffff81136fa5>] __perf_event_enable+0x255/0x260
[ 1423.328580] [<ffffffff81132360>] remote_function+0x40/0x50
[ 1423.334599] [<ffffffff810de126>] generic_exec_single+0x126/0x170
[ 1423.341136] [<ffffffff81132320>] ? task_clock_event_add+0x40/0x40
[ 1423.347809] [<ffffffff810de1d7>] smp_call_function_single+0x67/0xa0
[ 1423.354642] [<ffffffff811312d4>] task_function_call+0x44/0x50
[ 1423.360901] [<ffffffff81136d50>] ? perf_event_sched_in+0x90/0x90
[ 1423.367441] [<ffffffff81131370>] perf_event_enable+0x90/0xf0
[ 1423.373612] [<ffffffff811312e0>] ? task_function_call+0x50/0x50
[ 1423.380089] [<ffffffff8113144a>] perf_event_for_each_child+0x3a/0xa0
[ 1423.386949] [<ffffffff81137b0f>] perf_event_task_enable+0x4f/0x80
[ 1423.393609] [<ffffffff8107c015>] SyS_prctl+0x255/0x4b0
[ 1423.399208] [<ffffffff81660c84>] tracesys+0xe1/0xe6
[ 1423.404539] ---[ end trace c9ab81bd2a5a1d1d ]---
[ 1423.506804] Slab corruption (Tainted: G W ): kmalloc-2048 start=ffff8800cf213000, len=2048
[ 1423.516610] 040: 6b 6b 6b 6b 6b 6b 6b 6b 88 67 e3 b9 00 88 ff ff kkkkkkkk.g......
[ 1423.524775] Next obj: start=ffff8800cf213800, len=2048
[ 1423.530314] 000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
[ 1423.538465] 010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
I got the bug to trigger again, this time it finally managed to hit a
debug_objects WARNING if that's any additional help.
The bug followed the same pattern, software event
(PERF_TYPE_SOFTWARE / PERF_COUNT_SW_TASK_CLOCK) created, fork happens,
event closes in parent, child killed, rcu grace period expires and kfree
but event still active.
here's the kernel message followed by excerpts from the trace, I can
provide full trace if anyone cares.
Vince
[ 2226.252441] ------------[ cut here ]------------
[ 2226.257503] WARNING: CPU: 4 PID: 0 at lib/debugobjects.c:260 debug_print_object+0x83/0xa0()
[ 2226.266545] ODEBUG: free active (active state 0) object type: hrtimer hint: perf_swevent_hrtimer+0x0/0x140
[ 2226.276952] Modules linked in: fuse x86_pkg_temp_thermal intel_powerclamp snd_hda_codec_realtek snd_hda_codec_hdmi snd_hda_codec_generic coretemp snd_hda_intel snd_hda_controller snd_hda_codec snd_hwdep kvm i915 crct10dif_pclmul crc32_pclmul iTCO_wdt snd_pcm drm_kms_helper ghash_clmulni_intel iTCO_vendor_support snd_seq snd_timer snd_seq_device aesni_intel snd lpc_ich drm evdev i2c_i801 aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd psmouse soundcore serio_raw pcspkr i2c_algo_bit parport_pc parport mei_me mei mfd_core i2c_core wmi button processor video battery tpm_tis tpm sg sd_mod sr_mod crc_t10dif crct10dif_common cdrom ehci_pci ahci e1000e xhci_hcd ehci_hcd libahci libata ptp crc32c_intel usbcore scsi_mod usb_common pps_core fan thermal thermal_sys
[ 2226.350769] CPU: 4 PID: 0 Comm: swapper/4 Not tainted 3.15.0-rc1+ #87
[ 2226.357730] Hardware name: LENOVO 10AM000AUS/SHARKBAY, BIOS FBKT72AUS 01/26/2014
[ 2226.365658] 0000000000000009 ffff88011eb03cd8 ffffffff8164f7b3 ffff88011eb03d20
[ 2226.373728] ffff88011eb03d10 ffffffff810647cd ffff8800ce03c888 ffffffff81c433e0
[ 2226.381835] ffffffff81a19730 ffff8800cf4e0000 ffff8800ce03c888 ffff88011eb03d70
[ 2226.389820] Call Trace:
[ 2226.392428] <IRQ> [<ffffffff8164f7b3>] dump_stack+0x45/0x56
[ 2226.398595] [<ffffffff810647cd>] warn_slowpath_common+0x7d/0xa0
[ 2226.405059] [<ffffffff8106483c>] warn_slowpath_fmt+0x4c/0x50
[ 2226.411240] [<ffffffff813cc9e3>] debug_print_object+0x83/0xa0
[ 2226.417535] [<ffffffff81139200>] ? __perf_event_overflow+0x270/0x270
[ 2226.424463] [<ffffffff813cde73>] debug_check_no_obj_freed+0x263/0x360
[ 2226.431500] [<ffffffff811316aa>] ? free_event_rcu+0x2a/0x30
[ 2226.437579] [<ffffffff81196fd0>] kfree+0xb0/0x560
[ 2226.442740] [<ffffffff810ccd46>] ? rcu_process_callbacks+0x236/0x620
[ 2226.449658] [<ffffffff81131680>] ? pmu_dev_release+0x10/0x10
[ 2226.455811] [<ffffffff811316aa>] free_event_rcu+0x2a/0x30
[ 2226.461727] [<ffffffff810ccdad>] rcu_process_callbacks+0x29d/0x620
[ 2226.468440] [<ffffffff810ccd46>] ? rcu_process_callbacks+0x236/0x620
[ 2226.475384] [<ffffffff81069ab5>] __do_softirq+0xf5/0x290
[ 2226.481210] [<ffffffff81069e9d>] irq_exit+0xad/0xc0
[ 2226.486540] [<ffffffff81662e35>] smp_apic_timer_interrupt+0x45/0x60
[ 2226.493350] [<ffffffff8166181d>] apic_timer_interrupt+0x6d/0x80
[ 2226.499798] <EOI> [<ffffffff810d958e>] ? tick_nohz_idle_exit+0x12e/0x1b0
[ 2226.507192] [<ffffffff810aa7de>] cpu_startup_entry+0x12e/0x3d0
[ 2226.513542] [<ffffffff81042a43>] start_secondary+0x193/0x200
[ 2226.519706] ---[ end trace ec55e71b02ef43b3 ]---
Event Created:
<...>-13590 [000] 2225.706150: sys_enter: NR 298 (699a70, 0, ffffffff, ffffffff, 8, 8)
<...>-13590 [000] 2225.706160: kmalloc: (perf_event_alloc+0x55) call_site=ffffffff8113a565 ptr=0xffff8800cfa02000 bytes_req=1272 bytes_alloc=2048 gfp_flags=GFP_KERNEL|GFP_ZERO
<...>-13590 [000] 2225.706180: bprint: SYSC_perf_event_open: Opened: 1 1 0 (PERF_TYPE_SOFTWARE,PERF_COUNT_SW_TASK_CLOCK)
<...>-13590 [000] 2225.706180: sys_exit: NR 298 = 14 (0xe)
Fork:
<...>-13590 [003] 2226.204981: sys_enter: NR 56 (1200011, 0, 0, 7f6fab28b9d0, 0, 3516)
Close in parent:
<...>-13590 [003] 2226.216548: sys_enter: NR 3 (e, 11000, 11000, 22, 7f6fab0780b4, 7f6fab078120)
<...>-14467 [004] 2226.216548: mm_page_free: page=0xffffea0002d567e0 pfn=47540192 order=0
<...>-13590 [003] 2226.216549: sys_exit: NR 3 = 0
Kill of child:
<...>-13590 [002] 2226.245087: sys_enter: NR 62 (3884, 9, 7, 1, 7f6fab0780fc, 7f6fab078120)
Grace period expire/kfree:
<idle>-0 [004] 2226.252428: kfree: (free_event_rcu+0x2a) call_site=ffffffff811316aa ptr=0xffff8800cfa02000
On Thu, 24 Apr 2014, Vince Weaver wrote:
> [ 2226.257503] WARNING: CPU: 4 PID: 0 at lib/debugobjects.c:260 debug_print_object+0x83/0xa0()
> [ 2226.266545] ODEBUG: free active (active state 0) object type: hrtimer hint: perf_swevent_hrtimer+0x0/0x140
> [ 2226.389820] Call Trace:
> [ 2226.392428] <IRQ> [<ffffffff8164f7b3>] dump_stack+0x45/0x56
> [ 2226.398595] [<ffffffff810647cd>] warn_slowpath_common+0x7d/0xa0
> [ 2226.405059] [<ffffffff8106483c>] warn_slowpath_fmt+0x4c/0x50
> [ 2226.411240] [<ffffffff813cc9e3>] debug_print_object+0x83/0xa0
> [ 2226.417535] [<ffffffff81139200>] ? __perf_event_overflow+0x270/0x270
> [ 2226.424463] [<ffffffff813cde73>] debug_check_no_obj_freed+0x263/0x360
> [ 2226.431500] [<ffffffff811316aa>] ? free_event_rcu+0x2a/0x30
> [ 2226.437579] [<ffffffff81196fd0>] kfree+0xb0/0x560
> [ 2226.442740] [<ffffffff810ccd46>] ? rcu_process_callbacks+0x236/0x620
> [ 2226.449658] [<ffffffff81131680>] ? pmu_dev_release+0x10/0x10
> [ 2226.455811] [<ffffffff811316aa>] free_event_rcu+0x2a/0x30
> [ 2226.461727] [<ffffffff810ccdad>] rcu_process_callbacks+0x29d/0x620
> [ 2226.468440] [<ffffffff810ccd46>] ? rcu_process_callbacks+0x236/0x620
> [ 2226.475384] [<ffffffff81069ab5>] __do_softirq+0xf5/0x290
> [ 2226.481210] [<ffffffff81069e9d>] irq_exit+0xad/0xc0
> [ 2226.486540] [<ffffffff81662e35>] smp_apic_timer_interrupt+0x45/0x60
> [ 2226.493350] [<ffffffff8166181d>] apic_timer_interrupt+0x6d/0x80
> [ 2226.499798] <EOI> [<ffffffff810d958e>] ? tick_nohz_idle_exit+0x12e/0x1b0
> [ 2226.507192] [<ffffffff810aa7de>] cpu_startup_entry+0x12e/0x3d0
> [ 2226.513542] [<ffffffff81042a43>] start_secondary+0x193/0x200
> [ 2226.519706] ---[ end trace ec55e71b02ef43b3 ]---
so it's looking more and more like this issue is with a
PERF_COUNT_SW_TASK_CLOCK
event.
It's being deallocated in a different process than it was started (due to
fork).
And it really looks like the problem is even though the event is free'd,
there's still an active hrtimer associated with it somehow.
I can't seem to find *why* there's an associated hrtimer though, as the
event as far as I can tell was created with sample_period=0 and the
various
perf_swevent_init_hrtimer()
calls seem to guard with is_sampling()
This is made all the more confusing because the PERF_COUNT_SW_TASK_CLOCK
events are handled by their own PMU even though it's faked up so they look
like regular software events. Is there a reason for that?
Vince
On Thu, 24 Apr 2014, Vince Weaver wrote:
> I got the bug to trigger again, this time it finally managed to hit a
> debug_objects WARNING if that's any additional help.
>
> The bug followed the same pattern, software event
> (PERF_TYPE_SOFTWARE / PERF_COUNT_SW_TASK_CLOCK) created, fork happens,
> event closes in parent, child killed, rcu grace period expires and kfree
> but event still active.
>
> here's the kernel message followed by excerpts from the trace, I can
> provide full trace if anyone cares.
I care :)
Thanks,
tglx
OK, this is my current theory as to what's going on. I'd appreciate any
comments.
We have an event, let's call it #16.
Event #16 is a SW event created and running in the parent on CPU0.
CPU0 (parent): calls fork()
CPU6 (child): SW Event #16 is still running on CPU0 but is visible
on CPU6 because the fd passed through with fork
CPU0 (parent) close #16. Event not deallocated because
still visible in child
CPU0 (parent) kill child
CPU6 (child) shutting down.
last user of event #16
perf_release() called on event
which eventually calls event_sched_out()
which calls pmu->del which removes event from swevent_htable
*but only on CPU6*
**** some sort of race happens with CPU0 (possibly with
event_sched_in() and event->state==PERF_EVENT_STATE_INACTIVE)
That has event #16 in the cpu0 swevent_htable but not
freed the next time ctx_sched_out() happens ****
CPU6 (idle) grace period expires, kfree happens
the CPU0 hlist still has in the list the now freed (and poisoned)
event which causes problems, especially as new events added to
the list over-write bytes starting at 0x48 with pprev values.
Vince
On Mon, Apr 28, 2014 at 10:21:34AM -0400, Vince Weaver wrote:
> so it's looking more and more like this issue is with a
> PERF_COUNT_SW_TASK_CLOCK
> event.
But they don't actually use the hlist thing..
> It's being deallocated in a different process than it was started (due to
> fork).
>
> And it really looks like the problem is even though the event is free'd,
> there's still an active hrtimer associated with it somehow.
So this is a different problem from the hlist corruption?
> I can't seem to find *why* there's an associated hrtimer though, as the
> event as far as I can tell was created with sample_period=0 and the
> various
> perf_swevent_init_hrtimer()
> calls seem to guard with is_sampling()
That is indeed, decidedly odd.
> This is made all the more confusing because the PERF_COUNT_SW_TASK_CLOCK
> events are handled by their own PMU even though it's faked up so they look
> like regular software events. Is there a reason for that?
This was the easiest route when we introduced the mulitple pmu thing or
so, its been on the todo list for a cleanup ever since :-/
On Mon, Apr 28, 2014 at 03:38:38PM -0400, Vince Weaver wrote:
>
> OK, this is my current theory as to what's going on. I'd appreciate any
> comments.
>
>
> We have an event, let's call it #16.
>
> Event #16 is a SW event created and running in the parent on CPU0.
A regular software one, right? Not a timer one.
> CPU0 (parent): calls fork()
>
> CPU6 (child): SW Event #16 is still running on CPU0 but is visible
> on CPU6 because the fd passed through with fork
>
> CPU0 (parent) close #16. Event not deallocated because
> still visible in child
>
> CPU0 (parent) kill child
OK so far..
> CPU6 (child) shutting down.
> last user of event #16
> perf_release() called on event
> which eventually calls event_sched_out()
> which calls pmu->del which removes event from swevent_htable
> *but only on CPU6*
So on fork() we'll also clone the counter; after which there's two. One
will run on each task.
Because of a context switch optimization they can actually flip around
(the below patch disables that).
So there's (possibly) two events being killed here:
1) the event that is attached to the child process, it will be detached
and freed. do_exit() -> perf_event_exit_task() ->
perf_event_exit_task_context() -> __perf_event_exit_task()
If this is the cloned event, it will put the original event and be
freed here.
If the child ran the cloned event; then:
2) the closing of the fd, coupled with the put of 1) will drop the
refcount of the original event to 0 and it too will be removed and
freed.
IF however the original and cloned events were flipped at the time; the
child exit will detach the original event, but since the parent will
still have a cloned event attached, the clone will keep the event alive.
In this case no events will be freed until the parent too exits; at
which point the cloned event will get detached and freed. That will put
the last reference on the actual event, and that too will go.
Now, seeing that you actually see an event getting freed, we'll have to
assume the former situation, where the original event is on the parent
process and the cloned event is on the child process.
> **** some sort of race happens with CPU0 (possibly with
> event_sched_in() and event->state==PERF_EVENT_STATE_INACTIVE)
> That has event #16 in the cpu0 swevent_htable but not
> freed the next time ctx_sched_out() happens ****
So on do_exit(), exit_files() happens first, it will drop the refcount
of the original event to 1.
After that, perf_event_exit_task() happens, it will (as per the
callchain above) end up in __perf_event_exit_task().
__perf_event_exit_task() will call perf_group_detach, however no groups
involved here afaik, so that's quickly done.
It will then call perf_remove_from_context() which will try to
deschedule (which is likely already done by
perf_event_exit_task_context() which de-schedules the entire context in
one go), and then remove the event from the context.
Since it is the cloned event; it will then call sync_child_event(),
whicih will push whatever counts it has gathered into the original
(parent) event, and detach itself from the parent.
This will have done put_event(parent_event), which will drop the
refcount of the original event to 0. put_event() will in turn call the
same things: perf_group_detach() -- no groups, done quickly.
perf_remove_from_context(), this will IPI from CPU6 to CPU0, and
deschedule the original event, calling ->del() on CPU0, and as per the
above continue doing list_del_event() detaching itself from the context.
After the IPI put_event() will continue with _free_event() and we'll
call ->destroy() and call_rcu and the event will be no more.
After all that, the child continues calling free_event() which will also
call ->desotry() (but on the child event) and do the same call_rcu()
also freeing this event.
Nothing left.
> CPU6 (idle) grace period expires, kfree happens
>
> the CPU0 hlist still has in the list the now freed (and poisoned)
> event which causes problems, especially as new events added to
> the list over-write bytes starting at 0x48 with pprev values.
Right, so clearly something's gone funny.. above you speculate on a race
against event_sched_in(), but both paths serialize on event->ctx->lock.
__perf_remove_from_context() takes ctx->lock, as do the sched_in/out
paths.
quite the puzzle this one
On Tue, 29 Apr 2014, Peter Zijlstra wrote:
> On Mon, Apr 28, 2014 at 10:21:34AM -0400, Vince Weaver wrote:
> > so it's looking more and more like this issue is with a
> > PERF_COUNT_SW_TASK_CLOCK
> > event.
>
> But they don't actually use the hlist thing..
yes.
This turns out into another issue that I think is just use-after-free
memory corruption exhibiting itself a different way.
I've documented at least 8 different types of error message that I think
are all due to this issue.
> So this is a different problem from the hlist corruption?
Who knows. That's why I'm trying to get this issue fixed so I can figure
out which of the 10+ other bugs I'm tracking are the same or different.
> > This is made all the more confusing because the PERF_COUNT_SW_TASK_CLOCK
> > events are handled by their own PMU even though it's faked up so they look
> > like regular software events. Is there a reason for that?
>
> This was the easiest route when we introduced the mulitple pmu thing or
> so, its been on the todo list for a cleanup ever since :-/
It was very confusing and poorly documented, as is much of the perf_event
files. And yes, I know, I should do something about it rather than
complain.
I've actually given up on source code inspection to figure out what's
going on in kernel/events/core.c. What I do now is write simple test
cases and do an ftrace function trace. The results are often surprising.
Vince
On Tue, 29 Apr 2014, Peter Zijlstra wrote:
> > Event #16 is a SW event created and running in the parent on CPU0.
>
> A regular software one, right? Not a timer one.
Maybe. From traces I have it looks like it's a regular one (i.e. calls
perf_swevent_add() ) but who knows at this point.
When I actually got a trace with perf_event_open() instrumented to print
some attr values it looked like things were being caused by
PERF_COUNT_SW_TASK_CLOCK which makes no sense.
> > CPU6 (child) shutting down.
> > last user of event #16
> > perf_release() called on event
> > which eventually calls event_sched_out()
> > which calls pmu->del which removes event from swevent_htable
> > *but only on CPU6*
>
> So on fork() we'll also clone the counter; after which there's two. One
> will run on each task.
even if inherit isn't set?
> Because of a context switch optimization they can actually flip around
> (the below patch disables that).
ENOPATCH?
> quite the puzzle this one
yes.
I'm tediously working on trying to get a good trace of this happening.
I have a random seed that will trigger the bug in the fuzzer around 1 time
in 10.
Unfortunately many of the times it crashes so hard/quickly there's no
chance of getting the trace data (dump trace on oops never holds enough
state, and often the fuzzing triggers its own random trace events that
clutter those logs).
Also trace-cmd is a pain to use. Any suggested events I should trace
beyond the obvious?
Part of the problem is that despite what the documentation says it doesn't
look like you can combine the "-P pid" and "-c" children option, which
makes debugging a forking problem like this a lot harder to trace.
It's sort of possible to get around that with a really complicated -F ""
command line that does sudo back to me (don't want to fuzz as root) and
such, but still awkward.
Vince
On Tue, Apr 29, 2014 at 02:21:56PM -0400, Vince Weaver wrote:
> On Tue, 29 Apr 2014, Peter Zijlstra wrote:
>
> > > Event #16 is a SW event created and running in the parent on CPU0.
> >
> > A regular software one, right? Not a timer one.
>
> Maybe. From traces I have it looks like it's a regular one (i.e. calls
> perf_swevent_add() ) but who knows at this point.
>
> When I actually got a trace with perf_event_open() instrumented to print
> some attr values it looked like things were being caused by
> PERF_COUNT_SW_TASK_CLOCK which makes no sense.
>
> > > CPU6 (child) shutting down.
> > > last user of event #16
> > > perf_release() called on event
> > > which eventually calls event_sched_out()
> > > which calls pmu->del which removes event from swevent_htable
> > > *but only on CPU6*
> >
> > So on fork() we'll also clone the counter; after which there's two. One
> > will run on each task.
>
> even if inherit isn't set?
Fair point, nope not in that case. If you can trigger this without ever
using .inherit=1 this would exclude a lot of funny code.
> > Because of a context switch optimization they can actually flip around
> > (the below patch disables that).
>
> ENOPATCH?
urgh.. fail.
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 5129b1201050..0d6a58950a3b 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -2293,6 +2291,7 @@ static void perf_event_context_sched_out(struct task_struct *task, int ctxn,
if (!cpuctx->task_ctx)
return;
+#if 0
rcu_read_lock();
next_ctx = next->perf_event_ctxp[ctxn];
if (!next_ctx)
@@ -2335,6 +2334,7 @@ static void perf_event_context_sched_out(struct task_struct *task, int ctxn,
}
unlock:
rcu_read_unlock();
+#endif
if (do_switch) {
raw_spin_lock(&ctx->lock);
> > quite the puzzle this one
>
> yes.
>
> I'm tediously working on trying to get a good trace of this happening.
>
> I have a random seed that will trigger the bug in the fuzzer around 1 time
> in 10.
>
> Unfortunately many of the times it crashes so hard/quickly there's no
> chance of getting the trace data (dump trace on oops never holds enough
> state, and often the fuzzing triggers its own random trace events that
> clutter those logs).
>
> Also trace-cmd is a pain to use. Any suggested events I should trace
> beyond the obvious?
I've never used trace-cmd :/ What I do in the crashing hard case is try
and make dump_ftrace_on_oops work, although capturing a full trace
buffer over serial is exceedingly painful -- maxcpus= might work if you
have too many CPUs, I forgot.
Anyway, I can make the fuzzer to weird shit, but it doesn't look like
the thing you're seeing, but who knows.
On Tue, 29 Apr 2014 14:11:09 -0400 (EDT)
Vince Weaver <[email protected]> wrote:
> I've actually given up on source code inspection to figure out what's
> going on in kernel/events/core.c. What I do now is write simple test
> cases and do an ftrace function trace. The results are often surprising.
You might want to remove this from the kernel/events/Makefile:
ifdef CONFIG_FUNCTION_TRACER
CFLAGS_REMOVE_core.o = -pg
endif
It will let you function trace more of what is happening.
-- Steve
On Tue, 29 Apr 2014 14:21:56 -0400 (EDT)
Vince Weaver <[email protected]> wrote:
> Also trace-cmd is a pain to use. Any suggested events I should trace
> beyond the obvious?
>
> Part of the problem is that despite what the documentation says it doesn't
> look like you can combine the "-P pid" and "-c" children option, which
> makes debugging a forking problem like this a lot harder to trace.
Yeah, I need kernel assistance to fix some of that.
>
> It's sort of possible to get around that with a really complicated -F ""
> command line that does sudo back to me (don't want to fuzz as root) and
> such, but still awkward.
I'll try to write up a patch that lets you use -P with -c. But due to
the (crappy) implementation with ptrace, trace-cmd needs to be a parent
of task.
In the mean time, you could run this as root:
trace-cmd record -p function -F -c su non-root-user fuzz
-- Steve
On Tue, 29 Apr 2014, Peter Zijlstra wrote:
> Fair point, nope not in that case. If you can trigger this without ever
> using .inherit=1 this would exclude a lot of funny code.
I don't think inherit is being set, but I'm not actually sure.
I will have to add that to the trace_printk() and recompile/re-run
In the meantime I had a lucky crash and managed to catch a trace.
Unfortunately there's a lot of active events so it's not clear which is
which. I think this is going to need another round of trace generation :(
This trace can be found here:
http://web.eece.maine.edu/~vweaver/junk/bug.out.bz2
A summary:
The troublesome memory address is allocated as part of a perf_event_open
perf_fuzzer-4387 [001] 1802.628663: kmalloc: (perf_event_alloc+0x5a) call_site=ffffffff8113a8fa ptr=0xffff8800a3122800 bytes_req=1272 bytes_alloc=2048 gfp_flags=GFP_KERNEL|GFP_ZERO
The event opened successfully, fd=41, it looks like it is
PERF_COUNT_SW_EMULATION_FAULTS with attr.period=0
perf_fuzzer-4387 [001] 1802.628677: bprint: SYSC_perf_event_open: Opened: 1 8 0
perf_fuzzer-4387 [001] 1802.628677: sys_exit: NR 298 = 41
The parent forks:
perf_fuzzer-4387 [002] 1803.571239: sys_exit: NR 56 = 5504
The event is closed in the parent:
perf_fuzzer-4387 [002] 1803.582345: sys_enter: NR 3 (29, 3000, 3000, 7f7524d760a4, 7f7524d76108, 7f7524d76120)
perf_fuzzer-4387 [002] 1803.582345: sys_exit: NR 3 = 0
The parent kills the child:
perf_fuzzer-4387 [003] 1803.590145: sys_enter: NR 62 (1580, 9, 7, 7f7524d760b8, 7f7524d760b8, 7f7524d76120)
Presumably one of the many perf_swevent_del() calls in the child is us.
perf_fuzzer-5504 [004] 1803.590277: function: perf_swevent_del
*** The parent somehow fails to call perf_swevent_del() on CPU3? ***
The grace period expires and the memory is freed:
ksoftirqd/4-28 [004] 1803.609802: kfree: (free_event_rcu+0x2f) call_site=ffffffff8113177f ptr=0xffff8800a3122800
An event is deleted from swevent_hlist, but ->pprev was our free'd address:
perf_fuzzer-4387 [003] 1803.610555: function: perf_swevent_del
Slab corruption:
[ 1803.610555] ------------[ cut here ]------------
[ 1803.615419] WARNING: CPU: 3 PID: 4387 at include/linux/list.h:620 perf_swevent_del+0x6e/0x90()
[ 1803.948487] Slab corruption (Tainted: G W ): kmalloc-2048 start=ffff8800a3122800, len=2048
[ 1803.958294] 040: 6b 6b 6b 6b 6b 6b 6b 6b 88 f7 92 17 01 88 ff ff kkkkkkkk........
Vince, could you add the below to whatever tracing muck you already
have?
After staring at your traces all day with Thomas, we have doubts about
the refcount integrity.
---
kernel/events/core.c | 146 +++++++++++++++++++++++++++++----------------------
1 file changed, 82 insertions(+), 64 deletions(-)
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 5129b1201050..dac01e099f13 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -1664,6 +1664,9 @@ event_sched_in(struct perf_event *event,
u64 tstamp = perf_event_time(event);
int ret = 0;
+ WARN_ON(event->ctx != ctx);
+ lockdep_assert_held(&ctx->lock);
+
if (event->state <= PERF_EVENT_STATE_OFF)
return 0;
@@ -2029,14 +2032,9 @@ static int __perf_event_enable(void *info)
if (leader != event && leader->state != PERF_EVENT_STATE_ACTIVE)
goto unlock;
- if (!group_can_go_on(event, cpuctx, 1)) {
- err = -EEXIST;
- } else {
- if (event == leader)
- err = group_sched_in(event, cpuctx, ctx);
- else
- err = event_sched_in(event, cpuctx, ctx);
- }
+ err = -EEXIST;
+ if (group_can_go_on(event, cpuctx, 1))
+ err = group_sched_in(event, cpuctx, ctx);
if (err) {
/*
@@ -2293,6 +2291,7 @@ static void perf_event_context_sched_out(struct task_struct *task, int ctxn,
if (!cpuctx->task_ctx)
return;
+#if 0
rcu_read_lock();
next_ctx = next->perf_event_ctxp[ctxn];
if (!next_ctx)
@@ -2335,6 +2334,7 @@ static void perf_event_context_sched_out(struct task_struct *task, int ctxn,
}
unlock:
rcu_read_unlock();
+#endif
if (do_switch) {
raw_spin_lock(&ctx->lock);
@@ -3233,10 +3233,18 @@ static void __free_event(struct perf_event *event)
if (event->pmu)
module_put(event->pmu->module);
+ WARN_ON(event->hlist_entry.pprev && event->hlist_entry.pprev != LIST_POISON2);
+
call_rcu(&event->rcu_head, free_event_rcu);
}
-static void free_event(struct perf_event *event)
+
+static void _free_event(struct perf_event *event)
{
+ long refs = atomic_long_read(&event->refcount);
+
+ WARN(refs, "freeing event with %ld refs left; ptr=0x%p\n", refs, event);
+ trace_printk("freeing with %ld refs; ptr=0x%p\n", refs, event);
+
irq_work_sync(&event->pending);
unaccount_event(event);
@@ -3263,48 +3271,32 @@ static void free_event(struct perf_event *event)
if (is_cgroup_event(event))
perf_detach_cgroup(event);
-
__free_event(event);
}
-int perf_event_release_kernel(struct perf_event *event)
+static void free_event(struct perf_event *event)
{
- struct perf_event_context *ctx = event->ctx;
-
- WARN_ON_ONCE(ctx->parent_ctx);
- /*
- * There are two ways this annotation is useful:
- *
- * 1) there is a lock recursion from perf_event_exit_task
- * see the comment there.
- *
- * 2) there is a lock-inversion with mmap_sem through
- * perf_event_read_group(), which takes faults while
- * holding ctx->mutex, however this is called after
- * the last filedesc died, so there is no possibility
- * to trigger the AB-BA case.
- */
- mutex_lock_nested(&ctx->mutex, SINGLE_DEPTH_NESTING);
- raw_spin_lock_irq(&ctx->lock);
- perf_group_detach(event);
- raw_spin_unlock_irq(&ctx->lock);
- perf_remove_from_context(event);
- mutex_unlock(&ctx->mutex);
-
- free_event(event);
-
- return 0;
+ if (unlikely(atomic_long_cmpxchg(&event->refcount, 1, 0) != 1)) {
+ WARN(1, "unexpected event refcount: %ld; ptr=0x%p\n",
+ atomic_long_read(&event->refcount),
+ event);
+ return;
+ }
+ _free_event(event);
}
-EXPORT_SYMBOL_GPL(perf_event_release_kernel);
/*
* Called when the last reference to the file is gone.
*/
-static void put_event(struct perf_event *event)
+static void put_event(struct perf_event *event, struct perf_event *other)
{
+ struct perf_event_context *ctx = event->ctx;
struct task_struct *owner;
- if (!atomic_long_dec_and_test(&event->refcount))
+ long refs = atomic_long_sub_return(1, &event->refcount);
+ trace_printk("put ref: %ld; ptr=0x%p other=0x%p\n", refs, event, other);
+
+ if (refs)
return;
rcu_read_lock();
@@ -3316,14 +3308,8 @@ static void put_event(struct perf_event *event)
* owner->perf_event_mutex.
*/
smp_read_barrier_depends();
- if (owner) {
- /*
- * Since delayed_put_task_struct() also drops the last
- * task reference we can safely take a new reference
- * while holding the rcu_read_lock().
- */
- get_task_struct(owner);
- }
+ if (owner && !atomic_inc_not_zero(&owner->usage))
+ owner = NULL;
rcu_read_unlock();
if (owner) {
@@ -3340,12 +3326,39 @@ static void put_event(struct perf_event *event)
put_task_struct(owner);
}
- perf_event_release_kernel(event);
+ WARN_ON_ONCE(ctx->parent_ctx);
+ /*
+ * There are two ways this annotation is useful:
+ *
+ * 1) there is a lock recursion from perf_event_exit_task
+ * see the comment there.
+ *
+ * 2) there is a lock-inversion with mmap_sem through
+ * perf_event_read_group(), which takes faults while
+ * holding ctx->mutex, however this is called after
+ * the last filedesc died, so there is no possibility
+ * to trigger the AB-BA case.
+ */
+ mutex_lock_nested(&ctx->mutex, SINGLE_DEPTH_NESTING);
+ raw_spin_lock_irq(&ctx->lock);
+ perf_group_detach(event);
+ raw_spin_unlock_irq(&ctx->lock);
+ perf_remove_from_context(event);
+ mutex_unlock(&ctx->mutex);
+
+ _free_event(event);
+}
+
+int perf_event_release_kernel(struct perf_event *event)
+{
+ put_event(event, NULL);
+ return 0;
}
+EXPORT_SYMBOL_GPL(perf_event_release_kernel);
static int perf_release(struct inode *inode, struct file *file)
{
- put_event(file->private_data);
+ put_event(file->private_data, NULL);
return 0;
}
@@ -3969,6 +3982,11 @@ static void perf_mmap_close(struct vm_area_struct *vma)
*/
continue;
}
+
+ trace_printk("inc ref: %ld; ptr=0x%p\n",
+ atomic_long_read(&event->refcount),
+ event);
+
rcu_read_unlock();
mutex_lock(&event->mmap_mutex);
@@ -3988,7 +4006,7 @@ static void perf_mmap_close(struct vm_area_struct *vma)
ring_buffer_put(rb); /* can't be last, we still have one */
}
mutex_unlock(&event->mmap_mutex);
- put_event(event);
+ put_event(event, NULL);
/*
* Restart the iteration; either we're on the wrong list or
@@ -7374,7 +7392,7 @@ static void sync_child_event(struct perf_event *child_event,
* Release the parent event, if this was the last
* reference to it.
*/
- put_event(parent_event);
+ put_event(parent_event, child_event);
}
static void
@@ -7382,11 +7400,9 @@ __perf_event_exit_task(struct perf_event *child_event,
struct perf_event_context *child_ctx,
struct task_struct *child)
{
- if (child_event->parent) {
- raw_spin_lock_irq(&child_ctx->lock);
- perf_group_detach(child_event);
- raw_spin_unlock_irq(&child_ctx->lock);
- }
+ raw_spin_lock_irq(&child_ctx->lock);
+ perf_group_detach(child_event);
+ raw_spin_unlock_irq(&child_ctx->lock);
perf_remove_from_context(child_event);
@@ -7458,12 +7474,7 @@ static void perf_event_exit_task_context(struct task_struct *child, int ctxn)
mutex_lock(&child_ctx->mutex);
again:
- list_for_each_entry_safe(child_event, tmp, &child_ctx->pinned_groups,
- group_entry)
- __perf_event_exit_task(child_event, child_ctx, child);
-
- list_for_each_entry_safe(child_event, tmp, &child_ctx->flexible_groups,
- group_entry)
+ list_for_each_entry_rcu(child_event, &child_ctx->event_list, event_entry)
__perf_event_exit_task(child_event, child_ctx, child);
/*
@@ -7472,8 +7483,10 @@ static void perf_event_exit_task_context(struct task_struct *child, int ctxn)
* will still point to the list head terminating the iteration.
*/
if (!list_empty(&child_ctx->pinned_groups) ||
- !list_empty(&child_ctx->flexible_groups))
+ !list_empty(&child_ctx->flexible_groups)) {
+ WARN_ON_ONCE(1);
goto again;
+ }
mutex_unlock(&child_ctx->mutex);
@@ -7519,7 +7532,7 @@ static void perf_free_event(struct perf_event *event,
list_del_init(&event->child_list);
mutex_unlock(&parent->child_mutex);
- put_event(parent);
+ put_event(parent, event);
perf_group_detach(event);
list_del_event(event, ctx);
@@ -7605,6 +7618,11 @@ inherit_event(struct perf_event *parent_event,
return NULL;
}
+ trace_printk("inherit inc ref: %ld; ptr=0x%p other=0x%p\n",
+ atomic_long_read(&parent_event->refcount),
+ parent_event,
+ child_event);
+
get_ctx(child_ctx);
/*
On Wed, 30 Apr 2014, Peter Zijlstra wrote:
>
> Vince, could you add the below to whatever tracing muck you already
> have?
>
> After staring at your traces all day with Thomas, we have doubts about
> the refcount integrity.
I've been staring at traces all day too. Will try your patch tomorrow.
>From my staring, what looks like is happening in the trace is:
task_sched_in in parent adds our freed (but alive in child) event:
perf_fuzzer-2517 [001] 215.228165: bprint: perf_swevent_add: VMW add_rcu: 0xffff880036cbb000
This adds the event to the swevent_hlist
the child is in the process of exiting, eventually frees the event
perf_fuzzer-3634 [006] 215.228250: function: perf_release
perf_fuzzer-3634 [006] 215.228250: function: perf_event_release_kernel
perf_fuzzer-3634 [006] 215.228251: function: perf_group_detach
perf_fuzzer-3634 [006] 215.228251: function: perf_event__header_size
perf_fuzzer-3634 [006] 215.228251: function: perf_remove_from_context
Which then does
list_del_event()
event->state=PERF_EVENT_STATE_OFF;
Soon after the parent does task_sched_out
which gets to event_sched_out()
which hits
if (event->state != PERF_EVENT_STATE_ACTIVE)
return;
So it never hits the
event->pmu->del(event, 0);
We need to get the value off the hlist.
This analysis is probably wrong though because if it's as simple as that
above then I'm not sure why it isn't easier to hit the bug.
Vince
On Wed, 30 Apr 2014, Vince Weaver wrote:
> On Wed, 30 Apr 2014, Peter Zijlstra wrote:
>
> >
> > Vince, could you add the below to whatever tracing muck you already
> > have?
> >
> > After staring at your traces all day with Thomas, we have doubts about
> > the refcount integrity.
>
> I've been staring at traces all day too. Will try your patch tomorrow.
>
> >From my staring, what looks like is happening in the trace is:
>
> task_sched_in in parent adds our freed (but alive in child) event:
And that's the issue which puzzles us. Let's look at what we expect:
event create:
parent_evt->refcount = 1
parent_fd = fd(file)
file->refcnt = 1
mmap:
file->refcnt = 2
clone:
child_fd = dup(parent_fd)
file->refcnt = 3
child_evt = dup(parent_evt)
parent_evt->refcount = 2
unmap:
file->refcnt = 2
close(parent_fd)
file->refcnt = 1
kill child
child exits:
exit_files()
close(child_fd)
file->refcnt = 0
perf_release()
parent_evt->refcount = 1
perf_event_exit_task()
kill child_evt
parent_evt->refcount = 0
---> destroy parent
Now the trace shows a different story:
perf_fuzzer-4387 [001] 1802.628659: sys_enter: NR 298 (69bb58, 0, ffffffff, 12, 0, 0)
perf_fuzzer-4387 [001] 1802.628659: function: SyS_perf_event_open
perf_fuzzer-4387 [001] 1802.628660: function: perf_event_alloc
perf_fuzzer-4387 [001] 1802.628663: kmalloc: (perf_event_alloc+0x5a) call_site=ffffffff8113a8fa ptr=0xffff8800a3122800 bytes_req=1272 bytes_alloc=2048 gfp_flags=GFP_KERNEL|GFP_ZERO
perf_fuzzer-4387 [001] 1802.628663: function: perf_init_event
perf_fuzzer-4387 [001] 1802.628664: function: perf_tp_event_init
perf_fuzzer-4387 [001] 1802.628664: function: perf_swevent_init
perf_fuzzer-4387 [001] 1802.628667: function: perf_lock_task_context
perf_fuzzer-4387 [001] 1802.628669: kmem_cache_alloc: (__d_alloc+0x25) call_site=ffffffff811cfa75 ptr=0xffff8800b37ddeb8 bytes_req=288 bytes_alloc=312 gfp_flags=GFP_KERNEL
perf_fuzzer-4387 [001] 1802.628671: kmem_cache_alloc: (get_empty_filp+0x5c) call_site=ffffffff811b7b7c ptr=0xffff880117502ac0 bytes_req=448 bytes_alloc=448 gfp_flags=GFP_KERNEL|GFP_ZERO
perf_fuzzer-4387 [001] 1802.628672: function: perf_install_in_context
perf_fuzzer-4387 [001] 1802.628672: function: __perf_install_in_context
perf_fuzzer-4387 [001] 1802.628672: function: perf_pmu_disable
perf_fuzzer-4387 [001] 1802.628673: function: perf_pmu_nop_void
perf_fuzzer-4387 [001] 1802.628673: function: perf_event__header_size
perf_fuzzer-4387 [001] 1802.628673: function: perf_event__header_size
perf_fuzzer-4387 [001] 1802.628673: function: perf_event_sched_in
perf_fuzzer-4387 [001] 1802.628674: function: perf_pmu_enable
perf_fuzzer-4387 [001] 1802.628674: function: perf_pmu_nop_void
perf_fuzzer-4387 [001] 1802.628675: function: perf_unpin_context
perf_fuzzer-4387 [001] 1802.628676: function: perf_event__header_size
perf_fuzzer-4387 [001] 1802.628676: function: perf_event__id_header_size.isra.19
perf_fuzzer-4387 [001] 1802.628677: bprint: SYSC_perf_event_open: Opened: 1 8 0
perf_fuzzer-4387 [001] 1802.628677: sys_exit: NR 298 = 41
parent_evt->refcount = 1
file->refcount = 1
Now mmap:
perf_fuzzer-4387 [001] 1802.628678: sys_enter: NR 9 (0, 3000, 3, 1, 29, 0)
perf_fuzzer-4387 [001] 1802.628680: kmem_cache_alloc: (mmap_region+0x348) call_site=ffffffff811771c8 ptr=0xffff8800b54255d0 bytes_req=184 bytes_alloc=208 gfp_flags=GFP_KERNEL|GFP_ZERO
perf_fuzzer-4387 [001] 1802.628680: function: perf_mmap
perf_fuzzer-4387 [001] 1802.628682: kmalloc: (rb_alloc+0x38) call_site=ffffffff8113c858 ptr=0xffff88003766bb00 bytes_req=208 bytes_alloc=256 gfp_flags=GFP_KERNEL|GFP_ZERO
perf_fuzzer-4387 [001] 1802.628682: function: perf_mmap_alloc_page
perf_fuzzer-4387 [001] 1802.628683: mm_page_alloc: page=0xffffea00027abf60 pfn=41598816 order=0 migratetype=0 gfp_flags=GFP_KERNEL|GFP_ZERO
perf_fuzzer-4387 [001] 1802.628683: function: perf_mmap_alloc_page
perf_fuzzer-4387 [001] 1802.628684: mm_page_alloc: page=0xffffea00027abf28 pfn=41598760 order=0 migratetype=0 gfp_flags=GFP_KERNEL|GFP_ZERO
perf_fuzzer-4387 [001] 1802.628684: function: perf_mmap_alloc_page
perf_fuzzer-4387 [001] 1802.628685: mm_page_alloc: page=0xffffea00027abef0 pfn=41598704 order=0 migratetype=0 gfp_flags=GFP_KERNEL|GFP_ZERO
perf_fuzzer-4387 [001] 1802.628686: function: perf_event_update_userpage
perf_fuzzer-4387 [001] 1802.628692: function: perf_event_mmap
perf_fuzzer-4387 [001] 1802.628698: kmalloc: (perf_event_mmap+0x90) call_site=ffffffff81139d40 ptr=0xffff880083d0d000 bytes_req=4096 bytes_alloc=4096 gfp_flags=GFP_KERNEL
perf_fuzzer-4387 [001] 1802.628699: function: perf_event_aux
perf_fuzzer-4387 [001] 1802.628699: function: perf_event_aux_ctx
perf_fuzzer-4387 [001] 1802.628699: function: perf_event_aux_ctx
perf_fuzzer-4387 [001] 1802.628699: function: perf_event_aux_ctx
perf_fuzzer-4387 [001] 1802.628699: function: perf_event_mmap_output
perf_fuzzer-4387 [001] 1802.628700: function: perf_event_aux_ctx
perf_fuzzer-4387 [001] 1802.628700: function: perf_event_aux_ctx
perf_fuzzer-4387 [001] 1802.628700: function: perf_event_aux_ctx
perf_fuzzer-4387 [001] 1802.628701: kfree: (perf_event_mmap+0x175) call_site=ffffffff81139e25 ptr=0xffff880083d0d000
perf_fuzzer-4387 [001] 1802.628702: sys_exit: NR 9 = 140141108113408 (0x7f7524f7b000)
file->refcount = 2
And here the clone:
perf_fuzzer-4387 [002] 1803.570829: sys_enter: NR 56 (1200011, 0, 0, 7f7524f899d0, 0, 1123)
perf_fuzzer-4387 [002] 1803.570855: function: perf_event_init_task
Somewhere in this maze we copy the parent event so in theory:
parent_evt->refcount = 2
Now we copy the file descriptors:
perf_fuzzer-4387 [002] 1803.571065: kmem_cache_alloc: (dup_fd+0x33) call_site=ffffffff811d4a23 ptr=0xffff8800ce29e100 bytes_req=704 bytes_alloc=704 gfp_flags=GFP_KERNEL
file->refcount = 3
perf_fuzzer-4387 [002] 1803.571239: sys_exit: NR 56 = 5504
Fork is done and we unmap:
perf_fuzzer-4387 [002] 1803.582330: sys_enter: NR 11 (7f7524f7b000, 3000, 3000, 7f7524d760a4, 7f7524d76108, 7f7524d76120)
perf_fuzzer-4387 [002] 1803.582335: function: perf_mmap_close
perf_fuzzer-4387 [002] 1803.582339: kmem_cache_free: (remove_vma+0x63) call_site=ffffffff811742f3 ptr=0xffff8800b54255d0
perf_fuzzer-4387 [002] 1803.582344: sys_exit: NR 11 = 0
file->refcount = 2
Now we close the file descriptor
perf_fuzzer-4387 [002] 1803.582345: sys_enter: NR 3 (29, 3000, 3000, 7f7524d760a4, 7f7524d76108, 7f7524d76120)
perf_fuzzer-4387 [002] 1803.582345: sys_exit: NR 3 = 0
file->refcount = 1
Now the child gets killed
perf_fuzzer-4387 [003] 1803.590145: sys_enter: NR 62 (1580, 9, 7, 7f7524d760b8, 7f7524d760b8, 7f7524d76120)
perf_fuzzer-4387 [003] 1803.590148: kmem_cache_alloc: (__sigqueue_alloc+0x9c) call_site=ffffffff81073c0c ptr=0xffff8800897825e0 bytes_req=160 bytes_alloc=184 gfp_flags=GFP_ATOMIC|GFP_NOTRACK
perf_fuzzer-4387 [003] 1803.590149: sys_exit: NR 62 = 0
And the child exits and calls exit_files() and that decrements
file->refcount to 0, so perf_release is called:
perf_fuzzer-5504 [004] 1803.590311: function: perf_release
After that the dentry which is associated to the parent event is
released:
perf_fuzzer-5504 [004] 1803.590353: kmem_cache_free: (__d_free+0x42) call_site=ffffffff811cc1f2 ptr=0xffff8800b37ddeb8
That's fine, but the issue is, that perf_release() also calls:
perf_fuzzer-5504 [004] 1803.590312: function: perf_event_release_kernel
And that means that the parent_evt->refcount was not 2 as one would
expect. It was 1, otherwise put_event() which is called from
perf_release() would have returned right away due to:
if (!atomic_long_dec_and_test(&event->refcount))
return;
So after exit_files() is done perf_event_exit_task() is invoked
perf_fuzzer-5504 [004] 1803.590365: function: perf_event_exit_task
That cleans up the child events. So the child still has a reference to
the parent event which was destroyed before. It's still not kfree'd as
that happens via RCU way later
ksoftirqd/4-28 [004] 1803.609802: kfree: (free_event_rcu+0x2f) call_site=ffffffff8113177f ptr=0xffff8800a3122800
But of course the child cleanup will poke in the by now invalid parent
and leave some references to that at some other place, which will
later on be detected as slab corruption or some other kind of
wreckage.
Thanks,
tglx