2015-05-08 04:35:34

by Vince Weaver

[permalink] [raw]
Subject: perf: another perf_fuzzer generated lockup


This is a new one I think, I hit it on the haswell machine running
4.1-rc2.

The backtrace is complex enough I'm not really sure what's going on here.

The fuzzer has been having weird issues where it's been getting
overflow signals from invalid fds. This seems to happen
when an overflow signal interrupts the fuzzer mid-fork?
And the fuzzer code doesn't handle this well and attempts to call exit()
and/or kill the child from the signal handler that interrupted the
fork() and that doesn't always go well. I'm not sure if this is related,
just that some of those actions seem to appear in the backtrace.


[33864.529861] ------------[ cut here ]------------
[33864.534824] WARNING: CPU: 1 PID: 9852 at kernel/watchdog.c:302 watchdog_overflow_callback+0x92/0xc0()
[33864.544682] Watchdog detected hard LOCKUP on cpu 1
[33864.549635] Modules linked in:
[33864.552943] fuse x86_pkg_temp_thermal intel_powerclamp intel_rapl iosf_mbi coretemp snd_hda_codec_realtek snd_hda_codec_hdmi snd_hda_codec_generic kvm snd_hda_intel snd_hda_controller snd_hda_codec snd_hda_core crct10dif_pclmul snd_hwdep crc32_pclmul ghash_clmulni_intel snd_pcm aesni_intel aes_x86_64 lrw gf128mul evdev i915 iTCO_wdt iTCO_vendor_support glue_helper snd_timer ppdev ablk_helper psmouse drm_kms_helper cryptd snd drm pcspkr serio_raw lpc_ich soundcore parport_pc xhci_pci battery video processor i2c_i801 mei_me mei wmi i2c_algo_bit tpm_tis mfd_core xhci_hcd tpm parport button sg sr_mod sd_mod cdrom ehci_pci ahci ehci_hcd libahci libata e1000e ptp usbcore crc32c_intel scsi_mod fan usb_common pps_core thermal thermal_sys
[33864.622413] CPU: 1 PID: 9852 Comm: perf_fuzzer Not tainted 4.1.0-rc2+ #142
[33864.629776] Hardware name: LENOVO 10AM000AUS/SHARKBAY, BIOS FBKT72AUS 01/26/2014
[33864.637685] ffffffff81a209b5 ffff88011ea45aa0 ffffffff816d51d3 0000000000000000
[33864.645709] ffff88011ea45af0 ffff88011ea45ae0 ffffffff81072dfa ffff88011ea45ac0
[33864.653731] ffff880119b8f800 0000000000000000 ffff88011ea45c40 ffff88011ea45ef8
[33864.661783] Call Trace:
[33864.664409] <NMI> [<ffffffff816d51d3>] dump_stack+0x45/0x57
[33864.670618] [<ffffffff81072dfa>] warn_slowpath_common+0x8a/0xc0
[33864.677071] [<ffffffff81072e76>] warn_slowpath_fmt+0x46/0x50
[33864.683202] [<ffffffff8104db64>] ? native_apic_wait_icr_idle+0x24/0x30
[33864.690280] [<ffffffff81124f22>] watchdog_overflow_callback+0x92/0xc0
[33864.697294] [<ffffffff811673b1>] __perf_event_overflow+0x91/0x270
[33864.703916] [<ffffffff811673f9>] ? __perf_event_overflow+0xd9/0x270
[33864.710696] [<ffffffff8102b60a>] ? x86_perf_event_set_period+0xda/0x180
[33864.717842] [<ffffffff81167f89>] perf_event_overflow+0x19/0x20
[33864.724195] [<ffffffff810338f2>] intel_pmu_handle_irq+0x1e2/0x450
[33864.730840] [<ffffffff8102a41b>] perf_event_nmi_handler+0x2b/0x50
[33864.737436] [<ffffffff810185d0>] nmi_handle+0xa0/0x150
[33864.743025] [<ffffffff81018535>] ? nmi_handle+0x5/0x150
[33864.748733] [<ffffffff8101887a>] default_do_nmi+0x4a/0x140
[33864.754705] [<ffffffff81018a08>] do_nmi+0x98/0xe0
[33864.759858] [<ffffffff816df42f>] end_repeat_nmi+0x1e/0x2e
[33864.765746] [<ffffffff810bc1bb>] ? check_chain_key+0xdb/0x1e0
[33864.772004] [<ffffffff810bc1bb>] ? check_chain_key+0xdb/0x1e0
[33864.778253] [<ffffffff810bc1bb>] ? check_chain_key+0xdb/0x1e0
[33864.784498] <<EOE>> <IRQ> [<ffffffff810bdd09>] __lock_acquire.isra.31+0x3b9/0x1000
[33864.792950] [<ffffffff810bdd09>] ? __lock_acquire.isra.31+0x3b9/0x1000
[33864.800045] [<ffffffff810bf2e5>] lock_acquire+0xa5/0x130
[33864.805817] [<ffffffff81080afe>] ? __lock_task_sighand+0x6e/0x110
[33864.812468] [<ffffffff81080aaa>] ? __lock_task_sighand+0x1a/0x110
[33864.819084] [<ffffffff816dc721>] _raw_spin_lock+0x31/0x40
[33864.824979] [<ffffffff81080afe>] ? __lock_task_sighand+0x6e/0x110
[33864.831623] [<ffffffff81080afe>] __lock_task_sighand+0x6e/0x110
[33864.838096] [<ffffffff81080aaa>] ? __lock_task_sighand+0x1a/0x110
[33864.845314] [<ffffffff81080bcc>] do_send_sig_info+0x2c/0x80
[33864.851949] [<ffffffff81167807>] ? perf_swevent_event+0x67/0x90
[33864.858980] [<ffffffff8120092f>] send_sigio_to_task+0x12f/0x1a0
[33864.866005] [<ffffffff81200805>] ? send_sigio_to_task+0x5/0x1a0
[33864.873047] [<ffffffff812010c6>] ? send_sigio+0x56/0x100
[33864.879411] [<ffffffff8120111e>] send_sigio+0xae/0x100
[33864.885564] [<ffffffff81201207>] kill_fasync+0x97/0xf0
[33864.891713] [<ffffffff8120117f>] ? kill_fasync+0xf/0xf0
[33864.897983] [<ffffffff811667d4>] perf_event_wakeup+0xd4/0xf0
[33864.904662] [<ffffffff81166705>] ? perf_event_wakeup+0x5/0xf0
[33864.911490] [<ffffffff811668d0>] ? perf_pending_event+0xe0/0x110
[33864.918580] [<ffffffff811668d0>] perf_pending_event+0xe0/0x110
[33864.925494] [<ffffffff81156eac>] irq_work_run_list+0x4c/0x80
[33864.932197] [<ffffffff81156ef8>] irq_work_run+0x18/0x40
[33864.938469] [<ffffffff8101953f>] smp_trace_irq_work_interrupt+0x3f/0xc0
[33864.946263] [<ffffffff816dea3e>] trace_irq_work_interrupt+0x6e/0x80
[33864.953646] <EOI> [<ffffffff811a1b47>] ? copy_page_range+0x527/0x9a0
[33864.961287] [<ffffffff811a1b22>] ? copy_page_range+0x502/0x9a0
[33864.968265] [<ffffffff810710d2>] copy_process.part.23+0xc92/0x1b80
[33864.975589] [<ffffffff810813be>] ? SYSC_kill+0x8e/0x230
[33864.981879] [<ffffffff81072178>] do_fork+0xd8/0x420
[33864.987807] [<ffffffff812007e3>] ? f_setown+0x83/0xa0
[33864.993953] [<ffffffff81200d30>] ? SyS_fcntl+0x310/0x650
[33865.000348] [<ffffffff81417a23>] ? lockdep_sys_exit_thunk+0x12/0x14
[33865.007781] [<ffffffff81072546>] SyS_clone+0x16/0x20
[33865.013830] [<ffffffff816dd1b2>] system_call_fastpath+0x16/0x7a
[33865.020843] ---[ end trace d3bd7d73656f3cba ]---
[33865.026418] INFO: NMI handler (perf_event_nmi_handler) took too long to run: 496.487 msecs
[33865.035874] perf interrupt took too long (3879951 > 5000), lowering kernel.perf_event_max_sample_rate to 25000


2015-05-09 00:04:32

by Stephane Eranian

[permalink] [raw]
Subject: Re: perf: another perf_fuzzer generated lockup

Vince,

On Thu, May 7, 2015 at 9:40 PM, Vince Weaver <[email protected]> wrote:
>
>
> This is a new one I think, I hit it on the haswell machine running
> 4.1-rc2.
>
> The backtrace is complex enough I'm not really sure what's going on here.
>
> The fuzzer has been having weird issues where it's been getting
> overflow signals from invalid fds. This seems to happen
> when an overflow signal interrupts the fuzzer mid-fork?
> And the fuzzer code doesn't handle this well and attempts to call exit()
> and/or kill the child from the signal handler that interrupted the
> fork() and that doesn't always go well. I'm not sure if this is related,
> just that some of those actions seem to appear in the backtrace.
>
>
Is there a way to figure out how the fuzzer had programmed the PMU
to get there? (besides adding PMU state dump in the kernel crashdump)?


>
> [33864.529861] ------------[ cut here ]------------
> [33864.534824] WARNING: CPU: 1 PID: 9852 at kernel/watchdog.c:302 watchdog_overflow_callback+0x92/0xc0()
> [33864.544682] Watchdog detected hard LOCKUP on cpu 1
> [33864.549635] Modules linked in:
> [33864.552943] fuse x86_pkg_temp_thermal intel_powerclamp intel_rapl iosf_mbi coretemp snd_hda_codec_realtek snd_hda_codec_hdmi snd_hda_codec_generic kvm snd_hda_intel snd_hda_controller snd_hda_codec snd_hda_core crct10dif_pclmul snd_hwdep crc32_pclmul ghash_clmulni_intel snd_pcm aesni_intel aes_x86_64 lrw gf128mul evdev i915 iTCO_wdt iTCO_vendor_support glue_helper snd_timer ppdev ablk_helper psmouse drm_kms_helper cryptd snd drm pcspkr serio_raw lpc_ich soundcore parport_pc xhci_pci battery video processor i2c_i801 mei_me mei wmi i2c_algo_bit tpm_tis mfd_core xhci_hcd tpm parport button sg sr_mod sd_mod cdrom ehci_pci ahci ehci_hcd libahci libata e1000e ptp usbcore crc32c_intel scsi_mod fan usb_common pps_core thermal thermal_sys
> [33864.622413] CPU: 1 PID: 9852 Comm: perf_fuzzer Not tainted 4.1.0-rc2+ #142
> [33864.629776] Hardware name: LENOVO 10AM000AUS/SHARKBAY, BIOS FBKT72AUS 01/26/2014
> [33864.637685] ffffffff81a209b5 ffff88011ea45aa0 ffffffff816d51d3 0000000000000000
> [33864.645709] ffff88011ea45af0 ffff88011ea45ae0 ffffffff81072dfa ffff88011ea45ac0
> [33864.653731] ffff880119b8f800 0000000000000000 ffff88011ea45c40 ffff88011ea45ef8
> [33864.661783] Call Trace:
> [33864.664409] <NMI> [<ffffffff816d51d3>] dump_stack+0x45/0x57
> [33864.670618] [<ffffffff81072dfa>] warn_slowpath_common+0x8a/0xc0
> [33864.677071] [<ffffffff81072e76>] warn_slowpath_fmt+0x46/0x50
> [33864.683202] [<ffffffff8104db64>] ? native_apic_wait_icr_idle+0x24/0x30
> [33864.690280] [<ffffffff81124f22>] watchdog_overflow_callback+0x92/0xc0
> [33864.697294] [<ffffffff811673b1>] __perf_event_overflow+0x91/0x270
> [33864.703916] [<ffffffff811673f9>] ? __perf_event_overflow+0xd9/0x270
> [33864.710696] [<ffffffff8102b60a>] ? x86_perf_event_set_period+0xda/0x180
> [33864.717842] [<ffffffff81167f89>] perf_event_overflow+0x19/0x20
> [33864.724195] [<ffffffff810338f2>] intel_pmu_handle_irq+0x1e2/0x450
> [33864.730840] [<ffffffff8102a41b>] perf_event_nmi_handler+0x2b/0x50
> [33864.737436] [<ffffffff810185d0>] nmi_handle+0xa0/0x150
> [33864.743025] [<ffffffff81018535>] ? nmi_handle+0x5/0x150
> [33864.748733] [<ffffffff8101887a>] default_do_nmi+0x4a/0x140
> [33864.754705] [<ffffffff81018a08>] do_nmi+0x98/0xe0
> [33864.759858] [<ffffffff816df42f>] end_repeat_nmi+0x1e/0x2e
> [33864.765746] [<ffffffff810bc1bb>] ? check_chain_key+0xdb/0x1e0
> [33864.772004] [<ffffffff810bc1bb>] ? check_chain_key+0xdb/0x1e0
> [33864.778253] [<ffffffff810bc1bb>] ? check_chain_key+0xdb/0x1e0
> [33864.784498] <<EOE>> <IRQ> [<ffffffff810bdd09>] __lock_acquire.isra.31+0x3b9/0x1000
> [33864.792950] [<ffffffff810bdd09>] ? __lock_acquire.isra.31+0x3b9/0x1000
> [33864.800045] [<ffffffff810bf2e5>] lock_acquire+0xa5/0x130
> [33864.805817] [<ffffffff81080afe>] ? __lock_task_sighand+0x6e/0x110
> [33864.812468] [<ffffffff81080aaa>] ? __lock_task_sighand+0x1a/0x110
> [33864.819084] [<ffffffff816dc721>] _raw_spin_lock+0x31/0x40
> [33864.824979] [<ffffffff81080afe>] ? __lock_task_sighand+0x6e/0x110
> [33864.831623] [<ffffffff81080afe>] __lock_task_sighand+0x6e/0x110
> [33864.838096] [<ffffffff81080aaa>] ? __lock_task_sighand+0x1a/0x110
> [33864.845314] [<ffffffff81080bcc>] do_send_sig_info+0x2c/0x80
> [33864.851949] [<ffffffff81167807>] ? perf_swevent_event+0x67/0x90
> [33864.858980] [<ffffffff8120092f>] send_sigio_to_task+0x12f/0x1a0
> [33864.866005] [<ffffffff81200805>] ? send_sigio_to_task+0x5/0x1a0
> [33864.873047] [<ffffffff812010c6>] ? send_sigio+0x56/0x100
> [33864.879411] [<ffffffff8120111e>] send_sigio+0xae/0x100
> [33864.885564] [<ffffffff81201207>] kill_fasync+0x97/0xf0
> [33864.891713] [<ffffffff8120117f>] ? kill_fasync+0xf/0xf0
> [33864.897983] [<ffffffff811667d4>] perf_event_wakeup+0xd4/0xf0
> [33864.904662] [<ffffffff81166705>] ? perf_event_wakeup+0x5/0xf0
> [33864.911490] [<ffffffff811668d0>] ? perf_pending_event+0xe0/0x110
> [33864.918580] [<ffffffff811668d0>] perf_pending_event+0xe0/0x110
> [33864.925494] [<ffffffff81156eac>] irq_work_run_list+0x4c/0x80
> [33864.932197] [<ffffffff81156ef8>] irq_work_run+0x18/0x40
> [33864.938469] [<ffffffff8101953f>] smp_trace_irq_work_interrupt+0x3f/0xc0
> [33864.946263] [<ffffffff816dea3e>] trace_irq_work_interrupt+0x6e/0x80
> [33864.953646] <EOI> [<ffffffff811a1b47>] ? copy_page_range+0x527/0x9a0
> [33864.961287] [<ffffffff811a1b22>] ? copy_page_range+0x502/0x9a0
> [33864.968265] [<ffffffff810710d2>] copy_process.part.23+0xc92/0x1b80
> [33864.975589] [<ffffffff810813be>] ? SYSC_kill+0x8e/0x230
> [33864.981879] [<ffffffff81072178>] do_fork+0xd8/0x420
> [33864.987807] [<ffffffff812007e3>] ? f_setown+0x83/0xa0
> [33864.993953] [<ffffffff81200d30>] ? SyS_fcntl+0x310/0x650
> [33865.000348] [<ffffffff81417a23>] ? lockdep_sys_exit_thunk+0x12/0x14
> [33865.007781] [<ffffffff81072546>] SyS_clone+0x16/0x20
> [33865.013830] [<ffffffff816dd1b2>] system_call_fastpath+0x16/0x7a
> [33865.020843] ---[ end trace d3bd7d73656f3cba ]---
> [33865.026418] INFO: NMI handler (perf_event_nmi_handler) took too long to run: 496.487 msecs
> [33865.035874] perf interrupt took too long (3879951 > 5000), lowering kernel.perf_event_max_sample_rate to 25000
>

2015-05-11 21:14:19

by Vince Weaver

[permalink] [raw]
Subject: Re: perf: another perf_fuzzer generated lockup

On Fri, 8 May 2015, Stephane Eranian wrote:

> Vince,
>
> On Thu, May 7, 2015 at 9:40 PM, Vince Weaver <[email protected]> wrote:
> >
> >
> > This is a new one I think, I hit it on the haswell machine running
> > 4.1-rc2.
> >
> > The backtrace is complex enough I'm not really sure what's going on here.
> >
> > The fuzzer has been having weird issues where it's been getting
> > overflow signals from invalid fds. This seems to happen
> > when an overflow signal interrupts the fuzzer mid-fork?
> > And the fuzzer code doesn't handle this well and attempts to call exit()
> > and/or kill the child from the signal handler that interrupted the
> > fork() and that doesn't always go well. I'm not sure if this is related,
> > just that some of those actions seem to appear in the backtrace.
> >
> >
> Is there a way to figure out how the fuzzer had programmed the PMU
> to get there? (besides adding PMU state dump in the kernel crashdump)?

Not easily. In theory the fuzzer can regenerate state from the random
seed, but some of these bugs seem to be timing related or race conditions,
so they don't always replicate.

Also I can make the fuzzer dump the state, but often it has 100+ events
active and no way of knowing which ones are currently scheduled onto the
CPU.

Dumping the PMU state might help, but at the same time there's all the
other events going on such as software and tracepoint events and they
might all be contributing.

This particular bug almost replicates; the system definitely pauses for a
bit even if not long enough to trigger the watchdog.

I've been meaning to work on it more but we just finished with finals so I'm
stuck doing more or less nothing but grading this next week. After that I
should have some time to work on this issue plus a couple other warnings
the fuzzer has been showing.

Vince