2015-07-02 20:56:04

by Vince Weaver

[permalink] [raw]
Subject: perf: fuzzer triggered CPU lockup


The fuzzer just turned this up. This is the first part; a huge amount of
CPU lockup messages, rcu stalls, and spinlock lockup suspected messages
followed beyond what I've posted here but eventually the machine recovered (?)

[20601.976374] WARNING: CPU: 1 PID: 28966 at kernel/watchdog.c:311 watchdog_overflow_callback+0x84/0xa0()
[20601.986660] Watchdog detected hard LOCKUP on cpu 1
[20601.991597] Modules linked in:
[20601.995219] fuse x86_pkg_temp_thermal intel_powerclamp intel_rapl iosf_mbi coretemp kvm_intel kvm snd_hda_codec_hdmi crct10dif_pclmul crc32_pclmul ghash_clmulni_intel hmac drbg i915 ansi_cprng snd_hda_codec_realtek snd_hda_codec_generic iTCO_wdt ppdev snd_hda_intel snd_hda_codec aesni_intel aes_x86_64 snd_hda_core lrw evdev snd_hwdep drm_kms_helper drm psmouse iTCO_vendor_support snd_pcm gf128mul glue_helper ablk_helper serio_raw pcspkr cryptd snd_timer i2c_i801 processor battery video snd mei_me parport_pc i2c_algo_bit button soundcore lpc_ich mfd_core tpm_tis mei parport tpm wmi sg sr_mod sd_mod cdrom ehci_pci ehci_hcd ahci libahci libata xhci_pci xhci_hcd e1000e ptp usbcore crc32c_intel fan scsi_mod pps_core usb_common thermal thermal_sys
[20602.068491] CPU: 1 PID: 28966 Comm: perf_fuzzer Tainted: G W 4.1.0+ #160
[20602.077305] Hardware name: LENOVO 10AM000AUS/SHARKBAY, BIOS FBKT72AUS 01/26/2014
[20602.085645] ffffffff81a27b82 ffff88011ea45af0 ffffffff8169e763 0000000000000000
[20602.094061] ffff88011ea45b40 ffff88011ea45b30 ffffffff8106ecba ffff88011ea45c40
[20602.102496] ffff8801197c7800 0000000000000000 ffff88011ea45c40 ffff88011ea45ef8
[20602.110909] Call Trace:
[20602.113982] <NMI> [<ffffffff8169e763>] dump_stack+0x45/0x57
[20602.120616] [<ffffffff8106ecba>] warn_slowpath_common+0x8a/0xc0
[20602.127520] [<ffffffff8106ed36>] warn_slowpath_fmt+0x46/0x50
[20602.134103] [<ffffffff8102f676>] ? intel_pmu_drain_pebs_nhm+0x176/0x2e0
[20602.141753] [<ffffffff8111b694>] watchdog_overflow_callback+0x84/0xa0
[20602.149189] [<ffffffff8115ae7c>] __perf_event_overflow+0x8c/0x1b0
[20602.156300] [<ffffffff8115b9d4>] perf_event_overflow+0x14/0x20
[20602.163092] [<ffffffff81032174>] intel_pmu_handle_irq+0x1d4/0x440
[20602.170206] [<ffffffff81195ca5>] ? copy_page_range+0x515/0x960
[20602.177014] [<ffffffff8105182b>] ? native_apic_msr_write+0x2b/0x30
[20602.184197] [<ffffffff8105169d>] ? x2apic_send_IPI_self+0x1d/0x20
[20602.191284] [<ffffffff81019155>] ? arch_irq_work_raise+0x35/0x40
[20602.198282] [<ffffffff8114b0b6>] ? irq_work_queue+0x66/0x80
[20602.204830] [<ffffffff81028e76>] perf_event_nmi_handler+0x26/0x40
[20602.211928] [<ffffffff810181ad>] nmi_handle+0x9d/0x140
[20602.217999] [<ffffffff81018115>] ? nmi_handle+0x5/0x140
[20602.224206] [<ffffffff8101843a>] default_do_nmi+0x4a/0x120
[20602.230651] [<ffffffff8101859d>] do_nmi+0x8d/0xc0
[20602.236280] [<ffffffff816a86df>] end_repeat_nmi+0x1e/0x2e
[20602.242650] [<ffffffff8101cd16>] ? native_read_tsc+0x6/0x20
[20602.249194] [<ffffffff8101cd16>] ? native_read_tsc+0x6/0x20
[20602.255733] [<ffffffff8101cd16>] ? native_read_tsc+0x6/0x20
[20602.262282] <<EOE>> [<ffffffff813f7e9d>] delay_tsc+0x2d/0x70
[20602.269008] [<ffffffff813f7daf>] __delay+0xf/0x20
[20602.274617] [<ffffffff810bc2ac>] do_raw_spin_lock+0x8c/0x120
[20602.281273] [<ffffffff816a62bc>] _raw_spin_lock_irqsave+0x4c/0x60
[20602.288346] [<ffffffff81153aca>] ? perf_event_read+0x6a/0x130
[20602.295064] [<ffffffff81153aca>] perf_event_read+0x6a/0x130
[20602.301605] [<ffffffff81153c18>] perf_event_read_value+0x88/0xd0
[20602.308563] [<ffffffff81154125>] ? perf_event_ctx_lock_nested+0x15/0xf0
[20602.316160] [<ffffffff8115453f>] perf_read+0x1cf/0x300
[20602.322203] [<ffffffff811973cf>] ? handle_mm_fault+0xdbf/0x1750
[20602.329079] [<ffffffff811dd2a8>] __vfs_read+0x28/0xd0
[20602.335008] [<ffffffff8136b073>] ? security_file_permission+0xa3/0xc0
[20602.342452] [<ffffffff811dd7f3>] ? rw_verify_area+0x53/0xf0
[20602.348934] [<ffffffff811dd91a>] vfs_read+0x8a/0x130
[20602.354769] [<ffffffff811de739>] SyS_read+0x49/0xa0
[20602.360484] [<ffffffff816a64b2>] entry_SYSCALL_64_fastpath+0x16/0x7a
[20602.367734] ---[ end trace e3d880444bf5705f ]---
[20609.247053] ------------[ cut here ]------------
[20609.252343] WARNING: CPU: 2 PID: 29626 at kernel/watchdog.c:311 watchdog_overflow_callback+0x84/0xa0()
[20609.262667] Watchdog detected hard LOCKUP on cpu 2
[20609.387068] Call Trace:
[20609.390155] <NMI> [<ffffffff8169e763>] dump_stack+0x45/0x57
[20609.396774] [<ffffffff8106ecba>] warn_slowpath_common+0x8a/0xc0
[20609.403676] [<ffffffff8106ed36>] warn_slowpath_fmt+0x46/0x50
[20609.410245] [<ffffffff8102f676>] ? intel_pmu_drain_pebs_nhm+0x176/0x2e0
[20609.417849] [<ffffffff8111b694>] watchdog_overflow_callback+0x84/0xa0
[20609.425247] [<ffffffff8115ae7c>] __perf_event_overflow+0x8c/0x1b0
[20609.432324] [<ffffffff8115b9d4>] perf_event_overflow+0x14/0x20
[20609.439100] [<ffffffff81032174>] intel_pmu_handle_irq+0x1d4/0x440
[20609.446153] [<ffffffff8108f4f7>] ? __atomic_notifier_call_chain+0x77/0x90
[20609.453984] [<ffffffff8108f485>] ? __atomic_notifier_call_chain+0x5/0x90
[20609.461717] [<ffffffff810b8d4d>] ? lock_acquire+0xbd/0x120
[20609.468152] [<ffffffff8105182b>] ? native_apic_msr_write+0x2b/0x30
[20609.475296] [<ffffffff8105169d>] ? x2apic_send_IPI_self+0x1d/0x20
[20609.482383] [<ffffffff81019155>] ? arch_irq_work_raise+0x35/0x40
[20609.489362] [<ffffffff8114b0b6>] ? irq_work_queue+0x66/0x80
[20609.495880] [<ffffffff81028e76>] perf_event_nmi_handler+0x26/0x40
[20609.502970] [<ffffffff810181ad>] nmi_handle+0x9d/0x140
[20609.509066] [<ffffffff81018115>] ? nmi_handle+0x5/0x140
[20609.515184] [<ffffffff810184b9>] default_do_nmi+0xc9/0x120
[20609.521617] [<ffffffff8101859d>] do_nmi+0x8d/0xc0
[20609.527217] [<ffffffff816a86df>] end_repeat_nmi+0x1e/0x2e
[20609.533566] [<ffffffff81058bdd>] ? native_write_msr_safe+0xd/0x10
[20609.540658] [<ffffffff81058bdd>] ? native_write_msr_safe+0xd/0x10
[20609.547728] [<ffffffff81058bdd>] ? native_write_msr_safe+0xd/0x10
[20609.554770] <<EOE>> [<ffffffff8103097a>] ? __intel_pmu_enable_all+0x5a/0xc0
[20609.562888] [<ffffffff810309f0>] intel_pmu_enable_all+0x10/0x20
[20609.569768] [<ffffffff8102a95c>] x86_pmu_enable+0x25c/0x2e0
[20609.576271] [<ffffffff81156112>] perf_pmu_enable+0x22/0x30
[20609.582693] [<ffffffff811578d4>] perf_event_context_sched_in+0x94/0x150
[20609.590305] [<ffffffff81157e07>] __perf_event_task_sched_in+0x97/0xa0
[20609.597710] [<ffffffff81096517>] finish_task_switch+0x197/0x1f0
[20609.604580] [<ffffffff816a0860>] ? __schedule+0x280/0xa50
[20609.610866] [<ffffffff816a08c8>] __schedule+0x2e8/0xa50
[20609.616966] [<ffffffff811590f2>] ? put_event+0x92/0x150
[20609.623071] [<ffffffff816a1067>] schedule+0x37/0x80
[20609.628788] [<ffffffff816a135e>] schedule_preempt_disabled+0xe/0x10
[20609.636002] [<ffffffff816a2ce0>] mutex_lock_nested+0x120/0x320
[20609.642717] [<ffffffff811590f2>] ? put_event+0x92/0x150
[20609.648789] [<ffffffff811590f2>] put_event+0x92/0x150
[20609.654651] [<ffffffff8115908b>] ? put_event+0x2b/0x150
[20609.660680] [<ffffffff811591d0>] perf_release+0x10/0x20
[20609.666685] [<ffffffff811df34c>] __fput+0xdc/0x1e0
[20609.672243] [<ffffffff811df49e>] ____fput+0xe/0x10
[20609.677768] [<ffffffff8108c295>] task_work_run+0x85/0xb0
[20609.683882] [<ffffffff81071645>] do_exit+0x3a5/0xa90
[20609.689605] [<ffffffff8107d1b2>] ? get_signal+0xe2/0x720
[20609.695712] [<ffffffff81071dd4>] do_group_exit+0x54/0xe0
[20609.701802] [<ffffffff8107d350>] get_signal+0x280/0x720
[20609.707790] [<ffffffff81013478>] do_signal+0x28/0xb50
[20609.713579] [<ffffffff810b7806>] ? __lock_acquire.isra.31+0x3a6/0xf90
[20609.720874] [<ffffffff816a0653>] ? __schedule+0x73/0xa50
[20609.726927] [<ffffffff816a0bcc>] ? __schedule+0x5ec/0xa50
[20609.733093] [<ffffffff816a0bcc>] ? __schedule+0x5ec/0xa50
[20609.739262] [<ffffffff810815d5>] ? sys_getppid+0x5/0x80
[20609.745244] [<ffffffff81014002>] do_notify_resume+0x62/0x70
[20609.751546] [<ffffffff816a719e>] retint_signal+0x44/0x86
[20609.757629] ---[ end trace e3d880444bf57060 ]---


2015-07-03 03:25:15

by Vince Weaver

[permalink] [raw]
Subject: Re: perf: fuzzer triggered CPU lockup

On Thu, 2 Jul 2015, Vince Weaver wrote:

> The fuzzer just turned this up. This is the first part; a huge amount of
> CPU lockup messages, rcu stalls, and spinlock lockup suspected messages
> followed beyond what I've posted here but eventually the machine recovered (?)

> [20602.134103] [<ffffffff8102f676>] ? intel_pmu_drain_pebs_nhm+0x176/0x2e0

after a reboot I am still getting these, in addition to the other
intel_pmu_drain_pebs_nhm related warning. Do I need to try to bisect this
down?

[18356.573650] ------------[ cut here ]------------
[18356.578896] WARNING: CPU: 1 PID: 27699 at kernel/watchdog.c:311 watchdog_overflow_callback+0x84/0xa0()
[18356.590213] Watchdog detected hard LOCKUP on cpu 1
[18356.706920] CPU: 1 PID: 27699 Comm: perf_fuzzer Tainted: G W 4.1.0+ #162
[18356.716672] Hardware name: LENOVO 10AM000AUS/SHARKBAY, BIOS FBKT72AUS 01/26/2014
[18356.725951] ffffffff81a27b92 ffff88011ea45af0 ffffffff8169e763 0000000000000000
[18356.738175] ffff88011ea45b40 ffff88011ea45b30 ffffffff8106ecba ffff88011ea45c40
[18356.750305] ffff8801197c7800 0000000000000001 ffff88011ea45c40 ffff88011ea45ef8
[18356.761131] Call Trace:
[18356.765163] <NMI> [<ffffffff8169e763>] dump_stack+0x45/0x57
[18356.772761] [<ffffffff8106ecba>] warn_slowpath_common+0x8a/0xc0
[18356.780580] [<ffffffff8106ed36>] warn_slowpath_fmt+0x46/0x50
[18356.788080] [<ffffffff8102f676>] ? intel_pmu_drain_pebs_nhm+0x176/0x2e0
[18356.796685] [<ffffffff8111b694>] watchdog_overflow_callback+0x84/0xa0
[18356.805063] [<ffffffff8115ae7c>] __perf_event_overflow+0x8c/0x1b0
[18356.813081] [<ffffffff8115b9d4>] perf_event_overflow+0x14/0x20
[18356.820813] [<ffffffff81032194>] intel_pmu_handle_irq+0x1d4/0x440
[18356.828795] [<ffffffff81028e76>] perf_event_nmi_handler+0x26/0x40
[18356.836761] [<ffffffff810181ad>] nmi_handle+0x9d/0x140
[18356.843764] [<ffffffff81018115>] ? nmi_handle+0x5/0x140
[18356.850868] [<ffffffff810184b9>] default_do_nmi+0xc9/0x120
[18356.858228] [<ffffffff8101859d>] do_nmi+0x8d/0xc0
[18356.864793] [<ffffffff816a86df>] end_repeat_nmi+0x1e/0x2e
[18356.872062] <<EOE>> <UNK>
[18356.875986] ---[ end trace 1b52d9dc94e8716a ]---

Followed by a lot of these:

[18359.393390] INFO: rcu_sched detected stalls on CPUs/tasks: { 1} (detected by 6, t=5253 jiffies, g=2341684, c=2341683, q=71)
[18359.405806] Task dump for CPU 1:
[18359.409688] perf_fuzzer R running task 0 27699 2386 0x00000008
[18359.417712] ffff8800cfb7c680 00007f889f330fff 0000000000000096 0000000000000002
[18359.426147] ffff8800cfb7c680 ffffffff82a107e0 0000000000000001 ffff8800c4413c28
[18359.434553] 0000000000000096 ffffffff814090e0 00007f8800000000 000000019f400000
[18359.443023] Call Trace:
[18359.446107] [<ffffffff814090e0>] ? debug_check_no_obj_freed+0x90/0x200
[18359.453610] [<ffffffff81409162>] ? debug_check_no_obj_freed+0x112/0x200
[18359.461226] [<ffffffff81409162>] ? debug_check_no_obj_freed+0x112/0x200
[18359.468840] [<ffffffff811bc096>] ? kfree_debugcheck+0x16/0x40
[18359.475521] [<ffffffff8119e14e>] ? mmap_region+0x49e/0x5d0
[18359.481915] [<ffffffff8118468b>] ? vm_mmap_pgoff+0x9b/0xc0
[18359.488362] [<ffffffff810b5023>] ? up_write+0x23/0x50
[18359.494362] [<ffffffff8118468b>] ? vm_mmap_pgoff+0x9b/0xc0
[18359.500807] [<ffffffff811fbec5>] ? __fget+0x5/0xd0
[18359.506498] [<ffffffff8119c978>] ? SyS_mmap_pgoff+0xe8/0x270
[18359.513100] [<ffffffff81003012>] ? lockdep_sys_exit_thunk+0x12/0x14
[18359.520351] [<ffffffff816a64bc>] ? entry_SYSCALL_64_fastpath+0x20/0x7a

2015-07-03 10:08:29

by Peter Zijlstra

[permalink] [raw]
Subject: Re: perf: fuzzer triggered CPU lockup

On Thu, Jul 02, 2015 at 11:31:15PM -0400, Vince Weaver wrote:
> On Thu, 2 Jul 2015, Vince Weaver wrote:
>
> > The fuzzer just turned this up. This is the first part; a huge amount of
> > CPU lockup messages, rcu stalls, and spinlock lockup suspected messages
> > followed beyond what I've posted here but eventually the machine recovered (?)
>
> > [20602.134103] [<ffffffff8102f676>] ? intel_pmu_drain_pebs_nhm+0x176/0x2e0
>
> after a reboot I am still getting these, in addition to the other
> intel_pmu_drain_pebs_nhm related warning. Do I need to try to bisect this
> down?

I think I have a fairly good idea which patches to blame for that;
there's been a rework of the pebs code there. Let me go have a look.