2015-02-19 16:51:28

by Vince Weaver

[permalink] [raw]
Subject: perf: fuzzer gets CPU stuck in perf_callchain()


Another bug found by the perf_fuzzer(). I think this one is different
than the one I sent the other day, it looks like something is going
very wrong in perf_callchain().

This one is reasonably reproducible, if there's any extra debugging that I
can add. This is on a Haswell machine with git as of a few days ago.

There are two traces below; they were generated by the same random seed
across different reboots and the trigger is slightly different (although
both look like they hit the same bug in the callchain code).

[ 7938.495396] ------------[ cut here ]------------
[ 7938.500373] WARNING: CPU: 2 PID: 5692 at kernel/watchdog.c:290 watchdog_overflow_callback+0x92/0xc0()
[ 7938.510337] Watchdog detected hard LOCKUP on cpu 2
[ 7938.515301] Modules linked in: fuse x86_pkg_temp_thermal intel_powerclamp intel_rapl iosf_mbi coretemp kvm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel snd_hda_codec_realtek snd_hda_codec_generic snd_hda_codec_hdmi snd_hda_intel snd_hda_controller snd_hda_codec i915 aes_x86_64 lrw gf128mul ppdev iTCO_wdt xhci_pci snd_hwdep snd_pcm tpm_tis evdev glue_helper iTCO_vendor_support xhci_hcd snd_timer mei_me drm_kms_helper ablk_helper snd tpm mei drm psmouse serio_raw pcspkr parport_pc cryptd soundcore lpc_ich parport wmi mfd_core battery i2c_i801 i2c_algo_bit button processor video sg sr_mod sd_mod cdrom ahci libahci ehci_pci libata ehci_hcd e1000e ptp crc32c_intel scsi_mod usbcore usb_common pps_core fan thermal thermal_sys
[ 7938.586970] CPU: 2 PID: 5692 Comm: perf_fuzzer Not tainted 3.19.0+ #127
[ 7938.594053] Hardware name: LENOVO 10AM000AUS/SHARKBAY, BIOS FBKT72AUS 01/26/2014
[ 7938.601990] ffffffff81a41afe ffff88011ea85aa0 ffffffff816bfed5 0000000000000000
[ 7938.610056] ffff88011ea85af0 ffff88011ea85ae0 ffffffff8106ddfa 0000000000000000
[ 7938.618102] ffff8801195a1800 0000000000000000 ffff88011ea85c40 ffff88011ea85ef8
[ 7938.626105] Call Trace:
[ 7938.628735] <NMI> [<ffffffff816bfed5>] dump_stack+0x45/0x57
[ 7938.634940] [<ffffffff8106ddfa>] warn_slowpath_common+0x8a/0xc0
[ 7938.641394] [<ffffffff8106de76>] warn_slowpath_fmt+0x46/0x50
[ 7938.647566] [<ffffffff8111cde2>] watchdog_overflow_callback+0x92/0xc0
[ 7938.654617] [<ffffffff8115dbe1>] __perf_event_overflow+0x91/0x270
[ 7938.661233] [<ffffffff8102b56a>] ? x86_perf_event_set_period+0xca/0x170
[ 7938.668419] [<ffffffff8115e6c9>] perf_event_overflow+0x19/0x20
[ 7938.674764] [<ffffffff81032a7a>] intel_pmu_handle_irq+0x1ba/0x3a0
[ 7938.681432] [<ffffffff8102a45b>] perf_event_nmi_handler+0x2b/0x50
[ 7938.688055] [<ffffffff81018760>] nmi_handle+0xa0/0x150
[ 7938.693681] [<ffffffff810186c5>] ? nmi_handle+0x5/0x150
[ 7938.699405] [<ffffffff81018a0a>] default_do_nmi+0x4a/0x140
[ 7938.705429] [<ffffffff81018b98>] do_nmi+0x98/0xe0
[ 7938.710573] [<ffffffff816ca3d1>] end_repeat_nmi+0x1e/0x2e
[ 7938.716470] [<ffffffff8140b300>] ? copy_from_user_nmi+0x20/0x70
[ 7938.722929] [<ffffffff8140b300>] ? copy_from_user_nmi+0x20/0x70
[ 7938.729386] [<ffffffff8140b300>] ? copy_from_user_nmi+0x20/0x70
[ 7938.735862] <<EOE>> <IRQ> [<ffffffff8102c210>] ? perf_callchain_user+0xd0/0x240
[ 7938.744067] [<ffffffff81161508>] perf_callchain+0x1f8/0x220
[ 7938.750152] [<ffffffff8115da21>] perf_prepare_sample+0x291/0x3c0
[ 7938.756700] [<ffffffff8115dc5e>] __perf_event_overflow+0x10e/0x270
[ 7938.763412] [<ffffffff8115dc29>] ? __perf_event_overflow+0xd9/0x270
[ 7938.770254] [<ffffffff8115e114>] ? perf_tp_event+0xc4/0x210
[ 7938.776302] [<ffffffff8115da7c>] ? perf_prepare_sample+0x2ec/0x3c0
[ 7938.783021] [<ffffffff816ca018>] ? trace_page_fault+0x28/0x30
[ 7938.789319] [<ffffffff8115df89>] perf_swevent_overflow+0xa9/0xe0
[ 7938.795845] [<ffffffff8115e027>] perf_swevent_event+0x67/0x90
[ 7938.802139] [<ffffffff8115e114>] perf_tp_event+0xc4/0x210
[ 7938.808025] [<ffffffff810b7f10>] ? lock_release+0x130/0x260
[ 7938.814132] [<ffffffff810b643f>] ? __lock_acquire.isra.31+0x3af/0xfe0
[ 7938.821124] [<ffffffff811406d8>] ? perf_trace_buf_prepare+0x98/0xf0
[ 7938.827926] [<ffffffff810b4f8a>] ? perf_trace_lock+0x12a/0x160
[ 7938.834291] [<ffffffff81079c05>] ? __sigqueue_alloc+0x5/0x150
[ 7938.840541] [<ffffffff81079d34>] ? __sigqueue_alloc+0x134/0x150
[ 7938.847009] [<ffffffff811406d8>] ? perf_trace_buf_prepare+0x98/0xf0
[ 7938.853835] [<ffffffff8107ba4e>] ? __lock_task_sighand+0x6e/0x110
[ 7938.861174] [<ffffffff810b4f8a>] perf_trace_lock+0x12a/0x160
[ 7938.867972] [<ffffffff8107ba58>] ? __lock_task_sighand+0x78/0x110
[ 7938.875256] [<ffffffff8107bb4d>] ? do_send_sig_info+0x5d/0x80
[ 7938.882197] [<ffffffff810b7f10>] lock_release+0x130/0x260
[ 7938.888754] [<ffffffff816c7474>] _raw_spin_unlock_irqrestore+0x24/0x40
[ 7938.896510] [<ffffffff8107bb4d>] do_send_sig_info+0x5d/0x80
[ 7938.903240] [<ffffffff811f69df>] send_sigio_to_task+0x12f/0x1a0
[ 7938.910365] [<ffffffff811f68b5>] ? send_sigio_to_task+0x5/0x1a0
[ 7938.917455] [<ffffffff811f7176>] ? send_sigio+0x56/0x100
[ 7938.923894] [<ffffffff811f71ce>] send_sigio+0xae/0x100
[ 7938.930132] [<ffffffff811f72b7>] kill_fasync+0x97/0xf0
[ 7938.936359] [<ffffffff811f722f>] ? kill_fasync+0xf/0xf0
[ 7938.942707] [<ffffffff8115d0b4>] perf_event_wakeup+0xd4/0xf0
[ 7938.949522] [<ffffffff8115cfe5>] ? perf_event_wakeup+0x5/0xf0
[ 7938.956367] [<ffffffff8115d103>] perf_pending_event+0x33/0x60
[ 7938.963199] [<ffffffff8114e3fc>] irq_work_run_list+0x4c/0x80
[ 7938.969932] [<ffffffff8114e448>] irq_work_run+0x18/0x40
[ 7938.976212] [<ffffffff810196af>] smp_trace_irq_work_interrupt+0x3f/0xc0
[ 7938.983957] [<ffffffff816c99bd>] trace_irq_work_interrupt+0x6d/0x80
[ 7938.991362] <EOI> [<ffffffff816c743b>] ? _raw_spin_unlock+0x2b/0x40
[ 7938.998883] [<ffffffff810f0993>] ? generic_exec_single+0x133/0x190
[ 7939.006127] [<ffffffff81156100>] ? task_clock_event_add+0x40/0x40
[ 7939.013297] [<ffffffff810f0a57>] smp_call_function_single+0x67/0xa0
[ 7939.020568] [<ffffffff81154d69>] task_function_call+0x49/0x60
[ 7939.027277] [<ffffffff8115b400>] ? perf_event_sched_in+0xa0/0xa0
[ 7939.034270] [<ffffffff81154e10>] _perf_event_enable+0x90/0xf0
[ 7939.040963] [<ffffffff81154d80>] ? task_function_call+0x60/0x60
[ 7939.047863] [<ffffffff81154baf>] perf_event_for_each_child+0x3f/0xb0
[ 7939.055259] [<ffffffff8115c4ac>] perf_event_task_enable+0x7c/0xd0
[ 7939.062321] [<ffffffff81082d4d>] SyS_prctl+0x26d/0x4d0
[ 7939.068359] [<ffffffff816c7ead>] system_call_fastpath+0x16/0x1b
[ 7939.075261] ---[ end trace 8e72f55d0f5cbae4 ]---
[ 7939.080732] INFO: NMI handler (perf_event_nmi_handler) took too long to run: 585.271 msecs
[ 7939.090016] perf interrupt took too long (4572449 > 5000), lowering kernel.perf_event_max_sample_rate to 25000
[ 7940.065563] INFO: rcu_sched detected stalls on CPUs/tasks: { 2} (detected by 3, t=5252 jiffies, g=10488, c=10487, q=196)
[ 7940.077820] Task dump for CPU 2:
[ 7940.081779] perf_fuzzer R running task 0 5692 2549 0x0000000c
[ 7940.089901] 00007ffff86e99b8 000000000000002b 0000000000000003 ffff8800a4103cd8
[ 7940.098442] ffffffff810b643f ffff88011897a4d0 0000000000000003 ffff88011897a4d0
[ 7940.106955] ffff8800ced65020 ffff8800a4103c98 ffffffff810b52b8 ffff8800a4103cb8
[ 7940.115473] Call Trace:
[ 7940.118593] [<ffffffff810b643f>] ? __lock_acquire.isra.31+0x3af/0xfe0
[ 7940.126120] [<ffffffff810b52b8>] ? __lock_is_held+0x58/0x80
[ 7940.132660] [<ffffffff811406d8>] ? perf_trace_buf_prepare+0x98/0xf0
[ 7940.139989] [<ffffffff810b4f8a>] ? perf_trace_lock+0x12a/0x160
[ 7940.146848] [<ffffffff8115b485>] ? __perf_event_enable+0x85/0x2a0
[ 7940.153971] [<ffffffff816c743b>] ? _raw_spin_unlock+0x2b/0x40
[ 7940.160738] [<ffffffff810f0993>] ? generic_exec_single+0x133/0x190
[ 7940.167964] [<ffffffff81156100>] ? task_clock_event_add+0x40/0x40
[ 7940.175029] [<ffffffff810f0a57>] ? smp_call_function_single+0x67/0xa0
[ 7940.182518] [<ffffffff81154d69>] ? task_function_call+0x49/0x60
[ 7940.189447] [<ffffffff8115b400>] ? perf_event_sched_in+0xa0/0xa0
[ 7940.196485] [<ffffffff81154e10>] ? _perf_event_enable+0x90/0xf0
[ 7940.203430] [<ffffffff81154d80>] ? task_function_call+0x60/0x60
[ 7940.210386] [<ffffffff81154baf>] ? perf_event_for_each_child+0x3f/0xb0
[ 7940.217945] [<ffffffff8115c4ac>] ? perf_event_task_enable+0x7c/0xd0
[ 7940.225255] [<ffffffff81082d4d>] ? SyS_prctl+0x26d/0x4d0
[ 7940.231557] [<ffffffff816c7ead>] ? system_call_fastpath+0x16/0x1b
[ 7947.852948] perf interrupt took too long (4536733 > 10000), lowering kernel.perf_event_max_sample_rate to 12500
[ 7956.620503] perf interrupt took too long (4501296 > 20000), lowering kernel.perf_event_max_sample_rate to 6250
[ 7965.387462] perf interrupt took too long (4466134 > 38461), lowering kernel.perf_event_max_sample_rate to 3250


[145965.719358] Call Trace:
[145965.722845] <NMI> [<ffffffff816bfed5>] dump_stack+0x45/0x57
[145965.729907] [<ffffffff8106ddfa>] warn_slowpath_common+0x8a/0xc0
[145965.737170] [<ffffffff8106de76>] warn_slowpath_fmt+0x46/0x50
[145965.744103] [<ffffffff8101db0a>] ? native_sched_clock+0x2a/0x90
[145965.751356] [<ffffffff8111cde2>] watchdog_overflow_callback+0x92/0xc0
[145965.759125] [<ffffffff8115dbe1>] __perf_event_overflow+0x91/0x270
[145965.766466] [<ffffffff8102b56a>] ? x86_perf_event_set_period+0xca/0x170
[145965.774393] [<ffffffff8115e6c9>] perf_event_overflow+0x19/0x20
[145965.781478] [<ffffffff81032a7a>] intel_pmu_handle_irq+0x1ba/0x3a0
[145965.788793] [<ffffffff816ca020>] ? trace_page_fault+0x30/0x30
[145965.795777] [<ffffffff8102a45b>] perf_event_nmi_handler+0x2b/0x50
[145965.803049] [<ffffffff81018760>] nmi_handle+0xa0/0x150
[145965.809335] [<ffffffff810186c5>] ? nmi_handle+0x5/0x150
[145965.815730] [<ffffffff81018a95>] default_do_nmi+0xd5/0x140
[145965.822428] [<ffffffff81018b98>] do_nmi+0x98/0xe0
[145965.828254] [<ffffffff816ca3d1>] end_repeat_nmi+0x1e/0x2e
[145965.834803] [<ffffffff816c9ff0>] ? general_protection+0x30/0x30
[145965.841912] [<ffffffff816c9ff0>] ? general_protection+0x30/0x30
[145965.849073] [<ffffffff816c9ff0>] ? general_protection+0x30/0x30
[145965.856187] <<EOE>> <IRQ> [<ffffffff8140b46b>] ? copy_user_handle_tail+0x3b/0x80
[145965.865154] [<ffffffff8140b343>] copy_from_user_nmi+0x63/0x70
[145965.872114] [<ffffffff8102c210>] perf_callchain_user+0xd0/0x240
[145965.879227] [<ffffffff81161508>] perf_callchain+0x1f8/0x220
[145965.885923] [<ffffffff8115da21>] perf_prepare_sample+0x291/0x3c0
[145965.893151] [<ffffffff8115dc5e>] __perf_event_overflow+0x10e/0x270
[145965.900562] [<ffffffff8115dc29>] ? __perf_event_overflow+0xd9/0x270
[145965.908078] [<ffffffff810b7f10>] ? lock_release+0x130/0x260
[145965.914842] [<ffffffff8115df89>] perf_swevent_overflow+0xa9/0xe0
[145965.922072] [<ffffffff8115e027>] perf_swevent_event+0x67/0x90
[145965.929025] [<ffffffff8115e114>] perf_tp_event+0xc4/0x210
[145965.935615] [<ffffffff810b7f10>] ? lock_release+0x130/0x260
[145965.942395] [<ffffffff8107ba4e>] ? __lock_task_sighand+0x6e/0x110
[145965.949648] [<ffffffff810b4f8a>] ? perf_trace_lock+0x12a/0x160
[145965.956671] [<ffffffff8107ba58>] ? __lock_task_sighand+0x78/0x110
[145965.963919] [<ffffffff8107bb4d>] ? do_send_sig_info+0x5d/0x80
[145965.970858] [<ffffffff8107bb4d>] ? do_send_sig_info+0x5d/0x80
[145965.977849] [<ffffffff811406d8>] ? perf_trace_buf_prepare+0x98/0xf0
[145965.985330] [<ffffffff810b4f8a>] perf_trace_lock+0x12a/0x160
[145965.992195] [<ffffffff811f7176>] ? send_sigio+0x56/0x100
[145965.998659] [<ffffffff811f71f0>] ? send_sigio+0xd0/0x100
[145966.005169] [<ffffffff810b7f10>] lock_release+0x130/0x260
[145966.011716] [<ffffffff816c76e3>] _raw_read_unlock+0x23/0x40
[145966.018442] [<ffffffff811f71f0>] send_sigio+0xd0/0x100
[145966.024700] [<ffffffff811f72b7>] kill_fasync+0x97/0xf0
[145966.030968] [<ffffffff811f722f>] ? kill_fasync+0xf/0xf0
[145966.037365] [<ffffffff8115d0b4>] perf_event_wakeup+0xd4/0xf0
[145966.044210] [<ffffffff8115cfe5>] ? perf_event_wakeup+0x5/0xf0
[145966.051060] [<ffffffff8115d103>] perf_pending_event+0x33/0x60
[145966.057950] [<ffffffff8114e3fc>] irq_work_run_list+0x4c/0x80
[145966.064798] [<ffffffff8114e448>] irq_work_run+0x18/0x40
[145966.071127] [<ffffffff810196af>] smp_trace_irq_work_interrupt+0x3f/0xc0
[145966.078929] [<ffffffff816c99bd>] trace_irq_work_interrupt+0x6d/0x80
[145966.086359] <EOI> [<ffffffff816c74c2>] ? _raw_spin_unlock_irq+0x32/0x40
[145966.094337] [<ffffffff816c74bb>] ? _raw_spin_unlock_irq+0x2b/0x40
[145966.101634] [<ffffffff81092f7c>] finish_task_switch+0x8c/0x140
[145966.108630] [<ffffffff81092f3d>] ? finish_task_switch+0x4d/0x140
[145966.115793] [<ffffffff816c1c06>] __schedule+0x2d6/0x9f0
[145966.122115] [<ffffffff810b4e89>] ? perf_trace_lock+0x29/0x160
[145966.128909] [<ffffffff816c2357>] schedule+0x37/0x90
[145966.134802] [<ffffffff816c67ac>] schedule_timeout+0x1bc/0x250
[145966.141578] [<ffffffff816c37f8>] ? wait_for_completion+0x28/0x100
[145966.148660] [<ffffffff816c3894>] ? wait_for_completion+0xc4/0x100
[145966.155795] [<ffffffff816c389c>] wait_for_completion+0xcc/0x100
[145966.162731] [<ffffffff8109ae80>] ? wake_up_state+0x20/0x20
[145966.169205] [<ffffffff810d4830>] ? call_rcu_bh+0x20/0x20
[145966.175464] [<ffffffff810d12fd>] wait_rcu_gp+0x5d/0x80
[145966.181586] [<ffffffff810d1240>] ? ftrace_raw_output_rcu_utilization+0x70/0x70
[145966.189890] [<ffffffff810d352a>] synchronize_sched+0x3a/0x50
[145966.196573] [<ffffffff8114083b>] perf_trace_event_unreg.isra.1+0x3b/0x90
[145966.204345] [<ffffffff81140bac>] perf_trace_destroy+0x3c/0x50
[145966.211075] [<ffffffff81155aee>] tp_perf_event_destroy+0xe/0x10
[145966.217976] [<ffffffff8115721c>] __free_event+0x2c/0x80
[145966.224102] [<ffffffff8115f5f9>] SYSC_perf_event_open+0x629/0xc80
[145966.231246] [<ffffffff8116009e>] SyS_perf_event_open+0xe/0x10
[145966.237982] [<ffffffff816c7ead>] system_call_fastpath+0x16/0x1b
[145966.244932] ---[ end trace 7088940f25646d60 ]---


2015-02-19 16:57:45

by Peter Zijlstra

[permalink] [raw]
Subject: Re: perf: fuzzer gets CPU stuck in perf_callchain()

On Thu, Feb 19, 2015 at 11:54:40AM -0500, Vince Weaver wrote:
>
> Another bug found by the perf_fuzzer(). I think this one is different
> than the one I sent the other day, it looks like something is going
> very wrong in perf_callchain().
>
> This one is reasonably reproducible, if there's any extra debugging that I
> can add. This is on a Haswell machine with git as of a few days ago.

Aw man, and I had such good hopes with you saying it took 4 days to find
problems :/

I'll try and get some time to stare at both.

2015-02-19 17:03:40

by Peter Zijlstra

[permalink] [raw]
Subject: Re: perf: fuzzer gets CPU stuck in perf_callchain()

On Thu, Feb 19, 2015 at 11:54:40AM -0500, Vince Weaver wrote:

> [ 7938.802139] [<ffffffff8115e114>] perf_tp_event+0xc4/0x210
> [ 7938.861174] [<ffffffff810b4f8a>] perf_trace_lock+0x12a/0x160
> [ 7938.882197] [<ffffffff810b7f10>] lock_release+0x130/0x260
> [ 7938.888754] [<ffffffff816c7474>] _raw_spin_unlock_irqrestore+0x24/0x40
> [ 7938.896510] [<ffffffff8107bb4d>] do_send_sig_info+0x5d/0x80
> [ 7938.903240] [<ffffffff811f69df>] send_sigio_to_task+0x12f/0x1a0
> [ 7938.923894] [<ffffffff811f71ce>] send_sigio+0xae/0x100
> [ 7938.930132] [<ffffffff811f72b7>] kill_fasync+0x97/0xf0
> [ 7938.942707] [<ffffffff8115d0b4>] perf_event_wakeup+0xd4/0xf0
> [ 7938.956367] [<ffffffff8115d103>] perf_pending_event+0x33/0x60
> [ 7938.963199] [<ffffffff8114e3fc>] irq_work_run_list+0x4c/0x80
> [ 7938.969932] [<ffffffff8114e448>] irq_work_run+0x18/0x40
> [ 7938.976212] [<ffffffff810196af>] smp_trace_irq_work_interrupt+0x3f/0xc0
> [ 7938.983957] [<ffffffff816c99bd>] trace_irq_work_interrupt+0x6d/0x80

Ah, it looks like the never ending irq_work..

Could you try this?

---
kernel/events/core.c | 10 ++++++++++
1 file changed, 10 insertions(+)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index f04daabfd1cf..088e81ae43cb 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -4574,6 +4574,13 @@ static void perf_pending_event(struct irq_work *entry)
{
struct perf_event *event = container_of(entry,
struct perf_event, pending);
+ int rctx;
+
+ rctx = perf_swevent_get_recursion_context();
+ /*
+ * If we 'fail' here, that's OK, it means recursion is already disabled
+ * and we won't recurse 'further'.
+ */

if (event->pending_disable) {
event->pending_disable = 0;
@@ -4584,6 +4591,9 @@ static void perf_pending_event(struct irq_work *entry)
event->pending_wakeup = 0;
perf_event_wakeup(event);
}
+
+ if (rctx >= 0)
+ perf_swevent_put_recursion_context(rctx);
}

/*

2015-02-19 17:03:09

by Vince Weaver

[permalink] [raw]
Subject: Re: perf: fuzzer gets CPU stuck in perf_callchain()

On Thu, 19 Feb 2015, Peter Zijlstra wrote:

> On Thu, Feb 19, 2015 at 11:54:40AM -0500, Vince Weaver wrote:
> >
> > Another bug found by the perf_fuzzer(). I think this one is different
> > than the one I sent the other day, it looks like something is going
> > very wrong in perf_callchain().
> >
> > This one is reasonably reproducible, if there's any extra debugging that I
> > can add. This is on a Haswell machine with git as of a few days ago.
>
> Aw man, and I had such good hopes with you saying it took 4 days to find
> problems :/
>
> I'll try and get some time to stare at both.

well if it's any comfort it is definitely on the order of days now before
issues crop up (rather than minutes). And the Haswell machine is the
problem-causing one. I also have a core2 and a Cortex-A9 machine fuzzing
away and they've been at it a week now without turning up anything.

Vince

2015-02-19 17:21:29

by Vince Weaver

[permalink] [raw]
Subject: Re: perf: fuzzer gets CPU stuck in perf_callchain()

On Thu, 19 Feb 2015, Vince Weaver wrote:

> I also have a core2 and a Cortex-A9 machine fuzzing
> away and they've been at it a week now without turning up anything.

I have to take that back, it turns out the Cortex-A9 machine was wedged
for over a day, I just hadn't noticed because it hadn't dumped any kind
of message about the problem. Hmm.

Vince

2015-02-19 17:29:21

by Vince Weaver

[permalink] [raw]
Subject: Re: perf: fuzzer gets CPU stuck in perf_callchain()

On Thu, 19 Feb 2015, Vince Weaver wrote:

> I have to take that back, it turns out the Cortex-A9 machine was wedged
> for over a day, I just hadn't noticed because it hadn't dumped any kind
> of message about the problem. Hmm.

This turns out to be the find_get_context() bug that has been fixed in
current git, the kernel I was running wasn't new enough to have the fix in
it. So false alarm.

Vince

2015-02-24 03:29:46

by Vince Weaver

[permalink] [raw]
Subject: Re: perf: fuzzer gets CPU stuck in perf_callchain()

On Thu, 19 Feb 2015, Peter Zijlstra wrote:

> On Thu, Feb 19, 2015 at 11:54:40AM -0500, Vince Weaver wrote:
>
> > [ 7938.802139] [<ffffffff8115e114>] perf_tp_event+0xc4/0x210
> > [ 7938.861174] [<ffffffff810b4f8a>] perf_trace_lock+0x12a/0x160
> > [ 7938.882197] [<ffffffff810b7f10>] lock_release+0x130/0x260
> > [ 7938.888754] [<ffffffff816c7474>] _raw_spin_unlock_irqrestore+0x24/0x40
> > [ 7938.896510] [<ffffffff8107bb4d>] do_send_sig_info+0x5d/0x80
> > [ 7938.903240] [<ffffffff811f69df>] send_sigio_to_task+0x12f/0x1a0
> > [ 7938.923894] [<ffffffff811f71ce>] send_sigio+0xae/0x100
> > [ 7938.930132] [<ffffffff811f72b7>] kill_fasync+0x97/0xf0
> > [ 7938.942707] [<ffffffff8115d0b4>] perf_event_wakeup+0xd4/0xf0
> > [ 7938.956367] [<ffffffff8115d103>] perf_pending_event+0x33/0x60
> > [ 7938.963199] [<ffffffff8114e3fc>] irq_work_run_list+0x4c/0x80
> > [ 7938.969932] [<ffffffff8114e448>] irq_work_run+0x18/0x40
> > [ 7938.976212] [<ffffffff810196af>] smp_trace_irq_work_interrupt+0x3f/0xc0
> > [ 7938.983957] [<ffffffff816c99bd>] trace_irq_work_interrupt+0x6d/0x80
>
> Ah, it looks like the never ending irq_work..
>
> Could you try this?

sorry for the delay. I managed to reproduce this problem again on 4.0-rc1,
verified it was reproducible, then applied the patch below and I could no
longer trigger the issue.

Tested-by: Vince Weaver <[email protected]>



>
> ---
> kernel/events/core.c | 10 ++++++++++
> 1 file changed, 10 insertions(+)
>
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index f04daabfd1cf..088e81ae43cb 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -4574,6 +4574,13 @@ static void perf_pending_event(struct irq_work *entry)
> {
> struct perf_event *event = container_of(entry,
> struct perf_event, pending);
> + int rctx;
> +
> + rctx = perf_swevent_get_recursion_context();
> + /*
> + * If we 'fail' here, that's OK, it means recursion is already disabled
> + * and we won't recurse 'further'.
> + */
>
> if (event->pending_disable) {
> event->pending_disable = 0;
> @@ -4584,6 +4591,9 @@ static void perf_pending_event(struct irq_work *entry)
> event->pending_wakeup = 0;
> perf_event_wakeup(event);
> }
> +
> + if (rctx >= 0)
> + perf_swevent_put_recursion_context(rctx);
> }
>
> /*