2013-09-11 15:28:29

by Vince Weaver

[permalink] [raw]
Subject: perf-related lockup on 3.11


I got this hard lockup running my perf_fuzzer on 3.11 on a core2.
I guess I should try to reproduce it on linus-git but was hoping
I could wait until -rc1 before trying that.

I'm not really sure where to report lockups like this because it's sort
of unclear what exactly the problem is. perf? ftrace? nmi? rcu?

Vince


[ 4022.356007] ------------[ cut here ]------------
[ 4022.356007] WARNING: CPU: 0 PID: 3393 at kernel/watchdog.c:245 watchdog_overf
low_callback+0x9b/0xa6()
[ 4022.356007] Watchdog detected hard LOCKUP on cpu 0
[ 4022.356007] Modules linked in: cpufreq_userspace cpufreq_stats cpufreq_powers
ave cpufreq_conservative f71882fg coretemp acpi_cpufreq mperf mcs7830 usbnet psm
ouse pcspkr serio_raw evdev video wmi ohci_pci processor button thermal_sys ohci
_hcd i2c_nforce2 sg ehci_pci ehci_hcd sd_mod usbcore usb_common
[ 4022.356007] CPU: 0 PID: 3393 Comm: perf_fuzzer Not tainted 3.11.0 #2
[ 4022.356007] Hardware name: AOpen DE7000/nMCP7ALPx-DE R1.06 Oct.19.2012, BIO
S 080015 10/19/2012
[ 4022.356007] 00000000000000f5 ffff88011fc07c48 ffffffff81508364 ffff88011fc07
c58
[ 4022.356007] ffffffff8103cc81 ffffffff817d132a ffffffff8109b2e9 ffff88011b30e
400
[ 4022.356007] ffff88011fc07ef8 ffff88011fc07d58 0000000000000000 ffff88011fc07
ca8
[ 4022.356007] Call Trace:
[ 4022.356007] <NMI> [<ffffffff81508364>] ? dump_stack+0x41/0x55
[ 4022.356007] [<ffffffff8103cc81>] ? warn_slowpath_common+0x81/0x9c
[ 4022.356007] [<ffffffff8109b2e9>] ? watchdog_overflow_callback+0x9b/0xa6
[ 4022.356007] [<ffffffff8103cd3e>] ? warn_slowpath_fmt+0x46/0x48
[ 4022.356007] [<ffffffff8109b257>] ? watchdog_overflow_callback+0x9/0xa6
[ 4022.356007] [<ffffffff8109b2e9>] ? watchdog_overflow_callback+0x9b/0xa6
[ 4022.356007] [<ffffffff810c90d8>] ? __perf_event_overflow+0x12e/0x1ab
[ 4022.356007] [<ffffffff810c50ea>] ? perf_event_update_userpage+0x16/0xce
[ 4022.356007] [<ffffffff81017dc8>] ? intel_pmu_save_and_restart+0xe/0x22
[ 4022.356007] [<ffffffff81018a2f>] ? intel_pmu_handle_irq+0x2c4/0x35b
[ 4022.356007] [<ffffffff815122fc>] ? ftrace_call+0x5/0x2f
[ 4022.356007] [<ffffffff8150b2ad>] ? _raw_spin_unlock_irqrestore+0xf/0x11
[ 4022.356007] [<ffffffff8150ccab>] ? perf_event_nmi_handler+0x24/0x3d
[ 4022.356007] [<ffffffff8150c4dd>] ? nmi_handle+0x5e/0x13a
[ 4022.356007] [<ffffffff810a0d70>] ? rcu_nmi_enter+0x9/0x5b
[ 4022.356007] [<ffffffff8150c660>] ? do_nmi+0xa7/0x2bd
[ 4022.356007] [<ffffffff8150bc77>] ? end_repeat_nmi+0x1e/0x2e
[ 4022.356007] [<ffffffff8150b2ad>] ? _raw_spin_unlock_irqrestore+0xf/0x11
[ 4022.356007] [<ffffffff8150b2ad>] ? _raw_spin_unlock_irqrestore+0xf/0x11
[ 4022.356007] [<ffffffff8150b2ad>] ? _raw_spin_unlock_irqrestore+0xf/0x11
[ 4022.356007] <<EOE>> [<ffffffff810bdf89>] ? perf_trace_event_unreg+0x3a/0xbd
[ 4022.356007] [<ffffffff810be03a>] ? perf_trace_destroy+0x2e/0x3b
[ 4022.356007] [<ffffffff810c5f5d>] ? free_event+0xf6/0x11e
[ 4022.356007] [<ffffffff810cab5c>] ? SYSC_perf_event_open+0x6e4/0x771
[ 4022.356007] [<ffffffff81055666>] ? put_pid+0xe/0x4f
[ 4022.356007] [<ffffffff815126c6>] ? system_call_fastpath+0x1a/0x1f
[ 4022.356007] ---[ end trace c426f99910c1c076 ]---


2013-09-12 17:31:09

by Paul E. McKenney

[permalink] [raw]
Subject: Re: perf-related lockup on 3.11

On Wed, Sep 11, 2013 at 11:35:24AM -0400, Vince Weaver wrote:
>
> I got this hard lockup running my perf_fuzzer on 3.11 on a core2.
> I guess I should try to reproduce it on linus-git but was hoping
> I could wait until -rc1 before trying that.
>
> I'm not really sure where to report lockups like this because it's sort
> of unclear what exactly the problem is. perf? ftrace? nmi? rcu?

RCU is an innocent bystander on this one. (Honest, officer!)

You see rcu_nmi_enter() because it is called on NMI.

Thanx, Paul

> Vince
>
>
> [ 4022.356007] ------------[ cut here ]------------
> [ 4022.356007] WARNING: CPU: 0 PID: 3393 at kernel/watchdog.c:245 watchdog_overf
> low_callback+0x9b/0xa6()
> [ 4022.356007] Watchdog detected hard LOCKUP on cpu 0
> [ 4022.356007] Modules linked in: cpufreq_userspace cpufreq_stats cpufreq_powers
> ave cpufreq_conservative f71882fg coretemp acpi_cpufreq mperf mcs7830 usbnet psm
> ouse pcspkr serio_raw evdev video wmi ohci_pci processor button thermal_sys ohci
> _hcd i2c_nforce2 sg ehci_pci ehci_hcd sd_mod usbcore usb_common
> [ 4022.356007] CPU: 0 PID: 3393 Comm: perf_fuzzer Not tainted 3.11.0 #2
> [ 4022.356007] Hardware name: AOpen DE7000/nMCP7ALPx-DE R1.06 Oct.19.2012, BIO
> S 080015 10/19/2012
> [ 4022.356007] 00000000000000f5 ffff88011fc07c48 ffffffff81508364 ffff88011fc07
> c58
> [ 4022.356007] ffffffff8103cc81 ffffffff817d132a ffffffff8109b2e9 ffff88011b30e
> 400
> [ 4022.356007] ffff88011fc07ef8 ffff88011fc07d58 0000000000000000 ffff88011fc07
> ca8
> [ 4022.356007] Call Trace:
> [ 4022.356007] <NMI> [<ffffffff81508364>] ? dump_stack+0x41/0x55
> [ 4022.356007] [<ffffffff8103cc81>] ? warn_slowpath_common+0x81/0x9c
> [ 4022.356007] [<ffffffff8109b2e9>] ? watchdog_overflow_callback+0x9b/0xa6
> [ 4022.356007] [<ffffffff8103cd3e>] ? warn_slowpath_fmt+0x46/0x48
> [ 4022.356007] [<ffffffff8109b257>] ? watchdog_overflow_callback+0x9/0xa6
> [ 4022.356007] [<ffffffff8109b2e9>] ? watchdog_overflow_callback+0x9b/0xa6
> [ 4022.356007] [<ffffffff810c90d8>] ? __perf_event_overflow+0x12e/0x1ab
> [ 4022.356007] [<ffffffff810c50ea>] ? perf_event_update_userpage+0x16/0xce
> [ 4022.356007] [<ffffffff81017dc8>] ? intel_pmu_save_and_restart+0xe/0x22
> [ 4022.356007] [<ffffffff81018a2f>] ? intel_pmu_handle_irq+0x2c4/0x35b
> [ 4022.356007] [<ffffffff815122fc>] ? ftrace_call+0x5/0x2f
> [ 4022.356007] [<ffffffff8150b2ad>] ? _raw_spin_unlock_irqrestore+0xf/0x11
> [ 4022.356007] [<ffffffff8150ccab>] ? perf_event_nmi_handler+0x24/0x3d
> [ 4022.356007] [<ffffffff8150c4dd>] ? nmi_handle+0x5e/0x13a
> [ 4022.356007] [<ffffffff810a0d70>] ? rcu_nmi_enter+0x9/0x5b
> [ 4022.356007] [<ffffffff8150c660>] ? do_nmi+0xa7/0x2bd
> [ 4022.356007] [<ffffffff8150bc77>] ? end_repeat_nmi+0x1e/0x2e
> [ 4022.356007] [<ffffffff8150b2ad>] ? _raw_spin_unlock_irqrestore+0xf/0x11
> [ 4022.356007] [<ffffffff8150b2ad>] ? _raw_spin_unlock_irqrestore+0xf/0x11
> [ 4022.356007] [<ffffffff8150b2ad>] ? _raw_spin_unlock_irqrestore+0xf/0x11
> [ 4022.356007] <<EOE>> [<ffffffff810bdf89>] ? perf_trace_event_unreg+0x3a/0xbd
> [ 4022.356007] [<ffffffff810be03a>] ? perf_trace_destroy+0x2e/0x3b
> [ 4022.356007] [<ffffffff810c5f5d>] ? free_event+0xf6/0x11e
> [ 4022.356007] [<ffffffff810cab5c>] ? SYSC_perf_event_open+0x6e4/0x771
> [ 4022.356007] [<ffffffff81055666>] ? put_pid+0xe/0x4f
> [ 4022.356007] [<ffffffff815126c6>] ? system_call_fastpath+0x1a/0x1f
> [ 4022.356007] ---[ end trace c426f99910c1c076 ]---
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2013-09-17 21:08:17

by Vince Weaver

[permalink] [raw]
Subject: Re: perf-related lockup on 3.11

On Wed, 11 Sep 2013, Vince Weaver wrote:

>
> I got this hard lockup running my perf_fuzzer on 3.11 on a core2.
> I guess I should try to reproduce it on linus-git but was hoping
> I could wait until -rc1 before trying that.

OK, I can also lock up 3.12-rc1 using the perf_fuzzer. Last message
logged is below.

The system won't respond, even to sysrq (well, the serial port equiv)
and then after a while alarming file system/disk warnings start popping
up presumably because interrupts aren't working anymore.

[ 1173.553545] perf samples too long (2501 > 2500), lowering kernel.perf_event_max_sample_rate to 50000
[ 1173.556002] INFO: NMI handler (perf_event_nmi_handler) took too long to run: 2.464 msecs
[ 1314.102904] hrtimer: interrupt took 3787 ns
[ 1351.896018] ------------[ cut here ]------------
[ 1351.896018] WARNING: CPU: 0 PID: 3718 at kernel/watchdog.c:245 watchdog_overflow_callback+0x9b/0xa6()
[ 1351.896018] Watchdog detected hard LOCKUP on cpu 0
[ 1351.896018] Modules linked in: cpufreq_userspace cpufreq_stats cpufreq_powersave cpufreq_conservative f71882fg evdev pcspkr ohci_pci psmouse mcs7830 serio_raw usbnet ohci_hcd coretemp i2c_nforce2 video wmi acpi_cpufreq button processor thermal_sys sg ehci_pci sd_mod ehci_hcd usbcore usb_common
[ 1351.896018] CPU: 0 PID: 3718 Comm: perf_fuzzer Not tainted 3.12.0-rc1 #1
[ 1351.896018] Hardware name: AOpen DE7000/nMCP7ALPx-DE R1.06 Oct.19.2012, BIOS 080015 10/19/2012
[ 1351.896018] 00000000000000f5 ffff88011fc07c48 ffffffff8151a026 ffff88011fc07c58
[ 1351.896018] ffffffff8103cd58 ffffffff817fa8e9 ffffffff810a0e6c ffff88011b313400
[ 1351.896018] ffff88011fc07ef8 ffff88011fc07d58 0000000000000000 ffff88011fc07ca8
[ 1351.896018] Call Trace:
[ 1351.896018] <NMI> [<ffffffff8151a026>] ? dump_stack+0x41/0x53
[ 1351.896018] [<ffffffff8103cd58>] ? warn_slowpath_common+0x81/0x9c
[ 1351.896018] [<ffffffff810a0e6c>] ? watchdog_overflow_callback+0x9b/0xa6
[ 1351.896018] [<ffffffff8103ce15>] ? warn_slowpath_fmt+0x46/0x48
[ 1351.896018] [<ffffffff810a0dda>] ? watchdog_overflow_callback+0x9/0xa6
[ 1351.896018] [<ffffffff810a0e6c>] ? watchdog_overflow_callback+0x9b/0xa6
[ 1351.896018] [<ffffffff810ca0eb>] ? __perf_event_overflow+0x12e/0x1ab
[ 1351.896018] [<ffffffff8101806f>] ? intel_pmu_save_and_restart+0xe/0x22
[ 1351.896018] [<ffffffff81018cbd>] ? intel_pmu_handle_irq+0x2b8/0x34d
[ 1351.896018] [<ffffffff81523e7c>] ? ftrace_call+0x5/0x2f
[ 1351.896018] [<ffffffff810a7d02>] ? ftrace_ops_test+0x40/0x78
[ 1351.896018] [<ffffffff8151e9b3>] ? perf_event_nmi_handler+0x24/0x3d
[ 1351.896018] [<ffffffff8151e2f2>] ? nmi_handle+0x5e/0x13a
[ 1351.896018] [<ffffffff810a1610>] ? rcu_nmi_enter+0x9/0x5b
[ 1351.896018] [<ffffffff8151e475>] ? do_nmi+0xa7/0x2bd
[ 1351.896018] [<ffffffff8111f1d7>] ? send_sigio_to_task+0xc6/0xf5
[ 1351.896018] [<ffffffff8151da77>] ? end_repeat_nmi+0x1e/0x2e
[ 1351.896018] [<ffffffff8111f1d7>] ? send_sigio_to_task+0xc6/0xf5
[ 1351.896018] [<ffffffff8104b023>] ? do_send_sig_info+0xf/0x75
[ 1351.896018] [<ffffffff8104b023>] ? do_send_sig_info+0xf/0x75
[ 1351.896018] [<ffffffff810a7d02>] ? ftrace_ops_test+0x40/0x78
[ 1351.896018] [<ffffffff810a7d02>] ? ftrace_ops_test+0x40/0x78
[ 1351.896018] [<ffffffff810a7d02>] ? ftrace_ops_test+0x40/0x78
[ 1351.896018] <<EOE>> <IRQ> [<ffffffff8104b023>] ? do_send_sig_info+0xf/0x75
[ 1351.896018] [<ffffffff810a8d1c>] ? ftrace_ops_control_func+0x88/0xc8
[ 1351.896018] [<ffffffff810a8d2e>] ? ftrace_ops_control_func+0x9a/0xc8
[ 1351.896018] [<ffffffff81523e7c>] ? ftrace_call+0x5/0x2f
[ 1351.896018] [<ffffffff8111f11d>] ? send_sigio_to_task+0xc/0xf5
[ 1351.896018] [<ffffffff810a8d2e>] ? ftrace_ops_control_func+0x9a/0xc8
[ 1351.896018] [<ffffffff8104b028>] ? do_send_sig_info+0x14/0x75
[ 1351.896018] [<ffffffff810a2904>] ? rcu_irq_enter+0xd/0x75
[ 1351.896018] [<ffffffff8104b028>] ? do_send_sig_info+0x14/0x75
[ 1351.896018] [<ffffffff8111f1d7>] ? send_sigio_to_task+0xc6/0xf5
[ 1351.896018] [<ffffffff8151d164>] ? _raw_spin_lock_irqsave+0x4/0x2d
[ 1351.896018] [<ffffffff810a2904>] ? rcu_irq_enter+0xd/0x75
[ 1351.896018] [<ffffffff8151cf82>] ? _raw_read_lock+0x9/0x15
[ 1351.896018] [<ffffffff8111f21c>] ? send_sigio+0x16/0xdd
[ 1351.896018] [<ffffffff8111f2b0>] ? send_sigio+0xaa/0xdd
[ 1351.896018] [<ffffffff8151d169>] ? _raw_spin_lock_irqsave+0x9/0x2d
[ 1351.896018] [<ffffffff8111f34c>] ? kill_fasync+0x69/0x86
[ 1351.896018] [<ffffffff810ca850>] ? perf_event_wakeup+0x8c/0xa1
[ 1351.896018] [<ffffffff810c472b>] ? __irq_work_run+0x5f/0x7b
[ 1351.896018] [<ffffffff81005f82>] ? smp_trace_irq_work_interrupt+0x4b/0x7e
[ 1351.896018] [<ffffffff8152578a>] ? trace_irq_work_interrupt+0x6a/0x70
[ 1351.896018] [<ffffffff810a1930>] ? __note_gp_changes+0xc/0x54
[ 1351.896018] [<ffffffff8151cf57>] ? _raw_spin_unlock_irqrestore+0xf/0x11
[ 1351.896018] [<ffffffff810a347b>] ? rcu_process_callbacks+0xb7/0x4b8
[ 1351.896018] [<ffffffff810408f2>] ? perf_trace_softirq+0x38/0x96
[ 1351.896018] [<ffffffff81040f76>] ? __do_softirq+0x102/0x208
[ 1351.896018] [<ffffffff810410f5>] ? irq_exit+0x45/0xa1
[ 1351.896018] [<ffffffff81525fd4>] ? smp_trace_apic_timer_interrupt+0x83/0x91
[ 1351.896018] [<ffffffff810c7c58>] ? perf_event_task_output+0x117/0x117
[ 1351.896018] [<ffffffff81524f3a>] ? trace_apic_timer_interrupt+0x6a/0x70
[ 1351.896018] <EOI> [<ffffffff8108c2b6>] ? smp_call_function_single+0x8c/0x101
[ 1351.896018] [<ffffffff810a8d2e>] ? ftrace_ops_control_func+0x9a/0xc8
[ 1351.896018] [<ffffffff810c7cd3>] ? task_function_call+0x42/0x4c
[ 1351.896018] [<ffffffff810c77f1>] ? __perf_event_task_sched_in+0x108/0x108
[ 1351.896018] [<ffffffff810c81cb>] ? perf_event_enable+0xf7/0x12e
[ 1351.896018] [<ffffffff810c5a4a>] ? perf_event_task_enable+0x57/0x6c
[ 1351.896018] [<ffffffff8151bfea>] ? _cond_resched+0x9/0x37
[ 1351.896018] [<ffffffff8151a511>] ? mutex_lock+0x16/0x37
[ 1351.896018] [<ffffffff810c80d4>] ? perf_install_in_context+0xb8/0xb8
[ 1351.896018] [<ffffffff810c5955>] ? perf_event_for_each_child+0x56/0x88
[ 1351.896018] [<ffffffff810c5a4a>] ? perf_event_task_enable+0x57/0x6c
[ 1351.896018] [<ffffffff810c5a32>] ? perf_event_task_enable+0x3f/0x6c
[ 1351.896018] [<ffffffff8104eb20>] ? SyS_prctl+0x151/0x341
[ 1351.896018] [<ffffffff8100cefe>] ? syscall_trace_leave+0x52/0xba
[ 1351.896018] [<ffffffff815243c1>] ? tracesys+0xd4/0xd9
[ 1351.896018] ---[ end trace dea2e0afc6629577 ]---

2013-09-19 08:24:12

by Andi Kleen

[permalink] [raw]
Subject: Re: perf-related lockup on 3.11

Vince Weaver <[email protected]> writes:

> On Wed, 11 Sep 2013, Vince Weaver wrote:
>
>>
>> I got this hard lockup running my perf_fuzzer on 3.11 on a core2.
>> I guess I should try to reproduce it on linus-git but was hoping
>> I could wait until -rc1 before trying that.
>
> OK, I can also lock up 3.12-rc1 using the perf_fuzzer. Last message
> logged is below.

If it's easily reproducible and a regression, could you perhaps bisect it?

-Andi

--
[email protected] -- Speaking for myself only

2013-09-19 13:02:56

by Vince Weaver

[permalink] [raw]
Subject: Re: perf-related lockup on 3.11

On Thu, 19 Sep 2013, Andi Kleen wrote:

> Vince Weaver <[email protected]> writes:
>
> > On Wed, 11 Sep 2013, Vince Weaver wrote:
> >
> >>
> >> I got this hard lockup running my perf_fuzzer on 3.11 on a core2.
> >> I guess I should try to reproduce it on linus-git but was hoping
> >> I could wait until -rc1 before trying that.
> >
> > OK, I can also lock up 3.12-rc1 using the perf_fuzzer. Last message
> > logged is below.
>
> If it's easily reproducible and a regression, could you perhaps bisect it?

It's not necessarily easy to reproduce. It shows up eventually, but it
varies a bit.

It's also unclear whether it is a regression or not. It only started
appearing because I already got fixed the 5 or so other bugs that the
fuzzer would trigger first. So bisecting will be a pain because older
kernels would have to be patched and the patches don't necessarily apply
cleanly further back.

I'm working on seeing if I can get a reproducible trace that I can work
with, but there's a lot of other stuff going on right now so it's semi low
priority until I can spare a machine from doing 3.12-rc1 validation tests.

Vince