Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1030370AbaLLOjo (ORCPT ); Fri, 12 Dec 2014 09:39:44 -0500 Received: from mx1.redhat.com ([209.132.183.28]:33878 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S967919AbaLLOj2 (ORCPT ); Fri, 12 Dec 2014 09:39:28 -0500 Date: Fri, 12 Dec 2014 09:38:46 -0500 From: Dave Jones To: Linus Torvalds , Chris Mason , Mike Galbraith , Ingo Molnar , Peter Zijlstra , =?iso-8859-1?Q?D=E2niel?= Fraga , Sasha Levin , "Paul E. McKenney" , Linux Kernel Mailing List Subject: Re: frequent lockups in 3.18rc4 Message-ID: <20141212143846.GA27914@redhat.com> Mail-Followup-To: Dave Jones , Linus Torvalds , Chris Mason , Mike Galbraith , Ingo Molnar , Peter Zijlstra , =?iso-8859-1?Q?D=E2niel?= Fraga , Sasha Levin , "Paul E. McKenney" , Linux Kernel Mailing List References: <20141203184111.GA32005@redhat.com> <20141205171501.GA1320@redhat.com> <1417806247.4845.1@mail.thefacebook.com> <20141211145408.GB16800@redhat.com> <20141212030343.GA7945@redhat.com> <20141212044509.GA21145@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20141212044509.GA21145@redhat.com> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Dec 11, 2014 at 11:45:09PM -0500, Dave Jones wrote: > I've seen those messages a fair bit on other machines do, and they drive me nuts > because there's no 'perf' being run. I think it means "NMI watchdog", but is > worded badly. Still, it's curious that they appeared during what should have > been idle time. > > I'll reboot the box and give it another shot, and see what falls out in the morning. Same deal again. It happened pretty quick after I'd gone to bed (typical). [ 2754.509747] Clocksource tsc unstable (delta = -243594587656 ns) [ 2754.519197] Switched to clocksource hpet [ 2754.782940] INFO: rcu_preempt self-detected stall on CPU [ 2754.782972] 0: (1 GPs behind) idle=6ef/140000000000001/0 softirq=247160/247161 [ 2754.782999] (t=24343 jiffies g=104086 c=104085 q=0) [ 2754.783022] Task dump for CPU 0: [ 2754.783037] trinity-c387 R running task 14016 13658 12780 0x00000008 [ 2754.783070] ffff880222365bc0 000000005ed04994 ffff880244003d68 ffffffff810a8d46 [ 2754.783104] ffffffff810a8cb2 0000000000000000 0000000000000001 0000000000000000 [ 2754.783138] ffffffff81c51e40 0000000000000092 ffff880244003d88 ffffffff810acf4d [ 2754.783171] Call Trace: [ 2754.783184] [] sched_show_task+0x116/0x180 [ 2754.783215] [] ? sched_show_task+0x82/0x180 [ 2754.783239] [] dump_cpu_task+0x3d/0x50 [ 2754.783261] [] rcu_dump_cpu_stacks+0x90/0xd0 [ 2754.783286] [] rcu_check_callbacks+0x573/0x850 [ 2754.783311] [] ? __this_cpu_preempt_check+0x13/0x20 [ 2754.783337] [] ? hrtimer_run_queues+0x43/0x130 [ 2754.783361] [] update_process_times+0x4b/0x80 [ 2754.783386] [] tick_sched_timer+0x4c/0x1b0 [ 2754.783409] [] ? __run_hrtimer+0xbb/0x2e0 [ 2754.783432] [] __run_hrtimer+0xbb/0x2e0 [ 2754.783454] [] ? hrtimer_interrupt+0x94/0x260 [ 2754.783478] [] ? tick_init_highres+0x20/0x20 [ 2754.783501] [] hrtimer_interrupt+0x107/0x260 [ 2754.783526] [] local_apic_timer_interrupt+0x38/0x70 [ 2754.783552] [] smp_apic_timer_interrupt+0x45/0x60 [ 2754.783578] [] apic_timer_interrupt+0x6f/0x80 [ 2754.783600] [] ? lock_release_holdtime.part.24+0x9d/0x160 [ 2754.783634] [] ? put_lock_stats.isra.23+0xe/0x30 [ 2754.783659] [] ? __alloc_pages_nodemask+0x1ac/0xb60 [ 2754.783684] [] ? alloc_pages_vma+0xee/0x1b0 [ 2754.783708] [] ? local_clock+0x25/0x30 [ 2754.783731] [] ? __lock_acquire.isra.31+0x22c/0x9f0 [ 2754.783756] [] ? get_parent_ip+0x11/0x50 [ 2754.783779] [] ? lock_release_holdtime.part.24+0x9d/0x160 [ 2754.784613] [] alloc_pages_vma+0xee/0x1b0 [ 2754.785452] [] ? do_wp_page+0xca/0x7d0 [ 2754.786305] [] do_wp_page+0xca/0x7d0 [ 2754.787140] [] handle_mm_fault+0x6cb/0xe90 [ 2754.787948] [] ? __do_page_fault+0x140/0x600 [ 2754.788748] [] __do_page_fault+0x1a4/0x600 [ 2754.789562] [] ? put_lock_stats.isra.23+0xe/0x30 [ 2754.790340] [] ? lock_release_holdtime.part.24+0x9d/0x160 [ 2754.791113] [] ? get_parent_ip+0x11/0x50 [ 2754.791879] [] ? preempt_count_sub+0x7b/0x100 [ 2754.792646] [] ? trace_hardirqs_off_thunk+0x3a/0x3f [ 2754.793413] [] do_page_fault+0xc/0x10 [ 2754.794176] [] page_fault+0x22/0x30 [ 2754.794938] INFO: rcu_preempt detected stalls on CPUs/tasks: [ 2754.795736] Tasks blocked on level-0 rcu_node (CPUs 0-7): [ 2754.796528] 0: (1 GPs behind) idle=6ef/140000000000001/0 softirq=247160/247161 [ 2754.797334] Tasks blocked on level-0 rcu_node (CPUs 0-7): [ 2754.798153] (detected by 3, t=24343 jiffies, g=104086, c=104085, q=0) [ 2754.798981] Task dump for CPU 0: [ 2754.799797] trinity-c387 R running task 14016 13658 12780 0x00000008 [ 2754.800630] ffff880222365bc0 0000000000000246 0000000127e77e08 8000000000000865 [ 2754.801461] ffff8802000000a9 800000008d044865 0000000000000000 ffff8802256d3c70 [ 2754.802288] ffff880227e77e28 00000000000000a9 0000000000d8eff8 ffff880227e77f58 [ 2754.803109] Call Trace: [ 2754.803929] [] ? __do_page_fault+0x1a4/0x600 [ 2754.804765] [] ? put_lock_stats.isra.23+0xe/0x30 [ 2754.805596] [] ? lock_release_holdtime.part.24+0x9d/0x160 [ 2754.806444] [] ? get_parent_ip+0x11/0x50 [ 2754.807267] [] ? preempt_count_sub+0x7b/0x100 [ 2754.808088] [] ? trace_hardirqs_off_thunk+0x3a/0x3f [ 2754.808920] [] ? do_page_fault+0xc/0x10 [ 2754.809742] [] ? page_fault+0x22/0x30 [ 2771.561356] ------------[ cut here ]------------ [ 2771.562079] WARNING: CPU: 0 PID: 13696 at kernel/watchdog.c:317 watchdog_overflow_callback+0xdd/0x130() [ 2771.562879] Watchdog detected hard LOCKUP on cpu 0 [ 2771.562895] CPU: 0 PID: 13696 Comm: trinity-c425 Not tainted 3.18.0+ #101 [ 2771.564490] ffffffff81a66315 00000000fce35109 ffff880244005b88 ffffffff817d317e [ 2771.565315] 0000000000110004 ffff880244005be0 ffff880244005bc8 ffffffff81078a01 [ 2771.566136] 0000000000000000 0000000000000000 0000000000000000 ffff880244005d30 [ 2771.566954] Call Trace: [ 2771.567759] [] dump_stack+0x4f/0x7c [ 2771.568584] [] warn_slowpath_common+0x81/0xa0 [ 2771.569405] [] warn_slowpath_fmt+0x55/0x70 [ 2771.570253] [] ? restart_watchdog_hrtimer+0x60/0x60 [ 2771.571074] [] watchdog_overflow_callback+0xdd/0x130 [ 2771.571894] [] __perf_event_overflow+0xac/0x2a0 [ 2771.572721] [] ? x86_perf_event_set_period+0xe2/0x150 [ 2771.573551] [] perf_event_overflow+0x14/0x20 [ 2771.574378] [] intel_pmu_handle_irq+0x209/0x410 [ 2771.575210] [] perf_event_nmi_handler+0x2b/0x50 [ 2771.576040] [] nmi_handle+0xa4/0x1e0 [ 2771.576868] [] ? nmi_handle+0x5/0x1e0 [ 2771.577698] [] ? print_context_stack+0xe1/0x100 [ 2771.578526] [] default_do_nmi+0x7a/0x1d0 [ 2771.579354] [] do_nmi+0xb8/0xf0 [ 2771.580206] [] end_repeat_nmi+0x1e/0x2e [ 2771.581023] [] ? free_debug_processing+0x157/0x22a [ 2771.581836] [] ? free_debug_processing+0x157/0x22a [ 2771.582644] [] ? print_context_stack+0xe1/0x100 [ 2771.583452] [] ? print_context_stack+0xe1/0x100 [ 2771.584253] [] ? print_context_stack+0xe1/0x100 [ 2771.585042] <> [] dump_trace+0x140/0x370 [ 2771.585841] [] ? final_putname+0x26/0x50 [ 2771.586636] [] save_stack_trace+0x2f/0x50 [ 2771.587430] [] set_track+0x70/0x140 [ 2771.588217] [] free_debug_processing+0x157/0x22a [ 2771.589015] [] ? put_lock_stats.isra.23+0xe/0x30 [ 2771.589815] [] __slab_free+0x55/0x320 [ 2771.590636] [] ? debug_check_no_obj_freed+0x156/0x250 [ 2771.591442] [] ? mntput+0x24/0x40 [ 2771.592242] [] kmem_cache_free+0x262/0x280 [ 2771.593036] [] ? final_putname+0x26/0x50 [ 2771.593831] [] final_putname+0x26/0x50 [ 2771.594622] [] putname+0x29/0x40 [ 2771.595411] [] user_path_at_empty+0x6e/0xc0 [ 2771.596199] [] ? mntput_no_expire+0x67/0x140 [ 2771.596986] [] ? mntput_no_expire+0x5/0x140 [ 2771.597766] [] ? dput+0x56/0x190 [ 2771.598542] [] user_path_at+0x11/0x20 [ 2771.599311] [] path_setxattr+0x4c/0xe0 [ 2771.600097] [] SyS_lsetxattr+0x11/0x20 [ 2771.600848] [] system_call_fastpath+0x12/0x17 [ 2771.601598] ---[ end trace 7b78126c55dcb717 ]--- [ 2771.602404] INFO: NMI handler (perf_event_nmi_handler) took too long to run: 40.989 msecs [ 2771.603175] perf interrupt took too long (322423 > 2500), lowering kernel.perf_event_max_sample_rate to 50000 [ 3471.463812] perf interrupt took too long (319933 > 5000), lowering kernel.perf_event_max_sample_rate to 25000 [ 4563.539619] perf interrupt took too long (317460 > 10000), lowering kernel.perf_event_max_sample_rate to 12500 [ 5676.723413] perf interrupt took too long (315015 > 19841), lowering kernel.perf_event_max_sample_rate to 6300 [ 6800.751151] perf interrupt took too long (312583 > 39062), lowering kernel.perf_event_max_sample_rate to 3200 [ 8056.882309] perf interrupt took too long (310176 > 78125), lowering kernel.perf_event_max_sample_rate to 1600 [ 9233.809073] perf interrupt took too long (307790 > 156250), lowering kernel.perf_event_max_sample_rate to 800 again, the box survived. Next run I'll try undoing Chris' idea of no serial, and see if it wedges after the spew. After that, I'll do a no preempt run. Dave -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/