Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1031264AbaLLSYd (ORCPT ); Fri, 12 Dec 2014 13:24:33 -0500 Received: from e34.co.us.ibm.com ([32.97.110.152]:57520 "EHLO e34.co.us.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1031120AbaLLSYb (ORCPT ); Fri, 12 Dec 2014 13:24:31 -0500 Date: Fri, 12 Dec 2014 10:24:24 -0800 From: "Paul E. McKenney" To: Dave Jones , Linus Torvalds , Chris Mason , Mike Galbraith , Ingo Molnar , Peter Zijlstra , =?iso-8859-1?Q?D=E2niel?= Fraga , Sasha Levin , Linux Kernel Mailing List Subject: Re: frequent lockups in 3.18rc4 Message-ID: <20141212182424.GY25340@linux.vnet.ibm.com> Reply-To: paulmck@linux.vnet.ibm.com References: <20141205171501.GA1320@redhat.com> <1417806247.4845.1@mail.thefacebook.com> <20141211145408.GB16800@redhat.com> <20141212030343.GA7945@redhat.com> <20141212044509.GA21145@redhat.com> <20141212143846.GA27914@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20141212143846.GA27914@redhat.com> User-Agent: Mutt/1.5.21 (2010-09-15) X-TM-AS-MML: disable X-Content-Scanned: Fidelis XPS MAILER x-cbid: 14121218-0017-0000-0000-0000071058F4 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Dec 12, 2014 at 09:38:46AM -0500, Dave Jones wrote: > On Thu, Dec 11, 2014 at 11:45:09PM -0500, Dave Jones wrote: > > > I've seen those messages a fair bit on other machines do, and they drive me nuts > > because there's no 'perf' being run. I think it means "NMI watchdog", but is > > worded badly. Still, it's curious that they appeared during what should have > > been idle time. > > > > I'll reboot the box and give it another shot, and see what falls out in the morning. > > Same deal again. It happened pretty quick after I'd gone to bed (typical). > > [ 2754.509747] Clocksource tsc unstable (delta = -243594587656 ns) > [ 2754.519197] Switched to clocksource hpet > [ 2754.782940] INFO: rcu_preempt self-detected stall on CPU > [ 2754.782972] 0: (1 GPs behind) idle=6ef/140000000000001/0 softirq=247160/247161 In this one, the CPU is at least non-idle. ;-) > [ 2754.782999] (t=24343 jiffies g=104086 c=104085 q=0) > [ 2754.783022] Task dump for CPU 0: > [ 2754.783037] trinity-c387 R running task 14016 13658 12780 0x00000008 > [ 2754.783070] ffff880222365bc0 000000005ed04994 ffff880244003d68 ffffffff810a8d46 > [ 2754.783104] ffffffff810a8cb2 0000000000000000 0000000000000001 0000000000000000 > [ 2754.783138] ffffffff81c51e40 0000000000000092 ffff880244003d88 ffffffff810acf4d > [ 2754.783171] Call Trace: > [ 2754.783184] [] sched_show_task+0x116/0x180 > [ 2754.783215] [] ? sched_show_task+0x82/0x180 > [ 2754.783239] [] dump_cpu_task+0x3d/0x50 > [ 2754.783261] [] rcu_dump_cpu_stacks+0x90/0xd0 > [ 2754.783286] [] rcu_check_callbacks+0x573/0x850 > [ 2754.783311] [] ? __this_cpu_preempt_check+0x13/0x20 > [ 2754.783337] [] ? hrtimer_run_queues+0x43/0x130 > [ 2754.783361] [] update_process_times+0x4b/0x80 > [ 2754.783386] [] tick_sched_timer+0x4c/0x1b0 > [ 2754.783409] [] ? __run_hrtimer+0xbb/0x2e0 > [ 2754.783432] [] __run_hrtimer+0xbb/0x2e0 > [ 2754.783454] [] ? hrtimer_interrupt+0x94/0x260 > [ 2754.783478] [] ? tick_init_highres+0x20/0x20 > [ 2754.783501] [] hrtimer_interrupt+0x107/0x260 > [ 2754.783526] [] local_apic_timer_interrupt+0x38/0x70 > [ 2754.783552] [] smp_apic_timer_interrupt+0x45/0x60 > [ 2754.783578] [] apic_timer_interrupt+0x6f/0x80 Looks like standard scheduling-clock interrupt above this point. > [ 2754.783600] [] ? lock_release_holdtime.part.24+0x9d/0x160 If this was an acquisition rather than a release, I would suspect high lock contention. Could just be luck of the draw,I suppose. Or am I missing something subtle here? Thanx, Paul > [ 2754.783634] [] ? put_lock_stats.isra.23+0xe/0x30 > [ 2754.783659] [] ? __alloc_pages_nodemask+0x1ac/0xb60 > [ 2754.783684] [] ? alloc_pages_vma+0xee/0x1b0 > [ 2754.783708] [] ? local_clock+0x25/0x30 > [ 2754.783731] [] ? __lock_acquire.isra.31+0x22c/0x9f0 > [ 2754.783756] [] ? get_parent_ip+0x11/0x50 > [ 2754.783779] [] ? lock_release_holdtime.part.24+0x9d/0x160 > [ 2754.784613] [] alloc_pages_vma+0xee/0x1b0 > [ 2754.785452] [] ? do_wp_page+0xca/0x7d0 > [ 2754.786305] [] do_wp_page+0xca/0x7d0 > [ 2754.787140] [] handle_mm_fault+0x6cb/0xe90 > [ 2754.787948] [] ? __do_page_fault+0x140/0x600 > [ 2754.788748] [] __do_page_fault+0x1a4/0x600 > [ 2754.789562] [] ? put_lock_stats.isra.23+0xe/0x30 > [ 2754.790340] [] ? lock_release_holdtime.part.24+0x9d/0x160 > [ 2754.791113] [] ? get_parent_ip+0x11/0x50 > [ 2754.791879] [] ? preempt_count_sub+0x7b/0x100 > [ 2754.792646] [] ? trace_hardirqs_off_thunk+0x3a/0x3f > [ 2754.793413] [] do_page_fault+0xc/0x10 > [ 2754.794176] [] page_fault+0x22/0x30 > [ 2754.794938] INFO: rcu_preempt detected stalls on CPUs/tasks: > [ 2754.795736] Tasks blocked on level-0 rcu_node (CPUs 0-7): > [ 2754.796528] 0: (1 GPs behind) idle=6ef/140000000000001/0 softirq=247160/247161 > [ 2754.797334] Tasks blocked on level-0 rcu_node (CPUs 0-7): > [ 2754.798153] (detected by 3, t=24343 jiffies, g=104086, c=104085, q=0) > [ 2754.798981] Task dump for CPU 0: > [ 2754.799797] trinity-c387 R running task 14016 13658 12780 0x00000008 > [ 2754.800630] ffff880222365bc0 0000000000000246 0000000127e77e08 8000000000000865 > [ 2754.801461] ffff8802000000a9 800000008d044865 0000000000000000 ffff8802256d3c70 > [ 2754.802288] ffff880227e77e28 00000000000000a9 0000000000d8eff8 ffff880227e77f58 > [ 2754.803109] Call Trace: > [ 2754.803929] [] ? __do_page_fault+0x1a4/0x600 > [ 2754.804765] [] ? put_lock_stats.isra.23+0xe/0x30 > [ 2754.805596] [] ? lock_release_holdtime.part.24+0x9d/0x160 > [ 2754.806444] [] ? get_parent_ip+0x11/0x50 > [ 2754.807267] [] ? preempt_count_sub+0x7b/0x100 > [ 2754.808088] [] ? trace_hardirqs_off_thunk+0x3a/0x3f > [ 2754.808920] [] ? do_page_fault+0xc/0x10 > [ 2754.809742] [] ? page_fault+0x22/0x30 > [ 2771.561356] ------------[ cut here ]------------ > [ 2771.562079] WARNING: CPU: 0 PID: 13696 at kernel/watchdog.c:317 watchdog_overflow_callback+0xdd/0x130() > [ 2771.562879] Watchdog detected hard LOCKUP on cpu 0 > [ 2771.562895] CPU: 0 PID: 13696 Comm: trinity-c425 Not tainted 3.18.0+ #101 > [ 2771.564490] ffffffff81a66315 00000000fce35109 ffff880244005b88 ffffffff817d317e > [ 2771.565315] 0000000000110004 ffff880244005be0 ffff880244005bc8 ffffffff81078a01 > [ 2771.566136] 0000000000000000 0000000000000000 0000000000000000 ffff880244005d30 > [ 2771.566954] Call Trace: > [ 2771.567759] [] dump_stack+0x4f/0x7c > [ 2771.568584] [] warn_slowpath_common+0x81/0xa0 > [ 2771.569405] [] warn_slowpath_fmt+0x55/0x70 > [ 2771.570253] [] ? restart_watchdog_hrtimer+0x60/0x60 > [ 2771.571074] [] watchdog_overflow_callback+0xdd/0x130 > [ 2771.571894] [] __perf_event_overflow+0xac/0x2a0 > [ 2771.572721] [] ? x86_perf_event_set_period+0xe2/0x150 > [ 2771.573551] [] perf_event_overflow+0x14/0x20 > [ 2771.574378] [] intel_pmu_handle_irq+0x209/0x410 > [ 2771.575210] [] perf_event_nmi_handler+0x2b/0x50 > [ 2771.576040] [] nmi_handle+0xa4/0x1e0 > [ 2771.576868] [] ? nmi_handle+0x5/0x1e0 > [ 2771.577698] [] ? print_context_stack+0xe1/0x100 > [ 2771.578526] [] default_do_nmi+0x7a/0x1d0 > [ 2771.579354] [] do_nmi+0xb8/0xf0 > [ 2771.580206] [] end_repeat_nmi+0x1e/0x2e > [ 2771.581023] [] ? free_debug_processing+0x157/0x22a > [ 2771.581836] [] ? free_debug_processing+0x157/0x22a > [ 2771.582644] [] ? print_context_stack+0xe1/0x100 > [ 2771.583452] [] ? print_context_stack+0xe1/0x100 > [ 2771.584253] [] ? print_context_stack+0xe1/0x100 > [ 2771.585042] <> [] dump_trace+0x140/0x370 > [ 2771.585841] [] ? final_putname+0x26/0x50 > [ 2771.586636] [] save_stack_trace+0x2f/0x50 > [ 2771.587430] [] set_track+0x70/0x140 > [ 2771.588217] [] free_debug_processing+0x157/0x22a > [ 2771.589015] [] ? put_lock_stats.isra.23+0xe/0x30 > [ 2771.589815] [] __slab_free+0x55/0x320 > [ 2771.590636] [] ? debug_check_no_obj_freed+0x156/0x250 > [ 2771.591442] [] ? mntput+0x24/0x40 > [ 2771.592242] [] kmem_cache_free+0x262/0x280 > [ 2771.593036] [] ? final_putname+0x26/0x50 > [ 2771.593831] [] final_putname+0x26/0x50 > [ 2771.594622] [] putname+0x29/0x40 > [ 2771.595411] [] user_path_at_empty+0x6e/0xc0 > [ 2771.596199] [] ? mntput_no_expire+0x67/0x140 > [ 2771.596986] [] ? mntput_no_expire+0x5/0x140 > [ 2771.597766] [] ? dput+0x56/0x190 > [ 2771.598542] [] user_path_at+0x11/0x20 > [ 2771.599311] [] path_setxattr+0x4c/0xe0 > [ 2771.600097] [] SyS_lsetxattr+0x11/0x20 > [ 2771.600848] [] system_call_fastpath+0x12/0x17 > [ 2771.601598] ---[ end trace 7b78126c55dcb717 ]--- > [ 2771.602404] INFO: NMI handler (perf_event_nmi_handler) took too long to run: 40.989 msecs > [ 2771.603175] perf interrupt took too long (322423 > 2500), lowering kernel.perf_event_max_sample_rate to 50000 > [ 3471.463812] perf interrupt took too long (319933 > 5000), lowering kernel.perf_event_max_sample_rate to 25000 > [ 4563.539619] perf interrupt took too long (317460 > 10000), lowering kernel.perf_event_max_sample_rate to 12500 > [ 5676.723413] perf interrupt took too long (315015 > 19841), lowering kernel.perf_event_max_sample_rate to 6300 > [ 6800.751151] perf interrupt took too long (312583 > 39062), lowering kernel.perf_event_max_sample_rate to 3200 > [ 8056.882309] perf interrupt took too long (310176 > 78125), lowering kernel.perf_event_max_sample_rate to 1600 > [ 9233.809073] perf interrupt took too long (307790 > 156250), lowering kernel.perf_event_max_sample_rate to 800 > > again, the box survived. Next run I'll try undoing Chris' idea of no serial, > and see if it wedges after the spew. After that, I'll do a no preempt run. > > Dave > > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/