Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754393AbbHFEP1 (ORCPT ); Thu, 6 Aug 2015 00:15:27 -0400 Received: from arcturus.aphlor.org ([188.246.204.175]:49903 "EHLO arcturus.aphlor.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752130AbbHFEP0 (ORCPT ); Thu, 6 Aug 2015 00:15:26 -0400 Date: Thu, 6 Aug 2015 00:15:08 -0400 From: Dave Jones To: Frederic Weisbecker Cc: Sasha Levin , paulmck@linux.vnet.ibm.com, Linux Kernel , Josh Triplett , Peter Zijlstra Subject: Re: 4.2-rc5 rcu stalls. Message-ID: <20150806041504.GA14220@codemonkey.org.uk> Mail-Followup-To: Dave Jones , Frederic Weisbecker , Sasha Levin , paulmck@linux.vnet.ibm.com, Linux Kernel , Josh Triplett , Peter Zijlstra References: <20150803210835.GA4467@codemonkey.org.uk> <20150803213723.GN27280@linux.vnet.ibm.com> <20150803215535.GA13717@codemonkey.org.uk> <20150803220355.GO27280@linux.vnet.ibm.com> <55C0458B.6080003@oracle.com> <20150805001250.GA22259@codemonkey.org.uk> <20150805123757.GA7051@lerouge> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20150805123757.GA7051@lerouge> User-Agent: Mutt/1.5.23 (2014-03-12) X-Spam-Score: -2.9 (--) X-Spam-Report: Spam report generated by SpamAssassin on "arcturus.aphlor.org" Content analysis details: (-2.9 points, 5.0 required) pts rule name description ---- ---------------------- -------------------------------------------------- -1.0 ALL_TRUSTED Passed through trusted hosts only via SMTP -1.9 BAYES_00 BODY: Bayes spam probability is 0 to 1% [score: 0.0000] X-Authenticated-User: davej@codemonkey.org.uk Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 7721 Lines: 122 On Wed, Aug 05, 2015 at 02:37:59PM +0200, Frederic Weisbecker wrote: > On Tue, Aug 04, 2015 at 08:12:50PM -0400, Dave Jones wrote: > > On Tue, Aug 04, 2015 at 12:54:35AM -0400, Sasha Levin wrote: > > > On 08/03/2015 06:03 PM, Paul E. McKenney wrote: > > > >> > Ugh, that doesn't revert cleanly. Got something handy ? > > > > I do not, but perhaps either Sasha or Frederic do. > > > > > > I've attached a revert courtesy of Peter. > > > > Thanks. At first I thought this was doing the trick, but then I hit this again. > > > > > > [23643.545873] INFO: rcu_preempt detected stalls on CPUs/tasks: > > [23643.546031] Tasks blocked on level-0 rcu_node (CPUs 0-3): P31722 > > [23643.546173] (detected by 3, t=65002 jiffies, g=2256887, c=2256886, q=0) > > [23643.546326] trinity-watchdo R running task 14336 31722 31721 0x00080000 > > [23643.546488] ffff8804fcfe7cc8 000000000000ded0 0000000000000002 ffff8804f58bb680 > > [23643.546661] ffff8800ce4951c0 ffff8804fcfe7cb8 ffff8804fcfe8000 ffff8804f6552608 > > [23643.546830] 0000000000000009 ffff8804fcfe7e88 0000000000000009 ffff8804fcfe7ce8 > > [23643.547001] Call Trace: > > [23643.547058] [] preempt_schedule_common+0x22/0x40 > > [23643.547201] [] preempt_schedule+0x1f/0x30 > > [23643.547329] [] ___preempt_schedule+0x12/0x14 > > [23643.547465] [] ? do_send_sig_info+0x5d/0x80 > > [23643.547599] [] ? _raw_spin_unlock_irqrestore+0x42/0x70 > > [23643.547753] [] ? _raw_spin_unlock_irqrestore+0x60/0x70 > > [23643.547910] [] do_send_sig_info+0x5d/0x80 > > [23643.548039] [] group_send_sig_info+0xb2/0x120 > > [23643.548175] [] ? group_send_sig_info+0x5/0x120 > > [23643.548314] [] ? rcu_read_lock_held+0x4f/0x60 > > [23643.548451] [] kill_pid_info+0x7f/0x150 > > [23643.548576] [] ? kill_pid_info+0x20/0x150 > > [23643.548705] [] SYSC_kill+0xf4/0x2b0 > > [23643.548821] [] ? SYSC_kill+0x9d/0x2b0 > > [23643.548942] [] ? trace_hardirqs_on_caller+0x14b/0x1e0 > > [23643.549097] [] ? trace_hardirqs_on+0xd/0x10 > > [23643.549231] [] ? context_tracking_user_exit+0x13/0x20 > > [23643.549387] [] ? syscall_trace_enter_phase1+0xf7/0x150 > > [23643.549540] [] ? trace_hardirqs_on_thunk+0x17/0x19 > > [23643.549687] [] SyS_kill+0xe/0x10 > > [23643.549799] [] entry_SYSCALL_64_fastpath+0x12/0x6f > > If it still happens after Sasha's revert, which basically revert all the offending > patches related to preempt lately, then the reason might be elsewhere. > > How hard was it to reproduce? I see 23000 secs in your dmesg logs which is around 6 hours. Interestingly, it happened again, but only after 7 hours. I've yet to trigger it in a shorter timeframe. Frustrating. [28190.798758] INFO: rcu_preempt detected stalls on CPUs/tasks: [28190.798914] Tasks blocked on level-0 rcu_node (CPUs 0-3): P32189 [28190.799054] (detected by 1, t=65002 jiffies, g=2137396, c=2137395, q=0) [28190.799203] trinity-c224 R running task 13856 32189 31964 0x00080000 [28190.799362] ffff8804f2323da8 ffffffffa67fa4d1 ffff8804fe170000 ffff8804b66db680 [28190.799531] ffff8804fe170000 ffff8804f2323d98 0000000000000000 ffff8804f2324000 [28190.799699] 0000000000000002 0000000000000000 0000000000000000 ffff8804f2323dc8 [28190.799866] Call Trace: [28190.799921] [] ? preempt_schedule_irq+0x41/0xa0 [28190.800058] [] preempt_schedule_irq+0x47/0xa0 [28190.800191] [] retint_kernel+0x1b/0x2d [28190.800312] [] ? lock_acquire+0xd9/0x260 [28190.800438] [] ? __task_pid_nr_ns+0x5/0x190 [28190.800568] [] ? retint_kernel+0x2d/0x2d [28190.800691] [] __task_pid_nr_ns+0x42/0x190 [28190.800820] [] ? __task_pid_nr_ns+0x5/0x190 [28190.800950] [] sys_gettid+0x1b/0x20 [28190.801064] [] entry_SYSCALL_64_fastpath+0x12/0x6f [28190.801208] trinity-c224 R running task 13856 32189 31964 0x00080000 [28190.801365] ffff8804f2323da8 ffffffffa67fa4d1 ffff8804fe170000 ffff8804b66db680 [28190.801533] ffff8804fe170000 ffff8804f2323d98 0000000000000000 ffff8804f2324000 [28190.801702] 0000000000000002 0000000000000000 0000000000000000 ffff8804f2323dc8 [28190.801870] Call Trace: [28190.801923] [] ? preempt_schedule_irq+0x41/0xa0 [28190.802060] [] preempt_schedule_irq+0x47/0xa0 [28190.802193] [] retint_kernel+0x1b/0x2d [28190.802313] [] ? lock_acquire+0xd9/0x260 [28190.802436] [] ? __task_pid_nr_ns+0x5/0x190 [28190.802565] [] ? retint_kernel+0x2d/0x2d [28190.802688] [] __task_pid_nr_ns+0x42/0x190 [28190.802815] [] ? __task_pid_nr_ns+0x5/0x190 [28190.802945] [] sys_gettid+0x1b/0x20 [28190.803058] [] entry_SYSCALL_64_fastpath+0x12/0x6f [29929.492752] INFO: rcu_preempt detected stalls on CPUs/tasks: [29929.492906] Tasks blocked on level-0 rcu_node (CPUs 0-3): P289 [29929.493039] (detected by 0, t=65002 jiffies, g=2141006, c=2141005, q=0) [29929.493188] systemd-journal R running task 12464 289 1 0x00080000 [29929.493347] ffff8804ff2bbae8 ffffffffa67fa4d1 ffff880501f81b40 ffff880503d43680 [29929.493515] ffff880501f81b40 ffff8804ff2bbad8 0000000000000000 ffff8804ff2bc000 [29929.493683] ffff8800d3e9f118 ffff8800d3e9eb40 0000000000000056 ffff8804ff2bbb08 [29929.493853] Call Trace: [29929.493909] [] ? preempt_schedule_irq+0x41/0xa0 [29929.494046] [] preempt_schedule_irq+0x47/0xa0 [29929.494181] [] retint_kernel+0x1b/0x2d [29929.494304] [] ? __schedule+0x439/0xb20 [29929.494430] [] ? ___preempt_schedule+0x12/0x14 [29929.494568] [] ? ___preempt_schedule+0x12/0x14 [29929.494709] [] ? sock_def_readable+0x161/0x190 [29929.501118] [] ? rcu_is_watching+0x38/0x60 [29929.507566] [] ? rcu_is_watching+0x51/0x60 [29929.513987] [] sock_def_readable+0x161/0x190 [29929.520344] [] ? sock_def_readable+0x5/0x190 [29929.526678] [] ? _raw_spin_unlock+0x35/0x60 [29929.532988] [] unix_dgram_sendmsg+0x4f9/0x570 [29929.539184] [] ___sys_sendmsg+0x30b/0x320 [29929.545270] [] ? put_lock_stats.isra.29+0xe/0x30 [29929.551331] [] ? debug_smp_processor_id+0x17/0x20 [29929.557285] [] ? put_lock_stats.isra.29+0xe/0x30 [29929.563204] [] ? get_parent_ip+0x11/0x50 [29929.569047] [] ? preempt_count_sub+0xa3/0xf0 [29929.574796] [] ? __fget_light+0x66/0x90 [29929.580555] [] ? context_tracking_exit+0x43/0x240 [29929.586253] [] __sys_sendmsg+0x42/0x80 [29929.591843] [] SyS_sendmsg+0x12/0x20 [29929.597385] [] entry_SYSCALL_64_fastpath+0x12/0x6f -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/