Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756777AbcCUQWi (ORCPT ); Mon, 21 Mar 2016 12:22:38 -0400 Received: from mga14.intel.com ([192.55.52.115]:27951 "EHLO mga14.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756536AbcCUQWf (ORCPT ); Mon, 21 Mar 2016 12:22:35 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.24,372,1455004800"; d="scan'208";a="938480894" Date: Mon, 21 Mar 2016 09:22:30 -0700 From: Jacob Pan To: paulmck@linux.vnet.ibm.com Cc: Josh Triplett , Ross Green , Mathieu Desnoyers , John Stultz , Thomas Gleixner , Peter Zijlstra , lkml , Ingo Molnar , Lai Jiangshan , dipankar@in.ibm.com, Andrew Morton , rostedt , David Howells , Eric Dumazet , Darren Hart , =?UTF-8?B?RnLDqWTDqXJpYw==?= Weisbecker , Oleg Nesterov , pranith kumar , jacob.jun.pan@linux.intel.com, "Chatre, Reinette" Subject: Re: rcu_preempt self-detected stall on CPU from 4.5-rc3, since 3.17 Message-ID: <20160321092230.75f23fa9@yairi> In-Reply-To: <20160318235641.GH4287@linux.vnet.ibm.com> References: <20160220063248.GE3522@linux.vnet.ibm.com> <686568926.5862.1456259651418.JavaMail.zimbra@efficios.com> <20160223205522.GT3522@linux.vnet.ibm.com> <20160226005638.GV3522@linux.vnet.ibm.com> <20160318210011.GA571@cloud> <20160318235641.GH4287@linux.vnet.ibm.com> Organization: OTC X-Mailer: Claws Mail 3.9.3 (GTK+ 2.24.23; x86_64-pc-linux-gnu) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3782 Lines: 99 On Fri, 18 Mar 2016 16:56:41 -0700 "Paul E. McKenney" wrote: > On Fri, Mar 18, 2016 at 02:00:11PM -0700, Josh Triplett wrote: > > On Thu, Feb 25, 2016 at 04:56:38PM -0800, Paul E. McKenney wrote: > > > On Thu, Feb 25, 2016 at 04:13:11PM +1100, Ross Green wrote: > > > > On Wed, Feb 24, 2016 at 8:28 AM, Ross Green > > > > wrote: > > > > > On Wed, Feb 24, 2016 at 7:55 AM, Paul E. McKenney > > > > > wrote: > > > > > > [ . . . ] > > > > > > > >> Still working on getting decent traces... > > > > > > And I might have succeeded, see below. > > > > > > > >> > > > > >> Thanx, > > > > >> Paul > > > > >> > > > > > > > > > > G'day all, > > > > > > > > > > Here is another dmesg output for 4.5-rc5 showing another > > > > > rcu_preempt stall. This one appeared after only a day of > > > > > running. CONFIG_DEBUG_TIMING is turned on, but can't see any > > > > > output that shows from this. > > > > > > > > > > Again testing as before, > > > > > > > > > > Boot, run a series of small benchmarks, then just let the > > > > > system be and idle away. > > > > > > > > > > I notice in the stack trace there is mention of > > > > > hrtimer_run_queues and hrtimer_interrupt. > > > > > > > > > > Anyway, leave this for a few more eyes to look at. > > > > > > > > > > Open to any other suggestions of things to test. > > > > > > > > > > Regards, > > > > > > > > > > Ross Green > > > > > > > > > > > > G'day Paul, > > > > > > > > I left the pandaboard running and captured another stall. > > > > > > > > the attachment is the dmesg output. > > > > > > > > Again there is no apparent output from any CONFIG_DEBUG_TIMING > > > > so I assume there is nothing happening there. > > > > > > I agree, looks like this is not due to time skew. > > > > > > > I just saw the updates for 4.6 RCU code. > > > > Is the patch in [PATCH tip/core/rcu 04/13] valid here? > > > > > > I doubt that it will help, but you never know. > > > > > > > do you want me try the new patch set with this configuration? > > > > > > Even better would be to try Daniel Wagner's swait patchset. I > > > have attached them in UNIX mbox format, or you can get them from > > > the -tip tree. > > > > > > And I -finally- got some tracing that -might- be useful. The > > > dmesg, all 67MB of it, is here: > > > > > > http://www.rdrop.com/~paulmck/submission/console.2016.02.23a.log > > > > > > This failure mode is less likely to happen, and looks a bit > > > different than the ones that I was seeing before enabling > > > tracing. Then, an additional wakeup would actually wake the task > > > up. In contrast, with tracing enabled, the RCU grace-period > > > kthread goes into "teenager mode", refusing to wake up despite > > > repeated attempts. However, this might be a side-effect of the > > > ftrace dump. > > > > > > On line 525,132, we see that the rcu_preempt grace-period kthread > > > has been starved for 1,188,154 jiffies, or about 20 minutes. > > > This seems unlikely... The kthread is waiting for no more than a > > > three-jiffy timeout ("RCU_GP_WAIT_FQS(3)") and is in > > > TASK_INTERRUPTIBLE state ("0x1"). > > > > We're seeing a similar stall (~60 seconds) on an x86 development > > system here. Any luck tracking down the cause of this? If not, any > > suggestions for traces that might be helpful? > > The dmesg containing the stall, the kernel version, and the .config > would be helpful! Working on a torture test specific to this bug... > > Thanx, Paul > +Reinette, she has the system that can reproduce the issue. I believe she is having some other problems with it at the moment. But the .config should be available. Version is v4.5.