Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757703AbcCRVAX (ORCPT ); Fri, 18 Mar 2016 17:00:23 -0400 Received: from relay4-d.mail.gandi.net ([217.70.183.196]:53804 "EHLO relay4-d.mail.gandi.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753790AbcCRVAT (ORCPT ); Fri, 18 Mar 2016 17:00:19 -0400 Date: Fri, 18 Mar 2016 14:00:11 -0700 From: Josh Triplett To: "Paul E. McKenney" Cc: Ross Green , Mathieu Desnoyers , John Stultz , Thomas Gleixner , Peter Zijlstra , lkml , Ingo Molnar , Lai Jiangshan , dipankar@in.ibm.com, Andrew Morton , rostedt , David Howells , Eric Dumazet , Darren Hart , =?iso-8859-1?Q?Fr=E9d=E9ric?= Weisbecker , Oleg Nesterov , pranith kumar , jacob.jun.pan@linux.intel.com Subject: Re: rcu_preempt self-detected stall on CPU from 4.5-rc3, since 3.17 Message-ID: <20160318210011.GA571@cloud> References: <20160219173343.GB3522@linux.vnet.ibm.com> <20160220063248.GE3522@linux.vnet.ibm.com> <686568926.5862.1456259651418.JavaMail.zimbra@efficios.com> <20160223205522.GT3522@linux.vnet.ibm.com> <20160226005638.GV3522@linux.vnet.ibm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20160226005638.GV3522@linux.vnet.ibm.com> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 2929 Lines: 84 On Thu, Feb 25, 2016 at 04:56:38PM -0800, Paul E. McKenney wrote: > On Thu, Feb 25, 2016 at 04:13:11PM +1100, Ross Green wrote: > > On Wed, Feb 24, 2016 at 8:28 AM, Ross Green wrote: > > > On Wed, Feb 24, 2016 at 7:55 AM, Paul E. McKenney > > > wrote: > > [ . . . ] > > > >> Still working on getting decent traces... > > And I might have succeeded, see below. > > > >> > > >> Thanx, Paul > > >> > > > > > > G'day all, > > > > > > Here is another dmesg output for 4.5-rc5 showing another rcu_preempt stall. > > > This one appeared after only a day of running. CONFIG_DEBUG_TIMING is > > > turned on, but can't see any output that shows from this. > > > > > > Again testing as before, > > > > > > Boot, run a series of small benchmarks, then just let the system be > > > and idle away. > > > > > > I notice in the stack trace there is mention of hrtimer_run_queues and > > > hrtimer_interrupt. > > > > > > Anyway, leave this for a few more eyes to look at. > > > > > > Open to any other suggestions of things to test. > > > > > > Regards, > > > > > > Ross Green > > > > > > G'day Paul, > > > > I left the pandaboard running and captured another stall. > > > > the attachment is the dmesg output. > > > > Again there is no apparent output from any CONFIG_DEBUG_TIMING so I > > assume there is nothing happening there. > > I agree, looks like this is not due to time skew. > > > I just saw the updates for 4.6 RCU code. > > Is the patch in [PATCH tip/core/rcu 04/13] valid here? > > I doubt that it will help, but you never know. > > > do you want me try the new patch set with this configuration? > > Even better would be to try Daniel Wagner's swait patchset. I have > attached them in UNIX mbox format, or you can get them from the > -tip tree. > > And I -finally- got some tracing that -might- be useful. The dmesg, all > 67MB of it, is here: > > http://www.rdrop.com/~paulmck/submission/console.2016.02.23a.log > > This failure mode is less likely to happen, and looks a bit different > than the ones that I was seeing before enabling tracing. Then, an > additional wakeup would actually wake the task up. In contrast, with > tracing enabled, the RCU grace-period kthread goes into "teenager mode", > refusing to wake up despite repeated attempts. However, this might > be a side-effect of the ftrace dump. > > On line 525,132, we see that the rcu_preempt grace-period kthread has > been starved for 1,188,154 jiffies, or about 20 minutes. This seems > unlikely... The kthread is waiting for no more than a three-jiffy > timeout ("RCU_GP_WAIT_FQS(3)") and is in TASK_INTERRUPTIBLE state > ("0x1"). We're seeing a similar stall (~60 seconds) on an x86 development system here. Any luck tracking down the cause of this? If not, any suggestions for traces that might be helpful? - Josh Triplett