Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754062AbcCRX4x (ORCPT ); Fri, 18 Mar 2016 19:56:53 -0400 Received: from e19.ny.us.ibm.com ([129.33.205.209]:53458 "EHLO e19.ny.us.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751482AbcCRX4p (ORCPT ); Fri, 18 Mar 2016 19:56:45 -0400 X-IBM-Helo: d01dlp03.pok.ibm.com X-IBM-MailFrom: paulmck@linux.vnet.ibm.com X-IBM-RcptTo: linux-kernel@vger.kernel.org Date: Fri, 18 Mar 2016 16:56:41 -0700 From: "Paul E. McKenney" To: Josh Triplett Cc: Ross Green , Mathieu Desnoyers , John Stultz , Thomas Gleixner , Peter Zijlstra , lkml , Ingo Molnar , Lai Jiangshan , dipankar@in.ibm.com, Andrew Morton , rostedt , David Howells , Eric Dumazet , Darren Hart , =?iso-8859-1?Q?Fr=E9d=E9ric?= Weisbecker , Oleg Nesterov , pranith kumar , jacob.jun.pan@linux.intel.com Subject: Re: rcu_preempt self-detected stall on CPU from 4.5-rc3, since 3.17 Message-ID: <20160318235641.GH4287@linux.vnet.ibm.com> Reply-To: paulmck@linux.vnet.ibm.com References: <20160220063248.GE3522@linux.vnet.ibm.com> <686568926.5862.1456259651418.JavaMail.zimbra@efficios.com> <20160223205522.GT3522@linux.vnet.ibm.com> <20160226005638.GV3522@linux.vnet.ibm.com> <20160318210011.GA571@cloud> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20160318210011.GA571@cloud> User-Agent: Mutt/1.5.21 (2010-09-15) X-TM-AS-MML: disable X-Content-Scanned: Fidelis XPS MAILER x-cbid: 16031823-0057-0000-0000-000003CE864D Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3297 Lines: 88 On Fri, Mar 18, 2016 at 02:00:11PM -0700, Josh Triplett wrote: > On Thu, Feb 25, 2016 at 04:56:38PM -0800, Paul E. McKenney wrote: > > On Thu, Feb 25, 2016 at 04:13:11PM +1100, Ross Green wrote: > > > On Wed, Feb 24, 2016 at 8:28 AM, Ross Green wrote: > > > > On Wed, Feb 24, 2016 at 7:55 AM, Paul E. McKenney > > > > wrote: > > > > [ . . . ] > > > > > >> Still working on getting decent traces... > > > > And I might have succeeded, see below. > > > > > >> > > > >> Thanx, Paul > > > >> > > > > > > > > G'day all, > > > > > > > > Here is another dmesg output for 4.5-rc5 showing another rcu_preempt stall. > > > > This one appeared after only a day of running. CONFIG_DEBUG_TIMING is > > > > turned on, but can't see any output that shows from this. > > > > > > > > Again testing as before, > > > > > > > > Boot, run a series of small benchmarks, then just let the system be > > > > and idle away. > > > > > > > > I notice in the stack trace there is mention of hrtimer_run_queues and > > > > hrtimer_interrupt. > > > > > > > > Anyway, leave this for a few more eyes to look at. > > > > > > > > Open to any other suggestions of things to test. > > > > > > > > Regards, > > > > > > > > Ross Green > > > > > > > > > G'day Paul, > > > > > > I left the pandaboard running and captured another stall. > > > > > > the attachment is the dmesg output. > > > > > > Again there is no apparent output from any CONFIG_DEBUG_TIMING so I > > > assume there is nothing happening there. > > > > I agree, looks like this is not due to time skew. > > > > > I just saw the updates for 4.6 RCU code. > > > Is the patch in [PATCH tip/core/rcu 04/13] valid here? > > > > I doubt that it will help, but you never know. > > > > > do you want me try the new patch set with this configuration? > > > > Even better would be to try Daniel Wagner's swait patchset. I have > > attached them in UNIX mbox format, or you can get them from the > > -tip tree. > > > > And I -finally- got some tracing that -might- be useful. The dmesg, all > > 67MB of it, is here: > > > > http://www.rdrop.com/~paulmck/submission/console.2016.02.23a.log > > > > This failure mode is less likely to happen, and looks a bit different > > than the ones that I was seeing before enabling tracing. Then, an > > additional wakeup would actually wake the task up. In contrast, with > > tracing enabled, the RCU grace-period kthread goes into "teenager mode", > > refusing to wake up despite repeated attempts. However, this might > > be a side-effect of the ftrace dump. > > > > On line 525,132, we see that the rcu_preempt grace-period kthread has > > been starved for 1,188,154 jiffies, or about 20 minutes. This seems > > unlikely... The kthread is waiting for no more than a three-jiffy > > timeout ("RCU_GP_WAIT_FQS(3)") and is in TASK_INTERRUPTIBLE state > > ("0x1"). > > We're seeing a similar stall (~60 seconds) on an x86 development system > here. Any luck tracking down the cause of this? If not, any > suggestions for traces that might be helpful? The dmesg containing the stall, the kernel version, and the .config would be helpful! Working on a torture test specific to this bug... Thanx, Paul