Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752460AbcDCITE (ORCPT ); Sun, 3 Apr 2016 04:19:04 -0400 Received: from e18.ny.us.ibm.com ([129.33.205.208]:38449 "EHLO e18.ny.us.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751654AbcDCITA (ORCPT ); Sun, 3 Apr 2016 04:19:00 -0400 X-IBM-Helo: d01dlp01.pok.ibm.com X-IBM-MailFrom: paulmck@linux.vnet.ibm.com X-IBM-RcptTo: linux-kernel@vger.kernel.org Date: Sun, 3 Apr 2016 01:18:53 -0700 From: "Paul E. McKenney" To: Peter Zijlstra Cc: Mathieu Desnoyers , "Chatre, Reinette" , Jacob Pan , Josh Triplett , Ross Green , John Stultz , Thomas Gleixner , lkml , Ingo Molnar , Lai Jiangshan , dipankar@in.ibm.com, Andrew Morton , rostedt , David Howells , Eric Dumazet , Darren Hart , =?iso-8859-1?Q?Fr=E9d=E9ric?= Weisbecker , Oleg Nesterov , pranith kumar Subject: Re: rcu_preempt self-detected stall on CPU from 4.5-rc3, since 3.17 Message-ID: <20160403081853.GA32220@linux.vnet.ibm.com> Reply-To: paulmck@linux.vnet.ibm.com References: <20160327154018.GA4287@linux.vnet.ibm.com> <20160327204559.GV6356@twins.programming.kicks-ass.net> <20160327210641.GB4287@linux.vnet.ibm.com> <20160328062547.GD6344@twins.programming.kicks-ass.net> <20160328130841.GE4287@linux.vnet.ibm.com> <20160329002518.GA13058@linux.vnet.ibm.com> <20160329002814.GB13058@linux.vnet.ibm.com> <20160329134908.GA27588@linux.vnet.ibm.com> <20160330145547.GA3929@linux.vnet.ibm.com> <20160331154255.GA22915@linux.vnet.ibm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20160331154255.GA22915@linux.vnet.ibm.com> User-Agent: Mutt/1.5.21 (2010-09-15) X-TM-AS-MML: disable X-Content-Scanned: Fidelis XPS MAILER x-cbid: 16040308-0045-0000-0000-000003D13B3C Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 4287 Lines: 84 On Thu, Mar 31, 2016 at 08:42:55AM -0700, Paul E. McKenney wrote: > On Wed, Mar 30, 2016 at 07:55:47AM -0700, Paul E. McKenney wrote: > > On Tue, Mar 29, 2016 at 06:49:08AM -0700, Paul E. McKenney wrote: > > > On Mon, Mar 28, 2016 at 05:28:14PM -0700, Paul E. McKenney wrote: > > > > On Mon, Mar 28, 2016 at 05:25:18PM -0700, Paul E. McKenney wrote: > > > > > On Mon, Mar 28, 2016 at 06:08:41AM -0700, Paul E. McKenney wrote: > > > > > > On Mon, Mar 28, 2016 at 08:25:47AM +0200, Peter Zijlstra wrote: > > > > > > > On Sun, Mar 27, 2016 at 02:06:41PM -0700, Paul E. McKenney wrote: > > > > > > > > > > [ . . . ] > > > > > > > > > > > > > OK, so I should instrument migration_call() if I get the repro rate up? > > > > > > > > > > > > > > Can do, maybe try the below first. (yes I know how long it all takes :/) > > > > > > > > > > > > OK, will run this today, then run calibration for last night's run this > > > > > > evening. > > > > > > And of 18 two-hour runs, there were five failures, or about 28%. > > > That said, I don't have even one significant digit on the failure rate, > > > as 5 of 18 is within the 95% confidence limits for a failure probability > > > as low as 12.5% and as high as 47%. > > > > And after last night's run, this is narrowed down to between 23% and 38%, > > which is close enough. Average is 30%, 18 failures in 60 runs. > > > > Next step is to test Peter's patch some more. Might take a couple of > > night's worth of runs to get statistical significance. After which > > it will be time to rebase to 4.6-rc1. > > And the first night was not so good: 6 failures out of 24 runs. Adding > this to the 1-of-10 earlier gets 7 failures of 34. Here are how things > stack up given the range of base failure estimates: > > Low 95% bound of 23%: 84% confidence. > > Actual measurement of 30%: 92% confidence. > > High 95% bound of 38%: 98% confidence. > > So there is still some chance that Peter's patch is helping. I will > run for one more evening, after which it will be time to move forward > to 4.6-rc1. And no luck reducing bounds. However, moving to 4.6-rc1 did get some of the trace_printk() to print. The ftrace_dump()s resulted in RCU CPU stall warnings, and the dumps were truncated due to test timeouts in my scripting. (I need to make my scripts more patient when they see an ftrace dump in progress, I guess.) Here are the results: http://www2.rdrop.com/users/paulmck/submission/TREE03.console.log.tgz http://www2.rdrop.com/users/paulmck/submission/TREE03.1.console.log.tgz http://www2.rdrop.com/users/paulmck/submission/TREE03.2.console.log.tgz http://www2.rdrop.com/users/paulmck/submission/TREE03.3.console.log.tgz http://www2.rdrop.com/users/paulmck/submission/TREE03.4.console.log.tgz http://www2.rdrop.com/users/paulmck/submission/TREE03.5.console.log.tgz http://www2.rdrop.com/users/paulmck/submission/TREE03.6.console.log.tgz http://www2.rdrop.com/users/paulmck/submission/TREE03.7.console.log.tgz http://www2.rdrop.com/users/paulmck/submission/TREE03.8.console.log.tgz http://www2.rdrop.com/users/paulmck/submission/TREE03.9.console.log.tgz http://www2.rdrop.com/users/paulmck/submission/TREE03.11.console.log.tgz http://www2.rdrop.com/users/paulmck/submission/TREE03.12.console.log.tgz http://www2.rdrop.com/users/paulmck/submission/TREE03.13.console.log.tgz http://www2.rdrop.com/users/paulmck/submission/TREE03.14.console.log.tgz http://www2.rdrop.com/users/paulmck/submission/TREE03.15.console.log.tgz http://www2.rdrop.com/users/paulmck/submission/TREE03.16.console.log.tgz http://www2.rdrop.com/users/paulmck/submission/TREE03.17.console.log.tgz http://www2.rdrop.com/users/paulmck/submission/TREE03.18.console.log.tgz http://www2.rdrop.com/users/paulmck/submission/TREE03.19.console.log.tgz http://www2.rdrop.com/users/paulmck/submission/TREE03.20.console.log.tgz http://www2.rdrop.com/users/paulmck/submission/TREE03.21.console.log.tgz http://www2.rdrop.com/users/paulmck/submission/TREE03.22.console.log.tgz http://www2.rdrop.com/users/paulmck/submission/TREE03.23.console.log.tgz http://www2.rdrop.com/users/paulmck/submission/TREE03.24.console.log.tgz The config is here: http://www2.rdrop.com/users/paulmck/submission/config.tgz More runs to measure 4.6-rc1 base error rate... Thanx, Paul