Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753040AbcC1N3d (ORCPT ); Mon, 28 Mar 2016 09:29:33 -0400 Received: from e19.ny.us.ibm.com ([129.33.205.209]:57133 "EHLO e19.ny.us.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750857AbcC1N3b (ORCPT ); Mon, 28 Mar 2016 09:29:31 -0400 X-IBM-Helo: d01dlp03.pok.ibm.com X-IBM-MailFrom: paulmck@linux.vnet.ibm.com X-IBM-RcptTo: linux-kernel@vger.kernel.org Date: Mon, 28 Mar 2016 06:29:32 -0700 From: "Paul E. McKenney" To: Peter Zijlstra Cc: Jacob Pan , Josh Triplett , Ross Green , Mathieu Desnoyers , John Stultz , Thomas Gleixner , lkml , Ingo Molnar , Lai Jiangshan , dipankar@in.ibm.com, Andrew Morton , rostedt , David Howells , Eric Dumazet , Darren Hart , =?iso-8859-1?Q?Fr=E9d=E9ric?= Weisbecker , Oleg Nesterov , pranith kumar , "Chatre, Reinette" Subject: Re: rcu_preempt self-detected stall on CPU from 4.5-rc3, since 3.17 Message-ID: <20160328132932.GF4287@linux.vnet.ibm.com> Reply-To: paulmck@linux.vnet.ibm.com References: <20160223205522.GT3522@linux.vnet.ibm.com> <20160226005638.GV3522@linux.vnet.ibm.com> <20160318210011.GA571@cloud> <20160318235641.GH4287@linux.vnet.ibm.com> <20160321092230.75f23fa9@yairi> <20160327205439.GY6356@twins.programming.kicks-ass.net> <20160327210914.GD4287@linux.vnet.ibm.com> <20160328062851.GE6344@twins.programming.kicks-ass.net> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20160328062851.GE6344@twins.programming.kicks-ass.net> User-Agent: Mutt/1.5.21 (2010-09-15) X-TM-AS-MML: disable X-Content-Scanned: Fidelis XPS MAILER x-cbid: 16032813-0057-0000-0000-000003E1AA26 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 2527 Lines: 74 On Mon, Mar 28, 2016 at 08:28:51AM +0200, Peter Zijlstra wrote: > On Sun, Mar 27, 2016 at 02:09:14PM -0700, Paul E. McKenney wrote: > > > > Does that system have MONITOR/MWAIT errata? > > > > On the off-chance that this question was also directed at me, > > Hehe, it wasn't, however, since we're here.. > > > here is > > what I am running on. I am running in a qemu/KVM virtual machine, in > > case that matters. > > Have you actually tried on real proper hardware? Does it still reproduce > there? Ross has, but I have not, given that I have a shared system on the one hand and a single-socket (four core, eight hardware thread) laptop on the other that has even longer reproduction times. The repeat-by is as follows: o Build a kernel with the following Kconfigs: CONFIG_SMP=y CONFIG_NR_CPUS=16 CONFIG_PREEMPT_NONE=n CONFIG_PREEMPT_VOLUNTARY=n CONFIG_PREEMPT=y # This should result in CONFIG_PREEMPT_RCU=y CONFIG_HZ_PERIODIC=y CONFIG_NO_HZ_IDLE=n CONFIG_NO_HZ_FULL=n CONFIG_RCU_TRACE=y CONFIG_HOTPLUG_CPU=y CONFIG_RCU_FANOUT=2 CONFIG_RCU_FANOUT_LEAF=2 CONFIG_RCU_NOCB_CPU=n CONFIG_DEBUG_LOCK_ALLOC=n CONFIG_RCU_BOOST=y CONFIG_RCU_KTHREAD_PRIO=2 CONFIG_DEBUG_OBJECTS_RCU_HEAD=n CONFIG_RCU_EXPERT=y CONFIG_RCU_TORTURE_TEST=y CONFIG_PRINTK_TIME=y CONFIG_RCU_TORTURE_TEST_SLOW_CLEANUP=y CONFIG_RCU_TORTURE_TEST_SLOW_INIT=y CONFIG_RCU_TORTURE_TEST_SLOW_PREINIT=y If desired, you can instead build with CONFIG_RCU_TORTURE_TEST=m and modprobe/insmod the module manually. o Find a two-socket x86 system or larger, with at least 16 CPUs. o Boot the kernel with the following kernel boot parameters: rcutorture.onoff_interval=1 rcutorture.onoff_holdoff=30 The onoff_holdoff is only needed for CONFIG_RCU_TORTURE_TEST=y. When manually setting up the module, you get the holdoff for free, courtesy of human timescales. In the absence of instrumentation, I get failures usually within a couple of hours, though sometimes much longer. With instrumentation, the sky appears to be the limit. :-/ Ross is running on bare metal with no CPU hotplug, so perhaps his setup is of more immediate interest. He is seeing the same symptoms that I am, namely a task being repeatedly awakened without actually coming out of TASK_INTERRUPTIBLE state, let alone running. As you pointed out earlier, he cannot be seeing the same bug that my crude patch suppresses, but given that I still see a few failures with that crude patch, it is quite possible that there is still a common bug. Thanx, Paul