Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754757AbcC1QA3 (ORCPT ); Mon, 28 Mar 2016 12:00:29 -0400 Received: from e34.co.us.ibm.com ([32.97.110.152]:34298 "EHLO e34.co.us.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754020AbcC1QA0 (ORCPT ); Mon, 28 Mar 2016 12:00:26 -0400 X-IBM-Helo: d03dlp01.boulder.ibm.com X-IBM-MailFrom: paulmck@linux.vnet.ibm.com X-IBM-RcptTo: linux-kernel@vger.kernel.org Date: Mon, 28 Mar 2016 08:56:35 -0700 From: "Paul E. McKenney" To: Mathieu Desnoyers Cc: Thomas Gleixner , Peter Zijlstra , Jacob Pan , Josh Triplett , Ross Green , John Stultz , lkml , Ingo Molnar , Lai Jiangshan , dipankar@in.ibm.com, Andrew Morton , rostedt , David Howells , Eric Dumazet , Darren Hart , =?iso-8859-1?Q?Fr=E9d=E9ric?= Weisbecker , Oleg Nesterov , pranith kumar , "Chatre, Reinette" Subject: Re: rcu_preempt self-detected stall on CPU from 4.5-rc3, since 3.17 Message-ID: <20160328155635.GJ4287@linux.vnet.ibm.com> Reply-To: paulmck@linux.vnet.ibm.com References: <20160223205522.GT3522@linux.vnet.ibm.com> <20160318210011.GA571@cloud> <20160318235641.GH4287@linux.vnet.ibm.com> <20160321092230.75f23fa9@yairi> <20160327205439.GY6356@twins.programming.kicks-ass.net> <20160327210914.GD4287@linux.vnet.ibm.com> <20160328062851.GE6344@twins.programming.kicks-ass.net> <20160328132932.GF4287@linux.vnet.ibm.com> <24778627.37842.1459177656260.JavaMail.zimbra@efficios.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <24778627.37842.1459177656260.JavaMail.zimbra@efficios.com> User-Agent: Mutt/1.5.21 (2010-09-15) X-TM-AS-MML: disable X-Content-Scanned: Fidelis XPS MAILER x-cbid: 16032815-0017-0000-0000-000019078EBA Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 4187 Lines: 104 On Mon, Mar 28, 2016 at 03:07:36PM +0000, Mathieu Desnoyers wrote: > ----- On Mar 28, 2016, at 9:29 AM, Paul E. McKenney paulmck@linux.vnet.ibm.com wrote: > > > On Mon, Mar 28, 2016 at 08:28:51AM +0200, Peter Zijlstra wrote: > >> On Sun, Mar 27, 2016 at 02:09:14PM -0700, Paul E. McKenney wrote: > >> > >> > > Does that system have MONITOR/MWAIT errata? > >> > > >> > On the off-chance that this question was also directed at me, > >> > >> Hehe, it wasn't, however, since we're here.. > >> > >> > here is > >> > what I am running on. I am running in a qemu/KVM virtual machine, in > >> > case that matters. > >> > >> Have you actually tried on real proper hardware? Does it still reproduce > >> there? > > > > Ross has, but I have not, given that I have a shared system on the one > > hand and a single-socket (four core, eight hardware thread) laptop on > > the other that has even longer reproduction times. The repeat-by is > > as follows: > > > > o Build a kernel with the following Kconfigs: > > > > CONFIG_SMP=y > > CONFIG_NR_CPUS=16 > > CONFIG_PREEMPT_NONE=n > > CONFIG_PREEMPT_VOLUNTARY=n > > CONFIG_PREEMPT=y > > # This should result in CONFIG_PREEMPT_RCU=y > > CONFIG_HZ_PERIODIC=y > > CONFIG_NO_HZ_IDLE=n > > CONFIG_NO_HZ_FULL=n > > CONFIG_RCU_TRACE=y > > CONFIG_HOTPLUG_CPU=y > > CONFIG_RCU_FANOUT=2 > > CONFIG_RCU_FANOUT_LEAF=2 > > CONFIG_RCU_NOCB_CPU=n > > CONFIG_DEBUG_LOCK_ALLOC=n > > CONFIG_RCU_BOOST=y > > CONFIG_RCU_KTHREAD_PRIO=2 > > CONFIG_DEBUG_OBJECTS_RCU_HEAD=n > > CONFIG_RCU_EXPERT=y > > CONFIG_RCU_TORTURE_TEST=y > > CONFIG_PRINTK_TIME=y > > CONFIG_RCU_TORTURE_TEST_SLOW_CLEANUP=y > > CONFIG_RCU_TORTURE_TEST_SLOW_INIT=y > > CONFIG_RCU_TORTURE_TEST_SLOW_PREINIT=y > > > > If desired, you can instead build with CONFIG_RCU_TORTURE_TEST=m > > and modprobe/insmod the module manually. > > > > o Find a two-socket x86 system or larger, with at least 16 CPUs. > > > > o Boot the kernel with the following kernel boot parameters: > > > > rcutorture.onoff_interval=1 rcutorture.onoff_holdoff=30 > > > > The onoff_holdoff is only needed for CONFIG_RCU_TORTURE_TEST=y. > > When manually setting up the module, you get the holdoff for > > free, courtesy of human timescales. > > > > In the absence of instrumentation, I get failures usually within a > > couple of hours, though sometimes much longer. With instrumentation, > > the sky appears to be the limit. :-/ > > > > Ross is running on bare metal with no CPU hotplug, so perhaps his setup > > is of more immediate interest. He is seeing the same symptoms that I am, > > namely a task being repeatedly awakened without actually coming out of > > TASK_INTERRUPTIBLE state, let alone running. As you pointed out earlier, > > he cannot be seeing the same bug that my crude patch suppresses, but > > given that I still see a few failures with that crude patch, it is quite > > possible that there is still a common bug. > > With respect to bare metal vs KVM guest, I've reported an issue with > inaccurate detection of TSC as being an unreliable time source on a > KVM guest. The basic setup is to overcommit the CPU use across the > entire host, thus leading to preemption of the guest. The guest TSC > watchdog then falsely assume that TSC is unreliable, because it gets > preempted for a long time (e.g. 0.5 second) between reading the HPET > and the TSC. > > Ref. http://lkml.iu.edu/hypermail/linux/kernel/1509.1/00379.html > > I'm wondering if what Paul is observing in the KVM setup might be > caused by long preemption by the host. One way to stress test this > is to run parallel kernel builds on the host (or in another guest) > while the guest is running, thus over-committing the CPU use. > > Thoughts ? If I run NO_HZ_FULL, I do get warnings about unstable timesources. And certainly guest VCPUs can be preempted. However, if they were preempted for the lengths of time I am seeing, I should also see softlockup warnings on the host, which I do not see. That said, perhaps I should cobble together something to force short repeated preemptions at the host level. Maybe that would get the reproduction rate sufficiently high to enable less-dainty debugging. Thanx, Paul