Date: Sun, 27 Mar 2016 08:40:18 -0700
From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
To: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: "Chatre, Reinette" <reinette.chatre@intel.com>,
        Jacob Pan <jacob.jun.pan@linux.intel.com>,
        Josh Triplett <josh@joshtriplett.org>, Ross Green <rgkernel@gmail.com>,
        John Stultz <john.stultz@linaro.org>,
        Thomas Gleixner <tglx@linutronix.de>,
        Peter Zijlstra <peterz@infradead.org>,
        lkml <linux-kernel@vger.kernel.org>, Ingo Molnar <mingo@kernel.org>,
        Lai Jiangshan <jiangshanlai@gmail.com>, dipankar@in.ibm.com,
        Andrew Morton <akpm@linux-foundation.org>,
        rostedt <rostedt@goodmis.org>, David Howells <dhowells@redhat.com>,
        Eric Dumazet <edumazet@google.com>,
        Darren Hart <dvhart@linux.intel.com>,
        =?iso-8859-1?Q?Fr=E9d=E9ric?= Weisbecker <fweisbec@gmail.com>,
        Oleg Nesterov <oleg@redhat.com>, pranith kumar <bobby.prani@gmail.com>
Subject: Re: rcu_preempt self-detected stall on CPU from 4.5-rc3, since 3.17
Message-ID: <20160327154018.GA4287@linux.vnet.ibm.com>
Reply-To: paulmck@linux.vnet.ibm.com
References: <20160318235641.GH4287@linux.vnet.ibm.com>
 <0D818C7A2259ED42912C1E04120FDE26712E676E@ORSMSX111.amr.corp.intel.com>
 <20160325214623.GR4287@linux.vnet.ibm.com>
 <1370753660.36931.1458995371427.JavaMail.zimbra@efficios.com>
 <20160326152816.GW4287@linux.vnet.ibm.com>
 <20160326184940.GA23851@linux.vnet.ibm.com>
 <706246733.37102.1459030977316.JavaMail.zimbra@efficios.com>
 <20160327013456.GX4287@linux.vnet.ibm.com>
 <702204510.37291.1459086535844.JavaMail.zimbra@efficios.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <702204510.37291.1459086535844.JavaMail.zimbra@efficios.com>
User-Agent: Mutt/1.5.21 (2010-09-15)
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 7670
Lines: 147

On Sun, Mar 27, 2016 at 01:48:55PM +0000, Mathieu Desnoyers wrote:
> ----- On Mar 26, 2016, at 9:34 PM, Paul E. McKenney paulmck@linux.vnet.ibm.com wrote:
> 
> > On Sat, Mar 26, 2016 at 10:22:57PM +0000, Mathieu Desnoyers wrote:
> >> ----- On Mar 26, 2016, at 2:49 PM, Paul E. McKenney paulmck@linux.vnet.ibm.com
> >> wrote:
> >> 
> >> > On Sat, Mar 26, 2016 at 08:28:16AM -0700, Paul E. McKenney wrote:
> >> >> On Sat, Mar 26, 2016 at 12:29:31PM +0000, Mathieu Desnoyers wrote:
> >> >> > ----- On Mar 25, 2016, at 5:46 PM, Paul E. McKenney paulmck@linux.vnet.ibm.com
> >> >> > wrote:
> >> >> > 
> >> >> > > On Fri, Mar 25, 2016 at 09:24:14PM +0000, Chatre, Reinette wrote:
> >> >> > >> Hi  Paul,
> >> >> > >> 
> >> >> > >> On 2016-03-23, Paul E. McKenney wrote:
> >> >> > >> > Please boot with the following parameters:
> >> >> > >> > 
> >> >> > >> > 	rcu_tree.rcu_kick_kthreads ftrace
> >> >> > >> > trace_event=sched_waking,sched_wakeup,sched_wake_idle_without_ipi
> >> >> > >> 
> >> >> > >> With these parameters I expected more details to show up in the kernel logs but
> >> >> > >> cannot find any. Even so, today I left the machine running again and when this
> >> >> > >> happened I think I was able to capture the trace data for the event. Please
> >> >> > >> find attached the trace information for the kernel message below. Since the
> >> >> > >> complete trace file is very big I trimmed it to show the time around this event
> >> >> > >> - hopefully this will contain the information you need. I would also like to
> >> >> > >> provide some additional information. The system on which I see these events had
> >> >> > >> a time that was _very_ wrong. I noticed that this issue occurs when
> >> >> > >> system-timesynd was one of the tasks calling the functions of interest to your
> >> >> > >> tracing and am wondering if a very out of sync time in process of being
> >> >> > >> corrected could be the cause of this issue? As an experiment I ensured the
> >> >> > >> system time was accurate before leaving the system idle overnight and I did not
> >> >> > >> see the issue the next morning.
> >> >> > > 
> >> >> > > Ah!  Yes, a sudden jump in time or a disagreement about the time among
> >> >> > > different components of the system can definitely cause these symptoms.
> >> >> > > We have sometimes seen these problems occur when a pair of CPUs have
> >> >> > > wildly different ideas about what time it is, for example.  Please let
> >> >> > > me know how it goes.
> >> >> > > 
> >> >> > > Also, in your trace, there are no sched_waking events for the rcu_preempt
> >> >> > > process that are not immediately followed by sched_wakeup, so your trace
> >> >> > > isn't showing the problem that I am seeing.
> >> >> > 
> >> >> > This is interesting.
> >> >> > 
> >> >> > Perhaps we could try with those commits reverted ?
> >> >> > 
> >> >> > commit e3baac47f0e82c4be632f4f97215bb93bf16b342
> >> >> > Author: Peter Zijlstra <peterz@infradead.org>
> >> >> > Date:   Wed Jun 4 10:31:18 2014 -0700
> >> >> > 
> >> >> >     sched/idle: Optimize try-to-wake-up IPI
> >> >> > 
> >> >> > commit fd99f91aa007ba255aac44fe6cf21c1db398243a
> >> >> > Author: Peter Zijlstra <peterz@infradead.org>
> >> >> > Date:   Wed Apr 9 15:35:08 2014 +0200
> >> >> > 
> >> >> >     sched/idle: Avoid spurious wakeup IPIs
> >> >> > 
> >> >> > They appeared in 3.16.
> >> >> 
> >> >> At this point, I am up for trying pretty much anything.  ;-)
> >> >> 
> >> >> Will give it a go.
> >> > 
> >> > And those certainly don't revert cleanly!  Would patching the kernel
> >> > to remove the definition of TIF_POLLING_NRFLAG be useful?  Or, more
> >> > to the point, is there some other course of action that would be more
> >> > useful?  At this point, the test times are measured in weeks...
> >> 
> >> Indeed, patching the kernel to remove the TIF_POLLING_NRFLAG
> >> definition would have an effect similar to reverting those two
> >> commits.
> >> 
> >> Since testing takes a while, we could take a more aggressive
> >> approach towards reproducing a possible race condition: we
> >> could re-implement the _TIF_POLLING_NRFLAG vs _TIF_NEED_RESCHED
> >> dance, along with the ttwu pending lock-list queue, within
> >> a dummy test module, with custom data structures, and
> >> stress-test the invariants. We could also create a Promela
> >> model of these ipi-skip optimisations trying to validate
> >> progress: whenever a wakeup is requested, there should
> >> always be a scheduling performed, even if no further wakeup
> >> is encountered.
> >> 
> >> Each of the two approaches proposed above might be a significant
> >> endeavor, and would only validate my specific hunch. So it might
> >> be a good idea to just let a test run for a few weeks with
> >> TIF_POLLING_NRFLAG disabled meanwhile.
> > 
> > This makes a lot of sense.  I did some short runs, and nothing broke
> > too badly.  However, I left some diagnostic stuff in that obscured
> > the outcome.  I disabled the diagnostic stuff and am running overnight.
> > I might need to go further and revert some of my diagnostic patches,
> > but let's see where it is in the morning.
> 
> Here is another idea that might help us reproduce this issue faster.
> If you can afford it, you might want to just throw more similar hardware
> at the problem. Assuming the problem shows up randomly, but its odds
> of showing up make it happen only once per week, if we have 100 machines
> idling in the same way in parallel, we should be able to reproduce it
> within about 1-2 hours.
> 
> Of course, if the problem really need each machine to "degrade" for
> a week (e.g. memory fragmentation), that would not help. It's only for
> races that appear to be showing up randomly.

Certain rcutorture tests sometimes hit it within an hour (TREE03).
Last night's TREE03 ran six hours without incident, which is unusual
given that I didn't enable any tracepoints, but does not any significant
level of statitstical confidence.  The set will finish in a few hours,
at which point I will start parallel batches of TREE03 to see what
comes up.

Feel free to take a look at kernel/rcu/waketorture.c for my (feeble
thus far) attempt to speed things up.  I am thinking that I need to
push sleeping tasks onto idle CPUs to make it happen more often.
My current approach to this is to run with CPU utilizations of about
40% and using hrtimer with a prime number of microseconds to avoid
synchronization.  That should in theory get me a 40% chance of hitting
an idle CPU with a wakeup, and a reasonable chance of racing with a
CPU-hotplug operation.  But maybe the wakeup needs to be remote or
some such, in which case waketorture also needs to move stuff around.

Oh, and the patch I am running with is below.  I am running x86, and so
some other architectures would of course need the corresponding patch
on that architecture.

							Thanx, Paul

------------------------------------------------------------------------

diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index c7b5510..062ae53 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -102,7 +102,7 @@ struct thread_info {
 #define TIF_FORK		18	/* ret_from_fork */
 #define TIF_NOHZ		19	/* in adaptive nohz mode */
 #define TIF_MEMDIE		20	/* is terminating due to OOM killer */
-#define TIF_POLLING_NRFLAG	21	/* idle is polling for TIF_NEED_RESCHED */
+/* #define TIF_POLLING_NRFLAG	21	 idle is polling for TIF_NEED_RESCHED */
 #define TIF_IO_BITMAP		22	/* uses I/O bitmap */
 #define TIF_FORCED_TF		24	/* true if TF in eflags artificially */
 #define TIF_BLOCKSTEP		25	/* set when we want DEBUGCTLMSR_BTF */