Date: Thu, 10 Aug 2017 14:39:39 -0700
From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
To: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Pratyush Anand <panand@redhat.com>,
        =?utf-8?B?6rmA64+Z7ZiE?= <austinkernel.kim@gmail.com>,
        john.stultz@linaro.org, Steven Rostedt <rostedt@goodmis.org>,
        linux-kernel@vger.kernel.org
Subject: Re: RCU stall when using function_graph
Reply-To: paulmck@linux.vnet.ibm.com
References: <CAOoBcBVKpQkAVXji5qQu8r8GErqxpy9Ae9N97NhGpOQPgXudZg@mail.gmail.com>
 <CAOoBcBU00VRXmrNNEOjJHgXf9BimxKYOorJC0d3766mNdda=Bg@mail.gmail.com>
 <20170806170220.GQ3730@linux.vnet.ibm.com>
 <db4dc3c5-8a3d-9752-802e-ab509201e251@redhat.com>
 <20170809125804.GT3730@linux.vnet.ibm.com>
 <bf4f38d6-57b7-2281-db24-368d047956aa@linaro.org>
 <20170809144033.GU3730@linux.vnet.ibm.com>
 <208e981d-40ec-54fa-6293-5b8e6fe10a84@linaro.org>
 <20170809172236.GX3730@linux.vnet.ibm.com>
 <81dd7e5e-89be-2ff9-525e-7095e934baa5@linaro.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <81dd7e5e-89be-2ff9-525e-7095e934baa5@linaro.org>
User-Agent: Mutt/1.5.21 (2010-09-15)
Message-Id: <20170810213939.GV3730@linux.vnet.ibm.com>
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 7226
Lines: 158

On Thu, Aug 10, 2017 at 11:45:09AM +0200, Daniel Lezcano wrote:
> On 09/08/2017 19:22, Paul E. McKenney wrote:
> > On Wed, Aug 09, 2017 at 05:51:33PM +0200, Daniel Lezcano wrote:
> >> On 09/08/2017 16:40, Paul E. McKenney wrote:
> >>> On Wed, Aug 09, 2017 at 03:28:05PM +0200, Daniel Lezcano wrote:
> >>>> On 09/08/2017 14:58, Paul E. McKenney wrote:
> >>>>> On Wed, Aug 09, 2017 at 02:43:49PM +0530, Pratyush Anand wrote:
> >>>>>>
> >>>>>>
> >>>>>> On Sunday 06 August 2017 10:32 PM, Paul E. McKenney wrote:
> >>>>>>> On Sat, Aug 05, 2017 at 02:24:21PM +0900, 김동현 wrote:
> >>>>>>>> Dear All
> >>>>>>>>
> >>>>>>>> As for me, after configuring function_graph as below, crash disappears.
> >>>>>>>> "echo 0 > d/tracing/tracing_on"
> >>>>>>>> "sleep 1"
> >>>>>>>>
> >>>>>>>> "echo function_graph > d/tracing/current_tracer"
> >>>>>>>> "sleep 1"
> >>>>>>>>
> >>>>>>>> "echo smp_call_function_single > d/tracing/set_ftrace_filter"
> >>>>>>
> >>>>>> It will limit trace output to only for the filtered function
> >>>>>> (smp_call_function_single).
> >>>>>>
> >>>>>>>> adb shell "sleep 1"
> >>>>>>>>
> >>>>>>>> "echo 1 > d/tracing/tracing_on"
> >>>>>>>> adb shell "sleep 1"
> >>>>>>>>
> >>>>>>>> Right after function_graph is enabled, too many logs are traced upon IRQ
> >>>>>>>> transaction which many times eventually causes stall.
> >>>>>>>
> >>>>>>> That would do it!
> >>>>>>>
> >>>>>>> Hmmm...
> >>>>>>>
> >>>>>>> Steven, would it be helpful if RCU were to inform tracing (say) halfway
> >>>>>>> through the RCU CPU stall interval, allowing the tracer to do something
> >>>>>>> like cond_resched_rcu_qs()?  I can imagine all sorts of reasons why this
> >>>>>>> wouldn't work, for example, if all the tracing was with irqs disabled
> >>>>>>> or some such, but figured I should ask.
> >>>>>>>
> >>>>>>> Does Guillermo's approach work for others?
> >>>>>>
> >>>>>> Limited output with a couple of filtered function will definitely
> >>>>>> not cause RCU schedule stall. But the question is whether we should
> >>>>>> expect a full function graph trace working on every platform or not
> >>>>>> (specially the one which generates high interrupts)?
> >>>>>
> >>>>> It might well be that the user must disable RCU CPU stall warnings via
> >>>>> the rcu_cpu_stall_suppress sysfs entry (or increase their timeout via th
> >>>>> rcu_cpu_stall_timeout sysfs entry) before doing something that greatly
> >>>>> increases overhead.  Like enabling large quantities of tracing.  ;-)
> >>>>>
> >>>>> It -might- be possible to do this automatically, but reliable
> >>>>> automation would require that tracing understand how often each
> >>>>> function was called, which sounds to me to be a bit of a stretch.
> >>>>>
> >>>>> Thoughts?
> >>>>
> >>>> A random thought:
> >>>>
> >>>> Is it possible to have a mid-timeout happening and store some
> >>>> information like the instruction pointer, so when the timeout happen we
> >>>> can compare if there was some progress, if yes, very likely, system
> >>>> performance collapsed and we are not fast enough.
> >>>
> >>> RCU already does take various actions for an impending stall, so something
> >>> could be done.  But in most slowdowns, the instruction pointer will be
> >>> changing rapidly, just not as rapidly as it would normally.  So exactly
> >>> how would the forward-progress comparison be carried out?
> >>>
> >>> It would be easy to set up a notifier, so that if any notifier in the
> >>> chain returned an error, stall warnings would be suppressed.  It would
> >>> be harder to figure out when to re-enable them, though I suppose that
> >>> they could be suppressed only for the duration of the current grace
> >>> period or some such.
> >>>
> >>> But what exactly would you use such a notifier for?
> >>>
> >>> Or am I misunderstanding your suggestion?
> >>
> >> Well, may be the instruction pointer thing is not a good idea.
> >>
> >> I learnt from this experience, an overloaded kernel with a lot of
> >> interrupts can hang the console and issue RCU stall.
> >>
> >> However, someone else can face the same situation. Even if he reads the
> >> RCU/stallwarn.txt documentation, it will be hard to figure out the issue.
> >>
> >> A message telling the grace period can't be reached because we are too
> >> busy processing interrupts would have helped but I understand it is not
> >> easy to implement.
> >>
> >> Perhaps, adding a new bullet in the documentation can help:
> >>
> >> "If the interrupt processing time is longer than the interval between
> >> each interrupt, the CPU will keep processing the interrupts without
> >> allowing the RCU's grace period kthread. This situation can happen if
> >> there is a highly rated number of interrupts and the function_graph
> >> tracer is enabled".
> > 
> > How about this?
> 
> Yes, it is clear. Thanks for reformulating it.
> 
> > Any other debugging options that should be called out?  I bet that
> > the function_graph tracer isn't the only way to make this happen.
> 
> Nothing coming in mind but may be worth to mention the slowness of the
> CPU is the aggravating factor. In particular I was able to reproduce the
> issue by setting to the min CPU frequency. With the ondemand governor,
> we can have the frequency high (hence enough CPU power) at the moment we
> set the function_graph because another CPU is loaded (and both CPUs are
> sharing the same clock line). The system became stuck at the moment the
> other CPU went idle with the lowest frequency. That introduced
> randomness in the issue and made hard to figure out why the RCU stall
> was happening.

Adding this, then?

							Thanx, Paul

------------------------------------------------------------------------

commit f7d9ce95064f76be583c775fac32076fa59f1617
Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Date:   Thu Aug 10 14:33:17 2017 -0700

    documentation: Slow systems can stall RCU grace periods
    
    If a fast system has a worst-case grace-period duration of (say) ten
    seconds, then running the same workload on a system ten times as slow
    will get you an RCU CPU stall warning given default stall-warning
    timeout settings.  This commit therefore adds this possibility to
    stallwarn.txt.
    
    Reported-by: Daniel Lezcano <daniel.lezcano@linaro.org>
    Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

diff --git a/Documentation/RCU/stallwarn.txt b/Documentation/RCU/stallwarn.txt
index 21b8913acbdf..238acbd94917 100644
--- a/Documentation/RCU/stallwarn.txt
+++ b/Documentation/RCU/stallwarn.txt
@@ -70,6 +70,12 @@ o	A periodic interrupt whose handler takes longer than the time
 	considerably longer than normal, which can in turn result in
 	RCU CPU stall warnings.
 
+o	Testing a workload on a fast system, tuning the stall-warning
+	timeout down to just barely avoid RCU CPU stall warnings, and then
+	running the same workload with the same stall-warning timeout on a
+	slow system.  Note that thermal throttling and on-demand governors
+	can cause a single system to be sometimes fast and sometimes slow!
+
 o	A hardware or software issue shuts off the scheduler-clock
 	interrupt on a CPU that is not in dyntick-idle mode.  This
 	problem really has happened, and seems to be most likely to