Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751262AbdHCMo3 (ORCPT ); Thu, 3 Aug 2017 08:44:29 -0400 Received: from mx0a-001b2d01.pphosted.com ([148.163.156.1]:46912 "EHLO mx0a-001b2d01.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751010AbdHCMo2 (ORCPT ); Thu, 3 Aug 2017 08:44:28 -0400 Date: Thu, 3 Aug 2017 05:44:21 -0700 From: "Paul E. McKenney" To: Daniel Lezcano Cc: Steven Rostedt , john.stultz@linaro.org, linux-kernel@vger.kernel.org, Pratyush Anand Subject: Re: RCU stall when using function_graph Reply-To: paulmck@linux.vnet.ibm.com References: <20170801220405.GL3730@linux.vnet.ibm.com> <20170801201214.1e9c7d8e@gandalf.local.home> <20170802124239.GD1919@mai> <20170802090744.6922e9e9@gandalf.local.home> <11d179df-d8a9-5d3e-3bc4-080df464e85d@linaro.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <11d179df-d8a9-5d3e-3bc4-080df464e85d@linaro.org> User-Agent: Mutt/1.5.21 (2010-09-15) X-TM-AS-GCONF: 00 x-cbid: 17080312-0048-0000-0000-000001CE19A3 X-IBM-SpamModules-Scores: X-IBM-SpamModules-Versions: BY=3.00007477; HX=3.00000241; KW=3.00000007; PH=3.00000004; SC=3.00000216; SDB=6.00896958; UDB=6.00448758; IPR=6.00677141; BA=6.00005509; NDR=6.00000001; ZLA=6.00000005; ZF=6.00000009; ZB=6.00000000; ZP=6.00000000; ZH=6.00000000; ZU=6.00000002; MB=3.00016514; XFM=3.00000015; UTC=2017-08-03 12:44:24 X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused x-cbparentid: 17080312-0049-0000-0000-0000421B9C5B Message-Id: <20170803124421.GP3730@linux.vnet.ibm.com> X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:,, definitions=2017-08-03_06:,, signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 spamscore=0 suspectscore=0 malwarescore=0 phishscore=0 adultscore=0 bulkscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1706020000 definitions=main-1708030194 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 2775 Lines: 77 On Thu, Aug 03, 2017 at 01:41:11PM +0200, Daniel Lezcano wrote: > On 02/08/2017 15:07, Steven Rostedt wrote: > > On Wed, 2 Aug 2017 14:42:39 +0200 > > Daniel Lezcano wrote: > > > >> On Tue, Aug 01, 2017 at 08:12:14PM -0400, Steven Rostedt wrote: > >>> On Wed, 2 Aug 2017 00:15:44 +0200 > >>> Daniel Lezcano wrote: > >>> > >>>> On 02/08/2017 00:04, Paul E. McKenney wrote: > >>>>>> Hi Paul, > >>>>>> > >>>>>> I have been trying to set the function_graph tracer for ftrace and each time I > >>>>>> get a CPU stall. > >>>>>> > >>>>>> How to reproduce: > >>>>>> ----------------- > >>>>>> > >>>>>> echo function_graph > /sys/kernel/debug/tracing/current_tracer > >>>>>> > >>>>>> This error appears with v4.13-rc3 and v4.12-rc6. > >>> > >>> Can you bisect this? It may be due to this commit: > >>> > >>> 0598e4f08 ("ftrace: Add use of synchronize_rcu_tasks() with dynamic trampolines") > >> > >> Hi Steve, > >> > >> I git bisected but each time the issue occured. I went through the different > >> version down to v4.4 where the board was not fully supported and it ended up to > >> have the same issue. > >> > >> Finally, I had the intuition it could be related to the wall time (there is no > >> RTC clock with battery on the board and the wall time is Jan 1st, 1970). > >> > >> Setting up the with ntpdate solved the problem. > > Actually, it did not solve the problem. The function_graph trace is set, > I can use the system but after awhile (no tracing enabled at anytime), > the stall appears. > > >> Even if it is rarely the case to have the time not set, is it normal to have a > >> RCU cpu stall ? > >> > >> > > > > BTW, function_graph tracer is the most invasive of the tracers. It's 4x > > slower than function tracer. I'm wondering if the tracer isn't the > > cause, but just slows things down enough to cause a some other race > > condition that triggers the bug. > > Yes, that could be true. > > I tried the following scenario: > > - cpufreq governor => userspace + max_freq (1.2GHz) > - function_graph set ==> OK > > - cpufreq governor => userspace + min_freq (200MHz) > - function_graph set ==> RCU stall > > Beside that, I realize the board is constantly processing SOF interrupts > every 124us, so that adds more overhead. > > Removing the USB support, thus the associated processing for the SOF > interrupts, I don't see anymore the RCU stall. Looks like Steve called this one! ;-) > Is it the expected behavior to have the system hang after a RCU stall > raises ? No, but if NMI stack traces are enabled and there are any NMI problems, bad things can happen. In addition, the bulk of output can cause problems if you have a slow console connection. Thanx, Paul