Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752404AbdHBQvT (ORCPT ); Wed, 2 Aug 2017 12:51:19 -0400 Received: from mx0a-001b2d01.pphosted.com ([148.163.156.1]:46122 "EHLO mx0a-001b2d01.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752313AbdHBQvR (ORCPT ); Wed, 2 Aug 2017 12:51:17 -0400 Date: Wed, 2 Aug 2017 09:51:13 -0700 From: "Paul E. McKenney" To: Daniel Lezcano Cc: Steven Rostedt , john.stultz@linaro.org, linux-kernel@vger.kernel.org Subject: Re: RCU stall when using function_graph Reply-To: paulmck@linux.vnet.ibm.com References: <20170801220405.GL3730@linux.vnet.ibm.com> <20170801201214.1e9c7d8e@gandalf.local.home> <20170802124239.GD1919@mai> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20170802124239.GD1919@mai> User-Agent: Mutt/1.5.21 (2010-09-15) X-TM-AS-GCONF: 00 x-cbid: 17080216-0040-0000-0000-000003897C5A X-IBM-SpamModules-Scores: X-IBM-SpamModules-Versions: BY=3.00007472; HX=3.00000241; KW=3.00000007; PH=3.00000004; SC=3.00000216; SDB=6.00896560; UDB=6.00448520; IPR=6.00676748; BA=6.00005506; NDR=6.00000001; ZLA=6.00000005; ZF=6.00000009; ZB=6.00000000; ZP=6.00000000; ZH=6.00000000; ZU=6.00000002; MB=3.00016500; XFM=3.00000015; UTC=2017-08-02 16:51:14 X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused x-cbparentid: 17080216-0041-0000-0000-0000077DA734 Message-Id: <20170802165113.GZ3730@linux.vnet.ibm.com> X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:,, definitions=2017-08-02_09:,, signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 spamscore=0 suspectscore=0 malwarescore=0 phishscore=0 adultscore=0 bulkscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1706020000 definitions=main-1708020272 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 2673 Lines: 73 On Wed, Aug 02, 2017 at 02:42:39PM +0200, Daniel Lezcano wrote: > On Tue, Aug 01, 2017 at 08:12:14PM -0400, Steven Rostedt wrote: > > On Wed, 2 Aug 2017 00:15:44 +0200 > > Daniel Lezcano wrote: > > > > > On 02/08/2017 00:04, Paul E. McKenney wrote: > > > >> Hi Paul, > > > >> > > > >> I have been trying to set the function_graph tracer for ftrace and each time I > > > >> get a CPU stall. > > > >> > > > >> How to reproduce: > > > >> ----------------- > > > >> > > > >> echo function_graph > /sys/kernel/debug/tracing/current_tracer > > > >> > > > >> This error appears with v4.13-rc3 and v4.12-rc6. > > > > Can you bisect this? It may be due to this commit: > > > > 0598e4f08 ("ftrace: Add use of synchronize_rcu_tasks() with dynamic trampolines") > > Hi Steve, > > I git bisected but each time the issue occured. I went through the different > version down to v4.4 where the board was not fully supported and it ended up to > have the same issue. > > Finally, I had the intuition it could be related to the wall time (there is no > RTC clock with battery on the board and the wall time is Jan 1st, 1970). > > Setting up the with ntpdate solved the problem. > > Even if it is rarely the case to have the time not set, is it normal to have a > RCU cpu stall ? If the system is sufficiently confused about the time, you can indeed get RCU CPU stall warnings. In one memorable case, a pair of CPUs had a multi-minute disagreement as to the current time, which meant that when one of the started an RCU grace period, the other would immediately issue an RCU CPU stall warning. Thanx, Paul > > > >> > > > >> Is it something already reported ? > > > > > > > > I have seen this sort of thing, but only when actually dumping the trace > > > > out, and I though those got fixed. You are seeing this just accumulating > > > > the trace? > > > > > > No, just by changing the tracer. It is the first operation I do after > > > rebooting and it is reproducible each time. That happens on an ARM64 > > > platform. > > > > > > > These RCU CPU stall warnings usually occur when something grabs hold of > > > > a CPU for too long, as in 21 seconds or so. One way that they can happen > > > > is excessive lock contention, another is having the kernel run through > > > > too much data at one shot. > > > > > > > > Adding Steven Rostedt on CC for his thoughts. > > > > > > > > > -- > > Linaro.org │ Open source software for ARM SoCs > > Follow Linaro: Facebook | > Twitter | > Blog >