Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752767AbdHKJiP (ORCPT ); Fri, 11 Aug 2017 05:38:15 -0400 Received: from mail-wr0-f173.google.com ([209.85.128.173]:38688 "EHLO mail-wr0-f173.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751521AbdHKJiN (ORCPT ); Fri, 11 Aug 2017 05:38:13 -0400 Subject: Re: RCU stall when using function_graph To: paulmck@linux.vnet.ibm.com Cc: Pratyush Anand , =?UTF-8?B?6rmA64+Z7ZiE?= , john.stultz@linaro.org, Steven Rostedt , linux-kernel@vger.kernel.org References: <20170806170220.GQ3730@linux.vnet.ibm.com> <20170809125804.GT3730@linux.vnet.ibm.com> <20170809144033.GU3730@linux.vnet.ibm.com> <208e981d-40ec-54fa-6293-5b8e6fe10a84@linaro.org> <20170809172236.GX3730@linux.vnet.ibm.com> <81dd7e5e-89be-2ff9-525e-7095e934baa5@linaro.org> <20170810213939.GV3730@linux.vnet.ibm.com> From: Daniel Lezcano Message-ID: <03ff85d7-ccee-6aa1-8652-1b416571bfbb@linaro.org> Date: Fri, 11 Aug 2017 11:38:09 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.2.1 MIME-Version: 1.0 In-Reply-To: <20170810213939.GV3730@linux.vnet.ibm.com> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 2712 Lines: 66 On 10/08/2017 23:39, Paul E. McKenney wrote: > On Thu, Aug 10, 2017 at 11:45:09AM +0200, Daniel Lezcano wrote: [ ... ] >> Nothing coming in mind but may be worth to mention the slowness of the >> CPU is the aggravating factor. In particular I was able to reproduce the >> issue by setting to the min CPU frequency. With the ondemand governor, >> we can have the frequency high (hence enough CPU power) at the moment we >> set the function_graph because another CPU is loaded (and both CPUs are >> sharing the same clock line). The system became stuck at the moment the >> other CPU went idle with the lowest frequency. That introduced >> randomness in the issue and made hard to figure out why the RCU stall >> was happening. > > Adding this, then? Yes, sure. Thanks Paul. -- Daniel > ------------------------------------------------------------------------ > > commit f7d9ce95064f76be583c775fac32076fa59f1617 > Author: Paul E. McKenney > Date: Thu Aug 10 14:33:17 2017 -0700 > > documentation: Slow systems can stall RCU grace periods > > If a fast system has a worst-case grace-period duration of (say) ten > seconds, then running the same workload on a system ten times as slow > will get you an RCU CPU stall warning given default stall-warning > timeout settings. This commit therefore adds this possibility to > stallwarn.txt. > > Reported-by: Daniel Lezcano > Signed-off-by: Paul E. McKenney > > diff --git a/Documentation/RCU/stallwarn.txt b/Documentation/RCU/stallwarn.txt > index 21b8913acbdf..238acbd94917 100644 > --- a/Documentation/RCU/stallwarn.txt > +++ b/Documentation/RCU/stallwarn.txt > @@ -70,6 +70,12 @@ o A periodic interrupt whose handler takes longer than the time > considerably longer than normal, which can in turn result in > RCU CPU stall warnings. > > +o Testing a workload on a fast system, tuning the stall-warning > + timeout down to just barely avoid RCU CPU stall warnings, and then > + running the same workload with the same stall-warning timeout on a > + slow system. Note that thermal throttling and on-demand governors > + can cause a single system to be sometimes fast and sometimes slow! > + > o A hardware or software issue shuts off the scheduler-clock > interrupt on a CPU that is not in dyntick-idle mode. This > problem really has happened, and seems to be most likely to > -- Linaro.org │ Open source software for ARM SoCs Follow Linaro: Facebook | Twitter | Blog