Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752514AbdHPOE0 (ORCPT ); Wed, 16 Aug 2017 10:04:26 -0400 Received: from mail.kernel.org ([198.145.29.99]:39990 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752414AbdHPOEY (ORCPT ); Wed, 16 Aug 2017 10:04:24 -0400 DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 48D7121D4E Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=goodmis.org Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=rostedt@goodmis.org Date: Wed, 16 Aug 2017 10:04:21 -0400 From: Steven Rostedt To: Daniel Lezcano Cc: paulmck@linux.vnet.ibm.com, Pratyush Anand , =?UTF-8?B?6rmA64+Z7ZiE?= , john.stultz@linaro.org, linux-kernel@vger.kernel.org Subject: Re: RCU stall when using function_graph Message-ID: <20170816100421.318deae2@gandalf.local.home> In-Reply-To: <43e0a0bc-bdd4-6bd0-c970-336f2fb01c6d@linaro.org> References: <11d179df-d8a9-5d3e-3bc4-080df464e85d@linaro.org> <20170803124421.GP3730@linux.vnet.ibm.com> <20170803143801.GE1919@mai> <20170806170220.GQ3730@linux.vnet.ibm.com> <20170809125804.GT3730@linux.vnet.ibm.com> <20170809144033.GU3730@linux.vnet.ibm.com> <208e981d-40ec-54fa-6293-5b8e6fe10a84@linaro.org> <20170815092902.252f5e83@gandalf.local.home> <43e0a0bc-bdd4-6bd0-c970-336f2fb01c6d@linaro.org> X-Mailer: Claws Mail 3.14.0 (GTK+ 2.24.31; x86_64-pc-linux-gnu) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 2340 Lines: 58 On Wed, 16 Aug 2017 10:42:15 +0200 Daniel Lezcano wrote: > Hi Steven, > > > On 15/08/2017 15:29, Steven Rostedt wrote: > > > > [ I'm back from vacation! ] > > Did you get the tapes? :) Yes, but nothing in them would cause the reputation of the POTUS to become any worse than it already is. > > > On Wed, 9 Aug 2017 17:51:33 +0200 > > Daniel Lezcano wrote: > > > >> Well, may be the instruction pointer thing is not a good idea. > >> > >> I learnt from this experience, an overloaded kernel with a lot of > >> interrupts can hang the console and issue RCU stall. > >> > >> However, someone else can face the same situation. Even if he reads the > >> RCU/stallwarn.txt documentation, it will be hard to figure out the issue. > >> > >> A message telling the grace period can't be reached because we are too > >> busy processing interrupts would have helped but I understand it is not > >> easy to implement. > > > > What if the stall code triggered an irqwork first? The irqwork would > > trigger as soon as interrupts were enabled again (or at the next tick, > > depending on the arch), and then it would know that RCU stalled due to > > an irq storm if the irqwork is being hit. > > Is that condition enough to tell the CPU is over utilized by the > interrupts handling? > > And I'm wondering if it wouldn't make sense to have this detection in > the irq code. With or without the RCU stall warning kernel option set, > the irq framework will be warning about this situation. If the RCU stall > option is set, that will issue a second message. It will be easy to do > the connection between the first message and the second one, no ? The thing is, the RCU code keeps track of the state of progress, I don't believe the interrupt code does. It just worries about handling interrupts. I'm not excited about adding infrastructure to the interrupt code to do accounting of IRQ storms. On the other hand, the RCU code already does this. If it notices a stall, it can trigger a irq_work and wait a little more. If the irq_work doesn't fire, then it can do the normal RCU stall message. But if the irq_work does fire, and the RCU progress still hasn't moved forward, then it would be able to say this is due to an IRQ storm and produce a better error message. -- Steve