Received: by 10.192.165.148 with SMTP id m20csp2160479imm; Thu, 3 May 2018 11:22:36 -0700 (PDT) X-Google-Smtp-Source: AB8JxZqE6QtuibPumrKfE3G/y9jgHzgFYXmILw4gQcI+siP5bcVMDCDbXU5+KVyAPx3+qlRKhZIm X-Received: by 2002:a63:8bca:: with SMTP id j193-v6mr19114735pge.300.1525371755961; Thu, 03 May 2018 11:22:35 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1525371755; cv=none; d=google.com; s=arc-20160816; b=Pua2l28bTBLLy8oJkBJ8bueE9zFkSLr1TXLdEITggMHEia2jMLmQWyXDM7JqrINGSo 7yUW0t8IfuI5ntzKdN4PppKx4mdTjmIoCck0KxBjX16oiGBropA37rmFcNv7DrYx4V96 tkhN74zUx+4RtJjLQRNgv80NzzTItPKL2Dizv9Da/WKVBpzOCAwOs7QM2wjCDK/Q/Vde iQiiHd9oZgeZI15Bl5D+HJl+qYv62DzcJr0bwEuxkqwTvpNB+b4bCrDlm8CwbfUzb8uH gF33HqtlgMpNxzS+I8cgGkeik/W3H78KjBbSykLbr/G8IPu1ykX/MjHjC2DAPOhqSfQv Nf+w== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:message-id:user-agent:in-reply-to :content-disposition:mime-version:references:reply-to:subject:cc:to :from:date:arc-authentication-results; bh=2WcMGDx3hbHTJCr5gVr6F4qO4cEco0ACXm/eD9JHMF8=; b=hAED6rdhvyAb+kPzZK72IMbutGa0/ULZG/st/wgSLMznBcsnWAe8jXO+bd0NYJfg2u MAG3sor0VgGV3svqXxjv/uSZcYVgvS8r/IPHlxmCZo5Umi/3Scyc8duOPhKjlrQgVfwq 1Z3t9iMB74MDTuN2oYx3Y3/Jr3R/Az0sw3igcfr54pcLoy7nYT+wnHgrGnTNFPUhcoFG Bmxj6Tt+og+B0WVym+FRU6liE/FnLksUEGmlsOxJitHo6R8X/zBtbymw2ZxJAgO9vkNT yuUcnzsldfHOMb29rRuVm4ieDcSeu0FnK7aHCuu4WsYX79Ivd8DTYTVg4iZMv6mk/IzM ECUA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=ibm.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id q15-v6si4690407pgt.266.2018.05.03.11.22.22; Thu, 03 May 2018 11:22:35 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=ibm.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751316AbeECSU7 (ORCPT + 99 others); Thu, 3 May 2018 14:20:59 -0400 Received: from mx0a-001b2d01.pphosted.com ([148.163.156.1]:53698 "EHLO mx0a-001b2d01.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751019AbeECSU5 (ORCPT ); Thu, 3 May 2018 14:20:57 -0400 Received: from pps.filterd (m0098399.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.16.0.22/8.16.0.22) with SMTP id w43IJlRq102598 for ; Thu, 3 May 2018 14:20:57 -0400 Received: from e11.ny.us.ibm.com (e11.ny.us.ibm.com [129.33.205.201]) by mx0a-001b2d01.pphosted.com with ESMTP id 2hr6qhjbmh-1 (version=TLSv1.2 cipher=AES256-GCM-SHA384 bits=256 verify=NOT) for ; Thu, 03 May 2018 14:20:57 -0400 Received: from localhost by e11.ny.us.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Thu, 3 May 2018 14:20:55 -0400 Received: from b01cxnp22033.gho.pok.ibm.com (9.57.198.23) by e11.ny.us.ibm.com (146.89.104.198) with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted; Thu, 3 May 2018 14:20:51 -0400 Received: from b01ledav003.gho.pok.ibm.com (b01ledav003.gho.pok.ibm.com [9.57.199.108]) by b01cxnp22033.gho.pok.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id w43IKp6n46465122; Thu, 3 May 2018 18:20:51 GMT Received: from b01ledav003.gho.pok.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 4D2E0B2046; Thu, 3 May 2018 15:22:51 -0400 (EDT) Received: from paulmck-ThinkPad-W541 (unknown [9.70.82.108]) by b01ledav003.gho.pok.ibm.com (Postfix) with ESMTP id 03FCFB2052; Thu, 3 May 2018 15:22:51 -0400 (EDT) Received: by paulmck-ThinkPad-W541 (Postfix, from userid 1000) id 64AB816C1A35; Thu, 3 May 2018 11:22:13 -0700 (PDT) Date: Thu, 3 May 2018 11:22:13 -0700 From: "Paul E. McKenney" To: linux-kernel@vger.kernel.org Cc: mingo@kernel.org, jiangshanlai@gmail.com, dipankar@in.ibm.com, akpm@linux-foundation.org, mathieu.desnoyers@efficios.com, josh@joshtriplett.org, tglx@linutronix.de, peterz@infradead.org, rostedt@goodmis.org, dhowells@redhat.com, edumazet@google.com, fweisbec@gmail.com, oleg@redhat.com, joel.opensrc@gmail.com Subject: Re: [PATCH tip/core/rcu 13/22] rcu: Exclude near-simultaneous RCU CPU stall warnings Reply-To: paulmck@linux.vnet.ibm.com References: <20180423023150.GA21533@linux.vnet.ibm.com> <1524450747-22778-13-git-send-email-paulmck@linux.vnet.ibm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1524450747-22778-13-git-send-email-paulmck@linux.vnet.ibm.com> User-Agent: Mutt/1.5.21 (2010-09-15) X-TM-AS-GCONF: 00 x-cbid: 18050318-2213-0000-0000-0000029DD368 X-IBM-SpamModules-Scores: X-IBM-SpamModules-Versions: BY=3.00008963; HX=3.00000241; KW=3.00000007; PH=3.00000004; SC=3.00000258; SDB=6.01027001; UDB=6.00524569; IPR=6.00806134; MB=3.00020910; MTD=3.00000008; XFM=3.00000015; UTC=2018-05-03 18:20:55 X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused x-cbparentid: 18050318-2214-0000-0000-000059FB34C9 Message-Id: <20180503182213.GA1981@linux.vnet.ibm.com> X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:,, definitions=2018-05-03_08:,, signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 priorityscore=1501 malwarescore=0 suspectscore=1 phishscore=0 bulkscore=0 spamscore=0 clxscore=1015 lowpriorityscore=0 impostorscore=0 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1709140000 definitions=main-1805030157 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sun, Apr 22, 2018 at 07:32:18PM -0700, Paul E. McKenney wrote: > There is a two-jiffy delay between the time that a CPU will self-report > an RCU CPU stall warning and the time that some other CPU will report a > warning on behalf of the first CPU. This has worked well in the past, > but on busy systems, it is possible for the two warnings to overlap, > which makes interpreting them extremely difficult. > > This commit therefore uses a cmpxchg-based timing decision that > allows only one report in a given one-minute period (assuming default > stall-warning Kconfig parameters). This approach will of course fail > if you are seeing minute-long vCPU preemption, but in that case the > overlapping RCU CPU stall warnings are the least of your worries. > > Reported-by: Dmitry Vyukov > Signed-off-by: Paul E. McKenney And later testing showed that this commit had the unfortunate side-effect of completely suppressing other-CPU reporting of RCU CPU stalls. The patch below includes a fix, and this patch has been kicked out of the queue for the next merge window in favor of the one following. Thanx, Paul ------------------------------------------------------------------------ commit ed569311d8d655a72f93310dbf479ca84daa736f Author: Paul E. McKenney Date: Mon Apr 9 11:04:46 2018 -0700 rcu: Exclude near-simultaneous RCU CPU stall warnings There is a two-jiffy delay between the time that a CPU will self-report an RCU CPU stall warning and the time that some other CPU will report a warning on behalf of the first CPU. This has worked well in the past, but on busy systems, it is possible for the two warnings to overlap, which makes interpreting them extremely difficult. This commit therefore uses a cmpxchg-based timing decision that allows only one report in a given one-minute period (assuming default stall-warning Kconfig parameters). This approach will of course fail if you are seeing minute-long vCPU preemption, but in that case the overlapping RCU CPU stall warnings are the least of your worries. Reported-by: Dmitry Vyukov Signed-off-by: Paul E. McKenney diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 35efe85c35b4..f066269d5b91 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -1368,7 +1368,6 @@ static inline void panic_on_rcu_stall(void) static void print_other_cpu_stall(struct rcu_state *rsp, unsigned long gpnum) { int cpu; - long delta; unsigned long flags; unsigned long gpa; unsigned long j; @@ -1381,18 +1380,6 @@ static void print_other_cpu_stall(struct rcu_state *rsp, unsigned long gpnum) if (rcu_cpu_stall_suppress) return; - /* Only let one CPU complain about others per time interval. */ - - raw_spin_lock_irqsave_rcu_node(rnp, flags); - delta = jiffies - READ_ONCE(rsp->jiffies_stall); - if (delta < RCU_STALL_RAT_DELAY || !rcu_gp_in_progress(rsp)) { - raw_spin_unlock_irqrestore_rcu_node(rnp, flags); - return; - } - WRITE_ONCE(rsp->jiffies_stall, - jiffies + 3 * rcu_jiffies_till_stall_check() + 3); - raw_spin_unlock_irqrestore_rcu_node(rnp, flags); - /* * OK, time to rat on our buddy... * See Documentation/RCU/stallwarn.txt for info on how to debug @@ -1441,6 +1428,10 @@ static void print_other_cpu_stall(struct rcu_state *rsp, unsigned long gpnum) sched_show_task(current); } } + /* Rewrite if needed in case of slow consoles. */ + if (ULONG_CMP_GE(jiffies, READ_ONCE(rsp->jiffies_stall))) + WRITE_ONCE(rsp->jiffies_stall, + jiffies + 3 * rcu_jiffies_till_stall_check() + 3); rcu_check_gp_kthread_starvation(rsp); @@ -1485,6 +1476,7 @@ static void print_cpu_stall(struct rcu_state *rsp) rcu_dump_cpu_stacks(rsp); raw_spin_lock_irqsave_rcu_node(rnp, flags); + /* Rewrite if needed in case of slow consoles. */ if (ULONG_CMP_GE(jiffies, READ_ONCE(rsp->jiffies_stall))) WRITE_ONCE(rsp->jiffies_stall, jiffies + 3 * rcu_jiffies_till_stall_check() + 3); @@ -1508,6 +1500,7 @@ static void check_cpu_stall(struct rcu_state *rsp, struct rcu_data *rdp) unsigned long gpnum; unsigned long gps; unsigned long j; + unsigned long jn; unsigned long js; struct rcu_node *rnp; @@ -1546,14 +1539,17 @@ static void check_cpu_stall(struct rcu_state *rsp, struct rcu_data *rdp) ULONG_CMP_GE(gps, js)) return; /* No stall or GP completed since entering function. */ rnp = rdp->mynode; + jn = jiffies + 3 * rcu_jiffies_till_stall_check() + 3; if (rcu_gp_in_progress(rsp) && - (READ_ONCE(rnp->qsmask) & rdp->grpmask)) { + (READ_ONCE(rnp->qsmask) & rdp->grpmask) && + cmpxchg(&rsp->jiffies_stall, js, jn) == js) { /* We haven't checked in, so go dump stack. */ print_cpu_stall(rsp); } else if (rcu_gp_in_progress(rsp) && - ULONG_CMP_GE(j, js + RCU_STALL_RAT_DELAY)) { + ULONG_CMP_GE(j, js + RCU_STALL_RAT_DELAY) && + cmpxchg(&rsp->jiffies_stall, js, jn) == js) { /* They had a few time units to dump stack, so complain. */ print_other_cpu_stall(rsp, gpnum);