Received: by 2002:ad5:474a:0:0:0:0:0 with SMTP id i10csp10973021imu; Thu, 6 Dec 2018 09:29:29 -0800 (PST) X-Google-Smtp-Source: AFSGD/WzN802mcY/6KfaNl7X8DogNwyUmm2CQz6PrYmWXD8xaTmbCrsKOWNIMn+gBbAuN2QnaP4g X-Received: by 2002:a63:2744:: with SMTP id n65mr24685209pgn.65.1544117369490; Thu, 06 Dec 2018 09:29:29 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1544117369; cv=none; d=google.com; s=arc-20160816; b=yeOo6jYBfix6VTmAYGVQSXvk0Hm+6vaCM2mSeJSO6jxOjRqTqxyBERlV5iahyleysN WdiyR9HqQd7dIDs0a4lPt2hT/7H3YMp7Wj82ovSc5qHGQoxpNQhCKSleqojl5yV2f3Zz TUgsLXI0SwQjXn6M0MaKN+50lYAfR8zbanawYfPlXvyHnO0oyCGlRNr89TVrM+pP0tYS 8a2a9dbThgw4qAfjk/3v5q/FP39okoTZkzUtrBfPS3nKf1jMn0xpR9CNmFH26h7R9ZWT Mmiouf7EKVwLkwdzOzP37/oSqrB/viwusc3HJsGpOiVkDWlHDNtIA2BbyMuF+3mHAehi VxLQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:message-id:mime-version:date:in-reply-to :subject:cc:to:from:user-agent:references; bh=vyC5cA39OSI76JQSwZuDd3Y6IH4FQLIZbr4xBpl45qw=; b=TVDKAHt56KLQw+TbIGtu8FON6KC+BMPafLFQBnEvi7/AQXwG7PhfFwBgoHL3w/QStQ DWkJP3F6Rrn2/+Ib+67J3betv405rbshKrjBdoE+L3PYQP7YXzYB8L3foDUQaz7if7UT tz+EWgcuAsRWscj+qHxYBCoOIS2izxtuhxk+fID5n++A4MVVa+Hz8tpqrBSBra75FpFN 7B5c+eabd4p/9EF+CT6BNADBOY6WIIisVdSMwZmn1sUv+cQ/Ah07OjFg6z5CDLIvQB6y DuuttzkKkrHw19YR0anOduPTGbDKyja6KkXor+9z5/mPDepuDPw0ZWpowVgYxuuloSEl 77mQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=ibm.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id y20si655333plr.106.2018.12.06.09.29.10; Thu, 06 Dec 2018 09:29:29 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=ibm.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1725930AbeLFR2d (ORCPT + 99 others); Thu, 6 Dec 2018 12:28:33 -0500 Received: from mx0a-001b2d01.pphosted.com ([148.163.156.1]:41292 "EHLO mx0a-001b2d01.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725871AbeLFR2d (ORCPT ); Thu, 6 Dec 2018 12:28:33 -0500 Received: from pps.filterd (m0098409.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.16.0.22/8.16.0.22) with SMTP id wB6HOBJK080697 for ; Thu, 6 Dec 2018 12:28:32 -0500 Received: from e32.co.us.ibm.com (e32.co.us.ibm.com [32.97.110.150]) by mx0a-001b2d01.pphosted.com with ESMTP id 2p76wg4q6v-1 (version=TLSv1.2 cipher=AES256-GCM-SHA384 bits=256 verify=NOT) for ; Thu, 06 Dec 2018 12:28:31 -0500 Received: from localhost by e32.co.us.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Thu, 6 Dec 2018 17:28:30 -0000 Received: from b03cxnp08027.gho.boulder.ibm.com (9.17.130.19) by e32.co.us.ibm.com (192.168.1.132) with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted; (version=TLSv1/SSLv3 cipher=AES256-GCM-SHA384 bits=256/256) Thu, 6 Dec 2018 17:28:26 -0000 Received: from b03ledav005.gho.boulder.ibm.com (b03ledav005.gho.boulder.ibm.com [9.17.130.236]) by b03cxnp08027.gho.boulder.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id wB6HSPWw28639384 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=FAIL); Thu, 6 Dec 2018 17:28:25 GMT Received: from b03ledav005.gho.boulder.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 8F1AEBE054; Thu, 6 Dec 2018 17:28:25 +0000 (GMT) Received: from b03ledav005.gho.boulder.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 0804BBE051; Thu, 6 Dec 2018 17:28:22 +0000 (GMT) Received: from morokweng.localdomain (unknown [9.85.137.244]) by b03ledav005.gho.boulder.ibm.com (Postfix) with ESMTPS; Thu, 6 Dec 2018 17:28:22 +0000 (GMT) References: <1544095908-2414-1-git-send-email-ego@linux.vnet.ibm.com> User-agent: mu4e 1.0; emacs 26.1 From: Thiago Jung Bauermann To: "Gautham R. Shenoy" Cc: Michael Ellerman , Nicholas Piggin , Tyrel Datwyler , Benjamin Herrenschmidt , Vaidyanathan Srinivasan , Michael Bringmann , linuxppc-dev@lists.ozlabs.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH] pseries/hotplug: Add more delay in pseries_cpu_die while waiting for rtas-stop In-reply-to: <1544095908-2414-1-git-send-email-ego@linux.vnet.ibm.com> Date: Thu, 06 Dec 2018 15:28:17 -0200 MIME-Version: 1.0 Content-Type: text/plain X-TM-AS-GCONF: 00 x-cbid: 18120617-0004-0000-0000-000014BE463A X-IBM-SpamModules-Scores: X-IBM-SpamModules-Versions: BY=3.00010182; HX=3.00000242; KW=3.00000007; PH=3.00000004; SC=3.00000270; SDB=6.01127899; UDB=6.00585885; IPR=6.00908030; MB=3.00024502; MTD=3.00000008; XFM=3.00000015; UTC=2018-12-06 17:28:28 X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused x-cbparentid: 18120617-0005-0000-0000-000089C2E0BD Message-Id: <87a7li5zv2.fsf@morokweng.localdomain> X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:,, definitions=2018-12-06_06:,, signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 priorityscore=1501 malwarescore=0 suspectscore=0 phishscore=0 bulkscore=0 spamscore=0 clxscore=1015 lowpriorityscore=0 mlxscore=0 impostorscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1810050000 definitions=main-1812060148 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hello Gautham, Gautham R. Shenoy writes: > From: "Gautham R. Shenoy" > > Currently running DLPAR offline/online operations in a loop on a > POWER9 system with SMT=off results in the following crash: > > [ 223.321032] cpu 112 (hwid 112) Ready to die... > [ 223.355963] Querying DEAD? cpu 113 (113) shows 2 > [ 223.356233] cpu 114 (hwid 114) Ready to die... > [ 223.356236] cpu 113 (hwid 113) Ready to die... > [ 223.356241] Bad kernel stack pointer 1faf6ca0 at 1faf6d50 > [ 223.356243] Oops: Bad kernel stack pointer, sig: 6 [#1] > [ 223.356244] LE SMP NR_CPUS=2048 NUMA pSeries > [ 223.356247] Modules linked in: > [ 223.356255] CPU: 114 PID: 0 Comm: swapper/114 Not tainted 4.20.0-rc3 #39 > [ 223.356258] NIP: 000000001faf6d50 LR: 000000001faf6d50 CTR: 000000001ec6d06c > [ 223.356259] REGS: c00000001e5cbd30 TRAP: 0700 Not tainted (4.20.0-rc3) > [ 223.356260] MSR: 8000000000081000 CR: 28000004 XER: 00000020 > [ 223.356263] CFAR: 000000001ec98590 IRQMASK: 8000000000009033 > GPR00: 000000001faf6d50 000000001faf6ca0 000000001ed1c448 00000267e6a273c3 > GPR04: 0000000000000000 00000000000000e0 000000000000dfe8 000000001faf6d30 > GPR08: 000000001faf6d28 00000267e6a273c3 000000001ec1b108 0000000000000000 > GPR12: 0000000001b6d998 c00000001eb55080 c0000003a1b8bf90 000000001eea3f40 > GPR16: 0000000000000000 c0000006fda85100 c00000000004c8b0 c0000000013d5300 > GPR20: c0000006fda85300 0000000000000008 c0000000019d2cf8 c0000000013d6888 > GPR24: 0000000000000072 c0000000013d688c 0000000000000002 c0000000013d688c > GPR28: c0000000019cecf0 0000000000000390 0000000000000000 000000001faf6ca0 > [ 223.356281] NIP [000000001faf6d50] 0x1faf6d50 > [ 223.356281] LR [000000001faf6d50] 0x1faf6d50 > [ 223.356282] Call Trace: > [ 223.356283] Instruction dump: > [ 223.356285] XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX > [ 223.356286] XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX > [ 223.356290] ---[ end trace f46a4e046b564d1f ]--- > > This is due to multiple offlined CPUs (CPU 113 and CPU 114 above) > concurrently (within 3us) trying to make the rtas-call with the > "stop-self" token, something that is prohibited by the PAPR. > > The concurrent calls can happen as follows. > > o In dlpar_offline_cpu() we prod an offline CPU X (offline due to > SMT=off) and loop for 25 tries in pseries_cpu_die() querying if > the target CPU X has been stopped in RTAS. After 25 tries, we > prints the message "Querying DEAD? cpu X (X) shows 2" and return > to dlpar_offline_cpu(). Note that at this point CPU X has not yet > called rtas with the "stop-self" token, but can do so anytime now. > > o Back in dlpar_offline_cpu(), we prod the next offline CPU Y. CPU Y > promptly wakes up and calls RTAS with the "stop-self" token. > > o Before RTAS can stop CPU Y, CPU X also calls RTAS with the > "stop-self" token. > > The problem occurs due to the short number of tries (25) in > pseries_cpu_die() which covers 90% of the cases. For the remaining 10% > of the cases, it was observed that we would need to loop for a few > hundred iterations before the target CPU calls rtas with "stop-self" > token.Experiments show that each try takes roughly ~25us. > > In this patch we fix the problem by increasing the number of tries > from 25 to 4000 (which roughly corresponds to 100ms) before bailing > out and declaring that we have failed to observe the target CPU call > rtas-stop-self. This provides sufficient serialization to ensure that > there no concurrent rtas calls with "stop-self" token. > > > > Reported-by: Michael Bringmann > Signed-off-by: Gautham R. Shenoy > --- > arch/powerpc/platforms/pseries/hotplug-cpu.c | 13 ++++++++++++- > 1 file changed, 12 insertions(+), 1 deletion(-) > > diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c b/arch/powerpc/platforms/pseries/hotplug-cpu.c > index 2f8e621..c913c44 100644 > --- a/arch/powerpc/platforms/pseries/hotplug-cpu.c > +++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c > @@ -214,14 +214,25 @@ static void pseries_cpu_die(unsigned int cpu) > msleep(1); > } > } else if (get_preferred_offline_state(cpu) == CPU_STATE_OFFLINE) { > + int max_tries = 4000; /* Roughly corresponds to 100ms */ > + u64 tb_before = mftb(); > > - for (tries = 0; tries < 25; tries++) { > + for (tries = 0; tries < max_tries; tries++) { > cpu_status = smp_query_cpu_stopped(pcpu); > if (cpu_status == QCSS_STOPPED || > cpu_status == QCSS_HARDWARE_ERROR) > break; > cpu_relax(); > } > + > + if (tries == max_tries) { > + u64 time_elapsed_us = div_u64(mftb() - tb_before, > + tb_ticks_per_usec); > + > + pr_warn("Offlined CPU %d isn't stopped by RTAS after %llu us\n", > + cpu, time_elapsed_us); > + WARN_ON(1); > + } > } > > if (cpu_status != 0) { I posted a similar patch last year, but I wasn't able to arrive at a root cause analysis like you did: https://lists.ozlabs.org/pipermail/linuxppc-dev/2017-February/153734.html One thing I realized after I posted the patch was that in my case, the CPU was crashing inside RTAS. From the NIP and LR in the trace above it looks like it's crashing in RTAS in your case as well. Michael Ellerman had two comments on my patch: 1. Regardless of the underlying bug, the kernel shouldn't crash so we need a patch making it more resilient to this failure. 2. The wait loop should use udelay() so that the loop will actually take a set amount of wall time, rather than just cycles. Regarding 1. if the problem is that the kernel is causing RTAS to crash because it calls it in a way that's unsupported, then I don't see how we can make the kernel more resilient. We have to make the kernel respect RTAS' restrictions (or alternatively, poke RTAS devs to make RTAS fail gracefuly in these conditions). Regarding 2. I implemented a new version of my patch (posted below) but I was never able to test it because I couldn't access a system where the problem was reproducible anymore. There's also a race between the CPU driving the unplug and the CPU being unplugged which I think is not easy for the CPU being unplugged to win, which makes the busy loop in pseries_cpu_die() a bit fragile. I describe the race in the patch description. My solution to make the race less tight is to make the CPU driving the unplug to only start the busy loop only after the CPU being unplugged is in the CPU_STATE_OFFLINE state. At that point, we know that it either is about to call RTAS or it already has. Do you think this makes sense? If you do, would you mind testing my patch? You can change the timeouts and delays if you want. To be honest they're just guesses on my part... From 4174cd0939fadeda93e49ed2e94ba84131c922d0 Mon Sep 17 00:00:00 2001 From: Thiago Jung Bauermann Date: Thu, 6 Dec 2018 15:16:34 -0200 Subject: [PATCH] powerpc/pseries: Only wait for dying CPU after call to rtas_stop_self() When testing DLPAR CPU add/remove on a system under stress, pseries_cpu_die() doesn't wait long enough for a CPU to die and the kernel ends up crashing: [ 446.143648] cpu 152 (hwid 152) Ready to die... [ 446.464057] cpu 153 (hwid 153) Ready to die... [ 446.473525] cpu 154 (hwid 154) Ready to die... [ 446.474077] cpu 155 (hwid 155) Ready to die... [ 446.483529] cpu 156 (hwid 156) Ready to die... [ 446.493532] cpu 157 (hwid 157) Ready to die... [ 446.494078] cpu 158 (hwid 158) Ready to die... [ 446.503527] cpu 159 (hwid 159) Ready to die... [ 446.664534] cpu 144 (hwid 144) Ready to die... [ 446.964113] cpu 145 (hwid 145) Ready to die... [ 446.973525] cpu 146 (hwid 146) Ready to die... [ 446.974094] cpu 147 (hwid 147) Ready to die... [ 446.983944] cpu 148 (hwid 148) Ready to die... [ 446.984062] cpu 149 (hwid 149) Ready to die... [ 446.993518] cpu 150 (hwid 150) Ready to die... [ 446.993543] Querying DEAD? cpu 150 (150) shows 2 [ 446.994098] cpu 151 (hwid 151) Ready to die... [ 447.133726] cpu 136 (hwid 136) Ready to die... [ 447.403532] cpu 137 (hwid 137) Ready to die... [ 447.403772] cpu 138 (hwid 138) Ready to die... [ 447.403839] cpu 139 (hwid 139) Ready to die... [ 447.403887] cpu 140 (hwid 140) Ready to die... [ 447.403937] cpu 141 (hwid 141) Ready to die... [ 447.403979] cpu 142 (hwid 142) Ready to die... [ 447.404038] cpu 143 (hwid 143) Ready to die... [ 447.513546] cpu 128 (hwid 128) Ready to die... [ 447.693533] cpu 129 (hwid 129) Ready to die... [ 447.693999] cpu 130 (hwid 130) Ready to die... [ 447.703530] cpu 131 (hwid 131) Ready to die... [ 447.704087] Querying DEAD? cpu 132 (132) shows 2 [ 447.704102] cpu 132 (hwid 132) Ready to die... [ 447.713534] cpu 133 (hwid 133) Ready to die... [ 447.714064] Querying DEAD? cpu 134 (134) shows 2 cpu 0x86: Vector: 300 (Data Access) at [c000000007b0fd40] pc: 000000001ec3072c lr: 000000001ec2fee0 sp: 1faf6bd0 msr: 8000000102801000 dar: 212d6c1a2a20c dsisr: 42000000 current = 0xc000000474c6d600 paca = 0xc000000007b6b600 softe: 0 irq_happened: 0x01 pid = 0, comm = swapper/134 Linux version 4.8.0-34-generic (buildd@bos01-ppc64el-026) (gcc version 5.4.0 20160609 (Ubuntu/IBM 5.4.0-6ubuntu1~16.04.4) ) #36~16.04.1-Ubuntu SMP Wed Dec 21 18:53:20 UTC 2016 (Ubuntu 4.8.0-34.36~16.04.1-generic 4.8.11) WARNING: exception is not recoverable, can't continue This was reproduced in v4.12-rc3 as well, but I don't have a crash log handy for that version right now. Sorry. This is a race between one CPU stopping and another one calling pseries_cpu_die() to wait for it to stop. That function does a short busy loop calling RTAS query-cpu-stopped-state on the stopping CPU to verify that it is stopped. As can be seen in the dmesg right before or after the "Querying DEAD?" messages, if pseries_cpu_die() waited a little longer it would have seen the CPU in the stopped state. I see two cases that can be causing this race: 1. It's possible that CPU 134 was inactive at the time it was unplugged. In that case, dlpar_offline_cpu() calls H_PROD on the CPU and immediately calls pseries_cpu_die(). Meanwhile, the prodded CPU activates and start the process of stopping itself. It's possible that the busy loop is not long enough to allow for the CPU to wake up and complete the stopping process. 2. If CPU 134 was online at the time it was unplugged, it would have gone through the new CPU hotplug state machine in kernel/cpu.c that was introduced in v4.6 to get itself stopped. It's possible that the busy loop in pseries_cpu_die() was long enough for the older hotplug code but not for the new hotplug state machine. Either way, the solution is to only start querying if the CPU is stopped after it had a chance to call rtas_stop_self(). Since pseries_mach_cpu_die() sets the CPU current state to offline almost immediately before the RTAS call, we use that as a signal that it is either already stopped or very close to that point, and we can start the busy loop. Signed-off-by: Thiago Jung Bauermann --- arch/powerpc/platforms/pseries/hotplug-cpu.c | 10 +++++++++- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c b/arch/powerpc/platforms/pseries/hotplug-cpu.c index 2f8e62163602..fad3ceb74623 100644 --- a/arch/powerpc/platforms/pseries/hotplug-cpu.c +++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c @@ -214,13 +214,21 @@ static void pseries_cpu_die(unsigned int cpu) msleep(1); } } else if (get_preferred_offline_state(cpu) == CPU_STATE_OFFLINE) { + /* + * If the current state is not offline yet, it means that the + * dying CPU (which is in pseries_mach_cpu_die) didn't have a + * chance to call rtas_stop_self yet and therefore it's too + * early to query if the CPU is stopped. + */ + spin_event_timeout(get_cpu_current_state(cpu) == CPU_STATE_OFFLINE, + 100000, 100); for (tries = 0; tries < 25; tries++) { cpu_status = smp_query_cpu_stopped(pcpu); if (cpu_status == QCSS_STOPPED || cpu_status == QCSS_HARDWARE_ERROR) break; - cpu_relax(); + udelay(100); } }